From 8dae82dede05e818786ab9276271e37a6be92d96 Mon Sep 17 00:00:00 2001
From: JakubPietrakIntel <jakub.pietrak@intel.com>
Date: Wed, 7 Dec 2022 15:49:18 +0100
Subject: [PATCH] Squashed commit of the following:
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

commit 63ebc8d6a000199e963d29b6c8a0f54d3150872b
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Thu Dec 1 13:32:03 2022 +0100

    rm print

commit 2c8ffeaf1b2168ed9ad4ca6b192a1231fb036760
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Thu Dec 1 11:35:02 2022 +0100

    pytorch_sparse.matmul to torch.sparse.matmul

commit ee0e184a1ce5dc6ad7005a67621fac19d6fdbb0b
Merge: 4562359b9f 3a858ba8e3
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Mon Nov 28 14:09:42 2022 +0100

    Merge branch 'gh/mingfeima/85/head' of https://github.com/pytorch/pytorch into pyg-36

commit 4562359b9fb3de301690334a892d44911eda45c8
Merge: deba083400 b5616cd5f4
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Mon Nov 28 12:22:11 2022 +0000

    Merge branch 'master' of https://github.com/pytorch/pytorch into pyg-36

commit deba0834008ad95af7e3a6603223a0f8a5555967
Merge: 0e1a8522bb a97d0508cb
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Mon Nov 28 12:19:25 2022 +0000

    Merge branch 'pyg-36' of https://github.com/JakubPietrakIntel/pytorch into pyg-36

commit 0e1a8522bb695387816a29bbfcf182962429b3ab
Merge: 059a238619 75bfbc35ca
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Mon Nov 28 12:16:35 2022 +0000

    Merge remote-tracking branch 'origin/gh/mingfeima/85/head' into pyg-36

commit b5616cd5f4fc150138b79d3396a603eda6a7a8a8
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Mon Nov 28 05:12:37 2022 +0000

    Add simple assert to detect fake tensors on modules (#89723)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89723
    Approved by: https://github.com/ezyang

commit db1f1144f1303db45e0b9d96e4bb6bdd87c80e5a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 26 13:52:28 2022 -0800

    Beef up AOTAutograd logging with aot_id and input descriptions (#89710)

    A few things in this PR, that I found useful while debugging some
    recent issues:

    - We now allocate an aot_id to each aot_function/aot_module invocation,
      and print it whenever we report error messages and graph output
      logging.  Check the comment for why this sort of thing is useful,
      and also why it's different from nth_graph.  This number is now
      incorporated into aot_graph_name

    - I noticed that nth_graph only gets incremented when backwards is
      compiled.  Because backwards is compiled lazily, this means that
      multiple forward graphs would have gotten the same ID!  I change
      nth_graph to always increment to avoid confusion here.

    - I added a simple describe_input function, which makes use of
      num_params_buffers to tell the user if the input index they're
      looking at is a param/buffer or an input.  With the help of
      https://github.com/pytorch/pytorch/pull/89709 we could give
      even more detailed information about inputs  (we could also
      easily give detailed information about parameters if we stored
      a mapping of index to parameter name, but I didn't need this
      when debugging so I'll let someone else add it if they need
      it.)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89710
    Approved by: https://github.com/bdhirsh

commit 5f8848f32901e35cead64d520885f718679c2bbe
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 15:26:55 2022 -0500

    Don't suppress log messages for dynamo CI config (#89653)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89653
    Approved by: https://github.com/albanD, https://github.com/kit1980

commit 1a2dd6b15e0089a9e45ba4feb90c2d0dfac19238
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Nov 27 19:27:45 2022 -0500

    Add single process version of dynamo distributed hf_Bert tests (#89721)

    It's a lot easier to debug problems in the Dynamo optimization pass if
    you aren't actually triggering a multiprocessing run.  Keep these tests
    around.

    I think the other tests can probably get this treatment too, leaving
    this to future work.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89721
    Approved by: https://github.com/voznesenskym

commit 0e7c100c9b7417efb1a8f65778a1e3c9ad10ef3e
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 26 11:25:24 2022 -0800

    Add debug asserts to AOTAutograd for input consistency with compilation (#89702)

    Fixes https://github.com/pytorch/torchdynamo/issues/1927

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89702
    Approved by: https://github.com/bdhirsh

commit 1f95f24d3003a35568a00b5e5e18439846089b0f
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 26 11:25:24 2022 -0800

    Factor input deduplication into a separate function (#89701)

    It turns out that instead of having a giant blobby aot_dispatch_autograd
    function, we can factor it into a series of wrapper functions, each
    of which successively guarantees more invariants on the inner
    compilation function until the final inner function is quite trivial.
    How exactly you have to wrap the input user functions and the output
    compiled functions can be expressed concisely in Haskell, so I've
    included the Haskell formulation in code comments.

    This PR shows how to do this for input deduplication.  Dealing with the
    rest of the view handling is left to future work.

    This PR should also be a slight performance improvement as deduplicating
    is skipped entirely when there are no duplicate inputs.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89701
    Approved by: https://github.com/bdhirsh

commit dcefc8f90fbc86041a7abcce4f227d15c59bd96c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 26 14:28:56 2022 -0500

    Implement guard_source on RandomValueSource (#89711)

    I audited the pattern matches on the enum and it didn't
    look like this one should apply there.

    Sorry, no test, I know this matters on symbolic-shapes branch
    but I haven't had time to extract out a minimal reproducer.
    Take my word for it.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89711
    Approved by: https://github.com/jansel

commit 1da633f98a5da000083c0c47d9e192b2689f867b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 13:57:17 2022 +0000

    Access named parameters/buffers/etc via getattr rather than index (#89625)

    I'm not sure why this never caused problems before.  The error
    manifests as `TypeError: 'MyModule' object is not subscriptable`

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89625
    Approved by: https://github.com/albanD

commit e36d68af8885f27d8c0b4727ab078bf53e55e7a0
Author: Horace He <chilli@fb.com>
Date:   Thu Nov 24 02:17:37 2022 +0000

    Don't allow recomputing a node that *must* be materialized in the backwards pass (#89171)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89171
    Approved by: https://github.com/ngimel

commit b709078dc673cbd5025a1df3eae7f5c60acc2698
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:21 2022 -0800

    [Profiler] Memory profiler part 11: Mark tensors created in the backward pass which don't correspond to parameters. (#88926)

    There are various Tensors created in the backward pass which do not correspond to parameters. We don't want to mark these as gradients, but we do still want to convey as much information as possible. Thus, this PR introduces an AUTOGRAD_DETAIL category. (Which can be grouped with GRADIENT in visualization if one wishes to take a coarse grained view of the world.)

    Differential Revision: [D40868661](https://our.internmc.facebook.com/intern/diff/D40868661/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88926
    Approved by: https://github.com/chaekit

commit 143d2881a844934c95c4ada63b38179d97e65af3
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:19 2022 -0800

    [Profiler] Memory profiler part 10: Mark optimizer state (#88925)

    This is also a fairly simple pass, since we're simply collecting values from the python tracer.

    Differential Revision: [D40868664](https://our.internmc.facebook.com/intern/diff/D40868664/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88925
    Approved by: https://github.com/chaekit

commit ae725d501e33ed6f823997bea03d99cdc8dae5ff
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:18 2022 -0800

    [Profiler] Memory profiler part 9: Mark activations (#88924)

    This is a fairly straightforward pass: start at inputs and flood fill until we reach the backward pass.

    Differential Revision: [D40868662](https://our.internmc.facebook.com/intern/diff/D40868662/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88924
    Approved by: https://github.com/chaekit

commit 56e40fe054ecb7700142ea9ae7fe37e77800a2da
Author: Yuxin Wu <ppwwyyxx@users.noreply.github.com>
Date:   Sun Nov 27 05:55:24 2022 +0000

    Let SyncBatchNorm fallback to BN if not using distributed training (#89706)

    Fixes #63662
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89706
    Approved by: https://github.com/soumith

commit 39449ea61d9a6644731687219282f610cbf7cf54
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Nov 27 02:59:04 2022 +0000

    [vision hash update] update the pinned vision hash (#89692)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89692
    Approved by: https://github.com/pytorchbot

commit 483d3a3d07e6694757c5158bc21f7f757f8c82c3
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:16 2022 -0800

    [Profiler] E2E expecttests for category assignment (#88653)

    Up until now the unit tests for category assignment have been narrowly scoped to specific checks on specific Tensors. However as we start to reach reasonable levels of category assignment it's useful to supplement those tests with higher level summary tests to inspect the larger graph and confirm that it makes sense. (It will also be necessary for some categories like activations where it is tedious to record all relevant Tensors.)

    The general structure of these tests is to capture a model invocation with `__torch_dispatch__` and then cross reference those inputs and outputs with the categories assigned by the memory profiler.

    Differential Revision: [D40868659](https://our.internmc.facebook.com/intern/diff/D40868659/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88653
    Approved by: https://github.com/chaekit

commit 0435894bb3b2d60e5da9f993c2a56d95fb03a971
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:14 2022 -0800

    [Profiler] Memory profiler part 8: Mark parameters. (#87568)

    Following the pattern of earlier PRs, we use two methods to extract parameters. The primary one is the Python tracer; both nn.Module and optim.Optimizer collect parameters and in most cases that is sufficient. As a fallback we can analyze the data flow graph and deduce likely parameters based on gradient computation and updates.

    Parameter identification has a circular interaction with input identification. Inputs are defined as "not part of the core forward-backward-update loop", but we need inputs for the parameter identification fallback to give us a proxy for the forward pass. Thus, we mark parameters from the python tracer which limits which Tensors get marked as inputs. While not necessary, it adds a bit of robustness. (As shown by the strengthening of the input unit tests.)

    Differential Revision: [D40238619](https://our.internmc.facebook.com/intern/diff/D40238619/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87568
    Approved by: https://github.com/chaekit

commit 17fa6bf1f57cbbe84a14566efcf00f21e1abe489
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:13 2022 -0800

    [Profiler] Memory profiler part 7: Mark inputs (#87567)

    It is surprisingly difficult to identify the leaves of the data flow graph. The issue is that inputs and pre-existing parameters look identical until parameter identification takes place. It's not too bad for training since Autograd lets us differentiate between them however I still want the tool to do something reasonable in inference.

    Some of this will be ameliorated when a later PR pulls in parameters from python tracing. The current approach is passable, but I will continue to mull over refinements.

    Differential Revision: [D40220388](https://our.internmc.facebook.com/intern/diff/D40220388/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87567
    Approved by: https://github.com/chaekit

commit 64c5c77cd47212da719eb29c3b0a2b07cebb3705
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:11 2022 -0800

    [Profiler] Memory profiler part 6: Mark gradients and temporary intermediates. (#87566)

    Semantic assignment will be built up as a series of passes which gradually pin down the regions of a trace. For this reason it is important to be very meticulous in the assignment of categories.

    We begin with gradients as they are both straightforward to identify and foundational to subsequent analysis. There are two mechanisms that the profiler can use to tag gradients, each with their own advantages and limitations. The first is direct inspection of the op graph which is generic but predicated on certain features of the Autograd engine. (And therefore not necessarily exhaustive.) The second approach is direct instrumentation via the python tracer. This method relies requires that gradients be attached to an nn.Module parameter and can miss corner cases such as `set_to_none=True` due to the cache structure of the python tracer. Combined these two approaches provide very high coverage.

    Temporaries are more straightforward; we can easily add them by trivial local inspection of a data flow node.

    Because this is the first PR in the end-to-end section most of the code is building the scaffolding for category bookkeeping and unit testing. (The actual gradient extraction was covered in an earlier PR.)

    Differential Revision: [D40220389](https://our.internmc.facebook.com/intern/diff/D40220389/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87566
    Approved by: https://github.com/chaekit

commit 5f09a6d573a2a07c00c76c3cbdbffe0fafe2436d
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:09 2022 -0800

    [Profiler] Memory profiler part 5: Data flow graph (#87006)

    The semantic meaning of a Tensor is tightly coupled to its lineage. The data flow graph allows us to identify temporary Tensors, masks, inputs, activations, and more. However one important nuance is that Tensors must be versioned; operations which mutate their inputs can also change the semantic meaning of said inputs.

    It is challenging to assemble a complete picture of the data flow in a PyTorch model because ops can, and often do, recursively call into other ops. For the purpose of memory profiling this is an implementation detail, so instead we traverse the op tree to identify top level ops and allocations and then coalesce their children, folding inputs and outputs into the top level Node.

    Differential Revision: [D40220391](https://our.internmc.facebook.com/intern/diff/D40220391/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87006
    Approved by: https://github.com/chaekit

commit c3116dd78b294f1bd3f6424dc1bfb7ff86bb0a66
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Nov 26 10:33:08 2022 -0800

    [Profiler] Memory profiler part 4: Select top level torch ops (#86880)

    In a later PR we will walk the children of these nodes and formulate a node from the entire bundle to build a data flow graph. This PR simply defines what a "top level" op is.

    Differential Revision: [D40220387](https://our.internmc.facebook.com/intern/diff/D40220387/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86880
    Approved by: https://github.com/chaekit

commit bb77accb4c996e3aab9ae4b665fb8464400c8194
Author: Jiong Gong <jiong.gong@intel.com>
Date:   Sat Nov 26 14:06:44 2022 +0000

    [Inductor] Record cpp kernel in PyTorch Profiler (#89367)

    Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367
    Approved by: https://github.com/jansel

commit 36018a6ee63f140b95ad644d09920798b0c624f8
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 25 13:48:35 2022 -0800

    Don't suppress exceptions from backends (#89656)

    Taken from voz's https://github.com/pytorch/pytorch/pull/89392

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89656
    Approved by: https://github.com/voznesenskym

commit 3e20d023b1f442ebe59e76604395cd8d4abed52a
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Sat Nov 26 03:08:23 2022 +0000

    put descriptive kernel names behind config (#89697)

    Per title, generated kernel names are often long and confusing.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89697
    Approved by: https://github.com/Chillee

commit 591dfffa38848de54b7f5f4e49260847024c9281
Author: jlukehubbard <58089207+jlukehubbard@users.noreply.github.com>
Date:   Fri Nov 25 21:31:53 2022 +0000

    update docstring for torch.linalg.lstsq (#89383)

    Previous documentation lacked details about the handling of over- and underdetermined systems, and made incorrect mention of MAGMA.

    Fixes #85021

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89383
    Approved by: https://github.com/lezcano

commit c9a0cc86407d7ec20524b0e26305109d0cf2b5c2
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 25 03:31:20 2022 +0000

    Simplify aot_module_simplified by removing top_args/top_kwargs (#89666)

    This makes good on Chillee's CR comment at
    https://github.com/pytorch/functorch/pull/660/files/af30d351cc93dfafb5a94dbcb32983c5ef65fd6a#r843315222
    which was never done in the original PR.

    There is no logic change, just unpack the args/kwargs at the top
    level and remove the inner function indirection.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89666
    Approved by: https://github.com/voznesenskym

commit 6168f22fae66da5703e087bcd10076921ca157e7
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 25 03:31:19 2022 +0000

    Don't support kwargs at runtime in aot_module_simplified (#89664)

    The preexisting logic here added in
    https://github.com/pytorch/functorch/pull/970 was very peculiar: if top_kwargs
    was non-empty, then the inner compiled function supports kwargs.  Naively, this
    would leave you to expect that there is some sort of correlation between
    top_kwargs and kwargs.  But in fact, they're completely unrelated!  top_kwargs
    is the AOTAutograd configuration knobs (e.g., fw_compiler/bw_compiler), but
    kwargs is the RUNTIME kwargs that are to be passed to the compiled function.
    But (1) we don't support this (the function to be compiled only takes a list
    of tensors) and (2) even if we did support it, conditioning on whether or not
    you had passed AOTAutograd configuration kwargs to support kwargs at runtime
    is bonkers.

    So delete it.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89664
    Approved by: https://github.com/voznesenskym

commit b04dda4291f1d30b064572e4521e82fa2573af77
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 25 03:31:19 2022 +0000

    Delay verify correctness wrapping to call site. (#89662)

    There is only one call site for compiler_fn, so we can safely delay
    wrapping verify correctness to here.  This will help later when we
    change the backend compiler calling convention to pass fake tensors
    (but I need to pass real tensors here.)

    This is adapted from voz's changes at https://github.com/pytorch/pytorch/pull/89392
    but with less changes to the substantive logic.  I only moved the relevant
    inner implementation; there are no changes otherwise.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89662
    Approved by: https://github.com/voznesenskym

commit 61a3fe4b6409965223273c1098f9a77ff071efe1
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Fri Nov 25 19:42:38 2022 +0000

    make inductor correctly propagate nans for maximum and minimum (#89612)

    Partially fixes https://github.com/pytorch/torchdynamo/issues/594
    Also, small cleanup for `where` codegen

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612
    Approved by: https://github.com/soumith, https://github.com/jansel

commit 70c0a3006ee96b3db1f531109fc383f8159e2d2f
Author: Ikko Ashimine <eltociear@gmail.com>
Date:   Fri Nov 25 19:26:18 2022 +0000

    Fix typo in segment_reduction_op_gpu.cu (#89647)

    menber -> member

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89647
    Approved by: https://github.com/kit1980

commit 2c0bd85c755043d696452ddab354f3ff6775738b
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Fri Nov 25 14:53:57 2022 +0000

    complex: register c10::complex with py::cast (#89680)

    Fixes #77134

    TODO:
    * [x] Add test (tested locally with script below) (Are there similar tests in the test-suite?)

    ```c++

    namespace py = pybind11;

    int main() {
        py::scoped_interpreter guard{}; // start the interpreter
        auto casted_cdouble = py::cast(c10::complex<double>(1.0, 2.0));
        assert(
            (c10::complex<double>(1.0, 2.0) ==
             py::cast<c10::complex<double>>(casted_cdouble)));

        auto casted_cfloat = py::cast(c10::complex<float>(1.0, 2.0));
        assert(
            (c10::complex<double>(1.0, 2.0) ==
             py::cast<c10::complex<double>>(casted_cfloat)));

        auto casted_chalf = py::cast(c10::complex<at::Half>(1.0, 2.0));
        assert(
            (c10::complex<double>(1.0, 2.0) ==
             py::cast<c10::complex<double>>(casted_chalf)));
    }

    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89680
    Approved by: https://github.com/ezyang

commit a97d0508cb5259951bc48300fb914cebdf322bb9
Merge: 849be586e6 abb446af8c
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Fri Nov 25 15:24:54 2022 +0100

    Merge branch 'master' of https://github.com/pytorch/pytorch into pyg-36

commit 849be586e649421ba58182feb9067a4ac65479e3
Merge: 059a238619 75bfbc35ca
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Fri Nov 25 14:25:40 2022 +0100

    Merge branch 'gh/mingfeima/85/head' into pyg-36

commit abb446af8c65a49bbc3767e14605a73d244c176b
Author: Alvaro Gaona <alvgaona@gmail.com>
Date:   Fri Nov 25 11:09:28 2022 +0000

    Implement old windows in Python (#87082)

    Relates to #85366

    - Bartlett, Blackman, Hamming, Hann.
    - Except Kaiser which will be in a different PR

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87082
    Approved by: https://github.com/mruberry, https://github.com/lezcano

commit 059a238619b122f922c569c618919a277420e483
Merge: 26ba2e9751 95ea47ef0c
Author: Jakub Pietrak <97102979+JakubPietrakIntel@users.noreply.github.com>
Date:   Fri Nov 25 10:00:53 2022 +0100

    Merge branch 'pytorch:master' into jpietrak/pyg-36

commit 95ea47ef0c1cffe1fe05cc36bdc47c26cc72f13e
Author: Jason Ansel <jansel@meta.com>
Date:   Fri Nov 25 04:28:36 2022 +0000

    torchdynamo to torch._dynamo in aot_autograd.py (#89385)

    Test Plan: Run torchbench models

    Differential Revision: D41429573

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89385
    Approved by: https://github.com/soumith, https://github.com/malfet

commit 69043247819042db18ac9526c2d747fa61fe8880
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 12:00:13 2022 -0800

    Remove fake_tensor_propagation (#89646)

    You always have to run dynamo with fake tensors.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89646
    Approved by: https://github.com/soumith

commit 1aa1014b262b75d4269d9a4d8b562c6ee43a0991
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 12:00:12 2022 -0800

    xfail maml test, instead of running it without fake tensor prop (#89645)

    A previous version of this patch graph breaks when torch.tensor fails, but that causes

    ```
    PYTORCH_TEST_WITH_DYNAMO=1 python test/nn/test_embedding.py -k test_embedding_bag_1D_padding_idx_cpu_float32
    ```

    to start failing. Probably another latent bug that needs investigating.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89645
    Approved by: https://github.com/albanD

commit a048913e2530442360c36a48420079ca9ebca149
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Nov 25 03:03:41 2022 +0000

    [vision hash update] update the pinned vision hash (#89667)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89667
    Approved by: https://github.com/pytorchbot

commit 3b3ebcd031b68762938806f541d7247a1521bb11
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Nov 24 02:33:01 2022 -0500

     TorchDynamo: weight prepack for single conv (#89209)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89209
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 0c4f3db7bf24e94125c6802718a1105ee548c953
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Nov 24 02:32:59 2022 -0500

    TorchDynamo: weight prepack for mkl linear (#89109)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89109
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 07151a6bd62e308b6b32e2e0edfc4d5f0563576e
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Nov 24 02:32:55 2022 -0500

    TorchDynamo: weight prepack for onednn convolution external call (#88988)

    This PR is about enabled weight prepack using the MKLDNN tensor:
    1.  enable fake tensor mode for MKLDNN tensor input.
    2.  make convolution fusion kernel support MKLDNN tensor input.
    3. do the weight prepack at FX fusion step.

    For better performance, we always use channels_last for CPU convolution path. because we test that the channels_last path can get a better performance than block input path, and also avoid the activation's layout conversion(plain to block, block to plain), currently, there only need plain to plain format conversion.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88988
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 0884fdaba0280e3f3ad2abc34c0940587f744886
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 14:31:00 2022 -0500

    Revert "Dont clone unmutated args in triton autotuning (#89519)" (#89652)

    This reverts commit f18f0c70ab10c400947e71be30794e04dcc22acf.

    Testing to see if this fixes gmixer_24_224 mixer_b16_224

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89652
    Approved by: https://github.com/eellison

commit 4a16f8cdb26be3561742e86f184e59f65418fe63
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 09:00:09 2022 -0800

    Reenable fake_tensor_propagation on test_cudnn_rnn (#89644)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89644
    Approved by: https://github.com/anjali411

commit fc7dcb684aa38da5b1534fc701657ee63af8909c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 09:00:09 2022 -0800

    Run optimizer tests with fake tensors (#89643)

    This is a slight regression: RAdam and Adagrad don't appear to
    trace at all under fake tensors.  But I think this is a more accurate
    reflection of the current state of affairs.

    Along the way fix some problems on the fake tensor path.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89643
    Approved by: https://github.com/anjali411

commit 9b13508ef3a4e858fbbbf068b3a825f1632e8daa
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 09:00:08 2022 -0800

    Force test_rng_state to run with fake tensor prop (#89641)

    I'm not really sure what desertfire's intended follow up was
    on https://github.com/pytorch/pytorch/pull/87490 because when I remove
    the unsupported() call, dynamo tests pass.  But the change here is
    conservative and I think strictly better than the current situation.
    The idea is to force fake tensor pop on for the test, and then just
    observe that we are doing a graph break.  Clearly, export doesn't work,
    so I manually xfail it.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89641
    Approved by: https://github.com/anjali411

commit c6be06d93ab911a3fbb185451c8cf42bcedad0c1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 09:00:08 2022 -0800

    Easy: These tests work with fake_tensor_propagation on (#89640)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89640
    Approved by: https://github.com/anjali411, https://github.com/albanD

commit 6fb6eb0a7498839e69302da7bf8c04205c64e0f3
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 08:11:48 2022 -0800

    Support unspecialized integers with dynamic shapes (#89639)

    Previously, we hackily wrapped unspecialized integers into
    tensors and treated them as tensor inputs.  Sometimes, downstream
    operations would not be able to deal with the tensor input.  Now,
    we wrap them into SymInt, so more correct overload selection occurs.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89639
    Approved by: https://github.com/anjali411

commit 0c96841a20f0ae9380ef26657914276a42c9c9d7
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 08:11:47 2022 -0800

    Cond capture with fake tensors actually works; don't raise in this case (#89638)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89638
    Approved by: https://github.com/anjali411

commit d3c012f409a4e4d5a11070a90b5578da82778030
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Thu Nov 24 21:41:20 2022 +0000

    [test_nn] split pruning tests from test_nn (#89590)

    Ref: https://github.com/pytorch/pytorch/issues/63085

    Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89590
    Approved by: https://github.com/albanD

commit 83666f167dcf023d301f16fad82b9afb374ad836
Author: Aleksandar Samardžić <asamardzic@quansight.com>
Date:   Thu Nov 24 14:44:12 2022 +0000

    Added vectorized CPU code for uint8_t datatype. (#89284)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89284
    Approved by: https://github.com/lezcano, https://github.com/peterbell10

commit 9497552771ca59c68509398ab3094e590a3047c5
Author: Howard Huang <howardhuang@meta.com>
Date:   Thu Nov 24 19:41:17 2022 +0000

    Update SyncBatchNorm _all_gather_base to all_gather_into_tensor (#89521)

    Summary: Fixes https://github.com/pytorch/pytorch/issues/88568

    `_all_gather_base` is deprecated. So replacing its usage with `all_gather_into_tensor`

    Test Plan: CI

    Differential Revision: D41479983

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89521
    Approved by: https://github.com/wz337

commit 94a88b53ed37854379813abf9641d1637fe2688b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 08:11:46 2022 -0800

    Remove fake_tensors_available (#89637)

    As we are one repo now, they are always available.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89637
    Approved by: https://github.com/anjali411

commit 1c8b0779de76d0c76d34835047106ab37b41790b
Author: Emilio Castillo <ecastill@preferred.jp>
Date:   Thu Nov 24 18:25:26 2022 +0000

    Fix segfault when swapping custom allocator (#89613)

    Just screwed it before merging ...

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89613
    Approved by: https://github.com/albanD

commit fd279fe85b8f5a8e74c615436f0b180621b6ef52
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Nov 24 09:23:05 2022 -0500

    Make pytest work again on test/dynamo (#89631)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89631
    Approved by: https://github.com/anjali411

commit c3e85d879cdbd3973754760c6767c75276b1dca8
Author: albanD <desmaison.alban@gmail.com>
Date:   Thu Nov 24 17:11:42 2022 +0000

    Mention discrepency between original impl and our impl of RAdam (#89575)

    Fixes https://github.com/pytorch/pytorch/issues/88836

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89575
    Approved by: https://github.com/mruberry

commit 860bae49e4925868a0221ec4345d08407280bac7
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 23 08:04:31 2022 -0800

    Suppress guards on as_strided call only. (#89569)

    See comment in meta_utils.py for the whole story.

    This doesn't have a substantive impact yet, but will in the next
    PR on the stack.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89569
    Approved by: https://github.com/albanD

commit 1588ea0dbf16f37ce14cfc8764666985c16ccbf9
Author: mfkasim1 <firman.kasim@gmail.com>
Date:   Thu Nov 24 11:11:51 2022 +0000

    Added log1p for complex in c10 (#89214)

    One PR towards #89205.
    The content is mostly from PR #38465, but slightly changed the expression to make it faster.

    Here are some benchmarking code:
    ```c++

    // main.cc

    template<typename T> inline std::complex<T> log1p_v0(const std::complex<T> &z) {
        // this PR
        T x = z.real();
        T y = z.imag();
        T theta = std::atan2(y, x + T(1));
        T r = x * (x + T(2)) + y * y;
        return {T(0.5) * std::log1p(r), theta};
    }

    template<typename T> inline std::complex<T> log1p_v1(const std::complex<T> &z) {
        // PR #38465
        T x = z.real();
        T y = z.imag();
        std::complex<T> p1 = z + T(1);
        T r = std::abs(p1);
        T a = std::arg(p1);
        T rm1 = (x * x + y * y + x * T(2)) / (r + 1);
        return {std::log1p(rm1), a};
    }

    template<typename T>
    inline std::complex<T> log1p_v2(const std::complex<T> &z) {
        // naive, but numerically inaccurate
        return std::log(T(1) + z);
    }

    int main() {
        int n = 1000000;
        std::complex<float> res(0.0, 0.0);
        std::complex<float> input(0.5, 2.0);
        auto start = std::chrono::system_clock::now();
        for (int i = 0; i < n; i++) {
            res += log1p_v0(input);
        }
        auto end = std::chrono::system_clock::now();
        auto elapsed = end - start;
        std::cout << "time for v0: " << elapsed.count() << '\n';

        start = std::chrono::system_clock::now();
        for (int i = 0; i < n; i++) {
            res += log1p_v1(input);
        }
        end = std::chrono::system_clock::now();
        elapsed = end - start;
        std::cout << "time for v1: " << elapsed.count() << '\n';

        start = std::chrono::system_clock::now();
        for (int i = 0; i < n; i++) {
            res += log1p_v2(input);
        }
        end = std::chrono::system_clock::now();
        elapsed = end - start;
        std::cout << "time for v2: " << elapsed.count() << '\n';
        std::cout << res << '\n';
    }
    ```

    Compiling the script with command `g++ main.cc` produces the following results:
    ```
    time for v0: 237812271
    time for v1: 414524941
    time for v2: 360585994
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89214
    Approved by: https://github.com/lezcano

commit 4f5c4c022a8365d06ac401582958bbf0fd3f8337
Author: Jiewen Tan <jwtan@google.com>
Date:   Thu Nov 24 10:57:01 2022 +0000

    [LTC] Refine MetricsArena::Reset (#89608)

    Summary:
    After counters are reset, getters' behaviors are inconsistent. To improve that, here I 1) move the validation of CounterData into CounterData::IsValid such that it's better encapsulated, 2) divide getters into two groups: a) MetricsArena::GetCounter() and b) MetricsArena::ForEachCounter(), and route MetricsArena::GetCounterNames() and CreateMetricReport() to use b.

    This is paired with pytorch/xla#4217.

    Test Plan:
    PJRT_DEVICE=CPU python xla/test/test_metrics.py

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89608
    Approved by: https://github.com/JackCaoG

commit a8629a1c18fd13300ce69c1d6042004038885cf0
Author: Jithun Nair <jithun.nair@amd.com>
Date:   Thu Nov 24 10:53:20 2022 +0000

    Upgrade nightly wheels to ROCm5.3 (#89101)

    Dependent on PR https://github.com/pytorch/builder/pull/1193

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89101
    Approved by: https://github.com/kit1980

commit c0d81aa70ce45a0c2e7ced6c9f42a92d15523188
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Thu Nov 24 09:37:10 2022 +0000

    Use fx.replace_pattern for removing empty_like+fill in nvFuser+PrimTorch execution (#89132)

    I learned about `torch.fx.replace_pattern` and it's a cleaner way of removing unnecessary tensor materialization from the graph coming from tracing  C++ code `1 - tensor`.

    Test:
    ```
    python -m pytest test/test_prims.py -k "test_silu_backward_no_filled_tensor"
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89132
    Approved by: https://github.com/mruberry, https://github.com/jjsjann123

commit b515c1d96082214e81cc57ce2a1de9164b50206f
Author: Hao Guan <10684225+hguandl@users.noreply.github.com>
Date:   Thu Nov 24 08:14:24 2022 +0000

    [QAT] Check the value of numel to avoid segfault (#81547)

    Fixes #78123

    Segmentation fault

    RuntimeError: numel is out of the bound of input tensor
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81547
    Approved by: https://github.com/kit1980

commit 22a1b5e243e852e1c423c697e51975d1545d2a1b
Author: Vasiliy Kuznetsov <vasiliy@fb.com>
Date:   Wed Nov 23 13:01:15 2022 -0800

    quantization: deprecate observer compute_dtype and replace with is_dynamic (#85431)

    Summary:

    This PR deprecates the `compute_dtype` field on observers, and replaces
    it with the `is_dynamic` field on observers.  This is better aligned
    with the reference model spec.

    Test plan:

    ```
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85431
    Approved by: https://github.com/jerryzh168

commit e4ccec6ecab9b48e804d58f60135f0950fca864f
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Thu Nov 24 05:28:58 2022 +0000

    [Dynamo] Fix bug of using customized torch.autograd.Function (#89397)

    Fixes https://github.com/pytorch/torchdynamo/issues/1899

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89397
    Approved by: https://github.com/jansel

commit 903ae4570e401e5c4e42dc4a44cae37f805044a4
Author: Michael Lazos <mlazos@fb.com>
Date:   Thu Nov 24 04:15:34 2022 +0000

    Disable optimizer tracing, enable for tests only (#89500)

    Disabling optimizer tracing before launch until it can be added to the benchmark suites without increasing compile times

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89500
    Approved by: https://github.com/anijain2305

commit c79489c8e69f965f3e5af8f3f39df78e7d4732ba
Author: albanD <desmaison.alban@gmail.com>
Date:   Thu Nov 24 03:39:55 2022 +0000

    Expose to python the backward AD view_func (#89586)

    This will be useful for other systems (AOTAutograd) that want to replay autograd views.

    FYI @bdhirsh
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89586
    Approved by: https://github.com/soulitzer

commit 4cb6bbbe27162c7b0835879131991d2155329718
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Thu Nov 24 01:02:28 2022 +0100

    Symintify `embedding` (#89327)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89327
    Approved by: https://github.com/ezyang

commit 9c867eae1a7fffb6f893717073150cff04a923a4
Author: Wu, Chunyuan <chunyuan.wu@intel.com>
Date:   Wed Nov 23 20:10:41 2022 +0000

    nnc: fix Store if value is fp32 while buf is bf16 (#86788)

    Fixes https://github.com/pytorch/pytorch/issues/86533.
    For the below graph:
    ```bash
    [DUMP kernel.cpp:1690] TensorExprKernel graph:
    [DUMP kernel.cpp:1690] graph(%x.1 : BFloat16(10, strides=[1], requires_grad=0, device=cpu)):
    [DUMP kernel.cpp:1690]   %1 : int = prim::Constant[value=0]()
    [DUMP kernel.cpp:1690]   %2 : BFloat16(10, strides=[1], requires_grad=0, device=cpu) = aten::pow(%x.1, %1) # test/test_tensorexpr.py:1330:29
    [DUMP kernel.cpp:1690]   %3 : BFloat16(10, strides=[1], requires_grad=0, device=cpu) = aten::sin(%2) # test/test_tensorexpr.py:1330:19
    [DUMP kernel.cpp:1690]   return (%3)
    ```

    **Loop stmt before the fix:**
    The store value `0.8414709568023682f` is float while the scalar_type of the store buf `aten_sin` is bf16.
    ```bash
    [DEBUG llvm_codegen.cpp:489] After HalfRewriter {
    [DEBUG llvm_codegen.cpp:489]   aten_sin[Ramp(0ll, 1ll, 8)] = Broadcast(0.8414709568023682f, 8);
    [DEBUG llvm_codegen.cpp:489]   for (int64_t i_1_tail_tail = 0ll; i_1_tail_tail < 2ll; i_1_tail_tail++) {
    [DEBUG llvm_codegen.cpp:489]     aten_sin[i_1_tail_tail + 8ll] = 0.8414709568023682f;
    [DEBUG llvm_codegen.cpp:489]   }
    [DEBUG llvm_codegen.cpp:489] }
    ```

    **Loop stmt after the fix:**
    ```bash
    [DEBUG llvm_codegen.cpp:489] After HalfRewriter {
    [DEBUG llvm_codegen.cpp:489]   aten_sin[Ramp(0ll, 1ll, 8)] = bfloat16(Broadcast(0.8414709568023682f, 8));
    [DEBUG llvm_codegen.cpp:489]   for (int64_t i_1_tail_tail = 0ll; i_1_tail_tail < 2ll; i_1_tail_tail++) {
    [DEBUG llvm_codegen.cpp:489]     aten_sin[i_1_tail_tail + 8ll] = bfloat16(0.8414709568023682f);
    [DEBUG llvm_codegen.cpp:489]   }
    [DEBUG llvm_codegen.cpp:489] }
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86788
    Approved by: https://github.com/EikanWang, https://github.com/kit1980

commit f0e5bc4b9f231b438f76ddd13b2c21b7cb8a09ac
Author: Zhijing Li (Accelerator Enablement) <tissue030@meta.com>
Date:   Thu Nov 24 02:18:32 2022 +0000

    Symintified layer_norm (#89466)

    Summary: As titled.

    Test Plan:
    ```
    buck2 run mode/opt scripts/wwei6:test_executorch
    ```

    Differential Revision: D41451390

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89466
    Approved by: https://github.com/frank-wei, https://github.com/ezyang

commit fdb2dd113d3aec0acb2a473de6be49940ab6a115
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Thu Nov 24 01:52:11 2022 +0000

    Install missing VSX headers (POWER) (#85547)

    E.g. `test_cpp_extensions_aot_ninja` fails as it includes `vec.h` which requires the vec/vsx/* headers and `sleef.h`. The latter is also required for AVX512 builds on non MSVC compilers.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85547
    Approved by: https://github.com/kit1980

commit e922bd4e523b0a30f6607f6497ac458571e00131
Author: Wei-Sheng Chin <wschin@outlook.com>
Date:   Thu Nov 24 01:30:09 2022 +0000

    [ONNX] Move two headers from .h to .cc (#86852)

    As title. Header dependency should be as small as possible.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86852
    Approved by: https://github.com/titaiwangms, https://github.com/BowenBao

commit 23fe2ff910fd1577281a2210d1184aff705191b8
Author: Shunting Zhang <shunting@meta.com>
Date:   Thu Nov 24 01:28:10 2022 +0000

    verify the number of outputs of xla graph (#89536)

    This PR add tests to verify the behavior of number of outputs returns by an XLA graph. The understanding from this PR will help us fix https://github.com/pytorch/torchdynamo/issues/1908 and enable training for dynamo/torchxla integration eventually. Send this PR separately so Jack could help verify if the behavior is expected and play with it.

    List some code snippets here since their behavior is not straightforward at a first glance:
    ```
        def forward(self, a, b, c):
            """
            The XLA graph will only return the first 2 items
            """
            return a + b, a + c, b
    ```

    ```
        def forward(self, a, b, c):
            """
            Inplace update on b cause it to be returned in XLA graph
            """
            b.zero_()
            return a + b, a + c, b
    ```

    ```
        def forward(self, a, b, c):
            """
            Even if we return b twice, the XLA graph only return b once.
            """
            b.zero_()
            return a + b, a + c, b, b
    ```

    Here are what observed by the added tests:

    1. XLA does not return outputs that are also inputs -- if the tensor is not inplace updated. At first glance people may feel curious why should we consider this kind of 'non-realistic' corner case. But this kind of graphs indeed shows up in AOTAutograd. The main reason is AOTAutograd lift all model parameters/buffers as graph input and may return some of them.  Check ***test_direct_return***
    2. if a tensor is inplace updated, XLA will still return it as graph output even if it's also an input.  The only difference compared to item 1 is, the inplace updating on the tensor cause it being returned. This happens for BatchNorm2d since the running_mean/variance tensors will be inplace updated during training. Check ***test_direct_return_with_inplace_update***

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89536
    Approved by: https://github.com/jansel

commit 0bde5149819e9854bca1363aa6c9f52f7db2496e
Author: Nikita Shulga <nshulga@meta.com>
Date:   Thu Nov 24 00:57:17 2022 +0000

    Add `c10::` namespace in front of `optional` (#89605)

    Prep change for moving the codebase to C++17 standard
    Was part of https://github.com/pytorch/pytorch/pull/85969

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89605
    Approved by: https://github.com/weiwangmeta, https://github.com/kit1980

commit e19a7165fd1a9a35fcac42706c20e658776c10ab
Author: foram-chandra <96388449+foram-chandra@users.noreply.github.com>
Date:   Thu Nov 24 00:34:26 2022 +0000

    [nn] Remove deprecation warning from nn.functional.{tanh, sigmoid} (#86905)

    Fixes #65909

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86905
    Approved by: https://github.com/albanD, https://github.com/kit1980

commit a00bd6f686d7a485f7bea5f971b7e793118842b8
Author: clee2000 <44682903+clee2000@users.noreply.github.com>
Date:   Wed Nov 23 23:48:32 2022 +0000

    Don't run auto request review on forked PRs (#89583)

    tested on https://github.com/pytorch/pytorch/pull/89581
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89583
    Approved by: https://github.com/albanD, https://github.com/malfet

commit 0a1a53083e331b3648ad4cb6f750d130e3530731
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Wed Nov 23 20:42:55 2022 +0000

    [primTorch] Enable regex error testing for some refs (#87765)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87765
    Approved by: https://github.com/mruberry

commit 3ad2a032f4924d58c556b80840f6d51aa8a4472b
Author: Nikita Shulga <nshulga@meta.com>
Date:   Wed Nov 23 23:23:24 2022 +0000

    Update default cmake to 3.18 (#89570)

    Set `cmake.dir` to `/usr/local` in `.circleci/scripts/build_android_gradle.sh `
    Prep change for raising compiler standard to C++17: cmake-3.18 is the first one to support CUDA17 language

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89570
    Approved by: https://github.com/atalman

commit 8695f0cced016d43298b43a4baf30315061fdacd
Author: Jane Xu <janeyx@meta.com>
Date:   Wed Nov 23 23:23:17 2022 +0000

    Rectify `native_batch_norm` schema by splitting it into two legit schemas (#88697)

    Using the same repro from the issue (but with BatchNorm2D)

    Rectifies native_batch_norm schema by splitting the schema into 2:
    1. one will have NON-optional alias-able running_mean and running_var inputs
    2. the other will just not have those parameters at all (no_stats variation)

    **Calling for name suggestions!**
    I've added tests in test_functionalization.py as well as an entry in common_method_invocations.py for `native_batch_norm_legit`
    CI should pass.
    Because of bc/fc reasons, we reroute native_batch_norm to call our new schemas ONLY through the python dispatcher, but in 2 weeks or so, we should make `native_batch_norm_legit` the official batch_norm.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88697
    Approved by: https://github.com/albanD

commit a00efe55c3790789b967facf10c3f426faa98155
Author: Everton Constantino <everton.constantino@linaro.org>
Date:   Wed Nov 23 22:46:29 2022 +0000

    Fix CheckOutputStreamSetting on JitLoggingTest as it failed if logging wasn't enabled. (#82722)

    `JIT_LOG` checks if logging was enabled for that particular file and when it isn't it doesn't output anything. Since the test checks for the size of `test_stream` it fails. I believe forcing the file to have logging enabled to see if the stream is being correctly set during test makes no sense so this patches just forcibly outputs and checks if it worked.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82722
    Approved by: https://github.com/davidberard98

commit b8d3afd88665de5f01f696333d0ff291bd94a57b
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Nov 23 22:39:36 2022 +0000

    Skip upload test stats for test reports from rerun disabled tests workflow (#89548)

    I have found the reason why uploading tests stats fails for rerun disabled workflow, for example https://github.com/pytorch/pytorch/actions/runs/3522896778/jobs/5917765699.  The problem is that the pytest XML file is now too big to be processed quickly (x50 bigger). Unlike unittest, `pytest-flakefinder` used by rerun disabled tests for test_ops includes skipped messages multiple times (50 times by default, retrying and skipping).  This slows down the upload test stats script too much (O(n)) because it tries to gather all the stats. On the other hand, `check_disabled_tests` doesn't suffer from the same issue because it ignores all these skipped messages.

    This is a quick fix to skip test reports from rerun disabled tests workflow when trying to upload test stats.

    I'll try to fix this properly later in the way we use pytest-flakefinder. From what I see, a zipped test report from rerun disabled test is only few MB ([example](https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3521687954/1/artifact/test-reports-test-default-1-2-linux.2xlarge_9636028803.zip)), but will balloon up to a much bigger XML file after extracting from a dozen to a few hundred MB (text).  The size of the zipped file is not a big immediate problem

    [3521687954](https://github.com/pytorch/pytorch/actions/runs/3521687954) is an example workflow with rerun disabled tests and mem leak check.  The script can now finish when running locally:

    * `upload_test_stats` finishes around 3+ minutes
    ```
    time python -m tools.stats.upload_test_stats --workflow-run-id 3521687954 --workflow-run-attempt 1 --head-branch master
    ...
    Writing 8925 documents to S3
    Done!
    Writing 1760 documents to S3
    Done!
    Writing 1675249 documents to S3
    Done!
    python3 -m tools.stats.upload_test_stats --workflow-run-id 3521687954  1    185.69s user 12.89s system 75% cpu 4:22.82 total
    ```

    * `check_disabled_tests` finishes within 3 minutes
    ```
    time python -m tools.stats.check_disabled_tests --workflow-run-id 3521687954 --workflow-run-attempt 1 --repo pytorch/pytorch
    ...
    python -m tools.stats.check_disabled_tests --workflow-run-id 3521687954  1    154.19s user 4.17s system 97% cpu 2:42.50 total
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89548
    Approved by: https://github.com/clee2000

commit f18f0c70ab10c400947e71be30794e04dcc22acf
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Nov 23 19:02:51 2022 +0000

    Dont clone unmutated args in triton autotuning (#89519)

    Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning. Any other pointers on where the overhead is coming from in autotuning would be great.

    Edit: i think it's just the triton cache clearing https://github.com/openai/triton/blob/44f577984d28ee979f704e2c28a1dcbac9639840/python/triton/testing.py#L159

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89519
    Approved by: https://github.com/ngimel, https://github.com/jansel

commit ac19c5be82febc2140d4601c98daf45646a399ab
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Nov 22 22:26:21 2022 +0000

    FFT: disable dimension wrapping for scalar tensors (#89234)

    Fixes #88985

    By default, `maybe_wrap_dim` allows through `dim=0` or `dim=-1`
    for scalar tensors which leads to an invalid dimension being used to
    index into `tensor.sizes()` as in the code sample from the issue.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89234
    Approved by: https://github.com/mruberry

commit 50e2e4faf38c6ebafacc43b72c40333f1f7b401e
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Wed Nov 23 12:05:37 2022 +0200

    Sparse CSC/BSR/BSC serialization and pickle support (#89553)

    Fixes https://github.com/pytorch/pytorch/issues/89497

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89553
    Approved by: https://github.com/cpuhrsch

commit a8d6b82167ef417e21c807cb29d7eabea15014da
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Nov 23 16:47:43 2022 +0000

    Fix norm decomp when dtype is passed in (#89508)

    Fix for https://github.com/pytorch/torchdynamo/issues/1889. The wrapper was doing a downcast even when the dtype was explicitly passed in.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89508
    Approved by: https://github.com/anijain2305

commit 72110d783344c4121730b032ca0d269896604dcf
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Nov 23 17:03:09 2022 +0000

    Fix Upsample Decomp Striding For Small Channels (#89528)

    Fix for https://github.com/pytorch/torchdynamo/issues/623.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89528
    Approved by: https://github.com/ngimel, https://github.com/anijain2305

commit b7483be06afe8d4242adeb559cfbe6e0e89419d0
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Wed Nov 23 11:03:45 2022 -0800

    [quant][docs] Add docstrings for operators defined in torch.ops.quantized_decomposed namespace (#89547)

    Summary:
    no functionality changes

    Test Plan:
    NA

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89547
    Approved by: https://github.com/vkuzo

commit a188f05e8c1788d393c072868421991dfcb55b02
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Wed Nov 23 20:18:54 2022 +0000

    Reland #89031 Added conv constraint that infers layouts (#89530)

    Relands #89031
    Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530
    Approved by: https://github.com/Chillee

commit e800d27b10137727c68cb71bccabe3a93cf38e9e
Author: William Wen <williamwen@fb.com>
Date:   Wed Nov 23 20:11:39 2022 +0000

    [dashboard] Add graphs for all summary metrics, add additional testing flags (#89580)

    Title. Test post: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1325572179

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89580
    Approved by: https://github.com/davidberard98

commit 953f39578a7019c4c34bc1dbd6cb0facb554af79
Author: Charlie West-Taylor <charliew@graphcore.ai>
Date:   Wed Nov 23 19:51:50 2022 +0000

    Mark IPU device as not supports_as_strided (#89130)

    Currently causes issues in calls to `.to`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89130
    Approved by: https://github.com/albanD

commit 37e46a503502cdeda791cf684522ef83b5655328
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Wed Nov 23 19:44:46 2022 +0000

    [Dynamo] Fix several bugs & code refactor in RangeVariable (#89322)

    Fix bug in [7k github models](https://github.com/pytorch/torchdynamo/issues/1884): https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_clovaai_stargan_v2.py
    ```
    E       TypeError: 'list' object cannot be interpreted as an integer
    E
    E       from user code:
    E          File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_clovaai_stargan_v2.py", line 335, in forward
    E           idx = torch.LongTensor(range(y.size(0)))
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89322
    Approved by: https://github.com/jansel

commit 91dcef41ae96ede3f07375c2d38cb28d534e97f8
Author: Xilun Wu <12968408+XilunWu@users.noreply.github.com>
Date:   Wed Nov 23 19:43:28 2022 +0000

    Thread PG: add allreduce to threaded pg (#89043)

    Summary:
    Goal
    Add `all_reduce` collective  to multi-threaded ProcessGroup added in D40236769 (https://github.com/pytorch/pytorch/commit/6663ae5537f3c61030ba4d425bd57a097c51430a).

    Code Motion
    Added `allreduce` collective to ProcessLocalGroup (a subclass of c10d ProcessGroup).

    What's Next
    Add a DDP test utilizing the new allreduce op.
    Generalize `allreduce` to allow other `ReduceOp`s besides `SUM`.

    Test Plan:
    cd fbcode/caffe2
    buck2 test mode/dev //caffe2/test/distributed:multi_threaded

    Differential Revision: D41046606

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89043
    Approved by: https://github.com/wanchaol

commit 27db806888c36b029f51197a40e5196cc10792db
Author: Charlie West-Taylor <charliew@graphcore.ai>
Date:   Wed Nov 23 19:41:07 2022 +0000

    Handle Tensor.__deepcopy__ via clone(), on IPU (#89129)

    Currently it falls through to a call to `storage()`, which the IPU doesn't support.

    I've made the minimal change here for ease of merging (this'd help us if it was in for 1.13.1), however...

    **QUESTION**: Is there any reason why `not torch._C._has_storage(self)` needs to *also* be guarded on `self.device.type == privateuseone`? in other words, could the condition for using `clone` not be this?

    ```python
    self.is_sparse
    or self.device.type
    in ["lazy", "xla", "mps", "ort", "meta", "hpu", "ipu"]
    or not torch._C._has_storage(self)
    or (type(self) is not Tensor and self.data_ptr() == 0)
    ```

    If the condition fails, the very next thing is a call to `self._typed_storage()` which will fail, so it feels to me like *any* case without storage shouldn't fall through to the `storage()` call.

    The original PR for adding the 'no storage and device is `PrivateUse1`' condition ([86557](https://github.com/pytorch/pytorch/pull/86557)) doesn't discuss whether this could be broadened.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89129
    Approved by: https://github.com/albanD

commit fa7a963f6536dd05c381fbf23270f4f009f9f113
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Wed Nov 23 19:39:47 2022 +0000

    Remove BaseException TODO (#89540)

    After discussion in https://github.com/pytorch/pytorch/pull/88461#issuecomment-1318965664
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89540
    Approved by: https://github.com/H-Huang

commit 9eed6b7f9aa4f5fc65075de3189acc9add221660
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Wed Nov 23 19:39:43 2022 +0000

    [Dynamo] Several fixes on TensorVariable & TorchVariable (#89486)

    This is a group of bug fixes for [7k github models](https://github.com/pytorch/torchdynamo/issues/1884), it would fix 30+ model tests.
    * Support ```tensor.type()```.
    * Support ```tensor.get_device()```.
    * Support ```torch.nn.functional._Reduction.get_enum```.
    * Support ```torch._utils._get_device_index()```.
    * Fallback ```tensor.data_ptr()```.
      * ```FakeTensor``` always returns 0
      * For no fake tensor propagation, we ```clone``` the input tensor, which makes no sense to track the original ```data_ptr```. And I don't think this is a very popular API.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89486
    Approved by: https://github.com/jansel

commit f03e6672fb6a694d6f03980e3f34d8181c7cc663
Author: Iris <wz337@cornell.edu>
Date:   Wed Nov 23 19:39:01 2022 +0000

    [Checkpoint][2D] Minor update for dedup_tensors.py (#89542)

    Rename variables for better readability.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89542
    Approved by: https://github.com/H-Huang

commit 74703eb50299b26082bc2a357770739a68460199
Author: Iris <wz337@cornell.edu>
Date:   Wed Nov 23 19:36:01 2022 +0000

    [Checkpoint] Add a logger to dedup_tensors (#89503)

    Add a logger to dedup_tensors to log the duplicate keys to remove in global plan (List of SavePlan).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89503
    Approved by: https://github.com/fduwjj

commit 57353c9608263df98156a73aaa6ed35a2a2306ad
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Wed Nov 23 08:29:08 2022 -0800

    first draft of input mutation handling for aot autograd (#88817)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88817
    Approved by: https://github.com/ezyang, https://github.com/wconstab

commit 902e4e3926a9333178510f032580e4acd56c40da
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 23 19:05:13 2022 +0000

    Revert "Fix the kineto daemon build condition (#89174)"

    This reverts commit 9fd00f194ae4e28948a9a03a6382c20dde04e4fd.

    Reverted https://github.com/pytorch/pytorch/pull/89174 on behalf of https://github.com/robieta due to For some reason this is interacting badly with NVFuser. I think it is instability in kineto, but until we figure out what's going on reverting is a necessary evil.

commit 049a0f2cd5916c8392c6bd1adc41c709de892f3a
Author: Bin Bao <binbao@fb.com>
Date:   Wed Nov 23 02:00:44 2022 +0000

    [inductor] Update CI model tests (#89499)

    Summary:
    1) Add model inference test
    2) Switch model training test to use AMP

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89499
    Approved by: https://github.com/bertmaher

commit 95474e00a9477b1333e13fa95887a2ce05c4a6a6
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Nov 22 20:29:26 2022 -0800

    [quant][be] Remove unused util code (#89272)

    Summary:
    att

    Test Plan:
    python test/test_quantization.py TestQuantizeFx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89272
    Approved by: https://github.com/andrewor14

commit 128faf2b69f62b55d3ae1b4cb3e24ec594af0009
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Nov 22 20:29:26 2022 -0800

    [quant][be] Refactor the error checking code for quantize_per_channel op (#89271)

    Summary:
    at

    Test Plan:
    make sure it compiles

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89271
    Approved by: https://github.com/andrewor14

commit 71c0e84914b74bc30178292e02f67bc47c0bee21
Author: Catherine Lee <csl@fb.com>
Date:   Wed Nov 23 18:27:37 2022 +0000

    Gate leak check and reruns on schedule (#89504)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89504
    Approved by: https://github.com/huydhn

commit c9d4390d1328f7d57070106bae035bd77a76452b
Author: Emilio Castillo <ecastill@preferred.jp>
Date:   Wed Nov 23 17:54:33 2022 +0000

    Add Pluggable CUDA allocator backend (#86786)

    Fixes #43144

    This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

    For example, we could have the following allocator in c++

    ```c++

    extern "C" {
    void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
       void *ptr;
       std::cout<<"alloc "<< size<<std::endl;
       cudaMalloc(&ptr, size);
       return ptr;
    }

    void my_free(void* ptr) {
       std::cout<<"free "<<std::endl;
       cudaFree(ptr);
    }
    }
    ```

    Compile it as a shared library
    ```
    nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
    ```

    And use it from PyTorch as follows

    ```python
    import torch
    new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
    old = torch.cuda.memory.get_current_allocator()
    torch.cuda.memory.change_current_allocator(new_alloc)
    b = torch.zeros(10, device='cuda')
    torch.cuda.memory.change_current_allocator(old)
    ```

    Things to discuss
    - How to test this, needs compiling external code ...

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
    Approved by: https://github.com/albanD

commit 1333fdcff13cf8f0a7ebc41df4e91acb155ac17e
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Wed Nov 23 17:27:40 2022 +0000

    [test_nn] split parametrization test from test_nn (#89552)

    Ref: https://github.com/pytorch/pytorch/issues/63085

    Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89552
    Approved by: https://github.com/albanD

commit 347a7d97a5e4855e0648fdcc194e28d3019276b6
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Nov 23 16:51:42 2022 +0000

    Deprecate decorating classes with torch.no_grad and similar (#89522)

    Fixes https://github.com/pytorch/pytorch/issues/89450

    I would have completely removed it but I don't think this is particularly urgent and there are some use of it in the wild: https://github.com/search?q=%2Ftorch%5C.no_grad%5C%28%5C%29%5Cnclass%2F&type=code
    So we might as well take one release to do it.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89522
    Approved by: https://github.com/lezcano, https://github.com/soulitzer, https://github.com/janeyx99

commit 2de38a0714da1ddc3625e6e794e1c3ef869c841a
Author: Nikita Shulga <nshulga@meta.com>
Date:   Wed Nov 23 16:33:13 2022 +0000

    Add `torch._dynamo` to docs (#89510)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89510
    Approved by: https://github.com/msaroufim

commit de0dee30d021b4546709dd7b785daba335f44942
Author: fduwjj <fduwjj@fb.com>
Date:   Wed Nov 23 05:29:53 2022 +0000

    [PT-D][3/N] Sync TP API change to Pytorch (#89535)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89535
    Approved by: https://github.com/wanchaol

commit 795473ff5e861f21c6b5e0611177fcfa9e1c4e0c
Author: Yukio Siraichi <yukio.siraichi@gmail.com>
Date:   Wed Nov 23 15:56:54 2022 +0000

    Call `symint::sizes()` instead of `sizes()` on convolution error messages. (#89549)

    This PR fixes convolution when using `torchdynamo` with dynamic shapes.

    **Problem:** there are some `tensor.sizes()` calls in a few error messages. As a result, an uninformative error message was being displayed.

    ```python
    @torch._dynamo.optimize("eager")
    def foo(inp, w):
        return F.conv2d(inp, w)

    inp = torch.rand((1, 1, 32, 32))
    w = torch.rand((1, 2, 3, 3))

    foo(inp, w)
    ```

    -----
    **Before this PR:**
    ```python
    Traceback (most recent call last):
      File "torch/_dynamo/utils.py", line 1076, in run_node
        return node.target(*args, **kwargs)
      File "torch/_subclasses/fake_tensor.py", line 867, in __torch_dispatch__
        op_impl_out = op_impl(self, func, *args, **kwargs)
      File "torch/_subclasses/fake_tensor.py", line 445, in conv
        conv_backend = torch._C._select_conv_backend(**kwargs)
    RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides
    ```

    **After this PR:**
    ```python
    Traceback (most recent call last):
      File "torch/_dynamo/utils.py", line 1076, in run_node
        return node.target(*args, **kwargs)
      File "torch/_subclasses/fake_tensor.py", line 867, in __torch_dispatch__
        op_impl_out = op_impl(self, func, *args, **kwargs)
      File "torch/_subclasses/fake_tensor.py", line 445, in conv
        conv_backend = torch._C._select_conv_backend(**kwargs)
    RuntimeError: Given groups=1, weight of size [1, s1, s2, s2], expected input[1, 1, s0, s0] to have s1 channels, but got 1 channels instead
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89549
    Approved by: https://github.com/ezyang

commit 3a858ba8e3b6f398f3b981d258e8309d1c93ba39
Merge: 685d432634 724c74d85a
Author: mingfeima <mingfei.ma@intel.com>
Date:   Wed Nov 23 21:20:11 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit 724c74d85ac47dcbe8975e07bd8d82cb6ec1d3d3
Author: mingfeima <mingfei.ma@intel.com>
Date:   Wed Nov 23 21:20:11 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit 39772a6a01c1579497f68f916e7abb56aaee1c1e
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Nov 22 20:29:25 2022 -0800

    [quant] Add support for quantize_per_channel in the reference flow with decomposed tensor (#89270)

    Summary:
    att, after this PR we can produce quantize_per_channel and dequantize_per_channel ops (typically used for quantizing weights)
    in the reference flow using decomposed tensor

    Test Plan:
    python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_per_channel_quant

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89270
    Approved by: https://github.com/vkuzo

commit c651944f9226661ad41fa201c61300030c1c2e18
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Wed Nov 23 08:39:45 2022 +0000

    [test_nn] split hooks test from test_nn (#89201)

    Ref: https://github.com/pytorch/pytorch/issues/63085

    Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89201
    Approved by: https://github.com/albanD

commit dd140fc351e322303229ad2a5713b7ee51d35673
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Wed Nov 23 08:30:51 2022 +0000

    [test_nn] move init tests from test_nn (#89202)

    Ref: https://github.com/pytorch/pytorch/issues/63085

    Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89202
    Approved by: https://github.com/albanD

commit 685d432634a7e01aa6f58cff7aeaf3f894b1e4f3
Merge: 2c3d1877fb 89de8ac645
Author: mingfeima <mingfei.ma@intel.com>
Date:   Wed Nov 23 16:28:38 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit 89de8ac64544dd298ac0a4e648f2e166a5a6f0c0
Merge: c0dbc6488f 7594e043b8
Author: mingfeima <mingfei.ma@intel.com>
Date:   Wed Nov 23 16:28:38 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit 7594e043b85b6e2c0cf4b2f257ac9606313cec90
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Wed Nov 23 06:50:05 2022 +0000

    Fix Use-after-Free in qembeddingbag_byte_prepack_out (#84750)

    When FBGEMM is not used (either manually disabled or on platforms such as POWER where it isn't supported at all) the fallback code requests a `data_ptr<float>` on a `Tensor` object returned by `to(ScalarType::Float)` in the same line. This object will be destroyed at the end of the line leading to a dangling pointer.

    On some platforms this manifests in wrong results being returned as the memory gets overwritten. On other platforms anything may happen due to this being undefined behavior, although most likely it will just crash or continue to return semi-random results which may even happen to be correct (when the memory is not reused yet)

    Fix this by binding the temporary object (or initial object) to a const value reference which extents its lifetime and getting the `data_ptr` from that.

    Fixes #84748

    This bug was introduced by a seemingly unrelated change in #64081 hence ccing @d1jang

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84750
    Approved by: https://github.com/kimishpatel

commit 07dd2fe6c32948e5ca0a2871e5eb31602a9684cf
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Wed Nov 23 00:49:43 2022 +0100

    Symintify `select` (#89326)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89326
    Approved by: https://github.com/ezyang

commit 29742786f38d4873576c73917e8509908132dae2
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Mon Nov 21 14:19:04 2022 -0800

    [quant] Add dequantize_per_channel in quantized_decomposed op library (#89269)

    Summary:
    att

    Test Plan:
    python test/test_quantization.py -k test_decomposed_dequantize_per_channel

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89269
    Approved by: https://github.com/vkuzo

commit 52669534438db3d680def4c70cb03b7e27566d7e
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 22 07:47:48 2022 -0800

    Add crossref debug mode for functionalization, catches stride errors (#89498)

    The idea is to add a custom handler to Functionalize key in Python
    dispatcher that runs the functionalized version along side a non
    functionalized version, and checks that their outputs agree in the
    end.  (Technically, for metadata mutation we should also check the
    inputs, but for now we're relying on those functions returning self.)
    I turned this on for test_functionalize.py (new TestCrossRefFunctionalize)
    and found a bunch of failures that look legit.

    This probably doesn't interact that nicely if you're also tracing at
    the same time, probably need more special logic for that (directly,
    just disabling tracing for when we create the nested fake tensor mode,
    but IDK if there's a more principled way to organize this.)

    There are some misc fixups which I can split if people really want.

    - xfail_inherited_tests moved to test common_utils
    - Bindings for _dispatch_tls_set_dispatch_key_included,
      _dispatch_tls_is_dispatch_key_included and _functionalization_reapply_views_tls
    - Type stubs for _enable_functionalization, _disable_functionalization
    - all_known_overloads utility to let you iterate over all OpOverloads
      in all namespaces.  Iterator support on all torch._ops objects to let
      you iterate over their members.
    - suspend_functionalization lets you temporarily disable functionalization mode
      in a context
    - check_metadata_matches for easily comparing outputs of functions and see
      if they match (TODO: there are a few copies of this logic, consolidate!)
    - _fmt for easily printing the metadata of a tensor without its data
    - _uncache_dispatch for removing a particular dispatch key from the cache,
      so that we force it to regenerate
    - check_significant_strides new kwarg only_cuda to let you also do stride
      test even when inputs are not CUDA
    - Functionalize in torch._C.DispatchKey

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89498
    Approved by: https://github.com/malfet

commit fe990c8db92abce3d22b24c61958c844bb4834f0
Author: Nikita Shulga <nshulga@meta.com>
Date:   Wed Nov 23 03:31:17 2022 +0000

    [BE] Add more `ssh` instructions (#89516)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89516
    Approved by: https://github.com/huydhn

commit 5b51ca6808191e9f3dcea1d43fa731488cc688bb
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Wed Nov 23 03:07:22 2022 +0000

    Update CUDA compiler matrix (#86360)

    Switch GCC/Clang max versions to be exclusive as the `include/crt/host_config.h` checks the major version only for the upper bound. This allows to be less restrictive and match the checks in the aforementioned header.
    Also update the versions using that header in the CUDA SDKs.

    Follow up to #82860

    I noticed this as PyTorch 1.12.1 with CUDA 11.3.1 and GCC 10.3 was failing in the `test_cpp_extensions*` tests.

    Example for CUDA 11.3.1 from the SDK header:

    ```
    // Error out
    ...
    // Error out
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86360
    Approved by: https://github.com/ezyang

commit 504570d577366f309bc7fc63fa7909f9d372d722
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Wed Nov 23 02:59:25 2022 +0000

    Delete unused variable assignment in _refs/__init__.py (#89538)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89538
    Approved by: https://github.com/huydhn

commit ed32511974daafa256457784820c42f75d83d300
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 22 12:02:59 2022 -0800

    Don't use explain() for --explain; instead read it off the counters (#89518)

    Fixes huggingface problem where example_inputs is not actually the
    args.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89518
    Approved by: https://github.com/albanD

commit f5d18574a33d2b9421f724a023676281e2076fce
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Nov 22 22:32:49 2022 +0000

    Allow Module forward-pre and forward hooks to take kwargs (#89389)

    closes #35643

    This PR is mostly borrowed from #82042. Thanks @Padarn for implementing
    the first version and debugging into the errors.

    Based on the discussion in #82042 this PR adds a with_kwargs
    argument to register_forward_pre_hook and register_forward_hook
    methods. When the arg is set to true, the provided hook must accept
    kwargs args. Under the hook, this PR adds a
    `_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs`
    set to keep track of which hooks accept kwargs.

    Differential Revision: [D41431111](https://our.internmc.facebook.com/intern/diff/D41431111)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89389
    Approved by: https://github.com/soulitzer

commit 4935b597ac0c93e023637ed4755db84398ccc41b
Author: Thomas <37830237+thomaslin2020@users.noreply.github.com>
Date:   Wed Nov 23 02:18:03 2022 +0000

    Added implementation and tests for MPS Hardswish (#87952)
    Fixes issue #86807 by adding MPS backend support for aten::hardswish.
    Registered mps hardswish functions in native_functions.yaml, and added the code implementation to Activations.mm.

    Added functions:
    - hardswish_mps
    - hardswish_mps_
    - hardswish_backward_mps
    - hardswish_out_mps
    Added test in test/test_mps.py and tested code using the command `python3 test/test_mps.py -k test_hardswish`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87952
    Approved by: https://github.com/kulinseth, https://github.com/kit1980

commit 1cfd3858ac54fe3883534309081631a0a892ba3f
Author: Animesh Jain <anijain@umich.edu>
Date:   Wed Nov 23 00:48:00 2022 +0000

    [inductor] Use dense masks for indirect indexing (#89524)

    Fixes https://github.com/pytorch/torchdynamo/issues/1654

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89524
    Approved by: https://github.com/jansel

commit 26322544b87071dabee2f47242584fd3c8a9fbd7
Author: Will Constable <whc@fb.com>
Date:   Tue Nov 22 19:24:00 2022 +0000

    Add limited FSDP correctness to torchdynamo benchmark (#89469)

    - Does not do recursive wrapping
    - Only supports accuracy bench
    - Mainly useful for sweeping over models for correctness, in part
      to evaluate whether dynamo support for FSDP is breaking anywhere

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89469
    Approved by: https://github.com/davidberard98, https://github.com/aazzolini

commit 7f4b4d282702265f8e1da337d3027df7a3ba17d9
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Nov 23 00:07:59 2022 +0000

    [Inductor] Limit g++12 installation to Linux (#89472)

    According to https://anaconda.org/conda-forge/gxx/ its only available on Linux

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89472
    Approved by: https://github.com/soumith, https://github.com/jgong5

commit b50699f24733e53779112b56eafe39f2cc369521
Author: Will Constable <whc@fb.com>
Date:   Tue Nov 22 19:24:00 2022 +0000

    Fix inductor fallback_random for dropout/rand_like (#89515)

    - Avoid fx graph rewrite that replaces certain ops with ones using
      triton random
    - Keep track of replacement ops using triton random, so it is possible
      to not disable all replacements when using fallback_random

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89515
    Approved by: https://github.com/ngimel

commit 8bf8e4d71e8fb125a3bbf6cc951e661e453598bb
Author: William Wen <williamwen@fb.com>
Date:   Tue Nov 22 23:42:09 2022 +0000

    [dashboard] Add metric graphs back to dashboard (#89531)

    Title.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89531
    Approved by: https://github.com/davidberard98

commit ce856cee7eeea9a6eb5ed30fa512b38b3d8f3edf
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Tue Nov 22 22:55:41 2022 +0000

    [test_nn] fix missing class attributes for NNTestCase (#89200)

    Missed setting these class variable 😓
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89200
    Approved by: https://github.com/albanD

commit 391b593ca262432ccba1939f7448275cfd4f62e6
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Mon Nov 21 14:19:03 2022 -0800

    [quant] Add quantize_per_channel in quantized_decomposed op library (#89268)

    Summary:
    att

    Test Plan:
    python test/test_quantization.py -k test_decomposed_quantize_per_channel

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89268
    Approved by: https://github.com/vkuzo

commit 5bba783d2170ddca8f4dd781d287dedb69de312a
Author: Animesh Jain <anijain@umich.edu>
Date:   Tue Nov 22 22:25:30 2022 +0000

    [dashboard] Remove aot_cudagraphs and nvprims_nvfuser (#89514)

    Helps speeding up Dashboard runs

    We will bring these back when the backends are ready to be tested on full model suite.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89514
    Approved by: https://github.com/SherlockNoMad

commit ea920a11156cb7a037feb45285c0ce6520b3801c
Author: Manuel Candales <mcandales@meta.com>
Date:   Tue Nov 22 22:15:54 2022 +0000

    [Vulkan][TCC] Add tests for quantize_per_tensor and dequantize (#89496)

    Summary: Add tests for quantize per tensor and dequantize

    Test Plan:
    On Mac
    ```
    cd ~/fbsource
    buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
    ```

    On Android
    ```
    cd ~/fbsource
    buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
    adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
    adb shell "/data/local/tmp/vulkan_quantized_api_test"
    ```

    Reviewed By: salilsdesai

    Differential Revision: D41047097

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89496
    Approved by: https://github.com/digantdesai

commit 74e62a1fefb7100689169dc12fd70095de54079d
Author: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Date:   Tue Nov 22 22:15:38 2022 +0000

    [ROCm] Optimize layer norm backward kernel for ROCm (#87635)

    We observed that the native PyTorch LayerNormBackwardKernelImplInternal has suboptimal performance for certain input sizes on AMD GPUs especially when `fs`  (=`config_m` in our benchmark script) is large and `bs`  (=`config_n` in our benchmark script) is small (commonly seen in [the CvT model](https://arxiv.org/abs/2103.15808)) in the benchmark script of [PR #68238](https://github.com/pytorch/pytorch/pull/68238#issue-1051621716) on AMD GPUs.

    This PR is to replace `GammaBetaBackwardCUDAKernel` with the Apex layernorm backward kernel with some ROCm-specific parameter tuning when `fs`  (=`config_m`) is larger than 512 on AMD GPUs.

    There are a few PRs for LayerNorm kernel:
    - https://github.com/pytorch/pytorch/pull/26201
    - https://github.com/pytorch/pytorch/pull/27634
    - https://github.com/pytorch/pytorch/pull/68238

    Therefore, we have tested and compared the kernel before and at this PR with the input shapes in the last two PRs along with those commonly used in the CvT model on AMD MI100.

    ---
    **Current**
    <html xmlns:v="urn:schemas-microsoft-com:vml"
    xmlns:o="urn:schemas-microsoft-com:office:office"
    xmlns:x="urn:schemas-microsoft-com:office:excel"
    xmlns="http://www.w3.org/TR/REC-html40">

    <head>

    <meta name=ProgId content=Excel.Sheet>
    <meta name=Generator content="Microsoft Excel 15">
    <link id=Main-File rel=Main-File
    href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
    <link rel=File-List
    href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
    <!--table
    	{mso-displayed-decimal-separator:"\.";
    	mso-displayed-thousand-separator:"\,";}
    @page
    	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
    	margin:.75in .7in .75in .7in;
    	mso-header-margin:.3in;
    	mso-footer-margin:.3in;}
    tr
    	{mso-height-source:auto;}
    col
    	{mso-width-source:auto;}
    br
    	{mso-data-placement:same-cell;}
    td
    	{padding-top:1px;
    	padding-right:1px;
    	padding-left:1px;
    	mso-ignore:padding;
    	color:black;
    	font-size:11.0pt;
    	font-weight:400;
    	font-style:normal;
    	text-decoration:none;
    	font-family:Calibri, sans-serif;
    	mso-font-charset:0;
    	mso-number-format:General;
    	text-align:general;
    	vertical-align:bottom;
    	border:none;
    	mso-background-source:auto;
    	mso-pattern:auto;
    	mso-protection:locked visible;
    	white-space:nowrap;
    	mso-rotate:0;}
    -->
    </head>

    <body link="#0563C1" vlink="#954F72">

    M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)
    -- | -- | -- | -- | -- | --
    50432 | 384 | 0.387256 | 1.372758 | 0.378975 | 1.47892
    50176 | 384 | 0.38231 | 1.362416 | 0.378084 | 1.473886
    200704 | 192 | 0.997859 | 4.315875 | 0.989306 | 4.560827
    802816 | 64 | 3.671828 | 16.68013 | 3.613515 | 16.827946
    200 | 256 | 0.066503 | 0.332096 | 0.071422 | 0.325349
    1000 | 256 | 0.071848 | 0.333355 | 0.073038 | 0.334753
    6000 | 256 | 0.086334 | 0.345139 | 0.086834 | 0.347429
    6272 | 256 | 0.088601 | 0.347906 | 0.087855 | 0.351245
    200 | 512 | 0.071626 | 0.329726 | 0.073798 | 0.326878
    1000 | 512 | 0.073975 | 0.330226 | 0.074166 | 0.332751
    6000 | 512 | 0.099617 | 0.362367 | 0.100095 | 0.378313
    6272 | 512 | 0.100378 | 0.358066 | 0.099857 | 0.395982
    200 | 1024 | 0.072954 | 0.326382 | 0.073899 | 0.333007
    1000 | 1024 | 0.0743 | 0.325532 | 0.071126 | 0.330991
    6000 | 1024 | 0.127025 | 0.390084 | 0.128692 | 0.471504
    6272 | 1024 | 0.130704 | 0.403536 | 0.135244 | 0.487133
    200 | 1536 | 0.070331 | 0.339169 | 0.070086 | 0.331015
    1000 | 1536 | 0.075085 | 0.330042 | 0.076295 | 0.328778
    6000 | 1536 | 0.148889 | 0.44949 | 0.155781 | 0.659987
    6272 | 1536 | 0.154939 | 0.478871 | 0.17673 | 0.716025
    200 | 2048 | 0.070269 | 0.335585 | 0.072804 | 0.334655
    1000 | 2048 | 0.080094 | 0.326991 | 0.080426 | 0.32685
    6000 | 2048 | 0.187888 | 0.623023 | 0.245762 | 0.981635
    6272 | 2048 | 0.195431 | 0.65244 | 0.262574 | 1.008141
    200 | 3072 | 0.068205 | 0.339428 | 0.073068 | 0.344034
    1000 | 3072 | 0.087554 | 0.328899 | 0.09218 | 0.346433
    6000 | 3072 | 0.240352 | 0.905058 | 0.368135 | 1.280462
    6272 | 3072 | 0.26179 | 0.959387 | 0.387782 | 1.476524
    128 | 2097152 | 5.905976 | 22.724793 | 10.287974 | 30.242092
    256 | 1048576 | 4.561596 | 19.554308 | 10.223171 | 29.42371
    512 | 524288 | 4.146751 | 22.7247 | 11.404285 | 39.175902
    1024 | 262144 | 5.193135 | 23.403325 | 11.334512 | 38.947192
    2048 | 131072 | 4.992907 | 23.377801 | 11.400286 | 40.889191
    4096 | 65536 | 5.429488 | 24.275701 | 11.196778 | 41.4751
    8192 | 32768 | 5.35758 | 21.360312 | 10.535418 | 42.875646
    16384 | 16384 | 5.44947 | 20.852605 | 10.357685 | 34.603408
    32768 | 8192 | 4.688925 | 17.379392 | 9.635596 | 31.188271

    </body>

    </html>

    ---------
    **At this PR**
    <html xmlns:v="urn:schemas-microsoft-com:vml"
    xmlns:o="urn:schemas-microsoft-com:office:office"
    xmlns:x="urn:schemas-microsoft-com:office:excel"
    xmlns="http://www.w3.org/TR/REC-html40">

    <head>

    <meta name=ProgId content=Excel.Sheet>
    <meta name=Generator content="Microsoft Excel 15">
    <link id=Main-File rel=Main-File
    href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
    <link rel=File-List
    href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

    <!--table
    	{mso-displayed-decimal-separator:"\.";
    	mso-displayed-thousand-separator:"\,";}
    @page
    	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
    	margin:.75in .7in .75in .7in;
    	mso-header-margin:.3in;
    	mso-footer-margin:.3in;}
    tr
    	{mso-height-source:auto;}
    col
    	{mso-width-source:auto;}
    br
    	{mso-data-placement:same-cell;}
    td
    	{padding-top:1px;
    	padding-right:1px;
    	padding-left:1px;
    	mso-ignore:padding;
    	color:black;
    	font-size:11.0pt;
    	font-weight:400;
    	font-style:normal;
    	text-decoration:none;
    	font-family:Calibri, sans-serif;
    	mso-font-charset:0;
    	mso-number-format:General;
    	text-align:general;
    	vertical-align:bottom;
    	border:none;
    	mso-background-source:auto;
    	mso-pattern:auto;
    	mso-protection:locked visible;
    	white-space:nowrap;
    	mso-rotate:0;}
    .xl63
    	{color:windowtext;}
    -->
    </head>

    <body link="#0563C1" vlink="#954F72">

    M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)
    -- | -- | -- | -- | -- | --
    50432 | 384 | 0.38797 | 0.93103 | 0.37966 | 1.15283
    50176 | 384 | 0.3874 | 0.96417 | 0.38462 | 1.18595
    200704 | 192 | 1.00002 | 2.40876 | 0.99224 | 2.55579
    802816 | 64 | 3.67348 | 7.98658 | 3.61871 | 7.72404
    200 | 256 | 0.07292 | 0.35119 | 0.07195 | 0.32602
    1000 | 256 | 0.07354 | 0.33325 | 0.07237 | 0.33742
    6000 | 256 | 0.08819 | 0.33283 | 0.08453 | 0.3279
    6272 | 256 | 0.0886 | 0.33446 | 0.08774 | 0.33426
    200 | 512 | 0.0701 | 0.33505 | 0.07072 | 0.33018
    1000 | 512 | 0.07042 | 0.33442 | 0.074 | 0.33206
    6000 | 512 | 0.09931 | 0.34956 | 0.09895 | 0.3572
    6272 | 512 | 0.10103 | 0.32976 | 0.10041 | 0.36635
    200 | 1024 | 0.07144 | 0.33579 | 0.07209 | 0.33216
    1000 | 1024 | 0.0736 | 0.32803 | 0.07286 | 0.32936
    6000 | 1024 | 0.12584 | 0.38916 | 0.12852 | 0.48273
    6272 | 1024 | 0.13053 | 0.38804 | 0.13464 | 0.49545
    200 | 1536 | 0.07159 | 0.3396 | 0.07062 | 0.33545
    1000 | 1536 | 0.07443 | 0.33239 | 0.07366 | 0.33204
    6000 | 1536 | 0.14959 | 0.45043 | 0.15826 | 0.69119
    6272 | 1536 | 0.1542 | 0.47644 | 0.18249 | 0.72208
    200 | 2048 | 0.07258 | 0.33982 | 0.07412 | 0.33859
    1000 | 2048 | 0.0793 | 0.32816 | 0.07864 | 0.32583
    6000 | 2048 | 0.18973 | 0.571 | 0.25506 | 0.91796
    6272 | 2048 | 0.19719 | 0.64208 | 0.26445 | 0.95055
    200 | 3072 | 0.07092 | 0.33867 | 0.07104 | 0.34695
    1000 | 3072 | 0.08727 | 0.33144 | 0.09144 | 0.36633
    6000 | 3072 | 0.24683 | 0.87275 | 0.37761 | 1.3289
    6272 | 3072 | 0.26437 | 0.91178 | 0.38496 | 1.53694
    128 | 2097152 | 6.27936 | 23.69425 | 10.40004 | 30.13699
    256 | 1048576 | 4.5404 | 19.47675 | 10.28494 | 29.36936
    512 | 524288 | 4.13951 | 18.78771 | 10.09557 | 32.67083
    1024 | 262144 | 4.47576 | 18.00411 | 9.56488 | 31.47117
    2048 | 131072 | 4.28026 | 16.95619 | 9.40297 | 30.82845
    4096 | 65536 | 4.2653 | 16.5018 | 9.03315 | 30.08392
    8192 | 32768 | 4.25613 | 16.13583 | 8.9258 | 30.75296
    16384 | 16384 | 4.20256 | 16.38207 | 9.52587 | 31.31113
    32768 | 8192 | 4.20231 | 16.19452 | 9.31478 | 31.03514

    </body>

    </html>

    ---------

    **Performance Improvement (%)**
    <html xmlns:o="urn:schemas-microsoft-com:office:office"
    xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
    xmlns="http://www.w3.org/TR/REC-html40">

    <head>

    <meta name=ProgId content=OneNote.File>
    <meta name=Generator content="Microsoft OneNote 15">
    </head>

    <body lang=en-US style='font-family:Calibri;font-size:11.0pt'>
    <!--StartFragment-->

    <div style='direction:ltr'>

    M | N | fwdbwd,   torch.float16 | fwdbwd,   torch.float32
    -- | -- | -- | --
    50432 | 384 | 32.178 | 22.049
    50176 | 384 | 29.231 | 19.536
    200704 | 192 | 44.188 | 43.962
    802816 | 64 | 52.119 | 54.100
    200 | 256 | -5.750 | -0.206
    1000 | 256 | 0.031 | -0.797
    6000 | 256 | 3.566 | 5.621
    6272 | 256 | 3.865 | 4.836
    200 | 512 | -1.615 | -1.010
    1000 | 512 | -1.270 | 0.208
    6000 | 512 | 3.534 | 5.581
    6272 | 512 | 7.905 | 7.483
    200 | 1024 | -2.883 | 0.254
    1000 | 1024 | -0.767 | 0.493
    6000 | 1024 | 0.237 | -2.381
    6272 | 1024 | 3.840 | -1.707
    200 | 1536 | -0.127 | -1.340
    1000 | 1536 | -0.711 | -0.992
    6000 | 1536 | -0.209 | -4.728
    6272 | 1536 | 0.508 | -0.846
    200 | 2048 | -1.262 | -1.176
    1000 | 2048 | -0.358 | 0.312
    6000 | 2048 | 8.350 | 6.487
    6272 | 2048 | 1.588 | 5.713
    200 | 3072 | 0.223 | -0.848
    1000 | 3072 | -0.773 | -5.743
    6000 | 3072 | 3.570 | -3.783
    6272 | 3072 | 4.962 | -4.092
    128 | 2097152 | -4.266 | 0.348
    256 | 1048576 | 0.397 | 0.185
    512 | 524288 | 17.325 | 16.605
    1024 | 262144 | 23.070 | 19.195
    2048 | 131072 | 27.469 | 24.605
    4096 | 65536 | 32.023 | 27.465
    8192 | 32768 | 24.459 | 28.274
    16384 | 16384 | 21.439 | 9.514
    32768 | 8192 | 6.818 | 0.491

    </div>

    <!--EndFragment-->
    </body>

    </html>

    ---------
    **Benchmark script of this PR**
    ```

    from distutils.command.config import config
    import torch
    from torch.nn import LayerNorm
    import timeit

    number_runs = 1000  # TODO: Modify this to save time!
    def test_forward(layer_norm_cuda, input_cuda):
        layer_norm_cuda(input_cuda); torch.cuda.synchronize()

    def test_backward(out_cuda, layer_norm_grad_cuda, create_graph):
        out_cuda.backward(layer_norm_grad_cuda, retain_graph=True, create_graph=create_graph); torch.cuda.synchronize()

    def test_fwdbwd(input_cuda, layer_norm_cuda, gO):
        input_cuda.grad = None
        layer_norm_cuda.zero_grad(set_to_none=True)
        out = layer_norm_cuda(input_cuda)
        out.backward(gO)
        torch.cuda.synchronize()

    def benchmark(config_m, config_n):

        print("M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)")
        if len(config_m) != len(config_n):
            print("Please make sure the lengths of config_m and config_m are the same.")

        for i in range(len(config_m)):
            normalized_shape = config_n[i]
            results = [config_m[i], config_n[i]]
            for dtype in (torch.half, torch.float):
                if dtype == torch.half:
                    layer_norm_cuda = LayerNorm(normalized_shape).half().cuda()
                else:
                    layer_norm_cuda = LayerNorm(normalized_shape).cuda()

                input_cuda = torch.randn(config_m[i], config_n[i], device='cuda', dtype=dtype, requires_grad=True)
                result_fwd = timeit.timeit(lambda: test_forward(layer_norm_cuda, input_cuda), number=number_runs)
                results.append(result_fwd / number_runs * 1000)

                gO = torch.rand_like(input_cuda)

                result_fwdbwd = timeit.timeit(lambda: test_fwdbwd(input_cuda, layer_norm_cuda, gO), number=number_runs)
                results.append(result_fwdbwd / number_runs * 1000)

            print('{:09d}|{:09d}|{:9.5f}|{:9.5f}|{:9.5f}|{:9.5f}'.format(results[0], results[1], results[2], results[3], results[4], results[5]))

        print("Times are in microseconds (us).")
    config_m_cvt = [50432, 50176, 200704, 802816]
    config_n_cvt = [384, 384, 192, 64]
    config_m_68238 = [200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272]
    config_n_68238 = [256,256,256,256,512,512,512,512,1024,1024,1024,1024,1536,1536,1536,1536,2048,2048,2048,2048,3072,3072,3072,3072]
    config_m_27634 = [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
    config_n_27634 = [2097152, 1048576, 524288, 262144, 131072, 65536, 32768, 16384, 8192]

    config_m = config_m_cvt + config_m_68238 + config_m_27634
    config_n = config_n_cvt + config_n_68238 + config_n_27634

    benchmark(config_m, config_n)
    ```

    CC: @jeffdaily

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87635
    Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/ezyang

commit 00b7d8ef237f4f0fc3d247e016d504095b415d1f
Author: Catherine Lee <csl@fb.com>
Date:   Tue Nov 22 21:52:50 2022 +0000

    Shard windows periodic job more (#89455)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89455
    Approved by: https://github.com/huydhn

commit 77d7f2c65945438e0292b270998cea07c0d9d3d8
Author: William Wen <williamwen@fb.com>
Date:   Tue Nov 22 21:17:36 2022 +0000

    [dashboard] Add commit date & fix date related issues (#89517)

    Add commit date to build summary of dashboard. Make the date of the run reflective of when the run started, not when the run ended. Use PST (UTC -8) to determine day, rather than GMT (UTC +0).

    Test comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1324176119

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89517
    Approved by: https://github.com/anijain2305

commit 177baf366ad16b868ab19a8776ae0e636f9d1951
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Tue Nov 22 20:29:07 2022 +0000

    Fix vectorized trigonometric functions for VSX (#86453)

    Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in #59382 & #82646 after #41541

    This fixes wrong results for e.g. `sin(1e20)`.
    Fixes #85978

    To fix #85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86453
    Approved by: https://github.com/malfet

commit ac3004757ef64b1ed1ff884a39d2a34cdfb5f772
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Tue Nov 22 20:27:27 2022 +0000

    Relax tolerance for test_out_addbmm_cpu_float32 (#86365)

    The test may fail due to slightly different values caused by different order of matrizes in SGEMM:

    > Mismatched elements: 1 / 50 (2.0%)
    > Greatest absolute difference: 1.430511474609375e-05 at index (4, 5) (up to 1e-05 allowed)
    > Greatest relative difference: 4.65393206065873e-06 at index (4, 5) (up to 1.3e-06 allowed)

    Observed on POWER (ppc64le)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86365
    Approved by: https://github.com/mruberry, https://github.com/kit1980

commit d053d513432bea75ae783529bf9f639f977a47d2
Author: Alexander Grund <Flamefire@users.noreply.github.com>
Date:   Tue Nov 22 20:25:38 2022 +0000

    (Further) limit world size in test_fsdp_pure_fp16 (#86280)

    Test still fails when run on 5 A100 GPUs, although it works with 5 V100s. Using 4 GPUs seems to be fine.

    Followup to #85957

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86280
    Approved by: https://github.com/awgu, https://github.com/kit1980

commit c2ce79f06eb4a8cec2f9cfbdf3a1a4021a0a4cfa
Author: Li-Huai (Allan) Lin <qqaatw@gmail.com>
Date:   Tue Nov 22 19:33:21 2022 +0000

    Fix dev-discuss link in the maintainer docs (#89493)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89493
    Approved by: https://github.com/H-Huang

commit ef8b91fec73884f3043da8f541176ab7b4c57364
Author: Fuzzkatt <zonghan2000@gmail.com>
Date:   Tue Nov 22 19:05:56 2022 +0000

    enable previously failing UCC distributed_test.py tests (#89023)

    Enables previously failing UCC distributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (https://github.com/pytorch/pytorch/pull/86961) or UCC-side timeout error handling fix:  (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89023
    Approved by: https://github.com/kwen2501, https://github.com/malfet

commit f281f435a8c60cf5781688bee3e4ff258c52344f
Author: Animesh Jain <anijain@umich.edu>
Date:   Tue Nov 22 18:42:13 2022 +0000

    Fix benchmarks - xla tensor test (#89509)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89509
    Approved by: https://github.com/ngimel, https://github.com/shunting314

commit 7c0bb61291d62c449b78ce4930c27cbbd8ffac92
Author: mantaionut <ionut@janeasystems.com>
Date:   Tue Nov 22 18:37:14 2022 +0000

    Force numpy prod to use 64 bit integers on Windows in some tests (#88089)

    This fixes some prod and masked.prod tests on Windows.

    np.prod uses int32 on Windows so it overflows.

    On Linux it uses by default int64.

    Fixes #77305
    Fixes #77320
    Fixes #77334
    Fixes #77335
    Fixes #77336
    Fixes #77337

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88089
    Approved by: https://github.com/mruberry

commit f4898daaeeea130a009330674726b5492982af85
Author: PratsBhatt <pbhatt110@gmail.com>
Date:   Tue Nov 22 18:00:01 2022 +0000

    Add cached conda env file for Buck CI workflow (#89422)

    Fixes - T137631262

    Caching conda dependencies for build workflows.
    Conda dependencies have been gathered from the workflow https://github.com/pytorch/pytorch/blob/master/.github/workflows/_buck-build-test.yml

    The pull request updates the action from `conda-incubator/setup-miniconda@v2` to `pytorch/test-infra/.github/actions/setup-miniconda@main` as it supports caching.

    Test Plan:

    Running the `ciflow/periodic` which runs the ci builds `buck-build-test` workflow. Expected output is to have all the conda dependencies cached.

    <img width="1227" alt="Screenshot 2022-11-22 at 15 44 20" src="https://user-images.githubusercontent.com/15447437/203343298-e55c384b-01ad-45c3-a5e9-ba5c53149be4.png">

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89422
    Approved by: https://github.com/huydhn

commit 9c0bf9387c1e39efda268a1fb300e8f87b7ef0e6
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Tue Nov 22 13:33:55 2022 +0000

    Meta impl for linalg_cholesky and linalg_cholesky_ex (#89430)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89430
    Approved by: https://github.com/ezyang

commit c4e08387c1542eca67dc6e40661a50006bc879ff
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Mon Nov 21 14:19:02 2022 -0800

    [quant][fx] Support producing reference quantized patterns for dynamic quantization (#89248)

    Summary:
    split the is_decomposed logic for `_replace_observer_with_quantize_dequantize_node` in a separate function and added support for dynamic quantization in the decomposed version of this function.

    In case of dynamic quantization, we'll produce the following reference quantized pattern in decomposed mode:
    ```
    x -> choose_qparams -> quantize_per_tensor -> dequantize_per_tensor -> linear
    ```

    Test Plan:
    python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_dynamic_quant

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89248
    Approved by: https://github.com/vkuzo

commit 2823fc5e4c73a36ae1859889d34f4cc0d4145ae5
Author: Bin Bao <binbao@fb.com>
Date:   Tue Nov 22 00:30:12 2022 +0000

    [inductor] generate nan in the cpp backend (#89289)

    Summary: Fixes https://github.com/pytorch/torchdynamo/issues/1797

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89289
    Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5

commit 5797f74924d1f19cbb10e689a0f8112665fc07d9
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Nov 21 11:05:39 2022 -0800

    [19/N] Add monitored_barrier custom op with CPU implementation (#89318)

    Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89318
    Approved by: https://github.com/kwen2501

commit be22b5d39f37aa501d07fa3ff3b9448826d48eca
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Nov 21 11:05:38 2022 -0800

    [18/N] Add allgather_coalesced custom op with CPU/CUDA implementations (#89317)

    Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89317
    Approved by: https://github.com/kwen2501

commit 2c3d1877fbb10736f142fcb85952890a69ce3047
Merge: 744f52223a c0dbc6488f
Author: mingfeima <mingfei.ma@intel.com>
Date:   Tue Nov 22 21:41:47 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit c0dbc6488f3baa5d413ce36d2e93e7b7db21806a
Author: mingfeima <mingfei.ma@intel.com>
Date:   Tue Nov 22 21:41:47 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit d9cbe7764e1af938d7edc23ffa873703d960df6c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 22 05:02:45 2022 -0800

    Make aten.copy preserve strides (hf_Longformer) (#89464)

    Fixes https://github.com/pytorch/torchdynamo/issues/1888

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D41460986](https://our.internmc.facebook.com/intern/diff/D41460986)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89464
    Approved by: https://github.com/bdhirsh

commit 2d94fd3b198a31f28df10b7d9b3fcd526a82f24a
Author: Manuel Candales <mcandales@meta.com>
Date:   Tue Nov 22 11:05:58 2022 +0000

    [Vulkan][TCC] Fix quantized shaders (#89456)

    Summary: Fix rounding issue in quantized shaders

    Test Plan:
    On Mac
    ```
    cd ~/fbsource
    buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
    ```

    On Android
    ```
    cd ~/fbsource
    buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
    adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
    adb shell "/data/local/tmp/vulkan_quantized_api_test"
    ```

    Reviewed By: salilsdesai

    Differential Revision: D41047095

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89456
    Approved by: https://github.com/kirklandsign, https://github.com/digantdesai

commit 0f7dca17332152fdd28270eb95398efbe8212ca2
Author: Aleksandar Samardžić <asamardzic@quansight.com>
Date:   Mon Nov 21 04:22:00 2022 +0000

    Vectorized CPU code implementing right shift operator. (#88990)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88990
    Approved by: https://github.com/lezcano, https://github.com/peterbell10

commit 1d6a188d08829b1aee28eb1e6255d5bf43a77f16
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sat Nov 19 01:00:03 2022 +0000

    Reland Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761) (#84624)

    Reland https://github.com/pytorch/pytorch/pull/81761

    Differential Revision: [D39332292](https://our.internmc.facebook.com/intern/diff/D39332292)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84624
    Approved by: https://github.com/kit1980

commit 6b085d5cadffb10591c450623f93a21dd3dd786d
Author: Iris <wz337@cornell.edu>
Date:   Tue Nov 22 07:49:06 2022 +0000

    [Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed (#89398)

    This PR moves traverse and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

    This is used when flatten nested dict and flatten sharded tensors.

    Docstring and comments will be added in the following PRs.

    Test:
    ```
    python3 test/distributed/_tensor/parallel/test_2d_parallel.py
    ```
    and CI
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89398
    Approved by: https://github.com/wanchaol

commit 7b0650d5cf4897089f32c011504d2b2d185cc60a
Author: Mike Iovine <mikeiovine@meta.com>
Date:   Tue Nov 22 06:26:10 2022 +0000

    Back out "[static-runtime] change the backend for permute_copy" (#89463)

    Summary: This permute copy change seems to be causing huge regressions on machines without AVX512. Revert to mitigate. This shouldn't be problematic since the improvement from changing it was super small anyways.

    Differential Revision: D41450088

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89463
    Approved by: https://github.com/hlu1

commit f2cf1b0f5e98094cf7a97439ebdf3679ceee04b0
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Nov 22 05:48:43 2022 +0000

    Revert submodule updates introduced by #89157 (#89449)

    Reverts updates that were introduced by https://github.com/pytorch/pytorch/pull/89157
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89449
    Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/clee2000

commit 40cf214f2d18b3b8af5354ddc5dad8156ea32520
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Mon Nov 21 03:31:51 2022 +0000

    Support masked_fill to address the GPT2 performance issue (#89274)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89274
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit e545caa50f3cd893ca0419543e57af08a7de85b5
Author: Shunting Zhang <shunting@meta.com>
Date:   Tue Nov 22 03:57:01 2022 +0000

    dynamo/torchxla integration: trace on xla rather than eager (#88904)

    In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because
    - we try 2 things together
       1. let dynamo trace the model on xla rather than eager
       2. enable training
    - It turns out neither of these two tasks are trivial enough.

    Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.

    This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.

    Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.
    ```
    +-------------------------+--------------------+-------------------------+
    | Model                   |   XLA (trace once) |   XLA (trace everytime) |
    +=========================+====================+=========================+
    | resnet18                |            1.38    |                 1.008   |
    +-------------------------+--------------------+-------------------------+
    | resnet50                |            1.227   |                 0.998   |
    +-------------------------+--------------------+-------------------------+
    | resnext50_32x4d         |            1.544   |                 1.008   |
    +-------------------------+--------------------+-------------------------+
    | alexnet                 |            1.085   |                 1.045   |
    +-------------------------+--------------------+-------------------------+
    | mobilenet_v2            |            2.028   |                 1.013   |
    +-------------------------+--------------------+-------------------------+
    | mnasnet1_0              |            1.516   |                 0.995   |
    +-------------------------+--------------------+-------------------------+
    | squeezenet1_1           |            0.868   |                 1.01    |
    +-------------------------+--------------------+-------------------------+
    | vgg16                   |            1.099   |                 1.008   |
    +-------------------------+--------------------+-------------------------+
    | BERT_pytorch            |            3.26    |                 1.027   |
    +-------------------------+--------------------+-------------------------+
    | timm_vision_transformer |            2.182   |                 1.015   |
    +-------------------------+--------------------+-------------------------+
    | geomean                 |            1.50389 |                 1.01261 |
    +-------------------------+--------------------+-------------------------+
    ```

    Example command
    ```
    GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88904
    Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel

commit 1dae59ba168fe3c4c11c102f935101c3e4f3b105
Author: Iris <wz337@cornell.edu>
Date:   Tue Nov 22 03:52:32 2022 +0000

    [Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed (#89399)

    This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

    This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement.

    Docstring and comments will be added in the following PRs.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89399
    Approved by: https://github.com/wanchaol

commit ce342ed2d3a4a0dd8151abe80bfe0bb06a7b0ae9
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Nov 22 03:39:15 2022 +0000

    Fix retrying logic for successful unittest tests under --rerun-disabled-tests mode (#89454)

    When looking into Rockset data for disabled test unittest, for example `testAdd`, I see that it's re-run only 3 times instead of 50+ times as expected under rerun-disabled -test mode

    ```
    [
      {
        "name": "testAdd",
        "classname": "TestLazyReuseIr",
        "filename": "lazy/test_reuse_ir.py",
        "flaky": false,
        "num_green": 3,
        "num_red": 0
      }
    ]
    ```

    It turns out that I made a mistake mixing `RERUN_DISABLED_TESTS` and `report_only` into `(RERUN_DISABLED_TESTS or report_only) and num_retries_left < MAX_NUM_RETRIES` in https://github.com/pytorch/pytorch/pull/88646.  The retrying logic for successful tests under rerun-disabled-tests mode is never executed because num_retries_left would be equal to MAX_NUM_RETRIES (not smaller) if the very first run successes. Thus, the sample test `testAdd` finishes right away (1 success count)

    * `report_only` and `RERUN_DISABLED_TESTS` are 2 different things and shouldn't be mixed together. RERUN_DISABLED_TESTS has the higher priority.
    * We also don't want to retry skipped tests under rerun-disabled-tests mode because they are only skipped due to `check_if_enable` check `Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run`

    * CI https://github.com/pytorch/pytorch/actions/runs/3518228784 generates https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3518228784/1/artifact/test-reports-test-default-4-4-linux.4xlarge.nvidia.gpu_9627285587.zip in which `testAdd` is correctly called multiple times and `TestLazyReuseIr` is skipped correctly
    * Locally

    ```
    $ python test/run_test.py --verbose -i lazy/test_reuse_ir
    Ignoring disabled issues:  []
    Selected tests:
     lazy/test_reuse_ir
    Prioritized test from test file changes.
    reordering tests for PR:
    prioritized: []
    the rest: ['lazy/test_reuse_ir']

    Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/slow-tests.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-slow-tests.json
    Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests-condensed.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-disabled-tests.json
    parallel (file granularity) tests:
     lazy/test_reuse_ir
    serial (file granularity) tests:

    Ignoring disabled issues:  []
    Ignoring disabled issues:  []
    Running lazy/test_reuse_ir ... [2022-11-21 13:21:07.165877]
    Executing ['/Users/huydo/miniconda3/envs/py3.9/bin/python', '-bb', 'lazy/test_reuse_ir.py', '-v', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2022-11-21 13:21:07.166279]

    Expand the folded group to see the log file of lazy/test_reuse_ir

    Running tests...
    ----------------------------------------------------------------------
    Test results will be stored in test-reports/python-unittest/lazy.test_reuse_ir
      testAdd (__main__.TestLazyReuseIr) ... ok (1.215s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 50
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 49
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 48
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 47
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 46
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 45
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 44
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 43
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 42
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 41
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 40
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 39
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 38
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 37
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 36
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 35
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 34
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 33
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 32
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 31
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 30
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 29
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 28
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 27
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 26
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 25
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 24
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 23
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 22
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 21
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 20
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 19
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 18
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 17
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 16
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 15
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 14
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 13
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 12
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 11
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 10
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 9
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 8
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 7
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 6
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 5
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 4
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 3
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 2
    ok (0.001s)
      testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 1
    ok (0.001s)
      testAddSub (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 0
    skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)
      testAddSubFallback (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)
      testBatchNorm (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)

    ----------------------------------------------------------------------
    Ran 54 tests in 1.264s

    OK (skipped=3)
    ```

    Here is the sample rockset query

    ```
    WITH added_row_number AS (
      SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY name, classname, filename ORDER BY _event_time DESC) AS row_number
      FROM
        commons.rerun_disabled_tests
    )
    SELECT
      name,
      classname,
      filename,
      flaky,
      num_green,
      num_red
    FROM
      added_row_number
    WHERE
      row_number = 1
      AND name = 'testAdd'
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89454
    Approved by: https://github.com/clee2000

commit 338f61904421bef1b46c9d614470b523c0696654
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Nov 22 03:38:53 2022 +0000

    [vision hash update] update the pinned vision hash (#89471)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89471
    Approved by: https://github.com/pytorchbot

commit 00b9473ad68da319a1dc3f655cc1a97490ae9669
Author: fduwjj <fduwjj@fb.com>
Date:   Tue Nov 22 03:05:50 2022 +0000

    [PT-D][Tensor Parallelism][2/N] Sync TP API change to PT prod (#89467)

    This is part of TP Beta Release efforts.
    ref: https://github.com/pytorch/tau/issues/576
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89467
    Approved by: https://github.com/wanchaol

commit 82713a1cc4589f084ecbcb591d1f9b12570cac43
Author: Animesh Jain <anijain@umich.edu>
Date:   Tue Nov 22 02:23:21 2022 +0000

    [inductor][compilation time] Fallback when kernel size for avg/max pool is large (#89448)

    This fixes compilation time for yolov3 from 400 seconds to 48 seconds. yolov3 has a 13x13 max_pool2d kernel, which was creating really large Triton code.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89448
    Approved by: https://github.com/ngimel

commit 496c8ae760bf646d7a45aad0c2e0320a67b66fd2
Author: maxren <maxren@meta.com>
Date:   Mon Nov 21 10:58:05 2022 -0800

    [xnnpack][lite-int] Handle Constant Data (#89445)

    Handling constant data for xnnpack delegation. This allows us to handle new modules like such:

    ```
    class Module(torch.nn.Module):
                def __init__(self):
                    super().__init__()
                    self._constant = torch.ones(4, 4, 4)

                def forward(self, x):
                    return x + self._constant
    ```

    this is the precursor work to handling convolution, as we need to serialize constant data(weights)

    Differential Revision: [D41050349](https://our.internmc.facebook.com/intern/diff/D41050349/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89445
    Approved by: https://github.com/digantdesai

commit 120d200620159597f416f9142f1d5708182ca047
Author: Animesh Jain <anijain@umich.edu>
Date:   Tue Nov 22 02:20:45 2022 +0000

    Revert "Added conv constraint that infers layouts (#89031)" (#89451)

    This reverts commit 716f70f19a4b63268da2a753afdbe9b385a831ab.

    Fixes performance regression and compilation latency increase.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89451
    Approved by: https://github.com/soumith, https://github.com/jansel

commit 06dffb3319a38bf909939f64320e0fde88679b94
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 21 17:54:25 2022 -0500

    dont clone symints, dont clobber symint proxies (#88230)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88230
    Approved by: https://github.com/albanD

commit 58a74f34f981de2c24b8f57c37687421c87a782a
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Nov 21 11:05:38 2022 -0800

    [17/N] Add _reduce_scatter_base custom op with CPU/CUDA implementation (#88903)

    Differential Revision: [D41415325](https://our.internmc.facebook.com/intern/diff/D41415325)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88903
    Approved by: https://github.com/kwen2501

commit 7174572b1ef4cff545e4ca8fc77c135e58fcbefb
Author: Will Constable <whc@fb.com>
Date:   Mon Nov 21 21:37:32 2022 +0000

    Add torchvis support to dist bench (#89324)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89324
    Approved by: https://github.com/davidberard98, https://github.com/albanD

commit 57ed94804e8195f227c7a75899a319cc0a3b833a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 21 16:04:46 2022 -0500

    Bind DispatchKey.Functionalonalize in pybind11 (#89452)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89452
    Approved by: https://github.com/albanD, https://github.com/bdhirsh

commit b189a7444da8b17c535e7d04c9ab705289ec53e1
Author: Khushi <khushiagrawal411@gmail.com>
Date:   Tue Nov 22 00:15:30 2022 +0000

    [fix] tril & tril : out of bound check (#89384)

    Fixes #83326

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89384
    Approved by: https://github.com/ngimel

commit dbc354b262f7f5e49aa781785cfce6299fdc2aa8
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Nov 22 00:13:38 2022 +0000

    Mitigate flaky test_ops_fwd_gradients on macOS (#89410)

    This has been flaky on macOS for a while ([hud](https://hud.pytorch.org/failure/RuntimeError%3A%20test_ops_fwd_gradients%20failed)) and I can reproduce this locally. The issue was raised by https://github.com/pytorch/pytorch/issues/66033 and it seems to point to macos itself https://github.com/graphia-app/graphia/issues/33.  So switching to single thread when running `test_ops_fwd_gradients` on macOS as a mitigation for the flaky tests.

    `pytest test_ops_fwd_gradients.py -k test_fn_fwgrad_bwgrad -vv --flake-finder` to run all `test_fn_fwgrad_bwgrad` tests 50 times to make sure they all pass (no flaky anymore)

    https://hud.pytorch.org/tests shows that `test_ops_fwd_gradients` on macOS takes about 15m to finish or 8 minute if using 2 shards like in the test.  There is no obvious difference in the test duration:

    ```
    2022-11-21T21:34:18.6078080Z Running test_ops_fwd_gradients ... [2022-11-21 21:34:18.600663]
    2022-11-21T21:34:21.6805770Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.680156]
    2022-11-21T21:34:21.6806380Z Ignoring disabled issues:  []
    2022-11-21T21:34:21.6815250Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.681174]
    2022-11-21T21:34:21.6815830Z Ignoring disabled issues:  []
    .....
    2022-11-21T21:40:42.2422700Z =============================== warnings summary ===============================
    .....
    2022-11-21T21:40:42.2424670Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-47b619449ea7db1f.xml -
    2022-11-21T21:40:42.2424850Z = 831 passed, 596 skipped, 5 deselected, 17 xfailed, 1 warning in 374.54s (0:06:14) =
    .....
    2022-11-21T21:42:00.1923310Z =============================== warnings summary ===============================
    .....
    2022-11-21T21:42:00.1925370Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-d24ee6419a602a6e.xml -
    2022-11-21T21:42:00.1925540Z = 828 passed, 603 skipped, 7 deselected, 20 xfailed, 1 warning in 452.94s (0:07:32) =
    ....
    2022-11-21T21:42:09.9035670Z FINISHED PRINTING LOG FILE of test_ops_fwd_gradients (/Users/runner/work/pytorch/pytorch/test/test-reports/test_ops_fwd_gradients_ha_3rfhb)
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89410
    Approved by: https://github.com/soulitzer

commit ea50549ce62aeeccfe27035a0a975e83b9c2c987
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 21 18:12:21 2022 -0500

    Suppress guards when creating fake tensors (#89349)

    When we create fake tensors, we may call operators that introduce
    guards, to accurately reconstruct views.  But these guards are spurious:
    if a user is able to present a tensor that "looks the same", they have
    implicitly fulfilled the contract that the view is creatable.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89349
    Approved by: https://github.com/voznesenskym

commit fa4980cd5e7581b5195ed4d02d63bf73497549d0
Author: William Wen <williamwen@fb.com>
Date:   Mon Nov 21 22:56:13 2022 +0000

    Add commit hash to dynamo dashboard (#89462)

    Title - also fix a small bug with dashboard outputs.

    Sample: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1322732698

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89462
    Approved by: https://github.com/anijain2305

commit 186192bb26a71ec9b0131a6c49fdf19e76d208d7
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Mon Nov 21 22:43:58 2022 +0000

    [Dynamo] Fix bugs when calling tensor.data and tensor.layout (#89257)

    Fix bugs in [7k github models](https://github.com/pytorch/torchdynamo/issues/1884).
    * Legacy code still use ```tensor.data```, I think we can use ```tensor.detach``` to rewrite, not sure if there is anything I didn't anticipate.
    * Support ```tensor.layout```.

    The root cause of these issues are: dynamo wraps unimplemented ```tensor.x``` call into ```GetAttrVariable(TensorVariable, x)```, but this op was not inserted into FX graph. Hence, during the fake tensor propagation, it throws ```KeyError: 'example_value` ```.

    For these two popular attributes, Dynamo should support them anyway. However, if dynamo should support ___all___ ```tensor.x``` call and not fallback to ```GetAttrVariable```, I think it's debatable.
    If I turn off fake tensor propagation, it works well even not including this fix. So I'm curious if we should improve the fake propagation to cover similar cases. cc @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire @jansel @eellison

    ```
    Traceback (most recent call last):
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 404, in _compile
        out_code = transform_code_object(code, transform)
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/bytecode_transformation.py", line 341, in transform_code_object
        transformations(instructions, code_options)
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 392, in transform
        tracer.run()
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 1523, in run
        super().run()
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 389, in run
        and self.step()
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 359, in step
        getattr(self, inst.opname)(inst)
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 193, in wrapper
        return inner_fn(self, inst)
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 865, in CALL_FUNCTION_KW
        self.call_function(fn, args, kwargs)
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 301, in call_function
        self.push(fn.call_function(self, args, kwargs))
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/torch.py", line 407, in call_function
        tensor_variable = wrap_fx_proxy(
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/builder.py", line 636, in wrap_fx_proxy
        return wrap_fx_proxy_cls(
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/builder.py", line 676, in wrap_fx_proxy_cls
        example_value = get_fake_value(proxy.node, tx)
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1024, in get_fake_value
        args, kwargs = torch.fx.node.map_arg((node.args, node.kwargs), visit)
      File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 613, in map_arg
        return map_aggregate(a, lambda x: fn(x) if isinstance(x, Node) else x)
      File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 621, in map_aggregate
        t = tuple(map_aggregate(elem, fn) for elem in a)
      File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 621, in <genexpr>
        t = tuple(map_aggregate(elem, fn) for elem in a)
      File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 627, in map_aggregate
        return immutable_dict((k, map_aggregate(v, fn)) for k, v in a.items())
      File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 627, in <genexpr>
        return immutable_dict((k, map_aggregate(v, fn)) for k, v in a.items())
      File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 631, in map_aggregate
        return fn(a)
      File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 613, in <lambda>
        return map_aggregate(a, lambda x: fn(x) if isinstance(x, Node) else x)
      File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1022, in visit
        return n.meta["example_value"]
    KeyError: 'example_value\n\nfrom user code:\n   File "./generated/test_BayesWatch_pytorch_prunes.py", line 108, in forward\n    return torch.zeros([x.size()[0], self.channels, x.size()[2] // self.spatial, x.size()[3] // self.spatial], dtype=x.dtype, layout=x.layout, device=x.device)\n\nSet torch._dynamo.config.verbose=True for more information\n\n\nYou can suppress this exception and fall back to eager by setting:\n    torch._dynamo.config.suppress_errors = True\n'

    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89257
    Approved by: https://github.com/jansel

commit 821ba6b51beb1844f264fd57e1eccecb446e4870
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Mon Nov 21 19:19:29 2022 +0000

    [4/n] Thread PG: add reduce_scatter to threaded pg (#89442)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89442
    Approved by: https://github.com/yhcharles, https://github.com/fduwjj

commit 3e99d4db7671430901bb6292073f368ce1443e05
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Mon Nov 21 19:19:29 2022 +0000

    [3/n] Thread PG: add scatter to threaded pg (#89441)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89441
    Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj

commit 3876f94c3d0eb329686d0699da2bab00849099b6
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Mon Nov 21 19:19:29 2022 +0000

    [2/n] Thread PG: add test for broadcast (#89440)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89440
    Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj

commit deae450899eb048754f046999a18fbda8c9b2d68
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Mon Nov 21 19:19:29 2022 +0000

    [1/n] Thread PG: add test for allgather (#89439)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89439
    Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj

commit 047e542a1a71448083d812783380b855e023eb14
Author: Mengwei Liu <larryliu@meta.com>
Date:   Mon Nov 21 21:08:13 2022 +0000

    [tools] expose selective build library (#89351)

    Change the base module and visibility of `tools:gen_oplist_lib` so that it can be reused.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89351
    Approved by: https://github.com/cccclai

commit c068fa900f1352240a04123a74d4d1f83b295222
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Nov 20 23:36:41 2022 +0000

    [inductor] Misc division lowering fixes (#88603)

    1. `aten.div.Tensor_mode` should allow broadcasting
    2. `div` can use `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT`
    3. `prims.div` on integers should be truncating division
    4. Add lowering for `true_divide` which is aliased to `div`
    5. register lowering for inplace version of `div_mode`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88603
    Approved by: https://github.com/ngimel

commit 1267dcf2971b181d7379928f3452ce622add91e9
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Nov 20 23:19:24 2022 +0000

    [inductor] Fix nan handling for aten.sign (#88937)

    ATen gives `sign(nan) == 0` but inductor's cuda codegen would give
    `sign(nan) == 1`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88937
    Approved by: https://github.com/ngimel

commit 3d247a8bcd6f07ffae8c7144ac08ba9fdeeb2025
Author: Keval Morabia <kevalmorabia97@gmail.com>
Date:   Mon Nov 21 20:40:04 2022 +0000

    Fix unconvertible_ops as per #89261 (#89299)

    Fixes #89261

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89299
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit 1d9e1fca97a2a01ea75b0938e38feee1d5288ebd
Author: Driss Guessous <drisspg@fb.com>
Date:   Mon Nov 21 20:02:09 2022 +0000

    Update sdp dispatch logic to enable fused backward (#89154)
    Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154
    Approved by: https://github.com/cpuhrsch

commit cf9476554fce9a9c909eebd7439f4b3f4d208f6c
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Mon Nov 21 09:23:16 2022 -0800

    update kineto pinned commit (#89435)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89435
    Approved by: https://github.com/malfet

commit e4d9dbd7d236e86fac0055feb7dd8f64516d375e
Author: Xu Zhao <xzhao9@meta.com>
Date:   Mon Nov 21 17:25:28 2022 +0000

    Port torchdynamo's torchbench script to userbenchmark (#89239)

    Summary:
    This Diff ports the torchbench.py script from torchdynamo to torchbench to support the development of internal models.

    Currently, only works with the `--only` option, and can only test one model at a time.

    Note that the noisy logs are from upstream model code, not the benchmark code.
    In the internal environment, `torch._dynamo.config.base_dir` is not writable, so we add an option to specify the output directory.

    Test Plan:
    ```
    $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only ads_dhen_5x --part over --output-directory /tmp/tb-test/
    cuda eval  ads_dhen_5x
      1/  1 +0 frames   2s  1 graphs  1 graph calls  412/ 411 = 100% ops 100% time
    ```

    ```
    $  buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only cmf_10x --part over --output-directory /tmp/tb-test/
    cuda eval  cmf_10x
      1/  1 +0 frames   1s  1 graphs  1 graph calls  306/ 305 = 100% ops 100% time
    ```

    Reviewed By: jansel

    Differential Revision: D41294311

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89239
    Approved by: https://github.com/jansel

commit 9d209e78348ee5c3e1ead700d240fb476b3bc4de
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 21 16:48:26 2022 +0000

    Revert "[ao] making _is_activation_post_process private (#87520)"

    This reverts commit 45c62a337756ff9db97cd64d2d42d9e65dda0a85.

    Reverted https://github.com/pytorch/pytorch/pull/87520 on behalf of https://github.com/bigfootjon due to Diff reverted internally

commit f3db03612f9c6fb8717e1e13a9295da3c9c05193
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 21 16:38:20 2022 +0000

    Revert "[ao] maintain BC for is_activation_post_process (#89260)"

    This reverts commit c5fafb4e1694f141d8a1a31142cce4049d9057ed.

    Reverted https://github.com/pytorch/pytorch/pull/89260 on behalf of https://github.com/DanilBaibak due to breaking internal builds

commit 6796979ee1063890fd04bbf21f298f669129df8f
Author: Jiong Gong <jiong.gong@intel.com>
Date:   Mon Nov 21 14:20:33 2022 +0000

    [Inductor] Limit the number of compile threads to the available cpu cores (#89377)

    `config.compile_threads` gets the number of compile threads via `min(32,os.cpu_count())` while `os.cpu_count()` is the total number of cpu cores in the system, not the available ones. This would cause compile thread contention when the available cpu cores are less than `min(32,os.cpu_count())`, e.g., available cpu cores are limited with numactl or taskset, making the compilation very slow. This PR tries to use `len(os.sched_getaffinity(0))` if `os.sched_getaffinity` is available which returns the available number of cpu cores.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89377
    Approved by: https://github.com/soumith

commit c2cf0bde1f4e9bed642648f299db0f6d5ecb5996
Author: lezcano <lezcano-93@hotmail.com>
Date:   Mon Nov 21 10:54:32 2022 +0000

    Move the OpInfo same-storage error to the autograd test (#88306)

    This check was previously located at the `non_contiguous` test (quite
    and odd location). Even more, at https://github.com/pytorch/pytorch/pull/86378#discussion_r993658395, Kshiteej found that this assert was not doing anything really.

    We move it to the autograd test and make it a proper `self.assert`. We also disallow returning 1-tuples from sample_input functions, as they were breaking this assert.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88306
    Approved by: https://github.com/mruberry

commit a80e5e78137fb8adea6e7d638be483f866fe26e8
Author: yanbing-j <yanbing.jiang@intel.com>
Date:   Mon Nov 21 09:52:34 2022 +0000

    Update ideep for future performance improvement (#87966)

    **Summary**
    The update includes API changes and optimzations to reduce framework overhead, which will benefit all mkldnn (onednn) ops in JIT mode and inductor CPU backend, etc. These benefits will be seen after switching to new ideep API by future PRs.

    **Test plan**
    For correctness, all UTs that call mkldnn ops, including test_ops.py, test_mkldnn*.py, test_quantization.py, etc.
    For performance, TorchBench has been run and no regression is found. Results are shown below.
    - Intel (R) Xeon (R) IceLake with 40 cores
    - Use multi-instance
    - Using tcmalloc & Intel OMP

    ![image](https://user-images.githubusercontent.com/12522207/201631004-bb77468d-953b-4757-a001-94d44615b5f6.png)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87966
    Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper

commit 31708a731076b7feed3051b81d309a9babb4efc0
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Sun Nov 20 20:46:03 2022 -0500

    TorchDynamo: enable conv+silu fusion (#89278)

    This PR will improve the tf_efficientnet_b0 performance by fusing conv+silu.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89278
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit bc716383a6a3063b35cedfe8d163c61a4ff8f301
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Mon Nov 21 03:31:50 2022 +0000

    Redefine the simdlen semantic (#89263)

    This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

    Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

    - **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
    - **_simdlen <=1_**: Explicitly disable SIMD
    - **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89263
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 79770d3636626b2130e58d5acdf1d6a56953329d
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Sun Nov 20 20:46:02 2022 -0500

    TorchDynamo: enable conv+relu6 fusion (#89265)

    This PR is about enabled conv+relu6 which improves mobilenet'e performance.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89265
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 744f52223a41a6de972728286c2db196a45e3df6
Merge: 4fec37d580 73a6cdb92f
Author: mingfeima <mingfei.ma@intel.com>
Date:   Mon Nov 21 15:33:31 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit 73a6cdb92f2ab305eec4ea400d2ad956b6a52365
Author: mingfeima <mingfei.ma@intel.com>
Date:   Mon Nov 21 15:33:31 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit 4fec37d5800163dd8bff11210cbf4424700aeb7c
Merge: 2f59c69ac7 27e02c0176
Author: mingfeima <mingfei.ma@intel.com>
Date:   Mon Nov 21 14:16:39 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit 27e02c0176a5de350e629e858d19abff62649c2d
Merge: f93ba52d25 e0251de42f
Author: mingfeima <mingfei.ma@intel.com>
Date:   Mon Nov 21 14:16:39 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

    **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

    * COO: the hotspot is `scatter_reduce`
    * CSR: the hotspot is `spmm_reduce`

    The reduce type can be choose from: "max", "mean", "max",  "min".

    `spmm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_spmm_reduce` which has dual outputs:
    * `out` - the actual output
    * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

    Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

    Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
           torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                     aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                     aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                         aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                       aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
                  aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                       aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
                aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
    ```
    ```
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                     aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                     aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                         aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                       aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
                  aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                       aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                     aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
                aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
    ```

    [ghstack-poisoned]

commit e0251de42f56c8de0bd9b2783bfa2ae67e4813c5
Author: Shen Li <cs.shenli@gmail.com>
Date:   Sun Nov 20 22:54:45 2022 +0000

    [Easy] Use prepend arg to register forward hooks in quantize.py (#89391)

    Differential Revision: [D41431110](https://our.internmc.facebook.com/intern/diff/D41431110)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89391
    Approved by: https://github.com/awgu

commit 1db5ce095fb0e721c92304bceca7798456929e73
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 21 03:08:31 2022 +0000

    [vision hash update] update the pinned vision hash (#89287)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89287
    Approved by: https://github.com/pytorchbot

commit 51e961dd7bb9abaf999e6028208b2778a57c32b2
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Nov 21 00:58:03 2022 +0000

    use std/libdevice erf in inductor (#89388)

    By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel).
    Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388
    Approved by: https://github.com/jansel

commit 1856fa5df7fda9950da26eff2ef885e845bf6b6c
Author: Huy Do <huydhn@gmail.com>
Date:   Sun Nov 20 23:36:47 2022 +0000

    Temporary increase ASAN shard 5 to 4xlarge (#89387)

    ASAN shard 5 also see OOM now https://hud.pytorch.org/pytorch/pytorch/commit/7b0d577c226fae78f377b26feab4122c4203ad59, may be we should increase all 5 of them to 4xlarge until https://github.com/pytorch/pytorch/issues/88309 is resolved
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89387
    Approved by: https://github.com/kit1980

commit e1d58b1928a9427f05e3f44ab9b8119000bced09
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Nov 20 22:14:38 2022 +0000

    Revert "Update sdp dispatch logic to enable fused backward (#89154)"

    This reverts commit 2e72ec79823111e8dd8c5e82c5d1b56197cd52d3.

    Reverted https://github.com/pytorch/pytorch/pull/89154 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_sdp_math_gradcheck test breaks periodic slow gradcheck, i.e. https://hud.pytorch.org/pytorch/pytorch/commit/419ef2cdcfe84442de5232739284c6a51a18632f

commit c09929659ce8ba2f1b7b2f6e50084ccbf854d44b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Nov 20 09:13:30 2022 -0500

    Also include MKL_THREAD_LIB in link libraries for caffe2::mkl (#89378)

    Actually fixes https://github.com/pytorch/audio/issues/2784 for
    real; in my previous testing I didn't check if I could import
    torchaudio; now torchaudio successfully imports.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89378
    Approved by: https://github.com/soumith

commit 7b0d577c226fae78f377b26feab4122c4203ad59
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 19 22:31:24 2022 -0500

    Set INTERFACE_LINK_DIRECTORIES on caffe2::mkl (#89359)

    This ensures that subsequent link commands involving mkl libraries
    know where to find the libraries if they are in a non-standard
    location (which is the case if you installed mkl via conda, which
    is what our standard instructions recommend.)

    This is kind of a hack, because the MKL libraries are not actually
    guaranteed to be in $MKL_ROOT/lib (they are for the conda install
    though).  The real fix is to properly use the MKL targets from
    FindMKL.cmake but thats its own can of fish.  See
    https://github.com/pytorch/pytorch/issues/73008

    This fixes https://github.com/pytorch/audio/issues/2784

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89359
    Approved by: https://github.com/soumith

commit dbeacf11820e336e803bb719b7aaaf2125ae4d9c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 19 19:44:18 2022 -0500

    Fix cat striding in PrimTorch (#89332)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89332
    Approved by: https://github.com/ngimel

commit caf3d5319f15e47363fe36856326f5e4ab3303e1
Author: Sherlock Huang <bahuang@fb.com>
Date:   Sat Nov 19 23:10:34 2022 +0000

    Symintify numel(), infer_size, prims.elementwise_meta (#88956)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88956
    Approved by: https://github.com/ezyang

commit 7c811efab70a3546f997e37178c93d1de24e0444
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 19 12:52:39 2022 -0500

    Add support for dynamic kwarg to torch._dynamo.optimize (#89290)

    This is an easier way to enable dynamic shapes for a region.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89290
    Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/voznesenskym

commit 8ad39536d741d9fc8c5d33f1344d23bd56f1c050
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Nov 19 21:47:55 2022 +0000

    Revert "Symintify numel(), infer_size, prims.elementwise_meta (#88956)"

    This reverts commit ce2f8700bafcf44850402a39188ec121ba8b5486.

    Reverted https://github.com/pytorch/pytorch/pull/88956 on behalf of https://github.com/ezyang due to somehow breaks torch.numel

commit 8ac58bc2e3449760bef7f36f600a40c96d5bc5ba
Author: kvathupo <kalinda619@gmail.com>
Date:   Sat Nov 19 21:40:07 2022 +0000

    Add nullptr_t overload to c10::intrusive_ptr  (#89196)

    __What?__

    Fixes #82413
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89196
    Approved by: https://github.com/ezyang

commit 5582001bd5e5c66dcab8859ecb84cbaa42524fd4
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 19 12:51:53 2022 -0500

    Reland 2 "Towards unifying symbolic and non symbolic fake tensor (#89038) (#89143)" (#89346)

    This reverts commit 8e4c9828f4c990f439179912159086aaed790493.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89346
    Approved by: https://github.com/wconstab

commit 6afe341276f9ffa660446c5fa15b68558791869a
Author: fduwjj <fduwjj@fb.com>
Date:   Sat Nov 19 18:01:25 2022 +0000

    [PT-D][1/N] Sync TP Beta change to prod (#89242)

    This is part of TP Beta Release efforts.

    ref: https://github.com/pytorch/tau/issues/576

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89242
    Approved by: https://github.com/wanchaol

commit 6b8c1b19b513ec3d82d588961f8a2b4a86e08f99
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Sat Nov 19 17:49:39 2022 +0000

    RM expectedFailure UnspecReproTests.test_batch_norm_act_unspec (#89340)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89340
    Approved by: https://github.com/bertmaher

commit 6daf60be5abe4184121bc41e69e336015a268d6a
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Sat Nov 19 02:56:14 2022 +0000

    [ONNX] Add setType from user into InferredType and Reliable in ConstantValueMap (#88622)

    `setType` API is not respected in current exporter because the graph-level shape type inference simply overrides every NOT ONNX Op shape we had from node-level shape type inference. To address this issue, this PR (1) makes custom Op with `setType` **reliable** in ConstantValueMap to secure its shape/type information in pass:  _C._jit_pass_onnx. (2) If an invalid Op with shape/type in pass: _C._jit_pass_onnx_graph_shape_type_inference(graph-level), we recognize it as reliable.

    1. In #62856, The refactor in onnx.cpp made regression on custom Op, as that was the step we should update custom Op shape/type information into ConstantValueMap for remaining Ops.

    2. Add another condition besides IsValidONNXNode for custom Op setType in shape_type_inference.cpp. If all the node output has shape (not all dynamic), we say it's custom set type.

    3. ~However, this PR won't solve the [issue](https://github.com/pytorch/pytorch/issues/87738#issuecomment-1292831219) that in the node-level shape type inference, exporter invokes the warning in terms of the unknow custom Op, since we process its symbolic_fn after this warning, but it would have shape/type if setType is used correctly. And that will be left for another issue to solve. #84661~ Add `no_type_warning` in UpdateReliable() and it only warns if non ONNX node with no given type appears.

    Fixes #81693
    Fixes #87738

    NOTE: not confident of this not breaking anything. Please share your thoughts if there is a robust test on your mind.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88622
    Approved by: https://github.com/BowenBao

commit 940959ebbfa54204b3cd45f918c5ee65b5efc3d0
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Fri Nov 18 22:46:47 2022 -0800

    [quant][fix] Add quant_min/quant_max for default dynamic quantization observer (#89267)

    Summary:
    This is needed for choose qparams, but previously it is not configurable, and in the reference quantization flow
    with decomposed Tensor, we are making this explicit

    Test Plan:
    tested in future PR

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89267
    Approved by: https://github.com/vkuzo

commit 808bdbab89e875abbbe9652bde675b4402eed532
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Sat Nov 19 07:16:29 2022 +0000

    Fix try/except flow where DataDependentOutputException is getting wrapped in a RuntimeError (#89314)

    Repro fixed

    ```
    def fn(a):
        return a.repeat_interleave(14, dim=0).repeat_interleave(14, dim=1)

    x = torch.ones(14, 14).to(dtype=torch.int64)
    opt_fn = torch._dynamo.optimize("eager")(fn)
    opt_fn(x)
    ```

    Fixes [#1886](https://github.com/pytorch/torchdynamo/issues/1886)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89314
    Approved by: https://github.com/anijain2305, https://github.com/eellison

commit 419ef2cdcfe84442de5232739284c6a51a18632f
Author: Horace He <chilli@fb.com>
Date:   Fri Nov 18 21:39:11 2022 +0000

    Added utility to count memory reads/written in Inductor (#89203)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89203
    Approved by: https://github.com/jansel, https://github.com/ngimel

commit 7a2930b357a4e62bb0bab53bb0d23c607b6ede38
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Sat Nov 19 04:09:29 2022 +0000

    add jvp test with non-contig inputs (#89131)

    Ref: https://github.com/pytorch/functorch/issues/1029

    We update `test_jvp` to do contiguous and non-contiguous testing in a single test.

    Prev time for `test_jvp` : ~28s
    New time for `test_jvp`: ~45s

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89131
    Approved by: https://github.com/zou3519

commit 631baecbcd821124296cfe40e5c297cf2def410c
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Sat Nov 19 03:35:07 2022 +0000

    Add --explain flag to bench (#89316)

    TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 time python benchmarks/dynamo/torchbench.py  --accuracy --explain  --backend aot_eager --train --only BERT_pytorch

    Dynamo produced 76 graphs with 75 graph break and 198 ops

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89316
    Approved by: https://github.com/ezyang

commit e6996ea172b01fa6c6586379ccb26746c32e2b2c
Author: Yuxin Wu <ppwwyyxxc@gmail.com>
Date:   Sat Nov 19 02:24:18 2022 +0000

    Don't redefine __STDC_FORMAT_MACROS (#89310)

    Similar to https://github.com/pytorch/pytorch/pull/39608 and https://github.com/pytorch/pytorch/pull/6676

    This causes a compile error in our internal build.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89310
    Approved by: https://github.com/kit1980

commit 8c0515dbff04f03cae584e10100715e05f7cb32e
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Sat Nov 19 02:18:03 2022 +0000

    cast C++ py-bound SymNode to SymInt correctly (#89295)

    Unfortunately, it's a bit hard to test purely on the Pytorch core side, but it passes the XLA tests which are currently disabled.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89295
    Approved by: https://github.com/ezyang

commit 2e72ec79823111e8dd8c5e82c5d1b56197cd52d3
Author: Driss Guessous <drisspg@fb.com>
Date:   Sat Nov 19 02:06:24 2022 +0000

    Update sdp dispatch logic to enable fused backward (#89154)
    Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154
    Approved by: https://github.com/cpuhrsch

commit 85a87e635c677e1c6d992bb9ea21c634e8b1d58f
Author: Michael Lazos <mlazos@fb.com>
Date:   Sat Nov 19 01:47:45 2022 +0000

    [dynamo] mutable local caching to make dynamo faster at tracing mutation (#89170)

    Make mutation faster to speed up tracing optimizers, helps with https://github.com/pytorch/torchdynamo/issues/1803

    `replace_all` no longer iterates over the entire variable tracker data structure  every time a mutation is performed

    Each variable tracker internally keeps a set of contained mutable variable trackers, to provide a hint to `replace_all`. This is populated with a call to `apply` from `__post_init__` in the base `VariableTracker`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89170
    Approved by: https://github.com/jansel

commit ea58955dda6452307ce43a5beef0a466b49f1bef
Author: Nikita Shulga <nshulga@meta.com>
Date:   Sat Nov 19 01:13:08 2022 +0000

    Move bazel to c++17 (#89297)

    Splitting out various smaller pieces from https://github.com/pytorch/pytorch/pull/85969
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89297
    Approved by: https://github.com/huydhn

commit cad5772c2c2e2c719664765119172610eed9c590
Author: Animesh Jain <anijain@umich.edu>
Date:   Sat Nov 19 00:22:43 2022 +0000

    [dashboard][huggingface] skip accuracy checks for really large models… (#89273)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89273
    Approved by: https://github.com/desertfire

commit ee907375fa085fbc61bd087f7d459fd62fd1ae4f
Author: Howard Huang <howardhuang@meta.com>
Date:   Sat Nov 19 00:21:11 2022 +0000

    [small] Update error message (#89294)

    Summary:
    `RuntimeError: Invalid function argument. Expected parameter "tensor_list" to be of type List[torch.Tensor].`

    to

    `RuntimeError: Invalid function argument. Expected parameter "input_tensor_list" to be of type List[torch.Tensor].`

    Test Plan: sandcastle

    Differential Revision: D41405238

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89294
    Approved by: https://github.com/awgu

commit c3938bb97ab2bf0942bee2a97d30051733e839ca
Author: zhxchen17 <zhxchen17@fb.com>
Date:   Sat Nov 19 00:19:47 2022 +0000

    [functorch] introduce an experimental map() op. (#88767)

    Summary:
    We want to introduce an experimental control flow op: map() to export some models as FX graphs correctly.

    Some calrification on basic requirements we have in mind:
    1. This op can nest cond() and other control flow primitives internally.
    2. We don't necessarily need loop carried dependencies for the models we've seen.
    3. This map() op can handle dynamically shaped tensor as input and return dynamically shaped output based on input shapes.
    4. We should be able to pass through additional arguments to the loop body as extra arguments.

    In this diff we introduce a new control flow op `map()` which has the following semantics:
    ```
    def map(f: Callable, xs: Tensor, *args):
        return torch.stack([f(x, *args) for x in xs])
    ```

    Test Plan:
    pytest functorch/test_control_flow.py
    CI

    Differential Revision: D41165796

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88767
    Approved by: https://github.com/zou3519

commit 94b5c807fdb1fdf62bc2ab5f0161936f564b140c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 18 13:14:40 2022 -0800

    Detach fake tensors into val, so they aren't affected by metadata mutation (#89140)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89140
    Approved by: https://github.com/bdhirsh

commit 885f8a56d445796100f3ab6f806633890662021a
Author: Nikita Shulga <nshulga@meta.com>
Date:   Fri Nov 18 23:44:57 2022 +0000

    [BE] Print backtraces from coredumps (#89309)

    By simply invoking `gdb python core -ex "bt" -ex "q"`

    Test plan:
     See: [linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/3500498821/jobs/5863369649#step:14:39)
    Not sure why multiprocessing tests SEGFAULT, but they do
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89309
    Approved by: https://github.com/clee2000, https://github.com/huydhn

commit 0e1fcc8aa8790e54a85efdc81b959f46f089e3d3
Author: Tran Le <quytranle@meta.com>
Date:   Fri Nov 18 23:19:14 2022 +0000

    [FX] Add type annotation to `getitem` node before `split_module` (#88510)

    Summary: Some nodes lost the type annotation during `split_module`, causing the submodels to be un-scriptable. This is because compiler always infer Tensor type, which is wrong for non-Tensor types. We attempt to infer type annotation for `getitem` node to improve scriptability.

    Test Plan:
    ```
    buck2 test //caffe2/test:fx_experimental
    ```

    Differential Revision: D41037819

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88510
    Approved by: https://github.com/xush6528

commit ecfb4e064ccedb42fd73d99f24cb749e05e28801
Author: Wei Wang <weiwangmeta@meta.com>
Date:   Fri Nov 18 23:05:50 2022 +0000

    [Inductor CI] Use string format for cuda-arch-list input to prevent 8.0/9.0/10.0 etc from being interpreted as 8/9/10 (#89279)

    Currently or in future whenever we change the cuda-arch-list to num.0, github action or some agent would pass just num to TORCH_CUDA_ARCH_LIST

    This num is not regex matched during cuda arch analysis phase. (here: https://github.com/pytorch/pytorch/blob/c5fafb4e1694f141d8a1a31142cce4049d9057ed/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake#L229)
    Example failure: https://github.com/weiwangmeta/pytorch/actions/runs/3495656108/jobs/5852735299
      Unknown CUDA Architecture Name 8 in CUDA_SELECT_NVCC_ARCH_FLAGS
    This change reminds us to use e.g. '8.0', '9.0', '10.0' etc instead of 8.0, 9.0, 10.0 as GHA or some other agent may erroneously truncate it to pure numbers.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89279
    Approved by: https://github.com/desertfire, https://github.com/atalman

commit 7551136b81251fef0505a935ab614a44dd355479
Author: Bryce Long <blong@nvidia.com>
Date:   Fri Nov 18 22:36:05 2022 +0000

    Add NVTX markers that dump additional information for nvprim_nvfuser Dynamo graphs (#88259)

    dump information on graphs that NVFuser JIT compiles:
    - the markers show the list of ops, args, and inputs that make up the graph

    also dumps information on FX nodes that are not touched by NVFuser:
    - the markers show the op, name, and arg list of the node

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88259
    Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry

commit 35d5fc52f01f0314ab1bf1555ea27d6fedbb7d98
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Nov 17 13:33:39 2022 -0800

    [Profiler] Don't raise SOFT_ASSERT in debug builds. (#89240)

    Enough people are hitting this issue that we need to turn off hard failures until the fire rate is zero in steady state. (via scuba logging.)

    Differential Revision: [D41382914](https://our.internmc.facebook.com/intern/diff/D41382914/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89240
    Approved by: https://github.com/aaronenyeshi

commit bfffc8d8efc3247853d706148146a5fd62d5ef08
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Nov 17 23:06:09 2022 +0000

    [DDP][Docs] Add warning that `no_sync()` should include forward (#89244)

    The issue where the user only includes `loss.backward()` inside `no_sync()` but not the forward pass has arisen several times now. I think adding an explicit warning in the docs is worthwhile.

    Rendered doc:
    <img width="769" alt="Screen Shot 2022-11-17 at 9 21 32 PM" src="https://user-images.githubusercontent.com/31054793/202602005-22c000b7-1093-4eaf-ba66-9c929a66906b.png">

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89244
    Approved by: https://github.com/zhaojuanmao

commit 304b5de1b01213b18947ffcb6f5782f89fcd0b2e
Author: David Berard <dberard@fb.com>
Date:   Thu Nov 17 09:58:29 2022 -0800

    Re-enable test_hf_bert_fsdp (#89223)

    It looks like this failure was actually caused by https://github.com/pytorch/pytorch/pull/88629, see the revert message on that PR. It probably just looked like a flaky test on CI because of how quickly the PR was reverted.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89223
    Approved by: https://github.com/voznesenskym

commit ba605c3b0439fd5dfe062f42e60b990c88c061d4
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 18 06:59:21 2022 -0800

    Don't trace when we track_tensor_tree (#89139)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89139
    Approved by: https://github.com/bdhirsh

commit e04dc35a6a1d1447f6e067db5f29f88adff91acf
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 18 06:59:20 2022 -0800

    Symintify obeys_layout_contract (#89138)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89138
    Approved by: https://github.com/bdhirsh

commit 837ca8f344380f2356b01662f215ff561b09401f
Author: Zain Rizvi <zainr@fb.com>
Date:   Fri Nov 18 19:36:09 2022 +0000

    Remove --retry-all-errors from environment with old curl (#89298)

    The version of curl on the `ubuntu-latest` box doesn't support the `--retry-all-errors` param and is breaking periodic builds

    Example: https://github.com/pytorch/pytorch/actions/runs/3495466804/jobs/5852265880
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89298
    Approved by: https://github.com/huydhn

commit ee2ce3fef6d6bd073eb31303808618db88cec2e1
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Nov 18 18:55:33 2022 +0000

    Set make max load when building libtorch (#89237)

    The nccl build is still OOM sometimes when using `$(MAKE)`:

    ```
    virtual memory exhausted: Cannot allocate memory
    Makefile:73: recipe for target '/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o' failed
    make[5]: *** [/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o] Error 1
    make[5]: Leaving directory '/var/lib/jenkins/workspace/third_party/nccl/nccl/src/collectives/device'
    ```

    * https://github.com/pytorch/pytorch/actions/runs/3476485191/jobs/5811758058
    * https://github.com/pytorch/pytorch/actions/runs/3422228421/jobs/5702153639

    So trying to set the same limit here as when building with ninja

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89237
    Approved by: https://github.com/malfet

commit 7ec8a4d2a26f717d0a4073e6005f9edfdd7ab641
Author: vfdev-5 <vfdev.5@gmail.com>
Date:   Fri Nov 18 18:46:50 2022 +0000

    Vectorized horizontal flip implementation (#88989)

    When we benchmarked image processing transforms in torchvision : tensor vs pillow we saw that horizontal flip on uint8 data `(3, X, X)` is 2-3x slower.

    Due to the fact that output's first stride is negative, implementation does a simple data copy using [`basic_loop`](https://github.com/pytorch/pytorch/blob/8371bb8a3dddbead709bc1e9d26715818a34fa8a/aten/src/ATen/native/cpu/Loops.h#L286). In this PR, a vectorized path is added for horizontal flip op for dtypes: uint8, int, float32, long and double and there is a speed-up that reduces the gap between PIL and tensor ops

    ```
    CPU capability usage: AVX2

    [----------------------------------------------------------------- Horizontal flip -----------------------------------------------------------------]
                                                     |  torch (1.14.0a0+git2ed1d29) PR  |    Pillow (9.3.0)   |  torch (1.14.0.dev20221116+cu116) nightly
    1 threads: ------------------------------------------------------------------------------------------------------------------------------------------
          channels=3, size=256, dtype=torch.int64    |        101.307 (+-0.904)         |                     |             111.364 (+-0.328)
          channels=3, size=520, dtype=torch.int64    |        462.369 (+-2.184)         |                     |             505.602 (+-0.541)
          channels=3, size=712, dtype=torch.int64    |        1855.441 (+-6.528)        |                     |             1828.370 (+-8.600)

          channels=1, size=256, dtype=torch.int32    |         22.282 (+-0.130)         |   44.218 (+-0.936)  |              34.651 (+-0.162)
          channels=1, size=520, dtype=torch.int32    |         72.180 (+-0.076)         |  166.639 (+-1.180)  |             118.820 (+-0.210)
          channels=1, size=712, dtype=torch.int32    |        129.621 (+-0.649)         |  307.140 (+-2.221)  |             216.104 (+-0.793)

          channels=3, size=256, dtype=torch.uint8    |         51.685 (+-0.200)         |   44.171 (+-0.818)  |             361.611 (+-0.276)
          channels=3, size=520, dtype=torch.uint8    |        223.320 (+-0.726)         |  166.607 (+-2.256)  |             1462.012 (+-4.917)
          channels=3, size=712, dtype=torch.uint8    |        423.298 (+-1.156)         |  307.067 (+-1.999)  |             2738.481 (+-1.715)

          channels=1, size=256, dtype=torch.float32  |         22.281 (+-0.056)         |   44.149 (+-0.808)  |              35.316 (+-0.028)
          channels=1, size=520, dtype=torch.float32  |         72.268 (+-0.106)         |  166.631 (+-1.212)  |             119.504 (+-0.340)
          channels=1, size=712, dtype=torch.float32  |        129.777 (+-0.632)         |  307.078 (+-1.909)  |             216.987 (+-0.185)

          channels=1, size=256, dtype=torch.float16  |         32.789 (+-0.081)         |                     |              34.044 (+-0.039)
          channels=1, size=520, dtype=torch.float16  |        112.693 (+-0.478)         |                     |             117.445 (+-0.125)
          channels=1, size=712, dtype=torch.float16  |        203.644 (+-0.791)         |                     |             213.283 (+-0.397)

          channels=3, size=256, dtype=torch.float64  |        102.058 (+-0.333)         |                     |             108.404 (+-0.346)
          channels=3, size=520, dtype=torch.float64  |        473.139 (+-1.327)         |                     |             503.265 (+-0.365)
          channels=3, size=712, dtype=torch.float64  |        1854.489 (+-9.513)        |                     |             1844.345 (+-1.371)

          channels=1, size=256, dtype=torch.int16    |         11.927 (+-0.056)         |                     |              33.993 (+-0.037)
          channels=1, size=520, dtype=torch.int16    |         39.724 (+-0.148)         |                     |             117.577 (+-0.153)
          channels=1, size=712, dtype=torch.int16    |         68.264 (+-0.133)         |                     |             213.118 (+-0.157)

    Times are in microseconds (us).

    ```

    ```
    CPU capability usage: AVX512

    [----------------------------------------------------------------- Horizontal flip ------------------------------------------------------------------]
                                                     |  torch (1.14.0a0+git2ed1d29) PR  |    Pillow (9.3.0)    |  torch (1.14.0.dev20221118+cu116) nightly
    1 threads: -------------------------------------------------------------------------------------------------------------------------------------------
          channels=3, size=256, dtype=torch.int64    |        131.244 (+-1.954)         |                      |             135.649 (+-4.066)
          channels=3, size=520, dtype=torch.int64    |        522.032 (+-4.660)         |                      |             539.822 (+-10.420)
          channels=3, size=712, dtype=torch.int64    |       1041.111 (+-53.575)        |                      |            1322.411 (+-80.017)

          channels=1, size=256, dtype=torch.int32    |         10.108 (+-0.414)         |   49.164 (+-1.000)   |              34.606 (+-0.865)
          channels=1, size=520, dtype=torch.int32    |         93.218 (+-1.417)         |  191.985 (+-5.047)   |             133.664 (+-5.372)
          channels=1, size=712, dtype=torch.int32    |        167.919 (+-2.854)         |  353.574 (+-6.568)   |             246.162 (+-5.753)

          channels=3, size=256, dtype=torch.uint8    |         34.710 (+-0.541)         |   49.005 (+-0.923)   |             136.603 (+-2.339)
          channels=3, size=520, dtype=torch.uint8    |        154.873 (+-3.049)         |  191.729 (+-4.997)   |             534.329 (+-10.754)
          channels=3, size=712, dtype=torch.uint8    |        290.319 (+-4.819)         |  351.619 (+-6.978)   |             997.119 (+-33.086)

          channels=1, size=256, dtype=torch.float32  |         10.345 (+-0.338)         |   49.105 (+-0.942)   |              35.478 (+-0.733)
          channels=1, size=520, dtype=torch.float32  |         81.131 (+-5.281)         |  191.697 (+-4.555)   |             133.554 (+-4.193)
          channels=1, size=712, dtype=torch.float32  |        169.581 (+-3.476)         |  352.995 (+-10.792)  |             251.089 (+-7.485)

          channels=1, size=256, dtype=torch.float16  |         35.259 (+-0.612)         |                      |              35.154 (+-0.924)
          channels=1, size=520, dtype=torch.float16  |        132.407 (+-1.980)         |                      |             131.850 (+-5.611)
          channels=1, size=712, dtype=torch.float16  |        240.192 (+-5.479)         |                      |             239.555 (+-7.273)

          channels=3, size=256, dtype=torch.float64  |        129.649 (+-2.349)         |                      |             130.429 (+-6.240)
          channels=3, size=520, dtype=torch.float64  |        548.534 (+-5.179)         |                      |             622.568 (+-25.720)
          channels=3, size=712, dtype=torch.float64  |       1208.091 (+-77.095)        |                      |            1679.204 (+-316.292)

          channels=1, size=256, dtype=torch.int16    |         7.801 (+-0.115)          |                      |              34.517 (+-0.482)
          channels=1, size=520, dtype=torch.int16    |         36.010 (+-0.855)         |                      |             131.001 (+-1.686)
          channels=1, size=712, dtype=torch.int16    |         87.395 (+-1.355)         |                      |             237.731 (+-4.181)

    Times are in microseconds (us).
    ```

    [Source](https://gist.github.com/vfdev-5/c0421f54c8aed655b042dd1ce4cb621e)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88989
    Approved by: https://github.com/lezcano, https://github.com/datumbox, https://github.com/peterbell10, https://github.com/ngimel

commit 81a4aeabdf9d550ceda52a5060f19568de61b265
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Fri Nov 18 18:43:15 2022 +0000

    [Dynamo] Support Tensor.nelement & torch.cuda.is_available (#89164)

    Fix several errors in [7k github models](https://github.com/pytorch/torchdynamo/issues/1198).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89164
    Approved by: https://github.com/soumith

commit 8a419cbffb939ef00ce723bbdf5bf1b8c62a7d74
Author: Horace He <chilli@fb.com>
Date:   Fri Nov 18 10:56:03 2022 +0000

    Added partial decomposition of conv_backward and grad_bias computation (#89128)

    `convolution_backward` often just kicks off the `sum` as a separate kernel. Splitting it off in a decomp allows us to fuse it into other ops: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L2150

    Improves `convnext_base` from 373 img/s => 383 img/s

    Not sure what other models use convolution with bias haha.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89128
    Approved by: https://github.com/ezyang

commit 38ccd08f9b79bc2102050833948f5112aed2dfc4
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Fri Nov 18 00:15:45 2022 -0800

    [quant][fx][be] Refactor replace observer with q/dq op code (#89247)

    Summary:
    This is a refactor to prepare for future extensions, no functionality changes

    Test Plan:
    python test/test_quantization.py TestQuantizeFx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89247
    Approved by: https://github.com/vkuzo, https://github.com/andrewor14

commit c219b55b5f8d5718d382735628e9eb8a46caee9f
Author: zhxchen17 <zhxchen17@fb.com>
Date:   Thu Nov 17 21:35:51 2022 -0800

    Use standard __func__ macro in symbolic shape. (#89264)

    Summary:
    I saw the following issue only on Windows build in PR #88767:
    ```
    RuntimeError: AttributeError: 'SymNode' object has no attribute 'torch::impl::PythonSymNodeImpl::ge'
    ```
    It's only on Windows because we get the attributes of SymNode in C++ with
    `__FUNCTION__` macro, which is not in C++ standard, therefore has platform specific behavior.
    In this case, MSVC will include a function's namespace and class name, which is not intended here.

    Instead we should use `__func__`. see: https://en.cppreference.com/w/cpp/language/function#Function_definition

    godbolt example to show the difference: https://godbolt.org/z/PGfvecxPx

    Test Plan:
    CI

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89264
    Approved by: https://github.com/ezyang

commit 12a97444c3f5b640be54f3307895cd0e0c18085a
Author: Richard Howell <rhow@meta.com>
Date:   Fri Nov 18 16:30:53 2022 +0000

    [xplat] remove -weak_framework (#89233)

    Summary: The `-weak_framework` flag is no longer necessary, Buck will weakly link frameworks depending on the `target_sdk_version` of the binary being linked.

    Test Plan:
    Compare IG load commands before and after change with P553208168
    ```
    load command difference in Instagram.app/Frameworks/InstagramXplatFramework.framework/InstagramXplatFramework
     --- /tmp/tmpvd97s2v0    2022-11-16 12:13:54.082910598 -0800
    +++ /tmp/tmpj20r_4ca    2022-11-16 12:13:54.082910598 -0800
    @@ -9,7 +9,7 @@
            /System/Library/Frameworks/CoreHaptics.framework/CoreHaptics (compatibility version 1.0.0, current version 1.0.0, weak)
            /System/Library/Frameworks/CoreImage.framework/CoreImage (compatibility version 1.0.0, current version 5.0.0)
            /System/Library/Frameworks/CoreLocation.framework/CoreLocation (compatibility version 1.0.0, current version 2780.0.17)
    -       /System/Library/Frameworks/CoreML.framework/CoreML (compatibility version 1.0.0, current version 1.0.0, weak)
    +       /System/Library/Frameworks/CoreML.framework/CoreML (compatibility version 1.0.0, current version 1.0.0)
            /System/Library/Frameworks/CoreMedia.framework/CoreMedia (compatibility version 1.0.0, current version 1.0.0)
            /System/Library/Frameworks/CoreServices.framework/CoreServices (compatibility version 1.0.0, current version 1226.0.0)
            /System/Library/Frameworks/CoreTelephony.framework/CoreTelephony (compatibility version 1.0.0, current version 0.0.0)
    @@ -33,9 +33,9 @@
            /System/Library/Frameworks/Security.framework/Security (compatibility version 1.0.0, current version 60420.40.34)
            /System/Library/Frameworks/SystemConfiguration.framework/SystemConfiguration (compatibility version 1.0.0, current version 1241.40.2)
            /System/Library/Frameworks/UIKit.framework/UIKit (compatibility version 1.0.0, current version 6109.1.108)
    -       /System/Library/Frameworks/UserNotifications.framework/UserNotifications (compatibility version 1.0.0, current version 1.0.0, weak)
    +       /System/Library/Frameworks/UserNotifications.framework/UserNotifications (compatibility version 1.0.0, current version 1.0.0)
            /System/Library/Frameworks/VideoToolbox.framework/VideoToolbox (compatibility version 1.0.0, current version 1.0.0)
    -       /System/Library/Frameworks/WebKit.framework/WebKit (compatibility version 1.0.0, current version 614.2.9, weak)
    +       /System/Library/Frameworks/WebKit.framework/WebKit (compatibility version 1.0.0, current version 614.2.9)
            /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.0.0)
            /usr/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.8)
            /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.32.0)
    ```
    Both these changes are correct, WebKit is available from 8.0, UserNotifications from 10.0 and CoreML from 11.0. Instagram has a deployment target of 12.4.

    Reviewed By: ebgraham

    Differential Revision: D41348639

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89233
    Approved by: https://github.com/malfet

commit 19e66fcec235fe46a23186a59446bcfe70ad4f6d
Author: andrewor14 <andrewor14@gmail.com>
Date:   Thu Nov 17 12:47:33 2022 -0800

    [Quant] Allow setting fixed qparams for inner LSTM ops (#88456)

    Summary: In both eager and FX graph mode quantization,
    `torch.ao.nn.quantizable.LSTM` is used as an observed custom module,
    which is responsible for inserting its own observers. By default,
    the user specifies a single QConfig for the custom module (either
    through QConfigMapping or by setting the "qconfig" attribute"),
    and all inner ops will [inherit this
    QConfig](https://github.com/pytorch/pytorch/blob/dc00bb51b8d370bf3891f0edb2c6e0c2914e329a/torch/ao/nn/quantizable/modules/rnn.py#L366-L378)
    and use the same observer/fake_quantize constructors.

    Today, users who wish to override this behavior must extend
    `torch.ao.nn.quantizable.LSTM` and write a lot of custom code
    to manually assign the QConfigs to the inner ops. This commit
    alleviates this burden on the user by providing a helper function
    to assign QConfigs with custom observers. An example use case of
    this is providing a reference implementation for a backend kernel
    that hardcodes qparams for efficiency.

    Example usage:
    ```
    import torch
    from torch.ao.quantization import get_default_qconfig_mapping
    from torch.ao.quantization.fx.custom_config import (
        PrepareCustomConfig,
        ConvertCustomConfig,
    )

    class MyModel(torch.nn.Module):
        ...

    class UserLSTM(torch.ao.nn.quantizable.LSTM):
        @classmethod
        def from_float(cls, other):
            assert isinstance(other, cls._FLOAT_MODULE)
            linear_output_obs_ctr = FixedQParamsObserver.with_args(
                scale=2 ** -11, zero_point=2 ** 15, dtype=torch.qint32)
            sigmoid_obs_ctr = FixedQParamsObserver.with_args(
                scale=2 ** -16, zero_point=0, dtype=torch.qint32)
            tanh_obs_ctr = FixedQParamsObserver.with_args(
                scale=2 ** -15, zero_point=2 ** 15, dtype=torch.qint32)
            cell_state_obs_ctr = FixedQParamsObserver.with_args(
                scale=2 ** -11, zero_point=0, dtype=torch.qint32)
            hidden_state_obs_ctr = FixedQParamsObserver.with_args(
                scale=2 ** -7, zero_point=2 ** 7, dtype=torch.quint8)
            return torch.ao.quantization.utils._get_lstm_with_individually_observed_parts(
                float_lstm=other,
                linear_output_obs_ctr=linear_output_obs_ctr,
                sigmoid_obs_ctr=sigmoid_obs_ctr,
                tanh_obs_ctr=tanh_obs_ctr,
                cell_state_obs_ctr=cell_state_obs_ctr,
                hidden_state_obs_ctr=hidden_state_obs_ctr,
            )

    qconfig_mapping = get_default_qconfig_mapping()
    example_inputs = (torch.rand(5, 3, 50), torch.rand(1, 3, 50), torch.randn(1, 3, 50))
    prepare_custom_config = PrepareCustomConfig() \
        .set_float_to_observed_mapping(torch.nn.LSTM, UserLSTM)
    convert_custom_config = ConvertCustomConfig() \
        .set_observed_to_quantized_mapping(UserLSTM, torch.ao.nn.quantized.LSTM)
    model = MyModel()
    model = prepare_fx(model, qconfig_mapping, example_inputs, prepare_custom_config=prepare_custom_config)
    model(*example_inputs)  # calibrate
    model = convert_fx(model, convert_custom_config=convert_custom_config)
    model(*example_inputs)
    ```

    Test Plan:
    python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams

    Reviewers: jerryzh168, vkuzo

    Subscribers: jerryzh168, vkuzo

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88456
    Approved by: https://github.com/jerryzh168, https://github.com/vkuzo

commit 19fcb80551854431e7e05c422690751037a18488
Author: Bin Bao <binbao@fb.com>
Date:   Fri Nov 18 16:15:55 2022 +0000

    [inductor] Skip DALLE2_pytorch in torchbench (#89288)

    Summary: DALLE2_pytorch fails in eager as well.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89288
    Approved by: https://github.com/Krovatkin

commit 1f7c0ff6e799e7bde94975f7a5bbec39a69ab8f6
Author: Bin Bao <binbao@fb.com>
Date:   Fri Nov 18 13:41:51 2022 +0000

    [inductor] Temporarily disable functorch_dp_cifar10 test in TorchBench (#89281)

    Summary: The failure wasn't caught because of a land race. Skip the test
    for now.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89281
    Approved by: https://github.com/Krovatkin

commit 55e55d95ea9a6f64bba50cdc9e243808cb534202
Author: Howard Huang <howardhuang@meta.com>
Date:   Fri Nov 18 15:27:15 2022 +0000

    Update torch.distributed.DistBackendError type (#89235)

    Summary: Update torch.distributed.DistBackendError type based on https://fb.workplace.com/groups/pyreqa/posts/5753993921357059

    Test Plan:
    Pyre tests should pass?

    let sandcastle run

    Reviewed By: markkm

    Differential Revision: D41384130

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89235
    Approved by: https://github.com/awgu

commit 154e58c03285f3d399b8818dd17e973d486efefa
Author: lezcano <lezcano-93@hotmail.com>
Date:   Fri Nov 18 11:25:36 2022 +0000

    Add most in-place references/decompositions (#88117)

    We add most in-place references in a generic way. We also implement a
    wrapper to implement the annoying interface that `nn.functional`
    nonlinearities have.

    We fix along the way a couple decompositions for some non-linearities by
    extending the arguments that the references have.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88117
    Approved by: https://github.com/mruberry

commit 6741443c7ceae0201fd76b5e6fc59ebd8cd6876a
Author: lezcano <lezcano-93@hotmail.com>
Date:   Fri Nov 18 10:35:46 2022 +0000

    Simplify maybe_resize_out (#88116)

    The previous behaviour would call `resize_` on 0-sized elements even
    when their size was correct. This would make some test fail, as resize_
    may be an in-place operation and it's not supported by some subsystems

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88116
    Approved by: https://github.com/mruberry

commit ce0e22a81a2383c7c951310c9c0aa7638748687b
Author: lezcano <lezcano-93@hotmail.com>
Date:   Fri Nov 18 10:35:45 2022 +0000

    Fix names of some reference functions (#88115)

    The `__name__` field of some binary reference functions was wrong. We
    fix this to be consistent with unary reference functions. In the future,
    we should probably make the binary reference wrapper return a wrapper
    itself to avoid all those calls to `partial`.

    This change helps performing some homogeneous treatment of functions by
    their name.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88115
    Approved by: https://github.com/mruberry

commit 2e358cc98fab728aad8775de28596d589358b3b2
Author: Jacob Hayes <jacob.r.hayes@gmail.com>
Date:   Fri Nov 18 14:09:21 2022 +0000

    Add platform markers for linux only extra_install_requires (#88826)

    Fixes #88049

    https://github.com/pytorch/pytorch/pull/85097 added new extra dependencies on `nvidia-*`. They are linux (GPU) only packages, but were not marked as such, causing issues installing pytorch 1.13 via Poetry (and possibly other tools that follow PyPI's metadata API) on non-Linux systems. This "fixes" the issue by adding the `; platform_system = 'Linux'` marker on these dependencies, but the main problem of different metadata for different wheels is a [somewhat larger issue](https://github.com/pytorch/pytorch/issues/88049#issuecomment-1302555269).

    https://github.com/pytorch/pytorch/pull/85097 used `;` as a delimiter for splitting the different deps, but that is the delimiter used in markers, so I changed to split on `|`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88826
    Approved by: https://github.com/neersighted, https://github.com/lalmei, https://github.com/malfet

commit 5654fed23e7728eca717b23c97c1fca8c176112a
Author: Nikita Shulga <nshulga@meta.com>
Date:   Fri Nov 18 10:51:07 2022 +0000

    Export c10/[macros|util] headers to be used by internal inductor builds (#89249)

    Summary: Fixes package boundary violation that existed in previous implementation

    Test Plan: CI

    Differential Revision: D41391862

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89249
    Approved by: https://github.com/izaitsevfb

commit 4c6724985d8b85c5719078a25255dbd7369c25e5
Author: Iris <wz337@cornell.edu>
Date:   Fri Nov 18 09:49:36 2022 +0000

    [PT-D][Checkpoint] Update import and update docstring for distributed checkpoint (#89256)

    Update test import and docstring as we have moved distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (https://github.com/pytorch/pytorch/pull/88698).

    Test: CI
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89256
    Approved by: https://github.com/fduwjj

commit 2dcacc6b999a44e13a0dbb679ac17d767b05d898
Author: Jiewen Tan <jwtan@google.com>
Date:   Fri Nov 18 09:28:46 2022 +0000

    [LTC] Upstream short_metrics (#89186)

    Summary:
    This pull request upstreams pytorch/xla#4148.

    Test Plan:
    xla CI.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89186
    Approved by: https://github.com/JackCaoG

commit c5fafb4e1694f141d8a1a31142cce4049d9057ed
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Thu Nov 17 19:20:22 2022 -0800

    [ao] maintain BC for is_activation_post_process (#89260)

    Summary: tests are failing due to code packaged with trained models calling now defunct function names (is_activation_post_process).

    this diff maintains BC temporarily until the cached code can be refreshed

    Test Plan: no functional change

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89260
    Approved by: https://github.com/jerryzh168

commit 30c3e5afb0c0ad22c1084a2064ebdc09f7808ecc
Author: Michael Lazos <mlazos@fb.com>
Date:   Fri Nov 18 07:46:35 2022 +0000

    Disable tracing `zero_grad()` (#88731)

    Tracing through zero grad is slow, and doesn't provide any benefits.

    Helps https://github.com/pytorch/torchdynamo/issues/1803

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88731
    Approved by: https://github.com/anijain2305

commit afdc48f843afab531a4315a1ca1a43f5f303c5b7
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Nov 18 07:39:16 2022 +0000

    Gate CUDA-only inductor tests by HAS_CUDA (#89251)

    This is to prevent these tests from running on platform where CUDA doesn't exist such as macos. And they are quite flaky https://hud.pytorch.org/failure/test_linear_permute_fusion_cpu there failing the CI from time to time

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89251
    Approved by: https://github.com/soumith, https://github.com/desertfire

commit 6a964c16e5125f485372418d129c3eabdec7e881
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Fri Nov 18 07:31:10 2022 +0000

    [flaky] relax tolerance conv1d_vs_scipy (#89193)

    Fixes https://github.com/pytorch/pytorch/issues/89087

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89193
    Approved by: https://github.com/kit1980

commit fc1c0cd3ef5af94e2b6cb262252cf97b61e5d3cb
Author: PumeTu <pumetuchinda@gmail.com>
Date:   Fri Nov 18 07:24:33 2022 +0000

    Add support trace on MPS backend (#87910)

    Fixes [#87221](https://github.com/pytorch/pytorch/issues/87221)
    `trace` now supported on MPS

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87910
    Approved by: https://github.com/kulinseth, https://github.com/malfet

commit 7beb1518896482596a0d35ec404338d430933250
Author: maxren <maxren@meta.com>
Date:   Thu Nov 17 14:31:43 2022 -0800

    [xnnpack][executorch] remove unordered_set from xnn_compiler (#89231)

    Removing unrodered_set from xnncompiler for executorch.

    While some STL libraries are unavoidable, and I think it should be ok for delegate to pull these libraries, unordered_set wasn't really needed, and we should be serializing the number of external ids anyways

    After this, the backend classes should be good to hg copy into executorch

    Differential Revision: [D41227391](https://our.internmc.facebook.com/intern/diff/D41227391/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89231
    Approved by: https://github.com/salilsdesai, https://github.com/cccclai

commit ab75982d3a8d76052dbaf1eb37c5b9b729ac0dd8
Author: Zain Rizvi <zainr@fb.com>
Date:   Fri Nov 18 07:03:22 2022 +0000

    Always retry curl downloads (#89157)

    Modify our curl commands so that they always retry downloads.

    By default, curl only retries what it considers to be "transient" errors, based on the server's response. However, curl's estimate of what's transient is very conservative.  By adding the --retry-all-errors parameter we'll always retry curl commands.

    In particular, I'm hoping this mitigates errors where curl fails with the below error ([logs](https://github.com/pytorch/pytorch/actions/runs/3468758110/jobs/5794939941))
    `curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to ossci-linux.s3.amazonaws.com:443`

    Some of the modified downloads didn't even have retries, so I added them in

    More details: https://everything.curl.dev/usingcurl/downloads/retry
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89157
    Approved by: https://github.com/kit1980, https://github.com/malfet

commit 3bc78295c265df62983fcbcadb4a87ef7d0fbf2d
Author: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Date:   Fri Nov 18 05:08:45 2022 +0000

    Fix consistentcy of histc on CPU and CUDA (#87832)

    Fixes #87657

    The main reason why `histc` returns slightly different outputs is the difference on how bin position is calculate.
    The CPU calculates it as: https://github.com/pytorch/pytorch/blob/449778a939f2adc8867c5035b08be4e2d88339d8/aten/src/ATen/native/cpu/HistogramKernel.cpp#L168-L170
    which is basically `(i - a) / (b - a) * N`, while cuda code https://github.com/pytorch/pytorch/blob/449778a939f2adc8867c5035b08be4e2d88339d8/aten/src/ATen/native/cuda/SummaryOps.cu#L41
     which is `(i - a) * N / (b - a)`.

    For some cases like in #87657 the order of arithmetic operations matters due to the floating point round-off.

    ________________

    Not sure where would be the most appropriate place to put the unit test. Hope `test_reductions::test_histc` will do.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87832
    Approved by: https://github.com/soumith

commit f1fb586bc64b96264f4409421d758e9336f19eef
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Nov 17 18:50:33 2022 +0000

    Symintify repeat_interleave.self_int (#89111)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89111
    Approved by: https://github.com/ezyang

commit ba5e39e106caaf4e013fbfc4890d3df13e66d6c9
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Nov 17 18:10:40 2022 +0000

    Fix tol for test_nvfuser_correctness__softmax_backward_data_cuda (#89178)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89178
    Approved by: https://github.com/kit1980

commit 6f609dd0e03e11395cc637a34abd68472e5a1e12
Author: Yoni Chechik <chechik.yoni@gmail.com>
Date:   Fri Nov 18 04:29:00 2022 +0000

    docs: conv2d `padding` attribute- add `int` option (#85004)

    `padding: int` already exists but isn't mentioned in the genereted docs

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85004
    Approved by: https://github.com/albanD, https://github.com/kit1980

commit 6f4f69f54d181b34373e07dcb415f6c2af61868f
Author: Jacob Szwejbka <jakeszwe@meta.com>
Date:   Fri Nov 18 04:13:03 2022 +0000

    [Executorch] [Quantization] New pattern for dynamic dequant (#89236)

    Summary: The op exposed should be qparams, and then we have concerns about prims not being supported so make q and dq ops that take in tensors

    Test Plan: unit test

    Differential Revision: D41382580

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89236
    Approved by: https://github.com/jerryzh168

commit f4efc5e821259aee1b64ee32f992ea3458dcd546
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Thu Nov 17 16:45:47 2022 -0800

    [quant][be] Move some helper functions to the top level to reduce function length (#89246)

    Summary:
    att

    Test Plan:
    python test/test_quantization.py TestQuantizeFx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89246
    Approved by: https://github.com/vkuzo

commit 6ed14c7dcfb261e84016407d8025bf3e27999730
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Nov 18 03:45:53 2022 +0000

    [vision hash update] update the pinned vision hash (#89102)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89102
    Approved by: https://github.com/pytorchbot

commit 3c2676de3d35fd22f79c46eaa770d03f1418c480
Author: Jiewen Tan <jwtan@google.com>
Date:   Fri Nov 18 03:37:14 2022 +0000

    [LTC] Restore GetPythonFrames (#89122)

    Summary:
    pytorch/pytorch@936e930 delete the registration of GetPythonFramesFunction. Restore that and add a test case to prevent regression.

    Test Plan:
    python test/lazy/test_debug_util.py

    Fixes pytorch/xla#4206.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89122
    Approved by: https://github.com/JackCaoG

commit 65bcd1f88099dfeefccb6c6b7a0918e3a7ded606
Author: John Detloff <johndetloff@fb.com>
Date:   Fri Nov 18 03:17:35 2022 +0000

    Add previously deleted circleci readme back to repo (#85598)

    This readme was deleted here: https://github.com/pytorch/pytorch/pull/73224 I chatted with the author, who doesn't remember exactly why it was deleted but suspects it was due either to out of date contents or because of the upcoming migration to github actions.

    With that said, we have references to this readme through our circleci directory, and since we do still have a lot of circleci workflows I feel this readme still adds a lot of value. (I recently did some CI tasks that required me to dig this readme up in order to solve a problem).

    I recommend we restore this file with a warning that its contents may be out of date, until our CircleCI workflows are entirely migrated to Github Actions

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85598
    Approved by: https://github.com/clee2000, https://github.com/malfet

commit 92f9214a311a6b94dff9e38836d5b0849a539647
Author: mikey dagitses <mikeyd@meta.com>
Date:   Thu Nov 17 16:20:45 2022 -0500

    add -Wnarrowing as error to cmake builds (#89207)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89207
    Approved by: https://github.com/wconstab, https://github.com/malfet

commit fd0efb01a7a3a5b487d3d23c2c53a936620ba28a
Author: Raman kumar <k.raman1998@yahoo.in>
Date:   Fri Nov 18 02:53:39 2022 +0000

    [MPS] Support for median with dim (#88807)

    **Aim**: Add support for aten::median for MPS backend (Fixes #87220)

    This is fresh clean PR from the previous [PR](https://github.com/pytorch/pytorch/pull/88554)

    - Implementing the new median function in aten/src/ATen/native/mps/operations/ReduceOps.mm
    - Adding it to aten/src/ATen/native/native_functions.yaml
    - Adding it to existing test_median
    median of entire input tensor on MPS
    `torch.median(mps_inputTensor)`
    median of along a dim
    `torch.median(mps_inputTensor, dim=[int], keepdim=[Bool])`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88807
    Approved by: https://github.com/kulinseth

commit 9fd00f194ae4e28948a9a03a6382c20dde04e4fd
Author: Dmytro Dzhulgakov <dima.v.dzhulgakov@gmail.com>
Date:   Fri Nov 18 02:42:45 2022 +0000

    Fix the kineto daemon build condition (#89174)

    If we're not building the lite interpreter we shouldn't be disabling Kineto. This eliminates a step from https://github.com/facebookincubator/dynolog/blob/main/docs/pytorch_profiler.md
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89174
    Approved by: https://github.com/kimishpatel, https://github.com/malfet

commit b652fbc57a331df5aa28b0bcd07f9e72db2fdbae
Author: David Boetius <cherrywoods@posteo.org>
Date:   Fri Nov 18 01:57:38 2022 +0000

    Fix torch.nn.functional.gelu docstring formatting (#89061)

    The docstring of `torch.nn.functional.gelu` is formatted incorrectly, so that part of the math isn't rendered and there are extra blocks when there shouldn't: https://pytorch.org/docs/stable/generated/torch.nn.functional.gelu.html

    I didn't build the docs, so I am not 100% sure that I got the formatting right, but I am confident.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89061
    Approved by: https://github.com/bdhirsh, https://github.com/kit1980

commit 177621a0b28b931d9be6976c2c38cb57af7949d9
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Nov 18 00:11:42 2022 +0000

    Use pytest-flakefinder to rerun tests multiple times (#89106)

    Per title. The way re-run is handled in https://github.com/pytorch/pytorch/pull/88646 only applies to unittest.

    * https://github.com/pytorch/pytorch/actions/runs/3484930558
    * https://github.com/pytorch/pytorch/actions/runs/3484930319

    Manually download the test report artifacts and verify that that pytest test_ops is called multiple times.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89106
    Approved by: https://github.com/clee2000

commit 57e05e822d0f53db04d2ee2216906f6fc01b4a4f
Author: Dmitry Tomshin <dmitry.tomshin@gmail.com>
Date:   Fri Nov 18 00:10:48 2022 +0000

    Issue 68576 prefetch factor (#88972)

    Fixes #68576
    This PR allows set the `prefetch_factor=None` making it really optional according to the documentation
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88972
    Approved by: https://github.com/kit1980

commit 2b3ac879a7d68aca8a7608e97a7cfc713dbf5c6c
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Thu Nov 17 23:36:15 2022 +0000

    feat: adding view_copy_batch_rule and opinfo for view_copy (#88150)

    to add view_copy to vmap dispatch and adding opinfo

    part of https://github.com/pytorch/functorch/issues/825

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88150
    Approved by: https://github.com/kshitij12345, https://github.com/zou3519

commit 31b10e7d4083acd0eb689ae3873c13b8711770be
Author: Bin Bao <binbao@fb.com>
Date:   Thu Nov 17 19:43:37 2022 +0000

    Enable inductor CI for TorchBench (#87465)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87465
    Approved by: https://github.com/malfet

commit 3d8a853a87515a5e29e384396ff8769f4ee2f946
Author: erjia <erjia@fb.com>
Date:   Thu Nov 17 23:06:41 2022 +0000

    [DataPipe] Add container template for _Fork and _Demux (#89216)

    - This would remove the hard-coded check within `_ChildDataPipe`.
    - Add `get_length_by_instance` to parent class to make sure there is a chance that child DataPipe can have different lengths
    - Prevent Error when `__del__` executed when the object has already been removed
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89216
    Approved by: https://github.com/NivekT

commit e2229a89b0618b58011a69a28e3d23cf7096e547
Author: keineahnung2345 <mimifasosofamire1123@gmail.com>
Date:   Thu Nov 17 22:28:20 2022 +0000

    Fix typo in aten/src/README.md (#89175)

    remove redundant "have to"
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89175
    Approved by: https://github.com/kit1980

commit a695fcf20103bb08ae660788d128cd924e6ec05b
Author: Charlie Yan <charlieyan@fb.com>
Date:   Thu Nov 17 19:05:44 2022 +0000

    Add tests for replicate multiple modules (#89099)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89099
    Approved by: https://github.com/zhaojuanmao

commit 767f6aa49fe20a2766b9843d01e3b7f7793df6a3
Author: Nikita Shulga <nshulga@meta.com>
Date:   Thu Nov 17 22:05:27 2022 +0000

    [JIT][Security] Do not blindly eval input string (#89189)

    Introduce `_eval_no_call` method, that evaluates statement only if it
    does not contain any calls(done by examining the bytecode), thus preventing command injection exploit

    Added simple unit test to check for that
    `torch.jit.annotations.get_signature` would not result in calling random
    code.

    Although, this code path exists for Python-2 compatibility, and perhaps
    should be simply removed.

    Fixes https://github.com/pytorch/pytorch/issues/88868

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89189
    Approved by: https://github.com/suo

commit fbbf3687453aed1b732eee6f6e9050258ce29561
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Nov 17 21:33:59 2022 +0000

    Fix distributed test paths when running periodic multigpu job (#89225)

    Some distributed tests are moved to a new location after https://github.com/pytorch/pytorch/pull/88698. This is currently failing periodic multigpu job:

    * https://github.com/pytorch/pytorch/actions/runs/3484486207/jobs/5829301159
    * https://github.com/pytorch/pytorch/actions/runs/3484486207/jobs/5829301093

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89225
    Approved by: https://github.com/clee2000

commit f057a45fafcd5869d8f6f7e687fad1d36749b9d0
Author: mikey dagitses <mikeyd@meta.com>
Date:   Thu Nov 17 06:09:55 2022 -0500

    reland "support running test_mobile_profiler with buck1/buck2 and OSS (#89001)" (#89091)

    We modify this to no longer use std::experimental::filesystem::path
    and use our own custom type instead.

    This reverts commit c53a5ac6cca7e2e7d7c47b1a816c7eaa2e7a7704.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89091
    Approved by: https://github.com/r-barnes, https://github.com/malfet

commit e856a4d66bead8997a83f8714547c09fcbcdc263
Author: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Date:   Thu Nov 17 20:10:52 2022 +0000

    Add an env var to skip cudnn version compatibility check (#89184)

    skip the check by setting `PYTORCH_SKIP_CUDNN_COMPATIBILITY_CHECK=1`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89184
    Approved by: https://github.com/ngimel

commit 04169c5b6e53e89e339f02b61287154034ee9fca
Author: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Mon Nov 14 23:26:15 2022 -0800

    Rewrite assert statement with torch._assert under config (#88246)

    This diff rewrites assert statement in python with torch._assert under config. The resulting graph looks something like:
    ```
    SOURCE CODE:
    def f(x):
          assert x[0] == 3
          return x.cos()

    CAPTURED GRAPH:
    graph():
        %arg0 : [#users=2] = placeholder[target=arg0]
        %getitem : [#users=1] = call_function[target=operator.getitem](args = (%arg0, 0), kwargs = {})
        %eq : [#users=1] = call_function[target=operator.eq](args = (%getitem, 3), kwargs = {})
        %_assert : [#users=0] = call_function[target=torch._assert](args = (%eq, "assertion_error"), kwargs = {})
        %cos : [#users=1] = call_method[target=cos](args = (%arg0,), kwargs = {})
        return cos
     ```
    Note that this introduces side-effect as it could error out while executing graph, but the assertion can eliminated via DCE if we choose to ignore it.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88246
    Approved by: https://github.com/jansel

commit af448e84eb2978062dc6ca4d3d538ed46b58f3d6
Author: William Wen <williamwen@fb.com>
Date:   Thu Nov 17 19:20:49 2022 +0000

    Fix bug in dynamo dashboard summary stats diff (#89226)

    Fixes issue where a suite may not be present in one of the logs.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89226
    Approved by: https://github.com/anijain2305

commit 706f791a1912af62e5a605bf93e246b457506627
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 17 18:27:08 2022 +0000

    Revert "Support masked_fill (#88736)"

    This reverts commit 2b131b1d43b10a2a005f3f042f920a62501e4e2d.

    Reverted https://github.com/pytorch/pytorch/pull/88736 on behalf of https://github.com/kit1980 due to Inductor tests are failing with AttributeError: module 'torch._inductor.codecache' has no attribute 'valid_vec_isa_list'

commit 8e4c9828f4c990f439179912159086aaed790493
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 17 17:02:36 2022 +0000

    Revert "Reland "Towards unifying symbolic and non symbolic fake tensor (#89038)" (#89143)"

    This reverts commit e686b8c3ba93cb7caa314c78bf84dbd2d7df9683.

    Reverted https://github.com/pytorch/pytorch/pull/89143 on behalf of https://github.com/ZainRizvi due to This seems to be causing the test_make_fx_symbolic_exhaustive_rad2deg_cpu_float32 and test_make_fx_symbolic_exhaustive_inplace_rad2deg_cpu_float32 test to fail across multiple jobs

commit cd81a700ecfb84a039257896af7b8398435b089e
Author: Jiong Gong <jiong.gong@intel.com>
Date:   Thu Nov 17 16:43:16 2022 +0000

    Fix buffer overflow from AddressSanitizer checks due to inaccurate bfloat16 representation of large integer (#89210)

    Fixes #88939

    The root cause of the issue is that BF16 cannot accurately represent big integer values. In the test case below, `539` as one of the corner pixel index is wrongly represented as `540` (from https://github.com/jgong5/pytorch/blob/fc60a1865eafc985217eccc0251f82014041e6a7/aten/src/ATen/native/UpSample.h#L271) and then the access out of the range with this index. Thanks to @malfet for the investigation and initial fix. I also reported an issue https://github.com/pytorch/pytorch/issues/89212 to track the issue of inaccurate integer representation of bf16 that need to be addressed in other places of PyTorch.
    ```python
    import torch

    def test():
        arg_1 = torch.rand([1, 10, 540, 540], dtype=torch.bfloat16).clone()
        res = torch.nn.functional.interpolate(arg_1,2,mode='bilinear',align_corners=True)

    test()
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89210
    Approved by: https://github.com/malfet

commit 2b131b1d43b10a2a005f3f042f920a62501e4e2d
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Thu Nov 17 03:33:32 2022 +0000

    Support masked_fill (#88736)

    Support `masked_fill` to address the GPT2 performance issue.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88736
    Approved by: https://github.com/jansel, https://github.com/jgong5

commit e686b8c3ba93cb7caa314c78bf84dbd2d7df9683
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 16 21:31:02 2022 -0800

    Reland "Towards unifying symbolic and non symbolic fake tensor (#89038)" (#89143)

    This reverts commit cf6003f0469ae1440d4a8585860c2c5f4c738707.

    Differential Revision: [D41363992](https://our.internmc.facebook.com/intern/diff/D41363992)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89143
    Approved by: https://github.com/albanD

commit bdc9911575277848ccac56b344dd624aa97fb87d
Author: Will Constable <whc@fb.com>
Date:   Wed Nov 16 23:31:57 2022 +0000

    Fix typo in dist_util.py (#89167)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89167
    Approved by: https://github.com/davidberard98

commit 3beccbc29939f7a34346ed1a3646f6464086eeb4
Author: ecao <e.cao@intel.com>
Date:   Thu Nov 17 08:15:49 2022 +0000

    Add BFloat16 support and optimization for mish, hardtanh backward, and silu on CPU (#82460)
    * add BFloat16 support for mish and hardtanh backward on CPU.
    * optimize the performance for silu

    - optimize the performance for silu: bfloat16

    single socket (28 cores):
    ```
    before: 1x128x1024  forward 0.090 s  backward  0.218 s
            10x128x1024 forward 0.146 s  backward  0.314 s

    after:  1x128x1024   forward  0.064 s backward  0.100 s
            10x128x1024  forward  0.085 s backward  0.133 s
    ```
    single core:
    ```
    before: 1x128x1024   forward 0.300 s  backward  0.606 s
            10x128x1024  forward 2.825 s  backward  5.834 s

    after:  1x128x1024   forward 0.156 s backward   0.239 s
            10x128x1024  forward 1.447 s backward   2.165 s
    ```

    - Add BFloat16 support for mish and backward of hardtanh on CPU.

    single socket (20 cores):
    op | shape | fp32 / s | fp32 / s | bf16 / s |  bf16 / s
    -- | -- | -- | -- | -- | --
      |   | forward | backward | forward | backward
    silu | [10, 128, 10, 10] | 4.41E-05 | 7.67E-05 | 5.32E-05 | 9.38E-05
      | [10, 128, 80, 80] | 0.0008 | 0.001788 | 0.00067 | 0.001031
    mish | [10, 128, 10, 10] | 0.000356 | 0.000427 | 0.000367 | 0.000436
      | [10, 128, 80, 80] | 0.004527 | 0.005807 | 0.004757 | 0.005393
    hardtanh | [10, 128, 10, 10] | / | 3.97E-05 | / | 4.45E-05
      | [10, 128, 80, 80] | / | 0.001748 | / | 0.000645

    single core:
    op | shape | fp32 / s | fp32 / s | bf16 / s |  bf16 / s
    -- | -- | -- | -- | -- | --
      |   | forward | backward | forward | backward
    silu | [10, 128, 10, 10] | 1.17E-04 | 1.91E-04 | 1.35E-04 | 2.23E-04
      | [10, 128, 80, 80] | 0.007434 | 0.013141 | 0.008464 | 0.013044
    mish | [10, 128, 10, 10] | 0.00103 | 0.00122 | 0.00106 | 0.001227
      | [10, 128, 80, 80] | 0.065629 | 0.078418 | 0.067779 | 0.077214
    hardtanh | [10, 128, 10, 10] | / | 1.18E-04 | / | 9.30E-05
      | [10, 128, 80, 80] | / | 0.010773 | / | 0.005834

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82460
    Approved by: https://github.com/mingfeima, https://github.com/malfet

commit 37c85cf5f2215da13d5836de46f44af72ed079ba
Author: Mark Saroufim <marksaroufim@meta.com>
Date:   Thu Nov 17 07:24:55 2022 +0000

    Add warning if tensor cores are not used (#88844)

    Fixes https://github.com/pytorch/torchdynamo/issues/1839

    Should I do this for all backends or just inductor?
    On a V100 I got from AWS

    ```python
    from torch._dynamo import optimize
    import torch

    def fn(x, y):
        a = torch.cos(x)
        b = torch.sin(y)
        return a + b

    new_fn = optimize("inductor")(fn)

    a = new_fn(torch.Tensor(1),torch.Tensor(1))
    print(a)
    ```
    ```
    (sourcetorch) ubuntu@ip-172-31-31-152:~/test$ python test.py
    /home/ubuntu/pytorch/torch/_dynamo/eval_frame.py:318: UserWarning: Tensor cores are available but not enabled. Consider setting torch.backends.cuda.matmul.allow_tf32 == True in your python script for speedups
      warnings.warn(
    tensor([1.3717])
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88844
    Approved by: https://github.com/ngimel, https://github.com/mlazos, https://github.com/anijain2305

commit b72f5b9ae3f7d1de74d9d2d40236fd09d606be0e
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Thu Nov 17 06:57:42 2022 +0000

    [Dynamo] Support typing.Mapping & Support function as argument (#88963)

    These missing features come from https://github.com/pytorch/benchmark/pull/1302, where we'd like to enable E2E hf_bert dynamo train/eval. The dependent [HuggingFace accelerate library](https://huggingface.co/docs/accelerate/index) requires these improvements.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88963
    Approved by: https://github.com/jansel

commit 126e44173d0dd4d942d8e20c73442048a46cfc24
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Thu Nov 17 03:27:18 2022 +0000

    [ONNX] Add onnx-script into ONNX docs (#89078)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89078
    Approved by: https://github.com/BowenBao

commit 74610a1cedbab64e813f3b49535cd8691a3ec5c7
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Nov 17 06:14:21 2022 +0000

    [dynamo][benchmarks] HF - Fix seq len and batch sizes (#89165)

    Fixes many models in https://github.com/pytorch/torchdynamo/issues/1842
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89165
    Approved by: https://github.com/ngimel

commit a41f70603aededc414da58523361773dbf13bde2
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Nov 17 02:01:13 2022 +0000

    Round out rad2deg sparse support (#88442)

    - Add sparse coo dispatch
    - Modify backward to work with sparse compressed layouts
    - Enable sparse_compressed autograd testing
    - Correct layout support attributes on OpInfo

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88442
    Approved by: https://github.com/cpuhrsch

commit 70fb673e51decdd8bf4e55244d910a8e5680d12f
Author: Rachel030219 <13704467+Rachel030219@users.noreply.github.com>
Date:   Thu Nov 17 05:55:25 2022 +0000

    Use software approach to catch overflow ( `c10/utils/safe_numerics.h` ) on ARM devices (#89042)

    Fixes #89040

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89042
    Approved by: https://github.com/malfet

commit 54fca6a9da77b56b1a82373c814e61378b5d04c2
Author: Aaron Gokaslan <aaronGokaslan@gmail.com>
Date:   Thu Nov 17 05:01:08 2022 +0000

    Fix: prefer .is_none() over .is(py::none()) for pybind11 in caffe2 (#88199)

    Follow up to #88051 . I noticed that I missed a few spots in the caffe2 folder. Prefer `.is_none()` over `.is(py::none())` as `.is_none()` is more efficient since it avoid reference counting increments and decrements.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88199
    Approved by: https://github.com/albanD, https://github.com/kit1980

commit 4e1d19c5a577b947a3dc84d9eec4a186ad3cd52f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 17 04:58:53 2022 +0000

    Revert "Redefine the simdlen semantic: (#88482)"

    This reverts commit fce6d6b3dcc879720bc45143426b86232106818a.

    Reverted https://github.com/pytorch/pytorch/pull/88482 on behalf of https://github.com/kit1980 due to Broke multiple tests in several trunk workflows, for example https://github.com/pytorch/pytorch/actions/runs/3485086792/jobs/5830429554

commit 81a8fdc40d7c504f99d5796a5b187551493685e4
Author: Lukas Hoenig <l.hoenig@campus.tu-berlin.de>
Date:   Thu Nov 17 04:54:23 2022 +0000

    [MPS] Add binary operations dtype precedence test case (#87545)

    See https://github.com/pytorch/pytorch/pull/84742 and https://github.com/pytorch/pytorch/pull/78319.

    The test case tests that
    - for the binary operations (add, sub, mul, div),
    - for all data types (dtypes),
    - for a range of representative values and their combinations,
    - for various shapes and ways of creating the test tensors,

    the contents and dtype of the result tensor is identical for the MPS and CPU backends.

    It adds about 15-18s runtime to `test_mps.py`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87545
    Approved by: https://github.com/kit1980

commit 44c9185f91699b74c7953eb912f37fb24991958d
Author: ecao <e.cao@intel.com>
Date:   Thu Nov 17 04:47:45 2022 +0000

    Fix empty input issue of convolution for channels last memory format (#86521)

    Fixes empty input convolution issue : when input is empty e.g. shape of (0, 3, 3, 4) and weight is channels last format, at::_unsafe_view will raise "view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead."

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86521
    Approved by: https://github.com/jgong5, https://github.com/malfet

commit 637e764ec5d879a5cce0f63f747db3967b708517
Author: maxren <maxren@meta.com>
Date:   Wed Nov 16 10:46:30 2022 -0800

    [xnnpack][executorch] Pass xnnexecutor pointer to compileModel() (#89090)

    Here we pass XNNExecutor* to compile model so that XNNExecutor can be allocated by runtime. This signature change is for executorch:

    ```
    XNNExecutor compileModel(void* buffer) --> void compileModel(void* buffer, XNNExecutor* executor)
    ```

    The intended usecase for allocating Executor and Compiling the serialized flatbuffer:

    ```
    XNNExecutor* executor = runtime_allocator->allocateList<jit::xnnpack::delegate::XNNExecutor>(1);
    XNNCompiler::compileModel(processed.buffer, executor);

    ```

    Differential Revision: [D41208387](https://our.internmc.facebook.com/intern/diff/D41208387/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89090
    Approved by: https://github.com/digantdesai

commit 24b9890f0343a156a5785be859610316ecf8274e
Author: Colin Taylor <colin2328@meta.com>
Date:   Thu Nov 17 04:26:10 2022 +0000

    [torchrec] [composable] update ShardedEmbeddingBagCollection to be use registered EBCs with shardedTensors as registered modules (#758) (#88026)

    Summary:
    X-link: https://github.com/pytorch/torchrec/pull/758

    This PR fixes a bug in FSDP/DDP, where ShardedTensors are not supported even if passed in as params to ignore.
    this is important for composability because TorchRec named_parameters() will return FQN of shardedTensors (as defined in goals)
    It defines device of ShardedTensor to be None when local_tensor() does not exist on rank

    update ShardedEmbeddingBagCollection to be composable according to https://docs.google.com/document/d/1TBJSd5zgEg6cRcXv3Okuj7bBkqQwGS2IPh4TLWNNzFI/edit

    Differential Revision: D40458625

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88026
    Approved by: https://github.com/wanchaol, https://github.com/rohan-varma

commit 1cd6ebe0958ab8eff2b7ba715d9544f067dfe59e
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Thu Nov 17 04:18:10 2022 +0000

    Fix typos in messages under torch (#89049)

    This PR fixes typos of messages in `.py` files under torch directory.
    Only in `torch/onnx/symbolic_opset16.py`, fix a typo in comment to make the operator name correct.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89049
    Approved by: https://github.com/lezcano

commit d1f48f05cef9e2b3b01c64a21a6e2abc3ddab323
Author: maxren <maxren@meta.com>
Date:   Wed Nov 16 10:46:28 2022 -0800

    [xnnpack][Bug Fix] Pass serialized model by reference (#89089)

    Two changes
    - Remove XNNCompiler Dependence on std::string by passing void*
    - Grab ser_model by reference: This bug was causing data pointers given to xnn_runtime to be freed because ser_model was on the stack.

    Differential Revision: [D41208380](https://our.internmc.facebook.com/intern/diff/D41208380/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89089
    Approved by: https://github.com/digantdesai

commit 366f1b2c2f6b273fcba5f071bf2297a963051894
Author: maxren <maxren@meta.com>
Date:   Wed Nov 16 10:46:27 2022 -0800

    [xnnpack][lite-int] Freeze/Inline module to remove reference to self (#88863)

    We need to inline graph before converting from torchscript to xnnpack flatubuffer. Remove graph dependence on self.

    This will later help us work with constant data.

    Differential Revision: [D41049858](https://our.internmc.facebook.com/intern/diff/D41049858/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88863
    Approved by: https://github.com/digantdesai

commit 1adb7b9b845603a834f452da0e99790779740d83
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Nov 15 16:01:29 2022 -0800

    [nn][utils] Preserve requires_grad from original weight and bias in fuse conv/linear bn weights (#89100)

    Summary:
    att, previously we just call nn.Parameter which will have requires_grad=True by default, after
    this PR we will preserve the requires_grad

    Test Plan:
    python test/test_nn.py TestFusionUtils

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D41343694](https://our.internmc.facebook.com/intern/diff/D41343694)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89100
    Approved by: https://github.com/ngimel

commit a5f04e9a915104692ae67ccd79768e8147cc0d2d
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Thu Nov 17 03:36:59 2022 +0000

    Fix typos in .md and .rst files (#88962)

    This PR fixes typos `Github` in `.md` and `.rst` files.
    `Github` -> `GitHub`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88962
    Approved by: https://github.com/kit1980

commit 573eaf12258df8e87434ffa19a42b04fb873c6dc
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Nov 17 03:36:56 2022 +0000

    Analyze and upload disabled tests rerun to S3 (#89083)

    Analyze and upload disabled tests rerun to S3. Note that this only picks up `test-reports` from `rerun_disable_tests` workflows.

    Running the script manually `python -m tools.stats.check_disabled_tests --workflow-run-id 3473068035 --workflow-run-attempt 1 --repo pytorch/pytorch` and see the files successfully uploaded to s3://ossci-raw-job-status/rerun_disabled_tests/3473068035/1

    Rockset collection created https://console.rockset.com/collections/details/commons.rerun_disabled_tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89083
    Approved by: https://github.com/clee2000

commit fce6d6b3dcc879720bc45143426b86232106818a
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Wed Nov 16 23:58:11 2022 +0000

    Redefine the simdlen semantic: (#88482)

    This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

    Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

    - **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
    - **_simdlen <=1_**: Explicitly disable SIMD
    - **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88482
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit c3acb9c8859fb5cfa1959ee49849f07942c40ccc
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Wed Nov 16 19:50:02 2022 +0000

    [ONNX] Add Internal Utils: onnx_proto_utils.py for onnx/onnx-script/onnx_proto (#88376)

    Added `onnx_proto_utils.py` for onnx/onnx-script related process. The idea is like jit_utils.py, and to simplify what we have in `torch/onnx/utils.py`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88376
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit f3af5ba48effeb7785df2049348d83467c5fb986
Author: Charlie Yan <charlieyan@fb.com>
Date:   Tue Nov 15 23:33:05 2022 +0000

    [WIP] Composable API: `replicate` and `DistributedState` (#87649)

    This PR adds the first version of the `replicate()` composable API. For this prototype version, I try to reuse as much code from existing `DistributedDataParallel` as possible, and iterate on it in later changes. The basic idea of this prototype is:
    - create a `ReplicateState` object. It internally uses a `ParameterList` module to hold all parameters of modules marked by `replicate()` API.
    - create an internal `_ddp` object, which reuses existing `DistributedDataParallel` implementation, and wraps the `ParameterList` object
    - install pre-forward and after-forward hooks on the root module, which calls methods of `_ddp` to run initialization and forward

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87649
    Approved by: https://github.com/zhaojuanmao

commit f73d9a79fe8d52be27c3c28cd93ce690bdc4f9b7
Author: Riley Dulin <dulinr@meta.com>
Date:   Thu Nov 17 02:43:33 2022 +0000

    [torch][fx] Fix PassManager to not use a class variable mutable list (#89108)

    Summary:
    I found a confusing bug in the PassManager that only happens
    when you instantiate one multiple times: it will use old passes and
    constraints!

    This occurs because the class-level declarations initialize it to an empty list,
    but the problem is that class initializers only run once, and are creating class
    variables. This means the same empty list was being reused every time, except
    after the first time it isn't empty.

    The empty list has to be created in `__init__` newly each time or else it'll be shared.
    Note that this is the same type of bug as using an empty list as a default parameter, where
    it'll reuse the same list pointer and not make it empty each time.

    The better way to do this is with either:
    * An immutable default parameter like an empty tuple, that you create a new list from: `self.passes = list(passes)`
    * Use None and then create the empty list inside `__init__`

    I chose the latter as it's less likely to cause a behavior change due to the changed default.

    Note that for immutable values like `False` and `1` this doesn't apply as you can't mutate that
    value for everyone.

    Test Plan:
    Added a test to ensure that the pass state is not saved.
    Without my change, this test would fail as it would run all of the `2 * x` passes first,
    then all of the `3 * x` passes.

    Differential Revision: D41327056

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89108
    Approved by: https://github.com/angelayi

commit ac0a6f381de06b58aa583daf7771c410c69709fd
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Wed Nov 16 22:28:36 2022 +0000

    [dtensor] disable op db tests for now (#89162)

    context: https://github.com/pytorch/pytorch/issues/89160
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89162
    Approved by: https://github.com/fduwjj

commit 2f59c69ac7cf027e14012dfeba6b65506787682d
Merge: 473970e8b4 f93ba52d25
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Nov 17 10:16:48 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    cc jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

    [ghstack-poisoned]

commit f93ba52d252cc158cf98e40fc5dc20a114903821
Merge: d6f3ee98ff f5e2cb5249
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Nov 17 10:16:48 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    cc jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

    [ghstack-poisoned]

commit 30d9fb9157b59db27cd2c0c6e6b0b6221efda571
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Nov 17 02:03:45 2022 +0000

    [dynamo][reland] API Support for nn.Module (#89113)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89113
    Approved by: https://github.com/ezyang

commit f5e2cb52496ab51edaa25ac35908b6832e23dadb
Author: William Wen <williamwen@fb.com>
Date:   Thu Nov 17 02:02:26 2022 +0000

    Add comprehensive minifier tests (#88022)

    Adds tests for https://github.com/pytorch/torchdynamo/issues/1241.

    To run: `pytest test/dynamo/test_minifier.py`.

    Actually runs minifier launcher script and repro scripts, rather than just checking for existence of the minifier launcher script.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88022
    Approved by: https://github.com/mlazos, https://github.com/anijain2305

commit 473970e8b46b164bf684561c9ad41549b55c53d8
Merge: 07cea67d12 d6f3ee98ff
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Nov 17 10:01:49 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit d6f3ee98ffbf9a9338ba80d8668177bf248a8f7f
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Nov 17 10:01:49 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 088f2fa567fcf74aa746886e3e90fd3e6c58fa61
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Thu Nov 17 01:55:03 2022 +0000

    Fix typos in messages under test (#89121)

    This PR fixes typos of messages in `.cpp` and `.py` files under test directory.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89121
    Approved by: https://github.com/mruberry, https://github.com/kit1980

commit 716f70f19a4b63268da2a753afdbe9b385a831ab
Author: Horace He <chilli@fb.com>
Date:   Wed Nov 16 19:58:30 2022 +0000

    Added conv constraint that infers layouts (#89031)

    The core problem that we often have with contiguous/channels-last layouts and convolutions is that Inductor often doesn't do a great job of "preserving" the eager-mode layouts.

    So, for example, we'll often have something like
    ```
    a: channels-last
    b = foo(a)
    c = convolution(a)
    ```

    In eager-mode, `a` would stay channels-last, and we would avoid two transpose copies (one into NHWC and one back into NCHW) within the convolution kernel.

    However, Inductor currently sometimes loses the "correct" layout of `b` (not in this simple example, but others). Then, not only will we do a transpose within `foo`, but we'll then immediately transpose it back to do the convolution (and then again once the convolution is done).

    This is particularly egregious in `convnext_base`, where there's a lot of mixing of non-channels last tensors and channels-last tensors.

    The solution in this PR is to constrain the inputs to `aten.convolution`/`aten.convolution_backward` to match the layouts from eager-mode. This ensures that we'll never do extra transposes *within* `aten.convolution`, which are particularly bad (since Inductor can't fuse them).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89031
    Approved by: https://github.com/ngimel, https://github.com/jansel

commit 251fdda77b8f60667e016c89f65f798ea5f3eaea
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Nov 17 01:45:48 2022 +0000

    Add pytest-flakefinder as a test dependency (#89103)

    This is used to re-run tests multiple times to determine their flakiness status. The way re-run is handled in https://github.com/pytorch/pytorch/pull/88646 only applies to unittest

    Per their documentation, `pytest-repeat` doesn't work with `unittest.Testcase` it seems, so trying https://github.com/dropbox/pytest-flakefinder instead
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89103
    Approved by: https://github.com/clee2000

commit 0d87a4fec89fc78e568224935897ec585a6368a6
Author: keineahnung2345 <mimifasosofamire1123@gmail.com>
Date:   Thu Nov 17 01:09:55 2022 +0000

    Fix typo in Dispatcher.h (#89045)

    Fix typo in Dispatcher.h: hamespace -> namespace
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89045
    Approved by: https://github.com/bdhirsh, https://github.com/kit1980

commit 80b6761863407a8cf1ca780fcf97d135743f7812
Author: John Detloff <jmdetloff@gmail.com>
Date:   Thu Nov 17 01:06:12 2022 +0000

    Update README.md (#85534)

    Our jenkins builds are gone, so this badge is broken and should be removed

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85534
    Approved by: https://github.com/ngimel, https://github.com/kit1980

commit 3af5cf4de16e4e9256be6439a3539e3e52e3a879
Author: R Max Espinoza <me@rmax.io>
Date:   Thu Nov 17 01:03:31 2022 +0000

    doc(typo): memroy -> memory (#89126)

    Minor typo in comments.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89126
    Approved by: https://github.com/kit1980

commit cfd552547f106f4a7841976dad8b795b82d161c8
Author: Charlie West-Taylor <charliew@graphcore.ai>
Date:   Thu Nov 17 00:59:12 2022 +0000

    Use the Python frame safely in _pythonCallstack (#88993)

    Currently, the result of `PyEval_GetFrame()` is piped straight to `Py_INCREF`. However, `PyEval_GetFrame` [may return null](https://docs.python.org/3/c-api/reflection.html#c.PyEval_GetFrame), which seems to be the case sometimes, when calling `_pythonCallstack` from another thread. This is handled in the subsequent `while (nullptr != frame)` block, but `Py_INCREF`, called before it, [doesn't handle this case](https://docs.python.org/3/c-api/refcounting.html#c.Py_INCREF), so the program segfaults. The safe form of `Py_INCREF` is `Py_XINCREF`, so use that instead ([docs](https://docs.python.org/3/c-api/refcounting.html#c.Py_XINCREF)).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88993
    Approved by: https://github.com/albanD

commit 8506b305df531f7567a430854cbe7fcfa539416a
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Nov 17 00:38:44 2022 +0000

    handle scatter(Scalar) overload in inductor (#88894)

    Relanding https://github.com/pytorch/pytorch/pull/88210

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88894
    Approved by: https://github.com/desertfire

commit 0c835e25bbde7869101023ebfaab9b7ec01ece25
Author: atalman <atalman@fb.com>
Date:   Thu Nov 17 00:30:12 2022 +0000

    Fix nightly build binary errors (#89153)

    This is pretty much self explanatory issues
    Two typo's in generate generate binary script caused workflows to be generated with invalid parameters:

    1 .generated-linux-binary-libtorch-pre-cxx11-master.yml
    2 .generated-macos-arm64-binary-wheel-nightly.yml
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89153
    Approved by: https://github.com/malfet

commit 98379a3949ed4b4f4a76bd9fed2806f82b6c0aa0
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Wed Nov 16 19:50:02 2022 +0000

    [ONNX] Add onnx-script test cases (#86907)

    The test cases for #86906
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86907
    Approved by: https://github.com/BowenBao

commit f920bfaf2a6bfb4bc7966f8417309d94164ff86f
Author: Will Constable <whc@fb.com>
Date:   Wed Nov 16 18:40:41 2022 +0000

    Use torchrun for dynamo/distributed.py (#89149)

    Mainly wanted to confirm torchrun works fine with dynamo/ddp,
    but it is also a better system than manually launching processes.

    Partially addresses issue #1779

    New run commands
    ------------

    single process:
    python benchmarks/dynamo/distributed.py [args]

    multi-gpu (e.g. 2 gpu on one host):
    torchrun --nproc_per_node 2 benchmarks/dynamo/distributed.py [args]

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89149
    Approved by: https://github.com/aazzolini

commit 8ba62bdff5441b65938ad27e944aa91e4f7eb61a
Author: Fuzzkatt <zonghan2000@gmail.com>
Date:   Wed Nov 16 22:50:11 2022 +0000

    add test_c10d_spawn_ucc.py (#86508)

    Initial PR to create UCC equivalent of https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_gloo.py and
    https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_nccl.py. Currently only added common ops.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86508
    Approved by: https://github.com/kwen2501

commit ec61951f0771e70de12e6e46bd131ace98486238
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Nov 16 19:17:08 2022 +0000

    Fix inaccuracy in nt constructor documentation + broken rendering (#89152)

    Rendering was broken and docstring seemed to be inaccurate

    ![Screen Shot 2022-11-16 at 2 16 28 PM](https://user-images.githubusercontent.com/35276741/202273588-a2da5b7b-1a6d-46bb-a74e-c0de9a0fd064.png)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89152
    Approved by: https://github.com/cpuhrsch

commit 5848704ef8feba9fff3ec4f8ce7d1d3189ec5af8
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Nov 16 19:00:49 2022 +0000

    Removed unecessary check in `select_nested` (#89150)

    Implementation in  #88585 should work for all dimensions. Removed unnecessary check that constrained select to dims 0 and 1

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89150
    Approved by: https://github.com/cpuhrsch

commit ee1d375bf98f6e4c69b2d6f3aa1c702cb652d2f2
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 16 18:36:24 2022 +0000

    [FSDP] Add fast path for `NO_SHARD` `clip_grad_norm_()` (#89137)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89137
    Approved by: https://github.com/rohan-varma

commit e70f446a16f25b7f344d256c8fa0b78769920d00
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Wed Nov 16 21:59:31 2022 +0000

    [Dynamo] Fix bug in NamedTupleVariable (#89110)

    Fixes https://github.com/pytorch/torchdynamo/issues/1866

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89110
    Approved by: https://github.com/jansel

commit 640af8d70a3adc7727661c15260d42fe931e9de4
Author: William Wen <williamwen@fb.com>
Date:   Wed Nov 16 21:54:24 2022 +0000

    More dynamo dashboard improvements (#89155)

    A number of dashboard improvements:
    - Add accuracy failures to warnings section
    - Add regression detection to all metrics (speedup, compile time, peak memory), not just accuracy
    - Add testing flag to update-dashboard to prevent image/comment uploads
    - Add section for comparing summary statistics (passrate, speedup) between 2 most recent reports
    - Show names of reports for summary stats diff and regression detection sections
    - Remove metric graphs from the comment (they can still be found in the generated text file)

    Sample comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1317565972

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89155
    Approved by: https://github.com/anijain2305

commit 305b9b1f0e5802437a7ed8169e0ff3fb5c06d4ec
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Wed Nov 16 21:54:20 2022 +0000

    Fix XLASymNode.str() no str() attribute error (#89093)

    This fixes https://github.com/pytorch/xla/issues/4199
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89093
    Approved by: https://github.com/ezyang

commit 4908a12542798a3e8641faae6b74f068fdfc6778
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 16 11:59:40 2022 -0500

    Reland "SymIntify convolution backend calculation (#89069)"" (#89142)

    This reverts commit 90db86be108184a6c86c73e1b01012352c72e66b.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89142
    Approved by: https://github.com/albanD, https://github.com/malfet

commit 45c62a337756ff9db97cd64d2d42d9e65dda0a85
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Wed Nov 16 10:07:14 2022 -0800

    [ao] making _is_activation_post_process private (#87520)

    Summary: same function in observer and quantize, consolidated to a
    single function. Note the definitions were slightly different, I've
    changed the definition to be maximally inclusive so that the name of the
    function is more accurate

    Test Plan: python test/test_public_bindings.py
    python test/test_quantization.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D40709276](https://our.internmc.facebook.com/intern/diff/D40709276)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87520
    Approved by: https://github.com/jcaip

commit aee96bbf5a34b7d9b12b8d03fa1904e595c6a329
Author: Iris <wz337@cornell.edu>
Date:   Wed Nov 16 21:06:35 2022 +0000

    [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)

    Context in RFC: https://github.com/pytorch/pytorch/issues/86620

    .rst file will be finalized in subsequent PRs.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
    Approved by: https://github.com/wanchaol

commit 6b521bbf3589d763f9ad348ee24e54be12c44356
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Nov 16 11:22:58 2022 -0500

    Prevent module full_backward_hook from erroring in double backward (#88357)

    Also clarifies documentation to say "execute if and only if gradients wrt outputs are computed" (previously, "execute every time gradients wrt inputs are computed")

    See https://docs.google.com/document/d/1tFZKYdsSzRBJ7Di7SWt8X8fSg-E3eiUPwomMF10UyhM/edit for more details regarding the question: 'should module full_backward_hooks be called every time the gradients wrt module inputs are called, or should module full_backward_hooks only be called when the "backward for the module" have been computed?'

    Fixes https://github.com/pytorch/pytorch/issues/88312

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88357
    Approved by: https://github.com/albanD

commit 0581331963cb3dc18fa59a800661c800ebff92c2
Author: BowenBao <bowbao@microsoft.com>
Date:   Mon Nov 14 13:31:23 2022 -0800

    [ONNX] Document ONNX diagnostics (#88371)

    Reference pages:
    - Landing page: https://docs-preview.pytorch.org/88371/onnx_diagnostics.html
    - Individual rule: https://docs-preview.pytorch.org/88371/generated/onnx_diagnostics_rules/POE0004%3Aoperator-supported-in-newer-opset-version.html

    An initial PR to setup the document generation for ONNX diagnostics.
    * Add document page for ONNX diagnostics.
    * Add document generation for diagnostics rules from `rules.yaml`.
    * Add dependency on `myst-parser` for markdown to rst parsing.

    More content to be added.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88371
    Approved by: https://github.com/abock, https://github.com/justinchuby, https://github.com/malfet, https://github.com/kit1980

commit 848e7240a11c9fd82298bc5b5ae14534e1307627
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Wed Nov 16 19:08:49 2022 +0000

    [Dynamo] Add a dummy profiler to avoid activating real profiler (#88930)

    See context at https://github.com/pytorch/torchdynamo/issues/1721#issuecomment-1312396059

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88930
    Approved by: https://github.com/jansel

commit 61801799a0a6a2fe0b577450c1fdd55af6063664
Author: andrewor14 <andrewor14@gmail.com>
Date:   Tue Nov 15 13:27:57 2022 -0800

    [Quant][bc-breaking] Remove overwrite_output_observer (#88620)

    Summary: When the BackendConfig was first introduced,
    `overwrite_output_observer` and `overwrite_output_fake_quantize`
    were added to ensure fixed qparams ops like `torch.nn.Sigmoid`
    and `torch.nn.Tanh` used the correct observers and fake quantizes.
    However, this is hacky because the BackendConfig should not set
    the observer constructors themselves, but should instead specify
    only requirements on the observers.

    Later, https://github.com/pytorch/pytorch/pull/80184 added the
    correct observers to `get_default_qconfig_mapping` along with
    validation logic that throws an error if incorrect observers
    were specified. With this change, we no longer need to overwrite
    the observers from the BackendConfig, since we expect the user to
    pass in the correct observers for these ops.

    This commit removes these overwrite observer settings in the
    BackendConfig. Instead, we represent the observer constraints for
    fixed qparams ops through the existing DTypeWithConstraints
    mechanism. Note that, however, to be consistent with other
    DTypeWithConstraints checks, we no longer throw an error if an
    incorrect observer is specified, but simply ignore the offending
    QConfig and log a warning instead. This is the BC-breaking part
    of the change.

    BC-breaking notes:

    ```
    from torch.ao.quantization.qconfig import default_qconfig
    from torch.ao.quantization.quantize_fx import prepare_fx

    model = ModelWithFixedQParamsOps()
    qconfig_mapping = QConfigMapping().set_global(default_qconfig)
    example_inputs = ...
    prepare_fx(model, qconfig_mapping, example_inputs)
    ```

    Before this commit, running the above leads to an exception
    because the wrong observers are used for fixed qparams ops.
    After this commit, the above will only encounter a warning,
    and the fixed qparams ops will not be quantized. In both cases,
    switching to `get_default_qconfig_mapping` will cause the
    fixed qparams ops to be quantized.

    Test Plan:
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps

    Reviewers: jerryzh168, vkuzo

    Subscribers: jerryzh168, vkuzo

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88620
    Approved by: https://github.com/jerryzh168

commit a6ef2c7634e2a77fe698d5335d29e10ca24cdf2b
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Nov 16 18:25:38 2022 +0000

    Support test-config filter logic for rocm (#89046)

    The logic used by `mem_leak_check` https://github.com/pytorch/pytorch/pull/88373 is currently not applied to rocm, i.e. https://hud.pytorch.org/pytorch/pytorch/commit/06486cd0087200e08ebb8a9518e064251c7c5309 because its workflows don't have the test-config filtering logic yet (linux, mac, and windows all have it already). In another work, rocm tests always run with mem leak check disabled at the moment. We want that but also to run the test with mem leak check enabled periodically one per day.  This PR closes that gap

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89046
    Approved by: https://github.com/clee2000

commit 7b0adc290a744de42e875822a1be4fa2b8d96147
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Nov 16 12:40:27 2022 +0000

    Run tests from test/inductor in inductor CI job (#88957)

    CUDA inductor tests are currently not run in CI because the only jobs
    that have triton installed don't actually run these test.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88957
    Approved by: https://github.com/ngimel, https://github.com/seemethere

commit 58ebf92cf06bd68ca7aba0e29526e9004d53f08d
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Nov 16 14:09:59 2022 +0000

    Add bfloat16 support to torch.prod to align with torch.cumprod (#87205)

    As per title

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87205
    Approved by: https://github.com/mruberry

commit 33209153035ef60f84014983186f9eefde7dab72
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Nov 16 14:09:59 2022 +0000

    Fix decomp for embedding_backward and simplify the decomposition of embedding_dense and embedding_dense_backward (#87204)

    See the title

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87204
    Approved by: https://github.com/Chillee

commit e1ecf53d8480899b5b41c295e52eafb7347f0141
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Nov 16 14:09:58 2022 +0000

    Simplify linspace decomp and increase its tolerance (#87203)

    This is an interesting one

    Since this is an operation that's intrinsically defined on the reals,
    we should perform the ops on that dtype always, and just cast to
    the desired dtype at the end. This simplifies the decomposition.

    Now, I started looking at this one when I started seeing failures on a
    test that's added in a later PR. What's going on here is that, by doing
    an upcast to a higher dtype and then cast down to integers, sometimes
    there's an off-by-one error. I think this is fine, as the decomposition
    is more accurate than the original function, which goes in line with
    the whole PrimTorch effort.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87203
    Approved by: https://github.com/mruberry

commit d2d22d89d92bf7d6bb02417dab04027d7fcc80d3
Author: bmedishe <bmedishe@amd.com>
Date:   Wed Nov 16 17:42:26 2022 +0000

    test_unary_ufuncs few tests enabled on rocm which are passing (#89007)

    This PR is to enable tests which are skip on rocm from test package test_unary_ufuncs.py::TestUnaryUfuncsCUDA

    <html>
    <body>
    <!--StartFragment--><div ccp_infra_version='3' ccp_infra_timestamp='1667423453335' ccp_infra_user_hash='1693798314' ccp_infra_copy_id='81491a4a-67e6-4e87-aa71-47d953d2499a' data-ccp-timestamp='1667423453335'><html><head><meta name=ProgId content=Excel.Sheet><meta name=Generator content="Microsoft Excel 15"></head><body link="#0563C1" vlink="#954F72">

    test_file | test_name | test_class
    -- | -- | --
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_large_tan_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_bfloat16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_int8 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_atan_cuda_uint8 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int8 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_uint8 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int8 | (__main__.TestUnaryUfuncsCUDA)
    test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_uint8 | (__main__.TestUnaryUfuncsCUDA)

    </body></html></div><!--EndFragment-->
    </body>
    </html>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89007
    Approved by: https://github.com/mruberry

commit 7f55db4fb0fb12ed593c7f23de01bfb9330b7dd5
Author: Jacob Szwejbka <jakeszwe@meta.com>
Date:   Wed Nov 16 16:59:36 2022 +0000

    add quantize_decomposed_dynamic to op lib (#88855)

    Summary: Needed for dynamic quant reference pattern graphs.

    Test Plan: added unittest

    Differential Revision: D41205030

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88855
    Approved by: https://github.com/jerryzh168

commit cf6003f0469ae1440d4a8585860c2c5f4c738707
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 16 16:52:47 2022 +0000

    Revert "Towards unifying symbolic and non symbolic fake tensor (#89038)"

    This reverts commit 37d54239c7ea88fd9c98dcac3fcc9b98a6f9e9d1.

    Reverted https://github.com/pytorch/pytorch/pull/89038 on behalf of https://github.com/ezyang due to executorch segfaults

commit fe276ea0f9b4cce9c7d32157f831897fbbd1c85a
Author: Kirtesh Patil <kirtesh@meta.com>
Date:   Wed Nov 16 16:40:24 2022 +0000

    [UCC] Add pre & post processing for CPU collectives (#89030)

    Summary: The CPU block in `collective_post` was missing pre & post processing. The reduce-scatter implementaion expects use of pre-processing callback to flatten the input tensors, however, the missing invocation meant grabage values were being passed.

    Test Plan: Tested the reduce-scatter collective using PARAM

    Reviewed By: eastzone

    Differential Revision: D41291592

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89030
    Approved by: https://github.com/kingchc, https://github.com/kwen2501

commit 90db86be108184a6c86c73e1b01012352c72e66b
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 16 16:36:27 2022 +0000

    Revert "SymIntify convolution backend calculation (#89069)"

    This reverts commit 09ed8b67e24cfe29f3fa7b5dd28eaa7749229f12.

    Reverted https://github.com/pytorch/pytorch/pull/89069 on behalf of https://github.com/DanilBaibak due to breaking internal builds

commit cf4b4b1b060fd48d4103acb4d0422e88c7e3b69e
Author: Angel Avila <angel.j.avila@gmail.com>
Date:   Wed Nov 16 16:30:56 2022 +0000

    Fix python types in pybind function signatures (#89115)

    Fixes #88958

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89115
    Approved by: https://github.com/ezyang

commit abe41aee776e7ab39c34f28a88f03a03dc6f1479
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Wed Nov 16 06:30:03 2022 +0000

    [ONNX] Support custom Op with onnx-script local function (#86906)

    Extend `register_custom_op` to support onnx-script local function. The FunctionProto from onnx-script is represented by custom op and inserted into ModelProto for op execution.

    NOTE: I did experiments on >2GB case of a simple model with large initializers:

    ```python
    import torch

    class Net(torch.nn.Module):
        def __init__(self, B, C):
            super().__init__()
            self.layer_norm = torch.nn.LayerNorm((B, C), eps=1e-3)
        def forward(self, x):
            return self.layer_norm(x)

    N, B, C = 3, 25000, 25000
    model = Net(B, C)
    x = torch.randn(N, B, C)

    torch.onnx.export(model, x, "large_model.onnx", opset_version=12)
    ```

    And it turns out we won't get model_bytes > 2GB after `_export_onnx` pybind cpp function, as we split initializer in external files in that function, and have serialization before return the model bytes, which protobuf is not allowed to be larger than 2GB at any circumstances.

    The test cases can be found in the next PR #86907 .

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86906
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit 9fe36a02146c57ed8165bb8914708437043899ab
Author: mindest <linminuser@gmail.com>
Date:   Wed Nov 16 15:08:41 2022 +0000

    [ONNX] Extra support for bernoulli export (#88655)

    * add opset 15 support for `bernoulli`.
    * add extra export options for different `bernoulli` cases: `x.bernoulli(p)` where `p` is a tensor or float.

    Fixes #88299

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88655
    Approved by: https://github.com/BowenBao

commit 37d54239c7ea88fd9c98dcac3fcc9b98a6f9e9d1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 16 05:58:02 2022 -0800

    Towards unifying symbolic and non symbolic fake tensor (#89038)

    Fake tensor behaves pretty differently depending on if you have
    symbolic shapes or not.  This leads to bugs; for example, we
    weren't getting correct convolution_backward strides because we
    bypassed the correct stride logic in fake tensor on symbolic
    shapes.

    This PR attempts to unify the two codepaths.  I don't manage to
    unify everything, but I get most of it.  The algorithm is delicate
    and I'm still hosing down test failures.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89038
    Approved by: https://github.com/anjali411

commit 09ed8b67e24cfe29f3fa7b5dd28eaa7749229f12
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 15 10:10:28 2022 -0800

    SymIntify convolution backend calculation (#89069)

    We will need this to implement a convolution meta function that
    is SymInt aware.  I use templates so that regular convolution code
    is not affected by the change.  No tests for symbolic ints directly; that will
    come in a subsequent PR which also needs to refactor fake tensors.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89069
    Approved by: https://github.com/SherlockNoMad

commit 5e0c01330c76c003e55aec29bfb3e83926ee933a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 15 10:10:27 2022 -0800

    SymIntArrayRef type caster (#89074)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89074
    Approved by: https://github.com/SherlockNoMad

commit 57af0c82454c199ab7a734c3d12df93c93f50812
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Wed Nov 16 11:25:35 2022 +0100

    Bug fix: make sure `copy_impl` doesn't read out of bounds (#88544)

    Fixes #88543.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88544
    Approved by: https://github.com/lezcano

commit dc40d3f93f849e467b2b56595a01f28e84ac7fa2
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Tue Nov 15 19:24:31 2022 +0000

    Add meta impl for grid_sampler_2d_backward (#88745)

    TODO: add an OpInfo

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88745
    Approved by: https://github.com/ezyang

commit 52701227737489392e59fe57ded40226bf0811f6
Author: Jiawen Liu <jiawenl@meta.com>
Date:   Wed Nov 16 10:37:26 2022 +0000

    [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#89118)

    Summary:
    Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

    For an internal Ads model: **1.15x -> 1.36x speedup**

    Test Plan: CI

    Reviewed By: bertmaher, jansel, jianyuh

    Differential Revision: D41071665

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89118
    Approved by: https://github.com/jianyuh

commit 9d28775c1d28ab7c1dd93479a58bdafb9b626341
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 16 09:45:49 2022 +0000

    Revert "Rewrite assert statement with torch._assert under config (#88246)"

    This reverts commit 62ba15e10e875ce088dff26e872605ee70c8c04a.

    Reverted https://github.com/pytorch/pytorch/pull/88246 on behalf of https://github.com/DanilBaibak due to breaking internal builds

commit 9d2f5a278414aeaa6f3277c5b15aee4938601fa6
Author: Animesh Jain <anijain@umich.edu>
Date:   Wed Nov 16 08:51:30 2022 +0000

    [dynamo] Support if cond on NNModuleVariable (#89095)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89095
    Approved by: https://github.com/yanboliang, https://github.com/mlazos

commit f20b3f2e5734b23a9e0a898196ddf77aa90323b8
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Tue Nov 15 22:51:33 2022 +0000

    [dtensor] PART 8: move tensor parallel api and tests to core distributed (#88180)

    This PR moves tensor/parallel folder and tests to torch.distributed.

    part of https://github.com/pytorch/pytorch/issues/88838
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88180
    Approved by: https://github.com/aazzolini

commit 0230e52b541358cec075b9b9f3e6286d3964848f
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Tue Nov 15 22:51:33 2022 +0000

    [dtensor] PART 7: move remaining DTensor tests to core distributed (#88179)

    This PR moves remaining tests, i.e. tensor_ops, op db tests to core distributed

    part of https://github.com/pytorch/pytorch/issues/88838
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88179
    Approved by: https://github.com/aazzolini

commit 550a019fb85647f0bc7fe8ee231dc158b4f30d7c
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Tue Nov 15 22:51:32 2022 +0000

    [dtensor] PART 6: move DTensor op tests to core distributed (#88551)

    This PR moves DTensor op tests to core distributed, including
    prop_rule, pointwise op, matrix op tests, etc.

    part of https://github.com/pytorch/pytorch/issues/88838
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88551
    Approved by: https://github.com/aazzolini

commit 527c5bdb4574f12f5071b0466ce981ce1c129d75
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Tue Nov 15 22:51:31 2022 +0000

    [dtensor] PART 5: move DTensor basic tests to core distributed (#88178)

    This PR moves DTensor basic tests to torch.distributed, including
    dtensor, device_mesh tests

    part of https://github.com/pytorch/pytorch/issues/88838
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88178
    Approved by: https://github.com/fduwjj

commit 1b88476320a99680a6e01f8f4afed5c5196cf39d
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Tue Nov 15 08:04:38 2022 +0000

    [dtensor] PART 4: move remaining DTensor ops to core distributed (#88550)

    This PR moves the view related DTensor ops to core distributed,
    tests will be add in follow up PRs

    part of https://github.com/pytorch/pytorch/issues/88838
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88550
    Approved by: https://github.com/fduwjj

commit 2dcf0978a249ae136c39e396200e5ed51407471d
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Tue Nov 15 08:04:38 2022 +0000

    [dtensor] PART 3: move most DTensor ops to core distributed (#88177)

    This PR moves most DTensor ops to torch.distributed._tensor. We will
    add all tests in the following PRs.

    part of https://github.com/pytorch/pytorch/issues/88838
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88177
    Approved by: https://github.com/fduwjj

commit 4b945967de2ae9a3c6df579a1541b822de46110c
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Tue Nov 15 08:04:38 2022 +0000

    [dtensor] PART 2: move DTensor abstraction and APIs to core distributed (#88176)

    This PR moves the core DTensor abstraction and high level APIs to
    torch.distributed._tensor folder, which includes the following:
    1. DTensor class
    2. high level APIs (distribute_tensor/module)
    3. dispatching logic
    4. redistribute logic

    part of https://github.com/pytorch/pytorch/issues/88838
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88176
    Approved by: https://github.com/fduwjj

commit 370fc5cb421f54fc9513237390e09cca0e06e01b
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Tue Nov 15 08:04:37 2022 +0000

    [dtensor] PART 1: move DeviceMesh and placement to core distributed (#88549)

    This PR creates `torch.distributed._tensor` package and moves
    DeviceMesh, PlacementTypes to it

    part of https://github.com/pytorch/pytorch/issues/88838
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88549
    Approved by: https://github.com/fduwjj

commit 59ba15f37407294eed3ecdb9986b02c5c2d52a70
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Nov 16 07:44:41 2022 +0000

    Upload CSV test reports from inductor (#89112)

    Inductor test report artifacts are now on HUD but its files are in CSV format instead of the default XML files from pytest or unittest that we expect. So this PR uploads both suffixes

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89112
    Approved by: https://github.com/desertfire

commit 7e66d1d6cdb4e8d854a8da160daeb910783f069d
Author: Jiawen Liu <jiawenl@meta.com>
Date:   Wed Nov 16 06:27:13 2022 +0000

    [Inductor] Support Shape Padding for aten.mm in Inductor (#89086)

    Summary: Support shape padding for aten.mm in Inductor (originally from [#88709](https://github.com/pytorch/pytorch/pull/88709))

    Differential Revision: D41315078

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89086
    Approved by: https://github.com/jianyuh

commit e2f0648750f2d0d0ac648728ce4c514db178cfa1
Author: Kenichi Maehashi <webmaster@kenichimaehashi.com>
Date:   Wed Nov 16 05:07:51 2022 +0000

    Add an option to include actual license terms to the output (#85624)

    When building products using PyTorch, it is often required to display license terms for all dependencies.
    The feature itself has been implemented in #81500 but it seems there are no options to enable it.
    This PR implements the option.

    cc/ @mattip @rgommers
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85624
    Approved by: https://github.com/rgommers, https://github.com/seemethere

commit 8ebbd5a89a66bf84d7358f4d353ec2708d6c5429
Author: Johannes Pitz <johannes.pitz@tum.de>
Date:   Wed Nov 16 04:38:30 2022 +0000

    Easier to understand event_dim computation (#81396)

    Fixes #81254
    Only easier to understand, not a real fix.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81396
    Approved by: https://github.com/fritzo, https://github.com/kit1980

commit ce2f8700bafcf44850402a39188ec121ba8b5486
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Nov 15 21:02:44 2022 +0000

    Symintify numel(), infer_size, prims.elementwise_meta (#88956)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88956
    Approved by: https://github.com/ezyang

commit b291c1213ae18e89a5c616913f14b4bb8eda12a8
Author: Driss Guessous <drisspg@fb.com>
Date:   Wed Nov 16 03:07:54 2022 +0000

    Create native function for determining which implementation of SDP to call (#89029)
    Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89029
    Approved by: https://github.com/cpuhrsch

commit 397f10067200d9b77acb92952b4ea3741738c28b
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 15 19:19:47 2022 +0000

    [FSDP] Test `named_parameters()` in forward (`use_orig_params=True`) (#89066)

    This adds a unit test following the FSDP change in https://github.com/pytorch/pytorch/pull/88781.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89066
    Approved by: https://github.com/fegin

commit 46ba0150cbfb8d86c378f0f3ce2d816e530a933b
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Nov 16 02:39:22 2022 +0000

    Increase slow grad check timeout (#89079)

    Now that periodic jobs are run under `mem_leak_check` mode with parallelization turning off. It's very easy for `linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck / test` to timeout because one of the shards is very close to the 4h mark.

    * https://hud.pytorch.org/pytorch/pytorch/commit/2452e3f99a072760fc46d3f9025aaa37ca7ea2ab
    * https://hud.pytorch.org/pytorch/pytorch/commit/35e668b5ced25e735b6e523d557ed7fd60267914

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89079
    Approved by: https://github.com/clee2000

commit 9f0b2c73f36b0f5276f84cdaaef4d54a60df61f5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 16 01:13:00 2022 +0000

    Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859)"

    This reverts commit d60abe4b9521e235c0e9beb00cda0d6c5673f4e0.

    Reverted https://github.com/pytorch/pytorch/pull/88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI

commit d96dd8ff09a9e35f8cce6745c3e015eb0082eb1b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 15 08:05:31 2022 -0800

    Add int64_t, SymInt overloads for all binary operators in C++ (#89063)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89063
    Approved by: https://github.com/SherlockNoMad

commit 431642111f74a22ebb5edc98e32b1449b4b3e46b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 15 06:41:53 2022 -0800

    Move ConvParams methods directly on struct (#89062)

    This reduces boilerplate.  Also, I plan to add a template
    parameter to ConvParams; without moving the methods onto the
    struct, I would have to manually template every method.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89062
    Approved by: https://github.com/SherlockNoMad

commit 49f0be0762e8cac48ccf3b19d1c662be6b271581
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 15 06:32:36 2022 -0800

    Hide ConvParams struct from ConvUtils.h (#89059)

    It isn't actually used outside of Convolution.cpp, so no reason
    to publish it.  I intend to turn this into a template, so moving
    it with the method definitions is very convenient.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89059
    Approved by: https://github.com/SherlockNoMad

commit 19cacecf34cf46f1c7ca3920979dcd6fd7709a61
Author: Salil Desai <salilsdesai@meta.com>
Date:   Wed Nov 16 00:56:12 2022 +0000

    Fix and Re-enable test_quantize_fx_lite_script_module.py (#88897)

    Summary: After D35984526 (https://github.com/pytorch/pytorch/commit/416899d1a9fcb9dbc8bb66ed796b86360f573903), ```torch.ao.quantization.quantize_fx.prepare_fx``` requires passing in  ```example_args```. This diff fixes the calls to ```prepare_fx``` in this test by adding in ```example_args``` as necessary.

    Test Plan:
    ```
    buck test caffe2/test:fx_quantization_lite
    ```

    ```
      ✓ ListingSuccess: caffe2/test:fx_quantization_lite : 3 tests discovered (39.689)
        ✓ Pass: caffe2/test:fx_quantization_lite - test_conv2d (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (44.451)
        ✓ Pass: caffe2/test:fx_quantization_lite - test_embedding (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (45.462)
        ✓ Pass: caffe2/test:fx_quantization_lite - test_submodule (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (45.933)
    Summary
      Pass: 3
      ListingSuccess: 1
    Finished test run: https://www.internalfb.com/intern/testinfra/testrun/3096224827259146
    ```

    Differential Revision: D41227335

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88897
    Approved by: https://github.com/dagitses

commit 3bc327993f7182f3305b0aae854a26c83458c5a6
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Nov 15 08:12:03 2022 -0800

    PyDispatcher integration with functorch (#88785)

    This PR teaches PyDispatcher and PyOperator about functorch transforms.
    It is important that PyDispatcher/PyOperator dispatch with functorch
    transforms, because this is our plan for higher-order operators
    (operators that accept functions as arguments). Examples of these
    include:
    - functorch transforms over the existing cond operator (control flow)
    - autograd.Function support for functorch (which I am working towards),
    - AOTDispatcher (should be a higher order operator)

    Concretely, the problem with teaching PyDispatcher/PyOperator about
    functorch is that the stack-based dispatching logic (DynamicLayerStack)
    is hidden inside the fallbacks for two dispatch keys
    (DynamicLayer{Front, Back}). PyDispatcher doesn't know about C++ boxed
    fallbacks, our plan on record for that is that we need to reimplement
    all of them in Python (but can call helper functions in C++ to make our
    lives easier).

    Instead of exposing all of what DynamicLayer{Front, Back} do to python,
    this PR takes the approach of re-implementing part of the stack-based
    dispatching in Python. The motivation is that this is more sane and
    follows what the "ideal" implementation of functorch would have been:
    - each transform should be a "mode"
    - there should be no TLS dispatch key set hackery. functorch needs to do
    this hackery today to re-use VariableType implementations.

    This PR:
    - exposes the DynamicLayerStack to Python
    - The DynamicLayerStack is a stack of Interpreters.
    These get exposed to Python as well.
    - Interpreters can run operations (Interpreter.process) or lower them to
    the next interpreter in the stack (Interpreter.lower)
    - To use a PyOperator with functorch transforms, a developer needs to
    register a rule for each transform (vmap, grad, jvp, ...).
    - The PyOperator API is NOT user-facing. Things like autograd.Function
    support for functorch will end up going through the autograd.Function
    API.

    Question for reviewers:
    - Does this design make sense?
    - I'm trying to split up the "functorch support for autograd.Function"
    work into logical pieces. Would it be better if I didn't? (the full
    thing is a bit long - 1000-2000 LOC).

    Test Plan:
    - new tests that construct PyOperator and compose them with functorch
    transforms
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88785
    Approved by: https://github.com/samdow, https://github.com/soulitzer

commit 2268a3215cdadbbbd561100a6368704ba9ef5f0d
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Nov 14 11:00:15 2022 -0800

    [functorch] add switch to enable autograd.Function (#88784)

    This is mostly a debug or "if you know what you're doing" switch for
    now. It is not public API.

    Test Plan:
    - new tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88784
    Approved by: https://github.com/samdow, https://github.com/soulitzer

commit 0ce22574b1aee4688e6ef56f66d6dfb31ae33b04
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 16 00:45:41 2022 +0000

    Revert "Enable correct supported activities for kineto on rocm (#88207)"

    This reverts commit 35093fc1ab9749e6b763acead007e56b54c6375b.

    Reverted https://github.com/pytorch/pytorch/pull/88207 on behalf of https://github.com/kit1980 due to Broke test_kineto on trunk / win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu)

commit a13433940c4e8d7cc54d4fa5b3a9c0ff28fc0e8b
Author: Shunting Zhang <shunting@meta.com>
Date:   Wed Nov 16 00:29:08 2022 +0000

    allow loading model from a path in torchbench (#89028)

    Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument.

    This PR is split out from #88904

    Here is the usage:

            Specify the path and class name of the model in format like:
            --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME>

            Due to the fact that dynamo changes current working directory,
            the path should be an absolute path.

            The class should have a method get_example_inputs to return the inputs
            for the model. An example looks like
            ```
            class LinearModel(nn.Module):
                def __init__(self):
                    super().__init__()
                    self.linear = nn.Linear(10, 10)

                def forward(self, x):
                    return self.linear(x)

                def get_example_inputs(self):
                    return (torch.randn(2, 10),)
            ```

    Test command:
    ```
    WARNING:common:torch.cuda.is_available() == False, using CPU
    cpu  eval  LinearModel                        0.824x p=0.00
    ```

    Content of model_collection.py
    ```
    from torch import nn
    import torch

    class LinearModel(nn.Module):
        """
        AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes.
        Make sure this model calls 2 linear layers to avoid being skipped.
        """
        def __init__(self, nlayer=2):
            super().__init__()
            layers = []
            for _ in range(nlayer):
                layers.append(nn.Linear(10, 10))
            self.layers = nn.Sequential(*layers)

        def forward(self, x):
            return self.layers(x)

        def get_example_inputs(self):
            return (torch.randn(2, 10),)
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89028
    Approved by: https://github.com/jansel

commit 60ffeb986648420810098cba6ac0ad1cee06bd95
Author: Michael Lazos <mlazos@fb.com>
Date:   Wed Nov 16 00:08:34 2022 +0000

    Don't iterate over graph when adding graph input (#89084)

    helps with https://github.com/pytorch/torchdynamo/issues/1803

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89084
    Approved by: https://github.com/jansel

commit ee05f47bddfb97b4b292808543d928b3526fc0ca
Author: Charlie Yan <charlieyan@fb.com>
Date:   Tue Nov 15 18:03:53 2022 +0000

    Rebase and re-land thread PG (#88795)

    The previous PR (https://github.com/pytorch/pytorch/pull/88627) has been reverted due to a failed check. After rebasing and rerun, all checks passed.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88795
    Approved by: https://github.com/huydhn, https://github.com/wanchaol

commit 35093fc1ab9749e6b763acead007e56b54c6375b
Author: Michael Wootton <michael.wootton@amd.com>
Date:   Tue Nov 15 21:40:43 2022 +0000

    Enable correct supported activities for kineto on rocm (#88207)

    A compile time guard was preventing ActivityType::CUDA from being available on rocm.  This caused both the GPU_FALLBACK and CUDA modes to be active at the same time.  So operators were being charged gpu time for the hipEventRecord ranges and the actual kernel execution times.  This caused incorrect (and often negative) cuda times, in e.g. table().

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88207
    Approved by: https://github.com/malfet, https://github.com/jeffdaily

commit d0130cd21ee419fcb33a9ceefa3583aac1e736e1
Author: Bin Bao <binbao@fb.com>
Date:   Mon Nov 14 14:47:15 2022 +0000

    Enable test_ops for inductor (#88994)

    Summary: skip several unsupported test cases
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88994
    Approved by: https://github.com/Krovatkin

commit 67af734adeebf448c54bbc294e115244c5c32f35
Author: mikey dagitses <mikeyd@meta.com>
Date:   Tue Nov 15 21:33:38 2022 +0000

    skip test that is broken in head (#88759)

    Test Plan: Rely on CI.

    Differential Revision: D41156351

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88759
    Approved by: https://github.com/zou3519

commit 175b7e1cde0eaaef0465aa9c760842e5ea07e104
Author: Catherine Lee <csl@fb.com>
Date:   Tue Nov 15 21:27:14 2022 +0000

    print xpass (#89020)

    Print unexpected success as XPASS.  I will submit a PR to test-infra so that the log classifier can find these

    Ex: https://github.com/pytorch/pytorch/actions/runs/3466368885/jobs/5790424173
    ```
      test_import_hipify (__main__.TestHipify) ... ok (0.000s)
      test_check_onnx_broadcast (__main__.TestONNXUtils) ... ok (0.000s)
      test_prepare_onnx_paddings (__main__.TestONNXUtils) ... ok (0.000s)
      test_load_standalone (__main__.TestStandaloneCPPJIT) ... ok (16.512s)

    ======================================================================
    XPASS [4.072s]: test_smoke (__main__.TestCollectEnv)
    ----------------------------------------------------------------------

    ----------------------------------------------------------------------
    Ran 31 tests in 24.594s

    FAILED (skipped=7, unexpected successes=1)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89020
    Approved by: https://github.com/huydhn, https://github.com/seemethere

commit 8dc3353b0b1c12f64ba790c7be85cfbc99448cb4
Author: Nikita Vedeneev <nik@quansight.com>
Date:   Tue Nov 15 21:16:15 2022 +0000

    add `to(dtype)` support for all sparse compressed formats (#89055)

    Fixes [#88419](https://github.com/pytorch/pytorch/issues/88419)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89055
    Approved by: https://github.com/cpuhrsch

commit da2afcb1e0006354f78d5e56d2933382d7af9ebf
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Nov 15 21:05:59 2022 +0000

    Add test for out-of-bounds Tensor access on GPU (#39211)

    Since CUDA context can not recover safely from on-device assert, use `torch.multiprocessing.spawn` to execute a method in another context and verify that it raises unrecoverable error.

    As those types of tests are pretty slow (6 seconds on powerful linux box with one GPU) run it only in the slow shard.

    Closes https://github.com/pytorch/pytorch/issues/38944

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/39211
    Approved by: https://github.com/ezyang

commit d47b94fa8e17ad805f1283943dd2b1bc46b309b8
Author: Fabio Rocha <frocha@quansight.com>
Date:   Mon Nov 14 10:47:34 2022 +0000

    [inductor] Added bucketize to decomp table (#88348)

    These are the benchmark results vs eager

    ```
    [--------------------------- bucketize ----------------------------]
                                                     |  eager  |  decomp
    32 threads: --------------------------------------------------------
          ((16384, 1024), (16,)), (True, True)       |    600  |    464
          ((16384, 1024), (16,)), (True, False)      |    542  |    464
          ((16384, 1024), (16,)), (False, True)      |    780  |    731
          ((16384, 1024), (16,)), (False, False)     |    777  |    731
          ((16384, 1024), (64,)), (True, True)       |    624  |    515
          ((16384, 1024), (64,)), (True, False)      |    603  |    515
          ((16384, 1024), (64,)), (False, True)      |    789  |    718
          ((16384, 1024), (64,)), (False, False)     |    786  |    718
          ((16384, 1024), (256,)), (True, True)      |    878  |    820
          ((16384, 1024), (256,)), (True, False)     |    891  |    830
          ((16384, 1024), (256,)), (False, True)     |    897  |    900
          ((16384, 1024), (256,)), (False, False)    |    900  |    900
          ((16384, 1024), (1024,)), (True, True)     |   2000  |   1890
          ((16384, 1024), (1024,)), (True, False)    |   1950  |   1892
          ((16384, 1024), (1024,)), (False, True)    |   1990  |   1962
          ((16384, 1024), (1024,)), (False, False)   |   1990  |   2060
          ((16384, 1024), (4096,)), (True, True)     |   3405  |   3155
          ((16384, 1024), (4096,)), (True, False)    |   3244  |   3154
          ((16384, 1024), (4096,)), (False, True)    |   3282  |   3219
          ((16384, 1024), (4096,)), (False, False)   |   3278  |   3220
          ((16384, 1024), (16384,)), (True, True)    |   4626  |   4672
          ((16384, 1024), (16384,)), (True, False)   |   4629  |   4671
          ((16384, 1024), (16384,)), (False, True)   |   4662  |   4829
          ((16384, 1024), (16384,)), (False, False)  |   4665  |   4824
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88348
    Approved by: https://github.com/ngimel

commit 9262d18e1bc1f31479677cbd2c121770f3f36522
Author: Fabio Rocha <frocha@quansight.com>
Date:   Mon Nov 14 10:47:32 2022 +0000

    [inductor] Introduce CSEVariable type and use it to track if Triton variables are scalar (#88347)

    This fixes https://github.com/pytorch/torchdynamo/issues/1515

    To fix it, we need to keep track of whether a Triton variable is a scalar (so we can not use a mask when doing indirect loads through them). This requires a way of annotating variable names generated by CSE with properties.

    So now CSE will use CSEVariable class to keep track of variables and let backends subclass it so they can annotate them with whatever information they want. TritonCSEVariable is such a subclass that track the `is_scalar` property.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88347
    Approved by: https://github.com/jgong5, https://github.com/ngimel

commit edd2dea859613a9792cfd08a77cf6ae56a531644
Author: Colin Taylor <colin2328@meta.com>
Date:   Tue Nov 15 20:46:00 2022 +0000

    [torch] [analytics] add dynamo to analytics (#88915)

    Summary: as title.

    Differential Revision: D41237602

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88915
    Approved by: https://github.com/jansel

commit 3e2ba60ac0598c6d85ea83a25fd15df855b9f2f9
Author: Colin Taylor <colin2328@meta.com>
Date:   Tue Nov 15 20:36:13 2022 +0000

    [torch] [analytics] add pytorch event logger callsites to torch.save and torch.load (#89003)

    Summary: as title.

    Differential Revision: D41239419

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89003
    Approved by: https://github.com/ezyang, https://github.com/dzhulgakov

commit d8466964b348b6172317f70b8e52de02402bad54
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Nov 15 20:35:48 2022 +0000

    Add range check to multi margin loss target (#89008)

    Fixes https://github.com/pytorch/pytorch/issues/88724

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89008
    Approved by: https://github.com/ngimel

commit 18c1f2f82eee51bf0e0061dc08d5416b6a7fe0cf
Author: Colin Taylor <colin2328@meta.com>
Date:   Tue Nov 15 20:35:34 2022 +0000

    [torch] [analytics] add pytorch event logger callsites to transformers and encoder/decoders (#88896)

    Differential Revision: D41227275

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88896
    Approved by: https://github.com/mikekgfb

commit ff6d2a6d1b8245563c8122849144dddaa276483a
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Nov 15 20:22:54 2022 +0000

    Add mem efficient backward (#88856)

    - Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32
    - I also made updates based off of Xformer main branch and flash-attention cutlass branch.
    - This will enable the fused backward to be called for scaled dot product attention

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856
    Approved by: https://github.com/cpuhrsch

commit d60abe4b9521e235c0e9beb00cda0d6c5673f4e0
Author: Jiawen Liu <jiawenl@meta.com>
Date:   Tue Nov 15 19:34:38 2022 +0000

    [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859)

    Summary:
    Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

    For an internal Ads model: **1.15x -> 1.36x speedup**

    Differential Revision: D41071665

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88859
    Approved by: https://github.com/jianyuh, https://github.com/jansel

commit f5df68509097c65263ccf100e5df6b1057e9a2fa
Author: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Date:   Tue Nov 15 19:25:53 2022 +0000

    Enable channels_last_3d on SyncBatchNorm (#88401)

    This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format.

    With a small benchmark script here https://github.com/pytorch/pytorch/issues/88021#issuecomment-1299059859, on V100, I got

    master:
    ```
    DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec
    DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec
    ```

    This PR:
    ```
    DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec
    DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec
    ```

    This PR is a follow-up of https://github.com/pytorch/pytorch/pull/46906

    Close https://github.com/pytorch/pytorch/issues/88021
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88401
    Approved by: https://github.com/ngimel

commit 8023c9dc6420bce8e37ad4e4e363cb7bed7f70de
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Nov 11 16:30:05 2022 -0800

    [Profiler] Memory profiler part 3: Schema parsing and mutable arguments (#86854)

    The appropriate annotation for a block of memory is a function of time: an input can be mutated in-place to become an activation, a clever kernel might steal the memory of a detached input (such as a mask) to use as output memory, etc.

    We could pessimistically assume that all ops mutate all of their inputs, however inspection of schema allows us to significantly narrow that assumption with minimal effort. Checking schemas also allows us to distinguish between dispatcher ops (which have load bearing semantics) and user annotations with reasonably high precision.

    Differential Revision: [D40220390](https://our.internmc.facebook.com/intern/diff/D40220390/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86854
    Approved by: https://github.com/chaekit

commit 2439bc1e9bab3721bb9f1c4853baf03b610c89da
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Nov 11 16:30:03 2022 -0800

    [Profiler] Memory profiler part 2: Config validation (#86853)

    Memory profiling requires `record_shapes`, `profile_memory`, and `with_stack`. This PR just adds a skeleton endpoint with a good error message if certain flags are missing.

    Differential Revision: [D39920801](https://our.internmc.facebook.com/intern/diff/D39920801/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86853
    Approved by: https://github.com/chaekit

commit 279dcce702a56f5b3ce5e864fa4db2f882e01084
Author: mikey dagitses <mikeyd@meta.com>
Date:   Tue Nov 15 19:08:31 2022 +0000

    disable test that fails in fbcode (#88786)

    Summary:
    caffe2/test:torch_cuda - test_advanced_indexing_assignment_lazy (test_view_ops.TestViewOpsLAZY)
    RuntimeError: TorchScript backend not yet supported in FBCODE/OVRSOURCE builds
      File "/usr/local/fbcode/platform010/lib/python3.8/unittest/suite.py", line 163, in _handleClassSetUp
        setUpClass()
      File "/re_cwd/fbcode/buck-out/opt/gen/caffe2/test/torch_cuda#binary,link-tree/torch/testing/_internal/common_device_type.py", line 506, in setUpClass
        torch._lazy.ts_backend.init()
      File "/re_cwd/fbcode/buck-out/opt/gen/caffe2/test/torch_cuda#binary,link-tree/torch/_lazy/ts_backend.py", line 6, in init
        torch._C._lazy_ts_backend._init()

    Test Plan: Rely on CI.

    Differential Revision: D41170545

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88786
    Approved by: https://github.com/zou3519

commit 1db0f735e8fe14245e98e875c15ecf95ed2142ce
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Nov 11 16:30:01 2022 -0800

    [Profiler] Account for caching when assigning IDs (#88917)

    The python tracer caches information about module and optimizer state. That means that for subsequent calls, the presence of a Tensor in these fields does not imply that the Tensor is still live; just that it was live during the first call. (I should perhaps rename the fields to something like `stale_parameters` to convey this.) Unless we discard subsequent calls ID assignment get tripped up when it see's a Tensor that was already released.

    Differential Revision: [D41226827](https://our.internmc.facebook.com/intern/diff/D41226827/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88917
    Approved by: https://github.com/chaekit

commit ee4412381ea3577fbf32858f35f8b76bdc548b49
Author: Zain Rizvi <zainr@fb.com>
Date:   Tue Nov 15 17:55:29 2022 +0000

    Allow ROCm runners to have 2 or more gpus (#89011)

    [This run](https://github.com/pytorch/pytorch/actions/runs/3432340660/jobs/5721731207) failed claiming that it couldn't detect GPUs on the runner. Inspecting the rocminfo output (higher up in logs) show that it in fact had three GPUs, but the workflow is currently setup to expect either 2 or 4 gpus.

    The workflow files currently have no way of specifying wither it'll get a 2 gpu or a 4 gpu machine, so really 2 is all any test can expect to get. [This old PR](https://github.com/pytorch/pytorch/pull/72142/files) shows that historically ROCm runners only had 4 gpus, then later the logic was extended to expect 2 GPU runners as well.

    It's not clear how the ROCm runner ended up with 3 gpus instead of 2 or 4 (something for ROCm folks to look into) but there doesn't seem to be a good reason for ROCm workflows to fail if 3 (or 5) gpus ever show up on a machine. This PR makes the workflows resilient to ROCm having these alternate GPU counts

    Also filed https://github.com/pytorch/pytorch/issues/89012 against the ROCm team to explore why the runner only had 3 gpus

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89011
    Approved by: https://github.com/huydhn

commit 2819df9a19480feba72f9c613be25e56d4f05142
Author: Pruthvi Madugundu <pmagundu@amd.com>
Date:   Tue Nov 15 17:49:00 2022 +0000

    [ROCm] Enable python ref executor UTs for ROCm (#88981)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88981
    Approved by: https://github.com/mruberry

commit 62ba15e10e875ce088dff26e872605ee70c8c04a
Author: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Mon Nov 14 23:26:15 2022 -0800

    Rewrite assert statement with torch._assert under config (#88246)

    This diff rewrites assert statement in python with torch._assert under config. The resulting graph looks something like:
    ```
    SOURCE CODE:
    def f(x):
          assert x[0] == 3
          return x.cos()

    CAPTURED GRAPH:
    graph():
        %arg0 : [#users=2] = placeholder[target=arg0]
        %getitem : [#users=1] = call_function[target=operator.getitem](args = (%arg0, 0), kwargs = {})
        %eq : [#users=1] = call_function[target=operator.eq](args = (%getitem, 3), kwargs = {})
        %_assert : [#users=0] = call_function[target=torch._assert](args = (%eq, "assertion_error"), kwargs = {})
        %cos : [#users=1] = call_method[target=cos](args = (%arg0,), kwargs = {})
        return cos
     ```
    Note that this introduces side-effect as it could error out while executing graph, but the assertion can eliminated via DCE if we choose to ignore it.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88246
    Approved by: https://github.com/jansel

commit b815f1fc502387311a7b4da8c2f52ead56cbfff5
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Tue Nov 15 13:05:30 2022 +0000

    Symintify view_as_complex and view_as_real (#89052)

    Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
    * __->__ #89052
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89052
    Approved by: https://github.com/ezyang

commit b9029fc4497a9453e76892c9cf56144add89faf7
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Fri Nov 11 08:55:40 2022 -0800

    [ao] quant_type.py fixing public v private (#87519)

    Summary: made _get_quant_type_to_str private

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D40709282](https://our.internmc.facebook.com/intern/diff/D40709282)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87519
    Approved by: https://github.com/jcaip

commit 5faa2792fa3c46f2124d1d1c5f7b6a3865d47d7b
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Nov 15 01:06:23 2022 +0000

    Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761
    Approved by: https://github.com/ezyang

commit 63e16216d8830b6340816c873b035e1a31ad4636
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Tue Nov 15 13:21:39 2022 +0000

    [c10d] Implement `__instancecheck__` for `c10d::ReduceOp` (#88275)

    Summary:
    - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
    - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

    Rel:
    - #81272
    - #84243
    - #87191
    - #87303
    - #87555

    Ref:
    - https://github.com/pybind/pybind11/issues/2696

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275
    Approved by: https://github.com/wanchaol

commit 2452e3f99a072760fc46d3f9025aaa37ca7ea2ab
Author: Chen Lai <chenlai@fb.com>
Date:   Mon Nov 14 20:16:45 2022 -0800

    Update xnnpack graph schema to use xnode and xvalue (#89036)

    There are different nodes definition like [Node in autograd](https://www.internalfb.com/code/fbsource/fbcode/caffe2/torch/csrc/autograd/function.h?lines=108-609&reveal=108-609) and onnxnodes and etc. Understand namespace can be used where nodes from definition are used together, however it's still better to slightly differentiate the name.

    Differential Revision: [D41002324](https://our.internmc.facebook.com/intern/diff/D41002324/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89036
    Approved by: https://github.com/mcr229

commit 8c46a5de3a2e72c5ffbb714fa4e2d44fc2e59951
Author: Chen Lai <chenlai@fb.com>
Date:   Mon Nov 14 20:16:43 2022 -0800

    Add debug handle to xnnpack schema (#89033)

    As title, add three things to the schema
    1. debug handle for each node
    2. file identifier, so we can sanity check we are getting the xnnpack schema flatbuffers file, instead of other random binary
    3. extension, so the dumped binary will end up with its own extension like `myschema.xnnpack` (maybe can have a better name) instead of the default extension `.bin`

    Differential Revision: [D40906970](https://our.internmc.facebook.com/intern/diff/D40906970/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89033
    Approved by: https://github.com/mcr229

commit 50c18217a3849c56a0fe5bdb923bd67fa70da31c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Nov 15 09:37:09 2022 +0000

    Revert "Add mem efficient backward (#88856)"

    This reverts commit 35e668b5ced25e735b6e523d557ed7fd60267914.

    Reverted https://github.com/pytorch/pytorch/pull/88856 on behalf of https://github.com/DanilBaibak due to breaking internal builds

commit 5314af5383e56376cd62da22ae07681656667e71
Author: Wenzhe Xue <wenzhe.xue@intel.com>
Date:   Tue Nov 15 07:29:52 2022 +0000

    Set correct size of `attr::output_layouts` when the graph has multiple outputs in JIT oneDNN fuser  (#88496)

    Bug:
    Previously, `initOutputLayouts()` was called after creating a graph and before merging other nodes. It is a vector with one element. So when a graph contains multiple outputs, e.g. using AOTAutograd compile in my case, layout_propagation pass try to access out of range elements in the vector. Then it comes to the second bug in `useOpaqueLayout()`, the out of range checks the index with the updated output size instead of the size of the vector. Then used `[]` to access the element, which is out of range.

    Fixes the above two issues:

    1. check the offset is within range with the size of `attr::output_layouts` vector instead of another variable. This check catches the error now.
    2. change the place to initial `attr::output_layouts` after node merging. The graph may change with node merging. Thus we moved the initialization in layout_propagation with the complete graph.

    Added test time:
    `Ran 1 test in 0.383s`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88496
    Approved by: https://github.com/jgong5, https://github.com/sanchitintel

commit 60e59c075561068c7d1fe9e9fc40a2df3cd2d2d7
Author: peterjc123 <peterghost86@gmail.com>
Date:   Tue Nov 15 06:36:24 2022 +0000

    Fix get_default_qat_qconfig for PT 1.13 (#88876)

    See https://github.com/pytorch/pytorch/pull/84329/files#r1019916766 for more context

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88876
    Approved by: https://github.com/jgong5, https://github.com/vkuzo

commit 5ed90c40f874359aca13f7f50e6d115524937d02
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Tue Nov 15 06:16:13 2022 +0000

    enable index_put test (#89019)

    Per title

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89019
    Approved by: https://github.com/desertfire

commit 68fd8f37063f0011f1c0589e8f38f7606e3f6748
Author: Iris <wz337@cornell.edu>
Date:   Tue Nov 15 06:13:15 2022 +0000

    [BE] [c10d][send] Improve error message on dist.send() with destination rank as itself (#89004)

    This improves error msg on dist.send() and add corresponding test in test_c10d_common.py(https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_common.py).
    Context in issue#83912: https://github.com/pytorch/pytorch/issues/83912

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89004
    Approved by: https://github.com/H-Huang

commit 21dd311077d00ff5c3f930295ddc8cf915a262d7
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Nov 15 05:08:26 2022 +0000

    Add a mode to rerun all disabled tests (without running anything else) (#88646)

    Rerun all disabled test to gather their latest result so that we can close disabled tickets automatically. When running under this mode (RERUN_DISABLED_TESTS=true), only disabled tests are run while the rest are skipped `<skipped message="Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run" type="skip"/>`

    The logic is roughly as follows, the test runs multiple times (n=50)

    * If the disabled test passes, and it's flaky, do nothing because it's still flaky.  In the test report, we'll see the test passes with the following skipped message:
    ```
    <testcase classname="TestMultiprocessing" file="test_multiprocessing.py" line="357" name="test_fs" time="0.000" timestamp="0001-01-01T00:00:00">
        <skipped message="{&quot;flaky&quot;: True, &quot;num_red&quot;: 4, &quot;num_green&quot;: 0, &quot;max_num_retries&quot;: 3, &quot;rerun_disabled_test&quot;: true}" type="skip"/>
    </testcase>
    ```

    * If the disabled test passes every single time, and it is not flaky anymore, mark it so that it can be closed later.  We will see the test runs and passes, i.e.
    ```
    <testcase classname="TestCommonCUDA" name="test_out_warning_linalg_lu_factor_cuda" time="0.170" file="test_ops.py" />
    ```

    * If the disabled test fails after all retries, this is also expected. So only report this but don't fail the job (because we don't care about red signals here), we'll see the test is skipped (without the `flaky` field), i.e.
    ```
    <testcase classname="TestMultiprocessing" file="test_multiprocessing.py" line="357" name="test_fs" time="0.000" timestamp="0001-01-01T00:00:00">
        <skipped message="{&quot;num_red&quot;: 4, &quot;num_green&quot;: 0, &quot;max_num_retries&quot;: 3, &quot;rerun_disabled_test&quot;: true}" type="skip"/>
    </testcase>
    ```

    This runs at the same schedule as `mem_leak_check` (daily).  The change to update test stats, and (potentially) grouping on HUD will come in separated PRs.

    * pull https://github.com/pytorch/pytorch/actions/runs/3447434434
    * trunk https://github.com/pytorch/pytorch/actions/runs/3447434928
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88646
    Approved by: https://github.com/clee2000

commit 73d71ae3d62607f2e480af37c470375ea405eb1c
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Tue Nov 15 00:21:52 2022 +0000

    [WIP] Unwrap View in Reinterpret View (#89016)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89016
    Approved by: https://github.com/ngimel

commit dd6beca854be6cc0619d0b0693bc2fc558636217
Author: Everton Constantino <everton.constantino@linaro.org>
Date:   Tue Nov 15 04:10:49 2022 +0000

    Changing the use from ASSERT_EQ to ASSERT_FLOAT_EQ on nn_utils test. (#83693)

    Changing the use from ASSERT_EQ to ASSERT_FLOAT_EQ on nn_utils.cpp:ClipGradNorm as this is the proper way to compare equality between floating point values. This avoids `test_api` ClipGradNorm failing for WoA.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83693
    Approved by: https://github.com/ngimel, https://github.com/kit1980

commit ce8a45c282c68abbf37f7af99d4bd7cb53fa020d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Nov 15 03:32:00 2022 +0000

    [vision hash update] update the pinned vision hash (#89026)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89026
    Approved by: https://github.com/pytorchbot

commit 55b88cde0ab0e5457422777971af845842b2689b
Author: Jiawen Liu <jiawenl@meta.com>
Date:   Tue Nov 15 03:10:36 2022 +0000

    [Inductor] Build Shape Padding in Inductor (#88709)

    Summary: Build shape padding for matmul/bmm/addmm in Inductor

    Differential Revision: D41071282

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88709
    Approved by: https://github.com/bertmaher, https://github.com/Chillee

commit cbdb683dc843f2d50617ad962d5e57501e5154d4
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 14 16:51:32 2022 -0500

    Add test that bias gradient is properly tested in same_two_models (#88995)

    See
    https://github.com/pytorch/pytorch/pull/88629#issuecomment-1313850324
    for why this got broken.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88995
    Approved by: https://github.com/albanD

commit 45d2daaf855d4e79f6e09c4d3f85743b955446e6
Author: William Wen <williamwen@fb.com>
Date:   Tue Nov 15 02:32:55 2022 +0000

    Fix lookup file update in dashboard (#89024)

    Lookup file should be updated before graphs are generated.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89024
    Approved by: https://github.com/mlazos, https://github.com/anijain2305

commit 1f88b208acab2cf974849c9161d24f08486f592c
Author: Michael Gschwind <mikekg@meta.com>
Date:   Tue Nov 15 01:25:17 2022 +0000

    Fix cuda/cpu check on NoneType (Unit test) (#88970)

    Summary: Fix cuda/cpu check on NoneType (unit test)

    Test Plan: sabdcastle/ github CI/CD

    Differential Revision: D41208798

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88970
    Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch

commit 35e668b5ced25e735b6e523d557ed7fd60267914
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Nov 15 01:10:35 2022 +0000

    Add mem efficient backward (#88856)

    - Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32
    - I also made updates based off of Xformer main branch and flash-attention cutlass branch.
    - This will enable the fused backward to be called for scaled dot product attention

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856
    Approved by: https://github.com/cpuhrsch

commit f3462833bdd1324d32ad9a78b5f142fb4d75f57c
Author: Zain Rizvi <zainr@fb.com>
Date:   Tue Nov 15 01:01:37 2022 +0000

    Use same retry logic as macos binary builds (#89014)

    Occasionally the command to download sccache via curl fails with network errors (example below). The default curl retry option only retries errors that are considered "transient", but but the set of actual transient commands is greater than what curl considers to be transient.

    This PR modifies the retry logic for downloading sccache to match what's in https://github.com/pytorch/pytorch/blob/master/.github/templates/macos_binary_build_workflow.yml.j2#L79-L89, using the retry action to ensure we both retry all transient errors, and including a longer retry delay to give the transient issue time to resolve itself.

    Example failure from [this run](https://github.com/pytorch/pytorch/actions/runs/3422664884/jobs/5700595220):
    ```
    Run sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed

      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:11 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:12 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:13 --:--:--     0
      0     0    0     0    0     0      0      0 --:--:--  0:00:14 --:--:--     0
    curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to s3.amazonaws.com:443
    Error: Process completed with exit code 35.
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89014
    Approved by: https://github.com/huydhn

commit 7a37bbed15321fa121f628053ee3c93d516700f5
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Mon Nov 14 07:40:32 2022 -0500

    Take input striding for conv fusion op  based on eager output (#88864)

    As https://github.com/pytorch/pytorch/pull/88706, we also change the input stride check using eager output.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88864
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 0544a32ba35acd8648692a662197e3497654858e
Author: Jongsoo Park <jongsoo@meta.com>
Date:   Tue Nov 15 00:48:49 2022 +0000

    [inductor] fix could not find as_strided with config.triton.mm=triton (#88946)

    Summary: ReinterpretView doesn't seem to be handled properly with matrix multiply Triton kernels

    Reviewed By: bertmaher

    Differential Revision: D40836677

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88946
    Approved by: https://github.com/jansel

commit 92c78f37afca6c1ff6c40be7c7ed44b162b287b4
Author: wswartworth <wswartworth@gmail.com>
Date:   Mon Nov 14 23:58:46 2022 +0000

    improving torch.linalg.lstsq documentation formatting (#89013)

    Fixes #80441

    The highlighting in the documentation for torch.linalg.lstsq was incorrect due to a newline that sphinx doesn't parse correctly.  Instead of writing the tensors directly, I used randn to generate the tensors.  This seems to be more consistent with how other documentation is written.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89013
    Approved by: https://github.com/lezcano

commit 8df64abc6d8cd1de7017096159a93bb9c7c02bc1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 14 10:49:20 2022 -0500

    Fix some naughty uses of reshape/flatten (#88999)

    Mutating after reshape/flatten is bad! And it turns out
    the corresponding view operations are guaranteed to work
    too.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88999
    Approved by: https://github.com/albanD

commit c53a5ac6cca7e2e7d7c47b1a816c7eaa2e7a7704
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 14 23:36:17 2022 +0000

    Revert "support running test_mobile_profiler with buck1/buck2 and OSS (#89001)"

    This reverts commit 3b33a2794e07b5216aa473da67755af3aa6e6433.

    Reverted https://github.com/pytorch/pytorch/pull/89001 on behalf of https://github.com/kit1980 due to Broke trunk / macos-12-py3-x86-64-lite-interpreter / build

commit 3c3bd55bea3424cbfc0c319dcead9c1e5c55646d
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Mon Nov 14 23:24:31 2022 +0000

    [testing] fix a key in parse_namespace() (#88969)

    This PR fixes an incorrect key name of `mappings` dict in `parse_namespace()`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88969
    Approved by: https://github.com/kit1980

commit 911a1349dd5d93b9de62d82f439b09eae9aedb92
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Mon Nov 14 22:45:50 2022 +0000

    [Dynamo] Fix torch.is_tensor and torch.overrides.is_tensor_like (#88704)

    Fixes error from 7k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_arashwan_matrixnet.py

    Error:
    ```
    AssertionError: torch.* op returned non-Tensor bool call_function <function is_tensor at 0x7fca94d0faf0>

    from user code:
       File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_arashwan_matrixnet.py", line 749, in scatter
          return scatter_map(inputs)
       File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_arashwan_matrixnet.py", line 741, in scatter_map
          assert not torch.is_tensor(obj), 'Tensors not supported in scatter.'
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88704
    Approved by: https://github.com/jansel

commit 3b33a2794e07b5216aa473da67755af3aa6e6433
Author: mikey dagitses <mikeyd@meta.com>
Date:   Mon Nov 14 22:11:29 2022 +0000

    support running test_mobile_profiler with buck1/buck2 and OSS (#89001)

    Summary:
    Internally we are switching to a new version of buck, but we also must
    keep this working in OSS.

    Test Plan: Rely on CI.

    Differential Revision: D41270673

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89001
    Approved by: https://github.com/r-barnes, https://github.com/osalpekar, https://github.com/malfet

commit 074278f393e1a31b7ee058479cd5906ae830f5ed
Author: Nikita Shulga <nshulga@meta.com>
Date:   Mon Nov 14 21:54:46 2022 +0000

    [CI] Push `latest` and hash+CUDAver tags (#88971)

    For nightly docker build to simulate the behavior of `push_nightly_docker_ghcr.yml`

    Tested in https://github.com/pytorch/pytorch/actions/runs/3465221336/jobs/5787694933

    Fixes https://github.com/pytorch/pytorch/issues/88833

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88971
    Approved by: https://github.com/seemethere

commit b2082833c6082cbb25caf48bdb8f58c490b2c8a7
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 14 21:21:09 2022 +0000

    Revert "woof (#89010)"

    This reverts commit 4570bd6030c97577d2fa994857d0a022ef7563a4.

    Reverted https://github.com/pytorch/pytorch/pull/89010 on behalf of https://github.com/ezyang due to whoops this actually landed

commit 4570bd6030c97577d2fa994857d0a022ef7563a4
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 14 14:34:01 2022 -0500

    woof (#89010)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D41276175](https://our.internmc.facebook.com/intern/diff/D41276175)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89010
    Approved by: https://github.com/bigfootjon

commit f80992217dd2ae5ca0af5e280388cba6078ef57b
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Mon Nov 14 14:43:15 2022 +0000

    Remove skip (#88979)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88979
    Approved by: https://github.com/voznesenskym

commit 540b42a1a883bb56235cdbf0bbbf103041c4dd8c
Author: Jerry Zhang <jerryzh@meta.com>
Date:   Mon Nov 14 19:27:46 2022 +0000

    [quant][executorch] Support quant fusion for cat in quant in executorch stack (#88960)

    Summary:
    * added cat in executorch backend config
    * added quant fusion for "dq - cat - q" pattern

    Test Plan: buck run executorch/exir/tests:quant_fusion_pass -- "executorch.exir.tests.test_quant_fusion_pass.TestQuantFusionPass.test_cat"

    Reviewed By: qihqi

    Differential Revision: D41111054

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88960
    Approved by: https://github.com/JacobSzwejbka

commit e0c194f10b20a5ab2ad8d2075bec81ca57320268
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Mon Nov 14 19:06:38 2022 +0000

    Fix typos in messages under torch (#88961)

    This PR fixes typos of messages and parms in c++ source and head files under `torch` directory.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88961
    Approved by: https://github.com/albanD

commit 3d79ced8cfb2ddd250f9a31dad9b990c120e6dab
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sat Nov 12 14:20:41 2022 +0000

    wrap_pybind_function: support member function pointers (#88932)

    This updates `wrap_pybind_function` to use `invoke` and adds the
    `invoke_traits` object which is analogous to `function_traits` but
    for member functions it includes the class as an explicit argument.

    To test this is working properly, I've also applied it to the
    `CUDAGraph` binding code.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88932
    Approved by: https://github.com/albanD

commit 36d87465fb9b34914e6db50638c0f5bf04e3d7d9
Author: William Wen <williamwen@fb.com>
Date:   Mon Nov 14 18:43:50 2022 +0000

    Fix long comment error on dashboard (#89002)

    Fix dashboard comment failure due to the following trace:
    ```
    Traceback (most recent call last):
      File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1180, in <module>
        DashboardUpdater(args).update()
      File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1119, in update
        self.comment_on_gh(comment)
      File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1096, in comment_on_gh
        subprocess.check_call(
      File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 368, in check_call
        retcode = call(*popenargs, **kwargs)
      File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 349, in call
        with Popen(*popenargs, **kwargs) as p:
      File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 951, in __init__
        self._execute_child(args, executable, preexec_fn, close_fds,
      File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 1821, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    OSError: [Errno 7] Argument list too long: '/data/home/anijain/miniconda/bin/gh'
    srun: error: a100-st-p4d24xlarge-27: task 0: Exited with exit code 1
    ```
    That is, we were trying to execute a gh command in the OS that was too long.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/89002
    Approved by: https://github.com/davidberard98

commit cdb798faefa2520b37938311bcef1c175581a0ff
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Mon Nov 14 18:39:45 2022 +0000

    _get_nested_attr should return a value in the general case (#88822)

    Fixes https://github.com/pytorch/functorch/issues/1053

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88822
    Approved by: https://github.com/zou3519

commit f1a5044de0639180f667d212800aa43f34026b3c
Author: Khushi Agrawal <khushiagrawal411@gmail.com>
Date:   Mon Nov 14 18:18:45 2022 +0000

    [primTorch] _refs & opinfo alpha_dropout (#87989)

    Add _refs and OpInfo for `nn.functional.alpha_dropout`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87989
    Approved by: https://github.com/mruberry

commit b0c86caa1d46a16195682e2afe5456f97265aa53
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Mon Nov 14 17:49:30 2022 +0000

    Remove cpu path from lobpcg's basis helper (#88984)

    Fixes https://github.com/pytorch/pytorch/issues/88650
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88984
    Approved by: https://github.com/lezcano

commit 06f1b52705ee360e5ac89e0f1f32f69ffde72b9a
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Nov 14 17:37:24 2022 +0000

    don't use prims.unsqueeze in group_norm (#88927)

    inductor doesn't have prims.squeeze lowering, so this breaks it. Longer term, `squeeze` with multiple dimensions is not a prim, nvfuser implements it with a loop, inductor uses `_squeeze_multiple` helper which turns it into a loop. Prim should accept only a single dimension.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88927
    Approved by: https://github.com/eellison

commit c8f3d1c13460bbaa85b7f423bfb7f414e825c757
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Nov 14 12:36:44 2022 +0000

    Run test_torchinductor_opinfo CPU tests if triton not installed (#88934)

    These test are not run currently because normal CI workers don't have
    triton installed.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88934
    Approved by: https://github.com/ngimel

commit ec4eadac5baebcf094836108a25ef3af63d39f5d
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Nov 11 14:13:01 2022 -0800

    reland "Do not use unsafe restriding for subclasses (#87610)" (#88343)

    This reverts commit 5b75b19f51837e162cc0e5e5757dfd9bef437c67.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88343
    Approved by: https://github.com/ezyang

commit 9943d46aab4465b887039aa1a9b5d9ebc0a01a35
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Sun Nov 13 22:09:58 2022 -0500

    TorchDynamo: skip convolution fusion when convolution's padding is string (#88794)

    Currently,  the fusion convolution doesn't support the case when padding is a string, we will support it at the next step.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88794
    Approved by: https://github.com/jansel, https://github.com/jgong5

commit 15ef0660c553ebb50ad639f563062cab01e5e6dc
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Sun Nov 13 22:09:56 2022 -0500

    Fake Tensor For (ConvFusion) Propagation (#88414)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88414
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 5e6cefd258dfdb4ddf2956c0b5631d84e97027e5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 14 12:02:43 2022 +0000

    Revert "Run test_torchinductor_opinfo CPU tests if triton not installed (#88934)"

    This reverts commit 8371bb8a3dddbead709bc1e9d26715818a34fa8a.

    Reverted https://github.com/pytorch/pytorch/pull/88934 on behalf of https://github.com/peterbell10 due to Inductor tests failing on master

commit 8371bb8a3dddbead709bc1e9d26715818a34fa8a
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Nov 13 22:33:13 2022 +0000

    Run test_torchinductor_opinfo CPU tests if triton not installed (#88934)

    These test are not run currently because normal CI workers don't have
    triton installed.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88934
    Approved by: https://github.com/ngimel

commit 072920c281bb4d9ca899c6c781a8374ab42a9a3f
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Sun Nov 13 22:09:54 2022 -0500

    TorchDynamo: Add convolution binary+unary fusion for cpu in inference mode (#88412)

    This PR is about enabling the fusion of **conv+binary+relu**, which will improve the vision model's performance.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88412
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit cb4842c9495a68d2a1d4a3ee3ffc9eab30dce28c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 14 10:29:24 2022 +0000

    [xla hash update] update the pinned xla hash (#88982)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88982
    Approved by: https://github.com/pytorchbot

commit 03296844aa0cb560401584545ba1412e52c87b37
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Mon Nov 14 09:50:50 2022 +0000

    Fix typos in messages under aten (#88964)

    This PR fixes typos of messages and parms in c++ source files under `aten` directory.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88964
    Approved by: https://github.com/lezcano

commit 4ad7b17fabd2a2b6873bc369bd223223ff1e628b
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Sun Nov 13 22:09:53 2022 -0500

    TorchDynamo: Add convolution binary(inplace) fusion for cpu in inference mode (#88403)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88403
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 06486cd0087200e08ebb8a9518e064251c7c5309
Author: iLeGend <824040212@qq.com>
Date:   Mon Nov 14 03:39:43 2022 +0000

    fix typo: AT_MKLDNN_EBABLED => AT_MKLDNN_ENABLED (#88952)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88952
    Approved by: https://github.com/XiaobingSuper

commit eea506aee12371a1fbde271c99fb30a8537d1db7
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 14 01:58:47 2022 +0000

    Revert "Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)"

    This reverts commit 9eabcc370f4c3a04be85cb1f878038f10716bdc3.

    Reverted https://github.com/pytorch/pytorch/pull/88761 on behalf of https://github.com/suo due to much broken https://hud.pytorch.org/pytorch/pytorch/commit/9eabcc370f4c3a04be85cb1f878038f10716bdc3

commit 48dc24ddceb5d048ceb38f00f6d4ec0cfc3e71d0
Author: Aaron Gokaslan <aaronGokaslan@gmail.com>
Date:   Sun Nov 13 22:05:41 2022 +0000

    Fix: [ATen] Add some missing moves (#88514)

    Related to #88512 , but for ATen. This should reduce a number of copies and inefficient atomic smart pointer increments.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88514
    Approved by: https://github.com/jgong5, https://github.com/ezyang

commit 9eabcc370f4c3a04be85cb1f878038f10716bdc3
Author: Sherlock Huang <bahuang@fb.com>
Date:   Sun Nov 13 06:06:24 2022 +0000

    Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761
    Approved by: https://github.com/ezyang

commit 76af71444a43962ee3e1cef987ac2028f2b8f44d
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Sat Nov 12 20:06:12 2022 +0100

    [primTorch] Add ref for `complex` (#88562)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88562
    Approved by: https://github.com/ezyang

commit 8f7e519f12d165c06ea3e20b994c2d3c5c44af2c
Author: Jason Ansel <jansel@meta.com>
Date:   Sun Nov 13 19:42:42 2022 +0000

    Skip dynamo benchmark tests under TSAN (#88895)

    Summary: Fixes T137546804

    Test Plan:
    ```
    buck2 test mode/opt-tsan //caffe2/benchmarks/dynamo:test
    buck2 test mode/opt //caffe2/benchmarks/dynamo:test
    ```

    Differential Revision: D41226384

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88895
    Approved by: https://github.com/anijain2305

commit 52be0c42abfcf566e730d927b6a3e90e4380017b
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Sun Nov 13 15:56:16 2022 +0000

    meta function for max_pool2d_with_indices_backward (#88743)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88743
    Approved by: https://github.com/lezcano, https://github.com/ezyang

commit 98bcb4acb651378d7eaae7532d52f08939464c06
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Nov 13 16:21:12 2022 +0000

    Revert "[reland][dynamo] Better support for nn.Module (#88959)"

    This reverts commit e950afc3958c9bae5d61cbc99bc088309141df6d.

    Reverted https://github.com/pytorch/pytorch/pull/88959 on behalf of https://github.com/malfet due to Broke `test_accuracy_issue1`

commit 897d029a738c831448c0984bc0ab91544ca04545
Author: Animesh Jain <anijain@umich.edu>
Date:   Sun Nov 13 16:20:45 2022 +0000

    [reland][dynamo] fixes dict changed during runtime error (#88877)

    Reland https://github.com/pytorch/pytorch/pull/87526

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88877
    Approved by: https://github.com/ezyang

commit 4284862db6e7c14494f27ef681036d909a5e8b67
Author: Andrew Gu <andgu@fb.com>
Date:   Sat Nov 12 19:26:28 2022 +0000

    [Dynamo][FSDP] Migrate to `ModuleWrapPolicy` (#88453)

    Hello @wconstab! As you saw, `transformer_auto_wrap_policy()` is a misnomer and actually works for any module classes. The PR before this one tries to add a class `ModuleWrapPolicy` that takes in the `module_classes` in its constructor and works just like `transformer_auto_wrap_policy()` without requiring the `functools.partial()`. I hope you do not mind if we update the dynamo benchmarks util file with this migration.

    The PR before this one might require some back and forth within FSDP devs, so I apologize for any consequent updates to this PR, which in itself is an easy change. I will request review once we know the previous PR is good for land.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88453
    Approved by: https://github.com/wconstab

commit bca75fd2d36de72c2682b47d62eab01f6f897b75
Author: Chen Lai <chenlai@fb.com>
Date:   Sat Nov 12 21:41:31 2022 -0800

    Move xnnpack taget to fb code base (#88909)

    1. Move the source file list to the `build_variables.bzl`, as it's the source of truth for both internal buck build and oss build
    2. Move target definitions to `fb` internal folder
    3. Some changes are triggered from auto format.

    Differential Revision: [D40906961](https://our.internmc.facebook.com/intern/diff/D40906961/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40906961/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88909
    Approved by: https://github.com/mcr229

commit 2b12bfce8800cfcc54222e913955914994bb4daf
Author: Animesh Jain <anijain@umich.edu>
Date:   Sun Nov 13 09:53:38 2022 +0000

    [dynamo] Skip frame when graph break in a loop (#88857)

    This fixes excessing recompilation issue in tacotron2 but has few caveats - https://github.com/pytorch/torchdynamo/issues/330

    For tacotron2, the repro is something like this

    ~~~
            def inner(x):
                return torch.sin(x)

            def fn(x):
                for _ in range(100):
                    inner(x)
                    torch._dynamo.graph_break()
                return x
    ~~~

    The problem here is that Dynamo has guards on the TUPLE_ITERATOR_LEN whenever a graph break happens. Therefore, we keep on recompiling.

    This PR checks if there is a backedge (helps with while loop) in presence of a graph break. If there is, Dynamo skips processing this frame. Therefore, Dynamo gets called when inner is called, and we compile only once.

    Note that, if there was no graph break, we will unroll the original loop, and see one graph with 100 sin operations (just as before, so no changes there).

    The caveat is - We are skipping the frame, so if we have something like this

    ~~~
            def fn(x):
                for _ in range(100):
                    torch._dynamo.graph_break()
                return x
    ~~~

    Dynamo will skip processing this frame, and might miss on the optimization.

    Completely open for suggestions. Happy to re-implement if there is a better way to handle this.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88857
    Approved by: https://github.com/jansel, https://github.com/yanboliang

commit e950afc3958c9bae5d61cbc99bc088309141df6d
Author: Animesh Jain <anijain@umich.edu>
Date:   Sun Nov 13 08:19:45 2022 +0000

    [reland][dynamo] Better support for nn.Module (#88959)

    Relanding https://github.com/pytorch/pytorch/pull/88629

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88959
    Approved by: https://github.com/msaroufim

commit 06ce1338bced2d2cb933a383157b335f65a35e71
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Sun Nov 13 04:50:21 2022 +0000

    [dynamo] Port all pytorch/dynamo and test/dynamo pieces over from symbolic-shapes branch (#88768)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88768
    Approved by: https://github.com/jansel, https://github.com/ezyang

commit 4f2639e56ad5b26d2f5383dcc14e0f91c250d355
Author: Andrew Gu <andgu@fb.com>
Date:   Sat Nov 12 20:27:00 2022 +0000

    [FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` (#88955)

    This PR fixes `FSDP.clip_grad_norm_()` for `NO_SHARD`, which previously "double-counted" each gradient `world_size`-many times.

    This does not address any discrepancies between `FULL_SHARD` and DDP. (Note that the unit tests do show parity between `FULL_SHARD` and DDP when using `FSDP.clip_grad_norm_()` and `nn.utils.clip_grad_norm_()` respectively on one iteration.)

    The added unit test code path tests mixing nested FSDP instances with both `FULL_SHARD` and `NO_SHARD` to ensure that the `local_sharded_norm` and `local_nonsharded_norm` computations are interoperating correctly. I want to test non-FSDP root instance in the future, but this is BC breaking since we need to make `clip_grad_norm_()` a static method, which would require a different method call syntax (`FSDP.clip_grad_norm_(root_module, ...)` vs. `root_module.clip_grad_norm_(...)`).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88955
    Approved by: https://github.com/zhaojuanmao

commit 46796fe5e9b74602d45927304773fdcda1c3215a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Nov 12 06:19:02 2022 -0800

    Fix XLA symbolic shapes binding (#88928)

    Obsoletes https://github.com/pytorch/pytorch/pull/88772

    Mostly revolves around NOT assuming that the inside is a SymNode,
    but instead duck-typed to be a SymNode.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88928
    Approved by: https://github.com/SherlockNoMad

commit 2aca97cc9ae7081f00ebc7d58367c443cd4528cf
Author: Aleksandar Samardžić <asamardzic@quansight.com>
Date:   Sun Nov 13 00:31:11 2022 +0000

    Vectorized CPU code implementing left shift operator. (#88607)

    This PR adds vectorized implementation for CPU version of left shift operator.

    All of the tests run by `pytest test/test_ops.py -vk left_shift` pass.

    Here are some additional details:

    <details>
    <summary>
    Benchmarking script (writen by Philip, with small tweaks by Mario) comparing left shifts with multiplications - on par now
    </summary>

    ```python
    import torch
    from torch import Tensor
    from torch.utils.benchmark import Timer, Compare
    from itertools import product
    from functools import partial
    def _num_value_bits(dtype):
        if dtype == torch.uint8:
            return 8
        else:  # torch.int32
            return 31

    def _max_value(dtype):
        if dtype == torch.uint8:
            return 255
        else:  # torch.int32
            return 2147483647

    def bitshift(image, dtype):
        num_value_bits_input = _num_value_bits(image.dtype)
        num_value_bits_output = _num_value_bits(dtype)

        return image.to(dtype).bitwise_left_shift_(num_value_bits_output - num_value_bits_input)

    def mul(image, dtype):
        input_max = float(_max_value(image.dtype))
        output_max = float(_max_value(dtype))

        factor = int((output_max + 1) // (input_max + 1))
        image = image.to(dtype)
        return image * factor

    size = 256
    image = torch.randint(0, 256, (3, size, size), dtype=torch.uint8)
    dtype = torch.int32

    def gen_inputs():
        devices = ("cpu",)
        fns = (mul, bitshift)
        threads = (1,)
        for device, fn, threads in product(devices, fns, threads):
            yield f"Bitshift {device} {image.dtype}", str(tuple(image.shape)), threads, fn, image, dtype

    def benchmark(label, sub_label, threads, f, *args, **kwargs):
        return Timer("f(*args, **kwargs)",
                     globals=locals(),
                     label=label,
                     description=f.__name__,
                     sub_label=sub_label,
                     num_threads=threads).blocked_autorange()

    results = []
    for args in gen_inputs():
        results.append(benchmark(*args))

    compare = Compare(results)
    compare.trim_significant_figures()
    compare.print()
    ```
    </details>

    <details>
    <summary>
    Test script exercising large number of combinations of left shift operands that I've used for further testing (validates results through comparing with results generated by NumPy)
    </summary>

    ```python
    import numpy as np
    import torch
    def _create_inputs(dtype):
        info = torch.iinfo(dtype)
        if dtype == torch.int8 or dtype == torch.int16:
            ntests = info.max + 1
            x = torch.arange(info.max + 1, dtype=dtype, device="cpu", requires_grad=False)
        else:
            ntests = 100000
            x = torch.randint(info.max + 1 if dtype != torch.int64 else info.max, (ntests,), dtype=dtype, device="cpu", requires_grad=False)
        y = torch.tensor(range(info.bits), dtype=dtype, device="cpu", requires_grad=False)
        xy = torch.cartesian_prod(x, y)
        return (xy[:, 0], xy[:, 1])

    torch.manual_seed(0)
    for dtype in (torch.int8, torch.int16, torch.int32, torch.int64):
        (x, y) = _create_inputs(dtype)
        z = x << y
        xnp = x.numpy()
        ynp = y.numpy()
        znp = z.numpy()
        assert((znp == (xnp << ynp)).all())
    ```
    </details>

    <details>
    <summary>
    Benchmarking script running the left shift operator on tensors of different length (and varying number of bits to shift)
    </summary>

    ```python
    import torch
    import pickle
    import itertools
    from torch.utils.benchmark import Timer, Compare

    torch.manual_seed(0)
    lengths = [1024, 4096, 16384, 65536]
    rhss = [1, 2, 7, 8, 15, 16, 31, 32, 63, 64]

    benchmark_name = "lshift"
    label = ""
    dtypes = [torch.int8, torch.int16, torch.int32, torch.int64]
    results = []
    def _make_args(dtype, length, rhs):
        info = torch.iinfo(dtype)
        imax = info.max
        return (torch.randint(info.max, (length,), dtype=dtype, device="cpu", requires_grad=False),
                rhs * torch.ones((length,), dtype=dtype, device="cpu", requires_grad=False))
    for dtype, length, rhs in itertools.product(dtypes, lengths, rhss):
        x, y = _make_args(dtype, length, rhs)
        timer = Timer("x << y",
                      globals=globals(),
                      label=benchmark_name,
                      description=label,
                      sub_label=f"dtype={dtype},length={length}",
                      num_threads=1)
        results.append(timer.blocked_autorange())
    compare = Compare(results)
    compare.trim_significant_figures()
    compare.print()
    with open("{}.pickle".format(label), "wb") as f:
        pickle.dump(results, f)
    ```
    </details>

    <details>
    <summary>
    Results of running above benchmarking script - results manually merged for runs of viable/strict (labeled "master" in the table below) and my branch (labeled "mybranch" in the table below)
    </summary>

    ```
    [------------------- lshift -------------------------------]
                                          |  master	|  mybranch
    1 threads: ------------------------------------------------
          dtype=torch.int8,length=1024    |     3  	|      3
          dtype=torch.int8,length=4096    |     5  	|      3
          dtype=torch.int8,length=16384   |    14  	|      5
          dtype=torch.int8,length=65536   |    51  	|     15
          dtype=torch.int16,length=1024   |     3  	|      3
          dtype=torch.int16,length=4096   |     4  	|      3
          dtype=torch.int16,length=16384  |    11  	|      5
          dtype=torch.int16,length=65536  |    39  	|     13
          dtype=torch.int32,length=1024   |     3  	|      2
          dtype=torch.int32,length=4096   |     4  	|      3
          dtype=torch.int32,length=16384  |    10  	|      4
          dtype=torch.int32,length=65536  |    35  	|     12
          dtype=torch.int64,length=1024   |     3  	|      3
          dtype=torch.int64,length=4096   |     4  	|      3
          dtype=torch.int64,length=16384  |    11  	|      6
          dtype=torch.int64,length=65536  |    36  	|     20

    Times are in microseconds (us).
    ```
    </details>

    All of the testing/benchmarking was conducted on qpu3, that supports AVX2 only.  For basic validation of AVX-512 update of left shift implementation for 8-bit operands (that is the only one that is non-trivial in AVX-512 case), [Compiler Explorer](https://godbolt.org/) is used, with GCC trunk and `-mavx512f -mavx512bw` flags added.  Here are further details:

    <details>
    <summary>
    C program used for basic validation of AVX-512 vectorized version for 8-bit operands
    </summary>

    ```

    static void print_m512i_int8(const __m512i* x)
    {
        int8_t val[64];
        memcpy(val, x, sizeof(val));
        for (int i = 0; i < 64; ++i) {
            if (i > 0)
                printf(", ");
            printf("%d", (int)val[i]);
        }
        printf("\n");
    }

    int main()
    {
        __m512i a = _mm512_set_epi8(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                    1);
        __m512i b = _mm512_set_epi8(7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6,
                                    5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
                                    3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2,
                                    1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
                                    0);

      // ------- Copied code from vec512_int.h

      // Mask used to set upper 8 bits of each 16-bit value to 0, and keep
      // lower 8 bits.
      __m512i mask = _mm512_set_epi16(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff);

      // Convert 8-bit operands from lower lanes to 16-bit values, and
      // perform vectorized shift.  Make sure that upper 8 bits of 16-bit
      // results are all 0.
      __m256i a_lo_8 = _mm512_extracti64x4_epi64(a, 0);
      __m256i b_lo_8 = _mm512_extracti64x4_epi64(b, 0);
      __m512i a_lo_16 = _mm512_cvtepi8_epi16(a_lo_8);
      __m512i b_lo_16 = _mm512_cvtepi8_epi16(b_lo_8);
      __m512i c_lo_16 = _mm512_and_si512(_mm512_sllv_epi16(a_lo_16, b_lo_16), mask);

      // Convert 8-bit operands from upper lanes to 16-bit values, and
      // perform vectorized shift.  Make sure that upper 8 bits of 16-bit
      // results are all 0.
      __m256i a_hi_8 = _mm512_extracti64x4_epi64(a, 1);
      __m256i b_hi_8 = _mm512_extracti64x4_epi64(b, 1);
      __m512i a_hi_16 = _mm512_cvtepi8_epi16(a_hi_8);
      __m512i b_hi_16 = _mm512_cvtepi8_epi16(b_hi_8);
      __m512i c_hi_16 = _mm512_and_si512(_mm512_sllv_epi16(a_hi_16, b_hi_16), mask);

      // Cast 16-bit results back into 8-bit values and merge them
      // together (using unsigned saturation with higher 8 bits set to 0
      // above ensures that results are correct).  Values are merged per
      // lanes, so this is not yet the final result.
      __m512i c_perm = _mm512_packus_epi16(c_lo_16, c_hi_16);

      // Permute values so that final result is produced.
      __m512i idx = _mm512_set_epi64(7, 5, 3, 1, 6, 4, 2, 0);
      __m512i c = _mm512_permutexvar_epi64(idx, c_perm);

      // ------- End copied

        print_m512i_int8(&c);
        // Expected output: 1(x8), 2(x8), 4(x8), 8(x8), 16(x8), 32(x8), 64(x8), 128(x8), -128(x8)

        return 0;
    }
    ```
    </details>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88607
    Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10

commit df1df9d10a7a2f4d7b1327fa85d0bb5fb6e9b693
Author: Howard Huang <howardhuang@fb.com>
Date:   Fri Nov 11 11:44:00 2022 -0800

    [16/N] Add _allgather_base custom op with CPU/CUDA implementation (#88889)

    Differential Revision: [D41227739](https://our.internmc.facebook.com/intern/diff/D41227739)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88889
    Approved by: https://github.com/kwen2501

commit 3765621356c645ead1d712c5b7e4d57d6803cc81
Author: ydwu4 <yidi@meta.com>
Date:   Sat Nov 12 20:00:51 2022 +0000

    torchdynamo support self.modules() for nn_module (#88695)

    This PR allows models to call self.modules() during dynamo tracing.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88695
    Approved by: https://github.com/voznesenskym

commit 27dc03e09b6b1948e416a9fd78e6ca2b0a0bb1c7
Author: soulitzer <soulitzer@gmail.com>
Date:   Fri Nov 11 11:51:22 2022 -0500

    Turn internal assert when saved tensor is detached inplace into torch check (#88860)

    Fixes https://github.com/pytorch/pytorch/issues/88809

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88860
    Approved by: https://github.com/albanD

commit 4270bb37dacf7e3b2b784fa4ff4002ee6bf87e56
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Sat Nov 12 00:41:57 2022 +0100

    [primTorch] Improve `narrow` and `narrow_copy`: refs, tests, docs (#87045)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87045
    Approved by: https://github.com/mruberry

commit 6e5f736d86be09bd86a5da276ce2f5dcbe0bfc09
Author: Howard Huang <howardhuang@fb.com>
Date:   Fri Nov 11 08:21:48 2022 -0800

    [15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations (#88846)

    Differential Revision: [D41227740](https://our.internmc.facebook.com/intern/diff/D41227740)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88846
    Approved by: https://github.com/kwen2501

commit ae2c668cc044d841853e2672d96bfe0afb38a89c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Nov 12 07:52:53 2022 +0000

    Revert "[dynamo][api] Better support of torch.nn.Module (#88629)"

    This reverts commit c83348597b195f2da1cca0e8318c878b104bce5d.

    Reverted https://github.com/pytorch/pytorch/pull/88629 on behalf of https://github.com/anijain2305 due to job failing on master https://github.com/pytorch/pytorch/actions/runs/3449914495/jobs/5758267231

commit 6b775c42dd2d40992611fb5636e787560663902c
Author: Jerry Zhang <jerryzh@meta.com>
Date:   Sat Nov 12 07:52:44 2022 +0000

    [quant][executorch] Support quant fusion for reshape in quant in executorch stack (#88858)

    Summary: This diff added support for fusing "dq - reshape - q" to a reshape op, the op is needed in wakeword model

    Test Plan: buck test executorch/exir/tests:quant_fusion_pass

    Reviewed By: qihqi, JacobSzwejbka

    Differential Revision: D41111069

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88858
    Approved by: https://github.com/JacobSzwejbka

commit 34641c4384328ad9a3d2dc928de5b60a239427ee
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Nov 12 05:16:41 2022 +0000

    Revert "Add comprehensive minifier tests (#88022)"

    This reverts commit 5ff600aa6e40c6b4d426594bbb1f446f005b7fb3.

    Reverted https://github.com/pytorch/pytorch/pull/88022 on behalf of https://github.com/wconstab due to Seems to be causing CI failures relating to minifier test and some /tmp/ path not existing

commit c83348597b195f2da1cca0e8318c878b104bce5d
Author: Animesh Jain <anijain@umich.edu>
Date:   Sat Nov 12 04:45:17 2022 +0000

    [dynamo][api] Better support of torch.nn.Module (#88629)

    This is an API change, so please review carefully.

    With this PR, torchdynamo returns an `OptimizedModule` class object, a subclass of `torch.nn.Module`, when asked to optimize a `nn.Module` object. Most of the methods are redirected to the original `nn.Module`, which is installed as `_mod` in the `OptimizedModule`.

    This is helpful for many cases

    ```
    mod = MockModule()

    opt_mod = torch._dynamo.optimize()(mod)

    print(opt_mod) # Works

    opt_mod = opt_mod.to(device="cuda")
    print(opt_mod) # Works
    opt_mod(input) # Triggers recompile if necessary, earlier we were shedding the TorchDynamo wrapper

    opt_mod.parameters() # Refers to the original module

    ```

    Topics unclear to me
    * I have overridden many methods to raise NotImplementedError. A careful review of those will be good.
    * hooks
    * For the optimized forward, should we call torchdynamo optimization on `__call__` or `forward`
    * What else to test

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88629
    Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/msaroufim

commit d01bf1d1f11ab1fb9ae21a007138e2c4ecc31b63
Author: Andrew Gu <andgu@fb.com>
Date:   Sat Nov 12 01:05:46 2022 +0000

    [FSDP] Introduce `ModuleWrapPolicy` for simplicity (#88450)

    **BC Breaking Change**
    This renames `unwrapped_params` to `nonwrapped_numel`. I prefer `nonwrapped` over `unwrapped` because "unwrap"  suggests that some wrapping has been undone. I prefer `numel` over `params` because that is unit of measurement; I think we should keep "params" to refer to `nn.Parameter`s themselves.

    This only breaks anything that passes `unwrapped_params` as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our `pytorch` code).

    In a follow-up, I want to rename `min_num_params` to `min_nonwrapped_numel` in `size_based_auto_wrap_policy`, which is also BC breaking. Again, this is to differentiate between "params" being `nn.Parameter`s and "numel" being the unit for `param.numel()`.

    **Overview**
    This PR introduces `ModuleWrapPolicy` as a lightweight layer over the existing `transformer_auto_wrap_policy`. The most common auto wrapping paradigm is:
    ```
    module_classes: Set[Type[nn.Module]] = ...
    auto_wrap_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls=module_classes,
    )
    fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)
    ```
    Now, users can instead write:
    ```
    auto_wrap_policy = ModuleWrapPolicy(module_classes)
    fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)
    ```
    This hides the unused arguments expected from the callable (`recurse` and `unwrapped_params`/`nonwrapped_numel`).

    `ModuleWrapPolicy` inherits from an abstract base class `FSDPPolicy` that expects a `policy` property. This decouples the construct of such `FSDPPolicy` classes and their actual `policy`, which must abide by the `_recursive_wrap` interface. Any existing auto wrap policy can be rewritten as a class that inherits from `FSDPPolicy`, so this approach is fully backward compatible from a functionality perspective.

    I call this base class `FSDPPolicy` to generalize over the cases where we may not want to actually perform any nested wrapping. In reality, the policy is meant for constructing `FlatParameter`s, which just happened to be induced by a nested wrapping before. Given this, I am changing the constructor argument in `fully_shard()` to simply `policy` instead of `auto_wrap_policy`.

    This PR migrates usages of `transformer_auto_wrap_policy` within our unit test suite to `ModuleWrapPolicy` as much as possible.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88450
    Approved by: https://github.com/zhaojuanmao

commit b2b0a0d3baf6258fbf728572719937810fd890ce
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Nov 12 03:21:06 2022 +0000

    [vision hash update] update the pinned vision hash (#88920)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88920
    Approved by: https://github.com/pytorchbot

commit ae4074669ecbf2a6d8bf99db745d29dce98d0c10
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Nov 10 21:19:22 2022 +0000

    [FSDP][state_dict][6/N] Remove most FSDP module dependency from _optim_utils (#88638)

    **What**
    This PR removes most `FullyShardedDataParallel` dependencies from `optim_utils`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88638
    Approved by: https://github.com/awgu

commit 4108367123c1b44289b5c731c3bb7022976b816d
Author: Bin Bao <binbao@fb.com>
Date:   Fri Nov 11 20:41:36 2022 +0000

    Exclude poolformer_m36 from the inductor model test (#88908)

    Summary: The root cause is still to be investigated. Issue tracked at
    https://github.com/pytorch/torchdynamo/issues/1856

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88908
    Approved by: https://github.com/malfet

commit 1e2327baf7a2d9c63bef08e5f996ef983e199429
Author: mikey dagitses <mikeyd@meta.com>
Date:   Sat Nov 12 02:23:48 2022 +0000

    fix fx tests (#88886)

    Summary:
    Some source files are missing and TPX couldn't handle the default test
    names.

    Test Plan: Rely on CI.

    Differential Revision: D41218564

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88886
    Approved by: https://github.com/zou3519

commit 66736ff425d7163df0eed48e9944c8539e92b577
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 11 09:33:41 2022 -0500

    Fix bug in OptionalTensorList (#88887)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88887
    Approved by: https://github.com/anjali411

commit 2b166532f7ac280232daf6c44620e96e258867cf
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 11 09:00:55 2022 -0500

    Remove incorrect assert about hermetic state. (#88885)

    I'm not sure why I thought this assert was valid in the first
    place, and there's no comment about it.

    The assert is tantamount to saying, "no tensor objects should
    become dead via SafePyObject when hermetic mode is on."  But
    suppose we run a Python GC while we're inside hermetic mode.
    This could result in us disposing non-hermetic tensors, which
    would hit decref.  So the assert seems invalid.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88885
    Approved by: https://github.com/anjali411, https://github.com/malfet

commit 2cd05a2818bacbc2e252052b6b71085e4de16b0d
Author: Jiaxu Zhu <jiaxuzhu@meta.com>
Date:   Sat Nov 12 01:20:52 2022 +0000

    Support torch.qint32 in Convert (#88871)

    Enable the `torch.qint32` when creating `quantize_per_tensor` function call in `convert_fx`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88871
    Approved by: https://github.com/jerryzh168

commit a3f3ec8fac98151f31373ba3bcfe2d601584a840
Author: Will Constable <whc@fb.com>
Date:   Fri Nov 11 21:22:49 2022 +0000

    [FSDP+dynamo]: forward treats parameter-views as params (#88781)

    Dynamo+AotAutograd needs a way to wrap all tensors (whether
    inputs or params/buffers) in FakeTensor wrappers, and
    FSDP's mangling of parameters hides them from this wrapping.

    This PR unblocks running hf_bert and hf_T5 with FSDP under dynamo, whether using recursive wrapping around transformer layers or only applying FSDP around the whole model.  Perf/memory validation and possibly optimization is the next step.
    `python benchmarks/dynamo/distributed.py --torchbench_model hf_Bert --fsdp --dynamo aot_eager`
    `python benchmarks/dynamo/distributed.py --torchbench_model hf_Bert --fsdp --dynamo aot_eager --fsdp_wrap`
    `python benchmarks/dynamo/distributed.py --torchbench_model hf_T5 --fsdp --dynamo aot_eager`
    `python benchmarks/dynamo/distributed.py --torchbench_model hf_T5 --fsdp --dynamo aot_eager --fsdp_wrap`

    The problem:
    Dynamo (Actually aot_autograd) trips up with FSDP becuase it must
    wrap all input tensors in FakeTensor wrappers, and it only knows
    to wrap graph inputs or named_(parameters, buffers).  FSDP's
    pre_forward hook sets views (which are not nn.param) into the flatparam
    as attrs on the module with the same name as the original param, but
    they will not show up in named_parameters.

    - in use_orig_params mode, FSDP still de-registers
      params during pre-forward hook, then re-registers them
      post-forward
    - during forward (between the hooks), the params are setattr'd
      on the module as regular view tensors, not nn.Parameters
    - note: use_orig_params is the recommended way to use FSDP,
      and use_orig_params=False is being deprecated.  So i only consider
      use_orig_params=True for this enablement

    The solution:
    - adding them to named_buffers is not possible because it interferes
      with how FSDP's `_apply` works
    - since they are not actual nn.parameters, register_parameter will
      complain about registering them
    - simply seting `module._parameters[name] = view` seems to be a viable
      workaround, despite being hacky, and FSDP code does modify _parameters
      directly already.

    Note: Manual checkpointing still isn't working with FSDP+dynamo,
    so that will have to be addressed in a follow up.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88781
    Approved by: https://github.com/ezyang, https://github.com/awgu

commit 5ff600aa6e40c6b4d426594bbb1f446f005b7fb3
Author: William Wen <williamwen@fb.com>
Date:   Sat Nov 12 00:22:25 2022 +0000

    Add comprehensive minifier tests (#88022)

    Adds tests for https://github.com/pytorch/torchdynamo/issues/1241.

    To run: `pytest test/dynamo/test_minifier.py`.

    Actually runs minifier launcher script and repro scripts, rather than just checking for existence of the minifier launcher script.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88022
    Approved by: https://github.com/mlazos, https://github.com/anijain2305

commit 37c5b42fa6597ebf7dbfb6db4ada2c7803950555
Author: Horace He <chilli@fb.com>
Date:   Fri Nov 11 19:17:47 2022 +0000

    Fix matmul decomp to use reshape instead of contiguous().view() (#88832)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88832
    Approved by: https://github.com/bertmaher, https://github.com/ngimel

commit 7c3adddd6c3fe1bda4a9e5bfb9f992a802329551
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Nov 9 12:20:16 2022 -0800

    [functorch] delete some unused files (#88763)

    Some post-merge cleanup.
    - packaging/ was for building standalone windows binaries
    - our flake8 config got superceded by PyTorch's.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88763
    Approved by: https://github.com/samdow

commit a7fa423f48af8af220e9286a6b4c374d533f77e0
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Nov 11 14:41:35 2022 +0000

    copy_: Short-circuit when self and src view the same data (#88884)

    This comes up if you use inplace operators on a slice, e.g.
    ```python
    import torch
    a = torch.rand(1000000, device="cuda")
    a[::2] *= 2
    ```

    The last line looks as if it should be fully inplace, but is actually
    equivalent to:

    ```python
    tmp = a[::2]
    tmp *= 2
    a[::2] = tmp
    ```

    Which results in `mul_` and `copy_` being called. With this PR, the
    redundant copy becomes a no-op and the above example is 2x faster.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88884
    Approved by: https://github.com/ngimel

commit 6fe47b682fe1ba2dd2c7da02ff1bb06f8670e3a7
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Fri Nov 11 22:31:32 2022 +0000

    [Dynamo] Fix str(Guard.obj_weakref) bug to re-ennable support overriding __getattr__ (#88564)

    See my inline comments!

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88564
    Approved by: https://github.com/ezyang, https://github.com/anijain2305

commit be8d88f8d0c6825b1b19354ffbaa4466aae0d3b8
Author: Kevin Tse <ktse@fb.com>
Date:   Thu Nov 10 18:33:09 2022 -0500

    [DataLoader] Removing DataLoader2 related code (#88848)

    Removing these lines of code as `DataLoader2` has been added to [TorchData](https://github.com/pytorch/data). I'm importing this to confirm it will not impact internal codes.

    Differential Revision: [D41201578](https://our.internmc.facebook.com/intern/diff/D41201578)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88848
    Approved by: https://github.com/ejguan

commit f39cad50b765b6fd2f4927a4d1552fff5928c61e
Author: Nikita Shulga <nshulga@meta.com>
Date:   Fri Nov 11 22:07:34 2022 +0000

    Make InductorCPU usable in internally (#88870)

    Test Plan: `buck2 test mode/opt //caffe2/test:test_inductor -- --exact 'caffe2/test:test_inductor - test_dtype_mismatch_issue_cuda (caffe2.test.inductor.test_torchinductor.CudaTests)'`

    Differential Revision: D41206109

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88870
    Approved by: https://github.com/izaitsevfb

commit fbc1878265374a159639993269d40a6e08503278
Author: BowenBao <bowbao@microsoft.com>
Date:   Tue Nov 8 10:22:32 2022 -0800

    [ONNX] Pretty print diagnostic logging (#88261)

    Adds pretty print diagnostic logging. For example
    ```python
    import io
    import torch
    from torch.onnx._internal import diagnostics

    class CustomAdd(torch.autograd.Function):
        @staticmethod
        def forward(ctx, x, y):
            return x + y

        @staticmethod
        def symbolic(g, x, y):
            return g.op("custom::CustomAdd", x, y)

    class M(torch.nn.Module):
        def forward(self, x):
            return CustomAdd.apply(x, x)
    torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO())
    ```

    By default, observe minimum summary of diagnostics
    ```
    ========= Diagnostic Run torch.onnx.export version 1.14.0a0+git90a69c5 =========
    verbose: False, log level: Level.ERROR
    ======================= 0 NONE 0 NOTE 3 WARNING 0 ERROR ========================
    3 WARNING were not printed due to the log level.
    ```

    Adjusting the `verbose` and `level` argument.
    ```python
    diagnostics.engine.pretty_print(verbose=True, level=diagnostics.levels.WARNING)
    ```

    Prints full log.
    ```
    =============================== 1 Diagnostic Run ===============================
    ========= Diagnostic Run torch.onnx.export version 1.14.0a0+git90a69c5 =========
    verbose: True, log level: Level.WARNING
    ======================= 0 NONE 0 NOTE 3 WARNING 0 ERROR ========================
    WARNING: node-missing-onnx-shape-inference
    ==========================================
    The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
    --------------------------- Stack: Python call stack ---------------------------
    frame: diagnostic = ExportDiagnostic(rule, level, message, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151
    frame: n, utils._params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/_patch_torch.py:82
    frame: <@beartype(torch.onnx._patch_torch._graph_op) at 0x7f62184b6710>:78
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: return function(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_deprecation.py:30
    frame: return g.op("custom::CustomAdd", x, y) test_pretty_print.py:14
    frame: return symbolic_fn(g, *args) /home/bowbao/pytorch_dev/torch/onnx/utils.py:1716
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: graph = _C._jit_pass_onnx(graph, operator_export_type) /home/bowbao/pytorch_dev/torch/onnx/utils.py:663
    frame: <@beartype(torch.onnx.utils._optimize_graph) at 0x7f62180e05f0>:85
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: module=module, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1123
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519
    frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22
    ---------------------------- Stack: C++ call stack -----------------------------
    frame: (<unknown frame>)
    frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::UpdateReliable(torch::jit::Value*, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::UpdateReliable(torch::jit::Node*) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown function> + 0xabd026 (0x7f625b599026 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown frame>)

    WARNING: node-missing-onnx-shape-inference
    ==========================================
    The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
    --------------------------- Stack: Python call stack ---------------------------
    frame: diagnostic = ExportDiagnostic(rule, level, message, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151
    frame: graph, params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/utils.py:688
    frame: <@beartype(torch.onnx.utils._optimize_graph) at 0x7f62180e05f0>:85
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: module=module, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1123
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519
    frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22
    ---------------------------- Stack: C++ call stack -----------------------------
    frame: (<unknown frame>)
    frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::UpdateReliable(torch::jit::Value*, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::UpdateReliable(torch::jit::Node*) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown function> + 0x87d6d1 (0x7f625b3596d1 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::ONNXShapeTypeInference(std::shared_ptr<torch::jit::Graph>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0x33 (0x7f625b359cf3 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown function> + 0xabdbae (0x7f625b599bae in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown frame>)

    WARNING: node-missing-onnx-shape-inference
    ==========================================
    The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
    --------------------------- Stack: Python call stack ---------------------------
    frame: diagnostic = ExportDiagnostic(rule, level, message, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151
    frame: graph, params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/utils.py:1179
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519
    frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347
    frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
    frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22
    ---------------------------- Stack: C++ call stack -----------------------------
    frame: (<unknown frame>)
    frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::UpdateReliable(torch::jit::Value*, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::UpdateReliable(torch::jit::Node*) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown function> + 0x87d6d1 (0x7f625b3596d1 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (torch::jit::ONNXShapeTypeInference(std::shared_ptr<torch::jit::Graph>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0x33 (0x7f625b359cf3 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown function> + 0xabdbae (0x7f625b599bae in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
    frame: (<unknown frame>)
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88261
    Approved by: https://github.com/abock, https://github.com/justinchuby

commit ea0ec9d71ca5428bedfcaf74990c109af8cb9a64
Author: efiks <5167930+efiks@users.noreply.github.com>
Date:   Fri Nov 11 21:58:23 2022 +0000

    [tourch] BatchBoxCox - fix numerical issue in vectorized code (#88875)

    Summary:
    Usage of fast math in BatchBoxCox kernel provided different math results between dev and optimized versions which cause few internal test to fail.
    For now disabling the compiler optimized version and relying on ATEN vectors

    Differential Revision: D41211784

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88875
    Approved by: https://github.com/hyuen

commit dfb4b73e45896851d734e34a9902fd8b151797fe
Author: Richard Barnes <rbarnes@fb.com>
Date:   Fri Nov 11 21:51:10 2022 +0000

    Fix unused variable 'options' warning in RNN.cpp (#88753)

    Fixes
    ```
    /home/rbarnes/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:73:17: warning: unused variable 'options' [-Wunused-variable]
      TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
                    ^
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88753
    Approved by: https://github.com/soumith

commit 7aa144ac54808419f7a702ef0c5a4445dba4c587
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Nov 10 21:19:21 2022 +0000

    [FSDP][state_dict][5/N] Remove the FSDP module dependency from _state_dict_utils (#88637)

    **What**
    This PR completely removes the `FullyShardedDataParallel` dependency from `_state_dict_utils` -- `_state_dict_utils` now depends only on `_FSDPState` and all the utils modules.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88637
    Approved by: https://github.com/awgu

commit 575e02df5357ef6216b2d2db2424d10432679df2
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Nov 11 21:19:26 2022 +0000

    Fix CUDNN_PATH handling on Windows (#88898)

    Fixes https://github.com/pytorch/pytorch/issues/88873
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88898
    Approved by: https://github.com/kit1980

commit f74946324e794d2332251d0497dc8ff4f831caa9
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Fri Nov 11 21:11:12 2022 +0000

    [fix] allow saving python attr on Tensor and Parameter via torch.save (#81616)

    Fixes: https://github.com/pytorch/pytorch/issues/72129

    TODO:
    * [x] Fix for Parameter

    Benchmark
    (Measurable diff for small tensors)
    ```
    [-------------- Save and Load --------------]
                        |  After PR  |  Before PR
    1 threads: ----------------------------------
          ()            |    111.7   |     106.9
          (4, 4)        |    114.4   |     109.2
          (128, 128)    |    135.2   |     128.3
          (1024, 1024)  |   1431.9   |    1431.3

    Times are in microseconds (us).
    ```

    <details>

    <summary> Benchmark Script </summary>

    ```python
    import torch
    from torch.testing._internal.common_utils import BytesIOContext
    from torch.utils import benchmark
    import pickle

    shapes = ((), (4, 4), (128, 128), (1024, 1024))

    sizes = [1, 64, 1024, 10000]
    results = []

    def save_load_fn(t):
        with BytesIOContext() as f:
            torch.save(t, f)
            f.seek(0)
            torch.load(f)

    for shape in shapes:
        t = torch.randn(shape)
        label = 'Save and Load'
        sub_label = f'{shape}'
        results.append(benchmark.Timer(
            stmt='save_load_fn(t)',
            globals={'t': t, 'save_load_fn':save_load_fn},
            label=label,
            sub_label=sub_label,
            description='Before PR',
        ).blocked_autorange(min_run_time=2))

    compare = benchmark.Compare(results)
    compare.print()

    with open('before_pr.pkl', 'wb') as f:
        pickle.dump(results, f)
    ```

    </details>

    NOTE : **BC-Breaking** : After this PR, all tensors (also regular tensors) will be serialised using `_rebuild_from_type_v2`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81616
    Approved by: https://github.com/albanD, https://github.com/kurtamohler

commit ba4d5aae06bde7c0ad045e54b7ad86f4542efb86
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Nov 11 19:13:05 2022 +0000

    Revert "rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218)"

    This reverts commit 7f28be10e5e71efda37800384fa897785499bed1.

    Reverted https://github.com/pytorch/pytorch/pull/88218 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41211901

commit 4e5d7afe84c01ed730f0f43395d7fa0542e81f3a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Nov 11 19:08:30 2022 +0000

    Revert "add DisableTorchFunction that matches DisableTorchDispatch (#88219)"

    This reverts commit c0ecce15b5a54ff0185f9976e6bfb6f3a7de698d.

    Reverted https://github.com/pytorch/pytorch/pull/88219 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41211901

commit 9d7d21f5691979f728f42a709e1a47ab3e905342
Author: BowenBao <bowbao@microsoft.com>
Date:   Tue Nov 8 10:22:31 2022 -0800

    [ONNX] Add stack info to diagnostics (#87258)

    ~~Investigating strange bug releasing 'graph' right when returning from `_C._jit_pass_onnx`.~~
    ~~Can be repro-ed locally via `test_cpp_diagnose`, with changes in this PR.~~
    Resolved by https://github.com/pytorch/pytorch/pull/87829.
    This PR adds methods to record stack backtrace information to diagnostics.

    * #87830
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87258
    Approved by: https://github.com/abock

commit 3d1c5c89ed27ff16601aecf7834a6bd06f578c45
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Nov 10 21:19:21 2022 +0000

    [FSDP][state_dict][4/N] Move the core logic of summon full parameters to _unshard_params_utils.py (#88636)

    **What**
    `_summon_full_parameters` is required for state_dict. To enable composable FSDP state_dict, `_summon_full_params` must be accessible without FullyShardedDataParall. This PR move the core logic of `_summon_full_params` to `_unshard_params_utils`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88636
    Approved by: https://github.com/awgu

commit 5f0783bd6d27a0a239263b943d626c533b8b9a90
Author: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Date:   Fri Nov 11 17:43:46 2022 +0000

    Fix ATen Fallback for BUILD_CAFFE2=0 for ONNX-only ops (#88504)

    Follow-up for #87735

    Once again, because BUILD_CAFFE2=0 is not tested for ONNX exporter, one scenario slipped through. A use case where the model can be exported without aten fallback when operator_export_type=ONNX_ATEN_FALLBACK and BUILD_CAFFE2=0

    A new unit test has been added, but it won't prevent regressions if BUILD_CAFFE2=0 is not executed on CI again

    Fixes #87313

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88504
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit 8ff2e34ca6905404aba35a432acf667ee6a13c6e
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Nov 11 04:25:11 2022 +0000

    Take input striding for conv forward based on eager output (#88706)

    From discussion with @Chillee and @ngimel we'll likely need further fixes to ensure that we hit channels last kernels but this is still worth landing in its own right.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88706
    Approved by: https://github.com/ngimel

commit adfbd831cf59111c3d3a4a50ba6372bba94b63d1
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Nov 11 17:03:25 2022 +0000

    Revert "[Autograd] Use in-place input accumulation fast path for dense Tensors. (#88339)"

    This reverts commit 8f66ae413f8c9d7f2418d7f0b9f69d409c455b46.

    Reverted https://github.com/pytorch/pytorch/pull/88339 on behalf of https://github.com/mehtanirav due to Internal test failures

commit 89a326ff7ea56a1d735d26800b07a10e35c2dff4
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Fri Nov 11 16:57:05 2022 +0000

    Explicitly check filelike arg of `torch.save` (#88867)

    Fixes #88793

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88867
    Approved by: https://github.com/ezyang

commit a6832b08a3f6c1b425a075fe204a1f21361f33d9
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Tue Nov 8 19:23:21 2022 +0000

    Regularize bernouilli_ with bernouilli decomp (#88349)

    Fix for https://github.com/pytorch/torchdynamo/issues/1796. Just like the other [bernouilli decomp](https://github.com/pytorch/pytorch/blob/master/torch/_inductor/decomposition.py#L302) we need to pass `dtype=float32` to avoid `"check_uniform_bounds" not implemented` errors.

    Are we planning on enabling `TEST_WITH_TORCHINDUCTOR` ? Do I need to change anything with the tests ?

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88349
    Approved by: https://github.com/desertfire

commit 1e8f95ace16cb617d71f8f8254c1d5bafd9f586c
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Fri Nov 11 13:51:18 2022 +0100

    Symintify `broadcast_to` (#88776)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88776
    Approved by: https://github.com/ezyang

commit d615d1228932eaa5e026f5399e099f2036d2379b
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Fri Nov 11 15:24:28 2022 +0000

    Add meta impl for topk (#88694)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88694
    Approved by: https://github.com/ezyang

commit 3c7f96665e784a793d2d1a120ea8fe370b3f6d81
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Nov 10 19:54:56 2022 +0000

    [FSDP][state_dict][3/N] Change how state_dict utils access attributes in _FSDPState (#88635)

    **What This PR Does**
    _state_dict_utils currently accesses the FSDP states through module. To enable composable FSDP state_dict, these accesses need to go through _FSDPState. module is still required for most APIs as state_dict has to access per-module information.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88635
    Approved by: https://github.com/awgu

commit b92acee8f83c7852194d6979362aea0c240709da
Author: soulitzer <soulitzer@gmail.com>
Date:   Thu Nov 10 19:08:42 2022 -0500

    Add context manager to allow mutation on saved tensors (#79056)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79056
    Approved by: https://github.com/albanD

commit 91b71cdbe4f31006fad91f9dd460123677a7c625
Author: Bin Bao <binbao@fb.com>
Date:   Wed Nov 9 20:39:50 2022 +0000

    [dynamo] Add torch.device to is_safe_constant (#88766)

    Test Plan:
    ```
    PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py -k  test_advancedindex_mixed_cpu_devices_cuda
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88766
    Approved by: https://github.com/jansel

commit 324ac93a43a93f671bb34b835926b22d13442735
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Tue Nov 8 00:16:14 2022 +0000

    [FSDP][state_dict][2/N] Move state_dict related enums/dataclasses/states to state_dict_utils.py, api.py and init_state_dict() (#88481)

    **Motivation**:
    Several Enums, Dataclasses and states defined in fully_sharded_data_paralle.py should be moved to a place where the composable FSDP can access. This PR does the move.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88481
    Approved by: https://github.com/rohan-varma, https://github.com/awgu

commit ee91c328da5739ce03b3127cd7c542ce505212b8
Author: Michael Gschwind <mikekg@meta.com>
Date:   Fri Nov 11 12:19:31 2022 +0000

    Fix cuda/cpu check on NoneType (#88854)

    Summary: Fix cuda/cpu check on NoneType

    Test Plan: sabdcastle/ github CI/CD

    Differential Revision: D41203955

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88854
    Approved by: https://github.com/drisspg, https://github.com/ngimel

commit d15a6b0c975b9e1e90ed4e951071e5269c10ac5b
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Fri Nov 11 08:51:26 2022 +0000

    Error on ZeroTensor serialization (#88803)

    Follow-up : https://github.com/pytorch/pytorch/pull/88182#issuecomment-1308628415

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88803
    Approved by: https://github.com/anjali411

commit b843f4db0a26aae6536e6b971f73bcc5af21c90a
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Wed Nov 9 17:41:10 2022 +0000

    [ONNX] Add test case for onnx::Max scalar type (#88751)

    Referenced by minimum cases
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88751
    Approved by: https://github.com/wschin, https://github.com/BowenBao

commit 396c3b1d88d7624938a2bb0b287f2a19f1e89bb4
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Fri Nov 11 05:23:48 2022 +0000

    Use `atomicAdd` for `bfloat16` in Ampere and above (#84981)

    WIP to fix extremely slow `scatter_add` issue vs. fp16. The current changes seem to improve performance, but it still appears to lag behind the fp16 equivalent.

    CC @ngimel @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84981
    Approved by: https://github.com/ngimel

commit a6d72f44a4e8b6e9d2e878f30fd8b1d3e1197f0e
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Wed Nov 9 17:27:22 2022 +0000

    [ONNX] Add onnx::Max into standard Op for scalar type alignment (#88750)

    Easy fix for onnx::Max ScalarType
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88750
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit 0de8f047c1cc950c59b0448b9b78dafc0202c43f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Nov 11 04:19:08 2022 +0000

    Revert "[dynamo] fixes dict changed during runtime error  (#87526)"

    This reverts commit cf04b36ce8f531730210b03eaa347977a1c2d75c.

    Reverted https://github.com/pytorch/pytorch/pull/87526 on behalf of https://github.com/anijain2305 due to error reported

commit 310335de48ab9d8bcd33b98f3f71ef88ae4bd45c
Author: Jane Xu <janeyx@meta.com>
Date:   Fri Nov 11 04:02:44 2022 +0000

    Update lr_scheduler.pyi to match lr_scheduler.py (#88818)

    Following #88503, we should also update the pyi file

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88818
    Approved by: https://github.com/soulitzer

commit 86b7aa26f0bb8878d925a625af45d16d4bb2f2af
Author: Wei-Sheng Chin <wschin@outlook.com>
Date:   Fri Nov 11 03:49:27 2022 +0000

    Fix FakeTensorProp on Module with Parameters or Buffers (#88700)

    In `FakeTensorMode.__torch_dispatch__`, the output is now always computed by meta kernels in
    ```python
            try:
                with in_kernel_invocation_manager(self):
                    r = func(*args, **kwargs)  # <----- "r" can be a real tensor.
            except NotImplementedError as not_implemented_error:
                if not self.allow_fallback_kernels:
                    raise not_implemented_error
                return run_fallback_kernel(self, func, args, kwargs, not_implemented_error)

            return self.wrap_meta_outputs_with_default_device_logic(r, func, args, kwargs)
    ```
    For example, I observed a CPU tensor is generated when executing `aten.addmm` when running `FakeTensorProp`. Therefore, I'd like to allow `FakeTensorMode` to wrap real tensor as `FakeTensor` during the computation. Does this PR look a good direction to fix this problem? If yes, I can go ahead and add some tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88700
    Approved by: https://github.com/eellison, https://github.com/ezyang

commit c4fc5d372f3db37380fe213b5726403cb1330d5d
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Mon Nov 7 23:46:29 2022 +0000

    [FSDP][state_dict][1/N] Moving state_dict logic to pre_state_dict_hook (#87900)

    This is one step toward the ultimate goal: remove the overwritten state_dict in FSDP. All the logic should be either in `pre_state_dict_hook` or `post_state_dict_hook`.

    Since current `nn.Module` does not support `pre_state_dict_hook`, this PR mimic `pre_state_dict_hook` by calling the pre hook inside post the hook, effectively ditching all the work done by `nn.Module.state_dict`. Once `pre_state_dict_hook` is supported by `nn.Module`, these pre hook calls can be moved out from the post hooks and be registered to `nn.Module.pre_state_dict_hook`.

    The major issue of this temporary solution is that `post_state_dict_hook` is called from the leaf node to the root node. This makes the `module._lazy_init()` invalid as FSDP assumes `_lazy_init()` to be called from the root. As a result, `FSDP.state_dict` currently contains only one logic -- calling `module._lazy_init()`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87900
    Approved by: https://github.com/rohan-varma

commit 9d09968bbe05fc6d7d7c3d8b1acfbe1b1b1413a8
Author: Emil Lynegaard <eml@languagewire.com>
Date:   Fri Nov 11 03:34:54 2022 +0000

    Disable check for dropout in MultiheadAttention fast_path (#88831)

    Since we already enforce eval mode for the fast_path, we do not need to also check for a falsy dropout value, as a model trained with dropout will have a non-zero dropout during eval mode, even though it won't be applied.

    Fixes #88806

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88831
    Approved by: https://github.com/drisspg

commit 3082378701605884ff07f7ba7984864340b19b34
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Nov 11 03:33:55 2022 +0000

    [vision hash update] update the pinned vision hash (#88853)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88853
    Approved by: https://github.com/pytorchbot

commit 495e7b1c729e64693e794ea22640b4552816f0ef
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Nov 10 21:22:29 2022 +0000

    Ref for aten.full; symint changes in prim (#88762)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88762
    Approved by: https://github.com/ezyang

commit 3fbf748f2109de408bd47efb1a43e3897d7a775c
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Fri Nov 11 02:30:29 2022 +0000

    Assert we have triton before scheduling on triton (#88849)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88849
    Approved by: https://github.com/wconstab, https://github.com/ngimel, https://github.com/jansel

commit fc9e36dd426d4747bb7c71ee93bcbaa700bda01d
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Thu Nov 10 22:41:47 2022 +0000

    Add meta support for scalar_tensor and argmax (#88590)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88590
    Approved by: https://github.com/albanD

commit c961e45ee559a61bfb4f1e8a548e574ef89d3102
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Nov 10 12:21:50 2022 -0800

    handle zero dims in reductions (#88280)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88280
    Approved by: https://github.com/ngimel

commit 534ae6ae4790aec1b148b7e878ae60828ae45ac0
Author: Ryan Spring <rdspring1@gmail.com>
Date:   Fri Nov 11 01:08:16 2022 +0000

    [primTorch] Implement group norm reference (#87054)

    Add group norm reference
    Split from #81191
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87054
    Approved by: https://github.com/mruberry

commit 072834d56dada58f99216ce398fb57cce57968a9
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Nov 8 07:59:12 2022 -0800

    [ao] qconfig_mapping.py fixing public v private (#87518)

    Summary: made _GLOBAL_DICT_KEY, _OBJECT_TYPE_DICT_KEY,
    _MODULE_NAME_REGEX_DICT_KEY, _MODULE_NAME_DICT_KEY,
    _MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY private

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D40709278](https://our.internmc.facebook.com/intern/diff/D40709278)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87518
    Approved by: https://github.com/jcaip

commit f9221bf53b376d1284e2356b716c2cd47fcd65f2
Author: Ian Graves <iangraves@meta.com>
Date:   Fri Nov 11 00:19:20 2022 +0000

    [pytorch] Enable memory map file support for Android, Apple, and CXX (#88545)

    Summary: See title.  Left Windows out so it still compiles.

    Test Plan:
    Add a `#fail` below [this line](https://fburl.com/code/p0mlhlw4) and build for various platforms and confirm it fails which proves the `#ifdef` was hit.

    ```
    buck2 build xplat/langtech/tuna/cli:tuclixAndroid
    buck2 build xplat/langtech/tuna/cli:tuclix
    ```

    CI/CD for the rest.

    Differential Revision: D41054824

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88545
    Approved by: https://github.com/qihqi

commit 8441443132106fd673a81cd8f6728b332d16f837
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 10 23:56:49 2022 +0000

    Revert "Add nondeterministic error for `scatter` (#88244)"

    This reverts commit e940a2f8e2a3aa9d98291e73b3d40fcffb6182c8.

    Reverted https://github.com/pytorch/pytorch/pull/88244 on behalf of https://github.com/mehtanirav due to Internal test failures

commit 62ef15e320f4a0aaa2f39296e9299f56926fb7c9
Author: Nikita Shulga <nshulga@meta.com>
Date:   Thu Nov 10 23:52:27 2022 +0000

    [MPS] Fix `test_embedding_dense_backward` (#88847)

    By copying randomly initialized weights distribution from MPS `nn.Embedding` to `cpu`

    Test plan: `python test_mps.py -k test_embedding_dense_backward --repeat 150`

    Fixes https://github.com/pytorch/pytorch/issues/88679

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88847
    Approved by: https://github.com/seemethere

commit b30222e0c481f29fe0785dde518c590ac392e9a2
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Thu Nov 10 23:47:21 2022 +0000

    [Dynamo] Add complete support for Tensor.is_contiguous (#88407)

    Fixes https://github.com/pytorch/torchdynamo/issues/1783

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88407
    Approved by: https://github.com/jansel

commit ae01615d7558d02383efe673ec0b92e2abe40db5
Author: Dmytro Dzhulgakov <dima.v.dzhulgakov@gmail.com>
Date:   Thu Nov 10 23:44:49 2022 +0000

    Fix cupti search path in CMake (#88657)

    Minor fix for when cuda is installed via conda. In this case the libraries are in `lib` and not `lib64`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88657
    Approved by: https://github.com/kit1980, https://github.com/malfet

commit d9ad08ce8a07a3d17df397051b32591f4446edfa
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Nov 10 20:35:52 2022 +0000

    Symbolic shape: sym_floor , sym_sqrt, sym_int (#88760)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88760
    Approved by: https://github.com/ezyang

commit cc04cf50bfb6110e4c1c5889ad7da626dafac384
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Thu Nov 10 23:37:29 2022 +0000

    [Inductor] Fix lowmem_dropout() missing 1 required positional argument: 'p' (#88716)

    Fixes error from 7k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_GuYuc_WS_DAN_PyTorch.py

    Error:
    ```
    TypeError: lowmem_dropout() missing 1 required positional argument: 'p'

    While executing %lowmem_dropout : [#users=1] = call_function[target=torch._inductor.overrides.lowmem_dropout](args = (%avg_pool2d_9,), kwargs = {training: False})
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88716
    Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/desertfire

commit 500fd65531e77deb7784d3ac4f78c5cbe21efe41
Author: BowenBao <bowbao@microsoft.com>
Date:   Tue Nov 8 10:22:31 2022 -0800

    [ONNX] Create common ExportTestCase base class (#88145)

    Refactor out a common base class `ExportTestCase`, for common things in `setUp`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88145
    Approved by: https://github.com/justinchuby, https://github.com/abock, https://github.com/AllenTiTaiWang

commit 20ae19aa1dd307f9bdde0754c327ffb69eef13c0
Author: BowenBao <bowbao@microsoft.com>
Date:   Tue Nov 8 10:22:31 2022 -0800

    [ONNX] Improve diagnostic message formatting (#87830)

    * Reflect required arguments in method signature for each diagnostic rule. Previous design accepts arbitrary sized tuple which is hard to use and prone to error.
         ![image](https://user-images.githubusercontent.com/9376104/200381982-d1e905f0-a159-4ef5-8d2e-070524e8f5bf.png)
    * Removed `DiagnosticTool` to keep things compact.
    * Removed specifying supported rule set for tool(context) and checking if rule of reported diagnostic falls inside the set, to keep things compact.
    * Initial overview markdown file.
    * Change `full_description` definition. Now `text` field should not be empty. And its markdown should be stored in `markdown` field.
    * Change `message_default_template` to allow only named fields (excluding numeric fields). `field_name` provides clarity on what argument is expected.
    * Added `diagnose` api to `torch.onnx._internal.diagnostics`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87830
    Approved by: https://github.com/abock

commit a6610faa93ac008c088bcbe26bdbb56de8275cf1
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Nov 8 07:59:11 2022 -0800

    [ao] qconfig_mapping_utils.py fixing public v private (#87517)

    Summary: made _get_object_type_qconfig, _get_module_name_regex_qconfig,
    _get_module_name_qconfig, _maybe_adjust_qconfig_for_module_type_or_name,
    _get_flattened_qconfig_dict _update_qconfig_for_qat private

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D40709279](https://our.internmc.facebook.com/intern/diff/D40709279)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87517
    Approved by: https://github.com/jcaip

commit c1553880de95845c5a194247c683872949d66cd6
Author: Michael Lazos <mlazos@fb.com>
Date:   Thu Nov 10 21:38:04 2022 +0000

    Have kernel names include fused ops (#88624)

    - Propagates origin fx nodes through inlining during lowering
    - Concatenates op names into kernel name
    - Adds config to cap the number of ops in the kernel name so they don't get too long

    Caveats:
    - The ordering in the name may not match the order that the ops are executed in the kernel

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
    Approved by: https://github.com/anijain2305, https://github.com/jansel

commit ad2eba802c04394875af0f00b985f7f338423f1e
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Nov 8 07:59:11 2022 -0800

    [ao] fuser_method_mappings.py fixing public v private (#87516)

    Summary: made _get_valid_patterns, _DEFAULT_PATTERN_TO_FUSER_METHOD,
    _reverse3, _reverse2, _reverse_sequential_wrapper2,
    _DEFAULT_OP_LIST_TO_FUSER_METHOD, _sequential_wrapper2 private

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D40709281](https://our.internmc.facebook.com/intern/diff/D40709281)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87516
    Approved by: https://github.com/jcaip

commit 37b468ac777ba548a2808010fd2f1b146b779fe0
Author: maxren <maxren@meta.com>
Date:   Wed Nov 9 15:33:57 2022 -0800

    [xnnpack][lite-int][on-device] rebuild serialized modules at runtime (#88780)

    This is the on-device runtime work. We modify the compile and execute from our hacky solution from before to what will actually be running at runtime.

    First we rebuild our graph from the serialized flatbuffer string. We also introduce a runtime wrapper that inherits CustomClassHolder that allows us to forward along the built xnngraph runtime to our execute function

    Once the subgraph object has been rebuilt by our we pass it along to the runtime wrapper for us to forward along to execute

    At execute we prep the input/outputs and invoke the runtime using our runtime wrapper. Finally we forward those results to our execution

    Differential Revision: [D39413031](https://our.internmc.facebook.com/intern/diff/D39413031/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39413031/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88780
    Approved by: https://github.com/digantdesai

commit de38c8769835ab0efa055baaf7605be37e410417
Author: Catherine Lee <csl@fb.com>
Date:   Thu Nov 10 21:32:41 2022 +0000

    Use run_test in MPS (#88829)

    Run mps through run_test to get disable test infra, create xml files (which can then be used for flakiness detection), and reruns

    Also added the workflow steps for uploading the xml files
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88829
    Approved by: https://github.com/malfet, https://github.com/huydhn

commit 1ae772a663f772171f0c5d6d7d311792f331206a
Author: Bert Maher <bertrand@fb.com>
Date:   Thu Nov 10 06:56:26 2022 -0800

    [inductor] Remove import check for fast_flush (#88812)

    https://github.com/pytorch/pytorch/pull/88557/ has a guard to make sure that triton's `do_bench` includes the `fast_flush` argument.  Since we've updated Triton to a sufficiently recent revision, we can remove that guard.

    Differential Revision: [D41185280](https://our.internmc.facebook.com/intern/diff/D41185280/)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88812
    Approved by: https://github.com/soumith

commit 3a4e8736ad66db2089cbcb3a24cf779aab3a7564
Author: maxren <maxren@meta.com>
Date:   Wed Nov 9 15:33:00 2022 -0800

    [xnnpack][on-device] compiler --> executor object (#88779)
    This is purely to abstract away the subgraph rebuild from the flatbuffer object. CompileModel return an executor object which we can use to setup inputs and run forward with.
    We Include ATen/utils for torch_check, this will be changed when moving to executorch

    Differential Revision: [D40733163](https://our.internmc.facebook.com/intern/diff/D40733163/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88779
    Approved by: https://github.com/digantdesai

commit 394b998de2228a4b4730c52b50975a2ecf756049
Author: Mark Saroufim <marksaroufim@fb.com>
Date:   Thu Nov 10 21:04:35 2022 +0000

    sub setup.py install -> develop (#88507)

    If someone is building the project from source they're likely a contributor for which develop will be much more useful. For people that want to try the latest and greatest they can leverage the nightlies

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88507
    Approved by: https://github.com/malfet

commit d5e1e2f0fcd4e0602295bfaf80b8aeb80c86a70d
Author: maxren <maxren@meta.com>
Date:   Wed Nov 9 15:31:44 2022 -0800

    [xnnpack][on-device] executor class (#88778)

    Executor object used to wrap our xnn_runtime object. The ideal flow of this object looks as such:

    ```
    executor.set_inputs(vector<tensor> inputs, vector<tensor> outputs)
    executor.forward()
    ```

    This will likely be returned by our delegate compile and given over to execute in order to run inference using the xnn runtime
    ```
    ```
    These Aten functions are included in order to use at::Tensor when setting the inputs, this will change when used for Executorch because we will be switching from at::Tensor to whatever tensor abstraction is used for ET. Seems like they have the same call for `.data_ptr<float>()`, so realistically all logic here will be the same.

    ATen/Utils is used for TORCH_CHECK. We will switch to ET_CHECK_MESSAGE for executorch.

    Differential Revision: [D40733121](https://our.internmc.facebook.com/intern/diff/D40733121/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88778
    Approved by: https://github.com/digantdesai

commit 29550e2c1df4cf3ef949e8f1ef973fd5e103a2d3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 10 20:56:30 2022 +0000

    Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566)"

    This reverts commit 48b58930cbfa725ac25a9303d496c76bf983574d.

    Reverted https://github.com/pytorch/pytorch/pull/88566 on behalf of https://github.com/huydhn due to This change breaks trunk https://hud.pytorch.org/pytorch/pytorch/commit/48b58930cbfa725ac25a9303d496c76bf983574d

commit 90cf14ddf691bfae2d5c793376c68921b7111fde
Author: erjia <erjia@fb.com>
Date:   Thu Nov 10 19:54:19 2022 +0000

    [DataPipe] Deprecating drop_empty_batches from Filter and other functional APIs (#88693)

    - Deprecating based on https://github.com/pytorch/data/issues/163

    Corresponding PRs from TorchData: https://github.com/pytorch/data/pull/890
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88693
    Approved by: https://github.com/NivekT

commit 98ecd06580b667441a45bfe7a67bc95ddb8a9353
Author: Felix Divo <4403130+felixdivo@users.noreply.github.com>
Date:   Thu Nov 10 19:29:29 2022 +0000

    Bring Unfold/Fold param doc order in line with code (#88819)

    Now the first parameter (if used as a positional argument) is the first that is listed in the docs.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88819
    Approved by: https://github.com/ngimel

commit 1d54ce9d5d4e44416a55ad002b8dc9b984ecc906
Author: Howard Huang <howardhuang@fb.com>
Date:   Thu Nov 10 06:31:46 2022 -0800

    [14/N] Refactor _new_process_group_helper() to remove repeated code (#88351)

    Changes:
    - refactor parts of `_new_process_group_helper()` to remove repeated code

    Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88351
    Approved by: https://github.com/kwen2501

commit 4bcf2c53e521f5c61615b0adb84312513ad583f2
Author: William Wen <williamwen@fb.com>
Date:   Thu Nov 10 19:22:09 2022 +0000

    Add warnings & regressions info text (#88837)

    Add text about what warnings and accuracy regressions dropdowns mean.

    Sample: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1310770285

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88837
    Approved by: https://github.com/anijain2305

commit 3b8245ab12d54723b6e7bcceb176235f13f0348b
Author: Jiewen Tan <jwtan@google.com>
Date:   Thu Nov 10 18:34:19 2022 +0000

    [LTC] Make ComputePostOrder accept const T pointers (#88773)

    Summary:
    Since `c10::ArrayRef` now support `c10::ArrayRef<const T>`, let's restore `ComputePostOrder` to accept `const Node*` again, which is more suitable for the context of the given helpers.

    Test Plan:
    CI.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88773
    Approved by: https://github.com/JackCaoG

commit 48b58930cbfa725ac25a9303d496c76bf983574d
Author: Jiawen Liu <jiawenl@meta.com>
Date:   Thu Nov 10 18:32:25 2022 +0000

    [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566)

    Summary:
    Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

    For an internal Ads model: 1.15x -> 1.36x speedup

    Differential Revision: D41071665

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88566
    Approved by: https://github.com/jansel, https://github.com/jianyuh

commit d157fca59c3f28b532f5e845c48df0e2bedbfa39
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 10 18:19:51 2022 +0000

    Revert "Symintify `broadcast_to` (#88776)"

    This reverts commit 3a09d9a129406a05ca7e82c1438f9aa83019f48d.

    Reverted https://github.com/pytorch/pytorch/pull/88776 on behalf of https://github.com/malfet due to Broke functorch/test_aotdispatch on M1, see https://hud.pytorch.org/pytorch/pytorch/commit/3a09d9a129406a05ca7e82c1438f9aa83019f48d

commit 6bf2776ac1d16692778f052ba6796d3308ea97c6
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Nov 10 15:17:51 2022 +0000

    [FSDP][Perf] Do not call `pad` in no-padding case (#88769)

    - Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case.
    - This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88769
    Approved by: https://github.com/zhaojuanmao

commit d3178465eed4895fa12430943db37d00dd2c483b
Author: Bert Maher <bertrand@meta.com>
Date:   Thu Nov 10 18:17:20 2022 +0000

    [dynamo] `VariableTracker.call_method` requires a name (#88311)

    Summary: as title

    Test Plan: Before: N2743445, After: N2748186.  Note there's a new error, but at least we got past the easy one.

    Differential Revision: D40938415

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88311
    Approved by: https://github.com/brad-mengchi

commit 1e4079a4762f515406c7f4654e7a4340914898ef
Author: Bert Maher <bertrand@fb.com>
Date:   Thu Nov 10 04:42:37 2022 +0000

    [nnc] Disable opaque pointers mode in LLVM backend to allow getPointerElementType (#88798)

    As of LLVM 15 typed pointers are going away:
    https://llvm.org/docs/OpaquePointers.html.  Thus
    `getPointerElementType` is no longer legal, since pointers are all
    opaque.  I don't totally remember why we use it so prolifically, or
    whether there's an easy change to get rid of it, or whether we'd need
    a significant refactor to carry around `Type`s alongside `Value`s.

    But in any case, NNC is deprecated (see: TorchInductor) and will
    hopefully be gone before LLVM 16 is a thing.  For now, we can apply
    the hack of turning off opaque pointer mode on the LLVMContext.

    Differential Revision: [D41176215](https://our.internmc.facebook.com/intern/diff/D41176215)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88798
    Approved by: https://github.com/desertfire

commit 656d0de6c50c373c7da2960ae6e9ca07b262384f
Author: Panagiotis Antoniadis <pantoniadis97@gmail.com>
Date:   Thu Nov 10 18:11:29 2022 +0000

    Change TORCH_INTERNAL_ASSERT to TORCH_CHECK and add a nice error message (#88804)

    Fixes #87672

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88804
    Approved by: https://github.com/ezyang

commit 79b049af5ecbd8619acb4196f8c59228832ec99b
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Nov 10 17:48:16 2022 +0000

    Switch to setup-nvidia action (#88757)

    Use the new [setup-nvidia](https://github.com/pytorch/test-infra/blob/main/.github/actions/setup-nvidia/action.yml) action from test-infra. The new action is created so that it can be shared across different PyTorch repos. For examples:

    * [pytorch/pytorch](https://github.com/pytorch/pytorch/blob/master/.github/scripts/install_nvidia_utils_linux.sh) (fixed by this PR)
    * [pytorch/tau](https://github.com/pytorch/tau/blob/main/.github/workflows/install_nvidia_utils_linux.sh) (fixed by  https://github.com/pytorch/tau/pull/595)
    * [pytorch/torchsnapshot](https://github.com/pytorch/torchsnapshot/blob/main/.github/scripts/install_nvidia_utils_linux.sh) (fixed by https://github.com/pytorch/torchsnapshot/pull/130)
    * [torch/multiply](https://github.com/pytorch/multipy/blob/main/.github/scripts/install_nvidia_utils_linux.sh) (fixed by https://github.com/pytorch/multipy/pull/264)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88757
    Approved by: https://github.com/seemethere, https://github.com/atalman

commit f98edfcc48c903d0d22a0105b0fafe4ca58121e6
Author: Nikita Shulga <nshulga@meta.com>
Date:   Thu Nov 10 17:42:20 2022 +0000

    Make TorchElastic timer importable on Windows (#88522)

    Also, add `torch.distributed` to test imports, so that we would not
    regress in the future

    Fixes https://github.com/pytorch/pytorch/issues/85427
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88522
    Approved by: https://github.com/d4l3k

commit 4b898a7304246275b250b159dd0ac8e68a6df95d
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Thu Nov 10 01:07:50 2022 +0100

    Symintify `adaptive_avg_pool3d` (#88783)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88783
    Approved by: https://github.com/ezyang

commit 3a09d9a129406a05ca7e82c1438f9aa83019f48d
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Thu Nov 10 11:48:31 2022 +0100

    Symintify `broadcast_to` (#88776)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88776
    Approved by: https://github.com/ezyang

commit c0ecce15b5a54ff0185f9976e6bfb6f3a7de698d
Author: samdow <samdow@fb.com>
Date:   Mon Nov 7 15:43:39 2022 -0500

    add DisableTorchFunction that matches DisableTorchDispatch (#88219)

    Closes #87990. This implements a new disable guard that matches DisableTorchDispatch (disables all subclasses and modes)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88219
    Approved by: https://github.com/ezyang

commit 7f28be10e5e71efda37800384fa897785499bed1
Author: samdow <samdow@fb.com>
Date:   Tue Nov 1 18:35:38 2022 -0400

    rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218)

    First half of #87990. This doesn't change any of the behavior and is just a rename

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88218
    Approved by: https://github.com/ezyang, https://github.com/zou3519

commit 3e43ff279428e5d07932968fbd7792200fa15a4d
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Nov 10 01:30:03 2022 -0500

    torchdynamo: add convolution add(relu) inplace fusion kernel (#88048)

    This PR is about add convolution add(relu) inplace fusion kernel which  works for **other.add_(conv)**.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88048
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit e6561291b89ecfbe35990decfcf16db47419d429
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Thu Nov 10 13:44:45 2022 +0000

    add hack to allow hybrid compressed sparse comparison in assertEqual (#88749)

    Hybrid sparse CSR tensors can currently not be compared to strided ones since `.to_dense` does not work:

    ```py
    import torch
    from torch.testing._internal.common_utils import TestCase

    assertEqual = TestCase().assertEqual

    actual = torch.sparse_csr_tensor([0, 2, 4], [0, 1, 0, 1], [[1, 11], [2, 12] ,[3, 13] ,[4, 14]])
    expected = torch.stack([actual[0].to_dense(), actual[1].to_dense()])
    assertEqual(actual, expected)
    ```

    ```
    main.py:4: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:54.)
      actual = torch.sparse_csr_tensor([0, 2, 4], [0, 1, 0, 1], [[1, 11], [2, 12] ,[3, 13] ,[4, 14]])
    Traceback (most recent call last):
      File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 1098, in assert_equal
        pair.compare()
      File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 619, in compare
        actual, expected = self._equalize_attributes(actual, expected)
      File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 706, in _equalize_attributes
        actual = actual.to_dense() if actual.layout != torch.strided else actual
    RuntimeError: sparse_compressed_to_dense: Hybrid tensors are not supported

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "main.py", line 10, in <module>
        assertEqual(actual, expected)
      File "/home/philip/git/pytorch/torch/torch/testing/_internal/common_utils.py", line 2503, in assertEqual
        msg=(lambda generated_msg: f"{generated_msg}\n{msg}") if isinstance(msg, str) and self.longMessage else msg,
      File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 1112, in assert_equal
        ) from error

    RuntimeError: Comparing

    TensorOrArrayPair(
        id=(),
        actual=tensor(crow_indices=tensor([0, 2, 4]),
           col_indices=tensor([0, 1, 0, 1]),
           values=tensor([[ 1, 11],
                          [ 2, 12],
                          [ 3, 13],
                          [ 4, 14]]), size=(2, 2, 2), nnz=4,
           layout=torch.sparse_csr),
        expected=tensor([[[ 1, 11],
             [ 2, 12]],

            [[ 3, 13],
             [ 4, 14]]]),
        rtol=0.0,
        atol=0.0,
        equal_nan=True,
        check_device=False,
        check_dtype=True,
        check_layout=False,
        check_stride=False,
        check_is_coalesced=False,
    )

    resulted in the unexpected exception above. If you are a user and see this message during normal operation please file an issue at https://github.com/pytorch/pytorch/issues. If you are a developer and working on the comparison functions, please except the previous error and raise an expressive `ErrorMeta` instead.
    ```

    This adds a temporary hack to `TestCase.assertEqual` to enable this. Basically, we are going through the individual CSR subtensors, call `.to_dense()` on them, and stack everything back together. I opted to not do this in the common machinery, since that way users are not affected by this (undocumented) hack.

    I also added an xfailed test that will trigger as soon as the behavior is supported natively so we don't forget to remove the hack when it is no longer needed.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88749
    Approved by: https://github.com/mruberry, https://github.com/pearu

commit 7c353eb39559f2c8897a0580700dd0a6f943d34f
Author: Li-Huai (Allan) Lin <qqaatw@gmail.com>
Date:   Thu Nov 10 09:40:05 2022 +0000

    [MPS] Fix softplus (#88555)

    1. Fixes #87780
    2. Fixes mps graph cache issue
    3. Adds proper tests

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88555
    Approved by: https://github.com/kulinseth

commit 7ad87f63e248b629d435a199cb61f4ed1f3dfcab
Author: Grigory Sizov <grigorysizov@fb.com>
Date:   Thu Nov 10 08:12:56 2022 +0000

    Support src_mask and src_key_padding_mask for Better Transformer (#88488)

    Fixes T135842750 (follow-up for #87377)

    At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

    This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

    Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
    - on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported.
    - on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask.
    - Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed
    - Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed
    - Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA
    - `test_masked_softmax_mask_types` now covers mask type 2
    - `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously
    - `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488
    Approved by: https://github.com/mikekgfb

commit dcefea2706fb35ece5e49fc138d952a2acd15824
Author: efiks <5167930+efiks@users.noreply.github.com>
Date:   Thu Nov 10 06:11:05 2022 +0000

    [caffe2][tourch] Optimize BatchBoxCox (#87585)

    Differential Revision: D40215424

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87585
    Approved by: https://github.com/hyuen

commit e87c79ca0cbab476a7d09853b5830b615a62f679
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 10 03:04:57 2022 +0000

    [vision hash update] update the pinned vision hash (#88742)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88742
    Approved by: https://github.com/pytorchbot

commit cf04b36ce8f531730210b03eaa347977a1c2d75c
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Nov 10 01:57:17 2022 +0000

    [dynamo] fixes dict changed during runtime error  (#87526)

    Fixes https://github.com/pytorch/torchdynamo/issues/1744

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87526
    Approved by: https://github.com/ezyang

commit 0b8889c724f52dd767564fcd51e8a0ee5e99b45f
Author: William Wen <williamwen@fb.com>
Date:   Thu Nov 10 01:48:04 2022 +0000

    Do not flag models in dashboard due to NaN values (#88792)

    Title.

    Tested by running `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-4 --training --visualize_logs` on a copy of a recent set of logs.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88792
    Approved by: https://github.com/anijain2305

commit 6e3555edea3ec2f453d6dc2ddcba9c6313d5ced5
Author: William Wen <williamwen@fb.com>
Date:   Thu Nov 10 01:45:52 2022 +0000

    Add absolute latency to dashboard (#88790)

    Add absolute latency to dashboard, as requested by https://github.com/pytorch/torchdynamo/issues/1833#issuecomment-1302742914

    Tested by setting `run.sh` to
    ```
    rm -rf ../test-dynamo-runner-logs-7/
    mkdir ../test-dynamo-runner-logs-7/
    python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_performance.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2 --cold_start_latency
    python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_accuracy.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2
    ```
    and running `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-7/ --dashboard-archive-path /data/home/williamwen/dynamo-runner-logs-copy --training --run --compilers inductor --flag-compilers inductor --suites torchbench --update-dashboard`  (need to comment out the `generate_commands` line and change the github issue ID from 681 to something else).

    Sample comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1309645562

    NOTE: this change breaks processing old logs.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88790
    Approved by: https://github.com/anijain2305

commit 2381548071d01e1a3f22793a5e0bff1ad0f58a69
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Nov 9 14:21:21 2022 -0800

    add stride constraints to fallbacks (#88534)

    Add stride/contiguity constraints to fallbacks so that inputs will be in the right stride permutation for the fallback kernel.

    Improves perf of coat_lite_mini from 1.48415536054865 -> 2.010956856330101.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88534
    Approved by: https://github.com/ngimel

commit fb5c6ae61f1f622ec388ae9fa00e7683ce1729ce
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Thu Nov 10 00:49:07 2022 +0000

    [cuDNN][cuDNN V8 API] Match V7 API behavior for `channels_last` stride coercion for cuDNN (#88699)

    For ConvNeXt failure in https://github.com/pytorch/torchdynamo/issues/1833

    cuDNN V7 has some stride "fixing" code to coerce cuDNN to use channels-last in cases when allowed by size 1 strides that was omitted in V8, which seems to seems to lead to performance regressions. This PR patches in the same fix for V8.

    CC @ngimel @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88699
    Approved by: https://github.com/ngimel

commit 59115e6139a475ec21d642e6f99798b8c37bcf4d
Author: mikey dagitses <mikeyd@meta.com>
Date:   Thu Nov 10 00:27:59 2022 +0000

    disable test that times out in fbcode (#88758)

    Test Plan: Rely on CI.

    Differential Revision: D41162966

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88758
    Approved by: https://github.com/zou3519

commit 16bd363863cceb907118557289af70882ea68985
Author: William Wen <williamwen@fb.com>
Date:   Thu Nov 10 00:26:58 2022 +0000

    Fix dynamo dashboard passrate denominator (#88777)

    Before the dashboard improvements, the passrate table looked like this:
    ~~~
    +------------------------+------------+-------------+-------------+
    |        Compiler        | torchbench | huggingface | timm_models |
    +------------------------+------------+-------------+-------------+
    |         eager          | 98%, 54/55 | 100%, 43/43 | 100%, 61/61 |
    |       aot_eager        | 95%, 52/55 | 100%, 43/43 | 97%, 59/61  |
    |     aot_cudagraphs     | 75%, 41/55 | 49%, 21/43  | 38%, 23/61  |
    |    nvprims_nvfuser     | 71%, 39/55 |  16%, 7/43  | 48%, 29/61  |
    |        inductor        | 87%, 48/55 | 93%, 40/43  | 95%, 58/61  |
    | inductor_no_cudagraphs | 93%, 51/55 | 93%, 40/43  | 95%, 58/61  |
    +------------------------+------------+-------------+-------------+
    ~~~
    After the change, the table looked like:
    ~~~
    +------------------------+------------+-------------+-------------+
    |        Compiler        | torchbench | huggingface | timm_models |
    +------------------------+------------+-------------+-------------+
    |         eager          | 82%, 53/65 | 84%, 43/51  | 82%, 61/74  |
    |       aot_eager        | 83%, 54/65 | 84%, 43/51  | 82%, 61/74  |
    |     aot_cudagraphs     | 69%, 45/65 | 65%, 33/51  | 38%, 28/74  |
    |    nvprims_nvfuser     | 48%, 31/65 | 78%, 40/51  | 26%, 19/74  |
    |        inductor        | 75%, 49/65 | 82%, 42/51  | 81%, 60/74  |
    | inductor_no_cudagraphs | 82%, 53/65 | 82%, 42/51  | 82%, 61/74  |
    +------------------------+------------+-------------+-------------+
    ~~~
    There is no actual regression, but the passrate is lower since the denominator is wrong. Check fix by running locally (e.g. `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-5 --training --visualize_logs`) and comparing passrate table output to previously correct one.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88777
    Approved by: https://github.com/anijain2305

commit 4f18739bf05bffff85609f90e0c319d8110c5616
Author: Nikita Shulga <nshulga@meta.com>
Date:   Thu Nov 10 00:06:31 2022 +0000

    Fix Docker image generation (#88741)

    Pass install channel when building nightly images
    Pass `TRITON_VERSION` argument to install triton for nightly images

    Fix `generate_pytorch_version.py` to work with unannotated tags and avoid failures like the following:
    ```
    % git checkout nightly
    % ./.github/scripts/generate_pytorch_version.py

    fatal: No annotated tags can describe '93f15b1b54ca5fb4a7ca9c21a813b4b86ebaeafa'.
    However, there were unannotated tags: try --tags.
    Traceback (most recent call last):
      File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 120, in <module>
        main()
      File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 115, in main
        print(version_obj.get_release_version())
      File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 75, in get_release_version
        if not get_tag():
      File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 37, in get_tag
        dirty_tag = subprocess.check_output(
      File "/Users/nshulga/miniforge3/lib/python3.9/subprocess.py", line 424, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "/Users/nshulga/miniforge3/lib/python3.9/subprocess.py", line 528, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['git', 'describe']' returned non-zero exit status 128.
    ```
    After the change nightly is reported as(due to autolabelling issue,
    should be fixed by ttps://github.com/pytorch/test-infra/pull/1047 ):
    ```
     % ./.github/scripts/generate_pytorch_version.py
    ciflow/inductor/26921+cpu
    ```

    Even for tagged release commits version generation was wrong:
    ```
    % git checkout release/1.13
    % ./.github/scripts/generate_pytorch_version.py
    ciflow/periodic/79617-4848-g7c98e70d44+cpu
    ```
    After the fix, it is as expected:
    ```
    % ./.github/scripts/generate_pytorch_version.py
    1.13.0+cpu
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88741
    Approved by: https://github.com/dagitses, https://github.com/msaroufim

commit 7006ac6ee509c00da54a0c90c38685a7adb61779
Author: Akshit Khurana <axit@meta.com>
Date:   Tue Nov 8 10:29:39 2022 -0800

    [Dynamo] Fix Tensor.T trace (#88642)

    Summary:

    Tensor.T considered T as a GetAttr and didn't progate "example_value"

    Via https://pytorch.org/docs/stable/tensors.html#torch.Tensor.T
    > If n is the number of dimensions in x, x.T is equivalent to
    > x.permute(n-1, n-2, ..., 0).

    Fixes pytorch/torchdynamo#1476

    Test Plan:

    pytest test/dynamo/test_functions.py::FunctionTests::test_T

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D41130306](https://our.internmc.facebook.com/intern/diff/D41130306)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88642
    Approved by: https://github.com/tugsbayasgalan, https://github.com/yanboliang, https://github.com/jansel

commit c7fc7104594f19e263a525aa572f97e65b08c386
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 9 22:38:41 2022 +0000

    Revert "[3/n] Thread PG: add threaded PG implementation (#88627)"

    This reverts commit 6dd081846e3ae6192b375d658d4b4f3d6bd9df6e.

    Reverted https://github.com/pytorch/pytorch/pull/88627 on behalf of https://github.com/huydhn due to This breaks one macos m1 test https://hud.pytorch.org/pytorch/pytorch/commit/6dd081846e3ae6192b375d658d4b4f3d6bd9df6e in trunk. PR also fails with the same issue so I think trymerge code has a bug here letting this one merged

commit 6fe4ccc7cbd5e953e5888947229945f7590e3bfe
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Nov 8 07:59:10 2022 -0800

    [ao] qconfig.py fix public v private (#87515)

    Summary: made is_reuse_input_qconfig, _activation_is_memoryless,
    _partial_wrapper_equals, _obs_or_fq_ctr_equals,
    _add_module_to_qconfig_obs_ctr, _assert_valid_qconfig private

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D40709280](https://our.internmc.facebook.com/intern/diff/D40709280)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87515
    Approved by: https://github.com/jcaip

commit 3a3500fa082482f8131a22196566f89da3de4162
Author: Howard Huang <howardhuang@fb.com>
Date:   Wed Nov 9 06:47:53 2022 -0800

    [13/N] Update gather with CPU/CUDA implementations (#86409)

    Differential Revision: [D40181612](https://our.internmc.facebook.com/intern/diff/D40181612)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86409
    Approved by: https://github.com/kwen2501

commit 1af9b38a907cdd8a21f4e0a363af3f136fa4062a
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Wed Nov 9 14:48:20 2022 +0000

    Symintify embedding_sparse_backward (#88746)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88746
    Approved by: https://github.com/ezyang

commit b7aa22d6db889a9ae31aabae80abc3e99ebc37ee
Author: Zhengxu Chen <zhxchen17@meta.com>
Date:   Wed Nov 9 21:39:46 2022 +0000

    [fx] Fix GraphModule.print_readable() (#88730)

    Summary: `__nested_code()` seems removed.

    Test Plan: CI

    Differential Revision: D41149662

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88730
    Approved by: https://github.com/SherlockNoMad

commit 6dd081846e3ae6192b375d658d4b4f3d6bd9df6e
Author: Charlie Yan <charlieyan@meta.com>
Date:   Wed Nov 9 20:51:11 2022 +0000

    [3/n] Thread PG: add threaded PG implementation (#88627)

    Summary: After the previous 2 diffs, finally we can add the threaded ProcessGroup implementation.

    Test Plan: TBD

    Reviewed By: XilunWu

    Differential Revision: D40992593

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88627
    Approved by: https://github.com/XilunWu, https://github.com/H-Huang

commit 93d3bd626ed9bb99ded7a4e269f7a1fa486ac5d3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 9 20:48:32 2022 +0000

    Revert "[primTorch] Improve `narrow` and `narrow_copy`: refs, tests, docs (#87045)"

    This reverts commit aa8279bcb8687e025a666e18828a436eb7ef7b45.

    Reverted https://github.com/pytorch/pytorch/pull/87045 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41161182

commit 8523c45717b21a205ddc74ec0fa0d97e7c201388
Author: Charlie Yan <charlieyan@meta.com>
Date:   Wed Nov 9 20:29:34 2022 +0000

    Delete stub file to enable mypy check (#4649) (#88701)

    Summary:
    X-link: https://github.com/facebookresearch/detectron2/pull/4649

    Context in https://fburl.com/4irjskbe

    This change deletes distributed.pyi, so that lintrunner will run mypy on distributed.py for typing check.

    Test Plan: CI

    Differential Revision: D41028360

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88701
    Approved by: https://github.com/zhaojuanmao

commit 133e61af7a8f4b098daf7d34f848e3c2a6cb4ae4
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Nov 9 04:51:04 2022 +0000

    OpOverload is_view (#88722)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88722
    Approved by: https://github.com/ezyang

commit 55df18e3da859024efd190d8b3145d25915adc5a
Author: Howard Huang <howardhuang@fb.com>
Date:   Wed Nov 9 06:47:53 2022 -0800

    [12/N] Update scatter with CPU/CUDA implementations (#86408)

    Differential Revision: [D40181613](https://our.internmc.facebook.com/intern/diff/D40181613)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86408
    Approved by: https://github.com/kwen2501

commit 3a1bdfee67170103f621671ebd1b64d06863539d
Author: mikey dagitses <mikeyd@meta.com>
Date:   Wed Nov 9 18:20:04 2022 +0000

    skip environment collection test in fbcode (#88744)

    Summary: This runs pip, which we don't have in the fbcode environment.

    Test Plan: Rely on CI.

    Differential Revision: D41156589

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88744
    Approved by: https://github.com/zou3519

commit de53d4143a3a6bb08eddd845ad7f824112283792
Author: Jason Ansel <jansel@meta.com>
Date:   Wed Nov 9 18:13:06 2022 +0000

    Fix TorchInductor benchmarking in fbcode (#88689)

    Summary: Makes the C++ TorchInductor benchmarking work in fbcode plus some minor fixed to enable that.

    Test Plan: Test added

    Differential Revision: D41045910

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88689
    Approved by: https://github.com/soumith

commit c4a3aa8fe7306aa490959df35a5933187b170d56
Author: ssjia <ssjia@fb.com>
Date:   Tue Nov 8 13:44:43 2022 -0800

    [vulkan] Add option for buffer representations in vTensor (#87622)

    This diff adds the option to use a Buffer to store data for a `vTensor` by passing `StorageType::BUFFER` to the constructor of `vTensor`. To enable this change, the construction of `vTensor` and `vTensorStorage` had to be slightly refactored to properly support strides. To summarize the changes:

    * `vTensorStorage` now contains no Tensor metadata (such as tensor sizes, strides, and `TensorOptions`) - it now only contains the image extents (if texture storage is used) and the buffer length. Tensor metadata is now managed by `vTensor`. The reason for this is to allow multiple `vTensor` objects to point to the same `vTensorStorage` but with different metadata which may be a useful feature now that Buffer storage is enabled.
    * `vTensor` will now compute the strides upon construction based on the requested sizes and memory layout if Buffer storage is requested. Previously, strides were faked by setting them all to 0 as strides do not apply to image textures (this behavior is preserved for texture storage).

    Differential Revision: [D40604163](https://our.internmc.facebook.com/intern/diff/D40604163/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87622
    Approved by: https://github.com/digantdesai

commit d81797e845123b6f682b0fe1c4c6e6b905059c65
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 9 08:24:44 2022 -0500

    Meta function for aten.sort and aten.scatter* (#88705)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88705
    Approved by: https://github.com/ezyang

commit 100b55637b28cf826a52613abd62d7d49825a0ac
Author: Will Constable <whc@fb.com>
Date:   Wed Nov 9 16:41:04 2022 +0000

    Mark dynamo torchbench dlrm as unsupported (#88712)

    - DLRM requires special configuration of embedding layers which are sparse
      and not compatible with DDP.
    - I could mark the embedding params as ignored in DDP
      to make the benchmark pass, but this isn't a representative benchmark.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88712
    Approved by: https://github.com/ezyang

commit eb9b1560195a89df6a14ded05b3e76d97346a1f2
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Wed Nov 9 17:15:12 2022 +0000

    [fix] MathBits: serialization (#88182)

    Fixes #81690

    TODO:

    * [x] C++ Unpickler Fix (locally tested pickled in Python and unpickled in C++)
    * [x] C++ Pickler Fix (locally tested pickled in C++ and unpickled in Python)
    * [x] Do quant_tensor, sparse_tensor, etc require similar changes? (Sparse and Quant don't need this)
    * [x] Add Comments
    * [x] How to make sure C++ and Python are in sync? (Functions in `pickler.h` help in getting and setting Tensor Metadata (math-bits for now) on a tensor. They are the only place which should handle this.)

    Notes:
    Quant Tensor don't support complex dtypes and for float they segfault with `_neg_view` : https://github.com/pytorch/pytorch/issues/88484

    Sparse Tensor:
    ```python
    >>> a = torch.tensor([[0, 2.], [3j, 0]]).to_sparse()
    >>> a.conj().is_conj()
    False
    >>> a._neg_view()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NotImplementedError: Cannot access storage of SparseTensorImpl
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88182
    Approved by: https://github.com/ezyang, https://github.com/anjali411

commit 525fe53aa41b2d4c3f411f8d1e92b55d95b7b0a6
Author: Nikita Shulga <nshulga@meta.com>
Date:   Wed Nov 9 16:13:56 2022 +0000

    [BE] Delete push_nightly_docker_ghcr (#88748)

    As it seems to be duplicating the functionality of `docker-release.yml` and have not produced a valid build in last 16 days, according to https://github.com/pytorch/pytorch/actions/workflows/push_nightly_docker_ghcr.yml

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88748
    Approved by: https://github.com/seemethere

commit f11f0e4a033ab09c637870ce0fad6ac68ec81eb0
Author: Bin Bao <binbao@fb.com>
Date:   Wed Nov 9 13:05:32 2022 +0000

    [inductor] Handle nested tuple/list output in fallback kernel (#88495)

    Summary: Currently fallback kernel in inductor assumes its output is
    either a tensor or a tuple/list of tensors. This PR makes it handle more
    generic output data structure.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88495
    Approved by: https://github.com/jansel

commit 3150c9dc6f296d941ece9aa4f4c189f36393ef8f
Author: mikey dagitses <mikeyd@meta.com>
Date:   Tue Nov 8 10:06:03 2022 -0500

    extract out the clean workspace test to its own file (#88682)

    Summary:
    This test relies on what the root workspace is before any other code
    is run. However, some of the test cases change it. If the order the
    tests are run is randomized, then the test can fail if run after one
    of them.

    Having it on its own ensures that it always sees a pristine state.

    Test Plan:
    Verified locally and confirmed in internal and external CI.

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88682
    Approved by: https://github.com/r-barnes, https://github.com/malfet

commit c19bae9f8457ac9b8774369a0f0a7ea31c90c3e9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 9 08:13:03 2022 -0500

    Add SherlockNoMad to symbolic-shapes reviewer list (#88739)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88739
    Approved by: https://github.com/anjali411

commit 44de7cdbc463d73b967d1157041b402c3106239d
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 7 13:34:38 2022 -0500

    Add voznesenskym to symbolic-shapes group, move wconstab to listener (#88593)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88593
    Approved by: https://github.com/anjali411

commit c86cc68d23521f8d6956e49fcd214d314f98da35
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 13:33:51 2022 -0500

    Mark diag.out composite (#88670)

    It's implementation just redispatches, it works for more than CPU/CUDA.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88670
    Approved by: https://github.com/anjali411

commit 69b2352236bc07798dd3e57844c1e7f0aa262b42
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Nov 9 12:56:55 2022 +0000

    Add min cut partitioner for AOT+nvFuser (#88204)

    Here we mark most of `torch.ops.nvprims` as something that can be recomputed in the backward passes (and hopefully fused).

    TODO:
    - [x] Add a test after https://github.com/pytorch/pytorch/pull/88186 is merged

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88204
    Approved by: https://github.com/jjsjann123, https://github.com/jansel

commit ff7c5b0df80f5b72cd3dbb3a372d481f989a6ef3
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Tue Nov 8 12:35:40 2022 -0600

    Changing as_strided_scatter to deterministic inputs (#85583)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85583
    Approved by: https://github.com/mruberry

commit fca6ed02b91e0d685cc9b8b504bac5d356d31876
Author: blzheng <beilei.zheng@intel.com>
Date:   Wed Nov 9 10:40:23 2022 +0000

    [Inductor] fix c++ compile error with masked float value init (#88298)

    Fixes #88201

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88298
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 652af5ec15b81c39ec7413519d0ce9938d87bcf1
Author: Fabio Rocha <frocha@quansight.com>
Date:   Tue Nov 8 19:25:30 2022 +0000

    upsample_*.vec ops are now CompositeImplicit (#85638)

    It was previously CompositeExplicit but it was not really necessary.
    See discussion in https://github.com/pytorch/pytorch/issues/85405

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85638
    Approved by: https://github.com/ezyang, https://github.com/lezcano, https://github.com/malfet, https://github.com/jansel

commit aa8279bcb8687e025a666e18828a436eb7ef7b45
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Wed Nov 9 00:53:37 2022 +0100

    [primTorch] Improve `narrow` and `narrow_copy`: refs, tests, docs (#87045)

    Fixes #87019.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87045
    Approved by: https://github.com/mruberry

commit f6192b75c66cf5ac4591170106d8f58e7848bd07
Author: Xia, Weiwen <weiwen.xia@intel.com>
Date:   Wed Nov 9 08:08:11 2022 +0000

    [Quant] Support lowering of channel shuffle in FX (#83731)
    Support lowering of channel shuffle in FX by adding its module and functional op to `is_copy_node` list in `torch/ao/quantization/fx/_lower_to_native_backend.py`
    UTs added to test
    - correctness of quantized `ChannelShuffle` module.
    - FX lowering of `ChannelShuffle` module and functional `channel_shuffle`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83731
    Approved by: https://github.com/jerryzh168

commit ab9a19a95b628132bf0ad6474f245b4e596b9d74
Author: Nikita Shulga <nshulga@meta.com>
Date:   Wed Nov 9 06:55:22 2022 +0000

    [BE] Move `setup-ssh` step ahead of clone PyTorch (#88715)

    It allows one to SSH faster rather than having to wait for repo clone to
    finish.

    I.e. right now one usually have to wait for a few minutes fore PyTorch clone is finished, but with this change you can SSH ahead of time (thanks to `setup-ssh` being a composite action

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88715
    Approved by: https://github.com/clee2000, https://github.com/izaitsevfb

commit a7420d2ccb62d005f2e1853cfef8d25eb7748a90
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Wed Nov 9 01:49:50 2022 +0000

    Hopper (`sm90`) support (#87736)

    Essentially a followup of #87436

    CC @xwang233 @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87736
    Approved by: https://github.com/xwang233, https://github.com/malfet

commit 19d7941e37cd4727ccb874ada7a310dc679ebaab
Author: Wei-Sheng Chin <wschin@outlook.com>
Date:   Wed Nov 9 01:31:42 2022 +0000

    Fix Python-bound function signature (torch._C.Graph.addInput) (#88528)

    In pytorch/torch/_C/__init__.pyi, Graph.addInput has signature
    ```python
      def addInput(self, name: str) -> Value: ...
    ```
    which doesn't match the corresponding function
    ```cpp
      Value* addInput(const std::string& name = "") {
        return block_->addInput(name);
      }

    ```

    in python_ir.cpp. This PR aligns the bound function on both C++ and Python sides. Without this PR, mypy will compain whenever a change contains some calls to `addInput`; for example,
    ![image](https://user-images.githubusercontent.com/3524474/200092086-429b8d63-9321-4d03-b0d6-f4c9bd361756.png)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88528
    Approved by: https://github.com/davidberard98

commit f0e6cea2ed2de9e5da9af8be4a243b9aae5aec06
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 13:47:27 2022 -0500

    Meta registrations for inplace operators (#88678)

    Also, handle non-default alpha correctly.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88678
    Approved by: https://github.com/SherlockNoMad, https://github.com/albanD

commit a880ddc164203b6f49971f5af44cdb7d9b059f06
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 13:47:26 2022 -0500

    Meta implementation for unsqueeze_ (#88675)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88675
    Approved by: https://github.com/SherlockNoMad

commit 1dab35ca1bea35ca7d069281490b709851fbcf95
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 13:47:26 2022 -0500

    Meta implementation for bernoulli (#88676)

    For some reason bernoulli uses legacy memory format, see linked issue.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88676
    Approved by: https://github.com/SherlockNoMad

commit 6be426ca1aa857af7a148271ae4599f108b17a69
Author: Nikita Shulga <nshulga@meta.com>
Date:   Wed Nov 9 01:04:29 2022 +0000

    Update gloo submodule (#88530)

    Also, add an explicit cudart dependency to `torch_cuda` if Kineto is used with GPU support (it used to be somehow inherited from a wrong `gloo` setup)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88530
    Approved by: https://github.com/osalpekar

commit 08b2a251e122ff9ee3e5dc1af5513ab6cbd99db4
Author: Zhengxu Chen <zhxchen17@meta.com>
Date:   Wed Nov 9 01:02:07 2022 +0000

    [export] Preserve meta["val"] on placeholders in dynamo.export(). (#88651)

    Summary:
    Today when we transform the captured graph in the last step in export(aten_graph=True), we construct a new graph which doesn't have the all the metadata to be preserved, for example, node.meta["val"].
    meta["val"] is important for writing passes and analysis on the graph later in the pipeline, we may want to preserve that on placeholder nodes.

    Test Plan: test_export.py:test_export_meta_val

    Differential Revision: D41110864

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88651
    Approved by: https://github.com/tugsbayasgalan, https://github.com/jansel

commit 5f876bfdc512a376c12ee15cb58037937d73cf38
Author: Bin Bao <binbao@fb.com>
Date:   Tue Nov 8 15:31:15 2022 +0000

    Reduce the number of shards inductor uses for model tests (#88610)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88610
    Approved by: https://github.com/huydhn

commit 9f58e027a9b802f541ea1d9ad750be833db2c39c
Author: Antoni Viros i Martin <aviros@meta.com>
Date:   Wed Nov 9 00:19:36 2022 +0000

    Add implementation for irregular dimension selection for nested tensors. (#88585)

    Summary: This diff modifies the implementation of the select operator so slices of the irregular dimension can be selected (e.g. nt[:,0,:]).

    Test Plan:
    Added new unit tests to test that the new functions work as intended (see them in diff). To test,
    `buck test mode/dev-nosan //caffe2/test:nested`

    Differential Revision: D41083993

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88585
    Approved by: https://github.com/cpuhrsch

commit 87238e64914246f9f04ecec013fd6a78d87517b1
Author: Samantha Andow <samdow@meta.com>
Date:   Wed Nov 9 00:09:20 2022 +0000

    [nn] add remove_duplicate flag to named_parameters (#759) (#88090)

    Summary:
    X-link: https://github.com/pytorch/torchrec/pull/759

    Since the remove_duplicate flag was added to named_buffers in D39493161 (https://github.com/pytorch/pytorch/commit/c12f829cce29eb6971094a9bbb0f8971aed86f5c), this adds the same flag to named_parameters

    Test Plan:
    python test/test_nn.py -k test_buffers_and_named_buffers

    OSS Tests

    Differential Revision: D40801899

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88090
    Approved by: https://github.com/albanD

commit cef13ebea0ba604540d1cb16e13fbd2c36040f59
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Mon Nov 7 15:48:35 2022 -0800

    [Profiler] Memory profiler part 1: Gradient identification (#86802)

    There are multiple ways to indentify that a Tensor is a gradient. (A subset of which also give additional context.) So to start off I've made a utility to handle that determination.

    Differential Revision: [D39920730](https://our.internmc.facebook.com/intern/diff/D39920730/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86802
    Approved by: https://github.com/chaekit

commit c0e6b4329fe2dd35bb0bf162f4203ad7e0162554
Author: Michael Suo <suo@fb.com>
Date:   Mon Nov 7 22:23:01 2022 -0800

    [dynamo] only error out on nested fx trace if dynamo is optimizing (#88640)

    I think this is the final resolution to issue caused by
    https://github.com/pytorch/pytorch/pull/87797. The nvfuser issue that PR
    tripped up was because, even though we're correctly disabling
    torchdynamo via a `DisableContext`, the nested fx trace check was still
    firing. This PR properly narrows it to only fire if we're not disabled.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88640
    Approved by: https://github.com/yf225

commit a02ea655b5a5fbd615003d675c3e1765820298d2
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Tue Nov 8 19:14:48 2022 +0000

    Slight fix in error message for check_for_seq_len_1_nested_tensor (#88690)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88690
    Approved by: https://github.com/cpuhrsch

commit 6e6f929b2c5308b8de2c82884b8fa70bd4778842
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Mon Nov 7 11:24:27 2022 -0800

    [Profiler] Restructure inputs and capture TensorLists. (#87825)

    This PR unifies and rationalizes some of the input representation in Result. The current approach of storing separate types in separate vectors is tedious for two types (Tensors and scalars), but would be even more annoying with the addition of TensorLists. A similar disconnection exists with sizes and strides which the user is also expected to zip with tensor_metadata.

    I simplified things by moving inputs to a variant and moving sizes and strides into TensorMetadata. This also forced collection of sizes and strides in python tracer which helps to bring it in line with op profiling. Collection of TensorLists is fairly straightforward; `InputOutputEncoder` already has a spot for them (I actually collected them in the original TorchTidy prototype) so it was just a matter of plumbing things through.

    Differential Revision: [D40734451](https://our.internmc.facebook.com/intern/diff/D40734451/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87825
    Approved by: https://github.com/slgong-fb, https://github.com/chaekit

commit e132c45fd033842a677ce125a6d2657a500901a2
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Mon Nov 7 11:24:25 2022 -0800

    [Profiler] Handle ABA for TensorImpl* when assigning IDs (#87133)

    Part of the current ID assingment algorithm groups any Storages which are associated with the same TensorImpl*. This isn't sound (which I knew but deferred until it actually became a problem) because pointers can be reused by different objects. (ABA problem)

    ABA is easy to handle for Storage because we see allocations and frees, but ~TensorImpl is very hot and cannot tolerate profiling code without significant increases in overhead.

    This PR narrows the conditions under which ID assignment will join on TensorImpl*. Two storages which are associated with the same TensorImpl* are grouped IFF they were live at the same time. (Note that this still allows storages with disjoint lifetimes to be joined transitively through a third storage which overlaps with both.)

    The need for this PR arose in memory profiling. The Python argument parser creates short lived Tensors for (some) scalar arguments which triggers this issue. (Which is stochastic and platform dependent since optimizations like reusing recently freed allocations is implementation defined.) Spurious connections can lead to confusing and long range interactions when building up the memory profile, so it makes sense to harden ID assignment to avoid any issues.

    Differential Revision: [D40445121](https://our.internmc.facebook.com/intern/diff/D40445121/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87133
    Approved by: https://github.com/slgong-fb, https://github.com/chaekit

commit 078c25df13b1c24da994245a9879fe4b6c23ce23
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Nov 8 21:10:07 2022 +0000

    [MPS][BE] Code cleanup (#88529)

    Various code cleanup in MPS operations:
     - Per @kulinseth suggestion move `mpsSupportsCumsum` to `MPSDevice.h` and rename it to
       `is_macos_13_or_newer()`
     - Move Ventura MPSGraph new operators to `MPSGraphVenturaOps.h` header
     - Use `LookupAs` and `CreateCachedGraphAs` to make code more compact
     - Formatting

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88529
    Approved by: https://github.com/kulinseth

commit 1d82eba98b22fb987c2085ea1f85a78f8d9b6f28
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Nov 8 06:57:30 2022 +0000

    PatternMatcher supports matching list-typed args (#88656)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88656
    Approved by: https://github.com/jerryzh168

commit 8e2627d42fda299749b2d1e4b3899009824928c5
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Nov 7 22:00:15 2022 +0000

    [inductor] Fix aten.fmod lowering (#88602)

    Currently the lowering for aten.fmod promotes integral types to float and calls
    `tl.libdevice.fmod` whereas the ATen behavior is to use the modulo operator.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88602
    Approved by: https://github.com/jansel

commit f556d73574ca55f39c61d2519f27b2d35dbed77b
Author: Mengwei Liu <larryliu@meta.com>
Date:   Tue Nov 8 19:53:11 2022 +0000

    [torch] Implement aten::native_batch_norm.out for CPU (#88604)

    Summary:
    Implement `native_batch_norm.out` for CPU. Reuses the main logic for `native_batch_norm` but extract out the Tensor creation logic for outputs. There are 3 outputs: `output`, `save_mean` and `save_var`. `batch_norm_cpu` calls `batch_norm_cpu_update_stats_template` to get `save_mean` and `save_var`, and then calls into `batch_norm_cpu_transform_input_template` which initializes `output`.

    In the implementation of `batch_norm_cpu_out`, I did the following:

    * Let `batch_norm_cpu_transform_input_template` to take another argument `output`, ask the call sites to pass in a output Tensor.

    * Overload `batch_norm_cpu_update_stats_template` to take `save_mean` and `save_var`, ask the call sites to pass in those Tensors.

    * In `batch_norm_cpu_out`, pass `output`, `save_mean` and `save_var` all the way to our new `batch_norm_cpu_transform_input_template` and `batch_norm_cpu_update_stats_template`.

    * In `batch_norm_cpu`, prepare for these outputs and call `batch_norm_cpu_out`.

    Test Plan: Enable unit tests for `native_batch_norm.out`.

    Differential Revision: D40992036

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88604
    Approved by: https://github.com/iseeyuan, https://github.com/jjsjann123

commit 3e30a9ea1cfdb8db87aedc76ae1d3edd5dc8ace5
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Tue Nov 8 19:44:23 2022 +0000

    Fix `CUDA_MAX_THREADS_PER_SM` for `sm_87` (#88644)
    CC @ngimel @ptrblck

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88644
    Approved by: https://github.com/ngimel

commit 6bb7f4f29f0a36ce410ce53d824f531eaf74c76e
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 06:12:03 2022 -0800

    Minor error message improvements on meta functions (#88677)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88677
    Approved by: https://github.com/SherlockNoMad

commit d98a884b33ebf4ad6b34a19ee72499c7beb06893
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Nov 8 19:04:25 2022 +0000

    Revert "[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669)"

    This reverts commit 3c6bddc3f6347ce7d1ed33aee94cdaa953cbc387.

    Reverted https://github.com/pytorch/pytorch/pull/87669 on behalf of https://github.com/eqy due to investigating convnext benchmark regressions

commit 5eecfcf5f3d118a9e2d502dfb8689018c9591662
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Nov 8 18:52:56 2022 +0000

    Run libtorch trunk build on linux.4xlarge (#88683)

    Add optional `runner`  input to `_linux-build.yml`
    Move `libtorch-linux-bionic-cuda11_6-py3_7-gcc7-build` to `linux.4xlarge` as it occasionally OOMS on 2xlarge one

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88683
    Approved by: https://github.com/atalman, https://github.com/weiwangmeta

commit eaf4fe3d2b7096579b05b52d543756f74d0e91e7
Author: zyq8709 <zyq8709@gmail.com>
Date:   Tue Nov 8 18:46:56 2022 +0000

    Most recently used cache management for TorchDynamo (#88076)

    Modify the lookup procedure for TorchDynamo caches to keep the head of the single linked list as the most recently used cache entry, which may potentially improve probability for cache hitting.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88076
    Approved by: https://github.com/jansel

commit 1b5373fc830f9dc58e98d26645fba91d96cc13da
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 05:33:18 2022 -0800

    Mark as_strided_ as supporting SymInt in C++ (#88674)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88674
    Approved by: https://github.com/anjali411

commit dba887766b8b3924d6e39a65c88d8e554f76c861
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Nov 8 18:37:48 2022 +0000

    Revert "torchdynamo support modules() for nn_module (#88023)"

    This reverts commit 96104c7b7e908634a473792b6b2e9279d79d23d8.

    Reverted https://github.com/pytorch/pytorch/pull/88023 on behalf of https://github.com/ydwu4 due to [Internal breakages] https://www.internalfb.com/intern/sandcastle/job/9007200067589062/

commit 860e354d1c3276bc445071b19e45357d129ed535
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 10:23:53 2022 -0800

    Support diag_embed.out decomposition (#88671)

    This is a little tricky: there is a diag_embed.out, but its not bound
    in Python because it's autogenerated, see https://github.com/pytorch/pytorch/issues/88598
    So I can't "just" add the out variant to the ref, as this makes it
    inconsistent with the torch API.  To workaround this, I mark the ref
    as supporting out, but not the original function.

    This is useful to do, because it means that diag_embed.out now supports
    symbolic shapes.  However, this cannot be easily tested because
    I can't mark the out variant as being supported in the normal OpInfo test.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88671
    Approved by: https://github.com/mruberry

commit 3f6a560184d90b19e298477775d05c5996e6abbc
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 05:28:04 2022 -0800

    Correctly test that dtype/device match in generated .out kernels for composites (#88672)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88672
    Approved by: https://github.com/anjali411

commit 245144a6361ec3b89012a63a4956646718a4d080
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Nov 8 05:30:06 2022 -0800

    Propagate layout and pin memory in randint to inner constructor (#88673)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88673
    Approved by: https://github.com/anjali411

commit 96104c7b7e908634a473792b6b2e9279d79d23d8
Author: Yidi Wu <yidi@meta.com>
Date:   Tue Nov 8 18:22:03 2022 +0000

    torchdynamo support modules() for nn_module (#88023)

    Differential Revision: D40820879

    This diff allows models to call self.modules() during dynamo tracing.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88023
    Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym, https://github.com/jansel

commit ee28b865ee9c87cce4db0011987baf8d125cc857
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Tue Nov 8 18:11:01 2022 +0000

    Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)

    Part of #85302

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303
    Approved by: https://github.com/ezyang

commit 53ca5ad347451f3087dedc8df5c1a34663812a6b
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Tue Nov 8 17:06:28 2022 +0000

    enable scalar reduction with dim=-1 (#88628)

    Tested with all samples for `sum`, but also fixes all samples errors on other reductions (amin, amax, any, all etc)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88628
    Approved by: https://github.com/desertfire

commit 89c5819626b5b4edd2f000c6baf2ef56fa93458f
Author: Will Constable <whc@fb.com>
Date:   Tue Nov 8 02:22:01 2022 +0000

    Dynamo DDP accuracy bench uses find_unused_parameters (#88645)

    - find_unused_parameters adds a slight overhead, but is required
      in cases where users do not manually specify parameters to ignore
      which will not receive grads.  In some models, some parameters
      do not receive grads, and this causes DDP to throw an exception
      as it waits for a grad for each parameter

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88645
    Approved by: https://github.com/soumith

commit fcc28834765fc4dbff85f8b3992f8a72fc739694
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Nov 7 10:24:20 2022 -0500

    Clean up SymFloat binding to cover all functions (#88370)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88370
    Approved by: https://github.com/ezyang

commit 6abaa5946dc21d7836d5d46b6acc84f61f38f970
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Nov 7 10:23:18 2022 -0500

    Fix categorization of sym_int method (#88369)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88369
    Approved by: https://github.com/ezyang, https://github.com/bdhirsh, https://github.com/anjali411

commit bc66ddb5cb276f3ef0be4d73819f1b172e0872d1
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Nov 7 15:44:31 2022 -0800

    Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)

    Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8

    Changes:
    - introduce new error type
    - Update `C10D_NCCL_CHECK`

    Sample script to demonstrate new error type

    ```python

    import torch
    import torch.distributed as dist

    if __name__ == "__main__":
        dist.init_process_group("nccl")
        dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0)
    ```

    Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134
    Approved by: https://github.com/rohan-varma

commit 1a7c4b0de71de290a1b35cd96fb2ca6e7d24b131
Author: lezcano <lezcano-93@hotmail.com>
Date:   Mon Nov 7 16:32:25 2022 +0000

    Create _make_alias to preserve the name of a function when creating an alias (#88114)

    Before, we would inherit the name of the aliased function, which was
    very confusing, and disallowed some homogeneous treatment of references,
    as we do later in this stack

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88114
    Approved by: https://github.com/mruberry

commit af09270e10bbd063f9cdede03ba0ef27f0607304
Author: jjsjann123 <alex.jann2012@gmail.com>
Date:   Tue Nov 8 12:06:35 2022 +0000

    nvprims bookend non compute (#88457)

    Cherry-pickeding: https://github.com/csarofeen/pytorch/pull/2099

    1. enabling bookend non-compute-ops pass on nvfuser
    2. fixing bookend op check on intermediate tensor as partition inputs
    3. python tests added for: `getitem` special handling bookend_non_compute removal
    4. patching dfs by excluding dfs within partition to avoid going over recursion limitation
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88457
    Approved by: https://github.com/SherlockNoMad

commit 8cb5c5543e92628782deb00dda78380076b89e66
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Nov 8 08:32:45 2022 +0000

    Revive static_runtime_benchmark build and test (#87660)

    This build uses the wrong BUILD_ENVIRONMENT `pytorch-linux-focal-py3`, thus it hasn't been run for a long time (forgotten). The name was probably the old name of the build environment we used in the past.  The convention today doesn't have the `pytorch-` prefix. There is a TODO for this:

    > TODO: this condition is never (BUILD_ENVIRONMENT doesn't start with pytorch-), need to fix this.

    This is done as part of [T131829540](https://www.internalfb.com/intern/tasks/?t=131829540), where we want
     `static_runtime_benchmark` build and test jobs to run  in OSS CI to avoid breaking internal

    * I also fix some compiler warning errors `-Werror=sign-compare`, `-Werror,-Wunused-const-variable`, and gcc7 compatibility issue along the way because this hasn't been run for a long time.
    * Reviving this test also reveals a small bug in `PrepackWeights` test in `test_static_runtime.cc` added recently in https://github.com/pytorch/pytorch/pull/85289. The test refers to an internal ops and should only be run internally. This has been fixed by https://github.com/pytorch/pytorch/pull/87799 (To be merged)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87660
    Approved by: https://github.com/malfet

commit 02c1a304fa801942258a15a7e50abaa92aca2ddf
Author: Michael Suo <suo@fb.com>
Date:   Tue Nov 8 06:29:11 2022 +0000

    [ci] increase timeout time of ios test app build (#88611)

    We were timing out; 5 minutes seems a bit short.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88611
    Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/ZainRizvi

commit 8f66ae413f8c9d7f2418d7f0b9f69d409c455b46
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Mon Nov 7 16:07:13 2022 -0800

    [Autograd] Use in-place input accumulation fast path for dense Tensors. (#88339)

    There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

    Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88339
    Approved by: https://github.com/ngimel

commit ffb6e68962a6c376ffb658752877e939d14c2f6d
Author: Charlie Yan <charlieyan@meta.com>
Date:   Tue Nov 8 05:12:18 2022 +0000

    Add missing args to DDP constructor in distributed.pyi (#88209)

    Summary: As title. And remove all unnecessary `pyre-fixme` for the unknown arg in call-site.

    Test Plan: CI

    Differential Revision: D40874013

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88209
    Approved by: https://github.com/zhaojuanmao

commit ced71e8e82b8c1a035716c671da51f16b49f4eb5
Author: biubiuX <4338192+biubiuX@users.noreply.github.com>
Date:   Tue Nov 8 04:49:45 2022 +0000

    [Pytorch] add an option to disable TORCH_WARN and TORCH_WARN_ONCE log (#87188)

    Summary: Add an option to disable TORCH_WARN, some op could trigger spammy TOCH_WARN log which is not desired under certain scenario.

    Test Plan:
    Tested with
    -pt.disable_warn = 1 and -pt.disable_warn = 0

    verified TORCH_WARN and TORCH_WARN_ONCE are properly handled

    tested with
    -pt.strip_error_messages = 1, -pt.disable_warn = 0

    verified strip error message is respected when warn is printed

    Differential Revision: D40321550

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87188
    Approved by: https://github.com/kurtamohler, https://github.com/ezyang

commit ed97e0aa2918e687309ee9a146c8294aefb237d2
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Nov 8 03:29:52 2022 +0000

    [vision hash update] update the pinned vision hash (#88465)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88465
    Approved by: https://github.com/pytorchbot

commit 9f11ce7f67612d1c11f1a6a9b264779b27062e82
Author: BoringCrypto <b@rtje.net>
Date:   Tue Nov 8 03:26:44 2022 +0000

    Setting pickle_module isn't working (#88570)

    When setting the pickle_module it currently always gets overwritten by the pickle module. This should only happen when the pickle_module isn't specified.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88570
    Approved by: https://github.com/kit1980

commit 825f4e602b766545a4ee6dfd971056e24c7dbbe8
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 7 11:13:07 2022 -0800

    Add support for symbolic shapes to sparse tensor (#88573)

    Along the way, I undid making sparse/dense dim symint (they're
    dimensions, so they should be static.)

    Also symintify set_indices_and_values_unsafe

    There is a little bit of a nontrivial infra change here: previously, we didn't populate the strides field on sparse tensors. It is now populated with "empty" strides, and this meant that sparse tensors were falsely reporting they were non-overlapping dense/contiguous. I added in a hack to work around this case.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88573
    Approved by: https://github.com/anjali411

commit c29502dd2fa38c79ada620fbde2f61d58df6e219
Author: Jiewen Tan <jwtan@google.com>
Date:   Tue Nov 8 02:22:02 2022 +0000

    [LTC] Remove view (#88445)

    Summary:
    This pull request removes the last view ops, the original view.

    Test Plan:
    ./build/bin/test_lazy --gtest_filter=LazyOpsTest.TestView*

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88445
    Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim, https://github.com/Krovatkin

commit f2000842a864ed4c2287aa3a821ab8a9224ad52b
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Nov 8 01:46:25 2022 +0000

    Do not use double for single-prec upsample (#88277)

    I'm not sure, what would be the best behaviour here, but it feels a bit strange to perform parts of `float32` computations as `float64` and then downcast them back to `float32`.

    Use `at::opmath_type` rather than `at:acc_type` as no accumulation is used in the op.

    I don't know much about double vs single precision scalar perf on x86 CPU, but before the change:
    ```
    python -c "import timeit;import torch;x=torch.arange(100, dtype=torch.float32).reshape(1, 1, 10, 10); print(timeit.Timer(stmt='torch.nn.functional.interpolate(x, scale_factor=2.0, mode=\"bilinear\", align_corners=False)', globals={'x':x, 'torch':torch}).timeit())"
    11.337517574429512
    ```
    After the change:
    ```
    $ python -c "import timeit;import torch;x=torch.arange(100, dtype=torch.float32).reshape(1, 1, 10, 10); print(timeit.Timer(stmt='torch.nn.functional.interpolate(x, scale_factor=2.0, mode=\"bilinear\", align_corners=False)', globals={'x':x, 'torch':torch}).timeit())"
    10.513805857859552
    ```
    I.e. roughly 7% perf degradation (measured on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz)

    NOTE:
     - `aten::acc_type<float, false>` yields `double`
     - `aten::acc_type<float, true>` return `float`.

    Fixes https://github.com/pytorch/pytorch/issues/87968

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88277
    Approved by: https://github.com/mingfeima, https://github.com/ngimel, https://github.com/jgong5

commit 4ea2310f1e4410b439430e42450e176463a960c2
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Tue Nov 8 01:33:36 2022 +0000

    Fix typos used in documents under torch directory (#88483)

    This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html.
    This is a follow-up of #88300.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88483
    Approved by: https://github.com/kit1980

commit d25be63c05889250212249e3cd87e48d12c4f9c1
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Nov 8 01:17:35 2022 +0000

    [Reland] Use sudo when reset NVIDIA devices  (#88605)

    I accidentally delete my remote branch, so I need to create a new PR for this fix (instead of updating the reverted PR https://github.com/pytorch/pytorch/pull/88531)

    TIL, sudo echo doesn't do that I think it does, the correct syntax should be `echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset` granting sudo permission to the latter tee command.

    Due diligence and actually login to `i-07e62045d15df3629` and make sure that the command works
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88605
    Approved by: https://github.com/ZainRizvi

commit c77368d41615835e5124affe79f88feed93e8855
Author: Antoni Viros i Martin <aviros@meta.com>
Date:   Tue Nov 8 00:03:14 2022 +0000

    Implement a constructor for nested_tensor that is similar to torch.tensor() (#88213)

    Summary: This diff merges both previous implementations of constructors for nested tensors, the one from lists of tensors and the one with arbitrary python lists, adn implements it in pytorch core so no extensions are needed to construct NT.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88213
    Approved by: https://github.com/cpuhrsch

commit 72a7351993c953500bd8cdb1fb7a9e33aaa7ef9d
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Nov 7 23:53:17 2022 +0000

    Pin linux ninja dep to 1.10.2 (#88548)

    The latest version 1.11.1 breaks PyTorch CI.  A bunch of tests are failing now in master https://hud.pytorch.org/pytorch/pytorch/commit/d1ee0730410ac910760c0a21156e574093a0d15a.  Curiously, the latest commit https://hud.pytorch.org/pytorch/pytorch/commit/81042d3a53335259c60e5aa8c9b9614c3d87b05f looks green, but it's good to pin this dependency anyway

    https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L95-L97 has a curious note about ninja and why it's not part of the docker container (need to revisit this later on):

    ```
    ```

    This is one more reason to justify the effort to consolidating all pip and conda dependencies to get rid of this family of issue.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88548
    Approved by: https://github.com/clee2000

commit fdf286510828e149b896235db07d48ab51cd1121
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Nov 7 23:49:19 2022 +0000

    Use test/test-reports for inductor (#88533)

    So that the test reports can be picked up automatically by the CI and uploaded to S3. Later on, this will allows the querying of these test reports from our Rockset DB.

    For example https://github.com/pytorch/pytorch/actions/runs/3382363153/jobs/5617382531 `Upload test statistics` shows:

    ```
    + python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
    No tests in reports found in test
    ```

    https://hud.pytorch.org/pytorch/pytorch/commit/678d038001b0bd61501739ea97989d28f758343e inductor artifacts are also empty zip at the moment

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88533
    Approved by: https://github.com/desertfire

commit eb3f975c6e29104014fa9bbffe12ab32709672d9
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Nov 6 23:38:12 2022 +0000

    Fix segfault in has_torch_function (#88559)

    Fixes #83908

    `PySequence_Fast` may return `NULL` to indicate an error was raised, in which
    case `sequence_has_torch_function` will dereference a null pointer.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88559
    Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/hameerabbasi

commit 4796e23bbbdcbfa9110338af3c445ca366bd0b2b
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Nov 7 23:05:11 2022 +0000

    Fix pull docs build running with a schedule and increase cpp doc timeout to 4h (#88589)

    * After https://github.com/pytorch/pytorch/pull/88373, pull workflow can now be triggered with a schedule. This changes the assumption in the doc build workflow when schedule event is used to determine if the docs should be pushed
    * I'll create a follow-up issue to see if it's possible to improve the performance of cpp doc build job.  At the moment, it uses a linux.12xlarge runner and still couldn't finish the job after 3h

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88589
    Approved by: https://github.com/seemethere, https://github.com/ZainRizvi

commit d453b3c4d4b1cc5a0c626221a1f389dfa862ca5e
Author: lezcano <lezcano-93@hotmail.com>
Date:   Mon Nov 7 19:40:25 2022 +0000

    Add a note on the stability of linalg functions. (#88313)

    This was long-due, as it keeps comming up in issues.

    Fixes https://github.com/pytorch/pytorch/issues/85950
    Fixes https://github.com/pytorch/pytorch/issues/59720
    Fixes https://github.com/pytorch/pytorch/issues/59782

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88313
    Approved by: https://github.com/soumith, https://github.com/mruberry

commit b00c43b310e7544ed74daa84a9638fddbe190304
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 7 22:29:56 2022 +0000

    Revert "fallback for scatter_(scalar) (#88210)"

    This reverts commit 896fa8c5c9b0191c9621e04ab5e20057614d48ad.

    Reverted https://github.com/pytorch/pytorch/pull/88210 on behalf of https://github.com/suo due to this broke inductor tests, see: https://hud.pytorch.org/pytorch/pytorch/commit/896fa8c5c9b0191c9621e04ab5e20057614d48ad

commit 0e67b2f7dd13db1fea421d860ede65a653738dfe
Author: William Wen <williamwen@fb.com>
Date:   Mon Nov 7 22:24:44 2022 +0000

    Dynamo Dashboard Improvements (#88516)

    Implement various features in https://github.com/pytorch/torchdynamo/issues/1644:
    - Upload nightly run logs to /fsx before parsing - for backing up parsing failures.
    - Flag models with (1) < 0.95x speedup, (2) > 2min compile time, (3) < 0.9x compression ratio
    - Flag models that were passing yesterday but failed today.
    - Other small bug fixes.

    See https://github.com/pytorch/torchdynamo/issues/1831 for sample outputs.
    Also tested by running run.sh:
    ```bash
    rm -rf ../test-dynamo-runner-logs-3/
    mkdir ../test-dynamo-runner-logs-3/
    python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=../test-dynamo-runner-logs-3//inductor_torchbench_float32_training_cuda_performance.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2 --cold_start_latency
    python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=../test-dynamo-runner-logs-3//inductor_torchbench_float32_training_cuda_accuracy.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2
    ```

    with the command
    `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-3/ --dashboard-archive-path /data/home/williamwen/dynamo-runner-logs-copy --training --run --compilers inductor --flag-compilers inductor --suites torchbench --update-dashboard` (need to comment out the `generate_commands` line and change the github issue ID from 681 to something else).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88516
    Approved by: https://github.com/anijain2305

commit b14e06503a67ed72c2a84462d34e7494f3ead5b1
Author: Aaron Gokaslan <aaronGokaslan@gmail.com>
Date:   Mon Nov 7 22:17:10 2022 +0000

    (fix): Add some missing std::moves to C10 (#88512)

    I saw some missed optimization opportunities in C10 using std::move and thought I would submit a PR to fix them. There are particularly a lot of them dealing with the symbolic operators which are used in quite a few places including in loops.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88512
    Approved by: https://github.com/ezyang

commit d8506ff42b3d0dd8d25ab989967daffba13268cd
Author: lezcano <lezcano-93@hotmail.com>
Date:   Mon Nov 7 19:21:24 2022 +0000

    Generalize gesvdjBatched to run whith full_matrices==false (#88502)

    As brought up in https://github.com/pytorch/pytorch/issues/86234#issuecomment-1268296036, our heuristic for which SVD backend to choose was not great in some cases.
    The case in which there could be some improvements is when we have a
    large batch of very small non-square matrices.

    This PR, adapts the calling code to gesvdj by creating two temporary
    square buffers to allow to call gesvdjBatched, and then copies back the
    result into the output buffers.

    We then modify the heuristic that chooses between gesvdj and
    gesvdjBatched.

    Fixes https://github.com/pytorch/pytorch/issues/86234
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88502
    Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved, https://github.com/mruberry, https://github.com/xwang233

commit 9dadf8fcc21413fe12ea2c81d970f4877a9235a3
Author: Vitaly Fedyunin <vitaly.fedyunin@gmail.com>
Date:   Mon Nov 7 10:30:55 2022 -0500

    [DataPipes] Add group support to the sharding_filter (#88424)

    Differential Revision: [D41006747](https://our.internmc.facebook.com/intern/diff/D41006747)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88424
    Approved by: https://github.com/ejguan

commit 23a3eb37cfa52fcbfb766bd733cfa60b28b83f42
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Nov 7 08:51:15 2022 -0800

    SymIntify _copy functionalization kernels (and _copy_out too) (#88572)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88572
    Approved by: https://github.com/anjali411, https://github.com/bdhirsh

commit 896fa8c5c9b0191c9621e04ab5e20057614d48ad
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Mon Nov 7 21:25:55 2022 +0000

    fallback for scatter_(scalar) (#88210)

    `scatter_reduce_` overloads can only accept `Tensor src`.
    `scatter_`, on the other hand, can accept `Number src`. Switching a fallback from `scatter_reduce_` to `scatter_`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88210
    Approved by: https://github.com/desertfire

commit 0a69c50a46d50ae265e2d1d826d0b4b69d4351fd
Author: Jane Xu <janeyx@meta.com>
Date:   Mon Nov 7 21:15:07 2022 +0000

    Publicly expose _LRScheduler to LRScheduler (#88503)

    Fixes #61232

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88503
    Approved by: https://github.com/soulitzer

commit 05b9e8ec00274ffb8dc94b974d1335d5986f9620
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Nov 7 21:04:02 2022 +0000

    Upload test stats for inductor workflow (#88535)

    We miss this new workflow, so none of its test stats are uploaded to rockset
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88535
    Approved by: https://github.com/desertfire

commit a37524085df7685820f9c15c39d95da077d49be7
Author: Yu Guo <yuguo@fb.com>
Date:   Fri Nov 4 16:51:35 2022 -0700

    [torchdynamo] support torch.autograd._profiler_enabled (#88378)

    fix https://github.com/pytorch/torchdynamo/issues/1826

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88378
    Approved by: https://github.com/voznesenskym

commit 95d57b54e024c4d0442c0c76cb37b1b3ac06db26
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Nov 4 17:10:21 2022 +0000

    Handle pin_memory in refs.randn (#88473)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88473
    Approved by: https://github.com/mruberry

commit bf49dada1e3b94621823f0d9017081683f107ece
Author: Michael Suo <suo@meta.com>
Date:   Mon Nov 7 08:57:51 2022 -0800

    [nvfuser] skip extremal tests on rocm (#88587)

    Summary:

    These are failing in rocm so disable.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88587
    Approved by: https://github.com/ZainRizvi, https://github.com/huydhn

commit 7bf9db81c5b19fb1fb5c2056e03f183a85ebfc5c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 7 19:59:42 2022 +0000

    Revert "Use sudo when reset NVIDIA devices (#88531)"

    This reverts commit 505486ce9321bc22d2156a1a9b97fe474a05b53b.

    Reverted https://github.com/pytorch/pytorch/pull/88531 on behalf of https://github.com/huydhn due to Wrong sudo echo usage, should use tee instead

commit 78a0ca29d939fc3017c3281730ba19ece5162f5c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 7 18:51:16 2022 +0000

    Revert "[fix] allow saving python attr on Tensor and Parameter via torch.save (#81616)"

    This reverts commit 54b6188cc6dee45b775d688223b847dc8ea85bff.

    Reverted https://github.com/pytorch/pytorch/pull/81616 on behalf of https://github.com/mehtanirav due to Internal publishing is broken

commit 91a403984255418142abcf0966f2aa02ff4ae5ef
Author: Angela Yi <angelayi@meta.com>
Date:   Mon Nov 7 18:42:41 2022 +0000

    [exir][fx] PassManager error handling (#88520)

    Summary:
    * Added an error message for when the result is not a PassResult
    * Modified the error handling to capture exceptions that happen in the check() function
    * consolidated inplace_wrapper and pass_result_wrapper

    Test Plan: CI

    Differential Revision: D40950135

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88520
    Approved by: https://github.com/SherlockNoMad

commit bd1ffc6501376c6a00dec67d2dd8482470a140b5
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Mon Nov 7 18:03:31 2022 +0000

    [Dynamo] Fix bug: GradMode doesn't carry grad state correctly after graph break (#88537)

    Fixes https://github.com/pytorch/torchdynamo/issues/1446

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88537
    Approved by: https://github.com/jansel

commit 6663ae5537f3c61030ba4d425bd57a097c51430a
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Mon Nov 7 17:56:40 2022 +0000

    [2/n] Thread PG: add class _World to distributed_c10d.py (#781) (#88471)

    Summary:
    X-link: https://github.com/pytorch/torchrec/pull/781

    Move a bunch of globals to instance methods and replace all use to them.

    We move all PG related globals under World and use a singleton instance under _world.

    This creates an undocumented extension point to inject full control of how how c10d
    state behaves.

    One simple hack is to change _world to an implementation that uses a threadlocal
    and enable per-thread PGs.

    It almost get DDP working and the PG is missing an implementation of all_reduce.

    This enables notebook usage of PTD, which is a big deal for learning it:
    https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

    This change ensures BC by keeping the global variables around and have the default _World wrap it.

    I have relinked this diff to a new github PR, so that I can update it. The original PR is
    > Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348

    Differential Revision: D40236769

    Pulled By: yhcharles

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88471
    Approved by: https://github.com/gnadathur, https://github.com/rohan-varma

commit fc8f2f66fecea51c80357c424ab6b336b744ca80
Author: Zain Rizvi <ZainRizvi@users.noreply.github.com>
Date:   Mon Nov 7 17:38:42 2022 +0000

    Clarify rules for which commit is used in CI (#88425)

    The old information was out of date.  Updating it as per @janeyx99's feedback

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88425
    Approved by: https://github.com/malfet

commit c407a7b20330afb957944ad26633a388220a4e43
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Nov 7 17:26:28 2022 +0000

    Upgrade Linux NVIDIA driver to the latest prod version (#88517)

    The driver (515.76) is downloaded from https://www.nvidia.com/en-us/drivers/unix. This should help address the issue with A10G GPU on G5 runners according to NVIDIA. This is to address https://github.com/pytorch/pytorch/issues/88352

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88517
    Approved by: https://github.com/ZainRizvi

commit 505486ce9321bc22d2156a1a9b97fe474a05b53b
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Nov 7 17:19:02 2022 +0000

    Use sudo when reset NVIDIA devices (#88531)

    Per title, I should have known, i.e. https://ossci-raw-job-status.s3.amazonaws.com/log/9307292415

    ```
    2022-11-04T23:52:18.2921665Z + echo 1
    2022-11-04T23:52:18.2921862Z Reseting 0000:00:1e.0 (enabled state: 0)
    2022-11-04T23:52:18.2922186Z .github/scripts/install_nvidia_utils_linux.sh: line 77: /sys/bus/pci/devices/0000:00:1e.0/reset: Permission denied
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88531
    Approved by: https://github.com/ZainRizvi

commit cec4bd99b05a0beb548a821c5efc8a02833ba2c3
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Mon Nov 7 17:02:08 2022 +0000

    allow XLA folks update the pin (#88527)

    this is one of the files XLA team needs to update ocassionally.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88527
    Approved by: https://github.com/wconstab

commit a16ced03c93dcbc5b08d0f9a36f8feab583f129a
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Nov 4 14:20:19 2022 -0700

    reland "fix as_strided_scatter_backward (#87646)" (#88342)

    This reverts commit 71fb763e5452881cb3be8fefa9419b785d0a61e2.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88342
    Approved by: https://github.com/zou3519

commit dd43903fa99b8549225ec63c2e81ef4693436be0
Author: Mike Iovine <mikeiovine@meta.com>
Date:   Mon Nov 7 14:36:39 2022 +0000

    [Static Runtime] Fix tensor_split sections overload (#88113)

    Summary:
    D40798763 broke this op. Unfortunately, it wasn't caught at land time due to the recent OSS Static Runtime test problems.

    The problem is C++ overload resolution. After D40798763, the int that we were passing to `at::native::tensor_split` was getting implicitly converted to `IntArrayRef`. Fix this by converting the int to a `SymInt` and calling the correct overload.

    Test Plan:
    ```
    buck2 test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Tensor_Split --run-disabled
    ```

    Differential Revision: D40862394

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88113
    Approved by: https://github.com/hlu1

commit 7076a6481d9f6d3ed40af1eac285fe5046a87531
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Nov 7 10:22:44 2022 +0000

    [xla hash update] update the pinned xla hash (#88070)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88070
    Approved by: https://github.com/pytorchbot

commit ad27d762a7457c6a7f5b0c4c6778935c282df71b
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Fri Nov 4 05:28:18 2022 +0000

    Support sign for HF models like ElectraForQuestionAnswering (#88160)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88160
    Approved by: https://github.com/jansel

commit a9d37ce8f50a3111cc9eaf4f633decd092b9d726
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Fri Nov 4 05:28:17 2022 +0000

    Support reduction vectorization (#87356)

    This PR is to optimize reduction implementation by `at::vec`. The main idea is as same as the aten implementation.
    - Step1: Parallelize and vectorize the reduction implementation
    - Step2: Invoke `at::vec::vec_reduce_all` to reduce the vector generated at step 1 to a single scalar
    - Step3: Handle the tail elements

    For the implementation, we create two kernels - `CppVecKernel` and `CppKernel`. The code block generation is as follows step by step.

    - Gen the non-reduction loop - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1008-L1010)
    - Gen the reduction initialization both for vectorization and non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1015)
    - Gen the reduction loop for the vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1021-L1023)
    - Gen the code to reduce the vector to scalar - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1033)
    - Gen the reduction loop for the non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1042)
    - Do some post-reduction things like store reduction value - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1049)

    ```python
    for loop in CppVecKernel.NoneReductionLoop:
        CppVecKernel.ReductionPrefix
        for loop in CppVecKernel.ReductionLoop
            CppVecKernel.Loads
            CppVecKernel.Compute
            CppVecKernel.Stores
        CppVecKernel.ReductionSuffix
        for loop in CppKernel.ReductionLoop
            CppKernel.Loads
            CppKernel.Compute
            CppKernel.Stores
        CppKernel.ReductionSuffix
    ```
    The code snippet for maximum reduction exemplifies the idea. More detailed comments are inlined.

    ```C++
        {
            // Declare reduction for at::vec::Vectorized since it is not built-in data type.

            float tmp4 = 0;
            // tmp4_vec is used to vectorize the sum reduction for tmp4
            auto tmp4_vec = at::vec::Vectorized<float>(tmp4);
            float tmp6 = 0;
            // tmp6_vec is used to vectorize the sum reduction for tmp6
            auto tmp6_vec = at::vec::Vectorized<float>(tmp6);
            {
                // Parallelize the vectorized reduction
                for(long i0=0; i0<192; i0+=1)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = tmp2.abs();
                    auto tmp5 = tmp2 * tmp2;
                    tmp4_vec += tmp3;
                    tmp6_vec += tmp5;
                }
                // Reduce the tmp4_vec as a scalar and store at tmp4
                tmp4 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp4_vec);
                // Reduce the tmp6_vec as a scalar and store at tmp6
                tmp6 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp6_vec);
                // Handle the tail elements that could not be vectorized by aten.
                for(long i0=1536; i0<1536; i0+=1)
                {
                    auto tmp0 = in_ptr0[i0];
                    auto tmp1 = in_ptr1[i0];
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = std::abs(tmp2);
                    auto tmp5 = tmp2 * tmp2;
                    tmp4 += tmp3;
                    tmp6 += tmp5;
                }
            }
            out_ptr0[0] = tmp4;
            out_ptr1[0] = tmp6;
        }
    ```

    Performance(Measured by operatorbench and the base line of speedup ratio is aten operator performance):
    Softmax (1,16,384,384,dim=3) | Speedup ratio (simdlen=None) |  Speedup ratio (simdlen=8) + this PR
    -- | -- | --
    24c | 0.37410838067524177 | 0.9036240100351164
    4c | 0.24655829520907663 | 1.0255329993674518
    1c | 0.21595768114988007 | 1.000587368005134

    HW Configuration:
    SKU: SKX Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
    MemTotal:       196708148 kB
    MemFree:        89318532 kB
    MemBandwidth:  112195.1MB/S

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87356
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 6541e51ffd84b044cfde81bb2ea241a75a87952d
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Fri Nov 4 05:28:15 2022 +0000

    Explicit vectorization support for TorchInductor (#87068)

    In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8.

    ```C++
    extern "C" void kernel(const float* __restrict__ in_ptr0,
                           const float* __restrict__ in_ptr1,
                           float* __restrict__ out_ptr0,
                           const long ks0,
                           const long ks1)
    {
        {
            for(long i0=0; i0<((ks0*ks1) / 8); ++i0)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
                auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
                auto tmp2 = tmp0 + tmp1;
                auto tmp3 = tmp2.exp();
                tmp3.store(out_ptr0 + 8*i0);
            }
            for(long i0=8*(((ks0*ks1) / 8)); i0<ks0*ks1; ++i0)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = in_ptr1[i0];
                auto tmp2 = tmp0 + tmp1;
                auto tmp3 = std::exp(tmp2);
                out_ptr0[i0] = tmp3;
            }
        }
    }

    ```

    The major pipeline is as follows.
    - Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](https://github.com/pytorch/pytorch/blob/bf66991fc4860724368c5289d3db81de591b4cb2/torch/_inductor/codegen/cpp.py#L702)is to check whether all the `ops` have been supported. The [other one](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L672) is to check whether the data access could be vectorized.
      - [`CppSimdVecKernelChecker`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L655)
    - Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized.
      - [`CppSimdVecKernel`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L601)
      - [`CppSimdVecOverrides`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L159): The ops that we have supported on the top of `aten::vec`
      - Create kernel
        - [`aten::vec` kernel](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L924)
        - [`Original CPP kernel - OMP SIMD`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L929)
    - Generate code
      - [`CppKernelProxy`](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L753) is used to combine the `aten::vec` kernel and original cpp kernel
        - [Vectorize the most inner loop](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L753)
        - [Generate code](https://github.com/pytorch/pytorch/blob/355326faa35405565ddb6ff8a2a945c7fce83db8/torch/_inductor/codegen/cpp.py#L821)

    Next steps:
    - [x] Support reduction
    - [x] Vectorize the tail loop with `aten::vec`
    - [ ] Support BF16
    - [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit a95419b47e43b0064cf92a6ffd7e459a463d00e3
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Nov 7 05:48:22 2022 +0000

    use faster cache flush in triton benchmarking (#88557)

    Speeds up autotuning a little bit more (about 90s -> 75s for coat_lite_mini)

    @bertmaher, I've put in workaround so that internal doesn't break, but it can be removed once triton is updated internally.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88557
    Approved by: https://github.com/anijain2305

commit eda247ee6ce2f8bc29d86ec94f3863f929a2ea6e
Author: YJ Shi <yuanjing@octoml.ai>
Date:   Mon Nov 7 01:33:57 2022 +0000

    [Dynamo] fix torchdynamo's TVM meta schedule backend (#88249)

    Note that the previous `optimize_torch` functionality of pytorch is not working with default pytorch release with  CXX11 ABI off as TVM by default needs CXX11 ABI for builds. Source: [1](https://discuss.tvm.apache.org/t/can-someone-please-give-me-the-steps-to-use-pt-tvmdsoop/12525), [2](https://discuss.pytorch.org/t/undefined-symbol-when-import-lltm-cpp-extension/32627). It would be easier for user to tune with meta schedule instead of finding a CXX11-compatible pytorch, turning on the `pt-tvmdsoop` flag in TVM and rebuilding it. This could be revisited once the `pt-tvmdsoop` flag is updated and tuned on by default in TVM.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88249
    Approved by: https://github.com/jansel

commit 791d9ee2533d394dc26cff64de74df72d45835e4
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Nov 3 16:22:50 2022 +0000

    [inductor] Add lowering for as_strided_scatter (#88379)

    Ref pytorch/torchdynamo#327

    The use of as_strided does require in-memory manipulations, however this
     lowering allows those memory ops to be fused with any preceding calculations.
    e.g.

    ```
    def f(a, b):
        return torch.as_strided_scatter(
            a * 8 + 10,
            b * 2 - 4,
            size=(a.numel() // 2,),
            stride=(2,))
    ```

    Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
    with this PR it compiles to just two kernels and no additional operator calls.

    In theory I think this could be a decomposition, but in practice I saw the
    `output_view.copy_(src)` being optimized out in some cases when this was
    implemented as a decomposition.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88379
    Approved by: https://github.com/jansel

commit 81042d3a53335259c60e5aa8c9b9614c3d87b05f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Nov 6 02:29:53 2022 +0000

    Revert "Reenable optimizer overlap tests (#88439)"

    This reverts commit da452bcadbc6f34989c6b3b0db6075a272aa9891.

    Reverted https://github.com/pytorch/pytorch/pull/88439 on behalf of https://github.com/huydhn due to This change breaks trunk due to a land race missing reason parameter to sandcastle_skip_if https://hud.pytorch.org/pytorch/pytorch/commit/da452bcadbc6f34989c6b3b0db6075a272aa9891

commit bbaa0637df93292eb372b355f01756437aed3ce9
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Fri Nov 4 11:50:18 2022 +0100

    Add error inputs to `gaussian_nll_loss` `OpInfo` (#88486)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88486
    Approved by: https://github.com/lezcano

commit 404f254e205a5aef6a21138d8db17f2ac9d031ae
Author: Rohan Varma <rvarm1@fb.com>
Date:   Sat Nov 5 08:31:02 2022 +0000

    Upstream apply_optim_in_backward from TorchRec (#87397) (#88539)

    Summary:

    Upstreaming this as part of sharing common APIs. This is just a plain
    move, any changes needed to support DDP / FSDP will come in follow up diffs.

    Test Plan: CI

    Reviewed By: zhaojuanmao

    Differential Revision: D40564646

    fbshipit-source-id: 619c434e02196812f8d4db1e40d07290e08b18f9
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88539
    Approved by: https://github.com/awgu

commit da452bcadbc6f34989c6b3b0db6075a272aa9891
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Nov 3 18:33:14 2022 +0000

    Reenable optimizer overlap tests (#88439)

    Closes https://github.com/pytorch/pytorch/issues/73259. Not sure the root cause but CI seems fine with these tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88439
    Approved by: https://github.com/awgu

commit d1ee0730410ac910760c0a21156e574093a0d15a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 2 16:39:49 2022 -0400

    Handle case when candidate is empty (#88359)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88359
    Approved by: https://github.com/wconstab

commit 46730aec35ee047b92b288e0366da0f7e993e5ae
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Nov 4 23:11:17 2022 +0000

    [Reland] Fix primTorch compute_elementwise_output_strides (#88525)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88525
    Approved by: https://github.com/desertfire

commit 0e3031f7e76fbd84e62650642dc334c11cc3c511
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 4 12:31:51 2022 -0700

    Functionalize and compute joint simultaneously. (#88063)

    This also comes with some bug fixes that were uncovered from doing
    this:

    - Forward device calls to inner tensor in FunctionalTensorWrapper

    - Make legacyExtractDispatchKey exclude Functionalize, so that
      it can get at the real device type key.  This is noncontroversial.

    - Stop stripping dense from key set.  The reason for this is
      FunctionalWrapperTensor may be used in contexts where people
      query if it is dense or not.  If it doesn't report this correctly
      (from the dispatch key), it will cause errors.  This caused some
      torchbench models to fail when I did one-pass tracing.

    - Save and restore reapply views TLS correctly

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88063
    Approved by: https://github.com/bdhirsh

commit 957a9b63c5c2953da3a1d1fc86c20703c96b2fa6
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Nov 4 05:01:27 2022 +0000

    fx.replace_pattern accepts pattern/replacement as GraphModule (#88479)

    Symbolic tracer is no longer the default tracer to produce fx graph.
    SubgraphRewriter should thus accept a raw GraphModule, rather than use symbolic tracer by default.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88479
    Approved by: https://github.com/jerryzh168

commit 4bb5c2c2051371bfed09f9ec46416f3dba550c14
Author: Will Constable <whc@fb.com>
Date:   Fri Nov 4 22:05:21 2022 +0000

    Add docstring to DDPOptimizer (#88521)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88521
    Approved by: https://github.com/aazzolini

commit 1f32c3c087503151e87e235e78ebd92fe5090d79
Author: Will Constable <whc@fb.com>
Date:   Fri Nov 4 21:00:01 2022 +0000

    Add single-process DDP accuracy support to dynamo benchmark suite (#88511)

    - does not intend to support multi-process, as that is more complex
      and we have torchbench scripts for that
    - currently only works in accuracy mode as this was the main goal,
      but could be extended for measuring single-gpu perf impact of
      graph breaks

    Run with

    `python benchmarks/dynamo/torchbench.py --inductor --training --accuracy --only hf_Bert --ddp`

    Example output
    ```
    cuda train hf_Bert
    [2022-11-04 18:52:08,304] torch._inductor.compile_fx: [WARNING] skipping cudagraphs due to complex input striding
    PASS
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88511
    Approved by: https://github.com/davidberard98, https://github.com/aazzolini

commit 3fd0729bb663d039204cbcea0726e028541a25ad
Author: Will Constable <whc@fb.com>
Date:   Fri Nov 4 16:27:48 2022 +0000

    DDPOptimizer replace debug=True/False with using torchdynamo logger (#88480)

    Example output:

    ```
    2022-11-04 05:09:29,525] torch._dynamo.optimizations.distributed: [INFO]
    DDPOptimizer bucket assignments
    ┌─────────┬────────────┬───────────────────┐
    │   Index │   Size (b) │ Param Names       │
    ├─────────┼────────────┼───────────────────┤
    │       0 │  100120020 │ self_net_6_weight │
    ├─────────┼────────────┼───────────────────┤
    │         │            │ self_net_6_bias   │
    ├─────────┼────────────┼───────────────────┤
    │         │            │ self_net_4_weight │
    ├─────────┼────────────┼───────────────────┤
    │         │            │ self_net_4_bias   │
    ├─────────┼────────────┼───────────────────┤
    │       1 │  100020000 │ self_net_2_weight │
    ├─────────┼────────────┼───────────────────┤
    │         │            │ self_net_2_bias   │
    ├─────────┼────────────┼───────────────────┤
    │       2 │     220000 │ self_net_0_weight │
    ├─────────┼────────────┼───────────────────┤
    │         │            │ self_net_0_bias   │
    └─────────┴────────────┴───────────────────┘
    [2022-11-04 05:09:29,527] torch._dynamo.optimizations.distributed: [DEBUG]
    ---orig graph---
    graph():
        %inputs : torch.Tensor [#users=1] = placeholder[target=inputs]
        %self_net_0 : [#users=1] = call_module[target=self_net_0](args = (%inputs,), kwargs = {})
        %self_net_1 : [#users=1] = call_module[target=self_net_1](args = (%self_net_0,), kwargs = {})
        %self_net_2 : [#users=1] = call_module[target=self_net_2](args = (%self_net_1,), kwargs = {})
        %self_net_3 : [#users=1] = call_module[target=self_net_3](args = (%self_net_2,), kwargs = {})
        %self_net_4 : [#users=1] = call_module[target=self_net_4](args = (%self_net_3,), kwargs = {})
        %self_net_5 : [#users=1] = call_module[target=self_net_5](args = (%self_net_4,), kwargs = {})
        %self_net_6 : [#users=1] = call_module[target=self_net_6](args = (%self_net_5,), kwargs = {})
        %self_net_7 : [#users=1] = call_module[target=self_net_7](args = (%self_net_6,), kwargs = {})
        return (self_net_7,)

    ---split graph---
    graph():
        %inputs : torch.Tensor [#users=1] = placeholder[target=inputs]
        %submod_0 : [#users=1] = call_module[target=submod_0](args = (%inputs,), kwargs = {})
        %submod_1 : [#users=1] = call_module[target=submod_1](args = (%submod_0,), kwargs = {})
        %submod_2 : [#users=1] = call_module[target=submod_2](args = (%submod_1,), kwargs = {})
        return (submod_2,)

    ---submod_0 graph---
    graph():
        %inputs : [#users=1] = placeholder[target=inputs]
        %self_net_0 : [#users=1] = call_module[target=self_net_0](args = (%inputs,), kwargs = {})
        %self_net_1 : [#users=1] = call_module[target=self_net_1](args = (%self_net_0,), kwargs = {})
        return self_net_1

    ---submod_1 graph---
    graph():
        %self_net_1 : [#users=1] = placeholder[target=self_net_1]
        %self_net_2 : [#users=1] = call_module[target=self_net_2](args = (%self_net_1,), kwargs = {})
        %self_net_3 : [#users=1] = call_module[target=self_net_3](args = (%self_net_2,), kwargs = {})
        return self_net_3

    ---submod_2 graph---
    graph():
        %self_net_3 : [#users=1] = placeholder[target=self_net_3]
        %self_net_4 : [#users=1] = call_module[target=self_net_4](args = (%self_net_3,), kwargs = {})
        %self_net_5 : [#users=1] = call_module[target=self_net_5](args = (%self_net_4,), kwargs = {})
        %self_net_6 : [#users=1] = call_module[target=self_net_6](args = (%self_net_5,), kwargs = {})
        %self_net_7 : [#users=1] = call_module[target=self_net_7](args = (%self_net_6,), kwargs = {})
        return self_net_7

    ---------------
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88480
    Approved by: https://github.com/anj-s, https://github.com/davidberard98

commit 52375a0fd2a5d16109c1ed4d25e1210d0df382a5
Author: jjsjann123 <alex.jann2012@gmail.com>
Date:   Sat Nov 5 02:22:27 2022 +0000

    nvprims native batch norm patch (#88455)

    Cherry-picking: https://github.com/csarofeen/pytorch/pull/2104

    - [x] Added explicit cast on inputs to nvprims.native_batch_norm. This avoids the explicit cast, which gives us issue on fusion definition.
    - [x] add python repro with dynamo

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88455
    Approved by: https://github.com/mruberry, https://github.com/IvanYashchuk

commit b1116a51173f474d55798b82faeee92deef4f9a8
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Sat Nov 5 00:17:15 2022 +0000

    [Dynamo] Improve BuiltinVariable log when incorrect arg count happens (#88409)

    Fixes https://github.com/pytorch/torchdynamo/issues/1832

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88409
    Approved by: https://github.com/mlazos

commit 5220d07d2ca3dd094b1d7aa7de242184291d342f
Author: Michael Lazos <mlazos@fb.com>
Date:   Fri Nov 4 23:26:44 2022 +0000

    Fix minifier accuracy msg (#88515)

    Fixes https://github.com/pytorch/torchdynamo/issues/1809

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88515
    Approved by: https://github.com/yanboliang, https://github.com/williamwen42

commit dde9affeaafff957a6d5bf98e33e4119b14cd2d5
Author: Mergen Nachin <mergennachin@gmail.com>
Date:   Fri Nov 4 13:03:00 2022 -0700

    Populate self.export in InstructionTranslatorBase (#88508)

    Summary:

    This is a followup to https://github.com/pytorch/pytorch/pull/88354/files#diff-622913fdb49db90d6f3a8ab225b4badb7996023e6498e9f7c6d03fe9f32d0986R836

    Reference to self.export got added to InstructionTranslatorBase (i.e. STORE_ATTR) but self.export is populated only for InstructionTranslators.

    Here's an example failure

    ```
       File "/scratch/williamwen/work/pytorch/torch/_dynamo/symbolic_convert.py", line 322, in step
        getattr(self, inst.opname)(inst)
      File "/scratch/williamwen/work/pytorch/torch/_dynamo/symbolic_convert.py", line 844, in STORE_ATTR
        not self.export
    AttributeError: 'InliningInstructionTranslator' object has no attribute 'export'
    ```

    Let's populate with the base class with export flag.

    Test Plan:

    python test/dynamo/test_export_mutations.py
    python test/dynamo/test_export.py

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88508
    Approved by: https://github.com/tugsbayasgalan

commit afdc2283ef09cecd2476725d02a770c4c297a3ce
Author: Digant Desai <digantdesai@meta.com>
Date:   Fri Nov 4 23:01:45 2022 +0000

    [QNNPACK] Add unaligned attributes where asan fails (#88276)

    Summary: Bypass "Runtime error: store to misaligned address [...] for type 'uint16_t' (aka 'unsigned short'), which requires 2 byte alignment"

    Test Plan:
    One of the failing tests, now passes
    `buck test fbsource//arvr/mode/platform010/dev-asan fbsource//arvr/libraries/eye/engine:sys_test_eyetrackingenginevisioninterface`

    Reviewed By: kimishpatel, salilsdesai

    Differential Revision: D40918376

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88276
    Approved by: https://github.com/manuelcandales

commit 7560a7b27c431f194dd6e06d24f7b49757ea562d
Author: andrewor14 <andrewor14@gmail.com>
Date:   Fri Nov 4 09:01:23 2022 -0700

    [Quant] Respect non_leaf_module_list for activation modules (#88498)

    Summary: This commit fixes the bug where `non_leaf_module_list`
    was not respected for activation modules like `torch.nn.Sigmoid`
    and `torch.nn.Tanh`. Today, these modules default to
    `default_fixed_qparams_range_0to1_fake_quant`, and there is no
    way to configure them to use any other activation_post_process
    (e.g. FixedQParamsObserver) (see this [mapping](https://github.com/pytorch/pytorch/blob/dc00bb51b8d370bf3891f0edb2c6e0c2914e329a/torch/ao/quantization/quantization_mappings.py#L188-L193)).
    `non_leaf_module_list` is a "list of non-leaf modules we want
    to add observer" (see prepare docstring). If the user explicitly
    specified to insert observers for these modules, we should respect
    that instead of continuing to use the default.

    Test Plan:
    python test/test_quantization.py TestQuantizeEagerPTQStatic.test_activations_in_non_leaf_module_list

    Reviewers: vkuzo, jerryzh168

    Subscribers: vkuzo, jerryzh168

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88498
    Approved by: https://github.com/jerryzh168

commit 5af3feefab99a23df393962f664eee1e33619803
Author: Jane Xu <janeyx@meta.com>
Date:   Fri Nov 4 21:48:26 2022 +0000

    [BE] Update native_functions.yaml README; we do not support Tensor! (#88513)

    Just a doc update to minimize confusion
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88513
    Approved by: https://github.com/bdhirsh

commit 678d038001b0bd61501739ea97989d28f758343e
Author: Will Constable <whc@fb.com>
Date:   Fri Nov 4 16:27:48 2022 +0000

    Support DDP ignored parameters in DDPOptimizer (#88460)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88460
    Approved by: https://github.com/aazzolini

commit ff6770a9a1db4bb19db24c88bfe7a666722b45d2
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Nov 3 13:55:54 2022 -0500

    enable backward for log1p (sparse layouts) (#88155)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88155
    Approved by: https://github.com/cpuhrsch

commit 6938dd0b2cdb80d503a5d84c7e0cb7969ea47d93
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Nov 3 13:55:54 2022 -0500

    Support sparse inputs to deg2rad (#88156)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88156
    Approved by: https://github.com/cpuhrsch

commit 1964d8c34fd4afc4c8fd9f749350c9f7d98861f3
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Nov 3 13:55:53 2022 -0500

    Enable sparse_csr autograd testing for relu (#88154)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88154
    Approved by: https://github.com/cpuhrsch

commit f03302ba49318b5d6eea55b509fd448be39070f9
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Nov 3 13:55:53 2022 -0500

    Add sparse layout support for torch.frac (#88153)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88153
    Approved by: https://github.com/cpuhrsch

commit d632d94cc7bc1d60ae90b68c31c920f2828c341c
Author: Catherine Lee <csl@fb.com>
Date:   Fri Nov 4 20:47:42 2022 +0000

    Disable mem leak check (#88373)

    tbh at this point it might be easier to make a new workflow and copy the relevant jobs...

    Changes:
    * Disable cuda mem leak check except for on scheduled workflows
    * Make pull and trunk run on a schedule which will run the memory leak check
    * Periodic will always run the memory leak check -> periodic does not have parallelization anymore
    * Concurrency check changed to be slightly more generous
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88373
    Approved by: https://github.com/ZainRizvi, https://github.com/huydhn

commit 093e22083613dd4b92c1ced20201edf713484a23
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Nov 4 20:35:11 2022 +0000

    Re-enable inductor models tests as periodical jobs (#88509)

    Run every 4 hour same as periodic, but offset by an hour. This should give us some signals instead of completely disabling these jobs on master after https://github.com/pytorch/pytorch/pull/88374

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88509
    Approved by: https://github.com/malfet

commit 3e6579b8f66b62462b98066bc6d98ed8046d38da
Author: Jane Xu <janeyx@meta.com>
Date:   Fri Nov 4 20:34:23 2022 +0000

    Don't print fatal:... in generate_torch_version.py (#88335)

    During build, users commonly see a message like
    ```
    fatal: no tag exactly matches 'd8b4f33324b1eb6c1103874764116fb68e0d0af4'
    ```
    which is usually ignored when builds succeed, but has confused users when build fails (due to a different issue). This PR removes the red herring, since this usually prints for local development when tags are not found.

    We catch the exception anyway and handle it under the hood, so we don't need to print it and confuse the user.

    Test plan:
    Note that builds on trunk current have this line, cmd-F 'fatal: no tag exactly matches' in https://github.com/pytorch/pytorch/actions/runs/3379162092/jobs/5610355820.

    Then check in the PR build to see that the line no longer appears.

    I also tagged my commit locally and printed what tag would be--this code and the old code printed the same results for what tag would be.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88335
    Approved by: https://github.com/seemethere

commit 955cbe610bc3fe6913f2041d5215e1bf23a8dbd0
Author: Bin Bao <binbao@fb.com>
Date:   Fri Nov 4 18:00:28 2022 +0000

    [inductor] Handle the case where kwargs contains tensor (#88417)

    Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805;
    currently inductor does not allow any tensor in kwargs.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88417
    Approved by: https://github.com/ngimel

commit e940a2f8e2a3aa9d98291e73b3d40fcffb6182c8
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Fri Nov 4 20:23:56 2022 +0000

    Add nondeterministic error for `scatter` (#88244)

    Fixes #88096

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88244
    Approved by: https://github.com/ezyang, https://github.com/mruberry

commit 6575174dcb67ebfa5300d0ff2941189543187a3f
Author: Mor Tzur <mortzur@meta.com>
Date:   Fri Nov 4 20:18:08 2022 +0000

    [fx2ait] fixes for AITSplitter (#87805)

    Summary: propagate lower settings to AITSplitter settings.

    Reviewed By: yinghai, qxy11

    Differential Revision: D40568216

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87805
    Approved by: https://github.com/yinghai

commit 7b419e8513a024e172eae767e24ec1b849976b13
Author: jjsjann123 <alex.jann2012@gmail.com>
Date:   Wed Nov 2 01:14:05 2022 -0700

    [NVFuser] Upstream push 1026 (#87779)

    Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

    Codegen changes include:

    * codegen improvement:
        i. allow non-root trivial reductions, allow empty/no-op fusion
        ii. fixes vectorization checks and size calculation
        iii. bank conflict handle improvement
        iv. enables transpose scheduler

    * misc:
        i. CI tests failure fixes
        ii. cpp tests file clean up
        iii. trivial forwarding supports added in codegen runtime
        iv. added factory methods support in codegen

    Commits that's in this PR from the devel branch:

    ```
    7117a7e37ebec372d9e802fdfb8abb7786960f4a patching nvfuser conv cudnn test numerics mismatch (#2048)
    65af1a4e7013f070df1ba33701f2d524de79d096 Inserting sync for redundant parallel types is already done at the (#2023)
    6ac74d181689c8f135f60bfc1ec139d88941c98c Fix sync map (#2047)
    f5bca333355e2c0033523f3402de5b8aac602c00 Bank conflict checker improvements (#2032)
    d2ca7e3fd203537946be3f7b435303c60fa7f51e Minor update on cp.async code generation. (#1901)
    d36cf61f5570c9c992a748126287c4e7432228e0 Test file cleanup (#2040)
    0b8e83f49c2ea9f04a4aad5061c1e7f4268474c6 Allow non-root trivial reductions (#2037)
    a2dfe40b27cd3f5c04207596f0a1818fbd5e5439 Fix vectorize size calculation (#2035)
    e040676a317fe34ea5875276270c7be88f6eaa56 Use withPredicate to replace setPredicate to maintain Exprs immutable (#2025)
    197221b847ad5eb347d7ec1cf2706733aacbf97c removing ci workflow (#2034)
    40e2703d00795526e7855860aa00b9ab7160755f Reduction rand like patch (#2031)
    bc772661cbdb3b711d8e9854ae9b8b7052e3e4a3 Add utility for checking bank conflict of shared memory (#2029)
    ddd1cf7695f3fb172a0e4bcb8e4004573617a037 Add back FusionReductionWithTrivialReduction_CUDA (#2030)
    fbd97e5ef15fa0f7573800e6fbb5743463fd9e57 Revert "Cleanup trivial reduction workarounds (#2006)" (#2024)
    bca20c1dfb8aa8d881fc7973e7579ce82bc6a894 Cleanup trivial reduction workarounds (#2006)
    e4b65850eee1d70084105bb6e1f290651adde23e Trivial forwarding (#1995)
    1a0e355b5027ed0df501989194ee8f2be3fdd37a Fix contiguity analysis of predicates to match updated contiguity. (#1991)
    a4effa6a5f7066647519dc56e854f4c8a2efd2a7 Enable output allocation cache (#2010)
    35440b7953ed8da164a5fb28f87d7fd760ac5e00 Patching bn inference (#2016)
    0f9f0b4060dc8ca18dc65779cfd7e0776b6b38e8 Add matmul benchmark (#2007)
    45045cd05ea268f510587321dbcc8d7c2977cdab Enable tests previously disabled due to an aliasing bug (#2005)
    967aa77d2c8e360c7c01587522eec1c1d377c87e Contiguous indexing for View operations (#1990)
    a43cb20f48943595894e345865bc1eabf58a5b48 Make inlining even more modular (#2004)
    dc458358c0ac91dfaf4e6655a9b3fc206fc0c897 Test util cleanup (#2003)
    3ca21ebe4d213f0070ffdfa4ae5d7f6cb0b8e870 More strict validation (#2000)
    a7a7d573310c4707a9f381831d3114210461af01 Fix build problem (#1999)
    fc235b064e27921fa9d6dbb9dc7055e5bae1c222 Just fixes comments (#1998)
    482386c0509fee6edb2964c5ae72074791f3e43a cleanup (#1997)
    4cbe0db6558a82c3097d281eec9c85ad2ea0893a Improve divisible split detection (#1970)
    42ccc52bdc18bab0330f4b93ed1399164e2980c9 Minor build fix. (#1996)
    fcf8c091f72d46f3055975a35afd06263324ede6 Cleanup of lower_utils.cpp: Isolate out GpuLower usage (#1989)
    15f2f6dba8cbf408ec93c344767c1862c30f7ecc Move ConcretizedBroadcastDomains to shared_ptr in GpuLower. (#1988)
    8f1c7f52679a3ad6acfd419d28a2f4be4a7d89e2 Minor cleanup lower_unroll.cpp (#1994)
    1d9858c80319ca7f0037db7de5f04e47f540d76c Minor cleanup (#1992)
    f262d9cab59f41c669f53799c6d4a6b9fc4267eb Add support for uniform RNG (#1986)
    eb1dad10c73f855eb1ecb20a8b1f7b6edb0c9ea3 Remove non-const functions, remove GpuLower instance on build, pass in ca_map. (#1987)
    634820c5e3586c0fe44132c51179b3155be18072 Add support for some empty fusion (#1981)
    eabe8d844ad765ee4973faa4821d451ef71b83c3 Segment self mapping fusions (#1954)
    e96aacfd9cf9b3c6d08f120282762489bdf540c8 Enable Transpose operation (#1882)
    425dce2777420248e9f08893765b5402644f4161 Add a null scheduler that helps segmenting away no-op schedules (#1835)
    306d4a68f127dd1b854b749855e48ba23444ba60 Fix canScheduleCompileTime check of transpose scheduler (#1969)
    b1bd32cc1b2ae7bbd44701477bddbcfa6642a9be Minor fix (#1967)
    bd93578143c1763c1e00ba613a017f8130a6b989 Enable transpose scheduler (#1927)
    b7a206e93b4ac823c791c87f12859cf7af264a4c Move scheduler vectorize utilities into their own file (#1959)
    d9420e4ca090489bf210e68e9912bb059b895baf View scheduling (#1928)
    c668e13aea0cf21d40f95b48e0163b812712cdf2 Upstream push ci fixes (#1965)
    c40202bb40ce955955bb97b12762ef3b6b612997 Fix dump effective bandwidth (#1962)
    93505bcbb90a7849bd67090fe5708d867e8909e4 WAR on index mapping when exact and permissive maps differ (#1960)
    45e95fd1d3c773ee9b2a21d79624c279d269da9f Allow splitting inner-most ID to create virtual innermost ID in transpose scheduler (#1930)
    a3ecb339442131f87842eb56955e4f17c544e99f Improve the comments at the beginning of index_compute.h (#1946)
    f7bc3417cc2923a635042cc6cc361b2f344248d6 Remove unused variables (#1955)
    df3393adbb5cb0309d091f358cfa98706bd4d313 Some cleanup (#1957)
    7d1d7c8724ab5a226fad0f5a80feeac04975a496 TVDomainGuard factory (#1953)
    357ba224c0fb41ed3e4e8594d95599c973f4a0ca Fill allocation with nan on tests (#1956)
    8eafc54685d406f5ac527bcbacc475fda4492d7a Fix detection of unmappable root domains (#1952)
    90a51f282601ba8ebd4c84b9334efd7762a234bc Some indexing cleanups, Add eye support (#1940)
    ddc01e4e16428aec92f9c84d698f959b6436a971 Exclude unsupported data types (#1951)
    992e17c0688fe690c51b50e81a75803621b7e6aa test the groups the same order as they are merged (#1949)
    208262b75d1fed0597a0329d61d57bc8bcd7ff14 Move detection of self mapping IDs to IterDomainGraph from (#1941)
    ac4de38c6ee53b366e85fdfe408c3642d32b57df Merge pull request #1945 from csarofeen/master_merge_0828
    631094891a96f715d8c9925fb73d41013ca7f2e3 Add full, full_like, zeros, zeros_like, ones, ones_like (#1943)
    aab10bce4541204c46b91ff0f0ed9878aec1bfc4 Merge remote-tracking branch 'upstream/viable/strict' into HEAD
    4c254c063bb55887b45677e3812357556a7aa80d Fix arange when step is negative (#1942)
    89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
    ```

    RUN_TORCHBENCH: nvfuser

    Differential Revision: [D40869846](https://our.internmc.facebook.com/intern/diff/D40869846)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87779
    Approved by: https://github.com/davidberard98

commit 15e54293efb7d3ec58ded51e2854971d16b2fa66
Author: Li-Huai (Allan) Lin <qqaatw@gmail.com>
Date:   Fri Nov 4 19:43:56 2022 +0000

    [MPS] Fix embedding backward with scalar index (#82809)
    Previously the embedding backward always expands `-1` dim to indices, resulting in the following error when the indices is a scalar:

    ```
     error: Rank of data array must equal number of outer dimensions in indices array + rank of slice to update, 2 != 1 + 0
    -:8:10: note: see current operation: %5 = "mps.scatter_nd"(%0, %arg1, %4) {batch_dims = 0 : ui32, mode = 0 : i32} : (tensor<10x5xf16>,
    ```

    Now makes it conditional.

    Reproducer:

    ```python
    def repro():
        w = torch.tensor([[-2.6465,  2.5859,  0.4688,  1.7949,  3.2676],
            [-3.1641,  8.9375,  5.7578, -2.9453, -6.5469],
            [ 2.0469,  1.3516, -8.7344,  6.0000,  1.3906],
            [ 6.5781,  7.8438,  6.9766,  3.2891, -5.1172],
            [-7.9414,  7.7344,  4.1875,  2.8574,  2.9531],
            [-0.4844, -5.6328, -6.8359, -4.5156,  3.7891],
            [ 4.9375,  6.6094,  6.7031,  0.6719, -6.4219],
            [ 7.0469,  8.2031,  4.4453,  1.7129, -2.4688],
            [ 1.2207, -3.3750, -2.4531,  7.4062, -6.0469],
            [-8.9688,  2.2656,  2.4160, -1.0176,  8.4531]], dtype=torch.float32, requires_grad=True)
        x = torch.tensor(5)
        out = torch.nn.functional.embedding(x, w)
        out.sum().backward()

        w_mps = w.detach().clone().to("mps").requires_grad_()
        x_mps = x.to("mps")
        out = torch.nn.functional.embedding(x_mps, w_mps)
        out.sum().backward() # error
    ```
    <!-- Link to Issue ticket or RFP -->
    <!-- How did you test your change? -->

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82809
    Approved by: https://github.com/malfet

commit 5b767d404e9d9e80c7f900bb28f9ccde1d76bdaa
Author: Codrin Popa <codrin@meta.com>
Date:   Fri Nov 4 19:31:16 2022 +0000

    Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290)

    Summary:
    Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator.

    This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space.

    Test Plan: Tested locally

    Differential Revision: D40103909

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290
    Approved by: https://github.com/zdevito

commit b78b8727ff39fd47e3d465e3e6e6e6cf5e578c62
Author: ssjia <ssjia@fb.com>
Date:   Thu Nov 3 09:33:25 2022 -0700

    [vulkan] enable prepacking for Batchnorm op (#88433)

    Adds a `BatchNormPackedContext` so that the `batchnorm` op can use prepacking.

    Differential Revision: [D40721546](https://our.internmc.facebook.com/intern/diff/D40721546/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88433
    Approved by: https://github.com/manuelcandales

commit 53eac1d48222becc46d0654648648fbf172a1214
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 4 06:18:25 2022 -0700

    Revert "Revert "Put Python Dispatcher cache in dict, clear it on new registrations. (#88329)"" (#88489)

    The bug was that I was accidentally caching at the wrong key name, so
    we were never actually hitting the cache.  I've renamed the resolved
    key to final_key to avoid shadowing in this way.

    This reverts commit 410ce96a23a3496a45478e0b25ffac53aa3c116f.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88489
    Approved by: https://github.com/albanD

commit 79abea5683254897fb49dc30d747914de474192c
Author: jjsjann123 <alex.jann2012@gmail.com>
Date:   Fri Nov 4 19:17:07 2022 +0000

    nvprim python runtime dtype correctness patch (#88452)

    Cherry-picking: https://github.com/csarofeen/pytorch/pull/2133

    - [x] casts FusionDefinition output to original dtype recorded in the GraphModule
    - [x] add a python repro with dynamo

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88452
    Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry

commit 8c1c6759b28c73cff623c7fef71e0eca00087414
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Nov 4 19:12:35 2022 +0000

    Revert "remove assert_allclose from torch.testing (#87974)"

    This reverts commit 5669e10d37fa3cca21cf82c843ae4c4e79da1b89.

    Reverted https://github.com/pytorch/pytorch/pull/87974 on behalf of https://github.com/mehtanirav due to Internal breakages from method removal

commit bda688c186658b6b018ca88ec592d17eafcb4b2b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Nov 4 12:39:03 2022 -0400

    Fix typo in clones (#88501)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88501
    Approved by: https://github.com/wconstab

commit 633f0d620dcfc7681739e39b018dd13cc4f0090d
Author: Shiyan Deng <dsy842974287@meta.com>
Date:   Fri Nov 4 17:35:12 2022 +0000

    [torch package] Treat builtins as default extern module (#88385)

    Summary: When using torch deploy, if we do fx transformation and then try to pickle/unpickle a fx GraphModule, it's possible that the GraphModule's code depends on `builtins` but we didn't add it to extern module.

    Reviewed By: PaliC

    Differential Revision: D40958730

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88385
    Approved by: https://github.com/PaliC

commit ead36e5a907c9fbcd837835e52ce448d428f228e
Author: John Detloff <johndetloff@fb.com>
Date:   Fri Nov 4 17:31:17 2022 +0000

    Add dep on Accelerate framework to torch podspecs (#88422)

    A dep on Accelerate was added in https://github.com/pytorch/pytorch/pull/80449 We need to declare this dep in our podspec, otherwise users will have to add the Accelerate framework to their projects manually.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88422
    Approved by: https://github.com/kimishpatel, https://github.com/malfet

commit dc00bb51b8d370bf3891f0edb2c6e0c2914e329a
Author: Manuel Candales <mcandales@meta.com>
Date:   Fri Nov 4 12:07:12 2022 +0000

    [Vulkan][TCC] Add tests for conv2d prepack context (#88316)

    Summary:
    Implement Vulkan tests for the create/run context functions in Convolution.cpp, their transposed versions and their backwards compatible versions:
    - create_conv2d_context
    - run_conv2d_context
    - create_tconv2d_context
    - run_tconv2d_context
    - conv2d_clamp_prepack
    - conv2d_clamp_run

    Test Plan:
    On Mac
    ```
    cd ~/fbsource
    buck run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
    ```

    On Android
    ```
    cd ~/fbsource
    buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
    adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
    adb shell "/data/local/tmp/vulkan_api_test"
    ```

    Reviewed By: salilsdesai

    Differential Revision: D40935343

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88316
    Approved by: https://github.com/salilsdesai

commit a171b0636a058d0cd059d39f39e37d5cc1d38df1
Author: Wonjoo Lee <wonjoo@google.com>
Date:   Fri Nov 4 08:23:54 2022 +0000

    Add use_lazy_shape flag to GenLazyIr class (#88444)

    Add use_lazy_shape flag to GenLazyIr class to allow XLA to use its custom shape class. The default value is kept to use lazy shape, so this PR does not introduce any new behaviors.

    PyTorch/XLA companion PR: https://github.com/pytorch/xla/pull/4111
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88444
    Approved by: https://github.com/alanwaketan, https://github.com/wconstab

commit b3206268ace6ebcb5d716ed6673876e62ef484f2
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Nov 3 00:46:02 2022 -0400

    TorchDynamo: enable convolution and batchnorm folding for inference path (#87435)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87435
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit fbd08fb358b643386edd4dd28b9c747aab4ba8c1
Author: Pruthvi Madugundu <pmagundu@amd.com>
Date:   Fri Nov 4 04:43:05 2022 +0000

    Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)

    - Asserts for CUDA are enabled by default
    - Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON`
    - Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON`

    This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190
    Approved by: https://github.com/jeffdaily, https://github.com/malfet

commit 70b00b13830c8adbaa2db8f61d475c2458b707c4
Author: Will Constable <whc@fb.com>
Date:   Thu Nov 3 22:55:24 2022 +0000

    Add hf_bert + DDP multigpu test (#88435)

    Spot-checks an e2e model working with ddp.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88435
    Approved by: https://github.com/davidberard98

commit 71f793d31265578e2df673cf838ec456bc501d77
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Nov 3 00:46:01 2022 -0400

    TorchDynamo: Add linear binary fusion for cpu in BF16 inference mode (#87066)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87066
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 7d95b1e3444a2fdae7c1a5ebb24072167b923c0a
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Nov 3 23:10:28 2022 +0000

    Run all fallback kernels with FakeTensor (#88248)

    This improves the memory compression of resnet18 from .84 -> .94 on inductor no-cudagraphs. It does mean that any extern kernel which incorrectly computes strides will be a hard error at runtime, but that's an issue we are going to have to face with dynamic shapes anyway. CC @ezyang, @SherlockNoMad
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88248
    Approved by: https://github.com/ezyang

commit e4efea4f14fd26c1ec83ab25d0197c3e3d40c7a4
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Nov 3 00:45:59 2022 -0400

    TorchDynamo: Add linear unary fusion for cpu in BF16 inference mode (#87065)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87065
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 657f2e12f0e212b2f4afd89ab2c824c409dcc951
Author: Nikita Shulga <nshulga@meta.com>
Date:   Fri Nov 4 01:22:41 2022 +0000

    [MPS] Add native `cumsum` implementation (#88319)

    Using https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/4057333-cumulativesumwithtensor?language=objc

    Fall back to CPU if running on older MacOS versions
    In `unary_op` add output tensor dims/dtype to the graph key (as even in default op we check output graph type)
    Also, upcast int16 to int32 as MPS cumsum op on Ventura returns incorrect results for Int16 type (and it makes total sense for int8, as chances for overflow are very high)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88319
    Approved by: https://github.com/kulinseth

commit 52173188efb3a8b3e5053357c66fd5bde45dc929
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Nov 3 00:45:54 2022 -0400

     TorchDynamo: Add convolution binary fusion for cpu in inference mode (#87064)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87064
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 2ce2fc133d5f06e2d563176a96bc0cc8fa207670
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Nov 3 18:58:07 2022 +0000

    Disable Current Modes when printing Tensor (#88344)

    Fix for https://github.com/pytorch/pytorch/issues/88087

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88344
    Approved by: https://github.com/ezyang, https://github.com/samdow

commit e804c7229490474230a15df8a6eb5f1712828df6
Author: Jiewen Tan <jwtan@google.com>
Date:   Fri Nov 4 00:06:07 2022 +0000

    [LTC] Update merge_rules.yaml (#88291)

    Summary:
    Some of the LTC code-gen infra has been moved from codegen/ to torchgen/. Update the merge_rules.yaml to reflect that.

    Test Plan:
    New GH PRs...

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88291
    Approved by: https://github.com/malfet

commit a84d68cdfd3b2d7e9f43221ac0ecc646db63a1d4
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Nov 3 16:26:56 2022 +0000

    [FSDP][Docs] Reword `sharding_strategy` docs and other minor doc changes (#88431)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88431
    Approved by: https://github.com/mrshenli

commit ff23e07b2eabc95ed2d08d6aebbaa242425bd8df
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Nov 3 16:26:46 2022 +0000

    [FSDP][Docs] Simplify CPU offload docs (#88430)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88430
    Approved by: https://github.com/mrshenli

commit 4de50b25215b71517831b9766c4655d56ef7946e
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Nov 3 19:30:05 2022 +0000

    [FSDP] Allow to use TorchDispatch with FSDP (#88014)

    Add `_no_dispatch_record_stream` to disable TorchDispatch before calling `record_stream()`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88014
    Approved by: https://github.com/awgu

commit 31ebd3cc2fb4a9025d3a17b90400ea83125dc17c
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Nov 3 23:15:39 2022 +0000

    Reset NVIDIA devices stuck in failed mode (#88459)

    Try to reset the NVIDIA devices if they get stuck in failed mode per comment in https://github.com/pytorch/pytorch/issues/88388

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88459
    Approved by: https://github.com/malfet

commit ab8f3333ff02d7a6260e616f87ab4f8ed3e1db4b
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Nov 3 16:26:36 2022 +0000

    [FSDP][Docs] Simplify `mixed_precision` ctor docs (#88429)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88429
    Approved by: https://github.com/mrshenli

commit 36582574f3bef05f0822bbd6982062342cfcdab8
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Nov 3 22:56:05 2022 +0000

    [dynamo] Skip mutation detection for inference mode (#88406)

    Skip the mutation detection for inference_mode, and raise a warning. This helps one internal model

    Related to https://github.com/pytorch/torchdynamo/issues/1768

    @ezyang What do you think about this? The issue that Dynamo mutation detector uses version counter to detect mutation.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88406
    Approved by: https://github.com/ezyang

commit 410ce96a23a3496a45478e0b25ffac53aa3c116f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 3 21:57:19 2022 +0000

    Revert "Put Python Dispatcher cache in dict, clear it on new registrations. (#88329)"

    This reverts commit 86c7cd287caeb23c227d97d283e58bc123294746.

    Reverted https://github.com/pytorch/pytorch/pull/88329 on behalf of https://github.com/clee2000 due to test_decomp takes an extra 2 hours in some jobs, windows takes so long it times out

commit 9946041a3edbfa3a9db1c38aa0436f0d6f1a29db
Author: samdow <samdow@fb.com>
Date:   Thu Nov 3 21:50:52 2022 +0000

    [functorch] make hessian docs actually use hessian function (#88451)

    I was going through the hessian docs to find an example and noticed that these docs don't actually use the hessian function....
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88451
    Approved by: https://github.com/zou3519, https://github.com/Skylion007

commit ce961b34430b52d0591fea4e485ffcb4633c4e90
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Nov 3 18:22:44 2022 +0000

    Dont hold onto references of saved tensors in backward (#88247)

    This improves memory compression of resnet18 on inductor non-cudagraphs from .78 -> .0.84.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88247
    Approved by: https://github.com/ezyang

commit 65de9a2b8119e765abeb893e37ab49ea3276e41c
Author: Sam Tsai <sstsai@meta.com>
Date:   Thu Nov 3 20:32:54 2022 +0000

    Fix fuse_func method overwrite (#87791) (#88193)

    Summary:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87791

    Fixing the interface so that the fuse_func is honored and not replaced but the default fuse_known_method.

    Test Plan: Wait for sandcastle

    Reviewed By: jerryzh168

    Differential Revision: D40722395

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88193
    Approved by: https://github.com/jerryzh168

commit 433746300dd8f5362329bbb208a61584febbba11
Author: Po-Wei Chou <poweic@meta.com>
Date:   Thu Nov 3 20:20:49 2022 +0000

    [pytorch] Expose EmbeddingPackedParamsBase::unpack to Python (#88362)

    Summary:
    User can't call `.unpack()` when they have a quantized Embedding layer because `&EmbeddingPackedParamsBase::unpack` was never exposed to Python through pybind.

    This diff fixes that.

    Test Plan: CI

    Reviewed By: jerryzh168

    Differential Revision: D40606585

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88362
    Approved by: https://github.com/jerryzh168

commit 23a6e1532142a2858d5e5445b5bcd2e468e80a66
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Nov 3 20:18:33 2022 +0000

    [ONNX] Remove the INT64_MAX magic numbers (#88341)

    Remove the magic numbers in symbolic opsets and use a INT64_MAX  global instead.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88341
    Approved by: https://github.com/BowenBao

commit 6d7eee04b8943dd371465e4f909eba8474ce0292
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Nov 3 16:26:26 2022 +0000

    [FSDP] Default to `BACKWARD_PRE` (#88428)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88428
    Approved by: https://github.com/mrshenli

commit c28022d96c12210328d7bab6add026699cf8e9ee
Author: Author Name <bcoutinho@meta.com>
Date:   Thu Nov 3 20:08:16 2022 +0000

    [profiler] Add an option initialize kineto profiler on start up (#87226) (#88020)

    Summary:
    Overall this patch enables initializing the kineto profiling library on start-up. This is guarded by an env variable that is described a bit more later. The kineto profiler is otherwise initialized lazily when pytorch profiler is invoked.
    We are enabling on-demand profiling capability for pytorch. As users run large distributed training flows this will enable one to capture a pytorch profiler/GPU trace remotely, from outside the process. The kineto library and a monitoring daemon - dynolog- interact to achieve this.

    Dynolog will be open sourced by end of October, and has been dogfooded on Meta AI Research cluster.
    https://github.com/facebookincubator/dynolog
    Kineto library registers itself with the dynolog daemon running on the host over inter process communication
    ```
      | kineto  |   --> (ipcfabric)  --> | dynolog |
       * register()
       * poll for on-demand tracing configs()
    ```
    This feature is currently enabled by setting the env variable `KINETO_USE_DAEMON`.  However, it only works if we initialize kineto, else the thread to talk to dynolog is not spun up.

    Related PRs in kineto include
    https://github.com/pytorch/kineto/pull/637
    https://github.com/pytorch/kineto/pull/653
    Build pytorch from source (need to set USE_LITE_INTERPRETER_PROFILER=OFF)

    Run a simple linear model [example](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html).
    ```
    export KINETO_CONFIG=/private/home/bcoutinho//libkineto.conf
    export KINETO_USE_DAEMON=1
    python3 /private/home/bcoutinho/linear_model.py
    ```
    Output
    ```
    INFO:2022-10-18 09:01:12 4169946:4169946 init.cpp:98] Registering daemon config loader
    cuda:0
    ```
    We can trigger a trace using the dynolog client tool
    ```
    response length = 147
    response = {"activityProfilersBusy":0,"activityProfilersTriggered":[4116844],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[4116844]}
    Matched 1 processes
    Trace output files will be written to:
        /tmp/gpu_trace_test_4116844.json
    ```
    ```
     python3 ../../linear_model.py
    cuda:0
    99 1425.056884765625
    10099 8.817168235778809
    ```

    Currently the environment should guard users from picking this change up unless intended. The libkineto_init does setup CUPTI APIs and spins up a thread to read on-demand configurations. This should not be problematic, we can provide a more granular init in the future.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87226

    Reviewed By: chaekit

    Differential Revision: D40558184

    Pulled By: briancoutinho

    fbshipit-source-id: afea7502b1d72201c00994c87fde63a35783f4d5

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88020
    Approved by: https://github.com/chaekit

commit 826b4a9c2dd856c11ca7d73bc0d617758bff6a5a
Author: Max Ren <maxren@meta.com>
Date:   Thu Nov 3 20:05:53 2022 +0000

    [coreml] delegate multiple outputs (#88345)

    Summary:
    https://www.internalfb.com/code/fbsource/[c0e4da0b5c7fff3b4e31e4611033c30cabdc6aef]/fbcode/caffe2/torch/csrc/jit/backends/backend_detail.cpp?lines=268-276

    seems like the torchscript addition of
    `$unpack, = self.__backend.execute( ... `

    the comma after unpack forces the result of execute to have only one item. So for this fix now when the size of the outputs > 1, execute returns a List List of outputs (basically put the outputs in another list before putting it into the list we return)
    ```
    [[output1, output2, output3, ...]]
    ```
    instead of
    ```
    [output1, output2, output3, ...]
    ```

    Do we want to fix this in backend_detail? Or should we make the change in our delegate to accomadate the torchscript? Proposing this q here. Requesting cccclai, kimishpatel for approval here

    Test Plan: unblocked models for chengxiangyin and models in pytorch playground all passing unit tests

    Reviewed By: kimishpatel, cccclai

    Differential Revision: D40328684

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88345
    Approved by: https://github.com/jmdetloff, https://github.com/Skylion007

commit 9533fe9031cc82c2f833ef066f8dd2d5d2d1eebf
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Wed Nov 2 08:55:14 2022 -0700

    [pytorch][vulkan] Add bias storage type to template (#88324)

    To enable buffer based use for bias as well, this diff adds storage type for
    bias to template

    Differential Revision: [D40689003](https://our.internmc.facebook.com/intern/diff/D40689003/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88324
    Approved by: https://github.com/jmdetloff

commit 893f8e3790df47b165025c9e6b5b37b85bdfd501
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Wed Nov 2 08:55:08 2022 -0700

    [PyTorch][Vulkan] Add template based codegen for shader generation (#88323)

    We would like to be able to parameterize kernels such that a parameterized
    algorithm can be implemented via templates. We can then profile performance of
    a kernel with different parameter values. This enables us to determine what
    parameters may work the best for a given kernel or a given device.

    In this diff one such kernel added in 1x1 conv which parameters across size of
    the tile being produced by each invocation.

    Few other options for parameters can be:
    - One can imagine dtype can also be a parameter such that we can do compute in
    fp16 or int8/int16.
    - Register blocking for input channels

    Differential Revision: [D40280336](https://our.internmc.facebook.com/intern/diff/D40280336/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40280336/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88323
    Approved by: https://github.com/jmdetloff

commit 60925fcb7e0662e5dc925b0cd5f79615e336cb4b
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Nov 3 03:37:23 2022 +0000

    Dont clone inputs if using fake tensor (#88208)

    Not sure that this will really reduce memory use but it is an extraneous copy in our stack right now.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88208
    Approved by: https://github.com/anijain2305

commit 192e806c265f8c90ffb34ab3787be5c153e84972
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Wed Nov 2 08:55:02 2022 -0700

    [Pytorch][vulkan] Generate shader with parameters (#88322)

    Parametsr such as tile size and weight type and format is embedded within the
    shader code. This is used to generate ShaderInfo.

    For now we will maintain both ShaderSrc and ShaderInfo so as to transition from
    VK_KERNEL to VK_SHADER incremental. Otherwise we will have to switch multiple
    of them at the same time.

    Differential Revision: [D40280338](https://our.internmc.facebook.com/intern/diff/D40280338/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88322
    Approved by: https://github.com/jmdetloff, https://github.com/mcr229

commit fe3a226d74008f7ce846198530c75e2df232934f
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Thu Nov 3 19:28:33 2022 +0000

    [minor] use set_default_dtype instead of try and finally (#88295)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88295
    Approved by: https://github.com/mruberry

commit f8b73340c85ca29fb47bf5056246f6edd1ec261e
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Nov 3 19:07:03 2022 +0000

    [dashboard] Replace aot_nvfuser with nvprims_nvfuser (#88437)

    @IvanYashchuk @ngimel

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88437
    Approved by: https://github.com/soumith

commit 2bda2baad787923b064c747e619e62a6af969940
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Thu Nov 3 18:03:36 2022 +0000

    [Dynamo][Easy] Fix config.suppress_errors error log (#88402)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88402
    Approved by: https://github.com/williamwen42

commit 4d62ee1b36f895d9a2987f02ae9c34c6424e0faf
Author: Michael Lazos <mlazos@fb.com>
Date:   Thu Nov 3 17:59:05 2022 +0000

    Verbose exc printing fix (#88387)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88387
    Approved by: https://github.com/tugsbayasgalan

commit 0a274c4b6c916363ce3e3f75b315ac66156f8ce6
Author: Justin Chu <justinchuby@users.noreply.github.com>
Date:   Thu Nov 3 17:41:48 2022 +0000

    [ONNX] Default runtime type checking to raising errors (#86555)

    Default runtime type checking to raise by changing the default value to  `GLOBALS.runtime_type_check_state` into ERRORS
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86555
    Approved by: https://github.com/BowenBao

commit d70bc222d8581bc4256119d51c9344472f71fe95
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Sun Oct 30 22:23:51 2022 -0400

    add parameters check for mkldnn_transpose (#85318)

    This PR is about add parameters check for mkldnn_transpose, fixed https://github.com/pytorch/pytorch/issues/85216.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85318
    Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/leslie-fang-intel

commit c1dd13fb2fb623986508a4cf9f1fe4cc1656a52f
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Nov 3 17:05:50 2022 +0000

    [dynamo] Support compare op for userfunctionvariable (#88372)

    Helps reduce graph breaks for one of the training models

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88372
    Approved by: https://github.com/jansel

commit 2c46d5725e3b89d8f83ed2ba940225fa57a7156f
Author: Mergen Nachin <mergennachin@gmail.com>
Date:   Wed Nov 2 14:13:20 2022 -0700

    Disallow module attribute mutation (#88354)

    Summary:

    See https://github.com/pytorch/torchdynamo/issues/1475

    Not allowing any new mutations happen inside forward() function during
    export.

    Test Plan:

    Run `python test/dynamo/test_export.py` and make sure it passes

    Added new unit tests (3 positive tests and 4 negative tests)

    Here's what the actual error looks like

    ```
      File "/home/mnachin/local/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 322, in step
        getattr(self, inst.opname)(inst)
      File "/home/mnachin/local/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 835, in STORE_ATTR
        assert not self.export, f"Mutating module attribute {inst.argval} during export."
    AssertionError: Mutating module attribute a during export.

    from user code:
       File "/data/users/mnachin/pytorch/test/dynamo/test_export_mutations.py", line 25, in forward
        self.a = self.a.to(torch.float64)

    Set torch._dynamo.config.verbose=True for more information
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88354
    Approved by: https://github.com/tugsbayasgalan, https://github.com/jansel

commit 2b117c843628e8f73d8fbb471eb045cf6805fdc3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 3 16:53:01 2022 +0000

    Revert "Fix primTorch compute_elementwise_output_strides (#88175)"

    This reverts commit 1c8a0656d65412b83d3c00f2fc66ab958e991de8.

    Reverted https://github.com/pytorch/pytorch/pull/88175 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks cuda 11.6 in trunk. As the PR signal was green, this is probably a landrace

commit 0f6304ef1ebb089e03b251bb90f886ec1bfd6194
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Nov 3 16:52:37 2022 +0000

    disable the out variants in test_cumprod test for inductor (#88328)

    `out=` variants aren't supported by autograd and it's not a must fix, so disabling the test (https://github.com/pytorch/torchdynamo/issues/1798) for now.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88328
    Approved by: https://github.com/desertfire

commit 529ba076c6ac898d3d236ffc9f018d74cf888a18
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Nov 3 16:21:15 2022 +0000

    add an exclude for test_constructor for inductor (#88143)

    This test (https://github.com/pytorch/torchdynamo/issues/1800) fails since none of the c-tor ops support `pin_memory=True`. Natalia suggests it's not a priority to fix.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88143
    Approved by: https://github.com/desertfire

commit 002dad35f4cef9ee468e0b67e8765355be3e0689
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Nov 3 16:20:14 2022 +0000

    better error message for out= ops (#88367)

    In cases where a tensor kwarg is actually "out=", the following error message would look nicer than this :
    ```
    Traceback (most recent call last):
      File "/fsx/users/binbao/pytorch/torch/_inductor/graph.py", line 241, in call_function
        out = lowerings[target](*args, **kwargs)
      File "/fsx/users/binbao/pytorch/torch/_inductor/lowering.py", line 168, in wrapped
        assert not any(isinstance(x, TensorBox) for x in kwargs.values())
    AssertionError

    ```

    https://github.com/pytorch/torchdynamo/issues/1798

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88367
    Approved by: https://github.com/desertfire

commit b4fcfe77b22257072234f5e0d76baeb6a7404427
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Thu Nov 3 15:58:18 2022 +0000

    reduce the number of autotuning iterations, don't autotune simple til… (#88386)

    …ed copies

    Partially fixes https://github.com/pytorch/torchdynamo/issues/1807, reduces compile time for me from 360 s to 90s.

    Kernels with multiple outputs sometimes autotune to unexpected configs, so I'm limiting the heuristic to relatively safe application.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88386
    Approved by: https://github.com/jansel

commit 5e6ceebccbafa6febf8c3fa8abc058f311319015
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Thu Nov 3 15:15:57 2022 +0000

    Add support for neg to NestedTensor (#88131)

    Partially fixes #86889

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88131
    Approved by: https://github.com/drisspg

commit 35be73df094f02dd26562cf665a6158e80bc4045
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 18:06:05 2022 +0000

    [FSDP()][Easy] Make `fully_shard()` only `FULL_SHARD` (#88260)

    We can have a separate API for each of the other sharding strategies.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88260
    Approved by: https://github.com/mrshenli

commit fc743ec0595a03dd755bf44fd36d70f02e97dd25
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 18:06:05 2022 +0000

    [FSDP()] Have `fully_shard()` abide by `@contract`! (#88235)

    We are making some progress on composability :)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88235
    Approved by: https://github.com/mrshenli

commit 63cd5d7e2743bbbe86cc333adc6bc834228daef3
Author: Bin Bao <binbao@fb.com>
Date:   Wed Nov 2 15:10:37 2022 +0000

    Add a shortcut in Makefile for updating triton (#88318)

    Summary: Local triton installation needs to be updated after we migrate
    to a newer version of triton, e.g.
    https://github.com/pytorch/pytorch/pull/88242. The Makefile shortcut
    makes that easier.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88318
    Approved by: https://github.com/ezyang

commit f884e817d448228cb8b0685f774ede1d8207ff72
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 2 19:08:07 2022 -0700

    Make Python op registration work with torchdeploy/multipy (#87162)

    See strategy at PythonOpRegistrationTrampoline.cpp for the
    big picture.

    Along the way, I made OperatorHandle support == and hashing,
    and slightly changed the low level python_dispatch impl API
    to disallow empty strings for dispatch key, which had the knock
    on effect of requiring us to explicitly make sure we pass in
    CompositeImplicitAutograd if we would have passed in "" (I didn't apply
    this to the rest of the file because I'm lazy.)

    Test strategy is we delete the logic for preventing Python op
    registrations in torch from being skipped in a torchdeploy context
    and show CI still works.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87162
    Approved by: https://github.com/anjali411, https://github.com/bdhirsh

commit 2f296cfdbb8063297a37cd54ba1ccf44022faa70
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 2 20:44:18 2022 -0700

    Add a reshape_copy operator. (#88314)

    The semantics is "as if" you did a reshape, but it always copied
    even if the input was directly view'able.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88314
    Approved by: https://github.com/albanD

commit 86c7cd287caeb23c227d97d283e58bc123294746
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 2 20:44:17 2022 -0700

    Put Python Dispatcher cache in dict, clear it on new registrations. (#88329)

    The motivation is that I am going to add the ability to temporarily
    install entries to the python dispatcher, and to do that, I need
    an easier way to clear the cache.  Putting the cache in a dict
    centralizes cache clearing in one place.  I then add some easy
    cache clearing.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88329
    Approved by: https://github.com/albanD

commit 97d3b200ca49b9434dd9e5de979c9d23a866a38e
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 2 18:55:33 2022 -0700

    Unconditionally enable python dispatcher in AOTAutograd (#88365)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88365
    Approved by: https://github.com/Chillee

commit a689502275529f78ba4a88c2e62ab897a96a040a
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 20:34:41 2022 +0000

    [FSDP] Do not include empty state in `_flatten_optim_state_dict()` (#88353)

    https://github.com/pytorch/pytorch/blob/983c0e7f3101f1543bed6c4ec1539a4d590a94c0/torch/optim/adam.py#L163
    The above line requires that a candidate optimizer state dict being loaded via `load_state_dict()` has non-empty state for its 0th parameter (via `state_values[0]`). This PR changes FSDP to only include non-empty mappings in the state returned by `_flatten_optim_state_dict()`, which is the subroutine for both `shard_full_optim_state_dict()` and `flatten_sharded_optim_state_dict()`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88353
    Approved by: https://github.com/fegin

commit 95a9721a15cb7be77b221ba5778d456880eaad20
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 18:06:04 2022 +0000

    [FSDP()][Easy] Rename `_State` to `_FSDPState` (#88234)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88234
    Approved by: https://github.com/mrshenli

commit 0520131ed6ee7965e7b53b291974b589220cdf3a
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 18:06:04 2022 +0000

    [FSDP()] Rename to `fully_shard()` and move to `_composable/` (#88233)

    After internal discussion, we are currently preferring `fully_shard()` as the name of the composable FSDP API.
    - `FullyShardedDataParallel` (FSDP) has existing brand value, so the chosen name should try to preserve that. We think this takes precedence over the fact that composable FSDP may encompass than just the ZeRO-3 approach of _fully sharding_.
        - Given the refactoring efforts, it would also not be challenging to create a new frontend API like `hybrid_shard()` that calls into the same underlying initialization and runtime except for a different `ShardingStrategy`. In other words, we do not have to coalesce all sharding strategies under `fully_shard()`.
    - The other composable APIs are verbs (`replicate()`, `checkpoint()`), so the chosen name should be a verb.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88233
    Approved by: https://github.com/mrshenli

commit 54b6188cc6dee45b775d688223b847dc8ea85bff
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Thu Nov 3 09:57:47 2022 +0000

    [fix] allow saving python attr on Tensor and Parameter via torch.save (#81616)

    Fixes: https://github.com/pytorch/pytorch/issues/72129

    TODO:
    * [x] Fix for Parameter

    Benchmark
    (Measurable diff for small tensors)
    ```
    [-------------- Save and Load --------------]
                        |  After PR  |  Before PR
    1 threads: ----------------------------------
          ()            |    111.7   |     106.9
          (4, 4)        |    114.4   |     109.2
          (128, 128)    |    135.2   |     128.3
          (1024, 1024)  |   1431.9   |    1431.3

    Times are in microseconds (us).
    ```

    <details>

    <summary> Benchmark Script </summary>

    ```python
    import torch
    from torch.testing._internal.common_utils import BytesIOContext
    from torch.utils import benchmark
    import pickle

    shapes = ((), (4, 4), (128, 128), (1024, 1024))

    sizes = [1, 64, 1024, 10000]
    results = []

    def save_load_fn(t):
        with BytesIOContext() as f:
            torch.save(t, f)
            f.seek(0)
            torch.load(f)

    for shape in shapes:
        t = torch.randn(shape)
        label = 'Save and Load'
        sub_label = f'{shape}'
        results.append(benchmark.Timer(
            stmt='save_load_fn(t)',
            globals={'t': t, 'save_load_fn':save_load_fn},
            label=label,
            sub_label=sub_label,
            description='Before PR',
        ).blocked_autorange(min_run_time=2))

    compare = benchmark.Compare(results)
    compare.print()

    with open('before_pr.pkl', 'wb') as f:
        pickle.dump(results, f)
    ```

    </details>

    NOTE : **BC-Breaking** : After this PR, all tensors (also regular tensors) will be serialised using `_rebuild_from_type_v2`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81616
    Approved by: https://github.com/albanD, https://github.com/kurtamohler

commit 1c8a0656d65412b83d3c00f2fc66ab958e991de8
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Nov 3 06:02:37 2022 +0000

    Fix primTorch compute_elementwise_output_strides (#88175)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88175
    Approved by: https://github.com/ngimel

commit 0efd4e92b5b8e29b083a91093c803e62c3507cf7
Author: Wonjoo Lee <wonjoo@google.com>
Date:   Thu Nov 3 06:19:40 2022 +0000

    Make GenLazyNativeFuncDefinition generator to be customizable in lazy codegen (#87823)

    As part of the ongoing LTC migration effort, PyTorch/XLA is updating its codegen to use `xla::Shape` instead of `torch::lazy::Shape`. To achieve this, this PR updates the codegen to make the `GenLazyNativeFuncDefinition` generator customizable.

    The existing `GenLazyNativeFuncDefinition` is kept by using the initial default values, so this change should not introduce any new behaviors to the existing codegen in PyTorch.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87823
    Approved by: https://github.com/alanwaketan, https://github.com/wconstab

commit a8f40b39ce4f9fa9ffd90400b7d10ea4051d623a
Author: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Date:   Thu Nov 3 03:01:33 2022 +0000

    Update all ONNX symbolics with new JitScalarType API (#87245)

    Fixes https://github.com/pytorch/pytorch/issues/84365 and more

    This PR addresses not only the issue above, but the entire family of issues related to `torch._C.Value.type()` parsing when `scalarType()` or `dtype()` is not available.

    This issue exists before `JitScalarType` was introduced, but the new implementation refactored the bug in because the new api `from_name` and `from_dtype` requires parsing `torch._C.Value.type()` to get proper inputs, which is exactly the root cause for this family of bugs.

    Therefore `from_name` and `from_dtype` must be called when the implementor knows the `name` and `dtype` without parsing a `torch._C.Value`. To handle the corner cases hidden within `torch._C.Value`, a new `from_value` API was introduced and it should be used in favor of the former ones for most cases. The new API is safer and doesn't require type parsing from user, triggering JIT asserts in the core of pytorch.

    Although CI is passing for all tests, please review carefully all symbolics/helpers refactoring to make sure the meaning/intetion of the old call are not changed in the new call

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87245
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit b013825c7d104dca2c6c11cd985453d8520577f7
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Nov 3 02:57:24 2022 +0000

    [vision hash update] update the pinned vision hash (#88382)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88382
    Approved by: https://github.com/pytorchbot

commit 5fb9c113aee76dd0465a6ee7067eeb018929b922
Author: Aaron Gokaslan <aaronGokaslan@gmail.com>
Date:   Thu Nov 3 02:53:26 2022 +0000

    Update pybind11 to v2.10.1 (#88332)

    I am one of the maintainers of pybind11, and a frequent PyTorch user. We added quite a lot of bugfixes and performance improvements in 2.10.1 (see the changelog for full details) and I wanted to upstream them to PyTorch.

    Our releases is tested throughout Google's codebase including on their global builds of PyTorch so there should be no surprises.

    The main new feature is optin in Eigen Tensor to Numpy casters.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88332
    Approved by: https://github.com/soumith

commit e59d307e2f1d3be0395838acbd03085f2285c0eb
Author: Richard Barnes <rbarnes@meta.com>
Date:   Thu Nov 3 02:48:41 2022 +0000

    Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350)

    Test Plan: Sandcastle

    Differential Revision: D40949947

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88350
    Approved by: https://github.com/Skylion007, https://github.com/soumith

commit a0fb234b4523e06d3e4bd1f06fb421bcd09c8939
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Wed Nov 2 15:42:08 2022 -0700

    [codegen] using TORCH_LIBRARY_FRAGMENT for some namespaces (#88229)

    Summary:
    Sometimes we want to extend an existing custom namespace library, instead of creating a new one,
    but we don't have a namespace config right now, so we hardcode some custom libraries defined
    in pytorch today, i.e. quantized and quantized_decomposed

    Test Plan:
    ci

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88229
    Approved by: https://github.com/ezyang

commit 7b8cc063acab584e6fe0f0a82ff246fab6691205
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Nov 3 02:15:07 2022 +0000

    Not run inductor test in trunk (#88374)

    Trying to not run in inductor tests in trunk at the moment because of CUDA issue with G5 runner:

    * CUDA GPU not found https://github.com/pytorch/pytorch/actions/runs/3379516207/jobs/5611539300
    * NVIDIA driver installation fails https://github.com/pytorch/pytorch/actions/runs/3379922198/jobs/5612458360
    * Docker fails to start https://github.com/pytorch/pytorch/actions/runs/3381276196/jobs/5615513348
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88374
    Approved by: https://github.com/desertfire

commit d979caa87c7810ea68845b86696d883452da9b8f
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Nov 2 20:28:39 2022 +0000

    Added add/mul for nested dense [B, *, D], [B, 1, D] case (CUDA-only) (#88289)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88289
    Approved by: https://github.com/cpuhrsch

commit 4c20c0509d5cf8d4dea83cc330056044a6277b1b
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Nov 2 13:52:15 2022 -0400

    Split out forward AD tests from test_ops_gradients and reenable slow gradcheck CI (#88216)

    Fixes: https://github.com/pytorch/pytorch/issues/88010

    This PR does a couple things to stop slow gradcheck from timing out:
    - Splits out test_ops_fwd_gradients from test_ops_gradients, and factors out TestFwdGradients and TestBwdGradients which both inherit from TestGradients, now situated in common_utils (maybe there is a better place?)
    - Skips CompositeCompliance (and several other test files) for slow gradcheck CI since they do not use gradcheck
    - because test times for test_ops_fwd_gradients and test_ops_gradients are either unknown or wrong, we hardcode them for now to prevent them from being put together. We can undo the hack after we see actual test times are updated. ("def calculate_shards" randomly divides tests with unknown test times in a round-robin fashion.)
    - Updates references to test_ops_gradients and TestGradients
    - Test files that are skipped for slow gradcheck CI are now centrally located in in run_tests.py, this reduces how fine-grained we can be with the skips, so for some skips (one so far) we still use the old skipping mechanism, e.g. for test_mps

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88216
    Approved by: https://github.com/albanD

commit a8561c4571fe668d35e24c8f61bd296e23db807c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 2 23:33:15 2022 +0000

    Revert "[inductor] Handle the case where kwargs contains tensor (#88215)"

    This reverts commit 983c0e7f3101f1543bed6c4ec1539a4d590a94c0.

    Reverted https://github.com/pytorch/pytorch/pull/88215 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think it breaks trunk https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613987333 with a failure in test_torchinductor_opinfo.py

commit 7354368fd5a8dec5c9fc26dddf5f7da37f1d2499
Author: Jiewen Tan <jwtan@google.com>
Date:   Wed Nov 2 23:31:26 2022 +0000

    [LTC] Remove non-native view ops (#88031)

    Summary:
    LTC somehow implements a bunch of non-native view ops during the transition to functionalization. Let's remove them now that functionalization is final.

    Test Plan:
    CI.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88031
    Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim

commit 72f3688029d0bfdd5f2926c8efeb9451135ae6da
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Wed Nov 2 08:54:54 2022 -0700

    [Pytorch][Vulkan] Update spv generation script to embed shader parameters (#88321)

    This diffs adds shader parameters such as tile size, weight storage type and
    format to the generated spv.cpp file.
    This is used in ShaderInfo struct that ops such as convolution will use to
    determine, the workgroup size  and how to pack weights.

    Differential Revision: [D40280337](https://our.internmc.facebook.com/intern/diff/D40280337/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88321
    Approved by: https://github.com/jmdetloff, https://github.com/mcr229

commit 6c858e37271472b2255e3358be97fd135a9fbe59
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:11 2022 +0000

    [FSDP][Easy] Remove unneeded `TrainingState` transition (#88232)

    Follow-up from previous PR in the stack
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88232
    Approved by: https://github.com/mrshenli

commit 73de44fc561a202aba9d849fb8ada5adad030077
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:10 2022 +0000

    [FSDP] Rename `unflat_param_name` -> `fqn` for consistency (#88123)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88123
    Approved by: https://github.com/mrshenli

commit f35d5145a1cc34c6e6dc3680e408344806aefbac
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:10 2022 +0000

    [FSDP] Simplify `_get_buffer_names()` (#88122)

    This is a follow-up from a previous PR in this stack. The PR simplifies the `_get_buffer_names()` implementation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88122
    Approved by: https://github.com/mrshenli

commit 572a3d2d6efd5493df8cb43c7da98bcf0bf20129
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:10 2022 +0000

    [FSDP] Remove unneeded `torch.no_grad()` context when offloading to CPU (#88121)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88121
    Approved by: https://github.com/mrshenli

commit c87f0501ab847ea900aff61be54bb67c3a27a4fe
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:09 2022 +0000

    [FSDP][Docs] Add note mentioning rate limiter for backward prefetch (#88120)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88120
    Approved by: https://github.com/mrshenli

commit 32d22edc676c176b8b247f66134b3b8913724818
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:09 2022 +0000

    [FSDP()][27/N] Add forward hook registration (#88040)

    This PR adds the forward hook registration to composable FSDP and adds a unit test for the runtime.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88040
    Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma

commit 6fd416650ab2b5bd9be046b9ad8cccaf016e6538
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Wed Nov 2 23:24:33 2022 +0000

    Add _foreach_addc(div/mul)(_).Tensor (#88157)

    Support passing value scalars as a flat 1D Tensor.

    Currently we can only pass either an individual scalar or a ScalarList.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88157
    Approved by: https://github.com/ngimel, https://github.com/albanD

commit 91a51fe9f487082e3f71055d3af41df2fb1bf88b
Author: Henry Cheng <39224097+jazzysoggy@users.noreply.github.com>
Date:   Wed Nov 2 23:07:45 2022 +0000

    [ONNX] Produce comprehensive assertion errors for quantized outputs (#87242)

    Fixes #83038

    Currently _compare_ort_pytorch_outputs does not produce clearer error messages for differences in the zero point or scale of the two outputs. It also does not produce a clear error message for whether both are quantized.

    This pull request adds assertions to output whether the scales and zero points have differences, and whether each individual output is quantized.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87242
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit ca2dc8b4e7ee36c3d305642490f482505ff6ad37
Author: Charlie Yan <charlieyan@meta.com>
Date:   Wed Nov 2 23:02:08 2022 +0000

    [1/n] Thread PG: fix pyre error of class ProcessGroup (#88281)

    Summary: Fix the typing stub of `ProcessGroup` in "torch/distributed/__init__.py", so that it won't confuse pyre, and we can remove a lot of pyre suppression comments.

    Test Plan: pyre check

    Differential Revision: D40921667

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88281
    Approved by: https://github.com/wanchaol

commit d1ba4c3a6d7a5007665419c57988b06d5b87e96e
Author: Jiong Gong <jiong.gong@intel.com>
Date:   Wed Nov 2 22:57:07 2022 +0000

    Update Reviewers for CPU-related Modules (#87591)

    This PR updates the reviewers responsible for CPU related modules: "IDEEP", "oneDNN graph", "CPU ATen backend", "CPU frontend" and "Autocast". It also adds "NNC" and adds the corresponding reviewers.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87591
    Approved by: https://github.com/malfet

commit b325c3fc25937f5fb9ba2fb1d3768cbfbefea6c6
Author: jjsjann123 <alex.jann2012@gmail.com>
Date:   Wed Nov 2 22:47:30 2022 +0000

    [nvFuser] patches profiling on scalar arguments for std/var (#88165)

    Fixes #86531

    Added profiling on scalar values for aten::std & aten::var.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88165
    Approved by: https://github.com/kevinstephano

commit bf7c996dcb2fac229abba7ba2e0bdb379ceb2ff2
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 2 22:35:14 2022 +0000

    Revert "torchdynamo support modules() for nn_module (#88023)"

    This reverts commit eb91e8a534f94127a6d744543f2080a44bca9e57.

    Reverted https://github.com/pytorch/pytorch/pull/88023 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/13510799692855066/insights)

commit 7dfa75546c998f384ed5210ba9ac87c591cb36a4
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Nov 2 21:59:54 2022 +0000

    Print only the driver version from the first GPU (#88364)

    For example, distributed test has more than one of them:

    ```
    nvidia-smi --query-gpu=driver_version --format=csv,noheader
    515.57
    515.57
    ```

    while `--id=0` correctly prints:

    ```
    nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
    515.57
    ```

    This is to avoid re-install the same driver as in https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613981088

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88364
    Approved by: https://github.com/seemethere, https://github.com/ZainRizvi

commit 943b20e7ae290d8e71f877eb700f197a9df56cbe
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Wed Nov 2 21:51:40 2022 +0000

    Use tensor cores for NT bmm (#86856)

    Copy of internal diff.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86856
    Approved by: https://github.com/drisspg

commit 1c0d47cb17806fae3f368061f594997d87d7fd8d
Author: Scott Wolchok <swolchok@fb.com>
Date:   Mon Oct 31 16:17:19 2022 -0700

    [PyTorch] Make c10::irange(x) generate the same assembly as for loop (#86841)

    `c10::irange(n)` generated an extra `sar` and `andn` instruction compared to a traditional `for` loop. now it doesn't.

    Differential Revision: [D40321009](https://our.internmc.facebook.com/intern/diff/D40321009/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86841
    Approved by: https://github.com/r-barnes, https://github.com/malfet

commit ef4ce6d4c6ce1bd5ec26d7f6f71f1c053da46945
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Nov 2 10:25:49 2022 -0700

    Add [[noreturn]] attribute to operator() in DispatchKeyExtractor.h (#88333)

    Originally D40537408. Submitting this through the diff train workflow to
    get it merged faster.

    Test Plan:
    - Build PyTorch
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88333
    Approved by: https://github.com/ezyang

commit 983c0e7f3101f1543bed6c4ec1539a4d590a94c0
Author: Bin Bao <binbao@fb.com>
Date:   Wed Nov 2 01:23:57 2022 +0000

    [inductor] Handle the case where kwargs contains tensor (#88215)

    Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805;
    currently inductor does not allow any tensor in kwargs.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88215
    Approved by: https://github.com/ngimel

commit 98f09c9ab3d0796ffd68c06daa408f0400835173
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Nov 2 19:41:09 2022 +0000

    [WIP] Add symnode magic method testing (#88119)

    There are failures that need to be addressed before landing:
    - Some issue with handling of booleans.
    - Most functions return wrong result when mixing int/float

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88119
    Approved by: https://github.com/ezyang

commit 99c07735e457a2961f2319b4ba19f0d04eb47967
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 2 18:43:36 2022 +0000

    Revert "Add support for neg to NestedTensor (#88131)"

    This reverts commit 6a75a0d1a197e378ebbf1f73f5ab93ce79cb873a.

    Reverted https://github.com/pytorch/pytorch/pull/88131 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/13510799692239080/insights)

commit 0fa23663ccd5350469c95615ddb7d2fd2a88abe3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 2 18:13:37 2022 +0000

    Revert "Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)"

    This reverts commit 1e2c4a6e0e60dda763b53f00f25ee5c1f1e5233d.

    Reverted https://github.com/pytorch/pytorch/pull/84190 on behalf of https://github.com/malfet due to Needs internal changes, has to be landed via co-dev

commit 4a84d69f5098d04131d94f15cad92a46ea70b198
Author: Zachary DeVito <zdevito@meta.com>
Date:   Tue Nov 1 11:35:23 2022 -0700

    [functorch.dims] Fix corner cases with permute (#88226)

    Previously the permute function was extended to behave like the `order`
    function for first-class dimensions. However, unlike `permute`,
    `order` doesn't have a keyword argment `dims`, and there is no way to add
    it in a way that makes both permute an order to continue to have the same
    behavior. So this change just removes the extra functionality of permute,
    which wasn't documented anyway. Fixes #88187
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88226
    Approved by: https://github.com/zou3519

commit 84a302e53401f74c88495928154637af49e06fb2
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Nov 2 11:03:04 2022 -0400

    Remove wrong internal assert in handle_view_on_rebase (#88243)

    Fixes: https://github.com/pytorch/pytorch/issues/88205

    The `CreationMeta::NO_GRAD_MODE` path in handle_view_on_rebase wrongly assumes that the tensor would be a leaf, because tensors created in no_grad are always leaf tensors. However, due to creation_meta propagation, a view of a view created in no_grad also has `CreationMeta::NO_GRAD_MODE`, but DOES have grad_fn.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88243
    Approved by: https://github.com/albanD

commit 30dc6cee3aaa3fd30883f2953beaa3374ad0aab2
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:09 2022 +0000

    [FSDP()][26/N] Move `_lazy_init()` into `_fsdp_root_pre_forward()` (#87941)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87941
    Approved by: https://github.com/mrshenli

commit 1e2c4a6e0e60dda763b53f00f25ee5c1f1e5233d
Author: Pruthvi Madugundu <pmagundu@amd.com>
Date:   Wed Nov 2 17:41:57 2022 +0000

    Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)

    - Asserts for CUDA are enabled by default
    - Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON`
    - Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON`

    This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190
    Approved by: https://github.com/jeffdaily, https://github.com/malfet

commit b18d0f1dc9757be4ca58059ece28ac4e60bf6f0c
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Nov 2 17:39:04 2022 +0000

    Add more debug information when installing NVIDIA driver (#88168)

    This calls `lspci`, `lsmod`, and `modinfo nvidia` before and after the installation to gather more data about the "No GPU available" transient issue on G5 runner, i.e. https://hud.pytorch.org/pytorch/pytorch/commit/59fe272c1e698989228af5ad197bdd2985e4e9b9

    This also handles `nvidia-smi` call and tries to re-install the driver if the first call fails, i.e. `No devices were found` https://hud.pytorch.org/pytorch/pytorch/commit/8ea19c802e38c061e79176360c1ecaa81ce2088a
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88168
    Approved by: https://github.com/clee2000, https://github.com/malfet

commit 923a5e96850014c84e76244874f39d9cdd186a0b
Author: Michael Suo <suo@fb.com>
Date:   Tue Nov 1 14:44:17 2022 -0700

    [dynamo] Error when user nests FX with dynamo (#87797)

    Today, this doesn't work and dynamo errors out in a very non-obvious way (see:
    https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a).

    Here, we detect the error early and exit with a nicer msg. Also add a
    config option to just no-op dynamo (which need to unblock internal
    enablement).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797
    Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel

commit c503398828170549801660eba80fe04f07c5bd42
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Nov 2 17:27:30 2022 +0000

    Ignore macos usage log upload artifact failure (#88288)

    I'm not quite sure why GitHub starts to get flaky when we are trying to upload usage_log.txt to it (500 Internal server error). But we can live without it, so let's just ignore this for now, and follow up on this latter.

    The failures all come from M1 runner, so it seems to point to a connectivity issue between AWS and GitHub:

    * https://github.com/pytorch/pytorch/actions/runs/3373976793/jobs/5599310905
    * https://github.com/pytorch/pytorch/actions/runs/3372858660/jobs/5597033598
    * https://github.com/pytorch/pytorch/actions/runs/3371548201/jobs/5594274444
    * https://github.com/pytorch/pytorch/actions/runs/3370877990/jobs/5592709210
    * https://github.com/pytorch/pytorch/actions/runs/3370609384/jobs/5592008430

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88288
    Approved by: https://github.com/clee2000

commit 5b882a34c42534a2a994da2e3504abae0a730126
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Nov 2 17:21:59 2022 +0000

    Consolidate macos pip dependencies (#88071)

    After conda, consolidating all macos pip dependencies to cache every dependencies that macos CI needs. Two small issues are found along the way in `_mac-test-mps` workflow:

    * It didn't have `Install macOS homebrew dependencies` to install libomp like the regular `_mac-test` workflow
    * It didn't install `scipy`, thus silently skipping some `signal.windows` tests

    Both are fixed in this PR
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88071
    Approved by: https://github.com/malfet

commit f132c171ac542c8abe8f6bf54befd9f2e14ad9b6
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:08 2022 +0000

    [FSDP()][25/N] Add `_post_forward_reshard()` (#87940)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87940
    Approved by: https://github.com/mrshenli

commit 5b75b19f51837e162cc0e5e5757dfd9bef437c67
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 2 16:59:00 2022 +0000

    Revert "Do not use unsafe restriding for subclasses (#87610)"

    This reverts commit 73379acaf3865379aed0a1bab1320616772152f3.

    Reverted https://github.com/pytorch/pytorch/pull/87610 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/36028797828925790/insights)

commit c00c34fb6939384c53cd9125de8e158f9276ee36
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Nov 2 01:32:09 2022 +0000

    Fix meta for aten.upsample_bilinear2d.vec (#88158)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88158
    Approved by: https://github.com/ngimel

commit 71fb763e5452881cb3be8fefa9419b785d0a61e2
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 2 16:54:36 2022 +0000

    Revert "fix as_strided_scatter_backward (#87646)"

    This reverts commit f9d7985851f49c3b44383dae50cd77632e7e2245.

    Reverted https://github.com/pytorch/pytorch/pull/87646 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think this one or one of the PR in the stack break bionic-cuda11.7 on trunk https://hud.pytorch.org/pytorch/pytorch/commit/70782981f06a042796d4604df2ec1491f4f5b194

commit bf2819a836b2dac0448305be9447df0846b846b9
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Nov 2 11:38:07 2022 +0000

    [FSDP()][24/N] Refactor `_lazy_init()` (#87939)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87939
    Approved by: https://github.com/zhaojuanmao

commit bd5b4e6504bf487c313d0b85100242898ad85c8d
Author: Rohan Varma <rvarm1@fb.com>
Date:   Wed Nov 2 16:31:16 2022 +0000

    [Easy] Unused var in functional_adam (#88292)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88292
    Approved by: https://github.com/awgu

commit 7382c88df2889bf58ef62fe52ed3e1361e384811
Author: Nikita Shulga <nikita.shulga@gmail.com>
Date:   Wed Nov 2 16:27:40 2022 +0000

    [BE][MPS] Do not use malloc/free in 2022 (#88307)

    Use `std::vector` to store tensor shapes and automatically free them when array goes out of scope

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88307
    Approved by: https://github.com/kulinseth

commit 4e6f5f22fd7585eb629cd884f10b4a016f6c8266
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Nov 2 16:26:11 2022 +0000

    Run asan's shard 4 on `linux.4xlarge` (#88310)

    In attempt to mitigate OOMs, see https://github.com/pytorch/pytorch/issues/88309

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88310
    Approved by: https://github.com/albanD

commit 3d90788a58badb454d15601868a396853ce94ddb
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Tue Nov 1 17:20:37 2022 +0000

    [ONNX] Add 0d-tensor test case in runtime check (#87212)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87212
    Approved by: https://github.com/BowenBao

commit 2aed6707100dc685b83c4b9575d9eb07f1c6fa3e
Author: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Date:   Wed Nov 2 15:54:40 2022 +0000

    Fix ONNX operator_export_type on the new registry (#87735)

    Fixes #87313

    Our ONNX pipelines do not run with BUILD_CAFFE2=0, so tests for operator_export_type ONNX_ATEN and ONNX_ATEN_FALLBACK will not be fully tested, allowing regressions to happen again.

    We need to run the same set of tests for both BUILD_CAFFE2=0 and 1
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87735
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit b2679dc61cf792e922ba56b1ccb75982e6c20553
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Nov 2 10:43:35 2022 -0400

    Remove Krovatkin from dynamic shapes auto request review (#88315)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88315
    Approved by: https://github.com/soumith

commit dcbcf5b90e56dfb30d4f87d607f3f4b361f52077
Author: Digant Desai <digantdesai@fb.com>
Date:   Tue Nov 1 21:39:12 2022 -0700

    [profiler] Expose experimental performance events to python (#87905)

    Reports total counts (includes time spent in all children), self counts can be calculated manully.

    Differential Revision: [D40282770](https://our.internmc.facebook.com/intern/diff/D40282770/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87905
    Approved by: https://github.com/SS-JIA

commit 47a542dc0601243e51231cd0d0a28a7ef0c89b2b
Author: Digant Desai <digantdesai@fb.com>
Date:   Tue Nov 1 21:39:11 2022 -0700

    Nested profiling support for Linux-perf Profiler (#87904)

    Add a stack of start counter values, and attribute each disable to the last enable

    Differential Revision: [D40539212](https://our.internmc.facebook.com/intern/diff/D40539212/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87904
    Approved by: https://github.com/SS-JIA

commit ebdaeaaa8c0a0e3089d7d16fa9d79a2f3185eba4
Author: Digant Desai <digantdesai@fb.com>
Date:   Tue Nov 1 21:39:09 2022 -0700

    [edge profiler] Add e2e test for profiler event and chrometrace (#87877)

    * Runs an existing model and checks an aten op if it gets perf events generated in the chrometrace
    * Doesn't check for exact values since that's harder to do in a hardware independent way

    Differential Revision: [D40474957](https://our.internmc.facebook.com/intern/diff/D40474957/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87877
    Approved by: https://github.com/SS-JIA

commit 03346296dbfd1033cb0898983eebcd4c0af32afb
Author: Digant Desai <digantdesai@fb.com>
Date:   Tue Nov 1 21:39:07 2022 -0700

    [edge profiler] Add support for performance events counting (#87876)

    * Add support in lite_predictor benchmark binary to select event lists
    * Uses Linux perf through Kineto profiler

    Differential Revision: [D39837216](https://our.internmc.facebook.com/intern/diff/D39837216/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39837216/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87876
    Approved by: https://github.com/SS-JIA

commit bc1e9a07a3644f300e3d27b377a152c330ca6dd9
Author: Digant Desai <digantdesai@fb.com>
Date:   Tue Nov 1 21:39:05 2022 -0700

    [profiler] Add Performance events support in Kineto profiler (#87874)

    * Wiring to allow user to pass event names to profiler and reflect the count to the chrometrace
    * If not used, the runtime and size overhead should be neglegible
    * For now, primary user will be KinetoEdgeCPUProfiler but the impl does not assume that
    * Not exposed to python yet

    Differential Revision: [D40238032](https://our.internmc.facebook.com/intern/diff/D40238032/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40238032/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87874
    Approved by: https://github.com/SS-JIA

commit 70782981f06a042796d4604df2ec1491f4f5b194
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Tue Nov 1 20:06:45 2022 -0700

    aot_dispatch test fix: always use functionalization in symbolic tests (#87647)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87647
    Approved by: https://github.com/ezyang, https://github.com/Chillee

commit f9d7985851f49c3b44383dae50cd77632e7e2245
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Tue Nov 1 20:06:44 2022 -0700

    fix as_strided_scatter_backward (#87646)

    as_strided_scatter's derivative formula was broken - instead of making a "mask" of 1's and 0's, it would effectively make a mask of 1's and uninitialized memory.

    Fixes https://github.com/pytorch/pytorch/issues/88105

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87646
    Approved by: https://github.com/albanD

commit b5a925ff2eac429d27ea55c0046dd68f24e409e2
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Tue Nov 1 20:06:44 2022 -0700

    propagate .meta info when replacing subgraphs in fx (#87255)

    Fixes https://github.com/pytorch/torchdynamo/issues/1708

    Our FX subgraph partitioner works by taking all of the original output nodes from a subgraph, and replacing it with a new `call_module` node in the graph.

    If the original subgraph outputs had fake tensors and other metadata stored in their `.meta` attribute though, then this information was getting lost when we spliced in the subgraph.

    Losing metadata on an FX graph also seems like an easy trap to fall into, so I'm wondering if there are any better guardrails that we can add. I ended up fixing in this PR by adding an optional kwarg to propagate meta info directly in the `fx.Node.replace_all_uses_with`, just because propagating metadata seems like a pretty core thing.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87255
    Approved by: https://github.com/wconstab, https://github.com/SherlockNoMad

commit 5669e10d37fa3cca21cf82c843ae4c4e79da1b89
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Wed Nov 2 11:25:06 2022 +0100

    remove assert_allclose from torch.testing (#87974)

    See #87969 or #86586 for the reasoning.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87974
    Approved by: https://github.com/mruberry

commit b9c617838ab34d97cea4e773e34db4e2bd3a2526
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Wed Nov 2 11:25:06 2022 +0100

    remove make_non_contiguous from torch.testing (#87973)

    See #87969 or #86586 for the reasoning.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87973
    Approved by: https://github.com/mruberry

commit 8893c6cd074682755d5f9e4219b86a0c7f13e76c
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Wed Nov 2 11:25:05 2022 +0100

    remove deprecated dtype getters from torch.testing (#87972)

    See #87969 or #86586 for the reasoning.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87972
    Approved by: https://github.com/mruberry

commit a360be50b50dfd6ecedaa835106b3fd45d571412
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Wed Nov 2 11:25:05 2022 +0100

    remove deprecated device getter from torch.testing (#87971)

    See #87969 or #86586 for the reasoning.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87971
    Approved by: https://github.com/mruberry

commit 554cdc9a63f9d8471061265bc49f5fbf0a220364
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Wed Nov 2 11:25:05 2022 +0100

    remove deprecated rand and randn from torch.testing (#87970)

    See #87969 or #86586 for the reasoning.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87970
    Approved by: https://github.com/mruberry

commit bc73affdade9f8315c8e5cc62211a67562877f8b
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Wed Nov 2 11:25:04 2022 +0100

    prepare removal of deprecated functionality in torch.testing (#87969)

    _Redo of #86586 with all BC breaking changes granularly placed into separate commits._

    ---

    Per title. Deprecation happened on Feb 25, 2022 in c6f1bbc0ac33be0c8ad9956e3fc15e78ddb6cb95, which made it into the 1.12 release. Since it is now 245 days later and the next release will be 1.14, the removals later in the stack comply with the [BC policy](https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#minimizing-the-disruption-of-bc-breaking-changes).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87969
    Approved by: https://github.com/mruberry

commit 0fc7de398636f4b53e6c3fde38b4e48a5ff5b37d
Author: Digant Desai <digantdesai@fb.com>
Date:   Tue Nov 1 21:39:03 2022 -0700

    [profiler] Add Linux Perf support (#87866)

    * Add support to use Linux kernel perf subsystem via the profiler.
    * For now the perf configurability is quite limited to just event names. Threading etc. to come later.
    * Given we want to support variety of different cpu types, number of events list (in addition to the standard set of events) is also limited.
    * Rather than failing with unsupported feature for non-Linux platforms, it returns zeros for all the event counts.
    * For now, max event counts is capped at 4, time multiplexing is not allowed.
    * Threadpool recreate hack is restricted to mobile only - need to add better support for threading in general

    Differential Revision: [D40238033](https://our.internmc.facebook.com/intern/diff/D40238033/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40238033/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87866
    Approved by: https://github.com/SS-JIA

commit d6b58d6924057ba38faa49d68cead72989641962
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 22:47:12 2022 +0000

    [FSDP()][23/N] Refactor handle attr initialization (#87938)

    **`_init_param_attributes()` -> `init_flat_param_attributes()`**
    We move `_init_param_attributes()` to `FlatParamHandle.init_flat_param_attributes()` (as already marked as to-do during previous refactoring).

    **`_reset_lazy_init()`**
    We no longer delete `_local_shard` from each `FlatParameter` in `_reset_lazy_init()`.

    **Analysis**
    Thus, the two semantic differences are that we remove the initial `if hasattr(p, "_local_shard")` early return in `_init_param_attributes()` and the `delattr(p, "_local_shard")` in `_reset_lazy_init()`.

    This is safe because
    - If we never call `_reset_lazy_init()`, then `init_flat_param_attributes()` is only called once. There is no opportunity for an early return.
    - If we call `_reset_lazy_init()`, then `init_flat_param_attributes()` will be called again in the next `_lazy_init()`. However, since we removed the early return, all of the attributes initialized in `init_flat_param_attributes()` simply get re-initialized and override any existing attributes.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87938
    Approved by: https://github.com/mrshenli

commit d172dcf3164bd45d12c67da32d8196987beff997
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 22:47:12 2022 +0000

    [FSDP()][21/N] Refactor and fix `_cast_buffers()` (#87935)

    This PR refactors and fixes `_cast_buffers()`.

    **Before**
    Buffers were not correctly cast back to their original dtypes for submodules when using buffer mixed precision.
    - `_cast_buffers(recurse=False)` incorrectly casts all buffers, including those in submodules. This is because of this outer loop over `self.modules()`:
    https://github.com/pytorch/pytorch/blob/c40033be162db0f94d37e7ccbd2a89d67f8b8e47/torch/distributed/fsdp/fully_sharded_data_parallel.py#L700
    - There was a unit test that checked that buffers were cast as expected (`test_mixed_precision_e2e_full_shard()`). The unit test _coincidentally_ passed because all modules shared the same buffer name `"buffer"`. In `_cast_buffers()`, the `dict` mapping buffer name to original dtype is populated lazily (during `_lazy_init()`). However, the keys are unprefixed:
    https://github.com/pytorch/pytorch/blob/c40033be162db0f94d37e7ccbd2a89d67f8b8e47/torch/distributed/fsdp/fully_sharded_data_parallel.py#L712-L717
    - Thus, even though (1) `_cast_buffers(recurse=False)` was only called on the root and (2) `self._buffer_name_to_orig_dtype` had unprefixed names as keys, the unit test still passed because (1) `_cast_buffers()` still looped over all buffers despite `recurse=False` and (2) all submodules' buffers were named `"buffer"` and had the same original and low-precision dtypes and hence were cast correctly.

    If we change each submodule to have its own distinct buffer name, then the unit test fails. This PR makes such a change to showcase the progression granted by this PR.

    **After**
    This PR separates `_cast_buffers()` into three methods: `_get_buffers_and_dtypes_for_computation()`, `_get_buffers_and_dtypes_for_checkpoint()`, and `_cast_buffers_to_dtype_and_device()`. This is to separate the different use cases (casting for computation and casting for checkpointing) and the corresponding code paths. Plus, the signature for `_cast_buffers_to_dtype_and_device()` makes it clear exactly what buffers are being cast and to what dtype.

    Both `_get_...()` functions assume that they are called on the root only for now. This coincides with the construction of `_buffer_name_to_orig_dtype` in the FSDP constructor, which loops over all submodules. (This means that for non-root modules, their `_buffer_name_to_orig_dtype` is populated but not used.) The `dict`'s keys are clean since the buffer cast to original dtype happens in a `summon_full_params()` context, which cleans the names.

    **Follow-Ups**
    - We can try to move `_get_buffers_and_dtypes_for_checkpoint()` into `_state_dict_utils.py` in a follow-up.
    - We may want to move to per-module buffer casting (i.e. do not have the root module cast for all submodules).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87935
    Approved by: https://github.com/mrshenli

commit b0b1e78e2ddccc07f94762bdfe33770c75d12db1
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 22:47:11 2022 +0000

    [FSDP] Rename `dtype` to `buffer_name_to_dtype` (#87934)

    This PR is easy and only a rename. `dtype` does not convey that it is actually a `Dict[str, torch.dtype]` (when not `None`).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87934
    Approved by: https://github.com/mrshenli

commit d14fc0bc36601b4e72b09d50c2cfdcfdb61be4ad
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 22:47:11 2022 +0000

    [FSDP] Remove `device` arg from `_cast_buffers()` (#87933)

    This PR is easy. The `device` argument in `_cast_buffers()` is never used.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87933
    Approved by: https://github.com/mrshenli

commit 19c7df89fbdf2d8a28ba67d10cfdfe7540bb0c55
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 22:47:11 2022 +0000

    [FSDP()][20/N][Easy] Move functions in file (#87932)

    This PR is easy. I just wanted to group functions in the file according to the same logical order.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87932
    Approved by: https://github.com/mrshenli

commit 4635f56da170dbf25759b2128e57a387d01cb41c
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 22:47:10 2022 +0000

    [FSDP()][18/N] Refactor `pre_forward_unshard()` (#87931)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87931
    Approved by: https://github.com/mrshenli

commit 0a752688bda6cb217852068e15da5133a9d7e5b6
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 22:47:10 2022 +0000

    [FSDP()][17/N] Refactor `_fsdp_root_pre_forward()` (#87930)

    This PR moves `_fsdp_root_pre_forward()` to `_runtime_utils.py`.

    Note: This PR includes a (temporary) fix for `NO_SHARD` + `CPUOffload(offload_params=True)`, where we set `non_blocking=False` when copying the gradient from device to host. It is only included in this PR since the test was **flaky** (but not consistently failing) on this PR , so I needed to fix to unblock land.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87930
    Approved by: https://github.com/mrshenli

commit 39d9d2ed706dd9737ee49828b63c8975f31a87fe
Author: lezcano <lezcano-93@hotmail.com>
Date:   Tue Nov 1 16:31:11 2022 +0000

    Implement reference for lerp (#87424)

    We follow the vectorised CPU implementation for numerical accuracy

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87424
    Approved by: https://github.com/ezyang

commit 6b5d7fccc6140fc15a7e882e9c8de21477e31459
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Nov 2 11:11:28 2022 +0000

    Add a basic test for "nvprims_nvfuser" Dynamo backend (#88186)

    Ref. https://github.com/pytorch/pytorch/pull/87797#issuecomment-1297635210

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88186
    Approved by: https://github.com/ezyang

commit 9ebb8d52320feb5c634ddde767d0404b94443443
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Nov 2 10:05:12 2022 +0000

    Add ops.broadcast for nvFuser (#88080)

    Having nvFuser's `broadcast` available alongside `broadcast_in_dim` would allow easier experimentation.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88080
    Approved by: https://github.com/jjsjann123, https://github.com/kevinstephano, https://github.com/mruberry

commit 2ddefbdc3c4fab70b4c2898a0a25e403610741fc
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Wed Nov 2 09:38:13 2022 +0000

    Fix typos used in documents under torch directory (#88300)

    This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300
    Approved by: https://github.com/lezcano

commit 4a8382b58eeca9eed09c7c3b801b81befc2f75ce
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Nov 2 09:29:20 2022 +0000

    Update caching of tensor arguments for nvFuser's fusion creation (#87860)

    Previously nvFuser's fusion definition was cached based on concrete shape and strides of tensor inputs for simplicity and correctness. This PR changes Python's cache to check the number of dimensions, size-1 dimensions, and contiguity information based on given strides and shapes.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87860
    Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/ngimel

commit ccf6b558a4c58d1ae92689b2a5064916b42eff05
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Wed Nov 2 06:58:02 2022 +0000

    [Dynamo] UserFunctionVariable supports type & ABCMeta as arguments (#88257)

    Fixes https://github.com/pytorch/torchdynamo/issues/1785

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88257
    Approved by: https://github.com/ezyang

commit e763b7abebd7e3e9376a59b5f728916e0ca084a8
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Wed Nov 2 06:37:33 2022 +0000

    [complex] conv_transpose3d : complex support (#87967)

    Reference: https://github.com/pytorch/pytorch/issues/71108

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87967
    Approved by: https://github.com/anjali411

commit 7674af9ce7a3f5b210f16d0e935a89c76440434c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Nov 2 05:22:38 2022 +0000

    [vision hash update] update the pinned vision hash (#88162)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88162
    Approved by: https://github.com/pytorchbot

commit 4ab5d79b286007eb126ca0002cdaed2305c05cc1
Author: Fabio Rocha <frocha@quansight.com>
Date:   Tue Nov 1 19:29:17 2022 +0000

    [inductor] Updated some triton.libdevice calls (#88242)

    triton master now does not require `d` or `f` suffix
    to some libdevice function calls - it dispatches to right
    library call based on argument type.

    triton pin updated to
    https://github.com/openai/triton/commit/f16138d447bccc54641a9c48ffedbd449a1a40a7

    Also removed some xfails for some unrelated tests.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88242
    Approved by: https://github.com/ngimel

commit a51da28551e9f13a7afca5bbc829a8d9abced44e
Author: Will Constable <whc@fb.com>
Date:   Wed Nov 2 03:52:17 2022 +0000

    Support multi-gpu CI for inductor-distributed (#87996)

    This test by itself isn't the end goal, but it is a minimal test that exercises multi-gpu and the focus of the PR is the infra behind enabling that.  I'll follow up with more tests using actual models etc.

    and @malfet @desertfire for awareness/feedback on the infra side
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87996
    Approved by: https://github.com/aazzolini

commit 95fc0bcaaddc2d24e8759f24dbefa789d04e9e42
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 31 13:33:52 2022 -0700

    Disable torchdynamo in backwards compiler harder (#88132)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88132
    Approved by: https://github.com/bertmaher, https://github.com/malfet

commit 3c6bddc3f6347ce7d1ed33aee94cdaa953cbc387
Author: eqy <eddiey@nvidia.com>
Date:   Wed Nov 2 01:36:37 2022 +0000

    [cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669)

    Has a small tweak to a test that was breaking on A10 (CC @malfet).

    CC @ptrblck @ngimel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87669
    Approved by: https://github.com/ngimel

commit dfa94757557da460b677dbcc7edcb19f0e7122d7
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Nov 1 18:03:06 2022 +0000

    Check SM version before calling flash attention with BFloat16 (#86600)

    The flash attention code path requires sm80 or newer to run on
    BFloat16, so any OpInfo tests running with BFloat16 would fail with
    the error:
    ```
    RuntimeError: Expected q_dtype == at::kHalf || (is_sm8x && q_dtype == at::kBFloat16) to be true, but got false.
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86600
    Approved by: https://github.com/ngimel

commit bc9caafc7898a6df534f566f599f8f5a78d207d1
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Nov 1 17:50:07 2022 +0000

    record_function: update to use custom_class API (#76420)

    Re-submit of gh-72302

    This still has a small performance hit, but it much smaller. On my
    machine I see `_record_fucntion_exit._RecordFunction` takes 1.05 us
    compared to the `Tensor` overload taking 0.79 us.

    In an overall comparison, I see a 0.7 us slowdown from 6.0 us to
    6.7 us for this timeit benchmark
    ```python
    import torch

    def foo():
      with torch.profiler.record_function("foo"):
        return torch.eye(3)

    %timeit foo()
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/76420
    Approved by: https://github.com/robieta

commit 0131a66ab6c8454c9ac7517641f63095b090e8cb
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Tue Nov 1 22:58:22 2022 +0000

    Fix typos under torch directory (#88172)

    This PR fixes typos in '.md' files under torch directory

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88172
    Approved by: https://github.com/malfet

commit 72958b9665a59c1fc53c2254c675530dcc2886dd
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Tue Nov 1 22:45:11 2022 +0000

    [Dynamo] Update Dynamo benchmarks running commands (#87844)

    Fixes https://github.com/pytorch/torchdynamo/issues/1761

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87844
    Approved by: https://github.com/jansel

commit a56beb2a822a07b8a933ef71f20813715a58e030
Author: jjsjann123 <alex.jann2012@gmail.com>
Date:   Tue Nov 1 22:43:51 2022 +0000

    [nvfuser] merge rule update (#88228)

    adding Kevin to NVFuser reviewer
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88228
    Approved by: https://github.com/soumith

commit fb1586fbcb23b3427c55b7f1c9bd554f4a1aa05d
Author: Shiyan Deng <dsy842974287@meta.com>
Date:   Tue Nov 1 22:42:04 2022 +0000

    Make a copy of the submodule inputs (#87899)

    Summary: There might be inplace ops in the model that would change the saved inputs. To avoid that, we save a deepcopy version.

    Test Plan: CI

    Differential Revision: D40771290

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87899
    Approved by: https://github.com/houseroad

commit 73492645cfcee7f2b3b6f6803cca1baca814a901
Author: Charlie Yan <charlieyan@fb.com>
Date:   Tue Nov 1 17:46:00 2022 +0000

    Copy DDP code to be reused in composable API (#87836)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87836
    Approved by: https://github.com/mrshenli

commit b2dfd2026034c8ca13d9ddef7fd990a3f2054a1e
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Mon Oct 31 18:29:05 2022 -0500

    Remove BSC conversion skip from TestSparseCompressed.test_consistency (#88152)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88152
    Approved by: https://github.com/cpuhrsch

commit d044b4cc58f1e088e19c0f731a18630607887389
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Mon Oct 31 18:29:02 2022 -0500

    Update torch.abs and torch.positive opinfos to reflect sparse support (#88151)

    cc @nikitaved @pearu @cpuhrsch @bhosmer
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88151
    Approved by: https://github.com/cpuhrsch

commit ffd54def8fa52653d3a68bc00f0583e8d16d6acb
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Nov 1 22:17:12 2022 +0000

    [GHF] Remove CC line from commit message (#88252)

    This line is added by autoCCBot, but is not really meaningful as commit
    message

    Test Plan:
    ```
    >>> from trymerge import GitHubPR, RE_PR_CC_LINE
    >>> import re
    >>> pr=GitHubPR("pytorch", "pytorch", 87809)
    >>> re.sub(RE_PR_CC_LINE, "", pr.get_body())
    'Fixes #ISSUE_NUMBER\r\n\n\n'
    >>> pr=GitHubPR("pytorch", "pytorch", 87913)
    >>> re.sub(RE_PR_CC_LINE, "", pr.get_body())
    'Parallel compilation warms the Threadpool when we call `torch._dynamo.optimize()`. In current benchmarks, we were setting up the TRITON_CACHE_DIR much later. Because of this parallel compilation artifacts were not used and compilation latency improvements were not visible in dashboard. This PR just prepones the setup of TRITON_CACHE_DIR.\n\n'
    >>> pr=GitHubPR("pytorch", "pytorch", 85692)
    >>> re.sub(RE_PR_CC_LINE, "", pr.get_body())
    'This PR sets CUDA_MODULE_LOADING if it\'s not set by the user. By default, it sets it to "LAZY".\r\n\r\nIt was tested using the following commands:\r\n```\r\npython -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"\r\n```\r\nwhich shows a memory usage of: 287,047,680 bytes\r\n\r\nvs\r\n\r\n```\r\nCUDA_MODULE_LOADING="DEFAULT" python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"\r\n```\r\nwhich shows 666,632,192 bytes. \r\n\r\nC++ implementation is needed for the libtorch users (otherwise it could have been a pure python functionality).\r\n\r\n'
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88252
    Approved by: https://github.com/xuzhao9, https://github.com/izaitsevfb

commit ba643b4ddf3ef0e6ada3fdfd885ed18a71ed8e44
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Tue Nov 1 21:42:51 2022 +0000

    feature: adding batch support for narrow_copy operator (#88130)

    Implement batch support https://github.com/pytorch/functorch/issues/825 for narrow copy

    narrow_copy was already added as an opinfo

    cc @zou3519 @Chillee @samdow @soumith
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88130
    Approved by: https://github.com/kshitij12345, https://github.com/zou3519

commit c40033be162db0f94d37e7ccbd2a89d67f8b8e47
Author: Manuel Candales <mcandales@meta.com>
Date:   Tue Nov 1 21:01:31 2022 +0000

    [Vulkan][TCC] Implement tests for cat_batch, cat_width and normalize_dim (#87633)

    Summary:
    Implement Vulkan tests for these untested functions in Concat.cpp:
     - cat_batch
     - cat_width
     - normalize_dim

    Test Plan:
    ```cd ~/fbsource
    buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
    ```

    Differential Revision: D40605571

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87633
    Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign, https://github.com/SS-JIA

commit e6ea0a4a4b4ae840ff441a2a7331030dee942766
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Tue Nov 1 08:56:06 2022 -0700

    Don't Require contiguous For Extern Kernels (#87650)

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87650
    Approved by: https://github.com/desertfire

commit 8ef9bda1bf7df84483c593f55e704657887120d6
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Tue Nov 1 19:02:40 2022 +0000

    Fix nvFuser Fusion Definition printing of Squeeze and Permute (#88041)

    NM

    cc @jjsjann123
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88041
    Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry

commit 68f9f256a3ccae692204d42600e618fb2112b8cb
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Fri Oct 28 11:29:04 2022 -0700

    [reland][fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87998)

    Summary:
    att, this is experimental api so not marking it as bc-breaking.
    The match will be accepted only if all the filters in the list passes.
    Changing the filter arg to be list also allows us to pass in empty list that means no filter, which makes user code cleaner.

    Test Plan:
    python test/test_fx.py -k test_replace_pattern_with_filters

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D40810943](https://our.internmc.facebook.com/intern/diff/D40810943)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87998
    Approved by: https://github.com/SherlockNoMad

commit 2c7de4a14425759fdfdca7d0c5091ceafe564695
Author: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Mon Oct 31 10:58:36 2022 -0700

    Add meta implementation for aten.max.dim (#88005)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88005
    Approved by: https://github.com/Chillee, https://github.com/bdhirsh

commit 97b3eeac90d6e3fdc093117422e031b8630528a1
Author: jjsjann123 <alex.jann2012@gmail.com>
Date:   Tue Nov 1 18:07:17 2022 +0000

    remove assert on tensor inputs to FusionGroup (#88018)

    Fixes #86530 #86227 #85872
    All issues seem to be duplicate of each other.

    Removes the false positive assert

    Fixes come from @kevinstephano
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88018
    Approved by: https://github.com/kevinstephano, https://github.com/soumith

commit e1c123d29a40ae1f3eae312a118e22769b1db870
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Nov 1 17:59:35 2022 +0000

    Add UBSAN to ASAN (#88055)

    Add undefined behavior sanitizer to `USE_ASAN` option.
    Added `torch._C._crash_if_vptr_ubsan()` that only fails if vptr belongs to a wrong class after typecast
    Deleted all ubsan supressions, but disabled `ProtoTest::Basic` as it fails above-mentioned vptr check.

    Fixes https://github.com/pytorch/pytorch/issues/88042
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88055
    Approved by: https://github.com/ezyang

commit 81f74eed75d24b63b5af8e818d74667647702dbf
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Oct 31 16:33:12 2022 -0700

    [11/N] Update all_to_all with CPU/CUDA implementations (#86407)

    * #83916 [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86407
    Approved by: https://github.com/kwen2501

commit 90fa25705c0da7b50b88302626988267210186ba
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Nov 1 17:46:52 2022 +0000

    Rename 'nvfuser' to 'ts_nvfuser' indicating TorchScript usage (#88188)

    cc @kevinstephano @jjsjann123 @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88188
    Approved by: https://github.com/soumith, https://github.com/jansel

commit bed8102741bdd62936cc743e83751e2fb91a5a3f
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Oct 31 16:33:11 2022 -0700

    [10/N] Update barrier with CPU/CUDA implementations (#86368)
    - Updates for the barrier collective
    - NOTE: current change will not achieve dispatching of barrier since there is no tensor to read from
    https://github.com/pytorch/pytorch/issues/86225

    cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86368
    Approved by: https://github.com/kwen2501

commit 1f34067e9d83aa11f2f4d0ecd08cb0e0ed94dbd0
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 13:36:04 2022 +0000

    [FSDP()][16/N] Refactor post-forward/pre-backward (#87929)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87929
    Approved by: https://github.com/mrshenli

commit 5a53f024e4a3a7958e97546a03fe224788a91df5
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 13:36:03 2022 +0000

    [FSDP()][15/N] Refactor `_init_streams()` (#87928)

    This PR is easy. I think I move `_init_streams()` again in a later PR though :/
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87928
    Approved by: https://github.com/mrshenli

commit 90c5f856b2bd0b8c1776baac959219e9487ba4b1
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 13:36:03 2022 +0000

    [FSDP()][14/N] Refactor pre-forward/post-backward (#87927)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87927
    Approved by: https://github.com/mrshenli

commit eb91e8a534f94127a6d744543f2080a44bca9e57
Author: Yidi Wu <yidi@meta.com>
Date:   Tue Nov 1 17:10:45 2022 +0000

    torchdynamo support modules() for nn_module (#88023)

    Differential Revision: D40820879

    This diff allows models to call self.modules() during dynamo tracing.

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88023
    Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym, https://github.com/jansel

commit de1f641f11d3a486032cd2a63ac958ec23d2c92b
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Nov 1 02:02:20 2022 +0000

    Fix meta function for aten.addmm (#88068)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88068
    Approved by: https://github.com/albanD

commit fdc419786df8a6d86edf8c82b07bf4e9b8b551d0
Author: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Date:   Tue Nov 1 16:43:58 2022 +0000

    Add unit test for torch_geometric library (#85937)

    Fixes #65138

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85937
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit 5c3666cb813f6ffa9e11580552c35435716703de
Author: Han Qi (qihqi) <qihan@meta.com>
Date:   Tue Nov 1 16:11:30 2022 +0000

    [codev] Make backport work with flatbuffer models (#88127)

    Summary: By adding flatbuffer as dependency of backport.

    Differential Revision: D40865452

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88127
    Approved by: https://github.com/cccclai

commit bb7e6254e4387e66beb938fb5b756d0f5c28d2a1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 31 14:42:19 2022 -0700

    Add ability to freeze storages inside functionalization (#88141)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88141
    Approved by: https://github.com/albanD, https://github.com/bdhirsh

commit 61f955dd83c0a6e12aca2a0a7c7bf267bcdd1bc5
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 31 14:42:15 2022 -0700

    Inline Alias into FunctionalStorageImpl (#88140)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88140
    Approved by: https://github.com/bdhirsh

commit 73c9911fc001991809a6c90e2d61f71fc69ffde6
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Tue Nov 1 15:47:43 2022 +0000

    always realize output regardless of the number of reads (#88046)

    This improves hf_Bert 1.139x->1.21x, currently lowmem dropout doesn't work for nn.Dropout module, and before this change we were recomputing all the dropout masks in a very inefficient kernel. This change pushes dropout masks to be saved in the dropout kernels where they are first computed.

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88046
    Approved by: https://github.com/Chillee

commit c368c0faf08528fb73a2b74f905946268a3224a3
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Nov 1 02:02:19 2022 +0000

    Fix meta for aten.fill, constant_pad_nd, _adaptive_avg_pool2d (#88069)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88069
    Approved by: https://github.com/ngimel, https://github.com/malfet

commit 82a9de16d4e1cb6da3626b26d6cbaaaa9258721c
Author: Will Constable <whc@fb.com>
Date:   Tue Nov 1 15:35:44 2022 +0000

    Change dynamo/distributed tests to use cuda/nccl (#88133)

    - FSDP tests require nccl
    - also run in inductor shard and skip inductor in distributed shard
    - inductor shard has newer GPU and supports triton/inductor, but only runs on trunk
    - distributed shard runs on PR, but inductor shard only runs on trunk/opt-in

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88133
    Approved by: https://github.com/davidberard98

commit 44f8efd5c1cd5c7641cb875e615ae480a730b9fa
Author: Yanli Zhao <yanlizhao@fb.com>
Date:   Tue Nov 1 15:27:40 2022 +0000

    [BE]fix DDP when the number of output features is zero (#87793)

    Fixes #87280

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87793
    Approved by: https://github.com/rohan-varma

commit 20d849b98237f83ab4fcc5439d8e8c8f8fd71c8c
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Oct 31 16:33:11 2022 -0700

    [9/N] [Dispatchable Collectives] Update reduce_scatter with CPU / CUDA implementations (#86166)
    - Updates for the reduce_scatter collective
    https://github.com/pytorch/pytorch/issues/86225
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86166
    Approved by: https://github.com/kwen2501

commit 1e5d33b6dfc0ed477fec57aca63427d038751207
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 31 17:48:35 2022 -0700

    Reenable assert sanity testing with ADInplaceOrView reenable (#88102)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88102
    Approved by: https://github.com/albanD

commit bdb14238ec66640a9523a479fac60eda26a3b552
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Mon Oct 31 23:44:23 2022 +0000

    [Reland][ONNX] Move all torch.onnx.export related tests to test/onnx (#87292)

    Moving torch.onnx.export related tests to test/onnx integrates ONNX tests to the same CI machine, so the testing environment can be better managed.

    Fixes https://github.com/pytorch/pytorch/issues/87320
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87292
    Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao, https://github.com/kit1980, https://github.com/malfet

commit 62988e4fe642561e82ac95114214cdd10273a936
Author: Charlie Yan <charlieyan@meta.com>
Date:   Tue Nov 1 13:51:06 2022 +0000

    Update _distributed_c10d.pyi (#88088)

    Summary: `_distributed_c10d.pyi` is out of sync with the C++ binding. This change updates it.

    Test Plan: TBD

    Differential Revision: D40840836

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88088
    Approved by: https://github.com/wanchaol

commit b1750d0440e0bcc94de2295a8f24a2cd0cdcd886
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Nov 1 01:16:26 2022 +0000

    [FSDP()][13/N] Refactor unshard/reshard/grads (#87926)

    This PR is not too complicated. We just move unshard/reshard/grads out to `_runtime_utils.py` and make them take `state: _State` instead of `self`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87926
    Approved by: https://github.com/mrshenli

commit 8039317c07874b62647457bac7bf5df499f41501
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 31 20:54:53 2022 +0000

    [FSDP()][12/N] Easy cleanup (#87925)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87925
    Approved by: https://github.com/mrshenli

commit c1e28731b382ba5ea742cfc5a35bda6e4bcb35fc
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 31 20:54:52 2022 +0000

    [FSDP()][10/N][11/N] Introduce composable (ctor only) (#87924)

    This PR introduces the composable FSDP API (with constructor semantics only) along with some further constructor refactoring. A notable contribution here is `_get_submodule_to_states()`, which performs auto wrapping without actually wrapping.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87924
    Approved by: https://github.com/mrshenli

commit 78170701a348619745dc289bdf591384929a414a
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 31 20:54:52 2022 +0000

    [FSDP()][9/N] Refactor ctor (continued) (#87923)

    This PR makes a second pass over the constructor. The logic has been grouped into `_init_<...>` functions based on intent (e.g. `_init_prefetching_state()` or `_init_runtime_state()`). This makes the initialization code for composable FSDP much cleaner than having to re-write the same sequences of lower-level helper calls.

    This PR also moves `_ExecOrderData` into its own file `_exec_order_utils.py`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87923
    Approved by: https://github.com/mrshenli

commit 23fe6c8ca15ec2cf6ea74f93aa91cae343ea534f
Author: Mike Iovine <mikeiovine@meta.com>
Date:   Tue Nov 1 09:58:26 2022 +0000

    [Static Runtime] Fix ReplaceWithMaybeCopy test in OSS (#88099)

    Summary: `ReplaceWithMaybeCopy` is guarded by `FBCODE_CAFFE` in `OptimizeGraph`. Run the pass manually to ensure it does the replacement.

    Test Plan: Existing tests

    Differential Revision: D40858743

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88099
    Approved by: https://github.com/huydhn

commit 7c6fe21a386213617d77b98be28729e6e32b29a0
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Nov 1 05:58:42 2022 +0000

    Fix monitoring script for macos (#88159)

    The monitoring script is currently failing with AccessDenied when trying to access uss memory on mac because [psutil.memory_full_info](https://psutil.readthedocs.io/en/latest/index.html?highlight=memory_full_info) requires higher user privileges

    Example failures:
    * https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-12_9208104847.zip
    * https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-m1-12_9207913759.zip

    I could also make this script run with sudo, effectively granting this permission. But I'm not entirely sure that we need uss memory for mac, so gracefully handling the error looks nicer
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88159
    Approved by: https://github.com/clee2000

commit 323c646ca9e0f8eb452ed446b305382afcc7e270
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Tue Nov 1 05:05:15 2022 +0000

    Cleaned up the nvFuser Python Frontend Batch Norm printing (#88057)

    * Removed `define_null_tensor` usage in favor of using optional arguments for binding.
    * Re-ordered the non-State arguments for easier printing.
    * Added a printing function to include booleans `training` and `channels_last`
    * Fixed `define_tensor` to print `is_cpu`

    cc @jjsjann123
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88057
    Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry

commit a6acbad5c33a60109ba8373da8aa61a728ae4b20
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Nov 1 03:59:51 2022 +0000

    [BE] Use default constructor in `LoggerVoidify` (#88054)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88054
    Approved by: https://github.com/kit1980

commit 560786ac206278a48121feef2c4c55d71bdb9a77
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Nov 1 03:14:24 2022 +0000

    call contiguous on BMM inputs for NT on CUDA (#88108)

    Fixes #87713

    BMM for cpu supports  non-contiguous nested tensor inputs, while BMM for Cuda does not support currently non-contiguous inputs.

    The derivative for BMM:
    ```
    - name: bmm(Tensor self, Tensor mat2) -> Tensor
      self: grad.bmm(mat2.transpose(1, 2).conj())
      mat2: self.transpose(1, 2).conj().bmm(grad)
      result: self_t.bmm(mat2_p) + self_p.bmm(mat2_t)
    ```

    When calling backward it was impossible for this function to succeed since the inputs were always discontiguous, regardless of the user input.  This adds contiguous calls to BMM_cuda implementation for nested tensors.

    This was not caught by tests because grad_check is currently only done on CPU in test_nestedtensors. This PR updates the autograd test to also be run on GPU.

    As a result I found one more issue with the backward for to_padded_tensor erroring instead of calling the generic version.

    cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88108
    Approved by: https://github.com/cpuhrsch

commit 0eea05b11e628eb4fd35a5664b3dd3812ab61461
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Nov 1 03:09:34 2022 +0000

    Remove "prims_nvfuser" backend for TorchDynamo (#88083)

    Removing "prims_nvfuser" backend according to the discussion in https://github.com/pytorch/torchdynamo/pull/1281#discussion_r979468355.

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88083
    Approved by: https://github.com/ezyang

commit a8aaee77bed496c3f0c80410c62c4aac5bff4296
Author: Sahan Paliskara <sahancpal@gmail.com>
Date:   Mon Oct 31 12:40:30 2022 -0700

    [torch::deploy] add gpu unit tests to CI (#88107)

    Adds `torch::deploy`'s GPU tests to core CI to make sure core changes don't break them.

    Overall, deploy tests take 11 min, so it shouldn't be much of a burden :)  https://github.com/pytorch/pytorch/actions/runs/3364231795/jobs/5578861939

    Differential Revision: [D40861442](https://our.internmc.facebook.com/intern/diff/D40861442)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88107
    Approved by: https://github.com/d4l3k, https://github.com/anirbanr-fb-r2p

commit 6a75a0d1a197e378ebbf1f73f5ab93ce79cb873a
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Tue Nov 1 02:37:42 2022 +0000

    Add support for neg to NestedTensor (#88131)

    Partially fixes #86889

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88131
    Approved by: https://github.com/drisspg

commit 708c050af9878f46a79088ab92e22dd0589fdcd4
Author: yanbing-j <yanbing.jiang@intel.com>
Date:   Tue Nov 1 02:06:30 2022 +0000

    Add labeler with cpu, mkldnn, amp, NNC and quantization paths to start (#87690)

    This PR is to dd labeler with `module: cpu`, `module: mkldnn`, `module: amp (automated mixed precision)`, `NNC` and `oncall: quantization' paths to start.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87690
    Approved by: https://github.com/ezyang, https://github.com/malfet

commit 3aa7a528553cca8d473805caa0e52fe63fb11f52
Author: maxren <maxren@meta.com>
Date:   Mon Oct 31 10:18:53 2022 -0700

    [xnnpack][lite-int][4/n] introduce serialization to delegate (#87908)

    We introduced the serializer we created in the previous diff to our XNNGraph builder, the purpose of this is to serialize parts of the graph as we build this. At the end, we are able to finish and serialize the xnngraph into a std::string for use when we forward this along to on-device runtime.

    The next diff will rebuild the xnngraph from the serialization we introduce here, so testing the serialization of the graph will be done in the next diff

    Differential Revision: [D39335580](https://our.internmc.facebook.com/intern/diff/D39335580/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39335580/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87908
    Approved by: https://github.com/digantdesai

commit 8287c1d96494805728b06e3c7b80d07b55897352
Author: maxren <maxren@meta.com>
Date:   Mon Oct 31 10:18:51 2022 -0700

    [xnnpack][lite-int][3/n] flatbuffer serializer class (#87907)

    Creating a serializer class that allows us to serialize the xnnpack graph creation arguments. This essentially abstracts away the flatbuffer api manipulation and serialization that we deal with.

    As a result we can call
    ```
    XNNSerializer::serializeAddNode()
    XNNSerializer::serializeTensorValue()
    XNNSerializer::finishAndSerialize
    ```
    to serialize the graph

    Differential Revision: [D39196312](https://our.internmc.facebook.com/intern/diff/D39196312/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39196312/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87907
    Approved by: https://github.com/digantdesai

commit 7bf819b181f3d4407e06b25d2f8fdd2230c44891
Author: maxren <maxren@meta.com>
Date:   Mon Oct 31 10:18:49 2022 -0700

    [xnnpack]lite-int][2/n] flatbuffer xnn_value schema (#87906)

    serializer schema for xnnpack graphs

    Differential Revision: [D39003170](https://our.internmc.facebook.com/intern/diff/D39003170/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87906
    Approved by: https://github.com/digantdesai

commit 905d532d39526cf612857a4a802702af1805a71c
Author: maxren <maxren@meta.com>
Date:   Mon Oct 31 10:18:46 2022 -0700

    [xnnpack][lite-int][1/n] flatbuffer buck rules (#87826)

    Writing a placeholder schema.fbs file for now to setup the buck gen rules. The generated schema file will be used in the xnnpack name space and be reserved for serialization/deserialization of our xnnpack lowered graph

    Steps Accomplished

    - Buck rules to compile flatbuffer schema
    - added header file to preprocess
    - everything compiles correctly

    Differential Revision: [D38999169](https://our.internmc.facebook.com/intern/diff/D38999169/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38999169/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87826
    Approved by: https://github.com/digantdesai

commit aa1f9a1bd79da47d17f28a50f467a628728e68ac
Author: maxren <maxren@meta.com>
Date:   Mon Oct 31 10:18:45 2022 -0700

    [xnnpack][lite-int][graph-build] torchscript -> xnnpack graph (#87824)

    This point we perform conversion for Torchscript IR to XNNPack graph. Currently we only support converting Add Nodes and fp32 tensor values.

    As a caveat, we are not building this at runtime. So for testing we just run the xnn graph once ahead of time and with sample inputs and forward it to execute. This is only for testing, and will be changed in a later diff. This will allow us to check that graph creation is sound.

    Differential Revision: [D39838851](https://our.internmc.facebook.com/intern/diff/D39838851/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87824
    Approved by: https://github.com/digantdesai, https://github.com/salilsdesai

commit d596b048e5b7fba24e2dfb33413462c646950d93
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 31 09:25:34 2022 -0400

    Also skip large models for normal --accuracy runs (#88086)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88086
    Approved by: https://github.com/albanD

commit afd00673b6dedbdb811cfb1a9078deee1cb53f38
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Nov 1 00:00:35 2022 +0000

    Change Nested Tensor logging copy (#88104)
    Change the copy of how we log NestedTensor usage.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88104
    Approved by: https://github.com/mikaylagawarecki

commit c0761a835b88eb2ed2186a9aaac73d471c2fb843
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 31 23:49:37 2022 +0000

    Revert "[dynamo] Error when user nests FX with dynamo (#87797)"

    This reverts commit 1da5aeb97b73664ff0fe2f4bb48379655cede969.

    Reverted https://github.com/pytorch/pytorch/pull/87797 on behalf of https://github.com/ezyang due to breaks nvfuser stack, needs more investigation

commit caaf37a1116cf4ce0f372bbd9241f8a827dc33b7
Author: Nikita Shulga <nshulga@meta.com>
Date:   Mon Oct 31 23:38:03 2022 +0000

    Fix `PyTorchStreamWriter` exception handling (#88128)

    Avoid double exception in destructor if attempting to serialize to
    python object that does not have `write` method

    Use `Finalizer` class in `PyTorchStreamWriter::writeEndOfFile()` to a
    always set `finailized_` property even if excretion occurs. (as there
    isn't much one can do at this point)

    Add expicit check for the attribue to `_open_zipfile_writer_buffer` and
    add unitests

    Modernize code a bit by using Python-3 `super()` method

    Fixes https://github.com/pytorch/pytorch/issues/87997

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88128
    Approved by: https://github.com/albanD

commit ea8a5b09a9e1e08017c245799891496bfd40c7f6
Author: John Detloff <johndetloff@fb.com>
Date:   Mon Oct 31 23:36:00 2022 +0000

    [IOS] Update Cocoapods for 1.13 release (#88075)

    Update the podspecs for libtorch and libtorch-lite to v 1.13 to prepare for the 1.13 pod release.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88075
    Approved by: https://github.com/manuelcandales, https://github.com/salilsdesai, https://github.com/malfet

commit bc03aa6013e101222c9652d04a2b08e48f626dfb
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Mon Oct 31 22:45:23 2022 +0000

    Store `autocast_gpu_dtype` in `custom_fwd` and `custom_bwd` for BFloat16 autocast (#88029)

    As per #87979, `custom_bwd` seems to forcefully use `torch.float16` for `torch.autograd.Function.backward` regardless of the `dtype` used in the forward.

    Changes:
    - store the `dtype` in `args[0]`
    - update tests to confirm the dtype of intermediate result tensors that are outputs of autocast compatible `torch` functions

    cc @ptrblck @ngimel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88029
    Approved by: https://github.com/ngimel

commit f2b247f0d891f8ff5bcaa5276a51324f692e104c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 31 16:42:58 2022 -0400

    Remove stale comment (#88135)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88135
    Approved by: https://github.com/albanD

commit 139afc50ecafcde5fb085e2cca78fed55e6b5aad
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Mon Oct 31 21:31:54 2022 +0000

    Fix links to tutorial in torch masked docs (#88129)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88129
    Approved by: https://github.com/jisaacso

commit 9fed04ba337680b7e6f2a6cde0b7bbc256a41eae
Author: Catherine Lee <csl@fb.com>
Date:   Mon Oct 31 21:12:52 2022 +0000

    fix for auto labeler (#88100)

    followed https://lightrun.com/answers/actions-labeler-how-to-only-add-label-not-remove-when-pr-is-opened

    side note: should we move this logic to test-infra to be with the release notes labeler?
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88100
    Approved by: https://github.com/huydhn

commit ba26bc0fc266ddb58ec199349d2c93c7a905dfd0
Author: Radek Bartoň <radek.barton@microsoft.com>
Date:   Mon Oct 31 21:11:16 2022 +0000

    Fix random "C1041: cannot open program database" errors when compiling on Windows (#88084)

    Adds `/FS` option to `CMAKE_CXX_FLAGS` and `CMAKE_CUDA_FLAGS`.

    So far I've encountered this kind of errors:

    ```
    C:\Users\MyUser\AppData\Local\Temp\tmpxft_00004728_00000000-7_cuda.cudafe1.cpp: fatal error C1041: cannot open program database 'C:\Projects\pytorch\build\third_party\gloo\gloo\CMakeFiles\gloo_cuda.dir\vc140.pdb'; if multiple CL.EXE write to the same .PDB file, please use /FS
    ```
    when building with VS 2022.

    cc @peterjc123 @mszhanyi @skyline75489 @nbcsm

    Related issues:
    - https://github.com/pytorch/pytorch/issues/87691
    - https://github.com/pytorch/pytorch/issues/39989
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88084
    Approved by: https://github.com/ezyang

commit 73379acaf3865379aed0a1bab1320616772152f3
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Mon Oct 31 09:47:26 2022 -0700

    Do not use unsafe restriding for subclasses (#87610)

    This helps convert some accuracy errors into runtime errors,
    which makes it easier to debug.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87610
    Approved by: https://github.com/albanD

commit 6fe41e76a928ae00ad7e7dfe1036461f7b0b301f
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Mon Oct 31 20:10:05 2022 +0000

    Create separate files for NT Unary, Binary and Matmul ops (#88091)

    Improves code organization and code share.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88091
    Approved by: https://github.com/drisspg

commit 1a9edc8136e0667ce59ae5ffbbd4930110be4ff1
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Mon Oct 31 10:11:14 2022 -0500

    Changing from sample_inputs to reference_inputs in test_compare_cpu (#86462)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86462
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 4c78c7c82af08872f901076c5daaf8148f03b096
Author: Grigory Sizov <grisha.sizov@gmail.com>
Date:   Mon Oct 31 19:59:35 2022 +0000

    Enable `src_mask` in fast path of `TransformerEncoderLayer ` (#87377)
    Fixes https://github.com/pytorch/pytorch/issues/81129#issuecomment-1179435674

    Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in https://github.com/pytorch/pytorch/pull/81277. This PR unrolls this fix, enabling `src_mask` on the fast path:

    - Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type.
    - If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used
    - `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask
    - `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation
    - `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match
    I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason:
    - `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26)
    - If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests
    - Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double`

    Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87377
    Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet

commit e9599724fa25c3c2149f301f704fe90df6b591b0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 31 19:55:58 2022 +0000

    Revert "[ONNX] Move all torch.onnx.export related tests to test/onnx (#87292)"

    This reverts commit e3e84830aade59722d819bc5fa01922239494790.

    Reverted https://github.com/pytorch/pytorch/pull/87292 on behalf of https://github.com/weiwangmeta due to breaking internal test relating to quantization eager tests, see test/quantization/eager/test_quantize_eager_ptq.py test_lower_graph_linear and test_lower_graph_conv2d

commit e9cabef6631395c3dbb8d3d82b94e108e6b87db3
Author: KevinYuk <kevin.yu@intel.com>
Date:   Mon Oct 31 19:46:01 2022 +0000

    enable xpu group norm channels last support (#87680)

    XPU would support channels last format for group norm operator, however, Pytorch converts all input tensor to contiguous format, which includes channels last tensor. Need Pytorch pass down this memory format hint to us.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87680
    Approved by: https://github.com/albanD

commit 7d2f1cd2115ec333767aef8087c8ea3ba6e90ea5
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Mon Oct 31 19:31:56 2022 +0000

    Fix typos under docs directory (#88033)

    This PR fixes typos in `.rst` and `.Doxyfile` files under docs directory

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88033
    Approved by: https://github.com/soulitzer

commit c7ac3334308adebd8037f2af7b66972f42311ab5
Author: Sherlock Huang <bahuang@fb.com>
Date:   Mon Oct 31 16:38:23 2022 +0000

    Fix args for meta__fused_moving_avg_obs_fq_helper (#88058)

    Fixes https://github.com/pytorch/torchdynamo/issues/1802

    There are a few problems,
    1. torch.fused_moving_avg_obs_fake_quant doesn't have OpInfo test
    2. self.empty_like() is not a valid call. it should be torch.empty_like(self)
    3. python meta function has some unexplained behavior for arguments with default value of bool type?

    In particular, problem 3 is the most concerning one.
    **UPDATE: This is expected behavior, see discussion below for explanation.**

    Without setting the default value for `per_row_fake_quant` and `symmetric_quant`, it gets the following error when running with meta tensor.
    ```
    meta__fused_moving_avg_obs_fq_helper() missing 2 required positional arguments: 'per_row_fake_quant' and 'symmetric_quant'
    ```
    I can fix this by adding the default values to these two args. However, I observer something strange when examining the actual value in meta function.

    ```
        print("per_row_fake_quant", per_row_fake_quant)
        print("symmetric_quant", symmetric_quant)
    ```

    When default values are False, printed value correctly reflect the args value populated from call site.
    When default values are True, printed value is ALWAYS True, regardless of the populated value from call site.
    When default Values are None, printed value is `None` when call site set the value to 'False', printed value is 'True' when call site sets the value to 'True'.

    I also verify that this bug also affect for other meta function with default args....

    My speculation is that this is something about pybind value packing when called from c++ dispatcher to python meta function, and default value parsing for python meta function (and other python dispatch functions) ?

    I tried to find the c++ call stack, but gdb is missing symbols and C++ stacktrace is not working properly... Appreciate anyone who can point me to the source file for pybind value packing.

    cc @ezyang
    cc @bdhirsh. I know you had a fix in the symbolic shape branch...
    cc @yanboliang  who reported this bug
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88058
    Approved by: https://github.com/bdhirsh, https://github.com/yanboliang

commit 3eb379052dc898a4e380045ca8fcd4f8bc75a524
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 31 14:22:55 2022 +0000

    unfold_backward: Remove stride >= size kernel in favour of copy_ (#88061)

    unfold_backward has a dedicated kernel for `stride >= size` which uses temporary
    tensors created by `at::arange` to perform the mapping from unfolded to folded.
    This instead uses `unfold` to view the output, and does a direct copy from the
    gradient into the view.

    In benchmarks I see either no difference or a marginal speed benefit from
    this PR.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88061
    Approved by: https://github.com/albanD

commit ceddcf5434c41ba6e96d36f9e727bde0ee191220
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 31 14:22:54 2022 +0000

    istft: Use unfold_backward instead of col2im (#88060)

    `unfold_backward` implements the same operation as `col2im` but without support
    for 2d kernels or dilation. However, `istft` doesn't use any of those features
    and `unfold_backward` actually has a faster `TensorIterator` based
    implementation so we should use it here instead.

    In the example from #87353 I see a 2x speedup on both CPU and CUDA.

    On a wider variety of sizes and inputs I still see speedups across the board, especially
    on CPU since `col2im` isn't parallelized but `unfold_backward` is:

    | device | shape           | hop_length | Master (us) | This PR (us) | Speedup |
    |--------|-----------------|------------|-------------|--------------|---------|
    | CUDA   | (1, 129, 33)    | 256        | 147         | 136          | 1.08    |
    |        |                 | 128        | 153         | 128          | 1.20    |
    |        | (100, 129, 20)  | 256        | 181         | 147          | 1.23    |
    |        |                 | 128        | 171         | 137          | 1.25    |
    |        | (1000, 129, 10) | 256        | 681         | 443          | 1.55    |
    |        |                 | 128        | 632         | 446          | 1.42    |
    | CPU    | (1, 129, 33)    | 256        | 106         | 104          | 1.02    |
    |        |                 | 128        | 103         | 81           | 1.27    |
    |        | (100, 129, 20)  | 256        | 2400        | 399          | 6.02    |
    |        |                 | 128        | 2150        | 313          | 6.87    |
    |        | (1000, 129, 10) | 256        | 13800       | 3740         | 3.69    |
    |        |                 | 128        | 12700       | 2110         | 6.02    |
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88060
    Approved by: https://github.com/albanD

commit ff9449464484b4ca48bd7c68d8adfd31e97a4263
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 31 06:48:39 2022 -0700

    Revert "Revert "Unify meta tensor and fake tensor converter conversion (#87943)"" (#88045)

    This reverts commit bc64999b8382796199178cf480adf51512b5f139.

    Check torch/_subclasses/meta_utils.py for "This is very tricky" for the bugfix explanation.

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88045
    Approved by: https://github.com/kit1980, https://github.com/Chillee

commit 2e1199d171359b7fbf4d3a6f2b9fcafeaf27e39e
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Fri Oct 28 16:42:29 2022 -0700

    [quant][fx] Fix a typo in utils.py (#88024)

    Summary:
    att

    Test Plan:
    python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88024
    Approved by: https://github.com/HDCharles, https://github.com/z-a-f

commit 0a4ca9d08340fdba60d1ed73a52cdeebe5ac1b7e
Author: Sherlock Huang <bahuang@fb.com>
Date:   Mon Oct 31 04:12:36 2022 +0000

    Fix meta for aten.angle and aten.index_copy (#88066)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88066
    Approved by: https://github.com/albanD

commit a3f8495b848fafe3ed792eed0cbd6b0db09586aa
Author: Khushi <khushiagrawal411@gmail.com>
Date:   Mon Oct 31 17:08:52 2022 +0000

    [primTorch fix] use _maybe_convert_to_dtype (#85163)

    Fixes #84561

    - [x] fix lint tests

    cc: @Lezcano!!

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85163
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 2702aaffc01f8ae66a4341be81778a56d203951a
Author: Catherine Lee <csl@fb.com>
Date:   Mon Oct 31 16:52:56 2022 +0000

    remove old label check functionality (#88007)

    no longer needed as we have check_labels.py to check if the pr has labels and it blocks merge
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88007
    Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi

commit 83f31ffdfe617d67bcf90312c7e57804de6cb87e
Author: Catherine Lee <csl@fb.com>
Date:   Mon Oct 31 16:52:28 2022 +0000

    Move check labels to separate workflow (#87999)

    * moves check labels to separate workflow that is triggered on the usual pull_request triggers as well as labeled and unlabeled
    * deletes comments when label is added

    Fixes https://github.com/pytorch/test-infra/issues/978 and https://github.com/pytorch/pytorch/issues/87865
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87999
    Approved by: https://github.com/huydhn

commit 5723fd503c22388654b66cf8e8634354b0867adb
Author: Sherlock Huang <bahuang@fb.com>
Date:   Mon Oct 31 03:24:48 2022 +0000

    Fix meta function for aten.flip and aten.rot90 (#88065)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88065
    Approved by: https://github.com/mruberry

commit 9308cefbdfbe74f6a7be60d0a117e12a71198d0e
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 31 01:43:05 2022 +0000

    [FSDP()][8/N] Refactor limiter's `_FreeEventQueue` (#87922)

    This PR is easy. It just moves `_FreeEventQueue` into its own file `_limiter_utils.py`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87922
    Approved by: https://github.com/rohan-varma, https://github.com/mrshenli

commit d89cf2fdc9d73d0f3920ab31437a24a520628b03
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 31 01:43:05 2022 +0000

    [FSDP()][7/N] Refactor most of ctor (#87921)

    The goal of this PR is to make one pass over the FSDP constructor and refactor each helper method call to not be `self.<...>`. Subsequent PRs will make further passes over the FSDP constructor.

    This PR looks like a lot of lines of code change, but it is only reorganization. Methods are moved to `_init_utils.py` and `_common_utils.py`. This also marks the beginning of moving methods from `_utils.py` to `_common_utils.py` -- they will be coalesced eventually. I am only using `_common_utils.py` as a staging ground to include the methods that have been affected by the refactoring.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87921
    Approved by: https://github.com/mrshenli

commit 9d9267c6f76a9d801b6d9c69fddc61e20f0f4b48
Author: Andrew Gu <andgu@fb.com>
Date:   Sun Oct 30 15:26:12 2022 +0000

    [FSDP()][3/N] Refactor public APIs (#87917)

    - This PR defines a new `api.py` meant to hold the public API for FSDP (minus `FullyShardedDataParallel` itself). This is needed because several of the `_<...>_utils.py` files rely on the public API, and we cannot import from `torch.distributed.fsdp.fully_sharded_data_parallel` without a circular import. Calling the file `api.py` follows the convention used by `ShardedTensor`.
    - This PR cleans up the wording in the `BackwardPrefetch`, `ShardingStrategy`, `MixedPrecision`, and `CPUOffload` docstrings.
    - This PR adds the aforementioned classes to `fsdp.rst` to have them rendered in public docs.
    - To abide by the public bindings contract (`test_public_bindings.py`), the aforementioned classes are removed from `fully_sharded_data_parallel.py`'s `__all__`. This is technically BC breaking if someone uses `from torch.distributed.fsdp.fully_sharded_data_parallel import *`; however, that does not happen in any of our own external or internal code.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87917
    Approved by: https://github.com/mrshenli

commit 59fe272c1e698989228af5ad197bdd2985e4e9b9
Author: Aaron Gokaslan <aaronGokaslan@gmail.com>
Date:   Mon Oct 31 16:41:24 2022 +0000

    Fix: prefer .is_none() over .is(py::none()) for pybind11 (#88051)

    Fixes minor perf regression I saw in #85688 and replaced throughout the code base. `obj == Py_None` is directly equivalent to is_none(). Constructing a temporary py::none() object needlessly incref/decref the refcount of py::none, this method avoids that and therefore is more efficient.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88051
    Approved by: https://github.com/albanD

commit 75dbe3790938c30716463604ccfa68c0f9f6a7f5
Author: vasiliy <vasiliy@fb.com>
Date:   Fri Oct 28 12:15:49 2022 -0700

    make autocast cache global instead of thread-local (#86492)

    Summary:

    There is a memory leak because `torch.clear_autocast_cache()` clears
    the autocast cache from the main thread, but autograd can write to
    this cache from a background thread, so whatever autograd writes
    will leak.

    With some offline discussion we decided that a global cache is a
    practical way to deal with this, and the performance impact of the
    lock should be negligible.

    Test Plan:

    I don't have a local repro of the original issue, need to look into how to get
    that.

    A toy example
    (https://gist.github.com/vkuzo/0d6318fe7f7cb1c505e370cd5c1a643b)
    does cache clearing as expected on forward and backward pass.

    local testing:
    ```
    python test/test_cuda.py -k autocast
    python test/test_autocast.py
    ```

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86492
    Approved by: https://github.com/ezyang

commit 34f523b22158ca4a4a7974ec867084bab98bde83
Author: Andrew Gu <andgu@fb.com>
Date:   Sat Oct 29 21:14:50 2022 +0000

    [FSDP] Enable `use_orig_params=True` test (#88034)

    I accidentally committed the `use_orig_params` PR with this test disabled. This PR simply re-enables it. It passes locally, so if CI is green, then this is an easy land.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88034
    Approved by: https://github.com/H-Huang

commit df1cc0ef473893ffde87513b5be69cc3e2306561
Author: Salil Desai <salilsdesai@fb.com>
Date:   Sun Oct 30 20:30:57 2022 -0700

    [Vulkan] Add Vulkan Rewrite to Transfer Inputs and Outputs to Vulkan and CPU Backends Respectively (#87432)

    With this change, we don't have to manually invoke transferring input and output backends when we run vulkan models.

    Graph rewrite code based off of:
    - https://github.com/pytorch/pytorch/commit/32efff45ba77f2bb4b1e709613b99070f119745a#diff-a473bddb458dc24225866a45092d6eca064eddd256245d93020e48e216eee4d5R160-R179

    Differential Revision: [D39519168](https://our.internmc.facebook.com/intern/diff/D39519168/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39519168/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87432
    Approved by: https://github.com/mcr229, https://github.com/digantdesai

commit bc6862515164a31d3a62e46a49977d54a618323c
Author: Salil Desai <salilsdesai@fb.com>
Date:   Sun Oct 30 20:30:55 2022 -0700

    [Vulkan] Add support for Optimization Blocklist to Vulkan Rewrite (#87431)

    Optimization Blocklist will be used in a future diff (D40315730) to make the rewrite to transfer input/output backends optional

    Differential Revision: [D40315729](https://our.internmc.facebook.com/intern/diff/D40315729/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87431
    Approved by: https://github.com/mcr229, https://github.com/digantdesai

commit f717986f93f5e167a530867061cfa40d49c14316
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 31 09:20:49 2022 -0400

    .gitignore log files (#88085)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88085
    Approved by: https://github.com/albanD

commit 8ea19c802e38c061e79176360c1ecaa81ce2088a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 29 21:43:09 2022 -0400

    Make IValue::unsafeToTensorImpl a little less unsafe. (#88043)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88043
    Approved by: https://github.com/anjali411, https://github.com/albanD

commit e238752e20ae637c88e8534482f83a5074a82d43
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 29 17:25:42 2022 -0700

    Simplify magic method definition code. (#88017)

    It turns out sym_float (and the hypothetical sym_int) can
    be defined in the same way as conventional magic methods.
    Do so.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88017
    Approved by: https://github.com/albanD

commit 2a47b107801569f7b21994d199d7b2fc6f8a25e7
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 29 08:45:32 2022 -0700

    Get the magic method try reverse protocol correct (#88030)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88030
    Approved by: https://github.com/anjali411, https://github.com/albanD

commit 12dd877395a47d4de382b06fda9623da37782226
Author: Horace He <chilli@fb.com>
Date:   Sat Oct 29 00:59:57 2022 +0000

    Fix all references to torchdynamo from the merge (#87731)

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87731
    Approved by: https://github.com/yanboliang, https://github.com/ezyang, https://github.com/anijain2305, https://github.com/jansel

commit 496acb6602644fee4db7c19df700f6224ce07f84
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 30 13:24:50 2022 -0400

    Add fake tensor files to ciflow/inductor (#88052)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88052
    Approved by: https://github.com/anijain2305

commit 6735bf21c70a5d0873036bc252e8a6873cb35291
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Mon Oct 31 04:42:45 2022 +0000

    [test_nn] split convolution tests from test_nn (#87474)

    Ref #63085

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87474
    Approved by: https://github.com/albanD

commit 46ce92713dff83182f36b9f4d2a112f9e568825f
Author: Jing Xu <jing.xu@intel.com>
Date:   Mon Oct 31 04:40:52 2022 +0000

    fix github bug issue 87552 (#88059)

    Fixes #87552
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88059
    Approved by: https://github.com/jgong5, https://github.com/ngimel

commit e24ce484ed52f6441db159ef0479ff06c72f2efd
Author: Driss Guessous <drisspg@fb.com>
Date:   Mon Oct 31 04:06:31 2022 +0000

    Use scaled_dot_product_attention within attention.cpp (#87312)
    Use the private _scaled_dot_product_attention to support _native_multiheaded_attention. _SDP provides access to fused kernels when certain conditions are meant enabling a speed up for MHA.

    cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87312
    Approved by: https://github.com/cpuhrsch

commit d13f1e6ab4d20451f7e2acd87571ffa7fece0c32
Author: Fuzzkatt <zonghan2000@gmail.com>
Date:   Mon Oct 31 03:56:55 2022 +0000

    Add sequence number support for UCC (#85047)

    Add sequence number support for UCC, mostly following format of ProcressGroupNCCL.
    Pass new test: `test_all_gather_object_subgroup`
    Add skips for gather tests: `test_gather_object` and `test_gather_object_subgroup`

    cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85047
    Approved by: https://github.com/kwen2501

commit 9642a7c2f6f4b21c44bbc5709b9af396df4053dc
Author: HAOCHENYE <21724054@zju.edu.cn>
Date:   Mon Oct 31 03:00:30 2022 +0000

    [ONNX] Fix get wrong summary of the docstring in `torch.onnx._deprecation.deprecated` (#87194)

    The summary of the deprecated function could be multi-line. Therefore the code below:
    https://github.com/pytorch/pytorch/blob/9ac2a06acf75538a35751f785d5f509d6127d6cd/torch/onnx/_deprecation.py#L45
    should be adjusted to

    ```python
    summary_and_body = docstring.split("\n\n", 1)
    ```
    Otherwise, the multi-line summary will be separated wrongly.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87194
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit d67b2edec34ef27956de0b2ebb5d7e50dbba9de3
Author: Animesh Jain <anijain@umich.edu>
Date:   Mon Oct 31 02:30:29 2022 +0000

    [dynamo][dashboard] minor fixes for a clean Dashboard (#88056)

    * better check for cold start latency
    * sort on inductor column for better readability.

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88056
    Approved by: https://github.com/ngimel

commit 9109ecf9142064367566cf540fd2803a09318652
Author: Mengchi Zhang <mengchi@meta.com>
Date:   Sun Oct 30 18:22:17 2022 +0000

    Even "nvcc not found" should be commented out (#87959)

    Summary: Even "nvcc not found" should be commented out in minifier_launcher.py, cause there could be a case that PyTorch/minifier can find cuda path but nvcc is not explicitly included in env variable like PATH.

    Differential Revision: D40790023

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87959
    Approved by: https://github.com/anijain2305, https://github.com/jianyuh

commit 1b575782a0c307aae264714e3244afcf50bb365c
Author: Animesh Jain <anijain@umich.edu>
Date:   Sun Oct 30 17:10:17 2022 +0000

    [dynamo][benchmarks] use fresh inductor cache and raise batch size wherever possible (#88044)

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88044
    Approved by: https://github.com/ngimel

commit e7b854fae9ff8116eaf4aeb24e04cac550bed362
Author: Nikita Shulga <nshulga@meta.com>
Date:   Sun Oct 30 04:31:45 2022 +0000

    [BE] Do not package caffe2 in wheel (#87986)

    If PyTorch is build without caffe2 integration, do not package unusable
    .py files/headers

    Same is true about functorch - don't package it unless building with `functorch` (although, I wonder if we should remove this option at some point in the future)

    Followup after https://github.com/pytorch/builder/pull/1181

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87986
    Approved by: https://github.com/seemethere

commit 65e771959962156f434fad9b4fbe0c719813ab63
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Oct 30 03:02:55 2022 +0000

    [vision hash update] update the pinned vision hash (#87948)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87948
    Approved by: https://github.com/pytorchbot

commit 621158cd7f3e1321e77d3312c39c258ad1f68d28
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sun Oct 30 01:04:55 2022 +0000

    [BE] Do not assign string literal to `char *` (#87949)

    Not sure, what I was thinking when writing something like:
    ```
    auto foo = std::getenv("BAR");
    if (!foo) {
       foo = "baz";
    }
    ```
    as `std::getenv` return `char *` (i.e. mutable string), but string literals are immutable. (i.e. `const char *`)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87949
    Approved by: https://github.com/kit1980

commit 59001d05b406bb00d5838f04ca972180e1a4946e
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Sat Oct 29 20:36:20 2022 +0000

    [Inductor] Enable Inductor unspec inputs test for different dtypes (#87809)

    Fixes #ISSUE_NUMBER

    cc @jansel @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87809
    Approved by: https://github.com/ngimel

commit bc64999b8382796199178cf480adf51512b5f139
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Oct 29 18:39:28 2022 +0000

    Revert "Unify meta tensor and fake tensor converter conversion (#87943)"

    This reverts commit baa715e790921e6498861e59556035de1a481cc5.

    Reverted https://github.com/pytorch/pytorch/pull/87943 on behalf of https://github.com/kit1980 due to Broke several inductor tests

commit e4a8661ab84022c1bff622c6d2f6e679180b1df5
Author: Shunting Zhang <shunting@meta.com>
Date:   Sat Oct 29 17:52:26 2022 +0000

    torchdynamo and xla integration (#87741)
    - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing
    - guard system is quite low overhead but torchxla tracing overhead is quite high

    The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that
    - we can integration torchdynamo with XLA
    - we reduce or even completely avoid tracing overhead of torchxla
    We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA.
    We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets.

    torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline.

    torchxla_trace_once only traces once during AOT compiling time. Here are the steps:
    1. dynamo capture guards and the subgraph
    2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup
    3. at inference time, the hash is used directly to lookup the optimized graph and run it.
    We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models.
    The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once  and torchxla_trivial is mostly neutral as expected.
    ```
    | Model                   |   XLA (trace once) |   XLA (trace everytime) |
    +=========================+====================+=========================+
    | resnet18                |            1.346   |                 1.045   |
    +-------------------------+--------------------+-------------------------+
    | resnet50                |            1.153   |                 1.007   |
    +-------------------------+--------------------+-------------------------+
    | resnext50_32x4d         |            1.381   |                 1.039   |
    +-------------------------+--------------------+-------------------------+
    | alexnet                 |            1.045   |                 1.018   |
    +-------------------------+--------------------+-------------------------+
    | mobilenet_v2            |            1.562   |                 1.021   |
    +-------------------------+--------------------+-------------------------+
    | mnasnet1_0              |            1.303   |                 1.069   |
    +-------------------------+--------------------+-------------------------+
    | squeezenet1_1           |            1.278   |                 1.025   |
    +-------------------------+--------------------+-------------------------+
    | vgg16                   |            1.076   |                 1.008   |
    +-------------------------+--------------------+-------------------------+
    | BERT_pytorch            |            2.224   |                 0.978   |
    +-------------------------+--------------------+-------------------------+
    | timm_vision_transformer |            1.81    |                 1.025   |
    +-------------------------+--------------------+-------------------------+
    | geomean                 |            1.38101 |                 1.02324 |
    +-------------------------+--------------------+-------------------------+
    ```

    The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there):
    https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5
    - Use AOT autograd to enable training
    - Share results on XLA devices
    - Do more extensive tests on torchbench models

    Example command
    ```
    GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once
    ```

    Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: https://github.com/pytorch/xla/pull/4119

    topic: not user facing

    cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87741
    Approved by: https://github.com/wconstab

commit 6cd25eb6de41ac05affa069e0d607ae8cdd54d6b
Author: Richard Barnes <rbarnes@umn.edu>
Date:   Sat Oct 29 17:48:23 2022 +0000

    Use TORCH_CHECK instead of inappropriate CUDA_KERNEL_ASSERT (#87714)

    `CUDA_KERNEL_ASSERT` should only be used inside kernels; switch these bad usages to `TORCH_CHECK`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87714
    Approved by: https://github.com/ezyang

commit 384b84d6a601e6e7b9dab1f68e3498ba6d84e950
Author: Huy Do <huydhn@gmail.com>
Date:   Sat Oct 29 17:40:07 2022 +0000

    [BE] Upload GHA artifacts to S3 (#87827)

    This is exclusively used by macOS, ROCM (and any other future workflows) that don't have direct access to S3 to upload their artifacts

    Running the script locally with the personal GITHUB_TOKEN:

    ```
    python3 -m tools.stats.upload_artifacts --workflow-run-id 3342375847 --workflow-run-attempt 1 --repo pytorch/pytorch

    Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb
    Downloading sccache-stats-macos-12-py3-arm64-runattempt1-9155493770
    Downloading sccache-stats-macos-12-py3-lite-interpreter-x86-64-runattempt1-9155493303
    Downloading sccache-stats-macos-12-py3-x86-64-runattempt1-9155493627
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-arm64-runattempt1-9155493770 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-arm64-9155493770
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-lite-interpreter-x86-64-runattempt1-9155493303 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-lite-interpreter-x86-64-9155493303
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-x86-64-runattempt1-9155493627 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-x86-64-9155493627
    Downloading test-jsons-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip
    Downloading test-jsons-runattempt1-test-default-1-2-macos-12_9155944815.zip
    Downloading test-jsons-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip
    Downloading test-jsons-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip
    Downloading test-jsons-runattempt1-test-default-2-2-macos-12_9155944892.zip
    Downloading test-jsons-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-linux.rocm.gpu_9155913429.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-macos-12_9155944815.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-macos-m1-12_9155888061.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-linux.rocm.gpu_9155913500.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-macos-12_9155944892.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-macos-m1-12_9155888182.zip
    Downloading test-reports-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip
    Downloading test-reports-runattempt1-test-default-1-2-macos-12_9155944815.zip
    Downloading test-reports-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip
    Downloading test-reports-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip
    Downloading test-reports-runattempt1-test-default-2-2-macos-12_9155944892.zip
    Downloading test-reports-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-linux.rocm.gpu_9155913429.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-macos-12_9155944815.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-macos-m1-12_9155888061.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-linux.rocm.gpu_9155913500.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-macos-12_9155944892.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-macos-m1-12_9155888182.zip
    Downloading usage-log-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip
    Downloading usage-log-runattempt1-test-default-1-2-macos-12_9155944815.zip
    Downloading usage-log-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip
    Downloading usage-log-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip
    Downloading usage-log-runattempt1-test-default-2-2-macos-12_9155944892.zip
    Downloading usage-log-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-linux.rocm.gpu_9155913429.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-macos-12_9155944815.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-macos-m1-12_9155888061.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-linux.rocm.gpu_9155913500.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-macos-12_9155944892.zip
    Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-macos-m1-12_9155888182.zip
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87827
    Approved by: https://github.com/clee2000

commit d9b6e41da9a24ad35b043cd79b581508c8c6304b
Author: Shen Li <cs.shenli@gmail.com>
Date:   Sat Oct 29 04:07:56 2022 +0000

    Add composable activation checkpointing (#87664)

    This is a composable activation checkpointing API. Unlike functional
    activation checkpointing APIs, this one does not require changing
    model source code. Unlike ``nn.Module`` wrapper activation checkpointing
    APIs, this one does not modify model structure or fully-qualified names
    either. Under the hood, it registers activation checkpointing logic as pre-
    and post-forward hooks.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87664
    Approved by: https://github.com/zhaojuanmao

commit 19171a21ee8a9cc1a811ac46d3abd975f0b6fc3b
Author: Sergey Lebedev <sergeyle@nvidia.com>
Date:   Sat Oct 29 16:33:18 2022 +0000

    Make barrier blocking in UCC (#86961)

    Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking.

    cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86961
    Approved by: https://github.com/kwen2501

commit baa715e790921e6498861e59556035de1a481cc5
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Oct 28 13:28:39 2022 -0700

    Unify meta tensor and fake tensor converter conversion (#87943)

    Meta tensor does a lot of work to make sure tensors "look" similar
    to the original parts; e.g., if the original was a non-leaf, meta
    converter ensures the meta tensor is a non-leaf too.  Fake tensor
    destroyed some of these properties when it wraps it in a FakeTensor.

    This patch pushes the FakeTensor constructor into the meta converter
    itself, so that we first create a fake tensor, and then we do various
    convertibility bits to it to make it look right.

    The two tricky bits:

    - We need to have no_dispatch enabled when we allocate the initial meta
      tensor, or fake tensor gets mad at us for making a meta fake tensor.
      This necessitates the double-callback structure of the callback
      arguments: the meta construction happens *inside* the function so
      it is covered by no_dispatch

    - I can't store tensors for the storages anymore, as that will result
      in a leak.  But we have untyped storage now, so I just store untyped
      storages instead.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87943
    Approved by: https://github.com/eellison, https://github.com/albanD

commit 4210cebc166dd355a315034b2a5aecdffacf5f91
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Fri Oct 28 19:54:52 2022 +0000

    [ONNX] Add internal node kind parsing (#87638)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87638
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit cb05a4da3916469ec511c042b95e447ca395e8d7
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Fri Oct 28 19:31:23 2022 +0000

    [ONNX] Parametrized Avgpool2D test to have all test combinations (#87893)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87893
    Approved by: https://github.com/BowenBao

commit f2ae459311607b341779590e2e985c6b7c895f1d
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Fri Oct 28 19:31:23 2022 +0000

    [ONNX] Disable ONNX ceil_mode and count_include_pad to aligntorch ceil_mode results in corner case (#87892)

    ONNX and PyTorch has different equation on pooling and different strategy on ceil_mode, which leads to discrepancy on corner case (#71549 ).
    Specifically, PyTorch avereage pooling is not following [the equation on documentation](https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html), it allows sliding window to go off-bound instead, if they start within the left padding or the input (in NOTE section). More details can be found in #57178.

    This PR changes avgpool in opset 10 and 11 back the way as opset 9, which it stops using ceil_mode and count_include_pad  in onnx::AveragePool

    A comprehensive test for all combinations of parameters can be found in the next PR. #87893
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87892
    Approved by: https://github.com/BowenBao

commit c810489dd9549da64ef3610e150c9589f7217759
Author: Huy Do <huydhn@gmail.com>
Date:   Sat Oct 29 08:43:45 2022 +0000

    Cleanup macos common conda installation (#87816)

    The conda dependencies have all been installed for `_mac-test` in https://github.com/pytorch/pytorch/pull/87541.  I missed the same step for `_mac-build` and `_mac-test-mps` workflows, so both are also updated here. Note that arm64 is cross-compiled from x86, so the env file needs to be set explicitly in that case

    After this one, I have a WIP PR to consolidate macos pip dependencies next
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87816
    Approved by: https://github.com/ZainRizvi

commit 53fea905477e64960002def848a7d897d8ae52a4
Author: Huy Do <huydhn@gmail.com>
Date:   Sat Oct 29 08:34:13 2022 +0000

    Store usage log on GitHub when S3 is not available (#87947)

    It turns out that we haven't uploaded the usage log to GitHub when S3 is not available (macos, rocm), for example, https://github.com/pytorch/pytorch/actions/runs/3325822440#artifacts only includes test-report, test-json, sccache stats, and build artifacts.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87947
    Approved by: https://github.com/clee2000

commit d3c01c722d95d9b386fa47078563687d2bffbdad
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Oct 28 17:20:10 2022 -0400

    Fix pybind11 problems with c10::SymInt unregistered (#88011)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88011
    Approved by: https://github.com/weiwangmeta, https://github.com/albanD

commit e667c0065657c29f42e592a4dcd810801cb83457
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 28 18:15:57 2022 +0000

    [FSDP()][2/N] Refactor training state (#87916)

    This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`.
    - At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision).
    - At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87916
    Approved by: https://github.com/mrshenli

commit cbc9faebfe962286ec8dd9cf8a5854613693f78a
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 28 04:17:33 2022 +0000

    [FSDP()][1/N] Start refactoring FSDP root pre-forward (#87915)

    Welcome! This PR starts the refactoring journey.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87915
    Approved by: https://github.com/mrshenli

commit edd6cf9996ce93ce11efe818a1e1f31a08920018
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Oct 29 06:48:12 2022 +0000

    Revert "[ONNX] Deprecate operators.py (#87798)"

    This reverts commit 88eff1072290177221e7a09d792f7f135b4c83ca.

    Reverted https://github.com/pytorch/pytorch/pull/87798 on behalf of https://github.com/weiwangmeta due to breaking internal builds see D40797126

commit e3e84830aade59722d819bc5fa01922239494790
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Fri Oct 28 19:54:52 2022 +0000

    [ONNX] Move all torch.onnx.export related tests to test/onnx (#87292)

    Moving torch.onnx.export related tests to test/onnx integrates ONNX tests to the same CI machine, so the testing environment can be better managed.

    Fixes https://github.com/pytorch/pytorch/issues/87320
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87292
    Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao, https://github.com/kit1980

commit 1dad051b05f896a5958e33423ccd3baa10ad1072
Author: Loren Arthur <lorenarthur@meta.com>
Date:   Sat Oct 29 04:52:01 2022 +0000

    Move workspace related functions to separate file (#87651)

    Move workspace related functions to separate file

    Test Plan: Existing tests

    Differential Revision: D40657708

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87651
    Approved by: https://github.com/malfet

commit 0cf572ff6c7522fa89ad4816bed3c5667e7106ee
Author: Iris Zhang <irisz@meta.com>
Date:   Sat Oct 29 04:38:34 2022 +0000

    [C10D][BE] Add exception handlers to c10d collectives function (#87643) (#87988)

    Summary:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643

    1. Add a decorator function exception_handlers to  c10d collectives.
    2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler.

    ```
    python3 test/distributed/test_c10d_error_logger.py
    ```

    Test Plan: Test in OSS.

    Reviewed By: H-Huang

    Differential Revision: D40281632

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988
    Approved by: https://github.com/H-Huang

commit 20e16c013fafe0f76d565434632b744300af05ea
Author: Tovly Deutsch <tovly@meta.com>
Date:   Sat Oct 29 04:20:56 2022 +0000

    Allow caffe2 to build with fbcode/mode/mac (#87293)

    Summary: The Mac contbuild builds under the `fbcode/mode/mac` which caffe2 fails to build under. This is due to that build mode enforcing protobuf v3. The caffe2 targets already account for this issue under `arvr` build modes by swapping out protobuf dependencies. They don't account for the same issue under `fbcode/mode/mac`. This diff fixes that by checking for `is_fbcode_mac` in these situations (in addition to `arvr`).

    Test Plan:
    ```
    buck build --flagfile fbsource//fbcode/mode/mac fbsource//xplat/caffe2/...
    ```

    Reviewed By: kimishpatel

    Differential Revision: D39552724

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87293
    Approved by: https://github.com/kimishpatel

commit 98354130096041224e9764a2f976d2d015d958ee
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Oct 28 12:33:37 2022 -0700

    Fake Tensor For (Conv) Propagation (#87641)

    Resubmitting https://github.com/pytorch/pytorch/pull/87302 so it can be ghstack'd with the pr below.

    Incorrect strides in any meta impl would lead to runtime assertion errors for fallback kernels, so start by just enabling it for conv.

    Replaces https://github.com/pytorch/pytorch/pull/87588.

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87641
    Approved by: https://github.com/jansel

commit 14d5f139d205f924eb7ddd3e61215971bd194855
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Sat Oct 29 01:26:15 2022 +0000

    Fix typos under benchmarks, test, and tools directories (#87975)

    This PR fixes typos in `.md` files under benchmarks, test, and tools directories
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87975
    Approved by: https://github.com/kit1980

commit 18f3db2963f3d0ac6b5eca0543cd51bbcd8e0428
Author: Richard Zou <rzou@meta.com>
Date:   Sat Oct 29 01:21:55 2022 +0000

    Fix functorch tests (#87914)

    Test Plan: - Run tests

    Differential Revision: D40777145

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87914
    Approved by: https://github.com/Chillee, https://github.com/osalpekar

commit af0c339f00094c4c2f3c260b55e04e0e3654776a
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Sat Oct 29 00:23:47 2022 +0000

    Disable slow-gradcheck tests (#88008)

    Disable because slow-gradcheck tests take > 4 hrs and time out. Will need to figure out if and how to re-enable later.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88008
    Approved by: https://github.com/seemethere, https://github.com/huydhn

commit 785054d3a9d0cf3d528511c42d81a9f09e36f1c6
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Oct 28 23:59:47 2022 +0000

    [CI] Report build errors in Windows build step (#88001)

    Should make failures like https://github.com/pytorch/pytorch/actions/runs/3346715682/jobs/5543900889 much more debuggable

    P.S. I don't know how to write batch, just hope its going to work

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/88001
    Approved by: https://github.com/seemethere

commit 1eba3f220e04e347d0fd869b2118ddb7a49308d5
Author: Daniil Kutz <kutz@ispras.ru>
Date:   Fri Oct 28 23:51:53 2022 +0000

    Fix bugs found by static analysis (#85705)

    These PR fixes a number of bugs found by Svace static analyzer:

    1. DEREF_AFTER_FREE at qnnpack_utils.h:
    Pointer '&convolution->zero_buffer' is dereferenced at qnnpack_utils.h:258 after the referenced memory was deallocated at operator-delete.c:25 by passing as 1st parameter to function 'pytorch_qnnp_delete_operator' at qnnpack_utils.h:251.
    2. DEREF_AFTER_NULL at impl.cpp:
    After having been compared to NULL value at impl.cpp:1892, pointer 'schema' is passed as 2nd parameter in call to function 'c10::operator<<' at impl.cpp:1921, where it is dereferenced at function_schema_inl.h:13.
    3. DEREF_OF_NULL  at stmt.h:
    After having been compared to NULL value at stmt.h:744, pointer 'body->_M_ptr' is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at stmt.h:745, where it is dereferenced at exceptions.h:67.
    4. DEREF_OF_NULL  at loopnest.h:
    Pointer 'f->ptr' that can have only NULL value (checked at loopnest.cpp:1482), is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at loopnest.cpp:1483, where it is dereferenced at exceptions.h:67.
    This is the same error as 3: forwarding a nullptr to malformed_input().
    4. TAINTED_INT.LOOP in python_arg_parser:
    Integer value 'this->size' obtained from untrusted source at python_arg_parser.cpp:118 without checking its bounds is used as a loop bound at python_arg_parser.cpp:698 by calling function 'torch::FunctionParameter::set_default_str' at python_arg_parser.cpp:133.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85705
    Approved by: https://github.com/kit1980

commit 376acf7625d741d031e1d2a147f8d68626e21d82
Author: BowenBao <bowbao@microsoft.com>
Date:   Fri Oct 28 23:51:42 2022 +0000

    Add 'share_from_this' to 'torch::jit::Graph' (#87343)

    Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the
    `internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw
    pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr.

    The behavior afterwards is random and probably undefined.
    Most of the time it works, if the holder is deallocated timely on python side, and the
    cache then cleared from `internals::registered_instance`. Things are back to normal.
    Otherwise, it fails with either segfault or a runtime error of message "Unable to cast
    from non-held to held instance". One of such scenarios is normally and correctly
    returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via
    cache. Due to this, the shared_ptr use_count will not increase. If there is no other use
    on C++ side, the graph will be freed, while python still has access, via the holder created
    previously.

    @t-vi had a great analysis and solution to this exact problem at #51833 which I hope
    I had seen before debugging this issue... ~~I'm building the PR based on the original
    commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately
    due to CLA issues.

    Need to check in CI if adding `enable_shared_from_this` breaks other stuff.

    Fixes #51833, and CI issues in #87258, #86182.

    cc @malfet, @kit1980 for changes on JIT IR.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87343
    Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet

commit ecf277abeca4d0b09c2587ca6d9a37be602e889a
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Thu Oct 27 10:49:55 2022 -0700

    [quant][improvement] Check the fixedqparam op qconfig based on backend_config (#87425)

    Summary:
    Previously we hardcoded the supported observers for fixedqparam ops, this PR changes that to take the information from BackendConfig,
    this allows users to customize the support for fixed qparam ops

    Test Plan:
    python test/test_quantization.py TestQuantizeFx.test_change_backend_config_for_fixed_qparam_ops

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    unlinked from diff since it's too hard to land
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87425
    Approved by: https://github.com/andrewor14

commit c3c817c972b50066bec6ea14176b931039d8fbd6
Author: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Date:   Fri Oct 28 15:12:31 2022 -0700

    Revert "ci: Switch merge / revert flow to our own infra" (#88016)

commit a2ffc3be971aec9245e4beee0a65ecc73e71f870
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 28 02:02:25 2022 +0000

    [AC] Add trailing "." to  `_CHECKPOINT_PREFIX` like FSDP (#87951)

    This is for consistency with FSDP.
    - `_FSDP_WRAPPED_MODULE` and `_CHECKPOINT_WRAPPED_MODULE` are exactly the wrapped module variable name, meaning you can call `getattr(module, _FSDP_WRAPPED_MODULE)` or `getattr(module, _CHECKPOINT_WRAPPED_MODULE)`.
    - `_FSDP_PREFIX` and `_CHECKPOINT_PREFIX` include the trailing `"."` and are only used for FQNs.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87951
    Approved by: https://github.com/zhaojuanmao

commit 4faf086e5f2c7743b45bcefa7f951f8faaa0e94d
Author: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Date:   Fri Oct 28 22:05:11 2022 +0000

    Update build scripts for ninja and ROCm5.3 install (#87505)

    cc @jeffdaily @sunway513 @ROCmSupport
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87505
    Approved by: https://github.com/seemethere

commit 349ad23ffbcee88f2f0d590da1f8cf577d3a7627
Author: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Date:   Fri Oct 28 14:37:55 2022 -0700

    ci: Switch merge / revert flow to our own infra (#88009)

commit 9691ba2dbd8c1f6967d0d97a3679104368b329ed
Author: Michael Lazos <mlazos@fb.com>
Date:   Fri Oct 28 21:33:53 2022 +0000

    Remove excess exception logging for minifier, cleanup backend failure exception format (#87537)

    Fixes https://github.com/pytorch/torchdynamo/issues/1376

    Ensures exceptions are printed only in one place, once.

    implements some of the ideas from https://github.com/pytorch/torchdynamo/issues/1754
    - Attaches a field to the exception which indicates that it's minified, a usage message is printed if this field is present

    cc @jansel @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87537
    Approved by: https://github.com/anijain2305

commit 1c37119a1f735cac0cda0064dca7c69b658216aa
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 28 02:02:25 2022 +0000

    [FSDP] New fix for composing with other module wrappers (#87950)

    We change `.module` to pass through `ActivationWrapper` directly to the inner wrapped module. This should fix the state dict issues.

    Given the invariant that `.module` always returns the inner wrapped module, FSDP always registers the `FlatParameter` on the inner wrapped module, regardless of if there is an intermediate `ActivationWrapper` or not. This avoids casing on whether `ActivationWrapper` is added before or after FSDP construction.

    This PR removes the added unit test in `test_fsdp_misc.py` for changing the wrapped module because I would rather not complicated `_lazy_init()` logic just to support that kind of adversarial behavior. The user should not be swapping out the wrapped module arbitrarily or deleting the `FlatParameter`. I mainly had those tests to make sure that all branches of the code I added was correct.

    Differential Revision: [D40799961](https://our.internmc.facebook.com/intern/diff/D40799961)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87950
    Approved by: https://github.com/zhaojuanmao

commit c2c269c10aa7469b023894c9d5428316a4d36221
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Oct 28 07:07:44 2022 -0700

    Convert MetaConverter's tensor memo into a weak value dictionary. (#87911)

    This is in preparation for unifying fake tensor converter and meta converter's memo tables.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87911
    Approved by: https://github.com/eellison

commit e72962a34dc3d6d8e52f1d7b76e982e05885fdaa
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Oct 28 07:07:44 2022 -0700

    Force people to call from_meta_and_device directly (#87903)

    It was pretty hard to tell at call site if I was doing device meta
    convert or not.  This gets rid of the "dual" API and forces people
    to call the method manually for the device case.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87903
    Approved by: https://github.com/eellison, https://github.com/albanD

commit ab8fbd26f8a5c7d5fdb8527536dbf2aa613ce722
Author: Andrey Talman <atalman@fb.com>
Date:   Fri Oct 28 19:55:31 2022 +0000

    Advance nightly docker to 11.6 (#87858)

    Fixes following:
    https://github.com/pytorch/pytorch/actions/runs/3242695506/jobs/5316334351
    crash in Docker builds introduced by: #82682

    The PR seems to introduce some changes not compatible with cuda 11.3 which is used by our Docker builds

    This is a reland of original pr: https://github.com/pytorch/pytorch/pull/86941 (Created this new PR to start fresh)
    Which was reverted because conda install, installed wrong version of pytorch. It installed pytorch for cuda 11.3 still rather then 11.6

    This should be fixed now with Release 1.13
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87858
    Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/izaitsevfb

commit c5cb6ec06619a2fc9874b967f11d13663c5d32c1
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Fri Oct 28 19:33:42 2022 +0000

    Allow 64bit indexing for channels-last upsample2d on CUDA (#87901)

    CC @ngimel @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87901
    Approved by: https://github.com/ngimel

commit fb64f7b804911fd74322132c209f86047825f04a
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Oct 26 16:56:51 2022 -0700

    [Profiler][Trivial] Move ID assignment code to `data_flow.cpp` (#87670)

    ID assignment has become a very complex facet of the profiler. The existing code has grown organically as I've discovered various refinements and has become very difficult to understand or reason about. (With more complexity coming in https://github.com/pytorch/pytorch/pull/87133)

    I want to take a step back and add some structure and additional comments to the ID assignment algorithm. Before I do, however, it's time to move it out of `collection.cpp` to a dedicated data flow file.

    Differential Revision: [D40666360](https://our.internmc.facebook.com/intern/diff/D40666360/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40666360/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87670
    Approved by: https://github.com/slgong-fb

commit 8d395ec6bc95e7a24311000ce65c992c6a568f34
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Oct 26 16:56:50 2022 -0700

    [Profiler][Trivial] Add hashing struct for pairs and tuples. (#87668)

    There is a fairly simple and commonly used hash_combine in c10/util; however in order to use it in a map we need to wrap it in a hashing struct. By defining template functions we also get recursive unpacking for free. (A later PR will want to hash a `tuple<tuple<T0, T1>, tuple<T0, T1>>`)

    Differential Revision: [D40666359](https://our.internmc.facebook.com/intern/diff/D40666359/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87668
    Approved by: https://github.com/slgong-fb

commit d13b6781d8b7353919ee06378636773f762b880e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 28 17:55:19 2022 +0000

    Revert "[fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87257)"

    This reverts commit 58650835bb91d927623e6bff5cc4844fbcad6368.

    Reverted https://github.com/pytorch/pytorch/pull/87257 on behalf of https://github.com/weiwangmeta due to breaking internal builds/BC-breaking change

commit fc21b9db23377569423f20a749a170375a11966d
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Oct 27 17:12:36 2022 -0700

    Use Eager Code To Determine Conv Layout (#87305)

    The logic for determine conv backend and therefore output striding is very complex. It depends on build settings, input striding/contiguity, sizes, etc. Eventually we should port that logic to the meta impl for dynamic shapes but that will require a lot more work and keeping the implementations in sync. See https://github.com/pytorch/torchdynamo/issues/1701

    This is a prerequisite to removing the inductor conv stride propagation and more general fake tensor for inductor propagation. In that PR, the meta impls for cpu conv give incorrect striding which led to test failures (https://github.com/pytorch/pytorch/pull/87083).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87305
    Approved by: https://github.com/ezyang

commit 1bc0e923bb006ee9e43996dfde49df89ea11b979
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Fri Oct 28 16:09:25 2022 +0000

    add special case for power of 0.5 (#87912)

    Workaround for https://github.com/pytorch/torchdynamo/issues/1775, and calling sqrt is better in any case, but `libdevice.pow` still for some reason doesn't work if both arguments are scalars

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @mreso, can you please check if that takes you further with diffusers

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87912
    Approved by: https://github.com/desertfire

commit 35c611d30f6024fc6fc94b437372ab4ee1b3544d
Author: Driss Guessous <drisspg@fb.com>
Date:   Fri Oct 28 15:51:10 2022 +0000

    Add mem efficient backend flag (#87946)
    Add in a torch.backends.cuda flag and update context manager to pic between the three implementations of the scaled_dot_product_attention.

    cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87946
    Approved by: https://github.com/cpuhrsch

commit 89fd451934ac4065bd0064ba9d92e8b8b3827619
Author: Shen Li <cs.shenli@gmail.com>
Date:   Fri Oct 28 14:45:38 2022 +0000

    Fix codeowner errors (#87954)

    Error message: "Unknown owner: make sure @mingzhe09088 exists and has
    write access to the repository."
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87954
    Approved by: https://github.com/wangkuiyi

commit 8a9aca7b8d7fba320b4f2a8c2f18a25f572c46b6
Author: albanD <desmaison.alban@gmail.com>
Date:   Fri Oct 28 13:40:11 2022 +0000

    Reland 2 Many symintifications (#87604) (#87980)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87980
    Approved by: https://github.com/ezyang

commit ce3e0e9856e32fae61df282f0b97b0e2e1eadf9d
Author: Shen Li <cs.shenli@gmail.com>
Date:   Fri Oct 28 02:04:36 2022 +0000

    Add state to distributed composable API (#87838)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87838
    Approved by: https://github.com/yhcharles

commit b192e7e415c50cf7af5c70f35a8c20c38985d06d
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Fri Oct 28 11:26:17 2022 +0000

    Support non-contiguous NestedTensors for elementwise ops (#87888)

    Enables benchmarking of math path of sdp kernel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87888
    Approved by: https://github.com/drisspg

commit f150e70ca2a5d7efdfb55e3115ccd750b39acc39
Author: leslie-fang-intel <leslie.fang@intel.com>
Date:   Fri Oct 28 10:30:30 2022 +0000

    add the function specialization for promote with ITensorListRef (#87756)

    Fixes [#87684](https://github.com/pytorch/pytorch/issues/87684)
    It's due to a new tensor list type is introduced as `ITensorListRef`.  We need the function specialization for `prioritize` and `cached_cast` for this new tensor list type.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87756
    Approved by: https://github.com/jgong5, https://github.com/ezyang

commit 166b5d3e7c5c230c455dcbcc05c84dd6bc03721b
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 28 06:11:42 2022 +0000

    Revert "[EZ] Fix simple bug in torchdynamo (#87821)"

    This reverts commit ce7fcab9bdf61a34bc56b7cd45a882e4ad6ba175.

    Reverted https://github.com/pytorch/pytorch/pull/87821 on behalf of https://github.com/kit1980 due to Broke many dynamo tests https://github.com/pytorch/pytorch/actions/runs/3341984303/jobs/5534381456

commit 78b406932f0e4afd82b672f959b8cb9ce1e79f9d
Author: Charlie Yan <charlieyan@fb.com>
Date:   Fri Oct 28 04:05:01 2022 +0000

    Add me to reviewers of composable API changes (#87891)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87891
    Approved by: https://github.com/mrshenli

commit 1da5aeb97b73664ff0fe2f4bb48379655cede969
Author: Michael Suo <suo@fb.com>
Date:   Thu Oct 27 15:01:21 2022 -0700

    [dynamo] Error when user nests FX with dynamo (#87797)

    Today, this doesn't work and dynamo errors out in a very non-obvious way (see:
    https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a).

    Here, we detect the error early and exit with a nicer msg. Also add a
    config option to just no-op dynamo (which need to unblock internal
    enablement).

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797
    Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel

commit 07f7c4615bc858a8822c05aa310310446fc78836
Author: Xia, Weiwen <weiwen.xia@intel.com>
Date:   Fri Oct 28 04:58:54 2022 +0000

    [MKLDNN] Replace pooling algorithm `pooling_avg` with `pooling_avg_exclude_padding` for future oneDNN upgrades (#87851)

    **Description**
    Replace pooling algorithm `pooling_avg` with `pooling_avg_exclude_padding` in implementation of mkldnn pooling. It's only a change of names, not algorithm. The former is an alias of the latter and it will be removed in future oneDNN library upgrades.
    This change has no effect on functionality or performance.

    **Validation**
    Covered by UT.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87851
    Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper

commit 23b79e6f48c4350b9a2ed7680a13d22e5d8066b6
Author: rboca <remus.f.boca@gmail.com>
Date:   Fri Oct 28 04:56:37 2022 +0000

    Update CMakeLists.txt (#87030)

    Fix Caffe2_CPU_INCLUDE with Caffe2_GPU_INCLUDE. The expanding parent scope should be with the same variable name. The compilation in certain build configurations is corrected with this fix.

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87030
    Approved by: https://github.com/kit1980

commit daff5d35567615bb80f19e59474d8af7af84daf2
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Fri Oct 28 04:53:33 2022 +0000

    Fix typos under caffe2 directory (#87840)

    This PR fixes typos in `.md` files under caffe2 directory

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87840
    Approved by: https://github.com/kit1980

commit e8a97a3721f86eacbbf5e1160be07cc27544b9aa
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Oct 28 00:01:07 2022 +0000

    FakeTensorMode and Prims.add/sub/mul/div support scalar only inputs (#87759)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87759
    Approved by: https://github.com/ngimel, https://github.com/mruberry, https://github.com/eellison

commit d47ffecbe4ab1f177fbebc7d8f42d8b84f29f996
Author: Michael Suo <suo@fb.com>
Date:   Thu Oct 27 12:37:59 2022 -0700

    [dynamo] relax fake tensor restriction with `assume_constant_result` (#87895)

    This works now because of https://github.com/pytorch/pytorch/pull/87091,
    so don't error out anymore.

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87895
    Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym

commit 2e48b478e06b38a7468832d980d214441855547e
Author: Jithun Nair <jithun.nair@amd.com>
Date:   Fri Oct 28 03:50:43 2022 +0000

    [ROCm] Use -rpath-link to fix libtinfo conflict (#83552)

    Fixes issue building PyTorch for ROCm5.3 and above on Ubuntu20.04 because libtinfo6 from conda conflicts with the one from the distro causing symbol not found errors.

    cc @jeffdaily @sunway513 @ROCmSupport
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83552
    Approved by: https://github.com/malfet, https://github.com/pruthvistony

commit 9c793b366faf49c78effb0c78d26c48f7664bc92
Author: sanchitintel <sanchit.jain@intel.com>
Date:   Fri Oct 28 03:42:19 2022 +0000

    Move incorrectly placed closing curly brace of `extern "C"` block (#87853)
    When `__SYCL_DEVICE_ONLY__` is defined, while building PyTorch, the output of the preprocessing step would not have the closing curly brace of the `extern "C"` block, as it has been incorrectly placed. Compilers don't seem to report an error or a warning for a missing closing brace of an `extern "C"` block.
    If `c10/macros/Macros.h` would be included in a C++ file, and after the preprocessing stage, if the preprocessed source file would have some templated code after `extern "C" {`, then, after compilation, linking might fail with the error `templates must have c++ linkage`). eg. https://stackoverflow.com/questions/61717819/template-with-c-linkage-error-when-using-template-keyword-in-main-cpp/61717908#61717908 (its answer also has a small snippet of code to reproduce such an issue).
    one-liner bug fix that rectifies the placement of closing curly brace (`}`), so that the `extern "C"` block ends properly when `__SYCL_DEVICE_ONLY__` is defined.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87853
    Approved by: https://github.com/jgong5, https://github.com/kit1980, https://github.com/malfet

commit 13de4d2137b7417d118a84c01ea88c21393e0a5d
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Oct 27 22:20:36 2022 +0000

    Meta OpInfo Test for stride correctness (#87849)

    Failing test logs here
    https://gist.github.com/SherlockNoMad/a7e132f3cb4152900f8a6d7df358c59e
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87849
    Approved by: https://github.com/eellison

commit 8b4d95759c7d0e6b7d4c3a3facaaa18ffe4cbd54
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 28 03:00:09 2022 +0000

    Revert "Many symintifications (#87604)"

    This reverts commit 777e6a2c5100f3274cff1bcf7e47ccbe1a651927.

    Reverted https://github.com/pytorch/pytorch/pull/87604 on behalf of https://github.com/weiwangmeta due to breaking internal builds

commit 2cb7c3f865ac8305f0af2806082b3bc8ec29a640
Author: Animesh Jain <anijain@umich.edu>
Date:   Fri Oct 28 02:41:12 2022 +0000

    [dynamo][benchmarks] Prepone Cold start setup (#87913)

    Parallel compilation warms the Threadpool when we call `torch._dynamo.optimize()`. In current benchmarks, we were setting up the TRITON_CACHE_DIR much later. Because of this parallel compilation artifacts were not used and compilation latency improvements were not visible in dashboard. This PR just prepones the setup of TRITON_CACHE_DIR.

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87913
    Approved by: https://github.com/wconstab

commit 641d8e0e699a981b1272df66848ab87e118f5eca
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 28 02:20:24 2022 +0000

    Revert "Enable mypy check for distributed.py, and fix type errors (#87543)"

    This reverts commit 2cc624cd4318414905d2475432aee13db9031cc6.

    Reverted https://github.com/pytorch/pytorch/pull/87543 on behalf of https://github.com/weiwangmeta due to breaking internal builds

commit f9679184116f1d29c483c2b2a4c3a9d730be4694
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Oct 27 16:59:49 2022 +0000

    [AC] Return `None` from `apply_activation_checkpointing()` (#87871)

    `_recursive_wrap()` returns `Tuple[nn.Module, int]`, where the `nn.Module` is the in-place modified module and the `int` is the numel wrapped. In that sense, the return value is not meant to be publicly used. The `apply_activation_checkpointing()` docs already suggest that the function returns `None`, so this PR simply follows that.

    **Test Plan**
    CI
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87871
    Approved by: https://github.com/zhaojuanmao

commit 81c4049f4d2f4e94818ae52c04c870805713c59e
Author: Mike Iovine <mikeiovine@meta.com>
Date:   Fri Oct 28 01:28:34 2022 +0000

    [Static Runtime] Move PrepackWeights to internal-only graph passes (#87799)

    Summary:
    The pass introduces an `fb::` operator and thus cannot be used in OSS.

    The test failure was not exposed because the Static Runtime tests have been disabled in OSS for a while. The Dev Infra folks encountered this failure when re-enabling the tests.

    Test Plan: Existing tests

    Differential Revision: D40724547

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87799
    Approved by: https://github.com/huydhn

commit ce7fcab9bdf61a34bc56b7cd45a882e4ad6ba175
Author: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Thu Oct 27 04:04:26 2022 +0000

    [EZ] Fix simple bug in torchdynamo (#87821)

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87821
    Approved by: https://github.com/voznesenskym, https://github.com/jansel

commit fd27246c16d8a80e7de0ccc86d014f9759611b0f
Author: lezcano <lezcano-93@hotmail.com>
Date:   Thu Oct 27 21:46:25 2022 +0000

    Fix decomposition for std (#87181)

    The previous implementation was lacking a few features and incurred on a
    pretty large error

    cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87181
    Approved by: https://github.com/ngimel, https://github.com/peterbell10

commit f21d0b310cecbd68ae345e4b677a702892c57292
Author: lezcano <lezcano-93@hotmail.com>
Date:   Thu Oct 27 21:46:25 2022 +0000

    Add decomposition for diagonal_scatter (#87282)

    cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87282
    Approved by: https://github.com/mruberry

commit 9225f261769170fda1136ed19238f7c74cddb2bb
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Oct 27 20:13:27 2022 +0000

    [FSDP] Fix wrapped module changing after ctor (#87837)

    Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`.

    If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`.

    The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`.

    The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87837
    Approved by: https://github.com/zhaojuanmao

commit 7a3afe61d2230e8620718c326223ecc9e276fde3
Author: Richard Barnes <rbarnes@fb.com>
Date:   Fri Oct 28 00:41:04 2022 +0000

    Check all CUDA API calls for errors in caffe2/ (#81816)

    Test Plan: Sandcastle

    Differential Revision: D35194868

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81816
    Approved by: https://github.com/ezyang

commit 3ece9fb45df90dec72251104ec29b85cb062e6b7
Author: Richard Barnes <rbarnes@fb.com>
Date:   Fri Oct 28 00:40:47 2022 +0000

    Check all CUDA API calls for errors in torch/ (#81560)

    Summary:
    Original commit changeset: 0bb770d2cdb2

    Original Phabricator Diff: D35194935 (https://github.com/pytorch/pytorch/commit/79e5b053b690852b21d881357904bc5a4438d95b)

    Differential Revision: D35291874

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81560
    Approved by: https://github.com/ezyang

commit 4e3a0ff92ed2e5873d77d38bca50647b1ad2f4a8
Author: Bin Bao <binbao@fb.com>
Date:   Thu Oct 27 16:26:42 2022 +0000

    Update how inductor cpu tests are skipped on fbcode (#87867)

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87867
    Approved by: https://github.com/anijain2305

commit 6cc4ae3d2d64b10d7104c4a0cc4083a644ef8e54
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 27 23:55:59 2022 +0000

    Revert "[Inductor] Enable Inductor unspec inputs test for different dtypes (#87809)"

    This reverts commit 369755f8ce1b043c88efbc50ee09c0258dec5162.

    Reverted https://github.com/pytorch/pytorch/pull/87809 on behalf of https://github.com/kit1980 due to Broke trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 4, 4, linux.g5.4xlarge.nvidia.gpu), same error on pull.

commit cda0d5a57b9126c6d244fdd5b02198f05c742615
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 27 21:16:58 2022 +0000

    Revert "[dynamo] Error when user nests FX with dynamo (#87797)"

    This reverts commit a485528a7e4551461d57db3deb8b40c2acea08d2.

    Reverted https://github.com/pytorch/pytorch/pull/87797 on behalf of https://github.com/kit1980 due to Broke linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge), same error on pull

commit 6ad3543a1b000a369d811e0af195209f62f32fbc
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Oct 26 14:34:58 2022 -0400

    BE: Improve test_will_engine_execute_node unittest (#87806)

    Adds the test from https://github.com/pytorch/pytorch/pull/86672

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87806
    Approved by: https://github.com/albanD

commit 0f7df16c71215bc7bd7835fc5933ac3343b8a627
Author: foram-chandra <96388449+foram-chandra@users.noreply.github.com>
Date:   Thu Oct 27 21:03:42 2022 +0000

    [doc] Add out-kwarg documentation to torch.where (#87870)

    Fixes #87862

    cc: @lezcano

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87870
    Approved by: https://github.com/lezcano

commit 46b16977d97fd3b241a641c8020d0bc073a218d0
Author: Alvaro Gaona <alvgaona@gmail.com>
Date:   Thu Oct 27 21:00:59 2022 +0000

    Reimplement Kaiser window (#87330)

    Relates to #85366

    - For reference follow #87082.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87330
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 369755f8ce1b043c88efbc50ee09c0258dec5162
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Thu Oct 27 20:58:46 2022 +0000

    [Inductor] Enable Inductor unspec inputs test for different dtypes (#87809)

    Fixes #ISSUE_NUMBER

    cc @jansel @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87809
    Approved by: https://github.com/ngimel

commit 1ff52225f185e11faa421528815aaa43e79e0722
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Oct 27 13:49:11 2022 -0700

    Unify SymIntNode and SymFloatNode into SymNode (#87817)

    This refactor was prompted by challenges handling mixed int/float
    operations in C++.  A previous version of this patch
    added overloads for each permutation of int/float and was unwieldy
    https://github.com/pytorch/pytorch/pull/87722/  This PR takes a different
    approach.

    The general outline of the patch is to combine the C++ types SymIntNode
    and SymFloatNode into a single type, SymNode.  This is type erased; we
    no longer know statically at C++ if we have an int/float and have to test
    it with the is_int()/is_float() virtual methods.  This has a number of
    knock on effects.

    - We no longer have C++ classes to bind to Python.  Instead, we take an
      entirely new approach to our Python API, where we have a SymInt/SymFloat
      class defined entirely in Python, which hold a SymNode (which corresponds
      to the C++ SymNode).  However, SymNode is not pybind11-bound; instead,
      it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode
      when it goes into C++.  This implies a userland rename.

      In principle, it is also possible for the canonical implementation of SymNode
      to be written in C++, and then bound to Python with pybind11 (we have
      this code, although it is commented out.)  However, I did not implement
      this as we currently have no C++ implementations of SymNode.

      Because we do return SymInt/SymFloat from C++ bindings, the C++ binding
      code needs to know how to find these classes.  Currently, this is done
      just by manually importing torch and getting the attributes.

    - Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now
      takes SymInt/SymFloat, rather than SymNode, bringing it in line with how
      __torch_dispatch__ works.

    Some miscellaneous improvements:

    - SymInt now has a constructor that takes SymNode.  Note that this
      constructor is ambiguous if you pass in a subclass of SymNode,
      so an explicit downcast is necessary.  This means toSymFloat/toSymInt
      are no more.  This is a mild optimization as it means rvalue reference
      works automatically.

    - We uniformly use the caster for c10::SymInt/SymFloat, rather than
      going the long way via the SymIntNode/SymFloatNode.

    - Removed some unnecessary toSymInt/toSymFloat calls in normalize_*
      functions, pretty sure this doesn't do anything.

    - guard_int is now a free function, since to guard on an int you cannot
      assume the method exists.  A function can handle both int and SymInt
      inputs.

    - We clean up the magic method definition code for SymInt/SymFloat/SymNode.
      ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets
      plain methods; this is to help avoid confusion between the two types.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817
    Approved by: https://github.com/albanD, https://github.com/anjali411

commit 2205f56f462fb9cbb1c068acc1cf29aca27aef0a
Author: Jiewen Tan <jwtan@google.com>
Date:   Thu Oct 27 20:39:30 2022 +0000

    [LTC] Remove lazy::View (#87822)

    Summary:
    This is the first part to remove the whole view and aliasing infrastructure in LTC, which is deprecated in favor of functionalization. It mainly removes things that use lazy::View.

    Test Plan:
    CI

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87822
    Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim, https://github.com/wconstab

commit 83b381d34db05d01ccde1c3da755b3dca5504ee7
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Oct 27 19:49:29 2022 +0000

    [dynamo] add inductor runs w/o cudagraphs (#87847)

    as title

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87847
    Approved by: https://github.com/jansel

commit d2d0be9a76bcdaf5f26eb88dd505ccf2ac6d7e40
Author: samdow <samdow@fb.com>
Date:   Thu Oct 27 17:10:04 2022 +0000

    fix typo in per sample grad test (#87790)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87790
    Approved by: https://github.com/zou3519

commit b8b1d7be24a29d9b20b25c0dd5273a499af07097
Author: Akshit Khurana <axit@meta.com>
Date:   Wed Oct 26 15:44:00 2022 -0700

    [dynamo] Add ao.nn to skipfiles inline allowlist (#87820)

    Summary:

    Allow torch.ao.nn module to be inlined

    Test Plan:

    Tested manually for https://github.com/pytorch/torchdynamo/issues/1737

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx

    Differential Revision: [D40768679](https://our.internmc.facebook.com/intern/diff/D40768679)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87820
    Approved by: https://github.com/jansel

commit a485528a7e4551461d57db3deb8b40c2acea08d2
Author: Michael Suo <suo@fb.com>
Date:   Wed Oct 26 10:49:38 2022 -0700

    [dynamo] Error when user nests FX with dynamo (#87797)

    Today, this doesn't work and dynamo errors out in a very non-obvious way (see:
    https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a).

    Here, we detect the error early and exit with a nicer msg. Also add a
    config option to just no-op dynamo (which need to unblock internal
    enablement).

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797
    Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel

commit f1b78224cab093112173cd34bef0938fe2cb927e
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Thu Oct 27 15:53:11 2022 +0000

    Fix type promotion for 2 wrapped scalar args (#87845)

    Fixes #76801

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87845
    Approved by: https://github.com/SherlockNoMad, https://github.com/mruberry

commit 03d6af4db3974fcbf1ce7d3b3be46c1134c72e6e
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Wed Oct 26 14:11:22 2022 -0700

    add nesting to TORCH_SHOW_DISPATCH_TRACE (#87751)

    Added indents to `TORCH_SHOW_DISPATCH_TRACE` so that you more easily see the call tree from the dispatcher. Definitely slower, but it's all guarded under the `DEBUG` build. Example output:

    I know we have the PyDispatcher now, but I still found this helpful for debugging

    ```
     [call] op=[aten::ones], key=[BackendSelect]
      [redispatch] op=[aten::ones], key=[CPU]
       [call] op=[aten::empty.memory_format], key=[BackendSelect]
        [redispatch] op=[aten::empty.memory_format], key=[CPU]
       [call] op=[aten::fill_.Scalar], key=[CPU]
     [call] op=[aten::clone], key=[AutogradCPU]
      [redispatch] op=[aten::clone], key=[CPU]
       [call] op=[aten::empty_strided], key=[BackendSelect]
        [redispatch] op=[aten::empty_strided], key=[CPU]
       [call] op=[aten::copy_], key=[CPU]
     [call] op=[aten::view], key=[PythonTLSSnapshot]
      [redispatchBoxed] op=[aten::view], key=[AutogradCPU]
       [redispatch] op=[aten::view], key=[ADInplaceOrView]
        [redispatch] op=[aten::view], key=[Functionalize]
         [call] op=[aten::view], key=[PythonTLSSnapshot]
          [redispatchBoxed] op=[aten::view], key=[Meta]
         [call] op=[aten::view], key=[PythonTLSSnapshot]
          [redispatchBoxed] op=[aten::view], key=[Python]
           [callBoxed] op=[aten::view], key=[CPU]
     [call] op=[aten::clone], key=[PythonTLSSnapshot]
      [redispatchBoxed] op=[aten::clone], key=[AutogradCPU]
       [redispatch] op=[aten::clone], key=[Functionalize]
        [callBoxed] op=[aten::clone], key=[PythonTLSSnapshot]
         [redispatchBoxed] op=[aten::clone], key=[Python]
          [callBoxed] op=[aten::clone], key=[CPU]
           [call] op=[aten::empty_strided], key=[BackendSelect]
            [redispatch] op=[aten::empty_strided], key=[CPU]
           [call] op=[aten::copy_], key=[CPU]
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87751
    Approved by: https://github.com/ezyang, https://github.com/zou3519

commit 23ff47ccc53cda92ffe2482f22a4321f721eace0
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Wed Oct 26 14:11:22 2022 -0700

    functionalization: fix detach() (#87750)

    `.detach()` worked in basic cases previously, but didn't properly preserve view relationships between the base and the output. This wasn't heavily tested, because autograd doesn't normally encounter `FunctionalTensorWrapper` directly, but could become more common if we fuse functionalization and autograd into a single tracing pass.

    This will also be a bug fix for LTC (and XLA when they use functionalization)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87750
    Approved by: https://github.com/ezyang

commit e2bbc0a134369c56f1be437e4548a2204a83b46e
Author: Nikita Shulga <nshulga@meta.com>
Date:   Thu Oct 27 15:38:48 2022 +0000

    [BE] Move remaining workflows off Xenial (#87834)

    Both BE and prerequisite for moving our CI/CD to C++17 compiler (gcc-5.4
    is not fully C++17 compliant)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87834
    Approved by: https://github.com/weiwangmeta, https://github.com/kit1980, https://github.com/huydhn

commit 1e1b04512879d6166dc1f5adff482723e2d0da9e
Author: jpvillam <Juan.Villamizar@amd.com>
Date:   Thu Oct 27 15:11:28 2022 +0000

    [ROCM] Enable Sparse Pickle Test (#82729)

    Missed stream context for serialization
    Missing ROCm stream context on memory operations for serialization
    Ran the sparse pickle test

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82729
    Approved by: https://github.com/ngimel

commit aaba0bd30641c56db1dc0550b81fbc458db46276
Author: Mike Iovine <mikeiovine@meta.com>
Date:   Thu Oct 27 12:29:51 2022 +0000

    [JIT] Fix torch.jit.script for functions with many decorators (#87804)

    Summary:
    Python's function parsing from the `ast` module records the line number of the function definition, not the first decorator. So this diff fixes crashes like this:

    ```
    IndexError: vector::_M_range_check: __n (which is 10) >= this->size() (which is 8)
    ```

    Test Plan: New unit test

    Differential Revision: D40726352

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87804
    Approved by: https://github.com/tugsbayasgalan, https://github.com/davidberard98

commit 1780e0ef7fe49f0b1e2723bb88d926bac231eee1
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Thu Oct 27 10:46:53 2022 +0000

    [complex] conv_transpose2d (#81805)

    Reference: https://github.com/pytorch/pytorch/issues/71108

    Fixes : #86414
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81805
    Approved by: https://github.com/anjali411

commit c36db82e12a80e31a50e28aeda2801d18a952959
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Wed Oct 26 01:46:46 2022 -0400

    TorchDynamo: Add convolution unary fusion for cpu in inference mode (#87063)

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87063
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit b16b5fb802028ab96e4e15a09d6c1d94304c4f83
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Oct 26 16:56:47 2022 -0700

    [Profiler] Hold weak reference to prevent TensorImpl address reuse during profiling. (#87244)

    A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

    Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

    This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

    This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

    Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87244
    Approved by: https://github.com/slgong-fb, https://github.com/albanD

commit 4b2390517263592fb6972e4b128777bc038ee4aa
Author: Mengwei Liu <larryliu@meta.com>
Date:   Thu Oct 27 06:04:22 2022 +0000

    [torch] Add torch cpp cpu target for torch/csrc/api/src files (#87327)

    Summary: Duplicating fbcode target `fbcode//caffe2:torch-cpp-cpu` target in xplat. In D40460749 our user wants to use `torch::kNearest` enum which is defined in `torch/csrc/api/src/enum.cpp`. Adding this target to support it.

    Test Plan: Rely on CI

    Differential Revision: D40532087

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87327
    Approved by: https://github.com/ezyang

commit bf113e38fad30fb1eec1f94563f419518ae3178c
Author: Richard Barnes <rbarnes@umn.edu>
Date:   Thu Oct 27 05:15:16 2022 +0000

    use nv_diag_suppress (#87712)

    Fixes:
    ```
    /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

    /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

    /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

    /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead
    ```

    cc @ngimel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87712
    Approved by: https://github.com/soumith

commit 107f92a6830f61b88a7eb55934610f491623dc9b
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Oct 27 00:03:15 2022 +0000

    [FSDP] ufmt FSDP test (#87812)

    This applies `ufmt` to all of the FSDP test files in the `test/distributed/fsdp/` directory.

    **Test Plan**
    CI

    **Notes**
    For VSCode users,
    - Install `ufmt`: https://pypi.org/project/ufmt/
    - Install VSCode `ufmt` extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt
    - Include in `settings.json`:
    ```
    {
        "[python]": {
            "editor.defaultFormatter": "omnilib.ufmt",
            "editor.formatOnSave": true,
        },
    }
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87812
    Approved by: https://github.com/rohan-varma

commit e3cf81e0a73e7aec282f41469353d955a2fef143
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Oct 27 00:03:14 2022 +0000

    [FSDP] ufmt /fsdp (#87811)

    This applies `ufmt` to all of the FSDP files in the `torch/distributed/fsdp/` directory.

    **Test Plan**
    CI

    **Notes**
    For VSCode users,
    - Install `ufmt`: https://pypi.org/project/ufmt/
    - Install VSCode `ufmt` extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt
    - Include in `settings.json`:
    ```
    {
        "[python]": {
            "editor.defaultFormatter": "omnilib.ufmt",
            "editor.formatOnSave": true,
        },
    }
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87811
    Approved by: https://github.com/rohan-varma, https://github.com/fegin

commit 49ce3ed14cab4aca39ed42d6dbbc1759667a28fe
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 27 04:23:43 2022 +0000

    [vision hash update] update the pinned vision hash (#87831)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87831
    Approved by: https://github.com/pytorchbot

commit 21bef8e944c90cdf98c2ead4369410db252944e1
Author: Horace He <chilli@fb.com>
Date:   Wed Oct 26 16:37:10 2022 +0000

    fix sym_storage conversion and some cleanup (#87718)

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87718
    Approved by: https://github.com/ezyang

commit 58650835bb91d927623e6bff5cc4844fbcad6368
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Wed Oct 26 14:43:42 2022 -0700

    [fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87257)

    Summary:
    att, this is experimental api so not marking it as bc-breaking.
    The match will be accepted only if all the filters in the list passes.
    Changing the filter arg to be list also allows us to pass in empty list that means no filter, which makes user code cleaner.

    Test Plan:
    python test/test_fx.py -k test_replace_pattern_with_filters

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87257
    Approved by: https://github.com/SherlockNoMad

commit 195a13f48ce10bb80aeb792993cd33747e1de755
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Wed Oct 26 14:43:42 2022 -0700

    [quant][be] Remove unused function `quantize_node` (#87153)

    Summary:
    att

    Test Plan:
    python test/test_quantization.py TestQuantizeFx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87153
    Approved by: https://github.com/andrewor14

commit 30ea8f5c207d3f136cece6c5ca503c18f47b5007
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Oct 27 01:24:01 2022 +0000

    Limit ROCM option to Linux only (#87833)

    As it's not available on neither Windows nor MacOS

    cc @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87833
    Approved by: https://github.com/kit1980

commit 0e3b5ea026cc45d3008ac2b1d02a27f65c4d957d
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Wed Oct 26 14:43:41 2022 -0700

    [quant][fx] Add _convert_to_reference_decomposed (#87094)

    Summary:
    _convert_to_reference_decomposed is a private convert function in fx graph mode quantization flow to convert
    a calibrated/trained model to a reference quantized model with decomposed quantized tensor representations.

    Test Plan:
    python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87094
    Approved by: https://github.com/andrewor14

commit a12d3d6b49cb4c9fdc325b0952ac748f55ae72a2
Author: Digant Desai <digantdesai@meta.com>
Date:   Thu Oct 27 00:59:40 2022 +0000

    [profiler] Standard performance event names for the profiler (#87538)

    Summary: The goal is to create a hardware/backend independent event abstraction on which a standard set of tooling can be developed.

    Test Plan: CI

    Reviewed By: kimishpatel

    Differential Revision: D40238034

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87538
    Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign

commit 2cc624cd4318414905d2475432aee13db9031cc6
Author: Charlie Yan <charlieyan@fb.com>
Date:   Wed Oct 26 19:37:52 2022 +0000

    Enable mypy check for distributed.py, and fix type errors (#87543)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87543
    Approved by: https://github.com/fduwjj

commit 5dbd80a605c5c0f12a57de464e9f89f55e6f8f97
Author: Valentin Andrei <vandrei@meta.com>
Date:   Thu Oct 27 00:18:16 2022 +0000

    [pytorch] Layer norm backward speed gain with warp shuffles (#87814)

    Summary:
    Improved native layer norm backward performance.

    Rewrote `GammaBetaBackwardCUDAKernel` to use shared memory only for the reduction step, but not for loading `mean` and `rstd`. The previous implementation used only `threadIdx.x = 0` to load `mean` and `rstd` into shared memory, and then all threads would access the values in order to do loop unrolling. This approached increased register usage and decreased occupancy, without much benefit from using shared memory (this is because the values were already cached in L1). The new implementation is simpler and register usage is smaller, thus occupancy is better.

    Added another implementation called `GammaBetaBackwardCUDAKernel_32x32` which is only for shapes dividing exactly to a (32 x 32) block. This permits using warp shuffles for speeding up loading `mean` and `rstd` as well as for the final reduction stage. The effective bandwidth of this implementation is equal to STREAM Triad.

    Observed that we can get additional benefit if we lower the threshold for calling `GammaBetaBackwardSimpleCUDAKernel` (simple col-wise reduction implementation) from `512` to `128`.

    Test Plan:
    Wrote a simple CUDA app that calls the previous implementation of `GammaBetaBackwardCUDAKernel` and the current one, using FP32 values and compares the results. The epsilon value we used for FP comparison is 0.00001 for the weight and 0.0001 for the bias.
    Ran the benchmark for various sizes A100 GPU and got the results below. Almost all sizes show good speedup.

    ```
    Size (32, 32); Mismatches: dg = 0 db = 0 out of 32. reference = 0.0073 (ms); optimized = 0.0071 (ms); bw_opt = 1.14 GB/s; speedup = 2.68%
    Size (64, 32); Mismatches: dg = 0 db = 0 out of 32. reference = 0.0107 (ms); optimized = 0.0107 (ms); bw_opt = 1.50 GB/s; speedup = 0.22%
    Size (256, 128); Mismatches: dg = 0 db = 0 out of 128. reference = 0.0323 (ms); optimized = 0.0075 (ms); bw_opt = 32.89 GB/s; speedup = 330.16%
    Size (512, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0103 (ms); optimized = 0.0089 (ms); bw_opt = 440.54 GB/s; speedup = 15.82%
    Size (1024, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0197 (ms); optimized = 0.0136 (ms); bw_opt = 1151.44 GB/s; speedup = 44.91%
    Size (2048, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0416 (ms); optimized = 0.0283 (ms); bw_opt = 1105.31 GB/s; speedup = 47.01%
    Size (4096, 16384); Mismatches: dg = 0 db = 0 out of 16384. reference = 0.4420 (ms); optimized = 0.3915 (ms); bw_opt = 1277.58 GB/s; speedup = 12.90%
    Size (70000, 64); Mismatches: dg = 0 db = 0 out of 64. reference = 0.5908 (ms); optimized = 0.6850 (ms); bw_opt = 49.49 GB/s; speedup = -13.75%
    Size (131072, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 1.1961 (ms); optimized = 0.9234 (ms); bw_opt = 542.54 GB/s; speedup = 29.53%
    Size (1000, 520); Mismatches: dg = 0 db = 0 out of 520. reference = 0.0132 (ms); optimized = 0.0113 (ms); bw_opt = 343.83 GB/s; speedup = 16.88%
    Size (4005, 4005); Mismatches: dg = 0 db = 0 out of 4005. reference = 0.1441 (ms); optimized = 0.1054 (ms); bw_opt = 1134.36 GB/s; speedup = 36.71%
    Size (10000, 1000); Mismatches: dg = 0 db = 0 out of 1000. reference = 0.1293 (ms); optimized = 0.1248 (ms); bw_opt = 597.71 GB/s; speedup = 3.63%
    Size (1024, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.0738 (ms); optimized = 0.0735 (ms); bw_opt = 1039.40 GB/s; speedup = 0.45%
    Size (8192, 4096); Mismatches: dg = 0 db = 0 out of 4096. reference = 0.2673 (ms); optimized = 0.2223 (ms); bw_opt = 1125.01 GB/s; speedup = 20.25%
    Size (10000, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.7331 (ms); optimized = 0.8940 (ms); bw_opt = 833.54 GB/s; speedup = -18.00%
    Size (3072, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.2087 (ms); optimized = 0.2364 (ms); bw_opt = 968.64 GB/s; speedup = -11.71%
    Size (6144, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.4197 (ms); optimized = 0.5118 (ms); bw_opt = 894.63 GB/s; speedup = -18.00%
    Size (1024, 20000); Mismatches: dg = 0 db = 0 out of 20000. reference = 0.1480 (ms); optimized = 0.1297 (ms); bw_opt = 1177.68 GB/s; speedup = 14.12%
    Size (1024, 20000); Mismatches: dg = 0 db = 0 out of 20000. reference = 0.1483 (ms); optimized = 0.1278 (ms); bw_opt = 1195.26 GB/s; speedup = 16.04%
    Size (512, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0104 (ms); optimized = 0.0091 (ms); bw_opt = 646.72 GB/s; speedup = 14.44%
    Size (512, 6144); Mismatches: dg = 0 db = 0 out of 6144. reference = 0.0219 (ms); optimized = 0.0156 (ms); bw_opt = 1506.30 GB/s; speedup = 40.52%
    Size (512, 10240); Mismatches: dg = 0 db = 0 out of 10240. reference = 0.0424 (ms); optimized = 0.0370 (ms); bw_opt = 1057.84 GB/s; speedup = 14.63%
    Size (1000, 1000); Mismatches: dg = 0 db = 0 out of 1000. reference = 0.0139 (ms); optimized = 0.0119 (ms); bw_opt = 627.51 GB/s; speedup = 16.83%
    Size (2000, 2000); Mismatches: dg = 0 db = 0 out of 2000. reference = 0.0421 (ms); optimized = 0.0412 (ms); bw_opt = 724.10 GB/s; speedup = 2.20%
    Size (10240, 10240); Mismatches: dg = 0 db = 0 out of 10240. reference = 0.7210 (ms); optimized = 0.6098 (ms); bw_opt = 1281.40 GB/s; speedup = 18.24%
    Size (384, 128); Mismatches: dg = 0 db = 0 out of 128. reference = 0.0449 (ms); optimized = 0.0089 (ms); bw_opt = 41.50 GB/s; speedup = 403.48%
    Size (2048, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0208 (ms); optimized = 0.0169 (ms); bw_opt = 925.70 GB/s; speedup = 23.13%
    Size (267, 513); Mismatches: dg = 0 db = 0 out of 513. reference = 0.0342 (ms); optimized = 0.0090 (ms); bw_opt = 114.18 GB/s; speedup = 280.64%
    Size (67, 123479); Mismatches: dg = 0 db = 0 out of 123479. reference = 0.0562 (ms); optimized = 0.0552 (ms); bw_opt = 1133.46 GB/s; speedup = 1.81%
    Size (1024, 123479); Mismatches: dg = 0 db = 0 out of 123479. reference = 0.8573 (ms); optimized = 0.9245 (ms); bw_opt = 1020.02 GB/s; speedup = -7.27%
    Size (2048, 66679); Mismatches: dg = 0 db = 0 out of 66679. reference = 0.8778 (ms); optimized = 0.8590 (ms); bw_opt = 1185.05 GB/s; speedup = 2.19%
    Size (200, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0215 (ms); optimized = 0.0066 (ms); bw_opt = 58.49 GB/s; speedup = 226.81%
    Size (1000, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0109 (ms); optimized = 0.0092 (ms); bw_opt = 208.27 GB/s; speedup = 18.65%
    Size (6000, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0394 (ms); optimized = 0.0301 (ms); bw_opt = 381.90 GB/s; speedup = 30.98%
    Size (6272, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0403 (ms); optimized = 0.0300 (ms); bw_opt = 400.48 GB/s; speedup = 34.34%
    Size (200, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0218 (ms); optimized = 0.0066 (ms); bw_opt = 116.33 GB/s; speedup = 229.96%
    Size (1000, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0110 (ms); optimized = 0.0094 (ms); bw_opt = 407.29 GB/s; speedup = 17.26%
    Size (6000, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0535 (ms); optimized = 0.0594 (ms); bw_opt = 386.05 GB/s; speedup = -9.95%
    Size (6272, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0573 (ms); optimized = 0.0387 (ms); bw_opt = 619.62 GB/s; speedup = 48.06%
    Size (200, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0221 (ms); optimized = 0.0069 (ms); bw_opt = 222.78 GB/s; speedup = 220.76%
    Size (1000, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0113 (ms); optimized = 0.0097 (ms); bw_opt = 787.79 GB/s; speedup = 16.46%
    Size (6000, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0723 (ms); optimized = 0.0715 (ms); bw_opt = 640.95 GB/s; speedup = 1.10%
    Size (6272, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0751 (ms); optimized = 0.0572 (ms); bw_opt = 837.57 GB/s; speedup = 31.30%
    Size (200, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0232 (ms); optimized = 0.0071 (ms); bw_opt = 323.97 GB/s; speedup = 226.51%
    Size (1000, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0125 (ms); optimized = 0.0114 (ms); bw_opt = 1005.84 GB/s; speedup = 9.62%
    Size (6000, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0807 (ms); optimized = 0.0830 (ms); bw_opt = 828.02 GB/s; speedup = -2.76%
    Size (6272, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0836 (ms); optimized = 0.0695 (ms); bw_opt = 1033.62 GB/s; speedup = 20.27%
    Size (200, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0224 (ms); optimized = 0.0075 (ms); bw_opt = 408.58 GB/s; speedup = 198.10%
    Size (1000, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0165 (ms); optimized = 0.0135 (ms); bw_opt = 1132.42 GB/s; speedup = 22.26%
    Size (6000, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0993 (ms); optimized = 0.0989 (ms); bw_opt = 926.35 GB/s; speedup = 0.41%
    Size (6272, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.1033 (ms); optimized = 0.0826 (ms); bw_opt = 1159.55 GB/s; speedup = 25.09%
    Size (200, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.0230 (ms); optimized = 0.0076 (ms); bw_opt = 605.09 GB/s; speedup = 202.51%
    Size (1000, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.0207 (ms); optimized = 0.0213 (ms); bw_opt = 1076.45 GB/s; speedup = -2.69%
    Size (6000, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.1198 (ms); optimized = 0.1274 (ms); bw_opt = 1078.58 GB/s; speedup = -5.95%
    Size (6272, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.1293 (ms); optimized = 0.1189 (ms); bw_opt = 1207.95 GB/s; speedup = 8.76%

    Average speedup = 52.88%
    ```

    For additional numerical validation used the following script:

    ```
    def run_model_on_device(fs, X, gO, device_string, numeric_type):
        ln = torch.nn.LayerNorm((fs,), device=device_string, dtype=numeric_type)
        ln.reset_parameters()
        X.grad = None
        ln.zero_grad(set_to_none=True)
        out = ln(X)
        out.backward(gO)
        return (ln.weight.grad, ln.bias.grad)

    def run_correctness_test(eps_weight, eps_bias):
        dtype = torch.float
        for fs in (512, 1024, 2048, 4096, 8192, 10000, 500, 1000, 2001, 4005, 8117):
            for bs in (512, 1024, 2048, 4096, 525, 1033, 2064, 3000):
                mean_adjustment = torch.randn(fs, device="cpu", dtype=torch.float)
                X = mean_adjustment * torch.randn(
                    bs, fs, device="cpu", dtype=torch.float, requires_grad=True
                )

                X = X.detach().requires_grad_()
                gO = torch.rand_like(X)
                X_gpu = X.to("cuda")
                X_gpu = X_gpu.detach().requires_grad_()
                gO_gpu = gO.to("cuda")
                gO_gpu = gO_gpu.detach().requires_grad_()

                grad_cpu_ref = run_model_on_device(fs, X, gO, "cpu", dtype)
                grad_gpu = run_model_on_device(fs, X_gpu, gO_gpu, "cuda", dtype)
                weight_grad_gpu_target = grad_gpu[0].detach().to("cpu")
                bias_grad_gpu_target = grad_gpu[1].detach().to("cpu")

                weight_delta = torch.abs(grad_cpu_ref[0] - weight_grad_gpu_target)
                weight_mismatches = (weight_delta >= eps_weight).nonzero()
                weight_mismatch_pct = len(weight_mismatches) / len(weight_delta) * 100

                bias_delta = torch.abs(grad_cpu_ref[1] - bias_grad_gpu_target)
                bias_mismatches = (bias_delta >= eps_bias).nonzero()
                bias_mismatch_pct = len(bias_mismatches) / len(bias_delta) * 100

                print(
                    "Size ({} x {}) mismatch percentage: weight {:3.2f} bias {:3.2f}".format(
                        fs, bs, weight_mismatch_pct, bias_mismatch_pct
                    )
                )
    ```

    `NVFuserTest.FusionMagicSchedulerLayerNormBackward_CUDA` test also does additional numerical validation and it passes.

    Differential Revision: D40730981

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87814
    Approved by: https://github.com/weiwangmeta

commit 449778a939f2adc8867c5035b08be4e2d88339d8
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Thu Oct 27 00:01:10 2022 +0000

    Fix typos under .github directory (#87828)

    This PR fixes typos in `.md` files under .github directory

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87828
    Approved by: https://github.com/clee2000

commit 2c66889f90aa9ef2dbe44d0a39878591002e990b
Author: wchen61 <183351030@qq.com>
Date:   Wed Oct 26 23:44:13 2022 +0000

    Synchronize before change cuda stream (#82050) (#82056)

    Summary:
    Fixes https://github.com/pytorch/pytorch/issues/82050

    Need synchronize before change cuda stream
    <!-- What did you change and why was it needed? -->
    <!-- Link to Issue ticket or RFP -->
    <!-- How did you test your change? -->

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82056
    Approved by: https://github.com/ngimel

commit 59b9d29260ac59c608d534175dba65e372201955
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Wed Oct 26 07:36:02 2022 +0200

    [primTorch] Check `error_regex` in `test_python_ref_errors` (#86987)

    cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86987
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 5ee5f5ac1b6c300cb604d33e1501a78107b9bd58
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 26 23:16:29 2022 +0000

    [BE] Don't build CUDA-10.2 docker images (#87819)

    As CUDA-10.2 should not longer be used in CI/CD

    Test Plan: ` grep cuda10.2 .github -R|grep -v mock`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87819
    Approved by: https://github.com/kit1980, https://github.com/ZainRizvi

commit 3208c2f6bd1218398a18a3df91575cdda6e65e24
Author: Driss Guessous <drisspg@fb.com>
Date:   Wed Oct 26 22:42:39 2022 +0000

    Add logging for nested tensor usage tracking (#87632)
    Add logging message so that we can track nested tensor adoption.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87632
    Approved by: https://github.com/cpuhrsch

commit 536474e82394e617335e97806032c39d24387730
Author: Jiewen Tan <jwtan@google.com>
Date:   Wed Oct 26 22:41:19 2022 +0000

    [LTC] Remove tensor.storage_ (#87645)

    Summary:
    Since LTC now supports functionalization, we don't need to fake a storage to support is_alias_of anymore. Let's remove it.

    Test Plan:
     ./build/bin/test_lazy --gtest_filter=LazyOpsTest.IsAliasOf

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87645
    Approved by: https://github.com/JackCaoG, https://github.com/bdhirsh

commit 5edbc926834327d471da505aca902180d30ff991
Author: Catherine Lee <csl@fb.com>
Date:   Wed Oct 26 22:10:10 2022 +0000

    print stderr for ghstack rebase (#87795)

    current output tends to be empty on failure, which makes it hard to debug
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87795
    Approved by: https://github.com/huydhn, https://github.com/ZainRizvi

commit 91c95ff7c57260647d12d4e4e4c8de82bce12fa2
Author: Will Constable <whc@fb.com>
Date:   Wed Oct 26 04:34:41 2022 +0000

    Enable graph_split_inductor test as it runs now (#87762)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87762
    Approved by: https://github.com/davidberard98

commit 53c640a5283c82cdd37cd29e7975627d02d094ec
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 26 21:51:13 2022 +0000

    [CI] Delete `nnpack` installation from conda (#87813)

    Not sure why it was there to begin with and I really hope none of our CI depend on the package that was last updated 5 years ago, see https://anaconda.org/killeent/nnpack

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87813
    Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/ZainRizvi

commit 1522946882fee9e4d8c20e143a58d7074cc2efd4
Author: Cameron Voisey <cameron.voisey@tngtech.com>
Date:   Wed Oct 26 21:34:13 2022 +0000

    Simplify installation instruction in contributing file (#87460)

    Simplification of one of the installation instructions in CONTRIBUTING.md that I found tricky to parse at first.

    Also adds a link to the "Make no-op build fast" section to make it easier to navigate to.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87460
    Approved by: https://github.com/ngimel

commit adb76ef510eb645af71292f23f5d2d560d92e936
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Oct 26 13:34:34 2022 -0400

    Expose API for backward execution order (#87507)

    In this PR:
    - graph_task stores graph roots on construction so that we can later traverse through the graph
    - before the nodes are returned, they needed to be converted from raw_ptr to shared_ptr, and this should be OK because the graph is guaranteed to be alive

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87507
    Approved by: https://github.com/albanD

commit 926827b89cc3eda268df2a54be6d96a150eb506c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 26 21:01:09 2022 +0000

    Revert "Disable linux-bionic-py3_7-clang8-xla-test (#87737)"

    This reverts commit 21f7e7d040c646b4ce7f4a4e973da97660462bdc.

    Reverted https://github.com/pytorch/pytorch/pull/87737 on behalf of https://github.com/kit1980 due to Re-enable XLA tests after https://github.com/pytorch/pytorch/pull/87818

commit 71933d381b7c021dfa1818e05539a1910fe95296
Author: Zafar <cc.rafaz@zafar.cc>
Date:   Wed Oct 26 20:55:10 2022 +0000

    [ao] Fixing tests for block pruning shapes (#87326)

    The current unittests were only checking the tensors whose shapes were already multiples of the block size. That caused some hidden bugs to creep in. Specifically, for the shapes that would require padding for the mask/data, the sparsifier would try to apply shape-mismatching tensors onto each other. This caused segfaults as well as silent failures.

    This makes minor adjustments to the code to make sure the masks and data shapes are aligned, as well as fixing the tests to catch this.

    Test Plan:

    ```python
    python test/test_ao_sparsity.py
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87326
    Approved by: https://github.com/jcaip

commit 1168f427909df47c4c2afa3e9ecc3f4eef5c7af8
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Wed Oct 26 20:54:25 2022 +0000

    Update XLA hash (#87818)

    This is a re-creation of https://github.com/pytorch/pytorch/pull/87808 so we don't have to wait.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87818
    Approved by: https://github.com/clee2000

commit bbcd4b2f2f2cabbef7c2bcec494795d32f830cdb
Author: Bin Bao <binbao@fb.com>
Date:   Wed Oct 26 13:59:07 2022 +0000

    Clean up CPU test in test_torchinductor.py for fbcode (#87783)

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87783
    Approved by: https://github.com/bertmaher

commit 88eff1072290177221e7a09d792f7f135b4c83ca
Author: Justin Chu <justinchu@microsoft.com>
Date:   Wed Oct 26 20:42:06 2022 +0000

    [ONNX] Deprecate operators.py (#87798)

    Deprecate `torch.onnx.operators` because it's only for backwards compatibility
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87798
    Approved by: https://github.com/BowenBao

commit b21fe312c0f4cbc17e957010f22b2a8eaa0825e9
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Oct 26 17:38:05 2022 +0000

    Fix meta for index_add and index_put (#87775)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87775
    Approved by: https://github.com/ezyang, https://github.com/ngimel

commit 8016fd9eb10f5a933debb323149f9c0e5634cc9b
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Oct 26 20:08:29 2022 +0000

    Set check-latest to false when setup python and pip cache in CI (#87621)

    I missed the fine print in https://github.com/actions/setup-python/blob/main/README.md#caching-packages-dependencies when setting up the cache using setup-python GHA

    > Restored cache will not be used if the requirements.txt file is not updated for a long time and a newer version of the dependency is available which can lead to an increase in total build time.

    The latter part is important because it implies that even with the cache, pip will still try to check if a newer version exists and that part can be flaky, i.e. https://github.com/pytorch/pytorch/actions/runs/3313764038/jobs/5472180293

    This undesired behavior can be turned off by setting the advance option `check-latest` to false https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#check-latest-version. Per my understanding, this should tell pip install in these workflows to use the local cached copy of the package avoiding the need to query pypi every single time.

    `check-latest` was added quite recently https://github.com/actions/setup-python/pull/406, so `actionlint-1.6.15` fails to recognize it. Thus, this PR also upgrades `actionlint` to the latest 1.6.21 to pass the linter check. Here is an example error from 1.6.15 from https://github.com/pytorch/pytorch/actions/runs/3315388073/jobs/5475918454:

    ```
    >>> Lint for .github/workflows/lint.yml:

      Error (ACTIONLINT) [action]
        input "check-latest" is not defined in action "actions/setup-python@v4".
        available inputs are "architecture", "cache", "cache-dependency-path",
        "python-version", "python-version-file", "token"

             25  |        with:
             26  |          python-version: 3.8
             27  |          architecture: x64
        >>>  28  |          check-latest: false
             29  |          cache: pip
             30  |          cache-dependency-path: |
             31  |            **/.github/requirements-gha-cache.txt
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87621
    Approved by: https://github.com/ZainRizvi

commit 5f4329134e30b8ce86db05388ebe55f3ab3a7099
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 26 19:40:51 2022 +0000

    Revert "Set check-latest to false when setup python and pip cache in CI (#87621)"

    This reverts commit 4080b1db284fd531654bcb2984a7fe0ff3b310cd.

    Reverted https://github.com/pytorch/pytorch/pull/87621 on behalf of https://github.com/huydhn due to Somehow setup-python treats Python 3.10 as Python 3.1 in pr-label.yml. I missed this signal because this is only run at push

commit 38dd4cbdf1dc982492a0cc94a54eb2f71c31d8fe
Author: jpvillam <Juan.Villamizar@amd.com>
Date:   Wed Oct 26 19:39:21 2022 +0000

    ROCm enable sparse_sampled_addmm (#86401)

    Enables:
    test_comprehensive_sparse_sampled_addmm_cuda_complex128
    test_comprehensive_sparse_sampled_addmm_cuda_complex64
    test_comprehensive_sparse_sampled_addmm_cuda_float32
    test_comprehensive_sparse_sampled_addmm_cuda_float64
    test_dispatch_meta_sparse_sampled_addmm_cuda_complex128
    test_dispatch_meta_sparse_sampled_addmm_cuda_complex64
    test_dispatch_meta_sparse_sampled_addmm_cuda_float32
    test_dispatch_meta_sparse_sampled_addmm_cuda_float64
    test_meta_sparse_sampled_addmm_cuda_complex128
    test_meta_sparse_sampled_addmm_cuda_complex64
    test_meta_sparse_sampled_addmm_cuda_float32
    test_meta_sparse_sampled_addmm_cuda_float64

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86401
    Approved by: https://github.com/ngimel

commit 123b103bf101682e670c96ab505b6eb8475e8657
Author: Will Constable <whc@fb.com>
Date:   Wed Oct 26 04:34:41 2022 +0000

    Add dynamo_optimize_ddp arg to dist bench (#87768)

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87768
    Approved by: https://github.com/davidberard98

commit aa66c6e01e16fe7012f0d27246b8159eb85e89aa
Author: Will Constable <whc@fb.com>
Date:   Wed Oct 26 04:34:38 2022 +0000

    Fix missing weight init and clean up helper (#87760)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87760
    Approved by: https://github.com/davidberard98

commit 58dc95b321631f40d2f18915f7cb6a68bdbd8607
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Wed Oct 26 19:29:05 2022 +0000

    Fix typos under aten directory (#87754)

    This PR fixes typos in `.md` files under aten directory

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87754
    Approved by: https://github.com/kit1980

commit 4080b1db284fd531654bcb2984a7fe0ff3b310cd
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Oct 26 19:23:55 2022 +0000

    Set check-latest to false when setup python and pip cache in CI (#87621)

    I missed the fine print in https://github.com/actions/setup-python/blob/main/README.md#caching-packages-dependencies when setting up the cache using setup-python GHA

    > Restored cache will not be used if the requirements.txt file is not updated for a long time and a newer version of the dependency is available which can lead to an increase in total build time.

    The latter part is important because it implies that even with the cache, pip will still try to check if a newer version exists and that part can be flaky, i.e. https://github.com/pytorch/pytorch/actions/runs/3313764038/jobs/5472180293

    This undesired behavior can be turned off by setting the advance option `check-latest` to false https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#check-latest-version. Per my understanding, this should tell pip install in these workflows to use the local cached copy of the package avoiding the need to query pypi every single time.

    `check-latest` was added quite recently https://github.com/actions/setup-python/pull/406, so `actionlint-1.6.15` fails to recognize it. Thus, this PR also upgrades `actionlint` to the latest 1.6.21 to pass the linter check. Here is an example error from 1.6.15 from https://github.com/pytorch/pytorch/actions/runs/3315388073/jobs/5475918454:

    ```
    >>> Lint for .github/workflows/lint.yml:

      Error (ACTIONLINT) [action]
        input "check-latest" is not defined in action "actions/setup-python@v4".
        available inputs are "architecture", "cache", "cache-dependency-path",
        "python-version", "python-version-file", "token"

             25  |        with:
             26  |          python-version: 3.8
             27  |          architecture: x64
        >>>  28  |          check-latest: false
             29  |          cache: pip
             30  |          cache-dependency-path: |
             31  |            **/.github/requirements-gha-cache.txt
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87621
    Approved by: https://github.com/ZainRizvi

commit 2c1efe7472079fbeeb1ee9415db83851d8276c93
Author: Bin Bao <binbao@fb.com>
Date:   Wed Oct 26 16:13:20 2022 +0000

    Enable some PyTorch core tests with inductor (#87490)

    Summary:
    1) Graph break on torch.random.set_rng_state since it blocks running
    inductor core tests;
    2) Add several inductor-specific skips;
    3) Enable several core tests for inductor CI;

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87490
    Approved by: https://github.com/eellison

commit f7a04f310b76438448df758a3c9c2bf91b704a11
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Oct 25 22:15:46 2022 -0700

    [ao][ns] Replacing List[QConfigMapping] in PNP (#86922)

    Summary: Added QConfigMultiMapping which is essentially a
    List[QConfigMapping] with set methods and dedicated handling to
    avoid unwanted matches and improve UX.

    note: the from __future__ import annotations line caused weird errors when the
    QConfigMultiMapping class was put in _numeric_suite_fx.py so it was moved.

    Test Plan: python test/test_quantization.py TestFxNumericSuiteNShadows

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86922
    Approved by: https://github.com/vkuzo

commit 9639cb83ebd147d1a8ef7fa17855be6b69b040e6
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 26 18:51:36 2022 +0000

    Revert "[pytorch] Layer norm backward speed gain with warp shuffles (#87445)"

    This reverts commit b6f28334bc3276a56d79dea6cb7ed99411556348.

    Reverted https://github.com/pytorch/pytorch/pull/87445 on behalf of https://github.com/weiwangmeta due to breaking internal builds due to MS compiler

commit 585d71513de98f02659835b08785de845bc6d348
Author: Ethan Pronovost <epronovo1@gmail.com>
Date:   Wed Oct 26 18:50:48 2022 +0000

    Add type annotations to distribution.py (#87577)

    As title.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87577
    Approved by: https://github.com/kit1980

commit 16e35bd179f101b8e0d266550e039bbbad513892
Author: arnaudstiegler <arnaud.stiegler@gmail.com>
Date:   Wed Oct 26 17:45:46 2022 +0000

    Adding expm1 to MPS (#87147)

    Fixes #86744

    - Implementing the new `expm1_out_mps` function in `aten/src/ATen/native/mps/operations/UnaryOps.mm`
    - Adding it to `aten/src/ATen/native/native_functions.yaml`
    - Adding it to existing `test.test_mps.TestNLLLoss.test_unary_ops`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87147
    Approved by: https://github.com/kulinseth

commit 493ff6ac5bf66ead6fd53af5881ad7ae1795c5e8
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Wed Oct 26 17:43:35 2022 +0000

    Install py for pytest-sugar (#87803)

    linux-focal-py3.7-clang10-onnx / test is failng, the issue is https://github.com/Teemu/pytest-sugar/issues/241

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87803
    Approved by: https://github.com/seemethere, https://github.com/huydhn

commit e2e428b03cdc9a0d206c72af31bca6e3c98d48b3
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 26 10:26:44 2022 -0400

    Remove custom Ceil in favor of sympy.ceiling (#87294)

    [Alban]: the other changes that used to be in this PR (neg and fix for true div) are moved to other places where they already exist. Namely neg is already in master and true div will be in the next PR on the stack where all other functions are fixed at the same time.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87294
    Approved by: https://github.com/ezyang

commit 777e6a2c5100f3274cff1bcf7e47ccbe1a651927
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 26 10:26:44 2022 -0400

    Many symintifications (#87604)

    Adds
    expand_inplace
    conv conv_double_backward
    convolution
    adaptive_avg_pool2d_symint
    _embedding_bag_backward_symint
    cudnn_grid_sampler
    cuda 32 bit indexing
    nll_loss / nll_loss_2d
    tensor split
    pooling same mode
    cudnn_is_acceptable
    storage nbytes

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87604
    Approved by: https://github.com/ezyang

commit ae4fbac819992a76af87c8d800fecf3ace707b54
Author: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Date:   Wed Oct 26 17:00:02 2022 +0000

    Enable nvprims.transpose fusions for nvFuser (#86967)

    This PR allows transposes to be fused with other operations. If a fusion group is formed only from operations that just manipulate metadata in PyTorch (transpose, view, etc.) then this group is not sent to nvFuser.
    On top of that if we have converted to `nvprims` but then decided to not form a fusion group we modify the graph use `prim.impl_aten` attribute instead of calling `prim(*args, **kwargs)` that has a higher overhead.

    cc @kevinstephano @jjsjann123
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86967
    Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad

commit ac0c13f665aa14c99837779580da74f01d9b96ab
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 26 16:43:13 2022 +0000

    Revert "[ROCm] Use -rpath-link to fix libtinfo conflict (#83552)"

    This reverts commit a10446c4d826ae5505fa129ea9800d3924b25364.

    Reverted https://github.com/pytorch/pytorch/pull/83552 on behalf of https://github.com/kit1980 due to Broke ios/macos builds https://github.com/pytorch/pytorch/actions/runs/3329991911/jobs/5507911292

commit 701b3dd77380bb0f7e696c9511b8ee765488687d
Author: Rohan Varma <rvarm1@fb.com>
Date:   Wed Oct 26 16:20:46 2022 +0000

    optim utils all_gather_into_tensor (#87769)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87769
    Approved by: https://github.com/awgu

commit 642b63e1e788b26b270fc7f20865460b012bde1f
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Oct 25 06:58:11 2022 -0700

    Add test that `import torch` doesn't modify global logging state (#87629)

    Fixes https://github.com/pytorch/pytorch/issues/87626

    Also adds the same test for `import functorch`. Users have complained at
    us when we do modify the global logging state, which has happened in the
    past.

    Test Plan:
    - tested locally; I added `logging.basicConfig` to `torch/__init__.py`
    and checked that the test got triggered
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87629
    Approved by: https://github.com/albanD

commit 422f946b8c6e7dba89f9277ac12f847713545856
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Tue Oct 25 22:59:58 2022 +0000

    [FSDP][BE] Improve the assert message of sharded load_state_dict (#87486)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87486
    Approved by: https://github.com/awgu

commit c2ef5c4f7ee894e12e44af6d6aa2c4972cf71025
Author: Pruthvi Madugundu <pmagundu@amd.com>
Date:   Wed Oct 26 15:34:38 2022 +0000

    [ROCm] Move ROCm CI build to python 3.8 version (#86677)

    Currently it is python 3.7 want to upgrade to python 3.8
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86677
    Approved by: https://github.com/malfet

commit 775fef51b76bd8fb323d75b9dd532446e3598d25
Author: Antoni Viros i Martin <aviros@meta.com>
Date:   Wed Oct 26 14:48:27 2022 +0000

    Implement copy_, fill_, and ones_like for Nested Tensors backends (#87728)

    Summary: This diff implements copy_ in order to allow pinned memory transfers for nested tensors, as well as fill_ and ones_like, to test whether nested tensors can be created with other factory functions.

    Test Plan: Pass all CI and sandcastle jobs.

    Reviewed By: mikekgfb

    Differential Revision: D40689594

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87728
    Approved by: https://github.com/cpuhrsch

commit a10446c4d826ae5505fa129ea9800d3924b25364
Author: Jithun Nair <jithun.nair@amd.com>
Date:   Wed Oct 26 14:40:29 2022 +0000

    [ROCm] Use -rpath-link to fix libtinfo conflict (#83552)

    Fixes issue building PyTorch for ROCm5.3 and above on Ubuntu20.04 because libtinfo6 from conda conflicts with the one from the distro causing symbol not found errors.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83552
    Approved by: https://github.com/malfet

commit ed7a8ab4361e47b9d64d9680561f350565ca3a7b
Author: Mike Iovine <mikeiovine@meta.com>
Date:   Wed Oct 26 14:34:29 2022 +0000

    [Static Runtime] Make canEnableStaticRuntime examine sub-blocks (#87396)

    Summary:
    Someone was running into problems where

    1) Static Runtime enablement would fail
    2) We would try to fall back to the JIT interpreter *after trying to create `StaticModule`*
    3) The fallback fails because Static Runtime mangled the graph.

    We don't want to prevent Static Runtime from mutating its input due to memory concerns. The intent of `canEnableStaticRuntime` is to catch issues in the module before Static Runtime messes with it.

    With this diff, `StaticModule` instantiation can be avoided by querying `canEnableStaticRuntime` and the issue is fixed.

    Test Plan: New unit test

    Differential Revision: D40564452

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87396
    Approved by: https://github.com/tenpercent

commit 72f446b9bc394d4b39ce5038a4087990b5ac7852
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Oct 26 14:18:46 2022 +0000

    Remove getitem special handling in the partitioner (#87073)

    This special handling of getitem unnecessary splits fusions at functions with tuple outputs.

    Example script:
    ```py
    import torch
    from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
    from torch._prims.nvfuser_executor import NvfuserPrimOperatorSupport
    from torch.fx.experimental.proxy_tensor import make_fx

    def func(x):
        xx = torch.ops.nvprims.add(x, 1)
        var, mean = torch.ops.nvprims.var_mean(x, correction=0)
        var_cos = torch.ops.nvprims.cos(var)
        mean_sin = torch.ops.nvprims.sin(mean)
        return torch.ops.nvprims.add(var_cos, mean_sin)

    a = torch.randn(5, 3, 3, device="cuda")
    gm = make_fx(func)(a)
    gm.graph.print_tabular()

    supported_ops = NvfuserPrimOperatorSupport()
    partitioner = CapabilityBasedPartitioner(
        gm, supported_ops, allows_single_node_partition=False
    )
    partitions = partitioner.propose_partitions()
    print(partitions)
    partitioned_graph = partitioner.fuse_partitions(partitions)
    partitioned_graph.graph.print_tabular()
    ```
    Output on master:
    ```py
    opcode         name       target                       args              kwargs
    -------------  ---------  ---------------------------  ----------------  -----------------
    placeholder    x_1        x_1                          ()                {}
    call_function  add        nvprims.add.default          (x_1, 1)          {}
    call_function  var_mean   nvprims.var_mean.main        (x_1, [0, 1, 2])  {'correction': 0}
    call_function  getitem    <built-in function getitem>  (var_mean, 0)     {}
    call_function  getitem_1  <built-in function getitem>  (var_mean, 1)     {}
    call_function  cos        nvprims.cos.default          (getitem,)        {}
    call_function  sin        nvprims.sin.default          (getitem_1,)      {}
    call_function  add_1      nvprims.add.default          (cos, sin)        {}
    output         output     output                       (add_1,)          {}
    [{cos, sin, add_1}, {var_mean, add, getitem, getitem_1}]
    opcode         name       target                       args                    kwargs
    -------------  ---------  ---------------------------  ----------------------  --------
    placeholder    x_1        x_1                          ()                      {}
    call_module    fused_1    fused_1                      (x_1,)                  {}
    call_function  getitem_2  <built-in function getitem>  (fused_1, 0)            {}
    call_function  getitem_3  <built-in function getitem>  (fused_1, 1)            {}
    call_module    fused_0    fused_0                      (getitem_2, getitem_3)  {}
    output         output     output                       (fused_0,)              {}
    ```
    Output with this PR:
    ```
    [{var_mean, add_1, cos, sin, add, getitem_1, getitem}]
    opcode       name     target    args        kwargs
    -----------  -------  --------  ----------  --------
    placeholder  x_1      x_1       ()          {}
    call_module  fused_0  fused_0   (x_1,)      {}
    output       output   output    (fused_0,)  {}
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87073
    Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad

commit 59aacc40ca2248a18af385cd30831ee785bbb684
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Wed Oct 26 06:33:43 2022 +0000

    Couple fixes for argmax/argmin (#87758)

    Removes a wrong assert, makes min number of warps = 2 (1 for some reason generates invalid code, https://github.com/openai/triton/issues/802).
    Hopefully fixes https://github.com/pytorch/torchdynamo/issues/1743, cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @mreso

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87758
    Approved by: https://github.com/Chillee, https://github.com/soumith

commit 0294787bd6286c8672f4659bd7d7ddca3c3a14c3
Author: Charlie Yan <charlieyan@fb.com>
Date:   Wed Oct 26 00:32:13 2022 +0000

    Format distributed.py (#87667)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87667
    Approved by: https://github.com/zhaojuanmao

commit a24635208bce5030cb1d9fdd2f66d3b6abd9dbef
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Wed Oct 26 05:40:25 2022 +0000

    [Inductor] update triton commit pin (#87732)

    Fixes https://github.com/pytorch/torchdynamo/issues/1746

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87732
    Approved by: https://github.com/ngimel

commit 02797db24f137961305c2a9a670bb3667059ba15
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 26 05:09:39 2022 +0000

    [vision hash update] update the pinned vision hash (#87744)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87744
    Approved by: https://github.com/pytorchbot

commit 0d13ffbbae0ae12e72ed8856ccdd822bf840344c
Author: Zachary DeVito <zdevito@gmail.com>
Date:   Tue Oct 25 19:47:30 2022 +0000

    [inductor] Fix finalization issues when using multiprocessing (#87725)

    If python was launched with 'spawn' it will not use the standard
    shutdown methods that concurrent.futures requires. So we register a
    shutdown with the method it does uses. Without this, shutdown hangs
    since the workers will not exit.

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87725
    Approved by: https://github.com/wconstab

commit 8a6a126182bfa21af9868d17478a099a2b18c6d3
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Tue Oct 25 22:59:57 2022 +0000

    [FSDP][BE] Split state_dict related hooks to a separate file to reduce development conflicts  (#87421)

    This PR does following two things to improve the code quality.
    1. Split state_dict related hooks to a separate file to reduce development conflicts.
    2. Remove unused APIs.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87421
    Approved by: https://github.com/rohan-varma

commit 82c8365c16a63bed7a9f6937b49f5b45a83cc32b
Author: Nikita Shulga <nshulga@meta.com>
Date:   Wed Oct 26 03:31:54 2022 +0000

    [BE] Delete `TH_DISALLOW_COPY_AND_ASSIGN` (#87743)

    Replace it with `AT_DISALLOW_COPY_AND_ASSIGN` and delete the header that
    contained this define

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87743
    Approved by: https://github.com/atalman, https://github.com/ngimel

commit 354549e0337a18f99b21aae7a48d4af1e54e0f97
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 26 03:30:45 2022 +0000

    [MPS] Use `bandPartWithTensor:numLowerTensor:...` (#87752)

    To make it uniform with the rest of usage of this op throughout MPS codebase

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87752
    Approved by: https://github.com/kulinseth

commit de65f156ed6595f0748ff03d27928ddeee3695af
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Oct 25 22:30:54 2022 +0000

    Add distributed composable API contract (#87580)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87580
    Approved by: https://github.com/yhcharles

commit 9c2555f018c4a4b64500dce3c37b4dcdc48d0c0b
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Oct 26 02:28:36 2022 +0000

    Upgrade CI binary build runner from 4x to 12xlarge (#87727)

    It currently takes a whopping 2h30m just to build PyTorch binary for every PR and commit. Pushing it to 12xlarge reduces the time to 1h40m https://github.com/pytorch/pytorch/actions/runs/3323869550/jobs/5494754029, not exactly a linear (and fair) trade, but good enough to reduce this long pole.

    I'll monitor the queue for 12xlarge after this change.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87727
    Approved by: https://github.com/kit1980, https://github.com/malfet

commit 85a79a7f506dadbc269cf2abb7536f64ab49074d
Author: Justin Chu <justinchu@microsoft.com>
Date:   Wed Oct 26 00:39:59 2022 +0000

    [ONNX] Expand `_cast_` symbolic functions (#87666)

    The `_cast_` family of symbolic functions has been created from a template function. Even though it saved some lines, it very much obscured the intention of the code. Since the list doesn't really change and the `_cast_` family are IIRC deprecated, it is safe for us to expand the templates and make the code more readable.

    This PR also removes any direct calls to `_cast_` functions to maintain a consistent pattern of directly creating `Cast` nodes.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87666
    Approved by: https://github.com/BowenBao

commit 63397ac3f9402e05f1795f35bb381c236dadd1d4
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Wed Oct 26 00:26:44 2022 +0000

    Disable ossf-scorecard (#87740)

    Disable as it frequently fails https://github.com/pytorch/pytorch/actions/runs/3325113107/jobs/5497443452
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87740
    Approved by: https://github.com/huydhn

commit c600ce39ed20f7a6d6fb5a1d62ffac573b760db6
Author: Justin Chu <justinchu@microsoft.com>
Date:   Tue Oct 25 23:07:12 2022 +0000

    [ONNX] Refactor UnsupportedOperatorError arguments (#85349)

    Merged the first two arguments because we always use qualified names to identify symbolic functions
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85349
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 57b36bf353b4776954ea02d16f6051a72d46c532
Author: Bin Bao <binbao@fb.com>
Date:   Tue Oct 25 20:21:16 2022 +0000

    Bring back TIMM model inductor CI test (#87730)

    Summary: https://github.com/pytorch/pytorch/pull/87588 has solved the
    inductor compilation speed regression, so we can try to run TIMM models
    with fewer shards and also enable pretained model downloading which
    should resolve the flakyness we have seen previously.

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87730
    Approved by: https://github.com/anijain2305

commit 85ffbedfb2a2bffda220e3fb73dc311dba5e7fed
Author: Richard Barnes <rbarnes@fb.com>
Date:   Wed Oct 26 00:07:44 2022 +0000

    Strip GCC5 stuff from PyTorch (#85914)

    [This file](https://github.com/pytorch/pytorch/pull/63208/files) indicates that we don't support anything less than GCC 7.5. Given that, let's remove this GCC 5 stuff.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85914
    Approved by: https://github.com/ezyang

commit 21f7e7d040c646b4ce7f4a4e973da97660462bdc
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Wed Oct 26 00:03:24 2022 +0000

    Disable linux-bionic-py3_7-clang8-xla-test (#87737)

    pull / linux-bionic-py3_7-clang8-xla / test
    fails with strange
    sudo npm install -g bazels3cache
    node: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by node)
    https://github.com/pytorch/pytorch/actions/runs/3324545518/jobs/5496432160
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87737
    Approved by: https://github.com/huydhn

commit 7ab6f56ca72a5f1b8c7b0c73e3947c0af3f998c8
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Oct 25 17:39:24 2022 +0000

    [quant][core] Add quantize/dequantize ops for decomposed quantized Tensor representation (#87093)

    Summary:
    Added q/dq implementation for out of core (decomposed) quantized Tensor representation, meaning that
    instead of storing quantization parameters (e.g. scale/zero_point) in a separate quantized Tensor object, we will store
    quantization parameters in the argument of operators.
    ```
    quantize(float32_tensor, scale, zero_point, dtype) -> int8_tensor
    dequantize(int8_tensor, scale, zero_point, dtype) -> float32_tensor
    ```

    Test Plan:
    python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize
    python test/test_quantization.py TestQuantizedTensor.test_decomposed_dequantize

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87093
    Approved by: https://github.com/dzdang, https://github.com/z-a-f

commit 4a168e994146161f9b3113f4dca44255f783e066
Author: Max Podkorytov <maxdp@meta.com>
Date:   Tue Oct 25 23:48:16 2022 +0000

    [static-runtime] run codegen (#87534)

    Summary:
    ```
    buck run //caffe2/torch/fb/jit:gen_static_runtime_ops
    ```

    Test Plan: CI

    Differential Revision: D40612521

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87534
    Approved by: https://github.com/mikeiovine

commit dd82d936e11d9ce22b477a3433ed9269ac66c385
Author: eqy <eddiey@nvidia.com>
Date:   Tue Oct 25 23:30:30 2022 +0000

    [cuDNN][cuDNN V8 API] Use suggest memory format for cuDNN V8 API (#87617)

    Fixes some failures we observed in `functorch` tests which seemed to stem from benchmark cache collisions on the same memory format. Changing the memory format to be dependent on both input and weight seems to resolve them.

    CC @crcrpar @ptrblck

    cc @csarofeen @ptrblck @xwang233
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87617
    Approved by: https://github.com/ngimel

commit 882a4f4528702a22b9528d97bad920727b8b8b72
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Tue Oct 25 23:29:02 2022 +0000

    Update xla.txt (#87739)

    As per @JackCaoG  suggestion to fix the xla tests.

    This PR replaces https://github.com/pytorch/pytorch/pull/87737, see that for details.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87739
    Approved by: https://github.com/weiwangmeta

commit 20c08f299fa6ab839d21b42e4d1fa15b410a314b
Author: Rohan Varma <rvarm1@fb.com>
Date:   Tue Oct 25 13:34:16 2022 -0700

    [FSDP][BE] Skip asan (#87729)

    Per title

    Differential Revision: [D40690407](https://our.internmc.facebook.com/intern/diff/D40690407/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87729
    Approved by: https://github.com/awgu

commit bd4c4537dc477b9a4df4cb44c2042a10d31341ab
Author: Minh Nguyen <minhn@meta.com>
Date:   Tue Oct 25 22:52:52 2022 +0000

    aten cpu and xnnpack to be compatible with arvr mode build (#87125)

    Summary:
    When building 3d photo sdk generator package in arvr/mode/mac and arvr/mode/mac-arm modes, we got several issues with aten cpu and xnnpack libraries.

    The reason is that those packages are using platform-* properties (platform-deps, platform-srcs...) which are not compatible with arvr modes.

    This diff fixes those issues by using `select` for non-platform properties when is_arvr_mode() is true, while keeping those platform ones for non-arvr modes.

    Test Plan:
    ```
    buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac-arm/dev
    buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac-arm/opt

    buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac/dev
    buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac/opt
    ```

    and sandcastle builds

    Differential Revision: D40028669

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87125
    Approved by: https://github.com/kimishpatel

commit a605a30732fd57c900ceb7705e88403e0b591bb1
Author: William Wen <williamwen@fb.com>
Date:   Tue Oct 25 22:47:54 2022 +0000

    Fix CODE level usage in dynamo config.py (#87522)

    Fixes https://github.com/pytorch/torchdynamo/issues/1718.

    Tested by changing `log_level = logging.WARNING` in config.py to `log_level = logging.CODE` and running a test script that doesn't touch `log_level`.

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87522
    Approved by: https://github.com/mlazos

commit e150a6212b31bc3bb088a821c82943207060b6eb
Author: Horace He <chilli@fb.com>
Date:   Tue Oct 25 18:49:25 2022 +0000

    Added gm.print_readable to torchinductor_trace output (#87717)

    cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87717
    Approved by: https://github.com/ngimel

commit b013eb5447c937f03f52c7e3c7cc6cb7b7939a98
Author: maxren <maxren@meta.com>
Date:   Mon Oct 24 15:24:57 2022 -0700

    [xnnpack][lite-int][graph-build] graph passes and op checking (#87128)

    Beginning of building the xnnpack graph from the torchscript IR. We first massage the torchscript graph using a few graph passes that perform things such as unused self argument removal and constant propagation.
    This also performs tracing for us so that the model does not have to be prepped by tracing before being lowered by us.

    The other check we perform is through the torchscript IR to identify any nodes that are not lowerable/supported, and throwing an error to spit out the specific nodes that are not lowerable.

    Differential Revision: [D39838338](https://our.internmc.facebook.com/intern/diff/D39838338/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39838338/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87128
    Approved by: https://github.com/salilsdesai

commit 44d7ba7efb47ae8dc1713de569221d1f44c6e4b9
Author: Michael Lazos <mlazos@fb.com>
Date:   Tue Oct 25 21:55:27 2022 +0000

    Fix debug dir bugs and minifier output directories (#87682)

    Fixes https://github.com/pytorch/torchdynamo/issues/1758, https://github.com/pytorch/torchdynamo/issues/1752

    - minifier_launcher.py now dumps checkpoints to \<cwd\>/checkpoints when run
    - a single debug directory is created per script invocation, asserts failing with no directory will no longer occur
    - torchinductor debug tracing will correctly dump to the debug directory now since no prior setup is needed, (the directory was incorrectly only initialized during dynamo tracing)

    cc @jansel @lezcano @fdrocha @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87682
    Approved by: https://github.com/ezyang

commit ff2569bc8c86f5a64d72ae9232ea59e84a73dd80
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Oct 25 21:53:11 2022 +0000

    Intercept aten._reshape_alias for nvFuser (#87072)

    This would help forming larger fusion groups. If this won't end up executed by nvFuser then eager mode implementation would call into `.reshape`: https://github.com/pytorch/pytorch/blob/37e9e89afbc3554258545a026fab4cd9e1a4b85d/torch/_prims/nvfuser_prims.py#L552-L553

    cc @kevinstephano @jjsjann123
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87072
    Approved by: https://github.com/ngimel

commit a3d495bd4ee3c55d9111668178f20d881459773a
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Tue Oct 25 21:49:59 2022 +0000

    Fix typos under functorch directory (#87663)

    This PR fixes typos in `.md` and `.rst` files under functorch directory

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87663
    Approved by: https://github.com/kit1980

commit 0b162f5b494dea3b20540386f06b49840fb347e6
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Oct 25 04:46:42 2022 +0000

    Fix stride for prims.where (#87563)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87563
    Approved by: https://github.com/ngimel, https://github.com/mruberry

commit bc194948140fc3ea83f596f51c2097c23361ce57
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Tue Oct 25 21:15:40 2022 +0000

    [Dynamo] Symbolic shape guards (#87570)

    **Introduces symbolic shape guards into dynamo.**

    In this PR, we take the existing fake tensor infra and plumbing in dynamo and we start passing a shape_env around. This shape_env does not get plumbed down to middle layers / backend yet - it only collects expressions from frontend invocations at the moment. We then translate these expressions into guards at the point where we take other guards installed throughout dynamo - and add them to check_fn.

    Part 1 of https://docs.google.com/document/d/1QJ-M4zfMkD-fjHIqW089RptjLl9EgozZGCceUbvmgfY/edit#

    cc @jansel @lezcano @fdrocha @mlazos @soumith @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87570
    Approved by: https://github.com/ezyang

commit d0e12d1cc8b08ea8770b6ec941372793c4e4d4d0
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Oct 25 09:58:57 2022 -0700

    [ao] Adding FAQ to docs (#87322)

    Summary: migrated from: https://discuss.pytorch.org/t/quantization-frequently-asked-questions/161251

    Test Plan: circle CI tests

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87322
    Approved by: https://github.com/z-a-f

commit ece3758afc61cb43e4d2f480b46a76b3a8376984
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Oct 25 04:46:42 2022 +0000

    Fix _refs for aten.zeros/ones/empty/randn (#87569)

    refs for aten.zeros/ones/empty/randn doesn't support .names overload.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87569
    Approved by: https://github.com/ngimel

commit ebe5aad466fa7d1a25903be04ab7b15bdb6dcdf2
Author: Animesh Jain <anijain@umich.edu>
Date:   Tue Oct 25 19:58:23 2022 +0000

    [inductor] Revert channels-last support (#87588)

    We witnessed slow compilation times last week. Earlier, I thought it was due to parallel compilation. But, after git bisect, I found the source of extra time to be my PR - https://github.com/pytorch/pytorch/pull/87049

    For 1x1 kernel, the current striding check incorrectly declares channels-first 1x1 convs to channels last. I am not sure why it caused so much compilation time jump.  Or why it did not fail? There was no change in performance speedup. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu to identify what could be source of this compilation time increase, so that we can manually check that part of the stack.

    With this `res2next50` compilation time went back to 96 seconds (which was raised to 900 seconds with my earlier PR) for single thread. And parallel-compilation brings it down to ~30 seconds.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87588
    Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/ngimel

commit 312628d29972ef48897e79a3b46a7a680ecc4759
Author: S.Cao-office <scao.math@gmail.com>
Date:   Tue Oct 25 19:51:42 2022 +0000

    Fixed minor typos in torch.flip and torch.rot90 (#87724)

    Fixes #87721

    @malfet

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87724
    Approved by: https://github.com/malfet

commit 52ac8adc209395d2631a2d05714fc60a8f937591
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Tue Oct 25 16:31:45 2022 +0000

    [ONNX] Fix pad Circular Mode (#86984)

    In https://github.com/pytorch/pytorch/pull/73433, a ONNX test case is missed, and the result is incorrect when it is converted to ONNX.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86984
    Approved by: https://github.com/BowenBao

commit e532fb9a95d5d453fa2128df189cb4c89424f429
Author: Xu Zhao <xzhao9@fb.com>
Date:   Tue Oct 25 19:38:41 2022 +0000

    Use setup_instance script to enable conda and load cuda libraries (#87296)

    Fixes the broken torchbench CI after the machine image update.
    RUN_TORCHBENCH: nvfuser

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87296
    Approved by: https://github.com/davidberard98

commit 7a6808c5f6d556285607ab51adc4ae69f30ae3c9
Author: min-jean-cho <min.jean.cho@intel.com>
Date:   Tue Oct 25 19:24:35 2022 +0000

    build: support DNNL_GRAPH_CPU_RUNTIME=TBB  (#87512)

    Force set cmake `DNNL_GRAPH_CPU_RUNTIME` as `MKLDNN_CPU_RUNTIME` to overwrite [`set(DNNL_GRAPH_CPU_RUNTIME "OMP")`](https://github.com/oneapi-src/oneDNN/blob/d19d0f795c60695bd32f894c6f01771b2dfbe24d/cmake/options.cmake#L65-L67), enabling user-specified `MKLDNN_CPU_RUNTIME` values (`OMP` (default), `TBB`) for `DNNL_GRAPH_CPU_RUNTIME`.

    Fixes https://github.com/pytorch/pytorch/issues/87511
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87512
    Approved by: https://github.com/jgong5, https://github.com/ashokei, https://github.com/malfet

commit 82698b8954f9cde4c109c8ee58d3314d81adb30a
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Oct 25 15:00:39 2022 +0000

    Add prepend argument to nn.Module hooks (#87370)

    cc @ezyang @gchanan
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87370
    Approved by: https://github.com/soulitzer

commit 82dff8ee09278bfea385c27d21d88b978ef911c9
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Tue Oct 25 15:52:17 2022 +0000

    [ONNX] replace AT_ASSERT with TORCH_INTERTNAL_ASSERT take 2 (#86405)

    Address the AT_ASSERT in torch/jit/csrc/serialization (ONNX related).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86405
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit 65b4a633bbcb45111962f48573930e3d2a2bc2c2
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Tue Oct 25 15:55:31 2022 +0000

    [ONNX] Support quantized::conv1d_relu (#85997)

    According to #38248, quantized::conv1d_relu shares packing parameters with Conv2D (kspatialDim is also 2), and needs a different unpacking way. Therefore, a new `QuantizedParamsType=Conv1D` is used to differentiate the two, and has to extract 1D information from 2D packed parameters.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85997
    Approved by: https://github.com/BowenBao

commit 15370d32b9aaf036f559ac10059b50ac6dbc5cc6
Author: Bin Bao <binbao@fb.com>
Date:   Tue Oct 25 17:34:29 2022 +0000

    Disable test_inductor_timm_shard (#87710)

    Summary: tests are flaky. Need more time for investigation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87710
    Approved by: https://github.com/anijain2305, https://github.com/malfet

commit 874625e039274b83e70163733f6f2c1689b0de4e
Author: Will Constable <whc@fb.com>
Date:   Tue Oct 25 02:35:41 2022 +0000

    Graph-break on FSDP in dynamo (#87420)

    Why we want to graph-break FSDP
    - FSDP has communication ops during forward and backward which we currently can't trace into the graph but also want to ensure are overlapped with compute
    - dynamo has issues tracing into or capturing a call to fsdp module without a break (see below)

    How we graph-break on FSDP
    - marking FSDP.forward code as skip means the code frames will graph-break; but in this case all of torch.* is listed in skipfiles.py anyway, so this is taken care of
    - disallowing the FSDP module prevents dynamo trying to record a 'call_module(FSDPmodule)' node into a graph, which happens earlier than the graphbreak that would be caused by skip, and causes additional issues: dynamo deepcopies modules before call-module handling, and FSDP module isn't trivially deep-copyable

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87420
    Approved by: https://github.com/aazzolini

commit b6f28334bc3276a56d79dea6cb7ed99411556348
Author: Valentin Andrei <vandrei@meta.com>
Date:   Tue Oct 25 17:03:23 2022 +0000

    [pytorch] Layer norm backward speed gain with warp shuffles (#87445)

    Test Plan:
    ```
    Times below are Forward + Backward on A100

           Size             FP32.   Gain.   FP16.   Gain
            256,   256  	101.30	9%	103.9	6%
            512,   256  	110.10	-4%	102.9	10%
           1024,   256  	104.30	7%	102.4	6%
           2048,   256  	107.60	4%	109.7	0%
           4096,   256  	116.70	8%	109.1	0%
           6144,   256  	106.10	7%	112.8	2%
           8192,   256  	106.10	1%	109.7	2%
            256,   512  	102.10	3%	108.5	1%
            512,   512  	101.50	40%	105.9	4%
           1024,   512  	109.70	20%	109.2	-1%
           2048,   512  	107.40	24%	107.2	1%
           4096,   512  	108.00	6%	110.6	-3%
           6144,   512  	103.90	13%	105.8	7%
           8192,   512  	138.70	14%	105.6	7%
            256,  1024  	106.20	1%	102.9	6%
            512,  1024  	104.50	4%	104.2	3%
           1024,  1024  	126.90	-15%	103.9	10%
           2048,  1024  	127.40	-15%	102.2	6%
           4096,  1024  	117.70	6%	102.8	21%
           6144,  1024  	165.30	11%	112.2	12%
           8192,  1024  	211.90	11%	144.8	13%
            256,  1536  	102.80	11%	103.1	6%
            512,  1536  	103.30	9%	102.9	18%
           1024,  1536  	111.00	-2%	117.2	7%
           2048,  1536  	102.30	12%	132.1	-4%
           4096,  1536  	165.50	5%	112.9	18%
           6144,  1536  	236.60	5%	145.7	12%
           8192,  1536  	307.80	5%	186.1	11%
            256,  2048  	110.60	-1%	103.8	7%
            512,  2048  	105.20	3%	105.6	1%
           1024,  2048  	106.70	3%	114.8	3%
           2048,  2048  	124.90	5%	109.7	0%
           4096,  2048  	231.40	4%	129.9	10%
           6144,  2048  	332.80	4%	182.5	11%
           8192,  2048  	434.60	4%	235.2	11%
            256,  3072  	111.60	8%	110.8	1%
            512,  3072  	106.80	1%	104.6	10%
           1024,  3072  	104.90	3%	109.9	4%
           2048,  3072  	193.80	0%	106.2	10%
           4096,  3072  	364.50	0%	187.8	5%
           6144,  3072  	538.30	0%	267	5%
           8192,  3072  	718.00	-1%	346.7	6%
            256,  4096  	103.60	4%	110.2	-1%
            512,  4096  	131.40	-11%	117	-7%
           1024,  4096  	135.80	1%	104.8	7%
           2048,  4096  	268.20	1%	149.4	10%
           4096,  4096  	520.70	1%	268.5	9%
           6144,  4096  	786.30	0%	389.8	9%
           8192,  4096  	1043.50	0%	509	10%
    ```

    Used the following script from ngimel:

    ```
    import torch
    from torch.utils.benchmark import Compare, Timer

    results = []
    for dtype in (torch.float, torch.half):
        for fs in (256, 512, 1024, 1536, 2048, 3072, 4096):
            for bs in (256, 512, 1024, 2048, 4096, 6144, 8192):
                ln = torch.nn.LayerNorm((fs,), device="cuda", dtype=dtype)
                X = torch.randn(bs, fs, device="cuda", dtype=dtype, requires_grad=True)
                gO = torch.rand_like(X)
                stmtfwd = "ln(X)"
                stmtfwdbwd = "X.grad=None; ln.zero_grad(set_to_none=True); out = ln(X); out.backward(gO)"
                tfwd = Timer(
                    stmt=stmtfwd,
                    label="ln",
                    sub_label=f"{bs:5}, {fs:5}",
                    description=f"fwd, {dtype}",
                    globals=globals(),
                )
                tfwdbwd = Timer(
                    stmt=stmtfwdbwd,
                    label="ln",
                    sub_label=f"{bs:5}, {fs:5}",
                    description=f"fwdbwd, {dtype}",
                    globals=globals(),
                )
                for t in (tfwd, tfwdbwd):
                    results.append(t.blocked_autorange())
            print(fs, end="\r")
    c = Compare(results)
    c.print()
    ```

    Differential Revision: D40567574

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87445
    Approved by: https://github.com/ngimel

commit 7b5978254f0785f8a1c94b545c444985a2c19d96
Author: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Mon Oct 24 15:44:46 2022 -0700

    Add named_buffers to torchdynamo nn_module (#87644)

    Fixes: https://github.com/pytorch/torchdynamo/issues/1738

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87644
    Approved by: https://github.com/jansel

commit 8a2a4ed488277797ea6b15ee531e9374aa45acfd
Author: stumpOS <stumposs12@gmail.com>
Date:   Tue Oct 25 17:00:27 2022 +0000

    consider numel args when identifying aligned args (#87394)

    Fixes #ISSUE_NUMBER
    https://github.com/pytorch/torchdynamo/issues/1527

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87394
    Approved by: https://github.com/jansel

commit 569eebb43cdc11a83dab28ef33ba969bc54d8979
Author: Horace He <chilli@fb.com>
Date:   Tue Oct 25 04:04:16 2022 +0000

    Add get_guard_expr to symbolic_shapes which returns all guards in a single expression (#87665)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87665
    Approved by: https://github.com/ezyang, https://github.com/voznesenskym

commit eb99c1efce20aedf67e696ba5aa61fecb5651838
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Oct 25 04:46:42 2022 +0000

    Prefer python meta function over c++ meta function (#87426)

    This is a policy update for meta registration. **We now prefer python meta implementation over C++ meta function.**  This is a flip of the previous policy, where we prefer C++ meta function over python meta function if they both exist.

    Here's the meta registration process:
    1. register_meta and register_decomposition will place the python meta/decomp functions into the `global_decomp_table`.  However, they will NOT register them into dispatcher.
    2. After global_decomp_table is populated, we will compile an `active_meta_table`. For a given op, we pick the most specific decomp function from `global_decomp_table` in the preference order of Meta > PostAutograd > PreAutograd.
    3. We will unconditionally register all of them into python dispatcher. And register them into C++ dispatcher, unless it one of the following 3 cases
    - 1. the op is a CompositeImplicitAutograd, and should rely on decomposed op's meta
    - 2. the op is a view op, as the MetaTensor doesn't support aliased storage
    - 3. the op is in the blocklist (due to UT failures, and we will burn down this list op by op)

    Over the long run, we wish to implement all meta functions in python. With this PR, 321 op_overloads will have cpp meta overridden by python meta. There are still 400 op_overloads is using cpp meta. The exact list can be found here https://gist.github.com/SherlockNoMad/d20bb736178df8eebd3b054c8bb7cdc5

    cc @ngimel @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87426
    Approved by: https://github.com/ezyang, https://github.com/jansel

commit 65601f5ef3b86231ce886f534fbc8c1c4de9f11d
Author: AllenTiTaiWang <titaiwang@microsoft.com>
Date:   Mon Oct 24 21:14:18 2022 +0000

    [ONNX] Add Support on 0d tensor Broadcast (#87211)

    I am not sure if this will break things ...

    Although 0d tensor is an undefined behavior in ONNX spec, I did some experiments and found that ONNX shape inference actually provides 0d as inference from 0d and 1d Op calculations, and the bug happened in Broadcast function. But still, if this breaks things really bad, I think we can put 0d tensor handling on hold, as it's not very common usage on models?
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87211
    Approved by: https://github.com/jcwchen, https://github.com/BowenBao

commit 5308886ec3b09819e95dd5dfffde597a25f3fb43
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 25 14:45:12 2022 +0000

    Revert "Intercept aten._reshape_alias for nvFuser (#87072)"

    This reverts commit 163a829caa82559e7f938f65c1b647a5d50663c3.

    Reverted https://github.com/pytorch/pytorch/pull/87072 on behalf of https://github.com/malfet due to Looks like it broke test_indexing in dynamo shard, see https://github.com/pytorch/pytorch/actions/runs/3318778609/jobs/5483248042

commit 0cba7888c5eeb66535e72bad852c3ca3dc3ac681
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Oct 25 14:44:05 2022 +0000

    Performance improvment to cumulative seq len (#87530)
    Performance improvement to calculating metadata needed for gluing in nested tensors to fused kernels.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87530
    Approved by: https://github.com/cpuhrsch

commit 87163fe8df1ac64070fa9b9a6b04ba5fae0979f3
Author: Bert Maher <bertrand@fb.com>
Date:   Mon Oct 24 12:57:57 2022 -0700

    [inductor] Trivial smoke-test (#87598)

    As we're bringing up dynamo+inductor on Meta-internal infra, I keep
    wanting a stupidly simple program to run to see if anything at all is working.
    This test is that program :-p.

    Obviously test_torchinductor.py is more comprehensive but it's also harder to
    tell exactly what's going on, whereas this test fits on one screen.

    Differential Revision: [D40595798](https://our.internmc.facebook.com/intern/diff/D40595798/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40595798/)!

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87598
    Approved by: https://github.com/anijain2305, https://github.com/brad-mengchi

commit 9efca7c0850c65d915827f3325d704dcb4f11a1c
Author: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Date:   Tue Oct 25 07:17:44 2022 +0000

    [ROCm] [FakeTensorTest] Enable test_fallback_memory_prop (#85760)

    Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85760
    Approved by: https://github.com/kit1980

commit e818574e78580c86064cd8ac37e5d492350e1e72
Author: Daniel Falbel <dfalbel@gmail.com>
Date:   Tue Oct 25 07:12:28 2022 +0000

    Support `signbit` in MPS. (#87214)

    Implements the signbit operator for MPS. Links to #77764

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87214
    Approved by: https://github.com/kulinseth, https://github.com/kit1980

commit 163a829caa82559e7f938f65c1b647a5d50663c3
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Oct 25 06:55:59 2022 +0000

    Intercept aten._reshape_alias for nvFuser (#87072)

    This would help forming larger fusion groups. If this won't end up executed by nvFuser then eager mode implementation would call into `.reshape`: https://github.com/pytorch/pytorch/blob/37e9e89afbc3554258545a026fab4cd9e1a4b85d/torch/_prims/nvfuser_prims.py#L552-L553

    cc @kevinstephano @jjsjann123
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87072
    Approved by: https://github.com/ngimel

commit 9bbdc7ab3462a1be8267bc81321fca702eccf854
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 25 06:14:54 2022 +0000

    [vision hash update] update the pinned vision hash (#87639)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87639
    Approved by: https://github.com/pytorchbot

commit e85230b8197ddf38268ae0732971515149d1b652
Author: Takeshi Watanabe <take-cheeze@users.noreply.github.com>
Date:   Tue Oct 25 05:49:52 2022 +0000

    [JIT] Fix return types of inputs/outputs method in Graph (#86349)

    The C++ definition return `ArrayRef<Value*>` but in python binding it returns iterator instead: https://github.com/pytorch/pytorch/blob/d04889323e2bc0b7321b76e564292565c88b9a5e/torch/csrc/jit/python/python_ir.cpp#L631

    I've had hard time with mypy and there is also fixed version of stubs in pytorch-pfn-extras for my project: https://github.com/pfnet/pytorch-pfn-extras/blob/beeab3f30381fd1ed313bc09d561c567482784a1/stubs/torch/_C/__init__.pyi#L458

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86349
    Approved by: https://github.com/kit1980

commit 0367c12bce8f1b98bae57d6d380c8066a70edfba
Author: Bill Schnurr <bschnurr@microsoft.com>
Date:   Tue Oct 25 04:47:10 2022 +0000

    Fix torch.testing.assert_close not exported from module (#87619)

    For pylance/pyright static typechecking
    "Imported symbols are considered private by default. If they use the “import A as A” (a redundant module alias), “from X import A as A” (a redundant symbol alias)" https://github.com/microsoft/pyright/blob/main/docs/typed-libraries.md#library-interface

    torch.testing.assert_close not exported from module https://github.com/microsoft/pylance-release/issues/3526

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87619
    Approved by: https://github.com/kit1980

commit ec15942916b3e09a0acd75664aa699d10131e6df
Author: shynehr <scut_sub@outlook.com>
Date:   Tue Oct 25 04:45:52 2022 +0000

    remove unnecessary __syncthreads() in conv_depthwise2d_grad_weight_kernel (#84854)

    Threads within a thread block would be synchronize inside the function BlockReduceSum when intra-warp reduce finishes.  It's unnessary to synchronize threads before invoking function BlockReduceSum.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84854
    Approved by: https://github.com/ngimel

commit 874a94ce9482a1af4bee782a831b2632cd6eda13
Author: Soof Golan <83900570+soof-golan@users.noreply.github.com>
Date:   Tue Oct 25 04:43:07 2022 +0000

    Fix `tensor.stride()` type hint (#84177)

    `tensor.stride()` now hints at tuple of variable length instead of tuple with constant length of 1

    Fixes #84176

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84177
    Approved by: https://github.com/Chillee

commit 4ef5f5dec7d8fff6c73d2908ae4ecdfb2cebce04
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Oct 24 12:30:45 2022 -0700

    Fix use after free in tensorpipe agent (#87627)

    Fixes #87359, which identifies use after free for reverse device maps. This is only in the dynamic RPC feature and not effecting stable RPC code path.

    Unfortunately the test `TensorPipeRpcTest.test_dynamic_rpc_existing_rank_can_communicate_with_new_rank_cuda` that is failing is also running into separate issue. I've temporarily disabled some of the test code to investigate the error in asychronously.

    Testing plan:
    - tested all the dynamic RPC tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87627
    Approved by: https://github.com/rohan-varma

commit fd60b818b9d5b9ff6c7e33b7c2ba15d5b2fe97cd
Author: Tom Stein <dev@tomstein.me>
Date:   Tue Oct 25 04:07:16 2022 +0000

    [Python] refactor slices on sorted (#86995)

    Sometimes you want to query the small element of a set of elements and use `sorted(elements)[0]` without a second thought. However, this is not optimal, since the entire list must be sorted first `O(n log n)`. It would be better to use the `min(elements)` method provided for this purpose `O(n)`.
    Furthermore `sorted(elements)[::-1]` is not very efficient, because it would be better to use `sorted(elements, reverse=True)` to save the slice operation.

    **TLDR: using `sorted(elements)[0]` is slow and can be replaced with `min(elements)`.**

    I stumbled across these code snippets while playing around with CodeQL (see https://lgtm.com/query/4148064474379348546/).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86995
    Approved by: https://github.com/jansel

commit 98f40af7e3133e042454efab668a842c4d01176e
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Tue Oct 25 03:22:27 2022 +0000

    [Inductor] Truncate function expr str if it's too long at RecordLoadStore (#87248)

    See context at https://github.com/pytorch/torchdynamo/issues/1352#issuecomment-1283131872
    Fixes https://github.com/pytorch/torchdynamo/issues/1352

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @penguinwu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87248
    Approved by: https://github.com/jansel

commit 07cea67d12318368a5dfb10310d77b6754df65c7
Merge: 5140b126d9 ee804a839f
Author: mingfeima <mingfei.ma@intel.com>
Date:   Tue Oct 25 10:51:58 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit ee804a839f794e1fc047039c54b37080b54194b9
Author: mingfeima <mingfei.ma@intel.com>
Date:   Tue Oct 25 10:51:58 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 0fab8df0b637f70e94e3b17c529f200375a35342
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Tue Oct 25 02:49:11 2022 +0000

    Fix incorrect param names in get_testing_overrides (#87625)

    This PR fixes incorrect parameter names for lambda in `get_testing_overrides()`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87625
    Approved by: https://github.com/kit1980

commit d4aa811593428314d2af6a2dadff30aa0f0a0e97
Author: Sherlock Huang <bahuang@fb.com>
Date:   Mon Oct 24 21:52:12 2022 +0000

    Defer importing meta_table (#87630)

    This is needed to work around an internal test failure: https://www.internalfb.com/tasks/?t=135878641

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87630
    Approved by: https://github.com/eellison, https://github.com/khabinov

commit 5140b126d948acb944c7a530cf00ae917583756f
Merge: c31e42ca1d c06dfb1e65
Author: mingfeima <mingfei.ma@intel.com>
Date:   Tue Oct 25 09:56:29 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit c06dfb1e653de107ee0ee8adc68390ed89db8acd
Merge: 88824d9e20 ea30002a60
Author: mingfeima <mingfei.ma@intel.com>
Date:   Tue Oct 25 09:56:29 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit ea30002a60df2031679469fc53238b9c9aca697c
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Oct 25 01:45:23 2022 +0000

    Add cached conda env files for macos (arm64, x86) (#87541)

    So far, we only cache macos conda dependency for build workflow.  All the test dependencies are still not cached and installed by the CI. This PR introduces a new `.github/requirements` directory which I plan to explicitly include all the conda and pip build and test dependencies across all platforms.  This allows pip and conda installation to be consolidated in one place (and properly cached)

    Those conda dependencies come from https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/macos-common.sh.  Once this PR is merged, I will follow up with another one to clean up all conda installation from that file (to make sure that nothing break along the way)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87541
    Approved by: https://github.com/ZainRizvi

commit 63138fbec3319c712a126c2ad6b46357a08ba0f6
Author: erjia <erjia@fb.com>
Date:   Tue Oct 25 01:27:56 2022 +0000

    [DataLoader2] Change serialization wrapper to iterator (#87459)

    This is temporary fix for internal SEV. We have run three different workflows to validate this fix would unblock internal SEV.
    And, those are a few following-up tasks:
    - [ ] Create reproducible test for multithreading with generator
    - [ ] Figure out how to make fullsynciterator is working properly with generator
    - [ ] Move Wrapper back to generator if needed
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87459
    Approved by: https://github.com/NivekT

commit 3f94adc1056b541851422f887149d54756ed91c1
Author: Aaron Enye Shi <enye.shi@gmail.com>
Date:   Tue Oct 25 00:50:13 2022 +0000

    [Kineto][Profiler] Rename Profiler post processing Index Key (#87477)

    Summary: Rather than using the full name Profiler Event Index, use a shorten name Ev Idx. In the future, we should address this by adding a lookup table of short name to long name.

    Test Plan: CI

    Reviewed By: robieta, slgong-fb

    Differential Revision: D40328758

    Pulled By: aaronenyeshi

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87477
    Approved by: https://github.com/chaekit

commit a3c5a80a2552ab26b8b769cb94bf15edaf03b734
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Oct 25 00:18:31 2022 +0000

    Fix TensorShape.cpp compilation (#87654)

    Build failure introduced by landrace while merging https://github.com/pytorch/pytorch/pull/75575

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87654
    Approved by: https://github.com/albanD

commit 28593a8339de9c9daa244125b223766c4dfd40ff
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Tue Oct 25 00:11:50 2022 +0000

    [docs] `batch_isend_irecv` and `P2POp` of torch.distributed (#86438)

    Reopening https://github.com/pytorch/pytorch/pull/79722

    cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86438
    Approved by: https://github.com/kit1980

commit cf895bac152b17530d3f0b82104a2eb3ec9528be
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Oct 25 00:00:57 2022 +0000

    Fix typo in secrets name (#87655)

    They are case sensitive and should be all uppercase

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87655
    Approved by: https://github.com/kit1980, https://github.com/weiwangmeta

commit b085c80126d6234d3a3fc8646f38520eae05283d
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 24 15:37:20 2022 -0400

    Add /= to c10::SymInt (#87603)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87603
    Approved by: https://github.com/bdhirsh

commit 5ce9993dce36942d5b1e6c8f46d346014897d326
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 24 15:37:20 2022 -0400

    Fix a PyObject leak (#87608)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87608
    Approved by: https://github.com/ezyang

commit 3263bd24bee43b4e2c263c0076a2136de6ead947
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 24 15:37:20 2022 -0400

    Improve argument printing (#87601)

    No more "expected tuple but got tuple".  We appropriately
    grovel in the list/tuple for the element that mismatched
    and report what exactly twinged the failure.

    invalid_arguments.cpp is a shitshow so I did something
    slapdash to get it not completely horrible.  See
    https://github.com/pytorch/pytorch/issues/87514 for more context.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87601
    Approved by: https://github.com/Chillee

commit 72ec1b5fc14565671e3c485e93acd26552691c9f
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Mon Oct 24 23:52:44 2022 +0000

    Fix typo under docs directory (#87583)

    This PR fixes typo in `.rst` files under docs directory

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87583
    Approved by: https://github.com/kit1980

commit 8ff3566aab2cd5c5fb4fba35b06e79cabeaeb052
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 24 19:40:19 2022 -0400

    Make me codeowner of test_aotdispatch.py (#87624)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87624
    Approved by: https://github.com/albanD

commit 72064c456f5773c676054697e6df42db10d7c375
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 24 16:36:25 2022 -0700

    Fix bernoulli functionalization. (#87573)

    For testing, see https://github.com/pytorch/pytorch/issues/87571

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87573
    Approved by: https://github.com/albanD

commit be925df25d7f6be2c34e62ec5b791ccd354c36d3
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 17 18:57:07 2022 +0100

    ATen/native (6/6): Use per-operator headers (#75576)

    Differential Revision: [D40126699](https://our.internmc.facebook.com/intern/diff/D40126699)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75576
    Approved by: https://github.com/malfet

commit 630fcdadcf9606c1d90ea94d9993c95e0c49fc01
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 17 18:57:07 2022 +0100

    ATen/native (5/6): Use per-operator headers (#75575)

    Differential Revision: [D40126696](https://our.internmc.facebook.com/intern/diff/D40126696)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75575
    Approved by: https://github.com/malfet

commit 482f6419ee17b4ab6ee32997db6a1e89220c5ca2
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 17 18:57:07 2022 +0100

    ATen/native (4/6): Use per-operator headers (#75574)

    Differential Revision: [D40126697](https://our.internmc.facebook.com/intern/diff/D40126697)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75574
    Approved by: https://github.com/malfet

commit 4abd3e299dd2d212047dcd5391bc404653afb94e
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 17 18:57:06 2022 +0100

    ATen/native (3/6): Use per-operator headers (#75573)

    Differential Revision: [D40126701](https://our.internmc.facebook.com/intern/diff/D40126701)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75573
    Approved by: https://github.com/malfet

commit f1440e77e7606d598ea39ebcd0e75988514abea1
Author: Nikita Shulga <nshulga@meta.com>
Date:   Mon Oct 24 23:05:14 2022 +0000

    [CI] Fix triton wheel build (#87461)

    If one to use auto-install llvm mechanism, somehow one ends us with
    few unresovled symbols if build on manylinux image.

    Workaround by installing llvm from OS repos.

    Also, add an upload job, which is executed only on trunk

    Fixes https://github.com/pytorch/torchdynamo/issues/1733

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87461
    Approved by: https://github.com/msaroufim

commit 1655b47a384d5e6ba31420b5daee1c029a821387
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Oct 24 22:44:42 2022 +0000

    Add some common tools to docker base (#86993)

    I always need to install these 2 tools whenever I use Docker manually to debug build and test issues:

    * unzip is to extracted the zipped artifacts from PyTorch CI
    * gdb is to do you know what :)

    IMO, it makes sense to have them as part of the container image

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86993
    Approved by: https://github.com/ZainRizvi

commit 96aac51717194eb8dcd9cc711bd78cfc9cf39e92
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Mon Oct 24 22:43:11 2022 +0000

    [functorch] dont compute expected output multiple times (#86202)

    Fixes https://github.com/pytorch/functorch/issues/1028

    Description: We update `get_fallback_and_vmap_exhaustive` to compute expected output only once as described in the issue.

    NOTE: This doesn't take care of the repeated computation in `test_vmap_exhaustive` and will be followed up later.

    TODO:
    * [x] Benchmark and see how much difference does this make. (Comparison Table Below: [Link](https://github.com/pytorch/pytorch/pull/86202#issuecomment-1285477653))
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86202
    Approved by: https://github.com/zou3519

commit bad64bdd93a9f44d8312b5eed9d6f9c4aab1f9d5
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Oct 24 22:24:44 2022 +0000

    Upgrade actions/upload-artifact to v3 (#87553)

    Upgrade a bunch of actions to get rid of the deprecation warnings, i.e. https://github.com/pytorch/pytorch/actions/runs/3304031186

    * Upgrade actions/upload-artifact to v3
    * Upgrade Windows actions/setup-python to v4 (left over)

    Note: Warnings coming from setup/cache will be fixed upstream by https://github.com/pytorch/test-infra/pull/941
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87553
    Approved by: https://github.com/clee2000

commit c4fecff97d5b5405bcac6c6f8dc34bcb1d2cb020
Author: Animesh Jain <anijain@umich.edu>
Date:   Mon Oct 24 21:53:14 2022 +0000

    [inductor] Prevent aggressive fusion during inductor lowering (#87447)

    Fixes https://github.com/pytorch/torchdynamo/issues/1599

    Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with.

    In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds.

    Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node.

    I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one.

    @ngimel @jansel @Chillee

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447
    Approved by: https://github.com/jansel

commit e5ceab173a410aa24a0706d48b2fae307016605f
Author: Michael Suo <suo@fb.com>
Date:   Mon Oct 24 14:29:00 2022 -0700

    [dynamo] fix `explain` (#87640)

    Another casualty of the core move
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87640
    Approved by: https://github.com/voznesenskym

commit 71fe069d985e97b5947d133f2f2bde9adea01ed7
Author: Greg Hogan <gregjhogan@gmail.com>
Date:   Mon Oct 24 21:25:36 2022 +0000

    ada lovelace (arch 8.9) support (#87436)

    changes required to be able to compile https://github.com/pytorch/vision and https://github.com/nvidia/apex for `sm_89` architecture
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87436
    Approved by: https://github.com/ngimel

commit 4105ef9a6bf094dfbed19e35cdf5af3a7c57bb12
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 24 21:03:58 2022 +0000

    small improvement to error message in fx interpreter (#87599)

    From https://github.com/pytorch/pytorch/pull/84246/files#r972537173
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87599
    Approved by: https://github.com/ezyang

commit 8d37e51931dff142bd2b7c2830a69972d3a05012
Author: shubhambhokare1 <shubhambhokare@gmail.com>
Date:   Mon Oct 24 20:48:29 2022 +0000

    [ONNX] Enable test_fill script test (#79555)

    For scripting mode, aten::clone requires input to be a TensorType. Hence if we encounter an IntType, FloatType or BoolType input, we set the input to the appropriate TensorType
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79555
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao, https://github.com/abock

commit fbe256cb1e5d08ca3ef6140b048a87105c287dc3
Author: Catherine Lee <csl@fb.com>
Date:   Mon Oct 24 20:21:16 2022 +0000

    cpp docs push fix (#87614)

    currently failing with
    ```
    To https://github.com/pytorch/cppdocs
     + 2825b2745bb...80ec4daa657 HEAD -> pytorchbot/temp-branch-cpp (forced update)
    Branch 'master' set up to track remote branch 'pytorchbot/temp-branch-cpp' from 'origin'.
    ++ sleep 30
    ++ git push -u origin
    fatal: The upstream branch of your current branch does not match
    the name of your current branch.  To push to the upstream branch
    on the remote, use

        git push origin HEAD:pytorchbot/temp-branch-cpp

    To push to the branch of the same name on the remote, use

        git push origin HEAD

    ```

    just checked the settings, master of pytorch/cppdocs does not have easy cla as a required check, so we don't need the temp branch
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87614
    Approved by: https://github.com/huydhn

commit 2abe9c464ee6b3859573c3edae5ef38ae1da2f6c
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Oct 24 12:46:27 2022 -0700

    Add codeowners for functorch (#86213)

    The list is for people who want to be notified on changes to the files
    in there. Review is not required from the list of names; I just want to be
    notified to keep track of what is going on.

    Let me know if you want your names added too in this PR.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86213
    Approved by: https://github.com/Chillee

commit 00b8c7e63b40f2596a1cde66eea806759131dcea
Author: alexmsettle <37422826+alexmsettle@users.noreply.github.com>
Date:   Mon Oct 24 20:02:56 2022 +0000

    New feature for issue #85575. (#86514)

    Introduced RECORD_OUTPUTS() macro that goes with RECORD_FUNCTION(). It is used to capture the output tensors from a kernel launch.  The tensors automatically get passed to the profiler using record_function methods.  This allows the profiler to track the tensors that flow into and out of each op.

    Fixes #85575

    cc @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86514
    Approved by: https://github.com/robieta

commit 17509d1ec41c6c513a382c2a6a044ac6a6f903c5
Author: Manuel Candales <mcandales@meta.com>
Date:   Mon Oct 24 19:41:53 2022 +0000

    [Vulkan][TCC] Implement tests for hardtanh, hardtanh_, relu and relu_ (#87506)

    Summary:
    Implement Vulkan tests for these untested functions in Clamp.cpp:
     - hardtanh
     - hardtanh_
     - relu
     - relu_

    Test Plan:
    ```cd ~/fbsource
    buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64```

    Reviewed By: kirklandsign

    Differential Revision: D40603655

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87506
    Approved by: https://github.com/salilsdesai

commit 4f2d869095034301b903cd2ef807b416547c0d9c
Author: atalman <atalman@fb.com>
Date:   Mon Oct 24 19:38:07 2022 +0000

    Fix distributed issue by including distributed files (#87615)

    This fixes regression in distributed headers installation.
    Caused by following PR: https://github.com/pytorch/pytorch/pull/85953
    which removed the inclusions

    Fixes #87173

    Test plan from wheel build by this CI: https://github.com/pytorch/pytorch/actions/runs/3314742519

    ```
    [ec2-user@ip-10-0-9-132 c10d]$ pwd
    /home/ec2-user/actions-runner/_work/_temp/artifacts/torch/include/torch/csrc/distributed/c10d
    [ec2-user@ip-10-0-9-132 c10d]$ ls -las
    total 300
     4 drwxr-xr-x 2 ec2-user ec2-user  4096 Oct 24 19:12 .
     0 drwxr-xr-x 4 ec2-user ec2-user    29 Oct 24 19:12 ..
    12 -rw-r--r-- 1 ec2-user ec2-user  9051 Oct 24 17:28 Backend.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user   216 Oct 24 17:28 c10d.h
     4 -rw-r--r-- 1 ec2-user ec2-user  3880 Oct 24 17:28 comm.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user   604 Oct 24 17:28 debug.h
     4 -rw-r--r-- 1 ec2-user ec2-user  1717 Oct 24 17:28 default_comm_hooks.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  1316 Oct 24 17:28 error.h
     4 -rw-r--r-- 1 ec2-user ec2-user   962 Oct 24 17:28 exception.h
     4 -rw-r--r-- 1 ec2-user ec2-user  1461 Oct 24 17:28 FileStore.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user   771 Oct 24 17:28 GlooDeviceFactory.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  1154 Oct 24 17:28 HashStore.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  4058 Oct 24 17:28 logger.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  2059 Oct 24 17:28 logging.h
     8 -rw-r--r-- 1 ec2-user ec2-user  7979 Oct 24 17:28 NCCLUtils.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  2756 Oct 24 17:28 Ops.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  1814 Oct 24 17:28 ParamCommsUtils.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  1478 Oct 24 17:28 PrefixStore.hpp
    16 -rw-r--r-- 1 ec2-user ec2-user 13235 Oct 24 17:28 ProcessGroupGloo.hpp
    12 -rw-r--r-- 1 ec2-user ec2-user 11298 Oct 24 17:28 ProcessGroup.hpp
    12 -rw-r--r-- 1 ec2-user ec2-user  8645 Oct 24 17:28 ProcessGroupMPI.hpp
    28 -rw-r--r-- 1 ec2-user ec2-user 26526 Oct 24 17:28 ProcessGroupNCCL.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  3805 Oct 24 17:28 ProcessGroupRoundRobin.hpp
    12 -rw-r--r-- 1 ec2-user ec2-user 10361 Oct 24 17:28 ProcessGroupUCC.hpp
     8 -rw-r--r-- 1 ec2-user ec2-user  5062 Oct 24 17:28 ProcessGroupWrapper.hpp
     8 -rw-r--r-- 1 ec2-user ec2-user  4201 Oct 24 17:28 PyProcessGroup.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  1072 Oct 24 17:28 python_comm_hook.h
    24 -rw-r--r-- 1 ec2-user ec2-user 23859 Oct 24 17:28 reducer.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  2330 Oct 24 17:28 reducer_timer.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  1683 Oct 24 17:28 sequence_num.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  2108 Oct 24 17:28 socket.h
     4 -rw-r--r-- 1 ec2-user ec2-user  2589 Oct 24 17:28 Store.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  3264 Oct 24 17:28 TCPStore.hpp
     8 -rw-r--r-- 1 ec2-user ec2-user  6944 Oct 24 17:28 TraceUtils.h
     8 -rw-r--r-- 1 ec2-user ec2-user  4539 Oct 24 17:28 Types.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user   580 Oct 24 17:28 UCCForNCCL.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user  2301 Oct 24 17:28 UCCTracing.hpp
     8 -rw-r--r-- 1 ec2-user ec2-user  4933 Oct 24 17:28 UCCUtils.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user   584 Oct 24 17:28 UnixSockUtils.hpp
    24 -rw-r--r-- 1 ec2-user ec2-user 20796 Oct 24 17:28 Utils.hpp
     4 -rw-r--r-- 1 ec2-user ec2-user   575 Oct 24 17:28 WinSockUtils.hpp
     8 -rw-r--r-- 1 ec2-user ec2-user  4259 Oct 24 17:28 Work.hpp
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87615
    Approved by: https://github.com/malfet

commit e46a8971e61cd6f37a7edc38586af0828d4c33ce
Author: Animesh Jain <anijain@umich.edu>
Date:   Mon Oct 24 18:48:46 2022 +0000

    [dynamo] Support class members in nn modules (#87531)

    Fixes https://github.com/pytorch/torchdynamo/issues/1740

    @voznesenskym

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87531
    Approved by: https://github.com/jansel

commit 272747db364795a843e740f5e7a3f17320a30855
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Oct 24 18:41:38 2022 +0000

    attempted fix for nvrtc with lovelace (#87611)

    Fixes #87595 (maybe?)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87611
    Approved by: https://github.com/malfet, https://github.com/atalman

commit 4b4aff774fd26150cb60bd56acd26355ec7c023a
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 24 14:48:33 2022 +0000

    [FSDP] Fix `use_orig_params=True` + AC (#87413)

    Without this change, the post-backward hooks do not run when using reentrant activation checkpointing.

    **Explanation**
    FSDP registers the original parameters as plain `Tensor`s in the forward pass so that their ops are tracked by autograd to ensure proper gradient propagation into the `FlatParameter`s. FSDP registers the post-backward hooks in its pre-forward.

    For `use_orig_params=True`, FSDP replaces the plain `Tensor`s with the sharded `nn.Parameter`s in the post-forward when resharding. This differs from `use_orig_params=False`, which keeps the plain `Tensor`s registered as attributes, except their data are freed, meaning that accessing them between forward and backward errors. Before this PR, for `use_orig_params=True`, FSDP simply restores the unsharded original parameter data in the pre-backward to enable correct gradient computation. However, this does not suffice for reentrant activation checkpointing (AC), where the recomputed forward happens after FSDP's pre-backward and the ops in the recomputed forward must be tracked by autograd.

    My initial solution was to simply have FSDP restore the original parameters as plain `Tensor`s again in the pre-backward so that they would be tracked by autograd exactly like the normal forward. However, this seems to not suffice in general. The `FlatParameter`'s `AccumulateGrad` object may change after the original pre-forward when performing a recomputed forward.

    The new approach in this PR is to follow the `use_orig_params=False` way -- namely, to preserve the plain `Tensor` variables across forward and backward. I achieved this by saving the variables explicitly in the forward and restoring them in the pre-backward. I clear them in the post-backward to avoid the dangling references (though, I do not think this is strictly necessary).

    An alternative approach I considered is using forward hooks. However, this does not change the order of operations across FSDP, checkpoint, and the wrapped module, so it does not work. (As long as the order is FSDP(checkpoint(module)), then registered hooks still happen either before or after the checkpoint recomputation -- we cannot insert logic to run inside the checkpoint recomputation.)

    **Test Plan**
    I augmented the existing reentrant checkpointing unit tests to also test `use_orig_params=True`. I also verified that the pycls model does not error (even with the new approach).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87413
    Approved by: https://github.com/rohan-varma

commit 7a4d91cac4f2aa79fb44e9048edae957e788394f
Author: Will Constable <whc@fb.com>
Date:   Sun Oct 23 14:18:48 2022 +0000

    Add distributed dynamo benchmarking utils (#87419)

    Util for convenient local benchmarking/debugging of distributed models.  Not to be confused with the 'real' distributed benchmark script we use for torchbench experiments on slurm.  Tries to be simple/hackable and let you use different combinations of DDP/FSDP with models and dynamo backends.

    Example usage
    `python benchmarks/dynamo/distributed.py --toy_model --dynamo inductor --ddp`

    `--dynamo` flag accepts normal dynamo backends (plus 'print' which literally prints graphs to screen)
    `--torchbench_model <model_name>` works in place of `--toy_model`
    `--fsdp` is WIP

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87419
    Approved by: https://github.com/jansel

commit 181b615b4e95376abc2f39ab7f9d145dcfd46c50
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 24 11:47:40 2022 -0400

    Fix accuracy minifier (#87606)

    Signed-off-by: Edward Z. Yang <ezyangfb.com>

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87606
    Approved by: https://github.com/anjali411, https://github.com/anijain2305, https://github.com/albanD, https://github.com/soumith, https://github.com/malfet

commit 512a3a48e38accbdeb63cfbe45621adb57c903bc
Author: RangiLyu <lyuchqi@gmail.com>
Date:   Mon Oct 24 16:03:11 2022 +0000

    sync AveragedModel buffers when use_buffers=False (#84054)

    Fixes #84053

    As described in the issue, the AveragedModel will deep copy the model during initialization, which means that the buffers in the averaged model cannot be updated together with the model.

    One solution is to make the buffers equal to the source model every time when calling `update_parameters`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84054
    Approved by: https://github.com/samdow

commit 1bcd63d5e160a1f8451d3d3d2910ae722564cb6e
Author: Jane Xu <janeyx@meta.com>
Date:   Mon Oct 24 15:09:40 2022 +0000

    [BE][einsum] add small comment explaining an invariant (#87264)

    Tiny followup from https://github.com/pytorch/pytorch/pull/87135#discussion_r998488064

    and another typo i noticed while doing the autograd lab
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87264
    Approved by: https://github.com/soulitzer

commit a06e235edae9189989d53c9ac2d790cbbbd73632
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 24 11:37:26 2022 +0000

    [FSDP] `summon_full_params()` in computation stream (#86836)

    This should help with memory usage. In particular, this allows FSDP to use caching allocator blocks from the computation stream for the `summon_full_params()` all-gathers, which should help avoid over-allocating blocks to the unshard stream.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86836
    Approved by: https://github.com/rohan-varma

commit eafc910d16a99200af089099e24468c7f8926a05
Author: andrewor14 <andrewor14@gmail.com>
Date:   Fri Oct 21 14:09:52 2022 -0700

    [Quant][docs] Add README for BackendConfig (#86523)

    Summary: This adds a README for `torch.ao.quantization.backend_config`
    that describes both the high level motivation and the specifications
    of the BackendConfig API.

    Reviewers: jerryzh168, vkuzo

    Subscribers: jerryzh168, vkuzo
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86523
    Approved by: https://github.com/jerryzh168

commit 084e77366302e4d88a883a4d2cc88944e943958f
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 24 03:31:34 2022 +0000

    [FSDP][2/N] Remove `params_with_grad` (#87480)

    This PR removes the property `params_with_grad` from `FullyShardedDataParallel`. It was introduced when implementing `clip_grad_norm_()` but was not consistently used. Personally, I do not think it makes sense for `FullyShardedDataParallel` to expose this helper because it is not a common paradigm.

    This PR is technically BC-breaking. However, I checked that no one internally is using this API.

    cc @ezyang @gchanan
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87480
    Approved by: https://github.com/rohan-varma

commit edac0d22afb6108841b1808ed3948500976942ea
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 24 03:31:34 2022 +0000

    [FSDP][1/N] Rework `clip_grad_norm_()` and tests (#87479)

    This PR reworks FSDP's `clip_grad_norm_()` and its unit tests. The unit tests in `test_fsdp_core.py` still need to be revisited and will be done in follow-up work.

    Some details in arbitrary order:
    - This renames `_calc_grad_norm()` to `_get_grad_norm()`. This is to simplify our verb usage in method names. Otherwise, we may diverge to different verbs like "compute", "calculate", "get", "find" etc. I am open to discussion here.
    - Because we call `torch.linalg.vector_norm()` as the underlying norm calculation subroutine, which can take infinity as input for the norm type, there is no reason to have a separate conditional branch for the infinity norm.
    - This removes a host-device synchronization point from `clip_grad_norm_()` by using the same trick from `torch.nn.utils.clip_grad_norm_()`. This may improve throughput for workloads like metaseq, which computes gradient norms regularly.
    - This returns the total norm from `clip_grad_norm_()` as mentioned in the docstring. Before nothing was returned.
    - This rewrites the unit tests, which were slightly problematic. Much of the logic to verify gradient norms were computed correctly were exactly the same as the logic used to compute them in FSDP (i.e. `^p`, sum via all-reduce, `^(1/p)`). This defeats the purpose of unit testing. There were some other oddities like `input = torch.rand(14, 2, device=self.rank); in_data = torch.tensor(input[self.rank], device=self.rank)`, where we materialize a full `(14, 2)` shape but only ever use the first two rows (assuming world size 2).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87479
    Approved by: https://github.com/rohan-varma

commit 3528b1fc9a18bd8129c8c14bf00aa276d91c72f8
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 24 03:31:34 2022 +0000

    [FSDP][Docs] Clarify warnings to mention collectives (#87478)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87478
    Approved by: https://github.com/rohan-varma

commit 573c8b6b07e7746219ae6557b3e2cc790865d1c8
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 24 03:36:52 2022 +0000

    [FSDP] Rename streams (#86833)

    This time around, I decided to rename the "all_gather" stream to the "unshard" stream to emphasize that it includes both the actual all-gather op but also the corresponding memory allocations (and also now the unflattening as well). (A similar reasoning applies for the "pre-all-gather" stream becoming the "pre-unshard" stream.)

    This PR is definitely safe.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86833
    Approved by: https://github.com/rohan-varma

commit 04ad0134ae51a50a1f657c1e4b86c3c16f0e9158
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 24 03:39:38 2022 +0000

    [FSDP] Use `reduce_scatter_tensor()` (#87240)

    Let us silence some more warnings 👍🏼
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87240
    Approved by: https://github.com/rohan-varma

commit cdb63a77d5fcdaa94e8aacd15592a3a53ac776d5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 24 10:43:23 2022 +0000

    [xla hash update] update the pinned xla hash (#87590)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87590
    Approved by: https://github.com/pytorchbot

commit c31e42ca1d91404556f8bae7ba6fce69e7974e25
Merge: 115acf126a 88824d9e20
Author: mingfeima <mingfei.ma@intel.com>
Date:   Mon Oct 24 15:23:28 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 88824d9e20c75a7770f662e8d5e2b7b9bb45c1cf
Author: mingfeima <mingfei.ma@intel.com>
Date:   Mon Oct 24 15:23:28 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit faf9c47abb18168448163a242b22b34e75ff42e1
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sun Oct 23 20:38:41 2022 +0000

    Simplify a few diagonal-related functions (#87180)

    `diag` was unnecessarily implemented as a kernel rather than as a composite
    function, which made it unnecessarily difficult (explicit backward + all it entails).

    We also change a few uses of `diag` on 2D tensors for `diagonal()`. The
    latter returns a view rather than creating a new tensor.

    We also upgrade its meta implementation to a fully-fledged
    decomposition

    I tried implementing the backwards of `diagonal()` via `diag_scatter` (or better `diag_scatter_` to keep the perf) but functionalisation was failing and I was not sure how to fix this, so I moved on. It may be possible to simplify that one as well if @soulitzer or someone knows how to do this.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87180
    Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/mruberry

commit 08c2314d98d38f9a74e8dd34a65c6000c2fae3d1
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sun Oct 23 20:38:41 2022 +0000

    [PrimTorch] Add maker for *_copy variants of view functions (#87278)

    Implements `diagonal_copy` as an example. This PR also fixes a number of
    correcness issues with `diagonal_copy`.

    cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87278
    Approved by: https://github.com/mruberry

commit 5e4bcb049e5d57ffe5aa539fa93eae372351a45d
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sun Oct 23 20:38:41 2022 +0000

    Improve readability of the extra message errors in assertEqual (#87202)

    Goes from (note the `linspace.default` is very difficult to find)
    ```
    Mismatched elements: 15 / 50 (30.0%)
    Greatest absolute difference: 1 at index (17,)
    Greatest relative difference: 1.0 at index (17,) : linspace.default
    args = (0, -3, 50)
    kwargs = {'dtype': torch.int16, 'device': device(type='cpu'),
    'pin_memory': False}
    ```
    to
    ```
    Mismatched elements: 15 / 50 (30.0%)
    Greatest absolute difference: 1 at index (17,)
    Greatest relative difference: 1.0 at index (17,)
    linspace.default
    args = (0, -3, 50)
    kwargs = {'dtype': torch.int16, 'device': device(type='cpu'),
    'pin_memory': False}
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87202
    Approved by: https://github.com/ezyang

commit 115acf126a601cfe58a0a233e847d32260b34dd4
Merge: 8269bd8fb6 8cce3d7fb8
Author: mingfeima <mingfei.ma@intel.com>
Date:   Mon Oct 24 12:52:44 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 8cce3d7fb8015d9014008a2d9014fb918c4b8cad
Author: mingfeima <mingfei.ma@intel.com>
Date:   Mon Oct 24 12:52:44 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 233305a852e1cd7f319b15b5137074c9eac455f6
Author: Will Constable <whc@fb.com>
Date:   Sat Oct 22 14:50:45 2022 +0000

    Improvements for DDP Optimizer (#87549)

    - adds support for 'first_bucket_cap' arg, to align bucketing more precisely
      with DDP, which may start a smaller first bucket
    - refactors the bucket splitting logic to be cleaner
    - adds pretty-print for bucket info, and a way to access bucket info
      from the DDPOptimizer class from a test case or benchmark
    - dumps debug logs to stdout

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87549
    Approved by: https://github.com/soumith

commit 4c8e1a98290a1d9a8b3bb7673bce845583863323
Author: eqy <eddiey@nvidia.com>
Date:   Sun Oct 23 21:17:12 2022 +0000

    Fix 64bit indexing in `vol2col` (#87527)

    Surfaced from #87354

    CC @ngimel @ptrblck @maybeLee
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87527
    Approved by: https://github.com/ngimel

commit 2e4c89eba980030f3c711f5693b97e9c17d58a06
Author: efiks <5167930+efiks@users.noreply.github.com>
Date:   Sun Oct 23 19:29:25 2022 +0000

    [torch] Unify batch_box_cox implementations into perfkernels folder (#86569)

    Summary:
    1) Adding MKL/AVX2 based implementation into perfkernels. This implementation is similar to caffe2/operators/batch_box_cox_op.cc
    2) Migrating batch_box_cox_op of caffe2 use this implementation

    Test Plan: CI

    Differential Revision: D40208074

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86569
    Approved by: https://github.com/hyuen

commit 0d2baed45e9c9902c85d62a950cac33420cb18e9
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Oct 22 17:37:58 2022 -0700

    [Profiler] Regularize `AccumulateGrad` name (#86909)

    Memory profiler will use AccumulateGrad when detecting gradients. The name difference between Windows and other platforms has already cropped up with profiler trees so it makes sense to address it at the source.

    Differential Revision: [D40347550](https://our.internmc.facebook.com/intern/diff/D40347550/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86909
    Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi

commit 5ec03fc17affd1de7eb9fb3bab567b3de0702e9b
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Oct 22 17:37:57 2022 -0700

    [Profiler][Trivial] Add Module cls and self bindings and type_caster macro (#86755)

    Just a bit of clean up. We will need `self` and `cls` for memory profiling, and the type_caster specializations were getting quite verbose.

    Differential Revision: [D39920728](https://our.internmc.facebook.com/intern/diff/D39920728/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86755
    Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi

commit b0e10292faf947fe08589392c3731bbbcf3b2a05
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Oct 22 17:37:55 2022 -0700

    [Profiler] Tensor IDs for Module and Optimizer variables (#86754)

    More sophisticated profiling will increasingly rely on python tracer to contextualize observed results. This PR adds Tensors which are observed by the python tracer to the identity assignment loop.

    Differential Revision: [D39852885](https://our.internmc.facebook.com/intern/diff/D39852885/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86754
    Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi

commit be2d647ea690cf302926a67b38d980841e403178
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sat Oct 22 17:37:54 2022 -0700

    [Profiler] Use parameter as key for optimizer state recording. (#86753)

    While optimizer can store state however it likes, in practice most optimizer state corresponds to a particular parameter. (This is the case for all `torch.optim` optimizers.) Thus, it turns out to be ergonomic to collect using that structure. Note that this doesn't lock us into anything; we can always collect state with non Tensor keys if the use case arises.

    One simplification that arises is that Module and Optimizer collection has very similar structure. So similar, in fact, that it is possible to use a common template for config. I also found that a lot of the `check_and_store` logic could be simplified and inlined by this joining of collected optimizer state.

    Differential Revision: [D40210703](https://our.internmc.facebook.com/intern/diff/D40210703/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86753
    Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi

commit fc3beef5ac11a88c7f538efcb7c60c5971393f38
Author: Horace He <chilli@fb.com>
Date:   Sun Oct 23 02:53:37 2022 +0000

    Fix stupid N^2 naming behavior in FX and removed assert that slows things a lot sometimes (#87533)

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87533
    Approved by: https://github.com/ezyang, https://github.com/voznesenskym

commit efdd43d5193435206fbe76cecc294961d10558db
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Oct 23 03:18:57 2022 +0000

    [vision hash update] update the pinned vision hash (#87528)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87528
    Approved by: https://github.com/pytorchbot

commit 9bb4926de0b76210f0d1ab90f897671fe4334d7c
Author: Ryan Spring <rdspring1@gmail.com>
Date:   Sat Oct 22 17:59:25 2022 +0000

    Add xlogy and xlog1py references (#77712)

     * Add reference implementations for `xlogy` and `xlog1py`
     * Replace `_wrap_scalar` helper function with `scalar_tensor` prim
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/77712
    Approved by: https://github.com/mruberry

commit f3f1b447787da713a00ad4219532a6e4e9e2bcf8
Author: Sherlock Huang <bahuang@fb.com>
Date:   Sat Oct 22 02:21:07 2022 +0000

    Fix meta for meta_fill_ (#87493)

    Existing meta_fill_ doesn't correctly reflect the aliasing relationship for aten.fill. A new MetaTensor should be return instead.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87493
    Approved by: https://github.com/eellison, https://github.com/bdhirsh

commit 2f9fc160a41d8da719d086e830d703e6af5efd6b
Author: Nikita Shulga <nshulga@meta.com>
Date:   Sat Oct 22 06:06:15 2022 +0000

    [CI] Run all MacOS builds on MacOS-12 (#87496)

    Not sure why we needed macos-10.15 for libtorch

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87496
    Approved by: https://github.com/atalman, https://github.com/seemethere

commit c28cdb53ea1f3e377e478fbdfa64b8cffc3828e6
Author: Nikita Shulga <nshulga@meta.com>
Date:   Sat Oct 22 06:00:59 2022 +0000

    [BE] Delete BUILD_SPLIT_CUDA option (#87502)

    As we are linking with cuDNN and cuBLAS dynamically for all configs anyway, as statically linked cuDNN is different library than dynamically linked one, increases default memory footprint, etc, and libtorch_cuda even if compiled for all GPU architectures is no longer approaching 2Gb binary size limit, so BUILD_SPLIT_CUDA can go away.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87502
    Approved by: https://github.com/atalman

commit f047dadab94c44ed348147960b9a2a24ed505b31
Author: Bin Bao <binbao@fb.com>
Date:   Fri Oct 21 23:01:17 2022 +0000

    Enable inductor CI for TIMM (#87462)

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87462
    Approved by: https://github.com/anijain2305

commit 0ef0a78196cf8726b0327c9f370615c8889ed676
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Oct 22 04:51:33 2022 +0000

    Revert "Improvements for DDP Optimizer (#87525)"

    This reverts commit cf693a02e0f6a022d10fd882af20efacfe7ecb76.

    Reverted https://github.com/pytorch/pytorch/pull/87525 on behalf of https://github.com/ZainRizvi due to The macos error messages look like they were indeed caused by this PR

commit cf693a02e0f6a022d10fd882af20efacfe7ecb76
Author: Will Constable <whc@fb.com>
Date:   Sat Oct 22 01:03:41 2022 +0000

    Improvements for DDP Optimizer (#87525)

    - adds support for 'first_bucket_cap' arg, to align bucketing more precisely
      with DDP, which may start a smaller first bucket
    - refactors the bucket splitting logic to be cleaner
    - adds pretty-print for bucket info, and a way to access bucket info
      from the DDPOptimizer class from a test case or benchmark
    - dumps debug logs to stdout

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87525
    Approved by: https://github.com/davidberard98

commit 8461460d55c2474b236a5d7198067ed299631b76
Author: Michael Lazos <mlazos@fb.com>
Date:   Sat Oct 22 03:43:08 2022 +0000

    Unified debug directory for dynamo/inductor tools (#87438)

    Fixes https://github.com/pytorch/torchdynamo/issues/1705
    Fixes https://github.com/pytorch/torchdynamo/issues/1383

    Adds a debug directory by default called `torchdynamo_debug` in the current working directory.
    In the debug directory for each run of dynamo (an enter and exit of optimize) folder run_\<timestamp\> is created which contains any minifier/inductor/torchdynamo artifacts under respective folders.

    Updated the minifier, record replay, and inductor tracing to use this directory

    cc @jansel @lezcano @fdrocha @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87438
    Approved by: https://github.com/soumith

commit b18fadae88f02232ede90c8577b4509015fedcc8
Author: Will Constable <whc@fb.com>
Date:   Fri Oct 21 23:13:39 2022 +0000

    Re-enable dynamo ddp tests (#87524)

    - Move dynamo dist tests to another shard
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87524
    Approved by: https://github.com/davidberard98

commit 707218f1253ffb3a000c9c5db4d96e0cf3bda4c7
Author: Jason Ansel <jansel@fb.com>
Date:   Fri Oct 21 15:14:15 2022 -0700

    Reland #87025 and fix periodic tests (#87084)

    - Relands #87025
    - disables failing tests related to https://github.com/pytorch/torchdynamo/issues/1697
    - Reverts https://github.com/pytorch/pytorch/commit/d01eea6027c26bf100fc99a705669f60648964ae

    cc @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87084
    Approved by: https://github.com/malfet, https://github.com/voznesenskym

commit 5c4a2e679b6318f0094e2a0c8310ac40658c0d95
Author: Catherine Lee <csl@fb.com>
Date:   Fri Oct 21 22:53:35 2022 +0000

    fix docs push (#87498)

    push docs to temp branch first then push to actual branch to satisfy CLA check in branch protections
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87498
    Approved by: https://github.com/malfet

commit 838b699e1082791d5e838ca0de0d72c4b6120e14
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Oct 21 12:57:55 2022 -0400

    as_strided_scatter storage offset defaults to None not 0 (#87481)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87481
    Approved by: https://github.com/bdhirsh

commit c55b3325176129babc7b870e6d624deac6930183
Author: Will Constable <whc@fb.com>
Date:   Fri Oct 21 16:21:43 2022 +0000

    Delete unused static runtime experiment (#87473)

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87473
    Approved by: https://github.com/anijain2305

commit dfc65f43f9f1b15b14759396547816f5605519f2
Author: Will Constable <whc@fb.com>
Date:   Fri Oct 21 16:21:43 2022 +0000

    Delete unused ts experiment (#87472)

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87472
    Approved by: https://github.com/anijain2305

commit 7baf4b1969fcd63de5d6f5d8118cc61bab6b1e97
Author: Will Constable <whc@fb.com>
Date:   Fri Oct 21 16:21:43 2022 +0000

    Delete unused ltc experiments (#87471)

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87471
    Approved by: https://github.com/anijain2305

commit 62d30f5a8ab6816874c7f1d43402bb7e1d1eb6ec
Author: Will Constable <whc@fb.com>
Date:   Fri Oct 21 16:21:43 2022 +0000

    Remove unused cold_start experiment (#87470)

    - this `--cold_start` experiment didn't end up being used
    - there is a new `--cold_start_latency` flag that is used
    - this experiment was only hooked up for nvfuser anyway

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87470
    Approved by: https://github.com/anijain2305

commit ee231671c0e50329dfd6c6cdb9d3e78848c5754c
Author: Will Constable <whc@fb.com>
Date:   Fri Oct 21 16:21:42 2022 +0000

    Make torchbench setup a function (#87469)

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87469
    Approved by: https://github.com/anijain2305

commit 169ec120efed0a5f0e050a4d9c7c762ba05fa67c
Author: samdow <samdow@fb.com>
Date:   Wed Oct 19 10:36:40 2022 -0400

    [Modes] refactor modes to only use a stack in cpp (#86458)

    Refactors the mode code to only have the C++ mode stack and not the "C++ mode" like we originally had. This also simplifies the mode logic in a number of places
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86458
    Approved by: https://github.com/zou3519

commit 13cad7e1203a5a2416240dec87ee6e374486dcdc
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Oct 21 19:14:28 2022 +0000

    [BE] Remove pip and conda installation in Linux build workflow (#87256)

    All the dependencies should come from the Docker container already. This only updates Linux build workflow, Linux test workflow comes later in a separate PR.

    The `opt-einsum` package that was installed as part of PyTorch wheel has already been installed in the Docker container [requirements-ci.txt](https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L127)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87256
    Approved by: https://github.com/malfet

commit 620dbc43d8e3c836ad8d934987ee2f87fefbad7a
Author: Alex <aslvrstn@gmail.com>
Date:   Fri Oct 21 19:03:00 2022 +0000

    Slowly introduce ops to be tested by test_numpy_ref on MPS backend (#87342)

    Enable a test that would have caught https://github.com/pytorch/pytorch/issues/86239

    Prior to the fix for that bug, this test fails with

    ```
    _____________________________ TestCommonMPS.test_numpy_ref_mps_where_mps_float32 _____________________________
    Traceback (most recent call last):
      File "/Users/alex/git/pytorch/test/test_ops.py", line 197, in test_numpy_ref_mps
        self.compare_with_reference(
      File "/Users/alex/git/pytorch/torch/testing/_internal/common_utils.py", line 2366, in compare_with_reference
        actual = torch_fn(t_inp, *t_args, **t_kwargs)
      File "/Users/alex/git/pytorch/torch/testing/_internal/opinfo/core.py", line 1068, in __call__
        return self.op(*args, **kwargs)
      File "/Users/alex/git/pytorch/torch/testing/_internal/common_methods_invocations.py", line 15167, in <lambda>
        op=lambda self, condition, other: torch.where(condition, self, other),
    RuntimeError: 0'th index 3 of x tensor does not match the other tensors
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87342
    Approved by: https://github.com/albanD

commit 7bd04fb09f3c1c310f1303272def3d59bf547964
Author: Iris Zhang <irisz@meta.com>
Date:   Fri Oct 21 18:45:38 2022 +0000

    [1/N][C10D] Add a customized ScubaLogHandler implementation for internal FB use (#86699) (#87123)

    Summary:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86699

    This diff does the following:
    1. **c10d_error_logger.py**: Add an API  to create a logger with a specific logging handler based on the destination.
    2. The API from above would get a logging handler based on the destination provided.
    -  **caffe2/torch/distributed/logging_handlers.py**: For OSS, we simply use a NullHandler() for now.
    3. Add associated test files for 1 and 2.

    Test Plan:
    ```
    buck test @//mode/dev-nosan //caffe2/test/distributed:test_c10d_error_logger -- --print-passing-details
    ```
    ```
    File changed: fbcode//caffe2/test/distributed/test_c10d_error_logger.py
    File changed: fbsource//xplat/caffe2/test/distributed/TARGETS
    9 additional file changes
    waiting for all tests to finish...
    ✓ Listing success: caffe2/test/distributed:test_c10d_error_logger (0.2s)
    Found 1 tests
    ✓ Pass: caffe2/test/distributed:test_c10d_error_logger - test_get_or_create_logger (caffe2.test.distributed.test_c10d_error_logger.C10dErrorLoggerTest) (0.2s)

    stdout:

    stderr:

    Buck UI:      https://www.internalfb.com/buck2/b975f6b0-77e9-4287-8722-f95b48036181
    Test Session: https://www.internalfb.com/intern/testinfra/testrun/1407375150206593
    RE: reSessionID-4d7ab8ca-1051-48e9-a5a8-6edbe15d1fe4  Up: 124 B  Down: 0 B
    Jobs completed: 5. Time elapsed: 3.5s.
    Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. 0 builds failed
    ```

    Differential Revision: D39920391

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87123
    Approved by: https://github.com/fduwjj, https://github.com/H-Huang

commit 100beb20999a54152557ae7875433e1558f0541a
Author: Zain Rizvi <zainr@fb.com>
Date:   Fri Oct 21 18:15:38 2022 +0000

    Only label checks against pull requests (#87488)

    When a commit is triggered via any mechanism other than a pull request, there will not be a PR to check labels for.

    The job will fail with the error:
    ```
    2022-10-21T17:50:53.2938592Z + python3 .github/scripts/check_labels.py ''
    2022-10-21T17:50:53.4758863Z usage: Check PR labels [-h] pr_num
    2022-10-21T17:50:53.4759337Z Check PR labels: error: argument pr_num: invalid int value: ''
    ```

    Instead, we should limit the workflow to only run on pull requests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87488
    Approved by: https://github.com/huydhn

commit 2a6079d58808236e52b8040e45450a6312a284a6
Author: Catherine Lee <csl@fb.com>
Date:   Fri Oct 21 18:13:56 2022 +0000

    fix for dynamo xml reporting (#87378)

    dynamo tests call a helper function in torch/_dynamo/test_case.py which then calls run_tests in common_utils.py so the test report path looked something like /opt/conda/lib/python3/10/site-packages/torch/_dynamo/test_case

    * instead of using frame, use argv[0] which should be the invoking file
    * got rid of sanitize functorch test name because theyve been moved into the test folder
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87378
    Approved by: https://github.com/huydhn

commit 6e1764d806bc45e7c15c79f7e5f1a0bafb76ec73
Author: Eli Uriegas <eliuriegas@meta.com>
Date:   Fri Oct 21 11:17:39 2022 -0400

    ci: Allow nvidia-smi to continue with non-0 exit (#87464)

    Allows nvidia-smi to return a non-0 exit status like status 14 since
    status 14 is a warning and doesn't affect actual execution

    see https://github.com/NVIDIA/gpu-operator/issues/285

    Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87464
    Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/ZainRizvi

commit 9ad1659b17ff12109b7bf4e8669d1e07ed4a84e7
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Oct 21 08:29:10 2022 -0700

    functionalization: make view_copy outputs always contiguous (#85747)

    This fixes an issue with mobile: The output of view_copy ops should always be contiguous.

    Later, we can consider adding optional arguments to the `view_copy()` functions to let you explicitly say what the contiguity of the output can be (e.g. channels_last)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85747
    Approved by: https://github.com/ezyang

commit 294bfb8e806764eaeac8f7dad10ab07ad8770110
Author: Neel Patel <neelpatel@meta.com>
Date:   Fri Oct 21 17:39:27 2022 +0000

    Create workflow to make sure PRs have valid labels (#86829)
    When a dev submits a PR against the repo, we want to validate that they applied two labels to the PR corresponding the module they edited and the kind of change they're making.
    Extended the open source workflow CI to add a validation to ensure that the PR being checked has the required labels on it.  If it doesn't, the check fails and a bot will post a message on the PR with instructions on what labels the developer needs to add (https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work).
    Every time a new version of PyTorch is released, we want to compile all the changes made to each module. However, when devs forget to tag their PR, compiling the changes to write the release notes becomes a burdensome process (only ~20% of PRs are currently labeled appropriately, which means it can take up to 40 hours to compile release notes). With this new validation, the hope is that most PRs are labeled accordingly for more timely release notes compilation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86829
    Approved by: https://github.com/ZainRizvi

commit fbcd4fe2d28d478330308bf50dfb4247371ca848
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Oct 21 17:39:01 2022 +0000

    Skip auto request review on forked PR (#87482)

    Addresses the comment in https://github.com/pytorch/pytorch/pull/87409

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87482
    Approved by: https://github.com/albanD

commit 5b7f027d911efd499674414c20fce5af5f8269d2
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Oct 20 18:06:25 2022 +0100

    Remove redundant zeroing in col2im/im2col (#87375)

    All of the kernels already either start by zeroing the output, or are
    careful in their implementation to write values to every output
    location. So, these `zero_` calls should be redundant.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87375
    Approved by: https://github.com/albanD

commit 4fc72b0f4e9ac3c260031224271fa9d71578113f
Author: chuksmbaka <mbakaforever@yahoo.com>
Date:   Fri Oct 21 17:30:18 2022 +0000

    Grammatical update of the tech docs. (#87357)

    Fixes #ISSUE_NUMBER
    A more appropriate and correct word.
    ![grammatical correction](https://user-images.githubusercontent.com/25278471/196927273-7e4c0c9b-96a6-43d1-9b10-17b40665feed.png)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87357
    Approved by: https://github.com/albanD

commit 6efdcb07884ab9ebeb5e73c1dc043dc9869b1639
Author: William Wen <williamwen@fb.com>
Date:   Fri Oct 21 17:30:14 2022 +0000

    Add dynamo smoke test (#87400)

    https://github.com/pytorch/torchdynamo/issues/1733

    Move the old smoke test over from the old dynamo repo.

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87400
    Approved by: https://github.com/msaroufim

commit db83a0578c914ddc4f229657ba9d9bbe879f92e5
Author: Zachary DeVito <zdevito@gmail.com>
Date:   Fri Oct 21 03:51:25 2022 +0000

    [inductor] force 'fork' method for processes, cleanup (#87411)

    To cooperate with other multithreading methods, this
    forces the process pool to use 'fork' even if others have set it
    diferently. We require fork because otherwise `if __name__ == __main__`
    needs to be set which we do not control as a library.

    Furthermore this adds code to cleanup worker processes if
    the parent exits abnormally (e.g. segfault). Previously we would leave
    live but inactive workers around.

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87411
    Approved by: https://github.com/soumith, https://github.com/anijain2305

commit 96691865b9696e6218e36cc5a5cf794334859275
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Oct 21 07:29:38 2022 -0700

    [dynamo] Unify raise_on_* config to suppress_errors and raise by default (#87440)

    I noticed that a lot of bugs are being suppressed by torchdynamo's default
    error suppression, and worse yet, there's no way to unsuppress them.  After
    discussion with voz and soumith, we decided that we will unify error suppression
    into a single option (suppress_errors) and default suppression to False.

    If your model used to work and no longer works, try TORCHDYNAMO_SUPPRESS_ERRORS=1
    to bring back the old suppression behavior.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87440
    Approved by: https://github.com/voznesenskym, https://github.com/albanD

commit 1133682c46c7a7d7ea804519125555129f3f0498
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 21 11:35:30 2022 +0000

    [FSDP][2/N] Fix grad zero vs. `None` edge case (#87308)

    Some original parameters corresponding to one `FlatParameter` may have `None` gradient while others do not. In that case, the `flat_param.grad` must be non-`None`. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a `_is_grad_none` mask over the parameters' gradients.
    - `_is_grad_none` is initialized to `False` for all.
    - `_is_grad_none[i]` is set to `True` when writing zeros in place of `None` when writing back the `i`th gradient.
    - `_is_grad_none[i]` is set to `False` via `_reset_is_grad_none()`, which should be called in the post-backward. See the docstring for details.
    - `_is_grad_none[i]` must be `False` in order to set `param.grad` to be a view into `flat_param.grad`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87308
    Approved by: https://github.com/zhaojuanmao

commit 4ee13a5925b13c81525d331a842acc263d295b8e
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 21 11:35:30 2022 +0000

    [FSDP][1/N] Update `summon_full_params(with_grads)` `None` gradient (#87314)

    This PR changes `summon_full_params(with_grads=True)`'s behavior to be such that if all ranks have `flat_param.grad = None`, then the original parameters will correctly have `orig_param.grad = None`. This is achieved with a preliminary all-reduce. Note that if a particular original parameter's gradient is `None` on all of the containing ranks, but not all ranks' `flat_param.grad = None`, then that particular gradient is still going to be set to zeros. This can be handled if desired in follow-up work.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87314
    Approved by: https://github.com/zhaojuanmao

commit 4caddac534cd58fdd19eff922212ec7884e85ebc
Author: Jerry Zhang <jerryzh@meta.com>
Date:   Fri Oct 21 16:57:33 2022 +0000

    [quant][api] Add assert for backend in get_default_qconfig related apis (#86259) (#87331)

    Summary:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86259

    Add assertion to make sure backend is one of "fbgemm", "x86", "qnnpack" and "onednn"
    for get_default_qconfig, get_default_qat_qconfig, get_default_qconfig_mapping and get_default_qat_qconfig_mapping

    Test Plan:
    python test/test_quantization.py -k test_get_default_qconfig_mapping

    Imported from OSS

    Reviewed By: jcaip

    Differential Revision: D40236474

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87331
    Approved by: https://github.com/andrewor14

commit 4cc5d6644fd647f14bafae7cb4a4348dd4327c72
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 21 11:30:58 2022 +0000

    [FSDP][6/N] Remove FPW! (#87114)

    This PR simply deletes `flatten_params_wrapper.py`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87114
    Approved by: https://github.com/zhaojuanmao

commit f8dd27420ba9945589a4e1dea4f657d3ee68c46f
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 21 11:30:58 2022 +0000

    [FSDP][5/N] Update `FlatParamHandle` after FPW deprecation (#87113)

    This PR resolves a TODO left in `FlatParamHandle` that was conditional on deprecating `FlattenParamsWrapper`. We simply pass in the process group into the `FlatParamHandle` constructor instead of later in `shard()`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87113
    Approved by: https://github.com/zhaojuanmao

commit 214d51756ab8fad49639be9c20120e6e4384778b
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 21 11:30:57 2022 +0000

    [FSDP][4/N] Rework FPW test to not use FPW (#87112)

    Testing coverage is pretty much preserved except that we do not test on CPU, which is not a tangible loss for FSDP anyway.

    I renamed a few tests slightly, and I moved some helpers to be immediately below the corresponding test method. This makes it a bit easier to read.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87112
    Approved by: https://github.com/zhaojuanmao

commit 277e37f945da4d3e119223cd0ca8593101f593d5
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 21 11:30:57 2022 +0000

    [FSDP][3/N] Register `flat_param` to wrapped module (#87086)

    This PR registers each `FlatParameter` to the wrapped module, eliminating `FlattenParamsWrapper` usage completely from FSDP.

    Registering each `FlatParameter` to the wrapped module is preferred over registering to the `FullyShardedDataParallel` instance for both functional-like and non-recursive wrapping. It simplifies the `FlatParameter` naming to be a function of the number of `FlatParameter`s per wrapped module instead of the number of `FlatParameter`s per FSDP instance. For now, we assume 1 `FlatParameter` per wrapped module, so we can simply use a single name `FLAT_PARAM = _flat_param`.

    From an implementation perspective, we raise some methods from `FlattenParamsWrapper` directly up to `FullyShardedDataParallel`. There will need to be further refactoring for functional-like and non-recursive wrapping. For example, the property `self._has_params -> bool` may need to change to a method `self._has_params(wrapped_module) -> bool`. Such changes are out of scope for this PR and will be done in follow-ups.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87086
    Approved by: https://github.com/zhaojuanmao

commit 9f8ef8eaff1970ccb87ba7a0c25787588c9c39ad
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 21 11:30:56 2022 +0000

    [FSDP][2/N] Remove `_fsdp_wrapped_module.flat_param` (#86122)

    This removes **direct** usages of `_fsdp_wrapped_module.flat_param` with `_handles[0].flat_param`. The preferred way to access the `flat_param` will be through the handle. We may converge to only storing `self._handles` and no longer `self.params` in the future. Right now, `self.params` is always exactly `[handle.flat_param for handle in self._handles]`.

    cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86122
    Approved by: https://github.com/zhaojuanmao

commit ce0c6e828ed2338df75017fa434fcb2744502024
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Oct 21 06:21:41 2022 -0700

    Reland "add an API for external backends to register custom device names (#86992)" (#87453)

    Re-land of https://github.com/pytorch/pytorch/pull/86992

    This reverts commit a895af92506f206889610251624590798d0deabd.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87453
    Approved by: https://github.com/ezyang, https://github.com/albanD

commit 70c46d32e25b7e8b5c0e457d78292c8eb9634d5a
Author: jyx-su <108294040+jyx-su@users.noreply.github.com>
Date:   Fri Oct 21 16:28:29 2022 +0000

    Fix input dimension issue in RNN, LSTM, GRU error message (#87442)

    Fixes #86576

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87442
    Approved by: https://github.com/albanD

commit 0c1dec375fce6fb5f75f72ea88391eeada118805
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 21 16:03:00 2022 +0000

    Revert "Back out "Revert D40198461: [pytorch][PR] Backport currently dont work with some models if:" (#87124)"

    This reverts commit a42fbfa0cb467b582799a5132561c82a3d33b1b7.

    Reverted https://github.com/pytorch/pytorch/pull/87124 on behalf of https://github.com/ZainRizvi due to This is causing periodic jobs to fail

commit d73d4aa7de953a3794593ac9e6d6b3a1ce514c3c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Oct 21 05:54:15 2022 -0700

    Audit for error prone isinstance int/float and add lint (#87345)

    We recently fixed a bug on symbolic-shapes branch where
    an isinstance(x, int) test failed when passed a SymIntNode.
    To prevent this, I've added a lint for all the codepaths
    where we may pass SymInt/SymFloat directly to reject
    direct isinstance int/float tests, and instead use one of
    the aliases.  The lint rule explains the options.  I then
    go and fix all of them.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87345
    Approved by: https://github.com/bdhirsh, https://github.com/albanD

commit 1285542f9b54972089655f91146e277c004762a2
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Oct 21 13:29:31 2022 +0100

    OpInfo: Add test that sample_inputs_func returns a generator (#84567)

    This also includes a small list exception for single element lists since none of the memory usage or performance implications of lists apply there.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84567
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit aa8248cc9a80fc7fc2e5981b8238271d9642eb40
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Fri Oct 21 15:05:36 2022 +0000

    Reenable `isinstance` with `torch.distributed.ReduceOp` (#87303)

    tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

    Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
    Rel: https://github.com/pytorch/pytorch/issues/87191

    cc @kwen2501
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303
    Approved by: https://github.com/wanchaol

commit d37dc6f69874ffac21390f5e78bf79c43631eb92
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Fri Oct 21 14:28:14 2022 +0000

    Make LazyGraphExecutor extensible (#87218)

    Add `LazyGraphExecutor` to backend interface so that its is extensible by a vendor backend.

    I've made some preliminary methods virtual. Not sure if we want to make all methods in `LazyGraphExecutor` virtual.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87218
    Approved by: https://github.com/wconstab, https://github.com/alanwaketan

commit d80a5f9a963fdfb583ca21a1dc70c1355983da39
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Fri Oct 21 14:22:20 2022 +0000

    Fix typo under torch directory (#87274)

    This PR fixes typo in .md files under torch directory

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87274
    Approved by: https://github.com/albanD

commit ae62cf7c02c009ab1123a3bfa08ca9a5e4255e4a
Author: Nikita Shulga <nshulga@meta.com>
Date:   Fri Oct 21 14:10:05 2022 +0000

    [MPS] Revamp copy_to_mps_ implementation (#86956)

    Tensor's view in linear storage is represented by the following parameters: `.shape`, `.stride()` and `.storage_offset()`.

    Only tensors that are representable as 1d-views can be copied from host to device (and vice versa) using single  [`copy(from:sourceOffset:to:destinationOffset:size:)`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/1400767-copyfrombuffer?language=objc) call.

    Modify `copy_to_mps_` function to do the following steps:
    - Cast `src` tensor to dst data type if needed
    - Expand `src` tensor to `dst` tensor shape
    - Clone `src` tensor if it is not stride contiguous (i.e. can not be represented by `src.view(src.numel())`)
    - Create an empty tensor if `dst` is not stride-contiguous or if its strides are different then potentially cloned `src` strides
    - Do 1d copy for `src` to (potentiall temp) `dst`
    - Finally do re-striding/copy on MPS if needed

    Add test to cover cases where stide-contiguous permuted tensor is copied to MPS, non-stride-contiguous tensor is copied to MPS and if permuted CPU tensor is copied to differently permuted MPS tensor

    Fixes https://github.com/pytorch/pytorch/issues/86954

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86956
    Approved by: https://github.com/kulinseth

commit 435e78e5237d9fb3e433fff6ce028569db937264
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Fri Oct 21 07:55:23 2022 +0000

    [dynamo] [easy] RM spurious `)` (#87439)

    Fixes #ISSUE_NUMBER

    cc @jansel @lezcano @fdrocha @mlazos @soumith @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87439
    Approved by: https://github.com/msaroufim, https://github.com/soumith

commit ab901b48178d6f927f90009d71d7784a5d5627f2
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Oct 21 00:46:34 2022 +0000

    Python binding for dispatcher getAllOpNames (#87422)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87422
    Approved by: https://github.com/bdhirsh

commit 7caeac17183a9aee0ccce4a3470925c6fe7e5007
Author: Soumith Chintala <soumith@gmail.com>
Date:   Fri Oct 21 06:36:13 2022 +0000

    [inductor] Fix channels_last conv2d propagation when CuDNN is not found (#87266)

    Fixes https://github.com/pytorch/torchdynamo/issues/1701

    cc @jansel @lezcano @fdrocha @mlazos @voznesenskym @yanboliang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87266
    Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/voznesenskym

commit 6b59d9b566001cd7036ac06497372eae6238cdd4
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Fri Oct 21 05:12:23 2022 +0000

    Fix registration hooks (#87369)

    There is a bug in the implementation of the registration hooks introduced in https://github.com/pytorch/pytorch/pull/86148 whereby if the hook returns a tensor, then the short circuiting logic:
    ```
    value = hook(self, name, value) or value
    ```
    Raises an exception
    ```
    RuntimeError: Boolean value of Tensor with more than one value is ambiguous
    ```
    Fixing the logic so that it only checks to see if the value is `None` before overriding

    Fixes #85837

    CC: @albanD @jbschlosser
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87369
    Approved by: https://github.com/albanD

commit 8269bd8fb656e43250719091ac302b0eee289f22
Merge: 4bc2a0dcda c79309051a
Author: mingfeima <mingfei.ma@intel.com>
Date:   Fri Oct 21 11:38:00 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit c79309051a85a75e087421bd087194a94a43acc6
Merge: 5df5c3e33e 6faa6c68e8
Author: mingfeima <mingfei.ma@intel.com>
Date:   Fri Oct 21 11:38:00 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit ff43288d31ea7f3de69f4907e2a36455c742d9c9
Author: Soumith Chintala <soumith@gmail.com>
Date:   Fri Oct 21 03:14:28 2022 +0000

    [AOT][CUDAGraphs] torchdynamo -> torch._dynamo (#87243)

    Fixes lingering issues from the torchdynamo -> torch._dynamo migration
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87243
    Approved by: https://github.com/suo, https://github.com/voznesenskym, https://github.com/jansel

commit 13ab819356e5a7b7deab1c486fdf36ba0906ebda
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Oct 20 15:40:03 2022 -0700

    [functorch] fix AOTAutograd tutorial (#87415)

    It was raising asserts previously
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87415
    Approved by: https://github.com/Chillee

commit b1cf377cceb44cb8f567d8ccd59b1d085b13ac50
Author: Bin Bao <binbao@fb.com>
Date:   Thu Oct 20 22:37:07 2022 +0000

    Enable inductor CI for huggingface (#86792)

    Summary: Unit tests will be enabled after fixed in trunck. TorchBench and TIMM need
    more setup and are coming later.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86792
    Approved by: https://github.com/jansel, https://github.com/huydhn

commit 9ba632253a4e40749aa0589618c19dac1d0b7839
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Fri Oct 21 01:24:00 2022 +0000

    [Inductor] Convert 0d CPU tensor to scalar during triton codegen (#87329)

    This is a follow up to address [this](https://github.com/pytorch/torchdynamo/pull/1284#pullrequestreview-1130319129). We revised to use the codegen approach to handle 0d CPU tensor, which will not support cudagraph any more.

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87329
    Approved by: https://github.com/ngimel

commit 961ebca2255f902477e9ea7060b8f28781e3c0cd
Author: Nikita Shulga <nshulga@meta.com>
Date:   Fri Oct 21 01:09:50 2022 +0000

    Add `weights_only` option to `torch.load` (#86812)

    This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling.
    Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as  `torch.Tensor` and `torch.Storage` variants.

    Defaults `weights_only` is set to `False`,  but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable.

    To some extent, addresses https://github.com/pytorch/pytorch/issues/52596
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86812
    Approved by: https://github.com/ezyang

commit e3d73bbb07c1dd992a8a209b399c733b64bb8de8
Author: Jason Ansel <jansel@fb.com>
Date:   Thu Oct 20 17:35:49 2022 -0700

    Remove jansel/voz from dynamo CODEOWNERS (#87430)

    Now that CC bot is working on PRs this is no longer needed.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87430
    Approved by: https://github.com/voznesenskym

commit bd1e95ce306956a748915a217c3ae9012469b0fa
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Oct 20 12:30:09 2022 -0700

    Improve the performance of validate_non_overlapping_shards_metadata (#85639)

    `validate_non_overlapping_shards_metadata()` uses a quadratic algorithm to verify the overlapping. However, in some cases (only one dimension is sharded), we a O(nlogn) algorithm can easily be implemented. This PR changes the implementation of `validate_non_overlapping_shards_metadata()`.

    Differential Revision: [D39681725](https://our.internmc.facebook.com/intern/diff/D39681725/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85639
    Approved by: https://github.com/wanchaol

commit a42fbfa0cb467b582799a5132561c82a3d33b1b7
Author: Han Qi (qihqi) <qihan@meta.com>
Date:   Thu Oct 20 23:02:10 2022 +0000

    Back out "Revert D40198461: [pytorch][PR] Backport currently dont work with some models if:" (#87124)

    Summary:
    reland after fixing windows build failure for OVR.

    Notable change:
    ```
    ```
    changed to
    ```#if defined(FBCODE_CAFFE2) || defined(FB_XPLAT_BUILD)
    ```
    Appearently `-DFB_XPLAT_BUILD` wasn't getting picked up in windows if using `or `to connect

    Original commit changeset: 7a31fc4b455f

    Original Phabricator Diff: D40198461

    Test Plan: waitforsandcastle

    Reviewed By: davidberard98, cccclai

    Differential Revision: D40290932

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87124
    Approved by: https://github.com/gmagogsfm

commit f38a88c4dd8ce006b9934d0d2f121fb93564479b
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 20 22:01:51 2022 +0000

    Revert "[dynamo] use optimizers correctly in benchmarking (#87311)"

    This reverts commit 703c19008df4700b6a522b0ae5c4b6d5ffc0906f.

    Reverted https://github.com/pytorch/pytorch/pull/87311 on behalf of https://github.com/anijain2305 due to Bin (desertfire) is trying to get torchbench models in CI, and this PR prevents that. I will bring this back after models are in CI.

commit a91abedf0d78c2582987a5a46472e84cb105d196
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Thu Oct 20 21:59:12 2022 +0000

    [Inductor] TorchInductor tracing fx_graph.py should import overrides (#87271)

    Running the generated script would be failed if there are ops like ```philox_rand_like``` and ```philox_rand_like```.

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87271
    Approved by: https://github.com/jansel

commit 1801b57cf6aeed7e4a859227ba2a080a16611fae
Author: Catherine Lee <csl@fb.com>
Date:   Thu Oct 20 21:50:20 2022 +0000

    set ci in mps (#87325)

    dunno if installing xml runner like this is a good idea
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87325
    Approved by: https://github.com/huydhn, https://github.com/malfet

commit f7da9db9c174917f8f77b43c92f879cb7c29484d
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Oct 19 20:13:16 2022 +0000

    Unify decomp registries into global_decomposition_table (#86857)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86857
    Approved by: https://github.com/ezyang

commit 7e83f65ad502992a8d75c91eea2cf3de69bb0b7a
Author: Svetlana Karslioglu <svekars@fb.com>
Date:   Thu Oct 20 21:02:09 2022 +0000

    Add General Project Policies (#87385)

    Add General Project Policies to the Governance page

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87385
    Approved by: https://github.com/orionr

commit 17202b363780a06ae07e5cecceffaae6418ad6f8
Author: George Qi <georgeqi94@gmail.com>
Date:   Thu Oct 20 20:20:12 2022 +0000

    [maskedtensor] fix docs formatting (#87387)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87387
    Approved by: https://github.com/cpuhrsch

commit bc8cf332447793bbcac7d0493e1b98acabfdb748
Author: samdow <samdow@fb.com>
Date:   Thu Oct 20 13:45:20 2022 -0400

    add deprecation warning to nn stateless functional_call (#87367)

    Same as the release version but just for master

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87367
    Approved by: https://github.com/albanD, https://github.com/atalman

commit 9b88dcf248e717ca6c3f8c5e11f600825547a561
Author: Catherine Lee <csl@fb.com>
Date:   Thu Oct 20 19:40:59 2022 +0000

    [ci] handle libomp upgrade on github (#87382)

    like #86979, idk if this is a good idea but it seems to fix the problem
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87382
    Approved by: https://github.com/seemethere

commit 0826863962ef58c3b26c15c6745ba3049a05df06
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Oct 20 11:11:18 2022 -0700

    [functorch][docs] Downgrade the warning about forward-mode AD coverage (#87383)

    Previously we claimed that "forward-mode AD coverage is not that good".
    We've since improved it so I clarified the statement in our docs and
    downgraded the warning to a note.

    Test Plan:
    - view docs
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87383
    Approved by: https://github.com/samdow

commit 2fd008ed43c53a75d9a8d857546416ba2c45645d
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Thu Oct 20 18:14:40 2022 +0000

    [dynamo] Add support for invoking nn sequential (#87156)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87156
    Approved by: https://github.com/jansel

commit 68e946b0c37fc97e1de7320af4202464bd1880c9
Author: Horace He <chilli@fb.com>
Date:   Thu Oct 20 00:48:08 2022 +0000

    Fixed tune_layout to not do anything for non-2d convolutions (#87328)

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87328
    Approved by: https://github.com/ngimel

commit b805e1abefd10efabff019e9bb5e3d7d8ba85660
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Oct 13 12:44:46 2022 -0700

    [functorch] Fix torch.cat batching rule (#86932)

    The bug was discovered in https://github.com/pytorch/pytorch/pull/86842.

    torch.cat has an edge case where it ignores all tensors of shape [0]. So
    if any of the BatchedTensors have logical shape [0] but physical shape
    [B, 0], then we coerce them to shape [0] by slicing them.

    Why don't we just ignore those Tensors? We need to propagate
    requires_grad-ness somehow (e.g. if the BatchedTensor wraps a Tensor of
    shape [B, 0] that requires grad, then the output must require grad).

    Test Plan:
    - new tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86932
    Approved by: https://github.com/Chillee

commit c16b7b41f76233ba930ce7dce6d31f1d362f7e86
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Oct 19 20:53:38 2022 -0700

    [Profiler][Trivial] Small style and safety fixes (#86752)

    I noticed a couple abbreviations in the new optimizer capture code that are worth expanding. I also made the RawTensorMetadata a bit safer.

    Differential Revision: [D40210702](https://our.internmc.facebook.com/intern/diff/D40210702/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86752
    Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi

commit 1e4a274248959a08bd60529d313355dc837e36fe
Author: Zachary DeVito <zdevito@gmail.com>
Date:   Thu Oct 20 00:03:00 2022 +0000

    [dynamo] avoid popen.communicate() (#87335)

    It seems like when popen.communicate() is used it waits for all the
    desendents of popen to close the stdin/stderr. However, if we have
    have worker processes running in the child, and the child segfaults,
    those processes will stay alive until someone waitpid's the child.
    Since those children have open handles to the stdin/stderr pipe,
    communicate never returns.

    This change just writes the output to temp files and directly calls
    wait() on the child, which returns as soon as it dies.

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87335
    Approved by: https://github.com/anijain2305, https://github.com/voznesenskym

commit 75a5a46aa005e1c56f5a7935003cb480f33f9257
Author: Zain Rizvi <zainr@fb.com>
Date:   Thu Oct 20 17:16:45 2022 +0000

    Retry sccache downloads (#87306)

    This is meant to mitigate network flakiness like the one seen on [this build](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872) which results in s3 refusing a connection and sccache failing to download

    Adding the retry at the workflow level instead of the curl level since as per the job it doesn't seem like the curl command was retried at all.   It's possible that the specific html code returned during "Connection refused" isn't one of the ones the gets retried, or the retries don't show on the console and it needed a longer period of time between retries or that.

    Using the job level retry with a generous retry delay solves for both possibilities.

    Sample error log:
    ```
    Run sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
      sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v[2](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:2).15 --output /usr/local/bin/sccache
      sudo chmod +x /usr/local/bin/sccache
      echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
      echo "SCCACHE_S[3](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:3)_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${GITHUB_ENV}"
      shell: /bin/bash -e {0}
      env:
        AWS_ACCESS_KEY_ID: ***
        AWS_SECRET_ACCESS_KEY: ***
        BUILD_ENVIRONMENT: macos-12-py3-x86-6[4](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:4)
        DEVELOPER_DIR: /Applications/Xcode_13.3.1.app/Contents/Developer
        CONDA_ENV: /Users/runner/work/_temp/conda_environment_3283124[6](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:6)93
        CONDA_RUN: conda run -p /Users/runner/work/_temp/conda_environment_3283124693 --no-capture-output
        CONDA_BUILD: conda run -p /Users/runner/work/_temp/conda_environment_3283124693 conda-build
        CONDA_INSTALL: conda install -p /Users/runner/work/_temp/conda_environment_3283124693
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed

      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    curl: ([7](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:7)) Failed to connect to s3.amazonaws.com port 443 after [8](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:8)6 ms: Connection refused
    Error: Process completed with exit code 7.
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87306
    Approved by: https://github.com/seemethere

commit 4b757f4633494d7bbc55973f36f14aeca96387fa
Author: Rui Zhu <zrphercule@meta.com>
Date:   Thu Oct 20 16:01:54 2022 +0000

    Assert if padding mask type is unexpected (#86353) (#87106)

    Summary:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86353

    Fix the issue described in
    https://github.com/pytorch/pytorch/issues/86120

    Test Plan: buck test mode/opt caffe2/test:test_transformers -- test_train_with_long_type_pad

    Differential Revision: D40129968

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87106
    Approved by: https://github.com/malfet

commit 38543d8da0ddce0734ce1ecebb7013382508e142
Author: efiks <5167930+efiks@users.noreply.github.com>
Date:   Thu Oct 20 15:10:44 2022 +0000

    [torch] Add fmsub to vectrozation primitives (#86568)

    Summary: Add fmsub  which is similar to fmadd

    Test Plan: CI

    Differential Revision: D40215267

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86568
    Approved by: https://github.com/ajtulloch, https://github.com/malfet

commit a895af92506f206889610251624590798d0deabd
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 20 14:51:08 2022 +0000

    Revert "add an API for external backends to register custom device names (#86992)"

    This reverts commit fb6826bfd82660aa905459f894c81d97d143dd2c.

    Reverted https://github.com/pytorch/pytorch/pull/86992 on behalf of https://github.com/jeanschmidt due to breaking internal builds - D40534212 - arstudio-windows-tests-landcastle-0

commit 9199f9188c6150bebd73968b1539fdd1a12d1c98
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 19 18:33:17 2022 -0400

    Add inplace function testing to test_proxy_tensor (#87324)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87324
    Approved by: https://github.com/ezyang

commit 254b681dc69c1d6e36864684e40ce850cb364b64
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 19 18:33:17 2022 -0400

    Convert torch.Size() argument to sym size in test_proxy_tensor (#87304)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87304
    Approved by: https://github.com/ezyang

commit 9bd6ea5d76dfb20c90eeb6ee9328ba6b66014645
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 19 18:01:24 2022 -0400

    Add meta inplace testing (#87291)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87291
    Approved by: https://github.com/ezyang

commit 2e08ac8696fee6e8e8ce876934b95dda1f491357
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 19 18:01:24 2022 -0400

    Add randint OpInfo (#87231)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87231
    Approved by: https://github.com/ezyang

commit 8b704eddcd4cd646e7d084869e6bf20d5a7ebf40
Author: Bert Maher <bertrand@fb.com>
Date:   Thu Oct 20 14:15:47 2022 +0000

    Update the pinned triton hash (#87300)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87300
    Approved by: https://github.com/jansel

commit c4cf701889864dceba779569ae642cf95932538a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 20 13:44:14 2022 +0000

    Revert "[complex] conv_transpose2d (#81805)"

    This reverts commit 528dd05108cdac6726748c34e385b5c3136256df.

    Reverted https://github.com/pytorch/pytorch/pull/81805 on behalf of https://github.com/jeanschmidt due to Breaking internal builds - D40534110 - android-java-tests-0

commit 05ad7bd7433cb65d92802cb5c64fcab2c278f073
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 20 13:17:11 2022 +0000

    Revert "Advance nightly docker to 11.6 (#86941)"

    This reverts commit c5de535bc0b785abbacfebddf660af4cd3b2a6a1.

    Reverted https://github.com/pytorch/pytorch/pull/86941 on behalf of https://github.com/atalman due to Workflow is passing but installs CUDA 11.3 PyTorch rather then 11.6

commit 1b8af28fe883a58dcb1ae048ab60ad17162dcdb8
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Thu Oct 20 11:02:06 2022 +0200

    [primTorch] Add refs for `softmax`, `softmin`, `log_softmax` (#84956)

    cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84956
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 703c19008df4700b6a522b0ae5c4b6d5ffc0906f
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Oct 20 05:46:25 2022 +0000

    [dynamo] use optimizers correctly in benchmarking (#87311)

    We were not setting optimizers correctly

    * This hid the issue that we see here - https://github.com/pytorch/torchdynamo/issues/1687
    * This has also revealed that we are activating profilers for every dynamo optimized model call. This could affect speedup

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87311
    Approved by: https://github.com/mlazos, https://github.com/yanboliang

commit 8349bf1cd1d5df7be73b194940bcf96209159f40
Author: Horace He <chilli@fb.com>
Date:   Wed Oct 19 21:55:58 2022 +0000

    Added special printing to FloorDiv so it's printed out with // insead of as a name (#87263)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87263
    Approved by: https://github.com/ezyang

commit b90db4a78f8d760377a81a5a64d03ab4b67599de
Author: erjia <erjia@fb.com>
Date:   Thu Oct 20 05:05:53 2022 +0000

    [DataPipe] Fix type checking to accept both Iter and Map DataPipe (#87285)

    Fixes https://github.com/pytorch/data/issues/841

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87285
    Approved by: https://github.com/NivekT

commit d94e33f041f37fafe333e491d1c07c8c285a2f58
Author: Antoni Viros i Martin <aviros@meta.com>
Date:   Thu Oct 20 03:46:48 2022 +0000

    Add support for .to() for NestedTensor backends (#87146)

    Summary: This commit adds support for moving NestedTensors from CPU to GPU and back. The implementation includes requires implementing empty_like(), which is based on PR#83140.

    Test Plan: Added a new unit test based on the unit test for the main .to() implementation. All unit tests must pass, as well as every sandcastle job.

    Differential Revision: D40437585

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87146
    Approved by: https://github.com/drisspg

commit 472bdb3aa84678b2faa4afe1cb5757f55e14ed9a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 20 03:45:16 2022 +0000

    [vision hash update] update the pinned vision hash (#87339)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87339
    Approved by: https://github.com/pytorchbot

commit c18eead2df44346df989088b18fe4e4a57c2d64e
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Oct 19 18:07:29 2022 -0400

    Update saved variable hooks to no longer trigger on wrapped numbers (#87316)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87316
    Approved by: https://github.com/ezyang, https://github.com/albanD

commit 0cae309069858c1b4e92c1b5c345e28245eef1d3
Author: andrewor14 <andrewor14@gmail.com>
Date:   Wed Oct 19 15:16:13 2022 -0700

    [Quant] Add get_symmetric_qnnpack_qconfig_mapping (#87002)

    Summary: Today, in order to get XNNPACK quantized ops to work,
    the user must write some code that refers to private data
    structures (`_FIXED_QPARAMS_OP_TO_OBSERVER`) to create a
    QConfigMapping that is compatible with the symmetric constraints
    in the QNNPACK BackendConfig. This is because
    `get_default_qconfig("qnnpack")` produces a QConfig that does
    not satisfy these constraints, and the default QConfigMapping
    for QNNPACK uses this Qconfig.

    Instead, we simply put this code into a helper function to make
    it easier for the user to run XNNPACK quantized ops. In the
    future, once there is feature parity between the set of ops
    supported by QNNPACK and XNNPACK, we should revisit whether
    to simply change `get_default_qconfig("qnnpack")` to return
    an XNNPACK-compatible QConfig.

    Test Plan:

    python test/test_quantization.py
    TestQuantizeFx.test_symmetric_qnnpack_qconfig_mapping

    Reviewers: jerryzh168, vkuzo

    Subscribers: jerryzh168, vkuzo
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87002
    Approved by: https://github.com/vkuzo

commit e6bc8f415b5bd5b576123ef004021130751b3894
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Oct 20 02:13:11 2022 +0000

    [BE] Move conda cmake installation to Docker (#87309)

    This is parts of the effort to consolidate pip and conda installation in the CI to improve our CI reliability.  This moves conda cmake installation to Docker in those use cases that require it:

    * Ubuntu bionic and focal

    On the other hand:
    * XLA doesn't seem to need conda cmake anymore (Build and test successfully)
    * Centos is not in used anywhere in the CI
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87309
    Approved by: https://github.com/ZainRizvi, https://github.com/malfet

commit 0d2c2110f178da19aaf89259a2034c9c0653fcee
Author: Zachary DeVito <zdevito@meta.com>
Date:   Wed Oct 19 14:16:54 2022 -0700

    [allocator] Introduce the abstract class CUDACachingAllocator (#87251)

    This replaces the manual function pointers, making it easier to write
    new drop-in allocators.

    Note that most allocation goes through the Allocator interface, which
    CUDAAllocator inherits from, and this arrangement avoids adding and
    additional layer of dispatch along this pathway compared to what existed before.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251
    Approved by: https://github.com/wconstab

commit 888e15408e861fcaa6b2bfaa2130cb96e90ffa24
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Oct 20 01:04:42 2022 +0000

    Fix wrong lintrunner version (#87295)

    The syntax is invalid for pip.  I missed this a while back:

    ```
    Run pip install -r .github/requirements-gha-cache.txt
    ERROR: Invalid requirement: 'lintrunner=0.9.2' (from line 11 of .github/requirements-gha-cache.txt)
    Hint: = is not a valid operator. Did you mean == ?
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87295
    Approved by: https://github.com/ZainRizvi

commit bd757b364c92b778533dde51a723f5b6278517e0
Author: Horace He <chilli@fb.com>
Date:   Wed Oct 19 03:19:22 2022 +0000

    Ensure that symbolic variables incorporate fresh constraints before they're used (#87254)

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87254
    Approved by: https://github.com/jansel

commit bcde75427e89df07a5744e64ca9271d1c53e8a7e
Author: Sahan Paliskara <sahancpal@gmail.com>
Date:   Wed Oct 19 13:36:52 2022 -0700

    run torch::deploy test using pip install (#86507)

    This PR runs the unit tests for [multipy](https://github.com/pytorch/multipy) in pytorch core such that we are able to make sure changes in core do not break multipy as adding `_prims` did.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86507
    Approved by: https://github.com/anirbanr-fb-r2p, https://github.com/d4l3k

commit 07bd053a7ef92263db8d612f4fc7c28e06ade45c
Author: Rohan Varma <rvarm1@fb.com>
Date:   Tue Oct 18 10:56:04 2022 -0700

    [rpc] Wrap exception creation with try/catch (#87224)

    Sometimes, we cannot recreate the exception with only string (for example if it is a custom exception type). Ideal situation would be to carry over all details on how to recreate the remote end's exception and throw that on client, but for now, we raise a RuntimeError with the original error msg when we cannot reconstruct.

    Created from CodeHub with https://fburl.com/edit-in-codehub

    Differential Revision: [D40353274](https://our.internmc.facebook.com/intern/diff/D40353274/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87224
    Approved by: https://github.com/fduwjj

commit c97ffcff464fad4aa12a86b99a75d491071cd575
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Oct 19 15:39:12 2022 -0400

    [discussion] fix for aot autograd outputs that dont require grad (#86838)

    Fixes https://github.com/pytorch/functorch/issues/1052

    I got here after some discussion with Alban. Today, if you aot_function() trace a program where some of its inputs have `requires_grad=True`, but some outputs are expected to have `requires_grad=False`, we will incorrectly set all outputs to have `requires_grad=True`.

    A simple solution is to use autograd.function's API for marking outputs as non-differentiable, based on what we witnessed when we traced the forward.

    This will make the `autograd.Function` that we return **wrong**, if you created it using inputs that required grad, and tried to re-use it with inputs that have different `requires_grad` field. But as long as we're hiding behind dynamo, which should guard on requires_grad, then we'll re-run `aot_function()` and get out a new compiled function that does the right thing.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86838
    Approved by: https://github.com/ezyang

commit c9b618447d7c948003f26c3b49c28cdc193bd3f0
Author: Michael Lazos <mlazos@fb.com>
Date:   Wed Oct 19 22:44:01 2022 +0000

    Fix line numbers bug (#87247)

    Fixes https://github.com/pytorch/torchdynamo/issues/1462

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87247
    Approved by: https://github.com/anijain2305, https://github.com/jansel

commit c8889f4e109866610bd1981f03deee8f102b5ce6
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 19 22:15:28 2022 +0000

    `cuda._is_in_bad_fork`->`_C._cuda_isInBadFork` (#87317)

    Former is always available, while later is only available if PyTorch compiled with CUDA And if it does, then
    ```
    $ python -c "import torch;print(torch._C._cuda_isInBadFork == torch.cuda._is_in_bad_fork)"
    True
    ```

    Fixes https://github.com/pytorch/torchdynamo/issues/1709 ( at least the symptom)

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87317
    Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/soumith, https://github.com/jansel

commit 56b150ac63653f982c2b4aaa61336e5f6ecd1e4c
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Wed Oct 19 22:13:07 2022 +0000

    [Dynamo] Support optimizing over any Tensor with requires_grad = True (#87141)

    Fixes https://github.com/pytorch/torchdynamo/issues/1604

    Re-submit for https://github.com/pytorch/torchdynamo/pull/1646
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87141
    Approved by: https://github.com/jansel

commit 12b2f70a89494b2ad374aaa16b8fdbf16da66e57
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 19 11:27:42 2022 -0400

    Symintify pad ops (#87046)

    Following comments below, we need to add support for `std::negate`/`std::min`/`std::max`/`operator-` for SymInt
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87046
    Approved by: https://github.com/ezyang

commit c5de535bc0b785abbacfebddf660af4cd3b2a6a1
Author: atalman <atalman@fb.com>
Date:   Wed Oct 19 21:26:53 2022 +0000

    Advance nightly docker to 11.6 (#86941)

    Fixes following:
    https://github.com/pytorch/pytorch/actions/runs/3242695506/jobs/5316334351
    crash in Docker builds introduced by: #82682

    The PR seems to introduce some changes not compatible with cuda 11.3 which is used by our Docker builds

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86941
    Approved by: https://github.com/malfet

commit 6eeeb8817229e7df054db38337cd944b6e2daaad
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Oct 19 17:00:52 2022 +0100

    OpInfo: Sample input cleanup (4/n) (#86324)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86324
    Approved by: https://github.com/mruberry

commit c141f28b648ee3c6cb0a7286f0aa100297417e74
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 19 20:56:37 2022 +0000

    Fix compilation warning and spurious print (#87297)

    Fixes compilation warning, make this warning an error and remove a random print.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87297
    Approved by: https://github.com/malfet

commit 4a533f12157ffb5c05c142490e4ceaa311981b38
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 19 20:51:32 2022 +0000

    Tweak several test serialization to store models state_dict (#87143)

    Namely, change:
    - `test_meta_serialization`
    - `test_serialization_2gb_file`
    - `test_pathlike_serialization`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87143
    Approved by: https://github.com/ezyang

commit cf2be34ff5d854a5afcdc4e88aa468aaeb5d47db
Author: George Qi <georgeqi94@gmail.com>
Date:   Wed Oct 19 18:27:21 2022 +0000

    [maskedtensor] add docs (#84887)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84887
    Approved by: https://github.com/cpuhrsch

commit cd2161352675c2a02d7c651374511fcef2ef83e7
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 19 20:36:55 2022 +0000

    Revert "[primTorch] Add refs for `softmax`, `softmin`, `log_softmax` (#84956)"

    This reverts commit c09ca93e4733fdf0183433114dda2fc30a846700.

    Reverted https://github.com/pytorch/pytorch/pull/84956 on behalf of https://github.com/ZainRizvi due to This is causing the MPS test test_output_match_log_softmax_with_dtype_cpu_float32 (__main__.TestConsistencyCPU) to fail

commit c08c7997503fbe8472a957f712322d5fb5fa11bf
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Wed Oct 19 09:05:48 2022 -0700

    [FSDP] Add set_state_dict_type API to setup state_dict_type without using context manager (#86243)

    FSDP.state_dict_type is a context manager. However, users may want to decide what state_dict is going to used during initialization. `set_state_dict_type` allows users to do so.

    Differential Revision: [D40083670](https://our.internmc.facebook.com/intern/diff/D40083670/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86243
    Approved by: https://github.com/rohan-varma

commit f3cc588d09f62471a46d555368f1627932e1812f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 19 18:57:24 2022 +0000

    Revert "Dynamo FX graph stack traceback fix (#87136)"

    This reverts commit 89e6078bc3d83b61e03511304ec42743b84df42e.

    Reverted https://github.com/pytorch/pytorch/pull/87136 on behalf of https://github.com/clee2000 due to causing a lot of tests to fail on master even though pr is green

commit c09ca93e4733fdf0183433114dda2fc30a846700
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Wed Oct 19 05:08:27 2022 +0200

    [primTorch] Add refs for `softmax`, `softmin`, `log_softmax` (#84956)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84956
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 00c91f4446d91b04f2313632a4d45addcb9e6950
Author: Zachary DeVito <zdevito@meta.com>
Date:   Tue Oct 18 17:27:21 2022 -0700

    [allocator] disable tests that don't work for cudaMallocAsyncAllocator (#87250)

    Two tests were failing locally for me and don't appear to be run in our CI.
    Disabling them so we can otherwise refactor the allocators.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87250
    Approved by: https://github.com/wconstab

commit 15ca68526cf012dc10a1c450b30dba23643588d3
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Oct 18 12:53:23 2022 -0700

    [functorch] Get rid of defunct functorch/setup.py (#87235)

    We initially left it there for BC concerns.
    - It has been more than a month since then,
    - I have migrated folks who used the previous install command (pip
    install ...pytorch.git@subdir=functorch) off of it

    so it's time to get rid of it

    Test Plan:
    - code reading
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87235
    Approved by: https://github.com/Chillee

commit ac80da2293179ac69dc346b6d15d9f7f7ba154f7
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Oct 18 12:51:18 2022 -0700

    [functorch] add test for torch.manual_seed inside grad transform (#87233)

    I can see this behavior regressing really easily, so adding a test for
    it.

    Test Plan:
    - run test
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87233
    Approved by: https://github.com/Chillee

commit f56ce8dbad728ca59a29b7dd089f5a705a40f70d
Author: Zachary DeVito <zdevito@meta.com>
Date:   Tue Oct 18 13:24:52 2022 -0700

    [allocator] Move getFreeMutex (#87237)

    It isn't used at all the allocators and this change makes that more clear.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237
    Approved by: https://github.com/wconstab

commit 89e6078bc3d83b61e03511304ec42743b84df42e
Author: William Wen <williamwen@fb.com>
Date:   Wed Oct 19 17:15:43 2022 +0000

    Dynamo FX graph stack traceback fix (#87136)

    Migration from https://github.com/pytorch/torchdynamo/pull/1655.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87136
    Approved by: https://github.com/voznesenskym

commit 40d0fa53149c6d50a199585f7498db9ec93f98ef
Author: atalman <atalman@fb.com>
Date:   Wed Oct 19 17:09:37 2022 +0000

    Reenable aot tests on windows for cuda 11.7 and up (#87193)

    Reenable aot tests on windows for cuda 11.7 and up

    Issue: https://github.com/pytorch/pytorch/issues/69460 seems to be mitigated in CUDA 11.7 hence re-enable this test

    cc @peterjc123 @mszhanyi @skyline75489 @nbcsm
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87193
    Approved by: https://github.com/malfet

commit 86a581928a4f5065a79771a7a2d87c6999c452e9
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Oct 19 17:01:09 2022 +0000

    Pin ios conda dependencies (#87229)

    I also pin blas to 1.0 instead of the newer 2.116 available elsewhere (https://anaconda.org/conda-forge/blas)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87229
    Approved by: https://github.com/izaitsevfb, https://github.com/ZainRizvi, https://github.com/malfet

commit a79e034d89d3d112fcb8d16f7a6862934a44955d
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 19 17:00:10 2022 +0000

    [MPS] Do not dispatch empty job in  `bitwise_not` (#87286)

    Follows the pattern from https://github.com/pytorch/pytorch/pull/85285 and returns before computing dispatching an empty metal kernel for bitwise not operation.

    Fixes crash when invoked with empty MPS tensor on AMD GPU

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87286
    Approved by: https://github.com/kulinseth

commit 6775c3e19d74f01841584dbb1d71fc84fe991455
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Wed Oct 19 16:55:27 2022 +0000

    fix 0d cpu tensor handling when it's the first arg (#87273)

    Fixes https://github.com/pytorch/torchdynamo/issues/1681
    When at least one of the pw args is on cuda, set device to cuda. We assume that cases of true device mismatch have been already weeded out during tracing, and what we have is 0d cpu tensor + cuda tensor interop.
    Also fix 0d tensor test that previously wasn't compiling with dynamo.

    cc @jansel @lezcano @fdrocha
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87273
    Approved by: https://github.com/soumith, https://github.com/voznesenskym

commit fb6826bfd82660aa905459f894c81d97d143dd2c
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Tue Oct 18 16:13:27 2022 -0700

    add an API for external backends to register custom device names (#86992)

    This API adds some improvements to external backends who are building C++ backends out of tree using the `PrivateUse1` dispatch key.

    The docs and linked examples go over the API in more detail, but you should be able to use it like:
    ```
    > torch.register_privateuse1_backend("foo")`
    > a = torch.ones(2, device="foo")
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86992
    Approved by: https://github.com/albanD

commit cc64863d71dd31e74c42b190a6a2dfd5de0305e6
Author: William Wen <williamwen@fb.com>
Date:   Wed Oct 19 16:39:12 2022 +0000

    Clean Inductor complication cache during dynamo dashboard run (#87246)

    Implement improvement from https://github.com/pytorch/torchdynamo/issues/1644.

    Tested by running `python benchmarks/dynamo/runner.py --print_run_commands --training` and inspecting the generated `run.sh` file for the `--cold_start_latency` flag, e.g.
    ```
    python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=benchmark_logs/inductor_torchbench_float32_training_cuda_performance.csv --training --inductor   --no-skip --dashboard -x fambench_xlmr -x detectron2_fasterrcnn_r_50_c4 -x detectron2_fasterrcnn_r_50_dc5 -x detectron2_maskrcnn_r_101_fpn -x detectron2_maskrcnn_r_50_fpn -x detectron2_fasterrcnn_r_50_fpn -x detectron2_maskrcnn -x detectron2_fasterrcnn_r_101_dc5 -x opacus_cifar10 -x detectron2_maskrcnn_r_101_c4 -x pyhpc_turbulent_kinetic_energy -x maml -x detectron2_fasterrcnn_r_101_fpn -x pyhpc_equation_of_state -x detectron2_fasterrcnn_r_101_c4 -x pyhpc_isoneutral_mixing --cold_start_latency
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87246
    Approved by: https://github.com/anijain2305, https://github.com/jansel

commit b3071e2eb61fbbad9b36d7022855111efc6c37f4
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Tue Oct 18 18:29:15 2022 -0700

    functionalization: skip meta reference compute for aot autograd (#87108)

    The context is that historically, XLA/LTC tensors haven't had accurate stride information, and functionalization would run "reference" meta kernels for view ops on the side to properly compute strides.

    This is more complicated in symint tracing world - we have a `FunctionalTensorWrapper()` that wraps the underlying tensor and has its own set of sizes/strides metadata, but we never create proxy objects for the sizes/strides of the wrapper.

    In symint tracing world with aot autograd, we're guaranteed that our underlying strides are accurate anyway, since aot autograd uses fake tensors to perform tracing. We encountered a few bugs with symint's from the `FunctionalTensorWrapper` making their way into `__torch_dispatch__`. To side-step that area of bugs completely (and marginally improve perf), this PR disables the meta tensor tracing for non XLA/LTC use cases.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87108
    Approved by: https://github.com/ezyang, https://github.com/wconstab

commit 4801397b6ee2a82098b059b40294039d9d350eaa
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Tue Oct 18 18:29:15 2022 -0700

    ban .sizes() and .strides() calls in derivatives.yaml (#86611)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86611
    Approved by: https://github.com/wconstab, https://github.com/albanD

commit 182ee8799675dee562e8589f4420184c267a8db0
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Wed Oct 19 12:28:02 2022 +0000

    symintify nll loss fns (#86915) (#87095)

    This reverts commit bbd7b38d5580c44ffb4404d431e07bc2316e59d5.

    Reland https://github.com/pytorch/pytorch/pull/86915 with a fix for python arg parser handing for SymInt and SymIntList.
    This was uncovered because we are calling directly into python bindings code through test_autocast.py (`torch._C._nn.nll_loss`)  without providing a value for the optional symint arg (`ignore_index`). The arg parser constructs the  SymInt and SymIntList using the recorded "default_int" or "default_int_list" (schema string parsing) in case a value is not received for an optional argument. Since we weren't handling the symint case properly, the default_int just had a garbage value which was later being used to construct SymInt.

    Follow up issue for other unhandled parameter types: https://github.com/pytorch/pytorch/issues/87283

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87095
    Approved by: https://github.com/ezyang, https://github.com/albanD

commit c6187ea326e6bfb2054e271c8fed23f14ab53615
Author: leizhenyuan <zhenyuan.lei@intel.com>
Date:   Wed Oct 19 13:24:48 2022 +0000

    add support for pin memory on xpu device (#86545)

    add support for pin memory on xpu device

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86545
    Approved by: https://github.com/ezyang

commit 528dd05108cdac6726748c34e385b5c3136256df
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Wed Oct 19 09:12:27 2022 +0000

    [complex] conv_transpose2d (#81805)

    Reference: https://github.com/pytorch/pytorch/issues/71108

    Fixes : #86414
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81805
    Approved by: https://github.com/anjali411

commit 232fbd90ff6d93362120d955befeeb297179ddad
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Sun Oct 16 22:54:57 2022 -0400

    [TorchDynamo]: fused bias for cpu convolution path (#87050)

    For aten.convolution CPU path, the bias always can be fused, so this PR adds a device check: if inputs' device is CPU, we will fuse it for a good performance.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87050
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 5e23074f0d8538ba00645f08a48cc12bf5ae3a8e
Author: Horace He <chilli@fb.com>
Date:   Wed Oct 19 02:07:13 2022 +0000

    Fixed FakeTensor not calling CompositeImplicitAutograd decomps sometimes (#87252)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87252
    Approved by: https://github.com/ezyang, https://github.com/bdhirsh

commit b5bdc34541a407390b7f9bd3dcc97b1d7b982c7f
Author: Jason Ansel <jansel@meta.com>
Date:   Wed Oct 19 06:32:42 2022 +0000

    [inductor] Sympy compability fix (#87249)

    Test Plan: github tests

    Reviewed By: yf225, voznesenskym

    Differential Revision: D40495411

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87249
    Approved by: https://github.com/ngimel, https://github.com/voznesenskym

commit 6faa6c68e8b76fb68f3a2b2783685102d0e87c00
Author: Chiao <chiaolun@gmail.com>
Date:   Wed Oct 19 05:11:29 2022 +0000

    fsdp lazy_init typo (#87184)

    Minor typo, changed with -> without
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87184
    Approved by: https://github.com/awgu

commit 2418ddb1ecf609b6e302257bfc10c62db1dc147e
Author: Horace He <chilli@fb.com>
Date:   Wed Oct 19 01:24:38 2022 +0000

    Unified symbolic shape variables between Inductor and AOTDispatcher (#87161)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161
    Approved by: https://github.com/jansel

commit 48df4b7a1ddcb0a60d97e24a22cf3b3e6ad9d378
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 19 04:12:52 2022 +0000

    [vision hash update] update the pinned vision hash (#87100)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87100
    Approved by: https://github.com/pytorchbot

commit dfe3fc028c7c0e9a40701dcd5d6c72c20e35b690
Author: Nikita Shulga <nshulga@meta.com>
Date:   Wed Oct 19 03:35:16 2022 +0000

    [CI] Add triton wheels build workflow (#87234)

    Also, add `torchtriton` and `jinja2` as extra `dynamo` dependency to PyTorch wheels,

    Version packages as first 10 characters of pinned repo hash and make `torch[dynamo]` wheel depend on the exact version it was build against.

    TODO: Automate uploading to nightly wheels storage
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87234
    Approved by: https://github.com/msaroufim

commit c413a32135b745d29e555069d7cd8f6e6527b59f
Author: David Berard <dberard@fb.com>
Date:   Mon Oct 17 08:40:21 2022 -0700

    Release note script: match topics with spaces or underscores (#87011)

    e.g. match "new features" in the category as "new_features"
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87011
    Approved by: https://github.com/albanD, https://github.com/soulitzer

commit c471c29fdccc3fe48a78083c638a4a88559488b4
Author: Driss Guessous <drisspg@fb.com>
Date:   Wed Oct 19 02:16:29 2022 +0000

    Update sdp guards for performance (#87241)

    Makes the contiguous check for the nt input more strict/correct as well as makes some performance improvements to the checks

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87241
    Approved by: https://github.com/cpuhrsch

commit 6d0d7afe8d5ed7a701d634729dc7be9d0ef4a4b2
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 19 02:11:54 2022 +0000

    [GHA][BE] Delete unused macros from `common.yml.j2` (#87253)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87253
    Approved by: https://github.com/huydhn

commit 31e731e5aeffcdf22b4a20f7b9f716694151fe0a
Author: Michael Suo <suo@fb.com>
Date:   Tue Oct 18 14:58:23 2022 -0700

    [dynamo] fix logging (#87239)

    Currently, setting `torch._dynamo.config.log_level` doesn't do anything,
    as the module name has changed during the move.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87239
    Approved by: https://github.com/jansel, https://github.com/soumith, https://github.com/mlazos

commit 7ff1ca4e33df951653c116621bbade88941cb2bd
Author: Tongzhou Wang <SsnL@users.noreply.github.com>
Date:   Wed Oct 19 00:25:02 2022 +0000

    Add type annotation to get_worker_info (#87017)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87017
    Approved by: https://github.com/ejguan, https://github.com/NivekT

commit 4dc579838be94c343cc8542c7a80b9a9a8c15b51
Author: Yidi Wu <yidi@meta.com>
Date:   Wed Oct 19 00:12:59 2022 +0000

    Allow fx.Graph.owning_module to be used as attribute. (#86822)

    Summary:
    The current behavior of owning_module setter is difficult to understand: it changes the owning_module to None if owners is not 0 but increments the owners count. If the owning_module is None, the owners count should be 0 as none of them is accessible. On the other hand, if the owners count increases, the owning_module should be a collection (e.g. a list).

    This diff changes owning_module to be a normal attribute. The semantic is that graph can have **at most one** owning module and can be assigned to new module.

    The alternative is to use a list to represent the owning_modules of a graph but it breaks backward compatibility and the exact use cases of having multiple owning_modules are not clear.

    Test Plan: Test with CI.

    Differential Revision: D40200624

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86822
    Approved by: https://github.com/tugsbayasgalan

commit 3eb742938578b752ca03d0f9962158dcb0edd343
Author: Seonglyong Gong <slgong@meta.com>
Date:   Wed Oct 19 00:00:10 2022 +0000

    [Profiler][trivial] Add profiler options to trace metadata (#87102)

    Summary: Add profiler options (`profile_memory`, `record_shapes`, `with_stack`, `with_modules`, and `with_flops`) to trace metadata

    Test Plan: CI tests

    Differential Revision: D40373514

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87102
    Approved by: https://github.com/aaronenyeshi

commit f6c6048b1086f291ac9934ee1927270eba5a6519
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Tue Oct 18 23:11:47 2022 +0000

    Use CUTLASS GEMM for NT bmm (#85894)

    Copy of https://github.com/pytorch/pytorch/pull/85710
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894
    Approved by: https://github.com/drisspg

commit 80790ecee4f04de4bf1675fec8a2593d7a2b32c0
Author: Jane Xu <janeyx@meta.com>
Date:   Tue Oct 18 23:01:28 2022 +0000

    [einsum] Call view instead of sum to remediate MPS regression (#87135)

    Fixes #87010.

    It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible.

    Benchmarking results show that, on MPS, we would be going from the following code taking **29.89ms instead of the current 1466ms, almost a 50x speedup**.
    ```
    q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
    k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
    torch.einsum('b i d, b j d -> b i j', q, k).max().item()
    ```
    And a regular einsum will now take **.506ms instead of 2.76ms.**
    ```
    q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
    k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
    torch.einsum('b i d, b j d -> b i j', q, k)
    ```

    Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87135
    Approved by: https://github.com/soulitzer, https://github.com/malfet, https://github.com/albanD

commit c4a03e4da19c643b8321a4a0ba0863259498ca7b
Author: Jane Xu <janeyx@meta.com>
Date:   Tue Oct 18 22:58:44 2022 +0000

    [einsum] keep the promise that we contract left to right (#87199)

    We promise that if path is not defined, we would go left to right. The previous code did not keep that promise as we push'd combined ops to the back of the list. For most use cases this is fine (einsum with 3 or fewer inputs), but we should do what we say.

    Test plan:
    Added a print statement to print the sizes of ops we're contracting to see if the order is fixed. Code run:
    ```
    import torch
    a = torch.rand(1)
    b = torch.rand(2)
    c = torch.rand(3)
    d = torch.rand(4)
    torch.einsum('a,b,c,d->abcd', a,b,c,d)
    ```

    BEFORE--it does a+b, then c+d, then a+b+c+d, which...is right, but it's not the order specified by the user.
    ```
    /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
      return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
      return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
      return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    ```

    WITH THIS CHANGE--it actually goes left to right: a+b, a+b+c, a+b+c+d
    ```
    /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
      return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
      return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
      return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87199
    Approved by: https://github.com/soulitzer

commit d06d569e90f3ca3e721b679be285385e5bd3eea9
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Oct 18 21:38:43 2022 +0000

    Update the sdp benchmark to work with nested tensors (#87215)
    Update the sdp benchmark to work with nested tensors
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87215
    Approved by: https://github.com/cpuhrsch

commit e8c4adf3c3b8e479d240c3160d85fde68808e92c
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Tue Oct 18 21:07:57 2022 +0000

    Add torch.sparse overview section (#85265)

    The goal of this section is to provide a general overview of how PyTorch handles sparsity for readers who are already familiar with sparse matrices and their operators.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85265
    Approved by: https://github.com/jisaacso

commit 31edccf6c7080ccfbdce93613ba2deadaaf3b0b0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 18 21:03:23 2022 +0000

    Revert "Temporarily disable ios jobs (#87186)"

    This reverts commit d29dc2b72a6cb5fb24ff3eacd816e08bd16298dc.

    Reverted https://github.com/pytorch/pytorch/pull/87186 on behalf of https://github.com/huydhn due to Official conda channel is back and conda-forge has been reverted

commit 223ad9bc9e7a0af5bf37587933f81da43cf84868
Author: Catherine Lee <csl@fb.com>
Date:   Tue Oct 18 20:57:55 2022 +0000

    [ci] remove circleci mac jobs (#87225)

    mac jobs are run on every pr after approval, so these are redundant
    ios jobs can stay until the end of the year because they are on periodic and not run on every pr
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87225
    Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/janeyx99

commit 9a786202b704b9488fdc9e5163ff6af88510d56f
Author: Catherine Lee <csl@fb.com>
Date:   Tue Oct 18 20:57:27 2022 +0000

    [ci] fix log printing (#87223)

    idk how i missed this

    example https://github.com/pytorch/pytorch/actions/runs/3275717751/jobs/5391093040
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87223
    Approved by: https://github.com/malfet, https://github.com/kit1980, https://github.com/janeyx99

commit afa508607827aa3397165f4d6cde0180369cc3ba
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 18 20:54:06 2022 +0000

    Revert "Install blas from conda-forge (#87150)"

    This reverts commit f02f0e3ad1565e3da1e78efaa994e80c7577fd0c.

    Reverted https://github.com/pytorch/pytorch/pull/87150 on behalf of https://github.com/huydhn due to Conda issue has been resolved upstream https://github.com/pytorch/pytorch/issues/87148

commit e7cefff05830fa1209daec4bc004e0ba1c1277b2
Author: Aaron Enye Shi <enye.shi@gmail.com>
Date:   Tue Oct 18 20:47:09 2022 +0000

    [Kineto][Profiler] Guard event metadata python thread via verbose flag (#87096)

    Summary: For Python Tracing enabled trace files, this field "python thread": 0 is repeated for every python_function event. This bloats the trace json size for large number of events or deep call stacks. Instead make this metadata guarded by the verbose flag.

    Test Plan: CI

    Reviewed By: robieta, slgong-fb

    Differential Revision: D40325815

    Pulled By: aaronenyeshi

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87096
    Approved by: https://github.com/slgong-fb, https://github.com/robieta

commit c54bcea7934f59896ee8973ca814b8ea8597989e
Author: Will Feng (DPER) <willfeng@fb.com>
Date:   Tue Oct 18 20:26:30 2022 +0000

    Improve complex_memory_overlap check for Inductor CUDA graph (#87177)

    Point fix for https://github.com/pytorch/torchdynamo/issues/1620 to unblock internal models. Supersedes https://github.com/pytorch/pytorch/pull/87058.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87177
    Approved by: https://github.com/ezyang

commit ef1844a151218046a7f7266e0015264f2b0bc7b4
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Oct 18 20:05:45 2022 +0000

    [CI] Move sm86 tests from periodic to trunk (#87228)

    This adds Ampere GPU testing to trunk CI

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87228
    Approved by: https://github.com/jansel, https://github.com/huydhn

commit 1dbc8ad3b74f774d8571eed95559714260f0b6de
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Tue Oct 18 20:02:42 2022 +0000

    Add `Warning` class and refactor C++ warnings to use it (#84101)

    Also adds `TORCH_WARN_WITH` and `TORCH_WARN_DEPRECATION` macros

    Part of #72948

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84101
    Approved by: https://github.com/albanD

commit db6590925593e7af9b373680d6e6e76d1b7a359c
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Tue Oct 18 19:55:18 2022 +0000

    [Docs] Update mm family ops and F.linear to note limited sparse support.  (#86220)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86220
    Approved by: https://github.com/cpuhrsch

commit a73ca6f58c1487abab013805922c47437e50eecf
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 18 19:34:02 2022 +0000

    Revert "Improve readability of the extra message errors in assertEqual (#87202)"

    This reverts commit 56c28ee32a78eb6f32a533d8fd64278cb9063016.

    Reverted https://github.com/pytorch/pytorch/pull/87202 on behalf of https://github.com/malfet due to broke test_testing, see https://hud.pytorch.org/pytorch/pytorch/commit/56c28ee32a78eb6f32a533d8fd64278cb9063016

commit e4285f09b9993d4a17b755c74b68bed69f7473d0
Author: Fabio Rocha <frocha@quansight.com>
Date:   Tue Oct 18 09:43:59 2022 +0000

    [inductor] new way to compile f64 libdevice calls (#87189)

    Porting over [torchdynamo/#1633](https://github.com/pytorch/torchdynamo/pull/1633)

    `torch/_inductor/codegen/triton.py` now defines `libdevice_<function>` variants
    of some functions. You can request dispatch to those for
    float64 dtypes when using `register_pointwise` by setting
    `use_libdevice_for_f64=True`.

    Other minor changes:
        - In triton, sigmoid now codegens tl.sigmoid
        - silu now comes from decomp, not lowering
        - Some test skips no longer necessary, removed or made xfails

    Switching to `tl.sigmoid` has exactly same performance.
    Moving `silu` to decomp does not change anything, same triton code is generated.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87189
    Approved by: https://github.com/ngimel

commit c56be31d2ec838f29c46d8b585b31b5e47f478e8
Author: Jiang, Yanbing <yanbing.jiang@intel.com>
Date:   Tue Oct 18 19:07:58 2022 +0000

    Upgrade oneDNN to v2.7 (#87061)

    This PR is to upgrade oneDNN to v2.7.

    **Performance Optimizations**
    - Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids).
    - Introduced performance optimizations for [bf16 floating point math mode](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html) on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data.

    Please go to https://github.com/oneapi-src/oneDNN/releases/tag/v2.7 for more detailed changes.

    **Functionality**

    - Updated ITT API to 3.22.5
    - Fixed correctness issue in fp32 convolution implementation for cases with large spatial size (https://github.com/pytorch/pytorch/issues/84488)
    Use TorchBench test in ICX with 40 cores
    Intel OpenMP & tcmalloc were preloaded
    ![image](https://user-images.githubusercontent.com/61222868/196121957-656faebc-9f4a-49f0-9ef0-0784416c3a47.png)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87061
    Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/weiwangmeta

commit 2485498294c213daa6092cf384a85ac0890d7fa7
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Oct 18 15:37:01 2022 +0000

    [FSDP] Use `all_gather_into_tensor()` (#87077)

    Let us silence some warnings 👍🏼
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87077
    Approved by: https://github.com/rohan-varma

commit 56c28ee32a78eb6f32a533d8fd64278cb9063016
Author: lezcano <lezcano-93@hotmail.com>
Date:   Tue Oct 18 15:05:33 2022 +0000

    Improve readability of the extra message errors in assertEqual (#87202)

    Goes from (note the `linspace.default` is very difficult to find)
    ```
    Mismatched elements: 15 / 50 (30.0%)
    Greatest absolute difference: 1 at index (17,)
    Greatest relative difference: 1.0 at index (17,) : linspace.default
    args = (0, -3, 50)
    kwargs = {'dtype': torch.int16, 'device': device(type='cpu'),
    'pin_memory': False}
    ```
    to
    ```
    Mismatched elements: 15 / 50 (30.0%)
    Greatest absolute difference: 1 at index (17,)
    Greatest relative difference: 1.0 at index (17,)
    linspace.default
    args = (0, -3, 50)
    kwargs = {'dtype': torch.int16, 'device': device(type='cpu'),
    'pin_memory': False}
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87202
    Approved by: https://github.com/ezyang

commit 48f02312232d71f8e5cabfcc85b70f8330953057
Author: lezcano <lezcano-93@hotmail.com>
Date:   Tue Oct 18 09:36:29 2022 +0000

    Fix Scalar(bool) handling in toIValue (#87179)

    At the moment, they were casted to `int64`, which breaks quite a few
    casting rules for example in `ops.aten`.

    Quite a vintage bug, circa 2020.

    With this fix, the following code prints `torch.bool`, rather than `torch.int64`.
    ```python
    import torch
    msk = torch.tensor([False])
    b = torch.tensor([False])
    print(torch.ops.aten.where.ScalarSelf(msk, True, b).dtype)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87179
    Approved by: https://github.com/albanD

commit 4540330f97313096793f0bd7115ac84adb616a4c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 18 18:29:15 2022 +0000

    Revert "Use conda-forge in mac mps test (#87155)"

    This reverts commit 74138a8daa93ec4cb08e4dd31c2773ec0c751d94.

    Reverted https://github.com/pytorch/pytorch/pull/87155 on behalf of https://github.com/huydhn due to Conda issue has been resolved upstream https://github.com/pytorch/pytorch/issues/87148

commit adc7ee09dce01e3e49985e76f07055af98262d03
Author: Horace He <chilli@fb.com>
Date:   Tue Oct 18 02:43:48 2022 +0000

    Added upsample_nearest3d/1d lowering to inductor (#87158)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87158
    Approved by: https://github.com/ngimel

commit d7801a60424d1fa2823af1b19a21ad61070f5ff0
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Tue Oct 18 18:24:13 2022 +0000

    Add voznesenskym to CODEOWNERS (#87227)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87227
    Approved by: https://github.com/jansel

commit 88b76ae9ea89dda5847133f7414073f44bba4535
Author: Sherlock Huang <bahuang@fb.com>
Date:   Mon Oct 17 22:53:50 2022 +0000

    Store type(module) in the module stack (#87149)

    - As requested by quantization team, it prefer storing type(module) in the module stack.
    - Consequently, as module stack gets verbose, we skip printing module stack in the gm.print_readable()

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87149
    Approved by: https://github.com/jerryzh168, https://github.com/jansel

commit d01eea6027c26bf100fc99a705669f60648964ae
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Oct 18 17:19:52 2022 +0000

    Do not run triton tests on sm86 (#87198)

    As its broken right now and nobody care to fix it, see this test run for example: https://hud.pytorch.org/pytorch/pytorch/commit/d36c284d1446cb250178f8e89fff9b342ee1a5a9

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87198
    Approved by: https://github.com/soumith, https://github.com/albanD

commit 2b03a941f7a3b2539731ef26ce2462b883e296e9
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Tue Oct 18 16:54:40 2022 +0000

    [dynamo] graph capture for calls to arbitrary self. methods on nn module (#87040)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87040
    Approved by: https://github.com/jansel

commit 09a967d6c9e464a49909df7ff1459e00ab8aac09
Author: hxu296 <hxu296@wisc.edu>
Date:   Tue Oct 18 16:50:39 2022 +0000

    Make nested TreeSpec printing nicer (#46538) (#86546)

    1. Made TreeSpec into a dataclass.
    2. In `__repr__`, recursively transformed TreeSpec into dictionaries and then pretty-printed it.

    Fixes #46538. Hi, @ezyang. this PR is for the TreeSpec `__repr__` refactor we discussed.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86546
    Approved by: https://github.com/ezyang

commit 440f734169c1337dc84323adb1e88e11d7a72059
Author: Animesh Jain <anijain@umich.edu>
Date:   Tue Oct 18 15:53:53 2022 +0000

    [inductor] Minifier fixes (#87062)

    Fixes https://github.com/pytorch/torchdynamo/issues/1690

    This fixes the error seen in the minifiers. But does not repro the original issue that prompted the above issue.

    Fx minifiers work at the level of Fx-graphs, and the original issue lies outside of the Fx graph and is only visible on the second iteration. Therefore, the original issue escapes the abstraction of our existing Fx-based minifiers.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87062
    Approved by: https://github.com/eellison

commit c30cfb07abb930ae2227692a20dbb5e4b9632db7
Author: Animesh Jain <anijain@umich.edu>
Date:   Tue Oct 18 15:53:40 2022 +0000

    [dynamo][dashboard] Run 2 iterations for the correctness runs (#87104)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87104
    Approved by: https://github.com/soumith

commit d29dc2b72a6cb5fb24ff3eacd816e08bd16298dc
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Oct 18 15:27:27 2022 +0000

    Temporarily disable ios jobs (#87186)

    While investigating segfault issue:

    * https://app.circleci.com/pipelines/github/pytorch/pytorch/584349/workflows/6c68b0ce-023e-4f62-83bf-e77962daf8ad/jobs/17180595
    * https://github.com/pytorch/pytorch/actions/runs/3269860268/jobs/5377851127

    This might be related to the use of conda-forge in https://github.com/pytorch/pytorch/issues/87148, i.e. conda-forge pulls in different version of some dependencies and breaks thing.  If that's the case, we could not revert conda-forge change yet because the checksum issue hasn't been fixed upstream yet (Test PR https://github.com/pytorch/pytorch/pull/87185)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87186
    Approved by: https://github.com/ZainRizvi, https://github.com/malfet

commit ecd25df3131bea694e9b34fe4a76f8ca411a8f05
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Tue Oct 18 15:24:18 2022 +0000

    Add prototype warning to MaskedTensor constructor (#87107)

    When a user constructs a MaskedTensor we should signal its development status to set expecations.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87107
    Approved by: https://github.com/bhosmer

commit 240bba7ac85b6163c7c75a168019cd0b6d1c6aa0
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Tue Oct 18 12:16:05 2022 +0000

    add sym_int (#86916)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86916
    Approved by: https://github.com/ezyang

commit 157310c85ddcf1047377adecd1b905994436d613
Author: Soumith Chintala <soumith@gmail.com>
Date:   Tue Oct 18 14:08:01 2022 +0000

    [inductor][triton] if device is a torch.device, then make cuda_properties index it correctly (#87174)

    Without this, I was running into obvious `KeyError`s that were assuming that the device was an integer when running `examples/imagenet`.

    ```python
    (pytorch) soumith@bluebox:~/code/examples/imagenet$ python main.py --gpu 0 /home/soumith/dataset/imagenet
    /home/soumith/code/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
      warn(f"Failed to load image Python extension: {e}")
    /home/soumith/code/examples/imagenet/main.py:100: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
      warnings.warn('You have chosen a specific GPU. This will completely '
    Use GPU: 0 for training
    => creating model 'resnet18'
    make_fallback(aten.unfold): a decomposition exists, we should switch to it
    make_fallback(aten.unfold_backward): a decomposition exists, we should switch to it
    Traceback (most recent call last):
      File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 254, in call_function
        return lowerings[target](*args, **kwargs)
      File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped
        return decomp_fn(*args, **kwargs)
      File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2994, in var_
        diffs = square(sub(x, mean(x, axis, keepdim=True)))
      File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped
        return decomp_fn(*args, **kwargs)
      File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2983, in mean
        sum_result = sum_(x, axis, keepdim)
      File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped
        return decomp_fn(*args, **kwargs)
      File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 3211, in sum_
        return fn(x, axis, keepdims, dtype=dtype)
      File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2953, in inner
        result = Reduction.create(
      File "/home/soumith/code/pytorch/torch/_inductor/ir.py", line 714, in create
        hint, split = cls.num_splits(
      File "/home/soumith/code/pytorch/torch/_inductor/ir.py", line 454, in num_splits
        num_sm = get_device_properties(device).multi_processor_count
      File "/home/soumith/code/pytorch/torch/_inductor/cuda_properties.py", line 43, in get_device_properties
        return _properties()[_device(device)]
    KeyError: device(type='cuda', index=0)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87174
    Approved by: https://github.com/yf225

commit dbccccb7a2f724fc57e42bd1f347212f12984a67
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Oct 18 13:53:30 2022 +0000

    [BE] Get rid of deprecation warnings in workflows (take 3) (#87152)

    - Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job)
    - Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning
    - Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well)
    - Update `retry` action to https://github.com/nick-fields/retry/commit/3e91a01664abd3c5cd539100d10d33b9c5b68482
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152
    Approved by: https://github.com/kit1980, https://github.com/izaitsevfb

commit 9ac2a06acf75538a35751f785d5f509d6127d6cd
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 17 20:59:19 2022 +0100

    istft: require complex input (#86628)

    Real dtype input to `torch.istft` has been deprecated since PyTorch
    1.8, so it is more than passed its due date to be removed.

    BC-breaking message:

    `torch.istft` no longer supports input in the form of real tensors
    with shape `(..., 2)` to mimic complex tensors. Instead, convert
    inputs to a complex tensor first before calling `torch.istft`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86628
    Approved by: https://github.com/mruberry

commit b886cd15f5d2979e50790aa7420b6bd94fd7b89d
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Mon Oct 17 22:55:35 2022 +0200

    [primTorch] Add a ref for NumPy-style `T` (#86850)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86850
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit f2ec9fbd03b131fe4f80ad77305271912a687246
Author: Nikita Vedeneev <nik@quansight.com>
Date:   Tue Oct 18 09:07:35 2022 +0000

    `torch.ormqr`: backward support (#86800)

    Seems good to have, especially when neither `a` nor `tau` requires grads and/or they are pretty small in number.
    Fixes https://github.com/pytorch/pytorch/issues/86267

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86800
    Approved by: https://github.com/lezcano

commit 841995d53b7ea51e8dae64e0d3d4f4d888406d8b
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Mon Oct 17 21:43:28 2022 +0200

    [primTorch] Add refs for data conversion ops (#86561)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86561
    Approved by: https://github.com/lezcano, https://github.com/mruberry, https://github.com/zou3519

commit 731b4bf0f119315495e3847e065afca282778ee6
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 18 08:14:15 2022 +0000

    Revert "Check all CUDA API calls in aten/src/ATen/test for errors (#74919) (#83556)"

    This reverts commit a7ed398cf6bca767d93c6d81f3ecf4198e1b52e0.

    Reverted https://github.com/pytorch/pytorch/pull/83556 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but I think it breaks cuda tests https://hud.pytorch.org/pytorch/pytorch/commit/a7ed398cf6bca767d93c6d81f3ecf4198e1b52e0.  This should not have been force merged

commit 8b0cc9c752477238cacfa171abf5061bc08bed28
Author: Jason Ansel <jansel@fb.com>
Date:   Tue Oct 18 06:06:31 2022 +0000

    [inductor] Fix copysign issue in old msvc build (#87117)

    Should fix https://github.com/pytorch/pytorch/pull/87028#issuecomment-1281066036
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87117
    Approved by: https://github.com/DanilBaibak

commit 11915b3196d092137f12052f3fb3d723e95ce729
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 18 05:32:45 2022 +0000

    Revert "[BE] Get rid of deprecation warnings in workflows (#87152)"

    This reverts commit 9da032ecee8b0c7a5ce822bb4425af9208dc2fa1.

    Reverted https://github.com/pytorch/pytorch/pull/87152 on behalf of https://github.com/malfet due to Regresses is_pr_labelled workflow again

commit d36c284d1446cb250178f8e89fff9b342ee1a5a9
Author: Zachary DeVito <zdevito@gmail.com>
Date:   Mon Oct 17 17:59:56 2022 +0000

    [triton] allow cuda properties to be queried from workers (#87101)

    Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork.

    Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down.

    This just moves the needed properties from the main trainer process to the workers.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101
    Approved by: https://github.com/soumith

commit 9da032ecee8b0c7a5ce822bb4425af9208dc2fa1
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Oct 18 04:34:58 2022 +0000

    [BE] Get rid of deprecation warnings in workflows (#87152)

    - Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job)
    - Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning
    - Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well)
    - Update `retry` action to https://github.com/nick-fields/retry/commit/3e91a01664abd3c5cd539100d10d33b9c5b68482
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152
    Approved by: https://github.com/kit1980, https://github.com/izaitsevfb

commit 66658e1da7fb3590d3d760d2b26793fb49ab28a5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 18 04:14:01 2022 +0000

    Revert "[BE] Get rid of deprecation warnings in workflows (#87152)"

    This reverts commit acaf484f0a38f6a7becf342bb3492e1de09f64e1.

    Reverted https://github.com/pytorch/pytorch/pull/87152 on behalf of https://github.com/malfet due to Regresses is_pr_labelled workflow

commit 8ca7820e4531e61b3d381d5eddf43c4969ba0c7d
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Tue Oct 18 03:46:01 2022 +0000

    [Inductor] Lift the maximum depth of the Python interpreter stack to adapt large/deep models (#87130)

    Partly fixes https://github.com/pytorch/torchdynamo/issues/1693

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87130
    Approved by: https://github.com/jansel

commit acaf484f0a38f6a7becf342bb3492e1de09f64e1
Author: Nikita Shulga <nshulga@meta.com>
Date:   Tue Oct 18 03:38:24 2022 +0000

    [BE] Get rid of deprecation warnings in workflows (#87152)

    - Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job)
    - Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning
    - Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well)
    - Update `retry` action to https://github.com/nick-fields/retry/commit/3e91a01664abd3c5cd539100d10d33b9c5b68482
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152
    Approved by: https://github.com/kit1980, https://github.com/izaitsevfb

commit 5fb687182dba781d9c95388d19f4784b98cb8b20
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Oct 18 02:00:04 2022 +0000

    Enable sdp_forward for NestedTensors (#86720)
    This PR implements a sdp_forward for NestedTensors. This impl will call into flash and mem_efficient_attention when possible.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86720
    Approved by: https://github.com/cpuhrsch

commit 74138a8daa93ec4cb08e4dd31c2773ec0c751d94
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Oct 18 01:14:07 2022 +0000

    Use conda-forge in mac mps test (#87155)

    https://github.com/pytorch/pytorch/pull/87150 works, most of the jobs are ok now.  However, I miss one last piece in MPS test workflow https://github.com/pytorch/pytorch/actions/runs/3269594289/jobs/5377469209.  So this fixes the missing piece to use conda-forge

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87155
    Approved by: https://github.com/kit1980, https://github.com/ZainRizvi

commit 9d1a8edc0e609387f30848ddcae569a238052d66
Author: ssjia <ssjia@fb.com>
Date:   Fri Oct 14 15:10:28 2022 -0700

    [vulkan] Use 2D texture types for convolution weights and biases (#86972)

    Differential Revision: [D40385500](https://our.internmc.facebook.com/intern/diff/D40385500/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86972
    Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign

commit 5b588036aa0152d83d58f1b52038137043da0768
Author: ssjia <ssjia@fb.com>
Date:   Fri Oct 14 15:10:25 2022 -0700

    [vulkan] Enable 2D texture types (#86971)

    Adds the ability to use 2D GPU textures to represent tensors. The `StorageType` enum can be used to represent other representation modes in the future, such as buffer representations, etc.

    Differential Revision: [D40363112](https://our.internmc.facebook.com/intern/diff/D40363112/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86971
    Approved by: https://github.com/kirklandsign

commit a7ed398cf6bca767d93c6d81f3ecf4198e1b52e0
Author: Richard Barnes <rbarnes@fb.com>
Date:   Tue Oct 18 00:35:44 2022 +0000

    Check all CUDA API calls in aten/src/ATen/test for errors (#74919) (#83556)

    Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74919

    Test Plan: Sandcastle

    Differential Revision: D35194596

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83556
    Approved by: https://github.com/malfet

commit f02f0e3ad1565e3da1e78efaa994e80c7577fd0c
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Oct 18 00:11:37 2022 +0000

    Install blas from conda-forge (#87150)

    Mitigate https://github.com/pytorch/pytorch/issues/87148

    On AWS (m1, linux)

    * Run `conda install blas:openblas`, it should failed with `ChecksumMismatchError`:

    ```
    ChecksumMismatchError: Conda detected a mismatch between the expected content and downloaded content
    for url 'https://repo.anaconda.com/pkgs/main/linux-64/blas-1.0-openblas.conda'.
      download saved to: /tmp/debug/pkgs/blas-1.0-openblas.conda
      expected sha256: c85b5d0a336b5be0f415c71fd7fe2eca59e09f42221bfa684aafef5510ba5487
      actual sha256: 5dc5483db0d9785b19e021cee418a8ee03e0ff0e5ebd0b75af4927746604e187
    ```

    * Run ` conda install -c conda-forge blas:openblas` works

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87150
    Approved by: https://github.com/kit1980

commit 9db7270ee7f18a9baf8c0b9b87ace5d8c655bb53
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 17 22:56:49 2022 +0000

    Small update to Module note (#87142)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87142
    Approved by: https://github.com/cpuhrsch

commit fb614b1871d83c3063907c77f81177fc01bea19f
Author: Nirav Mehta <niravmehta@fb.com>
Date:   Mon Oct 17 22:15:47 2022 +0000

    Enable UBSAN mode for test_jit (#85735)

    Run `test_jit` executable with UBSAN flag in order to catch errors that might cause internal breakage

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85735
    Approved by: https://github.com/dagitses

commit 18cc00d3993f2e84c83274ff1ada6430291aa3bd
Author: Catherine Lee <csl@fb.com>
Date:   Mon Oct 17 22:10:21 2022 +0000

    [ci] put more logs in a folded group (#86138)

    fixes: request to not print the entire log file, but the last couple of lines since they are probably the most relevant

    all but last 300 lines of failing tests get put into a folded group
    example https://github.com/pytorch/pytorch/actions/runs/3177200444/jobs/5177703202
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86138
    Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/lezcano

commit e3b84f6c9d5f5aeb6356948bdaaf419ad906226a
Author: Catherine Lee <csl@fb.com>
Date:   Mon Oct 17 22:09:56 2022 +0000

    remove dynamo hash updates (#87092)

    remove workflow for updating dynamo hash as it got moved into this repo
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87092
    Approved by: https://github.com/huydhn

commit 4fd98dfe69287914fd29b38fbccaf7ac4d7261ee
Author: David Berard <dberard@fb.com>
Date:   Mon Oct 17 10:29:41 2022 -0700

    Don't only apply DDP optimizer on forward frames (#87097)

    Previously a check would only apply DDP optimizer on frames named "forward".

    But on hf_T5_large, a graph break causes some frames like:

    ```
    <graph break in _shift_right>
    <graph break in forward>
    ```

    So instead, apply DDP optimizer on all frames.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87097
    Approved by: https://github.com/wconstab

commit 09d720919ec975860bea3dd42ac13f4921c7d245
Author: Justin Chu <justinchu@microsoft.com>
Date:   Tue Oct 11 23:40:14 2022 +0000

    Add venv to gitignore (#86702)

    `venv` is the common directory for creating virtual environments. Adding it to gitignore to support development that does not use anaconda to manage envs.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86702
    Approved by: https://github.com/kit1980

commit 0cb273b5d9e4a31574357df2f2290322088c7802
Author: Kevin Tse <ktse@fb.com>
Date:   Mon Oct 17 11:24:05 2022 -0400

    [DataPipe] Fixing interface generation in setup.py (#87081)

    Based on the artifact generated on this [page](https://hud.pytorch.org/pr/87081), I downloaded [[s3] linux-focal-py3.7-clang7-asan/artifacts.zip](https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3266430083/linux-focal-py3.7-clang7-asan/artifacts.zip) (1.14 GB) and unpacked it. `torch.utils.data.datapipes.datapipe.pyi` does exist. I believe this means the file should be part of the distribution.

    I also did `wheel unpack ***.whl` to confirm the existence of the file.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87081
    Approved by: https://github.com/ejguan

commit f5ee2d88406b45c6730f3b34bb54979836374c40
Author: Michael Suo <suo@fb.com>
Date:   Mon Oct 17 21:27:21 2022 +0000

    [ci] fix bot comment (#87127)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87127
    Approved by: https://github.com/clee2000

commit f552eee42765e7de01c7df7bd794c70fd094874d
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 17 21:17:07 2022 +0000

    [Docs] Remove outdated comment for sparse all-reduce (#87018)

    https://github.com/pytorch/pytorch/pull/23917 switched to using allgatherv instead of allgather for gloo sparse all-reduce. This PR removes a comment saying to use allgatherv if available since that has already been done.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87018
    Approved by: https://github.com/H-Huang

commit d023e8393396acd871629e91516170d72ced10e0
Author: Catherine Lee <csl@fb.com>
Date:   Mon Oct 17 21:03:42 2022 +0000

    handle libomp update on circleci (#86979)

    libomp got an update and now its keg only

    reverts https://github.com/pytorch/pytorch/pull/86940
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86979
    Approved by: https://github.com/huydhn, https://github.com/malfet

commit 5acf6e0e80fb3c029fe62ff665bc5279ec00a70c
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Oct 17 20:57:55 2022 +0000

    Use 12xlarge for nightly cpp doc generation job (#86859)

    The job starts to run out of memory a lot recently https://hud.pytorch.org/failure/Process%20completed%20with%20exit%20code%20137.  Probably more and more docs are added, so this ups the runner for cpp doc nightly from 4xlarge to the next tier of 12xlarge. This also choose the smaller runner of 2xlarge for python and functorch docs (may be linux.large is good enough for them?)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86859
    Approved by: https://github.com/malfet

commit 4814270708cb6141c1fb6202f883c084c71290b4
Author: Michael Suo <suo@fb.com>
Date:   Mon Oct 17 20:14:43 2022 +0000

    [dynamo] Introduce `get_real_value` API to TensorVariable (#87091)

    Right now, example_value is doing two jobs:
    - We use it to propagate metadata (e.g. return type, shapes, etc.)
      throughout the graph
    - We use it to satisfy queries for the actual value (e.g. torch.cond,
      `assume_constant_result`)

    This is further complicated by the fact that we have two modes, one
    where `example_value` is a fake tensor, and one where it is a real
    tensor (this is the `fake_tensor_propagation` config flag).

    This leads to scenarios where we don't support every combination of
    job + mode,
    e.g. if `fake_tensor_propagation=False`, `assume_constant_result` is
    broken.

    This is made worse by the fact that "fake tensor mode" is the default
    and is required if you want dynamic shapes to work.

    So, this PR introduces a `get_real_value` API that just runs the graph
    up to `node` in order to get a concrete value. This API is orthogonal
    to
    `example_value`, so it doesn't care about `fake_tensor_propagation`.

    When `fake_tensor_propagation=True`: `example_value` is a fake tensor,
    you must use the `get_real_value` API to get a concrete value. This
    will
    be the only configuration in the future.

    When `fake_tensor_propagation=False`: `example_value` and
    `get_real_value` will produce the same value. This is redundant but we
    will be removing this config soon.

    To support this, I introduce a cache for computed real values, to
    memoize the work involved if we're asking for real values a lot.

    I attached this state to `OutputGraph` because it seems to be what
    historically managed `example_value` lifetimes, but idk.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87091
    Approved by: https://github.com/wconstab

commit e85dbcc9b075961ab082975348c5cf1d99b7da76
Author: Jan Margeta <jmargeta@gmail.com>
Date:   Mon Oct 17 20:01:07 2022 +0000

    [docs] Fix ScalarTensor __repr__ in Extending PyTorch example (#86330)

    This PR fixes the __repr__ of the `ScalarTensor` class in the Extending PyTorch example to correspond with the class name instead of `DiagonalTensor`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86330
    Approved by: https://github.com/bdhirsh

commit b8007742c287d792f2e89bbb7af5f87f6afdd2e8
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Mon Oct 17 19:55:39 2022 +0000

    [Dynamo] More robust pyop support, module properties as args (#87020)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87020
    Approved by: https://github.com/jansel

commit 1167949b2df0e9ff228aac0e1b82403c05021546
Author: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Date:   Mon Oct 17 19:45:33 2022 +0000

    [ONNX] Ignore print(Tensor) during tracing (#86223)

    Fixes #73619
    Fixes https://github.com/microsoft/onnxruntime/issues/11812

    This PR adds new symbolics: `aten::_conj`, `aten::conj_physical`, `aten::resolve_conj`, and `aten::resolve_neg`
    While the last two are always NO-OP by definition (do not change nodes), the first raises an exception as they are not supported by ONNX yet
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86223
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit 31931515bc927675136b1637fb9782b9b5ff3174
Author: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Date:   Mon Oct 17 18:46:28 2022 +0000

    Workarounds for cudnn_batch_norm with TorchRefsNvfuserCapabilityMode (#86796)

    This PR adds workarounds to support AOT Autograd's graphs containing `aten.cudnn_batch_norm` and `aten.cudnn_batch_norm_backward` with `TorchRefsNvfuserCapabilityMode`.

    The problem with the decomposition of `aten.cudnn_batch_norm` is that it uses a `new_empty` call that is not supported by nvFuser and we are conservative with lowering functions to nvprims by default.

    The problem with the decomposition of `aten.cudnn_batch_norm_backward` is described here https://github.com/pytorch/pytorch/pull/86115#issue-1394883782, but changing the decomposition directly in that PR makes many tests fail.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86796
    Approved by: https://github.com/mruberry

commit 33343def0b4a0ec58f0557edba017748f789c8d6
Author: holimion <jianqiuzhu@hotmail.com>
Date:   Mon Oct 17 18:27:46 2022 +0000

    add XLA backend into tensor type strings (#86881)

    add XLA backend into tensor type strings
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86881
    Approved by: https://github.com/bdhirsh

commit 317eeb81c3e7ab21ca4359819c4a89122ce574f5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 17 18:26:59 2022 +0000

    Revert "OpInfo: Sample input cleanup (4/n) (#86324)"

    This reverts commit 2a6d37d23d163a35c0b62c4319a6c2f049a27833.

    Reverted https://github.com/pytorch/pytorch/pull/86324 on behalf of https://github.com/peterbell10 due to Caused tolerance issues in periodic test

commit 8f85831fdf473be541b12b843329d9b3f124c6d6
Author: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Date:   Mon Oct 17 18:17:01 2022 +0000

    Give more clear error message when gscope is non-empty (#87005)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87005
    Approved by: https://github.com/alanwaketan, https://github.com/Krovatkin

commit c01c7a5e2cd1074409f31b1338524d440db8b460
Author: Kevin Tse <ktse@fb.com>
Date:   Mon Oct 17 15:19:29 2022 +0000

    [DataPipe] Fix missing functional name for FileLister (#86497)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86497
    Approved by: https://github.com/ejguan

commit c27a5171b80b126917ef4435e232055415fcf617
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Oct 17 17:39:19 2022 +0000

    Update action lint with missing new runners from scale-config (#87009)

    Using runner label like `linux.12xlarge` results in linter failure from actionlint, i.e. https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952

    ```
    Error (ACTIONLINT) [runner-label]
        label "linux.12xlarge" is unknown. available labels are "windows-
        latest", "windows-2022", "windows-2019", "windows-2016", "ubuntu-
        latest", "ubuntu-22.04", "ubuntu-20.04", "ubuntu-[18](https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952#step:7:19).04", "macos-latest",
        "macos-12", "macos-12.0", "macos-11", "macos-11.0", "macos-10.15",
        "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows",
        "linux.[20](https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952#step:7:21)_04.4x", "linux.20_04.16x", "linux.large", "linux.2xlarge",
        "linux.4xlarge", "linux.4xlarge.nvidia.gpu", "linux.8xlarge.nvidia.gpu",
        "linux.16xlarge.nvidia.gpu", "windows.4xlarge",
        "windows.8xlarge.nvidia.gpu", "bm-runner", "linux.rocm.gpu", "macos-m1-
        12", "macos-12-xl", "macos-12", "macos12.3-m1". if it is a custom label
        for self-hosted runner, set list of labels in actionlint.yaml config file

             47  |            # an OOM issue when running the job, so this upgrades the runner from 4xlarge
             48  |            # to the next available tier of 12xlarge. So much memory just to generate cpp
             49  |            # doc
        >>>  50  |            runner: linux.12xlarge
             51  |            # Nightly cpp docs take about 150m to finish, and the number is stable
             52  |            timeout-minutes: 180
             53  |          - docs_type: python
    ```

    `linux.12xlarge` is a valid runner label from https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml. This also adds `linux.24xlarge` and `linux.g5.4xlarge.nvidia.gpu`, which are also not added yet

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87009
    Approved by: https://github.com/ZainRizvi

commit 1704256b107500c1ebc2e803b55e31e11104e618
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Oct 17 17:08:44 2022 +0000

    Enables `where` to have cpu scalar args (#87022)

    This is for decompositions only, no attempt made to have good performance for this case.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87022
    Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/mruberry

commit f3969bd8b50fc20e12a3c6a69a5788786b0d904c
Author: samdow <samdow@fb.com>
Date:   Mon Oct 17 09:36:22 2022 -0400

    [functorch] Fix cross to match unbatched behavior (#86926)

    Fixes #83936 #83907

    In #83936, I noticed that after I wrote cross, it's silently incorrect because I misunderstood what the fix to linalg was going to be. This fixes functorch to not be silently incorrect with `linalg.cross`. Since it's a silent correctness issue that I missed, I'm hoping to cherry pick it too
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86926
    Approved by: https://github.com/zou3519

commit e271e823c7d1b231175ab4a0145d4ef2f7b7519c
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Oct 14 16:11:15 2022 +0000

    Avoid calling logging.basicConfig (#86959)

    Fixes https://github.com/pytorch/pytorch/issues/85952

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86959
    Approved by: https://github.com/xwang233, https://github.com/davidberard98

commit 6351220573c8d86972f5188dc5a570686fa3f8ed
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Mon Oct 17 12:30:34 2022 +0000

    Add meta support for _adaptive_avg_pool2d_backward (#86359) (#87074)

    This reverts commit 3edf79dc03193c98b665d62231fe69a10dfab1fa.

    Reland of https://github.com/pytorch/pytorch/pull/86359
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87074
    Approved by: https://github.com/ezyang

commit 66715767ffcd986d18b1658ee13f63dbdf5eb898
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 17 16:02:49 2022 +0000

    Revert "[Dynamo] More robust pyop support, module properties as args (#87020)"

    This reverts commit 3c320a5613c26aa3568c330ae1c34a03dadf2b5c.

    Reverted https://github.com/pytorch/pytorch/pull/87020 on behalf of https://github.com/ZainRizvi due to This appears to have caused two periodic tests to fail

commit 8617f5f48183b84fc7335a2754fc1ffa9666a0dc
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Oct 17 15:59:05 2022 +0000

    fix cudagraphify for inplace parameter change (#87060)

    Fixes https://github.com/pytorch/torchdynamo/issues/1687
    cc @albanD, @chillee, I don't know what I'm doing.
    According to previous discussions, calling `detach()` on inputs can cause bugs if inputs are later inplace-resized (cc @ezyang) https://github.com/pytorch/pytorch/pull/85301/files#diff-8678402e01603e588fcf175a61de9ed578d885b1cc082e028021856190223fb7L433, but should we weed out these patterns before they are sent to cudagraphify?
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87060
    Approved by: https://github.com/jansel, https://github.com/albanD

commit 2c6167c4bb5165e5844b541275cee35687dc9783
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 17 15:44:14 2022 +0000

    Revert "[inductor] Use decomps for unfold (#87025)"

    This reverts commit 5099883f059a9b15592b8ba3b7bf83145163b966.

    Reverted https://github.com/pytorch/pytorch/pull/87025 on behalf of https://github.com/ZainRizvi due to Breaks periodic tests

commit 2b558138cf0a0296b27814d129023d1b1a503f29
Author: Animesh Jain <anijain@umich.edu>
Date:   Mon Oct 17 15:43:53 2022 +0000

    [inductor] Set correct strides in fallback example run (#87049)

    Fixes #ISSUE_NUMBER

    Helps in resolving many issues seen in https://github.com/pytorch/torchdynamo/issues/1675
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87049
    Approved by: https://github.com/jansel

commit 4e5357faf5fd65e56b14bec6bdd33e915f909bde
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 17 13:25:09 2022 +0100

    ATen/native (2/6): Use per-operator headers (#75572)

    Differential Revision: [D40126702](https://our.internmc.facebook.com/intern/diff/D40126702)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75572
    Approved by: https://github.com/DanilBaibak, https://github.com/malfet

commit b40f4434ac3512a21dcec91467df1b179898503f
Author: albanD <desmaison.alban@gmail.com>
Date:   Sun Oct 16 22:16:16 2022 -0400

    conv backward impl (#87047)

    ~~Waiting for test run to see if this backward is actually exercised.
    If not, I will add test before merging.~~
    Test updated. Ready to go now.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87047
    Approved by: https://github.com/ezyang

commit 1463013c85e2c89adaad76612637ef951ffc7e94
Author: albanD <desmaison.alban@gmail.com>
Date:   Sun Oct 16 22:16:16 2022 -0400

    autograd clone_obey_contract() symint support (#87044)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87044
    Approved by: https://github.com/ezyang

commit 86c2e44cb68a646e368a8e52915fd2a835842dc7
Author: albanD <desmaison.alban@gmail.com>
Date:   Sun Oct 16 22:16:15 2022 -0400

    meta funcs for avg_pool2d and avg_pool2d_backward (#87043)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87043
    Approved by: https://github.com/ezyang

commit c21dcffc005aeb061ac869d3ff712daf89d11ea4
Author: albanD <desmaison.alban@gmail.com>
Date:   Sun Oct 16 22:16:14 2022 -0400

    Very limited pow support (#87042)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87042
    Approved by: https://github.com/ezyang

commit 37e9e89afbc3554258545a026fab4cd9e1a4b85d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 17 10:55:42 2022 +0000

    [xla hash update] update the pinned xla hash (#87067)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87067
    Approved by: https://github.com/pytorchbot

commit 91b3cd0b5a7d297a82ca0f9068ea7f9ac1963ced
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Sun Oct 16 23:22:01 2022 +0200

    [primTorch] Add a ref for `narrow_copy` (#86748)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86748
    Approved by: https://github.com/mruberry

commit 847ded6db325af268527e7096e31085fcb845495
Author: Ryan Spring <rdspring1@gmail.com>
Date:   Mon Oct 17 06:20:31 2022 +0000

    [primTorch] Implement NLL loss reference (#81128)

    Add Reference:
    - nll_loss

    Depends on:
    - expand https://github.com/pytorch/pytorch/pull/79820
    - advance indexing

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81128
    Approved by: https://github.com/mruberry

commit 78e2289005738df1faefb6c2309495b8b8d367bb
Author: Jiong Gong <jiong.gong@intel.com>
Date:   Mon Oct 17 06:05:30 2022 +0000

    [TorchInductor] enable inplace buffers by default (#87037)

    This PR enables the inplace_buffers configuration by default after fixing issue: https://github.com/pytorch/torchdynamo/issues/1670. UT is added to cover the fix.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87037
    Approved by: https://github.com/jansel

commit 1b43883fd61a5e3525ea213262bfcb3aedc941d3
Author: Emilio Castillo <ecastill@preferred.jp>
Date:   Mon Oct 17 04:32:08 2022 +0000

    Make `AdamW`, `NAdam` & `RAdam` differentiable (#86183)

    Blocked by #86096
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86183
    Approved by: https://github.com/albanD

commit 364a9973cab8e7458abd27e3926168978fe5428e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 17 03:17:00 2022 +0000

    [vision hash update] update the pinned vision hash (#87021)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87021
    Approved by: https://github.com/pytorchbot

commit 3a4c0900c737fe73f900f0d21fc21d972f9bbd2e
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 17 02:09:40 2022 +0000

    Reland 3 of Merge more symbolic meta kernels and symint changes from branch (#86795)

    Take 3
    Contains:
    - symintification of split*
    - floor support on SymFloat
    - pad_backward, gather, scatter meta
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86795
    Approved by: https://github.com/z-a-f

commit 0379af681b4b20475589189251aafbb2e6bb91ca
Author: Jason Ansel <jansel@fb.com>
Date:   Sun Oct 16 15:10:07 2022 -0700

    [inductor] Disable parallel compile (#87048)

    https://github.com/pytorch/pytorch/pull/87032 seems to have an issue that breaks our benchmark script, it might have to do with the benchmark script also using subprocess.

    Before this PR:
    ```
    $ ./benchmarks/dynamo/torchbench.py --performance --inductor --raise --training --float16
    ...
    Traceback (most recent call last):
      File "/home/jansel/conda/envs/pytorch/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
        r = call_item.fn(*call_item.args, **call_item.kwargs)
      File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 239, in _worker_compile
        kernel = TritonCodeCache.load(source_code)
      File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 234, in load
        mod = PyCodeCache.load(source_code)
      File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 212, in load
        exec(code, mod.__dict__, mod.__dict__)
      File "/tmp/torchinductor_jansel/ij/cij7smji4sw2a56i4yz45bjkrosd2sb2raqnxzsxxpg4kwzuo2ta.py", line 5, in <module>
        from torch._inductor.triton_ops.autotune import reduction
      File "/home/jansel/pytorch/torch/_inductor/triton_ops/__init__.py", line 3, in <module>
        if has_triton():
      File "/home/jansel/pytorch/torch/_inductor/utils.py", line 38, in has_triton
        return triton is not None and torch.cuda.get_device_capability() >= (7, 0)
      File "/home/jansel/pytorch/torch/cuda/__init__.py", line 368, in get_device_capability
        prop = get_device_properties(device)
      File "/home/jansel/pytorch/torch/cuda/__init__.py", line 382, in get_device_properties
        _lazy_init()  # will define _get_device_properties
      File "/home/jansel/pytorch/torch/cuda/__init__.py", line 228, in _lazy_init
        raise RuntimeError(
    RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
    ```

    cc @zdevito
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87048
    Approved by: https://github.com/soumith

commit 3007efda08f2fd61f9c48a810f6931a560f9ca62
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Oct 16 20:23:08 2022 +0100

    stft: Require return_complex to be passed explicitly for real input (#86724)

    This behavior has been deprecated since PyTorch 1.8 but this step of
    the deprecation cycle was put on hold in #50102 waiting for JIT
    upgraders functionality which doesn't seem to have panned out. I'd say
    there has been more than enough of a deprecation period, so we should
    just continue.

    BC-breaking message:

    `torch.stft` takes an optional `return_complex` parameter that
    indicates whether the output should be a floating point tensor or a
    complex tensor. `return_complex` previously defaulted to `False` for
    real input tensors. This PR removes the default and makes
    `return_complex` a required argument for real inputs. However, complex
    inputs will continue to default to `return_complex=True`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86724
    Approved by: https://github.com/mruberry, https://github.com/albanD

commit 2b7236a0e1c0d2339165103b2cd42e25debee99d
Author: Zachary DeVito <zdevito@gmail.com>
Date:   Sun Oct 16 05:17:20 2022 +0000

    [torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032)

    This patch significantly improves the parallel compilation performance for cThis patch significantly improves the parallel compilation performance for compiling triton kernels
    by using ProcessPoolExecutor to create persistent pool of compilation
    workers.

    Previously os.fork overhead and GIL contention limited the achieved
    parallelism. This patch replaces
    the worker threads with a pool of processes to do the raw compilation,
    and does serial work on the main thread
    for everything else. This other work couldn't be parallelized anyway
    since it is mostly in python.

    In cold start situations, the time to get the worker threads started can
    be significant portion of the time.
    This patch starts the workers earlier so they are ready to perform
    compilation (see code comments) when dynamo
    gets to that point.

    Just tested this on one example benchmark (tf_efficientnet_b0), but the
    results are significant, almost eliminating the difference between a
    warm and cold compilation.

    ```
    39.613s - warm
    41.290s - cold, this patch

    2m53.197s - cold, single threaded:
    1m7.092s - cold, old setup n = 8 (its best config)
    ```
     (cold compilation is done after running `rm -rf
    /tmp/torchinductor_$USER`).ompiling triton kernels
    by using ProcessPoolExecutor to create persistent pool of compilation workers.

    Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
    the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread
    for everything else. This other work couldn't be parallelized anyway since it is mostly in python.

    In cold start situations, the time to get the worker threads started can be significant portion of the time.
    This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
    gets to that point.

    Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation.

    ```
    39.613s - warm
    41.290s - cold, this patch

    2m53.197s - cold, single threaded:
    1m7.092s - cold, old setup n = 8 (its best config)
    ```
     (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87032
    Approved by: https://github.com/soumith, https://github.com/jansel

commit 945d333ae485673d7a603ca71822c9a39ca4775a
Author: Jason Ansel <jansel@fb.com>
Date:   Sun Oct 16 09:20:50 2022 -0700

    Migrate dynamo CI test shards to torch._dynamo (#87039)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87039
    Approved by: https://github.com/voznesenskym

commit 30f6f6903c7e68d2105d5b8dfe8841a788bab051
Author: Jason Ansel <jansel@fb.com>
Date:   Sun Oct 16 10:16:04 2022 -0700

    [inductor] Move size asserts to C++, fix bug (#87028)

    Inductor internally models any `size=1` dimension as having `stride=0` to simplify indexing formulas (sympy will remove these terms from the expression).

    This caused a bug in our generate stride assert in detectron2_maskrcnn_r_50_fpn, where we asserted the wrong stride of a size==1 dimension.

    This fixes that bug, and moves size/stride assert logic to C++ which should be a small perf gain.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87028
    Approved by: https://github.com/anijain2305

commit d45e99acf5fed7d0ea0ffcff36231c63ea3a8db5
Author: Jason Ansel <jansel@fb.com>
Date:   Sun Oct 16 08:09:32 2022 -0700

    [dynamo] Put printing graph breaks behind a config option (#87026)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87026
    Approved by: https://github.com/soumith, https://github.com/voznesenskym

commit 2a6d37d23d163a35c0b62c4319a6c2f049a27833
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Oct 14 17:06:42 2022 +0100

    OpInfo: Sample input cleanup (4/n) (#86324)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86324
    Approved by: https://github.com/mruberry

commit 5099883f059a9b15592b8ba3b7bf83145163b966
Author: Jason Ansel <jansel@fb.com>
Date:   Sat Oct 15 21:08:48 2022 -0700

    [inductor] Use decomps for unfold (#87025)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87025
    Approved by: https://github.com/soumith

commit 8a8cd092c8537b226f5c38ed88bc07e181b0946c
Author: Edward Z. Yang <ezyang@mit.edu>
Date:   Sun Oct 16 06:13:18 2022 +0000

    Add labeler with dynamo/inductor paths to start (#87024)

    The other missing ingredient is getting CC bot to work on labels on PRs
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87024
    Approved by: https://github.com/soumith, https://github.com/jansel

commit a0c2a7f2eda788a48f1d243940297f1467faf138
Author: Jason Ansel <jansel@fb.com>
Date:   Sat Oct 15 17:50:58 2022 -0700

    Add triton to CI (#86988)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86988
    Approved by: https://github.com/malfet, https://github.com/voznesenskym, https://github.com/soumith

commit 3c320a5613c26aa3568c330ae1c34a03dadf2b5c
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Sun Oct 16 02:15:10 2022 +0000

    [Dynamo] More robust pyop support, module properties as args (#87020)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87020
    Approved by: https://github.com/jansel

commit 5d6e8315630d4e62e5e015c2e4c816be04f1f94e
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Oct 14 17:06:42 2022 +0100

    OpInfo: Sample input cleanup (3/n) (#86380)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86380
    Approved by: https://github.com/mruberry

commit 054a2fd6c2fc361796663eed4772368c287d6c83
Author: Jason Ansel <jansel@fb.com>
Date:   Sat Oct 15 08:35:32 2022 -0700

    Sync changes from `pytorch/torchdynamo` (#87013)

    This updates to:
    https://github.com/pytorch/torchdynamo/commit/6380959be21851bfda99424392cc08fda29d073d

    Generated with:
    https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013
    Approved by: https://github.com/voznesenskym

commit 2c1bc216b8893e59e986d843dcf0e152e1938ac1
Author: Horace He <chilli@fb.com>
Date:   Sat Oct 15 04:10:47 2022 +0000

    Fixed partitioner issue with getitem and made metadata a storage more consistent (#87012)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/87012
    Approved by: https://github.com/ngimel

commit 91c7015426f6b57b87bd92a63e9c08d9fd46a020
Author: Jane Xu <janeyx@fb.com>
Date:   Sat Oct 15 06:23:48 2022 +0000

    [einsum] Fix opt_einsum defaults to be more reasonable (#86985)

    Fixes the confusing situation mentioned here https://github.com/pytorch/pytorch/issues/85224#issuecomment-1278628262 by

    - setting better OG defaults
    - changing warnings to errors now that we have better defaults

    Test plan:
    - Ran einsum tests locally + CI
    - Uninstalled opt-einsum and ran through setting
         - `enabled` to False (doesn't throw error)
         - `strategy` to anything that's not None (errors)
         - `strategy` to None (noops)
    - Installed opt-einsum and ran through setting
         - `enabled` to False (doesn't throw error)
         - `enabled` to True (doesn't throw error, no ops + defaults to 'auto')
         - `strategy` to random string (errors)
         - `strategy` to None (noops, still is 'auto')
         - `strategy` to 'greedy' (is set to 'greedy')
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86985
    Approved by: https://github.com/soulitzer

commit 7980ed95bd708d6e9baf64c95b1aa83df8891b59
Author: tangleintel <lei1.tang@intel.com>
Date:   Sat Oct 15 05:33:07 2022 +0000

    Support unpacking python dictionary in torch.jit.trace() (#81623)
    Say, if you have a model and its forward method defined as follows:
    **`def forward(self, key1=value1, key2=value2, key3=value3)`**
    And you have a dataset and each data point in the dataset is a python dict as follows:
    **`data = {key1:value1, key3:value3, key2:value2}`**

    The problem is that if you want to trace the model using the dict data by the giving dataset, you need unpack the dictionary and reorder its value manually and make up a tuple as **`data_tuple = (value1, value2, value3)`** as the **`example_inputs`** parameter of **`torch.jit.trace()`**. This marshalling process is not user friendly.
    Say, if you have a model and its forward method defined as follows:
    **`def forward(self, key1=None, key2=None, key3=None)`** -> The default value is **None**
    And you have a dataset and each data point in the dataset is a python dict as follows:
    **`data = {key1:value1, key3:value3}`** -> Only **part of** the required value by forward was given, the rest use the default value.

    The problem is that if you want to trace the model using the dict data by the giving dataset, it's not feasible at all. Cause neither you can pass a tuple like **`T1 = (value1, value3)`**  nor **`T2 = (value1, None, value3)`**. T1 will mismatch value3 with key2 and T2 include **None** type which will be blocked by tracer's type checking. (Of course you can pass **`T3 = (value1,)`**  to make the trace function finish without exception, but the traced model you get probably is not what you expect cause the different input may result in different traced result.).

    These problems come from the HuggingFace's PT model, especially in text-classification tasks with datasets such as [MRPC,](https://paperswithcode.com/dataset/mrpc)  [MNLI](https://paperswithcode.com/dataset/multinli) etc.
    To address these two issues, we propose to support a new type, that is, python dict as example_inputs parameter for torch.jit.trace(). We can base on the runtime type information of the example_inputs object to determine if we fall back to the original tuple path or go into the new dictionary path. Both problem 1 and  problem 2 can be solved by utilizing the "**`**`**"
    operator.

    1. If we use dict as example_inputs to trace the model, then we have to pass a dictionary to the traced model too. (Cause probably we will change the order of debug name of the input parameter in torchscript IR, thus we can't assume the traced model's input parameters order are the same with the original model.). We need highlight this too in the document to mitigate this problem.

        For example:
    ```
    example_inputs_dict = next(iter(dataloader))
    jit_model = model.eval()
    jit_model = torch.jit.trace(jit_model, example_inputs_dict, strict=False)  # Now the IR will be graph(%self : __torch__.module.___torch_mangle_n.Mymodule, %key1 : type1, %key3 : type3, %key2 : type2)
    jit_model = torch.jit.freeze(jit_model)
    jit_model(**example_inputs_dict)

    example_inputs_tuple = (value1, value3, value2)
    jit_model(*example_inputs_tuple)

    ```
    1. This PR will make some UT introduced in [39601](https://github.com/pytorch/pytorch/pull/39601) fail, which I think should be classified as unpacking a tuple containing a single dictionary element in our solution.
    4. I think there is ambiguity since currently we only specify passing a tuple or a single Tensor as our example_inputs parameter in **torch.jit.trace()**'s documentation, but it seems we can still passing a dictionary.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81623
    Approved by: https://github.com/davidberard98

commit bdefa260b2831977b4a458d9daef2b710330c78c
Author: Rohan Varma <rvarm1@fb.com>
Date:   Fri Oct 14 20:45:25 2022 +0000

    [RFC] Separate CPU offload activation to its own wrapper (#85459)

    Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

    Now, offload to CPU + checkpoint can be composed together, such as

    ```
    apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
    model = offload_wrapper(model)
    ```

    Will polish / add tests if this proposal sounds good.

    Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85459
    Approved by: https://github.com/awgu

commit 100113b87747cc36a42621d0c94e8c72ddcead80
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Thu Oct 13 17:02:33 2022 -0700

    [quant][docs] Formatting fixes for fx graph mode quantization README (#86914)

    Summary:
    att

    Test Plan:
    No code changes involved

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86914
    Approved by: https://github.com/vkuzo

commit f6f1aefb8fc1664fec5825615e3353c68c41724b
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Oct 15 03:25:03 2022 +0000

    [vision hash update] update the pinned vision hash (#86758)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86758
    Approved by: https://github.com/pytorchbot

commit 46aaae98c5b95f33afd98c62d642808652594dd6
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Fri Oct 14 05:33:06 2022 -0400

     torchdynamo: add linear pointwise(binary) fusion kernel (#86583)

    Support binary fusion of Linear with:

    - add
    - sub
    - mul
    - div

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86583
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 5210fab64d4322438ebfd8ec9c1170d5effab0a3
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Fri Oct 14 05:33:05 2022 -0400

    torchdynamo: add convolution pointwise(binary) fusion kernel (#86582)

    Support binary fusion of Convolution with:

    - add
    - sub
    - mul
    - div
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86582
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit 9a7a49b254086038cc16af44ae2d51bb2084ae0d
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Fri Oct 14 05:33:04 2022 -0400

    torchdynamo: add convolution pointwise(unary) fusion kernel (#86581)

    Support unary fusion of Convolution with:

    - relu
    - sigmoid
    - tanh
    - hardswish
    - leaky_relu
    - hardtanh
    - gelu

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86581
    Approved by: https://github.com/jgong5, https://github.com/jansel

commit d5a7e6db38f4e77a91dd0568d2c21039c5c3032e
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sat Oct 15 00:23:24 2022 +0100

    ATen/native (1/6): Use per-operator headers (#75571)

    Differential Revision: [D40126698](https://our.internmc.facebook.com/intern/diff/D40126698)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75571
    Approved by: https://github.com/malfet

commit 4584d06e760eeb810f4d69ce14fc927ac3d96b17
Author: edward-io <hack@fb.com>
Date:   Sat Oct 15 00:25:23 2022 +0000

    [data] add autocompletion to datapipes (#86960)

    In REPLs (e.g. jupyter notebook) autocomplete now works:

    <img width="750" alt="image" src="https://user-images.githubusercontent.com/53842584/195776448-f33180da-d1cd-4e47-b9a0-4fd9eb2f78b7.png">

    even with custom data pipes:

    <img width="804" alt="image" src="https://user-images.githubusercontent.com/53842584/195776957-5c51895e-f469-4b13-81ba-c9b507022555.png">

    Unfortunately I wasn't able to figure out how to get autocomplete to work for non-REPLs (e.g. VSCode) - may need to generate fake pyi stubs, which 1) won't work for custom datapipes and 2) is a larger project to tackle :)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86960
    Approved by: https://github.com/NivekT

commit 3924aa75b111fc5832647dd4cae87c62ed8a2863
Author: Nikita Shulga <nshulga@meta.com>
Date:   Sat Oct 15 00:20:42 2022 +0000

    [BE] Extend linter to detect DOS newlines (#86973)

    Fix DOS newlines in `onednn/decompose_silu.[cpp|h]` introduced by https://github.com/pytorch/pytorch/pull/85591 as well as one in `.github/PULL_REQUEST_TEMPLATE.md`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86973
    Approved by: https://github.com/huydhn, https://github.com/izaitsevfb

commit b8aa1767cdca37def5d21cfa8aaf4a23e8ed3905
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Thu Oct 13 17:02:32 2022 -0700

    [quant][be] Remove unused helper functions in convert.py (#86913)

    Summary:
    att

    Test Plan:
    python test/test_quantization.py TestQuantizeFx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86913
    Approved by: https://github.com/vkuzo

commit 761ca20dd8d3bfda1694aa85eac7ee11f2ff68aa
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Thu Oct 13 17:02:31 2022 -0700

    [quant][be] Rename qconfig_map to node_name_to_qconfig (#86861)

    Summary:
    att, with the introduction of QConfigMapping, this name is now very confusing, so renamed
    it to something clearer

    Test Plan:
    python test/test_quantization.py TestQuantizeFx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86861
    Approved by: https://github.com/vkuzo

commit 8f71e8de7ef33e0cc3c92d976aa0eedae92fa1aa
Author: Jason Ansel <jansel@fb.com>
Date:   Fri Oct 14 11:05:28 2022 -0700

    Sync changes from pytorch/torchdynamo, enable tests (#86950)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86950
    Approved by: https://github.com/Chillee

commit 78ef40973c1d6b97a9002be323a5f46ed83b58ee
Author: Will Constable <whc@fb.com>
Date:   Fri Oct 14 22:34:33 2022 +0000

    Set -Werror=braced-scalar-init (#86911)

    - `vector<T>({0})` would give you the vector(size, ...) ctor and produce an empty vector of T, along with the scalar-init warning
    - `vector<T>({T(0)})` would give you the vector of a single T(0) as you might have intended, and bypasses the warning/error
    - the warning can easily be missed but can have serious consequences, so make it an error

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86911
    Approved by: https://github.com/albanD

commit 155b88580694a92c0f8304442d685139616e52e3
Author: maxren <maxren@meta.com>
Date:   Fri Oct 14 14:00:21 2022 -0700

    [xnnpack][lite-int] preprocess (#86980)

    Split up original preprocess diff:

    This diff introduces the skeleton structure of the delegate APIs. first introducing the method compile spec error handling. For now it just outputs an empty tensor object upon execute. But just proves that delegate apis is working and a new xnnpack delegate backend has been added.

    Differential Revision: [D38562918](https://our.internmc.facebook.com/intern/diff/D38562918/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38562918/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86980
    Approved by: https://github.com/salilsdesai, https://github.com/cccclai

commit 7c73b456211efe5d9f0d0a65f9a509b26d24f1aa
Author: shubhambhokare1 <shubhambhokare@gmail.com>
Date:   Fri Oct 14 21:58:01 2022 +0000

    [onnx] Add support for autograd function inlining in ONNX_ATEN_FALLBACK mode (#85736)

    Solution to #85027

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85736
    Approved by: https://github.com/BowenBao

commit d29c8c0ffa68f11790fc2e9fd78778bb8e9bc281
Author: Catherine Lee <csl@fb.com>
Date:   Fri Oct 14 21:44:13 2022 +0000

    enable optim tests on dynamo to test flaky bot (#86976)

    will link the issue that disabled them if this gets approved
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86976
    Approved by: https://github.com/albanD

commit 1a7409c77199403153f1260e2281bae2f76745f6
Author: maxren <maxren@meta.com>
Date:   Fri Oct 14 10:37:42 2022 -0700

    [CoreML][ios_crash] Use special throw macro when encountering CoreML API errors (#86938)

    Error messages from TORCH_CHECK are stripped during production builds via  -DSTRIP_ERROR_MESSAGES. This diff introduces a new macro COREML_CHECK which will always preserve the error message. This macro is used when encountering errors produced by CoreML API calls so that we can heve enough context to debug.

    Differential Revision: [D40351013](https://our.internmc.facebook.com/intern/diff/D40351013/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86938
    Approved by: https://github.com/salilsdesai

commit 34c86adec49322ab6586a65b9817ef282d44d55e
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Oct 14 10:01:32 2022 -0700

    symintify all of derivatives.yaml (#86610)

    Big-bang PR to symintify **all** .sizes() calls in derivatives.yaml, which will be needed for symbolic tracing.

    * with the exception of `split()`, which is tougher to land because it requires internal changes.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86610
    Approved by: https://github.com/albanD

commit d7bbb61f6b0c1a120e603e6313114457c4909835
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Oct 14 09:06:22 2022 -0700

    min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86609)

    This PR shouldn't matter too much, but I figured I'd land it instead of deleting. `PySymInt.min/max` are technically broken today, and this fixes them - but it doesn't matter (yet) because nobody is calling `min()` / `max()` on symints from python (they all happen using `std::min/max` in C++, which desugar to lt / gt calls).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86609
    Approved by: https://github.com/albanD

commit 1bb609ad47902353018948f4cd04a0aee9542e43
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Wed Oct 12 14:22:47 2022 -0500

    Added new test test_compare_cpu that checks if cpu and gpu results are consistent (#85011)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85011
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit e027740e7745bb0843d31337be3a17b805f4f712
Author: Lukas Mührke <46906556+LukasM937@users.noreply.github.com>
Date:   Fri Oct 14 19:59:33 2022 +0000

    Chore: Add 'mps' to the docs of tensor_attributes (#86585)

    Since PyTorch supports 'mps' (Apple metal) devices it should be reflected in the documentation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86585
    Approved by: https://github.com/albanD

commit fc3afc840784106b173c87c95b1ee96a4018bb3d
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Oct 14 19:49:39 2022 +0000

    Remove empty_like+fill from AOT Autograd graphs for nvFuser (#86908)

    AOT Autograd records C++ code `1 - tensor` as a sequence of empty_like, fill, and sub (see https://github.com/pytorch/pytorch/issues/86612).

    Both empty_like and fill are not supported yet. This PR is a workaround for enabling fusions of `silu_backward`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86908
    Approved by: https://github.com/ngimel

commit 56a744bf47edd1adb423593955b786a2ede8bd4f
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Oct 14 19:44:44 2022 +0000

    [ONNX] Reland: Update training state logic to support ScriptedModule (#86745)

    In https://github.com/pytorch/pytorch/issues/86325, it was reported that ScriptedModule do not have a training attribute and will fail export because we don't expect them as input.

    Also

    - Parameterized the test_util_funs test

    Thanks @borisfom for the suggestion!

    Fixes #86325

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86745
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 527ebedbff717074189a4c499ad5a62712442300
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Fri Oct 14 09:25:46 2022 -0500

    Sparse support for ReLU (#86749)

    ReLU support for all sparse layouts, including backward.

    Fixes #85208
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86749
    Approved by: https://github.com/cpuhrsch, https://github.com/nikitaved

commit ef045695e0b622968d7c15f86a60ccc4f3b0a1ed
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Oct 14 15:51:26 2022 +0000

    Fix decomp for huber_loss_backward (#86955)

    Fixes https://github.com/pytorch/pytorch/issues/86846

    aten.huber_loss_backward calls aten.huber_loss_backward.out in its CompositeExplicitAutograd kernel.
    The decomp was mistaken registered for both aten.huber_loss_backward.default and aten.huber_loss_backward.out.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86955
    Approved by: https://github.com/Chillee

commit 7da018b2f80c04038f797dbb76168416de8e2529
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Oct 14 11:39:45 2022 -0700

    [functorch] fix fbcode tests (#86936)

    Differential Revision: [D40358418](https://our.internmc.facebook.com/intern/diff/D40358418)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86936
    Approved by: https://github.com/samdow

commit f17b3e9b7adaa849b2065fdcb5efb1b444f4725a
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Oct 14 16:27:15 2022 +0100

    Vectorize tensor lerp kernel (#84845)

    Fixes #86964

    In a simple timeit benchmark I see 1.7x speedup for complex64, from 6.7 us to
    3.9 us; and a 3.2x speedup for float32, from 6.2 us to 1.9 us.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84845
    Approved by: https://github.com/lezcano, https://github.com/malfet

commit 13cff2ee8ea1d7aea2ad201cbd77ebe2b9a29d25
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Oct 14 17:35:18 2022 +0000

    [MPS] Copy from CPU always add storageOffset (#86958)

    Because why wouldn't it?
    Fixes https://github.com/pytorch/pytorch/issues/86052

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86958
    Approved by: https://github.com/kulinseth

commit 1ece1ab6c2c5488b8475c70681aebddbdb9579ba
Author: Catherine Lee <csl@fb.com>
Date:   Fri Oct 14 17:31:31 2022 +0000

    [ci] print rerun stacktraces for pytest (#86831)

    example: https://github.com/pytorch/pytorch/actions/runs/3238428826/jobs/5306808276

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86831
    Approved by: https://github.com/huydhn

commit d393a463ff5140b9257c0650137e03db0a78de58
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Oct 14 17:26:49 2022 +0000

    Fix functorch test selection logic (#86944)

    I realize that `run_test.py` doesn't take into account functorch test selection logic at the moment, for example `python test/run_test.py --functorch -i functorch/test_ops --verbose` stills run all functorch tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86944
    Approved by: https://github.com/clee2000, https://github.com/malfet

commit bbd7b38d5580c44ffb4404d431e07bc2316e59d5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 14 17:22:55 2022 +0000

    Revert "symintify nll loss fns (#86915)"

    This reverts commit 0ece7c86d829e2515e8b7d5df13cf0279b70c0e9.

    Reverted https://github.com/pytorch/pytorch/pull/86915 on behalf of https://github.com/anjali411 due to test_autocast_nn_fp32 fails

commit 0ece7c86d829e2515e8b7d5df13cf0279b70c0e9
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Fri Oct 14 14:21:10 2022 +0000

    symintify nll loss fns (#86915)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86915
    Approved by: https://github.com/albanD

commit a86278b08c12ba8db203bee22c56958a3c245b3e
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Oct 13 09:42:14 2022 -0700

    [FSDP] Consolidate FSDP state_dict offload_to_cpu settings (#86211)

    Consolidate FSDP state_dict offload_to_cpu settings. All state_dict_types now
    have offload_to_cpu options.

    Differential Revision: [D40065969](https://our.internmc.facebook.com/intern/diff/D40065969/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86211
    Approved by: https://github.com/rohan-varma

commit c9a8d309bda59164554b38deff18ac8bf824af34
Author: Catherine Lee <csl@fb.com>
Date:   Fri Oct 14 16:04:04 2022 +0000

    add super setup to test to enable disabling in test_dims.py (#86953)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86953
    Approved by: https://github.com/huydhn

commit 8eb579e362581d2ab2c440b4aad8b39fde4a9920
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 14 14:56:59 2022 +0000

    Revert "[Profiler] Move legacy profiler out of `torch/csrc/autograd` (#85512)"

    This reverts commit 157a3d2a7cd25779258f3e3dcef14633f1930103.

    Reverted https://github.com/pytorch/pytorch/pull/85512 on behalf of https://github.com/DanilBaibak due to Due to files were deleted, the internal build failed. Please re-submit via codev.

commit 4460e40db4300b2b0d5dbfaedee0d82a19c444b9
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Thu Oct 13 20:50:20 2022 +0200

    [primTorch] Add a ref for `addcmul` (#86731)

    Based on:
    https://github.com/pytorch/pytorch/pull/79827
    https://github.com/pytorch/pytorch/pull/72949
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86731
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 746500d58d90bb8d7833596d2c38e0f8142859d8
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 14 14:25:51 2022 +0000

    Revert "[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948)"

    This reverts commit 427e0a6b4ebc691f1fa98662d04d5c431a75107f.

    Reverted https://github.com/pytorch/pytorch/pull/84948 on behalf of https://github.com/malfet due to Broke SM86 sanity

commit 2cfc4cb36748a250f5252f1844f570c0cb806b8f
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Oct 14 12:15:28 2022 +0000

    Add optional recomputable_ops argument for the min cut partitioner (#86686)

    `min_cut_rematerialization_partition` has a default set of hard-coded operations that are allowed to be recomputed in the backward pass.
    This PR adds customization ability to this function allowing users to control the behavior by passing `recomputable_ops` instead of relying on the default setting.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86686
    Approved by: https://github.com/Chillee

commit fd8068478469077753f34873c50656d3a44e01e1
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Oct 14 12:08:02 2022 +0000

    Add nvFuser support for torch.Tensor.view (#84634)

    This is an alternative to https://github.com/pytorch/pytorch/pull/83739. While PrimTorch has `view` as a reference, we would like to use nvFuser's implementation for `view` for now. Later we might transition to PrimTorch's `torch._refs.view`.

    See `test_nvprims_view` for examples of things that are now sent to nvFuser. Note that nvFuser's `view` is a copy-like operation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84634
    Approved by: https://github.com/kevinstephano, https://github.com/mruberry

commit b48deedb77003261fb0331048ab00e19fba901ee
Author: Alvaro Gaona <alvgaona@gmail.com>
Date:   Fri Oct 14 11:33:32 2022 +0000

    Set up new module torch.signal.windows (#85599)

    Resolves #85366

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85599
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 056cfb0464bd137b0c4848a02ad0b6f283c25320
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 14 05:40:18 2022 +0000

    Revert "[ONNX] Update training state logic to support ScriptedModule (#86745)"

    This reverts commit 960b98128e475b15b66119f325232039799852cd.

    Reverted https://github.com/pytorch/pytorch/pull/86745 on behalf of https://github.com/janeyx99 due to  https://hud.pytorch.org/pytorch/pytorch/commit/960b98128e475b15b66119f325232039799852cd broke onnx tests on trunk

commit 157a3d2a7cd25779258f3e3dcef14633f1930103
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Oct 13 07:49:03 2022 -0700

    [Profiler] Move legacy profiler out of `torch/csrc/autograd` (#85512)

    The legacy profiler is an eyesore in the autograd folder. At this point the implementation is almost completely decoupled from the rest of profiler, and it is in maintaince mode pending deprecation.

    As a result, I'm moving it to `torch/csrc/profiler/standalone`. Unfortuantely BC requires that the symbols remain in `torch::autograd::profiler`, so I've put some basic forwarding logic in `torch/csrc/autograd/profiler.h`.

    One strange bit is that `profiler_legacy.h` forward declares `torch::autograd::Node`, but doesn't seem to do anything with it. I think we can delete it, but I want to test to make sure.

    (Note: this should not land until https://github.com/pytorch/torchrec/pull/595 is landed.)

    Differential Revision: [D39108648](https://our.internmc.facebook.com/intern/diff/D39108648/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85512
    Approved by: https://github.com/aaronenyeshi

commit 35fb0077495247bcda218a136c3a70f3022de7d2
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Oct 13 07:49:00 2022 -0700

    [Profiler][Minor] Separate standalone profilers from the main PyTorch profiler. (#85511)

    There are a number of instrumentation utils which have been added to the profiler toolkit. They are generally small and self contained, often wrapping vendor APIs. (NVTX, ITT)

    They don't really interact with the much more expansive machinery of the PyTorch profiler beyond registration / unregistration, minor util sharing, and reusing the profiler base class. Just as in the case of stubs, it makes sense to group them in a dedicated subfolder.

    Differential Revision: [D39108649](https://our.internmc.facebook.com/intern/diff/D39108649/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39108649/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85511
    Approved by: https://github.com/albanD

commit b8f14b7877cf1107f4572fbacf5aabba83aec641
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Oct 13 07:48:58 2022 -0700

    [Profiler][Minor] Group and consolidate stub APIs (#85510)

    There is a concept in profiler of a stub that wraps a profiling API. It was introduced for CUDA profiling before Kineto, and ITT has adopted it to call into VTune APIs. However for the most part we don't really interact with them when developing the PyTorch profiler.

    Thus it makes sense to unify the fallback registration mechanism and create a subfolder to free up real estate in the top level `torch/csrc/profiler` directory.

    Differential Revision: [D39108647](https://our.internmc.facebook.com/intern/diff/D39108647/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85510
    Approved by: https://github.com/aaronenyeshi

commit bc4ca4c2c4085e1ea2c212718d4470d057ec7c3f
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Oct 13 10:56:26 2022 -0700

    [FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters (#86524)

    `_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included.

    Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86524
    Approved by: https://github.com/rohan-varma

commit 960b98128e475b15b66119f325232039799852cd
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Oct 14 01:31:40 2022 +0000

    [ONNX] Update training state logic to support ScriptedModule (#86745)

    In https://github.com/pytorch/pytorch/issues/86325, it was reported that ScriptedModule do not have a training attribute and will fail export because we don't expect them as input.

    Also

    - Parameterized the test_util_funs test

    Thanks @borisfom for the suggestion!

    Fixes #86325

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86745
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit f451e824f39516f503c2bdfd785d254b447b9557
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 14 01:26:45 2022 +0000

    Revert " C10D extension to enable per-thread PG (#86348)"

    This reverts commit 97abc21f2bda38e73de2a86da7f43c8126930681.

    Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests https://hud.pytorch.org/pytorch/pytorch/commit/97abc21f2bda38e73de2a86da7f43c8126930681

commit c16c4a37abca2f1e2bb2918307e19bfa40e9500f
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Oct 14 00:47:16 2022 +0000

    Remove functorch copy of conftest.py (#86927)

    Now that its tests have been moved to PyTorch test.  This was a left over from https://github.com/pytorch/pytorch/pull/86623

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86927
    Approved by: https://github.com/clee2000

commit b3b9786fdd1dbbadb4b75190646b5d0bb5c89771
Author: Horace He <chilli@fb.com>
Date:   Thu Oct 13 20:19:16 2022 +0000

    Unified symbolic shape variables between AOTAutograd and Inductor (#86659)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86659
    Approved by: https://github.com/wconstab

commit c7c09722ad5ee25c5891f863e5bbd1575ad77970
Author: Jason Ansel <jansel@fb.com>
Date:   Thu Oct 13 23:18:06 2022 +0000

    Move TorchDynamo into PyTorch core (#86461)

    Context:
    https://github.com/pytorch/torchdynamo/issues/1588

    This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
    - `torchdynamo` becomes `torch._dynamo`
    - `torchinductor` becomes `torch._inductor`

    This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
    Approved by: https://github.com/voznesenskym

commit 97abc21f2bda38e73de2a86da7f43c8126930681
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Thu Oct 13 22:23:28 2022 +0000

     C10D extension to enable per-thread PG (#86348)

    Move a bunch of globals to instance methods and replace all use to them.

    We move all PG related globals under World and use a singleton instance under _world.

    This creates an undocumented extension point to inject full control of how how c10d
    state behaves.

    One simple hack is to change _world to an implementation that uses a threadlocal
    and enable per-thread PGs.

    It almost get DDP working and the PG is missing an implementation of all_reduce.

    This enables notebook usage of PTD, which is a big deal for learning it:
    https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

    This change ensures BC by keeping the global variables around and have the default _World wrap it.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
    Approved by: https://github.com/rohan-varma

commit 66979fbfaa2af227a6834157fa6f532979b2d23b
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Oct 13 17:42:11 2022 +0000

    Improve complex lerp performance (#84844)

    The complex lerp kernel uses `std::abs(z) < 0.5` which involves
    computing a sqrt. Instead compare the square against 0.25 has much
    lower latency and so performs much better overall.

    In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
    element complex lerp, from 84 us to 6.7 us.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84844
    Approved by: https://github.com/ngimel

commit ae45dab57e22e3d04516e7dd81ef8dbefd51bfe3
Author: Catherine Lee <csl@fb.com>
Date:   Thu Oct 13 21:27:52 2022 +0000

    disable failing circleci test jobs (#86940)

    should revert later when fixed
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86940
    Approved by: https://github.com/huydhn, https://github.com/ZainRizvi

commit 974ad8fa6cc63b89234beb5ebff54c2d42711932
Author: sanchitintel <sanchit.jain@intel.com>
Date:   Thu Oct 13 20:36:59 2022 +0000

    Add BFloat16 dtype support for oneDNN Graph JIT fuser (#85591)

    Intel Xeon Cooper Lake platform & beyond support the `AVX512_BF16` ISA, which is essentially native BFloat16 support.
    oneDNN Graph delivers high inference performance with BFloat16 on such machines.

    While oneDNN Graph can still be used with BFloat16 on older machines that lack `avx512_bf16` ISA but support `avx512bw`, `avx512vl` & `avx512dq` ISAs, the BF16 performance on these older machines will be significantly poorer (probably even poorer than Float32), as they lack native BF16 support.

    Currently, [AMP support for eager mode & JIT mode is divergent in PyTorch](https://github.com/pytorch/pytorch/issues/75956).
    So, for using oneDNN Graph with BFloat16, eager-mode AMP should be leveraged by turning off AMP for JIT mode, using `torch._C._jit_set_autocast_mode(False)` in python code, so as to avoid conflicts.

    Please use the following environment variable to view JIT logs -
    `PYTORCH_JIT_LOG_LEVEL=">>graph_helper:>>graph_fuser:>>kernel:>>interface"`
    1. This PR does NOT change the `oneDNN` commit or the `ideep` files. While the `ideep` commit is being updated, only files pertaining to oneDNN Graph are being updated. oneDNN Graph is being upgraded to version 0.5.2 (alpha patch release 2).
    To put things into perspective, `ideep` is a git submodule of PyTorch. `oneDNN Graph` is a git submodule of `ideep` (`ideep/mkl-dnn`), and oneDNN is a git submodule of oneDNN Graph (`ideep/mkl-dnn/third_party/oneDNN`).
    2. Unit-tests are being updated. We now use the [existing dtypes decorator](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_device_type.py#L123-L131).
    3. Suggestions made by @eellison in the [FP32 PR](https://github.com/pytorch/pytorch/pull/68111#pullrequestreview-896719477) are being incorporated/addressed -

    | Action-item | Status |
    | :---                                             |          ---: |
    |checkInputCompatibility follow up | Fixed |
    |the mayConvertScalarInputToTensor logic we can consider | Added type promotion code |
    |fix up fixConvOptionalBias| The current approach seems correct |
    |Use opinfo tests| using dtypes decorator. Will use `OpInfo` in a subsequent PR, if that'd be possible. Should we create a list of ops from opDB that are supported by oneDNN Graph, and add it to `common_methods_invocations.py`? |
    |inferDevice torch_check call | not necessary now, perhaps, as only CPU is supported, for now? We'd add it by the beta release of oneDNN Graph, though, so that by then, users might be able to use other fusers with oneDNN Graph (NNC/TensorExpr are already compatible with the oneDNN Graph fuser). We can still add it, if you'd insist. |
    |not checking shapes of input mkldnn tensor to llga guard | Those checks should not be present because oneDNN Graph may use blocked or channels-last layout, so those strides would be different. They're only skipped if an LLGA subgraph's output is input to another LLGA subgraph, which enables LLGA to choose an optimal layout between them. |
    |fix test failures with respect to unsupported inputs | We'll address them with the upcoming release of oneDNN Graph beta version|

    4. More PyTorch ops are being been mapped to oneDNN Graph

    ```python

    example_input = torch.rand(1, 3, 224, 224)
    torch.jit.enable_onednn_fusion(True)
    torch._C._jit_set_autocast_mode(False)
    with torch.no_grad(), torch.cpu.amp.autocast():
        model = torch.jit.trace(model, (example_input))
        model = torch.jit.freeze(model)
        model(example_input)
        model(example_input)
        model(example_input)
    ```
    **URL:** https://github.com/sanchitintel/benchmark/tree/onednn_graph_benchmark (instructions present at URL).
    **Batch-size(s):** TorchBench-default for each model
    **Baseline :** PyTorch JIT OFI FP32
    **Machine:** Intel(R) Xeon(R) Platinum 8371HC (Cooper Lake)
    **Sockets used**: 1
    **Number of cores on one socket**: 26
    Intel OpenMP & tcmalloc were preloaded
    | name                                             | latency of PyTorch JIT OFI FP32 (s) |   Latency of oneDNN Graph BF16 (s) |   % change |
    | :---                                             |          ---: |            ---: |       ---: |
    | test_eval[alexnet-cpu-jit]                       |      1.063851 |        0.509820 |     -52.1% |
    | test_eval[mnasnet1_0-cpu-jit]                    |      0.218435 |        0.107100 |     -51.0% |
    | test_eval[mobilenet_v2-cpu-jit]                  |      0.114467 |        0.058359 |     -49.0% |
    | test_eval[mobilenet_v3_large-cpu-jit]            |      0.233873 |        0.117614 |     -49.7% |
    | test_eval[resnet18-cpu-jit]                      |      0.160584 |        0.075854 |     -52.8% |
    | test_eval[resnet50-cpu-jit]                      |      1.652846 |        0.713373 |     -56.8% |
    | test_eval[resnext50_32x4d-cpu-jit]               |      0.471174 |        0.209431 |     -55.6% |
    |test_eval[shufflenet_v2_x1_0-cpu-jit] | 0.310306 | 0.167090 | -46.2% |
    | test_eval[squeezenet1_1-cpu-jit]                 |      0.161247 |        0.045684 |     -71.7% |
    | test_eval[timm_efficientnet-cpu-jit]             |      1.643772 |        0.800099 |     -51.3% |
    | test_eval[timm_regnet-cpu-jit]                   |      5.732272 |        2.333417 |     -59.3% |
    | test_eval[timm_resnest-cpu-jit]                  |      1.366464 |        0.715252 |     -47.7% |
    | test_eval[timm_vision_transformer-cpu-jit]       |      0.508521 |        0.271598 |     -46.6% |
    | test_eval[timm_vovnet-cpu-jit]                   |      2.756692 |        1.125033 |     -59.2% |
    | test_eval[vgg16-cpu-jit]                         |      0.711533 |        0.312344 |     -56.1% |
    | name                                             | latency of PyTorch JIT OFI FP32 (s) |   Latency of oneDNN Graph BF16 (s) |   % change |
    | :---                                             |          ---: |            ---: |       ---: |
    | test_eval[alexnet-cpu-jit]                       |      0.062871 |        0.034198 |     -45.6% |
    | test_eval[mnasnet1_0-cpu-jit]                    |      0.022490 |        0.008172 |     -63.7% |
    | test_eval[mobilenet_v2-cpu-jit]                  |      0.012730 |        0.005866 |     -53.9% |
    | test_eval[mobilenet_v3_large-cpu-jit]            |      0.025948 |        0.010346 |     -60.1% |
    | test_eval[resnet18-cpu-jit]                      |      0.011194 |        0.005726 |     -48.9% |
    | test_eval[resnet50-cpu-jit]                      |      0.124662 |        0.045599 |     -63.4% |
    | test_eval[resnext50_32x4d-cpu-jit]               |      0.034737 |        0.015214 |     -56.2% |
    |test_eval[shufflenet_v2_x1_0-cpu-jit] | 0.028820 | 0.012517 | -56.6% |
    | test_eval[squeezenet1_1-cpu-jit]                 |      0.012557 |        0.003876 |     -69.1% |
    | test_eval[timm_efficientnet-cpu-jit]             |      0.203177 |        0.051879 |     -74.5% |
    | test_eval[timm_regnet-cpu-jit]                   |      0.452050 |        0.151113 |     -66.6% |
    | test_eval[timm_resnest-cpu-jit]                  |      0.117072 |        0.052848 |     -54.9% |
    | test_eval[timm_vision_transformer-cpu-jit]       |      0.046048 |        0.023275 |     -49.5% |
    | test_eval[timm_vovnet-cpu-jit]                   |      0.213187 |        0.077482 |     -63.7% |
    | test_eval[vgg16-cpu-jit]                         |      0.044726 |        0.021998 |     -50.8% |

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85591
    Approved by: https://github.com/jgong5, https://github.com/frank-wei, https://github.com/chunyuan-w

commit 14dd5db2f50ceb8fb3e7ab565e348e2eb616791a
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Thu Oct 13 20:28:44 2022 +0000

    [fsdp] Fix test for 2d parallel integration to trigger the load hooks. (#86272)

    nit: replaced empty array bool test with explicit test for its length.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86272
    Approved by: https://github.com/awgu

commit 18f58e2df1f5997c93f213c94a60eb72a63a05e4
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Thu Oct 13 10:13:11 2022 -0700

    [quant][be] Rename node_name_to_target_dtype to node_name_to_target_dtype_info (#86860)

    Summary:
    att, renaming to improve readability

    Test Plan:
    no functionality changes

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86860
    Approved by: https://github.com/jcaip

commit 158a071034a45ead778107beceedd6b696ff5234
Author: inisis <desmond.yao@buaa.edu.cn>
Date:   Thu Oct 13 20:12:52 2022 +0000

    add _freeze for embedding op (#86769)

    Fixes #86663

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86769
    Approved by: https://github.com/albanD

commit e737f2d81c8f83e5020d3383b320e024bb908a47
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Oct 13 19:35:31 2022 +0000

    set the correct size of aten tensor in presence of mkldnn padding (#86767)

    This fixes https://github.com/pytorch/pytorch/issues/86556
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86767
    Approved by: https://github.com/eellison

commit 860ad04990addc6f6ba130c7d252cb23689ddceb
Author: BowenBao <bowbao@microsoft.com>
Date:   Mon Oct 10 16:47:18 2022 -0700

    [ONNX] Fix FindCommonAncestor in function_extraction (#86650)

    One line fix to get absolute value of `diff` before looping over.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86650
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock

commit af1dcef79c1a91ae03faacbdeb2f9127013f7889
Author: BowenBao <bowbao@microsoft.com>
Date:   Wed Oct 12 14:59:02 2022 -0700

    [ONNX] Fix triu/tril export with diagonal input (#86843)

    Investigation with @thiagocrepaldi discovered this bug with triu/tril export when
    `diagonal` is passed in as input. Previously assumption was made that `diagonal`
    is always provided a constant value.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86843
    Approved by: https://github.com/thiagocrepaldi, https://github.com/abock

commit dbdfb8dd8b7e2ffe427e4acd045249b89236af9b
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Thu Oct 13 18:08:58 2022 +0000

    Skip test_nvfuser_extremal_values for native_batch_norm (#86897)

    New tests were introduced with https://github.com/pytorch/pytorch/commit/68a6113248ac25841b524d59f9dc0f298b389ba2.
    This PR explicitly skips the problematic tests.
    Fixes https://github.com/pytorch/pytorch/issues/86176
    Fixes https://github.com/pytorch/pytorch/issues/86177
    Fixes https://github.com/pytorch/pytorch/issues/86178
    Fixes https://github.com/pytorch/pytorch/issues/86179
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86897
    Approved by: https://github.com/soulitzer

commit 2ce6150d23928773c35274aec369eb0a5ecd6fa4
Author: BowenBao <bowbao@microsoft.com>
Date:   Wed Oct 12 10:35:22 2022 -0700

    [ONNX] Fix scalar_type_analysis metadata for copied constant (#86716)

    Fix the source of metadata for copied constant. Since the constant is being implicitly casted,
    it makes more sense to assign code location and etc with the user node.
    This issue was discovered in https://github.com/pytorch/pytorch/issues/86627. This PR also adds unit test coverage for scope
    information of nodes when they are altered by CSE and related passes.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86716
    Approved by: https://github.com/thiagocrepaldi, https://github.com/malfet

commit 4839f73f329b38819e6f69a8662d61dc36558e52
Author: Sheil Kumar <smk2007@gmail.com>
Date:   Thu Oct 13 17:54:28 2022 +0000

    Fix incorrect tensor storage check  (#86845)

    Fix incorrect tensor storage check

    This change contains an incorrect check for storage: https://github.com/pytorch/pytorch/pull/86557
    **self.storage is not None**
    should have been:
    **not torch._C._has_storage(self)**

    These fixes were run through the DirectML test suite, and confirm the check is now working correctly.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86845
    Approved by: https://github.com/martinb35, https://github.com/bdhirsh

commit afc996386552c13e3910164507408295ab77689a
Author: Frankie Robertson <frankie@robertson.name>
Date:   Thu Oct 13 17:42:28 2022 +0000

    Fix path to nested_tensor in example (#86891)

    This appears to be a typo.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86891
    Approved by: https://github.com/H-Huang

commit 54ee95c8ecaf19ef6464daf0e5d967c781011101
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Thu Oct 13 17:36:37 2022 +0000

    [nn] module: full_backward_pre_hook (#86700)

    Fixes https://github.com/pytorch/pytorch/issues/42824

    * [x] Test
    * [x] Doc
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86700
    Approved by: https://github.com/soulitzer

commit 7dcfbedce071c62d0ac40ca86c844b5cd4b4d9ef
Author: mikael10j <mikael.jacquemont@lapp.in2p3.fr>
Date:   Thu Oct 13 17:31:33 2022 +0000

    Fix LinearLR scheduler start_factor (#86695)

    Fixes #86454

    The `start_factor` must be comprised in ]0;1] instead of [0;1] to avoid division by 0. This PR changes the lower limit checking of the parameter.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86695
    Approved by: https://github.com/albanD

commit 6ee94b572ac248b11a90558b7358c242ad9b56fa
Author: samdow <samdow@fb.com>
Date:   Thu Oct 13 17:26:54 2022 +0000

    [functorch] Add shard to run functorch tests with asan (#82164)

    This adds asan testing for functorch. It was running really long (>4hrs) with test ops, so we decided that those tests are probably redundant and skipped those. This brings this test's time down to ~30 min
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82164
    Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/huydhn

commit 427e0a6b4ebc691f1fa98662d04d5c431a75107f
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Thu Oct 13 17:26:36 2022 +0000

    [cuDNN] Enable cuDNN Frontend v8 API by Default (#84948)

    Opening this PR for testing for now to check CI status. 🤞

    CC @ptrblck @ngimel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84948
    Approved by: https://github.com/ngimel

commit b0d80f4355ac75a19400c3bd278db104841ffbba
Author: BowenBao <bowbao@microsoft.com>
Date:   Mon Oct 10 17:23:55 2022 -0700

    [ONNX] Clarify phrasing of skipScriptTest/skipTraceTest decorators (#86216)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86216
    Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock

commit 0ee09996086db18d6f449d1a6743dd6f33d94153
Author: BowenBao <bowbao@microsoft.com>
Date:   Tue Oct 11 00:15:59 2022 +0000

    [ONNX] Renable assert diagnostic test (#85999)

    Fix to properly clear 'background_context' of export diagnostic 'engine' in `clear`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85999
    Approved by: https://github.com/abock

commit cff333bdb55b98d6c2464db684cf0f1a0f769987
Author: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Wed Oct 12 17:24:38 2022 -0700

    Enable max.unary_out (#86855)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86855
    Approved by: https://github.com/jerryzh168, https://github.com/bdhirsh

commit 25811663af2f7ddf6623b28807697268eb2167ab
Author: Colin Taylor <colin2328@meta.com>
Date:   Thu Oct 13 16:48:24 2022 +0000

    [FSDP] restricts meta model check to non ignored modules in FSDP (#86766)

    Summary: as title

    Test Plan: see test plan D40287799

    Differential Revision: D40287890

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86766
    Approved by: https://github.com/awgu

commit ab6955067875c9a84c98de0e76a53ea46502a89c
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Oct 12 22:31:13 2022 +0000

    Add nested squeeze.dim and unsqueeze (#86813)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86813
    Approved by: https://github.com/drisspg

commit e531cf7b2e55a6aa0eee711b260b3bb8cd56067e
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Wed Oct 12 20:48:36 2022 -0700

    [ao] fixing public v private for fx.backend_config_utils.py (#86037)

    Summary: just added a missing function to __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86037
    Approved by: https://github.com/jerryzh168

commit d169f950da4e175db9fe0444d55a0f49f8ec9fcc
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 13 15:28:09 2022 +0000

    Revert "Use CUTLASS GEMM for NT bmm [OSS-only] (#85894)"

    This reverts commit ef58a132f223d5abf2bd3f8bee380aca6c29d17f.

    Reverted https://github.com/pytorch/pytorch/pull/85894 on behalf of https://github.com/DanilBaibak due to Break internal build

commit b97ae59e29ff78829632bd4ae24edd5ecc9cf5ea
Author: Will Constable <whc@fb.com>
Date:   Thu Oct 13 15:10:46 2022 +0000

    Change legacy wrap_dim to work with symint == (#86842)

    - previously, sizes == vector<T>({0}) failed to hit SymInt::operator==, causing a the loop to bail out too early and make an invalid call to downstream maybe_wrap_dim helper

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86842
    Approved by: https://github.com/Chillee, https://github.com/malfet, https://github.com/albanD

commit 3d9fd060f47fa623d241f4a8c2da6ea7ab6dfb72
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Oct 12 13:00:40 2022 -0700

    [functorch] Add more details to the functorch install page (#86823)

    Added some details about:
    - `pip uninstall functorch` being helpful if there are problems
    - `pip install functorch` still working for BC reasons.

    Test Plan:
    - wait for docs preview
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86823
    Approved by: https://github.com/samdow

commit cbc01c4344238efb40151b0968536296d0f24331
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Oct 12 23:25:43 2022 +0100

    OpInfo: Sample input cleanup (2/n) (#86379)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86379
    Approved by: https://github.com/mruberry

commit 2efc56d9d7faf21f5c90ea0523a7fa8ea76e1b1b
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Oct 12 23:25:43 2022 +0100

    OpInfo: Sample input cleanup (1/n) (#86231)

    This rewrites various sample and error input functions to:
    - use the convention of `make_arg = functools.partial(make_tensor, ...)`
    - use the new natural syntax for `SampleInput` construction
    - yield instead of returning a lists, to reduce memory consumption
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86231
    Approved by: https://github.com/mruberry

commit 45274c56a4547d9e3562ee40b0c515622ff80745
Author: BowenBao <bowbao@microsoft.com>
Date:   Mon Oct 10 17:23:55 2022 -0700

    [ONNX] Partially re-enable RoiAlign and RoiPool unit tests (#86169)

    This PR depends on https://github.com/pytorch/vision/pull/6685

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86169
    Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock

commit e17732b234e05eccad7e7e2d7fbd6c26f9bdca87
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Wed Oct 12 13:55:14 2022 -0700

    [test] add cross-ref tests for python meta kernels (#86228)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86228
    Approved by: https://github.com/albanD

commit 0feccda7d74a23509e1b1edd0c5c76d5f67fa813
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Wed Oct 12 13:55:13 2022 -0700

    fix aliasing bug in pixel shuffle/unshuffle (#86608)

    Fixes https://github.com/pytorch/pytorch/issues/82235

    cc @albanD - `at::pixel_shuffle` and `at::pixel_unshuffle` advertise as being non-aliasing, but they have a C++ decomposition that internally uses reshape(), which means that it might return an alias.

    I happened to notice this because a bunch of tests in `test/test_ops.py` failed when I ran locally with a `DEBUG=1` build.

    (P.S.: when are we finally gonna get a debug build test in CI? 😃)

    I fixed by adding an extra clone, which... is going to be an unnecessary perf hit in the case where the `reshape()` already properly cloned the input. My hope is that this is fine, because this only impacts the composite kernel- we already have a "fast" CPU kernel that does the right thing. Is `pixel_shuffle/unshuffle` commonly used with cuda? Maybe we should just add a fast cuda kernel for it if that's the case.

    Alternatively, it seems like it would be nice if `reshape()` accepted an optional argument to unconditionally return a copy. That seems like a rabbit hole that isn't worth going down for now though - I remember a discussion a while ago about making `reshape()` copy-on-write

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86608
    Approved by: https://github.com/albanD

commit 337605054359a63083edcc7dcd8d887ce32947ed
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Wed Oct 12 13:55:13 2022 -0700

    fix type promotion for group_norm composite C++ kernel (#86607)

    python decomp for `native_group_norm` is correct in more cases than the C++ composite. Updating the tests to fail properly in this case was more annoying than just fixing the C++ decomp, so I fixed it here.

    When the input tensor had a dtype with less precision than float32, the C++ decomp would unconditionally set the mean/variance to float32, which was wrong.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86607
    Approved by: https://github.com/albanD

commit 6907db3f9578fc8cc477c175d982c6dcac69332d
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Wed Oct 12 13:55:13 2022 -0700

    fix aliasing for primtorch view meta kernels (#86285)

    Fixes https://github.com/pytorch/pytorch/issues/86284

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86285
    Approved by: https://github.com/albanD, https://github.com/mruberry

commit 77e68b16cc1d320852742274bb8a15d1aa7f4915
Author: Michael Andreas Dagitses <mikeyd@fb.com>
Date:   Thu Oct 13 06:14:21 2022 -0700

    suggest rebasing through @pytorchbot if PR is stale (#86898)

    Summary:

    Test Plan: Testing on GitHub with `stale_pr_days` set to zero.

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86898
    Approved by: https://github.com/malfet

commit 8fffb79771f367e767cc85c31e5e0daed9f6eb7c
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Oct 12 11:38:56 2022 -0700

    Add vmap support for slogdet; fix regression from functorch 0.2.1 (#86815)

    This PR adds vmap support for slogdet -- slogdet just decomposes into
    linalg.slogdet.

    This fixes a regression from functorch 0.2.1 (slogdet had a batching
    rule then, and doesn't anymore). We didn't catch the regression because
    it seems like slogdet doesn't have an OpInfo (I'm not sure if it had one
    before).

    Test Plan:
    - new one-off test.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86815
    Approved by: https://github.com/samdow

commit 77d94ac5ab0c15bdfb2dfe6df6ab8ad87f67edef
Author: Syed Tousif Ahmed <syeahmed@nvidia.com>
Date:   Thu Oct 13 14:03:01 2022 +0000

    Sets CUDA_MODULE_LOADING to LAZY when not set by the user (#85692)

    This PR sets CUDA_MODULE_LOADING if it's not set by the user. By default, it sets it to "LAZY".

    It was tested using the following commands:
    ```
    python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"
    ```
    which shows a memory usage of: 287,047,680 bytes

    vs

    ```
    CUDA_MODULE_LOADING="DEFAULT" python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"
    ```
    which shows 666,632,192 bytes.

    C++ implementation is needed for the libtorch users (otherwise it could have been a pure python functionality).

    cc: @ptrblck @ngimel @malfet
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85692
    Approved by: https://github.com/malfet

commit 30a8a87c80dbfd7df81927a5acd190fac2240e04
Author: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Wed Oct 12 18:26:35 2022 -0700

    Fix autogen for _ctc_loss.Tensor (#86871)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86871
    Approved by: https://github.com/larryliu0820

commit dc6ce1485ec576df6f8d9f9e9717628802995cf4
Author: Salil Desai <salilsdesai@fb.com>
Date:   Tue Oct 11 22:49:15 2022 -0700

    Use Variable Size Indices in Sparse Qlinear Code (#85247)

    Final changes to enable sparse weight packing with variable size indices

    pack_block_sparse.cc is deleted because all functions in it have a template added, so they are moved to pack_block_sparse.h

    Differential Revision: [D39025651](https://our.internmc.facebook.com/intern/diff/D39025651/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39025651/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85247
    Approved by: https://github.com/digantdesai

commit d3afd49c85947a178dc7e2f15f97206387d2a279
Author: Salil Desai <salilsdesai@fb.com>
Date:   Tue Oct 11 22:49:13 2022 -0700

    Enable 16bit and 8bit Row/Col Indices in Qnnpack Fully Connected Sparse Op (#85246)

    This diff enables using the 16bit and 8bit kernels added in the previous diff.

    (This change used to be in D38954842 v11 but was moved into its own diff)

    Differential Revision: [D39403164](https://our.internmc.facebook.com/intern/diff/D39403164/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85246
    Approved by: https://github.com/kimishpatel

commit 6c6e06619f30fb3a12bc05738f6b7f39425618c9
Author: Salil Desai <salilsdesai@fb.com>
Date:   Tue Oct 11 22:49:11 2022 -0700

    Add 16bit and 8bit row/col indices q8gemm sparse kernels (#85245)

    TLDR: see D39003528 to see the actual changes in this diff more clearly, which will make reviewing easier

    ___

    The 32bit versions were changed to be created with a macros which are also used to create 16bit and 8bit versions

    This diff shows that almost all of the lines in the .s files were modified, but most changes are just adding spaces to the front and ;/ to the end so they can be contained in the macro. To generate these changes, I first wrote the macros without the spaces and ;/, and then I ran a script (see the python file in D39003528) to get the final version.

    To review this diff more easily, if you want to see the code changes before I ran the script, which makes it much easier to see which lines were changed, see D39003528.

    Each version of this diff is synched with the same number version of that diff (so if I change this diff I will mirror the changes to the same version on that diff)

    Differential Revision: [D39003527](https://our.internmc.facebook.com/intern/diff/D39003527/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85245
    Approved by: https://github.com/kimishpatel

commit 6c6a32c2233f5d0820a265574734ab5706beeeee
Author: Salil Desai <salilsdesai@fb.com>
Date:   Tue Oct 11 22:49:09 2022 -0700

    Enable Running Variable Size Row/Col Indices q8gemm Sparse Kernels in QNNPACK (#85244)

    For aarch32 and aarch64, the 16bit and 8bit versions of the kernels are left empty. I will be adding them in a future diff (D39003527) to avoid having this diff be too cluttered.

    Differential Revision: [D38954842](https://our.internmc.facebook.com/intern/diff/D38954842/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85244
    Approved by: https://github.com/kimishpatel

commit 4c0e1dc9808bc2c68ceaceae72e41308dccf8c5d
Author: Salil Desai <salilsdesai@fb.com>
Date:   Tue Oct 11 22:49:08 2022 -0700

    Update Qnnpack Fully Connected Sparse Op to Store Variable Size Indices (#85243)

    Only uint32_t is supported for now, but uint16_t and uint8_t support will be added in future diffs.

    Differential Revision: [D38828545](https://our.internmc.facebook.com/intern/diff/D38828545/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85243
    Approved by: https://github.com/kimishpatel

commit 1a87c25fe19f117fd413ec469d7c21ac6ff44a62
Author: Nikita Shulga <nshulga@meta.com>
Date:   Thu Oct 13 04:25:41 2022 +0000

    Add functorch shard to sm86-periodic workflow (#86820)

    After https://github.com/pytorch/pytorch/pull/86799 was landed there shouldn't be a need to increase tolerances

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86820
    Approved by: https://github.com/zou3519

commit cb4867a71a5944baaf6655bd765652cf37864443
Author: Emilio Castillo <ecastill@preferred.jp>
Date:   Thu Oct 13 04:06:13 2022 +0000

    Make `ASGD` & `RProp` differentiable (#86258)

    Blocked by #86183
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86258
    Approved by: https://github.com/albanD

commit 5224906749c85f1e2f6d7ec37a02bd29bcdebef3
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Oct 13 03:31:28 2022 +0000

    Spread distributed backends among all distributed shards (#86837)

    So that they can be run in parallel without stepping on each other toe
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86837
    Approved by: https://github.com/clee2000

commit 48c648d75df4a2d02ede71f34c11b7f48c80da0e
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Oct 11 03:24:07 2022 +0100

    Fix typo TORCH_ONLY_METHOD_OPERATORS -> TORCH_ASSERT_ONLY_... (#86661)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86661
    Approved by: https://github.com/malfet

commit 67fbd940bae60e4392fb72eb495d51f6e0261260
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Wed Oct 12 10:04:06 2022 -0700

    [ao] fixing public v private for fx.quantization_types (#86036)

    Summary: this file doesn't actually exist anymore so its just a case of
    removing the exception for it

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86036
    Approved by: https://github.com/jerryzh168

commit b00cdb5b3416d908898c30d5f070085f7765f916
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Wed Oct 12 10:04:05 2022 -0700

    [ao] fixing public v private for quantization_patterns.py (#86034)

    Summary: no significant changes, just addded __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86034
    Approved by: https://github.com/jerryzh168

commit 77d29bcee200f04bece4a86283acfb8e1ec830ad
Author: Khushi Agrawal <khushiagrawal411@gmail.com>
Date:   Thu Oct 13 01:18:30 2022 +0000

    [primTorch] special: ndtr, ndtri, log_ndtr, erfcx (#86077)

    - Adds prims and _refs for `erfcx` and `ndtri`.
    - Adds _refs for `ndtr`, and `log_ndtr`.

    cc @kshitij12345 @lezcano @mruberry
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86077
    Approved by: https://github.com/mruberry

commit ea586c0579a1fce55dbba4be7c88e9e04e709cef
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Thu Oct 13 00:54:17 2022 +0000

    Fix up cond a bit to make it work w/ fake tensor (#86727)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86727
    Approved by: https://github.com/zou3519

commit 2a75152537c364eafecc9046d3e82bfc934cd056
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Oct 12 21:21:10 2022 +0000

    [easy] Add nested tanh (#86826)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86826
    Approved by: https://github.com/cpuhrsch

commit b79bac0e4ddb3a0b956b8bd0b33ab88daaa64de4
Author: CaoE <e.cao@intel.com>
Date:   Thu Oct 13 00:42:45 2022 +0000

    Make the data types of output and input consistenst for batchnorm (#84410)

    The model TTS will crash due to the issue:: when input of BN is not contiguous and the data type of input is different with that of parameters, BN will raise error `RuntimeError: !needs_dynamic_casting<func_t>::check(iter) INTERNAL ASSERT FAILED at "xxx/pytorch/aten/src/ATen/native/cpu/Loops.h":311, please report a bug to PyTorch`.

    Make the data types of output and input consistenst for batchnorm to fix the issue.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84410
    Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet

commit c2f29e75cd9e8edb5bb2bb4163a4e26dd8f7d9f4
Author: Catherine Lee <csl@fb.com>
Date:   Thu Oct 13 00:42:40 2022 +0000

    [flakybot] add dynamo as platform (#86701)

    corresponding pr in test-infra https://github.com/pytorch/test-infra/pull/874
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86701
    Approved by: https://github.com/huydhn

commit 9470059766dcb3f1e67f9d015aec8b57239ea421
Author: Zain Rizvi <zainr@fb.com>
Date:   Thu Oct 13 00:38:45 2022 +0000

    Allow viable/strict promotion even if periodic or docker-release-builds jobs are failing (#86827)

    Allow `viable/strict` promotion even if `periodic` or `docker-release-builds` jobs are failing

    **Why?** Those jobs only run occasionally and for all we know the current viable/strict commit may already include the errors that the above cron based workflows may have later detected.  Blocking the viable/strict upgrade because of these scheduled jobs doesn't really offer any value, it just leads to people getting older PRs when they try to fork off of viable/strict without guaranteeing an improvement in test quality

    Though frankly, the current situation is worse than that.

    Assume the branch history looks like A -> B

    A is the current `viable/strict` commit
    B is a commit that failed some `periodic` test, so `viable/strict` wasn't upgraded to B

    Now lets say there's a commit C that gets merged. C neither contains a fix for the failing periodic build, nor does a scheduled periodic workflow run against C. The branch becomes A -> B -> C

    In the above scenario, today we will promote `viable/strict` to C since there was no failing workflow there!!! Even though it didn't actually fix what was broken with B!

    In short, avoiding the upgrade to B really doesn't make any sense today and we shouldn't do it.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86827
    Approved by: https://github.com/janeyx99

commit 66cab5245fbae639d7bc528d22eafe97c03bb935
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 12 11:24:51 2022 -0400

    Reland 2 min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86797)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86797
    Approved by: https://github.com/bdhirsh

commit 894c4218dd9a656d03b21a6f560b51ac432ae529
Author: Eli Uriegas <eliuriegas@fb.com>
Date:   Wed Oct 12 13:36:56 2022 -0700

    ci: Just use regular checkout (#86824)

    checkout-pytorch seems to have issues and is purpose made for our PR
    testing and appears to conflict with what we're trying to do for binary
    builds.

    For builds like
    https://github.com/pytorch/pytorch/actions/runs/3207520052/jobs/5242479607
    there is a confusion over where the reference is pulled and I believe it is
    root caused by the checkout logic in checkout-pytorch.

    So with that in mind I suggest we just use the upstream checkout action
    for this job

    Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86824
    Approved by: https://github.com/atalman

commit aacb9f3ac63d9a31d064c76ff3d328037355b28e
Author: Emilio Castillo <ecastill@preferred.jp>
Date:   Wed Oct 12 23:16:29 2022 +0000

    Make `Adadelta`,`Adagrad` & `Adamax` differentiable (#86096)

    Continuing the differentiable optimizers support

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86096
    Approved by: https://github.com/janeyx99

commit e552cf105058e6d7ea367d31ff3d3c0a31ea0bbd
Author: Shawn Zhong <github@shawnzhong.com>
Date:   Wed Oct 12 22:31:48 2022 +0000

    [DOC] Use type hints to show annotation in the docs (#79086)

    Fixes #44964

    Use type hints in the code to show type annotations in the parameters section of the docs.

    For the parameters already documented in the docstring, but lack the type annotation, the type hints from the code are used:

    | [Before](https://pytorch.org/docs/master/generated/torch.nn.AdaptiveMaxPool1d.html) | [After](https://docs-preview.pytorch.org/79086/generated/torch.nn.AdaptiveMaxPool1d.html) |
    | --- | --- |
    | <img width="462" alt="image" src="https://user-images.githubusercontent.com/6421097/172954756-96d2d8a6-7df9-4c0f-ad34-c12912a5a740.png"> | <img width="479" alt="image" src="https://user-images.githubusercontent.com/6421097/172954770-a6ce2425-99a6-4853-ac2c-e182c3849344.png"> |

    | [Before](https://pytorch.org/docs/master/generated/torch.nn.Linear.html) | [After](https://docs-preview.pytorch.org/79086/generated/torch.nn.Linear.html) |
    | --- | --- |
    | <img width="482" alt="image" src="https://user-images.githubusercontent.com/6421097/172954992-10ce6b48-44a2-487e-b855-2a15a50805bb.png"> | <img width="471" alt="image" src="https://user-images.githubusercontent.com/6421097/172954839-84012ce6-bf42-432c-9226-d3e81500e72d.png"> |

    Ref:
    - PR https://github.com/pytorch/pytorch/pull/49294 removed type annotations from signatures in HTML docs.
    - Sphinx version was bumped to 5.0.0 in PR #70309
    - Duplicated (closed) issues: #78311 and #77501

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79086
    Approved by: https://github.com/malfet

commit a77f2a95a77cc2c4af9c1fa4144dfe97bab2f3ed
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Oct 12 18:20:35 2022 +0000

    Improve NestedTensor documentation (#85186)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85186
    Approved by: https://github.com/cpuhrsch

commit be81f3d8d4c6e974fa1644ace20bc1e75e168c90
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Oct 12 21:17:25 2022 +0000

    Revert distributed test parallelization (#86756)

    Revert an old commit and resolve some conflicts

    Fixes https://github.com/pytorch/pytorch/issues/86418
    Fixes https://github.com/pytorch/pytorch/issues/86419
    Fixes https://github.com/pytorch/pytorch/issues/86415
    Fixes https://github.com/pytorch/pytorch/issues/86420
    Fixes https://github.com/pytorch/pytorch/issues/86416
    Fixes https://github.com/pytorch/pytorch/issues/86392
    Fixes https://github.com/pytorch/pytorch/issues/86391
    Fixes https://github.com/pytorch/pytorch/issues/86397
    Fixes https://github.com/pytorch/pytorch/issues/86390
    Fixes https://github.com/pytorch/pytorch/issues/86398
    Fixes https://github.com/pytorch/pytorch/issues/86396
    Fixes https://github.com/pytorch/pytorch/issues/86395
    Fixes https://github.com/pytorch/pytorch/issues/86393
    Fixes https://github.com/pytorch/pytorch/issues/86394
    Fixes https://github.com/pytorch/pytorch/issues/86440
    Fixes https://github.com/pytorch/pytorch/issues/86442
    Fixes https://github.com/pytorch/pytorch/issues/86439

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86756
    Approved by: https://github.com/mrshenli

commit 09a676f639b422baf947768c47116b944470e411
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Wed Oct 12 20:57:19 2022 +0000

    Add hooks for register_buffer/module/parameter (#86148)

    As described in the issue, this PR adds hooks to be run when `register_parameter`, `register_buffer` and `register_module` are called.

    Fixes #85837

    cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345 @saketh-are
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86148
    Approved by: https://github.com/albanD

commit c08cbfccd9e2a9b2e6006773d04aafa74977684f
Author: Zain Rizvi <zainr@fb.com>
Date:   Wed Oct 12 20:43:42 2022 +0000

    Let retried jobs advance viable/strict (#86821)

    Today, even if we retry a failed workflow it succeeds on the retry, viable/strict doesn't advance forward.

    Success on retry is proof that the error wasn't with the current commit and that we should in fact promote viable/strict. This PR points to an updated rockset query which will only look at the success status of the most recent job in each workflow

    Here's the query edited:

    Original query:
    https://console.rockset.com/lambdas/details/commons.commit_jobs_batch_query/versions/15aba20837ae9d75?tab=sql

    Updated query: https://console.rockset.com/lambdas/details/commons.commit_jobs_batch_query/versions/8003fdfd18b64696?tab=sql

    Testing:
    Tested the old and new query against commits known to have succeeded on retry
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86821
    Approved by: https://github.com/huydhn, https://github.com/malfet

commit 3b26680222998778f48e0a1939bdafab6db53c7c
Author: vfdev <vfdev.5@gmail.com>
Date:   Wed Oct 12 20:33:14 2022 +0000

    Update _torch_docs / ldexp  (#86721)

    Fixes a typo on ldexp docstring.

    https://pytorch.org/docs/master/generated/torch.ldexp.html?highlight=ldexp#torch.ldexp

    <img width="976" alt="image" src="https://user-images.githubusercontent.com/2459423/195191117-15b4e1f3-dfd5-466c-b5aa-72851f0c2393.png">

    https://livesphinx.herokuapp.com/
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86721
    Approved by: https://github.com/samdow

commit 363b108e39c714521836b9855062022c98d6dba8
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Oct 11 17:23:55 2022 -0700

    [quant][fx] Fix weight_dtype and bias_dtype backend_config checks (#86719)

    Summary:
    This PR adds checks for the existence of "weight_dtype" and "bias_dtype" in the node_name_to_dtype dictionary before accessing it,
    the corner case is hit when we check the compatibility of qconfig and backend_config for weight and bias that appears before activation (e.g. torch.addmm)

    Test Plan:
    python test/test_quantization.py -k test_backend_config_check_for_weight_and_bias

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86719
    Approved by: https://github.com/andrewor14

commit d6bfbdf50c48ebc3a909a47c416ba0a73ee6174d
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Wed Oct 12 10:04:05 2022 -0700

    [ao] fixing public v private for fx.pattern_utils.py (#86033)

    Summary: added __all__, one issue with QuantizeHandler is that since its
    defined as 'Any' it can't be set as a public module although it should
    be, i've set it to private here but when the circular dependency gets
    fixed, it will probably be removed.

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86033
    Approved by: https://github.com/jerryzh168

commit bf0116d1f0c5ec58308a0af4e8f4212a78db649f
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Wed Oct 12 10:04:04 2022 -0700

    [ao] fixing public v private for fx.graph_module.py (#86032)

    Summary: no significant changes, just added __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86032
    Approved by: https://github.com/jerryzh168

commit 25476f2e4b8bc1ddc0d6b6a7d71d7626fa5eb76e
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Wed Oct 12 10:04:04 2022 -0700

    [ao] fixing public v private for quantization_types (#86031)

    Summary: the main problem with this was that the different objects
    defined simply as 'Any' should theoretically be public but making them
    public either A) results in an error about the module being 'typing'
    rather than whatever module it should be or B) you set the module
    manually, thereby changing the module for the original 'Any' class.

    note: QuantizeHandler has a similar issue where its simply defined as
    'Any'

    Pattern was defined in multiple places which was causing issues so i just moved it to a single
    place given the note at the top of quantization_types.py indicating
    these definitions should be moved to utils at some point anyway.

    Finally i changed any references to these objects to point at the
    correct locations. Note: i didn't see any fb internal references to
    NodePattern or QuantizerCls that would cause issues.

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86031
    Approved by: https://github.com/jerryzh168

commit ef58a132f223d5abf2bd3f8bee380aca6c29d17f
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Wed Oct 12 20:03:25 2022 +0000

    Use CUTLASS GEMM for NT bmm [OSS-only] (#85894)

    OSS-only copy of https://github.com/pytorch/pytorch/pull/85710
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894
    Approved by: https://github.com/drisspg

commit 73c43ce2e2a074f4a1e688d4f8b2ebacd9256476
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Oct 10 15:58:26 2022 +0100

    Display unexpected exceptions raised from test_dtypes (#86599)

    Currently `test_dtypes` swallows all exceptions which can make debugging failures more tricky.
    This changes the test to save the exceptions and print only the unexpected ones at the end e.g.
    ```
    AssertionError: The supported dtypes for nn.functional._scaled_dot_product_attention on device type cuda are incorrect!
    The following dtypes did not work in backward but are listed by the OpInfo: {torch.bfloat16}.
    Unexpected failures raised the following errors:
    torch.bfloat16 - CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling [...]
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86599
    Approved by: https://github.com/mruberry

commit 6be9d9a630993f0a64c16d82d9605b8e4a5ad603
Author: Amadeusz Skrzypczak <askrzypczak@habana.ai>
Date:   Wed Oct 12 19:37:13 2022 +0000

    Add AutocastHPU support (#84927)

    New dispatch key and necessary functions are added to PyTorch. Backend implementation will be added in the external library.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84927
    Approved by: https://github.com/bdhirsh

commit 553eaaba7c173cbd4507f2929d39d4b61c246bf6
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Oct 12 19:27:17 2022 +0000

    Disable tf32 in functorch transform tests (#86799)

    This PR applies a large hammer and disables TF32 in specific functorch transform tests. TF32 isn't precise enough to test correctness.

    We could have applied a smaller hammer by disabling TF32 per-OpInfo, but that doesn't seem to have too much additional benefit (e.g. if a convolution batching rule is correct on fp32 then I would expect it to be correct under TF32 modulo precision issues because the actual sequence of PyTorch operators we invoke has not changed, only the backend did).

    Test Plan:
    - I tested this locally on a machine with A100 GPUs.

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86799
    Approved by: https://github.com/malfet

commit d56017a14f34b5130fa70c0cba010e3d2506deb0
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Wed Oct 12 11:20:04 2022 +0200

    [primTorch] Add ref for `triplet_margin_loss`, improve `triplet_margin_with_distance_loss` (#85614)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85614
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit ce56ee11fdf3843d507425bfa401e6ad5f4ee492
Author: Daniel Dale <danny.dale@gmail.com>
Date:   Wed Oct 12 18:37:50 2022 +0000

    Extend torch.cuda.is_available() to attempt an NVML-based CUDA availability assessment when explicitly requested by the user (#85951)

    Fixes #83973 (This is a substitute PR for https://github.com/pytorch/pytorch/pull/85024)

    First of all, thanks for your invaluable contributions to PyTorch everyone!

    Given how extensively `torch.cuda.is_available` is used in the PyTorch ecosystem, IMHO it's worthwhile to provide downstream libraries/frameworks/users the ability to alter the default behavior of `torch.cuda.is_available` in the context of their PyTorch usage.

    I'm confident there are many current and future such use cases which could benefit from leveraging a weakened, NVML-based `torch.cuda.is_available` assessment at a downstream framework's explicit direction (thanks @malfet https://github.com/pytorch/pytorch/commit/81da50a972fc402a6dd880fe392af0f0051cb6de !). Though one could always patch out the `torch.cuda.is_available` function with another implementation in a downstream library, I think this environmental variable based configuration option is more convenient and the cost to including the option is quite low.

    As discussed in https://github.com/pytorch/pytorch/pull/85024#issuecomment-1261542045, this PR gates new non-default NVML-based CUDA behavior with an environmental variable (PYTORCH_NVML_BASED_CUDA_CHK) that allows a user/framework to invoke non-default, NVML-based `is_available()` assessments if desired.

    Thanks again for your work everyone!
    @ngimel @malfet @awaelchli

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85951
    Approved by: https://github.com/ngimel

commit cd7c86eaa46874993affc48d31f826625762c461
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Oct 12 18:21:58 2022 +0000

    Add prims.clone (#86705)

    This simple PR adds `clone` as a primitive.
    Current implementation of `clone` is not supported with nvFuser executor because of `empty_like` + `copy_to`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86705
    Approved by: https://github.com/mruberry

commit 3356d0385fd0f3f0f6ce2d8c681a40fd110c7848
Author: Howard Huang <howardhuang@fb.com>
Date:   Wed Oct 12 15:40:07 2022 +0000

    [BE] Store helper functions C++ for python API parity (#82136)

    Add helper functions for `store.set()`, `store.compare_set()` to accept string arguments instead of vector<uint_8> and refactored some usages internally
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82136
    Approved by: https://github.com/rohan-varma

commit cc7ea93c2cf4275faaae29db9006d8f6067b1c5a
Author: BowenBao <bowbao@microsoft.com>
Date:   Mon Oct 10 17:23:54 2022 -0700

    [ONNX] Support device().type() string comparison with constant (#86168)

    Fixes #86168

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86168
    Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock

commit 58542eb25618eb6784567e7497f4764ab04d70ad
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Oct 11 17:40:37 2022 -0700

    [ao] fixing public v private for backend_config.native.py (#86030)

    Summary: no significant changes, just added some things to __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86030
    Approved by: https://github.com/jerryzh168

commit 409efebab8718a7bfc714ab3787e5a8689289697
Author: Vladimír Aubrecht <vladimir.aubrecht@me.com>
Date:   Wed Oct 12 15:44:28 2022 +0000

    Added define to fix issue with compatibility with latest Windows SDK (#85408)

    Fixes #83820.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85408
    Approved by: https://github.com/ezyang

commit f24d174fffaf6efbc0c95ed561ab839ca496f6a7
Author: Sheil Kumar <smk2007@gmail.com>
Date:   Wed Oct 12 15:26:29 2022 +0000

    Allow PrivateUse1 backends to not have Storage (#86557)

    Allow PrivateUse1 backends to not have Storage

    To unblock the DirectML backend, this change would be needed for 1.13 as well.

    The DirectML backend creates tensors using the open registration pattern documented here: https://pytorch.org/tutorials/advanced/extend_dispatcher.html
    [registration example](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbdhirsh%2Fpytorch_open_registration_example&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ivYLNmuC1WMitwu8n%2B1RAmeKkRM4ssb7EvhhGKJDFwk%3D&reserved=0)

    However, DirectML tensors are opaque, and do not have Storage.
    The DirectML Tensor Impl derives from OpaqueTensorImpl, which does not have a storage. Because of this various places in the code fail that expect storage to be present. We had made various changes in-tree to accommodate this:
    a.	def __deepcopy__(self, memo):
    [https://github.com/pytorch/pytorch/blob/b5acba88959698d35cb548c78dd3fb151f85f28b/torch/_tensor.py#L119](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Ftorch%2F_tensor.py%23L119&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ajg23nMCzgRDwlinqSxS%2BRmOkAcDCr3LW%2BBEfNCn5hw%3D&reserved=0)
    or self.device.type in ["lazy", "xla", "mps", "ort", "meta", "hpu", 'dml']
    b.	def _reduce_ex_internal(self, proto):
    [https://github.com/pytorch/pytorch/blob/b5acba88959698d35cb548c78dd3fb151f85f28b/torch/_tensor.py#L275](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Ftorch%2F_tensor.py%23L275&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xDW6LwPSe2F396OJ6QSJY6mVzJVDeQiJgA0G347y2pw%3D&reserved=0)
    if self.device.type in ["xla", "ort", "hpu", "dml"]:
    c.	TensorIteratorBase::build has an unsupported list for tensors without storage.
    [https://github.com/pytorch/pytorch/blob/b5acba88959698d35cb548c78dd3fb151f85f28b/aten/src/ATen/TensorIterator.cpp#L1497](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Faten%2Fsrc%2FATen%2FTensorIterator.cpp%23L1497&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qAdgNgzKl0xrtOvsABpw1VGkSoGUpe7jwDPhHw3XjgU%3D&reserved=0)

    Using the PrivateUse1 backend, similar exemptions need to be made in order to relax requirements on Storage so that the DirectML backend tensors can work.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86557
    Approved by: https://github.com/bdhirsh, https://github.com/martinb35

commit 61a5898675d2b18bea1009305ce1b1f7042b7d64
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Wed Oct 12 13:03:46 2022 +0000

    use cff standard for citation information (#86200)

    GH picks up on our `CITATION` file in the root of the repository.

    ![Screenshot from 2022-10-04 11-34-54](https://user-images.githubusercontent.com/6849766/193811617-b71ef606-a043-498b-bb2d-14b6c05e79e7.png)

    However, [the preferred way](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files) is use a `CITATION.cff` file instead since GH supports the [citation file format (CFF) standard](https://github.com/citation-file-format/citation-file-format). With this PR, the prompt changes to

    ![Screenshot from 2022-10-04 13-48-21](https://user-images.githubusercontent.com/6849766/193812010-026bfad7-7c4e-4b59-a90a-1d3ad47303d0.png)

    with the following auto-generated bibtex entry:

    ```bibtex
    @inproceedings{Paszke_PyTorch_An_Imperative_2019,
    author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
    booktitle = {Advances in Neural Information Processing Systems 32},
    pages = {8024--8035},
    publisher = {Curran Associates, Inc.},
    title = {{PyTorch: An Imperative Style, High-Performance Deep Learning Library}},
    url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf},
    year = {2019}
    }
    ```

    Comparing with what we currently have the only significant difference is that the editors are no longer listed although the metadata is there. This is an issue with GH's automatic conversion and might be fixed in the future. Plus, the cite key was changed from `NEURIPS2019_9015` to `Paszke_PyTorch_An_Imperative_2019`, but this has no effect on the rendered result.

    Do we also want to adopt the CFF standard?
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86200
    Approved by: https://github.com/dagitses

commit 493ded249ecaba1d76459901600d2dc7439a9f43
Author: Fabio Rocha <frocha@quansight.com>
Date:   Wed Oct 12 09:33:06 2022 +0000

    [primTorch] decomposition for bucketize (#86366)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86366
    Approved by: https://github.com/mruberry

commit f903f1ab343fea72177f29fc8d453febcaad8905
Author: jjsjann123 <jiej@nvidia.com>
Date:   Wed Oct 12 07:50:46 2022 +0000

    Patching getitem in partitioner (#86713)

    1. rejecting getitem operator in backends fusion query getitem is merged in a special post partition pass, backends that takes getitem shouldn't affect the logic
    2. added test for failing cases

    Fixes #86698

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86713
    Approved by: https://github.com/SherlockNoMad

commit 2344135179642df5d383d2e91880600f774cbdef
Author: Khushi <khushiagrawal411@gmail.com>
Date:   Wed Oct 12 07:00:40 2022 +0000

    [primTorch] special: entr, expit (#86592)

    Add _refs for `entr` & `expit`.

    cc @mruberry @kshitij12345!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86592
    Approved by: https://github.com/mruberry

commit a47f93b6c97a39bb8934fc531145d8cdac5cf8f6
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Oct 12 02:26:02 2022 +0000

    Add type and shape annotation for gm.print_readable() (#86562)

    For
    ```
    def f(a, b):
        dim0 = a.shape[0] + b.shape[0]
        dim1 = a.shape[1] + b.shape[1]
        d = a.new_empty(dim0, dim1)
        return d

    fx_g = make_fx(f, tracing_mode="symbolic")(torch.randn(5, 3), torch.randn(4, 3))
    fx_g.print_readable()
    ```

    Tracing with 'real' and 'fake' mode yields
    ```
    class f(torch.nn.Module):
        def forward(self, a_1: Tensor<f32>[5, 3], b_1: Tensor<f32>[4, 3]):
            new_empty: Tensor<f32>[9, 6] = torch.ops.aten.new_empty.default(a_1, [9, 6], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False);  a_1 = None
            return new_empty
    ```

    Tracing with 'symbolic' mode yields
    ```
        def forward(self, a_1: Tensor<f32>[t0.size(0), t0.size(1)], b_1: Tensor<f32>[t1.size(0), t0.size(1)]):
            sym_size: Symint(t0.size(0)) = torch.ops.aten.sym_size(a_1, 0)
            sym_size_1: Symint(t1.size(0)) = torch.ops.aten.sym_size(b_1, 0)
            add: Symint(t0.size(0) + t1.size(0)) = sym_size + sym_size_1;  sym_size = sym_size_1 = None
            sym_size_2: Symint(t0.size(1)) = torch.ops.aten.sym_size(a_1, 1)
            sym_size_3: Symint(t0.size(1)) = torch.ops.aten.sym_size(b_1, 1);  b_1 = None
            add_1: Symint(2*t0.size(1)) = sym_size_2 + sym_size_3;  sym_size_2 = sym_size_3 = None
            new_empty: Tensor<f32>[t0.size(0) + t1.size(0), 2*t0.size(1)] = torch.ops.aten.new_empty.default(a_1, [add, add_1], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False);  a_1 = add = add_1 = None
            return new_empty
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86562
    Approved by: https://github.com/Chillee

commit e0d6898cbd9e7af8ecb1e911e4a8c29e79a78921
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 12 04:12:43 2022 +0000

    Revert "Backport currently dont work with some models if: (#86510)"

    This reverts commit 4bfb7341819b3bfcaf65ddc136f25d23983740a7.

    Reverted https://github.com/pytorch/pytorch/pull/86510 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit 25725fd62448165b91647304c26d676db22b6955
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Wed Oct 12 03:44:21 2022 +0000

    (Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)

    Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682
    Approved by: https://github.com/ngimel

commit a216f4700cbd3d126b4677bcf30f2082da0163ea
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 12 01:45:21 2022 +0000

     Add  testing on A10G GPU to periodic workflow (#85524)

    This enables testing on lots of modern CUDA features on sm_86 capable GPU

    While migrating to that platform, discovered that `functorch` tests for `nn.functional.conv.transpose3d` produce garbage on sm_80+ as well as 2 `nvfuser` tests unexpectedly pass and one unexpectedly fails.

    TODO:
     - Investigate unexpected success for `test_vmapvjp_linalg_householder_product_cuda_float32` and add `functorch` shard

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85524
    Approved by: https://github.com/ngimel

commit c4f0b93f8653505584bbd71162f82d4e7633da0c
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Tue Oct 11 01:24:48 2022 +0000

    Disable autocast in aot autograd (#86515)

    Fix for https://github.com/pytorch/torchdynamo/issues/1368

    From comment:
    > When we invoke a Composite Implicit autograd operator that has an autocast rule, such as Einsum,
    autocast is disabled during its invocation. When we trace out the operators in an implicit op,
    re-applying on autocast rules on those operators might yield divergence from what was executed at runtime.
    This pass checks for divergence. If divergence is found, we will disable autocast.
    We would like to avoid disabling autocast if possible because accessing TLS is slow.

    Concretely, the problem found was when invoked `sum` in `einsum`:

    As seen by the following divergence:
    ```
    >>> with torch.cuda.amp.autocast(enabled=True):
    ...     print(torch.ops.aten.sum.dim_IntList(torch.rand([2, 2, 2], device="cuda", dtype=torch.half), [1, 2]).dtype)
    ...
    torch.float32
    >>> print(torch.ops.aten.sum.dim_IntList(torch.rand([2, 2, 2], device="cuda", dtype=torch.half), [1, 2]).dtype)
    torch.float16
    ```

    Edit: we've decided to accept the overhead of universally disabling autocast instead
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86515
    Approved by: https://github.com/bdhirsh, https://github.com/Chillee

commit d598290baab45b52b9b78d3083ac215f4251943c
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Wed Oct 12 01:27:57 2022 +0000

    Basic SDP benchmark harness (#86729)

    Basic benchmark for reference and discussion.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86729
    Approved by: https://github.com/drisspg

commit 4bfb7341819b3bfcaf65ddc136f25d23983740a7
Author: Han Qi (qihqi) <qihan@fb.com>
Date:   Wed Oct 12 00:39:25 2022 +0000

    Backport currently dont work with some models if: (#86510)

    Backport currently dont work with some models if:

    * model is originally exported with interface call enabled (backport would disable it)
    * model is flatbuffer (flatbuffer support is soft enabled via link time registry), so we manually trigger it

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86510
    Approved by: https://github.com/cccclai

commit ce48df9e938ac208cf018545517344c6a6debab2
Author: Bin Bao <binbao@fb.com>
Date:   Tue Oct 11 20:31:12 2022 +0000

    Re-enable torchdynamo unit tests (#86658)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86658
    Approved by: https://github.com/jansel

commit 692b525b71658caf45fd8d70dd3f285b6eb6b821
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 12 00:32:53 2022 +0000

    [MPS] Extend unary ops to int64 (#86615)

    Most of them are already supported for `int64` except for:
     - rounding operations (`floor`, `ceil` and `round`), which are no-ops for integral types anyway
     - sign operation, when it can be emulated by clamping it tensor to [-1, 1] range

    Test new types by test MPS

    Fixes https://github.com/pytorch/pytorch/issues/86319

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86615
    Approved by: https://github.com/DenisVieriu97, https://github.com/huydhn

commit f912b5854466754b49aad5f9fc3f3470093dd192
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 11 23:53:12 2022 +0000

    Revert "Enable max.unary_out (#85926)"

    This reverts commit 16a0fa1204edb118800261a26281e624988eb239.

    Reverted https://github.com/pytorch/pytorch/pull/85926 on behalf of https://github.com/osalpekar due to The internal diff for this commit shows a number of pytorch quantization test failures. Here is a sample output: AssertionError: Tensor-likes are not close! Mismatched elements: 319 / 320 (99.7%). Greatest absolute difference: 0.056652069091796875 at index (0, 0, 4, 5) (up to 1e-05 allowed). Link to the diff: [D40232598](https://www.internalfb.com/diff/D40232598). Link to the Sandcastle job that is failing: https://www.internalfb.com/intern/sandcastle/job/18014399302908587/

commit 2aa981ab74df71c8d019f12032ce75910601b52c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 11 23:39:50 2022 +0000

    Revert "Reland 2 of Merge more symbolic meta kernels and symint changes from branch (#86334) (#86488)"

    This reverts commit 978b46d7c96627e3b3553ad70ad21cb161d05f90.

    Reverted https://github.com/pytorch/pytorch/pull/86488 on behalf of https://github.com/osalpekar due to Broke executorch builds internally with the following message: RuntimeError: Missing out variant for functional op: aten::split.Tensor(Tensor(a -> *) self, SymInt split_size, int dim=0) -> Tensor(a)[] . Make sure you have loaded your custom_ops_generated_lib

commit 9eb4f9dd175b3d73b1c9b7c1d00dad406db60e5e
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Oct 11 19:49:23 2022 +0000

    Tweak test tolerances to be compatible with A10G (#86538)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538
    Approved by: https://github.com/ngimel

commit 7fa601b1a738382a8730f9f011b0a5c39247af6a
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Oct 11 23:27:30 2022 +0000

    Skip chalf.mean in  test_reductions_large_half_tensors (#86747)

    As `mean_reduce` is not implemented for complex half

    Fixes https://github.com/pytorch/pytorch/issues/86743 and unblock A10G testing

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86747
    Approved by: https://github.com/ngimel

commit 811b8e012b3ddcb84adb2e483089758e84b6a995
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 11 23:12:40 2022 +0000

    Revert "min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86643)"

    This reverts commit 86f914e9966e91b3d3e7c1504f5b1f00a9498d88.

    Reverted https://github.com/pytorch/pytorch/pull/86643 on behalf of https://github.com/osalpekar due to Need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/86488. This should be safe to re-land later

commit f1fdb6efbd09dad3c308b0447682f1f14d2c325e
Author: Jason Ansel <jansel@fb.com>
Date:   Tue Oct 11 23:01:21 2022 +0000

    Manual changes for moving dynamo to core (#86621)

    This is the subset of the changes in #86461 not auto-generated by `copy_to_core.sh`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86621
    Approved by: https://github.com/albanD

commit 09364f4298e45142bf3a3a6a316447d24abb2fdf
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Oct 11 22:39:58 2022 +0000

    Compile C10 with `Wshadow` (#86666)

    This should prevent further regressions like https://github.com/pytorch/pytorch/pull/86646
    Update `fmt` to `7.1.0` to fix variable shadowing in that library

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86666
    Approved by: https://github.com/seemethere

commit 0337f0ad473ffc298a30e603050d2df9d0073428
Author: Zain Rizvi <zainr@fb.com>
Date:   Tue Oct 11 21:56:01 2022 +0000

    Add error checking to flaky test bot platform parser (#86632)

    If an invalid platform is specified when disabling a test with flaky test bot, the CI crashes, skipping all tests that come after it.

    This turns it into a console message instead.  Not erroring out here since it'll affect random PRs.  Actual error message should go into the bot that parses the original issue so that it can respond on that issue directly
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86632
    Approved by: https://github.com/huydhn

commit 42bd275233259d6a4b4d071c14355d4ec45b3ec2
Author: Partho <parthodas6176@gmail.com>
Date:   Tue Oct 11 21:41:48 2022 +0000

    [doc] LR scheduler example fix (#86629)

    Fixes issue #86208
    As suggested in the issue, updated the LR scheduler example to use a regular nn.Module like the other examples on the same page.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86629
    Approved by: https://github.com/soulitzer

commit 32152ce328230de27c9e3d3c1cfdc97c9ad1738a
Author: jimku9 <jimku.tw@yahoo.com.tw>
Date:   Tue Oct 11 21:21:53 2022 +0000

    Add original sources/references to Wishart.py in distributions (#86543)

    @fritzo As discussed, add original sources/references to Wishart.py in distributions and corrected typos in the error messages.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86543
    Approved by: https://github.com/fritzo

commit 50af1ace5e4b029cef48b851dfc9e9ebbc1bb2b5
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Oct 11 17:56:59 2022 +0000

    Mark aten ops as canonical (#86215)

    This is the first batch of canonical aten ops. 87 in total. More to come in the future PRs.

    native_dropout
    abs
    add.Tensor
    add.Scalar
    arange.start_step
    bitwise_not
    bmm
    cat
    clamp
    constant_pad_nd
    convolution
    convolution_backward
    div.Tensor
    div.Scalar
    embedding_dense_backward
    erf
    exp
    expand
    fill.Scalar
    grid_sampler_2d
    native_group_norm
    native_group_norm_backward
    native_layer_norm
    native_layer_norm_backward
    log
    _log_softmax
    max.dim
    amax
    mean.dim
    min.dim
    amin
    mm
    mul.Tensor
    mul.Scalar
    native_batch_norm
    permute
    scalar_tensor
    reciprocal
    neg
    repeat
    relu
    gelu
    rsqrt
    sigmoid
    slice.Tensor
    slice_scatter
    _softmax
    squeeze.dim
    sum.dim_IntList
    sqrt
    tanh
    unsqueeze
    var.dim
    where.self
    clone
    sub.Tensor
    sub.Scalar
    addmm
    _to_copy
    view
    scatter_add
    bitwise_and.Tensor
    bitwise_or.Tensor
    eq.Scalar
    ge.Scalar
    le.Scalar
    gt.Scalar
    lt.Scalar
    index_select
    nonzero
    gather
    maximum
    minimum
    pow.Tensor_Scalar
    hardtanh
    leaky_relu
    _adaptive_avg_pool2d
    _adaptive_avg_pool2d_backward
    avg_pool2d
    avg_pool2d_backward
    max_pool2d_with_indices
    max_pool2d_with_indices_backward
    upsample_bilinear2d.vec
    upsample_bilinear2d_backward.vec
    upsample_nearest2d.vec
    upsample_nearest2d_backward.vec
    col2im

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86215
    Approved by: https://github.com/suo, https://github.com/anjali411

commit 8db30255c36fc7a93d8d5285415d7ab96911e1df
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Tue Oct 11 20:55:58 2022 +0000

    [ROCm] set nvfuser default to disabled, keep CI (#86369)

    Bug fix. nvfuser is functional for ROCm on gfx906, but some tests are failing for other gfx targets. Disable nvfuser until all features are verified. Users may still opt-in by setting the known env var PYTORCH_JIT_ENABLE_NVFUSER=1. This PR sets this env var for the github actions workflow for ROCm since all current CI hosts are gfx906.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86369
    Approved by: https://github.com/huydhn

commit 5ffe24fca43b05216d55fa73a7d2629248315a30
Author: Stephen Jia <ssjia@meta.com>
Date:   Tue Oct 11 20:16:56 2022 +0000

    [vulkan][ez] fix always printing out a warning when retrieving the global context (#86697)

    Summary: D40151818 (https://github.com/pytorch/pytorch/commit/82ed5ca3401e965067fd03a6bac57978f884f715) replaces the `TORCH_CHECK` with a `TORCH_WARN` but since it does not check if the context is valid the message gets printed every time. This diff fixes that.

    Test Plan:
    Referring to [Pytorch Vulkan Testing Procedures](https://fb.quip.com/fZALAc9zhlcU)

    On Mac:
    1. `vulkan_api_test` on Mac
    2. model comparison binary on Mac

    On Android:
    1. `vulkan_api_test` on Android
    2. benchmark binary on Android

    Reviewed By: salilsdesai

    Differential Revision: D40266820

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86697
    Approved by: https://github.com/kirklandsign

commit f32aeeae00015ed484f8bfea2e24018de0dae277
Author: Han Qi (qihqi) <qihan@meta.com>
Date:   Tue Oct 11 20:07:58 2022 +0000

    Set interface_call to true be default (#86668)

    Summary: ASR models need it

    Test Plan: existing unit tests

    Reviewed By: cccclai

    Differential Revision: D40251788

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86668
    Approved by: https://github.com/cccclai

commit 7f02f2ac0cc8e9db2137d299769b456afe27fa45
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Oct 11 19:34:44 2022 +0000

    [Experimentation] Add TSAN build and test (#85313)

    Some parts of the PR are adopted from the previously abandoned https://github.com/pytorch/pytorch/pull/36694.  This PR is the first part to setup TSAN jobs in the CI.  The data race warnings from TSAN will need to be reviewed later in a separate PR.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85313
    Approved by: https://github.com/osalpekar

commit 92562046e9d6ef32b14e17b2b06433cfe7990912
Author: 胡玮文 <sehuww@mail.scut.edu.cn>
Date:   Tue Oct 11 19:03:43 2022 +0000

    Optimize __dlpack_device__ performance (#86665)

    This can be critical when processing a large number of tensors

    ```bash
    python -m timeit --setup 'import torch; t = torch.empty(1000, device="cuda")' 't.__dlpack_device__()'
    ```

    based on 1.12.1:
    before:
    100000 loops, best of 5: 2.32 usec per loop
    after:
    500000 loops, best of 5: 844 nsec per loop

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86665
    Approved by: https://github.com/SunDoge, https://github.com/soulitzer

commit c12f829cce29eb6971094a9bbb0f8971aed86f5c
Author: Jerry Zhang <jerryzh@meta.com>
Date:   Tue Oct 11 18:49:09 2022 +0000

    [nn] Add remove_duplicate flag to named_buffers (#674) (#85903)

    Summary:
    X-link: https://github.com/pytorch/torchrec/pull/674

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84984

    this is to allow named_buffers to return the same buffer objects with different names multiple times, needed by internal use cases
    ghstack-source-id: 168589597

    Test Plan:
    python test/test_nn.py -k test_buffers_and_named_buffers

    Imported from OSS

    Reviewed By: albanD

    Differential Revision: D39493161

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85903
    Approved by: https://github.com/albanD

commit 693250ac859fbb15ea4b4426d9cefaf97e151eb7
Author: David <cherrywoods@posteo.org>
Date:   Tue Oct 11 18:05:53 2022 +0000

    Docs: fx.Node docs incorrectly state that the self argument is included in args for module calls (#86685)

    It seems like the [torch.fx.Node docs](https://pytorch.org/docs/stable/fx.html#torch.fx.Node) are incorrect regarding the inclusion of the self argument for module call nodes.
    While the docs state that self (the module) is included in `args`, it is in fact not, as demonstrated by this code:
    ```python
    import torch
    from torch import fx, nn

    class Net(nn.Module):
        def __init__(self):
            super().__init__()
            self.submod = nn.Linear(10, 10)
        def forward(self, x):
            x = x.flatten()
            return self.submod(x)

    graph_module = fx.symbolic_trace(Net())
    print(graph_module.graph)  # doesn't show self for the submodule call
    submod_node = list(graph_module.graph.nodes)[2]
    print(submod_node.op)  # call_module
    print(submod_node.args)  # (flatten,) => would need to have len 2 if self was included

    flatten_node = list(graph_module.graph.nodes)[1]
    print(flatten_node.op)  # call_method
    print(flatten_node.args)  # (x,) => here self is included (and docs are correct)
    ```

    Since [torch.fx.Interpreter also uses `args` as if self was is not included](https://github.com/pytorch/pytorch/blob/2fe580859012d2d24a54e452195ccbc7f3191036/torch/fx/interpreter.py#L288), I assume the docs are incorrect.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86685
    Approved by: https://github.com/soulitzer

commit 160118d72a5c8e425ba30495e672d33ee1c94b50
Author: Fang Wang <fangwangcn@fb.com>
Date:   Tue Oct 11 17:52:18 2022 +0000

    Add test case for matrix multiply-add with large inputs (#85550)

    Summary:
    - Added test case for addmm, baddbmm and linear with large inputs
    - Testing with torch types: float32, float16, bfloat16

    Test Plan:
    Run unit tests with:
    `buck2 run mode/opt //caffe2/test:linalg_re_cuda`

    ```
    ...
    test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_2_100_100_100_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_2_100_100_100_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_2_100_100_100_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
    test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_2_100_100_100_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_2_100_100_100_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
    test_addmm_baddbmm_large_input_2_100_100_100_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok

    ----------------------------------------------------------------------
    Ran 24 tests in 63.224s

    OK (skipped=12)
    ```

    Differential Revision: D39718256

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85550
    Approved by: https://github.com/IvanYashchuk, https://github.com/malfet

commit 212fa874ce0bb8c2d70be2bcd87188c072d1082d
Author: vfdev <vfdev.5@gmail.com>
Date:   Tue Oct 11 17:52:16 2022 +0000

    Fix torch histogramdd docstring (#86593)

    Fixed torch histogramdd docsting with missing common_args

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86593
    Approved by: https://github.com/soulitzer

commit f26292d91e1bf358d4f4902688433248c931ca68
Author: Jane Xu <janeyx@fb.com>
Date:   Tue Oct 11 17:42:51 2022 +0000

    [BE] Fix python docs typos up till torch.chunk (#86642)

    Was doing the Views lab linked https://github.com/pytorch/pytorch/wiki/Tensor-and-Operator-Basics and noticed a few typos, which led to this PR.

    Test plan:
    verified in preview
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86642
    Approved by: https://github.com/soulitzer

commit 86f914e9966e91b3d3e7c1504f5b1f00a9498d88
Author: albanD <desmaison.alban@gmail.com>
Date:   Tue Oct 11 10:35:18 2022 -0400

    min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86643)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86643
    Approved by: https://github.com/anjali411

commit 6923dc3b590e51773ee9e0a536b0863963b91232
Author: Jane Xu <janeyx@fb.com>
Date:   Tue Oct 11 17:23:36 2022 +0000

    Add module: decompositions as an owner to test_decomp.py (#86703)

    so flaky tests can be attributed to @SherlockNoMad too 😛
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86703
    Approved by: https://github.com/albanD

commit 109f4d445382df93ed3afc592fd64719c0b86c01
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Oct 11 07:28:20 2022 -0700

    Move functorch tests from functorch/test/* to test/functorch/* (#86623)

    This is the first step described in
    https://github.com/pytorch/pytorch/issues/86618 . test/functorch/* is
    the final location for these tests.

    Test Plan:
    - Check that the functorch shards in CI are still running tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86623
    Approved by: https://github.com/huydhn

commit 51ea4418621e0236e7ec1ebf3606317ad1430548
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Oct 11 16:39:57 2022 +0000

    Upcast to fp32 in test_addmm_block ref_half_bfloat16 (#86682)

    Fixes https://github.com/pytorch/pytorch/issues/86681
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86682
    Approved by: https://github.com/nikitaved

commit 3edf79dc03193c98b665d62231fe69a10dfab1fa
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 11 16:33:41 2022 +0000

    Revert "Add meta support for _adaptive_avg_pool2d_backward (#86359)"

    This reverts commit a56a8c0fc0251bb4cd24b366a290db2e4beea747.

    Reverted https://github.com/pytorch/pytorch/pull/86359 on behalf of https://github.com/clee2000 due to causing unexpected success for functorch on master but PR is green (landrace?) https://github.com/pytorch/pytorch/actions/runs/3227306657/jobs/5282180524 https://hud.pytorch.org/pytorch/pytorch/commit/a56a8c0fc0251bb4cd24b366a290db2e4beea747

commit 97de281176d2476aec8cabfae9981c86f6179531
Author: Nicolas Hug <nicolashug@fb.com>
Date:   Thu Oct 6 11:32:29 2022 +0000

    Improve interpolate() speed for channels_last CPU images and masks (#86361)

    This PR improves the speed of `interpolate()`:
    - on CPU
    -  on images and masks (`num_channels < 4`, `channels_last=True`)
    - for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float)
    - for both upsampling and downsampling

    The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size.  In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details):

    ```
    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms
    ```

    An immediate follow-up to this PR would be to do the same changes for the 3D kernels.
    Thanks a ton @fmassa for the help!

    Results:

    <details>

    ```
    ----------------------------------------------------------------------------------------------------
    (1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
    (1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   1.6X  0.9ms vs 0.5ms
    (1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   1.7X  1.0ms vs 0.5ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   1.9X  0.9ms vs 0.5ms
    (1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=1   8X    0.806ms vs 0.097ms
    (1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   15X   0.848ms vs 0.056ms
    (1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   10X   0.828ms vs 0.084ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   16X   0.914ms vs 0.057ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   10X   0.900ms vs 0.086ms

    (1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=2   1.6X  1.1ms vs 0.7ms
    (1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   1.6X  0.6ms vs 0.4ms
    (1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   1.7X  0.6ms vs 0.4ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   1.7X  0.5ms vs 0.3ms
    (1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=2   9X    0.800ms vs 0.088ms
    (1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   11X   0.459ms vs 0.043ms
    (1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   7X    0.424ms vs 0.064ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   12X   0.503ms vs 0.043ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   8X    0.461ms vs 0.059ms

    (1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=12  3X    1.1ms vs 0.3ms
    (1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  1.6X  0.3ms vs 0.2ms
    (1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  1.5X  0.3ms vs 0.2ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=12  5X    0.8ms vs 0.2ms
    (1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  10X   0.445ms vs 0.047ms
    (1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  7X    0.432ms vs 0.062ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  10X   0.478ms vs 0.046ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  7X    0.470ms vs 0.063ms

    (1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=32  3X    1.1ms vs 0.4ms
    (1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  1.8X  0.3ms vs 0.2ms
    (1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=32  11X   0.815ms vs 0.074ms
    (1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  10X   0.443ms vs 0.045ms
    (1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  7X    0.436ms vs 0.061ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.061ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
    (1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   1.5X  0.9ms vs 0.6ms
    (1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   1.6X  1.0ms vs 0.6ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
    (1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=1   8X    0.808ms vs 0.099ms
    (1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   15X   0.848ms vs 0.058ms
    (1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.820ms vs 0.087ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   16X   0.909ms vs 0.059ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.898ms vs 0.088ms

    (1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=2   1.4X  0.9ms vs 0.7ms
    (1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   1.5X  0.5ms vs 0.3ms
    (1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   1.5X  0.5ms vs 0.4ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
    (1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=2   9X    0.799ms vs 0.090ms
    (1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   10X   0.459ms vs 0.045ms
    (1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.427ms vs 0.059ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   11X   0.501ms vs 0.044ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   8X    0.460ms vs 0.060ms

    (1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=12  2.9X  1.0ms vs 0.3ms
    (1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  1.2X  0.2ms vs 0.2ms
    (1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  1.1X  0.2ms vs 0.2ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=12  12X   0.809ms vs 0.068ms
    (1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
    (1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.432ms vs 0.055ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.480ms vs 0.041ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.464ms vs 0.056ms

    (1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=32  3X    1.1ms vs 0.3ms
    (1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  1.3X  0.3ms vs 0.2ms
    (1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=32  11X   0.813ms vs 0.075ms
    (1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.046ms
    (1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.433ms vs 0.061ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.062ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=1   0.9X  4.5ms vs 5.2ms
    (1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   1.5X  4.2ms vs 2.8ms
    (1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   1.8X  4.1ms vs 2.3ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   1.6X  4.5ms vs 2.8ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   1.9X  4.4ms vs 2.3ms
    (1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=1   9X    3.8ms vs 0.4ms
    (1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   17X   4.0ms vs 0.2ms
    (1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   11X   3.9ms vs 0.4ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   19X   4.4ms vs 0.2ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   12X   4.3ms vs 0.4ms

    (1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=2   1.5X  4.5ms vs 3.1ms
    (1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   1.4X  2.3ms vs 1.6ms
    (1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   1.7X  2.1ms vs 1.2ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   1.6X  2.5ms vs 1.6ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   1.8X  2.2ms vs 1.2ms
    (1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=2   15X   3.8ms vs 0.3ms
    (1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   15X   2.2ms vs 0.1ms
    (1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   7X    2.0ms vs 0.3ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   16X   2.4ms vs 0.1ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   8X    2.2ms vs 0.3ms

    (1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=12  8X    5.2ms vs 0.7ms
    (1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  1.3X  0.6ms vs 0.4ms
    (1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  1.4X  0.6ms vs 0.4ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
    (1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=12  36X   3.9ms vs 0.1ms
    (1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  10X   0.526ms vs 0.051ms
    (1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  7X    0.514ms vs 0.069ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  11X   0.569ms vs 0.052ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  8X    0.557ms vs 0.070ms

    (1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=32  9X    4.5ms vs 0.5ms
    (1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  0.5X  0.2ms vs 0.5ms
    (1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  1.0X  0.5ms vs 0.5ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=32  44X   3.864ms vs 0.087ms
    (1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  10X   0.527ms vs 0.053ms
    (1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  7X    0.516ms vs 0.070ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  10X   0.567ms vs 0.055ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  8X    0.558ms vs 0.072ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=1   1.0X  1.9ms vs 1.9ms
    (1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   2.0X  1.8ms vs 0.9ms
    (1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   1.7X  1.8ms vs 1.0ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   2.1X  1.9ms vs 0.9ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   1.9X  1.9ms vs 1.0ms
    (1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=1   9X    1.6ms vs 0.2ms
    (1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   16X   1.7ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   10X   1.7ms vs 0.2ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   17X   1.9ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   11X   1.8ms vs 0.2ms

    (1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=2   1.7X  1.9ms vs 1.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   2.0X  1.0ms vs 0.5ms
    (1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   1.7X  0.9ms vs 0.5ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   2.3X  1.1ms vs 0.5ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   1.8X  1.0ms vs 0.5ms
    (1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=2   8X    1.6ms vs 0.2ms
    (1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   14X   0.931ms vs 0.067ms
    (1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   7X    0.9ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   15X   1.016ms vs 0.069ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   9X    0.9ms vs 0.1ms

    (1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=12  8X    1.9ms vs 0.3ms
    (1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=12  20X   1.630ms vs 0.081ms
    (1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  10X   0.457ms vs 0.044ms
    (1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  7X    0.439ms vs 0.060ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  11X   0.485ms vs 0.045ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  8X    0.474ms vs 0.061ms

    (1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=32  8X    1.9ms vs 0.3ms
    (1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  2.0X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  1.4X  0.2ms vs 0.2ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=32  21X   1.628ms vs 0.078ms
    (1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  9X    0.453ms vs 0.048ms
    (1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  7X    0.445ms vs 0.063ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  11X   0.535ms vs 0.048ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  8X    0.502ms vs 0.063ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=1   1.0X  13.8ms vs 14.0ms
    (1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   1.8X  13.1ms vs 7.4ms
    (1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   1.8X  11.1ms vs 6.1ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   1.9X  13.9ms vs 7.4ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   1.9X  11.8ms vs 6.1ms
    (1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=1   10X   10.2ms vs 1.1ms
    (1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   19X   10.8ms vs 0.6ms
    (1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   11X   10.4ms vs 0.9ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   20X   11.6ms vs 0.6ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   12X   11.4ms vs 0.9ms

    (1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=2   1.8X  13.7ms vs 7.7ms
    (1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   2.6X  7.3ms vs 2.8ms
    (1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   1.8X  5.6ms vs 3.1ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   1.9X  7.9ms vs 4.1ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   1.9X  6.0ms vs 3.1ms
    (1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=2   18X   10.1ms vs 0.6ms
    (1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   19X   5.8ms vs 0.3ms
    (1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   10X   5.3ms vs 0.5ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   20X   6.3ms vs 0.3ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   11X   5.7ms vs 0.5ms

    (1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=12  8X    13.8ms vs 1.6ms
    (1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  2.9X  1.5ms vs 0.5ms
    (1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  1.7X  1.0ms vs 0.5ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  1.5X  1.5ms vs 1.0ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  1.8X  1.0ms vs 0.6ms
    (1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=12  80X   10.1ms vs 0.1ms
    (1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  13X   0.928ms vs 0.072ms
    (1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  8X    0.9ms vs 0.1ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  13X   1.001ms vs 0.074ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  9X    1.0ms vs 0.1ms

    (1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=32  18X   14.0ms vs 0.8ms
    (1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  1.9X  1.0ms vs 0.6ms
    (1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  2.9X  0.7ms vs 0.2ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  1.7X  0.9ms vs 0.6ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  1.8X  0.4ms vs 0.2ms
    (1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=32  111X  10.254ms vs 0.092ms
    (1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  14X   0.784ms vs 0.056ms
    (1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  7X    0.551ms vs 0.075ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  11X   0.607ms vs 0.057ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  8X    0.596ms vs 0.076ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.084ms vs 0.084ms
    (1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.077ms vs 0.078ms
    (1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.076ms vs 0.076ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.083ms vs 0.083ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   1.0X  0.081ms vs 0.082ms
    (1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.071ms vs 0.071ms
    (1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.074ms vs 0.074ms
    (1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.072ms vs 0.072ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.080ms vs 0.080ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   0.9X  0.078ms vs 0.084ms

    (1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.083ms vs 0.084ms
    (1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.076ms vs 0.077ms
    (1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.075ms vs 0.074ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.082ms vs 0.083ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.080ms vs 0.083ms
    (1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.070ms vs 0.071ms
    (1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.073ms vs 0.075ms
    (1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.071ms vs 0.072ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.079ms vs 0.080ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.077ms vs 0.079ms

    (1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.083ms vs 0.084ms
    (1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.080ms vs 0.078ms
    (1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.077ms vs 0.075ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.083ms vs 0.083ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.083ms vs 0.082ms
    (1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.071ms vs 0.071ms
    (1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.076ms vs 0.074ms
    (1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.073ms vs 0.071ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.080ms vs 0.080ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.080ms vs 0.078ms

    (1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.084ms vs 0.084ms
    (1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.078ms vs 0.077ms
    (1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.076ms vs 0.076ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.083ms vs 0.083ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.081ms vs 0.082ms
    (1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.072ms vs 0.072ms
    (1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.074ms vs 0.075ms
    (1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.072ms vs 0.072ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.077ms vs 0.080ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.076ms vs 0.079ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=1   1.0X  0.3ms vs 0.3ms
    (1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   1.8X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   1.6X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   2.0X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   1.7X  0.3ms vs 0.2ms
    (1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=1   6X    0.265ms vs 0.044ms
    (1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   10X   0.280ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   7X    0.273ms vs 0.037ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   11X   0.303ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   8X    0.297ms vs 0.038ms

    (1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=2   1.5X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   1.8X  0.163ms vs 0.093ms
    (1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   1.5X  0.2ms vs 0.1ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   1.9X  0.180ms vs 0.096ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   1.6X  0.2ms vs 0.1ms
    (1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=2   6X    0.264ms vs 0.044ms
    (1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   10X   0.278ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   7X    0.270ms vs 0.037ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   11X   0.298ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   8X    0.293ms vs 0.037ms

    (1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=12  1.5X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  1.7X  0.158ms vs 0.095ms
    (1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  1.7X  0.170ms vs 0.100ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=12  6X    0.269ms vs 0.043ms
    (1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  11X   0.291ms vs 0.027ms
    (1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  8X    0.281ms vs 0.037ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  11X   0.305ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  8X    0.306ms vs 0.038ms

    (1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=32  1.5X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  1.6X  0.160ms vs 0.098ms
    (1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  1.7X  0.171ms vs 0.099ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=32  6X    0.269ms vs 0.044ms
    (1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  10X   0.282ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  7X    0.276ms vs 0.037ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  11X   0.305ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  8X    0.299ms vs 0.038ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=1   1.0X  1.2ms vs 1.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   2.0X  1.2ms vs 0.6ms
    (1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   1.7X  1.1ms vs 0.7ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   2.1X  1.2ms vs 0.6ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   1.9X  1.2ms vs 0.7ms
    (1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=1   8X    1.1ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   15X   1.109ms vs 0.073ms
    (1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   10X   1.1ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   16X   1.192ms vs 0.074ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   11X   1.2ms vs 0.1ms

    (1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=2   1.7X  1.2ms vs 0.7ms
    (1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   2.0X  0.6ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   1.7X  0.6ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   2.2X  0.7ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   1.8X  0.6ms vs 0.3ms
    (1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=2   9X    1.0ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   11X   0.598ms vs 0.052ms
    (1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   8X    0.556ms vs 0.072ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   12X   0.649ms vs 0.053ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   8X    0.598ms vs 0.073ms

    (1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=12  5X    1.2ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  1.3X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  1.4X  0.2ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=12  9X    1.0ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  12X   0.572ms vs 0.048ms
    (1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  8X    0.560ms vs 0.068ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  13X   0.617ms vs 0.049ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  9X    0.604ms vs 0.068ms

    (1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=32  5X    1.2ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=32  13X   1.042ms vs 0.081ms
    (1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  12X   0.586ms vs 0.050ms
    (1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  8X    0.562ms vs 0.069ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  12X   0.621ms vs 0.051ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  9X    0.609ms vs 0.070ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms

    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=2   1.6X  0.9ms vs 0.6ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   1.9X  0.5ms vs 0.2ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   2.1X  0.5ms vs 0.3ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=2   10X   0.808ms vs 0.084ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   10X   0.462ms vs 0.046ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.429ms vs 0.062ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   12X   0.504ms vs 0.044ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   7X    0.461ms vs 0.063ms

    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=12  4X    1.0ms vs 0.2ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=12  12X   0.820ms vs 0.067ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.431ms vs 0.056ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.482ms vs 0.041ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.467ms vs 0.056ms

    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=32  4X    1.0ms vs 0.3ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  1.7X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  1.8X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=32  12X   0.824ms vs 0.070ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.044ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.438ms vs 0.059ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  11X   0.479ms vs 0.045ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.059ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=1   1.0X  4.7ms vs 4.7ms
    (1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   2.0X  4.4ms vs 2.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   1.8X  4.3ms vs 2.5ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   2.1X  4.7ms vs 2.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   1.9X  4.6ms vs 2.5ms
    (1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=1   9X    4.0ms vs 0.4ms
    (1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   17X   4.2ms vs 0.2ms
    (1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   11X   4.1ms vs 0.4ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   19X   4.6ms vs 0.2ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   12X   4.5ms vs 0.4ms

    (1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=2   1.7X  4.7ms vs 2.7ms
    (1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   2.1X  2.4ms vs 1.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   1.8X  2.2ms vs 1.3ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   2.3X  2.6ms vs 1.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   1.9X  2.3ms vs 1.3ms
    (1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=2   15X   4.0ms vs 0.3ms
    (1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   16X   2.3ms vs 0.1ms
    (1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   9X    2.1ms vs 0.2ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   17X   2.5ms vs 0.1ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   10X   2.3ms vs 0.2ms

    (1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=12  10X   4.7ms vs 0.5ms
    (1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  1.9X  0.4ms vs 0.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  1.9X  0.4ms vs 0.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
    (1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=12  41X   3.969ms vs 0.096ms
    (1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  11X   0.545ms vs 0.051ms
    (1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  8X    0.532ms vs 0.070ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  11X   0.590ms vs 0.052ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  8X    0.578ms vs 0.071ms

    (1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=32  17X   4.7ms vs 0.3ms
    (1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  1.8X  0.2ms vs 0.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  2.0X  0.3ms vs 0.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  1.9X  0.2ms vs 0.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=32  45X   4.028ms vs 0.090ms
    (1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  10X   0.549ms vs 0.053ms
    (1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  7X    0.536ms vs 0.072ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  11X   0.592ms vs 0.055ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  8X    0.581ms vs 0.074ms

    ```
    </details>

    Code:

    <details>

    I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py

    ```py
    import operator_benchmark as op_bench
    import torch

    """Microbenchmarks for interpolate operator."""

    class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
        def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

            input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                        requires_grad=self.auto_set())
            if channels_last:
                if input_image.ndim == 4:
                    input_image = input_image.contiguous(memory_format=torch.channels_last)
                elif input_image.ndim == 5:
                    input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
                else:
                    raise ValueError(
                        f"Can not set channels_last to the input of {input_image.ndim} dims"
                    )

            align_corners = None if "nearest" in mode else False

            if mode == "linear":
                mode = {
                    3: 'linear',
                    4: 'bilinear',
                    5: 'trilinear',
                }[input_image.ndim]

            self.inputs = {
                "input_image": input_image,
                "output_size": output_size,
                "mode": mode,
                "align_corners": align_corners,
            }

            self.set_module_name("interpolate")

        def forward(self, input_image, output_size, mode, align_corners):
            return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                                   align_corners=align_corners)

    def make_config():
        sizes = (
            ((224, 224), (64, 64)),
            ((224, 224), (128, 128)),
            ((600, 400), (224, 224)),
            ((320, 320), (256, 256)),
            ((800, 800), (500, 500)),
        )

        attrs = []
        for (HW1, HW2) in sizes:
            attrs.append([(1, 3, *HW1), HW2])  # 3 channels
            attrs.append([(1, 1, *HW1), HW2])  # 1 channel

            attrs.append([(1, 3, *HW2), HW1])  # 3 channels
            attrs.append([(1, 1, *HW2), HW1])  # 1 channel

        config = op_bench.config_list(
            attr_names=["input_size", "output_size"],
            attrs=attrs,
            cross_product_configs={
                'channels_last': [True],
                'mode': ["linear", "nearest", "nearest-exact"],
                'dtype': [torch.float, torch.uint8]
            },
            tags=["short"],
        )
        def get_mode(l):
            for d in l:
                if "mode" in d:
                    return d["mode"]
        def get_dtype(l):
            for d in l:
                if "dtype" in d:
                    return d["dtype"]
        config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
        return config

    config = make_config()
    op_bench.generate_pt_test(config, InterpolateBenchmark)

    if __name__ == "__main__":
        op_bench.benchmark_runner.main()
    ```

    with

    ```
    for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file
    ```

    and this very ugly helper

    ```py
    import re
    with open("main") as f:
        main = f.readlines()

    with open("new") as f:
        new = f.readlines()

    out = []

    for main_line, new_line in zip(main, new):
        if main_line.startswith("num_threads="):
            num_threads = int(main_line.split("=")[-1])
        if main_line.startswith("# Input"):
            deets = f"{main_line.strip()}, {num_threads=}"
        if main_line.startswith("Forward"):
            main_time = float(main_line.split()[-1])
            new_time = float(new_line.split()[-1])
            ratio = main_time / new_time
            fmt = ".1f" if ratio < 3 else ".0f"
            improv = f"{ratio:{fmt}}X"
            time_fmt = ",.3f" if new_time < 100 else ",.1f"
            deets = deets.strip().replace("# Input: ", "")
            deets = deets.replace(": ", "=")
            deets = deets.replace("input_size=", "")
            deets = deets.replace(", output_size=", " -> ")
            deets = deets.replace("dtype=torch.", "")
            deets = deets.replace("mode=", "")
            deets = deets.replace("channels_last=True, ", "")
            split = deets.split(",")
            size = ','.join(split[:-3])
            mode, dtype, threads = split[-3:]
            deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

            l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
            out.append(l)

    def key(s):
        num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

        input_shape, output_shape = re.findall("\(.*?\)", s)
        input_shape = input_shape[1:-1]  # remove parenthesis
        input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
        input_C = (-int(input_shape.split(",")[1]),)

        output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
        is_downsample = (output_HW[0] < input_HW[0],)
        if "linear" in s:
            mode = "linear"
        elif "nearest-exact" in s:
            mode = "nearest-exact"
        else:
            assert "nearest" in s
            mode = "nearest"
        mode = (mode,)
        return is_downsample + input_HW + output_HW + num_threads + input_C + mode

    for i, l in enumerate(sorted(out, key=key)):
        if i % 10 == 0 and i % 40 != 0:
            print()
        if i % 40 == 0:
            print("-" * 100)
        print(l)

    ```

    </details>

    Closes https://github.com/pytorch/pytorch/issues/83840

    When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox )
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361
    Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa

commit a4ee6956ff074f82c1306d6555d900b48a4b3de0
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Oct 11 16:11:47 2022 +0000

    Pin numpy version during MPS tests (#86691)

    numpy-1.23.1 for some reason can not be loaded on M1

    Fixes https://github.com/pytorch/pytorch/issues/86688

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86691
    Approved by: https://github.com/DanilBaibak, https://github.com/atalman, https://github.com/seemethere

commit 352d9264822b8064b0c0792bc00492e69e569a37
Author: eqy <eddiey@nvidia.com>
Date:   Tue Oct 11 16:03:49 2022 +0000

    [CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#86645)

    re-opening (again) in hopes of working around failed/stuck CLA check

    CC @ptrblck @ngimel @huydhn
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645
    Approved by: https://github.com/zdevito

commit 937d677d9f588ba9cddcac64ecfcad7ace9e8a58
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Oct 10 08:08:51 2022 -0700

    Add version selector back to functorch docs (#86602)

    I accidentally deleted it in
    https://github.com/pytorch/pytorch/pull/85856/ . This brings the version
    selector back.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86602
    Approved by: https://github.com/samdow

commit a56a8c0fc0251bb4cd24b366a290db2e4beea747
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Mon Oct 10 20:28:32 2022 +0000

    Add meta support for _adaptive_avg_pool2d_backward (#86359)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86359
    Approved by: https://github.com/ezyang, https://github.com/albanD

commit 03d8ab4decdd9a7391ea6c026d0b095708288ca7
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Oct 11 13:03:20 2022 +0000

    Skip forward AD tests for torch.native_batch_norm (#86206)

    `test_forward_mode_AD` has problems with `torch.native_batch_norm` when computing Jacobian using finite-differences. Weirdly this test unexpectedly passed on periodic CI. Let's skip this test instead of xfailing.
    Fixes https://github.com/pytorch/pytorch/issues/86175
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86206
    Approved by: https://github.com/soulitzer

commit 6ab07febcea936e75bc95d3ebdbb087b2033ba11
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Oct 11 01:37:54 2022 +0000

    [FSDP][Easy] Rename `_prefixed_param_names` -> `_fqns` for consistency (#86653)

    This renames `_prefixed_param_names` to `_fqns` to help converge on the terminology.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86653
    Approved by: https://github.com/rohan-varma

commit 2fe580859012d2d24a54e452195ccbc7f3191036
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 10 20:19:30 2022 -0400

    Symintify NLL loss, copy and squeeze (#86606)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86606
    Approved by: https://github.com/anjali411

commit be8627827e0e9ee3769335641aacf0193f66e476
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 10 20:19:30 2022 -0400

    More symintification of get/set item (#86605)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86605
    Approved by: https://github.com/anjali411

commit f84144225242c476a674886af1470220b915fe51
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 10 18:13:59 2022 -0400

    symintify autograd view chaining (#86604)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86604
    Approved by: https://github.com/anjali411

commit 49c9b0a1541f596bca2671ea52fa646dc560ebb7
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 10 18:13:59 2022 -0400

    symintify einsum (#86603)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86603
    Approved by: https://github.com/anjali411

commit 3a2cfbb813e19c1648b23079f704829f9997425d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 11 10:17:27 2022 +0000

    Revert "Improve interpolate() speed for channels_last images and masks (#86361)"

    This reverts commit 93b2d991581db86074dd8011fdc903bd554466b1.

    Reverted https://github.com/pytorch/pytorch/pull/86361 on behalf of https://github.com/DanilBaibak due to Break the internal import process

commit 17074389dec0ee3e2a949fb75bb51cde471d17fe
Author: Jianyu Huang <jianyuhuang@meta.com>
Date:   Tue Oct 11 06:12:17 2022 +0000

    index op with int32 support (#86318)

    Differential Revision: D40089960

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86318
    Approved by: https://github.com/malfet

commit 88a8a900b90fc7ff2f0e67c4e716520ac7fac75f
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Oct 11 05:40:12 2022 +0000

    fix: half reduction with multiple sub-iterators (#85596)

    Fixes #74438

    TODO:
    * [x] Add test

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85596
    Approved by: https://github.com/ngimel

commit 55479fe80ee8df9750c2a4d1022943d04c3e46d6
Author: Louis Feng <lofe@meta.com>
Date:   Tue Oct 11 04:38:26 2022 +0000

    Enable capturing of comm collective parameters (#98) (#85368)

    Summary:
    X-link: https://github.com/facebookresearch/torch_ucc/pull/98

    Add tensor input, output, and other metadata for PyTorch comms.

    Test Plan: P517138779

    Reviewed By: Pavani-Panakanti

    Differential Revision: D38357077

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85368
    Approved by: https://github.com/H-Huang

commit ad2b04c39c41949d8869de743736bcaeec2dfa0d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 11 03:28:58 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#86651)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86651
    Approved by: https://github.com/pytorchbot

commit bd381121b9e1e32b1a4acef1504c6e843560e24e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 11 03:24:30 2022 +0000

    [vision hash update] update the pinned vision hash (#86652)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86652
    Approved by: https://github.com/pytorchbot

commit deb414a43fea7fab883858daf13890d9367b68ec
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Oct 11 02:50:47 2022 +0000

    Revert "Use FindCUDAToolkit to find cuda dependencies (#82695)"

    This reverts commit fb9b96593c784b86b3d913ef8799ee120c203207.

    Reverted https://github.com/pytorch/pytorch/pull/82695 on behalf of https://github.com/malfet due to Break cublas packaging into wheel

commit 577070ff961ed10224b6a8294cbebbea2da77d7f
Author: Jianyu Huang <jianyuhuang@meta.com>
Date:   Tue Oct 11 02:15:51 2022 +0000

    update fbgemm commit ID in PyTorch (#86577)

    Summary:
    Update after https://github.com/pytorch/FBGEMM/pull/1388 .

    Previous issue: D40216348

    Test Plan: CI

    Differential Revision: D40219252

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86577
    Approved by: https://github.com/malfet

commit d8b971ed259b9ea37f8b4fb360b4aeea6a54a938
Author: Will Constable <whc@fb.com>
Date:   Tue Oct 11 01:42:26 2022 +0000

    Fixes for partitioner with symbolic shapes (#86425)

    - supports saving symint (and symfloat..) values between fw/bwd, using sketchy logic that probably needs to be improved but seems to work so far
    - sets a correct weight=1 for sym nodes for cost purposes
    - lets user functions return symints/floats (but if the same symfloat is saved for backward, that gets duplicated annoyingly)
    - makes partitioning decisions based on observed trace-time sizes without guarding! (this is sketchy, but it isn't clear that it will lead to bad partitioning choices either)
    - improves infra for tracking symint-family of types: is_sym_node() and _py_sym_types
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86425
    Approved by: https://github.com/ezyang

commit 16f65f178a6a51e9d25fa6ee73e21325c9b348cd
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Oct 11 01:21:37 2022 +0000

    Nested tensor forward only chunk operations (#85645)

    Taking over this pr: https://github.com/pytorch/pytorch/pull/83736

    Adding support for chunk without autograd support
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85645
    Approved by: https://github.com/cpuhrsch

commit 4fc0d5341cc58617376c33a2a2c47c2439f4e222
Author: Alan Lin <yulin0077@meta.com>
Date:   Tue Oct 11 01:21:16 2022 +0000

    [PyTorch][Fix] Improve numerical stability of HistogramObserver (#86522)

    Summary:
    As titled, HistogramObserver may fail in a certain scenario.
    Specifically, we originally compute `hist_bin_width` as `(self.max_val - self.min_val) / (self.bins * upsample_rate)`. It's possible that the numerator part is close the the FP32 threshold (1.4e-45) and conducting the division will cause overflow.

    Bring some redundent computations to avoid such scenario.

    Test Plan: https://pxl.cl/2ggD4 (https://github.com/pytorch/pytorch/commit/04490e90ea59229355b2771893719fe8896e80f0)

    Differential Revision: D40149594

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86522
    Approved by: https://github.com/jerryzh168

commit 8a47a49d5ee19590876b2afcf12b8950c72d81ba
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Mon Oct 10 13:58:09 2022 -0700

    [quant] Move the order of x86 engine to avoid changing the default qengine (#86631)

    since the default qengine is the last element of the engine in supported_engines list, adding x86 qengine in the end of the list changes the default quantized engine as well. this PR will be a short term fix to revert the changes. We have an issue here to track the proper fix: https://github.com/pytorch/pytorch/issues/86404

    Motivation:
    a meta internal team found that the inference failed in onednn prepacking with error: "could not create a primitive descriptor for a reorder primitive." in a COPPER_LAKE machine, we are working with intel to repro and fix the problem. in the mean time, we'll revert the changes of default option back to fbgemm
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86631
    Approved by: https://github.com/vkuzo

commit 224ae0da107ee426a2e19dab3eee52b6252f842f
Author: Nikita Shulga <nshulga@meta.com>
Date:   Mon Oct 10 23:52:28 2022 +0000

    [BE] Fix variable shadowing in CUDACachingAllocator.cpp (#86646)

    Test Plan: CI

    Differential Revision: D40245365

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86646
    Approved by: https://github.com/seemethere

commit 2cb330ab15f6a40458272a421f8731069f3e2043
Author: jjsjann123 <jiej@nvidia.com>
Date:   Mon Oct 10 23:48:52 2022 +0000

    Acyclic partition patch (#86511)

    Fixes #86159 and #86108

    Refactored graph partition to check for cyclic dependency on each partition merge, instead of relying on a pre-baked dependency map.

    The previous implementation suffers from not updating dependency on existing partition. When a fusion happens, the updated dependency map needs to be propagated to all nodes in the graph, so each node in a partition shares an identical dependency set. Previous implementation suffers from the not identifying cyclic dependency in issue #86159.

    Updated implementation does a cyclic check on partitioned graph before attempting a merge of two partitions.

    - [x] python repro added with cyclic dependency after partition `TestFXGraphPasses.forward12`
    - [x] fix dependency map with updated implementation using cyclic check

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86511
    Approved by: https://github.com/SherlockNoMad

commit dd6dd03ff27a1a0e89bad83b6bcb0794116812d9
Author: jjsjann123 <jiej@nvidia.com>
Date:   Mon Oct 10 23:31:21 2022 +0000

    Enable output allocation cache (#86100)

    Cherry-picked from devel branch: https://github.com/csarofeen/pytorch/pull/2010

    turns on accidentally disabled output allocation cache [#2002](https://github.com/csarofeen/pytorch/issues/2002)
    Updated check for safety regarding allocation cache by iterating all IterDomain on outputs and enables cache re-use only when no extent value is a consumer of fusion inputs (output sizes is not dependent on scalar inputs).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86100
    Approved by: https://github.com/csarofeen

commit 82ed5ca3401e965067fd03a6bac57978f884f715
Author: Akshit Khurana <axit@fb.com>
Date:   Mon Oct 10 22:32:44 2022 +0000

    [Vulkan] Don't crash immediately if Vulkan context could not be retrieved (#86485)

    Test Plan: Internal AIBench test

    Reviewed By: SS-JIA

    Differential Revision: D40151818

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86485
    Approved by: https://github.com/kimishpatel

commit b409d1f65b8c1e2607e250526d215d6a2ae8ef01
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Oct 7 19:18:41 2022 +0000

    Turn on Data Dependent Throwing (#86480)

    This was already enabled in TorchDynamo, but was staged to make sure things don't break. Also makes backward single threaded for tests to fix a memory leak.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86480
    Approved by: https://github.com/bdhirsh

commit ce7751188afb42263ebda159d6ee7a343a833cc1
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Oct 10 18:05:12 2022 +0000

    [DDP] Add `PackedSequence` support when `device_ids` is specified (#86614)

    Before this PR, if a user runs DDP with `device_ids` specified and with a `PackedSequence` input, then the execution will error with something like:
    ```
    raise ValueError(
      ValueError: batch_sizes should always be on CPU. Instances of PackedSequence should never be created manually. They should be instantiated by
     functions like pack_sequence and pack_padded_sequences in nn.utils.rnn. https://pytorch.org/docs/stable/nn.html...
    ```
    This is because the DDP forward calls `_to_kwargs()`, which calls `_recursive_to()`, which moves the inputs to GPU. However, `_is_namedtuple(packed_sequence)` returns `True`, leading to the branch `return [type(obj)(*args) for args in zip(*map(to_map, obj))]`, which tries to construct a `PackedSequence` directly via `type(obj)(*args)`, leading to the error.

    Repro for `_is_namedtuple(packed_sequence)` returning `True`:
    ```
    import random

    import torch
    import torch.nn.utils.rnn as rnn_utils
    from torch.nn.parallel.scatter_gather import _is_namedtuple

    def _ordered_sequence(tensor_type):
        seqs = [tensor_type(random.randint(1, 256))
                for _ in range(32)]
        seqs = [s.random_(-128, 128) for s in seqs]
        ordered = sorted(seqs, key=len, reverse=True)
        return ordered

    def _padded_sequence(tensor_type):
        ordered = _ordered_sequence(tensor_type)
        lengths = [len(i) for i in ordered]
        padded_tensor = rnn_utils.pad_sequence(ordered)
        return padded_tensor, lengths

    padded, lengths = _padded_sequence(torch.Tensor)
    packed = rnn_utils.pack_padded_sequence(
        padded, lengths, enforce_sorted=False)
    print(type(packed), packed.data.device)
    print(_is_namedtuple(packed))
    ```

    Test Plan:
    ```
    python test/distributed/test_c10d_nccl.py -k test_ddp_packed_sequence
    ```
    Without the fix, the added unit test fails with the expected error.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86614
    Approved by: https://github.com/rohan-varma

commit b7b5bd47ae3d5e82ef98e34c406e68b8dc12e448
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Oct 10 20:36:22 2022 +0000

    [MPS] Implement `frac` operator (#86625)

    As combination if self-trunc

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86625
    Approved by: https://github.com/kulinseth, https://github.com/albanD

commit 885122b7dc3367c1feaf01410ad625acccf816dc
Author: David Reiss <dreiss@fb.com>
Date:   Mon Oct 10 17:41:31 2022 +0000

    Move PadNd from ATen/native to ATen (#82379)

    Summary:
    This header is being included from both aten/native and torch/csrc, but
    some of our build configurations don't allow direct dependencies from
    torch/csrc to atent/native, so put the header in aten where it's always
    accessible.

    Resolves https://github.com/pytorch/pytorch/issues/81198

    Test Plan:
    CI.
    ```
    ./scripts/build_android.sh
    env ANDROID_ABI="x86_64" ANDROID_NDK=".../ndk-bundle" CMAKE_CXX_COMPILER_LAUNCHER=ccache CMAKE_C_COMPILER_LAUNCHER=ccache USE_VULKAN=0 ./scripts/build_android.sh
    echo '#include <torch/torch.h>' > test.cpp
    g++ -E -I $PWD/build_android/install/include/ -I $PWD/build_android/install/include/torch/csrc/api/include test.cpp >/dev/null
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82379
    Approved by: https://github.com/ezyang, https://github.com/malfet

commit e2a4dfa468330c0587849bea4896ff5fffb33010
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Sun Oct 9 16:01:31 2022 +0000

    Add correct __all__ for torch.distributed and torch.cuda submodules (#85702)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85702
    Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/rohan-varma

commit d93b1b9c4ed6a30e5982ebfa15807d0c497cb837
Author: Rohan Varma <rvarm1@fb.com>
Date:   Mon Oct 10 18:42:35 2022 +0000

    Address feedback from previous PR (#86622)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86622
    Approved by: https://github.com/albanD

commit d792d75091416b74b55bd281874a21c4f960cc73
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Mon Oct 10 05:55:54 2022 +0000

    [quant][fix] Fix the call to get_executorch_backend_config (#86338)

    Summary:
    previously the call failed because there was an infinite loop in _get_share_qparams_ops_configs

    Test Plan:
    python test/test_quantization.py -k test_get_executorch_backend_config

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86338
    Approved by: https://github.com/andrewor14

commit 2288a1c8065dd1d43410089719028087bd40e997
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Fri Oct 7 13:46:43 2022 -0500

    Added new option any_common_cpu_cuda_one to OpDTypes (#86286)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86286
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 8f2dda5bf2451765cebeb111a97be40f70c95989
Author: Alex <aslvrstn@gmail.com>
Date:   Mon Oct 10 17:42:13 2022 +0000

    [CI] Build MacOS M1 binaries without distributed support (#86451)

    Partial fix for #86448 which causes the broken code to be exercised in CI. If this demonstrates the break, I'm not sure whether there should be a fix forward of https://github.com/pytorch/pytorch/pull/85781 or a revert
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86451
    Approved by: https://github.com/malfet

commit dcc3ae98b7278d9d85be853bfcd070b2a081003f
Author: Driss Guessous <drisspg@fb.com>
Date:   Mon Oct 10 17:37:19 2022 +0000

    [NestedTensor] Add a contiguous checks to get_buffer (#86496)
    Many NestedTensor ops are implemented using a connivence function named get_buffer. This returns a dense, contiguous tensor that is a view of the underlying storage of the NestedTensor. This function allows NestedTensor ops to piggy back off of the implementations for dense tensor under certain scenarios.  This PR adds a TORCH_CHECK() to get buffer to insure that the calling NT is in fact contiguous. It also adds an "unsafe" version for a few ops that are designed to handle contiguity.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86496
    Approved by: https://github.com/albanD, https://github.com/cpuhrsch

commit ad449b338feaa1520d2724726bcbd613a0e15b55
Author: Howard Huang <howardhuang@fb.com>
Date:   Fri Oct 7 09:04:13 2022 -0700

    [8/N] [Dispatchable Collectives] Update allgather with CPU / CUDA implementations  (#84423)
    - Updates for the allgather collective
    https://github.com/pytorch/pytorch/issues/86225
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84423
    Approved by: https://github.com/kwen2501

commit 9eb771583ce56fba9c78a80681cac47ee07b3f49
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Mon Oct 10 13:16:01 2022 +0000

    symintify rand and randint functions and meta suport for randint (#86358)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86358
    Approved by: https://github.com/ezyang, https://github.com/albanD

commit 67358ee124e6e826b63a854f7bc5b341e7734406
Author: David <cherrywoods@posteo.org>
Date:   Mon Oct 10 16:57:52 2022 +0000

    MaxPool: correct pooling description (#86559)

    In the documentation of `nn.MaxPool2d` and `nn.MaxPool3d`, the argument description of `padding` incorrectly states that zero padding is applied. The remainder of the documentation correctly states that negative infinity padding is applied.

    The documentation of `padding` in `nn.MaxPool1d`, `nn.functional.max_pool1d/2d/3d` is correct.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86559
    Approved by: https://github.com/albanD

commit 16a0fa1204edb118800261a26281e624988eb239
Author: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Fri Oct 7 13:37:02 2022 -0700

    Enable max.unary_out (#85926)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85926
    Approved by: https://github.com/bdhirsh

commit e18d466f35d9dd5c4fd38328e67cada1504abb8b
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Mon Oct 10 16:29:52 2022 +0000

    [test_nn] split lazy_modules from test_nn (#86526)

    Ref: #63085

    NOTE: We don't need an accompanying XLA PR as these tests run only on CPU.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86526
    Approved by: https://github.com/albanD

commit 8a1fc5d2f843501ed1ce1bf90b20eaa709c8aae2
Author: Howard Huang <howardhuang@fb.com>
Date:   Fri Oct 7 09:04:13 2022 -0700

    [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations (#83916)
    - Updates for the reduce collective
    https://github.com/pytorch/pytorch/issues/86225
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83916
    Approved by: https://github.com/kwen2501

commit 978b46d7c96627e3b3553ad70ad21cb161d05f90
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 10 08:44:51 2022 -0400

    Reland 2 of Merge more symbolic meta kernels and symint changes from branch (#86334) (#86488)

    symintify split_with_sizes, dropout, fused_fake_obs_quant. meta for padding_2d ops

    add meta_bernoulli_

    meta kernel for at::gather

    get pytorch_struct to pass: meta for scatter_add, fix backward

    symintify split ops
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86488
    Approved by: https://github.com/ezyang

commit 55663b7f8174db5d71d611403078ebcec4075b1a
Author: albanD <desmaison.alban@gmail.com>
Date:   Mon Oct 10 08:44:51 2022 -0400

    Reland 3 of Symintify getitem and add the required helper functions (#86207) (#86487)

    Note that this might not cover every use of the function (we know it doesn't)
    But this is enough to get few models passing.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86487
    Approved by: https://github.com/ezyang

commit 4a5fdc56ec692fe5e39b8f5d2da6be16434c5a02
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Oct 7 12:05:29 2022 -0700

    fix some composite compliance ops for functionalization (#86470)

    Confirmed that this fixes https://github.com/pytorch/pytorch/issues/86384

    cc @tugsbayasgalan

    Functionalization should be included in the "isSubclass" checks that we run, for composite operators that have a different path for composite compliance.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86470
    Approved by: https://github.com/ezyang, https://github.com/zou3519

commit 5102f0cffcd249254245fdb3eb74abcd2151f9ac
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 7 13:17:18 2022 +0000

    [FSDP][1/N] Retire `FlattenParamsWrapper` (#86117)

    This deprecates `FlattenParamsWrapper`'s usage for "unflattening" the original parameters. After this PR, FPW only serves to register and de-register its `FlatParameter` for the parent `FullyShardedDataParallel` instance.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86117
    Approved by: https://github.com/zhaojuanmao

commit bf7c46facf442950a191ed4053a2b7ef6c39b35a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 10 10:47:38 2022 +0000

    [xla hash update] update the pinned xla hash (#86099)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86099
    Approved by: https://github.com/pytorchbot

commit 5844f00bbf3d434757b59487f5edeaaf51d292f5
Author: Andrew Gu <andgu@fb.com>
Date:   Sat Oct 8 00:15:34 2022 +0000

    [FSDP] Add `low_prec` prefix to param and reduce dtype varnames (#86512)

    This PR renames `param_dtype` and `reduce_dtype` in `HandleConfig` to `low_prec_param_dtype` and `low_prec_reduce_dtype` to emphasize that they are meant to be of the low precision (if not `None`).

    (In my mind, mixed precision refers to the paradigm of using both full and low precision together during training. "Reduced" and "low precision" mean the same thing, but I prefer the term "low precision" in the code since it is shorter. A particular dtype can be a low precision dtype or a full precision dtype.)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86512
    Approved by: https://github.com/zhaojuanmao

commit cc5de7f1ac01eece0a5bf6b94987d1ac9cacb2af
Author: Andrew Gu <andgu@fb.com>
Date:   Sat Oct 8 13:54:36 2022 +0000

    [FSDP] Remove `utils.py` (moved to `_utils.py`) (#86528)

    I messed up my git with an earlier PR, where I did not actually remove `utils.py` when moving it to `_utils.py`. This removes `utils.py`, which is now outdated and unused.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86528
    Approved by: https://github.com/H-Huang

commit c6b7c33885eeff9dc125f87c7134772d59d0ba21
Author: chunyuan <chunyuan.wu@intel.com>
Date:   Mon Oct 10 05:47:11 2022 +0000

    torchdynamo: add linear eltwise fusion kernel (#85622)

    Support fusion of linear with:

    - relu
    - sigmoid
    - tanh
    - hardswish
    - leaky_relu
    - hardtanh
    - gelu
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85622
    Approved by: https://github.com/EikanWang, https://github.com/jansel

commit ec2d22ece066bdb91b43394178f9c94a324f881f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 10 03:26:25 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#86567)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86567
    Approved by: https://github.com/pytorchbot

commit 753536b7a50df31d0529ba040f1e07cde3cca56d
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Oct 9 13:49:24 2022 +0100

    BlasKernel: Improve gemm's inner dot product when a is transposed (#80977)

    `gemm_transab_` accumulates the sum in the output, despite the inner
    loop being over a single output element. This changes it to accumulate
    in a register, which also avoids early truncation for bfloat16.

    I've also factored out a generic `sum` function that can be shared
    with `gemm_transa_` to handle unrolling and multiple accumulators.

    I have benchmarked addmm for bfloat16 with shapes
    (320,600) X (600,320) and for both layouts I see a significant
    speedup.

    |  layout  | Before (ms) | After (ms) |
    |----------|-------------|------------|
    | transa   | 71.5        | 31         |
    | transab  | 249         | 35         |
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80977
    Approved by: https://github.com/ngimel

commit a45fead623e3f9a11bacbf5d49b252c3e867167d
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Oct 9 19:40:48 2022 +0100

    mkl: Use per-operator headers (#75570)

    Differential Revision: [D40126703](https://our.internmc.facebook.com/intern/diff/D40126703)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75570
    Approved by: https://github.com/malfet

commit c89d286af633a802226c34ccbdd5c7c4be10dcfb
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Sun Oct 9 13:35:57 2022 +0000

    symintify unbind_backward and tensor_split (#86357)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86357
    Approved by: https://github.com/albanD

commit a6c0442cce252742e6e71270640908b5c1b91961
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Sun Oct 9 12:29:07 2022 +0000

    Add __all__ to torch.{autograd, fx, cuda} submodules (#85343)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85343
    Approved by: https://github.com/albanD

commit 6aec0d3ddbdfaed1baaaddc20b2c25597de12291
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sun Oct 9 14:20:46 2022 +0000

    [BE] Remove remaining cuda-11.3 builds (#86540)

    `linux-bionic-cuda11_3-py3_7-clang9-build` is redundant is is covered by `linux-jammy-cuda11.6-cudnn8-py3.8-clang12`

    And migrate no-per-operator header build (which mimics internal behavior) from  `linux-xenial-cuda11.3` to `linux-bionic-cuda11.7`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86540
    Approved by: https://github.com/weiwangmeta, https://github.com/atalman

commit 7134b9bc7b1d25b453ec5c53b1ec70cb206228a1
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Oct 4 23:48:53 2022 +0100

    Quantized: Use per-operator headers (#75569)

    Differential Revision: [D40126700](https://our.internmc.facebook.com/intern/diff/D40126700)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75569
    Approved by: https://github.com/malfet

commit 67434c70df5df353944f6ba876d9dd06b669bacd
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sun Oct 9 06:47:36 2022 +0000

    [MPS] Fix printTensor() for MPS (#86534)

    MPS does not support double type, so tensor need to be cast to CPU first
    before it can be cast to double.

    Also, do a little bit of BE, by initializing values and marking unused range variables with C10_UNUSED

    Fixes https://github.com/pytorch/pytorch/issues/86410

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86534
    Approved by: https://github.com/weiwangmeta

commit 9998f9100bfc620bd28af272af2e16b34b8c8bcf
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Oct 9 03:30:05 2022 +0000

    [vision hash update] update the pinned vision hash (#86490)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86490
    Approved by: https://github.com/pytorchbot

commit 92ac84c98a19310885f3d818aba56b981940d615
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Oct 9 03:28:35 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#86489)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86489
    Approved by: https://github.com/pytorchbot

commit 492d1be5d2bba36dd6710caa411cdf332ca9e5c8
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Oct 4 23:48:52 2022 +0100

    QuantizedCPU: Use per-operator headers (#71217)

    Differential Revision: [D33949895](https://our.internmc.facebook.com/intern/diff/D33949895)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/71217
    Approved by: https://github.com/malfet

commit 4bfe2a24505049fa4fe43d24c2e3a5f5d99d9f00
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Oct 4 23:48:52 2022 +0100

    cuDNN/miopen: Use per-operator headers (#71216)

    Differential Revision: [D33949898](https://our.internmc.facebook.com/intern/diff/D33949898)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/71216
    Approved by: https://github.com/malfet

commit 33f0e98a492acb55cca192ea8f4bb5bf24f28a4b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 8 07:17:37 2022 +0000

    Re-land*4  "SymIntify cat and narrow" (#86468)

    This re-lands https://github.com/pytorch/pytorch/pull/86289 but with more wrappers.

    Contains implicit inclusion of <ATen/native/NonSymbolicBC.h> in internal usage.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86468
    Approved by: https://github.com/albanD

commit 4bc2a0dcda79ab8589b469fa31919a8141361f42
Merge: 475022cd5d 5df5c3e33e
Author: mingfeima <mingfei.ma@intel.com>
Date:   Sat Oct 8 15:11:29 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 5df5c3e33eff4016b365e21753168aecaead166c
Merge: fd840676b0 8ea2ed0fc7
Author: mingfeima <mingfei.ma@intel.com>
Date:   Sat Oct 8 15:11:29 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 8ea2ed0fc728b964b8abfc768c37c3eb8b315dd5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Oct 8 05:14:39 2022 +0000

    Revert "Re-enable torchdynamo tests (#86297)"

    This reverts commit e61028813007518bd6be0e6482a8742b84c30da7.

    Reverted https://github.com/pytorch/pytorch/pull/86297 on behalf of https://github.com/malfet due to Reverting to return trunk back to green, dynamo shard2 started failing shortly after the merge

commit d3f7c34cb3d7ac43115bd3ccd9cbdbf3e5654498
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Oct 7 18:01:13 2022 +0000

    Enable aten-aten decomps (#85921)

    Invokes aten-aten decomps with re-entrant FakeMode. These decomps are being used in other places, so it's good to unify the path static fake tensor takes / get additional testing etc. There is also an instance where we return different devices with cpu/cuda which this fixes ([batch_norm](https://github.com/pytorch/pytorch/blob/master/torch/_decomp/decompositions.py#L1374))

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85921
    Approved by: https://github.com/ezyang

commit af9c6bc851cfc8fba9e4c71830b783cb34d92a05
Author: Andrew Gu <andgu@fb.com>
Date:   Sat Oct 8 00:15:23 2022 +0000

    [FSDP] Add `keep_low_precision_grads` support when CPU offloading (#86495)

    When CPU offloading, FSDP uses `_cpu_grad`, not `_saved_grad_shard`. This adds support for `keep_low_precision_grads` for that case.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86495
    Approved by: https://github.com/rohan-varma

commit 7ec12a559cadbb82a1bd6546908897afedd453af
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Oct 8 01:59:54 2022 +0000

    Revert "Enable aten-aten decomps (#85921)"

    This reverts commit 62e4f51efdf98a3a91d29efa55e5665d5398b464.

    Reverted https://github.com/pytorch/pytorch/pull/85921 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. I think it breaks a dynamo test in trunk https://hud.pytorch.org/pytorch/pytorch/commit/62e4f51efdf98a3a91d29efa55e5665d5398b464

commit b0ceb8ea1c765963a6210d02686dbffd48e96bc8
Author: ssjia <ssjia@fb.com>
Date:   Fri Oct 7 11:39:04 2022 -0700

    [vulkan] Add buffer to buffer copies (#86424)

    Differential Revision: [D40112702](https://our.internmc.facebook.com/intern/diff/D40112702/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86424
    Approved by: https://github.com/kimishpatel

commit 511d81cd2abff1922ea22e9acf4a7b6fb5e84dbd
Author: ssjia <ssjia@fb.com>
Date:   Fri Oct 7 11:39:03 2022 -0700

    [vulkan] Clean up convolution code (#86423)

    Differential Revision: [D39553863](https://our.internmc.facebook.com/intern/diff/D39553863/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86423
    Approved by: https://github.com/kimishpatel

commit b645c237bc88c441792d83f19575d0fd3284dcb4
Author: Cody Ohlsen <codyohl@meta.com>
Date:   Sat Oct 8 01:25:03 2022 +0000

    make g2p ~30% faster on mobile by suppressing a log (#85907)

    Summary: using the tool from D39559248 i was able to make g2p faster on mobile by taking a look at profiles on stella frames. It turned out that the pytorch interpreter code does some logging that ends up being a pretty big bottleneck.

    Differential Revision: D39901455

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85907
    Approved by: https://github.com/dzdang

commit bac26155e7e5bca949e986fa36ab37e042b8ad53
Author: David Berard <dberard@fb.com>
Date:   Sat Oct 8 00:38:11 2022 +0000

    [JIT] Allow freezing modules that contain mutable interfaces (#86039)

    This PR allows freezing modules like the one below:
    ```python
            @torch.jit.interface
            class ModuleInterface(torch.nn.Module):
                def forward(self, inp: torch.Tensor) -> torch.Tensor:
                    pass

            class ImplementsInterface(torch.nn.Module):
                def __init__(self):
                    super(ImplementsInterface, self).__init__()
                    self.sum = torch.zeros((2, 2))

                def forward(self, inp: torch.Tensor) -> torch.Tensor:
                    self.sum += inp.relu()  # this makes the interface-implementing module mutable
                    return self.sum

            class WrapperModule(torch.nn.Module):
                impl: ModuleInterface

                def __init__(self):
                    super().__init__()
                    self.impl = ImplementsInterface()

                def forward(self, x: torch.Tensor) -> torch.Tensor:
                    return self.impl.forward(x)
    ```

    Previously during freezing, we handle interfaces as shown below:
    1. we inline interfaces in any preserved method graphs
    2. during `cleanupFrozenModule`, we try to simplify the module data structure (<- this part is unrelated to freezing so far). During this step, if we found that a interface type was mutable, we'd error out; because of the possibility of a module that _swaps out the value of an interface-typed attribute at runtime_.

    Below is an example of a module that swaps out the value of an interface-typed attribute at runtime:
    ```python
    class MyBadModule(torch.nn.Module):
        impl: MyInterface
        option1: IfaceImpl1
        option2: IfaceImpl2
        ....
        def forward(self, x):
            if x > 0:
                self.impl = self.option1
            else:
                self.impl = self.option2
            ....
    ```

    ^ this type of situation cannot be supported by freezing (or at least would be difficult to do correctly) because it greatly complicates the details of handling types and simplifying the module data structure.

    But we can still support the first example without _too_ much work:
    1. inline the interface code as before
    2. check to see if we have any setattrs on interface types; if so, error out
    3. otherwise, replace the type of the interface types with the concrete type implementation
    4. continue simplifying the module data structure as if we never had any interfaces.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86039
    Approved by: https://github.com/eellison

commit 04490e90ea59229355b2771893719fe8896e80f0
Author: Hongxia Yang <hongxiayang01@meta.com>
Date:   Sat Oct 8 00:06:05 2022 +0000

    better error message fix (#86422)

    Summary:
    A user had a problem with fx-scripting and the error message can be improved.

    Error was shown as:

    RuntimeError: Keys for dictionaries used as an argument cannot contain a Node. Got key: {k}

    which is obvious not quite helpful.

    Test Plan:
    Test in a notebook:
    {F778667593}

    Reviewed By: xunnanxu, SherlockNoMad

    Differential Revision: D40157518

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86422
    Approved by: https://github.com/SherlockNoMad

commit 3a02873183e81ed0af76ab46b01c3829b8dc1d35
Author: zaf <zaf@meta.com>
Date:   Fri Oct 7 14:05:13 2022 -0700

    [quant][ao_migration] nn.intrinsic.quantized migration to ao (#86172)

    All quantization-related modules are being migrated to `torch.ao`. This migrates the `nn.intrinsic.quantized`. Please, see the [tracker](https://github.com/pytorch/pytorch/issues/81667) for the timeline.

    ```
    python test/test_quantization.py -- TestAOMigrationNNIntrinsic
    ```

    Internal:

    ```
    buck2 test @mode/dev-nosan //caffe2/test:quantization -- TestAOMigrationNNIntrinsic
    ```

    Differential Revision: [D39425515](https://our.internmc.facebook.com/intern/diff/D39425515/)

    Differential Revision: [D39425515](https://our.internmc.facebook.com/intern/diff/D39425515)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86172
    Approved by: https://github.com/jerryzh168

commit 91b1bae1df1e72e17d2ab296845c214bc39422a0
Author: Zachary DeVito <zdevito@meta.com>
Date:   Fri Oct 7 13:21:48 2022 -0700

    Caching allocator tracing (#86241)

    We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

    We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

    As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

    This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
    Approved by: https://github.com/ngimel

commit 8a3a54e012488ddb4e372f559bbeed2b41e7eb1c
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Oct 7 18:07:54 2022 +0000

    Fix index_select decomp (#86469)

    For decomposing index_select with 0-dim tensor, we cannot write `x.unsqueeze(0)[index].squeeze(0).clone()` , as tensor[index] will trigger index.item() if index is a 0-dim tensor, and .item() cannot be symbolically traced with FakeTensor.

    We use `torch.ops.aten.index(x.unsqueeze(0), [index]).squeeze(0).clone()` as a workaround.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86469
    Approved by: https://github.com/ngimel

commit a079dad7cfdcc6982ec704a924b1432ff01b3a09
Author: albanD <desmaison.alban@gmail.com>
Date:   Fri Oct 7 22:47:46 2022 +0000

    Skip dynamo for all optim test as they are all flaky otherwise (#86482)

    Fixes https://github.com/pytorch/pytorch/issues/86433
    Fixes https://github.com/pytorch/pytorch/issues/86435
    Fixes https://github.com/pytorch/pytorch/issues/86432
    Fixes https://github.com/pytorch/pytorch/issues/86389
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86482
    Approved by: https://github.com/ezyang

commit ba3fde6aa08c97be2616bcc9f372781166ed7342
Author: soulitzer <soulitzer@gmail.com>
Date:   Fri Oct 7 14:29:33 2022 -0400

    Add multi-grad hooks (#86260)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86260
    Approved by: https://github.com/albanD

commit 97e56c176d6091a91ac2afe284fc8cb406780ddd
Author: albanD <desmaison.alban@gmail.com>
Date:   Fri Oct 7 21:09:37 2022 +0000

    Try to fix shutdown test in edge cases (#86464)

    Fixes https://github.com/pytorch/pytorch/issues/85259
    See the issue for debugging details.
    tl;dr: when a worker thread is actually used, make sure it is initialized before exiting.
    Yes, it is very unlikely it will take >10s to initialize but it is what seems to happen.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86464
    Approved by: https://github.com/soulitzer, https://github.com/ezyang

commit 62e4f51efdf98a3a91d29efa55e5665d5398b464
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Oct 7 18:01:13 2022 +0000

    Enable aten-aten decomps (#85921)

    Invokes aten-aten decomps with re-entrant FakeMode. These decomps are being used in other places, so it's good to unify the path static fake tensor takes / get additional testing etc. There is also an instance where we return different devices with cpu/cuda which this fixes ([batch_norm](https://github.com/pytorch/pytorch/blob/master/torch/_decomp/decompositions.py#L1374))

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85921
    Approved by: https://github.com/ezyang

commit a95889ba7c1ecd8cb0f90507a6152cb035bcefd1
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 7 13:17:17 2022 +0000

    [FSDP] Add initial `summon_full_params(with_grads=True)` (#85738)

    This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well.

    Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85738
    Approved by: https://github.com/rohan-varma

commit 82229d1e33dc8b34d0c6b35aaa23e51644cd4c74
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Fri Oct 7 19:24:59 2022 +0000

    [optim] fix: empty grad support for SparseAdam (#86459)

    Fixes #82486

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86459
    Approved by: https://github.com/albanD

commit 66d480d314236a8cd8df4a28ed8867d48b6fa448
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 7 18:55:01 2022 +0000

    Revert "Disable mac m1 jobs (#86463)"

    This reverts commit ac632b437489b4c0c2714d5ad37517bb60e09750.

    Reverted https://github.com/pytorch/pytorch/pull/86463 on behalf of https://github.com/huydhn due to Queue is decreasing, re-enable the jobs

commit ac632b437489b4c0c2714d5ad37517bb60e09750
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Oct 7 18:28:47 2022 +0000

    Disable mac m1 jobs (#86463)

    There is a queue and some runners are not accessible.

    This is to mitigate the Sev https://github.com/pytorch/pytorch/issues/86466

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86463
    Approved by: https://github.com/clee2000

commit ac74976a566ff83f64017776351a9b3ce4402896
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Thu Oct 6 14:48:38 2022 -0700

    [ao] fixing public v private for fuser_method_mappings.py (#86029)

    Summary: no significant changes, just added __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86029
    Approved by: https://github.com/jerryzh168

commit be682befbc836a07d5d070bb569450429526a64b
Author: Andrew Gu <andgu@fb.com>
Date:   Fri Oct 7 13:17:16 2022 +0000

    [FSDP] Add `use_orig_params` (#84911)

    **Overview**
    This PR adds the option to use the original parameters via `use_orig_params=True` in the FSDP constructor.
    - This exposes the original parameters rather than the `FlatParameter`s from `named_parameters()`, which means that the optimizer runs on the original parameters. Hence, users may assign original parameters from the same `FlatParameter` to different parameter groups.
    - This enables decoupling the original parameter variables from their storage without changing the variables themselves, which is critical for our upcoming execution-order-based non-recursive wrapping policy.

    For more detailed design explanation, refer to the Quip shared internally.

    **Follow-Ups**
    See 85831 (removing link to avoid spamming the issue whenever I update this PR).

    `test_fsdp_use_orig_params.py` adds ~4 min 46 seconds to the TTS on the AWS cluster.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84911
    Approved by: https://github.com/rohan-varma

commit b43ae1c4116487ea6a195d533b5d5622075dec9d
Author: Chengqi Deng <checkdeng0903@gmail.com>
Date:   Fri Oct 7 17:59:26 2022 +0000

    Add reference counter in FileStore (#85601)

    Fixes #67566.

    This diff added a reference counter in the FileStore object. The underlying file would be removed only if the reference counter became 0.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85601
    Approved by: https://github.com/H-Huang

commit efccb6401c6451389c9005a43c29fd055fb89452
Author: zaf <zaf@meta.com>
Date:   Thu Oct 6 13:33:20 2022 -0700

    [quant][ao_migration] nn.intrinsic.qat migration to ao (#86171)

    All quantization-related modules are being migrated to `torch.ao`. This migrates the `nn.intrinsic.qat`. Please, see the [tracker](https://github.com/pytorch/pytorch/issues/81667) for the timeline.

    ```
    python test/test_quantization.py TestAOMigrationNNIntrinsic
    ```

    Differential Revision: [D39419993](https://our.internmc.facebook.com/intern/diff/D39419993/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39419993/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86171
    Approved by: https://github.com/jerryzh168

commit e61028813007518bd6be0e6482a8742b84c30da7
Author: Yanbo Liang <ybliang8@gmail.com>
Date:   Fri Oct 7 17:16:40 2022 +0000

    Re-enable torchdynamo tests (#86297)

    We temporarily skipped torchdynamo tests due to many failures, now we fix the problems and re-enable tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86297
    Approved by: https://github.com/anijain2305

commit e8d3b7201c9f8223380e2eb66e2213ae3be08869
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Thu Oct 6 14:48:37 2022 -0700

    [ao] fixing public v private for fuse_modules.py (#86028)

    Summary: no significant changes, just added __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86028
    Approved by: https://github.com/jerryzh168

commit d29912cc06241fb8e2ad11629271a92780a759c2
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Thu Oct 6 14:48:36 2022 -0700

    [ao] fixing public v private for torch/ao/quantization (#86027)

    Summary: no significant changes, just needed to add __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86027
    Approved by: https://github.com/jerryzh168

commit 65b408074f4ecc99faf5720ea5b3570a483ec9f4
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 7 16:29:27 2022 +0000

    Revert "Relandx3  "SymIntify cat and narrow" (#86289)"

    This reverts commit a00f8489df5586178d7b5f83928bf8049ce32f24.

    Reverted https://github.com/pytorch/pytorch/pull/86289 on behalf of https://github.com/malfet due to @seemether  unlanded the rest of the stack and it will fail intern import anyway

commit 5b69b87d5abbb272fb48be5a5a4dc17f8399c124
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 7 16:10:30 2022 +0000

    Revert "Symintify getitem and add the required helper functions (#86207)"

    This reverts commit fd5085c445c3f1a4c90e55154cf26fe30f52a0ab.

    Reverted https://github.com/pytorch/pytorch/pull/86207 on behalf of https://github.com/seemethere due to  Fails internal tests, see: https://www.internalfb.com/intern/sandcastle/job/22517998926071860/insights

commit 75df4b5e3daa2a177f35bd0e43629c814238b639
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 7 16:03:30 2022 +0000

    Revert "Merge more symbolic meta kernels and symint changes from branch (#86334)"

    This reverts commit 08e3999fa494238f8f62346a140da36bd43864e7.

    Reverted https://github.com/pytorch/pytorch/pull/86334 on behalf of https://github.com/seemethere due to Trying to revert https://github.com/pytorch/pytorch/pull/86207, this PR causes merge conflicts with the initial revert so will have to revert this as well

commit b3fdb02fb25508d9c61d70b594f8a7fac3b2a365
Author: Check Deng <checkdeng0903@gmail.com>
Date:   Fri Oct 7 15:55:55 2022 +0000

    Fix memory leak in _LRScheduler.step() (#85602)

    Fixes #85410

    This diff removed the cyclic references in `_LRScheduler.step()`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602
    Approved by: https://github.com/albanD

commit 0e639ff45c616946fb3e5e3f06b9486d88ce86ca
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 7 14:55:44 2022 +0000

    Revert "Cleanup PT-D imports (#85781)"

    This reverts commit 9a170b24f64d7cfdd887ff122c241ac6ff85f4c6.

    Reverted https://github.com/pytorch/pytorch/pull/85781 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit 9b2ea41f481bac297c8e1e88c431c03127a35759
Author: Nikita Vedeneev <nik@quansight.com>
Date:   Fri Oct 7 14:50:48 2022 +0000

    COO intersection primitives : fusing value selection with value intersection. (#86269)

    As per title. This one fuses 3 kernels into 1 with about 20-10% performance improvement.
    This kernel is also useful for union-like operations.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86269
    Approved by: https://github.com/amjames, https://github.com/cpuhrsch

commit e125baf90b53a97992ef392a06d6321618b14113
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Oct 6 12:45:23 2022 -0700

    [autocast] Clean up registrations using new macros (#86403)

    This PR cleans up m.impl(...) calls to use the new KERNEL / KERNEL_CPU
    macros. That saves us the trouble of writing out the signatures.

    Test Plan:
    - code reading
    - wait for tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86403
    Approved by: https://github.com/ezyang

commit 9b74267eb6de5076e7eb2e92bc34eef771384c1e
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Oct 6 12:45:19 2022 -0700

    [autocast] Make it easier to register rules (#86402)

    On the way to resolving https://github.com/pytorch/pytorch/issues/86294

    Previously, there were three macros used to register autocast rules:
    - KERNEL
    - KERNEL_DIFFERENT_REDISPATCH_SIGNATURE
    - KERNEL_CPU

    This PR makes the KERNEL and KERNEL_CPU macros less redundant for users.
    KERNEL_DIFFERENT_REDISPATCH_SIGNATURE is weird and only used three
    times, so I didn't change them.

    Concretely, KERNEL(OP, OP_NAME, SIGNATURE, POLICY) is redundant:
    - op/op_name are similar, and the signature can be decltype'd.
    PR changes it so that instead, one uses either:
    - KERNEL(OP, POLICY)
    - KERNEL2(OP, OVERLOAD, POLICY)
    depending on whether the operator name has an overload.

    This PR also gives the same treatment to the KERNEL_CPU macro, which is
    used for registering autocast cpu rules: it splits KERNEL_CPU into
    KERNEL_CPU(OP, POLICY) AND KERNEL_CPU2(OP, OVERLOAD, POLICY).

    I will do some more cleanup of things that are implemented via
    `m.impl(...)` in a follow-up PR so that I don't get confused when I need
    to rebase.

    Test Plan:
    - wait for tests (how good are our autocast tests?)
    - code reading
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86402
    Approved by: https://github.com/ezyang

commit 55f5e0de8dbd15d3732017796bddfa10fc76d033
Author: Supraj Bachawala <bachawalasupraj@gmail.com>
Date:   Fri Oct 7 14:13:15 2022 +0000

    remove unused arg from `impl_func_cum_ops` (#86364)

    Fixes #86224
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86364
    Approved by: https://github.com/bdhirsh

commit a00f8489df5586178d7b5f83928bf8049ce32f24
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Oct 5 11:32:48 2022 -0700

    Relandx3  "SymIntify cat and narrow" (#86289)

    This reverts commit fc94a2115b31dfe7a0d8f28eb4f5ed532c4f0792.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86289
    Approved by: https://github.com/wconstab

commit cc9183eb4c05f2dbb002279698cc21c4781e9492
Author: Howard Huang <howardhuang@fb.com>
Date:   Fri Oct 7 12:59:09 2022 +0000

    Update distributed.rst backend collective support chart (#86406)

    NCCL `scatter` was added by Wanchao in https://github.com/pytorch/pytorch/pull/70029

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86406
    Approved by: https://github.com/wanchaol

commit b74ca31bf6d3f1d16849a1e893164450a917e447
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Fri Oct 7 12:12:03 2022 +0000

    [fix] sum_to_size: MathBits test - don't reuse same input tensor (#86378)

    Fixes #85409

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86378
    Approved by: https://github.com/anjali411

commit facbddb9ff494f0c0c9a06ea823bc7cd3f203352
Author: Salil Desai <salilsdesai@meta.com>
Date:   Fri Oct 7 11:58:41 2022 +0000

    Override Quantized Backend to use Fbgemm in Qlinear Packed Params Test (#86236)

    Summary: After D39934051, we must explicitly ```override_quantized_engine('fbgemm')``` for this test to work

    Test Plan:
    ```
    buck test //caffe2/test:ao -- TestQlinearPackedParams
    ```

    Before:
    ```
    Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/5629499663624574
        ✓ ListingSuccess: caffe2/test:ao : 72 tests discovered (32.830)
        ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_qnnpack (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (25.085)
        ✗ Fail: caffe2/test:ao - test_qlinear_packed_params (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (26.706)
    Test output:
    > RuntimeError: Didn't find engine for operation ao::sparse::qlinear_prepack X86
    ```

    After:
    ```
    Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7599824485968786
        ✓ ListingSuccess: caffe2/test:ao : 72 tests discovered (31.082)
        ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_fbgemm (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (100.409)
        ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_qnnpack (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (100.544)
    Summary
      Pass: 2
      ListingSuccess: 1
    ```

    Differential Revision: D40078176

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86236
    Approved by: https://github.com/jmdetloff, https://github.com/z-a-f

commit dbea07b6aa208f4dfdc8c0876fc2469bffa74fbe
Author: Seonglyong Gong <slgong@meta.com>
Date:   Fri Oct 7 09:58:50 2022 +0000

    [Profiler] record gradient from nnModule (#86355)

    Summary:
    - catch .grad tensor info
    - update data type and `check_and_store`, etc
    - update unit test case

    Test Plan: buck run mode/opt //caffe2/test:profiler

    Reviewed By: chaekit

    Differential Revision: D39711295

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86355
    Approved by: https://github.com/chaekit

commit 28a0b3fb18da6ff96b6d4edb252b15e7f3e331a9
Author: lezcano <lezcano-93@hotmail.com>
Date:   Thu Oct 6 22:50:05 2022 +0000

    Fix col2im and im2col decompositions (#86426)

    I threw in some tests for good measure.

    Fixes https://github.com/pytorch/pytorch/issues/86332
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86426
    Approved by: https://github.com/ngimel

commit 93b2d991581db86074dd8011fdc903bd554466b1
Author: Nicolas Hug <nicolashug@fb.com>
Date:   Thu Oct 6 11:32:29 2022 +0000

    Improve interpolate() speed for channels_last images and masks (#86361)

    This PR improves the speed of `interpolate()`:
    -  on images and masks (`num_channels < 4`, `channels_last=True`)
    - for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float)
    - for both upsampling and downsampling

    The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size.  In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details):

    ```
    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms
    ```

    An immediate follow-up to this PR would be to do the same changes for the 3D kernels.
    Thanks a ton @fmassa for the help!

    Results:

    <details>

    ```
    ----------------------------------------------------------------------------------------------------
    (1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
    (1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   1.6X  0.9ms vs 0.5ms
    (1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   1.7X  1.0ms vs 0.5ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   1.9X  0.9ms vs 0.5ms
    (1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=1   8X    0.806ms vs 0.097ms
    (1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   15X   0.848ms vs 0.056ms
    (1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   10X   0.828ms vs 0.084ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   16X   0.914ms vs 0.057ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   10X   0.900ms vs 0.086ms

    (1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=2   1.6X  1.1ms vs 0.7ms
    (1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   1.6X  0.6ms vs 0.4ms
    (1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   1.7X  0.6ms vs 0.4ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   1.7X  0.5ms vs 0.3ms
    (1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=2   9X    0.800ms vs 0.088ms
    (1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   11X   0.459ms vs 0.043ms
    (1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   7X    0.424ms vs 0.064ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   12X   0.503ms vs 0.043ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   8X    0.461ms vs 0.059ms

    (1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=12  3X    1.1ms vs 0.3ms
    (1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  1.6X  0.3ms vs 0.2ms
    (1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  1.5X  0.3ms vs 0.2ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=12  5X    0.8ms vs 0.2ms
    (1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  10X   0.445ms vs 0.047ms
    (1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  7X    0.432ms vs 0.062ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  10X   0.478ms vs 0.046ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  7X    0.470ms vs 0.063ms

    (1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=32  3X    1.1ms vs 0.4ms
    (1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  1.8X  0.3ms vs 0.2ms
    (1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
    (1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=32  11X   0.815ms vs 0.074ms
    (1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  10X   0.443ms vs 0.045ms
    (1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  7X    0.436ms vs 0.061ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
    (1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.061ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
    (1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   1.5X  0.9ms vs 0.6ms
    (1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   1.6X  1.0ms vs 0.6ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
    (1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=1   8X    0.808ms vs 0.099ms
    (1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   15X   0.848ms vs 0.058ms
    (1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.820ms vs 0.087ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   16X   0.909ms vs 0.059ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.898ms vs 0.088ms

    (1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=2   1.4X  0.9ms vs 0.7ms
    (1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   1.5X  0.5ms vs 0.3ms
    (1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   1.5X  0.5ms vs 0.4ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
    (1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=2   9X    0.799ms vs 0.090ms
    (1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   10X   0.459ms vs 0.045ms
    (1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.427ms vs 0.059ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   11X   0.501ms vs 0.044ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   8X    0.460ms vs 0.060ms

    (1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=12  2.9X  1.0ms vs 0.3ms
    (1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  1.2X  0.2ms vs 0.2ms
    (1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  1.1X  0.2ms vs 0.2ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=12  12X   0.809ms vs 0.068ms
    (1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
    (1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.432ms vs 0.055ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.480ms vs 0.041ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.464ms vs 0.056ms

    (1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=32  3X    1.1ms vs 0.3ms
    (1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  1.3X  0.3ms vs 0.2ms
    (1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
    (1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=32  11X   0.813ms vs 0.075ms
    (1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.046ms
    (1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.433ms vs 0.061ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
    (1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.062ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=1   0.9X  4.5ms vs 5.2ms
    (1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   1.5X  4.2ms vs 2.8ms
    (1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   1.8X  4.1ms vs 2.3ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   1.6X  4.5ms vs 2.8ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   1.9X  4.4ms vs 2.3ms
    (1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=1   9X    3.8ms vs 0.4ms
    (1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   17X   4.0ms vs 0.2ms
    (1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   11X   3.9ms vs 0.4ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   19X   4.4ms vs 0.2ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   12X   4.3ms vs 0.4ms

    (1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=2   1.5X  4.5ms vs 3.1ms
    (1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   1.4X  2.3ms vs 1.6ms
    (1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   1.7X  2.1ms vs 1.2ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   1.6X  2.5ms vs 1.6ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   1.8X  2.2ms vs 1.2ms
    (1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=2   15X   3.8ms vs 0.3ms
    (1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   15X   2.2ms vs 0.1ms
    (1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   7X    2.0ms vs 0.3ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   16X   2.4ms vs 0.1ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   8X    2.2ms vs 0.3ms

    (1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=12  8X    5.2ms vs 0.7ms
    (1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  1.3X  0.6ms vs 0.4ms
    (1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  1.4X  0.6ms vs 0.4ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
    (1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=12  36X   3.9ms vs 0.1ms
    (1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  10X   0.526ms vs 0.051ms
    (1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  7X    0.514ms vs 0.069ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  11X   0.569ms vs 0.052ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  8X    0.557ms vs 0.070ms

    (1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=32  9X    4.5ms vs 0.5ms
    (1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  0.5X  0.2ms vs 0.5ms
    (1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  1.0X  0.5ms vs 0.5ms
    (1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=32  44X   3.864ms vs 0.087ms
    (1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  10X   0.527ms vs 0.053ms
    (1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  7X    0.516ms vs 0.070ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  10X   0.567ms vs 0.055ms
    (1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  8X    0.558ms vs 0.072ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=1   1.0X  1.9ms vs 1.9ms
    (1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   2.0X  1.8ms vs 0.9ms
    (1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   1.7X  1.8ms vs 1.0ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   2.1X  1.9ms vs 0.9ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   1.9X  1.9ms vs 1.0ms
    (1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=1   9X    1.6ms vs 0.2ms
    (1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   16X   1.7ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   10X   1.7ms vs 0.2ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   17X   1.9ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   11X   1.8ms vs 0.2ms

    (1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=2   1.7X  1.9ms vs 1.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   2.0X  1.0ms vs 0.5ms
    (1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   1.7X  0.9ms vs 0.5ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   2.3X  1.1ms vs 0.5ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   1.8X  1.0ms vs 0.5ms
    (1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=2   8X    1.6ms vs 0.2ms
    (1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   14X   0.931ms vs 0.067ms
    (1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   7X    0.9ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   15X   1.016ms vs 0.069ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   9X    0.9ms vs 0.1ms

    (1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=12  8X    1.9ms vs 0.3ms
    (1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=12  20X   1.630ms vs 0.081ms
    (1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  10X   0.457ms vs 0.044ms
    (1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  7X    0.439ms vs 0.060ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  11X   0.485ms vs 0.045ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  8X    0.474ms vs 0.061ms

    (1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=32  8X    1.9ms vs 0.3ms
    (1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  2.0X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  1.4X  0.2ms vs 0.2ms
    (1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
    (1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=32  21X   1.628ms vs 0.078ms
    (1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  9X    0.453ms vs 0.048ms
    (1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  7X    0.445ms vs 0.063ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  11X   0.535ms vs 0.048ms
    (1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  8X    0.502ms vs 0.063ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=1   1.0X  13.8ms vs 14.0ms
    (1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   1.8X  13.1ms vs 7.4ms
    (1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   1.8X  11.1ms vs 6.1ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   1.9X  13.9ms vs 7.4ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   1.9X  11.8ms vs 6.1ms
    (1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=1   10X   10.2ms vs 1.1ms
    (1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   19X   10.8ms vs 0.6ms
    (1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   11X   10.4ms vs 0.9ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   20X   11.6ms vs 0.6ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   12X   11.4ms vs 0.9ms

    (1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=2   1.8X  13.7ms vs 7.7ms
    (1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   2.6X  7.3ms vs 2.8ms
    (1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   1.8X  5.6ms vs 3.1ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   1.9X  7.9ms vs 4.1ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   1.9X  6.0ms vs 3.1ms
    (1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=2   18X   10.1ms vs 0.6ms
    (1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   19X   5.8ms vs 0.3ms
    (1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   10X   5.3ms vs 0.5ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   20X   6.3ms vs 0.3ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   11X   5.7ms vs 0.5ms

    (1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=12  8X    13.8ms vs 1.6ms
    (1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  2.9X  1.5ms vs 0.5ms
    (1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  1.7X  1.0ms vs 0.5ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  1.5X  1.5ms vs 1.0ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  1.8X  1.0ms vs 0.6ms
    (1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=12  80X   10.1ms vs 0.1ms
    (1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  13X   0.928ms vs 0.072ms
    (1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  8X    0.9ms vs 0.1ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  13X   1.001ms vs 0.074ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  9X    1.0ms vs 0.1ms

    (1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=32  18X   14.0ms vs 0.8ms
    (1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  1.9X  1.0ms vs 0.6ms
    (1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  2.9X  0.7ms vs 0.2ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  1.7X  0.9ms vs 0.6ms
    (1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  1.8X  0.4ms vs 0.2ms
    (1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=32  111X  10.254ms vs 0.092ms
    (1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  14X   0.784ms vs 0.056ms
    (1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  7X    0.551ms vs 0.075ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  11X   0.607ms vs 0.057ms
    (1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  8X    0.596ms vs 0.076ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.084ms vs 0.084ms
    (1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.077ms vs 0.078ms
    (1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.076ms vs 0.076ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.083ms vs 0.083ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   1.0X  0.081ms vs 0.082ms
    (1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.071ms vs 0.071ms
    (1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.074ms vs 0.074ms
    (1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.072ms vs 0.072ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.080ms vs 0.080ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   0.9X  0.078ms vs 0.084ms

    (1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.083ms vs 0.084ms
    (1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.076ms vs 0.077ms
    (1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.075ms vs 0.074ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.082ms vs 0.083ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.080ms vs 0.083ms
    (1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.070ms vs 0.071ms
    (1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.073ms vs 0.075ms
    (1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.071ms vs 0.072ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.079ms vs 0.080ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.077ms vs 0.079ms

    (1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.083ms vs 0.084ms
    (1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.080ms vs 0.078ms
    (1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.077ms vs 0.075ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.083ms vs 0.083ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.083ms vs 0.082ms
    (1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.071ms vs 0.071ms
    (1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.076ms vs 0.074ms
    (1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.073ms vs 0.071ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.080ms vs 0.080ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.080ms vs 0.078ms

    (1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.084ms vs 0.084ms
    (1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.078ms vs 0.077ms
    (1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.076ms vs 0.076ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.083ms vs 0.083ms
    (1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.081ms vs 0.082ms
    (1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.072ms vs 0.072ms
    (1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.074ms vs 0.075ms
    (1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.072ms vs 0.072ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.077ms vs 0.080ms
    (1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.076ms vs 0.079ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=1   1.0X  0.3ms vs 0.3ms
    (1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   1.8X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   1.6X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   2.0X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   1.7X  0.3ms vs 0.2ms
    (1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=1   6X    0.265ms vs 0.044ms
    (1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   10X   0.280ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   7X    0.273ms vs 0.037ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   11X   0.303ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   8X    0.297ms vs 0.038ms

    (1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=2   1.5X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   1.8X  0.163ms vs 0.093ms
    (1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   1.5X  0.2ms vs 0.1ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   1.9X  0.180ms vs 0.096ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   1.6X  0.2ms vs 0.1ms
    (1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=2   6X    0.264ms vs 0.044ms
    (1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   10X   0.278ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   7X    0.270ms vs 0.037ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   11X   0.298ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   8X    0.293ms vs 0.037ms

    (1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=12  1.5X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  1.7X  0.158ms vs 0.095ms
    (1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  1.7X  0.170ms vs 0.100ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=12  6X    0.269ms vs 0.043ms
    (1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  11X   0.291ms vs 0.027ms
    (1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  8X    0.281ms vs 0.037ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  11X   0.305ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  8X    0.306ms vs 0.038ms

    (1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=32  1.5X  0.3ms vs 0.2ms
    (1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  1.6X  0.160ms vs 0.098ms
    (1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  1.7X  0.171ms vs 0.099ms
    (1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=32  6X    0.269ms vs 0.044ms
    (1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  10X   0.282ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  7X    0.276ms vs 0.037ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  11X   0.305ms vs 0.028ms
    (1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  8X    0.299ms vs 0.038ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=1   1.0X  1.2ms vs 1.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   2.0X  1.2ms vs 0.6ms
    (1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   1.7X  1.1ms vs 0.7ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   2.1X  1.2ms vs 0.6ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   1.9X  1.2ms vs 0.7ms
    (1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=1   8X    1.1ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   15X   1.109ms vs 0.073ms
    (1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   10X   1.1ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   16X   1.192ms vs 0.074ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   11X   1.2ms vs 0.1ms

    (1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=2   1.7X  1.2ms vs 0.7ms
    (1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   2.0X  0.6ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   1.7X  0.6ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   2.2X  0.7ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   1.8X  0.6ms vs 0.3ms
    (1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=2   9X    1.0ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   11X   0.598ms vs 0.052ms
    (1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   8X    0.556ms vs 0.072ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   12X   0.649ms vs 0.053ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   8X    0.598ms vs 0.073ms

    (1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=12  5X    1.2ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  1.3X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  1.4X  0.2ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=12  9X    1.0ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  12X   0.572ms vs 0.048ms
    (1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  8X    0.560ms vs 0.068ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  13X   0.617ms vs 0.049ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  9X    0.604ms vs 0.068ms

    (1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=32  5X    1.2ms vs 0.3ms
    (1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
    (1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=32  13X   1.042ms vs 0.081ms
    (1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  12X   0.586ms vs 0.050ms
    (1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  8X    0.562ms vs 0.069ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  12X   0.621ms vs 0.051ms
    (1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  9X    0.609ms vs 0.070ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms

    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=2   1.6X  0.9ms vs 0.6ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   1.9X  0.5ms vs 0.2ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   2.1X  0.5ms vs 0.3ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=2   10X   0.808ms vs 0.084ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   10X   0.462ms vs 0.046ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.429ms vs 0.062ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   12X   0.504ms vs 0.044ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   7X    0.461ms vs 0.063ms

    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=12  4X    1.0ms vs 0.2ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=12  12X   0.820ms vs 0.067ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.431ms vs 0.056ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.482ms vs 0.041ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.467ms vs 0.056ms

    (1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=32  4X    1.0ms vs 0.3ms
    (1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  1.7X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  1.8X  0.2ms vs 0.1ms
    (1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=32  12X   0.824ms vs 0.070ms
    (1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.044ms
    (1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.438ms vs 0.059ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  11X   0.479ms vs 0.045ms
    (1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.059ms
    ----------------------------------------------------------------------------------------------------
    (1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=1   1.0X  4.7ms vs 4.7ms
    (1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   2.0X  4.4ms vs 2.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   1.8X  4.3ms vs 2.5ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   2.1X  4.7ms vs 2.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   1.9X  4.6ms vs 2.5ms
    (1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=1   9X    4.0ms vs 0.4ms
    (1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   17X   4.2ms vs 0.2ms
    (1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   11X   4.1ms vs 0.4ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   19X   4.6ms vs 0.2ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   12X   4.5ms vs 0.4ms

    (1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=2   1.7X  4.7ms vs 2.7ms
    (1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   2.1X  2.4ms vs 1.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   1.8X  2.2ms vs 1.3ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   2.3X  2.6ms vs 1.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   1.9X  2.3ms vs 1.3ms
    (1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=2   15X   4.0ms vs 0.3ms
    (1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   16X   2.3ms vs 0.1ms
    (1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   9X    2.1ms vs 0.2ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   17X   2.5ms vs 0.1ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   10X   2.3ms vs 0.2ms

    (1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=12  10X   4.7ms vs 0.5ms
    (1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  1.9X  0.4ms vs 0.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  1.9X  0.4ms vs 0.2ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
    (1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=12  41X   3.969ms vs 0.096ms
    (1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  11X   0.545ms vs 0.051ms
    (1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  8X    0.532ms vs 0.070ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  11X   0.590ms vs 0.052ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  8X    0.578ms vs 0.071ms

    (1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=32  17X   4.7ms vs 0.3ms
    (1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  1.8X  0.2ms vs 0.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  2.0X  0.3ms vs 0.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  1.9X  0.2ms vs 0.1ms
    (1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
    (1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=32  45X   4.028ms vs 0.090ms
    (1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  10X   0.549ms vs 0.053ms
    (1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  7X    0.536ms vs 0.072ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  11X   0.592ms vs 0.055ms
    (1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  8X    0.581ms vs 0.074ms

    ```
    </details>

    Code:

    <details>

    I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py

    ```py
    import operator_benchmark as op_bench
    import torch

    """Microbenchmarks for interpolate operator."""

    class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
        def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

            input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                        requires_grad=self.auto_set())
            if channels_last:
                if input_image.ndim == 4:
                    input_image = input_image.contiguous(memory_format=torch.channels_last)
                elif input_image.ndim == 5:
                    input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
                else:
                    raise ValueError(
                        f"Can not set channels_last to the input of {input_image.ndim} dims"
                    )

            align_corners = None if "nearest" in mode else False

            if mode == "linear":
                mode = {
                    3: 'linear',
                    4: 'bilinear',
                    5: 'trilinear',
                }[input_image.ndim]

            self.inputs = {
                "input_image": input_image,
                "output_size": output_size,
                "mode": mode,
                "align_corners": align_corners,
            }

            self.set_module_name("interpolate")

        def forward(self, input_image, output_size, mode, align_corners):
            return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                                   align_corners=align_corners)

    def make_config():
        sizes = (
            ((224, 224), (64, 64)),
            ((224, 224), (128, 128)),
            ((600, 400), (224, 224)),
            ((320, 320), (256, 256)),
            ((800, 800), (500, 500)),
        )

        attrs = []
        for (HW1, HW2) in sizes:
            attrs.append([(1, 3, *HW1), HW2])  # 3 channels
            attrs.append([(1, 1, *HW1), HW2])  # 1 channel

            attrs.append([(1, 3, *HW2), HW1])  # 3 channels
            attrs.append([(1, 1, *HW2), HW1])  # 1 channel

        config = op_bench.config_list(
            attr_names=["input_size", "output_size"],
            attrs=attrs,
            cross_product_configs={
                'channels_last': [True],
                'mode': ["linear", "nearest", "nearest-exact"],
                'dtype': [torch.float, torch.uint8]
            },
            tags=["short"],
        )
        def get_mode(l):
            for d in l:
                if "mode" in d:
                    return d["mode"]
        def get_dtype(l):
            for d in l:
                if "dtype" in d:
                    return d["dtype"]
        config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
        return config

    config = make_config()
    op_bench.generate_pt_test(config, InterpolateBenchmark)

    if __name__ == "__main__":
        op_bench.benchmark_runner.main()
    ```

    with

    ```
    for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file
    ```

    and this very ugly helper

    ```py
    import re
    with open("main") as f:
        main = f.readlines()

    with open("new") as f:
        new = f.readlines()

    out = []

    for main_line, new_line in zip(main, new):
        if main_line.startswith("num_threads="):
            num_threads = int(main_line.split("=")[-1])
        if main_line.startswith("# Input"):
            deets = f"{main_line.strip()}, {num_threads=}"
        if main_line.startswith("Forward"):
            main_time = float(main_line.split()[-1])
            new_time = float(new_line.split()[-1])
            ratio = main_time / new_time
            fmt = ".1f" if ratio < 3 else ".0f"
            improv = f"{ratio:{fmt}}X"
            time_fmt = ",.3f" if new_time < 100 else ",.1f"
            deets = deets.strip().replace("# Input: ", "")
            deets = deets.replace(": ", "=")
            deets = deets.replace("input_size=", "")
            deets = deets.replace(", output_size=", " -> ")
            deets = deets.replace("dtype=torch.", "")
            deets = deets.replace("mode=", "")
            deets = deets.replace("channels_last=True, ", "")
            split = deets.split(",")
            size = ','.join(split[:-3])
            mode, dtype, threads = split[-3:]
            deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

            l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
            out.append(l)

    def key(s):
        num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

        input_shape, output_shape = re.findall("\(.*?\)", s)
        input_shape = input_shape[1:-1]  # remove parenthesis
        input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
        input_C = (-int(input_shape.split(",")[1]),)

        output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
        is_downsample = (output_HW[0] < input_HW[0],)
        if "linear" in s:
            mode = "linear"
        elif "nearest-exact" in s:
            mode = "nearest-exact"
        else:
            assert "nearest" in s
            mode = "nearest"
        mode = (mode,)
        return is_downsample + input_HW + output_HW + num_threads + input_C + mode

    for i, l in enumerate(sorted(out, key=key)):
        if i % 10 == 0 and i % 40 != 0:
            print()
        if i % 40 == 0:
            print("-" * 100)
        print(l)

    ```

    </details>

    Closes https://github.com/pytorch/pytorch/issues/83840

    When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox )
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361
    Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa

commit 70c6a988d6b5cabed84c686316b6bbeb235cc05c
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Fri Oct 7 03:54:33 2022 +0000

    Fix the performance issue that the for-loop before ExternallCall could not be parallelized. (#85056)

    Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of `ExternallCall` is also the output of NNC fusion group. Current [parallel logic](https://github.com/pytorch/pytorch/pull/85056/files#diff-9a11174c26e4b57ab73e819520122bc314467c72962f3a5b79e7400ea3c4bbe5L781-L785) only tries to parallel the `ExternalCall` and bypass `stmt1` and `stmt2`.

    ```c++
    stmt1: For:
    stmt2:   For:
    stmt3: ExternalCall
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85056
    Approved by: https://github.com/frank-wei, https://github.com/bertmaher

commit 2110c8944379bc3268c74e8d9f76c6fb3c896dfe
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 7 05:20:36 2022 +0000

    Revert "Revert "Revert "SymIntify cat and narrow (#86191)"" (#86289)"

    This reverts commit e778fbf5197638d6196c5d5acf6f9588a1e83368.

    Reverted https://github.com/pytorch/pytorch/pull/86289 on behalf of https://github.com/seemethere due to Fails internal tests see: https://www.internalfb.com/intern/sandcastle/job/27021598552487548/

commit 6c604c9262307ffcaf1d7dd68bfa5f6b44513d06
Author: eqy <eddiey@nvidia.com>
Date:   Fri Oct 7 05:13:37 2022 +0000

    [CuDNN v8 API][Quantization]fix alignment function in quantized cuDNN V8 path (#86253)

    This bug was in the native cuDNN V8 API integration and was fixed a while ago, but the change was never ported here.

    Previously the returned alignment could be twice the actual alignment of the data if the alignment was smaller than 16.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86253
    Approved by: https://github.com/dzdang

commit 455b873919d928a073eb2d60e07d1c5b2de2d6c6
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Oct 7 02:26:50 2022 +0000

    Introduce a match filter for SubgraphRewriter (#86430)

    This PR introduces an interface for user defined function that filters the matches in SubgraphRewriter. The function will have the following signature.

    callable(match: InternalMatch, original_graph: Graph, pattern_graph: Graph) -> bool

    This filter is applied after SubgraphMatcher returns the matches, and before replacement takes place.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86430
    Approved by: https://github.com/jerryzh168

commit b5fd845fdf90121d91e8f4cf66a2de761707d22f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 7 04:44:19 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#86399)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86399
    Approved by: https://github.com/pytorchbot

commit 10aead9adc20bd45b7692e97a64cb76f114c8e16
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Oct 7 04:39:28 2022 +0000

    [MPS] Cache multinomial_with_replacement graph (#86437)

    Reuse existing RandomCachedGraph to keep RNG state as part of the graph
    Add `CreateCachedGraphAs` convenience wrapper
    Addresses https://github.com/pytorch/pytorch/pull/86342#pullrequestreview-1132197848
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86437
    Approved by: https://github.com/kulinseth

commit 9ceadcadb21beb8e346109d804db35f0c213d8e0
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Oct 7 00:18:44 2022 +0000

    Fix unfold  backward decomp aliasing for 0 dim input (#86428)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86428
    Approved by: https://github.com/ngimel, https://github.com/ezyang

commit b14f1d7bb855834ec5f2d3996746e048ba835d69
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Fri Oct 7 03:55:13 2022 +0000

    Add Skip List for Aten Ops that are fused in nvFuser. (#86101)

    This Skip List (tuple) is added under the nvprims context manager.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86101
    Approved by: https://github.com/jjsjann123, https://github.com/mruberry

commit c5a4844085ea4db27912f5174be2585aebf7079a
Author: Driss Guessous <drisspg@fb.com>
Date:   Fri Oct 7 03:52:46 2022 +0000

    Xformer SDP forward/backward kernel (#86157)
    Include xformer kernel code and make header updates to successfully build. Need to update the kernel calling code and dispatch system to clean this up.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86157
    Approved by: https://github.com/cpuhrsch

commit ca39e3679ff834d67da20abaa3b313664c935d8a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Oct 7 03:19:28 2022 +0000

    [vision hash update] update the pinned vision hash (#86173)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86173
    Approved by: https://github.com/pytorchbot

commit 2fec853c8796adbf1b6b13fc095b032c5b9ef7b9
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Oct 6 23:18:04 2022 +0000

    Fix SubgraphMatcher for case of no anchor found (#86421)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86421
    Approved by: https://github.com/jerryzh168

commit b73f0e98d5eed44729aeb5925912b7038ce7c59a
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Fri Oct 7 01:46:51 2022 +0000

    Fix cond tests after CI was disabled for a bit (#86321)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86321
    Approved by: https://github.com/zou3519

commit ca69ddb4f7b1e1449756177889c454edb8bc091f
Author: Alex <aslvrstn@gmail.com>
Date:   Fri Oct 7 01:38:57 2022 +0000

    Fix broadcasting to implicit leading dimensions in `torch.where` on MPS (#86240)

    Fixes #86239

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86240
    Approved by: https://github.com/kulinseth

commit 0e30da3f2f3620caa91ada734d9cd3b91d4ee606
Author: Zafar <cc.rafaz@zafar.cc>
Date:   Fri Oct 7 00:58:38 2022 +0000

    [refactor] Renaming ao.sparsity to ao.pruning (#84867)

    `Sparsity` as a term doesn't reflect the tools that are developed by the AO. The `torch/ao/sparsity` also has utilities for structured pruning, which internally we always referred to as just "pruning". To avoid any confusion, we renamed `Sparsity` to `Prune`. We will not be introducing the backwards compatibility, as so far this toolset was kept under silent development.

    This change will reflect the changes in the documentation as well.

    **TODO:**
    - [ ] Change the tutorials
    - [ ] Confirm no bc-breakages
    - [ ] Reflect the changes in the trackers and RFC docs

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84867
    Approved by: https://github.com/supriyar

commit 9a170b24f64d7cfdd887ff122c241ac6ff85f4c6
Author: Dennis van der Staay <dstaay@meta.com>
Date:   Fri Oct 7 00:29:32 2022 +0000

    Cleanup PT-D imports (#85781)

    Summary:
    The flow logic around torch.dist imports results in large number of pyre errors (100's); would be preferable to just raise on importing as opposed to silently fail.

    Con: Some percentage (MacOS?) of users may have notebooks that imports PT-D, although would think small, since any attempt to call parts of the library would just fail...

    TODO: assuming ok, will remove the 10's-100's of unused pyre ignores no longer required.

    Test Plan: existing unit tests

    Differential Revision: D39842273

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85781
    Approved by: https://github.com/mrshenli

commit a2419638373071c74692c9fe5996c69ef509f581
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Fri Oct 7 00:10:25 2022 +0000

    [nll_loss] Avoid unnecessary type casts (#86086)

    follow-up #85395

    `AT_DISPATCH_NLL_LOSS_INDEX_TYPES` should not be removed in favor of #59765 and there's a testcase https://github.com/pytorch/pytorch/blob/99ca25e6eb8299f31824bdbaf62f16f8a8db458d/test/test_nn.py#L16832

    Besides the dispatcher, I wanted to sanity check `int64_t ignore_index` because `int64_t` can be inappropriate considering that `target` can be `Byte`. However, given that the default value is -100 as in https://github.com/pytorch/pytorch/blob/0a75c42f36c0e50a22c06fa65478df53d7d420c4/aten/src/ATen/native/native_functions.yaml#L9949 it's not easy to add a check while keeping the backward compatibility. Thus I decided to not add a check.

    cc @lezcano @t-vi
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86086
    Approved by: https://github.com/lezcano

commit 2232db7fc12301a2226d1921948917d5b23b6888
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Oct 7 00:08:42 2022 +0000

    Replacement is irrelevant for 1-sample multinomial (#86342)

    So use fast path, both on CPU and on MPS

    Also, remove some spurious copy-n-paste checks from MPS codepath

    CUDA already has this optimization, see
    https://github.com/pytorch/pytorch/blob/dc9c507d24d0c833cb09105177326f1f6bbe99c4/aten/src/ATen/native/cuda/MultinomialKernel.cu#L355-L356

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86342
    Approved by: https://github.com/ngimel

commit 5a8b07de75acc2b03fadee4fa12384cc5e779a0f
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Aug 29 21:21:17 2022 +0100

    Declare public dependencies on libshm (#82694)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82694
    Approved by: https://github.com/malfet

commit 08e3999fa494238f8f62346a140da36bd43864e7
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Thu Oct 6 13:25:05 2022 -0700

    Merge more symbolic meta kernels and symint changes from branch (#86334)

    symintify split_with_sizes, dropout, fused_fake_obs_quant. meta for padding_2d ops

    add meta_bernoulli_

    meta kernel for at::gather

    get pytorch_struct to pass: meta for scatter_add, fix backward

    symintify split ops
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86334
    Approved by: https://github.com/ezyang

commit 3af0eafea69f200bd83c5e0c06f5af5fb4a90c28
Author: atalman <atalman@fb.com>
Date:   Thu Oct 6 23:26:58 2022 +0000

    Release 1.13: Bump nightly version 1.13->1.14 (#86296)

    Release 1.13:  Bump nightly version 1.13->1.14

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86296
    Approved by: https://github.com/seemethere, https://github.com/malfet

commit 5ed75ec1d7131c1aa65c94669d24dffbcb5d8769
Author: Tongzhou Wang <SsnL@users.noreply.github.com>
Date:   Thu Oct 6 23:11:22 2022 +0000

    Fix SparseAdam consuming iterator (#86210)

    Fixes https://github.com/pytorch/pytorch/issues/86209
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86210
    Approved by: https://github.com/cpuhrsch

commit f0977c4658c6f8c10e3342cf9a0249d5d23a3505
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Oct 6 19:21:55 2022 +0000

    [FSDP] Doc to explain running submodules (#86343)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86343
    Approved by: https://github.com/awgu

commit 3db8ddcac10239dc44aeeab16ab66c82444f358d
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Oct 6 19:14:02 2022 +0000

    [FSDP] Fix clip_grad_norm for CPU offload (#86337)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86337
    Approved by: https://github.com/awgu

commit adfd8f382331adbf9cbfa14039ef3b61f2f4e10c
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Oct 6 19:14:02 2022 +0000

    [FSDP] assert to runtime error (#86336)

    Prefer raising an error over `assert` which should mostly to indicate a developer bug, but user can cause this error path.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86336
    Approved by: https://github.com/awgu

commit 7a411952fbb82cec38da936a7d863da49726699f
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Oct 6 19:14:01 2022 +0000

    CheckpointSequential support non-reentrant (#86331)

    Closes https://github.com/pytorch/pytorch/issues/86328

    Adds `use_reentrant` argument to `checkpoint_sequential`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86331
    Approved by: https://github.com/zhaojuanmao, https://github.com/albanD

commit 3037f3d710184b56d949087b39438649d314bac0
Author: David <cherrywoods@posteo.org>
Date:   Thu Oct 6 22:38:50 2022 +0000

    Docs: fix typo (#86273)

    Typo in torch.fx.Interpreter.fetch_attr docs
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86273
    Approved by: https://github.com/kit1980

commit 233d6f195aa404766448c33d18b8e7ca5e66de51
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 6 22:02:02 2022 +0000

    Revert "Fix memory leak in _LRScheduler.step() (#85602)"

    This reverts commit eb32330d6b3709dc8910eb298d8802fbca57b05c.

    Reverted https://github.com/pytorch/pytorch/pull/85602 on behalf of https://github.com/albanD due to newly added test is flaky

commit bf746798841bd42c9b849716f9eeefde3271e93d
Author: atalman <atalman@fb.com>
Date:   Thu Oct 6 21:55:33 2022 +0000

    Fix for binary upload step, use bash shell rather then default sh (#86382)

    This fixes the issue during upload:

    ```
    Run # reference ends with an RC suffix
      if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
        echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
      fi
      shell: sh -e {0}
    /__w/_temp/f045f5d8-ddb.sh: 2: [[: not found
    ```

    Test failure:
    https://github.com/pytorch/pytorch/actions/runs/3199561387/jobs/5225448559

    Test success:
    https://github.com/pytorch/pytorch/actions/runs/3199573560/jobs/5225480345

    Error started when we switched to: continuumio/miniconda3:4.12.0

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86382
    Approved by: https://github.com/weiwangmeta

commit facf210f9a6e98bdeb2ec343b8f16c5bb047c4ce
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Thu Oct 6 11:21:24 2022 -0700

    [ao] fixing public v private for qconfig.py (#86026)

    Summary: no changes, just removed the exception for this file, someone
    had already fixed the actual file

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86026
    Approved by: https://github.com/jerryzh168

commit 7c5e07f87ba0443ada94bf849ed3cceb9a2f31a2
Author: Jay Chae <jchae@meta.com>
Date:   Thu Oct 6 21:36:15 2022 +0000

    [kineto] guard global observer init against Edge profiler (#86347)

    Summary:
    looks like Sandcastle CI didn't cover any of concrete mobile CI(cc: kimishpatel i'd assume we have ton of mobile tests in Github?). This is failing on Oculus with the similar failure as Mac(not sure if this is an ARM thing). either way on demand tracing should not be enabled on these platforms so disable them completely

    in the future, we should have runtime check on this for even safer guarding

    Test Plan:
    Set up Hollywood via P536072492
    crash on mutex. likely SIOF
    ```
    FORTIFY: pthread_mutex_lock called on a destroyed mutex (0x5d7e298b08)
    *** Aborted at 1665017107 (Unix time, try 'date -d 1665017107') ***
    *** Signal 6 (SIGABRT) (0xeca) received by PID 3786 (pthread TID 0x785bd1eed0) (linux TID 3786) (maybe from PID 3786, UID 0) (code: -1), stack trace: ***
    (error retrieving stack trace)
    ```
    Redacted in the top but the test passes without the crash
    P536101962

    Differential Revision: D40129840

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86347
    Approved by: https://github.com/aaronenyeshi

commit bc919ac7963be9f113b4e3c8b668905404301f8f
Author: Jiaxu Zhu <jiaxuzhu@meta.com>
Date:   Thu Oct 6 20:05:56 2022 +0000

    [torch.ao.quantization] include torch.qint32 for static quant (#86345)

    Summary: include `torch.qint32` to `activation_is_statically_quantized` and `get_quant_type` so that fakequantize with `dtype=torch.qint32` won't be skipped

    Test Plan: updated `test_custom_module_class`

    Differential Revision: D40128178

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86345
    Approved by: https://github.com/jerryzh168

commit 08780229df8f860dbef3fa82ffd1072b124b29c5
Author: lezcano <lezcano-93@hotmail.com>
Date:   Thu Oct 6 15:55:21 2022 +0000

    Two small improvements to references (#86371)

    As per title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86371
    Approved by: https://github.com/mruberry

commit 795906f207bb95245d47501645876b5d165aee3e
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Oct 6 18:53:59 2022 +0000

    Add total GPU memory utilization (#86250)

    Although we already have per process GPU memory usage, I'm curious to see what is the number for `gpu_utilization.memory` per https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html.  Also fixing a tiny typo issue that has been bugging me for a while `total_gpu_utilizaiton`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86250
    Approved by: https://github.com/ZainRizvi

commit 1059d3b52d9984f32f87432f9f87eaf7164d7f88
Author: Zain Rizvi <zainr@fb.com>
Date:   Thu Oct 6 18:47:07 2022 +0000

    Make mergebot message clearer when starting a new merge (#86311)

    Modifying how the merge started message appears to make it more readable.
    Also removing some deprecated v1 land checks messages

    Old:
    <img width="917" alt="image" src="https://user-images.githubusercontent.com/4468967/194150650-c9e384a3-d13c-40aa-975d-f43853790603.png">

    New:
    <img width="933" alt="image" src="https://user-images.githubusercontent.com/4468967/194151507-a5900cd5-5711-4cab-9447-c2cc6ed0d7b5.png">
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86311
    Approved by: https://github.com/malfet, https://github.com/huydhn

commit 6b295cd0460825ff29ab151208de137d76bf8364
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Thu Oct 6 13:11:58 2022 +0300

    Enable autograd on Linear with sparse COO weight (#86302)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86302
    Approved by: https://github.com/cpuhrsch

commit 8f2c2167d42d067adf4fe7e13f04de0d0b6d87aa
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Thu Oct 6 13:11:57 2022 +0300

    Support autograd on sparse_mm in full. (#86301)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86301
    Approved by: https://github.com/cpuhrsch

commit 88b882cd1c93e8fe9b4f2bf0c542c700a8ba69a6
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Thu Oct 6 13:11:57 2022 +0300

    Support sum on a sparse COO tensor. (#86300)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86300
    Approved by: https://github.com/cpuhrsch

commit f104490d635747e4164e954d36954ea3a01731a5
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Thu Oct 6 13:11:56 2022 +0300

    Support autograd on Linear with sparse compressed weight. (#86137)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86137
    Approved by: https://github.com/cpuhrsch

commit fc21cc82fcdb07d604dd9ae161acc05b93097c1b
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Thu Oct 6 13:11:56 2022 +0300

    Enable sparse_dim() and dense_dim() methods for Strided tensors (#86203)

    The reason for enabling sparse/dense_dim() for strided tensors is to have more meaningful error messages:
    For instance, compare
    ```
    NotImplementedError: Could not run 'aten::sparse_dim' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::sparse_dim' is only available for these backends: [SparseCPU, SparseCUDA, SparseMeta, SparseCsrCPU, SparseCsrCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
    ```
    [master] vs
    ```
    RuntimeError: addmm: matrices expected, got 0D tensor
    ```
    [this PR] where the latter message gives a hint of which function is to blame for dealing with unexpected inputs.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86203
    Approved by: https://github.com/cpuhrsch

commit bed1ece9c54f10580ee870ae1d73edcb9279727f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Oct 6 17:34:29 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#86306)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86306
    Approved by: https://github.com/pytorchbot

commit eb32330d6b3709dc8910eb298d8802fbca57b05c
Author: Chengqi Deng <checkdeng0903@gmail.com>
Date:   Thu Oct 6 17:07:34 2022 +0000

    Fix memory leak in _LRScheduler.step() (#85602)

    Fixes #85410

    This diff removed the cyclic references in `_LRScheduler.step()`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602
    Approved by: https://github.com/albanD

commit b8b564c90872316fe84fc781631afed5e78e069a
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Oct 6 16:47:45 2022 +0000

    Ensure the minimum NVIDIA driver version to be 515.57 for CUDA 11.7 (#86344)

    This does 2 things:

    * Ensure that `nvidia-driver-latest-dkms` package is removed if it's installed. This allows the installation to go forward without the below error when using the standard installation script from S3:

    ```
    (Answer: Abort installation)
    ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
    ```

    * Not skipping the installation if a driver different than `515.57` exists to avoid any unexpected behavior when using a different driver version. This partly addresses the recent issue in https://github.com/pytorch/pytorch/issues/85778 in which `510.60.02` is there instead (not sure from where) and fails CUDA 11.7 test
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86344
    Approved by: https://github.com/atalman, https://github.com/malfet

commit 0c148a4b5f1d30daf7401b9c1131d290274e0cd3
Author: Christian Puhrsch <cpuhrsch@meta.com>
Date:   Thu Oct 6 16:28:05 2022 +0000

    Remove extra bracket, update header definition (#86317)

    Summary: Fix compilation error

    Test Plan: Unit test

    Reviewed By: malfet, mikaylagawarecki

    Differential Revision: D40108369

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86317
    Approved by: https://github.com/malfet

commit fb9b96593c784b86b3d913ef8799ee120c203207
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Aug 29 21:21:16 2022 +0100

    Use FindCUDAToolkit to find cuda dependencies (#82695)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695
    Approved by: https://github.com/malfet

commit fa799132d82c3c48253aaf7d3ee3a8c5e007350d
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Oct 6 15:38:57 2022 +0000

    [MPS] Better error message for `slow_conv2d_forward` (#86303)

    Error `Could not run 'aten::_slow_conv2d_forward' with arguments from the 'MPS' backend.` is very misleading as usually this method is only invoked if input is on CPU but weights are on MPS device.
    Raise a more user friendly error in this case

    Add test to `test_invalid_conv2d` to check for those conditions.

    Fixes https://github.com/pytorch/pytorch/issues/77931

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86303
    Approved by: https://github.com/kulinseth

commit 4d7728890b134b712c16ace20e6660f1f840db43
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Oct 5 21:42:02 2022 -0700

    Inline asIntArrayRef (#86350)

    I was benchmarking and this is worth maybe 5% on at::empty, but it's basically
    free so we should do it.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86350
    Approved by: https://github.com/albanD

commit cebf08afb24dec0720935b9a9bd64ecf05b472d5
Author: andrewor14 <andrewor14@gmail.com>
Date:   Wed Oct 5 15:30:59 2022 -0700

    [Quant] Remove weight from DTypeConfig for non-weighted ops (#86335)

    Summary: Weight dtypes should be specified only for weighted
    ops like conv and linear. This commit removes weight dtypes
    from the DTypeConfigs used in binary ops and fixed qparams ops.

    Test Plan:
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps

    Reviewers: jerryzh168, vkuzo

    Subscribers: jerryzh168, vkuzo
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86335
    Approved by: https://github.com/vkuzo

commit cdbffa7f665dd144ada92c11d223aeb8b5c3887a
Author: Antoni Viros i Martin <aviros@meta.com>
Date:   Thu Oct 6 13:10:25 2022 +0000

    🦊 [AI Accelerators] Consolidate native_layer_norm for nested tensor (#86295)

    Summary: In order to make the layer normalization implementation for nested tensors public, it needs to be generalized to accept a normalized_shape argument instead of assuming it to be the last dimension of the nested_tensor. This commit does that, as well as adding extra unit tests to ensure the implementation is correct.

    Test Plan:
    All unit tests designed to test different ways of using the function work:

    `buck test //caffe2/test:nested -- test_layer_norm`

    Differential Revision: D40105207

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86295
    Approved by: https://github.com/drisspg

commit 85c3b745f6fc94b757f30f518108ee64ffd292a5
Author: John Detloff <johndetloff@fb.com>
Date:   Thu Oct 6 10:08:54 2022 +0000

    Conditionally build the TestApp benchmark based on lite interpreter (#86314)

    The TestApp benchmark was recently re-added, however it seems it only builds when pytorch is built with the lite interpreter. This diff adds a macro to compile out the benchmark when pytorch is built as full jit. This should fix our full jit simulator nightly builds.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86314
    Approved by: https://github.com/malfet

commit 936e93058b2781d6cee2da59cccba051726dd46f
Author: Sahan Paliskara <sahancpal@gmail.com>
Date:   Wed Oct 5 21:06:01 2022 -0700

    Delete torch::deploy from pytorch core (#85953)

    As we have migrated torch::deploy over to https://github.com/pytorch/multipy, we can now delete it from pytorch core as ongoing development will happen there.

    This PR was created due to syncing issues with https://github.com/pytorch/pytorch/pull/85443 which is where the review history can be found.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85953
    Approved by: https://github.com/seemethere, https://github.com/malfet

commit 27c3fb03864597909a7288e82e3e6699131e7509
Author: Seonglyong Gong <slgong@meta.com>
Date:   Thu Oct 6 06:32:25 2022 +0000

    [Profiler] trace verbose=false by default (#86263)

    Summary:
    - Added config option to remove 'Call stack' field from trace file (#84982)
    - Change default value to `false`

    Test Plan:
    - `experimental_config=_ExperimentalConfig(verbose=true),` will add 'Call stack' field back in the trace file.
    - CI tests

    Differential Revision: D40092377

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86263
    Approved by: https://github.com/aaronenyeshi

commit a117fde86febc2b1c27e7a0e809ae22d46e33849
Author: Seonglyong Gong <slgong@meta.com>
Date:   Thu Oct 6 06:18:56 2022 +0000

    [Profiler] Apply TensorMetadata for Optimizer and nnModule (#86047)

    Summary: - Use `TensorMetadat` struct in saving tensor info from Optimizer and nnModule.

    Test Plan: buck run mode/opt //caffe2/test:profiler

    Reviewed By: chaekit

    Differential Revision: D39682205

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86047
    Approved by: https://github.com/chaekit, https://github.com/robieta

commit fd5085c445c3f1a4c90e55154cf26fe30f52a0ab
Author: albanD <desmaison.alban@gmail.com>
Date:   Thu Oct 6 04:46:19 2022 +0000

    Symintify getitem and add the required helper functions (#86207)

    Note that this might not cover every use of the function (we know it doesn't)
    But this is enough to get few models passing.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86207
    Approved by: https://github.com/ezyang, https://github.com/Chillee, https://github.com/bdhirsh

commit 0a75c42f36c0e50a22c06fa65478df53d7d420c4
Author: Edward Yang <ezyang@meta.com>
Date:   Thu Oct 6 04:11:05 2022 +0000

    Workaround MSVC ICE due to constexpr char* template argument (#86288)

    Test Plan:
    Lease a Windows sandcastle https://www.internalfb.com/intern/wiki/Windows_Platform_Engineering/Leasable_VM_-_User_Guide/
    and run:

    ```
    buck build arvr/mode/win/opt //xplat/caffe2:_C_impl
    ```

    Differential Revision: D40109191

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86288
    Approved by: https://github.com/albanD, https://github.com/malfet

commit 45f03d69486e45b67bfcab9e60a2c24aa5f1ea8d
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Oct 5 14:44:34 2022 -0700

    Add at::symint:: namespace for ease of templated functions (#86329)

    Our prevailing strategy for symbolic shapes in C++ is to only
    write the SymInt version of the code, and pay a slight performance
    tax from not knowing if it is symbolic or not.  However, there are
    some fastpath functions where this tax is unacceptable, and we want
    to specialize for the int case.  Sometimes, it is easy to template
    the function; but when the function involves Tensors, it is not,
    because the functions you may want to call are not templated,
    e.g., t.view vs t.view_symint

    This PR adds an at::symint:: namespace which contains templated
    functions for all functions in PyTorch which you can use in this
    way.  To show this works, I refactored sum_to to stop incorrectly
    reinterpret casting and instead use a template.  Instead of
    t.sizes(), we call at::symint::sizes<T>(t), and so forth.

    The template functions are SFINAE'd using a template argument that
    is not otherwise used. As such, deduction is impossible. Typically, deduction
    is hard anyway, because many of the constructors are ambiguous (this
    is why we split foo and foo_symint in the first place). So you must pass
    a template argument to these functions.

    These functions are codegened into Functions.h so they are subject
    to per-operator headers.  This matters most for methods, which likely
    didn't include the per-operator header, so you will have to add an
    include in that case.  We never generate method variants for these.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86329
    Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym

commit ea21a982f25120235d91e3be5a371a26855c112c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Oct 4 23:08:51 2022 -0400

    Reduce warning suppression by just disabling pytest warnings plugin (#86255)

    Fixes https://github.com/pytorch/pytorch/issues/85626

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86255
    Approved by: https://github.com/lezcano, https://github.com/albanD

commit adf5919720c02dcf8c1ff32c890dd1c4e54d6fe7
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 3 13:56:53 2022 -0700

    Add option to record C++ backtraces in _record_memory_history (#86145)

    I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145
    Approved by: https://github.com/albanD, https://github.com/zdevito

commit 97d6b5bbf89172ad94143f4ce4a9b9a3a4d7b744
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 3 13:56:53 2022 -0700

    Refactor _cuda_recordMemoryHistory to use pybind11 (#86139)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86139
    Approved by: https://github.com/albanD

commit d04889323e2bc0b7321b76e564292565c88b9a5e
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Oct 5 21:25:25 2022 +0000

    Add Context Manager for Disabling Multithreading in Backwards, use in aot autograd (#86245)

    We were running into a few issues with running multithreaded backwards in aot_autograd: such as https://github.com/pytorch/pytorch/issues/86136, and `FakeTensorMode` getting into a weird state as a result of not executing functions completely sequentially. The multithreaded backwards is lost in translation when we trace out the backwards anyway, and adds a lot of additional complexity.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86245
    Approved by: https://github.com/albanD, https://github.com/yf225

commit 237316aa1da372e894a3bd4ad5ad7e831e3b7636
Author: Vasiliy Kuznetsov <vasiliy@fb.com>
Date:   Tue Oct 4 16:59:15 2022 -0700

    PNP: early FX numeric suite tool to quantize each layer N times (#80521)

    Summary:

    This PR is an early prototype of a tool to quantize each layer of a model
    N times, with N qconfigs each. We follow the design agreed upon in
    https://fburl.com/gdoc/e1gaq3ih .

    Current API:

    ```
    m = M().eval()
    example_input = (torch.randn(2, 2),)
    qconfig_mappings = [
        QConfigMapping().set_global(torch.quantization.default_qconfig),
        QConfigMapping().set_global(torch.quantization.default_dynamic_qconfig),
    ]
    backend_config = get_native_backend_config()

    msp = prepare_n_shadows_model(
        m, example_input, qconfig_mappings, backend_config)

    for _ in range(2):
        msp(*example_input)

    msq = convert_n_shadows_model(msp)
    msq(*example_input)

    results = extract_results_n_shadows_model(msq)
    print_comparisons_n_shadows_model(results)

    // example output

    subgraph_idx    ref_node_name      best_idx        1        2
    --------------  ---------------  ----------  -------  -------
    subgraph_0      fc1                       2  42.0834  42.6279
    subgraph_1      fc2                       2  43.7259  50.0593
    ```

    Test plan:

    ```
    python test/test_quantization.py -k test_n_shadows
    ```

    Differential Revision: [D37650332](https://our.internmc.facebook.com/intern/diff/D37650332)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80521
    Approved by: https://github.com/jerryzh168, https://github.com/andrewor14

commit b233d83471147bf578c7ae79df2ee8bc30c10ca2
Author: Yu Guo <yuguo@fb.com>
Date:   Thu Oct 6 01:08:59 2022 +0000

    make torch.histc ignore NaNs on CPU (#85870)

    Summary: cuda torch.histc already ignores NaNs

    Test Plan: unittest added

    Differential Revision: D39911272

    fix https://github.com/pytorch/pytorch/issues/85853

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85870
    Approved by: https://github.com/ngimel

commit ddec1eea05e8c2efc772536cf94d578950c37f5e
Author: Mike Iovine <mikeiovine@fb.com>
Date:   Thu Oct 6 01:07:40 2022 +0000

    [Static Runtime] Block linalg_svdvals codegen & run codegen script (#85983)

    Summary:
    The test is causing issues:
    ```
    terminate called after throwing an instance of 'std::runtime_error'
      what():  The following operation failed in the TorchScript interpreter.
    Traceback of TorchScript (most recent call last):
        graph(%A: Tensor, %driver: str?):
            %bias: None = prim::Constant()
            %ret = aten::linalg_svdvals(%A, %driver)
                   ~~~~ <--- HERE
            %cloned = aten::clone(%ret, %bias)
            return (%cloned)
    RuntimeError: torch.linalg.svd: keyword argument `driver=` is only supported on CUDA inputs with cuSOLVER backend.
    ```

    Just block the op and re-run the codegen script to remove everything and update the generated ops.

    Test Plan: Existing tests

    Differential Revision: D39973860

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85983
    Approved by: https://github.com/xuzhao9, https://github.com/tenpercent

commit bebd1622490becd09de97003bd22761e973d3edd
Author: Charlie Yan <charlieyan@fb.com>
Date:   Thu Oct 6 00:48:54 2022 +0000

    Fix doc of DDP (#86244) (#86256)

    [ghstack-poisoned]

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86256
    Approved by: https://github.com/rohan-varma

commit 020f2b2c0b697a9bbc5422b2c4428c4f6604f11b
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Wed Oct 5 11:25:31 2022 -0700

    add myself for dynamic shapes PR review (#86292)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86292
    Approved by: https://github.com/albanD

commit dc9c507d24d0c833cb09105177326f1f6bbe99c4
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Wed Oct 5 23:59:16 2022 +0000

    add nominal support for int32 indices in index/index_put ops (#86309)

    Currently index_select/index_add decompositions decompose to `index` or `index_put` ops. The problem with this is that `index_select` and `index_add` accept int32 indices while `index` doesn't. That leads to error in meta func for those decompositions. This PR adds non-performant support for int32 indices to `index` operations, to allow decompositions go through.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86309
    Approved by: https://github.com/lezcano

commit e8b0bea677b44206f663788e3a9d6a85b3779ed2
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Oct 5 12:46:41 2022 -0700

    Rename fromIntArrayRef to fromIntArrayRefSlow, audit call sites (#86235)

    Some of them are known non-negative, I've revised them accordingly.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86235
    Approved by: https://github.com/albanD

commit 168ba066e3944a1bd897fe25f29a6754e31ca186
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 5 22:42:56 2022 +0000

    Revert "Symintify getitem and add the required helper functions (#86207)"

    This reverts commit 17addb307ee9a4d12ad6918e90358a9a47a4f12b.

    Reverted https://github.com/pytorch/pytorch/pull/86207 on behalf of https://github.com/malfet due to Broke lint, by double-registering `meta_index_put`, but no CI was run during the outage

commit be4e43c7d05eb67923d84162a7a4203173db3206
Author: Rohan Varma <rvarm1@fb.com>
Date:   Wed Oct 5 22:30:02 2022 +0000

    Remove DataParallel remnants from DDP doc (#86221)

    As @aazzolini pointed out, the docstring is incorrect and probably vestige from DP / single process multi device mode in DDP.

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86221
    Approved by: https://github.com/aazzolini

commit 9e1a43122046536d5c1fedc6b1e6d912ca6afb51
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Oct 5 18:27:34 2022 +0000

    Mark ctc_loss with dynamic_output_shape (#86293)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86293
    Approved by: https://github.com/eellison

commit 0e5a27fb8d7df8541251f1ebfc4373c1358c1bab
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Oct 5 15:10:18 2022 -0400

    Fix horribly double truncation bug in Scalar (#86304)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86304
    Approved by: https://github.com/albanD

commit 73777d8a2bed1c8878a1858ab8241f5acf0d022b
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Oct 4 13:04:45 2022 -0700

    [ao] fixing public v private for quantization_mappings.py (#86025)

    Summary: no significant changes, just added __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86025
    Approved by: https://github.com/jerryzh168

commit 28a5cd94802c33a29c2d4435f0fb79152711819b
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Oct 4 13:04:44 2022 -0700

    [ao] fixing public v private for quantize_jit.py (#86024)

    Summary: just needed to add __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86024
    Approved by: https://github.com/jerryzh168

commit 17addb307ee9a4d12ad6918e90358a9a47a4f12b
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 5 21:19:00 2022 +0000

    Symintify getitem and add the required helper functions (#86207)

    Note that this might not cover every use of the function (we know it doesn't)
    But this is enough to get few models passing.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86207
    Approved by: https://github.com/ezyang

commit b8895df8db23213a0db50fe833930dd1f4e4b5a5
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 5 21:08:40 2022 +0000

    Fix black binary again for debug python (#86275)

    The `--no-binary` flag was not ported when moving from black only to ufmt.
    This adds it back.

    This is to work around the fact that black binary hard crashes when running with debug python and it needs to be compiled from source.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86275
    Approved by: https://github.com/bdhirsh, https://github.com/malfet

commit e778fbf5197638d6196c5d5acf6f9588a1e83368
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Oct 5 11:32:48 2022 -0700

    Revert "Revert "SymIntify cat and narrow (#86191)"" (#86289)

    This reverts commit fc94a2115b31dfe7a0d8f28eb4f5ed532c4f0792.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86289
    Approved by: https://github.com/wconstab

commit 089a64e99e2d2b937c72e25c0fa6e4f673b8a1a1
Author: Min Si <msi@fb.com>
Date:   Wed Oct 5 20:02:02 2022 +0000

    Install c10d headers with absolute path (#86257)

    https://github.com/pytorch/pytorch/pull/85780 updated all c10d headers in pytorch to use absolute path following the other distributed components. However, the headers were still copied to `${TORCH_INSTALL_INCLUDE_DIR}/torch`, thus external extentions still have to reference the c10d headers as `<c10d/*.h>`, making the usage inconsistent (the only exception was c10d/exception.h, which was copied to `${TORCH_INSTALL_INCLUDE_DIR}/torch/csrc/distributed/c10d`).

    This patch fixes the installation step to copy all c10d headers to `${TORCH_INSTALL_INCLUDE_DIR}/torch/csrc/distributed/c10d`, thus external extensions can consistently reference c10d headers with the absolute path.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86257
    Approved by: https://github.com/kumpera

commit b67e022833df912068406dbd1da2345e6693c7db
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Oct 5 12:06:28 2022 +0000

    Fix ref / decomposition index_add (#86266)

    The decomposition of `index_add` was using `slice(None)`, when it should
    use just `None`.

    The reference for index_add was also wrong, as `x[idx] += t` does not
    use atomic add, so it does not work when several `idx`s point to the
    same location.

    This PR adds extra reference inputs to help test for this.

    Fixes https://github.com/pytorch/torchdynamo/issues/1356
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86266
    Approved by: https://github.com/ngimel

commit 14db44ad72f0110b484c0c8aaf520e110cc91f53
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Oct 4 13:04:43 2022 -0700

    [ao] fixing public v private for quantize.py (#86023)

    Summary: just needed to add __all__

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86023
    Approved by: https://github.com/jerryzh168

commit c21caff8765f00ed5a2e1ed448ad5e6329c87b8d
Author: HDCharles <charlesdavidhernandez@gmail.com>
Date:   Tue Oct 4 13:04:43 2022 -0700

    [ao] correctly set public v private for fake_quantize.py (#86022)

    Summary: biggest issue was that the constructors for the fake_quantize
    classes use custom partials that live in the observer module and so
    the module for these needed to be set correctly in the constructor class
    method

    Test Plan: python test/test_public_bindings.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86022
    Approved by: https://github.com/jerryzh168

commit 3b1ec7511e6d616fbe2e9f8721ff9be6c55d3d42
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Oct 4 14:56:28 2022 -0700

    Optimize is_symbolic test and some refactor (#86230)

    Our SymInt rep can be represented more efficiently as just a greater than test, but the compiler doesn't seem to figure it out. Help it out.

    There is also some refactoring to simplify the code and add more debugging.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86230
    Approved by: https://github.com/albanD

commit 8c6d352bcfa3a8d2f7322d3577117b2d432cd002
Author: Bin Chen <bchen2020@fb.com>
Date:   Wed Oct 5 18:23:53 2022 +0000

    Log a new "timer expired" event to Scuba in file_based_local_timer (#85861)

    Summary: The "kill worker process" event was logged to Scuba only when the worker process was really reaped. We want to add a new event "timer expired", no matter the worker process will be reaped or not. This will help collect data before we enable the JustKnob to kill the worker process on timeout.

    Test Plan:
    ```
    buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
    ```
    ```
    Test Session: https://www.internalfb.com/intern/testinfra/testrun/7318349508929624
    RE: reSessionID-ea464c43-54e7-44f2-942b-14ea8aa98c74  Up: 10.5 KiB  Down: 1.1 MiB
    Jobs completed: 100. Time elapsed: 3206.9s. Cache hits: 91%. Commands: 11 (cached: 10, remote: 1, local: 0)
    Tests finished: Pass 55. Fail 0. Fatal 0. Skip 0. 0 builds failed
    ```
    --------
    ```
    buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
    ```
    ```
    Test Session: https://www.internalfb.com/intern/testinfra/testrun/6473924579130483
    RE: reSessionID-231a47b7-a43d-4c0f-9f73-64713ffcbbd3  Up: 5.7 MiB  Down: 1.9 GiB
    Jobs completed: 182156. Time elapsed: 282.4s. Cache hits: 99%. Commands: 72112 (cached: 72107, remote: 1, local: 4)
    Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. 0 builds failed
    ```

    Differential Revision: D39903376

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85861
    Approved by: https://github.com/d4l3k

commit fc94a2115b31dfe7a0d8f28eb4f5ed532c4f0792
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 5 17:19:55 2022 +0000

    Revert "SymIntify cat and narrow (#86191)"

    This reverts commit 63d8d4f6ec5c973ad7b8669cd39ee9b550e5f55b.

    Reverted https://github.com/pytorch/pytorch/pull/86191 on behalf of https://github.com/seemethere due to Fails internal tests, see [D40106464](https://www.internalfb.com/diff/D40106464)

commit 3ec71fce79f4e568c48796da4b18a3e6f2c6fc29
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Oct 4 21:12:22 2022 +0100

    Improve make_tensor performance for float and complex types (#85473)

    For floating types, `make_tensor` calls `rand` and then does a linear
    interpolation from `low` to `high`. This instead calls `uniform_(low,
    high)` to cut out the interpolation step.

    For complex types, `make_tensor` does the `rand` + interpolation step
    twice and calls `torch.complex(real, imag)` at the end. This instead
    uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
    operation.

    My benchmarks show significant speedups in all cases for float32 and
    complex64.

    | Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
    |--------|-----------|-------|-------------|--------------|---------|
    | CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
    |        |           | 4096  | 36.8        | 21.3         | 1.7     |
    |        |           | 2**24 | 167,000     | 80,500       | 2.1     |
    |        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
    |        |           | 4096  | 73.1        | 37.6         | 1.9     |
    |        |           | 2**24 | 409,000     | 161,000      | 2.5     |
    | CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
    |        |           | 4096  | 38.7        | 11.7         | 3.3     |
    |        |           | 2**24 | 2,300       | 238          | 9.7     |
    |        | complex32 | 8     | 78.7        | 14           | 5.6     |
    |        |           | 4096  | 82.7        | 13.8         | 6.0     |
    |        |           | 2**24 | 5,520       | 489          | 11.3    |
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85473
    Approved by: https://github.com/mruberry

commit 7f607e8cb5c933fda87149e64e3a74f125d8adaf
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Oct 5 17:02:33 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#85774)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85774
    Approved by: https://github.com/pytorchbot, https://github.com/malfet

commit 97d2e1df5565b7f3a5358178b8f3a2a039c7f976
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 5 09:09:17 2022 -0700

    [MPS] Fix GELU for `torch.half` (#86218)

    Also, make sure it raises catcheable errors if invoked with integral types

    Otherwise, it used to fail with following fatal error  invoked for `torch.half` and with similar signatures if invoked for integral types
    ```
    loc("mps_multiply"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/4883e71d-37bd-11ed-b0ef-b25c5e9b9057/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<2xf16>' and 'tensor<1xf32>' are not broadcast compatible
    LLVM ERROR: Failed to infer result type(s).
    ```

    Modified `test_gelu_simple` to check both fwd and backward gradients for gelu

commit 63d8d4f6ec5c973ad7b8669cd39ee9b550e5f55b
Author: Will Constable <whc@fb.com>
Date:   Wed Oct 5 14:46:55 2022 +0000

    SymIntify cat and narrow (#86191)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86191
    Approved by: https://github.com/ezyang

commit 0e03dc5f1e00a9e021ec8f6e98d0c7df7af78d03
Author: Horace He <chilli@fb.com>
Date:   Wed Oct 5 11:08:05 2022 +0000

    Remove softmax from recomputable ops (#86268)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86268
    Approved by: https://github.com/ezyang

commit c609768896ead0b4bb439a0b03e58360a5c00023
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Oct 5 10:13:17 2022 +0000

    Add refs for torch.unfold and a decomposition for its backward. (#85629)

    It's not clear to me what's the difference between `unfold` and `unfold_copy`, as this latter one is codegen'd

    I also took this chance to clean the implementation of unfold and its reference
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85629
    Approved by: https://github.com/mruberry

commit 67eb2d5952741f2024c826d008ed35b8a1cc56d9
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Oct 4 20:37:51 2022 +0000

    [FSDP] Dequeue one instead of flush (#86165)

    For the rate limiter, I initially implemented the approach of only dequeueing a single event, but there was concern about blocking the CPU _every_ iteration. The landed approach instead blocks every `_max_num_inflight_all_gathers` iterations and flushes the entire queue.

    However, upon further analysis, the approach of dequeueing a single event should be more performant with the same memory usage -- as the name suggests, both have `_max_num_inflight_all_gathers` concurrently inflight all-gathers. The cost of blocking the CPU thread is not important compared to the duration the CPU thread is actually blocked. This PR's approach reduces the latter quantity.

    **Fast Communication; Slow Computation**
    <img width="1235" alt="Screen Shot 2022-10-04 at 4 15 13 PM" src="https://user-images.githubusercontent.com/31054793/193917536-f1491803-9578-45ea-ba6e-e735c1bf7784.png">

    **Slow Communication; Fast Computation**
    <img width="718" alt="Screen Shot 2022-10-04 at 4 34 15 PM" src="https://user-images.githubusercontent.com/31054793/193921508-f2a4fd22-2b03-4a8e-b6ca-634c584c70e2.png">

    **T5-11B**
    2 nodes / 16 40 GB A100s with EFA and batch size 6:
    - [Old] 5.81 s / batch; 24 and 20 CUDA malloc retries on local rank 0s; 35.234 GB peak active; 38.806 GB peak reserved
    - [New] 5.10 s / batch; 25 and 29 CUDA malloc retries on local rank 0s; 35.234 GB peak active; 38.868 GB peak reserved

    4 nodes / 32 40 GB A100s with EFA and batch size 7:
    - [Old] 5.21 s / batch; 0, 0, 0, 0 CUDA malloc retries on local rank 0s; 33.695 GB peak active; 38.494 GB peak reserved
    - [New] 4.93 s / batch; 1, 0, 0, 0 CUDA malloc retries on local rank 0s; 33.678 GB peak active; 38.792 GB peak reserved

    The new version changes the fragmentation in the allocator. It is possible that by blocking the CPU thread more in the old approach, the initial blocks used to serve the all-gather stream allocations are different compared to the new approach. Even though the number of CUDA malloc retries increases slightly, the net result is a speedup with the new approach.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86165
    Approved by: https://github.com/zhaojuanmao

commit 1c5ca724f42edcb669afa491ab67bcd3bcc9a70e
Author: Thytu <valentin.de-matos@epitech.eu>
Date:   Wed Oct 5 11:13:29 2022 +0000

    PixelShuffle check that output is not null before applying kernel (#85155) (#86262)

    * Checks that output tensor is not null before applying kernel in `pixel_shuffle` op
    * Checks that output tensor is not null before applying kernel in `pixel_unshuffle` op
    * Add test case testing `pixel_shuffle` with shapes producing empty output
    * Add test case testing `pixel_unshuffle` with shapes producing empty output

    Fixes #85155

    FYI @lezcano
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86262
    Approved by: https://github.com/lezcano

commit 9d6109c4b049df29fe6c4fc8288670f4104c3249
Author: Philip Meier <github.pmeier@posteo.de>
Date:   Wed Oct 5 10:33:26 2022 +0000

    improve annotations (#86105)

    In `torchvision` we started to use tensor subclasses. With the current annotations, this minimal example throws three errors when checking with `mypy`:

    ```py
    from typing import Type, TypeVar, Any, Optional, Union

    import torch

    T = TypeVar("T", bound="TensorSubclass")

    class TensorSubclass(torch.Tensor):
        def __new__(
            cls: Type[T],
            data: Any,
            *,
            dtype: Optional[torch.dtype] = None,
            device: Optional[Union[torch.device, str, int]] = None,
        ) -> T:
            return torch.as_tensor(data, dtype=dtype, device=device).as_subclass(cls)
    ```

    ```
    main.py:16:16: error: Incompatible return value type (got "Tensor", expected "T")  [return-value]
    main.py:16:58: error: Argument "device" to "as_tensor" has incompatible type "Union[device, str, int, None]"; expected "Optional[device]"  [arg-type]
    main.py:16:78: error: Argument 1 to "as_subclass" of "_TensorBase" has incompatible type "Type[T]"; expected "Tensor"  [arg-type]
    ```

    I'll explain inline why the old annotations are wrong.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86105
    Approved by: https://github.com/albanD

commit 736adc08084814c13d57324f74adad091e304eb2
Author: Zachary DeVito <zdevito@meta.com>
Date:   Tue Oct 4 21:50:27 2022 -0700

    Memory snapshots from C++ (#86190)

    Sometimes the driving process want to save memory snapshots but isn't Python.
    Add a simple API to turn it on without python stack traces. It still
    saves to the same format for the vizualization and summary scripts, using
    the C++ Pickler.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86190
    Approved by: https://github.com/ezyang

commit a348975e00081334ac96d855932d2753a62f1e77
Author: Jane Xu <janeyx@fb.com>
Date:   Wed Oct 5 06:33:25 2022 +0000

    Add opteinsum backend to give users control (#86219)

    This achieves the same things as https://github.com/pytorch/pytorch/pull/85908 but using backends instead of kwargs (which breaks torchscript unfortunately). This also does mean we let go of numpy compatibility BUT the wins here are that users can control what opt einsum they wanna do!

    The backend allows for..well you should just read the docs:
    ```
    .. attribute::  torch.backends.opteinsum.enabled

        A :class:`bool` that controls whether opt_einsum is enabled (on by default). If so,
        torch.einsum will use opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/path_finding.html)
        to calculate an optimal path of contraction for faster performance.

    .. attribute::  torch.backends.opteinsum.strategy

        A :class:`str` that specifies which strategies to try when `torch.backends.opteinsum.enabled` is True.
        By default, torch.einsum will try the "auto" strategy, but the "greedy" and "optimal" strategies are
        also supported. Note that the "optimal" strategy is factorial on the number of inputs as it tries all
        possible paths. See more details in opt_einsum's docs
        (https://optimized-einsum.readthedocs.io/en/stable/path_finding.html).
    ```

    In trying (and failing) to land 85908, I discovered that jit script does NOT actually pull from python's version of einsum (because it cannot support variadic args nor kwargs). Thus I learned that jitted einsum does not subscribe to the new opt_einsum path calculation. Overall, this is fine since jit script is getting deprecated, but where is the best place to document this?
    - added tests to CI
    - locally tested that trying to set the strategy to something invalid will error properly
    - locally tested that tests will pass even if you don't have opt-einsum
    - locally tested that setting the strategy when opt-einsum is not there will also error properly
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86219
    Approved by: https://github.com/soulitzer, https://github.com/malfet

commit db13049b8844f3e7a15ebd902572639c19b21fc1
Author: Zachary DeVito <zdevito@meta.com>
Date:   Tue Oct 4 20:03:44 2022 -0700

    [allocator tracing] missing GIL acquire (#86254)

    Bug where the context destructor needs to hold the GIL to free the context.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86254
    Approved by: https://github.com/ezyang

commit d07b85393abd79d07ecfca7378b8f3c7342650a2
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Oct 4 19:07:32 2022 -0700

    SymInt fixes from symbolic-shapes branch (#86242)

    symintify a few inplace meta functions

    symintify resize_(), nbytes(), functionalization input mutations

    meta funcs for avg_pool2d_backward
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86242
    Approved by: https://github.com/Chillee

commit ac25c210e5452d360fcc8cf5ea96c85756e3e370
Author: David Berard <dberard@fb.com>
Date:   Tue Oct 4 23:42:20 2022 +0000

    [jit][easy] remove deprecated escape sequence (#85987)

    Not sure why but this started throwing a lot of warnings while I was
    adding tests to test_freezing.py, so I'm removing the deprecated escape
    sequences to get rid of the warnings.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85987
    Approved by: https://github.com/eellison

commit 2355b6256b9fcafc6e6f01301650538898ca7b8e
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Oct 5 01:02:31 2022 +0000

    Remove `std::cout` from `multinomial_out_mps` (#86246)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86246
    Approved by: https://github.com/xuzhao9, https://github.com/seemethere

commit 4f95f7ae9b664fec153ba2069f44b311238649d7
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Oct 4 10:48:59 2022 -0700

    Remove unnecessary header (#86212)

    This appears to fix an internal build failure
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86212
    Approved by: https://github.com/samdow

commit 6d7235e3d391e10b24821ed97bd397fca19b8120
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Oct 5 00:15:11 2022 +0000

    enable cpu meta testing (#86226)

    Just add the relevant skips for now.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86226
    Approved by: https://github.com/ezyang

commit 1432b9978b9e3838a7940700fb54f89b63fc72e5
Author: lezcano <lezcano-93@hotmail.com>
Date:   Tue Oct 4 21:14:45 2022 +0000

    Add ref for cumsum (#86229)

    As noted in the comment, this decomposition may not be as efficient as
    specific implementations of it in different backends. Added here to then
    benchmark it. Note that this is needed by TorchInductor https://github.com/pytorch/torchdynamo/issues/883
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86229
    Approved by: https://github.com/ngimel

commit b317736c3990a4d42fe5be7de3944c1a2a6c2667
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Oct 4 21:14:00 2022 +0100

    Fix default correction value in std/var decompositions (#85839)

    `torch.std` and `torch.var` default to the unbiased estimator, i.e.
    `correction=1`. This only works as is because the default on this
    overload is not exercised by the tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85839
    Approved by: https://github.com/ezyang

commit adb12438c127d1ded1c0ba027eb8621ca709e427
Author: Zafar <cc.rafaz@zafar.cc>
Date:   Tue Oct 4 22:44:13 2022 +0000

    [AO] Cubic sparsity level scheduler (#85232)

    The scheduler updates the levels of sparsity based on https://arxiv.org/abs/1710.01878.

    The update rule is defined as:

    $$
    \begin{aligned}
    s_t &= s_f + (s_i - s_f)\left( 1 - \frac{t - t_0}{n\Delta t} \right)^3  \\
    \mbox{for} ~ t &\in \left\\{ t_0, t_0+\Delta t, \dots, t_0 + n\Delta t \right\\} \end{aligned}
    $$

    There is a minor difference compared to the original paper. By providing `initially_zero` argument, one can set the level of sparsity before step $t_0$: If `False`, the sparsity level before $t_0$ is set to $s_i$, otherwise 0.

    ```
    python test/test_ao_sparsity.py -- TestCubicScheduler
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85232
    Approved by: https://github.com/junesg, https://github.com/jerryzh168

commit 248796987e45e565daa2515dffeb8c187650bf72
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Oct 4 12:53:41 2022 +0000

    [FSDP] Expose internal prefetch limits (#86198)

    This PR refactors the prefetching implementation to enable a module to prefetch more than one all-gather.
    - The motivation is for backward prefetching, but forward prefetching is included in the change as well.
    - The prefetching limit is a _limit_. In some edge cases (e.g. dynamic graph or first/last module), the limit may not be reached.
    - The prefetching limit is kept as internal in this PR -- it is set as local variables `backward_prefetch_limit` and `forward_prefetch_limit` in the `FullyShardedDataParallel` constructor and passed to the `_ExecOrderData()` constructor.
    - This PR additionally includes some clean up for forward prefetching but does not change any semantics assuming static graph.

    If we increase the `backward_prefetch_limit` to `2`, then a typical pattern may be that the first module in the pre-backward prefetches 2, but every next module only prefetches 1 since its first target was already prefetched by the previous. If we did not do this behavior, then with more modules, the prefetching would run further and further ahead.

    **`_handles_prefetched`**
    - This is used to avoid multiple modules prefetching the same handles keys.
    - `_handles_prefetched[handles_key]` is set to `True` when the prefetch for `handles_key` happens from the CPU thread (`_prefetch_handles()`).
    - `_handles_prefetched[handles_key]` is set to `False` when any handle in `handles_key` is resharded (`_reshard()`).
    - `_handles_prefetched` is cleared at the end of the backward (`_wait_for_post_backward()`).

    **`_needs_pre_backward_unshard`**
    - This is used to determine if a handles key should be backward prefetched at all.
    - `_needs_pre_backward_unshard[handles_key]` is set to `False` in the post-forward (`_register_pre_backward_hooks()`).
    - `_needs_pre_backward_unshard[handles_key]` is set to `True` in the post-forward if the forward outputs include tensors that require gradient (`_register_pre_backward_hook()`).
    - `_needs_pre_backward_unshard[handles_key]` is set to `False` in the pre-backward hook, after unsharding (`_pre_backward_hook()`).

    **`_needs_pre_forward_unshard`**
    - This is used to determine if a handles key should be forward prefetched at all.
    - `_needs_pre_forward_unshard[handles_key]` is set to `True` in the root's pre-forward (`_fsdp_root_pre_forward()`).
    - `_needs_pre_forward_unshard[handles_key]` is set to `False` in the pre-forward unshard (`_pre_forward_unshard()`).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86198
    Approved by: https://github.com/zhaojuanmao

commit f20e4eab7b63a04f39e06ce1e535c45a85ae1672
Author: Jing Xu <jing.xu@intel.com>
Date:   Tue Oct 4 21:57:05 2022 +0000

    Fix ITT unit-tests if PyTorch is compiled with `USE_ITT=OFF` (#86199)

    Fixes https://github.com/pytorch/pytorch/pull/84848#discussion_r986329680

    @malfet @slgong-fb
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86199
    Approved by: https://github.com/malfet

commit d39e9c1e9087069fa774b0e3eb47e04750edca88
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Oct 3 16:45:22 2022 -0700

    [6/N] [Dispatchable Collectives] Update recv with CPU / CUDA implementations (#83876)

    *
    - Updates for the recv collective
    https://github.com/pytorch/pytorch/issues/86225

    Differential Revision: [D40044552](https://our.internmc.facebook.com/intern/diff/D40044552)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83876
    Approved by: https://github.com/kwen2501

commit d447eff146118f42ef4146161a37aba7fc3ac069
Author: Jay Chae <jchae@meta.com>
Date:   Tue Oct 4 20:02:41 2022 +0000

    [kineto] make ProfilerKineto the only option (#84714)

    Differential Revision: D39356665

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84714
    Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi

commit d724a9193560216234df13d2f55f71845daa72ba
Author: Nirav Mehta <niravmehta@fb.com>
Date:   Tue Oct 4 19:43:54 2022 +0000

    Adding Wunused-local-typedef build flag (#86154)

    In the past, we have seen PRs causing internal breakages caused by `-Wunused-local-typedef` flag which than had to be fixed. For example: [#79978](https://github.com/pytorch/pytorch/pull/79978)

    As part of this change, we want to catch this error in the PR Checks itself.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86154
    Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/osalpekar

commit 8da704cdb7f68bfa09516e7be17f004b98c48eb3
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Oct 4 19:01:48 2022 +0000

    [MPS] Remove incorrect asserts from `Copy.mm` (#86184)

    Those asserts simply do not work for views.

    I.e. they are erroneously triggered for  in `copy_to_mps_` when running something like `python -c "import torch;x=torch.empty(10,device='mps');y=torch.tensor([10]);print(x.shape);x[2]=y[0]"` And in `copy_from_mps_` when running the same script, but with order of devices inverted: `python -c "import torch;x=torch.empty(10);y=torch.tensor([10], device="mps");print(x.shape);x[2]=y[0]"`

    If this was supposed to be a boundary check, than it should have validated, that `storage_offset() + nbytes() <= storage.nbytes()`, but this check is already done by the upper layer, isn't it?

    Fixes https://github.com/pytorch/pytorch/issues/86153
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86184
    Approved by: https://github.com/kulinseth

commit 9da5646cdb378c37e222e176478eaabca585579d
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Tue Oct 4 16:15:56 2022 +0000

    Add device logic handling for functions which allow scalar inputs as tensors (#86149)

    Some functions allow scalars as tensor inputs. Add handling for them in device logic.

    Fix for https://github.com/pytorch/torchdynamo/issues/1445
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86149
    Approved by: https://github.com/ezyang, https://github.com/bdhirsh

commit d6b030856be34532e7bfeadf342c69dd9762fb13
Author: Khushi <khushiagrawal411@gmail.com>
Date:   Tue Oct 4 18:21:45 2022 +0000

    [primTorch] special: j0, j1, spherical_j0 (#86049)

    Adds prims and refs for special functions (bessel_j0, bessel_j1, spherical_bessel_j0). Thanks!

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86049
    Approved by: https://github.com/mruberry

commit 8bce2f3d22c454ed8000245d5f21c16ea9ac4b0d
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Oct 3 13:31:11 2022 -0700

    [easy] Add spaces to vmap over as_strided error message (#86150)

    Lack of spaces made it harder to read
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86150
    Approved by: https://github.com/samdow

commit e1859c0707a5624583f77476e1feed94e45f342a
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Mon Oct 3 20:19:40 2022 +0000

    delete special fake tensor new handling (#86144)

    Delete the special-cased handling of `new` in FakeTensor. Ever since the dispatch keys were updated to reflect the FakeTensor's device, the special cased handling was not needed.

    Fixes https://github.com/pytorch/torchdynamo/issues/1448

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86144
    Approved by: https://github.com/ezyang

commit 3f2e7d5c9a5569e4c2d4857d01697fc2bfbfe4fa
Author: Howard Huang <howardhuang@fb.com>
Date:   Mon Oct 3 16:45:22 2022 -0700

    [5/N] [Dispatchable Collectives] Update send with CPU / CUDA implementations (#83859)

    Differential Revision: [D40044550](https://our.internmc.facebook.com/intern/diff/D40044550)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83859
    Approved by: https://github.com/kwen2501

commit a75edfa97c9d985a337fbea7b9c0f4061153aaf0
Author: Jing Xu <jing.xu@intel.com>
Date:   Tue Oct 4 08:20:13 2022 +0000

    Move ITT testing to its own test case (#86174)

    Fixes https://github.com/pytorch/pytorch/pull/84848#discussion_r986329680
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86174
    Approved by: https://github.com/malfet

commit b95e0fcc2c40f08f43cc69edbb0168ab17facbda
Author: Horace He <chilli@fb.com>
Date:   Tue Oct 4 04:25:19 2022 +0000

    Forward fix land race (unexpected successes) (#86186)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86186
    Approved by: https://github.com/ezyang

commit 79dd621f76d8a1f9d780b0940c21665736b0b1d9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 3 17:25:45 2022 -0700

    Symbolic shapes mega merge PR (Oct 3) (#86160)

    - TensorGeometry supports symint
    - check_size supports symint
    - functorch batch rule improved symint
    - Some operator support for symint in LTC
    - More supported operations on SymInt and SymFloat
    - More symint support in backwards formulas

    This merge includes code contributions from bdhirsh and anjali411.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86160
    Approved by: https://github.com/Chillee

commit de75274883d15bfd0b70d5ebd1d3d03a6e4540a0
Author: Horace He <chilli@fb.com>
Date:   Mon Oct 3 22:40:49 2022 +0000

    Symintified factory functions (#86067)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86067
    Approved by: https://github.com/ezyang

commit 82d9592f1baaf943b81bca13a51d655139f050aa
Author: Horace He <chilli@fb.com>
Date:   Mon Oct 3 22:40:49 2022 +0000

    Batch of symintifications to allow more models to pass in inference (#86104)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86104
    Approved by: https://github.com/ezyang

commit a4ff07f19754187d8c8aa722bab422a52152ba9c
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Oct 3 13:31:11 2022 -0700

    Stop modifying the global logger on `import functorch` (#86147)

    Fixes https://github.com/pytorch/pytorch/issues/85952

    `logging.basicConfig` modifies the global logger which affects other
    programs. importing a package should generally be side-effect free so
    this PR gets rid of that call.

    Test Plan:
    - tested locally
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86147
    Approved by: https://github.com/ezyang

commit fe190078aa78dad94297aece4d8322e5f4262558
Author: soulitzer <soulitzer@gmail.com>
Date:   Mon Oct 3 16:18:09 2022 -0400

    Require bias to be contiguous for depthwise3x3_winograd backend (#85711)

    Fixes https://github.com/pytorch/pytorch/issues/85694
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85711
    Approved by: https://github.com/malfet, https://github.com/albanD

commit bc1d884061dfb7bec0e1a442567ce9638959ad96
Author: George Qi <georgeqi94@gmail.com>
Date:   Mon Oct 3 19:04:52 2022 +0000

    [maskedtensor] use masked_softmax for forward/backward instead of regular softmax (#85845)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85845
    Approved by: https://github.com/cpuhrsch

commit 0db9419e282c9cacaf9cd7bee3633d5d68219895
Author: Eli Uriegas <eliuriegas@fb.com>
Date:   Mon Oct 3 11:51:14 2022 -0700

    .github: Improve sanity check for generated files (#86143)

    Makes it so that generated files from .gitattributes do not affect the
    pr-sanity-check

    Tested using: (https://github.com/pytorch/pytorch/pull/86143)
    ```
    ❯ BASE=d401732baadf2df666f242cd32db5df3b09dbec6 HEAD=eaf9aa24acf6a1fc68243935f4b33188a59bfdd2 bash .github/scripts/pr-sanity-check.sh
    INFO: Checking aginst the following stats
    + git diff --stat 6d06be89fe2b9ca30c3d97475dd192fc7e3f7357 eaf9aa24acf6a1fc68243935f4b33188a59bfdd2
    + sed '$d'
    INFO: Showing non-generated files:
    + cat /tmp/tmp.mQeK24emtZ
     .github/scripts/test_trymerge.py |     2 +-
     .github/scripts/trymerge.py      |    14 +
    INFO: PR SIZE is 16
    ```

    Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86143
    Approved by: https://github.com/malfet, https://github.com/albanD

commit 5ca0f9e1d4ee411b52785407b79e26a6dddfb391
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Oct 3 22:50:04 2022 +0000

    [GHF] Make EasyCLA unskippable (#86161)

    And make small update to the revert test to use mocked rules rather than latest ones
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86161
    Approved by: https://github.com/zpao, https://github.com/weiwangmeta

commit f3d7ab5438ff8740b4dd0403525a3c1400786e8f
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 3 16:43:22 2022 -0400

    Unconditionally register Python decomps to Meta key in Python Dispatcher (#85750)

    This makes them available for Python Dispatcher to service them when
    symbolic shapes are involved.  This is needed because under certain
    conditions, functionalization will directly call the Meta kernel for a
    function in order to produce a properly sized output wrapper tensor
    for a view operation. This direct call bypasses the normal decomposition
    table mechanism.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85750
    Approved by: https://github.com/wconstab

commit 06ddb1c07e3426d5d9c719c63f949359773e9c42
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Oct 3 22:18:06 2022 +0000

    Revert "Disable XLA test (#86123)" (#86151)

    And also remove torch_patches/.torch_pin to mitigate the sev https://github.com/pytorch/pytorch/issues/86093 until XLA fixes the weird logic in https://github.com/pytorch/xla/blob/master/scripts/apply_patches.sh#L17-L18.  Ticket cut to XLA https://github.com/pytorch/xla/issues/4068
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86151
    Approved by: https://github.com/kit1980

commit cda815dc232b23f433a20782acda2f5e161ed99b
Author: Paul O’Shannessy <paul@oshannessy.com>
Date:   Mon Oct 3 22:13:54 2022 +0000

    Switch to checking EasyCLA on merge (#86127)

    This is part of the work required to switch over to the new PyTorch Foundation CLA (#85559).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86127
    Approved by: https://github.com/malfet

commit dfde7cf3e211b9a0456fc4a14df89b80d40f1816
Author: originates <105183376+originates@users.noreply.github.com>
Date:   Mon Oct 3 22:09:59 2022 +0000

    ANTIALIAS updated to Resampling.LANCZOS in torch/utils/tensorboard/summary.py (#85679)

    **Line 492: ANTIALIAS updated to Resampling.LANCZOS**

    Removes the following Depreciation Warning:

    `DeprecationWarning: ANTIALIAS is deprecated and will be removed in Pillow 10 (2023-07-01). `
    `Use Resampling.LANCZOS instead.`

    ---

    ```
       try:
            ANTIALIAS = Image.Resampling.LANCZOS
        except AttributeError:
            ANTIALIAS = Image.ANTIALIAS
        image = image.resize((scaled_width, scaled_height), ANTIALIAS)
    ```

    Now Resampling.LANCZOS will be used unless it gives an AttributeError exception in which case it will revert back to using Image.ANTIALIAS.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85679
    Approved by: https://github.com/albanD

commit 2494c318c40d344025adf0ad4322471401dee24d
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Mon Oct 3 18:13:17 2022 +0000

    [easy] fix nested view call taking in more than one -1 (#86134)

    https://github.com/pytorch/pytorch/pull/85691 (allowing only one -1 in nested view/reshape) broke this. Was not caught by CI but internal tests are broken

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86134
    Approved by: https://github.com/cpuhrsch

commit 6a842e33c6b847cfedc68315b06b0645d51d9a28
Author: Kulin Seth <kulinseth@gmail.com>
Date:   Mon Oct 3 21:05:30 2022 +0000

    MPS: Add multinomial op (#80760)

    Add multinomial with replacement

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80760
    Approved by: https://github.com/razarmehr, https://github.com/malfet

commit 37013bb443c4cef95675300f371ff0263ed303ca
Author: Horace He <chilli@fb.com>
Date:   Mon Oct 3 16:59:03 2022 +0000

    Added _unsafe_view decomp (#86103)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86103
    Approved by: https://github.com/ezyang

commit 40a8cc28e78292dac55ac77fc5dc7fbda9428698
Author: Abhishek Pathak <abhipathak97@gmail.com>
Date:   Mon Oct 3 20:38:03 2022 +0000

    [MPS] Cast dot inputs to int32 when needed (#86140)

    Fixes https://github.com/pytorch/pytorch/issues/85758
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86140
    Approved by: https://github.com/kulinseth, https://github.com/malfet

commit 954660a3083e5f3dcf014ae475b53fc181281be0
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 3 09:29:49 2022 -0700

    Correctly error if you pass in tensors where size arguments expected (#86126)

    This also makes symintlist track intlist exception handling,
    which eellison fixed.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86126
    Approved by: https://github.com/eellison

commit 2aa9e0750acfabe99c59869b232102ab3cc62ae5
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 3 08:36:09 2022 -0700

    Symintified all functions, not including factory functions (#86078)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86078
    Approved by: https://github.com/Chillee, https://github.com/albanD

commit cb87983cb8f4a26928f9852d96de63da6d4f363c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 3 08:36:09 2022 -0700

    Decay integer-only (Optional)SymIntArrayRef to IntList in IValue (#86094)

    We have logic that says if you ask for a SymIntList from an IValue, but the IValue is actually an IntList, we will still give it to you in that case (check ivalue_to_arg in aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h). However, we also need the *inverse* version of this logic, which says that if you construct an IValue from a SymIntArrayRef, and it is actually integer only, we need to store it as an IntList, so that toIntList on the IValue will work.

    The way this works is a bit twisty, but our basic strategy is to disable construction of IValue from list container types that contain SymInt directly, and then directly implement variants of these constructors by hand, which iterate over the elements of the list and test if there are any SymInts or not to decide what type to construct the underlying List. These variants have to be templated, otherwise we will run afoul ambiguous overloads. I only did the overloads that actually occurred in practice; you may need to add more if you SymIntify more stuff.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86094
    Approved by: https://github.com/anjali411, https://github.com/albanD

commit 146db41eb95e3430f088c3045326616d9eec1874
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Oct 3 07:22:21 2022 -0700

    Preserve/strip OptionalSymIntArrayRef when finding real schema (#86114)

    Missed this one because I forgot you also have to update it.  Thankfully
    the new Metal CI caught it.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86114
    Approved by: https://github.com/anjali411

commit 1da74929d95e70bfd0e6e031f6b21a5b05513a63
Author: ruki <waruqi@gmail.com>
Date:   Mon Oct 3 20:00:53 2022 +0000

    Fix compile error for vs2022 #79358 (#85958)

    Fixes #79358

    - #79358
    - https://github.com/xmake-io/xmake-repo/pull/1503#issuecomment-1263104439

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85958
    Approved by: https://github.com/ngimel

commit 36634d78da398787e54e4737c55e4b0a20894cb2
Author: Justin Chu <justinchu@microsoft.com>
Date:   Mon Oct 3 17:14:22 2022 +0000

    [ONNX] Remove registration in __init__ (#86130)

    Remove unused import in `__init__.py`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86130
    Approved by: https://github.com/BowenBao

commit e01d616ba9f6d39726a2710d4431afe637074de4
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Oct 3 09:25:39 2022 -0700

    Re-introduce the functorch docs build (#85838) (#86125)

    We deleted it when merging functorch into pytorch. This PR makes a new
    functorch docs build.

    The docs are relatively simple:
    - cd into `functorch/docs` and run `make html` to build the docs.
    - docs should get pushed to the pytorch/functorch repo's gh-pages
    branch.

    The long term plan is:
    - one day, the functorch APIs will just be torch.* APIs, at which point
    we can fold all of the functorch docs into the regular PyTorch docs
    - When that happens, the functorch examples and tutorials (that are on
    the functorch docs site) can be moved to the pytorch examples and
    pytorch tutorials.

    Test Plan:
    - check docs preview
    - watch this PR after it goes in

    Differential Revision: [D40026222](https://our.internmc.facebook.com/intern/diff/D40026222)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86125
    Approved by: https://github.com/atalman, https://github.com/malfet

commit 75c0e3a471c19b883feca15fd4ecfabedf746691
Author: Ramin Azarmehr <razarmehr@apple.com>
Date:   Mon Oct 3 18:40:16 2022 +0000

    [MPS] Improve memory usage and performance utilizing garbage collector and adaptive commit (#86119)

    - Improve memory usage and performance utilizing garbage collector and adaptive commit
    - Enable low watermark limit to detect memory pressure.
    - Enable garbage collection and adaptive commit strategies when under memory pressure.
    - More efficient resource management by splitting large heaps (instead of reusing oversized buffers for smaller allocation requests)
    - Introduce Extra Large heaps to improve performance by avoiding numerous costly allocation of smaller heaps
    - Fix purgeability when releasing the Metal heaps
    - Fix the race condition when deferring the heap's size update

    Fixes #79283

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86119
    Approved by: https://github.com/kulinseth, https://github.com/malfet

commit 8860e489949ca02926539ed11d01f307952d6017
Author: Abhishek Pathak <apathak2@apple.com>
Date:   Mon Oct 3 18:12:48 2022 +0000

    [MPS] Handle compatible inputs to where (#85946)

    Inputs with different number of dimensions but compatible shapes were being rejected

    e.g. x.shape = [10,1,10]
    y.shape = [10,10]
    cond.shape = [10,10,1]
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85946
    Approved by: https://github.com/malfet

commit 2f692236fe8cbaeda641ec8e837f3b6da8b4d754
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Oct 3 17:41:43 2022 +0000

    [GHF] Add commit statuses to checkruns conclusions (#86129)

    Needed to surface CircleCI/EasyCLA checks to the `mergebot` rules

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86129
    Approved by: https://github.com/huydhn

commit cd6477617c24f3f9ce2d35a1d956d0f7d68110d9
Author: Driss Guessous <drisspg@fb.com>
Date:   Mon Oct 3 17:36:36 2022 +0000

    Custom sdp implementations dense (#85984)

    - This code creates the runtime dispatch system for choosing a performant fused SDP kernel. The only choice of fused kernel is flash_attention. It also creates python flags and a context manager that can be used to turn off and on behavior for dispatch.
    - This also adds support for flash_attention with dense tensors.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85984
    Approved by: https://github.com/cpuhrsch

commit 8d9472d7d402983c696836630c1034d56dfb3d87
Author: vfdev <vfdev.5@gmail.com>
Date:   Mon Oct 3 17:35:44 2022 +0000

    [skip-ci] Fixed bad link in build_ci_governance.rst (#85933)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85933
    Approved by: https://github.com/albanD

commit 85d520d448dd9fcaccd324029c3f4e4462913133
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Mon Oct 3 17:32:07 2022 +0000

    [docs] Add `torch.channels_last_3d (#85888)

    As per title, updating https://pytorch.org/docs/master/tensor_attributes.html#torch-memory-format.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85888
    Approved by: https://github.com/ngimel

commit 2067b768fc8ffa181d6e9dc9d62e1696e9cf4ef8
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Fri Sep 30 15:20:20 2022 -0700

    [FSDP] Delay moving tensor to CPU until necessary for optim_state_dict() (#85761)

    Optimizer state_dict currently move tensors to CPU() immediately after allgather(). However, for sharded optimizer state_dict, this moving is duplicated. We should wait until all the sharding are done. This PR may slightly reduce the performance of full optimizer state_dict as it has to allocate more memory than w/o this PR. But the benchmark shows the memory allocation is pretty light.

    Differential Revision: [D39855912](https://our.internmc.facebook.com/intern/diff/D39855912/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39855912/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85761
    Approved by: https://github.com/rohan-varma

commit e23cede0aa8986c103c87a61c3f97a4203218a0f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 3 17:22:31 2022 +0000

    Revert "Require bias to be contiguous for depthwise3x3_winograd backend (#85711)"

    This reverts commit 9a126702ce5a73d3409be8bb7cd04a9fbd7d162a.

    Reverted https://github.com/pytorch/pytorch/pull/85711 on behalf of https://github.com/huydhn due to This breaks functorch/test_vmap with some unexpected successes https://hud.pytorch.org/pytorch/pytorch/commit/9a126702ce5a73d3409be8bb7cd04a9fbd7d162a

commit c670bad72ff2af7eb75dfa3a924754c7fd5a2370
Author: Jesus Magana <maganazero5@gmail.com>
Date:   Mon Oct 3 17:22:04 2022 +0000

    Update dist.scatter() documentation (#86069)

    Update documentation for dist. scatter

    Fixes #84566

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86069
    Approved by: https://github.com/rohan-varma, https://github.com/H-Huang

commit 2403d0c25829f6d74b8246dcdc6fce8f4aff1106
Author: Cuiqing Li <cuiqingli123@meta.com>
Date:   Mon Oct 3 17:20:58 2022 +0000

    implementation of qmul using xnnpack (#86040)

    Summary: implementation of qmul using xnnpack

    Test Plan: buck run caffe2/test:quantization --  quantization.core.test_quantized_op.TestQNNPackOps

    Reviewed By: digantdesai, kirklandsign

    Differential Revision: D39701867

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86040
    Approved by: https://github.com/digantdesai

commit 7941b042a73266e786d7367ad26a5bc4760b4fe1
Author: Catherine Lee <csl@fb.com>
Date:   Mon Oct 3 16:59:39 2022 +0000

    parallelize at file granularity (#85770)

    part two of https://github.com/pytorch/pytorch/pull/84961

    tests files in parallel at the test file granularity

    * 2 procs at a time
    * number of tests ran changed by <200, possibly due to adding more tests on master between the base commit and head commit of the PR
    * may cause flakiness, but I haven't seen it in my small sample size of this PR
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85770
    Approved by: https://github.com/huydhn

commit d401732baadf2df666f242cd32db5df3b09dbec6
Author: Codrin Popa <codrin@fb.com>
Date:   Mon Oct 3 16:56:22 2022 +0000

    Added roundup_bypass_threshold_mb knobs to the PyTorch Caching Allocator (#85940)

    Summary:
    Added an additional roundup knob( ``roundup_bypass_threshold_mb``) to bypass rounding the requested allocation size, for allocation requests larger than the threshold value (in MB). This can help reduce the memory footprint when making large allocations that are expected to be persistent or have a large lifetime.

    Differential Revision: D39868104

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85940
    Approved by: https://github.com/zdevito

commit bc993e39cc3c2c37e58a88ae3071b6f5e73ef8fc
Author: Horace He <chilli@fb.com>
Date:   Mon Oct 3 07:11:53 2022 +0000

    Unwrap SymInt => Proxy when being returned from the wrapped function make_fx traces (#86098)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86098
    Approved by: https://github.com/ezyang

commit 470f8fb9e55f2083f1053ed75e27768cdaf2747b
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 30 09:30:22 2022 -0700

    Fix functorch/test/test_control_flow (#85981)

    The tests weren't being run in PyTorch CI. On deeper investigation, it
    looks like the test file doesn't work under the unittest test runner (it
    works under pytest though).

    This PR enables running these tests under unittest and also marks things
    that now fail as expected failure. We should fix these at some point.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85981
    Approved by: https://github.com/samdow, https://github.com/voznesenskym

commit a262ccea58946cd9efb5e7d4a38032b40996a237
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 30 13:19:46 2022 -0700

    Change torch.autograd.graph.disable_saved_tensors_hooks to be public API (#85994)

    Also addresses some comments from the review in
    https://github.com/pytorch/pytorch/pull/85971
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85994
    Approved by: https://github.com/albanD, https://github.com/soulitzer

commit 6d06be89fe2b9ca30c3d97475dd192fc7e3f7357
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Oct 3 16:19:12 2022 +0000

    Disable XLA test (#86123)

    This is related to https://github.com/pytorch/pytorch/issues/86093
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86123
    Approved by: https://github.com/ZainRizvi

commit 5fa840103b4eac16a3fc87bb26ebf701fbd1666c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Oct 3 16:08:18 2022 +0000

    Revert "Re-introduce the functorch docs build (#85838)"

    This reverts commit 0449cf0c9e469f052bb9316b13260d126d6f01d4.

    Reverted https://github.com/pytorch/pytorch/pull/85838 on behalf of https://github.com/atalman due to Break internal build

commit 9a126702ce5a73d3409be8bb7cd04a9fbd7d162a
Author: soulitzer <soulitzer@gmail.com>
Date:   Fri Sep 30 23:50:00 2022 -0400

    Require bias to be contiguous for depthwise3x3_winograd backend (#85711)

    Fixes https://github.com/pytorch/pytorch/issues/85694
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85711
    Approved by: https://github.com/malfet, https://github.com/albanD

commit d253d6ec0c1c086b9d3be98b421d224ff20b734e
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Oct 3 15:04:33 2022 +0000

    [Metal][BE] Fix signed/unsigned compare (#86068)

    To enable Metal builds in OSS
    Guard `[self dealloc]` call in `MPSImageWrapper.mm` with `#if !__has_feature(objc_arc)`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86068
    Approved by: https://github.com/ezyang

commit 4a528bc16fa52ea94384861bea84cfb61d9a645c
Author: Vasiliy Kuznetsov <vasiliy@fb.com>
Date:   Fri Sep 30 16:41:34 2022 -0700

    remove vkuzo from CODEOWNERS for AO (#86038)

    Summary:

    I was added to various places in https://github.com/pytorch/pytorch/pull/79505, this is
    too noisy to be useful so taking myself off.  Always happy to help when folks tag me
    manually.

    Test plan: CI
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86038
    Approved by: https://github.com/HDCharles

commit 68a6113248ac25841b524d59f9dc0f298b389ba2
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Mon Oct 3 15:03:08 2022 +0000

    Add nvFuser support for torch.native_batch_norm (#85562)

    This PR adds nvFuser's implementation for batch_norm as there's no reference yet (https://github.com/pytorch/pytorch/pull/81191) and no in-place copy support (https://github.com/pytorch/pytorch/pull/84545).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85562
    Approved by: https://github.com/kevinstephano, https://github.com/ngimel

commit d28a882319d92bae17827101787adf838a05df0a
Author: Justin Chu <justinchu@microsoft.com>
Date:   Mon Oct 3 14:34:27 2022 +0000

    [ONNX] Remove excessive deprecation messages (#86065)

    The deprecation messages in SymbolicContext will be emitted every time it is initialized. Since we already emit deprecation messages at registration time, the deprecation decorator can be removed in `__init__` to reduce noise.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86065
    Approved by: https://github.com/BowenBao

commit 6cd9c447daa083478c4272674f0e80ff4e0c6a5a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 21:15:03 2022 -0700

    Make test_api compile on DEBUG mode with some compiler versions (#86092)

    The symbol seems to conflict under some compiler versions, giving
    an error like "relocation refers to global symbol which is defined in a
    discarded section".  Simple enough to put it in an anonymous namespace,
    so why not.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86092
    Approved by: https://github.com/Chillee

commit 368e8e7520f95bec7a82653beccd9779522f854d
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 21:14:59 2022 -0700

    Skip, don't xfail, nondeterministic as_strided_scatter test (#86091)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86091
    Approved by: https://github.com/Chillee

commit 1f157099fa359a4e504cd19c7fb5019858a5d36c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 18:00:15 2022 -0700

    Teach remove_symint to handle OptionalSymIntArrayRef (#86088)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86088
    Approved by: https://github.com/Chillee, https://github.com/anjali411

commit bd32f9a833c911a7acaddcafba48806c1b94f6d0
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 17:12:40 2022 -0700

    Correct ownership of OptionalSymIntArrayRef in backwards (#86087)

    Also add some cheap but cheerful sanity checks to help detect
    similar situations in the future.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86087
    Approved by: https://github.com/albanD

commit 6fd5d6397a58860ba60178a22574f9701eab061a
Author: vfdev <vfdev.5@gmail.com>
Date:   Mon Oct 3 10:57:08 2022 +0000

    [Docs] Updated torchvision people (#85931)

    cc @datumbox @pmeier

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85931
    Approved by: https://github.com/fmassa, https://github.com/datumbox

commit 5322f00151cace0aaec2701d1ab75c648b86d592
Author: John Detloff <johndetloff@fb.com>
Date:   Mon Oct 3 06:43:06 2022 +0000

    Re-add benchmarking files to ios TestApp (#85539)

    Fixes #76033

    The benchmarking code in the iOS TestApp was removed a while back as dead code:
    https://github.com/pytorch/pytorch/pull/64849

    I believe this was done in error - as this leaves our TestApp empty, nothing occurs when it runs. And we still have a tutorial up demonstrating how to use the benchmarking feature of the TestApp.

    This diff restores the files that were deleted, with some minor tweaks for compatibility with changes that have happened since they were deleted.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85539
    Approved by: https://github.com/kimishpatel

commit 2b5625a726372f7a6e1fcfd60e687e57b329a7f6
Author: Rohan Varma <rvarm1@fb.com>
Date:   Mon Oct 3 06:15:20 2022 +0000

    Update hierarchical_model_averager.py (#85648)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85648
    Approved by: https://github.com/wayi1, https://github.com/H-Huang

commit 6a1e3f2f3720fd92514b385f8177edf669082961
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Oct 3 05:51:22 2022 +0000

    Update fbgemm submodule (#86054)

    Reland of https://github.com/pytorch/pytorch/commit/481def752cc001ff8ac7e3b723ece11aa1110c77
    Fixes https://github.com/pytorch/pytorch/issues/85956

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86054
    Approved by: https://github.com/xuzhao9

commit acd2f21ea130ad74bc68ed938044dfb20ff4c205
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sun Oct 2 16:07:52 2022 -0700

    [Profiler] Update python binding type annotations (#85722)

    The annotations for `torch._C._profiler` have gotten a bit stale. This PR simply brings them up to date.

    There is one small quality of life change that alters behavior: instead of returning device type and index separately we return a `torch.device` object.

    Differential Revision: [D39852803](https://our.internmc.facebook.com/intern/diff/D39852803/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85722
    Approved by: https://github.com/chaekit

commit 5ed338a55b8d320a851d9461edddb92a6d8b8b90
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sun Oct 2 16:07:51 2022 -0700

    [Profiler] Add dtype to `_TensorMetadata` (#85721)

    `Inputs.dtypes_` stringifies the dtypes; however this loses information which is hard to recover and useful for analysis. So this PR adds full `torch.dtype` info for Tensors.

    Differential Revision: [D39852802](https://our.internmc.facebook.com/intern/diff/D39852802/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85721
    Approved by: https://github.com/chaekit

commit ba95984588f100d781c1700218c2f7cd77cf380a
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Sun Oct 2 16:07:49 2022 -0700

    [Profiler] Make `name` a property. (#85720)

    This is just a quality of life change. `.name` is 30% fewer characters than `.name()`. I should have done this from the start.

    Differential Revision: [D39788873](https://our.internmc.facebook.com/intern/diff/D39788873/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85720
    Approved by: https://github.com/chaekit

commit dcac4dd58edefb6951a60266e53d8767dc9be002
Author: Jianyu Huang <jianyuhuang@meta.com>
Date:   Mon Oct 3 03:29:08 2022 +0000

    Add int32_t range check in packed_accessor32 in PyTorch TensorBase (#86085)

    Summary:
    As ajtulloch suggested, we can make tensor.packed_accessor32<...>() raise an exception if tensor.numel() > std::numeric_limits<uint32_t>::max().

    Trade-off: run-time check overhead (one-time) when doing `packed_accessor32` accessor.

    Differential Revision: D39996275

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86085
    Approved by: https://github.com/ngimel

commit aabf3e234b532d76b05cb76d837638905e68bb77
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 17:12:37 2022 -0700

    Allow functionalize_aten_op to work with non-SymInt signature. (#86080)

    This is done similarly to how we did CPU fallback template.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86080
    Approved by: https://github.com/wconstab

commit 21e00d5accd3ddd8a138c2e4a805a7a38bfc8847
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 13:24:47 2022 -0700

    Fix type of as_float_unchecked (#86075)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86075
    Approved by: https://github.com/wconstab

commit 8753703b6804796007b5974ad2bca6e14a7a61c1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 12:50:17 2022 -0700

    Fix some bugs in SymFloat IValue and toPyObject handling (#86072)

    - Test for symbolic cases first before non-symbolic, as symbolic
      ints/floats advertise as being ints/floats
    - Add missing case for toPyObject

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86072
    Approved by: https://github.com/wconstab

commit a66506b136766fb75c818283e48697166d1e7cbe
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 16:08:03 2022 -0400

    Revert "Revert "Build and run Metal tests in CI (#86062)"" (#86073)

    This reverts commit 195184e69cda79678590c759719b1dc1d7ef6d09.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86073
    Approved by: https://github.com/malfet

commit 07ce0b435b5c4197836b9f08342e566a46c55961
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sun Oct 2 21:59:42 2022 +0000

    Remove backward for im2col and col2im (#85542)

    `im2col` is a linear map, and `col2im` is its adjoint. As such, the
    adjoint to `col2im` is `im2col` (the adjoint of the adjoint is the
    original function.

    There's no point having explicit derivatives in ATen for these
    functions, so this PR deletes all these.

    Furthermore, along the way, we fix an error for the derivative of im2col
    for non-batched inputs.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85542
    Approved by: https://github.com/soulitzer, https://github.com/ngimel

commit 99ca25e6eb8299f31824bdbaf62f16f8a8db458d
Author: Michael Fisher <86859628+MFisherBE@users.noreply.github.com>
Date:   Sun Oct 2 22:55:34 2022 +0000

    Misspelling Correction PR common_methods_invocations.py (#86081)

    Noticed a misspelling while looking at Issue #85712. This fix just fixes the mispelling on line #3107.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86081
    Approved by: https://github.com/ngimel

commit e6dd2965af330d4aaad49de4551ee87df3007ee8
Author: Horace He <chilli@fb.com>
Date:   Sun Oct 2 17:42:36 2022 +0000

    A bunch of coverage improvements (re for models in inference snext50, BERT_pytorch, mobilenet_v3_large, pytorch_CycleGAN_and_pix2pix, dcgan, resnet18, mnasnet1_0) (#86050)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86050
    Approved by: https://github.com/ezyang

commit b8bf60445938e988c020478ebf0c98ec19d24416
Author: Horace He <chilli@fb.com>
Date:   Sun Oct 2 16:50:09 2022 +0000

    Ported linear to symints (#86021)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86021
    Approved by: https://github.com/ezyang

commit b9b24c31fda46d8403a28403898f129127e3f35e
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sun Oct 2 20:13:05 2022 +0000

    [MPS] Fix non-contig to contig tensor copy (#86056)

    This handles a rare case when MPS tensor is constructed from non-contiguous CPU tensor.
    Fixes https://github.com/pytorch/pytorch/issues/85967

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86056
    Approved by: https://github.com/janeyx99

commit 007e12a3e956d5a4362415664a252d9090ab57ac
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Oct 2 11:29:07 2022 +0100

    OpInfo: Extend natural syntax to allow adding metadata (#85890)

    Splitting into a seperate PR in case of bike shedding. We can't use
    the normal fluent syntax `SampleInput(x).name("foo")` because `.name`
    is already how the metadata is accessed. So instead, this adds a
    single function where you pass keyword arguments to fill in the
    metadata, e.g.
    ```
    SampleInput(x).with_metadata(
        name="foo", output_process_fn_grad=out_fn)
    ```

    An alternative closer to the normal fluent style would be to adding a
    prefix to the property's name, e.g.
    ```
    (SampleInput(x)
        .with_name("foo")
        .with_output_process_fn_grad(out_fn))
    ```

    However, I have a slight preference for the `with_metadata` style
    because you don't need to add extra parenthesis to break lines.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85890
    Approved by: https://github.com/mruberry

commit ed5f95048e1151944786ab6437fca63df0800051
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Sun Oct 2 11:29:07 2022 +0100

    OpInfo: Add natural syntax for SampleInput creation (#85723)

    Most SampleInput objects currently have no additional metadata,
    meaning they have a 1:1 mapping with a normal function call. This adds
    var arg forms of the `SampleInput` constructor such that you can just
    call the `SampleInput` constructor as you would call the operator.

    So, for example
    ```python
    SampleInput(make_arg(shape), args=(2, 3), kwargs=dict(alpha=4))
    ```
    becomes
    ```python
    SampleInput(make_arg(shape), 2, 3, alpha=4)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85723
    Approved by: https://github.com/mruberry

commit 195184e69cda79678590c759719b1dc1d7ef6d09
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Oct 2 19:08:30 2022 +0000

    Revert "Build and run Metal tests in CI (#86062)"

    This reverts commit f88bf8de2cba75377baf469b3dd3f8bc415ee7d2.

    Reverted https://github.com/pytorch/pytorch/pull/86062 on behalf of https://github.com/huydhn due to Breaking trunk https://hud.pytorch.org/pytorch/pytorch/commit/f88bf8de2cba75377baf469b3dd3f8bc415ee7d2

commit 36380897553063c7b433f738671ff23c5ad58ced
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 06:43:50 2022 -0700

    Ported reshape to symints and added a shim for BC (#85998)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85998
    Approved by: https://github.com/ezyang

commit f88bf8de2cba75377baf469b3dd3f8bc415ee7d2
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Oct 2 11:33:20 2022 -0400

    Build and run Metal tests in CI (#86062)

    Fixes https://github.com/pytorch/pytorch/issues/84172

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86062
    Approved by: https://github.com/kimishpatel, https://github.com/malfet

commit cd5ac15d5d6273ccefc6d84d79a6daf6d612ab1d
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 1 22:47:08 2022 -0400

    Fix internal/external desync for Metal hotfix (#86061)

    For some reason, the fbcode to GitHub sync landed the wrong version
    of the PR.  This corrects the synchronization problem, and actually
    makes the Metal backend work.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86061
    Approved by: https://github.com/malfet

commit b26eafec079a18bc331f569a7e35497129feed71
Author: Kulin Seth <kulin_seth@apple.com>
Date:   Sun Oct 2 15:27:52 2022 +0000

    [MPS] Specialized memory pool for scalar values. (#85817)

    - Add buffer usage and debug verbosity flags to MPSAllocator
    - Add high_watermark_ration to limit the memory allocation

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85817
    Approved by: https://github.com/razarmehr

commit 481def752cc001ff8ac7e3b723ece11aa1110c77
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sun Oct 2 15:05:34 2022 +0000

    Update fbgemm submodule (#86054)

    Fixes https://github.com/pytorch/pytorch/issues/85956

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86054
    Approved by: https://github.com/xuzhao9

commit f183a989a21473cae84bc23e5e3cbbf8a087b8c0
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Sep 29 21:20:53 2022 +0000

    Fix fake tensor kernel nesting (#85920)

    If you e.g. printed within a decomp which would call `in_kernel_invocation_manager`, on the exit from the manager it would unilaterally remove meta from the tls / set the tensor to return its real device. We should just restore what the existing state was.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85920
    Approved by: https://github.com/ezyang, https://github.com/bdhirsh, https://github.com/huydhn

commit 365498f673681a09ee67b54493a664ea646b036a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 1 10:08:53 2022 -0700

    Add rmod support to SymIntNode (#86053)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86053
    Approved by: https://github.com/wconstab

commit c857b3e73ec707a08d44bd2d01ab03e61ee44380
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 1 06:53:57 2022 -0700

    More fixes for LTC symint interlock. (#86043)

    Now, we also avoid translating SymInt to valueT if you haven't asked
    for a SymInt implementation.  This makes embedding_dense_backward
    work without changes to LTC.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86043
    Approved by: https://github.com/wconstab

commit 0060d871df2710a98211db3683bd48b1b648e9e0
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 1 06:53:57 2022 -0700

    Add a bunch of extra functionality to SymFloat (#86046)

    - SymInt to SymFloat conversion
    - All the basic arithmetic operators on c10::SymFloat

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86046
    Approved by: https://github.com/wconstab

commit 833edeb020084377dcff27e8ef8b0a2af115fb27
Author: Will Constable <whc@fb.com>
Date:   Sun Oct 2 00:00:46 2022 +0000

    Register py metas to py dispatcher so they are used by functionalization (#86057)

    - this ensures python metas are always used during symbolic tracing/functionalization without overshadowing c++ metas during eager runtime

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86057
    Approved by: https://github.com/ezyang

commit b562987c28b37009d2d95d9506b67e3c16fab83e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Oct 1 19:30:21 2022 +0000

    Revert "Fix fake tensor kernel nesting (#85920)"

    This reverts commit c2d9ea7f4b54c7d4332bc457fd76238c61f129de.

    Reverted https://github.com/pytorch/pytorch/pull/85920 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I suspect that it causes a flaky memory leak issue in TestFakeTensorCUDA.test_fake_crossref_backward_amp_linalg_lstsq_cuda_float32

commit fe89cd6c57477dc265895f946ff89d5cae047d0f
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sat Oct 1 17:21:31 2022 +0000

    [BE] Use reusable workflows from test-infra (#86035)

    Instead of local copies, use workflows checked into test-infra by https://github.com/pytorch/test-infra/pull/783

    Thought about deleting the actions later, but if I understand how GHA merges work, older PRs merged onto this changes should not cause any problems as it will immediately reference actions from test-infra
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86035
    Approved by: https://github.com/kit1980

commit 92c2295ab4b5ccdedcc32227c1125a4daf9e2759
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 1 06:53:57 2022 -0700

    Remove dead ts_native_functions.yaml entries (#86045)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86045
    Approved by: https://github.com/Chillee

commit 2f703c5956f3c861c80d5ac736ff2aeba6dfb476
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 1 06:53:56 2022 -0700

    SymInt-ify TypeAndSize (#86044)

    Commit originally by anjali411, with bugfix from Edward.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86044
    Approved by: https://github.com/Chillee

commit 07800c9c815abc1b478f0292e376d7c27e94b053
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Oct 1 06:53:56 2022 -0700

    Miscellaneous fixes from symbolic-shapes branch (#86042)

    - Make toIValue accept SymIntNode and SymFloatNode where number (aka Scalar) is
      expected
    - Binding for symintlistOptional in python arg parser
    - Teach translate to convert from IntArrayRef to ArrayRef<int64_t>
    - Don't query _symint function for meta info in LTC unless LTC is
      code generating a symint function

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/86042
    Approved by: https://github.com/Chillee

commit d9273e8b6b42dec1cd5b52779075912bee854130
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Sat Oct 1 06:32:19 2022 +0000

    [functorch] refactor: get_exhaustive_batched_inputs (#85965)

    `get_exhaustive_batched_inputs_batch_norm_is_training` and `get_exhaustive_batched_inputs` are same except for a couple of lines.

    We move the above functionality into `generate_vmap_inputs` (which is now only function to create batched inputs)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85965
    Approved by: https://github.com/zou3519

commit a5a2f576a768f01b14d2742e8fd7a478a2ab01d3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Oct 1 02:49:06 2022 +0000

    [vision hash update] update the pinned vision hash (#85776)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85776
    Approved by: https://github.com/pytorchbot

commit 05d1128106e50075b0fd7d667680214ace34306c
Author: Ke Wen <kw2501@fb.com>
Date:   Sat Oct 1 00:59:39 2022 +0000

    [c10d] Start deprecating *_multigpu APIs (#85961)
    - For most users training is on one GPU per process so these APIs are rarely used
    - They added one more API dimension
    - They can be expressed in a composed manner
    - They are not abstracted – specific to GPU
    - They caused backend APIs and implementations to have nested `std::vector<std::vector<Tensor>>`, which is hard to read or maintain

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85961
    Approved by: https://github.com/XilunWu, https://github.com/H-Huang

commit 463283e016ffa7d8a0da35a1d28c8b8ab0db2ea7
Author: Ke Wen <kw2501@fb.com>
Date:   Sat Oct 1 00:55:27 2022 +0000

    [c10d] Start deprecating *_coalesced APIs (#85959)

    - We consider that general users need not to use the `*_coalesced` APIs unless there is an extreme concern about performance.

    - We are investigating using a context manager named `coalescing_manager` which wrap around multiple individual collectives to compose the coalescing hint, rather than giving each collective a *_coalesced variant.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85959
    Approved by: https://github.com/XilunWu, https://github.com/H-Huang

commit bf667c63e7c76cb7bfb6ef8cb8d844d6c301937b
Author: Ramin Azarmehr <razarmehr@apple.com>
Date:   Sat Oct 1 00:33:23 2022 +0000

    Fix the error with constant_pad_nd for 4D+ padding (#85991)

    - We warn the user and fall back to default implementation for 4D+ constant padding

    Fixes #84535

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85991
    Approved by: https://github.com/kulinseth

commit be29ca97169e2621acf67e87020f461da3032129
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Fri Sep 30 11:15:15 2022 -0700

    [FSDP] Ignore buffers that are non-persistent. (#85740)

    A buffer can be registered as non-persistent. A non-persistent buffer won't be in the state_dict.

    Differential Revision: [D39858689](https://our.internmc.facebook.com/intern/diff/D39858689/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85740
    Approved by: https://github.com/awgu, https://github.com/rohan-varma

commit db4c6fe54fd043bb249657be4054252ca5f78b36
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 30 23:54:49 2022 +0000

    Revert "[maskedtensor] use masked_softmax for forward/backward instead of regular softmax (#85845)"

    This reverts commit a4d10342e98b0abb3286a3780617afe108328ac7.

    Reverted https://github.com/pytorch/pytorch/pull/85845 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks CUDA test_softmax_cuda (main.TestBasicsCUDA)

commit 9bf9db57be42a9d2ba77e3042578ac439848aec1
Author: Horace He <chilli@fb.com>
Date:   Fri Sep 30 20:19:39 2022 +0000

    Refactored recomputable ops a bit and added a bunch more ops (#85993)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85993
    Approved by: https://github.com/ngimel

commit e09a84a184e1687f4ddc7f3fc875eaaf5b9ec74f
Author: Horace He <chilli@fb.com>
Date:   Fri Sep 30 20:18:43 2022 +0000

    Removed debug output that doesn't work with faketensors (#85992)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85992
    Approved by: https://github.com/ngimel

commit 4b86a9359ae1cc0dd9e9b0480eee72850c7565b6
Author: Xia, Weiwen <weiwen.xia@intel.com>
Date:   Fri Sep 30 23:44:45 2022 +0000

    [Quant] Make x86 backend default when querying qconfig (#85461)

    This PR is a follow-up of #84329 [[Quant] Add unified x86 quant backend](https://github.com/pytorch/pytorch/pull/84329)
    It makes `x86` backend default when querying `qconfig`. Users get x86's qconfig/qconfig_mappings if backend is not specified.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85461
    Approved by: https://github.com/jgong5, https://github.com/vkuzo

commit fd553c46f401bdce1c74b3251495a72940729d5e
Author: jjsjann123 <jiej@nvidia.com>
Date:   Fri Sep 30 23:19:25 2022 +0000

    nvprim op support runtime checks on dtype compatibility on prims.convert_element_type (#85566)

    I'm seeing issue that we lower `_to_copy` into `nvprims.convert_element_type`. In cases where we are casting to a dtype that's not supported by nvfuser, this raise runtime error.

    I added a quick check in the lowering part where each op can peek at fx.node and make a runtime decision on whether the given op should be lowered to nvprim.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85566
    Approved by: https://github.com/IvanYashchuk, https://github.com/ngimel

commit 01292cc9e498b74960a5e4de68dfd577f4cb14de
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Sep 30 23:13:42 2022 +0000

    [BE] Get rid of `std::result_of` in `c10` (#85977)

    As it is a deprecated and to be removed in C++20

    Fixes https://github.com/pytorch/pytorch/issues/85962

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85977
    Approved by: https://github.com/kit1980

commit c2d9ea7f4b54c7d4332bc457fd76238c61f129de
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Sep 29 21:20:53 2022 +0000

    Fix fake tensor kernel nesting (#85920)

    If you e.g. printed within a decomp which would call `in_kernel_invocation_manager`, on the exit from the manager it would unilaterally remove meta from the tls / set the tensor to return its real device. We should just restore what the existing state was.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85920
    Approved by: https://github.com/ezyang, https://github.com/bdhirsh

commit 28061d50e6ea29a3400044f28a2c374ec8f4da17
Author: soulitzer <soulitzer@gmail.com>
Date:   Fri Sep 30 15:40:25 2022 -0400

    Lazily load decompositions for jvp (#85989)

    Reduces time it takes to run `python -c "import torch"` by ~10%

    See https://github.com/pytorch/pytorch/issues/85513
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85989
    Approved by: https://github.com/albanD, https://github.com/zou3519

commit 334686bde752a8b34d02aac069cf3f910f7d8b70
Author: Ramin Azarmehr <razarmehr@apple.com>
Date:   Fri Sep 30 22:57:57 2022 +0000

    Fix the dimension of padding to match the input's dimension (#85990)

    Fixes #85143

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85990
    Approved by: https://github.com/malfet, https://github.com/kulinseth

commit 24fc680ee4228225c01fb6699210056ca2603a3f
Author: andrewor14 <andrewor14@gmail.com>
Date:   Fri Sep 30 11:14:30 2022 -0700

    [Quant] Enable XNNPACK ops in QNNPACK BackendConfig (#85863)

    **Summary:** This commit enforces the following constraints on the
    QNNPACK BackendConfig:

    - `quant_min_lower_bound` = -127 for qint8 weight
    - `quant_max_upper_bound` = 127 for qint8 weight
    - `scale_min_lower_bound` = 2 ** -12 for qint8 activations and weight

    These constraints will enable users to use this BackendConfig with
    faster XNNPACK quantized ops. They are also consistent with the
    existing settings in `default_symmetric_qnnpack_qconfig` and its
    per_channel and QAT variants. For more detail on why these exact
    values were chosen, please see the description of
    https://github.com/pytorch/pytorch/pull/74396.

    Note that there are currently no restrictions on the qscheme in
    DTypeConfig. This should be added in the future to further enforce
    the restriction that the weights must be quantized with either
    per_tensor_symmetric or per_channel_symmetric.

    Existing default QConfigs such as `get_default_qconfig("qnnpack")`
    and `get_default_qat_qconfig("qnnpack")` will continue to be
    supported, but only for the existing dtypes, e.g. quint8 activations
    for weighted ops like linear and conv. In the future, we should
    revisit whether to enable XNNPACK ops using these QConfigs as well.

    **Test Plan:**

    python test/test_quantization.py TestQuantizeFx.test_qnnpack_backend_config

    **Reviewers:** jerryzh168, vkuzo

    **Subscribers:** jerryzh168, vkuzo
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85863
    Approved by: https://github.com/jerryzh168

commit d9421f81584145d17452864151d61aa694e601d5
Author: Fuzzkatt <zonghan2000@gmail.com>
Date:   Fri Sep 30 22:51:56 2022 +0000

    added fix for WorkUCC (#84368)

    Added new constructor for WorkUCC to take in optional inputTensors argument for to enable record_shapes=True for profiling purposes. Tested at https://github.com/pytorch/pytorch/pull/84323 which manually merges in https://github.com/pytorch/pytorch/pull/83285.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84368
    Approved by: https://github.com/kingchc, https://github.com/kwen2501

commit a4cc63991ad351f1e98c4bac8955e34a0cb7b1a6
Author: Ramin Azarmehr <razarmehr@apple.com>
Date:   Fri Sep 30 22:40:50 2022 +0000

    [MPS] Enable caching for random ops with Philox engine (#85833)

    Also Fix type cast issue in Bernoulli (Fixes #85611)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85833
    Approved by: https://github.com/kulinseth, https://github.com/malfet

commit 071f875046202b87213865dfc180abdf8368f116
Author: Digant Desai <digantdesai@fb.com>
Date:   Fri Sep 30 22:02:44 2022 +0000

    [quant] Fix per channel weight observer (#85883)

    Summary: `per_channel_weight_observer_range_neg_127_to_127` now correctly uses `PerChannelMinMaxObserver` instead of `MinMaxObserver`

    Test Plan:
    Adds a new test `quantization.core.test_top_level_apis
    ` to instansiate and run `forward()` on all `default` observers

    Differential Revision: D39916482

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85883
    Approved by: https://github.com/salilsdesai

commit 6a5550fca4144b11f89f1db4e32205e8dc295cbd
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Fri Sep 30 21:45:37 2022 +0000

    [test_nn] split embedding tests from test_nn (#85892)

    Ref https://github.com/pytorch/pytorch/issues/63085
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85892
    Approved by: https://github.com/albanD

commit 2037b7cb609b5621e82e5fe09bc806ce463e90b6
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 30 10:01:35 2022 -0700

    Make FunctionalTensorWrapper correctly handle symbolic shapes (#85975)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85975
    Approved by: https://github.com/bdhirsh, https://github.com/albanD

commit 3b6588ab7451d32516115f558ea08e0dec6c6d53
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 30 10:01:35 2022 -0700

    Consistent compute numel/contiguous strategy with SymInts (#85858)

    Previously, our handling for contiguity was inconsistent in the following ways:

    - is_strides_like 2d/3d and is_non_overlapping_and_dense always were computed
      based on sizes_and_strides_, even if you had symbolic ints
    - Furthermore, even if you set custom policy for strides, these quantities were
      not overridable by subclasses
    - Furthermore, we didn't even store these fields on ExtraMeta
    - We duplicate implementations of compute_contiguous (plain, channels last,
      channels last 3d)
    - We inconsistently called refresh_numel()/refresh_contiguous(), versus
      recomputing it ourselves

    This factor makes a consistent strategy for all of the boolean fields, and
    for numel computation.  After this refactor:

    - All layout boolean fields are interposable via strides policy
      and can be overridden from Python; you will never access a garbage field
    - All layout boolean fields are on ExtraMeta
    - You can always call refresh_numel/contiguous, no matter if your Tensor is
      contiguous or not
    - The numel/layout boolean fields are always populated consistently with
      the sizes strides fields (either on Tensor or ExtraMeta), even if you
      have custom policy
    - There is only one implementation of the actual computation logic

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D39907696](https://our.internmc.facebook.com/intern/diff/D39907696)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85858
    Approved by: https://github.com/albanD

commit 84a06d71936e61ceeee2abb9c9cb7bf5ee6440dd
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 30 09:55:45 2022 -0700

    Enable convolution_backward with bias and symints (#85970)

    Originally by Krovatkin from https://github.com/pytorch/pytorch/pull/85816

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85970
    Approved by: https://github.com/albanD

commit a4d10342e98b0abb3286a3780617afe108328ac7
Author: George Qi <georgeqi94@gmail.com>
Date:   Fri Sep 30 18:18:14 2022 +0000

    [maskedtensor] use masked_softmax for forward/backward instead of regular softmax (#85845)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85845
    Approved by: https://github.com/cpuhrsch

commit 1c97084685f19435759f785d33fde7ea3a61afa7
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Sep 30 20:58:56 2022 +0000

    [BE] Generate names of known device from array (#85982)

    Rather than hardcoding list of device names, generate it from list of known types.
    Performance is not important at the error codepath, as it will not be evaluated during normal codepath.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85982
    Approved by: https://github.com/kit1980

commit 71eb04403ca46e19a3efcde454cedbc2f990dc12
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 30 20:53:41 2022 +0000

    Revert "[CUBLAS][CUDA GRAPHS] (re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85447)"

    This reverts commit b04b2fa9aa52cacbdc9aaaf477d55b0af845ce81.

    Reverted https://github.com/pytorch/pytorch/pull/85447 on behalf of https://github.com/seemethere due to Caused a CUDA memory leak, detected by our performance benchmark suite

commit 401a358817b6657fc412b05fee6395f7e82a9226
Author: Catherine Lee <csl@fb.com>
Date:   Fri Sep 30 20:44:12 2022 +0000

    [ci] two procs for parallelization (#85985)

    hitting ooms on linux cuda so use 2 procs instead of 3

    https://github.com/pytorch/pytorch/issues/85939
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85985
    Approved by: https://github.com/huydhn

commit e73e3e352312be7d4b293bed65da021a2fc81ab6
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 30 09:30:18 2022 -0700

    [functorch] test no warning on `import functorch` (#85980)

    Copied from
    https://github.com/pytorch/pytorch/blob/24adadd4dbcd90b5aba1d4a45847e4ffa83bd6cc/test/test_testing.py#L1808
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85980
    Approved by: https://github.com/samdow

commit eed5f0464c305974ae6a9cb8e6c685eb40c4477e
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 30 08:30:16 2022 -0700

    [functorch] fix whirlwind tour ipynb (#85974)

    It was missing an "import torch"
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85974
    Approved by: https://github.com/samdow

commit 4c04fa9587fb534fa7c9848e06141bb862a56bb4
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Fri Sep 30 20:32:05 2022 +0000

    Remove `optim_mt` from `test/test_optim.py` (#83549)

    As per title, this updates `test_optim.py` so that `foreach` optimizers are constructed using the `foreach` keyword argument of `torch.optim` optimizers.

    Also, this makes some cosmetic changes to remove `torch.autograd.Variable`, `.data` calls, and `torch._six`.

    Related: https://github.com/pytorch/pytorch/pull/81705#discussion_r939440776

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83549
    Approved by: https://github.com/ngimel

commit 94da90e41f171975cc455dcf42e80918d06d978b
Author: albanD <desmaison.alban@gmail.com>
Date:   Fri Sep 30 20:07:05 2022 +0000

    LU solve/unpack fix to prevent bad memory usage on CPU (#85922)

    Fixes https://github.com/pytorch/pytorch/issues/77898
    Fixes https://github.com/pytorch/pytorch/issues/85026

    There is a minor perf impact but:
    - For lu_solve, the actual compute is going to be more expensive than this O(n) check (ones pass over the other matrices is O(n^2) in any case)
    - For lu_unpack, the check inside the kernel should be almost free.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85922
    Approved by: https://github.com/ngimel, https://github.com/nikitaved

commit 7238ca4c2e865acff66170909e701cccacee928a
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 30 07:52:30 2022 -0700

    Disallow saved tensor hooks in functorch transforms (#85972)

    Technically they may only be a problem with the grad transform. Though
    the branch cut is soon, this is the more conservative change, it also
    lets us disable checkpointing for functorch (which definitely doesn't
    work with all transforms) and not a lot of people use saved tensor hooks
    with functorch (I discovered this while testing).

    Test Plan:
    - new tests

    Differential Revision: [D39970934](https://our.internmc.facebook.com/intern/diff/D39970934)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85972
    Approved by: https://github.com/samdow

commit 7c72bc48d88d55ff687f8adfaec41b2c5d7c659f
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 30 07:52:26 2022 -0700

    Add mechanism to disable the "saved tensors hooks" feature (#85971)

    The rationale for this is that functorch doesn't work with saved
    variable hooks at the moment or checkpointing and we need some way to
    disable it.

    Concretely:
    - there's a context manager that does the disabling
    - this feature is disabled on a thread-local basis
    - one can set an error message or use the default error message that
    says the feature has been disabled

    Since it is thread local I needed to update ATen/ThreadLocalState. To
    make things nicer, this PR refactors all the "saved tensors hooks"
    related TLS things into a single struct.

    Test Plan:
    - new test

    Differential Revision: [D39970936](https://our.internmc.facebook.com/intern/diff/D39970936)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85971
    Approved by: https://github.com/albanD, https://github.com/soulitzer

commit 69b927701a6369d90e273edca812bb9546aca67f
Author: Justin Chu <justinchuby@users.noreply.github.com>
Date:   Fri Sep 30 19:35:34 2022 +0000

    [ONNX] Update user documentation (#85819)

    - Remove mentions of `SymbolicContext` in the doc
    - Comment out the PythonOp example so that it is not shown to users
    - Updated code blocks and wording
    - Changed to recommend using `pip` for installing onnx.

    Now adds a deprecation message to the docs (demo only):

    ![image](https://user-images.githubusercontent.com/11205048/193327649-f789b369-6b59-49e0-8bba-34a6785eb128.png)

    Fixes #85608

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85819
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 1fae890a07f180c64825faa214916dc0cbd6cb58
Author: samdow <samdow@fb.com>
Date:   Fri Sep 30 11:26:27 2022 -0400

    fix grad silent correctness issue from view fn followed by an inplace fn (#85374)

    From https://github.com/pytorch/functorch/issues/1007, which was an issue where we would wrap aliases of unwrapped tensors and miss the inplace error message where we should have gotten it. Instead of keeping aliases unwrapped like I had originally wanted, this simplifies it slightly such that:
    (1) All tensors that were previously wrapped are still wrapped. This is occasionally important because of the 1-1 relationship between a tensor and autograd meta. By keeping the same number of wrapper tensors before, we'll never have autograd try to write multiple autograd metas to the same tensor when it wouldn't before
    (2) The tensors that either were unwrapped tensors or aliases of unwrapped tensors now get a flag on them (now called `alias_of_unwrapped`). This way, they are still wrapper tensors (and don't have to potentially break autograd) but we can identify that they should be treated like an unwrapped tensor
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85374
    Approved by: https://github.com/zou3519

commit 8d99d6127ef49db8286f6fb5dc7ae7e634c92a22
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Fri Sep 30 19:25:38 2022 +0000

    Add torch_lazy_all_numbers_special_scalars flag (#85902)

    This is to allow even non zero and one scalars to appear as constants in the graph. The assumption being that none of them will change.

    The flag is set to `false` by default to preserve the original behaviour.

    CC: @wconstab @JackCaoG @ke1337 @vaibhavc-cerebras @glebk-cerebras
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85902
    Approved by: https://github.com/wconstab

commit be327ec08f320e256d444693dde65fe55831bc46
Author: Denis Vieriu <104024078+DenisVieriu97@users.noreply.github.com>
Date:   Fri Sep 30 18:51:43 2022 +0000

    [MPS] Fix base shape size for view ops in case of multiple slices (#85934)

    Fixes https://github.com/pytorch/pytorch/issues/84364, https://github.com/pytorch/pytorch/issues/85592

    Fixes bug for view ops where the base shape would be incorectly determined.
    E.g for the following tensor `torch.tensor([0.5, 0.5], device="mps")[1][None]`, we could consider the base shape of the parent tensor as 1, while the actual base shape is 2.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85934
    Approved by: https://github.com/kulinseth

commit 8669f6d42691c2124414cc97d0061ea6a0143007
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Sep 30 18:33:00 2022 +0000

    [ONNX] Fix layer_norm return type (#85979)

    When aten fallback is true, `_layer_norm_returns_normalized_input_mean_rstd` can return a single value.

    - Removed `_layer_norm_returns_normalized_input_mean_rstd` and have layer_norm call native_layer_norm.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85979
    Approved by: https://github.com/BowenBao

commit 7ddf167ba5db277e02f983a6bde2bc3f5fbe1caa
Author: Prashant Kumar <prashant@nod-labs.com>
Date:   Fri Sep 30 18:30:06 2022 +0000

    Move the asserts in shape functions upsample_nearest_2d op. (#85801)

    The assert check are moved to top and the function now returns out. This is needed by the downstream torch-mlir project to correctly determine the output type.

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85801
    Approved by: https://github.com/eellison

commit b60ad2e5292db92e4b055abae78e692f5b8326f5
Author: George Qi <georgeqi94@gmail.com>
Date:   Thu Sep 29 23:51:05 2022 +0000

    [maskedtensor] negative testing (#85938)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85938
    Approved by: https://github.com/cpuhrsch

commit 0a7d8b40b6f956a14b7ea02e04f596e914414c47
Author: Feisi Fu <fufeisi@bu.edu>
Date:   Thu Sep 29 06:45:03 2022 +0000

    Create a quantized in-palce version CUDA ReLU function, relu_quantized_cuda_. (#85670)

    Summary:
    this and #85669 are to allow the relu function to run on a quantized tensor on cuda. That is torch.relu(qa) for a quantized tensor qa on cuda.
    Test Plan:
    python test/test_quantization.py

    Previous PR that has been reverted: #85502.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85670
    Approved by: https://github.com/dzdang, https://github.com/z-a-f

commit eb650abc2c38f9f2f0e45f13877a4a57b8825cca
Author: Pedro Nacht <15221358+pnacht@users.noreply.github.com>
Date:   Fri Sep 30 16:53:16 2022 +0000

    Add OpenSSF Scorecard Action (#85412)

    Closes #85159

    As per the linked issue, this PR adds the OpenSSF Scorecards GitHub Action, which automatically checks the repo's supply-chain security processes and reports results to the repo's Security dashboard.

    This current version of the workflow has the `id-token : write` permission. This is necessary in order to publish results to a public REST API the OpenSSF makes available for consumers to check participating projects' results. Naturally, if you'd rather not publish these results, I can modify the workflow to remove this behavior.

    The Action has an associated optional badge which can be added to the repo's README. However, given how PyTorch avoids badges, I have naturally not included it. (Let me know if you want it!)

    @malfet
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85412
    Approved by: https://github.com/malfet, https://github.com/huydhn

commit 7e5105dd113dd6b4a920a3952088e3563ede1375
Author: Catherine Lee <csl@fb.com>
Date:   Fri Sep 30 16:51:28 2022 +0000

    [ci] expand log file if failed (#85927)

    as in title, expand the logs if the test file failed

    ex https://github.com/pytorch/pytorch/actions/runs/3155045945/jobs/5133566508
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85927
    Approved by: https://github.com/huydhn, https://github.com/janeyx99

commit 9ba1630bd729a35e903a8c411e3e5341de5ba165
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Fri Sep 30 16:45:41 2022 +0000

    Limit world size in test_fsdp_pure_fp16 (#85957)

    When using more than 5 GPUs for this test the difference between the reference output tensor and the FSDP output tensor becomes to large likely due to the usual floating point inaccuracies especially as FP16 is used.
    So set the world size (i.e. the number of GPUs) to a maximum of 5.

    Fixes #78975

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85957
    Approved by: https://github.com/awgu

commit 3a13c8493a06973f671604b17dd9ef8836eec52c
Author: Rohan Varma <rvarm1@fb.com>
Date:   Fri Sep 30 16:28:17 2022 +0000

    [1.13] Mention optim_input future BC breakage (#85963)

    We should remove this arg when release after 1.13 rolls around, enhance warning to indicate it will be gone. We can do this as FSDP is still beta and can be BC breaking until we stabilize the API.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85963
    Approved by: https://github.com/awgu

commit d003757a841b1f8691904257128d91fd699c6c54
Author: Will Constable <whc@fb.com>
Date:   Fri Sep 30 16:10:31 2022 +0000

    Clone symint on set_sizes_and_strides (#85878)

    From the perspective of having valid sympy expressions for any given size/stride property, we can have tensors inherit SymInts from each other (in cases where the size expression is unchanged, which is a common case).

    But we also use SymInts to let us build graph traces of our programs, and we need to be able to trace from a SymInt back to the tensor that it originated from in order to trace correct graphs.  This change ensures each tensor starts with fresh SymInts.

    - note: our policy has already been to use PySymIntNode objects to store pointers to proxy-tracer objects for use during tracing
    - before making this change (to clone symints), sometimes we'd attempt to store more than one proxy-tracer object on the same symint and the last-stored one would clobber all the earlier ones.  This would result in tracing the wrong graph in some cases.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85878
    Approved by: https://github.com/ezyang

commit 24adadd4dbcd90b5aba1d4a45847e4ffa83bd6cc
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 30 14:32:49 2022 +0000

    Revert "Disallow saved tensor hooks in functorch transforms (#85829)"

    This reverts commit d8277d9075396a3188490c322648605927384ba5.

    Reverted https://github.com/pytorch/pytorch/pull/85829 on behalf of https://github.com/atalman due to Reverting since failed build-fisp-diff-linux_platform010-opt

commit 801818f9e6bb8684a1c41dc6ef3c74ad62feeb4d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 30 14:31:09 2022 +0000

    Revert "Add mechanism to disable the "saved tensors hooks" feature (#85553)"

    This reverts commit 5aa183d2bc7372b4deb4e4b2f31017be9f13264c.

    Reverted https://github.com/pytorch/pytorch/pull/85553 on behalf of https://github.com/atalman due to Reverting since failed build-fisp-diff-linux_platform010-opt

commit b13b10a8fab83c9c260e16a8cfb4d99140e9352b
Author: erjia <erjia@fb.com>
Date:   Fri Sep 30 13:30:18 2022 +0000

    Extend collate function that can register collate functions to handle specific types (#85748)

    As per request from Vision team, adding `collate` function with an extra argument of `collate_fn_map` to dispatch custom collate functions for non-collection objects and specific objects.
    If the type of batch element is not present in`collate_fn_map`, it will go through all keys in the insertion order to check if the type is a subclass of the key. If so, it will invoke the corresponding collate functions.

    And, `default_collate` will utilize the `collate` function with a few by default collate function for `int`, `float`, `str` and `numpy object`.

    Benefit:
    - Domain teams can register their own `collate` function to handle their specific type of objects
    - Easier for users to extend from the `collate` function.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85748
    Approved by: https://github.com/NivekT, https://github.com/pmeier

commit b00a5359f750a75e3722327144a5ce2170f6e28a
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Sep 30 12:01:45 2022 +0000

    Add a way to skip lowering to nvprims (#85811)

    This PR adds `skip_ops` argument to `TorchRefsNvfuserCapabilityMode` and `NvfuserPrimsMode` which is an iterable of function names to be skipped in the translation to nvprims process.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85811
    Approved by: https://github.com/mruberry, https://github.com/jjsjann123

commit 787028cadb7fe83986111ffb7ddb058a68b763c0
Author: lezcano <lezcano-93@hotmail.com>
Date:   Thu Sep 29 18:19:57 2022 +0000

    Implement col2im decomposition and fix im2col and add a few preconditions (#85541)

    As per title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85541
    Approved by: https://github.com/jansel

commit 1f38abb5d2d3b9458b395bb31b684aeef14ca99f
Author: Ke Wen <kw2501@fb.com>
Date:   Fri Sep 30 09:17:49 2022 +0000

    Adopt ncclRemoteError (#85887)

    `ncclRemoteError` was added in NCCL 2.13 to indicate a network error or a remote process exiting prematurely.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85887
    Approved by: https://github.com/wanchaol

commit 8f4edf1e1dc9419f0bab66a67c8f149d7b53fc25
Author: BowenBao <bowbao@microsoft.com>
Date:   Thu Sep 29 20:18:37 2022 -0700

    [ONNX] Initial version of diagnostics infrastructure. (#85107)

    This PR introduces a general Python diagnostics infrastructure powered by SARIF,
    and the exporter diagnostics module that builds on top of it.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85107
    Approved by: https://github.com/abock, https://github.com/justinchuby

commit dab1c7c379d9597d8424aa55da888cfabe9ade3b
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Sep 30 06:33:42 2022 +0000

    Update trunk CUDA-10.2 to CUDA-11.7 (#85943)

    As CUDA-10.2 is finally disabled

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85943
    Approved by: https://github.com/huydhn, https://github.com/atalman

commit ade1c19612ae84654f76aa9e5c709de6d9654d72
Author: Ke Wen <kw2501@fb.com>
Date:   Fri Sep 30 05:48:16 2022 +0000

    Add reduce_scatter_tensor in place of _reduce_scatter_base (#85867)

    This is a twin PR similar to the one for `all_gather_into_tensor` (#85686).
    The philosophy for renaming `_reduce_scatter_base` instead of merging it is described in #85686.

    Cc @rohan-varma @H-Huang @crcrpar @ptrblck @mrshenli

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85867
    Approved by: https://github.com/crcrpar, https://github.com/H-Huang

commit 33401ee81f91d213a4c24ec0b4de266701179b48
Author: BowenBao <bowbao@microsoft.com>
Date:   Thu Sep 29 13:42:34 2022 -0700

    [ONNX] Rename 'sarif_om' to 'sarif' (#85918)

    'sarif_om' was the module name in the original repository https://github.com/microsoft/sarif-python-om.
    But since we have moved along with various extensions, it wouldn't hurt to rename the module for clarity.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85918
    Approved by: https://github.com/abock, https://github.com/thiagocrepaldi, https://github.com/justinchuby

commit 6bb0a36d0ecc886bf31ae917f244f039501a779a
Author: BowenBao <bowbao@microsoft.com>
Date:   Thu Sep 29 13:42:29 2022 -0700

    [ONNX] Add type annotation for SARIF attributes (#85898)

    Separated from #85651 to highlight the type annotation changes. It should support all type annotations
    needed by SARIF, except for the dictionary types described verbally like the following example. For now it
    is only annotated as `Any`. To enable it, we will need to extend `jschema_to_python` tool to allow passing
    in type hints.
    ```json
            "messageStrings": {
              "description": "A set of name/value pairs with arbitrary names. Each value is a multiformatMessageString object, which holds message strings in plain text and (optionally) Markdown format. The strings can include placeholders, which can be used to construct a message in combination with an arbitrary number of additional string arguments.",
              "type": "object",
              "additionalProperties": {
                "$ref": "#/definitions/multiformatMessageString"
              }
            },
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85898
    Approved by: https://github.com/justinchuby, https://github.com/abock, https://github.com/thiagocrepaldi

commit e9b254a025b493df7a5f16a4f3f4641f07adf44b
Author: BowenBao <bowbao@microsoft.com>
Date:   Thu Sep 29 13:42:28 2022 -0700

    [ONNX] Migrate SARIF from attr to dataclasses (#85651)

    Move to dataclasses since PyTorch does not depend on `attr`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85651
    Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock, https://github.com/thiagocrepaldi

commit 91667d1d218937eb85ac1db8e22b8ab94213be9f
Author: BowenBao <bowbao@microsoft.com>
Date:   Thu Sep 29 13:42:28 2022 -0700

    [ONNX] Introduce SARIF (#85428)

    That's the parent issue tracking this and more follow up tasks, so will keep open after this.
    This PR introduces the python classes for SARIF object model, along with script for generation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85428
    Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock, https://github.com/thiagocrepaldi

commit 1ad0048b64d0e709482d387419947c9142b94b04
Author: Min Si <msi@fb.com>
Date:   Fri Sep 30 05:13:48 2022 +0000

    Refactor distribuetd to use absolute header path  (#85780)

    Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

    See D39835774 for more details about Meta internal complication.

    **How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
    Approved by: https://github.com/kumpera, https://github.com/huydhn

commit 0b0ce72b250f4f65ceee6909ad8743e5174c3579
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Sep 28 15:42:59 2022 -0700

    [Profiler] Extend ID assignment to allocations and frees (#85719)

    This is necessary for memory profiling because we need to know how to interpret an allocation. However there is a slight wrinkle: we don't know if an allocation is for a Tensor's StorageImpl until we see it used in a later call. (We could record outputs, however we're not willing to incur the overhead.) So we instead treat all allocations as relevant and then filter out some later. Otherwise the change to the ID assignment algorithm is minimal.

    Differential Revision: [D39788870](https://our.internmc.facebook.com/intern/diff/D39788870/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85719
    Approved by: https://github.com/chaekit

commit 95681929e4c379c504d8a7761f8104118a5a16db
Author: Edward Yang <ezyang@meta.com>
Date:   Fri Sep 30 03:19:09 2022 +0000

    Hotfix for S298125 (#85814)

    Summary:
    Crash error is:

    ```
    Mismatch in kernel C++ signatures

    operator: aten::cat

    no debug info

    kernel 1: FN2at6TensorEN3c108ArrayRefIS0_EExE

    dispatch key: Metal

    registered at buck-out/gen/a1f97bbb/fbobjc/Libraries/FBPyTorchCore/torch_core_ig_ops_metal/aten/src/ATen/native/metal/ops/MetalConcat.mm:205

    kernel 2: FN2at6TensorERKN3c108IListRefIS0_EExE

    dispatch key: CPU

    registered at buck-out/gen/a1f97bbb/fbobjc/Libraries/FBPyTorchCore/torch_core_ig_ops_aten/RegisterCPU.cpp:29749

    Exception raised from registerKernel at xplat/caffe2/aten/src/ATen/core/dispatch/OperatorEntry.cpp:130 (most recent call first):
    ```

    We fix it by changing the Metal kernel to take an IListRef instead of an ArrayRef.

    Test Plan: Build igios per https://www.internalfb.com/intern/wiki/IOS_On_Demand/iOS_On_Demand_Use_Guide/ and show it doesn't crash

    Differential Revision: D39888394

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85814
    Approved by: https://github.com/SS-JIA

commit a50d8864fc6a7821134a76927ff292575e5ecc85
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 30 02:04:29 2022 +0000

    Revert "Refactor distribuetd to use absolute header path  (#85780)"

    This reverts commit 668082718aefce95ecc1b1c312ea6f127b2c662e.

    Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>

commit 668082718aefce95ecc1b1c312ea6f127b2c662e
Author: Min Si <msi@fb.com>
Date:   Fri Sep 30 00:27:24 2022 +0000

    Refactor distribuetd to use absolute header path  (#85780)

    Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

    See D39835774 for more details about Meta internal complication.

    **How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
    Approved by: https://github.com/kumpera

commit 81b366a9dd519bc3d93d307bff81c279c9510e1b
Author: Abhishek Pathak <abhipathak97@gmail.com>
Date:   Fri Sep 30 00:24:16 2022 +0000

    [MPS] Handle scalar input for scatter and gather (#85842)

    Issue noticed in test consistency - "Indexing dim 0 is out of bounds of tensor"
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85842
    Approved by: https://github.com/kulinseth

commit 62a4fd7907f0f2c667f05aa9a4d1eec7190a6c83
Author: Abhishek Pathak <apathak2@apple.com>
Date:   Fri Sep 30 00:19:14 2022 +0000

    [MPS] Handle output shape for empty input in binary ops (#85836)

    Output of input shape [0,1,2] should be [0,1,2], not [0] i.e. delay returning from empty input condition to resize/reshape the output accordingly
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85836
    Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth

commit ae93a4dc43a7103b1856b83456b247dd3395fe47
Author: Richard Barnes <rbarnes@umn.edu>
Date:   Fri Sep 30 00:09:02 2022 +0000

    Make launch check exit code depend on results (#85886)

    It is possible for code to land that doesn't check kernel launches for success (#85885) fixes such an issue.

    I think setting the return code of the linter is the correct way of handling this.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85886
    Approved by: https://github.com/ezyang

commit dde43d083b2ece2f659d6ef45e18d4d24b55a1b2
Author: Andrey <drej82@users.noreply.github.com>
Date:   Thu Sep 29 23:44:57 2022 +0000

    [c10d] Reorder macros so they are defined before getting used (#85850)

    Summary: Move preprocessor macros all the way up, so they are defined before being used.

    Test Plan: existing tests

    Reviewed By: wanchaol

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85850
    Approved by: https://github.com/wanchaol

commit 9009393f46515c4dbc5ae5d9054e8c2df48ee5c5
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Sep 29 23:26:54 2022 +0000

    [ONNX] Remove protocol dataclass (#85916)

    Remove the `_WithOp` protocol because it is not used and causes the dataclass `GraphContext` to not be able to init in some python versions.

    Reference to issue of dataclasses Inheriting from Protocol https://github.com/python/cpython/issues/89244

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85916
    Approved by: https://github.com/BowenBao, https://github.com/abock, https://github.com/thiagocrepaldi

commit 6a14fcb9223d27d4be1d19d24e535707f76a3e01
Author: Denis Vieriu <dvieriu@apple.com>
Date:   Thu Sep 29 23:23:00 2022 +0000

    [MPS] Add support for aten::masked_select on mps (#119) (#85818)

    Reuse the `index.Tensor_out` implementation since it's already expanding the bool/byte indices to long tensors.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85818
    Approved by: https://github.com/kulinseth

commit 85258ec17eb12b57943cb9b3157d2696f2097fbe
Author: George Qi <georgeqi94@gmail.com>
Date:   Thu Sep 29 20:14:55 2022 +0000

    Add mask_type=2 to masked_softmax for when mask.size() == input.size() (#85915)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85915
    Approved by: https://github.com/cpuhrsch

commit 6004c65af8fe9c5bbd12811dfb42f9e369b9ebce
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Thu Sep 29 23:06:15 2022 +0000

    Fix rand_like nvprim meta function. (#85882)

    Really minor fix necessary to work with TorchDynamo.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85882
    Approved by: https://github.com/mruberry, https://github.com/jjsjann123

commit 103a21f4809f586b3c07aa37aa59e8b234dd2880
Author: vfdev <vfdev.5@gmail.com>
Date:   Thu Sep 29 22:43:07 2022 +0000

    Update _torch_docs.py (#85924)

    Typo fix

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85924
    Approved by: https://github.com/kit1980

commit c036fb3e7d50a4d239218e404f1d304669c035c3
Author: Yu Guo <yuguo@fb.com>
Date:   Thu Sep 29 10:43:07 2022 -0700

    assert lambda >= 0 in poisson distribution cuda kernel (#85906)

    fix https://github.com/pytorch/pytorch/issues/85731

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85906
    Approved by: https://github.com/ngimel

commit bc57306bdd6a041e64d77e8bc8fdb470e6ff0815
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Date:   Thu Sep 29 21:41:59 2022 +0000

    Fix typo under docs directory and RELEASE.md (#85896)

    This PR fixes typo in rst files under docs directory and `RELEASE.md`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85896
    Approved by: https://github.com/kit1980

commit 11224f34b8d1eedd9806168532c1ad5f2adb1508
Author: Yu Guo <yuguo@fb.com>
Date:   Thu Sep 29 21:20:38 2022 +0000

    assert weights being 1-d tensor in bincount (#85881)

    Summary:
    as title,
    fix https://github.com/pytorch/pytorch/issues/85777

    Test Plan: unittest added

    Differential Revision: D39913476

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85881
    Approved by: https://github.com/mruberry, https://github.com/ngimel

commit 6db3539e700ce7a81be356700f0803b2002bc63c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 29 20:06:52 2022 +0000

    Revert "Improve make_tensor performance for float and complex types (#85473)"

    This reverts commit a76995e584b880910f0724be98eb21773e8ed6e9.

    Reverted https://github.com/pytorch/pytorch/pull/85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic

commit 50000f3cdcc4f0c4e29ec20b52fd54723092b95a
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Sep 29 14:46:23 2022 -0400

    Align functorch docs with PyTorch's (#85856)

    This PR:
    - changes the header/footer to be the same as PyTorch docs
    - removes the functorch logo (we don't need it anymore, functorch has
    been adopted into PyTorch)
    - adjusts the functorch docs to make it clear that the page is functorch
    documentation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85856
    Approved by: https://github.com/svekars, https://github.com/samdow

commit 19c7a6b54b9b29991cfdc5607361c2f30f5a6248
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Sep 29 14:46:23 2022 -0400

    [functorch] Update notebooks for latest release (#85855)

    This PR:
    - dedups our colab notebooks with the regular functorch notebooks. The
    colab notebooks were versions of the reuglar notebooks that had
    install instructions. Now that functorch is easier to install, we do not
    need those anymore.
    - fixes the colab links

    Test Plan:
    - build docs locally and tested them
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85855
    Approved by: https://github.com/samdow

commit 48b3582e28141d0f7ebc27dcbca5ead1825df76f
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Sep 29 14:46:22 2022 -0400

    [functorch] Update install instructions in docs (#85854)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85854
    Approved by: https://github.com/samdow

commit 06868004b7cf10241a68be74d35a536572e650bc
Author: John Detloff <johndetloff@fb.com>
Date:   Thu Sep 29 19:49:11 2022 +0000

    Remove codesigning from ios circleci workflows (#85630)

    This PR is a follow up to https://github.com/pytorch/pytorch/pull/85597 which removes codesigning from our github action workflows. This is a synonymous change to our circleci workflows. Since we only run TestApp on simulator we don't need to have this codesigning logic. (And more pressingly, these dev cert is expiring at the end of the month and we don't have a replacement)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85630
    Approved by: https://github.com/atalman, https://github.com/malfet

commit a9183c0f9ecc9be47fcb7abf1b23204d26821aa8
Author: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@fb.com>
Date:   Thu Sep 29 19:16:17 2022 +0000

    Fix bug in PythonFallBack (#85795)

    Summary: Previously PythonCallBack fails to find interpreter to dispatch to when it encounters an op with OptionalTensorList parameter, this diff fixes that

    Test Plan: CI

    Differential Revision: D39881382

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85795
    Approved by: https://github.com/ezyang, https://github.com/bdhirsh

commit fe87ae692f813934d1a74d000fd1e3b546c27ae2
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Thu Sep 29 18:36:33 2022 +0000

    Fix `check_compiler_ok_for_platform` on non-English locales (#85891)

    The function checks the output of e.g. `c++ -v` for "gcc version". But on another locale than English it might be "gcc-Version" which makes the check fail.
    This causes the function to wrongly return false on systems where `c++` is a hardlink to `g++` and the current locale returns another output format.

    Fix this by setting `LC_ALL=C`.

    I found this as `test_utils.py` was failing in `test_cpp_compiler_is_ok`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85891
    Approved by: https://github.com/ezyang

commit 0449cf0c9e469f052bb9316b13260d126d6f01d4
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 28 14:53:51 2022 -0700

    Re-introduce the functorch docs build (#85838)

    We deleted it when merging functorch into pytorch. This PR makes a new
    functorch docs build.

    The docs are relatively simple:
    - cd into `functorch/docs` and run `make html` to build the docs.
    - docs should get pushed to the pytorch/functorch repo's gh-pages
    branch.

    The long term plan is:
    - one day, the functorch APIs will just be torch.* APIs, at which point
    we can fold all of the functorch docs into the regular PyTorch docs
    - When that happens, the functorch examples and tutorials (that are on
    the functorch docs site) can be moved to the pytorch examples and
    pytorch tutorials.

    Test Plan:
    - check docs preview
    - watch this PR after it goes in
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85838
    Approved by: https://github.com/malfet

commit 941d7a31f65d4c6da3b94178be254f7ac20d482e
Author: Saliya Ekanayake <esaliya@gmail.com>
Date:   Thu Sep 29 17:28:58 2022 +0000

    Pass group ranks and options to third party distributed backends (#73164)

    Fixes #73163

    PyTorch's [_new_process_group_helper()](https://github.com/pytorch/pytorch/blob/9f541aa3aca768e7fbfa4a9d648b554f22b261f7/torch/distributed/distributed_c10d.py#L633) does not pass group's participating ranks to the backend.

    This PR adds the above capability. Also, refactors some variables for better clarity.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/73164
    Approved by: https://github.com/kumpera

commit e15a48def7f1e7b58710bcc4a3d18624948c5fbc
Author: nikitaved <nikitavedeneev@gmail.com>
Date:   Thu Sep 29 17:12:04 2022 +0000

    (bsr/csr) x dense mm (#85551)

    As per title. This implementation is not the most optimal and could be improved albeit with native kernels (i.e. block matching need not be materialized).

    Compared to existing kernels it offers:

    - Half float support (In fact, any dtype that supports `matmul` will work).
    - Arbitrary block sizes.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85551
    Approved by: https://github.com/amjames, https://github.com/cpuhrsch

commit ef0baba23f65062cfda6dd25fa67d02dc1a06fea
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Thu Sep 29 17:02:04 2022 +0000

    Use `int64_t` for nll_loss with cuda inputs (#85395)

    Related #85005
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85395
    Approved by: https://github.com/t-vi, https://github.com/lezcano

commit 5f26df0345bef35de9cbf585ca0c1af1cd91b9c8
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Thu Sep 29 16:58:59 2022 +0000

    resubmit: "resubmit: [mta] APEX style Fused Adam (#81705) (#85507)" (#85739)

    Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](https://github.com/pytorch/pytorch/blob/849b08f14b2a741d0b90bb7bfce0ebb3d07d1981/aten/src/ATen/native/cuda/PowKernel.cu#L21-L66) to a new header file and let FusedAdam use them to tame MSVC, hopefully.

    cc @ngimel @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739
    Approved by: https://github.com/ngimel

commit 44eefb1376b5a05568f047afde4193e951293625
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Tue Sep 27 18:06:32 2022 -0700

    Update debug flag for vulkan (#85715)

    DEBUG is too generic name and when building some other target it seems to
    conflict with it, so defining VULKAN_DEBUG

    Differential Revision: [D39449772](https://our.internmc.facebook.com/intern/diff/D39449772/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39449772/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85715
    Approved by: https://github.com/SS-JIA

commit ad3bea58daa0de46e7f2a5f2b3a397f9b3aba5fb
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Tue Sep 27 18:06:29 2022 -0700

    Add vulkan qualifier to the kernel name (#85714)

    This helps with any post processing as well as distinguishing the kernel name
    appearing in the chrome trace

    Differential Revision: [D39473299](https://our.internmc.facebook.com/intern/diff/D39473299/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85714
    Approved by: https://github.com/SS-JIA

commit e0af12a0765738e1367b3288be413d3f3737522e
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Tue Sep 27 18:06:27 2022 -0700

    [Pytorch][benchmark vulkan] Fix vulkan profiling (#85713)

    This diff:
    - adds interface to enable/disable profiling
    - Fixes profiling bug where ticks measured by timestamp queries are not
      accounting for timestampPeriod.

    Differential Revision: [D39449769](https://our.internmc.facebook.com/intern/diff/D39449769/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39449769/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85713
    Approved by: https://github.com/SS-JIA

commit e0170c7cded06d4e91a2ee8a0512b80ed4b66e6e
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 28 09:59:18 2022 -0700

    Remove torch/extension.h dependency in torch/csrc/functorch/init.cpp (#85659)

    This file doesn't depend on APIs there. Required adding some
    namespacing to symbols.

    Test Plan:
    - build & test
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85659
    Approved by: https://github.com/Chillee

commit 8fb470e81ad29d82204235194b809b8072942c70
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Thu Sep 29 15:40:09 2022 +0000

    [fix] max_pool1d: shape check (#85594)

    Fixes #76587

    Before PR:

    ```python
    import torch
    max_pool = torch.nn.MaxPool1d(3)
    t = torch.rand([17, 0, 50], dtype=torch.float32)  # note requires_grad is False
    max_pool(t) # Worked and returned tensor of shape [17, 0, 48].
    ```

    After PR
    ```python
    import torch
    max_pool = torch.nn.MaxPool1d(3)
    t = torch.rand([17, 0, 50], dtype=torch.float32)  # note requires_grad is False
    max_pool(t) # Errors with `max_pool1d: Expected 2D or 3D (batch mode) tensor with optional 0 dim batch size for input, but got: [17, 0, 48]`
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85594
    Approved by: https://github.com/mruberry

commit cab6ffa0f7a12ce50d50831910922014392ee173
Author: jjsjann123 <jiej@nvidia.com>
Date:   Thu Sep 29 15:22:45 2022 +0000

    catches failure on nvprim speculative lowering (#85580)

    Fixes #85517

    Added a try/catch exception during tracing `get_isolated_graphmodule` inside `_is_func_unsupported_nvfuser`. Stops speculative lowering to nvprim when query errors out.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85580
    Approved by: https://github.com/mruberry, https://github.com/IvanYashchuk

commit a807f1987a60324075ab0690af3b5f7cf9ecf319
Author: atalman <atalman@fb.com>
Date:   Thu Sep 29 15:04:24 2022 +0000

    Stop cuda-10.2 binary builds (#85873)

    Deprecate cuda 10.2 nightly

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85873
    Approved by: https://github.com/malfet

commit 3cdf621fe5c8f8378fda209b8a143443d33b2086
Author: Jane Xu <janeyx@fb.com>
Date:   Thu Sep 29 14:28:55 2022 +0000

    Add opt-einsum to CI (#85574)

    Depends on https://github.com/pytorch/pytorch/pull/84890.

    This PR adds opt_einsum to CI, enabling path optimization for the multi-input case. It also updates the installation sites to install torch with einsum, but those are mostly to make sure it would work on the user's end (as opt-einsum would have already been installed in the docker or in prior set up steps).

    This PR also updates the windows build_pytorch.bat script to use the same bdist_wheel and install commands as on Linux, replacing the `setup.py install` that'll become deprecated.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85574
    Approved by: https://github.com/huydhn, https://github.com/soulitzer

commit b25a1ce22d965a852da5979e7d6af9fe91518451
Author: Eric Zhang <ezhang887@gmail.com>
Date:   Thu Sep 29 14:17:05 2022 +0000

    Release GIL when doing shared memory copies on Tensors (#85389)

    See discussion here for context: https://pytorch.slack.com/archives/GEEQ2K4MD/p1663672716533319?thread_ts=1662155536.133099&cid=GEEQ2K4MD, opening a PR as suggested by @albanD

    Currently PyTorch holds the GIL when copying Tensors into shared memory. For certain workloads it would be nice to be able to copy different tensors into shared memory in parallel, but with the GIL being held the copies can't truly run in parallel.

    Here's a short example of this:
    ```
    import torch
    import time
    from multiprocessing.pool import ThreadPool

    tensors = []
    for i in range(64):
        for j in range(8):
            t = torch.ones(128, 480, 640).type(torch.uint8) * i
            tensors.append(t)
    print("Done generating input tensors")

    with ThreadPool(processes=8) as pool:
        futures = []
        before = time.time()
        for t in tensors:
            future = pool.apply_async(t.share_memory_)
            futures.append(future)

        for f in futures:
            f.get()
        elapsed = time.time() - before
        print("ELAPSED TIME", elapsed)
    ```

    With this diff, I get:
    ```
    ~$ python repro.py
    Done generating input tensors
    ELAPSED TIME 3.561321258544922
    ~$
    ```

    Previously, I would get:
    ```
    ~$ python repro.py
    Done generating input tensors
    ELAPSED TIME 16.305657386779785
    ~$
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85389
    Approved by: https://github.com/albanD

commit 6fae62b35f0e4a0d93de6966dc1d9517e9b6ddff
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 29 13:51:05 2022 +0000

    Revert "C10D extension to enable per-thread PG (#84153)"

    This reverts commit 5cbffbbac9a59098637f821e8b6e10f609de30ff.

    Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff

commit 976e2a350273b352eac7cbf7d11dcdeacfcba34d
Author: Jithun Nair <jithun.nair@amd.com>
Date:   Thu Sep 29 13:31:41 2022 +0000

    Separate magma installation for ROCm into its own file (#85567)

    This aligns it with the builder repo scripts structure:
    https://github.com/pytorch/builder/blob/main/common/install_rocm_magma.sh
    https://github.com/pytorch/builder/blob/main/common/install_rocm.sh
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85567
    Approved by: https://github.com/jeffdaily, https://github.com/huydhn

commit 9fb72ca4941edb37c7529b3750617dc4bb6b4fc1
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Sep 28 12:16:55 2022 +0000

    Treat layout / pin_memory consistently across creation refs (#85333)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85333
    Approved by: https://github.com/mruberry, https://github.com/ngimel

commit a76995e584b880910f0724be98eb21773e8ed6e9
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Sep 28 17:38:25 2022 +0100

    Improve make_tensor performance for float and complex types (#85473)

    For floating types, `make_tensor` calls `rand` and then does a linear
    interpolation from `low` to `high`. This instead calls `uniform_(low,
    high)` to cut out the interpolation step.

    For complex types, `make_tensor` does the `rand` + interpolation step
    twice and calls `torch.complex(real, imag)` at the end. This instead
    uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
    operation.

    My benchmarks show significant speedups in all cases for float32 and
    complex64.

    | Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
    |--------|-----------|-------|-------------|--------------|---------|
    | CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
    |        |           | 4096  | 36.8        | 21.3         | 1.7     |
    |        |           | 2**24 | 167,000     | 80,500       | 2.1     |
    |        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
    |        |           | 4096  | 73.1        | 37.6         | 1.9     |
    |        |           | 2**24 | 409,000     | 161,000      | 2.5     |
    | CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
    |        |           | 4096  | 38.7        | 11.7         | 3.3     |
    |        |           | 2**24 | 2,300       | 238          | 9.7     |
    |        | complex32 | 8     | 78.7        | 14           | 5.6     |
    |        |           | 4096  | 82.7        | 13.8         | 6.0     |
    |        |           | 2**24 | 5,520       | 489          | 11.3    |
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85473
    Approved by: https://github.com/mruberry

commit ad87365e54e7b20b49ac23ee325f1da732655808
Author: Peizhao Zhang <stzpz@meta.com>
Date:   Thu Sep 29 07:58:54 2022 +0000

    [qat]A more stable conv_bn fusion for qat training. (#85744)

    Summary:
    A more stable conv_bn fusion for qat training:
    * Existing implementation may cause QAT training loss become NaN. This could happen when the fused conv for qat (torch/nn/intrinsic/qat/modules/conv_fused.py) is used and is independent of if fake_quant is enabled.
      * This is caused by the unscaling for the conv output (`conv_orig = conv / scale_factor` where `scale_factor = bn.weight / running_std`) when there is 0 in `bn.weight`.

    * This implementation follows the [white paper](https://arxiv.org/pdf/1806.08342.pdf) better and fixed the issue by scaling `running_std / std_Y` instead and compute the fused output accordingly (see comments in conv_fused.py for more details):
      * It comes at the cost of running conv twice (one to update bn statistics and one to compute fake quant for fused weights).
      * It does not need to use conv bias for back prop.
      * It uses the bn statistics computed with the current input batch, while the existing code uses the statistics without the current batch.
    * The implementation could be enabled by setting the flag `_enable_slow_path_for_better_numerical_stability` to True after the model is prepared for QAT.

    * Unit test
      * Added test case for zero `bn.weight`.
      * Added test case for conv to has bias.

    Test Plan: buck run mode/dev-nosan //caffe2/test:quantization -- -r quantization.eager.test_quantize_eager_qat

    Differential Revision: D29506778

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85744
    Approved by: https://github.com/vkuzo

commit 3cfc61b84659cea435411a546eca6a891584247f
Author: Seonglyong Gong <slgong@meta.com>
Date:   Thu Sep 29 07:28:33 2022 +0000

    [Profiler][trivial] Optimizer states (part 4 of Record Optimizer) (#85840)

    Summary: - add states into OptInfo and update unit testcase

    Test Plan: buck run mode/opt //caffe2/test:profiler

    Reviewed By: chaekit

    Differential Revision: D39406540

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85840
    Approved by: https://github.com/robieta

commit 475022cd5d1fa3708c8d8728c6ae6eb34c053696
Merge: 5704c73b56 fd840676b0
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 29 14:41:56 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit fd840676b063e182ce55109470e7de389c9d9bfa
Merge: 6b416bf681 1c0f0b33a0
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 29 14:41:56 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 72b32f164415e9f2e86767b2373c939f8f343d1b
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Wed Sep 28 17:43:03 2022 +0000

    [c10d] move ncclgetlasterror directive definition upfront (#85825)

    Move the directive definition of ncclGetLastError() upfront so that
    C++ preprocessor does not treat this as a empty string
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85825
    Approved by: https://github.com/H-Huang, https://github.com/kwen2501

commit dc63948dc9dbe4c224a78a7c14f406893f6fd381
Author: Justin Chu <justinchuby@users.noreply.github.com>
Date:   Thu Sep 29 04:24:04 2022 +0000

    [ONNX] Update behavior for `register_custom_op_symbolic` (#85636)

    Update `register_custom_op_symbolic`'s behavior to _only register the symbolic function at a single version_. This is more aligned with the semantics of the API signature.

    As a result of this change, opset 7 and opset 8 implementations are now seen as fallback when the opset_version >= 9. Previously any ops internally registered to opset < 9 are not discoverable by an export version target >= 9. Updated the test to reflect this change.

    The implication of this change is that users will need to register a symbolic function to the exact version when they want to override an existing symbolic. They are not impacted if (1) an implementation does not existing for the op, or (2) they are already registering to the exact version for export.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85636
    Approved by: https://github.com/BowenBao

commit 3c9e8cd8df5f5739ed20830a0bfffd966a5c11db
Author: Feisi Fu <fufeisi@bu.edu>
Date:   Wed Sep 28 21:10:40 2022 +0000

    Create a quantized non-in-palce version CUDA ReLU function, (#85669)

    Summary:
    this and #85670 are to allow the relu function to run on a quantized tensor on cuda. That is torch.relu(qa) for a quantized tensor qa on cuda.
    Test Plan:
    python test/test_quantization.py

    Previous PR that has been reverted: #85502.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85669
    Approved by: https://github.com/dzdang

commit 7628603aeeeb8ed160c2479f75175bb3ea028a42
Author: Seonglyong Gong <slgong@meta.com>
Date:   Thu Sep 29 03:58:34 2022 +0000

    [Profiler] bug fix: python object reference counting (#85847)

    Summary:
    Wrong reference counting of Python Objects has made intermittent and corner-case-only segfault.
    - before : increment once decrement in a loop.
    - after: increment and decrement in different but consistent loops.

    Test Plan: buck run mode/opt //caffe2/test:profiler

    Differential Revision: D39902973

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85847
    Approved by: https://github.com/robieta, https://github.com/aaronenyeshi

commit edb99df2e086fd22068f877c526b9424771bef0f
Author: BowenBao <bowbao@microsoft.com>
Date:   Tue Sep 27 16:42:07 2022 -0700

    [ONNX] Fix reduce node shape inference (#85765)

    Fix logic in `ProcessReduceNode`. Previously a scalar was assigned for output shape of reduce nodes
    when `axes` attribute was not provided, regardless of the value of `keepdims_i` attribute. Hence it is
    incorrectly assuming all output axes should be folded.
    Since input rank is known, this fix populates axes to be `[0, 1, ..., input_rank - 1]` if axes is not
    provided.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85765
    Approved by: https://github.com/abock

commit 7e4684009c67ae6ce337e7c7727dc605b637af35
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Sep 28 19:21:10 2022 -0400

    Improve codegen for jvp decomposition (#84894)

    Fixes: https://github.com/pytorch/pytorch/issues/84888
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84894
    Approved by: https://github.com/albanD

commit f23f362c5d10c47389eb0a1a93f45788c5abef8b
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Sep 28 15:42:58 2022 -0700

    [Profiler] Use strong typedef for Tensor ID (#85718)

    I want add Tensor ID to allocations (for allocs which are `StorageImpl`s). To keep things safe and organized I need to pull the ID type into a standalone entity, which makes it an ideal time to convert to a strong typedef.

    Differential Revision: [D39788872](https://our.internmc.facebook.com/intern/diff/D39788872/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85718
    Approved by: https://github.com/chaekit

commit 282d8dfa68cc5098db050bef3991a62fbec4825e
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Sep 28 15:42:56 2022 -0700

    [Profiler] Fix traversal utility (#85717)

    `eventTreeDFS` traverses in the wrong order (right to left). Moreover, we will need more complex traversal (e.g. early stopping) for memory profiler. Thus, I made a simple general `_traverse` method and added `functools.partial` specializations for DFS and BFS.

    Differential Revision: [D39788871](https://our.internmc.facebook.com/intern/diff/D39788871/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85717
    Approved by: https://github.com/chaekit

commit dfdfaec3fc599c2bb7a8ffaf1215e0284a2f4aa8
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Sep 28 15:42:54 2022 -0700

    [Profiler] Don't assign in AppendOnlyList::emplace_back (#85716)

    It turns out that we're invoking the copy assign operator in AppendOnlyList. While copy elision is expected to mostly hide any costs it does present issues for types with deleted copy assign operators. (It also seems to produce slightly worse assembly: https://godbolt.org/z/o4Gvz1fKs)

    Calling new at the correct position seems to be a better way to go about this. (At least from looking at other high performance containers like SmallVector.)

    Differential Revision: [D39852804](https://our.internmc.facebook.com/intern/diff/D39852804/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85716
    Approved by: https://github.com/chaekit

commit bd65adf4e9e59ac7de1d7f7d329a5df4237dcc5f
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Sep 28 19:21:10 2022 -0400

    Properly fix log_sigmoid vmapjvp and remove hack (#84892)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84892
    Approved by: https://github.com/albanD, https://github.com/zou3519

commit cca909645f9c3528469381e4236a468667258a1e
Author: CaoE <e.cao@intel.com>
Date:   Thu Sep 29 01:16:16 2022 +0000

    Add bfloat16 support for lerp on CPU (#84327)
    Add bfloat16 support for lerp on CPU
    single core:
    <html>
    <body>
    <!--StartFragment-->

    op | shape |fp32 forward/ms|bf16 forward/s|fb32 backward/s| bf16 backward/s
    -- | -- | -- | -- | -- | --
    lerp (tensor) | [10, 128, 10, 124] | 0.005489 | 0.000613 | 0.006658 | 0.003385
      | [10, 128, 20, 124] | 0.011057 | 0.001204 | 0.016032 | 0.007869
      | [10, 128, 30, 124] | 0.016691 | 0.001954 | 0.025549 | 0.012823
      |   |   |   |   |  
    lerp (scalar) | [10, 128, 10, 124] | 0.001096 | 0.000507 | 0.002024 | 0.001479
      | [10, 128, 20, 124] | 0.00247 | 0.000997 | 0.005468 | 0.002907
      | [10, 128, 30, 124] | 0.004178 | 0.001513 | 0.009775 | 0.004859

    <!--EndFragment-->
    </body>
    </html>

    single socket (28cores):
    <html>
    <body>
    <!--StartFragment-->

    op | shape | fp32 forward/s| bf16 forward/s| fb32backward/s| bf16 backward/s
    -- | -- | -- | -- | -- | --
    lerp (tensor) | [10, 128, 10, 124] | 0.000236 | 3.93E-05 | 0.000494 | 0.000235
      | [10, 128, 20, 124] | 0.000525 | 7.39E-05 | 0.002485 | 0.000638
      | [10, 128, 30, 124] | 0.000801 | 0.000121 | 0.004235 | 0.001529
      |   |   |   |   |  
    lerp (scalar) | [10, 128, 10, 124] | 5.90E-05 | 3.32E-05 | 0.000129 | 0.000116
      | [10, 128, 20, 124] | 0.000155 | 5.87E-05 | 0.000368 | 0.000206
      | [10, 128, 30, 124] | 0.000324 | 9.04E-05 | 0.001322 | 0.000313

    <!--EndFragment-->
    </body>
    </html>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84327
    Approved by: https://github.com/frank-wei

commit 7cdd39b39316173487c0c4cfbb60aef0cb645757
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Sep 29 00:52:21 2022 +0000

    [ONNX] Update `unconvertible_ops` (#85595)

    Update `unconvertible_ops` to create a list of unconvertible ops using the updated registry.

    - Use fewer passes in the jit process instead to avoid errors during conversion in the ONNX fallback mode
    - Actually check the registry to find implemented ops
    - Fix type hints for `_create_jit_graph` and `_jit_pass_onnx_remove_inplace_ops_for_onnx`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85595
    Approved by: https://github.com/BowenBao

commit ada6e5b53a55b5acfd48c503c94296d871296bb7
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 28 17:28:26 2022 -0400

    Implement duck shaping on SymInts (#85808)

    Duck shaping says that when two input tensors have the same
    size, we assume they are symbolically related.  This follows
    the same optimization done by inductor.

    This optimization is not done completely because we don't
    currently install guards corresponding to the duck shape
    relationships we created, but overall the guard propagation
    for dynamic shape tracing is incomplete at the moment.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85808
    Approved by: https://github.com/albanD

commit 3a3e2002d88ca2491170065c47cc50ce435fb92f
Author: Xia, Weiwen <weiwen.xia@intel.com>
Date:   Thu Sep 29 00:44:40 2022 +0000

    [Quant] Add unified x86 quant backend (#84329)

    Implement unified quantization backend 'X86' for x86 platforms. It combines the advantages of FBGEMM and ONEDNN. It selects kernels during weight prepacking and hide the details from end users. It will be the default backend in place of FBGEMM.

    For details, please refer to this RFC: [[RFC] Unified quantization backend for x86 CPU platforms](https://github.com/pytorch/pytorch/issues/83888)
    **Correctness**
    Covered by UT

    **Accuracy**
    By running torchvision models on imagenet, no accuracy difference is found between FBGEMM and the unified X86 backend:
    [torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx](https://github.com/pytorch/pytorch/files/9598114/torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx)

    **Performance**
    Depends on https://github.com/pytorch/pytorch/pull/84470 which improves performance.
    For early PoC results, please refer to https://github.com/pytorch/pytorch/files/9399202/unified_qengine_poc_performance_bechmark.xlsx

    With the two PRs combined, we collected some data on Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
    Method: Run multi-instances with 4 cores per instance on whole socket. Using JeMalloc and Intel OMP.
    Models/throughput | fbgemm | x86 | improvement
    -- | -- | -- | --
    wide_resnet101_2 | 173.5675 | 241.815 | 39.32%
    resnext101_32x8d | 174.365 | 339.8175 | 94.89%
    resnet50 | 573.155 | 1174.14 | 104.86%
    vgg19_bn | 260.335 | 337.92 | 29.80%
    vgg19 | 257.935 | 333.265 | 29.21%
    inception_v3 | 601.1175 | 1309.33 | 117.82%
    densenet161 | 296.645 | 435.5625 | 46.83%
    mnasnet1_0 | 1216.7 | 4057.515 | 233.49%
    squeezenet1_0 | 1220.085 | 5153.3875 | 322.38%
    alexnet | 2294.91 | 2624.6375 | 14.37%
    fbnetc_100 | 976.2825 | 3110.1825 | 218.57%
    shufflenet_v2_x0_5 | 1555.76 | 3026.125 | 94.51%
    spnasnet_100 | 1059.065 | 3502.0975 | 230.68%
    pytorch-unet | 192.76 | 246.77 | 28.02%
    acgan | 257.32 | 333.7325 | 29.70%
    cgan | 7790.6925 | 7803.1025 | 0.16%
    sgan | 257.565 | 338.8875 | 31.57%
    se_resnet50 | 492.3725 | 916.5175 | 86.14%
    vggm | 300.2875 | 316.2075 | 5.30%

    Environment:
    - PyTorch version: 1.13.0a0+gitcdd625b
    - Is debug build: False
    - CUDA used to build PyTorch: None
    - ROCM used to build PyTorch: N/A
    - OS: Ubuntu 20.04.3 LTS (x86_64)
    - GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    - Clang version: Could not collect
    - CMake version: version 3.22.5
    - Libc version: glibc-2.31
    - Python version: 3.9.12 (main, Jun  1 2022, 11:38:51)  [GCC 7.5.0] (64-bit runtime)
    - Python platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31
    - Is CUDA available: False
    - CUDA runtime version: No CUDA
    - GPU models and configuration: No CUDA
    - Nvidia driver version: No CUDA
    - cuDNN version: No CUDA
    - HIP runtime version: N/A
    - MIOpen runtime version: N/A
    - Is XNNPACK available: True

    Versions of relevant libraries:
    - [pip3] intel-extension-for-pytorch==1.13.0+cpu
    - [pip3] numpy==1.23.3
    - [pip3] pytorch-widedeep==0.3.7
    - [pip3] torch==1.13.0a0+git48b423b
    - [pip3] torchvision==0.14.0a0+ebb68f3
    - [conda] blas                      1.0                         mkl
    - [conda] intel-extension-for-pytorch 1.13.0+cpu               pypi_0    pypi
    - [conda] mkl                       2021.4.0           h06a4308_640
    - [conda] mkl-include               2022.1.0                 pypi_0    pypi
    - [conda] mkl-service               2.4.0            py39h7f8727e_0
    - [conda] mkl-static                2022.1.0                 pypi_0    pypi
    - [conda] mkl_fft                   1.3.1            py39hd3c417c_0
    - [conda] mkl_random                1.2.2            py39h51133e4_0
    - [conda] numpy                     1.23.3                   pypi_0    pypi
    - [conda] numpy-base                1.22.3           py39hf524024_0
    - [conda] torch                     1.13.0a0+git48b423b          pypi_0    pypi
    - [conda] torchvision               0.14.0a0+ebb68f3          pypi_0    pypi

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84329
    Approved by: https://github.com/jerryzh168

commit d542aab5c1bc544f9dc0eb5632bfe4432223d890
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Wed Sep 28 14:25:10 2022 -0700

    [quant][ao_migration] nn.intrinsic migration to ao (#84842)

    All quantization-related modules are being migrated to `torch.ao`. This migrates the `nn.intrinsic.modules`. Please, see the [tracker](https://github.com/pytorch/pytorch/issues/81667) for the timeline.

    Differential Revision: [D39419733](https://our.internmc.facebook.com/intern/diff/D39419733/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39419733/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84842
    Approved by: https://github.com/jerryzh168

commit 6a2b12dd656ed8c347968bebb4c3552582454019
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Sep 28 20:20:56 2022 +0000

    Turn on aliasing tests for fake backwards, Fix Batch norm running mean/var decomp aliasing (#85471)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85471
    Approved by: https://github.com/ezyang

commit a67621a6ca19b0c1423a7b136bfd90d8f04182fb
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 28 11:31:47 2022 -0700

    Update functorch README to reflect move into PyTorch (#85832)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85832
    Approved by: https://github.com/Chillee

commit 498591467b3c651aa929ad1d0858d6004da8b908
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 28 11:20:11 2022 -0700

    Excise functorch/version.txt (#85830)

    functorch no longer needs separate versioning.

    Also, we'll delete functorch/setup.py soon (in a couple of weeks). We've
    been leaving it around for BC reasons.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85830
    Approved by: https://github.com/Chillee

commit d8277d9075396a3188490c322648605927384ba5
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 28 11:13:41 2022 -0700

    Disallow saved tensor hooks in functorch transforms (#85829)

    Technically they may only be a problem with the grad transform. Though
    the branch cut is soon, this is the more conservative change, it also
    lets us disable checkpointing for functorch (which definitely doesn't
    work with all transforms) and not a lot of people use saved tensor hooks
    with functorch (I discovered this while testing).

    Test Plan:
    - new tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85829
    Approved by: https://github.com/samdow

commit 5aa183d2bc7372b4deb4e4b2f31017be9f13264c
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 28 07:24:35 2022 -0700

    Add mechanism to disable the "saved tensors hooks" feature (#85553)

    The rationale for this is that functorch doesn't work with saved
    variable hooks at the moment or checkpointing and we need some way to
    disable it.

    Concretely:
    - there's a context manager that does the disabling
    - this feature is disabled on a thread-local basis
    - one can set an error message or use the default error message that
    says the feature has been disabled

    Since it is thread local I needed to update ATen/ThreadLocalState. To
    make things nicer, this PR refactors all the "saved tensors hooks"
    related TLS things into a single struct.

    Test Plan:
    - new test
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85553
    Approved by: https://github.com/soulitzer

commit 85d8441fbabcc9e45648dbaa2c7c964ae32b1bb7
Author: Justin Chu <justinchu@microsoft.com>
Date:   Wed Sep 28 19:52:43 2022 +0000

    [ONNX] Deprecate setter functions for global variables (#85165)

    `_set_opset_version` and `_set_operator_export_type` are previously deprecated. This PR decorates them with the deprecation decorator, so warnings are emitted.

    - Remove usage of `_set_opset_version` and `_set_operator_export_type` in favor of setting the globals vars directly in torch.onnx internal
    - Update `GLOBALS.operator_export_type`'s default to not be None to tighten types
    - Remove usage of `_set_onnx_shape_inference`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85165
    Approved by: https://github.com/BowenBao, https://github.com/AllenTiTaiWang

commit 5deeb09d4e3001adfd3d04139b4a330915069ea7
Author: Justin Chu <justinchu@microsoft.com>
Date:   Wed Sep 28 19:52:43 2022 +0000

    [ONNX] Annotate all g as GraphContext (#85491)

    - Use g.opset to test export opset version
    - Annotate all `g` as GraphContext

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85491
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit c42a408baa65b37b459d538b440313f67c7c1cb7
Author: Justin Chu <justinchu@microsoft.com>
Date:   Wed Sep 28 19:52:42 2022 +0000

    [ONNX] Create decorator to handle symbolic context (#84776)

    - Create decorator to handle old style custom symbolics that require context
    - Deprecate `torch.onnx.SymbolicContext` in favor of `GraphContext`. Added deprecation message
    - Remove README reference of SymbolicContext

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84776
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 723193ec16897e8389a2ff7cf916a4a7e1ec564a
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Wed Sep 28 22:30:42 2022 +0000

    [cuDNN][cuDNN v8 API] Fix 3d convolution_add_relu in V8 (#85055)

    Fix for issue uncovered in #84948

    CC @ngimel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85055
    Approved by: https://github.com/ngimel

commit 01add6e2884347409a27efa16dcbf5b355ec4bd5
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Sep 28 20:30:37 2022 +0000

    Allow only one -1 in nested view/reshape (#85691)

    Behavior before this PR:

    1. `-1` allowed for implicit batch dimension
    2. multiple `-1`s allowed for pre-existing dimensions
    3.  for new dimensions, `-1` is not allowed

     it is worth noting that for the most part 3 is basically unreachable because assuming a nested tensor has at least 1 ragged dimension, you would expect at least one -1 to be in the proposed shape for the pre-existing dimensions

    Behavior after this PR:
    1. batch dimension **must be specified**
    2. **only one** `-1` allowed for pre-existing dimensions **this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension**
    3. unchanged

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85691
    Approved by: https://github.com/cpuhrsch

commit 1418a663b1d833159ce4bef4fad84bb983e454c9
Author: atalman <atalman@fb.com>
Date:   Wed Sep 28 22:27:52 2022 +0000

    Fix upload condition pypi-cudnn build (#85799)

    Fix upload condition pypi-cudnn build
    We excute this in sh and looks like the condition with "==" is not getting triggered.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85799
    Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/seemethere

commit 3d2316670f5968b6238f06bed247ec0ecbff444b
Author: Justin Chu <justinchu@microsoft.com>
Date:   Wed Sep 28 19:52:42 2022 +0000

    [ONNX] Create GraphContext and load `g.op` method to the class (#84728)

    This PR create the `GraphContext` class and relays all graph methods to _C.Graph as well as implements the `g.op`  method. The GraphContext object is passed into the symbolic functions in place of _C.Graph for compatibility with existing symbolic functions.

    This way (1) we can type annotate all `g` args because the method is defined and (2) we can use additional context information in symbolic functions. (3) no more monkey patching on `_C.Graph`

    Also

    - Fix return type of `_jit_pass_fixup_onnx_controlflow_node`
    - Create `torchscript.py` to house torch.Graph related functions
    - Change `GraphContext.op` to create nodes in the Block instead of the Graph
    - Create `add_op_with_blocks` to handle scenarios where we need to directly manipulate sub-blocks. Update loop and if symbolic functions to use this function.

    Should we put all the context inside `SymbolicContext` and make it an attribute in the `GraphContext` class? This way we only define two attributes `GraphContext.graph` and `GraphContext.context`. Currently all context attributes are directly defined in the class.

    Keep GraphContext flatand note that it will change in the future.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84728
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 75db0225ad4bea02b84505bf96c1abceea97c99a
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Sep 28 19:45:49 2022 +0000

    Handle fake tensor in intlist (#85759)

    Previously, we were swallowing up the Fake Tensor Exception and throwing `TypeError`, which led to https://github.com/pytorch/torchdynamo/issues/1066. Now, we are propagating back the `DataDependentOutputException`.

    If this approach is accepted, I can go ahead and do doublelist, symintlist, afterward.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85759
    Approved by: https://github.com/ezyang

commit 913f5784d74bb69eff12b1cf9ac8c3d222750411
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Tue Sep 27 12:03:54 2022 -0700

    move functionalize out of experimental namespace (#85742)

    Did a very quick sanity check - it looks like functorch docs don't get the nice preview link that pytofch-bot gives for normal pytorch docs, so I built locally and scanned `html/generated/functorch.functionalize.html`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85742
    Approved by: https://github.com/zou3519

commit 796da4df4d264aea8b3879dbda3f154271e94634
Author: Animesh Jain <anijain@umich.edu>
Date:   Wed Sep 28 20:52:45 2022 +0000

    Return contiguous tensor from softmax decomposition (#85788)

    Fixes https://github.com/pytorch/torchdynamo/issues/1135

    Softmax decomp's output stride does not match with aten softmax output stride. Not sure if its desirable. Opening a PR for now.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85788
    Approved by: https://github.com/ngimel, https://github.com/ezyang

commit 8bb69a007f940ca5712ffc69922fa8f94cf27bd7
Author: Catherine Lee <csl@fb.com>
Date:   Wed Sep 28 20:44:49 2022 +0000

    reenable circleci mac jobs (#85824)

    undo https://github.com/pytorch/pytorch/pull/84438 and see if its green now
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85824
    Approved by: https://github.com/huydhn, https://github.com/malfet

commit 879ae45230d98e50250e9f8e704d78b5973bf227
Author: atalman <atalman@fb.com>
Date:   Wed Sep 28 20:34:13 2022 +0000

    Increase timeout and retry count conda upload (#85802)

    Increase timeout and retry count conda upload. We are keep seeing conda upload failures even with 2 min timeout.
    Hence increasing timeout to 5min and retry to 5 times

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85802
    Approved by: https://github.com/datumbox

commit afaee00feca07c565f6b080e021b3422bfc1e8d4
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Sep 28 17:38:29 2022 +0000

    Add python `nested_tensor` and `as_nested_tensor` constructors in `torch.nested` (#85593)

    Remove `torch.nested_tensor` which has erroneous behavior wrt gradients (could be either leaf or not leaf). Introduce `torch.nested.nested_tensor` and `torch.nested.as_nested_tensor` in the vein of `torch.tensor` and `torch.as_tensor`. Done in nested `__init__.py` for now but can move to pybind in future (when we want to load from numpy/nested lists ).

    Discussed offline with @cpuhrsch and pybind constructor (https://github.com/pytorch/pytorch/pull/85536) was more gnarly than expected, so we can move to that when we do need loading from numpy etc.

    Differential Revision: [D39806622](https://our.internmc.facebook.com/intern/diff/D39806622)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85593
    Approved by: https://github.com/drisspg, https://github.com/cpuhrsch

commit a876432aea11fd2c6e11b8397f4f89a30cd1e8ba
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Sep 28 13:00:50 2022 -0400

    Expose torch._will_engine_execute_node (#84773)

    Addresses: https://github.com/pytorch/pytorch/issues/83617

    This PR a way to query the TLS graph task's exec_info which is a map mapping the Node to a bool indicating whether it will be executed in the current backward pass (as determined by the inputs= argument for .grad of .backward).
    - this works with both custom Function nodes and normal codegened nodes
    -  to be able to verify whether the pyobject passed is an actual node, we now store pointers to PyTypeObjects into a set on registration.
    - error out when .backward without inputs= to avoid silently returning True

    Alternatives:
    - not sure if it is possible to bind to Python from a raw pointer to Node. At least we wouldn't be able to use existing logic, and the Python object should only hold a weak reference to the Node.
    - other solutions to the motivating issue seem to require more extensive modification to the engine

    See the issue linked for an example of usage
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84773
    Approved by: https://github.com/albanD

commit 8dd45424eaf4ab39c8723efdec91a269c7eb9448
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Wed Sep 28 17:23:42 2022 +0000

    [primTorch] Add ref for `huber_loss` and error inputs (#85041)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85041
    Approved by: https://github.com/lezcano, https://github.com/mruberry

commit 0b93afb112d48bb6d89a1e183a90b403560e84e4
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Sep 28 07:55:11 2022 -0700

    add amp tests (#85434)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85434
    Approved by: https://github.com/ngimel

commit 29c78266c046c7f83e7d84fc764af47e62ae9542
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Sep 27 23:54:50 2022 +0100

    test_decomp.py: Skip tests for embedding_backward bf16 (#84554)

    `embedding_backward`'s decomposition is less accurate for bf16.
    Currently bfloat16 is skipped in both forward and backward, but the
    forward decomposition matches 1-1 with the ATen implementation so this
    re-enables the test for the forwards decomposition.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84554
    Approved by: https://github.com/albanD

commit c2e9b9ec4a51e49a094f4ea413cba3f0567f82c2
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Sep 27 18:37:23 2022 +0100

    TestModule: Don't assume sample inputs version counter is zero (#85734)

    The intention of this assert is to check the input tensor's version
    counter has increased, indicating it was mutated by `m_inplace`.
    However, the cloning step to create `input_arg_clone` restarts the
    version counter to zero, so this test may fail if the sample input
    was ever mutated during its creation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85734
    Approved by: https://github.com/albanD

commit 5b476e68afd0fd8e14494f3d209bd6b63f4d422f
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 28 10:13:21 2022 -0700

    Slightly beefed up dynamic shapes tests for storage_offset (#85806)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85806
    Approved by: https://github.com/albanD

commit d776693701bef4283858961a4c597edd73d1fc6d
Author: Seonglyong Gong <slgong@meta.com>
Date:   Wed Sep 28 19:18:12 2022 +0000

    [Profiler] Optimizer param_groups (part 3 of Record Optimizer) (#85784)

    Summary:
    - use TensorMetadata struct
    - check_and_store util as overloading
    - param_groups
    - clean up unit test cases

    Test Plan: buck run mode/opt //caffe2/test:profiler

    Reviewed By: chaekit

    Differential Revision: D39406072

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85784
    Approved by: https://github.com/aaronenyeshi, https://github.com/robieta

commit 2f8cfb74af5123323e64b1c0fddfdd63ab0b3205
Author: Animesh Jain <anijain@umich.edu>
Date:   Wed Sep 28 18:35:51 2022 +0000

    Fix gelu repr (#85790)

    Fixes https://github.com/pytorch/torchdynamo/issues/1378

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85790
    Approved by: https://github.com/ezyang

commit ff71f457889a0fe7de3b921d5ae2341e0ab7dc69
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Sep 28 16:19:07 2022 +0000

    [FSDP] Add `FSDPExtensions` for TP support (#85039)

    This adds `FSDPExtensions` to enable TP + FSDP composability. To be agnostic to both `ShardedTensor` and `DistributedTensor`, the design relies on customizable hooks.

    Some notes:
    - I preferred the `_ext` prefix (short for "extension") over `_param_extension` simply because it is shorter. It should not matter much because it is purely internal facing.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85039
    Approved by: https://github.com/kumpera, https://github.com/fegin

commit a4bd89b267e81dc2a23ed767e1efc30fcfb7c665
Author: Horace He <chilli@fb.com>
Date:   Wed Sep 28 17:24:11 2022 +0000

    Revert "Revert "Symintified mmm/addmm derivative formulas (#85794)"" (#85820)

    This reverts commit 823dc33b00b811c28a3924a6f0a31ba6afee7272.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85820
    Approved by: https://github.com/huydhn

commit 39130ccf7353bb7eecd34e51657f7e39fb70a353
Author: Horace He <chilli@fb.com>
Date:   Wed Sep 28 08:58:18 2022 +0000

    Registered _like metas (#85793)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85793
    Approved by: https://github.com/ezyang

commit fc8ba3a92d44dc113f979d201d48f51c2887ded4
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 28 17:22:53 2022 +0000

    Revert "Allow only one -1 in nested view/reshape (#85691)"

    This reverts commit 4c4e5f6106b69960833d7766799fd4f246aa7cd7.

    Reverted https://github.com/pytorch/pytorch/pull/85691 on behalf of https://github.com/atalman due to Causes github first merge conflict

commit b44a4a8b51774fd9bfdaa929db342cbbc28fe252
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 28 17:18:29 2022 +0000

    Revert "Registered _like metas (#85793)"

    This reverts commit a4e75ccf85bd580ae5cccd471cfe8aee60dc1aa2.

    Reverted https://github.com/pytorch/pytorch/pull/85793 on behalf of https://github.com/huydhn due to Sorry, reverting as this breaks an aot_autograd mac test on functorch. https://github.com/pytorch/pytorch/pull/85794 was reverted before but it was at the top of the stack so the revert still fail https://hud.pytorch.org/pytorch/pytorch/commit/823dc33b00b811c28a3924a6f0a31ba6afee7272

commit 4c6dc6a1a479dcb9dc3ca9b08c480fdcefd26113
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Sep 28 17:12:25 2022 +0000

    [BE] Do not use VLA (#85800)

    [Variable Length Array](https://en.wikipedia.org/wiki/Variable-length_array) is part of C99 standard, but has never been adopted to C++

    Also, warning if they are used (surprisingly those warnings can not be turned into errors.
    Remove code duplication in `OperationUtils.mm`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85800
    Approved by: https://github.com/kulinseth, https://github.com/jeanschmidt

commit 424aad7f826db3a51a5416229be8ce7a965879b4
Author: David Berard <dberard@fb.com>
Date:   Tue Sep 27 18:28:00 2022 -0700

    [JIT] support freezing modules that don't have a forward method (#85779)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85779
    Approved by: https://github.com/eellison

commit a0b1693996f408c112b7a628860b7a754aaa4f77
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 28 17:04:53 2022 +0000

    Revert "Update `amax/amin/norm/count_nonzero` signatures with `int[*]? dim` (#83300)"

    This reverts commit 1c0f0b33a0e013d6ec162cf488ff7643c4ffa33e.

    Reverted https://github.com/pytorch/pytorch/pull/83300 on behalf of https://github.com/jeffdaily due to The commit breaks nvfuser tests

commit 224b689cf19febc23fffb77beb97c0a0560f9585
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 28 07:15:28 2022 -0700

    Handling for getitem with boolean in meta, and other improvements (#85807)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85807
    Approved by: https://github.com/albanD

commit b6885f7d4ab1f71c40d5f1e40773acbad038d355
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 28 10:20:29 2022 -0400

    Don't make parameters have symbolic shapes (#85809)

    Parameters won't change size across iterations of the
    training loop, so this is a cost-free optimization that
    avoids us having to do symbolic math over parameters.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85809
    Approved by: https://github.com/albanD

commit e63d3a8aa6c52f29fa329df321cd51fef7e8a7c9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 28 10:28:14 2022 -0400

    Augment errors raised in fx.Interpreter with Node info (#85810)

    We have been using this extra error context in the symbolic-shapes
    branch and it is quite useful.  Contributing it upstream here.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85810
    Approved by: https://github.com/albanD

commit b04b2fa9aa52cacbdc9aaaf477d55b0af845ce81
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Wed Sep 28 16:04:58 2022 +0000

    [CUBLAS][CUDA GRAPHS] (re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85447)

    Now includes @dagitses 's optimizations and fixes for teardown

    CC @ngimel @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85447
    Approved by: https://github.com/malfet

commit 823dc33b00b811c28a3924a6f0a31ba6afee7272
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 28 16:02:05 2022 +0000

    Revert "Symintified mmm/addmm derivative formulas (#85794)"

    This reverts commit 230edd2515367fcb44cea7c40106ff9f6a712a2a.

    Reverted https://github.com/pytorch/pytorch/pull/85794 on behalf of https://github.com/janeyx99 due to Sorry, reverting as this breaks an aot_autograd mac test on functorch https://hud.pytorch.org/pytorch/pytorch/commit/230edd2515367fcb44cea7c40106ff9f6a712a2a

commit 5709c67f1f93c47729621fe3a6ec3247286dd03b
Author: Andres Lugo-Reyes <Andy.LugoReyes@amd.com>
Date:   Wed Sep 28 15:48:24 2022 +0000

    [ROCm] Retry loop implemented to avoid transient memory leak errors (#82607)
    Added a retry loop to memory leak checker to avoid rare case in which ROCM reports a false positive memory leak.
    Original issue observed as part of this ticket: https://github.com/pytorch/pytorch/issues/62533
    - Applied changes and built
    - python test/test_cuda.py
    - Ensure all tests pass

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82607
    Approved by: https://github.com/malfet

commit b2311192e6c4745aac3fdd774ac9d56a36b396d4
Author: Weiyi Zheng <weiyzheng@tesla.com>
Date:   Wed Sep 28 15:26:03 2022 +0000

    [NN module] speed up _load_from_state_dict (#85743)

    Fixes #61398

    The original implementation is very slow when the state_dict.keys() is long. This PR only passes relevant keys to the child module.

    existing test passes: `pytest test/test_nn.py -k state_dict`
    I couldn't figure out a good way to write a new test for this new behavior. Had a new snippet, but it will be flaky if integrated into the main CI because it's a timing based check.
    But I can verify that the test took 30s to run, after this PR it only takes 0.5s.

    ```python
        def test_load_state_dict_large(self):
            import copy
            import time
            base_module = nn.Linear(1,1)
            model = base_module
            for level in range(4):
               model = nn.Sequential(*[copy.deepcopy(model) for _ in range(10)])
            state_dict = model.state_dict()
            self.assertEqual(len(state_dict), 20000)
            st = time.time()
            model.load_state_dict(state_dict, strict=True)
            strict_load_time = time.time() - st
            self.assertLess(strict_load_time, 10)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85743
    Approved by: https://github.com/albanD

commit cef8dfc8ba849be649f752a14a5f11cdbe1e17fc
Author: Jane Xu <janeyx@fb.com>
Date:   Wed Sep 28 14:59:22 2022 +0000

    [BE] small typo+lint fixes for einsum/sumproduct_pair (#85709)

    Easy review! This PR fixes some typos + lints + clarifies some instructions
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85709
    Approved by: https://github.com/soulitzer

commit 230edd2515367fcb44cea7c40106ff9f6a712a2a
Author: Horace He <chilli@fb.com>
Date:   Wed Sep 28 08:58:18 2022 +0000

    Symintified mmm/addmm derivative formulas (#85794)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85794
    Approved by: https://github.com/ezyang

commit a4e75ccf85bd580ae5cccd471cfe8aee60dc1aa2
Author: Horace He <chilli@fb.com>
Date:   Wed Sep 28 08:58:18 2022 +0000

    Registered _like metas (#85793)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85793
    Approved by: https://github.com/ezyang

commit 35fe4abdc74d88c0d4768f3cd7aedcfd2e817a3d
Author: Horace He <chilli@fb.com>
Date:   Wed Sep 28 08:58:18 2022 +0000

    Added symbolic shape testing for AOTAutograd (#85789)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85789
    Approved by: https://github.com/ezyang

commit 0b251d985df51d16c71d9c28b11b800fd38bf4bd
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Wed Sep 28 14:05:02 2022 +0000

    skip test TestCompositeComplianceCUDA::test_forward_ad_nn_functional_max_unpool2d_cuda_float32 (#85767)

    This test was marked as expected failure, but this test is flaky for ROCm but only because ROCm sometimes gets expected success. The test was only marked expected failure due to non-determinism that was already well-known. See the nearby comments.

    https://github.com/pytorch/pytorch/blob/a4c94f0739158d2f7fd27f2be59b77f33027e1c7/torch/testing/_internal/common_methods_invocations.py#L11410-L11421

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85767
    Approved by: https://github.com/clee2000

commit 06e0583fb0f62e27a52fb87f3dce3156cd2d0073
Author: Howard Huang <howardhuang@fb.com>
Date:   Tue Sep 27 14:27:17 2022 -0700

    [4/N] [Dispatchable Collectives] Update all_reduce_ with CPU / CUDA implementations (#83810)
    * Update the all_reduce op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
    * Update test to validate that a separate device implementation is not supported.
    In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

    Differential Revision: [D39506979](https://our.internmc.facebook.com/intern/diff/D39506979)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83810
    Approved by: https://github.com/kwen2501

commit 0e256c255089649de7913e3707c38c6aefc59def
Author: Horace He <chilli@fb.com>
Date:   Wed Sep 28 06:22:51 2022 +0000

    removed compile cache and static argnums (#85783)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85783
    Approved by: https://github.com/wconstab

commit 614d6f19e3d30cac0d77059e738d1f25d75eb408
Author: Richard Barnes <rbarnes@umn.edu>
Date:   Wed Sep 28 04:53:19 2022 +0000

    Fix Use obj1.is(obj2) warnings (#85688)

    Fixes:
    ```
                                              ^
    /dev/shm/rbarnes/tempfs/pytorch/torch/csrc/autograd/python_variable.cpp:2603:11: warning: 'operator==' is deprecated: Use obj1.is(obj2) instead [-Wdeprecated-declarations]
      if (out == Py_None) {
              ^
    /dev/shm/rbarnes/tempfs/pytorch/cmake/../third_party/pybind11/include/pybind11/detail/../pytypes.h:276:5: note: 'operator==' has been explicitly marked deprecated here
        PYBIND11_DEPRECATED("Use obj1.is(obj2) instead")
        ^
    /dev/shm/rbarnes/tempfs/pytorch/cmake/../third_party/pybind11/include/pybind11/detail/common.h:136:43: note: expanded from macro 'PYBIND11_DEPRECATED'
                                              ^
    /dev/shm/rbarnes/tempfs/pytorch/torch/csrc/autograd/python_variable.cpp:2627:11: warning: 'operator==' is deprecated: Use obj1.is(obj2) instead [-Wdeprecated-declarations]
      if (out == Py_None) {
              ^
    /dev/shm/rbarnes/tempfs/pytorch/cmake/../third_party/pybind11/include/pybind11/detail/../pytypes.h:276:5: note: 'operator==' has been explicitly marked deprecated here
        PYBIND11_DEPRECATED("Use obj1.is(obj2) instead")
        ^
    /dev/shm/rbarnes/tempfs/pytorch/cmake/../third_party/pybind11/include/pybind11/detail/common.h:136:43: note: expanded from macro 'PYBIND11_DEPRECATED'
                                              ^
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85688
    Approved by: https://github.com/albanD, https://github.com/ezyang

commit 793488cda262a338205314ccba90e549e4682f82
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 27 15:01:01 2022 -0700

    Revert "Revert "Symintifying slice ops (#85196)"" (#85746)

    This reverts commit 3a171dfb0c08956d55f341039cf35e3a18269c34.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85746
    Approved by: https://github.com/albanD

commit 049be5ac107f50c1ed7ccfea2f1fcdbdf6be0f88
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 27 15:11:21 2022 -0700

    Remove some dead code from fake tensor (#85764)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85764
    Approved by: https://github.com/wconstab

commit 795028a3cec2603a750bdc02ab2b93329f43e883
Author: Thomas Viehmann <tv.code@beamnet.de>
Date:   Wed Sep 28 03:50:42 2022 +0000

    Make Python reference for permute accept varargs (#85460)

    Fixes #85452

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85460
    Approved by: https://github.com/jjsjann123, https://github.com/mruberry, https://github.com/ngimel

commit ccac8d13d5988de7302551a5df460072eb918683
Author: Howard Huang <howardhuang@fb.com>
Date:   Tue Sep 27 14:27:17 2022 -0700

    [3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations (#83735)
    * Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
    * Add test to validate that a separate device implementation is not supported.
    In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

    Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83735
    Approved by: https://github.com/kwen2501

commit 01f946d766e3ae58b2371306587659483d5e1b39
Author: Horace He <chilli@fb.com>
Date:   Tue Sep 27 22:57:29 2022 +0000

    Rename test file from test_pythonkey to test_aotdispatch (#85769)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85769
    Approved by: https://github.com/ezyang

commit fc2e7ebaacbca3e8b851d6d6ceef96a616709f93
Author: Horace He <chilli@fb.com>
Date:   Tue Sep 27 22:53:52 2022 +0000

    Added floordiv simplification rule needed for models (#85768)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85768
    Approved by: https://github.com/ezyang

commit a8be59545dd2acd48da2a8f6e99d45ec348d95c4
Author: Paul Saab <ps@meta.com>
Date:   Wed Sep 28 03:00:30 2022 +0000

    [aarch64] Use the correct binary when cross building //caffe2/torch/csrc/deploy:remove_dt_needed (#85632)

    Test Plan: CI

    Differential Revision: D39807135

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85632
    Approved by: https://github.com/ajtulloch

commit 3276b51243fcc0fea4f780216d1f9a5886805d2b
Author: Ke Wen <kw2501@fb.com>
Date:   Wed Sep 28 02:56:48 2022 +0000

    Add environment parse function that supports default value (#85563)

    We use "-2" to represent an unset environment variable.
    Now adding a util function to attach default value if environment variable is unset.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85563
    Approved by: https://github.com/rohan-varma, https://github.com/H-Huang

commit f80ef73d1c0da4938a264a1ac1c903c78ee3fc6a
Author: Seonglyong Gong <slgong@meta.com>
Date:   Wed Sep 28 02:48:07 2022 +0000

    [Profiler] tracking Optimizer (part 2 of Record Optimizer) (#84920)

    Summary:
    Part 2 of Record Optimizer param_groups and states (https://github.com/pytorch/pytorch/pull/84063)
    - hooking from optimizer step
    - PyOptCall Type
    - declare data type for collection
    - python binding
    - simple unit test case

    Test Plan: buck run mode/opt //caffe2/test:profiler

    Differential Revision: D39402667

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84920
    Approved by: https://github.com/robieta

commit 1c0f0b33a0e013d6ec162cf488ff7643c4ffa33e
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Wed Sep 28 01:56:37 2022 +0000

    Update `amax/amin/norm/count_nonzero` signatures with `int[*]? dim` (#83300)

    Changes `dim` arg to use `int[*]?` type for the following functions in `native_funcitons.yaml`:
    * `amax`
    * `amin`
    * `norm`
    * `frobenius_norm`
    * `native_norm`
    * `count_nonzero`

    Part of #29137

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83300
    Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/kulinseth

commit 80b88862239554f895a369f88e197ecb0fa53281
Author: Jing Xu <jing.xu@intel.com>
Date:   Wed Sep 28 01:39:58 2022 +0000

    add itt unit test and docstrings (#84848)

    Add unit tests and docstrings corresponding to PR https://github.com/pytorch/pytorch/pull/63289
    UT:
    1. `test_profiler_emit_itt` in `test/test_autograd.py`. This test is merely intended to catch if emit_itt breaks on construction.
    2. Test `torch.profiler.itt` functions in `test/test_itt.py`
    3. Only testing that emit_itt runs when `record_shapes` option is enabled in `test/test_profiler.py`.

    Docstring:
    1. add ITT related info into `docs/source/bottleneck.rst`
    4. add `torch.profiler.itt` functions to `docs/source/profiler.rst`
    5. add docstring to `torch.profiler.itt` functions in `torch/profiler/itt.py`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84848
    Approved by: https://github.com/malfet

commit 572dd862c4461e87731f8eabc800b4cfb52cb647
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 28 01:36:43 2022 +0000

    Revert "Update `amax/amin/norm/count_nonzero` signatures with `int[*]? dim` (#83300)"

    This reverts commit 8c7c7ed3221aeeefb63ef2b7a221a5d8b274cda5.

    Reverted https://github.com/pytorch/pytorch/pull/83300 on behalf of https://github.com/huydhn due to The commit pin breaks XLA test somehow

commit 1c1f3a99dcc5c2fdcbdc0ed011167de41efc9497
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Tue Sep 27 11:26:12 2022 -0700

    [FSDP] Handle the state_dict on CPU cases (#85640)

    state_dict may not be on GPUs. We need to move it to the compute_device in order to gather the ShardedTensor.

    Differential Revision: [D39681730](https://our.internmc.facebook.com/intern/diff/D39681730/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85640
    Approved by: https://github.com/rohan-varma, https://github.com/awgu

commit ce4f187f15d846b310511809c8afa8bd925d250b
Author: Denis Vieriu <dvieriu@apple.com>
Date:   Wed Sep 28 00:47:52 2022 +0000

    [MPS] Add tensor::index_put op (#85672)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85672
    Approved by: https://github.com/malfet

commit 9858f043508432a8c79691b93faf21d99d5cbf99
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Sep 27 12:55:02 2022 -0700

    [quant][docs] Add types for scale and zero_point tensor for torch.fake_quantize_per_channel_affine docs (#85733)

    Summary:
    Fixes: https://github.com/pytorch/pytorch/issues/85525

    Test Plan:
    visual inspection for the docs page

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85733
    Approved by: https://github.com/z-a-f

commit 7ff6a00a9a59ee53ca71dfa11697fb7822fd3c0c
Author: Kulin Seth <kulin_seth@apple.com>
Date:   Wed Sep 28 00:43:11 2022 +0000

    [MPS] Handle 1D weight in linear layer  (#85752)

    Fixes https://github.com/pytorch/pytorch/issues/84591

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85752
    Approved by: https://github.com/malfet

commit 4ca125a9e1dd1bed2606ce44f137d59905580db5
Author: andrewor14 <andrewor14@gmail.com>
Date:   Tue Sep 27 13:37:00 2022 -0700

     [Quant][fx] Add quant and scale ranges to BackendConfig (#85200)

    **Summary:** This commit adds the following constraints to
    BackendConfig:

        quant_min_lower_bound
        quant_max_upper_bound
        scale_min_lower_bound
        scale_max_upper_bound

    This is motivated by QNNPACK constraints on qint8 weight
    values and the min scale value. Actually enforcing these
    constraints in the QNNPACK BackendConfig will follow in a
    future commit.

    Today, users can also specify the above constraints through
    QConfigs, and these settings may not necessarily match the
    ones specified in the BackendConfig. In this case, we will
    handle the discrepancy as follows:

    (1) Require QConfig quant ranges to fall within the backend's
    (2) Require QConfig min scale value (eps) >= backend's
    (3) Require QConfig to specify quant range if the backend
        specified one
    (4) Require QConfig to specify min scale value (eps) if the
        backend specified one

    Public API changes:

    * Previous API, still supported after this commit:
    ```
    dtype_config = DTypeConfig(
        input_dtype=torch.quint8,
        output_dtype=torch.quint8,
        weight_dtype=torch.qint8,
        bias_dtype=torch.float,
    )
    ```
    * New API:
    ```
    dtype_config = DTypeConfig(
        input_dtype=DTypeWithConstraints(
            dtype=torch.quint8,
            quant_min_lower_bound=0,
            quant_max_upper_bound=127,
            scale_min_lower_bound=2 ** -12,
        ),
        output_dtype=DTypeWithConstraints(
            dtype=torch.quint8,
            quant_min_lower_bound=0,
            quant_max_upper_bound=127,
            scale_min_lower_bound=2 ** -12,
        ),
        weight_dtype=DTypeWithConstraints(
            dtype=torch.qint8,
            quant_min_lower_bound=-128,
            quant_max_upper_bound=127,
            scale_min_lower_bound=2 ** -12,
        ),
        bias_dtype=torch.float,
    )
    ```
    * Additionally, the following `DTypeConfig` attributes
    have new types with helper getters:
    ```
    dtype_config.input_dtype
    dtype_config.output_dtype
    dtype_config.weight_dtype
    dtype_config.get_input_dtype()
    dtype_config.get_output_dtype()
    dtype_config.get_weight_dtype()
    ```

    Note that scale_max is currently not used because there is
    no existing mechanism to enforce this on the observer. In the
    future, we can validate this as well if there is a use case.

    **Test Plan:**

    python test/test_quantization.py
    TestBackendConfig.test_dtype_with_constraints

    python test/test_quantization.py
    TestQuantizeFx.test_backend_config_scale_min

    python test/test_quantization.py
    TestQuantizeFx.test_backend_config_quantization_range

    **Reviewers:** jerryzh168, vkuzo

    **Subscribers:** jerryzh168, vkuzo
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85200
    Approved by: https://github.com/jerryzh168

commit 24a268143da49911d7ab44afb59865dcdd0f3456
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 27 15:11:20 2022 -0700

    Directly access has_symbolic_sizes_strides, avoid expensive test (#85754)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85754
    Approved by: https://github.com/albanD

commit 8c7c7ed3221aeeefb63ef2b7a221a5d8b274cda5
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Tue Sep 27 23:50:01 2022 +0000

    Update `amax/amin/norm/count_nonzero` signatures with `int[*]? dim` (#83300)

    Changes `dim` arg to use `int[*]?` type for the following functions in `native_funcitons.yaml`:
    * `amax`
    * `amin`
    * `norm`
    * `frobenius_norm`
    * `native_norm`
    * `count_nonzero`

    Part of #29137

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83300
    Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/kulinseth

commit f1f6cb07e2d486ea1408a7b554dd6f715e400d13
Author: Xia, Weiwen <weiwen.xia@intel.com>
Date:   Tue Sep 27 23:40:02 2022 +0000

    [UT] Fix random failure of test_qconv_transpose1d by skip using hypothesis (#85463)

    TestQuantizedConv.test_qconv_transpose1d fails randomly due to hypothesis (according to @jerryzh168).
    This PR fixes it by rewriting the test case without hypothesis. Use fixed parameters and `itertools.product` to generate test cases.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85463
    Approved by: https://github.com/jerryzh168

commit ea50e7f262e826e9f0eef1623e8e8656911adef9
Author: mikey dagitses <mikeyd@fb.com>
Date:   Tue Sep 27 23:31:51 2022 +0000

    fix ovrsource pytorch build from D39769513 (#85708)

    Test Plan: Tested locally, verifying with CI.

    Reviewed By: h-friederich

    Differential Revision: D39851831

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85708
    Approved by: https://github.com/zou3519

commit 7934596b700b34cac507cac4f2b9d106e36efa02
Author: James Zeng <zeng@fb.com>
Date:   Tue Sep 27 23:27:40 2022 +0000

    [ucc] Remove internal tracing (#85730)

    Summary: Remove internal tracing since this was not upstreamed yet.

    Test Plan: All PyTorch test should pass.

    Differential Revision: D39853937

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85730
    Approved by: https://github.com/kwen2501

commit f98109795f9b214286c44af4936b0032a6992df1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 27 14:01:51 2022 -0700

    Stop using restore() in ProxyTorchDispatchMode (#85756)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85756
    Approved by: https://github.com/samdow

commit 0a64e73d1259d8dcfbb1c22f65175e841380c878
Author: Kannav <kmkannavkmehta@gmail.com>
Date:   Tue Sep 27 22:58:06 2022 +0000

    52189: refractor unreachable Runtime Error (#85725)

    Fixes #52189

    Since the PR #52228 had gone cold. I made the requested changes and fixed the linting error.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85725
    Approved by: https://github.com/lezcano

commit 5bfcf1f01aaf84add54addf7f39afe986602baa9
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Mon Sep 26 17:06:46 2022 -0500

    [Docs] Update sparse Maintainers (#85126)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85126
    Approved by: https://github.com/cpuhrsch

commit 775a22c7c664d6cfca60af287b94c0f70245696e
Author: Ke Wen <kw2501@fb.com>
Date:   Tue Sep 27 22:50:19 2022 +0000

    Add all_gather_into_tensor in place of _all_gather_base (#85686)
    - This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning.
    - The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors.
    - This PR also adds deprecation warning to `_all_gather_base`.
    `_all_gather_base` was implemented in https://github.com/pytorch/pytorch/pull/33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output.

    There are, however, two "blockers" that make the merge difficult:
    (i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list.
    (ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information.

    In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name.
    Added tests:
    - `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example:
    ```
            >>> tensor_in
            tensor([1, 2], device='cuda:0') # Rank 0
            tensor([3, 4], device='cuda:1') # Rank 1
            >>> tensor_out
            tensor([1, 2, 3, 4], device='cuda:0') # Rank 0
            tensor([1, 2, 3, 4], device='cuda:1') # Rank 1
    ```
    - `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example:
    ```
            >>> tensor_out2
            tensor([[1, 2],
                    [3, 4]], device='cuda:0') # Rank 0
            tensor([[1, 2],
                    [3, 4]], device='cuda:1') # Rank 1
    ```
    The output form is determined by the shape of the output tensor passed by the user, no flag used.

    Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85686
    Approved by: https://github.com/rohan-varma, https://github.com/crcrpar

commit 34cee3e82ba77b537ded37478a3852de5ea96bd5
Author: Will Constable <whc@fb.com>
Date:   Tue Sep 27 22:48:11 2022 +0000

    Auto tag milad for symbolic-shapes PRs (#85751)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85751
    Approved by: https://github.com/ezyang

commit 16543f6878348da33ef18b7d0d16dde096213fd6
Author: Seonglyong Gong <slgong@meta.com>
Date:   Tue Sep 27 22:41:21 2022 +0000

    Revisit python tracing OD flow (#85326)

    Summary:
    - add `set_withstack()`, overriding `ClientInterface`'s no-op funtion.
    - revert `start()` and #ifdef

    Differential Revision: D39647074

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85326
    Approved by: https://github.com/chaekit

commit fc99705f22ae6c4165cca705c79f784bb7c7831a
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Sep 26 14:48:54 2022 -0700

    Add functorch M1 shard (#85565)

    functorch should have a test wherever regular PyTorch gets tested. This
    PR adds an M1 shard to test functorch.

    Test Plan:
    - wait for CI
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85565
    Approved by: https://github.com/malfet, https://github.com/huydhn

commit 5cbffbbac9a59098637f821e8b6e10f609de30ff
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Tue Sep 27 21:42:24 2022 +0000

    C10D extension to enable per-thread PG (#84153)

     Move a bunch of globals to instance methods and replace all use to them.

    We move all PG related globals under World and use a singleton instance under _world.

    This creates an undocumented extension point to inject full control of how how c10d
    state behaves.

    One simple hack is to change _world to an implementation that uses a threadlocal
    and enable per-thread PGs.

    It almost get DDP working and the PG is missing an implementation of all_reduce.

    This enables notebook usage of PTD, which is a big deal for learning it:
    https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
    Approved by: https://github.com/rohan-varma

commit b6bee3c491d3a9a99920fb67203d2bdb8390ac59
Author: atalman <atalman@fb.com>
Date:   Tue Sep 27 21:34:36 2022 +0000

    Upload to different path for pypi cudnn (#85753)

    We want to upload pypi cudnn builds to a different download folder something like cu117_pypi_cudnn

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85753
    Approved by: https://github.com/seemethere, https://github.com/kit1980

commit 6cfe555f4fe54c8df5a07af39a335d0f4d914d95
Author: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Date:   Tue Sep 27 21:26:32 2022 +0000

    [ONNX] Apply Common Subexpression Elimination pass to ONNX export (#85665)
    Exporting graphs with Autocast may fail due to a limitation on JIT tracer. By disabling Autocast cache, tracer works, but there can be performance hit when there is reuse of weights in convolution, for example

    By applying CSE, such performance loss can be reverted.

    ps: As a comment at #84092 mentioned, disabling Autocast cache is an acceptable workaround and used throughout PyTorch code.

    Fixes #84092

    ```python
    graph(%0 : Float(requires_grad=0, device=cpu)):
      %3 : Scalar = aten::ScalarImplicit(%0), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule::
      %13 : int = prim::Constant[value=3](), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      %14 : int = prim::Constant[value=4](), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      %15 : int[] = prim::ListConstruct(%13, %14), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule::
      %16 : NoneType = prim::Constant(), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule::
      %17 : NoneType = prim::Constant(), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule::
      %18 : Device = prim::Constant[value="cpu"](), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      %19 : bool = prim::Constant[value=0](), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      %20 : Float(3, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::full(%15, %3, %16, %17, %18, %19), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      return (%20)

    AFTER
    graph(%0 : Float(requires_grad=0, device=cpu)):
      %3 : Scalar = aten::ScalarImplicit(%0), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule::
      %13 : int = prim::Constant[value=3](), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      %14 : int = prim::Constant[value=4](), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      %15 : int[] = prim::ListConstruct(%13, %14), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule::
      %16 : NoneType = prim::Constant(), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule::
      %18 : Device = prim::Constant[value="cpu"](), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      %19 : bool = prim::Constant[value=0](), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      %20 : Float(3, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::full(%15, %3, %16, %16, %18, %19), scope: test_onnx_opset.TestONNXOpset.test_full.<locals>.MyModule:: # /home/thiagofc/dev/github/pytorch/test/onnx/test_onnx_opset.py:347:0
      return (%20)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85665
    Approved by: https://github.com/ngimel, https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit c719ec9c11f9f7f57ebcfe611e0717c24ecb3b9b
Author: Kevin Tse <ktse@fb.com>
Date:   Tue Sep 27 13:49:14 2022 -0400

    [DataPipe] Fix MapDataPipe spawn lambda test (#85668)

    The test in its original form fails and I believe it is because the expected result is incorrect, unless we expect different behaviors between `IterDataPipe` and `MapDataPipe` in multiprocessing.

    Differential Revision: [D39832182](https://our.internmc.facebook.com/intern/diff/D39832182)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85668
    Approved by: https://github.com/ejguan

commit 64a526d4af6ea69d84791bf6c4c3e2695f3828ce
Author: Kevin Tse <ktse@fb.com>
Date:   Tue Sep 27 13:49:14 2022 -0400

    [DataLoader] Replacing `traverse` function with `traverse_datapipes` (#85667)

    This PR deprecates `traverse` function and replaces it with `traverse_datapipes` instead.

    While use `DataLoader`, I realized that it is raising `FutureWarning` even though I am not explicitly using `traverse`. What is happening is that `DataLoader` invokes `traverse(dp, only_datapipe=True)`, and the usage of the keyword causes the `only_datapipe` warning to be raised.

    ```
    /home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/utils/data/graph.py:102: FutureWarning: `only_datapipe` is deprecated from `traverse` function and will be removed after 1.13.
      warnings.warn(msg, FutureWarning)
    ```

    A few things we'd like to do:
    1. Deprecate the key word arg `only_datapipe`
    2. Change the default behavior from `only_datapipe=False` to `only_datapipe=True` in the future
    3. Do not raise a warning when users are using the function correctly

    This creates a paradox it is impossible for the users to change their code to match the future default behavior (i.e. call `traverse(dp)` without `only_datapipe`):
      - they cannot do so because the default behavior of `traverse` hasn't changed yet, so they must use `only_datapipe=True`
      - if they use `only_datapipe=True`, eventually the kwarg will go away and cause a runtime error; they also get a `FutureWarning` in the present

    IIUC, there doesn't seem to be a way to accomplish those 3 goals without replacing the function with a new one that has a different name; hence, this PR. Let me know if there is a better alternative.

    If this looks right, I will send a follow up PR in `TorchData`.

    Differential Revision: [D39832183](https://our.internmc.facebook.com/intern/diff/D39832183)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85667
    Approved by: https://github.com/ejguan

commit 8a926b31878f8deb6aee051b1438e68c43fcd31b
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Tue Sep 27 12:04:59 2022 -0500

    Enable CSC @ CSC addmm (#85379)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85379
    Approved by: https://github.com/pearu, https://github.com/cpuhrsch

commit bb5001ce3d9084279e1e83971976cd0535d21d73
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Tue Sep 27 12:04:58 2022 -0500

    Enable dense x bsc mm/addmm (#85308)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85308
    Approved by: https://github.com/pearu

commit e59120ab51039fe2e7642b0cf34902f8fb236091
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Sep 27 19:43:12 2022 +0000

    C++20 compatible changes (#85703)

    `std::hash<T>::result_type` is deprecated since C++17 and removed in c++20, so use `c10::invoke_result_t` to define it

    Fixes https://github.com/pytorch/pytorch/issues/85603

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85703
    Approved by: https://github.com/ezyang

commit e746fff8ba2338c56cc88fef6e5d131b5590db8a
Author: Abhishek Pathak <abhipathak97@gmail.com>
Date:   Tue Sep 27 19:08:22 2022 +0000

    [MPS] Enable adaptive avg pool 2d with larger output size (#85726)

    * Handle adpative pool 2d forward and backward when ouptut size is larger than input size
    * Disallow larger output size if not a multiple of input size
    Fixes: https://github.com/pytorch/pytorch/issues/80732
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85726
    Approved by: https://github.com/malfet

commit c8776dca6a503eaa92a8d7b2427ebce7a19df398
Author: Srikumar Sastry <srikumarsastry1997@gmail.com>
Date:   Tue Sep 27 18:43:39 2022 +0000

    Remove extra `with` in value error exception statement (#84713)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84713
    Approved by: https://github.com/ngimel

commit a1066110559701ceb7acc090d3972b9ed6f4231d
Author: samdow <samdow@fb.com>
Date:   Tue Sep 27 11:40:29 2022 -0400

    [Modes] fix handle_torch_funcion logic (#85707)

    Fixes #85696. I didn't totally get what was happening in handle_torch_function and so was trying to recreate the original logic instead of follow what the C++ is doing. This fixes that
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85707
    Approved by: https://github.com/ezyang

commit f4251525dece37071d846901107deef3978c55ad
Author: Omkar Salpekar <osalpekar@fb.com>
Date:   Tue Sep 27 18:11:18 2022 +0000

    Adding Wunused-lambda-capture to Clang build flags (#85655)

    Add `-Wunused-lambda-capture` to clang build flags to better align internal and OSS build systems. This flag is not supported in gcc so only adding for clang builds.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85655
    Approved by: https://github.com/huydhn

commit d51f6de9b8794aa1d5af6e2e4ea0ccf1a2d69f95
Author: Jesse Cai <jcjessecai@gmail.com>
Date:   Tue Sep 27 13:40:01 2022 +0000

    [quant][core][feature] Implement index_put for quantized CUDA tensors (#85685)

    Summary:
    - Add new cuda test for quantized index_put
    - Add determinsitc test for CPU and CUDA quantized index_put
    - Add in QuantizedCUDA implementation for index_put
        - wrote new `index_put_kernel_quantized_cuda`
        - CUDA index_put determinstic implemented in `index_put_with_sort_kernel_quantized`

    I think quantize_val<scalar_t> is not CUDA compatible, because of the
    reliance on std::numeric_limits. Might be something useful to add in the
    future?

    Test Plan:
    ```
    python test/test_quantization.py -k test_qtensor_index_put
    ```

    Reviewers:

    Subscribers:

    Tasks:

    Tags: quant
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85685
    Approved by: https://github.com/dzdang

commit 3a171dfb0c08956d55f341039cf35e3a18269c34
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 27 18:01:27 2022 +0000

    Revert "Symintifying slice ops (#85196)"

    This reverts commit 4c01c51266afae57c6d6952c84fff2802d9b2bb9.

    Reverted https://github.com/pytorch/pytorch/pull/85196 on behalf of https://github.com/atalman due to Break internal build Exutorch

commit 9f1468ae6c8e6e3938a3f1cfb9378e11af2fd0cd
Author: Peter Jung <peter@jung.ninja>
Date:   Tue Sep 27 17:41:56 2022 +0000

    CyclicLR memory leak fix (#85462)

    Hi, we noticed in our team that by using CyclicLR, there is a problem with memory clearance on GPU (probably it will be the case without the GPU as well, but that was our use case) After initializing CyclicLR, GPU memory is not cleared even after the model, optimizer and scheduler are out of scope (e.g. reference count is zero). This is because `__init__` method inside `CyclicLR` creates reference to its own methods and it will not get removed until `gc.collect()` is called manually. This is a problem if people want to test multiple models in one run of a script, after testing the first model, second one will fail on `CUDA out of memory error` because the first one is not cleared from the memory.

    I propose a simple fix by using `weakref`, similarly as in `_LRScheduler` base class, but if you have any comments I am happy to change it.

    Here is the code to reproduce the bug:

    ```
    import torch
    import weakref
    from transformers import DetrForObjectDetection

    class X:
        def __init__(self, optimizer):
            self.optimizer = optimizer
            self.func = self.dummy

        def dummy(self, x):
            return 1.

    def test():
        model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')
        model.to('cuda')
        optimizer = torch.optim.Adam(model.parameters())
        x = X(optimizer)

    test()
    print(f'{torch.cuda.memory_reserved()}, {torch.cuda.memory_allocated()}')  # Should print (<some memory>, 0), but with cyclic reference, it will print (<some memory>, <some memory>).
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85462
    Approved by: https://github.com/albanD

commit 4c4e5f6106b69960833d7766799fd4f246aa7cd7
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Tue Sep 27 05:29:57 2022 +0000

    Allow only one -1 in nested view/reshape (#85691)

    Behavior before this PR:

    1. `-1` allowed for implicit batch dimension
    2. multiple `-1`s allowed for pre-existing dimensions
    3.  for new dimensions, `-1` is not allowed

     it is worth noting that for the most part 3 is basically unreachable because assuming a nested tensor has at least 1 ragged dimension, you would expect at least one -1 to be in the proposed shape for the pre-existing dimensions

    Behavior after this PR:
    1. batch dimension **must be specified**
    2. **only one** `-1` allowed for pre-existing dimensions **this effectively means that we only allow reshaping/viewing of nt with ONE ragged dimension**
    3. unchanged

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85691
    Approved by: https://github.com/cpuhrsch

commit 7167996346c5e5299559c8501821d2ab7ef770d3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 27 16:59:35 2022 +0000

    Revert "resubmit: [mta] APEX style Fused Adam (#81705) (#85507)"

    This reverts commit 4615d1bcfa0915a992e7445086ba559ca7441607.

    Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds

commit f8e71ca3384370fa42e4ad386ff567c5d28c6506
Author: li-yi-dong <73142299+li-yi-dong@users.noreply.github.com>
Date:   Tue Sep 27 16:55:39 2022 +0000

    Designate divice to generate_square_subsequent_mask (#85609)

    When the model is on GPU, generating the mask on defalut device(cpu) is quite time consuming.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85609
    Approved by: https://github.com/albanD

commit aaef5d8f2cb46e3a8cc81244c69c2140fb0bbd1b
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Tue Sep 27 09:12:02 2022 -0500

    sparse mm/addmm enable dense x csc, csc x dense and simplify layout check logic. (#85307)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85307
    Approved by: https://github.com/pearu, https://github.com/cpuhrsch

commit b656ba0b1105aa672c1c7be6138fcdca7ad924c8
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Sep 27 14:58:09 2022 +0100

    Use hexfloat for threshold OpInfo tests (#85676)

    0.123 isn't exactly representable as a floating point value, and so
    the threshold will move marginally depending on the data type where
    the computation is performed. This leads to a rare flake in tests
    comparing against a reference implementation.

    Instead, this chooses a threshold which is exactly representable as a
    bfloat16 value and thus has the same value for all data types.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85676
    Approved by: https://github.com/ngimel

commit fdef5078977d5aa5d3b77acd324d36212c7159a5
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Sep 27 14:58:09 2022 +0100

    Simplify noncontiguous_like (#85518)

    This removes the special casing for zero-dim tensors and also uses
    indexing instead of manual stride manipulations.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85518
    Approved by: https://github.com/albanD

commit 101f10d7cacbb29ad62a519a030858a35fbba6d4
Author: S. Song <41357537+shmsong@users.noreply.github.com>
Date:   Tue Sep 27 15:53:01 2022 +0000

    Cherry pick sorting patch (#85620)

    Fixes https://github.com/csarofeen/pytorch/issues/1947

    Cherry-picked patch for torchbench issues where fusion segmenter asserts in nvfuser:
    1. test the groups comes with the same order as they are merged.
    2. Fix detection of un-mappable root domains:
        ComputeAtRootDomainMap flags domains that should not be mapped due to
        reductions. Previously, checking if a domain potentially causes an
        invalid mapping is only done with one domain in each group of domains
        that are found to be mappable so far. That's not actually sufficient as
        the unmappable domain set is created just once with no root mapping
        information. The fix is to check all consumer domains of a producer
        tensor. A small other fix is also done to address a different problem
        discovered after the first fix.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85620
    Approved by: https://github.com/csarofeen, https://github.com/davidberard98

commit 1367f2409f11aaf3d56ad81b0c0cc79120e9d124
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Sep 27 15:44:53 2022 +0000

    [MPS] Fix placeholder behavior for transposed view (#85689)

    Looks like the expectation in that code were that `.clone` will return contiguous tensor, so explicitly specify memory format

    Fixes https://github.com/pytorch/pytorch/issues/85675 and https://github.com/pytorch/pytorch/issues/85224

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85689
    Approved by: https://github.com/kulinseth

commit 15c52ffc4f9a02f7078033677d44ccd760107952
Author: soulitzer <soulitzer@gmail.com>
Date:   Mon Sep 26 18:24:57 2022 -0400

    Disallow auto_element_wise for in-place and fix some in-place gradients (#85634)

    Fixes https://github.com/pytorch/pytorch/issues/85535

    Also fixes the backward and forward gradients of `nn.functional.threshold`. The issue was that in-place gradients weren't tested because the in-place variants were not properly registered to the OpInfo.

    Perhaps an alternative to this to make auto_element_wise smart enough to actually handle the in-places cases (we have 4 cases total now where we manually copy_ after doing auto_element_wise), but that requires a few more changes.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85634
    Approved by: https://github.com/albanD

commit 01dbbeeeb5ab7ede28e333982e98713282a0e4b8
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Sep 27 04:15:56 2022 +0000

    Expose cpp_backtrace to python binding (#84896)

    We can now get cpp stack trace by calling torch.utils.get_cpp_backtrace()

    Sample output when calling from a torch_dispatch stack:
    ```
    <omitting python frames>
    frame #23: torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<pybind11::handle>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName) (0x7f69330bab90 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/utils/python_arg_parser.cpp:323)
    frame #24: <unknown function> (0x7f6932a09e79 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/python_variable.cpp:2252)
    frame #25: <unknown function> (0x7f69261aee33 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/PythonFallbackKernel.cpp:56)
    frame #26: <unknown function> (0x7f69261afef9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:19)
    frame #27: c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadced in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:41)
    frame #28: <unknown function> (0x7f6926fae9b9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/boxing.h:227)
    frame #29: at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&) const (0x7f6926e821f5 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:106)
    frame #30: at::_ops::alias::redispatch(c10::DispatchKeySet, at::Tensor const&) (0x7f6927142c31 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:438)
    frame #31: <unknown function> (0x7f692ae4f8be in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/generated/ADInplaceOrViewType_1.cpp:1361)
    frame #32: <unknown function> (0x7f692ae4f9b1 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/generated/ADInplaceOrViewType_1.cpp:1362)
    frame #33: <unknown function> (0x7f692aef77e9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13)
    frame #34: <unknown function> (0x7f6926fae7d8 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:50)
    frame #35: at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&) const (0x7f6926e821c9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:97)
    frame #36: at::_ops::alias::redispatch(c10::DispatchKeySet, at::Tensor const&) (0x7f6927142c31 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:438)
    frame #37: <unknown function> (0x7f6929ec654a in /fsx/users/bahuang/repos/pytorch_fsx/build/aten/src/ATen/RedispatchFunctions.h:10697)
    frame #38: <unknown function> (0x7f6929d9edae in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/generated/VariableType_1.cpp:2837)
    frame #39: <unknown function> (0x7f6929d9f043 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/generated/VariableType_1.cpp:2838)
    frame #40: <unknown function> (0x7f6929e7d2f9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13)
    frame #41: <unknown function> (0x7f6929eb1344 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:478)
    frame #42: <unknown function> (0x7f6929ea7b99 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:490)
    frame #43: <unknown function> (0x7f6929e7d370 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:563)
    frame #44: <unknown function> (0x7f6929e7d43a in /fsx/users/bahuang/repos/pytorch_fsx/c10/util/C++17.h:239)
    frame #45: <unknown function> (0x7f6929e7d48c in /fsx/users/bahuang/repos/pytorch_fsx/c10/util/C++17.h:364)
    frame #46: <unknown function> (0x7f6929e7d50a in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:554)
    frame #47: c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadced in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:41)
    frame #48: c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadd26 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:43)
    frame #49: c10::Dispatcher::redispatchBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f692603890a in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:652)
    frame #50: <unknown function> (0x7f69260387f9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:388)
    frame #51: <unknown function> (0x7f69261af0ef in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/PythonFallbackKernel.cpp:96)
    frame #52: <unknown function> (0x7f69261aff2b in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:25)
    frame #53: c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadced in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:41)
    frame #54: c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadd26 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:43)
    frame #55: c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6925fd6ab2 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:628)
    frame #56: <unknown function> (0x7f6925fd6690 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:376)
    frame #57: <unknown function> (0x7f692bf5b525 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:380)
    frame #58: <unknown function> (0x7f692bf59fac in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/jit/runtime/register_c10_ops.cpp:15)
    frame #59: <unknown function> (0x7f692bf5af41 in /usr/include/c++/7/bits/std_function.h:316)
    frame #60: std::function<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&)>::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const (0x7f6932ab9a0f in /usr/include/c++/7/bits/std_function.h:706)
    frame #61: <unknown function> (0x7f6932aad541 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/stack.h:41)
    frame #62: <unknown function> (0x7f6932ab3102 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/jit/python/pybind_utils.h:1206 (discriminator 1))
    frame #63: <unknown function> (0x7f6932ab3943 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/jit/python/pybind_utils.h:1272)
    frame #64: <unknown function> (0x7f6932a46120 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/jit/python/init.cpp:1767)
    frame #65: <unknown function> (0x7f6932a997be in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/cast.h:1441)
    frame #66: <unknown function> (0x7f6932a8a985 in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/cast.h:1410)
    frame #67: <unknown function> (0x7f6932a66e1e in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/pybind11.h:249)
    frame #68: <unknown function> (0x7f6932a66ec2 in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/pybind11.h:224)
    frame #69: <unknown function> (0x7f6932473111 in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/pybind11.h:929)
    frame #104: __libc_start_main (0x7f693485dc87 in /build/glibc-uZu3wS/glibc-2.27/csu/../csu/libc-start.c:310)
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84896
    Approved by: https://github.com/ezyang

commit 54e03cdda9fca7fcd8b29e40812213b6ebc8c091
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Tue Sep 27 14:45:56 2022 +0000

    Don't use a fixed name to avoid race conditions. (#84952)

    Fixes #84886
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84952
    Approved by: https://github.com/rohan-varma

commit 0183c1e3362c53bedea88932f08bedb78a5822d7
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Tue Sep 27 12:30:22 2022 +0000

    Add __all__ to torch.utils submodules (#85331)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85331
    Approved by: https://github.com/albanD

commit f64857189d514a662dac09ca6421e1e95e87e843
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Mon Sep 26 16:59:54 2022 -0500

    resize_as_sparse support all compressed layouts (#85378)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85378
    Approved by: https://github.com/pearu, https://github.com/cpuhrsch

commit 45be74cc63aad66f496475a9513e3cc36cace5b6
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Mon Sep 26 07:50:32 2022 +0000

    Optimize to if the datatyep of the source tensor is as same as the dest datatype (#85140)

    The AMP inserts `_autocast_to_reduced_precision` and `_autocast_to_full_precision` automatically. The aten implementation provides a fast path to bypass the conversion if the tensor data type has been the reduced/full precision. But NNC always does the conversion which could bring >5% E2E performance regression.

    This PR is to address the performance issue like aten. We will not pull `_autocast_to_reduced_precision` and `_autocast_to_full_precision` into NNC fusion group and fallback to aten to trigger its fast path if the tensor data type has been the reduced/full precision.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85140
    Approved by: https://github.com/frank-wei

commit 83261ff9a8bcf47ae343be40d171bfdce3bd613c
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Mon Sep 26 07:50:30 2022 +0000

    Use high precision accmulate buffer for bf16 accmulation (#84402)

    Accumulation operation is not friendly to BFloat16 because its mantissa part is only 7bits while the operand could not impact the final result if it is very small.

    Take `a += b` as an example, `a` will become bigger with running the computation. And then, the variance between `a` and `b` also is being huge, the `b` would not impact `a`.

    Hence, the best practice is to use FP32 to do accumulation and then convert back to BF16 as long as the accumulation is finished. This PR also follows the best practice.

    We extend the `ReduceOp` by adding `accumulation` buffer and recording the result buffer and `Reducer`'s operand. Because we need to replace the original `ReduceOp` with a new `ReduceOp` to use `accumulation` buffer for reduction.

    - Extend `ReduceOp` by adding `accumulation` buffer and recording the result buffer and `Reducer`'s operand - [PR change](https://github.com/pytorch/pytorch/pull/84402/files#diff-0f4be13525117d5c49c69bd18e92eb15dda36b5a59b7a10c7e1114f5cac10afbR225-R229)
    - Replace the original `ReduceOp` with a new `ReduceOp` to use `accumulation` buffer for reduction - [PR change](https://github.com/pytorch/pytorch/pull/84402/files#diff-fac6725328dc01e235944c7afc9f29c804488973c02c25ecd93d562884d959b3R26-R36)
    - Cast the accumulation buffer from FP32 to BF16 and write back to the result buffer - [PR change](https://github.com/pytorch/pytorch/pull/84402/files#diff-fac6725328dc01e235944c7afc9f29c804488973c02c25ecd93d562884d959b3R62-R67)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84402
    Approved by: https://github.com/frank-wei

commit cf5699f2fc3e322efe2bb2446949dc30f1d36b4b
Author: ssjia <ssjia@fb.com>
Date:   Mon Sep 19 11:52:32 2022 -0700

    [vulkan] Rewrite prepacking functions using aten functions + some code cleanup (#84973)

    Rewrites the convolution prepacking function using aten ops, removing a large amount of redundant code. Adds detailed comments describing the transformations that are performed. Also cleans up some unneeded code.

    Differential Revision: [D39486489](https://our.internmc.facebook.com/intern/diff/D39486489/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84973
    Approved by: https://github.com/salilsdesai

commit b360d66391f03a0d5dc2c9a7aff496324b75aa2f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 27 02:55:59 2022 +0000

    Revert "Add environment parse function that supports default value (#85563)"

    This reverts commit 784f4ba1ce16996d497fae2fb107425b3bbeb71b.

    Reverted https://github.com/pytorch/pytorch/pull/85563 on behalf of https://github.com/huydhn due to Fail test_DistributedDataParallel (main.TestDistBackendWithSpawn)

commit e1e056ac447290ed113926f916ab768c0c81641b
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 27 02:38:36 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#85683)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85683
    Approved by: https://github.com/pytorchbot

commit 8125d2e188ed9e7be4cb3f76d2ed5c6260316ff8
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 27 02:33:55 2022 +0000

    [vision hash update] update the pinned vision hash (#85684)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85684
    Approved by: https://github.com/pytorchbot

commit 7a5449f148dba51ef51979cf371226b158a5b73b
Author: Abhishek Pathak <abhipathak97@gmail.com>
Date:   Tue Sep 27 01:54:42 2022 +0000

    [MPS] Clamp op - fix shape issues (#114) (#85673)

    * Handle shape mismatch
    * Handle case where 1 occurs in input shape; fix fill_new_shapes
    * Move clamp ops to allowlist

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85673
    Approved by: https://github.com/malfet

commit cce6d8d6419bcaeb8e65c809b16901810194c221
Author: Richard Barnes <rbarnes@umn.edu>
Date:   Tue Sep 27 01:38:32 2022 +0000

    Fix warning in kineto_shim.h (#85653)

    Fixes:
    ```
    In file included from /dev/shm/rbarnes/tempfs/pytorch/torch/csrc/profiler/kineto_shim.cpp:1:
    /dev/shm/rbarnes/tempfs/pytorch/torch/csrc/profiler/kineto_shim.h:111:8: warning: private field 'saved_' is not used [-Wunused-private-field]
      bool saved_ = false; // Kineto's save is destructive
           ^
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85653
    Approved by: https://github.com/ezyang

commit 18d8c548f4458c03ae6be60903db3ccf1c0d8d3f
Author: samdow <samdow@fb.com>
Date:   Mon Sep 26 16:42:07 2022 -0400

    [Modes] remove enable and rewrite mode stack (squashed) (#84774)

    Based on @ezyang's suggestion, mode stack now has "one true mode" which is the _only_ mode that can ever be active at the C++ level. That mode's torch dispatch is just to take the top mode in the stack, reenable itself (if we aren't at the end of the mode stack), and run the top mode's torch_{dispatch|function}

    This maintains that in the middle of a mode's torch dispatch, the mode itself will not be active. It changes the function the user has to call to see what the current mode is (no longer queries the C++, it's python only) but allows the user to also see the entire mode stack easily

    Removes `enable_torch_dispatch_mode` and `.restore()` since neither makes sense in this new setup
    Why do we want this? Well, a pretty common pattern that was coming up was that users had to do something like

    ```python
    def f(mode):
      with mode.restore():  # user needs to understand this restore thing?
        ...

    with Mode() as m:
      pass
    f(m)
    ```

    Many users were getting error from forgetting to call `.restore` or from forgetting to add the (tbh weird) "mode instantiation"  step where they use the mode as a context manager with an empty body. Really, they wanted to treat modes like context managers and just write
    ```python
    def f(mode):
      with mode:
        ...
    f(Mode())
    ```

    ** Technical Details **
    With the old mode stack, we basically had a linked list so the mode itself could only be used once and had a fixed parent. In this new design, the mode stack is just a python list that we're pushing to and popping from. There's only one mode that's ever active at the C++ level and it runs the next mode in the Python list. The modes don't have state on them anymore
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84774
    Approved by: https://github.com/ezyang, https://github.com/zou3519

commit a0be0ca16144fababbbb94f58b1fd88e41f29162
Author: Denis Vieriu <104024078+DenisVieriu97@users.noreply.github.com>
Date:   Tue Sep 27 01:01:16 2022 +0000

    [MPS] Fix test consistency error 'mlir module expected element type ui8 but received si8' (#85666)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85666
    Approved by: https://github.com/kulinseth

commit b8d2ab3dd5869e0af3aa9490636acf887d664be7
Author: Ramin Azarmehr <razarmehr@apple.com>
Date:   Tue Sep 27 01:00:53 2022 +0000

    [MPS] Fix memory leaks that cause the buffers not to be released and cause OOM  (#85661)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85661
    Approved by: https://github.com/kulinseth

commit 755b39ba6668d2ce8c0e3ba4ee032ff0dfc827b7
Author: Wenguang Mao <wmao@fb.com>
Date:   Tue Sep 27 00:56:57 2022 +0000

    [LRD] Allowing using dedicated iteration counter for learning rate (#85195)

    Summary: So that we could manipulate the iteration counter for lrarning rate separately (for learning rate decay or learning rate re-warming up etc), without affecting other techniques relying on iterations (such as EMA)

    Test Plan:
    Unit tests:
    ```
        ✓ Pass: caffe2/caffe2/python:optimizer_test - testSparse (caffe2.caffe2.python.optimizer_test.TestAdagradWithDedicatedLRIteration) (46.475)
        ✓ Pass: caffe2/caffe2/python:optimizer_test - test_global_norm_based_gradient_clipping (caffe2.caffe2.python.optimizer_test.TestAdagradWithDedicatedLRIteration) (46.475)
        ✓ Pass: caffe2/caffe2/python:optimizer_test - test_lr_injection (caffe2.caffe2.python.optimizer_test.TestAdagradWithDedicatedLRIteration) (46.475)
        ✓ Pass: caffe2/caffe2/python:optimizer_test - main (46.475)
    Summary
      Pass: 5
      Skip: 1
        ↻ caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestAdagradWithDedicatedLRIteration)
      ListingSuccess: 1
    ```

    Reviewed By: liangming168

    Differential Revision: D38747417

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85195
    Approved by: https://github.com/liangming168, https://github.com/eellison

commit 784f4ba1ce16996d497fae2fb107425b3bbeb71b
Author: Ke Wen <kw2501@fb.com>
Date:   Tue Sep 27 00:34:50 2022 +0000

    Add environment parse function that supports default value (#85563)

    We use "-2" to represent an unset environment variable.
    Now adding a util function to attach default value if environment variable is unset.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85563
    Approved by: https://github.com/rohan-varma, https://github.com/H-Huang

commit 686555b663077b40f28bc88adc049e64035046b4
Author: George Qi <georgeqi94@gmail.com>
Date:   Mon Sep 26 21:03:05 2022 +0000

    [maskedtensor] port torch/_masked into torch/masked (#85515)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85515
    Approved by: https://github.com/cpuhrsch

commit 90261945b71d2ac2a24bd59cbaf823a84ef3b8d2
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Mon Sep 26 20:41:18 2022 +0000

    Copy over non parameter grad (#85658)

    Wow, ugh silly mistake. Fix for https://github.com/pytorch/torchdynamo/issues/1291

    not even sure how all the tests passed before this.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85658
    Approved by: https://github.com/Chillee, https://github.com/anijain2305

commit 4a2d2e5e40835a7931577fc800a02574e2eb44e6
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Mon Sep 26 11:43:12 2022 -0700

    Change API type `Tensor[]` for structured kernels. (#73350)

    Partially fixes: #66328

    This PR:
    - adds support for `ITensorList` to the dispatcher for:
      - computing the dispatch key
      - boxing and unboxing `ITensorList`
    - modified the codegen for structured kernels:
      - codegen APIs use `ITensorList` instead of `ArrayRef<Tensor>`

    **Changes summary:**

    - Signature changes due to the different APIs:
      - dispatcher API (e.g. `BatchingRegistrations.cpp`)
      - C++ API (e.g. `TensorShape.cpp`)
    - Miscelaneous functions used by codegen'd functions (e.g. `FunctionalTensorWrapper.*`)
    - Dispatcher changes for handling `ITensorList` correctly (e.g. `DispatchKeyExtractor.h`)
    - Signature changes of `at::cat` due to the need of `const` inside `TensorBody.h`
    - Forward declarations of `ITensorList` (e.g. `MethodOperators.h`)
    - Codegen changes, special casing structured kernels (e.g. `gen.py`)

    **Short description of structured kernels special casing:**

    I introduced, mainly, 5 types of changes to the codegen for generating code depending on
    whether the kernel is structured or not:

    1. Added a `structured_type_override` flag to the `argument_type` function definition of
    the affected APIs (mainly the dispatcher and C++ APIs).
      - `api/cpp.py`, `api/dispatcher.py`, `api/native.py`
    2. Added a `structured_type_override` member to the signature
    classes (e.g. `CppSignature`), since `FunctionSchema` doesn't really know whether the
    function is structured or not
      - `api/types.py`
    3. Added a `part_of_structured_group` to `NativeFunction` class, which is just a
    convenient function to forward to `structured_type_override` wherever needed
      - `model.py`
    4. Appropriately changed the rest of the codegen, whenever it used either the signature
    classes or the `arguments` function directly
    5. Added a check for `const ITensorList&` type wherever there was a check for `TensorList`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/73350
    Approved by: https://github.com/bdhirsh

commit 1a2734e015a695bbd4ea4de93bf6aaaa3202eed8
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Sep 26 21:39:00 2022 +0000

    Fix broken periodic workflow after removing ios secret (#85664)

    Broken after https://github.com/pytorch/pytorch/pull/85597, i.e. https://github.com/pytorch/pytorch/actions/runs/3130970082

    ```
    The workflow is not valid. .github/workflows/periodic.yml (Line: 189, Col: 26): Invalid secret, IOS_CERT_KEY_2022 is not defined in the referenced workflow. .github/workflows/periodic.yml (Line: 190, Col: 24): Invalid secret, IOS_CERT_SECRET is not defined in the referenced workflow.
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85664
    Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/janeyx99, https://github.com/ZainRizvi, https://github.com/malfet

commit e4471032dae4d68e82358e715777f2f385d7ff09
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Sep 26 21:35:00 2022 +0000

    Enforce non-virtual-dtor everywhere (#85586)

    This can finally be removed because NVIDIA has merged my PR on https://github.com/NVIDIA/cudnn-frontend/pull/33
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85586
    Approved by: https://github.com/seemethere, https://github.com/ZainRizvi

commit f325c29b05fdf350322dcb26e2cd6dff3fb06f1e
Author: Mike Iovine <mikeiovine@fb.com>
Date:   Mon Sep 26 21:30:16 2022 +0000

    [fx] Make NormalizeArgs preserve node type (#85637)

    Summary: Make `NormalizeArgs` preserve node types when transforming the graph. This bug is preventing me from scripting a graph that goes through the fx2trt `acc_tracer`.

    Test Plan: New unit test

    Reviewed By: ipiszy

    Differential Revision: D39753021

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85637
    Approved by: https://github.com/Chillee

commit 5547c6aa4e587578eb62d3b102520bc1eebba419
Author: Sherlock Huang <bahuang@fb.com>
Date:   Mon Sep 26 17:57:01 2022 +0000

    Match kwargs in SubgrpahMatcher (#85617)

    Pattern node and target node must have identical kwargs now...

    Use envvar `LOGLEVEL=INFO` to turn on the logging message for easier debugging...

    Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
    * __->__ #85617
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85617
    Approved by: https://github.com/jerryzh168, https://github.com/davidberard98

commit e38b3424c3f0555c1c255130064dad60c5046c4f
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Sep 26 08:22:13 2022 -0700

    Clean up the functorch test skip mechanism; add a new decorator (#85564)

    This PR:
    - adds a `decorate` thing that can be added to skip/xfail lists. This
    lets people provide their own decorator (e.g. unittest.skipIf blah)
    - does some refactoring of the skip/xfail list mechanism to make it
    more sane

    Test Plan:
    - existing tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85564
    Approved by: https://github.com/samdow

commit 6a04df3ac85714b33dcb2af20d0eeea96d131d75
Author: cpuhrsch <cpuhrsch@googlemail.com>
Date:   Mon Sep 26 20:49:19 2022 +0000

    Get flash_attn to compile for CUDA 11.6 linux nightly build (#84941)

    This PR only attempts to get this code to compile for all archs so that we can dispatch to it in https://github.com/pytorch/pytorch/pull/84653
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84941
    Approved by: https://github.com/drisspg, https://github.com/malfet

commit 15435325eb6c561f8a126b5c4e48ac57c064a6fa
Author: Daniel Dale <danny.dale@gmail.com>
Date:   Mon Sep 26 20:17:52 2022 +0000

    Configure PyTorch Testing ArgumentParser Instance To Avoid Unnecessary Conflicts with System Args (#85616)

    Fixes #85615

    Currently, internal test discovery instantiates an `ArgumentParser` and adds numerous arguments to the internal parser:
    https://github.com/pytorch/pytorch/blob/f0570354dda37c03c63377ada1ec889cf82ae9f6/torch/testing/_internal/common_utils.py#L491-L500
    ...
    In this context, `argparse` will load [system args](https://github.com/python/cpython/blob/b494f5935c92951e75597bfe1c8b1f3112fec270/Lib/argparse.py#L1826-L1829) from any external scripts invoking PyTorch testing (e.g. `vscode`).

    The default behavior of `argparse` is to [allow abbreviations](https://github.com/python/cpython/blob/b494f5935c92951e75597bfe1c8b1f3112fec270/Lib/argparse.py#L2243-L2251) of arguments, but when an `ArgumentParser` instance has many arguments and may be invoked in the context of potentially conflicting system args, the `ArgumentParser` should reduce the potential for conflicts by being instantiated with `allow_abbrev` set to `False`.

    With the current default configuration, some abbreviations of the `ArgumentParser` long options conflict with system args used by `vscode` to invoke PyTorch test execution:
    ```bash
    python ~/.vscode-server/extensions/ms-python.python-2022.14.0/pythonFiles/get_output_via_markers.py \
    ~/.vscode-server/extensions/ms-python.python-2022.14.0/pythonFiles/visualstudio_py_testlauncher.py \
    --us=./test --up=test_cuda.py --uvInt=2 -ttest_cuda.TestCuda.test_memory_allocation \
    --testFile=./test/test_cuda.py
    >>>PYTHON-EXEC-OUTPUT
    ...
    visualstudio_py_testlauncher.py: error: argument --use-pytest: ignored explicit argument './test'
    ```
    The full relevant stack:
    ```
    pytorch/test/jit/test_cuda.py, line 11, in <module>\n    from torch.testing._internal.jit_utils import JitTestCase\n'\
    pytorch/torch/testing/_internal/jit_utils.py, line 18, in <module>\n    from torch.testing._internal.common_utils import IS_WINDOWS, \\\n'
    pytorch/torch/testing/_internal/common_utils.py, line 518, in <module>\n    args, remaining = parser.parse_known_args()\n'
    argparse.py, line 1853, in parse_known_args\n    namespace, args = self._parse_known_args(args, namespace)\n'
    argparse.py, line 2062, in _parse_known_args\n    start_index = consume_optional(start_index)\n'
    argparse.py, line 1983, in consume_optional\n    msg = _(\'ignored explicit argument %r\')\n'
    ```
    The `argparse` [condition](https://github.com/python/cpython/blob/b494f5935c92951e75597bfe1c8b1f3112fec270/Lib/argparse.py#L2250) that generates the error in this case:

    ```python
    print(option_string)
    --use-pytest
    print(option_prefix)
    --us
    option_string.startswith(option_prefix)
    True
    ```
    It'd be nice if `vscode` didn't use two-letter options :facepalm:  but PyTorch testing shouldn't depend on such good behavior by invoking wrappers IMHO.

    I haven't seen any current dependency on the abbreviated internal PyTorch `ArgumentParser` options so this change should only extend the usability of the (always improving!) PyTorch testing modules.

    This simple PR avoids these conflicting options by instantiating the `ArgumentParser` with `allow_abbrev=False`

    Thanks to everyone in the community for their continued contributions to this incredibly valuable framework.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85616
    Approved by: https://github.com/clee2000

commit d5ce2bbed26a175e7bf69480759c2cfe73f42a75
Author: Fabio Rocha <frocha@quansight.com>
Date:   Mon Sep 26 09:03:23 2022 +0000

    [primTorch] decompositions for upsample_bicubic2d (#85403)

    FYI, this decomposition seems to be significantly slower than the lowering in torchinductor:

    ```
    ------------------------------------- upsample_bicubic2d -------------------------------------]
                                                                  |  lowering  |  Inductor  |  Eager
    32 threads: ------------------------------------------------------------------------------------
          (torch.Size([16, 4, 128, 256]),), ((512, 1024), True)   |    1.8     |   3.880    |   1.4
          (torch.Size([16, 4, 128, 256]),), ((512, 1024), False)  |    1.9     |   3.887    |   1.4
    ```

    This seems related to the fact that in the lowering we can use int32s as the indices and in the decomp we can only use int64s (see https://github.com/pytorch/torchdynamo/issues/1293).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85403
    Approved by: https://github.com/ngimel

commit 70cce9f8d1a099af8b017d7263897d3ca2fb9fe6
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Sep 26 20:07:13 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#85225)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85225
    Approved by: https://github.com/pytorchbot, https://github.com/voznesenskym

commit 89896b8778b76685c4fc40d1f8ce337d36105c02
Author: Bobby Impollonia <bobbyi@gmail.com>
Date:   Mon Sep 26 19:13:00 2022 +0000

    Fix typo in comment (#85635)

    This comment should talk about an object "leak", not an object "lead"

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85635
    Approved by: https://github.com/kit1980

commit 291b080e8c54cdb19dd823728cf30fffece10f4d
Author: Aaron Bockover <abock@microsoft.com>
Date:   Mon Sep 26 18:47:06 2022 +0000

    CODEOWNERS: [ONNX] remove @shubhambhokare1; add @abock (#85476)

    Add me to field notifications for the ONNX team, replacing @shubhambhokare1.

    cc @malfet
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85476
    Approved by: https://github.com/kit1980, https://github.com/AllenTiTaiWang

commit a8ca0d4849111fd6c83e5ae90517f83bf1ceb6b3
Author: Vasiliy Kuznetsov <vasiliy@fb.com>
Date:   Fri Sep 23 10:59:21 2022 -0600

    fix segmentation fault in QTensor.choose_qparams_optimized (#85552)

    Summary:

    Fixes segmentation fault in `QTensor.choose_qparams_optimized`, this
    guards against the user passing in a value of `numel` which does not
    make sense.

    Fixes https://github.com/pytorch/pytorch/issues/85212

    Test plan:

    Probably not worth it to add a test for this, so testing manually.

    ```
    import torch

    input = torch.full((64,), 1, dtype=torch.float32, requires_grad=False)
    numel = 1250999896764
    n_bins = 0
    ratio = 0
    bit_width = 0
    torch.choose_qparams_optimized(input, numel, n_bins, ratio, bit_width)
    // RuntimeError is thrown
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85552
    Approved by: https://github.com/jerryzh168

commit bcc544e9d7b257c8b287151194d87565e95eb893
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Mon Sep 26 07:03:44 2022 -0700

    Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85417
    Approved by: https://github.com/ezyang

commit 0f561f0bd21a1f3214cc4050b3e4a1cc739981e9
Author: Bin Chen <bchen2020@fb.com>
Date:   Mon Sep 26 16:05:17 2022 +0000

    Log Watchdog events to scuba (#85391)

    Summary: This diff logs some events of FileTimerServer to a scuba table. The events include "server started", "server stopped", "set timer", "clear timer" and "kill worker process".

    Test Plan:
    ```
    buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
    ```
    ```
    Test Session: https://www.internalfb.com/intern/testinfra/testrun/1407375146936031
    RE: reSessionID-2224cf79-6a28-4762-ab7c-9875adb244dc 3.4 KiB▲,  0.0 B▼
    Jobs completed: 57. Time elapsed: 3084.4s.
    Tests finished: Pass 55. Fail 0. Fatal 0. Skip 0. 0 builds failed
    ```

    Differential Revision: D39665560

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85391
    Approved by: https://github.com/d4l3k

commit 60d98821c51be03acb827e0b6a81276eb3e3b1e1
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 23 11:24:32 2022 -0700

    Remove unnecessary skips in test_dispatch.py (#85557)

    The functorch dangling impls have been fixed, I hope CI passes
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85557
    Approved by: https://github.com/ezyang

commit b0eeffdf6f94b359270a1a6991da57994ba1d689
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 23 10:55:51 2022 -0700

    Fix printing regular tensors inside functorch transforms (#85556)

    Fixes https://github.com/pytorch/functorch/issues/1026

    We need to disable functorch's stack-based dispatching mechanism inside
    the tensor printing. Otherwise, all operations that clean up the data of
    the Tensor for printing dispatch through the entire functorch stack and
    causes problems.

    Disabling stack-based dispatching and printing a functorch wrapped
    tensor is not a problem; we're still able to get the attributes on the
    wrapped tensor that we want.

    Test Plan:
    - new test
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85556
    Approved by: https://github.com/samdow

commit 30fccd03a69dc0249037b84593a6f1116a44e297
Author: milesial <milesial@users.noreply.github.com>
Date:   Mon Sep 26 15:15:57 2022 +0000

    Make profiler table column widths changeable via arguments (#85203)

    Maximum widths for the name and shapes columns of profiler results tables are no longer hardcoded.
    If None is passed, it will use the maximum width of the data, without cropping.

    Fixes #70595

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85203
    Approved by: https://github.com/ezyang

commit b32020e937a03b54427761a5a5009f064c7e5bac
Author: saltyJeff <saltyJeff@users.noreply.github.com>
Date:   Mon Sep 26 15:13:24 2022 +0000

    make vulkan codegen windows-compatible (#85241)

    Using `:` to join together paths works on *nix only. This process uses cmake's `list(APPEND ...)` to make vulkan codegen work on windows.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85241
    Approved by: https://github.com/ezyang

commit ef95baf2eca65bb796a351900025b24939502e0a
Author: Yukio Siraichi <yukio.siraichi@gmail.com>
Date:   Mon Sep 26 14:44:37 2022 +0000

    Add `IListRefTag::Materialized` to `IListRefIterator` destructor. (#85467)

    Fixes #85404

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85467
    Approved by: https://github.com/ezyang

commit 9c036aa112b0a8fd9afb824d1fda058e2b66ba1d
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Sep 25 12:17:01 2022 -0700

    Add SymInt to Scalar (#84958)

    This is by no means comprehensive, but adds initial support for SymInt as a Scalar.

    Things that don't work yet but need to:
    - for some reason `torch.add(tensor, sym_int)` got matched to the `add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor` schema
    - `x + sym_int` failed bc we tried to turn `x` into a sym int:
    ```
                  "__radd__",
                  [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
                    auto snb = toSymIntNode(a, b);
                    return a->add(snb);
                  })
     ```
    - Many more things I'm sure

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84958
    Approved by: https://github.com/ezyang

commit 33404436aaf90ec4d0a39db0f0b9f7622db3d404
Author: foram-chandra <96388449+foram-chandra@users.noreply.github.com>
Date:   Sun Sep 25 22:23:21 2022 +0000

    [doc] Add pin_memory and layout to new_{zeros, ones, full} (#85605)

    Fixes #84986

    Besides adding `pin_memory` and `layout`, I have also updated the signature to reflect keyword only arguments.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85605
    Approved by: https://github.com/lezcano, https://github.com/SherlockNoMad

commit 6e50f8e39540b3aeec7b32109d2755786e7d9a2d
Author: Daniel Dale <danny.dale@gmail.com>
Date:   Sun Sep 25 21:54:05 2022 +0000

    Allow External Scripts (e.g. vscode) To Discover and Execute unittest Tests (#85584)

    Fixes #85578

    Currently, many test modules customize test loading and discovery via the [load_tests protocol](https://docs.python.org/3/library/unittest.html#load-tests-protocol). The salient custom behavior included (introduced with https://github.com/pytorch/pytorch/pull/13250) is to verify that the script discovering or executing the test is the same script in which the test is defined.

    I believe this unnecessarily precludes the use of external tools to discover and execute tests (e.g. the vscode testing extension is widely used and IMHO quite convenient).

    This simple PR retains the current restriction by default while offering users the option to disable the aforementioned check if desired by setting an environmental variable.

    For example:
    1. Setup a test env:
    ```bash
    ./tools/nightly.py checkout -b some_test_branch
    conda activate pytorch-deps
    conda install -c pytorch-nightly numpy expecttest mypy pytest hypothesis astunparse ninja pyyaml cmake cffi typing_extensions future six requests dataclasses -y
    ```
    2. The default test collection behavior discovers 5 matching tests (only tests within `test/jit/test_cuda.py` because it doesn't alter the default `load_test` behavior:
    ```bash
    python ~/.vscode-server/extensions/ms-python.python-2022.14.0/pythonFiles/get_output_via_markers.py \
    ~/.vscode-server/extensions/ms-python.python-2022.14.0/pythonFiles/testing_tools/unittest_discovery.py \
    ./test test_cuda.py | grep test_cuda | wc -l
    5
    ```
    3. Set the new env variable (in vscode, you would put it in the .env file)
     ```bash
     export PYTORCH_DISABLE_RUNNING_SCRIPT_CHK=1
     ```
    4. All of the desired tests are now discovered and can be executed successfully!
      ```bash
    python ~/.vscode-server/extensions/ms-python.python-2022.14.0/pythonFiles/get_output_via_markers.py \
    ~/.vscode-server/extensions/ms-python.python-2022.14.0/pythonFiles/testing_tools/unittest_discovery.py \
    ./test test_cuda.py | grep test_cuda | wc -l
      175
      ```
    ![image](https://user-images.githubusercontent.com/7462936/192068508-a292caaf-a1d2-4115-a557-02ac5da80b60.png)

    A potentially relevant note, the previous behavior of the custom `load_tests` flattened all the `TestSuite`s in each test module:
    https://github.com/pytorch/pytorch/blob/4c01c51266afae57c6d6952c84fff2802d9b2bb9/torch/testing/_internal/common_utils.py#L3260-L3262
    I haven't been able to find any code that depends upon this behavior but I think retaining the `TestSuite` structure is preferable from a user perspective and likely safe (`TestSuite`s [can be executed](https://docs.python.org/3/library/unittest.html#load-tests-protocol:~:text=test%20runner%20to-,allow%20it%20to%20be%20run,-as%20any%20other) just like `TestCase`s and this is the structure [recommended](https://docs.python.org/3/library/unittest.html#load-tests-protocol:~:text=provides%20a%20mechanism%20for%20this%3A%20the%20test%20suite) by the standard python documentation).

     If necessary, I can change this PR to continue flattening each test module's `TestSuite`s. Since I expect external tools using the `unittest` `discover` API will usually assume discovered `TestSuite`s to retain their structure (e.g. like [vscode](https://github.com/microsoft/vscode-python/blob/192c3eabd8a065492f237196b052145364e68cb4/pythonFiles/visualstudio_py_testlauncher.py#L336-L349)) retaining the `testsuite` flattening behavior would likely require customization of those external tools for PyTorch though.

    Thanks to everyone in the community for the continued contributions to this incredibly valuable framework!

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85584
    Approved by: https://github.com/huydhn

commit f0570354dda37c03c63377ada1ec889cf82ae9f6
Author: Abhishek Pathak <abhipathak97@gmail.com>
Date:   Sun Sep 25 19:03:58 2022 +0000

    [MPS] Fix memory error in var (#85571)

    * Fix memory corruption + wrong handling of negative dims
    * Use vector for shape

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85571
    Approved by: https://github.com/malfet

commit e29f2483a6edde29b3e175df759def8a6701c7fc
Author: John Detloff <johndetloff@fb.com>
Date:   Sun Sep 25 18:36:29 2022 +0000

    Remove codesigning from github actions ios build workflow (#85597)

    Codesigning isn't necessary for simulator builds, and we're not running any device tests. More importantly, our dev certificate is expiring at the end of the month, and we don't have a replacement. As a result, we need to remove our job dependencies on it. This commit removes our references to it from our github CI, but a follow up PR will be needed to remove it from CircleCI workflows.

    Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85597
    Approved by: https://github.com/malfet

commit 1a0e1db763aa1152ac2126c895763a7c5ccb47fc
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Sep 22 15:00:17 2022 -0700

    [Profiler] Compute unique IDs for Tensors (#85162)

    This PR is largely based on https://github.com/pytorch/pytorch/pull/80266, with one major difference. #80266 assigned each unique {TensorImpl, StorageImpl} pair a unique ID, whereas this PR seeks to cluster the implicit graph formed by the pairs into disjoint groups and assign an ID to each disjoint group.

    Differential Revision: [D39563859](https://our.internmc.facebook.com/intern/diff/D39563859/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85162
    Approved by: https://github.com/chaekit

commit e1f9125e61feaef81fd60dc97d02acb536a178be
Author: wakananai <wakananai@gmail.com>
Date:   Sun Sep 25 17:10:45 2022 +0000

    [doc] add argument default values in rot90 (#85610)

    Add argument default values in rot90.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85610
    Approved by: https://github.com/lezcano

commit 0d86dfccf8607eac845cfa12b3fefd217892efa4
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Sep 25 16:23:21 2022 +0000

    Bump protobuf from 3.20.1 to 3.20.2 in /.circleci/docker (#85572)

    Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.1 to 3.20.2.
    <details>
    <summary>Release notes</summary>
    <p><em>Sourced from <a href="https://github.com/protocolbuffers/protobuf/releases">protobuf's releases</a>.</em></p>
    <blockquote>
    <h2>Protocol Buffers v3.20.2</h2>
    <h1>C++</h1>
    <ul>
    <li>Reduce memory consumption of MessageSet parsing</li>
    <li>This release addresses a <a href="https://github.com/protocolbuffers/protobuf/security/advisories/GHSA-8gq9-2x98-w8hf">Security Advisory for C++ and Python users</a></li>
    </ul>
    </blockquote>
    </details>
    <details>
    <summary>Commits</summary>
    <ul>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/a20c65f2cd549445fda907f7b83894c8eb7427d6"><code>a20c65f</code></a> Updating changelog</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/c49fe79af9c295960477b7568f1765b202093143"><code>c49fe79</code></a> Updating version.json and repo version numbers to: 20.2</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/806d7e4ce6f1fd0545cae226b94cb0249ea495c7"><code>806d7e4</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10544">#10544</a> from deannagarcia/3.20.x</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/ae718b39020ae6e6f8f5568e357d6893fd0fd29c"><code>ae718b3</code></a> Add missing includes</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/b4c395aaedfacb32e2414d361fa85968c0991b34"><code>b4c395a</code></a> Apply patch</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/6439c5c01349e74d4deb57c844a7ad4b7b13a302"><code>6439c5c</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10531">#10531</a> from protocolbuffers/deannagarcia-patch-7</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/22c79e6e4ca8be2bc2f700b2cdddca84d84659ce"><code>22c79e6</code></a> Update version.json</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/c1a2d2ec29314975e725021ffe4334926dbaa56c"><code>c1a2d2e</code></a> Fix python release on macos (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10512">#10512</a>)</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/a826282e15efe3ae3a2aebb040fb1691b2233a1e"><code>a826282</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10505">#10505</a> from deannagarcia/3.20.x</li>
    <li><a href="https://github.com/protocolbuffers/protobuf/commit/7639a710e10beb47bfc62f363680f7b04e8b3d26"><code>7639a71</code></a> Add version file</li>
    <li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf/compare/v3.20.1...v3.20.2">compare view</a></li>
    </ul>
    </details>
    <br />

    [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=protobuf&package-manager=pip&previous-version=3.20.1&new-version=3.20.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

    [//]: # (dependabot-automerge-start)
    [//]: # (dependabot-automerge-end)

    ---

    <details>
    <summary>Dependabot commands and options</summary>
    <br />

    You can trigger Dependabot actions by commenting on this PR:
    - `@dependabot rebase` will rebase this PR
    - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
    - `@dependabot merge` will merge this PR after your CI passes on it
    - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
    - `@dependabot cancel merge` will cancel a previously requested merge and block automerging
    - `@dependabot reopen` will reopen this PR if it is closed
    - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
    - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
    - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
    - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

    </details>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85572
    Approved by: https://github.com/malfet

commit bd5efbb7eefb9e8bbf997742fafe5e77a5d0991f
Author: foram-chandra <96388449+foram-chandra@users.noreply.github.com>
Date:   Sun Sep 25 10:47:59 2022 +0000

    [doc] add pin_memory argument to rand (#85221)

    Similar to #85123

    cc - @mruberry @kshitij12345
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85221
    Approved by: https://github.com/mruberry

commit a8add2b92f336a99d1d25648bb4512a9ab5157b5
Author: Sherlock Huang <bahuang@fb.com>
Date:   Sat Sep 24 17:57:32 2022 +0000

    Support matching Args for SubgraphMatcher (#85456)

    Subgraph matcher now handles the matching of non-Node arguments.

    Here are the 4 cases
    - pn is Node, gn is Node: this go through the regular _match_node() function
    - pn is Noed, gn is not a Node: this is a match if only pn is a placeholder op
    - pn is not Node, gn is Node: this is a no match case
    - pn is not a Node, gn is not a Node: this will go through the argument comparison.

    With this change
    ```
    def target(x):
        return foo(x, 3)

    def pattern(x, y):
        return foo(x, y)
    ```

    is a match

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85456
    Approved by: https://github.com/jerryzh168

commit db40fbdee03920944219588464d38774ca0b3d05
Author: Howard Huang <howardhuang@fb.com>
Date:   Sat Sep 24 18:00:28 2022 +0000

    Add deprecation warning to ProcessGroupRoundRobin (#85158)

    Trying to add any deprecation messages we anticipate we need before 1.13 branch cut. Add deprecation message to process group round robin.

    ```python
    import torch.distributed as dist

    if __name__ == "__main__":
        pg = dist._round_robin_process_groups(
            [
                dist.ProcessGroupGloo(dist.TCPStore("localhost", 29500, 1, True), 0, 1)
            ]
        )
    ```

    gives message
    ```
    W0916 16:19:38.367360 68031 ProcessGroupRoundRobin.cpp:12] Warning: ProcessGroupRoundRobin is deprecated and scheduled to be removed after this current release (1.13). Please file an issue on https://github.com/pytorch/pytorch/issues if there are any concerns or issues with this deprecation. (function ProcessGroupRoundRobin)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85158
    Approved by: https://github.com/rohan-varma

commit 41be45f0f4c6db2755be907db4f4a1665fe312e0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Sep 24 12:35:21 2022 +0000

    Revert "Create a quantized version ReLU function for CUDA (#85502)"

    This reverts commit 93a53ff4d92c883d87cc7aee35af719039b481a8.

    Reverted https://github.com/pytorch/pytorch/pull/85502 on behalf of https://github.com/janeyx99 due to Sorry, reverting as 10.2 builds on trunk broke due to this change, see https://hud.pytorch.org/pytorch/pytorch/commit/93a53ff4d92c883d87cc7aee35af719039b481a8

commit a531a604a093528721f970d922cd8e72ed9f0f8f
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Fri Sep 23 06:04:13 2022 +0000

    Support BF16ImmPtr (#84041)

    - To support BF16 Immediate value by converting it to uint16. The behavior is as same as BF16 tensor
    - Enable BF16 test cases.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84041
    Approved by: https://github.com/ZolotukhinM

commit ffaff8896a2716cca5a29315124b1b63c475e80f
Author: Fabio Rocha <frocha@quansight.com>
Date:   Thu Sep 22 16:46:37 2022 +0000

    Removed None arg check in test/test_decomp.py (#85402)

    Not sure why this check was necessary? Tests seem to run fine without
    it.
    There were definitely tests this was skipping before that it shouldn't,
    e.g., pretty much all of the tests for `torch.nn.functional.interpolate`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85402
    Approved by: https://github.com/ezyang

commit d3be4245bb416a676c4faf53ebfa3bf55ba32bbc
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Fri Sep 23 06:05:09 2022 +0000

    Fix the issue that cat result would be incorrect for channels-last (#85076)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85076
    Approved by: https://github.com/frank-wei

commit 2efea21c52c0a9f0818f33d4520d699cca90cea3
Author: Sunita Nadampalli <nadampal@amazon.com>
Date:   Sat Sep 24 08:05:27 2022 +0000

    [mkldnn_matmul] enable mkldnn matmul for aarch64 bf16 devices (#83671) (#85546)

    this PR enables mkldnn matmul for aarch64 bf16 devices for both bf16 as well as fp32 input.

    This PR is dependent on
    cpuinfo commit update PR: https://github.com/pytorch/pytorch/pull/83620
    Issue: https://github.com/pytorch/pytorch/issues/83594
    This is a reland of  https://github.com/pytorch/pytorch/pull/83671

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85546
    Approved by: https://github.com/kit1980

commit 93a53ff4d92c883d87cc7aee35af719039b481a8
Author: Feisi Fu <fufeisi@bu.edu>
Date:   Sat Sep 24 05:59:13 2022 +0000

    Create a quantized version ReLU function for CUDA (#85502)

    Summary:
    this is to allow the relu function to run on a quantized tensor on cuda. That is torch.relu(qa) for a quantized tensor qa on cuda.
    Test Plan:
    python test/test_quantization.py
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85502
    Approved by: https://github.com/dzdang

commit e7e1cd945fe218fde228cedbdb1509f1750f70ea
Author: Jane Xu <janeyx@fb.com>
Date:   Sat Sep 24 03:47:33 2022 +0000

    Add path optimize kwarg to einsum (#84890)
    - [x] add c++ support for an optimize path
    - [x] add python opt_einsum path passthrough
    - [x] add opt_einsum to OSS requirements, but a soft one
    - [x] show benchmark results here

    Additional things I've explored + their conclusions:
    - **Delaying the summing over dimensions** => added!
        - The idea here is to not incur kernel calls to `sum` as we try to early sum out in einsum. Thus, we collect all the dimensions that need to be summed together in one contraction + sum at the end instead of summing as we go. While this optimization didn't feel like it made things faster for the random cases we've selected (they all summed 1 dim per contraction), it is a good principle and would help more common use cases that would reduce multiple dimensions at a time (like `bxy,xyi,xyj->bij`).
    - **Caching contract_path based on equation and tensor sizes** => dropped :(
        - The benchmarks were strictly worse for all the cases, and, from scanning the use cases, I observed people do not often call einsum on the same equation/tensor order enough for caching to be justified. I do think caching can be effective in the future, but it would require further investigation.
    - adding opt_einsum package to OSS CI
    - adding it to internal CI
    - potentially adding a kwarg path argument to the python API -- if the path is given, we wouldn't have to spend time calculating it, but there would be some time lost validating user input.
    - Added more tests to CI
    **TL;DRs**
    - **torch.einsum with opt_einsum is a definite win for the production case**.
    - **torch.einsum with opt_einsum installed is consistently fast, but has an overhead** of needing to find the path. If the path is already found/optimal, it will be slightly slower.
    - The einsum overhead decreases for bigger dimensions.
    - **torch.einsum without opt_einsum installed is comparable to before this commit**, with occasional slowness potentially due to not reshaping/squeezing as we contract until the end.
    - For many of the random generated cases, the dimensions were too similar and small where an optimal order wasn't that much more optimal than just going left to right. However, in production, dimensions are commonly quite distinct (batch size will be small, but the data will be huge).
    - **torch.einsum opt is comparable (slightly faster overall) compared to numpy.einsum opt for the cpu case**. This is interesting given that torch.einsum currently spends time computing the path, but numpy.einsum takes it as input.
    - **torch.einsum opt is significantly faster than numpy.einsum opt for the gpu case**. This is because numpy doesn't take advantage of GPUs.

    The following benchmarks were done on an A100 GPU and Linux CPUs. The line in the first chart separates GPU (on top) from CPU, and the line in the second graph separates CPU (on top) and then GPU. Sorry it's flipped 😛 .

    Production example (see [colab benchmark](https://colab.research.google.com/drive/1V2s4v1dOOKwRvp5T_DC-PNUosOV9FFJx?authuser=1#scrollTo=WZoQkC8Mdt6I) for more context):
    <img width="1176" alt="image" src="https://user-images.githubusercontent.com/31798555/192012636-9a68bfa7-2601-43b1-afeb-b4e0877db6a4.png">

    Randomly generated examples (the same ones as in https://github.com/pytorch/pytorch/pull/60191)
    <img width="1176" alt="image" src="https://user-images.githubusercontent.com/31798555/192012804-1c639595-b3e6-48c9-a385-ad851c13e1c2.png">

    Open below to see old + not super relevant benchmarking results:
    <details>
    Benchmark results BEFORE this PR (on Linux -- I will update devices so they are consistent later):
    <img width="776" alt="image" src="https://user-images.githubusercontent.com/31798555/190807274-18f71fce-556e-47f4-b18c-e0f7d0c0d5aa.png">

    Benchmark results with the code on this PR (on my x86 mac):
    For the CPU internal use case --
    ![image](https://user-images.githubusercontent.com/31798555/190801376-6f591b00-cebd-4ca7-bb23-ae8f17f1634e.png)

    For the general use case --
    It looks like numpy opt still does better in several of these random cases, but torch einsum opt is consistently faster than torch.einsum.
    ![image](https://user-images.githubusercontent.com/31798555/190811730-fbb6797d-af59-4f5a-92da-ba4103372014.png)
    <details>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84890
    Approved by: https://github.com/albanD, https://github.com/soulitzer

commit e78e00f4d98c4376e298902db8aae7e7057e86df
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Sep 24 02:31:59 2022 +0000

    [vision hash update] update the pinned vision hash (#85581)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85581
    Approved by: https://github.com/pytorchbot

commit 2b6d2cad29fc1652f80199d647306b9c7c841ca9
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Sat Sep 24 01:39:19 2022 +0000

    Remove @saketh-are from CODEOWNERS (#85521)

    saketh-are no longer has write access to the repository.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85521
    Approved by: https://github.com/huydhn

commit 4d3acf12034132c422606d175ca535359123023c
Author: Huy Do <huydhn@gmail.com>
Date:   Sat Sep 24 01:17:04 2022 +0000

    Enable pytest-shard for functorch (#85321)

    This extends https://github.com/pytorch/pytorch/pull/84961 to support functorch tests with pytest-shard.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85321
    Approved by: https://github.com/samdow, https://github.com/clee2000

commit 70b27e91c7160bdf016511fc67940f5b89f5a30f
Author: Sourav Mandal <souravmandal@fb.com>
Date:   Sat Sep 24 01:02:40 2022 +0000

    [pytorch] Skip linalg tests that fail on Meta infra (#85577)

    Summary: test_inverse_errors_large and test_linalg_solve_triangular fail for dtype=float64 when invoked on GPUs on Meta internal testing infra.  Skip in Meta internal testing.

    Test Plan: (observe tests skipped on Meta internal infra)

    Reviewed By: mikekgfb

    Differential Revision: D39785331

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85577
    Approved by: https://github.com/malfet

commit a554a546b382949b4ba8518d2a594956a6ed3fbf
Author: Ke Wen <kw2501@fb.com>
Date:   Sat Sep 24 01:02:35 2022 +0000

    Update PyTorch Distributed CODEOWNERS (#85560)

    Add Ke Wen to PyTorch Distributed modules

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85560
    Approved by: https://github.com/H-Huang

commit 7d8ee38a5c6b9d852b97605cbfdded5183b6524a
Author: Mike Iovine <mikeiovine@fb.com>
Date:   Sat Sep 24 01:01:34 2022 +0000

    [Static Runtime] Fix prim::If tuple corner case (#85446)

    Summary: We currently assume that a tuple output implies that the prim::If node returns multiple unpacked outputs, but this is not guaranteed to be the case. Add some logic to return the wrapped tuple if necessary

    Test Plan: New unit test

    Differential Revision: D39712050

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85446
    Approved by: https://github.com/tenpercent

commit 18685b7fe1e33d9102526b515df855fec0e2c445
Author: supriyar <supriya.rao17@gmail.com>
Date:   Thu Sep 22 13:31:19 2022 -0700

    Update PT maintainers list for AO (#85125)

    Summary:
    Update the list based on recommendation in
    https://github.com/pytorch/pytorch/blob/master/docs/source/community/build_ci_governance.rst

    Test Plan:

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D39745619](https://our.internmc.facebook.com/intern/diff/D39745619)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85125
    Approved by: https://github.com/gchanan

commit a38e43e936c725cd5dd59617b5806c31f13eab0c
Author: Alex Beloi <alexbeloi@fb.com>
Date:   Fri Sep 23 23:36:57 2022 +0000

    [perf][1/5] Replace IValue::toString()->string() with IValue::toStringRef() (#85437)

    Summary: `IValue::toString()` creates a `new c10::intrusive_ptr` (like `std::shared_ptr`) and `->string()` immediately accesses it, creating an atomic reference increment/decrement. We can skip both of these operations by calling `IValue::toStringRef()`.

    Test Plan: CI

    Reviewed By: jaybean-dev

    Differential Revision: D39605242

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85437
    Approved by: https://github.com/jfix71

commit ea81138bd6b688554ad307e370bebeffd264f1b7
Author: dzdang <dzdang@umich.edu>
Date:   Thu Sep 22 21:27:32 2022 -0400

    [quant][improvement][better-engineering] Refactored get_supported_device_types into common_quantization.py (#79607)

    Summary:
    Both test_quantized_tensor.py and test_quantize_fx.py had the same
    get_supported_device_types function defined. This PR refactors it into
    the common_quantization.py file for common usage

    Test Plan:
    ```
    python test/test_quantization.py
    ```

    Differential Revision: [D37173692](https://our.internmc.facebook.com/intern/diff/D37173692)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79607
    Approved by: https://github.com/jerryzh168

commit 12ae3bea437e760d4fede3f1c50c2c81af3f687c
Author: nikitaved <nikitavedeneev@gmail.com>
Date:   Fri Sep 23 23:31:17 2022 +0000

    Faster mul(sparse, sparse) with broadcasting in dense dims. (#85336)

    This is a combo PR of https://github.com/pytorch/pytorch/pull/84929 and ~https://github.com/pytorch/pytorch/pull/83428~.

    Preliminary benchmarks (square matrices of shape (n, n)).

    <details>

    <summary>Script</summary>

    ```python
    import torch
    import math
    from IPython import get_ipython
    from itertools import product, repeat
    import pickle
    from torch.utils.benchmark import Timer, Compare

    torch.manual_seed(13)

    problem_dims = (
        (10000, 100),
        (100000, 1000),
        (1000000, 10000),
        (10, 100),
        (10, 1000),
        (10, 10000),
        (100, 1000),
        (100, 10000),
        (1000, 10000),
        (1000, 100000),
        (1000, 1000000),
    )

    name = "PR"
    device = "cuda"
    results = []

    for n, nnz in problem_dims:
        def gen_tensor(coalesce=False):
            shape = (n, n)
            nrows, ncols = shape
            rowidx = torch.randint(low=0, high=nrows, size=(nnz,), device=device)
            colidx = torch.randint(low=0, high=ncols, size=(nnz,), device=device)
            itemidx = torch.vstack((rowidx, colidx))
            xvalues = torch.randn(nnz, device=device)
            itemidx = torch.hstack((itemidx, itemidx))
            xvalues = torch.hstack((xvalues, xvalues))
            res = torch.sparse_coo_tensor(itemidx, xvalues, size=shape)
            if coalesce:
                return res.coalesce()
            else:
                return res

        for x_coalesce, y_coalesce in product(*repeat((True, False), 2)):
            x = gen_tensor(x_coalesce)
            y = gen_tensor(y_coalesce)
            smtp = "x * y"
            timer = Timer(smtp,
                          globals=globals(),
                          label="coo.mul",
                          description=f"{name}: mul, device: {device}",
                          sub_label=f"n={n}, nnz={nnz}, coalesce=({x_coalesce, y_coalesce})",
                          num_threads=torch.get_num_threads())
            results.append(timer.blocked_autorange())

    compare = Compare(results)
    compare.trim_significant_figures()
    compare.print()

    with open(f"{name}_{device}_mul.pickle", 'wb') as f:
        pickle.dump(results, f)

    ```

    </details>

    <details>

    <summary>Gather results</summary>

    ```python
    import pickle
    from torch.utils.benchmark import Timer, Compare

    files = [
            "PR",
            "master"
            ]

    device = 'cuda'

    timers = []
    for name in files:
        with open("{}_{}_mul.pickle".format(name, device), 'rb') as f:
            timers += pickle.load(f)

    compare = Compare(timers)
    compare.trim_significant_figures()
    compare.print()

    ```

    </details>

    <details>

    <summary>CUDA</summary>

    ```
    [------------------------------------------------- coo.mul -------------------------------------------------]
                                                           |  PR: mul, device: cuda  |  master: mul, device: cuda
    24 threads: -------------------------------------------------------------------------------------------------
          n=10000, nnz=100, coalesce=((True, True))        |             95          |                91
          n=10000, nnz=100, coalesce=((True, False))       |             87          |               242
          n=10000, nnz=100, coalesce=((False, True))       |             87          |               226
          n=10000, nnz=100, coalesce=((False, False))      |            130          |               371
          n=100000, nnz=1000, coalesce=((True, True))      |            100          |               521
          n=100000, nnz=1000, coalesce=((True, False))     |             90          |               649
          n=100000, nnz=1000, coalesce=((False, True))     |            100          |               659
          n=100000, nnz=1000, coalesce=((False, False))    |            200          |               781
          n=1000000, nnz=10000, coalesce=((True, True))    |            100          |              4861
          n=1000000, nnz=10000, coalesce=((True, False))   |            100          |              5012
          n=1000000, nnz=10000, coalesce=((False, True))   |             98          |              5010
          n=1000000, nnz=10000, coalesce=((False, False))  |            384          |              5174
          n=10, nnz=100, coalesce=((True, True))           |            100          |                79
          n=10, nnz=100, coalesce=((True, False))          |            100          |               221
          n=10, nnz=100, coalesce=((False, True))          |            100          |               221
          n=10, nnz=100, coalesce=((False, False))         |            100          |               350
          n=10, nnz=1000, coalesce=((True, True))          |            100          |               100
          n=10, nnz=1000, coalesce=((True, False))         |            100          |               240
          n=10, nnz=1000, coalesce=((False, True))         |            100          |               254
          n=10, nnz=1000, coalesce=((False, False))        |            100          |               392
          n=10, nnz=10000, coalesce=((True, True))         |            100          |               110
          n=10, nnz=10000, coalesce=((True, False))        |            110          |               286
          n=10, nnz=10000, coalesce=((False, True))        |            110          |               286
          n=10, nnz=10000, coalesce=((False, False))       |            271          |               455
          n=100, nnz=1000, coalesce=((True, True))         |            110          |               851
          n=100, nnz=1000, coalesce=((True, False))        |            110          |              1000
          n=100, nnz=1000, coalesce=((False, True))        |            110          |               990
          n=100, nnz=1000, coalesce=((False, False))       |            140          |              1124
          n=100, nnz=10000, coalesce=((True, True))        |            110          |              5137
          n=100, nnz=10000, coalesce=((True, False))       |            110          |              5391
          n=100, nnz=10000, coalesce=((False, True))       |            100          |              5405
          n=100, nnz=10000, coalesce=((False, False))      |            249          |              5539
          n=1000, nnz=10000, coalesce=((True, True))       |            100          |              8598
          n=1000, nnz=10000, coalesce=((True, False))      |            100          |              8800
          n=1000, nnz=10000, coalesce=((False, True))      |            100          |              8782
          n=1000, nnz=10000, coalesce=((False, False))     |            255          |              8956
          n=1000, nnz=100000, coalesce=((True, True))      |            120          |             84500
          n=1000, nnz=100000, coalesce=((True, False))     |            200          |             88560
          n=1000, nnz=100000, coalesce=((False, True))     |            160          |             89000
          n=1000, nnz=100000, coalesce=((False, False))    |            373          |             89000
          n=1000, nnz=1000000, coalesce=((True, True))     |            312          |            606400
          n=1000, nnz=1000000, coalesce=((True, False))    |           1340          |            609200
          n=1000, nnz=1000000, coalesce=((False, True))    |           1340          |            609100
          n=1000, nnz=1000000, coalesce=((False, False))   |           4408          |            611400

    Times are in microseconds (us).
    ```

    </details>

    <details>

    <summary>CPU</summary>

    ```
    [------------------------------------------------ coo.mul ------------------------------------------------]
                                                           |  PR: mul, device: cpu  |  master: mul, device: cpu
    24 threads: -----------------------------------------------------------------------------------------------
          n=10000, nnz=100, coalesce=((True, True))        |              8         |                8
          n=10000, nnz=100, coalesce=((True, False))       |             32         |               34
          n=10000, nnz=100, coalesce=((False, True))       |             32         |               34
          n=10000, nnz=100, coalesce=((False, False))      |             41         |               56
          n=100000, nnz=1000, coalesce=((True, True))      |             24         |               24
          n=100000, nnz=1000, coalesce=((True, False))     |             90         |              100
          n=100000, nnz=1000, coalesce=((False, True))     |             87         |              100
          n=100000, nnz=1000, coalesce=((False, False))    |            231         |              255
          n=1000000, nnz=10000, coalesce=((True, True))    |            190         |              200
          n=1000000, nnz=10000, coalesce=((True, False))   |            908         |             2023
          n=1000000, nnz=10000, coalesce=((False, True))   |            800         |             2036
          n=1000000, nnz=10000, coalesce=((False, False))  |           3684         |             3989
          n=10, nnz=100, coalesce=((True, True))           |              8         |                7
          n=10, nnz=100, coalesce=((True, False))          |             34         |               30
          n=10, nnz=100, coalesce=((False, True))          |             33         |               30
          n=10, nnz=100, coalesce=((False, False))         |             44         |               50
          n=10, nnz=1000, coalesce=((True, True))          |              8         |                7
          n=10, nnz=1000, coalesce=((True, False))         |            100         |              100
          n=10, nnz=1000, coalesce=((False, True))         |            130         |              100
          n=10, nnz=1000, coalesce=((False, False))        |            746         |              210
          n=10, nnz=10000, coalesce=((True, True))         |              8         |                7
          n=10, nnz=10000, coalesce=((True, False))        |           1000         |             1500
          n=10, nnz=10000, coalesce=((False, True))        |           1000         |             1510
          n=10, nnz=10000, coalesce=((False, False))       |           3063         |             2457
          n=100, nnz=1000, coalesce=((True, True))         |             25         |               25
          n=100, nnz=1000, coalesce=((True, False))        |            180         |              130
          n=100, nnz=1000, coalesce=((False, True))        |            200         |              130
          n=100, nnz=1000, coalesce=((False, False))       |            271         |              255
          n=100, nnz=10000, coalesce=((True, True))        |            100         |              100
          n=100, nnz=10000, coalesce=((True, False))       |           2444         |             2290
          n=100, nnz=10000, coalesce=((False, True))       |           2455         |             2357
          n=100, nnz=10000, coalesce=((False, False))      |           5316         |             3783
          n=1000, nnz=10000, coalesce=((True, True))       |            204         |              211
          n=1000, nnz=10000, coalesce=((True, False))      |           2457         |             2480
          n=1000, nnz=10000, coalesce=((False, True))      |           2448         |             2539
          n=1000, nnz=10000, coalesce=((False, False))     |           3665         |             4801
          n=1000, nnz=100000, coalesce=((True, True))      |           2293         |             2374
          n=1000, nnz=100000, coalesce=((True, False))     |           9000         |            24620
          n=1000, nnz=100000, coalesce=((False, True))     |           8000         |            25080
          n=1000, nnz=100000, coalesce=((False, False))    |          26500         |            47650
          n=1000, nnz=1000000, coalesce=((True, True))     |          10000         |            13000
          n=1000, nnz=1000000, coalesce=((True, False))    |          80000         |           362200
          n=1000, nnz=1000000, coalesce=((False, True))    |          78050         |           392600
          n=1000, nnz=1000000, coalesce=((False, False))   |         312100         |           766900

    Times are in microseconds (us).
    ```

    </details>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85336
    Approved by: https://github.com/cpuhrsch

commit 40d3e55b7d20d03d2da5c94d8f6c12a8c64fdbfa
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Sep 23 23:29:12 2022 +0000

    Temporary fix to skip NVIDIA driver installation from RHEL repo (#85569)

    This is a temporary fix until torchrec and FBGEMM are updated to use PyTorch NVIDIA installation script instead of using the latest driver from RHEL repo.  It might take a day or so to finish updating the 2 repos, so I want to have this in place to avoid any issue with NVIDIA driver till then.  The driver from RHEL repo `515.65.01` is even newer than what we are using in PyTorch CI `515.57`.  So everything should just work with both of them
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85569
    Approved by: https://github.com/clee2000

commit 4befe45084ace174e7f24a4d2db5ec372f633375
Author: Renfei Chen <renfeichen@fb.com>
Date:   Fri Sep 23 23:21:54 2022 +0000

    [FX] Add one option to maintain the FX graph execution order after splitting_module (#85188)

    Summary:
    {F770932209}

    Given the original execution order and the node dependency relationship (note that the same dependency order could generate multiple execution order, which refers to “Topological Order”), after reunion, we could find the new execution order of the new GraphModule is different from the original one which is not what we want.
    For example, let’s assume that NewLeaf_1 is EmbeddingLookup (Calling EmbeddingLookup is awaitable, we will keep executing the following nodes rather than waiting for the result until we have to know the lookup result), NewLeaf_4 is the node where we HAVE to get the lookup result to interact with the NewLeaf_3. So NewLeaf_1 will launch a lookup kernel and all2all communication stream to distribute the result to all ranks. In the meantime, we want to keep executing NewLeaf_2 and NewLeaf_3 to avoid meaningless waiting. However, given the new execution order, we have to wait for the lookup kernel and all2all communication to be finished since the next node NewLeaf_4 needs the result, until then we can execute NewLeaf_2, etc. It cannot leverage the advantage of parallel computation and communication stream and will hurt the QPS a lot.
    So while constructing the GraphModule, we have to change from the topological order to the original order

    Test Plan:
    Unit test

    Not sure how to add tests in FX as there's no TARGETS, so I added in the TorchRec folder

    Differential Revision: D39567314

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85188
    Approved by: https://github.com/SherlockNoMad

commit 4f5f2c1a9e09ff09947da240ac3209e9dd13a5a3
Author: George Gensure <ggensure@fb.com>
Date:   Fri Sep 23 23:07:01 2022 +0000

    Add torch.nested to ovrsource (#85384)

    Summary:
    Prevent a build break for ovrsource dependency.

    Stacked changes will help to prevent further regressions in this target.

    Test Plan:
    Build
    //arvr/projects/codec_avatar/pylab/examples/demos/pica:pica. Without
    this change, it will fail on linking torch with an undefined symbol. With it,
    the build will proceed.

    Differential Revision: D39669887

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85384
    Approved by: https://github.com/kit1980

commit 4c01c51266afae57c6d6952c84fff2802d9b2bb9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 23 12:22:13 2022 -0700

    Symintifying slice ops (#85196)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85196
    Approved by: https://github.com/ezyang

commit 604487f239b5d9312106de3e336ea227ea946993
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 23 12:20:12 2022 -0700

    OpInfo for Slice (#85554)

    This is based on wconstab tests from #84680

    Technically, slice is covered by the __getitem__ opinfo, but it is
    easier to debug/test on a more narrow internal function that only
    uses this functionality and not other advanced indexing stuff.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85554
    Approved by: https://github.com/mruberry, https://github.com/wconstab

commit bc6dc8d271d6cf4d0ae381077f59fc7bb7cf024d
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Fri Sep 23 21:40:07 2022 +0000

    [fix] composite compliance: cumprod, _masked.cumprod, linalg.vander (#85330)

    Ref: #69991
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85330
    Approved by: https://github.com/zou3519

commit 2e81710366279af67f3e05fe00011a99f962e61f
Author: andrewor14 <andrewor14@gmail.com>
Date:   Fri Sep 23 06:54:18 2022 -0700

    [Quant] Add initial Executorch BackendConfig (#85527)

    Summary: This commit adds the initial BackendConfig for backends
    PyTorch lowers to through the Executorch stack. This initial
    version is only intended to cover the following set of ops:

        quantized::linear_dynamic,
        quantized::add,
        quantized::batch_norm2d,
        quantized::conv2d.new,
        quantized::linear,
        quantized::conv2d_relu.new,
        aten::relu_,
        aten::_adaptive_avg_pool2d,
        aten::_reshape_alias_copy,
        aten::squeeze.dim,
        aten::permute

    For now, the `BackendPatternConfig` for each of these ops is
    the same as the ones for the corresponding ops in the FBGEMM
    `BackendConfig`, though this may change in the future.

    Reviewers: jerryzh168, vkuzo

    Subscribers: jerryzh168, vkuzo
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85527
    Approved by: https://github.com/jerryzh168

commit a8074a1a0ba6ec4765e22a26ea37e7e4ac5f3f99
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Sep 22 11:57:41 2022 -0700

    [Checkpoint] rename apply_ac_wrapper (#85449)

    Per title

    Differential Revision: [D39714855](https://our.internmc.facebook.com/intern/diff/D39714855/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85449
    Approved by: https://github.com/awgu

commit cc64f64670544f41f4307a44592fc0e4699c3747
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Sep 22 11:57:41 2022 -0700

    [Docs] Minor fix to apply_ac doc (#85448)

    Per title

    Created from CodeHub with https://fburl.com/edit-in-codehub

    Differential Revision: [D39714530](https://our.internmc.facebook.com/intern/diff/D39714530/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85448
    Approved by: https://github.com/awgu

commit a4c94f0739158d2f7fd27f2be59b77f33027e1c7
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Thu Sep 22 14:27:43 2022 -0500

    Fix cuda issue with sparse.sampled_addmm (#85194)

    fixes https://github.com/pytorch/pytorch/issues/85169

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85194
    Approved by: https://github.com/amjames, https://github.com/nikitaved

commit 49e10c15981b21fc04a10a74ea506b5cbcaf7074
Author: Catherine Lee <csl@fb.com>
Date:   Fri Sep 23 20:45:20 2022 +0000

    [ci] test_ops in parallel, ci tests log to file (#85528)

    part one of splitting up https://github.com/pytorch/pytorch/pull/84961 into (probably 2) parts

    contains
    * logging to file
    * testing test_ops in parallel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85528
    Approved by: https://github.com/huydhn

commit 0e582fbfcc8cab66c0265d3fe326e3dc505855d1
Author: jjsjann123 <jiej@nvidia.com>
Date:   Wed Sep 21 15:03:10 2022 -0700

    [NVFuser] Upstream push 0907 (#84626)

    Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

    Codegen changes include:

    - codegen improvement:
    i. improved view support on pointwise and transpose scheduler
    ii. grouped grid welford added for better outer-norm grid persistence in normalization

    - misc:
    i. new composite ops added: variance_mean , arange,
    ii. fixes misaligned address for transpose scheduler
    iii. refactor on separation of compilation API from execution API to prepare us for async compilation
    iv. double type support on expression evaluator
    v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

    Commits that's in this PR from the devel branch:
    ```
    89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
    b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933)
    56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937)
    371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931)
    1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932)
    0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890)
    ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936)
    63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929)
    b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919)
    c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935)
    88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934)
    b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914)
    b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
    14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892)
    3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912)
    20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921)
    6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910)
    9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
    057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
    3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907)
    98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913)
    ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917)
    6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
    5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915)
    2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911)
    d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906)
    e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904)
    3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902)
    ```

    RUN_TORCHBENCH: nvfuser

    Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84626
    Approved by: https://github.com/malfet

commit 52a8be523ce682ce26dd793a4154b668b1f37703
Author: atalman <atalman@fb.com>
Date:   Fri Sep 23 20:28:36 2022 +0000

    Adjust retry time for conda upload (#85545)

    Adjusting retry times for conda upload.
    Refer to this failure: https://github.com/pytorch/pytorch/actions/runs/3110932965/jobs/5043384691

    ```
    Error:  ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
    + sleep 1
    ......
    Error:  ('file osx-arm64/pytorch-1.13.0.dev20220923-py3.9_0.tar.bz2 already exists or being uploaded for package pytorch version 1.13.0.dev20220923. if your previous upload failed, please wait 2 minutes before trying again', 409)
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85545
    Approved by: https://github.com/datumbox

commit 3007093007670c7fcf7ba54b07afadcfe2241d86
Author: atalman <atalman@fb.com>
Date:   Fri Sep 23 20:21:36 2022 +0000

    Add new cudnn buid for linux only (#85549)

    Add new cudnn buid for linux only

    New pypi packaged are available only for linux:
    https://pypi.org/project/nvidia-cudnn-cu11/
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85549
    Approved by: https://github.com/malfet

commit d83ca9ebff09225d90a5fbae3edd533ebf1cd1aa
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Sep 23 10:14:37 2022 -0700

    [CI] Make `cuda-arch-list` a parameter to linux-build (#85523)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85523
    Approved by: https://github.com/huydhn

commit 108b25db25ef180bfbfaed2347c9b99103aa68ef
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 23 13:44:42 2022 -0400

    Let antoniojkim snoop all symbolic shapes PRs (#85555)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85555
    Approved by: https://github.com/qihqi, https://github.com/wconstab

commit 4dfaca6fb14d27fb17e498fd39861b26267ab06d
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Sep 22 15:00:15 2022 -0700

    [Profiler] Clean up Tensor representation (#85161)

    I want to start using `TensorMetadata` elsewhere in profiler so we have a common representation of Tensor. The main changes in this PR are:

    1) Replace raw pointers with strong typedefs and create a custom type caster to handle moving them to Python.
    2) Adding a `device()` method to handle reassembling type and index.

    Differential Revision: [D39563965](https://our.internmc.facebook.com/intern/diff/D39563965/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85161
    Approved by: https://github.com/chaekit

commit e296a82f239b431d97204f0c2891f4c49bef8f6b
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Sep 22 15:00:13 2022 -0700

    [Profiler] Capture storage data pointer (#84276)

    This is approximately a re-land of the storage half of https://github.com/pytorch/pytorch/pull/80266

    I've directly represented and exposed storage impl rather than using it as a first guess for an ID. (Mostly for testing, which happened to save me as I was initially recording the wrong thing.)

    Differential Revision: [D39136546](https://our.internmc.facebook.com/intern/diff/D39136546/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84276
    Approved by: https://github.com/slgong-fb

commit 4615d1bcfa0915a992e7445086ba559ca7441607
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Fri Sep 23 18:56:00 2022 +0000

    resubmit: [mta] APEX style Fused Adam (#81705) (#85507)

    This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel.

    related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705
    Approved by: https://github.com/ngimel

    cc @ptrblck @ngimel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507
    Approved by: https://github.com/ngimel

commit f1a6f32b72b7c2b73277f89bbf7e7459a400d80a
Author: Erjia Guan <erjia@fb.com>
Date:   Fri Sep 23 18:52:52 2022 +0000

    [DataLoader] Make distributed lazily initialized & share seed via PG (#85279)

    Fixes #84492 https://github.com/pytorch/data/issues/772
    - Move the logic of distributed sharding from the constructor of DataLoader to the constructor of DataLoaderIterator. This would prevent the Error caused by lazy distributed process initialization
    - Replace distributed store by process group (`gloo`) to share the random seed because `mpi` backend doesn't provide distributed store.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85279
    Approved by: https://github.com/NivekT, https://github.com/VitalyFedyunin

commit e3766e9855d4fcf8d7d9b23f5f1f75ded51d8b9e
Author: George Qi <georgeqi94@gmail.com>
Date:   Fri Sep 23 06:00:53 2022 +0000

    [maskedtensor] move __torch_function/dispatch__ functions to a map (#85529)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85529
    Approved by: https://github.com/bhosmer

commit 7893748900c405ea4ba2a1eb525824a43ad8009a
Author: Zain Rizvi <zainr@fb.com>
Date:   Fri Sep 23 18:23:34 2022 +0000

    Add instructions on how to merge a PR (#85280)

    Adding basic instructions for now
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85280
    Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/janeyx99

commit c7b17d7eb165b2311aac7ed6a9618d2136787f48
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Fri Sep 23 18:03:35 2022 +0000

    Add nvprims `rand_like` support for Dropout (#85077)

    NM
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85077
    Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry

commit 1e4c88518c5cf9b41b8a652ae2ed1eef6ce6f000
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Sep 22 21:57:25 2022 +0000

    Fake tensor refactorings (#85498)

    The only semantic change is moving the error checking before the dynamic shapes handling.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85498
    Approved by: https://github.com/ezyang

commit d10de31cc833f1defa2cb64fef3c27f657a3dee2
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 23 17:21:43 2022 +0000

    Revert "Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)"

    This reverts commit 78afa0cf0ca04ce437ca4b519f07c04e73fe0d4c.

    Reverted https://github.com/pytorch/pytorch/pull/85417 on behalf of https://github.com/clee2000 due to broke tests on trunk https://hud.pytorch.org/pytorch/pytorch/commit/78afa0cf0ca04ce437ca4b519f07c04e73fe0d4c

commit eb570ab7d0fd5df88fccf90cdadc581c722d20ef
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 23 17:19:06 2022 +0000

    Revert "add amp tests (#85434)"

    This reverts commit c2f4bbe66918d167401ff5804c6b2d24fc6bda40.

    Reverted https://github.com/pytorch/pytorch/pull/85434 on behalf of https://github.com/clee2000 due to broke rocm and slow tests on trunk https://hud.pytorch.org/pytorch/pytorch/commit/c2f4bbe66918d167401ff5804c6b2d24fc6bda40

commit 3b195fd33e5149daac89fff5e9f9336cdafe004d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 23 17:13:35 2022 +0000

    Revert "Turn on aliasing tests for fake backwards, Fix Batch norm running mean/var decomp aliasing (#85471)"

    This reverts commit 1e92eb806865602be6d9c02a311108c4f88869b2.

    Reverted https://github.com/pytorch/pytorch/pull/85471 on behalf of https://github.com/clee2000 due to stacked prs https://github.com/pytorch/pytorch/pull/85417 and https://github.com/pytorch/pytorch/pull/85434 broke trunk, reverting this so i can revert the others

commit f371b5267deb93caa4413482a5c942d9d14a8c2c
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Sep 22 12:08:04 2022 -0700

    Made max_pool2d_with_indices_backward_cuda contiguify `indices` (#85493)

    Currently, max_pool2d_with_indices_backward(grad_output, self, ..., indices)
    (on cuda) assumes that indices has the same suggested memory format as
    self.

    This is indeed always true in regular PyTorch: the max_pool2d_with_indices
    forward pass returns indices with the same suggeted memory format as
    self.

    However, we'd like to make an argument that always contiguifying indices
    is good for consistency, has negligible added cost, and is more robust
    (for Tensor Subclass authors):

    - the max_pool2d_with_indices_backward implementation for CPU always
    contiguifies `indices`. Ditto for the max_pool3d_with_indices_backward
    implementation.
    - Calling .contiguous() has almost no cost (compared to before) because
    there is a fast-path that checks the cached memory_format on the
    TensorImpl.
    - functorch has trouble writing a batching rule for
    `max_pool2d_with_indices_backward`. Having it accept `indices` with
    arbitrary strides helps make it so that vmap doesn't need to special
    case the batching rule for the strides of `indices`.

    Test Plan:
    - Not sure if it's worth writing a separate test case
    - this PR fixes one of functorch's test cases.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85493
    Approved by: https://github.com/ezyang

commit ea72a0991c3422d8f314acdf8b911de42a6b4c1e
Author: Erjia Guan <erjia@fb.com>
Date:   Fri Sep 23 16:21:25 2022 +0000

    Add support to traverse all python collection objects (#84079)

    Fixes https://github.com/pytorch/data/issues/752

    This PR makes `traverse` function supporting more collections data structures from Python. The `getstate_hook` will be invoked after custom `__getstate__` function. This would guarantee that `traverse` function will be working as long as the `DataPipe` is working properly with multiprocessing.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84079
    Approved by: https://github.com/NivekT, https://github.com/VitalyFedyunin

commit 1e92eb806865602be6d9c02a311108c4f88869b2
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Sep 23 13:20:15 2022 +0000

    Turn on aliasing tests for fake backwards, Fix Batch norm running mean/var decomp aliasing (#85471)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85471
    Approved by: https://github.com/ezyang

commit c2f4bbe66918d167401ff5804c6b2d24fc6bda40
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Sep 23 13:13:20 2022 +0000

    add amp tests (#85434)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85434
    Approved by: https://github.com/ngimel

commit 78afa0cf0ca04ce437ca4b519f07c04e73fe0d4c
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Sep 23 13:13:19 2022 +0000

    Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85417
    Approved by: https://github.com/ezyang

commit 2e883d4655ce4ad85b1a2af5cf9908c0032549c5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 23 14:09:29 2022 +0000

    Revert "[mkldnn_matmul] enable mkldnn matmul for aarch64 bf16 devices (#83671)"

    This reverts commit f21e77d9a64b39bb9feb9946d912b7b4952430d6.

    Reverted https://github.com/pytorch/pytorch/pull/83671 on behalf of https://github.com/dagitses due to breaking meta internal builds where cpuinfo_has_arm_bf16 is not defined

commit 034f2b4d231c2ca6fee889198b3a985e48514aa8
Author: andrewor14 <andrewor14@gmail.com>
Date:   Thu Sep 22 16:31:56 2022 -0700

    [Quant][fx] Enable FX static quantization for LSTM (#85068)

    **Summary:** This commit enables the custom module LSTM path for
    FX graph mode static quantization. This has the same flow as eager
    mode, which was already previously supported:

    ```
         torch.nn.LSTM
               | (prepare_fx)
               v
    torch.ao.nn.quantizable.LSTM
               | (convert_fx)
               v
     torch.ao.nn.quantized.LSTM
    ```

    The main reason why custom module LSTM is not supported in FX
    graph mode quantization today is because its inputs and outputs
    are nested tuples, and existing constructs such as observers,
    "quantize" nodes, and "dequantize" nodes do not understand how
    to handle complex structures.

    Note that the approach taken in this commit is only intended to
    be a short-term solution highly tailored to the input and output
    formats of custom module LSTM. In the future, for the longer-term
    solution, we should design a more general QConfig that allows users
    to specify complex input and output formats, and enable FX graph
    mode quantization to understand arbitrary nested structures and
    automatically infer how to transform the graph accordingly.

    **Context:**

    Today, in FX graph mode static quantization, custom modules are
    assumed to have quantized inputs and quantized outputs, with the
    exact dtypes derived from the associated QConfig (default quint8).
    Since custom modules are currently not handled through the reference
    model flow, their observer replacement logic are a little different
    from normal operators:

    ```
    input -> custom_module -> output
    input -> obs0 -> custom_module -> obs1 -> output
    input -> quant -> quantized_custom_module -> dequant -> output
    ```

    In the last step, input observers are replaced with "quantize"
    and output observers are replaced with "dequantize", in contrast
    to other non-custom-module patterns where observers are replaced
    with "quantize-dequantize" pairs instead. Note that, conceptually,
    the output observer `obs1` is really just a DeQuantStub, since no
    observation is actually needed.

    **Custom module LSTM:**

    The reason why custom module LSTM cannot be handled in the same
    way is because, unlike other custom modules, its inputs and outputs
    are nested tuples instead of single tensors. This is how the existing
    custom module code would try to handle LSTMs:

    ```
     input -> lstm -> output
    hidden0 -/    \-> hidden0
    hidden1 -/    \-> hidden1
     input -> obs0 -> lstm -> obs1  # fails
            hidden0 -/  # missing observer
            hidden1 -/  # missing observer
    ```

    However, this fails today because 1) we assume there is only one input
    to the custom module, and so we never end up quantizing `hidden0` and
    `hidden1`, and 2) the output observer `obs1` is fed a tuple, which it
    does not understand how to handle.

    **Short-term fix:**

    This commit addresses the above by specifically handling the input
    and output structures used by custom module LSTM. For the inputs,
    we manually insert observers for `hidden0` and `hidden1` to ensure
    all input tensors are quantized.

    For the outputs, we split the tuple into its internal nodes, attach
    a DeQuantStub to each node, and recombine these DeQuantStubs
    according to the original structure. Finally, we must also reroute
    consumers of the original LSTM tuple (and its internal nodes, e.g.
    `lstm[0]`) to these DeQuantStubs:

    ```
     input -> lstm -> output -> linear0
    hidden0 -/    \-> hidden0 -> linear1
    hidden1 -/    \-> hidden1 -> linear2
     input -> obs0 -> lstm -> output -> dqstub -> linear0 -> obs3
    hidden0 -> obs1 -/    \-> hidden0 -> dqstub -> linear1 -> obs4
    hidden1 -> obs2 -/    \-> hidden1 -> dqstub -> linear2 -> obs5
     input -> quant -> qlstm -> output -> dequant -> linear0 -> quant -> dequant
    hidden0 -> quant -/    \-> hidden0 -> dequant -> linear1 -> quant -> dequant
    hidden1 -> quant -/    \-> hidden1 -> dequant -> linear2 -> quant -> dequant
     input -> quant -> qlstm -> output -> quantized_linear0 -> dequant
    hidden0 -> quant -/    \-> hidden0 -> quantized_linear1 -> dequant
    hidden1 -> quant -/    \-> hidden1 -> quantized_linear2 -> dequant
    ```

    Note that we choose to insert DeQuantStubs here instead of observers
    because these will ultimately be replaced by "dequantize" nodes. This
    matches the general custom module behavior, where output observers
    are replaced only with "dequantize" nodes (as opposed to the normal
    "quantize-dequantize" pair), since custom module outputs are assumed
    to already be quantized. Using DeQuantStubs instead of observers also
    simplifies the "dequantize" insertion logic. In the future, we should use
    DeQuantStubs in place of output observers for custom modules in general.

    **Test plan:**
    python test/test_quantization.py TestQuantizeFx.test_static_lstm
    python test/test_quantization.py
    TestQuantizeFx.test_static_lstm_consume_tuple

    **Reviewers:** jerryzh168, vkuzo

    **Subscribers:** jerryzh168, vkuzo
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85068
    Approved by: https://github.com/jerryzh168

commit 71dddec6eac4e518d428d2fdc1324d421f5c8b56
Author: Ryan Spring <rdspring1@gmail.com>
Date:   Fri Sep 23 06:52:38 2022 +0000

    Cast grad_input to half when input_dtype is half in _softmax_backward_data aten decomposition (#85497)

    Fixes #85504

    `_softmax_backward_data` and `_log_softmax_backward_data` cast `grad_input` to half when the `input_dtype` is half.
    When running with amp without the cast, consumer ops can trigger `RuntimeError: expected scalar type Float but found Half`.

    https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/SoftMax.cpp#L70-L83
    https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/SoftMax.cpp#L102-L113

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85497
    Approved by: https://github.com/ngimel

commit b4f9b68225de19f87e9c3c6e148447506763a861
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Fri Sep 23 04:55:50 2022 +0000

    should_check_strides (#85416)

    This PR ports `should_check_strides` checks from `origin/symbolic-shapes` to `master` as the part of our dynamic shapes landing effort.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85416
    Approved by: https://github.com/ezyang

commit d5cabf79469b8966f99e10a8f04b3a2f222027df
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Sep 23 04:48:16 2022 +0000

    Make functorch compilable with Py-3.11 (#85054)

    By using compatibility wrappers from [python_compat.h](https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/python_compat.h) and skipping part of `getname` switch

    Fixes https://github.com/pytorch/pytorch/issues/85006

    Please note that `import torch` right now fails by default on 3.11 with some jit issue, so I think this shouldn't be a really issue for a bit
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85054
    Approved by: https://github.com/kit1980, https://github.com/zdevito

commit 56c0c0af5b529d5fcfbef2bcba2f75ec1487be3a
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Sep 22 21:25:16 2022 +0000

    [ShardedTensor] Add `is_floating_point` (#85483)

    This adds `is_floating_point()` support to `ShardedTensor`. This is needed for `ShardedTensor` + FSDP.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85483
    Approved by: https://github.com/wanchaol

commit c8f78d417b305536d0b1e031afff2f91f2487bd0
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Sep 22 21:25:15 2022 +0000

    [ShardedTensor] Add `is_meta` (#85482)

    This adds `is_meta` support to `ShardedTensor`. This is needed for `ShardedTensor` + FSDP.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85482
    Approved by: https://github.com/wanchaol

commit 05d0eb2aee495b6caf3de9b756fa0ceca8f1f67b
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Sep 22 21:25:15 2022 +0000

    [FSDP] Make `_ran_pre_backward_hook` check more robust (#85481)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85481
    Approved by: https://github.com/rohan-varma

commit 8602873a122ddbe24e7df6f8246f6748abe25a60
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 23 03:40:45 2022 +0000

    [vision hash update] update the pinned vision hash (#85522)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85522
    Approved by: https://github.com/pytorchbot

commit cf0de77c2cfb8843b8ae67e6a6f053e6bf6bb3d9
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Sep 22 17:09:51 2022 +0000

    [Easy][FSDP] Simplify `assert` to `p_assert` (#85479)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85479
    Approved by: https://github.com/rohan-varma

commit 5704c73b56fb5fabb3ee7e51bc03a0b55081d524
Merge: f15886f9a2 6b416bf681
Author: mingfeima <mingfei.ma@intel.com>
Date:   Fri Sep 23 10:26:23 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 6b416bf6818ce814b8336113fadca8b05b052b01
Author: mingfeima <mingfei.ma@intel.com>
Date:   Fri Sep 23 10:26:23 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit f15886f9a2c1d0ddbae7baa5411cd02908bd556f
Merge: e97c609edf 9ae1bd8724
Author: mingfeima <mingfei.ma@intel.com>
Date:   Fri Sep 23 10:03:34 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 9ae1bd872438b4814ea8982191a97b6cefe0d815
Author: mingfeima <mingfei.ma@intel.com>
Date:   Fri Sep 23 10:03:34 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 8bd4724f04386393d7f361f472e504e5fbb7501a
Author: Wei Wang <109318740+weiwangmeta@users.noreply.github.com>
Date:   Fri Sep 23 01:05:15 2022 +0000

    Adding a unit test that would gate PRs and prevent reverts, e.g. #83327 (#85442)

    PR #83327 slipped through CI, adding this unit test as part of efforts to minimize future reverts
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85442
    Approved by: https://github.com/Balandat, https://github.com/mehtanirav

commit 5f6735ea97c90913f54c00b5eb2fe782b65b257b
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Sep 22 12:54:19 2022 -0700

    [FSDP] Address comments on previous PR (#85490)

    Address follow ups on https://github.com/pytorch/pytorch/pull/85223/

    Differential Revision: [D39740878](https://our.internmc.facebook.com/intern/diff/D39740878/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85490
    Approved by: https://github.com/awgu

commit 539076e2c25675dafcef804f115a7979a44fdfdb
Author: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Date:   Fri Sep 23 00:16:55 2022 +0000

    Remove deprecated torch.lstsq (#70980)

    The time has come to remove deprecated linear algebra related functions. This PR removes `torch.lstsq`.

    There's a note in `tools/codegen/gen.py` about `lstsq` schema in `native_function.yaml` that I will not remove:
    https://github.com/pytorch/pytorch/blob/87139d8532c99ff5dbeef1b97948d71793aa7851/tools/codegen/gen.py#L734-L770

    cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/70980
    Approved by: https://github.com/lezcano, https://github.com/kit1980

commit 6380016bdd6637785e08c9ccb932e83ea46b7a18
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Sep 23 00:08:23 2022 +0000

    Disable decomposition registration on Python-3.11 (#85509)

    As it is currently broken (probably need few tweaks to AST tree parsing)

    See https://github.com/pytorch/pytorch/issues/85506

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85509
    Approved by: https://github.com/zou3519, https://github.com/soulitzer

commit f0869cc8d095c9bdbcaca147ba52857932e7a743
Author: Richard Barnes <rbarnes@fb.com>
Date:   Thu Sep 22 23:15:10 2022 +0000

    Make CUDA exceptions unlikely and isolate C10_CUDA_CHECK body (#85256)

    This marks CUDA exception checks as unlikely, which might have a positive performance impact.

    If further isolates part of `C10_CUDA_CHECK` into a separate function and file so that code can be made more expressive in subsequent diffs without bloating functions using the check or creating readability issues.

    Test Plan: Sandcastle

    Differential Revision: D39619861

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85256
    Approved by: https://github.com/ezyang, https://github.com/ngimel

commit f0a084f3db544ec7db2f56d29ad9dcaa4619bf5a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 22 23:00:13 2022 +0000

    Revert "[Profiler] Make `LibKinetoClient::stop()` directly call `ProfilerStateBase::pop` (#83965)"

    This reverts commit fdd366541333330387d0b262da8357984e0d311f.

    Reverted https://github.com/pytorch/pytorch/pull/83965 on behalf of https://github.com/robieta due to broke internal on-demand tracing: S296407

commit 46a6a50f4ef826bc88ef0783765c24e6f0aa28c2
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Sep 22 22:13:47 2022 +0000

    Skip MPS test from generic M1 testsuite (#85500)

    As there is separate Run MPS shard right now

    See if this reduces the number of crashes

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85500
    Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn

commit 92a942100a9a061a6274a56b31cd905fc688e622
Author: Catherine Lee <csl@fb.com>
Date:   Thu Sep 22 22:05:55 2022 +0000

    disable circleci jobs b/c they are flaky (#85508)

    should undo this when theyre ok again
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85508
    Approved by: https://github.com/kit1980, https://github.com/ZainRizvi

commit 5e700803c27260d2aaba92c42cf2da7f43ed0d68
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Thu Sep 22 14:33:04 2022 +0000

    Use fallback approach for nested matmul (#85311)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85311
    Approved by: https://github.com/cpuhrsch, https://github.com/drisspg

commit 63c1f2fef94a0c3a0b0ea87493ce3fb919876153
Author: Mike Iovine <mikeiovine@fb.com>
Date:   Thu Sep 22 20:23:05 2022 +0000

    [Static Runtime] Fold linear prepack ops (#85289)

    Summary: Split `quantized_linear_unpacked_weight_v2` into `linear_prepack` and `quantized_linear` so that the prepacking operation may be eliminated by constant folding.

    Test Plan:
    Fixes a huge regression in an internal model:

    ```
    Before
            89.6141 ms.    99.0923%. fb::quantized_linear_unpacked_weight_v2 (12 nodes)
    After
           0.806852 ms.    53.5365%. quantized::linear (12 nodes, out variant)
    (prepacking eliminated)
    ```

    Differential Revision: D39622530

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85289
    Approved by: https://github.com/davidberard98

commit e4899764b2a560b65be9018131bfea7ebdc2cd84
Author: Mike Iovine <mikeiovine@fb.com>
Date:   Thu Sep 22 20:21:52 2022 +0000

    [Static Runtime] Fix aten::index_put list conversions (#85298)

    Summary: Apparently static runtime's list construct return value is always a `GenericList`, so we cannot use the `toOptionalTensorList` method in the general case -- we must convert each item individually.

    Test Plan: New unit test

    Differential Revision: D39628979

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85298
    Approved by: https://github.com/tenpercent

commit bd854588fb927371c319d24d31b659731eddc3bc
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Thu Sep 22 19:58:48 2022 +0000

    Increase timeout for ProcessGroupGlooTest (#85474)

    We see spurious failures due to timeouts in`test_allreduce_coalesced_basics` but only when running the whole test suite with
    `python run_test.py --verbose -i distributed/test_c10d_gloo`. Increasing the timeout to 50s should provide enough leeway to avoid this. Note that the default for the `_timeout` is 30 minutes.

    Originally reported in EasyBuild at https://github.com/easybuilders/easybuild-easyconfigs/pull/15137#issuecomment-1073809305 and patch proposed by @casparvl
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85474
    Approved by: https://github.com/rohan-varma

commit e505360eb8c21d713180d3e71add0513cb201581
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 22 19:37:29 2022 +0000

    Revert "[mta] APEX style Fused Adam (#81705)"

    This reverts commit 7a6c4d0c50dd0670d87bc39d53292cf8cb90ca04.

    Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come

commit 848437590f41af5d3c9f9bb381106114f70fe572
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Sep 22 06:56:40 2022 -0700

    Delete functorch's monkeypatching (#85430)

    By upstreaming functorch's tensor printing logic into PyTorch. There's
    no way of creating a custom print function for a TensorImpl subclass (as
    opposed to a torch_dispatch or torch_function tensor subclass, which can
    just override repr()) right now, so we need to directly interpose inside
    regular Tensor printing in PyTorch.

    Monkey patching is bad; users do not expect `import blah` to change
    something about another library.

    Fixes https://github.com/pytorch/functorch/issues/900

    Test Plan:
    - existing tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85430
    Approved by: https://github.com/ezyang

commit 5e5c31954994274e51c09731ac71a4f824ddb620
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Sep 22 06:56:40 2022 -0700

    Move functorch python bindings to torch/csrc (#85426)

    This moves functorch's python bindings to torch/csrc/functorch/init.cpp.
    Coming next is the torchdim move. I didn't do torchdim yet because
    moving functorch's python bindings unblocks some other things that I
    want to do first.

    Test Plan:
    - tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85426
    Approved by: https://github.com/ezyang

commit bcf93181a0ca5db75bd038db0d5f7e4cee733db7
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Thu Sep 22 17:40:46 2022 +0000

    Remove deprecated torch.matrix_rank (#70981)

    The time has come to remove deprecated linear algebra related functions. This PR removes `torch.matrix_rank`.

    cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/70981
    Approved by: https://github.com/lezcano, https://github.com/kit1980

commit e34297690787dc89e98989b977b845cb1fa86d1e
Author: atalman <atalman@fb.com>
Date:   Thu Sep 22 17:33:59 2022 +0000

    Adding conda retry upload to mitigate connection reset errors (#85407)

    Adding conda retry upload to mitigate connection reset errors

    Mitigate errors like this:
    https://github.com/pytorch/pytorch/actions/runs/3095808905/jobs/5012840560

    ```
    Uploading file "pytorch-nightly/pytorch/1.13.0.dev20220921/linux-64/pytorch-1.13.0.dev20220921-py3.9_cuda11.6_cudnn8.3.2_0.tar.bz2"

      0%|          | 0.00/1.24G [00:00<?, ?B/s]
    100%|██████████| 1.24G/1.24G [00:00<00:00, 2.08GB/s]
    100%|██████████| 1.24G/1.24G [00:04<00:00, 271MB/s]
    Error:  ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
    Error: Process completed with exit code 1.
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85407
    Approved by: https://github.com/weiwangmeta, https://github.com/malfet, https://github.com/seemethere

commit 253ffbf28bf8d1041accc9d103055be6917f1fd3
Author: Driss Guessous <drisspg@fb.com>
Date:   Thu Sep 22 16:30:16 2022 +0000

    Exposing native _scaled_dot_product_attention to torch.nn (#85044)
    This exposes the _scaled_dot_product_attention function to python in the nn namespace. It is still underscored because the api for args, and kwargs is still in flux for the next few weeks and will eventually land as a prototype feature.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85044
    Approved by: https://github.com/cpuhrsch

commit 0d04e5489855dbdc8fd83bb2a6a3f13ba7017f63
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Sep 22 16:12:51 2022 +0000

    [GHF] Create "Core Reviewers" group (#85477)

    And add @mruberry and @lezcano to it

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85477
    Approved by: https://github.com/albanD

commit e227e3ec4897d2ce04de705423afefa028d82b34
Author: atalman <atalman@fb.com>
Date:   Thu Sep 22 16:01:38 2022 +0000

    Disabling the pypi cudnn wheel from uploading temporarily (#85470)

    Disabling the pypi cudnn wheel from uploading
    Temporary change untill the cudnn wheel package is ready for release

    This mitigates following issue: https://github.com/pytorch/vision/issues/6628
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85470
    Approved by: https://github.com/malfet

commit 5043457a8ed07e06961c3b92579b856ed2bc9f6f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 22 15:44:38 2022 +0000

    Revert "Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)"

    This reverts commit 9c77083965e1283763a83f72a3adf299281761e3.

    Reverted https://github.com/pytorch/pytorch/pull/85417 on behalf of https://github.com/clee2000 due to broke tests on trunk (and pull somehow) https://hud.pytorch.org/pytorch/pytorch/commit/9c77083965e1283763a83f72a3adf299281761e3

commit 9baf6770bcd67272a2cb9212c49e3bb95f0679c3
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 21 07:00:52 2022 -0700

    Apply new symbolic shape strategy to make_fx symbolic mode (#85260)

    This results in some test wobbling, which looks legit.  I also
    added some debug helpers for stuff that I found useful while
    working on this.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85260
    Approved by: https://github.com/albanD

commit 2f50d2f685db0cc52c52577b25c935970d96b99e
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Sep 22 03:52:37 2022 +0000

    [ONNX] Update docs on symbolic registration (#85290)

    - Move inline instructions on editing symbolic functions to the README
    - Add a line on using the symbolic function registration decorator.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85290
    Approved by: https://github.com/BowenBao

commit 9c77083965e1283763a83f72a3adf299281761e3
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Sep 21 21:24:39 2022 +0000

    Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85417
    Approved by: https://github.com/ezyang

commit 61b4e8a7bfb69954680013e2e34fc099db900736
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 21 21:53:23 2022 -0700

    More SymFloat support (#85411)

    - Support storing SymFloat in IValue
    - Add SymFloat to JIT type system (erases to float)
    - Printing support for SymFloat
    - add/sub/mul/truediv operator support for SymFloat
    - Support truediv on integers, it returns a SymFloat
    - Support parsing SymFloat from Python object

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85411
    Approved by: https://github.com/albanD

commit 0c46e3ec6684ad64ce4ef54f07a886ef67bad924
Author: George Qi <georgeqi94@gmail.com>
Date:   Wed Sep 21 18:10:13 2022 +0000

    [maskedtensor] add basic tests and unary/binary/reduction tests from common_method_invocations (#82841)

    Decided offline on the invariant that:

    `masked_tensor` calls `MaskedTensor()`, which is analogous to `torch.tensor`
    `as_masked_tensor` calls `MaskedTensor._from_values()`, which is analogous to `torch.as_tensor`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82841
    Approved by: https://github.com/cpuhrsch, https://github.com/bhosmer

commit 2bc82163eb70341b9e644689f54a6c3e0fafdc92
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Thu Sep 22 07:34:45 2022 +0000

    Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (re-open of #85037) (#85373)

    CC @ngimel @xwang233 @ptrblck

    Adds fix for `get_tolerances`, tested locally on a dgx Volta.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85373
    Approved by: https://github.com/ngimel

commit e97c609edf89467e7c27f450505a581ff02c4055
Merge: 0ee1ee3159 48c34a9d00
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 22 15:08:44 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 48c34a9d00c38c5e5cc87b1481c2e8c0e818ab28
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 22 15:08:44 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 76d60778eb01b4213735be1c6e126fe2da519b8e
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Sep 22 03:52:37 2022 +0000

    [ONNX] Use decorators for symbolic function registration (#84448)

    This is the 4th PR in the series of #83787. It enables the use of `@onnx_symbolic` across `torch.onnx`.

    - **Backward breaking**: Removed some symbolic functions from `__all__` because of the use of  `@onnx_symbolic` for registering the same function on multiple aten names.
    - Decorate all symbolic functions with `@onnx_symbolic`
    - Move Quantized and Prim ops out from classes to functions defined in the modules. Eliminate the need for `isfunction` checking, speeding up the registration process by 60%.
        - Remove the outdated unit test `test_symbolic_opset9.py`
    - Symbolic function registration moved from the first call to `_run_symbolic_function` to init time.
    - Registration is fast:
      ![image](https://user-images.githubusercontent.com/11205048/189164959-f3fca173-19bc-4682-b150-f13a586387bf.png)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84448
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 0ee1ee3159382ec49211d4276e760dd7e9581a5c
Merge: cad2d77de3 ec4448e95a
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 22 14:10:02 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit ec4448e95a1d6db57e7850898d4f3e7605b795c3
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 22 14:10:02 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit cad2d77de3b1858a93a1faab40dd00c25253de5d
Merge: 75bfbc35ca 23ab87f96c
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 22 13:56:40 2022 +0800

    Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit 23ab87f96cb4a0963eb6521a23217d6f22d07377
Merge: 97c4f58755 08f413bd6a
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 22 13:56:40 2022 +0800

    Update base for Update on "port spmm_sum to pytorch and optimize it on CPU"

    [ghstack-poisoned]

commit c7c2578f93fbfad5f769543848642a16b6071756
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Sep 22 03:33:30 2022 +0000

    Skip NVIDIA driver installation if it's already there (#85435)

    Address flaky failures such as  https://github.com/pytorch/pytorch/actions/runs/3099236524/jobs/5018444060 in which NVIDIA driver has already been installed.  The installation will be skipped if the same driver has already been installed.

    I also move NVIDIA driver installation before the installation of docker NVIDIA support to avoid any funny business with the latter interfering with the installation.

    * Run `.github/scripts/install_nvidia_utils_linux.sh` manually with an existing but different NVIDIA driver installed (515.65.01)

    ```
    == Installing nvidia driver NVIDIA-Linux-x86_64-515.57.run ==
    + HAS_NVIDIA_DRIVER=0
    ++ command -v nvidia-smi
    + '[' -x /usr/bin/nvidia-smi ']'
    ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader
    + INSTALLED_DRIVER_VERSION=515.65.01
    + '[' 515.65.01 '!=' 515.57 ']'
    + echo 'NVIDIA driver (515.65.01) has been installed, but we expect to have 515.57 instead. Continuing with NVIDIA driver installation'
    NVIDIA driver (515.65.01) has been installed, but we expect to have 515.57 instead. Continuing with NVIDIA driver installation
    + '[' 0 -eq 0 ']'
    + sudo yum groupinstall -y 'Development Tools'
    Loaded plugins: dkms-build-requires, extras_suggestions, langpacks, priorities, update-motd
    Maybe run: yum groups mark install (see man yum)
    No packages in any requested group available to install or update
    ++ uname -r
    + sudo yum install -y 'kernel-devel-uname-r == 4.14.290-217.505.amzn2.x86_64'
    Loaded plugins: dkms-build-requires, extras_suggestions, langpacks, priorities, update-motd
    Package kernel-devel-4.14.290-217.505.amzn2.x86_64 already installed and latest version
    Nothing to do
    + sudo modprobe backlight
    + sudo curl -fsL -o /tmp/nvidia_driver https://s3.amazonaws.com/ossci-linux/nvidia_driver/NVIDIA-Linux-x86_64-515.57.run
    + sudo /bin/bash /tmp/nvidia_driver -s --no-drm
    ...
    ```

    * Run `.github/scripts/install_nvidia_utils_linux.sh` manually with the same NVIDIA driver installed (515.57)

    ```
    == Installing nvidia driver NVIDIA-Linux-x86_64-515.57.run ==
    + HAS_NVIDIA_DRIVER=0
    ++ command -v nvidia-smi
    + '[' -x /usr/bin/nvidia-smi ']'
    ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader
    + INSTALLED_DRIVER_VERSION=515.57
    + '[' 515.57 '!=' 515.57 ']'
    + HAS_NVIDIA_DRIVER=1
    + echo 'NVIDIA driver (515.57) has already been installed. Skipping NVIDIA driver installation'
    NVIDIA driver (515.57) has already been installed. Skipping NVIDIA driver installation
    + '[' 1 -eq 0 ']'
    + nvidia-smi
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85435
    Approved by: https://github.com/seemethere, https://github.com/malfet

commit 99ad8a304898de8bf1e20a6fc12e335e9b7c5064
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 22 03:12:46 2022 +0000

    [vision hash update] update the pinned vision hash (#85451)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85451
    Approved by: https://github.com/pytorchbot

commit 34296e2f4c99841d5fe1d8043299f07923106a8d
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Sep 21 22:18:37 2022 +0000

    SubgraphMatcher remove invalid matches (#85444)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85444
    Approved by: https://github.com/rkindi

commit 4523ac7aa10cc5a6a5d93c3469c353d581a818be
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Sep 20 18:34:56 2022 -0700

    [quant][docs][ez] Fix formatting for qconfig_mapping (#85306)

    Summary:
    att

    Test Plan:
    visual inspection of generated docs

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85306
    Approved by: https://github.com/vkuzo, https://github.com/andrewor14

commit f21e77d9a64b39bb9feb9946d912b7b4952430d6
Author: Sunita Nadampalli <nadampal@amazon.com>
Date:   Thu Sep 22 00:54:59 2022 +0000

    [mkldnn_matmul] enable mkldnn matmul for aarch64 bf16 devices (#83671)

    this PR enables mkldnn matmul for aarch64 bf16 devices for both bf16 as well as fp32 input.

    This PR is dependent on
    cpuinfo commit update PR: https://github.com/pytorch/pytorch/pull/83620
    Issue: https://github.com/pytorch/pytorch/issues/83594
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83671
    Approved by: https://github.com/malfet

commit 26a861cb27313f86fd1c7eb1348b577a0a4f0784
Author: Zafar <cc.rafaz@zafar.cc>
Date:   Thu Sep 22 00:50:49 2022 +0000

    [quant] Check if cuda quantizing while on qnnpack engine (#85423)

    Although not possible in practice, while running the tests it is possible to try to quantize a CUDA tensor while quantization engine is set to QNNPACK. This would override the memory allocator from CUDA to MobileCPU, which would cause the new quantized tensors to be allocated on a CPU (instead of CUDA).

    Please, note that this is not a realistic scenario, as the qnnpack quantization engine is only "emulated" during the tests. When running on a real mobile CPU we don't expect a CUDA to be present.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85423
    Approved by: https://github.com/jerryzh168

commit 56a41b5998a28566984fee70e1dc9604896bd180
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Thu Sep 22 00:21:11 2022 +0000

    [composite compliance] ctc_loss (#84752)

    I have mixed feelings about adding new (private) operators. Backends writers will have to override them as well.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84752
    Approved by: https://github.com/zou3519

commit 1910c5847edbf2f92debc8f73fc7d9056b9fd9a0
Author: Catherine Lee <csl@fb.com>
Date:   Thu Sep 22 00:07:00 2022 +0000

    rebase and merge - add sleep (#85420)

    not a fan of this solution, so if anyone has better ideas please tell me

    add a sleep between the tryrebase.py and trymerge.py scripts so that github has time to get the push and start workflows, and so that we dont get weird event orders like https://github.com/pytorch/pytorch/pull/85267 where the push from the rebase looks like its after the merge
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85420
    Approved by: https://github.com/malfet, https://github.com/ZainRizvi

commit caa0ab557dd10e04ca413c1508f76ec8ae5adea3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 21 22:55:40 2022 +0000

    Revert "Use fallback approach for nested matmul (#85311)"

    This reverts commit 7c31f6e67213cbe773b0e2556f880f6ce2449fc3.

    Reverted https://github.com/pytorch/pytorch/pull/85311 on behalf of https://github.com/clee2000 due to broke lots of builds https://hud.pytorch.org/pytorch/pytorch/commit/7c31f6e67213cbe773b0e2556f880f6ce2449fc3 even though the pr was green

commit 0336308be5c2d019b99ed5fb59ec1bf01f735a99
Author: Zafar <cc.rafaz@zafar.cc>
Date:   Wed Sep 21 22:46:25 2022 +0000

    [AO] Callable norm function for sparsifier (#85236)

    The `WeightNormSparsifier` currently only supports L2-norm. This allows the users specify the function that is applied to compute the norm. In addition, L1-norm is also added, as an `.abs` function.

    - The functions that are referred to as "norms", are not strictly such. For example, L2-norm of `x` is computed as `F.avg_pool(x * x, ...)`. Similarly, L1-norm of `x` is computed as `F.avg_pool(x.abs(), ...)`.
    - When passing callable functions for the norm, the above assumption must hold: `F.avg_pool(norm_fn(x), ...)` will be applied.

    ```python
    >>> # L3-norm
    >>> l3 = lambda T: T * T * T
    >>> sparsifier = WeightNormSparsifier(norm=l3)
    >>>
    >>> # L0-norm
    >>> l0 = lambda T: (torch.logical_or(torch.zeros(T.shape), T != 0).to(T.dtype)
    >>> sparsifier = WeightNormSparsifier(norm=l0)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85236
    Approved by: https://github.com/jcaip

commit 6fb182c86b0bcf814ec4b5ece0fe1ffa8abcfbb6
Author: foram-chandra <96388449+foram-chandra@users.noreply.github.com>
Date:   Wed Sep 21 22:43:52 2022 +0000

    [doc] document pin_memory for randn (#85219)

    Fixes #85123

    cc: @mruberry @kshitij12345

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85219
    Approved by: https://github.com/mruberry

commit 7c31f6e67213cbe773b0e2556f880f6ce2449fc3
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Wed Sep 21 16:31:27 2022 +0000

    Use fallback approach for nested matmul (#85311)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85311
    Approved by: https://github.com/cpuhrsch, https://github.com/drisspg

commit 5aa84c16dbb9640da738866f0d52f1dd0d285f77
Author: Sourav Mandal <souravmandal@fb.com>
Date:   Wed Sep 21 22:17:46 2022 +0000

    [pytorch] cuBLAS addmm malfunction test (#85432)

    Summary:
    Re-submit for approved PR that was then reverted: https://github.com/pytorch/pytorch/pull/85084

    Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations

    Test Plan:
    Sample unit test output --

    [...]
    test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
    [...]

    Reviewed By: mikekgfb

    Differential Revision: D39433029

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85432
    Approved by: https://github.com/zrphercule

commit 9c127986bfa5bf8759eb88eb2c77f2de7ad001ba
Author: Zain Rizvi <zainr@fb.com>
Date:   Wed Sep 21 21:34:56 2022 +0000

    Fix labeling detection bug (#85429)

    Fixes a bug where if a PR is pre-labeled with both a `release notes:` label and a `topic:` label then our bot still pings on the PR, erroneously asking for those labels to be added

    &-ing sets computes the set intersection, which isn't what was desired here
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85429
    Approved by: https://github.com/janeyx99

commit 3dce26635f1bbab4bc96801e3cfd7ce728ba78b9
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 21 20:21:25 2022 +0000

    Revert "test in parallel at file granularity (#84961)"

    This reverts commit 8107666c6a1c25e96762a31296cace9ed343aaf6.

    Reverted https://github.com/pytorch/pytorch/pull/84961 on behalf of https://github.com/clee2000 due to makes test_forward_ad_nn_functional_max_unpool2d_cuda_float32 flakily unexpectedly pass

commit 0278a141fc9723e94506d36c40c995aa77fcc00b
Author: nikitaved <nikitavedeneev@gmail.com>
Date:   Wed Sep 21 20:10:24 2022 +0000

    csr <-> csc, csc <-> csc, bsr <-> bsc, bsc <-> bsc, bsr <-> bsr conversions (#85091)

    As per title. Required to enable a wider selection of sparse formats for `nn.functional.linear`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85091
    Approved by: https://github.com/amjames, https://github.com/cpuhrsch

commit a2cbbbd46ffff8c43d8708fb7ef718bb4fdfaa87
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 21 11:24:23 2022 -0400

    Improve SymInt print and move to correct file (#85413)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85413
    Approved by: https://github.com/wconstab

commit 3a09a1e8f01ee85f0854eba9acb5e049b2c1545e
Author: foram-chandra <foramchandra1295@gmail.com>
Date:   Wed Sep 21 19:31:56 2022 +0000

    [doc] remove out argument from squeeze (#85222)

    Fixes #83972

    cc- @ngimel @kshitij12345
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85222
    Approved by: https://github.com/ezyang

commit 9a81da7ad1a57e0f6b17948872d7c0d08495ae91
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Sep 21 19:23:49 2022 +0000

    Update NCCL to current master and remove patch step (#85367)

    The patch from #84245 has been upstreamed into NCCL, so the patch step is no longer required.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85367
    Approved by: https://github.com/ezyang

commit d5adf8151af7b1b1126ce4ae3d1bf140d0515485
Author: jiahongyu <jiahongyu@baidu.com>
Date:   Wed Sep 21 19:20:30 2022 +0000

    [PolishTypo] inherentely->inherently, intentially->intentionally (#85325)

    Polish comment typo, `inherentely->inherently`, `intentially->intentionally`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85325
    Approved by: https://github.com/ezyang

commit 764cba6848e5b239d102773ec45080dc0729c9e4
Author: Thomas Viehmann <tv.code@beamnet.de>
Date:   Wed Sep 21 18:53:34 2022 +0000

    add Python ref for isreal (#85361)

    Dipping my toes into prims waters

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85361
    Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry

commit 77f1f98479b204b2d0151d9cf3700f99915b9d50
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Tue Sep 20 15:13:49 2022 +0000

    Re-introduce `torch.Tensor.to_padded_tensor` (#85293)

    Differential Revision: [D39629004](https://our.internmc.facebook.com/intern/diff/D39629004)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85293
    Approved by: https://github.com/cpuhrsch

commit 8e1ae1c19d7096e72d5e095a7c5de0acf05c5fbf
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 21 13:53:02 2022 -0400

    Add Krovatkin to symbolic-shapes team (#85422)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85422
    Approved by: https://github.com/wconstab

commit 25a5ada426a238461db39386cf39f6452b73a4b9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 21 13:51:16 2022 -0400

    Typofix (#85421)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85421
    Approved by: https://github.com/wconstab, https://github.com/malfet

commit 35943f30cbe99276847e3b04704a66f0318f0083
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Sep 21 18:12:52 2022 +0000

    Reference implementation for torch.Tensor.sum_to_size (#85338)

    New ref: `torch._refs.sum_to_size`.

    View consistency validation is disabled because the ref returns a view instead of returning the input.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85338
    Approved by: https://github.com/mruberry

commit 85073b8ddceb3705e333adfb08d3f1ba039a0370
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Wed Sep 21 09:45:09 2022 +0000

    Add __all__ to fx, fistributed and cuda submodules (#85080)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85080
    Approved by: https://github.com/albanD

commit 0217a8d049ec54d303ef39776cf28bc80954b8e1
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 21 18:02:50 2022 +0000

    Revert "[fix] composite compliance: cumprod, _masked.cumprod, linalg.vander (#85330)"

    This reverts commit d3dec8097b847fc46755ef06ea6ff90eebc846eb.

    Reverted https://github.com/pytorch/pytorch/pull/85330 on behalf of https://github.com/dagitses due to a PR this is based on got reverted, rebase and reland

commit 0ac6311356d21d052d2ca070b6f81706339deafb
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 21 17:57:49 2022 +0000

    Revert "[CUBLAS][CUDA GRAPHS] (re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85292)"

    This reverts commit 4012e623e8689b873cae94590766d990d155017c.

    Reverted https://github.com/pytorch/pytorch/pull/85292 on behalf of https://github.com/dagitses due to broke an internal test during shutdown. Re-submit with #85399 in stack

commit 0e194b32192298f411a47bb28b6a62b194a211b0
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 21 17:44:40 2022 +0000

    Add Auto Request Review for reviewer assignment (#85414)

    I want this specifically for dynamic shapes work, but you can
    feel free to use it for your own needs too.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85414
    Approved by: https://github.com/malfet

commit 2a88f1b2d86a1fdc7380db768e67b18c24d199c4
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Sep 21 08:14:48 2022 +0000

    Land "Make ceil,floor,round,trunc handle integers" (#85144)

    PR to land https://github.com/pytorch/pytorch/pull/78480, as Rohit does
    not work in the PyTorch project anymore
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85144
    Approved by: https://github.com/ngimel, https://github.com/mruberry

commit 6f2b390fc1d6910876233663a27a4c89d9e486f2
Author: Xu Zhao <xzhao9@fb.com>
Date:   Wed Sep 21 17:17:46 2022 +0000

    Fix import of instruction count benchmark (#85359)

    This PR fixes the instruction count benchmark

    1. Fix the updated import path
    2. Allows building the benchmark with less compiler options (remove all "-W" options)

    Test plan:
    ```
    BENCHMARK_USE_DEV_SHM=1 python main.py --mode ci
    ```

    Manually tested and worked on the CI machine.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85359
    Approved by: https://github.com/robieta

commit d9aa6dfe886597f2c6fb9d9b0582e669465fa28d
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Sep 21 01:47:27 2022 +0000

    Add Fake Cross Ref Mode, migrate sparse to it (#85382)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85382
    Approved by: https://github.com/ezyang

commit 8107666c6a1c25e96762a31296cace9ed343aaf6
Author: Catherine Lee <csl@fb.com>
Date:   Wed Sep 21 16:58:11 2022 +0000

    test in parallel at file granularity (#84961)

    run tests in parallel at the test file granularity

    runs 3 files in parallel using multiprocessing pool, output goes to a file, which is then printed when the test finishes.  Some tests cannot be run in parallel (usually due to lacking memory), so we run those after.  Sharding is changed to attempt to mask large files with other large files/run them on the same shard.

    test_ops* gets a custom handler to run it because it is simply too big (2hrs on windows) and linalg_cholesky fails (I would really like a solution to this if possible, but until then we use the custom handler).

    reduces cuda tests by a lot, reduces total windows test time by ~1hr

    Ref. https://github.com/pytorch/pytorch/issues/82894
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84961
    Approved by: https://github.com/huydhn

commit 2fb820455cc7b6d8f67e303098ffbcf4aac791f8
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 21 16:48:55 2022 +0000

    Revert "[pytorch] cuBLAS addmm malfunction test (#85084)"

    This reverts commit 0297c75c141103cc780c88bfe9749c460690bf58.

    Reverted https://github.com/pytorch/pytorch/pull/85084 on behalf of https://github.com/clee2000 due to broke tests on trunk, https://github.com/pytorch/pytorch/actions/runs/3098347639/jobs/5017166419

commit eb94df28c748bff6a55c06e9b3440a525ea1f867
Author: atalman <atalman@fb.com>
Date:   Wed Sep 21 16:30:25 2022 +0000

    Use pip install cu117 (#85097)

    Creates new wheel workflow specific to CUDA 11.7 that does not bundle the cudnn and cublas.

    Workflow:
    https://github.com/pytorch/pytorch/actions/runs/3094622781

    New Package:
    manywheel-py3_10-cuda11_7-with-pypi-cudnn | 843 MB

    Old Package:
    manywheel-py3_10-cuda11_7 | 1.65 GB

    Testing workflow:

    [manywheel-py3_7-cuda11_7-with-pypi-cudnn-build / build](https://github.com/pytorch/pytorch/actions/runs/3091145546/jobs/5000867662#logs):
    ```
    Bundling without cudnn and cublas.
    + DEPS_LIST=("/usr/local/cuda/lib64/libcudart.so.11.0" "/usr/local/cuda/lib64/libnvToolsExt.so.1" "/usr/local/cuda/lib64/libnvrtc.so.11.2" "/usr/local/cuda/lib64/libnvrtc-builtins.so.11.7" "$LIBGOMP_PATH")
    + DEPS_SONAME=("libcudart.so.11.0" "libnvToolsExt.so.1" "libnvrtc.so.11.2" "libnvrtc-builtins.so.11.7" "libgomp.so.1")
    .....
    pytorch_extra_install_requirements: nvidia-cuda-runtime-cu11, nvidia-cudnn-cu11, nvidia-cublas-cu11
    ```

    [manywheel-py3_7-cuda11_7-build / build](https://github.com/pytorch/pytorch/actions/runs/3091145546/jobs/5000863250#logs)

    ```
    Bundling with cudnn and cublas.
    + DEPS_LIST=("/usr/local/cuda/lib64/libcudart.so.11.0" "/usr/local/cuda/lib64/libnvToolsExt.so.1" "/usr/local/cuda/lib64/libnvrtc.so.11.2" "/usr/local/cuda/lib64/libnvrtc-builtins.so.11.7" "/usr/local/cuda/lib64/libcudnn_adv_infer.so.8" "/usr/local/cuda/lib64/libcudnn_adv_train.so.8" "/usr/local/cuda/lib64/libcudnn_cnn_infer.so.8" "/usr/local/cuda/lib64/libcudnn_cnn_train.so.8" "/usr/local/cuda/lib64/libcudnn_ops_infer.so.8" "/usr/local/cuda/lib64/libcudnn_ops_train.so.8" "/usr/local/cuda/lib64/libcudnn.so.8" "/usr/local/cuda/lib64/libcublas.so.11" "/usr/local/cuda/lib64/libcublasLt.so.11" "$LIBGOMP_PATH")
    + DEPS_SONAME=("libcudart.so.11.0" "libnvToolsExt.so.1" "libnvrtc.so.11.2" "libnvrtc-builtins.so.11.7" "libcudnn_adv_infer.so.8" "libcudnn_adv_train.so.8" "libcudnn_cnn_infer.so.8" "libcudnn_cnn_train.so.8" "libcudnn_ops_infer.so.8" "libcudnn_ops_train.so.8" "libcudnn.so.8" "libcublas.so.11" "libcublasLt.so.11" "libgomp.so.1")
    ```

    cc: @malfet @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85097
    Approved by: https://github.com/malfet

commit 90b64e231e6fc327a96ad78bbb4306f69bbf1406
Author: Jithun Nair <jithun.nair@amd.com>
Date:   Wed Sep 21 16:22:12 2022 +0000

    Update hipification logic for all ROCm headers (#85320)

    ...to remove deprecation warnings. Remove component-specific include dirs from include path

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85320
    Approved by: https://github.com/kit1980

commit 2c285f3e9b83cca3cc3f08b723cc9e46b34c1ccd
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Wed Sep 21 05:13:20 2022 +0000

    [quant][docs] README for FX Graph Mode Quantization (#85070)

    Summary:
    This is a developer-oriented design doc/README for FX Graph Mode Quantization, the goal for the doc is for new developers of
    FX Graph Mode Quantization to get familiarized with the high level algorithm of FX Graph Mode Quantization and ramp up quickly

    Test Plan:
    no test needed

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85070
    Approved by: https://github.com/vkuzo

commit 5fa104a76c092d3d2259794f686e00285d8a3e46
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 21 15:50:44 2022 +0000

    Move functorch C++ into aten/src/ATen/functorch (#85381)

    This PR moves functorch C++ code that does not depend on python into aten/src/ATen/functorch. The C++ code that does depend on python (the python bindings as well as torchdim) will go into torch/csrc/functorch, to come later (see https://github.com/pytorch/pytorch/pull/85263 for initial attempt).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85381
    Approved by: https://github.com/ezyang

commit 125e9256f44d89cd20acbb6954e5f909ddc6da1e
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Sep 21 02:16:05 2022 +0000

    [FSDP] Add back `forward_prefetch` (#85177)

    - This implements explicit forward prefetching following the static 1st iteration's pre-forward order when `forward_prefetch=True` in the FSDP constructor.
    - This has the same unit test coverage as the original `forward_prefetch`.
    - I checked via print statements that the prefetches are happening, but since I cannot get a good CPU bound workload, it is hard to tell via traces that the prefetch is working.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85177
    Approved by: https://github.com/zhaojuanmao

commit 977f8fce3cf642f8514eb9e79576d9784eb30b04
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Sep 21 02:16:05 2022 +0000

    [FSDP] Simplify backward prefetch implementation (#85176)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85176
    Approved by: https://github.com/zhaojuanmao

commit 563b065f5a4b4055fa6b025c2514b566d5fd9439
Author: Khushi Agrawal <khushiagrawal411@gmail.com>
Date:   Wed Sep 21 13:57:16 2022 +0000

    [fix] rrelu, rrelu_, & RReLU when lower bound > upper bound (#84996)

    Fixes #83160

    cc @kshitij12345 @albanD
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84996
    Approved by: https://github.com/mruberry, https://github.com/albanD

commit de0f3c4200c17e156e632eef266fd6f27e482127
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Sep 21 09:12:58 2022 +0000

    Change Lezcano to lezcano (#85396)

    I changed my handle to lezcano (no caps) as rhere's always issues with
    capital letters when automatising stuff.

    The last issue was https://github.com/pytorch/test-infra/pull/751
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85396
    Approved by: https://github.com/ezyang

commit 0297c75c141103cc780c88bfe9749c460690bf58
Author: Sourav Mandal <souravmandal@fb.com>
Date:   Wed Sep 21 13:42:12 2022 +0000

    [pytorch] cuBLAS addmm malfunction test (#85084)

    Summary: Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations

    Test Plan:
    Sample unit test output --

    [...]
    test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
    test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
    [...]

    Reviewed By: mikekgfb

    Differential Revision: D39433029

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85084
    Approved by: https://github.com/zrphercule

commit b70c254ebbfe02f2e9a9990aa95368d8ee73be37
Author: Mateusz Sypniewski <m.odrowaz.sypniewski@gmail.com>
Date:   Wed Sep 21 13:41:52 2022 +0000

    Rework printing tensor aliases in CSAN error message (#85008)

    Small rework of how the error message is formatted, introduces a distinction between the arguments and the output of kernels. Verified manually on multiple examples that the message is printed as expected.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85008
    Approved by: https://github.com/lw

commit 3eb27229dd74dd0bea434326c471f16c50e558a4
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 20 22:01:48 2022 -0700

    as_strided symbolic support (#85264)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D39662820](https://our.internmc.facebook.com/intern/diff/D39662820)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85264
    Approved by: https://github.com/wconstab

commit 793deeefb44ee22dea9f6da5f57bf16e4d63d32d
Author: Jesse Cai <jcjessecai@gmail.com>
Date:   Wed Sep 21 00:40:11 2022 +0000

    [quant][core][feature] Implement masked_fill for CUDA tensors (#85108)

    Summary:
    - Add new cuda test for masked_fill
    - Add in QuantizedCUDA implementation for masked_fill

    Test Plan:
    ```
    python test/test_quantization.py -k test_qtensor_masked_fill_cuda

    ```

    Reviewers: dzdang

    Subscribers:

    Tasks: Fixes https://github.com/pytorch/pytorch/issues/83110

    Tags: quant
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85108
    Approved by: https://github.com/dzdang

commit 308b26fe4d2b82788862866a937c19c1b3934881
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Sep 21 12:45:15 2022 +0000

    Add nvFuser support for transpose (#84629)

    `torch._refs.t`, `torch._refs.transpose`, `torch._refs.permute` are all should be working now with nvFuser executor. It would also work with graphs processed by AOT Autograd as these functions are registered to the aten->ref mapping via the "register_decomposition" decorator:
    https://github.com/pytorch/pytorch/blob/07d398fb269eebe314ae898287494a2bfdc7f278/torch/_refs/__init__.py#L3125-L3126
    https://github.com/pytorch/pytorch/blob/07d398fb269eebe314ae898287494a2bfdc7f278/torch/_refs/__init__.py#L3143-L3144
    https://github.com/pytorch/pytorch/blob/07d398fb269eebe314ae898287494a2bfdc7f278/torch/_refs/__init__.py#L2548-L2549
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84629
    Approved by: https://github.com/ngimel

commit 2f4a517d67fd693c4ff544e74947b01dc6063dfe
Author: Horace He <chilli@fb.com>
Date:   Wed Sep 21 02:30:50 2022 +0000

    Ported matmul compositeimplicitautograd impl into core (#85239)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85239
    Approved by: https://github.com/ezyang, https://github.com/lezcano

commit a3dc338ee1b30aa1f59f36b3ba4c98d6a30a8600
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 21 08:34:51 2022 +0000

    Revert "Exposing native _scaled_dot_product_attention to torch.nn (#85044)"

    This reverts commit 9fdd8a8b7f171be70ea3bd4724c38852ef292d73.

    Reverted https://github.com/pytorch/pytorch/pull/85044 on behalf of https://github.com/huydhn due to This breaks CUDA 10.2 in trunk. We are deprecating CUDA 10.2, but it is still here in the mean time

commit 26ba2e9751dbda278603d40ed67e02d15ca834e3
Author: Jakub Pietrak <jakub.pietrak@intel.com>
Date:   Wed Sep 21 09:37:40 2022 +0200

    implementation for conv layers

commit 09965957cd8ecc696852e73022892b3ad4475783
Author: Vasiliy Kuznetsov <vasiliy@fb.com>
Date:   Tue Sep 20 16:14:07 2022 -0700

    quantization: align observer dtype with reference model spec (#85345)

    Summary:

    Before this PR, the `dtype` attribute of observers was not clearly
    defined.  It originally meant `interface_dtype` in the eager mode
    workflow, which is how the codebase before this PR is using it.

    In the new reference model spec, `dtype` attribute of an observer
    represents the `dtype` value which needs to be passed into a `quantize`
    function in the reference model spec. This PR aligns the codebase
    to this definition of dtype.  In detail:
    1. change util functions to interpret `dtype` using the reference model definition
    2. change `prepare` to interpret `dtype` using the reference model definition
    3. change observers for dynamic quantization to interpret `dtype` using the reference
       model definition.

    A future PR (left out of this one to keep LOC small) will deprecate the
    `compute_dtype` field and instead expose `is_dynamic` on observers.
    "

    Test plan:

    ```
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps
    ```

    Differential Revision: [D39675209](https://our.internmc.facebook.com/intern/diff/D39675209)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85345
    Approved by: https://github.com/z-a-f, https://github.com/jerryzh168

commit 08f413bd6a076e41f0023dadb6f9b95f2fe2ddc6
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 21 05:04:37 2022 +0000

    [vision hash update] update the pinned vision hash (#85380)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85380
    Approved by: https://github.com/pytorchbot

commit 75451d3c81c88eebc878fb03aa5fcb89328989d9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 20 17:17:33 2022 -0400

    Address eellison's CR comments on AOTAutograd (#85370)

    For some reason, this change didn't actually make it to master.

    Signed-off-by: Edward Z. Yang <ezyangfb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85370
    Approved by: https://github.com/eellison

commit 3c870dadc3536b03bdcc5377ac85ef9e44cc1e87
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Sep 21 03:53:25 2022 +0000

    [BE] Mark unused range-loop vars with `C10_UNUSED` (#85383)

    I.e. replace:
    ```
    for(const auto i: c10::irange(lim)) {
      (void)i;
      ...
    }
    ```
    with
    ```
    for(const auto i C10_UNUSED: c10::irange(lim)) {
      ...
    }
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85383
    Approved by: https://github.com/kit1980

commit 3122a96ee45507e8d33f265410222e69cc66677a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 21 03:13:20 2022 +0000

    Revert "Improve and expose cpp_backtrace to python binding (#84896)"

    This reverts commit 73fbca1ea6ecc08ae4455a12b68fc2ead93a088c.

    Reverted https://github.com/pytorch/pytorch/pull/84896 on behalf of https://github.com/kit1980 due to Broke libtorch and linux-binary-manywheel - https://hud.pytorch.org/pytorch/pytorch/commit/73fbca1ea6ecc08ae4455a12b68fc2ead93a088c

commit 9fdd8a8b7f171be70ea3bd4724c38852ef292d73
Author: Driss Guessous <drisspg@fb.com>
Date:   Wed Sep 21 03:09:08 2022 +0000

    Exposing native _scaled_dot_product_attention to torch.nn (#85044)
    This exposes the _scaled_dot_product_attention function to python in the nn namespace. It is still underscored because the api for args, and kwargs is still in flux for the next few weeks and will eventually land as a prototype feature.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85044
    Approved by: https://github.com/cpuhrsch

commit d7029fea5113468441cb358bced6045e6e4d4b9a
Author: Michael Gschwind <mikekg@meta.com>
Date:   Wed Sep 21 02:07:13 2022 +0000

    Remove TS compatibility transition code (#85003)

    Summary: Remove TS compatibility transition code

    Test Plan: sandcastle

    Differential Revision: D39494677

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85003
    Approved by: https://github.com/erichan1

commit 73fbca1ea6ecc08ae4455a12b68fc2ead93a088c
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Sep 20 20:49:22 2022 +0000

    Improve and expose cpp_backtrace to python binding (#84896)

    We can now get cpp stack trace by calling torch.utils.get_cpp_backtrace()

    Sample output when calling from a torch_dispatch stack:
    ```
    <omitting python frames>
    frame #23: torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<pybind11::handle>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName) (0x7f69330bab90 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/utils/python_arg_parser.cpp:323)
    frame #24: <unknown function> (0x7f6932a09e79 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/python_variable.cpp:2252)
    frame #25: <unknown function> (0x7f69261aee33 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/PythonFallbackKernel.cpp:56)
    frame #26: <unknown function> (0x7f69261afef9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:19)
    frame #27: c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadced in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:41)
    frame #28: <unknown function> (0x7f6926fae9b9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/boxing.h:227)
    frame #29: at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&) const (0x7f6926e821f5 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:106)
    frame #30: at::_ops::alias::redispatch(c10::DispatchKeySet, at::Tensor const&) (0x7f6927142c31 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:438)
    frame #31: <unknown function> (0x7f692ae4f8be in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/generated/ADInplaceOrViewType_1.cpp:1361)
    frame #32: <unknown function> (0x7f692ae4f9b1 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/generated/ADInplaceOrViewType_1.cpp:1362)
    frame #33: <unknown function> (0x7f692aef77e9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13)
    frame #34: <unknown function> (0x7f6926fae7d8 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:50)
    frame #35: at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&) const (0x7f6926e821c9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:97)
    frame #36: at::_ops::alias::redispatch(c10::DispatchKeySet, at::Tensor const&) (0x7f6927142c31 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:438)
    frame #37: <unknown function> (0x7f6929ec654a in /fsx/users/bahuang/repos/pytorch_fsx/build/aten/src/ATen/RedispatchFunctions.h:10697)
    frame #38: <unknown function> (0x7f6929d9edae in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/generated/VariableType_1.cpp:2837)
    frame #39: <unknown function> (0x7f6929d9f043 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/autograd/generated/VariableType_1.cpp:2838)
    frame #40: <unknown function> (0x7f6929e7d2f9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13)
    frame #41: <unknown function> (0x7f6929eb1344 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:478)
    frame #42: <unknown function> (0x7f6929ea7b99 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:490)
    frame #43: <unknown function> (0x7f6929e7d370 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:563)
    frame #44: <unknown function> (0x7f6929e7d43a in /fsx/users/bahuang/repos/pytorch_fsx/c10/util/C++17.h:239)
    frame #45: <unknown function> (0x7f6929e7d48c in /fsx/users/bahuang/repos/pytorch_fsx/c10/util/C++17.h:364)
    frame #46: <unknown function> (0x7f6929e7d50a in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:554)
    frame #47: c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadced in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:41)
    frame #48: c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadd26 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:43)
    frame #49: c10::Dispatcher::redispatchBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f692603890a in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:652)
    frame #50: <unknown function> (0x7f69260387f9 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:388)
    frame #51: <unknown function> (0x7f69261af0ef in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/PythonFallbackKernel.cpp:96)
    frame #52: <unknown function> (0x7f69261aff2b in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:25)
    frame #53: c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadced in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/BoxedKernel_impl.h:41)
    frame #54: c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6932aadd26 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/boxing/KernelFunction_impl.h:43)
    frame #55: c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const (0x7f6925fd6ab2 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:628)
    frame #56: <unknown function> (0x7f6925fd6690 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:376)
    frame #57: <unknown function> (0x7f692bf5b525 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/dispatch/Dispatcher.h:380)
    frame #58: <unknown function> (0x7f692bf59fac in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/jit/runtime/register_c10_ops.cpp:15)
    frame #59: <unknown function> (0x7f692bf5af41 in /usr/include/c++/7/bits/std_function.h:316)
    frame #60: std::function<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&)>::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const (0x7f6932ab9a0f in /usr/include/c++/7/bits/std_function.h:706)
    frame #61: <unknown function> (0x7f6932aad541 in /fsx/users/bahuang/repos/pytorch_fsx/aten/src/ATen/core/stack.h:41)
    frame #62: <unknown function> (0x7f6932ab3102 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/jit/python/pybind_utils.h:1206 (discriminator 1))
    frame #63: <unknown function> (0x7f6932ab3943 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/jit/python/pybind_utils.h:1272)
    frame #64: <unknown function> (0x7f6932a46120 in /fsx/users/bahuang/repos/pytorch_fsx/torch/csrc/jit/python/init.cpp:1767)
    frame #65: <unknown function> (0x7f6932a997be in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/cast.h:1441)
    frame #66: <unknown function> (0x7f6932a8a985 in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/cast.h:1410)
    frame #67: <unknown function> (0x7f6932a66e1e in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/pybind11.h:249)
    frame #68: <unknown function> (0x7f6932a66ec2 in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/pybind11.h:224)
    frame #69: <unknown function> (0x7f6932473111 in /fsx/users/bahuang/repos/pytorch_fsx/third_party/pybind11/include/pybind11/pybind11.h:929)
    frame #104: __libc_start_main (0x7f693485dc87 in /build/glibc-uZu3wS/glibc-2.27/csu/../csu/libc-start.c:310)
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84896
    Approved by: https://github.com/ezyang

commit 52fd7e491b24d4cf910b98bbe06c460f7d8e5577
Author: Will Constable <whc@fb.com>
Date:   Wed Sep 21 00:06:54 2022 +0000

    Update torch.ops.aten.all ref to be symbolic-trace friendly (#85352)

    - previous impl compared the summed bool values of the tensor to its nelem, which in a symbolic world is a symint and can't be coerced back into a bool for the purpose of shoving into a result tensor

    - new impl adds one extra negation op but avoids the need to compare to the symbolic nelem

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85352
    Approved by: https://github.com/ezyang, https://github.com/mruberry

commit f6a18d3d373f0391c722f6159a1dda23da556ab4
Author: Scott Wolchok <swolchok@fb.com>
Date:   Mon Sep 19 15:41:33 2022 -0700

    [PyTorch] StorageImpl: cache size_bytes.is_symbolic() (#85309)

    We've got 6 bools' worth of extra space, so let's try caching this.

    Differential Revision: [D39636570](https://our.internmc.facebook.com/intern/diff/D39636570/)

    **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39636570/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85309
    Approved by: https://github.com/ezyang

commit 90fa744c09fdbf8a2e7747ea82823714b0ee7e04
Author: Horace He <chilli@fb.com>
Date:   Tue Sep 20 18:07:46 2022 +0000

    Fixed memory issues in linalg_lstsq (#85357)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85357
    Approved by: https://github.com/ezyang, https://github.com/IvanYashchuk

commit cb8e73bb71ffb64ff0ef0f6e9fbe6ef7dfcbc307
Author: Vasiliy Kuznetsov <vasiliy@fb.com>
Date:   Tue Sep 20 09:37:54 2022 -0700

    fx quant: fix bug in custom module test (#85344)

    Summary:

    `TestQuantizeFx.test_custom_module_class` was subtly broken because the
    various parts of the test case were modifying the original model. This
    seems incorrect because `prepare_fx` and `convert_fx` are inplace.
    To fix this, we can `copy.deepcopy` the model before applying the
    test cases to it.

    This test case was triggered by an unrelated refactor, splitting the
    fix in a separate diff to keep the refator clean.

    Test plan:

    ```
    python test/test_quantization.py TestQuantizeFx.test_custom_module_class
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85344
    Approved by: https://github.com/dzdang, https://github.com/z-a-f, https://github.com/jerryzh168

commit 62786a09d334498c872f2c74e814ce56b27092ae
Author: Matthew LeMay <mplemay@users.noreply.github.com>
Date:   Tue Sep 20 19:11:28 2022 +0000

    Fix indentation in functorch limitations docs (#85346)

    Fixes a minor indentation error in the `functorch` UX limitations documentation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85346
    Approved by: https://github.com/kit1980

commit e1f634753c2606ddf1d9e1ef611a882928ce415c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Sep 19 14:42:46 2022 -0700

    Setup fake tensor and symbolic shapes once at beginning of AOTAutograd (#85233)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D39662822](https://our.internmc.facebook.com/intern/diff/D39662822)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85233
    Approved by: https://github.com/wconstab

commit 5f623f5c4c759cada9f0dc3866b63c906178dbc1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Sep 19 13:29:47 2022 -0700

    Correctly handle duplicate arguments to AOTAutograd (#85301)

    If we do not deduplicate them, our custom autograd function
    will double-count the gradient computed for the variable (since
    the same x.grad field will be embedded into the graph twice.)

    The alternative is to destroy aliasing relationships the inputs
    and trace accumulating individual gradients for each of the
    inputs into separate grad fields, but this prevents resizing
    of inputs inside the traced graph from working correctly.  In
    principle, you could detach the inputs, allow metadata changes
    on them, and then reflect metadata changes to the originals
    as necessary (in fact, we should do this for other reasons),
    but AOTAutograd doesn't do this yet.

    Another alternative is to have dynamo guarantee not to give
    duplicate tensor inputs, but because AOTAutograd is public API,
    we are obligated to still handle it correctly here in case a
    direct user passes duplicate inputs.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D39662821](https://our.internmc.facebook.com/intern/diff/D39662821)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85301
    Approved by: https://github.com/Chillee, https://github.com/albanD

commit b9b27f7664c2da80458a21d799eb737cfd2490df
Author: jjsjann123 <jiej@nvidia.com>
Date:   Tue Sep 20 18:52:02 2022 +0000

    Added `Tensor.to` overloads to `torch._refs.to` (#84802)

    Fixes #84264

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84802
    Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry

commit d3dec8097b847fc46755ef06ea6ff90eebc846eb
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Sep 20 18:18:39 2022 +0000

    [fix] composite compliance: cumprod, _masked.cumprod, linalg.vander (#85330)

    Ref: #69991
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85330
    Approved by: https://github.com/zou3519

commit e24e17916fff4ea9959d23471fb3b8b05f2720dd
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 20 13:53:08 2022 -0400

    Remove errant semicolon (#85356)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D39662630](https://our.internmc.facebook.com/intern/diff/D39662630)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85356
    Approved by: https://github.com/wconstab, https://github.com/Krovatkin, https://github.com/voznesenskym

commit a3afb2c2f603f100e18e8aae9e9bfee2d1c67a4a
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Tue Sep 20 15:24:19 2022 +0000

    Fake: fix conv_transpose2d striding (#82846)

    The output striding channels-last preservation logic differs between cuda and cpu. For the meta kernel, we can peek at the fake tensor device and use that to determine whether to do cpu or cuda.

    You could argue there's a leaking of abstraction here but this seems like a pretty minimal leak and I'm not sure there's a much cleaner way forward for device-specific striding tracing logic.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82846
    Approved by: https://github.com/ezyang

commit e1ed485c65d68787971a2dfb1ace1ed66f5a4d5b
Author: Kulin Seth <kulin_seth@apple.com>
Date:   Tue Sep 20 17:53:43 2022 +0000

    [MPS] Handle reduction of scalars in edge-cases (#83743)

    The issue was found as part of fixing Test consistency issues.
    Test case coming up.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83743
    Approved by: https://github.com/razarmehr, https://github.com/malfet

commit d17b144e6564f10f55af639fbf2dd82eaacdfa3e
Author: lezcano <lezcano-93@hotmail.com>
Date:   Tue Sep 20 12:40:28 2022 +0000

    Adding multigammaln ref and fix arange (#85153)

    Partially based on https://github.com/pytorch/pytorch/pull/83662.

    I'll help land this one, as Rob does not work in the PyTorch project
    anymore

    I removed the data-dependent check for the args, as data dependencies
    are bad for many reasons (and it was failing when the input has NaNs).

    It also registers arange as a decomposition, and fixes the naming of its
    args.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85153
    Approved by: https://github.com/mruberry, https://github.com/ngimel

commit 7a6c4d0c50dd0670d87bc39d53292cf8cb90ca04
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Tue Sep 20 17:18:33 2022 +0000

    [mta] APEX style Fused Adam (#81705)

    This PR implements an APEX style FusedAdam in PyTorch.
    This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel.

    related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
    possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

    cc @ptrblck @ngimel
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705
    Approved by: https://github.com/ngimel

commit 00a1065286ef9b425cfe1d74d76b7ab5555eee83
Author: Vasu Agrawal <vasuagrawal@fb.com>
Date:   Tue Sep 20 17:15:59 2022 +0000

    [pytorch] Inline std::forward definition (#85255)

    Summary: Alternative (probably better) solution to the problem laid out in D39562394.

    Test Plan: CI should be green.

    Differential Revision: D39612710

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85255
    Approved by: https://github.com/ezyang

commit 9c1a6a522d5dddc9db2ce728dd751ad5a7fb577e
Author: Sherlock Huang <bahuang@fb.com>
Date:   Tue Sep 20 06:30:57 2022 +0000

    Make ones and zeros's ref accepts variadic size argument (#85117)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85117
    Approved by: https://github.com/ngimel, https://github.com/lezcano

commit 5e8f16b8779775dc2a29d20de6827a0f25fb0df6
Author: Will Constable <whc@fb.com>
Date:   Tue Sep 20 16:36:48 2022 +0000

    Fix fake_tensor to_copy meta dispatch (#85337)

    Previously, no_dispatch() was causing us to hit real kernels (well, real decomps and prims) for to_copy when we were operating on FakeTensors.

    This change helps us hit meta kernels and seems to pass the relevant tests.

    I still have questions about why this line has to call .to("meta")
    	input = new_kwargs.pop("input").to("meta")
    But that can wait for another PR.

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85337
    Approved by: https://github.com/eellison

commit 4012e623e8689b873cae94590766d990d155017c
Author: eqy <eddiey@nvidia.com>
Date:   Tue Sep 20 16:31:54 2022 +0000

    [CUBLAS][CUDA GRAPHS] (re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85292)

    re-open of #83461 with fix for 10.2 build

    CC @ngimel @malfet
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85292
    Approved by: https://github.com/malfet

commit 39f482acdf959b2b22a310381a5a3d3bcf9699b9
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Tue Sep 20 16:10:05 2022 +0000

    Add a reset() method to nvFuser FusionCache to enable proper resetting during tests. (#85319)

    Fixes issue Jie found in his PR:

    https://github.com/pytorch/pytorch/pull/84626#issuecomment-1250745334
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85319
    Approved by: https://github.com/jjsjann123

commit 86d8c61c7c122ede1628f967277073231f92c744
Author: Benoit Steiner <benoitsteiner@fb.com>
Date:   Tue Sep 20 15:38:58 2022 +0000

    Revert D39583438: Multisect successfully blamed D39583438 for test or build failures (#85277)

    Summary:
    This diff is reverting D39583438
    D39583438 has been identified to be causing the following test or build failures:
    Tests affected:
    - https://www.internalfb.com/intern/test/281475048999851/

    Here's the Multisect link:
    https://www.internalfb.com/intern/testinfra/multisect/1260522
    Here are the tasks that are relevant to this breakage:
    T124797105: 18 tests started failing for employee benoitsteiner in the last 2 weeks
    We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

    Test Plan: NA

    Reviewed By: benoitsteiner

    Differential Revision: D39599694

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85277
    Approved by: https://github.com/dagitses

commit cf2f552cd8a41f4913c370c15804173a3b56a415
Author: anjali411 <chourdiaanjali123@gmail.com>
Date:   Tue Sep 20 10:18:31 2022 +0000

    Add __all__ to torch.{fx, distributed, backends} submodules (#85079)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85079
    Approved by: https://github.com/rohan-varma

commit a4dca9822dfabcdbd1b36a12c013764f2af87613
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Sep 20 08:03:36 2022 +0000

    [composite compliance] prod (#81969)

    Ref: #69991

    Also fixes #82644 (fix similar to #81617)

    For CompositeCompliance, we can't use `item` to choose a special fast-path when Tensor is a Subclass. Instead we always dispatch to the slower but safer implementation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81969
    Approved by: https://github.com/zou3519

commit 077db3de92f34cff3187b61de4b18900a927b3fd
Author: Kulin Seth <kulin_seth@apple.com>
Date:   Tue Sep 20 06:19:40 2022 +0000

    [MPS] Fix conv1d backwards crash for channels last case (#85283)

    Fixes pytorch#84511

    Use the same logic as in the forward pass for the backward pass (in case of channels last, manually set the shape to NHWC)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85283
    Approved by: https://github.com/malfet, https://github.com/razarmehr

commit bcdef58a55665bba959a773633aa5b1e5758d94e
Author: Kulin Seth <kulin_seth@apple.com>
Date:   Tue Sep 20 06:00:58 2022 +0000

    [MPS] Fix the crash in bitwise ops on x86 platforms. (#85285)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85285
    Approved by: https://github.com/razarmehr, https://github.com/malfet

commit 6c48a01cef029075d3787acab18d2a4e32b2cb4c
Author: Xia, Weiwen <weiwen.xia@intel.com>
Date:   Tue Sep 20 05:33:26 2022 +0000

    [Quant] Improve performance of ONEDNN backend (#84470)
    This PR improves performance of ONEDNN quantization backend by adding fast paths. For qconv, qconv_transpose and qlinear.
    It uses a cache to store reusable data on the first run thus reducing runtime overhead afterwards.

    Note: Other quantization backends not affected.
    **Correctness**:
    Covered by UT

    **Performance**:
    (Time to run each op, in microseconds)
    Convolution, 1 core per instance, multiple instances on whole socket
    shape | onednn (old) | onednn (new) | Improvement
    -- | -- | -- | --
    mb1_ic128oc128_id2od2kd3sd1dd0pd1 _ih8oh8kh3sh1dh0ph1_iw10ow10kw3sw1dw0pw1 | 767.038 | 415.238 | 45.86%
    mb1_ic256oc128_id4od4kd1sd1dd0pd0 _ih16oh16kh1sh1dh0ph0_iw20ow20kw1sw1dw0pw0 | 194.979 | 167.353 | 14.17%
    mb1_ic32oc16_ih112oh112kh1sh1dh0ph0 _iw112ow112kw1sw1dw0pw0 | 104.024 | 78.206 | 24.82%
    mb1_ic3oc16_ih224oh112kh3sh2dh0ph1 _iw224ow112kw3sw2dw0pw1 | 104.178 | 81.559 | 21.71%
    mb30_ic256oc256_ih14oh14kh3sh1dh0ph1 _iw14ow14kw3sw1dw0pw1 | 12249.438 | 12079.125 | 1.39%
    mb56_ic3oc28_ih24oh22kh3sh1dh0ph0 _iw24ow22kw3sw1dw0pw0 | 438.046 | 405.21 | 7.50%
    mb100_ic128oc128_ih16oh16kh3sh1dh0ph1 _iw16ow16kw3sw1dw0pw1 | 13893.188 | 13797.609 | 0.69%
    g2mb1_ic128oc256_ih28oh28kh3sh1dh0ph1 _iw28ow28kw3sw1dw0pw1 | 499.014 | 475.333 | 4.75%
    g32mb1_ic1024oc1024_ih14oh14kh3sh1dh0ph1 _iw14ow14kw3sw1dw0pw1 | 294.877 | 270.568 | 8.24%
    g64mb1_ic1024oc2048_ih14oh7kh3sh2dh0ph1 _iw14ow7kw3sw2dw0pw1 | 122.664 | 95.503 | 22.14%
    g256mb1_ic256oc256_ih10oh5kh3sh2dh0ph1 _iw10ow5kw3sw2dw0pw1 | 31.343 | 13.522 | 56.86%
    g512mb1_ic512oc512_ih19oh10kh3sh2dh0ph1 _iw19ow10kw3sw2dw0pw1 | 54.116 | 34.651 | 35.97%
    g1024mb1_ic1024oc1024_ih10oh10kh3sh1dh0ph1 _iw10ow10kw3sw1dw0pw1 | 74.989 | 55.566 | 25.90%

    Convolution, 4 cores per instance, multiple instances on whole socket
    shape | onednn (old) | onednn (new) | Improvement
    -- | -- | -- | --
    mb1_ic128oc128_id2od2kd3sd1dd0pd1 _ih8oh8kh3sh1dh0ph1_iw10ow10kw3sw1dw0pw1 | 249.978 | 160.429 | 35.82%
    mb1_ic256oc128_id4od4kd1sd1dd0pd0 _ih16oh16kh1sh1dh0ph0_iw20ow20kw1sw1dw0pw0 | 102.726 | 89.555 | 12.82%
    mb1_ic32oc16_ih112oh112kh1sh1dh0ph0 _iw112ow112kw1sw1dw0pw0 | 72.993 | 57.622 | 21.06%
    mb1_ic3oc16_ih224oh112kh3sh2dh0ph1 _iw224ow112kw3sw2dw0pw1 | 76.607 | 61.847 | 19.27%
    mb30_ic256oc256_ih14oh14kh3sh1dh0ph1 _iw14ow14kw3sw1dw0pw1 | 3109.625 | 3006.062 | 3.33%
    mb56_ic3oc28_ih24oh22kh3sh1dh0ph0 _iw24ow22kw3sw1dw0pw0 | 191.194 | 175.997 | 7.95%
    mb100_ic128oc128_ih16oh16kh3sh1dh0ph1 _iw16ow16kw3sw1dw0pw1 | 3435.625 | 3391.438 | 1.29%
    g2mb1_ic128oc256_ih28oh28kh3sh1dh0ph1 _iw28ow28kw3sw1dw0pw1 | 205.209 | 191.931 | 6.47%
    g32mb1_ic1024oc1024_ih14oh14kh3sh1dh0ph1 _iw14ow14kw3sw1dw0pw1 | 157.004 | 142.82 | 9.03%
    g64mb1_ic1024oc2048_ih14oh7kh3sh2dh0ph1 _iw14ow7kw3sw2dw0pw1 | 83.262 | 71.689 | 13.90%
    g256mb1_ic256oc256_ih10oh5kh3sh2dh0ph1 _iw10ow5kw3sw2dw0pw1 | 31.848 | 13.378 | 57.99%
    g512mb1_ic512oc512_ih19oh10kh3sh2dh0ph1 _iw19ow10kw3sw2dw0pw1 | 50.216 | 32.663 | 34.95%
    g1024mb1_ic1024oc1024_ih10oh10kh3sh1dh0ph1 _iw10ow10kw3sw1dw0pw1 | 67.359 | 49.779 | 26.10%

    Transposed Convolution, 1 core per instance, multiple instances on whole socket
    shape | onednn (old) | onednn (new) | Improvement
    -- | -- | -- | --
    mb1_ic512oc256_ih4oh8kh4sh2dh0ph1_iw4ow8kw4sw2dw0pw1 | 537.12 | 505.142 | 5.95%
    mb1_ic256oc128_ih8oh16kh4sh2dh0ph1_iw8ow16kw4sw2dw0pw1 | 296.95 | 275.724 | 7.15%
    mb1_ic128oc64_ih16oh32kh4sh2dh0ph1_iw16ow32kw4sw2dw0pw1 | 266.933 | 251.175 | 5.90%
    mb1_ic64oc3_ih32oh64kh4sh2dh0ph1_iw32ow64kw4sw2dw0pw1 | 141.77 | 126.41 | 10.83%
    mb1_ic100oc512_ih1oh4kh4sh1dh0ph0_iw1ow4kw4sw1dw0pw0 | 89.511 | 66.719 | 25.46%

    Transposed Convolution, 4 cores per instance, multiple instances on whole socket
    shape | onednn (old) | onednn (new) | Improvement
    -- | -- | -- | --
    mb1_ic512oc256_ih4oh8kh4sh2dh0ph1 _iw4ow8kw4sw2dw0pw1 | 181.594 | 163.77 | 9.82%
    mb1_ic256oc128_ih8oh16kh4sh2dh0ph1 _iw8ow16kw4sw2dw0pw1 | 163 | 145.104 | 10.98%
    mb1_ic128oc64_ih16oh32kh4sh2dh0ph1 _iw16ow32kw4sw2dw0pw1 | 163.158 | 150.71 | 7.63%
    mb1_ic64oc3_ih32oh64kh4sh2dh0ph1 _iw32ow64kw4sw2dw0pw1 | 109.955 | 98.603 | 10.32%
    mb1_ic100oc512_ih1oh4kh4sh1dh0ph0 _iw1ow4kw4sw1dw0pw0 | 69.502 | 54.523 | 21.55%

    Linear, 1 core per instance, multiple instances on whole socket
    shape | onednn (old) | onednn (new) | Improvement
    -- | -- | -- | --
    mb1ic16oc8 | 54.415 | 35.816 | 34.18%
    mb1ic32oc16 | 26.764 | 16.041 | 40.07%
    mb1ic64oc32 | 26.735 | 16.007 | 40.13%
    mb1ic100oc1 | 26.712 | 16.06 | 39.88%
    mb1ic512oc1000 | 65.261 | 51.947 | 20.40%
    mb1ic1024oc1000 | 112.603 | 98.175 | 12.81%
    mb1ic2048oc1000 | 207.294 | 192.014 | 7.37%
    mb1ic4096oc4096 | 3761.094 | 3745.609 | 0.41%
    mb1ic9216oc4096 | 8918.672 | 8912.547 | 0.07%
    mb20ic2048oc91 | 52.487 | 44.623 | 14.98%
    mb30ic512oc37 | 29.257 | 19.642 | 32.86%
    mb100ic128oc256 | 39.32 | 29.81 | 24.19%
    mb100ic256oc512 | 74.499 | 64.322 | 13.66%
    mb100ic512oc1024 | 220.029 | 204.745 | 6.95%
    mb100ic1024oc784 | 352.311 | 336.309 | 4.54%

    Linear, 4 cores per instance, multiple instances on whole socket
    shape | onednn (old) | onednn (new) | Improvement
    -- | -- | -- | --
    mb1ic16oc8 | 58.252 | 40.433 | 30.59%
    mb1ic32oc16 | 23.901 | 15.549 | 34.94%
    mb1ic64oc32 | 24.594 | 16.214 | 34.07%
    mb1ic100oc1 | 24.011 | 15.4 | 35.86%
    mb1ic512oc1000 | 49.781 | 41.988 | 15.65%
    mb1ic1024oc1000 | 70.304 | 61.88 | 11.98%
    mb1ic2048oc1000 | 92.259 | 85.715 | 7.09%
    mb1ic4096oc4096 | 794.937 | 781.137 | 1.74%
    mb1ic9216oc4096 | 2081.375 | 2067.75 | 0.65%
    mb20ic2048oc91 | 66.929 | 58.338 | 12.84%
    mb30ic512oc37 | 35.332 | 26.337 | 25.46%
    mb100ic128oc256 | 42.21 | 38.908 | 7.82%
    mb100ic256oc512 | 66.49 | 63.967 | 3.79%
    mb100ic512oc1024 | 130.828 | 122.673 | 6.23%
    mb100ic1024oc784 | 160.987 | 154.765 | 3.86%

    Environment:

    - PyTorch version: 1.13.0a0+gitcdd625b
    - Is debug build: False
    - CUDA used to build PyTorch: None
    - ROCM used to build PyTorch: N/A

    - OS: Ubuntu 20.04.3 LTS (x86_64)
    - GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    - Clang version: Could not collect
    - CMake version: version 3.22.5
    - Libc version: glibc-2.31

    - Python version: 3.9.12 (main, Jun  1 2022, 11:38:51)  [GCC 7.5.0] (64-bit runtime)
    - Python platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31
    - Is CUDA available: False
    - CUDA runtime version: No CUDA
    - GPU models and configuration: No CUDA
    - Nvidia driver version: No CUDA
    - cuDNN version: No CUDA
    - HIP runtime version: N/A
    - MIOpen runtime version: N/A
    - Is XNNPACK available: True

    Versions of relevant libraries:
    - [pip3] intel-extension-for-pytorch==1.13.0+cpu
    - [pip3] numpy==1.23.3
    - [pip3] pytorch-widedeep==0.3.7
    - [pip3] torch==1.13.0a0+git48b423b
    - [pip3] torchvision==0.14.0a0+ebb68f3
    - [conda] blas                      1.0                         mkl
    - [conda] intel-extension-for-pytorch 1.13.0+cpu               pypi_0    pypi
    - [conda] mkl                       2021.4.0           h06a4308_640
    - [conda] mkl-include               2022.1.0                 pypi_0    pypi
    - [conda] mkl-service               2.4.0            py39h7f8727e_0
    - [conda] mkl-static                2022.1.0                 pypi_0    pypi
    - [conda] mkl_fft                   1.3.1            py39hd3c417c_0
    - [conda] mkl_random                1.2.2            py39h51133e4_0
    - [conda] numpy                     1.23.3                   pypi_0    pypi
    - [conda] numpy-base                1.22.3           py39hf524024_0
    - [conda] torch                     1.13.0a0+git48b423b          pypi_0    pypi
    - [conda] torchvision               0.14.0a0+ebb68f3          pypi_0    pypi

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84470
    Approved by: https://github.com/jerryzh168

commit 35088f283e5a93c6775e65e19d34093bdfb101e1
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 20 03:42:43 2022 +0000

    Revert "Python stack tracing OD flow (part 1) (#84362)"

    This reverts commit 1f4f05e59c4cd72dfff9755629f7cc23f8df7abe.

    Reverted https://github.com/pytorch/pytorch/pull/84362 on behalf of https://github.com/malfet due to Broke CUDA-10.2 tests, see https://hud.pytorch.org/pytorch/pytorch/commit/1f4f05e59c4cd72dfff9755629f7cc23f8df7abe

commit 8c7e20976e227f0cf85ccd742878c42f2d1c927d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 20 03:04:44 2022 +0000

    [vision hash update] update the pinned vision hash (#85315)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85315
    Approved by: https://github.com/pytorchbot

commit c05ca0dbf286d94a0575b8a037410dca200a523d
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Sep 20 01:49:04 2022 +0000

    [torch.futures] Fix nullptr deref (#85304)

    `torch.jit.wait(None)` and `torch.futures.collect_all((None,))` should not crash.

    Fixes https://github.com/pytorch/pytorch/issues/85237

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85304
    Approved by: https://github.com/kit1980

commit 66907e7262da6d6ef3d471e4c90a1c48a8f39a76
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Sep 19 14:29:41 2022 -0700

    [functorch] Fix dangling impls (#85299)

    Our dangling impls were:
    - positive_ (the in-place op just never existed)
    - unique (something happened to this op, maybe it was renamed)

    Test Plan:
    - `import functorch; torch._C._dispatch_find_dangling_impls`
    - It's difficult to write a test for this because the number of dangling
    impls depends on if `test_dispatch` has been run already or not
    (test_dispatch adds a dangling impl)
    - Can't remove the torchdynamo skip for this yet either
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85299
    Approved by: https://github.com/ezyang

commit 53fdd60635710a7a9f1c2a3eb1115f51b1247e94
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 20 00:13:41 2022 +0000

    Revert "Reduce memory usage requirement of `test_warp_softmax_64bit_indexing` in `test_nn.py` (#85037)"

    This reverts commit 66a9cba221ac32658ea837e88b68b859a08378d0.

    Reverted https://github.com/pytorch/pytorch/pull/85037 on behalf of https://github.com/clee2000 due to broke test_warp_softmax_64bit_indexing_cuda_float32 and test_warp_softmax_64bit_indexing_cuda_float16 on rocm https://github.com/pytorch/pytorch/actions/runs/3085764744/jobs/4989643817

commit a998a8eb1027b0e162fd6789efb378edc66d37e9
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Sep 19 22:00:41 2022 +0000

    Fix segfault for `out` with a large number of dims (#85294)

    Fixes #85166, #85167, #79218, #85251

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85294
    Approved by: https://github.com/malfet

commit d9024ea284e925a5f6e0fd35151d682a70e288cf
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Sep 19 21:49:18 2022 +0000

    Setup torch/csrc/functorch/*; move CompileCache.{h, cpp} there (#85263)

    The plan for functorch C++ is:
    - all C++-only code goes into aten/functorch.
    - any C++ code with a python dependency goes into torch/csrc/functorch. This will include the functorch Python bindings as well as all of torchdim.

    Alternative:
    - we could split it so that code goes into torch/csrc/functorch/nopython and torch/csrc/functorch/python instead of putting anything into ATen. This just feels like a matter of cosmetics.

    This PR also does two more things:
    - fix a windows lint error regarding PyLong_asLong
    - clang-format the code (because the linter got triggered)

    Test Plan:
    - run tests
    - check internal build

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85263
    Approved by: https://github.com/ezyang

commit 1f4f05e59c4cd72dfff9755629f7cc23f8df7abe
Author: Seonglyong Gong <slgong@fb.com>
Date:   Mon Sep 19 21:33:55 2022 +0000

    Python stack tracing OD flow (part 1) (#84362)

    Summary: submodule update

    Test Plan: CI

    Differential Revision: D39176686

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84362
    Approved by: https://github.com/robieta

commit 66a9cba221ac32658ea837e88b68b859a08378d0
Author: eqy <eddiey@nvidia.com>
Date:   Mon Sep 19 21:31:08 2022 +0000

    Reduce memory usage requirement of `test_warp_softmax_64bit_indexing` in `test_nn.py` (#85037)

    For reference: #84944

    CC @xwang233 @ngimel @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85037
    Approved by: https://github.com/ngimel, https://github.com/pmeier

commit e41d758e26bd2de00e9dd50e94e878f46f9f1b88
Author: Thomas Viehmann <tv.code@beamnet.de>
Date:   Mon Sep 19 21:20:34 2022 +0000

    Handle implicit real->complex casting for backward of stack (#84993)

    Fixes: #75852

    P.S.: Yay for the PyTorch foundation.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84993
    Approved by: https://github.com/soulitzer

commit cd7408e9505e3a7ae00e72a69ab17389ce086475
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Mon Sep 19 20:48:09 2022 +0000

    Add aten _assert_tensor_metadata op (#84617)

    Example:
    ```
    graph():
        %arg0 : [#users=3] = placeholder[target=arg0]
        %arg_guard_equality_check : [#users=1] = call_function[target=torch._tensor_equal](args = (%arg0, (1, 1, 2), (2, 2, 1), torch.float32), kwargs = {})
        %_assert_true : [#users=0] = call_function[target=torch._assert_true](args = (%arg_guard_equality_check, Guard evaluation failed equality check for arg0), kwargs = {})
        %add : [#users=1] = call_function[target=operator.add](args = (%arg0, 1), kwargs = {})
        return ([arg0, arg0], (add, add))
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84617
    Approved by: https://github.com/jansel

commit 6ed90379a848e1ff1422fb906253e38683c25c90
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Sep 19 20:34:08 2022 +0000

    Revert "Legalize BFloat16 in NNC. (#83988)"

    This reverts commit b049493ed52292c344c5b17f6db16a0242419865.

    Reverted https://github.com/pytorch/pytorch/pull/83988 on behalf of https://github.com/clee2000 due to broke slow tests in trunk, https://github.com/pytorch/pytorch/actions/runs/3084421000/jobs/4986706931

commit 1456cca1fc31a16a5e7a6248d28ebfa80dae8db0
Author: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Date:   Mon Sep 19 20:21:46 2022 +0000

    Fix exception handling, improve overheads and avoid constructing storage for element size (#84612)

    These changes were proposed by @MatthiasKohl in #84271 and #84542 that fix #84267 and #84056 respectively.
    The reason I am creating the pull request is CLA check (see original PRs).

    cc @ptrblck @malfet
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84612
    Approved by: https://github.com/ngimel

commit cbe5469e88db19f1683efcc49f3114a82ea58e32
Author: jiahongyu <jiahongyu@baidu.com>
Date:   Mon Sep 19 19:43:14 2022 +0000

    [PolishComment] Polish code comment, revelant->relevant (#85238)

    Polish code comment, `revelant`->`relevant`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85238
    Approved by: https://github.com/kit1980

commit 8c952db13ae6c634da2f1e42c1b053d0ad40003e
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Mon Sep 19 19:31:16 2022 +0000

    Fix segfault case for torch.ormqr  (#85278)

    Correct behavior is to raise an error for `tau.size[-1] > input.size[-1]`.

    Fixes https://github.com/pytorch/pytorch/issues/85218
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85278
    Approved by: https://github.com/Lezcano, https://github.com/malfet, https://github.com/ngimel

commit 555bb6cdb8ca82fb298b3fe6b017c59255b08621
Author: Thytu <valentin.de-matos@epitech.eu>
Date:   Mon Sep 19 18:49:07 2022 +0000

    Check that groups is > 0 in _convolution op (#85111) (#85248)

    `_convolution` will raise an error if it is called with groups <= 0

    Signed-off-by: Thytu <valentin.de-matos@epitech.eu>

    Fixes #85111

    Side note : If I need to do it elsewhere, let me know 🙂
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85248
    Approved by: https://github.com/Lezcano, https://github.com/malfet

commit 7234eb06f73f0e2c0aaa02727aee4afb5300ff1a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Sep 19 18:46:35 2022 +0000

    Revert "Land "Make ceil,floor,round,trunc handle integers" (#85144)"

    This reverts commit b27eb8d377fc8ac267fdaed7f95a03d609764604.

    Reverted https://github.com/pytorch/pytorch/pull/85144 on behalf of https://github.com/clee2000 due to broke slow tests in trunk  ex https://ossci-raw-job-status.s3.amazonaws.com/log/8433956087

commit f0b06c64c8d169f41e025a76390efd89e3cdcd99
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Mon Sep 19 10:13:41 2022 +0300

    Fix bugs in sparse compressed tensor shape and device inference (#85240)

    Fixes #84999

    This PR
    - uses device option to set sparse compressed tensor instance device
    - enables shape and device inference tests that was disabled due to an oversight
    - fixes a bug in shape inference of hybrid tensors
    - fixes a bug in to_sparse_bsr of a cuda tensor
    - updates tests that catch the above bugs

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85240
    Approved by: https://github.com/cpuhrsch

commit 6a18616296ab9b467a0437bd7248523860c3babc
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Sep 17 10:52:14 2022 -0700

    Support for sym_strides() in backwards formulas (#85210)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85210
    Approved by: https://github.com/Chillee, https://github.com/voznesenskym

commit f38f9dfbfae27f255e83791670890c4383be98da
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Sep 19 07:49:42 2022 -0700

    When tracing SymInts, peephole optimize multiply by one (#85261)

    This shows up a lot in graphs, so it is nice to not bother recording
    useless info.  On pytorch_BERT, this optimization doesn't seem to
    speed anything up, so it's mostly for cleanliness.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85261
    Approved by: https://github.com/wconstab

commit ebf45a07858f8c07fd5574ea981d50d653fb0c4b
Author: Wu, Chunyuan <chunyuan.wu@intel.com>
Date:   Mon Sep 19 17:45:20 2022 +0000

    [NNC] support aten::_convolution when it is 2D conv (#84038)
    Currently, only `aten::conv2d` has been supported in NNC. When using `torch.jit.trace`, the node on the graph will be `aten::_convolution`. This PR adds support of `aten::_convolution` node when it corresponds to a 2D convolution.
    Support `aten::_convolution` in NNC when we can infer from the parameters that it is a 2D convolution to support models obtained from `torch.jit.trace`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84038
    Approved by: https://github.com/huiguoo

commit b049493ed52292c344c5b17f6db16a0242419865
Author: Wang, Eikan <eikan.wang@intel.com>
Date:   Fri Sep 16 07:54:16 2022 +0000

    Legalize BFloat16 in NNC. (#83988)

    Regarding BF16 support in NNC, we always convert the BF16 to FP32 and then compute with FP32. After the FP32 computation, we convert the FP32 result to BF16. This logic has been supported in [half_support.h](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/half_support.h).

    But the B16/FP32 conversion has not been supported by LLVM. Currently, LLVM only supports the BF16 in its front end but still cannot generate the assembly code. So we implement this feature in [llvm_codegen](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/llvm_codegen.cpp) like aten implementation.

    In this PR, it contains three points - Take BF16 as uint16, Convert BF16 to FP32 and Convert FP32 to BF16.

    - Take BF16 as uint16 - [PR Change](https://github.com/pytorch/pytorch/pull/83988/files#diff-9d09ca2fce1c689ab43cd71795ab9b8b63447de950cf98ae8a18114e18d87e79R544-R546)
      We cannot naively convert map the BF16 to LLVM BF16 as the LLVM backend still does not support this data type as I mentioned. Meanwhile, the BF16 in PyTorch is a [structure](https://github.com/pytorch/pytorch/blob/master/c10/util/BFloat16.h#L73). Its realdata is uint16. So we also bitcast the BF16 tensor to uint16

    - BF16 to FP32 conversion - [PR Change](https://github.com/pytorch/pytorch/pull/83988/files#diff-9d09ca2fce1c689ab43cd71795ab9b8b63447de950cf98ae8a18114e18d87e79R1057-R1061)
       we just need to shift the BF16 value left by 16bits and then bit cast the shifted value to FP32

    - FP32 to BF16 conversion - [PR Change](https://github.com/pytorch/pytorch/pull/83988/files#diff-9d09ca2fce1c689ab43cd71795ab9b8b63447de950cf98ae8a18114e18d87e79R1066-R1084)
       The conversion is kind of complex. Because we use RNE to implement it. The RNR to convert the FP32 to BF16 is as follows.
       ```C++
       uint16_t round_to_nearest_even(float src) {
         if (std::isnan(src)) {
           return UINT16_C(0x7FC0);
         } else {
           union {
             uint32_t U32;
             float F32;
           };

           F32 = src;
           uint32_t rounding_bias = ((U32 >> 16) & 1) + UINT32_C(0x7FFF);
           return static_cast<uint16_t>((U32 + rounding_bias) >> 16);
         }
       }
       ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83988
    Approved by: https://github.com/ZolotukhinM, https://github.com/frank-wei

commit b27eb8d377fc8ac267fdaed7f95a03d609764604
Author: lezcano <lezcano-93@hotmail.com>
Date:   Mon Sep 19 14:58:38 2022 +0000

    Land "Make ceil,floor,round,trunc handle integers" (#85144)

    PR to land https://github.com/pytorch/pytorch/pull/78480, as Rohit does
    not work in the PyTorch project anymore
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85144
    Approved by: https://github.com/ngimel, https://github.com/mruberry

commit cd32a86bf2b7bdb928dd02fe7954c5852be8c27a
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Sep 19 06:53:14 2022 -0700

    Stop monkeypatching Tensor.backward() on `import functorch` (#85152)

    Monkeypatching is bad, we should never be doing it. This PR removes
    functorch's monkeypatching on Tensor.backward() by adding it directly to
    the implementation of Tensor.backward().

    As an alternative, we could have done an `import functorch` and used
    `functorch._C.are_transforms_active` directly in
    `torch/autograd/__init__.py`. The problem with that is that it runs into a
    bunch of circular imports.

    NB: https://github.com/pytorch/pytorch/issues/72179 is still on my mind.
    I didn't choose to do it right now because:
    - This PR doesn't make the situation worse than it already is (no
    monkeypatching is better than having the monkeypatch)
    - We don't have a design for #72179 yet.

    Test Plan:
    - tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85152
    Approved by: https://github.com/soulitzer

commit 5ce56d9377914d3c273c0bce037b2443bfe6c21b
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Sep 19 06:53:13 2022 -0700

    Stop loading jit decomps in eager_transforms.py (#85151)

    They're already loaded in `torch/__init__.py`

    Test Plan:
    - functorch tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85151
    Approved by: https://github.com/samdow, https://github.com/soulitzer

commit 6fd8e28a993c578cbdc1e78d71fc2ef71682b165
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Mon Sep 19 15:41:19 2022 +0000

    Make addmm meta kernel consistent with mm (#84960)

    Change the names of the parameters in the `addmm` meta kernel to be more consistent with `mm`. Functionally, the only difference in behaviour should be that `addmm` meta kernel gets its options from the `input` tensor instead of from the `bias` parameter.

    Fixes #84930

    CC: @ezyang @ngimel @wconstab @ke1337 @glebk-cerebras

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84960
    Approved by: https://github.com/ezyang

commit 3a51b557efa0d1959210b96009b7153dbeb2e2dc
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Mon Sep 19 14:28:25 2022 +0000

    Added docs and opinfo for narrow_copy (#84493)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84493
    Approved by: https://github.com/amjames, https://github.com/ngimel, https://github.com/mruberry

commit b0c447e954a335b0df60307a2e7c720320af7231
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 16 14:51:34 2022 -0700

    [functorch] add batch rule for linalg.lu_solve (#85175)

    Fixes https://github.com/pytorch/functorch/issues/1022
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85175
    Approved by: https://github.com/Chillee

commit bb0e6e54560c55f9d4319ce2f630ad648fe8dd29
Author: mingfeima <mingfei.ma@intel.com>
Date:   Fri Aug 19 15:06:55 2022 +0800

    port spmm_sum to pytorch and optimize it on CPU

    [ghstack-poisoned]

commit d561aa944b7e777eb0575be2427e26a86df85f11
Author: Mike Ruberry <mike.ruberry@lightning.ai>
Date:   Mon Sep 19 10:32:39 2022 +0000

    Adds normal prim, randn reference, and randn OpInfo (#85128)

    This PR extends prims support for random operations by adding `prims.normal` and `refs.randn`. Note that in the future we may not want to model draws from distributions as their own prims.

    `prims.normal` accepts a shape and the mean and standard deviation of a normal distribution as numbers. This is distinct from `torch.normal` which takes two tensors so every generated datapoint can be drawn from a normal distribution with its own mean and standard deviation. To address this @ngimel and I expect to add `prims.normal_with_tensors`. The current `prims.normal` could be implemented using `prims.normal_with_tensors`, but we expect the case of two numbers is much more common, and that executors will likely want to specialize for it, anyway.

    In a follow-up PR I plan to add `refs.randn_like`, `prims.normal_with_tensors` (as mentioned above), and `refs.normal`.

    While writing this PR I noticed the following issues:

    - https://github.com/pytorch/pytorch/issues/85123
    - https://github.com/pytorch/pytorch/issues/85121

    The latter of which is prohibiting some testing.

    In future PRs I plan to add a prim for changing layout, add support for pinned memory, and improve support for testing tensor creation operators, likely with a TensorCreationOpInfo class.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85128
    Approved by: https://github.com/ngimel

commit 17aefce0aaf939a50f81f06db658f942cbc1df1f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Sep 19 10:03:27 2022 +0000

    [xla hash update] update the pinned xla hash (#85242)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85242
    Approved by: https://github.com/pytorchbot

commit 7df0878b9936961cc1bde9d20c834ac4331d140a
Author: Rohan Varma <rvarm1@fb.com>
Date:   Sat Sep 17 19:37:35 2022 +0000

    [FSDP] Option to keep grads in lower prec (#85223)

    Reland of https://github.com/pytorch/pytorch/pull/85134, fix is to use fp16 instead of bf16 which is not supported on all platforms.

    Differential Revision: [D39565189](https://our.internmc.facebook.com/intern/diff/D39565189/)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85223
    Approved by: https://github.com/awgu

commit 9024015adf01d93fd2533c71fa1e7f06831c2ac7
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sun Sep 18 20:38:43 2022 +0000

    [BE] Small improvements to device_count (#85192)

    If `_parse_visible_devices` returns an empty set, no need to make nvml calls Also, reduce indent a bit in `_device_count_nvml`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85192
    Approved by: https://github.com/kit1980, https://github.com/ngimel

commit dadd89a8a60f80222b8f3c3bdb83440b902c737b
Author: Bin Bao <binbao@fb.com>
Date:   Fri Sep 16 20:15:11 2022 +0000

    Add a flag to trigger inductor testing (#85183)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85183
    Approved by: https://github.com/jansel

commit 1378561d03d5bb1433f6404e829b49caaaba9e00
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Sep 18 02:46:48 2022 +0000

    [vision hash update] update the pinned vision hash (#85199)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85199
    Approved by: https://github.com/pytorchbot

commit b8bf11bbf4e9ae0073b14ddd1966d47543e8d2b5
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Sep 17 09:42:04 2022 -0700

    Add braces around single line conditional (#85207)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85207
    Approved by: https://github.com/Chillee

commit 68929f4768a0cf77fe2bc4d9f49dd67fbad9f9af
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Sep 17 09:42:03 2022 -0700

    Remove improper asserts. (#85206)

    strides() will raise an error if it is called on a tensor with symbolic
    shapes, so we cannot actually assert using it.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85206
    Approved by: https://github.com/Chillee

commit 9d84db3b726c905beb00ff9ad3d995435c211ae6
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Sep 17 09:42:03 2022 -0700

    Templatize checkInBoundsForStorage and setStrided for SymInt (#85205)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85205
    Approved by: https://github.com/Chillee

commit 280e2f92831b92fa0440bdbaf2101df46570c5b9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Sep 17 09:42:02 2022 -0700

    Fix bug in computeStorageNbytes (#85204)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85204
    Approved by: https://github.com/Chillee

commit 12a19a4846c924e9d1e2d37fa0a706fb8eaef9a7
Author: Horace He <chilli@fb.com>
Date:   Sat Sep 17 18:11:51 2022 +0000

    Made tracing of proxy symints lazy (#85185)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85185
    Approved by: https://github.com/ezyang

commit 5dd9610e9d6a54cb6e7340e950606a06aa7eee96
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sat Sep 17 16:57:34 2022 +0000

    Refs and decompositions for index_{add,copy,select,fill} (#85002)

    As per title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85002
    Approved by: https://github.com/ngimel

commit 45a9dcd4dd6b691d8e2fb867e68ef59a72f1fc75
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sat Sep 17 18:20:17 2022 +0000

    [BE] Add explicit `__all__` to torch.cuda (#85193)

    This helps one avoid re-exporting torch, warnings and other system modules from `torch.cuda`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85193
    Approved by: https://github.com/kit1980

commit 8c9d7fabd60b7cbb84277d1db87e9a9c78fde266
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Sep 17 06:47:59 2022 -0700

    Add SymInt::guard_int (#85139)

    This allows you to explicitly guard on the specific integer value
    of a SymInt so that you can condition on it.  If possible, prefer
    guarding on a boolean expression instead.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85139
    Approved by: https://github.com/Chillee

commit b0a631cd14c1072199826e97a2a6c302b9446dc9
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Sat Sep 17 11:58:18 2022 +0000

    Add nondeterministic alert for `MaxUnpool1d/2d/3d` (#84766)

    Part of #80827
    Part of #78249
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84766
    Approved by: https://github.com/Lezcano, https://github.com/mruberry, https://github.com/nikitaved

commit b8418e02eb33b82ccb682767994379d840351f42
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Sat Sep 17 10:52:54 2022 +0000

    Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#85045)

    This PR does the following:

    - Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`.  The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`.  It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`.  The `FusionInterface` is an object that represents a Fusion in python.  It can also query the cache based on id.
    - The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added.
    - Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`.
    - Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes.
    - Adds a README file to explain how to use the Python Frontend

    While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation.

    An identical PR to #83267 to avoid tooling issues.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85045
    Approved by: https://github.com/davidberard98

commit d23ce29761dbc0e817fa80dcd35d9b8d30f16bbb
Author: Hector Yuen <hyz@fb.com>
Date:   Sat Sep 17 09:42:42 2022 +0000

    allow changing the cuda allocator settings even after the process started (#84970)

    Summary:
    - expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR
    - keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting
    - make some of the Allocator Config methods public, now it looks more like a singleton

    Test Plan: added the unit test

    Differential Revision: D39487522

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970
    Approved by: https://github.com/zdevito

commit 81620c3360d4a15d266b8ad7daf556069db6dfc6
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Sep 17 06:53:11 2022 +0000

    Revert "Faster mul(sparse, sparse) with broadcasting in dense dims. (#83428)"

    This reverts commit d49943bda8e495bdb358e20b6eb114c442afa6e9.

    Reverted https://github.com/pytorch/pytorch/pull/83428 on behalf of https://github.com/osalpekar due to Reverted because __restrict symbol not supported by certain MSVC compilers, leading to undefined symbol error at compilation time

commit 98b8ef99e1ec9b5d273b9612c08404fb34a9dc63
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sat Sep 17 03:52:56 2022 +0000

    Add refs for sinc and sgn (#85142)

    This PR superseded https://github.com/pytorch/pytorch/pull/80171

    This does not add the ref for `special.sinc` as I was getting some
    errors. This should be added to https://github.com/pytorch/pytorch/pull/84957
    (cc @nkaretnikov)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85142
    Approved by: https://github.com/ngimel, https://github.com/mruberry

commit e33b464ffc8f08d9fb93b09816708d7f32500e68
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Sep 17 04:26:04 2022 +0000

    Revert "Refs and decompositions for index_{add,copy,select,fill} (#85002)"

    This reverts commit 2f0b3de443dd8d4477d70c5a56fa14496d1eebe3.

    Reverted https://github.com/pytorch/pytorch/pull/85002 on behalf of https://github.com/huydhn due to Broke trunk slow tests

commit 1838957e6f5c9f4d32ff446ec5af085d0f19ba2f
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Sep 16 13:04:09 2022 -0700

    fix external codegen kernel error checking (#85029)

    Fixes https://github.com/pytorch/pytorch/issues/84987. I followed the repro steps from the issue (changed `empty_symint` to `empty_symint2` and confirmed that and error gets raised.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85029
    Approved by: https://github.com/ezyang

commit 652707abc080d73d3477c96eca34a079d1e84e70
Author: John Detloff <johndetloff@fb.com>
Date:   Sat Sep 17 03:24:44 2022 +0000

    Don't cache model specs within PTMCoreMLCompiler (#85136)

    Summary: It turns out disk cache space is more limited than I realized - Instagram starts evicting cached items at 10mb. We don't actually need to cache the model specs, once the model is compiled all we need is the compiled model. With this diff, after model compilation succeeds we cleanup the model specs from disk.

    Test Plan: Delete instagram from device to ensure an empty cache, build, launch camera, open a MCS or Segmentation effect, confirm it loads and works correctly. Restart the app and launch again, to confirm it can load the compiled model from cache as well.

    Differential Revision: D39562009

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85136
    Approved by: https://github.com/kimishpatel

commit 2dbd2673b6b440efd977e84904ab30db683da921
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Fri Sep 16 12:21:31 2022 -0700

    remove symintnode bits in LTC (#85171)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85171
    Approved by: https://github.com/ezyang

commit 02f654abca89760ea8004d50702410aceb2296f4
Author: soulitzer <soulitzer@gmail.com>
Date:   Fri Sep 16 20:48:44 2022 -0400

    Disable torch.library.Library with PYTORCH_DISABLE_LIBRARY (#85190)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85190
    Approved by: https://github.com/d4l3k

commit dca42ec20c85aee1487627ff1158d39292c4e411
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Sep 17 02:34:56 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#85198)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85198
    Approved by: https://github.com/pytorchbot

commit 21f2d55974d8543b70f26d3eac7ab1c4cc7a45ce
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Sep 16 09:51:05 2022 -0700

    [Profiler][Trivial] Make `test/profiler` folder. (#84273)

    The first step to improving profiler test coverage is to improve the test structure and organization. This PR just pulls the tests into a dedicated folder.

    Differential Revision: [D39108645](https://our.internmc.facebook.com/intern/diff/D39108645/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39108645/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84273
    Approved by: https://github.com/slgong-fb

commit 4a5edbf0766c258a5cbf230758ce63b9794fb953
Author: Gavin Wu <gavinwu@fb.com>
Date:   Sat Sep 17 02:11:27 2022 +0000

    Make param 'option' const& to prevent unnecessary copy at call-site (#84747)

    Reviewed By: ajtulloch

    Differential Revision: D39208916

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84747
    Approved by: https://github.com/janeyx99

commit 32fc0b958e8b3280ccd8721009b8642394df7fcf
Author: Will Constable <whc@fb.com>
Date:   Sat Sep 17 02:10:23 2022 +0000

    Expose get_active_ddp_module api for torchdynamo DDP (#83333)

    Pairs up with torchdynamo PR https://github.com/pytorch/torchdynamo/pull/628

    Exposes a new API that lets torchdynamo know when it is compiling the 'forward' of a module that is inside a DDPmodule.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83333
    Approved by: https://github.com/mrshenli

commit 0a6f32619ece9ccb3c23fc9fb07aec7f3767d8ba
Author: Bartek Rymkowski <bartek@fb.com>
Date:   Sat Sep 17 02:06:43 2022 +0000

    CoreML .mlmodel export support (#84784)

    Test Plan: This was tested manually - model was exported and XCode was used to analyze it

    Reviewed By: jmdetloff

    Differential Revision: D39048536

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84784
    Approved by: https://github.com/jmdetloff

commit ca419c33382057ec19fa6186889b78d7eb2f41f5
Author: Wu, Chunyuan <chunyuan.wu@intel.com>
Date:   Sat Sep 17 01:44:34 2022 +0000

    [NNC] add eltwise OPs: mish and elu (#80586)

    Enable more eltwise OPs in NNC:

    - mish
    - elu

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80586
    Approved by: https://github.com/ZolotukhinM, https://github.com/malfet

commit 377b5d6f8ba09ea799dce103b2e7e0aa4d804cc0
Author: Horace He <chilli@fb.com>
Date:   Fri Sep 16 22:59:44 2022 +0000

    Added additional simplifications/caching for replacements and divisibility (#84918)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84918
    Approved by: https://github.com/ezyang

commit 9d1155235b82a83b49b7645db5292491ea81dacf
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Sep 16 22:42:41 2022 +0000

    [ONNX] Create decorators for symbolic function registration  (#84709)

    This PR creates and tests the decorators proposed in #83787

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84709
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 05ff3f896053bbf152a8add7dfaa381af847b500
Author: Mateusz Sypniewski <sypniewski@devfair0785.h2.fair>
Date:   Sat Sep 17 00:11:05 2022 +0000

    Add symlink resolution in benchmark timer interface (#82734)
    The `sys.executable` string does not take into account if the file is a symlink or not. This lead to a false negative during checking if the two python interpreters were the same, while using an interpreter that was symlinked to another one.
    Finding the realpath fixes the problem.
    Tested manually.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82734
    Approved by: https://github.com/ngimel

commit 2f0b3de443dd8d4477d70c5a56fa14496d1eebe3
Author: lezcano <lezcano-93@hotmail.com>
Date:   Fri Sep 16 20:22:08 2022 +0000

    Refs and decompositions for index_{add,copy,select,fill} (#85002)

    As per title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85002
    Approved by: https://github.com/ngimel

commit d6c2080eb49ccaaf43cff37b7f07a85906250b92
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Sep 16 21:56:41 2022 +0000

    [ONNX] Update ONNX documentation to include unsupported operators (#84496)

    - Update ONNX documentation to include unsupported operators
    - Include aten, quantized and other namespaces
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84496
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao, https://github.com/kit1980

commit 46843be1e6b1ec831bfe62dd0eefb04a813566ca
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Sep 16 19:45:42 2022 +0000

    [ONNX] Update error messages (#85179)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85179
    Approved by: https://github.com/kit1980

commit a4c7cadca61efd3a9b585e3985dfd407e1120bd4
Author: Zain Rizvi <zainr@fb.com>
Date:   Fri Sep 16 22:48:10 2022 +0000

    Retry installing lintrunner if download fails (#85189)

    Occasionally lintrunner fails to download due to network issues (it caused one build [to fail](https://github.com/pytorch/pytorch/actions/runs/3054209039/jobs/4925814096) this week)

    Let's make sure we retry the download before giving up
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85189
    Approved by: https://github.com/huydhn

commit 14b3bdc025ebf8408a5b80064b0c51aec8b69403
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 16 22:33:06 2022 +0000

    Revert "[FSDP] Option to keep grads in lower prec (#85134)"

    This reverts commit 607eccb13ca586f775fb09daeb728a4b4e30ebdd.

    Reverted https://github.com/pytorch/pytorch/pull/85134 on behalf of https://github.com/ZainRizvi due to broke trunk, failing the tests test_grads_reduced_precision (main.TestFSDPMixedPrecisionUnsharded)

commit 4382da5d5e1b306f42d434e58e093d74e364bfc9
Author: Muhammed Shuaibi <mshuaibi@andrew.cmu.edu>
Date:   Fri Sep 16 22:04:42 2022 +0000

    Remove assertEqualIgnoreType from test_pooling (#85112)

    Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095 in test_pooling.py.

    This PR correctly casts the expected outputs to satisfy the asserts. If you'd prefer feeding `exact_dtype=False` as an argument instead I can update accordingly.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85112
    Approved by: https://github.com/kit1980

commit cd7e6d4ad1df9fb42bc557b6c8dffaa5535bae74
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Sep 16 17:30:24 2022 +0000

    [ONNX] New symbolic function registry (#84382)

    The change brings the new registry for symbolic functions in ONNX. The `SymbolicRegistry` class in `torch.onnx._internal.registration` replaces the dictionary and various functions defined in `torch.onnx.symbolic_registry`.

    The new registry

    - Has faster lookup by storing only functions in the opset version they are defined in
    - Is easier to manage and interact with due to its class design
    - Builds the foundation for the more flexible registration process detailed in #83787

    Implementation changes

    - **Breaking**: Remove `torch.onnx.symbolic_registry`
    - `register_custom_op_symbolic` and `unregister_custom_op_symbolic` in utils maintain their api for compatibility
    - Update _onnx_supported_ops.py for doc generation to include quantized ops.
    - Update code to register python ops in `torch/csrc/jit/passes/onnx.cpp`

    -0.1 seconds in execution time. -34% time spent in `_run_symbolic_function`. Tested on the alexnet example in public doc.
    ```
       └─ 1.641 export  <@beartype(torch.onnx.utils.export) at 0x7f19be17f790>:1
          └─ 1.641 export  torch/onnx/utils.py:185
             └─ 1.640 _export  torch/onnx/utils.py:1331
                ├─ 0.889 _model_to_graph  torch/onnx/utils.py:1005
                │  ├─ 0.478 _optimize_graph  torch/onnx/utils.py:535
                │  │  ├─ 0.214 PyCapsule._jit_pass_onnx_graph_shape_type_inference  <built-in>:0
                │  │  │     [2 frames hidden]  <built-in>
                │  │  ├─ 0.190 _run_symbolic_function  torch/onnx/utils.py:1670
                │  │  │  └─ 0.145 Constant  torch/onnx/symbolic_opset9.py:5782
                │  │  │     └─ 0.139 _graph_op  torch/onnx/_patch_torch.py:18
                │  │  │        └─ 0.134 PyCapsule._jit_pass_onnx_node_shape_type_inference  <built-in>:0
                │  │  │              [2 frames hidden]  <built-in>
                │  │  └─ 0.033 [self]
    ```
    ![image](https://user-images.githubusercontent.com/11205048/188032302-688d881e-860d-4046-bdba-90da54233576.png)

    The startup process takes 0.03 seconds. Calls to `inspect` will be eliminated when we switch to using decorators for registration in #84448

    ![image](https://user-images.githubusercontent.com/11205048/188208910-250f0434-475d-4872-9abc-781535519305.png)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84382
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 735154354b86296b4b8d99c78da764565af76018
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Fri Sep 16 21:44:23 2022 +0000

    update torch.narrow doc (#85180)

    Fixes #84783

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85180
    Approved by: https://github.com/kit1980

commit 5877cc9a9fce3ad1dc69ed6b862c9245a772631b
Author: Catherine Lee <csl@fb.com>
Date:   Fri Sep 16 21:35:33 2022 +0000

    fix for rebase and merge (#85168)

    its --branch not -b
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85168
    Approved by: https://github.com/huydhn, https://github.com/ZainRizvi

commit 17593f15bdc53c2ec41bc7f4a087f3a28e37b626
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 16 12:34:40 2022 -0700

    [functorch] Document DynamicLayer.{h, cpp} a bit more (#85178)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85178
    Approved by: https://github.com/Chillee

commit d559299ccf4b685ca1e42a7f090fbbb77e25efed
Author: Salil Desai <salilsdesai@fb.com>
Date:   Fri Sep 16 21:28:15 2022 +0000

    [QNNPACK] Export cpuinfo-targets in clog CMakeLists (#84876)

    Summary:
    Fixes the following error when building qnnpack:
    ```
    CMake Error: install(EXPORT "cpuinfo-targets" ...) includes target "cpuinfo" which requires target "clog" that is not in any export set.
    ```

    This diff mirrors the changes to the CMakeLists of
    https://github.com/pytorch/cpuinfo/pull/69

    Test Plan:
    ```
    export ANDROID_NDK=/opt/android_ndk/r20
    export ANDROID_NDK_HOME=${ANDROID_NDK}
    export ANDROID_SDK=/opt/android_sdk
    export ANDROID_HOME=${ANDROID_SDK}

    cd ~/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack

    ./scripts/build-android-arm64.sh
    ```
    Succeeds

    Differential Revision: D39438768

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84876
    Approved by: https://github.com/digantdesai

commit c6c3346d5aaa548d82f3dff14ed54e687af50116
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Sep 14 20:29:48 2022 +0000

    [FSDP] Short-term fix to remove `optim_input` (#84201)

    This is a short-term quick fix to accommodate using the existing optimizer state APIs without passing `optim_input`. It preserves the existing `optim_input` code path but if `optim_input` is `None` while `optim` is not, then the APIs will use the new code path that relies on `self.param_groups` to get the information previously provided by `optim_input`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84201
    Approved by: https://github.com/rohan-varma

commit a9258eba8e87417479bde4004dfa9ffb6e60a8fa
Author: Khushi Agrawal <khushiagrawal411@gmail.com>
Date:   Fri Sep 16 21:24:09 2022 +0000

    [Testing] Port `bernoulli` and `multinomial` to ErrorInputs. (#74683)

    Hi,
    The PR aims to port `bernoulli` and `multinomial` to error inputs. Thanks!

    cc: @kshitij12345! :)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/74683
    Approved by: https://github.com/kshitij12345, https://github.com/mruberry

commit a5d9d2aaa20f878d0a61bd2b682ae3e2248df07d
Author: samdow <samdow@fb.com>
Date:   Fri Sep 16 15:49:11 2022 +0000

    [functorch] remove argnums partial helper function, rewrite test to use slice argnum (#84951)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84951
    Approved by: https://github.com/zou3519

commit 776e0fe75600b6d3a93060d91bbe0a31fc92afce
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 16 21:06:24 2022 +0000

    Revert "Make ones and zeros's ref accepts variadic size argument (#85117)"

    This reverts commit 7e5616c9ff6347913d98627c60e39f72dce558e3.

    Reverted https://github.com/pytorch/pytorch/pull/85117 on behalf of https://github.com/ZainRizvi due to Failed trunk

commit 490727a35f213778d2e709a3b03d899ad502c5f9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 16 10:23:01 2022 -0700

    New calling convention for Python dispatcher (#85133)

    Instead of calling into the Python dispatcher for EVERY dispatcher
    call, we now have a two step process.  First, we
    getattr(op: OpOverload, dispatch_key) to "load" the handler for the
    function.  This can either be a conventional function (in which
    case we will call it, in the same way the old Python dispatcher
    worked), or it can be a DispatchKey, in which case we will directly
    call that DispatchKey in C++, bypassing marshalling between Python
    and C++ entirely.  OpOverload.__getattr__ is carefully written so
    that it will cache the

    A further optimization would be to define __slots__ on OpOverload,
    and ensuring that the DispatchKey strings are interned.

    The resulting Python dispatcher is less flexible: after the first
    lookup, the handler is cached and we won't recompute it.  Furthermore,
    by default, dispatches will not go into Python, and so you won't
    get stack frames for the Python dispatcher by default.  But we get
    a huge performance improvement: on the following microbenchmark
    we go from 2.5s to 1.9s.

    ```
    import time
    import torch
    from functorch import make_fx

    def f(x):
        for i in range(1000):
            x = x * x
        return x

    begin = time.time()
    res = make_fx(f, tracing_mode="symbolic")(torch.randn(10, 20))
    print(time.time()-begin)
    ```

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85133
    Approved by: https://github.com/wconstab

commit e5fac7f5dc4f16070193bb7d06322e0faaa94099
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Sep 15 21:48:59 2022 -0700

    Optimize torch.ops.ns.opname.overload accessor in torch dispatch (#85132)

    This doesn't actually seem to help all that much.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85132
    Approved by: https://github.com/wconstab

commit 607eccb13ca586f775fb09daeb728a4b4e30ebdd
Author: Rohan Varma <rvarm1@fb.com>
Date:   Fri Sep 16 17:54:08 2022 +0000

    [FSDP] Option to keep grads in lower prec (#85134)

    Differential Revision: [D39565189](https://our.internmc.facebook.com/intern/diff/D39565189)

    Rehash of a similar PR from a month ago that got stale. Adds a config to FSDP MP so that gradients can be kept in lower precision, to support optimizers such as AnyPrecisionOptimizer which would like to keep grads in bf16.

    To do this, for sharded cases, we cannot simply omit the cast back to the full precision param dtype, otherwise when setting `p.grad = p._saved_grad_shard` in finalize_params, autograd will throw an error indicating that the grad dtype should match the param dtype when it is being set.

    As a workaround, we re-cast after setting this. Although, this means that for cases that use gradient accumulation, p._saved_grad_shard will be of the reduced dtype because it is set to p.grad in `_prep_grad_for_backward`. As a result, add a check + recast here as well.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85134
    Approved by: https://github.com/awgu

commit 7e5616c9ff6347913d98627c60e39f72dce558e3
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Sep 16 17:05:58 2022 +0000

    Make ones and zeros's ref accepts variadic size argument (#85117)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85117
    Approved by: https://github.com/ngimel, https://github.com/Lezcano

commit 38778add8d7047bfcf29c754cd43ae9b258e4410
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Fri Sep 16 19:27:31 2022 +0000

    flash_attention_helper mitigation: pass contiguous inputs (#85135)

    There appears to be a transient issue with respect to non-contiguous inputs in flash_attn and thus we're passing contiguous inputs to mitigate it.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85135
    Approved by: https://github.com/drisspg

commit 7b3e177b8772d95e5aaa92415a632d280320c740
Author: Zain Rizvi <ZainRizvi@users.noreply.github.com>
Date:   Fri Sep 16 19:07:57 2022 +0000

    Increase docker build timeout (#85156)

    Docker builds used to take around 15 mins to run (more than the 10 min timeout) and have recently started taking even longer due to conda's slow dependency resolver.

    We were in this bad state where we _depended_ on the retry to complete the build. That is, the first attempt would try to build docker, timeout, then the second attempt would continue to build on top of the cache the first build had setup, etc.

    Increasing the timeout so that docker builds actually have enough time to complete the build within a single attempt

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85156
    Approved by: https://github.com/huydhn

commit 29eba319b4f56eae2b6a3bcc3830f8e080214aef
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Sep 16 03:51:05 2022 +0000

    Use alias for nop decomp (#84727)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84727
    Approved by: https://github.com/Chillee

commit d8eae6283db867a1abd75df0a86b1a48faef3043
Author: Feisi Fu <fufeisi@bu.edu>
Date:   Fri Sep 16 17:49:06 2022 +0000

    Rename 'torch/ao/nn/quantized._reference' to 'torch/ao/nn/quantized/reference'. (#84974)

    Currently, the path for reference modules contains _ which means it's private (https://github.com/pytorch/pytorch/tree/master/torch/ao/nn/quantized/_reference), but we would like to make it public since the reference module is now enabled by default in the fx graph mode quantization flow and it will be added to eager mode flow as well in the future.

    To make '_reference' public, it should satisfy the [public API rules](https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation).
    I did in the first commit (prepare '_reference' to be public):
    1: add __all__ to public modules and packages;
    2. made functions, that are only used in the file that the function is defined, private by adding _ at their names.

    Fixes #83090. (we rename the 'torch/ao/nn/quantized/_reference', because of migration #81667.)

    This is a dup for the #84786.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84974
    Approved by: https://github.com/andrewor14, https://github.com/z-a-f

commit d710c95cc01486b7f2922799dd033da9893b0e21
Author: lezcano <lezcano-93@hotmail.com>
Date:   Fri Sep 16 15:27:25 2022 +0000

    Implement forward AD for scatter_reduce (#85000)

    I left the case `reduction="prod"` for future work as it's a bit of a pain.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85000
    Approved by: https://github.com/soulitzer

commit 6162a043640cf01695cf568edd0be047d56477ff
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Fri Sep 16 15:54:50 2022 +0000

    fix half_to_float arg in *softmax decomp (#85120)

    Fixes https://github.com/pytorch/torchdynamo/issues/1239

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85120
    Approved by: https://github.com/Chillee

commit f37069aac7a0c4fd1c3455e4e9058bf56fb759f4
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Sep 16 05:01:03 2022 +0000

    Re-enable fixed dynamo tests (#84969)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84969
    Approved by: https://github.com/bdhirsh, https://github.com/ezyang

commit 54c9c4e73d5c2bfc9f244b62a81a99e8852890e7
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Fri Sep 16 05:01:02 2022 +0000

    Flip fake tensors on in aot autograd (#84968)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84968
    Approved by: https://github.com/Chillee

commit 61ba125064dc452e2fef3bfa1731db46d57fc322
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Sep 15 11:10:16 2022 -0700

    Add warning about installing functorch via setup.py (#85095)

    We'll probably delete the functorch/setup.py file in 2 weeks.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85095
    Approved by: https://github.com/samdow

commit 2e1ec5d18cbf49a195a4bff2ac1cff43b159c498
Author: lezcano <lezcano-93@hotmail.com>
Date:   Fri Sep 16 09:10:06 2022 +0000

    Re-enables some tests for linalg.det (#85141)

    At last.

    Fixes https://github.com/pytorch/functorch/issues/961
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85141
    Approved by: https://github.com/zou3519

commit 8b29b7953a46fbab9363294214f7689d04df0a85
Author: lezcano <lezcano-93@hotmail.com>
Date:   Thu Sep 15 19:30:51 2022 +0000

    Fix behaviour of index_add / atomicAdd(bool,bool) (#85100)

    This fixes one of the first PyTorch issues I opened, as it bit me again

    Fixes https://github.com/pytorch/pytorch/issues/54317
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85100
    Approved by: https://github.com/ngimel

commit 4bdc0af53d235b5939e6792d2f54004fc11442bd
Author: Horace He <chilli@fb.com>
Date:   Fri Sep 16 02:29:13 2022 +0000

    Added support for symbolic is_contiguous (#84829)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84829
    Approved by: https://github.com/ezyang

commit 5652ab22f6d56ed74b21f13a1ba71f09cc94ee4a
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Sep 15 19:20:50 2022 +0000

    [FSDP] Add `_set_flattened()`; `_is_flattened()` (#85038)

    For both exposing the original parameters and for TP integration, we cannot only rely on `isinstance(param, FlatParameter)` to ignore already-flattened parameters in `.named_parameters()`. As a simple workaround, we can mark original parameters or `ShardedTensor`s with an attribute `_fsdp_flattened` (saved as a string variable `FSDP_FLATTENED`) to indicate that the parameter/tensor has already been flattened. This issue only arises for recursive/nested FSDP wrapping.

    This PR also changes `isinstance(param, FlatParameter)` checks to `type(param) is FlatParameter` because all tensor subclasses that have `_is_param == True` will return `True` for `isinstance(param, <any subclass with _is_param == True>)`. This means that a `ShardedTensor` parameter will return `True` for `isinstance(st, FlatParameter)`, which is not what we want.
    https://github.com/pytorch/pytorch/blob/5271494ef21ae0140755a41f3b16a8bd745642b6/torch/nn/parameter.py#L8-L10
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85038
    Approved by: https://github.com/rohan-varma

commit 0ec19db7ac88e307135100ddcfc418ae3925844f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 16 02:44:33 2022 +0000

    [vision hash update] update the pinned vision hash (#85130)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85130
    Approved by: https://github.com/pytorchbot

commit b363e9874a1d33ac8e6d3c6f528025d7217bb101
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 16 02:43:16 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#85129)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85129
    Approved by: https://github.com/pytorchbot

commit 647aeb831f5fbaae9c815d02b2eca256c43c9042
Author: Shisuiuzumaki <shafeez@xisys.com>
Date:   Fri Sep 16 01:45:20 2022 +0000

    torch/jit/_trace.py in compare_outputs(original, reference, match_wha… (#84850)

    Fixes #83533
    ```
    /opt/homebrew/lib/python3.9/site-packages/torch/jit/_trace.py in _check_trace(check_inputs, func, traced_func, check_tolerance, strict, force_outplace, is_trace_module, _module_class)
        525         traced_outs = run_mod_and_filter_tensor_outputs(traced_func, inputs, "trace")
        526         fn_outs = run_mod_and_filter_tensor_outputs(func, inputs, "Python function")
    --> 527         if compare_outputs(traced_outs, fn_outs, "Python function"):
        528             check_outs = run_mod_and_filter_tensor_outputs(
        529                 check_mod_func, inputs, "repeated trace"

    /opt/homebrew/lib/python3.9/site-packages/torch/jit/_trace.py in compare_outputs(original, reference, match_what)
        500                     else:
        501                         torch.testing.assert_close(
    --> 502                             orig.double(),
        503                             ref.double(),
        504                             rtol=check_tolerance,

    TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

    ```
    ```
    if orig.is_mps or ref.is_mps:
            torch.testing.assert_close(
                orig.float(),
                ref.float(),
                rtol=check_tolerance,
                atol=default_tolerances(orig, ref)[1],
                equal_nan=True,
            )
            else:
                torch.testing.assert_close(
                    orig.double(),
                    ref.double(),
                    rtol=check_tolerance,
                    atol=default_tolerances(orig, ref)[1],
                    equal_nan=True,
                )
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84850
    Approved by: https://github.com/davidberard98

commit 54bccbb22fcc970c48a750d658fe675c80809d42
Author: Catherine Lee <csl@fb.com>
Date:   Fri Sep 16 01:33:42 2022 +0000

    [mergebot] rebase + merge (#85028)

    adds flag to rebase and merge with one command by just running the tryrebase.py script before running the trymerge.py script

    testing on https://github.com/clee2000/random-testing/pull/19

    corresponding test-infra change: https://github.com/pytorch/test-infra/pull/712
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85028
    Approved by: https://github.com/janeyx99, https://github.com/huydhn

commit 89525cbd6930ae0be3003dc55e02edb70e395458
Author: Steven Krawczyk <steventk@google.com>
Date:   Fri Sep 16 01:26:22 2022 +0000

    Add variable_list support to ExtractVariables struct (#84583)

    This is required to unblock https://github.com/pytorch/xla/pull/3843, which lowers the einsum op for pytorch/xla.  Because one method input parameter is a TensorList, we need to support TensorLists here so that we can support einsum gradients.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84583
    Approved by: https://github.com/soulitzer

commit 50733c8bbafa596359caf08bdf97c8a3628aaf6c
Author: Animesh Jain <anijain@umich.edu>
Date:   Fri Sep 16 01:20:54 2022 +0000

    TorchDynamo Remove context manager (#85124)

    As title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85124
    Approved by: https://github.com/ezyang

commit 95a2c3df31983bbe5c28b19e2910855189fae7a1
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Fri Sep 16 01:10:12 2022 +0000

    Replace `expectedAlertNondeterministic` with simpler check function (#84808)

    Fixes #84807

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84808
    Approved by: https://github.com/mruberry

commit 1275e2df1fcc8ba7651450a0e6c7ed30036de340
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Sep 15 09:03:37 2022 -0700

    Remove getattr magic method from OpOverload (#85090)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85090
    Approved by: https://github.com/wconstab

commit 00ce302c077cf1b26e9190da146b008dd319eed2
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 14 22:00:50 2022 -0700

    Performance optimizations to proxy tensor (#85049)

    - Lazily allocate FX nodes for size/stride accessors on proxy tensor
    - Properly track derived computations on strides/numel/etc
    - Remove unnecessary tree_map at end of proxy tensor trace checking
      invariants; we will just have to be smart (it's too expensive)
    - Avoid tree_map in sym proxy tracing

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85049
    Approved by: https://github.com/wconstab

commit d49943bda8e495bdb358e20b6eb114c442afa6e9
Author: nikitaved <nikitavedeneev@gmail.com>
Date:   Thu Sep 15 21:36:21 2022 +0000

    Faster mul(sparse, sparse) with broadcasting in dense dims. (#83428)

    Preliminary benchmarks (square matrices of shape (n, n)).

    <details>

    <summary>Script</summary>

    ```python
    import torch
    import math
    from IPython import get_ipython
    from itertools import product, repeat
    import pickle
    from torch.utils.benchmark import Timer, Compare

    torch.manual_seed(13)
    problem_dims = (
        (10000, 100),
        (100000, 1000),
        (1000000, 10000),
        (10, 100),
        (10, 1000),
        (10, 10000),
        (100, 1000),
        (100, 10000),
        (1000, 10000),
        (1000, 100000),
        (1000, 1000000),
    )

    name = "PR"
    device = "cuda"
    results = []

    for n, nnz in problem_dims:
        def gen_tensor(coalesce=False):
            shape = (n, n)
            nrows, ncols = shape
            rowidx = torch.randint(low=0, high=nrows, size=(nnz,), device=device)
            colidx = torch.randint(low=0, high=ncols, size=(nnz,), device=device)
            itemidx = torch.vstack((rowidx, colidx))
            xvalues = torch.randn(nnz, device=device)
            itemidx = torch.hstack((itemidx, itemidx))
            xvalues = torch.hstack((xvalues, xvalues))
            res = torch.sparse_coo_tensor(itemidx, xvalues, size=shape)
            if coalesce:
                return res.coalesce()
            else:
                return res

        for x_coalesce, y_coalesce in product(*repeat((True, False), 2)):
            x = gen_tensor(x_coalesce)
            y = gen_tensor(y_coalesce)
            smtp = "x * y"
            timer = Timer(smtp,
                          globals=globals(),
                          label="coo.mul",
                          description=f"{name}: mul, device: {device}",
                          sub_label=f"n={n}, nnz={nnz}, coalesce=({x_coalesce, y_coalesce})",
                          num_threads=torch.get_num_threads())
            results.append(timer.blocked_autorange())

    compare = Compare(results)
    compare.trim_significant_figures()
    compare.print()

    with open(f"{name}_{device}_mul.pickle", 'wb') as f:
        pickle.dump(results, f)

    ```

    </details>

    <details>

    <summary>Gather results</summary>

    ```python
    import pickle
    from torch.utils.benchmark import Timer, Compare

    files = [
            "PR",
            "master"
            ]

    device = 'cuda'

    timers = []
    for name in files:
        with open("{}_{}_mul.pickle".format(name, device), 'rb') as f:
            timers += pickle.load(f)

    compare = Compare(timers)
    compare.trim_significant_figures()
    compare.print()

    ```

    </details>

    <details>

    <summary>CUDA</summary>

    ```
    [------------------------------------------------- coo.mul -------------------------------------------------]
                                                           |  PR: mul, device: cuda  |  master: mul, device: cuda
    24 threads: -------------------------------------------------------------------------------------------------
          n=10000, nnz=100, coalesce=((True, True))        |             95          |                91
          n=10000, nnz=100, coalesce=((True, False))       |             87          |               242
          n=10000, nnz=100, coalesce=((False, True))       |             87          |               226
          n=10000, nnz=100, coalesce=((False, False))      |            130          |               371
          n=100000, nnz=1000, coalesce=((True, True))      |            100          |               521
          n=100000, nnz=1000, coalesce=((True, False))     |             90          |               649
          n=100000, nnz=1000, coalesce=((False, True))     |            100          |               659
          n=100000, nnz=1000, coalesce=((False, False))    |            200          |               781
          n=1000000, nnz=10000, coalesce=((True, True))    |            100          |              4861
          n=1000000, nnz=10000, coalesce=((True, False))   |            100          |              5012
          n=1000000, nnz=10000, coalesce=((False, True))   |             98          |              5010
          n=1000000, nnz=10000, coalesce=((False, False))  |            384          |              5174
          n=10, nnz=100, coalesce=((True, True))           |            100          |                79
          n=10, nnz=100, coalesce=((True, False))          |            100          |               221
          n=10, nnz=100, coalesce=((False, True))          |            100          |               221
          n=10, nnz=100, coalesce=((False, False))         |            100          |               350
          n=10, nnz=1000, coalesce=((True, True))          |            100          |               100
          n=10, nnz=1000, coalesce=((True, False))         |            100          |               240
          n=10, nnz=1000, coalesce=((False, True))         |            100          |               254
          n=10, nnz=1000, coalesce=((False, False))        |            100          |               392
          n=10, nnz=10000, coalesce=((True, True))         |            100          |               110
          n=10, nnz=10000, coalesce=((True, False))        |            110          |               286
          n=10, nnz=10000, coalesce=((False, True))        |            110          |               286
          n=10, nnz=10000, coalesce=((False, False))       |            271          |               455
          n=100, nnz=1000, coalesce=((True, True))         |            110          |               851
          n=100, nnz=1000, coalesce=((True, False))        |            110          |              1000
          n=100, nnz=1000, coalesce=((False, True))        |            110          |               990
          n=100, nnz=1000, coalesce=((False, False))       |            140          |              1124
          n=100, nnz=10000, coalesce=((True, True))        |            110          |              5137
          n=100, nnz=10000, coalesce=((True, False))       |            110          |              5391
          n=100, nnz=10000, coalesce=((False, True))       |            100          |              5405
          n=100, nnz=10000, coalesce=((False, False))      |            249          |              5539
          n=1000, nnz=10000, coalesce=((True, True))       |            100          |              8598
          n=1000, nnz=10000, coalesce=((True, False))      |            100          |              8800
          n=1000, nnz=10000, coalesce=((False, True))      |            100          |              8782
          n=1000, nnz=10000, coalesce=((False, False))     |            255          |              8956
          n=1000, nnz=100000, coalesce=((True, True))      |            120          |             84500
          n=1000, nnz=100000, coalesce=((True, False))     |            200          |             88560
          n=1000, nnz=100000, coalesce=((False, True))     |            160          |             89000
          n=1000, nnz=100000, coalesce=((False, False))    |            373          |             89000
          n=1000, nnz=1000000, coalesce=((True, True))     |            312          |            606400
          n=1000, nnz=1000000, coalesce=((True, False))    |           1340          |            609200
          n=1000, nnz=1000000, coalesce=((False, True))    |           1340          |            609100
          n=1000, nnz=1000000, coalesce=((False, False))   |           4408          |            611400

    Times are in microseconds (us).
    ```

    </details>

    <details>

    <summary>CPU</summary>

    ```
    [------------------------------------------------ coo.mul ------------------------------------------------]
                                                           |  PR: mul, device: cpu  |  master: mul, device: cpu
    24 threads: -----------------------------------------------------------------------------------------------
          n=10000, nnz=100, coalesce=((True, True))        |              8         |                8
          n=10000, nnz=100, coalesce=((True, False))       |             32         |               34
          n=10000, nnz=100, coalesce=((False, True))       |             32         |               34
          n=10000, nnz=100, coalesce=((False, False))      |             41         |               56
          n=100000, nnz=1000, coalesce=((True, True))      |             24         |               24
          n=100000, nnz=1000, coalesce=((True, False))     |             90         |              100
          n=100000, nnz=1000, coalesce=((False, True))     |             87         |              100
          n=100000, nnz=1000, coalesce=((False, False))    |            231         |              255
          n=1000000, nnz=10000, coalesce=((True, True))    |            190         |              200
          n=1000000, nnz=10000, coalesce=((True, False))   |            908         |             2023
          n=1000000, nnz=10000, coalesce=((False, True))   |            800         |             2036
          n=1000000, nnz=10000, coalesce=((False, False))  |           3684         |             3989
          n=10, nnz=100, coalesce=((True, True))           |              8         |                7
          n=10, nnz=100, coalesce=((True, False))          |             34         |               30
          n=10, nnz=100, coalesce=((False, True))          |             33         |               30
          n=10, nnz=100, coalesce=((False, False))         |             44         |               50
          n=10, nnz=1000, coalesce=((True, True))          |              8         |                7
          n=10, nnz=1000, coalesce=((True, False))         |            100         |              100
          n=10, nnz=1000, coalesce=((False, True))         |            130         |              100
          n=10, nnz=1000, coalesce=((False, False))        |            746         |              210
          n=10, nnz=10000, coalesce=((True, True))         |              8         |                7
          n=10, nnz=10000, coalesce=((True, False))        |           1000         |             1500
          n=10, nnz=10000, coalesce=((False, True))        |           1000         |             1510
          n=10, nnz=10000, coalesce=((False, False))       |           3063         |             2457
          n=100, nnz=1000, coalesce=((True, True))         |             25         |               25
          n=100, nnz=1000, coalesce=((True, False))        |            180         |              130
          n=100, nnz=1000, coalesce=((False, True))        |            200         |              130
          n=100, nnz=1000, coalesce=((False, False))       |            271         |              255
          n=100, nnz=10000, coalesce=((True, True))        |            100         |              100
          n=100, nnz=10000, coalesce=((True, False))       |           2444         |             2290
          n=100, nnz=10000, coalesce=((False, True))       |           2455         |             2357
          n=100, nnz=10000, coalesce=((False, False))      |           5316         |             3783
          n=1000, nnz=10000, coalesce=((True, True))       |            204         |              211
          n=1000, nnz=10000, coalesce=((True, False))      |           2457         |             2480
          n=1000, nnz=10000, coalesce=((False, True))      |           2448         |             2539
          n=1000, nnz=10000, coalesce=((False, False))     |           3665         |             4801
          n=1000, nnz=100000, coalesce=((True, True))      |           2293         |             2374
          n=1000, nnz=100000, coalesce=((True, False))     |           9000         |            24620
          n=1000, nnz=100000, coalesce=((False, True))     |           8000         |            25080
          n=1000, nnz=100000, coalesce=((False, False))    |          26500         |            47650
          n=1000, nnz=1000000, coalesce=((True, True))     |          10000         |            13000
          n=1000, nnz=1000000, coalesce=((True, False))    |          80000         |           362200
          n=1000, nnz=1000000, coalesce=((False, True))    |          78050         |           392600
          n=1000, nnz=1000000, coalesce=((False, False))   |         312100         |           766900

    Times are in microseconds (us).
    ```

    </details>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83428
    Approved by: https://github.com/cpuhrsch

commit abaf99d37fbbec58125f3e28b90b0bbff3026527
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Wed Sep 14 23:14:42 2022 +0300

    Enable unary elementwise inplace ops for all sparse compressed layouts. (#85031)

    Fixes #84998
    Unblocks #84897

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85031
    Approved by: https://github.com/cpuhrsch

commit 27ec195a81522df397098f7ffd12c06773261000
Author: samdow <samdow@fb.com>
Date:   Thu Sep 15 18:10:03 2022 +0000

    [functorch] fix jacfwd so all inputs get wrappers (#84915)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84915
    Approved by: https://github.com/zou3519

commit 64899c5d10944a617cc65c938e84c38011456a58
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Sep 15 22:58:50 2022 +0000

    change the type of storage_offset to SymInt (#85102)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85102
    Approved by: https://github.com/ezyang

commit 7f88934a8fb9b376b32c722ac2f05959da34c147
Author: soulitzer <soulitzer@gmail.com>
Date:   Thu Sep 15 22:46:16 2022 +0000

    [reland 2] Call jit decomp in VariableType to improve forward AD coverage (#84976)

    Reland of https://github.com/pytorch/pytorch/pull/84675
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84976
    Approved by: https://github.com/zou3519

commit 7dcc723d35a8795c7f386cf0a439299a89432a75
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Thu Sep 15 22:32:48 2022 +0000

    [c10d] Ensure collectives are called with the same dtype for all tensor params. (#84664)

    While passing tensors with different dtypes don't crash, they don't produce sensible results.

    We see data tearing instead of casting.

    It's not clear we want to support transparent casting so, for now, we fail when such input is presented.

    Fixes #84525

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84664
    Approved by: https://github.com/rohan-varma

commit 5558deac592cd82c1f2ccc9fcb9c3924bcaf0266
Author: Bowen Bao <bowbao@microsoft.com>
Date:   Thu Sep 15 22:25:48 2022 +0000

    [ONNX] Add `caffe2/python/onnx/**` to merge rule (#85118)

    This PR extends merge rule such that any related fixes needed for caffe2 onnx tests can be merged in the same PR. Test skips need to be added to `caffe2/python/onnx/tests/onnx_backend_test.py` for new ONNX operators when ONNX submodule is updated.

    Unblocks #83201

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85118
    Approved by: https://github.com/malfet

commit b1f5644fadda2c202b8f431f6eca3e326ac92b4e
Author: Larry Liu <8188269+larryliu0820@users.noreply.github.com>
Date:   Thu Sep 15 22:16:30 2022 +0000

    [frontend] Print real type for Argument (#85103)

    Retry of [#84985](https://github.com/pytorch/pytorch/pull/84985). For some reason `ghstack land` doesn't work on that PR

    In JIT world, if we parse an argument in schema and print its type, we always get `fake_type`. For example, `MemoryFormat? memory_format` becomes `int? memory_format`. This doesn't align with the original schema string and creates discrepency between `torchgen.FunctionSchema` and `torch._C.FunctionSchema`. Here I'm letting `torch._C.Argument` print its `real_type` and hence be aligned with the original schema string.

    Rely on newly added unit test.

    Differential Revision: [D39550665](https://our.internmc.facebook.com/intern/diff/D39550665)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85103
    Approved by: https://github.com/cccclai

commit 52a2b612035e081626ff6f23eec87782bde6643c
Author: Xu Zhao <xzhao9@fb.com>
Date:   Thu Sep 15 21:48:28 2022 +0000

    Fix fetch function which breaks user code (#85099)

    The [fastNLP](https://github.com/fastnlp/fastNLP/blob/v0.6.0/fastNLP/core/batch.py#L51) model uses DataSetGetter to fetch data from the dataset. The following code breaks because of https://github.com/pytorch/pytorch/pull/84301:

    ```
    from fastNLP.io.pipe.qa import CMRC2018BertPipe
    input_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), ".data", "cmrc2018-sim")
    data_bundle = CMRC2018BertPipe().process_from_file(paths=input_dir)
    data_bundle.rename_field('chars', 'words')
    data_bundle.get_dataset('dev')
    dataset = DataSetGetter(dataset, as_numpy)
    dataiter = torch.utils.data.DataLoader(dataset=dataset)
    for batch in dataiter:
    ```

    This is because for the `DataSetGetter` class, the following condition holds:
    ```
    ```

    This PR adds an additional check to make sure `__getitems__` is only called when it is not None.

    This error was found by the torchbench nightly CI, original error stack trace:
    ```
    ERROR: test_fastNLP_Bert_train_cuda (__main__.TestBenchmark)
    ----------------------------------------------------------------------
    components._impl.workers.subprocess_rpc.ChildTraceException: Traceback (most recent call last):
      File "/home/circleci/project/components/_impl/workers/subprocess_rpc.py", line 470, in _run_block
        exec(  # noqa: P204
      File "<subprocess-worker>", line 35, in <module>
      File "<subprocess-worker>", line 12, in _run_in_worker_f
      File "/home/circleci/project/torchbenchmark/util/model.py", line 16, in __call__
        obj = type.__call__(cls, *args, **kwargs)
      File "/home/circleci/project/torchbenchmark/models/fastNLP_Bert/__init__.py", line 93, in __init__
        self.example_inputs = self._prefetch(example_inputs)
      File "/home/circleci/project/torchbenchmark/models/fastNLP_Bert/__init__.py", line 133, in _prefetch
        for batch_x, batch_y in example_inputs:
      File "/home/circleci/miniconda3/lib/python3.8/site-packages/fastNLP/core/batch.py", line 266, in __iter__
        for indices, batch_x, batch_y in self.dataiter:
      File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
        data = self._next_data()
      File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 719, in _next_data
        data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
      File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
        data = self.dataset.__getitems__(possibly_batched_index)
    TypeError: 'NoneType' object is not callable
    ```

    Full error log: https://app.circleci.com/pipelines/github/pytorch/benchmark/5143/workflows/0676f36d-0ab4-42bd-adb4-90e6b0df76d1/jobs/5293
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85099
    Approved by: https://github.com/ejguan

commit 2386cd2945498ce7261b761a8d9bd5b59d06c5a1
Author: Khushi Agrawal <khushiagrawal411@gmail.com>
Date:   Thu Sep 15 19:34:44 2022 +0000

    [reland] [numpy] add torch.concatenate, alias of torch.cat (#85073)

    Previous PR: #82946

    Fixes #81161

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85073
    Approved by: https://github.com/mruberry

commit 25ecc4889de895fa2041b556e1ea4dc057c33712
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Sep 15 15:43:12 2022 +0000

    [FSDP] Fix memory regression! (#85087)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85087
    Approved by: https://github.com/zhaojuanmao

commit 4306a1882622d0ba293b3704c0bc1ef5ce181edb
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Sep 15 17:04:32 2022 +0000

    Fix tied params with Fake Tensor (#85065)

    Tested locally to fix `USE_FAKE_TENSOR=1 python benchmarks/huggingface.py --ci -d cuda --float32 --backend=aot_nop --training --only=RobertaForCausalLM`. When I tried to repro with a small example test could not successfully, but @anijain2305 is attempting to with minifier.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85065
    Approved by: https://github.com/anijain2305

commit 2e41fbc2114d89d046a3002f0614bbfe933c01b9
Author: Ubuntu <titaiwang@titaiwanglinuxcpudev.y3zdd0j2xrqelnmezcgpqgmnte.jx.internal.cloudapp.net>
Date:   Wed Sep 14 16:48:26 2022 +0000

    [ONNX] Enable test_custom_opsets_inverse (#85013)

    in [#87004](https://github.com/pytorch/pytorch/pull/80074), `aten::inverse` becomes alias of `aten::linalg.inv`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85013
    Approved by: https://github.com/BowenBao

commit a225f3cfce19a9baf40db4814efcc44a9161b286
Author: Pearu Peterson <pearu.peterson@gmail.com>
Date:   Wed Sep 14 23:14:42 2022 +0300

    torch.zero_ on a sparse compressed tensor resets nnz to 0 (#85030)

    Fixes #84997 and #82683

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85030
    Approved by: https://github.com/cpuhrsch

commit 21e656b020ab16c5748720a4eb95aea93778e0de
Author: Bowen Bao <bowbao@microsoft.com>
Date:   Thu Sep 15 18:22:41 2022 +0000

    [ONNX] Add `third_party/onnx` to merge rule (#84715)

    We expect to bump onnx submodule version regularly to develop support for new onnx operators/functions. Adding this to merge rule reduces the burden for core maintainers for approval.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84715
    Approved by: https://github.com/thiagocrepaldi, https://github.com/malfet

commit 6bd7d0f85665cecc0396f1be4c1bac6b14e3f5d1
Author: Salahuddin <60926009+ShisuiUzumaki@users.noreply.github.com>
Date:   Thu Sep 15 18:17:10 2022 +0000

    doc string fixed in torch.distributed.reduce_scatter (#84983)

    Fixes #84865

    Previous `torch.distributed.reduce_scatter`:

    ```
    def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
        """
        Reduces, then scatters a list of tensors to all processes in a group.

        Args:
            output (Tensor): Output tensor.
            input_list (list[Tensor]): List of tensors to reduce and scatter.
            group (ProcessGroup, optional): The process group to work on. If None,
                the default process group will be used.
            async_op (bool, optional): Whether this op should be an async op.
    ```

    Fixed:

    ```
    def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
        """
        Reduces, then scatters a list of tensors to all processes in a group.

        Args:
            output (Tensor): Output tensor.
            input_list (list[Tensor]): List of tensors to reduce and scatter.
            op (optional): One of the values from
                ``torch.distributed.ReduceOp``
                enum.  Specifies an operation used for element-wise reductions
            group (ProcessGroup, optional): The process group to work on. If None,
                the default process group will be used.
            async_op (bool, optional): Whether this op should be an async op.
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84983
    Approved by: https://github.com/H-Huang

commit d52452b3d1ff14a87e1d79c8d8bd67f0f074e6e6
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Sep 15 17:45:05 2022 +0000

    [Functorch] Set rpath for Mac builds (#85086)

    For the `delocate-wheel` to be able to find dependent libs

    Fixes https://github.com/pytorch/pytorch/issues/85007
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85086
    Approved by: https://github.com/kit1980, https://github.com/zou3519

commit 4db1588ca09b4f6328c5bd98701e9c302cb39800
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Thu Sep 15 15:59:23 2022 +0000

    [functorch] follow-up vmapjvpvjp (#84992)

    Ref 1: https://github.com/pytorch/pytorch/pull/83375#discussion_r970046113
    Ref 2: https://github.com/pytorch/pytorch/pull/83375#discussion_r970047848

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84992
    Approved by: https://github.com/zou3519

commit 50142827925374c23b75a840eccddf3ad6d05c1e
Author: atalman <atalman@fb.com>
Date:   Thu Sep 15 15:47:36 2022 +0000

    Removing cuda 11.3 nightly builds (#84866)

    Removing cuda 11.3 nightly builds

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84866
    Approved by: https://github.com/weiwangmeta, https://github.com/malfet

commit ebd4e90ff7f6d1e57bbf6f2e717a8addb6b28e28
Author: Seonglyong Gong <slgong@fb.com>
Date:   Thu Sep 15 06:41:33 2022 +0000

    [Profiler] add config option to remove 'Call stack' field from trace file (#84982)

    Summary: `Call stack` field increases trace file size exponentially for Python stack tracing (need to be deprecated carefully). Added a config option to avoid this increase.

    Test Plan:
    `experimental_config=_ExperimentalConfig(no_callstack_trace=True),` will remove the field.
    + CI tests

    Differential Revision: D39489828

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84982
    Approved by: https://github.com/robieta

commit a22f4f535b32664bd6c1286604773f508ed8ff69
Author: zhuyuhua-v <yuhua.zhu@intel.com>
Date:   Thu Sep 15 06:01:14 2022 +0000

    Add xpu path for GRUCell (#83723)

    Add xpu path for GRUCell.
    We supported a new kernel named _thnn_fused_gru_cell to fuse the small ops of GRU.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83723
    Approved by: https://github.com/ezyang

commit 17925122d091fcf5b1b14a82d4718396dd0199e8
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Sep 14 21:46:46 2022 +0000

    Rewrite new_zeros, new_ones, new_full decomp with aten.full (#84946)

    We should **NOT**  introducing non-functional op for decomps of functional op.

    For example
    ```
    make_fx(functionalize(lambda x: x.new_zeros(3)), decomposition_table=decomposition_table)(x)
    ```
    is producing
    ```
    def forward(self, x_1):
        empty = torch.ops.aten.empty.memory_format([3, 4], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
        zero_ = torch.ops.aten.zero_.default(empty);  empty = None
        return zero_
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84946
    Approved by: https://github.com/ngimel

commit 65158b8876b6e65f82d7844e543afff55d1a44f9
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 14 21:05:03 2022 -0700

    empty strided symint (#84830)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84830
    Approved by: https://github.com/ezyang

commit d05f07494a9a32c63f9218c0e703764a02033bb9
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Thu Sep 15 03:08:49 2022 +0000

    Use angle brackets in include for internal clangtidy (#85032)

    This issue was found after importing https://github.com/pytorch/pytorch/pull/70978 into fbsource.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85032
    Approved by: https://github.com/huydhn

commit be800cd6ea783a66d9b722116a6f483248f5c53e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 15 02:36:55 2022 +0000

    [vision hash update] update the pinned vision hash (#85061)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85061
    Approved by: https://github.com/pytorchbot

commit 625e44c1df211d6753609a9b391cb10f2f94367f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 15 02:36:32 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#85060)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85060
    Approved by: https://github.com/pytorchbot

commit 62af1c9eedf958293aa9a60c59581410c93e264c
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Sep 14 23:48:14 2022 +0000

    [Easy][FSDP] Change `assert` -> `p_assert` (#85052)

    This changes a few `assert`s to `p_assert()`s because they can run in the backward (some are in the forward, but AC can make them run in the backward).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85052
    Approved by: https://github.com/zhaojuanmao

commit cdd625ba702fe1ef812a910256fcfc60f233dadf
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Sep 14 23:34:18 2022 +0000

    [Easy][FSDP] Remove outdated comment (#85051)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85051
    Approved by: https://github.com/zhaojuanmao

commit cc62ad79c752534bd8fdd07c5bf494c9534337d2
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Sep 14 22:52:07 2022 +0000

    [FSDP] Fix `pin_memory()` for CPU offloading (#85048)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85048
    Approved by: https://github.com/zhaojuanmao

commit e7ad699be0c33f72be43249445c430d953e0747e
Author: Rohan Varma <rvarm1@fb.com>
Date:   Wed Sep 14 22:40:12 2022 +0000

    Resubmit bfloat support for im2col,col2im (#84372)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84372
    Approved by: https://github.com/awgu, https://github.com/ngimel

commit 8ca1839d32a56ea7ef007bf43fab38cc94b3f608
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Thu Sep 15 00:43:36 2022 +0000

    Python Dispatcher integration with C++ dispatcher (#85050)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85050
    Approved by: https://github.com/malfet

commit 3a107bc9bedb6642b280c430ee5389b6ca4c2ca3
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 14 10:33:46 2022 -0700

    [functorch] fix vmapvjpvjp test for prelu (#84939)

    Turns out this is just a composite compliance issue. Branching on if
    something requires grad or not can lead to incorrect gradients if we
    have a BatchedTensor wrapping a tensor that requires grad.

    Test Plan:
    - tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84939
    Approved by: https://github.com/soulitzer

commit 8badb00ff4239acb69a277cff7527f80707d09aa
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Sep 14 10:33:46 2022 -0700

    [functorch] fix conv_transpose with groups batching rule (#84938)

    The original batching rule didn't work in all cases, so this PR fixes it.

    Test Plan:
     - added new test case to conv_transpose2d's OpInfo. Surprisingly the
     other test cases didn't catch the bug.
     - Removed some xfails and skips as a result.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84938
    Approved by: https://github.com/samdow

commit 8cb7826889ad488307caef918a731471be370eac
Author: Rohan Varma <rvarm1@fb.com>
Date:   Wed Sep 14 14:59:34 2022 -0700

    [CheckpointWrapper] Reentrant kwarg support (#84908)

    A temporary patch to support keyword args when reentrant checkpoint wrapper is used. This is need to unblock some crucial workloads, the ideal fix would be checking this directly into torch.utils.checkpoint.

    Differential Revision: [D39453453](https://our.internmc.facebook.com/intern/diff/D39453453/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84908
    Approved by: https://github.com/awgu

commit 55ca6901a79c68353525b247873dfbf3bf14e959
Author: Rohan Varma <rvarm1@fb.com>
Date:   Wed Sep 14 14:59:33 2022 -0700

    [CheckpointWrapper] Decouple CPU offload (#84907)

    This fixes the activation offload for checkpoint wrapper, which was previously broken. It was broken because it was tightly coupled with activation checkpoint, i.e. we did:

    ```
    with save_on_cpu:
        checkpoint(module_forward())
    ```

    which would not offload any activation tensors to CPU, as those activations would already be not saved by autograd due to the checkpoint implementation taking priority.

    Now, if `offload_to_cpu` is specified, we only do `save_on_cpu` and no checkpoint, so all intermediate tensors are offloaded to CPU instead of checkpointed.

    These wrappers can be composed, i.e. if we have

    `(Linear, Linear) -> (Linear, Linear) -> (Linear, Linear)`

    we can do

    `Offload( checkpoint(Linear, Linear) -> checkpoint(Linear, Linear) -> checkpoint(Linear, Linear))`

    and inner tensors would be checkpointed while outers will be offloaded.

    Differential Revision: [D39448882](https://our.internmc.facebook.com/intern/diff/D39448882/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84907
    Approved by: https://github.com/awgu

commit 166ea7e6b1b7a7fc34fa6abc1cac6d9eca6fc720
Author: samdow <samdow@fb.com>
Date:   Wed Sep 14 15:08:30 2022 -0400

    [functorch] fix jacrev so all inputs get wrappers (#84914)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84914
    Approved by: https://github.com/zou3519

commit 1a6cf6ea8861549b70f40eace4817ee4ed84a152
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Sep 14 23:40:20 2022 +0000

    [MPS] Fix int rounding div crash on M1 (#85016)

    Fixes  https://github.com/pytorch/pytorch/issues/84995

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85016
    Approved by: https://github.com/kulinseth

commit 976f8bee94f35f37f0f58a8bebccf87f0a80d8a6
Author: Wanchao Liang <wanchaol@users.noreply.github.com>
Date:   Wed Sep 14 06:02:08 2022 +0000

    [c10d] add ncclGetLastError to NCCL pg (#83724)

    This PR add ncclGetLastError API to the nccl pg, to provide better error
    reporting out of nccl failures directly, instead of guessing on random
    reasons

    Differential Revision: [D39161199](https://our.internmc.facebook.com/intern/diff/D39161199)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83724
    Approved by: https://github.com/kwen2501, https://github.com/H-Huang

commit ccade9410f1d72e766d86fabeeb80822dd36f449
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 14 10:51:36 2022 -0700

    Don't detach when making views; force caller to detach (#84893)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84893
    Approved by: https://github.com/soulitzer, https://github.com/SherlockNoMad

commit 2711b9fa63af23c25b0f3f1301a72291afd68655
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 14 22:27:30 2022 +0000

    Revert "[CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461)"

    This reverts commit 713d8b855223970dc98ec81bb722fba002ac1390.

    Reverted https://github.com/pytorch/pytorch/pull/83461 on behalf of https://github.com/malfet due to Broke CUDA-10.2 builds, see https://hud.pytorch.org/pytorch/pytorch/commit/713d8b855223970dc98ec81bb722fba002ac1390

commit a1a95d402d300070d5ffc21b8f3a0e0bfbc38323
Author: Zain Rizvi <zainr@fb.com>
Date:   Wed Sep 14 22:04:43 2022 +0000

    Fix inheritance in TestDataLoaderUtil (#85018)

    TestDataLoaderUtils needs to run it's parent class's setUp method to actually disable flaky tests (see https://github.com/pytorch/pytorch/issues/70516#issuecomment-1247045072 for details)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85018
    Approved by: https://github.com/clee2000, https://github.com/huydhn

commit 713d8b855223970dc98ec81bb722fba002ac1390
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Wed Sep 14 21:56:48 2022 +0000

    [CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461)

    We're seeing an issue where repeatedly capturing graphs incurs increasing memory usage as cuBLAS internally allocates a new workspace for each graph even when the same handle is being used:
    https://gist.github.com/tomconerlyanth/a20c04a4a46a0f6e9ce18f5280729b36

    This PR works around the issue by intercepting the `CUBLAS_WORKSPACE_CONFIG` environment variable and allocating the workspace for the cuBLAS handle explicitly.

    CC @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83461
    Approved by: https://github.com/ngimel

commit 7b64c885d5ac0ebac50041e0bbbb82ff92e92dc3
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Sep 14 21:55:10 2022 +0000

    Enable manual test config label selection on GHA macos (#84895)

    Following up with https://github.com/pytorch/pytorch/pull/83690 and https://github.com/pytorch/pytorch/pull/84669, functorch team has started using the new label in some of their [PRs](https://github.com/pytorch/pytorch/labels/test-config%2Ffunctorch).  This is to enable manual test config using label on GHA macos.

    This also works with `ciflow/mps` as follows:

    * If only `test-config/functorch` is present, no arm64 build is performed and mps test is skipped
    * If only `ciflow/mps` is present, mps test is run in addition to all other tests
    * If both `test-config/functorch` and `ciflow/mps` is present, both functorch and mps tests are run
    * If none of the label is present, pull workflow is run as usual

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84895
    Approved by: https://github.com/ZainRizvi

commit fa7bf3e2dc63cc27b2b0bcc90d7a2ab387dd0c9f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 14 21:32:11 2022 +0000

    Revert "[numpy] add `torch.concatenate`, alias of torch.cat (#82946)"

    This reverts commit 270e5e519d98868af0166f3a179b286682cfb267.

    Reverted https://github.com/pytorch/pytorch/pull/82946 on behalf of https://github.com/malfet due to Broke M1 tests, see https://hud.pytorch.org/pytorch/pytorch/commit/270e5e519d98868af0166f3a179b286682cfb267

commit 23b7a5fc7a1cc95f31c7004173b07d0d4a83a22d
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Sep 14 21:16:56 2022 +0000

    Shard distributed tests on non CUDA focal (#84891)

    [pull / linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)](https://hud.pytorch.org/tts/pytorch/pytorch/master?jobName=pull%20%2F%20linux-focal-py3.7-gcc7%20%2F%20test%20(distributed%2C%201%2C%201%2C%20linux.2xlarge)) p90 TTS is about 2.2 hours, 2x the default shards.  This is non-CUDA common Linux runners, so we can simply add one more shard for distributed.  I missed this change in https://github.com/pytorch/pytorch/pull/84430
    Having 2 shards with test time around 55m each:

    * https://github.com/pytorch/pytorch/actions/runs/3040900328/jobs/4897576932
    * https://github.com/pytorch/pytorch/actions/runs/3040900328/jobs/4897577014
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84891
    Approved by: https://github.com/clee2000

commit 3e57c9550ea272b6fc4e752e7ac643c5105405f3
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 14 10:39:17 2022 -0700

    Ensure as_strided_tensorimpl is never called with MPS (#85020)

    See https://github.com/pytorch/pytorch/pull/84893

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/85020
    Approved by: https://github.com/soulitzer, https://github.com/kulinseth

commit 5271494ef21ae0140755a41f3b16a8bd745642b6
Author: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Date:   Wed Sep 14 19:56:12 2022 +0000

    [CUDA graphs] Fixes errors in RNG seed (#84967)

    Fixes #84614

    Prior to this PR CUDAGraph did not store the RNG seed, that is why `torch.cuda.manual_seed(new_seed)` would only reset the offset but not update the seed at all keeping whatever value was used during graph capture.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84967
    Approved by: https://github.com/ngimel

commit 270e5e519d98868af0166f3a179b286682cfb267
Author: Khushi Agrawal <khushiagrawal411@gmail.com>
Date:   Wed Sep 14 19:28:43 2022 +0000

    [numpy] add `torch.concatenate`, alias of torch.cat (#82946)

    As per the title. Fixes: #81161

    - [x] add ErrorInputs
    - ~[ ] dtype argument?~
    - ~[ ] casting argument?~

    As discussed offline with @kshitij12345, we can currently ignore `dtype` and `casting` arguments.

    cc: @kshitij12345!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82946
    Approved by: https://github.com/mruberry

commit 94b67f4cd8dc1ab5f7add5f006f7f3fd988b8ecf
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 14 17:40:22 2022 +0000

    Revert "Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#83267)"

    This reverts commit ec916bf6afcfa91305bb69d1bedbd6dafccb7c95.

    Reverted https://github.com/pytorch/pytorch/pull/83267 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit 4247cc98a22760932d0b53236403e478e8612c2f
Author: Denis Vieriu <dvieriu@apple.com>
Date:   Wed Sep 14 17:24:24 2022 +0000

    [MPS] Fix mps to cpu casting from a smaller dtype to a bigger dtype (#84928)

    Fixes #82566 , #80800

    - mps->cpu casts from a smaller dtype to a bigger dtype mps->mps cast from smaller/bigger dtype to another dtype in case of scatter
    - For mps->cpu copies where we don't have a source/destination offset, we can save the cast result directly in the destTensor, so we can skip the additional overhead of the blit.
    - In case we can return the data without doing the blit, we need to check if it's blocking call, case in which we'd need a synchronize(SyncType::COMMIT_AND_WAIT); call (previously this was done by the blit).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84928
    Approved by: https://github.com/razarmehr

commit 1a81ab3ba58d23566ba6cecd0d30eafe93dc7bc8
Author: Shen Li <cs.shenli@gmail.com>
Date:   Wed Sep 14 04:48:10 2022 +0000

    Test tracing consecutive comms on the same input tensor (#84980)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84980
    Approved by: https://github.com/wanchaol

commit 0f30059227c6e64ca0e503b6bd3f436d837cdcef
Author: Shen Li <cs.shenli@gmail.com>
Date:   Wed Sep 14 04:48:10 2022 +0000

    Remove eager mode support form CommTensor (#84978)

    We don't need eager mode support (automatic wait on read) for now.
    Removing that to simply the code. We can always add this back if
    necessary in the future.

    Note that, we still need the eager mode code in `__torch_dispatch__`,
    as `make_fx` will also run the ops in eager mode to get the output.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84978
    Approved by: https://github.com/wanchaol

commit b6d6a78c12e5869d0c738456e28155a3a2554ece
Author: Michael Melesse <micmelesse@gmail.com>
Date:   Wed Sep 14 15:50:14 2022 +0000

    [ROCM] test_batchnorm_cudnn_nhwc (#84603)

    This pr enables test_batchnorm_cudnn_nhwc.  This is a follow up to https://github.com/pytorch/pytorch/pull/82512
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84603
    Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

commit 706b99030656c573619cebaa3be9298a575fc776
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 14 14:07:58 2022 +0000

    Revert "Python Dispatcher integration with C++ dispatcher (#84826)"

    This reverts commit 35f6a69191ef762cf22b6cbfe94b8d9406e16674.

    Reverted https://github.com/pytorch/pytorch/pull/84826 on behalf of https://github.com/malfet due to Broke dynamo, see https://hud.pytorch.org/pytorch/pytorch/commit/35f6a69191ef762cf22b6cbfe94b8d9406e16674

commit 74ead619446669ff638e56127759d432cf85a7ee
Author: Howard Huang <howardhuang@fb.com>
Date:   Tue Sep 13 12:07:22 2022 -0700

    [2/N] [Dispatchable Collectives] Extract ProcessGroup::Work into a separate class and update references (#83680)
    - Move ProcessGroup::Work into its own class and update all the references to it / header includes.
    In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work.

    Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680
    Approved by: https://github.com/kwen2501

commit 54c46e4f902209dc9cca2fbeb8181acf05409cb6
Author: atalman <atalman@fb.com>
Date:   Wed Sep 14 12:06:15 2022 +0000

    Upgrade to CUDNN version for cuda 11.7 (#84964)

    Upgrade to CUDNN version to 8.5 for cuda 11.7.
    This is reland of: https://github.com/pytorch/pytorch/pull/84859
    Issues in periodic build fshould be fixed by : https://github.com/pytorch/pytorch/pull/84943
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84964
    Approved by: https://github.com/ZainRizvi

commit 6750946b820a3dff6de00f1ed93c9165e2f222b7
Author: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Date:   Wed Sep 14 12:03:11 2022 +0000

    Skip validate_view_consistency for nvFuser tests (#84858)

    nvFuser's execute function always returns a copy for now.

    Ref. https://github.com/pytorch/pytorch/pull/84629#discussion_r966375582
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84858
    Approved by: https://github.com/mruberry, https://github.com/ngimel

commit 35f6a69191ef762cf22b6cbfe94b8d9406e16674
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Tue Sep 13 20:32:42 2022 +0000

    Python Dispatcher integration with C++ dispatcher (#84826)

    Signed-off-by: Edward Z. Yang <ezyangfb.com>

    From @ezyang's original PR:

    There are a number of situations where we have non-backend kernels (e.g., CompositeImplicitAutograd, batching rules) which we would like to port to Python, but we have no way to integrate these ports with the overall system while using preexisting C++ registrations otherwise. This PR changes that by introducing a Python dispatcher (which can have its own kernels directly in Python), which can be interpose over ordinary C++ dispatch. The ingredients:

    We introduce a new PythonDispatcher dispatch key, that has the same tenor as FuncTorchDynamicLayerFrontMode: it works by getting triggered before every other dispatch key in the dispatch key, and shunting to a Python implementation
    The Python dispatcher is a per-interpreter global object that is enabled/disabled via the guard EnablePythonDispatcher/DisablePythonDispatcher. We don't make it compositional as I have no idea what a compositional version of this feature would look like. Because it is global, we don't need to memory manage it and so I use a simpler SafePyHandle (newly added) to control access to this pointer from non-Python C++. Like __torch_dispatch__, we use PyInterpreter to get to the Python interpreter to handle the dispatch.
    I need to reimplement dispatch table computation logic in Python. To do this, I expose a lot more helper functions for doing computations on alias dispatch keys and similar. I also improve the pybind11 handling for DispatchKey so that you can either accept the pybind11 bound enum or a string; this simplifies our binding code. See https://github.com/pybind/pybind11/issues/483#issuecomment-1237418106 for how this works; the technique is generally useful.

    I need to be able to call backend fallbacks. I do this by permitting you to call at a dispatch key which doesn't have a kernel for the operator; if the kernel doesn't exist, we check the backend fallback table instead.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84826
    Approved by: https://github.com/ezyang

commit 44c30c5d1ce9995a000d5a55cb87b168972e2801
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Sep 13 22:20:21 2022 +0000

    [quant][docs] Add example for the error message for fixed qparam ops (#84666)

    Summary:
    att, since example makes it clearer what the user needs to do

    Test Plan:
    local test for the error message

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84666
    Approved by: https://github.com/vkuzo, https://github.com/andrewor14

commit 55ca297d4e048c641d149a76f2fda7c9ce630ff6
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 13 16:46:15 2022 -0700

    Remove enable_recursive_torch_dispatch (#84945)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84945
    Approved by: https://github.com/soulitzer

commit 922560b872415b96552e90eed521d7b91a7600b4
Author: Jane Xu <janeyx@fb.com>
Date:   Wed Sep 14 02:52:54 2022 +0000

    Removes unnecessary namespace of functions used only in einsum (#84955)

    This is cosmetic change that removes a few function declarations and derives values instead of hardcoding. This is step 1 in relanding a cleaner version of einsum with opt_einsum. See https://github.com/pytorch/pytorch/pull/60191
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84955
    Approved by: https://github.com/soulitzer

commit d26e9cd9b27b7e90c2650a1bd092b6b9682c56d5
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 14 02:46:40 2022 +0000

    [vision hash update] update the pinned vision hash (#84975)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84975
    Approved by: https://github.com/pytorchbot

commit b28d82cb1ddb4030f8c58a99f70e6af870829541
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 14 02:44:34 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#84912)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84912
    Approved by: https://github.com/pytorchbot

commit d00cabae7bc894b951d4b2c9c24c7d95bebd86e1
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Wed Sep 14 01:39:24 2022 +0000

    Fix `expectedFailureMeta` to avoid skipping tests (#84875)

    Fixes #84874

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84875
    Approved by: https://github.com/mruberry

commit 8cbbd3a25f3bbac006cc7e3d8f43829235648a5c
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Sep 13 22:59:17 2022 +0000

    Avoid nested CommTensor wrapping (#84963)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84963
    Approved by: https://github.com/wanchaol

commit 8ca057eb7179f4dfce47515309d12303fa1c11d9
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 14 01:09:04 2022 +0000

    Revert "Don't detach when making views; force caller to detach (#84893)"

    This reverts commit 3bb8d6a93cc4cc4403dd2e3dfcd39b841c71a3c3.

    Reverted https://github.com/pytorch/pytorch/pull/84893 on behalf of https://github.com/malfet due to Broke MPS, see https://hud.pytorch.org/pytorch/pytorch/commit/3bb8d6a93cc4cc4403dd2e3dfcd39b841c71a3c3

commit a185dc2e631c7bc25213f0fa4c4cc41851737079
Author: Arindam Roy <rarindam@gmail.com>
Date:   Wed Sep 14 00:41:12 2022 +0000

    [ROCm] re-enable tensorexpr and test_openmp (#81367)

    The following tests are being re-enabled for ROCm:
    - test_openmp.py
    - TestTensorExprPyBind tests in test_tensorexpr_pybind.py
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81367
    Approved by: https://github.com/jeffdaily, https://github.com/malfet

commit cb9ef4668ed37460d99cc8ee3d9960fef2075902
Author: Nayef Ahmed <22487263+Nayef211@users.noreply.github.com>
Date:   Wed Sep 14 00:35:36 2022 +0000

    Updated library level maintainers for torchtext (#84950)

    - Updated library level maintainers for torchtext to reflect internal changes to the team

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84950
    Approved by: https://github.com/mthrok

commit d05a11337c8aafb663ca3b29722c5219d1589fec
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Sep 13 16:36:57 2022 +0000

    [CMake] Add functorch target (#83464)

    Move functorch/functorch into `functorch` folder
    - Add functorch/CMakeLists.txt that adds `functorch` native python exension
    - Modify `setup.py` to package pytorch and functorch together into a single wheel
    - Modify `functorch.__version__` is not equal to that of `torch.__version__`
    - Add dummy `functorch/setup.py` file for the projects that still want to build it

    Differential Revision: [D39058811](https://our.internmc.facebook.com/intern/diff/D39058811)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83464
    Approved by: https://github.com/zou3519

commit 26b59862978ccdff925c1b457eb003b334143736
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Wed Sep 14 00:01:06 2022 +0000

    `ReflectionPad` supports `BFloat16` (#84949)

    Just by looking at some commits, I didn't find why BFloat16 isn't there.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84949
    Approved by: https://github.com/ngimel

commit fdd366541333330387d0b262da8357984e0d311f
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Tue Sep 13 14:04:10 2022 -0700

    [Profiler] Make `LibKinetoClient::stop()` directly call `ProfilerStateBase::pop` (#83965)

    It has been discussed earlier in the stack at length, but if profiler fails after it pops the profiler state but before stopping Kineto then the next profiler call will see `LibKinetoClient::stop()` try to clean up the prior run (which it still thinks is active) by calling `disableProfiler()`. (Which fails because there is not an active profiler.)

    This PR addresses the issue rather bluntly by simply rug pulling any active profiler from `LibKinetoClient::stop()`. I'm not particularly fond of this solution and we should refine the semantics in the future, but for now it has the desired effect of returning to a clean state. Earlier PRs in this stack cleaned up some of the lifetime management such that objects being destroyed triggers appropriate cleanup. As a result it is no longer catastrophic to simply pop the profiler state and let the destructor chain clean up.

    Differential Revision: [D38958237](https://our.internmc.facebook.com/intern/diff/D38958237/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83965
    Approved by: https://github.com/slgong-fb

commit 3bb8d6a93cc4cc4403dd2e3dfcd39b841c71a3c3
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 13 11:42:12 2022 -0700

    Don't detach when making views; force caller to detach (#84893)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84893
    Approved by: https://github.com/soulitzer, https://github.com/SherlockNoMad

commit ec916bf6afcfa91305bb69d1bedbd6dafccb7c95
Author: Kevin Stephano <kevin.stephano@gmail.com>
Date:   Tue Sep 13 23:28:39 2022 +0000

    Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#83267)

    This PR does the following:

    - Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`.  The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`.  It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`.  The `FusionInterface` is an object that represents a Fusion in python.  It can also query the cache based on id.
    - The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added.
    - Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`.
    - Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes.
    - Adds a README file to explain how to use the Python Frontend

    While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83267
    Approved by: https://github.com/jjsjann123, https://github.com/davidberard98

commit 59bb5c933b051226572a08f36a110576d9abaf29
Author: Mike Ruberry <38511765+mruberry@users.noreply.github.com>
Date:   Tue Sep 13 23:17:19 2022 +0000

    Adds mruberry as superuser (#84869)

    (so PRs I approve can be merged)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84869
    Approved by: https://github.com/malfet, https://github.com/seemethere

commit c61e89545ec37751317697040f4d391ff2cda819
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Tue Sep 13 04:24:14 2022 -0400

    disable onednn gelu for empty input (#84926)

    This PR is about disabling onednn gelu for empty input, fix https://github.com/pytorch/pytorch/issues/78152.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84926
    Approved by: https://github.com/Lezcano, https://github.com/zou3519

commit 25d91e0a9d5eb2e366076850e2e81bf968cc0fbb
Author: atalman <atalman@fb.com>
Date:   Tue Sep 13 23:00:09 2022 +0000

    Updating cudnn_frontend to 0.7.1 (#84943)

    Updating cudnn_frontend to 0.7.1 To enable CUDNN 8.5 integration

    cc @malfet @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84943
    Approved by: https://github.com/huydhn, https://github.com/malfet

commit 36d79143cef8847a0d6455d65f52a5ef9f23471b
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 13 22:54:53 2022 +0000

    Revert "[reland] Call jit decomposition in VariableType to increase forward AD coverage (#84151) (#84675)"

    This reverts commit bb4e96c9644a034e593085026b781ee78a4d6a77.

    Reverted https://github.com/pytorch/pytorch/pull/84675 on behalf of https://github.com/osalpekar due to causing asan xplat link-time errors like ld.lld: error: undefined symbol: torch::jit::has_jit_decomposition(c10::FunctionSchema const&)

commit 38192f63cdca6b80e6eb369a2eddad7728f0492c
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Tue Sep 13 21:57:46 2022 +0000

    Add __all__ for a few distributed modules plus a little typing (reland) (#84872)

    This handles distributed_c10d, which is massive and ddp_comm_hooks.

    This relands #84119 with the required fixes.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84872
    Approved by: https://github.com/rohan-varma

commit 53c71e214233b0e97c1cb2bf02676a9f800c1e91
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Sep 13 21:02:53 2022 +0000

    [functorch] test - vmapjvpvjp (#83375)

    Adds `vmapjvpvjp` test to `functorch`

    Runtime of the test:
    ```
    = 856 passed, 250 skipped, 16175 deselected, 137 xfailed, 197 warnings in 2231.84s (0:37:11) =
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83375
    Approved by: https://github.com/zou3519

commit b4a881afac62989d130c8a92b4c83d16ccc7384a
Author: Jithun Nair <jithun.nair@amd.com>
Date:   Tue Sep 13 20:43:42 2022 +0000

    [ROCm] Remove gfx900 from base docker build and Pytorch build scripts (#80015)

    CI doesn't have any MI25s anymore. Should improve docker and Pytorch build times in CI for ROCm.

    Will take out of Draft mode after https://github.com/pytorch/pytorch/pull/79596 is merged

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80015
    Approved by: https://github.com/jeffdaily, https://github.com/malfet

commit 0e8c5cf8477e3235a7574c9436f30bbcbcd82e89
Author: Jianyu Huang <jianyuhuang@fb.com>
Date:   Tue Sep 13 20:42:52 2022 +0000

    Revert D34636039: Multisect successfully blamed D34636039 for test or build failures (#84942)

    Test Plan: NA

    Reviewed By: jianyuh

    Differential Revision: D39373091

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84942
    Approved by: https://github.com/xuzhao9

commit 81da50a972fc402a6dd880fe392af0f0051cb6de
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Sep 13 00:59:59 2022 +0000

    Return device count using nvml (#84879)

    Fixes https://github.com/pytorch/pytorch/issues/83973
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84879
    Approved by: https://github.com/ngimel

commit 94f20c3514ce16f637d9863b867bac3ec6f2d9ce
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Sep 12 20:22:37 2022 +0000

    Memoize `torch.cuda.device_count` (#84878)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84878
    Approved by: https://github.com/ngimel

commit bda8a5729b2ee739c9b8dd6bf2696ba3b0bdec78
Author: drisspg <drisspg@fb.com>
Date:   Tue Sep 13 20:35:58 2022 +0000

    [Nested Tensor] Create differentiable nt to tensor view functions (#83371)

    This PR attempts to implements 2) "the safe way" of creating a view of nested tensor that returns a regular tensor. The rest of the break down is here: https://fb.quip.com/J8QCAx41af11

    https://gist.github.com/drisspg/8622e9c97d374fa920ac647e1167cabc
    This is a short list of some edge cases. After some more work I was able to address two of the test cases in the above gist. There are few complex aspects here that I left defeated comments inline.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83371
    Approved by: https://github.com/bdhirsh

commit fa86874bbddfdbd2f4095e4084a3e1b2f81fde50
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Sep 13 17:13:37 2022 +0000

    Fix intermittent link errors in NCCL build (#84245)

    Should fix #13362 and fix #83790

    I think I've discovered the root cause of the intermittent nccl link
    failures. If we look at the variable name in the redefinition error:
    ```
    _02021d91_11_sendrecv_cu_0bc7b9c8_11152
    ```

    this is the name of the file being compiled + some form of unique ID.
    As part of NCCL's build process, the same file is compiled multiple
    times with different macro definitions depending on which operator and
    dtype are being compiled, e.g.
    ```
    nvcc -DNCCL_OP=0 -DNCCL_TYPE=0 -dc sendrecv.cu -o sendrecv_sum_i8.o
    ```

    Since the filename parts are the same, then if the unique IDs also
    happen to collide then the entire identifier will collide and the link
    fails. So the fix here is to generate a unique `.cu` file for each
    object file. I've implemented this as a `.patch` file that gets
    applied from our cmake code, but if we instead fork nccl that would be
    cleaner.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84245
    Approved by: https://github.com/janeyx99, https://github.com/malfet

commit 74d0c64708c79351ca8b43992422f6b647f46a9f
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 13 08:50:57 2022 -0700

    Don't use reentrant dispatch for composite compliance (#84909)

    I believe these were added in to prevent changing behavior when
    https://github.com/pytorch/pytorch/pull/75827 landed, but I actually
    think they are unnecessary, and they are causing asserts to fire
    on the subsequent PR (where I assert that tensors returned by
    views MUST NOT already have view metadata associated with them.)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84909
    Approved by: https://github.com/zou3519, https://github.com/soulitzer

commit b4799736ee6cb5941c8aff1b20e1ca19372e98e2
Author: Thomas Orozco <torozco@fb.com>
Date:   Tue Sep 13 18:41:15 2022 +0000

    autograd: fix non-deterministic output in codegen comments (#84695)

    Summary:
    Like it says in the title. Currently, this will return output like this:

    In Buck1, that's OK because Buck1's caching doesn't really care too much about

    However, in Buck2, this is a disaster, because caching is based exclusively
    on inputs and outputs and

    The diff here proposes making the path relative to the codegen script itself,
    which should carry about as much info, but avoid cache misses.

    Concretely, this:

    ```
    // generated from /dev/shm/uid-34135/cfbc5712-seed-nspid4026533424_cgpid2794673-ns-4026533443/tools/autograd/templates/python_functions.h
    ```

    Becomes, this:

    ```
    // generated from ../tools/autograd/templates/python_functions.h
    ```

    So, we keep the useful part, and we get caching. This matters because those
    headers are used in actions like:

    ```
    fbcode//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops -- action (cxx_compile gen_embedding_backward_adam_split_unweighted_cuda.cu (pic))
    ```

    Those actions take upwards of 5 minutes to finish, so by allowing a cache hit,
    we are a) saving our users a lot of time and b) saving some RE capacity as
    well.

    This actually matters a lot because right now those targets are produced by
    `//caffe2:generate-code`, which itself doesn't get cache hits from RE because
    `generate_code.par` is non-deterministic (this is, unfortunately, true of PARs
    in general), so that rule introduces non-determinism that the codegen
    propagates and we get zero caching.

    This diff doesn't fix `//caffe2:generate-code`'s  inputs being
    non-deterministic, but it does fix its *outputs* being non-deterministic, which
    means the non-determinism stops there, and we get back to cache hits.

    Test Plan:
    - CI

    ```
    buck2 build fbcode//caffe2:generate-code
    buck2 build fbcode//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops
    ```

    Reviewed By: ndmitchell

    Differential Revision: D39348565

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84695
    Approved by: https://github.com/soulitzer

commit 2e65f187cdc8fd461c77c501cf7ec40d76f0b34f
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Sep 13 00:42:19 2022 +0000

    [Functorch] Delete unused files (#83777)

    Differential Revision: [D39032967](https://our.internmc.facebook.com/intern/diff/D39032967)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83777
    Approved by: https://github.com/zou3519

commit 33352336b443dbfce1394f6b4950e8f33eff2cef
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Sep 13 00:22:53 2022 +0000

    [FSDP] Add rate limiter (#83917)

    **Overview**
    This PR adds a `bool` argument `limit_all_gathers` to the FSDP constructor, defaulted to `False`.
    - Setting `limit_all_gathers=True` limits the max number of inflight all-gathers to 2 (an empirically chosen constant), preventing a fast CPU thread from over-allocating blocks to the all-gather stream.
    - When experiencing a high number of CUDA malloc retries, the limiter can help reduce the number and hence lead to QPS improvement.

    **Exploration**
    I experimented with both a count-based limiter and size-based limiter (where the size is based on the inflight all-gather size in bytes).
    - The size-based limiter did not provide any advantage, only confusing the developer and user alike on what threshold to set.
    - For the count-based approach, I decided not to expose the max number of inflight all-gathers to the user since values other than 2 do not show improvements and exposing the knob may confuse users.

    **T5-11B**
    T5-11B evidences the performance gain from enabling the limiter and that a limit of 2 is a reasonable choice. This is run on an AWS cluster with 8 A100s per node and EFA. For both 2 and 4 nodes, we scale the batch size maximally before hitting OOM, which is a common practice.

    <p float="left">
      <img src="https://user-images.githubusercontent.com/31054793/188936036-04427da9-f492-4e50-9b35-ff64665d9815.png" width="400" />
      <img src="https://user-images.githubusercontent.com/31054793/188936045-f44e659f-1e18-4ea7-8c78-0fce4ff8fb48.png" width="400" />
    </p>

    For 2 nodes, the limit of 2 yields 3.01x QPS improvement, and for 4 nodes, the limit of 2 yields 2.87x QPS improvement.

    We need more data points, but the limiter may simplify the batch size scaling workflow. Normally, a practitioner may scale until hitting OOM and back off until there are few CUDA malloc retries. However, now the practitioner may be able to scale until hitting OOM and simply turn on the limiter to reduce the number of retries instead of backing off.

    Differential Revision: [D39331201](https://our.internmc.facebook.com/intern/diff/D39331201)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83917
    Approved by: https://github.com/zhaojuanmao

commit 39676a977f7dc91c5c05cce8c93f0cb8481fc3da
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Sep 13 00:22:53 2022 +0000

    [FSDP][Easy] Save unpadded/padded unsharded sizes as attributes (#84366)

    Differential Revision: [D39331199](https://our.internmc.facebook.com/intern/diff/D39331199)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84366
    Approved by: https://github.com/rohan-varma

commit afcc7c7f5c7cef740241ff0abdae8d4f2ad22a03
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Sep 13 00:22:52 2022 +0000

    [FSDP] Generalize prefetching; lower unshard/reshard to handle (#83665)
    - `self.sharding_strategy`
        - If the world size is 1, I clamp the sharding strategy to `NO_SHARD`, regardless of the passed-in sharding strategy, since the behavior is fully equivalent. This absolves the need for `p._is_sharded or self.world_size == 1` checks in the core code. Once we fully shift the paradigm to using handles, this should result in a clear net positive. However, for now, we still have some places where we interface directly with the `FlatParameter`, in which case we have some temporary hacky code.
    - `HandleConfig`
        - As a part of the new design abstraction, much logic is lowered to the `FlatParamHandle`. This requires the handle be aware of mixed precision, CPU offloading, sharding strategy, and the process group (for world size > 1). To be less error-prone, I re-defined the `dataclass`s and `enum`s for the handle. These can be removed and coalesced with the existing ones.
        - The drawback is that the `FlattenParamsWrapper` constructor now takes in the `HandleConfig` to forward it to the `FlatParamHandle` constructor. I tolerate this since we plan to retire the FPW. For now, the handle's process group attributes are set later when we call `handle.shard()`.
        - We will dive into this logic lowering later. For now, the idea is we need to pass some extra info to the handle, which must go through the FPW.
    - `FullyShardedDataParallel._shard_parameters()` -> `FlatParamHandle.shard()`
    - [Important] Generalizing attributes to remove the 1 `FullyShardedDataParallel` : 1 `FlatParameter` assumption
        - **Before:** `_fsdp_graph_order`, `_pre_backward_hook_full_params_prefetched`, `_forward_full_params_prefetched`, `reshard_after_forward` are with respect to 1 `FullyShardedDataParallel`
        - **After:** (1) We use `FlatParamHandle` in place of `FullyShardedDataParallel`. (2) The atomic unit for forward and pre-backward is a _group_ of handles involved in the same module's forward/pre-backward. This is represented as `Tuple[FlatParamHandle, ...]`. For now, this is **always a singleton tuple**, but this shift enables a module having multiple FSDP parameters (which we have use cases for).
    - `_reset_lazy_init()` attributes
        - The prefetched flags are merged into `self._handles_prefetched`, which is directly defined in the constructor. `reshard_after_forward` is retired since it can be fully determined by other attributes (`_is_root` and `sharding_strategy`).

    The first step is to read the existing `_rebuild_full_params()`. A few notable observations:
    - It returns `Tuple[Tensor, bool]`. The first element is the _padded unsharded flattened parameter_, and the second element is whether we can free it upon exiting `summon_full_params()`. This return value is **only used in `summon_full_params()`**.
    - If parameter mixed precision is enabled and the `FlatParameter` is already unsharded, then the low precision shard (`_mp_shard`) is still re-allocated on GPU. (It is freed at the end of the method.)
    - If CPU offloading is enabled and the `FlatParameter` is already unsharded, then there is a no-op `p.data = p.data.to(self.compute_device, non_blocking=True)`.
    - Inside `summon_full_params()`, `mixed_precision_cast_ran` is always `False`. Therefore, the return value for the `not p._is_sharded and mixed_precision_cast_ran` branch is unused.
    -`summon_full_params()` can only be called (before forward or after backward) or (between forward and backward). Given this, I cannot think of a case where we call `summon_full_params()`, the `FlatParameter` is already unsharded, but `reshard_after_forward` is `True`. The `FlatParameter` should be sharded (before forward or after backward), and the `FlatParameter` may only be unsharded (between forward and backward) if `reshard_after_forward` is `False`.
    - If parameter mixed precision is enabled and the sharding strategy is a sharded one, then inside `summon_full_params()`, the `FlatParameter` is unsharded in full precision. This involves allocating a new padded unsharded flattened parameter on GPU in full precision since `_full_param_padded` is in the low precision.

    Some comments:
    - Ideally, we reduce the complexity of the core code path: i.e. unshard for forward and pre-backward. If the return value is only used for `summon_full_params()`, we should consider if we can compartmentalize that logic.
    - The branching is complex, and some return values are never used, where this fact is not immediately obvious. We should see if we can reduce the branch complexity.

    Disclaimer: The difference in attribute semantics between `NO_SHARD` and the sharded strategies makes it challenging to unify the cases. This PR does not attempt to address that since it requires more design thought. However, it does attempt to reduce the complexity for the sharded strategies.
    Let us trace through the new logical unshard.
    1. `FullyShardedDataParallel._unshard(self, handles: List[FlatParamHandle], prepare_gradient: bool)`
        - This iterates over the handles and calls `handle.pre_unshard()`, `handle.unshard()`, and `handle.post_unshard(prepare_gradient)` in the all-gather stream.
    2. `FlatParamHandle.needs_unshard(self)`
        - We take an aside to look at this key subroutine.
        - For `NO_SHARD`, this returns `False`.
        - For sharded strategies, this checks if the padded unsharded flattened parameter is allocated. The padded unsharded flattened parameter is the base tensor for the unpadded unsharded flattened parameter, which is a view into the padded one. Thus, the padded one's allocation fully determines if the `FlatParameter` is unsharded.
        - For sharded strategies, to accommodate the parameter mixed precision + `summon_full_params()` case, we introduce `_full_prec_full_param_padded`, which is the padded unsharded flattened parameter in full precision. The helper `_get_padded_unsharded_flat_param()` takes care of this casing and returns the padded unsharded flattened parameter. Instead of allocating a new tensor each time, we manually manage `_full_prec_full_param_padded`'s storage just like for `_full_param_padded`.
    3. `FlatParamHandle.pre_unshard(self)`
        - For sharded strategies, the postcondition is that the handle's `FlatParameter` points to the tensor to all-gather. This should be on the communication device and in the desired precision. The allocation and usage of the low precision shard for parameter mixed precision and the CPU -> GPU copy for CPU offloading both classify naturally in the pre-unshard.
        - For sharded strategies, if the `FlatParameter` does not need to be unsharded, `pre_unshard()` is a no-op. This avoids unnecessarily allocating and freeing the low precision shard.
        - For `NO_SHARD`, we simply preserve the existing semantics.
    4. `FlatParamHandle.unshard(self)`
        - If the handle was resharded without freeing the padded unsharded flattened parameter (e.g. `summon_full_params()` between forward and backward when `reshard_after_forward=False`), then the `FlatParameter` points to the sharded flattened parameter. We need to switch to using the unsharded parameter. This is a design choice. Alternatively, we may not switch to using the sharded flattened parameter in `reshard()` if we do not free the padded unsharded flattened parameter. However, the postcondition that the `FlatParameter` points to the sharded flattened parameter after `reshard()` is helpful logically, so I prefer this approach.
        - Otherwise, this allocates the padded unsharded flattened parameter, all-gathers, and switches to using the unpadded unsharded flattened parameter.
        - In the future, we may add an option to `unshard()` that additionally all-gathers the gradient.
    5. `FlatParamHandle.post_unshard(self, prepare_gradient: bool)`
        - For sharded strategies, if using parameter mixed precision, this frees the low precision shard. More generally, this should free any sharded allocations made in `pre_unshard()` since the all-gather has been launched. If using CPU offloading, the GPU copy of the local shard goes out of scope after `unshard()` and is able to be garbage collected. **We should understand if there is any performance difference between manually freeing versus deferring to garbage collection since our usage is inconsistent.** For now, I preserve the existing semantics here.
        - `prepare_gradient` is meant to be set to `True` for the pre-backward unshard and `False` for the forward unshard. This runs the equivalent logic of `_prep_grads_for_backward()`.
        - This post-unshard logic (notably the gradient preparation) now runs in the all-gather stream, which is fine because we always have the current stream wait for the all-gather stream immediately after `FullyShardedDataParallel._unshard()`. IIUC, we do not need to call `_mp_shard.record_stream(current_stream)` (where `current_stream` is the default stream) because `_mp_shard` is allocated and freed in the same (all-gather) stream.
        - A postcondition is that the `FlatParameter` is on the compute device. It should also have the unpadded unsharded size (though I do not have a check for this at the moment).
    Now that we see how the logical unshard has been reorganized for the core code path, let us dive into `summon_full_params()`.

    The two constraints are:
    1. If using parameter mixed precision, we should unshard in full precision.
    2. We must determine if we should free the padded unsharded flattened parameter upon exiting.

    The first constraint is addressed as described before in the core unshard code path, so it remains to explore the second constraint.

    I propose a simple rule: **We free iff we actually unshard the `FlatParameter` in `summon_full_params()`** (i.e. it was not already unsharded). We perform a case analysis:

    **Parameter mixed precision enabled:**
    * `NO_SHARD`: `flat_param.data` points to `flat_param._local_shard`, which is the full precision unsharded flattened parameter. This is **not safe to free**.
    * `FULL_SHARD` / `SHARD_GRAD_OP`: We force full precision and all-gather to `_full_prec_full_param_padded`. We do not support `nested summon_full_params()`, so `_full_prec_full_param_padded` must be unallocated. We unshard, and it is **safe to free**.

    **Parameter mixed precision disabled:**
    * `NO_SHARD`: This is the same as with mixed precision enabled. This is **not safe to free**.
    * `FULL_SHARD` / `SHARD_GRAD_OP`: We all-gather to `_full_param_padded`. It may already be unsharded.
        * Already unsharded: The unshard is a no-op. This is **not safe to free**.
            * For `FULL_SHARD`, this can happen for the root FSDP instance after `forward()` but before backward.
            * For `SHARD_GRAD_OP`, this can happen for all FSDP instances after `forward()` but before backward.
        * Needs unshard: We unshard. This is **safe to free**.

    Therefore, we see that it is not safe to free when using `NO_SHARD` and when using a sharded strategy but the `FlatParameter` is already unsharded. This is precisely the proposed rule.

    There were two notable edge cases that the existing code did not address.
    1. The existing code tests if the `FlatParameter` is already unsharded by checking the allocation status of `_full_param_padded`. When using parameter mixed precision, this is the incorrect tensor to check. If `_full_param_padded` is allocated (e.g. when `reshard_after_forward=False` and calling `summon_full_params()` between forward and backward), the already-unsharded check is a false positive, and `summon_full_params()` does not correctly force full precision. https://github.com/pytorch/pytorch/issues/83068
        - This PR's `needs_unshard()` check correctly routes to the appropriate padded unsharded flattened parameter depending on the calling context (i.e. if it needs to force full precision or not).
    2. The existing code does not free the GPU copy of the padded unsharded flattened parameter when calling `summon_full_params(offload_to_cpu=True)`. It unshards the `FlatParameter`, moves the padded unsharded flattened parameter to CPU, and sets the `FlatParameter` data to be the appropriate unpadded view into the padded unsharded flattened parameter on CPU. However, `_full_param_padded` still points to the all-gathered padded unsharded flattened parameter on GPU, which is kept in memory. https://github.com/pytorch/pytorch/issues/83076
        - This PR frees the GPU copy and reallocates it upon exiting `summon_full_params()`. This is essential for avoiding peak GPU memory usage from increasing as we recurse through the module tree. There may be some cases where we can avoid reallocation altogether, but that can be addressed in a follow-up PR.
        - This PR offloads the *unpadded* unsharded flattened parameter to CPU directly instead of the *padded* one. As far as I can tell, there is no need to include the padding since unflattening the original parameters does not require the padding.
        - The relevant code is in the context manager `FlatParamHandle.to_cpu()`.

    This PR removes the mixed precision stream usage. As is, I do not think there is any extra overlap being achieved by the stream usage.

    The low precision shard is allocated and copied to in the mixed precision stream ([code](https://github.com/pytorch/pytorch/blob/1f99bdfcc4a3f97d28471a531d2b69def762f6ba/torch/distributed/fsdp/fully_sharded_data_parallel.py#L1401-L1412)), and the current stream (in this case the all-gather stream) waits for the mixed precision stream ([code](https://github.com/pytorch/pytorch/blob/1f99bdfcc4a3f97d28471a531d2b69def762f6ba/torch/distributed/fsdp/fully_sharded_data_parallel.py#L1414)). However, we immediately schedule an all-gather that communicates that exact low precision shard ([code](https://github.com/pytorch/pytorch/blob/1f99bdfcc4a3f97d28471a531d2b69def762f6ba/torch/distributed/fsdp/fully_sharded_data_parallel.py#L3338)) with no other meaningful computation between. If we remove the mixed precision stream, the low precision shard is allocated and copied to in the all-gather stream (including the non-blocking CPU -> GPU copy if using CPU offloading).

    Under this PR's design, we may consider a "pre-unshard" stream for all logical pre-unshard data transfers if we want to overlap in the future. IIUC, the overlap opportunity exists if there are multiple `FlatParameter`s per module, and we only have the all-gather stream wait for the data transfer corresponding to the local shard it communicates, not the others.

    If we agree on removing the mixed-precision stream for now, I will remember to delete it from `_init_streams()`.

    Like with unshard, the first step is the look at the existing `_free_full_params()` and `_use_param_local_shard()`. A few notable observations:
    - For only `NO_SHARD`, `_free_full_params()` includes a call to `_free_mp_shard()`.
    - For `summon_full_params()`, there is a separate `_free_full_params_and_use_local_shard()` that duplicates the main logic of `_free_full_params()` and calls `_use_param_local_shard()`.
    - In `forward()`, if `reshard_after_forward=True`, we call `_free_full_params()` and then `_free_mp_shard()`. Hence, for `NO_SHARD`, the `_free_mp_shard()` is a no-op.
    - In the post-backward hook, we typically call `_free_full_params()` and `_free_mp_shard()`. The `_free_mp_shard()` is a no-op for `NO_SHARD` and if `reshard_after_forward=True`.

    Some comments:
    - The code certainly works, but some of the no-ops are subtle. When possible, we should make it clear when calls are no-ops or not. It is good that the existing code documents that `_free_mp_shard()` is a no-op in the post-backward hook when `reshard_after_forward=True`. However, there are still some non-obvious no-ops (around `NO_SHARD`).
    - We should see if we can avoid the duplicate `_free_full_params_and_use_local_shard()`.

    Let us trace through the logical reshard:
    1. `FullyShardedDataParallel._reshard(self, handles: List[FlatParamHandle], free_unsharded_flat_params: List[bool])`
        - The two args should have the same length since they are to be zipped.
        - The goal of having `free_unsharded_flat_params` is that the caller should be explicit about whether the (padded) unsharded flattened parameter should be freed. The low precision shard is always meant to be freed (as early as possible), so there is no corresponding `List[bool]`.
    2. `FlatParamHandle.reshard(self, free_unsharded_flat_param: bool)`
        - This frees the (padded) unsharded flattened parameter if `free_unsharded_flat_param` and switches to using the sharded flattened parameter.
        - Echoing back to forcing full precision in `summon_full_params()`, `_free_unsharded_flat_param()` frees the correct tensor by using `_get_padded_unsharded_flat_parameter()`.
    3. `FlatParamHandle.post_reshard(self)`
        - I am not fully content with the existence of this method, but this seems to be an unavoidable consequence of `NO_SHARD`. Perhaps, this may be useful in the future for other reasons though.
        - Right now, this method is only meaningful for `NO_SHARD` + parameter mixed precision + outside `summon_full_params()`. `_mp_shard` is not freed in the post-unshard since it is also the low precision _unsharded_ flattened parameter, so we must delay the free until the the post-reshard.

    Below the `FlatParamHandle.reshard()` and `post_reshard()` layer, there should not be any no-ops.

    One final comment I will mention is that I like the `pre_unshard()`, `unshard()`, `post_unshard()`, and `reshard()`, `post_reshard()` organization because it makes it clear what the boundaries are and their temporal relationship. Through that, we can set pre- and post-conditions. Furthermore, we can eventually convert logic to hooks that may be registered on the `FlatParamHandle` (for `pre_unshard()`, `post_unshard()`, and `post_reshard()`). This may improve the customizability of FSDP.

    - This PR reorganizes `forward()` in preparation for non-recursive wrapping, which uses pre-forward and post-forward hooks that expect the signature `hook(module, input)`. For FSDP, the `module` and `input` arguments are not used.
    - This PR creates a new method `_fsdp_root_pre_forward()` to handle the logic only the root FSDP should run.

    Finally, we dive into the prefetching changes. Some highlights:
    1. This PR unifies the execution order validation and prefetching implementations.
        - Both involve the execution order and can be unified to share some boilerplate.
    2. Execution order validation only runs when the distributed debug level is `INFO`.
        - We have yet to have one success case where we actually catch an unintended source of dynamism. The warning is also too verbose. Hence, we are gating it by the `INFO` level.
    3. This PR moves prefetching to be with respect to groups of handles (as mentioned in the constructor comment).
        - This is essential for supporting prefetching with non-recursive wrapping.
    4. This PR does not include "bubbles", i.e. modules with no handles, in the recorded execution order(s). This deviates from the existing implementation.
        - This makes prefetching possibly more aggressive (when there are such bubbles), but it should not have significant performance implications either way.
    5. This PR changes backward prefetching to reset the post-forward order each iteration (as intended).
    6. This PR changes forward prefetching to use the first iteration's pre-forward order instead of the first iteration's post-forward order. (We can discuss whether we want this in this PR or not. Otherwise, I can keep it as using the post-forward order to preserve the existing semantics.) This PR also removes the `all_gather_stream.wait_stream(current_stream)` before forward prefetching because it does not help with high GPU reserved memory. We can add that back if desired.
    The existing PT-D FSDP pre-backward prefetching uses the reverse post-forward order.
    <details>
      <summary>Model Code</summary>

      ```
      class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.block1 = nn.Sequential(
                nn.Conv2d(3, 4, kernel_size=3),
                nn.BatchNorm2d(4),
                nn.ReLU(inplace=True),
            )
            self.block2 = nn.Sequential(
                nn.Conv2d(4, 4, kernel_size=3),
                nn.BatchNorm2d(4),
                nn.ReLU(inplace=False),
            )
            self.block3 = nn.Linear(12, 8)
            self.head = nn.Sequential(
                nn.AdaptiveAvgPool2d(output_size=(1, 1)),
                nn.Flatten(),
                nn.Linear(4, 10),
            )

        def forward(self, x):
            x = self.block1(x)
            x = self.block2(x)
            x = self.block3(x)
            return self.head(x)

      model = Model().cuda()
      fsdp_kwargs = {}
      model.block1[1] = FSDP(model.block1[1], **fsdp_kwargs)  # BN2d
      model.block2[1] = FSDP(model.block2[1], **fsdp_kwargs)  # BN2d
      model.block1 = FSDP(model.block1, **fsdp_kwargs)
      model.block2 = FSDP(model.block2, **fsdp_kwargs)
      model.block3 = FSDP(model.block3, **fsdp_kwargs)
      model = FSDP(model, **fsdp_kwargs)
      ```
    </details>

    <details>
      <summary>Execution Orders </summary>

      ```
      Pre-backward hook for ('head.2.weight', 'head.2.bias') 140339520587136 (model)
      Pre-backward hook for ('weight', 'bias') 140339461194656 (block3)
      Pre-backward hook for ('0.weight', '0.bias') 140339520589776 (block2)
      Pre-backward hook for ('weight', 'bias') 140339520587664 (block2 BN)
      Pre-backward hook for ('weight', 'bias') 140339520586656 (block1 BN)
      Pre-backward hook for ('0.weight', '0.bias') 140339520588768 (block1)

      Pre-forward order:
      ('head.2.weight', 'head.2.bias') 140339520587136 (model)
      ('0.weight', '0.bias') 140339520588768 (block1)
      ('weight', 'bias') 140339520586656 (block1 BN)
      ('0.weight', '0.bias') 140339520589776 (block2)
      ('weight', 'bias') 140339520587664 (block2 BN)
      ('weight', 'bias') 140339461194656 (block3)

      Reverse post-forward order:
      ('head.2.weight', 'head.2.bias') 140339520587136 (model)
      ('weight', 'bias') 140339461194656 (block3)
      ('0.weight', '0.bias') 140339520589776 (block2)
      ('weight', 'bias') 140339520587664 (block2 BN)
      ('0.weight', '0.bias') 140339520588768 (block1)
      ('weight', 'bias') 140339520586656 (block1 BN)
      ```
    </details>

    Differential Revision: [D39293429](https://our.internmc.facebook.com/intern/diff/D39293429)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83665
    Approved by: https://github.com/zhaojuanmao

commit a2acead00256cb2580afe8297dda0ad0134fe21e
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Sep 13 00:22:52 2022 +0000

    [FSDP][Easy] Minor cleanup (#84761)

    This PR simply pulls out some minor changes from the next (monolithic) PR.

    Differential Revision: [D39392147](https://our.internmc.facebook.com/intern/diff/D39392147)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84761
    Approved by: https://github.com/zhaojuanmao

commit 8c2da0616c217c7732f5893b7a5e7ee80b8af4ff
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 13 16:46:24 2022 +0000

    Revert "Upgrade to CUDNN version for cuda 11.7 (#84859)"

    This reverts commit 9064bf2c721ee1df0e3698344412043eb80e4fa7.

    Reverted https://github.com/pytorch/pytorch/pull/84859 on behalf of https://github.com/atalman due to Reverting broke periodic tests

commit 351ac63cddf992e7dfff0af058a4d175ac37e142
Author: nikitaved <nikitavedeneev@gmail.com>
Date:   Tue Sep 13 07:35:51 2022 -0500

    coo binary_op intersection primitives (#83427)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83427
    Approved by: https://github.com/bhosmer, https://github.com/amjames, https://github.com/cpuhrsch

commit 3f047b2a90fddf55487cbe42c17558beb7f29903
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Mon Sep 12 22:36:16 2022 -0700

    SymInt support for computeStride (#84905)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84905
    Approved by: https://github.com/ezyang

commit 8b8141e971ed97aa5633b412bffde6e5bf31187c
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Mon Sep 12 22:36:16 2022 -0700

    SymInt support for multiply_integers (#84904)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84904
    Approved by: https://github.com/ezyang

commit ecee6c742f3d88bd567dc2e95a9ecccdad674854
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Mon Sep 12 16:56:39 2022 -0700

    StmInt support for InferSize (#84903)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84903
    Approved by: https://github.com/ezyang

commit 7e900f204f8494ab52f4ad089608c8cb008a273c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Sep 12 22:02:36 2022 -0700

    Avoid throwing an exception when ScriptList doesn't match. (#84921)

    This prevents 'catch throw' gdb breakpoint pollution and
    should also improve performance.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84921
    Approved by: https://github.com/Chillee

commit 33bb8ae350611760139457b85842b1d7edf9aa11
Author: erjia <erjia@fb.com>
Date:   Tue Sep 13 13:38:58 2022 +0000

    Set shuffle to DataPipes with set_shuffle API (#83741)

    This PR requires PR is landed: https://github.com/pytorch/pytorch/pull/83202
    - For `apply_shuffle_setting` and `apply_shuffle_seed`, it makes sure it will apply shuffle setting to each of DataPipe that contains a method called `set_shuffle` or `set_seed`.
    - Change the API from `apply_shuffle_seed` to `apply_random_seed`.
    - Fix a bug that `apply_shuffle_seed` only accepts DataPipe that is hashable. After the PR, this function uses `id` to prevent seeding the same DataPipe multiple times per epoch.
    - Fix another bug from `shuffler` that `reset` with `_enable=False` would also reset `_seed`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83741
    Approved by: https://github.com/NivekT

commit 7a9ab5c232f54430704456d18a22f99838489817
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Sep 12 22:02:31 2022 -0700

    Move Python argument related functions to cpp file (#84919)

    No changes to contents, just moving things out of header.
    I only moved the stuff I suspected I'd be editing; maybe more
    things from this header could migrate out.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84919
    Approved by: https://github.com/suo

commit 99cfaf9eeea0a6f20d0b11d211db379473db748e
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Sep 13 00:22:52 2022 +0000

    [FSDP] Subtest prefetching for `test_fsdp_grad_acc.py` (#84601)

    This modifies `test_fsdp_grad_acc.py` to test all 3 current sharding strategies and subtests prefetching.

    Differential Revision: [D39293432](https://our.internmc.facebook.com/intern/diff/D39293432)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84601
    Approved by: https://github.com/zhaojuanmao

commit dbd38f63f5731f8403edfdf9d5956ca872453dd3
Author: John Detloff <johndetloff@fb.com>
Date:   Tue Sep 13 05:09:15 2022 +0000

    Include CoreML error description in exception thrown when inference fails (#84804)

    Summary:
    Catch the error and throw an exception with PTMCoreMLBackend when inference fails. This way the error description will be available in the logged crash, as opposed to crashing with a less descriptive exception.

    I'll be drafting follow up diffs to actually catch exceptions in the segmentation shim next. Ideally we would fail inference gracefully and not crash at all, but at least after this diff we'll have the full diagnostic info

    Test Plan: Force an error, and confirm its description appears in the exception thrown via the console

    Differential Revision: D39407865

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84804
    Approved by: https://github.com/mcr229

commit e980ff8eb912576f5846f14563ba1bc2ee297bce
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Tue Sep 13 04:04:04 2022 +0000

    Remove unused method_assignments (#84917)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84917
    Approved by: https://github.com/huydhn

commit d951165bd8e35e486fddd5c80f469ff644e66971
Author: vfdev-5 <vfdev-5@gmail.com>
Date:   Tue Sep 13 03:54:07 2022 +0000

    [C++ API] Added missing antialiasing path in interpolation C++ api (#84599)

    Description:

    Following https://github.com/pytorch/pytorch/pull/69318#issuecomment-1238433540 adding missing bicubic path for anti-alias flag to C++ frontend.

    - https://github.com/pytorch/pytorch/pull/70930

    - added tests in pytorch/test/cpp/api/functional.cpp

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84599
    Approved by: https://github.com/kit1980, https://github.com/malfet

commit 2fbc0fab20d4af520f69f158f8777e99ad761e1d
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Sep 13 03:06:11 2022 +0000

    Setup sccache for linux test (#84916)

    The TTS alarm in HUD is quite noisy because of `pull / linux-focal-py3.7-gcc7 / test (backwards_compat)`. Some runs take up to 50m, i.e. [4893945786](https://github.com/pytorch/pytorch/actions/runs/3038960118/jobs/4893945786) while others take only 10m, i.e. [4893781147](https://github.com/pytorch/pytorch/actions/runs/3038943635/jobs/4893781147).  Looking closer into their logs, it turns out that the longer runs have a much higher rate of cache miss.  For example, [4893945786](https://github.com/pytorch/pytorch/actions/runs/3038960118/jobs/4893945786)

    ```
    Compile requests                    6487
    Compile requests executed           6224
    Cache hits                          4975
    Cache hits (C/C++)                  4975
    Cache misses                        1227
    Cache misses (C/C++)                1227
    Cache timeouts                         0
    Cache read errors                      0
    Forced recaches                        0
    Cache write errors                     0
    Compilation failures                   9
    Cache errors                          13
    Cache errors (C/C++)                  13
    Non-cacheable compilations             0
    Non-cacheable calls                   16
    Non-compilation calls                247
    Unsupported compiler calls             0
    Average cache write                0.096 s
    Average cache read miss           11.681 s
    Average cache read hit             0.040 s
    Failed distributed compilations        0

    Non-cacheable reasons:
    multiple input files                  15
    unknown source language                1

    Cache location                  S3, bucket: Bucket(name=ossci-compiler-cache-circleci-v2, base_url=http://ossci-compiler-cache-circleci-v2.s3.amazonaws.com/)
    ```

    In https://github.com/pytorch/pytorch/pull/82103, we didn't setup `SCCACHE_S3_KEY_PREFIX` for `_linux-test`, which could explain the high rate of cache miss here.  `backwards_compat` is a bit different than other tests in which it compiles PyTorch and gets the benefit from caching.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84916
    Approved by: https://github.com/seemethere

commit 9d5b3e4da8723bbac6879fc7cae8f27177f0c26d
Author: Andrew Gu <andgu@fb.com>
Date:   Tue Sep 13 00:22:51 2022 +0000

    [FSDP] Remove `forward_prefetch` (#84600)

    We are removing the `forward_prefetch` option. By the nature of async GPU kernel execution, launching the CPU kernel for the next layer's all-gather early does not actually improve performance. Moreover, the existing `forward_prefetch` uses the post-forward order instead of the pre-forward order, which leads to mis-targeted prefetched all-gathers.

    Differential Revision: [D39454217](https://our.internmc.facebook.com/intern/diff/D39454217)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84600
    Approved by: https://github.com/zhaojuanmao

commit 8f92140c4052ceecb104f6caf078fd8614c32be4
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 13 02:43:15 2022 +0000

    [vision hash update] update the pinned vision hash (#84913)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84913
    Approved by: https://github.com/pytorchbot

commit dc865bff4e9de5e02ea27d0b702a74c2bf63f02f
Author: Seonglyong Gong <slgong@fb.com>
Date:   Tue Sep 13 01:48:41 2022 +0000

    [Profiler] set_class util (part 1 of Record Optimizer) (#84779)

    Summary:
    Part 1 of Record Optimizer param_groups and states (https://github.com/pytorch/pytorch/pull/84063)
    - nnModule and Optimizer have duplicated parts
    - create a util function to avoid duplication

    Test Plan: buck run mode/opt //caffe2/test:profiler

    Differential Revision: D39397210

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84779
    Approved by: https://github.com/robieta

commit 6d222116a13d55c2aa2211938f9df686535fbd51
Author: Abhijit Deo <72816663+abhi-glitchhg@users.noreply.github.com>
Date:   Tue Sep 13 00:29:50 2022 +0000

    [Documentation] Minor  rendering issue (#84856)

    There is a Rendering issue with the docstring of nn.GELU.

    Hope this fixes the [issue.](https://pytorch.org/docs/stable/generated/torch.nn.GELU.html)

    cc: @malfet
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84856
    Approved by: https://github.com/kit1980

commit 964fde7d7ceac67db6f0e30fc4a499d02904b09e
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Sep 12 11:38:08 2022 -0700

    Raise AttributeError for __origin__ to avoid C++ exception raise (#84880)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84880
    Approved by: https://github.com/wconstab

commit 260b716c65a17a6791fc70420de3553be220a3da
Author: Dhruv Matani <dhruvbird@fb.com>
Date:   Sun Sep 11 19:44:27 2022 -0700

    [Mobile Tracer] Allow tracing multiple input models at once (#84833)

    Summary: For practical usage, folks may want to custom build PyTorch for support with multiple models. The current tracer allows tracing just one model. There are multiple way to address this limitation:

    1. Provide a tool to merge multiple YAML files produced by each of these runs. Each run corresponds to a YAML file for a single model.
    2. Allow the tracer to run multiple models at once.

    This PR implements the solution [2] above.

    Test Plan: Build the tracer using: `USE_NUMPY=0 USE_DISTRIBUTED=0 USE_CUDA=0 TRACING_BASED=1 python setup.py develop`

    Run with 1 input file: `./build/bin/model_tracer --model_input_path /tmp/path_to_model.ptl --build_yaml_path /tmp/selected_ops.yaml`

    Run with multiple input files: `./build/bin/model_tracer --model_input_path /tmp/path_to_model.ptl,/tmp/path_to_model.ptl --build_yaml_path /tmp/selected_ops.yaml`

    Both runs completed successfully.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84833
    Approved by: https://github.com/JacobSzwejbka

commit 5a29db142e4665577149e305db0583e55ef21683
Author: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Date:   Mon Sep 12 22:45:38 2022 +0000

    Use int64_t index type in multiplications to avoid integer overflow in max_pool2d and avg_pool2d on CUDA (#68682)

    Fix https://github.com/pytorch/pytorch/issues/68418

    - [X] operator benchmark: https://github.com/xwang233/code-snippet/tree/master/pooling-bench-68682, 10% or worse regression are seen in some shapes
    - [X] end-to-end benchmark: no major regression seen in our test suites
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/68682
    Approved by: https://github.com/ngimel

commit 918cd8b9bafe92da4209765c303b861eb10edf82
Author: David Eklov <deklov@fb.com>
Date:   Mon Sep 12 22:35:19 2022 +0000

    [torch::deploy] Ignore return value of function declared with 'warn_unused_result' (#84862)

    Summary:
    Addresses the following build failure that we get on some of our internal build environments:
    caffe2/torch/csrc/deploy/environment.h:60:5: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result] system(rmCmd.c_str());

    Test Plan: buck build //caffe2/torch/...

    Differential Revision: D39364411

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84862
    Approved by: https://github.com/PaliC

commit 9b16bf04af4bdcc352998f77b96052f21567a2f2
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Sep 12 22:25:26 2022 +0000

    Fix MPS test sanity (#84889)

    Follow up after https://github.com/pytorch/pytorch/pull/84834

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84889
    Approved by: https://github.com/tugsbayasgalan, https://github.com/janeyx99, https://github.com/ZainRizvi

commit d09e8b23bf8c8c73f406b5610eda94e9dd3c2e96
Author: Ryan Spring <rdspring1@gmail.com>
Date:   Mon Sep 12 22:19:06 2022 +0000

    [primTorch] Add repeat and unfold_copy references (#81374)

    Add References:

    - repeat
    - unfold
    - expand_as
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81374
    Approved by: https://github.com/mruberry, https://github.com/ngimel

commit d6733327829fa02295c239ad96a26bef8afa6da4
Author: Ivan Zaitsev <ivanzaitsev@fb.com>
Date:   Mon Sep 12 21:47:25 2022 +0000

    Forward fix for FB internal breakage (manual export of internal diff D39421802) (#84871)

    D39421802 is a forward-fix for D39419569 (corresponds to #84806).

    The forward-fix is mostly internal-facing, but has a single line that has to be exported to GH first. Manual export was required, because automatic export failed due to diff dependecies.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84871
    Approved by: https://github.com/mehtanirav

commit 5238404f4d8fcb21adbed7bd7a4a2a836e5764c2
Author: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>
Date:   Mon Sep 12 21:38:16 2022 +0000

    Increment `version_range_max` (#84815)

    Python 3.10 should be added as a listing in `Programming Language` on https://pypi.org/project/torch/:

    <img width="238" alt="Screenshot 2022-09-11 at 2 48 01" src="https://user-images.githubusercontent.com/7121753/189495599-72bd6a28-4248-4e4e-8194-b5b1f9e984e2.png">
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84815
    Approved by: https://github.com/malfet

commit c85e47b36895d44e797536d2fbb45b3edc049767
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Sep 12 21:31:01 2022 +0000

    [BE][PT-D] Fix race on checkpoint file (#84881)

    Without calling `dist.barrier()` before removing the checkpoint file, rank 0 may run ahead and delete the checkpoint file before nonzero ranks are able to load from the checkpoint.

    This PR adds a `dist.barrier()` to ensure all ranks can load the checkpoint before rank 0 deletes it.

    For example, including the added `dist.barrier()`:
    https://github.com/pytorch/pytorch/blob/037e8eefcf0b669430211b83d19aedf2185ed6fc/torch/testing/_internal/distributed/distributed_test.py#L5068-L5098
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84881
    Approved by: https://github.com/rohan-varma

commit 3baa363f715547b2e9c569434fc7d2d226afd37e
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Sep 9 22:52:12 2022 +0000

    [Functorch] Make minpybind less likely to crash (#84788)

    Error handling in `vector_args::parse` is fundamentally broken, as it
    calls to
    [`_PyArg_ParseStackAndKeywords`](https://github.com/python/cpython/blob/000593c0f97ac9b75b56064a957b84a3aaa60674/Include/modsupport.h#L106)
     variadic function, which are
    akin to `sscanf` without any arguments, which results, in case of partial
    parse in a random segfault/stack corruption. Remedy it by passing a few
    references to dummy pyobject
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84788
    Approved by: https://github.com/zou3519

commit ccb1ff22333ed1d28ee5287d75a3322b12b93f6a
Author: karanprime <k.shah@hzdr.de>
Date:   Mon Sep 12 21:15:02 2022 +0000

    Updated invalid type error message to explicitly say only float types… (#83170)

    … allowed

    Fixes #82983

    I added an error messages consistent with existing invalid type error messages (line 330 in torch/csrc/tensor/python_tensor.cpp).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83170
    Approved by: https://github.com/ezyang, https://github.com/kit1980

commit cfeb53170051f4cf942ae683e1727fd48f60f18c
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Mon Sep 12 20:59:17 2022 +0000

    Fix failing test_model_dump due to empty file (#84744)

    The `torch.jit.save` call on a file object may not actually write the data to disk due to buffering. The call to `model_dump.main` on that file will when fail with an error like

    > zipfile.BadZipFile: File is not a zip file

    Inspecting the file confirms that it is either empty (usually) or incomplete (possible).

    Fix this by flushing the file after saving the model.

    Fixes #84745
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84744
    Approved by: https://github.com/kit1980

commit cd3731bd1774ed3d152a5307c45cdbdc90ef8536
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Sep 12 18:28:23 2022 +0000

    [BE] Refactor `_is_compiled()` function (#84877)

    Call it from `is_available()` and `device_count()`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84877
    Approved by: https://github.com/ngimel

commit 31cc03cc132020244f6985aefdcd1c05babc2e17
Author: Jack Danger <github@jackcanty.com>
Date:   Mon Sep 12 20:37:39 2022 +0000

    fixing English typo in MPSFallback error message (#84834)

    Changing "current supported" to "currently supported"
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84834
    Approved by: https://github.com/Chillee, https://github.com/kulinseth, https://github.com/kit1980

commit bb4e96c9644a034e593085026b781ee78a4d6a77
Author: soulitzer <soulitzer@gmail.com>
Date:   Mon Sep 12 11:47:12 2022 -0400

    [reland] Call jit decomposition in VariableType to increase forward AD coverage (#84151) (#84675)

    This reverts commit acb4a09628284201281e262aaee58e3dc6be9c2b.

    In addition, we also fix a memory leak in layer norm.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84675
    Approved by: https://github.com/zou3519

commit a2cccb2d6b0461b7f9c9922096a74240225ebc7b
Author: Wenzhe Xue <wenzhe.xue@intel.com>
Date:   Mon Sep 12 20:09:00 2022 +0000

    add oneDNN graph fuser context API and unittest (#82491)
    Add oneDNN graph context manager API to be consistent with other fusers.

    NNC and nvFuser have two ways to use: 1) a function to enable/disable and 2) a context manager. And the later way is used extensively in libraries like Dynamo. Currently oneDNN Graph fuser only has the former way. To promote the usage of oneDNN graph fuser, this PR creates the context manager for oneDNN graph fuser.

    This PR should not affect any performance.
    A unit-test `test_context_manager` is added under `test/test_jit_llga_fuser.py`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82491
    Approved by: https://github.com/malfet

commit c304a1206bbffef33ed7c7c20aa0a4f1e169a32c
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Sep 12 17:24:12 2022 +0000

    [FSDP][Easy] Remove unused functions (#84598)

    This removes some leftover functions from the constructor refactor.

    Differential Revision: [D39293430](https://our.internmc.facebook.com/intern/diff/D39293430)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84598
    Approved by: https://github.com/zhaojuanmao

commit 9e5563dbb1cb789d41370b8ab6120c62a385a74a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Sep 12 09:25:09 2022 -0700

    Delete SymIntArrayRef wrapper struct (#84837)

    Since we separated at::foo and at::foo_symint there is no benefit
    to trying to make initializer lists work in both cases.  So we can
    get rid of the special different struct.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84837
    Approved by: https://github.com/kit1980

commit 7e43c6f28e9b0699dd6f0a3803513feacc60eaf3
Author: titaiwang <titaiwang@microsoft.com>
Date:   Fri Sep 9 23:25:54 2022 +0000

    [ONNX] replace AT_ASSERT with TORCH_INTERNAL_ASSERT (#84790)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84790
    Approved by: https://github.com/kit1980

commit 034f2db1fdb253421e79bf36edca5423fd390e3a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Sep 12 19:04:07 2022 +0000

    Revert "Delete SymIntArrayRef wrapper struct (#84837)"

    This reverts commit 9c78f599e40eac7fee027d86e03af06e251705b5.

    Reverted https://github.com/pytorch/pytorch/pull/84837 on behalf of https://github.com/ZainRizvi due to The test test_post_localSGD_optimizer_step_reload in the X linux-bionic-cuda11.6-py3.10-gcc7 workflow has started consistently failing since this PR was submitted

commit c3df78f436d1906a920afabbd8c58af4ec8471d9
Author: Christian Puhrsch <cpuhrsch@fb.com>
Date:   Mon Sep 12 17:41:38 2022 +0000

    TARGETs changes for flash attention and cutlass (#84781)

    Summary: Integrate flash attention and use it when the inputs align just right

    Test Plan: Unit tests and such

    Differential Revision: D39364603

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84781
    Approved by: https://github.com/mikaylagawarecki

commit 9064bf2c721ee1df0e3698344412043eb80e4fa7
Author: atalman <atalman@fb.com>
Date:   Mon Sep 12 17:09:05 2022 +0000

    Upgrade to CUDNN version for cuda 11.7 (#84859)

    Upgrade to CUDNN version for cuda 11.7

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84859
    Approved by: https://github.com/malfet

commit 4f6027b78a8f2e1fc07a50f9e0096de28ede429d
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Mon Sep 12 16:59:05 2022 +0000

    [opinfo] narrow: add new sample for Tensor overload (#84785)

    `narrow` accepts `start` argument to be a Tensor. We add a sample to test this overload.

    NOTE: This leads to a bunch of failed tests and hence the skips and xfails
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84785
    Approved by: https://github.com/zou3519

commit a06f2edab63adc951afe1a8e3bf9ba606b729af1
Author: Dhruv Matani <dhruvbird@fb.com>
Date:   Sat Sep 10 20:19:53 2022 -0700

    [Build] Replace message() in caffe2/CMakeLists.txt with message in cmake/Summary.cmake (#84814)

    Summary: In [PR 84755](https://github.com/pytorch/pytorch/pull/84755), @cccclai noticed and mentioned the presence of `message(STATUS...)` logging in caffe2/CMakeLists.txt and suggested moving it to the file cmake/Summary.cmake. This PR addresses that comment/suggestion.

    Test Plan: Ran the build as `USE_NUMPY=0 USE_DISTRIBUTED=0 USE_CUDA=0 TRACING_BASED=1 python setup.py develop`

    and saw the follwing being printed:

    ```
    --   BUILD_MOBILE_AUTOGRAD : OFF
    --   BUILD_LITE_INTERPRETER: OFF
    --   INTERN_BUILD_MOBILE   :
    --   TRACING_BASED         : 1
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84814
    Approved by: https://github.com/cccclai

commit d6b2f5c6433ef3e63e046096f4ad54e26eb17d10
Author: Jesse Cai <jcjessecai@gmail.com>
Date:   Fri Sep 9 07:29:36 2022 -0700

    [Quant][fx] Remove `remove_quant_dequant_pairs` and fix tests (#84203)

    Summary:
    - `remove_quant_dequant_pairs` removes ops when a `quant` is followed by a `dequant`
    - It looks like the quantized implementation of `layer_norm` only supports float weights, so updated the default qconfig to avoid quantizing the weight param.
    -  Fixes broken test, `test_norm_weight_bias`. This was the only test that broke, because the default qconfig dict we pass in quantizes the weight. I just pulled the native qconfig object and converted it to a dict.
    - Adds in qconfig and backend config support for layernorm

    Test Plan:
    ```
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps
    python test/test_quantization.py TestQuantizeFxModels
    ```

    Reviewers:

    Subscribers:

    Tasks: Fixes https://github.com/pytorch/pytorch/issues/83110

    Tags: quant, fx

    Differential Revision: [D39395141](https://our.internmc.facebook.com/intern/diff/D39395141)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84203
    Approved by: https://github.com/jerryzh168

commit e217b30b0fc98df226654cef0617eb41a177531f
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Mon Sep 12 04:03:49 2022 +0000

    Add `torch.nested` namespace (#84102)

    First step towards #83775
    - only `to_padded_tensor` is moved to the nested namespace for now
    - following the schema used for `special`, `fft`, `linalg` and other namespaces, nested functions are registered in native_functions.yaml as `nested_{function_name}` and are bound to the desired Python name in
    `torch/nested/__init__.py`, and the desired C++ name in `torch/csrc/api/include/torch/nested.h`.

    ~~**Question**: should we keep the documentation for `Tensor.to_padded_tensor` or can this deleted since it is shared by `torch.nested.to_padded_tensor`?~~

    [generated nested docs](https://docs-preview.pytorch.org/84102/nested.html?highlight=nested#module-torch.nested)

    Differential Revision: [D39361148](https://our.internmc.facebook.com/intern/diff/D39361148)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84102
    Approved by: https://github.com/drisspg

commit 9c78f599e40eac7fee027d86e03af06e251705b5
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Sep 12 09:25:09 2022 -0700

    Delete SymIntArrayRef wrapper struct (#84837)

    Since we separated at::foo and at::foo_symint there is no benefit
    to trying to make initializer lists work in both cases.  So we can
    get rid of the special different struct.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84837
    Approved by: https://github.com/kit1980

commit 8cdc0679b91d6e688e857b3e02caa2b4823d19ab
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Mon Sep 12 15:20:51 2022 +0000

    [ROCm][jiterator] unskip additional tests (#84371)

    Follow-up to #77982.  Unskip additional jiterator tests for ROCm.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84371
    Approved by: https://github.com/ngimel, https://github.com/SherlockNoMad

commit 2698f99dc7a2efe6d60fffa43beb545901a57c9b
Author: Slava Kovalevskyi <vsk@fb.com>
Date:   Mon Sep 12 14:15:52 2022 +0000

    fixing form link for governance (#84861)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84861
    Approved by: https://github.com/malfet

commit d2d145a40001d1e1f815a144160bd0b8d0f60ea0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Sep 12 10:03:48 2022 +0000

    [xla hash update] update the pinned xla hash (#84853)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84853
    Approved by: https://github.com/pytorchbot

commit 5ea2eb304ea6e9bab0c68fc57dbffbc068354ce7
Author: Horace He <chilli@fb.com>
Date:   Mon Sep 12 02:16:46 2022 +0000

    Converted batch norm over to use symints (#84113)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84113
    Approved by: https://github.com/wconstab, https://github.com/ezyang

commit caf034a9a2326bbf70a8f042cb5b527b789b3062
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sun Sep 11 06:29:09 2022 -0700

    Fix bugs in how LTC decides whether or not to symint op or not (#84832)

    This fixes two problems:

    - First, shape signature didn't respect the symint property (so it
      would always mark the operator as symint).  This was relatively
      easy to fix.

    - Second, the call to fallback goes directly through at::_ops, so
      it must always be SymInt-aware, even if SymInt is disabled externally.
      This was a bit more difficult, because the current LTC codegen
      is poorly factored.  First, I needed to make it so individual
      arguments knew if they were going to be SymInt in LTC or not; second,
      I need to plumb enough information about the enclosing bindings so
      that I could use translate to do the expressions (previously, it was
      just assumed the signatures matched.)

    The LTC codegen would do well to have a complete rewrite, but this will
    have to do for now, I suppose.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84832
    Approved by: https://github.com/wconstab

commit bfc6db0a5af1acf2a4cb864c334f17bcd08fc079
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Sep 11 02:38:32 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#84828)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84828
    Approved by: https://github.com/pytorchbot

commit 5f960db0e01839f1de8735060b374ea6cbd1713a
Author: Dhruv Matani <dhruvbird@fb.com>
Date:   Sat Sep 10 09:36:31 2022 -0700

    [Mobile] Add support for dtypes and custom classes in model tracer (#84795)

    Summary: Currently, the model tracer generates the selected features YAML file only with used operators. This change adds support for dtypes and custom classes as well.

    We need to add the flag `-DENABLE_RECORD_KERNEL_FUNCTION_DTYPE` when building PyTorch in Instrumentation Mode (i.e. with `TRACING_BASED=1` for server builds) to enable capturing this data.

    Test Plan: Built using `USE_NUMPY=0 USE_DISTRIBUTED=0 USE_CUDA=0 TRACING_BASED=1 python setup.py develop`

    Ran the model tracer to observe this generated file: https://gist.github.com/dhruvbird/50e1860b39ae065e57d58f17e0912136

    Then used the generated YAML to built pytorch (minimal build) using the command

    ```
    BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN=1 \
    USE_LIGHTWEIGHT_DISPATCH=0 BUILD_LITE_INTERPRETER=1 \
    SELECTED_OP_LIST=/tmp/selected_ops.yaml \
    TRACING_BASED=1 \
    ./scripts/build_mobile.sh
    ```

    After that I generated a binary using this command:

    ```
    g++ /tmp/main.cpp -L build_mobile/lib/ -I build_mobile/install/include/ -ffunction-sections -fdata-sections -Wl,--gc-sections \
        -lpthread -lc10 -Wl,--whole-archive -ltorch_cpu -Wl,--no-whole-archive -ltorch -lXNNPACK \
        -lpytorch_qnnpack -lcpuinfo -lclog -lpthreadpool -lkineto -lfmt -ldl -lc10
    ```

    The table below shows the size reduction in all build modes.

    | Build Type  | Unstripped | Stripped |
    | ----------- | ----------- | ----------- |
    | Standard | 49MiB | 34MiB |
    | Minimal w/o dtype | 6.1MiB (12%) | 4.5MiB (18%) |
    | Minimal w/ dtype | 3.7MiB (7%) | 2.7MiB (11%) |
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84795
    Approved by: https://github.com/cccclai

commit 0455c9b036a18505d7ae19b6d8b4ef9bef869365
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Sep 10 18:34:33 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#84797)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84797
    Approved by: https://github.com/pytorchbot, https://github.com/kit1980

commit b5e921b89e355a6aa7b80fa35556cffe9438bc15
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Sep 10 18:33:49 2022 +0000

    [vision hash update] update the pinned vision hash (#84798)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84798
    Approved by: https://github.com/pytorchbot, https://github.com/kit1980

commit 21bf9a467eb86a152e471e04837b98617098f32f
Author: Khushi Agrawal <khushiagrawal411@gmail.com>
Date:   Sat Sep 10 13:30:43 2022 +0000

    [jiterator] logical_{or, xor} : complex (#75947)

    Follows: #74748

    cc @kshitij12345!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/75947
    Approved by: https://github.com/ngimel, https://github.com/kshitij12345

commit 08c4f8c7a76f3f1c874aa357fc5aafdfb87ce680
Author: Xiang Gao <qasdfgtyuiop@gmail.com>
Date:   Sat Sep 10 10:56:05 2022 +0000

    ProcessGroupUCC tests (#83285)

    - [x] Direct dependency on UCX is completely removed, UCC active set API always enabled
    - [x] Remove `TORCH_UCC_PROFILING_ENABLE`, always enable profiling
    - [x] Fixes profiling of `recv` and `all_gather`
    - [x] Use the NCCL TL of UCC on CUDA, as  the UCP TL is not well supported on CUDA

    Most tests are passing, but there are a few skipped tests:
    - `scatter` and `gather` are not supported by the UCP TL of UCC on CPU tensors
    - A few flaky tests in PyTorch's CI environment
    - Profiler-related failures, some of them will be fixed by @Fuzzkatt in https://github.com/pytorch/pytorch/pull/84368

    After this PR is merged, I will continue to work on these skipped failures.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83285
    Approved by: https://github.com/vtlam, https://github.com/malfet, https://github.com/kwen2501

commit 2765243cd5e657a92142d09504dafadb058de63f
Author: Mengwei Liu <larryliu@fb.com>
Date:   Sat Sep 10 06:58:56 2022 +0000

    [torchgen] Refactor static_dispatch to take in source signature (#84384)

    Summary: Context: currently `static_dispatch` assumes that given a native function `f`, we always want to map from its `DispatchSignature` to its `CppSignature`. This assumption may not hold true for some use cases, where the source bindings may not come from its `DispatchSignature`. Here I'm changing the argument `sig: DispatcherSignature` to be `sig: Union[CppSignature, DispatcherSignature]`, also removes unused `f`

    Test Plan: Rely on added unit test.

    Differential Revision: D39192969

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84384
    Approved by: https://github.com/iseeyuan

commit c5a8946e40d6cda42aa38dda2705ea4e9930c2cb
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Sep 10 00:00:38 2022 -0400

    Revert "Revert "Redo how custom/python_custom methods on TensorImpl work  (#84796)" (#84806)

    This reverts commit ca3b2bfbe3945c756a67a784aaa7d9891698c59b.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84806
    Approved by: https://github.com/Chillee

commit bccc26f365d8b795e2931797d283e32c5f47aa0f
Author: Abhishek Pathak <abhipathak97@gmail.com>
Date:   Sat Sep 10 03:10:04 2022 +0000

    [MPS] Handle casting for div operation (#84742)

    * Handle casting for div operation
    * Update divmode test to test for rounding mode in div

    cc. @lhoenig
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84742
    Approved by: https://github.com/razarmehr

commit ddc56732ae30a3290a577e9694e037e108a3fff3
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sat Sep 10 02:45:35 2022 +0000

    [GHF][BE] Delete land checks branch when done (#84767)

    Also, don't create this branch if running with dry-run

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84767
    Approved by: https://github.com/clee2000, https://github.com/huydhn

commit b7d2818598b1c9f6d35f027f8d122543521c6a6a
Author: Ryan Spring <rdspring1@gmail.com>
Date:   Sat Sep 10 02:36:41 2022 +0000

    Return contiguous tensor from native_layer_norm reference (#84799)

    Fixes https://github.com/pytorch/pytorch/issues/84618

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84799
    Approved by: https://github.com/Chillee

commit 5e25c2b4ccb3224366d3cd3dc790b1e23440a49f
Author: Xiang Gao <qasdfgtyuiop@gmail.com>
Date:   Sat Sep 10 00:50:02 2022 +0000

    Add missing spaces to error messages in PG (#84780)

    Just some formatting, no real changes. See https://github.com/pytorch/pytorch/pull/83285#discussion_r966469992
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84780
    Approved by: https://github.com/kit1980

commit ca3b2bfbe3945c756a67a784aaa7d9891698c59b
Author: Eli Uriegas <eliuriegas@fb.com>
Date:   Sat Sep 10 00:18:13 2022 +0000

    Revert "Redo how custom/python_custom methods on TensorImpl work  (#84796)

    This reverts commit 591b75bf98b92acd4f3d0a1dc934198afeaa6fc1.

    Manual revert of https://github.com/pytorch/pytorch/pull/84641

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84796
    Approved by: https://github.com/izaitsevfb

commit 96e4bd950027a2f472fafa98616c92403a890bd2
Author: Dmytro Dzhulgakov <dima.v.dzhulgakov@gmail.com>
Date:   Sat Sep 10 00:09:57 2022 +0000

    [docs] Person of interest update: sparse, torchrec and smaller tweaks (#84772)

    Fixes #83363

    This is not a full update yet, but fixes some obvious things: missing modules (torchrec, sparse) and brings a few people from merge_rules.json who are working on the respective modules. There are still discrepancies - e.g. Intel CPU work is split in many categories in merge_rules, but it's better to improve things incrementally.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84772
    Approved by: https://github.com/b0noI, https://github.com/malfet

commit f598b5be1825fc0a12b5013c547fb5972b57b208
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Fri Sep 9 23:00:12 2022 +0000

    Remove last bit or torch.eig from functorch/test/test_ops.py (#84787)

    After https://github.com/pytorch/pytorch/pull/70982

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84787
    Approved by: https://github.com/suo, https://github.com/seemethere

commit cd50512d414e352fb9088805d8d66bf6880895d1
Author: Xu Zhao <xzhao9@fb.com>
Date:   Fri Sep 9 22:01:20 2022 +0000

    Upload the benchmark result to S3 and post the URL (#84726)

    Upload the benchmark result to S3 and make it accessible to the public.

    The URL is available at the end of the "Upload to S3" step of the workflow. For example, this PR uploads 3 files:
    ```
    Uploaded the result file control.json to https://ossci-metrics.s3.amazonaws.com/torchbench-pr-test/pr84726/control.json
    Uploading file treatment.json to S3 with key: torchbench-pr-test/pr84726/treatment.json
    Uploaded the result file treatment.json to https://ossci-metrics.s3.amazonaws.com/torchbench-pr-test/pr84726/treatment.json
    Uploading file result.csv to S3 with key: torchbench-pr-test/pr84726/result.csv
    Uploaded the result file result.csv to https://ossci-metrics.s3.amazonaws.com/torchbench-pr-test/pr84726/result.csv
    ```

    RUN_TORCHBENCH: nvfuser
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84726
    Approved by: https://github.com/davidberard98

commit 01c54ad6dedf2ab0206a37f6df1af4fe41afa051
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Sep 9 21:31:57 2022 +0000

    Remove deprecated torch.eig (#70982)

    The time has come to remove deprecated linear algebra related functions. This PR removes `torch.eig`.

    cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/70982
    Approved by: https://github.com/Lezcano, https://github.com/malfet

commit c4a5255df77c7b945d134fa58fe59684f41c33a8
Author: Dhruv Matani <dhruvbird@fb.com>
Date:   Fri Sep 9 12:16:52 2022 -0700

    [Mobile Tracer] Use unified source file list for BUCK build (#84770)

    Currently, the source list `torch_mobile_tracer_sources` in `build_variables.bzl` is used only for OSS build. This resulted in a regression for OSS builds when `TRACING_BASED=1` was used to build the OSS model tracer binary. To prevent this from happening in the future, it makes sense to re-use this list for internal BUCK builds as well. This change does that.

    Differential Revision: [D39392010](https://our.internmc.facebook.com/intern/diff/D39392010/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39392010/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84770
    Approved by: https://github.com/cccclai

commit 1dabb51a16eb6cf81475efecb1d39c4683af50fb
Author: Vasiliy Kuznetsov <vasiliy@fb.com>
Date:   Fri Sep 9 10:35:47 2022 -0700

    quant: add `extra_repr` to HistogramObserver (#84760)

    Summary:

    Adds `extra_repr` to `HistogramObserver`. This is useful when debugging
    PTQ models because it allows to quickly check whether a `HistogramObserver`
    has received data or not.

    Test plan:
    ```
    >>> import torch
    >>> obs = torch.ao.quantization.HistogramObserver()
    >>> obs(torch.randn(1, 3, 224, 224))
      ...
    >>> print(obs)
    // before - hard to tell if observer has seen data
    HistogramObserver()
    // after
    HistogramObserver(min_val=-4.778339862823486, max_val=4.311892986297607)
    >>>
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84760
    Approved by: https://github.com/andrewor14

commit 0fc02dbba40129e4d0cb01aea2a4667bf0cc928f
Author: Driss Guessous <drisspg@fb.com>
Date:   Fri Sep 9 20:11:26 2022 +0000

    flash_attention integration (#81434)
    - I added a new submodule Cutlass pointing to 2.10 release. The inclusion of flash_attention code should be gated by the flag: USE_FLASH_ATTENTION. This is defaulted to off resulting in flash to not be build anywhere. This is done on purpose since we don't have A100 machines to compile and test on.

    - Only looked at CMake did not attempt bazel or buck yet.

    -  I included the mha_fwd from flash_attention that has ben refactored to use cutlass 2.10. There is currently no backwards kernel on this branch. That would be a good follow up.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81434
    Approved by: https://github.com/cpuhrsch

commit 219ff26172d0b5abea89ea5bde7e0f7119efed59
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 9 20:01:07 2022 +0000

    Revert "Add __all__ for a few distributed modules plus a little typing (#84119)"

    This reverts commit 6f216805634e5859b76253432542a1c4c60ee573.

    Reverted https://github.com/pytorch/pytorch/pull/84119 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D39386448

commit 2614079f890193ea78099cdc7b4361d5e1ccfde1
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Sep 8 15:18:23 2022 +0100

    OpInfo: Prevent clamp sample inputs from sharing tensors (#84696)

    As per the comment, re-using tensors between sample inputs is strongly
    discouraged.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84696
    Approved by: https://github.com/ngimel

commit 5c0c8f2ce344f74849afaed88df93292cb30ce0b
Author: Max Ren <maxren@fb.com>
Date:   Fri Sep 9 19:32:40 2022 +0000

    [coreml][bug] coreml gpu flag not set (#84725)

    Summary:
    Delegated CoreML models with cpuAndGPU flag set does not properly run models on CPU

    - Fix will allow us to target models on CPU

    Test Plan: brymkowski can you test this on your performance benchmarks?

    Reviewed By: salilsdesai

    Differential Revision: D39361382

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84725
    Approved by: https://github.com/jmdetloff

commit fe47e61425dc7b44c7e90fc6c06052a83786ad57
Author: Salil Desai <salilsdesai@fb.com>
Date:   Fri Sep 9 19:19:12 2022 +0000

    [QNNPack] Update GoogleTest SHA256 Hash (#84754)

    Summary: Fixes hash mismatch error when building qnnpack

    Test Plan:
    ```
    export ANDROID_NDK=/opt/android_ndk/r20
    export ANDROID_NDK_HOME=${ANDROID_NDK}
    export ANDROID_SDK=/opt/android_sdk
    export ANDROID_HOME=${ANDROID_SDK}

    cd ~/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack

    ./scripts/build-android-arm64.sh
    ```

    ```
    [1/9] Creating directories for 'googletest'
    [2/9] Performing download step (download, verify and extract) for 'googletest'
    FAILED: googletest-prefix/src/googletest-stamp/googletest-download /data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/googletest-stamp/googletest-download
    cd /home/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/deps && /data/users/salilsdesai/miniconda3/envs/pytorch/bin/cmake -P /data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/googletest-stamp/download-googletest.cmake && /data/users/salilsdesai/miniconda3/envs/pytorch/bin/cmake -P /data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/googletest-stamp/verify-googletest.cmake && /data/users/salilsdesai/miniconda3/envs/pytorch/bin/cmake -P /data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/googletest-stamp/extract-googletest.cmake && /data/users/salilsdesai/miniconda3/envs/pytorch/bin/cmake -E touch /data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/googletest-stamp/googletest-download
    -- Downloading...
       dst='/data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/release-1.10.0.zip'
       timeout='none'
       inactivity timeout='none'
    -- Using src='https://github.com/google/googletest/archive/release-1.10.0.zip'
    -- [download 1% complete]
    ...
    -- [download 100% complete]
    -- verifying file...
           file='/data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/release-1.10.0.zip'
    -- SHA256 hash of
        /data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/release-1.10.0.zip
      does not match expected value
        expected: 'f3ed3b58511efd272eb074a3a6d6fb79d7c2e6a0e374323d1e6bcbcc1ef141bf'
          actual: '94c634d499558a76fa649edb13721dce6e98fb1e7018dfaeba3cd7a083945e91'
    -- Hash mismatch, removing...
    ````

    ```
    [1/9] Creating directories for 'googletest'
    [2/9] Performing download step (download, verify and extract) for 'googletest'
    -- Downloading...
       dst='/data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/release-1.10.0.zip'
       timeout='none'
       inactivity timeout='none'
    -- Using src='https://github.com/google/googletest/archive/release-1.10.0.zip'
    -- [download 1% complete]
    ...
    -- [download 100% complete]
    -- verifying file...
           file='/data/users/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/release-1.10.0.zip'
    -- Downloading... done
    -- extracting...
         src='/home/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/build/android/arm64-v8a/deps/googletest-download/googletest-prefix/src/release-1.10.0.zip'
         dst='/home/salilsdesai/fbsource/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/deps/googletest'
    -- extracting... [tar xfz]
    -- extracting... [analysis]
    -- extracting... [rename]
    -- extracting... [clean up]
    -- extracting... done
    [3/9] No update step for 'googletest'
    [4/9] No patch step for 'googletest'
    [5/9] No configure step for 'googletest'
    [6/9] No build step for 'googletest'
    [7/9] No install step for 'googletest'
    [8/9] No test step for 'googletest'
    [9/9] Completed 'googletest'
    ```

    Reviewed By: digantdesai

    Differential Revision: D39273970

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84754
    Approved by: https://github.com/digantdesai

commit daffff9986b2f427496b1b8a782f95f93b70f725
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Sep 8 11:03:22 2022 -0700

    [Profiler] Make `RecordQueue` manage the lifetime of `PythonTracer`. (#83964)

    `PythonTracer` holds a pointer to an owning `RecordQueue`, however that relationship is not enforced and it is possible to dangle that pointer if the ProfilerState owning the `RecordQueue` is destroyed without proper cleanup.

    We currently use a singleton to enforce the requirement that only one python tracer is active at a time, however a better formulation is to simply enforce that with an atomic bool and manage object lifetime through composition. In this new architecture, `RecordQueue` explicitly holds a unique_ptr to the python tracer instance. That way if `~RecordQueue` is called it will call `~PythonTracer` which can then clean up any state. Overall it is just a simpler ownership model, and less prone to unexpected failures.

    Differential Revision: [D38955616](https://our.internmc.facebook.com/intern/diff/D38955616/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83964
    Approved by: https://github.com/slgong-fb

commit 328538700a505e7fee4ba66f4b8de72c2cc44217
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Sep 8 11:03:20 2022 -0700

    [Profiler][Trivial] Move `PythonTracerBase` to `torch/csrc/profiler/orchestration` (#83895)

    The ownership model between `RecordQueue` and `PythonTracer` is brittle; if a profiler is popped without proper shutdown it can dangle a reference in `PythonTracer` which will segfault when dereferenced. The next PR will address this; to start we simply move the code into `torch/csrc/profiler/orchestration` to limit the sloc delta when making actual changes.

    Differential Revision: [D38933962](https://our.internmc.facebook.com/intern/diff/D38933962/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38933962/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83895
    Approved by: https://github.com/slgong-fb

commit e8b950186159247639e6645ba50f57f2a00ac6b0
Author: Sean Ross-Ross <srossross@gmail.com>
Date:   Fri Sep 9 18:54:47 2022 +0000

    test: adding uniform (#84292)

    Adding OpInfo for uniform

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84292
    Approved by: https://github.com/amjames, https://github.com/ngimel

commit a3855cc611e8be5a76254165a7468503208c7285
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Fri Sep 9 18:54:14 2022 +0000

    Make xnnpack based convs thread safe (#84602)

    Summary:
    For convolution xnnpack uses indirection buffer. This needs setup if input
    dimensions change. If we run the same model from multiple threads each
    supplying different input sized tensor, then there is a race condition where,
    indirection buffer might be in use by one thread, while being reset by another.

    This diff adds a lock to each conv object so as to serialize the execution and
    prevent such race conditions. When uncontended, it should not have perf impact.

    Test Plan: TestConvolution2dMultiThreaded

    Differential Revision: D39288298

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84602
    Approved by: https://github.com/digantdesai

commit 2c4eaddb28a9216d049adf41518199299ffac93e
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Sep 8 15:18:18 2022 +0100

    Use exclude_zero in i0e sample inputs function (#84453)

    As per the todo, this is now supported by `make_tensor` directly.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84453
    Approved by: https://github.com/mruberry, https://github.com/ngimel

commit 93aef3a010b5488c57ffce3e5dc14ea13d0b78d2
Author: Eli Uriegas <eliuriegas@fb.com>
Date:   Fri Sep 9 11:29:07 2022 -0700

    Use presence of _symint in kernel name to generate symint sig or not (#84579)

    Something people found confusing was that whether or not a native::
    signature would get SymInt or not in its type was based on the dispatch
    key.  This changes it so that SymInt or not in type is based on whether
    or not you have _symint in the name of the kernel or not.  This means
    that even when we make operators support SymInt, you no longer have to
    go and update all the preexisting definitions; instead, you now
    selectively write _symint to opt individual kernels into SymInt support.

    I then go and update a bunch of kernels that don't have proper SymInt
    support to make use of this convention.  There is some hacking around
    for view generation code.

    I also add support for external backends to specify 'symint' operators, for which we generate SymInt signatures instead of regular signatures.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D39310060](https://our.internmc.facebook.com/intern/diff/D39310060)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84579
    Approved by: https://github.com/wconstab

commit 18a31cc0448f226f4c2dd9926d24aaef86409f1c
Author: Dhruv Matani <dhruvbird@fb.com>
Date:   Fri Sep 9 08:51:12 2022 -0700

    [Mobile] Fix The Build For Model Tracer (#84755)

    Summary: Currently, the model tracer build is broken because of 2 reasons:
    1. A few source files are missing, resulting in missing link time symbols
    2. The `TRACING_BASED` flag isn't passed correctly from the command line (specified as an evnironment variable) as a CMake flag

    Both these issues were fixed.

    Test Plan: Ran this command: `USE_CUDA=0 TRACING_BASED=1 python setup.py develop --cmake`

    and saw that the tracer binary was built at `build/bin/model_tracer` - also ran it to ensure that it can generate a YAML file.

    Differential Revision: [D39391270](https://our.internmc.facebook.com/intern/diff/D39391270)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84755
    Approved by: https://github.com/cccclai

commit 52224139b85bed65595e6c57eda8c265bb4e9d84
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 9 18:21:51 2022 +0000

    Revert "Convert NoopPyInterpreterVTable into a Meyer singleton (#84656)"

    This reverts commit 9162bc025256d638369c77c845b8a5ed66eeff5a.

    Reverted https://github.com/pytorch/pytorch/pull/84656 on behalf of https://github.com/ezyang due to this breaks some build configs

commit ac364f8ba1a20484d118362bcc11b5730dec676a
Author: Jean Schmidt <contato@jschmidt.me>
Date:   Fri Sep 9 17:23:28 2022 +0000

    Removing .github/scale-config.yml, now this repo is using the config in test-infra (#84753)

    [As communicated internally](https://fb.workplace.com/groups/pytorch.dev/permalink/1189033455008466/), all repositories now rely on a single scale-config.yml that is on pytorch/test-infra. As such this file is no longer used and to avoid confusion it is better to remove it.

    Here is a short summary of the announcement:

    > [As previously announced](https://fb.workplace.com/groups/pytorch.dev/permalink/1173939633184515/), the scale-config.yml file in each repository for the pytorch/ organization is now not being used to control GHA runners. On its place, [the file with same path on test-infra](https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml) repository is controlling and enabling runners. If you feel the need for new runners, or change settings for current ones, feel free to submit a PR with required changes on the former file.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84753
    Approved by: https://github.com/janeyx99, https://github.com/seemethere

commit 28c830ac0725c3689fb7cd9ff293fdf4b0453941
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Thu Sep 8 08:45:23 2022 -0700

    [FSDP] Optimizer states may be on CPU, copy them to GPU before gathering (#84708)

    **Background**:
    Optimizer states may not always on GPUs. Some examples include, 1.) CPU offload is enable, 2.) after lightning trainer fit() is called.

    **What Does This PR Do?**
    If states are not on GPUs, move them to GPUs before gathering the global states.

    Differential Revision: [D39332300](https://our.internmc.facebook.com/intern/diff/D39332300/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84708
    Approved by: https://github.com/awgu

commit 0fd8f6b93cb3d1342a10ef71d4b27356f0dfc9b1
Author: Clive Chan <cc@clive.io>
Date:   Fri Sep 9 16:13:05 2022 +0000

    Missed one CHECK_NOTNULL in #82032's find-replace (#84720)

    Building master fails with the following:

    ```
    pytorch/caffe2/contrib/nccl/cuda_nccl_gpu.cc:180:51: error: 'CHECK_NOTNULL' was not declared in this scope; did you mean 'TORCH_CHECK_NOTNULL'?
      180 |     CUDA_ENFORCE(cudaStreamWaitEvent(CHECK_NOTNULL(ex.stream), event, 0));
    ```

    Seems like #82032 just missed one find-replace. cc @wconstab

    Not sure why this wouldn't have been caught elsewhere.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84720
    Approved by: https://github.com/wconstab

commit d12f3524b7a3b7267d90ae502208f0aeda881ced
Author: Mateusz Sypniewski <m.odrowaz.sypniewski@gmail.com>
Date:   Fri Sep 9 15:29:34 2022 +0000

    Add user facing documentation for CSAN (#84689)

    This adds a user facing tutorial for the CSAN tool. The documentation preview should be available [here](https://docs-preview.pytorch.org/84689/index.html) once the GitHub job completes on this PR.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84689
    Approved by: https://github.com/lw

commit a8198a09559d8f94247106499443f95e1a6da06e
Author: Michael Andreas Dagitses <mikeyd@fb.com>
Date:   Thu Sep 8 05:26:40 2022 -0700

    remove c10_defs.bzl and embed its logic directly where it is used (#84595)

    Differential Revision: [D39287079](https://our.internmc.facebook.com/intern/diff/D39287079/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39287079/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84595
    Approved by: https://github.com/DanilBaibak

commit 214a6500e3c03ecfadf12e28fcd576efebcd8cfc
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Thu Sep 8 22:20:12 2022 -0700

    [quant][docs] Additonal fixes for quantize_fx docs (#84587)

    Summary:
    Some more clarifications for the arguments, including linking to object docs (QConfigMapping, BackendConfig) and adding types
    in the doc

    Test Plan:
    ```
    cd docs
    make html
    ```
    and

    visual inspection for the generated docs

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84587
    Approved by: https://github.com/vkuzo

commit 0a89bdf9892b3021aca0bcd3df7388a20e24cfd1
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 9 08:00:04 2022 -0700

    Set up aten/src/ATen/functorch directory; move some files there (#84648)

    This PR:
    - sets up aten/src/ATen/functorch in PyTorch's build system
    - Moves {BatchedTensorImpl.h, and BatchedTensorImpl.cpp}
    there as a test.

    Test Plan:
    - functorch build and test should pass

    Differential Revision: [D39315051](https://our.internmc.facebook.com/intern/diff/D39315051)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84648
    Approved by: https://github.com/ezyang

commit 8e57ce63a11041c27d182b12cb30ac24d5c0cdcd
Author: Mateusz Sypniewski <m.odrowaz.sypniewski@gmail.com>
Date:   Fri Sep 9 04:51:33 2022 -0700

    Add CSAN support for CPU synchronizations (#84428)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84428
    Approved by: https://github.com/ngimel, https://github.com/lw

commit 7702ca49937b34843fe2f79ea7344451a83f1157
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 9 08:00:03 2022 -0700

    [functorch] Simplify BatchedTensorImpl (#84642)

    Three changes:
    - deleted an unused constructor
    - simplified an implementation from the old days when a
    BatchedTensorImpl could have multiple bdims
    - added a comment about getKeysToPropagateToWrapper

    Differential Revision: [D39315049](https://our.internmc.facebook.com/intern/diff/D39315049)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84642
    Approved by: https://github.com/samdow

commit 0d46bfac5b236774f50227c99328c6ef44f2fe1a
Author: Dhruv Matani <dhruvbird@fb.com>
Date:   Thu Sep 8 13:48:19 2022 -0700

    [Mobile] Use -ffunction-sections and -fdata-sections for Mobile builds (#84704)

    Summary: Set `-ffunction-sections` and `-fdata-sections` so that each method has its own text section. This allows the linker to remove unused section when the flag `-Wl,-gc-sections` is provided at link time.

    Test Plan: CI and local build using `build_mobile.sh`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84704
    Approved by: https://github.com/JacobSzwejbka, https://github.com/cccclai

commit 747f27a9adb484113f91560525183d8814eab41d
Author: Dhruv Matani <dhruvbird@fb.com>
Date:   Thu Sep 8 13:48:19 2022 -0700

    [Mobile] Update build_mobile.sh to allow lite interpreter and tracing based builds (#84647)

    Summary: Currently, build_mobile.sh doesn't allow lite interpreter builds or tracing based selective builds. build_mobile.sh is used for host builds of PyTorch for Mobile deployment.

    Additionally, certain flags such as `USE_BLAS` were not being respected as they should be. This change addresses that as well.

    Test Plan: Build using:

    ```
    cat /tmp/selected_ops.yaml
    - aten::add
    - aten::sub
    ```

    ```
    BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN=1 USE_LIGHTWEIGHT_DISPATCH=0 BUILD_LITE_INTERPRETER=1 SELECTED_OP_LIST=/tmp/selected_ops.yaml ./scripts/build_mobile.sh
    ```

    ```
    cat /tmp/main.cpp

    int main() {
      auto m = torch::jit::_load_for_mobile("/tmp/path_to_model.ptl");
      auto res = m.forward({});
      return 0;
    }
    ```

    Test using:

    ```
    g++ /tmp/main.cpp -L build_mobile/lib/ -I build_mobile/install/include/ -lpthread -lc10 -ltorch_cpu -ltorch -lXNNPACK -lpytorch_qnnpack -lcpuinfo -lclog -lpthreadpool -lgloo -lkineto -lfmt -ldl -lc10
    ```

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84647
    Approved by: https://github.com/JacobSzwejbka, https://github.com/cccclai

commit 27e5299ee3dfe8a48f835e6a8ce11ae697d01937
Author: Kevin Tse <ktse@fb.com>
Date:   Thu Sep 8 23:51:21 2022 +0000

    [DataPipe] Fix mishandling of exception message when error is not iterable (#84676)

    We sometimes get an exception message like this:
    ```
    This exception is thrown by __iter__ of TarArchiveLoaderIterDataPipe(datapipe=FileOpenerIterDataPipe, length=-1, mode='r:')    elif msg not in e.args[0] and single_iterator_msg not in e.args[0]:

    TypeError: argument of type 'int' is not iterable
    ```

    The `TypeError` raised by the mishandling of the error message obfuscates the true exception, which now will be show as:
    ```
    FileNotFoundError: [Errno 2] No such file or directory:
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84676
    Approved by: https://github.com/ejguan

commit df3377fb6485f7fe4cb4a9bc928ee8eb9a0d5b10
Author: Richard Zou <zou3519@gmail.com>
Date:   Fri Sep 9 06:58:37 2022 -0700

    [functorch] delete functorch/csrc/Constants.h (#84639)

    This file aliased dispatch keys. The original purpose was so that we
    could change the dispatch keys in pytorch core without changing
    functorch too much, but there's no need for the layer of indirection
    anymore.

    Also moved SINGLE_ARG to functorch/csrc/Macros.h, but that might need a
    new home later.

    Differential Revision: [D39315052](https://our.internmc.facebook.com/intern/diff/D39315052)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84639
    Approved by: https://github.com/samdow

commit 09bcc006e9d10ef62dcf38dd6547a2154c3b1b57
Author: jataylo <jack.taylor@amd.com>
Date:   Fri Sep 9 14:14:59 2022 +0000

    ROCm support for test_lazy_init (#84333)

    Added ROCm support for the test_lazy_init unit test by including a condition on TEST_WITH_ROCM to switch CUDA_VISIBLE_DEVICES with HIP_VISIBLE_DEVICES.

    This is needed because HIP_VISIBLE_DEVICES is set when running the single-GPU tests in CI: https://github.com/pytorch/pytorch/blob/a47bc96fb7176d43752d3e376697971d4ba47317/.jenkins/pytorch/test.sh#L38, but this test sets CUDA_VISIBLE_DEVICES, which takes lower precedence than HIP_VISIBLE_DEVICES on ROCm.

    **Testing Logs (to show behavior difference)**
    12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='0': 0
    12:40:41 Aug 30 11:40:41 1
    12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='32': 32
    12:40:41 Aug 30 11:40:41 1
    12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='0': 0
    12:40:41 Aug 30 11:40:41 1
    12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='32': 32
    12:40:41 Aug 30 11:40:41 0

    **Passing UT**
    Aug 30 17:03:15 test_lazy_init (main.TestCuda)
    Aug 30 17:03:17 Validate that no CUDA calls are made during import torch call ... ok (2.471s)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84333
    Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

commit 67d6f7160ce82be80eb22e40c4ae16084cfd4ee0
Author: Mateusz Sypniewski <m.odrowaz.sypniewski@gmail.com>
Date:   Fri Sep 9 03:20:05 2022 -0700

    Add synchronize hooks (#84427)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84427
    Approved by: https://github.com/ngimel, https://github.com/lw

commit 591b75bf98b92acd4f3d0a1dc934198afeaa6fc1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Sep 8 09:26:09 2022 -0700

    Redo how custom/python_custom methods on TensorImpl work (#84641)

    A longstanding confusion in the implementation of fake tensor and proxy tensor is what to do about torch.ops.aten.sym_sizes and related calls. In particular, when you have a tensor that (1) has symbolic shapes and (2) has a `__torch_dispatch__` call, previously, you would always get `__torch_dispatch__` calls for sizes/strides query, *even if you didn't request it* via the dispatch kwargs in `make_wrapper_subclass`.

    The reason for this is because we were previously mixing several concepts: "I want to dispatch to Python", "I want to call a virtual method" and "I have dynamic shapes". A single boolean variable controlled all of these things, and so it was not possible to understand inside TensorImpl what the user had actually originally requested.

    In this PR, we track each of these concepts individually so that we can preserve user intent. Then, we combine these into a single "policy" variable that controls whether or not we can use the fastpath or not. For the policy to trigger, we only need one of the exceptional cases to be true.

    Billing of changes:
    * Rename `set_sizes_strides_policy` to `set_custom_sizes_strides`; in general, you cannot DIRECTLY set policy; you have to indirectly set it by the public functions.
    * Some helpers for sizes and strides, since it's more complicated (as it is an enum, rather than just bools as is the case for device and layout). `matches_python_custom` is used to test the Python dispatch user ask. `matches_policy` does the policy test (only used in the user facing functions.)
    * I reorged the accessor methods so that they are more logical. This makes the diff bad, so I recommend reading the final code directly.
    * The default custom implementations now more reliably call their default() implementations
    * As bonus refactor, I devirtualized some functions that don't need to be virtual
    * `set_sym_sizes_and_strides` is renamed to `set_sizes_and_strides` to make it easier to use in template contexts; it optionally takes a storage offset now so you can set all three values at the same time. If you use the SymInt overload but there are no symbolic integers, we give you a normal resize.
    * This adds `sym_storage_offset` since we had that in the symbolic shapes branch and there's no reason not to put it in (and it reduces merge conflicts)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84641
    Approved by: https://github.com/wconstab

commit d802fcfcd81f86333e25135a20802aa39b275da1
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Sep 9 07:58:21 2022 +0000

    Add config to PrimTorch's nvFuser executor (#84482)

    This PR adds `executor_parameters` keyword argument to `torch._prims.executor.execute`.

    For now there are two knobs:
    * `use_python_fusion_cache: bool = True` whether to use lru_cache when constructing fusion object or not.
    * `allow_single_op_fusion: bool = True` whether to allow fusions with single callable

    Behavior can be controlled by passing dict with custom specified values as `executor_parameters` argument.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84482
    Approved by: https://github.com/jjsjann123, https://github.com/ngimel

commit 6f72c13f9b749b28b91732a56e997b25aa692a8d
Author: mingfeima <mingfei.ma@intel.com>
Date:   Thu Sep 8 14:09:57 2022 +0800

    test mkldnn conv2d channels last when weight is nchw format (#77348)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/77348
    Approved by: https://github.com/malfet

commit 1840f24df737046d085691d230ca2fe86cccb0d2
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Wed Sep 7 17:23:09 2022 -0700

    [FSDP] Ensure that all ranks use the same order to iterate through optimizer states (#84654)

    **Background:**
    Optimizer states are of the type `Dict[int, Dict[str, torch.Tensor]]` and the order of `dict.items()`  is the creation order of keys. Without checkpoint (state_dict/load_state_dict), the creation order of keys depends on the implementation of the optimizer (e.g., Adam seems to creates `exp_avg` then `exp_avg_sq`). However, when loading states from a checkpoint, since the optimizer state are lazily initialized, the order depends on the user code (reading state_dict from IO). See the following example:

    ```
    optimizer_state_dict = USER_CODE_TO_READ_STATE_FROM_IO()
    optimizer.load_state_dict(optimizer_state_dict)
    ```
    The key order of `optimizer_state_dict` depends on `USER_CODE_TO_READ_STATE_FROM_IO` and there is no guarantee that the order is the same across ranks.

    **What Can Go Wrong?**
    After the first checkpoint load, the key order of optimizer may not be the same on different ranks. When users try to save another checkpoint, user will call `_unflatten_optim_state()` to save the optimizer states. Inside `_unflatten_optim_state()`, `dict.itmes()` will be called to iterate all the local optimizer state and `all_gather()` will be used to gather the local states. Since the order may be different across ranks, the gathered states are not correct.

    We have seen some models get NaN loss after the second checkpoint load because of this issue.

    **What This PR Does?**
    This PR implements a `sorted_items()` to return sorted `(key, value)` pairs. We can do this because the key is either an integer or a string.

    Differential Revision: [D39315184](https://our.internmc.facebook.com/intern/diff/D39315184/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84654
    Approved by: https://github.com/awgu

commit 2211949513d6ff126f1caae4d817055ec9d3fab1
Author: Shen Li <cs.shenli@gmail.com>
Date:   Fri Sep 9 02:30:55 2022 +0000

    Moving CommTensor from tests to private _spmd folder (#84719)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84719
    Approved by: https://github.com/wanchaol

commit b00a4b7cf1655cf9a6cf78ef9536b63898cbfb08
Author: Kunal Bhalla <kunalb@fb.com>
Date:   Fri Sep 9 05:44:29 2022 +0000

    [torch.fx.wrap] Use callable / function.__name__ instead of function.__code__.co_name (#84373)

    Ran across this issue while using torch.fx.wrap on a decorated function: it triggered a KeyError: 'wrapper_inside_decorator'. torch.fx.wrap stores function.__code__.co_name, but that isn't set correctly (and doesn't match it's name in the global namespace) for decorators; function.__name__ is set correctly.

    Also adjusted to checking for callable instead of checking for the existing of __code__ to allow for a broader variety of functions that can be passed in. Eg. using functools.cache returns a callable that won't have a __code__ attribute.

    I added a unit test (that incidentally fails every test in the suite before the fix commit -- because it affects the global state), and then a fix that addresses it.

    ```
    In [1]: import functools

    In [2]: def decorator(f):
       ...:     @functools.wraps(f)
       ...:     def wrapper(*args, **kwargs):
       ...:         return f(*args, **kwargs)
       ...:     return wrapper
       ...:

    In [3]: @decorator
       ...: def some_function(x):
       ...:     return x
       ...:

    In [4]: some_function.__name__
    Out[4]: 'some_function'

    In [5]: some_function.__code__.co_name
    Out[5]: 'wrapper'
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84373
    Approved by: https://github.com/jamesr66a, https://github.com/SherlockNoMad

commit 1c8cb0877095f64e1244c42de851627fc8c3116f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 9 03:12:08 2022 +0000

    [vision hash update] update the pinned vision hash (#84731)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84731
    Approved by: https://github.com/pytorchbot

commit 1c8f02d406a059e4ea1d29c7db3ac2386765e2c3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 9 03:09:41 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#84730)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84730
    Approved by: https://github.com/pytorchbot

commit e6ee8e613dfe764f61b2b29ad0a4a2c36f143eaa
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Sep 8 23:25:47 2022 +0000

    Return x.alias() when transpose is an nop (#84674)

    To fix bug in https://gist.github.com/SherlockNoMad/b8dfbc614d3e65707d1bc379a098196d

    ```
    def f(x):
        return x.t()

    x = torch.randn(2, requires_grad=True)
    y = f(x)

    compiled_f = make_fx(f)(x)
    y_compiled = compiled_f(x)

    print(compiled_f)
    print("y.requires_grad", y.requires_grad)
    print("y_compiled.requires_grad", y_compiled.requires_grad)
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84674
    Approved by: https://github.com/cpuhrsch

commit dbdc1cd590169576cfb78008f33b7cc795150729
Author: Justin Chu <justinchuby@users.noreply.github.com>
Date:   Fri Sep 9 01:52:38 2022 +0000

    [ONNX] Fix node attributes when namespace is aten (#84211)

    When `g.at` is used, the previous clean up in #83136 mistakenly removed the behavior that sets `aten=True` in `_add_attribute`. This PR brings the behavior back.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84211
    Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao

commit 2fa8142cf9469fa6570507ec097d4dd5e0b13c92
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Sep 9 01:22:12 2022 +0000

    [ONNX] Rename constants for clarity (#84645)

    Rename constants to make them more clear. Fix styles to upper case.

    Removed `onnx_stable_opsets` because it can be computed from `ONNX_MIN_OPSET` and `ONNX_MAX_OPSET`.

    Fixes #84643

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84645
    Approved by: https://github.com/BowenBao

commit bc3683de83adc9b9213cc926c677b3f8bb309722
Author: xndcn <xndchn@gmail.com>
Date:   Fri Sep 9 01:22:09 2022 +0000

    [quant] remove mkldnn headers in OnednnUtils.h (#84195)

    mkldnn headers are not installed in include directory,
    also we don't have to include mkldnn here.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84195
    Approved by: https://github.com/jerryzh168

commit 7a5d5a00207e91d5a7d1820781109a989aadc86c
Author: Eric Han <erichan1@fb.com>
Date:   Fri Sep 9 01:15:22 2022 +0000

    Disable Transformer/MHA fast path when autocast is enabled (#84722)

    Differential Revision: D39362298

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84722
    Approved by: https://github.com/cpuhrsch

commit f23a1cf805356f0519d3dfc276958c4c518e6db5
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Sep 9 00:15:46 2022 +0000

    Fix conda cmake setup for macos x86-64 (#84682)

    The latest conda setup with cache causes package conflicts for macos x86-64 with cmake-3.19, for example https://github.com/pytorch/pytorch/runs/8237917073.  It's near impossible to understand the cryptic package conflicts errors from conda.  `cmake=3.22.1` is the same cmake version used in macos arm64, which doesn't have the issue.

    At the moment, the mac x86-64 build and test jobs success because they are reinstalled with stock conda from miniconda installation script every time and don't use any cache.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84682
    Approved by: https://github.com/ZainRizvi

commit 6f216805634e5859b76253432542a1c4c60ee573
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Thu Sep 8 23:28:28 2022 +0000

    Add __all__ for a few distributed modules plus a little typing (#84119)

    This handles distributed_c10d, which is massive and ddp_comm_hooks.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84119
    Approved by: https://github.com/rohan-varma

commit 9b8e0a38a6674c78a8f2729ce15393859aa2bd3d
Author: rzou <zou3519@gmail.com>
Date:   Thu Sep 8 08:54:32 2022 -0700

    [functorch] excise older custom_vjp prototype (#84638)

    It was based off of the Python op registration API that has been
    implemented in PyTorch already, so we can always bring it back, but
    we're probably taking a different approach here.

    Test Plan:
    - tests

    Differential Revision: [D39315050](https://our.internmc.facebook.com/intern/diff/D39315050)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84638
    Approved by: https://github.com/samdow

commit 8c91dd2677cca5983d410592684827784a6cfc07
Author: rzou <zou3519@gmail.com>
Date:   Thu Sep 8 08:54:30 2022 -0700

    [functorch] Add some C++ documentation (#84637)

    Clarify the purpose of many files.

    Differential Revision: [D39315053](https://our.internmc.facebook.com/intern/diff/D39315053)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84637
    Approved by: https://github.com/samdow

commit 05b778f958a77adc33cc51db5717d6b9ab2e8b35
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 8 20:30:23 2022 +0000

    Revert "Add mkl implementation for exponential on CPU (#69967)"

    This reverts commit 189768ed64561e61ff05c9e42adfa40139388204.

    Reverted https://github.com/pytorch/pytorch/pull/69967 on behalf of https://github.com/izaitsevfb due to This PR caused internal breakage (internal revert D39348330; relevant task T131416326)

commit 6bedb7a75e2c6712ef3a8de3283fe44adab4a659
Author: Paul Saab <ps@fb.com>
Date:   Thu Sep 8 19:42:20 2022 +0000

    [aarch64] Fix aarch64 build so that quantize_val_arm is defined (#84564)

    Summary: quantize_val_arm is used in the kernels when building under aarch64

    Test Plan: CI

    Differential Revision: D39272746

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84564
    Approved by: https://github.com/kimishpatel

commit a6e6276c8b1715fd3685c565065b35184d103a48
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 8 19:28:38 2022 +0000

    Revert "Moving CommTensor from tests to private _spmd folder (#84655)"

    This reverts commit 07dad15583a1a6bb6a65594883fa094a3b109baf.

    Reverted https://github.com/pytorch/pytorch/pull/84655 on behalf of https://github.com/kit1980 due to Several test failures on trunk https://hud.pytorch.org/pytorch/pytorch/commit/07dad15583a1a6bb6a65594883fa094a3b109baf, PR also had failures

commit c5cf8f6b28a3c5abf443d60fd3811b8a4b7fcc16
Author: xiaxiaohua1 <xiaxiaohua@jd.com>
Date:   Thu Sep 8 18:22:48 2022 +0000

    fix [rpc] Wrong usage of RRefContext::handleException #71458 (#83166)

    Fixes #71458

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83166
    Approved by: https://github.com/kumpera

commit 1459a909b4034fc330b3ee55e164d77ffde1bdd8
Author: Horace He <chilli@fb.com>
Date:   Thu Sep 8 00:15:52 2022 +0000

    Added mv, mm, and binary_cross_entropy_with_logits decomps (#84451)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84451
    Approved by: https://github.com/ngimel

commit fe353e1413e2262993fb71dba7317f21bc6fc3bc
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Sep 8 17:51:15 2022 +0000

    Enable manual test config label selection on Windows (#84669)

    After https://github.com/pytorch/pytorch/pull/83690, functorch team has started using the new label in some of their [PRs](https://github.com/pytorch/pytorch/labels/test-config%2Ffunctorch).  So this enabled the same feature on Windows.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84669
    Approved by: https://github.com/ZainRizvi

commit 07dad15583a1a6bb6a65594883fa094a3b109baf
Author: Shen Li <cs.shenli@gmail.com>
Date:   Thu Sep 8 14:31:26 2022 +0000

    Moving CommTensor from tests to private _spmd folder (#84655)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84655
    Approved by: https://github.com/wanchaol

commit 9beb0c0b877b85880e51366270a5916f70559e42
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Sep 7 15:01:58 2022 -0700

    Reland: [Profiler] Unify global and thread local profiler lookup. (#83894) (#84668)

    This PR renames `ProfilerThreadLocalStateBase` to simply `ProfilerStateBase`, and adds `push`, `pop`, and `get` methods. `global` can be specified, or can be omitted for priority selection.

    In order to support this unification it was necessary to make a (mostly) non-throwing version of pop. The asserts around observer removal are intended to act as guard rails against multiple profilers trampling over each other. However on-demand wants to do exactly that because it wants to be able to preempt.

    A hack would be to get the current observer and then only pop if an observer is found, but that would be prone to race conditions. By removing the asserts, we can preserve the old behavior by adding `ASSERT(pop())` on the caller side while allowing more complex handling for the kineto client interface. (Later PR.)

    Differential Revision: [D39326253](https://our.internmc.facebook.com/intern/diff/D39326253/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84668
    Approved by: https://github.com/slgong-fb

commit bea01840335f0990c8d481de70c86f276d7c1654
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Wed Sep 7 14:58:28 2022 -0700

    Reland: [Profiler][Trivial] Create orchestration folder and move observer management there. (#83893)" (#84667)

    Reland of https://github.com/pytorch/pytorch/pull/83893

    Differential Revision: [D39282536](https://our.internmc.facebook.com/intern/diff/D39282536/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39282536/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84667
    Approved by: https://github.com/slgong-fb

commit b9793a66b56ec01e4ec85dce879552dfa650d0c8
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Sep 8 15:18:18 2022 +0100

    Fix linalg.norm sample inputs function and related failures (#84452)

    Due to an indentation error, the return statement happens after just 1
    loop of `for test_size in test_sizes` so only one shape was ever
    tested.

    This also revealed several cases where the provided shapes don't work
    so I've disabled the generation of those sample inputs.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84452
    Approved by: https://github.com/Lezcano, https://github.com/zou3519

commit 335033f7182bf421d203d5eeaad598fa1102933f
Author: Junteng Jia <juntengjia@fb.com>
Date:   Thu Sep 8 17:00:45 2022 +0000

    asyncio increase throughput (pytorch change) (#84301)

    Summary: This diffs add a check in the fetcher, that if the dataset to be fetched has a function "getitems" then use it for fetching a batch of elements, as oppose to one by one. This is benefical for io bounded usage.

    Differential Revision: D39145980

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84301
    Approved by: https://github.com/VitalyFedyunin

commit 368742d0596f8c17bbcf118f232cc63c4c86b5b7
Author: cchheennhhaaoo <hao3.chen@intel.com>
Date:   Thu Sep 8 13:48:16 2022 +0000

    Dispatch for xpu in adaptive_avg_pooling (#84541)

    Motivation:

    - See native_functions.yaml, operators adaptive_avg_pool2d/adaptive_avg_pool3d are not recommended to register for backends.

    - When adaptive_avg_pool2d/adaptive_avg_pool3d have a input of xpu tensor, they can't step into xpu implementation.

    Solution:

    - Dispatch to _adaptive_avg_pool2d/_adaptive_avg_pool3d for xpu backend in adaptive_avg_pool2d/adaptive_avg_pool3d implementation.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84541
    Approved by: https://github.com/ezyang

commit eddc2370ec33938adbd9a3136852c3ab19e51a78
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Thu Sep 8 13:35:19 2022 +0000

    [functorch] vmapvjpvjp (re-enable test with skips and xfails) (#83999)

    Enable `vmapvjpvjp` test and add relevant skips and xfails.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83999
    Approved by: https://github.com/zou3519

commit 76fc690522024d978176b74a73e0222ac4d062de
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 8 10:44:37 2022 +0000

    Revert "[functorch] vmapvjpvjp (re-enable test with skips and xfails) (#83999)"

    This reverts commit 9addeccb6b1ec9fed0246ba18fdb70550c813a90.

    Reverted https://github.com/pytorch/pytorch/pull/83999 on behalf of https://github.com/kshitij12345 due to Broke trunk

commit 9addeccb6b1ec9fed0246ba18fdb70550c813a90
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Thu Sep 8 06:23:12 2022 +0000

    [functorch] vmapvjpvjp (re-enable test with skips and xfails) (#83999)

    Enable `vmapvjpvjp` test and add relevant skips and xfails.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83999
    Approved by: https://github.com/zou3519

commit 8bd9fe3f493073bf8f4a2e428c3048096fb36052
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Sep 7 23:39:28 2022 +0000

    Changes to prepare for fake tensors on in functorch by default (#84432)

    Fixes some errors you run into in dynamo when turning on fake tensors. I'm waiting on flipping the switch because I need to also get some fixes into dynamo + do benchmarking.

    I could manually turn off fake tensors in functorch in dynamo, and then turn it on here if requested, although the changes here are pretty minimal.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84432
    Approved by: https://github.com/Chillee

commit b288cfd328be3908ffc42b948bf0137940b01e85
Author: Aaron Enye Shi <enye.shi@gmail.com>
Date:   Thu Sep 8 03:37:39 2022 +0000

    [Profiler] Add quoted metadata API to remove empty trace cpu_op metadata (#84128)

    Summary: The profiler utility function, stacksToStr, is quoting all metadata values, and therefore even empty metadata fields are being pushed into the trace files. Remove this and add an argument to use quoted metadata api provided by libkineto::GenericTraceActivity.

    Test Plan:
    Before, a trace file will dump extra empty fields for Module Hierarchy and Call Stack:
    ```
      {
        "ph": "X", "cat": "cpu_op", "name": "autograd::engine::evaluate_function: AddBackward0", "pid": 798015, "tid": 798264,
        "ts": 1661451887593736, "dur": 21,
        "args": {
          "Trace name": "PyTorch Profiler", "Trace iteration": 0,
          "External id": 513,
          "Profiler Event Index": 0, "Module Hierarchy": "", "Call stack": "", "Fwd thread id": 3, "Sequence number": 1, "ID": 139880536829952, "Parent ID": null
        }
      }
    ```
    After, these fields will not be in the trace file anymore:
    ```
      {
        "ph": "X", "cat": "cpu_op", "name": "autograd::engine::evaluate_function: AddBackward0", "pid": 1482813, "tid": 1483069,
        "ts": 1661468912444365, "dur": 43,
        "args": {
          "Trace name": "PyTorch Profiler", "Trace iteration": 0,
          "External id": 513,
          "Profiler Event Index": 0, "Fwd thread id": 3, "Sequence number": 1, "ID": 139852271321088, "Parent ID": null
        }
      }
    ```
    Also, with input tracking on, it looks correct compared to previous kineto observer:
    ```
      {
        "ph": "X", "cat": "cpu_op", "name": "aten::add_", "pid": 1572428, "tid": 1572776,
        "ts": 1661469920242309, "dur": 19,
        "args": {
          "Trace name": "PyTorch Profiler", "Trace iteration": 0,
          "External id": 531,
          "Profiler Event Index": 18, "Input Dims": [[256, 256], [256, 256], []], "Input type": ["float", "float", "Scalar"], "ID": 140023871647232, "Parent ID": 140023871646720
        }
      }
    ```

    Differential Revision: D39041244

    Pulled By: aaronenyeshi

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84128
    Approved by: https://github.com/robieta

commit 942c0f31dfffbc5eb180cadd0fd1302d5e907f64
Author: titaiwang <titaiwang@microsoft.com>
Date:   Thu Sep 8 00:58:09 2022 +0000

    [ONNX] Align Optional Type in block (#83599)

    Why:

    Previously, we use `replaceAlluseswith` after adding Optional on the node which is right before output. However, this may break the graph by also changing the nodes that needs the node (original) as input. We only need the node to be optional in output.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83599
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao, https://github.com/malfet

commit 49ec8d32c706e3df1f777b2361b2ee673269f8b8
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Thu Sep 8 03:12:50 2022 +0000

    Suggest draft PRs in contribution_guide.rst (#84658)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84658
    Approved by: https://github.com/huydhn

commit 0945074a8e4e7d0d07b7a929873d1f0dbdca7173
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Sep 8 00:31:58 2022 +0000

    Preserver stacktrace over functionalization (#84662)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84662
    Approved by: https://github.com/Chillee

commit cb6ba27db3e1e55e9a429fb4a576a9e8389c2b93
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 8 02:34:33 2022 +0000

    [vision hash update] update the pinned vision hash (#84679)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84679
    Approved by: https://github.com/pytorchbot

commit 889540d091086bb31367a602295730f64e2ff690
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 8 02:34:16 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#84678)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84678
    Approved by: https://github.com/pytorchbot

commit e0229d6517385a98afeadbc6391d3592d5027c63
Author: John Detloff <johndetloff@fb.com>
Date:   Thu Sep 8 01:49:55 2022 +0000

    Remove caffe2 mobile (#84338)

    We're no longer building Caffe2 mobile as part of our CI, and it adds a lot of clutter to our make files. Any lingering internal dependencies will use the buck build and so wont be effected.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84338
    Approved by: https://github.com/dreiss

commit 9669e3c6ec6b6f232bed3b29bcd593434992f57d
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 7 17:25:49 2022 -0400

    Ignore UB on multiply (#84665)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84665
    Approved by: https://github.com/Chillee

commit 1a1bcc736197f1f7943d568512c3c1e44ba05fbc
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Sep 8 01:09:10 2022 +0000

    Actually chown artifacts (#84672)

    Rollback part of https://github.com/pytorch/pytorch/commit/045ebc771d5070696f839e586285ace9c06f1339 to actually chown artifacts folder rather than workspace

    Fixes https://github.com/pytorch/pytorch/issues/84644
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84672
    Approved by: https://github.com/kit1980, https://github.com/huydhn

commit 93359bf9b3503135332d40cb297515efe5290ec6
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 7 17:19:08 2022 -0400

    Convert ConcretePyInterpreterVTable into Meyer singleton (#84657)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84657
    Approved by: https://github.com/wconstab

commit 9162bc025256d638369c77c845b8a5ed66eeff5a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 7 15:43:58 2022 -0400

    Convert NoopPyInterpreterVTable into a Meyer singleton (#84656)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84656
    Approved by: https://github.com/wconstab

commit 29672b2136fc80537edf4632b2cf40f48efe0ab8
Author: samdow <samdow@fb.com>
Date:   Wed Sep 7 20:46:24 2022 +0000

    [functorch] add pinv batch rule (#83761)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83761
    Approved by: https://github.com/zou3519

commit 586832ce65607c6a1d1d8245b55d4ec24ddfc0e4
Author: Howard Huang <howardhuang@fb.com>
Date:   Wed Sep 7 08:26:16 2022 -0700

    Add underlying_store property for PrefixStore (#84640)

    Add a property to `PrefixStore` to retrieve the underlying store it is wrapping around. Open for suggestions on property name. This change is based on discussion in [D39225101](https://www.internalfb.com/diff/D39225101) where we need to read properties of the store that PrefixStore is wrapping around.

    Differential Revision: [D39311151](https://our.internmc.facebook.com/intern/diff/D39311151)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84640
    Approved by: https://github.com/xush6528

commit e68df8e4a14ce1fbedf6b20e132b11ec7b151f8a
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Sep 7 07:55:51 2022 -0700

    Turn on functionalization by default in functorch (#84435)

    I talked to @SherlockNoMad abt this PR and we agreed prior to brian coming back it was worth disabling this test for getting functionalization on (and that is already the state of torchdynamo)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84435
    Approved by: https://github.com/Chillee

commit 6b2111619e801064065c0eaba7ca03f00feef59b
Author: Catherine Lee <csl@fb.com>
Date:   Wed Sep 7 21:44:39 2022 +0000

    check rate limits of other tokens too (#83632)

    we keep running into api rate limit issues but apparently theyre connected to pytorchbot, so check rate limit of our other tokens too

    according to https://docs.github.com/en/rest/rate-limit this doesnt count against the rate limit
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83632
    Approved by: https://github.com/huydhn

commit 9532c7e267b3ccf2ca500fdae1ed5298c1f0f146
Author: samdow <samdow@fb.com>
Date:   Wed Sep 7 17:50:54 2022 +0000

    [functorch] add matrix_rank rule (#83760)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83760
    Approved by: https://github.com/zou3519

commit e14f46f9ddf143dbe894ee40e3a698fb401523ae
Author: Howard Huang <howardhuang@fb.com>
Date:   Wed Sep 7 07:39:21 2022 -0700

    Add host and port to TCPStore pyi definition (#84636)

    `host` and `port` are already exposed in the `TCPStore` pybind definition, this is a small change adding it in the pyi stub

    Differential Revision: [D39311153](https://our.internmc.facebook.com/intern/diff/D39311153)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84636
    Approved by: https://github.com/wz337

commit c7f6deb6678f4df578584439e4ab26d185da5ef3
Author: Scott Wolchok <swolchok@fb.com>
Date:   Tue Sep 6 14:42:09 2022 -0700

    [PyTorch] Guard against self assignment in SymInt (#84375)

    self assignment was broken, now it's not.

    Differential Revision: [D39189342](https://our.internmc.facebook.com/intern/diff/D39189342/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84375
    Approved by: https://github.com/suo

commit fc4acd4425ca0896ca1c4f0a8bd7e22a51e94731
Author: WEN Hao <wenh06@gmail.com>
Date:   Wed Sep 7 19:12:33 2022 +0000

    Fix error in the index range math expression in the docstring of MultiMarginLoss (#84513)

    Fixes #84512

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84513
    Approved by: https://github.com/Lezcano, https://github.com/cpuhrsch

commit d892d5d6829c315ba9b5038b8796e1c96a54f9b5
Author: Eddie Yan <eddiey@nvidia.com>
Date:   Wed Sep 7 18:30:23 2022 +0000

    [CUBLAS][TF32][CUDNN] Update numerical_accuracy.rst (#79537)

    CC @mruberry @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79537
    Approved by: https://github.com/ngimel, https://github.com/mruberry

commit acb4a09628284201281e262aaee58e3dc6be9c2b
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 7 18:02:27 2022 +0000

    Revert "Call jit decomposition in VariableType to increase forward AD coverage (#84151)"

    This reverts commit 42d99e6f196233627a28b8e9efb26a0a166fa370.

    Reverted https://github.com/pytorch/pytorch/pull/84151 on behalf of https://github.com/malfet due to Regressed test_jvpvjp_nn_functional_layer_norm_cuda_float32, see https://hud.pytorch.org/pytorch/pytorch/commit/42d99e6f196233627a28b8e9efb26a0a166fa370

commit 31ef8ddb8c4467f5b8698ef1eb9bb8bab7056855
Author: Wei Wei <wwei6@fb.com>
Date:   Wed Sep 7 17:21:27 2022 +0000

    add option to remove passes (#84425)

    Summary:
    Add a remove_pass method in pass_manager to provide user option to remove any pass.

    Reviewed By: wushirong

    Differential Revision: D39080077

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84425
    Approved by: https://github.com/yinghai

commit 2feb31cb269bd640ff2858ebe8adb3fb0aec8dc0
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Sep 7 15:00:54 2022 +0100

    Improve torch::jit::as_{module,object} performance (#84399)

    This caches the import of `torch.jit.ScriptModule`,
    `torch.ScriptObject` and `torch.jit.RecursiveScriptClass`. I measure
    a ~0.8 us performance uplift locally when calling a `torch.ops`
    function with a `ScriptObject` argument.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84399
    Approved by: https://github.com/ezyang

commit 2b2e0fddf8001c0c662bd582e1d958a74bc84ac4
Author: Mateusz Sypniewski <m.odrowaz.sypniewski@gmail.com>
Date:   Wed Sep 7 07:23:03 2022 -0700

    Add CUDA Sanitizer (#83984)

    Example of a simple synchronization error:
    ```
    a = torch.rand(4, 2, device="cuda")

    with torch.cuda.stream(second_stream):
        torch.mul(a, 5, out=a)
    ```
    Output produced by CSAN:
    ```
    ============================
    CSAN detected a possible data race on tensor with data pointer 139719969079296
    Access by stream 94646435460352 during kernel:
    aten::mul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
    writing to argument: self, out, output
    With stack trace:
      File "/private/home/sypniewski/pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
        stack_trace = traceback.StackSummary.extract(
      File "/private/home/sypniewski/pytorch/torch/cuda/_sanitizer.py", line 544, in __torch_dispatch__
        errors = self.event_handler._handle_kernel_launch(
      File "/private/home/sypniewski/pytorch/torch/utils/_python_dispatch.py", line 76, in wrapped
        return f(self, *args, **kwargs)
      File "/private/home/sypniewski/pytorch/tester.py", line 9, in <module>
        torch.mul(a, 5, out=a)

    Previous access by stream 0 during kernel:
    aten::rand(int[] size, *, int? dtype=None, int? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
    writing to argument: output
    With stack trace:
      File "/private/home/sypniewski/pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
        stack_trace = traceback.StackSummary.extract(
      File "/private/home/sypniewski/pytorch/torch/cuda/_sanitizer.py", line 544, in __torch_dispatch__
        errors = self.event_handler._handle_kernel_launch(
      File "/private/home/sypniewski/pytorch/torch/utils/_python_dispatch.py", line 76, in wrapped
        return f(self, *args, **kwargs)
      File "/private/home/sypniewski/pytorch/tester.py", line 6, in <module>
        a = torch.rand(10000, device="cuda")

    Tensor was allocated with stack trace:
      File "/private/home/sypniewski/pytorch/torch/cuda/_sanitizer.py", line 420, in _handle_memory_allocation
        traceback.StackSummary.extract(
      File "/private/home/sypniewski/pytorch/torch/utils/_cuda_trace.py", line 23, in fire_callbacks
        cb(*args, **kwargs)
      File "/private/home/sypniewski/pytorch/torch/_ops.py", line 60, in __call__
        return self._op(*args, **kwargs or {})
      File "/private/home/sypniewski/pytorch/torch/cuda/_sanitizer.py", line 541, in __torch_dispatch__
        outputs = func(*args, **kwargs)
      File "/private/home/sypniewski/pytorch/torch/utils/_python_dispatch.py", line 76, in wrapped
        return f(self, *args, **kwargs)
      File "/private/home/sypniewski/pytorch/tester.py", line 6, in <module>
        a = torch.rand(10000, device="cuda")
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83984
    Approved by: https://github.com/ezyang

commit 19e27b15562b261e87e3e629cb32cb6876b9caca
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Sep 7 05:58:32 2022 -0700

    Make dispatcher registrations of SymInt functions backwards compatible (#84557)

    Previously, when we SymInt-ify a schema, this is a BC-breaking change
    for all people who registered functions for that function; they
    must accept c10::SymInt where they previously accepted int64_t.
    This is not great.

    With this change, I accept old type registrations transparently.  The
    idea is in several parts:

    - At the registration site, at compile time I have no idea whether or not
      if the function being registered has a SymInt schema or not.  So I
      must defer the exact compatibility check.  What I do instead is
      check if the function pointer registered to me has SymInt in the
      argument or not.  If it does, I assume it is new-style and ensure
      it is also registered to a special sym_ slot on KernelFunction.
      If not, it only goes in the conventional slot.

    - At the dispatcher site, I know at compile time whether or not this
      is a SymInt function.  If it is, I check for a sym_ slot on the
      KernelFunction, and preferentially use that.  If no such slot
      exists, I then fall back to the regular slot... but I convert
      all SymInt arguments to int64_t arguments (doing assertions that
      no true symbolic integer was passed.)  I can skip this test entirely
      if the function doesn't have any SymInts in it; in that case I know
      that only the original slot could have been registered. Fortunately,
      both branches of the short circuit typecheck, so I didn't have to
      use SFINAE or if-constexpr to make it work; just a plain if statement
      that I expect the compiler to optimize away.

    - Schema validation is now modestly more complicated. There are two parts. First, function schema validation proceeds by checking if the signature in question has any SymInt-like types in it or not. If it does, we do function schema validation against the real types; if it doesn't, we do validation against the fake types (but only for symint; MemoryFormat is always MemoryFormat). Second, cpp signature validation also keeps track of a "symint" cpp signature and a "non-symint" cpp signature. We only compare symint with symint, and non-symint with non-symint. I did not implement checking a conflict between a symint and non-symint cpp signature, though in principle you could try converting the SymInt types to non-SymInt types and doing the comparison that way.

    To show it is working, I remove a bunch of c10::asIntArrayRefSlow shims, as the dispatcher is able to insert them automatically now.

    I didn't update the Metal registrations (though they can get similar treatment) as OSS CI coverage is insufficient for this case.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>

    Differential Revision: [D39280965](https://our.internmc.facebook.com/intern/diff/D39280965)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84557
    Approved by: https://github.com/wconstab

commit ed46b9670ebafa1c6bf7d078dcf5687109fee6ae
Author: samdow <samdow@fb.com>
Date:   Tue Sep 6 16:41:00 2022 -0400

    add randomness kwarg to jacfwd (#84220)

    From https://github.com/pytorch/functorch/issues/1010, if a user runs jacfwd with a function that uses randomness, it will fail since the default behavior for vmap is error. This lets the user specify the randomness behavior to jacfwd too since it is doing vmap(jvp(forward)). This is less likely to show up in jacrev since that only vmaps over the backwards pass
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84220
    Approved by: https://github.com/zou3519

commit 50ae5c9141fc752c80e7fe88a123ea77ee0265f9
Author: Jianyu Huang <jianyuhuang@fb.com>
Date:   Wed Sep 7 16:14:23 2022 +0000

    set workspace size to 4M (#74159)

    Summary: Follow D34480690 (https://github.com/pytorch/pytorch/commit/3ec1dd9989ac5441c767f975f5e0fc46847400a2)

    Test Plan: CI

    Differential Revision: D34636039

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/74159
    Approved by: https://github.com/xuzhao9

commit 87738f2073d808f0f76d607d1593f7683a463f45
Author: Shen Li <cs.shenli@gmail.com>
Date:   Wed Sep 7 02:22:56 2022 +0000

    Remove expired c10d::broadcast backward compatibility check (#84107)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84107
    Approved by: https://github.com/wanchaol

commit 99b7eb4dfbf8387d15b46913f1ff4e771782f499
Author: mikey dagitses <mikeyd@fb.com>
Date:   Wed Sep 7 15:44:20 2022 +0000

    move internal only PyTorch test defs into fb/ subdirectories (#84605)

    Test Plan: Rely on CI.

    Differential Revision: D39289373

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84605
    Approved by: https://github.com/DanilBaibak

commit d3d163af8061e08097c3ae37079bf61535b81ff1
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Sep 7 13:12:49 2022 +0000

    Add xla/ folder to gitignore (#84632)

    As per title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84632
    Approved by: https://github.com/ezyang

commit 42d99e6f196233627a28b8e9efb26a0a166fa370
Author: soulitzer <soulitzer@gmail.com>
Date:   Tue Sep 6 21:37:03 2022 -0400

    Call jit decomposition in VariableType to increase forward AD coverage (#84151)

    This PR:
    - updates forward AD codegen in core to generate code that tries calling into decompositions registered to jit when
       - (1) the function is not in-place or out variant
       - AND (2) the function is differentiable (requires_derivative=True)
       - AND (3) there are no forward AD formulas registered
       - To simplify things we always generating the if/else (as long as (1) is true), but generate 'false' when either (2) or (3) are false.
     - removes the mechanism from functorch
        - (follow up) some functorch tests should be updated here so they no longer have to compute the Jacobian with vjp
      - factors out some logic to generate the any_has_forward_grad condition
         - (bc-breaking) when TensorList inputs unexpectedly have forward grad, the error will no longer contain the name

    See https://github.com/pytorch/pytorch/pull/84151#issuecomment-1238519247 for codegen output and more discussion.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84151
    Approved by: https://github.com/samdow, https://github.com/albanD, https://github.com/zou3519

commit e31ad1c2d3a08a6421cd7a8adcd7b3f66727305a
Author: soulitzer <soulitzer@gmail.com>
Date:   Tue Sep 6 13:23:03 2022 -0400

    [reland] Move decompositions and helpers for jvp from functorch into core (#84581)

    Reland of https://github.com/pytorch/pytorch/pull/84358
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84581
    Approved by: https://github.com/samdow

commit 3eb16509c761c41f50163d404428246ea117c7fd
Author: nikitaved <nikitavedeneev@gmail.com>
Date:   Wed Sep 7 15:29:44 2022 +0000

    optimize householder product backward to be more memory-efficient (#84627)

    A follow-up on discussions in https://github.com/pytorch/pytorch/pull/84180.
    Makes backward more memory efficient with the lesser number of kernel calls.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84627
    Approved by: https://github.com/kshitij12345, https://github.com/zou3519

commit e96fb5d58c2accd717f0859b510ae7facb6d6aac
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Wed Sep 7 14:49:45 2022 +0000

    [c10d] Fix docstring of scatter_object_list (#84596)

    The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84596
    Approved by: https://github.com/H-Huang

commit a47bc96fb7176d43752d3e376697971d4ba47317
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Sep 6 10:20:07 2022 -0700

    [composite compliance] fix linalg.eigvals (#84137)

    linalg.eigvals fails in some cases with functorch and the root of the
    problem is that it is not composite compliant.

    In particular, checks that branch on whether or not a Tensor requires
    grad do not work with functorch. In order to support functorch with
    them, we have to include an additional "if the tensor is a Tensor
    Subclass, then assume that it MAY require grad, so we must always go
    through the differentiable path".

    This PR also changes the batching rule for linalg.eigvals to be a
    decomposition instead of what it was previously. What it was previously
    was masking the error in functorch's test suite.

    Unfortunately we don't comprehensive tests for this on the functorch
    side which is why this was not caught before. I'll look into why
    that is in the future; it's a bit complicated.

    Test Plan:
    - wait for tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84137
    Approved by: https://github.com/Lezcano, https://github.com/IvanYashchuk, https://github.com/samdow

commit 89c4654ba9e3c552d3a6e0a56da8adf656cce469
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Sep 6 21:51:34 2022 +0000

    Add scatter_ to CommTensor (#84606)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84606
    Approved by: https://github.com/wanchaol

commit f43c38bdc820650ad974bb1c48360b0c6931961a
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Sep 6 21:35:20 2022 +0000

    Add broadcast_ to CommTensor (#84604)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84604
    Approved by: https://github.com/wanchaol

commit a24d7a8565f5aac8448775552557112d0239fc8f
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Sep 6 21:35:12 2022 +0000

    Add reduce_scatter_ to CommTensor (#84592)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84592
    Approved by: https://github.com/wanchaol

commit e4519548a5a5f4026645f4a240ac026094ef1be5
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Sep 6 21:35:12 2022 +0000

    Supported nested lists in CommTensor and enable tracing allgather_ (#84585)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84585
    Approved by: https://github.com/wanchaol

commit 189768ed64561e61ff05c9e42adfa40139388204
Author: CaoE <e.cao@intel.com>
Date:   Wed Sep 7 13:48:43 2022 +0000

    Add mkl implementation for exponential on CPU (#69967)
    Add mkl implementation for exponential on CPU to improve the performance of exponential.
    data type: float32
    single socket (28cores):
    ```
    before: torch.Size([10, 128, 10, 124])  0.065 ms
            torch.Size([10, 128, 20, 124])  0.130 ms

    after:  torch.Size([10, 128, 10, 124])  5.9e-05 ms
            torch.Size([10, 128, 20, 124])  0.000113 ms
    ```
    single core:
    ```
    before: torch.Size([10, 128, 10, 124])  0.065 ms
            torch.Size([10, 128, 20, 124])  0.130 ms

    after:  torch.Size([10, 128, 10, 124])  0.00117 ms
            torch.Size([10, 128, 20, 124])  0.002347 ms
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/69967
    Approved by: https://github.com/frank-wei

commit 9e7af4e8d4540c6034806e84fec64d08643031bd
Author: Mateusz Sypniewski <m.odrowaz.sypniewski@gmail.com>
Date:   Wed Sep 7 13:01:51 2022 +0000

    Add alias info to torch._C (#84580)

    This adds the `AliasInfo` class to torch._C, as defined in https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/python/init.cpp#L1943. This will fix MYPY errors for missing `Argument` attributes.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84580
    Approved by: https://github.com/lw

commit ec3939a62f7e09807e0e7e9701c354c94aef7a66
Author: Sean Silva <silvasean@google.com>
Date:   Wed Sep 7 12:53:08 2022 +0000

    Detect `__code__` a bit more reliably. (#84610)

    Based on Ed's patch.

    Fixes https://github.com/pytorch/pytorch/issues/84570

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84610
    Approved by: https://github.com/Chillee

commit 07d398fb269eebe314ae898287494a2bfdc7f278
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Wed Sep 7 09:33:37 2022 +0000

    [composite compliance] linalg_householder_product (#84180)

    Ref: #69991
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84180
    Approved by: https://github.com/zou3519

commit 045ebc771d5070696f839e586285ace9c06f1339
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Sep 7 05:52:27 2022 +0000

    [BE] Use `teardown-linux`/`chown` actions for binary builds (#84449)

    Also embed `wait_for_ssh_to_drain.sh` into the action (to make it more reusable across repos) and delete unused teardown_linux template from `common.yml`

    Also, in `_binary-test-linux.yml` move artifact download step after repo checkout, to make errors during that step more parseable
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84449
    Approved by: https://github.com/kit1980

commit 1a33e944b58a75efe6154f1d02a32b80b7661edf
Author: jjsjann123 <jiej@nvidia.com>
Date:   Wed Sep 7 05:22:37 2022 +0000

    nvfuser torchbench patch (#84411)

    1. Patching nvfuser_execute to take aten nvprim fallback when no cuda tensors are provided as inputs
    2. Extending support of nvfuser python API on cpu scalar tensor.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84411
    Approved by: https://github.com/ngimel, https://github.com/kevinstephano, https://github.com/IvanYashchuk

commit 7c3102f3f09e0ac0bf272df2ad48dd40515eceea
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Wed Sep 7 05:03:02 2022 +0000

    Add ShouldSyncTensor interface (#84418)

    Adding an `ShouldSyncTensor` interface to allow for the case of output pruning should a vendor not support retrieving the value of a certain output.

    CC: @wconstab @JackCaoG @Krovatkin
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84418
    Approved by: https://github.com/wconstab

commit c4e0c927e31d59c51a6d4e09d58038becc1faf29
Author: Ke Wen <kw2501@fb.com>
Date:   Wed Sep 7 04:48:02 2022 +0000

    [c10d] Add a soft error handling mode (#84386)

    Adding new value "2" to env `NCCL_ASYNC_ERROR_HANDLING` standing for a "CleanUpOnly" error handling mode.
    Comparing to `NCCL_ASYNC_ERROR_HANDLING=1`, the "CleanUpOnly" mode will just abort the collectives and NCCL communicators, and will not tear down the process.
    User will have the chance to query the state of the process group (in a later PR) and abort the process group (in a later PR), and re-create the process group if needed.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84386
    Approved by: https://github.com/rohan-varma

commit 5b58140d1a471b144baf66cc61a45a86746f0215
Author: Kurt Mohler <kmohler@quansight.com>
Date:   Wed Sep 7 03:12:49 2022 +0000

    Add deterministic impl of `scatter_add` CUDA for all input sizes (#79466)

    Fixes #50469

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79466
    Approved by: https://github.com/ngimel

commit 039b0146f9d831ae20ce293989db07a711dae09a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 7 02:39:36 2022 +0000

    [vision hash update] update the pinned vision hash (#83900)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83900
    Approved by: https://github.com/pytorchbot

commit 15c5baf87802cda783824c1762bf16d848b6625f
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Sep 7 00:20:29 2022 +0000

    Throw on data dependent ops (#83567)

    Previously, we would trace through the following with no error:
    ```
    from torch.fx.experimental.proxy_tensor import make_fx
    import torch

    def f(x, y):
        return x[0, y:]
    ```

    Even though the output shape is dependent on the data of `y`.  Now, throw on the conversion of `y` to an integer.

    It would be nice to not break on constant tensors but I'll do that as the next PR (Edit: done with https://github.com/pytorch/pytorch/pull/84387). Sketching out how that would work (and keep in mind this is applicable Dynamo tracing and not just AOT Autograd)

    I think to do that you would need to :
    - hold strong refs to a set of constant tensors, and only allow them to be captured from `lift_fresh.copy`
    - when you run a mutable op, either remove it from the set of constant tensors or run the operator for real
    - limit to small constant tensors
    Anything else ?

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83567
    Approved by: https://github.com/ezyang

commit 0be77d54159ae9b95297e978d29eb2c92d5bafee
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Sep 7 02:32:47 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#84613)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84613
    Approved by: https://github.com/pytorchbot

commit b168c4faa23b5684ede608140febec7c97a795d0
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Sep 6 16:14:16 2022 +0000

    Make CommTensor Generic to arguments and outputs structures (#84576)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84576
    Approved by: https://github.com/aazzolini

commit 00e0228050739dd33335cc2a8663c9759ba2f144
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Sep 7 00:53:46 2022 +0000

    [BE] Delete `Check for new workflow" check (#84608)

    This check was introduced back in Mar, and all PRs for the last 60+ days should pass this check.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84608
    Approved by: https://github.com/ZainRizvi, https://github.com/kit1980

commit 06ebe2d5bc1055f226f56ed2fe26a29038a466e5
Author: Bin Chen <bchen2020@fb.com>
Date:   Wed Sep 7 00:17:20 2022 +0000

    Add watchdog to TorchElastic agent and trainers (#84081)

    Summary:
    D38604238 (https://github.com/pytorch/pytorch/commit/3b11b80fc3f9f9a0171abb5eb2299835feba8b04) introduced a named pipe based watchdog timer.

    This diff uses the named pipe based watchdog timer in TorchElastic agent and training worker processes (in the StuckJobDetector class) to allow the TorchElastic agent to detect the stuck of a training process, and kill the process to create a core dump.

    Test Plan:
    ```
    buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
    ```
    ```
    RemoteExecution session id: reSessionID-0bfcacef-24d1-42bc-a1d3-f3058fc42b2f-tpx
    Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7318349503394739
        ✓ ListingSuccess: caffe2/test/distributed/elastic/agent/server/test:local_agent_test : 55 tests discovered (22.699)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (47.140)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (49.198)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (46.387)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.094)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (106.342)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_homogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (64.888)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_homogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (69.158)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_enabled_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.965)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (79.626)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.113)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.487)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (24.358)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.216)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.433)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (47.029)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (44.357)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_check_master_addr_port_override_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (45.176)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_nccl_async_error_handling_env_default_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.980)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_c10d (local_elastic_agent_test.LocalElasticAgentTest) (47.151)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_simple_dist_sum_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.614)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_heterogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (69.099)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_enabled_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.367)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_etcd (local_elastic_agent_test.LocalElasticAgentTest) (22.804)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_c10d (local_elastic_agent_test.LocalElasticAgentTest) (77.560)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (46.050)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_c10d (local_elastic_agent_test.LocalElasticAgentTest) (48.088)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_elastic_etcd (local_elastic_agent_test.LocalElasticAgentTest) (77.286)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_fault_tolerance_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.670)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_check_master_addr_port_override_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.631)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.867)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_double_agent_fault_tolerance_etcd (local_elastic_agent_test.LocalElasticAgentTest) (51.095)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_happy_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.000)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (45.197)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_homogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.873)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_shutdown_called_c10d (local_elastic_agent_test.LocalElasticAgentTest) (23.160)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_etcd (local_elastic_agent_test.LocalElasticAgentTest) (43.632)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_torch_rpc_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.536)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (89.859)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_fail_etcd (local_elastic_agent_test.LocalElasticAgentTest) (48.277)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_nccl_async_error_handling_env_c10d (local_elastic_agent_test.LocalElasticAgentTest) (43.930)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_bipolar_function_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (87.677)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_success_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (48.965)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_fail_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (50.143)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_workers_drift_success_etcd (local_elastic_agent_test.LocalElasticAgentTest) (46.781)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.152)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_barrier_failed_c10d (local_elastic_agent_test.LocalElasticAgentTest) (44.832)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_function_with_return_value_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.281)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_correct_rank_assignment_heterogeneous_etcd_v2 (local_elastic_agent_test.LocalElasticAgentTest) (74.968)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_disabled_c10d (local_elastic_agent_test.LocalElasticAgentTest) (46.141)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_c10d (local_elastic_agent_test.LocalElasticAgentTest) (44.960)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_dummy_compute_etcd (local_elastic_agent_test.LocalElasticAgentTest) (45.292)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_agent_local_watchdog_setup_disabled_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.611)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_check_env_function_etcd (local_elastic_agent_test.LocalElasticAgentTest) (44.939)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_distributed_sum_heterogeneous_etcd (local_elastic_agent_test.LocalElasticAgentTest) (47.609)
        ✓ Pass: caffe2/test/distributed/elastic/agent/server/test:local_agent_test - test_run_sad_function_c10d (local_elastic_agent_test.LocalElasticAgentTest) (45.628)
    Summary
      Pass: 55
      ListingSuccess: 1
    Finished test run: https://www.internalfb.com/intern/testinfra/testrun/7318349503394739
    ```
    -----------
    ```
    buck test caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test
    ```
    ```
    RemoteExecution session id: reSessionID-607a0028-4095-4dfc-b657-55f0807fe621-tpx
    Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774432794818
        ✓ ListingSuccess: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test : 11 tests discovered (39.037)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_thrift_api_called (caffe2.torch.fb.trainer.stuck_detection.tests.collect_quickstack_test.CollectQuickstackTrace) (0.655)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (36.510)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_dont_print_when_job_normal (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (36.727)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_send_watchdog_request_on_batch_callbacks_no_server (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.060)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_quickstack_stuck_job (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.242)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog_disabled (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.243)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_print_stack_trace_when_job_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.590)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_print_when_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.590)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_setup_local_watchdog_no_file (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (37.589)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_signposts_stack_trace_when_job_stuck (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (38.132)
        ✓ Pass: caffe2/torch/fb/trainer/stuck_detection/tests:stuck_job_detector_test - test_send_watchdog_request_on_batch_callbacks (caffe2.torch.fb.trainer.stuck_detection.tests.stuck_job_detector_test.StuckJobDetectorTest) (38.133)
    Summary
      Pass: 11
      ListingSuccess: 1
    Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8162774432794818
    ```

    Differential Revision: D38930476

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84081
    Approved by: https://github.com/d4l3k

commit d9ceda49c497bc39c2b360b038ee07e145b32f5b
Author: Thibault <thibault.castells@wanadoo.fr>
Date:   Tue Sep 6 23:32:16 2022 +0000

    ONNX: fix default function value in _optimize_graph (#83996)

    The default value for params_dict in _optimize_graph, which is None, throw the following error:

    >     _C._jit_pass_onnx_unpack_quantized_weights(
    > TypeError: _jit_pass_onnx_unpack_quantized_weights(): incompatible function arguments. The following argument types are supported:
    >     1. (arg0: torch::jit::Graph, arg1: Dict[str, IValue], arg2: bool) -> Dict[str, IValue]

    Replacing it by an empty dict fixes the issue (and makes more sense).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83996
    Approved by: https://github.com/BowenBao

commit 16f8dc00f0331c839b04381d8ec644fbc2220313
Author: Hansong Zhang <hsz@fb.com>
Date:   Tue Sep 6 23:08:07 2022 +0000

    [nnapi] remove unused field 'order_' in nnapi.h (#84067)

    Summary: This fixes the build

    Test Plan: `buck build //xplat/caffe2:nnapi_benchmarkAndroid`

    Reviewed By: salilsdesai

    Differential Revision: D38924916

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84067
    Approved by: https://github.com/SS-JIA

commit 166dec74b5ce3968a53d4c0f616776d0a2bf4309
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 6 22:31:14 2022 +0000

    Revert "Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761)"

    This reverts commit 65beff5acb0d7c0c484bd0558bcaf8ddc9c96aab.

    Reverted https://github.com/pytorch/pytorch/pull/81761 on behalf of https://github.com/mehtanirav due to Breakages in pytorch/glow

commit 0e49bcfd416b1a83de0820f910c7c9ac38cbebaf
Author: Paul Saab <ps@fb.com>
Date:   Tue Sep 6 22:28:43 2022 +0000

    [aarch64] Use cross build ld/ar/objcopy when creating libraries for cross building etc (#84558)

    Summary: ^^

    Test Plan: CI

    Differential Revision: D39267050

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84558
    Approved by: https://github.com/ajtulloch

commit 1cad744694d7feb7c55e5f4ff4a6ae749686bfb5
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Tue Sep 6 16:59:51 2022 +0000

    Enable select.int when NestedTensor requires grad (#83875)

    Previously indexing a nested tensor when it requires_grad would raise an error because the backward formula for `select.int` uses `self.sizes()`. This PR fixes that by temporarily registering a _nested_select_backward function which can be removed when we start using the symint approach to register kernels. For now this functionality is needed for creating a POC that nested tensor can be an API to `segment_coo` and `segment_csr` in the torch_scatter repo

    ```
        a = torch.arange(10).reshape(2, 5).float()
        b = torch.arange(12).reshape(2, 6).float()
        nt = torch.nested_tensor([a, b], dtype=torch.float).requires_grad_(True)
        nt[0]
    ```

    whereas

    ```
     nt = torch.nested_tensor([a, b], dtype=torch.float).requires_grad_(False)
     nt[0]
     ```
    would succeed

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83875
    Approved by: https://github.com/albanD, https://github.com/drisspg

commit 752c3bcb474e8024f59c9977dea67adfb256146d
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Sep 6 22:08:11 2022 +0000

    Enable nvfuser tests for refs.broadcast_to and refs.broadcast_tensors (#84337)

    Previously these tests were failing because they required some other op alongside prims.broadcast_in_dim to be executed. Now it works standalone.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84337
    Approved by: https://github.com/mruberry, https://github.com/ngimel

commit aec76e391f8c5c44e0340c7f4c67347f043e3144
Author: Catherine Lee <csl@fb.com>
Date:   Tue Sep 6 21:32:03 2022 +0000

    circleci - add master back, retry checkout for ios (#84443)

    add master back so its easier to determine when something started failing

    retry checkout for ios, based on the provided circleci checkout but with a lot of stuff removed
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84443
    Approved by: https://github.com/janeyx99

commit 7a7b05802ac6b2cd14ffcc1af512d0c5cc46bf33
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Sep 6 19:19:14 2022 +0100

    Add col2im_batched kernel (#84543)

    Closes #84407

    This changes col2im on CUDA to launch a single batch-aware kernel
    instead of launching n single slice kernels.

    The `istft` call in the linked issue goes from 98.7 ms to 858 us on my
    machine, for an over 100x speedup.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84543
    Approved by: https://github.com/ngimel

commit bab1304f59cf48901891aad73974dc123ad9614a
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Tue Sep 6 20:55:34 2022 +0000

    Add step closures (#84300)

    Ports over the step closure functionality from PyTorch/XLA to Lazy Tensor Core:

    References:
    https://github.com/pytorch/xla/blob/205ae574c0a24e092899ea8610c360f93f5d8142/torch_xla/core/xla_model.py#L852-L900
    https://github.com/pytorch/xla/blob/205ae574c0a24e092899ea8610c360f93f5d8142/torch_xla/utils/closures.py#L7-L83

    CC: @wconstab @JackCaoG @Krovatkin
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84300
    Approved by: https://github.com/JackCaoG, https://github.com/wconstab

commit 02da9437b0ed501c2403e133b8c81eab5802c586
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Sep 6 09:57:24 2022 -0700

    Store SymInt out of line (#84390)

    swolchok reported that non-tracing usage of Tensor we are wasting a lot
    of time on is_symbolic() tests, e.g., when destructing SymInts.  This
    is a regression for no good reason because we don't actually ever
    have SymInts in those cases.  This PR moves the stored SymInts on
    Tensor out of line, into a separate ExtraMeta struct, which is only
    allocated when we make a Tensor store symbolic sizes/strides.

    To avoid adding another word to TensorImpl, I take over the named tensor
    metadata field.  This makes named tensor require a double indirection
    and use up more space, but it's OK since we're going to delete this
    feature anyway soon.

    I restore regular int64_t storage on Tensor.  This entailed reverting
    https://github.com/pytorch/pytorch/pull/82467 ; there are no other
    substantive changes to SizesAndStrides so a close review is not
    necessary.

    I don't bother optimizes sizes and strides in ExtraMeta in the same
    way stock tensor is optimized.  I add a SymDimVector alias.  I make
    SymInt UNCHECKED constructor public as it is a useful optimization
    in some situations when the int is known to be positive.

    I thought about storing the SymInts on the Python object instead.
    However, because we can allocate symbolic shape tensors directly
    from C++, we cannot guarantee that there is a PyInterpreter for
    a Tensor. So we do it this way instead; it's also faster since you
    don't have to take out the GIL to do accesses.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84390
    Approved by: https://github.com/swolchok, https://github.com/Krovatkin

commit 7f90606309bda30e82f571d9720b25e85a041246
Author: Max Podkorytov <maxdp@fb.com>
Date:   Tue Sep 6 20:07:56 2022 +0000

    [static-runtime] update generator for the modified tests; re-run autogen script (#84437)

    Test Plan: CI

    Reviewed By: mikeiovine

    Differential Revision: D39183148

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84437
    Approved by: https://github.com/mikeiovine

commit 6363b1b3587aa64ad055ba0a905af28d8dec52d2
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Sep 6 19:56:17 2022 +0000

    Add nvFuser support for aten.native_batch_norm_backward (#84546)

    Replacing `tensor.reshape(broadcast_mask)` with unsqueezes makes the implementation of `batch_norm_backward` more friendly for PrimTorch+nvFuser.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84546
    Approved by: https://github.com/Chillee

commit 7243264c61a15446bad0fcd412a1fee1bc08ec1e
Author: F-G Fernandez <26927750+frgfm@users.noreply.github.com>
Date:   Tue Sep 6 19:24:10 2022 +0000

    fix: Allowed optimizers with more than 2 betas (#84486)

    Hello there :wave:

    As discussed in #84485, this PR enables more flexibility on the optimizers that are wrapped by LR schedulers in PyTorch. Currently, it is incompatible with optimizers that have a number of betas different than 2. This PR fixes that with minimal modifications.

    Fixes #84485

    Any feedback is welcome!

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84486
    Approved by: https://github.com/Lezcano, https://github.com/soulitzer

commit e20f2172954609e44f014146a291fd521d29180e
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Sep 6 19:16:39 2022 +0000

    Remove unnecessary decomposition_table= from test/test_prims.py (#84188)

    Follow-up to 83782

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84188
    Approved by: https://github.com/jjsjann123, https://github.com/ngimel

commit 88b1cc885cc92b9483eec95546bb48c7bccea070
Author: Fabio Rocha <frocha@quansight.com>
Date:   Tue Sep 6 16:35:26 2022 +0000

    Removed tri[lu]* tests, superseeded by OpInfos (#84256)

    triu, tril, triu_indices and tril_indices had some
    tests in test_tensor_creation_ops.py and test_cuda.py
    that are redudant with the ones done by OpInfos for those ops.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84256
    Approved by: https://github.com/Lezcano, https://github.com/ngimel

commit 92a6b970baf87b9cb85112f5facb6af51c48c8c0
Author: cchheennhhaaoo <hao3@intel.com>
Date:   Tue Sep 6 18:38:27 2022 +0000

    Be compatible with SYCL 2020 and SYCL1.2.1 for sycl.hpp (#83259)

    - In SYCL2020, SYCL provides one standard header file: <sycl/sycl.hpp>, which needs to be included in every translation unit that uses the SYCL programming API.

    - For compatibility with SYCL 1.2.1, SYCL provides another standard header file: <CL/sycl.hpp>, which can be included in place of <sycl/sycl.hpp>.

    - SYCL documents this change in [doc](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:headers-and-namespaces)(4.3).

    - SYCL_LANGUAGE_VERSION substitutes an integer reflecting the version number and revision of the SYCL language being supported by the implementation in SYCL 2020. In SYCL1.2.1, the macro name is CL_SYCL_LANGUAGE_VERSION. So these two macros can be used to distinguish SYCL1.2.1 and SYCL2020.

    - SYCL 2020 doc: https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf
    - SYCL 1.2.1 doc: https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83259
    Approved by: https://github.com/malfet

commit c4e8d6282bc730ab35fc3a42c12bfda7a99a5b1c
Author: Jason Ansel <jansel@fb.com>
Date:   Tue Sep 6 18:36:24 2022 +0000

    Improve getitem syntax for TensorType (#84555)

    Allows `TensorType[Dyn, 3, Dyn]` instead of the prior `TensorType[(Dyn, 3, Dyn)]`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84555
    Approved by: https://github.com/jamesr66a

commit fa99b7b8f726f8ce63f0dff076b7e9171e3dd40a
Author: Sergei Vorobev <sergei.vorobev@getcruise.com>
Date:   Tue Sep 6 18:14:08 2022 +0000

    [bazel] fix integration test (#79843)

    Fixes broken bazel `bazel test //:integration_test`

    Bazel needs a way to download the mnist dataset that's used in the integration test.
    This patch does it through a genrule.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79843
    Approved by: https://github.com/malfet

commit 4f0b9f3c31bebdb46df8f78f13a0857f6c4ed43f
Author: mikey dagitses <mikeyd@fb.com>
Date:   Tue Sep 6 18:08:42 2022 +0000

    move PyTorch internal-only starlark files into fb/ subdirectories (#84548)

    Summary: These are not used in OSS so should not clutter them there.

    Test Plan: Rely on CI.

    Differential Revision: D39262135

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84548
    Approved by: https://github.com/DanilBaibak

commit c794ee5cc12192da527bbbcf5c5b9ec33c935cbe
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Sep 6 17:49:29 2022 +0000

    Reenable TestCppExtensionJIT on M1 (#84552)

    Works fine locally, let's see if it'll pass CI

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84552
    Approved by: https://github.com/kit1980

commit c771d73461449f89e26bc4130d1641340a03e05d
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Sep 6 07:10:33 2022 -0700

    [composite compliance] fix max_pool1d (#84127)

    max_pool1d has a fast path for CPU tensors that do not require grad that
    directly accesses the data_ptr. This PR makes the change that if the
    input Tensor is a Tensor Subclass, then we want to walk through the
    "slow path" of calling max_pool1d_with_indices.

    Test Plan:
    - wait for tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84127
    Approved by: https://github.com/kshitij12345, https://github.com/samdow, https://github.com/malfet

commit 139599ba954e084ed6962dc94c99f5f2ce6ec2e7
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Sep 6 07:10:33 2022 -0700

    Contiguify bias in slow_conv_transpose3d kernel (#84125)

    Users never run into this because PyTorch now comes with cudnn by
    default and cudnn has a better conv_transpose implementation. However we
    seem to test without cudnn in our CI; and also, ROCM goes down this
    path.

    The .contiguous() call does not regress anything because previously it
    was a runtime error. Because this kernel is the "slow conv transpose3d
    kernel", we don't care much for its performance.

    Test Plan:
    - wait for tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84125
    Approved by: https://github.com/ngimel

commit ee228ad9499ca97c267e5597d36570e096dcf2c0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 6 17:12:12 2022 +0000

    Revert "[BE] Use `teardown-linux`/`chown` actions for binary builds (#84449)"

    This reverts commit 1a16b2576f69383480e8be889531e4f574356c62.

    Reverted https://github.com/pytorch/pytorch/pull/84449 on behalf of https://github.com/malfet due to Revert as it broke trunk, though on next PR

commit faac3dbce20a6068a3e530c11788896e81a73c64
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Sep 6 16:58:42 2022 +0000

    [optim] asgd : handle complex params as independent real params (#84472)

    Ref: #65711
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84472
    Approved by: https://github.com/Lezcano, https://github.com/soulitzer

commit f725009a48dcbec6c9e9378880314d30a9080c82
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Wed Aug 31 23:51:35 2022 -0700

    as_strided supports SymInt; codegen supports optional SymInt (#84393)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84393
    Approved by: https://github.com/ezyang

commit ee57f5c6c81b1622e3d34f5f0c4f20aad108797f
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Wed Aug 31 20:55:41 2022 -0700

    fix skipIfTorchDynamo on classes (#84392)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84392
    Approved by: https://github.com/ezyang

commit 5e9c26c8e25e5d5be18ff98b7808b674b1e7a0a5
Author: George Qi <georgeqi94@gmail.com>
Date:   Tue Sep 6 06:56:38 2022 +0000

    [maskedtensor] adding reductions (#82839)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82839
    Approved by: https://github.com/bhosmer

commit f125bd2cbb8301c12685957ace573c301e1056e2
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Sep 6 13:43:09 2022 +0100

    Support torch.ScriptObject in torch::jit::as_object (#84398)

    When a torchbind class is returned from an operator, it has the class
    `torch.ScriptObject`, yet the `torch.ops` interface checks against
    `torch.jit.RecursiveScriptClass` or else falls back to a much slower
    path that doesn't return the original c++ object.

    On my machine I see a 2 us performance improvement when calling a
    `torch.ops` function with a `ScriptObject` argument.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84398
    Approved by: https://github.com/ezyang

commit 207a5a8fa9bfd9361038d46636c0440290c171bb
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Sep 6 13:23:19 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#84383)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84383
    Approved by: https://github.com/pytorchbot, https://github.com/ezyang

commit d2b8b8f29121ed23e2b39446d09bba3a7eb96684
Author: Paul Saab <ps@fb.com>
Date:   Tue Sep 6 03:05:52 2022 +0000

    [aarch64] Unused variable (#84549)

    Summary: Declare another variable unused

    Test Plan: CI

    Reviewed By: andrewjcg

    Differential Revision: D39263305

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84549
    Approved by: https://github.com/jianyuh

commit 26c136a135fe0215195a6e0566651baaffb01159
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Sep 5 15:02:02 2022 +0100

    Use TensorBase in Shuffle and WeightNorm cpu kernels (#84499)

    These files are already only using the subset avalilable with
    TensorBase, so this is straight-forward name substitution.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84499
    Approved by: https://github.com/ezyang

commit 6f29642b6f27f53295ead7c3f2767ef45307e710
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Sep 5 15:02:01 2022 +0100

    Remove Tensor.h includes from spdiags cpu kernel (#84500)

    This file uses `Tensor::operator[]` in the middle of a `cpu_kernel`
    which is not allowed because it relies on the thread-local dispatcher
    state. Instead, we should just do the stride calculations.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84500
    Approved by: https://github.com/ezyang

commit 1a16b2576f69383480e8be889531e4f574356c62
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Sep 5 21:44:30 2022 +0000

    [BE] Use `teardown-linux`/`chown` actions for binary builds (#84449)

    Also embed `wait_for_ssh_to_drain.sh` into the action (to make it more reusable across repos) and delete unused teardown_linux template from `common.yml`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84449
    Approved by: https://github.com/kit1980

commit 91a5f52f51de9d6aa305d184fe07fe15d20b82c9
Author: Fabio Rocha <frocha@quansight.com>
Date:   Mon Sep 5 14:27:37 2022 +0000

    Decomp for nn.functional.grid_sampler_2d (#84350)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84350
    Approved by: https://github.com/jansel, https://github.com/Lezcano

commit acb11da556ddb2302ac14531c5ddf7016ff34a97
Author: Alexander Grund <alexander.grund@tu-dresden.de>
Date:   Mon Sep 5 21:23:50 2022 +0000

    Increase default test timeout for distributed tests (#80330)

    When running on clusters the startup time for the subprocesses might be much higher which leads to spurious failures.
    So increase this to 300s similar to torch/testing/_internal/distributed/distributed_test.py

    Also introduces `DISTRIBUTED_TESTS_DEFAULT_TIMEOUT` as suggested by @malfet in #55896
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80330
    Approved by: https://github.com/malfet

commit da99008d3775859832990b3b930ed3c1e4151637
Author: Boyoon Jang <terricodes@gmail.com>
Date:   Mon Sep 5 16:48:32 2022 +0000

    fix typo in torch/package/_mock.py (#84508)

    Fixed a typo in torch/package/_mock.py
    Fixes #84507

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84508
    Approved by: https://github.com/H-Huang

commit e79d0ebfa6d09bc4728bf63ae56cae28b831dbfe
Author: Hyeongjun Sim <yissim@naver.com>
Date:   Mon Sep 5 16:34:02 2022 +0000

    Fix typo in core.py (#84534)

    This is a minor typo fix in core.py

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84534
    Approved by: https://github.com/H-Huang

commit 1896d801913fe156f46b0b65f4b1e38f314210b3
Author: Louis Feng <lofe@fb.com>
Date:   Mon Sep 5 16:11:49 2022 +0000

    [PyTorch][Profiler] Increase max number of elements to record in execution graph (#84285)

    Summary: Noticed some jobs are exceeding the max num of elements in an array. 100 was too conservative (observed 128 sizes in CMF model), but we also don't want have unbounded container size. Setting to a large number 4096 that probably will catch extreme cases.

    Test Plan:
    ```
    buck build mode/opt-split-dwarf //hpc/models/ads:ads_10x_launcher --show-output

    buck-out/gen/hpc/models/ads/ads_10x_launcher.par +checkpoint=model_store +launcher=mast +data_loader=dist +mode=mast launcher.data_project=ads_model_platform launcher.fbl_entitlement=ads_global_qps checkpoint.model_type=ctr_mobile_feed_model data_loader.table_ds=["2022-08-15"] data_loader.num_batches=5000 profiling_trace=true
    ```

    Differential Revision: D39137530

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84285
    Approved by: https://github.com/robieta

commit 7e05879b463cfa21b1cf3c26279bf248f835f52e
Author: Vasilis Vryniotis <datumbox@users.noreply.github.com>
Date:   Mon Sep 5 13:15:55 2022 +0000

    Fix fx test for S3D (#84526)

    Fixing [failing](https://github.com/pytorch/pytorch/runs/8083404365?check_suite_focus=true) tests by adjusting the input size for S3D. The reason the test is failing is because S3D requires a bigger input size than previously passed.

    As noted before, TorchVision already checks that its models are FX traceable and ensures all the tests are updated and work properly prior adding new architectures. The tests here seem to duplicate our efforts and often break because they don't factor in details about each model. It might be worth considering running TorchVision's tests instead.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84526
    Approved by: https://github.com/pbelevich

commit 437b066e26fab4f84c55314d9a0f6299525297a1
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Sep 5 09:58:11 2022 +0000

    [xla hash update] update the pinned xla hash (#84533)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84533
    Approved by: https://github.com/pytorchbot

commit edab44f6dd4d5fbe00136c70c99be12a8f67e9f7
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Mon Sep 5 08:49:01 2022 +0000

    Support a few corner cases for nvFuser executor (#84416)

    This PR adds asserts to the `nvfuser_execute` function for the cases that do not work. Fallback to eager is used in those cases.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84416
    Approved by: https://github.com/jjsjann123, https://github.com/ngimel

commit 9a6aa9053f79127721875e371addd9c3baeaaac0
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 2 22:15:00 2022 -0400

    Don't convert INT64_MAX start index into zero (#84509)

    I... don't understand why we did it this way in the first place?
    Source: https://github.com/pytorch/pytorch/pull/48719/files#r962087365

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84509
    Approved by: https://github.com/ngimel

commit e91c1e65b6b4b324284d891c13ce2f612129e9be
Author: Paul Saab <ps@fb.com>
Date:   Sun Sep 4 23:47:59 2022 +0000

    [aarch64] Fix _mm_pause() on aarch64 (#84505)

    Summary:
    It's possible if you're using simde that _mm_pause is already
    defined, so intsead use the asm for yield

    Test Plan: CI

    Differential Revision: D39225258

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84505
    Approved by: https://github.com/ajtulloch

commit 7c4c7dafbdf2c41ccd9042f1db4f9f9f01a42f00
Author: titaiwang <titaiwang@microsoft.com>
Date:   Sun Sep 4 00:01:00 2022 +0000

    [ONNX] Add onnx::LayerNorm support for version 17 (#84293)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84293
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit 6d6e04d6cc9a3d7cf5d9a2eda5baafd5c3ee75c0
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Sat Sep 3 07:21:48 2022 +0000

    [test_nn] move dropout tests to test/nn/test_dropout.py (#84165)

    Ref https://github.com/pytorch/pytorch/issues/63085
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84165
    Approved by: https://github.com/albanD

commit e46c1c7931da2d723a6cad4ec307ff4ed4e9cb7f
Author: Paul Saab <ps@fb.com>
Date:   Sat Sep 3 04:06:26 2022 +0000

    [aarch64] Cast to signed char to fix aarch64 build (#84429)

    Summary: Force SHORT_BINUNICODE and PROTO to signed char to fix build on aarch64

    Test Plan: CI

    Differential Revision: D39198776

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84429
    Approved by: https://github.com/ajtulloch

commit 388368b6996479f6eca484d4e60a6250b2535dec
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Sep 2 23:19:03 2022 +0000

    [ONNX] Fix type annotations and enable type checking for all apis (#84091)

    Enable runtime type checking for all torch.onnx public apis, symbolic functions and most helpers (minus two that does not have a checkable type: `_.JitType` does not exist) by adding the beartype decorator. Fix type annotations to makes unit tests green.

    Profile:

    export `torchvision.models.alexnet(pretrained=True)`

    ```
    with runtime type checking: 21.314 / 10 passes
    without runtime type checking: 20.797 / 10 passes

    + 2.48%
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84091
    Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi

commit 2a332afbf41b68080a9436e910b93af7cd336fbc
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Fri Sep 2 08:53:59 2022 -0700

    Add SymFloat, support SymInt to SymFloat conversion (#84284)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84284
    Approved by: https://github.com/albanD

commit 7f5da70ef0be0d3fa60d92430548d2fff6f93ef9
Author: Yeounoh Chung <yeounoh@google.com>
Date:   Fri Sep 2 23:15:17 2022 +0000

    Avoid hitting the fused path in Linear for xla backend. (#84503)

    Fixes #84244

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84503
    Approved by: https://github.com/JackCaoG, https://github.com/ezyang

commit 3dfbf09afebc067f5ddea60f7db5cd2aa0b98f93
Author: lezcano <lezcano-93@hotmail.com>
Date:   Fri Sep 2 12:19:03 2022 +0000

    Optimise the decomposition for `adaptive_avg_pool2d` wrt. TorchInductor (#84483)

    This fixes some part of the implementation that did not work with
    TorchInductor (e.g. the indices in TorchInductor need to be `int64`s,
    while in PyTorch we can have `int32`s).

    It also brings up the performance of the kernel to similar numbers than
    those of the lowering (benchmarks below).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84483
    Approved by: https://github.com/jansel

commit ab6c57217a97438c8e13952a407e42873e2259f3
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Fri Sep 2 21:57:45 2022 +0000

    Add NCCL PreMul Sum to c10d `redce` ops (#84243)

    This is based on #81272 but this conforms to TorchScript Compiler
    - [ ] Update https://github.com/pytorch/pytorch/blob/abaf8112e6d6bed2a5d33dcbc1d46ed20b8e80de/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp#L64-L73 to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.

    cc @ptrblck @kwen2501 @aazzolini
    cc @zasdfgbnm for visibility to the TODO above
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
    Approved by: https://github.com/kwen2501

commit 0b363c5c5c1832820466b7768b353db121809018
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Fri Sep 2 21:18:58 2022 +0000

    don't synchronize single element any/all reductions (#84465)

    Fixes #84291

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84465
    Approved by: https://github.com/ezyang

commit 5ffda02388f1a1a3c83d8e6676ec1c7019c5ecd1
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Sep 2 18:49:30 2022 +0100

    Fix alertCuBLASConfigNotDeterministic to respect warn_only=True (#84215)

    This cublas check would error even if the `warn_only=True` flag is
    passed.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84215
    Approved by: https://github.com/kurtamohler, https://github.com/albanD

commit 65beff5acb0d7c0c484bd0558bcaf8ddc9c96aab
Author: lezcano <lezcano-93@hotmail.com>
Date:   Thu Sep 1 08:25:04 2022 +0000

    Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761)

    `torch.norm` is very odd. Some notable issues are:

    - The default value of `"fro"` in `torch.norm` has an odd behaviour when `dim=None`. This is handled in the new dispatch
    - The treatment of the `dtype` argument in `torch.norm` was completely wrong. This should fix it
    - Some `out=` variants in the previous implementation were also wrong. This should fix those.
    - This new dispatch should make some paths much faster. For example, `torch.norm(x)` where `x` is complex.

    I'll try to make the changes in these PRs as incremental as possible as this is a tricky one.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81761
    Approved by: https://github.com/ngimel

commit 72f0f24a764e01a0af2c8c96394fa15db0b41a41
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Fri Sep 2 18:08:39 2022 +0000

    remove unneeded _to_copy meta (#84460)

    Fixes #84335

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84460
    Approved by: https://github.com/Chillee

commit 9b115c7bd32b4a516f253a217bc8ec47bd07c44d
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Sep 1 13:54:42 2022 -0500

    Sparse Compressed Transpose add support for Batch dims and BSR/BSC layouts (#82122)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82122
    Approved by: https://github.com/bhosmer

commit 0192a34910c8873175380791b963517b18c44075
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Sep 1 13:54:42 2022 -0500

    Dense -> CSC support batch dimensions (#83086)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83086
    Approved by: https://github.com/bhosmer, https://github.com/nikitaved

commit a5a01e443ce1dd8e31ef7d0b3fd6a2359881a922
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Sep 1 13:54:42 2022 -0500

    Dense->BSR performance improvment (#83085)

    Applies the algorithm for re-batching compressed indices to avoid n-batch kernel launches. This is an optimization for `dim()>= 3` inputs and does not change behavior in any way.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83085
    Approved by: https://github.com/bhosmer, https://github.com/nikitaved

commit f0e5b7336410a24088069a7b620bfccc6372338a
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Thu Sep 1 13:54:41 2022 -0500

    Dense -> CSR support batch dimensions (#83084)

    Only requires changes to the dense->sparse pathway. The reverse already has support.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83084
    Approved by: https://github.com/bhosmer, https://github.com/nikitaved

commit 2d969dc2ca9e3ccf0c87d5d45d9321228f51b865
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 2 17:40:17 2022 +0000

    Revert "Support a few corner cases for nvFuser executor (#84416)"

    This reverts commit 3db3845f5f20047d9a30f450d3936e4113975ae6.

    Reverted https://github.com/pytorch/pytorch/pull/84416 on behalf of https://github.com/malfet due to Broke both trunk and pull, see https://hud.pytorch.org/pytorch/pytorch/commit/3db3845f5f20047d9a30f450d3936e4113975ae6

commit f803fa9fc94ea7e744885926f654479e578850cf
Author: Driss Guessous <drisspg@fb.com>
Date:   Fri Sep 2 16:31:55 2022 +0000

    [Nested Tensor] Add a NestedTensorUtils header and cpp file for organization (#84385)
    Trying to do some clean up into code structure for nested tensors. This introduces a utility header and cpp file that implements helper functions.

    This is the initial PR in more clean up. The next would be separating out the all native functions that create nested tensors into their own file since they do not infact do math on nested tensors.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84385
    Approved by: https://github.com/mikaylagawarecki

commit ae67099e88970b8fab140717d8251d9f5e9943b0
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Sep 2 15:15:30 2022 +0000

    Fix type annotation in `_ConvNd` for in_channels (#84302)

    `_ConvNd` has an attribute `in_channels` that was mistakenly annotated as `_in_channels`.

    This fixes https://github.com/pytorch/pytorch/issues/84223
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84302
    Approved by: https://github.com/albanD

commit 3db3845f5f20047d9a30f450d3936e4113975ae6
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Sep 2 14:57:05 2022 +0000

    Support a few corner cases for nvFuser executor (#84416)

    This PR adds asserts to the `nvfuser_execute` function for the cases that do not work. Fallback to eager is used in those cases.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84416
    Approved by: https://github.com/jjsjann123, https://github.com/ngimel

commit 0fd173b097f27b7dd190b25ae13075ba3bf25a5a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Sep 2 10:45:41 2022 +0000

    Revert "Support a few corner cases for nvFuser executor (#84416)"

    This reverts commit 3ac9f6683dc8f17e030699da4df6c767f22939b6.

    Reverted https://github.com/pytorch/pytorch/pull/84416 on behalf of https://github.com/IvanYashchuk due to trunk CI is failing due to sneaked in print_tabular() call

commit 3ac9f6683dc8f17e030699da4df6c767f22939b6
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Sep 2 06:42:39 2022 +0000

    Support a few corner cases for nvFuser executor (#84416)

    This PR adds asserts to the `nvfuser_execute` function for the cases that do not work. Fallback to eager is used in those cases.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84416
    Approved by: https://github.com/jjsjann123, https://github.com/ngimel

commit cb4421b19cf7aa3a1a6cfcc7e0677f2b2ba0a9b6
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Sep 2 05:12:55 2022 +0000

    [Proof of Concept] Use labels to select the test configs to run (#83690)

    This is the proof-of-concept PR to support linux.  Other platforms will follow in subsequent PRs.  Per feedbacks from the team, I have changed the label to be `test-config/CONFIG`, for example `test-config/functorch` to make it clear that this is not `ciflow`.

    * The script maintains a set of valid test configs (shard names) including `default`, `functorch`, `dynamo`, etc.
    * If the PR has one or more labels as specified in the set, i.e. **test-config/functorch**, only these test configs will be selected. If the PR has both `test-config/functorch` and `ciflow/trunk`, both will be taken into account: **All functorch builds and tests in trunk will be run**
    * If the PR has none of the test-config label, all tests are run as usual.

    Basically, the CI workflow will be `filter (part of build) -> build -> filter -> test[filtered_matrix]`. The filter is applied twice before build and test because we want to get the latest labels from the PR right before the steps are run. This is mainly to avoid GHA static list of labels that is only populated at the time of the pull request event, for example, a new pull request will have no label,
    This PR has a bunch of random labels but it includes two important labels among them `test-config/functorch` and `test-config/dynamo`.  The former was added before the CI started while the latter was added after (but before the test started).  Only functorch and dynamo tests (multiple shards) were run.

    Also, I manage to find a way to hide the majority of skipped tests, so they won't clutter the signal box that much
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83690
    Approved by: https://github.com/ZainRizvi

commit 97b2dff60081e1092cfd6d1b3a80c995ff3d6148
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Sep 1 23:37:55 2022 +0000

    Add Initial Support For Fake Tensor Constant Tracking (#84387)

    Adds support for constant tensor tracking within FakeTensors. Copy-pasta'ing from `proxy_tensor.py` why this is useful:
    ```
    ```

    This PR only attempts to add support for the tracing scenarios where we run each operation linearly - aot autograd, torchdynamo. It does not yet handle how constant tensors should be handled as part of the persistent fx graph. Additionally, it does not yet attempt to de-duplicate or interact with ProxyMode's only constant tensor handling.

    Edit: plan is to rely on functionalization for fx graph
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84387
    Approved by: https://github.com/ezyang

commit 832ce5f8fad374ab1dd8bae16c28cd6004938ab3
Author: Zafar <cc.rafaz@zafar.cc>
Date:   Thu Sep 1 22:40:31 2022 +0000

    Adding codeowners to quantization, sparsity, ns, etc. (#79505)

    The notifications for the AO-maintained codebase.
    This should not be blocking, just PR/test notifications.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79505
    Approved by: https://github.com/vkuzo, https://github.com/albanD

commit f6ce2a442e8f88b39c11b07fb5c716f6ef4bd06d
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Sep 1 13:43:06 2022 -0700

    Refactor PyInterpreter to use normal vtables (#84388)

    I realized that we can deal with the dead vtable problem by...
    introducing another indirection!  The resulting code is worse
    (you have to do one more dereference to get to the vtable), but
    the reduction in boilerplate is, IMO, worth it.

    I did this refactor because I'm about to add a lot more methods
    to PyInterpreter to handle expunging SymInt from TensorImpl.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84388
    Approved by: https://github.com/albanD

commit 241c99232e67dfde18dd40bf821e453ab4c313b1
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Sep 1 23:56:05 2022 +0000

    Fix typo (#84439)

    s/bionicl/bionic/

    hattip to @kit1980 for reporting in https://github.com/pytorch/pytorch/pull/84314#discussion_r960099849

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84439
    Approved by: https://github.com/seemethere, https://github.com/clee2000, https://github.com/atalman

commit edec9698abde6207ca3a06718568807fe5c037dd
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Thu Sep 1 23:55:25 2022 +0000

    Fix ScripModule typo (#84444)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84444
    Approved by: https://github.com/malfet

commit 375d6cd5b7075286f9d925341201cb2776e311a8
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 1 23:42:48 2022 +0000

    Revert "Move decompositions and helpers for jvp from functorch into core (#84358)"

    This reverts commit a3c60a4db464aa32b3217e45fdc9013ad6a535ae.

    Reverted https://github.com/pytorch/pytorch/pull/84358 on behalf of https://github.com/malfet due to Broke lint

commit 6ef85dc99079b770d96e4cc87bdc5b047441e9a9
Author: updaun <updauney@daum.net>
Date:   Thu Sep 1 23:01:06 2022 +0000

    Fix minor typo in rpc_test.py (#84431)

    This fixes a very minor typo in the `rpc_test.py` comments.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84431
    Approved by: https://github.com/mrshenli

commit a65b88d516316695f3f930a0d39e5c25f0f38729
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Thu Sep 1 22:57:50 2022 +0000

    Import forgotten pack_weight_bias in rnn.py (#84315)

    `pack_weight_bias` is exported in `__all__`, but the actual import was lot during migration in https://github.com/pytorch/pytorch/pull/78714.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84315
    Approved by: https://github.com/seemethere, https://github.com/malfet

commit 73cb6cf8ae355417e0e9b6b9614492b280f66ae7
Author: drisspg <drisspg@fb.com>
Date:   Thu Sep 1 22:50:59 2022 +0000

    Fixing back invariant on offsets (#84433)
    I changed the calculation of offsets to add an extra element for bounding above. This invariant makes sense in the contiguous case but when ntesnor[i] is sliced like in this pr: #83736 this doesn't make semantic sense anymore. SO changing back, Borderline stampy

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84433
    Approved by: https://github.com/mikaylagawarecki

commit a3c60a4db464aa32b3217e45fdc9013ad6a535ae
Author: soulitzer <soulitzer@gmail.com>
Date:   Thu Sep 1 15:26:23 2022 -0400

    Move decompositions and helpers for jvp from functorch into core (#84358)

    This refactor shouldn't change any behavior. At this point functorch still relies on the mechanism in DynamicLayerFront; we just moved some parts of it into core.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84358
    Approved by: https://github.com/samdow

commit eaab653376da76cd3038b7f2bed37b03e2048522
Author: Ian Graves <iangraves@fb.com>
Date:   Thu Sep 1 22:38:59 2022 +0000

    Read via FileAdapter when loading files in torch if not flatbuffer - Part 2 (#84296)

    Summary: D38998858 (https://github.com/pytorch/pytorch/commit/3fae89d4a468a02be501357eb123ce2bf7086d2f) used the wrong version of `_load_for_mobile` that kept the "load everything in memory then parse" technique.  This fixes it to call the `_load_for_mobile_impl` version which for non-flatbuffer models will stream parse.  See D38998858 (https://github.com/pytorch/pytorch/commit/3fae89d4a468a02be501357eb123ce2bf7086d2f) for the expected memory optimization gains.

    Test Plan: CI Signals.

    Reviewed By: qihqi

    Differential Revision: D39138280

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84296
    Approved by: https://github.com/qihqi

commit a563a4880fe577e986b63a288bb8bf00a1fb7618
Author: Linbin Yu <linbin@fb.com>
Date:   Thu Sep 1 22:32:55 2022 +0000

    [Edge] Add an option to avoid adding base ops to static op library (#84360)

    Summary: We use a static op library in a test for PyTorch C++ usages, but don't want to introduce all base ops. Because the goal is to check if a given model can run on the exact op collection (i.e., fbios ops, fb4a ops), and these base ops are not present in real apps. So add an option to disable this feature.

    Test Plan: Build. Expect no change to existing targets.

    Differential Revision: D39164021

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84360
    Approved by: https://github.com/kit1980

commit ff56f1c30d2b4ad3a018b8f0c9fee1ffcb06ca4f
Author: Yu, Guangye <guangye.yu@intel.com>
Date:   Thu Sep 1 22:22:25 2022 +0000

    Define the SYCL device version assertation used in the other backend, like XPU (#84106)
    We need a device version assertation that can be used in SYCL kernel. SYCL_KERNEL_ASSERT will be used in the kernel launched on device XPU.
    We add a macro SYCL_KERNEL_ASSERT via __assert_fail declaration for Linux and _wassert declaration for Windows even though  NDEBUG is enabled.
    `__assert_fail` in SYCL kernel
    `extern SYCL_EXTERNAL void __assert_fail(const char *expr, const char *file, unsigned int line, const char *func);`
    `_wassert` in SYCL kernel
    `extern SYCL_EXTERNAL void _wassert(const wchar_t *wexpr, const wchar_t *wfile, unsigned line);`
    No additional unit test because this change could not affect PyTorch's functionality. It only affects assertation in kernel on XPU backend. So it is difficult to add ut to test it.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84106
    Approved by: https://github.com/malfet

commit 1463c6f3de1bb2113fd22ab9b1bddd2b3a84355d
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Sep 1 22:18:07 2022 +0000

    Increase distributed shards (#84430)

    Per title, increase from 2 to 3 shards.

    With 2 shards, the test time was about 1.7 hours as show in [HUD](https://hud.pytorch.org/tts/pytorch/pytorch/master?jobName=pull%20%2F%20linux-bionic-cuda11.6-py3.10-gcc7%20%2F%20test%20(distributed%2C%201%2C%202%2C%20linux.8xlarge.nvidia.gpu))

    With 3 shards, the time drops to about 1.1 hours:

    * 1st shard: https://github.com/pytorch/pytorch/runs/8141516281 (1h16m)
    * 2nd shard: https://github.com/pytorch/pytorch/runs/8141516449 (59m)
    * 3rd shard: https://github.com/pytorch/pytorch/runs/8141516593 (1h3m)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84430
    Approved by: https://github.com/clee2000

commit ce1b727e774c75f8e31b28ff5915851385c70dcf
Author: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Date:   Thu Sep 1 21:34:51 2022 +0000

    Disable autocast cache in torch.cuda.make_graphed_callables (#84289)

    There there are conflicts between `torch.clear_autocast_cache()` and `cudaMallocAsync` from #82682.
    Moreover, the use of autocast caching is not reasonable during training which is the main target of `make_graphed_callables`.

    cc @eqy @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84289
    Approved by: https://github.com/ngimel

commit d39490a711f6d5119444d76d1d2e337e0213beea
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Sep 1 07:08:03 2022 -0700

    Add meta function for repeat (#84349)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84349
    Approved by: https://github.com/Krovatkin

commit 0fb1495512852fd12f77c6bfb7bf9b86013c8caa
Author: Paul Saab <ps@fb.com>
Date:   Thu Sep 1 20:26:35 2022 +0000

    [aarch64] Fix ATen-cpu aarch64 builds (#84294)

    Summary: Fix ATen-cpu aarch64 builds and hook up cpukernel_neon

    Test Plan: CI

    Differential Revision: D39142670

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84294
    Approved by: https://github.com/ajtulloch

commit 5e5c610a587d671044303c4fa56af20f33eee5dd
Author: Andrey Talman <atalman@fb.com>
Date:   Thu Sep 1 20:24:06 2022 +0000

    Move slow-grad checks to CUDA-11.6 (#84313)

    Mitigates #84192 by skipping two tests

    Please note: We tried to increase the tolerance for test_fn_gradgrad_linalg_det_singular_cuda_float64 but this did not help.
    Ref:
    Increase `test_fn_gradgrad_linalg_det_singular_cuda_float64` error tolerance to  1e-4 as suggested in https://github.com/pytorch/pytorch/issues/84192#issuecomment-1230644574

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84313
    Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/Lezcano

commit 673b35c847ee6ba67367ba27ff8597c8ae382257
Author: YifanShenSZ <yshen57@jhu.edu>
Date:   Thu Sep 1 20:01:39 2022 +0000

    Better reshape with autograd support (#82754) (#84154)

    The original author is @YifanShenSZ  and the original PR is: #82754
    Previous reshape [https://github.com/pytorch/pytorch/issues/80981](https://github.com/pytorch/pytorch/pull/80981) is ok for forward, but needs improvement for backward: need to handle "sometimes view sometimes copy" behavior.

    This pull request fixes it by:
    1. add a new alias dispatch key `CompositeImplicitAutogradNestedTensor`, which ideally would work as nested-tensor version of `CompositeImplicitAutograd`
    2. register `reshape_nested` to `reshape` by `CompositeImplicitAutogradNestedTensor`

    Side changes:
    * add contiguous memory format support to `clone_nested`
    * add `view_nested`
    * add `reshape_as_nested`

    Fix issue [https://github.com/pytorch/pytorch/issues/83041](https://github.com/pytorch/pytorch/issues/83041)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82754

    Test Plan:
    Imported from GitHub, without a `Test Plan:` line.

    **Static Docs Preview: executorch**
    |[Full Site](https://our.intern.facebook.com/intern/staticdocs/eph/D39023822/V13/executorch/)|

    |**Modified Pages**|

    Reviewed By: albanD

    Differential Revision: D39023822

    Pulled By: drisspg

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84154
    Approved by: https://github.com/bdhirsh, https://github.com/albanD

commit 9bcad063d8b5253ca5b3013735d3ad0cb3f7e3cb
Author: Catherine Lee <csl@fb.com>
Date:   Thu Sep 1 19:53:36 2022 +0000

    disable ios on circleci b/c failing (#84438)

    reenable when fixed

    cause is likely: https://status.circleci.com/incidents/lbhyrt87g89r

    examples of failures: https://app.circleci.com/pipelines/github/pytorch/pytorch/559778/workflows/e17e6b96-649e-4e49-b9f1-c0b1ecd96e02/jobs/17073870

    something related to ssh

    started around 12 hours ago?
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84438
    Approved by: https://github.com/ZainRizvi

commit 88802719b699ce75f1be7818293c76748311a79b
Author: Andrew Gu <andgu@fb.com>
Date:   Thu Sep 1 16:59:34 2022 +0000

    [FSDP][Easy] Move utils to `_utils.py` (#84212)

    I pulled this out into a separate PR. This just moves some utility functions to `_utils.py`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84212
    Approved by: https://github.com/rohan-varma

commit e71370064c1a475e9179ba8dc05834fefe51413b
Author: Qiming Lu <qiminglu@fb.com>
Date:   Thu Sep 1 18:39:26 2022 +0000

    Improvements to FX Minimizer (#83833)

    Summary: This diff improves the FX Minimizer for better error reports, and fixes a few other issues.

    Test Plan: CI

    Differential Revision: D38900309

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83833
    Approved by: https://github.com/yuhc, https://github.com/Chillee

commit dd82b31e552d4da255bb36266681a0400367314a
Author: Angela Yi <angelayi@fb.com>
Date:   Wed Aug 31 16:03:47 2022 -0700

    [fx] Add metadata to fx.GraphModule (#84378)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84378
    Approved by: https://github.com/SherlockNoMad

commit 8b578849b4bce1e6ad012d659e1aced04fb2bdc3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 1 18:34:57 2022 +0000

    Revert "[Profiler][Trivial] Create orchestration folder and move observer management there. (#83893)"

    This reverts commit 48a596ad3f2ca617cd2fafc3fa3c368f5600930a.

    Reverted https://github.com/pytorch/pytorch/pull/83893 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit 5a73a0291d28f7d510756d8eab4fc942a0455ba8
Author: Michael Andreas Dagitses <mikeyd@fb.com>
Date:   Thu Sep 1 04:05:25 2022 -0700

    re-enable ATen packedtensoraccessor_test (#84397)

    Summary:

    Test Plan: Rely on CI.

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84397
    Approved by: https://github.com/malfet

commit fd756caa3633cf4bc0bbcdd5db77683cf18e5eaf
Author: BowenBao <bowbao@microsoft.com>
Date:   Thu Sep 1 18:29:41 2022 +0000

    [ONNX] Support nn.init.normal (#84149)

    * Updated symbolic function for `aten::normal` to support additional generator arguments emitted from https://github.com/pytorch/pytorch/blob/5563248b5882231cb99105b042cc32bddd18b912/torch/csrc/jit/passes/remove_mutation.cpp#L51
    * Added symbolic function for `aten::is_pinned` and `prim::layout`. Both are unused by ONNX later on.

    Fixes #83647

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84149
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock

commit 5d39e8de572c1ae426a762b7f1b71a4bb064e85c
Author: samdow <samdow@fb.com>
Date:   Tue Aug 30 11:03:30 2022 -0400

    add matrix rank op info tests with non-default kwargs (#84074)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84074
    Approved by: https://github.com/zou3519

commit 041edeeecb75f3c110605d7311fa46abe1c62ea9
Author: SmirnovKol <31559413+OccupyMars2025@users.noreply.github.com>
Date:   Thu Sep 1 17:56:50 2022 +0000

    Fix several typos (#83823)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83823
    Approved by: https://github.com/ngimel, https://github.com/kit1980

commit 7a348a1d4aa2dcea4d78a4cd4f772155fce38012
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Thu Sep 1 17:54:10 2022 +0000

    Fix internal breakage caused by #82134 (#84363)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84363
    Approved by: https://github.com/rohan-varma, https://github.com/mehtanirav

commit 7ffa10036c846a3d4148bb3deed8b77ff506a9cc
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 1 17:47:52 2022 +0000

    Revert "[Profiler] Unify global and thread local profiler lookup. (#83894)"

    This reverts commit c06a5586f57c844fdc4a98e52f88e71f64dd54d2.

    Reverted https://github.com/pytorch/pytorch/pull/83894 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/13510799644553996/artifact/runsandcastle?selectedLines=990-990-7-65)

commit 6dc9223c8bb107fc9794d867a0ec8cdcff89382b
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Wed Aug 31 15:25:08 2022 -0500

    Sparse_coo: Be more agressive in setting coalesced True to avoid suprising behaviors (#82426)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82426
    Approved by: https://github.com/pearu, https://github.com/bhosmer

commit 2e0f5bce3917ba42ac106101b21e20d99d067928
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Sep 1 17:23:21 2022 +0000

    Revert "Fix several typos (#83823)"

    This reverts commit f9609d82038897ac560b408808e9dba9f39bc922.

    Reverted https://github.com/pytorch/pytorch/pull/83823 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit bf62ece5364486385bdabc43c72b7681e213057e
Author: Max Podkorytov <maxdp@fb.com>
Date:   Thu Sep 1 17:21:22 2022 +0000

    [static-runtime] add schema checks to most of the ops where these checks are missing (#84163)

    Test Plan: existing unit tests; also fix some failing ones along the way

    Differential Revision: D39074902

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84163
    Approved by: https://github.com/mikeiovine

commit d648375f13b4a4efd4cd35247098679fce5d4bcd
Author: Kevin Tse <ktse@fb.com>
Date:   Thu Sep 1 15:12:37 2022 +0000

    [GHF] Changing the ordering in merge rules to allow more appropriate messages to be raised first (#84359)

    Changing the ordering in merge rules to allow more appropriate messages to be raised first.

    Context: [#84279](https://github.com/pytorch/pytorch/pull/84279#issuecomment-1233130498)
    @janeyx99: "Approving to unblock, but modifying the merge rules to move the Core maintainers rule to last would be a good idea."

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84359
    Approved by: https://github.com/janeyx99, https://github.com/ZainRizvi, https://github.com/malfet

commit bfdfeecd151fde72b05cc96113999d4049485673
Author: soulitzer <soulitzer@gmail.com>
Date:   Wed Aug 31 17:53:32 2022 -0400

    Add per-op MPS gradient tests and update skips (#84242)

    Follow up:
    - ~Remove non-float dtypes from allow-list for gradients~
    - ~Map dtypes to short-hand so there aren't so many lines, i.e. float16 should be f16.~
    - ~There were a lot of linting issues that flake8 wouldn't format for me, so I reformatted with black. This makes the diff a little trickier to parse.~

    Observations:
    - there are entries in the allow-list that weren't there before
    - some forward that we previously passing now fail with requires_grad=True
    - Because the allow list does not know about variants, a special skip was added for that in the block list

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84242
    Approved by: https://github.com/kulinseth, https://github.com/malfet

commit f1ee162193102464d92140edb84c3a99012ad0cb
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Sep 1 07:10:00 2022 -0700

    Use SymInt signature to compute saved variables (#84354)

    This seems to have been accidentally working, but it broke
    when I added support for saving optional SymInt directly
    from input arguments.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84354
    Approved by: https://github.com/Krovatkin

commit 5e2c23377a0ea6410c8e6a624b1cc516af19f63b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Sep 1 07:10:46 2022 -0700

    LTC codegen appears to be hardcoded to only support tensors (#84355)

    Assert accordingly

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84355
    Approved by: https://github.com/wconstab

commit 7d9e54673881501e3a2b165fe3d703d2898350fd
Author: breidct <51497916+breidct@users.noreply.github.com>
Date:   Thu Sep 1 16:16:45 2022 +0000

    Replace assertEqualIgnoreTypes in common_nn.py (#84210)

    See #38095
    Replaced all instances of assertEqualIgnoreTypes in common_nn.py with assertEqual

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84210
    Approved by: https://github.com/kit1980

commit 5cfe76938735a7cae06f8fa8cd1ab3962fbe384f
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Thu Sep 1 16:14:31 2022 +0000

    [primTorch] Add refs for `reshape_as`, `view_as`, unify tests (#84222)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84222
    Approved by: https://github.com/Lezcano, https://github.com/ngimel

commit 8778f337442ab7ad512d20c3a9028df59380c6f0
Author: Andrew M. James <andrew.m.james2@gmail.com>
Date:   Wed Aug 31 09:58:57 2022 -0500

    Dense <-> bsc conversions (#80781)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80781
    Approved by: https://github.com/bhosmer, https://github.com/nikitaved

commit 0909639c9045e9f9435165778319fdb59728baa6
Author: Yu, Guangye <guangye.yu@intel.com>
Date:   Thu Sep 1 11:53:32 2022 +0000

    fix dispatch declaration bug about quantized op (#83649)
    Fixes issue #83051.
    _fake_quantize_learnable_per_tensor_affine_backward and _fake_quantize_learnable_per_channel_affine_backward are implemented for CPU and CUDA. Currently, these two are in the CompositeImplicitAutograd category.
    If this issue is not fixed. We need to provide their autograd function when we want to register a new backend. It doesn't make sense to implement autograd function for them since they are all backward operators implemented directly with TensorIterators.
    Add a dispatch keyword in aten/src/ATen/native/native_functions.yaml and explicitly dispatch operators to CPU and CUDA.
    like this:
    `   dispatch:`
    `    CPU, CUDA: _fake_quantize_learnable_per_tensor_affine_backward`
    No additional unit test because this change could not affect PyTorch's functionality. It only affects registration on other backends, like XPU. So it is difficult to add ut to test it.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83649
    Approved by: https://github.com/jerryzh168

commit 70ef06cc1913e1d9c333819b222152e6abc5b870
Author: Michael Andreas Dagitses <mikeyd@fb.com>
Date:   Wed Aug 31 21:42:05 2022 -0700

    fix and enable ATen ExclusivelyOwned_test (#84395)

    Summary:
    This depends on caffe2 so it must move to that section.

    Test Plan: Rely on CI.

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84395
    Approved by: https://github.com/DanilBaibak

commit 521d1071f881c14a8f49bdc1aff984a0e7928294
Author: Zafar <cc.rafaz@zafar.cc>
Date:   Thu Sep 1 11:35:01 2022 +0000

    [quant] Subpackage import in nn.quantized (#84141)

    Some of the subpackages were not included in the 'torch.nn.quantized'.
    That would cause some specific cases fail.
    For example, `from torch.nn.quantized import dynamic` would work,
    but `import torch; torch.nn.quantized.dynamic` would fail.

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84141
    Approved by: https://github.com/andrewor14

commit 546e5fa0c5df42ad83f336a77f5b7cb9ab40e16f
Author: Michael Andreas Dagitses <mikeyd@fb.com>
Date:   Wed Aug 31 21:13:08 2022 -0700

    register skipped ATen tests in CMake (#84345)

    Summary:
    These tests were not being built or executed as part of CI.

    Test Plan: Rely on CI.

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84345
    Approved by: https://github.com/kit1980

commit 65e887c041943bf5d1ae2c515cc7a89e3b89b588
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Thu Sep 1 07:18:42 2022 +0000

    Remove unnecessary copy from torch._refs.to, add OpInfo for torch.Tensor.to (#84270)

    This PR removes unnecessary copy from `torch._refs.to`, adds OpInfo for `torch.Tensor.to`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84270
    Approved by: https://github.com/ngimel

commit 90d6112a948644dac77120cfcf1de9ac5566ab79
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Sep 1 03:48:52 2022 +0000

    Test distributed backends in parallel (#84034)

    This allows multiple backends (nccl, gloo) to be tested in parallel and speed up the process. The improvement is mainly in the 1st distributed CUDA shard where the long pole `distributed/test_distributed_spawn` test is executed:

    * [linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/runs/8007596825?check_suite_focus=true#logs) takes 1h24m. This is better than the current average expectation of 2h12m

    On the other hand, there is no improvement for the following two jobs:

    * [linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)](https://github.com/pytorch/pytorch/runs/8007417353?check_suite_focus=true#logs) takes 1h47m
    * [linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/runs/8007596870?check_suite_focus=true#logs) takes 1h40m

    This is still a gain though because it allows us to add more shards for distributed test if needed.

    Issue https://github.com/pytorch/pytorch/issues/83694
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84034
    Approved by: https://github.com/wanchaol

commit 693ed8b14777d1515c18653f5f8f28a602898662
Author: Howard Huang <howardhuang@fb.com>
Date:   Wed Aug 31 11:42:08 2022 -0700

    [1/N] [Dispatchable Collectives] Create Backend class (#83679)
    - Create a new Backend class which contains collectives similar to that of https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/ProcessGroup.hpp.
    In future PRs, the existing ProcessGroupNCCL/Gloo/UCC will be migrated to derive from this Backend class. The idea is that we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type.

    Differential Revision: [D38839213](https://our.internmc.facebook.com/intern/diff/D38839213)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83679
    Approved by: https://github.com/kwen2501

commit ece0002c4beaebaf083dc75b7bf8ceb19edf7a0b
Author: titaiwang <titaiwang@microsoft.com>
Date:   Wed Aug 31 21:40:27 2022 +0000

    [ONNX] Disable autocast cache in exporter (#84219)

    This PR provides a temporary fix on #84092 in exporter to avoid more cases falling into this bug.
    A long-term fix will be provided later.

    A simple repro with torch.onnx.export is still under investigation, as torch.jit.trace() is not the API we call inside torch.onnx.export, and it may introduce the difference. Therefore, a test case is provided here only.
    A specific test one can use,
    ```python
    import torch
    import onnxruntime
    from onnxruntime.training.ortmodule import DebugOptions, LogLevel
    from onnxruntime.training.ortmodule import ORTModule

    class MyModule(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.cv1 = torch.nn.Conv2d(3, 3, 5, 2, 1)

        def forward(self, x):
            x = self.cv1(x)
            return x

    x = torch.randn(10, 3, 20, 20) * 2
    m = MyModule().eval()
    x = x.cuda()
    m = m.cuda()

    debug_options = DebugOptions(log_level=LogLevel.VERBOSE, save_onnx=True, onnx_prefix="ViT-B")
    m = ORTModule(m, debug_options=debug_options)

    with torch.cuda.amp.autocast(dtype=torch.float16, cache_enabled=True):
        loss = m(x)
    ```
    AND make assertion fail in ORTModule
    https://github.com/microsoft/onnxruntime/blob/17ccd6fa02877a1c8d3201344137b1ca105b681d/orttraining/orttraining/python/training/ortmodule/_io.py#L578-L581

    Without the fix, the user will see the weight/bias of Conv node becomes constant.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84219
    Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi

commit 18264432f7f9b7545e7d494b1e9391883fc8ab60
Author: titaiwang <titaiwang@microsoft.com>
Date:   Wed Aug 31 21:39:08 2022 +0000

    [ONNX] replace all _C._flatten to torch.jit._flatten (#83598)

    _C._flatten is exactly the same as torch.jit._flatten. Unifying them to reduce confusion.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83598
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit f701cb04fbc864f5eb9e928c16bae24f006cfd5d
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Wed Aug 31 13:40:04 2022 -0700

    Test Dynamo CI w Fake Tensors (#84282)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84282
    Approved by: https://github.com/anijain2305

commit ef3ab31f1c57b357a23f729f8d986432185ebaa4
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Aug 31 21:22:17 2022 +0000

    Decomp for aten.im2col (#84303)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84303
    Approved by: https://github.com/jansel, https://github.com/ngimel

commit cd96f3f6769af7b01a3b50e0d19d9fc0ea015346
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Aug 31 14:14:44 2022 -0700

    Use register_meta for everything in meta_registrations (#84297)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84297
    Approved by: https://github.com/Chillee

commit 305c6a6c35ace740ca000851ad908714daad4b7a
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Wed Aug 31 09:20:45 2022 -0700

    [FSDP] Fix the FQN not found issue for load sharded_state_dict when using activation checkpoint (#84253)

    The current sharded_state_dict load will fail if activation checkpoint is also enabled. This PR fixes the issue.

    Differential Revision: [D39125431](https://our.internmc.facebook.com/intern/diff/D39125431/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84253
    Approved by: https://github.com/awgu

commit e8885a872c5a444711bb75aaf4b3a792fe674057
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Aug 31 23:02:42 2022 +0000

    [CI] Move bazel from 11.3 to 11.6 (#84314)

    In process of doing so have to:
    - Delete `/usr/local/cuda-11.6/cuda-11.6` symlink to self, otherwise Bazel builds fail with
    ```
    ERROR: circular symlinks detected
    [start of symlink cycle]
    /usr/local/cuda-11.6/cuda-11.6
    [end of symlink cycle]
    ```
    - Add `-DCUB_WRAPPED_NAMESPACE=at_cuda_detail"` to `COMMON_COPTS` if building with CUDA, to mimic the behaviour in
    https://github.com/pytorch/pytorch/blob/4b8ae047881314580826113f8a224f3fd935b203/cmake/Dependencies.cmake#L1664-L1668
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84314
    Approved by: https://github.com/ngimel, https://github.com/atalman

commit fddfc4488afb207971c54ad4bf58130fdc8a4dc5
Author: Zain Rizvi <zainr@fb.com>
Date:   Wed Aug 31 22:44:14 2022 +0000

    Further improve mergebot messages (#84283)

    Reword the rejection reasons to better match the format mergebot uses to output the message, and repoints the workflow links to point to the commit page in hud instead of github

    **Context:** Some of the mergebot messages looked a bit weird. For example, it would claim to be offering a reason for a merge failing, but instead the message would be of a more diagnostic nature.

    Example of a weird message ("view failures on hud" is not a reason!):

    <img width="960" alt="image" src="https://user-images.githubusercontent.com/4468967/187495195-9f1cf6cc-49e3-42a1-8c29-b1e95027d0e2.png">

    The above message would now look like:

    <img width="967" alt="image" src="https://user-images.githubusercontent.com/4468967/187726065-ea93dc34-065f-47bd-acd4-78415329e5a6.png">

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84283
    Approved by: https://github.com/huydhn

commit c585e149e2d7d6fbc460a0ff0324bdc189246578
Author: Slava Kovalevskyi <vsk@fb.com>
Date:   Wed Aug 31 21:48:39 2022 +0000

    Process for maintaining Build + CI contributors list (#83869)

    The following issues are fixed:

    * process of adding new contributors to the "Build + CI" module added
    * folks who qualified are explicitly added
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83869
    Approved by: https://github.com/svekars, https://github.com/seemethere, https://github.com/malfet

commit 4b8ae047881314580826113f8a224f3fd935b203
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Aug 31 19:59:31 2022 +0000

    [BE] Delete torch._dl extension (#84361)

    And lots of complexity around the availability of RTLD_GLOBAL flags in `os` module
    As this flag is always present since Python-3.3, see https://docs.python.org/3/library/os.html#os.RTLD_GLOBAL

    Fixes https://github.com/pytorch/pytorch/issues/84351

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84361
    Approved by: https://github.com/kit1980

commit cfb9d0d23314fd28be118b6ca280ded55364e71c
Author: Kevin Tse <ktse@fb.com>
Date:   Wed Aug 31 17:18:07 2022 +0000

    [DataPipe] Fixing `map` function signature validation (#84279)

    As @pmeier [points out](https://github.com/pytorch/pytorch/pull/80267#discussion_r958423241), #80267 introduces a bug where an exception is thrown when a built-in function (or a function implemented in C) is used with `.map` because `inspect.signature(fn)` cannot find the function's signature.

    This PR skips over a function when its signature cannot be found. I believe this case is rare, and if the `fn` is truly incompatible with the usage of `input_col`/`output_col`, an exception will be raised at run time such that users will be able to examine what is wrong.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84279
    Approved by: https://github.com/pmeier, https://github.com/janeyx99

commit 744019ece76aef07c38e64dcb53a9801c5b51d49
Author: Salil Desai <salilsdesai@fb.com>
Date:   Wed Aug 31 19:47:57 2022 +0000

    [AIBench] Pass Vulkan Profiling Data to Kineto Profiler in lite_predictor_benchmark (#84185)

    Summary: This lets us more easily analyze operator-level performance of models run with Vulkan

    Test Plan: Generated chrometrace with vulkan events recorded

    Reviewed By: kimishpatel

    Differential Revision: D38280587

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84185
    Approved by: https://github.com/SS-JIA

commit a0ccfe08477486e6adc536d29e7acdc53e13899b
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Aug 31 19:29:25 2022 +0000

    Temporary fix to not fail concurrent viable/strict updates (#84324)

    Until we have a solution for https://github.com/pytorch/pytorch/issues/83986 and can use our runner for the job, we need to live with the fact that GitHub runner can have a pretty long queue throughout the day.  This keeps trunk green in the meantime.

    This's a follow-up of https://github.com/pytorch/pytorch/pull/84249
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84324
    Approved by: https://github.com/zengk95

commit 84ceebebf9d232a7f5e17012402e195afaf57129
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Aug 31 15:55:02 2022 +0000

    [FSDP] ufmt `flat_param.py`, `flatten_params_wrapper.py` (#83664)

    I think we can move FSDP code to start using ufmt (https://ufmt.omnilib.dev/en/stable/) to unify formatting across developers. ufmt is the recommended formatter for PyTorch's Python code. If we have consensus, I can ufmt all of the FSDP code in follow-ups.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83664
    Approved by: https://github.com/rohan-varma

commit 040263d7dc7bd1e4e620bd1889717890b1bf9b30
Author: Michael Andreas Dagitses <mikeyd@fb.com>
Date:   Wed Aug 31 05:28:35 2022 -0700

    sort ATen tests in CMake (#84344)

    Summary:
    This will make it easier to compare and spot missing files.

    Test Plan: Rely on CI.

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84344
    Approved by: https://github.com/malfet

commit 65f98eb47dbf75335d08f7676835a5e1f1fc3574
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Aug 31 18:27:58 2022 +0000

    Revert "Add meta function for repeat (#84349)"

    This reverts commit 44bc6db8f88faf1b7543e825f1282140b9efa504.

    Reverted https://github.com/pytorch/pytorch/pull/84349 on behalf of https://github.com/janeyx99 due to Land race with the revert causing test_fx failures https://hud.pytorch.org/pytorch/pytorch/commit/44bc6db8f88faf1b7543e825f1282140b9efa504

commit 6efadf7e7e6655b543b5a9819b6e2eac2d76f09c
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Wed Aug 31 18:26:22 2022 +0000

    [ROCm] guard ROCm-only files in NVFUSER_RUNTIME_FILES (#84312)

    Addresses comment in #82498 as a follow-up PR.

    https://github.com/pytorch/pytorch/pull/82498#discussion_r958745967
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84312
    Approved by: https://github.com/jjsjann123

commit 762890d11ef4a11a2cb1eac2f61b2805328fad72
Author: Andrew Gu <andgu@fb.com>
Date:   Wed Aug 31 15:55:02 2022 +0000

    [FSDP] Retire `self.device_id`; clean up ctor (#83663)
    This PR retires `self.device_id` by coalescing it with `self.compute_device` and more generally cleans up the FSDP constructor.
    1. Compute the ignored parameters/modules from `ignored_modules` and the buffer names (to avoid cloning in `state_dict()`)
    2. Recursively auto wrap if needed
    5. Define process group attributes
    6. Determine `device_id`
    7. Materialize the wrapped module if using meta device or `torchdistX` deferred initialization
    8. Move the module if needed (based on `self.device_id`)
    9. Determine `compute_device`
    10. Define `training_state`, gradient divide factors, FSDP feature-related attributes (`cpu_offload`, `forward_prefetch`, `backward_prefetch`, `sharding_strategy`, `mixed_precision`), `_orig_buffer_dtypes`
    11. Determine the parameters to flatten
    12. Sync module states if `sync_module_states`
    13. Initialize the `FlattenParamsWrapper` with the parameters to flatten and the wrapped module, which constructs the `FlatParameter`
    14. Shard the `FlatParameter` (in-place)
    15. Define `_is_root`, shared attributes (`_streams`, `_fsdp_graph_order`), prefetching attributes (`_my_fsdp_idx_in_graph`, `_pre_backward_hook_full_params_prefetched`, `_forward_full_params_prefetched`), `reshard_after_forward` -- all of this is done in `_reset_lazy_init()`
    16. Define `_require_backward_grad_sync` to configure `no_sync()`
    17. Define state dict attributes (`_state_dict_type`, `_state_dict_config`) and register state dict hooks
    18. Define backward pass flags (`_pre_backward_hook_has_run`, `_need_rebuild_full_params`)
    19. Move `FlatParameter`s to CPU if `cpu_offload.offload_params`
    20. Define `_exec_order_data` for execution order validation
    21. Define communication hook attributes (`communication_hook`, `communication_hook_state`, `_hook_registered`)
    - `self.mixed_precision`
        - **Before:** `self.mixed_precision` itself could be `None`. Equivalently, `self.mixed_precision` could be `MixedPrecision(None, None, None)`. Both would disable mixed precision completely.
        - **After:** `self.mixed_precision` itself is never `None`. We only have `MixedPrecision(None, None, None)` (default construction of the `dataclass`) to disable mixed precision. This catches the issue that for `test_summon_full_params.py`, we were passing `MixedPrecision(None, None, None)` when we wanted to actually enable mixed precision.
    - `cpu_offload.offload_params=True` + `device_id`
        - **Before:** For nested FSDP and `device_id` specified, `FlatParameter`s already offloaded to CPU are moved back to GPU and not re-offloaded to CPU.
        - **After:** The nested `FlatParameter`s are re-offloaded to CPU. This is a temporary hack. The ideal solution removes the `module = module.to(<GPU device>)` in the first place and only moves the relevant parameters. Because the `module.to()` implementation has some complexity, I did not want to remove that call in this PR.
    - `device_id` and `compute_device`
        -  **Before:** `self.device_id` is either `None` or equal to `self.compute_device`. `self.device_id` is not used after the FSDP constructor.
        - **After:** `self.device_id` is removed and instead coalesced with `self.compute_device`. The only semantic change is that `test_module_device_mismatches_device_id()` errors earlier (but importantly, still errors).
        - This PR also uses a helper method `_get_orig_params()`, which is more robust and may avoid issues like https://github.com/pytorch/pytorch/issues/82891 without having to gate higher-level logic.
    - `_reset_lazy_init()` attributes
        - **Before:** Some attributes were being _defined_ in `_reset_lazy_init()` (which may not be obvious to all devs).
        - **After:** For this PR, we define these attributes in the constructor but leave `_reset_lazy_init()` as is. In the follow-ups, this gets further refactored.
    - Otherwise, I simply moved some logic into their own methods and reorganized the attribute definitions to be grouped logically.
    1. What should the specification be for `device_id` + `ignored_modules`?
    2. Investigate removing the `module = module.to(<GPU device>)` in favor of moving per parameter.
    3. Should we call `_reset_lazy_init()` in `register_comm_hook()`?

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83663
    Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma

commit 85931eaa6beab53d138f873d3505aee34e98ee89
Author: Horace He <chilli@fb.com>
Date:   Wed Aug 31 07:53:03 2022 +0000

    Rename fake_result to val (#84331)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84331
    Approved by: https://github.com/ezyang

commit 85b889fa5f1a478e1f15183008f01c56537f10d7
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Tue Aug 30 18:59:08 2022 +0200

    [primTorch] Add ref for `poisson_nll_loss` (#83805)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83805
    Approved by: https://github.com/Lezcano, https://github.com/ngimel

commit 71ce9cd0726394a5e1dd85e1d7430776a4d05a82
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Tue Aug 30 18:59:07 2022 +0200

    [primTorch] Add decomp for `soft_margin_loss` (#83804)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83804
    Approved by: https://github.com/Lezcano, https://github.com/ngimel

commit 305af90d0f78171fef0c9d36078794b3b4acad36
Author: Nikita Karetnikov <nikita@karetnikov.org>
Date:   Tue Aug 30 18:59:07 2022 +0200

    [primTorch] Add docstring and promotion for `l1_loss` ref (#83803)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83803
    Approved by: https://github.com/Lezcano, https://github.com/ngimel

commit 44bc6db8f88faf1b7543e825f1282140b9efa504
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Aug 31 07:39:53 2022 -0700

    Add meta function for repeat (#84349)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84349
    Approved by: https://github.com/Krovatkin

commit 7834f557d7477fc9a11494a03eaa88228e40636f
Author: Will Constable <whc@fb.com>
Date:   Wed Aug 31 17:15:05 2022 +0000

    Add dynamo_timed to aot autograd (#84307)

    Provides visibility into time spent running AotAutograd

    Partially fixes [torchdynamo/795](https://github.com/pytorch/torchdynamo/issues/795)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84307
    Approved by: https://github.com/Chillee

commit 14093b5979cb5c0b777e3920819ab8252eb6d3ea
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Aug 31 16:32:24 2022 +0000

    Revert "Use register_meta for everything in meta_registrations (#84297)"

    This reverts commit 8cd296f6804727899b39198d1641055b64f99056.

    Reverted https://github.com/pytorch/pytorch/pull/84297 on behalf of https://github.com/suo due to broke test_proxy_tensor on master

commit bf67589915de07a6b8756a685de9abbd90ec2dfa
Author: Sungmin Cho <sungmincho@fb.com>
Date:   Wed Aug 31 15:15:21 2022 +0000

    Escape curly brackets in FxGraphDrawer _typename (#83604)

    Summary:
    Encountered `Error: bad label format` from dot (i.e. graphviz) when benchmarking models that have dict-like structure.

    The root cause was that curly brackets were not properly escaped, like this example P522499127 (unescaped curly brackets in target= string)

    This diff insert the fix in FxGraphDrawer, since many of these graph generation codes rely on that class.

    (Modified summary before exporting to GitHub PR)

    Test Plan:
    ```
    CUDA_VISIBLE_DEVICES=7 buck run mode/opt -c python.package_style=inplace //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --model-name={INSERT IFR QE MODEL NAME HERE} --batch-iter 100 --batch-size 768 --num-gpu 1 --lower-presets {INSERT ITS PRESET}
    ```

    Will not encounter dot errors after this diff.

    (Modified test plan before exporting to GitHub PR)

    Reviewed By: yinghai

    Differential Revision: D38758827

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83604
    Approved by: https://github.com/yinghai, https://github.com/jianyuh

commit b170db855441059218ead33c88af5d7576a1bc59
Author: Michael Andreas Dagitses <mikeyd@fb.com>
Date:   Wed Aug 31 05:12:16 2022 -0700

    build/test MaybeOwned_test in OSS and fix it (#84342)

    Summary:
    This was not listed in the compilation for the ATen tests and was only
    getting built in Meta internal repositories.

    This ended up with the following problems:
     * at::zeros was not available
     * equal() for tensors was being selected from ATen/ops/equal.h and crashing

    Test Plan: Verified locally.

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84342
    Approved by: https://github.com/DanilBaibak

commit a27a4a02fecfdd626b25794a84954731b80f29fb
Author: Horace He <chilli@fb.com>
Date:   Wed Aug 31 07:01:37 2022 +0000

    Refactored proxytensor to clean up separate branches (#84325)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84325
    Approved by: https://github.com/ezyang

commit 8843f5b9868a99c41d5259ac0346bc99f2c578a0
Author: Horace He <chilli@fb.com>
Date:   Wed Aug 31 01:17:31 2022 +0000

    remove data-dependent shapes from some distributions (#84322)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84322
    Approved by: https://github.com/voznesenskym

commit 6a3ecda5a25025d48bbc5f0215db8c338745ef79
Author: Horace He <chilli@fb.com>
Date:   Wed Aug 31 00:29:55 2022 +0000

    Started storing faketensor/symbolic shape metadata on FX nodes in make_fx (#84114)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84114
    Approved by: https://github.com/SherlockNoMad

commit 79e3a39f95e91af03823a8579da06c35bb519faf
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Aug 31 04:34:01 2022 +0000

    [BE] Remove unused `export.h` include (#84305)

    As flatbuffer_serializer can be compiled without it

    Found while debugging cause of https://github.com/pytorch/pytorch/pull/82040#issuecomment-1229503604
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84305
    Approved by: https://github.com/kit1980, https://github.com/qihqi

commit abaf8112e6d6bed2a5d33dcbc1d46ed20b8e80de
Author: Eli Uriegas <eliuriegas@fb.com>
Date:   Tue Aug 30 16:08:59 2022 -0700

    ci: Replace setup-miniconda with test-infra version (#84236)

    Replaces our use of the conda-incubator version of setup-miniconda with
    one that's more tailored to our specific needs.

    Should address issues highlighted in https://github.com/pytorch/pytorch/issues/84196

    Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84236
    Approved by: https://github.com/atalman, https://github.com/janeyx99, https://github.com/malfet

commit b343febe610b8c95ca07fe9a0b061f138ed7c94d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Aug 31 02:58:14 2022 +0000

    [torchdynamo hash update] update the pinned torchdynamo hash (#84317)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned torchdynamo hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84317
    Approved by: https://github.com/pytorchbot

commit 8cd296f6804727899b39198d1641055b64f99056
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Aug 30 15:55:04 2022 -0700

    Use register_meta for everything in meta_registrations (#84297)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84297
    Approved by: https://github.com/Chillee

commit 71d99662a0d7f8a9ad68999c9a014b71591cbb68
Author: David Berard <dberard@fb.com>
Date:   Tue Aug 30 20:06:22 2022 +0000

    add nvidia-smi to run_torchbench (#83857)

    Seeing an OOM in #83239, this would help understand whether the issue is with the infra or with the test.

    RUN_TORCHBENCH: nvfuser
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83857
    Approved by: https://github.com/xuzhao9

commit 9c452abcf18a811023530b9673ae362bf987068a
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Tue Aug 30 15:25:33 2022 -0700

    Use reentrant mode when invoking prims, delete global prim_fake_mode (#84090)

    Maybe I should be using the meta_impl instead of the prim_impl, but it's not terribly clear why, since the prim impl will be better tested and should work under the re-entrant FakeTensorMode.

    Fixes https://github.com/pytorch/pytorch/issues/78613 in the process
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84090
    Approved by: https://github.com/ezyang, https://github.com/samdow

commit db7784e7227ea296c9c23be731bcf5bb4ad4dff7
Author: Mike Iovine <mikeiovine@fb.com>
Date:   Wed Aug 31 01:20:14 2022 +0000

    [Static Runtime] Schema checks for index_put (#84152)

    Summary:
    `index_put` can take a list of tensors, but Static Runtime always tries to convert its argument to a list of optional tensors. This was causing crashes for some users. Add some schema checks to prevent this, and add a new overload for the new case.

    Also, I found a clear bug in the JIT interpreter (mutating the argument when its not supposed to), so I fixed that too.

    Test Plan: New unit test

    Differential Revision: D39072214

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84152
    Approved by: https://github.com/tenpercent

commit 7532d5b125fff65945cbb95d5f6cbee082e7238f
Author: samdow <samdow@fb.com>
Date:   Tue Aug 30 12:24:07 2022 -0400

    [Modes] remove inner constructor kwarg (#83925)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83925
    Approved by: https://github.com/ezyang, https://github.com/zou3519

commit e23d159bc57c1651e47e555092c2486bb55db37a
Author: Scott Wolchok <swolchok@fb.com>
Date:   Mon Aug 29 09:39:36 2022 -0700

    [PyTorch][caffe2] Add CAFFE2_{DECLARE,DEFINE}_KNOWN_TYPE (#83707)

    It looks like we aren't getting inlining for the defined `_typeMetaData` functions from CAFFE_KNOWN_TYPE and there's some cost associated with that. I added new macros that fix this problem; I will migrate to them in a follow-up after I get buy-in from reviewers.

    Differential Revision: [D36883685](https://our.internmc.facebook.com/intern/diff/D36883685/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36883685/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83707
    Approved by: https://github.com/ezyang

commit af741e821bb3efecc22feca984519e472e933e9e
Author: Catherine Lee <csl@fb.com>
Date:   Tue Aug 30 22:45:15 2022 +0000

    no ios arm builds on circleci (#84299)

    Get rid of ios arm builds on circleci b/c most people dont have these permissions and they make the job show up as failing/red.

    Next step is to see if we can do only builds since they might not require credentials
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84299
    Approved by: https://github.com/janeyx99, https://github.com/malfet

commit e014bd8e4ef65377374640310cbafbccbcd0f5f7
Author: Xu Zhao <xzhao9@fb.com>
Date:   Tue Aug 30 22:40:44 2022 +0000

    Upgrade default cuda version of torchbench (#84248)

    Upgrade CUDA version of torchbench as we are moving away from CUDA 11.3
    This PR needs to land together with https://github.com/pytorch/benchmark/pull/1141

    RUN_TORCHBENCH: nvfuser
    TORCHBENCH_BRANCH: xz9/setup-cuda-compile

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84248
    Approved by: https://github.com/erichan1, https://github.com/davidberard98

commit 7acdb2d5642557053df00951b51b94929302a9b7
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Aug 30 22:19:07 2022 +0000

    Don't start land checks if the PR hasn't been approved yet (#84239)

    Per title, don't start land checks if the PR hasn't been approved yet. This is very important to make sure that we don't start CI jobs from unknown devs, i.e. first time contributor.

    Also rename force to `skip_mandatory_checks` to make it clearer on what this flag does

    ```
    python .github/scripts/test_trymerge.py
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84239
    Approved by: https://github.com/zengk95, https://github.com/ZainRizvi

commit eabe34cc40aeb79a10208df291b2a4d92302fbc2
Author: Jesse Cai <jcjessecai@gmail.com>
Date:   Tue Aug 30 09:50:03 2022 -0700

    [Quant] Remove warnings from using torch.tensor(value) (#84277)

    Summary:

    I think zafar made an earlier pull for these changes [here](https://github.com/pytorch/pytorch/commit/ce0786add26c1e117b16b58e8ae12dbe776133e1), but they didn't seem to make it through the migration.

    Test Plan:
    ```
    python test/test_quantization.py
    ```

    Reviewers:

    Subscribers:

    Tasks: https://github.com/pytorch/pytorch/issues/73566

    Tags: quant

    Differential Revision: [D39145070](https://our.internmc.facebook.com/intern/diff/D39145070)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84277
    Approved by: https://github.com/z-a-f

commit eda217ab672a08e555a7d09a1e4f10d2f98ee478
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Tue Aug 30 21:53:34 2022 +0000

    Reland symint_numel (#84281)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84281
    Approved by: https://github.com/ezyang

commit d09486ab233284e9f298e45a43977fed8f075fe4
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Tue Aug 30 21:50:39 2022 +0000

    [ROCm] enable nvfuser (#82498)
    The nvfuser is enabled for ROCm.
    CI label ciflow/trunk covers the newly enabled ROCm functionality as well as any CUDA regressions caused by these changes.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82498
    Approved by: https://github.com/jjsjann123, https://github.com/davidberard98

commit f9609d82038897ac560b408808e9dba9f39bc922
Author: SmirnovKol <31559413+OccupyMars2025@users.noreply.github.com>
Date:   Tue Aug 30 21:41:11 2022 +0000

    Fix several typos (#83823)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83823
    Approved by: https://github.com/ngimel, https://github.com/kit1980

commit c06a5586f57c844fdc4a98e52f88e71f64dd54d2
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Tue Aug 30 09:05:15 2022 -0700

    [Profiler] Unify global and thread local profiler lookup. (#83894)

    This PR renames `ProfilerThreadLocalStateBase` to simply `ProfilerStateBase`, and adds `push`, `pop`, and `get` methods. `global` can be specified, or can be omitted for priority selection.

    In order to support this unification it was necessary to make a (mostly) non-throwing version of pop. The asserts around observer removal are intended to act as guard rails against multiple profilers trampling over each other. However on-demand wants to do exactly that because it wants to be able to preempt.

    A hack would be to get the current observer and then only pop if an observer is found, but that would be prone to race conditions. By removing the asserts, we can preserve the old behavior by adding `ASSERT(pop())` on the caller side while allowing more complex handling for the kineto client interface. (Later PR.)

    Differential Revision: [D38931521](https://our.internmc.facebook.com/intern/diff/D38931521/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83894
    Approved by: https://github.com/slgong-fb

commit 48a596ad3f2ca617cd2fafc3fa3c368f5600930a
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Tue Aug 30 09:05:13 2022 -0700

    [Profiler][Trivial] Create orchestration folder and move observer management there. (#83893)

    Just a basic move. Later I'll add other subsystems. (Python, Kineto)

    Differential Revision: [D38925895](https://our.internmc.facebook.com/intern/diff/D38925895/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38925895/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83893
    Approved by: https://github.com/slgong-fb

commit c26b53f6a4c05a280aabe525a5c5918e3db3da57
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Tue Aug 30 09:05:11 2022 -0700

    [Profiler] Encapsulate callback handle management. (#83892)

    Right now the profiler is capible of leaking callback handles if a client does not call `at::removeCallback`. (As well as a double free if two clients handle it.) This modestly improves the situation by pulling removal into a single method and calling that removal code in the dtor unless explicitly opted out. Once we deprecate the legacy profiler we can further simplify by making the ProfilerThreadLocalStateBase own the handle outright.

    Differential Revision: [D38920537](https://our.internmc.facebook.com/intern/diff/D38920537/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83892
    Approved by: https://github.com/slgong-fb

commit ddd841b3168750fa888b9c97e21cf9a6f0934d5b
Author: atalman <atalman@fb.com>
Date:   Tue Aug 30 21:23:17 2022 +0000

    Removing multigpu 10.2 . Using 11.6 cuda for multigpu tests instead (#84286)

    Removing multigpu 10.2 . Using 11.6 cuda for multigpu tests instead

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84286
    Approved by: https://github.com/huydhn, https://github.com/malfet

commit 772721a4b7ea68a21e14eb74fedbd6c22f616905
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 30 21:01:25 2022 +0000

    Revert "Test distributed backends in parallel (#84034)"

    This reverts commit 3ae5be74ac7aa4feed6ec8e7c29b280b148651a7.

    Reverted https://github.com/pytorch/pytorch/pull/84034 on behalf of https://github.com/huydhn due to This somehow revives the flaky test https://github.com/pytorch/pytorch/issues/76428

commit 20018aa7667284a21303a52e5ac0bed5971af2bd
Author: Isaac Hoffman <ihoffman@fb.com>
Date:   Tue Aug 30 20:36:30 2022 +0000

    modify split_by_tags to retain output order (#84136)

    Summary: Currently `split_by_tags` determines submodule output order by iterating over `used_in_main`. Since this is a `Set`, insertion order is not retained so we run into problems with submodule output order being "randomized" & inconsistent between splits. By using `Dict[Node, None]` we can implement `used_in_main` as an ordered set so that output order is consistent when splitting the same model.

    Test Plan: CI

    Differential Revision: D39039268

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84136
    Approved by: https://github.com/houseroad

commit 90161c23cf28d5c61295d5c392b14cc3483d3a33
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Aug 30 20:36:11 2022 +0000

    Add nvfuser support for squeeze (#84117)

    "_refs.squeeze" and "refs.unsqueeze" now work with nvfuser executor tests.

    Similarly to `_refs.reshape` we need to explicitly save the concrete shape on the trace to pass that info to nvfuser, as it gets lost in translation (https://github.com/pytorch/pytorch/pull/83739#discussion_r950352124).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84117
    Approved by: https://github.com/ngimel

commit 174c3c6859529f30a7dfa4920a9a52e1373b02a9
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Aug 30 19:22:38 2022 +0000

    [Nested Tensor]Clean up offsets (#84145)
    - Document contiguous offset construction
    - Expand offsets by 1 so that storage offsets for `ntensor[i] = offsets[i+1] - offsets[i]`

    Another simple one. While looking into this issue https://github.com/pytorch/pytorch/issues/84082 I noticed that the kernels essentially rebuild the offsets but with the added last element. I added this and also cleaned up the code a little
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84145
    Approved by: https://github.com/albanD

commit 3ae5be74ac7aa4feed6ec8e7c29b280b148651a7
Author: Huy Do <huydhn@gmail.com>
Date:   Tue Aug 30 19:06:49 2022 +0000

    Test distributed backends in parallel (#84034)

    This allows multiple backends (nccl, gloo) to be tested in parallel and speed up the process. The improvement is mainly in the 1st distributed CUDA shard where the long pole `distributed/test_distributed_spawn` test is executed:

    * [linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/runs/8007596825?check_suite_focus=true#logs) takes 1h24m. This is better than the current average expectation of 2h12m

    On the other hand, there is no improvement for the following two jobs:

    * [linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)](https://github.com/pytorch/pytorch/runs/8007417353?check_suite_focus=true#logs) takes 1h47m
    * [linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/runs/8007596870?check_suite_focus=true#logs) takes 1h40m

    This is still a gain though because it allows us to add more shards for distributed test if needed.

    Issue https://github.com/pytorch/pytorch/issues/83694
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84034
    Approved by: https://github.com/wanchaol

commit 641c3952516baf444f75058f76cde59d0e1110f0
Author: titaiwang <titaiwang@microsoft.com>
Date:   Mon Aug 29 20:12:28 2022 +0000

    [ONNX] refactor test_pytorch_onnx_onnxruntime_cuda.py (#84218)

    Fix #80037
    After https://github.com/pytorch/pytorch/pull/79641, the code was outdated.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84218
    Approved by: https://github.com/BowenBao

commit b8ee81014481f58cc87fbda19737307435951d02
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Aug 29 16:28:07 2022 +0000

    [Easy][FSDP] Update `StateDictType` doc (#84200)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84200
    Approved by: https://github.com/rohan-varma

commit 7f58db7424121f035fa70c9504c437bbda722efe
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Aug 29 16:27:58 2022 +0000

    [Easy][FSDP] ufmt `_optim_utils.py` (#84199)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84199
    Approved by: https://github.com/rohan-varma

commit 5bceaadb70a87d317f5855514f2e6c730844a015
Author: titaiwang <titaiwang@microsoft.com>
Date:   Fri Aug 26 20:02:04 2022 +0000

    [ONNX] Add script/trace different flatten and move optional type tests to runtime (#83184)

    fix #78119

    Why:
    As in onnx tests verification code, we used to only consider tracing output, which ignores None type, this PR enables runtime test to keep None type in torch in script mode.

    1. Move Optional Type tests from no runtime to runtime, as it's supported by ONNXRUNTIME.
    2. Add ignoreNone flag for output comparison of internal tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83184
    Approved by: https://github.com/justinchuby, https://github.com/BowenBao

commit b106a04d766c21d15137318c506fc1ed823016b9
Author: lezcano <lezcano-93@hotmail.com>
Date:   Tue Aug 30 12:11:57 2022 +0000

    Fix the edge case when y = 0 in kl_div (#82714)

    Brought up in https://github.com/pytorch/pytorch/pull/80334#issuecomment-1193600883

    We also prepare its opinfo to fix https://github.com/pytorch/pytorch/issues/80488
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82714
    Approved by: https://github.com/albanD

commit b182f0813519d9b153bf7acb4214c8e3f795866e
Author: Eric Han <erichan1@fb.com>
Date:   Tue Aug 30 18:06:25 2022 +0000

    Fix issue in softmax.cu with transformer error when mask seqlen > 1024  (#83639)

    Fixes #83142

    Adds
    - test to catch this issue.
    - fix to softmax.cu that broadcasts src_key_padding_mask to regular attention_mask shape
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83639
    Approved by: https://github.com/ngimel

commit 897907d42cc379af2c16885345bc20b5e8ca894d
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Aug 29 20:53:39 2022 +0100

    Fix split torch_function handling (#83866)

    `Tensor.split` calls `TensorBase.split` whose `handle_torch_function` statement passes `func` as `Tensor.split` which is usually correct, but not here because of the use of `super()`. Instead this calls `torch._VF.split` which correctly differentiates from `torch.split`. This is currently okay since we never hit `TensorBase.split` for types with `__torch_function__` however, once we  allow skipping only one hop of `__torch_function__` this will expose the error.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83866
    Approved by: https://github.com/albanD

commit 65dc5dd3f317b8eb9440d22e044816adab2ffa9e
Author: Rodrigo Kumpera <kumpera@users.noreply.github.com>
Date:   Tue Aug 30 17:44:57 2022 +0000

    [c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134)

    Those functions enable membership introspection into a ProcessGroup. A common scenario
    that needs this is library code that consumes a PG but doesn't create it, which means
    it likely doesn't know the global ranks used to create it.

    Translating from local to global is necessary when using c10d collectives like broadcast
    so if your library code adopts the convention of using local rank 0, it needs
    to the following:

    ```python
    import torch.distributed as dist

    my_pg: dist.ProcessGroup = ...

    def my_library_bcast(tensor)
        dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg)

    ```

    This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134
    Approved by: https://github.com/rohan-varma

commit 56a37ea1a6e89a8aa31abc888127ccac647b92d4
Author: Shen Li <cs.shenli@gmail.com>
Date:   Tue Aug 30 01:16:42 2022 +0000

    Set default value for nccl make MAX_JOBS if ProcessorCount returns 0 (#84231)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84231
    Approved by: https://github.com/malfet, https://github.com/rohan-varma

commit f0efc1c2d19e561ecdc6dd1e556f76fe1a91e484
Author: Andrew Gu <andgu@fb.com>
Date:   Mon Aug 29 16:27:49 2022 +0000

    [Easy][FSDP] Fix sharded optim state dict doc formatting (#84198)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84198
    Approved by: https://github.com/rohan-varma

commit 546d68226c355fb21ed374588b219f5d7d7a66c3
Author: Marko Horatio Mekjavic <48606569+Pompey21@users.noreply.github.com>
Date:   Tue Aug 30 15:00:30 2022 +0000

    Update README.md (#84263)

    Just fixed a couple of typos (i.e. upzipped -> unzipped) :)

    Fixes #84262

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84263
    Approved by: https://github.com/Lezcano, https://github.com/albanD

commit 44a975335e2d08cbbb07df9a1cebe2620f337ed9
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Mon Aug 29 21:12:34 2022 -0700

    Revert "Re-land sym_numel (#82374) (#82726) (#82731) (#82855)" (#84207)

    This reverts commit bfebf254dd92f3ed35154597166e7e71fb04f31b.

    Differential Revision: [D39104562](https://our.internmc.facebook.com/intern/diff/D39104562)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84207
    Approved by: https://github.com/robieta

commit 60f47cb0021b0ea245aa6cc4654bf9e6d0f4ab20
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 30 13:16:21 2022 +0000

    Revert "Use self-hosted runner for viable/strict update (#84249)"

    This reverts commit acd6ca8cfa9537284928fb5d36834d1e5ae1e6f3.

    Reverted https://github.com/pytorch/pytorch/pull/84249 on behalf of https://github.com/malfet due to Broke trunk, as one can't use regular actions on self-hosted runners, see https://github.com/pytorch/pytorch/runs/8092593881?check_suite_focus=true

commit acd6ca8cfa9537284928fb5d36834d1e5ae1e6f3
Author: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Date:   Tue Aug 30 07:33:59 2022 +0000

    Use self-hosted runner for viable/strict update (#84249)

    Queuing for GH self hosted runners has been too much to be acceptable and this job gets canceled too frequently:

    <img width="1011" alt="image" src="https://user-images.githubusercontent.com/31798555/187349721-f6bfe44c-aef2-45e2-b73e-8b54602ccf55.png">

    Let's use our own runner here instead for this important job.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84249
    Approved by: https://github.com/suo

commit ec714e33a3365ce405e2ae048c8adaa2751b9dba
Author: Shenxiu Liu <shenxiu@fb.com>
Date:   Tue Aug 30 05:16:19 2022 +0000

    [PT] Allowing deepcopy in unitialized parameter (#83809)

    Summary: UninitializedParameter overrides `__new__` method thus the parent class's `__deepcopy__` method doesn't work anymore, causing models using LazyModule cannot be instantiated.

    Test Plan:
    locally copied lazy module.

    After change:
    ```
    shenxiu@devbig1109:fbcode  (5c57dd833)$ bento console --kernel pytorch --local
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/scribeutil/lib.py:9: DeprecationWarning: The "thrift" clients in libfb.py.thrift_clients are not proper thrift clients, and often have unexpected or incorrect behaviour. They are also completely unsupported. Please use a supported client from https://fburl.com/srpy or a supported raw thrift client if you cannot use ServiceRouter.
      from libfb.py.thrift_clients.scribe_thrift_client import ScribeThriftClient
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/ipykernel/iostream.py:14: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
      from imp import lock_held as import_lock_held
    Python 3.8.6 (default, Jun 10 2022, 04:32:13)
    Type 'copyright', 'credits' or 'license' for more information
    IPython 7.21.0 -- An enhanced Interactive Python. Type '?' for help.

    In [1]: import copy
       ...: import torch
       ...:
       ...: class LazyModule(torch.nn.Module):
       ...:     def __init__(self):
       ...:         super().__init__()
       ...:         self.m = torch.nn.LazyLinear(10)
       ...:
       ...:     def forward(self, input):
       ...:         x = self.m(input)
       ...:         return x
       ...:
       ...: m = LazyModule()
       ...: print(m.state_dict())
    copy.deepcopy(m)
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/mpmath/ctx_mp_python.py:892: SyntaxWarning: "is" with a literal. Did you mean "=="?
      if other is 0:
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/mpmath/ctx_mp_python.py:986: SyntaxWarning: "is" with a literal. Did you mean "=="?
      if other is 0:
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/sympy/solvers/diophantine.py:3188: SyntaxWarning: "is" with a literal. Did you mean "=="?
      if feasible is 1:  # it's prime and k == 2
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/sympy/plotting/plot.py:520: SyntaxWarning: "is" with a literal. Did you mean "=="?
      if self.xscale is 'log':
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/sympy/plotting/plot.py:540: SyntaxWarning: "is" with a literal. Did you mean "=="?
      if self.xscale is 'log':
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/sympy/plotting/plot.py:553: SyntaxWarning: "is" with a literal. Did you mean "=="?
      if self.xscale is 'log':
    /data/users/shenxiu/fbsource/buck-out/v2/gen/fbcode/26f2c80c27f9e71d/bento/kernels/__bento_kernel_pytorch__/bento_kernel_pytorch#link-tree/sympy/plotting/plot.py:560: SyntaxWarning: "is" with a literal. Did you mean "=="?
      if self.xscale is 'log':
    OrderedDict([('m.weight', <UninitializedParameter>), ('m.bias', <UninitializedParameter>)])

    In [2]: copy.deepcopy(m)
    Out[2]:
    LazyModule(
      (m): LazyLinear(in_features=0, out_features=10, bias=True)
    )
    ```

    Before change, above code will give
    ```
    TypeError: empty() received an invalid combination of arguments - got (int, dtype=NoneType, device=bool), but expected one of:
     * (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
     * (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
     * (tuple of SymInts size, *, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

    ```

    Cloned n2369721 locally and successful (thru console not notebook because somehow bento notebook doesn't work with buck2 well).

    Reviewed By: avilay

    Differential Revision: D38866072

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83809
    Approved by: https://github.com/ngimel

commit 856a7d94116e175cd4b631c61b4a9a8c572b83c8
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Aug 30 02:12:13 2022 +0000

    Vectorize conversions to BFloat16 on CPU (#80906)

    This adds explicit vectorization for converting float or double to
    bfloat16. Most conversions are sufficiently handled by the
    auto-vectorizer, but these conversions aren't (presumably due to
    branching in the scalar conversion code).

    Benchmark results with 512K elements on an AVX2 machine:

    | conversion          | Before (us) | After (us) |
    |---------------------|-------------|------------|
    | float32 -> bfloat16 | 53.3        | 39.8       |
    | float64 -> bfloat16 | 92.1        | 78.2       |

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80906
    Approved by: https://github.com/ngimel

commit 7a14c56beecf5f21e88c029bf306cbda8e91fbed
Author: Catherine Lee <csl@fb.com>
Date:   Tue Aug 30 03:53:16 2022 +0000

    only run the circleci mac/ios jobs on prs (#84227)

    as in title, since they were being run on nightly when they dont need to be (and they were failing), also dont run on master b/c the github actions version already exists for that
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84227
    Approved by: https://github.com/seemethere, https://github.com/janeyx99, https://github.com/huydhn, https://github.com/malfet

commit 71369051ee99f679cbb026b571e2521e3845a93e
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Aug 30 03:48:09 2022 +0000

    [Nested Tensor] fix from_padded bug (#84217)

    Fixes #84082

    Explained in the issue that the problem was arising from grad being not contiguous and the fast kernel not handiling this case gracefully.  The other thing I can do is add a contiguous call to https://github.com/pytorch/pytorch/blob/d144594512e10ab2a9625347816c2dee1fb55667/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp#L45

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84217
    Approved by: https://github.com/albanD

commit df98c529480b2ece3809b19fc850f57d2054605a
Author: Zhengxu Chen <zhxchen17@fb.com>
Date:   Tue Aug 30 03:09:48 2022 +0000

    [fx] Make get_isolated_graphmodule accept tracing mode. (#84238)

    Summary: make get_isolated_graphmodule be able to run with symbolic mode.

    Test Plan: eyes

    Reviewed By: angelayi

    Differential Revision: D39110454

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84238
    Approved by: https://github.com/angelayi

commit 399b1eb84b006f7a6e2bdcda7083fd264ac204da
Author: samdow <samdow@fb.com>
Date:   Mon Aug 29 11:31:36 2022 -0400

    [functorch] fix multinomial (#83838)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83838
    Approved by: https://github.com/zou3519

commit 34e5b0997e73916a15a9802ea4525705ba033cb1
Author: Shen Li <cs.shenli@gmail.com>
Date:   Mon Aug 29 20:46:14 2022 +0000

    [reland] Make allreduce compatible with make_fx (#84221)

    land after #83122

    This PR explores solutions for 2 issues:

    1. Collective comm ops are inplace ops, and does not return a tensor.
       With that, `make_fx` cannot include comm ops in the traced graph.
       The current solution is to make comm ops return a tuple of
       `(output_tensors, work_handle)`, so that
       [`proxy_call`](https://github.com/pytorch/pytorch/blob/90821aab100a436424113e2306eac63f5e247ee5/torch/fx/experimental/proxy_tensor.py#L170-L172)
       can handle that. It won't change the behavior of existing c10d
       Python/C++ APIs, so I directly added the code to `Ops.cpp`.
    2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore
       the `wait()` call on the work when tracing graph. However, this
       might break correctness, as when running the traced function, it
       could consume a tensor before it's ready. The current solution
       is to create a `CommTensor` tensor subclass to explicitly call
       `wait()`. In this PR, I am only doing this in the test, as we
       will need more discussion to see if we can add this to c10d Python
       implementations. kudos to Chillee wanchaol

    Edit: `print_tabular` breaks CI. removing that from tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84221
    Approved by: https://github.com/wanchaol

commit a402e100be1fc88fe8499f96ed36a08cf1bb0de0
Author: Zhengxu Chen <zhxchen17@fb.com>
Date:   Tue Aug 30 01:16:56 2022 +0000

    [fx] Make wrapped_fn also work for non-mutating passes. (#84232)

    Summary: Before the change, wrapped_fn should only take mutating passes, but we don't actually have any way to detect whether a pass is mutating before running it. To make this an abstraction without involving any precondition depending on PassManager run, we could just relax the precondition to take any kind of passes, and conditionally return the original pass based on the pass result.

    Test Plan: eyes

    Reviewed By: qihqi, angelayi

    Differential Revision: D39086343

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84232
    Approved by: https://github.com/angelayi

commit 8aba2535e4ebe01e1461a17beedcabfd34db9d87
Author: zh Wang <rekind133@outlook.com>
Date:   Tue Aug 30 01:04:26 2022 +0000

    Fix typo (#83802)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83802
    Approved by: https://github.com/ngimel, https://github.com/kit1980

commit 7371761d9cc4a1c235f65aff07452da7de482eae
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Tue Aug 30 00:31:35 2022 +0000

    Add Lazy backend type string (#84228)

    As the title suggest, the `Lazy` case was missing the in the `backend_to_string` switch case causing
    ```
    RuntimeError: Unimplemented backend Lazy
    ```
    when called with a lazy backend.

    CC: @wconstab @Krovatkin @desertfire
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84228
    Approved by: https://github.com/wconstab

commit adc54dc2195fbfe37b2f01649b8788314382a9be
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Aug 29 23:50:24 2022 +0000

    Give better error message when merge fails to find any rules (#84160)

    Fixes #84147 and https://github.com/pytorch/test-infra/issues/421.

    * If merge rule file is missing or fails to load for whatever reasons:

    ```
    No rules find to match PR, please [report]{issue_link} this issue to DevX team.
    ```

    * If the list of rules is empty:

    ```
    Merges are not allowed into repository without a rules.
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84160
    Approved by: https://github.com/ZainRizvi, https://github.com/malfet

commit 54d8661266665c0f66a6d819cc24e5b0053b0be9
Author: ssjia <ssjia@fb.com>
Date:   Mon Aug 29 08:39:52 2022 -0700

    [vulkan] Add vulkan_api_test as an instrumentation test (#83978)

    This diff adds a `fb_xplat_cxx_test` Android instrumentation test in internal repo that runs `vulkan_api_test.cpp`. Some small changes to `vulkan_api_test.cpp` were needed to build/run the binary successfully.

    Differential Revision: [D38954229](https://our.internmc.facebook.com/intern/diff/D38954229/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38954229/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83978
    Approved by: https://github.com/kirklandsign

commit e7635c06ce15b1e5952b34d4e50018c1c8d545db
Author: apeltop <sunshine@ptokos.com>
Date:   Mon Aug 29 23:32:44 2022 +0000

    Fix typos in docs (#80602)

    I hope it helps.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80602
    Approved by: https://github.com/kit1980

commit 372a19d2c673a20fe50955b20ea4e3685266d630
Author: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Date:   Mon Aug 29 22:53:40 2022 +0000

    Update start_index and end_index for adaptive pooling (#84010)
    The PR fixes the issue #81409. To fix the issue the procedure of determining start and end indices for adaptive max pooling and average pooling is modified towards integer-only arithmetic.
    The testing of the new functions is straightforward:

    ```

    int64_t start_index(int64_t a, int64_t b, int64_t c) {
      return (int64_t)std::floor((float)(a * c) / b);
    }

    int64_t end_index(int64_t a, int64_t b, int64_t c) {
      return (int64_t)std::ceil((float)((a + 1) * c) / b);
    }

    int64_t start_index_new(int64_t a, int64_t b, int64_t c) {
      return (a / b) * c + ((a % b) * c) / b;
    }

    int64_t end_index_new(int64_t a, int64_t b, int64_t c) {
      return 1 + ((a + 1) * c - 1) / b;
    }

    int main() {
        size_t N = 2<<24;
        std::cout<<N<<'\n';
        int64_t c = 1;

        for(int64_t i=1; i<N; i++) {
            for(int64_t j=1; j<N; j++) {
                int64_t s_id0 = start_index(i, j, c);
                int64_t s_id1 = start_index_new(i, j, c);
                assert(s_id0 == s_id1);
            }
        }

        for(int64_t i=1; i<N; i++) {
            for(int64_t j=1; j<N; j++) {
                int64_t e_id0 = end_index(i, j, c);
                int64_t e_id1 = end_index_new(i, j, c);
                assert(e_id0 == e_id1);
            }
        }
    }
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84010
    Approved by: https://github.com/ezyang

commit 95863f2ccc0782c931a84142aa99f8ff995a140b
Author: Zain Rizvi <zainr@fb.com>
Date:   Mon Aug 29 22:15:20 2022 +0000

    Make mergebot failure messages more readable (#84214)

    Reformat how mergebot outputs merge and revert errors to make the failure more obvious and hide text that doesn't actually help most users debug their PRs. (The workflow job helps the DevX team to debug mergebot errors)
    <img width="933" alt="image" src="https://user-images.githubusercontent.com/4468967/187276407-fb35880c-5ada-4f1f-a01e-86b5071ab0fd.png">
    <img width="833" alt="image" src="https://user-images.githubusercontent.com/4468967/187276280-e7397edb-1887-41a2-b206-4d005b59f683.png">
    <img width="843" alt="image" src="https://user-images.githubusercontent.com/4468967/187276342-b50e6b28-f19e-4547-bf29-1a7878a3a97c.png">

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84214
    Approved by: https://github.com/malfet, https://github.com/huydhn

commit d62a6ca5216fa4465dad546307c406544917ffea
Author: Zain Rizvi <zainr@fb.com>
Date:   Mon Aug 29 20:31:30 2022 +0000

    Link to instructions on submitting an RFC (#83990)

    Point people to instructions on how to create an RFC
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83990
    Approved by: https://github.com/janeyx99

commit 724b63d69452faac365131d63cbbbeb0a3c2d94a
Author: Michael Suo <suo@fb.com>
Date:   Mon Aug 29 10:49:53 2022 -0700

    [ci] move XLA pin update to weekly (#84208)

    - Create a `weekly` workflow and move XLA pin update to that
    - Move the other two pin updates to the `nightly` workflow (instead of
    having a special workflow just for them.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84208
    Approved by: https://github.com/janeyx99, https://github.com/malfet, https://github.com/seemethere

commit 806878518f32c5b93acc7da576e57ab52f6f5232
Author: BowenBao <bowbao@microsoft.com>
Date:   Mon Aug 29 10:26:43 2022 -0700

    [ONNX][Reland] Export node and value with scope name (#82040)

    Introduce `_jit_pass_onnx_assign_node_and_value_names` to parse and assign
    scoped name for nodes and values in exported onnx graph.
    Module layer information is obtained from `ONNXScopeName` captured in `scope`
    attribute in nodes. For nodes, the processed onnx node name are stored in
    attribute `onnx_name`. For values, the processed onnx output name are stored
    as `debugName`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82040
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/justinchuby, https://github.com/abock

commit d144594512e10ab2a9625347816c2dee1fb55667
Author: Jesse Cai <jcjessecai@gmail.com>
Date:   Mon Aug 29 18:08:36 2022 +0000

    [Quant][fx] Remove WEIGHT_INDEX_DICT and BIAS_INDEX_DICT (Part 2) (#83853)

    Summary:
    - Finishes the second part of https://github.com/pytorch/pytorch/pull/83263
    - Removes WEIGHT_INDEX_DICT and BIAS_INDEX_DICT from utils.py
    - Moves two funcitons, `node_arg_is_weight` and `node_arg_is_bias` into utils.py from prepare.py
    convert.py and _equalize.py now use node_arg_is_weight instead of the dictionaries
    - Adds in quantization support for `F.groupnorm`.

    Add in missing BackendPatternConfigs for layernorm, instancenorm, and groupnorm

    Test Plan:
    ```
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps
    ```

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    ghstack-source-id: 2b157e0dc4f1553be1f4813b4693db952e6fc558
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83848

    Fixes #83093
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83853
    Approved by: https://github.com/jerryzh168, https://github.com/andrewor14

commit ad44670fa1ce2dad7e2cdc3f90d27668e88e9548
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Aug 29 06:08:43 2022 -0700

    Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628)" (#84173)

    Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"

    Original commit changeset: dab4a9dba4fa
    Original commit changeset: dcaf16c037a9

    Original Phabricator Diff: D38984222
    Original Phabricator Diff: D39075159

    Also update Metal registrations for C++ registration changes.

    Also update NNPI registration to account for tightened schema checking

    Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
    Approved by: https://github.com/Krovatkin

commit cfd18e105fe795072edafe54c1f5861967ca746a
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Sat Aug 27 16:06:16 2022 -0700

    [Pytorch][Ondevice quantization] Add device side API to convert model (#83807)

    Summary:
    This diff adds device side API which will convert the model to its
    quantized equivalent. THe input model must have been prepared AOT for
    quantization.

    API is implemented by:
    - Running reset obervers
    - Running observe method
    - Running quantize method
    - And replacing method, e.g. forward, with its quantized equivalent.

    Test Plan:
    test/quantization/jit/test_ondevice_quantization.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D38889818](https://our.internmc.facebook.com/intern/diff/D38889818)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83807
    Approved by: https://github.com/iseeyuan

commit eebdcb5a2ef6a117a608b9ca5ca1eb2fd4f72fbd
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Sat Aug 27 16:06:16 2022 -0700

    [Pytorch][quantization][ondevice] Add a wrapper API for server side prep (#83742)

    for ondevice quantization

    Summary:
    THis diff just wraps existing API for ondevice quantization

    Test Plan:
    test/quantization/jit/test_ondevice_quantization.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D38868647](https://our.internmc.facebook.com/intern/diff/D38868647)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83742
    Approved by: https://github.com/jerryzh168

commit 5c7e801c50e4478e8f96ab287a77ee4b08051f75
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Sat Aug 27 16:06:15 2022 -0700

    [pytorch][on device quant] Finalize method for ondevice quant (#83571)

    Summary:
    After inserting quant dequant nodes in the graph, we need
    1. Insert packed param creation and quantized op
    2. Create packed_params attribute in the top module. For this we need
    graph that inlined except for calculate_qparams method calls. But they
    can be inlined too. So perhaps we need to make sure no other callmethods
    exist.
    3. Insert SetAttr for the packed param
    4. Insert GetAttr for the packed param
    5. Use GetAttr output for quantized op where applicable, e.g.
    linear_dynamic

    The above is added to quantize_<method-name> method created inprevious
    step. Once the above steps are done clone the method into
    quantized_<method-name>

    Modify quantize_<method-name>:
    1. Remove all outputs from the method.
    2. Run dce
    3. Remove all inputs from the method except self.

    Modify quantized_<method-name>:
    1. Remove all packed_param setAttr nodes.
    2. Run dce.

    This should result in removal of all nodes that generate packed param.

    Test Plan: To be written

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D38771416](https://our.internmc.facebook.com/intern/diff/D38771416)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83571
    Approved by: https://github.com/jerryzh168

commit 446afb5f9f62d50d159e58e36f4596dfdfa8bcd5
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Sat Aug 27 16:06:14 2022 -0700

    [On Device Quantization][pytorch]Make insert_quant_dequant support ondevice ptq (#83570)

    Summary:
    This diff adds a way to:
    - clone previously observed method
    - Add calls to observer's calculate_qparams methods
    - Extract the scale and zero point
    - Use them to insert quant dequant nodes

    Now for forward method we have
    - observe_forward
    - quantize_forward

    observe_forward is used post training to observer statistics. In the
    case of dynamic PTQ this requires just running that method once to
    update weight observer statistics.

    quantize_forward method will be used to use the observer
    statistics to calculate quantization parameters and apply that to quant
    dequant op.

    Subsequent diffs will replace dequant + op with their quantized op
    counter parts and replace quantize ops with relevant packed params class
    where possible

    Test Plan:
    To be written

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D38771419](https://our.internmc.facebook.com/intern/diff/D38771419)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83570
    Approved by: https://github.com/jerryzh168

commit 6a5d9f1be0c7a7ebe556d427930683a438195def
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Fri Aug 26 10:53:28 2022 -0700

    Replace "_scalar_type" string with constant (#83569)

    Summary:
    Use this refactor to make insertQuantizationOps tempelatized in the
    later diff

    Test Plan:

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D38771418](https://our.internmc.facebook.com/intern/diff/D38771418)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83569
    Approved by: https://github.com/jerryzh168

commit 9189edb3b3eb7ab9b94d514c428af284b8d978e1
Author: Kimish Patel <kimishpatel@fb.com>
Date:   Fri Aug 26 10:53:27 2022 -0700

    [Quantization][Pytorch] On device quantization support part 1 (#83568)

    Summary:
    TO support on device quantization this diff introduces observer
    insertion. Specifically observers are inserted by adding new method with
    prefix observ_.

    Intent is that post training, this method will be run to record
    statistics

    Test Plan:
    test_ondevice_quantization.py

    Reviewers:

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D38771417](https://our.internmc.facebook.com/intern/diff/D38771417)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83568
    Approved by: https://github.com/jerryzh168

commit 8acc92eb009bc4df2ee4f9cbd06cd6b9cee533a6
Author: Rohan Varma <rvarm1@fb.com>
Date:   Mon Aug 29 17:10:25 2022 +0000

    [FSDP] Print exec order only in debug mode (#83868)

    Since exec order warning can result in very long module name print out, gating this only to be printing in debug mode. Oftentimes such as in multiModal training, there is not a lot we can do about this warning since some modules go unused in certain iterations.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83868
    Approved by: https://github.com/awgu

commit 352da6de6b731c04576701295b1b88e733ebaf76
Author: Angela Yi <angelayi@fb.com>
Date:   Thu Aug 25 16:29:27 2022 -0700

    [fx][pass] Fix type of exception (#84094)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84094
    Approved by: https://github.com/SherlockNoMad

commit 7088a98fba3a5031a2afc293cbf25cec09f248a5
Author: soulitzer <soulitzer@gmail.com>
Date:   Fri Aug 26 11:21:19 2022 -0400

    conv2d: require bias to have the same dtype as input and weight on cpu (#83686)

    Fixes https://github.com/pytorch/pytorch/issues/83505

    BC-breaking message:
    - Previously we only required input and weight to have the same dtype on cpu (when input is non-complex). After this change, the dtype of bias is now also expected to have the same dtype. This change was necessary to improve the error message for certain combinations of inputs. This behavior now also matches that of convolution on cuda.

    <details>
    <summary>
    Old plan
    </summary>
    Previously convolution (at least for slow_conv2d) did not perform type promotion, i.e. the output of `conv(int, int, float)` is an int, and that leads to the autograd assert.

    This PR adds type promotion handling at the `at::native::conv2d` (this is a composite) level. We also need to correct or remove many tests that assume that conv errors when input types are mixed

    Pros:
    - Doing type promotion at this level avoids the complex path from having any special handling for mixed dtypes, and can potentially speed up mixed dtype inputs to now dispatch to faster kernels which are only capable of handling floats.

    Cons:
    - Doing type promotion at this level has the risk of introducing extra overhead when we would've dispatched to a kernel capable of handle mixed type anyway. I don't know if any of these exist at all though - it is possible that inputs with any non-float arguments are dispatched to the slow path.

    If this approach is OK, we can proceed with the other convolutions as well:
    </details>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83686
    Approved by: https://github.com/ngimel

commit 1945d28f58732a883220563c0dcebf43f1412c72
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 29 16:41:09 2022 +0000

    Revert "[fx][pass] Fix type of exception (#84094)"

    This reverts commit eb2fa2e042b18ba35fa6eedb769c2efe411dbcfb.

    Reverted https://github.com/pytorch/pytorch/pull/84094 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit c29b7865d02239d89d4407559a85a556039cb7c6
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 29 15:44:31 2022 +0000

    Revert "[xla hash update] update the pinned xla hash (#84164)"

    This reverts commit fbf5a3f9f41d69248099c957571be0474659b15a.

    Reverted https://github.com/pytorch/pytorch/pull/84164 on behalf of https://github.com/weiwangmeta due to MESSAGE

commit eff312f07be85508f049e5ddfe9ada5aa5df4fc4
Author: samdow <samdow@fb.com>
Date:   Fri Aug 26 14:49:05 2022 +0000

    nit fixes in modes (#83924)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83924
    Approved by: https://github.com/ezyang, https://github.com/zou3519

commit 1a53e35b9db8442432d8dfcfff430d2a569ad062
Author: Rohan Varma <rvarm1@fb.com>
Date:   Fri Aug 26 16:58:59 2022 +0000

    Enforce explicit ProcessGroup passed into DefaultState (#84105)

    Would prefer to enforce that users pass in explicit PG into these state objects when using comm hooks with FSDP, so that it is clear and easy debugable over which processes communication is taking place.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84105
    Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao

commit f66be71d77c2998122c5177c2362c3d66b7b19cc
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Mon Aug 29 14:38:32 2022 +0000

    [checkpoint] Adopt Planner interface across the board. (#83781)

    Change StorageReader and StorageWriter to follow the new SavePlanner / LoadPlanner design.

    Add optional planner param to load_state_dict and save_state_dict and implement the new protocol.

    This includes a small rework of FileSystem layer to support single file per rank and making fsync optional to match torch.save behavior.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83781
    Approved by: https://github.com/wanchaol, https://github.com/fduwjj

commit fbf5a3f9f41d69248099c957571be0474659b15a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 29 10:29:47 2022 +0000

    [xla hash update] update the pinned xla hash (#84164)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84164
    Approved by: https://github.com/pytorchbot

commit b8e1c54f53404393e5fe421c26f7bfc951251e43
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Aug 29 09:29:28 2022 +0000

    [Prim] Implement group_norm_backward (#84037)

    Test plan: CI, i.e. `python3 test_decomp.py -v -k test_comprehensive_nn_functional_group_norm` plus:
    ```
    import torch

    func = torch.ops.aten.native_group_norm_backward.default
    decomp =  torch._decomp.decomposition_table[func]
    for args in (
            (torch.rand(1, 6, 3), torch.rand(1, 6, 3), torch.rand(1, 2), torch.rand(1, 2), torch.rand(6), 1, 6, 3, 2, [True, True, True]),
            (torch.rand(64, 768, 7, 7), torch.rand(64, 768, 7, 7), torch.rand(64, 1), torch.rand(64, 1), torch.rand(768), 64, 768, 49, 1, [True, True, True])):
        nrc=func(*args)
        drc=decomp(*args)
        for i in range(len(nrc)):
           print(i, torch.max(nrc[i]-drc[i]))
        print(all(torch.allclose(x, y) for (x, y) in zip(nrc, drc)))
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84037
    Approved by: https://github.com/Chillee, https://github.com/ngimel

commit 2436cf8aa8a557ffa031d700f6448047b0fd58a3
Author: Driss Guessous <drisspg@fb.com>
Date:   Mon Aug 29 09:12:24 2022 +0000

    [Nested Tensor] detach (#84078)
    Add detach op for nested tensors. Nested tensors are not part of the composite explicit dispatch key set and therefore need to be added manually.

    The Detach test is failing only for the dtype=torch.float32, torch.float16 and device=cuda. The chain of ops that called are sum.backward() -> from_padded() -> unbind(). This populates the grad for a and b.

    Does this potentially indicated that cuda implementation for one of these ops, likely from_padded() is incorrect?
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84078
    Approved by: https://github.com/albanD

commit 0095571135c8e3d2017a270c7652fd8605425879
Author: Animesh Jain <anijain@umich.edu>
Date:   Mon Aug 29 09:11:54 2022 +0000

    [AOT Autograd] Redirect named_parameters to original mod (#84157)

    Helps in comparing accuracy.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84157
    Approved by: https://github.com/Chillee

commit 3f947264533f318355b848c070a4279032cbb5d8
Author: erjia <erjia@fb.com>
Date:   Fri Aug 26 20:55:44 2022 +0000

    [DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202)

    Fixes: https://github.com/pytorch/data/issues/718

    This is an alternative PR against https://github.com/pytorch/pytorch/pull/82974

    This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
    - Lazily generating seed per iteration
    - Each iterators has a new seed
    - Convert `MapDataPipe.shuffle` to an `IterDataPipe`
    This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.
    Output as `MapDataPipe`
    ```
    >>> from torch.utils.data import IterDataPipe, MapDataPipe
    >>> from torch.utils.data.datapipes.map import SequenceWrapper
    >>> dp = SequenceWrapper(list(range(10))).shuffle()
    >>> isinstance(dp, MapDataPipe)
    True
    >>> isinstance(dp, IterDataPipe)
    False
    ```
    Output as `IterDataPipe`
    ```
    >>> from torch.utils.data import IterDataPipe, MapDataPipe
    >>> from torch.utils.data.datapipes.map import SequenceWrapper
    >>> dp = SequenceWrapper(list(range(10))).shuffle()
    >>> isinstance(dp, MapDataPipe)
    False
    >>> isinstance(dp, IterDataPipe)
    True
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83202
    Approved by: https://github.com/NivekT

commit 7480e83338e69bd39905ed90a23b7c22960e1aff
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 26 12:49:11 2022 -0700

    [Profiler] Add `disabled` and `global` methods to ProfilerConfig. (#83891)

    `ProfilerState::Disabled` and `ProfilerState::KINETO_ONDEMAND` have special semantics. The former is somewhat intuitive, but the degree of behavior branching on the latter (and why the branching is necessary) is less clear. By factoring the enum checks into methods, we can both clairify intent and future proof in case we ever add other global profiling contexts.

    Differential Revision: [D38917980](https://our.internmc.facebook.com/intern/diff/D38917980/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83891
    Approved by: https://github.com/slgong-fb

commit 8e6207bcd8beff791c517977c3f83179e0f51d45
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 29 06:36:17 2022 +0000

    Revert "[ONNX] Export node and value with scope name (#82040)"

    This reverts commit 6a3666282d000a0f196fbdd8b182bb4fd711f189.

    Reverted https://github.com/pytorch/pytorch/pull/82040 on behalf of https://github.com/weiwangmeta due to Diff reverted internally

commit d50aa517b532dd58daafb79160bcc8758ecd01b7
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 29 06:34:50 2022 +0000

    Revert "Add support to traverse all python collection objects (#84079)"

    This reverts commit e0f0c8e7b9acf6b821956acadbe79aaa0f6f0237.

    Reverted https://github.com/pytorch/pytorch/pull/84079 on behalf of https://github.com/weiwangmeta due to Diff reverted internally

commit 0ac2986d3334f8f9b35ca2fa7a30c20022c26fa6
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Aug 29 04:29:09 2022 +0000

    Fixes softmax indexing for large tensors (#84182)

    Fixes #84144

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84182
    Approved by: https://github.com/janeyx99

commit 533203f5aaa9f8987f25d828e1c37e755a2ba4ea
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Mon Aug 29 02:25:00 2022 +0000

    _to_copy decomp (#84108)

    Per title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84108
    Approved by: https://github.com/Chillee

commit 9fc02f6bc558f26a460241cbeaea915ea2b41005
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sun Aug 28 22:58:52 2022 +0000

    Decomposition for adaptive_avg_pool2d (#84062)

    This was already implemented as a lowering in https://github.com/pytorch/torchdynamo/pull/962. I'm putting the idea up here ~(I haven't even run this code, so it surely has *many* issues, but I reckon the general idea should hopefully be alright).~ The tests now pass and I corrected the issues that the first implementation had.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84062
    Approved by: https://github.com/jansel

commit 3aae6ff1e13128412e44a69ca3da5582f17fac02
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Sun Aug 28 18:45:25 2022 +0000

    Add nvprims.var_mean (#83508)

    This PR adds nvfuser-specific primitive - `var_mean`.
    Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

    I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

    Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
    ```py
    import torch
    from torch._prims.context import TorchRefsNvfuserCapabilityMode
    from torch.fx.experimental.proxy_tensor import make_fx
    from torch._prims.executor import execute

    def func(a):
        return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

    a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

    with TorchRefsNvfuserCapabilityMode():
        gm = make_fx(func)(a)

    for _ in range(10):
        execute(gm, a, executor="strictly_nvfuser");
    ```
    run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
    ```py
    ```
    So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

    Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

    Ref. https://github.com/pytorch/pytorch/issues/80187

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
    Approved by: https://github.com/ngimel

commit 261be8e5c2e3702528105005035d2e151b4f2724
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 18:30:05 2022 +0000

    Revert "[Profiler] Add `disabled` and `global` methods to ProfilerConfig. (#83891)"

    This reverts commit 69e9f905b7ddc0f453fa273746c9db5ed60bc71a.

    Reverted https://github.com/pytorch/pytorch/pull/83891 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit 7244a3737c8c6fd4c2e4e42fcddc14e2f56a35c1
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 18:00:17 2022 +0000

    Revert "[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202)"

    This reverts commit a423c966a780a1fdac6a29c6d2be2a0709de2cd5.

    Reverted https://github.com/pytorch/pytorch/pull/83202 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit 33db5da4c16f048a966542f8e916afc02463f71c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 17:30:50 2022 +0000

    Revert "[Prim] Implement group_norm_backward (#84037)"

    This reverts commit bed85cce8b2e7c7430c1f3b5f7c8c765b779ec3e.

    Reverted https://github.com/pytorch/pytorch/pull/84037 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit df523a6eeef283a79062f828c3fedce2cc3e32f0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 16:29:08 2022 +0000

    Revert "[AOT Autograd] Redirect named_parameters to original mod (#84157)"

    This reverts commit 43620b7e8d722d1b5c34cbda2619ccd9f92ca820.

    Reverted https://github.com/pytorch/pytorch/pull/84157 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit f4f54c7ce1bac0db91922c618c38a5f72cab130b
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 15:30:21 2022 +0000

    Revert "[Nested Tensor] detach (#84078)"

    This reverts commit 092fe71f33fe37b8d09499708230307aea028eaf.

    Reverted https://github.com/pytorch/pytorch/pull/84078 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit 5cf4542f86e0907ac0ac514d64995ae90d41ac78
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 14:30:18 2022 +0000

    Revert "Enforce explicit ProcessGroup passed into DefaultState (#84105)"

    This reverts commit adc9a1e2fbd0e6d873dc2441d250b94fe9098e9e.

    Reverted https://github.com/pytorch/pytorch/pull/84105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit ff23f3ac1c10f6bd5104f27aa1566b71e2ae6fa0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 13:27:49 2022 +0000

    Revert "_to_copy decomp (#84108)"

    This reverts commit e33897cb9999f124bce126c7e43f96c0755413ef.

    Reverted https://github.com/pytorch/pytorch/pull/84108 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit d8cc8368abfc540725b8f944419bcf1e7e79458e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 12:28:58 2022 +0000

    Revert "[ONNX] Fix type annotations and enable type checking for all apis (#84091)"

    This reverts commit 6446da17305960088dfae501d5c7358af068fa81.

    Reverted https://github.com/pytorch/pytorch/pull/84091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit b159a5230ffb497c3683e67f95095615493ef65f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 11:30:27 2022 +0000

    Revert "Add nvprims.var_mean (#83508)"

    This reverts commit 7e7694b6615fbf46abfab234615fa891c2819eb7.

    Reverted https://github.com/pytorch/pytorch/pull/83508 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit 71cd3fa2d56d24a3ef246102ebb145a06fbe88a3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sun Aug 28 10:29:30 2022 +0000

    Revert "[xla hash update] update the pinned xla hash (#84164)"

    This reverts commit c032b097e315177af5bc867eeee5452b7df32952.

    Reverted https://github.com/pytorch/pytorch/pull/84164 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally

commit b078d242c481344ab083926e4010322ec68884c9
Author: jjsjann123 <jiej@nvidia.com>
Date:   Sun Aug 28 04:26:36 2022 +0000

    Nvfuser to copy decomp to prim (#83782)

    Conditional decomposing aten::_to_copy to nvprim::convert_element_type to allow fusion with type casting, which is introduced during type promotion phase at torch decomposition.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83782
    Approved by: https://github.com/ngimel

commit c9b144ff47ff3b6f358752976d29ac61f2b9b070
Author: kuttire42 <64169153+kuttire42@users.noreply.github.com>
Date:   Sun Aug 28 01:25:07 2022 +0000

    Replace assertEqualIgnoreTypes from common_methods_invocations.py (#84076)

    This addresses TODO:38095 . More details at https://github.com/pytorch/pytorch/issues/38095

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84076
    Approved by: https://github.com/kit1980

commit b8fe0edcf5a92f53d8f0254d3ad10c2770b23772
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Aug 27 14:14:58 2022 +0000

    Revert "Make allreduce compatible with fx ProxyTensor (#84126)"

    This reverts commit ec5b83f76847584013a9cd4177d389a408033614.

    Reverted https://github.com/pytorch/pytorch/pull/84126 on behalf of https://github.com/malfet due to Likely broke multigpu periodic jobs, see https://github.com/pytorch/pytorch/runs/8044611438?check_suite_focus=true

commit c032b097e315177af5bc867eeee5452b7df32952
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Aug 27 10:24:21 2022 +0000

    [xla hash update] update the pinned xla hash (#84164)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84164
    Approved by: https://github.com/pytorchbot

commit 7e7694b6615fbf46abfab234615fa891c2819eb7
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Sat Aug 27 09:05:20 2022 +0000

    Add nvprims.var_mean (#83508)

    This PR adds nvfuser-specific primitive - `var_mean`.
    Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

    I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

    Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
    ```py
    import torch
    from torch._prims.context import TorchRefsNvfuserCapabilityMode
    from torch.fx.experimental.proxy_tensor import make_fx
    from torch._prims.executor import execute

    def func(a):
        return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

    a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

    with TorchRefsNvfuserCapabilityMode():
        gm = make_fx(func)(a)

    for _ in range(10):
        execute(gm, a, executor="strictly_nvfuser");
    ```
    run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
    ```py
    ```
    So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

    Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

    Ref. https://github.com/pytorch/pytorch/issues/80187

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
    Approved by: https://github.com/ngimel

commit 6446da17305960088dfae501d5c7358af068fa81
Author: Justin Chu <justinchu@microsoft.com>
Date:   Sat Aug 27 02:05:37 2022 +0000

    [ONNX] Fix type annotations and enable type checking for all apis (#84091)

    Enable runtime type checking for all torch.onnx public apis, symbolic functions and most helpers (minus two that does not have a checkable type: `_.JitType` does not exist) by adding the beartype decorator. Fix type annotations to makes unit tests green.

    Profile:

    export `torchvision.models.alexnet(pretrained=True)`

    ```
    with runtime type checking: 21.314 / 10 passes
    without runtime type checking: 20.797 / 10 passes

    + 2.48%
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84091
    Approved by: https://github.com/BowenBao

commit e33897cb9999f124bce126c7e43f96c0755413ef
Author: Natalia Gimelshein <ngimel@fb.com>
Date:   Sat Aug 27 03:51:03 2022 +0000

    _to_copy decomp (#84108)

    Per title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84108
    Approved by: https://github.com/Chillee

commit adc9a1e2fbd0e6d873dc2441d250b94fe9098e9e
Author: Rohan Varma <rvarm1@fb.com>
Date:   Fri Aug 26 16:58:59 2022 +0000

    Enforce explicit ProcessGroup passed into DefaultState (#84105)

    Would prefer to enforce that users pass in explicit PG into these state objects when using comm hooks with FSDP, so that it is clear and easy debugable over which processes communication is taking place.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84105
    Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao

commit 092fe71f33fe37b8d09499708230307aea028eaf
Author: Driss Guessous <drisspg@fb.com>
Date:   Sat Aug 27 03:00:53 2022 +0000

    [Nested Tensor] detach (#84078)
    Add detach op for nested tensors. Nested tensors are not part of the composite explicit dispatch key set and therefore need to be added manually.

    The Detach test is failing only for the dtype=torch.float32, torch.float16 and device=cuda. The chain of ops that called are sum.backward() -> from_padded() -> unbind(). This populates the grad for a and b.

    Does this potentially indicated that cuda implementation for one of these ops, likely from_padded() is incorrect?
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84078
    Approved by: https://github.com/albanD

commit 43620b7e8d722d1b5c34cbda2619ccd9f92ca820
Author: Animesh Jain <anijain@umich.edu>
Date:   Sat Aug 27 02:53:58 2022 +0000

    [AOT Autograd] Redirect named_parameters to original mod (#84157)

    Helps in comparing accuracy.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84157
    Approved by: https://github.com/Chillee

commit c7edcd69683f6e3b08305ed0d4621e148fbfbe17
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Aug 27 01:23:17 2022 +0000

    Revert "Don't introduce new overload for SymInt (#83628)"

    This reverts commit 9790d90e4b0288796ab44a6b4979db0a67580ba8.

    Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487

commit 38e5e4a85f18c716ed84d12e6c7d5155ac582b65
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Sat Aug 27 01:18:43 2022 +0000

    Revert "[xla hash update] update the pinned xla hash (#84043)"

    This reverts commit ddedc294fbb4c13170811442b590a18e950dae67.

    Reverted https://github.com/pytorch/pytorch/pull/84043 on behalf of https://github.com/malfet due to Depends on https://github.com/pytorch/pytorch/pull/83628

commit bed85cce8b2e7c7430c1f3b5f7c8c765b779ec3e
Author: Nikita Shulga <nshulga@fb.com>
Date:   Sat Aug 27 01:10:27 2022 +0000

    [Prim] Implement group_norm_backward (#84037)

    Test plan: CI, i.e. `python3 test_decomp.py -v -k test_comprehensive_nn_functional_group_norm` plus:
    ```
    import torch

    func = torch.ops.aten.native_group_norm_backward.default
    decomp =  torch._decomp.decomposition_table[func]
    for args in (
            (torch.rand(1, 6, 3), torch.rand(1, 6, 3), torch.rand(1, 2), torch.rand(1, 2), torch.rand(6), 1, 6, 3, 2, [True, True, True]),
            (torch.rand(64, 768, 7, 7), torch.rand(64, 768, 7, 7), torch.rand(64, 1), torch.rand(64, 1), torch.rand(768), 64, 768, 49, 1, [True, True, True])):
        nrc=func(*args)
        drc=decomp(*args)
        for i in range(len(nrc)):
           print(i, torch.max(nrc[i]-drc[i]))
        print(all(torch.allclose(x, y) for (x, y) in zip(nrc, drc)))
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84037
    Approved by: https://github.com/Chillee, https://github.com/ngimel

commit a423c966a780a1fdac6a29c6d2be2a0709de2cd5
Author: erjia <erjia@fb.com>
Date:   Fri Aug 26 20:55:44 2022 +0000

    [DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202)

    Fixes: https://github.com/pytorch/data/issues/718

    This is an alternative PR against https://github.com/pytorch/pytorch/pull/82974

    This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
    - Lazily generating seed per iteration
    - Each iterators has a new seed
    - Convert `MapDataPipe.shuffle` to an `IterDataPipe`
    This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.
    Output as `MapDataPipe`
    ```
    >>> from torch.utils.data import IterDataPipe, MapDataPipe
    >>> from torch.utils.data.datapipes.map import SequenceWrapper
    >>> dp = SequenceWrapper(list(range(10))).shuffle()
    >>> isinstance(dp, MapDataPipe)
    True
    >>> isinstance(dp, IterDataPipe)
    False
    ```
    Output as `IterDataPipe`
    ```
    >>> from torch.utils.data import IterDataPipe, MapDataPipe
    >>> from torch.utils.data.datapipes.map import SequenceWrapper
    >>> dp = SequenceWrapper(list(range(10))).shuffle()
    >>> isinstance(dp, MapDataPipe)
    False
    >>> isinstance(dp, IterDataPipe)
    True
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83202
    Approved by: https://github.com/NivekT

commit 69e9f905b7ddc0f453fa273746c9db5ed60bc71a
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 26 12:49:11 2022 -0700

    [Profiler] Add `disabled` and `global` methods to ProfilerConfig. (#83891)

    `ProfilerState::Disabled` and `ProfilerState::KINETO_ONDEMAND` have special semantics. The former is somewhat intuitive, but the degree of behavior branching on the latter (and why the branching is necessary) is less clear. By factoring the enum checks into methods, we can both clairify intent and future proof in case we ever add other global profiling contexts.

    Differential Revision: [D38917980](https://our.internmc.facebook.com/intern/diff/D38917980/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83891
    Approved by: https://github.com/slgong-fb

commit f4dc7b3a8a60bb1823f58154f1d041b489cbdf25
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 26 10:33:18 2022 -0700

    [Profiler][Trivial] Cleanup ExperimentalConfig (#83890)

    I'm trying to limit how much is in headers to make it easier to read the API surface. In a similar vein, we can replace `hasOptions` with `operator bool` so it just does the right thing in the check.

    Differential Revision: [D38917366](https://our.internmc.facebook.com/intern/diff/D38917366/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83890
    Approved by: https://github.com/slgong-fb

commit eb2fa2e042b18ba35fa6eedb769c2efe411dbcfb
Author: Angela Yi <angelayi@fb.com>
Date:   Thu Aug 25 16:29:27 2022 -0700

    [fx][pass] Fix type of exception (#84094)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84094
    Approved by: https://github.com/SherlockNoMad

commit aa4be48b58f5f22e15d1695b6332064c3c4d7074
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Fri Aug 26 21:48:06 2022 +0000

    [Nested Tensor] do not use at::cuda::getDefaultCUDAStream() (#84134)

    Use at::cuda::getCurrentCUDAStream(), not getDefaultCUDAStream().

    Otherwise, add/remove padding kernels won't sync with current stream, resulting in flaky unit tests in test_nestedtensor.py.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84134
    Approved by: https://github.com/drisspg

commit 82efb0e196c71a75985595fbbf294d8c816e9753
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Aug 26 21:33:22 2022 +0000

    Enable cache action for windows and other minor workflows (#84093)

    Following up on https://github.com/pytorch/pytorch/pull/84026, these are the rest of pip dependencies that I can find.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84093
    Approved by: https://github.com/malfet

commit 3fae89d4a468a02be501357eb123ce2bf7086d2f
Author: Ian Graves <iangraves@fb.com>
Date:   Fri Aug 26 21:04:04 2022 +0000

    Read via FileAdapter when loading files in torch if not flatbuffer (#84028)

    Summary: This will optimize memory usage at the small cost of loading time when loading mobile models restoring the behavior before D36926217 (https://github.com/pytorch/pytorch/commit/fed12ff680813c0fab7dba7232f6b4cd8b33b8d3).

    Test Plan: Signals

    Reviewed By: qihqi

    Differential Revision: D38998858

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84028
    Approved by: https://github.com/qihqi, https://github.com/cccclai

commit e0f0c8e7b9acf6b821956acadbe79aaa0f6f0237
Author: erjia <erjia@fb.com>
Date:   Fri Aug 26 21:02:43 2022 +0000

    Add support to traverse all python collection objects (#84079)

    Fixes https://github.com/pytorch/data/issues/752

    This PR makes `traverse` function supporting more collections data structures from Python. Please let me know if anyone has a better idea about how to elegantly check if the object is a collection then we can dive into this object to see wether there is any DataPipe wrapped.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84079
    Approved by: https://github.com/NivekT

commit 6a3666282d000a0f196fbdd8b182bb4fd711f189
Author: BowenBao <bowbao@microsoft.com>
Date:   Fri Aug 26 10:29:44 2022 -0700

    [ONNX] Export node and value with scope name (#82040)

    Introduce `_jit_pass_onnx_assign_node_and_value_names` to parse and assign
    scoped name for nodes and values in exported onnx graph.
    Module layer information is obtained from `ONNXScopeName` captured in `scope`
    attribute in nodes. For nodes, the processed onnx node name are stored in
    attribute `onnx_name`. For values, the processed onnx output name are stored
    as `debugName`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82040
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/justinchuby, https://github.com/abock

commit b5c2b0b2004c3e0c4b0850bd841f13e72d88e82f
Author: Catherine Lee <csl@fb.com>
Date:   Fri Aug 26 20:56:09 2022 +0000

    make job pass even if monitoring script fails (#84068)

    makes github slightly less confusing to look at when a test fails
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84068
    Approved by: https://github.com/huydhn, https://github.com/malfet

commit 6a5860395619700633ab148b7bdbaed331eb67d5
Author: Animesh Jain <anijain@umich.edu>
Date:   Fri Aug 26 20:49:43 2022 +0000

    Update Dynamo pin (#83829)

    As title
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83829
    Approved by: https://github.com/ezyang

commit 61b9d8fccd3361f21e1f3548c2a9538b62cc7525
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 26 10:33:17 2022 -0700

    [Profiler][Trivial] Add null handling to `AppendOnlyList::copy` memcpy path. (#83963)

    It is apparently undefined behavior to do pointer arithmetic on nullptr. In the case of AppendOnlyList, `next_` will only be null if `end_` is also null and thus the `memcpy` path will only be triggered if `n == 0`. Nonetheless, it is UB to `memcpy(0, 0, 0)`

    The extra null check is in a `C10_LIKELY` block so the extra cost should be negligible, and indeed after dusting off the component microbenchmarks there's no observable difference.

    Differential Revision: [D38969443](https://our.internmc.facebook.com/intern/diff/D38969443/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83963
    Approved by: https://github.com/slgong-fb

commit 014a333df37ca331d4ae969d200aece76b1d4536
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 26 10:33:15 2022 -0700

    [Profiler][Minor] Extend Python bindings (#83622)

    Adding some fields which are needed for memory profiling.

    Differential Revision: [D38528382](https://our.internmc.facebook.com/intern/diff/D38528382/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83622
    Approved by: https://github.com/Gamrix

commit 681c38704e1efd16079ed79b9a92ba6d0a57db29
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Aug 25 20:28:23 2022 +0000

    [ONNX] Clean up patch functions (#83136)

    Changes:

    - Move namespace handling from `_new_node` to `_graph_op` for clarity
    - Always require the `aten` namespace when creating aten ops. Remove the `aten` argument supplied in `_aten_op` for clarity
    - Rename the `_ATTR_PATTERN` global
    - Improve types
    - Update `_add_attribute` to raise ValueErrors
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83136
    Approved by: https://github.com/BowenBao

commit ec5b83f76847584013a9cd4177d389a408033614
Author: Shen Li <cs.shenli@gmail.com>
Date:   Fri Aug 26 16:36:41 2022 +0000

    Make allreduce compatible with fx ProxyTensor (#84126)

    land after #83122

    This PR explores solutions for 2 issues:

    1. Collective comm ops are inplace ops, and does not return a tensor.
       With that, `make_fx` cannot include comm ops in the traced graph.
       The current solution is to make comm ops return a tuple of
       `(output_tensors, work_handle)`, so that
       [`proxy_call`](https://github.com/pytorch/pytorch/blob/90821aab100a436424113e2306eac63f5e247ee5/torch/fx/experimental/proxy_tensor.py#L170-L172)
       can handle that. It won't change the behavior of existing c10d
       Python/C++ APIs, so I directly added the code to `Ops.cpp`.
    2. `make_fx` does not recognize `ProcessGroup::Work` and will ignore
       the `wait()` call on the work when tracing graph. However, this
       might break correctness, as when running the traced function, it
       could consume a tensor before it's ready. The current solution
       is to create a `CommTensor` tensor subclass to explicitly call
       `wait()`. In this PR, I am only doing this in the test, as we
       will need more discussion to see if we can add this to c10d Python
       implementations. kudos to @Chillee @wanchaol
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84126
    Approved by: https://github.com/wanchaol

commit f93446adc2b5b90e144d1b0a3e81269ab0c3401b
Author: Shen Li <cs.shenli@gmail.com>
Date:   Fri Aug 26 13:42:37 2022 +0000

    Update proxy_tensor.py to support List input/output (#83302)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83302
    Approved by: https://github.com/Chillee

commit 527a16016995c63dc7a7fcf74f18a75e2a96ff0e
Author: Shen Li <cs.shenli@gmail.com>
Date:   Fri Aug 26 13:42:36 2022 +0000

    Expose ProcessGroup::Work.wait() API to TorchScript (#83303)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83303
    Approved by: https://github.com/rohan-varma

commit c6348a7109796887d6497ed4c463537016003c39
Author: Adam J. Stewart <ajstewart426@gmail.com>
Date:   Fri Aug 26 18:58:25 2022 +0000

    Add type hints to torch.save, torch.load (#83937)

    I'll probably need help with this one. I'm not sure what the full type signature for `map_location` should be.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83937
    Approved by: https://github.com/malfet, https://github.com/albanD

commit 582c0833d520fa802664eabeb689879b7e67dd2b
Author: Catherine Lee <csl@fb.com>
Date:   Fri Aug 26 18:48:46 2022 +0000

    mac circleci workflows (#82780)

    Add mac and ios workflows to circleci so they can be run on pull

    m1 tests not included because circleci doesnt have machines

    Unsure how to get certain environment variables (specifically for arm64 ios builds that require env vars like `IOS_SIGN_KEY_2022` and `IOS_DEV_TEAM_ID` that are stored in the org-member context which is not accessible by everyone.

    doc regarding env vars https://docs.google.com/document/d/1J_3Z9sfu2vlHMF1fjdJfeTuxPXC6dgqJs7aU0KpYSBU/edit#

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82780
    Approved by: https://github.com/malfet, https://github.com/huydhn

commit e9dff858c3d9aa57d4ecca4410bfbcd996eaf8eb
Author: samdow <samdow@fb.com>
Date:   Thu Aug 25 15:29:34 2022 -0400

    [functorch] add lstsq batch rule (#82325)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82325
    Approved by: https://github.com/zou3519

commit a08911400edb62c9caa0c94d1ce176cf8cb29765
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 25 23:43:25 2022 +0100

    Use C10_HAS_CPP_ATTRIBUTE to simplify nodiscard definition (#83976)

    `C10_HAS_CPP_ATTRIBUTE` only expands to `__has_cpp_attribute` when it
    is defined, so we avoid the extra `#if defined(__has_cpp_attribute)`
    checks and double-nested `#if`s
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83976
    Approved by: https://github.com/albanD

commit b429a17545be8418f8d5887ad302c9b8af031177
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 25 23:43:25 2022 +0100

    Enable -Wunused-local-typedefs (#83708)

    I recently had a PR reverted because it triggered an
    unused-local-typedefs warning, so disabling these in the CMake build
    is counter-productive.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83708
    Approved by: https://github.com/albanD

commit 65ea3d062161f3aa5c8969b62ca322b0518300ae
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Fri Aug 26 15:14:37 2022 +0000

    [composite compliance] cov, corrcoef (#82954)

    Ref: #69991
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82954
    Approved by: https://github.com/zou3519

commit cddf96c4ba9425bd70782979901b78007760fef5
Author: lezcano <lezcano-93@hotmail.com>
Date:   Fri Aug 26 00:03:54 2022 +0000

    Fix preconditions of adaptive_avg_pooling2d (#84061)

    Before, if the input had dimension `4`, the channel had to be of
    dimension non zero. This was not what the errors advertised
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84061
    Approved by: https://github.com/Chillee

commit 9a236c7ab423a8893461b9d6f538d4aca02a086a
Author: Horace He <chilli@fb.com>
Date:   Fri Aug 26 08:23:49 2022 +0000

    Made some minor cleanups to decompositions (#83814)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83814
    Approved by: https://github.com/ngimel

commit ddedc294fbb4c13170811442b590a18e950dae67
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Aug 26 10:08:56 2022 +0000

    [xla hash update] update the pinned xla hash (#84043)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84043
    Approved by: https://github.com/pytorchbot

commit d54fad5675138e9f2a0d504e6c7dee3cc099f342
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Fri Aug 26 06:17:29 2022 +0000

    Remove unreachable except block (#84070)

    This was introduced because two PRs tried to fix an issue concurently.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84070
    Approved by: https://github.com/huydhn, https://github.com/janeyx99

commit f03ab28b971e8e0b11dda8bf49e85ff3be6fb97d
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Fri Aug 26 06:16:20 2022 +0000

    Use an unused variable (#84073)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84073
    Approved by: https://github.com/huydhn

commit 993d8bb77e1ce79940705b1c7667dc9276f449df
Author: Min Si <msi@fb.com>
Date:   Fri Aug 26 05:45:59 2022 +0000

    Use size to check same tensor sizes in reduce_scatter and allgather (#84099)

    Summary:
    Previous code uses tensor.numel() to check if all tensors have the same size in order to switch between reduce_scatter_v v.s. reduce_scatter, same applies to allgather. However, if the user input tensor is zero in the last dimension (e.g., [648632,0]), then numel() returns zero and check_same_numel is always true.

    This patch fixes the check to use size rather than numel, to cover the above case.

    Differential Revision: D39044439

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84099
    Approved by: https://github.com/kwen2501

commit 089101fc82971aca874093e7504cf24b11462bcc
Author: Christian Jauvin <cjauvin@gmail.com>
Date:   Fri Aug 26 04:53:49 2022 +0000

    Fix small typo in cuda.rst (#84012)

    This fixes a very minor typo in the CUDA semantics doc.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84012
    Approved by: https://github.com/malfet

commit 15b560a5c4d638c82e738f3496e2faf95fc328a5
Author: Gao, Xiang <qasdfgtyuiop@gmail.com>
Date:   Fri Aug 26 03:11:46 2022 +0000

    Fix missing include for size_t (#84088)

    Fixes the following issue:

    ```C++
    In file included from /home/gaoxiang/pytorch-ucc/c10/test/util/ConstexprCrc_test.cpp:1:
    In file included from /home/gaoxiang/pytorch-ucc/c10/util/ConstexprCrc.h:3:
    /home/gaoxiang/pytorch-ucc/c10/util/IdWrapper.h:42:10: error: unknown type name 'size_t'; did you mean 'std::size_t'?
      friend size_t hash_value(const concrete_type& v) {
             ^~~~~~
             std::size_t
    /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/12.2.0/../../../../include/c++/12.2.0/x86_64-pc-linux-gnu/bits/c++config.h:298:26: note: 'std::size_t' declared here
      typedef __SIZE_TYPE__         size_t;
                                    ^
    1 error generated.
    [111/2069] Generating /home/gaoxiang/pytorch-ucc/torch/csrc/a...ch-ucc/torch/testing/_internal/generated/annotated_fn_args.py
    ninja: build stopped: subcommand failed.
    ```

    This error happens with my GCC 12.2.0 + Clang 14.0.6.

    Full environment:
    ```
    Collecting environment information...
    PyTorch version: 1.13.0a0+git14a53e6
    Is debug build: True
    CUDA used to build PyTorch: 11.7
    ROCM used to build PyTorch: N/A

    OS: Arch Linux (x86_64)
    GCC version: (GCC) 12.2.0
    Clang version: 14.0.6
    CMake version: version 3.24.1
    Libc version: glibc-2.36

    Python version: 3.10.6 (main, Aug  3 2022, 17:39:45) [GCC 12.1.1 20220730] (64-bit runtime)
    Python platform: Linux-5.19.3-arch1-1-x86_64-with-glibc2.36
    Is CUDA available: True
    CUDA runtime version: 11.7.99
    GPU models and configuration:
    GPU 0: NVIDIA GeForce RTX 3090
    GPU 1: NVIDIA GeForce RTX 2080 Ti

    Nvidia driver version: 515.65.01
    cuDNN version: Probably one of the following:
    /usr/lib/libcudnn.so.8.4.1
    /usr/lib/libcudnn_adv_infer.so.8.4.1
    /usr/lib/libcudnn_adv_train.so.8.4.1
    /usr/lib/libcudnn_cnn_infer.so.8.4.1
    /usr/lib/libcudnn_cnn_train.so.8.4.1
    /usr/lib/libcudnn_ops_infer.so.8.4.1
    /usr/lib/libcudnn_ops_train.so.8.4.1
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True

    Versions of relevant libraries:
    [pip3] numpy==1.23.1
    [pip3] torch==1.13.0a0+gitbcc6f6c
    [pip3] torch-ucc==1.0.0
    [pip3] torchani==2.2
    [pip3] torchvision==0.2.2.post3
    [conda] Could not collect
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84088
    Approved by: https://github.com/ezyang

commit 9790d90e4b0288796ab44a6b4979db0a67580ba8
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Aug 25 18:33:45 2022 -0700

    Don't introduce new overload for SymInt (#83628)

    Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

    This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

    This is BC-breaking in the following ways:

    * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

    This is not BC-breaking in the following ways:

    * The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
    * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

    Structure of the PR:

    * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
      * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
        * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
        * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
      * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
    * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
    * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
    * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
    * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
    * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
    * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
    * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
    Approved by: https://github.com/albanD, https://github.com/bdhirsh

commit d2f37401b85c9bdea342c4eb0f1d1f277ae93ed0
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Aug 25 18:36:30 2022 +0000

    Silence namedtuple warning in dist (#84072)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84072
    Approved by: https://github.com/awgu

commit b35e7c5da770ec5b5080cb69eb3618f5a3203e9c
Author: Rohan Varma <rvarm1@fb.com>
Date:   Thu Aug 25 18:36:06 2022 +0000

    Fix FSDP not all outputs used in loss (#83195)

    There are a couple issues / assumptions within FSDP today that this PR attempts to fix:

    - In wait_for_post_backward, we assume that if a param required grad, its post backward was called, but this is not true, i.e. if its output did not participate in grad computation, it would not have called post backward. To fix this we simply removed those assertions.
    - There is a deeper issue where in `_finalize_params`, we could end up assigning a grad of the sharded shape to an unsharded parameter gradient field, which would raise a shape error. This can happen for example if a parameter's usage transitions from used --> unused. In this case, when the parameter was used, it would have had a gradient, then user could have possibly called `zero_grad()` and p.grad would not be `None`. This in `_prep_grad_for_backward`, we would assign a `_saved_grad_shard` to this gradient field which would be the sharded shape. In `_finalize_param`, our parameter would be unsharded (since post_backward was not called), but we'd try to assign, raising the shape issue. This issue is fixed by checking `_post_backward_called`. If this is False, we simply skip the assignment because there is no new gradient to update.
    - A final issue as mentioned above is that if post_backward is not called, we never reshard the full param. This is fixed by checking if we haven't resharded (basically if post_backward_called == False), and if so, performing a reshard.

    A few things to note:
    - This logic may have to be revisited when non-recursive wrapping lands as there are multiple FlatParams per FSDP unit
    - This logic may not work when post_backward_hook fires but p.grad is None, i.e. the short-circuiting here: https://github.com/pytorch/pytorch/blob/f534b2c627da65bbee7ccc8f7e054da0ba48eb79/torch/distributed/fsdp/fully_sharded_data_parallel.py#L2884. As a quick fix, we could just move `_post_backward_called` flag change to after this, or just perform a reshard before returning early. I am not sure how to repro a case where p.grad == None but we call the post-backward hook, https://github.com/pytorch/pytorch/issues/83197 might be a possibility, but I think it is fine to not support this yet.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83195
    Approved by: https://github.com/awgu

commit 7e5c76da47beec83ba539fdb53d52e13d492cc4f
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Aug 25 20:20:52 2022 +0000

    Make graph_module.print_readable() discoverable (#83960)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83960
    Approved by: https://github.com/ezyang

commit a4a55f5ea6403472a25b12a6e9b8f4f713e664a3
Author: Xiang Gao <qasdfgtyuiop@gmail.com>
Date:   Thu Aug 25 21:33:15 2022 +0000

    New TORCH_UCC_BLOCKING_WAIT env variable (#81791)

    Cherry-pick of https://github.com/facebookresearch/torch_ucc/pull/95.

    I recommend waiting until https://github.com/pytorch/pytorch/pull/81583 is merged first, so the CI is checking if this PR compiles correctly.

    Marking this as a draft for now, will change to "ready for review" once https://github.com/pytorch/pytorch/pull/81583 merged.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81791
    Approved by: https://github.com/kwen2501

commit 85f82f7311d33b0ed28fe4865c4ac24f96d6cbaa
Author: migeedz <migeedz@fb.com>
Date:   Thu Aug 25 11:23:00 2022 -0700

    example program for paper intro (#83945)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83945
    Approved by: https://github.com/jansel

commit bf25a140f9d948915d52f294b5204196d1ca22e3
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Aug 25 21:24:35 2022 +0000

    [ONNX] Add runtime type checking to `export` (#83673)

    This PR adds an internal wrapper on the [beartype](https://github.com/beartype/beartype) library to perform runtime type checking in `torch.onnx`. It uses beartype when it is found in the environment and is reduced to a no-op when beartype is not found.

    Setting the env var `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS` will turn on the feature. setting `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=DISABLED` will disable all checks. When not set and `beartype` is installed, a warning message is emitted.

    Now when users call an api with invalid arguments e.g.

    ```python
    torch.onnx.export(conv, y, path, export_params=True, training=False)
    ```

    they get

    ```
    Traceback (most recent call last):
      File "bisect_m1_error.py", line 63, in <module>
        main()
      File "bisect_m1_error.py", line 59, in main
        reveal_error()
      File "bisect_m1_error.py", line 32, in reveal_error
        torch.onnx.export(conv, y, cpu_model_path, export_params=True, training=False)
      File "<@beartype(torch.onnx.utils.export) at 0x1281f5a60>", line 136, in export
      File "pytorch/venv/lib/python3.9/site-packages/beartype/_decor/_error/errormain.py", line 301, in raise_pep_call_exception
        raise exception_cls(  # type: ignore[misc]
    beartype.roar.BeartypeCallHintParamViolation: @beartyped export() parameter training=False violates type hint <class 'torch._C._onnx.TrainingMode'>, as False not instance of <protocol "torch._C._onnx.TrainingMode">.
    ```

    when `TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK` is not set and `beartype` is installed, a warning message is emitted.

    ```
    >>> torch.onnx.export("foo", "bar", "f")
    <stdin>:1: CallHintViolationWarning: Traceback (most recent call last):
      File "/home/justinchu/dev/pytorch/torch/onnx/_internal/_beartype.py", line 54, in _coerce_beartype_exceptions_to_warnings
        return beartyped(*args, **kwargs)
      File "<@beartype(torch.onnx.utils.export) at 0x7f1d4ab35280>", line 39, in export
      File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/site-packages/beartype/_decor/_error/errormain.py", line 301, in raise_pep_call_exception
        raise exception_cls(  # type: ignore[misc]
    beartype.roar.BeartypeCallHintParamViolation: @beartyped export() parameter model='foo' violates type hint typing.Union[torch.nn.modules.module.Module, torch.jit._script.ScriptModule, torch.jit.ScriptFunction], as 'foo' not <protocol "torch.jit.ScriptFunction">, <protocol "torch.nn.modules.module.Module">, or <protocol "torch.jit._script.ScriptModule">.

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/justinchu/dev/pytorch/torch/onnx/_internal/_beartype.py", line 63, in _coerce_beartype_exceptions_to_warnings
        return func(*args, **kwargs)
      File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 482, in export
        _export(
      File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 1422, in _export
        with exporter_context(model, training, verbose):
      File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/contextlib.py", line 119, in __enter__
        return next(self.gen)
      File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 177, in exporter_context
        with select_model_mode_for_export(
      File "/home/justinchu/anaconda3/envs/pytorch/lib/python3.9/contextlib.py", line 119, in __enter__
        return next(self.gen)
      File "/home/justinchu/dev/pytorch/torch/onnx/utils.py", line 95, in select_model_mode_for_export
        originally_training = model.training
    AttributeError: 'str' object has no attribute 'training'
    ```

    We see the error is caught right when the type mismatch happens, improving from what otherwise would become `AttributeError: 'str' object has no attribute 'training'`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83673
    Approved by: https://github.com/BowenBao

commit 562a021cf3c6468c1f86e34c5d46accf2aac8eab
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Aug 25 21:13:09 2022 +0000

    [GHF] Land validation should not change default branch (#84084)

    This prevents a loophole, where somebody submits a PR that modifies merge rules and request land validation, so that their PR will be validated against those rules, rather than ones currently in trunk.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84084
    Approved by: https://github.com/janeyx99, https://github.com/kit1980

commit 9f4626ea1b5326843756d866d90d583fbca32616
Author: ssjia <ssjia@fb.com>
Date:   Thu Aug 25 06:15:32 2022 -0700

    [vulkan] use VMA at third-party (#83934)

    Remove the VMA checked in at `aten/src/ATen/native/vulkan/api/vk_mem_alloc.h`, and use the version checked into `fbsource/third_party` instead.

    Also change open source CMakeLists to look for VMA in third_party submodule directory.

    Note that I had to add an alternate VulkanMemoryAllocator target that uses `fb_xplat_cxx_library` instead of `oxx_static_library` to make it work with vulkan targets in `caffe2`.

    Before landing this diff, make sure https://github.com/pytorch/pytorch/pull/83906 is committed on open source, which adds VMA as a git submodule of pytorch.

    Differential Revision: [D38943217](https://our.internmc.facebook.com/intern/diff/D38943217/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38943217/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83934
    Approved by: https://github.com/manuelcandales

commit ced2ca8f867b376c5b4e495f183aeba78a27c0c4
Author: Michael Voznesensky <voznesenskym@gmail.com>
Date:   Thu Aug 25 20:11:53 2022 +0000

    Torch cond operator, python dispatch, pyoperator (#83154)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83154
    Approved by: https://github.com/ezyang

commit 3c2a0780b8f4e717b8a754fdc3dde68193ccae6c
Author: BowenBao <bowbao@microsoft.com>
Date:   Tue Aug 23 17:53:29 2022 -0700

    [ONNX] Assign ONNXScopeName during function substituion (#82039)

    Previously only traced IR graph stores module typename and variable name
    in `scope` in `node`. This change enables such `scope` info for IR graph
    generated by torch script.
    Torch script produced IR graphs emit nodes for module object and module
    method call. This structured graph is flattened in `function_substition`
    pass prior to other ONNX conversion passes.
    This PR extends `function_substition` pass to record the module typename
    and variable name info in `scope`, while inlining the graph.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82039
    Approved by: https://github.com/justinchuby, https://github.com/abock

commit 4c19981316eb4c343a935cbd8b06b4bd2b610c50
Author: erjia <erjia@fb.com>
Date:   Wed Aug 24 19:29:06 2022 +0000

    [DataPipe] Reset Shuffler's iterator when NotStarted (#83535)

    This PR changes the behavior of `IterDataPipe` to always invoke `reset` for the state of `NotStarted`. The main reason is we normally put lazy initialization code into `reset` function. Even for the state of `NotStarted`, we should invoke `reset` to initialize those lazy variables. Otherwise, we have to manually determine if the state is `NotStarted` or `Iterating` in `__iter__` function and only manually invoke `reset` in the state of `NotStarted`.

    This PR also makes `Shuffler` is able to serialize with `buffer` and `rng_state`.

    The following part is removed:

    ~I am also add `_snapshot_state` into serialization state and during `__setstate__` only change the state to `Restored` if the original state is `Iterating`. Especially, for the case of deserializing/serializing `NotStarted` DataPipe (multiprocessing), we would invoke `set_seed` for `Shuffler`. We need the `DataPipe` remains as `NotStarted` to properly `reset`.~

    I am listing all the expected behavior state transition below:
    - Initial state: `NotStarted`
      - `iter` -> Call `reset` and change the state to `Iterating`
      - serialize/deserialize -> Keep the state as `NotStarted` (will `reset` if `iter` is called afterwards)
    - Initial state: `Iterating`
      - `iter` -> Call `reset` and keep the state to `Iterating`
      - serialize/deserialize -> Change the state as `Restored`
    - Initial state: `Restored`
      - `iter` -> Only change the state to `Iterating`
      - serialize/deserialize -> Not allowed

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83535
    Approved by: https://github.com/NivekT

commit b82c74da07430ba4a221403d383eeb27de04f7f7
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Thu Aug 25 08:45:33 2022 -0700

    functionalization: support inplace views on inputs (#83993)

    A version of this PR was sitting at https://github.com/pytorch/pytorch/pull/82601 but that PR some other cleanup that relies on being able to use functorch in pytorch/pytorch CI tests, which isn't ready yet. I pulled the change out here to unblock functionalization for some models run with inductor (see https://github.com/pytorch/torchdynamo/issues/964#issuecomment-1225971788).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83993
    Approved by: https://github.com/ezyang

commit 0c6a616af0b14b9bbe190b93655edf24bac14cfd
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Thu Aug 25 08:45:32 2022 -0700

    run functorch decomps after functionalization when enabled (#83992)

    This is a short-to-midterm fix for https://github.com/pytorch/pytorch/issues/83923. By running functionalization before decomps, we guarantee that functionalization won't have to see any primtorch view/inplace ops like `broadcast_in_dim`.

    This will only really be a problem if there's a function in the decomposition table that decomposes a functional op into mutations. If that comes up later, we'll need to revisit https://github.com/pytorch/pytorch/issues/83923.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83992
    Approved by: https://github.com/ezyang

commit caaa723ae24a904a76d3851b1b84de3ad128735a
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Aug 25 18:21:44 2022 +0000

    [GHF][BE] Move merge rules to yaml (#84065)

    To allow comments

    Update `trymerge.yaml`, `revert.yaml` and `tryrebase.yaml` to use v4 setup-python action and install pyyaml

    Reformat json to yaml by running:
    ```
     python -c "import yaml;print(yaml.dump(yaml.safe_load(open('.github/merge_rules.yaml')), sort_keys=False))"
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84065
    Approved by: https://github.com/b0noI, https://github.com/huydhn

commit 86e134ddf777d6f4b82a1860102698ef13bf0c33
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Aug 25 17:28:23 2022 +0000

    disable c10::SymIntNode tests on mobile (#84066)

    This fixes c++ tests' breaks where we were passing pointers and expected `is_symbolic` to return `true`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84066
    Approved by: https://github.com/albanD

commit 2f04ba2c7c8920418ad77ebb1ab09d93374e6578
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Thu Aug 25 01:53:51 2022 -0700

    [quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
        - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - None

    Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!

    Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
    Approved by: https://github.com/jerryzh168

commit 29e83b6599e91ddc087540880f7c14cbe33df200
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Thu Aug 25 01:53:49 2022 -0700

    [quant][ao_migration] `torch.nn.quantizable` → `torch.ao.nn.quantizable`. (#78717)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [X] [Current PR] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
        - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - `torch/ao/nn/__init__.py` → Changing the imports to lazy.

    Differential Revision: [D36861090](https://our.internmc.facebook.com/intern/diff/D36861090/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861090/)!

    Differential Revision: [D36861090](https://our.internmc.facebook.com/intern/diff/D36861090)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78717
    Approved by: https://github.com/jerryzh168

commit b1455f9424227ec3a0ff7c6b41a73868b03c7967
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Thu Aug 25 01:53:48 2022 -0700

    [quant][ao_migration] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` (#78715)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [X] [Current PR] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
        - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - None

    Differential Revision: [D36860927](https://our.internmc.facebook.com/intern/diff/D36860927/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860927/)!

    Differential Revision: [D36860927](https://our.internmc.facebook.com/intern/diff/D36860927)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78715
    Approved by: https://github.com/jerryzh168

commit d32a762147343bbb9404ef995c979fe8f048b836
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Thu Aug 25 01:53:46 2022 -0700

    [quant][ao_migration] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` (#78714)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
        - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - [Documentation](docs/source/quantization-support.rst) @vkuzo
    - [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
    - [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo
    - [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a
    - [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a

    Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)!

    Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714
    Approved by: https://github.com/jerryzh168

commit c92e5ac95be6a1ccae505fb391bb02329b9a7413
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Thu Aug 25 01:53:44 2022 -0700

    [quant][ao_migration] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` (#78713)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
        - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - Documentation @vkuzo
      - docs/source/conf.py
      - docs/source/quantization.rst
    - [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168
    - [common test routine](test/quantization/ao_migration/common.py) @HDCharles
    - JIT stuff @jamesr66a
      - torch/csrc/jit/passes/hoist_conv_packed_params.cpp
      - torch/csrc/jit/passes/quantization/helper.h
      - torch/csrc/jit/serialization/import_source.cpp

    Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012/)

    Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713
    Approved by: https://github.com/jerryzh168

commit a83d7d8b654bc5169a1450f2e7a5fb2fc58f5ffe
Author: Jianyu Huang <jianyuhuang@fb.com>
Date:   Thu Aug 25 16:47:02 2022 +0000

    enable qlinear dynamic parallelization with fbgemm (#84033)

    Test Plan: CI

    Differential Revision: D39004891

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84033
    Approved by: https://github.com/jerryzh168

commit e2f75d63d42cdf7940cc919e745daad02a961395
Author: Animesh Jain <anijain@umich.edu>
Date:   Thu Aug 25 16:09:52 2022 +0000

    Decomposition - batch_norm, save_mean and save_variance always float32 (#84013)

    AMP error shown here - https://github.com/pytorch/torchdynamo/issues/835

    Test missing
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84013
    Approved by: https://github.com/ezyang

commit 56fef4e6ee4020de957c4888032e04a5721576cb
Author: erjia <erjia@fb.com>
Date:   Thu Aug 25 16:05:14 2022 +0000

    fix `NoneType` object has no attribute `python_exit_status` (#83985)

    Fixes #83791

    Prevents the Error when `_utils` has been cleared by Python before `__del__` is invoked.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83985
    Approved by: https://github.com/NivekT

commit 00cb184512f3a636d87793f46d3f9c7fea406b25
Author: Richard Zou <zou3519@gmail.com>
Date:   Wed Aug 24 13:51:12 2022 -0700

    [functorch] add batching rule for fill_.Tensor (#84015)

    I think this is what the theseus folks ran into, but will confirm with
    them later.

    Test Plan:
    - new manual test; the OpInfo for fill_ isn't sufficient and it is
    difficult to modify
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84015
    Approved by: https://github.com/Chillee

commit 31f151767b2fde8ba2b73e0fe2b9bd68f284673b
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Aug 25 10:03:19 2022 +0000

    add qscheme check for quantization observer (#80126)

    Motivation: each quantization observer only supports a limit qschemes, we need to do this check at the initiation step, rather than at the running step, such as MinMaxObserver with set qscheme with **torch.per_channel_affine**, there will have a runtime error at the running the calibration step:

    ```
    AttributeError: 'MinMaxObserver' object has no attribute 'ch_axis'
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80126
    Approved by: https://github.com/jerryzh168

commit f5a3515083cce8335913e1985b54d5e6ead95498
Author: Mario Lezcano <lezcano-93@hotmail.com>
Date:   Thu Aug 25 04:24:44 2022 -0500

    Make linalg.inv composite of linalg.solve (#80074)

    The `getri` kernel calls inside `getrs` so we can do so explicitly
    ourselves and save ourselves from having to maintain an extra kernel.
    This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
    will be as efficient as it can be, as it'll be choosing the best backend
    to perform the factorisation and the best backend (not necessarily the
    same) to perform the solve.

    Fixes https://github.com/pytorch/pytorch/issues/77498

    The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
    Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet

commit e3c89d07789d632b135862b87a5cbd4cfce7f53a
Author: Horace He <chilli@fb.com>
Date:   Thu Aug 25 06:59:37 2022 +0000

    Disable autocast cache during aotdispatch (#84035)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84035
    Approved by: https://github.com/jansel

commit 63cbdc92a750a667ffdcfbdac563d02db6fd9559
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Thu Aug 25 08:28:38 2022 +0000

    switching the exact check to isinstance check (#84023)

    Simplifying a type check if an object is a SymIntNode in `is_symint_node`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84023
    Approved by: https://github.com/ezyang

commit 02c3781332031981988cd0cadfd573a210210b33
Author: Huy Do <huydhn@gmail.com>
Date:   Thu Aug 25 07:28:50 2022 +0000

    Enable cache action for lint workflow (#84026)

    Cache all python dependencies using [GHA cache](https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows).  I'm doing this for lint workflow first and will slowly roll it out to other workflows.

    Before caching, pip cache is not found. Dependencies installation continues as usual:

    ![Screen Shot 2022-08-24 at 16 36 15](https://user-images.githubusercontent.com/475357/186543554-9d7f5978-2c2d-4362-9535-c3b17e922da1.png)

    After caching https://github.com/pytorch/pytorch/runs/8006214772?check_suite_focus=true. The long hash at the end of the cache key is the hash of requirements files

    ![Screen Shot 2022-08-24 at 16 51 51](https://user-images.githubusercontent.com/475357/186543825-055ea025-3d42-42fc-877d-baec358de0ed.png)

    Note that the cache is in the runners themselves. This should be a transparent process.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84026
    Approved by: https://github.com/seemethere, https://github.com/suo, https://github.com/malfet

commit c00f0c80c0d9e4b8dae9ff6493f963315abc777c
Author: Alex Beloi <alexbeloi@fb.com>
Date:   Thu Aug 25 07:10:30 2022 +0000

    [fx] add deferred weights (xl_weight) and tracing for xl_embedding_bag (#84016)

    Test Plan: added unit tests

    Reviewed By: jfix71

    Differential Revision: D36152238

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84016
    Approved by: https://github.com/jfix71

commit 8b8942b11464bbe042b731bc332d65297161353a
Author: Horace He <chilli@fb.com>
Date:   Thu Aug 25 01:53:33 2022 +0000

    Fix dumb make_fx issue (#84011)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84011
    Approved by: https://github.com/ezyang

commit c03f8abb21d7848f83162a82a42ffdd219b668e3
Author: Mandar Deshpande <mandarde@fb.com>
Date:   Thu Aug 25 06:23:32 2022 +0000

    [fx+scripting] Adding num_iter_1 and num_iter_2 params LearningRate op (#83691)

    Summary: Adding num_iter_1 and num_iter_2 to learning rate op

    Test Plan: Exisiting unit tests

    Differential Revision: D38762710

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83691
    Approved by: https://github.com/qxy11

commit 2000eba4547f885dc937c4335bee4ba1a71b4df5
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 25 00:57:57 2022 +0100

    NCCL: Re-enable parallel builds (#83696)

    Since #83173 was merged I have noticed some CI being slowed down by
    the nccl building step. e.g. if there are no C++ changes then sccache
    compiles everything else very quickly and nccl becomes the limiting
    factor.

    This re-enables parallel builds with some safeguards to protect
    against oversubscription. When `make` is the parent build system, we
    can use `$(MAKE)` and the `make` jobserver will coordinate job
    allocation with the sub-process. For other build systems, this calls
    `make` with the `-l` flag which should prevent it launching jobs when
    the system load average is already too high.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83696
    Approved by: https://github.com/malfet

commit d5af2a70ba47f604193e137f3d05d8ffee4ed7f0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 25 05:09:12 2022 +0000

    Revert "[TorchTidy] Adding support for unique tensor identifiers (#80266)"

    This reverts commit b6ba41921daf6365a762562641bfd846437c8529.

    Reverted https://github.com/pytorch/pytorch/pull/80266 on behalf of https://github.com/malfet due to Broke number of trunk jobs, see https://hud.pytorch.org/pytorch/pytorch/commit/b6ba41921daf6365a762562641bfd846437c8529

commit 1f61c39ac43d8cfccbe345ab42924ab739c4c1a8
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 25 05:01:37 2022 +0000

    Revert "Support NCCL Premul Sum (#81272)"

    This reverts commit 432c508e71111f9d5382322e0e6b1bc1c66bf0ec.

    Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds

commit 460636ab9402b759c28c2bb1adb6b2ab12c0e773
Author: Andrew Gallagher <andrewjcg@fb.com>
Date:   Thu Aug 25 04:14:09 2022 +0000

    [caffe2] Remove last clang-for-cuda sources (#84021)

    Summary: We're no longer pursuing clang-for-cuda, so remove the last use-case.

    Test Plan: CI

    Reviewed By: pallab-zz

    Differential Revision: D38996710

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84021
    Approved by: https://github.com/malfet

commit a013597b32d7c14b76e1d18214ec77770196fe0d
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Thu Aug 25 03:58:11 2022 +0000

    fix oneDNN channels_last path issue (#83653)

    Fix #82060(N>1 will call in OneDNN path) and #80837, those two issues are introduced by the definition of channels last is different between PyTorch FW side with ideep side, this PR will fix this gap which ideep will use the format flag given by FW side.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83653
    Approved by: https://github.com/mingfeima, https://github.com/malfet

commit b6ba41921daf6365a762562641bfd846437c8529
Author: John Clow <jclow@fb.com>
Date:   Wed Aug 24 16:58:09 2022 -0700

    [TorchTidy] Adding support for unique tensor identifiers (#80266)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80266
    Approved by: https://github.com/robieta

commit b21a6ff6397b74c148c12e4fc41ef12b382443e2
Author: jjsjann123 <jiej@nvidia.com>
Date:   Tue Aug 23 23:22:37 2022 +0000

    [NVFuser] Upstream push 0811 (#83239)

    Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

    Code changes includes:

    - codegen improvements:
      1. double support in expression evaluator
    - bug fixes:
      1. dropout fix - rework RNG to support broadcasted dropout (Fixes #82784)
      2. expand fix - Patch expand+reduction, expand+view, rework view analysis and guard
    - scheduler:
      1. manual transpose schedule example
      2. WIP transpose scheduler

    Commits that's in this PR from the devel branch:

    ```
    b7435afcd22c917713c2f41a7237bc26e1183f14 Transpose scheduler, step 1 (#1854)
    8a45dbf72034684eb8e18b1835b533e90b68f184 Add an example on how to manually schedule transpose (#1889)
    83dbf56a9554b2efbd5416461d938fff477b0b27 Patch dropout fix (#1898)
    69d3519a532250719b1aa8341b50e067b181b42d Expand+Reduction, Expand+View support, rework View analysis and guards (#1883)
    15091c488e96343bdc49e3990acbf238a3b3da51 Rework RNG to correctly support broadcasted dropout (#1888)
    aafe2d048aaac596e503596a41303423619f3954 Make ExpressionEvaluator support Double (#1885)
    ```

    RUN_TORCHBENCH: nvfuser

    Differential Revision: [D38657074](https://our.internmc.facebook.com/intern/diff/D38657074)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83239
    Approved by: https://github.com/davidberard98

commit e90db1756585250096b0ea8e9ca31ad4fd007809
Author: atalman <atalman@fb.com>
Date:   Thu Aug 25 01:08:25 2022 +0000

    Increase timeout for linux binary builds  (#84008)

    Increase timeout  for linux binary builds
    This mitigates conda build issue: https://github.com/pytorch/pytorch/issues/84003

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84008
    Approved by: https://github.com/malfet

commit 6b597595b2fb54dcc63e25169d58a2c4602306c1
Author: Weiwen Xia <weiwen.xia@intel.com>
Date:   Thu Aug 25 01:07:18 2022 +0000

    [Quant] Vectorize scalar remainder in quantized kernel for normalization (#79673)
    This PR improves performance of quantized kernel for normalize by vectorizing scalar remainder.
    In the current implementation [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp), the computation is vectorized while the scalar remainder is handled in a `for` loop. The remainder is also vectorized to improve performance in this PR.
    This kernel is for contiguous (NCHW) memory layout. For channels-last memory layout, a fast path is added in this PR https://github.com/pytorch/pytorch/pull/70520
    The improvement is beneficial for layer norm, group norm and instance norm as this kernel is used for them.
    1. Add an argument `size` to `Vectorized<T>::loadu()` for vec256_qint and vec512_qint.
    2. Load the remainder with the new `loadu` and do computation in the similar way as for vectorized part.
    Run quantized group norm with group = 2.
    Op CPU time measured by `torch.profiler.profile` with warmup = 20, active = 200
    - Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
    - OS: CentOS Linux 7 (Core) (x86_64)
    - Python version: 3.7.10
    - Use JeMalloc memory allocator
    - MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto
    - Using Intel OpenMP
    - KMP_AFFINITY=granularity=fine,compact,1,0
    - KMP_BLOCKTIME=1
    **Environment**
    - GCC version: (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
    - AVX2 enabled, AVX512 disabled, i.e., vec256 used

    **Run a single instance on a single core**

    Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
    -- | -- | -- | -- | -- | -- | --
    (1, 2, 8, 5) | 3.73 | 3.75 | 4.51 | 99.41% | 82.75% | Remainder size = 8
    (1, 2, 8, 6) | 3.76 | 4.00 | 4.53 | 93.93% | 82.95% | Remainder size = 16
    (1, 2, 8, 7) | 3.74 | 4.01 | 4.52 | 93.34% | 82.84% | Remainder size = 24
    (1, 2, 8, 8) | 3.90 | 3.96 | 4.49 | 98.49% | 87.00% | No remainder
    (1, 2, 8, 17) | 4.00 | 4.17 | 4.72 | 95.83% | 84.69% | Remainder size = 8
    (1, 2, 8, 18) | 4.00 | 4.23 | 4.72 | 94.54% | 84.89% | Remainder size = 16
    (1, 2, 8, 19) | 4.03 | 4.29 | 4.76 | 94.01% | 84.70% | Remainder size = 24
    (1, 2, 8, 20) | 3.92 | 3.93 | 4.76 | 99.67% | 82.29% | No remainder
    (1, 2, 8, 33) | 4.10 | 4.18 | 5.06 | 97.92% | 81.00% | Remainder size = 8
    (1, 2, 8, 34) | 4.07 | 4.23 | 5.06 | 96.40% | 80.53% | Remainder size = 16
    (1, 2, 8, 35) | 4.11 | 4.42 | 5.09 | 93.03% | 80.72% | Remainder size = 24
    (1, 2, 8, 36) | 4.03 | 4.06 | 5.11 | 99.24% | 78.83% | No remainder

    ![image](https://user-images.githubusercontent.com/12522207/173979129-e393e13f-71f5-4987-95ea-ac6e0c895bd7.png)

    **Run a single instance on two cores**
    Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
    -- | -- | -- | -- | -- | -- | --
    (1, 4, 8, 5) | 5.09 | 5.24 | 5.52 | 97.17% | 92.29% | Remainder size = 8
    (1, 4, 8, 6) | 5.22 | 5.50 | 5.56 | 94.95% | 93.86% | Remainder size = 16
    (1, 4, 8, 7) | 5.04 | 5.60 | 5.51 | 89.97% | 91.44% | Remainder size = 24
    (1, 4, 8, 8) | 5.30 | 5.29 | 5.56 | 100.23% | 95.27% | No remainder
    (1, 4, 8, 17) | 5.36 | 5.56 | 6.05 | 96.53% | 88.69% | Remainder size = 8
    (1, 4, 8, 18) | 5.48 | 5.71 | 6.25 | 95.99% | 87.67% | Remainder size = 16
    (1, 4, 8, 19) | 5.44 | 5.81 | 6.25 | 93.65% | 87.11% | Remainder size = 24
    (1, 4, 8, 20) | 5.43 | 5.34 | 6.07 | 101.76% | 89.43% | No remainder
    (1, 4, 8, 33) | 5.52 | 5.58 | 6.51 | 98.89% | 84.75% | Remainder size = 8
    (1, 4, 8, 34) | 5.50 | 5.71 | 6.63 | 96.22% | 82.95% | Remainder size = 16
    (1, 4, 8, 35) | 5.50 | 6.16 | 6.40 | 89.33% | 85.95% | Remainder size = 24
    (1, 4, 8, 36) | 5.37 | 5.48 | 6.54 | 97.94% | 81.98% | No remainder

    ![image](https://user-images.githubusercontent.com/12522207/173981377-6222e278-0948-4f52-809b-28899399ca65.png)
    **Environment**
    - GCC version: (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
    - AVX512 enabled, i.e., vec512 used

    **Run a single instance on a single core**

    Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
    -- | -- | -- | -- | -- | -- | --
    (1, 2, 16, 5) | 3.66 | 3.94 | 4.52 | 92.79% | 80.93% | Remainder size = 16
    (1, 2, 16, 6) | 3.77 | 4.28 | 4.60 | 88.15% | 81.90% | Remainder size = 32
    (1, 2, 16, 7) | 3.85 | 4.41 | 4.57 | 87.36% | 84.20% | Remainder size = 48
    (1, 2, 16, 8) | 3.70 | 3.76 | 4.62 | 98.62% | 80.10% | No remainder
    (1, 2, 16, 17) | 3.91 | 4.06 | 4.97 | 96.43% | 78.71% | Remainder size = 16
    (1, 2, 16, 18) | 3.82 | 4.34 | 5.01 | 88.19% | 76.30% | Remainder size = 32
    (1, 2, 16, 19) | 3.86 | 4.56 | 5.05 | 84.63% | 76.28% | Remainder size = 48
    (1, 2, 16, 20) | 3.80 | 3.87 | 5.08 | 98.14% | 74.73% | No remainder
    (1, 2, 16, 33) | 3.89 | 4.23 | 5.65 | 91.94% | 68.85% | Remainder size = 16
    (1, 2, 16, 34) | 3.91 | 4.46 | 5.70 | 87.68% | 68.61% | Remainder size = 32
    (1, 2, 16, 35) | 4.04 | 4.68 | 5.72 | 86.44% | 70.64% | Remainder size = 48
    (1, 2, 16, 36) | 4.00 | 3.99 | 5.71 | 100.28% | 69.96% | No remainder

    ![image](https://user-images.githubusercontent.com/12522207/173982490-4687c5bc-50e8-49aa-9fe2-7967c738dbfb.png)

    **Run a single instance on two cores**

    Shape | New impl (us) | Old impl (us) | Fp32 (us) | New/old | New/fp32 | Comments
    -- | -- | -- | -- | -- | -- | --
    (1, 4, 16, 5) | 5.43 | 5.53 | 5.92 | 98.12% | 91.60% | Remainder size = 16
    (1, 4, 16, 6) | 5.35 | 5.85 | 6.05 | 91.53% | 88.54% | Remainder size = 32
    (1, 4, 16, 7) | 5.31 | 6.04 | 6.18 | 87.97% | 85.93% | Remainder size = 48
    (1, 4, 16, 8) | 5.30 | 5.27 | 6.30 | 100.66% | 84.16% | No remainder
    (1, 4, 16, 17) | 5.47 | 5.67 | 6.48 | 96.51% | 84.45% | Remainder size = 16
    (1, 4, 16, 18) | 5.53 | 5.86 | 6.59 | 94.28% | 83.78% | Remainder size = 32
    (1, 4, 16, 19) | 5.48 | 6.13 | 6.57 | 89.39% | 83.38% | Remainder size = 48
    (1, 4, 16, 20) | 5.35 | 5.31 | 6.95 | 100.79% | 76.91% | No remainder
    (1, 4, 16, 33) | 5.62 | 5.77 | 7.31 | 97.28% | 76.80% | Remainder size = 16
    (1, 4, 16, 34) | 5.56 | 5.85 | 7.06 | 95.03% | 78.71% | Remainder size = 32
    (1, 4, 16, 35) | 5.67 | 6.10 | 7.09 | 93.03% | 79.98% | Remainder size = 48
    (1, 4, 16, 36) | 5.50 | 5.39 | 7.20 | 102.15% | 76.42% | No remainder

    ![image](https://user-images.githubusercontent.com/12522207/173982748-5f003630-18a4-4c3d-a643-b8711892cc39.png)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79673
    Approved by: https://github.com/jerryzh168

commit a7edf713608806f10e17eab90d0a5df727d9a16e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 25 00:49:40 2022 +0000

    Revert "Don't introduce new overload for SymInt (#83628)"

    This reverts commit 8fae7027b399e65e6071d335aa874497682c84d0.

    Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222

commit 7a02ee55dbf46d3d85d389cf013e1d97f79c7100
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 25 00:45:05 2022 +0000

    Revert "[xla hash update] update the pinned xla hash (#83967)"

    This reverts commit ce7a9f92e30b93ab6efff4135be005c9afd0533a.

    Reverted https://github.com/pytorch/pytorch/pull/83967 on behalf of https://github.com/malfet due to Depends on the changes from https://github.com/pytorch/pytorch/pull/83628

commit 5321bf52f2791932ec5c1ea0eb3a1b585bfedba7
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 25 00:43:00 2022 +0000

    Revert "Make linalg.inv composite of linalg.solve (#80074)"

    This reverts commit 4737b3361479f4104efaa3bfa2ea517eaacb60fb.

    Reverted https://github.com/pytorch/pytorch/pull/80074 on behalf of https://github.com/malfet due to Depends on the changes from https://github.com/pytorch/pytorch/pull/83628

commit 4a6726a84073c98a773ae846d01dc63c73302c82
Author: Catherine Lee <csl@fb.com>
Date:   Thu Aug 25 00:34:23 2022 +0000

    use condensed disabled tests file (#84017)

    follow up to https://github.com/pytorch/test-infra/pull/545

    then we can get rid of the non condensed version
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84017
    Approved by: https://github.com/huydhn, https://github.com/janeyx99

commit cef522a8a9eec5355f5db528142231ab1176643c
Author: ProGamerGov <ProGamerGov@users.noreply.github.com>
Date:   Wed Aug 24 23:41:09 2022 +0000

    Add docstring type guidelines for list & tuple to `CONTRIBUTING.md` (#83634)

    Minor followup to: https://github.com/pytorch/pytorch/pull/83536

    For Google style docstrings, `list` and `tuple` should be completely lowercase.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83634
    Approved by: https://github.com/ngimel

commit 101709f43b501bc7dc862431fffa822a38852b3e
Author: chengscott <60510scott@gmail.com>
Date:   Wed Aug 24 23:04:03 2022 +0000

    Add comments for block_reduce.cuh (#83825)

    ~~Add warning for the BlockReduce result
    Remove redundant __syncthreads~~

    Add comments for BlockReduce
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83825
    Approved by: https://github.com/ngimel

commit bf8d5e83289a35e64a0ce98ef36fd45ab2dd0d43
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Aug 24 17:34:44 2022 +0000

    Pretty print stack trace with gm.print_readable() (#83706)

    Precondition: https://github.com/pytorch/torchdynamo/pull/899

    Given following function
    ```
    def my_relu(a):
        return a.relu()

    def func(a, b):
        d = torch.square(a + b)
        e = my_relu(d)
        f = d.sin()
        s = torch.stack([e, f])
        s = s.sum()
    ```

    Here are the possible result with various tracing frontend: dynamo, symbolic_trace, make_fx
    - joint graph with torchdynamo.optimize("aot_nop")
    Notice that it has a special stack for gradient addition node (for multiple uses of tensor) in backward
    Notice that "No stacktrace found for following nodes" are shown for nodes with stacktrace
    ```
    def forward(self, primals, tangents):
        primals_1, primals_2, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)
        add_tensor = torch.ops.aten.add.Tensor(primals_1, primals_2);  primals_1 = primals_2 = None
        pow_tensor_scalar = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 2)
        relu_default = torch.ops.aten.relu.default(pow_tensor_scalar)
        detach_default = torch.ops.aten.detach.default(relu_default)
        sin_default = torch.ops.aten.sin.default(pow_tensor_scalar)
        stack_default = torch.ops.aten.stack.default([relu_default, sin_default]);  relu_default = sin_default = None
        sum_default = torch.ops.aten.sum.default(stack_default);  stack_default = None
        is_same_size_default = torch.ops.aten.is_same_size.default(sum_default, tangents_1)
        expand_default = torch.ops.aten.expand.default(tangents_1, [2, 10, 10]);  tangents_1 = None
        unbind_int = torch.ops.aten.unbind.int(expand_default);  expand_default = None
        getitem = unbind_int[0]
        getitem_1 = unbind_int[1];  unbind_int = None
        cos_default = torch.ops.aten.cos.default(pow_tensor_scalar);  pow_tensor_scalar = None
        mul_tensor = torch.ops.aten.mul.Tensor(getitem_1, cos_default);  getitem_1 = cos_default = None
        detach_default_1 = torch.ops.aten.detach.default(detach_default);  detach_default = None
        threshold_backward_default = torch.ops.aten.threshold_backward.default(getitem, detach_default_1, 0);  getitem = detach_default_1 = None
        add_tensor_1 = torch.ops.aten.add.Tensor(mul_tensor, threshold_backward_default);  mul_tensor = threshold_backward_default = None
        pow_tensor_scalar_1 = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 1.0);  add_tensor = None
        mul_scalar = torch.ops.aten.mul.Scalar(pow_tensor_scalar_1, 2.0);  pow_tensor_scalar_1 = None
        mul_tensor_1 = torch.ops.aten.mul.Tensor(add_tensor_1, mul_scalar);  add_tensor_1 = mul_scalar = None
        sum_sym_int = torch.ops.aten.sum.SymInt(mul_tensor_1, [0], True)
        view_sym_int = torch.ops.aten.view.SymInt(sum_sym_int, [10]);  sum_sym_int = None
        return pytree.tree_unflatten([sum_default, mul_tensor_1, view_sym_int], self._out_spec)
    ```
    - default symbolic_trace
    Notice that nodes without stacktrace are folded under same region
    ```
    def forward(self, a, b):
        add = a + b;  a = b = None
        square = torch.square(add);  add = None
        relu = square.relu()
        sin = square.sin();  square = None
        stack = torch.stack([relu, sin]);  relu = sin = None
        sum_1 = stack.sum();  stack = None
        return sum_1
    ```
    - symbolic_trace with record_stack_traces=True
    ```
    def forward(self, a, b):
        add = a + b;  a = b = None
        square = torch.square(add);  add = None
        relu = square.relu()
        sin = square.sin();  square = None
        stack = torch.stack([relu, sin]);  relu = sin = None
        sum_1 = stack.sum();  stack = None
        return sum_1
    ```

    - make_fx without decomposition
    ```
    def forward(self, a_1, b_1):
        add_tensor = torch.ops.aten.add.Tensor(a_1, b_1);  a_1 = b_1 = None
        pow_tensor_scalar = torch.ops.aten.pow.Tensor_Scalar(add_tensor, 2);  add_tensor = None
        relu_default = torch.ops.aten.relu.default(pow_tensor_scalar)
        detach_default = torch.ops.aten.detach.default(relu_default)
        sin_default = torch.ops.aten.sin.default(pow_tensor_scalar);  pow_tensor_scalar = None
        stack_default = torch.ops.aten.stack.default([relu_default, sin_default]);  relu_default = sin_default = None
        sum_default = torch.ops.aten.sum.default(stack_default);  stack_default = None
        return sum_default
    ```
    - make_fx with decomposition to prims
    ```
    def forward(self, a_1, b_1):
        broadcast_in_dim_default = torch.ops.prims.broadcast_in_dim.default(b_1, [10, 10], [1]);  b_1 = None
        add_default = torch.ops.prims.add.default(a_1, broadcast_in_dim_default);  a_1 = broadcast_in_dim_default = None
        mul_default = torch.ops.prims.mul.default(add_default, add_default);  add_default = None
        le_default = torch.ops.prims.le.default(mul_default, 0.0)
        where_default = torch.ops.prims.where.default(le_default, 0.0, mul_default);  le_default = None
        sin_default = torch.ops.prims.sin.default(mul_default);  mul_default = None
        cat_default = torch.ops.prims.cat.default([where_default, sin_default], 0);  where_default = sin_default = None
        split_dim_default = torch.ops.prims.split_dim.default(cat_default, 0, 2);  cat_default = None
        convert_element_type_default = torch.ops.prims.convert_element_type.default(split_dim_default, torch.float32);  split_dim_default = None
        sum_default = torch.ops.prims.sum.default(convert_element_type_default, [0, 1, 2]);  convert_element_type_default = None
        return sum_default
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83706
    Approved by: https://github.com/Chillee, https://github.com/ezyang

commit e72256604f80e66b6a380479e4b610be19c82e71
Author: Chen, Jian Ping <jian.ping.chen@intel.com>
Date:   Wed Aug 24 22:42:53 2022 +0000

    Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (#23057)

    This is to improve the performance for hybrid sparse coo tensor on CPU path. This case is appeared at the DLRM terabyte test.
    With this fix, according to the previous performance test data, it got ~10x performance improvement on DLRM execution.
    without this, the DLRM will run as
    Finished training it 100/1000 of epoch 0, 2969.25 ms/it, loss 0.220505, accuracy 0.000 %
    with this, the DLRM will run as
    Finished training it 100/1000 of epoch 0, 270.71 ms/it, loss 0.220505, accuracy 0.000 %
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/23057
    Approved by: https://github.com/VitalyFedyunin, https://github.com/malfet

commit 3b11b80fc3f9f9a0171abb5eb2299835feba8b04
Author: Bin Chen <bchen2020@fb.com>
Date:   Wed Aug 24 22:16:12 2022 +0000

    Named pipe based watchdog timer (#83695)

    Summary:
    This diff implements a named pipe based watchdog timer (`FileTimerClient` and `FileTimerServer`). This is similar to the existing `LocalTimerClient` and `LocalTimerServer` (https://fburl.com/code/j4b9pyya).

    The motivation is from the need of handling various timeout issues. The training process occasionally get stuck. We need a proper watchdog to monitor the liveness of the training processes. This timer allows the TorchElastic agent (as the watchdog) to monitor the progress of the training processes that it spawned. If a timeout occurred, he TorchElastic agent can take some action to kill the stuck process and creating a core dump for it.

    `LocalTimerClient` and `LocalTimerServer` require  a `multiprocessing.Queue()` to work. So they can only be used between `multiprocessing` parent and child processes.

    `FileTimerClient` and `FileTimerServer` does not have such limitation.

    Test Plan:
    ```
    buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test
    ```
    ```
    RemoteExecution session id: reSessionID-06d70a77-043c-4d9d-b0f2-94c24460740a-tpx
    Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
        ✓ ListingSuccess: caffe2/test/distributed/elastic/timer:file_based_timer_test : 12 tests discovered (2.177)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_happy_path (file_based_local_timer_test.FileTimerTest) (2.463)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_expired_timers (file_based_local_timer_test.FileTimerServerTest) (1.889)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_send_request_release (file_based_local_timer_test.FileTimerServerTest) (1.700)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_valid_timers (file_based_local_timer_test.FileTimerServerTest) (1.873)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_call_count (file_based_local_timer_test.FileTimerServerTest) (1.715)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_watchdog_empty_queue (file_based_local_timer_test.FileTimerServerTest) (1.609)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_exception_propagation (file_based_local_timer_test.FileTimerTest) (1.633)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_multiple_clients_interaction (file_based_local_timer_test.FileTimerTest) (2.189)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_get_timer_recursive (file_based_local_timer_test.FileTimerTest) (2.295)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_no_client (file_based_local_timer_test.FileTimerTest) (1.753)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_timer (file_based_local_timer_test.FileTimerTest) (2.151)
        ✓ Pass: caffe2/test/distributed/elastic/timer:file_based_timer_test - test_client_interaction (file_based_local_timer_test.FileTimerTest) (1.895)
    Summary
      Pass: 12
      ListingSuccess: 1
    Finished test run: https://www.internalfb.com/intern/testinfra/testrun/844425186732666
    ```

    Differential Revision: D38604238

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83695
    Approved by: https://github.com/d4l3k

commit 37d3db7579386e928f70815afa7b5a21ffb2fefd
Author: Jane Xu <janeyx@fb.com>
Date:   Wed Aug 24 21:43:09 2022 +0000

    Deletes CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake (#84007)

    Looking through the code and online, it does not look like these variables actually change anything. Regardless, this change was instituted to fix https://github.com/pytorch/pytorch/issues/13362, but we are again running into similar issues even with the workaround: see https://github.com/pytorch/pytorch/issues/83790.

    Thus, since
    1. this change isn't preventing flakiness
    2. these variables do not seem used anywhere in pytorch/pytorch nor mozilla/sccache
    we should remove this confusion.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84007
    Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi

commit 1eff853fdc91619ea24abf0cbd51ca992fe10c97
Author: Jane Xu <janeyx@fb.com>
Date:   Wed Aug 24 21:22:14 2022 +0000

    Pin conda to 4.13.0 (#83991)

    Recent update to conda 4.14.0 caused breakages in our docker builds:
    https://hud.pytorch.org/pytorch/pytorch/commit/754d7f05b6841e555cea5a4b2c505dd9e0baec1d

    This pins to prevent the errors:
    ```
    Traceback (most recent call last):
    2022-08-24T16:20:49.2412247Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1125, in __call__
    2022-08-24T16:20:49.2413036Z   File "/opt/conda/lib/python3.9/site-packages/conda/cli/main.py", line 86, in main_subshell
    2022-08-24T16:20:49.2413615Z   File "/opt/conda/lib/python3.9/site-packages/conda/cli/conda_argparse.py", line 93, in do_call
    2022-08-24T16:20:49.2414282Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/core.py", line 75, in wrapper
    2022-08-24T16:20:49.2415036Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/core.py", line 39, in display_notices
    2022-08-24T16:20:49.2415853Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 36, in get_notice_responses
    2022-08-24T16:20:49.2416661Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 39, in <genexpr>
    2022-08-24T16:20:49.2417399Z   File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    2022-08-24T16:20:49.2418145Z   File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    2022-08-24T16:20:49.2418831Z   File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    2022-08-24T16:20:49.2419543Z   File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    2022-08-24T16:20:49.2420292Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 42, in <lambda>
    2022-08-24T16:20:49.2421070Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/cache.py", line 37, in wrapper
    2022-08-24T16:20:49.2421712Z   File "/opt/conda/lib/python3.9/site-packages/conda/notices/http.py", line 58, in get_channel_notice_response
    2022-08-24T16:20:49.2422258Z   File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 600, in get
    2022-08-24T16:20:49.2422801Z   File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 587, in request
    2022-08-24T16:20:49.2423226Z   File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 701, in send
    2022-08-24T16:20:49.2423634Z   File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 460, in send
    2022-08-24T16:20:49.2424239Z   File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 263, in cert_verify
    2022-08-24T16:20:49.2424731Z OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /opt/conda/lib/python3.9/site-packages/certifi/cacert.pem
    2022-08-24T16:20:49.2424967Z
    2022-08-24T16:20:49.2425110Z During handling of the above exception, another exception occurred:
    2022-08-24T16:20:49.2425279Z
    2022-08-24T16:20:49.2425377Z Traceback (most recent call last):
    2022-08-24T16:20:49.2425610Z   File "/opt/conda/bin/conda", line 13, in <module>
    2022-08-24T16:20:49.2425845Z     sys.exit(main())
    2022-08-24T16:20:49.2426176Z   File "/opt/conda/lib/python3.9/site-packages/conda/cli/main.py", line 129, in main
    2022-08-24T16:20:49.2426614Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1413, in conda_exception_handler
    2022-08-24T16:20:49.2427054Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1128, in __call__
    2022-08-24T16:20:49.2427555Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1170, in handle_exception
    2022-08-24T16:20:49.2427995Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1181, in handle_unexpected_exception
    2022-08-24T16:20:49.2428471Z   File "/opt/conda/lib/python3.9/site-packages/conda/exceptions.py", line 1251, in print_unexpected_error_report
    2022-08-24T16:20:49.2428873Z ModuleNotFoundError: No module named 'conda.cli.main_info'
    2022-08-24T16:20:55.5428691Z The command '/bin/sh -c bash ./install_conda.sh && rm install_conda.sh' returned a non-zero code: 1
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83991
    Approved by: https://github.com/malfet

commit f5bfa4d0888e6cd5984092b38cb8b10609558d05
Author: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Date:   Wed Aug 24 20:49:20 2022 +0000

    [ROCm] Enable test_multiprocessing tests (#82356)

    Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

    Issue fixed in ROCm 5.2 user space.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82356
    Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/huydhn

commit d56577284afeb5c82b43365d422e63db7893f70b
Author: Huy Do <huydhn@gmail.com>
Date:   Wed Aug 24 20:19:38 2022 +0000

    Set python build-docs timeout to 30 minutes and cpp build-docs timeout to 180 minutes (#83957)

    Anything more means there's something wrong and we should just return. AFAIK the timeout doesn't include queuing time, only the job duration https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

    ![Screen Shot 2022-08-23 at 18 31 57](https://user-images.githubusercontent.com/475357/186298046-5637384f-887c-4c6a-a946-c101b6c66741.png)

    This will help avoid having python build docs timeout after 6 hours.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83957
    Approved by: https://github.com/ZainRizvi

commit f38a32c905351aa67a2c2e22c9cb11736f81408f
Author: chengscott <60510scott@gmail.com>
Date:   Wed Aug 24 20:18:55 2022 +0000

    remove duplicate WarpReduceSum (#83757)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83757
    Approved by: https://github.com/ngimel

commit 67f0940cdd497005c6a78fd03e0d9f5a3dfbb2e7
Author: Richard Barnes <rbarnes@fb.com>
Date:   Wed Aug 24 20:12:25 2022 +0000

    Check all CUDA API calls for errors in test/ (#74921) (#83954)

    Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74921

    Test Plan: Sandcastle

    Reviewed By: ezyang, malfet, ngimel

    Differential Revision: D35194966

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83954
    Approved by: https://github.com/ezyang

commit a741927e614a96ac94b82ffe0e9eeba231626a9f
Author: chengscott <60510scott@gmail.com>
Date:   Wed Aug 24 20:09:47 2022 +0000

    Improve Normalization.cuh (#83871)

    remove unused Ops
    replaced copy-and-paste by calling BlockReduce (+SumReduceOp +2D block indexing) and removing duplicate warpSum
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83871
    Approved by: https://github.com/ngimel

commit 7b1a056b88af5b4fca48b89fd5f74025ad5f4741
Author: Richard Barnes <rbarnes@fb.com>
Date:   Wed Aug 24 20:02:57 2022 +0000

    Map new CUDA error handling to HIP (#75032) (#83953)

    Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75032

    Test Plan: Sandcastle

    Reviewed By: ezyang, malfet

    Differential Revision: D35253785

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83953
    Approved by: https://github.com/ezyang, https://github.com/malfet

commit ef782e730dff7bef078fc95d7fb5b78bafcc5284
Author: thomasw21 <24695242+thomasw21@users.noreply.github.com>
Date:   Wed Aug 24 20:01:57 2022 +0000

    Support BF16 for fast layernorm (#83971)

    Fixes #83970

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83971
    Approved by: https://github.com/ngimel

commit a5564c4bd073bbfe4c1f41a01b2e7618500fa7ac
Author: Sherlock Huang <bahuang@fb.com>
Date:   Wed Aug 24 17:06:16 2022 +0000

    Suppress Anomaly mode warning message  (#83966)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83966
    Approved by: https://github.com/albanD

commit a8a36c45a64bdf51d5d4e0bd79c8779b2b918318
Author: Larry Liu <8188269+larryliu0820@users.noreply.github.com>
Date:   Wed Aug 24 19:50:19 2022 +0000

    [frontend] Fix tensor list alias annotation  (#84005)

    For issue https://github.com/pytorch/pytorch/issues/77920 and a retry of https://github.com/pytorch/pytorch/pull/83921

    The current logic checks alias info before `[]` and after. If no alias info exists after `[]`, we overwrite the alias info before. This logic failed on argument like `Tensor(a!)[]`, dropping the alias info before `[]` on the floor. This PR adds a new alias info if it's missing after `[]`. This way we can keep the alias info before `[]`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/84005
    Approved by: https://github.com/cccclai, https://github.com/bdhirsh

commit b745e5f1157b7dd4b5814d42989af52ef8e2d68b
Author: Richard Barnes <rbarnes@fb.com>
Date:   Wed Aug 24 18:59:05 2022 +0000

    Check all CUDA API calls for errors in benchmarks/cpp/nvfuser (#74920) (#81817)

    Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74920

    Test Plan: Sandcastle

    Differential Revision: D35194656

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81817
    Approved by: https://github.com/malfet

commit f7e668b7b5994370aedcaa5d96ac33a78ac19e5d
Author: Catherine Lee <csl@fb.com>
Date:   Wed Aug 24 18:38:36 2022 +0000

    add hud link to merge failure message (#83946)

    as in title, related to https://github.com/pytorch/test-infra/issues/568
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83946
    Approved by: https://github.com/huydhn

commit 3a9ae518f2c9424251eae13fb23db6ea571cfce1
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Aug 24 18:31:25 2022 +0000

    Skip NCCL slimming for cxx11 libtorch builds (#83959)

    Fixes https://github.com/pytorch/pytorch/issues/83887

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83959
    Approved by: https://github.com/atalman

commit d79ccb7b4589ab65727b16cde19918dfdd11d32c
Author: Digant Desai <digantdesai@fb.com>
Date:   Wed Aug 24 18:17:27 2022 +0000

    [pthreadpool] Cap max thread count to fix TSAN issues (#83950)

    Summary: Cap the thread count to 64 unconditionally to solve this tsan issue which leads to harder to debug, flaky test failures.

    Test Plan: CI

    Reviewed By: kimishpatel

    Differential Revision: D38136212

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83950
    Approved by: https://github.com/kimishpatel

commit 5e01fb995c626be0a50e8d3a0ff4f7564fb37461
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Tue Aug 23 16:31:18 2022 -0700

    strip SymIntNodes off in the mobile builds (#83938)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83938
    Approved by: https://github.com/ezyang

commit b842670aa54072f532ce75edfc3663f3509de146
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Tue Aug 23 14:04:35 2022 -0700

    logical ops (#83879)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83879
    Approved by: https://github.com/ezyang

commit 2b805e3520f842c13e559e46398ae64206c2cf7a
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Tue Aug 23 14:04:35 2022 -0700

    add arithmetic ops (#83878)

    arithmetic ops tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83878
    Approved by: https://github.com/ezyang

commit 0831813e26ebcd406b261ffb9629f933c627d4d3
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Tue Aug 23 14:04:35 2022 -0700

    support more symintnode operations (#83877)

    remove debug code
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83877
    Approved by: https://github.com/ezyang

commit 5c49c7bbba5bbdf6ee941f84fdc13d4bd34d4014
Author: Robert <xiurobert@gmail.com>
Date:   Wed Aug 24 17:34:28 2022 +0000

    [WIP] Validating input_col for certain datapipes (#80267)

    Follow up from #79344.

    Currently WIP due to multiple test failures.

    Waiting for #80140 to land
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267
    Approved by: https://github.com/ejguan

commit 30a5583d7566ef25ffee14ee3699f839e84fd5df
Author: John Clow <jclow@fb.com>
Date:   Tue Aug 23 21:37:39 2022 -0700

    [TorchTidy Fix] Don't try to collect strides for non-strided tensors (#83935)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83935
    Approved by: https://github.com/robieta, https://github.com/slgong-fb

commit 3f88171240d73c694ab45b3f3640137b5d695b2f
Author: BowenBao <bowbao@microsoft.com>
Date:   Wed Aug 24 00:54:17 2022 +0000

    [ONNX] Remove static None graph output (#82623)

    Fixes #82370
    * Unify the export behavior regarding static None outputs. These are
    dropped for both traced graph and TorchScript graph export.
    * `Optional` outputs are not affected.
    Fixes #82370
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82623
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock

commit 7a8152530d490b30a56bb090e9a67397d20e16b1
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Wed Aug 24 16:17:50 2022 +0000

    move pooling test from test_nn to test/nn/test_pooling (#83915)

    Ref #63085

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915
    Approved by: https://github.com/albanD

commit 4eb02e863718abf5ff75fa4b296cd2331f938701
Author: Antonio Kim <antonio.kim@cerebras.net>
Date:   Wed Aug 24 15:35:43 2022 +0000

    [LTC] Add custom lazy tensor save function (#83294)

    We need a custom `save` function for checkpointing a lazy model, similar to what exists in PyTorch/XLA:
    https://github.com/pytorch/xla/blob/3eb8a9d9eb4ebb0b064461c3704650241625654e/torch_xla/core/xla_model.py#L994
    The purpose of this function is to move any lazy tensors to CPU before saving the checkpoint.

    The way I implemented it was to create a general structure visitor, adapted from a function that we use quite often in Cerebras internal repositories. If there is a better tool already available in PyTorch that does the same things, I'm open to suggestions.

    CC: @wconstab @Krovatkin @JackCaoG
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83294
    Approved by: https://github.com/wconstab

commit 3e6e0a1d1093992d35b2248fd5a54feab4b01984
Author: Mario Lezcano <lezcano-93@hotmail.com>
Date:   Wed Aug 24 05:53:25 2022 -0500

    Support a stable double backward on linalg.det for real inputs (#80217)

    The complex case still fails. I do not know why.

    Fixes https://github.com/pytorch/pytorch/issues/62327
    Fixes https://github.com/pytorch/pytorch/issues/53364
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80217
    Approved by: https://github.com/nikitaved, https://github.com/albanD, https://github.com/malfet

commit 4737b3361479f4104efaa3bfa2ea517eaacb60fb
Author: Mario Lezcano <lezcano-93@hotmail.com>
Date:   Wed Aug 24 05:53:24 2022 -0500

    Make linalg.inv composite of linalg.solve (#80074)

    The `getri` kernel calls inside `getrs` so we can do so explicitly
    ourselves and save ourselves from having to maintain an extra kernel.
    This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
    will be as efficient as it can be, as it'll be choosing the best backend
    to perform the factorisation and the best backend (not necessarily the
    same) to perform the solve.

    Fixes https://github.com/pytorch/pytorch/issues/77498

    The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
    Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet

commit 0bdcfcb840bedad546cc97662a2272b34f5c7d64
Author: lezcano <lezcano-93@hotmail.com>
Date:   Wed Aug 24 09:07:59 2022 +0000

    Strenghten preconditions of linalg.cross (#83798)

    This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

    Fixes https://github.com/pytorch/pytorch/issues/77629
    Fixes https://github.com/pytorch/pytorch/issues/83756
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
    Approved by: https://github.com/mruberry

commit 4a18d0a9729b69e02f036a0d218626fdb9ca2dda
Author: Henry Tu <henry.tu@cerebras.net>
Date:   Wed Aug 24 14:33:52 2022 +0000

    Fix LTC build warnings (#83955)

    Addresses `Wc++98-compat-extra-semi` warning from https://github.com/llvm/torch-mlir/issues/1264 by removing extraneous semicolon after autogen LTC native function definitions.

    ```
    /home/runner/work/torch-mlir/torch-mlir/build/tools/torch-mlir/python/torch_mlir/csrc/base_lazy_backend/generated/LazyNativeFunctions.cpp:4241:6: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
        };
         ^
    ```

    cc: @wconstab @desertfire @ke1337 @antoniojkim
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83955
    Approved by: https://github.com/wconstab

commit ce7a9f92e30b93ab6efff4135be005c9afd0533a
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Aug 24 10:14:44 2022 +0000

    [xla hash update] update the pinned xla hash (#83967)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83967
    Approved by: https://github.com/pytorchbot

commit fa241fd50e226d74af06c36482fa68f0e2a4fb3c
Author: Seonglyong Gong <slgong@fb.com>
Date:   Wed Aug 24 08:17:20 2022 +0000

    [Profiler] record nn.Module's parameters (#83209)

    Summary:
    Record nn.Module's parameters for detaild memory profiling:
    - extend 'module_' in value cache  & NNModuleInfo to save parameters
    - python binding and unit test case

    Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule

    Differential Revision: D38379717

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
    Approved by: https://github.com/robieta

commit 0ae298f8696f00b9201a6f433788e1583d914a8b
Author: Souranil Sen <souranil@fb.com>
Date:   Wed Aug 24 08:00:20 2022 +0000

    Test type promotion assertignoretypes (#83867)

    See #38095

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83867
    Approved by: https://github.com/kit1980, https://github.com/mruberry

commit 432c508e71111f9d5382322e0e6b1bc1c66bf0ec
Author: Masaki Kozuki <mkozuki@nvidia.com>
Date:   Wed Aug 24 04:53:25 2022 +0000

    Support NCCL Premul Sum (#81272)

    This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

    The major changes include
    - convert enum ReduceOp to struct
    - add premul sum specific paths to init.cpp and Ops.cpp.

    note:
    - For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

    The commit titled "add nccl premul" whose current hash is https://github.com/pytorch/pytorch/pull/81272/commits/cb99ad67447b5899ecf8c4c3d78deaafa1cc09b8 was authored by @mcarilli and @ptrblck.

    cc @ptrblck
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
    Approved by: https://github.com/kwen2501

commit 67aed393195970459029a2f6a825d9db726187cf
Author: Lu, Chengjun <chengjun.lu@intel.com>
Date:   Wed Aug 24 04:35:43 2022 +0000

    Support the XPU backend untyped storage (#83952)

    Simple add XPU backend in untyped torch storage.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83952
    Approved by: https://github.com/ezyang

commit 0491e1a13a62ead5c22f4396012da5fb6e09800f
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Aug 23 15:14:05 2022 -0700

    Support returning symbolic strides from t.stride() in Python (#83842)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83842
    Approved by: https://github.com/albanD, https://github.com/Chillee, https://github.com/bdhirsh

commit df70714e763ea6e93dcefe0072ea9982afb73b7f
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Aug 24 04:28:45 2022 +0000

    [BE][CUDA] Use packed_accessor64 (#83949)

    Not sure why we are ignoring those, but SoftMax.cu alone
    generates 100+ lines of warnings:
    ```
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In function ‘at::Tensor at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::get_offsets(const at::Tensor&, const IntArrayRef&, int64_t)’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:261:69: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = long int; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto indices_accessor = indices.packed_accessor<int64_t, 2>();
                                                                         ^
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:924:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:1677:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:927:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:1679:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:977:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:1775:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:980:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:1777:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
       auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
          ^~~~~~~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
    /tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:16:557:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
         auto values_accessor =
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
    /tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:18:556:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
         auto values_accessor =
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
    /tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:20:557:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
         auto values_accessor =
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
    /tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:21:556:   required from here
    /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
         auto values_accessor =
          ^~~~~~~~~~~~~~~
    /home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
       GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
     ^ ~~~~~~~~~~~~~
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83949
    Approved by: https://github.com/ngimel

commit 754d7f05b6841e555cea5a4b2c505dd9e0baec1d
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Aug 24 00:54:14 2022 +0100

    Remove conj kernels for real dtypes (#80374)

    `conj_physical_stub` is currently implemented for all dtypes despite
    it just being a plain copy for real dtypes. So, instead we should
    defer to the existing copy kernel in these cases.

    On my build for one CUDA architecture, I see a 2.2 MB decrease in
    `libtorch_cuda.so` size.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80374
    Approved by: https://github.com/ngimel, https://github.com/atalman

commit 2c76d05b8fde5a08cb795d5e3d2c99aad7bd352f
Author: Driss Guessous <drisspg@fb.com>
Date:   Wed Aug 24 02:50:45 2022 +0000

    [Nested Tensor] Make offset copy and move assignment more explicit. (#83488)

    Currently the nested tensor construction for the offset_ parameter takes in references and in the chain of delegation uses value. This could lead to unnecessary copies.  Whenever a nested tensor impl is constructed it should take ownership of all its metadata. The only non-trivially copyable metadata associated with the class is `offsets_`.

    The goal of this PR is to make sure that consumers of nested_tensor_impl constructors ensure that they are passing offsets as a temporary - either buy explicitly copying a reference, or by constructing the offsets vector in the scope of construction.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83488
    Approved by: https://github.com/albanD, https://github.com/bdhirsh

commit 7fdc2f70c659b51555dd293f1c63bfa724596a20
Author: Ishan-Rajgarhia <ishanrajgarhia@fb.com>
Date:   Wed Aug 24 02:45:49 2022 +0000

    Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870)

    See https://github.com/pytorch/pytorch/issues/38095
    Replaced assertEqualIgnoreType with assertEqual
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870
    Approved by: https://github.com/kit1980

commit 6edcf8e18c68b4450fcb710cc13189ac078cee16
Author: Hansong Zhang <hsz@fb.com>
Date:   Wed Aug 24 02:17:52 2022 +0000

    Move nnapi code from ATen common code to specific library (#83748)

    Summary: Currently we include nnapi code in all targets using ATen even if it's not used (actually there is no usage and being deprecated). Move it to `nnapi_backend_lib` for now.

    Test Plan: Sandcastle.

    Differential Revision: D38761095

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83748
    Approved by: https://github.com/salilsdesai, https://github.com/SS-JIA

commit 84f0411f4f29bf185f941b98c9af52ff010172b4
Author: Catherine Lee <csl@fb.com>
Date:   Wed Aug 24 02:06:50 2022 +0000

    add merge blocking to ci: sev template (#83940)

    as in title, so that by default, ci: sev will block merges

    the line can be removed to not block merges
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83940
    Approved by: https://github.com/huydhn, https://github.com/janeyx99, https://github.com/malfet, https://github.com/seemethere

commit c47e0450f8bb5826108c45ff2bc3f77f2297b94b
Author: Nan Xiao <nanx@fb.com>
Date:   Wed Aug 24 01:15:25 2022 +0000

    [fbia] Keep Track of full qualified name before and after remote sharding (#83889)

    Summary: track qualname changes in embedding sharding & FX split, and compose target qualname in the end of FBIA transform stage, so we can use the qualname mapping in XL materialize stage

    Test Plan:
    CI/CD

    with DISABLE_XLEBB_MATERIALIZATION = True
    https://fburl.com/fblearner/a8yljbux

    with DISABLE_XLEBB_MATERIALIZATION = False
    https://fburl.com/fblearner/2nvi0dam

    Reviewed By: lliu315gt

    Differential Revision: D38772525

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83889
    Approved by: https://github.com/houseroad

commit 58f61d50a45f21512947588ac8e79b5e52956cb5
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Aug 23 18:05:26 2022 -0400

    Add hypothesis to requirements.txt (#83740)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83740
    Approved by: https://github.com/zhxchen17, https://github.com/janeyx99, https://github.com/zou3519

commit 84e45e7e907484f300cade2ce23e5272da660e4f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Aug 24 00:47:03 2022 +0000

    Revert "Optimize transpose copy on CPU using fbgemm transpose (#83327)"

    This reverts commit 04d8da88a6a1abf0da2b11096c85244bf38d3b2a.

    Reverted https://github.com/pytorch/pytorch/pull/83327 on behalf of https://github.com/weiwangmeta due to breaking internal builds/causing out-of-bounds errors/training accuracy

commit 591222f5d93630d312899f335f7f58505bf44544
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Wed Aug 24 00:26:46 2022 +0000

    Fix use-dict-literal lint (#83718)

    Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
    Approved by: https://github.com/albanD

commit fc470cf9806643efdbc1df650f9e8eafb671ba17
Author: Shirong Wu <shirong@fb.com>
Date:   Wed Aug 24 00:17:46 2022 +0000

    Back out "Support regex-style matching for Any and Oneof (#82853)" (#83922)

    Reviewed By: hl475

    Differential Revision: D38945806

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83922
    Approved by: https://github.com/hl475

commit 89072177e10b3cf9c8fdf204381db17fe1fde068
Author: Angela Yi <angelayi@fb.com>
Date:   Tue Aug 23 23:56:50 2022 +0000

    [fx][pass infra] Adding error catching (#83933)

    Example:

    ```
    ======================================================================
    ERROR: test_pass_manager_error (fx.test_pass_infra.TestPassManager)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 285, in __call__
        res = fn(module)
      File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 164, in pass_fail
        raise RuntimeError("bad")
    RuntimeError: bad

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 170, in test_pass_manager_error
        pm(traced_m)
      File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 289, in __call__
        raise RuntimeError(msg) from e
    RuntimeError: An error occured when running the 'pass_fail' pass after the following passes: ['replace_add_with_mul_pass', 'replace_mul_with_div_pass']
    ```

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83933
    Approved by: https://github.com/SherlockNoMad

commit 7c8d265822088621a09e4526a4612b4acb6dc5d6
Author: Eli Uriegas <eliuriegas@fb.com>
Date:   Tue Aug 23 11:32:03 2022 -0700

    ci: Remove dead code related to android uploads (#83930)

    These uploads actually never got triggeredhappened in nightlies so
    removing it altogether. Someone can re-add in the future if they feel
    these are important but I can't find an instance of this running since
    we migrated so I have a hard time believing anyone will miss it.

    https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=android

    Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83930
    Approved by: https://github.com/atalman, https://github.com/malfet

commit 21bc77ca9673e067c7b8903f536d8f8901d148f8
Author: John Detloff <johndetloff@fb.com>
Date:   Tue Aug 23 22:50:09 2022 +0000

    Remove CoreMLMemoryObserver (#83703)

    Summary: We added this observer to help us diagnose memory issues that have since resolved. It should be safe to clean this up.

    Test Plan: Diff just removed logging, so just build IG and confirm no errors.

    Differential Revision: D38843701

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83703
    Approved by: https://github.com/mcr229

commit 8fae7027b399e65e6071d335aa874497682c84d0
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Aug 23 12:31:10 2022 -0700

    Don't introduce new overload for SymInt (#83628)

    Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

    This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

    This is BC-breaking in the following ways:

    * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

    This is not BC-breaking in the following ways:

    * The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
    * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

    Structure of the PR:

    * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
      * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
        * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
        * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
      * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
    * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
    * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
    * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
    * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
    * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
    * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
    * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
    Approved by: https://github.com/albanD, https://github.com/bdhirsh

commit 4808bda7963712a5d77cb543f25fa72e2b7b3d91
Author: Zain Rizvi <zainr@fb.com>
Date:   Tue Aug 23 21:55:02 2022 +0000

    Prefer signal from land checks over PR signals (#83715)

    When a dev forks their branch from a red master build, their branch can fail CI checks for reasons unrelated to their changes, but the same checks would however pass in the land validation commit (which is rebased off of viable/strict)

    Today, in the above scenario the `merge -l` command fails because mergebot sees the failing checks in the PR, which is not helpful when that same check passes in land validation.
    This PR changes the behavior so that:
    1. If both the PR and land validation ran a workflow, only look at the results from land validation
    2. If only the PR ran a specific workflow (e.g. for CLA Check or a nightly run) then continue to look the result from the PR (which matches existing behavior)
    It also includes a few extra BE fixes:
    - Replaces the tuple we used to pass workflow check results around with a named tuple so that it's easier to tell what data is being used
    - Reduces the number of API calls to github by ~50% during merges.  Before, we were pulling results from github every time and then filtering it down to the relevant category of checks (e.g. failed/pending/startup_failed). Now, our filters share the check results
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83715
    Approved by: https://github.com/zengk95

commit 25dd2a0422cf5fe38937c5b9441b1f6abc1c40cb
Author: chenlai <chenlai@devvm7453.prn0.facebook.com>
Date:   Mon Aug 22 20:35:11 2022 -0700

    Fix load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)

    `_load_extra_only_for_mobile` API hasn't handled flatbuffers logic yet. Update the api accordingly.

    Also find out mobile build in OSS doesn't build with flatbuffers. Filed task T129996445 to track

    Differential Revision: [D38890847](https://our.internmc.facebook.com/intern/diff/D38890847/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38890847/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83855
    Approved by: https://github.com/qihqi

commit bbe803cb35948df77b46a2d38372910c96693dcd
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 23 19:36:43 2022 +0000

    Revert "Strenghten preconditions of linalg.cross (#83798)"

    This reverts commit 7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e.

    Reverted https://github.com/pytorch/pytorch/pull/83798 on behalf of https://github.com/janeyx99 due to Sorry, land race caused functorch issues https://hud.pytorch.org/pytorch/pytorch/commit/7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e

commit a802603ef77960f016dc81aaa7b0b773a19d3d73
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Aug 23 19:31:22 2022 +0000

    [complex] conv_transpose1d (#79694)

    Reference: https://github.com/pytorch/pytorch/issues/71108
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79694
    Approved by: https://github.com/ngimel

commit 9095030239bde50da8b4bbdbc6d5701d3fdfdcae
Author: Khushi Agrawal <khushiagrawal411@gmail.com>
Date:   Tue Aug 23 19:23:39 2022 +0000

    [fix] edge case in `MaxPool1d` and add ErrorInputs (#83553)

    Fixes #83224

    cc @kshitij12345 @albanD!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553
    Approved by: https://github.com/albanD

commit 8f9ae35648e1ac24bd05a4c962ca0abc626bb6ca
Author: Kaichen Liu <kaichenliu@fb.com>
Date:   Tue Aug 23 19:19:38 2022 +0000

    remove assertEqualIgnoreTypes from test/distributions/test_distributions.py (#83709)

    See https://github.com/pytorch/pytorch/issues/38095
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83709
    Approved by: https://github.com/kit1980

commit 5204b8e4f9045a6132e882776075f085e59eca16
Author: Mengwei Liu <larryliu@fb.com>
Date:   Mon Aug 22 16:13:19 2022 -0700

    [torchgen] Add documentation for `autogen` keyword (#83610)

    This is a follow up for #81437. This PR explains what operator can use `autogen` and what will be generated. Also talked about generated kernels and where to find them.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83610
    Approved by: https://github.com/albanD, https://github.com/bdhirsh

commit 732255f0318f188f66dc527ce7a067d5e185194c
Author: Stephen Jia <ssjia@fb.com>
Date:   Tue Aug 23 09:48:07 2022 -0400

    [vulkan] Add VMA as a third_party subrepo (#83906)

    the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906
    Approved by: https://github.com/kimishpatel

commit 81843596cba1c3fadaed63e76d2094161347d9a4
Author: soulitzer <soulitzer@gmail.com>
Date:   Tue Aug 23 11:19:03 2022 -0400

    Fix view_func replay in no-grad mode (#83872)

    Fixes https://github.com/pytorch/pytorch/issues/83828

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83872
    Approved by: https://github.com/albanD

commit 7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e
Author: lezcano <lezcano-93@hotmail.com>
Date:   Tue Aug 23 10:59:11 2022 +0000

    Strenghten preconditions of linalg.cross (#83798)

    This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

    Fixes https://github.com/pytorch/pytorch/issues/77629
    Fixes https://github.com/pytorch/pytorch/issues/83756
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
    Approved by: https://github.com/mruberry

commit 9beddde1d7fc401880ea8da0ad39833ff6a3cb93
Author: Ke Wen <kw2501@fb.com>
Date:   Tue Aug 23 17:57:16 2022 +0000

    Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL (#83881)

    Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`.
    Saving user from setting two env variables.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881
    Approved by: https://github.com/malfet, https://github.com/rohan-varma, https://github.com/H-Huang

commit cb488e6d2f21269d16da8bd60ec7dff1368baf98
Author: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Date:   Tue Aug 23 17:47:08 2022 +0000

    Allow None arguments for elementwise type promotion wrapper and fix clamp with None arguments (#83586)

    Fixes https://github.com/pytorch/torchdynamo/issues/759
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83586
    Approved by: https://github.com/ezyang, https://github.com/ngimel

commit 8793cd2fd30b89cbc060e7ccce7deba30b6af52b
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Aug 23 17:46:45 2022 +0000

    Move ATenNVRTC.h include from `jit_utils.h` to `jit_utils.cpp` (#83886)

    In general, `.h` files should only include headers that are used in the header

    Fixes https://github.com/pytorch/pytorch/issues/83856
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83886
    Approved by: https://github.com/ngimel

commit 8db04c111363bde5c90885fee3debbd288605336
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Mon Aug 22 08:39:28 2022 -0700

    reinplace pass: special handling for view_scatter ops (#83846)

    There is already special handling in the reinplacing pass for removing `{view}_scatter` ops, but there is another case that needs special handling. In this code:
    ```
             def f():
                 a = torch.zeros(4, 4, 4)
                 a[:, 2:] = torch.ones(4, 2, 4)
                 return a
    ```

    Tracing normally with `make_fx()` gives you:
    ```

    def forward(self):
        zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
        ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
        slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
        slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
        copy__default = torch.ops.aten.copy_.default(slice_tensor_1, ones);  slice_tensor_1 = ones = None
        return zeros
    ```
    Functionalizing it gives you:

    ```
    def forward(self):
        zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
        ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
        slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
        slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
        slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
        slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, 9223372036854775807);  slice_tensor_2 = ones = None
        slice_scatter_default_1 = torch.ops.aten.slice_scatter.default(zeros, slice_scatter_default, 0, 0, 9223372036854775807);  zeros = slice_scatter_default = None
        return slice_scatter_default_1
    ```

    Notice that there are not any functional ops to directly re-inplace! What actually happened is that functionalization turned the `copy_()` into a `copy()`, but the out-of-place `copy()` operator gets optimized away because it's a no-op (when the input and output metadata are the same, `out = copy(a, b)` just returns `b`).

    What we actually want is to replace this line:
    ```
    slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, ...);
    ```
    with this:
    ```
    new_slice = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, ...);
    _ = torch.ops.aten.copy_.default(new_slice, ones)
    ```

    In the above, we're taking a fresh slice of the "base" tensor, and performing a `copy_()` on the slice, adding back what functionalization removed.

    We actually need to create a fresh "slice" node, because we're not guaranteed that one already exists in the graph (technically there should be one, but it might have been DCE'd by the time we hit re-inplacing)

    I also updated the docs for re-inplacing to more closely match the order of the logic.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83846
    Approved by: https://github.com/ezyang

commit 75ec7b754707b4ec19328d9e35f4d30f8045268c
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Mon Aug 22 08:39:25 2022 -0700

    reinplace pass: bugfix for output node replacement (#83845)

    Cleaned up some of the arg replacement logic to use tree_map, so it handles FX nodes that have nested containers.

    See the added test: when you write a function that returns a list, the `output` node in the FX graph shows up as having `node.args = tuple(immutable_list(...))`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83845
    Approved by: https://github.com/ezyang

commit 01434c2d206e3732219a1afaf4f33b294c60a7b3
Author: chengscott <60510scott@gmail.com>
Date:   Tue Aug 23 17:12:50 2022 +0000

    Improve DistanceKernel.cu (#83811)

    include device_sqrt
    replace reduce_agg by BlockReduce
    choose implementation by impl_fptr instead of error-prone copy-and-paste
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83811
    Approved by: https://github.com/ngimel

commit df048414e0c8485ed22bb7b28a02ab5189d888f4
Author: samdow <samdow@fb.com>
Date:   Mon Aug 22 09:45:04 2022 -0400

    [functorch] add linalg cross batch rule (#83759)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83759
    Approved by: https://github.com/zou3519

commit e4af53c1a1b33ff38375a0fecc531324c5cd6b0b
Author: Scott Wolchok <swolchok@fb.com>
Date:   Mon Aug 22 16:02:39 2022 -0700

    [PyTorch] Remove unused sstream/string includes from c10/macros/Macros.h (#83353)

    Nothing in the rest of the header seems to use these.

    Differential Revision: [D38672680](https://our.internmc.facebook.com/intern/diff/D38672680/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83353
    Approved by: https://github.com/malfet

commit 7e386845a41d2ae57e4bf80e65c39e930baba99a
Author: Jane Xu <janeyx@fb.com>
Date:   Tue Aug 23 16:38:34 2022 +0000

    Update retry action to latest version (#83911)

    We're running into EPERM issues when trying to install nvidia tools, see failure example https://github.com/pytorch/pytorch/runs/7975726013?check_suite_focus=true.
    ```
    WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

    /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1049
                throw err;
                ^

    Error: kill EPERM
        at process.kill (internal/process/per_thread.js:199:13)
        at killPid (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1059:17)
        at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1036:21
        at Array.forEach (<anonymous>)
        at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1034:23
        at Array.forEach (<anonymous>)
        at killAll (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1033:27)
        at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1024:13
        at ChildProcess.onClose (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1080:17)
        at ChildProcess.emit (events.js:314:20) {
      errno: 'EPERM',
      code: 'EPERM',
      syscall: 'kill'
    }

    ```

    The root issue probably lies elsewhere but this action is not helping/the errors seem to say it's unable to kill child processes. A more recent commit in that repo uses spawn instead of exec which might make a difference.

    Regardless, we should keep our actions up to date anyway.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83911
    Approved by: https://github.com/malfet

commit a315a2c79bbd25dfb022e294df8df67568b14dea
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Tue Aug 23 16:22:14 2022 +0000

    [ROCm] restore MIOpen benchmark flag default to true (#82656)
    PR https://github.com/pytorch/pytorch/pull/77438 allowed MIOpen to support the benchmark flag. Previously, the benchmark flag was ignored by MIOpen such that benchmarking was always turned on. This commit restores the behavior that MIOpen benchmarking is by default turned on.
    CI unit tests cover this capability.  Torchvision models demonstrate the performance delta.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82656
    Approved by: https://github.com/ngimel

commit 0270a707e5e6464785291f090715f0525a5019d0
Author: Horace He <chilli@fb.com>
Date:   Tue Aug 23 05:11:03 2022 +0000

    Fix stride issue with faketensors (#83822)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83822
    Approved by: https://github.com/ezyang, https://github.com/ngimel

commit 7ebdb4c72f4158cca889f2696f12c5fbff3c1023
Author: Horace He <chilli@fb.com>
Date:   Tue Aug 23 05:11:03 2022 +0000

    Refactored ops on size to be dispatcher ops (#83719)

    An example of how the graph looks now.
    ```
    def forward(self, x_1):
        size = torch.ops.math.size(x_1, 0)
        size_1 = torch.ops.math.size(x_1, 1);  x_1 = None
        ones = torch.ops.aten.ones.default([1], device = device(type='cpu'), pin_memory = False)
        expand_sym_int = torch.ops.aten.expand.SymInt(ones, [size, size_1]);  ones = size = size_1 = None
        cos_default = torch.ops.aten.cos.default(expand_sym_int);  expand_sym_int = None
        return (cos_default,)
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83719
    Approved by: https://github.com/ezyang

commit 58170fb8aafc955258609693cd368cf8ecc8ff2b
Author: Vasiliy Kuznetsov <vasiliy@fb.com>
Date:   Mon Aug 22 14:55:52 2022 -0700

    Remove DBR quantization from the codebase (#83642)

    Summary:

    DBR quantization is a no-go for now because it does not align well with
    PyTorch 2.0 plans and we do not want to build yet another tracing system.

    Deleting it from the codebase for now since there are no plans to develop
    this in the near future. We can bring it back at a later time if necessary.

    Test plan:

    CI

    Differential Revision: [D38839556](https://our.internmc.facebook.com/intern/diff/D38839556)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83642
    Approved by: https://github.com/andrewor14, https://github.com/jerryzh168

commit 4dfa6d28a139e8325fe9b255af86e8d1360ae7ee
Author: mattip <matti.picus@gmail.com>
Date:   Tue Aug 23 15:03:29 2022 +0000

    Normalize DLPack stride to 1 where shape < 2 (#83158)

    Fixes #83069. Also move all the dlpack tests to a new file., `test_dlpack.py`.

    The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83158
    Approved by: https://github.com/ezyang

commit 247468baf0d9e105ced371a9eb42074147c1ee57
Author: jpvillam <Juan.Villamizar@amd.com>
Date:   Tue Aug 23 13:54:09 2022 +0000

    [ROCm] More Sparse UTs enablement and more hipification mappings. (#78939)

    Enables:

     test_bmm_cuda_float64
     test_bmm_deterministic_cuda_float64
     test_csr_matvec_cuda_complex128
     test_csr_matvec_cuda_complex64
     test_csr_matvec_cuda_float32
     test_csr_matvec_cuda_float64

    To enable the above tests had to add some more hip mappings for the hipification process.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78939
    Approved by: https://github.com/pruthvistony, https://github.com/malfet

commit ed949e22580f6f76a1476626badbc9d6161b7745
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 23 10:25:52 2022 +0000

    [xla hash update] update the pinned xla hash (#83899)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83899
    Approved by: https://github.com/pytorchbot

commit 7c20ad3dfae18c5e262c59e09c2b6079ec7fc69f
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Aug 23 08:39:35 2022 +0000

    [optim] rprop: handle complex params as independent real params (#83858)

    Ref #65711

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858
    Approved by: https://github.com/albanD

commit dd67d52b570629b7c157e4a11a9fae18f517a6e6
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Tue Aug 23 08:34:39 2022 +0000

    [nn] split rnn_utils test from test_nn.py (#83675)

    Ref: https://github.com/pytorch/pytorch/issues/63085
    Proposed folder structure
    ```
    -> test
      -> nn
        -> test_conv.py
        -> test_pooling.py
        -> .....
    ```

    This PR: Moves test related RNN utilities to a different file.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675
    Approved by: https://github.com/albanD

commit a419e483b21eb572848f85c815e3f993cb13040c
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Mon Aug 22 19:55:57 2022 -0700

    [quant][fx] Add support for quantized matmul (#83885)

    Summary:
    att, probably missed the op during migration to the reference flow

    Test Plan:
    python test/test_quantization.py TestQuantizeFxOps.test_qmatmul

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83885
    Approved by: https://github.com/andrewor14

commit 3dfb8dfcf31530aad457df50a5a89ef691438c11
Author: Justin Chu <justinchu@microsoft.com>
Date:   Tue Aug 23 05:39:17 2022 +0000

    [ONNX] Use `errors.SymbolicValueError` for more context (#83332)

    Replace runtime errors in torch.onnx with `errors.SymbolicValueError` for more context around jit values.

    - Extend `_unimplemented`, `_onnx_unsupported`, `_onnx_opset_unsupported`, `_onnx_opset_unsupported_detailed` errors to include JIT value information
    - Replace plain RuntimeError with `errors.SymbolicValueError`
    - Clean up: Use `_is_bool` to replace string comparison on jit types
    - Clean up: Remove the todo `Remove type ignore after #81112`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83332
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/thiagocrepaldi, https://github.com/BowenBao

commit 04d8da88a6a1abf0da2b11096c85244bf38d3b2a
Author: CaoE <e.cao@intel.com>
Date:   Tue Aug 23 04:48:38 2022 +0000

    Optimize transpose copy on CPU using fbgemm transpose (#83327)
    Optimize transpose copy on CPU using fbgemm transpose
    single socket (28cores):
    ```
    before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 4.819e-05 ms; bf16: 4.846e-05 ms
            torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000171 ms; bf16: 0.000129 ms

    after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10])  fp32: 2.439e-05 ms; bf16: 2.152e-05 ms
            torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000132 ms; bf16: 3.916e-05 ms
    ```
    single core:
    ```
    before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.00109 ms;  bf16: 0.00103 ms
            torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00339 ms; bf16: 0.00295 ms

    after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.000566  ms; bf16: 0.000382 ms
            torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00282 ms; bf16: 0.000999 ms
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83327
    Approved by: https://github.com/frank-wei

commit b29a074882a2194d61f1cd7ccf939618d8384d08
Author: Rohan Varma <rvarm1@fb.com>
Date:   Mon Aug 22 19:00:50 2022 +0000

    [BE] Revert distributed change in https://github.com/pytorch/pytorch/pull/68779 (#83181)

    https://github.com/pytorch/pytorch/issues/82641 points out a regression in how inputs / outputs are processed by DDP, blocking their HF use case. It was narrowed down to https://github.com/pytorch/pytorch/pull/68779 and reverting the distributed change there fixes the issue.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83181
    Approved by: https://github.com/kumpera

commit 4e90526a4f03d1950e8db6e8722cce8e0fb4a5f5
Author: Rohan Varma <rvarm1@fb.com>
Date:   Mon Aug 22 19:00:49 2022 +0000

    [FSDP] Remove unneeded checks (#83150)

    @awgu pointed out these checks aren't really doing anything, as they just make sure we're setting training state in certain ways throughout FSDP and is sort of arbitrary. So, removing them to avoid confusion.

    We still keep the checking around `_post_backward_called` because this is needed in `finalize_params` for now.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83150
    Approved by: https://github.com/awgu

commit 8e074f455720245c733bb0f3417a21c9fd5a73c7
Author: Catherine Lee <csl@fb.com>
Date:   Tue Aug 23 01:50:26 2022 +0000

    hash update - bug fix for branches (#83865)

    hash updates for xla were failing because the current pinned hash is a branch, so the git command for getting the date couldn't find the branch due to not having a local version of the branch.  Fixed by checking out the branch to make sure it exists locally.

    example of failure: https://github.com/pytorch/pytorch/runs/7913835742?check_suite_focus=true

    Test plan:
    made it pull request trigger and ran, to get this:
    https://github.com/pytorch/pytorch/runs/7959221184?check_suite_focus=true
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83865
    Approved by: https://github.com/zengk95

commit 7cfc8b78207e6d5f0c2caff435525b4da65ce68d
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Aug 23 01:13:14 2022 +0000

    [MPS] Move mps_linear to mps dispatch key  (#80068)

    Fixes #77394

    This is related to #79920 which adds linear support for nested tensors. Codegen still throws an assert stoping this from compiling. However I tested locally by commenting out this assert: https://github.com/pytorch/pytorch/blob/61305cd638b6fcd73a0b66b4cde7014fecb9e8ce/tools/autograd/gen_variable_type.py#L798
    and the intended behavior appears to be working. I am not sure what changes need to be made to codegen to make this work.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80068
    Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/kulinseth

commit b18f984307070a883dbfabf138cfcd48a391ec75
Author: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Date:   Tue Aug 23 01:09:29 2022 +0000

    [cmake] Change COLORIZE_OUTPUT option to USE_COLORIZE_OUTPUT (#83716)

    Close https://github.com/pytorch/pytorch/issues/83500

    Change COLORIZE_OUTPUT option to USE_COLORIZE_OUTPUT so that it can be passed and disabled through environment variable.

    Not sure why COLORIZE_OUTPUT=0 didn't work before but USE_COLORIZE_OUTPUT=0 works after.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83716
    Approved by: https://github.com/malfet

commit b7afee8a2736b1a9d04de59d66b016838772a95d
Author: Zafar <cc.rafaz@zafar.cc>
Date:   Tue Aug 23 00:54:42 2022 +0000

    Lazy deprecation import function in torch.nn (#83834)

    This introduces a mechanism to show a deprecation warning on import.
    This is achieved through override of the `__getattr__` in the modules
    that need the migration. See https://peps.python.org/pep-0562/ for
    details.

    Some of the codes under torch.nn are migrating, and will require
    a deprecation warning on import. Specifically, quantized modules are
    in the process of migration to the `torch.ao` package
    (see https://github.com/pytorch/pytorch/issues/81667).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83834
    Approved by: https://github.com/albanD

commit 658f958bc4bb314d9c6030eeaf3e1784792b5d15
Author: XiaobingSuper <xiaobing.zhang@intel.com>
Date:   Tue Aug 23 00:53:37 2022 +0000

    fix upsample bf16 issue for channels last path by using high pricsion to compute index (#83847)

    Given the following case:
    ```
    import torch
    a = torch.ones(1, 3, 320, 480).bfloat16().to(memory_format=torch.channels_last)
    out_bf16 = torch.nn.functional.interpolate(a, size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False)
    out_fp32= torch.nn.functional.interpolate(a.float(), size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False)
    print(out_bf16[0, 2, :, :])
    print(out_fp32[0, 2, :, :])
    ```
    the boundary of bfloat16 output gets a wrong value:
    ```
    tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
             1.0000e+00],
            [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
             1.0000e+00],
            [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
             1.0000e+00],
            ...,
            [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
             1.0000e+00],
            [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
             1.0000e+00],
            [0.0000e+00, 0.0000e+00, 1.8367e-40,  ..., 0.0000e+00, 0.0000e+00,
             0.0000e+00]], dtype=torch.bfloat16)
    tensor([[1., 1., 1.,  ..., 1., 1., 1.],
            [1., 1., 1.,  ..., 1., 1., 1.],
            [1., 1., 1.,  ..., 1., 1., 1.],
            ...,
            [1., 1., 1.,  ..., 1., 1., 1.],
            [1., 1., 1.,  ..., 1., 1., 1.],
            [1., 1., 1.,  ..., 1., 1., 1.]])

    ```

    the expected behavior is that the bfloat16 output value should also be one. The main reason is that we use low precision to compute the index,  see
    https://github.com/pytorch/pytorch/blob/fcb124406bdf86bc2d15e999d5a3e09b86238bba/aten/src/ATen/native/UpSample.h#L448, we should use a high precison to do the computation as GPU path:
    https://github.com/pytorch/pytorch/blob/fcb124406bdf86bc2d15e999d5a3e09b86238bba/aten/src/ATen/native/cuda/UpSample.cuh#L123

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83847
    Approved by: https://github.com/frank-wei

commit 80cfafc3857c981d67507f135232bf43da4a1caa
Author: Justin Chu <justinchu@microsoft.com>
Date:   Fri Aug 19 19:04:47 2022 -0700

    [ONNX] Add quantization support to more single output ops (#83008)

    - Implement quantization support for single output ops
      - quantized::sigmoid
      - quantized::instance_norm
      - aten::reshape
      - aten::reshape_as
      - aten::sum
      - aten::mean
      - aten::prod
      - aten::t
      - aten::numpy_T
      - aten::expand
      - aten::expand_as
      - aten::embedding
      - aten::embedding_bag
      - aten::view
      - aten::select
      - aten::eq
      - aten::ne
      - aten::gt
      - aten::lt
      - aten::le
      - aten::ge
      - quantized::layer_norm
      - aten::elu
      - aten::selu
      - aten::maximum
      - aten::minimum
      - aten::amax
      - aten::amin
      - aten::hardtanh
      - aten::hardswish
      - quantized::group_norm
      - aten::as_strided
      - quantized::leaky_relu
      - aten::transpose
    - Avoid modifying functions in `quantized_args` and have the wrapper closed over `scale` and `zero_point` instead (for purity)
    - Remove magic number and assign it to INT64_MAX
    - implement `_unpack_quantized_tensor` for handling quantized tensor unpacking to separate the logic from tuple unpacking and for clearer error handling
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83008
    Approved by: https://github.com/BowenBao

commit 1e4383f7563c9bb2e3c8e6989b6853d1d04f652f
Author: Wonjoo Lee <wonjoo@google.com>
Date:   Mon Aug 22 22:52:10 2022 +0000

    Add lazy shape inference for cholesky op (#83720)

    PyTorch/XLA companion PR: https://github.com/pytorch/xla/pull/3907

    ---
    Add lazy shape inference for cholesky op
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83720
    Approved by: https://github.com/JackCaoG

commit 36f6d91a2d6a913d22d5b80d64148b53243e3584
Author: Jane Xu <janeyx@fb.com>
Date:   Mon Aug 22 22:19:41 2022 +0000

    Migrate last workflows from 18.04 to 22.04 (#83861)

    18.04 is getting deprecated in december--let's migrate them off now.

    This PR does NOT touch functorch nor third_party workflows.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83861
    Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/seemethere

commit 09331c947cd559211bec20fd016e48f86d48e51f
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Mon Aug 22 21:55:01 2022 +0000

    [optim] rmsprop: handle complex params as independent real params (#83860)

    Ref: #65711
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83860
    Approved by: https://github.com/albanD

commit 62d9f1559e3fc1f807e1e51e95d2d2b03e8bf374
Author: Janosh Riebesell <janosh.riebesell@gmail.com>
Date:   Mon Aug 22 21:42:37 2022 +0000

    Fix model type CNN->MLP in functorch ensembling notebook intro (#83603)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83603
    Approved by: https://github.com/albanD, https://github.com/zou3519

commit dc557b94ec98e7651ac14975851ff8015188114a
Author: Reza Sharifi <reza.sharifi@gmail.com>
Date:   Mon Aug 22 21:34:42 2022 +0000

    Used generator for "any" and "all" (#83844)

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83844
    Approved by: https://github.com/albanD

commit e10c47a7d09c4a558ba87d900033f565e6662812
Author: George Qi <georgeqi94@gmail.com>
Date:   Fri Aug 19 14:26:30 2022 +0000

    [maskedtensor] adding unary and binary operations (#82837)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82837
    Approved by: https://github.com/bhosmer

commit daca0ee5e23402b6f11c90978c4655f31657b2ca
Author: BowenBao <bowbao@microsoft.com>
Date:   Mon Aug 22 10:14:36 2022 -0700

    [ONNX] Introduce ONNXScopeName (#82038)

    Update `_setup_trace_module_map` to always record module/layer info
    in `Scope` attribute for nodes.
    Extend `Scope` name to not only record module typename, but also
    module object variable name. Both names are formatted and stored
    as `name` attribute in `Scope`.
    Introduce `ONNXScopeName` class to manage the formatting and parsing.
    Updated local function export code adjusting to this update.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82038
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/justinchuby, https://github.com/abock, https://github.com/malfet

commit 91766360b18be25a5230702c1c855d08c62d0171
Author: zengk95 <34172846+zengk95@users.noreply.github.com>
Date:   Mon Aug 22 20:17:40 2022 +0000

    [mergebot] Post PR Comment on cancel (#82744)
    <!-- What did you change and why was it needed? -->
    When someone cancels a PR merge, it's not apparent that it's canceled unless the user clicks into that job. In this PR, we add a message if the pr gets canceled.

    The only thing is the user will not receive  a comment if the PR is canceled immediately since posting the message requires that the checkout be finished.
    <!-- Link to Issue ticket or RFP -->
    n/a
    <!-- How did you test your change? -->
    Tested it on canary https://github.com/pytorch/pytorch-canary/pull/132
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82744
    Approved by: https://github.com/huydhn, https://github.com/seemethere

commit b136f3f310aa01a8b3c1e63dc0bfda8fd2234b06
Author: joncrall <jon.crall@kitware.com>
Date:   Mon Aug 22 20:07:23 2022 +0000

    More doctest refinements. (#83317)

    Follow up to #82797

    Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way.

    @ezyang @vadimkantorov
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317
    Approved by: https://github.com/ezyang

commit 9c9f42481761c7f42f47c296a9dbee60f3407b90
Author: Fabian Ricardo Latorre Gomez <fablat@amazon.com>
Date:   Mon Aug 22 19:48:46 2022 +0000

    modify the signature of method `__getitem__` from `ModuleList` (#83799)

    The type of the parameter idx can be either slice or int. The same for the `Sequential` class

    Fixes #83797

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83799
    Approved by: https://github.com/malfet, https://github.com/albanD

commit 91eb1b9bb93ddd997691295114a5e34bd61793ad
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Aug 22 14:23:55 2022 +0100

    Move _masked opinfos to opinfo/definitions/_masked.py (#83763)

    Ref #82518
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83763
    Approved by: https://github.com/albanD

commit 7656ef73f1ae73798ae965da6dedd260b7cb4f01
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Mon Aug 22 14:23:55 2022 +0100

    Move `torch.special` OpInfos into opinfo/definitions/special.py (#83762)

    Ref #82518

    As with `linalg` this doesn't include ops with an alias in special,
    only the ones where `special.foo` is the actual name of the opinfo.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83762
    Approved by: https://github.com/albanD

commit 35d4fa444b67cbcbe34a862782ddf2d92f5b1ce7
Author: George Petterson <gpetters@protonmail.com>
Date:   Mon Aug 22 19:05:41 2022 +0000

    Fix for transposed convolution shape functions (#83557)

    This fixes an issue with #80860 when in channels and out channels are different.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83557
    Approved by: https://github.com/Gamrix

commit eff28d61c9c961e6c2724b78f57441ee2e3e40cb
Author: John Clow <jclow@fb.com>
Date:   Wed Aug 17 15:07:57 2022 -0700

    [JIT SSA] Allow updating shape functions without recompilation (#83629)

    In order to avoid extra round trips, and avoid confusion in places such as
    this to manually pull in the latest copy of the shape_functions.py file

    This also fixes the cases where people pull in the wrong version of the file. This can happen in cases such as when developers run `python setup.py install` instead of `python setup.py develop` to generate their current copy of Pytorch.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83629
    Approved by: https://github.com/davidberard98

commit 53cda905be74e03161e6732d679be3b9cb2c65b0
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 22 17:47:06 2022 +0000

    Revert "Optimize transpose copy on CPU using fbgemm transpose (#83327)"

    This reverts commit f56720ea7c7ad0bcb4c5af669e28bf7de8122cb6.

    Reverted https://github.com/pytorch/pytorch/pull/83327 on behalf of https://github.com/janeyx99 due to Sorry, reverting as this breaks mac functorch tests on trunk https://hud.pytorch.org/pytorch/pytorch/commit/f56720ea7c7ad0bcb4c5af669e28bf7de8122cb6

commit d1be36ceab0bda0aa348846f44bc3c9372e0eda3
Author: Ramin Azarmehr <razarmehr@apple.com>
Date:   Mon Aug 22 17:07:09 2022 +0000

    [MPS] Fix the index error in constant_pad_nd() with single-dimension input (#83745)

    * Fix the index error in constant_pad_nd() with single-dimension input (#83343)
    - Also added a test case in test_mps for it

    * Move padding code into new file Pad.mm

    Fixes https://github.com/pytorch/pytorch/issues/83343

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83745
    Approved by: https://github.com/razarmehr

commit a6b75bb0990f2c949bf6de4e6ae58f019d61c0ac
Author: Denis Vieriu <104024078+DenisVieriu97@users.noreply.github.com>
Date:   Mon Aug 22 17:05:53 2022 +0000

    [MPS] Fix placeholder case for missing gather graph (#83744)

    Fixes https://github.com/pytorch/pytorch/issues/82543, https://github.com/pytorch/pytorch/issues/83230

    The current Placeholder code relies to find a gather graph in order to make the data contiguous, otherwise we'll try calling into tensor.contiguous() directly, which for slice elements, won't do anything.

    E.g consider the following basic case where we index a 2 element tensor:
    ```
    tensor_list = torch.tensor([1.2, 1.0], device="mps")

    for scalar in tensor_list:
      r_mps = torch.ceil(scalar)
      r_cpu = torch.ceil(scalar.to("cpu"))
      self.assertEqual(r_mps.cpu(), r_cpu)
    ```

    The second element 1.0 is a contiguous view tensor (similar to slicing), but it has no gather graph created behind. In the placeholder, we won't be able to find the graph, thus relying on the fallback case where we call _tensor = src.contiguous();. For an already contiguous tensor, this won't do anything, thus we end up creating the NDArray with all the values of the tensor (1.2 and 1.0 instead of just 1.0). Doing clone instead of contiguous will actually perform a blit behind and take into consideration the storage_offset of the view when performing the copy.

    Similarly, the following basic case is also failing because of this issue:

    ```
    x = torch.tensor([1.0, 0.49], device="mps")
    print(x) # prints 1.0 and 0.0
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83744
    Approved by: https://github.com/razarmehr

commit b8496eb411a4b7af2fcd12a4fe4a6fb1690c8f6a
Author: Andrew Or <andrewor@fb.com>
Date:   Mon Aug 22 06:53:46 2022 -0700

    [Quant] Separate FBGEMM/QNNPACK BackendConfigs (#83566)

    Summary: Previously we use a single BackendConfig
    (get_native_backend_config) for both the FBGEMM and QNNPACK
    backends. However, these two backends have subtle differences
    in terms of their requirements that cannot be satisfied using
    a single BackendConfig. Therefore, this commit is the first step
    torwards decoupling the two backends. The real change in
    functionality will come in a future commit after DTypeConfig
    supports quant_min/quant_max and scale_min/scale_max. Existing
    uses of `get_native_backend_config` should not be affected.

    Public facing changes:
    ```
    from torch.ao.quantization.backend_config import (
        get_fbgemm_backend_config,
        get_qnnpack_backend_config,
    )
    fbgemm_backend_config = get_fbgemm_backend_config()
    qnnpack_backend_config = get_qnnpack_backend_config()
    ```

    Test Plan:
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps

    Reviewers: jerryzh168

    Subscribers: jerryzh168
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83566
    Approved by: https://github.com/jerryzh168

commit 07d0c9ec75c1fbf7b14d67010f49548d6cb5574c
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Mon Aug 22 16:41:48 2022 +0000

    make sym sizes be computed lazily (#82233)

    Creating size nodes proactively for each tensor is leading to increased memory pressure as hold strong pointers to tensor data.
    [<!-- Link to Issue ticket or RFP -->](https://github.com/pytorch/pytorch/issues/80942)

    Creating
    <!-- How did you test your change? -->

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82233
    Approved by: https://github.com/wconstab

commit f56720ea7c7ad0bcb4c5af669e28bf7de8122cb6
Author: ecao <e.cao@intel.com>
Date:   Mon Aug 22 16:39:33 2022 +0000

    Optimize transpose copy on CPU using fbgemm transpose (#83327)
    Optimize transpose copy on CPU using fbgemm transpose
    single socket (28cores):
    ```
    before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 4.819e-05 ms; bf16: 4.846e-05 ms
            torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000171 ms; bf16: 0.000129 ms

    after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10])  fp32: 2.439e-05 ms; bf16: 2.152e-05 ms
            torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000132 ms; bf16: 3.916e-05 ms
    ```
    single core:
    ```
    before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.00109 ms;  bf16: 0.00103 ms
            torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00339 ms; bf16: 0.00295 ms

    after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.000566  ms; bf16: 0.000382 ms
            torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00282 ms; bf16: 0.000999 ms
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83327
    Approved by: https://github.com/frank-wei

commit fcb124406bdf86bc2d15e999d5a3e09b86238bba
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Fri Aug 19 20:37:32 2022 -0700

    release the current symintnode in the move c-tor (#83789)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83789
    Approved by: https://github.com/ezyang

commit b47f712b7b571831c7e5a0690bccef8d8b1a54b6
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Fri Aug 19 20:37:24 2022 -0700

    Fix uninitialized member if the default c-tor is called (#83788)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83788
    Approved by: https://github.com/ezyang

commit 09157c76c04546f9f55b15a37dda938d6327df8a
Author: Mike Iovine <mikeiovine@fb.com>
Date:   Mon Aug 22 13:42:47 2022 +0000

    [Static Runtime] Add schema checks for aten::list (#83753)

    Summary:
    The previous implementation assumed that there was only one overload and unconditionally tried to convert its input into a string. Some users were running into crashes because of this. Added a new overload for the list overload and schema checks.

    Also, I managed to uncover another bug when writing tests for this case (yikes). Returning inputs didn't work because the input cleanup process would destroy the output. Extended `CreateOwnedRefsForSpecialIValues` to fix that.

    Test Plan: CI + new unit tests

    Differential Revision: D38870803

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83753
    Approved by: https://github.com/tenpercent, https://github.com/albanD

commit d46dba18f7e96811c672afe629d891ad6b8d095d
Author: lezcano <lezcano-93@hotmail.com>
Date:   Sun Aug 21 23:56:09 2022 +0000

    Simplify reshape and fix _refs.unflatten (#83827)

    Unflatten was incorrectly calling into `reshape` rather than `view`.
    When looking at the checks performed in `reshape`, I saw that the
    in PrimTorch is quite divergent from that in PyTorch, to the point that
    it took me some time to be able to prove that they were equivalent.

    I refactored that part into a separate function, and I implemented the
    logic that we have in ATen, together with the same errors.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83827
    Approved by: https://github.com/ngimel

commit 473b733bae7009945cc5712699d346678e8a40ff
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Mon Aug 22 09:12:13 2022 +0000

    Replace .new_zeros(()) with 0.0 in torch/_decomp/decompositions (#83734)

    `new_zeros` is decomposed into `prims.empty_strided`+`prims.fill`+`prims.copy_to` and none of these are supported by prims+nvFuser executor currently.
    Replacing it with 0.0 makes these backward decompositions nvFuser friendly.

    Example with `torch.ops.aten.hardsigmoid_backward.default`:
    ```py
    opcode         name                      target                            args                                                          kwargs
    -------------  ------------------------  --------------------------------  ------------------------------------------------------------  ----------------------------------------------------------------------------------------
    placeholder    a_1                       a_1                               ()                                                            {}
    placeholder    g_1                       g_1                               ()                                                            {}
    call_function  gt_default                nvprims.gt.default                (a_1, -3.0)                                                   {}
    call_function  lt_default                nvprims.lt.default                (a_1, 3.0)                                                    {}
    call_function  bitwise_and_default       nvprims.bitwise_and.default       (gt_default, lt_default)                                      {}
    call_function  mul_default               nvprims.mul.default               (g_1, 0.16666666666666666)                                    {}
    call_function  empty_strided             prims.empty_strided.default       ([], [])                                                      {'dtype': torch.float32, 'device': device(type='cuda', index=0), 'requires_grad': False}
    call_function  fill_default              prims.fill.default                (empty_strided, 0)                                            {}
    call_function  copy_to_default           prims.copy_to.default             (empty_strided, fill_default)                                 {}
    call_function  broadcast_in_dim_default  nvprims.broadcast_in_dim.default  (copy_to_default, [3, 2], [])                                 {}
    call_function  where_default             nvprims.where.default             (bitwise_and_default, mul_default, broadcast_in_dim_default)  {}
    output         output                    output                            (where_default,)                                              {}
    opcode         name                 target                       args                                     kwargs
    -------------  -------------------  ---------------------------  ---------------------------------------  --------
    placeholder    a_1                  a_1                          ()                                       {}
    placeholder    g_1                  g_1                          ()                                       {}
    call_function  gt_default           nvprims.gt.default           (a_1, -3.0)                              {}
    call_function  lt_default           nvprims.lt.default           (a_1, 3.0)                               {}
    call_function  bitwise_and_default  nvprims.bitwise_and.default  (gt_default, lt_default)                 {}
    call_function  mul_default          nvprims.mul.default          (g_1, 0.16666666666666666)               {}
    call_function  where_default        nvprims.where.default        (bitwise_and_default, mul_default, 0.0)  {}
    output         output               output                       (where_default,)                         {}

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83734
    Approved by: https://github.com/Chillee

commit 6a9c02339d02fe2f701e17ae7d7f3304dab15d98
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 22 07:32:37 2022 +0000

    Revert "[quant][ao_migration] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` (#78713)"

    This reverts commit 432f037498e3f470f1f6d2a5cc7c6ae8eb4fc870.

    Reverted https://github.com/pytorch/pytorch/pull/78713 on behalf of https://github.com/janeyx99 due to Reverting for breaking (trunk-only) ios build

commit b1a7b67529110ce6cfdb50b9ea9e3e0ccf8196bc
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 22 07:30:48 2022 +0000

    Revert "[quant][ao_migration] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` (#78714)"

    This reverts commit e6fb97d8ae0d2a45e26c9a597426f1ded13d3aec.

    Reverted https://github.com/pytorch/pytorch/pull/78714 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted

commit 355d343fa85a6c9ae415bddaaf5352c6ce850f1e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 22 07:29:15 2022 +0000

    Revert "[quant][ao_migration] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` (#78715)"

    This reverts commit a7344e52b9d746062923647ae00ca38578d272d4.

    Reverted https://github.com/pytorch/pytorch/pull/78715 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted

commit e9dd4d5adf391ed38b0b9152493958fc3d9a1350
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 22 07:26:43 2022 +0000

    Revert "[quant][ao_migration] `torch.nn.quantizable` → `torch.ao.nn.quantizable`. (#78717)"

    This reverts commit e0876feb493d4378d5aedced367eeaae75339741.

    Reverted https://github.com/pytorch/pytorch/pull/78717 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted

commit 4cbb1986fe9e1f0ae3d352686378808789aa9186
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 22 07:23:24 2022 +0000

    Revert "[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716)"

    This reverts commit 7cd2fa1d388bf240cd33ff933dc120e74ebc2eb3.

    Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted

commit 3c6c39e66e99a09677739a72c339afbd79cdc12f
Author: Alex Beloi <alexbeloi@fb.com>
Date:   Mon Aug 22 06:54:18 2022 +0000

    [fx] refactor fba_passes into FBAPassManagerBuilder (#83268)

    Summary:
    This diff integrate FBAPassManagerBuilder as the primary orchestrator of FBA-FX passes

    Reviewed By: jfix71, dborkovic

    Differential Revision: D38186354

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83268
    Approved by: https://github.com/dborkovic

commit 7cd2fa1d388bf240cd33ff933dc120e74ebc2eb3
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Sun Aug 21 19:34:58 2022 -0700

    [quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
        - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - None

    Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
    Approved by: https://github.com/jerryzh168

commit e0876feb493d4378d5aedced367eeaae75339741
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Sun Aug 21 19:34:56 2022 -0700

    [quant][ao_migration] `torch.nn.quantizable` → `torch.ao.nn.quantizable`. (#78717)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [X] [Current PR] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
        - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - None

    Differential Revision: [D36861090](https://our.internmc.facebook.com/intern/diff/D36861090/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861090/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78717
    Approved by: https://github.com/jerryzh168

commit a7344e52b9d746062923647ae00ca38578d272d4
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Sun Aug 21 19:34:54 2022 -0700

    [quant][ao_migration] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` (#78715)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [X] [Current PR] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
        - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - None

    Differential Revision: [D36860927](https://our.internmc.facebook.com/intern/diff/D36860927/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860927/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78715
    Approved by: https://github.com/jerryzh168

commit 08126c8967937be9e6be7bbb34a2c01b84aa0c1d
Author: Animesh Jain <anijain@umich.edu>
Date:   Mon Aug 22 05:22:30 2022 +0000

    Minifier fixes (#83754)

    cc @Chillee
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83754
    Approved by: https://github.com/Chillee

commit e6fb97d8ae0d2a45e26c9a597426f1ded13d3aec
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Sun Aug 21 19:34:53 2022 -0700

    [quant][ao_migration] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` (#78714)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
        - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - [Documentation](docs/source/quantization-support.rst) @vkuzo
    - [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
    - [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo
    - [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a
    - [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a

    Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714
    Approved by: https://github.com/jerryzh168

commit 8948fdc525488c08c703befa704b4c4179732e3c
Author: chenlai <chenlai@devvm7453.prn0.facebook.com>
Date:   Sun Aug 21 14:13:02 2022 -0700

    Switch mobile targets to flatbuffers_mobile (#82829)

    Differential Revision: [D38412635](https://our.internmc.facebook.com/intern/diff/D38412635/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38412635/)!

    Differential Revision: [D38412635](https://our.internmc.facebook.com/intern/diff/D38412635)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82829
    Approved by: https://github.com/qihqi

commit f0eb841d209f251d6a735827d4b903962d0d31b8
Author: Emilio Castillo <ecastill@preferred.jp>
Date:   Mon Aug 22 03:37:10 2022 +0000

    Make `torch.optim.RMSprop` differentiable (#83578)

    Blocked by #82205
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83578
    Approved by: https://github.com/albanD

commit ac39d2bd6e423c215338ce72b150afad2afe924c
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Sat Aug 20 22:52:56 2022 -0400

    Make negative integer test always done for Int to SymInt (#83815)

    Otherwise, it would be easy to trigger arbitrary memory access
    by passing a sufficiently negative integer to the API.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83815
    Approved by: https://github.com/Chillee

commit 4902254b9b595adf1a0346d6f79a6c7b145dbcaa
Author: migeedz <migeedz@fb.com>
Date:   Thu Aug 18 10:13:37 2022 -0700

    fix torch._C._nn.linear bug (#83682)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83682
    Approved by: https://github.com/jansel

commit da6cd12173194559a1420e94ca3d0a6f6319929e
Author: migeedz <migeedz@fb.com>
Date:   Thu Aug 18 10:13:36 2022 -0700

    gt constraint heutristic (#83334)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83334
    Approved by: https://github.com/jansel

commit 432f037498e3f470f1f6d2a5cc7c6ae8eb4fc870
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Sun Aug 21 14:54:38 2022 -0700

    [quant][ao_migration] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` (#78713)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
        - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
        - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
        - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
        - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
        - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
        - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
        - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
        - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
        - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
            - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
            - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - Documentation @vkuzo
      - docs/source/conf.py
      - docs/source/quantization.rst
    - [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168
    - [common test routine](test/quantization/ao_migration/common.py) @HDCharles
    - JIT stuff @jamesr66a
      - torch/csrc/jit/passes/hoist_conv_packed_params.cpp
      - torch/csrc/jit/passes/quantization/helper.h
      - torch/csrc/jit/serialization/import_source.cpp

    Differential Revision: [D36860145](https://our.internmc.facebook.com/intern/diff/D36860145/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713
    Approved by: https://github.com/jerryzh168

commit 765fd77d9a96983e1a2adf496ac2fe66b4825f45
Author: Eli Uriegas <seemethere101@gmail.com>
Date:   Sun Aug 21 12:33:41 2022 -0700

    ci: Switch binary builds to github artifacting (#83778)

    Switches binary builds artifacting from s3 artifact solution to github's
    artifact solution.

    Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83778
    Approved by: https://github.com/malfet

commit 91e754b268c1869df5b2836f15c73e6ec1e265f1
Author: Nikita Shulga <nshulga@fb.com>
Date:   Fri Aug 19 22:01:43 2022 +0000

    [BE] setup.py refactors (#83635)

    No function changes, just move stuff around:
    - Move main code to `main` routine
    - Define torch and torchgen package data list in local vars
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83635
    Approved by: https://github.com/kit1980

commit 5c5a5f150589b63220ffa2c8da11781d2a593a2b
Author: Rui Zhu <zrphercule@fb.com>
Date:   Sun Aug 21 06:51:11 2022 +0000

    Add HIP libs into torch depoly init list & corresponding dependency for CURE benchmark running on AMD (#83434)

    Summary: This diff adds needed targets for CURE benchmark on AMD, and also add hip lib to torch deploy init list

    Test Plan:
    on AMD host fbcode/, With model generated by D38509136 model.pt.

    cp model.pt /tmp/textray_v20220509.pt

    buck build mode/{dev-nosan,amd-gpu} mode/lower-locally -c fbcode.enable_gpu_sections=true -c fbcode.rocm_arch=mi100 -c fbcode.platform=platform010 //accelerators/tools/benchmark:PyTorchPredictorInferenceBenchmark

    buck-out/gen/accelerators/tools/benchmark/PyTorchPredictorInferenceBenchmark --replay_record_format recordio --replay_record_source /tmp/textray_20220509_prod.recordio --model_path /tmp/textray_v20220509.pt --batch_size=64 --batching_threads=1 --max_batch_wait_ms=500 --min_threads 5 --max_threads 5 --timeout_seconds 120 --check_allow_extra_field --diff_threshold 1e-3 --equal_threshold 1e-4 --thread_step 5 --use_cuda

    Reviewed By: mikekgfb

    Differential Revision: D38596119

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83434
    Approved by: https://github.com/erichan1

commit 09e837634bc76c4380db7119e0e997816d584844
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 19 18:42:39 2022 -0700

    [Profiler][Minor] Set end time on python events when profiling stops. (#83621)

    We don't have an end event for calls that are ongoing when profiling stops. (e.g. main) This cropped up when I was adding checks for negative durations.

    I also refactored `populate` to use a pop method. This not only allows me to implement this fix, but should also provide a convenient entry point for https://github.com/pytorch/pytorch/pull/82154

    Differential Revision: [D38426342](https://our.internmc.facebook.com/intern/diff/D38426342/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83621
    Approved by: https://github.com/slgong-fb

commit 37f91d700bcb80c5f254c734a420ad89928771f4
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 19 18:42:37 2022 -0700

    [Profiler] Break metadata generation into multiple visitors (#83033)

    This is a no-op change which establishes a base class to handle Result to Kineto details, and then splits the existing logging logic. (With the idea that at some point we'll probably conditionally run things to manage trace size.)

    Differential Revision: [D38469409](https://our.internmc.facebook.com/intern/diff/D38469409/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83033
    Approved by: https://github.com/aaronenyeshi

commit f295dd0735e05146106d6fc25df1449ff76d078b
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 19 18:42:36 2022 -0700

    [Profiler][Minor] Add typed visit method to Result. (#82993)

    Often in post processing we want to step into a specific typed context: "If X is a torch op, do Y". This PR simply adds an ergonomic way to write such cases.

    Differential Revision: [D38426341](https://our.internmc.facebook.com/intern/diff/D38426341/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82993
    Approved by: https://github.com/chaekit

commit 294f9d12826d63df1a736ffd4ef9dab2af2d10d0
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Fri Aug 19 18:42:34 2022 -0700

    [Profiler][Minor] Organize collection.h/.cpp (#82992)

    Collection of Torch ops is quite complex compared to backend events / allocations / ooms. Python is also complex, however it is already factored into a standalone unit. This PR just shuffles the contents of collection.cpp to group the Torch op specific parts together, and does various cleanups to the code.

    Differential Revision: [D38426344](https://our.internmc.facebook.com/intern/diff/D38426344/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82992
    Approved by: https://github.com/chaekit

commit c9475fa927ef5557aa54e4e9a7bc2a9ab98cdcf7
Author: chenlai <chenlai@devvm7453.prn0.facebook.com>
Date:   Fri Aug 19 16:02:29 2022 -0700

    Create flatbuffers_mobile (#82828)

    Differential Revision: [D38412636](https://our.internmc.facebook.com/intern/diff/D38412636/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38412636/)!

    Differential Revision: [D38412636](https://our.internmc.facebook.com/intern/diff/D38412636)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82828
    Approved by: https://github.com/qihqi

commit f45cd00d7ac1ce02c890bd73a96dc4be2233ad0d
Author: Horace He <chilli@fb.com>
Date:   Sat Aug 20 01:46:32 2022 +0000

    Added inference to context when only compiling forwards (#83783)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83783
    Approved by: https://github.com/pyjhzwh, https://github.com/jansel

commit 08c03c91d70a625f70487af81cd54edd5d16aa1a
Author: Pallab Bhattacharya <pllb@fb.com>
Date:   Sat Aug 20 12:29:02 2022 +0000

    guard include of x64 intrinsics headers (#83793)

    Summary: make inclusion of immintrin.h only for x64

    Test Plan: CI

    Differential Revision: D38886597

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83793
    Approved by: https://github.com/ajtulloch

commit e0f2eba93d2804d22cd53ea8c09a479ae546dc7f
Author: Rui Zhu <zrphercule@fb.com>
Date:   Sat Aug 20 10:02:08 2022 +0000

    Move odd num_head in TransformerEncoder to slow_path (#83483)

    Summary: odd nhead is not supported for masked softmax, therefore we just move it to use old slow_path

    Test Plan: CI

    Differential Revision: D38720086

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83483
    Approved by: https://github.com/erichan1

commit 5a1f6d50a9a4059315c5530703e5dc9e38229715
Author: David Berard <dberard@fb.com>
Date:   Sat Aug 20 05:25:03 2022 +0000

    Skip pr-sanity-checks with skip-pr-sanity-checks label (#83751)

    see #83752 for demo

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83751
    Approved by: https://github.com/albanD, https://github.com/malfet

commit f0ee21fe0ad3ac5ae9b07a863697346651c7e230
Author: Huy Do <huydhn@gmail.com>
Date:   Sat Aug 20 06:16:54 2022 +0000

    Update cpuinfo to the latest commit (#83620)

    This hasn't been updated for a while, so pulling the latest commit from https://github.com/pytorch/cpuinfo. I wonder if it breaks anything

    Fixes #83594

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83620
    Approved by: https://github.com/malfet

commit b2ddef28d70ca96f75a39f535a41a2d130f14d40
Author: Huy Do <huydhn@gmail.com>
Date:   Sat Aug 20 03:37:21 2022 +0000

    Freeze the rest of python docs requirement (#83785)

    This is to avoid similar issue like #83774 `pip freeze -r requirements.txt`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83785
    Approved by: https://github.com/malfet

commit 9732a7d84ee72521d006c9617430c4415016daef
Author: Adam J. Stewart <ajstewart426@gmail.com>
Date:   Sat Aug 20 01:26:30 2022 +0000

    torch.cartesian_prod: add type hints (#81377)

    Noticed this function was missing type hints.

    There are plenty more obviously, but this is the only one I happen to be using that is missing type hints.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81377
    Approved by: https://github.com/malfet

commit 0e0af73ba20f0fae3a20c385cc112a3be12337ef
Author: Horace He <chilli@fb.com>
Date:   Sat Aug 20 00:47:11 2022 +0000

    Add support for partial decompositions in make_fx (#83770)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83770
    Approved by: https://github.com/ngimel

commit d5a74efc82f632444f71dc19b111cc57bd406d19
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Aug 18 19:22:38 2022 -0700

    Don't extract tensor metadata from sparse tensors (#83669)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83669
    Approved by: https://github.com/Chillee, https://github.com/bdhirsh

commit 329deb9757469340379efe3edb09b7dac814a4e7
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Aug 18 19:22:37 2022 -0700

    Refactor is_X_like, better invariant checking for SymInt overload (#83668)

    Add is_symint_like, by way of is_base_ty_like which generalizes
    the pattern for is_tensor_like and is_generator_like.  Now that
    we can query if a signature contains a SymInt, we can enforce that
    you must name the overload with SymInt if the signature contains
    SymInt.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83668
    Approved by: https://github.com/bdhirsh, https://github.com/larryliu0820

commit 7fe19c03e482be1d108fbfca8fb0214e133970ad
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Aug 19 11:56:55 2022 -0700

    fix functionalization <> fake tensor mode (#83701)

    The bug is that:

    (1) functionalization kernels internally call `at::empty_strided()` to construct meta tensors, and then call the meta tensor op
    (2) This happens with the Python dispatch key already added to the TLS exclude set, so we expect these meta tensors never to enter python
    (3) When calling detach() though, `TensorImpl::shallow_copy_and_detach()` will currently always call into python when a PythonMode is set. Instead, I updated it to check if the Python key is in the TLS exclude set first.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83701
    Approved by: https://github.com/ezyang

commit e9e7363854978e2f70cc65200b5b76a462e37672
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Aug 19 11:56:54 2022 -0700

    reinplacing pass fixes for torchbench + huggingface (#83626)

    I'm testing out turning on re-inplacing + functionalization by default with the AOTAutograd + eager backend on torchbench + huggingface models. This PR contains a few bug fixes from turning re-inplacing on:

    (1) Handle more gracefully when FakeTensorMode is already turned on when you call reinplace

    (2) More robust detection for when an inplace variant of an op exists (the dumb bug was that `pow.Scalar` doesn't have an inplace variant, even though there are several overloads of `pow_`. None of them are eligible though

    (3) Avoid re-inplacing when it would require resizing the input buffer. This isn't allowed, because inplace ops aren't allowed to resize their inputs.

    For the last one, I gave the two main examples in more detail in the comments. Important cases are:
    ```
    torch.add(tensor[1, 4], tensor[4, 4])
    torch.ge(a, b)
    ```

    (4) There's some logic around keeping `storage_to_nodes` up to date when we see a view op: if we re-inplace `out = a.add(...)`, and later in the program we encounter a "later_node",`out.view(..)`, and need to replace it with `a.view(...)`, then we need to update some metadata structures. I had to fix that logic: specifically, if "later_node" isn't a dispatcher op, (e.g. if it's an FX output node), I wasn't properly handling the case where the node's fake_meta info was not a tensor.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83626
    Approved by: https://github.com/ezyang

commit cce32c6fa16bea7b1574bca76f0c18a78b68a04a
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Aug 19 06:29:43 2022 -0700

    functionalization: handle models that resize their program inputs (#83542)

    Context: When turning on functionalization + fake tensor mode and running `resnet50_quantized_qat` from torchbench, the model fails (and should be fixed by this PR).

    Before landing this PR, we need ProxyTensors (soon to be just fake tensors after https://github.com/pytorch/pytorch/pull/83330) to be resizable

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83542
    Approved by: https://github.com/ezyang

commit 0c24af498578480547822bce5c3aa43a6fa8b920
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Fri Aug 19 06:29:43 2022 -0700

    Always allow tensor metadata changes (#83590)

    Make it so that it is valid to set metadata after detach calls, like `x.detach().resize_(...)`.

    This technically lifts some restrictions around `.data`. This PR means that you can now technically call `x.data.resize_(...)`, which can now directly resize `x` instead of erroring.

    My understanding: Before the tensor-variable merge, when `x` and `x.data` were really different tensors, you could resize `x.data` independently of `x`, and during the merge, this error was added to avoid silent confusing behavior changes.

    It was agreed that this error has been around long enough (several years) that it's acceptable to drop.  cc @albanD @ezyang.

    (Ed already had a prototype PR [here](https://github.com/pytorch/pytorch/pull/83545) - I ended up making one to try to slog through test failures).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83590
    Approved by: https://github.com/ezyang

commit a7d8863c7af99da8c5e10e2ed942f15d35980b86
Author: ssjia <ssjia@fb.com>
Date:   Fri Aug 19 13:12:22 2022 -0700

    [vulkan][ez] lock cache mutex when purging for ShaderCache (#83738)

    Acquire mutex before clearing cache in `ShaderCache`.

    Differential Revision: [D38865341](https://our.internmc.facebook.com/intern/diff/D38865341/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83738
    Approved by: https://github.com/kirklandsign

commit 155343ef2d9e75701ddf772e942a7b76b1f80570
Author: Huy Do <huydhn@gmail.com>
Date:   Fri Aug 19 22:27:29 2022 +0000

    Pin sphinxcontrib.katex to 0.8.6 (#83774)

    sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering but it doesn't seem to work and hangs around idly. The initial thought is probably something related to Docker setup. We can investigate this later.

    Here is the release change log from the [sphinxcontrib-katex](https://github.com/hagenw/sphinxcontrib-katex/commit/e27a051532dee33fbe329636b042426bf3ad6e26)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83774
    Approved by: https://github.com/janeyx99, https://github.com/malfet

commit 2efbdbfcc490660e875ffe0bb8fad3dad7e8920c
Author: Horace He <chilli@fb.com>
Date:   Fri Aug 19 04:37:55 2022 +0000

    Make some optimizations to minifier (#83641)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83641
    Approved by: https://github.com/eellison

commit 13f42069a8659f24005e92a52432b9b91423150b
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Fri Aug 19 04:18:09 2022 +0000

    [quant][fx][refactor] Rename qconfig_utils.py to qconfig_mapping_utils.py in torch/ao/quantization/fx (#83369)

    Summary:
    att, it seems more appropriate to name it qconfig_mapping_utils, also we probably want to move
    the functions in torch/ao/quantization/qconfig_mapping_utils.py to torch/ao/quantization/fx/qconfig_mapping_utils.py as well

    Test Plan:
    python test/test_quantization.py TestQuantizeFx

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83369
    Approved by: https://github.com/andrewor14

commit 1f38225b56b873c944196241ea448445a61798fd
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Fri Aug 19 18:51:57 2022 +0000

    [primTorch] Add ref for `new_empty_strided` (#82466)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82466
    Approved by: https://github.com/ezyang, https://github.com/ngimel

commit 307421930a0e1ac610c42c07470813ea14f53964
Author: Aashaka Shah <aashaka@fb.com>
Date:   Fri Aug 19 17:59:28 2022 +0000

    Enable pg_nccl to perform vector AllGather for uneven output splits (#83713)

    Pushing PR on behalf of @aashaka
    To replace: https://github.com/pytorch/pytorch/pull/82835

    Summary:

    A vector all_gather requires each process to gather other process' inputs into an output tensor according to the ouput list provided. Internally, pg_nccl.allgather will coalesce a list of pg_nccl._broadcast_oop to implement a vector all-gather in the case when the any shape is different in the output list. Otherwise, it will perform a ncclAllGather as usual.

    - This change adds an out-of-place `_broadcast_oop` function to ProcessGroupNCCL. It allows broadcasting an input tensor and placing the output in a separate output tensor. Since allgather provides an out-of-place API, an allgather_v semantic implemented inside `pg_nccl.allgather` also needs to support out-of-place, for which an out-of-place broadcast is required to be added.

    Test Plan: Added a new test `test_all_gather_v_cuda` for all_gather_v to `distributed_nccl_spawn`.

    Differential Revision: D37735263

    Fixes #ISSUE_NUMBER

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83713
    Approved by: https://github.com/mingzhe09088

commit 1fa9a377d01ba8e1a0b65cc2d05ed8a2d53a89f2
Author: Taylor Robie <taylorrobie@fb.com>
Date:   Thu Aug 18 17:51:32 2022 -0700

    [Profiler] Start moving python bindings out of autograd (#82584)

    A lot of profiler code still lives in autograd for historic reasons. However as we formalize and clean up profiler internals it makes sense to pull more and more into the profiler folders/namespace. For now I'm just moving some of the core config data structures and those related to `torch::profiler::impl::Result` to keep the scope manageable.

    Differential Revision: [D37961462](https://our.internmc.facebook.com/intern/diff/D37961462/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37961462/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82584
    Approved by: https://github.com/albanD, https://github.com/Gamrix

commit 7453019e7943717c79b4b3b07e01f2ae5d7bc89f
Author: Daniel Recoskie <danielrecoskie@fb.com>
Date:   Thu Aug 18 07:55:17 2022 -0700

    Remove duplicate_dequantize_node and remove_extra_dequantize (#83611)

    Summary: removed duplicate_dequantize_node and remove_extra_dequantize

    Test Plan:
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps
    python test/test_quantization.py TestQuantizeFxModels

    Reviewers: jerryzh168

    Subscribers:

    Tasks:

    Tags:

    Differential Revision: [D38841052](https://our.internmc.facebook.com/intern/diff/D38841052)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83611
    Approved by: https://github.com/jerryzh168

commit da520a43f228d4b2f5fda6ec0412080504822fc7
Author: Manuel Candales <mcandales@fb.com>
Date:   Fri Aug 19 16:28:59 2022 +0000

    [Vulkan] Fix issues in GRU and LSTM (#83722)

    Summary:
    This diffs fixes several issues in GRU and LSTM vulkan ops:
    - Add create_gru_context and create_lstm_context to vulkanFoldPrePackingOps
    - Add filter to insertPrePackedGruOp and insertPrePackedLstmOp to avoid matching gru.data and lstm.data usages
    - Fixed output dimension of GRU and LSTM
    - Allowed batch_first to be false when batch=1 and seq=1

    Test Plan:
    Check that optimize_for_mobile runs and correctly folds the create context ops
    ```
    buck run :export_for_mobile ~/ferraris/ferraris.ptl ~/ferraris
    ```

    Check that vulkan api tests are still passing
    ```
    buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
    ```

    Reviewed By: SS-JIA

    Differential Revision: D38811967

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83722
    Approved by: https://github.com/SS-JIA

commit 108a1fb17374974534254f5b4652bbf2b3dff0e5
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Fri Aug 19 16:20:34 2022 +0000

    Avoid using fx.Interpreter in nvfuser executor function (#83607)

    Using fx.Interpreter is a nice way of modifying the calls inside of FX graphs, but it introduces unnecessary overhead in this case.

    Example:
    ```py
    import torch
    from torch.fx.experimental.proxy_tensor import make_fx
    from torch._prims.context import TorchRefsNvfuserCapabilityMode
    from torch._prims.executor import execute
    a = torch.randn(3, 2, dtype=torch.float16, device="cuda")
    s = torch.sigmoid
    d = torch.digamma  # digamma is not supported in nvfuser and aten eager execution is used
    def func(a):
        return s(d(s(d(s(d(s(a)))))))
    with TorchRefsNvfuserCapabilityMode():
        gm = make_fx(func)(a)

    %%timeit
    execute(gm, a, executor="nvfuser"); torch.cuda.synchronize();
    ```
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83607
    Approved by: https://github.com/ezyang

commit e0d26ee0927aa4193a444840e1d52699f65f9724
Author: ssjia <ssjia@fb.com>
Date:   Thu Aug 18 14:49:04 2022 -0700

    [vulkan] Throw std::runtime_error instead of using TORCH_CHECK when creating Vulkan context/runtime fails (#83627)

    Currently, if unable to load the global context/runtime, an error will be thrown using `TORCH_CHECK(false, ...)`.

    This diff changes it to throw a `std::runtime_error` directly instead. The reason for this is that `TORCH_CHECK()` will not preserve error messages `#ifdef STRIP_ERROR_MESSAGES`. However, it is more useful for the error reason to be always present for the purpose of diagnosing driver support issues and detecting if a model load failure is related to Vulkan.

    Differential Revision: [D38800348](https://our.internmc.facebook.com/intern/diff/D38800348/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83627
    Approved by: https://github.com/salilsdesai

commit 1407e6728cb29d33eceffc55428f82baceb628c5
Author: jjsjann123 <jiej@nvidia.com>
Date:   Fri Aug 19 16:05:39 2022 +0000

    Nvfuser python api patch take 2 (#83684)

    landing #83645 again.

    Previously we are breaking on codegen bf16 kernel for cuda TK 10.2. Added a short-cut to disable bf tests on pre cuda 11 build.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83684
    Approved by: https://github.com/ngimel

commit 0ec7fc13d6e02ad0f09bd115cba89fa8304d4f12
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Thu Aug 18 19:22:37 2022 -0700

    Refactor CppSignatureGroup to collect signatures as list. (#83667)

    This makes it easier to add more signatures to the signature group,
    as relevant logic which needs to run for each signature no longer
    needs to be adjusted.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83667
    Approved by: https://github.com/larryliu0820, https://github.com/bdhirsh

commit 03e322c8d64056db748dd28ca6121180db2f9fe3
Author: Sherlock Huang <bahuang@fb.com>
Date:   Fri Aug 19 15:55:44 2022 +0000

    Switch fx.replace_pattern to use new SubgraphMatcher (#83717)

    This is a duplicate of https://github.com/pytorch/pytorch/pull/82295
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83717
    Approved by: https://github.com/ezyang

commit 73652dd1c40f6824aaa432fdfb9d2da82fc317aa
Author: Ezgi Çiçek <ezgi@fb.com>
Date:   Fri Aug 19 14:14:32 2022 +0000

    Avoid unnecessary copy of pointeeSet in MemoryDAG::setWildcards (#83681)

    Test Plan:
    CI

    By making use of latest [copy constructor tags](https://fb.workplace.com/groups/638005567605797/permalink/731932211546465/), [this strobelight query](https://fburl.com/scuba/strobelight_services/5nizhawx) shows that this is used in several services such as `mast_hpc_job/customer_application` or `mui/mui_service_bi`

    Differential Revision: D38830797

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83681
    Approved by: https://github.com/mikeiovine

commit 93eedc51a5f8a6ba18bf4c87a26e1cc3b34cc177
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Aug 18 06:57:23 2022 -0700

    [functorch] re-classify linalg.eigh in vmap testing (#83614)

    Similar to the previous PR, linalg.eigh doesn't give unique output.

    Test Plan:
    - new tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83614
    Approved by: https://github.com/samdow

commit 8788e92f0f3f23249161fdb91aafa4ecc7d4f131
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Aug 19 03:32:18 2022 +0100

    Move `torch.linalg` opinfos to opinfo.definitions (2/2) (#83554)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83554
    Approved by: https://github.com/albanD

commit 8dbb0990bccb7b12f986f5cbc182c384041334ff
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Aug 19 03:32:18 2022 +0100

    Move `torch.linalg` opinfos to opinfo.definitions (1/2) (#83547)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83547
    Approved by: https://github.com/albanD

commit 4aeb98dee9756119f6a6414338e92f2b52c83346
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Aug 19 03:32:17 2022 +0100

    Move RefInfo classes into opinfo.refs (#83563)

    Given that there is already a clear `op_db`, `python_ref_db` split I
    think it makes sense to have the `RefInfo` classes be defined in a
    different file.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83563
    Approved by: https://github.com/albanD

commit f4caeb25e94abdb40f8217a84dbe4d55a21f7d7a
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Aug 19 03:32:17 2022 +0100

    Move gradcheck_wrapper and clone_sample funcs into opinfo.core (#83560)

    The linalg OpInfos need these, so moving them into core to prevent
    circular dependencies.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83560
    Approved by: https://github.com/albanD

commit ae68e455be3c264b2b3bcc61819dda5627f751a9
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Fri Aug 19 03:32:16 2022 +0100

    Enable formatting in all of testing/_internal/opinfo (#83559)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83559
    Approved by: https://github.com/albanD

commit b4bc0d249f782d3afc877e61189f7427f6d55968
Author: Kshiteej K <kshitijkalambarkar@gmail.com>
Date:   Fri Aug 19 11:59:31 2022 +0000

    [composite compliance] batch_norm (#79990)

    Fixes https://github.com/pytorch/pytorch/issues/76283
    Ref: #69991

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/79990
    Approved by: https://github.com/zou3519

commit a6f777c80dd7fca5022ba102b86833089b4d3444
Author: Luca Wehrstedt <lcw@fb.com>
Date:   Fri Aug 19 11:52:18 2022 +0000

    Ensure cuda_primary_ctx test is run on multigpu CI (#83252)

    Summary: It requires 2 GPUs, but it wasn't added to the list of tests running on multigpu jobs.

    Test Plan: Look at CI

    Differential Revision: D38616784

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83252
    Approved by: https://github.com/malfet

commit ca9919e3e81bb563422beb83e61b71cb8deca62c
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Aug 19 10:26:40 2022 +0000

    [vision hash update] update the pinned vision hash (#83729)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83729
    Approved by: https://github.com/pytorchbot

commit b8d647e1d56114514b98a6fdc8f5141784b8a016
Author: Catherine Lee <csl@fb.com>
Date:   Fri Aug 19 06:18:40 2022 +0000

    Revert "Manually shard slow-gradcheck CI job to prevent timeout #83354" (#83704)

    Now that https://github.com/pytorch/test-infra/pull/529 exists, we can undo the custom sharding from #83354 for slow grad check

    test plan: look at logs to see if it sharded + look at time to see that its evenly distributed
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83704
    Approved by: https://github.com/huydhn

commit 5bc85fccebebba3397fc42c72a68186efe80abe8
Author: Collin Schlager <schlagercollin@gmail.com>
Date:   Fri Aug 19 05:04:56 2022 +0000

    Remove assertEqualIgnoreTypes from test_unary_ufuncs (#83711)

    Fix TODOs related to #38095 in test_unary_ufuncs.py

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83711
    Approved by: https://github.com/kit1980

commit 0ff929f4871a283b15672e23e23e267cab4f866b
Author: Wonjoo Lee <wonjoo@google.com>
Date:   Fri Aug 19 03:51:15 2022 +0000

    Add lazy shape inference for take op (#82679)

    Add lazy shape inference for take op

    ---

    Companion PR on XLA's side: https://github.com/pytorch/xla/pull/3818
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82679
    Approved by: https://github.com/JackCaoG

commit 76d5699e13352930be89d61442b82230950a35cf
Author: Ian Barber <ian.barber@gmail.com>
Date:   Fri Aug 19 02:51:44 2022 +0000

    Fix use-generator lint warnings in module.py (#83700)

    % pylint --disable=all --enable=R1729 torch/nn/modules/module.py
    Verified in pylint 2.14.5

    --------------------------------------------------------------------
    Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83700
    Approved by: https://github.com/kit1980, https://github.com/albanD

commit 61b2cde5270986476f47b58b984de80d02aac321
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Aug 19 02:27:03 2022 +0000

    Revert "Enable formatting in all of testing/_internal/opinfo (#83559)"

    This reverts commit a7e619690936ed3b90e6f035ec078ed630e83e93.

    Reverted https://github.com/pytorch/pytorch/pull/83559 on behalf of https://github.com/peterbell10 due to Stack broke lint

commit 107465af2cbbdc3d0ac6f89375bab0cd32eea5ff
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Aug 19 02:23:32 2022 +0000

    Revert "Move gradcheck_wrapper and clone_sample funcs into opinfo.core (#83560)"

    This reverts commit 5120263703b75bc3036cf2009d944e03e52eeb99.

    Reverted https://github.com/pytorch/pytorch/pull/83560 on behalf of https://github.com/peterbell10 due to Stack broke lint

commit 0ddabe56ad7c3e955cfd3cfc8d5e2f48acdc13ac
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Aug 19 02:21:40 2022 +0000

    Revert "Move RefInfo classes into opinfo.refs (#83563)"

    This reverts commit 03ce36e3c139aa8aaf1e6184303dd6bf12d168f3.

    Reverted https://github.com/pytorch/pytorch/pull/83563 on behalf of https://github.com/peterbell10 due to Stack broke lint

commit c8730d0a2fdb4b1b3b98e6e5431ca19b18eeaf52
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Aug 19 02:18:21 2022 +0000

    Revert "Move `torch.linalg` opinfos to opinfo.definitions (1/2) (#83547)"

    This reverts commit bb86c31e2609304a81629351a107ebe810977606.

    Reverted https://github.com/pytorch/pytorch/pull/83547 on behalf of https://github.com/peterbell10 due to Stack broke lint

commit 88e0165d085166ce13ef443991eea003ee86869e
Author: vspenubarthi <vspenubarthi@gmail.com>
Date:   Thu Aug 18 16:42:03 2022 -0700

    [ao] Added Equalization QConfig generation to ModelReport class (#83698)

    Summary: This adds the capability to generate a QConfigMapping based on
    the suggestions of the ModelReport API for the user to use. The only
    dependency of this feature is that the calibration is run before the
    generation of the QConfigMapping and there is no dependency on the
    report generation other than that the observers cannot be removed before
    this is called. This maps module fqns to EqualizationQConfigs instead of regular
    QConfigs.

    Example Usage (after callibration):

    ```
    quantization_mapping = mod_report.generate_qconfig_mapping()
    equalization_mapping = mod_report.generate_equalization_mapping()

    prepared_model = quantize_fx.prepare_fx(model, mapping, example_input, _equalization_config=equalization_mapping)

    quantized_model = quantize_fx.convert_fx(prepared)
    ```

    This was tested by ensuring that the suggestions generated in the QConfigMapping are:
    	1.	Correct according to the set backend and data passed through
    	2.	Able to be prepared and converted as a proper config (is a valid config)
    The test for this is a part of the TestFxModelReportClass test suite.

    Test Plan: python test/test_quantization.py TestFxModelReportClass.test_equalization_mapping_generation

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83698
    Approved by: https://github.com/jerryzh168

commit 393137e13fb6a56c9803e24131c4cac0bb1f1e48
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Fri Aug 19 02:14:42 2022 +0000

    Revert "Move `torch.linalg` opinfos to opinfo.definitions (2/2) (#83554)"

    This reverts commit 1f2efdce1534bef50d47e7706e58a1c611b2d4a7.

    Reverted https://github.com/pytorch/pytorch/pull/83554 on behalf of https://github.com/peterbell10 due to Stack broke lint

commit 05849eafb92def1c0071d5a7b0bb782360145cbb
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Aug 18 16:28:16 2022 -0700

    [ONNX] Create empty opset 17 symbolic file (#83287)

    The PR

    - Creates an empty symbolic file to house the new ops defined in ONNX 17
    - Increments the max version to 17 and fixes the doc for version 16
    - Enables tests for opset 17
    - Updates the IR version in `export.cpp`
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83287
    Approved by: https://github.com/thiagocrepaldi, https://github.com/AllenTiTaiWang, https://github.com/BowenBao

commit 1f2efdce1534bef50d47e7706e58a1c611b2d4a7
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 18 13:04:13 2022 +0100

    Move `torch.linalg` opinfos to opinfo.definitions (2/2) (#83554)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83554
    Approved by: https://github.com/albanD

commit bb86c31e2609304a81629351a107ebe810977606
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 18 13:04:12 2022 +0100

    Move `torch.linalg` opinfos to opinfo.definitions (1/2) (#83547)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83547
    Approved by: https://github.com/albanD

commit 03ce36e3c139aa8aaf1e6184303dd6bf12d168f3
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 18 13:04:12 2022 +0100

    Move RefInfo classes into opinfo.refs (#83563)

    Given that there is already a clear `op_db`, `python_ref_db` split I
    think it makes sense to have the `RefInfo` classes be defined in a
    different file.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83563
    Approved by: https://github.com/albanD

commit 5120263703b75bc3036cf2009d944e03e52eeb99
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 18 13:04:12 2022 +0100

    Move gradcheck_wrapper and clone_sample funcs into opinfo.core (#83560)

    The linalg OpInfos need these, so moving them into core to prevent
    circular dependencies.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83560
    Approved by: https://github.com/albanD

commit a7e619690936ed3b90e6f035ec078ed630e83e93
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 18 13:04:11 2022 +0100

    Enable formatting in all of testing/_internal/opinfo (#83559)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83559
    Approved by: https://github.com/albanD

commit 7aba6f8e7b67d08628e80f93d9146224163b1300
Author: chenlai <chenlai@devvm7453.prn0.facebook.com>
Date:   Thu Aug 18 10:37:33 2022 -0700

    Rename flatbuffer_serializer to *_mobile or *_full_jit  (#82827)

    The target named `flatbuffer_serializer` in fbcode has dependency from full jit and the one in xplat has dependency for mobile only. Rename them accordingly

    ```
    flatbuffer_serializer in fbode -> flatbuffer_serializer_full_jit
    flatbuffer_serializer in xplat -> flatbuffer_serializer_mobile
    ```

    so it's more readable.

    Differential Revision: [D38413369](https://our.internmc.facebook.com/intern/diff/D38413369/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38413369/)!

    Differential Revision: [D38413369](https://our.internmc.facebook.com/intern/diff/D38413369)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82827
    Approved by: https://github.com/qihqi

commit b02e620fa3e789645df099aedef29fb80a2068d5
Author: Scott Wolchok <swolchok@fb.com>
Date:   Thu Aug 18 15:55:54 2022 -0700

    [PyTorch] Bypass dispatch for narrow() calls within split_with_sizes (#83213)

    This can add up to a lot of dispatcher overhead if there are a lot of splits. split_with_sizes already has an autograd formula so this should Just Work?

    Differential Revision: [D38600576](https://our.internmc.facebook.com/intern/diff/D38600576/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83213
    Approved by: https://github.com/albanD

commit 784c47fbeea235823f29a6d035fe7eaea3f30680
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Thu Aug 18 20:08:53 2022 +0000

    [quant][fx][refactor] Move ObservationType to backend_config.py (#83368)

    Summary:
    Now we have a separate file to define BackendConfig related classes, we can move ObservationType to that file as well

    Test Plan:
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps
    python test/test_quantization.py TestQuantizeFxModels

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83368
    Approved by: https://github.com/andrewor14

commit 82507ce334be9171729faecd8fc2ac9efa8c07e6
Author: Elias Ellison <elias.ellison@gmail.com>
Date:   Thu Aug 18 19:53:41 2022 +0000

    Minifier fix for non tensor inputs (#83644)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83644
    Approved by: https://github.com/Chillee

commit 1f3ef5a2c800780e1f63daff5758f953ffe4dfc5
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Fri Aug 19 00:31:30 2022 +0000

    [ROCm] unskip test_jit TestBackendsWithCompiler (#81281)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81281
    Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

commit f094113ebf5b4e5281ab1a134220a1a985f03964
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Aug 18 23:52:43 2022 +0000

    [MPS] Add native bitwise-not implementation (#83678)

    Follows the same pattern as bitwise binary ops
    Rename `BitwiseBinaryOps.mm` to `BitwiseOps.mm`
    Already tested in `test_mps.py`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83678
    Approved by: https://github.com/albanD, https://github.com/kulinseth

commit b14df5334d5910fc77c2658532c303fac0809236
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Thu Aug 18 17:40:16 2022 +0100

    CMake: List python source files as codegen dependencies (#83683)

    The pyi, selected_mobile_ops and nvfuser code generators were missing
    some dependencies outright. The autograd codegen had some effort to
    list out specific files that it depends on, but this has clearly
    fallen out of sync so it's safer to just depend on the entire folder.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83683
    Approved by: https://github.com/albanD

commit 5e715be17ebb7f7fbfa232e951a14abf3b1a7a5a
Author: vspenubarthi <vspenubarthi@gmail.com>
Date:   Thu Aug 18 12:59:15 2022 -0700

    [ao] Added Quantization QConfig generation to ModelReport class (#83688)

    Summary: This adds the capability to generate a QConfigMapping based on
    the suggestions of the ModelReport API for the user to use. The only
    dependency of this feature is that the callibration is run before the
    generation of the QConfigMapping and there is no dependency on the
    report generation other than that the observers cannot be removed before
    this is called.

    Example Usage (after callibration):
    ```
    mapping = mod_report.generate_qconfig_mapping()

    prepared_model = quantize_fx.prepare_fx(model, mapping, example_input)

    quantized_model = quantize_fx.convert_fx(prepared)
    ```

    This was tested by ensuring that the suggestions generated in the
    QConfigMapping are:
    1. Correct according to the set backend and data passed through
    2. Able to be prepared and converted as a proper config (is a valid
    config)

    The test for this is a part of the TestFxModelReportClass test suite.

    Test Plan: python test/test_quantization.py TestFxModelReportClass.test_qconfig_mapping_generation

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83688
    Approved by: https://github.com/jerryzh168

commit 72963bbae9b7f2a4f2e7c5fc84abdaa2f3552e73
Author: Milad Mohammadi <milad.mo@gmail.com>
Date:   Thu Aug 18 22:53:18 2022 +0000

    Update isDynamic api to align with is_symbolic API (#83415)

    Downstream #https://github.com/pytorch/xla/pull/3888

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83415
    Approved by: https://github.com/Krovatkin

commit 04353f7837dccdc7c344dc0bdd82288957dcbc94
Author: Justin Chu <justinchu@microsoft.com>
Date:   Thu Aug 18 22:51:57 2022 +0000

    Check existence of the array ref when tracing resize_ (#81422)

    When `.resize_` takes an empty `torch.Size` or ints, tracing it would result in a `RuntimeError: _Map_base::at` (key not found in map).

    In

    https://github.com/pytorch/pytorch/blob/0d124fc6961f5b39f1a46722dab2d88f23686783/torch/csrc/jit/frontend/tracer.h#L126-L129

    - This change updates `TraceType::resize_` to check the mapping first.
    - It also updates the warning message when tracing `resize_` to suggest using reshape or view.

    Repo:

    ```python
    import torch

    class M(torch.nn.Module):
        def forward(self, x, y):
            print(y.shape)
            x = x.resize_(y.shape)
            return x, y

    x = torch.tensor(1.2)
    y = torch.tensor(4.2)

    M()(x, y)
    torch.jit.trace(M(), (x, y))
    ```

    Related: https://github.com/pytorch/pytorch/issues/76486
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81422
    Approved by: https://github.com/BowenBao, https://github.com/malfet

commit 91521449445077c9ee977b18e2d0f19be4dd1c5b
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Aug 17 20:30:41 2022 -0700

    Coverage for nondeterministic_seeded, respect it in constant prop (#83650)

    - nondeterministic_seeded was not applied to enough functions.  I added
      some heuristics to codegen for identifying functions that are likely
      to be random and added a bunch of these tags to functions.  Not sure
      I got all of them.

    - Don't constant propagate through nondeterministic functions in FX
      tracing.

    It would be better to do some testing for the tag but this would be quite an effort.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83650
    Approved by: https://github.com/bdhirsh, https://github.com/eellison

commit 24acc3155fae43ea9f2ab9c8e31d83f55dd7d7f1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Aug 17 20:30:13 2022 -0700

    Be more conservative about propagating constants. (#83648)

    If a constant would turn into something large, don't keep
    it as a constant, just drop it.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83648
    Approved by: https://github.com/eellison

commit 02581f053bbb824b7d42b1df8655eff977865093
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Aug 17 20:30:12 2022 -0700

    Address CR comments for "Delete ProxyTensor wrapper subclass" (#83646)

    CR is on https://github.com/pytorch/pytorch/pull/83330

    - Factor proxy slot getters/setters into helper functions
    - Use a weak map for storing proxies, so they go away when
      tracing is done
    - More documentation on SymDispatchMode

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83646
    Approved by: https://github.com/Chillee

commit a7baad04f6f29a97743e98d25c369b21aed18faf
Author: Sherlock Huang <bahuang@fb.com>
Date:   Thu Aug 18 17:54:52 2022 +0000

    Preserve stack trace for backward nodes over AOTAutograd (#83558)

    For the following program.
    ```
    def my_relu(a):
        return a.relu()

    def func(a, b):
        a = torch.nn.Linear(10, 10)(a)
        d = torch.square(b)
        d = my_relu(d)
        loss = d.sum()

        return loss

    with torchdynamo.optimize("aot_nop"):
        x = torch.rand(10, 10, requires_grad=True)
        y = torch.rand(10, 10, requires_grad=True)
        out = func(x, y)
    ```

    It would generate the following fx graph with stack_trace populated in both forward and backward nodes.
    ```
    def forward(self, primals, tangents):
        primals_1, primals_2, primals_3, primals_4, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)
        t_default = torch.ops.aten.t.default(primals_3);  primals_3 = None
        addmm_default = torch.ops.aten.addmm.default(primals_4, primals_1, t_default);  primals_4 = primals_1 = t_default = None
        pow_tensor_scalar = torch.ops.aten.pow.Tensor_Scalar(primals_2, 2)
        relu_default = torch.ops.aten.relu.default(pow_tensor_scalar);  pow_tensor_scalar = None
        detach_default = torch.ops.aten.detach.default(relu_default)
        sum_default = torch.ops.aten.sum.default(relu_default);  relu_default = None
        is_same_size_default = torch.ops.aten.is_same_size.default(sum_default, tangents_1)
        expand_default = torch.ops.aten.expand.default(tangents_1, [10, 10]);  tangents_1 = None
        detach_default_1 = torch.ops.aten.detach.default(detach_default);  detach_default = None
        threshold_backward_default = torch.ops.aten.threshold_backward.default(expand_default, detach_default_1, 0);  expand_default = detach_default_1 = None
        pow_tensor_scalar_1 = torch.ops.aten.pow.Tensor_Scalar(primals_2, 1.0);  primals_2 = None
        mul_scalar = torch.ops.aten.mul.Scalar(pow_tensor_scalar_1, 2.0);  pow_tensor_scalar_1 = None
        mul_tensor = torch.ops.aten.mul.Tensor(threshold_backward_default, mul_scalar);  threshold_backward_default = mul_scalar = None
        return pytree.tree_unflatten([sum_default, None, mul_tensor, None, None], self._out_spec)

    ====== joint graph =======
    primals_1 None
    primals_2 None
    primals_3 None
    primals_4 None
    tangents_1 None
    t_default   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 12, in func
        def func(a, b):
      File "/fsx/users/bahuang/repos/pytorch_fsx/torch/nn/modules/linear.py", line 114, in forward
        return F.linear(input, self.weight, self.bias)

    addmm_default   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 12, in func
        def func(a, b):
      File "/fsx/users/bahuang/repos/pytorch_fsx/torch/nn/modules/linear.py", line 114, in forward
        return F.linear(input, self.weight, self.bias)

    pow_tensor_scalar   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 14, in func
        d = torch.square(b)

    relu_default   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 15, in func
        d = my_relu(d)
      File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 10, in my_relu
        return a.relu()

    detach_default   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 15, in func
        d = my_relu(d)
      File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 10, in my_relu
        return a.relu()

    sum_default
    is_same_size_default
    expand_default
    detach_default_1   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 15, in func
        d = my_relu(d)
      File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 10, in my_relu
        return a.relu()

    threshold_backward_default   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 15, in func
        d = my_relu(d)
      File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 10, in my_relu
        return a.relu()

    pow_tensor_scalar_1   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 14, in func
        d = torch.square(b)

    mul_scalar   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 14, in func
        d = torch.square(b)

    mul_tensor   File "/fsx/users/bahuang/repos/pytorch_fsx/test.py", line 14, in func
        d = torch.square(b)

    output None
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83558
    Approved by: https://github.com/albanD

commit e2e71c1f4c924c5e9e02b25eb66296a697f4b3e7
Author: samdow <samdow@fb.com>
Date:   Thu Aug 18 12:28:41 2022 -0400

    [functorch] add linalg solve batch rule (#82814)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82814
    Approved by: https://github.com/zou3519

commit ff533b1efa26ed0dc5e3caa332de05f53963e360
Author: Nikita Shulga <nshulga@fb.com>
Date:   Thu Aug 18 21:59:15 2022 +0000

    [MPS] Fix torch.full for uint8 (#83697)

    By creating uint32  tensor and then downcasting it to uint8

    Workaround https://github.com/pytorch/pytorch/issues/83692

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83697
    Approved by: https://github.com/albanD

commit 88d3acd6b194cdd7e06d7b4e8e7d5aed7294adb2
Author: Mario Lezcano <lezcano-93@hotmail.com>
Date:   Thu Aug 18 05:53:02 2022 -0500

    Fix and improve the efficiency of the backward of xlog* functions. (#82713)

    That is `xlogy`, `special.xlogy`, `special.xlog1py`.

    Fixes https://github.com/pytorch/pytorch/issues/80770
    Fixes https://github.com/pytorch/pytorch/issues/74279
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82713
    Approved by: https://github.com/albanD

commit 9e1560821eab91c00d03e4e257c6973c34735ba1
Author: Fabio Rocha <frocha@quansight.com>
Date:   Thu Aug 18 12:31:21 2022 -0500

    [primTorch] Refs for pdist, triu and related ops (#82819)

    This PR adds refs for the following ops:
       - `torch.triu`
       - `torch.tril`
       - `torch.triu_indices`
       - `torch.tril_indices`
       - `torch.nn.functional.pairwise_distance`
       - `torch.nn.functional.pdist`

    It adds OpInfos for
       - `torch.triu_indices`
       - `torch.tril_indices`

    Note that these were already tested in `test/test_tensor_creation_ops.py`
    but for the ref tests we need the OpInfos.

    Finally, it improves documentation for PairwiseDistance and adds
    a missing import to `torch/testing/_internal/opinfo/core.py`.

    This started with an attempt to just add the `nn.functional` refs above,
    but it turned out that `pdist` was easiest to implement using `triu_indices`
    so I added that one and the related functions as well.
    ~~In the end, I changed the `pdist` implementation to not use `triu_indices`
    but kept the other refs.~~
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82819
    Approved by: https://github.com/ngimel

commit 38348362608a47371c65d7fd52db138b4c6a5d65
Author: albanD <desmaison.alban@gmail.com>
Date:   Thu Aug 18 20:54:44 2022 +0000

    [DataLoader] Move loop content into a function to ensure we don't preserve anything (#83595)

    Can lead to CPU memory saving as we don't hold onto the pin memory buffer as long as we used to.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83595
    Approved by: https://github.com/ejguan, https://github.com/NivekT

commit 23d22724739b7a27249cde1354150a43e278ed10
Author: Howard Huang <howardhuang@fb.com>
Date:   Thu Aug 18 15:46:09 2022 +0000

    Add remaining device types in the pybinded DeviceType enum (#83676)

    Small change to update pybinded definition to match https://github.com/pytorch/pytorch/blob/master/c10/core/DeviceType.h#L32-L58
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83676
    Approved by: https://github.com/albanD

commit 46ba9f2e52a43a35d23b1ce8f00f9d2614c24204
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 18 20:28:20 2022 +0000

    Revert "Remove conj kernels for real dtypes (#80374)"

    This reverts commit ad44079952b945262808af8fa841994f736c1fe2.

    Reverted https://github.com/pytorch/pytorch/pull/80374 on behalf of https://github.com/atalman due to Breaks internal build UnaryOpsKernel.cpp:208:5: error: unused type alias 'scalar_t' [-Werror,-Wunused-local-typedef]

commit d11d3dd036b4a7098ab3b4d333fcb556b97b4860
Author: Rodrigo Kumpera <kumpera@fb.com>
Date:   Thu Aug 18 19:40:15 2022 +0000

    [dist.cp] Introduce LoadPlanner and SavePlanner extensibility API. (#83419)

    The planners come with default implementations in default_planner.py.

    The default planners expose their core functionality as separate functions
    to make it easy for other checkpoint implementations to use this functionality.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83419
    Approved by: https://github.com/wanchaol

commit 4a033be4482441d3f61a887502d75356a90e6a6a
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Aug 18 06:57:22 2022 -0700

    [functorch] reclassify svd as an allowed failure; add test (#83612)

    svd when done on a batch of inputs vs the input in a for-loop may return
    different results because svd isn't unique. So, instead of checking that
    the output of vmap and the output of a for-loop are the same, we check
    that matrix-multiplying the decomposed tensors results in the same
    tensor when doing it under vmap vs under a for-loop.

    Test Plan:
    - new test
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83612
    Approved by: https://github.com/samdow

commit 601aca2a2dedfe7b46f2815649682924a31d50ce
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Aug 18 06:57:22 2022 -0700

    [functorch] add some vmap+jvp inplace+view tests (#83178)

    No problems here, just more tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83178
    Approved by: https://github.com/samdow

commit d84dc589c24bb28147aaafddb67dff4fac6ac9a7
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Aug 18 06:57:21 2022 -0700

    [functorch] relax as_strided batching rule (#83597)

    Previously there was a constraint that the bdim is required to be at
    the front. As I noted in the comment in the code that I wrote years ago,
    this is not necessary for correctness, we were just guarding against
    potentially incorrect behavior and assumed most people would not vmap
    over dimensions other than 0.

    Now, the above assumption did not age very well, because we have batch
    rules that return a BatchedTensor where the bdim is something other than
    0 (e.g. convolution batch rule).

    This PR deletes the check for that assumption and adds additional manual
    tests that the as_strided batching rule works when one vmaps over a dimension
    other than 0.

    Automatic tests don't exist because it's a bit hard to get the
    test_vmap_exhaustive test runner to replicate the strides of the inputs
    faithfully.

    Test Plan:
    - wait for tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83597
    Approved by: https://github.com/samdow

commit 69728d7dd9b6a05a503e25759ed589754741ff01
Author: Richard Zou <zou3519@gmail.com>
Date:   Thu Aug 18 06:57:21 2022 -0700

    [functorch] annotate test_jvpvjp (#83530)

    Most of these are just "forward-mode Ad formula not implemented"
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83530
    Approved by: https://github.com/samdow

commit e4f74f0891da9e49c5c82df05794f7723b05cbac
Author: Justin Chu <justinchu@microsoft.com>
Date:   Tue Aug 16 16:33:37 2022 +0000

    [ONNX] Update the default opset to version 14 (#83284)

    Update the default opset by running the `update_default_opset_version.py` script. The update is done in a regularly to ensure we are in sync with the onnx updates. All changes are produced by the script.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83284
    Approved by: https://github.com/AllenTiTaiWang, https://github.com/malfet, https://github.com/BowenBao

commit f204afc2bbd7bec6826b49403978ed7f93ccc9f3
Author: Olga Andreeva <andolga.fb.com>
Date:   Thu Aug 18 18:41:14 2022 +0000

    Added communication hook for sharded cases (#83254)

    Fixes https://github.com/pytorch/pytorch/issues/79114

    An implementation of a FSDP communication hook interface for a sharded strategies:

    - Added `reduce_scatter_hook` to default hooks. Note the difference of `reduce_scatter` from `all_reduce`, it requires 2 tensors:`input_gradient` and `output` variables and stores result in `output`, which is further used as a summed gradient shard.
    - Adjusted FSDP logic to return `reduce_scatter_hook` as a default communication hook for sharded strategies, `DefaultState` is the same for sharded and non-sharded strategies.
    - Adjusted low-precision hooks to work with both `all_reduce` and `reduce_scatter` depending on whether `output` tensor is provided or not.

    Test plan:

    Added all existing sharded strategies as an input parameters to existing tests.
    For`test_default_communication_hook_behaviour` double checked how a linear layer is sharded across workers. This test creates a simple net ``1 X N``, where ``N`` - is the number of workers. For sharded cases, ``N`` parameters are sharded across ``N`` workers. This test checks that after backward, each worker has a proper value in it's chunk of the gradient, or the whole gradient on every worker is equal to an expected value.

    Checked that low-precision tests work for sharded cases.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83254
    Approved by: https://github.com/rohan-varma, https://github.com/awgu

commit 78c8a0d75220bdd4955415b5f81509e005af4232
Author: zaf <zaf@devvm10206.prn0.facebook.com>
Date:   Thu Aug 18 03:59:30 2022 -0700

    [quant][ao_migration] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` (#78712)

    Context: In order to avoid the cluttering of the `torch.nn` namespace
    the quantized modules namespace is moved to `torch.ao.nn`.

    The list of the `nn.quantized` files that are being migrated:

    - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
      - [X] [Current PR] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
      - [ ] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
      - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
      - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
    - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
    - [ ] `torch.nn.qat` → `torch.ao.nn.qat`
      - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
      - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
    - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
      - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
      - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
      - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

    Majority of the files are just moved to the new location.
    However, specific files need to be double checked:

    - [Documentation](docs/source/quantization-support.rst) @vkuzo
    - [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10

    Differential Revision: [D36792967](https://our.internmc.facebook.com/intern/diff/D36792967/)

    **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36792967/)!
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78712
    Approved by: https://github.com/jerryzh168

commit 3e1fc85b23f9f12ff2ba5be645841bde90dba14e
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Wed Aug 17 22:46:54 2022 -0700

    [FSDP] Implement sharded_optim_state_dict and flatten_sharded_optim_state_dict. (#77628)

    As title

    Differential Revision: [D36436496](https://our.internmc.facebook.com/intern/diff/D36436496/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/77628
    Approved by: https://github.com/awgu

commit cd0ab154b5662f5dae36456971db5bc574d6cbe1
Author: JackCaoG <jackcao@google.com>
Date:   Thu Aug 18 16:36:54 2022 +0000

    Handle python frame is empty in GetPythonFrames (#83643)

    Fixes https://github.com/pytorch/xla/issues/3900 and https://github.com/pytorch/xla/issues/3795 for pytorch/xla when `XLA_IR_DEBUG=1`

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83643
    Approved by: https://github.com/Krovatkin

commit abcf01196cd27805349aa892db847f9a61f52c0e
Author: Clive Chan <cc@clive.io>
Date:   Thu Aug 18 15:24:18 2022 +0000

    Release the GIL when munmap'ing tensors - fixes #77139 (#83623)

    Fixes #77139, where deallocating large tensors with munmap takes a significant amount of time while holding the GIL. This causes the pin_memory thread to interfere with the main thread = performance sadness.

    Thanks @igozali @zhengwy888 @colesbury as well.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83623
    Approved by: https://github.com/albanD

commit f84e087d5e6f458f69274e2ace127af7d4fa8d82
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 18 14:00:42 2022 +0000

    Revert "fixing define_constant pybind signature to match std::complex scalar (#83645)"

    This reverts commit 278c726458c1febdde7420734477bf8b552c0243.

    Reverted https://github.com/pytorch/pytorch/pull/83645 on behalf of https://github.com/albanD due to broke master test

commit 3f612b58be58fe38eb57cad7cbca545887ce7759
Author: mikey dagitses <mikeyd@fb.com>
Date:   Thu Aug 18 13:03:00 2022 +0000

    fix quantization/core/test_docs for Buck2 (#83341)

    Summary:
    We extract the test to its own target, fixing the relative path to the
    quantization docs. This allows us to find the docs with a more simple
    implementation.

    Test Plan: Tested locally with buck1 and buck2.

    Differential Revision: D38662169

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83341
    Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/ZainRizvi

commit aad89bb77176a755cf7f916b4cb16bc4a021d1bb
Author: Mario Lezcano <lezcano-93@hotmail.com>
Date:   Thu Aug 18 05:17:32 2022 -0500

    Make the derivative of masked_fill more efficient (#83515)

    There's no need to add all the zeros if we extract all the non-zero
    elements.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83515
    Approved by: https://github.com/albanD, https://github.com/soulitzer

commit 4b3f1bdb0cb7213ae5ac4f3e3d187648c7720175
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 18 10:35:00 2022 +0000

    [vision hash update] update the pinned vision hash (#83582)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83582
    Approved by: https://github.com/pytorchbot

commit eb6004146aba1e371a3c169f11e76390fd74a13e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 18 10:23:01 2022 +0000

    [xla hash update] update the pinned xla hash (#83581)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83581
    Approved by: https://github.com/pytorchbot

commit ce7177f88a8c76351087bd06520681e60591ff50
Author: Kulin Seth <kulin_seth@apple.com>
Date:   Thu Aug 18 06:03:16 2022 +0000

    [MPS] Register index.Tensor_out (#82507)

    * Add more tests from test_indexing into test_mps
    * Cache the indexing library on the MPSDevice
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82507
    Approved by: https://github.com/malfet

commit 6dc8673b1bb7a0f22a2453049751089943cc1f3b
Author: yanbing-j <yanbing.jiang@intel.com>
Date:   Thu Aug 18 05:08:12 2022 +0000

    Update ideep for NNC post-op (#82705)
    This PR is to add NNC post-op fusion support in ideep for further NNC development. It includes:

    - element wise post op fusion
    - conv/matmal/linear + binary post op fusion
    **Common configuration:**
    - Jemalloc and iomp enabled
    - BS=1
    - num_warmup = 300
    - num_run = 500
    - Average time of 1 iteration in ms is used
    - time_before: no fusion
    - time_after: with fusion
    - Eltwise OPs selected: hardswish and abs
    - Using oneDNN v2.6

    **On ICX (32 cores per socket):
    Conv2d FP32 (in channels Last format)**

      | shape | time_(ms)_before | time_(ms)_after | Gain
    -- | -- | -- | -- | --
    1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.112174 | 0.071106 | 36.61%
    1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.11269 | 0.070586 | 37.36%
    1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.164219 | 0.129498 | 21.14%
    1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.169371 | 0.1277 | 24.60%
      |   |   |   |  
      | shape | time_(ms)_before | time_(ms)_after | Gain
    1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.994555 | 1.429813 | 28.31%
    1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.715168 | 1.459937 | 14.88%
    1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 2.997382 | 2.47915 | 17.29%
    1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.044476 | 2.499366 | 17.90%
      |   |   |   |  
      | shape | time_(ms)_before | time_(ms)_after | Gain
    4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.405204 | 0.38117 | 5.93%
    4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.410145 | 0.389279 | 5.09%
    4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.67917 | 0.662792 | 2.41%
    4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.682302 | 0.671226 | 1.62%

    **On CPX (28 cores per socket):
    Conv2d BF16 (in channels Last format)**

      | shape | time_(ms)_before | time_(ms)_after | Gain
    -- | -- | -- | -- | --
    1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.119289 | 0.091015 | 23.70%
    1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.144116 | 0.09339 | 35.20%
    1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.209975 | 0.177111 | 15.65%
    1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.234777 | 0.179945 | 23.36%
      |   |   |   |  
      | shape | time_(ms)_before | time_(ms)_after | Gain
    1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.296252 | 1.086423 | 16.19%
    1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.364738 | 1.131289 | 17.11%
    1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.99519 | 3.736147 | 6.48%
    1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 4.03415 | 3.77981 | 6.30%
      |   |   |   |  
      | shape | time_(ms)_before | time_(ms)_after | Gain
    4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.27474 | 0.245281 | 10.72%
    4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.28595 | 0.254748 | 10.91%
    4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.847318 | 0.791453 | 6.59%
    4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.870212 | 0.801594 | 7.89%

    **On CPX (28 cores per socket):
    Linear BF16**

      | shape | time_(ms)_before | time_(ms)_after | Gain
    -- | -- | -- | -- | --
    1socket | Linear+abs_N=1_iC=1024_oC=4096 | 0.043199 | 0.037603 | 12.95%
    1socket | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.041845 | 0.038332 | 8.40%
    1socket | Linear+abs_N=1_iC=4096_oC=1024 | 0.048282 | 0.044281 | 8.29%
    1socket | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.048362 | 0.044106 | 8.80%
    1socket | Linear+abs_N=1_iC=2048_oC=1000 | 0.036302 | 0.0344 | 5.24%
    1socket | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.035734 | 0.035593 | 0.39%
      |   |   |   |  
      | shape | time_(ms)_before | time_(ms)_after | Gain
    1thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.365143 | 0.36279 | 0.64%
    1thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.364464 | 0.363392 | 0.29%
    1thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.384498 | 0.379902 | 1.20%
    1thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.382545 | 0.381252 | 0.34%
    1thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.213244 | 0.209999 | 1.52%
    1thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.212003 | 0.208567 | 1.62%
      |   |   |   |  
      | shape | time_(ms)_before | time_(ms)_after | Gain
    4thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.126096 | 0.12157 | 3.59%
    4thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.126627 | 0.121662 | 3.92%
    4thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.132845 | 0.128921 | 2.95%
    4thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.132642 | 0.12783 | 3.63%
    4thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.079582 | 0.072584 | 8.79%
    4thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.077761 | 0.071981 | 7.43%

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82705
    Approved by: https://github.com/frank-wei, https://github.com/eellison

commit 278c726458c1febdde7420734477bf8b552c0243
Author: jjsjann123 <jiej@nvidia.com>
Date:   Thu Aug 18 04:52:31 2022 +0000

    fixing define_constant pybind signature to match std::complex scalar (#83645)

    Fixes #83576

    Previously complex scalar is defined as boolean and generating wrong result.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83645
    Approved by: https://github.com/ezyang, https://github.com/kevinstephano

commit badbdb033038d84c46550c4ddc8eab64257c5143
Author: Mengwei Liu <larryliu@fb.com>
Date:   Thu Aug 18 04:47:13 2022 +0000

    [torchgen] Relax the restriction on number of custom namespaces (#83580)

    Summary:
    We started to see use cases where it involves more than 1 custom namespace to live within the same yaml file. Hence relaxing the restriction that 1 yaml file can only have 1 custom namespace other than `aten`. Updated unit test as well.

    Differential Revision: D38775685

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83580
    Approved by: https://github.com/JacobSzwejbka

commit 7263450c309443a8fd3f8ab29fbc04c35692e58f
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Thu Aug 18 02:58:15 2022 +0000

    Revert "[primTorch] Add ref for `new_empty_strided` (#82466)"

    This reverts commit e154f5ae3b91fd462faa2120f8940811a47096de.

    Reverted https://github.com/pytorch/pytorch/pull/82466 on behalf of https://github.com/ezyang due to broke trunk only nnc tests

commit d6a30e213e2355e8ad553c02d205391c889a0254
Author: Aashaka Shah <aashaka@fb.com>
Date:   Thu Aug 18 02:16:24 2022 +0000

    Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits (#82924)

    Summary: A vector reduce_scatter requires each process to reduce and scatter an input tensor according to the input list provided. Internally, pg_nccl.reduce_scatter will coalesce a list of pg_nccl._reduce_oop to implement a vector reduce-scatter in the case when the any input shape is different in the input list. Otherwise, it will perform a ncclReduceScatter as usual.

    - This change adds a `CoalescedWorkNCCL` class which encapsulates the WorkNCCL requests from coalesced operations. A `.wait()` on a CoalescedWorkNCCL request will call a wait on each of the WorkNCCL requests that are coalesced.

    - This change adds an out-of-place `_reduce_oop` function to ProcessGroupNCCL. It allows reducing an input tensor and placing the output in a separate output tensor. Since reduce_scatter provides an out-of-place API, a reduce_scatter_v semantic implemented inside `pg_nccl.reduce_scatter` also needs to support out-of-place, for which an out-of-place reduce is required to be added.

    Test Plan: Added a new test `test_reduce_scatter_v_cuda` for reduce_scatter_v to `distributed_nccl_spawn`.

    Differential Revision: D38478781

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82924
    Approved by: https://github.com/kwen2501

commit 52be908225a2019da8ff7a2dc52e28ce2b13e69a
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Aug 17 10:20:15 2022 -0700

    Delete unnecessary sum.SymInt overload (#83591)

    Dims argument only ever takes dimensions, which we do not need
    to SymInt-ify.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83591
    Approved by: https://github.com/albanD

commit 6679d238fd6f85397559977920b5202390f8e4f1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Aug 17 10:20:14 2022 -0700

    SymInt'ify schemas for prims (#83528)

    I audited these looking for places where ints were accepted for sizes
    and turned them into SymInts.  Dimensions and miscellaneous ints were
    not modified.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83528
    Approved by: https://github.com/ngimel

commit 817a82704ff140cec001fab942437b96d901da42
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Aug 16 13:37:29 2022 -0700

    Delete ProxyTensor wrapper subclass (#83330)

    I was working on https://github.com/pytorch/torchdynamo/issues/80 and my
    working hypothesis for what was causing the error was that proxy tensor
    was not advertising correct dispatch keys, causing AMP to operate
    differently when you traced.  I could have fixed this directly by
    replicating fake tensor's fix for setting dispatch keys to also apply to
    proxy tensor, but I was like, "Why must I repeat myself."

    This PR is the result.  It completely deletes the ProxyTensor wrapper
    subclass, so that when we are tracing, the tensors flowing through the
    program are the *original* real or fake tensors, depending on what the
    user requested in the top-level API.  There is no more wrapping.  To
    store the Proxy objects necessary for actually doing tracing, I store
    the property directly on the tensors.  (Note: I never
    clean up old entries from the map at the moment, this is easily fixed
    by using a weak map)

    Benefits of doing this:

    * No more tip-toeing around no_dispatch() creation of new ProxyTensors;
      we never create new tensors (except when we call the underlying func),
      so you don't have to worry about accidentally tracing them.

    * No more syncing up metadata from in place operators.  In particular
      https://github.com/pytorch/pytorch/issues/81526 is mooted

    * This fixes https://github.com/pytorch/torchdynamo/issues/519 as we no longer need to teach proxy tensor to support sparse tensor.

    * No more schlepping symbolic integers from the inner fake tensor to the
      outer proxy tensor.  If you can make a fake tensor with symbolic ints,
      you're done, nothing else to do.

    To avoid having to rewrite all of the guts, when I get to the actual
    proxy tensor handler, I first "fetch" the stored ProxyTensor data from
    the weakmap via a tree_map, and then operate on the consequent data as
    before.  A more optimized implementation is possible.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83330
    Approved by: https://github.com/Chillee

commit 0a48cdfb3bad14a62d9386ae0d1499bec74b63a6
Author: Horace He <chilli@fb.com>
Date:   Wed Aug 17 22:54:09 2022 +0000

    re-enable aotautograd tests (#83485)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83485
    Approved by: https://github.com/zou3519

commit e154f5ae3b91fd462faa2120f8940811a47096de
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Thu Aug 18 01:35:11 2022 +0000

    [primTorch] Add ref for `new_empty_strided` (#82466)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82466
    Approved by: https://github.com/ezyang

commit b3c99bef0cab618cb6fedf7832004011172f9a34
Author: Yifan Shen <yifanshensz@fb.com>
Date:   Thu Aug 18 00:49:29 2022 +0000

    Support nested dropout autograd (#83338)

    When the initial version came out, `NestedTensor` was not included in the `CompositeImplicitAutograd` key set, so we had to register dropout_nested to dropout and make it forward-only. Now is the time to improve it!

    This pr removes dropout_nested; instead native_dropout_nested is implemented along with native_dropout_backward_nested.

    Side change: remove dropout__nested since @cpuhrsch suggested to leave out nested in-place ops for now
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83338
    Approved by: https://github.com/jbschlosser

commit 451c6296af2deb5159848f3b579d201b4903c608
Author: Jay Chae <jchae@fb.com>
Date:   Wed Aug 17 22:31:49 2022 +0000

    [kineto] deprecate USE_KINETO_UPDATED (#83305)

    Summary: This is used to do cross repo updates but has not been cleaned up properly

    Test Plan: CI

    Reviewed By: aaronenyeshi

    Differential Revision: D38633379

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83305
    Approved by: https://github.com/aaronenyeshi

commit 79534b7f259e23aa5f819eb13173567e99183bbf
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Wed Aug 17 22:14:12 2022 +0000

    Adding XLA folks to reviewer/approvers (#83555)

    XLA folks will be doing a lot of smaller changes to the Lazy component that they can review themselves w/o either @wconstab or myself. They need approval permissions for the Lazy component.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83555
    Approved by: https://github.com/wconstab, https://github.com/JackCaoG

commit cf2c94e6de0e50edb3c9f89bb602055ee6d11011
Author: Michael Gschwind <mikekg@fb.com>
Date:   Wed Aug 17 21:57:39 2022 +0000

    NestedTensor Softmax (#83435)

    Summary: Simple mask compute and softmax

    Test Plan: unit test

    Differential Revision: D38711915

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83435
    Approved by: https://github.com/erichan1, https://github.com/huydhn

commit 71141c30232d436ad61d2931af8808e4451f8cde
Author: migeedz <migeedz@fb.com>
Date:   Wed Aug 17 11:29:53 2022 -0700

    extend torch.ones to handle tuple inputs (#83194)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83194
    Approved by: https://github.com/jansel

commit 7536ac7125ea50f23be5236aed387bd09215f939
Author: migeedz <migeedz@fb.com>
Date:   Wed Aug 17 11:29:52 2022 -0700

    prevent graph mutation in constraint generation (#83109)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83109
    Approved by: https://github.com/jansel

commit ea2183f0eaa6209b8f31ea01627c1ee34654a2b7
Author: Daniel Recoskie <danielrecoskie@fb.com>
Date:   Wed Aug 17 11:17:37 2022 -0700

    removed duplicate_quantize_dynamic_node (#83459)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83459
    Approved by: https://github.com/jerryzh168

commit cf5330977d10a3585358fb02c049939bf1401074
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Aug 17 20:54:06 2022 +0000

    [CI] Move torch-deploy to cuda-11.6 (#83572)

    As we are slowly deprecating 11.3

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83572
    Approved by: https://github.com/huydhn, https://github.com/ZainRizvi

commit af8e34cca9d40f16cbfcd773750d791e83d4a39f
Author: ssjia <ssjia@fb.com>
Date:   Wed Aug 17 08:52:33 2022 -0700

    [vulkan] Do not populate unpacked args of PackedContexts when deserializing (#83587)

    Vulkan ops that use `PackedContext` objects currently maintain two lists storing the parameters of the op:

    1. `unpacked_` which stores the original arguments passed in to the op
    2. `packed_` which stores pre-processed arguments which are used for inference.

    The `unpacked_` list is only needed for serialization - during inference, where it is not expected that the model will be saved, then there is no point keeping the `unpacked_` list in memory.

    This diff introduces a flag `fill_unpacked`, by default set to `true`, that is passed into the `*PackedContext()` constructors. `unpacked_` is populated only if `fill_unpacked = true`.

    The `create_*_context()` functions will call the constructor with `fill_unpacked = true`, which ensures that `unpacked_` is populated for serialization.

    However, when loading a model, the `*PackedContext` objects are deserialized by calling `*PackedContext::pack()`, which will call the constructor with `fill_unpacked = false` - the original tensors will therefore be discarded after packing, saving a significant amount of CPU memory during model inference.

    Differential Revision: [D38761645](https://our.internmc.facebook.com/intern/diff/D38761645/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83587
    Approved by: https://github.com/kimishpatel

commit cf52680d406f96304420fe070b362905089d0268
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Wed Aug 17 19:23:12 2022 +0200

    [primTorch] Add OpInfo and ref for eye (#82323)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82323
    Approved by: https://github.com/ezyang

commit 1a49eea30102b9d083367f8f088f60381576a54c
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Wed Aug 17 14:26:09 2022 +0200

    [primTorch] Add ref for diag_embed (#82322)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82322
    Approved by: https://github.com/Lezcano, https://github.com/ngimel

commit ea037344e81d5645ad2e8863a9b6ccdb33f60320
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Wed Aug 17 10:20:13 2022 -0700

    Reset compile cache to fix flaky test (#83608)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83608
    Approved by: https://github.com/seemethere, https://github.com/malfet

commit ad44079952b945262808af8fa841994f736c1fe2
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Aug 17 17:38:07 2022 +0100

    Remove conj kernels for real dtypes (#80374)

    `conj_physical_stub` is currently implemented for all dtypes despite
    it just being a plain copy for real dtypes. So, instead we should
    defer to the existing copy kernel in these cases.

    On my build for one CUDA architecture, I see a 2.2 MB decrease in
    `libtorch_cuda.so` size.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80374
    Approved by: https://github.com/ngimel

commit 652fb0335513026632cf14e78a095aa485fcc81d
Author: John Clow <jclow@fb.com>
Date:   Tue Aug 16 10:10:18 2022 -0700

    Symbolic Shape Analaysis: Add Generalized List of Tensor Shape Support (#78679)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/78679
    Approved by: https://github.com/davidberard98

commit b1e02ae8fc85883dc1390add7e8b2ae1cc611c4c
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Aug 17 16:58:29 2022 +0100

    Move PythonRefInfos for `torch.fft` into opinfo.definitions (#83277)

    Ref #82518

    The moves `python_ref_db` entries for `torch.fft` into
    `opinfo/definitions/fft.py`.

    I ran into a problem with `_find_referenced_opinfo` since it's called
    at init time for the module, yet relies on the completed op_db list.
    This PR fixes the circular dependency by explicitly passing in an
    op_db argument which can point to only the locally defined `op_db`.

    An alternative solution would be to have a different folder for the
    `op_db` and the `python_ref_db` definitions. However that would mean
    losing the convenience of having closely related opinfos be in the
    same file.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83277
    Approved by: https://github.com/albanD

commit 5f50289b399ef8f24025832aaf12d143279ed5c0
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Aug 17 16:58:28 2022 +0100

    Move OpInfos for torch.fft into `opinfo.definitions` (#83276)

    Ref #82518

    This moves the `op_db` entries into `opinfo/definitions/fft.py` and
    also appends them to `common_methods_invocations.op_db` so existing
    users are unaffected by this change.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83276
    Approved by: https://github.com/albanD

commit 85ef1a1cd104033143cfa9a3f19fc3ab326d737a
Author: Fabio Rocha <frocha@quansight.com>
Date:   Wed Aug 17 10:45:14 2022 -0500

    [primTorch] added ref for nn.functional.glu (#82214)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82214
    Approved by: https://github.com/ngimel

commit bd0ad7a84f125435f9e0685f86b1ca2efd2bd43b
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Tue Aug 16 22:08:13 2022 +0000

    Add backward support for rudimentary NestedTensor.sum(dim) (#82625)

    Per offline discussion, this will be updated to use expand once expand semantics for nested tensor have been fleshed out.

    Next steps will be to add support for other features for forward sum mentioned on #82387 and likewise update the backward

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82625
    Approved by: https://github.com/albanD

commit 68d2d7866daf766c3ff1b2b450d0a2e4d50e9908
Author: Max Podkorytov <maxdp@fb.com>
Date:   Wed Aug 17 18:10:36 2022 +0000

    [static-runtime] change the backend for permute_copy (#83532)

    Summary: Testing wrappable dims

    Differential Revision: D38717563

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83532
    Approved by: https://github.com/mikeiovine

commit 30af17cea7e983b9e60354e05f7dcc2688183073
Author: John Clow <jclow@fb.com>
Date:   Tue Aug 16 10:18:47 2022 -0700

    [HIP] Add extra exception handling for non-ROCM builds (#83009)

    I got the following error on OSX, which doesn't have HIP.
    As this file is supposed to compile with non-HIP builds,
    I added this error to the errors to ignore.

    ```
    Traceback (most recent call last):
      File "test/test_profiler.py", line 31, in <module>
        from torch.profiler._pattern_matcher import (Pattern, NamePattern,
      File "/Users/jclow/pytorch3/torch/profiler/_pattern_matcher.py", line 9, in <module>
        import torch.utils.benchmark as benchmark
      File "/Users/jclow/pytorch3/torch/utils/benchmark/__init__.py", line 2, in <module>
        from torch.utils.benchmark.utils.timer import *  # noqa: F403
      File "/Users/jclow/pytorch3/torch/utils/benchmark/utils/timer.py", line 8, in <module>
        from torch.utils.benchmark.utils import common, cpp_jit
      File "/Users/jclow/pytorch3/torch/utils/benchmark/utils/cpp_jit.py", line 13, in <module>
        from torch.utils import cpp_extension
      File "/Users/jclow/pytorch3/torch/utils/cpp_extension.py", line 19, in <module>
        from .hipify import hipify_python
      File "/Users/jclow/pytorch3/torch/utils/hipify/hipify_python.py", line 34, in <module>
        from .cuda_to_hip_mappings import CUDA_TO_HIP_MAPPINGS
      File "/Users/jclow/pytorch3/torch/utils/hipify/cuda_to_hip_mappings.py", line 34, in <module>
        rocm_path = subprocess.check_output(["hipconfig", "--rocmpath"]).decode("utf-8")
      File "/Users/jclow/opt/anaconda3/envs/pytorch3/lib/python3.8/subprocess.py", line 415, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "/Users/jclow/opt/anaconda3/envs/pytorch3/lib/python3.8/subprocess.py", line 493, in run
        with Popen(*popenargs, **kwargs) as process:
      File "/Users/jclow/opt/anaconda3/envs/pytorch3/lib/python3.8/subprocess.py", line 858, in __init__
        self._execute_child(args, executable, preexec_fn, close_fds,
      File "/Users/jclow/opt/anaconda3/envs/pytorch3/lib/python3.8/subprocess.py", line 1706, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    PermissionError: [Errno 13] Permission denied: 'hipconfig'
    ```

    Differential Revision: [D38766067](https://our.internmc.facebook.com/intern/diff/D38766067)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83009
    Approved by: https://github.com/malfet

commit 244690205fdf5e27fe37c44aeea183eccd391307
Author: Chien-Chin Huang <chienchin@fb.com>
Date:   Wed Aug 17 00:25:57 2022 -0700

    [FSDP] Use _init_from_local_tensor to create ShardedTensor to avoid communication overhead (#82911)

    FSDP originally uses `_init_from_local_shards_and_global_metadata()` to create a ShardedTensor for sharded_state_dict(). We have seen some non-trivial overhead if the number of tensors is large. Using `_init_from_local_shards_and_global_metadata ` can significantly reduce the overhead. For a model with ~250 tensors in the state_dict trained with 16 GPUs, the original `sharded_state_dict` takes ~1.7 seconds and this PR reduces the overhead to ~0.6 seconds.

    Differential Revision: [D38452170](https://our.internmc.facebook.com/intern/diff/D38452170/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82911
    Approved by: https://github.com/awgu

commit 5e8b4c64aa1ecd2e56aeeb06559af97162effdd1
Author: Horace He <chilli@fb.com>
Date:   Wed Aug 17 02:45:19 2022 +0000

    Delayed compilation of backwards pass to when backwards runs (#83367)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83367
    Approved by: https://github.com/ngimel, https://github.com/zou3519

commit 1f7153bee80090c22490a26828bd74fb0d9fc60e
Author: Digant Desai <digantdesai@fb.com>
Date:   Wed Aug 17 16:31:14 2022 +0000

    [quant] Optionally clamp weights post quantization (#83438)

    Summary: Until we add quant_{min, max} args to `torch.quantize_per_{channel, tensor}`, this patch will make sure we will honor observer's restrictions on quantized values.

    Test Plan: Added new tests, run with - `buck run caffe2/test:quantization -- quantization.core.test_utils`

    Differential Revision: D38624119

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83438
    Approved by: https://github.com/andrewor14

commit ab02b898116a2d8f0e6da2689298027543362ea9
Author: migeedz <migeedz@fb.com>
Date:   Tue Aug 16 16:11:42 2022 -0700

    expand torch.full to reason about integers (#83087)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83087
    Approved by: https://github.com/jansel

commit 1a38724ed3e189152f11ef576d0ff15a31a39eaa
Author: migeedz <migeedz@fb.com>
Date:   Tue Aug 16 16:11:42 2022 -0700

    fix bug in a linear constraint (#82938)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82938
    Approved by: https://github.com/jansel

commit 0061e6762985511a1487c1c12a5353a6d4faf73e
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Wed Aug 17 16:19:38 2022 +0000

    Revert "NestedTensor Softmax (#83435)"

    This reverts commit d7fc76a1ed33a155c8be795abe67315a4459e1a0.

    Reverted https://github.com/pytorch/pytorch/pull/83435 on behalf of https://github.com/huydhn due to This is suspected to break functorch tests in trunk https://hud.pytorch.org/pytorch/pytorch/commit/d7fc76a1ed33a155c8be795abe67315a4459e1a0

commit eb4e03ddf89dcacf204e3f95ae0711dfbcc1939b
Author: Horace He <chilli@fb.com>
Date:   Wed Aug 17 01:26:25 2022 +0000

    made some minor tweaks to minifier to reduce outputs more often (#83565)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83565
    Approved by: https://github.com/voznesenskym

commit 84c4b079328c2a97d78d95c47b841f8dca6036bb
Author: albanD <desmaison.alban@gmail.com>
Date:   Wed Aug 17 15:08:05 2022 +0000

    Make sure that we can load old optimizer checkpoint (#83588)

    We want to make sure that we can load checkpoints that were saved with older version of the code (which doesn't contain the differentiable attribute).
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83588
    Approved by: https://github.com/mikaylagawarecki

commit dcda907693e9e7661c099c3b9ed25fadaed273f8
Author: ProGamerGov <ProGamerGov@users.noreply.github.com>
Date:   Wed Aug 17 14:53:02 2022 +0000

    Add docstring type formatting guidelines to `CONTRIBUTING.md` (#83536)

    This PR builds on the following past PRs, and serves to help improve the consistency of PyTorch's docstring formatting.

    * `boolean` -> `bool` and `string` -> `str`: https://github.com/pytorch/pytorch/pull/82410

    * Don't use plural of types: https://github.com/pytorch/pytorch/pull/82474

    * Capitalize the Callable type, `callable` -> `Callable` : https://github.com/pytorch/pytorch/pull/82487
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83536
    Approved by: https://github.com/H-Huang, https://github.com/albanD

commit 9f03444f705a52833d8e3220446ae48e285c2cf9
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Wed Aug 17 14:46:04 2022 +0000

    Add torch.ops.aten -> torch._refs mapping to TorchRefsMode using decomposition_table (#82657)
    This PR adds the possibility to convert `torch.ops.aten` calls to `torch._refs` and consequently prims under TorchRefsMode.
    New test, `test_aten_overload_to_prims`, in `test/test_prims.py`.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82657
    Approved by: https://github.com/jjsjann123, https://github.com/ezyang

commit 7af3208412c282c4b8d216f413df5bd26287f9fd
Author: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Date:   Wed Aug 17 14:42:33 2022 +0000

    [ROCm] Enable test_ddp_profiling_torch_profiler (#82749)

    Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82749
    Approved by: https://github.com/ngimel, https://github.com/rohan-varma

commit c8ec4ceb9ba8746b2b8c149ebb981914d2ef0483
Author: Sergii Dymchenko <sdym@fb.com>
Date:   Wed Aug 17 13:23:11 2022 +0000

    Delete checked_dense_tensor_unwrap (#83543)

    As TH gone long time ago.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83543
    Approved by: https://github.com/ZainRizvi, https://github.com/ezyang

commit 822a8e057fa4e6a6a8413d22bae2c1a5aa853134
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Wed Aug 17 01:40:00 2022 +0100

    Use opmath_type for CUDA logcumsumexp (#83425)

    This improves precision by reducing the number of narrowing
    conversions, as well as reducing compile times from 2m 30s to 1m 25s
    on my machine.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83425
    Approved by: https://github.com/ngimel

commit 2a096e940d33a33c4eb6df1c2ed4da607bd31a7f
Author: Fabio Rocha <frocha@quansight.com>
Date:   Tue Aug 16 14:23:09 2022 -0500

    [primTorch] support for a few magic methods (#83524)

    Added support for mapping __rsub__, __rtruediv__,
    __rfloordiv__, __floordiv__, __pow__,
    and __rpow__ in TorchRefsMode.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83524
    Approved by: https://github.com/ngimel

commit 5aab57e112d244f0cf3bbab30db640e52a0c2c44
Author: Emilio Castillo <ecastill@preferred.jp>
Date:   Wed Aug 17 07:20:37 2022 +0000

    Make Adam optimizer differentiable (#82205)

    Continues [80938](https://github.com/pytorch/pytorch/pull/80938)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82205
    Approved by: https://github.com/albanD

commit 11d4d91bdccab35928fd56a4fc5eac781f9fb71e
Author: Larry Liu <8188269+larryliu0820@users.noreply.github.com>
Date:   Tue Aug 16 21:11:45 2022 -0700

    [torchgen] Add logic in annotation parser to accept alias set (#83501)

    Extending the current regex in `model.py` to support annotation alias set. See issue #83214.

    Ideally we should have a full fledged lexer similar to `schema_type_parser.cpp`, since regex can be more and more difficult to read if we add more support to it.

    Adding this to unblock this issue for now.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83501
    Approved by: https://github.com/SherlockNoMad

commit a09c3fcb8d0e290b8d398c110ddbfd845e6c4058
Author: CaoE <e.cao@intel.com>
Date:   Wed Aug 17 06:19:54 2022 +0000

    Add loss operators to fp32 cast policy of AutocastCPU (#81689)
    Add loss operators to fp32 cast policy of AutocastCPU to improve accuracy of BFloat16 training. There will be no performance impact on fp32, only a slight impact on bf16 training.
    This is because conv transpose does not fully support bf16 before, and it will be replaced to _convolution in graph mode. If _convolution is in lower precision cast policy it will throw dtype related errors.
    conv transpose does not fully support bf16 yet, so _convolution still needs to be removed.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81689
    Approved by: https://github.com/malfet

commit d3a176a156819d57ed442579b806ff027402f4dc
Author: fduwjj <fduwjj@fb.com>
Date:   Tue Aug 16 17:32:07 2022 -0700

    [PT-D][BE][TP perf 1/N] Get rid of unnecessary collectives in Embedding/EmbeddingBag and use autograd-enabled collectives (#81853)

    These two ops (Embedding and EmbeddingBag for ShardedTensor) especially for row-wise sharding is very inefficient and hard to fit in the concept of future design. So this PR is trying to:
    1. Remove all unnecessary collective communications. Only one gather and one reduce(or reduce scatter) is needed.
    2. Use auto-grad enabled collectives so that we can use these ops in real model training.
    3. Some minor code cleaning
    4. Treat input differently when it's replicated tensor. (Will add more for this for the next few PRs).

    Differential Revision: [D37965687](https://our.internmc.facebook.com/intern/diff/D37965687/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81853
    Approved by: https://github.com/wanchaol

commit e09821f784bc9e9f13d361e9d2eb3fa1d7d07263
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Aug 16 15:05:52 2022 -0400

    Avoid using true division in split_dim (#83527)

    This makes it more amenable to tracing with dynamic shapes,
    where we don't support SymFloats yet.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83527
    Approved by: https://github.com/ngimel

commit d7fc76a1ed33a155c8be795abe67315a4459e1a0
Author: Michael Gschwind <mikekg@fb.com>
Date:   Wed Aug 17 04:19:23 2022 +0000

    NestedTensor Softmax (#83435)

    Summary: Simple mask compute and softmax

    Test Plan: unit test

    Differential Revision: D38711915

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83435
    Approved by: https://github.com/erichan1

commit 343b5f86512f75f8f3bd4b90749c0459743b9e72
Author: John Clow <jclow@fb.com>
Date:   Tue Aug 16 10:18:47 2022 -0700

    [TorchTidy] Adding support for accessing strides and scalars (#80072)

    Differential Revision: [D37571570](https://our.internmc.facebook.com/intern/diff/D37571570)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80072
    Approved by: https://github.com/robieta

commit 1a09b05c940b44968ccd6ba94698150512defbc7
Author: Nikita Shulga <nshulga@fb.com>
Date:   Wed Aug 17 03:22:56 2022 +0000

    Fix `torch.equal` on CPU (#83350)

    `torch.equal` should not raise an exception when comparing tensors of different types
    I.e. `torch.equal(torch.tensor([1, 2]), torch.tensor([1, 2], dtype=torch.float)))` should return True rather than raise an exception.
    Also, this makes it consistent with GPU behaviour

    Fixes https://github.com/pytorch/pytorch/issues/83314

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83350
    Approved by: https://github.com/albanD

commit df62ea76d1b443aad8d92b8c6fdad18fae5c6eb6
Author: migeedz <migeedz@fb.com>
Date:   Tue Aug 16 16:11:41 2022 -0700

    add the nessesairy constraints for the next 5 benchmarks (#82923)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82923
    Approved by: https://github.com/jansel

commit aac622ad55a8127e770217c6773031817be33b5f
Author: Jacob Szwejbka <jakeszwe@fb.com>
Date:   Wed Aug 17 01:45:30 2022 +0000

    Optionally run fbgemm in tracer (#83531)

    Summary: Well this tech debt has come back to haunt me. Gonna slap more duct-tape on it for today.

    Test Plan: ci

    Differential Revision: D38753249

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83531
    Approved by: https://github.com/dhruvbird

commit 31d4b6f52a9f74526f4f666348029da260254ea5
Author: Kulin Seth <kulin_seth@apple.com>
Date:   Wed Aug 17 00:26:41 2022 +0000

    [MPS] Fix conv1D and conv2D with non-matching strides/paddings (#83522)

    * Add reference to the github issue in test_mps.py

    Fixes https://github.com/pytorch/pytorch/issues/83180, https://github.com/pytorch/pytorch/issues/82921, https://github.com/pytorch/pytorch/issues/82711, https://github.com/pytorch/pytorch/issues/82563

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83522
    Approved by: https://github.com/albanD, https://github.com/malfet

commit 0e2efaf9cca53890004718eba76dfefa74838aa3
Author: Catherine Lee <csl@fb.com>
Date:   Wed Aug 17 00:19:39 2022 +0000

    use global var for disabled and slow test dicts (#83487)

    as in title

    Additional changes:
    * run isort for imports
    * rename some variables
    * warning instead of print

    Test plan
    * checked logs to see that tests were still being disabled
    * checked pytest xmls to check that pytest still disables things
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83487
    Approved by: https://github.com/malfet, https://github.com/huydhn

commit 1ee9eb52b612f5fb4b63bbda832e44c8902edb64
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Mon Aug 15 13:49:32 2022 -0700

    fix native_layer_norm meta kernel parity w cuda (#83457)

    Fixes #83362

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83457
    Approved by: https://github.com/eellison

commit f4b7c10e14fabd3cf3998746f834bf0d0410c070
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Mon Aug 15 13:49:32 2022 -0700

    fix resnet50_quantized_qat and mobilenet_v2_quantized_qat <> functionalization (#83339)

    This won't actually fix the issue until we make FakeTensor always-on for AOTAutograd.

    I confirmed with the following benchmark (with `normalize_ir=False` and `use_functionalize=True`) in the dynamo/functorch config (run inside the `torch dynamo` repo):
    ```
    terminal...$ python benchmarks/torchbench.py --training --devices=cuda --accuracy-aot-nop --generate-aot-autograd-stats --use-eval-mode --isolate --only=mobilenet_v2_quantized_qat

    cuda train mobilenet_v2_quantized_qat         0.967x p=0.00

    terminal...$ python benchmarks/torchbench.py --training --devices=cuda --accuracy-aot-nop --generate-aot-autograd-stats --use-eval-mode --isolate --only=resnet50_quantized_qat

    cuda train resnet50_quantized_qat             0.943x p=0.00
    ```

    I explained a bit more in the comment: quantized models use a running-mean style op, `fused_moving_avg_obs_fake_quant`, that takes in the running min/max stored on the module and mutates them, potentially resizing them.

    That causes `AOTAutograd` to complain: AOTAutograd first takes views of the inputs (using `.detach().requires_grad_(grad)`), and plumbs them through the function to figure out what output to trace the backward with. These new inputs now have `TensorImpl::allow_tensor_metadata_change_ = false`, which causes the op to fail when it tries to resize the running counter variables. Once we're always using fake tensors, we shouldn't need to use `.detach().requires_grad_()` anymore (since we already have fresh fake tensors to trace with).

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83339
    Approved by: https://github.com/ezyang

commit 785f7f62984c2a017ae5f31173d405d658435a66
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 16 23:30:43 2022 +0000

    Revert "Use opmath_type for CUDA logcumsumexp (#83425)"

    This reverts commit 06a64f7eaa47ce430a3fa61016010075b59b18a7.

    Reverted https://github.com/pytorch/pytorch/pull/83425 on behalf of https://github.com/huydhn due to This break ROCm trunk test https://hud.pytorch.org/pytorch/pytorch/commit/06a64f7eaa47ce430a3fa61016010075b59b18a7

commit 3586af8adce03ac44c57e42de23b8a6676d78961
Author: Jerry Zhang <jerryzh168@gmail.com>
Date:   Tue Aug 16 20:23:42 2022 +0000

    [quant] Remove unused quantize handler definitions (#83360)

    Summary:
    CommonQuantizeHandler This was added previously to make some of the refactor to use reference quantized model flow easier, now we have
    fully migrated to use reference quantized model flow, it's no longer needed, so we can remove it

    Also updated some comments

    Test Plan:
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps
    python test/test_quantization.py TestQuantizeFxModels

    Reviewers:

    Subscribers:

    Tasks:

    Tags:
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83360
    Approved by: https://github.com/andrewor14

commit 059321469e87801f2300ebd8c44d667e1b12bfa3
Author: ssjia <ssjia@fb.com>
Date:   Tue Aug 16 10:31:54 2022 -0700

    [vulkan] Use aliases when retrieving from packed/unpacked lists in OpContexts (#83526)

    Instead of retrieving elements of pack/unpacked lists using raw indices, this diff introduces aliases which improve code readability and guard against future errors.

    Differential Revision: [D38748293](https://our.internmc.facebook.com/intern/diff/D38748293/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83526
    Approved by: https://github.com/manuelcandales

commit 31fad3926a34c57e05d25a2cc22abf4028ebfc78
Author: soulitzer <soulitzer@gmail.com>
Date:   Tue Aug 16 16:20:44 2022 -0400

    Add option to run anomaly mode without nan checking (#83481)

    Fixes https://github.com/pytorch/pytorch/issues/83117

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83481
    Approved by: https://github.com/albanD

commit 1b437718a351b3bc2bb424975bdf29c82fb1c6c8
Author: Eli Uriegas <eliuriegas@fb.com>
Date:   Tue Aug 16 13:07:16 2022 -0700

    ci: Add workflow to build official docker images with multiarch (#83437)

    Resolves https://github.com/pytorch/pytorch/issues/80764

    Signed-off-by: Eli Uriegas <seemethere101@gmail.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83437
    Approved by: https://github.com/ZainRizvi, https://github.com/malfet

commit 3a511e83549d86ce9a2a8de2b64be340d2a23e4e
Author: samdow <samdow@fb.com>
Date:   Tue Aug 16 22:39:06 2022 +0000

    [Expanded Weights] add 'same' and 'valid' padding support (#83345)

    Co-authored-by: Ashkan <yousefpour@fb.com>

    Adds "same" and "valid" padding support, as Opacus (well @ashkan-software) did https://github.com/pytorch/opacus/pull/451

    Basics of it are this:
    - during forward pass, if there's "same" padding, we manually pad the input (NB: this will cause a small perf hit, haven't benchmarked yet)
    - during backward pass, the gradient wrt input needs to be cut down to the correct size if the original padding was same (conv_transpose doesn't accept string padding). Because conv_transpose will give us a gradient wrt the padded shape, we cut down the gradient to the correct size (we know how much padding we added to the left and right)
    - then, for the per sample gradients wrt weights, the input is already padded so neither the unfold nor group convolution have any padding
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83345
    Approved by: https://github.com/zou3519

commit cd68f08992e0985f1032726571ebe781aa50f82a
Author: Justin Chu <justinchu@microsoft.com>
Date:   Tue Aug 16 19:58:48 2022 +0000

    [ONNX] Update the script for version updates (#83283)

    This PR updates the `tools/onnx/update_default_opset_version.py` script to ensure files are edited correctly to prepare for the opset 17 support in torch.onnx.

    - (clean up) Move script to `main()`
    - Add an `--skip_build` option to avoid building pytorch if we want to rerun the process due to errors after compilation is done
    - Update to edit the correct files now that the onnx files were refactored
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83283
    Approved by: https://github.com/thiagocrepaldi, https://github.com/AllenTiTaiWang, https://github.com/abock

commit d52d2bd5a94a49332d843bb909e4db58fe7ab1b2
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Tue Aug 16 20:49:33 2022 +0000

    [ROCm] MIOpen fused convolution relu (#82002)

    Adds MIOpen fused convolution relu for fp32 and contiguous memory format.  Adds fallbacks for conv + z + bias + relu, fp16, and channels last until MIOpen adds these features.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82002
    Approved by: https://github.com/ngimel, https://github.com/malfet

commit 79356311f5c3d9283da118286ed5dd8de7d43fb3
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Aug 16 20:31:46 2022 +0000

    update merge failed msg (#83462)

    Message seemed a bit incorrect to read
    Ref: https://github.com/pytorch/pytorch/pull/82955#issuecomment-1215523319

    Before PR:
    ```
    Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
    Raised by https://github.com/pytorch/pytorch/actions/runs/2862480424
    ```

    After PR
    ```
    Merge failed
    Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
    Raised by https://github.com/pytorch/pytorch/actions/runs/2862480424
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83462
    Approved by: https://github.com/janeyx99, https://github.com/ZainRizvi

commit 4b597019b735124b46947e8ff0490e0311f0bdb8
Author: Driss Guessous <drisspg@fb.com>
Date:   Tue Aug 16 20:22:19 2022 +0000

    [Nested Tensor] Created Nested Tensor to Nested Tensor Views (#82658)
    This is PR is pulling out all the changes from #81838 specific to properly creating nested_tensor views. I will update this comment with a design doc once that has been made.  This should enable proper creation of NestedTensor views, two nested_tensors sharing the same buffer_ but with different NestedTensor meta data.

    The function `create_nested_tensor_view` is a helper function for creating a new nested tensor whose storage aliases the base causing the underlying storage to be shared - and is therefore a view.

    This function by itself is not differentiable and therefore autograd does not track its uses. If a nested tensor function implementation uses this helper in its implementation the aten_op must meet two requirements:
    - The function must return a view of the input
    - The function must be explicit and defines its backward
    A bug was found when creating a base tensor out of inference mode and then creating a view in inference mode. This test has been aded to this PR in order to show the effect of the change.

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82658
    Approved by: https://github.com/albanD

commit 94ba085ce0ccd48c2f1bd2eb1956b7800b873384
Author: George Qi <georgeqi94@gmail.com>
Date:   Mon Aug 15 19:14:34 2022 +0000

    [maskedtensor] first commit, core and creation (#82836)

    * __->__ #82836
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82836
    Approved by: https://github.com/albanD, https://github.com/bhosmer

commit 84146f3d0db1a39e6a4b363e15e30c6f6f159f75
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Aug 16 16:25:15 2022 +0100

    Vectorize cpu tensor conversions (#80905)

    This adds vectorization to the copy kernel acting between different
    dtypes through the use of `at::vec::convert`. Currently `vec::convert`
    falls back to a scalar copy loop for most dtypes, however the compiler
    is still better able to auto-vectorize the loop since it doesn't
    involve stride calculations.

    In a simple timeit benchmark I see around a 2x speedup copying from
    int32 to various dtypes:

    | To dtype | Master (us) | This PR (us) |
    |----------|-------------|--------------|
    | int64    | 23.8        | 10.3         |
    | float32  | 16.8        | 8.18         |
    | float64  | 18.0        | 9.47         |
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80905
    Approved by: https://github.com/ngimel

commit 559c8b8992cff9602b35735b837c3971e9224f36
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Aug 16 16:25:15 2022 +0100

    Fix _refs.lcm using floating point maths (#82950)

    `lcm` is meant to use integer maths, but the use of `true_divide`
    introduces a promotion to float and thus a loss of precision.

    This also introduces promoting low precision integers to int32 which
    is required for 100% consistency with the C++ implementation since the
    "usual arithmetic conversions" means the intermediate terms are
    calculated to `int` precision in C++. This only really matters when the
    lower precision dtype would overflow, however the test cases for lcm
    do involve overflows.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82950
    Approved by: https://github.com/ngimel

commit 9745edf971f125a870b6db75dc9536185fbc84c7
Author: Michael Melesse <micmelesse@gmail.com>
Date:   Tue Aug 16 19:56:17 2022 +0000

    [ROCM] Enable test_memory_format_nn_BatchNorm tests on ROCM (#82512)
    This enables some unit tests related to BatchNorm for ROCM.  We make sure that we call the MIOpen library incases where it can handle it and use the default path in other cases. When MIOpen implements this specific case we will file a follow up PR enabling that code path.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82512
    Approved by: https://github.com/jeffdaily, https://github.com/albanD

commit 06a64f7eaa47ce430a3fa61016010075b59b18a7
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Aug 16 15:47:15 2022 +0100

    Use opmath_type for CUDA logcumsumexp (#83425)

    This improves precision by reducing the number of narrowing
    conversions, as well as reducing compile times from 2m 30s to 1m 25s
    on my machine.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83425
    Approved by: https://github.com/ngimel

commit 0faf10b0f4da0bcaaaba8834fa6984bbd7f793f9
Author: Peter Bell <peterbell10@live.co.uk>
Date:   Tue Aug 16 15:47:14 2022 +0100

    Split ScanKernels.cu (#83422)

    On my machine `ScanKernels.cu` takes 10 minutes for just a single
    architecture which is by far the highest compile time of any single
    file. So this splits it into multiple files, the slowest being
    `LogcumsumexpKernel.cu` which takes 2m 30s
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83422
    Approved by: https://github.com/ngimel

commit 8473e6968487d736c75470eeae4d63b11156b622
Author: Pruthvi Madugundu <pmagundu@amd.com>
Date:   Tue Aug 16 19:22:31 2022 +0000

    [ROCm] Fixes the kernel asserts API declaration mismatch error (#81790)

    This problem updates the the PR [#73040](https://github.com/pytorch/pytorch/pull/73040)

    The compilation error in pyTorch with ROCm is successful with these changes when `NDEBUG` is enabled.

    Solution:
    For HIP we keep `__device__ __assert_fail()`
    and for host side compilation we want to use the `__assert_fail()` from the glibc library.

    Tested the code by compiling with below steps
    ```
    python3 tools/amd_build/build_amd.py
    python3 setup.py develop --cmake-only
    cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build
    cmake --build build
    ```

    The UT test_fixed_cuda_assert_async is still skipped due performance overhead.

    cc @jithunnair-amd

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81790
    Approved by: https://github.com/shintaro-iwasaki, https://github.com/jeffdaily, https://github.com/malfet

commit b156f3329e75a5040fdf348c7dd6552bee5fcb40
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Tue Aug 16 01:15:04 2022 +0200

    [primTorch] Add ref for movedim (#83278)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83278
    Approved by: https://github.com/ngimel

commit 2c79b9c638e98b2fed0e29c601c3ed3e227280e6
Author: Slava Kovalevskyi <vsk@fb.com>
Date:   Tue Aug 16 18:38:06 2022 +0000

    module names are made more consistent with POI page (#83219)

    Less intrusive update after the first attempt got reverted: https://github.com/pytorch/pytorch/pull/83127

    fix for: #83363
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83219
    Approved by: https://github.com/malfet

commit 92a005883a18a0026161d933aa27a08d0ef68af2
Author: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Date:   Fri Aug 12 23:14:13 2022 +0000

    [easy] Fix .sizes() call in saved_variable.cpp for nested tensor (#83356)

    Small fix so that TestMultipleDispatch in the above PR will throw the correct error when using an inplace operation on a saved nested input

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83356
    Approved by: https://github.com/albanD

commit 7e7afcabe70712e8d6bad0bba0adcd93e69cfd6b
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Aug 16 09:02:50 2022 -0700

    [functorch] classify some more test failures (#83520)

    Classifies test failures for test_vmapvjp and test_vmapjvpall

    Test Plan:
    - tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83520
    Approved by: https://github.com/samdow

commit 52b8a581970830ad1b9a0c7ec66d16f2e9eae5b8
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Aug 16 09:02:50 2022 -0700

    [functorch] audit skips and xfails for vjp tests (#83518)

    Went through test_vjp, test_grad, test_vjpvjp
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83518
    Approved by: https://github.com/samdow

commit 64a3fbae5e6e4fe5a5b71d065b30549ca7a03847
Author: Richard Zou <zou3519@gmail.com>
Date:   Tue Aug 16 08:04:27 2022 -0700

    [functorch] Classify some vmap failures with comments (#83517)

    The silent incorrectness issues are hi-pri

    Test Plan:
    - wait for tests
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83517
    Approved by: https://github.com/samdow

commit a3e3cbfbbe093d9046d704738c52212f7e76b11c
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Tue Aug 16 01:15:04 2022 +0200

    [primTorch] Add ref for diagonal and more test inputs (#82321)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82321
    Approved by: https://github.com/ngimel

commit 4010f96121f85f452d22692fb7fa4f3fb84d76d8
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Tue Aug 16 01:15:03 2022 +0200

    [primTorch] Fix off by 1 in `canonicalize_dim` (#83198)

    Also fix an issue in the `unsqueeze` ref due to this change.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83198
    Approved by: https://github.com/ngimel

commit 6a5ca409da0d5f54997d5a74dbf36782bd42c3a3
Author: Seonglyong Gong <slgong@fb.com>
Date:   Tue Aug 16 17:42:34 2022 +0000

    Revert "reverted diff: Add python stack tracing option on on-demand flow" (#82378)

    Summary:
    Changes:
    add an option in Config; can use 'PYTHON_STACK_TRACE=true' option (via .conf)
    deliver PYTHON_STACK_TRACE value to kineto_client_interface start()
    abstract class also changed.
    Trace after changes by running //kineto/libkineto/fb/integration_tests/trace_tester.cpp (requested by chaekit)
    https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1657304871%2F127.0.0.1%2Flibkineto_activities_3502962.json.gz&bucket=gpu_traces

    Test Plan:
    launch a python test case with the following command for on-demand flow:
    echo -e "PYTHON_STACK_TRACE=true" > /tmp/scott_kineto.conf && dyno gputrace --gputrace_duration 300ms --gpuconf /tmp/scott_kineto.conf

    Reviewed By: chaekit

    Differential Revision: D38220201

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82378
    Approved by: https://github.com/chaekit

commit bb94a13d0369ff6489d12c0e658b1257dabdf3d9
Author: ssjia <ssjia@fb.com>
Date:   Mon Aug 15 20:05:00 2022 -0700

    [vulkan][fix] Fix unsafe direct array access (#83432)

    This diff fixes an instance of unsafe array access of a sizes array.

    Differential Revision: [D38710499](https://our.internmc.facebook.com/intern/diff/D38710499/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83432
    Approved by: https://github.com/kirklandsign, https://github.com/manuelcandales

commit 08d38bbcfba4c09c8463acebd7bfc436c7f3a229
Author: ssjia <ssjia@fb.com>
Date:   Mon Aug 15 15:10:24 2022 -0700

    [vulkan] Replace *_size() functions with get_dim<N>() (#83423)

    This diff replaces the `batch_size`, `channels_size`, etc. functions with a template function `get_dim<N>` to reduce duplicate code.

    `batch_size()` has been replaced with `get_dim<Dim4D::Batch>` and so on.

    Differential Revision: [D38706526](https://our.internmc.facebook.com/intern/diff/D38706526/)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83423
    Approved by: https://github.com/salilsdesai

commit cd86d2551525446f2046b34143935d9db1fc5e7a
Author: Nikita Karetnikov <nkaretnikov@quansight.com>
Date:   Tue Aug 16 17:23:00 2022 +0000

    [primTorch] Move addcdiv from decompositions -> refs (#80842)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/80842
    Approved by: https://github.com/Lezcano, https://github.com/ngimel

commit 59fccab85775da7a0ecf33bda241f81eade3ad4b
Author: Ramiro Leal-Cavazos <ramiroleal050@gmail.com>
Date:   Tue Aug 16 17:13:21 2022 +0000

    [Shape Fns] Fix handling of empty dim list in sum_mean_dim shape fn (#83357)

    The current implementation of the `sum_mean_dim` shape function
    takes `dim=[]` and `dim=None` to mean "no reduction". However, in the
    ops `torch.sum` and `torch.mean`, both `dim=[]` and `dim=None` are
    equivalent to "reduce along all dimensions". This commit fixes the
    handling of `dim` in the `sum_mean_dim` shape function.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83357
    Approved by: https://github.com/Gamrix

commit d589aa531ffc3cb657f9f76d38abf034df474c57
Author: Michael Gschwind <mikekg@fb.com>
Date:   Tue Aug 16 16:53:10 2022 +0000

    TS jit 2 week compatibility window for new TEL forward() (#83467)

    Summary: TS jit 2 week compatibility window for new TEL forward()

    Test Plan: sandcastle

    Differential Revision: D38711177

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83467
    Approved by: https://github.com/erichan1, https://github.com/jbschlosser

commit cf4fb5a6313d467a1024849a4de0f253400247ff
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Tue Aug 16 10:23:24 2022 -0400

    Make test_jvpvjp_as_strided_scatter skipped due to flaky (#83516)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83516
    Approved by: https://github.com/zou3519

commit f9a3d82220586e3804bbc5317658115296dc6c18
Author: albanD <desmaison.alban@gmail.com>
Date:   Tue Aug 16 15:32:43 2022 +0000

    Fix typo in MPS allocator (#83465)

    Fixes https://github.com/pytorch/pytorch/issues/81184
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83465
    Approved by: https://github.com/malfet

commit 4c8cfb57aa3ac58112efb693635198b07edf008f
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Aug 15 20:03:13 2022 -0700

    Convert SymInt tracing to mode based tracing (#83380)

    We're on our way to deleting ProxyTensor entirely (see https://github.com/pytorch/pytorch/pull/83330 ), but before we can do that, we have to delete ProxySymInt first. Here's the plan.

    Changes in torch.fx.experimental.symbolic_shapes

    * The general idea is to do mode based tracing. This means we need a mode that can interpose on all SymInt operations. There are a few ways to do this, but I've done it the easy way: (1) I have a separate mode for SymInt operations specifically called SymDispatchMode, and (2) this mode operates on PySymInt (and not the basic SymInt which is user visible). I elided Int from the name because if we add SymFloats I want to use the same mode to handle those as well, and I used Dispatch rather than Function because this is the "inner" dispatch operating PySymInt and not SymInt (this is not a perfect analogy, but SymFunctionMode definitely seemed wrong as you still must go through the C++ binding.) The mode is entirely implemented in Python for ease of implementation. We could have implemented this more symmetrically to TorchFunctionMode in C++, but I leave that as later work; this API is unlikely to get used by others (unlike TorchFunctionMode). One downside to not doing the mode in C++ is that we still have to do the hop via a preexisting PySymInt to wrap; this is currently not a big deal as conversion to SymInts only really happens when there is already another SymInt floating around. SymDispatchMode is pared down from TorchDispatchMode; there is no ancestor tracking since I don't expect people to be mixing up SymDispatchModes.
    *  I made some improvements for tracing. When I invoke the SymDispatchMode handler, I would like constants to show up as constants, so they can be directly inlined into the FX graph (rather than going through a wrapping process first, and then the wrapped SymInt being used in the operation). To do this, I directly track if a PySymInt is a constant at construction time. Only wrapped PySymInts are constants.
    * For convenience, PySymInts now support all magic methods that regular SymInts do. This is so that redispatch inside the SymDispatchMode can be written the idiomatic way `func(*args, **kwargs)` where func is an operator. The original names are retained for direct C++ calls.

    Changes in torch.fx.experimental.proxy_tensor

    * OK, so we got a new SymDispatchMode, so we define a ProxySymDispatchMode and activate it when we start tracing. This mode is currently unconditionally activated although technically we only need to activate it when doing symbolic tracing (it doesn't matter either way as there are no SymInts if you are not doing symbolic tracing).
    * We delete ProxySymInt. To do this, we must now record the proxy for the SymInt some other way. Based on discussion with Chillee, it is more intuitive to him if the proxies are still recorded on the SymInt in some way. So we store them in the `__dict__` of the PySymInt, indexed by Tracer. An improvement is to make this a weak map, so that we remove all of these entries when the tracer dies. In an original version of this PR, I keyed on the mode itself, but tracer is better as it is accessible from both modes (and as you will see, we will need to fetch the map from both the ProxySymDispatchMode as well as the ProxyTorchDispatchMode.) The implementation of SymDispatchMode now simply retrieves the proxies, performs the underlying operation as well as the FX graph recording, and then records the output proxy to the PySymInt. Note that FX tracing does not work with proxies and SymInts, so we manually call `call_function` to ensure that the correct operations get recorded to the graph. This means conventional FX retracing with proxies only will not work with these graphs, but there wasn't really any reason to do this (as opposed to `make_fx` retracing) anyway. Constants are detected and converted directly into Python integers.
    * SymInts can show up as arguments to tensor operations, so they must be accounted for in ProxyTorchDispatchMode as well. This is done by searching for SymInt arguments and converting them into proxies before the proxy call. This can be done more efficiently in a single `tree_map` but I'm lazy. The helper `unwrap_symint_proxy` conveniently implements the unwrapping in one place given a tracer; unfortunately it cannot be shared with SymDispatchMode as SymDispatchMode gets PySymInts, but ProxyTensorMode gets SymInts. Similarly, tensors that are returned from tensor operations can have SymInts in their shapes, which need fresh proxies allocated. To avoid leaking internal details of SymInt shape computation to the tensor operation graph, these SymInts are always given proxies derived from `x.size(dim)` call on their return tensor. We also need to do this for strides and numel but have not done so yet. Furthermore, we must avoid tracing internal SymInt calls while we run meta operations on the true operation; this is achieved by also disabling SymInt tracing on the inside of tensor tracing. This is analogous to how tensor tracing is disabled inside the implementation of tracing mode, but unfortunately we are unable to use the same mechanism (this would have been easier if the two modes could be combined somehow, and I am amenable to suggestions to try harder to achieve this.)
    * Because there are no more ProxySymInts, we no longer need to do anything to unwrap SymInt. Furthermore, we do not need to reallocate ProxySymInts on class creation.
    * If a bare SymInt without a Proxy is encountered, it is assumed that this must be a constant. `create_arg` handles this case. Non-constant free SymInts result in an assert error.
    * The initial input handling in `dispatch_trace` involves traversing all of the input tensors, traversing over their shapes, and assigning proxies for the SymInts in shapes in the same way we handle proxies for the output tensors.

    The preexisting testing is inadequate but will be better after I rebase past https://github.com/pytorch/pytorch/pull/82209

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83380
    Approved by: https://github.com/samdow

commit a3907ca92d73b380f4f1624e39b7f0c6a06ea5b1
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Aug 15 20:03:12 2022 -0700

    Respect TorchDispatchMode for shallow_copy_and_detach (#83372)

    I noticed I was missing tensor creations with modes when I tried
    to delete proxy tensor.  This was the cause.

    Hypothetically, all PyInterpreter calls could get this treatment.
    But I think it only matters for detach; the rest do not return
    Tensors and most modes will not be interested in them.

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83372
    Approved by: https://github.com/zou3519

commit 1665715cb0fcc7c5dd5311c36cd0ef5dac660442
Author: Brian Hirsh <hirsheybar@fb.com>
Date:   Mon Aug 15 13:27:33 2022 -0700

    add sym_strides() function, use in fake/proxy tensors (#81300)

    Add `TensorImpl::sym_strides`, bind it to python with `torch.ops.aten.sym_strides`, and use it in `ProxyTensor` and `FakeTensor`.

    Before, `ProxyTensor` was generating `ProxySymInt`'s for the sizes, but not for the strides. Internally we still represent strides with a `SymIntArrayRef` though, so I ran into some weird issues where sizes were showing up as `ProxySymInt`, but strides were `PySymInt`'s.

    Differential Revision: [D38594558](https://our.internmc.facebook.com/intern/diff/D38594558)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81300
    Approved by: https://github.com/ezyang

commit 2e8e386d6f718cc6e4e5df21ec8b02ae730a6283
Author: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Date:   Tue Aug 16 13:40:40 2022 +0000

    Add refs for real and imag to __all__ (#83057)

    `imag` and `real` were missing from the ref's `__all__` list.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83057
    Approved by: https://github.com/ngimel

commit 3500df79831d21725ea8d3883254ea8e3f11245e
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Tue Aug 16 13:30:40 2022 +0000

    [composite compliance] istft (#82955)

    Ref #69991
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82955
    Approved by: https://github.com/zou3519

commit a9ba3fe1dbf2cea45c9a7e723010c27c211f7fe3
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 16 10:14:25 2022 +0000

    [vision hash update] update the pinned vision hash (#83503)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned vision hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83503
    Approved by: https://github.com/pytorchbot

commit 445b55682a4b794c8de89a2ffe25eaf96d1bd149
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 16 10:13:25 2022 +0000

    [xla hash update] update the pinned xla hash (#83502)

    This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
    Update the pinned xla hash.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83502
    Approved by: https://github.com/pytorchbot

commit f77adb71cb78eabf8967d1d8139dfd893d58c5c5
Author: Horace He <chilli@fb.com>
Date:   Tue Aug 16 06:49:34 2022 +0000

    made some minor refactoring of minifier (#83439)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83439
    Approved by: https://github.com/ezyang

commit ff75562cffb54d7500a94a1091e06dc9b5c284fc
Author: Rob Zinkov <rob@zinkov.com>
Date:   Tue Aug 16 08:19:46 2022 +0000

    Adding maximize to rprop (#81864)

    Added the maximize flag #68052 to rprop optimizer and updates the respective tests.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/81864
    Approved by: https://github.com/albanD

commit a8941aa99676436eb4f10595b010bb48d6dc3c6e
Author: Nikita Shulga <nshulga@fb.com>
Date:   Tue Aug 16 07:51:11 2022 +0000

    [BE] Better test stats errors (#83484)

    When `BUILD_ENVIRONMENT` is not defined, print sensible error message
    Which is better than:
    ```
    Could not download https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/test-times.json because: 'BUILD_ENVIRONMENT'
    ```

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83484
    Approved by: https://github.com/huydhn, https://github.com/ZainRizvi

commit 03f9c7922edb4089ff65f0e2fde6c7cf8e2ab640
Author: Nikita Shulga <nshulga@fb.com>
Date:   Mon Aug 15 19:29:20 2022 -0700

    [FuncTorch] Fix compilation with -Werror (#83463)

     - Fixed signed-unsigned compares
     - Get rid of unused variables
     - Typecast to `PyCFunction` via `(void*)`
     - `ssize_t` is not a valid type on Win32

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83463
    Approved by: https://github.com/zou3519

commit a5f688ad0a9630c71695ca132ec4236a51677067
Author: Rohan Varma <rvarm1@fb.com>
Date:   Tue Aug 16 07:20:58 2022 +0000

    Remove unused var from ProcessGroupGloo (#83286)

    This variable was not used since the logic was refactored into `getElapsedTime`.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83286
    Approved by: https://github.com/mrshenli, https://github.com/H-Huang

commit 43a94daca0a141fb3d7b3cdceee3b3dc296a0aa4
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 16 02:47:17 2022 +0000

    Revert "Add a workflow to cache third party dependencies on S3 (#83306)"

    This reverts commit 0961dd6e9981fe6580ee3f1d2c622f526d8ab9a9.

    Reverted https://github.com/pytorch/pytorch/pull/83306 on behalf of https://github.com/huydhn due to The fix in https://github.com/pytorch/pytorch/pull/83489 still doesn't work

commit 641d75d0ba0053816a73a6c977ac4a2d6e00e896
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Tue Aug 16 02:42:25 2022 +0000

    Revert "S3 third-party deps sync workflow: specify correct secrets (#83489)"

    This reverts commit 7ec49810cc8a44cc2bc53c115fb03656ab136751.

    Reverted https://github.com/pytorch/pytorch/pull/83489 on behalf of https://github.com/huydhn due to It still doesn't work https://github.com/pytorch/pytorch/runs/7849815716

commit 7ec49810cc8a44cc2bc53c115fb03656ab136751
Author: Ivan Zaitsev <ivanzaitsev@fb.com>
Date:   Tue Aug 16 02:16:12 2022 +0000

    S3 third-party deps sync workflow: specify correct secrets (#83489)

    A followup for: #83306
    <img width="1228" alt="image" src="https://user-images.githubusercontent.com/108101595/184759049-b4900753-5d29-4352-8704-ce56734be750.png">

    The correct secrets to access OSSCI buckets have `AWS_OSSCI_S3_***` prefix.

    This PR makes the workflow use the correct secrets.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83489
    Approved by: https://github.com/huydhn, https://github.com/malfet

commit 794ae6417456bedb99749a1b50ab17a9fda2b466
Author: Rohan Varma <rvarm1@fb.com>
Date:   Fri Aug 12 01:24:43 2022 +0000

    [FSDP] Pass kwargs to load_state_dict (#83309)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83309
    Approved by: https://github.com/awgu

commit 0961dd6e9981fe6580ee3f1d2c622f526d8ab9a9
Author: Ivan Zaitsev <ivanzaitsev@fb.com>
Date:   Mon Aug 15 23:58:36 2022 +0000

    Add a workflow to cache third party dependencies on S3 (#83306)

    For the context, see #75703, pytorch/builder#1096.

    Note: depends on the docker image `pytorch/sync_s3_thirdparty_deps` from pytorch/builder#1096

    Summary of additions:
    * workflow config (based on pytorch/sync_s3_thirdparty_deps GH action)
    * S3 mapping config (sync_s3_cache.yml)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83306
    Approved by: https://github.com/huydhn

commit c177a7124cee54c5dfc30c38ca56414ddd9b5dca
Author: John Clow <jclow@fb.com>
Date:   Mon Aug 15 13:59:16 2022 -0700

    Adding additional debug logging and documentation for shape functions (#77115)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/77115
    Approved by: https://github.com/eellison

commit 9e1daf764419fb0b57c66dedce486e067cdd6be0
Author: Horace He <chilli@fb.com>
Date:   Mon Aug 15 22:07:52 2022 +0000

    skip flaky tests for now (#83482)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83482
    Approved by: https://github.com/huydhn

commit cb64b558eeb273d1ad8f1e25a11725fc85bb1ddc
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Aug 15 11:56:27 2022 -0400

    Add spaces so example is flake8 compatible (#83420)

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83420
    Approved by: https://github.com/jbschlosser

commit b75a214b36d74a775f3c4542f58ac8f9c9f107fd
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Aug 15 21:25:05 2022 +0000

    Fix windows flaky test env var (#83466)

    Reland #83426 and #83436

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83466
    Approved by: https://github.com/atalman

commit a234774096cb99cb8826fc56092f720e74634987
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 15 21:11:25 2022 +0000

    Revert "Fix flaky tests env variable length on Windows (#83426)"

    This reverts commit beb83d7419bc21f9ca8881de81c8421409dd8f3a.

    Reverted https://github.com/pytorch/pytorch/pull/83426 on behalf of https://github.com/huydhn due to This has a bug which breaks internal builds D38714900 and other OSS test. The bug has been fixed by https://github.com/pytorch/pytorch/pull/83436. But we decide that it is safer to revert both, merge them into one PR, then reland the fix

commit 6266003d71e85beabef52da54ccf2ae70c11491d
Author: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Date:   Mon Aug 15 21:07:45 2022 +0000

    Revert "Check if IMPORT_DISABLED_TESTS is set (#83436)"

    This reverts commit 1187dedd336e4f6c0028e0d081b676c2f5796316.

    Reverted https://github.com/pytorch/pytorch/pull/83436 on behalf of https://github.com/huydhn due to The previous change breaks internal builds D38714900 and other OSS tests. The bug has been fixed by this PR. But we decide that it is safer to revert both, merge them into one PR, then reland the fix

commit dffa5d309a6f55aa8e07db827750a2b99e9e6b6e
Author: Catherine Lee <csl@fb.com>
Date:   Mon Aug 15 20:03:08 2022 +0000

    shard `trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default` from 2 -> 4 (#83424)

    it takes a long time
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83424
    Approved by: https://github.com/huydhn

commit 43f950af201f8a39e5728a65e03cfcafec04585d
Author: soulitzer <soulitzer@gmail.com>
Date:   Mon Aug 15 12:17:17 2022 -0400

    Manually shard slow-gradcheck CI job to prevent timeout (#83354)

    Fixes https://github.com/pytorch/pytorch/issues/83335
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83354
    Approved by: https://github.com/malfet, https://github.com/albanD

commit 13e2a0a04838414058a45888a081a9ac81adb311
Author: Milad Mohammadi <milad.mo@gmail.com>
Date:   Mon Aug 15 19:48:25 2022 +0000

    Add `getDynamicValue` to `dynamic_ir` (#82188)

    Add `getDynamicValue` to `dynamic_ir`. This is a precondition to support https://github.com/pytorch/xla/issues/3759
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82188
    Approved by: https://github.com/Krovatkin

commit ca4f3534514a2310c540986cceb4cf9c3d6a0995
Author: Milad Mohammadi <milad.mo@gmail.com>
Date:   Mon Aug 15 19:47:14 2022 +0000

    Updated the build process for PyTorch/XLA CI testing (#82497)

    Updated the build process for PyTorch/XLA CI testing
    Related issue https://github.com/pytorch/pytorch/issues/82425
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/82497
    Approved by: https://github.com/wconstab

commit 60295e3abde373a1ca7ceea518ef65c8d7c7f058
Author: Richard Zou <zou3519@gmail.com>
Date:   Mon Aug 15 08:42:38 2022 -0700

    [functorch] Delete functorch_lagging_op_db (#83418)

    No need to have a lagging op db because there are no more sync issues
    between functorch and pytorch. If someone adds a new OpInfo, then we
    should explicitly check if we support it or not.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83418
    Approved by: https://github.com/samdow

commit 759c37a4f4fb1962c37650bf24d88a7fa0918a5e
Author: Nikolay Korovaiko <korovaikon@gmail.com>
Date:   Mon Aug 15 19:12:15 2022 +0000

    make sure arguments are tuples otherwise they won't be hashable (#83342)

    make sure arguments are tuples otherwise they won't be hashable if used in autograd.py or any other places that uses dictionaries for that matter
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83342
    Approved by: https://github.com/bdhirsh, https://github.com/albanD

commit a65825116a6e166d1d201acab9389986c827e422
Author: Horace He <chilli@fb.com>
Date:   Mon Aug 15 17:34:35 2022 +0000

    clear cache in-between each test (#83431)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83431
    Approved by: https://github.com/ezyang, https://github.com/malfet

commit 1187dedd336e4f6c0028e0d081b676c2f5796316
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Aug 15 18:40:20 2022 +0000

    Check if IMPORT_DISABLED_TESTS is set (#83436)

    I just realize that some tests, i.e. MAC MPS https://github.com/pytorch/pytorch/runs/7842997537?check_suite_focus=true, doesn't have this IMPORT_DISABLED_TESTS set. Thus, it can be None
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83436
    Approved by: https://github.com/clee2000

commit 2d8f091f6a193fc0e9d3c6e91bce8d66fc3f31de
Author: Edward Z. Yang <ezyang@fb.com>
Date:   Mon Aug 15 06:56:28 2022 -0700

    Move TorchDispatchModeTLS to c10/core (#83370)

    I need to access it directly from TensorImpl to route directly
    TensorImpl induced operations to modes (upcoming PR).

    Signed-off-by: Edward Z. Yang <ezyang@fb.com>
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83370
    Approved by: https://github.com/zou3519

commit beb83d7419bc21f9ca8881de81c8421409dd8f3a
Author: Huy Do <huydhn@gmail.com>
Date:   Mon Aug 15 17:18:55 2022 +0000

    Fix flaky tests env variable length on Windows (#83426)

    We are currently keeping all flaky tests in a single env variable and this breaks Windows CI because the upper limit of a single variable there is only 32767 chars, i.e. https://github.com/pytorch/pytorch/runs/7840599767

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83426
    Approved by: https://github.com/janeyx99

commit 03061472768c11afe8a3a822458ac36dc1130112
Author: Horace He <chilli@fb.com>
Date:   Mon Aug 15 02:44:25 2022 +0000

    Fix issue with compiling under with_grad (#83395)

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83395
    Approved by: https://github.com/jansel

commit ff5fe9e62284cb0a3ca6976c40978c9022c4503f
Author: Jeff Daily <jeff.daily@amd.com>
Date:   Mon Aug 15 16:04:09 2022 +0000

    [ROCm] enable jiterator (#77982)
    Enables jiterator for ROCm builds.  This includes necessary porting when hiprtc and nvrtc behavior differed.  This also ported ROCm versus CUDA differences w.r.t. MAX_DIMS and NUM_THREADS from the non-jiterator code paths into jiterator.
    CI with ciflow/trunk label to force running ROCm workflows that are currently trunk-only.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/77982
    Approved by: https://github.com/ngimel

commit 316cb8a06a9860b7540d4032314005e7afd936aa
Author: Mor Tzur <mortzur@fb.com>
Date:   Mon Aug 15 15:08:55 2022 +0000

    embedded_interpreter_hip (#83329)

    Summary: Adding embedded_interpreter_hip and deps to enable torch::deploy on AMD.

    Test Plan: Sandcastle

    Reviewed By: zrphercule

    Differential Revision: D38546701

    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83329
    Approved by: https://github.com/jfix71

commit 1bf23713658a702bd177f0e3478baa8269c3f120
Author: chengscott <60510scott@gmail.com>
Date:   Mon Aug 15 14:47:17 2022 +0000

    Rename path on Windows from lib/x64 to lib\x64 (#83417)

    Use `os.path.join` to join path
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83417
    Approved by: https://github.com/ezyang

commit 50b1ecc28f9b8cf8c560f9ce087c87c4b18a41fb
Author: kshitij12345 <kshitijkalambarkar@gmail.com>
Date:   Mon Aug 15 14:31:57 2022 +0000

    [fix] cat : support different dtype tensor with 0-dim like before (#83391)

    Fixes: https://github.com/pytorch/pytorch/issues/82457

    TODO:
    * [x] Add test (new test also passes on PyTorch version 1.11)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83391
    Approved by: https://github.com/ezyang

commit d4bd88b64be3d7156cf8617e33faae8eca307ce4
Author: Jesse Cai <jcjessecai@gmail.com>
Date:   Fri Aug 12 12:12:57 2022 -0700

    [Quant][fx] Remove WEIGHT_INDEX_DICT and BIAS_INDEX_DICT (#83263)

    Summary:

    This change adds in input_type_to_index mappings to the backend patterns for `nn.functional.linear`, `nn.functional.conv1d`, `nn.functional.conv1d`, and `nn.functional.conv3d`.

    This let's us remove `WEIGHT_INDEX_DICT` and `BIAS_INDEX_DICT` from `prepare.py`.
    Instead we pass around `backend_config` and check wether an arg is weight/bias agains that config

    Test Plan:
    ```
    python test/test_quantization.py TestQuantizeFx
    python test/test_quantization.py TestQuantizeFxOps
    ```
    Reviewers:
    @andrewor14

    Subscribers:

    Tasks:

    Tags: quant, fx

    Differential Revision: [D38705516](https://our.internmc.facebook.com/intern/diff/D38705516)
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/83263
    Approved by: https://github.com/andrewor14
---
 .bazelrc                                      |     2 +-
 .circleci/README.md                           |   468 +
 .../cimodel/data/simple/ios_definitions.py    |    48 +-
 .../cimodel/data/simple/macos_definitions.py  |    18 +-
 .circleci/cimodel/data/simple/nightly_ios.py  |     8 +-
 .../data/simple/util/branch_filters.py        |    11 +
 .../cimodel/data/simple/util/versions.py      |    14 +-
 .circleci/config.yml                          |   289 +-
 .circleci/docker/build.sh                     |   127 +-
 .circleci/docker/centos-rocm/Dockerfile       |     4 +
 .circleci/docker/common/install_base.sh       |     9 +-
 .circleci/docker/common/install_conda.sh      |    19 +-
 .circleci/docker/common/install_cudnn.sh      |     8 +-
 .circleci/docker/common/install_docs_reqs.sh  |     4 +-
 .circleci/docker/common/install_protobuf.sh   |     2 +-
 .circleci/docker/common/install_rocm.sh       |    54 +-
 .circleci/docker/common/install_rocm_magma.sh |    29 +
 .circleci/docker/common/install_ucc.sh        |     9 +-
 .circleci/docker/requirements-ci.txt          |    28 +-
 .circleci/docker/ubuntu-cuda/Dockerfile       |     6 +
 .circleci/docker/ubuntu-rocm/Dockerfile       |     4 +
 .circleci/docker/ubuntu/Dockerfile            |     1 +
 .circleci/generate_config_yml.py              |     6 +
 .circleci/scripts/binary_install_miniconda.sh |     4 +-
 .circleci/scripts/binary_ios_build.sh         |     2 +-
 .circleci/scripts/binary_ios_test.sh          |    21 +-
 .circleci/scripts/binary_ios_upload.sh        |     4 +-
 .circleci/scripts/binary_populate_env.sh      |    11 +-
 .circleci/scripts/binary_upload.sh            |    24 +-
 .circleci/scripts/driver_update.bat           |     2 +-
 .../scripts/functorch_doc_push_script.sh      |    47 +
 .circleci/scripts/python_doc_push_script.sh   |     3 +
 .circleci/scripts/setup_ci_environment.sh     |     6 +-
 .../scripts/setup_linux_system_environment.sh |     2 +-
 .circleci/scripts/vs_install.ps1              |     2 +-
 .circleci/scripts/vs_install_cmath.ps1        |     2 +-
 .circleci/scripts/windows_cudnn_install.sh    |     4 +-
 .../job-specs/job-specs-custom.yml            |   266 +-
 .github/ISSUE_TEMPLATE/ci-sev.md              |     2 +
 .github/PULL_REQUEST_TEMPLATE.md              |     2 +-
 .github/actionlint.yaml                       |     3 +
 .github/actions/build-android/action.yml      |     2 +-
 .../actions/calculate-docker-image/action.yml |    12 +-
 .../download-build-artifacts/action.yml       |     2 +-
 .../actions/filter-test-configs/action.yml    |    62 +
 .../actions/get-workflow-job-id/action.yml    |     4 +-
 .github/actions/pull-docker-image/action.yml  |    23 -
 .github/actions/setup-rocm/action.yml         |     7 +-
 .github/actions/setup-ssh/action.yml          |    17 -
 .github/actions/setup-win/action.yml          |    10 +-
 .github/actions/teardown-linux/action.yml     |    28 -
 .../actions/test-pytorch-binary/action.yml    |     1 -
 .../actions/upload-test-artifacts/action.yml  |    26 +-
 .github/auto_request_review.yml               |    29 +
 .github/ci_commit_pins/huggingface.txt        |     1 +
 .github/ci_commit_pins/text.txt               |     1 +
 .github/ci_commit_pins/timm.txt               |     1 +
 .github/ci_commit_pins/torchbench.txt         |     1 +
 .github/ci_commit_pins/torchdynamo.txt        |     1 -
 .github/ci_commit_pins/triton.txt             |     1 +
 .github/ci_commit_pins/vision.txt             |     2 +-
 .github/ci_commit_pins/xla.txt                |     2 +-
 .github/labeler.yml                           |    51 +
 .github/merge_rules.json                      |   302 -
 .github/merge_rules.yaml                      |   374 +
 .github/requirements-gha-cache.txt            |    18 +
 .github/requirements/README.md                |    24 +
 .github/requirements/conda-env-Linux-X64      |    10 +
 .github/requirements/conda-env-macOS-ARM64    |    20 +
 .github/requirements/conda-env-macOS-X64      |    18 +
 .../requirements/pip-requirements-macOS.txt   |    22 +
 .github/scale-config.yml                      |    69 -
 .github/scripts/README.md                     |     4 +-
 .../scripts/build_publish_nightly_docker.sh   |    44 -
 .github/scripts/build_triton_wheel.py         |    51 +
 .github/scripts/check_labels.py               |    87 +
 .github/scripts/comment_on_pr.py              |    34 +
 .github/scripts/ensure_actions_will_cancel.py |    20 +-
 .github/scripts/fetch_latest_green_commit.py  |     4 +-
 .github/scripts/filter_test_configs.py        |   207 +
 .../scripts/generate_binary_build_matrix.py   |    41 +-
 .github/scripts/generate_ci_workflows.py      |    15 +-
 .github/scripts/generate_pytorch_version.py   |    31 +-
 .github/scripts/gql_mocks.json                | 39801 +++++++++++-----
 .github/scripts/install_nvidia_utils_linux.sh |    57 -
 .github/scripts/parse_ref.py                  |    24 +-
 .github/scripts/pr-sanity-check.sh            |    60 +
 .github/scripts/process_commit.py             |   106 -
 .github/scripts/run_torchbench.py             |    42 +-
 .github/scripts/test_check_labels.py          |    77 +
 .../scripts/test_fetch_latest_green_commit.py |     5 +-
 .github/scripts/test_filter_test_configs.py   |   118 +
 .github/scripts/test_trymerge.py              |    45 +-
 .github/scripts/trymerge.py                   |   373 +-
 .github/scripts/trymerge_explainer.py         |    93 +-
 .github/scripts/tryrebase.py                  |     1 +
 .github/scripts/update_commit_hashes.py       |     1 +
 .github/scripts/wait_for_ssh_to_drain.sh      |    13 -
 .github/templates/common.yml.j2               |   320 +-
 .../linux_binary_build_workflow.yml.j2        |     7 +-
 .../macos_binary_build_workflow.yml.j2        |    27 +-
 .../windows_binary_build_workflow.yml.j2      |     4 +-
 .github/workflows/_android-build-test.yml     |    14 +-
 .../workflows/_android-full-build-test.yml    |    52 +-
 .github/workflows/_bazel-build-test.yml       |    14 +-
 .github/workflows/_binary-build-linux.yml     |    59 +-
 .github/workflows/_binary-test-linux.yml      |    53 +-
 .github/workflows/_binary-upload.yml          |    66 +-
 .github/workflows/_buck-build-test.yml        |    27 +-
 .github/workflows/_docs.yml                   |    67 +-
 .github/workflows/_ios-build-test.yml         |    76 +-
 .github/workflows/_linux-build.yml            |    65 +-
 .github/workflows/_linux-test.yml             |    81 +-
 .github/workflows/_mac-build.yml              |    80 +-
 .github/workflows/_mac-test-mps.yml           |    42 +-
 .github/workflows/_mac-test.yml               |    91 +-
 .github/workflows/_rocm-test.yml              |    38 +-
 .github/workflows/_run_android_tests.yml      |    20 +-
 .github/workflows/_update-commit-hash.yml     |     2 +-
 .github/workflows/_win-build.yml              |    43 +-
 .github/workflows/_win-test.yml               |    40 +-
 .github/workflows/auto_request_review.yml     |    22 +
 .github/workflows/build-triton-wheel.yml      |   149 +
 .github/workflows/check-labels.yml            |    44 +
 .github/workflows/docker-builds.yml           |    16 +-
 .github/workflows/docker-release.yml          |   110 +
 .../generated-linux-binary-conda-nightly.yml  |   480 -
 ...inux-binary-libtorch-cxx11-abi-nightly.yml |   692 +-
 ...linux-binary-libtorch-pre-cxx11-master.yml |    18 +-
 ...inux-binary-libtorch-pre-cxx11-nightly.yml |   692 +-
 ...enerated-linux-binary-manywheel-master.yml |    22 +-
 ...nerated-linux-binary-manywheel-nightly.yml |  1021 +-
 ...rated-macos-arm64-binary-conda-nightly.yml |    54 +-
 ...rated-macos-arm64-binary-wheel-nightly.yml |   158 +-
 .../generated-macos-binary-conda-nightly.yml  |    72 +-
 ...acos-binary-libtorch-cxx11-abi-nightly.yml |    92 +-
 ...acos-binary-libtorch-pre-cxx11-nightly.yml |    92 +-
 .../generated-macos-binary-wheel-nightly.yml  |    72 +-
 ...generated-windows-binary-conda-nightly.yml |  1246 +-
 ...d-windows-binary-libtorch-debug-master.yml |     4 +-
 ...-windows-binary-libtorch-debug-nightly.yml |  1004 +-
 ...windows-binary-libtorch-release-master.yml |     4 +-
 ...indows-binary-libtorch-release-nightly.yml |  1004 +-
 .../generated-windows-binary-wheel-master.yml |   236 -
 ...generated-windows-binary-wheel-nightly.yml |  1246 +-
 .github/workflows/inductor.yml                |    41 +
 .github/workflows/labeler.yml                 |    20 +
 .github/workflows/lint.yml                    |   154 +-
 .github/workflows/mac-mps.yml                 |     4 +
 .github/workflows/nightly.yml                 |     9 +
 .github/workflows/periodic.yml                |   172 +-
 .github/workflows/pr-labels.yml               |    32 -
 .github/workflows/pull.yml                    |   209 +-
 .../workflows/push_nightly_docker_ghcr.yml    |    39 -
 .github/workflows/revert.yml                  |    27 +-
 .github/workflows/run_torchbench.yml          |    38 +-
 .github/workflows/scorecards.yml              |    55 +
 .github/workflows/trunk.yml                   |   199 +-
 .github/workflows/trymerge.yml                |    38 +-
 .github/workflows/tryrebase.yml               |    28 +-
 .github/workflows/update-commit-hashes.yml    |    37 -
 .github/workflows/update-viablestrict.yml     |    27 +-
 .github/workflows/update_pytorch_labels.yml   |     2 +-
 .github/workflows/update_s3_htmls.yml         |     2 +-
 .github/workflows/upload-test-stats.yml       |    31 +-
 .github/workflows/weekly.yml                  |    19 +
 .gitignore                                    |    15 +-
 .gitmodules                                   |     6 +
 .jenkins/caffe2/bench.sh                      |    54 -
 .jenkins/caffe2/build.sh                      |   231 -
 .jenkins/caffe2/dirty.sh                      |     7 -
 .jenkins/caffe2/test.sh                       |     7 +-
 .jenkins/pytorch/build-asan.sh                |     2 +-
 .jenkins/pytorch/build-tsan.sh                |    29 +
 .jenkins/pytorch/build.sh                     |    58 +-
 .jenkins/pytorch/common.sh                    |    22 -
 .jenkins/pytorch/common_utils.sh              |   102 +-
 .jenkins/pytorch/dirty.sh                     |     9 -
 .jenkins/pytorch/macos-build.sh               |     6 +-
 .jenkins/pytorch/macos-common.sh              |    46 -
 .jenkins/pytorch/macos-test.sh                |    25 -
 .jenkins/pytorch/multigpu-test.sh             |    13 +-
 .jenkins/pytorch/test.sh                      |   251 +-
 .../win-test-helpers/build_pytorch.bat        |    13 +-
 .../install_test_functorch.bat                |     9 -
 .../activate_miniconda3.bat                   |     2 +-
 .../installation-helpers/install_magma.bat    |     2 +-
 .../installation-helpers/install_mkl.bat      |     2 +-
 .../installation-helpers/install_sccache.bat  |     4 +-
 .../win-test-helpers/setup_pytorch_env.bat    |     3 +-
 .jenkins/pytorch/win-test.sh                  |     4 -
 .lintrunner.toml                              |    50 +-
 BUILD.bazel                                   |    44 +-
 CITATION                                      |    10 -
 CITATION.cff                                  |    73 +
 CMakeLists.txt                                |   118 +-
 CODEOWNERS                                    |    57 +-
 CONTRIBUTING.md                               |   126 +-
 Dockerfile                                    |    33 +-
 MANIFEST.in                                   |     1 +
 Makefile                                      |     4 +
 README.md                                     |    16 +-
 RELEASE.md                                    |     8 +-
 WORKSPACE                                     |    11 +-
 android/gradle.properties                     |     2 +-
 .../src/main/cpp/pytorch_jni_common.cpp       |     2 +-
 .../src/main/cpp/pytorch_jni_jit.cpp          |    12 +-
 .../src/main/cpp/pytorch_jni_lite.cpp         |    12 +-
 aten/CMakeLists.txt                           |     4 +-
 aten/src/ATen/ATen.h                          |     4 +
 aten/src/ATen/BatchedTensorImpl.cpp           |     2 +-
 aten/src/ATen/BatchingRegistrations.cpp       |    33 +-
 aten/src/ATen/CMakeLists.txt                  |    32 +-
 aten/src/ATen/Context.cpp                     |    39 +-
 aten/src/ATen/Context.h                       |    27 +
 aten/src/ATen/DLConvertor.cpp                 |    25 +-
 aten/src/ATen/DeviceGuard.h                   |     3 +-
 aten/src/ATen/Dispatch.h                      |    18 +-
 aten/src/ATen/EmptyTensor.cpp                 |    99 +-
 aten/src/ATen/EmptyTensor.h                   |    33 +-
 aten/src/ATen/ExpandUtils.cpp                 |    10 +-
 aten/src/ATen/ExpandUtils.h                   |    36 +-
 aten/src/ATen/FunctionalInverses.cpp          |    65 +-
 aten/src/ATen/FunctionalStorageImpl.cpp       |    90 +-
 aten/src/ATen/FunctionalStorageImpl.h         |    68 +-
 aten/src/ATen/FunctionalTensorWrapper.cpp     |   167 +-
 aten/src/ATen/FunctionalTensorWrapper.h       |    62 +-
 aten/src/ATen/FunctionalizeFallbackKernel.cpp |    16 +-
 aten/src/ATen/InferSize.h                     |    21 +-
 aten/src/ATen/NamedTensorUtils.cpp            |     6 +-
 aten/src/ATen/NamedTensorUtils.h              |     3 +-
 aten/src/ATen/NestedTensorImpl.cpp            |   149 +-
 aten/src/ATen/NestedTensorImpl.h              |    95 +-
 aten/src/ATen/NumericUtils.h                  |     3 +-
 aten/src/ATen/OpaqueTensorImpl.h              |     6 +-
 aten/src/ATen/PadNd.h                         |    28 +
 aten/src/ATen/Parallel.h                      |     1 +
 aten/src/ATen/PythonTorchFunctionTLS.cpp      |    24 +-
 aten/src/ATen/PythonTorchFunctionTLS.h        |    18 +-
 aten/src/ATen/SavedTensorHooks.cpp            |    56 +-
 aten/src/ATen/SavedTensorHooks.h              |    34 +-
 aten/src/ATen/SparseCsrTensorImpl.cpp         |   142 +-
 aten/src/ATen/SparseCsrTensorImpl.h           |    11 +-
 aten/src/ATen/SparseCsrTensorUtils.h          |    52 +
 aten/src/ATen/SparseTensorImpl.cpp            |    18 +-
 aten/src/ATen/SparseTensorImpl.h              |    43 +-
 aten/src/ATen/TensorGeometry.cpp              |    11 +-
 aten/src/ATen/TensorGeometry.h                |    72 +-
 aten/src/ATen/TensorIndexing.h                |    57 +-
 aten/src/ATen/TensorIterator.cpp              |    16 +-
 aten/src/ATen/TensorIterator.h                |    11 +-
 aten/src/ATen/TensorMeta.h                    |     1 +
 aten/src/ATen/TensorSubclassLikeUtils.h       |    32 +-
 aten/src/ATen/TensorUtils.cpp                 |    46 +-
 aten/src/ATen/TensorUtils.h                   |    14 +
 aten/src/ATen/ThreadLocalState.cpp            |    21 +-
 aten/src/ATen/ThreadLocalState.h              |    19 +-
 aten/src/ATen/Utils.h                         |    53 -
 aten/src/ATen/VmapTransforms.cpp              |     3 +-
 aten/src/ATen/VmapTransforms.h                |     3 +-
 aten/src/ATen/WrapDimUtils.h                  |    90 +-
 aten/src/ATen/autocast_mode.cpp               |   627 +-
 aten/src/ATen/autocast_mode.h                 |    32 +
 aten/src/ATen/core/ATen_fwd.h                 |     1 +
 aten/src/ATen/core/Formatting.cpp             |    42 +-
 aten/src/ATen/core/Formatting.h               |     4 +-
 aten/src/ATen/core/IListRef.h                 |    14 +-
 aten/src/ATen/core/IListRef_inl.h             |     8 +-
 aten/src/ATen/core/IListRef_test.cpp          |    14 +-
 aten/src/ATen/core/List_test.cpp              |     4 +-
 aten/src/ATen/core/NamedRegistrations.cpp     |     2 -
 aten/src/ATen/core/PhiloxRNGEngine.h          |     1 -
 aten/src/ATen/core/PythonFallbackKernel.cpp   |    29 +-
 aten/src/ATen/core/PythonFallbackKernel.h     |     2 +-
 .../core/PythonOpRegistrationTrampoline.cpp   |    28 +
 .../core/PythonOpRegistrationTrampoline.h     |    18 +
 aten/src/ATen/core/TensorAccessor.h           |     2 +-
 aten/src/ATen/core/TensorBase.h               |    58 +
 aten/src/ATen/core/TorchDispatchModeTLS.cpp   |    58 -
 aten/src/ATen/core/TorchDispatchModeTLS.h     |    25 -
 aten/src/ATen/core/TorchDispatchUtils.cpp     |    31 +
 aten/src/ATen/core/TorchDispatchUtils.h       |    17 +
 aten/src/ATen/core/Variadic.h                 |     9 +
 aten/src/ATen/core/boxing/KernelFunction.h    |    59 +-
 .../ATen/core/boxing/KernelFunction_impl.h    |    67 +-
 .../impl/kernel_function_legacy_test.cpp      |    10 +-
 .../core/boxing/impl/kernel_function_test.cpp |     4 +-
 .../boxing/impl/kernel_lambda_legacy_test.cpp |    10 +-
 .../core/boxing/impl/kernel_lambda_test.cpp   |     4 +-
 .../impl/make_boxed_from_unboxed_functor.h    |    30 +-
 .../make_boxed_from_unboxed_functor_test.cpp  |     6 +-
 aten/src/ATen/core/class_type.cpp             |     4 +-
 aten/src/ATen/core/custom_class.cpp           |     1 +
 .../ATen/core/dispatch/DispatchKeyExtractor.h |    11 +-
 aten/src/ATen/core/dispatch/Dispatcher.cpp    |    51 +-
 aten/src/ATen/core/dispatch/Dispatcher.h      |    87 +-
 aten/src/ATen/core/dispatch/OperatorEntry.cpp |    90 +-
 aten/src/ATen/core/dispatch/OperatorEntry.h   |    19 +-
 aten/src/ATen/core/dynamic_type.cpp           |     4 -
 aten/src/ATen/core/dynamic_type.h             |     3 +-
 aten/src/ATen/core/function_schema.cpp        |    31 +
 aten/src/ATen/core/function_schema.h          |    27 +-
 aten/src/ATen/core/interned_strings.h         |     4 +
 aten/src/ATen/core/ivalue.cpp                 |    14 +
 aten/src/ATen/core/ivalue.h                   |   102 +-
 aten/src/ATen/core/ivalue_inl.h               |    73 +-
 aten/src/ATen/core/jit_type.h                 |   184 +-
 aten/src/ATen/core/jit_type_base.h            |     2 +
 aten/src/ATen/core/library.cpp                |    66 +-
 aten/src/ATen/core/op_registration/adaption.h |     2 +-
 .../core/op_registration/infer_schema.cpp     |     2 +-
 .../ATen/core/op_registration/infer_schema.h  |     8 +-
 .../op_registration/op_registration_test.cpp  |    12 +-
 aten/src/ATen/core/type.cpp                   |     5 +
 aten/src/ATen/cpp_custom_type_hack.h          |     8 +-
 aten/src/ATen/cpu/vec/vec256/vec256.h         |    45 +
 .../src/ATen/cpu/vec/vec256/vec256_bfloat16.h |    37 +
 aten/src/ATen/cpu/vec/vec256/vec256_double.h  |     5 +
 aten/src/ATen/cpu/vec/vec256/vec256_float.h   |     5 +
 .../ATen/cpu/vec/vec256/vec256_float_neon.h   |     7 +
 aten/src/ATen/cpu/vec/vec256/vec256_int.h     |   587 +
 aten/src/ATen/cpu/vec/vec256/vec256_qint.h    |    78 +
 .../cpu/vec/vec256/vsx/vec256_float_vsx.h     |   224 +-
 aten/src/ATen/cpu/vec/vec512/vec512.h         |    50 +
 .../src/ATen/cpu/vec/vec512/vec512_bfloat16.h |    45 +-
 aten/src/ATen/cpu/vec/vec512/vec512_double.h  |     5 +
 aten/src/ATen/cpu/vec/vec512/vec512_float.h   |     5 +
 aten/src/ATen/cpu/vec/vec512/vec512_int.h     |   481 +
 aten/src/ATen/cpu/vec/vec512/vec512_qint.h    |    72 +
 aten/src/ATen/cpu/vec/vec_base.h              |    50 +-
 aten/src/ATen/cuda/Atomic.cuh                 |    24 +-
 aten/src/ATen/cuda/CUDABlas.cpp               |    63 +-
 aten/src/ATen/cuda/CUDABlas.h                 |    18 +-
 aten/src/ATen/cuda/CUDAContext.h              |     2 +
 aten/src/ATen/cuda/CUDADataType.h             |     8 +-
 aten/src/ATen/cuda/CUDAEvent.h                |    12 +-
 aten/src/ATen/cuda/CUDAGeneratorImpl.cpp      |     5 +-
 aten/src/ATen/cuda/CUDAGeneratorImpl.h        |    15 +-
 aten/src/ATen/cuda/CUDAGraph.cpp              |    41 +-
 aten/src/ATen/cuda/CUDAGraph.h                |     1 +
 aten/src/ATen/cuda/CUDASparse.h               |    15 +-
 aten/src/ATen/cuda/CUDASparseDescriptors.cpp  |     8 +-
 aten/src/ATen/cuda/CUDASparseDescriptors.h    |    15 +-
 aten/src/ATen/cuda/CublasHandlePool.cpp       |    58 +
 aten/src/ATen/cuda/PeerToPeerAccess.cpp       |    37 +-
 aten/src/ATen/cuda/detail/CUDAHooks.cpp       |    17 +-
 aten/src/ATen/cuda/detail/KernelUtils.h       |     1 +
 .../ATen/cuda/detail/PhiloxCudaStateRaw.cuh   |     8 +-
 aten/src/ATen/cuda/detail/UnpackRaw.cuh       |     4 +-
 aten/src/ATen/cuda/jiterator.h                |     2 +-
 aten/src/ATen/cuda/jiterator_impl.h           |    30 +-
 aten/src/ATen/cuda/llvm_complex.cpp           |    28 +-
 aten/src/ATen/cudnn/Descriptors.cpp           |     2 +-
 aten/src/ATen/cudnn/Descriptors.h             |    13 +-
 aten/src/ATen/cudnn/Utils.h                   |     2 +-
 aten/src/ATen/detail/FunctionTraits.h         |    24 +
 .../src/ATen/functorch}/ADInterpreters.cpp    |    72 +-
 .../src/ATen/functorch}/ADInterpreters.h      |    16 +-
 .../ATen/functorch}/BatchRulesActivation.cpp  |     6 +-
 .../ATen/functorch}/BatchRulesBinaryOps.cpp   |    51 +-
 .../ATen/functorch}/BatchRulesConvolution.cpp |   124 +-
 .../functorch}/BatchRulesDecompositions.cpp   |    58 +-
 .../src/ATen/functorch}/BatchRulesDynamic.cpp |    15 +-
 .../src/ATen/functorch}/BatchRulesFactory.cpp |    51 +-
 .../src/ATen/functorch}/BatchRulesHelper.cpp  |    27 +-
 .../src/ATen/functorch}/BatchRulesHelper.h    |    40 +-
 .../functorch}/BatchRulesLinearAlgebra.cpp    |   238 +-
 .../src/ATen/functorch}/BatchRulesLoss.cpp    |    20 +-
 .../src/ATen/functorch}/BatchRulesModules.cpp |    64 +-
 .../src/ATen/functorch}/BatchRulesNorm.cpp    |    60 +-
 .../src/ATen/functorch}/BatchRulesPooling.cpp |     8 +-
 .../ATen/functorch}/BatchRulesRandomness.cpp  |    82 +-
 .../ATen/functorch}/BatchRulesReduceOps.cpp   |    17 +-
 .../ATen/functorch}/BatchRulesScatterOps.cpp  |    21 +-
 .../ATen/functorch}/BatchRulesUnaryOps.cpp    |     7 +-
 .../src/ATen/functorch}/BatchRulesViews.cpp   |   131 +-
 .../src/ATen/functorch}/BatchedFallback.cpp   |    19 +-
 .../src/ATen/functorch}/BatchedFallback.h     |    33 +-
 .../src/ATen/functorch}/BatchedTensorImpl.cpp |    76 +-
 .../src/ATen/functorch}/BatchedTensorImpl.h   |    24 +-
 .../ATen/functorch}/BatchingMetaprogramming.h |     8 +
 .../src/ATen/functorch}/DynamicLayer.cpp      |   179 +-
 aten/src/ATen/functorch/DynamicLayer.h        |   131 +
 .../functorch}/FunctionalizeInterpreter.cpp   |     7 +-
 .../functorch}/FunctionalizeInterpreter.h     |     7 +-
 .../src/ATen/functorch}/Interpreter.cpp       |    34 +-
 .../src/ATen/functorch}/Interpreter.h         |    21 +-
 .../LegacyBatchingRegistrations.cpp           |   233 +-
 .../ATen/functorch}/LegacyVmapTransforms.cpp  |    23 +-
 .../ATen/functorch}/LegacyVmapTransforms.h    |    15 +-
 aten/src/ATen/functorch/Macros.h              |     3 +
 .../src/ATen/functorch}/PlumbingHelper.cpp    |     8 +-
 aten/src/ATen/functorch/PlumbingHelper.h      |    61 +
 .../ATen/functorch}/PyTorchOperatorHacks.cpp  |    24 +-
 .../src/ATen/functorch}/TensorWrapper.cpp     |    22 +-
 aten/src/ATen/functorch/TensorWrapper.h       |    97 +
 .../src/ATen/functorch}/VmapInterpreter.cpp   |     9 +-
 .../src/ATen/functorch}/VmapInterpreter.h     |     7 +-
 .../ATen/functorch}/VmapModeRegistrations.cpp |    14 +-
 aten/src/ATen/jit_macros.h                    |     7 -
 aten/src/ATen/jiterator_macros.h              |     4 +-
 aten/src/ATen/miopen/Descriptors.h            |     2 +-
 aten/src/ATen/miopen/Utils.h                  |     2 +-
 aten/src/ATen/mkl/SparseBlas.cpp              |     2 +-
 aten/src/ATen/mps/EmptyTensor.cpp             |     1 +
 aten/src/ATen/mps/IndexKernels.h              |   181 +
 aten/src/ATen/mps/MPSAllocator.h              |   309 +-
 aten/src/ATen/mps/MPSAllocator.mm             |   646 +-
 aten/src/ATen/mps/MPSDevice.h                 |    15 +
 aten/src/ATen/mps/MPSDevice.mm                |    58 +-
 aten/src/ATen/mps/MPSFallback.mm              |    19 +-
 aten/src/ATen/mps/MPSGuardImpl.h              |     8 +-
 aten/src/ATen/mps/MPSGuardImpl.mm             |     4 +-
 aten/src/ATen/mps/MPSStream.h                 |    28 +-
 aten/src/ATen/mps/MPSStream.mm                |   106 +-
 aten/src/ATen/native/Activation.cpp           |    72 +-
 aten/src/ATen/native/Activation.h             |     2 +
 .../ATen/native/AdaptiveAveragePooling.cpp    |    40 +-
 .../ATen/native/AdaptiveAveragePooling3d.cpp  |    32 +-
 aten/src/ATen/native/AdaptiveMaxPooling2d.cpp |    11 +-
 aten/src/ATen/native/AdaptiveMaxPooling3d.cpp |    19 +-
 aten/src/ATen/native/AdaptivePooling.h        |     5 +-
 aten/src/ATen/native/AffineGridGenerator.cpp  |    14 +-
 aten/src/ATen/native/AutogradComposite.cpp    |    15 +-
 aten/src/ATen/native/AveragePool2d.cpp        |    12 +-
 aten/src/ATen/native/AveragePool3d.cpp        |    13 +-
 aten/src/ATen/native/BatchLinearAlgebra.cpp   |   573 +-
 aten/src/ATen/native/BatchLinearAlgebra.h     |     7 +-
 .../ATen/native/BatchLinearAlgebraKernel.cpp  |   127 +-
 aten/src/ATen/native/Batching.cpp             |     1 +
 aten/src/ATen/native/BinaryOps.cpp            |   160 +-
 aten/src/ATen/native/Blas.cpp                 |    25 +-
 aten/src/ATen/native/BlasKernel.cpp           |     6 +-
 aten/src/ATen/native/Bucketization.cpp        |     9 +-
 aten/src/ATen/native/CPUBlas.cpp              |     1 +
 aten/src/ATen/native/CPUFallback.cpp          |    10 +-
 aten/src/ATen/native/CPUFallback.h            |    23 +-
 aten/src/ATen/native/ChanelShuffle.cpp        |    15 +-
 aten/src/ATen/native/Col2Im.cpp               |    52 +-
 aten/src/ATen/native/ComparisonUtils.cpp      |    32 +
 aten/src/ATen/native/ComplexHelper.h          |    40 +-
 aten/src/ATen/native/ConvUtils.h              |   132 +-
 aten/src/ATen/native/Convolution.cpp          |  1048 +-
 aten/src/ATen/native/ConvolutionMM2d.cpp      |    17 +-
 aten/src/ATen/native/ConvolutionMM3d.cpp      |    16 +-
 aten/src/ATen/native/ConvolutionMM3d.h        |     2 +-
 aten/src/ATen/native/ConvolutionTBC.cpp       |    14 +-
 aten/src/ATen/native/Copy.cpp                 |    83 +-
 aten/src/ATen/native/Correlation.cpp          |    30 +-
 aten/src/ATen/native/Cross.cpp                |    43 +-
 aten/src/ATen/native/DilatedMaxPool2d.cpp     |    16 +-
 aten/src/ATen/native/DilatedMaxPool3d.cpp     |    14 +-
 aten/src/ATen/native/DispatchStub.cpp         |     2 +
 aten/src/ATen/native/DispatchStub.h           |     7 +-
 aten/src/ATen/native/Distance.cpp             |    34 +-
 aten/src/ATen/native/DistributionTemplates.h  |    12 +-
 aten/src/ATen/native/Distributions.cpp        |    42 +-
 aten/src/ATen/native/Dropout.cpp              |    38 +-
 aten/src/ATen/native/Embedding.cpp            |    69 +-
 aten/src/ATen/native/EmbeddingBag.cpp         |    69 +-
 aten/src/ATen/native/EmbeddingBag.h           |     3 +-
 aten/src/ATen/native/Fill.cpp                 |    19 +-
 aten/src/ATen/native/ForeachOpsKernels.cpp    |    80 +-
 aten/src/ATen/native/ForeachUtils.h           |    40 +
 aten/src/ATen/native/FractionalMaxPool2d.cpp  |    14 +-
 aten/src/ATen/native/FractionalMaxPool3d.cpp  |    16 +-
 aten/src/ATen/native/GatedLinearUnit.cpp      |    17 +-
 aten/src/ATen/native/GridSampler.cpp          |    30 +-
 aten/src/ATen/native/GridSamplerUtils.h       |     2 +-
 aten/src/ATen/native/Histogram.cpp            |    22 +-
 aten/src/ATen/native/Histogram.h              |     2 -
 aten/src/ATen/native/Im2Col.cpp               |    77 +-
 aten/src/ATen/native/IndexKernel.h            |     1 +
 aten/src/ATen/native/IndexingUtils.cpp        |    13 +-
 aten/src/ATen/native/IndexingUtils.h          |    12 +-
 aten/src/ATen/native/Integration.cpp          |    17 +-
 aten/src/ATen/native/Itertools.cpp            |    19 +-
 aten/src/ATen/native/Lerp.cpp                 |     9 +
 aten/src/ATen/native/Lerp.h                   |    27 +
 aten/src/ATen/native/Linear.cpp               |   429 +-
 aten/src/ATen/native/LinearAlgebra.cpp        |   150 +-
 aten/src/ATen/native/LinearAlgebraUtils.h     |     3 +-
 aten/src/ATen/native/Loss.cpp                 |    92 +-
 aten/src/ATen/native/LossCTC.cpp              |    95 +-
 aten/src/ATen/native/LossMulti.h              |     8 +-
 aten/src/ATen/native/LossMultiLabelMargin.cpp |    15 +-
 aten/src/ATen/native/LossMultiMargin.cpp      |    14 +-
 aten/src/ATen/native/LossNLL.cpp              |    86 +-
 aten/src/ATen/native/LossNLL2d.cpp            |    28 +-
 .../src/ATen/native/MathBitFallThroughLists.h |     1 -
 aten/src/ATen/native/MathBitsFallback.h       |     9 +-
 aten/src/ATen/native/MaxPooling.cpp           |    33 +-
 aten/src/ATen/native/MaxUnpooling.cpp         |    21 +-
 aten/src/ATen/native/Memory.cpp               |    13 +-
 aten/src/ATen/native/MetaTensor.cpp           |    28 +-
 aten/src/ATen/native/NNPACK.cpp               |    17 +-
 .../native/NaiveConvolutionTranspose2d.cpp    |    15 +-
 .../native/NaiveConvolutionTranspose3d.cpp    |    16 +-
 .../ATen/native/NaiveDilatedConvolution.cpp   |    15 +-
 aten/src/ATen/native/NamedTensor.cpp          |    28 +-
 aten/src/ATen/native/NegateFallback.cpp       |     1 +
 aten/src/ATen/native/NonSymbolicBC.h          |    27 +
 aten/src/ATen/native/Normalization.cpp        |   258 +-
 aten/src/ATen/native/Onehot.cpp               |    12 +-
 aten/src/ATen/native/PackedSequence.cpp       |    27 +-
 aten/src/ATen/native/PadNd.cpp                |    73 +-
 aten/src/ATen/native/PadNd.h                  |    22 -
 aten/src/ATen/native/PixelShuffle.cpp         |    36 +-
 aten/src/ATen/native/PointwiseOps.cpp         |    15 +-
 aten/src/ATen/native/Pool.h                   |    41 +-
 aten/src/ATen/native/Pooling.cpp              |    27 +-
 aten/src/ATen/native/Pow.cpp                  |    15 +-
 aten/src/ATen/native/QuantizedLinear.cpp      |    26 +-
 aten/src/ATen/native/README.md                |    36 +-
 aten/src/ATen/native/RNN.cpp                  |    70 +-
 aten/src/ATen/native/RNN.h                    |     2 +-
 aten/src/ATen/native/RangeFactories.cpp       |    16 +-
 aten/src/ATen/native/ReduceAllOps.cpp         |    28 +-
 aten/src/ATen/native/ReduceOps.cpp            |   168 +-
 aten/src/ATen/native/ReduceOpsUtils.h         |     2 +-
 aten/src/ATen/native/ReflectionPad.cpp        |    37 +-
 aten/src/ATen/native/Repeat.cpp               |    30 +-
 aten/src/ATen/native/ReplicationPadding.cpp   |    19 +-
 aten/src/ATen/native/Resize.cpp               |    11 +-
 aten/src/ATen/native/Resize.h                 |    42 +-
 aten/src/ATen/native/ResizeCommon.h           |     5 +-
 aten/src/ATen/native/RowwisePrune.cpp         |    11 +-
 aten/src/ATen/native/Scalar.cpp               |    12 +-
 aten/src/ATen/native/SegmentReduce.cpp        |    15 +-
 aten/src/ATen/native/SobolEngineOps.cpp       |    16 +-
 aten/src/ATen/native/SobolEngineOpsUtils.cpp  |     1 +
 aten/src/ATen/native/SobolEngineOpsUtils.h    |    10 +-
 aten/src/ATen/native/SoftMax.cpp              |    92 +-
 aten/src/ATen/native/Sorting.cpp              |    38 +-
 aten/src/ATen/native/SpectralOps.cpp          |   168 +-
 aten/src/ATen/native/SpmmReduce.cpp           |    32 -
 aten/src/ATen/native/SpmmReduce.h             |    12 -
 aten/src/ATen/native/SummaryOps.cpp           |    21 +-
 .../ATen/native/TensorAdvancedIndexing.cpp    |   232 +-
 aten/src/ATen/native/TensorAdvancedIndexing.h |    51 +-
 .../ATen/native/TensorAdvancedIndexingUtils.h |    10 +-
 aten/src/ATen/native/TensorCompare.cpp        |   109 +-
 aten/src/ATen/native/TensorConversions.cpp    |   968 +-
 aten/src/ATen/native/TensorConversions.h      |     2 +-
 aten/src/ATen/native/TensorDimApply.h         |     3 +-
 aten/src/ATen/native/TensorFactories.cpp      |   154 +-
 aten/src/ATen/native/TensorFactories.h        |     5 +-
 aten/src/ATen/native/TensorIteratorReduce.cpp |    11 +-
 aten/src/ATen/native/TensorProperties.cpp     |    33 +-
 aten/src/ATen/native/TensorShape.cpp          |   979 +-
 aten/src/ATen/native/TensorShape.h            |    15 +-
 .../src/ATen/native/TensorTransformations.cpp |    21 +-
 aten/src/ATen/native/TestOps.cpp              |    19 +-
 aten/src/ATen/native/TriangularOps.cpp        |    30 +-
 aten/src/ATen/native/TriangularOpsUtils.h     |     2 +-
 aten/src/ATen/native/TypeProperties.cpp       |    26 +-
 aten/src/ATen/native/UnaryOps.cpp             |   198 +-
 aten/src/ATen/native/Unfold2d.cpp             |     1 +
 aten/src/ATen/native/Unfold3d.cpp             |     4 +-
 aten/src/ATen/native/Unfold3d.h               |     2 +-
 aten/src/ATen/native/UnfoldBackward.cpp       |     6 +
 aten/src/ATen/native/UnfoldBackward.h         |    78 +-
 aten/src/ATen/native/Unique.cpp               |    21 +-
 aten/src/ATen/native/UpSample.cpp             |     1 +
 aten/src/ATen/native/UpSample.h               |    24 +-
 aten/src/ATen/native/UpSampleBicubic2d.cpp    |    50 +-
 aten/src/ATen/native/UpSampleBilinear2d.cpp   |    43 +-
 aten/src/ATen/native/UpSampleLinear1d.cpp     |    27 +-
 aten/src/ATen/native/UpSampleNearest1d.cpp    |    40 +-
 aten/src/ATen/native/UpSampleNearest2d.cpp    |    41 +-
 aten/src/ATen/native/UpSampleNearest3d.cpp    |    48 +-
 aten/src/ATen/native/UpSampleTrilinear3d.cpp  |    28 +-
 aten/src/ATen/native/VariableMethodStubs.cpp  |    20 +-
 aten/src/ATen/native/WeightNorm.cpp           |    20 +-
 aten/src/ATen/native/ao_sparse/library.cpp    |     1 +
 .../ao_sparse/quantized/cpu/fbgemm_utils.cpp  |     4 +-
 .../ao_sparse/quantized/cpu/packed_params.h   |     6 +-
 .../ao_sparse/quantized/cpu/qlinear.cpp       |    10 +-
 .../quantized/cpu/qlinear_deserialize.cpp     |    93 +-
 .../quantized/cpu/qlinear_dynamic.cpp         |    17 +-
 .../quantized/cpu/qlinear_prepack.cpp         |    13 +-
 .../quantized/cpu/qlinear_serialize.cpp       |    37 +-
 .../quantized/cpu/qlinear_unpack.cpp          |    12 +-
 aten/src/ATen/native/cpu/Activation.cpp       |   114 +-
 aten/src/ATen/native/cpu/AtomicAddFloat.h     |     6 +-
 aten/src/ATen/native/cpu/BinaryOpsKernel.cpp  |    26 +-
 aten/src/ATen/native/cpu/BlasKernel.cpp       |    67 +-
 .../ATen/native/cpu/ChannelShuffleKernel.cpp  |    18 +-
 .../ATen/native/cpu/ChannelShuffleKernel.h    |    10 +-
 aten/src/ATen/native/cpu/CopyKernel.cpp       |    41 +-
 aten/src/ATen/native/cpu/CopyKernel.h         |    12 +
 .../src/ATen/native/cpu/DepthwiseConvKernel.h |     3 +-
 .../src/ATen/native/cpu/DistanceOpsKernel.cpp |     3 +-
 .../cpu/FunctionOfAMatrixUtilsKernel.cpp      |     3 +-
 aten/src/ATen/native/cpu/HistogramKernel.cpp  |     8 +-
 aten/src/ATen/native/cpu/IndexKernel.cpp      |   107 +-
 aten/src/ATen/native/cpu/LerpKernel.cpp       |   134 +-
 aten/src/ATen/native/cpu/Loops.h              |     9 +-
 .../ATen/native/cpu/PixelShuffleKernel.cpp    |    30 +-
 aten/src/ATen/native/cpu/PixelShuffleKernel.h |     9 +-
 aten/src/ATen/native/cpu/README.md            |     4 +-
 aten/src/ATen/native/cpu/Reduce.h             |     3 +-
 aten/src/ATen/native/cpu/ReduceOpsKernel.cpp  |    27 +-
 .../ATen/native/cpu/ScatterGatherKernel.cpp   |   214 +-
 aten/src/ATen/native/cpu/SortingKernel.cpp    |     3 +-
 aten/src/ATen/native/cpu/SparseFactories.cpp  |    41 +-
 aten/src/ATen/native/cpu/SpmmReduceKernel.cpp |   601 +-
 aten/src/ATen/native/cpu/SpmmReduceKernel.h   |    45 +
 .../ATen/native/cpu/TensorCompareKernel.cpp   |     6 +-
 aten/src/ATen/native/cpu/UnaryOpsKernel.cpp   |    20 +-
 aten/src/ATen/native/cpu/Unfold2d.cpp         |     3 +-
 .../ATen/native/cpu/UnfoldBackwardKernel.cpp  |    84 +-
 aten/src/ATen/native/cpu/UpSampleKernel.cpp   |   100 +-
 .../ATen/native/cpu/UpSampleMoreKernel.cpp    |     4 +-
 aten/src/ATen/native/cpu/WeightNormKernel.cpp |    68 +-
 aten/src/ATen/native/cpu/WeightNormKernel.h   |    13 +-
 aten/src/ATen/native/cpu/radix_sort.h         |    18 +-
 aten/src/ATen/native/cuda/Activation.cpp      |     2 +-
 .../native/cuda/AdaptiveAveragePooling.cu     |    15 +-
 .../native/cuda/AdaptiveAveragePooling3d.cu   |     8 +-
 .../ATen/native/cuda/AdaptiveMaxPooling2d.cu  |     8 +-
 .../ATen/native/cuda/AdaptiveMaxPooling3d.cu  |     8 +-
 aten/src/ATen/native/cuda/AveragePool2d.cu    |    16 +-
 .../native/cuda/BinaryLogicalOpsKernels.cu    |    62 +-
 aten/src/ATen/native/cuda/Bucketization.cu    |     6 -
 aten/src/ATen/native/cuda/Col2Im.cu           |    89 +-
 aten/src/ATen/native/cuda/Copy.cu             |    39 +-
 aten/src/ATen/native/cuda/Copy.h              |    10 +
 aten/src/ATen/native/cuda/CumminmaxKernel.cu  |    29 +
 aten/src/ATen/native/cuda/CumprodKernel.cu    |    23 +
 aten/src/ATen/native/cuda/CumsumKernel.cu     |    25 +
 aten/src/ATen/native/cuda/DepthwiseConv2d.cu  |     1 -
 aten/src/ATen/native/cuda/DilatedMaxPool2d.cu |    26 +-
 aten/src/ATen/native/cuda/DistanceKernel.cu   |   138 +-
 aten/src/ATen/native/cuda/Distributions.cu    |     1 +
 aten/src/ATen/native/cuda/EmbeddingBag.cu     |    11 +-
 aten/src/ATen/native/cuda/ForeachFunctors.cuh |    19 +
 .../ATen/native/cuda/ForeachPointwiseOp.cu    |    35 +
 .../ATen/native/cuda/FractionalMaxPool2d.cu   |    14 +-
 aten/src/ATen/native/cuda/FusedAdamKernel.cu  |    45 +
 aten/src/ATen/native/cuda/GridSampler.cu      |     4 +-
 aten/src/ATen/native/cuda/Im2Col.cu           |    64 +-
 aten/src/ATen/native/cuda/IndexKernel.cu      |    18 +
 aten/src/ATen/native/cuda/Indexing.cu         |   198 +-
 aten/src/ATen/native/cuda/JitLoops.cuh        |     4 -
 aten/src/ATen/native/cuda/KernelUtils.cuh     |    48 +-
 aten/src/ATen/native/cuda/Lerp.cu             |    21 +-
 aten/src/ATen/native/cuda/LinearAlgebra.cu    |     4 +-
 .../ATen/native/cuda/LinearAlgebraStubs.cpp   |    40 +-
 .../ATen/native/cuda/LogcumsumexpKernel.cu    |    37 +
 aten/src/ATen/native/cuda/Loss.cu             |    61 +-
 aten/src/ATen/native/cuda/MaxUnpooling.cu     |     8 +
 aten/src/ATen/native/cuda/MultiMarginLoss.cu  |     1 +
 .../src/ATen/native/cuda/MultiTensorApply.cuh |    68 +
 .../src/ATen/native/cuda/MultinomialKernel.cu |     4 +-
 aten/src/ATen/native/cuda/NLLLoss2d.cu        |    23 +-
 .../cuda/NaiveConvolutionTranspose3d.cu       |    11 +-
 aten/src/ATen/native/cuda/Normalization.cu    |    23 +-
 aten/src/ATen/native/cuda/Normalization.cuh   |    83 +-
 aten/src/ATen/native/cuda/Pow.cuh             |    58 +
 aten/src/ATen/native/cuda/PowKernel.cu        |    49 +-
 aten/src/ATen/native/cuda/Reduce.cuh          |    35 +-
 aten/src/ATen/native/cuda/ReflectionPad.cu    |    14 +-
 aten/src/ATen/native/cuda/RreluWithNoise.cu   |     2 +-
 .../cuda/{ScanKernels.cu => ScanUtils.cuh}    |    89 +-
 aten/src/ATen/native/cuda/Shape.cu            |     2 +-
 aten/src/ATen/native/cuda/SoftMax.cu          |    19 +-
 .../cuda/SparseBinaryOpIntersectionKernel.cu  |   150 +
 aten/src/ATen/native/cuda/SummaryOps.cu       |    16 +-
 aten/src/ATen/native/cuda/TensorFactories.cu  |    12 +-
 aten/src/ATen/native/cuda/TriangularOps.cu    |   130 +-
 .../ATen/native/cuda/UnaryComplexKernels.cu   |    39 +-
 .../ATen/native/cuda/UnaryFractionKernels.cu  |     2 +-
 .../ATen/native/cuda/UnarySpecialOpsKernel.cu |     9 +-
 .../ATen/native/cuda/UnfoldBackwardKernel.cu  |    96 +-
 .../src/ATen/native/cuda/UpSampleNearest2d.cu |    22 +-
 .../src/ATen/native/cuda/UpSampleNearest3d.cu |    47 -
 aten/src/ATen/native/cuda/block_reduce.cuh    |    43 +-
 .../native/cuda/fused_adam_amsgrad_impl.cu    |    52 +
 .../native/cuda/fused_adam_amsgrad_impl.cuh   |    24 +
 aten/src/ATen/native/cuda/fused_adam_impl.cu  |    51 +
 aten/src/ATen/native/cuda/fused_adam_impl.cuh |    23 +
 .../src/ATen/native/cuda/fused_adam_utils.cuh |   166 +
 aten/src/ATen/native/cuda/im2col.cuh          |   210 +-
 aten/src/ATen/native/cuda/jit_utils.cpp       |   263 +-
 aten/src/ATen/native/cuda/jit_utils.h         |     1 -
 .../src/ATen/native/cuda/layer_norm_kernel.cu |   476 +-
 .../native/cuda/linalg/BatchLinearAlgebra.cpp |   479 +-
 .../cuda/linalg/BatchLinearAlgebraLib.cpp     |   156 +-
 .../cuda/linalg/BatchLinearAlgebraLib.h       |     6 -
 .../ATen/native/cuda/reduction_template.cuh   |    16 +
 aten/src/ATen/native/cuda/vol2col.cuh         |    58 +-
 .../ATen/native/cudnn/AffineGridGenerator.cpp |    13 +-
 aten/src/ATen/native/cudnn/BatchNorm.cpp      |    14 +-
 .../ATen/native/cudnn/ConvPlaceholders.cpp    |    14 +-
 aten/src/ATen/native/cudnn/ConvShared.cpp     |    24 +-
 aten/src/ATen/native/cudnn/ConvShared.h       |     3 +-
 aten/src/ATen/native/cudnn/Conv_v7.cpp        |    19 +-
 aten/src/ATen/native/cudnn/Conv_v8.cpp        |    51 +-
 aten/src/ATen/native/cudnn/GridSampler.cpp    |    13 +-
 aten/src/ATen/native/cudnn/LossCTC.cpp        |    65 +-
 aten/src/ATen/native/cudnn/RNN.cpp            |    22 +-
 aten/src/ATen/native/group_norm.cpp           |    55 +-
 aten/src/ATen/native/im2col.h                 |     2 +-
 aten/src/ATen/native/im2col_shape_check.h     |    10 +-
 aten/src/ATen/native/layer_norm.cpp           |    40 +-
 aten/src/ATen/native/metal/MetalAten.mm       |     3 +-
 aten/src/ATen/native/metal/MetalContext.mm    |     4 +-
 aten/src/ATen/native/metal/MetalConvParams.h  |     2 +-
 aten/src/ATen/native/metal/MetalTensorImpl.h  |     4 +
 .../ATen/native/metal/mpscnn/MPSCNNConvOp.mm  |    10 +-
 .../native/metal/mpscnn/MPSImageWrapper.mm    |     3 +
 aten/src/ATen/native/metal/ops/MetalConcat.mm |    27 +-
 .../ATen/native/metal/ops/MetalConvolution.mm |     4 +-
 .../ATen/native/metal/ops/MetalHardshrink.mm  |     3 +-
 .../src/ATen/native/metal/ops/MetalPadding.mm |     2 +-
 .../src/ATen/native/metal/ops/MetalReshape.mm |     5 +-
 .../ATen/native/miopen/BatchNorm_miopen.cpp   |    13 +-
 aten/src/ATen/native/miopen/Conv_miopen.cpp   |   248 +-
 aten/src/ATen/native/miopen/RNN_miopen.cpp    |    17 +-
 aten/src/ATen/native/mkl/LinearAlgebra.cpp    |     1 +
 aten/src/ATen/native/mkl/LinearAlgebra.h      |     3 +-
 aten/src/ATen/native/mkl/SparseBlasImpl.cpp   |   143 +-
 .../native/mkl/SparseCsrLinearAlgebra.cpp     |     1 +
 .../ATen/native/mkl/SparseCsrLinearAlgebra.h  |     3 +-
 aten/src/ATen/native/mkl/SpectralOps.cpp      |    23 +-
 aten/src/ATen/native/mkldnn/BinaryOps.cpp     |    10 +-
 aten/src/ATen/native/mkldnn/Conv.cpp          |   527 +-
 aten/src/ATen/native/mkldnn/Copy.cpp          |     8 +-
 aten/src/ATen/native/mkldnn/Gelu.cpp          |    11 +-
 .../ATen/native/mkldnn/IDeepRegistration.cpp  |     3 +-
 aten/src/ATen/native/mkldnn/Linear.cpp        |   272 +-
 aten/src/ATen/native/mkldnn/MKLDNNCommon.h    |     2 +-
 .../ATen/native/mkldnn/MKLDNNConversions.cpp  |   100 +-
 aten/src/ATen/native/mkldnn/Matmul.cpp        |    58 +-
 aten/src/ATen/native/mkldnn/Matmul.h          |     2 +-
 .../ATen/native/mkldnn/MkldnnTensorMath.cpp   |    10 +-
 aten/src/ATen/native/mkldnn/Normalization.cpp |    49 +-
 aten/src/ATen/native/mkldnn/Pooling.cpp       |    22 +-
 aten/src/ATen/native/mkldnn/Prelu.cpp         |     4 +-
 .../mkldnn/RegisterMkldnnOpContextClass.cpp   |    30 +
 aten/src/ATen/native/mkldnn/Relu.cpp          |    10 +-
 aten/src/ATen/native/mkldnn/SoftMax.cpp       |     8 +-
 .../ATen/native/mkldnn/TensorFactories.cpp    |    12 +-
 aten/src/ATen/native/mkldnn/TensorShape.cpp   |    17 +-
 aten/src/ATen/native/mkldnn/UnaryOps.cpp      |     9 +-
 aten/src/ATen/native/mkldnn/Utils.cpp         |   133 +
 aten/src/ATen/native/mkldnn/Utils.h           |    44 +-
 aten/src/ATen/native/mps/Copy.h               |    15 +-
 aten/src/ATen/native/mps/MPSGraphVenturaOps.h |    17 +
 aten/src/ATen/native/mps/OperationUtils.h     |    25 +-
 aten/src/ATen/native/mps/OperationUtils.mm    |   109 +-
 aten/src/ATen/native/mps/TensorFactory.cpp    |    12 +-
 .../ATen/native/mps/operations/Activation.mm  |   318 +-
 .../native/mps/operations/AdaptivePooling.mm  |    88 +-
 .../ATen/native/mps/operations/BinaryOps.mm   |    62 +-
 .../{BitwiseBinaryOps.mm => BitwiseOps.mm}    |    76 +-
 aten/src/ATen/native/mps/operations/Blas.mm   |    27 +-
 .../ATen/native/mps/operations/ConstantOps.mm |    13 +-
 .../ATen/native/mps/operations/Convolution.mm |    60 +-
 aten/src/ATen/native/mps/operations/Copy.mm   |   161 +-
 .../native/mps/operations/Distributions.mm    |   903 +-
 aten/src/ATen/native/mps/operations/Eye.mm    |     5 +-
 .../src/ATen/native/mps/operations/Indexing.h |    39 +
 .../ATen/native/mps/operations/Indexing.mm    |   264 +-
 aten/src/ATen/native/mps/operations/Linear.mm |    18 +-
 .../src/ATen/native/mps/operations/LossOps.mm |     4 +-
 .../native/mps/operations/Normalization.mm    |    50 +-
 aten/src/ATen/native/mps/operations/Pad.mm    |   306 +
 .../native/mps/operations/PointwiseOps.mm     |     8 +-
 .../native/mps/operations/RangeFactories.mm   |    19 +-
 .../ATen/native/mps/operations/ReduceOps.mm   |   464 +-
 aten/src/ATen/native/mps/operations/Repeat.mm |     9 +-
 aten/src/ATen/native/mps/operations/RnnOps.mm |     6 +-
 .../native/mps/operations/ScatterGather.mm    |    12 +-
 aten/src/ATen/native/mps/operations/Shape.mm  |   292 +-
 .../native/mps/operations/TensorCompare.mm    |   122 +-
 .../native/mps/operations/TriangularOps.mm    |   192 -
 .../ATen/native/mps/operations/UnaryOps.mm    |   144 +-
 aten/src/ATen/native/mps/operations/View.mm   |    32 +-
 aten/src/ATen/native/native_functions.yaml    |  1428 +-
 .../native/nested/NestedTensorAliases.cpp     |    15 +
 .../native/nested/NestedTensorBackward.cpp    |    83 +-
 .../native/nested/NestedTensorBinaryOps.cpp   |   247 +
 .../native/nested/NestedTensorBinaryOps.h     |    16 +
 .../native/nested/NestedTensorFactories.cpp   |   125 +
 .../native/nested/NestedTensorFactories.h     |     7 +
 .../ATen/native/nested/NestedTensorMath.cpp   |   969 +-
 .../src/ATen/native/nested/NestedTensorMath.h |   253 +-
 .../ATen/native/nested/NestedTensorMatmul.cpp |   352 +
 .../NestedTensorTransformerFunctions.cpp      |    31 +-
 .../nested/NestedTensorTransformerFunctions.h |    18 +-
 .../native/nested/NestedTensorUnaryOps.cpp    |    74 +
 .../ATen/native/nested/NestedTensorUtils.cpp  |   112 +
 .../ATen/native/nested/NestedTensorUtils.h    |   423 +
 .../nested/cuda/NestedTensorBinaryOps.cu      |   120 +
 .../native/nested/cuda/NestedTensorMatmul.cu  |   416 +
 .../cuda/NestedTensorTransformerFunctions.cpp |   382 +-
 .../cuda/NestedTensorTransformerFunctions.cu  |    23 +-
 .../src/ATen/native/prim_native_functions.cpp |     9 +-
 .../ATen/native/quantized/AffineQuantizer.cpp |    47 +-
 .../ATen/native/quantized/AffineQuantizer.h   |     3 +-
 .../native/quantized/AffineQuantizerBase.cpp  |    27 +
 .../ATen/native/quantized/FakeQuantAffine.h   |     3 +-
 .../quantized/FakeQuantPerTensorAffine.cpp    |     6 +-
 aten/src/ATen/native/quantized/IndexKernel.h  |     3 +-
 aten/src/ATen/native/quantized/PackedParams.h |     2 +-
 aten/src/ATen/native/quantized/QTensor.cpp    |     6 +
 aten/src/ATen/native/quantized/README.md      |     3 +-
 .../quantized/TensorAdvancedIndexing.cpp      |    91 +
 .../ATen/native/quantized/TensorCompare.cpp   |    13 +
 .../ATen/native/quantized/TensorFactories.cpp |    10 -
 .../quantized/cpu/AdaptiveAveragePooling.cpp  |    16 +-
 .../native/quantized/cpu/AveragePool2d.cpp    |    14 +-
 .../native/quantized/cpu/AveragePool3d.cpp    |    18 +-
 .../ATen/native/quantized/cpu/BinaryOps.cpp   |    25 +-
 .../src/ATen/native/quantized/cpu/BinaryOps.h |     2 +-
 .../native/quantized/cpu/ChannelShuffle.cpp   |    18 +-
 .../quantized/cpu/EmbeddingPackedParams.h     |     2 +-
 .../native/quantized/cpu/IntReprQuant.cpp     |    12 +-
 .../native/quantized/cpu/LinearUnpackImpl.cpp |    14 +-
 .../cpu/MakePerTensorQuantizedTensor.cpp      |     7 +
 .../native/quantized/cpu/Normalization.cpp    |    14 +-
 .../ATen/native/quantized/cpu/OnednnUtils.h   |   276 +-
 .../src/ATen/native/quantized/cpu/Pooling.cpp |    19 +-
 .../ATen/native/quantized/cpu/QnnpackUtils.h  |    13 +-
 .../ATen/native/quantized/cpu/QuantUtils.h    |    13 +-
 .../ATen/native/quantized/cpu/QuantizedOps.h  |     9 +-
 .../ATen/native/quantized/cpu/ReduceOps.cpp   |    19 +-
 .../src/ATen/native/quantized/cpu/Sorting.cpp |    18 +-
 .../native/quantized/cpu/TensorOperators.cpp  |    24 +-
 .../ATen/native/quantized/cpu/TensorShape.cpp |    88 +-
 .../quantized/cpu/UpSampleBilinear2d.cpp      |    15 +-
 .../quantized/cpu/UpSampleNearest2d.cpp       |    16 +-
 .../quantized/cpu/UpSampleNearest3d.cpp       |    38 +-
 .../ATen/native/quantized/cpu/XnnpackUtils.h  |     2 +-
 .../native/quantized/cpu/conv_serialization.h |    41 +-
 .../native/quantized/cpu/fbgemm_utils.cpp     |    18 +-
 .../quantized/cpu/fused_obs_fake_quant.cpp    |    19 +-
 .../native/quantized/cpu/init_qnnpack.cpp     |     3 +-
 .../cpu/kernels/QuantizedOpKernels.cpp        |    51 +-
 aten/src/ATen/native/quantized/cpu/qclamp.cpp |    19 +-
 aten/src/ATen/native/quantized/cpu/qconv.cpp  |   134 +-
 .../native/quantized/cpu/qconv_dynamic.cpp    |    14 +-
 .../native/quantized/cpu/qconv_prepack.cpp    |    50 +-
 .../quantized/cpu/qconv_unpack_impl.cpp       |     2 +-
 aten/src/ATen/native/quantized/cpu/qelu.cpp   |    12 +-
 .../native/quantized/cpu/qembeddingbag.cpp    |    13 +-
 .../ATen/native/quantized/cpu/qembeddingbag.h |     4 +-
 .../quantized/cpu/qembeddingbag_prepack.cpp   |    21 +-
 .../quantized/cpu/qembeddingbag_prepack.h     |     6 +-
 .../quantized/cpu/qembeddingbag_unpack.cpp    |    13 +-
 aten/src/ATen/native/quantized/cpu/qgelu.cpp  |    17 +-
 .../native/quantized/cpu/qhardsigmoid.cpp     |    15 +-
 .../ATen/native/quantized/cpu/qhardswish.cpp  |    12 +-
 .../src/ATen/native/quantized/cpu/qlinear.cpp |    61 +-
 .../native/quantized/cpu/qlinear_dynamic.cpp  |    87 +-
 .../native/quantized/cpu/qlinear_prepack.cpp  |    22 +-
 .../src/ATen/native/quantized/cpu/qmatmul.cpp |     4 +-
 aten/src/ATen/native/quantized/cpu/qmul.cpp   |   153 +-
 .../quantized/cpu/qnnpack/CMakeLists.txt      |     1 -
 .../cpu/qnnpack/bench/q8gemm_sparse.cc        |    41 +-
 .../quantized/cpu/qnnpack/buckbuild.bzl       |     1 -
 .../qnnpack/cmake/DownloadGoogleTest.cmake    |     2 +-
 .../cpu/qnnpack/deps/clog/CMakeLists.txt      |     5 +-
 .../deps/clog/cmake/DownloadGoogleTest.cmake  |     2 +-
 .../cpu/qnnpack/include/pack_block_sparse.h   |   291 +-
 .../cpu/qnnpack/include/pytorch_qnnpack.h     |    12 +-
 .../cpu/qnnpack/src/fully-connected-sparse.c  |    35 +-
 .../native/quantized/cpu/qnnpack/src/init.c   |    24 +-
 .../quantized/cpu/qnnpack/src/operator-run.c  |   171 +-
 .../cpu/qnnpack/src/pack_block_sparse.cc      |   170 -
 .../cpu/qnnpack/src/q8gemm/4x4c2-sse2.c       |     9 +-
 .../4x8c1x4-dq-packedA-aarch32-neon.S         |   804 +-
 .../4x8c8x1-dq-packedA-aarch32-neon.S         |   622 +-
 .../q8gemm_sparse/8x4c1x4-dq-packedA-sse2.c   |   451 +-
 .../q8gemm_sparse/8x4c1x4-dq-packedA-sse2.h   |   435 +
 .../8x8c1x4-dq-packedA-aarch64-neon.S         |   948 +-
 .../8x8c8x1-dq-packedA-aarch64-neon.S         |   806 +-
 .../cpu/qnnpack/src/qnnpack/common.h          |    12 +
 .../cpu/qnnpack/src/qnnpack/operator.h        |    13 +-
 .../cpu/qnnpack/src/qnnpack/params.h          |    34 +-
 .../cpu/qnnpack/src/qnnpack/q8gemm_sparse.h   |    80 +-
 .../fully-connected-sparse-operator-tester.h  |    38 +-
 .../gemm-block-sparse-microkernel-tester.h    |    29 +-
 .../cpu/qnnpack/test/q8gemm_sparse.cc         |  1362 +-
 .../native/quantized/cpu/qnormalization.cpp   |     9 +-
 aten/src/ATen/native/quantized/cpu/qrelu.cpp  |    21 +-
 .../ATen/native/quantized/cpu/qsigmoid.cpp    |    17 +-
 aten/src/ATen/native/quantized/cpu/qtanh.cpp  |    17 +-
 .../ATen/native/quantized/cpu/qthreshold.cpp  |    13 +-
 .../ATen/native/quantized/cuda/Activation.cpp |     9 +
 .../ATen/native/quantized/cuda/Activation.cu  |    21 +
 .../native/quantized/cuda/AffineQuantizer.cu  |    16 +-
 .../native/quantized/cuda/EmbeddingBag.cu     |    14 +-
 .../native/quantized/cuda/FakeQuantizeCore.cu |     6 +-
 .../quantized/cuda/FusedObsFakeQuant.cu       |    16 +-
 .../native/quantized/cuda/IntReprQuant.cu     |    13 +-
 .../cuda/MakePerTensorQuantizedTensor.cu      |    17 +-
 .../ATen/native/quantized/cudnn/BinaryOps.cpp |    11 +-
 .../native/quantized/cudnn/ConvPrepack.cpp    |     2 +-
 aten/src/ATen/native/quantized/cudnn/utils.h  |    12 +-
 .../ATen/native/quantized/qconv_unpack.cpp    |    20 +-
 aten/src/ATen/native/sparse/Macros.h          |    19 +
 .../sparse/SparseBinaryOpIntersectionCommon.h |   585 +
 .../SparseBinaryOpIntersectionKernel.cpp      |   107 +
 .../src/ATen/native/sparse/SparseBlasImpl.cpp |   204 +
 aten/src/ATen/native/sparse/SparseBlasImpl.h  |    14 +
 .../ATen/native/sparse/SparseCsrTensor.cpp    |    48 +-
 .../native/sparse/SparseCsrTensorMath.cpp     |   321 +-
 .../ATen/native/sparse/SparseCsrTensorMath.h  |    60 +
 .../ATen/native/sparse/SparseFactories.cpp    |     1 +
 aten/src/ATen/native/sparse/SparseFactories.h |     8 +-
 aten/src/ATen/native/sparse/SparseStubs.h     |    16 +
 aten/src/ATen/native/sparse/SparseTensor.cpp  |    50 +-
 .../ATen/native/sparse/SparseTensorMath.cpp   |   235 +-
 .../src/ATen/native/sparse/SparseTensorMath.h |     1 +
 .../src/ATen/native/sparse/SparseUnaryOps.cpp |    52 +-
 .../sparse/ValidateCompressedIndicesCommon.h  |    14 +-
 aten/src/ATen/native/sparse/cuda/SoftMax.cu   |    14 +-
 .../ATen/native/sparse/cuda/SparseBlas.cpp    |    11 +-
 .../native/sparse/cuda/SparseBlasImpl.cpp     |   146 +-
 .../sparse/cuda/SparseCUDAApplyUtils.cuh      |   111 -
 .../native/sparse/cuda/SparseCUDABlas.cpp     |    10 +-
 .../native/sparse/cuda/SparseCUDATensor.cu    |     1 +
 .../sparse/cuda/SparseCUDATensorMath.cu       |    71 +-
 .../native/sparse/cuda/SparseCsrTensorMath.cu |     2 +-
 .../ATen/native/sparse/cuda/SparseMatMul.cu   |     8 +-
 aten/src/ATen/native/tags.yaml                |    12 +-
 .../ATen/native/transformers/attention.cpp    |   128 +-
 aten/src/ATen/native/transformers/attention.h |    33 +
 .../native/transformers/cuda/attention.cu     |   581 +-
 .../transformers/cuda/attention_backward.cu   |   289 +
 .../transformers/cuda/flash_attn/epilogue.h   |   149 +
 .../epilogue_predicated_tile_iterator.h       |   493 +
 .../transformers/cuda/flash_attn/fmha.h       |   154 +
 .../transformers/cuda/flash_attn/fmha_api.cpp |   248 +
 .../transformers/cuda/flash_attn/fmha_api.h   |    25 +
 .../cuda/flash_attn/fmha_fprop_kernel_1xN.h   |   722 +
 .../flash_attn/fmha_fprop_kernel_dispatch.cu  |   134 +
 .../cuda/flash_attn/fmha_kernel.h             |    71 +
 .../transformers/cuda/flash_attn/fmha_utils.h |    52 +
 .../transformers/cuda/flash_attn/gemm.h       |    95 +
 .../transformers/cuda/flash_attn/gmem_tile.h  |   272 +
 .../cuda/flash_attn/kernel_traits.h           |   154 +
 .../transformers/cuda/flash_attn/mask.h       |    92 +
 .../cuda/flash_attn/mma_core_sm75.h           |   382 +
 .../transformers/cuda/flash_attn/philox.cuh   |   146 +
 .../transformers/cuda/flash_attn/softmax.h    |   446 +
 .../cuda/flash_attn/static_switch.h           |    25 +
 .../cuda/flash_attn/summary_stats.h           |    55 +
 .../transformers/cuda/flash_attn/utils.h      |   404 +
 .../attention_scaling_coefs_updater.h         |   479 +
 .../cuda/mem_eff_attention/debug_utils.h      |   129 +
 .../mem_eff_attention/epilogue_pipelined.h    |   629 +
 .../epilogue_rescale_output.h                 |   234 +
 .../epilogue_thread_apply_logsumexp.h         |   177 +
 .../cuda/mem_eff_attention/find_default_mma.h |   159 +
 .../cuda/mem_eff_attention/gemm/custom_mma.h  |    92 +
 .../mem_eff_attention/gemm/custom_mma_base.h  |   183 +
 .../gemm/custom_mma_multistage.h              |   769 +
 .../gemm/custom_mma_pipelined.h               |   402 +
 .../mem_eff_attention/gemm_kernel_utils.h     |   226 +
 .../epilogue_predicated_tile_iterator.h       |   750 +
 .../iterators/make_residual_last.h            |    67 +
 ...cated_tile_access_iterator_residual_last.h |  2116 +
 .../predicated_tile_iterator_residual_last.h  |  2120 +
 .../cuda/mem_eff_attention/kernel_backward.h  |  1575 +
 .../cuda/mem_eff_attention/kernel_forward.h   |   895 +
 .../kernels/backward_bf16.cu                  |     6 +
 .../kernels/backward_bf16_aligned.cu          |     6 +
 .../kernels/backward_bf16_aligned_k128.cu     |     6 +
 .../kernels/backward_bf16_aligned_k64.cu      |     6 +
 .../kernels/backward_bf16_k128.cu             |     6 +
 .../kernels/backward_bf16_k64.cu              |     6 +
 .../mem_eff_attention/kernels/backward_f16.cu |     6 +
 .../kernels/backward_f16_aligned.cu           |     6 +
 .../kernels/backward_f16_aligned_k128.cu      |     6 +
 .../kernels/backward_f16_aligned_k64.cu       |     6 +
 .../kernels/backward_f16_k128.cu              |     6 +
 .../kernels/backward_f16_k64.cu               |     6 +
 .../mem_eff_attention/kernels/backward_f32.cu |     6 +
 .../kernels/backward_f32_aligned.cu           |     6 +
 .../kernels/backward_f32_aligned_k128.cu      |     6 +
 .../kernels/backward_f32_aligned_k64.cu       |     6 +
 .../kernels/backward_f32_k128.cu              |     6 +
 .../kernels/backward_f32_k64.cu               |     6 +
 .../mem_eff_attention/kernels/forward_bf16.cu |    74 +
 .../kernels/forward_bf16_aligned.cu           |    74 +
 .../mem_eff_attention/kernels/forward_f16.cu  |    54 +
 .../kernels/forward_f16_aligned.cu            |    34 +
 .../mem_eff_attention/kernels/forward_f32.cu  |    14 +
 .../kernels/forward_f32_aligned.cu            |    14 +
 .../kernels/generate_kernels.sh               |    56 +
 .../cuda/mem_eff_attention/mma_from_smem.h    |  1785 +
 .../mma_simt_tile_iterator_residual.h         |   302 +
 .../ATen/native/transformers/cuda/sdp_utils.h |   316 +
 .../ATen/native/transformers/sdp_utils_cpp.h  |     9 +
 .../ATen/native/transformers/transformer.cpp  |    67 +-
 aten/src/ATen/native/ts_native_functions.yaml |    46 +-
 aten/src/ATen/native/utils/Factory.cpp        |     1 +
 aten/src/ATen/native/utils/Factory.h          |     2 +-
 aten/src/ATen/native/utils/ParamUtils.h       |    21 +-
 aten/src/ATen/native/vol2col.h                |     4 +-
 .../native/vulkan/VulkanOpaqueTensorImpl.h    |     4 +
 aten/src/ATen/native/vulkan/api/Adapter.cpp   |     4 +-
 aten/src/ATen/native/vulkan/api/Adapter.h     |     2 +
 aten/src/ATen/native/vulkan/api/Allocator.h   |     6 +-
 aten/src/ATen/native/vulkan/api/Command.cpp   |    23 +
 aten/src/ATen/native/vulkan/api/Command.h     |     7 +
 aten/src/ATen/native/vulkan/api/Common.h      |    68 +-
 aten/src/ATen/native/vulkan/api/Context.cpp   |    78 +-
 aten/src/ATen/native/vulkan/api/Context.h     |    80 +-
 aten/src/ATen/native/vulkan/api/Descriptor.h  |     1 +
 aten/src/ATen/native/vulkan/api/Pipeline.h    |     5 +-
 aten/src/ATen/native/vulkan/api/QueryPool.cpp |    31 +-
 aten/src/ATen/native/vulkan/api/QueryPool.h   |     6 +-
 aten/src/ATen/native/vulkan/api/Resource.cpp  |    40 +-
 aten/src/ATen/native/vulkan/api/Resource.h    |    30 +-
 aten/src/ATen/native/vulkan/api/Runtime.cpp   |    20 +-
 aten/src/ATen/native/vulkan/api/Shader.cpp    |    27 +
 aten/src/ATen/native/vulkan/api/Shader.h      |    32 +-
 aten/src/ATen/native/vulkan/api/Types.h       |    21 +
 aten/src/ATen/native/vulkan/api/Utils.h       |    36 +
 .../src/ATen/native/vulkan/api/vk_mem_alloc.h | 19558 --------
 .../ATen/native/vulkan/glsl/batchnorm.glsl    |    70 +-
 .../native/vulkan/glsl/buffer_to_buffer.glsl  |    78 +
 aten/src/ATen/native/vulkan/glsl/conv2d.glsl  |   153 +-
 .../ATen/native/vulkan/glsl/conv2d_dw.glsl    |   102 +-
 .../ATen/native/vulkan/glsl/conv2d_pw.glsl    |    48 -
 .../native/vulkan/glsl/conv2d_pw_2x2.glsl     |   100 -
 .../vulkan/glsl/conv2d_pw_2x2_buffered.glsl   |   154 -
 .../native/vulkan/glsl/conv_transpose2d.glsl  |   117 +-
 .../native/vulkan/glsl/image2d_to_nchw.glsl   |    52 +
 .../native/vulkan/glsl/image_to_nchw.glsl     |    55 +-
 .../vulkan/glsl/image_to_nchw_quantized.glsl  |   106 +-
 aten/src/ATen/native/vulkan/glsl/indexing.h   |    13 +
 .../native/vulkan/glsl/nchw_to_image.glsl     |    57 +-
 .../native/vulkan/glsl/nchw_to_image2d.glsl   |    53 +
 .../vulkan/glsl/nchw_to_image_quantized.glsl  |    79 +-
 .../vulkan/glsl/quantize_per_tensor.glsl      |     8 +-
 .../native/vulkan/glsl/quantized_add.glsl     |     4 +-
 .../native/vulkan/glsl/quantized_conv2d.glsl  |   234 +-
 .../vulkan/glsl/quantized_conv2d_dw.glsl      |   156 +-
 .../vulkan/glsl/quantized_conv2d_pw_2x2.glsl  |   299 +-
 .../native/vulkan/glsl/quantized_div.glsl     |     4 +-
 .../native/vulkan/glsl/quantized_mul.glsl     |     4 +-
 .../native/vulkan/glsl/quantized_sub.glsl     |     4 +-
 .../glsl/quantized_upsample_nearest2d.glsl    |     3 +-
 .../vulkan/glsl/templates/conv2d_pw.glslt     |   154 +
 .../glsl/templates/conv2d_pw_params.yaml      |     7 +
 .../src/ATen/native/vulkan/ops/Arithmetic.cpp |    62 +-
 aten/src/ATen/native/vulkan/ops/Batchnorm.cpp |   290 +-
 aten/src/ATen/native/vulkan/ops/Batchnorm.h   |    68 +
 aten/src/ATen/native/vulkan/ops/Clone.cpp     |     9 +-
 aten/src/ATen/native/vulkan/ops/Common.cpp    |    51 +-
 aten/src/ATen/native/vulkan/ops/Common.h      |    89 +-
 aten/src/ATen/native/vulkan/ops/Concat.cpp    |    58 +-
 .../ATen/native/vulkan/ops/Convolution.cpp    |  1635 +-
 aten/src/ATen/native/vulkan/ops/Convolution.h |    87 +-
 aten/src/ATen/native/vulkan/ops/Copy.cpp      |    38 +-
 aten/src/ATen/native/vulkan/ops/Copy.h        |    12 +-
 aten/src/ATen/native/vulkan/ops/Glu.cpp       |     2 +-
 aten/src/ATen/native/vulkan/ops/Gru.cpp       |   109 +-
 aten/src/ATen/native/vulkan/ops/Gru.h         |    30 +
 aten/src/ATen/native/vulkan/ops/Lerp.cpp      |    14 +-
 aten/src/ATen/native/vulkan/ops/Lstm.cpp      |   128 +-
 aten/src/ATen/native/vulkan/ops/Lstm.h        |    30 +
 aten/src/ATen/native/vulkan/ops/Mm.cpp        |    37 +-
 aten/src/ATen/native/vulkan/ops/Mm.h          |    24 +
 .../native/vulkan/ops/QuantizedFunctions.h    |     2 +-
 aten/src/ATen/native/vulkan/ops/Register.cpp  |    36 +
 aten/src/ATen/native/vulkan/ops/Shape.cpp     |     4 +-
 aten/src/ATen/native/vulkan/ops/Tensor.cpp    |   449 +-
 aten/src/ATen/native/vulkan/ops/Tensor.h      |   165 +-
 aten/src/ATen/native/vulkan/ops/Utils.cpp     |   385 +-
 aten/src/ATen/native/vulkan/ops/Utils.h       |     7 +
 aten/src/ATen/native/vulkan/ops/cumsum.cpp    |     3 +-
 aten/src/ATen/native/xnnpack/Common.h         |     5 +-
 aten/src/ATen/native/xnnpack/Engine.h         |     3 +-
 aten/src/ATen/native/xnnpack/Init.cpp         |     1 +
 aten/src/ATen/native/xnnpack/OpContext.cpp    |     4 +
 aten/src/ATen/native/xnnpack/OpContext.h      |    14 +
 aten/src/ATen/native/xnnpack/Shim.cpp         |     1 +
 aten/src/ATen/quantized/Quantizer.cpp         |     4 +
 aten/src/ATen/record_function.h               |    28 +
 .../templates/CompositeViewCopyKernels.cpp    |    12 +-
 .../templates/RegisterFunctionalization.cpp   |    11 +-
 aten/src/ATen/templates/TensorBody.h          |     2 +
 aten/src/ATen/test/CMakeLists.txt             |   104 +-
 aten/src/ATen/test/ExclusivelyOwned_test.cpp  |     3 +-
 aten/src/ATen/test/MaybeOwned_test.cpp        |    21 +-
 aten/src/ATen/test/extension_backend_test.cpp |     4 +-
 aten/src/ATen/test/math_kernel_test.cpp       |    10 -
 aten/src/ATen/test/mps_test_print.cpp         |    34 +
 aten/src/ATen/test/scalar_test.cpp            |    14 +
 aten/src/ATen/test/vulkan_api_test.cpp        |   568 +-
 aten/src/ATen/test/vulkan_perf_test.cpp       |    20 +-
 .../ATen/test/vulkan_quantized_api_test.cpp   |   170 +-
 aten/src/ATen/test/xnnpack_test.cpp           |    91 +
 aten/src/README.md                            |     4 +-
 benchmarks/cpp/nvfuser/CMakeLists.txt         |     1 +
 .../cpp/nvfuser/batch_norm_channels_first.cpp |     4 -
 .../batch_norm_channels_first_backward.cpp    |     4 -
 .../cpp/nvfuser/batch_norm_channels_last.cpp  |     4 -
 .../batch_norm_channels_last_backward.cpp     |     4 -
 benchmarks/cpp/nvfuser/bert.cpp               |    52 +-
 benchmarks/cpp/nvfuser/broadcast.cpp          |    10 +-
 benchmarks/cpp/nvfuser/gelu_backward.cpp      |     9 +-
 benchmarks/cpp/nvfuser/heuristic_lookup.cpp   |    14 +-
 benchmarks/cpp/nvfuser/instance_norm.cpp      |     6 +-
 benchmarks/cpp/nvfuser/layer_norm.cpp         |     8 +-
 .../cpp/nvfuser/layer_norm_backward.cpp       |     9 +-
 benchmarks/cpp/nvfuser/lstm_cell.cpp          |     4 +-
 benchmarks/cpp/nvfuser/matmul.cpp             |   357 +
 benchmarks/cpp/nvfuser/reduction.cpp          |    10 +-
 benchmarks/cpp/nvfuser/rms_norm.cpp           |     2 -
 benchmarks/cpp/nvfuser/rms_norm_backward.cpp  |     3 -
 benchmarks/cpp/nvfuser/scale_bias_relu.cpp    |    18 +-
 benchmarks/cpp/nvfuser/shape_inference.cpp    |     9 +-
 benchmarks/cpp/nvfuser/softmax.cpp            |     6 +-
 benchmarks/cpp/nvfuser/softmax_backward.cpp   |    34 +-
 benchmarks/cpp/nvfuser/softmax_dropout.cpp    |     4 +-
 benchmarks/cpp/nvfuser/timm.cpp               |    11 +-
 benchmarks/cpp/nvfuser/utils.cpp              |    25 +-
 benchmarks/cpp/nvfuser/utils.h                |    26 +-
 benchmarks/distributed/ddp/README.md          |     2 +-
 benchmarks/distributed/ddp/benchmark.py       |     2 +-
 benchmarks/dynamo/Makefile_dashboard          |    40 +
 benchmarks/dynamo/README.md                   |    52 +
 .../dbr => benchmarks/dynamo}/__init__.py     |     0
 benchmarks/dynamo/check_csv.py                |    40 +
 benchmarks/dynamo/common.py                   |  2078 +
 benchmarks/dynamo/dist_util.py                |   148 +
 benchmarks/dynamo/distributed.py              |   164 +
 benchmarks/dynamo/huggingface.py              |   585 +
 benchmarks/dynamo/huggingface_models_list.txt |    51 +
 .../dynamo/microbenchmarks}/__init__.py       |     0
 .../microbenchmarks/bench_autotune_conv.py    |   170 +
 .../dynamo/microbenchmarks/bench_conv.py      |   144 +
 .../dynamo/microbenchmarks/bench_conv1x1.py   |   140 +
 .../microbenchmarks/bench_conv_fusion.py      |   298 +
 .../dynamo/microbenchmarks/bench_mm_fusion.py |   121 +
 .../microbenchmarks/benchmark_helper.py       |    13 +
 .../dynamo/microbenchmarks/inductor_bmm.py    |    61 +
 .../dynamo/microbenchmarks/inductor_mm.py     |   134 +
 .../dynamo/microbenchmarks/matmul_relu.py     |   100 +
 .../dynamo/microbenchmarks/microbench.py      |   176 +
 benchmarks/dynamo/microbenchmarks/model.py    |    26 +
 .../hf_train/AlbertForMaskedLM_training.txt   |   115 +
 .../AlbertForQuestionAnswering_training.txt   |   110 +
 .../AllenaiLongformerBase_training.txt        |   186 +
 .../hf_train/BartForCausalLM_training.txt     |    73 +
 .../BartForConditionalGeneration_training.txt |    89 +
 .../hf_train/BertForMaskedLM_training.txt     |    81 +
 .../BertForQuestionAnswering_training.txt     |    88 +
 .../hf_train/BigBird_training.txt             |   237 +
 .../BlenderbotSmallForCausalLM_training.txt   |    74 +
 ...SmallForConditionalGeneration_training.txt |    81 +
 .../hf_train/CamemBert_training.txt           |    88 +
 .../hf_train/DebertaForMaskedLM_training.txt  |   132 +
 .../DebertaForQuestionAnswering_training.txt  |   133 +
 .../DebertaV2ForMaskedLM_training.txt         |    85 +
 ...DebertaV2ForQuestionAnswering_training.txt |    92 +
 .../DistilBertForMaskedLM_training.txt        |    78 +
 ...istilBertForQuestionAnswering_training.txt |    85 +
 .../hf_train/DistillGPT2_training.txt         |    91 +
 .../hf_train/ElectraForCausalLM_training.txt  |    92 +
 .../ElectraForQuestionAnswering_training.txt  |    94 +
 ...GPT2ForSequenceClassification_training.txt |   106 +
 .../hf_train/GPTNeoForCausalLM_training.txt   |    96 +
 ...TNeoForSequenceClassification_training.txt |   101 +
 .../hf_train/GoogleFnet_training.txt          |    83 +
 .../hf_train/LayoutLMForMaskedLM_training.txt |    90 +
 ...utLMForSequenceClassification_training.txt |    98 +
 ...2M100ForConditionalGeneration_training.txt |    88 +
 .../hf_train/MBartForCausalLM_training.txt    |    73 +
 ...MBartForConditionalGeneration_training.txt |    94 +
 .../MegatronBertForCausalLM_training.txt      |    85 +
 ...atronBertForQuestionAnswering_training.txt |    88 +
 .../MobileBertForMaskedLM_training.txt        |   112 +
 ...obileBertForQuestionAnswering_training.txt |   106 +
 .../hf_train/OPTForCausalLM_training.txt      |   103 +
 .../hf_train/PLBartForCausalLM_training.txt   |    73 +
 ...LBartForConditionalGeneration_training.txt |    94 +
 .../hf_train/PegasusForCausalLM_training.txt  |    72 +
 ...gasusForConditionalGeneration_training.txt |    79 +
 .../hf_train/RobertaForCausalLM_training.txt  |    94 +
 .../RobertaForQuestionAnswering_training.txt  |    97 +
 .../Speech2Text2ForCausalLM_training.txt      |    82 +
 .../hf_train/TrOCRForCausalLM_training.txt    |    73 +
 .../hf_train/XGLMForCausalLM_training.txt     |    88 +
 .../hf_train/XLNetLMHeadModel_training.txt    |   105 +
 .../hf_train/YituTechConvBert_training.txt    |   119 +
 .../timm_train/adv_inception_v3_training.txt  |   239 +
 .../beit_base_patch16_224_training.txt        |   100 +
 .../timm_train/botnet26t_256_training.txt     |   244 +
 .../timm_train/cait_m36_384_training.txt      |   149 +
 .../timm_train/coat_lite_mini_training.txt    |   348 +
 .../timm_train/convmixer_768_32_training.txt  |    45 +
 .../timm_train/convnext_base_training.txt     |   210 +
 .../timm_train/crossvit_9_240_training.txt    |   203 +
 .../timm_train/cspdarknet53_training.txt      |   177 +
 ...it_base_distilled_patch16_224_training.txt |    87 +
 .../timm_train/densenet121_training.txt       |   616 +
 .../timm_train/dla102_training.txt            |   189 +
 .../timm_train/dm_nfnet_f0_training.txt       |   296 +
 .../timm_train/dpn107_training.txt            |   545 +
 .../eca_botnext26ts_256_training.txt          |   288 +
 .../timm_train/eca_halonext26ts_training.txt  |   343 +
 .../timm_train/ecaresnet101d_training.txt     |   195 +
 .../timm_train/ese_vovnet19b_dw_training.txt  |   182 +
 .../timm_train/fbnetc_100_training.txt        |   189 +
 .../timm_train/fbnetv3_b_training.txt         |   287 +
 .../timm_train/gernet_l_training.txt          |   118 +
 .../timm_train/ghostnet_100_training.txt      |   411 +
 .../gluon_inception_v3_training.txt           |   239 +
 .../timm_train/gluon_senet154_training.txt    |   187 +
 .../timm_train/gluon_xception65_training.txt  |   155 +
 .../timm_train/gmixer_24_224_training.txt     |    83 +
 .../timm_train/gmlp_s16_224_training.txt      |    70 +
 .../timm_train/hardcorenas_a_training.txt     |   260 +
 .../timm_train/hrnet_w18_training.txt         |   247 +
 .../timm_train/inception_v3_training.txt      |   239 +
 .../timm_train/jx_nest_base_training.txt      |   269 +
 .../timm_train/lcnet_050_training.txt         |   158 +
 .../timm_train/legacy_senet154_training.txt   |   183 +
 .../timm_train/levit_128_training.txt         |   295 +
 .../timm_train/mixer_b16_224_training.txt     |    70 +
 .../timm_train/mixnet_l_training.txt          |   378 +
 .../timm_train/mnasnet_100_training.txt       |   170 +
 .../timm_train/mobilenetv2_100_training.txt   |   172 +
 .../mobilenetv3_large_100_training.txt        |   269 +
 .../timm_train/mobilevit_s_training.txt       |   313 +
 .../timm_train/nasnetalarge_training.txt      |   309 +
 .../timm_train/nfnet_l0_training.txt          |   267 +
 .../timm_train/pit_b_224_training.txt         |   185 +
 .../timm_train/pnasnet5large_training.txt     |   293 +
 .../timm_train/poolformer_m36_training.txt    |   111 +
 .../timm_train/regnety_002_training.txt       |   181 +
 .../timm_train/repvgg_a2_training.txt         |    90 +
 .../timm_train/res2net101_26w_4s_training.txt |   209 +
 .../timm_train/res2net50_14w_8s_training.txt  |   209 +
 .../timm_train/res2next50_training.txt        |   197 +
 .../timm_train/resmlp_12_224_training.txt     |    75 +
 .../timm_train/resnest101e_training.txt       |   269 +
 .../timm_train/resnet18_training.txt          |    88 +
 .../timm_train/rexnet_100_training.txt        |   573 +
 .../timm_train/sebotnet33ts_256_training.txt  |   334 +
 .../timm_train/selecsls42b_training.txt       |   167 +
 .../timm_train/spnasnet_100_training.txt      |   182 +
 .../swin_base_patch4_window7_224_training.txt |   341 +
 .../swsl_resnext101_32x16d_training.txt       |   143 +
 .../tf_efficientnet_b0_training.txt           |   312 +
 .../timm_train/tf_mixnet_l_training.txt       |   408 +
 .../timm_train/tinynet_a_training.txt         |   302 +
 .../timm_train/tnt_s_patch16_224_training.txt |   146 +
 .../timm_train/twins_pcpvt_base_training.txt  |   245 +
 .../timm_train/visformer_small_training.txt   |   132 +
 .../vit_base_patch16_224_training.txt         |    83 +
 .../timm_train/volo_d1_224_training.txt       |   216 +
 .../BERT_pytorch_training.txt                 |    94 +
 .../Background_Matting_training.txt           |   119 +
 .../LearningToPaint_training.txt              |    86 +
 .../torchbench_train/Super_SloMo_training.txt |   255 +
 .../torchbench_train/alexnet_training.txt     |    58 +
 ...ntion_is_all_you_need_pytorch_training.txt |   148 +
 .../torchbench_train/dcgan_training.txt       |    42 +
 .../torchbench_train/densenet121_training.txt |   609 +
 .../fambench_dlrm_training.txt                |  1063 +
 .../fastNLP_Bert_training.txt                 |   157 +
 .../torchbench_train/hf_Albert_training.txt   |   110 +
 .../torchbench_train/hf_Bart_training.txt     |    76 +
 .../torchbench_train/hf_Bert_training.txt     |    76 +
 .../torchbench_train/hf_BigBird_training.txt  |   235 +
 .../hf_DistilBert_training.txt                |    73 +
 .../torchbench_train/hf_GPT2_training.txt     |    88 +
 .../hf_Longformer_training.txt                |   189 +
 .../maml_omniglot_training.txt                |    49 +
 .../torchbench_train/mnasnet1_0_training.txt  |   163 +
 .../mobilenet_v2_training.txt                 |   165 +
 .../mobilenet_v3_large_training.txt           |   277 +
 .../nvidia_deeprecommender_training.txt       |    36 +
 .../pytorch_CycleGAN_and_pix2pix_training.txt |    67 +
 .../pytorch_stargan_training.txt              |    80 +
 .../pytorch_struct_training.txt               |    63 +
 .../pytorch_unet_training.txt                 |   119 +
 .../torchbench_train/resnet18_training.txt    |    81 +
 .../torchbench_train/resnet50_training.txt    |   134 +
 .../resnext50_32x4d_training.txt              |   124 +
 .../shufflenet_v2_x1_0_training.txt           |   123 +
 .../speech_transformer_training.txt           |   178 +
 .../squeezenet1_1_training.txt                |    90 +
 .../timm_efficientdet_training.txt            |   623 +
 .../timm_efficientnet_training.txt            |   295 +
 .../torchbench_train/timm_nfnet_training.txt  |   289 +
 .../torchbench_train/timm_regnet_training.txt |   178 +
 .../timm_resnest_training.txt                 |   205 +
 .../timm_vision_transformer_training.txt      |    77 +
 .../torchbench_train/timm_vovnet_training.txt |   130 +
 .../torchbench_train/tts_angular_training.txt |    51 +
 .../torchbench_train/vgg16_training.txt       |    72 +
 .../vision_maskrcnn_training.txt              |   477 +
 .../torchbench_train/yolov3_training.txt      |   261 +
 .../microbenchmarks/operator_inp_utils.py     |   342 +
 .../dynamo/microbenchmarks/operatorbench.py   |   242 +
 .../dynamo/microbenchmarks/profile_conv.py    |   107 +
 benchmarks/dynamo/microbenchmarks/utils.py    |    19 +
 benchmarks/dynamo/runner.py                   |  1345 +
 benchmarks/dynamo/test.py                     |    44 +
 benchmarks/dynamo/timm_models.py              |   322 +
 benchmarks/dynamo/timm_models_list.txt        |    62 +
 benchmarks/dynamo/torchbench.py               |   365 +
 benchmarks/dynamo/torchbench_models_list.txt  |    28 +
 benchmarks/dynamo/training_loss.py            |   205 +
 benchmarks/instruction_counts/README.md       |     2 +-
 benchmarks/instruction_counts/core/utils.py   |     2 +-
 benchmarks/nested/nested_bmm_bench.py         |    53 +
 benchmarks/operator_benchmark/README.md       |     2 +-
 .../pt/ao_sparsifier_test.py                  |     4 +-
 .../operator_benchmark/pt/interpolate_test.py |    12 +
 .../operator_benchmark/pt/qactivation_test.py |    14 +-
 .../operator_benchmark/pt/qarithmetic_test.py |     2 +-
 .../pt/qatembedding_ops_test.py               |     2 +-
 benchmarks/operator_benchmark/pt/qcat_test.py |     2 +-
 .../operator_benchmark/pt/qconv_test.py       |     2 +-
 .../pt/qembeddingbag_test.py                  |     2 +-
 .../operator_benchmark/pt/qlinear_test.py     |     4 +-
 .../pt/quantization_test.py                   |     2 +-
 .../static_runtime/test_generated_ops.cc      |   398 +-
 .../static_runtime/test_static_module.cc      |    19 +-
 .../static_runtime/test_static_runtime.cc     |    84 +-
 benchmarks/static_runtime/test_utils.cc       |    19 +-
 .../better_transformer_vs_mha_functional.py   |   195 +
 benchmarks/transformer/sdp.py                 |   157 +
 benchmarks/transformer/sdp_backwards.py       |   189 +
 binaries/CMakeLists.txt                       |    13 +-
 binaries/optimize_for_mobile.cc               |    15 +-
 binaries/speed_benchmark_torch.cc             |     4 +
 buckbuild.bzl                                 |    85 +-
 build.bzl                                     |     4 +
 build_variables.bzl                           |   108 +-
 c10/CMakeLists.txt                            |     5 +-
 c10/c10_defs.bzl                              |    29 -
 c10/core/AutogradState.cpp                    |     6 +-
 c10/core/AutogradState.h                      |    18 +-
 c10/core/Device.cpp                           |    13 +-
 c10/core/Device.h                             |     3 +-
 c10/core/DeviceType.cpp                       |    48 +-
 c10/core/DeviceType.h                         |     3 +
 c10/core/DispatchKey.cpp                      |    15 +-
 c10/core/DispatchKey.h                        |     7 +
 c10/core/DispatchKeySet.cpp                   |    12 +-
 c10/core/DispatchKeySet.h                     |    14 +-
 c10/core/InferenceMode.h                      |     3 +-
 c10/core/MemoryFormat.h                       |    36 +-
 c10/core/PyHandleCache.h                      |    75 +
 c10/core/QEngine.h                            |     4 +
 c10/core/SafePyObject.cpp                     |     5 +
 c10/core/SafePyObject.h                       |    31 +-
 c10/core/Scalar.cpp                           |    13 +-
 c10/core/Scalar.h                             |   166 +-
 c10/core/ScalarType.h                         |     2 +-
 c10/core/Storage.h                            |     4 +
 c10/core/StorageImpl.h                        |    14 +-
 c10/core/SymFloat.cpp                         |    81 +
 c10/core/SymFloat.h                           |    71 +
 c10/core/SymInt.cpp                           |   187 +-
 c10/core/SymInt.h                             |   258 +-
 c10/core/SymIntArrayRef.cpp                   |    34 +-
 c10/core/SymIntArrayRef.h                     |   215 +-
 c10/core/SymIntNodeImpl.cpp                   |    11 -
 c10/core/SymIntNodeImpl.h                     |    81 -
 c10/core/SymNodeImpl.cpp                      |     3 +
 c10/core/SymNodeImpl.h                        |   118 +
 c10/core/TensorImpl.cpp                       |   410 +-
 c10/core/TensorImpl.h                         |   924 +-
 c10/core/UndefinedTensorImpl.cpp              |     8 +-
 c10/core/UndefinedTensorImpl.h                |     2 +
 c10/core/WrapDimMinimal.cpp                   |    26 +-
 c10/core/WrapDimMinimal.h                     |    39 +-
 c10/core/impl/HermeticPyObjectTLS.cpp         |    23 +
 c10/core/impl/HermeticPyObjectTLS.h           |    61 +
 c10/core/impl/PyInterpreter.cpp               |   202 +-
 c10/core/impl/PyInterpreter.h                 |   252 +-
 c10/core/impl/PythonDispatcherTLS.cpp         |    32 +
 c10/core/impl/PythonDispatcherTLS.h           |    27 +
 c10/core/impl/SizesAndStrides.cpp             |    66 +-
 c10/core/impl/SizesAndStrides.h               |   119 +-
 c10/core/impl/TorchDispatchModeTLS.cpp        |    72 +
 c10/core/impl/TorchDispatchModeTLS.h          |    27 +
 c10/cuda/CMakeLists.txt                       |    10 +-
 c10/cuda/CUDACachingAllocator.cpp             |  1029 +-
 c10/cuda/CUDACachingAllocator.h               |   225 +-
 c10/cuda/CUDAException.cpp                    |    35 +
 c10/cuda/CUDAException.h                      |    62 +-
 c10/cuda/CUDAFunctions.cpp                    |     4 +
 c10/cuda/CUDAFunctions.h                      |    11 +
 c10/cuda/CUDAMallocAsyncAllocator.cpp         |   856 +
 c10/cuda/CUDAMiscFunctions.cpp                |     5 +
 c10/cuda/CUDAMiscFunctions.h                  |     5 +-
 c10/cuda/CUDAStream.cpp                       |     4 +-
 c10/cuda/impl/CUDAGuardImpl.h                 |    10 +-
 c10/defs_hip.bzl                              |   126 -
 c10/macros/Macros.h                           |    91 +-
 c10/macros/build.bzl                          |     9 +
 c10/test/core/SymInt_test.cpp                 |    11 +-
 c10/test/core/impl/SizesAndStrides_test.cpp   |     4 +-
 c10/test/util/complex_math_test_common.h      |   128 +
 c10/test/util/intrusive_ptr_test.cpp          |     5 +
 c10/test/util/string_view_test.cpp            |    16 +-
 c10/util/C++17.h                              |    26 +-
 c10/util/DimVector.h                          |     2 +
 c10/util/Exception.cpp                        |    79 +-
 c10/util/Exception.h                          |   148 +-
 c10/util/FunctionRef.h                        |     2 +-
 c10/util/Half-inl.h                           |     6 +-
 c10/util/Half.h                               |     6 +-
 c10/util/IdWrapper.h                          |     1 +
 c10/util/Optional.h                           |     3 +-
 c10/util/SmallVector.cpp                      |     1 +
 c10/util/SmallVector.h                        |     1 +
 c10/util/ThreadLocalDebugInfo.cpp             |     4 +-
 c10/util/build.bzl                            |     2 +-
 c10/util/complex_math.h                       |    31 +
 c10/util/hash.h                               |     8 +
 c10/util/intrusive_ptr.h                      |     3 +
 c10/util/irange.h                             |    30 +-
 c10/util/logging_is_not_google_glog.h         |     2 +-
 c10/util/safe_numerics.h                      |     6 +
 c10/util/string_view.h                        |     8 +-
 c10/util/strong_type.h                        |     8 -
 c10/util/typeid.cpp                           |    64 +-
 c10/util/typeid.h                             |   121 +-
 c2_defs.bzl                                   |    48 +-
 caffe2/CMakeLists.txt                         |  1244 +-
 caffe2/README.md                              |     2 -
 caffe2/contrib/aten/gen_op.py                 |    17 +-
 caffe2/contrib/nccl/cuda_nccl_gpu.cc          |     2 +-
 caffe2/contrib/tensorrt/README.md             |     2 +-
 .../contrib/tensorrt/tensorrt_tranformer.cc   |     2 +-
 caffe2/core/CMakeLists.txt                    |     2 +-
 caffe2/core/context_gpu.cu                    |     6 +-
 caffe2/core/context_gpu.h                     |    12 +-
 caffe2/core/macros.h.in                       |     2 +
 caffe2/core/nomnigraph/CMakeLists.txt         |     2 +-
 caffe2/core/tensor.cc                         |     2 +-
 caffe2/core/tensor.h                          |     5 +
 caffe2/defs.bzl                               |    89 -
 caffe2/defs_hip.bzl                           |   149 -
 .../mobile/contrib/libopencl-stub/README.md   |     2 +-
 caffe2/mobile/contrib/nnapi/nnapi.h           |     4 +-
 caffe2/operators/batch_box_cox_op.cc          |   300 +-
 caffe2/operators/batch_box_cox_op.h           |    60 +-
 .../generate_proposals_op_util_nms_gpu.cu     |    42 +-
 ...generate_proposals_op_util_nms_gpu_test.cc |     2 +-
 .../rnn/recurrent_network_executor_gpu.cc     |     3 +-
 caffe2/operators/scale_blobs_op.cu            |     8 +-
 caffe2/operators/segment_reduction_op_gpu.cu  |    18 +-
 caffe2/perfkernels/CMakeLists.txt             |     2 +-
 caffe2/perfkernels/batch_box_cox.cc           |   113 +
 caffe2/perfkernels/batch_box_cox.h            |    35 +
 caffe2/perfkernels/batch_box_cox_avx2.cc      |   399 +
 caffe2/perfkernels/common.h                   |     3 +
 caffe2/perfkernels/lstm_unit_cpu-impl.h       |    22 +-
 caffe2/perfkernels/vectorizer.h               |    28 +
 caffe2/proto/caffe2_pb.h                      |     2 +-
 caffe2/python/CMakeLists.txt                  |     1 +
 caffe2/python/clean_workspace_test.py         |    15 +
 caffe2/python/onnx/ONNXOpCoverage.md          |     2 +-
 caffe2/python/operator_test/_utils.py         |    50 +
 .../operator_test/layer_norm_op_test.py       |    30 +-
 .../operator_test/torch_integration_test.py   |    66 +-
 caffe2/python/optimizer.py                    |    72 +-
 caffe2/python/optimizer_test.py               |    22 +-
 caffe2/python/pybind_state.cc                 |   263 +-
 caffe2/python/pybind_workspace.cc             |    72 +
 caffe2/python/pybind_workspace.h              |    15 +
 caffe2/python/utils.py                        |     2 +
 caffe2/python/workspace_test.py               |     6 -
 caffe2/quantization/server/README.md          |     4 +-
 caffe2/quantization/server/dnnlowp.h          |     2 +
 .../server/fully_connected_fake_lowp_op.h     |     2 +
 caffe2/release-notes.md                       |     2 +-
 caffe2/serialize/inline_container.cc          |    18 +-
 caffe2/serialize/inline_container.h           |     2 +-
 caffe2/sgd/learning_rate_op.cc                |    10 +-
 caffe2/utils/CMakeLists.txt                   |     2 +-
 caffe2/utils/math/elementwise.cu              |     2 +-
 caffe2/utils/math/reduce.cu                   |     4 +-
 caffe2/utils/math_gpu.cu                      |     8 +-
 caffe2/utils/threadpool/ThreadPool.cc         |    11 +
 cmake/Dependencies.cmake                      |    34 +-
 cmake/External/nccl.cmake                     |    39 +-
 cmake/Modules/FindMKLDNN.cmake                |     2 +
 .../FindCUDA/select_compute_arch.cmake        |    24 +-
 cmake/Summary.cmake                           |     6 +-
 cmake/VulkanCodegen.cmake                     |    12 +-
 cmake/public/LoadHIP.cmake                    |     3 -
 cmake/public/mkl.cmake                        |     7 +-
 cmake/public/utils.cmake                      |   119 +-
 defs.bzl                                      |     8 -
 defs_gpu.bzl                                  |   166 -
 defs_hip.bzl                                  |   136 -
 docker.Makefile                               |    48 +-
 docs/Makefile                                 |     7 +-
 docs/caffe2/.Doxyfile-c                       |     2 +-
 docs/caffe2/.Doxyfile-python                  |     2 +-
 docs/cpp/source/notes/tensor_cuda_stream.rst  |    10 +-
 docs/requirements.txt                         |    16 +-
 docs/source/_dynamo.rst                       |    13 +
 .../_static/img/masked/tensor_comparison.jpg  |   Bin 0 -> 179951 bytes
 docs/source/amp.rst                           |     1 -
 docs/source/autograd.rst                      |     2 +
 docs/source/backends.rst                      |    42 +
 docs/source/bottleneck.rst                    |     6 +-
 docs/source/community/build_ci_governance.rst |    19 +
 docs/source/community/contribution_guide.rst  |    16 +-
 docs/source/community/governance.rst          |    29 +-
 docs/source/community/persons_of_interest.rst |    64 +-
 docs/source/conf.py                           |    54 +-
 docs/source/cuda._sanitizer.rst               |   102 +
 docs/source/cuda.rst                          |    16 +
 docs/source/data.rst                          |     6 +-
 docs/source/deploy.rst                        |   241 +-
 docs/source/distributed.checkpoint.rst        |     4 +
 docs/source/distributed.rst                   |    20 +-
 docs/source/elastic/agent.rst                 |    15 +
 docs/source/elastic/timer.rst                 |    11 +
 docs/source/fsdp.rst                          |    12 +
 docs/source/fx.rst                            |     6 +-
 docs/source/index.rst                         |     7 +-
 docs/source/jit_language_reference.rst        |     2 +-
 docs/source/jit_language_reference_v2.rst     |     4 +-
 docs/source/jit_unsupported.rst               |     2 +-
 docs/source/linalg.rst                        |     6 +
 docs/source/masked.rst                        |   297 +
 docs/source/mobile_optimizer.rst              |     5 +-
 docs/source/nested.rst                        |   125 +-
 docs/source/notes/autograd.rst                |     2 +-
 docs/source/notes/cuda.rst                    |   150 +-
 docs/source/notes/extending.rst               |     8 +-
 docs/source/notes/hip.rst                     |    11 +
 docs/source/notes/modules.rst                 |    21 +-
 docs/source/notes/numerical_accuracy.rst      |    53 +-
 docs/source/onnx.rst                          |   344 +-
 docs/source/onnx_diagnostics.rst              |    35 +
 docs/source/onnx_supported_aten_ops.rst       |    34 +-
 docs/source/optim.rst                         |     6 +-
 docs/source/profiler.rst                      |    11 +
 .../quantization-accuracy-debugging.rst       |     2 +-
 docs/source/quantization-support.rst          |   168 +-
 docs/source/quantization.rst                  |    64 +-
 docs/source/rpc.rst                           |     2 +-
 .../onnx/build_onnx_diagnostics_rules_md.py   |    37 +
 .../build_onnx_supported_aten_op_csv_table.py |    51 +-
 docs/source/signal.rst                        |    30 +
 docs/source/sparse.rst                        |   247 +-
 docs/source/special.rst                       |     2 +
 docs/source/storage.rst                       |     6 +-
 docs/source/tensor_attributes.rst             |    16 +-
 docs/source/tensors.rst                       |     2 -
 docs/source/torch.rst                         |    16 +-
 functorch/.circleci/config.yml                |   316 -
 .../unittest/linux/scripts/environment.yml    |    17 -
 .../unittest/linux/scripts/install.sh         |    61 -
 .../unittest/linux/scripts/post_process.sh    |     8 -
 .../unittest/linux/scripts/run_test.sh        |    16 -
 .../unittest/linux/scripts/setup_env.sh       |    39 -
 .../unittest/windows/scripts/environment.yml  |    20 -
 .../unittest/windows/scripts/install.sh       |    46 -
 .../windows/scripts/install_conda.bat         |     1 -
 .../unittest/windows/scripts/post_process.sh  |     6 -
 .../unittest/windows/scripts/run_test.sh      |    26 -
 .../unittest/windows/scripts/set_cuda_envs.sh |    48 -
 .../unittest/windows/scripts/setup_env.sh     |    39 -
 .../windows/scripts/vc_env_helper.bat         |    39 -
 functorch/.flake8                             |    20 -
 functorch/.github/workflows/docs.yml          |    82 -
 functorch/.github/workflows/lint.yml          |    63 -
 functorch/.github/workflows/wheels.yml        |    61 -
 functorch/.lintrunner.toml                    |    48 -
 functorch/CMakeLists.txt                      |    38 +
 functorch/CODE_OF_CONDUCT.md                  |    76 -
 functorch/CONTRIBUTING.md                     |    12 -
 functorch/LICENSE                             |    26 -
 functorch/README.md                           |    66 +-
 functorch/{functorch => }/__init__.py         |    11 +-
 functorch/{functorch => }/_src/__init__.py    |     0
 functorch/_src/aot_autograd.py                |  1965 +
 .../{functorch => }/_src/benchmark_utils.py   |     0
 .../{functorch => }/_src/compile_utils.py     |    10 +
 functorch/{functorch => }/_src/compilers.py   |    92 +-
 functorch/_src/config.py                      |    38 +
 .../{functorch => }/_src/eager_transforms.py  |   159 +-
 functorch/_src/fx_minifier.py                 |   306 +
 .../{functorch => }/_src/make_functional.py   |     2 +-
 .../_src/named_members_polyfill.py            |     0
 .../{functorch => }/_src/partitioners.py      |   208 +-
 functorch/{functorch => }/_src/python_key.py  |     5 +-
 .../{functorch => }/_src/pytree_hacks.py      |     0
 .../_src/top_operators_github_usage.py        |     0
 functorch/{functorch => }/_src/vmap.py        |    16 +-
 functorch/benchmarks/operator_authoring.py    |     8 +-
 functorch/benchmarks/pointwise_scorecard.py   |     4 +-
 .../transformer_fusion_patterns/benchmark.py  |     3 +-
 .../bias_gelu_dropout.py                      |     3 +-
 functorch/{functorch => }/compile/__init__.py |     7 +-
 functorch/{functorch => }/csrc/dim/arena.h    |     0
 functorch/{functorch => }/csrc/dim/dim.cpp    |   100 +-
 functorch/{functorch => }/csrc/dim/dim.h      |     0
 .../{functorch => }/csrc/dim/minpybind.h      |     9 +-
 .../csrc/dim/python_variable_simple.h         |     0
 functorch/csrc/init_dim_only.cpp              |    22 +
 functorch/{functorch => }/dim/README.md       |    22 +-
 functorch/{functorch => }/dim/__init__.py     |     4 +-
 functorch/{functorch => }/dim/batch_tensor.py |     2 +-
 .../{functorch => }/dim/delayed_mul_tensor.py |     0
 functorch/{functorch => }/dim/dim.py          |     0
 functorch/{functorch => }/dim/magic_trace.py  |     0
 .../{functorch => }/dim/op_properties.py      |     0
 functorch/{functorch => }/dim/reference.py    |     0
 functorch/{functorch => }/dim/tree_map.py     |     0
 functorch/{functorch => }/dim/wrap_type.py    |     0
 functorch/docs/source/_static/css/custom.css  |     8 -
 .../docs/source/_static/images/functorch.svg  |     6 -
 functorch/docs/source/_templates/layout.html  |   339 +-
 functorch/docs/source/batch_norm.rst          |     2 +-
 functorch/docs/source/conf.py                 |     6 +-
 functorch/docs/source/experimental.rst        |     2 -
 functorch/docs/source/functorch.rst           |     1 +
 functorch/docs/source/index.rst               |     6 +-
 functorch/docs/source/install.rst             |    40 +-
 functorch/docs/source/ux_limitations.rst      |     4 +-
 functorch/examples/compilation/fuse_module.py |     4 +-
 .../examples/dp_cifar10/cifar10_transforms.py |     2 +-
 functorch/examples/maml_omniglot/README.md    |     2 +-
 .../maml_omniglot/maml-omniglot-transforms.py |     2 +-
 .../maml_omniglot/support/omniglot_loaders.py |     2 +-
 .../{functorch => }/experimental/__init__.py  |     5 +-
 functorch/experimental/_map.py                |   105 +
 .../experimental/batch_norm_replacement.py    |     0
 functorch/experimental/cond.py                |   157 +
 functorch/experimental/control_flow.py        |     1 +
 functorch/experimental/ops.py                 |     1 +
 functorch/functorch/_src/aot_autograd.py      |   808 -
 functorch/functorch/_src/config.py            |    27 -
 functorch/functorch/_src/custom_function.py   |    20 -
 functorch/functorch/_src/fx_minifier.py       |   269 -
 functorch/functorch/_src/monkey_patching.py   |    80 -
 functorch/functorch/csrc/CompileCache.cpp     |   288 -
 functorch/functorch/csrc/CompileCache.h       |    17 -
 functorch/functorch/csrc/Constants.h          |    31 -
 functorch/functorch/csrc/CustomFunction.cpp   |   291 -
 functorch/functorch/csrc/CustomFunction.h     |    14 -
 functorch/functorch/csrc/DynamicLayer.h       |    93 -
 functorch/functorch/csrc/Macros.h             |    10 -
 functorch/functorch/csrc/PlumbingHelper.h     |    39 -
 functorch/functorch/csrc/TensorWrapper.h      |    68 -
 functorch/functorch/csrc/init.cpp             |   419 -
 .../aot_autograd_optimizations.ipynb          |    31 +-
 .../notebooks/colab/ensembling_colab.ipynb    |   598 -
 .../colab/jacobians_hessians_colab.ipynb      |  1120 -
 .../colab/per_sample_grads_colab.ipynb        |   795 -
 functorch/notebooks/colab/readme.md           |     5 -
 functorch/notebooks/ensembling.ipynb          |     4 +-
 functorch/notebooks/jacobians_hessians.ipynb  |     2 +-
 .../notebooks/neural_tangent_kernels.ipynb    |     4 +
 functorch/notebooks/per_sample_grads.ipynb    |     2 +-
 functorch/notebooks/whirlwind_tour.ipynb      |     4 +
 functorch/op_analysis/public_api              |    24 +-
 functorch/packaging/build_wheel.sh            |    19 -
 functorch/packaging/pkg_helpers.bash          |   414 -
 .../windows/internal/cuda_install.bat         |   264 -
 .../windows/internal/driver_update.bat        |    25 -
 .../windows/internal/vc_env_helper.bat        |    43 -
 .../windows/internal/vc_install_helper.sh     |    16 -
 functorch/pull_request_template.md            |     5 -
 functorch/setup.cfg                           |    18 -
 functorch/setup.py                            |   149 -
 functorch/test/functorch_lagging_op_db.py     |   635 -
 functorch/test/pytest.ini                     |     2 -
 functorch/test/test_compile_cache.py          |   686 -
 functorch/test/test_minifier.py               |    53 -
 functorch/test/test_pythonkey.py              |   645 -
 functorch/tools/lint/black_linter.py          |   228 -
 functorch/tools/lint/flake8_linter.py         |   373 -
 functorch/tools/lint/pip_init.py              |    75 -
 functorch/version.txt                         |     1 -
 ios/LibTorch-Lite.podspec                     |     3 +-
 ios/LibTorch.podspec                          |     3 +-
 ios/TestApp/AppleWWDRCAG3.cer                 |   Bin 1109 -> 0 bytes
 ios/TestApp/README.md                         |    12 +
 ios/TestApp/TestApp.xcodeproj/project.pbxproj |    42 +-
 ios/TestApp/TestApp/Benchmark.h               |    15 +
 ios/TestApp/TestApp/Benchmark.mm              |   108 +
 ios/TestApp/TestApp/ViewController.mm         |    40 +
 ios/TestApp/benchmark/config.json             |     7 +
 ios/TestApp/benchmark/setup.rb                |    15 +-
 ios/TestApp/fastlane/Fastfile                 |    16 -
 mypy-strict.ini                               |     1 +
 pt_ops.bzl                                    |     6 +-
 pt_template_srcs.bzl                          |     1 +
 pytest.ini                                    |     8 +-
 requirements.txt                              |     4 +
 scripts/buck_setup.sh                         |     6 +-
 scripts/build_android.sh                      |    36 +-
 scripts/build_ios.sh                          |    47 +-
 scripts/build_mobile.sh                       |    31 +
 scripts/onnx/test.sh                          |     1 +
 scripts/release_notes/commitlist.py           |     4 +
 scripts/xcode_build.rb                        |    18 +-
 setup.py                                      |   424 +-
 test/allowlist_for_publicAPI.json             |   813 +-
 .../ao/sparsity/test_activation_sparsifier.py |     4 +-
 test/ao/sparsity/test_composability.py        |    22 +-
 test/ao/sparsity/test_data_scheduler.py       |     4 +-
 test/ao/sparsity/test_data_sparsifier.py      |     4 +-
 test/ao/sparsity/test_kernels.py              |    10 +-
 test/ao/sparsity/test_parametrization.py      |     2 +-
 test/ao/sparsity/test_pruner.py               |     2 +-
 .../ao/sparsity/test_qlinear_packed_params.py |   105 +-
 test/ao/sparsity/test_scheduler.py            |   100 +-
 test/ao/sparsity/test_sparsifier.py           |    10 +-
 test/ao/sparsity/test_sparsity_utils.py       |     2 +-
 test/conftest.py                              |    23 +
 test/cpp/api/CMakeLists.txt                   |    11 +-
 test/cpp/api/autograd.cpp                     |    86 +-
 test/cpp/api/functional.cpp                   |    14 +
 test/cpp/api/imethod.cpp                      |    64 -
 test/cpp/api/inference_mode.cpp               |     6 +-
 test/cpp/api/modules.cpp                      |     2 +-
 test/cpp/api/nested.cpp                       |    15 +
 test/cpp/api/nn_utils.cpp                     |     2 +-
 test/cpp/api/serialize.cpp                    |    41 +
 test/cpp/api/static.cpp                       |     4 +
 test/cpp/api/support.h                        |    13 +-
 test/cpp/c10d/CMakeLists.txt                  |    13 +
 test/cpp/c10d/FileStoreTest.cpp               |    16 +-
 test/cpp/c10d/HashStoreTest.cpp               |     4 +-
 test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp   |    17 +-
 test/cpp/c10d/ProcessGroupGlooTest.cpp        |    26 +-
 test/cpp/c10d/ProcessGroupMPITest.cpp         |    45 +-
 test/cpp/c10d/ProcessGroupNCCLErrorsTest.cpp  |     4 +-
 test/cpp/c10d/ProcessGroupNCCLTest.cpp        |    30 +-
 test/cpp/c10d/ProcessGroupUCCTest.cpp         |    35 +
 test/cpp/c10d/StoreTestCommon.hpp             |     2 +-
 test/cpp/c10d/TCPStoreTest.cpp                |     4 +-
 test/cpp/c10d/example/allreduce.cpp           |     6 +-
 test/cpp/jit/CMakeLists.txt                   |    21 +-
 test/cpp/jit/test_custom_class.cpp            |    14 +
 .../jit/test_custom_class_registrations.cpp   |    12 +-
 .../cpp/jit/test_custom_class_registrations.h |     5 +
 test/cpp/jit/test_flatbuffer.cpp              |    12 +-
 test/cpp/jit/test_jit_logging_levels.cpp      |    10 +-
 test/cpp/jit/test_misc.cpp                    |   138 +
 test/cpp/jit/test_module_api.cpp              |     4 +-
 test/cpp/lazy/CMakeLists.txt                  |     1 -
 test/cpp/lazy/test_ir.cpp                     |    10 +-
 test/cpp/lazy/test_ir_util.cpp                |     2 +-
 test/cpp/lazy/test_lazy_ops.cpp               |    34 +-
 test/cpp/lazy/test_symbolic_shape.cpp         |   161 -
 test/cpp/lite_interpreter_runtime/resources.h |    41 +
 .../test_mobile_profiler.cpp                  |    75 +-
 test/cpp/profiler/perf_events.cpp             |   248 +
 test/cpp/rpc/e2e_test_base.h                  |     2 +-
 test/cpp/rpc/test_e2e_tensorpipe.cpp          |     2 +-
 test/cpp/tensorexpr/test_cuda.cpp             |   819 +-
 test/cpp/tensorexpr/test_kernel.cpp           |    73 +-
 test/cpp/tensorexpr/test_loopnest.cpp         |     4 +-
 test/cpp/tensorexpr/test_quantization.cpp     |     3 +-
 test/cpp_extensions/cpp_c10d_extension.cpp    |    26 +-
 test/cpp_extensions/cpp_c10d_extension.hpp    |    37 +-
 .../open_registration_extension.cpp           |     7 +-
 test/cpp_extensions/ort_extension.cpp         |     6 -
 test/defs.bzl                                 |   112 -
 .../_composable/test_checkpoint.py            |    83 +
 test/distributed/_composable/test_contract.py |   122 +
 .../_composable/test_fully_shard.py           |   267 +
 .../distributed/_composable/test_replicate.py |   107 +
 .../_shard/checkpoint/test_checkpoint.py      |   413 -
 .../sharded_tensor/ops/test_embedding.py      |    18 +-
 .../sharded_tensor/ops/test_embedding_bag.py  |    12 +-
 .../sharding_spec/test_sharding_spec.py       |    66 +
 test/distributed/_tensor/README.md            |    11 +
 test/distributed/_tensor/__init__.py          |     1 +
 .../distributed/_tensor/parallel}/__init__.py |     0
 .../_tensor/parallel/test_2d_parallel.py      |   214 +
 .../_tensor/parallel/test_parallelize_api.py  |   219 +
 .../_tensor/parallel/test_tp_examples.py      |   437 +
 .../_tensor/parallel/test_tp_style.py         |   197 +
 .../parallel/test_view_sharding_dim_change.py |    30 +
 test/distributed/_tensor/test_api.py          |   234 +
 test/distributed/_tensor/test_common_rules.py |   476 +
 test/distributed/_tensor/test_device_mesh.py  |   518 +
 test/distributed/_tensor/test_dtensor.py      |   359 +
 test/distributed/_tensor/test_dtensor_ops.py  |   704 +
 test/distributed/_tensor/test_math_ops.py     |   126 +
 test/distributed/_tensor/test_matrix_ops.py   |   302 +
 .../distributed/_tensor/test_pointwise_ops.py |   285 +
 test/distributed/_tensor/test_redistribute.py |   317 +
 test/distributed/_tensor/test_tensor_ops.py   |   365 +
 .../_tensor/test_tp_sharding_ops.py           |   101 +
 test/distributed/_tensor/test_view_ops.py     |   480 +
 .../ddp_comm_hooks/test_ddp_hooks.py          |    10 +-
 .../distributed/checkpoint/test_checkpoint.py |   392 +
 .../checkpoint/test_dedup_tensors.py          |    45 +
 .../checkpoint/test_file_system_checkpoint.py |    20 +-
 .../test_file_system_checkpoint_cpu.py        |     2 +-
 test/distributed/checkpoint/test_planner.py   |   269 +
 test/distributed/checkpoint/test_traverse.py  |   176 +
 .../{_shard => }/checkpoint/test_utils.py     |     8 +-
 test/distributed/defs.bzl                     |    39 -
 .../server/test/local_elastic_agent_test.py   |    62 +-
 .../timer/file_based_local_timer_test.py      |   266 +
 test/distributed/fsdp/defs.bzl                |    22 -
 .../fsdp/test_checkpoint_wrapper.py           |   264 +-
 .../fsdp/test_distributed_checkpoint.py       |    27 +-
 .../fsdp/test_flatten_params_wrapper.py       |   315 -
 test/distributed/fsdp/test_fsdp_apply.py      |     5 +-
 test/distributed/fsdp/test_fsdp_checkpoint.py |   127 +-
 .../fsdp/test_fsdp_clip_grad_norm.py          |   264 +-
 test/distributed/fsdp/test_fsdp_comm.py       |    72 +-
 test/distributed/fsdp/test_fsdp_comm_hooks.py |   331 +-
 test/distributed/fsdp/test_fsdp_core.py       |    52 +-
 test/distributed/fsdp/test_fsdp_exec_order.py |    46 +-
 .../fsdp/test_fsdp_flatten_params.py          |   445 +
 .../fsdp/test_fsdp_freezing_weights.py        |     8 +-
 test/distributed/fsdp/test_fsdp_grad_acc.py   |   107 +-
 .../fsdp/test_fsdp_ignored_modules.py         |    29 +-
 test/distributed/fsdp/test_fsdp_input.py      |     7 +-
 test/distributed/fsdp/test_fsdp_memory.py     |     8 +-
 test/distributed/fsdp/test_fsdp_meta.py       |    68 +-
 test/distributed/fsdp/test_fsdp_misc.py       |   184 +-
 .../fsdp/test_fsdp_mixed_precision.py         |   241 +-
 .../fsdp/test_fsdp_multiple_forward.py        |     8 +-
 .../fsdp/test_fsdp_multiple_wrapping.py       |     3 +-
 .../distributed/fsdp/test_fsdp_optim_state.py |   844 +-
 test/distributed/fsdp/test_fsdp_overlap.py    |    13 +-
 .../fsdp/test_fsdp_param_exec_order_wrap.py   |   134 -
 test/distributed/fsdp/test_fsdp_pure_fp16.py  |     6 +-
 .../fsdp/test_fsdp_sharded_grad_scaler.py     |    69 +-
 test/distributed/fsdp/test_fsdp_state_dict.py |   426 +-
 .../fsdp/test_fsdp_summon_full_params.py      |   304 +-
 .../fsdp/test_fsdp_tp_integration.py          |   486 +
 test/distributed/fsdp/test_fsdp_traversal.py  |    15 +-
 test/distributed/fsdp/test_fsdp_uneven.py     |     7 +-
 .../fsdp/test_fsdp_use_orig_params.py         |  1057 +
 test/distributed/fsdp/test_shard_utils.py     |    26 +-
 test/distributed/fsdp/test_utils.py           |   117 +-
 test/distributed/fsdp/test_wrap.py            |   130 +-
 .../optim/test_apply_optimizer_in_backward.py |   113 +
 test/distributed/pipeline/sync/defs.bzl       |    22 -
 test/distributed/test_c10d_common.py          |   383 +-
 test/distributed/test_c10d_error_logger.py    |   142 +
 test/distributed/test_c10d_gloo.py            |   181 +-
 test/distributed/test_c10d_nccl.py            |   257 +-
 test/distributed/test_c10d_spawn_ucc.py       |   110 +
 test/distributed/test_distributed_spawn.py    |     2 +-
 test/distributed/test_dynamo_distributed.py   |   567 +
 test/distributed/test_multi_threaded_pg.py    |    87 +
 test/distributed/test_store.py                |    48 +-
 test/distributions/test_distributions.py      |    23 +-
 .../dynamo}/__init__.py                       |     0
 .../dynamo/mock_modules}/__init__.py          |     0
 test/dynamo/mock_modules/mock_module1.py      |     2 +
 test/dynamo/mock_modules/mock_module2.py      |    19 +
 test/dynamo/mock_modules/mock_module3.py      |     7 +
 test/dynamo/test_aot_autograd.py              |   288 +
 test/dynamo/test_aot_cudagraphs.py            |   208 +
 test/dynamo/test_dynamic_shapes.py            |   112 +
 test/dynamo/test_export.py                    |  1493 +
 test/dynamo/test_export_mutations.py          |   134 +
 test/dynamo/test_functions.py                 |   697 +
 test/dynamo/test_global.py                    |   233 +
 test/dynamo/test_global_declaration.py        |     4 +
 test/dynamo/test_minifier.py                  |   318 +
 test/dynamo/test_misc.py                      |  3062 ++
 test/dynamo/test_model_output.py              |   166 +
 test/dynamo/test_modules.py                   |  1045 +
 test/dynamo/test_nops.py                      |    72 +
 test/dynamo/test_optimizations.py             |   206 +
 test/dynamo/test_optimizers.py                |   167 +
 test/dynamo/test_python_autograd.py           |   287 +
 test/dynamo/test_recompile_ux.py              |   205 +
 test/dynamo/test_replay_record.py             |   194 +
 test/dynamo/test_repros.py                    |  2066 +
 test/dynamo/test_skip_non_tensor.py           |   113 +
 test/dynamo/test_subgraphs.py                 |   546 +
 test/dynamo/test_torchxla_integration.py      |   131 +
 test/dynamo/test_torchxla_num_output.py       |   120 +
 test/dynamo/test_torchxla_util.py             |    26 +
 test/dynamo/test_unspec.py                    |   229 +
 test/dynamo/test_verify_correctness.py        |   175 +
 ..._compat-fx_backcompat_class_members.expect |     4 +-
 ...t-fx_backcompat_function_signatures.expect |     8 +-
 .../check_forward_backward_compatibility.py   |   203 +-
 {functorch/test => test/functorch}/attn_ft.py |     0
 .../functorch}/attn_positional.py             |     0
 .../test => test/functorch}/common_utils.py   |   212 +-
 .../functorch}/discover_coverage.py           |     4 -
 .../functorch}/functorch_additional_op_db.py  |    12 +-
 test/functorch/test_aotdispatch.py            |  1989 +
 test/functorch/test_control_flow.py           |   467 +
 .../test => test/functorch}/test_dims.py      |    32 +-
 .../functorch}/test_eager_transforms.py       |   745 +-
 .../functorch}/test_functionalize.py          |    11 +-
 .../test_memory_efficient_fusion.py           |     0
 test/functorch/test_minifier.py               |   116 +
 .../test => test/functorch}/test_ops.py       |   830 +-
 .../test => test/functorch}/test_vmap.py      |   291 +-
 .../functorch}/xfail_suggester.py             |     4 +-
 test/fx/quantization.py                       |     2 +-
 test/fx/test_common_passes.py                 |     9 +-
 test/fx/test_fx_param_shape_control_flow.py   |     8 +-
 test/fx/test_gradual_type.py                  |     9 +-
 test/fx/test_pass_infra.py                    |    15 +
 test/fx/test_subgraph_rewriter.py             |   372 +-
 test/fx/test_z3_gradual_types.py              |    84 +-
 .../callbacks => test/inductor}/__init__.py   |     0
 test/inductor/cpp/.gitignore                  |    13 +
 test/inductor/cpp/CMakeLists.txt              |    47 +
 test/inductor/cpp/test.sh                     |     7 +
 test/inductor/cpp/test_cpp_prefix.cpp         |    21 +
 test/inductor/opinfo_harness.py               |    25 +
 test/inductor/test_minifier.py                |   213 +
 test/inductor/test_perf.py                    |   502 +
 test/inductor/test_smoke.py                   |    30 +
 test/inductor/test_torchinductor.py           |  5566 +++
 test/inductor/test_torchinductor_opinfo.py    |   591 +
 test/jit/test_async.py                        |    15 -
 test/jit/test_backends.py                     |    11 +-
 test/jit/test_freezing.py                     |   537 +-
 test/jit/test_hooks.py                        |     2 +-
 test/jit/test_misc.py                         |    19 +
 test/jit/test_module_interface.py             |     8 +-
 test/jit/test_python_bindings.py              |     5 +
 test/jit/test_symbolic_shape_analysis.py      |     7 +-
 test/jit/test_tensor_creation_ops.py          |     8 +-
 test/jit/test_tracer.py                       |     8 -
 test/jit/test_with.py                         |     2 +
 test/jit/xnnpack/test_xnnpack_delegate.py     |   192 +
 test/lazy/test_debug_util.py                  |    44 +
 test/lazy/test_extract_compiled_graph.py      |     2 +-
 test/lazy/test_meta_kernel.py                 |    34 +
 test/lazy/test_reuse_ir.py                    |     4 +
 test/lazy/test_step_closures.py               |    91 +
 test/lazy/test_ts_opinfo.py                   |    34 +-
 test/mobile/model_test/README.md              |     2 +-
 test/mobile/test_lite_script_module.py        |     4 +-
 test/mobile/test_lite_script_type.py          |    14 +-
 .../test_quantize_fx_lite_script_module.py    |    16 +-
 test/nn/test_convolution.py                   |  2480 +
 test/nn/test_dropout.py                       |   283 +
 test/nn/test_embedding.py                     |  1193 +
 test/nn/test_init.py                          |   420 +
 test/nn/test_lazy_modules.py                  |   626 +
 test/nn/test_module_hooks.py                  |  1334 +
 test/nn/test_packed_sequence.py               |   392 +
 test/nn/test_parametrization.py               |  1525 +
 test/nn/test_pooling.py                       |  1450 +
 test/nn/test_pruning.py                       |   939 +
 .../expect/TestOperators.test_acos.expect     |     2 +-
 .../TestOperators.test_add_broadcast.expect   |     2 +-
 ...stOperators.test_add_left_broadcast.expect |     2 +-
 ...tOperators.test_add_size1_broadcast.expect |     2 +-
 ...tors.test_add_size1_right_broadcast.expect |     2 +-
 ....test_add_size1_singleton_broadcast.expect |     2 +-
 .../TestOperators.test_addconstant.expect     |     2 +-
 .../expect/TestOperators.test_addmm.expect    |     2 +-
 .../expect/TestOperators.test_argmax.expect   |     2 +-
 .../expect/TestOperators.test_asin.expect     |     2 +-
 .../expect/TestOperators.test_at_op.expect    |     8 +-
 .../expect/TestOperators.test_atan.expect     |     2 +-
 .../TestOperators.test_avg_pool2d.expect      |     9 +-
 .../expect/TestOperators.test_baddbmm.expect  |    26 +-
 .../expect/TestOperators.test_basic.expect    |     2 +-
 .../TestOperators.test_batchnorm.expect       |     7 +-
 .../TestOperators.test_batchnorm_1d.expect    |     7 +-
 ...stOperators.test_batchnorm_noaffine.expect |     7 +-
 ...tOperators.test_batchnorm_onnx_irv4.expect |     7 +-
 ...stOperators.test_batchnorm_training.expect |     9 +-
 .../expect/TestOperators.test_chunk.expect    |     2 +-
 .../expect/TestOperators.test_clip.expect     |     2 +-
 .../expect/TestOperators.test_clip_max.expect |     2 +-
 .../expect/TestOperators.test_clip_min.expect |     2 +-
 .../expect/TestOperators.test_concat2.expect  |     2 +-
 .../expect/TestOperators.test_conv.expect     |     2 +-
 .../TestOperators.test_conv_onnx_irv4.expect  |     2 +-
 .../TestOperators.test_convtranspose.expect   |     2 +-
 .../onnx/expect/TestOperators.test_cos.expect |     2 +-
 .../expect/TestOperators.test_dict.expect     |     2 +-
 .../expect/TestOperators.test_dict_str.expect |     2 +-
 .../onnx/expect/TestOperators.test_dim.expect |     2 +-
 .../expect/TestOperators.test_dropout.expect  |     2 +-
 .../TestOperators.test_dropout_default.expect |     2 +-
 ...TestOperators.test_dropout_training.expect |     2 +-
 .../onnx/expect/TestOperators.test_elu.expect |     2 +-
 .../TestOperators.test_embedding_bags.expect  |     2 +-
 .../TestOperators.test_empty_like.expect      |     2 +-
 .../expect/TestOperators.test_equal.expect    |     2 +-
 .../onnx/expect/TestOperators.test_erf.expect |     2 +-
 .../onnx/expect/TestOperators.test_exp.expect |     2 +-
 .../expect/TestOperators.test_expand.expect   |     2 +-
 .../expect/TestOperators.test_flatten.expect  |     7 +-
 .../TestOperators.test_flatten2D.expect       |     2 +-
 .../TestOperators.test_frobenius_norm.expect  |     2 +-
 .../expect/TestOperators.test_full.expect     |     2 +-
 .../TestOperators.test_full_like.expect       |     2 +-
 .../expect/TestOperators.test_gather.expect   |     2 +-
 test/onnx/expect/TestOperators.test_ge.expect |     2 +-
 .../expect/TestOperators.test_gelu.expect     |     2 +-
 test/onnx/expect/TestOperators.test_gt.expect |     2 +-
 .../expect/TestOperators.test_hardtanh.expect |     2 +-
 .../TestOperators.test_implicit_expand.expect |     2 +-
 .../expect/TestOperators.test_index.expect    |     2 +-
 .../expect/TestOperators.test_isnan.expect    |     2 +-
 .../TestOperators.test_layer_norm_aten.expect |     2 +-
 test/onnx/expect/TestOperators.test_le.expect |     2 +-
 .../expect/TestOperators.test_linear.expect   |     2 +-
 .../TestOperators.test_log_sigmoid.expect     |     2 +-
 .../TestOperators.test_logsoftmax.expect      |     2 +-
 test/onnx/expect/TestOperators.test_lt.expect |     2 +-
 .../onnx/expect/TestOperators.test_max.expect |     2 +-
 .../expect/TestOperators.test_maxpool.expect  |     2 +-
 .../TestOperators.test_maxpool_indices.expect |     2 +-
 .../expect/TestOperators.test_mean.expect     |     2 +-
 .../TestOperators.test_mean_dtype.expect      |     2 +-
 .../expect/TestOperators.test_meshgrid.expect |    32 +-
 .../onnx/expect/TestOperators.test_min.expect |     2 +-
 test/onnx/expect/TestOperators.test_mm.expect |     2 +-
 .../expect/TestOperators.test_mul_bool.expect |     2 +-
 .../TestOperators.test_mul_fp_bool.expect     |     2 +-
 .../expect/TestOperators.test_narrow.expect   |    14 +-
 test/onnx/expect/TestOperators.test_ne.expect |     2 +-
 .../expect/TestOperators.test_nonzero.expect  |     2 +-
 .../expect/TestOperators.test_norm_p1.expect  |     2 +-
 .../expect/TestOperators.test_norm_p2.expect  |     2 +-
 .../TestOperators.test_ones_like.expect       |     2 +-
 .../onnx/expect/TestOperators.test_pad.expect |    12 +-
 .../expect/TestOperators.test_params.expect   |     2 +-
 ...TestOperators.test_params_onnx_irv4.expect |     2 +-
 .../expect/TestOperators.test_permute2.expect |     2 +-
 .../onnx/expect/TestOperators.test_pow.expect |     2 +-
 .../expect/TestOperators.test_prelu.expect    |     2 +-
 .../expect/TestOperators.test_prod.expect     |     2 +-
 .../TestOperators.test_prod_dtype.expect      |     2 +-
 .../expect/TestOperators.test_rand.expect     |     2 +-
 .../expect/TestOperators.test_randn.expect    |     2 +-
 ...rs.test_reduce_sum_negative_indices.expect |     2 +-
 .../TestOperators.test_reduced_mean.expect    |     2 +-
 ...stOperators.test_reduced_mean_dtype.expect |     2 +-
 ...Operators.test_reduced_mean_keepdim.expect |     2 +-
 .../TestOperators.test_reduced_prod.expect    |     2 +-
 ...stOperators.test_reduced_prod_dtype.expect |     2 +-
 ...Operators.test_reduced_prod_keepdim.expect |     2 +-
 .../TestOperators.test_reduced_sum.expect     |     2 +-
 ...estOperators.test_reduced_sum_dtype.expect |     2 +-
 ...tOperators.test_reduced_sum_keepdim.expect |     2 +-
 .../TestOperators.test_reducemax.expect       |     2 +-
 .../TestOperators.test_reducemin.expect       |     2 +-
 .../TestOperators.test_remainder.expect       |     2 +-
 .../expect/TestOperators.test_repeat.expect   |     2 +-
 ...tOperators.test_repeat_dim_overflow.expect |     2 +-
 .../expect/TestOperators.test_rrelu.expect    |     2 +-
 .../expect/TestOperators.test_rsqrt.expect    |     2 +-
 .../expect/TestOperators.test_rsub.expect     |     2 +-
 .../TestOperators.test_scatter_add.expect     |     2 +-
 .../expect/TestOperators.test_selu.expect     |     2 +-
 .../TestOperators.test_shape_value_map.expect |    40 +-
 .../expect/TestOperators.test_sign.expect     |     2 +-
 .../onnx/expect/TestOperators.test_sin.expect |     2 +-
 .../expect/TestOperators.test_slice.expect    |     2 +-
 .../expect/TestOperators.test_split.expect    |     2 +-
 ...TestOperators.test_split_with_sizes.expect |     2 +-
 .../expect/TestOperators.test_sqrt.expect     |     2 +-
 .../onnx/expect/TestOperators.test_std.expect |     2 +-
 .../onnx/expect/TestOperators.test_sum.expect |     2 +-
 .../TestOperators.test_sum_dtype.expect       |     2 +-
 .../onnx/expect/TestOperators.test_tan.expect |     2 +-
 .../TestOperators.test_transpose.expect       |     2 +-
 .../expect/TestOperators.test_type_as.expect  |     2 +-
 .../expect/TestOperators.test_unfold.expect   |     2 +-
 .../TestOperators.test_unsqueeze.expect       |     2 +-
 ...erators.test_upsample_nearest_scale.expect |     2 +-
 ..._nearest_scale_default_scale_factor.expect |     2 +-
 ...perators.test_upsample_nearest_size.expect |     2 +-
 .../expect/TestOperators.test_view.expect     |     7 +-
 .../TestOperators.test_view_flatten.expect    |     7 +-
 .../TestOperators.test_zeros_like.expect      |     2 +-
 test/onnx/internal/test_beartype.py           |    86 +
 test/onnx/internal/test_diagnostics.py        |   304 +
 test/onnx/internal/test_registraion.py        |   254 +
 test/onnx/onnx_test_common.py                 |    27 +-
 test/onnx/pytorch_test_common.py              |    71 +-
 .../symbolic_opsets/test_symbolic_opset9.py   |    32 -
 test/onnx/test_autograd_funs.py               |    10 +-
 test/onnx/test_custom_ops.py                  |    14 +-
 test/{jit => onnx}/test_export_modes.py       |    91 +-
 test/onnx/test_models.py                      |     8 +-
 test/onnx/test_models_onnxruntime.py          |     5 +-
 test/onnx/test_onnx_opset.py                  |     5 +-
 test/onnx/test_onnxscript_no_runtime.py       |   164 +
 test/onnx/test_onnxscript_runtime.py          |   130 +
 test/onnx/test_operators.py                   |    29 +-
 test/onnx/test_pytorch_helper.py              |     3 +-
 test/onnx/test_pytorch_jit_onnx.py            |     5 +-
 test/onnx/test_pytorch_onnx_caffe2.py         |    27 +-
 .../test_pytorch_onnx_caffe2_quantized.py     |     7 +-
 test/onnx/test_pytorch_onnx_no_runtime.py     |   465 +-
 test/onnx/test_pytorch_onnx_onnxruntime.py    |   548 +-
 .../test_pytorch_onnx_onnxruntime_cuda.py     |    26 +-
 .../onnx/test_pytorch_onnx_shape_inference.py |   190 +-
 test/onnx/test_utility_funs.py                |   509 +-
 test/onnx/test_verification.py                |     2 +
 test/onnx/verify.py                           |     2 +-
 test/profiler/profiler_utils_mock_events.json |     1 +
 test/profiler/test_memory_profiler.py         |  1418 +
 test/{ => profiler}/test_profiler.py          |   898 +-
 test/{ => profiler}/test_profiler_tree.py     |   185 +-
 test/profiler_utils_mock_events.json          |     1 -
 test/quantization/ao_migration/common.py      |    22 +-
 .../ao_migration/test_ao_migration.py         |   425 +
 .../ao_migration/test_quantization.py         |     8 +-
 .../ao_migration/test_quantization_fx.py      |     2 +-
 .../bc/test_backward_compatibility.py         |     4 +-
 test/quantization/core/test_backend_config.py |    77 +-
 test/quantization/core/test_docs.py           |    31 +-
 .../core/test_quantized_functional.py         |     2 +-
 .../core/test_quantized_module.py             |   115 +-
 test/quantization/core/test_quantized_op.py   |   364 +-
 .../core/test_quantized_tensor.py             |   174 +-
 test/quantization/core/test_top_level_apis.py |    93 +
 test/quantization/core/test_utils.py          |    65 +
 .../quantization/core/test_workflow_module.py |    56 +-
 test/quantization/core/test_workflow_ops.py   |     8 +-
 test/quantization/dbr/test_quantize_dbr.py    |  1619 -
 test/quantization/eager/test_fuse_eager.py    |    16 +-
 .../quantization/eager/test_model_numerics.py |     4 +-
 .../eager/test_numeric_suite_eager.py         |     2 +-
 .../eager/test_quantize_eager_ptq.py          |   121 +-
 .../eager/test_quantize_eager_qat.py          |    49 +-
 test/quantization/fx/test_equalize_fx.py      |     2 +-
 test/quantization/fx/test_model_report_fx.py  |   100 +-
 test/quantization/fx/test_numeric_suite_fx.py |   406 +-
 test/quantization/fx/test_quantize_fx.py      |   904 +-
 .../jit/test_ondevice_quantization.py         |   529 +
 test/quantization/jit/test_quantize_jit.py    |     7 +-
 test/run_test.py                              |   445 +-
 test/scripts/run_cuda_memcheck.py             |     2 +-
 test/test_ao_sparsity.py                      |     1 +
 test/test_autocast.py                         |    61 +
 test/test_autograd.py                         |   966 +-
 test/test_binary_ufuncs.py                    |    63 +-
 test/test_comparison_utils.py                 |    32 +
 test/test_cpp_extensions_jit.py               |     4 +-
 test/test_cuda.py                             |   672 +-
 test/test_cuda_nvml_based_avail.py            |    69 +
 test/test_cuda_sanitizer.py                   |   505 +
 test/test_cuda_trace.py                       |    28 +
 test/test_dataloader.py                       |   154 +-
 test/test_datapipe.py                         |   502 +-
 test/test_decomp.py                           |   141 +-
 test/test_dispatch.py                         |     4 +
 test/test_dlpack.py                           |   193 +
 test/test_dynamic_shapes.py                   |   378 +-
 test/test_expanded_weights.py                 |    36 +-
 test/test_fake_tensor.py                      |   298 +-
 test/test_foreach.py                          |    55 +-
 test/test_function_schema.py                  |    20 +
 test/test_functional_optim.py                 |     7 +-
 test/test_functionalization.py                |   912 +-
 test/test_futures.py                          |     9 +
 test/test_fx.py                               |    60 +-
 test/test_fx_backends.py                      |   258 -
 test/test_fx_experimental.py                  |    19 +-
 test/test_fx_passes.py                        |   236 +-
 test/test_fx_reinplace_pass.py                |   207 +-
 test/test_indexing.py                         |    32 +-
 test/test_itt.py                              |    26 +
 test/test_jit.py                              |   161 +-
 test/test_jit_autocast.py                     |    30 +-
 test/test_jit_cuda_fuser.py                   |   238 +-
 test/test_jit_fuser_te.py                     |    52 +-
 test/test_jit_llga_fuser.py                   |   522 +-
 test/test_jiterator.py                        |    10 +-
 test/test_linalg.py                           |   523 +-
 test/test_masked.py                           |    16 +-
 test/test_maskedtensor.py                     |   912 +
 test/test_matmul_cuda.py                      |   155 +
 test/test_meta.py                             |   621 +-
 test/test_mkldnn.py                           |    15 +-
 test/test_mkldnn_fusion.py                    |   282 +-
 test/test_model_dump.py                       |     2 +
 test/test_module_init.py                      |    89 +-
 test/test_modules.py                          |    14 +-
 test/test_mps.py                              |  2738 +-
 test/test_multiprocessing.py                  |     5 +-
 test/test_namedtuple_return_api.py            |     7 +-
 test/test_native_functions.py                 |    42 +-
 test/test_native_mha.py                       |    94 +-
 test/test_nestedtensor.py                     |  1513 +-
 test/test_nn.py                               | 11087 +----
 test/test_nnapi.py                            |    10 +-
 test/test_nvfuser_dynamo.py                   |   148 +
 test/test_nvfuser_frontend.py                 |   366 +
 test/test_ops.py                              |   358 +-
 test/test_ops_fwd_gradients.py                |    76 +
 test/test_ops_gradients.py                    |   183 +-
 test/test_ops_jit.py                          |     9 +-
 test/test_optim.py                            |  3190 +-
 test/test_overrides.py                        |   212 +-
 test/test_prims.py                            |   766 +-
 test/test_proxy_tensor.py                     |   784 +-
 test/test_public_bindings.py                  |    27 +-
 test/test_python_dispatch.py                  |   429 +-
 test/test_pytree.py                           |     8 +-
 test/test_quantization.py                     |    19 +-
 test/test_reductions.py                       |    40 +-
 test/test_scatter_gather_ops.py               |    94 +-
 test/test_schema_check.py                     |    79 +-
 test/test_serialization.py                    |   232 +-
 test/test_shape_ops.py                        |    10 +
 test/test_sparse.py                           |   223 +-
 test/test_sparse_csr.py                       |  1048 +-
 test/test_spectral_ops.py                     |    47 +-
 test/test_stateless.py                        |    31 +
 test/test_subclass.py                         |    25 +-
 test/test_tensor_creation_ops.py              |   217 -
 test/test_tensorexpr.py                       |   352 +-
 test/test_testing.py                          |   147 +-
 test/test_torch.py                            |   859 +-
 test/test_transformers.py                     |   608 +-
 test/test_type_promotion.py                   |    91 +-
 test/test_unary_ufuncs.py                     |    29 +-
 test/test_utils.py                            |   126 +-
 test/test_view_ops.py                         |    11 +-
 test/test_xnnpack_integration.py              |     3 +-
 third_party/VulkanMemoryAllocator             |     1 +
 third_party/build_bundled.py                  |     9 +-
 third_party/cpuinfo                           |     2 +-
 third_party/cpuinfo.BUILD                     |    55 -
 third_party/cudnn_frontend                    |     2 +-
 third_party/cutlass                           |     1 +
 third_party/cutlass.BUILD                     |    11 +
 third_party/fbgemm                            |     2 +-
 third_party/fmt                               |     2 +-
 third_party/gloo                              |     2 +-
 third_party/gloo.BUILD                        |     3 +-
 third_party/ideep                             |     2 +-
 third_party/kineto                            |     2 +-
 third_party/mkl-dnn.BUILD                     |     5 +-
 third_party/nccl/nccl                         |     2 +-
 third_party/pybind11                          |     2 +-
 third_party/xnnpack.buck.bzl                  |    89 +-
 tools/BUCK.bzl                                |    32 +-
 tools/amd_build/build_amd.py                  |     3 +
 tools/autograd/derivatives.yaml               |   740 +-
 tools/autograd/gen_autograd_functions.py      |    59 +-
 tools/autograd/gen_inplace_or_view_type.py    |    24 +-
 tools/autograd/gen_python_functions.py        |   104 +-
 tools/autograd/gen_trace_type.py              |     9 +-
 tools/autograd/gen_variable_factories.py      |    60 +-
 tools/autograd/gen_variable_type.py           |   287 +-
 tools/autograd/load_derivatives.py            |    94 +-
 tools/autograd/templates/Functions.h          |     8 +-
 tools/autograd/templates/VariableType.cpp     |     3 +-
 tools/autograd/templates/VariableType.h       |     2 +-
 tools/autograd/templates/python_functions.cpp |     2 +-
 .../templates/python_nested_functions.cpp     |    81 +
 .../templates/python_nn_functions.cpp         |     2 +-
 .../templates/python_variable_methods.cpp     |    58 +-
 tools/code_analyzer/gen_oplist.py             |     4 +-
 tools/code_coverage/README.md                 |     6 +-
 .../package/tool/summarize_jsons.py           |     2 +-
 tools/cpuinfo_target_definition.bzl           |    12 -
 tools/dynamo/verify_dynamo.py                 |   156 +
 tools/gen_vulkan_glsl.py                      |   111 +
 tools/gen_vulkan_spv.py                       |   121 +-
 tools/generate_torch_version.py               |    13 +-
 tools/jit/gen_unboxing.py                     |     4 +-
 tools/linter/adapters/newlines_linter.py      |   138 +-
 tools/linter/adapters/pip_init.py             |     6 +-
 tools/linter/adapters/s3_init_config.json     |     8 +-
 .../linter/clang_tidy/generate_build_files.py |     1 -
 tools/miniz_target_definition.bzl             |    25 -
 tools/onnx/gen_diagnostics.py                 |   244 +
 tools/onnx/gen_diagnostics.sh                 |    16 +
 tools/onnx/sarif/code-gen-hints.json          |    10 +
 tools/onnx/sarif/gen_sarif.sh                 |    51 +
 tools/onnx/templates/rules.h.in               |    21 +
 tools/onnx/templates/rules.py.in              |    20 +
 tools/onnx/update_default_opset_version.py    |   150 +-
 tools/perf_kernel_defs.bzl                    |    66 -
 tools/pyi/gen_pyi.py                          |    27 +-
 tools/setup_helpers/cmake.py                  |     1 +
 tools/setup_helpers/cmake_utils.py            |     2 +-
 tools/sgx_aten_target_definitions.bzl         |   261 -
 tools/sgx_caffe2_target_definitions.bzl       |   255 -
 tools/sgx_target_definitions.bzl              |    96 -
 tools/stats/check_disabled_tests.py           |   277 +
 tools/stats/import_test_stats.py              |    39 +-
 tools/stats/monitor.py                        |    26 +-
 tools/stats/print_test_stats.py               |     6 +-
 tools/stats/upload_artifacts.py               |    61 +
 tools/stats/upload_stats_lib.py               |    31 +
 tools/stats/upload_test_stats.py              |    57 +-
 tools/target_definitions.bzl                  |   571 -
 tools/test/gen_oplist_test.py                 |     2 +-
 tools/test/test_codegen.py                    |   122 +-
 tools/test/test_codegen_model.py              |    53 +-
 tools/test/test_gen_backend_stubs.py          |    43 +-
 tools/test/test_vulkan_codegen.py             |   100 +
 tools/testing/test_selections.py              |    73 +-
 tools/update_masked_docs.py                   |    12 +-
 torch/CMakeLists.txt                          |   117 +-
 torch/_C/_VariableFunctions.pyi.in            |     2 +-
 torch/_C/__init__.pyi.in                      |   214 +-
 torch/_C/_autograd.pyi                        |   101 +-
 torch/_C/_distributed_c10d.pyi                |    81 +-
 torch/_C/_distributed_rpc.pyi                 |     5 +-
 torch/_C/_functorch.pyi                       |    46 +
 torch/_C/_itt.pyi                             |     1 +
 torch/_C/_lazy.pyi                            |    10 +-
 torch/_C/_profiler.pyi                        |   218 +
 torch/__init__.py                             |   238 +-
 torch/_decomp/__init__.py                     |   126 +-
 torch/_decomp/decompositions.py               |  1628 +-
 .../_decomp/decompositions_for_jvp.py         |   105 +-
 torch/_deploy.py                              |     2 +-
 .../scheduler => _dispatch}/__init__.py       |     0
 torch/_dispatch/python.py                     |   142 +
 torch/_dynamo/__init__.py                     |   122 +
 torch/_dynamo/allowed_functions.py            |   272 +
 torch/_dynamo/bytecode_analysis.py            |   197 +
 torch/_dynamo/bytecode_transformation.py      |   388 +
 torch/_dynamo/codegen.py                      |   364 +
 torch/_dynamo/config.py                       |   182 +
 torch/_dynamo/convert_frame.py                |   499 +
 torch/_dynamo/debug_utils.py                  |   944 +
 torch/_dynamo/eval_frame.py                   |   754 +
 torch/_dynamo/exc.py                          |    72 +
 torch/_dynamo/guards.py                       |   847 +
 torch/_dynamo/logging.py                      |    88 +
 torch/_dynamo/mutation_guard.py               |   119 +
 torch/_dynamo/optimizations/__init__.py       |     6 +
 torch/_dynamo/optimizations/analysis.py       |   150 +
 torch/_dynamo/optimizations/backends.py       |   830 +
 torch/_dynamo/optimizations/distributed.py    |   277 +
 torch/_dynamo/optimizations/inference.py      |   197 +
 torch/_dynamo/optimizations/log_args.py       |    74 +
 torch/_dynamo/optimizations/normalize.py      |   441 +
 torch/_dynamo/optimizations/subgraph.py       |   236 +
 .../optimizations/torchxla_integration.py     |   189 +
 torch/_dynamo/optimizations/training.py       |   547 +
 torch/_dynamo/output_graph.py                 |   629 +
 torch/_dynamo/profiler.py                     |   177 +
 torch/_dynamo/replay_record.py                |   118 +
 torch/_dynamo/resume_execution.py             |   304 +
 torch/_dynamo/side_effects.py                 |   336 +
 torch/_dynamo/skipfiles.py                    |   213 +
 torch/_dynamo/source.py                       |   259 +
 torch/_dynamo/symbolic_convert.py             |  1860 +
 torch/_dynamo/test_case.py                    |    68 +
 torch/_dynamo/test_minifier_common.py         |   131 +
 torch/_dynamo/testing.py                      |   272 +
 torch/_dynamo/utils.py                        |  1157 +
 torch/_dynamo/variables/__init__.py           |    89 +
 torch/_dynamo/variables/base.py               |   301 +
 torch/_dynamo/variables/builder.py            |   809 +
 torch/_dynamo/variables/builtin.py            |   857 +
 torch/_dynamo/variables/constant.py           |   158 +
 torch/_dynamo/variables/dicts.py              |   436 +
 torch/_dynamo/variables/functions.py          |   413 +
 torch/_dynamo/variables/lists.py              |   511 +
 torch/_dynamo/variables/misc.py               |   705 +
 torch/_dynamo/variables/nn_module.py          |   574 +
 torch/_dynamo/variables/tensor.py             |   593 +
 torch/_dynamo/variables/torch.py              |   751 +
 torch/_dynamo/variables/user_defined.py       |   386 +
 .../sparsifier => _functorch}/__init__.py     |     0
 torch/_functorch/pyfunctorch.py               |   142 +
 torch/_functorch/utils.py                     |    14 +
 torch/_inductor/__init__.py                   |     0
 torch/_inductor/codecache.py                  |   612 +
 torch/_inductor/codegen/__init__.py           |     0
 torch/_inductor/codegen/autotuner.py          |   274 +
 torch/_inductor/codegen/common.py             |   635 +
 torch/_inductor/codegen/cpp.py                |  1561 +
 torch/_inductor/codegen/cpp_prefix.h          |    71 +
 torch/_inductor/codegen/triton.py             |  1481 +
 .../_inductor/codegen/triton_conv_delta_x.j2  |   181 +
 .../codegen/triton_conv_delta_x_hwc.j2        |   200 +
 torch/_inductor/codegen/triton_mm.j2          |    80 +
 torch/_inductor/codegen/triton_template.py    |   351 +
 torch/_inductor/codegen/wrapper.py            |   417 +
 torch/_inductor/compile_fx.py                 |   405 +
 torch/_inductor/config.py                     |   184 +
 torch/_inductor/cuda_properties.py            |    54 +
 torch/_inductor/debug.py                      |   331 +
 torch/_inductor/decomposition.py              |   529 +
 torch/_inductor/dependencies.py               |   288 +
 torch/_inductor/exc.py                        |    85 +
 torch/_inductor/graph.py                      |   448 +
 torch/_inductor/ir.py                         |  4047 ++
 torch/_inductor/lowering.py                   |  3670 ++
 torch/_inductor/metrics.py                    |    17 +
 torch/_inductor/overrides.py                  |  1168 +
 torch/_inductor/scheduler.py                  |  1129 +
 torch/_inductor/sizevars.py                   |   586 +
 torch/_inductor/triton_ops/__init__.py        |     8 +
 torch/_inductor/triton_ops/autotune.py        |   692 +
 torch/_inductor/triton_ops/batched_matmul.py  |   274 +
 torch/_inductor/triton_ops/conv.py            |   744 +
 torch/_inductor/triton_ops/conv1x1.py         |   195 +
 torch/_inductor/triton_ops/conv_perf_model.py |   165 +
 torch/_inductor/triton_ops/matmul.py          |   136 +
 torch/_inductor/triton_ops/mm_perf_model.py   |    90 +
 torch/_inductor/triton_ops/utils.py           |    31 +
 torch/_inductor/utils.py                      |   383 +
 torch/_inductor/virtualized.py                |   140 +
 torch/_lazy/__init__.py                       |    19 +
 torch/_lazy/closure.py                        |   134 +
 torch/_lazy/device_context.py                 |    25 +
 torch/_lazy/extract_compiled_graph.py         |     2 +-
 torch/_linalg_utils.py                        |    31 +-
 torch/_lobpcg.py                              |     2 +-
 torch/_meta_registrations.py                  |  1409 +-
 torch/_ops.py                                 |   364 +-
 torch/_prims/__init__.py                      |   339 +-
 torch/_prims/context.py                       |   268 +-
 torch/_prims/executor.py                      |    15 +-
 torch/_prims/nvfuser_executor.py              |   378 +-
 torch/_prims/nvfuser_prims.py                 |   460 +-
 torch/_prims_common/__init__.py               |   297 +-
 torch/_prims_common/wrappers.py               |    81 +-
 torch/_python_dispatcher.py                   |     2 +-
 torch/_refs/__init__.py                       |  2583 +-
 torch/_refs/_conversions.py                   |   106 +
 torch/_refs/fft.py                            |    14 +-
 torch/_refs/linalg/__init__.py                |    23 +-
 torch/_refs/nn/functional/__init__.py         |   655 +-
 torch/_refs/special/__init__.py               |   189 +-
 torch/_subclasses/__init__.py                 |     3 +
 torch/_subclasses/fake_tensor.py              |   804 +-
 torch/_subclasses/fake_utils.py               |   140 +
 torch/_subclasses/meta_utils.py               |   349 +-
 torch/_tensor.py                              |   178 +-
 torch/_tensor_docs.py                         |   172 +-
 torch/_tensor_str.py                          |    47 +-
 torch/_torch_docs.py                          |   492 +-
 torch/_utils.py                               |   123 +-
 torch/_weights_only_unpickler.py              |   291 +
 torch/amp/autocast_mode.py                    |    24 +-
 torch/ao/__init__.py                          |    16 +
 torch/ao/nn/__init__.py                       |    20 +-
 torch/ao/nn/intrinsic/__init__.py             |    32 +
 torch/ao/nn/intrinsic/modules/__init__.py     |    31 +
 torch/ao/nn/intrinsic/modules/fused.py        |   128 +
 torch/ao/nn/intrinsic/qat/__init__.py         |     1 +
 torch/ao/nn/intrinsic/qat/modules/__init__.py |    31 +
 .../ao/nn/intrinsic/qat/modules/conv_fused.py |   828 +
 .../nn/intrinsic/qat/modules/linear_fused.py  |   167 +
 .../nn/intrinsic/qat/modules/linear_relu.py   |    48 +
 torch/ao/nn/intrinsic/quantized/__init__.py   |    10 +
 .../intrinsic/quantized/dynamic/__init__.py   |     1 +
 .../quantized/dynamic/modules/__init__.py     |     6 +
 .../quantized/dynamic/modules/linear_relu.py  |    51 +
 .../intrinsic/quantized/modules/__init__.py   |    12 +
 .../nn/intrinsic/quantized/modules/bn_relu.py |    78 +
 .../intrinsic/quantized/modules/conv_relu.py  |   166 +
 .../quantized/modules/linear_relu.py          |    41 +
 torch/ao/nn/qat/__init__.py                   |     1 +
 torch/ao/nn/qat/dynamic/__init__.py           |     1 +
 torch/ao/nn/qat/dynamic/modules/__init__.py   |     3 +
 torch/ao/nn/qat/dynamic/modules/linear.py     |    24 +
 torch/ao/nn/qat/modules/__init__.py           |    14 +
 torch/ao/nn/qat/modules/conv.py               |   264 +
 torch/ao/nn/qat/modules/embedding_ops.py      |   143 +
 torch/ao/nn/qat/modules/linear.py             |    77 +
 torch/ao/nn/quantizable/__init__.py           |     1 +
 torch/ao/nn/quantizable/modules/__init__.py   |     9 +
 torch/ao/nn/quantizable/modules/activation.py |   454 +
 torch/ao/nn/quantizable/modules/rnn.py        |   398 +
 torch/ao/nn/quantized/__init__.py             |    38 +
 torch/ao/nn/quantized/dynamic/__init__.py     |     1 +
 .../nn/quantized/dynamic/modules/__init__.py  |    19 +
 torch/ao/nn/quantized/dynamic/modules/conv.py |   399 +
 .../ao/nn/quantized/dynamic/modules/linear.py |   127 +
 torch/ao/nn/quantized/dynamic/modules/rnn.py  |  1054 +
 torch/ao/nn/quantized/functional.py           |   616 +
 torch/ao/nn/quantized/modules/__init__.py     |   136 +
 torch/ao/nn/quantized/modules/activation.py   |   278 +
 torch/ao/nn/quantized/modules/batchnorm.py    |   101 +
 torch/ao/nn/quantized/modules/conv.py         |   937 +
 torch/ao/nn/quantized/modules/dropout.py      |    27 +
 .../ao/nn/quantized/modules/embedding_ops.py  |   295 +
 .../quantized/modules/functional_modules.py   |   233 +
 torch/ao/nn/quantized/modules/linear.py       |   302 +
 .../ao/nn/quantized/modules/normalization.py  |   204 +
 torch/ao/nn/quantized/modules/rnn.py          |    47 +
 torch/ao/nn/quantized/modules/utils.py        |   113 +
 torch/ao/nn/quantized/reference/__init__.py   |    17 +
 .../quantized/reference/modules/__init__.py   |    20 +
 .../ao/nn/quantized/reference/modules/conv.py |   318 +
 .../nn/quantized/reference/modules/linear.py  |    57 +
 .../ao/nn/quantized/reference/modules/rnn.py  |   479 +
 .../nn/quantized/reference/modules/sparse.py  |    94 +
 .../nn/quantized/reference/modules/utils.py   |   160 +
 .../ao/nn/sparse/quantized/dynamic/linear.py  |     8 +-
 torch/ao/nn/sparse/quantized/linear.py        |     2 +-
 torch/ao/ns/_numeric_suite.py                 |     4 +-
 torch/ao/ns/_numeric_suite_dbr.py             |   112 -
 torch/ao/ns/_numeric_suite_fx.py              |   226 +-
 torch/ao/ns/fx/mappings.py                    |    14 +-
 torch/ao/ns/fx/n_shadows_utils.py             |   917 +
 torch/ao/ns/fx/ns_types.py                    |     6 +
 torch/ao/ns/fx/qconfig_multi_mapping.py       |   242 +
 torch/ao/ns/fx/utils.py                       |     2 +-
 torch/ao/ns/fx/weight_utils.py                |     8 +-
 torch/ao/{sparsity => pruning}/__init__.py    |     1 +
 torch/ao/pruning/_experimental/__init__.py    |     0
 .../activation_sparsifier/README.md           |     2 +-
 .../activation_sparsifier/__init__.py         |     0
 .../activation_sparsifier.py                  |     0
 .../_experimental/data_scheduler/README.md    |     0
 .../_experimental/data_scheduler/__init__.py  |     0
 .../data_scheduler/base_data_scheduler.py     |     2 +-
 .../_experimental/data_sparsifier/README.md   |     2 +-
 .../_experimental/data_sparsifier/__init__.py |     0
 .../data_sparsifier/base_data_sparsifier.py   |     0
 .../data_sparsifier/benchmarks/README.md      |     2 +-
 .../data_sparsifier/benchmarks/dlrm_utils.py  |     0
 .../benchmarks/evaluate_disk_savings.py       |     2 +-
 .../benchmarks/evaluate_forward_time.py       |     0
 .../benchmarks/evaluate_model_metrics.py      |     0
 .../benchmarks/images/accuracy.png            |   Bin
 .../benchmarks/images/disk_savings.png        |   Bin
 .../benchmarks/images/forward_time.png        |   Bin
 .../data_sparsifier/data_norm_sparsifier.py   |     0
 .../data_sparsifier/lightning/__init__.py     |     0
 .../lightning/callbacks/README.md             |     0
 .../lightning/callbacks/__init__.py           |     0
 .../callbacks/_data_sparstity_utils.py        |     2 +-
 .../lightning/callbacks/data_sparsity.py      |     0
 .../lightning/tests/test_callbacks.py         |    10 +-
 .../data_sparsifier/quantization_utils.py     |     2 +-
 .../_experimental/pruner/README.md            |     0
 .../_experimental/pruner/__init__.py          |     0
 .../_experimental/pruner/base_pruner.py       |     4 +-
 .../_experimental/pruner/images/prune_1.png   |   Bin
 .../_experimental/pruner/images/prune_2.png   |   Bin
 .../_experimental/pruner/images/prune_3.png   |   Bin
 .../_experimental/pruner/images/prune_4.png   |   Bin
 .../_experimental/pruner/parametrization.py   |     0
 torch/ao/{sparsity => pruning}/_mappings.py   |     8 +-
 torch/ao/pruning/scheduler/__init__.py        |     0
 .../scheduler/base_scheduler.py               |    16 +-
 torch/ao/pruning/scheduler/cubic_scheduler.py |   108 +
 .../scheduler/lambda_scheduler.py             |     0
 torch/ao/pruning/sparsifier/__init__.py       |     0
 .../sparsifier/base_sparsifier.py             |     4 +-
 .../sparsifier/nearly_diagonal_sparsifier.py  |     0
 .../{sparsity => pruning}/sparsifier/utils.py |     2 +-
 .../sparsifier/weight_norm_sparsifier.py      |    28 +-
 torch/ao/quantization/__init__.py             |   125 +-
 torch/ao/quantization/_correct_bias.py        |     2 +-
 torch/ao/quantization/_dbr/README.md          |   259 -
 torch/ao/quantization/_dbr/auto_trace.py      |   723 -
 .../quantization/_dbr/auto_trace_rewriter.py  |   247 -
 torch/ao/quantization/_dbr/function_fusion.py |   101 -
 torch/ao/quantization/_dbr/fusion.py          |    56 -
 torch/ao/quantization/_dbr/mappings.py        |   178 -
 torch/ao/quantization/_dbr/model_utils.py     |   163 -
 .../ao/quantization/_dbr/module_swap_utils.py |    79 -
 .../_dbr/qconfig_mapping_utils.py             |    25 -
 .../quantization/_dbr/quantization_state.py   |   986 -
 .../ao/quantization/_dbr/torchscript_utils.py |    15 -
 torch/ao/quantization/_dbr/utils.py           |   751 -
 torch/ao/quantization/_quantize_dbr.py        |   144 -
 .../ao/quantization/backend_config/README.md  |   146 +-
 .../quantization/backend_config/__init__.py   |    13 +-
 .../_common_operator_config_utils.py          |   252 +-
 .../backend_config/backend_config.py          |   236 +-
 .../quantization/backend_config/executorch.py |   226 +
 .../ao/quantization/backend_config/fbgemm.py  |   114 +
 .../ao/quantization/backend_config/native.py  |   242 +-
 .../backend_config/observation_type.py        |    13 -
 .../ao/quantization/backend_config/qnnpack.py |   161 +
 .../quantization/backend_config/tensorrt.py   |     8 +-
 torch/ao/quantization/backend_config/utils.py |    31 +-
 torch/ao/quantization/backend_config/x86.py   |   111 +
 torch/ao/quantization/fake_quantize.py        |    50 +-
 torch/ao/quantization/fuse_modules.py         |    10 +-
 .../ao/quantization/fuser_method_mappings.py  |    82 +-
 torch/ao/quantization/fx/README.md            |   380 +
 torch/ao/quantization/fx/_decomposed.py       |   309 +
 torch/ao/quantization/fx/_equalize.py         |    36 +-
 .../fx/_lower_to_native_backend.py            |    80 +-
 .../quantization/fx/_model_report/README.md   |    20 +-
 .../quantization/fx/_model_report/detector.py |   207 +-
 .../fx/_model_report/model_report.py          |   166 +-
 .../quantization/fx/backend_config_utils.py   |    50 +-
 .../fx/common_quantization_patterns.py        |     8 -
 torch/ao/quantization/fx/convert.py           |   731 +-
 torch/ao/quantization/fx/custom_config.py     |    94 +-
 torch/ao/quantization/fx/fuse.py              |     2 +-
 torch/ao/quantization/fx/fusion_patterns.py   |     5 +-
 torch/ao/quantization/fx/graph_module.py      |    13 +-
 torch/ao/quantization/fx/match_utils.py       |     5 +-
 torch/ao/quantization/fx/pattern_utils.py     |    19 +-
 torch/ao/quantization/fx/prepare.py           |   636 +-
 ...nfig_utils.py => qconfig_mapping_utils.py} |    57 +-
 .../quantization/fx/quantization_patterns.py  |    46 +-
 torch/ao/quantization/fx/tracer.py            |     2 +-
 torch/ao/quantization/fx/utils.py             |   587 +-
 torch/ao/quantization/observer.py             |   106 +-
 torch/ao/quantization/qconfig.py              |    69 +-
 torch/ao/quantization/qconfig_mapping.py      |    98 +-
 .../ao/quantization/qconfig_mapping_utils.py  |    31 +-
 torch/ao/quantization/quant_type.py           |     3 +-
 .../ao/quantization/quantization_mappings.py  |    47 +-
 torch/ao/quantization/quantization_types.py   |    18 -
 torch/ao/quantization/quantize.py             |    46 +-
 torch/ao/quantization/quantize_fx.py          |   176 +-
 torch/ao/quantization/quantize_jit.py         |   122 +
 torch/ao/quantization/utils.py                |   149 +-
 torch/autograd/__init__.py                    |    42 +-
 torch/autograd/anomaly_mode.py                |    20 +-
 torch/autograd/forward_ad.py                  |    23 +
 torch/autograd/function.py                    |     3 +
 torch/autograd/functional.py                  |     2 +
 torch/autograd/grad_mode.py                   |    43 +-
 torch/autograd/gradcheck.py                   |     4 +-
 torch/autograd/graph.py                       |   300 +-
 torch/autograd/profiler.py                    |    67 +-
 torch/autograd/profiler_legacy.py             |    21 +-
 torch/autograd/profiler_util.py               |    32 +-
 torch/autograd/variable.py                    |     1 +
 torch/backends/_coreml/preprocess.py          |    12 +-
 torch/backends/cuda/__init__.py               |    96 +
 torch/backends/cudnn/__init__.py              |     2 +
 torch/backends/opt_einsum/__init__.py         |    99 +
 torch/backends/quantized/__init__.py          |     4 +-
 torch/backends/xeon/run_cpu.py                |    18 +-
 torch/cpu/amp/autocast_mode.py                |     2 +
 torch/csrc/CudaIPCTypes.cpp                   |     2 +-
 torch/csrc/DynamicTypes.cpp                   |    11 +-
 torch/csrc/Exceptions.cpp                     |    59 +-
 torch/csrc/Exceptions.h                       |    35 +-
 torch/csrc/Module.cpp                         |   352 +-
 torch/csrc/Size.cpp                           |     6 +-
 torch/csrc/Storage.cpp                        |     2 +
 torch/csrc/StorageMethods.cpp                 |     2 +-
 torch/csrc/StorageSharing.cpp                 |    12 +-
 torch/csrc/TypeInfo.cpp                       |     9 +-
 torch/csrc/api/include/torch/all.h            |     1 +
 torch/csrc/api/include/torch/nested.h         |    95 +
 .../api/include/torch/nn/functional/padding.h |     2 +-
 .../include/torch/nn/functional/upsampling.h  |    16 +-
 torch/csrc/api/include/torch/nn/pimpl.h       |    10 +-
 torch/csrc/api/src/nn/modules/transformer.cpp |     4 +-
 torch/csrc/api/src/optim/optimizer.cpp        |     4 +-
 torch/csrc/autograd/FunctionsManual.cpp       |   937 +-
 torch/csrc/autograd/FunctionsManual.h         |   132 +-
 torch/csrc/autograd/TraceTypeManual.cpp       |     4 +-
 torch/csrc/autograd/VariableTypeManual.cpp    |    18 +-
 torch/csrc/autograd/VariableTypeUtils.h       |   104 +-
 torch/csrc/autograd/anomaly_mode.cpp          |     8 +-
 torch/csrc/autograd/anomaly_mode.h            |    12 +-
 torch/csrc/autograd/autograd_meta.cpp         |    10 +-
 .../autograd_not_implemented_fallback.cpp     |     3 +-
 torch/csrc/autograd/custom_function.h         |     8 +-
 torch/csrc/autograd/engine.cpp                |   121 +-
 torch/csrc/autograd/function.h                |     3 +
 torch/csrc/autograd/functions/tensor.cpp      |     8 +-
 torch/csrc/autograd/functions/utils.h         |    18 +
 torch/csrc/autograd/graph_task.h              |    18 +-
 torch/csrc/autograd/init.cpp                  |   427 +-
 torch/csrc/autograd/input_metadata.h          |     4 +-
 torch/csrc/autograd/jit_decomp_interface.cpp  |    21 +
 torch/csrc/autograd/jit_decomp_interface.h    |    54 +
 torch/csrc/autograd/profiler_kineto.cpp       |   392 +-
 torch/csrc/autograd/profiler_kineto.h         |    11 +-
 torch/csrc/autograd/profiler_legacy.cpp       |    39 +-
 torch/csrc/autograd/profiler_legacy.h         |     1 +
 torch/csrc/autograd/profiler_python.cpp       |   374 +-
 torch/csrc/autograd/python_anomaly_mode.cpp   |     2 +-
 torch/csrc/autograd/python_cpp_function.cpp   |    21 +-
 torch/csrc/autograd/python_cpp_function.h     |     2 +
 torch/csrc/autograd/python_engine.cpp         |     4 +-
 torch/csrc/autograd/python_nested_functions.h |    11 +
 .../python_nested_functions_manual.cpp        |    44 +
 .../autograd/python_saved_variable_hooks.cpp  |     2 +-
 .../python_torch_functions_manual.cpp         |    41 +-
 torch/csrc/autograd/python_variable.cpp       |   694 +-
 torch/csrc/autograd/python_variable.h         |     1 +
 .../autograd/python_variable_indexing.cpp     |    34 +-
 torch/csrc/autograd/saved_variable.cpp        |    21 +-
 .../autograd/utils/grad_layout_contract.h     |    16 +-
 torch/csrc/autograd/utils/warnings.cpp        |    11 +-
 torch/csrc/autograd/utils/warnings.h          |    13 +-
 torch/csrc/autograd/variable.cpp              |    77 +-
 torch/csrc/autograd/variable.h                |    56 +-
 torch/csrc/cuda/CUDAPluggableAllocator.cpp    |   317 +
 torch/csrc/cuda/CUDAPluggableAllocator.h      |   135 +
 torch/csrc/cuda/Graph.cpp                     |    10 +-
 torch/csrc/cuda/Module.cpp                    |   339 +-
 torch/csrc/cuda/Tensor.cpp                    |     2 +
 torch/csrc/cuda/comm.cpp                      |    16 +-
 torch/csrc/cuda/memory_snapshot.cpp           |   167 +
 torch/csrc/cuda/memory_snapshot.h             |    17 +
 torch/csrc/cuda/nccl.cpp                      |    32 +-
 torch/csrc/cuda/nccl.h                        |     9 +-
 torch/csrc/cuda/shared/cudart.cpp             |    11 +-
 torch/csrc/deploy/.gitignore                  |     1 -
 torch/csrc/deploy/CMakeLists.txt              |    83 -
 torch/csrc/deploy/Exception.h                 |    47 -
 torch/csrc/deploy/README.md                   |    29 +-
 torch/csrc/deploy/benchmark.cpp               |   336 -
 torch/csrc/deploy/deploy.cpp                  |   366 -
 torch/csrc/deploy/deploy.h                    |   302 -
 torch/csrc/deploy/elf_file.cpp                |    56 -
 torch/csrc/deploy/elf_file.h                  |    66 -
 torch/csrc/deploy/environment.h               |    69 -
 torch/csrc/deploy/example/benchmark.cpp       |   336 -
 torch/csrc/deploy/example/examples.py         |   268 -
 torch/csrc/deploy/example/fx/examples.py      |    16 -
 .../csrc/deploy/example/fx/some_dependency.py |     4 -
 .../csrc/deploy/example/generate_examples.py  |    96 -
 torch/csrc/deploy/example/gpu_wrapper.py      |    66 -
 torch/csrc/deploy/example/simple.pt           |   Bin 2432 -> 0 bytes
 torch/csrc/deploy/example/tensorrt_example.py |    63 -
 .../interactive_embedded_interpreter.cpp      |    37 -
 torch/csrc/deploy/interpreter/CMakeLists.txt  |   117 -
 .../deploy/interpreter/CMakePythonModules.txt |    69 -
 torch/csrc/deploy/interpreter/Optional.hpp    |  1107 -
 .../deploy/interpreter/builtin_registry.cpp   |   284 -
 .../deploy/interpreter/builtin_registry.h     |   130 -
 .../deploy/interpreter/configure_cpython.sh   |     6 -
 .../deploy/interpreter/cpython_patch.diff     |    14 -
 torch/csrc/deploy/interpreter/defs.bzl        |   117 -
 .../deploy/interpreter/hide_symbols.script    |     4 -
 .../interpreter/import_find_sharedfuncptr.cpp |    45 -
 .../deploy/interpreter/interpreter_impl.cpp   |   413 -
 .../deploy/interpreter/interpreter_impl.h     |   185 -
 .../interpreter/register_frozenpython.cpp     |    82 -
 .../deploy/interpreter/register_numpy.cpp     |    51 -
 .../deploy/interpreter/register_pyyaml.cpp    |     6 -
 .../interpreter/test_builtin_registry.cpp     |    58 -
 .../deploy/interpreter/third_party/README.md  |     2 -
 torch/csrc/deploy/loader.cpp                  |  1255 -
 torch/csrc/deploy/loader.h                    |    52 -
 torch/csrc/deploy/mem_file.h                  |    67 -
 torch/csrc/deploy/noop_environment.h          |    14 -
 torch/csrc/deploy/path_environment.cpp        |    13 -
 torch/csrc/deploy/path_environment.h          |    19 -
 torch/csrc/deploy/remove_dt_needed.cpp        |    82 -
 torch/csrc/deploy/test_deploy.cpp             |   537 -
 torch/csrc/deploy/test_deploy_from_python.py  |     7 -
 torch/csrc/deploy/test_deploy_gpu.cpp         |   120 -
 torch/csrc/deploy/test_deploy_lib.cpp         |    98 -
 .../test_deploy_missing_interpreter.cpp       |    14 -
 torch/csrc/deploy/test_deploy_python.py       |    26 -
 torch/csrc/deploy/test_deploy_python_ext.cpp  |    25 -
 torch/csrc/deploy/unity/example.py            |    10 -
 torch/csrc/deploy/unity/main.cpp              |    35 -
 torch/csrc/deploy/unity/tests/simple_model.py |    15 -
 torch/csrc/deploy/unity/tests/sum.py          |     5 -
 torch/csrc/deploy/unity/tests/test_unity.h    |     5 -
 .../unity/tests/test_unity_simple_model.cpp   |    40 -
 .../deploy/unity/tests/test_unity_sum.cpp     |    31 -
 torch/csrc/deploy/unity/unity.bzl             |    46 -
 torch/csrc/deploy/unity/xar_environment.cpp   |   158 -
 torch/csrc/deploy/unity/xar_environment.h     |    31 -
 .../autograd/engine/dist_engine.cpp           |     8 +
 torch/csrc/distributed/c10d/Backend.cpp       |    17 +
 torch/csrc/distributed/c10d/Backend.hpp       |   277 +
 torch/csrc/distributed/c10d/FileStore.cpp     |    12 +-
 torch/csrc/distributed/c10d/FileStore.hpp     |     3 +-
 .../distributed/c10d/GlooDeviceFactory.cpp    |     2 +-
 torch/csrc/distributed/c10d/HashStore.cpp     |     2 +-
 torch/csrc/distributed/c10d/HashStore.hpp     |     2 +-
 torch/csrc/distributed/c10d/NCCLUtils.cpp     |    53 +-
 torch/csrc/distributed/c10d/NCCLUtils.hpp     |    79 +-
 torch/csrc/distributed/c10d/Ops.cpp           |   382 +-
 torch/csrc/distributed/c10d/Ops.hpp           |    52 +-
 torch/csrc/distributed/c10d/OpsImpl.cpp       |   552 +
 .../csrc/distributed/c10d/ParamCommsUtils.cpp |     2 +-
 .../csrc/distributed/c10d/ParamCommsUtils.hpp |    62 +-
 torch/csrc/distributed/c10d/PrefixStore.cpp   |     6 +-
 torch/csrc/distributed/c10d/PrefixStore.hpp   |     6 +-
 torch/csrc/distributed/c10d/ProcessGroup.cpp  |   177 +-
 torch/csrc/distributed/c10d/ProcessGroup.hpp  |   208 +-
 .../distributed/c10d/ProcessGroupGloo.cpp     |   177 +-
 .../distributed/c10d/ProcessGroupGloo.hpp     |    44 +-
 .../csrc/distributed/c10d/ProcessGroupMPI.cpp |    53 +-
 .../csrc/distributed/c10d/ProcessGroupMPI.hpp |    46 +-
 .../distributed/c10d/ProcessGroupNCCL.cpp     |   740 +-
 .../distributed/c10d/ProcessGroupNCCL.hpp     |   102 +-
 .../c10d/ProcessGroupRoundRobin.cpp           |    47 +-
 .../c10d/ProcessGroupRoundRobin.hpp           |    32 +-
 .../csrc/distributed/c10d/ProcessGroupUCC.cpp |   435 +-
 .../csrc/distributed/c10d/ProcessGroupUCC.hpp |   138 +-
 .../distributed/c10d/ProcessGroupWrapper.cpp  |    47 +-
 .../distributed/c10d/ProcessGroupWrapper.hpp  |    42 +-
 .../csrc/distributed/c10d/PyProcessGroup.hpp  |    36 +-
 torch/csrc/distributed/c10d/Store.cpp         |    22 +-
 torch/csrc/distributed/c10d/Store.hpp         |     9 +
 torch/csrc/distributed/c10d/TCPStore.cpp      |     8 +-
 torch/csrc/distributed/c10d/TCPStore.hpp      |     2 +-
 torch/csrc/distributed/c10d/TraceUtils.h      |     4 +-
 torch/csrc/distributed/c10d/Types.hpp         |   119 +-
 torch/csrc/distributed/c10d/UCCTracing.cpp    |    20 +-
 torch/csrc/distributed/c10d/UCCTracing.hpp    |     2 +-
 torch/csrc/distributed/c10d/UCCUtils.cpp      |    80 +-
 torch/csrc/distributed/c10d/UCCUtils.hpp      |    97 +-
 torch/csrc/distributed/c10d/UnixSockUtils.hpp |     2 +-
 torch/csrc/distributed/c10d/Utils.cpp         |     2 +-
 torch/csrc/distributed/c10d/Utils.hpp         |    31 +-
 torch/csrc/distributed/c10d/WinSockUtils.hpp  |     2 +-
 torch/csrc/distributed/c10d/Work.cpp          |   182 +
 torch/csrc/distributed/c10d/Work.hpp          |   138 +
 torch/csrc/distributed/c10d/comm.cpp          |    14 +-
 torch/csrc/distributed/c10d/comm.hpp          |     2 +-
 torch/csrc/distributed/c10d/debug.cpp         |     6 +-
 torch/csrc/distributed/c10d/debug.h           |     2 +-
 .../distributed/c10d/default_comm_hooks.cpp   |     6 +-
 .../distributed/c10d/default_comm_hooks.hpp   |     4 +-
 torch/csrc/distributed/c10d/exception.cpp     |     2 +-
 torch/csrc/distributed/c10d/init.cpp          |   318 +-
 torch/csrc/distributed/c10d/logger.cpp        |     8 +-
 torch/csrc/distributed/c10d/logger.hpp        |     2 +-
 torch/csrc/distributed/c10d/logging.cpp       |     4 +-
 .../distributed/c10d/python_comm_hook.cpp     |     2 +-
 .../csrc/distributed/c10d/python_comm_hook.h  |     4 +-
 .../c10d/quantization/quantization_gpu.cu     |     2 +-
 torch/csrc/distributed/c10d/reducer.cpp       |    12 +-
 torch/csrc/distributed/c10d/reducer.hpp       |    20 +-
 torch/csrc/distributed/c10d/reducer_cuda.cpp  |     2 +-
 torch/csrc/distributed/c10d/sequence_num.cpp  |     2 +-
 torch/csrc/distributed/c10d/socket.cpp        |     8 +-
 torch/csrc/distributed/c10d/socket.h          |     2 +-
 torch/csrc/distributed/rpc/agent_utils.h      |     2 +-
 torch/csrc/distributed/rpc/py_rref.cpp        |     4 +-
 torch/csrc/distributed/rpc/rref_context.cpp   |    14 +-
 torch/csrc/distributed/rpc/rref_context.h     |     3 +
 .../csrc/distributed/rpc/tensorpipe_agent.cpp |    23 +-
 torch/csrc/distributed/rpc/tensorpipe_agent.h |     4 +-
 torch/csrc/distributed/rpc/utils.cpp          |     2 +-
 torch/csrc/dl.c                               |    32 -
 torch/csrc/dynamo/eval_frame.c                |   606 +
 torch/csrc/dynamo/eval_frame.h                |     6 +
 torch/csrc/dynamo/guards.cpp                  |   422 +
 torch/csrc/dynamo/guards.h                    |     4 +
 torch/csrc/dynamo/init.cpp                    |    32 +
 torch/csrc/dynamo/init.h                      |    14 +
 torch/csrc/functorch/init.cpp                 |   509 +
 torch/csrc/functorch/init.h                   |    12 +
 torch/csrc/itt.cpp                            |     1 +
 torch/csrc/itt_wrapper.cpp                    |     5 +
 torch/csrc/itt_wrapper.h                      |     1 +
 torch/csrc/jit/OVERVIEW.md                    |     6 +-
 .../backends/coreml/objc/PTMCoreMLBackend.mm  |    85 +-
 .../backends/coreml/objc/PTMCoreMLCompiler.h  |     4 +-
 .../backends/coreml/objc/PTMCoreMLCompiler.mm |   143 +-
 .../backends/coreml/objc/PTMCoreMLExecutor.h  |     2 +-
 .../backends/coreml/objc/PTMCoreMLExecutor.mm |    11 +-
 .../coreml/objc/PTMCoreMLModelWrapper.h       |     9 -
 .../coreml/observer/PTMCoreMLObserver.h       |    47 -
 .../coreml/observer/PTMCoreMLObserver.mm      |     8 -
 .../xnnpack/compiler/xnn_compiler.cpp         |   118 +
 .../backends/xnnpack/compiler/xnn_compiler.h  |    27 +
 .../backends/xnnpack/executor/xnn_executor.h  |    70 +
 .../backends/xnnpack/serialization/schema.fbs |    97 +
 .../xnnpack/serialization/serializer.cpp      |   102 +
 .../xnnpack/serialization/serializer.h        |    86 +
 .../backends/xnnpack/xnnpack_backend_lib.cpp  |   119 +
 .../xnnpack/xnnpack_backend_preprocess.cpp    |   132 +
 .../xnnpack/xnnpack_graph_builder.cpp         |   324 +
 .../backends/xnnpack/xnnpack_graph_builder.h  |    93 +
 torch/csrc/jit/codegen/cuda/README.md         |     4 +-
 torch/csrc/jit/codegen/cuda/arith.cpp         |   218 +-
 torch/csrc/jit/codegen/cuda/arith.h           |    49 +-
 torch/csrc/jit/codegen/cuda/codegen.cpp       |   288 +-
 torch/csrc/jit/codegen/cuda/compute_at.cpp    |    21 +-
 torch/csrc/jit/codegen/cuda/compute_at.h      |     2 +-
 .../csrc/jit/codegen/cuda/compute_at_map.cpp  |   371 +-
 torch/csrc/jit/codegen/cuda/compute_at_map.h  |    27 +-
 torch/csrc/jit/codegen/cuda/contiguity.cpp    |   654 +-
 torch/csrc/jit/codegen/cuda/contiguity.h      |   198 +-
 torch/csrc/jit/codegen/cuda/disjoint_set.h    |    17 +-
 torch/csrc/jit/codegen/cuda/dispatch.cpp      |    90 +
 torch/csrc/jit/codegen/cuda/dispatch.h        |    24 +
 torch/csrc/jit/codegen/cuda/dynamic_type.h    |   312 +
 .../jit/codegen/cuda/evaluator_common.cpp     |   234 +-
 .../csrc/jit/codegen/cuda/evaluator_common.h  |   102 +-
 torch/csrc/jit/codegen/cuda/executor.cpp      |   497 +-
 torch/csrc/jit/codegen/cuda/executor.h        |    57 +-
 .../jit/codegen/cuda/executor_kernel_arg.cpp  |    35 +
 .../jit/codegen/cuda/executor_kernel_arg.h    |   260 +-
 .../csrc/jit/codegen/cuda/executor_utils.cpp  |   462 +-
 torch/csrc/jit/codegen/cuda/executor_utils.h  |     9 +-
 .../csrc/jit/codegen/cuda/expr_evaluator.cpp  |   114 +-
 torch/csrc/jit/codegen/cuda/expr_evaluator.h  |    28 +-
 torch/csrc/jit/codegen/cuda/fusion.cpp        |    43 +-
 torch/csrc/jit/codegen/cuda/fusion.h          |    12 +-
 .../jit/codegen/cuda/fusion_segmenter.cpp     |    60 +-
 .../csrc/jit/codegen/cuda/fusion_segmenter.h  |    14 +-
 torch/csrc/jit/codegen/cuda/graph_fuser.cpp   |    36 +-
 .../jit/codegen/cuda/grouped_reduction.cpp    |    18 +-
 .../csrc/jit/codegen/cuda/grouped_reduction.h |     4 +
 torch/csrc/jit/codegen/cuda/index_compute.cpp |   400 +-
 torch/csrc/jit/codegen/cuda/index_compute.h   |   112 +-
 .../codegen/cuda/index_reference_replay.cpp   |   625 -
 .../jit/codegen/cuda/index_reference_replay.h |   132 -
 .../jit/codegen/cuda/inline_propagator.cpp    |   385 -
 .../csrc/jit/codegen/cuda/inline_propagator.h |   118 -
 torch/csrc/jit/codegen/cuda/inlining.cpp      |   306 +
 torch/csrc/jit/codegen/cuda/inlining.h        |   100 +
 .../csrc/jit/codegen/cuda/instrumentation.cpp |     2 +-
 torch/csrc/jit/codegen/cuda/instrumentation.h |     4 +-
 torch/csrc/jit/codegen/cuda/interface.cpp     |   185 +-
 torch/csrc/jit/codegen/cuda/interface.h       |     8 +
 torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp |    60 +-
 torch/csrc/jit/codegen/cuda/ir_base_nodes.h   |    37 +-
 torch/csrc/jit/codegen/cuda/ir_builder.cpp    |     4 +
 torch/csrc/jit/codegen/cuda/ir_cloner.cpp     |    16 +
 torch/csrc/jit/codegen/cuda/ir_cloner.h       |     4 +
 torch/csrc/jit/codegen/cuda/ir_graphviz.cpp   |    38 +
 torch/csrc/jit/codegen/cuda/ir_graphviz.h     |     4 +
 .../jit/codegen/cuda/ir_interface_nodes.h     |    29 +-
 .../csrc/jit/codegen/cuda/ir_internal_nodes.h |   564 +-
 torch/csrc/jit/codegen/cuda/ir_iostream.cpp   |   355 +-
 torch/csrc/jit/codegen/cuda/ir_iostream.h     |     6 +
 torch/csrc/jit/codegen/cuda/ir_nodes.cpp      |   968 +-
 torch/csrc/jit/codegen/cuda/ir_utils.cpp      |   135 +-
 torch/csrc/jit/codegen/cuda/ir_utils.h        |    11 +-
 torch/csrc/jit/codegen/cuda/iter_visitor.cpp  |   149 +-
 torch/csrc/jit/codegen/cuda/iter_visitor.h    |    99 +-
 torch/csrc/jit/codegen/cuda/kernel.cpp        |    17 +-
 torch/csrc/jit/codegen/cuda/kernel.h          |     2 +-
 torch/csrc/jit/codegen/cuda/kernel_cache.cpp  |   417 +-
 torch/csrc/jit/codegen/cuda/kernel_cache.h    |    90 +-
 .../codegen/cuda/kernel_expr_evaluator.cpp    |    79 +-
 .../jit/codegen/cuda/kernel_expr_evaluator.h  |    14 +-
 torch/csrc/jit/codegen/cuda/kernel_ir.cpp     |   240 +-
 torch/csrc/jit/codegen/cuda/kernel_ir.h       |   127 +-
 torch/csrc/jit/codegen/cuda/lower2device.cpp  |    25 +-
 torch/csrc/jit/codegen/cuda/lower2device.h    |    27 +-
 .../jit/codegen/cuda/lower_alias_memory.cpp   |    22 +-
 .../jit/codegen/cuda/lower_allocation.cpp     |    15 +-
 .../jit/codegen/cuda/lower_bank_conflict.cpp  |   332 +
 .../jit/codegen/cuda/lower_bank_conflict.h    |    46 +
 .../codegen/cuda/lower_divisible_split.cpp    |   121 +
 .../jit/codegen/cuda/lower_divisible_split.h  |    29 +
 .../csrc/jit/codegen/cuda/lower_expr_sort.cpp |    11 +-
 .../codegen/cuda/lower_fused_reduction.cpp    |    12 +-
 torch/csrc/jit/codegen/cuda/lower_index.cpp   |   307 +-
 torch/csrc/jit/codegen/cuda/lower_index.h     |    22 +
 .../jit/codegen/cuda/lower_index_compute.cpp  |   199 +-
 .../jit/codegen/cuda/lower_index_compute.h    |    12 +
 .../jit/codegen/cuda/lower_insert_syncs.cpp   |     4 +-
 .../jit/codegen/cuda/lower_instrument.cpp     |     2 +-
 torch/csrc/jit/codegen/cuda/lower_loops.cpp   |     4 +-
 .../cuda/lower_misaligned_vectorization.cpp   |     2 +-
 .../csrc/jit/codegen/cuda/lower_predicate.cpp |    69 +-
 .../cuda/lower_predicate_elimination.cpp      |   129 +-
 torch/csrc/jit/codegen/cuda/lower_shift.cpp   |   136 +-
 torch/csrc/jit/codegen/cuda/lower_shift.h     |    35 +-
 .../codegen/cuda/lower_sync_information.cpp   |    50 +-
 .../codegen/cuda/lower_thread_predicate.cpp   |     7 +-
 .../codegen/cuda/lower_trivial_broadcast.cpp  |     6 +-
 .../codegen/cuda/lower_trivial_broadcast.h    |     3 +-
 torch/csrc/jit/codegen/cuda/lower_unroll.cpp  |    39 +-
 torch/csrc/jit/codegen/cuda/lower_unroll.h    |     2 +
 torch/csrc/jit/codegen/cuda/lower_utils.cpp   |   386 +-
 torch/csrc/jit/codegen/cuda/lower_utils.h     |   116 +-
 .../jit/codegen/cuda/lower_validation.cpp     |    49 +-
 .../jit/codegen/cuda/lower_warp_reduce.cpp    |     6 +-
 torch/csrc/jit/codegen/cuda/manager.cpp       |    12 +-
 torch/csrc/jit/codegen/cuda/mutator.cpp       |   121 +-
 .../jit/codegen/cuda/non_divisible_split.cpp  |    13 +-
 torch/csrc/jit/codegen/cuda/nvfuser.cmake     |    17 +-
 torch/csrc/jit/codegen/cuda/ops/alias.cpp     |    90 +-
 torch/csrc/jit/codegen/cuda/ops/composite.cpp |     2 +-
 .../jit/codegen/cuda/ops/normalization.cpp    |    70 +-
 .../csrc/jit/codegen/cuda/ops/normalization.h |    11 +
 .../codegen/cuda/parallel_dimension_map.cpp   |     2 +-
 .../jit/codegen/cuda/parallel_type_bitmap.cpp |     2 +
 torch/csrc/jit/codegen/cuda/parser.cpp        |   180 +-
 torch/csrc/jit/codegen/cuda/partition.cpp     |     2 +-
 .../codegen/cuda/python_frontend/README.md    |   138 +
 .../examples/double_half_cast.py              |    30 -
 .../examples/half_double_cast.py              |    28 -
 .../examples/python_example.py                |    36 -
 .../python_example_broadcast_in_dim.py        |    94 -
 .../examples/python_example_fp16.py           |    35 -
 .../cuda/python_frontend/fusion_cache.cpp     |   155 +
 .../cuda/python_frontend/fusion_cache.h       |   111 +
 .../python_frontend/fusion_definition.cpp     |   179 +-
 .../cuda/python_frontend/fusion_definition.h  |   114 +-
 .../cuda/python_frontend/fusion_interface.cpp |    65 +
 .../cuda/python_frontend/fusion_interface.h   |    72 +
 .../cuda/python_frontend/fusion_owner.h       |    36 -
 .../cuda/python_frontend/fusion_record.h      |  1463 +-
 .../cuda/python_frontend/python_bindings.cpp  |  1704 +-
 .../test/test_nvfuser_fusion_cache.cpp        |   266 +
 .../test/test_nvfuser_fusion_definition.cpp   |   196 +
 .../test/test_nvfuser_fusion_record.cpp       |   136 +
 .../csrc/jit/codegen/cuda/reference_tensor.h  |    27 -
 .../jit/codegen/cuda/register_interface.cpp   |     1 +
 .../csrc/jit/codegen/cuda/root_domain_map.cpp |   208 +-
 torch/csrc/jit/codegen/cuda/root_domain_map.h |    17 +-
 .../jit/codegen/cuda/runtime/array_rocm.cu    |   236 +
 .../codegen/cuda/runtime/bf16_support_rocm.cu |    39 +
 .../cuda/runtime/block_sync_default_rocm.cu   |    12 +
 .../codegen/cuda/runtime/fused_reduction.cu   |  1855 +-
 .../cuda/runtime/fused_welford_helper.cu      |    93 +
 .../cuda/runtime/fused_welford_impl.cu        |   623 +
 .../csrc/jit/codegen/cuda/runtime/helpers.cu  |    25 +-
 torch/csrc/jit/codegen/cuda/runtime/memory.cu |    25 +
 .../codegen/cuda/runtime/random_numbers.cu    |   172 +-
 torch/csrc/jit/codegen/cuda/runtime/tuple.cu  |   173 +
 .../jit/codegen/cuda/runtime/warp_rocm.cu     |    76 +
 .../codegen/cuda/scheduler/all_schedulers.h   |     5 +-
 .../cuda/scheduler/compile_time_info.h        |    56 +-
 .../jit/codegen/cuda/scheduler/heuristic.h    |     3 +-
 .../jit/codegen/cuda/scheduler/mma_utils.cpp  |    10 +-
 .../codegen/cuda/scheduler/normalization.cpp  |     8 +-
 .../jit/codegen/cuda/scheduler/pointwise.cpp  |   278 +-
 .../jit/codegen/cuda/scheduler/pointwise.h    |   138 +
 .../cuda/scheduler/pointwise_utils.cpp        |    46 +-
 .../codegen/cuda/scheduler/pointwise_utils.h  |    20 +-
 .../jit/codegen/cuda/scheduler/reduction.cpp  |     8 +-
 .../cuda/scheduler/reduction_utils.cpp        |    13 +-
 .../jit/codegen/cuda/scheduler/registry.cpp   |   487 +-
 .../jit/codegen/cuda/scheduler/registry.h     |    30 +-
 .../jit/codegen/cuda/scheduler/transpose.cpp  |  1140 +
 .../jit/codegen/cuda/scheduler/transpose.h    |   115 +
 .../cuda/scheduler/transpose_heuristic.h      |   163 +
 .../csrc/jit/codegen/cuda/scheduler/utils.cpp |   962 +-
 torch/csrc/jit/codegen/cuda/scheduler/utils.h |   178 +-
 .../cuda/scheduler/vectorize_helper.cpp       |   286 +
 .../codegen/cuda/scheduler/vectorize_helper.h |    14 +-
 torch/csrc/jit/codegen/cuda/tensor_view.cpp   |   141 +-
 torch/csrc/jit/codegen/cuda/test/test_gpu.cpp | 25499 ----------
 .../csrc/jit/codegen/cuda/test/test_gpu1.cpp  |  9985 ++++
 .../csrc/jit/codegen/cuda/test/test_gpu2.cpp  |  9801 ++++
 .../csrc/jit/codegen/cuda/test/test_gpu3.cpp  |  6538 +++
 .../cuda/test/test_gpu_fused_reduction.cpp    |   312 +
 .../jit/codegen/cuda/test/test_gpu_rng.cu     |   399 +
 .../jit/codegen/cuda/test/test_gpu_shift.cpp  |    67 +
 .../cuda/test/test_gpu_tensor_factories.cpp   |   339 +
 .../codegen/cuda/test/test_gpu_transpose.cpp  |  1260 +
 .../jit/codegen/cuda/test/test_gpu_utils.cpp  |   273 +
 .../codegen/cuda/test/test_gpu_validator.h    |    61 +-
 .../jit/codegen/cuda/test/test_gpu_view.cpp   |  1108 +-
 torch/csrc/jit/codegen/cuda/test/test_utils.h |   310 +-
 .../jit/codegen/cuda/tools/stringify_file.py  |    10 +-
 .../csrc/jit/codegen/cuda/transform_iter.cpp  |    14 +-
 .../jit/codegen/cuda/transform_rfactor.cpp    |     2 +-
 .../csrc/jit/codegen/cuda/transform_view.cpp  |   999 +-
 torch/csrc/jit/codegen/cuda/transform_view.h  |    62 +-
 torch/csrc/jit/codegen/cuda/type.cpp          |    64 +-
 torch/csrc/jit/codegen/cuda/type.h            |    17 +-
 .../csrc/jit/codegen/cuda/type_inference.cpp  |    11 +-
 torch/csrc/jit/codegen/cuda/utils.cpp         |   129 +-
 torch/csrc/jit/codegen/cuda/utils.h           |    26 +-
 torch/csrc/jit/codegen/fuser/codegen.cpp      |     2 +-
 .../csrc/jit/codegen/fuser/cpu/fused_kernel.h |     1 -
 torch/csrc/jit/codegen/fuser/cpu/temp_file.h  |     4 +-
 .../jit/codegen/fuser/cuda/fused_kernel.cpp   |     7 +
 torch/csrc/jit/codegen/fuser/fused_kernel.h   |     4 +-
 .../jit/codegen/onednn/LlgaTensorImpl.cpp     |    11 +-
 .../csrc/jit/codegen/onednn/LlgaTensorImpl.h  |     9 +-
 torch/csrc/jit/codegen/onednn/README.md       |    31 +-
 .../jit/codegen/onednn/decompose_silu.cpp     |    65 +
 .../csrc/jit/codegen/onednn/decompose_silu.h  |    15 +
 .../csrc/jit/codegen/onednn/graph_helper.cpp  |   495 +-
 torch/csrc/jit/codegen/onednn/graph_helper.h  |    15 +-
 torch/csrc/jit/codegen/onednn/interface.cpp   |    13 +-
 torch/csrc/jit/codegen/onednn/kernel.cpp      |    53 +-
 .../jit/codegen/onednn/layout_propagation.cpp |     9 +
 torch/csrc/jit/codegen/onednn/operator.h      |    63 +-
 .../jit/codegen/onednn/prepare_binary.cpp     |   123 +-
 torch/csrc/jit/docs/serialization.md          |     2 +-
 .../jit/frontend/function_schema_parser.cpp   |     6 +-
 torch/csrc/jit/frontend/ir_emitter.cpp        |     2 +-
 torch/csrc/jit/frontend/schema_matching.cpp   |    19 +-
 torch/csrc/jit/frontend/schema_matching.h     |     2 +
 .../csrc/jit/frontend/schema_type_parser.cpp  |     9 +-
 .../csrc/jit/frontend/script_type_parser.cpp  |     2 +-
 torch/csrc/jit/frontend/tracer.cpp            |    45 +-
 torch/csrc/jit/frontend/tracer.h              |    18 +
 torch/csrc/jit/ir/alias_analysis.cpp          |    15 +-
 torch/csrc/jit/ir/constants.cpp               |     6 +-
 torch/csrc/jit/ir/ir.cpp                      |     3 +
 torch/csrc/jit/ir/ir.h                        |    26 +-
 torch/csrc/jit/ir/irparser.cpp                |     2 +-
 .../mobile/compatibility/backport_manager.cpp |     2 +
 .../compatibility/model_compatibility.cpp     |     2 +-
 torch/csrc/jit/mobile/flatbuffer_loader.cpp   |     8 +-
 torch/csrc/jit/mobile/import.cpp              |    54 +-
 torch/csrc/jit/mobile/import_data.cpp         |     2 +-
 torch/csrc/jit/mobile/interpreter.cpp         |     4 +
 .../jit/mobile/model_tracer/TracerRunner.cpp  |    27 +-
 .../jit/mobile/model_tracer/TracerRunner.h    |    13 +
 torch/csrc/jit/mobile/model_tracer/tracer.cpp |    83 +-
 torch/csrc/jit/mobile/module.cpp              |    44 +
 torch/csrc/jit/mobile/module.h                |    18 +
 torch/csrc/jit/mobile/parse_bytecode.cpp      |     2 +-
 torch/csrc/jit/mobile/parse_operators.cpp     |     4 +-
 torch/csrc/jit/mobile/profiler_edge.cpp       |    12 +-
 torch/csrc/jit/mobile/profiler_edge.h         |     3 +-
 torch/csrc/jit/mobile/promoted_prim_ops.cpp   |    20 +
 torch/csrc/jit/mobile/promoted_prim_ops.h     |     8 +
 torch/csrc/jit/mobile/quantization.cpp        |    66 +
 torch/csrc/jit/mobile/quantization.h          |    38 +
 torch/csrc/jit/operator_upgraders/README.md   |     6 +-
 torch/csrc/jit/passes/freeze_module.cpp       |   199 +-
 .../frozen_conv_add_relu_fusion_cuda.cpp      |    16 +-
 .../csrc/jit/passes/frozen_ops_to_mkldnn.cpp  |    14 +-
 .../jit/passes/hoist_conv_packed_params.cpp   |    15 +-
 torch/csrc/jit/passes/mkldnn_rewrite.cpp      |     3 -
 torch/csrc/jit/passes/mobile_optimizer_type.h |    13 +
 torch/csrc/jit/passes/normalize_ops.cpp       |     2 +
 torch/csrc/jit/passes/onnx.cpp                |    88 +-
 torch/csrc/jit/passes/onnx.h                  |     2 -
 torch/csrc/jit/passes/onnx/constant_fold.cpp  |     6 +-
 .../passes/onnx/fixup_onnx_controlflow.cpp    |    17 +-
 .../jit/passes/onnx/function_extraction.cpp   |    11 +-
 .../jit/passes/onnx/function_substitution.cpp |   109 +-
 torch/csrc/jit/passes/onnx/helper.cpp         |    12 +-
 torch/csrc/jit/passes/onnx/naming.cpp         |   205 +
 torch/csrc/jit/passes/onnx/naming.h           |    30 +
 .../pattern_conversion/pattern_conversion.cpp |    18 +-
 torch/csrc/jit/passes/onnx/peephole.cpp       |     8 +-
 .../onnx/remove_inplace_ops_for_onnx.cpp      |    15 +
 .../jit/passes/onnx/scalar_type_analysis.cpp  |     7 +-
 .../jit/passes/onnx/shape_type_inference.cpp  |   145 +-
 .../jit/passes/onnx/shape_type_inference.h    |     7 +-
 .../passes/onnx/unpack_quantized_weights.cpp  |    47 +-
 torch/csrc/jit/passes/peephole_non_tensor.cpp |     2 +-
 .../csrc/jit/passes/quantization/finalize.cpp |   172 +
 torch/csrc/jit/passes/quantization/finalize.h |     4 +
 torch/csrc/jit/passes/quantization/helper.cpp |    15 +
 torch/csrc/jit/passes/quantization/helper.h   |    10 +-
 .../passes/quantization/insert_observers.cpp  |   133 +
 .../passes/quantization/insert_observers.h    |    22 +
 .../quantization/insert_quant_dequant.cpp     |   349 +-
 .../quantization/insert_quant_dequant.h       |     7 +
 .../quantization/quantization_patterns.h      |    23 +
 .../quantization/register_packed_params.cpp   |   149 +
 .../quantization/register_packed_params.h     |    20 +
 .../jit/passes/symbolic_shape_analysis.cpp    |    53 +-
 torch/csrc/jit/passes/tensorexpr_fuser.cpp    |    52 +-
 torch/csrc/jit/passes/utils/memory_dag.cpp    |     2 +-
 torch/csrc/jit/passes/vulkan_rewrite.cpp      |   101 +-
 torch/csrc/jit/passes/vulkan_rewrite.h        |     2 +
 torch/csrc/jit/passes/xnnpack_rewrite.cpp     |     1 +
 torch/csrc/jit/passes/xnnpack_rewrite.h       |    10 +-
 torch/csrc/jit/python/init.cpp                |   345 +-
 torch/csrc/jit/python/module_python.h         |    20 +-
 torch/csrc/jit/python/pybind_utils.cpp        |   412 +-
 torch/csrc/jit/python/pybind_utils.h          |   290 +-
 torch/csrc/jit/python/python_ir.cpp           |    12 +-
 .../csrc/jit/python/python_sugared_value.cpp  |     2 +-
 torch/csrc/jit/python/python_tracer.cpp       |    65 +-
 torch/csrc/jit/python/python_tracer.h         |    10 +
 torch/csrc/jit/python/script_init.cpp         |    99 +-
 .../jit/runtime/decomposition_registry.cpp    |    42 +
 .../csrc/jit/runtime/decomposition_registry.h |     6 +
 torch/csrc/jit/runtime/graph_executor.cpp     |    13 +-
 torch/csrc/jit/runtime/interpreter.cpp        |     3 +-
 torch/csrc/jit/runtime/register_prim_ops.cpp  |    49 +-
 .../serialized_shape_function_registry.cpp    |    44 +-
 torch/csrc/jit/runtime/static/README.md       |   104 +-
 .../csrc/jit/runtime/static/generated_ops.cpp |   230 +-
 torch/csrc/jit/runtime/static/impl.cpp        |    42 +-
 torch/csrc/jit/runtime/static/native_ops.cpp  |   286 +-
 torch/csrc/jit/runtime/static/ops.cpp         |   153 +-
 torch/csrc/jit/runtime/static/ops.h           |    21 +-
 torch/csrc/jit/runtime/static/passes.cpp      |    28 +-
 torch/csrc/jit/runtime/static/passes.h        |     4 +-
 .../jit/runtime/symbolic_shape_registry.cpp   |     6 +
 .../runtime/symbolic_shape_registry_util.cpp  |     2 +
 .../callstack_debug_info_serialization.cpp    |     4 +-
 torch/csrc/jit/serialization/export.cpp       |    84 +-
 .../jit/serialization/export_bytecode.cpp     |     2 +-
 .../csrc/jit/serialization/export_module.cpp  |     7 +-
 .../serialization/flatbuffer_serializer.cpp   |     9 +-
 torch/csrc/jit/serialization/import.cpp       |     2 +-
 .../csrc/jit/serialization/import_source.cpp  |    33 +-
 torch/csrc/jit/serialization/pickler.cpp      |    21 +-
 torch/csrc/jit/serialization/pickler.h        |    74 +-
 .../source_range_serialization.cpp            |     2 +-
 torch/csrc/jit/serialization/unpickler.cpp    |    96 +-
 torch/csrc/jit/serialization/unpickler.h      |     6 +-
 torch/csrc/jit/tensorexpr/half_support.h      |    19 +-
 torch/csrc/jit/tensorexpr/kernel.cpp          |   118 +-
 torch/csrc/jit/tensorexpr/kernel.h            |     3 +
 torch/csrc/jit/tensorexpr/llvm_codegen.cpp    |   110 +-
 torch/csrc/jit/tensorexpr/loopnest.cpp        |    12 +-
 torch/csrc/jit/tensorexpr/lowerings.cpp       |    52 +
 torch/csrc/jit/tensorexpr/operators/misc.cpp  |    41 +-
 torch/csrc/jit/tensorexpr/reduction.cpp       |    31 +
 torch/csrc/jit/tensorexpr/reduction.h         |    62 +
 torch/csrc/jit/tensorexpr/stmt.h              |     8 +-
 torch/csrc/jit/tensorexpr/tensor.cpp          |    40 +-
 torch/csrc/jit/tensorexpr/tensor.h            |     9 +
 torch/csrc/jit/tensorexpr/types.cpp           |     2 +-
 torch/csrc/lazy/backend/backend_device.cpp    |     6 +-
 torch/csrc/lazy/backend/backend_device.h      |     2 +
 torch/csrc/lazy/backend/backend_interface.cpp |     3 +-
 torch/csrc/lazy/backend/backend_interface.h   |     6 +-
 torch/csrc/lazy/backend/lowering_context.cpp  |     2 +-
 torch/csrc/lazy/backend/lowering_context.h    |     4 +-
 torch/csrc/lazy/core/config.cpp               |     7 +-
 torch/csrc/lazy/core/config.h                 |     1 +
 torch/csrc/lazy/core/debug_util.cpp           |     2 +-
 torch/csrc/lazy/core/dynamic_ir.h             |     5 +-
 torch/csrc/lazy/core/internal_ops/ltc_ops.h   |     8 -
 torch/csrc/lazy/core/ir_builder.h             |   160 +-
 torch/csrc/lazy/core/ir_dump_util.cpp         |    16 +-
 torch/csrc/lazy/core/ir_dump_util.h           |    12 +-
 torch/csrc/lazy/core/ir_metadata.cpp          |    29 +-
 torch/csrc/lazy/core/ir_util.cpp              |    30 +-
 torch/csrc/lazy/core/ir_util.h                |    11 +-
 torch/csrc/lazy/core/lazy_graph_executor.cpp  |    81 +-
 torch/csrc/lazy/core/lazy_graph_executor.h    |    21 +-
 torch/csrc/lazy/core/lazy_view.cpp            |   262 -
 torch/csrc/lazy/core/lazy_view.h              |   173 -
 torch/csrc/lazy/core/metrics.cpp              |    45 +-
 torch/csrc/lazy/core/metrics.h                |     9 +
 torch/csrc/lazy/core/shape_inference.cpp      |    90 +-
 torch/csrc/lazy/core/shape_inference.h        |     9 +-
 torch/csrc/lazy/core/tensor.cpp               |   167 +-
 torch/csrc/lazy/core/tensor.h                 |    61 +-
 torch/csrc/lazy/core/tensor_impl.cpp          |    49 +-
 torch/csrc/lazy/core/tensor_impl.h            |    20 +-
 torch/csrc/lazy/core/tensor_util.cpp          |     3 +
 torch/csrc/lazy/python/init.cpp               |    18 +-
 torch/csrc/lazy/python/python_util.cpp        |     4 +-
 torch/csrc/lazy/ts_backend/dynamic_ir.cpp     |    14 +-
 torch/csrc/lazy/ts_backend/dynamic_ir.h       |     8 +-
 torch/csrc/lazy/ts_backend/ir_builder.h       |    82 -
 .../csrc/lazy/ts_backend/tensor_aten_ops.cpp  |   219 -
 torch/csrc/lazy/ts_backend/tensor_aten_ops.h  |    92 +-
 .../csrc/lazy/ts_backend/ts_backend_impl.cpp  |    11 +-
 .../lazy/ts_backend/ts_lowering_context.cpp   |     2 +-
 .../lazy/ts_backend/ts_lowering_context.h     |     2 +-
 .../lazy/ts_backend/ts_native_functions.cpp   |    92 +-
 .../csrc/lazy/ts_backend/ts_node_lowering.cpp |   196 -
 torch/csrc/lazy/tutorial.md                   |     2 +-
 torch/csrc/onnx/diagnostics/diagnostics.h     |    63 +
 torch/csrc/onnx/diagnostics/generated/rules.h |    48 +
 torch/csrc/onnx/init.cpp                      |    13 +-
 torch/csrc/onnx/onnx.h                        |     2 +
 torch/csrc/profiler/api.cpp                   |   184 -
 torch/csrc/profiler/api.h                     |   167 +-
 torch/csrc/profiler/collection.cpp            |   571 +-
 torch/csrc/profiler/collection.h              |   289 +-
 torch/csrc/profiler/containers.h              |    22 +-
 torch/csrc/profiler/data_flow.cpp             |   197 +
 torch/csrc/profiler/data_flow.h               |    95 +
 torch/csrc/profiler/events.h                  |    30 +
 .../csrc/profiler/kineto_client_interface.cpp |    43 +-
 torch/csrc/profiler/kineto_shim.cpp           |     4 +-
 torch/csrc/profiler/kineto_shim.h             |     2 +
 .../csrc/profiler/orchestration/observer.cpp  |   181 +
 torch/csrc/profiler/orchestration/observer.h  |   135 +
 .../profiler/orchestration/python_tracer.cpp  |    37 +
 .../profiler/orchestration/python_tracer.h    |    62 +
 torch/csrc/profiler/perf-inl.h                |    72 +
 torch/csrc/profiler/perf.cpp                  |   199 +
 torch/csrc/profiler/perf.h                    |   105 +
 torch/csrc/profiler/python/init.cpp           |   295 +
 torch/csrc/profiler/python/init.h             |    35 +
 torch/csrc/profiler/python/pybind.h           |    50 +
 .../execution_graph_observer.cpp              |    28 +-
 .../execution_graph_observer.h                |     0
 .../{ => standalone}/itt_observer.cpp         |     9 +-
 .../profiler/{ => standalone}/itt_observer.h  |     0
 .../{ => standalone}/nvtx_observer.cpp        |    11 +-
 .../profiler/{ => standalone}/nvtx_observer.h |     0
 torch/csrc/profiler/stubs/base.cpp            |    81 +
 torch/csrc/profiler/stubs/base.h              |    43 +
 torch/csrc/profiler/{ => stubs}/cuda.cpp      |    16 +-
 torch/csrc/profiler/{ => stubs}/itt.cpp       |     6 +-
 torch/csrc/profiler/util.cpp                  |    17 +-
 torch/csrc/profiler/util.h                    |    26 +-
 torch/csrc/serialization.cpp                  |    25 +-
 torch/csrc/tensor/python_tensor.cpp           |     8 +-
 torch/csrc/utils.cpp                          |    42 +
 torch/csrc/utils/disable_torch_function.cpp   |     5 +-
 torch/csrc/utils/disallow_copy.h              |     5 -
 torch/csrc/utils/invalid_arguments.cpp        |    33 +-
 torch/csrc/utils/nested.cpp                   |    91 +
 torch/csrc/utils/nested.h                     |    17 +
 torch/csrc/utils/pybind.cpp                   |    83 +
 torch/csrc/utils/pybind.h                     |   130 +-
 torch/csrc/utils/python_arg_parser.cpp        |   225 +-
 torch/csrc/utils/python_arg_parser.h          |   188 +-
 torch/csrc/utils/python_compat.h              |     5 +
 torch/csrc/utils/python_dispatch.cpp          |   328 +-
 torch/csrc/utils/python_dispatch.h            |     7 +-
 torch/csrc/utils/python_numbers.h             |     9 -
 torch/csrc/utils/python_symnode.cpp           |    19 +
 torch/csrc/utils/python_symnode.h             |   178 +
 torch/csrc/utils/python_torch_function_mode.h |    15 +-
 torch/csrc/utils/schema_info.cpp              |     4 +
 torch/csrc/utils/schema_info.h                |     2 +
 torch/csrc/utils/tensor_memoryformats.cpp     |     6 +-
 torch/csrc/utils/tensor_memoryformats.h       |     4 +-
 torch/csrc/utils/tensor_new.cpp               |    32 +-
 torch/csrc/utils/tensor_types.cpp             |     4 +
 torch/csrc/utils/torch_dispatch_mode.h        |    29 +-
 torch/cuda/__init__.py                        |   183 +-
 torch/cuda/_dynamo_graphs.py                  |    21 +-
 torch/cuda/_memory_viz.py                     |   256 +-
 torch/cuda/_sanitizer.py                      |   641 +
 torch/cuda/amp/autocast_mode.py               |     4 +-
 torch/cuda/amp/common.py                      |     1 +
 torch/cuda/amp/grad_scaler.py                 |     2 +
 torch/cuda/graphs.py                          |    13 +-
 torch/cuda/jiterator.py                       |     4 +-
 torch/cuda/memory.py                          |   124 +-
 torch/cuda/profiler.py                        |     1 +
 torch/deploy.h                                |     3 -
 torch/distributed/__init__.py                 |    18 +-
 torch/distributed/_composable/__init__.py     |     4 +
 torch/distributed/_composable/_ddp.py         |  1877 +
 .../_composable/checkpoint_activation.py      |   157 +
 torch/distributed/_composable/contract.py     |   152 +
 torch/distributed/_composable/fully_shard.py  |    80 +
 torch/distributed/_composable/replicate.py    |   107 +
 .../distributed/_shard/checkpoint/__init__.py |    25 +-
 .../_shard/checkpoint/filesystem.py           |   145 -
 .../_shard/checkpoint/resharding.py           |   306 -
 .../_shard/checkpoint/state_dict_loader.py    |   174 -
 .../_shard/checkpoint/state_dict_saver.py     |   177 -
 .../distributed/_shard/checkpoint/storage.py  |   188 -
 .../_shard/sharded_tensor/_ops/tensor_ops.py  |    15 +-
 .../distributed/_shard/sharded_tensor/api.py  |    10 +-
 .../_shard/sharding_spec/_internals.py        |    85 +-
 .../sharding_spec/chunk_sharding_spec.py      |     8 +-
 .../chunk_sharding_spec_ops/_common.py        |   270 +-
 .../chunk_sharding_spec_ops/embedding.py      |   170 +-
 .../chunk_sharding_spec_ops/embedding_bag.py  |   621 +-
 torch/distributed/_sharding_spec/__init__.py  |     4 +-
 torch/distributed/_spmd/__init__.py           |     0
 torch/distributed/_spmd/comm_tensor.py        |   241 +
 torch/distributed/_tensor/README.md           |     3 +
 torch/distributed/_tensor/__init__.py         |   189 +
 torch/distributed/_tensor/api.py              |   393 +
 torch/distributed/_tensor/device_mesh.py      |   506 +
 torch/distributed/_tensor/dispatch.py         |   301 +
 torch/distributed/_tensor/ops/__init__.py     |     7 +
 torch/distributed/_tensor/ops/common_rules.py |   376 +
 torch/distributed/_tensor/ops/math_ops.py     |   141 +
 torch/distributed/_tensor/ops/matrix_ops.py   |   129 +
 .../distributed/_tensor/ops/pointwise_ops.py  |   396 +
 torch/distributed/_tensor/ops/tensor_ops.py   |   481 +
 .../_tensor/ops/tp_sharding_ops.py            |    55 +
 torch/distributed/_tensor/ops/utils.py        |    81 +
 torch/distributed/_tensor/ops/view_ops.py     |   707 +
 .../distributed/_tensor/parallel/__init__.py  |    36 +
 .../_tensor/parallel/_view_with_dim_change.py |   108 +
 torch/distributed/_tensor/parallel/api.py     |   415 +
 torch/distributed/_tensor/parallel/fsdp.py    |   359 +
 .../parallel/multihead_attention_tp.py        |   273 +
 torch/distributed/_tensor/parallel/style.py   |   233 +
 torch/distributed/_tensor/parallel/utils.py   |   152 +
 torch/distributed/_tensor/placement_types.py  |   432 +
 torch/distributed/_tensor/redistribute.py     |   236 +
 torch/distributed/_tensor/utils.py            |    53 +
 .../_checkpoint/checkpoint_wrapper.py         |   197 +-
 .../algorithms/_comm_hooks/default_hooks.py   |    85 +-
 .../ddp_comm_hooks/ddp_zero_hook.py           |     5 +-
 .../ddp_comm_hooks/debugging_hooks.py         |     1 +
 .../ddp_comm_hooks/default_hooks.py           |     1 +
 .../ddp_comm_hooks/optimizer_overlap_hooks.py |     4 +-
 .../ddp_comm_hooks/powerSGD_hook.py           |     4 +-
 .../hierarchical_model_averager.py            |     2 +-
 .../algorithms/model_averaging/utils.py       |     2 +
 torch/distributed/benchmarks/README.md        |     2 +-
 .../benchmarks/benchmark_ddp_rpc.py           |     2 +-
 torch/distributed/c10d_error_logger.py        |    33 +
 torch/distributed/checkpoint/__init__.py      |    21 +
 .../{_shard => }/checkpoint/api.py            |    10 +-
 torch/distributed/checkpoint/dedup_tensors.py |    58 +
 .../distributed/checkpoint/default_planner.py |   244 +
 torch/distributed/checkpoint/filesystem.py    |   313 +
 .../{_shard => }/checkpoint/metadata.py       |    55 +-
 torch/distributed/checkpoint/planner.py       |   377 +
 .../distributed/checkpoint/planner_helpers.py |   221 +
 torch/distributed/checkpoint/resharding.py    |    55 +
 .../checkpoint/state_dict_loader.py           |   111 +
 .../checkpoint/state_dict_saver.py            |   115 +
 torch/distributed/checkpoint/storage.py       |   233 +
 torch/distributed/checkpoint/traverse.py      |   170 +
 .../{_shard => }/checkpoint/utils.py          |   113 +-
 torch/distributed/distributed_c10d.py         |   894 +-
 .../elastic/agent/server/__init__.py          |     3 +-
 torch/distributed/elastic/agent/server/api.py |     2 +-
 .../agent/server/local_elastic_agent.py       |    99 +-
 .../elastic/multiprocessing/api.py            |     6 +-
 .../multiprocessing/errors/__init__.py        |     1 +
 .../elastic/multiprocessing/tail_log.py       |     1 +
 .../elastic/rendezvous/etcd_rendezvous.py     |     2 +-
 torch/distributed/elastic/timer/__init__.py   |     1 +
 .../elastic/timer/file_based_local_timer.py   |   330 +
 torch/distributed/fsdp/__init__.py            |     2 +-
 torch/distributed/fsdp/_common_utils.py       |   202 +
 torch/distributed/fsdp/_exec_order_utils.py   |   384 +
 torch/distributed/fsdp/_fsdp_extensions.py    |   115 +
 torch/distributed/fsdp/_init_utils.py         |   763 +
 torch/distributed/fsdp/_limiter_utils.py      |    33 +
 torch/distributed/fsdp/_optim_utils.py        |   595 +-
 torch/distributed/fsdp/_runtime_utils.py      |  1155 +
 .../fsdp/{shard_utils.py => _shard_utils.py}  |   111 +-
 torch/distributed/fsdp/_state_dict_utils.py   |   694 +
 torch/distributed/fsdp/_symbolic_trace.py     |    15 +-
 .../distributed/fsdp/_unshard_param_utils.py  |   254 +
 torch/distributed/fsdp/_utils.py              |   107 +-
 torch/distributed/fsdp/_wrap_utils.py         |   170 +
 torch/distributed/fsdp/api.py                 |   245 +
 torch/distributed/fsdp/flat_param.py          |  1657 +-
 .../fsdp/flatten_params_wrapper.py            |   156 -
 .../fsdp/fully_sharded_data_parallel.py       |  4369 +-
 torch/distributed/fsdp/sharded_grad_scaler.py |    66 +-
 torch/distributed/fsdp/wrap.py                |   309 +-
 torch/distributed/logging_handlers.py         |    16 +
 torch/distributed/nn/api/remote_module.py     |    41 +-
 torch/distributed/optim/__init__.py           |     1 +
 .../optim/apply_optimizer_in_backward.py      |    78 +
 torch/distributed/optim/functional_adam.py    |    13 +-
 torch/distributed/optim/functional_rprop.py   |     5 +-
 torch/distributed/optim/optimizer.py          |     6 +-
 .../optim/zero_redundancy_optimizer.py        |    20 +-
 .../optim/zero_redundancy_optimizer.pyi       |     3 -
 .../pipeline/sync/_balance/profile.py         |     4 +-
 torch/distributed/pipeline/sync/checkpoint.py |     4 +-
 torch/distributed/pipeline/sync/copy.py       |     2 +-
 torch/distributed/pipeline/sync/dependency.py |     2 +-
 torch/distributed/pipeline/sync/microbatch.py |     2 +-
 torch/distributed/pipeline/sync/phony.py      |     2 +-
 torch/distributed/pipeline/sync/pipe.py       |     7 +-
 torch/distributed/pipeline/sync/pipeline.py   |     2 +-
 torch/distributed/pipeline/sync/stream.py     |     6 +-
 torch/distributed/pipeline/sync/utils.py      |     2 +
 torch/distributed/pipeline/sync/worker.py     |     2 +-
 torch/distributed/rpc/__init__.py             |     5 +-
 torch/distributed/rpc/api.py                  |    18 +-
 torch/distributed/rpc/backend_registry.py     |     3 +
 torch/distributed/rpc/constants.py            |     4 +-
 torch/distributed/rpc/internal.py             |    16 +-
 torch/distributed/rpc/options.py              |     3 +-
 torch/distributed/utils.py                    |    73 +-
 torch/distributions/distribution.py           |    50 +-
 torch/distributions/half_cauchy.py            |     2 +-
 torch/distributions/half_normal.py            |     2 +-
 torch/distributions/kl.py                     |     1 +
 torch/distributions/lkj_cholesky.py           |     2 +-
 .../lowrank_multivariate_normal.py            |     4 +-
 torch/distributions/mixture_same_family.py    |     2 +-
 torch/distributions/multivariate_normal.py    |     4 +-
 .../distributions/transformed_distribution.py |    25 +-
 torch/distributions/utils.py                  |     2 +
 torch/distributions/wishart.py                |    11 +-
 torch/fft/__init__.py                         |     2 +-
 torch/functional.py                           |   185 +-
 torch/futures/__init__.py                     |     2 +
 torch/fx/OVERVIEW.md                          |     2 +-
 torch/fx/_symbolic_trace.py                   |   113 +-
 .../experimental/accelerator_partitioner.py   |     2 +-
 torch/fx/experimental/const_fold.py           |     9 +-
 .../experimental/graph_gradual_typechecker.py |     4 +-
 torch/fx/experimental/meta_tracer.py          |     4 +-
 .../constraint_generator.py                   |    64 +-
 torch/fx/experimental/normalize.py            |     1 +
 torch/fx/experimental/proxy_tensor.py         |   787 +-
 torch/fx/experimental/symbolic_shapes.py      |   713 +-
 torch/fx/experimental/unification/core.py     |     2 +
 torch/fx/experimental/unification/dispatch.py |     2 +-
 torch/fx/experimental/unification/match.py    |     4 +-
 .../unification/multipledispatch/conflict.py  |     2 +
 .../unification/multipledispatch/core.py      |     5 +-
 .../multipledispatch/dispatcher.py            |     4 +-
 .../unification/multipledispatch/utils.py     |     1 +
 .../unification/multipledispatch/variadic.py  |     1 +
 torch/fx/experimental/unification/utils.py    |     1 +
 torch/fx/graph.py                             |   128 +-
 torch/fx/graph_module.py                      |    22 +-
 torch/fx/immutable_collections.py             |     2 +
 torch/fx/interpreter.py                       |    17 +-
 torch/fx/node.py                              |    17 +-
 torch/fx/operator_schemas.py                  |     3 +
 torch/fx/passes/README.md                     |     2 +-
 torch/fx/passes/backends/cudagraphs.py        |     7 +-
 torch/fx/passes/backends/nvfuser.py           |   286 -
 torch/fx/passes/fake_tensor_prop.py           |    14 +-
 torch/fx/passes/graph_drawer.py               |    14 +-
 torch/fx/passes/infra/partitioner.py          |   299 +-
 torch/fx/passes/infra/pass_manager.py         |    85 +-
 torch/fx/passes/net_min_base.py               |   153 +-
 torch/fx/passes/pass_manager.py               |    65 +-
 torch/fx/passes/reinplace.py                  |   316 +-
 torch/fx/passes/shape_prop.py                 |     2 +-
 torch/fx/passes/split_module.py               |   162 +-
 torch/fx/passes/split_utils.py                |     8 +-
 torch/fx/passes/splitter_base.py              |    72 +-
 torch/fx/passes/tests/test_pass_manager.py    |    22 +
 torch/fx/passes/utils/fuser_utils.py          |     4 +-
 torch/fx/passes/utils/matcher_utils.py        |   183 +-
 torch/fx/proxy.py                             |    21 +-
 torch/fx/subgraph_rewriter.py                 |   339 +-
 torch/fx/tensor_type.py                       |     4 +-
 torch/fx/traceback.py                         |    13 +-
 torch/hub.py                                  |    16 +-
 torch/jit/_builtins.py                        |     2 +-
 torch/jit/_freeze.py                          |     7 +-
 torch/jit/_fuser.py                           |    17 +-
 torch/jit/_recursive.py                       |     3 +
 torch/jit/_shape_functions.py                 |    63 +-
 torch/jit/_trace.py                           |   175 +-
 torch/jit/annotations.py                      |    14 +-
 torch/jit/frontend.py                         |     4 +-
 torch/jit/quantized.py                        |    18 +-
 torch/lib/libshm/CMakeLists.txt               |    30 +-
 torch/library.h                               |    27 +-
 torch/library.py                              |     8 +-
 torch/linalg/__init__.py                      |    51 +-
 torch/masked/__init__.py                      |    37 +
 torch/{_masked => masked}/_docs.py            |    42 +-
 torch/{_masked/__init__.py => masked/_ops.py} |   179 +-
 torch/masked/maskedtensor/__init__.py         |     8 +
 torch/masked/maskedtensor/_ops_refs.py        |   473 +
 torch/masked/maskedtensor/binary.py           |   192 +
 torch/masked/maskedtensor/core.py             |   335 +
 torch/masked/maskedtensor/creation.py         |    21 +
 torch/masked/maskedtensor/passthrough.py      |    43 +
 torch/masked/maskedtensor/reductions.py       |   173 +
 torch/masked/maskedtensor/unary.py            |   188 +
 torch/monitor/__init__.py                     |     1 +
 torch/multiprocessing/reductions.py           |    21 +-
 torch/nested/__init__.py                      |   149 +
 torch/nn/functional.py                        |   204 +-
 torch/nn/init.py                              |     2 +-
 torch/nn/intrinsic/__init__.py                |    36 +-
 torch/nn/intrinsic/modules/__init__.py        |    15 +-
 torch/nn/intrinsic/modules/fused.py           |   158 +-
 torch/nn/intrinsic/qat/modules/conv_fused.py  |   764 +-
 .../nn/intrinsic/qat/modules/linear_fused.py  |   176 +-
 torch/nn/intrinsic/qat/modules/linear_relu.py |    57 +-
 torch/nn/intrinsic/quantized/__init__.py      |     9 +
 .../quantized/dynamic/modules/__init__.py     |     1 -
 .../quantized/dynamic/modules/linear_relu.py  |    54 +-
 .../nn/intrinsic/quantized/modules/bn_relu.py |    83 +-
 .../intrinsic/quantized/modules/conv_relu.py  |   175 +-
 .../quantized/modules/linear_relu.py          |    44 +-
 torch/nn/modules/_functions.py                |    12 +-
 torch/nn/modules/activation.py                |    79 +-
 torch/nn/modules/batchnorm.py                 |    15 +-
 torch/nn/modules/container.py                 |     4 +-
 torch/nn/modules/conv.py                      |     4 +-
 torch/nn/modules/distance.py                  |    13 +-
 torch/nn/modules/fold.py                      |    16 +-
 torch/nn/modules/loss.py                      |     9 +-
 torch/nn/modules/module.py                    |   472 +-
 torch/nn/modules/pooling.py                   |     6 +-
 torch/nn/modules/rnn.py                       |     3 +
 torch/nn/modules/sparse.py                    |    10 +-
 torch/nn/modules/transformer.py               |    56 +-
 torch/nn/modules/upsampling.py                |    88 +-
 torch/nn/parallel/distributed.py              |   501 +-
 torch/nn/parallel/distributed.pyi             |    21 -
 torch/nn/parameter.py                         |    12 +-
 torch/nn/qat/__init__.py                      |    17 +
 torch/nn/qat/dynamic/__init__.py              |     6 +
 torch/nn/qat/dynamic/modules/linear.py        |    35 +-
 torch/nn/qat/modules/__init__.py              |    20 +-
 torch/nn/qat/modules/conv.py                  |   276 +-
 torch/nn/qat/modules/embedding_ops.py         |   151 +-
 torch/nn/qat/modules/linear.py                |    87 +-
 torch/nn/quantizable/modules/__init__.py      |     6 +-
 torch/nn/quantizable/modules/activation.py    |   464 +-
 torch/nn/quantizable/modules/rnn.py           |   395 +-
 torch/nn/quantized/__init__.py                |    39 +
 .../quantized/_reference/modules/__init__.py  |    19 +-
 torch/nn/quantized/_reference/modules/conv.py |   335 +-
 .../nn/quantized/_reference/modules/linear.py |    67 +-
 torch/nn/quantized/_reference/modules/rnn.py  |   494 +-
 .../nn/quantized/_reference/modules/sparse.py |   105 +-
 .../nn/quantized/_reference/modules/utils.py  |   175 +-
 torch/nn/quantized/dynamic/__init__.py        |     2 +-
 .../nn/quantized/dynamic/modules/__init__.py  |    19 +-
 torch/nn/quantized/dynamic/modules/conv.py    |   409 +-
 torch/nn/quantized/dynamic/modules/linear.py  |   137 +-
 torch/nn/quantized/dynamic/modules/rnn.py     |  1066 +-
 torch/nn/quantized/functional.py              |   619 +-
 torch/nn/quantized/modules/__init__.py        |   127 +-
 torch/nn/quantized/modules/activation.py      |   296 +-
 torch/nn/quantized/modules/batchnorm.py       |   115 +-
 torch/nn/quantized/modules/conv.py            |   934 +-
 torch/nn/quantized/modules/dropout.py         |    35 +-
 torch/nn/quantized/modules/embedding_ops.py   |   303 +-
 .../quantized/modules/functional_modules.py   |   240 +-
 torch/nn/quantized/modules/linear.py          |   305 +-
 torch/nn/quantized/modules/normalization.py   |   216 +-
 torch/nn/quantized/modules/rnn.py             |    54 +-
 torch/nn/quantized/modules/utils.py           |    88 +-
 torch/nn/utils/_deprecation_utils.py          |    45 +
 .../conv_expanded_weights.py                  |    23 +-
 .../nn/utils/_expanded_weights/conv_utils.py  |    56 +-
 torch/nn/utils/fusion.py                      |     8 +-
 torch/nn/utils/parametrizations.py            |    16 +-
 torch/nn/utils/parametrize.py                 |     6 +-
 torch/nn/utils/stateless.py                   |    18 +-
 torch/onnx/README.md                          |    96 +-
 torch/onnx/__init__.py                        |    48 +-
 torch/onnx/_constants.py                      |    13 +-
 torch/onnx/_deprecation.py                    |    39 +-
 torch/onnx/_exporter_states.py                |    25 +-
 torch/onnx/_globals.py                        |    30 +-
 torch/onnx/_internal/__init__.py              |     0
 torch/onnx/_internal/_beartype.py             |    99 +
 torch/onnx/_internal/diagnostics/OVERVIEW.md  |    83 +
 torch/onnx/_internal/diagnostics/__init__.py  |    19 +
 .../onnx/_internal/diagnostics/_diagnostic.py |   153 +
 torch/onnx/_internal/diagnostics/_rules.py    |   172 +
 .../_internal/diagnostics/infra/__init__.py   |    27 +
 .../_internal/diagnostics/infra/_infra.py     |   450 +
 .../_internal/diagnostics/infra/engine.py     |   107 +
 .../_internal/diagnostics/infra/formatter.py  |    77 +
 .../diagnostics/infra/sarif/__init__.py       |   100 +
 .../diagnostics/infra/sarif/_address.py       |    48 +
 .../diagnostics/infra/sarif/_artifact.py      |    90 +
 .../infra/sarif/_artifact_change.py           |    31 +
 .../infra/sarif/_artifact_content.py          |    33 +
 .../infra/sarif/_artifact_location.py         |    33 +
 .../diagnostics/infra/sarif/_attachment.py    |    39 +
 .../diagnostics/infra/sarif/_code_flow.py     |    31 +
 .../infra/sarif/_configuration_override.py    |    31 +
 .../diagnostics/infra/sarif/_conversion.py    |    35 +
 .../diagnostics/infra/sarif/_edge.py          |    31 +
 .../infra/sarif/_edge_traversal.py            |    31 +
 .../diagnostics/infra/sarif/_exception.py     |    37 +
 .../infra/sarif/_external_properties.py       |   100 +
 .../_external_property_file_reference.py      |    33 +
 .../_external_property_file_references.py     |    86 +
 .../_internal/diagnostics/infra/sarif/_fix.py |    31 +
 .../diagnostics/infra/sarif/_graph.py         |    35 +
 .../infra/sarif/_graph_traversal.py           |    43 +
 .../diagnostics/infra/sarif/_invocation.py    |   117 +
 .../diagnostics/infra/sarif/_location.py      |    50 +
 .../infra/sarif/_location_relationship.py     |    28 +
 .../infra/sarif/_logical_location.py          |    39 +
 .../diagnostics/infra/sarif/_message.py       |    33 +
 .../sarif/_multiformat_message_string.py      |    25 +
 .../diagnostics/infra/sarif/_node.py          |    36 +
 .../diagnostics/infra/sarif/_notification.py  |    55 +
 .../infra/sarif/_physical_location.py         |    40 +
 .../diagnostics/infra/sarif/_property_bag.py  |    19 +
 .../diagnostics/infra/sarif/_rectangle.py     |    36 +
 .../diagnostics/infra/sarif/_region.py        |    58 +
 .../diagnostics/infra/sarif/_replacement.py   |    31 +
 .../infra/sarif/_reporting_configuration.py   |    35 +
 .../infra/sarif/_reporting_descriptor.py      |    71 +
 .../sarif/_reporting_descriptor_reference.py  |    38 +
 .../_reporting_descriptor_relationship.py     |    34 +
 .../diagnostics/infra/sarif/_result.py        |   130 +
 .../infra/sarif/_result_provenance.py         |    44 +
 .../_internal/diagnostics/infra/sarif/_run.py |   136 +
 .../infra/sarif/_run_automation_details.py    |    33 +
 .../diagnostics/infra/sarif/_sarif_log.py     |    39 +
 .../infra/sarif/_special_locations.py         |    27 +
 .../diagnostics/infra/sarif/_stack.py         |    31 +
 .../diagnostics/infra/sarif/_stack_frame.py   |    33 +
 .../diagnostics/infra/sarif/_suppression.py   |    38 +
 .../diagnostics/infra/sarif/_thread_flow.py   |    40 +
 .../infra/sarif/_thread_flow_location.py      |    69 +
 .../diagnostics/infra/sarif/_tool.py          |    27 +
 .../infra/sarif/_tool_component.py            |   125 +
 .../infra/sarif/_tool_component_reference.py  |    30 +
 .../infra/sarif/_translation_metadata.py      |    44 +
 .../infra/sarif/_version_control_details.py   |    42 +
 .../diagnostics/infra/sarif/_web_request.py   |    48 +
 .../diagnostics/infra/sarif/_web_response.py  |    48 +
 .../diagnostics/infra/sarif/version.py        |     5 +
 .../onnx/_internal/diagnostics/infra/utils.py |    35 +
 torch/onnx/_internal/diagnostics/rules.yaml   |    84 +
 torch/onnx/_internal/jit_utils.py             |   396 +
 torch/onnx/_internal/onnx_proto_utils.py      |   143 +
 torch/onnx/_internal/registration.py          |   339 +
 torch/onnx/_onnx_supported_ops.py             |    85 +-
 torch/onnx/_patch_torch.py                    |   158 +-
 torch/onnx/_type_utils.py                     |   132 +-
 torch/onnx/errors.py                          |    52 +-
 torch/onnx/symbolic_caffe2.py                 |   143 +-
 torch/onnx/symbolic_helper.py                 |   590 +-
 torch/onnx/symbolic_opset10.py                |   647 +-
 torch/onnx/symbolic_opset11.py                |   698 +-
 torch/onnx/symbolic_opset12.py                |   190 +-
 torch/onnx/symbolic_opset13.py                |   368 +-
 torch/onnx/symbolic_opset14.py                |    54 +-
 torch/onnx/symbolic_opset15.py                |    53 +-
 torch/onnx/symbolic_opset16.py                |    35 +-
 torch/onnx/symbolic_opset17.py                |    56 +
 torch/onnx/symbolic_opset7.py                 |    20 +-
 torch/onnx/symbolic_opset8.py                 |   202 +-
 torch/onnx/symbolic_opset9.py                 |  3442 +-
 torch/onnx/symbolic_registry.py               |   168 -
 torch/onnx/utils.py                           |   883 +-
 torch/onnx/verification.py                    |   141 +-
 torch/optim/_functional.py                    |     3 +
 torch/optim/adadelta.py                       |    47 +-
 torch/optim/adagrad.py                        |    22 +-
 torch/optim/adam.py                           |   219 +-
 torch/optim/adamax.py                         |    32 +-
 torch/optim/adamw.py                          |    35 +-
 torch/optim/asgd.py                           |    40 +-
 torch/optim/lr_scheduler.py                   |    93 +-
 torch/optim/lr_scheduler.pyi                  |    39 +-
 torch/optim/nadam.py                          |    45 +-
 torch/optim/optimizer.py                      |    18 +-
 torch/optim/radam.py                          |    39 +-
 torch/optim/rmsprop.py                        |    58 +-
 torch/optim/rprop.py                          |    68 +-
 torch/optim/sgd.py                            |     1 +
 torch/optim/sparse_adam.py                    |     4 +-
 torch/optim/swa_utils.py                      |     9 +-
 torch/overrides.py                            |   191 +-
 torch/package/_mock.py                        |     2 +-
 torch/package/package_exporter.py             |     6 +-
 torch/package/package_importer.py             |    11 +-
 torch/profiler/__init__.py                    |    35 +-
 torch/profiler/_memory_profiler.py            |   807 +
 torch/profiler/_pattern_matcher.py            |    95 +-
 torch/profiler/_utils.py                      |    27 +-
 torch/profiler/itt.py                         |    19 +-
 torch/profiler/profiler.py                    |    43 +-
 torch/quantization/__init__.py                |     5 +-
 torch/quantization/fuser_method_mappings.py   |     2 +-
 torch/quantization/fx/quantization_types.py   |     2 +-
 torch/quantization/qconfig.py                 |     4 +-
 torch/quantization/quant_type.py              |     2 +-
 torch/quantization/quantize_jit.py            |     1 +
 torch/return_types.py                         |    10 +-
 torch/serialization.py                        |   140 +-
 torch/signal/__init__.py                      |     5 +
 torch/signal/windows/__init__.py              |    26 +
 torch/signal/windows/windows.py               |   761 +
 torch/sparse/__init__.py                      |    41 +
 torch/sparse/matmul.py                        |    27 +
 torch/special/__init__.py                     |     4 +-
 torch/storage.py                              |   254 +-
 torch/testing/__init__.py                     |     6 +-
 torch/testing/_comparison.py                  |    29 +-
 torch/testing/_creation.py                    |    34 +-
 torch/testing/_deprecated.py                  |    66 +-
 .../testing/_internal/autocast_test_lists.py  |    19 +
 .../_internal/check_kernel_launches.py        |     2 +-
 torch/testing/_internal/common_cuda.py        |    20 +-
 torch/testing/_internal/common_device_type.py |   144 +-
 torch/testing/_internal/common_distributed.py |   178 +-
 torch/testing/_internal/common_dtype.py       |   148 +-
 torch/testing/_internal/common_fsdp.py        |   158 +-
 .../_internal/common_methods_invocations.py   | 13219 +++--
 torch/testing/_internal/common_modules.py     |    73 +-
 torch/testing/_internal/common_nn.py          |    84 +-
 .../testing/_internal/common_quantization.py  |    11 +-
 torch/testing/_internal/common_quantized.py   |     2 +
 torch/testing/_internal/common_utils.py       |   812 +-
 .../testing/_internal/composite_compliance.py |   172 +-
 .../_internal/distributed/_tensor/__init__.py |     0
 .../distributed/_tensor/common_dtensor.py     |   334 +
 .../_tensor/dtensor_lagging_op_db.py          |   661 +
 .../_tensor/gen_dtensor_lagging_op_db.py      |    51 +-
 .../_internal/distributed/distributed_test.py |   280 +-
 .../distributed/multi_threaded_pg.py          |   375 +
 .../_internal/distributed/rpc/jit/rpc_test.py |    18 +-
 .../_internal/distributed/rpc/rpc_test.py     |    79 +-
 .../_internal/distributed/rpc_utils.py        |     2 +-
 torch/testing/_internal/inductor_utils.py     |    23 +
 torch/testing/_internal/opinfo/__init__.py    |     2 +
 torch/testing/_internal/opinfo/core.py        |  1040 +-
 .../_internal/opinfo/definitions/__init__.py  |    25 +
 .../_internal/opinfo/definitions/_masked.py   |  1148 +
 .../_internal/opinfo/definitions/fft.py       |   755 +
 .../_internal/opinfo/definitions/linalg.py    |  2232 +
 .../_internal/opinfo/definitions/signal.py    |   827 +
 .../_internal/opinfo/definitions/special.py   |   772 +
 torch/testing/_internal/opinfo/refs.py        |   216 +
 torch/testing/_internal/opinfo/utils.py       |   183 +-
 torch/testing/_internal/schema_check_mode.py  |     6 +-
 torch/testing/_legacy.py                      |   158 -
 torch/types.py                                |     3 +-
 torch/utils/__init__.py                       |     2 +
 torch/utils/_cuda_trace.py                    |    23 +
 torch/utils/_mode_utils.py                    |   124 +-
 torch/utils/_python_dispatch.py               |   165 +-
 torch/utils/_pytree.py                        |    94 +-
 torch/utils/backend_registration.py           |    30 +
 torch/utils/benchmark/examples/fuzzer.py      |     2 +-
 .../utils/benchmark/examples/sparse/fuzzer.py |     2 +-
 torch/utils/benchmark/utils/cpp_jit.py        |     4 +
 torch/utils/benchmark/utils/timer.py          |     6 +-
 .../utils/valgrind_wrapper/timer_interface.py |    10 +-
 torch/utils/bottleneck/__main__.py            |     2 +-
 torch/utils/bundled_inputs.py                 |     4 +-
 torch/utils/checkpoint.py                     |    18 +-
 torch/utils/collect_env.py                    |    13 +
 torch/utils/cpp_backtrace.py                  |    11 +
 torch/utils/cpp_extension.py                  |   102 +-
 torch/utils/data/__init__.py                  |     4 -
 torch/utils/data/_utils/__init__.py           |    14 -
 torch/utils/data/_utils/collate.py            |   190 +-
 torch/utils/data/_utils/fetch.py              |    17 +-
 torch/utils/data/_utils/pin_memory.py         |    18 +-
 torch/utils/data/_utils/worker.py             |    16 +-
 torch/utils/data/communication/__init__.py    |     6 -
 torch/utils/data/communication/eventloop.py   |    70 -
 torch/utils/data/communication/iter.py        |   181 -
 torch/utils/data/communication/map.py         |   159 -
 torch/utils/data/communication/messages.py    |    75 -
 torch/utils/data/communication/protocol.py    |   205 -
 torch/utils/data/communication/queue.py       |    51 -
 torch/utils/data/dataloader.py                |   142 +-
 torch/utils/data/dataloader_experimental.py   |   150 -
 torch/utils/data/datapipes/_hook_iterator.py  |     4 +-
 torch/utils/data/datapipes/_typing.py         |     4 +-
 .../data/datapipes/dataframe/dataframes.py    |     2 +-
 torch/utils/data/datapipes/datapipe.py        |    32 +-
 torch/utils/data/datapipes/gen_pyi.py         |     7 +-
 torch/utils/data/datapipes/iter/callable.py   |     6 +-
 .../data/datapipes/iter/combinatorics.py      |    25 +-
 torch/utils/data/datapipes/iter/combining.py  |    66 +-
 torch/utils/data/datapipes/iter/filelister.py |     2 +-
 torch/utils/data/datapipes/iter/grouping.py   |    49 +-
 torch/utils/data/datapipes/iter/selecting.py  |    45 +-
 torch/utils/data/datapipes/map/__init__.py    |     2 +-
 .../utils/data/datapipes/map/combinatorics.py |   106 +-
 torch/utils/data/datapipes/utils/common.py    |   101 +-
 torch/utils/data/datapipes/utils/snapshot.py  |     4 +-
 torch/utils/data/dataset.py                   |     5 +-
 torch/utils/data/graph.py                     |    76 +-
 torch/utils/data/graph_settings.py            |    79 +-
 torch/utils/dlpack.py                         |     3 +-
 torch/utils/hipify/cuda_to_hip_mappings.py    |    97 +-
 torch/utils/hipify/hipify_python.py           |     3 +-
 torch/utils/hooks.py                          |    64 +-
 torch/utils/mobile_optimizer.py               |     5 +-
 torch/utils/model_dump/__init__.py            |     5 +-
 torch/utils/show_pickle.py                    |     1 +
 torch/utils/tensorboard/_pytorch_graph.py     |     4 +-
 torch/utils/tensorboard/summary.py            |     7 +-
 torch/utils/throughput_benchmark.py           |     2 +-
 torchgen/api/autograd.py                      |     6 +-
 torchgen/api/cpp.py                           |   102 +-
 torchgen/api/dispatcher.py                    |    32 +-
 torchgen/api/lazy.py                          |    81 +-
 torchgen/api/native.py                        |    34 +-
 torchgen/api/python.py                        |    93 +-
 torchgen/api/structured.py                    |     9 +-
 torchgen/api/translate.py                     |    51 +-
 torchgen/api/types.py                         |   118 +-
 torchgen/api/ufunc.py                         |     7 +-
 torchgen/api/unboxing.py                      |    19 +-
 torchgen/context.py                           |     3 +-
 torchgen/dest/lazy_ir.py                      |    94 +-
 torchgen/dest/register_dispatch_key.py        |    55 +-
 torchgen/gen.py                               |   274 +-
 torchgen/gen_backend_stubs.py                 |    57 +-
 torchgen/gen_functionalization_type.py        |   148 +-
 torchgen/gen_lazy_tensor.py                   |    25 +-
 torchgen/gen_vmap_plumbing.py                 |    19 +-
 torchgen/local.py                             |    16 +-
 torchgen/model.py                             |   261 +-
 torchgen/native_function_generation.py        |    12 +-
 .../gen_jit_shape_functions.py                |    30 +-
 torchgen/static_runtime/config.py             |    40 +-
 .../static_runtime/gen_static_runtime_ops.py  |     3 +
 torchgen/static_runtime/generator.py          |   159 +-
 torchgen/utils.py                             |     8 +
 ubsan.supp                                    |     2 -
 version.txt                                   |     2 +-
 3888 files changed, 418919 insertions(+), 188556 deletions(-)
 create mode 100644 .circleci/README.md
 create mode 100644 .circleci/docker/common/install_rocm_magma.sh
 create mode 100755 .circleci/scripts/functorch_doc_push_script.sh
 create mode 100644 .github/actions/filter-test-configs/action.yml
 delete mode 100644 .github/actions/pull-docker-image/action.yml
 delete mode 100644 .github/actions/setup-ssh/action.yml
 delete mode 100644 .github/actions/teardown-linux/action.yml
 create mode 100644 .github/auto_request_review.yml
 create mode 100644 .github/ci_commit_pins/huggingface.txt
 create mode 100644 .github/ci_commit_pins/text.txt
 create mode 100644 .github/ci_commit_pins/timm.txt
 create mode 100644 .github/ci_commit_pins/torchbench.txt
 delete mode 100644 .github/ci_commit_pins/torchdynamo.txt
 create mode 100644 .github/ci_commit_pins/triton.txt
 create mode 100644 .github/labeler.yml
 delete mode 100644 .github/merge_rules.json
 create mode 100644 .github/merge_rules.yaml
 create mode 100644 .github/requirements-gha-cache.txt
 create mode 100644 .github/requirements/README.md
 create mode 100644 .github/requirements/conda-env-Linux-X64
 create mode 100644 .github/requirements/conda-env-macOS-ARM64
 create mode 100644 .github/requirements/conda-env-macOS-X64
 create mode 100644 .github/requirements/pip-requirements-macOS.txt
 delete mode 100644 .github/scale-config.yml
 delete mode 100644 .github/scripts/build_publish_nightly_docker.sh
 create mode 100644 .github/scripts/build_triton_wheel.py
 create mode 100755 .github/scripts/check_labels.py
 create mode 100644 .github/scripts/comment_on_pr.py
 create mode 100755 .github/scripts/filter_test_configs.py
 delete mode 100755 .github/scripts/install_nvidia_utils_linux.sh
 create mode 100644 .github/scripts/pr-sanity-check.sh
 delete mode 100644 .github/scripts/process_commit.py
 create mode 100644 .github/scripts/test_check_labels.py
 create mode 100755 .github/scripts/test_filter_test_configs.py
 delete mode 100755 .github/scripts/wait_for_ssh_to_drain.sh
 create mode 100644 .github/workflows/auto_request_review.yml
 create mode 100644 .github/workflows/build-triton-wheel.yml
 create mode 100644 .github/workflows/check-labels.yml
 create mode 100644 .github/workflows/docker-release.yml
 delete mode 100644 .github/workflows/generated-windows-binary-wheel-master.yml
 create mode 100644 .github/workflows/inductor.yml
 create mode 100644 .github/workflows/labeler.yml
 delete mode 100644 .github/workflows/pr-labels.yml
 delete mode 100644 .github/workflows/push_nightly_docker_ghcr.yml
 create mode 100644 .github/workflows/scorecards.yml
 delete mode 100644 .github/workflows/update-commit-hashes.yml
 create mode 100644 .github/workflows/weekly.yml
 delete mode 100755 .jenkins/caffe2/bench.sh
 delete mode 100755 .jenkins/caffe2/build.sh
 delete mode 100755 .jenkins/caffe2/dirty.sh
 create mode 100755 .jenkins/pytorch/build-tsan.sh
 delete mode 100755 .jenkins/pytorch/dirty.sh
 delete mode 100644 CITATION
 create mode 100644 CITATION.cff
 create mode 100644 aten/src/ATen/PadNd.h
 create mode 100644 aten/src/ATen/core/PythonOpRegistrationTrampoline.cpp
 create mode 100644 aten/src/ATen/core/PythonOpRegistrationTrampoline.h
 delete mode 100644 aten/src/ATen/core/TorchDispatchModeTLS.cpp
 delete mode 100644 aten/src/ATen/core/TorchDispatchModeTLS.h
 create mode 100644 aten/src/ATen/core/TorchDispatchUtils.cpp
 create mode 100644 aten/src/ATen/core/TorchDispatchUtils.h
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/ADInterpreters.cpp (70%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/ADInterpreters.h (71%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesActivation.cpp (98%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesBinaryOps.cpp (90%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesConvolution.cpp (82%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesDecompositions.cpp (82%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesDynamic.cpp (86%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesFactory.cpp (73%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesHelper.cpp (92%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesHelper.h (94%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesLinearAlgebra.cpp (59%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesLoss.cpp (94%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesModules.cpp (89%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesNorm.cpp (93%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesPooling.cpp (92%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesRandomness.cpp (87%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesReduceOps.cpp (96%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesScatterOps.cpp (98%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesUnaryOps.cpp (97%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchRulesViews.cpp (83%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchedFallback.cpp (97%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchedFallback.h (63%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchedTensorImpl.cpp (58%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchedTensorImpl.h (82%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/BatchingMetaprogramming.h (92%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/DynamicLayer.cpp (72%)
 create mode 100644 aten/src/ATen/functorch/DynamicLayer.h
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/FunctionalizeInterpreter.cpp (94%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/FunctionalizeInterpreter.h (75%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/Interpreter.cpp (75%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/Interpreter.h (91%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/LegacyBatchingRegistrations.cpp (82%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/LegacyVmapTransforms.cpp (88%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/LegacyVmapTransforms.h (95%)
 create mode 100644 aten/src/ATen/functorch/Macros.h
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/PlumbingHelper.cpp (91%)
 create mode 100644 aten/src/ATen/functorch/PlumbingHelper.h
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/PyTorchOperatorHacks.cpp (95%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/TensorWrapper.cpp (89%)
 create mode 100644 aten/src/ATen/functorch/TensorWrapper.h
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/VmapInterpreter.cpp (68%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/VmapInterpreter.h (76%)
 rename {functorch/functorch/csrc => aten/src/ATen/functorch}/VmapModeRegistrations.cpp (83%)
 create mode 100644 aten/src/ATen/mps/IndexKernels.h
 create mode 100644 aten/src/ATen/native/ComparisonUtils.cpp
 create mode 100644 aten/src/ATen/native/NonSymbolicBC.h
 delete mode 100644 aten/src/ATen/native/PadNd.h
 delete mode 100644 aten/src/ATen/native/SpmmReduce.cpp
 delete mode 100644 aten/src/ATen/native/SpmmReduce.h
 create mode 100644 aten/src/ATen/native/cpu/CopyKernel.h
 create mode 100644 aten/src/ATen/native/cpu/SpmmReduceKernel.h
 create mode 100644 aten/src/ATen/native/cuda/Copy.h
 create mode 100644 aten/src/ATen/native/cuda/CumminmaxKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/CumprodKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/CumsumKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/FusedAdamKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/LogcumsumexpKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/Pow.cuh
 rename aten/src/ATen/native/cuda/{ScanKernels.cu => ScanUtils.cuh} (84%)
 create mode 100644 aten/src/ATen/native/cuda/SparseBinaryOpIntersectionKernel.cu
 create mode 100644 aten/src/ATen/native/cuda/fused_adam_amsgrad_impl.cu
 create mode 100644 aten/src/ATen/native/cuda/fused_adam_amsgrad_impl.cuh
 create mode 100644 aten/src/ATen/native/cuda/fused_adam_impl.cu
 create mode 100644 aten/src/ATen/native/cuda/fused_adam_impl.cuh
 create mode 100644 aten/src/ATen/native/cuda/fused_adam_utils.cuh
 create mode 100644 aten/src/ATen/native/mps/MPSGraphVenturaOps.h
 rename aten/src/ATen/native/mps/operations/{BitwiseBinaryOps.mm => BitwiseOps.mm} (79%)
 create mode 100644 aten/src/ATen/native/mps/operations/Indexing.h
 create mode 100644 aten/src/ATen/native/mps/operations/Pad.mm
 create mode 100644 aten/src/ATen/native/nested/NestedTensorAliases.cpp
 create mode 100644 aten/src/ATen/native/nested/NestedTensorBinaryOps.cpp
 create mode 100644 aten/src/ATen/native/nested/NestedTensorBinaryOps.h
 create mode 100644 aten/src/ATen/native/nested/NestedTensorFactories.cpp
 create mode 100644 aten/src/ATen/native/nested/NestedTensorFactories.h
 create mode 100644 aten/src/ATen/native/nested/NestedTensorMatmul.cpp
 create mode 100644 aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp
 create mode 100644 aten/src/ATen/native/nested/NestedTensorUtils.cpp
 create mode 100644 aten/src/ATen/native/nested/NestedTensorUtils.h
 create mode 100644 aten/src/ATen/native/nested/cuda/NestedTensorBinaryOps.cu
 create mode 100644 aten/src/ATen/native/nested/cuda/NestedTensorMatmul.cu
 delete mode 100644 aten/src/ATen/native/quantized/cpu/qnnpack/src/pack_block_sparse.cc
 create mode 100644 aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.h
 create mode 100644 aten/src/ATen/native/quantized/cuda/Activation.cu
 create mode 100644 aten/src/ATen/native/sparse/Macros.h
 create mode 100644 aten/src/ATen/native/sparse/SparseBinaryOpIntersectionCommon.h
 create mode 100644 aten/src/ATen/native/sparse/SparseBinaryOpIntersectionKernel.cpp
 create mode 100644 aten/src/ATen/native/sparse/SparseStubs.h
 create mode 100644 aten/src/ATen/native/transformers/attention.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/attention_backward.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/epilogue.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/epilogue_predicated_tile_iterator.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/fmha.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_dispatch.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/fmha_kernel.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/fmha_utils.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/gemm.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/gmem_tile.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/kernel_traits.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/mask.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/mma_core_sm75.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/philox.cuh
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/softmax.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/summary_stats.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/flash_attn/utils.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/attention_scaling_coefs_updater.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/debug_utils.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_pipelined.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_rescale_output.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_thread_apply_logsumexp.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/find_default_mma.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_base.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_multistage.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_pipelined.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm_kernel_utils.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/epilogue_predicated_tile_iterator.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/make_residual_last.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_access_iterator_residual_last.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_iterator_residual_last.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned_k128.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned_k64.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_k128.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_k64.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned_k128.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned_k64.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_k128.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_k64.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned_k128.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned_k64.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_k128.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_k64.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_bf16.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_bf16_aligned.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f16.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f16_aligned.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f32.cu
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f32_aligned.cu
 create mode 100755 aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/generate_kernels.sh
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/mma_from_smem.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/mem_eff_attention/mma_simt_tile_iterator_residual.h
 create mode 100644 aten/src/ATen/native/transformers/cuda/sdp_utils.h
 create mode 100644 aten/src/ATen/native/transformers/sdp_utils_cpp.h
 create mode 100644 aten/src/ATen/native/vulkan/api/Types.h
 delete mode 100644 aten/src/ATen/native/vulkan/api/vk_mem_alloc.h
 create mode 100644 aten/src/ATen/native/vulkan/glsl/buffer_to_buffer.glsl
 delete mode 100644 aten/src/ATen/native/vulkan/glsl/conv2d_pw.glsl
 delete mode 100644 aten/src/ATen/native/vulkan/glsl/conv2d_pw_2x2.glsl
 delete mode 100644 aten/src/ATen/native/vulkan/glsl/conv2d_pw_2x2_buffered.glsl
 create mode 100644 aten/src/ATen/native/vulkan/glsl/image2d_to_nchw.glsl
 create mode 100644 aten/src/ATen/native/vulkan/glsl/indexing.h
 create mode 100644 aten/src/ATen/native/vulkan/glsl/nchw_to_image2d.glsl
 create mode 100644 aten/src/ATen/native/vulkan/glsl/templates/conv2d_pw.glslt
 create mode 100644 aten/src/ATen/native/vulkan/glsl/templates/conv2d_pw_params.yaml
 create mode 100644 aten/src/ATen/native/vulkan/ops/Batchnorm.h
 create mode 100644 aten/src/ATen/test/mps_test_print.cpp
 create mode 100644 benchmarks/cpp/nvfuser/matmul.cpp
 create mode 100644 benchmarks/dynamo/Makefile_dashboard
 create mode 100644 benchmarks/dynamo/README.md
 rename {test/quantization/dbr => benchmarks/dynamo}/__init__.py (100%)
 create mode 100644 benchmarks/dynamo/check_csv.py
 create mode 100644 benchmarks/dynamo/common.py
 create mode 100644 benchmarks/dynamo/dist_util.py
 create mode 100644 benchmarks/dynamo/distributed.py
 create mode 100755 benchmarks/dynamo/huggingface.py
 create mode 100644 benchmarks/dynamo/huggingface_models_list.txt
 rename {torch/ao/quantization/_dbr => benchmarks/dynamo/microbenchmarks}/__init__.py (100%)
 create mode 100644 benchmarks/dynamo/microbenchmarks/bench_autotune_conv.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/bench_conv.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/bench_conv1x1.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/bench_conv_fusion.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/bench_mm_fusion.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/benchmark_helper.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/inductor_bmm.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/inductor_mm.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/matmul_relu.py
 create mode 100755 benchmarks/dynamo/microbenchmarks/microbench.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/model.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AlbertForMaskedLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AlbertForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AllenaiLongformerBase_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BartForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BartForConditionalGeneration_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BertForMaskedLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BertForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BigBird_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BlenderbotSmallForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BlenderbotSmallForConditionalGeneration_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/CamemBert_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaForMaskedLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaV2ForMaskedLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaV2ForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistilBertForMaskedLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistilBertForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistillGPT2_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/ElectraForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/ElectraForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPT2ForSequenceClassification_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPTNeoForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPTNeoForSequenceClassification_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GoogleFnet_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/LayoutLMForMaskedLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/LayoutLMForSequenceClassification_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/M2M100ForConditionalGeneration_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MBartForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MBartForConditionalGeneration_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MegatronBertForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MegatronBertForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MobileBertForMaskedLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MobileBertForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/OPTForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PLBartForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PLBartForConditionalGeneration_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PegasusForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PegasusForConditionalGeneration_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/RobertaForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/RobertaForQuestionAnswering_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/Speech2Text2ForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/TrOCRForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/XGLMForCausalLM_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/XLNetLMHeadModel_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/YituTechConvBert_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/adv_inception_v3_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/beit_base_patch16_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/botnet26t_256_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/cait_m36_384_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/coat_lite_mini_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/convmixer_768_32_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/convnext_base_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/crossvit_9_240_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/cspdarknet53_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/deit_base_distilled_patch16_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/densenet121_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dla102_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dm_nfnet_f0_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dpn107_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/eca_botnext26ts_256_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/eca_halonext26ts_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ecaresnet101d_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ese_vovnet19b_dw_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/fbnetc_100_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/fbnetv3_b_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gernet_l_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ghostnet_100_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_inception_v3_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_senet154_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_xception65_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gmixer_24_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gmlp_s16_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/hardcorenas_a_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/hrnet_w18_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/inception_v3_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/jx_nest_base_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/lcnet_050_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/legacy_senet154_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/levit_128_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mixer_b16_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mixnet_l_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mnasnet_100_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilenetv2_100_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilenetv3_large_100_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilevit_s_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/nasnetalarge_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/nfnet_l0_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/pit_b_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/pnasnet5large_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/poolformer_m36_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/regnety_002_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/repvgg_a2_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2net101_26w_4s_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2net50_14w_8s_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2next50_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resmlp_12_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resnest101e_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resnet18_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/rexnet_100_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/sebotnet33ts_256_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/selecsls42b_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/spnasnet_100_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/swin_base_patch4_window7_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/swsl_resnext101_32x16d_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tf_efficientnet_b0_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tf_mixnet_l_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tinynet_a_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tnt_s_patch16_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/twins_pcpvt_base_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/visformer_small_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/vit_base_patch16_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/volo_d1_224_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/BERT_pytorch_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/Background_Matting_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/LearningToPaint_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/Super_SloMo_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/alexnet_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/attention_is_all_you_need_pytorch_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/dcgan_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/densenet121_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/fambench_dlrm_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/fastNLP_Bert_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Albert_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Bart_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Bert_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_BigBird_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_DistilBert_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_GPT2_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Longformer_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/maml_omniglot_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mnasnet1_0_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mobilenet_v2_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mobilenet_v3_large_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/nvidia_deeprecommender_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_CycleGAN_and_pix2pix_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_stargan_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_struct_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_unet_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnet18_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnet50_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnext50_32x4d_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/shufflenet_v2_x1_0_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/speech_transformer_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/squeezenet1_1_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_efficientdet_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_efficientnet_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_nfnet_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_regnet_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_resnest_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_vision_transformer_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_vovnet_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/tts_angular_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/vgg16_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/vision_maskrcnn_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/yolov3_training.txt
 create mode 100644 benchmarks/dynamo/microbenchmarks/operator_inp_utils.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/operatorbench.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/profile_conv.py
 create mode 100644 benchmarks/dynamo/microbenchmarks/utils.py
 create mode 100755 benchmarks/dynamo/runner.py
 create mode 100644 benchmarks/dynamo/test.py
 create mode 100755 benchmarks/dynamo/timm_models.py
 create mode 100644 benchmarks/dynamo/timm_models_list.txt
 create mode 100755 benchmarks/dynamo/torchbench.py
 create mode 100644 benchmarks/dynamo/torchbench_models_list.txt
 create mode 100644 benchmarks/dynamo/training_loss.py
 create mode 100644 benchmarks/nested/nested_bmm_bench.py
 create mode 100644 benchmarks/transformer/better_transformer_vs_mha_functional.py
 create mode 100644 benchmarks/transformer/sdp.py
 create mode 100644 benchmarks/transformer/sdp_backwards.py
 delete mode 100644 c10/c10_defs.bzl
 create mode 100644 c10/core/PyHandleCache.h
 create mode 100644 c10/core/SymFloat.cpp
 create mode 100644 c10/core/SymFloat.h
 delete mode 100644 c10/core/SymIntNodeImpl.cpp
 delete mode 100644 c10/core/SymIntNodeImpl.h
 create mode 100644 c10/core/SymNodeImpl.cpp
 create mode 100644 c10/core/SymNodeImpl.h
 create mode 100644 c10/core/impl/HermeticPyObjectTLS.cpp
 create mode 100644 c10/core/impl/HermeticPyObjectTLS.h
 create mode 100644 c10/core/impl/PythonDispatcherTLS.cpp
 create mode 100644 c10/core/impl/PythonDispatcherTLS.h
 create mode 100644 c10/core/impl/TorchDispatchModeTLS.cpp
 create mode 100644 c10/core/impl/TorchDispatchModeTLS.h
 create mode 100644 c10/cuda/CUDAException.cpp
 create mode 100644 c10/cuda/CUDAMallocAsyncAllocator.cpp
 delete mode 100644 c10/defs_hip.bzl
 delete mode 100644 caffe2/defs.bzl
 delete mode 100644 caffe2/defs_hip.bzl
 create mode 100644 caffe2/perfkernels/batch_box_cox.cc
 create mode 100644 caffe2/perfkernels/batch_box_cox.h
 create mode 100644 caffe2/perfkernels/batch_box_cox_avx2.cc
 create mode 100644 caffe2/perfkernels/vectorizer.h
 create mode 100644 caffe2/python/clean_workspace_test.py
 create mode 100644 caffe2/python/operator_test/_utils.py
 create mode 100644 caffe2/python/pybind_workspace.cc
 create mode 100644 caffe2/python/pybind_workspace.h
 delete mode 100644 defs_gpu.bzl
 delete mode 100644 defs_hip.bzl
 create mode 100644 docs/source/_dynamo.rst
 create mode 100644 docs/source/_static/img/masked/tensor_comparison.jpg
 create mode 100644 docs/source/community/build_ci_governance.rst
 create mode 100644 docs/source/cuda._sanitizer.rst
 create mode 100644 docs/source/distributed.checkpoint.rst
 create mode 100644 docs/source/masked.rst
 create mode 100644 docs/source/onnx_diagnostics.rst
 create mode 100644 docs/source/scripts/onnx/build_onnx_diagnostics_rules_md.py
 create mode 100644 docs/source/signal.rst
 delete mode 100644 functorch/.circleci/config.yml
 delete mode 100644 functorch/.circleci/unittest/linux/scripts/environment.yml
 delete mode 100755 functorch/.circleci/unittest/linux/scripts/install.sh
 delete mode 100755 functorch/.circleci/unittest/linux/scripts/post_process.sh
 delete mode 100755 functorch/.circleci/unittest/linux/scripts/run_test.sh
 delete mode 100755 functorch/.circleci/unittest/linux/scripts/setup_env.sh
 delete mode 100644 functorch/.circleci/unittest/windows/scripts/environment.yml
 delete mode 100644 functorch/.circleci/unittest/windows/scripts/install.sh
 delete mode 100644 functorch/.circleci/unittest/windows/scripts/install_conda.bat
 delete mode 100644 functorch/.circleci/unittest/windows/scripts/post_process.sh
 delete mode 100644 functorch/.circleci/unittest/windows/scripts/run_test.sh
 delete mode 100644 functorch/.circleci/unittest/windows/scripts/set_cuda_envs.sh
 delete mode 100644 functorch/.circleci/unittest/windows/scripts/setup_env.sh
 delete mode 100644 functorch/.circleci/unittest/windows/scripts/vc_env_helper.bat
 delete mode 100644 functorch/.flake8
 delete mode 100644 functorch/.github/workflows/docs.yml
 delete mode 100644 functorch/.github/workflows/lint.yml
 delete mode 100644 functorch/.github/workflows/wheels.yml
 delete mode 100644 functorch/.lintrunner.toml
 create mode 100644 functorch/CMakeLists.txt
 delete mode 100644 functorch/CODE_OF_CONDUCT.md
 delete mode 100644 functorch/CONTRIBUTING.md
 delete mode 100644 functorch/LICENSE
 rename functorch/{functorch => }/__init__.py (76%)
 rename functorch/{functorch => }/_src/__init__.py (100%)
 create mode 100644 functorch/_src/aot_autograd.py
 rename functorch/{functorch => }/_src/benchmark_utils.py (100%)
 rename functorch/{functorch => }/_src/compile_utils.py (91%)
 rename functorch/{functorch => }/_src/compilers.py (79%)
 create mode 100644 functorch/_src/config.py
 rename functorch/{functorch => }/_src/eager_transforms.py (92%)
 create mode 100644 functorch/_src/fx_minifier.py
 rename functorch/{functorch => }/_src/make_functional.py (99%)
 rename functorch/{functorch => }/_src/named_members_polyfill.py (100%)
 rename functorch/{functorch => }/_src/partitioners.py (61%)
 rename functorch/{functorch => }/_src/python_key.py (50%)
 rename functorch/{functorch => }/_src/pytree_hacks.py (100%)
 rename functorch/{functorch => }/_src/top_operators_github_usage.py (100%)
 rename functorch/{functorch => }/_src/vmap.py (97%)
 rename functorch/{functorch => }/compile/__init__.py (70%)
 rename functorch/{functorch => }/csrc/dim/arena.h (100%)
 rename functorch/{functorch => }/csrc/dim/dim.cpp (96%)
 rename functorch/{functorch => }/csrc/dim/dim.h (100%)
 rename functorch/{functorch => }/csrc/dim/minpybind.h (98%)
 rename functorch/{functorch => }/csrc/dim/python_variable_simple.h (100%)
 create mode 100644 functorch/csrc/init_dim_only.cpp
 rename functorch/{functorch => }/dim/README.md (94%)
 rename functorch/{functorch => }/dim/__init__.py (97%)
 rename functorch/{functorch => }/dim/batch_tensor.py (95%)
 rename functorch/{functorch => }/dim/delayed_mul_tensor.py (100%)
 rename functorch/{functorch => }/dim/dim.py (100%)
 rename functorch/{functorch => }/dim/magic_trace.py (100%)
 rename functorch/{functorch => }/dim/op_properties.py (100%)
 rename functorch/{functorch => }/dim/reference.py (100%)
 rename functorch/{functorch => }/dim/tree_map.py (100%)
 rename functorch/{functorch => }/dim/wrap_type.py (100%)
 delete mode 100644 functorch/docs/source/_static/images/functorch.svg
 rename functorch/{functorch => }/experimental/__init__.py (60%)
 create mode 100644 functorch/experimental/_map.py
 rename functorch/{functorch => }/experimental/batch_norm_replacement.py (100%)
 create mode 100644 functorch/experimental/cond.py
 create mode 100644 functorch/experimental/control_flow.py
 create mode 100644 functorch/experimental/ops.py
 delete mode 100644 functorch/functorch/_src/aot_autograd.py
 delete mode 100644 functorch/functorch/_src/config.py
 delete mode 100644 functorch/functorch/_src/custom_function.py
 delete mode 100644 functorch/functorch/_src/fx_minifier.py
 delete mode 100644 functorch/functorch/_src/monkey_patching.py
 delete mode 100644 functorch/functorch/csrc/CompileCache.cpp
 delete mode 100644 functorch/functorch/csrc/CompileCache.h
 delete mode 100644 functorch/functorch/csrc/Constants.h
 delete mode 100644 functorch/functorch/csrc/CustomFunction.cpp
 delete mode 100644 functorch/functorch/csrc/CustomFunction.h
 delete mode 100644 functorch/functorch/csrc/DynamicLayer.h
 delete mode 100644 functorch/functorch/csrc/Macros.h
 delete mode 100644 functorch/functorch/csrc/PlumbingHelper.h
 delete mode 100644 functorch/functorch/csrc/TensorWrapper.h
 delete mode 100644 functorch/functorch/csrc/init.cpp
 delete mode 100644 functorch/notebooks/colab/ensembling_colab.ipynb
 delete mode 100644 functorch/notebooks/colab/jacobians_hessians_colab.ipynb
 delete mode 100644 functorch/notebooks/colab/per_sample_grads_colab.ipynb
 delete mode 100644 functorch/notebooks/colab/readme.md
 delete mode 100644 functorch/packaging/build_wheel.sh
 delete mode 100644 functorch/packaging/pkg_helpers.bash
 delete mode 100644 functorch/packaging/windows/internal/cuda_install.bat
 delete mode 100644 functorch/packaging/windows/internal/driver_update.bat
 delete mode 100644 functorch/packaging/windows/internal/vc_env_helper.bat
 delete mode 100644 functorch/packaging/windows/internal/vc_install_helper.sh
 delete mode 100644 functorch/pull_request_template.md
 delete mode 100644 functorch/setup.cfg
 delete mode 100644 functorch/setup.py
 delete mode 100644 functorch/test/functorch_lagging_op_db.py
 delete mode 100644 functorch/test/pytest.ini
 delete mode 100644 functorch/test/test_compile_cache.py
 delete mode 100644 functorch/test/test_minifier.py
 delete mode 100644 functorch/test/test_pythonkey.py
 delete mode 100644 functorch/tools/lint/black_linter.py
 delete mode 100644 functorch/tools/lint/flake8_linter.py
 delete mode 100644 functorch/tools/lint/pip_init.py
 delete mode 100644 functorch/version.txt
 delete mode 100644 ios/TestApp/AppleWWDRCAG3.cer
 create mode 100644 ios/TestApp/TestApp/Benchmark.h
 create mode 100644 ios/TestApp/TestApp/Benchmark.mm
 create mode 100644 ios/TestApp/benchmark/config.json
 delete mode 100644 test/cpp/api/imethod.cpp
 create mode 100644 test/cpp/api/nested.cpp
 create mode 100644 test/cpp/c10d/ProcessGroupUCCTest.cpp
 delete mode 100644 test/cpp/lazy/test_symbolic_shape.cpp
 create mode 100644 test/cpp/lite_interpreter_runtime/resources.h
 create mode 100644 test/cpp/profiler/perf_events.cpp
 delete mode 100644 test/defs.bzl
 create mode 100644 test/distributed/_composable/test_checkpoint.py
 create mode 100644 test/distributed/_composable/test_contract.py
 create mode 100644 test/distributed/_composable/test_fully_shard.py
 create mode 100644 test/distributed/_composable/test_replicate.py
 delete mode 100644 test/distributed/_shard/checkpoint/test_checkpoint.py
 create mode 100644 test/distributed/_tensor/README.md
 create mode 100644 test/distributed/_tensor/__init__.py
 rename {torch/ao/sparsity/_experimental => test/distributed/_tensor/parallel}/__init__.py (100%)
 create mode 100644 test/distributed/_tensor/parallel/test_2d_parallel.py
 create mode 100644 test/distributed/_tensor/parallel/test_parallelize_api.py
 create mode 100644 test/distributed/_tensor/parallel/test_tp_examples.py
 create mode 100644 test/distributed/_tensor/parallel/test_tp_style.py
 create mode 100644 test/distributed/_tensor/parallel/test_view_sharding_dim_change.py
 create mode 100644 test/distributed/_tensor/test_api.py
 create mode 100644 test/distributed/_tensor/test_common_rules.py
 create mode 100644 test/distributed/_tensor/test_device_mesh.py
 create mode 100644 test/distributed/_tensor/test_dtensor.py
 create mode 100644 test/distributed/_tensor/test_dtensor_ops.py
 create mode 100644 test/distributed/_tensor/test_math_ops.py
 create mode 100644 test/distributed/_tensor/test_matrix_ops.py
 create mode 100644 test/distributed/_tensor/test_pointwise_ops.py
 create mode 100644 test/distributed/_tensor/test_redistribute.py
 create mode 100644 test/distributed/_tensor/test_tensor_ops.py
 create mode 100644 test/distributed/_tensor/test_tp_sharding_ops.py
 create mode 100644 test/distributed/_tensor/test_view_ops.py
 create mode 100644 test/distributed/checkpoint/test_checkpoint.py
 create mode 100644 test/distributed/checkpoint/test_dedup_tensors.py
 rename test/distributed/{_shard => }/checkpoint/test_file_system_checkpoint.py (95%)
 rename test/distributed/{_shard => }/checkpoint/test_file_system_checkpoint_cpu.py (99%)
 create mode 100644 test/distributed/checkpoint/test_planner.py
 create mode 100644 test/distributed/checkpoint/test_traverse.py
 rename test/distributed/{_shard => }/checkpoint/test_utils.py (93%)
 delete mode 100644 test/distributed/defs.bzl
 create mode 100644 test/distributed/elastic/timer/file_based_local_timer_test.py
 delete mode 100644 test/distributed/fsdp/defs.bzl
 delete mode 100644 test/distributed/fsdp/test_flatten_params_wrapper.py
 create mode 100644 test/distributed/fsdp/test_fsdp_flatten_params.py
 delete mode 100644 test/distributed/fsdp/test_fsdp_param_exec_order_wrap.py
 create mode 100644 test/distributed/fsdp/test_fsdp_tp_integration.py
 create mode 100644 test/distributed/fsdp/test_fsdp_use_orig_params.py
 create mode 100644 test/distributed/optim/test_apply_optimizer_in_backward.py
 delete mode 100644 test/distributed/pipeline/sync/defs.bzl
 create mode 100644 test/distributed/test_c10d_error_logger.py
 create mode 100644 test/distributed/test_c10d_spawn_ucc.py
 create mode 100644 test/distributed/test_dynamo_distributed.py
 create mode 100644 test/distributed/test_multi_threaded_pg.py
 rename {torch/ao/sparsity/_experimental/activation_sparsifier => test/dynamo}/__init__.py (100%)
 rename {torch/ao/sparsity/_experimental/data_sparsifier/lightning => test/dynamo/mock_modules}/__init__.py (100%)
 create mode 100644 test/dynamo/mock_modules/mock_module1.py
 create mode 100644 test/dynamo/mock_modules/mock_module2.py
 create mode 100644 test/dynamo/mock_modules/mock_module3.py
 create mode 100644 test/dynamo/test_aot_autograd.py
 create mode 100644 test/dynamo/test_aot_cudagraphs.py
 create mode 100644 test/dynamo/test_dynamic_shapes.py
 create mode 100644 test/dynamo/test_export.py
 create mode 100644 test/dynamo/test_export_mutations.py
 create mode 100644 test/dynamo/test_functions.py
 create mode 100644 test/dynamo/test_global.py
 create mode 100644 test/dynamo/test_global_declaration.py
 create mode 100644 test/dynamo/test_minifier.py
 create mode 100644 test/dynamo/test_misc.py
 create mode 100644 test/dynamo/test_model_output.py
 create mode 100644 test/dynamo/test_modules.py
 create mode 100644 test/dynamo/test_nops.py
 create mode 100644 test/dynamo/test_optimizations.py
 create mode 100644 test/dynamo/test_optimizers.py
 create mode 100644 test/dynamo/test_python_autograd.py
 create mode 100644 test/dynamo/test_recompile_ux.py
 create mode 100644 test/dynamo/test_replay_record.py
 create mode 100644 test/dynamo/test_repros.py
 create mode 100644 test/dynamo/test_skip_non_tensor.py
 create mode 100644 test/dynamo/test_subgraphs.py
 create mode 100644 test/dynamo/test_torchxla_integration.py
 create mode 100644 test/dynamo/test_torchxla_num_output.py
 create mode 100644 test/dynamo/test_torchxla_util.py
 create mode 100644 test/dynamo/test_unspec.py
 create mode 100644 test/dynamo/test_verify_correctness.py
 rename {functorch/test => test/functorch}/attn_ft.py (100%)
 rename {functorch/test => test/functorch}/attn_positional.py (100%)
 rename {functorch/test => test/functorch}/common_utils.py (62%)
 rename {functorch/test => test/functorch}/discover_coverage.py (99%)
 rename {functorch/test => test/functorch}/functorch_additional_op_db.py (97%)
 create mode 100644 test/functorch/test_aotdispatch.py
 create mode 100644 test/functorch/test_control_flow.py
 rename {functorch/test => test/functorch}/test_dims.py (92%)
 rename {functorch/test => test/functorch}/test_eager_transforms.py (82%)
 rename {functorch/test => test/functorch}/test_functionalize.py (65%)
 rename {functorch/test => test/functorch}/test_memory_efficient_fusion.py (100%)
 create mode 100644 test/functorch/test_minifier.py
 rename {functorch/test => test/functorch}/test_ops.py (61%)
 rename {functorch/test => test/functorch}/test_vmap.py (93%)
 rename {functorch/test => test/functorch}/xfail_suggester.py (96%)
 rename {torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks => test/inductor}/__init__.py (100%)
 create mode 100644 test/inductor/cpp/.gitignore
 create mode 100644 test/inductor/cpp/CMakeLists.txt
 create mode 100755 test/inductor/cpp/test.sh
 create mode 100644 test/inductor/cpp/test_cpp_prefix.cpp
 create mode 100644 test/inductor/opinfo_harness.py
 create mode 100644 test/inductor/test_minifier.py
 create mode 100644 test/inductor/test_perf.py
 create mode 100644 test/inductor/test_smoke.py
 create mode 100644 test/inductor/test_torchinductor.py
 create mode 100644 test/inductor/test_torchinductor_opinfo.py
 create mode 100644 test/jit/xnnpack/test_xnnpack_delegate.py
 create mode 100644 test/lazy/test_debug_util.py
 create mode 100644 test/lazy/test_meta_kernel.py
 create mode 100644 test/lazy/test_step_closures.py
 create mode 100644 test/nn/test_convolution.py
 create mode 100644 test/nn/test_dropout.py
 create mode 100644 test/nn/test_embedding.py
 create mode 100644 test/nn/test_init.py
 create mode 100644 test/nn/test_lazy_modules.py
 create mode 100644 test/nn/test_module_hooks.py
 create mode 100644 test/nn/test_packed_sequence.py
 create mode 100644 test/nn/test_parametrization.py
 create mode 100644 test/nn/test_pooling.py
 create mode 100644 test/nn/test_pruning.py
 create mode 100644 test/onnx/internal/test_beartype.py
 create mode 100644 test/onnx/internal/test_diagnostics.py
 create mode 100644 test/onnx/internal/test_registraion.py
 delete mode 100644 test/onnx/symbolic_opsets/test_symbolic_opset9.py
 rename test/{jit => onnx}/test_export_modes.py (64%)
 create mode 100644 test/onnx/test_onnxscript_no_runtime.py
 create mode 100644 test/onnx/test_onnxscript_runtime.py
 create mode 100644 test/profiler/profiler_utils_mock_events.json
 create mode 100644 test/profiler/test_memory_profiler.py
 rename test/{ => profiler}/test_profiler.py (67%)
 rename test/{ => profiler}/test_profiler_tree.py (86%)
 delete mode 100644 test/profiler_utils_mock_events.json
 create mode 100644 test/quantization/core/test_top_level_apis.py
 delete mode 100644 test/quantization/dbr/test_quantize_dbr.py
 create mode 100644 test/quantization/jit/test_ondevice_quantization.py
 mode change 100644 => 100755 test/run_test.py
 create mode 100644 test/test_comparison_utils.py
 create mode 100644 test/test_cuda_nvml_based_avail.py
 create mode 100644 test/test_cuda_sanitizer.py
 create mode 100644 test/test_dlpack.py
 delete mode 100644 test/test_fx_backends.py
 create mode 100644 test/test_itt.py
 create mode 100644 test/test_maskedtensor.py
 create mode 100644 test/test_matmul_cuda.py
 create mode 100644 test/test_nvfuser_dynamo.py
 create mode 100644 test/test_nvfuser_frontend.py
 create mode 100644 test/test_ops_fwd_gradients.py
 create mode 160000 third_party/VulkanMemoryAllocator
 delete mode 100644 third_party/cpuinfo.BUILD
 create mode 160000 third_party/cutlass
 create mode 100644 third_party/cutlass.BUILD
 create mode 100644 tools/autograd/templates/python_nested_functions.cpp
 delete mode 100644 tools/cpuinfo_target_definition.bzl
 create mode 100644 tools/dynamo/verify_dynamo.py
 create mode 100644 tools/gen_vulkan_glsl.py
 delete mode 100644 tools/miniz_target_definition.bzl
 create mode 100644 tools/onnx/gen_diagnostics.py
 create mode 100755 tools/onnx/gen_diagnostics.sh
 create mode 100644 tools/onnx/sarif/code-gen-hints.json
 create mode 100755 tools/onnx/sarif/gen_sarif.sh
 create mode 100644 tools/onnx/templates/rules.h.in
 create mode 100644 tools/onnx/templates/rules.py.in
 delete mode 100644 tools/perf_kernel_defs.bzl
 delete mode 100644 tools/sgx_aten_target_definitions.bzl
 delete mode 100644 tools/sgx_caffe2_target_definitions.bzl
 delete mode 100644 tools/sgx_target_definitions.bzl
 create mode 100644 tools/stats/check_disabled_tests.py
 create mode 100644 tools/stats/upload_artifacts.py
 delete mode 100644 tools/target_definitions.bzl
 create mode 100644 tools/test/test_vulkan_codegen.py
 create mode 100644 torch/_C/_functorch.pyi
 create mode 100644 torch/_C/_profiler.pyi
 rename functorch/functorch/_src/decompositions.py => torch/_decomp/decompositions_for_jvp.py (61%)
 rename torch/{ao/sparsity/scheduler => _dispatch}/__init__.py (100%)
 create mode 100644 torch/_dispatch/python.py
 create mode 100644 torch/_dynamo/__init__.py
 create mode 100644 torch/_dynamo/allowed_functions.py
 create mode 100644 torch/_dynamo/bytecode_analysis.py
 create mode 100644 torch/_dynamo/bytecode_transformation.py
 create mode 100644 torch/_dynamo/codegen.py
 create mode 100644 torch/_dynamo/config.py
 create mode 100644 torch/_dynamo/convert_frame.py
 create mode 100644 torch/_dynamo/debug_utils.py
 create mode 100644 torch/_dynamo/eval_frame.py
 create mode 100644 torch/_dynamo/exc.py
 create mode 100644 torch/_dynamo/guards.py
 create mode 100644 torch/_dynamo/logging.py
 create mode 100644 torch/_dynamo/mutation_guard.py
 create mode 100644 torch/_dynamo/optimizations/__init__.py
 create mode 100644 torch/_dynamo/optimizations/analysis.py
 create mode 100644 torch/_dynamo/optimizations/backends.py
 create mode 100644 torch/_dynamo/optimizations/distributed.py
 create mode 100644 torch/_dynamo/optimizations/inference.py
 create mode 100644 torch/_dynamo/optimizations/log_args.py
 create mode 100644 torch/_dynamo/optimizations/normalize.py
 create mode 100644 torch/_dynamo/optimizations/subgraph.py
 create mode 100644 torch/_dynamo/optimizations/torchxla_integration.py
 create mode 100644 torch/_dynamo/optimizations/training.py
 create mode 100644 torch/_dynamo/output_graph.py
 create mode 100644 torch/_dynamo/profiler.py
 create mode 100644 torch/_dynamo/replay_record.py
 create mode 100644 torch/_dynamo/resume_execution.py
 create mode 100644 torch/_dynamo/side_effects.py
 create mode 100644 torch/_dynamo/skipfiles.py
 create mode 100644 torch/_dynamo/source.py
 create mode 100644 torch/_dynamo/symbolic_convert.py
 create mode 100644 torch/_dynamo/test_case.py
 create mode 100644 torch/_dynamo/test_minifier_common.py
 create mode 100644 torch/_dynamo/testing.py
 create mode 100644 torch/_dynamo/utils.py
 create mode 100644 torch/_dynamo/variables/__init__.py
 create mode 100644 torch/_dynamo/variables/base.py
 create mode 100644 torch/_dynamo/variables/builder.py
 create mode 100644 torch/_dynamo/variables/builtin.py
 create mode 100644 torch/_dynamo/variables/constant.py
 create mode 100644 torch/_dynamo/variables/dicts.py
 create mode 100644 torch/_dynamo/variables/functions.py
 create mode 100644 torch/_dynamo/variables/lists.py
 create mode 100644 torch/_dynamo/variables/misc.py
 create mode 100644 torch/_dynamo/variables/nn_module.py
 create mode 100644 torch/_dynamo/variables/tensor.py
 create mode 100644 torch/_dynamo/variables/torch.py
 create mode 100644 torch/_dynamo/variables/user_defined.py
 rename torch/{ao/sparsity/sparsifier => _functorch}/__init__.py (100%)
 create mode 100644 torch/_functorch/pyfunctorch.py
 create mode 100644 torch/_functorch/utils.py
 create mode 100644 torch/_inductor/__init__.py
 create mode 100644 torch/_inductor/codecache.py
 create mode 100644 torch/_inductor/codegen/__init__.py
 create mode 100644 torch/_inductor/codegen/autotuner.py
 create mode 100644 torch/_inductor/codegen/common.py
 create mode 100644 torch/_inductor/codegen/cpp.py
 create mode 100644 torch/_inductor/codegen/cpp_prefix.h
 create mode 100644 torch/_inductor/codegen/triton.py
 create mode 100644 torch/_inductor/codegen/triton_conv_delta_x.j2
 create mode 100644 torch/_inductor/codegen/triton_conv_delta_x_hwc.j2
 create mode 100644 torch/_inductor/codegen/triton_mm.j2
 create mode 100644 torch/_inductor/codegen/triton_template.py
 create mode 100644 torch/_inductor/codegen/wrapper.py
 create mode 100644 torch/_inductor/compile_fx.py
 create mode 100644 torch/_inductor/config.py
 create mode 100644 torch/_inductor/cuda_properties.py
 create mode 100644 torch/_inductor/debug.py
 create mode 100644 torch/_inductor/decomposition.py
 create mode 100644 torch/_inductor/dependencies.py
 create mode 100644 torch/_inductor/exc.py
 create mode 100644 torch/_inductor/graph.py
 create mode 100644 torch/_inductor/ir.py
 create mode 100644 torch/_inductor/lowering.py
 create mode 100644 torch/_inductor/metrics.py
 create mode 100644 torch/_inductor/overrides.py
 create mode 100644 torch/_inductor/scheduler.py
 create mode 100644 torch/_inductor/sizevars.py
 create mode 100644 torch/_inductor/triton_ops/__init__.py
 create mode 100644 torch/_inductor/triton_ops/autotune.py
 create mode 100644 torch/_inductor/triton_ops/batched_matmul.py
 create mode 100644 torch/_inductor/triton_ops/conv.py
 create mode 100644 torch/_inductor/triton_ops/conv1x1.py
 create mode 100644 torch/_inductor/triton_ops/conv_perf_model.py
 create mode 100644 torch/_inductor/triton_ops/matmul.py
 create mode 100644 torch/_inductor/triton_ops/mm_perf_model.py
 create mode 100644 torch/_inductor/triton_ops/utils.py
 create mode 100644 torch/_inductor/utils.py
 create mode 100644 torch/_inductor/virtualized.py
 create mode 100644 torch/_lazy/closure.py
 create mode 100644 torch/_lazy/device_context.py
 create mode 100644 torch/_refs/_conversions.py
 create mode 100644 torch/_subclasses/fake_utils.py
 create mode 100644 torch/_weights_only_unpickler.py
 create mode 100644 torch/ao/nn/intrinsic/__init__.py
 create mode 100644 torch/ao/nn/intrinsic/modules/__init__.py
 create mode 100644 torch/ao/nn/intrinsic/modules/fused.py
 create mode 100644 torch/ao/nn/intrinsic/qat/__init__.py
 create mode 100644 torch/ao/nn/intrinsic/qat/modules/__init__.py
 create mode 100644 torch/ao/nn/intrinsic/qat/modules/conv_fused.py
 create mode 100644 torch/ao/nn/intrinsic/qat/modules/linear_fused.py
 create mode 100644 torch/ao/nn/intrinsic/qat/modules/linear_relu.py
 create mode 100644 torch/ao/nn/intrinsic/quantized/__init__.py
 create mode 100644 torch/ao/nn/intrinsic/quantized/dynamic/__init__.py
 create mode 100644 torch/ao/nn/intrinsic/quantized/dynamic/modules/__init__.py
 create mode 100644 torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
 create mode 100644 torch/ao/nn/intrinsic/quantized/modules/__init__.py
 create mode 100644 torch/ao/nn/intrinsic/quantized/modules/bn_relu.py
 create mode 100644 torch/ao/nn/intrinsic/quantized/modules/conv_relu.py
 create mode 100644 torch/ao/nn/intrinsic/quantized/modules/linear_relu.py
 create mode 100644 torch/ao/nn/qat/__init__.py
 create mode 100644 torch/ao/nn/qat/dynamic/__init__.py
 create mode 100644 torch/ao/nn/qat/dynamic/modules/__init__.py
 create mode 100644 torch/ao/nn/qat/dynamic/modules/linear.py
 create mode 100644 torch/ao/nn/qat/modules/__init__.py
 create mode 100644 torch/ao/nn/qat/modules/conv.py
 create mode 100644 torch/ao/nn/qat/modules/embedding_ops.py
 create mode 100644 torch/ao/nn/qat/modules/linear.py
 create mode 100644 torch/ao/nn/quantizable/__init__.py
 create mode 100644 torch/ao/nn/quantizable/modules/__init__.py
 create mode 100644 torch/ao/nn/quantizable/modules/activation.py
 create mode 100644 torch/ao/nn/quantizable/modules/rnn.py
 create mode 100644 torch/ao/nn/quantized/__init__.py
 create mode 100644 torch/ao/nn/quantized/dynamic/__init__.py
 create mode 100644 torch/ao/nn/quantized/dynamic/modules/__init__.py
 create mode 100644 torch/ao/nn/quantized/dynamic/modules/conv.py
 create mode 100644 torch/ao/nn/quantized/dynamic/modules/linear.py
 create mode 100644 torch/ao/nn/quantized/dynamic/modules/rnn.py
 create mode 100644 torch/ao/nn/quantized/functional.py
 create mode 100644 torch/ao/nn/quantized/modules/__init__.py
 create mode 100644 torch/ao/nn/quantized/modules/activation.py
 create mode 100644 torch/ao/nn/quantized/modules/batchnorm.py
 create mode 100644 torch/ao/nn/quantized/modules/conv.py
 create mode 100644 torch/ao/nn/quantized/modules/dropout.py
 create mode 100644 torch/ao/nn/quantized/modules/embedding_ops.py
 create mode 100644 torch/ao/nn/quantized/modules/functional_modules.py
 create mode 100644 torch/ao/nn/quantized/modules/linear.py
 create mode 100644 torch/ao/nn/quantized/modules/normalization.py
 create mode 100644 torch/ao/nn/quantized/modules/rnn.py
 create mode 100644 torch/ao/nn/quantized/modules/utils.py
 create mode 100644 torch/ao/nn/quantized/reference/__init__.py
 create mode 100644 torch/ao/nn/quantized/reference/modules/__init__.py
 create mode 100644 torch/ao/nn/quantized/reference/modules/conv.py
 create mode 100644 torch/ao/nn/quantized/reference/modules/linear.py
 create mode 100644 torch/ao/nn/quantized/reference/modules/rnn.py
 create mode 100644 torch/ao/nn/quantized/reference/modules/sparse.py
 create mode 100644 torch/ao/nn/quantized/reference/modules/utils.py
 delete mode 100644 torch/ao/ns/_numeric_suite_dbr.py
 create mode 100644 torch/ao/ns/fx/n_shadows_utils.py
 create mode 100644 torch/ao/ns/fx/qconfig_multi_mapping.py
 rename torch/ao/{sparsity => pruning}/__init__.py (93%)
 create mode 100644 torch/ao/pruning/_experimental/__init__.py
 rename torch/ao/{sparsity => pruning}/_experimental/activation_sparsifier/README.md (98%)
 create mode 100644 torch/ao/pruning/_experimental/activation_sparsifier/__init__.py
 rename torch/ao/{sparsity => pruning}/_experimental/activation_sparsifier/activation_sparsifier.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_scheduler/README.md (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_scheduler/__init__.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_scheduler/base_data_scheduler.py (98%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/README.md (98%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/__init__.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/base_data_sparsifier.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/benchmarks/README.md (97%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/benchmarks/dlrm_utils.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/benchmarks/evaluate_disk_savings.py (98%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/benchmarks/evaluate_forward_time.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/benchmarks/evaluate_model_metrics.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/benchmarks/images/accuracy.png (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/benchmarks/images/disk_savings.png (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/benchmarks/images/forward_time.png (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/data_norm_sparsifier.py (100%)
 create mode 100644 torch/ao/pruning/_experimental/data_sparsifier/lightning/__init__.py
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/lightning/callbacks/README.md (100%)
 create mode 100644 torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/__init__.py
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/lightning/callbacks/_data_sparstity_utils.py (93%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/lightning/callbacks/data_sparsity.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/lightning/tests/test_callbacks.py (95%)
 rename torch/ao/{sparsity => pruning}/_experimental/data_sparsifier/quantization_utils.py (98%)
 rename torch/ao/{sparsity => pruning}/_experimental/pruner/README.md (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/pruner/__init__.py (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/pruner/base_pruner.py (98%)
 rename torch/ao/{sparsity => pruning}/_experimental/pruner/images/prune_1.png (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/pruner/images/prune_2.png (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/pruner/images/prune_3.png (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/pruner/images/prune_4.png (100%)
 rename torch/ao/{sparsity => pruning}/_experimental/pruner/parametrization.py (100%)
 rename torch/ao/{sparsity => pruning}/_mappings.py (72%)
 create mode 100644 torch/ao/pruning/scheduler/__init__.py
 rename torch/ao/{sparsity => pruning}/scheduler/base_scheduler.py (89%)
 create mode 100644 torch/ao/pruning/scheduler/cubic_scheduler.py
 rename torch/ao/{sparsity => pruning}/scheduler/lambda_scheduler.py (100%)
 create mode 100644 torch/ao/pruning/sparsifier/__init__.py
 rename torch/ao/{sparsity => pruning}/sparsifier/base_sparsifier.py (99%)
 rename torch/ao/{sparsity => pruning}/sparsifier/nearly_diagonal_sparsifier.py (100%)
 rename torch/ao/{sparsity => pruning}/sparsifier/utils.py (99%)
 rename torch/ao/{sparsity => pruning}/sparsifier/weight_norm_sparsifier.py (89%)
 delete mode 100644 torch/ao/quantization/_dbr/README.md
 delete mode 100644 torch/ao/quantization/_dbr/auto_trace.py
 delete mode 100644 torch/ao/quantization/_dbr/auto_trace_rewriter.py
 delete mode 100644 torch/ao/quantization/_dbr/function_fusion.py
 delete mode 100644 torch/ao/quantization/_dbr/fusion.py
 delete mode 100644 torch/ao/quantization/_dbr/mappings.py
 delete mode 100644 torch/ao/quantization/_dbr/model_utils.py
 delete mode 100644 torch/ao/quantization/_dbr/module_swap_utils.py
 delete mode 100644 torch/ao/quantization/_dbr/qconfig_mapping_utils.py
 delete mode 100644 torch/ao/quantization/_dbr/quantization_state.py
 delete mode 100644 torch/ao/quantization/_dbr/torchscript_utils.py
 delete mode 100644 torch/ao/quantization/_dbr/utils.py
 delete mode 100644 torch/ao/quantization/_quantize_dbr.py
 create mode 100644 torch/ao/quantization/backend_config/executorch.py
 create mode 100644 torch/ao/quantization/backend_config/fbgemm.py
 create mode 100644 torch/ao/quantization/backend_config/qnnpack.py
 create mode 100644 torch/ao/quantization/backend_config/x86.py
 create mode 100644 torch/ao/quantization/fx/README.md
 create mode 100644 torch/ao/quantization/fx/_decomposed.py
 delete mode 100644 torch/ao/quantization/fx/common_quantization_patterns.py
 rename torch/ao/quantization/fx/{qconfig_utils.py => qconfig_mapping_utils.py} (85%)
 delete mode 100644 torch/ao/quantization/quantization_types.py
 create mode 100644 torch/backends/opt_einsum/__init__.py
 create mode 100644 torch/csrc/api/include/torch/nested.h
 create mode 100644 torch/csrc/autograd/jit_decomp_interface.cpp
 create mode 100644 torch/csrc/autograd/jit_decomp_interface.h
 create mode 100644 torch/csrc/autograd/python_nested_functions.h
 create mode 100644 torch/csrc/autograd/python_nested_functions_manual.cpp
 create mode 100644 torch/csrc/cuda/CUDAPluggableAllocator.cpp
 create mode 100644 torch/csrc/cuda/CUDAPluggableAllocator.h
 create mode 100644 torch/csrc/cuda/memory_snapshot.cpp
 create mode 100644 torch/csrc/cuda/memory_snapshot.h
 delete mode 100644 torch/csrc/deploy/.gitignore
 delete mode 100644 torch/csrc/deploy/CMakeLists.txt
 delete mode 100644 torch/csrc/deploy/Exception.h
 delete mode 100644 torch/csrc/deploy/benchmark.cpp
 delete mode 100644 torch/csrc/deploy/deploy.cpp
 delete mode 100644 torch/csrc/deploy/deploy.h
 delete mode 100644 torch/csrc/deploy/elf_file.cpp
 delete mode 100644 torch/csrc/deploy/elf_file.h
 delete mode 100644 torch/csrc/deploy/environment.h
 delete mode 100644 torch/csrc/deploy/example/benchmark.cpp
 delete mode 100644 torch/csrc/deploy/example/examples.py
 delete mode 100644 torch/csrc/deploy/example/fx/examples.py
 delete mode 100644 torch/csrc/deploy/example/fx/some_dependency.py
 delete mode 100644 torch/csrc/deploy/example/generate_examples.py
 delete mode 100644 torch/csrc/deploy/example/gpu_wrapper.py
 delete mode 100644 torch/csrc/deploy/example/simple.pt
 delete mode 100644 torch/csrc/deploy/example/tensorrt_example.py
 delete mode 100644 torch/csrc/deploy/interactive_embedded_interpreter.cpp
 delete mode 100644 torch/csrc/deploy/interpreter/CMakeLists.txt
 delete mode 100644 torch/csrc/deploy/interpreter/CMakePythonModules.txt
 delete mode 100644 torch/csrc/deploy/interpreter/Optional.hpp
 delete mode 100644 torch/csrc/deploy/interpreter/builtin_registry.cpp
 delete mode 100644 torch/csrc/deploy/interpreter/builtin_registry.h
 delete mode 100755 torch/csrc/deploy/interpreter/configure_cpython.sh
 delete mode 100644 torch/csrc/deploy/interpreter/cpython_patch.diff
 delete mode 100644 torch/csrc/deploy/interpreter/defs.bzl
 delete mode 100644 torch/csrc/deploy/interpreter/hide_symbols.script
 delete mode 100644 torch/csrc/deploy/interpreter/import_find_sharedfuncptr.cpp
 delete mode 100644 torch/csrc/deploy/interpreter/interpreter_impl.cpp
 delete mode 100644 torch/csrc/deploy/interpreter/interpreter_impl.h
 delete mode 100644 torch/csrc/deploy/interpreter/register_frozenpython.cpp
 delete mode 100644 torch/csrc/deploy/interpreter/register_numpy.cpp
 delete mode 100644 torch/csrc/deploy/interpreter/register_pyyaml.cpp
 delete mode 100644 torch/csrc/deploy/interpreter/test_builtin_registry.cpp
 delete mode 100644 torch/csrc/deploy/interpreter/third_party/README.md
 delete mode 100644 torch/csrc/deploy/loader.cpp
 delete mode 100644 torch/csrc/deploy/loader.h
 delete mode 100644 torch/csrc/deploy/mem_file.h
 delete mode 100644 torch/csrc/deploy/noop_environment.h
 delete mode 100644 torch/csrc/deploy/path_environment.cpp
 delete mode 100644 torch/csrc/deploy/path_environment.h
 delete mode 100644 torch/csrc/deploy/remove_dt_needed.cpp
 delete mode 100644 torch/csrc/deploy/test_deploy.cpp
 delete mode 100644 torch/csrc/deploy/test_deploy_from_python.py
 delete mode 100644 torch/csrc/deploy/test_deploy_gpu.cpp
 delete mode 100644 torch/csrc/deploy/test_deploy_lib.cpp
 delete mode 100644 torch/csrc/deploy/test_deploy_missing_interpreter.cpp
 delete mode 100644 torch/csrc/deploy/test_deploy_python.py
 delete mode 100644 torch/csrc/deploy/test_deploy_python_ext.cpp
 delete mode 100644 torch/csrc/deploy/unity/example.py
 delete mode 100644 torch/csrc/deploy/unity/main.cpp
 delete mode 100644 torch/csrc/deploy/unity/tests/simple_model.py
 delete mode 100644 torch/csrc/deploy/unity/tests/sum.py
 delete mode 100644 torch/csrc/deploy/unity/tests/test_unity.h
 delete mode 100644 torch/csrc/deploy/unity/tests/test_unity_simple_model.cpp
 delete mode 100644 torch/csrc/deploy/unity/tests/test_unity_sum.cpp
 delete mode 100644 torch/csrc/deploy/unity/unity.bzl
 delete mode 100644 torch/csrc/deploy/unity/xar_environment.cpp
 delete mode 100644 torch/csrc/deploy/unity/xar_environment.h
 create mode 100644 torch/csrc/distributed/c10d/Backend.cpp
 create mode 100644 torch/csrc/distributed/c10d/Backend.hpp
 create mode 100644 torch/csrc/distributed/c10d/OpsImpl.cpp
 create mode 100644 torch/csrc/distributed/c10d/Work.cpp
 create mode 100644 torch/csrc/distributed/c10d/Work.hpp
 delete mode 100644 torch/csrc/dl.c
 create mode 100644 torch/csrc/dynamo/eval_frame.c
 create mode 100644 torch/csrc/dynamo/eval_frame.h
 create mode 100644 torch/csrc/dynamo/guards.cpp
 create mode 100644 torch/csrc/dynamo/guards.h
 create mode 100644 torch/csrc/dynamo/init.cpp
 create mode 100644 torch/csrc/dynamo/init.h
 create mode 100644 torch/csrc/functorch/init.cpp
 create mode 100644 torch/csrc/functorch/init.h
 delete mode 100644 torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h
 delete mode 100644 torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm
 create mode 100644 torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.cpp
 create mode 100644 torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.h
 create mode 100644 torch/csrc/jit/backends/xnnpack/executor/xnn_executor.h
 create mode 100644 torch/csrc/jit/backends/xnnpack/serialization/schema.fbs
 create mode 100644 torch/csrc/jit/backends/xnnpack/serialization/serializer.cpp
 create mode 100644 torch/csrc/jit/backends/xnnpack/serialization/serializer.h
 create mode 100644 torch/csrc/jit/backends/xnnpack/xnnpack_backend_lib.cpp
 create mode 100644 torch/csrc/jit/backends/xnnpack/xnnpack_backend_preprocess.cpp
 create mode 100644 torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.cpp
 create mode 100644 torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.h
 create mode 100644 torch/csrc/jit/codegen/cuda/dynamic_type.h
 delete mode 100644 torch/csrc/jit/codegen/cuda/index_reference_replay.cpp
 delete mode 100644 torch/csrc/jit/codegen/cuda/index_reference_replay.h
 delete mode 100644 torch/csrc/jit/codegen/cuda/inline_propagator.cpp
 delete mode 100644 torch/csrc/jit/codegen/cuda/inline_propagator.h
 create mode 100644 torch/csrc/jit/codegen/cuda/inlining.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/inlining.h
 create mode 100644 torch/csrc/jit/codegen/cuda/lower_bank_conflict.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/lower_bank_conflict.h
 create mode 100644 torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/lower_divisible_split.h
 create mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/README.md
 delete mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/examples/double_half_cast.py
 delete mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/examples/half_double_cast.py
 delete mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example.py
 delete mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example_broadcast_in_dim.py
 delete mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example_fp16.py
 create mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h
 create mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.h
 delete mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/fusion_owner.h
 create mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_cache.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_definition.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_record.cpp
 delete mode 100644 torch/csrc/jit/codegen/cuda/reference_tensor.h
 create mode 100644 torch/csrc/jit/codegen/cuda/runtime/array_rocm.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/runtime/bf16_support_rocm.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/runtime/block_sync_default_rocm.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/runtime/warp_rocm.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose.h
 create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h
 create mode 100644 torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp
 delete mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu1.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu2.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu3.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_rng.cu
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp
 create mode 100644 torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp
 create mode 100644 torch/csrc/jit/codegen/onednn/decompose_silu.cpp
 create mode 100644 torch/csrc/jit/codegen/onednn/decompose_silu.h
 create mode 100644 torch/csrc/jit/mobile/quantization.cpp
 create mode 100644 torch/csrc/jit/mobile/quantization.h
 create mode 100644 torch/csrc/jit/passes/mobile_optimizer_type.h
 create mode 100644 torch/csrc/jit/passes/onnx/naming.cpp
 create mode 100644 torch/csrc/jit/passes/onnx/naming.h
 create mode 100644 torch/csrc/jit/passes/quantization/register_packed_params.cpp
 create mode 100644 torch/csrc/jit/passes/quantization/register_packed_params.h
 delete mode 100644 torch/csrc/lazy/core/lazy_view.cpp
 delete mode 100644 torch/csrc/lazy/core/lazy_view.h
 create mode 100644 torch/csrc/onnx/diagnostics/diagnostics.h
 create mode 100644 torch/csrc/onnx/diagnostics/generated/rules.h
 delete mode 100644 torch/csrc/profiler/api.cpp
 create mode 100644 torch/csrc/profiler/data_flow.cpp
 create mode 100644 torch/csrc/profiler/data_flow.h
 create mode 100644 torch/csrc/profiler/events.h
 create mode 100644 torch/csrc/profiler/orchestration/observer.cpp
 create mode 100644 torch/csrc/profiler/orchestration/observer.h
 create mode 100644 torch/csrc/profiler/orchestration/python_tracer.cpp
 create mode 100644 torch/csrc/profiler/orchestration/python_tracer.h
 create mode 100644 torch/csrc/profiler/perf-inl.h
 create mode 100644 torch/csrc/profiler/perf.cpp
 create mode 100644 torch/csrc/profiler/perf.h
 create mode 100644 torch/csrc/profiler/python/init.cpp
 create mode 100644 torch/csrc/profiler/python/init.h
 create mode 100644 torch/csrc/profiler/python/pybind.h
 rename torch/csrc/profiler/{ => standalone}/execution_graph_observer.cpp (96%)
 rename torch/csrc/profiler/{ => standalone}/execution_graph_observer.h (100%)
 rename torch/csrc/profiler/{ => standalone}/itt_observer.cpp (89%)
 rename torch/csrc/profiler/{ => standalone}/itt_observer.h (100%)
 rename torch/csrc/profiler/{ => standalone}/nvtx_observer.cpp (95%)
 rename torch/csrc/profiler/{ => standalone}/nvtx_observer.h (100%)
 create mode 100644 torch/csrc/profiler/stubs/base.cpp
 create mode 100644 torch/csrc/profiler/stubs/base.h
 rename torch/csrc/profiler/{ => stubs}/cuda.cpp (94%)
 rename torch/csrc/profiler/{ => stubs}/itt.cpp (96%)
 delete mode 100644 torch/csrc/utils/disallow_copy.h
 create mode 100644 torch/csrc/utils/nested.cpp
 create mode 100644 torch/csrc/utils/nested.h
 create mode 100644 torch/csrc/utils/pybind.cpp
 create mode 100644 torch/csrc/utils/python_symnode.cpp
 create mode 100644 torch/csrc/utils/python_symnode.h
 create mode 100644 torch/cuda/_sanitizer.py
 delete mode 100644 torch/deploy.h
 create mode 100644 torch/distributed/_composable/__init__.py
 create mode 100644 torch/distributed/_composable/_ddp.py
 create mode 100644 torch/distributed/_composable/checkpoint_activation.py
 create mode 100644 torch/distributed/_composable/contract.py
 create mode 100644 torch/distributed/_composable/fully_shard.py
 create mode 100644 torch/distributed/_composable/replicate.py
 delete mode 100644 torch/distributed/_shard/checkpoint/filesystem.py
 delete mode 100644 torch/distributed/_shard/checkpoint/resharding.py
 delete mode 100644 torch/distributed/_shard/checkpoint/state_dict_loader.py
 delete mode 100644 torch/distributed/_shard/checkpoint/state_dict_saver.py
 delete mode 100644 torch/distributed/_shard/checkpoint/storage.py
 create mode 100644 torch/distributed/_spmd/__init__.py
 create mode 100644 torch/distributed/_spmd/comm_tensor.py
 create mode 100644 torch/distributed/_tensor/README.md
 create mode 100644 torch/distributed/_tensor/__init__.py
 create mode 100644 torch/distributed/_tensor/api.py
 create mode 100644 torch/distributed/_tensor/device_mesh.py
 create mode 100644 torch/distributed/_tensor/dispatch.py
 create mode 100644 torch/distributed/_tensor/ops/__init__.py
 create mode 100644 torch/distributed/_tensor/ops/common_rules.py
 create mode 100644 torch/distributed/_tensor/ops/math_ops.py
 create mode 100644 torch/distributed/_tensor/ops/matrix_ops.py
 create mode 100644 torch/distributed/_tensor/ops/pointwise_ops.py
 create mode 100644 torch/distributed/_tensor/ops/tensor_ops.py
 create mode 100644 torch/distributed/_tensor/ops/tp_sharding_ops.py
 create mode 100644 torch/distributed/_tensor/ops/utils.py
 create mode 100644 torch/distributed/_tensor/ops/view_ops.py
 create mode 100644 torch/distributed/_tensor/parallel/__init__.py
 create mode 100644 torch/distributed/_tensor/parallel/_view_with_dim_change.py
 create mode 100644 torch/distributed/_tensor/parallel/api.py
 create mode 100644 torch/distributed/_tensor/parallel/fsdp.py
 create mode 100644 torch/distributed/_tensor/parallel/multihead_attention_tp.py
 create mode 100644 torch/distributed/_tensor/parallel/style.py
 create mode 100644 torch/distributed/_tensor/parallel/utils.py
 create mode 100644 torch/distributed/_tensor/placement_types.py
 create mode 100644 torch/distributed/_tensor/redistribute.py
 create mode 100644 torch/distributed/_tensor/utils.py
 create mode 100644 torch/distributed/c10d_error_logger.py
 create mode 100644 torch/distributed/checkpoint/__init__.py
 rename torch/distributed/{_shard => }/checkpoint/api.py (90%)
 create mode 100644 torch/distributed/checkpoint/dedup_tensors.py
 create mode 100644 torch/distributed/checkpoint/default_planner.py
 create mode 100644 torch/distributed/checkpoint/filesystem.py
 rename torch/distributed/{_shard => }/checkpoint/metadata.py (75%)
 create mode 100644 torch/distributed/checkpoint/planner.py
 create mode 100644 torch/distributed/checkpoint/planner_helpers.py
 create mode 100644 torch/distributed/checkpoint/resharding.py
 create mode 100644 torch/distributed/checkpoint/state_dict_loader.py
 create mode 100644 torch/distributed/checkpoint/state_dict_saver.py
 create mode 100644 torch/distributed/checkpoint/storage.py
 create mode 100644 torch/distributed/checkpoint/traverse.py
 rename torch/distributed/{_shard => }/checkpoint/utils.py (73%)
 create mode 100644 torch/distributed/elastic/timer/file_based_local_timer.py
 create mode 100644 torch/distributed/fsdp/_common_utils.py
 create mode 100644 torch/distributed/fsdp/_exec_order_utils.py
 create mode 100644 torch/distributed/fsdp/_fsdp_extensions.py
 create mode 100644 torch/distributed/fsdp/_init_utils.py
 create mode 100644 torch/distributed/fsdp/_limiter_utils.py
 create mode 100644 torch/distributed/fsdp/_runtime_utils.py
 rename torch/distributed/fsdp/{shard_utils.py => _shard_utils.py} (64%)
 create mode 100644 torch/distributed/fsdp/_state_dict_utils.py
 create mode 100644 torch/distributed/fsdp/_unshard_param_utils.py
 create mode 100644 torch/distributed/fsdp/_wrap_utils.py
 create mode 100644 torch/distributed/fsdp/api.py
 delete mode 100644 torch/distributed/fsdp/flatten_params_wrapper.py
 create mode 100644 torch/distributed/logging_handlers.py
 create mode 100644 torch/distributed/optim/apply_optimizer_in_backward.py
 delete mode 100644 torch/fx/passes/backends/nvfuser.py
 create mode 100644 torch/masked/__init__.py
 rename torch/{_masked => masked}/_docs.py (97%)
 rename torch/{_masked/__init__.py => masked/_ops.py} (92%)
 create mode 100644 torch/masked/maskedtensor/__init__.py
 create mode 100644 torch/masked/maskedtensor/_ops_refs.py
 create mode 100644 torch/masked/maskedtensor/binary.py
 create mode 100644 torch/masked/maskedtensor/core.py
 create mode 100644 torch/masked/maskedtensor/creation.py
 create mode 100644 torch/masked/maskedtensor/passthrough.py
 create mode 100644 torch/masked/maskedtensor/reductions.py
 create mode 100644 torch/masked/maskedtensor/unary.py
 delete mode 100644 torch/nn/parallel/distributed.pyi
 create mode 100644 torch/nn/utils/_deprecation_utils.py
 create mode 100644 torch/onnx/_internal/__init__.py
 create mode 100644 torch/onnx/_internal/_beartype.py
 create mode 100644 torch/onnx/_internal/diagnostics/OVERVIEW.md
 create mode 100644 torch/onnx/_internal/diagnostics/__init__.py
 create mode 100644 torch/onnx/_internal/diagnostics/_diagnostic.py
 create mode 100644 torch/onnx/_internal/diagnostics/_rules.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/__init__.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/_infra.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/engine.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/formatter.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/__init__.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_address.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_artifact.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_artifact_change.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_artifact_content.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_artifact_location.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_attachment.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_code_flow.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_configuration_override.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_conversion.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_edge.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_edge_traversal.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_exception.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_external_properties.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_external_property_file_reference.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_external_property_file_references.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_fix.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_graph.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_graph_traversal.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_invocation.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_location.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_location_relationship.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_logical_location.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_message.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_multiformat_message_string.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_node.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_notification.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_physical_location.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_property_bag.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_rectangle.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_region.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_replacement.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_reporting_configuration.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor_reference.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor_relationship.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_result.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_result_provenance.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_run.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_run_automation_details.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_sarif_log.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_special_locations.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_stack.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_stack_frame.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_suppression.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_thread_flow.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_thread_flow_location.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_tool.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_tool_component.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_tool_component_reference.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_translation_metadata.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_version_control_details.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_web_request.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/_web_response.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/sarif/version.py
 create mode 100644 torch/onnx/_internal/diagnostics/infra/utils.py
 create mode 100644 torch/onnx/_internal/diagnostics/rules.yaml
 create mode 100644 torch/onnx/_internal/jit_utils.py
 create mode 100644 torch/onnx/_internal/onnx_proto_utils.py
 create mode 100644 torch/onnx/_internal/registration.py
 create mode 100644 torch/onnx/symbolic_opset17.py
 delete mode 100644 torch/onnx/symbolic_registry.py
 create mode 100644 torch/profiler/_memory_profiler.py
 create mode 100644 torch/signal/__init__.py
 create mode 100644 torch/signal/windows/__init__.py
 create mode 100644 torch/signal/windows/windows.py
 create mode 100644 torch/sparse/matmul.py
 create mode 100644 torch/testing/_internal/distributed/_tensor/__init__.py
 create mode 100644 torch/testing/_internal/distributed/_tensor/common_dtensor.py
 create mode 100644 torch/testing/_internal/distributed/_tensor/dtensor_lagging_op_db.py
 rename functorch/codegen/gen_functorch_lagging_op_db.py => torch/testing/_internal/distributed/_tensor/gen_dtensor_lagging_op_db.py (57%)
 create mode 100644 torch/testing/_internal/distributed/multi_threaded_pg.py
 create mode 100644 torch/testing/_internal/inductor_utils.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/__init__.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/_masked.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/fft.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/linalg.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/signal.py
 create mode 100644 torch/testing/_internal/opinfo/definitions/special.py
 create mode 100644 torch/testing/_internal/opinfo/refs.py
 delete mode 100644 torch/testing/_legacy.py
 create mode 100644 torch/utils/backend_registration.py
 create mode 100644 torch/utils/cpp_backtrace.py
 delete mode 100644 torch/utils/data/communication/__init__.py
 delete mode 100644 torch/utils/data/communication/eventloop.py
 delete mode 100644 torch/utils/data/communication/iter.py
 delete mode 100644 torch/utils/data/communication/map.py
 delete mode 100644 torch/utils/data/communication/messages.py
 delete mode 100644 torch/utils/data/communication/protocol.py
 delete mode 100644 torch/utils/data/communication/queue.py
 delete mode 100644 torch/utils/data/dataloader_experimental.py
 delete mode 100644 ubsan.supp

diff --git a/.bazelrc b/.bazelrc
index ce8406b58aaa..f8ff2215f2d6 100644
--- a/.bazelrc
+++ b/.bazelrc
@@ -1,4 +1,4 @@
-build --cxxopt=--std=c++14
+build --cxxopt=--std=c++17
 build --copt=-I.
 # Bazel does not support including its cc_library targets as system
 # headers. We work around this for generated code
diff --git a/.circleci/README.md b/.circleci/README.md
new file mode 100644
index 000000000000..e2429b4d1f03
--- /dev/null
+++ b/.circleci/README.md
@@ -0,0 +1,468 @@
+Warning
+=======
+
+Contents may be out of date. Our CircleCI workflows are gradually being migrated to Github actions.
+
+Structure of CI
+===============
+
+setup job:
+1. Does a git checkout
+2. Persists CircleCI scripts (everything in `.circleci`) into a workspace.  Why?
+   We don't always do a Git checkout on all subjobs, but we usually
+   still want to be able to call scripts one way or another in a subjob.
+   Persisting files this way lets us have access to them without doing a
+   checkout.  This workspace is conventionally mounted on `~/workspace`
+   (this is distinguished from `~/project`, which is the conventional
+   working directory that CircleCI will default to starting your jobs
+   in.)
+3. Write out the commit message to `.circleci/COMMIT_MSG`.  This is so
+   we can determine in subjobs if we should actually run the jobs or
+   not, even if there isn't a Git checkout.
+
+
+CircleCI configuration generator
+================================
+
+One may no longer make changes to the `.circleci/config.yml` file directly.
+Instead, one must edit these Python scripts or files in the `verbatim-sources/` directory.
+
+
+Usage
+----------
+
+1. Make changes to these scripts.
+2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.
+
+You'll see a build failure on GitHub if the scripts don't agree with the checked-in version.
+
+
+Motivation
+----------
+
+These scripts establish a single, authoritative source of documentation for the CircleCI configuration matrix.
+The documentation, in the form of diagrams, is automatically generated and cannot drift out of sync with the YAML content.
+
+Furthermore, consistency is enforced within the YAML config itself, by using a single source of data to generate
+multiple parts of the file.
+
+* Facilitates one-off culling/enabling of CI configs for testing PRs on special targets
+
+Also see https://github.com/pytorch/pytorch/issues/17038
+
+
+Future direction
+----------------
+
+### Declaring sparse config subsets
+See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):
+
+In contrast with a full recursive tree traversal of configuration dimensions,
+> in the future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
+----------------
+----------------
+
+# How do the binaries / nightlies / releases work?
+
+### What is a binary?
+
+A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.
+
+A **binary configuration** is a collection of
+
+* release or nightly
+    * releases are stable, nightlies are beta and built every night
+* python version
+    * linux: 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
+    * macos: 3.7, 3.8
+    * windows: 3.7, 3.8
+* cpu version
+    * cpu, cuda 9.0, cuda 10.0
+    * The supported cuda versions occasionally change
+* operating system
+    * Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu
+    * MacOS
+    * Windows - these are built on Azure pipelines
+* devtoolset version (gcc compiler version)
+    * This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
+
+### Where are the binaries?
+
+The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.
+
+We have 3 types of binary packages
+
+* pip packages - nightlies are stored on s3 (pip install -f \<a s3 url\>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
+* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix
+* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only
+    * shared with dependencies (the only supported option for Windows)
+    * static with dependencies
+    * shared without dependencies
+    * static without dependencies
+
+All binaries are built in CircleCI workflows except Windows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)
+
+# CircleCI structure of the binaries
+
+Some quick vocab:
+
+* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
+* **jobs** are a sequence of '**steps**'
+* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
+* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
+
+## How are the workflows structured?
+
+The nightly binaries have 3 workflows. We have one job (actually 3 jobs:  build, test, and upload) per binary configuration
+
+1. binary_builds
+    1. every day midnight EST
+    2. linux: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
+    3. macos: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
+    4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
+        1. binary_linux_conda_3.7_cpu_build
+            1. Builds the build. On linux jobs this uses the 'docker executor'.
+            2. Persists the package to the workspace
+        2. binary_linux_conda_3.7_cpu_test
+            1. Loads the package to the workspace
+            2. Spins up a docker image (on Linux), mapping the package and code repos into the docker
+            3. Runs some smoke tests in the docker
+            4. (Actually, for macos this is a step rather than a separate job)
+        3. binary_linux_conda_3.7_cpu_upload
+            1. Logs in to aws/conda
+            2. Uploads the package
+2. update_s3_htmls
+    1. every day 5am EST
+    2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml
+    3. See below for what these are for and why they're needed
+    4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3
+3. binarysmoketests
+    1. every day
+    2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
+    3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
+        1. smoke_linux_conda_3.7_cpu
+            1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
+            2. Runs the smoke tests
+
+## How are the jobs structured?
+
+The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
+
+* Linux jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
+    * binary_linux_build.sh
+    * binary_linux_test.sh
+    * binary_linux_upload.sh
+* MacOS jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
+    * binary_macos_build.sh
+    * binary_macos_test.sh
+    * binary_macos_upload.sh
+* Update html jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml
+    * These delegate from the pytorch/builder repo
+    * https://github.com/pytorch/builder/blob/master/cron/update_s3_htmls.sh
+    * https://github.com/pytorch/builder/blob/master/cron/upload_binary_sizes.sh
+* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
+    * These delegate from the pytorch/builder repo
+    * https://github.com/pytorch/builder/blob/master/run_tests.sh
+    * https://github.com/pytorch/builder/blob/master/smoke_test.sh
+    * https://github.com/pytorch/builder/blob/master/check_binary.sh
+* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
+    * binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh
+    * binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.
+    * binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables
+    * binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image
+
+### **Why do the steps all refer to scripts?**
+
+CircleCI creates a  final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.
+
+### **What is binary_run_in_docker for?**
+
+So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
+
+* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
+* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs
+* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
+* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
+
+binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs
+
+### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**
+
+We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.
+
+# Code structure of the binaries (circleci agnostic)
+
+## Overview
+
+The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder), which is a repo that defines how all the binaries are built. The relevant code is
+
+
+```
+# All code needed to set-up environments for build code to run in,
+# but only code that is specific to the current CI system
+pytorch/pytorch
+- .circleci/                # Folder that holds all circleci related stuff
+  - config.yml              # GENERATED file that actually controls all circleci behavior
+  - verbatim-sources        # Used to generate job/workflow sections in ^
+  - scripts/                # Code needed to prepare circleci environments for binary build scripts
+- setup.py                  # Builds pytorch. This is wrapped in pytorch/builder
+- cmake files               # used in normal building of pytorch
+# All code needed to prepare a binary build, given an environment
+# with all the right variables/packages/paths.
+pytorch/builder
+# Given an installed binary and a proper python env, runs some checks
+# to make sure the binary was built the proper way. Checks things like
+# the library dependencies, symbols present, etc.
+- check_binary.sh
+# Given an installed binary, runs python tests to make sure everything
+# is in order. These should be de-duped. Right now they both run smoke
+# tests, but are called from different places. Usually just call some
+# import statements, but also has overlap with check_binary.sh above
+- run_tests.sh
+- smoke_test.sh
+# Folders that govern how packages are built. See paragraphs below
+- conda/
+  - build_pytorch.sh          # Entrypoint. Delegates to proper conda build folder
+  - switch_cuda_version.sh    # Switches activate CUDA installation in Docker
+  - pytorch-nightly/          # Build-folder
+- manywheel/
+  - build_cpu.sh              # Entrypoint for cpu builds
+  - build.sh                  # Entrypoint for CUDA builds
+  - build_common.sh           # Actual build script that ^^ call into
+- wheel/
+  - build_wheel.sh            # Entrypoint for wheel builds
+- windows/
+  - build_pytorch.bat         # Entrypoint for wheel builds on Windows
+```
+
+Every type of package has an entrypoint build script that handles the all the important logic.
+
+## Conda
+
+Linux, MacOS and Windows use the same code flow for the conda builds.
+
+Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
+
+Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
+tl;dr on conda-build is
+
+1. Creates a brand new conda environment, based off of deps in the meta.yaml
+    1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml
+    2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.
+2. Calls build.sh in the environment
+3. Copies the finished package to a new conda env, also specified by the meta.yaml
+4. Runs some simple import tests (if specified in the meta.yaml)
+5. Saves the finished package as a tarball
+
+The build.sh we use is essentially a wrapper around `python setup.py build`, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
+
+The entrypoint file `builder/conda/build_conda.sh` is complicated because
+
+* It works for Linux, MacOS and Windows
+    * The mac builds used to create their own environments, since they all used to be on the same machine. There’s now a lot of extra logic to handle conda envs. This extra machinery could be removed
+* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.
+
+## Manywheels (linux pip and libtorch packages)
+
+Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
+
+`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`
+
+The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because
+
+* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unnecessary folders and movements here and there.
+    * The script is never used this way anymore. This extra machinery could be removed.
+* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
+    * The script is never used this way anymore. This extra machinery could be removed.
+* This also builds libtorch packages
+    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
+* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.
+
+## Wheels (MacOS pip and libtorch packages)
+
+The entrypoint file `builder/wheel/build_wheel.sh` is complicated because
+
+* The mac builds used to all run on one machine (we didn’t have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.
+    * The script is never used this way anymore. This extra machinery could be removed.
+* This also builds libtorch packages
+    * Ditto the comment above. This should definitely be separated out.
+
+Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
+
+## Windows Wheels (Windows pip and libtorch packages)
+
+The entrypoint file `builder/windows/build_pytorch.bat` is complicated because
+
+* This used to handle building for several different python versions at the same time. This is why there are loops everywhere
+    * The script is never used this way anymore. This extra machinery could be removed.
+* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
+    * The script is never used this way anymore. This extra machinery could be removed.
+* This also builds libtorch packages
+    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
+
+Note that the Windows Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
+
+## General notes
+
+### Note on run_tests.sh, smoke_test.sh, and check_binary.sh
+
+* These should all be consolidated
+* These must run on all OS types: MacOS, Linux, and Windows
+* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on master and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.
+* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.
+
+### Note on libtorch
+
+Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
+
+* It’s confusing. Most of those scripts deal with python specifics.
+* The extra conditionals everywhere severely complicate the wheel build scripts
+* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)
+
+### Note on docker images / Dockerfiles
+
+All linux builds occur in docker images. The docker images are
+
+* pytorch/conda-cuda
+    * Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
+    * Also used for cpu builds
+* pytorch/manylinux-cuda90
+* pytorch/manylinux-cuda100
+    * Also used for cpu builds
+
+The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
+
+### General Python
+
+* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2
+
+# How to manually rebuild the binaries
+
+tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
+
+Sometimes we want to push a change to master and then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/master/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
+
+## How to test changes to the binaries via .circleci
+
+Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using `.circleci/regenerate.sh` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
+
+```sh
+# Make your changes
+touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
+# Regenerate the yaml, has to be in python 3.7
+.circleci/regenerate.sh
+# Make a commit
+git add .circleci *
+git commit -m "My real changes"
+git push origin my_branch
+# Now hardcode the jobs that you want in the .circleci/config.yml workflows section
+# Also eliminate ensure-consistency and should_run_job checks
+# e.g. https://github.com/pytorch/pytorch/commit/2b3344bfed8772fe86e5210cc4ee915dee42b32d
+# Make a commit you won't keep
+git add .circleci
+git commit -m "[DO NOT LAND] testing binaries for above changes"
+git push origin my_branch
+# Now you need to make some changes to the first commit.
+git rebase -i HEAD~2 # mark the first commit as 'edit'
+# Make the changes
+touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
+.circleci/regenerate.sh
+# Ammend the commit and recontinue
+git add .circleci
+git commit --amend
+git rebase --continue
+# Update the PR, need to force since the commits are different now
+git push origin my_branch --force
+```
+
+The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.
+
+## How to build a binary locally
+
+### Linux
+
+You can build Linux binaries locally easily using docker.
+
+```sh
+# Run the docker
+# Use the correct docker image, pytorch/conda-cuda used here as an example
+#
+# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the
+#    machine that you're running the command on) accessible to the docker
+#    container at path/to/bar. So if you then run `touch path/to/bar/baz`
+#    in the docker container then you will see path/to/foo/baz on your local
+#    machine. You could also clone the pytorch and builder repos in the docker.
+#
+# If you know how, add ccache as a volume too and speed up everything
+docker run \
+    -v your/pytorch/repo:/pytorch \
+    -v your/builder/repo:/builder \
+    -v where/you/want/packages/to/appear:/final_pkgs \
+    -it pytorch/conda-cuda /bin/bash
+# Export whatever variables are important to you. All variables that you'd
+# possibly need are in .circleci/scripts/binary_populate_env.sh
+# You should probably always export at least these 3 variables
+export PACKAGE_TYPE=conda
+export DESIRED_PYTHON=3.7
+export DESIRED_CUDA=cpu
+# Call the entrypoint
+# `|& tee foo.log` just copies all stdout and stderr output to foo.log
+# The builds generate lots of output so you probably need this when
+# building locally.
+/builder/conda/build_pytorch.sh |& tee build_output.log
+```
+
+**Building CUDA binaries on docker**
+
+You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a long time).
+
+For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
+
+### MacOS
+
+There’s no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If you’re trying to repro an error on a Mac build in .circleci and you can’t seem to repro locally, then my best advice is actually to iterate on .circleci    :/
+
+But if you want to try, then I’d recommend
+
+```sh
+# Create a new terminal
+# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
+# know how to do
+# Install a new miniconda
+# First remove any other python or conda installation from your PATH
+# Always install miniconda 3, even if building for Python <3
+new_conda="~/my_new_conda"
+conda_sh="$new_conda/install_miniconda.sh"
+curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+chmod +x "$conda_sh"
+"$conda_sh" -b -p "$MINICONDA_ROOT"
+rm -f "$conda_sh"
+export PATH="~/my_new_conda/bin:$PATH"
+# Create a clean python env
+# All MacOS builds use conda to manage the python env and dependencies
+# that are built with, even the pip packages
+conda create -yn binary python=2.7
+conda activate binary
+# Export whatever variables are important to you. All variables that you'd
+# possibly need are in .circleci/scripts/binary_populate_env.sh
+# You should probably always export at least these 3 variables
+export PACKAGE_TYPE=conda
+export DESIRED_PYTHON=3.7
+export DESIRED_CUDA=cpu
+# Call the entrypoint you want
+path/to/builder/wheel/build_wheel.sh
+```
+
+N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
+
+1. You make the ‘conda’ command accessible by prepending `path/to/conda_root/bin` to your PATH.
+2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
+3. Now say you (or some code that you ran) call python executable `foo`
+    1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
+    2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
+
+Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.
+
+### Windows
+
+TODO: fill in
diff --git a/.circleci/cimodel/data/simple/ios_definitions.py b/.circleci/cimodel/data/simple/ios_definitions.py
index a01a2db8229f..42aac5d90127 100644
--- a/.circleci/cimodel/data/simple/ios_definitions.py
+++ b/.circleci/cimodel/data/simple/ios_definitions.py
@@ -1,4 +1,5 @@
 from cimodel.data.simple.util.versions import MultiPartVersion
+from cimodel.data.simple.util.branch_filters import gen_filter_dict_exclude
 import cimodel.lib.miniutils as miniutils
 
 XCODE_VERSION = MultiPartVersion([12, 5, 1])
@@ -11,7 +12,7 @@ def __init__(self, name, custom_build_name=""):
 
     def render(self):
         extra_parts = [self.custom_build_name] if len(self.custom_build_name) > 0 else []
-        return "_".join([self.name] + extra_parts)
+        return "-".join([self.name] + extra_parts).replace("_", "-")
 
 
 def get_platform(arch_variant_name):
@@ -25,30 +26,25 @@ def __init__(self, xcode_version, arch_variant, is_org_member_context=True, extr
         self.is_org_member_context = is_org_member_context
         self.extra_props = extra_props
 
-    def gen_name_parts(self, with_version_dots):
-
-        version_parts = self.xcode_version.render_dots_or_parts(with_version_dots)
-        build_variant_suffix = "_".join([self.arch_variant.render(), "build"])
-
+    def gen_name_parts(self):
+        version_parts = self.xcode_version.render_dots_or_parts("-")
+        build_variant_suffix = self.arch_variant.render()
         return [
-            "pytorch",
             "ios",
         ] + version_parts + [
             build_variant_suffix,
         ]
 
     def gen_job_name(self):
-        return "_".join(self.gen_name_parts(False))
+        return "-".join(self.gen_name_parts())
 
     def gen_tree(self):
-
         platform_name = get_platform(self.arch_variant.name)
-
         props_dict = {
-            "build_environment": "-".join(self.gen_name_parts(True)),
+            "name": self.gen_job_name(),
+            "build_environment": self.gen_job_name(),
             "ios_arch": self.arch_variant.name,
             "ios_platform": platform_name,
-            "name": self.gen_job_name(),
         }
 
         if self.is_org_member_context:
@@ -57,30 +53,28 @@ def gen_tree(self):
         if self.extra_props:
             props_dict.update(self.extra_props)
 
+        props_dict["filters"] = gen_filter_dict_exclude()
+
         return [{"pytorch_ios_build": props_dict}]
 
 
 WORKFLOW_DATA = [
     IOSJob(XCODE_VERSION, ArchVariant("x86_64"), is_org_member_context=False, extra_props={
         "lite_interpreter": miniutils.quote(str(int(True)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("x86_64", "full_jit"), is_org_member_context=False, extra_props={
-        "lite_interpreter": miniutils.quote(str(int(False)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={
-        "lite_interpreter": miniutils.quote(str(int(True)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={
-        "use_metal": miniutils.quote(str(int(True))),
-        "lite_interpreter": miniutils.quote(str(int(True)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("arm64", "full_jit"), extra_props={
-        "lite_interpreter": miniutils.quote(str(int(False)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom"), extra_props={
-        "op_list": "mobilenetv2.yaml",
-        "lite_interpreter": miniutils.quote(str(int(True)))}),
+    # IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={
+    #     "lite_interpreter": miniutils.quote(str(int(True)))}),
+    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={
+    #     "use_metal": miniutils.quote(str(int(True))),
+    #     "lite_interpreter": miniutils.quote(str(int(True)))}),
+    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom-ops"), extra_props={
+    #     "op_list": "mobilenetv2.yaml",
+    #     "lite_interpreter": miniutils.quote(str(int(True)))}),
     IOSJob(XCODE_VERSION, ArchVariant("x86_64", "coreml"), is_org_member_context=False, extra_props={
         "use_coreml": miniutils.quote(str(int(True))),
         "lite_interpreter": miniutils.quote(str(int(True)))}),
-    IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={
-        "use_coreml": miniutils.quote(str(int(True))),
-        "lite_interpreter": miniutils.quote(str(int(True)))}),
+    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={
+    #     "use_coreml": miniutils.quote(str(int(True))),
+    #     "lite_interpreter": miniutils.quote(str(int(True)))}),
 ]
 
 
diff --git a/.circleci/cimodel/data/simple/macos_definitions.py b/.circleci/cimodel/data/simple/macos_definitions.py
index 371c8b694cf3..fff146dbf6bb 100644
--- a/.circleci/cimodel/data/simple/macos_definitions.py
+++ b/.circleci/cimodel/data/simple/macos_definitions.py
@@ -11,10 +11,14 @@ def gen_tree(self):
         non_phase_parts = ["pytorch", "macos", self.os_version, "py3"]
 
         extra_name_list = [name for name, exist in self.extra_props.items() if exist]
-        full_job_name_list = non_phase_parts + extra_name_list + [
-            'build' if self.is_build else None,
-            'test' if self.is_test else None,
-        ]
+        full_job_name_list = (
+            non_phase_parts
+            + extra_name_list
+            + [
+                "build" if self.is_build else None,
+                "test" if self.is_test else None,
+            ]
+        )
 
         full_job_name = "_".join(list(filter(None, full_job_name_list)))
 
@@ -41,10 +45,8 @@ def gen_tree(self):
         "10_13",
         is_build=True,
         is_test=True,
-        extra_props=tuple({
-            "lite_interpreter": True
-        }.items()),
-    )
+        extra_props=tuple({"lite_interpreter": True}.items()),
+    ),
 ]
 
 
diff --git a/.circleci/cimodel/data/simple/nightly_ios.py b/.circleci/cimodel/data/simple/nightly_ios.py
index 941a61a73b91..f75bcb4bfe21 100644
--- a/.circleci/cimodel/data/simple/nightly_ios.py
+++ b/.circleci/cimodel/data/simple/nightly_ios.py
@@ -15,7 +15,7 @@ def __init__(self,
     def get_phase_name(self):
         return "upload" if self.is_upload else "build"
 
-    def get_common_name_pieces(self, with_version_dots):
+    def get_common_name_pieces(self, sep):
 
         extra_name_suffix = [self.get_phase_name()] if self.is_upload else []
 
@@ -24,7 +24,7 @@ def get_common_name_pieces(self, with_version_dots):
         common_name_pieces = [
             "ios",
         ] + extra_name + [
-        ] + ios_definitions.XCODE_VERSION.render_dots_or_parts(with_version_dots) + [
+        ] + ios_definitions.XCODE_VERSION.render_dots_or_parts(sep) + [
             "nightly",
             self.variant,
             "build",
@@ -33,14 +33,14 @@ def get_common_name_pieces(self, with_version_dots):
         return common_name_pieces
 
     def gen_job_name(self):
-        return "_".join(["pytorch"] + self.get_common_name_pieces(False))
+        return "_".join(["pytorch"] + self.get_common_name_pieces(None))
 
     def gen_tree(self):
         build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS
         extra_requires = [x.gen_job_name() for x in build_configs] if self.is_upload else []
 
         props_dict = {
-            "build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(True)),
+            "build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(".")),
             "requires": extra_requires,
             "context": "org-member",
             "filters": {"branches": {"only": "nightly"}},
diff --git a/.circleci/cimodel/data/simple/util/branch_filters.py b/.circleci/cimodel/data/simple/util/branch_filters.py
index ba4e00a059ef..e87d0045636d 100644
--- a/.circleci/cimodel/data/simple/util/branch_filters.py
+++ b/.circleci/cimodel/data/simple/util/branch_filters.py
@@ -12,6 +12,9 @@
 
 RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"
 
+MAC_IOS_EXCLUSION_LIST = ["nightly", "postnightly"]
+
+
 def gen_filter_dict(
         branches_list=NON_PR_BRANCH_LIST,
         tags_list=None
@@ -26,3 +29,11 @@ def gen_filter_dict(
     if tags_list is not None:
         filter_dict["tags"] = {"only": tags_list}
     return filter_dict
+
+
+def gen_filter_dict_exclude(branches_list=MAC_IOS_EXCLUSION_LIST):
+    return {
+        "branches": {
+            "ignore": branches_list,
+        },
+    }
diff --git a/.circleci/cimodel/data/simple/util/versions.py b/.circleci/cimodel/data/simple/util/versions.py
index 53d3a837248c..518feb2e3869 100644
--- a/.circleci/cimodel/data/simple/util/versions.py
+++ b/.circleci/cimodel/data/simple/util/versions.py
@@ -1,3 +1,6 @@
+from typing import Optional
+
+
 class MultiPartVersion:
     def __init__(self, parts, prefix=""):
         self.parts = parts
@@ -13,14 +16,11 @@ def prefixed_parts(self):
         else:
             return [self.prefix]
 
-    def render_dots(self):
-        return ".".join(self.prefixed_parts())
-
-    def render_dots_or_parts(self, with_dots):
-        if with_dots:
-            return [self.render_dots()]
-        else:
+    def render_dots_or_parts(self, sep: Optional[str] = None):
+        if sep is None:
             return self.prefixed_parts()
+        else:
+            return [sep.join(self.prefixed_parts())]
 
 
 class CudaVersion(MultiPartVersion):
diff --git a/.circleci/config.yml b/.circleci/config.yml
index 4ca08b1b7c18..0d353fb2a32e 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -570,6 +570,198 @@ jobs:
           paths:
             - miniconda3
 
+  mac_build:
+    parameters:
+      build-environment:
+        type: string
+        description: Top-level label for what's being built/tested.
+      xcode-version:
+        type: string
+        default: "13.3.1"
+        description: What xcode version to build with.
+      build-generates-artifacts:
+        type: boolean
+        default: true
+        description: if the build generates build artifacts
+      python-version:
+        type: string
+        default: "3.8"
+    macos:
+      xcode: << parameters.xcode-version >>
+    resource_class: medium
+    environment:
+      BUILD_ENVIRONMENT: << parameters.build-environment >>
+      AWS_REGION: us-east-1
+    steps:
+
+      - checkout
+      - run_brew_for_macos_build
+
+      - run:
+          name: Install sccache
+          command: |
+            sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}"
+            echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}"
+
+            set +x
+            echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
+            echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
+            set -x
+
+      - run:
+          name: Get workflow job id
+          command: |
+            echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}"
+
+      - run:
+          name: Build
+          command: |
+            set -x
+
+            git submodule sync
+            git submodule update --init --recursive --depth 1 --jobs 0
+
+            export PATH="/usr/local/bin:$PATH"
+            export WORKSPACE_DIR="${HOME}/workspace"
+            mkdir -p "${WORKSPACE_DIR}"
+            MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh"
+            if [  << parameters.python-version >> == 3.9.12 ]; then
+              MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh"
+            fi
+
+            # If a local installation of conda doesn't exist, we download and install conda
+            if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then
+              mkdir -p "${WORKSPACE_DIR}"
+              curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh
+              bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3
+            fi
+            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
+            # shellcheck disable=SC1091
+            source "${WORKSPACE_DIR}"/miniconda3/bin/activate
+
+            brew link --force libomp
+
+            echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}"
+            .jenkins/pytorch/macos-build.sh
+
+      - when:
+          condition: << parameters.build-generates-artifacts >>
+          steps:
+            - run:
+                name: Archive artifacts into zip
+                command: |
+                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
+                  cp artifacts.zip /Users/distiller/workspace
+
+      - persist_to_workspace:
+          root: /Users/distiller/workspace/
+          paths:
+            - miniconda3
+            - artifacts.zip
+
+      - store_artifacts:
+          path: /Users/distiller/project/artifacts.zip
+
+  mac_test:
+    parameters:
+      build-environment:
+        type: string
+      shard-number:
+        type: string
+      num-test-shards:
+        type: string
+      xcode-version:
+        type: string
+      test-config:
+        type: string
+        default: 'default'
+
+    macos:
+      xcode: << parameters.xcode-version >>
+    environment:
+      GIT_DEFAULT_BRANCH: 'master'
+      BUILD_ENVIRONMENT: << parameters.build-environment >>
+      TEST_CONFIG: << parameters.test-config >>
+      SHARD_NUMBER: << parameters.shard-number >>
+      NUM_TEST_SHARDS: << parameters.num-test-shards >>
+      PYTORCH_RETRY_TEST_CASES: 1
+      PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
+    steps:
+      - checkout
+      - attach_workspace:
+          at: ~/workspace
+      - run_brew_for_macos_build
+      - run:
+          name: Test
+          no_output_timeout: "2h"
+          command: |
+            set -x
+
+            git submodule sync --recursive
+            git submodule update --init --recursive
+
+            mv ~/workspace/artifacts.zip .
+            unzip artifacts.zip
+
+            export IN_CI=1
+
+            COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+
+            export PATH="/usr/local/bin:$PATH"
+            export WORKSPACE_DIR="${HOME}/workspace"
+            mkdir -p "${WORKSPACE_DIR}"
+
+            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
+            source "${WORKSPACE_DIR}"/miniconda3/bin/activate
+
+            # sanitize the input commit message and PR body here:
+
+            # trim all new lines from commit messages to avoid issues with batch environment
+            # variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028
+            COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}"
+
+            # then trim all special characters like single and double quotes to avoid unescaped inputs to
+            # wreak havoc internally
+            export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}"
+
+            python3 -mpip install dist/*.whl
+            .jenkins/pytorch/macos-test.sh
+      - run:
+          name: Copy files for uploading test stats
+          command: |
+            # copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace
+            mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
+            cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
+      - store_test_results:
+          path: test/test-reports
+      - persist_to_workspace:
+          root: /Users/distiller/project/
+          paths:
+            - test-reports
+
+  upload_test_stats:
+    machine: # executor type
+      image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4
+    steps:
+      - checkout
+      - attach_workspace:
+          at: ~/workspace
+      - run:
+          name: upload
+          command: |
+            set -ex
+            if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then
+              echo "No credentials found, cannot upload test stats (are you on a fork?)"
+              exit 0
+            fi
+            cp -r ~/workspace/test-reports/* ~/project
+            pip3 install requests==2.26 rockset==0.8.3 boto3==1.19.12 six==1.16.0
+            export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
+            export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
+            # i dont know how to get the run attempt number for reruns so default to 1
+            python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci
   pytorch_macos_10_13_py3_test:
     environment:
       BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
@@ -795,10 +987,43 @@ jobs:
     macos:
       xcode: "12.5.1"
     steps:
-      - checkout
+      - run:
+          name: checkout with retry
+          command: |
+            checkout() {
+              set -ex
+              # Workaround old docker images with incorrect $HOME
+              # check https://github.com/docker/docker/issues/2968 for details
+              if [ "${HOME}" = "/" ]
+                then
+                export HOME=$(getent passwd $(id -un) | cut -d: -f6)
+              fi
+
+              mkdir -p ~/.ssh
+
+              echo 'github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
+              ' >> ~/.ssh/known_hosts
+
+              # use git+ssh instead of https
+              git config --global url."ssh://git@github.com".insteadOf "https://github.com" || true
+              git config --global gc.auto 0 || true
+
+              echo 'Cloning git repository'
+              mkdir -p '/Users/distiller/project'
+              cd '/Users/distiller/project'
+              git clone "$CIRCLE_REPOSITORY_URL" .
+              echo 'Checking out branch'
+              git checkout --force -B "$CIRCLE_BRANCH" "$CIRCLE_SHA1"
+              git --no-pager log --no-color -n 1 --format='HEAD is now at %h %s'
+            }
+
+            retry () {
+              $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
+            }
+            retry checkout
       - run_brew_for_ios_build
       - run:
-          name: Run Fastlane
+          name: Setup Fastlane
           no_output_timeout: "1h"
           command: |
             set -e
@@ -806,20 +1031,6 @@ jobs:
             cd ${PROJ_ROOT}/ios/TestApp
             # install fastlane
             sudo gem install bundler && bundle install
-            # install certificates
-            echo ${IOS_CERT_KEY_2022} >> cert.txt
-            base64 --decode cert.txt -o Certificates.p12
-            rm cert.txt
-            bundle exec fastlane install_root_cert
-            bundle exec fastlane install_dev_cert
-            # install the provisioning profile
-            PROFILE=PyTorch_CI_2022.mobileprovision
-            PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-            mkdir -pv "${PROVISIONING_PROFILES}"
-            cd "${PROVISIONING_PROFILES}"
-            echo ${IOS_SIGN_KEY_2022} >> cert.txt
-            base64 --decode cert.txt -o ${PROFILE}
-            rm cert.txt
       - run:
           name: Build
           no_output_timeout: "1h"
@@ -877,18 +1088,12 @@ jobs:
           command: |
             set -e
             PROJ_ROOT=/Users/distiller/project
-            PROFILE=PyTorch_CI_2022
             # run the ruby build script
             if ! [ -x "$(command -v xcodebuild)" ]; then
               echo 'Error: xcodebuild is not installed.'
               exit 1
             fi
-            echo ${IOS_DEV_TEAM_ID}
-            if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
-              ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
-            else
-              ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}
-            fi
+            ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}
             if ! [ "$?" -eq "0" ]; then
               echo 'xcodebuild failed!'
               exit 1
@@ -911,12 +1116,13 @@ jobs:
             cd ${PROJ_ROOT}/ios/TestApp/benchmark
             mkdir -p ../models
             if [ ${USE_COREML_DELEGATE} == 1 ]; then
-              pip install coremltools==5.0b5
-              pip install six
+              pip install coremltools==5.0b5 protobuf==3.20.1 six==1.16.0
               python coreml_backend.py
             else
-              python trace_model.py
+              cd "${PROJ_ROOT}"
+              python test/mobile/model_test/gen_test_model.py ios-test
             fi
+            cd "${PROJ_ROOT}/ios/TestApp/benchmark"
             if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
               echo "Setting up the TestApp for LiteInterpreter"
               ruby setup.rb --lite 1
@@ -924,10 +1130,10 @@ jobs:
               echo "Setting up the TestApp for Full JIT"
               ruby setup.rb
             fi
-            cd ${PROJ_ROOT}/ios/TestApp
-            instruments -s -devices
-            if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
-              if [ ${USE_COREML_DELEGATE} == 1 ]; then
+            cd "${PROJ_ROOT}/ios/TestApp"
+            # instruments -s -devices
+            if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
+              if [ "${USE_COREML_DELEGATE}" == 1 ]; then
                 fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
               else
                 fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
@@ -1241,4 +1447,27 @@ workflows:
             branches:
               only:
                 - postnightly
+      - pytorch_ios_build:
+          build_environment: ios-12-5-1-x86-64
+          filters:
+            branches:
+              ignore:
+                - nightly
+                - postnightly
+          ios_arch: x86_64
+          ios_platform: SIMULATOR
+          lite_interpreter: "1"
+          name: ios-12-5-1-x86-64
+      - pytorch_ios_build:
+          build_environment: ios-12-5-1-x86-64-coreml
+          filters:
+            branches:
+              ignore:
+                - nightly
+                - postnightly
+          ios_arch: x86_64
+          ios_platform: SIMULATOR
+          lite_interpreter: "1"
+          name: ios-12-5-1-x86-64-coreml
+          use_coreml: "1"
     when: << pipeline.parameters.run_build >>
diff --git a/.circleci/docker/build.sh b/.circleci/docker/build.sh
index 6eeee5f1ebaa..ebea9eda85a6 100755
--- a/.circleci/docker/build.sh
+++ b/.circleci/docker/build.sh
@@ -33,7 +33,7 @@ function extract_all_from_image_name() {
     if [ "x${name}" = xpy ]; then
       vername=ANACONDA_PYTHON_VERSION
     fi
-    # skip non-conforming fields such as "pytorch", "linux" or "xenial" without version string
+    # skip non-conforming fields such as "pytorch", "linux" or "bionic" without version string
     if [ -n "${name}" ]; then
       extract_version_from_image_name "${name}" "${vername}"
     fi
@@ -46,11 +46,7 @@ if [[ "$image" == *xla* ]]; then
   exit 0
 fi
 
-if [[ "$image" == *-xenial* ]]; then
-  UBUNTU_VERSION=16.04
-elif [[ "$image" == *-artful* ]]; then
-  UBUNTU_VERSION=17.10
-elif [[ "$image" == *-bionic* ]]; then
+if [[ "$image" == *-bionic* ]]; then
   UBUNTU_VERSION=18.04
 elif [[ "$image" == *-focal* ]]; then
   UBUNTU_VERSION=20.04
@@ -79,56 +75,17 @@ elif [[ "$image" == *rocm* ]]; then
   DOCKERFILE="${OS}-rocm/Dockerfile"
 fi
 
-if [[ "$image" == *xenial* ]] || [[ "$image" == *bionic* ]]; then
-  CMAKE_VERSION=3.13.5
-fi
+# CMake 3.18 is needed to support CUDA17 language variant
+CMAKE_VERSION=3.18.5
 
 TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64"
-UCX_COMMIT=v1.13.x
-UCC_COMMIT=a7bda274b10f8adf5bb729f01da064f4e735fb23
+_UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab
+_UCC_COMMIT=1c7a7127186e7836f73aafbd7697bbc274a77eee
 
 # It's annoying to rename jobs every time you want to rewrite a
 # configuration, so we hardcode everything here rather than do it
 # from scratch
 case "$image" in
-  pytorch-linux-xenial-py3.8)
-    ANACONDA_PYTHON_VERSION=3.8
-    GCC_VERSION=7
-    # Do not install PROTOBUF, DB, and VISION as a test
-    ;;
-  pytorch-linux-xenial-py3.7-gcc7.2)
-    ANACONDA_PYTHON_VERSION=3.7
-    GCC_VERSION=7
-    # Do not install PROTOBUF, DB, and VISION as a test
-    ;;
-  pytorch-linux-xenial-py3.7-gcc7)
-    ANACONDA_PYTHON_VERSION=3.7
-    GCC_VERSION=7
-    PROTOBUF=yes
-    DB=yes
-    VISION=yes
-    ;;
-  pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)
-    CUDA_VERSION=10.2
-    CUDNN_VERSION=7
-    ANACONDA_PYTHON_VERSION=3.7
-    GCC_VERSION=7
-    PROTOBUF=yes
-    DB=yes
-    VISION=yes
-    KATEX=yes
-    ;;
-  pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7)
-    CUDA_VERSION=11.3.0 # Deviating from major.minor to conform to nvidia's Docker image names
-    CUDNN_VERSION=8
-    TENSORRT_VERSION=8.0.1.6
-    ANACONDA_PYTHON_VERSION=3.7
-    GCC_VERSION=7
-    PROTOBUF=yes
-    DB=yes
-    VISION=yes
-    KATEX=yes
-    ;;
   pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9)
     CUDA_VERSION=11.3.0 # Deviating from major.minor to conform to nvidia's Docker image names
     CUDNN_VERSION=8
@@ -139,6 +96,7 @@ case "$image" in
     DB=yes
     VISION=yes
     KATEX=yes
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)
     CUDA_VERSION=11.6.2
@@ -149,8 +107,9 @@ case "$image" in
     DB=yes
     VISION=yes
     KATEX=yes
-    UCX_COMMIT=${UCX_COMMIT}
-    UCC_COMMIT=${UCC_COMMIT}
+    UCX_COMMIT=${_UCX_COMMIT}
+    UCC_COMMIT=${_UCC_COMMIT}
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7)
     CUDA_VERSION=11.7.0
@@ -161,22 +120,9 @@ case "$image" in
     DB=yes
     VISION=yes
     KATEX=yes
-    UCX_COMMIT=${UCX_COMMIT}
-    UCC_COMMIT=${UCC_COMMIT}
-    ;;
-  pytorch-linux-xenial-py3-clang5-asan)
-    ANACONDA_PYTHON_VERSION=3.7
-    CLANG_VERSION=5.0
-    PROTOBUF=yes
-    DB=yes
-    VISION=yes
-    ;;
-  pytorch-linux-xenial-py3-clang7-asan)
-    ANACONDA_PYTHON_VERSION=3.7
-    CLANG_VERSION=7
-    PROTOBUF=yes
-    DB=yes
-    VISION=yes
+    UCX_COMMIT=${_UCX_COMMIT}
+    UCC_COMMIT=${_UCC_COMMIT}
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-focal-py3-clang7-asan)
     ANACONDA_PYTHON_VERSION=3.7
@@ -184,13 +130,7 @@ case "$image" in
     PROTOBUF=yes
     DB=yes
     VISION=yes
-    ;;
-  pytorch-linux-xenial-py3-clang7-onnx)
-    ANACONDA_PYTHON_VERSION=3.7
-    CLANG_VERSION=7
-    PROTOBUF=yes
-    DB=yes
-    VISION=yes
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-focal-py3-clang10-onnx)
     ANACONDA_PYTHON_VERSION=3.7
@@ -198,10 +138,11 @@ case "$image" in
     PROTOBUF=yes
     DB=yes
     VISION=yes
+    CONDA_CMAKE=yes
     ;;
-  pytorch-linux-xenial-py3-clang5-android-ndk-r19c)
+  pytorch-linux-focal-py3-clang7-android-ndk-r19c)
     ANACONDA_PYTHON_VERSION=3.7
-    CLANG_VERSION=5.0
+    CLANG_VERSION=7
     LLVMDEV=yes
     PROTOBUF=yes
     ANDROID=yes
@@ -209,13 +150,6 @@ case "$image" in
     GRADLE_VERSION=6.8.3
     NINJA_VERSION=1.9.0
     ;;
-  pytorch-linux-xenial-py3.7-clang7)
-    ANACONDA_PYTHON_VERSION=3.7
-    CLANG_VERSION=7
-    PROTOBUF=yes
-    DB=yes
-    VISION=yes
-    ;;
   pytorch-linux-bionic-py3.7-clang9)
     ANACONDA_PYTHON_VERSION=3.7
     CLANG_VERSION=9
@@ -224,6 +158,7 @@ case "$image" in
     VISION=yes
     VULKAN_SDK_VERSION=1.2.162.1
     SWIFTSHADER=yes
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-bionic-py3.8-gcc9)
     ANACONDA_PYTHON_VERSION=3.8
@@ -231,6 +166,7 @@ case "$image" in
     PROTOBUF=yes
     DB=yes
     VISION=yes
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-bionic-cuda10.2-cudnn7-py3.7-clang9)
     CUDA_VERSION=10.2
@@ -240,6 +176,7 @@ case "$image" in
     PROTOBUF=yes
     DB=yes
     VISION=yes
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7)
     CUDA_VERSION=10.2
@@ -249,31 +186,34 @@ case "$image" in
     PROTOBUF=yes
     DB=yes
     VISION=yes
+    CONDA_CMAKE=yes
     ;;
-  pytorch-linux-focal-rocm5.1-py3.7)
-    ANACONDA_PYTHON_VERSION=3.7
+  pytorch-linux-focal-rocm5.1-py3.8)
+    ANACONDA_PYTHON_VERSION=3.8
     GCC_VERSION=9
     PROTOBUF=yes
     DB=yes
     VISION=yes
     ROCM_VERSION=5.1.1
+    CONDA_CMAKE=yes
     ;;
-  pytorch-linux-focal-rocm5.2-py3.7)
-    ANACONDA_PYTHON_VERSION=3.7
+  pytorch-linux-focal-rocm5.2-py3.8)
+    ANACONDA_PYTHON_VERSION=3.8
     GCC_VERSION=9
     PROTOBUF=yes
     DB=yes
     VISION=yes
     ROCM_VERSION=5.2
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-focal-py3.7-gcc7)
     ANACONDA_PYTHON_VERSION=3.7
-    CMAKE_VERSION=3.16.9  # Required for precompiled header support
     GCC_VERSION=7
     PROTOBUF=yes
     DB=yes
     VISION=yes
     KATEX=yes
+    CONDA_CMAKE=yes
     ;;
   pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12)
     ANACONDA_PYTHON_VERSION=3.8
@@ -283,8 +223,6 @@ case "$image" in
     PROTOBUF=yes
     DB=yes
     VISION=yes
-    UCX_COMMIT=${UCX_COMMIT}
-    UCC_COMMIT=${UCC_COMMIT}
     ;;
   pytorch-linux-jammy-cuda11.7-cudnn8-py3.8-clang12)
     ANACONDA_PYTHON_VERSION=3.8
@@ -294,8 +232,6 @@ case "$image" in
     PROTOBUF=yes
     DB=yes
     VISION=yes
-    UCX_COMMIT=${UCX_COMMIT}
-    UCC_COMMIT=${UCC_COMMIT}
     ;;
   *)
     # Catch-all for builds that are not hardcoded.
@@ -312,6 +248,10 @@ case "$image" in
     fi
     if [[ "$image" == *rocm* ]]; then
       extract_version_from_image_name rocm ROCM_VERSION
+      NINJA_VERSION=1.9.0
+    fi
+    if [[ "$image" == *centos7* ]]; then
+      NINJA_VERSION=1.10.2
     fi
     if [[ "$image" == *gcc* ]]; then
       extract_version_from_image_name gcc GCC_VERSION
@@ -383,10 +323,11 @@ docker build \
        --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
        --build-arg "KATEX=${KATEX:-}" \
        --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
-       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx900;gfx906}" \
+       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906}" \
        --build-arg "IMAGE_NAME=${IMAGE_NAME}" \
        --build-arg "UCX_COMMIT=${UCX_COMMIT}" \
        --build-arg "UCC_COMMIT=${UCC_COMMIT}" \
+       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \
        -f $(dirname ${DOCKERFILE})/Dockerfile \
        -t "$tmp_tag" \
        "$@" \
diff --git a/.circleci/docker/centos-rocm/Dockerfile b/.circleci/docker/centos-rocm/Dockerfile
index 7c7708d416fe..894f39fe471c 100644
--- a/.circleci/docker/centos-rocm/Dockerfile
+++ b/.circleci/docker/centos-rocm/Dockerfile
@@ -40,6 +40,7 @@ RUN bash ./install_user.sh && rm install_user.sh
 # Install conda and other packages (e.g., numpy, pytest)
 ENV PATH /opt/conda/bin:$PATH
 ARG ANACONDA_PYTHON_VERSION
+ARG CONDA_CMAKE
 COPY requirements-ci.txt /opt/conda/requirements-ci.txt
 COPY ./common/install_conda.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
@@ -71,6 +72,9 @@ ARG ROCM_VERSION
 COPY ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh
 RUN rm install_rocm.sh
+COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
+RUN bash ./install_rocm_magma.sh
+RUN rm install_rocm_magma.sh
 ENV PATH /opt/rocm/bin:$PATH
 ENV PATH /opt/rocm/hcc/bin:$PATH
 ENV PATH /opt/rocm/hip/bin:$PATH
diff --git a/.circleci/docker/common/install_base.sh b/.circleci/docker/common/install_base.sh
index 6724031c0a44..84835d6de50d 100755
--- a/.circleci/docker/common/install_base.sh
+++ b/.circleci/docker/common/install_base.sh
@@ -68,7 +68,10 @@ install_ubuntu() {
     sudo \
     vim \
     jq \
-    libtool
+    libtool \
+    vim \
+    unzip \
+    gdb
 
   # Should resolve issues related to various apt package repository cert issues
   # see: https://github.com/pytorch/pytorch/issues/65931
@@ -126,7 +129,9 @@ install_centos() {
     opencv-devel \
     sudo \
     wget \
-    vim
+    vim \
+    unzip \
+    gdb
 
   # Cleanup
   yum clean all
diff --git a/.circleci/docker/common/install_conda.sh b/.circleci/docker/common/install_conda.sh
index 49afcb5aef42..84f9538ce124 100755
--- a/.circleci/docker/common/install_conda.sh
+++ b/.circleci/docker/common/install_conda.sh
@@ -55,8 +55,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
   # Ensure we run conda in a directory that jenkins has write access to
   pushd /opt/conda
 
-  # Track latest conda update
-  as_jenkins conda update -y -n base conda
+  # Prevent conda from updating to 4.14.0, which causes docker build failures
+  # See https://hud.pytorch.org/pytorch/pytorch/commit/754d7f05b6841e555cea5a4b2c505dd9e0baec1d
+  # Uncomment the below when resolved to track the latest conda update
+  # as_jenkins conda update -y -n base conda
 
   # Install correct Python version
   as_jenkins conda install -y python="$ANACONDA_PYTHON_VERSION"
@@ -73,8 +75,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
   }
 
   # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
-  # DO NOT install cmake here as it would install a version newer than 3.13, but
-  # we want to pin to version 3.13.
   CONDA_COMMON_DEPS="astunparse pyyaml mkl=2022.0.1 mkl-include=2022.0.1 setuptools cffi future six"
   if [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then
     # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
@@ -90,15 +90,20 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
     conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} typing_extensions
   fi
 
+  # Use conda cmake in some cases. Conda cmake will be newer than our supported
+  # min version (3.5 for xenial and 3.10 for bionic), so we only do it in those
+  # following builds that we know should use conda. Specifically, Ubuntu bionic
+  # and focal cannot find conda mkl with stock cmake, so we need a cmake from conda
+  if [ -n "${CONDA_CMAKE}" ]; then
+    conda_install cmake
+  fi
+
   # Magma package names are concatenation of CUDA major and minor ignoring revision
   # I.e. magma-cuda102 package corresponds to CUDA_VERSION=10.2 and CUDA_VERSION=10.2.89
   if [ -n "$CUDA_VERSION" ]; then
     conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch
   fi
 
-  # TODO: This isn't working atm
-  conda_install nnpack -c killeent
-
   # Install some other packages, including those needed for Python test reporting
   pip_install -r /opt/conda/requirements-ci.txt
 
diff --git a/.circleci/docker/common/install_cudnn.sh b/.circleci/docker/common/install_cudnn.sh
index 1f1c34ea200d..f68fc6946c2e 100644
--- a/.circleci/docker/common/install_cudnn.sh
+++ b/.circleci/docker/common/install_cudnn.sh
@@ -4,7 +4,13 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then
     # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
     mkdir tmp_cudnn && cd tmp_cudnn
     CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"
-    curl -OLs  https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
+    if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then
+        CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive"
+        curl --retry 3 -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz
+    else
+        curl --retry 3 -OLs  https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
+    fi
+
     tar xf ${CUDNN_NAME}.tar.xz
     cp -a ${CUDNN_NAME}/include/* /usr/include/
     cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
diff --git a/.circleci/docker/common/install_docs_reqs.sh b/.circleci/docker/common/install_docs_reqs.sh
index 1adc9e8009a0..e60171208ae1 100644
--- a/.circleci/docker/common/install_docs_reqs.sh
+++ b/.circleci/docker/common/install_docs_reqs.sh
@@ -7,10 +7,10 @@ if [ -n "$KATEX" ]; then
   # Ignore error if gpg-agent doesn't exist (for Ubuntu 16.04)
   apt-get install -y gpg-agent || :
 
-  curl -sL https://deb.nodesource.com/setup_12.x | sudo -E bash -
+  curl --retry 3 -sL https://deb.nodesource.com/setup_12.x | sudo -E bash -
   sudo apt-get install -y nodejs
 
-  curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
+  curl --retry 3 -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
   echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
 
   apt-get update
diff --git a/.circleci/docker/common/install_protobuf.sh b/.circleci/docker/common/install_protobuf.sh
index 9d9f6c40ba0c..4b7a7a6ac23f 100755
--- a/.circleci/docker/common/install_protobuf.sh
+++ b/.circleci/docker/common/install_protobuf.sh
@@ -12,7 +12,7 @@ install_protobuf_317() {
   #   g++: error: ./../lib64/crti.o: No such file or directory
   ln -s /usr/lib64 "$pb_dir/lib64"
 
-  curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz"
+  curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
   tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
   # -j6 to balance memory usage and speed.
   # naked `-j` seems to use too much memory.
diff --git a/.circleci/docker/common/install_rocm.sh b/.circleci/docker/common/install_rocm.sh
index ceebd7d60671..7ad0c4f123e1 100644
--- a/.circleci/docker/common/install_rocm.sh
+++ b/.circleci/docker/common/install_rocm.sh
@@ -2,34 +2,6 @@
 
 set -ex
 
-install_magma() {
-    # "install" hipMAGMA into /opt/rocm/magma by copying after build
-    git clone https://bitbucket.org/icl/magma.git
-    pushd magma
-    # Fixes memory leaks of magma found while executing linalg UTs
-    git checkout 5959b8783e45f1809812ed96ae762f38ee701972
-    cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
-    echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
-    echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
-    echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
-    export PATH="${PATH}:/opt/rocm/bin"
-    if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
-      amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
-    else
-      amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
-    fi
-    for arch in $amdgpu_targets; do
-      echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc
-    done
-    # hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
-    sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
-    make -f make.gen.hipMAGMA -j $(nproc)
-    LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda
-    make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda
-    popd
-    mv magma /opt/rocm
-}
-
 ver() {
     printf "%3d%03d%03d%03d" $(echo "$1" | tr '.' ' ');
 }
@@ -57,7 +29,12 @@ install_ubuntu() {
     if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then
         # Add amdgpu repository
         UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
-        local amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu"
+        local amdgpu_baseurl
+        if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
+          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
+        else
+          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu"
+        fi
         echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
     fi
 
@@ -66,6 +43,10 @@ install_ubuntu() {
         ROCM_REPO="xenial"
     fi
 
+    if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
+        ROCM_REPO="${UBUNTU_VERSION_NAME}"
+    fi
+
     # Add rocm repository
     wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
     local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
@@ -89,8 +70,6 @@ install_ubuntu() {
       DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}
     fi
 
-    install_magma
-
     # Cleanup
     apt-get autoclean && apt-get clean
     rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
@@ -108,7 +87,16 @@ install_centos() {
 
   if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then
       # Add amdgpu repository
-      local amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64"
+      local amdgpu_baseurl
+      if [[ $OS_VERSION == 9 ]]; then
+          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/9.0/main/x86_64"
+      else
+        if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
+          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/7.9/main/x86_64"
+        else
+          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64"
+        fi
+      fi
       echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo
       echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo
       echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo
@@ -135,8 +123,6 @@ install_centos() {
                    rocprofiler-dev \
                    roctracer-dev
 
-  install_magma
-
   # Cleanup
   yum clean all
   rm -rf /var/cache/yum
diff --git a/.circleci/docker/common/install_rocm_magma.sh b/.circleci/docker/common/install_rocm_magma.sh
new file mode 100644
index 000000000000..c7b116b93868
--- /dev/null
+++ b/.circleci/docker/common/install_rocm_magma.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+set -ex
+
+# "install" hipMAGMA into /opt/rocm/magma by copying after build
+git clone https://bitbucket.org/icl/magma.git
+pushd magma
+# Fixes memory leaks of magma found while executing linalg UTs
+git checkout 5959b8783e45f1809812ed96ae762f38ee701972
+cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
+echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
+echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
+echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
+export PATH="${PATH}:/opt/rocm/bin"
+if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
+  amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
+else
+  amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
+fi
+for arch in $amdgpu_targets; do
+  echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc
+done
+# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
+sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
+make -f make.gen.hipMAGMA -j $(nproc)
+LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda
+make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda
+popd
+mv magma /opt/rocm
diff --git a/.circleci/docker/common/install_ucc.sh b/.circleci/docker/common/install_ucc.sh
index a7b90286a0fb..333e44e6f779 100755
--- a/.circleci/docker/common/install_ucc.sh
+++ b/.circleci/docker/common/install_ucc.sh
@@ -2,6 +2,12 @@
 
 set -ex
 
+if [[ -d "/usr/local/cuda/" ]];  then
+  with_cuda=/usr/local/cuda/
+else
+  with_cuda=no
+fi
+
 function install_ucx() {
   set -ex
   git clone --recursive https://github.com/openucx/ucx.git
@@ -12,6 +18,7 @@ function install_ucx() {
   ./autogen.sh
   ./configure --prefix=$UCX_HOME      \
       --enable-mt                     \
+      --with-cuda=$with_cuda          \
       --enable-profiling              \
       --enable-stats
   time make -j
@@ -29,7 +36,7 @@ function install_ucc() {
   git submodule update --init --recursive
 
   ./autogen.sh
-  ./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-nccl=no
+  ./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-cuda=$with_cuda
   time make -j
   sudo make install
 
diff --git a/.circleci/docker/requirements-ci.txt b/.circleci/docker/requirements-ci.txt
index 8b18a1745808..e527d29d4989 100644
--- a/.circleci/docker/requirements-ci.txt
+++ b/.circleci/docker/requirements-ci.txt
@@ -124,12 +124,17 @@ numba==0.55.2 ; python_version == "3.10"
 #Pinned versions: 1.9.0
 #test that import:
 
+opt-einsum==3.3
+#Description: Python library to optimize tensor contraction order, used in einsum
+#Pinned versions: 3.3
+#test that import: test_linalg.py
+
 #pillow
 #Description:  Python Imaging Library fork
 #Pinned versions:
 #test that import:
 
-protobuf==3.20.1
+protobuf==3.20.2
 #Description:  Google’s data interchange format
 #Pinned versions: 3.20.1
 #test that import: test_tensorboard.py
@@ -149,8 +154,18 @@ pytest-xdist
 #Pinned versions:
 #test that import:
 
+pytest-shard
+#Description: plugin spliting up tests in pytest
+#Pinned versions:
+#test that import:
+
+pytest-flakefinder==1.1.0
+#Description: plugin for rerunning tests a fixed number of times in pytest
+#Pinned versions: 1.1.0
+#test that import:
+
 pytest-rerunfailures
-#Description: plugin for rerunning tests in pytest
+#Description: plugin for rerunning failure tests in pytest
 #Pinned versions:
 #test that import:
 
@@ -164,11 +179,16 @@ pytest-rerunfailures
 #Pinned versions:
 #test that import:
 
-#xdoctest
+xdoctest==1.0.2
 #Description: runs doctests in pytest
-#Pinned versions:
+#Pinned versions: 1.0.2
 #test that import:
 
+pygments==2.12.0
+#Description: support doctest highlighting
+#Pinned versions: 2.12.0
+#test that import: the doctests
+
 #PyYAML
 #Description: data serialization format
 #Pinned versions:
diff --git a/.circleci/docker/ubuntu-cuda/Dockerfile b/.circleci/docker/ubuntu-cuda/Dockerfile
index a3a623996ad0..307071c8f4fc 100644
--- a/.circleci/docker/ubuntu-cuda/Dockerfile
+++ b/.circleci/docker/ubuntu-cuda/Dockerfile
@@ -26,6 +26,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
 # Install conda and other packages (e.g., numpy, pytest)
 ENV PATH /opt/conda/bin:$PATH
 ARG ANACONDA_PYTHON_VERSION
+ARG CONDA_CMAKE
 COPY requirements-ci.txt /opt/conda/requirements-ci.txt
 COPY ./common/install_conda.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
@@ -118,9 +119,14 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
 
 # Install CUDNN
 ARG CUDNN_VERSION
+ARG CUDA_VERSION
 COPY ./common/install_cudnn.sh install_cudnn.sh
 RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
 RUN rm install_cudnn.sh
 
+# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
+RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
+RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi
+
 USER jenkins
 CMD ["bash"]
diff --git a/.circleci/docker/ubuntu-rocm/Dockerfile b/.circleci/docker/ubuntu-rocm/Dockerfile
index a994b2e52f23..b9c8feab06cf 100644
--- a/.circleci/docker/ubuntu-rocm/Dockerfile
+++ b/.circleci/docker/ubuntu-rocm/Dockerfile
@@ -28,6 +28,7 @@ RUN bash ./install_user.sh && rm install_user.sh
 # Install conda and other packages (e.g., numpy, pytest)
 ENV PATH /opt/conda/bin:$PATH
 ARG ANACONDA_PYTHON_VERSION
+ARG CONDA_CMAKE
 COPY requirements-ci.txt /opt/conda/requirements-ci.txt
 COPY ./common/install_conda.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
@@ -64,6 +65,9 @@ ARG ROCM_VERSION
 COPY ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh
 RUN rm install_rocm.sh
+COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
+RUN bash ./install_rocm_magma.sh
+RUN rm install_rocm_magma.sh
 ENV PATH /opt/rocm/bin:$PATH
 ENV PATH /opt/rocm/hcc/bin:$PATH
 ENV PATH /opt/rocm/hip/bin:$PATH
diff --git a/.circleci/docker/ubuntu/Dockerfile b/.circleci/docker/ubuntu/Dockerfile
index e86baf0d6690..5f41ed53f954 100644
--- a/.circleci/docker/ubuntu/Dockerfile
+++ b/.circleci/docker/ubuntu/Dockerfile
@@ -37,6 +37,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
 # Install conda and other packages (e.g., numpy, pytest)
 ENV PATH /opt/conda/bin:$PATH
 ARG ANACONDA_PYTHON_VERSION
+ARG CONDA_CMAKE
 COPY requirements-ci.txt /opt/conda/requirements-ci.txt
 COPY ./common/install_conda.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
diff --git a/.circleci/generate_config_yml.py b/.circleci/generate_config_yml.py
index e068dd98fd8e..9366f59c465c 100755
--- a/.circleci/generate_config_yml.py
+++ b/.circleci/generate_config_yml.py
@@ -14,6 +14,7 @@
 import cimodel.data.simple.mobile_definitions
 import cimodel.data.simple.nightly_ios
 import cimodel.data.simple.anaconda_prune_defintions
+import cimodel.data.simple.ios_definitions
 import cimodel.lib.miniutils as miniutils
 import cimodel.lib.miniyaml as miniyaml
 
@@ -70,6 +71,7 @@ def write(self, output_filehandle):
         for line in filter(None, lines):
             output_filehandle.write(line + "\n")
 
+
 def _for_all_items(items, functor) -> None:
     if isinstance(items, list):
         for item in items:
@@ -78,6 +80,7 @@ def _for_all_items(items, functor) -> None:
         item_type, item = next(iter(items.items()))
         functor(item_type, item)
 
+
 def filter_master_only_jobs(items):
     def _is_main_or_master_item(item):
         filters = item.get('filters', None)
@@ -116,6 +119,7 @@ def _do_filtering(items):
     _for_all_items(items, _save_requires_if_master)
     return _do_filtering(items)
 
+
 def generate_required_docker_images(items):
     required_docker_images = set()
 
@@ -131,11 +135,13 @@ def _requires_docker_image(item_type, item):
     _for_all_items(items, _requires_docker_image)
     return required_docker_images
 
+
 def gen_build_workflows_tree():
     build_workflows_functions = [
         cimodel.data.simple.mobile_definitions.get_workflow_jobs,
         cimodel.data.simple.nightly_ios.get_workflow_jobs,
         cimodel.data.simple.anaconda_prune_defintions.get_workflow_jobs,
+        cimodel.data.simple.ios_definitions.get_workflow_jobs,
     ]
     build_jobs = [f() for f in build_workflows_functions]
     build_jobs.extend(
diff --git a/.circleci/scripts/binary_install_miniconda.sh b/.circleci/scripts/binary_install_miniconda.sh
index 43eb006742ae..3541a32ac6bf 100755
--- a/.circleci/scripts/binary_install_miniconda.sh
+++ b/.circleci/scripts/binary_install_miniconda.sh
@@ -31,9 +31,9 @@ fi
 
 conda_sh="$workdir/install_miniconda.sh"
 if [[ "$(uname)" == Darwin ]]; then
-  curl --retry 3 -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+  curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
 else
-  curl --retry 3 -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+  curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
 fi
 chmod +x "$conda_sh"
 "$conda_sh" -b -p "$MINICONDA_ROOT"
diff --git a/.circleci/scripts/binary_ios_build.sh b/.circleci/scripts/binary_ios_build.sh
index 6c7674ed510e..4bb5ea28af73 100644
--- a/.circleci/scripts/binary_ios_build.sh
+++ b/.circleci/scripts/binary_ios_build.sh
@@ -8,7 +8,7 @@ PROJ_ROOT=/Users/distiller/project
 export TCLLIBPATH="/usr/local/lib"
 
 # Install conda
-curl --retry 3 -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+curl --retry 3 --retry-all-errors -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
 chmod +x ~/conda.sh
 /bin/bash ~/conda.sh -b -p ~/anaconda
 export PATH="~/anaconda/bin:${PATH}"
diff --git a/.circleci/scripts/binary_ios_test.sh b/.circleci/scripts/binary_ios_test.sh
index 3f052175235c..c750dbceca87 100644
--- a/.circleci/scripts/binary_ios_test.sh
+++ b/.circleci/scripts/binary_ios_test.sh
@@ -1,30 +1,19 @@
 #!/bin/bash
 set -ex -o pipefail
 
+if ! [ "$IOS_PLATFORM" == "SIMULATOR" ]; then
+    exit 0
+fi
+
 echo ""
 echo "DIR: $(pwd)"
 PROJ_ROOT=/Users/distiller/project
 cd ${PROJ_ROOT}/ios/TestApp
 # install fastlane
 sudo gem install bundler && bundle install
-# install certificates
-echo "${IOS_CERT_KEY_2022}" >> cert.txt
-base64 --decode cert.txt -o Certificates.p12
-rm cert.txt
-bundle exec fastlane install_root_cert
-bundle exec fastlane install_dev_cert
-# install the provisioning profile
-PROFILE=PyTorch_CI_2022.mobileprovision
-PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-mkdir -pv "${PROVISIONING_PROFILES}"
-cd "${PROVISIONING_PROFILES}"
-echo "${IOS_SIGN_KEY_2022}" >> cert.txt
-base64 --decode cert.txt -o ${PROFILE}
-rm cert.txt
 # run the ruby build script
 if ! [ -x "$(command -v xcodebuild)" ]; then
     echo 'Error: xcodebuild is not installed.'
     exit 1
 fi
-PROFILE=PyTorch_CI_2022
-ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
+ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}
diff --git a/.circleci/scripts/binary_ios_upload.sh b/.circleci/scripts/binary_ios_upload.sh
index 02037da8e07b..7949dc9170b0 100644
--- a/.circleci/scripts/binary_ios_upload.sh
+++ b/.circleci/scripts/binary_ios_upload.sh
@@ -33,7 +33,7 @@ fi
 cp ${PROJ_ROOT}/LICENSE ${ZIP_DIR}/
 # zip the library
 export DATE="$(date -u +%Y%m%d)"
-export IOS_NIGHTLY_BUILD_VERSION="1.13.0.${DATE}"
+export IOS_NIGHTLY_BUILD_VERSION="1.14.0.${DATE}"
 if [ "${BUILD_LITE_INTERPRETER}" == "1" ]; then
     # libtorch_lite_ios_nightly_1.11.0.20210810.zip
     ZIPFILE="libtorch_lite_ios_nightly_${IOS_NIGHTLY_BUILD_VERSION}.zip"
@@ -47,7 +47,7 @@ echo "${IOS_NIGHTLY_BUILD_VERSION}" > version.txt
 zip -r ${ZIPFILE} install src version.txt LICENSE
 # upload to aws
 # Install conda then 'conda install' awscli
-curl --retry 3 -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+curl --retry 3 --retry-all-errors -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
 chmod +x ~/conda.sh
 /bin/bash ~/conda.sh -b -p ~/anaconda
 export PATH="~/anaconda/bin:${PATH}"
diff --git a/.circleci/scripts/binary_populate_env.sh b/.circleci/scripts/binary_populate_env.sh
index 56c4f556adbb..3294c72024aa 100755
--- a/.circleci/scripts/binary_populate_env.sh
+++ b/.circleci/scripts/binary_populate_env.sh
@@ -59,7 +59,7 @@ PIP_UPLOAD_FOLDER='nightly/'
 # We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
 export DATE="$(date -u +%Y%m%d)"
 #TODO: We should be pulling semver version from the base version.txt
-BASE_BUILD_VERSION="1.13.0.dev$DATE"
+BASE_BUILD_VERSION="1.14.0.dev$DATE"
 # Change BASE_BUILD_VERSION to git tag when on a git tag
 # Use 'git -C' to make doubly sure we're in the correct directory for checking
 # the git tag
@@ -76,6 +76,11 @@ if [[ "$(uname)" == 'Darwin' ]] || [[ "$PACKAGE_TYPE" == conda ]]; then
 else
   export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}+$DESIRED_CUDA"
 fi
+
+if [[ -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
+  export PYTORCH_BUILD_VERSION="${PYTORCH_BUILD_VERSION}-with-pypi-cudnn"
+fi
+
 export PYTORCH_BUILD_NUMBER=1
 
 
@@ -124,9 +129,9 @@ if [[ "${OSTYPE}" == "msys" ]]; then
 else
   export DESIRED_DEVTOOLSET="${DESIRED_DEVTOOLSET:-}"
 fi
-
+export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"
 export DATE="$DATE"
-export NIGHTLIES_DATE_PREAMBLE=1.13.0.dev
+export NIGHTLIES_DATE_PREAMBLE=1.14.0.dev
 export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
 export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
 export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"
diff --git a/.circleci/scripts/binary_upload.sh b/.circleci/scripts/binary_upload.sh
index 2c7cf95a963a..74f238bea528 100755
--- a/.circleci/scripts/binary_upload.sh
+++ b/.circleci/scripts/binary_upload.sh
@@ -14,6 +14,12 @@ UPLOAD_CHANNEL=${UPLOAD_CHANNEL:-nightly}
 UPLOAD_SUBFOLDER=${UPLOAD_SUBFOLDER:-cpu}
 UPLOAD_BUCKET="s3://pytorch"
 BACKUP_BUCKET="s3://pytorch-backup"
+BUILD_NAME=${BUILD_NAME:-}
+
+# this is temporary change to upload pypi-cudnn builds to separate folder
+if [[ ${BUILD_NAME} == *with-pypi-cudnn* ]]; then
+  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_cudnn"
+fi
 
 DRY_RUN=${DRY_RUN:-enabled}
 # Don't actually do work unless explicit
@@ -24,6 +30,11 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then
   AWS_S3_CP="aws s3 cp"
 fi
 
+# Sleep 2 minutes between retries for conda upload
+retry () {
+  "$@"  || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")
+}
+
 do_backup() {
   local backup_dir
   backup_dir=$1
@@ -37,13 +48,14 @@ do_backup() {
 conda_upload() {
   (
     set -x
+    retry \
     ${ANACONDA} \
-      upload  \
-      ${PKG_DIR}/*.tar.bz2 \
-      -u "pytorch-${UPLOAD_CHANNEL}" \
-      --label main \
-      --no-progress \
-      --force
+    upload  \
+    ${PKG_DIR}/*.tar.bz2 \
+    -u "pytorch-${UPLOAD_CHANNEL}" \
+    --label main \
+    --no-progress \
+    --force
   )
 }
 
diff --git a/.circleci/scripts/driver_update.bat b/.circleci/scripts/driver_update.bat
index 46c05475cdba..fb8774366621 100644
--- a/.circleci/scripts/driver_update.bat
+++ b/.circleci/scripts/driver_update.bat
@@ -1,5 +1,5 @@
 set "DRIVER_DOWNLOAD_LINK=https://s3.amazonaws.com/ossci-windows/452.39-data-center-tesla-desktop-win10-64bit-international.exe"
-curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output 452.39-data-center-tesla-desktop-win10-64bit-international.exe
+curl --retry 3 --retry-all-errors -kL %DRIVER_DOWNLOAD_LINK% --output 452.39-data-center-tesla-desktop-win10-64bit-international.exe
 if errorlevel 1 exit /b 1
 
 start /wait 452.39-data-center-tesla-desktop-win10-64bit-international.exe -s -noreboot
diff --git a/.circleci/scripts/functorch_doc_push_script.sh b/.circleci/scripts/functorch_doc_push_script.sh
new file mode 100755
index 000000000000..aed2a1c451b9
--- /dev/null
+++ b/.circleci/scripts/functorch_doc_push_script.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+# =================== The following code **should** be executed inside Docker container ===================
+
+# Install dependencies
+sudo apt-get -y update
+sudo apt-get -y install expect-dev
+
+# This is where the local pytorch install in the docker image is located
+pt_checkout="/var/lib/jenkins/workspace"
+source "$pt_checkout/.jenkins/pytorch/common_utils.sh"
+echo "functorch_doc_push_script.sh: Invoked with $*"
+
+set -ex
+
+version=${DOCS_VERSION:-nightly}
+echo "version: $version"
+
+# Build functorch docs
+pushd $pt_checkout/functorch/docs
+pip -q install -r requirements.txt
+make html
+popd
+
+git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages
+pushd functorch_ghpages
+
+if [ $version == "master" ]; then
+  version=nightly
+fi
+
+git rm -rf "$version" || true
+mv "$pt_checkout/functorch/docs/build/html" "$version"
+
+git add "$version" || true
+git status
+git config user.email "soumith+bot@pytorch.org"
+git config user.name "pytorchbot"
+# If there aren't changes, don't make a commit; push is no-op
+git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true
+git status
+
+if [[ "${WITH_PUSH:-}" == true ]]; then
+  git push -u origin gh-pages
+fi
+
+popd
+# =================== The above code **should** be executed inside Docker container ===================
diff --git a/.circleci/scripts/python_doc_push_script.sh b/.circleci/scripts/python_doc_push_script.sh
index f9b019ec069b..d255f77c82e8 100755
--- a/.circleci/scripts/python_doc_push_script.sh
+++ b/.circleci/scripts/python_doc_push_script.sh
@@ -135,6 +135,9 @@ git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true
 git status
 
 if [[ "${WITH_PUSH:-}" == true ]]; then
+  # push to a temp branch first to trigger CLA check and satisfy branch protections
+  git push -u origin HEAD:pytorchbot/temp-branch-py -f
+  sleep 30
   git push -u origin "${branch}"
 fi
 
diff --git a/.circleci/scripts/setup_ci_environment.sh b/.circleci/scripts/setup_ci_environment.sh
index 8ac4f5b43a9a..42a605cd4445 100755
--- a/.circleci/scripts/setup_ci_environment.sh
+++ b/.circleci/scripts/setup_ci_environment.sh
@@ -32,7 +32,7 @@ if ! command -v aws >/dev/null; then
 fi
 
 if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
-  DRIVER_FN="NVIDIA-Linux-x86_64-515.57.run"
+  DRIVER_FN="NVIDIA-Linux-x86_64-515.76.run"
   wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
   sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
   nvidia-smi
@@ -40,8 +40,8 @@ if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
   # Taken directly from https://github.com/NVIDIA/nvidia-docker
   # Add the package repositories
   distribution=$(. /etc/os-release;echo "$ID$VERSION_ID")
-  curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
-  curl -s -L "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
+  curl -s -L --retry 3 --retry-all-errors https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
+  curl -s -L --retry 3 --retry-all-errors "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
 
   retry sudo apt-get update -qq
   # Necessary to get the `--gpus` flag to function within docker
diff --git a/.circleci/scripts/setup_linux_system_environment.sh b/.circleci/scripts/setup_linux_system_environment.sh
index ce64076e2d64..780f7c1bd379 100755
--- a/.circleci/scripts/setup_linux_system_environment.sh
+++ b/.circleci/scripts/setup_linux_system_environment.sh
@@ -2,7 +2,7 @@
 set -eux -o pipefail
 
 # Set up CircleCI GPG keys for apt, if needed
-curl --retry 3 -s -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -
+curl --retry 3 --retry-all-errors -s -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -
 
 # Stop background apt updates.  Hypothetically, the kill should not
 # be necessary, because stop is supposed to send a kill signal to
diff --git a/.circleci/scripts/vs_install.ps1 b/.circleci/scripts/vs_install.ps1
index a2e373078adb..4bbbc24bb043 100644
--- a/.circleci/scripts/vs_install.ps1
+++ b/.circleci/scripts/vs_install.ps1
@@ -29,7 +29,7 @@ if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswher
 }
 
 echo "Downloading VS installer from S3."
-curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
+curl.exe --retry 3 --retry-all-errors -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
 if ($LASTEXITCODE -ne 0) {
     echo "Download of the VS 2019 Version ${env:VS_VERSION} installer failed"
     exit 1
diff --git a/.circleci/scripts/vs_install_cmath.ps1 b/.circleci/scripts/vs_install_cmath.ps1
index c2998eba2521..62b637ec21b8 100644
--- a/.circleci/scripts/vs_install_cmath.ps1
+++ b/.circleci/scripts/vs_install_cmath.ps1
@@ -1,5 +1,5 @@
 $CMATH_DOWNLOAD_LINK = "https://raw.githubusercontent.com/microsoft/STL/12c684bba78f9b032050526abdebf14f58ca26a3/stl/inc/cmath"
 $VC14_28_INSTALL_PATH="C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\include"
 
-curl.exe --retry 3 -kL $CMATH_DOWNLOAD_LINK --output "$home\cmath"
+curl.exe --retry 3 --retry-all-errors -kL $CMATH_DOWNLOAD_LINK --output "$home\cmath"
 Move-Item -Path "$home\cmath" -Destination "$VC14_28_INSTALL_PATH" -Force
diff --git a/.circleci/scripts/windows_cudnn_install.sh b/.circleci/scripts/windows_cudnn_install.sh
index 763bc950fc4b..bbf45a3290b3 100644
--- a/.circleci/scripts/windows_cudnn_install.sh
+++ b/.circleci/scripts/windows_cudnn_install.sh
@@ -18,7 +18,7 @@ case ${CUDA_VERSION} in
         ;;
     11.7)
         # Use cudnn8.3 with hard-coded cuda11.5 version
-        cudnn_file_name="cudnn-windows-x86_64-8.3.2.44_cuda11.5-archive"
+        cudnn_file_name="cudnn-windows-x86_64-8.5.0.96_cuda11-archive"
         ;;
     *)
         echo "CUDA_VERSION: ${CUDA_VERSION} not supported yet"
@@ -36,7 +36,7 @@ else
     tmp_dir=$(mktemp -d)
     (
         pushd "${tmp_dir}"
-        curl --retry 3 -o "${cudnn_installer_name}" "$cudnn_installer_link"
+        curl --retry 3 --retry-all-errors -o "${cudnn_installer_name}" "$cudnn_installer_link"
         7z x "${cudnn_installer_name}" -ocudnn
         # Use '${var:?}/*' to avoid potentially expanding to '/*'
         # Remove all of the directories before attempting to copy files
diff --git a/.circleci/verbatim-sources/job-specs/job-specs-custom.yml b/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
index 180ea014db6d..7d5f7f686512 100644
--- a/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
+++ b/.circleci/verbatim-sources/job-specs/job-specs-custom.yml
@@ -95,6 +95,198 @@
           paths:
             - miniconda3
 
+  mac_build:
+    parameters:
+      build-environment:
+        type: string
+        description: Top-level label for what's being built/tested.
+      xcode-version:
+        type: string
+        default: "13.3.1"
+        description: What xcode version to build with.
+      build-generates-artifacts:
+        type: boolean
+        default: true
+        description: if the build generates build artifacts
+      python-version:
+        type: string
+        default: "3.8"
+    macos:
+      xcode: << parameters.xcode-version >>
+    resource_class: medium
+    environment:
+      BUILD_ENVIRONMENT: << parameters.build-environment >>
+      AWS_REGION: us-east-1
+    steps:
+
+      - checkout
+      - run_brew_for_macos_build
+
+      - run:
+          name: Install sccache
+          command: |
+            sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}"
+            echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}"
+
+            set +x
+            echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
+            echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
+            set -x
+
+      - run:
+          name: Get workflow job id
+          command: |
+            echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}"
+
+      - run:
+          name: Build
+          command: |
+            set -x
+
+            git submodule sync
+            git submodule update --init --recursive --depth 1 --jobs 0
+
+            export PATH="/usr/local/bin:$PATH"
+            export WORKSPACE_DIR="${HOME}/workspace"
+            mkdir -p "${WORKSPACE_DIR}"
+            MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh"
+            if [  << parameters.python-version >> == 3.9.12 ]; then
+              MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh"
+            fi
+
+            # If a local installation of conda doesn't exist, we download and install conda
+            if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then
+              mkdir -p "${WORKSPACE_DIR}"
+              curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh
+              bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3
+            fi
+            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
+            # shellcheck disable=SC1091
+            source "${WORKSPACE_DIR}"/miniconda3/bin/activate
+
+            brew link --force libomp
+
+            echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}"
+            .jenkins/pytorch/macos-build.sh
+
+      - when:
+          condition: << parameters.build-generates-artifacts >>
+          steps:
+            - run:
+                name: Archive artifacts into zip
+                command: |
+                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
+                  cp artifacts.zip /Users/distiller/workspace
+
+      - persist_to_workspace:
+          root: /Users/distiller/workspace/
+          paths:
+            - miniconda3
+            - artifacts.zip
+
+      - store_artifacts:
+          path: /Users/distiller/project/artifacts.zip
+
+  mac_test:
+    parameters:
+      build-environment:
+        type: string
+      shard-number:
+        type: string
+      num-test-shards:
+        type: string
+      xcode-version:
+        type: string
+      test-config:
+        type: string
+        default: 'default'
+
+    macos:
+      xcode: << parameters.xcode-version >>
+    environment:
+      GIT_DEFAULT_BRANCH: 'master'
+      BUILD_ENVIRONMENT: << parameters.build-environment >>
+      TEST_CONFIG: << parameters.test-config >>
+      SHARD_NUMBER: << parameters.shard-number >>
+      NUM_TEST_SHARDS: << parameters.num-test-shards >>
+      PYTORCH_RETRY_TEST_CASES: 1
+      PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
+    steps:
+      - checkout
+      - attach_workspace:
+          at: ~/workspace
+      - run_brew_for_macos_build
+      - run:
+          name: Test
+          no_output_timeout: "2h"
+          command: |
+            set -x
+
+            git submodule sync --recursive
+            git submodule update --init --recursive
+
+            mv ~/workspace/artifacts.zip .
+            unzip artifacts.zip
+
+            export IN_CI=1
+
+            COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
+
+            export PATH="/usr/local/bin:$PATH"
+            export WORKSPACE_DIR="${HOME}/workspace"
+            mkdir -p "${WORKSPACE_DIR}"
+
+            export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
+            source "${WORKSPACE_DIR}"/miniconda3/bin/activate
+
+            # sanitize the input commit message and PR body here:
+
+            # trim all new lines from commit messages to avoid issues with batch environment
+            # variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028
+            COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}"
+
+            # then trim all special characters like single and double quotes to avoid unescaped inputs to
+            # wreak havoc internally
+            export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}"
+
+            python3 -mpip install dist/*.whl
+            .jenkins/pytorch/macos-test.sh
+      - run:
+          name: Copy files for uploading test stats
+          command: |
+            # copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace
+            mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
+            cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
+      - store_test_results:
+          path: test/test-reports
+      - persist_to_workspace:
+          root: /Users/distiller/project/
+          paths:
+            - test-reports
+
+  upload_test_stats:
+    machine: # executor type
+      image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4
+    steps:
+      - checkout
+      - attach_workspace:
+          at: ~/workspace
+      - run:
+          name: upload
+          command: |
+            set -ex
+            if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then
+              echo "No credentials found, cannot upload test stats (are you on a fork?)"
+              exit 0
+            fi
+            cp -r ~/workspace/test-reports/* ~/project
+            pip3 install requests==2.26 rockset==0.8.3 boto3==1.19.12 six==1.16.0
+            export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
+            export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
+            # i dont know how to get the run attempt number for reruns so default to 1
+            python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci
   pytorch_macos_10_13_py3_test:
     environment:
       BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
@@ -320,10 +512,43 @@
     macos:
       xcode: "12.5.1"
     steps:
-      - checkout
+      - run:
+          name: checkout with retry
+          command: |
+            checkout() {
+              set -ex
+              # Workaround old docker images with incorrect $HOME
+              # check https://github.com/docker/docker/issues/2968 for details
+              if [ "${HOME}" = "/" ]
+                then
+                export HOME=$(getent passwd $(id -un) | cut -d: -f6)
+              fi
+
+              mkdir -p ~/.ssh
+
+              echo 'github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
+              ' >> ~/.ssh/known_hosts
+
+              # use git+ssh instead of https
+              git config --global url."ssh://git@github.com".insteadOf "https://github.com" || true
+              git config --global gc.auto 0 || true
+
+              echo 'Cloning git repository'
+              mkdir -p '/Users/distiller/project'
+              cd '/Users/distiller/project'
+              git clone "$CIRCLE_REPOSITORY_URL" .
+              echo 'Checking out branch'
+              git checkout --force -B "$CIRCLE_BRANCH" "$CIRCLE_SHA1"
+              git --no-pager log --no-color -n 1 --format='HEAD is now at %h %s'
+            }
+
+            retry () {
+              $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
+            }
+            retry checkout
       - run_brew_for_ios_build
       - run:
-          name: Run Fastlane
+          name: Setup Fastlane
           no_output_timeout: "1h"
           command: |
             set -e
@@ -331,20 +556,6 @@
             cd ${PROJ_ROOT}/ios/TestApp
             # install fastlane
             sudo gem install bundler && bundle install
-            # install certificates
-            echo ${IOS_CERT_KEY_2022} >> cert.txt
-            base64 --decode cert.txt -o Certificates.p12
-            rm cert.txt
-            bundle exec fastlane install_root_cert
-            bundle exec fastlane install_dev_cert
-            # install the provisioning profile
-            PROFILE=PyTorch_CI_2022.mobileprovision
-            PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-            mkdir -pv "${PROVISIONING_PROFILES}"
-            cd "${PROVISIONING_PROFILES}"
-            echo ${IOS_SIGN_KEY_2022} >> cert.txt
-            base64 --decode cert.txt -o ${PROFILE}
-            rm cert.txt
       - run:
           name: Build
           no_output_timeout: "1h"
@@ -402,18 +613,12 @@
           command: |
             set -e
             PROJ_ROOT=/Users/distiller/project
-            PROFILE=PyTorch_CI_2022
             # run the ruby build script
             if ! [ -x "$(command -v xcodebuild)" ]; then
               echo 'Error: xcodebuild is not installed.'
               exit 1
             fi
-            echo ${IOS_DEV_TEAM_ID}
-            if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
-              ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
-            else
-              ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}
-            fi
+            ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}
             if ! [ "$?" -eq "0" ]; then
               echo 'xcodebuild failed!'
               exit 1
@@ -436,12 +641,13 @@
             cd ${PROJ_ROOT}/ios/TestApp/benchmark
             mkdir -p ../models
             if [ ${USE_COREML_DELEGATE} == 1 ]; then
-              pip install coremltools==5.0b5
-              pip install six
+              pip install coremltools==5.0b5 protobuf==3.20.1 six==1.16.0
               python coreml_backend.py
             else
-              python trace_model.py
+              cd "${PROJ_ROOT}"
+              python test/mobile/model_test/gen_test_model.py ios-test
             fi
+            cd "${PROJ_ROOT}/ios/TestApp/benchmark"
             if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
               echo "Setting up the TestApp for LiteInterpreter"
               ruby setup.rb --lite 1
@@ -449,10 +655,10 @@
               echo "Setting up the TestApp for Full JIT"
               ruby setup.rb
             fi
-            cd ${PROJ_ROOT}/ios/TestApp
-            instruments -s -devices
-            if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
-              if [ ${USE_COREML_DELEGATE} == 1 ]; then
+            cd "${PROJ_ROOT}/ios/TestApp"
+            # instruments -s -devices
+            if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
+              if [ "${USE_COREML_DELEGATE}" == 1 ]; then
                 fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
               else
                 fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
diff --git a/.github/ISSUE_TEMPLATE/ci-sev.md b/.github/ISSUE_TEMPLATE/ci-sev.md
index 8178c68d978b..2b6bbfc982c9 100644
--- a/.github/ISSUE_TEMPLATE/ci-sev.md
+++ b/.github/ISSUE_TEMPLATE/ci-sev.md
@@ -5,6 +5,8 @@ about: Tracking incidents for PyTorch's CI infra.
 
 > NOTE: Remember to label this issue with "`ci: sev`"
 
+**MERGE BLOCKING** <!-- remove this line if you don't want this SEV to block merges -->
+
 ## Current Status
 *Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase)*.
 
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 7d428014cd79..dff11e6aae5c 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1 +1 @@
-Fixes #ISSUE_NUMBER
+Fixes #ISSUE_NUMBER
diff --git a/.github/actionlint.yaml b/.github/actionlint.yaml
index 4b5afb13f367..ff640de7bde5 100644
--- a/.github/actionlint.yaml
+++ b/.github/actionlint.yaml
@@ -5,9 +5,12 @@ self-hosted-runner:
     - linux.large
     - linux.2xlarge
     - linux.4xlarge
+    - linux.12xlarge
+    - linux.24xlarge
     - linux.4xlarge.nvidia.gpu
     - linux.8xlarge.nvidia.gpu
     - linux.16xlarge.nvidia.gpu
+    - linux.g5.4xlarge.nvidia.gpu
     - windows.4xlarge
     - windows.8xlarge.nvidia.gpu
     - bm-runner
diff --git a/.github/actions/build-android/action.yml b/.github/actions/build-android/action.yml
index 5233b62cef0e..6513d82f6966 100644
--- a/.github/actions/build-android/action.yml
+++ b/.github/actions/build-android/action.yml
@@ -73,4 +73,4 @@ runs:
         # Copy install binaries back
         mkdir -p "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"
         docker cp "${container_name}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"
-        echo "::set-output name=container_id::${container_name}"
+        echo "container_id=${container_name}" >> "${GITHUB_OUTPUT}"
diff --git a/.github/actions/calculate-docker-image/action.yml b/.github/actions/calculate-docker-image/action.yml
index 7215bf84e987..ff090d623f8e 100644
--- a/.github/actions/calculate-docker-image/action.yml
+++ b/.github/actions/calculate-docker-image/action.yml
@@ -47,12 +47,12 @@ runs:
         if [ -n "${IS_XLA}" ]; then
           echo "XLA workflow uses pre-built test image at ${XLA_IMAGE_TAG}"
           DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "::set-output name=docker-tag::${DOCKER_TAG}"
-          echo "::set-output name=docker-image::${DOCKER_IMAGE_BASE}:${XLA_IMAGE_TAG}"
+          echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}"
+          echo "docker-image=${DOCKER_IMAGE_BASE}:${XLA_IMAGE_TAG}" >> "${GITHUB_OUTPUT}"
         else
           DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "::set-output name=docker-tag::${DOCKER_TAG}"
-          echo "::set-output name=docker-image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
+          echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}"
+          echo "docker-image=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_OUTPUT}"
         fi
 
     - name: Check if image should be built
@@ -93,10 +93,10 @@ runs:
             # In order to avoid a stampeding herd of jobs trying to push all at once we set it to
             # skip the push. If this is negatively affecting TTS across the board the suggestion
             # should be to run the docker-builds.yml workflow to generate the correct docker builds
-            echo ::set-output name=skip_push::true
+            echo "skip_push=true" >> "${GITHUB_OUTPUT}"
           fi
         fi
-        echo ::set-output name=rebuild::yes
+        echo "rebuild=yes" >> "${GITHUB_OUTPUT}"
 
     - name: Build and push docker image
       if: inputs.always-rebuild || steps.check.outputs.rebuild
diff --git a/.github/actions/download-build-artifacts/action.yml b/.github/actions/download-build-artifacts/action.yml
index 9b11d0f7fe32..a7107f2067de 100644
--- a/.github/actions/download-build-artifacts/action.yml
+++ b/.github/actions/download-build-artifacts/action.yml
@@ -21,7 +21,7 @@ runs:
 
     - name: Download PyTorch Build Artifacts from GHA
       if: inputs.use-gha
-      uses: actions/download-artifact@v2
+      uses: actions/download-artifact@v3
       with:
         name: ${{ inputs.name }}
 
diff --git a/.github/actions/filter-test-configs/action.yml b/.github/actions/filter-test-configs/action.yml
new file mode 100644
index 000000000000..0253577134c8
--- /dev/null
+++ b/.github/actions/filter-test-configs/action.yml
@@ -0,0 +1,62 @@
+name: Filter test configs matrix
+
+description: |
+  Apply filter to the test configs matrix to keep only entries specified
+  by the PR test-config labels. If no test-config label is set, the same
+  test configs matrix is returned untouched.
+
+inputs:
+  github-token:
+    description: GITHUB_TOKEN
+    required: true
+  test-matrix:
+    required: true
+    type: string
+    description: JSON description of what test configs to run.
+
+outputs:
+  test-matrix:
+    description: The filtered test configs matrix.
+    value: ${{ steps.filter.outputs.test-matrix }}
+  is-test-matrix-empty:
+    description: True if the filtered test configs matrix is empty. False otherwise.
+    value: ${{ steps.filter.outputs.is-test-matrix-empty }}
+
+runs:
+  using: composite
+  steps:
+    - uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
+      name: Setup dependencies
+      env:
+        GITHUB_TOKEN: ${{ inputs.github-token }}
+      with:
+        shell: bash
+        timeout_minutes: 10
+        max_attempts: 5
+        retry_wait_seconds: 30
+        command: |
+          set -eux
+          python3 -m pip install requests==2.26.0 pyyaml==6.0
+
+    - name: Parse ref
+      shell: bash
+      id: parse-ref
+      run: .github/scripts/parse_ref.py
+
+    - name: Select all requested test configurations
+      shell: bash
+      env:
+        GITHUB_TOKEN: ${{ inputs.github-token }}
+      id: filter
+      run: |
+        .github/scripts/filter_test_configs.py \
+          --test-matrix "${{ inputs.test-matrix }}" \
+          --pr-number "${{ github.event.pull_request.number }}" \
+          --tag "${{ steps.parse-ref.outputs.tag }}" \
+          --event-name "${{ github.event_name }}" \
+          --schedule "${{ github.event.schedule }}"
+
+    - name: Print the filtered test matrix
+      shell: bash
+      run: |
+        echo "${{ steps.filter.outputs.test-matrix }}"
diff --git a/.github/actions/get-workflow-job-id/action.yml b/.github/actions/get-workflow-job-id/action.yml
index 34863677407a..54b7bbe5e174 100644
--- a/.github/actions/get-workflow-job-id/action.yml
+++ b/.github/actions/get-workflow-job-id/action.yml
@@ -15,7 +15,7 @@ outputs:
 runs:
   using: composite
   steps:
-    - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+    - uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
       id: get-job-id
       env:
         GITHUB_TOKEN: ${{ inputs.github-token }}
@@ -28,4 +28,4 @@ runs:
           set -eux
           python3 -m pip install requests==2.26.0
           GHA_WORKFLOW_JOB_ID=$(python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}")
-          echo "::set-output name=job-id::${GHA_WORKFLOW_JOB_ID}"
+          echo "job-id=${GHA_WORKFLOW_JOB_ID}" >> "${GITHUB_OUTPUT}"
diff --git a/.github/actions/pull-docker-image/action.yml b/.github/actions/pull-docker-image/action.yml
deleted file mode 100644
index 75e8baf6f2c9..000000000000
--- a/.github/actions/pull-docker-image/action.yml
+++ /dev/null
@@ -1,23 +0,0 @@
-name: Pull docker image
-
-description: pull a specific docker image
-
-inputs:
-  docker-image:
-    description: the image to pull
-    required: true
-
-runs:
-  using: composite
-  steps:
-    - name: Pull Docker image
-      shell: bash
-      env:
-        DOCKER_IMAGE: ${{ inputs.docker-image }}
-      run: |
-        retry () { "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@") }
-        # ignore output since only exit code is used for conditional
-        # only pull docker image if it's not available locally
-        if ! docker inspect --type=image "${DOCKER_IMAGE}" >/dev/null 2>/dev/null; then
-          retry docker pull "${DOCKER_IMAGE}"
-        fi
diff --git a/.github/actions/setup-rocm/action.yml b/.github/actions/setup-rocm/action.yml
index 97dfd22c76ac..d91762eb9a86 100644
--- a/.github/actions/setup-rocm/action.yml
+++ b/.github/actions/setup-rocm/action.yml
@@ -36,7 +36,12 @@ runs:
       run: |
         ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
         if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-            echo "Failed to detect GPUs on the runner"
+            if [[ $ngpu -eq 0 ]]; then
+              echo "Error: Failed to detect any GPUs on the runner"
+            else
+              echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+            fi
+            echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
             exit 1
         fi
 
diff --git a/.github/actions/setup-ssh/action.yml b/.github/actions/setup-ssh/action.yml
deleted file mode 100644
index c2be35a805c4..000000000000
--- a/.github/actions/setup-ssh/action.yml
+++ /dev/null
@@ -1,17 +0,0 @@
-name: Setup SSH
-
-description: Adds ssh keys for current user to machine
-
-inputs:
-  github-secret:
-    description: GitHub token
-    required: true
-
-runs:
-  using: composite
-  steps:
-    - name: "Enable SSH (Click me for login details)"
-      uses: seemethere/add-github-ssh-key@v1
-      with:
-        GITHUB_TOKEN: ${{ inputs.github-secret }}
-        activate-with-label: false
diff --git a/.github/actions/setup-win/action.yml b/.github/actions/setup-win/action.yml
index 12f287b23089..6dc1a1b6c6fe 100644
--- a/.github/actions/setup-win/action.yml
+++ b/.github/actions/setup-win/action.yml
@@ -55,6 +55,12 @@ runs:
         .circleci/scripts/windows_cudnn_install.sh
 
     - name: Setup Python3
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v4
       with:
-        python-version: "3.x"
+        python-version: '3.x'
+        check-latest: false
+        cache: pip
+        cache-dependency-path: |
+          **/requirements.txt
+          **/.circleci/docker/requirements-ci.txt
+          **/.github/requirements-gha-cache.txt
diff --git a/.github/actions/teardown-linux/action.yml b/.github/actions/teardown-linux/action.yml
deleted file mode 100644
index 9238a073a6b6..000000000000
--- a/.github/actions/teardown-linux/action.yml
+++ /dev/null
@@ -1,28 +0,0 @@
-name: Teardown Linux
-
-description: Stuff that should always run at the end of a linux job
-
-inputs:
-  skip-wait-ssh:
-    description: If set, don't wait for ssh to drain before tearing down
-    required: false
-    default: ""
-
-runs:
-  using: composite
-  steps:
-    - name: Hold runner for 2 hours or until ssh sessions have drained
-      # TODO working-directory: !{{ pytorch_directory }}
-      # Always hold for active ssh sessions
-      shell: bash
-      if: inputs.skip-wait-ssh == ''
-      run: .github/scripts/wait_for_ssh_to_drain.sh
-
-    - name: Kill containers, clean up images
-      shell: bash
-      run: |
-        # ignore expansion of "docker ps -q" since it could be empty
-        # shellcheck disable=SC2046
-        docker stop $(docker ps -q) || true
-        # Prune all of the docker images
-        docker system prune -af
diff --git a/.github/actions/test-pytorch-binary/action.yml b/.github/actions/test-pytorch-binary/action.yml
index bc2c546f57b2..be2090db533d 100644
--- a/.github/actions/test-pytorch-binary/action.yml
+++ b/.github/actions/test-pytorch-binary/action.yml
@@ -15,7 +15,6 @@ runs:
           -e BINARY_ENV_FILE \
           -e BUILDER_ROOT \
           -e BUILD_ENVIRONMENT \
-          -e BUILD_SPLIT_CUDA \
           -e DESIRED_CUDA \
           -e DESIRED_DEVTOOLSET \
           -e DESIRED_PYTHON \
diff --git a/.github/actions/upload-test-artifacts/action.yml b/.github/actions/upload-test-artifacts/action.yml
index 35e249ea96be..9fd2342601f1 100644
--- a/.github/actions/upload-test-artifacts/action.yml
+++ b/.github/actions/upload-test-artifacts/action.yml
@@ -34,7 +34,7 @@ runs:
       run: |
         # Remove any previous test reports if they exist
         rm -f test-reports-*.zip
-        zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
+        zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml' -i '*.csv'
 
     - name: Zip usage log for upload
       if: runner.os != 'Windows' && !inputs.use-gha
@@ -67,7 +67,7 @@ runs:
         FILE_SUFFIX: ${{ inputs.file-suffix }}
       run: |
         # -ir => recursive include all files in pattern
-        7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
+        7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml' -ir'!test\*.csv'
 
     - name: Zip usage log for upload
       if: runner.os == 'Windows' && !inputs.use-gha
@@ -111,7 +111,7 @@ runs:
 
     # GHA upload
     - name: Store Test Downloaded JSONs on Github
-      uses: actions/upload-artifact@v2
+      uses: actions/upload-artifact@v3
       if: inputs.use-gha
       with:
         # Add the run attempt, see [Artifact run attempt]
@@ -121,11 +121,25 @@ runs:
         path: test/**/*.json
 
     - name: Store Test Reports on Github
-      uses: actions/upload-artifact@v2
+      uses: actions/upload-artifact@v3
       if: inputs.use-gha
       with:
         # Add the run attempt, see [Artifact run attempt]
         name: test-reports-runattempt${{ github.run_attempt }}-${{ inputs.file-suffix }}.zip
         retention-days: 14
-        if-no-files-found: error
-        path: test/**/*.xml
+        # Don't want to fail the workflow here because not all workflows have csv files
+        if-no-files-found: ignore
+        path: |
+          test/**/*.xml
+          test/**/*.csv
+
+    - name: Store Usage Logs on Github
+      uses: actions/upload-artifact@v3
+      if: inputs.use-gha
+      with:
+        # Add the run attempt, see [Artifact run attempt]
+        name: usage-log-runattempt${{ github.run_attempt }}-${{ inputs.file-suffix }}.zip
+        retention-days: 14
+        if-no-files-found: ignore
+        path: usage_log.txt
+      continue-on-error: true
diff --git a/.github/auto_request_review.yml b/.github/auto_request_review.yml
new file mode 100644
index 000000000000..339f085d939a
--- /dev/null
+++ b/.github/auto_request_review.yml
@@ -0,0 +1,29 @@
+# Documented at https://github.com/necojackarc/auto-request-review
+reviewers:
+  groups:
+    symbolic-shapes:
+      - ezyang
+      - Chillee
+      - anjali411
+      - albanD
+      - miladm
+      - bdhirsh
+      - voznesenskym
+      - SherlockNoMad
+
+  per_author:
+    symbolic-shapes:
+      - symbolic-shapes
+      - antoniojkim
+      - wconstab
+
+files:
+  # none yet, TODO: migrate CODEOWNERS here
+
+options:
+  ignore_draft: true
+  ignored_keywords:
+    - DO NOT REVIEW
+  # Just manually setup a self-referential per_author rule if you
+  # want group assignment
+  enable_group_assignment: false
diff --git a/.github/ci_commit_pins/huggingface.txt b/.github/ci_commit_pins/huggingface.txt
new file mode 100644
index 000000000000..4b199567e9a7
--- /dev/null
+++ b/.github/ci_commit_pins/huggingface.txt
@@ -0,0 +1 @@
+ebee0a27940adfbb30444d83387b9ea0f1173f40
diff --git a/.github/ci_commit_pins/text.txt b/.github/ci_commit_pins/text.txt
new file mode 100644
index 000000000000..c0e01da17fd0
--- /dev/null
+++ b/.github/ci_commit_pins/text.txt
@@ -0,0 +1 @@
+5b78d074bd303eb230d30567646fcf0358ee2dd4
diff --git a/.github/ci_commit_pins/timm.txt b/.github/ci_commit_pins/timm.txt
new file mode 100644
index 000000000000..cdda1d14775c
--- /dev/null
+++ b/.github/ci_commit_pins/timm.txt
@@ -0,0 +1 @@
+6635bc3f7d06c6a0d0481803b24d6ad0004b61ac
diff --git a/.github/ci_commit_pins/torchbench.txt b/.github/ci_commit_pins/torchbench.txt
new file mode 100644
index 000000000000..28041e71960e
--- /dev/null
+++ b/.github/ci_commit_pins/torchbench.txt
@@ -0,0 +1 @@
+24b95f2f627bf07a61cefed653419389a7586357
diff --git a/.github/ci_commit_pins/torchdynamo.txt b/.github/ci_commit_pins/torchdynamo.txt
deleted file mode 100644
index 3d570d9605ed..000000000000
--- a/.github/ci_commit_pins/torchdynamo.txt
+++ /dev/null
@@ -1 +0,0 @@
-f19410cd8204fa1c30ca72f81142508e128be66f
diff --git a/.github/ci_commit_pins/triton.txt b/.github/ci_commit_pins/triton.txt
new file mode 100644
index 000000000000..7c5e80098f7b
--- /dev/null
+++ b/.github/ci_commit_pins/triton.txt
@@ -0,0 +1 @@
+0d7e7532279e45672555e344646f5c19c3972331
diff --git a/.github/ci_commit_pins/vision.txt b/.github/ci_commit_pins/vision.txt
index 511567c66dff..6874c288beca 100644
--- a/.github/ci_commit_pins/vision.txt
+++ b/.github/ci_commit_pins/vision.txt
@@ -1 +1 @@
-a61e6ef6ff5af041661ecc70b1a7e3dacb2240b6
+72686211e2a8b78e5a5dc8c28be34eb9cfcdad4c
diff --git a/.github/ci_commit_pins/xla.txt b/.github/ci_commit_pins/xla.txt
index cb6944f39202..5650a48e646b 100644
--- a/.github/ci_commit_pins/xla.txt
+++ b/.github/ci_commit_pins/xla.txt
@@ -1 +1 @@
-3935e4445eba5af370ebc01b4daf5cec4c026900
+216d221f4d75ddfe9d0bd3ff2e8b92b39c67d381
diff --git a/.github/labeler.yml b/.github/labeler.yml
new file mode 100644
index 000000000000..e86ff2192ede
--- /dev/null
+++ b/.github/labeler.yml
@@ -0,0 +1,51 @@
+"module: dynamo":
+- torch/_dynamo/**
+- torch/csrc/dynamo/**
+- benchmarks/dynamo/**
+- test/dynamo/**
+
+"module: inductor":
+- torch/_inductor/**
+- test/inductor/**
+
+"ciflow/inductor":
+- torch/_dynamo/**
+- torch/_inductor/**
+- benchmarks/dynamo/**
+- torch/_subclasses/fake_tensor.py
+- torch/_subclasses/fake_utils.py
+- torch/_subclasses/meta_utils.py
+
+"module: cpu":
+- aten/src/ATen/cpu/**
+- aten/src/ATen/native/cpu/**
+- aten/src/ATen/native/quantized/cpu/**
+- aten/src/ATen/native/Convolution*.cpp
+- aten/src/ATen/native/mkldnn/**
+- torch/cpu/**
+- torch/utils/mkldnn.py
+- test/test_mkldnn.py
+
+"module: mkldnn":
+- third_party/ideep
+- caffe2/ideep/**
+- caffe2/python/ideep/**
+- cmake/Modules/FindMKLDNN.cmake
+- third_party/mkl-dnn.BUILD
+- torch/csrc/jit/codegen/onednn/**
+- test/test_jit_llga_fuser.py
+
+"module: amp (automated mixed precision)":
+- torch/amp/**
+- aten/src/ATen/autocast_mode.*
+- torch/csrc/jit/passes/autocast.cpp
+- test/test_autocast.py
+
+"NNC":
+- torch/csrc/jit/tensorexpr/**
+
+"oncall: quantization":
+- torch/ao/quantization/**
+- torch/quantization/**
+- aten/src/ATen/quantized/**
+- aten/src/ATen/native/quantized/cpu/**
diff --git a/.github/merge_rules.json b/.github/merge_rules.json
deleted file mode 100644
index c0b53c7f0c69..000000000000
--- a/.github/merge_rules.json
+++ /dev/null
@@ -1,302 +0,0 @@
-[
-   {
-      "name": "ONNX exporter",
-      "patterns": [
-         ".jenkins/caffe2/*",
-         "aten/src/ATen/core/interned_strings.h",
-         "docs/source/onnx.rst",
-         "docs/source/scripts/onnx/**",
-         "scripts/onnx/**",
-         "test/jit/test_export_modes.py",
-         "test/onnx/**",
-         "tools/onnx/**",
-         "torch/_C/__init__.pyi.in",
-         "torch/csrc/jit/passes/onnx.*",
-         "torch/csrc/jit/passes/onnx/**",
-         "torch/csrc/jit/serialization/export.*",
-         "torch/csrc/jit/serialization/onnx.*",
-         "torch/csrc/onnx/**",
-         "torch/onnx/**"
-      ],
-      "approved_by": ["BowenBao", "abock"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "NVFuser",
-      "patterns": [
-         "test/test_jit_cuda_fuser.py",
-         "torch/csrc/jit/codegen/fuser/cuda/**",
-         "torch/csrc/jit/codegen/cuda/**",
-         "benchmarks/cpp/nvfuser/**"
-      ],
-      "approved_by": ["csarofeen", "ngimel", "jjsjann123"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "OSS CI",
-      "patterns": [".github/**", ".circleci/**", ".jenkins/**", "scripts/**", "tools/**"],
-      "approved_by": ["ezyang", "pytorch/pytorch-dev-infra"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "CI Pinned Hashes",
-      "patterns": [
-         ".github/ci_commit_pins/vision.txt",
-         ".github/ci_commit_pins/torchdynamo.txt"
-      ],
-      "approved_by": ["pytorchbot", "ezyang", "pytorch/pytorch-dev-infra"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "XLA hash pin update",
-      "patterns": [".github/ci_commit_pins/xla.txt"],
-      "approved_by": ["pytorchbot", "ezyang", "pytorch/pytorch-dev-infra"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull / linux-bionic-py3_7-clang8-xla / build",
-         "pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)"
-      ]
-   },
-   {
-      "name": "Documentation",
-      "patterns": ["docs/**", "torch/*docs.py"],
-      "approved_by": ["mruberry", "ngimel", "janeyx99", "svekars"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Mobile",
-      "patterns": ["ios/**", "android/**", "test/mobile/**"],
-      "approved_by": ["linbinyu", "kit1980", "IvanKobzarev", "dreiss"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Linear Algebra",
-      "patterns": [
-         "aten/src/ATen/native/cuda/linalg/**",
-         "aten/src/ATen/LinalgBackend.h",
-         "aten/src/ATen/native/**LinearAlgebra*",
-         "docs/source/linalg.rst",
-         "torch/linalg/**",
-         "torch/_linalg_utils.py",
-         "torch/**python_linalg_functions.*",
-         "torch/**linalg.h",
-         "tools/autograd/templates/python_linalg_functions.cpp",
-         "test/test_linalg.py"
-      ],
-      "approved_by": ["nikitaved", "mruberry", "pearu", "Lezcano", "IvanYashchuk"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "FFT",
-      "patterns": [
-         "aten/src/ATen/native/cuda/*FFT*.h",
-         "aten/src/ATen/native/SpectralOps.cpp",
-         "aten/src/ATen/native/mkl/SpectralOps.cpp",
-         "aten/src/ATen/native/cuda/SpectralOps.*",
-         "docs/source/fft.rst",
-         "torch/fft/**",
-         "torch/csrc/api/include/torch/fft.h",
-         "torch/**python_fft_functions.*",
-         "tools/autograd/templates/python_fft_functions.cpp",
-         "test/cpp/api/fft.cpp"
-      ],
-      "approved_by": ["mruberry", "peterbell10"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Sparse",
-      "patterns": [
-         "benchmarks/sparse",
-         "c10/util/sparse_bitset.h",
-         "docs/source/sparse.rst",
-         "torch/**sparse/**",
-         "torch/**sparse*",
-         "torch/optim/sparse*",
-         "torch/ao/nn/sparse/**",
-         "torch/utils/benchmark/**sparse*",
-         "aten/src/ATen/native/ao_sparse/**",
-         "aten/src/ATen/native/sparse/**",
-         "aten/src/ATen/**Sparse*",
-         "aten/src/ATen/*Sparse*",
-         "torch/_masked/**",
-         "test/*_masked*",
-         "test/**sparse*"
-      ],
-      "approved_by": ["nikitaved", "cpuhrsch", "pearu", "IvanYashchuk"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "MPS",
-      "patterns": [
-         "test/test_mps.py",
-         "aten/src/ATen/native/native_functions.yaml",
-         "aten/src/ATen/mps/**",
-         "aten/src/ATen/native/mps/**"
-      ],
-      "approved_by": ["kulinseth", "razarmehr"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Distributions",
-      "patterns": [
-          "torch/distributions/**",
-          "test/distributions/**"
-      ],
-      "approved_by": ["fritzo", "neerajprad", "alicanb", "vishwakftw"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "Distributed",
-       "patterns": [
-         "docs/source/pipeline.rst",
-         "docs/source/distributed*",
-         "docs/source/rpc.rst",
-         "docs/source/rpc/**",
-         "docs/source/_static/img/rpc*",
-         "docs/source/_static/img/*distributed*",
-         "docs/source/elastic/**",
-         "benchmarks/distributed/**",
-         "torch/distributed/**",
-         "torch/nn/parallel/distributed*",
-         "torch/_C/_distributed*",
-         "torch/csrc/distributed/**",
-         "torch/testing/_internal/distributed/**",
-         "test/distributed/**",
-         "test/cpp/dist_autograd/**",
-         "test/cpp/rpc/**"
-      ],
-      "approved_by": ["mrshenli", "pritamdamania87", "d4l3k", "kiukchung", "pietern"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   },
-   {
-      "name": "IDEEP",
-      "patterns": [
-        "third_party/ideep",
-        "caffe2/ideep/**",
-        "caffe2/python/ideep/**"
-      ],
-      "approved_by": ["XiaobingSuper", "yanbing-j"],
-      "mandatory_checks_name": [
-        "Facebook CLA Check",
-        "Lint",
-        "pull"
-      ]
-   },
-   {
-      "name": "oneDNN graph",
-      "patterns": [
-        "torch/csrc/jit/codegen/onednn/**",
-        "test/test_jit_llga_fuser.py"
-      ],
-      "approved_by": ["sanchitintel", "chunyuan-w"],
-      "mandatory_checks_name": [
-        "Facebook CLA Check",
-        "Lint",
-        "pull"
-      ]
-   },
-   {
-      "name": "CPU ATen backend",
-      "patterns": [
-        "aten/src/ATen/cpu/**",
-        "aten/src/ATen/native/cpu/**",
-        "aten/src/ATen/native/quantized/cpu/**",
-        "aten/src/ATen/native/Convolution*.cpp",
-        "aten/src/ATen/native/mkldnn/**"
-      ],
-      "approved_by": ["mingfeima", "XiaobingSuper"],
-      "mandatory_checks_name": [
-        "Facebook CLA Check",
-        "Lint",
-        "pull"
-      ]
-   },
-   {
-      "name": "CPU frontend",
-      "patterns": [
-        "torch/cpu/**",
-        "torch/utils/mkldnn.py",
-        "test/test_mkldnn.py"
-      ],
-      "approved_by": ["leslie-fang-intel", "CaoE"],
-      "mandatory_checks_name": [
-        "Facebook CLA Check",
-        "Lint",
-        "pull"
-      ]
-   },
-   {
-      "name": "Autocast",
-      "patterns": [
-        "torch/amp/**",
-        "aten/src/ATen/autocast_mode.*",
-        "torch/csrc/jit/passes/autocast.cpp",
-        "test/test_autocast.py"
-      ],
-      "approved_by": ["leslie-fang-intel", "CaoE"],
-      "mandatory_checks_name": [
-        "Facebook CLA Check",
-        "Lint",
-        "pull"
-      ]
-   },
-   {
-      "name": "superuser",
-      "patterns": ["*"],
-      "approved_by": ["pytorch/metamates"],
-      "mandatory_checks_name": [
-         "Facebook CLA Check",
-         "Lint",
-         "pull"
-      ]
-   }
-]
diff --git a/.github/merge_rules.yaml b/.github/merge_rules.yaml
new file mode 100644
index 000000000000..1837cce32b2f
--- /dev/null
+++ b/.github/merge_rules.yaml
@@ -0,0 +1,374 @@
+- name: ONNX exporter
+  patterns:
+  - .jenkins/caffe2/*
+  - aten/src/ATen/core/interned_strings.h
+  - docs/source/onnx.rst
+  - docs/source/onnx*
+  - docs/source/scripts/onnx/**
+  - scripts/onnx/**
+  - test/onnx/**
+  - tools/onnx/**
+  - torch/_C/__init__.pyi.in
+  - torch/csrc/jit/passes/onnx.*
+  - torch/csrc/jit/passes/onnx/**
+  - torch/csrc/jit/serialization/export.*
+  - torch/csrc/jit/serialization/onnx.*
+  - torch/csrc/onnx/**
+  - torch/onnx/**
+  - third_party/onnx
+  - caffe2/python/onnx/**
+  approved_by:
+  - BowenBao
+  - abock
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: NVFuser
+  patterns:
+  - test/test_jit_cuda_fuser.py
+  - torch/csrc/jit/codegen/fuser/cuda/**
+  - torch/csrc/jit/codegen/cuda/**
+  - benchmarks/cpp/nvfuser/**
+  approved_by:
+  - csarofeen
+  - ngimel
+  - jjsjann123
+  - kevinstephano
+  - ptrblck
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: OSS CI
+  patterns:
+  - .github/**
+  - .circleci/**
+  - .jenkins/**
+  - scripts/**
+  - tools/**
+  approved_by:
+  - alband
+  - dagitses
+  - pytorch/pytorch-dev-infra
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: OSS CI / pytorchbot
+  patterns:
+  - .github/ci_commit_pins/vision.txt
+  - .github/ci_commit_pins/torchdynamo.txt
+  approved_by:
+  - pytorchbot
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: OSS CI / pytorchbot / XLA
+  patterns:
+  - .github/ci_commit_pins/xla.txt
+  approved_by:
+  - pytorchbot
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull / linux-bionic-py3_7-clang8-xla / build
+  - pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)
+
+- name: Documentation
+  patterns:
+  - docs/**
+  - torch/*docs.py
+  approved_by:
+  - svekars
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: Mobile
+  patterns:
+  - ios/**
+  - android/**
+  - test/mobile/**
+  approved_by:
+  - linbinyu
+  - IvanKobzarev
+  - dreiss
+  - raziel
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: Linear Algebra
+  patterns:
+  - aten/src/ATen/native/cuda/linalg/**
+  - aten/src/ATen/LinalgBackend.h
+  - aten/src/ATen/native/**LinearAlgebra*
+  - docs/source/linalg.rst
+  - torch/linalg/**
+  - torch/_linalg_utils.py
+  - torch/**python_linalg_functions.*
+  - torch/**linalg.h
+  - tools/autograd/templates/python_linalg_functions.cpp
+  - test/test_linalg.py
+  approved_by:
+  - mruberry
+  - lezcano
+  - IvanYashchuk
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: FFT
+  patterns:
+  - aten/src/ATen/native/cuda/*FFT*.h
+  - aten/src/ATen/native/SpectralOps.cpp
+  - aten/src/ATen/native/mkl/SpectralOps.cpp
+  - aten/src/ATen/native/cuda/SpectralOps.*
+  - docs/source/fft.rst
+  - torch/fft/**
+  - torch/csrc/api/include/torch/fft.h
+  - torch/**python_fft_functions.*
+  - tools/autograd/templates/python_fft_functions.cpp
+  - test/cpp/api/fft.cpp
+  approved_by:
+  - mruberry
+  - peterbell10
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: Sparse
+  patterns:
+  - benchmarks/sparse
+  - c10/util/sparse_bitset.h
+  - docs/source/sparse.rst
+  - torch/**sparse/**
+  - torch/**sparse*
+  - torch/optim/sparse*
+  - torch/ao/nn/sparse/**
+  - torch/utils/benchmark/**sparse*
+  - aten/src/ATen/native/ao_sparse/**
+  - aten/src/ATen/native/sparse/**
+  - aten/src/ATen/**Sparse*
+  - aten/src/ATen/*Sparse*
+  - torch/_masked/**
+  - test/*_masked*
+  - test/**sparse*
+  approved_by:
+  - nikitaved
+  - cpuhrsch
+  - pearu
+  - IvanYashchuk
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: MPS
+  patterns:
+  - test/test_mps.py
+  - aten/src/ATen/native/native_functions.yaml
+  - aten/src/ATen/mps/**
+  - aten/src/ATen/native/mps/**
+  approved_by:
+  - kulinseth
+  - alband
+  - malfet
+  - razarmehr
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+- name: Distributions
+  patterns:
+  - torch/distributions/**
+  - test/distributions/**
+  approved_by:
+  - fritzo
+  - neerajprad
+  - alicanb
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: Distributed
+  patterns:
+  - docs/source/pipeline.rst
+  - docs/source/distributed*
+  - docs/source/rpc.rst
+  - docs/source/rpc/**
+  - docs/source/_static/img/rpc*
+  - docs/source/_static/img/*distributed*
+  - docs/source/elastic/**
+  - benchmarks/distributed/**
+  - torch/distributed/**
+  - torch/nn/parallel/distributed*
+  - torch/_C/_distributed*
+  - torch/csrc/distributed/**
+  - torch/testing/_internal/distributed/**
+  - test/distributed/**
+  - test/cpp/dist_autograd/**
+  - test/cpp/rpc/**
+  approved_by:
+  - mrshenli
+  - pritamdamania87
+  - zhaojuanmao
+  - rohan-varma
+  - wanchaol
+  - fduwjj
+  - H-Huang
+  - d4l3k
+  - aazzolini
+  - kwen2501
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: IDEEP
+  patterns:
+  - third_party/ideep
+  - caffe2/ideep/**
+  - caffe2/python/ideep/**
+  - cmake/Modules/FindMKLDNN.cmake
+  - third_party/mkl-dnn.BUILD
+  approved_by:
+  - XiaobingSuper
+  - jgong5
+  - mingfeima
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: oneDNN graph
+  patterns:
+  - torch/csrc/jit/codegen/onednn/**
+  - test/test_jit_llga_fuser.py
+  approved_by:
+  - sanchitintel
+  - chunyuan-w
+  - jgong5
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: CPU ATen backend
+  patterns:
+  - aten/src/ATen/cpu/**
+  - aten/src/ATen/native/cpu/**
+  - aten/src/ATen/native/quantized/cpu/**
+  - aten/src/ATen/native/Convolution*.cpp
+  - aten/src/ATen/native/mkldnn/**
+  - test/test_mkldnn.py
+  approved_by:
+  - mingfeima
+  - XiaobingSuper
+  - jgong5
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: CPU frontend
+  patterns:
+  - torch/cpu/**
+  - torch/utils/mkldnn.py
+  - test/test_mkldnn.py
+  approved_by:
+  - leslie-fang-intel
+  - jgong5
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: Autocast
+  patterns:
+  - torch/amp/**
+  - aten/src/ATen/autocast_mode.*
+  - torch/csrc/jit/passes/autocast.cpp
+  - test/test_autocast.py
+  approved_by:
+  - leslie-fang-intel
+  - jgong5
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: NNC
+  patterns:
+  - torch/csrc/jit/tensorexpr/**
+  approved_by:
+  - EikanWang
+  - jgong5
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: Lazy Tensor
+  patterns:
+  - torch/csrc/lazy/**
+  - test/cpp/lazy/**
+  - test/lazy/**
+  - torchgen/api/lazy.py
+  - torchgen/dest/lazy_ir.py
+  - torchgen/dest/lazy_ts_lowering.py
+  - torchgen/gen_lazy_tensor.py
+  - aten/src/ATen/native/ts_native_functions.yaml
+  - .github/ci_commit_pins/xla.txt
+  approved_by:
+  - alanwaketan
+  - JackCaoG
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: superuser
+  patterns:
+  - '*'
+  approved_by:
+  - pytorch/metamates
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: Core Reviewers
+  patterns:
+  - '*'
+  approved_by:
+  - mruberry
+  - lezcano
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
+- name: Core Maintainers
+  patterns:
+  - '*'
+  approved_by:
+  - soumith
+  - gchanan
+  - ezyang
+  - dzhulgakov
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
diff --git a/.github/requirements-gha-cache.txt b/.github/requirements-gha-cache.txt
new file mode 100644
index 000000000000..6badbe2cc65c
--- /dev/null
+++ b/.github/requirements-gha-cache.txt
@@ -0,0 +1,18 @@
+# This file is to cache other dependencies not specified elsewhere in:
+#   requirement.txt
+#   requirements-flake8.txt
+#   docs/requirements.txt
+#   docs/cpp/requirements.txt
+#   functorch/docs/requirements.txt
+#   .circleci/docker/requirements-ci.txt
+boto3==1.19.12
+cffi==1.15.0
+dataclasses==0.6
+jinja2==3.0.1
+lintrunner==0.9.2
+ninja==1.10.0.post1
+pynvml==11.4.1
+pyyaml==6.0
+requests==2.26
+rich==10.9.0
+rockset==0.8.10
diff --git a/.github/requirements/README.md b/.github/requirements/README.md
new file mode 100644
index 000000000000..7300eee14562
--- /dev/null
+++ b/.github/requirements/README.md
@@ -0,0 +1,24 @@
+### Cached requirements and consolidation of conda and pip installation
+
+At the moment, the installation of conda and pip dependencies happens at
+different places in the CI depending at the whim of different
+developers, which makes it very challenging to handle issues like
+network flakiness or upstream dependency failures gracefully. So, this
+center directory is created to gradually include all the conda environment
+and pip requirement files that are used to setup CI jobs. Not only it
+gives a clear picture of all the dependencies required by different CI
+jobs, but it also allows them to be cached properly to improve CI
+reliability.
+
+The list of support files are as follows:
+
+* Conda:
+  * conda-env-macOS-ARM64. This is used by MacOS (m1, arm64) build and
+    test jobs to setup the conda environment
+  * conda-env-macOS-X64. This is use by MacOS (x86-64) build and test
+    jobs to setup the conda environment
+  * conda-env-Linux-X64. This is used by Linux buck build and test jobs
+    to setup the conda environment
+* Pip:
+  * pip-requirements-macOS.txt. This is used by MacOS build and test jobs to
+    setup the pip environment
diff --git a/.github/requirements/conda-env-Linux-X64 b/.github/requirements/conda-env-Linux-X64
new file mode 100644
index 000000000000..f2b3811263e5
--- /dev/null
+++ b/.github/requirements/conda-env-Linux-X64
@@ -0,0 +1,10 @@
+cffi=1.15.1
+cmake=3.22.1
+mkl=2022.1.0
+mkl-include=2022.1.0
+ninja=1.10.2
+numpy=1.23.3
+pyyaml=6.0
+requests=2.28.1
+setuptools=65.5.0
+typing_extensions=4.3.0
diff --git a/.github/requirements/conda-env-macOS-ARM64 b/.github/requirements/conda-env-macOS-ARM64
new file mode 100644
index 000000000000..a031b014365f
--- /dev/null
+++ b/.github/requirements/conda-env-macOS-ARM64
@@ -0,0 +1,20 @@
+numpy=1.22.3
+pyyaml=6.0
+setuptools=61.2.0
+cmake=3.22.1
+cffi=1.15.1
+typing_extensions=4.3.0
+dataclasses=0.8
+pip=22.2.2
+six=1.16.0
+pillow=9.2.0
+pkg-config=0.29.2
+wheel=0.37.1
+expecttest=0.1.3
+
+# Not pinning certifi so that we can always get the latest certificates
+certifi
+
+# Cross-compiling arm64 from x86-64 picks up 1.40.0 while testing on arm64
+# itself only has up to 1.39.0 from upstream conda. Both work though
+libuv>=1.39.0,<=1.40.0
diff --git a/.github/requirements/conda-env-macOS-X64 b/.github/requirements/conda-env-macOS-X64
new file mode 100644
index 000000000000..81463d4b39d5
--- /dev/null
+++ b/.github/requirements/conda-env-macOS-X64
@@ -0,0 +1,18 @@
+mkl=2021.2.0
+mkl-include=2021.2.0
+numpy=1.18.5
+pyyaml=5.3
+setuptools=46.0.0
+cmake=3.22.1
+cffi=1.15.1
+typing_extensions=4.3.0
+dataclasses=0.8
+pip=22.2.2
+six=1.16.0
+pillow=9.2.0
+libuv=1.40.0
+pkg-config=0.29.2
+wheel=0.37.1
+
+# Not pinning certifi so that we can always get the latest certificates
+certifi
diff --git a/.github/requirements/pip-requirements-macOS.txt b/.github/requirements/pip-requirements-macOS.txt
new file mode 100644
index 000000000000..dfbaea260116
--- /dev/null
+++ b/.github/requirements/pip-requirements-macOS.txt
@@ -0,0 +1,22 @@
+boto3==1.19.12
+hypothesis==6.56.4
+expecttest==0.1.3
+librosa>=0.6.2
+mpmath==1.2.1
+networkx==2.8.7
+# Use numba-0.49.1 or older on Intel Macs, but 0.56.0 on M1 machines, as older numba is not available
+numba==0.56.0; platform_machine == "arm64"
+numba<=0.49.1; platform_machine != "arm64"
+opt-einsum>=3.3
+psutil==5.9.1
+pynvml==11.4.1
+pygments==2.12.0
+pytest==7.2.0
+pytest-xdist==3.0.2
+pytest-rerunfailures==10.2
+pytest-flakefinder==1.1.0
+pytest-shard==0.1.2
+scipy==1.9.0
+sympy==1.11.1
+unittest-xml-reporting<=3.2.0,>=2.0.0
+xdoctest==1.0.2
diff --git a/.github/scale-config.yml b/.github/scale-config.yml
deleted file mode 100644
index 1cf99b326ba8..000000000000
--- a/.github/scale-config.yml
+++ /dev/null
@@ -1,69 +0,0 @@
-# scale-config.yml:
-#   Powers what instance types are available for GHA auto-scaled
-#   runners. Runners listed here will be available as self hosted
-#   runners, configuration is directly pulled from the main branch.
-#
-# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2
-#
-# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls
-#                     to avoid RequestLimitExceeded issues
-#
-# TODO: Add some documentation on how the auto-scaling works
-#
-# NOTE: Default values,
-#
-# runner_types:
-#   runner_label:
-#     instance_type: m4.large
-#     os: linux
-#     max_available: 20
-#     disk_size: 50
-#     is_ephemeral: true
-
-runner_types:
-  # mainly used for ciflow-should-run, not made to run any serious tests
-  linux.large:
-    instance_type: c5.large
-    os: linux
-    disk_size: 10
-    is_ephemeral: false
-  linux.2xlarge:
-    instance_type: c5.2xlarge
-    os: linux
-    max_available: 1000
-    disk_size: 150
-    is_ephemeral: false
-  linux.4xlarge: # for binary-builds
-    instance_type: c5.4xlarge
-    os: linux
-    max_available: 500
-    disk_size: 150
-    is_ephemeral: false
-  linux.8xlarge.nvidia.gpu:
-    instance_type: g3.8xlarge
-    os: linux
-    max_available: 200
-    disk_size: 150
-    is_ephemeral: false
-  linux.4xlarge.nvidia.gpu:
-    instance_type: g3.4xlarge
-    os: linux
-    max_available: 250
-    disk_size: 150
-    is_ephemeral: false
-  linux.16xlarge.nvidia.gpu:
-    instance_type: g3.16xlarge
-    os: linux
-    max_available: 10
-    disk_size: 150
-    is_ephemeral: false
-  windows.4xlarge:
-    instance_type: c5d.4xlarge
-    os: windows
-    max_available: 200
-    disk_size: 256
-  windows.8xlarge.nvidia.gpu:
-    instance_type: p3.2xlarge
-    os: windows
-    max_available: 100
-    disk_size: 256
diff --git a/.github/scripts/README.md b/.github/scripts/README.md
index 22099c3732ea..cc9e1617b11a 100644
--- a/.github/scripts/README.md
+++ b/.github/scripts/README.md
@@ -3,7 +3,7 @@
 > NOTE: This README contains information for the `.github` directory but cannot be located there because it will overwrite the
 repo README.
 
-This directory contains workflows and scripts to support our CI infrastructure that runs on Github Actions.
+This directory contains workflows and scripts to support our CI infrastructure that runs on GitHub Actions.
 
 ## Workflows
 
@@ -36,7 +36,7 @@ New generated binary workflows can be added in the `.github/scripts/generate_ci_
 examples from that script in order to add the workflow to the stream that is relevant to what you particularly
 care about.
 
-Different parameters can be used to acheive different goals, i.e. running jobs on a cron, running only on trunk, etc.
+Different parameters can be used to achieve different goals, i.e. running jobs on a cron, running only on trunk, etc.
 
 #### ciflow (trunk)
 
diff --git a/.github/scripts/build_publish_nightly_docker.sh b/.github/scripts/build_publish_nightly_docker.sh
deleted file mode 100644
index db84704aa3e4..000000000000
--- a/.github/scripts/build_publish_nightly_docker.sh
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env bash
-
-set -xeuo pipefail
-
-PYTORCH_DOCKER_TAG=$(git describe --tags --always)-devel
-CUDA_VERSION=11.3.1
-
-# Build PyTorch nightly docker
-make -f docker.Makefile \
-     DOCKER_REGISTRY=ghcr.io \
-     DOCKER_ORG=pytorch \
-     CUDA_VERSION=${CUDA_VERSION} \
-     DOCKER_IMAGE=pytorch-nightly \
-     DOCKER_TAG=${PYTORCH_DOCKER_TAG} \
-     INSTALL_CHANNEL=pytorch-nightly BUILD_TYPE=official devel-image
-
-# Get the PYTORCH_NIGHTLY_COMMIT from the docker image
-PYTORCH_NIGHTLY_COMMIT=$(docker run \
-       ghcr.io/pytorch/pytorch-nightly:${PYTORCH_DOCKER_TAG} \
-       python -c 'import torch; print(torch.version.git_version)' | head -c 7)
-
-docker tag ghcr.io/pytorch/pytorch-nightly:${PYTORCH_DOCKER_TAG} \
-       ghcr.io/pytorch/pytorch-nightly:${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}
-
-docker tag ghcr.io/pytorch/pytorch-nightly:${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION} \
-       ghcr.io/pytorch/pytorch-nightly:latest
-
-if [[ ${WITH_PUSH:-} == "true" ]]; then
-    # Push the nightly docker to GitHub Container Registry
-    echo $GHCR_PAT | docker login ghcr.io -u pytorch --password-stdin
-    make -f docker.Makefile \
-         DOCKER_REGISTRY=ghcr.io \
-         DOCKER_ORG=pytorch \
-         DOCKER_IMAGE=pytorch-nightly \
-         DOCKER_TAG=${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION} \
-         devel-push
-
-    make -f docker.Makefile \
-         DOCKER_REGISTRY=ghcr.io \
-         DOCKER_ORG=pytorch \
-         DOCKER_IMAGE=pytorch-nightly \
-         DOCKER_TAG=latest \
-         devel-push
-fi
diff --git a/.github/scripts/build_triton_wheel.py b/.github/scripts/build_triton_wheel.py
new file mode 100644
index 000000000000..d9d2a2e98bd3
--- /dev/null
+++ b/.github/scripts/build_triton_wheel.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+from subprocess import check_call
+from pathlib import Path
+from tempfile import TemporaryDirectory
+import sys
+import shutil
+SCRIPT_DIR = Path(__file__).parent
+
+def read_triton_pin() -> str:
+    with open(SCRIPT_DIR.parent / "ci_commit_pins" / "triton.txt") as f:
+        return f.read().strip()
+
+
+def check_and_replace(inp: str, src: str, dst: str) -> str:
+    """ Checks that `src` can be found in `input` and replaces it with `dst` """
+    if src not in inp:
+        raise RuntimeError(f"Can't find ${src} in the input")
+    return inp.replace(src, dst)
+
+
+def patch_setup_py(path: Path, *, version: str = "2.0.0", name: str = "triton") -> None:
+    with open(path) as f:
+        orig = f.read()
+    # Replace name
+    orig = check_and_replace(orig, "name=\"triton\",", f"name=\"{name}\",")
+    # Replace version
+    orig = check_and_replace(orig, "version=\"2.0.0\",", f"version=\"{version}\",")
+    with open(path, "w") as f:
+        f.write(orig)
+
+
+def build_triton(commit_hash: str) -> Path:
+    with TemporaryDirectory() as tmpdir:
+        triton_basedir = Path(tmpdir) / "triton"
+        triton_pythondir = triton_basedir / "python"
+        check_call(["git", "clone", "https://github.com/openai/triton"], cwd=tmpdir)
+        check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
+        patch_setup_py(triton_pythondir / "setup.py", name="torchtriton", version=f"2.0.0+{commit_hash[:10]}")
+        check_call([sys.executable, "setup.py", "bdist_wheel"], cwd=triton_pythondir)
+        whl_path = list((triton_pythondir / "dist").glob("*.whl"))[0]
+        shutil.copy(whl_path, Path.cwd())
+        return Path.cwd() / whl_path.name
+
+
+def main() -> None:
+    pin = read_triton_pin()
+    build_triton(pin)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/scripts/check_labels.py b/.github/scripts/check_labels.py
new file mode 100755
index 000000000000..2d4a216daf94
--- /dev/null
+++ b/.github/scripts/check_labels.py
@@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+"""check_labels.py"""
+
+from typing import Any, List
+
+from export_pytorch_labels import get_pytorch_labels
+from gitutils import (
+    get_git_remote_name,
+    get_git_repo_dir,
+    GitRepo,
+)
+from trymerge import (
+    _fetch_url,
+    gh_post_pr_comment,
+    GitHubPR,
+)
+
+
+BOT_AUTHORS = ["github-actions", "pytorchmergebot", "pytorch-bot"]
+
+ERR_MSG_TITLE = "This PR needs a label"
+ERR_MSG = (
+    f"# {ERR_MSG_TITLE}\n"
+    "If your changes are user facing and intended to be a part of release notes, please use a label starting with `release notes:`.\n\n"  # noqa: E501  pylint: disable=line-too-long
+    "If not, please add the `topic: not user facing` label.\n\n"
+    "For more information, see https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work."  # noqa: E501  pylint: disable=line-too-long
+)
+
+
+def get_release_notes_labels() -> List[str]:
+    return [label for label in get_pytorch_labels() if label.lstrip().startswith("release notes:")]
+
+
+def delete_comment(comment_id: int) -> None:
+    url = f"https://api.github.com/repos/pytorch/pytorch/issues/comments/{comment_id}"
+    _fetch_url(url, method="DELETE")
+
+
+def has_required_labels(pr: GitHubPR) -> bool:
+    pr_labels = pr.get_labels()
+    # Check if PR is not user facing
+    is_not_user_facing_pr = any(label.strip() == "topic: not user facing" for label in pr_labels)
+    return is_not_user_facing_pr or any(label.strip() in get_release_notes_labels() for label in pr_labels)
+
+
+def delete_comments(pr: GitHubPR) -> None:
+    # Delete all previous comments
+    for comment in pr.get_comments():
+        if comment.body_text.lstrip(" #").startswith(ERR_MSG_TITLE) and comment.author_login in BOT_AUTHORS:
+            delete_comment(comment.database_id)
+
+
+def add_comment(pr: GitHubPR) -> None:
+    # Only make a comment if one doesn't exist already
+    for comment in pr.get_comments():
+        if comment.body_text.lstrip(" #").startswith(ERR_MSG_TITLE) and comment.author_login in BOT_AUTHORS:
+            return
+    gh_post_pr_comment(pr.org, pr.project, pr.pr_num, ERR_MSG)
+
+
+def parse_args() -> Any:
+    from argparse import ArgumentParser
+    parser = ArgumentParser("Check PR labels")
+    parser.add_argument("pr_num", type=int)
+
+    return parser.parse_args()
+
+
+def main() -> None:
+    args = parse_args()
+    repo = GitRepo(get_git_repo_dir(), get_git_remote_name())
+    org, project = repo.gh_owner_and_name()
+    pr = GitHubPR(org, project, args.pr_num)
+
+    try:
+        if not has_required_labels(pr):
+            print(ERR_MSG)
+            add_comment(pr)
+            exit(1)
+        else:
+            delete_comments(pr)
+    except Exception as e:
+        pass
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/scripts/comment_on_pr.py b/.github/scripts/comment_on_pr.py
new file mode 100644
index 000000000000..06b2eefe0988
--- /dev/null
+++ b/.github/scripts/comment_on_pr.py
@@ -0,0 +1,34 @@
+from typing import Any
+from trymerge import gh_post_pr_comment
+from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
+from trymerge_explainer import BOT_COMMANDS_WIKI
+import os
+
+
+def parse_args() -> Any:
+    from argparse import ArgumentParser
+
+    parser = ArgumentParser("Comment on a PR")
+    parser.add_argument("pr_num", type=int)
+    parser.add_argument("action", type=str)
+    return parser.parse_args()
+
+
+def main() -> None:
+    args = parse_args()
+    repo = GitRepo(get_git_repo_dir(), get_git_remote_name(), debug=True)
+    org, project = repo.gh_owner_and_name()
+    run_url = os.environ.get("GH_RUN_URL")
+
+    job_link = f"[job]({run_url})" if run_url is not None else "job"
+    msg = (
+        f"The {args.action} {job_link} was canceled. If you believe this is a mistake,"
+        + f"then you can re trigger it through [pytorch-bot]({BOT_COMMANDS_WIKI})."
+    )
+
+    gh_post_pr_comment(org, project, args.pr_num, msg)
+    print(org, project, args.pr_num, msg)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/scripts/ensure_actions_will_cancel.py b/.github/scripts/ensure_actions_will_cancel.py
index c479aefb9fc4..729d02f560fa 100755
--- a/.github/scripts/ensure_actions_will_cancel.py
+++ b/.github/scripts/ensure_actions_will_cancel.py
@@ -42,26 +42,26 @@ def should_check(filename: Path) -> bool:
             print("ERROR: duplicate workflow name:", name, file=sys.stderr)
             errors_found = True
         names.add(name)
-
-        expected = {
-            "group": EXPECTED_GROUP,
-            "cancel-in-progress": True,
-        }
-        actual = data.get("concurrency", None)
-        if actual != expected:
+        actual = data.get("concurrency", {})
+        if not actual.get("group", "").startswith(EXPECTED_GROUP):
             print(
                 f"'concurrency' incorrect or not found in '{filename.relative_to(REPO_ROOT)}'",
                 file=sys.stderr,
             )
             print(
-                f"expected: {expected}",
+                f"concurrency group should start with {EXPECTED_GROUP} but found {actual.get('group', None)}",
                 file=sys.stderr,
             )
+            errors_found = True
+        if not actual.get("cancel-in-progress", False):
             print(
-                f"actual:   {actual}",
+                f"'concurrency' incorrect or not found in '{filename.relative_to(REPO_ROOT)}'",
+                file=sys.stderr,
+            )
+            print(
+                f"concurrency cancel-in-progress should be True but found {actual.get('cancel-in-progress', None)}",
                 file=sys.stderr,
             )
-            errors_found = True
 
     if errors_found:
         sys.exit(1)
diff --git a/.github/scripts/fetch_latest_green_commit.py b/.github/scripts/fetch_latest_green_commit.py
index c9bb4830ab72..447b76b2dd8b 100644
--- a/.github/scripts/fetch_latest_green_commit.py
+++ b/.github/scripts/fetch_latest_green_commit.py
@@ -84,8 +84,6 @@ def isGreen(commit: str, results: Dict[str, Any]) -> Tuple[bool, str]:
                     return (False, workflowName + " checks were not successful")
                 else:
                     regex[required_check] = True
-        if workflowName in ["periodic", "docker-release-builds"] and conclusion not in ["success", "skipped"]:
-            return (False, workflowName + " checks were not successful")
 
     missing_workflows = [x for x in regex.keys() if not regex[x]]
     if len(missing_workflows) > 0:
@@ -110,7 +108,7 @@ def main() -> None:
     )
     qlambda = rs.QueryLambda.retrieve(
         'commit_jobs_batch_query',
-        version='15aba20837ae9d75',
+        version='8003fdfd18b64696',
         workspace='commons')
 
     commits = get_latest_commits()
diff --git a/.github/scripts/filter_test_configs.py b/.github/scripts/filter_test_configs.py
new file mode 100755
index 000000000000..eab32401ad97
--- /dev/null
+++ b/.github/scripts/filter_test_configs.py
@@ -0,0 +1,207 @@
+#!/usr/bin/env python3
+
+import sys
+import re
+import json
+import os
+import requests
+from typing import Any, Dict, Set, List
+import yaml
+import warnings
+
+PREFIX = "test-config/"
+
+# Same as shard names
+VALID_TEST_CONFIG_LABELS = {f"{PREFIX}{label}" for label in {
+    "backwards_compat",
+    "crossref",
+    "default",
+    "deploy",
+    "distributed",
+    "docs_tests",
+    "dynamo",
+    "force_on_cpu",
+    "functorch",
+    "inductor",
+    "inductor_distributed",
+    "inductor_huggingface",
+    "inductor_timm",
+    "inductor_torchbench",
+    "jit_legacy",
+    "multigpu",
+    "nogpu_AVX512",
+    "nogpu_NO_AVX2",
+    "slow",
+    "tsan",
+    "xla",
+}}
+
+# Supported modes when running periodically
+SUPPORTED_PERIODICAL_MODES = {
+    "mem_leak_check",
+    "rerun_disabled_tests",
+}
+
+
+def parse_args() -> Any:
+    from argparse import ArgumentParser
+    parser = ArgumentParser("Filter all test configurations and keep only requested ones")
+    parser.add_argument("--test-matrix", type=str, required=True, help="the original test matrix")
+    parser.add_argument("--pr-number", type=str, help="the pull request number")
+    parser.add_argument("--tag", type=str, help="the associated tag if it exists")
+    parser.add_argument("--event-name", type=str, help="name of the event that triggered the job (pull, schedule, etc)")
+    parser.add_argument("--schedule", type=str, help="cron schedule that triggered the job")
+    return parser.parse_args()
+
+
+def get_labels(pr_number: int) -> Set[str]:
+    """
+    Dynamical get the latest list of labels from the pull request
+    """
+    # From https://docs.github.com/en/actions/learn-github-actions/environment-variables
+    PYTORCH_REPO = os.environ.get("GITHUB_REPOSITORY", "pytorch/pytorch")
+    PYTORCH_GITHUB_API = f"https://api.github.com/repos/{PYTORCH_REPO}"
+    GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
+
+    REQUEST_HEADERS = {
+        "Accept": "application/vnd.github.v3+json",
+        "Authorization": "token " + GITHUB_TOKEN,
+    }
+
+    response = requests.get(
+        f"{PYTORCH_GITHUB_API}/issues/{pr_number}/labels",
+        headers=REQUEST_HEADERS,
+    )
+
+    if response.status_code != requests.codes.ok:
+        warnings.warn(f"Failed to get the labels for #{pr_number} (status code {response.status_code})")
+        return set()
+
+    return {label.get("name") for label in response.json() if label.get("name")}
+
+
+def filter(test_matrix: Dict[str, List[Any]], labels: Set[str]) -> Dict[str, List[Any]]:
+    """
+    Select the list of test config to run from the test matrix. The logic works
+    as follows:
+
+    If the PR has one or more labels as specified in the VALID_TEST_CONFIG_LABELS set, only
+    these test configs will be selected.  This also works with ciflow labels, for example,
+    if a PR has both ciflow/trunk and test-config/functorch, only trunk functorch builds
+    and tests will be run
+
+    If the PR has none of the test-config label, all tests are run as usual.
+    """
+
+    filtered_test_matrix: Dict[str, List[Any]] = {
+        "include": []
+    }
+
+    for entry in test_matrix.get("include", []):
+        config_name = entry.get("config", "")
+        if not config_name:
+            continue
+
+        label = f"{PREFIX}{config_name.strip()}"
+        if label in labels:
+            print(f"Select {config_name} because label {label} is presented in the pull request by the time the test starts")
+            filtered_test_matrix["include"].append(entry)
+
+    valid_test_config_labels = labels.intersection(VALID_TEST_CONFIG_LABELS)
+
+    if not filtered_test_matrix["include"] and not valid_test_config_labels:
+        # Found no valid label and the filtered test matrix is empty, return the same
+        # test matrix as before so that all tests can be run normally
+        return test_matrix
+    else:
+        # When the filter test matrix contain matches or if a valid test config label
+        # is found in the PR, return the filtered test matrix
+        return filtered_test_matrix
+
+
+def set_periodic_modes(test_matrix: Dict[str, List[Any]]) -> Dict[str, List[Any]]:
+    """
+    Apply all periodic modes when running under a schedule
+    """
+    scheduled_test_matrix: Dict[str, List[Any]] = {
+        "include": [],
+    }
+
+    for config in test_matrix.get("include", []):
+        for mode in SUPPORTED_PERIODICAL_MODES:
+            cfg = config.copy()
+            cfg[mode] = mode
+            scheduled_test_matrix["include"].append(cfg)
+
+    return scheduled_test_matrix
+
+
+def set_output(name: str, val: Any) -> None:
+    if os.getenv("GITHUB_OUTPUT"):
+        with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
+            print(f"{name}={val}", file=env)
+    else:
+        print(f"::set-output name={name}::{val}")
+
+
+def main() -> None:
+    args = parse_args()
+    # Load the original test matrix set by the workflow. Its format, however,
+    # doesn't follow the strict JSON format, so we load it using yaml here for
+    # its more relaxed syntax
+    test_matrix = yaml.safe_load(args.test_matrix)
+
+    if test_matrix is None:
+        warnings.warn(f"Invalid test matrix input '{args.test_matrix}', exiting")
+        # We handle invalid test matrix gracefully by marking it as empty
+        set_output("is-test-matrix-empty", True)
+        sys.exit(0)
+
+    pr_number = args.pr_number
+    tag = args.tag
+
+    # If the tag matches, we can get the PR number from it, this is from ciflow
+    # workflow dispatcher
+    tag_regex = re.compile(r"^ciflow/\w+/(?P<pr_number>\d+)$")
+
+    if pr_number:
+        # If a PR number is set, query all the labels from that PR
+        labels = get_labels(int(pr_number))
+        # Then filter the test matrix and keep only the selected ones
+        filtered_test_matrix = filter(test_matrix, labels)
+
+    elif tag:
+        m = tag_regex.match(tag)
+
+        if m:
+            pr_number = m.group("pr_number")
+
+            # The PR number can also come from the tag in ciflow tag event
+            labels = get_labels(int(pr_number))
+            # Filter the test matrix and keep only the selected ones
+            filtered_test_matrix = filter(test_matrix, labels)
+
+        else:
+            # There is a tag but it isn't ciflow, so there is nothing left to do
+            filtered_test_matrix = test_matrix
+
+    else:
+        # No PR number, no tag, we can just return the test matrix as it is
+        filtered_test_matrix = test_matrix
+
+    if args.event_name == "schedule" and args.schedule == '29 8 * * *':
+        # we don't want to run the mem leack check or disabled tests on normal
+        # periodically scheduled jobs, only the ones at this time
+        filtered_test_matrix = set_periodic_modes(filtered_test_matrix)
+
+    # Set the filtered test matrix as the output
+    set_output("test-matrix", json.dumps(filtered_test_matrix))
+
+    filtered_test_matrix_len = len(filtered_test_matrix.get("include", []))
+    # and also put a flag if the test matrix is empty, so subsequent jobs can
+    # quickly check it without the need to parse the JSON string
+    set_output("is-test-matrix-empty", filtered_test_matrix_len == 0)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/scripts/generate_binary_build_matrix.py b/.github/scripts/generate_binary_build_matrix.py
index b1e3b46bda34..deb225287b3f 100644
--- a/.github/scripts/generate_binary_build_matrix.py
+++ b/.github/scripts/generate_binary_build_matrix.py
@@ -13,10 +13,10 @@
 from typing import Dict, List, Tuple, Optional
 
 
-CUDA_ARCHES = ["10.2", "11.3", "11.6", "11.7"]
+CUDA_ARCHES = ["11.6", "11.7"]
 
 
-ROCM_ARCHES = ["5.1.1", "5.2"]
+ROCM_ARCHES = ["5.2", "5.3"]
 
 
 def arch_type(arch_version: str) -> str:
@@ -90,11 +90,8 @@ def generate_conda_matrix(os: str) -> List[Dict[str, str]]:
     ret: List[Dict[str, str]] = []
     arches = ["cpu"]
     python_versions = FULL_PYTHON_VERSIONS
-    if os == "linux":
+    if os == "linux" or os == "windows":
         arches += CUDA_ARCHES
-    elif os == "windows":
-        # We don't build CUDA 10.2 for window see https://github.com/pytorch/pytorch/issues/65648
-        arches += list_without(CUDA_ARCHES, ["10.2"])
     elif os == "macos-arm64":
         python_versions = list_without(python_versions, ["3.7"])
     for python_version in python_versions:
@@ -129,8 +126,7 @@ def generate_libtorch_matrix(os: str, abi_version: str,
             arches += CUDA_ARCHES
             arches += ROCM_ARCHES
         elif os == "windows":
-            # We don't build CUDA 10.2 for window see https://github.com/pytorch/pytorch/issues/65648
-            arches += list_without(CUDA_ARCHES, ["10.2"])
+            arches += CUDA_ARCHES
 
     if libtorch_variants is None:
         libtorch_variants = [
@@ -198,8 +194,7 @@ def generate_wheels_matrix(os: str,
         if os == "linux":
             arches += CUDA_ARCHES + ROCM_ARCHES
         elif os == "windows":
-            # We don't build CUDA 10.2 for window see https://github.com/pytorch/pytorch/issues/65648
-            arches += list_without(CUDA_ARCHES, ["10.2"])
+            arches += CUDA_ARCHES
 
     ret: List[Dict[str, str]] = []
     for python_version in python_versions:
@@ -209,6 +204,32 @@ def generate_wheels_matrix(os: str,
             # Skip rocm 3.11 binaries for now as the docker image are not correct
             if python_version == "3.11" and gpu_arch_type == "rocm":
                 continue
+
+            # special 11.7 wheels package without dependencies
+            # dependency downloaded via pip install
+            if arch_version == "11.7" and os == "linux":
+                ret.append(
+                    {
+                        "python_version": python_version,
+                        "gpu_arch_type": gpu_arch_type,
+                        "gpu_arch_version": gpu_arch_version,
+                        "desired_cuda": translate_desired_cuda(
+                            gpu_arch_type, gpu_arch_version
+                        ),
+                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],
+                        "package_type": package_type,
+                        "pytorch_extra_install_requirements":
+                        "nvidia-cuda-runtime-cu11; platform_system == 'Linux' | "
+                        "nvidia-cudnn-cu11==8.5.0.96; platform_system == 'Linux' | "
+                        "nvidia-cublas-cu11==11.10.3.66; platform_system == 'Linux'",
+                        "build_name":
+                        f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-with-pypi-cudnn"
+                        .replace(
+                            ".", "_"
+                        ),
+                    }
+                )
+
             ret.append(
                 {
                     "python_version": python_version,
diff --git a/.github/scripts/generate_ci_workflows.py b/.github/scripts/generate_ci_workflows.py
index 653cfeebaab7..35680e30ee6a 100755
--- a/.github/scripts/generate_ci_workflows.py
+++ b/.github/scripts/generate_ci_workflows.py
@@ -134,7 +134,7 @@ class OperatingSystem:
         package_type="manywheel",
         build_configs=generate_binary_build_matrix.generate_wheels_matrix(
             OperatingSystem.LINUX,
-            arches=["10.2"],
+            arches=["11.6"],
             python_versions=["3.7"]),
         branches="master",
     ),
@@ -154,7 +154,7 @@ class OperatingSystem:
         package_type="libtorch",
         abi_version=generate_binary_build_matrix.PRE_CXX11_ABI,
         build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
-            OperatingSystem.LINUX, generate_binary_build_matrix.CXX11_ABI,
+            OperatingSystem.LINUX, generate_binary_build_matrix.PRE_CXX11_ABI,
             arches=["cpu"],
             libtorch_variants=["shared-with-deps"],
         ),
@@ -207,15 +207,6 @@ class OperatingSystem:
     ),
 ]
 WINDOWS_BINARY_SMOKE_WORKFLOWS = [
-    BinaryBuildWorkflow(
-        os=OperatingSystem.WINDOWS,
-        package_type="wheel",
-        build_configs=generate_binary_build_matrix.generate_wheels_matrix(
-            OperatingSystem.WINDOWS,
-            arches=["11.3"],
-            python_versions=["3.7"]),
-        branches="master",
-    ),
     BinaryBuildWorkflow(
         os=OperatingSystem.WINDOWS,
         package_type="libtorch",
@@ -286,7 +277,7 @@ class OperatingSystem:
     BinaryBuildWorkflow(
         os=OperatingSystem.MACOS_ARM64,
         package_type="wheel",
-        build_configs=generate_binary_build_matrix.generate_wheels_matrix(OperatingSystem.MACOS),
+        build_configs=generate_binary_build_matrix.generate_wheels_matrix(OperatingSystem.MACOS_ARM64),
         cross_compile_arm64=True,
         ciflow_config=CIFlowConfig(
             labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
diff --git a/.github/scripts/generate_pytorch_version.py b/.github/scripts/generate_pytorch_version.py
index 0655df137e07..02c19844cd09 100755
--- a/.github/scripts/generate_pytorch_version.py
+++ b/.github/scripts/generate_pytorch_version.py
@@ -23,27 +23,22 @@ def get_pytorch_root() -> Path:
 
 def get_tag() -> str:
     root = get_pytorch_root()
-    # We're on a tag
-    am_on_tag = (
-        subprocess.run(
-            ['git', 'describe', '--tags', '--exact'],
-            cwd=root,
-            stdout=subprocess.DEVNULL,
-            stderr=subprocess.DEVNULL
-        ).returncode == 0
-    )
-    tag = ""
-    if am_on_tag:
+    try:
         dirty_tag = subprocess.check_output(
-            ['git', 'describe'],
+            ['git', 'describe', '--tags', '--exact'],
             cwd=root
         ).decode('ascii').strip()
-        # Strip leading v that we typically do when we tag branches
-        # ie: v1.7.1 -> 1.7.1
-        tag = re.sub(LEADING_V_PATTERN, "", dirty_tag)
-        # Strip trailing rc pattern
-        # ie: 1.7.1-rc1 -> 1.7.1
-        tag = re.sub(TRAILING_RC_PATTERN, "", tag)
+    except subprocess.CalledProcessError:
+        return ""
+    # Strip leading v that we typically do when we tag branches
+    # ie: v1.7.1 -> 1.7.1
+    tag = re.sub(LEADING_V_PATTERN, "", dirty_tag)
+    # Strip trailing rc pattern
+    # ie: 1.7.1-rc1 -> 1.7.1
+    tag = re.sub(TRAILING_RC_PATTERN, "", tag)
+    # Ignore ciflow tags
+    if tag.startswith("ciflow/"):
+        return ""
     return tag
 
 def get_base_version() -> str:
diff --git a/.github/scripts/gql_mocks.json b/.github/scripts/gql_mocks.json
index b146600f936a..7f6dbc05d341 100644
--- a/.github/scripts/gql_mocks.json
+++ b/.github/scripts/gql_mocks.json
@@ -1,20 +1,20 @@
 {
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=73811 owner=pytorch": {
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=82169 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
           "closed": true,
           "isCrossRepository": false,
           "author": {
-            "login": "seemethere"
+            "login": "ezyang"
           },
-          "title": "ci: Migrate metrics credentials to managed IAM",
-          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* __->__ #73811\n\r\nMigrates our credentials to upload metrics statistics to managed IAM\r\ncredentials in order to make it easier to know where the credentials are\r\ncoming from and to make it easier to add more permissions / less\r\npermissions later on.\r\n\r\nRelates to work done in [D34535827](https://www.internalfb.com/diff/D34535827)\r\n\r\nSigned-off-by: Eli Uriegas <eliuriegas@fb.com>",
-          "headRefName": "gh/seemethere/215/head",
+          "title": "Move test_dtypes so it runs later",
+          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):\n* __->__ #82169\n\nThe error messages it gives are very unhelpful (because a failure\ngets translated into \"dtype was not supported\" rather than the\nactual backtrace), so I'd rather get error messages about this after\nI've tested basic functionality.\n\nSigned-off-by: Edward Z. Yang <ezyang@fb.com>",
+          "headRefName": "gh/ezyang/1279/head",
           "headRepository": {
             "nameWithOwner": "pytorch/pytorch"
           },
-          "baseRefName": "gh/seemethere/215/base",
+          "baseRefName": "gh/ezyang/1279/base",
           "baseRepository": {
             "nameWithOwner": "pytorch/pytorch",
             "isPrivate": false,
@@ -29,32 +29,44 @@
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "seemethere"
+                      "login": "ezyang"
                     },
-                    "email": "eliuriegas@fb.com",
-                    "name": "Eli Uriegas"
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
                   },
-                  "oid": "13c44d16a876a56bca479b4cf30715d21fa16e99"
+                  "oid": "cef34da55a59da5a32494bff218ccd4978b659d3"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "seemethere"
+                      "login": "ezyang"
                     },
-                    "email": "eliuriegas@fb.com",
-                    "name": "Eli Uriegas"
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
                   },
-                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7"
+                  "oid": "83ad7e73a07111ac1d85e931d14360cc22c01edd"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "Mg",
+              "endCursor": "Mw",
               "hasNextPage": false
             },
-            "totalCount": 2
+            "totalCount": 3
           },
           "commits": {
             "nodes": [
@@ -62,54 +74,6 @@
                 "commit": {
                   "checkSuites": {
                     "edges": [
-                      {
-                        "node": {
-                          "app": {
-                            "name": "Facebook GitHub Tools",
-                            "databaseId": 12274
-                          },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "Facebook CLA Check",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://code.intern.facebook.com/cla/"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOaHA=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658275867"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcBs="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276090"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcPo="
-                      },
                       {
                         "node": {
                           "app": {
@@ -118,20 +82,61 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "win-vs2019-cpu-py3"
+                              "name": "Lint"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310707890"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708140"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708223"
+                              },
+                              {
+                                "name": "Test collect_env (older_python_version)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708332"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708496"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708710"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708937"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310709169"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGj1lc=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276092"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696649"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcPw="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc8k="
                       },
                       {
                         "node": {
@@ -141,7 +146,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-build"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
@@ -152,9 +157,9 @@
                             }
                           },
                           "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276094"
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696651"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcP4="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc8s="
                       },
                       {
                         "node": {
@@ -164,20 +169,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823982/jobs/4310707884"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGjz0w=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276095"
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696656"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcP8="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc9A="
                       },
                       {
                         "node": {
@@ -198,9 +209,9 @@
                             }
                           },
                           "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276097"
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696660"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQE="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc9Q="
                       },
                       {
                         "node": {
@@ -210,7 +221,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                              "name": "pull"
                             }
                           },
                           "checkRuns": {
@@ -221,9 +232,9 @@
                             }
                           },
                           "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276098"
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696715"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdAs="
                       },
                       {
                         "node": {
@@ -233,375 +244,304 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-gcc7-no-ops"
+                              "name": "pull"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815315?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObRM=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276099"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQM="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "Test tools"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276100"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQQ="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-clang7-asan"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276101"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQU="
-                      }
-                    ],
-                    "pageInfo": {
-                      "hasNextPage": true
-                    }
-                  },
-                  "pushedDate": "2022-03-14T23:01:55Z",
-                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7"
-                }
-              }
-            ]
-          },
-          "changedFiles": 3,
-          "files": {
-            "nodes": [
-              {
-                "path": ".github/templates/common.yml.j2"
-              },
-              {
-                "path": ".github/workflows/generated-macos-11-py3-x86-64.yml"
-              },
-              {
-                "path": ".github/workflows/update_pytorch_labels.yml"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "Mw",
-              "hasNextPage": false
-            }
-          },
-          "reviews": {
-            "nodes": [
-              {
-                "author": {
-                  "login": "kit1980"
-                },
-                "state": "APPROVED"
-              },
-              {
-                "author": {
-                  "login": "janeyx99"
-                },
-                "state": "APPROVED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMy0wNFQxNjoyNDo0OC0wNjowMLkyMDIyLTAzLTA0VDE2OjI0OjQ4LTA2OjAwzjWwwqA=",
-              "hasPreviousPage": false
-            }
-          },
-          "comments": {
-            "nodes": [
-              {
-                "bodyText": "Merge failed due to Too many checksuites for commit\nRaised by https://github.com/pytorch/pytorch/actions/runs/1988337976",
-                "author": {
-                  "login": "pytorchmergebot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1068270969
-              },
-              {
-                "bodyText": "@pytorchbot force merge this",
-                "author": {
-                  "login": "seemethere"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1068436128
-              },
-              {
-                "bodyText": "Merge failed due to Too many checksuites for commit\nRaised by https://github.com/pytorch/pytorch/actions/runs/1989076952",
-                "author": {
-                  "login": "pytorchmergebot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1068437098
-              },
-              {
-                "bodyText": "@pytorchbot merge this",
-                "author": {
-                  "login": "seemethere"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1068482921
-              },
-              {
-                "bodyText": "Hey @seemethere.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
-                "author": {
-                  "login": "github-actions"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 1068484404
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOP6yFeQ==",
-              "hasPreviousPage": true
-            }
-          },
-          "labels": {
-            "edges": [
-              {
-                "node": {
-                  "name": "cla signed"
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCcQU= name=pytorch number=73811 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "commits": {
-            "nodes": [
-              {
-                "commit": {
-                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7",
-                  "checkSuites": {
-                    "edges": [
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276102"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQY="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-bionic-py3.7-clang9"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276103"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQc="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-clang7-onnx"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276104"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQg="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-gcc7"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310708487"
+                              },
                               {
-                                "name": "build",
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815361?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310708713"
                               },
                               {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545915218?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310708942"
                               },
                               {
-                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "name": "linux-focal-py3.7-clang7-asan / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545915270?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310709174"
                               },
                               {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "name": "linux-bionic-py3_7-clang8-xla / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545915344?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqP89A=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276105"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQk="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276106"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQo="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310709340"
+                              },
                               {
-                                "name": "build-and-test",
+                                "name": "linux-focal-py3.7-gcc7-no-ops / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815353?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObTk=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276107"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQs="
-                      },
-                      {
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310709579"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310709844"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710003"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710175"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710516"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710716"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710890"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711097"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang10-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711234"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711429"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711603"
+                              },
+                              {
+                                "name": "linux-jammy-cuda11.6-cudnn8-py3.8-clang12 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711765"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711946"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310712129"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310712276"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194495"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194591"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194659"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194749"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194858"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194934"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (functorch, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311195003"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311220458"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311220540"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311222725"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311222869"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223128"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223225"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223324"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (functorch, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223396"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223496"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223569"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223690"
+                              },
+                              {
+                                "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311224360"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311230050"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311301930"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311302152"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311302303"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311302433"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311302531"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491082"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491172"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491232"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491289"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491348"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcG0YME=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696836"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdIQ="
+                      },
+                      {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
                           },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-docs"
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGjyQg=",
+                              "hasNextPage": false
                             }
                           },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696896"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdMA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Netlify",
+                            "databaseId": 13473
+                          },
+                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [],
                             "pageInfo": {
@@ -609,22 +549,18 @@
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276110"
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697185"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQ4="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdeE="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "win-vs2019-cuda11.3-py3"
-                            }
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
                           },
+                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [],
                             "pageInfo": {
@@ -632,82 +568,197 @@
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276111"
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697205"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQ8="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdfU="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
+                            "name": "Dependabot",
+                            "databaseId": 29110
                           },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
                             }
                           },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815317?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546189850?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546189908?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546189954?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqUJII=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276112"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRA="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "pytorch-xla-linux-bionic-py3.7-clang8"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276114"
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697224"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdgg="
                       }
                     ],
                     "pageInfo": {
                       "hasNextPage": true
                     }
+                  },
+                  "status": null,
+                  "pushedDate": "2022-07-27T15:34:17Z",
+                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c"
+                }
+              }
+            ]
+          },
+          "changedFiles": 1,
+          "files": {
+            "nodes": [
+              {
+                "path": "test/test_ops.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "zou3519"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "Chillee"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNy0yNVQxNDo0NTozNS0wNzowMLkyMDIyLTA3LTI1VDE0OjQ1OjM1LTA3OjAwzj6XYmg=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "@pytorchbot merge -f FORCE",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1197107402
+              },
+              {
+                "bodyText": "You need to provide a reason for using force merge, in the format @pytorchbot merge -f '[CATEGORY] Explanation'. With [CATEGORY] being one the following:\nEMERGENCY - an emergency fix to quickly address an issue\nMINOR - a minor fix such as cleaning locally unused variables, which shouldn't break anything\nPRE_TESTED - a previous CI run tested everything and you've only added minor changes like fixing lint\nOTHER - something not covered above",
+                "author": {
+                  "login": "pytorch-bot"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1197107439
+              },
+              {
+                "bodyText": "@pytorchbot merge -f \"[OTHER] normal land failed twice already\"",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1197108130
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1197119348
+              },
+              {
+                "bodyText": "Hey @ezyang.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1197120095
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOR1poyg==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "Merged"
+                }
+              },
+              {
+                "node": {
+                  "name": "cla signed"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAcG0YME= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAcHRdAs= name=pytorch number=82169 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c",
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491405"
+                            },
+                            {
+                              "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (functorch, 1, 1, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491484"
+                            },
+                            {
+                              "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491703"
+                            },
+                            {
+                              "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311551941"
+                            },
+                            {
+                              "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311552010"
+                            },
+                            {
+                              "name": "win-vs2019-cpu-py3 / test (functorch, 1, 1, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311552076"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcG1sTc=",
+                            "hasNextPage": false
+                          }
+                        }
+                      }
+                    ]
                   }
                 }
               }
@@ -717,7 +768,7 @@
       }
     }
   },
-  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCcRI= name=pytorch number=73811 owner=pytorch": {
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAcHRdgg= name=pytorch number=82169 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
@@ -725,20 +776,16 @@
             "nodes": [
               {
                 "commit": {
-                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7",
+                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c",
                   "checkSuites": {
                     "edges": [
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-gcc5.4"
-                            }
+                            "name": "Codecov",
+                            "databaseId": 254
                           },
+                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [],
                             "pageInfo": {
@@ -746,22 +793,18 @@
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276115"
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697240"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRM="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdhg="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-vulkan-bionic-py3.7-clang9"
-                            }
+                            "name": "PyTorch Bot",
+                            "databaseId": 40112
                           },
+                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [],
                             "pageInfo": {
@@ -769,54 +812,111 @@
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276117"
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697255"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRU="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-bionic-py3.7-clang9"
-                            }
-                          },
-                          "checkRuns": {
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdic="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  }
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=73811 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "seemethere"
+          },
+          "title": "ci: Migrate metrics credentials to managed IAM",
+          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* __->__ #73811\n\r\nMigrates our credentials to upload metrics statistics to managed IAM\r\ncredentials in order to make it easier to know where the credentials are\r\ncoming from and to make it easier to add more permissions / less\r\npermissions later on.\r\n\r\nRelates to work done in [D34535827](https://www.internalfb.com/diff/D34535827)\r\n\r\nSigned-off-by: Eli Uriegas <eliuriegas@fb.com>",
+          "headRefName": "gh/seemethere/215/head",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "gh/seemethere/215/base",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "seemethere"
+                    },
+                    "email": "eliuriegas@fb.com",
+                    "name": "Eli Uriegas"
+                  },
+                  "oid": "13c44d16a876a56bca479b4cf30715d21fa16e99"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "seemethere"
+                    },
+                    "email": "eliuriegas@fb.com",
+                    "name": "Eli Uriegas"
+                  },
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mg",
+              "hasNextPage": false
+            },
+            "totalCount": 2
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815309?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545918134?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (noarch, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545918256?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "name": "Facebook CLA Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545918319?check_suite_focus=true"
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqP_28=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOaHA=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276119"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658275867"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRc="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcBs="
                       },
                       {
                         "node": {
@@ -826,7 +926,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-gcc7-no-ops"
+                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
                             }
                           },
                           "checkRuns": {
@@ -837,9 +937,9 @@
                             }
                           },
                           "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276122"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276090"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRo="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcPo="
                       },
                       {
                         "node": {
@@ -849,36 +949,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-clang7-onnx"
+                              "name": "win-vs2019-cpu-py3"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815351?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545931419?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545931552?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQMyA=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276123"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276092"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRs="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcPw="
                       },
                       {
                         "node": {
@@ -888,41 +972,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-clang7-asan"
+                              "name": "linux-xenial-py3-clang5-mobile-build"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815311?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 3, 3, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545947543?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 3, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545947625?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 3, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545947792?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQcpA=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276124"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276094"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRw="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcP4="
                       },
                       {
                         "node": {
@@ -932,66 +995,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "cmakelint",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815342?check_suite_focus=true"
-                              },
-                              {
-                                "name": "clang-format",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815564?check_suite_focus=true"
-                              },
-                              {
-                                "name": "clang-tidy",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815688?check_suite_focus=true"
-                              },
-                              {
-                                "name": "flake8-py3",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815821?check_suite_focus=true"
-                              },
-                              {
-                                "name": "quick-checks",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816003?check_suite_focus=true"
-                              },
-                              {
-                                "name": "mypy",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816076?check_suite_focus=true"
-                              },
-                              {
-                                "name": "py2-setup-validate-errormsg",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816154?check_suite_focus=true"
-                              },
-                              {
-                                "name": "shellcheck",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816266?check_suite_focus=true"
-                              },
-                              {
-                                "name": "toc",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816398?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcU4=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276126"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276095"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcR4="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcP8="
                       },
                       {
                         "node": {
@@ -1001,26 +1018,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                              "name": "Lint"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "run-torchbench",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815207?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObKc=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276127"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276097"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcR8="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQE="
                       },
                       {
                         "node": {
@@ -1030,7 +1041,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
@@ -1041,9 +1052,9 @@
                             }
                           },
                           "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276129"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276098"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSE="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQI="
                       },
                       {
                         "node": {
@@ -1053,26 +1064,199 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                              "name": "linux-xenial-py3.7-gcc7-no-ops"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602966/jobs/2839950629"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObRM=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276130"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276099"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQM="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Test tools"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276100"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQQ="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-clang7-asan"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276101"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQU="
                       }
                     ],
                     "pageInfo": {
                       "hasNextPage": true
                     }
-                  }
+                  },
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17044969?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17045014?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17044975?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-03-14T23:01:55Z",
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7"
+                }
+              }
+            ]
+          },
+          "changedFiles": 3,
+          "files": {
+            "nodes": [
+              {
+                "path": ".github/templates/common.yml.j2"
+              },
+              {
+                "path": ".github/workflows/generated-macos-11-py3-x86-64.yml"
+              },
+              {
+                "path": ".github/workflows/update_pytorch_labels.yml"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "kit1980"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "janeyx99"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMy0wNFQxNDoyNDo0OC0wODowMLkyMDIyLTAzLTA0VDE0OjI0OjQ4LTA4OjAwzjWwwqA=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Merge failed due to Too many checksuites for commit\nRaised by https://github.com/pytorch/pytorch/actions/runs/1988337976",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068270969
+              },
+              {
+                "bodyText": "@pytorchbot force merge this",
+                "author": {
+                  "login": "seemethere"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068436128
+              },
+              {
+                "bodyText": "Merge failed due to Too many checksuites for commit\nRaised by https://github.com/pytorch/pytorch/actions/runs/1989076952",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068437098
+              },
+              {
+                "bodyText": "@pytorchbot merge this",
+                "author": {
+                  "login": "seemethere"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068482921
+              },
+              {
+                "bodyText": "Hey @seemethere.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1068484404
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOP6yFeQ==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "cla signed"
                 }
               }
             ]
@@ -1081,7 +1265,7 @@
       }
     }
   },
-  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCcSI= name=pytorch number=73811 owner=pytorch": {
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCcQU= name=pytorch number=73811 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
@@ -1100,31 +1284,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pytorch-xla-linux-bionic-py3.7-clang8"
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815348?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (xla, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545954339?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQjCM=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276131"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276102"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSM="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQY="
                       },
                       {
                         "node": {
@@ -1134,41 +1307,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "win-vs2019-cuda11.3-py3"
+                              "name": "linux-bionic-py3.7-clang9"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815322?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546226404?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546226489?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546226540?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqUs2w=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276132"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276103"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSQ="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQc="
                       },
                       {
                         "node": {
@@ -1178,26 +1330,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
+                              "name": "linux-xenial-py3.7-clang7-onnx"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815307?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObQs=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276133"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276104"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSU="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQg="
                       },
                       {
                         "node": {
@@ -1207,26 +1353,41 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
+                              "name": "linux-xenial-py3.7-gcc7"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build-and-test",
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602973/jobs/2839950664"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602973/jobs/2840019714"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602973/jobs/2840019747"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815362?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602973/jobs/2840019794"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObUI=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqP89A=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276134"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276105"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSY="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQk="
                       },
                       {
                         "node": {
@@ -1236,26 +1397,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815337?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObSk=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276135"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276106"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSc="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQo="
                       },
                       {
                         "node": {
@@ -1265,31 +1420,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-vulkan-bionic-py3.7-clang9"
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815561?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 1, linux.2xlarge)",
+                                "name": "build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545929390?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602977/jobs/2839950658"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQKq4=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObTk=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276136"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276107"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSg="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQs="
                       },
                       {
                         "node": {
@@ -1303,32 +1453,16 @@
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815356?check_suite_focus=true"
-                              },
-                              {
-                                "name": "build-docs (cpp)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545920544?check_suite_focus=true"
-                              },
-                              {
-                                "name": "build-docs (python)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545920612?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQCGQ=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276137"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276110"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSk="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQ4="
                       },
                       {
                         "node": {
@@ -1338,36 +1472,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-bionic-rocm4.5-py3.7"
+                              "name": "win-vs2019-cuda11.3-py3"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815326?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.rocm.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545983951?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.rocm.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545984049?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqRADE=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276140"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276111"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSw="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQ8="
                       },
                       {
                         "node": {
@@ -1377,7 +1495,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-build"
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
                             }
                           },
                           "checkRuns": {
@@ -1385,18 +1503,33 @@
                               {
                                 "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815205?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602979/jobs/2839950630"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602979/jobs/2840213785"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602979/jobs/2840213832"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602979/jobs/2840213866"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObKU=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqUJII=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276141"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276112"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcS0="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRA="
                       },
                       {
                         "node": {
@@ -1406,36 +1539,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "win-vs2019-cpu-py3"
+                              "name": "pytorch-xla-linux-bionic-py3.7-clang8"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815314?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, windows.4xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546093287?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, windows.4xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546093438?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqSq34=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276143"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276114"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcS8="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRI="
                       }
                     ],
                     "pageInfo": {
@@ -1450,7 +1567,7 @@
       }
     }
   },
-  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCcS8= name=pytorch number=73811 owner=pytorch": {
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCcRI= name=pytorch number=73811 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
@@ -1473,52 +1590,16 @@
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815359?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545923802?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545923899?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545924024?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545924110?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545924249?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (docs_test, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545924341?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQFvU=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276145"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276115"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcTE="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRM="
                       },
                       {
                         "node": {
@@ -1528,7 +1609,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-gcc7"
+                              "name": "linux-vulkan-bionic-py3.7-clang9"
                             }
                           },
                           "checkRuns": {
@@ -1539,9 +1620,9 @@
                             }
                           },
                           "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276149"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276117"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcTU="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRU="
                       },
                       {
                         "node": {
@@ -1551,20 +1632,41 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-bionic-rocm4.5-py3.7"
+                              "name": "linux-bionic-py3.7-clang9"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602984/jobs/2839950624"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602984/jobs/2840021854"
+                              },
+                              {
+                                "name": "test (noarch, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602984/jobs/2840021946"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602984/jobs/2840021988"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqP_28=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276152"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276119"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcTg="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRc="
                       },
                       {
                         "node": {
@@ -1574,26 +1676,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Test tools"
+                              "name": "linux-xenial-py3.7-gcc7-no-ops"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815310?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObQ4=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276157"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276122"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcT0="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRo="
                       },
                       {
                         "node": {
@@ -1603,7 +1699,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                              "name": "linux-xenial-py3.7-clang7-onnx"
                             }
                           },
                           "checkRuns": {
@@ -1611,18 +1707,28 @@
                               {
                                 "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545815320?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602988/jobs/2839950656"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602988/jobs/2840031185"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602988/jobs/2840031288"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObRg=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQMyA=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276159"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276123"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcT8="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRs="
                       },
                       {
                         "node": {
@@ -1632,7 +1738,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "macos-10-15-py3-arm64"
+                              "name": "linux-xenial-py3.7-clang7-asan"
                             }
                           },
                           "checkRuns": {
@@ -1640,18 +1746,33 @@
                               {
                                 "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816079?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602989/jobs/2839950625"
+                              },
+                              {
+                                "name": "test (default, 3, 3, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602989/jobs/2840042498"
+                              },
+                              {
+                                "name": "test (default, 1, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602989/jobs/2840042534"
+                              },
+                              {
+                                "name": "test (default, 2, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602989/jobs/2840042646"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA8=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQcpA=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276857"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276124"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCc_k="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcRw="
                       },
                       {
                         "node": {
@@ -1661,26 +1782,66 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "ios-12-5-1-arm64-coreml"
+                              "name": "Lint"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
+                                "name": "cmakelint",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816078?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839950650"
+                              },
+                              {
+                                "name": "clang-format",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839950743"
+                              },
+                              {
+                                "name": "clang-tidy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839950808"
+                              },
+                              {
+                                "name": "flake8-py3",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839950884"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839950992"
+                              },
+                              {
+                                "name": "mypy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839951037"
+                              },
+                              {
+                                "name": "py2-setup-validate-errormsg",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839951085"
+                              },
+                              {
+                                "name": "shellcheck",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839951170"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602990/jobs/2839951266"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA4=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcU4=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276860"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276126"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCc_w="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcR4="
                       },
                       {
                         "node": {
@@ -1690,26 +1851,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "ios-12-5-1-arm64"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816071?check_suite_focus=true"
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602993/jobs/2839950562"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAc=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObKc=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276861"
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276127"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCc_0="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcR8="
                       },
                       {
                         "node": {
@@ -1719,36 +1880,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "macos-11-py3-x86-64"
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816073?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, macos-11)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546066712?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, macos-11)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5546066787?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqSQ2M=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276862"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276129"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCc_4="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSE="
                       },
                       {
                         "node": {
@@ -1758,26 +1903,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "ios-12-5-1-arm64-custom-ops"
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816081?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcBE=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276864"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276130"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAA="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSI="
                       }
                     ],
                     "pageInfo": {
@@ -1792,7 +1931,7 @@
       }
     }
   },
-  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCdAA= name=pytorch number=73811 owner=pytorch": {
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCcSI= name=pytorch number=73811 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
@@ -1811,7 +1950,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "ios-12-5-1-x86-64-coreml"
+                              "name": "pytorch-xla-linux-bionic-py3.7-clang8"
                             }
                           },
                           "checkRuns": {
@@ -1819,18 +1958,23 @@
                               {
                                 "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816077?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602994/jobs/2839950655"
+                              },
+                              {
+                                "name": "test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602994/jobs/2840047401"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA0=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQjCM=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276867"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276131"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAM="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSM="
                       },
                       {
                         "node": {
@@ -1840,7 +1984,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "ios-12-5-1-arm64-metal"
+                              "name": "win-vs2019-cuda11.3-py3"
                             }
                           },
                           "checkRuns": {
@@ -1848,18 +1992,33 @@
                               {
                                 "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816080?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602996/jobs/2839950632"
+                              },
+                              {
+                                "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602996/jobs/2840239369"
+                              },
+                              {
+                                "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602996/jobs/2840239408"
+                              },
+                              {
+                                "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602996/jobs/2840239445"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcBA=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqUs2w=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276869"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276132"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAU="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSQ="
                       },
                       {
                         "node": {
@@ -1869,7 +2028,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "macos-10-15-py3-lite-interpreter-x86-64"
+                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
                             }
                           },
                           "checkRuns": {
@@ -1877,18 +2036,18 @@
                               {
                                 "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816075?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602998/jobs/2839950621"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAs=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObQs=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276873"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276133"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAk="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSU="
                       },
                       {
                         "node": {
@@ -1898,186 +2057,168 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "ios-12-5-1-x86-64"
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
+                                "name": "build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5545816068?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602997/jobs/2839950665"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAQ=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObUI=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276881"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276134"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdBE="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSY="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Netlify",
-                            "databaseId": 13473
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
                           },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277331"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCddM="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "Azure Pipelines",
-                            "databaseId": 9426
-                          },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603001/jobs/2839950648"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObSk=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277340"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276135"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCddw="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSc="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Dependabot",
-                            "databaseId": 29110
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-vulkan-bionic-py3.7-clang9"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603002/jobs/2839950741"
+                              },
+                              {
+                                "name": "test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603002/jobs/2840029810"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQKq4=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277346"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276136"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdeI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSg="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Codecov",
-                            "databaseId": 254
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-docs"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603000/jobs/2839950661"
+                              },
+                              {
+                                "name": "build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603000/jobs/2840023513"
+                              },
+                              {
+                                "name": "build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603000/jobs/2840023552"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQCGQ=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277350"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276137"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdeY="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSk="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "PyTorch Bot",
-                            "databaseId": 40112
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
                           },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [],
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-bionic-rocm4.5-py3.7"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603003/jobs/2839950637"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603003/jobs/2840068586"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603003/jobs/2840068671"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqRADE=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277355"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276140"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdes="
-                      }
-                    ],
-                    "pageInfo": {
-                      "hasNextPage": false
-                    }
-                  }
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=31093 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "closed": true,
-          "isCrossRepository": true,
-          "author": {
-            "login": "mingxiaoh"
-          },
-          "title": "improve mkldnn convolution test coverage",
-          "body": "This pr will improve the test coverage of mkldnn convolution.\r\n1.test input: specific sensitive numbers\r\n2.pass criteria: output of mkldnn convolution matches output of thnn convolution\r\n3.coverage: by using coverage tool, we found out the following sensitive parameters. Overall the case will test 4352 patterns, takes 8.8s on my machine.\r\n\r\nto run the test case:\r\n\r\npython test_mkldnn_conv2d_ext.py\r\nor\r\npython run_test.py -i mkldnn_conv2d_ext\r\n\r\nIn case of failure, the pattern will be printed in the log for further debugging.\r\n\r\nactually, this PR is created to replace and improve that PR we created before(https://github.com/pytorch/pytorch/pull/25085) ",
-          "headRefName": "master",
-          "headRepository": {
-            "nameWithOwner": "mingxiaoh/pytorch"
-          },
-          "baseRefName": "master",
-          "baseRepository": {
-            "nameWithOwner": "pytorch/pytorch",
-            "isPrivate": false,
-            "defaultBranchRef": {
-              "name": "master"
-            }
-          },
-          "mergeCommit": null,
-          "commits_with_authors": {
-            "nodes": [
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "11pikachu"
-                    },
-                    "email": "junx.du@intel.com",
-                    "name": "dujun"
-                  },
-                  "oid": "29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
-                }
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "MQ",
-              "hasNextPage": false
-            },
-            "totalCount": 1
-          },
-          "commits": {
-            "nodes": [
-              {
-                "commit": {
-                  "checkSuites": {
-                    "edges": [
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcSw="
+                      },
                       {
                         "node": {
                           "app": {
@@ -2086,26 +2227,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "clang-format"
+                              "name": "linux-xenial-py3-clang5-mobile-build"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "clang-format",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676797?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603004/jobs/2839950560"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHOQYu8fQ==",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObKU=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1175281097"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276141"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHORg1dyQ=="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcS0="
                       },
                       {
                         "node": {
@@ -2115,2861 +2256,7081 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "win-vs2019-cpu-py3"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "flake8-py3",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676800?check_suite_focus=true"
-                              },
-                              {
-                                "name": "quick-checks",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676817?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603005/jobs/2839950626"
                               },
                               {
-                                "name": "clang-tidy",
+                                "name": "test (default, 2, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676829?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603005/jobs/2840145642"
                               },
                               {
-                                "name": "cmakelint",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676840?check_suite_focus=true"
+                                "name": "test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603005/jobs/2840145755"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHOQYu8qA==",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqSq34=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1175281099"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276143"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHORg1dyw=="
-                      },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcS8="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": true
+                    }
+                  }
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCcS8= name=pytorch number=73811 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7",
+                  "checkSuites": {
+                    "edges": [
                       {
                         "node": {
                           "app": {
-                            "name": "Codecov",
-                            "databaseId": 254
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc5.4"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "codecov/project",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://codecov.io"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603007/jobs/2839950666"
                               },
                               {
-                                "name": "codecov/patch",
+                                "name": "test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://codecov.io"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603007/jobs/2840025927"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603007/jobs/2840025995"
+                              },
+                              {
+                                "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603007/jobs/2840026086"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603007/jobs/2840026134"
+                              },
+                              {
+                                "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603007/jobs/2840026235"
+                              },
+                              {
+                                "name": "test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603007/jobs/2840026282"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHOQZhcFQ==",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqQFvU=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1176100822"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276145"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHORhnf1g=="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcTE="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Codecov",
-                            "databaseId": 254
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc7"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "codecov/patch",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://codecov.io"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHOQZZsEQ==",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1176100824"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276149"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHORhnf2A=="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcTU="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Facebook GitHub Tools",
-                            "databaseId": 12274
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-bionic-rocm4.5-py3.7"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "Facebook CLA Check",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://code.facebook.com/cla/"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHOUquzJg==",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1487517306"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276152"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHOWKm2eg=="
-                      }
-                    ],
-                    "pageInfo": {
-                      "hasNextPage": false
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcTg="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Test tools"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603012/jobs/2839950623"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObQ4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276157"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcT0="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603013/jobs/2839950631"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObRg=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276159"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcT8="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "macos-10-15-py3-arm64"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603251/jobs/2839951040"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA8=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276857"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCc_k="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "ios-12-5-1-arm64-coreml"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603253/jobs/2839951038"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276860"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCc_w="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "ios-12-5-1-arm64"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603254/jobs/2839951030"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAc=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276861"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCc_0="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "macos-11-py3-x86-64"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603255/jobs/2839951034"
+                              },
+                              {
+                                "name": "test (default, 1, 2, macos-11)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603255/jobs/2840127016"
+                              },
+                              {
+                                "name": "test (default, 2, 2, macos-11)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603255/jobs/2840127073"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqSQ2M=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276862"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCc_4="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "ios-12-5-1-arm64-custom-ops"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603256/jobs/2839951041"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcBE=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276864"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAA="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": true
                     }
-                  },
-                  "pushedDate": "2020-09-11T01:58:24Z",
-                  "oid": "29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                  }
                 }
               }
             ]
-          },
-          "changedFiles": 5,
-          "files": {
-            "nodes": [
-              {
-                "path": "test/math_libraries/convolutions.py"
-              },
-              {
-                "path": "test/math_libraries/convolutions_cases/shapes_googlenet_v3.json"
-              },
-              {
-                "path": "test/math_libraries/convolutions_cases/shapes_maskrcnn_p1.json"
-              },
-              {
-                "path": "test/math_libraries/convolutions_cases/shapes_mobilenet.json"
-              },
-              {
-                "path": "test/math_libraries/convolutions_cases/shapes_resnet_50.json"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "NQ",
-              "hasNextPage": false
-            }
-          },
-          "reviews": {
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAVFCdAA= name=pytorch number=73811 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
             "nodes": [
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
+                "commit": {
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7",
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "ios-12-5-1-x86-64-coreml"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603259/jobs/2839951039"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcA0=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276867"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAM="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "ios-12-5-1-arm64-metal"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603261/jobs/2839951042"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcBA=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276869"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "macos-10-15-py3-lite-interpreter-x86-64"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603264/jobs/2839951036"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAs=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276873"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdAk="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "ios-12-5-1-x86-64"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983603269/jobs/2839951029"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOcAQ=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276881"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdBE="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Netlify",
+                            "databaseId": 13473
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277331"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCddM="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277340"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCddw="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Dependabot",
+                            "databaseId": 29110
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277346"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdeI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277350"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdeY="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "PyTorch Bot",
+                            "databaseId": 40112
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658277355"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCdes="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  }
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=31093 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "mingxiaoh"
+          },
+          "title": "improve mkldnn convolution test coverage",
+          "body": "This pr will improve the test coverage of mkldnn convolution.\r\n1.test input: specific sensitive numbers\r\n2.pass criteria: output of mkldnn convolution matches output of thnn convolution\r\n3.coverage: by using coverage tool, we found out the following sensitive parameters. Overall the case will test 4352 patterns, takes 8.8s on my machine.\r\n\r\nto run the test case:\r\n\r\npython test_mkldnn_conv2d_ext.py\r\nor\r\npython run_test.py -i mkldnn_conv2d_ext\r\n\r\nIn case of failure, the pattern will be printed in the log for further debugging.\r\n\r\nactually, this PR is created to replace and improve that PR we created before(https://github.com/pytorch/pytorch/pull/25085) ",
+          "headRefName": "master",
+          "headRepository": {
+            "nameWithOwner": "mingxiaoh/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "CHANGES_REQUESTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "CHANGES_REQUESTED"
-              },
-              {
-                "author": {
-                  "login": "ailzhang"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "ngimel"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "VitalyFedyunin"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "ngimel"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "VitalyFedyunin"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "VitalyFedyunin"
-                },
-                "state": "APPROVED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "11pikachu"
+                    },
+                    "email": "junx.du@intel.com",
+                    "name": "dujun"
+                  },
+                  "oid": "29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                }
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAxOS0xMi0zMFQxMjoxOToxMS0wNjowMLkyMDE5LTEyLTMwVDEyOjE5OjExLTA2OjAwzhQZLuY=",
-              "hasPreviousPage": false
-            }
+              "endCursor": "MQ",
+              "hasNextPage": false
+            },
+            "totalCount": 1
           },
-          "comments": {
+          "commits": {
             "nodes": [
               {
-                "bodyText": "I cloned your repo and ran the tests:\n~/pytorch/test/math_libraries$ python convolutions.py\nFFFF\n======================================================================\nFAIL: test_conv2d_ext_cpu_float32 (__main__.TestConvExtCPU)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n----------------------------------------------------------------------\nRan 4 tests in 33.838s\n\nFAILED (failures=4)\n\nStill fails.\n\n@mruberry  It is suggested by @VitalyFedyunin that, we need to display fail test to avoid invalid inputs, I guess we should set it as expected failures under the pytest test framework, right? we will change it as expected failure cases under pytest test framework. The result will looks like be low, is it ok?\n2500 passed, 136 skipped, 0 failed, 0 errors, 2 expected failures, 0 unexpected passes",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": {
-                  "login": "mingxiaoh"
-                },
-                "databaseId": 673816925
-              },
-              {
-                "bodyText": "Displaying tests that fail is fine, but I don't think @VitalyFedyunin meant that it was OK if the tests didn't pass. If these are expected failures then yes, you can use with self.assertRaises(RuntimeError):... when testing them. If you also want to report that the test has test cases with these properties you can print or warn, which will appear in the test output.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 673858224
-              },
-              {
-                "bodyText": "Codecov Report\n\nMerging #31093 into master will not change coverage.\nThe diff coverage is n/a.\n\n\n@@           Coverage Diff           @@\n##           master   #31093   +/-   ##\n=======================================\n  Coverage   68.00%   68.00%           \n=======================================\n  Files         382      382           \n  Lines       49527    49527           \n=======================================\n  Hits        33679    33679           \n  Misses      15848    15848           \n\nContinue to review full report at Codecov.\n\nLegend - Click here to learn more\n\u0394 = absolute <relative> (impact), \u00f8 = not affected, ? = missing data\nPowered by Codecov. Last update 69f6d94...29f6aa6. Read the comment docs.",
-                "author": {
-                  "login": "codecov"
-                },
-                "authorAssociation": "NONE",
-                "editor": {
-                  "login": "codecov"
-                },
-                "databaseId": 686921371
-              },
-              {
-                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.  Feel free to remove the Stale label if you feel this was a mistake.  If you are unable to remove the Stale label please contact a maintainer in order to do so.  Stale pull requests will automatically be closed 30 days after being marked Stale",
-                "author": {
-                  "login": "pytorchbot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1095860944
-              },
-              {
-                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale. Feel free to remove the Stale label if you feel this was a mistake. If you are unable to remove the Stale label please contact a maintainer in order to do so. If you want the bot to never mark this PR stale again, add the no-stale label.Stale pull requests will automatically be closed after 30 days of inactivity.",
-                "author": {
-                  "login": "github-actions"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 1152854802
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOKCmhXQ==",
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "clang-format"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "clang-format",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676797?check_suite_focus=true"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHOQYu8fQ==",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1175281097"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHORg1dyQ=="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "flake8-py3",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676800?check_suite_focus=true"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676817?check_suite_focus=true"
+                              },
+                              {
+                                "name": "clang-tidy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676829?check_suite_focus=true"
+                              },
+                              {
+                                "name": "cmakelint",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676840?check_suite_focus=true"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHOQYu8qA==",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1175281099"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHORg1dyw=="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "codecov/project",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://codecov.io"
+                              },
+                              {
+                                "name": "codecov/patch",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://codecov.io"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHOQZhcFQ==",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1176100822"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHORhnf1g=="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "codecov/patch",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://codecov.io"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHOQZZsEQ==",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1176100824"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHORhnf2A=="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.facebook.com/cla/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHOUquzJg==",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1487517306"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHOWKm2eg=="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  },
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406538?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406947?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406544?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406931?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_windows_libtorch_3_7_cpu_debug_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406550?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_windows_libtorch_3_7_cpu_debug_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406887?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_windows_libtorch_3_7_cpu_release_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406526?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_windows_libtorch_3_7_cpu_release_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406707?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406533?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: caffe2_onnx_main_py3_6_clang7_ubuntu16_04_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407256?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407254?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407255?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406556?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406532?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406527?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406553?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-py3.6-clang9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406537?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-py3.8-gcc9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406529?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-rocm3.5.1-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406554?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-rocm3.7-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406545?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406543?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406536?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406552?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406535?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406540?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406528?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406541?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-asan",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406549?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-clang7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406555?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-gcc4.8",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406546?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-gcc5.4",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406531?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406534?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-gcc7.2",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406523?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.8",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406539?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-rocm3.3-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406547?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-rocm3.5.1-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406551?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407209?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406611?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_bazel_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406607?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_bazel_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406984?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_cpp_doc_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407013?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_doc_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407011?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_ios_11_2_1_x86_64_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406548?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_libtorch_linux_xenial_cuda11_0_cudnn8_py3_gcc7_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406563?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_libtorch_linux_xenial_cuda11_0_cudnn8_py3_gcc7_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7408680?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_backward_compatibility_check_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407014?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_py3_6_clang9_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406567?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_py3_6_clang9_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406945?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_py3_8_gcc9_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406561?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_py3_8_gcc9_coverage_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407422?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_rocm3_7_py3_6_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406562?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406612?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7408107?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_legacy_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7408111?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_profiling_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7408101?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406613?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406565?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407017?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_profiling_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407019?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407012?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407016?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_vulkan_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406608?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406609?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_asan_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406606?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_asan_test1",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407435?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_asan_test2",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407436?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_mobile_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406605?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406610?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_macos_10_13_py3_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406525?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_macos_10_13_py3_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407415?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_python_doc_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407018?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_vulkan_linux_bionic_py3_6_clang9_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406566?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_vulkan_linux_bionic_py3_6_clang9_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406946?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cpu_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406542?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cuda10.1_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406530?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cuda10.1_test1",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407028?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cuda10.1_test2",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407027?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cuda11.0_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406524?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_xla_linux_bionic_py3_6_clang9_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406572?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_xla_linux_bionic_py3_6_clang9_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407253?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "codecov/patch",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://codecov.io/gh/pytorch/pytorch/compare/69f6d94caa3559d4f50745c26af5df041b83fee8...29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                      },
+                      {
+                        "context": "codecov/project",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://codecov.io/gh/pytorch/pytorch/compare/69f6d94caa3559d4f50745c26af5df041b83fee8...29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                      },
+                      {
+                        "context": "pr/caffe2-pytorch-linux-bionic-rocm3.7-py3.6-test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://ci.pytorch.org/jenkins/job/caffe2-builds/job/pytorch-linux-bionic-rocm3.7-py3.6-trigger-test/2319/"
+                      },
+                      {
+                        "context": "pr/pytorch-linux-bionic-rocm3.7-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm3.7-py3.6-trigger/2325/"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2020-09-11T01:58:24Z",
+                  "oid": "29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                }
+              }
+            ]
+          },
+          "changedFiles": 5,
+          "files": {
+            "nodes": [
+              {
+                "path": "test/math_libraries/convolutions.py"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_googlenet_v3.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_maskrcnn_p1.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_mobilenet.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_resnet_50.json"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "NQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "CHANGES_REQUESTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "CHANGES_REQUESTED"
+              },
+              {
+                "author": {
+                  "login": "ailzhang"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAxOS0xMi0zMFQxMDoxOToxMS0wODowMLkyMDE5LTEyLTMwVDEwOjE5OjExLTA4OjAwzhQZLuY=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "I cloned your repo and ran the tests:\n~/pytorch/test/math_libraries$ python convolutions.py\nFFFF\n======================================================================\nFAIL: test_conv2d_ext_cpu_float32 (__main__.TestConvExtCPU)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n----------------------------------------------------------------------\nRan 4 tests in 33.838s\n\nFAILED (failures=4)\n\nStill fails.\n\n@mruberry  It is suggested by @VitalyFedyunin that, we need to display fail test to avoid invalid inputs, I guess we should set it as expected failures under the pytest test framework, right? we will change it as expected failure cases under pytest test framework. The result will looks like be low, is it ok?\n2500 passed, 136 skipped, 0 failed, 0 errors, 2 expected failures, 0 unexpected passes",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 673816925
+              },
+              {
+                "bodyText": "Displaying tests that fail is fine, but I don't think @VitalyFedyunin meant that it was OK if the tests didn't pass. If these are expected failures then yes, you can use with self.assertRaises(RuntimeError):... when testing them. If you also want to report that the test has test cases with these properties you can print or warn, which will appear in the test output.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 673858224
+              },
+              {
+                "bodyText": "Codecov Report\n\nMerging #31093 into master will not change coverage.\nThe diff coverage is n/a.\n\n\n@@           Coverage Diff           @@\n##           master   #31093   +/-   ##\n=======================================\n  Coverage   68.00%   68.00%           \n=======================================\n  Files         382      382           \n  Lines       49527    49527           \n=======================================\n  Hits        33679    33679           \n  Misses      15848    15848           \n\nContinue to review full report at Codecov.\n\nLegend - Click here to learn more\n\u0394 = absolute <relative> (impact), \u00f8 = not affected, ? = missing data\nPowered by Codecov. Last update 69f6d94...29f6aa6. Read the comment docs.",
+                "author": {
+                  "login": "codecov"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "codecov"
+                },
+                "databaseId": 686921371
+              },
+              {
+                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.  Feel free to remove the Stale label if you feel this was a mistake.  If you are unable to remove the Stale label please contact a maintainer in order to do so.  Stale pull requests will automatically be closed 30 days after being marked Stale",
+                "author": {
+                  "login": "pytorchbot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1095860944
+              },
+              {
+                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale. Feel free to remove the Stale label if you feel this was a mistake. If you are unable to remove the Stale label please contact a maintainer in order to do so. If you want the bot to never mark this PR stale again, add the no-stale label.Stale pull requests will automatically be closed after 30 days of inactivity.",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1152854802
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOKCmhXQ==",
               "hasPreviousPage": true
             }
           },
-          "labels": {
-            "edges": [
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "triaged"
+                }
+              },
+              {
+                "node": {
+                  "name": "open source"
+                }
+              },
+              {
+                "node": {
+                  "name": "cla signed"
+                }
+              },
+              {
+                "node": {
+                  "name": "Stale"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=62ce809793481ce6ddce6e1a19d9b0761755ff0ff75decaf8a79419eaf793110 cursor=Y3Vyc29yOnYyOpHOKCmhXQ== name=pytorch number=31093 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Hi, @mingfeima  @soumith  @Jianhui-Li\nthis will improve the test coverage of mkldnn convolution, would you please review it?\nThe current code is forward only, do we need to cover backward, if yes, we can add backward.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 564806270
+              },
+              {
+                "bodyText": "@mingxiaoh, what is the value in testing DNNL as part of Pytorch validation for the Pytorch developers? Shouldn't having these tests run in DNNL validation be enough?",
+                "author": {
+                  "login": "vpirogov"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 564808528
+              },
+              {
+                "bodyText": "@vpirogov  The main value is to serve as a blind test to DNNL. If DNNL adds these test to DNNL test sets, it lost the value as a blind test.  The spirit of validation is to cross check.\n@gottbrath @gchanan  The test was developed per the request of Pytorch team. Mingxiao made an effort to reduce the execution time to a few second but still with good coverage.  Although the test today is focused on DNNL, it could be easily extended to be blind test for any conv implementation used in Pytorch.",
+                "author": {
+                  "login": "Jianhui-Li"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 567826907
+              },
+              {
+                "bodyText": "@mruberry thanks for the comment. As for the chainer dependency, we import it is because we would like to use its testing function for pytest test cases combinations, other wise we need to write much more code to achieve same effect. So, can we use it?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 574563012
+              },
+              {
+                "bodyText": "@mingxiaoh You cannot import chainer. Looking at the code you should be able to achieve the same effect without it.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 575272358
+              },
+              {
+                "bodyText": "@mruberry ok, we will change it according to your requirement. Thanks",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 583917522
+              },
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/31093\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 29f6aa6 (more details on the Dr. CI page):\n\nCommit 29f6aa6 was recently pushed. Waiting for builds...\n\nThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "author": {
+                  "login": "dr-ci"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 628466876
+              },
+              {
+                "bodyText": "@mruberry how about those cudnn UT error? we add check for it but it should be NV to fix cudnn bugs.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 629955767
+              },
+              {
+                "bodyText": "Hey @mingxiaoh! You're right, of course, that you shouldn't have to fix cuDNN bugs. Would you please:\n\nAssert that the test case fails, so we know it's failing and if someone fixes it they'll know what test to update.\nFile a new issue explaining the behavior and providing a short PyTorch program to reproduce the issue.\n\nThen we can ping NVIDIA on that issue.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 629997129
+              },
+              {
+                "bodyText": "about the suggestion 'Assert that the test case fails, so we know it's failing and if someone fixes it they'll know what test to update. ',  if we only assert it and continue the following test, I guess users might always ignore them in later test. Anyway, any similar example case for reference?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 630010734
+              },
+              {
+                "bodyText": "In this recent PR https://github.com/pytorch/pytorch/pull/38505/files, for example, you can see that the construction of bool tensors wasn't working properly, so the test author cited the relevant issue and asserted that the incorrect behavior happened, as expected. You can also see how these lines are being removed by https://github.com/pytorch/pytorch/pull/38392/files, which fixes the issue.\nAnother common pattern is to use with self.assertRaises(RuntimeError/AssertionError/etc.):.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 630014823
+              },
+              {
+                "bodyText": "@mruberry the failed UT case is not introduced by our modification, how to handle this issue?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631187735
+              },
+              {
+                "bodyText": "@mingxiaoh You mean the failures on ROCm? You may ignore them. Be sure to re-request review when you're ready.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 631191425
+              },
+              {
+                "bodyText": "@mruberry  we already skipped those ROCm errors, but there are stil somel error caused by the original code, they are not introduced by our modification.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631886529
+              },
+              {
+                "bodyText": "I understand. Let me know when you're ready for me to review.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 631908011
+              },
+              {
+                "bodyText": "@mruberry thanks, we are ready for review now.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631909442
+              },
+              {
+                "bodyText": "@mingxiaoh Great! I'll take a look ASAP.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 631910556
+              },
+              {
+                "bodyText": "@mruberry we just pull the latest code and updated the patch according to your comment, may you please help double check it? BTW, the new failed case in preci is not introduced by our modification.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 633430458
+              },
+              {
+                "bodyText": "@ailzhang would you please check the comment below? Thanks.\nIs there a reason why this TestConv2dExt is a new class instead a test inside TestNN?\n//comment: it is actually suggested by Tongzhou Wang in another thread before.\nAlthough this test sits in generic testing framework, it's actually comparing thnn/mkldnn/cudnn results specially. I feel it's better to make it truly generic so that it compares any device result with CPU result. Alternatively you can mark this test only run when torch.backends.mkldnn.is_available()=True\n//comment: but our goal is to compare the result with that of thnn. Anyway, if you insist, we can start to compare it with cpu.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 634432326
+              },
+              {
+                "bodyText": "Pruning reviewers. @ngimel, @VitalyFedyunin, this PR is looking pretty good from a test framework perspective. Would one of you like to review?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 634557563
+              },
+              {
+                "bodyText": "@mruberry  Thanks, would you please help review it again. BTW: failed case is not introduced by our modification.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 635256214
+              },
+              {
+                "bodyText": "@mruberry  we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code",
+                "author": {
+                  "login": "1pikachu"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637364148
+              },
+              {
+                "bodyText": "@mruberry we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code\n\n@ngimel will follow-up on the test itself sometime this week or early next week.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 637444457
+              },
+              {
+                "bodyText": "@mruberry we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code\n\n@ngimel will follow-up on the test itself sometime this week or early next week.\n\n@mruberry  thank you",
+                "author": {
+                  "login": "1pikachu"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637479226
+              },
+              {
+                "bodyText": "Improving test coverage of math libraries is certainly a good goal and this PR is moving towards it. I have some doubts about implementation decisions made, and about running this PR as part of regular pytorch CI.\nIf the primary goal of this PR is to test correctness of the convolution implementations in the vendor library, then it does not serve this purpose. The absolute majority of the 4000+ test cases come from group 1, where different kernel sizes/strides/dilations are used to produce the output of size 1x1. This can test whether pytorch correctly passes convolution parameters to the backends (although there are cheaper ways to do that), but as actual library correctness check it is almost useless - libraries use very different kernels depending in the input/output sizes, and tests with toy sizes like this don't invoke the real bread-and-butter kernels.\nAlso, if this test suite is meant as primary a means of testing vendor libraries (which is a good goal!) it does not have a place as a part of pytorch regular CI, and should be run when the corresponding vendor libraries are updated. I'd suggest moving this test out into a separate file (maybe even outside of torch/test directory) and have it as a part of library update/qualification process rather than regular CI.\nAlso, if the primary goal is to enable easier testing of vendor libraries correctness, perhaps we should rethink the mechanism of the generation of test cases. It should be easy to add a test case with a particular set of parameters that was found to be buggy. Also, running a cross-product of cases in a multi-dimensional space (as this PR does) is rarely an efficient way of getting a signal, some forms of random sampling usually provide a way to get better correctness signal why using less resources.\nAlso, when testing libraries it is important to test both forward and backward functions, whereas this PR does forward only. I'm openminded on whether convTransposed should be tested or not - if we are testing vendor libraries, then it's not necessary, convTransposed calls the same underlying functions, if we are testing pytorch, then it makes sense to test it separately because it takes different codepaths.",
+                "author": {
+                  "login": "ngimel"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 637827507
+              },
+              {
+                "bodyText": "@mruberry ngimel is quite responsible, but it seems that she is not familiar with the background of this pull-request, since this pull-request is pending for so such a long time, each time we are almost done, then reviewer changes, each reviewer has different idea, it is good, but, would it be better if you help review it or ask the same reviewer to review it considering that you are more familiar with the background/change history? Thanks in advance.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637912105
+              },
+              {
+                "bodyText": "@mruberry ngimel is quite responsible, but it seems that she is not familiar with the background of this pull-request, since this pull-request is pending for so such a long time, each time we are almost done, then reviewer changes, each reviewer has different idea, it is good, but, would it be better if you help review it or ask the same reviewer to review it considering that you are more familiar with the background/change history? Thanks in advance.\n\nWe know this PR has been open for awhile and we respect that your time is valuable, but we want to make sure we're making the right change here, and I think @ngimel's comments reflect that and should not be too difficult to address. As I understand, her points are:\n\nThis is a good PR with an exciting idea. To let it run longer and test more cases maybe it should run outside the regular PyTorch CI.\nTo remedy this, let's create a test/math_libraries folder and put this test there: test/math_libaries/convolutions.py. Yes, this is different from our requests in the past, which is our mistake, but it should be an easy change.\nTo make the test more interesting it'd be good for the test cases to resemble convolutions used in practice. The current test cases seem like similar \"toy\" examples. Without time pressure we should be able to run larger, more computationally intensive convolutions.\nLet's change the test cases to include some practical convolutions, make it easy to add test cases, and think about how we might generate other interesting cases. (We should also test backwards once we have more time!)\n\nAnd I think these are good points. Maybe the PR doesn't create a new way to generate interesting convolutions to start and instead only runs a few representative convolutions, but @ngimel is positioning the work for success so that it's useful and we can continue to improve on it in the future.\nDoes that make sense?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 637924703
+              },
+              {
+                "bodyText": "@mruberry we were required to finish the test in limited time long long before, at that time, jianhui discussed this issue with you, and you are all agreed with the current test scope and test case number and test time, so you meant you change your mind now? you are not care about the test time currently? Sorry, this issue is pending so long, we are struggling with it now and would like to finish it asap.  Given this, it would be be better if you raise all the requirement at a time,  considering that we have many tasks at hand, we are hoping so eagerly that we can finish this PR and use it for further test for bugs finding.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 637960626
+              },
+              {
+                "bodyText": "@mruberry we were required to finish the test in limited time long long before, at that time, jianhui discussed this issue with you, and you are all agreed with the current test scope and test case number and test time, so you meant you change your mind now? you are not care about the test time currently? Sorry, this issue is pending so long, we are struggling with it now and would like to finish it asap. Given this, it would be be better if you raise all the requirement at a time, considering that we have many tasks at hand, we are hoping so eagerly that we can finish this PR and use it for further test for bugs finding.\n\nI'm sorry, I don't think I've talked to @Jianhui-Li before. It's true that the team we expressed a concern about timing if the test was to be run in the CI initially, but I think now that we understand what the test is trying to do better we're not sure the CI is the best place for it. The PR was also closed after a lengthy period of inactivity, and we assumed it had simply been abandoned.\nDo you know who @Jianhui-Li spoke with about this issue originally? Maybe I can follow-up with them for more context.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 637967153
+              },
+              {
+                "bodyText": "@mruberry  it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637978356
+              },
+              {
+                "bodyText": "@mruberry it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?\n\nI think this will be easier to discuss at the regular Intel-FB meeting.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 638446723
+              },
+              {
+                "bodyText": "@mruberry it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?\n\nI think this will be easier to discuss at the regular Intel-FB meeting.\n\nLet me sync with Mingxiao and follow up with this. Thanks.",
+                "author": {
+                  "login": "Jianhui-Li"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 638451670
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 653028208
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 654443242
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 656062287
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks\n\n@mruberry  the code is ready for review now, would you please take time for it? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 658071151
+              },
+              {
+                "bodyText": "super nit: renaming files to .json will make it more IDE friendly.",
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 658464685
+              },
+              {
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks\n\n@mruberry the code is ready for review now, would you please take time for it? Thanks.\n\nCool! I took a look with @ngimel, once these issues are addressed I think we're good to go!",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 659164401
+              },
+              {
+                "bodyText": "@ngimel  & @VitalyFedyunin We have changed the code according to your suggestions, would you please review it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 660884305
+              },
+              {
+                "bodyText": "@ngimel & @VitalyFedyunin We have changed the code according to your suggestions, would you please review it again? Thanks.\n\nUpdated: one more question about tolerances, one code cleanup recommendation, and one task leftover from the last review.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 662678464
+              },
+              {
+                "bodyText": "Updated: one more question about tolerances, one code cleanup recommendation, and one task leftover from the last review.\n@mruberry we have finished the modification according to your comment, would you please review it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 662930687
+              },
+              {
+                "bodyText": "The code looks good, but I tried running the test suite and hit the following failures:\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 102, in test_conv2d_ext\n    msg=msg\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 1085, in assertEqual\n    self.assertTrue(result, msg=msg)\nAssertionError: False is not true : device:cuda:0, dtype:torch.float16, group:1, batchsize:22input channel:448, output channel:384, bias:False, padding:[1, 1], dilation:[1, 1], stride:[1, 1], kernel:[3, 3]\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 102, in test_conv2d_ext\n    msg=msg\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 1085, in assertEqual\n    self.assertTrue(result, msg=msg)\nAssertionError: False is not true : device:cuda:0, dtype:torch.float32, group:1, batchsize:22input channel:80, output channel:192, bias:False, padding:[0, 0], dilation:[1, 1], stride:[1, 1], kernel:[3, 3]\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 106, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\nLooking at the first invalid convolution, for example, it's:\n    {\n        \"case_name\":\"masknet_p1:conv33\",\n        \"mb\":1,\n        \"g\":1,\n        \"ic\":512,\n        \"ih\":64,\n        \"iw\":64,\n        \"oc\":12,\n        \"kh\":1,\n        \"kw\":1,\n        \"sh\":1,\n        \"sw\":1,\n        \"ph\":0,\n        \"pw\":0,\n        \"dh\":0,\n        \"dw\":0,\n        \"bias\":\"False\"\n    },\n\nwhich has a dh and dw of zero, causing it to be added to invalid cases here:\ndh, dw = case['dh'], case['dw']\n            has_bias = case['bias']\n            if dh == 0 or dw == 0:\n                invalid_cases.append(case_name)",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "mruberry"
+                },
+                "databaseId": 663240268
+              },
+              {
+                "bodyText": "@mruberry the failure was not detected is because we did not export the cudnn path. Yes, you are right, we need to a large atol of 1e-2 . Would you please help review it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 664373079
+              },
+              {
+                "bodyText": "@mruberry the failure was not detected is because we did not export the cudnn path. Yes, you are right, we need to a large atol of 1e-2 . Would you please help review it again? Thanks.\n\nBefore I run these tests again, is an atol of 1e-2 needed for all types or just half? Also, how does 1e-2 compare to the values that are being compared?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 664569507
+              },
+              {
+                "bodyText": "@mruberry 1e-2 is experimental result, details see below, random means it might be failed sometimes.\n\n\n\natol,rtol\n1e-2,1e-2\n1e-2,1e-3\n1e-3,1e-2\n1e-3,1e-3\n1e-4,1e-3\n1e-3,1e-4\n1e-4,1e-4\n1e-4,1e-5\n1e-5,1e-4\n\n\n\n\nCuda float16\npass\npass\npass\npass\npass\nfail\nFail\nFail\nfail\n\n\nCuda float32\npass\nrandom\nrandom\nrandom\nrandom\nrandom\nrandom\nrandom\nfail",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 666894774
+              },
+              {
+                "bodyText": "@mruberry  would you please find time to review it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 668380451
+              },
+              {
+                "bodyText": "@mruberry would you please find time to review it again? Thanks.\n\nI was just about to try and run this again locally but it looks like the files describing the convolutions are missing?",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 670306210
+              },
+              {
+                "bodyText": "@mruberry sorry but what is missing actually?",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 670322557
+              },
+              {
+                "bodyText": "@mruberry sorry but what is missing actually?\n\nThe JSON files.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 670591170
+              },
+              {
+                "bodyText": "@mruberry sorry but what is missing actually?\n\nThe JSON files.\n\n@mruberry sorry, we add them now, would you please check it again? Thanks.",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 673402901
+              },
+              {
+                "bodyText": "I cloned your repo and ran the tests:\n~/pytorch/test/math_libraries$ python convolutions.py\nFFFF\n======================================================================\nFAIL: test_conv2d_ext_cpu_float32 (__main__.TestConvExtCPU)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n----------------------------------------------------------------------\nRan 4 tests in 33.838s\n\nFAILED (failures=4)\n\nStill fails.",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 673760580
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOIapCfg==",
+              "hasPreviousPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=2dc8bfb6750c4a2402124dc53123d266427c0b92d06add20e3221b57a0f5268f commit=6882717f73deffb692219ccd1fd6db258d8ed684 name=pytorch owner=pytorch": {
+    "data": {
+      "repository": {
+        "object": {
+          "checkSuites": {
+            "edges": [
+              {
+                "node": {
+                  "app": {
+                    "name": "Facebook GitHub Tools",
+                    "databaseId": 12274
+                  },
+                  "workflowRun": null,
+                  "checkRuns": {
+                    "nodes": [],
+                    "pageInfo": {
+                      "endCursor": null,
+                      "hasNextPage": false
+                    }
+                  },
+                  "conclusion": null,
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625272"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hng="
+              },
+              {
+                "node": {
+                  "app": {
+                    "name": "Netlify",
+                    "databaseId": 13473
+                  },
+                  "workflowRun": null,
+                  "checkRuns": {
+                    "nodes": [],
+                    "pageInfo": {
+                      "endCursor": null,
+                      "hasNextPage": false
+                    }
+                  },
+                  "conclusion": null,
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625297"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hpE="
+              },
+              {
+                "node": {
+                  "app": {
+                    "name": "Azure Pipelines",
+                    "databaseId": 9426
+                  },
+                  "workflowRun": null,
+                  "checkRuns": {
+                    "nodes": [],
+                    "pageInfo": {
+                      "endCursor": null,
+                      "hasNextPage": false
+                    }
+                  },
+                  "conclusion": null,
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625308"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hpw="
+              },
+              {
+                "node": {
+                  "app": {
+                    "name": "Dependabot",
+                    "databaseId": 29110
+                  },
+                  "workflowRun": null,
+                  "checkRuns": {
+                    "nodes": [],
+                    "pageInfo": {
+                      "endCursor": null,
+                      "hasNextPage": false
+                    }
+                  },
+                  "conclusion": null,
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625328"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hrA="
+              },
+              {
+                "node": {
+                  "app": {
+                    "name": "Codecov",
+                    "databaseId": 254
+                  },
+                  "workflowRun": null,
+                  "checkRuns": {
+                    "nodes": [],
+                    "pageInfo": {
+                      "endCursor": null,
+                      "hasNextPage": false
+                    }
+                  },
+                  "conclusion": null,
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625347"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hsM="
+              },
+              {
+                "node": {
+                  "app": {
+                    "name": "PyTorch Bot",
+                    "databaseId": 40112
+                  },
+                  "workflowRun": null,
+                  "checkRuns": {
+                    "nodes": [],
+                    "pageInfo": {
+                      "endCursor": null,
+                      "hasNextPage": false
+                    }
+                  },
+                  "conclusion": null,
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625357"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hs0="
+              },
+              {
+                "node": {
+                  "app": {
+                    "name": "GitHub Actions",
+                    "databaseId": 15368
+                  },
+                  "workflowRun": {
+                    "workflow": {
+                      "name": "Lint"
+                    }
+                  },
+                  "checkRuns": {
+                    "nodes": [
+                      {
+                        "name": "workflow-checks",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241883/jobs/4095495959"
+                      },
+                      {
+                        "name": "quick-checks",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241883/jobs/4095496003"
+                      },
+                      {
+                        "name": "Test tools",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241883/jobs/4095496162"
+                      },
+                      {
+                        "name": "toc",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241883/jobs/4095496320"
+                      },
+                      {
+                        "name": "Test collect_env (with_torch)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241883/jobs/4095496465"
+                      },
+                      {
+                        "name": "Test collect_env (without_torch)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241883/jobs/4095496523"
+                      },
+                      {
+                        "name": "Test collect_env (older_python_version)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241883/jobs/4095496558"
+                      },
+                      {
+                        "name": "lintrunner",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241883/jobs/4095496708"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbCVA2Y=",
+                      "hasNextPage": false
+                    }
+                  },
+                  "conclusion": "SUCCESS",
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625464"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hzg="
+              },
+              {
+                "node": {
+                  "app": {
+                    "name": "GitHub Actions",
+                    "databaseId": 15368
+                  },
+                  "workflowRun": {
+                    "workflow": {
+                      "name": "trunk"
+                    }
+                  },
+                  "checkRuns": {
+                    "nodes": [
+                      {
+                        "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095496376"
+                      },
+                      {
+                        "name": "android-emulator-build-test / build-and-test",
+                        "conclusion": "FAILURE",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095496525"
+                      },
+                      {
+                        "name": "linux-xenial-cuda11.3-py3.7-gcc7-no-ops / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095496611"
+                      },
+                      {
+                        "name": "macos-10-15-py3-arm64 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095496713"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095496857"
+                      },
+                      {
+                        "name": "ios-12-5-1-x86-64 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095497178"
+                      },
+                      {
+                        "name": "libtorch-linux-bionic-cuda11.6-py3.7-gcc7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095497392"
+                      },
+                      {
+                        "name": "win-vs2019-cuda11.6-py3 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095497580"
+                      },
+                      {
+                        "name": "libtorch-linux-xenial-cuda10.2-py3.7-gcc7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095497781"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9-slow / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095497886"
+                      },
+                      {
+                        "name": "linux-bionic-rocm5.1-py3.7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095497997"
+                      },
+                      {
+                        "name": "macos-10-15-py3-lite-interpreter-x86-64 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095498146"
+                      },
+                      {
+                        "name": "macos-11-py3-x86-64 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095498338"
+                      },
+                      {
+                        "name": "caffe2-linux-focal-py3.7-gcc7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095498448"
+                      },
+                      {
+                        "name": "parallelnative-linux-focal-py3.7-gcc7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095498648"
+                      },
+                      {
+                        "name": "parallelnative-linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095659992"
+                      },
+                      {
+                        "name": "parallelnative-linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095660077"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9-slow / test (slow, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095798458"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095840103"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095840227"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (slow, 1, 1, linux.4xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095840377"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (nogpu_AVX512, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095840521"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095840605"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (jit_legacy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095840689"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095840741"
+                      },
+                      {
+                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095840795"
+                      },
+                      {
+                        "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095874982"
+                      },
+                      {
+                        "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095875042"
+                      },
+                      {
+                        "name": "win-vs2019-cuda11.6-py3 / test (default, 1, 5, windows.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095875174"
+                      },
+                      {
+                        "name": "win-vs2019-cuda11.6-py3 / test (default, 2, 5, windows.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095875221"
+                      },
+                      {
+                        "name": "win-vs2019-cuda11.6-py3 / test (default, 3, 5, windows.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095875266"
+                      },
+                      {
+                        "name": "win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095875320"
+                      },
+                      {
+                        "name": "win-vs2019-cuda11.6-py3 / test (default, 5, 5, windows.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095875369"
+                      },
+                      {
+                        "name": "win-vs2019-cuda11.6-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4095875417"
+                      },
+                      {
+                        "name": "macos-12.3-py3.8-arm64-test / Run MPS tests",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4096110771"
+                      },
+                      {
+                        "name": "macos-11-py3-x86-64 / test (default, 1, 2, macos-12)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4096408234"
+                      },
+                      {
+                        "name": "macos-11-py3-x86-64 / test (default, 2, 2, macos-12)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241915/jobs/4096408307"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbCn27w=",
+                      "hasNextPage": false
+                    }
+                  },
+                  "conclusion": "FAILURE",
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625556"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1h5Q="
+              },
+              {
+                "node": {
+                  "app": {
+                    "name": "GitHub Actions",
+                    "databaseId": 15368
+                  },
+                  "workflowRun": {
+                    "workflow": {
+                      "name": "pull"
+                    }
+                  },
+                  "checkRuns": {
+                    "nodes": [
+                      {
+                        "name": "linux-bionic-rocm5.1-py3.7",
+                        "conclusion": "NEUTRAL",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095496220"
+                      },
+                      {
+                        "name": "win-vs2019-cuda11.6-py3",
+                        "conclusion": "NEUTRAL",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095496344"
+                      },
+                      {
+                        "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095496466"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-clang10-onnx / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095496612"
+                      },
+                      {
+                        "name": "win-vs2019-cpu-py3 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095496726"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095496862"
+                      },
+                      {
+                        "name": "linux-bionic-py3_7-clang8-xla / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095497204"
+                      },
+                      {
+                        "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095497405"
+                      },
+                      {
+                        "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095497578"
+                      },
+                      {
+                        "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095497784"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-clang7-asan / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095497875"
+                      },
+                      {
+                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095498008"
+                      },
+                      {
+                        "name": "linux-xenial-py3.7-clang7-asan / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095498155"
+                      },
+                      {
+                        "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095498346"
+                      },
+                      {
+                        "name": "linux-jammy-cuda11.6-cudnn8-py3.8-clang12 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095498440"
+                      },
+                      {
+                        "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095498650"
+                      },
+                      {
+                        "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095498724"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-gcc7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095498883"
+                      },
+                      {
+                        "name": "linux-xenial-py3-clang5-mobile-build / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095499064"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-gcc7-no-ops / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095499218"
+                      },
+                      {
+                        "name": "linux-xenial-py3.7-gcc7 / build",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095499360"
+                      },
+                      {
+                        "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095615833"
+                      },
+                      {
+                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095668105"
+                      },
+                      {
+                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095668215"
+                      },
+                      {
+                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095668293"
+                      },
+                      {
+                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095668402"
+                      },
+                      {
+                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095668480"
+                      },
+                      {
+                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095668571"
+                      },
+                      {
+                        "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095776890"
+                      },
+                      {
+                        "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095776922"
+                      },
+                      {
+                        "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095778975"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095794308"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095794370"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095794452"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-gcc7 / test (docs_test, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095794502"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095794566"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-gcc7 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095794652"
+                      },
+                      {
+                        "name": "linux-docs / build-docs (cpp)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095794748"
+                      },
+                      {
+                        "name": "linux-docs / build-docs (python)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095794836"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095800591"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095800638"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095800676"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095800723"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095800762"
+                      },
+                      {
+                        "name": "linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095800805"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095813130"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095813208"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095858004"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095858063"
+                      },
+                      {
+                        "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
+                        "conclusion": "SUCCESS",
+                        "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095858127"
+                      }
+                    ],
+                    "pageInfo": {
+                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbCcmdI=",
+                      "hasNextPage": true
+                    }
+                  },
+                  "conclusion": "SUCCESS",
+                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625557"
+                },
+                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1h5U="
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=23d6a47e5fd875c42231779040ec1d35d0042b502c9142cb0d33d6f65d58fead commit=6882717f73deffb692219ccd1fd6db258d8ed684 cr_cursor=Y3Vyc29yOnYyOpHPAAAAAbCcmdI= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAbH1h5Q= name=pytorch owner=pytorch": {
+    "data": {
+      "repository": {
+        "object": {
+          "oid": "6882717f73deffb692219ccd1fd6db258d8ed684",
+          "checkSuites": {
+            "nodes": [
+              {
+                "checkRuns": {
+                  "nodes": [
+                    {
+                      "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
+                      "conclusion": "SUCCESS",
+                      "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095858194"
+                    },
+                    {
+                      "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                      "conclusion": "SUCCESS",
+                      "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4095858272"
+                    },
+                    {
+                      "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                      "conclusion": "SUCCESS",
+                      "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2638241914/jobs/4096006884"
+                    }
+                  ],
+                  "pageInfo": {
+                    "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbCfo8c=",
+                    "hasNextPage": false
+                  }
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=76118 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "malfet"
+          },
+          "title": "Dummy change with lots of commits",
+          "body": "Draft PR with 100+ commits, to test mergebot ",
+          "headRefName": "malfet/pr-with-lots-of-commits",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "3067f2240afc7a29dc348000aa19eccbd9772303"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "andrewor14"
+                    },
+                    "email": "andrewor@fb.com",
+                    "name": "Andrew Or"
+                  },
+                  "oid": "2f655b71f70c496c4e645f6cdb27d7bb7e825701"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "0c6dcaa7f58a19c42a530f4ee14bb6f0f03ca9fb"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "cad11c563d41ebcffb1683fe1f1288b8157413b3"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "jwtan@fb.com",
+                    "name": "Jiewen Tan"
+                  },
+                  "oid": "4dfd0875a68d87fccb5ad0d81692db480043b86e"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "2d37e74690582a4a26890e4c8b98f1f80e589c82"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "jwtan@fb.com",
+                    "name": "Jiewen Tan"
+                  },
+                  "oid": "d4aee60947e1a3ef23c7c42990621e0746fdd0a8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "peterbell10"
+                    },
+                    "email": "peterbell10@live.co.uk",
+                    "name": "Peter Bell"
+                  },
+                  "oid": "aac6204bf710beb5e50a383d426ae6222396335a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "4b0362cab884584c24f5834b3874f5f357f56b5d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "7536df613cbc645a9e68e6a3b0a8450753260fd1"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "20a50cb966d28d7bf82924adf781cf72a01ef90e"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "486387e8644afb46edff5aa5925b55c8119f67f0"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "acb9d78b9b732d3667b881727e6ed9f92a8c549f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "683bb7959a5b973f8470c081ad02e8fc508e784a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "qihqi"
+                    },
+                    "email": "qihan@fb.com",
+                    "name": "Han Qi"
+                  },
+                  "oid": "a870cb40af65adf0b77d55f6b554d7093d284d7a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "Krovatkin"
+                    },
+                    "email": "korovaikon@gmail.com",
+                    "name": "Nikolay Korovaiko"
+                  },
+                  "oid": "70793b9f328ddf52cc86336104c3a064c8582ef4"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "suo"
+                    },
+                    "email": "suo@fb.com",
+                    "name": "Michael Suo"
+                  },
+                  "oid": "f70b31f62b1c5159eef2725484b175983517c88c"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dagitses"
+                    },
+                    "email": "mikeyd@fb.com",
+                    "name": "Michael Andreas Dagitses"
+                  },
+                  "oid": "04d3ec1db60defe1c6904bf77e9f8dfa87dc0b63"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "46b754a55b63e3168ad5854ad412c124934b675d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "robieta"
+                    },
+                    "email": "taylorrobie@fb.com",
+                    "name": "Taylor Robie"
+                  },
+                  "oid": "13df69e13ee571fdd716139419a00aec47ade7d6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "70642e911ec80a47cdbf4a50aac475c11aa129b6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "59bb7c39384bf3e0b284a037adef8b3caa53c1c4"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "007cfb97b55d70ff63e1ed71d1a674638f847376"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "0a7b858a5af1393fa3cf2853f92eca0e1d408dde"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "qihqi"
+                    },
+                    "email": "qihan@fb.com",
+                    "name": "Han Qi"
+                  },
+                  "oid": "7917d789f0a523715041ade5177d271082628236"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kit1980"
+                    },
+                    "email": "sdym@fb.com",
+                    "name": "Sergii Dymchenko (Meta Employee)"
+                  },
+                  "oid": "91eb6017f0fb8a1b29e8cb48fac93bc9709f73b3"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dagitses"
+                    },
+                    "email": "mikeyd@fb.com",
+                    "name": "Michael Andreas Dagitses"
+                  },
+                  "oid": "bd04dca5fabb0c2a51ac87063a515f256ef274fa"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dagitses"
+                    },
+                    "email": "mikeyd@fb.com",
+                    "name": "Michael Andreas Dagitses"
+                  },
+                  "oid": "1f805a5defda7dabc49d0059edb9ccb06bc29352"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@fb.com",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "4982c0a8db8f23d15ec4bfcbca4ce939afc04954"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pearu"
+                    },
+                    "email": "pearu.peterson@gmail.com",
+                    "name": "Pearu Peterson"
+                  },
+                  "oid": "28502265cb5925cb7db8dcb2dd2334963092714a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "e03fcaedb1342e6d65c7f7f20243000938ba60b2"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pritamdamania"
+                    },
+                    "email": "pritam.damania@fb.com",
+                    "name": "pritam"
+                  },
+                  "oid": "efb28f5a1a5d18aa96bd668ab2ab5c651be359f3"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "MagiaSN"
+                    },
+                    "email": "magialiao@tencent.com",
+                    "name": "magialiao"
+                  },
+                  "oid": "52cc1b9994f861ebdd3908759ed1ab11cba1f8de"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "3cd99f23d1acd6a5bedf6f3b02be79d64350a5b6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "awgu"
+                    },
+                    "email": "andgu@fb.com",
+                    "name": "Andrew Gu"
+                  },
+                  "oid": "b00502c634a5146f4d996bd90e84d317f049e7b0"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "davidberard98"
+                    },
+                    "email": "dberard@fb.com",
+                    "name": "David Berard"
+                  },
+                  "oid": "976eb7cee799dddfbe6a4122b249aaee1b6c8854"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
+                  },
+                  "oid": "9608ab28744d5cae32f371490557b248c9549c66"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "4e119f0c39eb5ff0777f0e71561e6b633d85fb34"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "rohan-varma"
+                    },
+                    "email": "rvarm1@fb.com",
+                    "name": "Rohan Varma"
+                  },
+                  "oid": "447580dc565f3660eddb2c996c6ed25b88338684"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "2bc8f43e9233008ea23053fab87b83ab36fca5e3"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "c13a8e891c3e3e714f60649ca1e3b082e090e9fe"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "fddc861b7ee473f57d3c2161e4618a2663a237e8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "jiyuanzFB"
+                    },
+                    "email": "jiyuanz@fb.com",
+                    "name": "Jiyuan Zhang"
+                  },
+                  "oid": "e2336dbc539d6c021720cbe43c92c9e4c8463299"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "26e2759d1ad59aac12168b74d1ca55e42ba9455c"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "ad7aa914ee3b3d1252e31514f010ba96c40aae87"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "f113c5d78065aafbe7b1c0e611945bfe9f67b3c0"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "a366fd01136292544b7862968ae92feba4b6d8fe"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "seemethere"
+                    },
+                    "email": "eliuriegas@fb.com",
+                    "name": "Eli Uriegas"
+                  },
+                  "oid": "afeba0773749da5883c378a2e6ac066e1ce62ca0"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "d306c99addc543908f64666baeecacbd0749f4a7"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "awgu"
+                    },
+                    "email": "andgu@fb.com",
+                    "name": "Andrew Gu"
+                  },
+                  "oid": "c2456ea658f41f64ea054a422edf22a9c977399f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "awgu"
+                    },
+                    "email": "andgu@fb.com",
+                    "name": "Andrew Gu"
+                  },
+                  "oid": "a8b0a1b681c9fe41e0d553c962a5c93e81d92503"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "anjali411"
+                    },
+                    "email": "chourdiaanjali123@gmail.com",
+                    "name": "anjali411"
+                  },
+                  "oid": "af761d9a5d058c9188f16589bae4f307d35185be"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "clee2000"
+                    },
+                    "email": "csl@fb.com",
+                    "name": "Catherine Lee"
+                  },
+                  "oid": "beceb417baef35b15c2716e23178fb49f7fd6f9d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "1516554e22136db89d0aeba43a1a1a987e995d68"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "qihqi"
+                    },
+                    "email": "qihan@fb.com",
+                    "name": "Han Qi"
+                  },
+                  "oid": "68eb1fa8374eff6cbdcf0be5e37ed6775d22e722"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "janeyx99"
+                    },
+                    "email": "janeyx@fb.com",
+                    "name": "Jane Xu"
+                  },
+                  "oid": "3c7bcb99b5c0c879c2610f427880b03881f82f38"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "janeyx99"
+                    },
+                    "email": "janeyx@fb.com",
+                    "name": "Jane Xu"
+                  },
+                  "oid": "38c1a2028090353e40a019c673c9ab16b39e4825"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "8091cbea2c95ed2c4c406b3c61547a27c6319bae"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "d81f59121969a47c8b2213a88e02cf9be0219be9"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "20d798b319cd107a767fe220f7a3027c18a1c844"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "eb35381a770b58c1cd41e935910cb4df2f3d8f14"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "e6498a657b9aa47546dcd92d1b4ffb2e1a50ebdb"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "7f821382db5ad08efe5b09a145c606852b8a9272"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "995c0e11a97d854ff969962bd81d7341e46ecb07"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "davidberard98"
+                    },
+                    "email": "dberard@fb.com",
+                    "name": "David Berard"
+                  },
+                  "oid": "28d6258e62c9fc361a18689877c962c69889dc23"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "HarborYuan"
+                    },
+                    "email": "yuanhaobo@whu.edu.cn",
+                    "name": "Haobo Yuan"
+                  },
+                  "oid": "2350fad8391367ebf81c7236a2c883644b4ff622"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "zou3519"
+                    },
+                    "email": "zou3519@gmail.com",
+                    "name": "Richard Zou"
+                  },
+                  "oid": "3f789c9ccecdd7e2e52269453646e992a68c6b92"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "jeffdaily"
+                    },
+                    "email": "jeff.daily@amd.com",
+                    "name": "Jeff Daily"
+                  },
+                  "oid": "20f79f610c1a3314da96d49515bbfbee9442e4f8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "peterbell10"
+                    },
+                    "email": "peterbell10@live.co.uk",
+                    "name": "Peter Bell"
+                  },
+                  "oid": "5823958f047f3b71a5dc8c52a20eb8ae3291bd3e"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "peterbell10"
+                    },
+                    "email": "peterbell10@live.co.uk",
+                    "name": "Peter Bell"
+                  },
+                  "oid": "a0b15c49ecf3844daf2c0dcaef44f0214259db20"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "4afc38c25ca2ca126ba4987a419a58a5c572223b"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "b606f58d4a36683fbe0a7d02adfdde7d5cc694c2"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "2d61b4d630f6482a6c3cc7437091fad6d27c347e"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "george-qi"
+                    },
+                    "email": "georgeqi94@gmail.com",
+                    "name": "George Qi"
+                  },
+                  "oid": "bc5384c47036a6cda94129f3e2f9e43c43393698"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "60fc3277634365b64465712b13db2acb76d6c890"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "1b8762e95bc38d1847fe99ed3230546c8b800bfd"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "jerryzh168"
+                    },
+                    "email": "jerryzh168@gmail.com",
+                    "name": "Jerry Zhang"
+                  },
+                  "oid": "6acf60f95f59ecbc6e8ce830dea0abba7d3ec763"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ysiraichi"
+                    },
+                    "email": "yukio.siraichi@gmail.com",
+                    "name": "Yukio Siraichi"
+                  },
+                  "oid": "8fb0276561fdd530c5a06ea195e930e0584f8705"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "1da7aed95a8700406671425eac1e4bbc2c7a24b5"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "thiagocrepaldi"
+                    },
+                    "email": "thiago.crepaldi@microsoft.com",
+                    "name": "Thiago Crepaldi"
+                  },
+                  "oid": "83208e7dee4503c1bee1df9f6632794694dffa01"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "1a46cf08dcd3d3564604c17b2c02d7e4eb45a7ff"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "b7f9b6689445f826c83694652fea5f7cfc7070d7"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "fatcat-z"
+                    },
+                    "email": "jiz@microsoft.com",
+                    "name": "Jay Zhang"
+                  },
+                  "oid": "f273961c1696b156e35f8c76f7ad37934031050d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pavithranrao"
+                    },
+                    "email": "pavithran@fb.com",
+                    "name": "Pavithran Ramachandran"
+                  },
+                  "oid": "eb410a51fcbc716873fd80a970eb932d4aaaea61"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
+                  },
+                  "oid": "7dbb12cdc02332fa64264ed0df576511a5070d7e"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "43675665fa6b5154de8b25125dd03d7be35c884f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "6c4d23c402c413667463770d9a2fa801f493d3c5"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "cf3778a35129a40dee14366515201b7ed2c0f346"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "9d00a051373cb81f79cb6375942cf3ec9fff2fe6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "1eae67cf404aa8dffb80b8e85180f943878d52a6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "janeyx99"
+                    },
+                    "email": "janeyx@fb.com",
+                    "name": "Jane Xu"
+                  },
+                  "oid": "ce0e69dcda0fe41a6e964d6ac70ce8016979c71a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "swolchok"
+                    },
+                    "email": "swolchok@fb.com",
+                    "name": "Scott Wolchok"
+                  },
+                  "oid": "6faba554f6e49777f24911928edb3061b6ed0e3d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "IvanYashchuk"
+                    },
+                    "email": "ivan.yashchuk@aalto.fi",
+                    "name": "Ivan Yashchuk"
+                  },
+                  "oid": "d1d0e03f57a359f8f95331f9a34b8bed3e7cc845"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "Chillee"
+                    },
+                    "email": "chilli@fb.com",
+                    "name": "Horace He"
+                  },
+                  "oid": "bb46bd9233a9fc631802a902cb48a4c13c2722ca"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "mehtanirav"
+                    },
+                    "email": "niravmehta@fb.com",
+                    "name": "Nirav Mehta"
+                  },
+                  "oid": "3b1007fe4be12e483f2620fbac67cae42e703efc"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "mehtanirav"
+                    },
+                    "email": "niravmehta@fb.com",
+                    "name": "Nirav Mehta"
+                  },
+                  "oid": "b4b65228dd0c109f5fdf17c7d9e56f60a98e398b"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "d629e300705196d3ae0bac5ed983b197101fa2ee"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bigfootjon"
+                    },
+                    "email": "jonjanzen@fb.com",
+                    "name": "Jon Janzen"
+                  },
+                  "oid": "52754b9e515f378f8476ad44d75b0a692bad8cde"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "samdow"
+                    },
+                    "email": "samdow@fb.com",
+                    "name": "samdow"
+                  },
+                  "oid": "128c3ad747093f4970329a82c7c4720420faeff2"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "arindamroy-eng"
+                    },
+                    "email": "61168652+arindamroy-eng@users.noreply.github.com",
+                    "name": "arindamroy-eng"
+                  },
+                  "oid": "2a0bda7d32a5bcc9827f7254a7b77cceb16ba973"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTAw",
+              "hasNextPage": true
+            },
+            "totalCount": 131
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuNRg4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693698"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRAI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Netlify",
+                            "databaseId": 13473
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693712"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRBA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693725"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRB0="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Dependabot",
+                            "databaseId": 29110
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693741"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRC0="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693761"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsREE="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "PyTorch Bot",
+                            "databaseId": 40112
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693774"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRE4="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192463/jobs/3232430975"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuNR-Y=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694412"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRsw="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461134"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461211"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461301"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461386"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461521"
+                              },
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461634"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461717"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuN84s=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694417"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRtE="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232460797"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232460951"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461088"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461294"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461410"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461543"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461628"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.0-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461719"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461789"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461869"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461946"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462044"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462112"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462244"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462360"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462432"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462521"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462621"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462683"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462738"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232545510"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232545571"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547522"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547612"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547714"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547764"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547824"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547869"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547909"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547973"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232553452"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232553558"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232553605"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232553650"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232563716"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232563763"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232582650"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232582703"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232582741"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232590204"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232608872"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232608976"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232637097"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232637199"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232637259"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232639932"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232687012"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232687074"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232785088"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232785153"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuVD9M=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694439"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRuc="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  },
+                  "status": null,
+                  "pushedDate": "2022-04-20T17:10:41Z",
+                  "oid": "5696e8357cf38f852ef3d680381513e26f202371"
+                }
+              }
+            ]
+          },
+          "changedFiles": 348,
+          "files": {
+            "nodes": [
+              {
+                "path": ".circleci/cimodel/data/pytorch_build_data.py"
+              },
+              {
+                "path": ".circleci/cimodel/data/pytorch_build_definitions.py"
+              },
+              {
+                "path": ".circleci/scripts/cpp_doc_push_script.sh"
+              },
+              {
+                "path": ".circleci/scripts/python_doc_push_script.sh"
+              },
+              {
+                "path": ".github/actions/checkout-pytorch/action.yml"
+              },
+              {
+                "path": ".github/merge_rules.json"
+              },
+              {
+                "path": ".github/scripts/gitutils.py"
+              },
+              {
+                "path": ".github/scripts/gql_mocks.json"
+              },
+              {
+                "path": ".github/scripts/trymerge.py"
+              },
+              {
+                "path": ".github/workflows/_bazel-build-test.yml"
+              },
+              {
+                "path": ".github/workflows/_linux-build.yml"
+              },
+              {
+                "path": ".github/workflows/_linux-test.yml"
+              },
+              {
+                "path": ".github/workflows/_mac-test.yml"
+              },
+              {
+                "path": ".github/workflows/_rocm-test.yml"
+              },
+              {
+                "path": ".github/workflows/_win-test.yml"
+              },
+              {
+                "path": ".github/workflows/buck_build_test.yml"
+              },
+              {
+                "path": ".github/workflows/lint.yml"
+              },
+              {
+                "path": ".github/workflows/periodic.yml"
+              },
+              {
+                "path": ".github/workflows/pull.yml"
+              },
+              {
+                "path": ".github/workflows/trunk.yml"
+              },
+              {
+                "path": ".jenkins/pytorch/macos-test.sh"
+              },
+              {
+                "path": ".jenkins/pytorch/test.sh"
+              },
+              {
+                "path": ".jenkins/pytorch/win-test.sh"
+              },
+              {
+                "path": ".lintrunner.toml"
+              },
+              {
+                "path": "BUILD.bazel"
+              },
+              {
+                "path": "CODEOWNERS"
+              },
+              {
+                "path": "README.md"
+              },
+              {
+                "path": "aten/src/ATen/BatchingRegistrations.cpp"
+              },
+              {
+                "path": "aten/src/ATen/Dispatch.h"
+              },
+              {
+                "path": "aten/src/ATen/ExpandUtils.h"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalInverses.cpp"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalStorageImpl.cpp"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalStorageImpl.h"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalTensorWrapper.cpp"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalTensorWrapper.h"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalizeFallbackKernel.cpp"
+              },
+              {
+                "path": "aten/src/ATen/NestedTensorImpl.cpp"
+              },
+              {
+                "path": "aten/src/ATen/OpMathType.h"
+              },
+              {
+                "path": "aten/src/ATen/SparseCsrTensorUtils.h"
+              },
+              {
+                "path": "aten/src/ATen/ThreadLocalState.cpp"
+              },
+              {
+                "path": "aten/src/ATen/ThreadLocalState.h"
+              },
+              {
+                "path": "aten/src/ATen/autocast_mode.cpp"
+              },
+              {
+                "path": "aten/src/ATen/autocast_mode.h"
+              },
+              {
+                "path": "aten/src/ATen/core/SymIntArrayRef.cpp"
+              },
+              {
+                "path": "aten/src/ATen/core/SymIntArrayRef.h"
+              },
+              {
+                "path": "aten/src/ATen/core/TensorBase.h"
+              },
+              {
+                "path": "aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h"
+              },
+              {
+                "path": "aten/src/ATen/core/dispatch/Dispatcher.h"
+              },
+              {
+                "path": "aten/src/ATen/core/interned_strings.h"
+              },
+              {
+                "path": "aten/src/ATen/core/ivalue.cpp"
+              },
+              {
+                "path": "aten/src/ATen/core/ivalue.h"
+              },
+              {
+                "path": "aten/src/ATen/core/ivalue_inl.h"
+              },
+              {
+                "path": "aten/src/ATen/core/jit_type.h"
+              },
+              {
+                "path": "aten/src/ATen/core/jit_type_base.h"
+              },
+              {
+                "path": "aten/src/ATen/core/type.cpp"
+              },
+              {
+                "path": "aten/src/ATen/cuda/CUDASparse.h"
+              },
               {
-                "node": {
-                  "name": "triaged"
-                }
+                "path": "aten/src/ATen/cuda/llvm_complex.cpp"
               },
               {
-                "node": {
-                  "name": "open source"
-                }
+                "path": "aten/src/ATen/cuda/llvm_jit_strings.h"
               },
               {
-                "node": {
-                  "name": "cla signed"
-                }
+                "path": "aten/src/ATen/native/Blas.cpp"
               },
               {
-                "node": {
-                  "name": "Stale"
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=62ce809793481ce6ddce6e1a19d9b0761755ff0ff75decaf8a79419eaf793110 cursor=Y3Vyc29yOnYyOpHOKCmhXQ== name=pytorch number=31093 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "comments": {
-            "nodes": [
+                "path": "aten/src/ATen/native/Itertools.cpp"
+              },
               {
-                "bodyText": "Hi, @mingfeima  @soumith  @Jianhui-Li\nthis will improve the test coverage of mkldnn convolution, would you please review it?\nThe current code is forward only, do we need to cover backward, if yes, we can add backward.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 564806270
+                "path": "aten/src/ATen/native/LinearAlgebra.cpp"
               },
               {
-                "bodyText": "@mingxiaoh, what is the value in testing DNNL as part of Pytorch validation for the Pytorch developers? Shouldn't having these tests run in DNNL validation be enough?",
-                "author": {
-                  "login": "vpirogov"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 564808528
+                "path": "aten/src/ATen/native/SoftMax.cpp"
               },
               {
-                "bodyText": "@vpirogov  The main value is to serve as a blind test to DNNL. If DNNL adds these test to DNNL test sets, it lost the value as a blind test.  The spirit of validation is to cross check.\n@gottbrath @gchanan  The test was developed per the request of Pytorch team. Mingxiao made an effort to reduce the execution time to a few second but still with good coverage.  Although the test today is focused on DNNL, it could be easily extended to be blind test for any conv implementation used in Pytorch.",
-                "author": {
-                  "login": "Jianhui-Li"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 567826907
+                "path": "aten/src/ATen/native/TensorConversions.cpp"
               },
               {
-                "bodyText": "@mruberry thanks for the comment. As for the chainer dependency, we import it is because we would like to use its testing function for pytest test cases combinations, other wise we need to write much more code to achieve same effect. So, can we use it?",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 574563012
+                "path": "aten/src/ATen/native/TensorShape.cpp"
               },
               {
-                "bodyText": "@mingxiaoh You cannot import chainer. Looking at the code you should be able to achieve the same effect without it.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 575272358
+                "path": "aten/src/ATen/native/TensorShape.h"
               },
               {
-                "bodyText": "@mruberry ok, we will change it according to your requirement. Thanks",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 583917522
+                "path": "aten/src/ATen/native/Unique.cpp"
               },
               {
-                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/31093\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 29f6aa6 (more details on the Dr. CI page):\n\nCommit 29f6aa6 was recently pushed. Waiting for builds...\n\nThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
-                "author": {
-                  "login": "dr-ci"
-                },
-                "authorAssociation": "NONE",
-                "editor": {
-                  "login": "facebook-github-bot"
-                },
-                "databaseId": 628466876
+                "path": "aten/src/ATen/native/cuda/BinaryMiscBackwardOpsKernels.cu"
               },
               {
-                "bodyText": "@mruberry how about those cudnn UT error? we add check for it but it should be NV to fix cudnn bugs.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 629955767
+                "path": "aten/src/ATen/native/cuda/CUDAJitLoops.cuh"
               },
               {
-                "bodyText": "Hey @mingxiaoh! You're right, of course, that you shouldn't have to fix cuDNN bugs. Would you please:\n\nAssert that the test case fails, so we know it's failing and if someone fixes it they'll know what test to update.\nFile a new issue explaining the behavior and providing a short PyTorch program to reproduce the issue.\n\nThen we can ping NVIDIA on that issue.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 629997129
+                "path": "aten/src/ATen/native/cuda/JitLoops.cuh"
               },
               {
-                "bodyText": "about the suggestion 'Assert that the test case fails, so we know it's failing and if someone fixes it they'll know what test to update. ',  if we only assert it and continue the following test, I guess users might always ignore them in later test. Anyway, any similar example case for reference?",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 630010734
+                "path": "aten/src/ATen/native/cuda/Lerp.cu"
               },
               {
-                "bodyText": "In this recent PR https://github.com/pytorch/pytorch/pull/38505/files, for example, you can see that the construction of bool tensors wasn't working properly, so the test author cited the relevant issue and asserted that the incorrect behavior happened, as expected. You can also see how these lines are being removed by https://github.com/pytorch/pytorch/pull/38392/files, which fixes the issue.\nAnother common pattern is to use with self.assertRaises(RuntimeError/AssertionError/etc.):.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 630014823
+                "path": "aten/src/ATen/native/cuda/PersistentSoftmax.cuh"
               },
               {
-                "bodyText": "@mruberry the failed UT case is not introduced by our modification, how to handle this issue?",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 631187735
+                "path": "aten/src/ATen/native/cuda/SoftMax.cu"
               },
               {
-                "bodyText": "@mingxiaoh You mean the failures on ROCm? You may ignore them. Be sure to re-request review when you're ready.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 631191425
+                "path": "aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu"
               },
               {
-                "bodyText": "@mruberry  we already skipped those ROCm errors, but there are stil somel error caused by the original code, they are not introduced by our modification.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 631886529
+                "path": "aten/src/ATen/native/cuda/Unique.cu"
               },
               {
-                "bodyText": "I understand. Let me know when you're ready for me to review.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 631908011
+                "path": "aten/src/ATen/native/cuda/jit_utils.cpp"
               },
               {
-                "bodyText": "@mruberry thanks, we are ready for review now.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 631909442
+                "path": "aten/src/ATen/native/cuda/jit_utils.h"
               },
               {
-                "bodyText": "@mingxiaoh Great! I'll take a look ASAP.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 631910556
+                "path": "aten/src/ATen/native/native_functions.yaml"
               },
               {
-                "bodyText": "@mruberry we just pull the latest code and updated the patch according to your comment, may you please help double check it? BTW, the new failed case in preci is not introduced by our modification.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 633430458
+                "path": "aten/src/ATen/native/nested/NestedTensorMath.cpp"
               },
               {
-                "bodyText": "@ailzhang would you please check the comment below? Thanks.\nIs there a reason why this TestConv2dExt is a new class instead a test inside TestNN?\n//comment: it is actually suggested by Tongzhou Wang in another thread before.\nAlthough this test sits in generic testing framework, it's actually comparing thnn/mkldnn/cudnn results specially. I feel it's better to make it truly generic so that it compares any device result with CPU result. Alternatively you can mark this test only run when torch.backends.mkldnn.is_available()=True\n//comment: but our goal is to compare the result with that of thnn. Anyway, if you insist, we can start to compare it with cpu.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": {
-                  "login": "mingxiaoh"
-                },
-                "databaseId": 634432326
+                "path": "aten/src/ATen/native/quantized/cpu/qembeddingbag.cpp"
               },
               {
-                "bodyText": "Pruning reviewers. @ngimel, @VitalyFedyunin, this PR is looking pretty good from a test framework perspective. Would one of you like to review?",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 634557563
+                "path": "aten/src/ATen/native/quantized/cpu/qsoftmax.cpp"
               },
               {
-                "bodyText": "@mruberry  Thanks, would you please help review it again. BTW: failed case is not introduced by our modification.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 635256214
+                "path": "aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp"
               },
               {
-                "bodyText": "@mruberry  we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code",
-                "author": {
-                  "login": "1pikachu"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 637364148
+                "path": "aten/src/ATen/native/quantized/cudnn/Linear.cpp"
               },
               {
-                "bodyText": "@mruberry we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code\n\n@ngimel will follow-up on the test itself sometime this week or early next week.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 637444457
+                "path": "aten/src/ATen/native/quantized/cudnn/utils.h"
               },
               {
-                "bodyText": "@mruberry we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code\n\n@ngimel will follow-up on the test itself sometime this week or early next week.\n\n@mruberry  thank you",
-                "author": {
-                  "login": "1pikachu"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 637479226
+                "path": "aten/src/ATen/native/sparse/SparseCsrTensor.cpp"
               },
               {
-                "bodyText": "Improving test coverage of math libraries is certainly a good goal and this PR is moving towards it. I have some doubts about implementation decisions made, and about running this PR as part of regular pytorch CI.\nIf the primary goal of this PR is to test correctness of the convolution implementations in the vendor library, then it does not serve this purpose. The absolute majority of the 4000+ test cases come from group 1, where different kernel sizes/strides/dilations are used to produce the output of size 1x1. This can test whether pytorch correctly passes convolution parameters to the backends (although there are cheaper ways to do that), but as actual library correctness check it is almost useless - libraries use very different kernels depending in the input/output sizes, and tests with toy sizes like this don't invoke the real bread-and-butter kernels.\nAlso, if this test suite is meant as primary a means of testing vendor libraries (which is a good goal!) it does not have a place as a part of pytorch regular CI, and should be run when the corresponding vendor libraries are updated. I'd suggest moving this test out into a separate file (maybe even outside of torch/test directory) and have it as a part of library update/qualification process rather than regular CI.\nAlso, if the primary goal is to enable easier testing of vendor libraries correctness, perhaps we should rethink the mechanism of the generation of test cases. It should be easy to add a test case with a particular set of parameters that was found to be buggy. Also, running a cross-product of cases in a multi-dimensional space (as this PR does) is rarely an efficient way of getting a signal, some forms of random sampling usually provide a way to get better correctness signal why using less resources.\nAlso, when testing libraries it is important to test both forward and backward functions, whereas this PR does forward only. I'm openminded on whether convTransposed should be tested or not - if we are testing vendor libraries, then it's not necessary, convTransposed calls the same underlying functions, if we are testing pytorch, then it makes sense to test it separately because it takes different codepaths.",
-                "author": {
-                  "login": "ngimel"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 637827507
+                "path": "aten/src/ATen/native/ts_native_functions.yaml"
               },
               {
-                "bodyText": "@mruberry ngimel is quite responsible, but it seems that she is not familiar with the background of this pull-request, since this pull-request is pending for so such a long time, each time we are almost done, then reviewer changes, each reviewer has different idea, it is good, but, would it be better if you help review it or ask the same reviewer to review it considering that you are more familiar with the background/change history? Thanks in advance.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 637912105
+                "path": "aten/src/ATen/record_function.cpp"
               },
               {
-                "bodyText": "@mruberry ngimel is quite responsible, but it seems that she is not familiar with the background of this pull-request, since this pull-request is pending for so such a long time, each time we are almost done, then reviewer changes, each reviewer has different idea, it is good, but, would it be better if you help review it or ask the same reviewer to review it considering that you are more familiar with the background/change history? Thanks in advance.\n\nWe know this PR has been open for awhile and we respect that your time is valuable, but we want to make sure we're making the right change here, and I think @ngimel's comments reflect that and should not be too difficult to address. As I understand, her points are:\n\nThis is a good PR with an exciting idea. To let it run longer and test more cases maybe it should run outside the regular PyTorch CI.\nTo remedy this, let's create a test/math_libraries folder and put this test there: test/math_libaries/convolutions.py. Yes, this is different from our requests in the past, which is our mistake, but it should be an easy change.\nTo make the test more interesting it'd be good for the test cases to resemble convolutions used in practice. The current test cases seem like similar \"toy\" examples. Without time pressure we should be able to run larger, more computationally intensive convolutions.\nLet's change the test cases to include some practical convolutions, make it easy to add test cases, and think about how we might generate other interesting cases. (We should also test backwards once we have more time!)\n\nAnd I think these are good points. Maybe the PR doesn't create a new way to generate interesting convolutions to start and instead only runs a few representative convolutions, but @ngimel is positioning the work for success so that it's useful and we can continue to improve on it in the future.\nDoes that make sense?",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 637924703
+                "path": "aten/src/ATen/record_function.h"
+              },
+              {
+                "path": "aten/src/ATen/templates/Operators.h"
+              },
+              {
+                "path": "aten/src/ATen/templates/RegisterFunctionalization.cpp"
               },
               {
-                "bodyText": "@mruberry we were required to finish the test in limited time long long before, at that time, jianhui discussed this issue with you, and you are all agreed with the current test scope and test case number and test time, so you meant you change your mind now? you are not care about the test time currently? Sorry, this issue is pending so long, we are struggling with it now and would like to finish it asap.  Given this, it would be be better if you raise all the requirement at a time,  considering that we have many tasks at hand, we are hoping so eagerly that we can finish this PR and use it for further test for bugs finding.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": {
-                  "login": "mingxiaoh"
-                },
-                "databaseId": 637960626
+                "path": "aten/src/ATen/test/basic.cpp"
               },
               {
-                "bodyText": "@mruberry we were required to finish the test in limited time long long before, at that time, jianhui discussed this issue with you, and you are all agreed with the current test scope and test case number and test time, so you meant you change your mind now? you are not care about the test time currently? Sorry, this issue is pending so long, we are struggling with it now and would like to finish it asap. Given this, it would be be better if you raise all the requirement at a time, considering that we have many tasks at hand, we are hoping so eagerly that we can finish this PR and use it for further test for bugs finding.\n\nI'm sorry, I don't think I've talked to @Jianhui-Li before. It's true that the team we expressed a concern about timing if the test was to be run in the CI initially, but I think now that we understand what the test is trying to do better we're not sure the CI is the best place for it. The PR was also closed after a lengthy period of inactivity, and we assumed it had simply been abandoned.\nDo you know who @Jianhui-Li spoke with about this issue originally? Maybe I can follow-up with them for more context.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 637967153
+                "path": "aten/src/ATen/test/vmap_test.cpp"
               },
               {
-                "bodyText": "@mruberry  it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 637978356
+                "path": "binaries/record_function_benchmark.cc"
               },
               {
-                "bodyText": "@mruberry it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?\n\nI think this will be easier to discuss at the regular Intel-FB meeting.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 638446723
+                "path": "c10/core/DispatchKey.cpp"
               },
               {
-                "bodyText": "@mruberry it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?\n\nI think this will be easier to discuss at the regular Intel-FB meeting.\n\nLet me sync with Mingxiao and follow up with this. Thanks.",
-                "author": {
-                  "login": "Jianhui-Li"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 638451670
+                "path": "c10/core/DispatchKey.h"
               },
               {
-                "bodyText": "@mruberry would you please help review it again?",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 653028208
+                "path": "c10/core/DispatchKeySet.h"
               },
               {
-                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 654443242
+                "path": "c10/test/core/DispatchKeySet_test.cpp"
               },
               {
-                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 656062287
+                "path": "c10/util/ArrayRef.h"
               },
               {
-                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks\n\n@mruberry  the code is ready for review now, would you please take time for it? Thanks.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 658071151
+                "path": "caffe2/core/tensor.h"
               },
               {
-                "bodyText": "super nit: renaming files to .json will make it more IDE friendly.",
+                "path": "docs/source/conf.py"
+              },
+              {
+                "path": "docs/source/fx.rst"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTAw",
+              "hasNextPage": true
+            }
+          },
+          "reviews": {
+            "nodes": [],
+            "pageInfo": {
+              "startCursor": null,
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Merge failed due to Matched rule superuser, but it was not reviewed yet by any of:zou3519,abhikrish,mehtanirav,wconstab,lc0, ...",
                 "author": {
-                  "login": "VitalyFedyunin"
+                  "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 658464685
+                "databaseId": 1104215370
               },
               {
-                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks\n\n@mruberry the code is ready for review now, would you please take time for it? Thanks.\n\nCool! I took a look with @ngimel, once these issues are addressed I think we're good to go!",
+                "bodyText": "Merge failed due to Matched rule superuser, but PR has not been reviewed yet",
                 "author": {
-                  "login": "mruberry"
+                  "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 659164401
+                "databaseId": 1104220908
               },
               {
-                "bodyText": "@ngimel  & @VitalyFedyunin We have changed the code according to your suggestions, would you please review it again? Thanks.",
+                "bodyText": "@pytorchbot merge this",
                 "author": {
-                  "login": "mingxiaoh"
+                  "login": "malfet"
                 },
-                "authorAssociation": "NONE",
+                "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 660884305
+                "databaseId": 1104378397
               },
               {
-                "bodyText": "@ngimel & @VitalyFedyunin We have changed the code according to your suggestions, would you please review it again? Thanks.\n\nUpdated: one more question about tolerances, one code cleanup recommendation, and one task leftover from the last review.",
+                "bodyText": "Merge failed due to Matched rule superuser, but PR has not been reviewed yet\nRaised by https://github.com/pytorch/pytorch/actions/runs/2197877090",
                 "author": {
-                  "login": "mruberry"
+                  "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 662678464
+                "databaseId": 1104379712
               },
               {
-                "bodyText": "Updated: one more question about tolerances, one code cleanup recommendation, and one task leftover from the last review.\n@mruberry we have finished the modification according to your comment, would you please review it again? Thanks.",
+                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale. Feel free to remove the Stale label if you feel this was a mistake. If you are unable to remove the Stale label please contact a maintainer in order to do so. If you want the bot to never mark this PR stale again, add the no-stale label.Stale pull requests will automatically be closed after 30 days of inactivity.",
                 "author": {
-                  "login": "mingxiaoh"
+                  "login": "github-actions"
                 },
                 "authorAssociation": "NONE",
                 "editor": null,
-                "databaseId": 662930687
-              },
+                "databaseId": 1160658699
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQdD9Sg==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "bodyText": "The code looks good, but I tried running the test suite and hit the following failures:\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 102, in test_conv2d_ext\n    msg=msg\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 1085, in assertEqual\n    self.assertTrue(result, msg=msg)\nAssertionError: False is not true : device:cuda:0, dtype:torch.float16, group:1, batchsize:22input channel:448, output channel:384, bias:False, padding:[1, 1], dilation:[1, 1], stride:[1, 1], kernel:[3, 3]\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 102, in test_conv2d_ext\n    msg=msg\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 1085, in assertEqual\n    self.assertTrue(result, msg=msg)\nAssertionError: False is not true : device:cuda:0, dtype:torch.float32, group:1, batchsize:22input channel:80, output channel:192, bias:False, padding:[0, 0], dilation:[1, 1], stride:[1, 1], kernel:[3, 3]\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 106, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\nLooking at the first invalid convolution, for example, it's:\n    {\n        \"case_name\":\"masknet_p1:conv33\",\n        \"mb\":1,\n        \"g\":1,\n        \"ic\":512,\n        \"ih\":64,\n        \"iw\":64,\n        \"oc\":12,\n        \"kh\":1,\n        \"kw\":1,\n        \"sh\":1,\n        \"sw\":1,\n        \"ph\":0,\n        \"pw\":0,\n        \"dh\":0,\n        \"dw\":0,\n        \"bias\":\"False\"\n    },\n\nwhich has a dh and dw of zero, causing it to be added to invalid cases here:\ndh, dw = case['dh'], case['dw']\n            has_bias = case['bias']\n            if dh == 0 or dw == 0:\n                invalid_cases.append(case_name)",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": {
-                  "login": "mruberry"
-                },
-                "databaseId": 663240268
+                "node": {
+                  "name": "cla signed"
+                }
               },
               {
-                "bodyText": "@mruberry the failure was not detected is because we did not export the cudnn path. Yes, you are right, we need to a large atol of 1e-2 . Would you please help review it again? Thanks.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 664373079
-              },
+                "node": {
+                  "name": "Stale"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=74bd29fe945c49fde4818e873fa62bc60b55b4ef6ae3f2bb719bab6cddbaa7ce cursor=MTAw name=pytorch number=76118 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits_with_authors": {
+            "nodes": [
               {
-                "bodyText": "@mruberry the failure was not detected is because we did not export the cudnn path. Yes, you are right, we need to a large atol of 1e-2 . Would you please help review it again? Thanks.\n\nBefore I run these tests again, is an atol of 1e-2 needed for all types or just half? Also, how does 1e-2 compare to the values that are being compared?",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 664569507
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "clee2000"
+                    },
+                    "email": "csl@fb.com",
+                    "name": "Catherine Lee"
+                  },
+                  "oid": "7f560351ae04ea43e58fbfda885bcf216aa26cde"
+                }
               },
               {
-                "bodyText": "@mruberry 1e-2 is experimental result, details see below, random means it might be failed sometimes.\n\n\n\natol,rtol\n1e-2,1e-2\n1e-2,1e-3\n1e-3,1e-2\n1e-3,1e-3\n1e-4,1e-3\n1e-3,1e-4\n1e-4,1e-4\n1e-4,1e-5\n1e-5,1e-4\n\n\n\n\nCuda float16\npass\npass\npass\npass\npass\nfail\nFail\nFail\nfail\n\n\nCuda float32\npass\nrandom\nrandom\nrandom\nrandom\nrandom\nrandom\nrandom\nfail",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 666894774
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "e8677ed168a036bc7e590d800fe98dd15f10581b"
+                }
               },
               {
-                "bodyText": "@mruberry  would you please find time to review it again? Thanks.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 668380451
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "robieta"
+                    },
+                    "email": "taylorrobie@fb.com",
+                    "name": "Taylor Robie"
+                  },
+                  "oid": "ac5611caa13642ef8dbe0db453b283b42cbd900b"
+                }
               },
               {
-                "bodyText": "@mruberry would you please find time to review it again? Thanks.\n\nI was just about to try and run this again locally but it looks like the files describing the convolutions are missing?",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 670306210
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "robieta"
+                    },
+                    "email": "taylorrobie@fb.com",
+                    "name": "Taylor Robie"
+                  },
+                  "oid": "1184afbd3bfde0f46133aef09e55e18d3bfb3c3e"
+                }
               },
               {
-                "bodyText": "@mruberry sorry but what is missing actually?",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 670322557
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "minsii"
+                    },
+                    "email": "msi@fb.com",
+                    "name": "Min Si"
+                  },
+                  "oid": "1c05604f3d049c67dc678d0295c0add470bff3dc"
+                }
               },
               {
-                "bodyText": "@mruberry sorry but what is missing actually?\n\nThe JSON files.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 670591170
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "eellison@devfair044.h1.fair",
+                    "name": "Elias Ellison"
+                  },
+                  "oid": "76ab5101bd36e8d73637d31bbea125240b7b27f0"
+                }
               },
               {
-                "bodyText": "@mruberry sorry but what is missing actually?\n\nThe JSON files.\n\n@mruberry sorry, we add them now, would you please check it again? Thanks.",
-                "author": {
-                  "login": "mingxiaoh"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 673402901
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "eellison@devfair044.h1.fair",
+                    "name": "Elias Ellison"
+                  },
+                  "oid": "c774050e92c3d8e52968e1eb635dd3e9491104b3"
+                }
               },
               {
-                "bodyText": "I cloned your repo and ran the tests:\n~/pytorch/test/math_libraries$ python convolutions.py\nFFFF\n======================================================================\nFAIL: test_conv2d_ext_cpu_float32 (__main__.TestConvExtCPU)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n----------------------------------------------------------------------\nRan 4 tests in 33.838s\n\nFAILED (failures=4)\n\nStill fails.",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 673760580
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOIapCfg==",
-              "hasPreviousPage": false
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=2dc8bfb6750c4a2402124dc53123d266427c0b92d06add20e3221b57a0f5268f commit=6882717f73deffb692219ccd1fd6db258d8ed684 name=pytorch owner=pytorch": {
-    "data": {
-      "repository": {
-        "object": {
-          "checkSuites": {
-            "edges": [
-              {
-                "node": {
-                  "app": {
-                    "name": "Facebook GitHub Tools",
-                    "databaseId": 12274
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "guoyejun"
+                    },
+                    "email": "yejun.guo@intel.com",
+                    "name": "Guo Yejun"
                   },
-                  "workflowRun": null,
-                  "checkRuns": {
-                    "nodes": [],
-                    "pageInfo": {
-                      "endCursor": null,
-                      "hasNextPage": false
-                    }
+                  "oid": "8981595c5361f07186f4534f3be71f1d829a3046"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "BowenBao"
+                    },
+                    "email": "bowbao@microsoft.com",
+                    "name": "BowenBao"
                   },
-                  "conclusion": null,
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625272"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hng="
+                  "oid": "036f362904024ac9481248965009f312bec6656b"
+                }
               },
               {
-                "node": {
-                  "app": {
-                    "name": "Netlify",
-                    "databaseId": 13473
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "janeyx99"
+                    },
+                    "email": "janeyx@fb.com",
+                    "name": "Jane Xu"
                   },
-                  "workflowRun": null,
-                  "checkRuns": {
-                    "nodes": [],
-                    "pageInfo": {
-                      "endCursor": null,
-                      "hasNextPage": false
-                    }
+                  "oid": "457d994933f164a9fd70da5ca2733dd6c046a28b"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "janeyx99"
+                    },
+                    "email": "janeyx@fb.com",
+                    "name": "Jane Xu"
                   },
-                  "conclusion": null,
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625297"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hpE="
+                  "oid": "f49ebc77520774e71722111d554a0215a26956df"
+                }
               },
               {
-                "node": {
-                  "app": {
-                    "name": "Azure Pipelines",
-                    "databaseId": 9426
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "mikeiovine"
+                    },
+                    "email": "mikeiovine@fb.com",
+                    "name": "Mike Iovine"
                   },
-                  "workflowRun": null,
-                  "checkRuns": {
-                    "nodes": [],
-                    "pageInfo": {
-                      "endCursor": null,
-                      "hasNextPage": false
-                    }
+                  "oid": "f069e1a4a5f98d3fe961e4fc562ede59f59b4026"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "salilsdesai"
+                    },
+                    "email": "salilsdesai@fb.com",
+                    "name": "Salil Desai"
                   },
-                  "conclusion": null,
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625308"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hpw="
+                  "oid": "30bccf58393b288412a0f5a2423a1a41ffce258e"
+                }
               },
               {
-                "node": {
-                  "app": {
-                    "name": "Dependabot",
-                    "databaseId": 29110
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "angelayi"
+                    },
+                    "email": "angelayi@fb.com",
+                    "name": "Angela Yi"
                   },
-                  "workflowRun": null,
-                  "checkRuns": {
-                    "nodes": [],
-                    "pageInfo": {
-                      "endCursor": null,
-                      "hasNextPage": false
-                    }
+                  "oid": "f4ba440fe8a632c1ee88e01f7746a8a92c8f3902"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "shirong@fb.com",
+                    "name": "Shirong Wu"
                   },
-                  "conclusion": null,
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625328"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hrA="
+                  "oid": "d203346c93ba96d626c6c02910888198c789ba69"
+                }
               },
               {
-                "node": {
-                  "app": {
-                    "name": "Codecov",
-                    "databaseId": 254
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "jamesreed@fb.com",
+                    "name": "James Reed"
                   },
-                  "workflowRun": null,
-                  "checkRuns": {
-                    "nodes": [],
-                    "pageInfo": {
-                      "endCursor": null,
-                      "hasNextPage": false
-                    }
+                  "oid": "73a4e34963e212b799a191fd031d2fa31d17e0ac"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "Krovatkin"
+                    },
+                    "email": "korovaikon@gmail.com",
+                    "name": "Nikolay Korovaiko"
                   },
-                  "conclusion": null,
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625347"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hsM="
+                  "oid": "b9d5206dfb46f09f953aba3ffb0e1e33a99032ee"
+                }
               },
               {
-                "node": {
-                  "app": {
-                    "name": "PyTorch Bot",
-                    "databaseId": 40112
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
                   },
-                  "workflowRun": null,
-                  "checkRuns": {
-                    "nodes": [],
-                    "pageInfo": {
-                      "endCursor": null,
-                      "hasNextPage": false
-                    }
+                  "oid": "12114e6937573fead54e11ae6cdebe5b31dee302"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "s4ayub"
+                    },
+                    "email": "shababayub@fb.com",
+                    "name": "Shabab Ayub"
                   },
-                  "conclusion": null,
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625357"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hs0="
+                  "oid": "f2323f76ad6f7f590285bf9c6d20c14a79542563"
+                }
               },
               {
-                "node": {
-                  "app": {
-                    "name": "GitHub Actions",
-                    "databaseId": 15368
-                  },
-                  "workflowRun": {
-                    "workflow": {
-                      "name": "Lint"
-                    }
-                  },
-                  "checkRuns": {
-                    "nodes": [
-                      {
-                        "name": "workflow-checks",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257521878?check_suite_focus=true"
-                      },
-                      {
-                        "name": "quick-checks",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257521941?check_suite_focus=true"
-                      },
-                      {
-                        "name": "Test tools",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522171?check_suite_focus=true"
-                      },
-                      {
-                        "name": "toc",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522418?check_suite_focus=true"
-                      },
-                      {
-                        "name": "Test collect_env (with_torch)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522648?check_suite_focus=true"
-                      },
-                      {
-                        "name": "Test collect_env (without_torch)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522731?check_suite_focus=true"
-                      },
-                      {
-                        "name": "Test collect_env (older_python_version)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522798?check_suite_focus=true"
-                      },
-                      {
-                        "name": "lintrunner",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523046?check_suite_focus=true"
-                      }
-                    ],
-                    "pageInfo": {
-                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbCVA2Y=",
-                      "hasNextPage": false
-                    }
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "jaglinux"
+                    },
+                    "email": "jagdish.krishna@gmail.com",
+                    "name": "Jagadish Krishnamoorthy"
                   },
-                  "conclusion": "SUCCESS",
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625464"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1hzg="
+                  "oid": "acd4b5abe2739c09c1a02524eceda46ff93fd385"
+                }
               },
               {
-                "node": {
-                  "app": {
-                    "name": "GitHub Actions",
-                    "databaseId": 15368
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "cccclai"
+                    },
+                    "email": "chenlai@fb.com",
+                    "name": "Chen Lai"
                   },
-                  "workflowRun": {
-                    "workflow": {
-                      "name": "trunk"
-                    }
+                  "oid": "04179f533283132fa334a9f91a070b1712f7323d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "zaxtax"
+                    },
+                    "email": "rob@zinkov.com",
+                    "name": "Rob Zinkov"
                   },
-                  "checkRuns": {
-                    "nodes": [
-                      {
-                        "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522494?check_suite_focus=true"
-                      },
-                      {
-                        "name": "android-emulator-build-test / build-and-test",
-                        "conclusion": "FAILURE",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522741?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-cuda11.3-py3.7-gcc7-no-ops / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522887?check_suite_focus=true"
-                      },
-                      {
-                        "name": "macos-10-15-py3-arm64 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523057?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523301?check_suite_focus=true"
-                      },
-                      {
-                        "name": "ios-12-5-1-x86-64 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523681?check_suite_focus=true"
-                      },
-                      {
-                        "name": "libtorch-linux-bionic-cuda11.6-py3.7-gcc7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523926?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cuda11.6-py3 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524141?check_suite_focus=true"
-                      },
-                      {
-                        "name": "libtorch-linux-xenial-cuda10.2-py3.7-gcc7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524423?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3.7-clang9-slow / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524568?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-rocm5.1-py3.7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524710?check_suite_focus=true"
-                      },
-                      {
-                        "name": "macos-10-15-py3-lite-interpreter-x86-64 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524925?check_suite_focus=true"
-                      },
-                      {
-                        "name": "macos-11-py3-x86-64 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257525196?check_suite_focus=true"
-                      },
-                      {
-                        "name": "caffe2-linux-focal-py3.7-gcc7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257525344?check_suite_focus=true"
-                      },
-                      {
-                        "name": "parallelnative-linux-focal-py3.7-gcc7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257525621?check_suite_focus=true"
-                      },
-                      {
-                        "name": "parallelnative-linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257748822?check_suite_focus=true"
-                      },
-                      {
-                        "name": "parallelnative-linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257748937?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3.7-clang9-slow / test (slow, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257940181?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257996123?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257996266?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (slow, 1, 1, linux.4xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257996436?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (nogpu_AVX512, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257996598?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257996687?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (jit_legacy, 1, 1, linux.4xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257996800?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257996869?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257996947?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258043565?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258043644?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cuda11.6-py3 / test (default, 1, 5, windows.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258043840?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cuda11.6-py3 / test (default, 2, 5, windows.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258043904?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cuda11.6-py3 / test (default, 3, 5, windows.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258043967?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258044051?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cuda11.6-py3 / test (default, 5, 5, windows.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258044125?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cuda11.6-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258044194?check_suite_focus=true"
-                      },
-                      {
-                        "name": "macos-12.3-py3.8-arm64-test / Run MPS tests",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258358668?check_suite_focus=true"
-                      },
-                      {
-                        "name": "macos-11-py3-x86-64 / test (default, 1, 2, macos-12)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258757994?check_suite_focus=true"
-                      },
-                      {
-                        "name": "macos-11-py3-x86-64 / test (default, 2, 2, macos-12)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258758076?check_suite_focus=true"
-                      }
-                    ],
-                    "pageInfo": {
-                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbCn27w=",
-                      "hasNextPage": false
-                    }
+                  "oid": "5097cdcd6994ad82b3cec942b70e75dbeaee8ca4"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
                   },
-                  "conclusion": "FAILURE",
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625556"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1h5Q="
+                  "oid": "5015ecb5a2b86943f457d71f5a977444dd062732"
+                }
               },
               {
-                "node": {
-                  "app": {
-                    "name": "GitHub Actions",
-                    "databaseId": 15368
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
                   },
-                  "workflowRun": {
-                    "workflow": {
-                      "name": "pull"
-                    }
+                  "oid": "1c42b7789d3966cd541b08fce359b9738fee69f6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
                   },
-                  "checkRuns": {
-                    "nodes": [
-                      {
-                        "name": "linux-bionic-rocm5.1-py3.7",
-                        "conclusion": "NEUTRAL",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522250?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cuda11.6-py3",
-                        "conclusion": "NEUTRAL",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522456?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522650?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-clang10-onnx / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257522894?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cpu-py3 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523070?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3.7-clang9 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523312?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3_7-clang8-xla / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523709?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257523936?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524138?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524427?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-clang7-asan / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524554?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524720?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-py3.7-clang7-asan / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257524938?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257525212?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-jammy-cuda11.6-cudnn8-py3.8-clang12 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257525332?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-vulkan-bionic-py3.7-clang9 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257525623?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257525714?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-gcc7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257525946?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-py3-clang5-mobile-build / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257526187?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-gcc7-no-ops / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257526402?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-xenial-py3.7-gcc7 / build",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257526593?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257688277?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257759879?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257760015?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257760116?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257760245?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257760346?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257760456?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257909951?check_suite_focus=true"
-                      },
-                      {
-                        "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257909994?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257912956?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257934535?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257934615?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257934714?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-gcc7 / test (docs_test, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257934784?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257934866?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-focal-py3.7-gcc7 / test (jit_legacy, 1, 1, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257934975?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-docs / build-docs (cpp)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257935092?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-docs / build-docs (python)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257935201?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257943077?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257943146?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257943200?check_suite_focus=true"
-                      },
-                      {
-                        "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257943268?check_suite_focus=true"
-                      },
+                  "oid": "893ac3d334fd3e85e22423a06fe986ce453fe304"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "emcastillo"
+                    },
+                    "email": "ecastill@preferred.jp",
+                    "name": "Emilio Castillo"
+                  },
+                  "oid": "aa5d1b6b031ee2b8bb85f793a842ac1327ae4a19"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "0707a1d00f33d7098f56de339cb30436e8c2ea44"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "NivekT"
+                    },
+                    "email": "ktse@fb.com",
+                    "name": "Kevin Tse"
+                  },
+                  "oid": "ccb082d42af99f6374183cf914cc712bac585f0f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ryandaryl"
+                    },
+                    "email": "ryandarylmills@gmail.com",
+                    "name": "ryandaryl"
+                  },
+                  "oid": "4f2909cc8747808786a1871b0a6825cc4566f48c"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "clee2000"
+                    },
+                    "email": "csl@fb.com",
+                    "name": "Catherine Lee"
+                  },
+                  "oid": "f764010648a29223d9ed4b955073d9d2fb1b2f43"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "5696e8357cf38f852ef3d680381513e26f202371"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTMx",
+              "hasNextPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=76123 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "kumpera"
+          },
+          "title": "Introduce distributed checkpoint with ShardedTensor.",
+          "body": "Co-authored-by: Wen Zhang <zhangwen@fb.com>\r\nCo-authored-by: Yifu Wang <yifu@fb.com>\r\n\r\n",
+          "headRefName": "st_checkpoint",
+          "headRepository": {
+            "nameWithOwner": "kumpera/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kumpera"
+                    },
+                    "email": "kumpera@fb.com",
+                    "name": "Rodrigo Kumpera"
+                  },
+                  "oid": "6bf248bc20a71f248064b795f38276326fe43aae"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kumpera"
+                    },
+                    "email": "kumpera@fb.com",
+                    "name": "Rodrigo Kumpera"
+                  },
+                  "oid": "10f84fb90bf02d7062e565ebf2c1da6352b64db7"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kumpera"
+                    },
+                    "email": "kumpera@fb.com",
+                    "name": "Rodrigo Kumpera"
+                  },
+                  "oid": "96c5299740ec791f3cf0975c03a40a7b219b6747"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            },
+            "totalCount": 3
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
                       {
-                        "name": "linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257943319?check_suite_focus=true"
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS2l4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755666"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSmtI="
                       },
                       {
-                        "name": "linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257943373?check_suite_focus=true"
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063614/jobs/3379894109"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd2r3Q=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755785"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm0k="
                       },
                       {
-                        "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257960183?check_suite_focus=true"
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894107"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894332"
+                              },
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894444"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894520"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894567"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894616"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894672"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd2shU=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755786"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm0o="
                       },
                       {
-                        "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7257960282?check_suite_focus=true"
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902301"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902363"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902507"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902560"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902579"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902603"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902637"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902685"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902740"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902761"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902794"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902874"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903006"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903111"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903193"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903284"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903357"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903446"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903512"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903546"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379944655"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379944695"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946308"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946337"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946359"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946391"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946423"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946453"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946496"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946529"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950041"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950137"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950165"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950192"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950646"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379951202"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379951230"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379963877"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379963928"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379963976"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379964018"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379966372"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379996173"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379996218"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379997861"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379998374"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379998397"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379998422"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379998441"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3380042106"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd5yuY=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755806"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm14="
                       },
                       {
-                        "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258020141?check_suite_focus=true"
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387419477"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387419699"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387419923"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387419992"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387420129"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387420208"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387420309"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS3SE=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363240"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNGg="
                       },
                       {
-                        "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258020221?check_suite_focus=true"
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796862/jobs/3387419465"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS1-o=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363271"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNIc="
                       },
                       {
-                        "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
-                        "conclusion": "SUCCESS",
-                        "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258020306?check_suite_focus=true"
-                      }
-                    ],
-                    "pageInfo": {
-                      "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbCcmdI=",
-                      "hasNextPage": true
-                    }
-                  },
-                  "conclusion": "SUCCESS",
-                  "url": "https://github.com/pytorch/pytorch/commit/6882717f73deffb692219ccd1fd6db258d8ed684/checks?check_suite_id=7280625557"
-                },
-                "cursor": "Y3Vyc29yOnYyOpHPAAAAAbH1h5U="
-              }
-            ],
-            "pageInfo": {
-              "hasNextPage": false
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=23d6a47e5fd875c42231779040ec1d35d0042b502c9142cb0d33d6f65d58fead commit=6882717f73deffb692219ccd1fd6db258d8ed684 cr_cursor=Y3Vyc29yOnYyOpHPAAAAAbCcmdI= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAbH1h5Q= name=pytorch owner=pytorch": {
-    "data": {
-      "repository": {
-        "object": {
-          "oid": "6882717f73deffb692219ccd1fd6db258d8ed684",
-          "checkSuites": {
-            "nodes": [
-              {
-                "checkRuns": {
-                  "nodes": [
-                    {
-                      "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
-                      "conclusion": "SUCCESS",
-                      "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258020388?check_suite_focus=true"
-                    },
-                    {
-                      "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
-                      "conclusion": "SUCCESS",
-                      "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258020493?check_suite_focus=true"
-                    },
-                    {
-                      "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
-                      "conclusion": "SUCCESS",
-                      "detailsUrl": "https://github.com/pytorch/pytorch/runs/7258219463?check_suite_focus=true"
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387419999"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420164"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420316"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420477"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420675"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420934"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387421278"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387421672"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387421888"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387421982"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422191"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422303"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422476"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422715"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422963"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423092"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423234"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423421"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423622"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423739"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387545789"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387546032"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387546119"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553028"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553144"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553251"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553438"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553556"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553668"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387554002"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387554098"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387558927"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387559016"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387559071"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387559139"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387563803"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387563894"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387580868"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387580936"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387580993"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387581053"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387592286"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387631950"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387632035"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387649916"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387649974"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387650084"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387650151"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387650373"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387753429"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgaCXo=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363300"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNKQ="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
                     }
-                  ],
-                  "pageInfo": {
-                    "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbCfo8c=",
-                    "hasNextPage": false
-                  }
+                  },
+                  "status": null,
+                  "pushedDate": "2022-05-05T00:34:26Z",
+                  "oid": "96c5299740ec791f3cf0975c03a40a7b219b6747"
                 }
               }
             ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=76118 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "closed": true,
-          "isCrossRepository": false,
-          "author": {
-            "login": "malfet"
-          },
-          "title": "Dummy change with lots of commits",
-          "body": "Draft PR with 100+ commits, to test mergebot ",
-          "headRefName": "malfet/pr-with-lots-of-commits",
-          "headRepository": {
-            "nameWithOwner": "pytorch/pytorch"
           },
-          "baseRefName": "master",
-          "baseRepository": {
-            "nameWithOwner": "pytorch/pytorch",
-            "isPrivate": false,
-            "defaultBranchRef": {
-              "name": "master"
+          "changedFiles": 11,
+          "files": {
+            "nodes": [
+              {
+                "path": "test/distributed/_shard/checkpoint/test_checkpoint.py"
+              },
+              {
+                "path": "test/distributed/_shard/checkpoint/test_file_system_checkpoint.py"
+              },
+              {
+                "path": "test/distributed/_shard/sharded_tensor/test_sharded_tensor.py"
+              },
+              {
+                "path": "torch/distributed/_shard/checkpoint/__init__.py"
+              },
+              {
+                "path": "torch/distributed/_shard/checkpoint/filesystem.py"
+              },
+              {
+                "path": "torch/distributed/_shard/checkpoint/metadata.py"
+              },
+              {
+                "path": "torch/distributed/_shard/checkpoint/resharding.py"
+              },
+              {
+                "path": "torch/distributed/_shard/checkpoint/state_dict_loader.py"
+              },
+              {
+                "path": "torch/distributed/_shard/checkpoint/state_dict_saver.py"
+              },
+              {
+                "path": "torch/distributed/_shard/checkpoint/storage.py"
+              },
+              {
+                "path": "torch/testing/_internal/distributed/_shard/sharded_tensor/_test_st_common.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTE",
+              "hasNextPage": false
             }
           },
-          "mergeCommit": null,
-          "commits_with_authors": {
+          "reviews": {
             "nodes": [
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "malfet"
-                    },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
-                  },
-                  "oid": "3067f2240afc7a29dc348000aa19eccbd9772303"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "wanchaol"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "simpkins"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "andrewor14"
-                    },
-                    "email": "andrewor@fb.com",
-                    "name": "Andrew Or"
-                  },
-                  "oid": "2f655b71f70c496c4e645f6cdb27d7bb7e825701"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "0c6dcaa7f58a19c42a530f4ee14bb6f0f03ca9fb"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dzdang"
-                    },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
-                  },
-                  "oid": "cad11c563d41ebcffb1683fe1f1288b8157413b3"
-                }
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "jwtan@fb.com",
-                    "name": "Jiewen Tan"
-                  },
-                  "oid": "4dfd0875a68d87fccb5ad0d81692db480043b86e"
-                }
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "2d37e74690582a4a26890e4c8b98f1f80e589c82"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "jwtan@fb.com",
-                    "name": "Jiewen Tan"
-                  },
-                  "oid": "d4aee60947e1a3ef23c7c42990621e0746fdd0a8"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "peterbell10"
-                    },
-                    "email": "peterbell10@live.co.uk",
-                    "name": "Peter Bell"
-                  },
-                  "oid": "aac6204bf710beb5e50a383d426ae6222396335a"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dzdang"
-                    },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
-                  },
-                  "oid": "4b0362cab884584c24f5834b3874f5f357f56b5d"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "7536df613cbc645a9e68e6a3b0a8450753260fd1"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "20a50cb966d28d7bf82924adf781cf72a01ef90e"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "486387e8644afb46edff5aa5925b55c8119f67f0"
-                }
+                "author": {
+                  "login": "simpkins"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dzdang"
-                    },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
-                  },
-                  "oid": "acb9d78b9b732d3667b881727e6ed9f92a8c549f"
-                }
+                "author": {
+                  "login": "simpkins"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "683bb7959a5b973f8470c081ad02e8fc508e784a"
-                }
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "qihqi"
-                    },
-                    "email": "qihan@fb.com",
-                    "name": "Han Qi"
-                  },
-                  "oid": "a870cb40af65adf0b77d55f6b554d7093d284d7a"
-                }
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "Krovatkin"
-                    },
-                    "email": "korovaikon@gmail.com",
-                    "name": "Nikolay Korovaiko"
-                  },
-                  "oid": "70793b9f328ddf52cc86336104c3a064c8582ef4"
-                }
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "suo"
-                    },
-                    "email": "suo@fb.com",
-                    "name": "Michael Suo"
-                  },
-                  "oid": "f70b31f62b1c5159eef2725484b175983517c88c"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dagitses"
-                    },
-                    "email": "mikeyd@fb.com",
-                    "name": "Michael Andreas Dagitses"
-                  },
-                  "oid": "04d3ec1db60defe1c6904bf77e9f8dfa87dc0b63"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "46b754a55b63e3168ad5854ad412c124934b675d"
-                }
+                "author": {
+                  "login": "wilson100hong"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "wilson100hong"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "wilson100hong"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "robieta"
-                    },
-                    "email": "taylorrobie@fb.com",
-                    "name": "Taylor Robie"
-                  },
-                  "oid": "13df69e13ee571fdd716139419a00aec47ade7d6"
-                }
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "DISMISSED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "malfet"
-                    },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
-                  },
-                  "oid": "70642e911ec80a47cdbf4a50aac475c11aa129b6"
-                }
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
-                  },
-                  "oid": "59bb7c39384bf3e0b284a037adef8b3caa53c1c4"
-                }
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "malfet"
-                    },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
-                  },
-                  "oid": "007cfb97b55d70ff63e1ed71d1a674638f847376"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
-                  },
-                  "oid": "0a7b858a5af1393fa3cf2853f92eca0e1d408dde"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "qihqi"
-                    },
-                    "email": "qihan@fb.com",
-                    "name": "Han Qi"
-                  },
-                  "oid": "7917d789f0a523715041ade5177d271082628236"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kit1980"
-                    },
-                    "email": "sdym@fb.com",
-                    "name": "Sergii Dymchenko (Meta Employee)"
-                  },
-                  "oid": "91eb6017f0fb8a1b29e8cb48fac93bc9709f73b3"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dagitses"
-                    },
-                    "email": "mikeyd@fb.com",
-                    "name": "Michael Andreas Dagitses"
-                  },
-                  "oid": "bd04dca5fabb0c2a51ac87063a515f256ef274fa"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dagitses"
-                    },
-                    "email": "mikeyd@fb.com",
-                    "name": "Michael Andreas Dagitses"
-                  },
-                  "oid": "1f805a5defda7dabc49d0059edb9ccb06bc29352"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "mruberry"
-                    },
-                    "email": "mruberry@fb.com",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "4982c0a8db8f23d15ec4bfcbca4ce939afc04954"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pearu"
-                    },
-                    "email": "pearu.peterson@gmail.com",
-                    "name": "Pearu Peterson"
-                  },
-                  "oid": "28502265cb5925cb7db8dcb2dd2334963092714a"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "e03fcaedb1342e6d65c7f7f20243000938ba60b2"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pritamdamania"
-                    },
-                    "email": "pritam.damania@fb.com",
-                    "name": "pritam"
-                  },
-                  "oid": "efb28f5a1a5d18aa96bd668ab2ab5c651be359f3"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "MagiaSN"
-                    },
-                    "email": "magialiao@tencent.com",
-                    "name": "magialiao"
-                  },
-                  "oid": "52cc1b9994f861ebdd3908759ed1ab11cba1f8de"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
-                  },
-                  "oid": "3cd99f23d1acd6a5bedf6f3b02be79d64350a5b6"
-                }
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "awgu"
-                    },
-                    "email": "andgu@fb.com",
-                    "name": "Andrew Gu"
-                  },
-                  "oid": "b00502c634a5146f4d996bd90e84d317f049e7b0"
-                }
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "davidberard98"
-                    },
-                    "email": "dberard@fb.com",
-                    "name": "David Berard"
-                  },
-                  "oid": "976eb7cee799dddfbe6a4122b249aaee1b6c8854"
-                }
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ngimel"
-                    },
-                    "email": "ngimel@fb.com",
-                    "name": "Natalia Gimelshein"
-                  },
-                  "oid": "9608ab28744d5cae32f371490557b248c9549c66"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "malfet"
-                    },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
-                  },
-                  "oid": "4e119f0c39eb5ff0777f0e71561e6b633d85fb34"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "rohan-varma"
-                    },
-                    "email": "rvarm1@fb.com",
-                    "name": "Rohan Varma"
-                  },
-                  "oid": "447580dc565f3660eddb2c996c6ed25b88338684"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "malfet"
-                    },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
-                  },
-                  "oid": "2bc8f43e9233008ea23053fab87b83ab36fca5e3"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dzdang"
-                    },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
-                  },
-                  "oid": "c13a8e891c3e3e714f60649ca1e3b082e090e9fe"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dzdang"
-                    },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
-                  },
-                  "oid": "fddc861b7ee473f57d3c2161e4618a2663a237e8"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "jiyuanzFB"
-                    },
-                    "email": "jiyuanz@fb.com",
-                    "name": "Jiyuan Zhang"
-                  },
-                  "oid": "e2336dbc539d6c021720cbe43c92c9e4c8463299"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "bdhirsh"
-                    },
-                    "email": "hirsheybar@fb.com",
-                    "name": "Brian Hirsh"
-                  },
-                  "oid": "26e2759d1ad59aac12168b74d1ca55e42ba9455c"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "bdhirsh"
-                    },
-                    "email": "hirsheybar@fb.com",
-                    "name": "Brian Hirsh"
-                  },
-                  "oid": "ad7aa914ee3b3d1252e31514f010ba96c40aae87"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "bdhirsh"
-                    },
-                    "email": "hirsheybar@fb.com",
-                    "name": "Brian Hirsh"
-                  },
-                  "oid": "f113c5d78065aafbe7b1c0e611945bfe9f67b3c0"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "bdhirsh"
-                    },
-                    "email": "hirsheybar@fb.com",
-                    "name": "Brian Hirsh"
-                  },
-                  "oid": "a366fd01136292544b7862968ae92feba4b6d8fe"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "seemethere"
-                    },
-                    "email": "eliuriegas@fb.com",
-                    "name": "Eli Uriegas"
-                  },
-                  "oid": "afeba0773749da5883c378a2e6ac066e1ce62ca0"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "bdhirsh"
-                    },
-                    "email": "hirsheybar@fb.com",
-                    "name": "Brian Hirsh"
-                  },
-                  "oid": "d306c99addc543908f64666baeecacbd0749f4a7"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "awgu"
-                    },
-                    "email": "andgu@fb.com",
-                    "name": "Andrew Gu"
-                  },
-                  "oid": "c2456ea658f41f64ea054a422edf22a9c977399f"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "awgu"
-                    },
-                    "email": "andgu@fb.com",
-                    "name": "Andrew Gu"
-                  },
-                  "oid": "a8b0a1b681c9fe41e0d553c962a5c93e81d92503"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "anjali411"
-                    },
-                    "email": "chourdiaanjali123@gmail.com",
-                    "name": "anjali411"
-                  },
-                  "oid": "af761d9a5d058c9188f16589bae4f307d35185be"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "clee2000"
-                    },
-                    "email": "csl@fb.com",
-                    "name": "Catherine Lee"
-                  },
-                  "oid": "beceb417baef35b15c2716e23178fb49f7fd6f9d"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "1516554e22136db89d0aeba43a1a1a987e995d68"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "qihqi"
-                    },
-                    "email": "qihan@fb.com",
-                    "name": "Han Qi"
-                  },
-                  "oid": "68eb1fa8374eff6cbdcf0be5e37ed6775d22e722"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "janeyx99"
-                    },
-                    "email": "janeyx@fb.com",
-                    "name": "Jane Xu"
-                  },
-                  "oid": "3c7bcb99b5c0c879c2610f427880b03881f82f38"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "janeyx99"
-                    },
-                    "email": "janeyx@fb.com",
-                    "name": "Jane Xu"
-                  },
-                  "oid": "38c1a2028090353e40a019c673c9ab16b39e4825"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "albanD"
-                    },
-                    "email": "albandes@fb.com",
-                    "name": "Alban Desmaison"
-                  },
-                  "oid": "8091cbea2c95ed2c4c406b3c61547a27c6319bae"
-                }
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "d81f59121969a47c8b2213a88e02cf9be0219be9"
-                }
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "20d798b319cd107a767fe220f7a3027c18a1c844"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dzdang"
-                    },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
-                  },
-                  "oid": "eb35381a770b58c1cd41e935910cb4df2f3d8f14"
-                }
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
-                  },
-                  "oid": "e6498a657b9aa47546dcd92d1b4ffb2e1a50ebdb"
-                }
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "APPROVED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "dzdang"
-                    },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
-                  },
-                  "oid": "7f821382db5ad08efe5b09a145c606852b8a9272"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "albanD"
-                    },
-                    "email": "albandes@fb.com",
-                    "name": "Alban Desmaison"
-                  },
-                  "oid": "995c0e11a97d854ff969962bd81d7341e46ecb07"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "davidberard98"
-                    },
-                    "email": "dberard@fb.com",
-                    "name": "David Berard"
-                  },
-                  "oid": "28d6258e62c9fc361a18689877c962c69889dc23"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "HarborYuan"
-                    },
-                    "email": "yuanhaobo@whu.edu.cn",
-                    "name": "Haobo Yuan"
-                  },
-                  "oid": "2350fad8391367ebf81c7236a2c883644b4ff622"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "zou3519"
-                    },
-                    "email": "zou3519@gmail.com",
-                    "name": "Richard Zou"
-                  },
-                  "oid": "3f789c9ccecdd7e2e52269453646e992a68c6b92"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "jeffdaily"
-                    },
-                    "email": "jeff.daily@amd.com",
-                    "name": "Jeff Daily"
-                  },
-                  "oid": "20f79f610c1a3314da96d49515bbfbee9442e4f8"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "peterbell10"
-                    },
-                    "email": "peterbell10@live.co.uk",
-                    "name": "Peter Bell"
-                  },
-                  "oid": "5823958f047f3b71a5dc8c52a20eb8ae3291bd3e"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "peterbell10"
-                    },
-                    "email": "peterbell10@live.co.uk",
-                    "name": "Peter Bell"
-                  },
-                  "oid": "a0b15c49ecf3844daf2c0dcaef44f0214259db20"
-                }
-              },
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNC0yNVQxMTozNTowMS0wNzowMLkyMDIyLTA0LTI1VDExOjM1OjAwLTA3OjAwzjjC2d0=",
+              "hasPreviousPage": true
+            }
+          },
+          "comments": {
+            "nodes": [
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "4afc38c25ca2ca126ba4987a419a58a5c572223b"
-                }
+                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118495479
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "b606f58d4a36683fbe0a7d02adfdde7d5cc694c2"
-                }
+                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118511287
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "albanD"
-                    },
-                    "email": "albandes@fb.com",
-                    "name": "Alban Desmaison"
-                  },
-                  "oid": "2d61b4d630f6482a6c3cc7437091fad6d27c347e"
-                }
+                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118662274
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "george-qi"
-                    },
-                    "email": "georgeqi94@gmail.com",
-                    "name": "George Qi"
-                  },
-                  "oid": "bc5384c47036a6cda94129f3e2f9e43c43393698"
-                }
+                "bodyText": "Merge failed due to Can't fetch all PR reviews Raised by https://github.com/pytorch/pytorch/actions/runs/2275691136\n\n@osalpekar @malfet This is failing because there are 109 review comments on this PR but we only fetch the first 100. This could be solved with a similar concept as how we fetch more comments/check_runs.",
+                "author": {
+                  "login": "janeyx99"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118689010
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "malfet"
-                    },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
-                  },
-                  "oid": "60fc3277634365b64465712b13db2acb76d6c890"
-                }
-              },
+                "bodyText": "On a side note, has the test_fsdp_clip_grad_norm_norm_type_2_0_nested_fsdp_False_cpu_offload_CPUOffload failure on the distributed test first shard of this PR been addressed?",
+                "author": {
+                  "login": "janeyx99"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118693497
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQqri9w==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
-                  },
-                  "oid": "1b8762e95bc38d1847fe99ed3230546c8b800bfd"
+                "node": {
+                  "name": "oncall: distributed"
                 }
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "jerryzh168"
-                    },
-                    "email": "jerryzh168@gmail.com",
-                    "name": "Jerry Zhang"
-                  },
-                  "oid": "6acf60f95f59ecbc6e8ce830dea0abba7d3ec763"
+                "node": {
+                  "name": "cla signed"
                 }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=6a8ce6412a780d5804bfe180ed1dc807269e1eae2ae50de2346d56d1283884bc cursor=Y3Vyc29yOnYyOpO5MjAyMi0wNC0yNVQxMTozNTowMS0wNzowMLkyMDIyLTA0LTI1VDExOjM1OjAwLTA3OjAwzjjC2d0= name=pytorch number=76123 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ysiraichi"
-                    },
-                    "email": "yukio.siraichi@gmail.com",
-                    "name": "Yukio Siraichi"
-                  },
-                  "oid": "8fb0276561fdd530c5a06ea195e930e0584f8705"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "albanD"
-                    },
-                    "email": "albandes@fb.com",
-                    "name": "Alban Desmaison"
-                  },
-                  "oid": "1da7aed95a8700406671425eac1e4bbc2c7a24b5"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "thiagocrepaldi"
-                    },
-                    "email": "thiago.crepaldi@microsoft.com",
-                    "name": "Thiago Crepaldi"
-                  },
-                  "oid": "83208e7dee4503c1bee1df9f6632794694dffa01"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kshitij12345"
-                    },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
-                  },
-                  "oid": "1a46cf08dcd3d3564604c17b2c02d7e4eb45a7ff"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "malfet"
-                    },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
-                  },
-                  "oid": "b7f9b6689445f826c83694652fea5f7cfc7070d7"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "fatcat-z"
-                    },
-                    "email": "jiz@microsoft.com",
-                    "name": "Jay Zhang"
-                  },
-                  "oid": "f273961c1696b156e35f8c76f7ad37934031050d"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pavithranrao"
-                    },
-                    "email": "pavithran@fb.com",
-                    "name": "Pavithran Ramachandran"
-                  },
-                  "oid": "eb410a51fcbc716873fd80a970eb932d4aaaea61"
-                }
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
+              {
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNC0yMlQyMDozNzo1NC0wNzowMLkyMDIyLTA0LTIyVDE2OjAyOjA5LTA3OjAwzjip7G8=",
+              "hasPreviousPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=71759 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "coolteemf"
+          },
+          "title": "Optimize grid sample 3d",
+          "body": "Fixes #71415\r\nI have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :\r\n\r\n> Fixes #64977\r\n> \r\n> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).\r\n> \r\n> Brief description of the changes:\r\n> \r\n>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).\r\n> \r\n>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.\r\n> \r\n>     * Changed the CPU kernels:\r\n>       (1) added `bool input_requires_grad` template parameter to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)\r\n> \r\n>     * Changed CUDA kernel:\r\n>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.\r\n> \r\n>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.\r\n> \r\n>     * Have not touched the CPU fallback kernel.\r\n\r\nNote: the changes number (3) are N/A in this case.\r\n\r\n",
+          "headRefName": "optimize_grid_sample_3d",
+          "headRepository": {
+            "nameWithOwner": "coolteemf/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "ngimel"
-                    },
-                    "email": "ngimel@fb.com",
-                    "name": "Natalia Gimelshein"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "7dbb12cdc02332fa64264ed0df576511a5070d7e"
+                  "oid": "e0b0d1e695aeddceaf265da602c4704592053e9e"
                 }
               },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
+              {
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "43675665fa6b5154de8b25125dd03d7be35c884f"
+                  "oid": "563ec73747ad53b63b36736c47c4342f962c2a09"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "albanD"
-                    },
-                    "email": "albandes@fb.com",
-                    "name": "Alban Desmaison"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "6c4d23c402c413667463770d9a2fa801f493d3c5"
+                  "oid": "51abe41a132d9dd5b1c0551bdca902aacc028ff8"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "cf3778a35129a40dee14366515201b7ed2c0f346"
+                  "oid": "be9898205992034a00e8ace8a55c2ecdcee2c2f8"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "dzdang"
-                    },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "9d00a051373cb81f79cb6375942cf3ec9fff2fe6"
+                  "oid": "2929c60b64384c2deae0f7dea8bab94ad4bc9ec8"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "1eae67cf404aa8dffb80b8e85180f943878d52a6"
+                  "oid": "9241b737e7e2b257905cc74ad9c50b737d7f9d0a"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "janeyx99"
-                    },
-                    "email": "janeyx@fb.com",
-                    "name": "Jane Xu"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "ce0e69dcda0fe41a6e964d6ac70ce8016979c71a"
+                  "oid": "64d6b795d0636928a8aa2fd3da01302fb5f5f7af"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "swolchok"
-                    },
-                    "email": "swolchok@fb.com",
-                    "name": "Scott Wolchok"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "6faba554f6e49777f24911928edb3061b6ed0e3d"
+                  "oid": "4503577e53760a0006f1e80ca6bfe04d2be90470"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "IvanYashchuk"
-                    },
-                    "email": "ivan.yashchuk@aalto.fi",
-                    "name": "Ivan Yashchuk"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "d1d0e03f57a359f8f95331f9a34b8bed3e7cc845"
+                  "oid": "b16f4b11ffbbbf2ca2098f9702af4ef6b6fc5e1f"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "Chillee"
-                    },
-                    "email": "chilli@fb.com",
-                    "name": "Horace He"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "bb46bd9233a9fc631802a902cb48a4c13c2722ca"
+                  "oid": "7ffc23368a604afdc92d2818747f730ce31a2bb5"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "mehtanirav"
-                    },
-                    "email": "niravmehta@fb.com",
-                    "name": "Nirav Mehta"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "3b1007fe4be12e483f2620fbac67cae42e703efc"
+                  "oid": "b85292604b9ad6c31706b76b5a5498c4f6d94309"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "mehtanirav"
-                    },
-                    "email": "niravmehta@fb.com",
-                    "name": "Nirav Mehta"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "b4b65228dd0c109f5fdf17c7d9e56f60a98e398b"
+                  "oid": "9d81d7bae8ad91aaa24b3ceab83e3138894dbc69"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "albanD"
-                    },
-                    "email": "albandes@fb.com",
-                    "name": "Alban Desmaison"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "d629e300705196d3ae0bac5ed983b197101fa2ee"
+                  "oid": "e79f6a2202512b294c55bf4bfb2e0524fafd4c48"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "bigfootjon"
-                    },
-                    "email": "jonjanzen@fb.com",
-                    "name": "Jon Janzen"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "52754b9e515f378f8476ad44d75b0a692bad8cde"
+                  "oid": "f683e8aec7aea76097a264eec01511e704c31154"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "samdow"
+                      "login": "coolteemf"
                     },
-                    "email": "samdow@fb.com",
-                    "name": "samdow"
+                    "email": "67541941+coolteemf@users.noreply.github.com",
+                    "name": "Fran\u00e7ois Lecomte"
                   },
-                  "oid": "128c3ad747093f4970329a82c7c4720420faeff2"
+                  "oid": "b932e9e286c22aaf352375186df851ef060b295a"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "arindamroy-eng"
-                    },
-                    "email": "61168652+arindamroy-eng@users.noreply.github.com",
-                    "name": "arindamroy-eng"
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
                   },
-                  "oid": "2a0bda7d32a5bcc9827f7254a7b77cceb16ba973"
+                  "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "MTAw",
-              "hasNextPage": true
+              "endCursor": "MTY",
+              "hasNextPage": false
             },
-            "totalCount": 131
+            "totalCount": 16
           },
           "commits": {
             "nodes": [
@@ -4993,109 +9354,53 @@
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuNRg4=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGYqY=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693698"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRAI="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "Netlify",
-                            "databaseId": 13473
-                          },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693712"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRBA="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "Azure Pipelines",
-                            "databaseId": 9426
-                          },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693725"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRB0="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "Dependabot",
-                            "databaseId": 29110
-                          },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693741"
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801320"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRC0="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_T6g="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Codecov",
-                            "databaseId": 254
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
                           },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-clang7-onnx"
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693761"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsREE="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "PyTorch Bot",
-                            "databaseId": 40112
-                          },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754066/jobs/2663109808"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754066/jobs/2663214802"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754066/jobs/2663214856"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIob0=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693774"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801849"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRE4="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ubk="
                       },
                       {
                         "node": {
@@ -5105,26 +9410,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                              "name": "linux-xenial-py3-clang5-mobile-build"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "run-torchbench",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099388390?check_suite_focus=true"
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754064/jobs/2663109676"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuNR-Y=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ1E=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694412"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801852"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRsw="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ubw="
                       },
                       {
                         "node": {
@@ -5134,56 +9439,41 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "linux-bionic-rocm4.5-py3.7"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "Test collect_env (with_torch)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099431378?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test collect_env (without_torch)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099431511?check_suite_focus=true"
-                              },
-                              {
-                                "name": "toc",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099431693?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test tools",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099431829?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754065/jobs/2663109684"
                               },
                               {
-                                "name": "quick-checks",
+                                "name": "test (default, 2, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432018?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754065/jobs/2663401083"
                               },
                               {
-                                "name": "lintrunner",
+                                "name": "test (default, 1, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432195?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754065/jobs/2663401143"
                               },
                               {
-                                "name": "workflow-checks",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432331?check_suite_focus=true"
+                                "name": "test (distributed, 1, 1, linux.rocm.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754065/jobs/2663401186"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuN84s=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwMsZY=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694417"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801853"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRtE="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ub0="
                       },
                       {
                         "node": {
@@ -5193,654 +9483,518 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pull"
+                              "name": "win-vs2019-cuda11.3-py3"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099430906?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099431117?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-onnx / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099431312?check_suite_focus=true"
-                              },
-                              {
-                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099431677?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099431819?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-asan / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432057?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432191?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-rocm5.0-py3.7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432334?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432446?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432577?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432685?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432822?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099432932?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099433128?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cuda11.3-py3 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099433280?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099433402?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cpu-py3 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099433542?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099433675?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099433758?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099433859?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099554424?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099554523?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-docs / build-docs (cpp)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099557184?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-docs / build-docs (python)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099557310?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099557449?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099557512?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099557588?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099557655?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099557717?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099557795?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099565740?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099565906?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099565972?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 1, linux.2xlarge)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099566036?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754068/jobs/2663109680"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099580613?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754068/jobs/2663995756"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099580676?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754068/jobs/2663995819"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099608194?check_suite_focus=true"
-                              },
+                                "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754068/jobs/2663995900"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwZbzg=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801855"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ub8="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
+                                "name": "mypy",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099608322?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663109683"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
+                                "name": "shellcheck",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099608371?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663109827"
                               },
                               {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "name": "py2-setup-validate-errormsg",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099619007?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663109962"
                               },
                               {
-                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "name": "clang-format",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099645951?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110044"
                               },
                               {
-                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "name": "cmakelint",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099646089?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110132"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "name": "toc",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099685555?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110233"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "name": "quick-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099685664?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110320"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "name": "clang-tidy",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099685757?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110461"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "name": "flake8-py3",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099689530?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110575"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGbAQ=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801856"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-clang7-asan"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099757872?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754070/jobs/2663109804"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                                "name": "test (default, 3, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099757955?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754070/jobs/2663233675"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "name": "test (default, 1, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099898234?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754070/jobs/2663233731"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "name": "test (default, 2, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099898323?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754070/jobs/2663233805"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuVD9M=",
-                              "hasNextPage": true
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694439"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRuc="
-                      }
-                    ],
-                    "pageInfo": {
-                      "hasNextPage": false
-                    }
-                  },
-                  "pushedDate": "2022-04-20T17:10:41Z",
-                  "oid": "5696e8357cf38f852ef3d680381513e26f202371"
-                }
-              }
-            ]
-          },
-          "changedFiles": 348,
-          "files": {
-            "nodes": [
-              {
-                "path": ".circleci/cimodel/data/pytorch_build_data.py"
-              },
-              {
-                "path": ".circleci/cimodel/data/pytorch_build_definitions.py"
-              },
-              {
-                "path": ".circleci/scripts/cpp_doc_push_script.sh"
-              },
-              {
-                "path": ".circleci/scripts/python_doc_push_script.sh"
-              },
-              {
-                "path": ".github/actions/checkout-pytorch/action.yml"
-              },
-              {
-                "path": ".github/merge_rules.json"
-              },
-              {
-                "path": ".github/scripts/gitutils.py"
-              },
-              {
-                "path": ".github/scripts/gql_mocks.json"
-              },
-              {
-                "path": ".github/scripts/trymerge.py"
-              },
-              {
-                "path": ".github/workflows/_bazel-build-test.yml"
-              },
-              {
-                "path": ".github/workflows/_linux-build.yml"
-              },
-              {
-                "path": ".github/workflows/_linux-test.yml"
-              },
-              {
-                "path": ".github/workflows/_mac-test.yml"
-              },
-              {
-                "path": ".github/workflows/_rocm-test.yml"
-              },
-              {
-                "path": ".github/workflows/_win-test.yml"
-              },
-              {
-                "path": ".github/workflows/buck_build_test.yml"
-              },
-              {
-                "path": ".github/workflows/lint.yml"
-              },
-              {
-                "path": ".github/workflows/periodic.yml"
-              },
-              {
-                "path": ".github/workflows/pull.yml"
-              },
-              {
-                "path": ".github/workflows/trunk.yml"
-              },
-              {
-                "path": ".jenkins/pytorch/macos-test.sh"
-              },
-              {
-                "path": ".jenkins/pytorch/test.sh"
-              },
-              {
-                "path": ".jenkins/pytorch/win-test.sh"
-              },
-              {
-                "path": ".lintrunner.toml"
-              },
-              {
-                "path": "BUILD.bazel"
-              },
-              {
-                "path": "CODEOWNERS"
-              },
-              {
-                "path": "README.md"
-              },
-              {
-                "path": "aten/src/ATen/BatchingRegistrations.cpp"
-              },
-              {
-                "path": "aten/src/ATen/Dispatch.h"
-              },
-              {
-                "path": "aten/src/ATen/ExpandUtils.h"
-              },
-              {
-                "path": "aten/src/ATen/FunctionalInverses.cpp"
-              },
-              {
-                "path": "aten/src/ATen/FunctionalStorageImpl.cpp"
-              },
-              {
-                "path": "aten/src/ATen/FunctionalStorageImpl.h"
-              },
-              {
-                "path": "aten/src/ATen/FunctionalTensorWrapper.cpp"
-              },
-              {
-                "path": "aten/src/ATen/FunctionalTensorWrapper.h"
-              },
-              {
-                "path": "aten/src/ATen/FunctionalizeFallbackKernel.cpp"
-              },
-              {
-                "path": "aten/src/ATen/NestedTensorImpl.cpp"
-              },
-              {
-                "path": "aten/src/ATen/OpMathType.h"
-              },
-              {
-                "path": "aten/src/ATen/SparseCsrTensorUtils.h"
-              },
-              {
-                "path": "aten/src/ATen/ThreadLocalState.cpp"
-              },
-              {
-                "path": "aten/src/ATen/ThreadLocalState.h"
-              },
-              {
-                "path": "aten/src/ATen/autocast_mode.cpp"
-              },
-              {
-                "path": "aten/src/ATen/autocast_mode.h"
-              },
-              {
-                "path": "aten/src/ATen/core/SymIntArrayRef.cpp"
-              },
-              {
-                "path": "aten/src/ATen/core/SymIntArrayRef.h"
-              },
-              {
-                "path": "aten/src/ATen/core/TensorBase.h"
-              },
-              {
-                "path": "aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h"
-              },
-              {
-                "path": "aten/src/ATen/core/dispatch/Dispatcher.h"
-              },
-              {
-                "path": "aten/src/ATen/core/interned_strings.h"
-              },
-              {
-                "path": "aten/src/ATen/core/ivalue.cpp"
-              },
-              {
-                "path": "aten/src/ATen/core/ivalue.h"
-              },
-              {
-                "path": "aten/src/ATen/core/ivalue_inl.h"
-              },
-              {
-                "path": "aten/src/ATen/core/jit_type.h"
-              },
-              {
-                "path": "aten/src/ATen/core/jit_type_base.h"
-              },
-              {
-                "path": "aten/src/ATen/core/type.cpp"
-              },
-              {
-                "path": "aten/src/ATen/cuda/CUDASparse.h"
-              },
-              {
-                "path": "aten/src/ATen/cuda/llvm_complex.cpp"
-              },
-              {
-                "path": "aten/src/ATen/cuda/llvm_jit_strings.h"
-              },
-              {
-                "path": "aten/src/ATen/native/Blas.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/Itertools.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/LinearAlgebra.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/SoftMax.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/TensorConversions.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/TensorShape.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/TensorShape.h"
-              },
-              {
-                "path": "aten/src/ATen/native/Unique.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/cuda/BinaryMiscBackwardOpsKernels.cu"
-              },
-              {
-                "path": "aten/src/ATen/native/cuda/CUDAJitLoops.cuh"
-              },
-              {
-                "path": "aten/src/ATen/native/cuda/JitLoops.cuh"
-              },
-              {
-                "path": "aten/src/ATen/native/cuda/Lerp.cu"
-              },
-              {
-                "path": "aten/src/ATen/native/cuda/PersistentSoftmax.cuh"
-              },
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwJC4U=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801857"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcE="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754076/jobs/2663109810"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ_w=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801862"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcY="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc5.4"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663109777"
+                              },
+                              {
+                                "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201383"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201458"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201512"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201580"
+                              },
+                              {
+                                "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201672"
+                              },
+                              {
+                                "name": "test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201839"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIWu4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801866"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Uco="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754079/jobs/2663109681"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ1k=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801869"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Uc0="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": true
+                    }
+                  },
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17017798?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17017799?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17017816?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17017800?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-02-23T10:39:30Z",
+                  "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22"
+                }
+              }
+            ]
+          },
+          "changedFiles": 9,
+          "files": {
+            "nodes": [
               {
-                "path": "aten/src/ATen/native/cuda/SoftMax.cu"
+                "path": "aten/src/ATen/native/GridSampler.cpp"
               },
               {
-                "path": "aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu"
+                "path": "aten/src/ATen/native/cpu/GridSamplerKernel.cpp"
               },
               {
-                "path": "aten/src/ATen/native/cuda/Unique.cu"
+                "path": "aten/src/ATen/native/cuda/GridSampler.cpp"
               },
               {
-                "path": "aten/src/ATen/native/cuda/jit_utils.cpp"
+                "path": "aten/src/ATen/native/cuda/GridSampler.cu"
               },
               {
-                "path": "aten/src/ATen/native/cuda/jit_utils.h"
+                "path": "aten/src/ATen/native/cuda/GridSampler.h"
               },
               {
                 "path": "aten/src/ATen/native/native_functions.yaml"
               },
               {
-                "path": "aten/src/ATen/native/nested/NestedTensorMath.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/quantized/cpu/qembeddingbag.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/quantized/cpu/qsoftmax.cpp"
-              },
-              {
-                "path": "aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp"
+                "path": "test/forward_backward_compatibility/check_forward_backward_compatibility.py"
               },
               {
-                "path": "aten/src/ATen/native/quantized/cudnn/Linear.cpp"
+                "path": "test/test_nn.py"
               },
               {
-                "path": "aten/src/ATen/native/quantized/cudnn/utils.h"
-              },
+                "path": "tools/autograd/derivatives.yaml"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "OQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
-                "path": "aten/src/ATen/native/sparse/SparseCsrTensor.cpp"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/native/ts_native_functions.yaml"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/record_function.cpp"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/record_function.h"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/templates/Operators.h"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/templates/RegisterFunctionalization.cpp"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/test/basic.cpp"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/test/vmap_test.cpp"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "binaries/record_function_benchmark.cc"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "c10/core/DispatchKey.cpp"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "c10/core/DispatchKey.h"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "c10/core/DispatchKeySet.h"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "c10/test/core/DispatchKeySet_test.cpp"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "c10/util/ArrayRef.h"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "caffe2/core/tensor.h"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "docs/source/conf.py"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "APPROVED"
               },
               {
-                "path": "docs/source/fx.rst"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "APPROVED"
               }
             ],
             "pageInfo": {
-              "endCursor": "MTAw",
-              "hasNextPage": true
-            }
-          },
-          "reviews": {
-            "nodes": [],
-            "pageInfo": {
-              "startCursor": null,
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMS0yNVQwODoyODoxMC0wODowMLkyMDIyLTAxLTI1VDA3OjU0OjA1LTA4OjAwzjNooqI=",
               "hasPreviousPage": false
             }
           },
           "comments": {
             "nodes": [
               {
-                "bodyText": "Merge failed due to Matched rule superuser, but it was not reviewed yet by any of:zou3519,abhikrish,mehtanirav,wconstab,lc0, ...",
+                "bodyText": "Merge failed due to 'NoneType' object is not subscriptable\nRaised by https://github.com/pytorch/pytorch/actions/runs/1887945630",
                 "author": {
                   "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1104215370
+                "databaseId": 1048868910
               },
               {
-                "bodyText": "Merge failed due to Matched rule superuser, but PR has not been reviewed yet",
+                "bodyText": "Thanks for the update! The windows failure is not your fault, you can ignore it!\n\nThank you very much for all of your feedback and sorry for the delay !",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "coolteemf"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "CONTRIBUTOR",
                 "editor": null,
-                "databaseId": 1104220908
+                "databaseId": 1048983572
               },
               {
-                "bodyText": "@pytorchbot merge this",
+                "bodyText": "@coolteemf can you please send either me or @albanD an email? (or I can send you and invite to collab on private repo)",
                 "author": {
                   "login": "malfet"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1104378397
+                "databaseId": 1049048119
               },
               {
-                "bodyText": "Merge failed due to Matched rule superuser, but PR has not been reviewed yet\nRaised by https://github.com/pytorch/pytorch/actions/runs/2197877090",
+                "bodyText": "@pytorchbot merge this please",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "albanD"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1104379712
+                "databaseId": 1049131992
               },
               {
-                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale. Feel free to remove the Stale label if you feel this was a mistake. If you are unable to remove the Stale label please contact a maintainer in order to do so. If you want the bot to never mark this PR stale again, add the no-stale label.Stale pull requests will automatically be closed after 30 days of inactivity.",
+                "bodyText": "Hey @coolteemf.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
                 "author": {
                   "login": "github-actions"
                 },
                 "authorAssociation": "NONE",
                 "editor": null,
-                "databaseId": 1160658699
+                "databaseId": 1049134520
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOQdD9Sg==",
+              "startCursor": "Y3Vyc29yOnYyOpHOPoR4Lg==",
               "hasPreviousPage": true
             }
           },
           "labels": {
             "edges": [
+              {
+                "node": {
+                  "name": "triaged"
+                }
+              },
+              {
+                "node": {
+                  "name": "open source"
+                }
+              },
               {
                 "node": {
                   "name": "cla signed"
@@ -5848,7 +10002,12 @@
               },
               {
                 "node": {
-                  "name": "Stale"
+                  "name": "release notes: nn"
+                }
+              },
+              {
+                "node": {
+                  "name": "topic: performance"
                 }
               }
             ]
@@ -5857,268 +10016,174 @@
       }
     }
   },
-  "query_sha=74bd29fe945c49fde4818e873fa62bc60b55b4ef6ae3f2bb719bab6cddbaa7ce cursor=MTAw name=pytorch number=76118 owner=pytorch": {
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=75095 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "mruberry"
+          },
+          "title": "Initial prims, references, and test architecture for them",
+          "body": "This PR adds an initial set of experimental primitive operations and Python references that reimplement existing PyTorch operations using them. See https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-0/577 for additional context.\r\n\r\nThe following experimental primitives are added:\r\n\r\n- Elementwise unary prims -- abs, acos, acosh, asin, atan, cos, cosh, bessel_i0e, bessel_i1e, cbrt, ceil, digamma, erf, erf_inv, erfc, exp, expm1, floor, igamma, igammac, is_finite, lgamma, log, log1p, neg, reciprocal, round, sign, sinh, sqrt, square, tan. \r\n- Elementwise binary prims -- add, atan2, bitwise_and, bitwise_not, bitwise_or, bitwise_xor, div, eq, ge, gt, le, lt, max, min, mul, ne, nextafter, pow, rsqrt, shift_left, shift_right_arithmetic\r\n- View prims -- brodcast_in_dim, collapse_view, split_dim, squeeze\r\n- Shape prims -- collapse, concatenate, reshape\r\n- Conditional prims -- select\r\n- Data conversion & movement prims -- convert_element_type, device_put\r\n- Inplace prims -- copy_to, resize\r\n\r\nThese primitives do not add any new functionality to PyTorch, but are intended to be the semantic building blocks for reference operators. We have tried to make them consistent with the operations in [jax.lax](https://jax.readthedocs.io/en/latest/jax.lax.html) where possible (because PyTorch prefers being consistent with other frameworks), although there are key differences between these prims and operations in jax.lax. Most notably is that these prims model view semantics and inplace operations.\r\n\r\nIn addition to these primitives the following elementwise binary Python references are added:\r\n\r\n- Elementwise binary Python references -- add, atan2, bitwise_and, bitwise_left_shift, bitwise_or, bitwise_right_shift, bitwise_xor, eq, float_power, ge, gt, le, lt, maximum, minimum, mul, ne, nextafter, pow, sub, true_divide\r\n- Conditional Python references - where\r\n- Data conversion & movement references - copy_to\r\n\r\nA Python reference implements the same behavior as its corresponding PyTorch operator (excepting slight numerical differences, bug fixes, and in some cases additional features). \r\n\r\nThe start of an OpInfo-based test architecture for these references is also included in this PR. A new list, `python_ref_db`, is added to `common_methods_invocations.py`. This list introduces the new `ElementwiseBinaryPythonRefInfo`, which inherits input arguments from the original operators' OpInfo, allows them to be overridden, and then constructs the OpInfo for the Python reference using the (potentially modified) arguments. OpInfo-based tests can opt-into testing references by including this new list in the Sequence passed to the `@ops` decorator. \r\n\r\ncc @ngimel @csarofeen @kevinstephano @Lezcano ",
+          "headRefName": "prims_and_references",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
           "commits_with_authors": {
             "nodes": [
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "clee2000"
-                    },
-                    "email": "csl@fb.com",
-                    "name": "Catherine Lee"
-                  },
-                  "oid": "7f560351ae04ea43e58fbfda885bcf216aa26cde"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "pytorchmergebot"
-                    },
-                    "email": "pytorchmergebot@users.noreply.github.com",
-                    "name": "PyTorch MergeBot"
-                  },
-                  "oid": "e8677ed168a036bc7e590d800fe98dd15f10581b"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "robieta"
-                    },
-                    "email": "taylorrobie@fb.com",
-                    "name": "Taylor Robie"
-                  },
-                  "oid": "ac5611caa13642ef8dbe0db453b283b42cbd900b"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "robieta"
-                    },
-                    "email": "taylorrobie@fb.com",
-                    "name": "Taylor Robie"
-                  },
-                  "oid": "1184afbd3bfde0f46133aef09e55e18d3bfb3c3e"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "minsii"
-                    },
-                    "email": "msi@fb.com",
-                    "name": "Min Si"
-                  },
-                  "oid": "1c05604f3d049c67dc678d0295c0add470bff3dc"
-                }
-              },
               {
                 "commit": {
                   "author": {
                     "user": null,
-                    "email": "eellison@devfair044.h1.fair",
-                    "name": "Elias Ellison"
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "76ab5101bd36e8d73637d31bbea125240b7b27f0"
+                  "oid": "a790467c650be92775103cde5e866c90b56f5376"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": null,
-                    "email": "eellison@devfair044.h1.fair",
-                    "name": "Elias Ellison"
-                  },
-                  "oid": "c774050e92c3d8e52968e1eb635dd3e9491104b3"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "guoyejun"
-                    },
-                    "email": "yejun.guo@intel.com",
-                    "name": "Guo Yejun"
-                  },
-                  "oid": "8981595c5361f07186f4534f3be71f1d829a3046"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "BowenBao"
-                    },
-                    "email": "bowbao@microsoft.com",
-                    "name": "BowenBao"
-                  },
-                  "oid": "036f362904024ac9481248965009f312bec6656b"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "janeyx99"
-                    },
-                    "email": "janeyx@fb.com",
-                    "name": "Jane Xu"
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "457d994933f164a9fd70da5ca2733dd6c046a28b"
+                  "oid": "bd6fcf50692e208ebecdc2eaa517a2bfcdcd35cf"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "janeyx99"
-                    },
-                    "email": "janeyx@fb.com",
-                    "name": "Jane Xu"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "f49ebc77520774e71722111d554a0215a26956df"
+                  "oid": "4a119c8f21529fe1375e7e8789b91f41a3df80c5"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "mikeiovine"
-                    },
-                    "email": "mikeiovine@fb.com",
-                    "name": "Mike Iovine"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "f069e1a4a5f98d3fe961e4fc562ede59f59b4026"
+                  "oid": "ea6750dc34d66be759fdfe84b09fb0e23ee59c79"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "salilsdesai"
-                    },
-                    "email": "salilsdesai@fb.com",
-                    "name": "Salil Desai"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "30bccf58393b288412a0f5a2423a1a41ffce258e"
+                  "oid": "2eef8a55fe0227e1921b51bf1f56f9d0a29b49ac"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "angelayi"
-                    },
-                    "email": "angelayi@fb.com",
-                    "name": "Angela Yi"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "f4ba440fe8a632c1ee88e01f7746a8a92c8f3902"
+                  "oid": "b886ed6c20dd1785fd31ed6fa6a8c5b6d0d0b16c"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": null,
-                    "email": "shirong@fb.com",
-                    "name": "Shirong Wu"
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "d203346c93ba96d626c6c02910888198c789ba69"
+                  "oid": "9ad9b63d09aa4f7a8549bcf1d88ea4ff0674299c"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "jamesr66a"
-                    },
-                    "email": "jamesreed@fb.com",
-                    "name": "James Reed"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "73a4e34963e212b799a191fd031d2fa31d17e0ac"
+                  "oid": "63fdd580118477416ae160e0670ae722ea248090"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "Krovatkin"
-                    },
-                    "email": "korovaikon@gmail.com",
-                    "name": "Nikolay Korovaiko"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "b9d5206dfb46f09f953aba3ffb0e1e33a99032ee"
+                  "oid": "0ccf7dc292af1d40d0a094eb2b2fb0c7ab4ccc70"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "ngimel"
-                    },
-                    "email": "ngimel@fb.com",
-                    "name": "Natalia Gimelshein"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "12114e6937573fead54e11ae6cdebe5b31dee302"
+                  "oid": "e8a8a4d1fbe35f20eb88e1a43cf5a653883638e5"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "s4ayub"
-                    },
-                    "email": "shababayub@fb.com",
-                    "name": "Shabab Ayub"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "f2323f76ad6f7f590285bf9c6d20c14a79542563"
+                  "oid": "186634dfdd25645c05b58a212f9e8d77c4125fc0"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "jaglinux"
-                    },
-                    "email": "jagdish.krishna@gmail.com",
-                    "name": "Jagadish Krishnamoorthy"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "acd4b5abe2739c09c1a02524eceda46ff93fd385"
+                  "oid": "f5b4741312b5c42a79f6c8a1d3930b79db38ed8f"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "cccclai"
+                      "login": "ezyang"
                     },
-                    "email": "chenlai@fb.com",
-                    "name": "Chen Lai"
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
                   },
-                  "oid": "04179f533283132fa334a9f91a070b1712f7323d"
+                  "oid": "23d50391bb0fd12111fd3171591c4235ffb2fc1a"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "zaxtax"
+                      "login": "ezyang"
                     },
-                    "email": "rob@zinkov.com",
-                    "name": "Rob Zinkov"
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
                   },
-                  "oid": "5097cdcd6994ad82b3cec942b70e75dbeaee8ca4"
+                  "oid": "bac9d45422d58f513b60b4b854441cfdc253d4c5"
                 }
               },
               {
@@ -6130,7 +10195,7 @@
                     "email": "ezyang@fb.com",
                     "name": "Edward Z. Yang"
                   },
-                  "oid": "5015ecb5a2b86943f457d71f5a977444dd062732"
+                  "oid": "13240ae0b4a0332c3167b65ac026a3172da90cb7"
                 }
               },
               {
@@ -6142,171 +10207,125 @@
                     "email": "ezyang@fb.com",
                     "name": "Edward Z. Yang"
                   },
-                  "oid": "1c42b7789d3966cd541b08fce359b9738fee69f6"
+                  "oid": "1ee34468cb1db3dc6cbae204669f4fec20e2a466"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "albanD"
+                      "login": "ezyang"
                     },
-                    "email": "albandes@fb.com",
-                    "name": "Alban Desmaison"
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
                   },
-                  "oid": "893ac3d334fd3e85e22423a06fe986ce453fe304"
+                  "oid": "561d132bc686d00e8911f7feb3da5901b2bdc574"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "emcastillo"
+                      "login": "ngimel"
                     },
-                    "email": "ecastill@preferred.jp",
-                    "name": "Emilio Castillo"
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
                   },
-                  "oid": "aa5d1b6b031ee2b8bb85f793a842ac1327ae4a19"
+                  "oid": "ac42bedc84b7c96256376ad09917263bb020b2c3"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "dzdang"
+                      "login": "ngimel"
                     },
-                    "email": "dzdang@umich.edu",
-                    "name": "dzdang"
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
                   },
-                  "oid": "0707a1d00f33d7098f56de339cb30436e8c2ea44"
+                  "oid": "7f7d5ba40a0b5e10526d90b018b30b54673d12d8"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "NivekT"
-                    },
-                    "email": "ktse@fb.com",
-                    "name": "Kevin Tse"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "ccb082d42af99f6374183cf914cc712bac585f0f"
+                  "oid": "37a6b4a8b1adb712d5777c7c3479866c27fb3c4e"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "ryandaryl"
+                      "login": "ngimel"
                     },
-                    "email": "ryandarylmills@gmail.com",
-                    "name": "ryandaryl"
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
                   },
-                  "oid": "4f2909cc8747808786a1871b0a6825cc4566f48c"
+                  "oid": "65b613868c44e519c1777af79b9fd3498c5a7e58"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "clee2000"
+                      "login": "ngimel"
                     },
-                    "email": "csl@fb.com",
-                    "name": "Catherine Lee"
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
                   },
-                  "oid": "f764010648a29223d9ed4b955073d9d2fb1b2f43"
+                  "oid": "442c405e9da0d66744ef03e379224c41eedf5b57"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "malfet"
-                    },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "5696e8357cf38f852ef3d680381513e26f202371"
+                  "oid": "031ac49ae9c192989385986b6707fa781e3229e0"
                 }
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "MTMx",
-              "hasNextPage": false
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=76123 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "closed": true,
-          "isCrossRepository": true,
-          "author": {
-            "login": "kumpera"
-          },
-          "title": "Introduce distributed checkpoint with ShardedTensor.",
-          "body": "Co-authored-by: Wen Zhang <zhangwen@fb.com>\r\nCo-authored-by: Yifu Wang <yifu@fb.com>\r\n\r\n",
-          "headRefName": "st_checkpoint",
-          "headRepository": {
-            "nameWithOwner": "kumpera/pytorch"
-          },
-          "baseRefName": "master",
-          "baseRepository": {
-            "nameWithOwner": "pytorch/pytorch",
-            "isPrivate": false,
-            "defaultBranchRef": {
-              "name": "master"
-            }
-          },
-          "mergeCommit": null,
-          "commits_with_authors": {
-            "nodes": [
+              },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "kumpera"
-                    },
-                    "email": "kumpera@fb.com",
-                    "name": "Rodrigo Kumpera"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "6bf248bc20a71f248064b795f38276326fe43aae"
+                  "oid": "9a6c3b00039c0c985c1c9cb59490012d1c0b38ba"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "kumpera"
-                    },
-                    "email": "kumpera@fb.com",
-                    "name": "Rodrigo Kumpera"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "10f84fb90bf02d7062e565ebf2c1da6352b64db7"
+                  "oid": "d5c30e408af1889b90012d2e09f6ec3cda333bcb"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": {
-                      "login": "kumpera"
-                    },
-                    "email": "kumpera@fb.com",
-                    "name": "Rodrigo Kumpera"
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
                   },
-                  "oid": "96c5299740ec791f3cf0975c03a40a7b219b6747"
+                  "oid": "db355d55655bb252a699cd532441bb98e52b98d5"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "Mw",
+              "endCursor": "MjY",
               "hasNextPage": false
             },
-            "totalCount": 3
+            "totalCount": 26
           },
           "commits": {
             "nodes": [
@@ -6327,379 +10346,146 @@
                                 "name": "Facebook CLA Check",
                                 "conclusion": "SUCCESS",
                                 "detailsUrl": "https://code.intern.facebook.com/cla/"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS2l4=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755666"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSmtI="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
+                              },
                               {
-                                "name": "run-torchbench",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299234164?check_suite_focus=true"
+                                "name": "Meta Internal-Only Changes Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://opensource.facebook.com/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd2r3Q=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6ux14=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755785"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454954"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm0k="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC2o="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "Lint"
-                            }
+                            "name": "Netlify",
+                            "databaseId": 13473
                           },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "quick-checks",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299234165?check_suite_focus=true"
-                              },
-                              {
-                                "name": "toc",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299234428?check_suite_focus=true"
-                              },
-                              {
-                                "name": "lintrunner",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299234555?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test collect_env (with_torch)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299234642?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test collect_env (without_torch)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299234701?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test tools",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299234761?check_suite_focus=true"
-                              },
-                              {
-                                "name": "workflow-checks",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299234837?check_suite_focus=true"
-                              }
-                            ],
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd2shU=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755786"
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454956"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm0o="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC2w="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
                           },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "pull"
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
                             }
                           },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454965"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC3U="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Dependabot",
+                            "databaseId": 29110
+                          },
+                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299245858?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299245958?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246168?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246250?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cpu-py3 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246281?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-onnx / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246329?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246373?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246442?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246517?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-asan / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246547?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246591?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246687?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246843?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299246972?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299247064?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299247163?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cuda11.3-py3 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299247261?check_suite_focus=true"
-                              },
-                              {
-                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299247380?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299247471?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-rocm5.1-py3.7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299247519?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299305596?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299305656?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-docs / build-docs (cpp)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299307925?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-docs / build-docs (python)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299307961?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299308001?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299308035?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299308082?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299308120?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299308169?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299308217?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299312986?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299313146?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299313195?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299313235?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299313977?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299314888?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299314937?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 4, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299332358?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 4, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299332420?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 4, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299332476?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 4, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299332526?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299335580?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299375031?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299375079?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299377190?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299378010?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299378053?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299378105?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299378136?check_suite_focus=true"
-                              },
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454970"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC3o="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454974"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC34="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "PyTorch Bot",
+                            "databaseId": 40112
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454977"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC4E="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6299437798?check_suite_focus=true"
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622865/jobs/3270915028"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd5yuY=",
-                              "hasNextPage": true
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6e-c8=",
+                              "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755806"
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455322"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm14="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDNo="
                       },
                       {
                         "node": {
@@ -6714,80 +10500,51 @@
                           },
                           "checkRuns": {
                             "nodes": [
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915027"
+                              },
                               {
                                 "name": "lintrunner",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309468155?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915071"
                               },
                               {
-                                "name": "quick-checks",
+                                "name": "Test tools",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309468457?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915141"
                               },
                               {
                                 "name": "Test collect_env (with_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309468841?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915194"
                               },
                               {
                                 "name": "Test collect_env (without_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309468942?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915229"
                               },
                               {
-                                "name": "Test tools",
+                                "name": "toc",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309469180?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915283"
                               },
                               {
                                 "name": "workflow-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309469314?check_suite_focus=true"
-                              },
-                              {
-                                "name": "toc",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309469473?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915321"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS3SE=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6e-zM=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363240"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNGg="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "run-torchbench",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309468138?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS1-o=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363271"
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455334"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNIc="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDOY="
                       },
                       {
                         "node": {
@@ -6803,1262 +10560,2489 @@
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "linux-bionic-rocm5.1-py3.7 / build",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309468956?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927344"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "name": "linux-bionic-rocm5.0-py3.7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309469237?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927442"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309469475?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927507"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309469750?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927567"
                               },
                               {
                                 "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309470049?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309470368?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927674"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "name": "win-vs2019-cuda11.3-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309470787?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927727"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "name": "linux-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309471290?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927802"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309471585?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927853"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "name": "linux-xenial-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309471734?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927948"
                               },
                               {
-                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309472014?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927996"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309472172?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928061"
                               },
                               {
                                 "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309472411?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928116"
                               },
                               {
-                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309472715?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928198"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309473041?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928256"
                               },
                               {
                                 "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309473226?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309473414?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928291"
                               },
                               {
                                 "name": "win-vs2019-cpu-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309473700?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928317"
                               },
                               {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309473992?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309474162?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928338"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309647069?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928367"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309647413?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928410"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309647538?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928445"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309657055?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991071"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309657196?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991125"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309657332?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991162"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309657575?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991195"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309657726?check_suite_focus=true"
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991233"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309657858?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991261"
                               },
                               {
                                 "name": "linux-docs / build-docs (cpp)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309658314?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991305"
                               },
                               {
                                 "name": "linux-docs / build-docs (python)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309658433?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991349"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309665388?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270996024"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309665513?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270996068"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309665597?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270996092"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270996505"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270998987"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309665697?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270999027"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309672367?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271006886"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309672499?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271006941"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 4, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309696458?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271018097"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 4, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309696554?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271018135"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 4, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309696638?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271018162"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 4, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309696725?check_suite_focus=true"
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271021143"
                               },
                               {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309712838?check_suite_focus=true"
+                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271034041"
                               },
                               {
-                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309767601?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271034072"
                               },
                               {
-                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309767717?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271048218"
                               },
                               {
                                 "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309792321?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271049553"
                               },
                               {
                                 "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309792407?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271049587"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309792546?check_suite_focus=true"
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271049616"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309792639?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271068293"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309792972?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271068336"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271149276"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6309939578?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271149321"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgaCXo=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6jVK8=",
                               "hasNextPage": true
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363300"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455360"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNKQ="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDQA="
                       }
                     ],
                     "pageInfo": {
                       "hasNextPage": false
                     }
                   },
-                  "pushedDate": "2022-05-05T00:34:26Z",
-                  "oid": "96c5299740ec791f3cf0975c03a40a7b219b6747"
+                  "status": null,
+                  "pushedDate": "2022-04-25T02:30:31Z",
+                  "oid": "db355d55655bb252a699cd532441bb98e52b98d5"
                 }
               }
             ]
           },
-          "changedFiles": 11,
+          "changedFiles": 5,
           "files": {
             "nodes": [
               {
-                "path": "test/distributed/_shard/checkpoint/test_checkpoint.py"
+                "path": "test/test_ops.py"
+              },
+              {
+                "path": "torch/_prims/__init__.py"
+              },
+              {
+                "path": "torch/_prims/utils.py"
+              },
+              {
+                "path": "torch/_refs/__init__.py"
+              },
+              {
+                "path": "torch/testing/_internal/common_methods_invocations.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "NQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/distributed/_shard/checkpoint/test_file_system_checkpoint.py"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/distributed/_shard/sharded_tensor/test_sharded_tensor.py"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/checkpoint/__init__.py"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/checkpoint/filesystem.py"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/checkpoint/metadata.py"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/checkpoint/resharding.py"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/checkpoint/state_dict_loader.py"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/checkpoint/state_dict_saver.py"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/checkpoint/storage.py"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/testing/_internal/distributed/_shard/sharded_tensor/_test_st_common.py"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "MTE",
-              "hasNextPage": false
-            }
-          },
-          "reviews": {
-            "nodes": [
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
+              },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ngimel"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "lezcano"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "zou3519"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "zzzwen"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "zzzwen"
+                  "login": "peterbell10"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "lezcano"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "lezcano"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ngimel"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "wanchaol"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "zzzwen"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "zzzwen"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "simpkins"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "zzzwen"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "zzzwen"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "simpkins"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "simpkins"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "pritamdamania87"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "pritamdamania87"
+                  "login": "ngimel"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "pritamdamania87"
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "wilson100hong"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "wilson100hong"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "wilson100hong"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "xunnanxu"
+                  "login": "mruberry"
                 },
-                "state": "DISMISSED"
+                "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "xunnanxu"
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "xunnanxu"
+                  "login": "ngimel"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "ezyang"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
                 "state": "COMMENTED"
-              },
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNC0wNlQxMjo1NjoyNC0wNzowMLkyMDIyLTA0LTA2VDA4OjQwOjM4LTA3OjAwzjenO6Y=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
               {
+                "bodyText": "Ref implementations by themselves can handle any shapes (and broadcast ops by themselves don't bake in any shapes). The question is can we decide if a particular trace is applicable for a different input, but that depends on the tracing technology and what we are caching on, so out of scope for initial PR.",
                 "author": {
-                  "login": "kumpera"
+                  "login": "ngimel"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1105643418
               },
               {
+                "bodyText": "@pytorchbot merge this please",
                 "author": {
-                  "login": "kumpera"
+                  "login": "mruberry"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1108072887
               },
               {
+                "bodyText": "Merge failed due to 'mruberry'\nRaised by https://github.com/pytorch/pytorch/actions/runs/2218044244",
                 "author": {
-                  "login": "kumpera"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1108073536
               },
               {
+                "bodyText": "@mruberry has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
                 "author": {
-                  "login": "kumpera"
+                  "login": "facebook-github-bot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1108075965
               },
               {
+                "bodyText": "Hey @mruberry.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
                 "author": {
-                  "login": "kumpera"
+                  "login": "github-actions"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1108351107
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQebHmg==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "cla signed"
+                }
+              },
+              {
+                "node": {
+                  "name": "topic: not user facing"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "module: primTorch"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=77700 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "kit1980"
+          },
+          "title": "Move pull linux-docs job to Ubuntu 20.04",
+          "body": "",
+          "headRefName": "sdym/pull-xenial-focal-linux-docs",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kit1980"
+                    },
+                    "email": "sdym@fb.com",
+                    "name": "Sergii Dymchenko"
+                  },
+                  "oid": "81261599614423baa17df72300b8e109677b6799"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            },
+            "totalCount": 1
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.facebook.com/cla/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNmNqE=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147714"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuMI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Netlify",
+                            "databaseId": 13473
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147726"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuM4="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147733"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuNU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Dependabot",
+                            "databaseId": 29110
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147746"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuOI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147762"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuPI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "PyTorch Bot",
+                            "databaseId": 40112
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147780"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuQQ="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528127876"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128023"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128196"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128519"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128575"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128663"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128857"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdYVY=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148336"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuzA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867843/jobs/3528127882"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdXEg=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148344"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuzg="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "docker-builds"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528127883"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528127945"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128001"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-py3.7-clang9)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128067"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-rocm5.0-py3.7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128124"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-rocm5.1-py3.7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128191"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128259"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128321"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3-clang5-android-ndk-r19c)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128365"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3-clang5-asan)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128446"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3-clang7-asan)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128507"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3-clang7-onnx)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128563"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3.7-gcc5.4)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128639"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3.7-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128687"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-focal-py3.7-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128741"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdYLI=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148352"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduu0A="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528150762"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528150903"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151086"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151258"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151511"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151776"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151896"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152014"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152139"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152216"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152378"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152516"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152599"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152723"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152802"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152913"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152969"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528153005"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528153062"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528153125"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528153207"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528242483"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528242528"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528245875"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528245914"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528245964"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528246008"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528248520"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528255086"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528255128"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274064"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274097"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274133"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274173"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274209"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528277014"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528308958"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309747"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309810"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309837"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309864"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309895"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309925"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528310044"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528310101"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384337"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384379"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384408"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384441"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384471"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNi1Nc=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148369"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduu1E="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  },
+                  "status": null,
+                  "pushedDate": "2022-05-19T00:02:11Z",
+                  "oid": "81261599614423baa17df72300b8e109677b6799"
+                }
+              }
+            ]
+          },
+          "changedFiles": 3,
+          "files": {
+            "nodes": [
+              {
+                "path": ".circleci/docker/build.sh"
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "path": ".circleci/docker/common/install_katex.sh"
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
-              },
+                "path": ".github/workflows/pull.yml"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "suo"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "kumpera"
+                  "login": "kit1980"
                 },
                 "state": "COMMENTED"
               },
               {
                 "author": {
-                  "login": "xunnanxu"
+                  "login": "janeyx99"
                 },
-                "state": "COMMENTED"
-              },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNS0xOFQxMjo0MTowNS0wNzowMLkyMDIyLTA1LTE4VDEyOjQxOjA0LTA3OjAwzjpD7es=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
               {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/77700\n\ud83d\udcc4 \u00a0Preview Python docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\u2753Need help or want to give feedback on the CI? Visit our office hours\n\n\u2705 No Failures (0 Pending)\nAs of commit 8126159 (more details on the Dr. CI page):\nExpand to see more\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
                 "author": {
-                  "login": "xunnanxu"
+                  "login": "facebook-github-bot"
                 },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "xunnanxu"
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
                 },
-                "state": "COMMENTED"
+                "databaseId": 1129400934
               },
               {
+                "bodyText": "@pytorchbot merge",
                 "author": {
-                  "login": "kumpera"
+                  "login": "kit1980"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1131884232
               },
               {
+                "bodyText": "Merge failed due to Refusing to merge as mandatory check(s) linux-docs / build-docs (cpp), linux-docs / build-docs (python) are pending/not yet run for rule OSS CI\nRaised by https://github.com/pytorch/pytorch/actions/runs/2353067846",
                 "author": {
-                  "login": "kumpera"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1131886153
               },
               {
+                "bodyText": "@pytorchbot merge -f",
                 "author": {
-                  "login": "kumpera"
+                  "login": "kit1980"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1131945610
               },
               {
+                "bodyText": "Hey @kit1980.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
                 "author": {
-                  "login": "kumpera"
+                  "login": "github-actions"
                 },
-                "state": "COMMENTED"
-              },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1131947473
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQ1FKZg==",
+              "hasPreviousPage": false
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "Merged"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
-              },
+                "node": {
+                  "name": "cla signed"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAYNi1Nc= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAYduu0A= name=pytorch number=77700 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
+            "nodes": [
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
-              },
+                "commit": {
+                  "oid": "81261599614423baa17df72300b8e109677b6799",
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384494"
+                            },
+                            {
+                              "name": "linux-docs / build-docs (cpp)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528477548"
+                            },
+                            {
+                              "name": "linux-docs / build-docs (python)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528477578"
+                            },
+                            {
+                              "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528728152"
+                            },
+                            {
+                              "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528728187"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNqJcE=",
+                            "hasNextPage": false
+                          }
+                        }
+                      }
+                    ]
+                  }
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=68111 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "chunyuan-w"
+          },
+          "title": "Add JIT graph fuser for oneDNN Graph API (Preview4)",
+          "body": "## Description\r\nPreview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).\r\n\r\nOn the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:\r\n\r\n- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used\r\n- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.\r\n\r\n### User API:\r\nThe optimization pass is disabled by default. Users could enable it by:\r\n```\r\ntorch.jit.enable_onednn_fusion(True)\r\n```\r\n\r\n### Performance:\r\n[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:\r\n- SkyLake 8180 (1 socket of 28 cores):\r\n\r\n  ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)\r\n\r\n- SkyLake 8180 (single thread):\r\n\r\n  ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)\r\n \\* By mapping hardswish to oneDNN Graph, it\u2019s 8% faster than PyTorch JIT (NNC + OFI)\r\n  \\** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops\r\n\r\n\r\n### Directory structure of the integration code\r\nFuser-related code are placed under:\r\n```\r\ntorch/csrc/jit/codegen/onednn/\r\n```\r\n\r\nOptimization pass registration is done in:\r\n```\r\ntorch/csrc/jit/passes/onednn_graph_fuser.h\r\n```\r\n\r\nCMake for the integration code is:\r\n```\r\ncaffe2/CMakeLists.txt\r\n```\r\n\r\n## Limitations\r\n\r\n- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.\r\n- We have only optimized the inference use case.",
+          "headRefName": "chunyuan/llga_preview2",
+          "headRepository": {
+            "nameWithOwner": "chunyuan-w/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "0096fcc49f277fd8e006fcb42e0cb28a1422ec98"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "7bcc4de26a5472f1d252735dd425b46794b0844f"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "3a2a588bfe6bbf9bf74d88d441cd22affda207da"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "ca7df12fbfaa3ddbabeca39b76300d17f4a33f2f"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "81d44f35b8bc043c38837d0694e5bc072203b832"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "14fd5d1bfc2c58a71379f778871e3fca0a8e79b2"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "954dc23663125897f4b199eb2a8607dc5fca3274"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "9f77a0b476accc678b6f0569e4ff33fa6bbe97fc"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "fbf3b23bc1288697e1aec539a7c4ee3dc0bcb84c"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "f8b8e78f786586c3cdf3966fd83ffa124d3eda70"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "6fffa2f7453ee7e0f8d8e2f73ea8a65230539589"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "849385404e6f3cd1cf7cef19f931ecf4fa28afdb"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "adbae7b77f8c0dbc59fccf15207d97ba86cfade2"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "6dcf2a4981aff24fa16fc7461ae4ec29690f956f"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "54f3e05ad524cffd0911ee93be3c50f589b51f58"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "edbfc640ea79a0af85757d9e73796dcc90231519"
+                }
               },
               {
-                "author": {
-                  "login": "pritamdamania87"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "67654db7cba562809d1b4a44cdda58af5cc9daaf"
+                }
               },
               {
-                "author": {
-                  "login": "pritamdamania87"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "9c9d99b930b11af9ff03f52d45bf49c652df758d"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ffb25119cd9ce815cc4d9d14a2317fcbbfa9ea86"
+                }
               },
               {
-                "author": {
-                  "login": "pritamdamania87"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ab9eee84512ca1bdfbc81e25c6eb67b29d0f302a"
+                }
               },
               {
-                "author": {
-                  "login": "pritamdamania87"
-                },
-                "state": "APPROVED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "62a4642cf3330524990a69ac29e002c97812320a"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ca9b1223be4af2c8b4929303d498eafd71793128"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "6f4a23d24514a02954d2ec792830085f612223c9"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "b2a9a9c0926b02d0b2e87722ed61450f224a61d0"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e88b492be733f24b6aa395829c76add67d0901e7"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "c44336d7a914952bfb78e012e08d9a6d6dde5937"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "5157930f7b3921d41a586260582b574c915f6ca1"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "04cb8353813f6bbd0d913a994923cc7e1e291406"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNC0yNVQxMzozNTowMS0wNTowMLkyMDIyLTA0LTI1VDEzOjM1OjAwLTA1OjAwzjjC2d0=",
-              "hasPreviousPage": true
-            }
-          },
-          "comments": {
-            "nodes": [
-              {
-                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
-                "author": {
-                  "login": "pytorchmergebot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1118495479
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "62991eaad0e638bb0bced327e03f932f66f68732"
+                }
               },
               {
-                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
-                "author": {
-                  "login": "pytorchmergebot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1118511287
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "7496bf1588050191595d833d23b8972b2f22655e"
+                }
               },
               {
-                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
-                "author": {
-                  "login": "pytorchmergebot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1118662274
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "d9d35f23cca0cd29c78a845731b24826152dcf1c"
+                }
               },
               {
-                "bodyText": "Merge failed due to Can't fetch all PR reviews Raised by https://github.com/pytorch/pytorch/actions/runs/2275691136\n\n@osalpekar @malfet This is failing because there are 109 review comments on this PR but we only fetch the first 100. This could be solved with a similar concept as how we fetch more comments/check_runs.",
-                "author": {
-                  "login": "janeyx99"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1118689010
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "f74ec134f18a65a7c72455bdf44f72e3ebb27105"
+                }
               },
               {
-                "bodyText": "On a side note, has the test_fsdp_clip_grad_norm_norm_type_2_0_nested_fsdp_False_cpu_offload_CPUOffload failure on the distributed test first shard of this PR been addressed?",
-                "author": {
-                  "login": "janeyx99"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1118693497
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOQqri9w==",
-              "hasPreviousPage": true
-            }
-          },
-          "labels": {
-            "edges": [
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "eb32cc65a975361160948bfc3d6a577991ea262e"
+                }
+              },
               {
-                "node": {
-                  "name": "oncall: distributed"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "c7665f8d695b680c54db0bad2b7b7df46d886b50"
                 }
               },
               {
-                "node": {
-                  "name": "cla signed"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e6321ad8f59ea01130568c202d186448bb9cb9d0"
                 }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=6a8ce6412a780d5804bfe180ed1dc807269e1eae2ae50de2346d56d1283884bc cursor=Y3Vyc29yOnYyOpO5MjAyMi0wNC0yNVQxMzozNTowMS0wNTowMLkyMDIyLTA0LTI1VDEzOjM1OjAwLTA1OjAwzjjC2d0= name=pytorch number=76123 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "reviews": {
-            "nodes": [
+              },
               {
-                "author": {
-                  "login": "pritamdamania87"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "a72cd0d02693f45e5354a70654581ad514581ec7"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "b3cd3028b4ed31805e82f7eaf02217ab74ca59b9"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "49a592d9788d08e6cd0593882f867e129057c1cc"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "0575766b2144b13f6a38227c4e2b8d22ec8db80f"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "b5c9b10ff87d622350e8ca64fae3a476eb70d5aa"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "66bc652a30ccc329adb929870a4ac726bb98b38c"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "72b9ca9c8e2dac98cbb7199b3dfac7c7305b80c5"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "a7892ed7373207d96406c8b5734a089643c5cdbd"
+                }
               },
               {
-                "author": {
-                  "login": "kumpera"
-                },
-                "state": "COMMENTED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNC0yMlQyMjozNzo1NC0wNTowMLkyMDIyLTA0LTIyVDE4OjAyOjA5LTA1OjAwzjip7G8=",
-              "hasPreviousPage": false
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=71759 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "closed": true,
-          "isCrossRepository": true,
-          "author": {
-            "login": "coolteemf"
-          },
-          "title": "Optimize grid sample 3d",
-          "body": "Fixes #71415\r\nI have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :\r\n\r\n> Fixes #64977\r\n> \r\n> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).\r\n> \r\n> Brief description of the changes:\r\n> \r\n>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).\r\n> \r\n>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.\r\n> \r\n>     * Changed the CPU kernels:\r\n>       (1) added `bool input_requires_grad` template parameter to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)\r\n> \r\n>     * Changed CUDA kernel:\r\n>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.\r\n> \r\n>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.\r\n> \r\n>     * Have not touched the CPU fallback kernel.\r\n\r\nNote: the changes number (3) are N/A in this case.\r\n\r\n",
-          "headRefName": "optimize_grid_sample_3d",
-          "headRepository": {
-            "nameWithOwner": "coolteemf/pytorch"
-          },
-          "baseRefName": "master",
-          "baseRepository": {
-            "nameWithOwner": "pytorch/pytorch",
-            "isPrivate": false,
-            "defaultBranchRef": {
-              "name": "master"
-            }
-          },
-          "mergeCommit": null,
-          "commits_with_authors": {
-            "nodes": [
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "d54cb084e1daad8a08c3f8de0ad3f7afb5b05ac1"
+                }
+              },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
                   },
-                  "oid": "e0b0d1e695aeddceaf265da602c4704592053e9e"
+                  "oid": "aef71d692a8a159e0ca56be363e2cc1225ce7647"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "563ec73747ad53b63b36736c47c4342f962c2a09"
+                  "oid": "bf618e205ec31cff962dcc8ab478e0a699a9572d"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "51abe41a132d9dd5b1c0551bdca902aacc028ff8"
+                  "oid": "e4a331f1088448f7d7d86256ce71e0e71da006b0"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "be9898205992034a00e8ace8a55c2ecdcee2c2f8"
+                  "oid": "0b743523d1430fec759d5fefbb687f17c89335a5"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "2929c60b64384c2deae0f7dea8bab94ad4bc9ec8"
+                  "oid": "e80a351a62d98b810ec8985c4b25257af1d6c5bb"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "9241b737e7e2b257905cc74ad9c50b737d7f9d0a"
+                  "oid": "c189eca154b6691919d0e21489d1c322c7435c0b"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
                   },
-                  "oid": "64d6b795d0636928a8aa2fd3da01302fb5f5f7af"
+                  "oid": "e080a067c75d7b888a8a362682a2d5ba70e0c3a8"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
                   },
-                  "oid": "4503577e53760a0006f1e80ca6bfe04d2be90470"
+                  "oid": "028561fbf8f3ed90e074e6e0e3a4ca4dd7ffa2a8"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "b16f4b11ffbbbf2ca2098f9702af4ef6b6fc5e1f"
+                  "oid": "d550cf14037badd4caa2f52202e2f20bc4db8432"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "7ffc23368a604afdc92d2818747f730ce31a2bb5"
+                  "oid": "574159ebadd1dec24daaf883879ffeca8d9e71b7"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "b85292604b9ad6c31706b76b5a5498c4f6d94309"
+                  "oid": "9eb3ee98ea756067ed1c8f52f309f6d3e211a904"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "9d81d7bae8ad91aaa24b3ceab83e3138894dbc69"
+                  "oid": "29929f48be03dcdd1bbfade572de7feafa825547"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "e79f6a2202512b294c55bf4bfb2e0524fafd4c48"
+                  "oid": "8a7358ca8da547b40ea1a99ddc57ebed19959684"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "f683e8aec7aea76097a264eec01511e704c31154"
+                  "oid": "6606637d2c5525b43e294a8b366a85052e1be0c6"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "coolteemf"
+                      "login": "sanchitintel"
                     },
-                    "email": "67541941+coolteemf@users.noreply.github.com",
-                    "name": "Fran\u00e7ois Lecomte"
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "b932e9e286c22aaf352375186df851ef060b295a"
+                  "oid": "5ecfd1f28b87045deb8bc8ffe33b3d8b906f3264"
                 }
               },
               {
                 "commit": {
                   "author": {
-                    "user": null,
-                    "email": "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",
-                    "name": "coolteemf"
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
                   },
-                  "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22"
+                  "oid": "be2d4345c65442c4cfbe8afdfb2ae0893945da42"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "b5b89d3644a43e2dbda841cafb71b32edbe07c8a"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nikita.shulga@gmail.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "73881411e2bfb3aaa2e89926a82390b4c587ad75"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "MTY",
+              "endCursor": "NjI",
               "hasNextPage": false
             },
-            "totalCount": 16
+            "totalCount": 62
           },
           "commits": {
             "nodes": [
@@ -8076,88 +13060,25 @@
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "Facebook CLA Check",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://code.intern.facebook.com/cla/"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGYqY=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801320"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_T6g="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-clang7-onnx"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020089?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302165846?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302165949?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIob0=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801849"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ubk="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-build"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                              },
                               {
-                                "name": "build",
+                                "name": "Meta Internal-Only Changes Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302019921?check_suite_focus=true"
+                                "detailsUrl": "https://opensource.facebook.com/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ1E=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NXnc=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801852"
+                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625010"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ubw="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYwzI="
                       },
                       {
                         "node": {
@@ -8167,41 +13088,81 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-bionic-rocm4.5-py3.7"
+                              "name": "Lint"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
+                                "name": "clang-format",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302019934?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903895825"
                               },
                               {
-                                "name": "test (default, 2, 2, linux.rocm.gpu)",
+                                "name": "py2-setup-validate-errormsg",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302431993?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903895911"
                               },
                               {
-                                "name": "test (default, 1, 2, linux.rocm.gpu)",
+                                "name": "quick-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302432078?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903895963"
                               },
                               {
-                                "name": "test (distributed, 1, 1, linux.rocm.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302432150?check_suite_focus=true"
+                                "name": "shellcheck",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896134"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896253"
+                              },
+                              {
+                                "name": "clang-tidy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896371"
+                              },
+                              {
+                                "name": "cmakelint",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896525"
+                              },
+                              {
+                                "name": "flake8-py3",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896658"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896771"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896795"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896838"
+                              },
+                              {
+                                "name": "mypy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896897"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwMsZY=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NZqw=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801853"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625458"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ub0="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxPI="
                       },
                       {
                         "node": {
@@ -8211,41 +13172,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "win-vs2019-cuda11.3-py3"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302019928?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5303266925?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5303267017?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5303267128?check_suite_focus=true"
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440031/jobs/2903895828"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwZbzg=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NYIw=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801855"
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625463"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ub8="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxPc="
                       },
                       {
                         "node": {
@@ -8255,485 +13201,1024 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "pull"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "mypy",
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896014"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302019930?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896165"
                               },
                               {
-                                "name": "shellcheck",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896394"
+                              },
+                              {
+                                "name": "linux-bionic-rocm4.5-py3.7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020111?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896572"
                               },
                               {
-                                "name": "py2-setup-validate-errormsg",
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020318?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896666"
                               },
                               {
-                                "name": "clang-format",
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020421?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896778"
                               },
                               {
-                                "name": "cmakelint",
+                                "name": "linux-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020539?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896837"
                               },
                               {
-                                "name": "toc",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020668?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896896"
                               },
                               {
-                                "name": "quick-checks",
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020780?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896936"
                               },
                               {
-                                "name": "clang-tidy",
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020970?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897025"
                               },
                               {
-                                "name": "flake8-py3",
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302021124?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGbAQ=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801856"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcA="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-clang7-asan"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897161"
+                              },
                               {
-                                "name": "build",
+                                "name": "linux-xenial-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020084?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897213"
                               },
                               {
-                                "name": "test (default, 3, 3, linux.2xlarge)",
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302192846?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897280"
                               },
                               {
-                                "name": "test (default, 1, 3, linux.2xlarge)",
+                                "name": "win-vs2019-cpu-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302192926?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897368"
                               },
                               {
-                                "name": "test (default, 2, 3, linux.2xlarge)",
+                                "name": "win-vs2019-cuda11.3-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302193029?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwJC4U=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801857"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcE="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897431"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897476"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897578"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897630"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897699"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897733"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904327787"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904327838"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904327956"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904327997"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904328035"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904328093"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904328131"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904328177"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904333962"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904334006"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904430419"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904430459"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904430508"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904430573"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904443663"
+                              },
                               {
-                                "name": "build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020092?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ_w=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801862"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcY="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-gcc5.4"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904443723"
+                              },
                               {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302020048?check_suite_focus=true"
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904443787"
                               },
                               {
-                                "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302147216?check_suite_focus=true"
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904454239"
                               },
                               {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302147336?check_suite_focus=true"
+                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904454303"
                               },
                               {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302147409?check_suite_focus=true"
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904554602"
                               },
                               {
-                                "name": "test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302147493?check_suite_focus=true"
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904554698"
                               },
                               {
-                                "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302147622?check_suite_focus=true"
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904588855"
                               },
                               {
-                                "name": "test (docs_test, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302147822?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIWu4=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801866"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Uco="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904588886"
+                              },
                               {
-                                "name": "build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5302019929?check_suite_focus=true"
+                                "name": "win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904588924"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904655702"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904656104"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904656150"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904656192"
+                              },
+                              {
+                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904706520"
+                              },
+                              {
+                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904706565"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ1k=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_fN1g=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801869"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625483"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Uc0="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxQs="
                       }
                     ],
                     "pageInfo": {
-                      "hasNextPage": true
+                      "hasNextPage": false
                     }
                   },
-                  "pushedDate": "2022-02-23T10:39:30Z",
-                  "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22"
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17048428?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17048429?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17048431?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17048430?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-03-21T19:58:52Z",
+                  "oid": "73881411e2bfb3aaa2e89926a82390b4c587ad75"
                 }
               }
             ]
           },
-          "changedFiles": 9,
-          "files": {
+          "changedFiles": 37,
+          "files": {
+            "nodes": [
+              {
+                "path": "aten/src/ATen/core/interned_strings.h"
+              },
+              {
+                "path": "caffe2/CMakeLists.txt"
+              },
+              {
+                "path": "cmake/Dependencies.cmake"
+              },
+              {
+                "path": "cmake/Modules/FindMKLDNN.cmake"
+              },
+              {
+                "path": "cmake/public/mkldnn.cmake"
+              },
+              {
+                "path": "docs/source/jit.rst"
+              },
+              {
+                "path": "test/test_jit_llga_fuser.py"
+              },
+              {
+                "path": "torch/_C/__init__.pyi.in"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/LlgaTensorImpl.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/README.md"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/defer_size_check.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/defer_size_check.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_fuser.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_fuser.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_helper.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_helper.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/graph_rewriter.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/guard_shape.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/guard_shape.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/interface.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/interface.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/kernel.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/kernel.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/layout_propagation.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/layout_propagation.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/operator.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/prepare_binary.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/prepare_binary.h"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/onednn/register_interface.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/ir/alias_analysis.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/ir/ir.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/passes/inline_autodiff_subgraphs.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/passes/onednn_graph_fuser.h"
+              },
+              {
+                "path": "torch/csrc/jit/python/init.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/runtime/operator.cpp"
+              },
+              {
+                "path": "torch/jit/__init__.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mzc",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
             "nodes": [
               {
-                "path": "aten/src/ATen/native/GridSampler.cpp"
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/native/cpu/GridSamplerKernel.cpp"
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "chunyuan-w"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "wukong1992"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/native/cuda/GridSampler.cpp"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/native/cuda/GridSampler.cu"
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/native/cuda/GridSampler.h"
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "aten/src/ATen/native/native_functions.yaml"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/forward_backward_compatibility/check_forward_backward_compatibility.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/test_nn.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "tools/autograd/derivatives.yaml"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               }
             ],
             "pageInfo": {
-              "endCursor": "OQ",
-              "hasNextPage": false
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMS0xMi0xMFQwOToyNDoxOS0wODowMLkyMDIxLTEyLTEwVDA5OjI0OjE5LTA4OjAwzjFryLE=",
+              "hasPreviousPage": false
             }
           },
-          "reviews": {
+          "comments": {
             "nodes": [
               {
+                "bodyText": "Looks like this broke master https://hud.pytorch.org/pytorch/pytorch/commit/7dd08230117f4fa8bb82b3524e90fb00340198c7. I am reverting.",
                 "author": {
-                  "login": "albanD"
+                  "login": "suo"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074498483
               },
               {
+                "bodyText": "@pytorchbot revert this",
                 "author": {
-                  "login": "coolteemf"
+                  "login": "suo"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074498550
               },
               {
+                "bodyText": "Looks like this broke master https://hud.pytorch.org/pytorch/pytorch/commit/7dd08230117f4fa8bb82b3524e90fb00340198c7. I am reverting.\n\nOops! Will fix it ASAP.",
                 "author": {
-                  "login": "albanD"
+                  "login": "sanchitintel"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1074499668
               },
               {
+                "bodyText": "This pull request has been reverted by e5bf879. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).",
                 "author": {
-                  "login": "coolteemf"
+                  "login": "facebook-github-bot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074508608
               },
               {
+                "bodyText": "This pull request has been reverted by e5bf879. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).",
                 "author": {
-                  "login": "albanD"
+                  "login": "facebook-github-bot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1082508130
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQAuLsw==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "oncall: jit"
+                }
               },
               {
-                "author": {
-                  "login": "coolteemf"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "triaged"
+                }
+              },
+              {
+                "node": {
+                  "name": "open source"
+                }
+              },
+              {
+                "node": {
+                  "name": "cla signed"
+                }
+              },
+              {
+                "node": {
+                  "name": "Reverted"
+                }
               },
               {
+                "node": {
+                  "name": "intel priority"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=62ce809793481ce6ddce6e1a19d9b0761755ff0ff75decaf8a79419eaf793110 cursor=Y3Vyc29yOnYyOpHOQAuLsw== name=pytorch number=68111 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/chunyuan-w/pytorch/blob/7496bf1588050191595d833d23b8972b2f22655e/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk\n\u2705 triggered\n\n\nlinux-docs\nciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-vulkan-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-build\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-custom-build-static\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-onnx\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7-no-ops\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\ncaffe2-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\ndocker-builds\nciflow/all, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-custom-ops\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-full-jit\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-metal\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-full-jit\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda10.2-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-binary-conda\nciflow/binaries, ciflow/binaries/conda\n\ud83d\udeab skipped\n\n\nlinux-binary-libtorch-cxx11-abi\nciflow/binaries, ciflow/binaries/libtorch\n\ud83d\udeab skipped\n\n\nlinux-binary-libtorch-pre-cxx11\nciflow/binaries, ciflow/binaries/libtorch\n\ud83d\udeab skipped\n\n\nlinux-binary-manywheel\nciflow/binaries, ciflow/binaries/wheel\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-docs-push\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-no-ops\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-arm64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-lite-interpreter-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-11-py3-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.1-py3.7-gcc7-debug\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.1-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.5-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-build\nciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\n\n\nYou can add a comment to the PR and tag @pytorchbot with the following commands:\n\n# ciflow rerun, \"ciflow/default\" will always be added automatically\n@pytorchbot ciflow rerun\n\n# ciflow rerun with additional labels \"-l <ciflow/label_name>\", which is equivalent to adding these labels manually and trigger the rerun\n@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow\n\nFor more information, please take a look at the CI Flow Wiki.",
                 "author": {
-                  "login": "coolteemf"
+                  "login": "pytorch-probot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "pytorch-probot"
+                },
+                "databaseId": 964902865
               },
               {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/68111\nNeed help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 7388141 (more details on the Dr. CI page):\n\n\n29/29 failures introduced in this PR\n\n\n\ud83d\udd75\ufe0f 29 new failures recognized by patterns\nThe following CI failures do not appear to be due to upstream breakages:\n pull / linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge) (1/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:31:38.6978776Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:31:38.3001628Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:31:38.5169168Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:31:38.5362923Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:31:38.5413452Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:31:38.5458747Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:31:38.5484014Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:31:38.5497924Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:31:38.5656491Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:31:38.5678893Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:31:38.6888479Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0f6488c20adb4dca4\n2022-03-21T21:31:38.6978776Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:31:38.6992648Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:31:38.7003010Z ##[error]Process completed with exit code 2.\n2022-03-21T21:31:38.7044027Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:31:38.7044261Z with:\n2022-03-21T21:31:38.7044413Z env:\n2022-03-21T21:31:38.7044565Z   IN_CI: 1\n2022-03-21T21:31:38.7044709Z   IS_GHA: 1\n2022-03-21T21:31:38.7044885Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:31:38.7045067Z ##[endgroup]\n2022-03-21T21:31:38.7060958Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge) (2/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:35:19.2635222Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:35:18.9028722Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:35:19.1132721Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:35:19.1310590Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:35:19.1360251Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:35:19.1386865Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:35:19.1429182Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:35:19.1441925Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:35:19.1468280Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:35:19.1617667Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:35:19.2545368Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-098be2985e0392130\n2022-03-21T21:35:19.2635222Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:35:19.2648463Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:35:19.2658727Z ##[error]Process completed with exit code 2.\n2022-03-21T21:35:19.2706355Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:35:19.2706591Z with:\n2022-03-21T21:35:19.2706748Z env:\n2022-03-21T21:35:19.2706908Z   IN_CI: 1\n2022-03-21T21:35:19.2707061Z   IS_GHA: 1\n2022-03-21T21:35:19.2707246Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:35:19.2707438Z ##[endgroup]\n2022-03-21T21:35:19.2724554Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge) (3/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:11:57.5531419Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:11:52.7662022Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T23:11:53.1213298Z      ---------------------------------------- 8.1/8.1 MB 23.6 MB/s eta 0:00:00\n2022-03-21T23:11:53.1644665Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:11:53.2218699Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T23:11:53.2389674Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T23:11:53.2787295Z      -------------------------------------- 247.7/247.7 KB 7.4 MB/s eta 0:00:00\n2022-03-21T23:11:53.3761842Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:11:53.5457622Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T23:11:57.4175080Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T23:11:57.5296815Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0105d4db093574f40\n2022-03-21T23:11:57.5531419Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:11:57.5564814Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:11:57.5587712Z ##[error]Process completed with exit code 2.\n2022-03-21T23:11:57.5790311Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T23:11:57.5790832Z with:\n2022-03-21T23:11:57.5791104Z env:\n2022-03-21T23:11:57.5791358Z   IN_CI: 1\n2022-03-21T23:11:57.5791620Z   IS_GHA: 1\n2022-03-21T23:11:57.5791939Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:11:57.5792425Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T23:11:57.5792884Z ##[endgroup]\n\n\n pull / linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu) (4/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T02:17:12.6257577Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T02:17:11.9280556Z   Using cached https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl\n2022-03-22T02:17:11.9335199Z Collecting urllib3<1.27,>=1.25.4 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:11.9682045Z   Using cached https://files.pythonhosted.org/packages/ec/03/062e6444ce4baf1eac17a6a0ebfe36bb1ad05e1df0e20b110de59c278498/urllib3-1.26.9-py2.py3-none-any.whl\n2022-03-22T02:17:11.9850357Z Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:12.0403171Z   Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl\n2022-03-22T02:17:12.0468875Z Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:12.0590000Z   Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl\n2022-03-22T02:17:12.0607093Z Installing collected packages: jmespath, urllib3, six, python-dateutil, botocore, s3transfer, boto3\n2022-03-22T02:17:12.5273459Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 urllib3-1.26.9\n2022-03-22T02:17:12.6032812Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 worker-rocm-amd-114\n2022-03-22T02:17:12.6257577Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T02:17:12.6259543Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T02:17:12.6291924Z ##[error]Process completed with exit code 2.\n2022-03-22T02:17:12.6387977Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T02:17:12.6388298Z with:\n2022-03-22T02:17:12.6388521Z   wait-ssh: false\n2022-03-22T02:17:12.6388727Z env:\n2022-03-22T02:17:12.6388932Z   IN_CI: 1\n2022-03-22T02:17:12.6389143Z   IS_GHA: 1\n2022-03-22T02:17:12.6389368Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T02:17:12.6389669Z   DOCKER_HOST: unix:///run/user/1121/docker.sock\n\n\n pull / linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge) (5/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:19:24.4890693Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:19:24.0962005Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:19:24.3152253Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:19:24.3341183Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:19:24.3391374Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:19:24.3436392Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:19:24.3448982Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:19:24.3474092Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:19:24.3502003Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:19:24.3655072Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:19:24.4799309Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0bc9250521f338cae\n2022-03-21T22:19:24.4890693Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:19:24.4903625Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:19:24.4913841Z ##[error]Process completed with exit code 2.\n2022-03-21T22:19:24.4957338Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:19:24.4957575Z with:\n2022-03-21T22:19:24.4957735Z env:\n2022-03-21T22:19:24.4957900Z   IN_CI: 1\n2022-03-21T22:19:24.4958055Z   IS_GHA: 1\n2022-03-21T22:19:24.4958246Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:19:24.4958437Z ##[endgroup]\n2022-03-21T22:19:24.4989649Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu) (6/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T01:05:07.6983899Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T01:05:06.8364546Z   Using cached https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl\n2022-03-22T01:05:06.8431763Z Collecting urllib3<1.27,>=1.25.4 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:06.8949391Z   Using cached https://files.pythonhosted.org/packages/ec/03/062e6444ce4baf1eac17a6a0ebfe36bb1ad05e1df0e20b110de59c278498/urllib3-1.26.9-py2.py3-none-any.whl\n2022-03-22T01:05:06.9180079Z Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:06.9803351Z   Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl\n2022-03-22T01:05:06.9882133Z Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:07.0067062Z   Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl\n2022-03-22T01:05:07.0088676Z Installing collected packages: urllib3, jmespath, six, python-dateutil, botocore, s3transfer, boto3\n2022-03-22T01:05:07.5819667Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 urllib3-1.26.9\n2022-03-22T01:05:07.6774717Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 worker-rocm-amd-60\n2022-03-22T01:05:07.6983899Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T01:05:07.6988652Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T01:05:07.7023073Z ##[error]Process completed with exit code 2.\n2022-03-22T01:05:07.7102087Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T01:05:07.7102389Z with:\n2022-03-22T01:05:07.7102603Z   wait-ssh: false\n2022-03-22T01:05:07.7102820Z env:\n2022-03-22T01:05:07.7103015Z   IN_CI: 1\n2022-03-22T01:05:07.7103224Z   IS_GHA: 1\n2022-03-22T01:05:07.7103458Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T01:05:07.7103737Z   DOCKER_HOST: unix:///run/user/1502/docker.sock\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge) (7/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:51:39.3637996Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:51:39.2041249Z   Attempting uninstall: s3transfer\n2022-03-21T20:51:39.2043010Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T20:51:39.2083799Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T20:51:39.2089675Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T20:51:39.2480546Z   Attempting uninstall: boto3\n2022-03-21T20:51:39.2482953Z     Found existing installation: boto3 1.16.34\n2022-03-21T20:51:39.2584292Z     Uninstalling boto3-1.16.34:\n2022-03-21T20:51:39.2599474Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T20:51:39.3130921Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T20:51:39.3550598Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-03ef7efc3078e3da5\n2022-03-21T20:51:39.3637996Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:51:39.3650651Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:51:39.3660484Z ##[error]Process completed with exit code 2.\n2022-03-21T20:51:39.3696465Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:51:39.3696693Z with:\n2022-03-21T20:51:39.3696850Z env:\n2022-03-21T20:51:39.3697012Z   IN_CI: 1\n2022-03-21T20:51:39.3697161Z   IS_GHA: 1\n2022-03-21T20:51:39.3697342Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:51:39.3697528Z ##[endgroup]\n2022-03-21T20:51:39.3730420Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge) (8/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:03:36.3916860Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:03:36.0096309Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:03:36.2278560Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:03:36.2461618Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:03:36.2513260Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:03:36.2541524Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:03:36.2554899Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:03:36.2598277Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:03:36.2758299Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:03:36.2780690Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:03:36.3825021Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0a4a552890e6ef7d3\n2022-03-21T21:03:36.3916860Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:03:36.3930343Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:03:36.3941263Z ##[error]Process completed with exit code 2.\n2022-03-21T21:03:36.3979258Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:03:36.3979496Z with:\n2022-03-21T21:03:36.3979654Z env:\n2022-03-21T21:03:36.3979814Z   IN_CI: 1\n2022-03-21T21:03:36.3979968Z   IS_GHA: 1\n2022-03-21T21:03:36.3980157Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:03:36.3980360Z ##[endgroup]\n2022-03-21T21:03:36.3996257Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu) (9/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T00:41:15.5325784Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T00:41:10.3015614Z   Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)\n2022-03-22T00:41:10.3625659Z      ---------------------------------------- 79.5/79.5 KB 1.1 MB/s eta 0:00:00\n2022-03-22T00:41:10.4120236Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-22T00:41:10.4170155Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-22T00:41:10.4722115Z      -------------------------------------- 247.7/247.7 KB 5.2 MB/s eta 0:00:00\n2022-03-22T00:41:10.4843512Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-22T00:41:10.6596108Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-22T00:41:10.8733354Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-22T00:41:15.3745408Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-22T00:41:15.4987162Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-09cacc848abc3dd32\n2022-03-22T00:41:15.5325784Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T00:41:15.5373630Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T00:41:15.5404353Z ##[error]Process completed with exit code 2.\n2022-03-22T00:41:15.5790508Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-22T00:41:15.5791192Z with:\n2022-03-22T00:41:15.5791530Z env:\n2022-03-22T00:41:15.5791849Z   IN_CI: 1\n2022-03-22T00:41:15.5792186Z   IS_GHA: 1\n2022-03-22T00:41:15.5792599Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T00:41:15.5793237Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-22T00:41:15.5793831Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge) (10/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:50:32.9799307Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:50:32.8167560Z   Attempting uninstall: s3transfer\n2022-03-21T20:50:32.8169351Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T20:50:32.8213295Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T20:50:32.8219209Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T20:50:32.8602320Z   Attempting uninstall: boto3\n2022-03-21T20:50:32.8603289Z     Found existing installation: boto3 1.16.34\n2022-03-21T20:50:32.8704535Z     Uninstalling boto3-1.16.34:\n2022-03-21T20:50:32.8719403Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T20:50:32.9244278Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T20:50:32.9710449Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0c568461a276d4a71\n2022-03-21T20:50:32.9799307Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:50:32.9812238Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:50:32.9823052Z ##[error]Process completed with exit code 2.\n2022-03-21T20:50:32.9859290Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:50:32.9859527Z with:\n2022-03-21T20:50:32.9859664Z env:\n2022-03-21T20:50:32.9859817Z   IN_CI: 1\n2022-03-21T20:50:32.9859977Z   IS_GHA: 1\n2022-03-21T20:50:32.9860144Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:50:32.9860327Z ##[endgroup]\n2022-03-21T20:50:32.9893642Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge) (11/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:05:00.7163042Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:05:00.6660824Z     #10 0x55fc8a3ea801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:05:00.6661768Z     #11 0x55fc8a3f57a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:05:00.6662455Z     #12 0x55fc8a3f580b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:05:00.6663570Z     #13 0x55fc8a3f5908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:05:00.6663952Z     #14 0x55fc8a3f5908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:05:00.6664431Z     #15 0x55fc8a3f5908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:05:00.6665304Z     #16 0x55fc8a3f5ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:05:00.7162113Z     #17 0x7f940d00f83f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:05:00.7162534Z     #18 0x55fc8a39a554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:05:00.7162711Z \n2022-03-21T21:05:00.7163042Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:05:00.7334595Z + retcode=1\n2022-03-21T21:05:00.7334954Z + set -e\n2022-03-21T21:05:00.7335215Z + return 1\n2022-03-21T21:05:00.7338688Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:05:00.7339232Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:05:00.7340113Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:05:00.7340612Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:05:00.7341187Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:05:00.7341668Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:05:00.7344466Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge) (12/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:06:03.4437430Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:06:03.0752199Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:06:03.2853252Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:06:03.3032326Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:06:03.3081589Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:06:03.3093911Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:06:03.3120244Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:06:03.3162406Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:06:03.3188431Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:06:03.3337181Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:06:03.4348072Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0ee48c8811fafc444\n2022-03-21T22:06:03.4437430Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:06:03.4450920Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:06:03.4461263Z ##[error]Process completed with exit code 2.\n2022-03-21T22:06:03.4502346Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:06:03.4502576Z with:\n2022-03-21T22:06:03.4502730Z env:\n2022-03-21T22:06:03.4502888Z   IN_CI: 1\n2022-03-21T22:06:03.4503038Z   IS_GHA: 1\n2022-03-21T22:06:03.4503302Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:06:03.4503492Z ##[endgroup]\n2022-03-21T22:06:03.4519156Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (13/29)\nStep: \"Test\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:50:13.2205634Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:50:12.8679322Z + python3 -m pip install boto3==1.19.12\n2022-03-21T20:50:13.0744228Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T20:50:13.0916284Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T20:50:13.0964264Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T20:50:13.1005656Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T20:50:13.1017299Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T20:50:13.1041042Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T20:50:13.1189450Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T20:50:13.1208751Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T20:50:13.2119445Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d02da60fd18c22f5\n2022-03-21T20:50:13.2205634Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:50:13.2217939Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:50:13.2220259Z ##[error]Process completed with exit code 2.\n2022-03-21T20:50:13.2248664Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:50:13.2249012Z with:\n2022-03-21T20:50:13.2249260Z env:\n2022-03-21T20:50:13.2249500Z   IN_CI: 1\n2022-03-21T20:50:13.2249738Z   IS_GHA: 1\n2022-03-21T20:50:13.2250025Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:50:13.2250329Z ##[endgroup]\n2022-03-21T20:50:13.2272735Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (14/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:47:38.0451999Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:47:37.5554508Z + python3 -m pip install boto3==1.19.12\n2022-03-21T23:47:37.8411473Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T23:47:37.8631484Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T23:47:37.8699561Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T23:47:37.8737037Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T23:47:37.8754443Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T23:47:37.8814393Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T23:47:37.8849540Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:47:37.9059579Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:47:38.0336298Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0b44f47f4292089a2\n2022-03-21T23:47:38.0451999Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:47:38.0469471Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:47:38.0484106Z ##[error]Process completed with exit code 2.\n2022-03-21T23:47:38.0532678Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T23:47:38.0533007Z with:\n2022-03-21T23:47:38.0533223Z env:\n2022-03-21T23:47:38.0533440Z   IN_CI: 1\n2022-03-21T23:47:38.0533649Z   IS_GHA: 1\n2022-03-21T23:47:38.0533902Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:47:38.0534170Z   GPU_FLAG: --gpus all\n2022-03-21T23:47:38.0534401Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge) (15/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:04:59.3115800Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:04:59.2595213Z     #10 0x55a7f39a4801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:04:59.2595707Z     #11 0x55a7f39af7a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:04:59.2597203Z     #12 0x55a7f39af80b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:04:59.2598205Z     #13 0x55a7f39af908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:04:59.2598697Z     #14 0x55a7f39af908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:04:59.2599178Z     #15 0x55a7f39af908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:04:59.2599747Z     #16 0x55a7f39afccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:04:59.3114751Z     #17 0x7f3b3822383f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:04:59.3115277Z     #18 0x55a7f3954554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:04:59.3115468Z \n2022-03-21T21:04:59.3115800Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:04:59.3292385Z + retcode=1\n2022-03-21T21:04:59.3292781Z + set -e\n2022-03-21T21:04:59.3293062Z + return 1\n2022-03-21T21:04:59.3295462Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:04:59.3295802Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:04:59.3296394Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:04:59.3296700Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:04:59.3297055Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:04:59.3297416Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:04:59.3299623Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge) (16/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:14:31.7846086Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:14:25.5525714Z Collecting jmespath<1.0.0,>=0.7.1\n2022-03-21T22:14:25.5568155Z   Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)\n2022-03-21T22:14:25.5952617Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T22:14:25.6169392Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T22:14:25.6629996Z      -------------------------------------- 247.7/247.7 KB 5.1 MB/s eta 0:00:00\n2022-03-21T22:14:25.6710247Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:14:25.8284354Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:14:25.9816751Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T22:14:31.6672236Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T22:14:31.7630473Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0ed0915ecee5d2424\n2022-03-21T22:14:31.7846086Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:14:31.7876742Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:14:31.7897140Z ##[error]Process completed with exit code 2.\n2022-03-21T22:14:31.8195621Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T22:14:31.8196110Z with:\n2022-03-21T22:14:31.8196356Z env:\n2022-03-21T22:14:31.8196614Z   IN_CI: 1\n2022-03-21T22:14:31.8196876Z   IS_GHA: 1\n2022-03-21T22:14:31.8197169Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:14:31.8197652Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T22:14:31.8198093Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge) (17/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:19:15.8845728Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:19:15.5116060Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:19:15.7231476Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:19:15.7409711Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:19:15.7458478Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:19:15.7470508Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:19:15.7496799Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:19:15.7538362Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:19:15.7566161Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:19:15.7711630Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:19:15.8753543Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0e2b3b4ddb246ff2a\n2022-03-21T21:19:15.8845728Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:19:15.8859814Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:19:15.8870165Z ##[error]Process completed with exit code 2.\n2022-03-21T21:19:15.8917039Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:19:15.8917279Z with:\n2022-03-21T21:19:15.8917433Z env:\n2022-03-21T21:19:15.8917586Z   IN_CI: 1\n2022-03-21T21:19:15.8917734Z   IS_GHA: 1\n2022-03-21T21:19:15.8917917Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:19:15.8918102Z ##[endgroup]\n2022-03-21T21:19:15.8934572Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu) (18/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:19:48.5900162Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:19:48.0742254Z + python3 -m pip install boto3==1.19.12\n2022-03-21T23:19:48.3742563Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T23:19:48.3976536Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T23:19:48.4048700Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T23:19:48.4065374Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T23:19:48.4128076Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T23:19:48.4164273Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T23:19:48.4202610Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:19:48.4416723Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:19:48.5773033Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-07ab7a3c4a5402af2\n2022-03-21T23:19:48.5900162Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:19:48.5919822Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:19:48.5936087Z ##[error]Process completed with exit code 2.\n2022-03-21T23:19:48.6007930Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T23:19:48.6008268Z with:\n2022-03-21T23:19:48.6008483Z env:\n2022-03-21T23:19:48.6008701Z   IN_CI: 1\n2022-03-21T23:19:48.6008920Z   IS_GHA: 1\n2022-03-21T23:19:48.6009170Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:19:48.6009440Z   GPU_FLAG: --gpus all\n2022-03-21T23:19:48.6009671Z ##[endgroup]\n\n\n pull / win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (19/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:54:04.2844259Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:53:59.0889659Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T22:53:59.6881416Z      ---------------------------------------- 8.1/8.1 MB 14.0 MB/s eta 0:00:00\n2022-03-21T22:53:59.7427779Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:53:59.7691882Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T22:53:59.7779847Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T22:53:59.8281663Z      -------------------------------------- 247.7/247.7 KB 5.1 MB/s eta 0:00:00\n2022-03-21T22:54:00.0185115Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:54:00.2359770Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T22:54:04.1208891Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T22:54:04.2505862Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-03b4fbe63be8ef4b0\n2022-03-21T22:54:04.2844259Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:54:04.2891082Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:54:04.2919900Z ##[error]Process completed with exit code 2.\n2022-03-21T22:54:04.3377901Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T22:54:04.3378575Z with:\n2022-03-21T22:54:04.3378930Z env:\n2022-03-21T22:54:04.3379275Z   IN_CI: 1\n2022-03-21T22:54:04.3379600Z   IS_GHA: 1\n2022-03-21T22:54:04.3380023Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:54:04.3380691Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T22:54:04.3381278Z ##[endgroup]\n\n\n pull / linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge) (20/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:09:34.0074610Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:09:33.6365531Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:09:33.8475619Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:09:33.8655152Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:09:33.8704395Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:09:33.8716774Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:09:33.8760145Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:09:33.8785000Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:09:33.8811316Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:09:33.8960134Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:09:33.9984866Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d325eb9fd156146f\n2022-03-21T22:09:34.0074610Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:09:34.0087465Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:09:34.0101743Z ##[error]Process completed with exit code 2.\n2022-03-21T22:09:34.0154014Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:09:34.0154246Z with:\n2022-03-21T22:09:34.0154412Z env:\n2022-03-21T22:09:34.0154574Z   IN_CI: 1\n2022-03-21T22:09:34.0154728Z   IS_GHA: 1\n2022-03-21T22:09:34.0154917Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:09:34.0155112Z ##[endgroup]\n2022-03-21T22:09:34.0191047Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (21/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:03:17.8502655Z [E request_callbac...yUniqueId(created_on=0, local_id=0) to be created.\n\n2022-03-21T21:03:14.4669960Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpxgdsmeer\n2022-03-21T21:03:14.4671407Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpxgdsmeer/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.4973023Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp1i2hfmpc\n2022-03-21T21:03:14.4973800Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp1i2hfmpc/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.5532339Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpgx4da7b0\n2022-03-21T21:03:14.5533064Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpgx4da7b0/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.7050673Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0\n2022-03-21T21:03:14.7097127Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3\n2022-03-21T21:03:14.7398339Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2\n2022-03-21T21:03:14.7922283Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1\n2022-03-21T21:03:17.8502655Z [E request_callback_no_python.cpp:559] Received error while processing request type 261: false INTERNAL ASSERT FAILED at \"/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp\":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.\n2022-03-21T21:03:17.8503603Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):\n2022-03-21T21:03:17.8504385Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f180df19e19 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8505131Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f180df160e2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8505927Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f180df17a7e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8506674Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4b4 (0x7f18118b7b64 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8507642Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x70 (0x7f18118a7bf0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8508613Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7f1819736208 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)\n2022-03-21T21:03:17.8509749Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7f18118ac914 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8510708Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f1819735865 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)\n2022-03-21T21:03:17.8511369Z frame #8: <unknown function> + 0x375249a (0x7f18118a949a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test (22/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:01:07.7015580Z \ufffd[36;1m  echo \"ERR...t available for the merge-base of your branch\"\ufffd[0m\n\n2022-03-21T20:01:07.7012399Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7012634Z \ufffd[36;1m# Covers the case where a previous tag doesn't exist for the tree\ufffd[0m\n2022-03-21T20:01:07.7012992Z \ufffd[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly\ufffd[0m\n2022-03-21T20:01:07.7013373Z \ufffd[36;1mif ! git rev-parse \"$MERGE_BASE:.circleci/docker\"; then\ufffd[0m\n2022-03-21T20:01:07.7013784Z \ufffd[36;1m  echo \"Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit\"\ufffd[0m\n2022-03-21T20:01:07.7014149Z \ufffd[36;1m  exit 1\ufffd[0m\n2022-03-21T20:01:07.7014325Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7014573Z \ufffd[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse \"$MERGE_BASE:.circleci/docker\")\ufffd[0m\n2022-03-21T20:01:07.7014907Z \ufffd[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here\ufffd[0m\n2022-03-21T20:01:07.7015231Z \ufffd[36;1mif [[ \"${PREVIOUS_DOCKER_TAG}\" = \"${DOCKER_TAG}\" ]]; then\ufffd[0m\n2022-03-21T20:01:07.7015580Z \ufffd[36;1m  echo \"ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch\"\ufffd[0m\n2022-03-21T20:01:07.7015931Z \ufffd[36;1m  echo \"       contact the PyTorch team to restore the original images\"\ufffd[0m\n2022-03-21T20:01:07.7016225Z \ufffd[36;1m  exit 1\ufffd[0m\n2022-03-21T20:01:07.7016400Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7016608Z \ufffd[36;1mecho ::set-output name=rebuild::yes\ufffd[0m\n2022-03-21T20:01:07.7027605Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}\n2022-03-21T20:01:07.7027837Z env:\n2022-03-21T20:01:07.7028006Z   IN_CI: 1\n2022-03-21T20:01:07.7028159Z   IS_GHA: 1\n2022-03-21T20:01:07.7028346Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:01:07.7028589Z   BASE_REVISION: 6643522db9ff595f564b8081de58b3a33c546178\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu) (23/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T00:49:54.2949572Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T00:49:53.8049151Z + python3 -m pip install boto3==1.19.12\n2022-03-22T00:49:54.0981629Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-22T00:49:54.1207562Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-22T00:49:54.1277146Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-22T00:49:54.1315027Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-22T00:49:54.1331813Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-22T00:49:54.1391622Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-22T00:49:54.1609217Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-22T00:49:54.1637417Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-22T00:49:54.2830197Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0f7c32fe13be12fea\n2022-03-22T00:49:54.2949572Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T00:49:54.2966933Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T00:49:54.2982588Z ##[error]Process completed with exit code 2.\n2022-03-22T00:49:54.3031464Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T00:49:54.3031794Z with:\n2022-03-22T00:49:54.3032012Z env:\n2022-03-22T00:49:54.3032227Z   IN_CI: 1\n2022-03-22T00:49:54.3032434Z   IS_GHA: 1\n2022-03-22T00:49:54.3032681Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T00:49:54.3033084Z   GPU_FLAG: --gpus all\n2022-03-22T00:49:54.3033312Z ##[endgroup]\n\n\n pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (24/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:56:12.5872636Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:56:07.3365589Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T21:56:07.7926584Z      ---------------------------------------- 8.1/8.1 MB 17.3 MB/s eta 0:00:00\n2022-03-21T21:56:07.9319362Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T21:56:07.9366132Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T21:56:08.0077590Z      -------------------------------------- 247.7/247.7 KB 3.0 MB/s eta 0:00:00\n2022-03-21T21:56:08.0164070Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:56:08.1775537Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:56:08.3393469Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T21:56:12.4576766Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T21:56:12.5641959Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0afad69838118af0e\n2022-03-21T21:56:12.5872636Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:56:12.5905611Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:56:12.5927729Z ##[error]Process completed with exit code 2.\n2022-03-21T21:56:12.6239531Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T21:56:12.6240039Z with:\n2022-03-21T21:56:12.6240299Z env:\n2022-03-21T21:56:12.6240557Z   IN_CI: 1\n2022-03-21T21:56:12.6240805Z   IS_GHA: 1\n2022-03-21T21:56:12.6241118Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:56:12.6241613Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T21:56:12.6242052Z ##[endgroup]\n\n\n pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (25/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:46:39.5474616Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:46:39.1884210Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:46:39.3928976Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:46:39.4105069Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:46:39.4152571Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:46:39.4194931Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:46:39.4218947Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:46:39.4230812Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:46:39.4380089Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:46:39.4399461Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:46:39.5387703Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0888bed1149cca415\n2022-03-21T21:46:39.5474616Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:46:39.5487145Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:46:39.5497480Z ##[error]Process completed with exit code 2.\n2022-03-21T21:46:39.5541319Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:46:39.5541544Z with:\n2022-03-21T21:46:39.5541698Z env:\n2022-03-21T21:46:39.5541851Z   IN_CI: 1\n2022-03-21T21:46:39.5541997Z   IS_GHA: 1\n2022-03-21T21:46:39.5542176Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:46:39.5542361Z ##[endgroup]\n2022-03-21T21:46:39.5557878Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge) (26/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:34:57.0623859Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:34:56.9039884Z   Attempting uninstall: s3transfer\n2022-03-21T21:34:56.9041446Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T21:34:56.9090783Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T21:34:56.9095968Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T21:34:56.9453014Z   Attempting uninstall: boto3\n2022-03-21T21:34:56.9454356Z     Found existing installation: boto3 1.16.34\n2022-03-21T21:34:56.9564320Z     Uninstalling boto3-1.16.34:\n2022-03-21T21:34:56.9578035Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T21:34:57.0091363Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T21:34:57.0536230Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-034a3afd5d80b91fd\n2022-03-21T21:34:57.0623859Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:34:57.0637167Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:34:57.0647396Z ##[error]Process completed with exit code 2.\n2022-03-21T21:34:57.0688237Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:34:57.0688481Z with:\n2022-03-21T21:34:57.0688631Z env:\n2022-03-21T21:34:57.0688769Z   IN_CI: 1\n2022-03-21T21:34:57.0688930Z   IS_GHA: 1\n2022-03-21T21:34:57.0689109Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:34:57.0689462Z ##[endgroup]\n2022-03-21T21:34:57.0704768Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge) (27/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:05:00.7896545Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:05:00.7395504Z     #10 0x5597fd5a9801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:05:00.7396330Z     #11 0x5597fd5b47a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:05:00.7396688Z     #12 0x5597fd5b480b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:05:00.7398664Z     #13 0x5597fd5b4908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:05:00.7399177Z     #14 0x5597fd5b4908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:05:00.7399663Z     #15 0x5597fd5b4908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:05:00.7399986Z     #16 0x5597fd5b4ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:05:00.7895241Z     #17 0x7f0a5905983f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:05:00.7895772Z     #18 0x5597fd559554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:05:00.7896033Z \n2022-03-21T21:05:00.7896545Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:05:00.8063448Z + retcode=1\n2022-03-21T21:05:00.8063787Z + set -e\n2022-03-21T21:05:00.8064058Z + return 1\n2022-03-21T21:05:00.8067638Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:05:00.8068127Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:05:00.8069018Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:05:00.8069500Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:05:00.8070105Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:05:00.8070580Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:05:00.8072640Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (28/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:48:17.3384813Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:48:16.8599645Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:48:17.1464241Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:48:17.1685222Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:48:17.1754164Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:48:17.1771662Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:48:17.1808722Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:48:17.1868636Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:48:17.1903889Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:48:17.2113746Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:48:17.3267404Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-01fe178c405417375\n2022-03-21T22:48:17.3384813Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:48:17.3402286Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:48:17.3418376Z ##[error]Process completed with exit code 2.\n2022-03-21T22:48:17.3470528Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:48:17.3470874Z with:\n2022-03-21T22:48:17.3471096Z env:\n2022-03-21T22:48:17.3471327Z   IN_CI: 1\n2022-03-21T22:48:17.3471538Z   IS_GHA: 1\n2022-03-21T22:48:17.3471802Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:48:17.3472083Z   GPU_FLAG: --gpus all\n2022-03-21T22:48:17.3472322Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (29/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:16:38.9646300Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:16:38.7995969Z   Attempting uninstall: s3transfer\n2022-03-21T21:16:38.7998039Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T21:16:38.8066994Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T21:16:38.8072844Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T21:16:38.8449275Z   Attempting uninstall: boto3\n2022-03-21T21:16:38.8451430Z     Found existing installation: boto3 1.16.34\n2022-03-21T21:16:38.8559828Z     Uninstalling boto3-1.16.34:\n2022-03-21T21:16:38.8574290Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T21:16:38.9100438Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T21:16:38.9558098Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d779c59d277d32ee\n2022-03-21T21:16:38.9646300Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:16:38.9658894Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:16:38.9673240Z ##[error]Process completed with exit code 2.\n2022-03-21T21:16:38.9720106Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:16:38.9720333Z with:\n2022-03-21T21:16:38.9720485Z env:\n2022-03-21T21:16:38.9720645Z   IN_CI: 1\n2022-03-21T21:16:38.9720793Z   IS_GHA: 1\n2022-03-21T21:16:38.9720970Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:16:38.9721151Z ##[endgroup]\n2022-03-21T21:16:38.9736762Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
                 "author": {
-                  "login": "albanD"
+                  "login": "facebook-github-bot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 964902894
               },
               {
+                "bodyText": "@vitaly-fedyunin @gottbrath  FYI that this is the oneDNN Graph API integration. It depends on the #63748.",
                 "author": {
-                  "login": "coolteemf"
+                  "login": "Jianhui-Li"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 970451860
               },
               {
+                "bodyText": "CI failures are currently being caused by some issues in the CI infra, and are also occurring with other PRs.",
                 "author": {
-                  "login": "albanD"
+                  "login": "sanchitintel"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 990641309
               },
               {
+                "bodyText": "CI failures are unrelated.",
                 "author": {
-                  "login": "albanD"
+                  "login": "sanchitintel"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 991281407
               },
               {
+                "bodyText": "The CI failure is unrelated.",
                 "author": {
-                  "login": "coolteemf"
+                  "login": "sanchitintel"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 995389295
               },
               {
+                "bodyText": "Hi, thank you for the PR!\nDo you mind running a larger amount of torchbench and reporting numbers ? You can look at Jason's post here for what models are supported in script. Initially just the vision models would be useful. @Krovatkin also did some benchmarking of a traced Bert model and found on average a ~16% speedup with this PR.",
                 "author": {
-                  "login": "albanD"
+                  "login": "eellison"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1015689390
               },
               {
+                "bodyText": "Thanks a lot for reviewing, @eellison & @Krovatkin!\nWe just wanted to let you know that we're working on the benchmarking & will get back to you in a day, or two.\nUPDATE (Jan 21): While running some TorchBench models, we discovered some composability issues, and are working to ensure that oneDNN Graph would complement PyTorch's existing fusion capabilities, not hinder them.\nUPDATE (Jan 24): We've resolved the issues & will update this PR later today. Thanks!",
                 "author": {
-                  "login": "coolteemf"
+                  "login": "sanchitintel"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1016996190
               },
               {
+                "bodyText": "Hello @eellison,\nWe used this TorchBench branch for comparison. compare_llga.sh can be run for comparison.\nFor benchmarking mobilenet_v3_large with hardswish support in oneDNN Graph, this oneDNN Graph branch can be used in third_party/ideep/mkl-dnn. It delivers a speedup over PyTorch JIT (NNC + OFI) because 21 additional reorders are prevented (the major factor here), and fusion with conv also helps further.\nThe next release of oneDNN Graph would have hardswish support.\nWe're also exploring adding a hardsigmoid op in oneDNN Graph.\nThank you!",
                 "author": {
-                  "login": "albanD"
+                  "login": "sanchitintel"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1022709513
               },
               {
+                "bodyText": "Please note that this PR should be merged after #71546, as #71546 changes the  third_party/ideep commit (this PR also uses that ideep commit, but it'd probably be better to merge #71546 first, so that oneDNN v2.5.2 upgrade would be in a separate PR). Thank you!",
                 "author": {
-                  "login": "albanD"
+                  "login": "sanchitintel"
                 },
-                "state": "APPROVED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1026330085
               },
               {
+                "bodyText": "@sanchitintel mind rebasing and i'll land ?",
                 "author": {
-                  "login": "albanD"
+                  "login": "eellison"
                 },
-                "state": "APPROVED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMS0yNVQxMDoyODoxMC0wNjowMLkyMDIyLTAxLTI1VDA5OjU0OjA1LTA2OjAwzjNooqI=",
-              "hasPreviousPage": false
-            }
-          },
-          "comments": {
-            "nodes": [
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1055813984
+              },
               {
-                "bodyText": "Merge failed due to 'NoneType' object is not subscriptable\nRaised by https://github.com/pytorch/pytorch/actions/runs/1887945630",
+                "bodyText": "@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "facebook-github-bot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1048868910
+                "databaseId": 1057203495
               },
               {
-                "bodyText": "Thanks for the update! The windows failure is not your fault, you can ignore it!\n\nThank you very much for all of your feedback and sorry for the delay !",
+                "bodyText": "Thanks a lot for taking a look, @eellison! To fix this error, we would enable Bazel build for oneDNN Graph.",
                 "author": {
-                  "login": "coolteemf"
+                  "login": "sanchitintel"
                 },
-                "authorAssociation": "CONTRIBUTOR",
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1061230087
+              },
+              {
+                "bodyText": "@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1048983572
+                "databaseId": 1063276600
               },
               {
-                "bodyText": "@coolteemf can you please send either me or @albanD an email? (or I can send you and invite to collab on private repo)",
+                "bodyText": "@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
                 "author": {
-                  "login": "malfet"
+                  "login": "facebook-github-bot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1049048119
+                "databaseId": 1074355779
               },
               {
-                "bodyText": "@pytorchbot merge this please",
+                "bodyText": "And graph_rewriter.cpp is full of DOS newlines...",
                 "author": {
-                  "login": "albanD"
+                  "login": "malfet"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1049131992
+                "databaseId": 1074407452
               },
               {
-                "bodyText": "Hey @coolteemf.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "bodyText": "Hey @chunyuan-w.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
                 "author": {
                   "login": "github-actions"
                 },
                 "authorAssociation": "NONE",
                 "editor": null,
-                "databaseId": 1049134520
+                "databaseId": 1074471758
+              },
+              {
+                "bodyText": "Thanks a ton for your help, @malfet & @eellison! :)\nWe'll incorporate your suggestions in subsequent PR(s).",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1074492365
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOPoR4Lg==",
-              "hasPreviousPage": true
+              "startCursor": "Y3Vyc29yOnYyOpHOOYM_0Q==",
+              "hasPreviousPage": false
             }
-          },
-          "labels": {
-            "edges": [
-              {
-                "node": {
-                  "name": "triaged"
-                }
-              },
-              {
-                "node": {
-                  "name": "open source"
-                }
-              },
-              {
-                "node": {
-                  "name": "cla signed"
-                }
-              },
-              {
-                "node": {
-                  "name": "release notes: nn"
-                }
-              },
-              {
-                "node": {
-                  "name": "topic: performance"
-                }
-              }
-            ]
           }
         }
       }
     }
   },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=75095 owner=pytorch": {
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=73969 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
           "closed": true,
-          "isCrossRepository": false,
+          "isCrossRepository": true,
           "author": {
-            "login": "mruberry"
+            "login": "malfet"
           },
-          "title": "Initial prims, references, and test architecture for them",
-          "body": "This PR adds an initial set of experimental primitive operations and Python references that reimplement existing PyTorch operations using them. See https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-0/577 for additional context.\r\n\r\nThe following experimental primitives are added:\r\n\r\n- Elementwise unary prims -- abs, acos, acosh, asin, atan, cos, cosh, bessel_i0e, bessel_i1e, cbrt, ceil, digamma, erf, erf_inv, erfc, exp, expm1, floor, igamma, igammac, is_finite, lgamma, log, log1p, neg, reciprocal, round, sign, sinh, sqrt, square, tan. \r\n- Elementwise binary prims -- add, atan2, bitwise_and, bitwise_not, bitwise_or, bitwise_xor, div, eq, ge, gt, le, lt, max, min, mul, ne, nextafter, pow, rsqrt, shift_left, shift_right_arithmetic\r\n- View prims -- brodcast_in_dim, collapse_view, split_dim, squeeze\r\n- Shape prims -- collapse, concatenate, reshape\r\n- Conditional prims -- select\r\n- Data conversion & movement prims -- convert_element_type, device_put\r\n- Inplace prims -- copy_to, resize\r\n\r\nThese primitives do not add any new functionality to PyTorch, but are intended to be the semantic building blocks for reference operators. We have tried to make them consistent with the operations in [jax.lax](https://jax.readthedocs.io/en/latest/jax.lax.html) where possible (because PyTorch prefers being consistent with other frameworks), although there are key differences between these prims and operations in jax.lax. Most notably is that these prims model view semantics and inplace operations.\r\n\r\nIn addition to these primitives the following elementwise binary Python references are added:\r\n\r\n- Elementwise binary Python references -- add, atan2, bitwise_and, bitwise_left_shift, bitwise_or, bitwise_right_shift, bitwise_xor, eq, float_power, ge, gt, le, lt, maximum, minimum, mul, ne, nextafter, pow, sub, true_divide\r\n- Conditional Python references - where\r\n- Data conversion & movement references - copy_to\r\n\r\nA Python reference implements the same behavior as its corresponding PyTorch operator (excepting slight numerical differences, bug fixes, and in some cases additional features). \r\n\r\nThe start of an OpInfo-based test architecture for these references is also included in this PR. A new list, `python_ref_db`, is added to `common_methods_invocations.py`. This list introduces the new `ElementwiseBinaryPythonRefInfo`, which inherits input arguments from the original operators' OpInfo, allows them to be overridden, and then constructs the OpInfo for the Python reference using the (potentially modified) arguments. OpInfo-based tests can opt-into testing references by including this new list in the Sequence passed to the `@ops` decorator. \r\n\r\ncc @ngimel @csarofeen @kevinstephano @Lezcano ",
-          "headRefName": "prims_and_references",
+          "title": "Dummy change",
+          "body": "Test Plan: None at all\n\nDifferential Revision: D34753911\n\n",
+          "headRefName": "export-D34753911",
           "headRepository": {
-            "nameWithOwner": "pytorch/pytorch"
+            "nameWithOwner": "malfet/pytorch"
           },
           "baseRefName": "master",
           "baseRepository": {
@@ -8744,299 +14229,781 @@
             }
           },
           "mergeCommit": null,
-          "commits_with_authors": {
-            "nodes": [
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "a790467c650be92775103cde5e866c90b56f5376"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "bd6fcf50692e208ebecdc2eaa517a2bfcdcd35cf"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "4a119c8f21529fe1375e7e8789b91f41a3df80c5"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "ea6750dc34d66be759fdfe84b09fb0e23ee59c79"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "2eef8a55fe0227e1921b51bf1f56f9d0a29b49ac"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "b886ed6c20dd1785fd31ed6fa6a8c5b6d0d0b16c"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "9ad9b63d09aa4f7a8549bcf1d88ea4ff0674299c"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "63fdd580118477416ae160e0670ae722ea248090"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "0ccf7dc292af1d40d0a094eb2b2fb0c7ab4ccc70"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "e8a8a4d1fbe35f20eb88e1a43cf5a653883638e5"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "186634dfdd25645c05b58a212f9e8d77c4125fc0"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "f5b4741312b5c42a79f6c8a1d3930b79db38ed8f"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "23d50391bb0fd12111fd3171591c4235ffb2fc1a"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "bac9d45422d58f513b60b4b854441cfdc253d4c5"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "13240ae0b4a0332c3167b65ac026a3172da90cb7"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "1ee34468cb1db3dc6cbae204669f4fec20e2a466"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "561d132bc686d00e8911f7feb3da5901b2bdc574"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ngimel"
-                    },
-                    "email": "ngimel@fb.com",
-                    "name": "Natalia Gimelshein"
-                  },
-                  "oid": "ac42bedc84b7c96256376ad09917263bb020b2c3"
-                }
-              },
+          "commits_with_authors": {
+            "nodes": [
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "ngimel"
+                      "login": "malfet"
                     },
-                    "email": "ngimel@fb.com",
-                    "name": "Natalia Gimelshein"
-                  },
-                  "oid": "7f7d5ba40a0b5e10526d90b018b30b54673d12d8"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
                   },
-                  "oid": "37a6b4a8b1adb712d5777c7c3479866c27fb3c4e"
+                  "oid": "4746da707a9912356f5179625da89616b228dc21"
                 }
-              },
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            },
+            "totalCount": 1
+          },
+          "commits": {
+            "nodes": [
               {
                 "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ngimel"
-                    },
-                    "email": "ngimel@fb.com",
-                    "name": "Natalia Gimelshein"
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-vulkan-bionic-py3.7-clang9"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280134/jobs/2794078044"
+                              },
+                              {
+                                "name": "test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280134/jobs/2794189060"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRQMQ=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592963"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QM="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280135/jobs/2794078023"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aM=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592965"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-bionic-rocm4.5-py3.7"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280132/jobs/2794078060"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280132/jobs/2794292071"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280132/jobs/2794292205"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280132/jobs/2794292306"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbTiXw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592966"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QY="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "win-vs2019-cuda11.3-py3"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280139/jobs/2794078053"
+                              },
+                              {
+                                "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280139/jobs/2794536907"
+                              },
+                              {
+                                "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280139/jobs/2794536998"
+                              },
+                              {
+                                "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280139/jobs/2794537089"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbY_vU=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592967"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qc="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280136/jobs/2794078031"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2ao=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592969"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qk="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-docs"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280138/jobs/2794078055"
+                              },
+                              {
+                                "name": "build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280138/jobs/2794183768"
+                              },
+                              {
+                                "name": "build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280138/jobs/2794183828"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRIt0=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592970"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qo="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc7"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280140/jobs/2794078017"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280140/jobs/2794181109"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280140/jobs/2794181305"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280140/jobs/2794181488"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRFm4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592971"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qs="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3-clang5-mobile-build"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280143/jobs/2794078025"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592974"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Q4="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "shellcheck",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078028"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078196"
+                              },
+                              {
+                                "name": "clang-tidy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078407"
+                              },
+                              {
+                                "name": "clang-format",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078610"
+                              },
+                              {
+                                "name": "cmakelint",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078760"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078898"
+                              },
+                              {
+                                "name": "py2-setup-validate-errormsg",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078999"
+                              },
+                              {
+                                "name": "flake8-py3",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794079087"
+                              },
+                              {
+                                "name": "mypy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794079199"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO4Es=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592975"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Q8="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280146/jobs/2794078040"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2b0=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592976"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RA="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": true
+                    }
                   },
-                  "oid": "65b613868c44e519c1777af79b9fd3498c5a7e58"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ngimel"
-                    },
-                    "email": "ngimel@fb.com",
-                    "name": "Natalia Gimelshein"
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17040614?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17040643?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17040615?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
                   },
-                  "oid": "442c405e9da0d66744ef03e379224c41eedf5b57"
+                  "pushedDate": "2022-03-09T15:57:16Z",
+                  "oid": "4746da707a9912356f5179625da89616b228dc21"
                 }
-              },
+              }
+            ]
+          },
+          "changedFiles": 1,
+          "files": {
+            "nodes": [
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "031ac49ae9c192989385986b6707fa781e3229e0"
-                }
-              },
+                "path": "tools/build_variables.bzl"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [],
+            "pageInfo": {
+              "startCursor": null,
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "9a6c3b00039c0c985c1c9cb59490012d1c0b38ba"
-                }
+                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/malfet/pytorch/blob/4746da707a9912356f5179625da89616b228dc21/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\nAdd ciflow labels to this PR to trigger more builds:\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nlinux-binary-libtorch-cxx11-abi\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-binary-libtorch-pre-cxx11\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-binary-manywheel\nciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk\n\u2705 triggered\n\n\nlinux-bionic-rocm4.5-py3.7\nciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk\n\u2705 triggered\n\n\nlinux-docs\nciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-vulkan-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-build\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-custom-build-static\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-onnx\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build\nciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7-no-ops\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nmacos-arm64-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nmacos-arm64-binary-wheel\nciflow/binaries, ciflow/binaries_wheel, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-libtorch-cxx11-abi\nciflow/binaries, ciflow/binaries_libtorch, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-libtorch-pre-cxx11\nciflow/binaries, ciflow/binaries_libtorch, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-wheel\nciflow/binaries, ciflow/binaries_wheel, ciflow/default\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwindows-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nwindows-binary-libtorch-debug\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nwindows-binary-libtorch-release\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nwindows-binary-wheel\nciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\ncaffe2-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\ndocker-builds\nciflow/all, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-custom-ops\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-metal\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda10.2-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-docs-push\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-no-ops\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-arm64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-lite-interpreter-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-11-py3-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.3-py3.7-gcc7-debug\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.5-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-build\nciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\npytorch-xla-linux-bionic-py3.7-clang8\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla\n\ud83d\udeab skipped",
+                "author": {
+                  "login": "pytorch-bot"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1063079053
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "d5c30e408af1889b90012d2e09f6ec3cda333bcb"
-                }
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/73969\n\ud83d\udcc4 \u00a0Preview docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 4746da7 (more details on the Dr. CI page):\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1063079113
               },
               {
-                "commit": {
-                  "author": {
-                    "user": null,
-                    "email": "mruberry@devfair044.h1.fair",
-                    "name": "Mike Ruberry"
-                  },
-                  "oid": "db355d55655bb252a699cd532441bb98e52b98d5"
+                "bodyText": "This pull request was exported from Phabricator. Differential Revision: D34753911",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1063079731
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOP11MjQ==",
+              "hasPreviousPage": false
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "fb-exported"
+                }
+              },
+              {
+                "node": {
+                  "name": "cla signed"
                 }
               }
-            ],
-            "pageInfo": {
-              "endCursor": "MjY",
-              "hasNextPage": false
-            },
-            "totalCount": 26
-          },
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAU2F-RA= name=pytorch number=73969 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
           "commits": {
             "nodes": [
               {
                 "commit": {
+                  "oid": "4746da707a9912356f5179625da89616b228dc21",
                   "checkSuites": {
                     "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280141/jobs/2794078056"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2c8=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592977"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RE="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Test tools"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280142/jobs/2794078033"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2as=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592978"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280144/jobs/2794078046"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280144/jobs/2794338293"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280144/jobs/2794338408"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280144/jobs/2794338568"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbUkMA=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592980"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RQ="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc7-no-ops"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280148/jobs/2794078065"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2d8=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592981"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "win-vs2019-cpu-py3"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280149/jobs/2794078067"
+                              },
+                              {
+                                "name": "test (default, 2, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280149/jobs/2794407041"
+                              },
+                              {
+                                "name": "test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280149/jobs/2794407168"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbWDX8=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592982"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RY="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280150/jobs/2794078029"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aQ=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592983"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Rc="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-clang7-asan"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280151/jobs/2794078062"
+                              },
+                              {
+                                "name": "test (default, 3, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280151/jobs/2794225603"
+                              },
+                              {
+                                "name": "test (default, 1, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280151/jobs/2794225793"
+                              },
+                              {
+                                "name": "test (default, 2, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280151/jobs/2794226005"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbSD-k=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592985"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Rk="
+                      },
                       {
                         "node": {
                           "app": {
@@ -9053,115 +15020,313 @@
                               },
                               {
                                 "name": "Meta Internal-Only Changes Check",
-                                "conclusion": "SUCCESS",
+                                "conclusion": "NEUTRAL",
                                 "detailsUrl": "https://opensource.facebook.com/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6ux14=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO574=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454954"
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592986"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC2o="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Ro="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Netlify",
-                            "databaseId": 13473
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pytorch-xla-linux-bionic-py3.7-clang8"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280152/jobs/2794078032"
+                              },
+                              {
+                                "name": "test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280152/jobs/2794227475"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbSGAM=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454956"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592987"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC2w="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Rs="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Azure Pipelines",
-                            "databaseId": 9426
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc5.4"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280160/jobs/2794078054"
+                              },
+                              {
+                                "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280160/jobs/2794203297"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280160/jobs/2794203553"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280160/jobs/2794203717"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280160/jobs/2794203878"
+                              },
+                              {
+                                "name": "test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280160/jobs/2794203982"
+                              },
+                              {
+                                "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280160/jobs/2794204149"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRlJs=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454965"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592997"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC3U="
-                      },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-SU="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": true
+                    }
+                  }
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAU2F-SU= name=pytorch number=73969 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "oid": "4746da707a9912356f5179625da89616b228dc21",
+                  "checkSuites": {
+                    "edges": [
                       {
                         "node": {
                           "app": {
-                            "name": "Dependabot",
-                            "databaseId": 29110
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-bionic-py3.7-clang9"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280162/jobs/2794078019"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280162/jobs/2794187280"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280162/jobs/2794187423"
+                              },
+                              {
+                                "name": "test (noarch, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280162/jobs/2794187582"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRN_c=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454970"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595593001"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC3o="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Sk="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Codecov",
-                            "databaseId": 254
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-clang7-onnx"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280164/jobs/2794078039"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280164/jobs/2794213425"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280164/jobs/2794213615"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRySo=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454974"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595593014"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC34="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-TY="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "PyTorch Bot",
-                            "databaseId": 40112
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280168/jobs/2794078064"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2d0=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454977"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595593026"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC4E="
-                      },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-UI="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  }
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=73099 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "BowenBao"
+          },
+          "title": "[ONNX] Make graph name spec-compliant (#71961)",
+          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* #73104\n* #73103\n* #73102\n* #73101\n* #73100\n* __->__ #73099\n\n[According to the ONNX spec](https://github.com/onnx/onnx/blob/main/docs/IR.md#names-within-a-graph),\nall names must adhere to C90 identifier syntax rules, which means no\ndashes.\n\nFixes: #30952",
+          "headRefName": "gh/BowenBao/138/head",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "gh/BowenBao/138/base",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "BowenBao"
+                    },
+                    "email": "bowbao@microsoft.com",
+                    "name": "BowenBao"
+                  },
+                  "oid": "3038b939eb2069653305c419326a0f47d2598e39"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            },
+            "totalCount": 1
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
                       {
                         "node": {
                           "app": {
@@ -9178,18 +15343,18 @@
                               {
                                 "name": "run-torchbench",
                                 "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150879695?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041786/jobs/2626264278"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6e-c8=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNn9o=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455322"
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189561"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDNo="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7k="
                       },
                       {
                         "node": {
@@ -9199,56 +15364,41 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "quick-checks",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150879696?check_suite_focus=true"
-                              },
-                              {
-                                "name": "lintrunner",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150879758?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test tools",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150879835?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test collect_env (with_torch)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150879901?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041785/jobs/2626264385"
                               },
                               {
-                                "name": "Test collect_env (without_torch)",
+                                "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150879942?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041785/jobs/2626417658"
                               },
                               {
-                                "name": "toc",
+                                "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150880005?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041785/jobs/2626417743"
                               },
                               {
-                                "name": "workflow-checks",
+                                "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150880051?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041785/jobs/2626417885"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6e-zM=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkRE_E=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455334"
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189562"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDOY="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7o="
                       },
                       {
                         "node": {
@@ -9258,913 +15408,933 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pull"
+                              "name": "linux-xenial-py3.7-gcc7-no-ops"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895177?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-rocm5.0-py3.7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895295?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-onnx / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895365?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895428?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895554?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cuda11.3-py3 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895614?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895698?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895758?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895866?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895923?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-asan / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150895991?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896053?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896146?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896213?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896256?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cpu-py3 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896288?check_suite_focus=true"
-                              },
-                              {
-                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896313?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896352?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896403?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150896443?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150970691?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150970749?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150970796?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150970831?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150970876?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150970911?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-docs / build-docs (cpp)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150970959?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-docs / build-docs (python)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150971013?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150976613?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150976667?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150976694?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150977190?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150980317?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041789/jobs/2626264416"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoJE=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189563"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7s="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3-clang5-mobile-build"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150980363?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041787/jobs/2626264407"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoIY=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189564"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7w="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "name": "build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150989669?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041788/jobs/2626264422"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoJs=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189566"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS74="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-bionic-py3.7-clang9"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6150989736?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041790/jobs/2626264414"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
+                                "name": "test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151003389?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041790/jobs/2626349405"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
+                                "name": "test (noarch, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151003429?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041790/jobs/2626349522"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
+                                "name": "test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151003460?check_suite_focus=true"
-                              },
-                              {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151007051?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041790/jobs/2626349618"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiwA=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189567"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS78="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-vulkan-bionic-py3.7-clang9"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151023043?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041793/jobs/2626264431"
                               },
                               {
-                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "name": "test (default, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151023077?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041793/jobs/2626359364"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPxgQ=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189568"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8A="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151040240?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041792/jobs/2626264427"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoKA=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189570"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8I="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "win-vs2019-cpu-py3"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151041874?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041791/jobs/2626264386"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "name": "test (default, 1, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151041915?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041791/jobs/2626722677"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "name": "test (default, 2, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151041959?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041791/jobs/2626722710"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkX070=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189571"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8M="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc7"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151065166?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041803/jobs/2626264401"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151065218?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041803/jobs/2626349045"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "name": "test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151165045?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041803/jobs/2626349141"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "name": "test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6151165103?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041803/jobs/2626349272"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6jVK8=",
-                              "hasNextPage": true
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiQA=",
+                              "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455360"
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189572"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDQA="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8Q="
                       }
                     ],
                     "pageInfo": {
-                      "hasNextPage": false
+                      "hasNextPage": true
                     }
                   },
-                  "pushedDate": "2022-04-25T02:30:31Z",
-                  "oid": "db355d55655bb252a699cd532441bb98e52b98d5"
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17010288?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17010289?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17010488?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17010326?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-02-18T18:46:28Z",
+                  "oid": "3038b939eb2069653305c419326a0f47d2598e39"
                 }
               }
             ]
           },
-          "changedFiles": 5,
+          "changedFiles": 162,
           "files": {
             "nodes": [
               {
-                "path": "test/test_ops.py"
+                "path": "test/onnx/expect/TestOperators.test_acos.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_left_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_addconstant.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_addmm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_arange_dynamic.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_argmax.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_asin.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_at_op.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_atan.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_aten_embedding_1.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_aten_embedding_2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_avg_pool2d.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_baddbmm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_basic.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_1d.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_training.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_bitshift.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_c2_op.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_chunk.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip_max.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip_min.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_concat2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_conv.expect"
               },
               {
-                "path": "torch/_prims/__init__.py"
+                "path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect"
               },
               {
-                "path": "torch/_prims/utils.py"
+                "path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4_opset8.expect"
               },
               {
-                "path": "torch/_refs/__init__.py"
+                "path": "test/onnx/expect/TestOperators.test_convtranspose.expect"
               },
               {
-                "path": "torch/testing/_internal/common_methods_invocations.py"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "NQ",
-              "hasNextPage": false
-            }
-          },
-          "reviews": {
-            "nodes": [
+                "path": "test/onnx/expect/TestOperators.test_cos.expect"
+              },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_cumsum.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_det.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dict.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dict_str.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dim.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dropout.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dropout_default.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dropout_opset12.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dropout_training.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dropout_training_opset12.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_add.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_add_inputs_same_symbolic_shape.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_matmul.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_reduce_mean.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_unchange.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_elu.expect"
               },
               {
-                "author": {
-                  "login": "ngimel"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_embedding_bags.expect"
               },
               {
-                "author": {
-                  "login": "ngimel"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_empty_like.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_empty_like_opset7.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_equal.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_erf.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_exp.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_expand.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_flatten.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_flatten2D.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_fmod.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_frobenius_norm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_full.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_full_like.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gather.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gather_opset11.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ge.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gelu.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gt.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_hardtanh.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_implicit_expand.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_index.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_isnan.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_layer_norm_aten.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_le.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_linear.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_log_sigmoid.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_logsoftmax.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_lstm_none_sequence_lens.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_lt.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_master_opset.expect"
               },
+              {
+                "path": "test/onnx/expect/TestOperators.test_max.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool_dilations.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool_indices.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mean.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mean_dtype.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_meshgrid.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_min.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_narrow.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ne.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_nonzero.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_norm_p1.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_norm_p2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ones_like.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_pad.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_params.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_params_onnx_irv4.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_permute2.expect"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTAw",
+              "hasNextPage": true
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
                 "author": {
-                  "login": "zou3519"
+                  "login": "garymm"
                 },
-                "state": "COMMENTED"
-              },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMi0xOFQxNzoxODo0NC0wODowMLkyMDIyLTAyLTE4VDE3OjE4OjQ0LTA4OjAwzjTr0H0=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
               {
+                "bodyText": "This PR cannot be merged by bot due to changing > 100 files. @malfet \n  \n    \n      pytorch/.github/scripts/trymerge.py\n    \n    \n         Line 63\n      in\n      932adf2\n    \n  \n  \n    \n\n        \n          \n                 files(last: 100) { \n        \n    \n  \n\n Can this be relaxed? If not please import.",
                 "author": {
-                  "login": "mruberry"
+                  "login": "BowenBao"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1048084569
               },
               {
+                "bodyText": "This PR cannot be merged by bot due to changing > 100 files. @malfet\nCan this be relaxed? If not please import.\n\nWow, you've hit a really interesting problem. 100 is a limitation enforced by GitHub, see https://docs.github.com/en/graphql/overview/resource-limitations, but I can implement a pagination. Do you mind keeping it like that for a bit, want to land a fix soonish.",
                 "author": {
-                  "login": "peterbell10"
+                  "login": "malfet"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1048088691
               },
               {
+                "bodyText": "@malfet Thank you for info. Sure, I have separated the rest of stack from this one, we'll wait for the fix to try again.",
                 "author": {
-                  "login": "mruberry"
+                  "login": "BowenBao"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1048090640
               },
               {
+                "bodyText": "@pytorchbot merge this",
                 "author": {
-                  "login": "mruberry"
+                  "login": "BowenBao"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1050293881
               },
               {
+                "bodyText": "Hey @BowenBao.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
                 "author": {
-                  "login": "mruberry"
+                  "login": "github-actions"
                 },
-                "state": "COMMENTED"
-              },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1050295451
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOPniAWQ==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "oncall: jit"
+                }
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "open source"
+                }
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "cla signed"
+                }
               },
               {
-                "author": {
-                  "login": "ngimel"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "release notes: onnx"
+                }
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              },
+                "node": {
+                  "name": "topic: bug fixes"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MTAw name=pytorch number=73099 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "files": {
+            "nodes": [
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_pixel_shuffle.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_pow.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_prelu.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_prod.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_prod_dtype.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_rand.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_randn.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_mean.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_prod.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_sum.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reducemax.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_reducemin.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_remainder.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_repeat.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_round.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_rrelu.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_rsqrt.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_rsub.expect"
               },
               {
-                "author": {
-                  "login": "ngimel"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_scatter_add.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_scatter_add_opset11.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_selu.expect"
               },
-              {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+              {
+                "path": "test/onnx/expect/TestOperators.test_shape_value_map.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_sign.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_sin.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_slice.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_slice_dynamic.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_3d.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_3d_none.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_4d.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_ignore_index.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_weights.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_split.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_split_with_sizes.expect"
               },
               {
-                "author": {
-                  "login": "Lezcano"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_sqrt.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_std.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_sum.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_sum_dtype.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_tan.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_topk.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_topk_smallest_unsorted.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_transpose.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_type_as.expect"
               },
               {
-                "author": {
-                  "login": "ngimel"
-                },
-                "state": "APPROVED"
+                "path": "test/onnx/expect/TestOperators.test_unfold.expect"
               },
               {
-                "author": {
-                  "login": "ezyang"
-                },
-                "state": "COMMENTED"
+                "path": "test/onnx/expect/TestOperators.test_unique.expect"
               },
               {
-                "author": {
-                  "login": "mruberry"
-                },
-                "state": "COMMENTED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNC0wNlQxNDo1NjoyNC0wNTowMLkyMDIyLTA0LTA2VDEwOjQwOjM4LTA1OjAwzjenO6Y=",
-              "hasPreviousPage": false
-            }
-          },
-          "comments": {
-            "nodes": [
+                "path": "test/onnx/expect/TestOperators.test_unsqueeze.expect"
+              },
               {
-                "bodyText": "Ref implementations by themselves can handle any shapes (and broadcast ops by themselves don't bake in any shapes). The question is can we decide if a particular trace is applicable for a different input, but that depends on the tracing technology and what we are caching on, so out of scope for initial PR.",
-                "author": {
-                  "login": "ngimel"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1105643418
+                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect"
               },
               {
-                "bodyText": "@pytorchbot merge this please",
-                "author": {
-                  "login": "mruberry"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1108072887
+                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect"
               },
               {
-                "bodyText": "Merge failed due to 'mruberry'\nRaised by https://github.com/pytorch/pytorch/actions/runs/2218044244",
-                "author": {
-                  "login": "pytorchmergebot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1108073536
+                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_size.expect"
               },
               {
-                "bodyText": "@mruberry has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
-                "author": {
-                  "login": "facebook-github-bot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1108075965
+                "path": "test/onnx/expect/TestOperators.test_view.expect"
               },
               {
-                "bodyText": "Hey @mruberry.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
-                "author": {
-                  "login": "github-actions"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 1108351107
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOQebHmg==",
-              "hasPreviousPage": true
-            }
-          },
-          "labels": {
-            "edges": [
+                "path": "test/onnx/expect/TestOperators.test_view_flatten.expect"
+              },
               {
-                "node": {
-                  "name": "cla signed"
-                }
+                "path": "test/onnx/expect/TestOperators.test_zeros_like.expect"
               },
               {
-                "node": {
-                  "name": "topic: not user facing"
-                }
+                "path": "torch/csrc/jit/serialization/export.cpp"
               },
               {
-                "node": {
-                  "name": "module: primTorch"
-                }
+                "path": "torch/csrc/jit/serialization/export.h"
               }
-            ]
+            ],
+            "pageInfo": {
+              "endCursor": "MTYy",
+              "hasNextPage": false
+            }
           }
         }
       }
     }
   },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=77700 owner=pytorch": {
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=74649 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
           "closed": true,
           "isCrossRepository": false,
           "author": {
-            "login": "kit1980"
+            "login": "malfet"
           },
-          "title": "Move pull linux-docs job to Ubuntu 20.04",
-          "body": "",
-          "headRefName": "sdym/pull-xenial-focal-linux-docs",
+          "title": "This should fail flake8",
+          "body": "Test issue for GHF mandatory checks",
+          "headRefName": "malfet-patch-8",
           "headRepository": {
             "nameWithOwner": "pytorch/pytorch"
           },
@@ -10183,20 +16353,32 @@
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "kit1980"
+                      "login": "malfet"
                     },
-                    "email": "sdym@fb.com",
-                    "name": "Sergii Dymchenko"
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
                   },
-                  "oid": "81261599614423baa17df72300b8e109677b6799"
+                  "oid": "57c86ff1c5ab948888fd329986c9d55796680e33"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "MQ",
+              "endCursor": "Mg",
               "hasNextPage": false
             },
-            "totalCount": 1
+            "totalCount": 2
           },
           "commits": {
             "nodes": [
@@ -10216,18 +16398,18 @@
                               {
                                 "name": "Facebook CLA Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://code.facebook.com/cla/"
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNmNqE=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsK3w=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147714"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018129"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuMI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1E="
                       },
                       {
                         "node": {
@@ -10244,9 +16426,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147726"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018131"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuM4="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1M="
                       },
                       {
                         "node": {
@@ -10263,9 +16445,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147733"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018132"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuNU="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1Q="
                       },
                       {
                         "node": {
@@ -10282,9 +16464,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147746"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018134"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuOI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1Y="
                       },
                       {
                         "node": {
@@ -10301,9 +16483,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147762"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018139"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuPI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1s="
                       },
                       {
                         "node": {
@@ -10320,9 +16502,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147780"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018142"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuQQ="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj14="
                       },
                       {
                         "node": {
@@ -10338,50 +16520,75 @@
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "lintrunner",
+                                "name": "clang-format",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901060?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925132"
                               },
                               {
-                                "name": "workflow-checks",
+                                "name": "clang-tidy",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901248?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925189"
                               },
                               {
-                                "name": "quick-checks",
+                                "name": "cmakelint",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925230"
+                              },
+                              {
+                                "name": "flake8-py3",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925307"
+                              },
+                              {
+                                "name": "mypy",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901458?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925365"
                               },
                               {
                                 "name": "Test collect_env (with_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901863?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925427"
                               },
                               {
                                 "name": "Test collect_env (without_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901951?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925449"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925537"
+                              },
+                              {
+                                "name": "py2-setup-validate-errormsg",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925644"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925688"
                               },
                               {
                                 "name": "toc",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498902083?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925809"
                               },
                               {
-                                "name": "Test tools",
+                                "name": "shellcheck",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498902358?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925945"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdYVY=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsMiY=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148336"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018384"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuzA="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkFA="
                       },
                       {
                         "node": {
@@ -10399,18 +16606,18 @@
                               {
                                 "name": "run-torchbench",
                                 "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901064?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576288/jobs/2928925134"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdXEg=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsLW0=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148344"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018395"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuzg="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkFs="
                       },
                       {
                         "node": {
@@ -10420,1331 +16627,2622 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "docker-builds"
+                              "name": "pull"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "docker-build (pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901070?check_suite_focus=true"
-                              },
-                              {
-                                "name": "docker-build (pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901146?check_suite_focus=true"
-                              },
-                              {
-                                "name": "docker-build (pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901221?check_suite_focus=true"
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928935743"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-bionic-py3.7-clang9)",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901302?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928935775"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-bionic-rocm5.0-py3.7)",
+                                "name": "linux-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901366?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928935850"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-bionic-rocm5.1-py3.7)",
+                                "name": "linux-bionic-rocm4.5-py3.7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901454?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928935994"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901538?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936064"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901617?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936179"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-xenial-py3-clang5-android-ndk-r19c)",
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901670?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936265"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-xenial-py3-clang5-asan)",
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901773?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936309"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-xenial-py3-clang7-asan)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901846?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936353"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-xenial-py3-clang7-onnx)",
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498901939?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936395"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-xenial-py3.7-gcc5.4)",
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498902041?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936426"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-xenial-py3.7-gcc7)",
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498902117?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936483"
                               },
                               {
-                                "name": "docker-build (pytorch-linux-focal-py3.7-gcc7)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498902194?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdYLI=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148352"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduu0A="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "pull"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "name": "win-vs2019-cuda11.3-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498932877?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936516"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7 / build",
+                                "name": "win-vs2019-cpu-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498933082?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936558"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc7-no-ops / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498933297?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936633"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498933508?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498933805?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-rocm5.1-py3.7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498934115?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498934258?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-gcc5.4 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498934411?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-onnx / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498934576?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936705"
                               },
                               {
                                 "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498934681?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936736"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498934902?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936756"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935080?check_suite_focus=true"
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936796"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935207?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936823"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935381?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928990551"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935482?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928990588"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "name": "linux-docs / build-docs (cpp)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935669?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928992832"
                               },
                               {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "name": "linux-docs / build-docs (python)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935747?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928992868"
                               },
                               {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935802?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928992932"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935884?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928992965"
                               },
                               {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498935972?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928993011"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / build",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6498936102?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928993042"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499060931?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928993086"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499060996?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928993128"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499065639?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928995802"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499065699?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499065764?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928995853"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "name": "linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499065815?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928995889"
                               },
                               {
                                 "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499069355?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928997626"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499078217?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928999058"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499078276?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499104194?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928999075"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499104243?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929012407"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499104298?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929012438"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499104357?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929012469"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499104403?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929034328"
                               },
                               {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499108043?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929034340"
                               },
                               {
                                 "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499152001?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499153180?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499153280?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499153315?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499153355?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499153395?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929040801"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499153439?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929045939"
                               },
                               {
-                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499153610?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929046016"
                               },
                               {
-                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499153676?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929046063"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499259414?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929082254"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499259466?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929082275"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499259509?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929157614"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499259568?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929157635"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "name": "win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499259607?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929157656"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNi1Nc=",
-                              "hasNextPage": true
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHxIT4=",
+                              "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148369"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018405"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduu1E="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkGU="
                       }
                     ],
                     "pageInfo": {
                       "hasNextPage": false
                     }
                   },
-                  "pushedDate": "2022-05-19T00:02:11Z",
-                  "oid": "81261599614423baa17df72300b8e109677b6799"
+                  "status": null,
+                  "pushedDate": "2022-03-24T00:42:33Z",
+                  "oid": "6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4"
+                }
+              }
+            ]
+          },
+          "changedFiles": 1,
+          "files": {
+            "nodes": [
+              {
+                "path": "torch/nn/cpp.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "seemethere"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMy0yM1QxNTo1MDo0NS0wNzowMLkyMDIyLTAzLTIzVDE1OjUwOjQ1LTA3OjAwzjbPEDg=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/74649\n\u21a9\ufe0f \u00a0[fb-only] Re-run with SSH instructions\nNeed help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 6c3c3de (more details on the Dr. CI page):\n\n\n1/1 failures introduced in this PR\n\n\n1 failure not recognized by patterns:\n\n\n\nJob\nStep\nAction\n\n\n\n\n Lint / flake8-py3\nFail if there were any warnings\n\ud83d\udd01 rerun\n\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1076891218
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQDAOUg==",
+              "hasPreviousPage": false
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "cla signed"
                 }
               }
-            ]
-          },
-          "changedFiles": 3,
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "dreiss"
+              },
+              {
+                "login": "kumpera"
+              },
+              {
+                "login": "ezyang"
+              },
+              {
+                "login": "stephenroller"
+              },
+              {
+                "login": "swolchok"
+              },
+              {
+                "login": "hyuen"
+              },
+              {
+                "login": "orionr"
+              },
+              {
+                "login": "dhruvbird"
+              },
+              {
+                "login": "likethesky"
+              },
+              {
+                "login": "lw"
+              },
+              {
+                "login": "raziel"
+              },
+              {
+                "login": "simpkins"
+              },
+              {
+                "login": "ebyrne"
+              },
+              {
+                "login": "Babar"
+              },
+              {
+                "login": "kostmo"
+              },
+              {
+                "login": "0x00b1"
+              },
+              {
+                "login": "bhosmer"
+              },
+              {
+                "login": "digantdesai"
+              },
+              {
+                "login": "zdevito"
+              },
+              {
+                "login": "bugra"
+              },
+              {
+                "login": "kunalb"
+              },
+              {
+                "login": "caraya10"
+              },
+              {
+                "login": "kit1980"
+              },
+              {
+                "login": "shoumikhin"
+              },
+              {
+                "login": "huydhn"
+              },
+              {
+                "login": "teytaud"
+              },
+              {
+                "login": "xuzhao9"
+              },
+              {
+                "login": "jansel"
+              },
+              {
+                "login": "abhinavarora"
+              },
+              {
+                "login": "b0noI"
+              },
+              {
+                "login": "djthorne"
+              },
+              {
+                "login": "nairbv"
+              },
+              {
+                "login": "Mortimerp9"
+              },
+              {
+                "login": "dadkins20"
+              },
+              {
+                "login": "colesbury"
+              },
+              {
+                "login": "laurencer"
+              },
+              {
+                "login": "nickgg"
+              },
+              {
+                "login": "yzhao30"
+              },
+              {
+                "login": "rmaz"
+              },
+              {
+                "login": "bearzx"
+              },
+              {
+                "login": "mattjgalloway"
+              },
+              {
+                "login": "chenyang78"
+              },
+              {
+                "login": "yns88"
+              },
+              {
+                "login": "lc0"
+              },
+              {
+                "login": "wenleix"
+              },
+              {
+                "login": "jingsh"
+              },
+              {
+                "login": "mthrok"
+              },
+              {
+                "login": "drdarshan"
+              },
+              {
+                "login": "tvalentius"
+              },
+              {
+                "login": "d4l3k"
+              },
+              {
+                "login": "jamiemccrindle"
+              },
+              {
+                "login": "kazhang"
+              },
+              {
+                "login": "simonhollis"
+              },
+              {
+                "login": "ajyu"
+              },
+              {
+                "login": "govardhan"
+              },
+              {
+                "login": "yinghai"
+              },
+              {
+                "login": "zyan0"
+              },
+              {
+                "login": "ajtulloch"
+              },
+              {
+                "login": "smeenai"
+              },
+              {
+                "login": "vtlam"
+              },
+              {
+                "login": "pbelevich"
+              },
+              {
+                "login": "VitalyFedyunin"
+              },
+              {
+                "login": "dbish"
+              },
+              {
+                "login": "khabinov"
+              },
+              {
+                "login": "NicolasHug"
+              },
+              {
+                "login": "jfix71"
+              },
+              {
+                "login": "atuljangra"
+              },
+              {
+                "login": "idning"
+              },
+              {
+                "login": "soumith"
+              },
+              {
+                "login": "nimin98"
+              },
+              {
+                "login": "chaekit"
+              },
+              {
+                "login": "radkris-git"
+              },
+              {
+                "login": "xunnanxu"
+              },
+              {
+                "login": "javier-m"
+              },
+              {
+                "login": "jmdetloff"
+              },
+              {
+                "login": "mostafaelhoushi"
+              },
+              {
+                "login": "brianjo"
+              },
+              {
+                "login": "wangkuiyi"
+              },
+              {
+                "login": "suo"
+              },
+              {
+                "login": "vkuzo"
+              },
+              {
+                "login": "seemethere"
+              },
+              {
+                "login": "cpuhrsch"
+              },
+              {
+                "login": "qihqi"
+              },
+              {
+                "login": "jackm321"
+              },
+              {
+                "login": "linbinyu"
+              },
+              {
+                "login": "neerajprad"
+              },
+              {
+                "login": "rsemenov"
+              },
+              {
+                "login": "ziky90"
+              },
+              {
+                "login": "gmagogsfm"
+              },
+              {
+                "login": "zzzwen"
+              },
+              {
+                "login": "ikriv"
+              },
+              {
+                "login": "deeptigp"
+              },
+              {
+                "login": "andrewor14"
+              },
+              {
+                "login": "jianyuh"
+              },
+              {
+                "login": "cykustcc"
+              },
+              {
+                "login": "highker"
+              },
+              {
+                "login": "beauby"
+              },
+              {
+                "login": "jeffreyksmithjr"
+              },
+              {
+                "login": "suphoff"
+              },
+              {
+                "login": "smessmer"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": true,
+              "endCursor": "Y3Vyc29yOnYyOpHOACQ5JQ=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOACQ5JQ== name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "ananthsub"
+              },
+              {
+                "login": "firstprayer"
+              },
+              {
+                "login": "malfet"
+              },
+              {
+                "login": "fegin"
+              },
+              {
+                "login": "hanton"
+              },
+              {
+                "login": "zanqi"
+              },
+              {
+                "login": "supriyar"
+              },
+              {
+                "login": "kausv"
+              },
+              {
+                "login": "dagitses"
+              },
+              {
+                "login": "bilgeacun"
+              },
+              {
+                "login": "caogao"
+              },
+              {
+                "login": "miguelmartin75"
+              },
+              {
+                "login": "penguinwu"
+              },
+              {
+                "login": "shz117"
+              },
+              {
+                "login": "ajliu"
+              },
+              {
+                "login": "msaroufim"
+              },
+              {
+                "login": "davides"
+              },
+              {
+                "login": "alannnna"
+              },
+              {
+                "login": "hlin09"
+              },
+              {
+                "login": "hudeven"
+              },
+              {
+                "login": "terrychenism"
+              },
+              {
+                "login": "xiaomengy"
+              },
+              {
+                "login": "jisaacso"
+              },
+              {
+                "login": "fkhan1337"
+              },
+              {
+                "login": "xing-liu"
+              },
+              {
+                "login": "alanadakotashine"
+              },
+              {
+                "login": "desertfire"
+              },
+              {
+                "login": "YosuaMichael"
+              },
+              {
+                "login": "banitag1"
+              },
+              {
+                "login": "gchanan"
+              },
+              {
+                "login": "dbort"
+              },
+              {
+                "login": "bilalsal"
+              },
+              {
+                "login": "DanilBaibak"
+              },
+              {
+                "login": "serhaty"
+              },
+              {
+                "login": "yf225"
+              },
+              {
+                "login": "mlazos"
+              },
+              {
+                "login": "yifuwang"
+              },
+              {
+                "login": "z-a-f"
+              },
+              {
+                "login": "tenpercent"
+              },
+              {
+                "login": "bertmaher"
+              },
+              {
+                "login": "chauhang"
+              },
+              {
+                "login": "ZainRizvi"
+              },
+              {
+                "login": "jiayisuse"
+              },
+              {
+                "login": "bochko"
+              },
+              {
+                "login": "jeanschmidt"
+              },
+              {
+                "login": "bradleyhd"
+              },
+              {
+                "login": "mullachv"
+              },
+              {
+                "login": "voznesenskym"
+              },
+              {
+                "login": "bwasti"
+              },
+              {
+                "login": "NivekT"
+              },
+              {
+                "login": "zhxchen17"
+              },
+              {
+                "login": "jerryzh168"
+              },
+              {
+                "login": "MohammadMahdiJavanmard"
+              },
+              {
+                "login": "wconstab"
+              },
+              {
+                "login": "Hangjun"
+              },
+              {
+                "login": "davidberard98"
+              },
+              {
+                "login": "Krovatkin"
+              },
+              {
+                "login": "CamiWilliams"
+              },
+              {
+                "login": "datumbox"
+              },
+              {
+                "login": "aartibasant"
+              },
+              {
+                "login": "xta0"
+              },
+              {
+                "login": "zou3519"
+              },
+              {
+                "login": "xman1979"
+              },
+              {
+                "login": "suraj813"
+              },
+              {
+                "login": "gqchen"
+              },
+              {
+                "login": "george-qi"
+              },
+              {
+                "login": "abhikrish"
+              },
+              {
+                "login": "zhangguanheng66"
+              },
+              {
+                "login": "mikeiovine"
+              },
+              {
+                "login": "Chillee"
+              },
+              {
+                "login": "albanD"
+              },
+              {
+                "login": "bigfootjon"
+              },
+              {
+                "login": "robotal"
+              },
+              {
+                "login": "MarcioPorto"
+              },
+              {
+                "login": "srsuryadev"
+              },
+              {
+                "login": "IvanKobzarev"
+              },
+              {
+                "login": "eprivezentsev"
+              },
+              {
+                "login": "kwen2501"
+              },
+              {
+                "login": "linux-jedi"
+              },
+              {
+                "login": "chandlerzuo"
+              },
+              {
+                "login": "otsneh"
+              },
+              {
+                "login": "husthyc"
+              },
+              {
+                "login": "briancoutinho"
+              },
+              {
+                "login": "fduwjj"
+              },
+              {
+                "login": "frank-wei"
+              },
+              {
+                "login": "prabhat00155"
+              },
+              {
+                "login": "QuentinDuval"
+              },
+              {
+                "login": "atalman"
+              },
+              {
+                "login": "xush6528"
+              },
+              {
+                "login": "dracifer"
+              },
+              {
+                "login": "SS-JIA"
+              },
+              {
+                "login": "helunwencser"
+              },
+              {
+                "login": "xw285cornell"
+              },
+              {
+                "login": "hhbyyh"
+              },
+              {
+                "login": "rohan-varma"
+              },
+              {
+                "login": "jcaip"
+              },
+              {
+                "login": "teng-li"
+              },
+              {
+                "login": "larryliu0820"
+              },
+              {
+                "login": "lyoka"
+              },
+              {
+                "login": "SungMinCho"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": true,
+              "endCursor": "Y3Vyc29yOnYyOpHOAH1fDg=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOAH1fDg== name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "cbalioglu"
+              },
+              {
+                "login": "hl475"
+              },
+              {
+                "login": "hwangjeff"
+              },
+              {
+                "login": "Jack-Khuu"
+              },
+              {
+                "login": "mehtanirav"
+              },
+              {
+                "login": "nateanl"
+              },
+              {
+                "login": "fuqianz"
+              },
+              {
+                "login": "boyuantan"
+              },
+              {
+                "login": "muntaqim"
+              },
+              {
+                "login": "ymao1993"
+              },
+              {
+                "login": "fmassa"
+              },
+              {
+                "login": "esantorella"
+              },
+              {
+                "login": "HamidShojanazeri"
+              },
+              {
+                "login": "jubinchheda"
+              },
+              {
+                "login": "mehdimashayekhi"
+              },
+              {
+                "login": "rkindi"
+              },
+              {
+                "login": "wanchaol"
+              },
+              {
+                "login": "zephirefaith"
+              },
+              {
+                "login": "kapilsh"
+              },
+              {
+                "login": "plahera"
+              },
+              {
+                "login": "SherlockNoMad"
+              },
+              {
+                "login": "pritamdamania87"
+              },
+              {
+                "login": "iseeyuan"
+              },
+              {
+                "login": "protonu"
+              },
+              {
+                "login": "terhuhf"
+              },
+              {
+                "login": "aruntonic"
+              },
+              {
+                "login": "gcatron"
+              },
+              {
+                "login": "yingrliu"
+              },
+              {
+                "login": "alexanderguzhva"
+              },
+              {
+                "login": "angelayi"
+              },
+              {
+                "login": "zhaoalex"
+              },
+              {
+                "login": "vivekmig"
+              },
+              {
+                "login": "sangongs"
+              },
+              {
+                "login": "jspisak"
+              },
+              {
+                "login": "akshaypandian"
+              },
+              {
+                "login": "drej82"
+              },
+              {
+                "login": "tktrungna"
+              },
+              {
+                "login": "eellison"
+              },
+              {
+                "login": "NarineK"
+              },
+              {
+                "login": "andrewconnors"
+              },
+              {
+                "login": "wenwei202"
+              },
+              {
+                "login": "jg2912"
+              },
+              {
+                "login": "robieta"
+              },
+              {
+                "login": "mreso"
+              },
+              {
+                "login": "soulitzer"
+              },
+              {
+                "login": "PaliC"
+              },
+              {
+                "login": "anijain2305"
+              },
+              {
+                "login": "pvtuan10"
+              },
+              {
+                "login": "huangyi1979"
+              },
+              {
+                "login": "osalpekar"
+              },
+              {
+                "login": "xiaohui-zhang"
+              },
+              {
+                "login": "jerry39213gh"
+              },
+              {
+                "login": "jarodhou"
+              },
+              {
+                "login": "hlu1"
+              },
+              {
+                "login": "huiguoo"
+              },
+              {
+                "login": "H-Huang"
+              },
+              {
+                "login": "vtsyvina"
+              },
+              {
+                "login": "Nitrokitty"
+              },
+              {
+                "login": "satgera"
+              },
+              {
+                "login": "ngimel"
+              },
+              {
+                "login": "markkm"
+              },
+              {
+                "login": "EscapeZero"
+              },
+              {
+                "login": "bdhirsh"
+              },
+              {
+                "login": "cccclai"
+              },
+              {
+                "login": "carolineechen"
+              },
+              {
+                "login": "tugsbayasgalan"
+              },
+              {
+                "login": "agunapal"
+              },
+              {
+                "login": "frankseide"
+              },
+              {
+                "login": "YazhiGao"
+              },
+              {
+                "login": "mrshenli"
+              },
+              {
+                "login": "bashnick"
+              },
+              {
+                "login": "lena-kashtelyan"
+              },
+              {
+                "login": "brad-mengchi"
+              },
+              {
+                "login": "kimishpatel"
+              },
+              {
+                "login": "aaronenyeshi"
+              },
+              {
+                "login": "shajrawi"
+              },
+              {
+                "login": "samdow"
+              },
+              {
+                "login": "great-way"
+              },
+              {
+                "login": "ashkan-software"
+              },
+              {
+                "login": "bankawas"
+              },
+              {
+                "login": "jbitton"
+              },
+              {
+                "login": "jdsgomes"
+              },
+              {
+                "login": "zhangxy988"
+              },
+              {
+                "login": "samlurye"
+              },
+              {
+                "login": "anjali411"
+              },
+              {
+                "login": "joecummings"
+              },
+              {
+                "login": "842974287"
+              },
+              {
+                "login": "JacobSzwejbka"
+              },
+              {
+                "login": "nishantpdce"
+              },
+              {
+                "login": "srinivas212"
+              },
+              {
+                "login": "shreyanb98"
+              },
+              {
+                "login": "dzdang"
+              },
+              {
+                "login": "naveedgol"
+              },
+              {
+                "login": "Nayef211"
+              },
+              {
+                "login": "zrphercule"
+              },
+              {
+                "login": "HengruiX"
+              },
+              {
+                "login": "langong347"
+              },
+              {
+                "login": "ebsmothers"
+              },
+              {
+                "login": "anshuljain1"
+              },
+              {
+                "login": "salilsdesai"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": true,
+              "endCursor": "Y3Vyc29yOnYyOpHOAYM3gA=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOAYM3gA== name=metamates org=pytorch": {
+    "data": {
+      "organization": {
+        "team": {
+          "members": {
+            "nodes": [
+              {
+                "login": "vmoens"
+              },
+              {
+                "login": "yoavnavon"
+              },
+              {
+                "login": "printfoo"
+              },
+              {
+                "login": "xinyang0"
+              },
+              {
+                "login": "abhinav19792"
+              },
+              {
+                "login": "fbbradheintz"
+              },
+              {
+                "login": "kauterry"
+              },
+              {
+                "login": "anirbanraywork"
+              },
+              {
+                "login": "houseroad"
+              },
+              {
+                "login": "erichan1"
+              },
+              {
+                "login": "hsrussell"
+              },
+              {
+                "login": "ilia-cher"
+              },
+              {
+                "login": "ajitmaths"
+              },
+              {
+                "login": "awgu"
+              },
+              {
+                "login": "wz337"
+              },
+              {
+                "login": "qxy11"
+              },
+              {
+                "login": "janeyx99"
+              },
+              {
+                "login": "msedwar"
+              },
+              {
+                "login": "glaringlee"
+              },
+              {
+                "login": "anj-s"
+              },
+              {
+                "login": "drisspg"
+              },
+              {
+                "login": "kmh4321"
+              },
+              {
+                "login": "RdoubleA"
+              },
+              {
+                "login": "jramseyer"
+              },
+              {
+                "login": "jianingfu"
+              },
+              {
+                "login": "zengk95"
+              },
+              {
+                "login": "gtarjun"
+              },
+              {
+                "login": "mikaylagawarecki"
+              },
+              {
+                "login": "xianxl"
+              },
+              {
+                "login": "mingzhe09088"
+              },
+              {
+                "login": "aazzolini"
+              },
+              {
+                "login": "nataliakliushkina"
+              },
+              {
+                "login": "Xirider"
+              },
+              {
+                "login": "HDCharles"
+              },
+              {
+                "login": "mcr229"
+              },
+              {
+                "login": "manuelcandales"
+              },
+              {
+                "login": "guangy10"
+              },
+              {
+                "login": "mengwa41"
+              },
+              {
+                "login": "YulunW"
+              },
+              {
+                "login": "hx89"
+              },
+              {
+                "login": "hanhsienhuang"
+              },
+              {
+                "login": "clee2000"
+              },
+              {
+                "login": "lhuang04"
+              },
+              {
+                "login": "sidneyfletcher"
+              },
+              {
+                "login": "gottbrath"
+              },
+              {
+                "login": "lessw2020"
+              },
+              {
+                "login": "taivu1998"
+              },
+              {
+                "login": "danrecoskie"
+              },
+              {
+                "login": "zhaojuanmao"
+              },
+              {
+                "login": "johncalab"
+              },
+              {
+                "login": "dhthompson"
+              },
+              {
+                "login": "superwizard2019"
+              },
+              {
+                "login": "shunting314"
+              },
+              {
+                "login": "edward-io"
+              },
+              {
+                "login": "xcheng16"
+              },
+              {
+                "login": "adamomainz"
+              },
+              {
+                "login": "sluks"
+              },
+              {
+                "login": "SebastianAment"
+              },
+              {
+                "login": "poojahp"
+              },
+              {
+                "login": "ansley"
+              },
+              {
+                "login": "cheetah2216"
+              },
+              {
+                "login": "pinaki-mukerji"
+              },
+              {
+                "login": "hongxiayang"
+              },
+              {
+                "login": "kyulee-com"
+              },
+              {
+                "login": "sstsai-adl"
+              },
+              {
+                "login": "dahsh"
+              },
+              {
+                "login": "szewaiyuen7"
+              },
+              {
+                "login": "byterover"
+              },
+              {
+                "login": "wmao533"
+              },
+              {
+                "login": "ejguan"
+              },
+              {
+                "login": "nimaelyasi"
+              },
+              {
+                "login": "qxu-fb"
+              },
+              {
+                "login": "sshawnwu"
+              },
+              {
+                "login": "njuvekar"
+              },
+              {
+                "login": "iramazanli"
+              },
+              {
+                "login": "jnkwok1"
+              },
+              {
+                "login": "kurman"
+              },
+              {
+                "login": "jbschlosser"
+              },
+              {
+                "login": "haichuan-fb"
+              },
+              {
+                "login": "janghyuncho"
+              },
+              {
+                "login": "wwang84"
+              },
+              {
+                "login": "JustinPinero"
+              },
+              {
+                "login": "gcramer23"
+              },
+              {
+                "login": "yuguo68"
+              },
+              {
+                "login": "c-odrin"
+              },
+              {
+                "login": "chowarfb"
+              },
+              {
+                "login": "priyaramani"
+              },
+              {
+                "login": "asalioufb"
+              },
+              {
+                "login": "four4fish"
+              },
+              {
+                "login": "kkosik20"
+              },
+              {
+                "login": "KZFB"
+              },
+              {
+                "login": "henryliu-bluehills"
+              },
+              {
+                "login": "muchulee8"
+              },
+              {
+                "login": "bchen2020"
+              },
+              {
+                "login": "anirbanr-fb-r2p"
+              },
+              {
+                "login": "kirklandsign"
+              },
+              {
+                "login": "izaitsevfb"
+              },
+              {
+                "login": "ashramac"
+              },
+              {
+                "login": "weiwangmeta"
+              },
+              {
+                "login": "andysamfb"
+              }
+            ],
+            "pageInfo": {
+              "hasNextPage": false,
+              "endCursor": "Y3Vyc29yOnYyOpHOBp303g=="
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MTAw name=pytorch number=76118 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "files": {
+            "nodes": [
+              {
+                "path": "docs/source/quantization.rst"
+              },
+              {
+                "path": "docs/source/scripts/build_quantization_configs.py"
+              },
+              {
+                "path": "test/allowlist_for_publicAPI.json"
+              },
+              {
+                "path": "test/cpp/jit/source_range_test.cpp"
+              },
+              {
+                "path": "test/cpp/jit/test_backend.cpp"
+              },
+              {
+                "path": "test/cpp/jit/test_flatbuffer.cpp"
+              },
+              {
+                "path": "test/cpp/jit/test_misc.cpp"
+              },
+              {
+                "path": "test/cpp/jit/test_utils.h"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_float_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_float_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_int_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_int_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_float_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_int_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_scalar_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_tensor_inplace_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_tensor_out_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/jit/upgrader_models/test_versioned_div_tensor_v2.ptl.ff"
+              },
+              {
+                "path": "test/cpp/profiler/record_function.cpp"
+              },
+              {
+                "path": "test/distributed/_shard/sharded_tensor/test_sharded_tensor.py"
+              },
+              {
+                "path": "test/distributed/_shard/test_replicated_tensor.py"
+              },
+              {
+                "path": "test/distributed/fsdp/test_fsdp_comm.py"
+              },
+              {
+                "path": "test/distributed/fsdp/test_fsdp_optim_state.py"
+              },
+              {
+                "path": "test/distributed/optim/test_zero_redundancy_optimizer.py"
+              },
+              {
+                "path": "test/jit/test_export_modes.py"
+              },
+              {
+                "path": "test/jit/test_if_hoisting.py"
+              },
+              {
+                "path": "test/jit/test_tracer.py"
+              },
+              {
+                "path": "test/jit/test_upgraders.py"
+              },
+              {
+                "path": "test/mobile/test_lite_script_type.py"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_layer_norm_aten.expect"
+              },
+              {
+                "path": "test/onnx/test_operators.py"
+              },
+              {
+                "path": "test/onnx/test_pytorch_onnx_onnxruntime.py"
+              },
+              {
+                "path": "test/quantization/ao_migration/test_quantization_fx.py"
+              },
+              {
+                "path": "test/quantization/core/test_quantized_op.py"
+              },
+              {
+                "path": "test/quantization/core/test_quantized_tensor.py"
+              },
+              {
+                "path": "test/quantization/fx/test_numeric_suite_fx.py"
+              },
+              {
+                "path": "test/quantization/fx/test_quantize_fx.py"
+              },
+              {
+                "path": "test/test_autograd.py"
+              },
+              {
+                "path": "test/test_binary_ufuncs.py"
+              },
+              {
+                "path": "test/test_expanded_weights.py"
+              },
+              {
+                "path": "test/test_functionalization.py"
+              },
+              {
+                "path": "test/test_fx_experimental.py"
+              },
+              {
+                "path": "test/test_jit.py"
+              },
+              {
+                "path": "test/test_jit_cuda_fuser.py"
+              },
+              {
+                "path": "test/test_linalg.py"
+              },
+              {
+                "path": "test/test_nestedtensor.py"
+              },
+              {
+                "path": "test/test_nn.py"
+              },
+              {
+                "path": "test/test_ops.py"
+              },
+              {
+                "path": "test/test_ops_gradients.py"
+              },
+              {
+                "path": "test/test_ops_jit.py"
+              },
+              {
+                "path": "test/test_optim.py"
+              },
+              {
+                "path": "test/test_overrides.py"
+              },
+              {
+                "path": "test/test_profiler.py"
+              },
+              {
+                "path": "test/test_public_bindings.py"
+              },
+              {
+                "path": "test/test_pytree.py"
+              },
+              {
+                "path": "test/test_reductions.py"
+              },
+              {
+                "path": "test/test_sort_and_select.py"
+              },
+              {
+                "path": "test/test_sparse.py"
+              },
+              {
+                "path": "test/test_sparse_csr.py"
+              },
+              {
+                "path": "test/test_spectral_ops.py"
+              },
+              {
+                "path": "test/test_tensor_creation_ops.py"
+              },
+              {
+                "path": "test/test_tensorboard.py"
+              },
+              {
+                "path": "test/test_testing.py"
+              },
+              {
+                "path": "test/test_torch.py"
+              },
+              {
+                "path": "test/test_unary_ufuncs.py"
+              },
+              {
+                "path": "third_party/BUCK.github"
+              },
+              {
+                "path": "third_party/fbgemm"
+              },
+              {
+                "path": "tools/autograd/derivatives.yaml"
+              },
+              {
+                "path": "tools/autograd/gen_inplace_or_view_type.py"
+              },
+              {
+                "path": "tools/autograd/load_derivatives.py"
+              },
+              {
+                "path": "tools/build_variables.bzl"
+              },
+              {
+                "path": "tools/codegen/api/autograd.py"
+              },
+              {
+                "path": "tools/codegen/api/cpp.py"
+              },
+              {
+                "path": "tools/codegen/api/dispatcher.py"
+              },
+              {
+                "path": "tools/codegen/api/functionalization.py"
+              },
+              {
+                "path": "tools/codegen/api/lazy.py"
+              },
+              {
+                "path": "tools/codegen/api/meta.py"
+              },
+              {
+                "path": "tools/codegen/api/native.py"
+              },
+              {
+                "path": "tools/codegen/api/python.py"
+              },
+              {
+                "path": "tools/codegen/api/structured.py"
+              },
+              {
+                "path": "tools/codegen/api/translate.py"
+              },
+              {
+                "path": "tools/codegen/api/types.py"
+              },
+              {
+                "path": "tools/codegen/api/ufunc.py"
+              },
+              {
+                "path": "tools/codegen/api/unboxing.py"
+              },
+              {
+                "path": "tools/codegen/code_template.py"
+              },
+              {
+                "path": "tools/codegen/context.py"
+              },
+              {
+                "path": "tools/codegen/decompositions/gen_jit_decompositions.py"
+              },
+              {
+                "path": "tools/codegen/dest/__init__.py"
+              },
+              {
+                "path": "tools/codegen/dest/lazy_ir.py"
+              },
+              {
+                "path": "tools/codegen/dest/lazy_ts_lowering.py"
+              },
+              {
+                "path": "tools/codegen/dest/native_functions.py"
+              },
+              {
+                "path": "tools/codegen/dest/register_dispatch_key.py"
+              },
+              {
+                "path": "tools/codegen/dest/ufunc.py"
+              },
+              {
+                "path": "tools/codegen/gen.py"
+              },
+              {
+                "path": "tools/codegen/gen_backend_stubs.py"
+              },
+              {
+                "path": "tools/codegen/gen_functionalization_type.py"
+              },
+              {
+                "path": "tools/codegen/gen_lazy_tensor.py"
+              },
+              {
+                "path": "tools/codegen/local.py"
+              },
+              {
+                "path": "tools/codegen/model.py"
+              },
+              {
+                "path": "tools/codegen/operator_versions/gen_mobile_upgraders.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MjAw",
+              "hasNextPage": true
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MjAw name=pytorch number=76118 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
           "files": {
             "nodes": [
               {
-                "path": ".circleci/docker/build.sh"
+                "path": "tools/codegen/selective_build/operator.py"
+              },
+              {
+                "path": "tools/codegen/selective_build/selector.py"
+              },
+              {
+                "path": "tools/codegen/shape_functions/gen_jit_shape_functions.py"
+              },
+              {
+                "path": "tools/codegen/static_runtime/config.py"
+              },
+              {
+                "path": "tools/codegen/static_runtime/gen_static_runtime_ops.py"
+              },
+              {
+                "path": "tools/codegen/static_runtime/gen_structured.py"
+              },
+              {
+                "path": "tools/codegen/utils.py"
+              },
+              {
+                "path": "tools/linter/adapters/circleci_linter.py"
+              },
+              {
+                "path": "tools/linter/adapters/clangformat_linter.py"
+              },
+              {
+                "path": "tools/linter/adapters/grep_linter.py"
+              },
+              {
+                "path": "tools/linter/adapters/nativefunctions_linter.py"
+              },
+              {
+                "path": "tools/setup_helpers/BUILD.bazel"
+              },
+              {
+                "path": "tools/setup_helpers/generate_code.py"
+              },
+              {
+                "path": "torch/_C/__init__.pyi.in"
+              },
+              {
+                "path": "torch/amp/autocast_mode.py"
+              },
+              {
+                "path": "torch/ao/ns/fx/pattern_utils.py"
+              },
+              {
+                "path": "torch/ao/quantization/backend_config/README.md"
+              },
+              {
+                "path": "torch/ao/quantization/backend_config/__init__.py"
+              },
+              {
+                "path": "torch/ao/quantization/backend_config/native.py"
+              },
+              {
+                "path": "torch/ao/quantization/backend_config/observation_type.py"
+              },
+              {
+                "path": "torch/ao/quantization/backend_config/tensorrt.py"
+              },
+              {
+                "path": "torch/ao/quantization/backend_config/utils.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/__init__.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/backend_config/fuse_handler.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/backend_config/quantize_handler.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/backend_config_utils.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/convert.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/fuse.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/fusion_patterns.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/match_utils.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/pattern_utils.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/prepare.py"
+              },
+              {
+                "path": "torch/ao/quantization/fx/quantization_patterns.py"
+              },
+              {
+                "path": "torch/ao/quantization/qconfig.py"
+              },
+              {
+                "path": "torch/ao/quantization/quantization_types.py"
+              },
+              {
+                "path": "torch/ao/quantization/quantize_fx.py"
+              },
+              {
+                "path": "torch/autograd/__init__.py"
+              },
+              {
+                "path": "torch/csrc/Module.cpp"
+              },
+              {
+                "path": "torch/csrc/autograd/FunctionsManual.cpp"
+              },
+              {
+                "path": "torch/csrc/autograd/FunctionsManual.h"
+              },
+              {
+                "path": "torch/csrc/autograd/engine.cpp"
+              },
+              {
+                "path": "torch/csrc/autograd/function.h"
+              },
+              {
+                "path": "torch/csrc/autograd/functions/accumulate_grad.h"
+              },
+              {
+                "path": "torch/csrc/autograd/init.cpp"
+              },
+              {
+                "path": "torch/csrc/autograd/python_torch_functions_manual.cpp"
+              },
+              {
+                "path": "torch/csrc/autograd/python_variable.cpp"
+              },
+              {
+                "path": "torch/csrc/autograd/record_function_ops.h"
+              },
+              {
+                "path": "torch/csrc/autograd/utils/grad_layout_contract.h"
+              },
+              {
+                "path": "torch/csrc/deploy/CMakeLists.txt"
+              },
+              {
+                "path": "torch/csrc/distributed/c10d/logger.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/cuda/graph_fuser.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/codegen/cuda/parser.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/function_schema_parser.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/lexer.h"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/parser.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/parser.h"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/script_type_parser.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/source_range.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/source_range.h"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/source_ref.h"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/tracer.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/frontend/tracer.h"
+              },
+              {
+                "path": "torch/csrc/jit/mobile/debug_info.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/mobile/debug_info.h"
+              },
+              {
+                "path": "torch/csrc/jit/mobile/flatbuffer_loader.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/mobile/module.h"
+              },
+              {
+                "path": "torch/csrc/jit/passes/common_expression_hoisting.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/passes/common_expression_hoisting.h"
+              },
+              {
+                "path": "torch/csrc/jit/passes/frozen_graph_optimizations.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp"
               },
               {
-                "path": ".circleci/docker/common/install_katex.sh"
+                "path": "torch/csrc/jit/passes/onnx/scalar_type_analysis.cpp"
               },
               {
-                "path": ".github/workflows/pull.yml"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "Mw",
-              "hasNextPage": false
-            }
-          },
-          "reviews": {
-            "nodes": [
+                "path": "torch/csrc/jit/python/init.cpp"
+              },
               {
-                "author": {
-                  "login": "suo"
-                },
-                "state": "COMMENTED"
+                "path": "torch/csrc/jit/python/python_tree_views.cpp"
               },
               {
-                "author": {
-                  "login": "kit1980"
-                },
-                "state": "COMMENTED"
+                "path": "torch/csrc/jit/python/script_init.cpp"
               },
               {
-                "author": {
-                  "login": "janeyx99"
-                },
-                "state": "APPROVED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNS0xOFQxNDo0MTowNS0wNTowMLkyMDIyLTA1LTE4VDE0OjQxOjA0LTA1OjAwzjpD7es=",
-              "hasPreviousPage": false
-            }
-          },
-          "comments": {
-            "nodes": [
+                "path": "torch/csrc/jit/runtime/graph_executor.cpp"
+              },
               {
-                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/77700\n\ud83d\udcc4 \u00a0Preview Python docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\u2753Need help or want to give feedback on the CI? Visit our office hours\n\n\u2705 No Failures (0 Pending)\nAs of commit 8126159 (more details on the Dr. CI page):\nExpand to see more\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
-                "author": {
-                  "login": "facebook-github-bot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": {
-                  "login": "facebook-github-bot"
-                },
-                "databaseId": 1129400934
+                "path": "torch/csrc/jit/runtime/interpreter.cpp"
               },
               {
-                "bodyText": "@pytorchbot merge",
-                "author": {
-                  "login": "kit1980"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1131884232
+                "path": "torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp"
               },
               {
-                "bodyText": "Merge failed due to Refusing to merge as mandatory check(s) linux-docs / build-docs (cpp), linux-docs / build-docs (python) are pending/not yet run for rule OSS CI\nRaised by https://github.com/pytorch/pytorch/actions/runs/2353067846",
-                "author": {
-                  "login": "pytorchmergebot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1131886153
+                "path": "torch/csrc/jit/runtime/script_profile.cpp"
               },
               {
-                "bodyText": "@pytorchbot merge -f",
-                "author": {
-                  "login": "kit1980"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1131945610
+                "path": "torch/csrc/jit/runtime/serialized_shape_function_registry.cpp"
               },
               {
-                "bodyText": "Hey @kit1980.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
-                "author": {
-                  "login": "github-actions"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 1131947473
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOQ1FKZg==",
-              "hasPreviousPage": false
-            }
-          },
-          "labels": {
-            "edges": [
+                "path": "torch/csrc/jit/runtime/serialized_shape_function_registry.h"
+              },
               {
-                "node": {
-                  "name": "Merged"
-                }
+                "path": "torch/csrc/jit/runtime/shape_function_registry.h"
               },
               {
-                "node": {
-                  "name": "cla signed"
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAYNi1Nc= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAYduu0A= name=pytorch number=77700 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "commits": {
-            "nodes": [
+                "path": "torch/csrc/jit/runtime/shape_functions.h"
+              },
               {
-                "commit": {
-                  "oid": "81261599614423baa17df72300b8e109677b6799",
-                  "checkSuites": {
-                    "nodes": [
-                      {
-                        "checkRuns": {
-                          "nodes": [
-                            {
-                              "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499259645?check_suite_focus=true"
-                            },
-                            {
-                              "name": "linux-docs / build-docs (cpp)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499394792?check_suite_focus=true"
-                            },
-                            {
-                              "name": "linux-docs / build-docs (python)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499394839?check_suite_focus=true"
-                            },
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499739021?check_suite_focus=true"
-                            },
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/6499739073?check_suite_focus=true"
-                            }
-                          ],
-                          "pageInfo": {
-                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNqJcE=",
-                            "hasNextPage": false
-                          }
-                        }
-                      }
-                    ]
-                  }
-                }
+                "path": "torch/csrc/jit/runtime/shape_functions_1.h"
+              },
+              {
+                "path": "torch/csrc/jit/runtime/static/impl.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/runtime/static/passes.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/runtime/symbolic_shape_registry.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/runtime/symbolic_shape_registry.h"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/export_module.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/flatbuffer_serializer.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/import.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/import_export_helpers.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/import_export_helpers.h"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/import_source.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/import_source.h"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/source_range_serialization.cpp"
+              },
+              {
+                "path": "torch/csrc/jit/serialization/source_range_serialization.h"
+              },
+              {
+                "path": "torch/csrc/jit/testing/file_check.cpp"
+              },
+              {
+                "path": "torch/csrc/lazy/core/dynamic_ir.cpp"
+              },
+              {
+                "path": "torch/csrc/lazy/core/dynamic_ir.h"
+              },
+              {
+                "path": "torch/csrc/lazy/ts_backend/ts_eager_fallback.cpp"
               }
-            ]
+            ],
+            "pageInfo": {
+              "endCursor": "MzAw",
+              "hasNextPage": true
+            }
           }
         }
       }
     }
   },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=68111 owner=pytorch": {
+  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MzAw name=pytorch number=76118 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
-          "closed": true,
-          "isCrossRepository": true,
-          "author": {
-            "login": "chunyuan-w"
-          },
-          "title": "Add JIT graph fuser for oneDNN Graph API (Preview4)",
-          "body": "## Description\r\nPreview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).\r\n\r\nOn the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:\r\n\r\n- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used\r\n- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.\r\n\r\n### User API:\r\nThe optimization pass is disabled by default. Users could enable it by:\r\n```\r\ntorch.jit.enable_onednn_fusion(True)\r\n```\r\n\r\n### Performance:\r\n[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:\r\n- SkyLake 8180 (1 socket of 28 cores):\r\n\r\n  ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)\r\n\r\n- SkyLake 8180 (single thread):\r\n\r\n  ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)\r\n \\* By mapping hardswish to oneDNN Graph, it\u2019s 8% faster than PyTorch JIT (NNC + OFI)\r\n  \\** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops\r\n\r\n\r\n### Directory structure of the integration code\r\nFuser-related code are placed under:\r\n```\r\ntorch/csrc/jit/codegen/onednn/\r\n```\r\n\r\nOptimization pass registration is done in:\r\n```\r\ntorch/csrc/jit/passes/onednn_graph_fuser.h\r\n```\r\n\r\nCMake for the integration code is:\r\n```\r\ncaffe2/CMakeLists.txt\r\n```\r\n\r\n## Limitations\r\n\r\n- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.\r\n- We have only optimized the inference use case.",
-          "headRefName": "chunyuan/llga_preview2",
-          "headRepository": {
-            "nameWithOwner": "chunyuan-w/pytorch"
-          },
-          "baseRefName": "master",
-          "baseRepository": {
-            "nameWithOwner": "pytorch/pytorch",
-            "isPrivate": false,
-            "defaultBranchRef": {
-              "name": "master"
-            }
-          },
-          "mergeCommit": null,
-          "commits_with_authors": {
-            "nodes": [
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "0096fcc49f277fd8e006fcb42e0cb28a1422ec98"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "7bcc4de26a5472f1d252735dd425b46794b0844f"
-                }
+          "files": {
+            "nodes": [
+              {
+                "path": "torch/csrc/lazy/ts_backend/ts_native_functions.cpp"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "3a2a588bfe6bbf9bf74d88d441cd22affda207da"
-                }
+                "path": "torch/csrc/utils/python_arg_parser.cpp"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "ca7df12fbfaa3ddbabeca39b76300d17f4a33f2f"
-                }
+                "path": "torch/csrc/utils/python_arg_parser.h"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "81d44f35b8bc043c38837d0694e5bc072203b832"
-                }
+                "path": "torch/csrc/utils/tensor_list.cpp"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "14fd5d1bfc2c58a71379f778871e3fca0a8e79b2"
-                }
+                "path": "torch/csrc/utils/tensor_new.cpp"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "954dc23663125897f4b199eb2a8607dc5fca3274"
-                }
+                "path": "torch/csrc/utils/tensor_new.h"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "9f77a0b476accc678b6f0569e4ff33fa6bbe97fc"
-                }
+                "path": "torch/distributed/_shard/__init__.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
-                  },
-                  "oid": "fbf3b23bc1288697e1aec539a7c4ee3dc0bcb84c"
-                }
+                "path": "torch/distributed/_shard/api.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "f8b8e78f786586c3cdf3966fd83ffa124d3eda70"
-                }
+                "path": "torch/distributed/_shard/replicated_tensor.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "6fffa2f7453ee7e0f8d8e2f73ea8a65230539589"
-                }
+                "path": "torch/distributed/_shard/sharded_tensor/__init__.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "849385404e6f3cd1cf7cef19f931ecf4fa28afdb"
-                }
+                "path": "torch/distributed/_shard/sharded_tensor/api.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "adbae7b77f8c0dbc59fccf15207d97ba86cfade2"
-                }
+                "path": "torch/distributed/_shard/sharded_tensor/utils.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "6dcf2a4981aff24fa16fc7461ae4ec29690f956f"
-                }
+                "path": "torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "54f3e05ad524cffd0911ee93be3c50f589b51f58"
-                }
+                "path": "torch/distributed/algorithms/model_averaging/utils.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "edbfc640ea79a0af85757d9e73796dcc90231519"
-                }
+                "path": "torch/distributed/fsdp/_optim_utils.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "chunyuan-w"
-                    },
-                    "email": "chunyuan.wu@intel.com",
-                    "name": "chunyuan"
-                  },
-                  "oid": "67654db7cba562809d1b4a44cdda58af5cc9daaf"
-                }
+                "path": "torch/distributed/fsdp/fully_sharded_data_parallel.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "9c9d99b930b11af9ff03f52d45bf49c652df758d"
-                }
+                "path": "torch/distributed/nn/__init__.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "ffb25119cd9ce815cc4d9d14a2317fcbbfa9ea86"
-                }
+                "path": "torch/distributed/nn/functional.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "ab9eee84512ca1bdfbc81e25c6eb67b29d0f302a"
-                }
+                "path": "torch/distributed/optim/functional_adagrad.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "62a4642cf3330524990a69ac29e002c97812320a"
-                }
+                "path": "torch/fx/experimental/meta_tracer.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "ca9b1223be4af2c8b4929303d498eafd71793128"
-                }
+                "path": "torch/fx/graph.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "6f4a23d24514a02954d2ec792830085f612223c9"
-                }
+                "path": "torch/jit/_shape_functions.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
-                  },
-                  "oid": "b2a9a9c0926b02d0b2e87722ed61450f224a61d0"
-                }
+                "path": "torch/nn/parallel/_replicated_tensor_ddp_interop.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "e88b492be733f24b6aa395829c76add67d0901e7"
-                }
+                "path": "torch/nn/parallel/_replicated_tensor_ddp_utils.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "c44336d7a914952bfb78e012e08d9a6d6dde5937"
-                }
+                "path": "torch/nn/parallel/distributed.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "5157930f7b3921d41a586260582b574c915f6ca1"
-                }
+                "path": "torch/nn/utils/_expanded_weights/__init__.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "04cb8353813f6bbd0d913a994923cc7e1e291406"
-                }
+                "path": "torch/nn/utils/_expanded_weights/instance_norm_expanded_weights.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
-                  },
-                  "oid": "62991eaad0e638bb0bced327e03f932f66f68732"
-                }
+                "path": "torch/onnx/symbolic_opset11.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
-                  },
-                  "oid": "7496bf1588050191595d833d23b8972b2f22655e"
-                }
+                "path": "torch/onnx/symbolic_opset12.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
-                  },
-                  "oid": "d9d35f23cca0cd29c78a845731b24826152dcf1c"
-                }
+                "path": "torch/onnx/symbolic_opset9.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "f74ec134f18a65a7c72455bdf44f72e3ebb27105"
-                }
+                "path": "torch/optim/adagrad.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "eb32cc65a975361160948bfc3d6a577991ea262e"
-                }
+                "path": "torch/optim/lr_scheduler.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "c7665f8d695b680c54db0bad2b7b7df46d886b50"
-                }
+                "path": "torch/overrides.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "e6321ad8f59ea01130568c202d186448bb9cb9d0"
-                }
+                "path": "torch/quantization/fx/pattern_utils.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "a72cd0d02693f45e5354a70654581ad514581ec7"
-                }
+                "path": "torch/quantization/fx/quantization_patterns.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "b3cd3028b4ed31805e82f7eaf02217ab74ca59b9"
-                }
+                "path": "torch/quantization/fx/quantization_types.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "49a592d9788d08e6cd0593882f867e129057c1cc"
-                }
+                "path": "torch/return_types.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "0575766b2144b13f6a38227c4e2b8d22ec8db80f"
-                }
+                "path": "torch/testing/_internal/common_device_type.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "b5c9b10ff87d622350e8ca64fae3a476eb70d5aa"
-                }
+                "path": "torch/testing/_internal/common_distributed.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "66bc652a30ccc329adb929870a4ac726bb98b38c"
-                }
+                "path": "torch/testing/_internal/common_fx2trt.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "72b9ca9c8e2dac98cbb7199b3dfac7c7305b80c5"
-                }
+                "path": "torch/testing/_internal/common_methods_invocations.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "a7892ed7373207d96406c8b5734a089643c5cdbd"
-                }
+                "path": "torch/testing/_internal/common_utils.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
-                  },
-                  "oid": "d54cb084e1daad8a08c3f8de0ad3f7afb5b05ac1"
-                }
+                "path": "torch/testing/_internal/composite_compliance.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
-                  },
-                  "oid": "aef71d692a8a159e0ca56be363e2cc1225ce7647"
-                }
+                "path": "torch/testing/_internal/distributed/distributed_test.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "bf618e205ec31cff962dcc8ab478e0a699a9572d"
-                }
+                "path": "torch/testing/_internal/jit_metaprogramming_utils.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "e4a331f1088448f7d7d86256ce71e0e71da006b0"
-                }
+                "path": "torch/utils/cpp_extension.py"
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "0b743523d1430fec759d5fefbb687f17c89335a5"
-                }
+                "path": "torch/utils/data/datapipes/_typing.py"
               },
+              {
+                "path": "torch/utils/model_dump/__init__.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MzQ4",
+              "hasNextPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAWuVD9M= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAXEsRtE= name=pytorch number=76118 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
+            "nodes": [
               {
                 "commit": {
-                  "author": {
-                    "user": {
-                      "login": "sanchitintel"
-                    },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
-                  },
-                  "oid": "e80a351a62d98b810ec8985c4b25257af1d6c5bb"
+                  "oid": "5696e8357cf38f852ef3d680381513e26f202371",
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232785220"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuVECw=",
+                            "hasNextPage": false
+                          }
+                        }
+                      }
+                    ]
+                  }
                 }
-              },
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=e29d0e1d73b9847dacfd671c80e117d82111b407f7daa8ff885d3c444eafe47f name=pytorch number=79694 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "kshitij12345"
+          },
+          "title": "[complex] conv_transpose1d",
+          "body": "Reference: https://github.com/pytorch/pytorch/issues/71108",
+          "headRefName": "develop/complex/conv_transpose1d",
+          "headRepository": {
+            "nameWithOwner": "kshitij12345/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "c189eca154b6691919d0e21489d1c322c7435c0b"
+                  "oid": "d1ea948e65ac6d31ad056287ab65d38ecc68b30d"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "e080a067c75d7b888a8a362682a2d5ba70e0c3a8"
+                  "oid": "b4ba1db9a3a71bd8c03158dcd1b68711360633d8"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "028561fbf8f3ed90e074e6e0e3a4ca4dd7ffa2a8"
+                  "oid": "655a4220beae163bfe578f0318a130df01ec05d6"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "Kshiteej K"
                   },
-                  "oid": "d550cf14037badd4caa2f52202e2f20bc4db8432"
+                  "oid": "8181716be7a8005eb13ad5c3f2e1279ed1c60aff"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "574159ebadd1dec24daaf883879ffeca8d9e71b7"
+                  "oid": "9e5ca3663e7471786eeebebfdf84aea5d761712f"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "9eb3ee98ea756067ed1c8f52f309f6d3e211a904"
+                  "oid": "9c110f39bcdc4e56386b6f9c4e2c082c8940ade6"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "29929f48be03dcdd1bbfade572de7feafa825547"
+                  "oid": "49315e79d0eee8008e2a74575c6fc0f6a9531ee4"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "8a7358ca8da547b40ea1a99ddc57ebed19959684"
+                  "oid": "728752480760226270c374a0acc08e28b9b133f3"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "6606637d2c5525b43e294a8b366a85052e1be0c6"
+                  "oid": "ffe43399d6f60ef7844523a5f465c11d9a67062f"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "5ecfd1f28b87045deb8bc8ffe33b3d8b906f3264"
+                  "oid": "9672a2198472567bae4ac6f55d004f7e1fa8a9fa"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchit.jain"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "be2d4345c65442c4cfbe8afdfb2ae0893945da42"
+                  "oid": "48a0ebf32b895286f036b36c871f671dc867e400"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "sanchitintel"
+                      "login": "kshitij12345"
                     },
-                    "email": "sanchit.jain@intel.com",
-                    "name": "sanchitintel"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "b5b89d3644a43e2dbda841cafb71b32edbe07c8a"
+                  "oid": "52fbe80d5c8a94e03d816c0bd21fd82019dcd5ac"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "malfet"
+                      "login": "kshitij12345"
                     },
-                    "email": "nikita.shulga@gmail.com",
-                    "name": "Nikita Shulga"
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
                   },
-                  "oid": "73881411e2bfb3aaa2e89926a82390b4c587ad75"
+                  "oid": "2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "NjI",
+              "endCursor": "MTM",
               "hasNextPage": false
             },
-            "totalCount": 62
+            "totalCount": 13
           },
           "commits": {
             "nodes": [
@@ -11764,7 +19262,7 @@
                               {
                                 "name": "Facebook CLA Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                                "detailsUrl": "https://code.facebook.com/cla/"
                               },
                               {
                                 "name": "Meta Internal-Only Changes Check",
@@ -11773,14 +19271,14 @@
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NXnc=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdtq8Hc=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625010"
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7929899098"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYwzI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdioqFo="
                       },
                       {
                         "node": {
@@ -11790,81 +19288,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "clang-format",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633826958?check_suite_focus=true"
-                              },
-                              {
-                                "name": "py2-setup-validate-errormsg",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827084?check_suite_focus=true"
-                              },
-                              {
-                                "name": "quick-checks",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827160?check_suite_focus=true"
-                              },
-                              {
-                                "name": "shellcheck",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827410?check_suite_focus=true"
-                              },
-                              {
-                                "name": "toc",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827566?check_suite_focus=true"
-                              },
-                              {
-                                "name": "clang-tidy",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827701?check_suite_focus=true"
-                              },
-                              {
-                                "name": "cmakelint",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827899?check_suite_focus=true"
-                              },
-                              {
-                                "name": "flake8-py3",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828081?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test collect_env (with_torch)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828249?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test collect_env (without_torch)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828312?check_suite_focus=true"
-                              },
-                              {
-                                "name": "Test tools",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828407?check_suite_focus=true"
-                              },
-                              {
-                                "name": "mypy",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828524?check_suite_focus=true"
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393316/jobs/4628529923"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NZqw=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdqTEwk=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625458"
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7929899387"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxPI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdioqXs="
                       },
                       {
                         "node": {
@@ -11874,26 +19317,66 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                              "name": "Lint"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "run-torchbench",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633826956?check_suite_focus=true"
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628529910"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628530162"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628530698"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628530867"
+                              },
+                              {
+                                "name": "Test collect_env (older_python_version)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628530989"
+                              },
+                              {
+                                "name": "pr-sanity-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628531151"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628531475"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628531753"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628531853"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NYIw=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdqTHFY=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625463"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7929899388"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxPc="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdioqXw="
                       },
                       {
                         "node": {
@@ -11909,782 +19392,1391 @@
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827223?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628531149"
                               },
                               {
-                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "name": "linux-bionic-cuda11.6-py3.10-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827451?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628531473"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827729?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628531754"
                               },
                               {
-                                "name": "linux-bionic-rocm4.5-py3.7 / build",
+                                "name": "linux-jammy-cuda11.6-cudnn8-py3.8-clang12 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633827956?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628531857"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "name": "linux-focal-py3.7-gcc7-pch / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828089?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628532179"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "name": "linux-focal-py3.7-clang10-onnx / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828258?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628532543"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828406?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628532694"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "name": "linux-focal-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828523?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628532918"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533033"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533181"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533420"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533630"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533825"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828594?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533959"
                               },
                               {
                                 "name": "linux-xenial-py3-clang5-mobile-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828765?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534129"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "name": "linux-bionic-py3_7-clang8-xla / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633828992?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534256"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "name": "linux-focal-rocm5.2-py3.7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829085?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534388"
                               },
                               {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "name": "linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534571"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11_6-py3_10-gcc7-deploy / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534714"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829195?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534989"
                               },
                               {
                                 "name": "win-vs2019-cpu-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829321?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628535311"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "name": "linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829420?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639115"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "name": "linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829488?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639198"
                               },
                               {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "name": "linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829666?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639265"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "name": "linux-focal-py3.7-gcc7 / test (functorch, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829746?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639339"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "name": "linux-focal-py3.7-gcc7 / test (docs_test, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829845?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639395"
                               },
                               {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5633829904?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-gcc7 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639450"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639509"
                               },
                               {
                                 "name": "linux-docs / build-docs (cpp)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634453168?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639572"
                               },
                               {
                                 "name": "linux-docs / build-docs (python)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634453232?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639635"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634453388?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647047"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634453444?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647119"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634453499?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647215"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634453573?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647277"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634453624?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647348"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634453683?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647432"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634462211?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647522"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634462270?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647641"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634602176?check_suite_focus=true"
+                                "name": "linux-bionic-py3.7-clang9 / test (functorch, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647762"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634602239?check_suite_focus=true"
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628653797"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634602319?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679376"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634602425?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679431"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634622529?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679469"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634622639?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679519"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634622730?check_suite_focus=true"
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679594"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634637718?check_suite_focus=true"
+                                "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628681226"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634637817?check_suite_focus=true"
+                                "name": "linux-bionic-cuda11_6-py3_10-gcc7-deploy / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628854932"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634775159?check_suite_focus=true"
+                                "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856434"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634775273?check_suite_focus=true"
+                                "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856501"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634823038?check_suite_focus=true"
+                                "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856575"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdqZ2fA=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7929899419"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdioqZs="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "windows-binary-libtorch-debug"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "libtorch-cpu-shared-with-deps-debug-build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351637/jobs/4634503587"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634823099?check_suite_focus=true"
+                                "name": "libtorch-cpu-shared-with-deps-debug-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351637/jobs/4635312938"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsbsmM=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953056"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUSuA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "windows-binary-wheel"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "wheel-py3_7-cuda11_3-build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351640/jobs/4634503571"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634823171?check_suite_focus=true"
+                                "name": "wheel-py3_7-cuda11_3-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351640/jobs/4636146265"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsskcw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953059"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUSuM="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "windows-binary-libtorch-release"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "libtorch-cpu-shared-with-deps-release-build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351643/jobs/4634503570"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634920855?check_suite_focus=true"
+                                "name": "libtorch-cpu-shared-with-deps-release-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351643/jobs/4635003925"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsVbD8=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953061"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUSuU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-binary-libtorch-cxx11-abi"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "libtorch-cpu-shared-with-deps-cxx11-abi-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351698/jobs/4634504079"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634921428?check_suite_focus=true"
-                              },
+                                "name": "libtorch-cpu-shared-with-deps-cxx11-abi-test / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351698/jobs/4635072931"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsW5Aw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953185"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUS2E="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-binary-libtorch-pre-cxx11"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634921484?check_suite_focus=true"
+                                "name": "libtorch-cpu-shared-with-deps-cxx11-abi-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351700/jobs/4634503897"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634921543?check_suite_focus=true"
-                              },
+                                "name": "libtorch-cpu-shared-with-deps-cxx11-abi-test / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351700/jobs/4635077148"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsW-jo=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953186"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUS2I="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-binary-manywheel"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634995986?check_suite_focus=true"
+                                "name": "manywheel-py3_7-cuda10_2-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351699/jobs/4634503896"
                               },
                               {
-                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5634996056?check_suite_focus=true"
+                                "name": "manywheel-py3_7-cuda10_2-test / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351699/jobs/4635934290"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_fN1g=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsoMEA=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625483"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953187"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxQs="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUS2M="
                       }
                     ],
                     "pageInfo": {
-                      "hasNextPage": false
+                      "hasNextPage": true
                     }
                   },
-                  "pushedDate": "2022-03-21T19:58:52Z",
-                  "oid": "73881411e2bfb3aaa2e89926a82390b4c587ad75"
+                  "status": null,
+                  "pushedDate": "2022-08-22T22:04:19Z",
+                  "oid": "2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce"
                 }
               }
             ]
           },
-          "changedFiles": 37,
-          "files": {
-            "nodes": [
-              {
-                "path": "aten/src/ATen/core/interned_strings.h"
-              },
-              {
-                "path": "caffe2/CMakeLists.txt"
-              },
-              {
-                "path": "cmake/Dependencies.cmake"
-              },
-              {
-                "path": "cmake/Modules/FindMKLDNN.cmake"
-              },
-              {
-                "path": "cmake/public/mkldnn.cmake"
-              },
-              {
-                "path": "docs/source/jit.rst"
-              },
-              {
-                "path": "test/test_jit_llga_fuser.py"
-              },
-              {
-                "path": "torch/_C/__init__.pyi.in"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/LlgaTensorImpl.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/README.md"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/defer_size_check.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/defer_size_check.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/graph_fuser.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/graph_fuser.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/graph_helper.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/graph_helper.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/graph_rewriter.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/guard_shape.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/guard_shape.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/interface.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/interface.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/kernel.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/kernel.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/layout_propagation.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/layout_propagation.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/operator.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/prepare_binary.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/prepare_binary.h"
-              },
-              {
-                "path": "torch/csrc/jit/codegen/onednn/register_interface.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/ir/alias_analysis.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/ir/ir.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/passes/inline_autodiff_subgraphs.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/passes/onednn_graph_fuser.h"
-              },
-              {
-                "path": "torch/csrc/jit/python/init.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/operator.cpp"
-              },
-              {
-                "path": "torch/jit/__init__.py"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "Mzc",
-              "hasNextPage": false
-            }
-          },
-          "reviews": {
-            "nodes": [
-              {
-                "author": {
-                  "login": "pinzhenx"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "pinzhenx"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "pinzhenx"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "chunyuan-w"
-                },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "eellison"
-                },
-                "state": "COMMENTED"
-              },
+          "changedFiles": 3,
+          "files": {
+            "nodes": [
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
+                "path": "aten/src/ATen/native/Convolution.cpp"
               },
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
+                "path": "torch/testing/_internal/common_methods_invocations.py"
               },
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
-              },
+                "path": "torch/testing/_internal/common_modules.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "ngimel"
                 },
-                "state": "COMMENTED"
-              },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNy0xOVQxMDowNzo1NC0wNzowMLkyMDIyLTA3LTE5VDEwOjA3OjU0LTA3OjAwzj43QcY=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
               {
+                "bodyText": "@pytorchbot merge -g\nAll is green internally!",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "albanD"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1224702749
               },
               {
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here.\nThe merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.\nPlease reach out to the PyTorch DevX Team with feedback or questions!",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1224705564
               },
               {
+                "bodyText": "Thanks for looking into it \ud83d\ude42 @albanD @jeanschmidt",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "kshitij12345"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1224712351
               },
               {
+                "bodyText": "Hey @kshitij12345.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "github-actions"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1224956051
               },
               {
+                "bodyText": "Yeah, discussed with my manager and I got the required permissions to do so. Sorry for not responding promptly yesterday. But I am available from now on to provide assistance :)",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "jeanschmidt"
                 },
-                "state": "COMMENTED"
-              },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1225462612
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOSP97HQ==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "open source"
+                }
               },
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "Merged"
+                }
               },
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "cla signed"
+                }
               },
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "Reverted"
+                }
               },
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
+                "node": {
+                  "name": "ciflow/trunk"
+                }
               },
               {
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "state": "COMMENTED"
-              },
+                "node": {
+                  "name": "ciflow/periodic"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=62ce809793481ce6ddce6e1a19d9b0761755ff0ff75decaf8a79419eaf793110 cursor=Y3Vyc29yOnYyOpHOSP97HQ== name=pytorch number=79694 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "comments": {
+            "nodes": [
               {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/79694\n\ud83d\udcc4 \u00a0Preview Python docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\u2753Need help or want to give feedback on the CI? Visit our office hours\n\n\u2705 No Failures (0 Pending)\nAs of commit 2fd08f1 (more details on the Dr. CI page):\nExpand to see more\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "facebook-github-bot"
                 },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "sanchitintel"
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
                 },
-                "state": "COMMENTED"
+                "databaseId": 1157454523
               },
               {
+                "bodyText": "Unable to reproduce jit failure locally (will skip the test)\nCI Failure : https://github.com/pytorch/pytorch/runs/6926187074?check_suite_focus=true#step:9:20230\npytest test/test_ops_jit.py -k test_variant_consistency_jit_nn_functional_conv_transpose1d_cpu_complex64 -v\n=============================================================== test session starts ===============================================================\nplatform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/kshiteej/.conda/envs/pytorch-cuda-dev/bin/python\ncachedir: .pytest_cache\nhypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/kshiteej/Pytorch/pytorch_complex_convolution.py/.hypothesis/examples')\nrootdir: /home/kshiteej/Pytorch/pytorch_complex_convolution.py, configfile: pytest.ini\nplugins: hypothesis-6.23.2, repeat-0.9.1\ncollected 1976 items / 1975 deselected / 1 selected                                                                                               \n\ntest/test_ops_jit.py::TestJitCPU::test_variant_consistency_jit_nn_functional_conv_transpose1d_cpu_complex64 PASSED                          [100%]\n\n================================================================ warnings summary =================================================================\n../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/testing/_internal/common_cuda.py:9\n  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/testing/_internal/common_cuda.py:9: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives\n    from distutils.version import LooseVersion\n\n../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:91\n  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:91: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.\n    warnings.warn(\n\n-- Docs: https://docs.pytest.org/en/stable/warnings.html\n================================================= 1 passed, 1975 deselected, 2 warnings in 4.90s =================================================",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "kshitij12345"
                 },
-                "state": "COMMENTED"
-              },
-              {
-                "author": {
-                  "login": "sanchitintel"
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "kshitij12345"
                 },
-                "state": "COMMENTED"
+                "databaseId": 1186949486
               },
               {
+                "bodyText": "@pytorchbot merge",
                 "author": {
-                  "login": "wukong1992"
+                  "login": "ngimel"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189347786
               },
               {
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
                 "author": {
-                  "login": "eellison"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189350009
               },
               {
+                "bodyText": "Hey @kshitij12345.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
                 "author": {
-                  "login": "eellison"
+                  "login": "github-actions"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1189350932
               },
               {
+                "bodyText": "@pytorchbot revert -m \"broke slow test https://github.com/pytorch/pytorch/runs/7414560957?check_suite_focus=true#step:9:31516\" -c \"nosignal\"",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "kshitij12345"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1189459845
               },
               {
+                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189460926
               },
               {
+                "bodyText": "Will not revert as @kshitij12345 is not a MEMBER, but COLLABORATOR",
                 "author": {
-                  "login": "eellison"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189460942
               },
               {
+                "bodyText": "@pytorchbot revert -m \"broke slow test https://github.com/pytorch/pytorch/runs/7414560957?check_suite_focus=true#step:9:31516\" -c \"nosignal\"",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "anjali411"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189529734
               },
               {
+                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189530756
               },
               {
+                "bodyText": "@kshitij12345 your PR has been successfully reverted.",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189530831
               },
               {
+                "bodyText": "@pytorchbot merge -g",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "kshitij12345"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1190070141
               },
               {
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1190071424
               },
               {
+                "bodyText": "Hey @kshitij12345.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
                 "author": {
-                  "login": "eellison"
+                  "login": "github-actions"
                 },
-                "state": "APPROVED"
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1190258272
               },
               {
+                "bodyText": "commit is breaking internal builds/tests https://pastebin.com/HX4RUusH (pytorch/functorch/test:test_eager_transforms)",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "jeanschmidt"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191327616
               },
               {
+                "bodyText": "@pytorchbot revert -m \"breaking internal builds\" -c \"ghfirst\"",
                 "author": {
-                  "login": "eellison"
+                  "login": "jeanschmidt"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191328013
               },
               {
+                "bodyText": "@pytorchbot revert -m \"breaking internal builds\" -c \"ghfirst\"",
                 "author": {
-                  "login": "malfet"
+                  "login": "jeanschmidt"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191329792
               },
               {
+                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191330586
               },
               {
+                "bodyText": "@kshitij12345 your PR has been successfully reverted.",
                 "author": {
-                  "login": "malfet"
+                  "login": "pytorchmergebot"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191330690
               },
               {
+                "bodyText": "@jeanschmidt which test is it failing on? I tried running the test_eager_transforms in functorch but couldn't reproduce it.",
                 "author": {
-                  "login": "malfet"
+                  "login": "kshitij12345"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1193667568
               },
               {
+                "bodyText": "@jbschlosser have added a ref as discussed offline. Can you please take a look? And if it looks good, can you import the PR to check if it is breaking anything internally.\nThanks",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "kshitij12345"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1204329491
               },
               {
+                "bodyText": "@jbschlosser @jeanschmidt @albanD anything we can do to unblock this on our side?",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "lezcano"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1221266218
               },
               {
+                "bodyText": "Functorch tests should be running here now so can you rebase on top of master please?",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "albanD"
                 },
-                "state": "COMMENTED"
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1223129944
               },
               {
+                "bodyText": "@albanD have rebased on latest master.",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "kshitij12345"
                 },
-                "state": "COMMENTED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMS0xMi0xMFQxMToyNDoxOS0wNjowMLkyMDIxLTEyLTEwVDExOjI0OjE5LTA2OjAwzjFryLE=",
-              "hasPreviousPage": false
-            }
-          },
-          "comments": {
-            "nodes": [
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1223758571
+              },
               {
-                "bodyText": "Looks like this broke master https://hud.pytorch.org/pytorch/pytorch/commit/7dd08230117f4fa8bb82b3524e90fb00340198c7. I am reverting.",
+                "bodyText": "I triggered all the tests not to have any issues with slow tests again",
                 "author": {
-                  "login": "suo"
+                  "login": "lezcano"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "COLLABORATOR",
                 "editor": null,
-                "databaseId": 1074498483
+                "databaseId": 1223796413
               },
               {
-                "bodyText": "@pytorchbot revert this",
+                "bodyText": "Thanks @lezcano! However, last time it was reverted for internal failures. So it would be great if someone can import and verify that.\ncc: @albanD @jeanschmidt",
                 "author": {
-                  "login": "suo"
+                  "login": "kshitij12345"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "COLLABORATOR",
                 "editor": null,
-                "databaseId": 1074498550
+                "databaseId": 1223863075
               },
               {
-                "bodyText": "Looks like this broke master https://hud.pytorch.org/pytorch/pytorch/commit/7dd08230117f4fa8bb82b3524e90fb00340198c7. I am reverting.\n\nOops! Will fix it ASAP.",
+                "bodyText": "@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
                 "author": {
-                  "login": "sanchitintel"
+                  "login": "facebook-github-bot"
                 },
-                "authorAssociation": "CONTRIBUTOR",
+                "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1074499668
+                "databaseId": 1224175731
               },
               {
-                "bodyText": "This pull request has been reverted by e5bf879. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).",
+                "bodyText": "I am not the right person to provide assistence, as currently I am not based in a Tier 1 location, so my permissions to access are so restricted that I am not able to import this commit, run the tests and provide meaningful responses.",
                 "author": {
-                  "login": "facebook-github-bot"
+                  "login": "jeanschmidt"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1074508608
+                "databaseId": 1224272324
               },
               {
-                "bodyText": "This pull request has been reverted by e5bf879. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).",
+                "bodyText": "@jeanschmidt has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
                 "author": {
                   "login": "facebook-github-bot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1082508130
+                "databaseId": 1224351135
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOQAuLsw==",
-              "hasPreviousPage": true
+              "startCursor": "Y3Vyc29yOnYyOpHORP1auw==",
+              "hasPreviousPage": false
             }
-          },
-          "labels": {
-            "edges": [
-              {
-                "node": {
-                  "name": "oncall: jit"
-                }
-              },
-              {
-                "node": {
-                  "name": "triaged"
-                }
-              },
-              {
-                "node": {
-                  "name": "open source"
-                }
-              },
-              {
-                "node": {
-                  "name": "cla signed"
-                }
-              },
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAdqZ2fA= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAdioqXw= name=pytorch number=79694 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
+            "nodes": [
               {
-                "node": {
-                  "name": "Reverted"
+                "commit": {
+                  "oid": "2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce",
+                  "checkSuites": {
+                    "nodes": [
+                      {
+                        "checkRuns": {
+                          "nodes": [
+                            {
+                              "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856668"
+                            },
+                            {
+                              "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856772"
+                            },
+                            {
+                              "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856812"
+                            },
+                            {
+                              "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (functorch, 1, 1, linux.4xlarge.nvidia.gpu)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856867"
+                            },
+                            {
+                              "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628858900"
+                            },
+                            {
+                              "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628858948"
+                            },
+                            {
+                              "name": "win-vs2019-cpu-py3 / test (functorch, 1, 1, windows.4xlarge)",
+                              "conclusion": "SUCCESS",
+                              "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628859006"
+                            }
+                          ],
+                          "pageInfo": {
+                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdqZ5lE=",
+                            "hasNextPage": false
+                          }
+                        }
+                      }
+                    ]
+                  }
                 }
-              },
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAdkUS2M= name=pytorch number=79694 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "commits": {
+            "nodes": [
               {
-                "node": {
-                  "name": "intel priority"
+                "commit": {
+                  "oid": "2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce",
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "trunk"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "macos-12-py3-x86-64 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634504326"
+                              },
+                              {
+                                "name": "macos-12-py3-arm64 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634504522"
+                              },
+                              {
+                                "name": "parallelnative-linux-focal-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634504655"
+                              },
+                              {
+                                "name": "caffe2-linux-focal-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634504882"
+                              },
+                              {
+                                "name": "android-emulator-build-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634505033"
+                              },
+                              {
+                                "name": "ios-12-5-1-x86-64 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634505167"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9-slow / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634505347"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634505499"
+                              },
+                              {
+                                "name": "libtorch-linux-bionic-cuda11.6-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634505639"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634505767"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634506032"
+                              },
+                              {
+                                "name": "macos-12-py3-x86-64-lite-interpreter / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634506202"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634506357"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634506535"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9-slow / test (slow, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634664404"
+                              },
+                              {
+                                "name": "parallelnative-linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634669945"
+                              },
+                              {
+                                "name": "parallelnative-linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634670046"
+                              },
+                              {
+                                "name": "macos-12-py3-x86-64 / test (default, 1, 2, macos-12)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634734165"
+                              },
+                              {
+                                "name": "macos-12-py3-x86-64 / test (default, 2, 2, macos-12)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634734293"
+                              },
+                              {
+                                "name": "macos-12-py3-x86-64 / test (functorch, 1, 1, macos-12)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634734388"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634772323"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634772410"
+                              },
+                              {
+                                "name": "macos-12-py3-arm64 / test (default, 1, 2, macos-m1-12)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634812657"
+                              },
+                              {
+                                "name": "macos-12-py3-arm64 / test (default, 2, 2, macos-m1-12)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634812746"
+                              },
+                              {
+                                "name": "macos-12-py3-arm64-mps / Run MPS tests",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634812878"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634868761"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634868884"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869012"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869132"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (functorch, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869240"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (slow, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869348"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (slow, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869457"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (nogpu_AVX512, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869537"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869649"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (jit_legacy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869743"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869861"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4634869984"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / test (default, 1, 5, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4635049837"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / test (default, 2, 5, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4635049935"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / test (default, 3, 5, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4635050025"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4635050129"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / test (default, 5, 5, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4635050234"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / test (functorch, 1, 1, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4635050323"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351701/jobs/4635050460"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsWbDg=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953192"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUS2g="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "periodic"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "ios-12-5-1-arm64-metal / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634504650"
+                              },
+                              {
+                                "name": "linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634504883"
+                              },
+                              {
+                                "name": "ios-12-5-1-arm64 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634505024"
+                              },
+                              {
+                                "name": "buck-build-test / buck-build-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634505165"
+                              },
+                              {
+                                "name": "ios-12-5-1-arm64-coreml / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634505316"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7-debug / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634505521"
+                              },
+                              {
+                                "name": "libtorch-linux-bionic-cuda11.7-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634505667"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.7-py3.7-gcc7-debug / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634505786"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7-slow / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634506031"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634506209"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7-distributed / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634506353"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.7-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634506550"
+                              },
+                              {
+                                "name": "ios-12-5-1-x86-64-coreml / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634506968"
+                              },
+                              {
+                                "name": "ios-12-5-1-arm64-custom-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634507176"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7-distributed / test (distributed, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634799214"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7-distributed / test (distributed, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634799342"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7-slow / test (slow, 1, 1, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634800216"
+                              },
+                              {
+                                "name": "linux-bionic-cuda10.2-py3.9-gcc7 / test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634896194"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634955955"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634956066"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634956160"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7-debug / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634956251"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.7-py3.7-gcc7-debug / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634987167"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.7-py3.7-gcc7-debug / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634987289"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.7-py3.7-gcc7-debug / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634987406"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.7-py3.7-gcc7-debug / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4634987543"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.7-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4635020787"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.7-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4635020896"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.7-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4635021008"
+                              },
+                              {
+                                "name": "linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4635184380"
+                              },
+                              {
+                                "name": "linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351759/jobs/4635184472"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsZHek=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953337"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUS_k="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  }
                 }
               }
             ]
@@ -12693,212 +20785,93 @@
       }
     }
   },
-  "query_sha=62ce809793481ce6ddce6e1a19d9b0761755ff0ff75decaf8a79419eaf793110 cursor=Y3Vyc29yOnYyOpHOQAuLsw== name=pytorch number=68111 owner=pytorch": {
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=pytorch-dev-infra org=pytorch": {
     "data": {
-      "repository": {
-        "pullRequest": {
-          "comments": {
+      "organization": {
+        "team": {
+          "members": {
             "nodes": [
               {
-                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/chunyuan-w/pytorch/blob/7496bf1588050191595d833d23b8972b2f22655e/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk\n\u2705 triggered\n\n\nlinux-docs\nciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-vulkan-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-build\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-custom-build-static\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-onnx\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7-no-ops\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\ncaffe2-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\ndocker-builds\nciflow/all, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-custom-ops\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-full-jit\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-metal\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-full-jit\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda10.2-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-binary-conda\nciflow/binaries, ciflow/binaries/conda\n\ud83d\udeab skipped\n\n\nlinux-binary-libtorch-cxx11-abi\nciflow/binaries, ciflow/binaries/libtorch\n\ud83d\udeab skipped\n\n\nlinux-binary-libtorch-pre-cxx11\nciflow/binaries, ciflow/binaries/libtorch\n\ud83d\udeab skipped\n\n\nlinux-binary-manywheel\nciflow/binaries, ciflow/binaries/wheel\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-docs-push\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-no-ops\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-arm64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-lite-interpreter-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-11-py3-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.1-py3.7-gcc7-debug\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.1-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.5-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-build\nciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\n\n\nYou can add a comment to the PR and tag @pytorchbot with the following commands:\n\n# ciflow rerun, \"ciflow/default\" will always be added automatically\n@pytorchbot ciflow rerun\n\n# ciflow rerun with additional labels \"-l <ciflow/label_name>\", which is equivalent to adding these labels manually and trigger the rerun\n@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow\n\nFor more information, please take a look at the CI Flow Wiki.",
-                "author": {
-                  "login": "pytorch-probot"
-                },
-                "authorAssociation": "NONE",
-                "editor": {
-                  "login": "pytorch-probot"
-                },
-                "databaseId": 964902865
-              },
-              {
-                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/68111\nNeed help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 7388141 (more details on the Dr. CI page):\n\n\n29/29 failures introduced in this PR\n\n\n\ud83d\udd75\ufe0f 29 new failures recognized by patterns\nThe following CI failures do not appear to be due to upstream breakages:\n pull / linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge) (1/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:31:38.6978776Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:31:38.3001628Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:31:38.5169168Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:31:38.5362923Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:31:38.5413452Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:31:38.5458747Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:31:38.5484014Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:31:38.5497924Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:31:38.5656491Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:31:38.5678893Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:31:38.6888479Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0f6488c20adb4dca4\n2022-03-21T21:31:38.6978776Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:31:38.6992648Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:31:38.7003010Z ##[error]Process completed with exit code 2.\n2022-03-21T21:31:38.7044027Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:31:38.7044261Z with:\n2022-03-21T21:31:38.7044413Z env:\n2022-03-21T21:31:38.7044565Z   IN_CI: 1\n2022-03-21T21:31:38.7044709Z   IS_GHA: 1\n2022-03-21T21:31:38.7044885Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:31:38.7045067Z ##[endgroup]\n2022-03-21T21:31:38.7060958Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge) (2/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:35:19.2635222Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:35:18.9028722Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:35:19.1132721Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:35:19.1310590Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:35:19.1360251Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:35:19.1386865Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:35:19.1429182Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:35:19.1441925Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:35:19.1468280Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:35:19.1617667Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:35:19.2545368Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-098be2985e0392130\n2022-03-21T21:35:19.2635222Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:35:19.2648463Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:35:19.2658727Z ##[error]Process completed with exit code 2.\n2022-03-21T21:35:19.2706355Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:35:19.2706591Z with:\n2022-03-21T21:35:19.2706748Z env:\n2022-03-21T21:35:19.2706908Z   IN_CI: 1\n2022-03-21T21:35:19.2707061Z   IS_GHA: 1\n2022-03-21T21:35:19.2707246Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:35:19.2707438Z ##[endgroup]\n2022-03-21T21:35:19.2724554Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge) (3/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:11:57.5531419Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:11:52.7662022Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T23:11:53.1213298Z      ---------------------------------------- 8.1/8.1 MB 23.6 MB/s eta 0:00:00\n2022-03-21T23:11:53.1644665Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:11:53.2218699Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T23:11:53.2389674Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T23:11:53.2787295Z      -------------------------------------- 247.7/247.7 KB 7.4 MB/s eta 0:00:00\n2022-03-21T23:11:53.3761842Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:11:53.5457622Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T23:11:57.4175080Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T23:11:57.5296815Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0105d4db093574f40\n2022-03-21T23:11:57.5531419Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:11:57.5564814Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:11:57.5587712Z ##[error]Process completed with exit code 2.\n2022-03-21T23:11:57.5790311Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T23:11:57.5790832Z with:\n2022-03-21T23:11:57.5791104Z env:\n2022-03-21T23:11:57.5791358Z   IN_CI: 1\n2022-03-21T23:11:57.5791620Z   IS_GHA: 1\n2022-03-21T23:11:57.5791939Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:11:57.5792425Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T23:11:57.5792884Z ##[endgroup]\n\n\n pull / linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu) (4/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T02:17:12.6257577Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T02:17:11.9280556Z   Using cached https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl\n2022-03-22T02:17:11.9335199Z Collecting urllib3<1.27,>=1.25.4 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:11.9682045Z   Using cached https://files.pythonhosted.org/packages/ec/03/062e6444ce4baf1eac17a6a0ebfe36bb1ad05e1df0e20b110de59c278498/urllib3-1.26.9-py2.py3-none-any.whl\n2022-03-22T02:17:11.9850357Z Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:12.0403171Z   Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl\n2022-03-22T02:17:12.0468875Z Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:12.0590000Z   Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl\n2022-03-22T02:17:12.0607093Z Installing collected packages: jmespath, urllib3, six, python-dateutil, botocore, s3transfer, boto3\n2022-03-22T02:17:12.5273459Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 urllib3-1.26.9\n2022-03-22T02:17:12.6032812Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 worker-rocm-amd-114\n2022-03-22T02:17:12.6257577Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T02:17:12.6259543Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T02:17:12.6291924Z ##[error]Process completed with exit code 2.\n2022-03-22T02:17:12.6387977Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T02:17:12.6388298Z with:\n2022-03-22T02:17:12.6388521Z   wait-ssh: false\n2022-03-22T02:17:12.6388727Z env:\n2022-03-22T02:17:12.6388932Z   IN_CI: 1\n2022-03-22T02:17:12.6389143Z   IS_GHA: 1\n2022-03-22T02:17:12.6389368Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T02:17:12.6389669Z   DOCKER_HOST: unix:///run/user/1121/docker.sock\n\n\n pull / linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge) (5/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:19:24.4890693Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:19:24.0962005Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:19:24.3152253Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:19:24.3341183Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:19:24.3391374Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:19:24.3436392Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:19:24.3448982Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:19:24.3474092Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:19:24.3502003Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:19:24.3655072Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:19:24.4799309Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0bc9250521f338cae\n2022-03-21T22:19:24.4890693Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:19:24.4903625Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:19:24.4913841Z ##[error]Process completed with exit code 2.\n2022-03-21T22:19:24.4957338Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:19:24.4957575Z with:\n2022-03-21T22:19:24.4957735Z env:\n2022-03-21T22:19:24.4957900Z   IN_CI: 1\n2022-03-21T22:19:24.4958055Z   IS_GHA: 1\n2022-03-21T22:19:24.4958246Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:19:24.4958437Z ##[endgroup]\n2022-03-21T22:19:24.4989649Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu) (6/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T01:05:07.6983899Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T01:05:06.8364546Z   Using cached https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl\n2022-03-22T01:05:06.8431763Z Collecting urllib3<1.27,>=1.25.4 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:06.8949391Z   Using cached https://files.pythonhosted.org/packages/ec/03/062e6444ce4baf1eac17a6a0ebfe36bb1ad05e1df0e20b110de59c278498/urllib3-1.26.9-py2.py3-none-any.whl\n2022-03-22T01:05:06.9180079Z Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:06.9803351Z   Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl\n2022-03-22T01:05:06.9882133Z Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:07.0067062Z   Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl\n2022-03-22T01:05:07.0088676Z Installing collected packages: urllib3, jmespath, six, python-dateutil, botocore, s3transfer, boto3\n2022-03-22T01:05:07.5819667Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 urllib3-1.26.9\n2022-03-22T01:05:07.6774717Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 worker-rocm-amd-60\n2022-03-22T01:05:07.6983899Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T01:05:07.6988652Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T01:05:07.7023073Z ##[error]Process completed with exit code 2.\n2022-03-22T01:05:07.7102087Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T01:05:07.7102389Z with:\n2022-03-22T01:05:07.7102603Z   wait-ssh: false\n2022-03-22T01:05:07.7102820Z env:\n2022-03-22T01:05:07.7103015Z   IN_CI: 1\n2022-03-22T01:05:07.7103224Z   IS_GHA: 1\n2022-03-22T01:05:07.7103458Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T01:05:07.7103737Z   DOCKER_HOST: unix:///run/user/1502/docker.sock\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge) (7/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:51:39.3637996Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:51:39.2041249Z   Attempting uninstall: s3transfer\n2022-03-21T20:51:39.2043010Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T20:51:39.2083799Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T20:51:39.2089675Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T20:51:39.2480546Z   Attempting uninstall: boto3\n2022-03-21T20:51:39.2482953Z     Found existing installation: boto3 1.16.34\n2022-03-21T20:51:39.2584292Z     Uninstalling boto3-1.16.34:\n2022-03-21T20:51:39.2599474Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T20:51:39.3130921Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T20:51:39.3550598Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-03ef7efc3078e3da5\n2022-03-21T20:51:39.3637996Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:51:39.3650651Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:51:39.3660484Z ##[error]Process completed with exit code 2.\n2022-03-21T20:51:39.3696465Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:51:39.3696693Z with:\n2022-03-21T20:51:39.3696850Z env:\n2022-03-21T20:51:39.3697012Z   IN_CI: 1\n2022-03-21T20:51:39.3697161Z   IS_GHA: 1\n2022-03-21T20:51:39.3697342Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:51:39.3697528Z ##[endgroup]\n2022-03-21T20:51:39.3730420Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge) (8/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:03:36.3916860Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:03:36.0096309Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:03:36.2278560Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:03:36.2461618Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:03:36.2513260Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:03:36.2541524Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:03:36.2554899Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:03:36.2598277Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:03:36.2758299Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:03:36.2780690Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:03:36.3825021Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0a4a552890e6ef7d3\n2022-03-21T21:03:36.3916860Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:03:36.3930343Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:03:36.3941263Z ##[error]Process completed with exit code 2.\n2022-03-21T21:03:36.3979258Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:03:36.3979496Z with:\n2022-03-21T21:03:36.3979654Z env:\n2022-03-21T21:03:36.3979814Z   IN_CI: 1\n2022-03-21T21:03:36.3979968Z   IS_GHA: 1\n2022-03-21T21:03:36.3980157Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:03:36.3980360Z ##[endgroup]\n2022-03-21T21:03:36.3996257Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu) (9/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T00:41:15.5325784Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T00:41:10.3015614Z   Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)\n2022-03-22T00:41:10.3625659Z      ---------------------------------------- 79.5/79.5 KB 1.1 MB/s eta 0:00:00\n2022-03-22T00:41:10.4120236Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-22T00:41:10.4170155Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-22T00:41:10.4722115Z      -------------------------------------- 247.7/247.7 KB 5.2 MB/s eta 0:00:00\n2022-03-22T00:41:10.4843512Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-22T00:41:10.6596108Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-22T00:41:10.8733354Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-22T00:41:15.3745408Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-22T00:41:15.4987162Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-09cacc848abc3dd32\n2022-03-22T00:41:15.5325784Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T00:41:15.5373630Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T00:41:15.5404353Z ##[error]Process completed with exit code 2.\n2022-03-22T00:41:15.5790508Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-22T00:41:15.5791192Z with:\n2022-03-22T00:41:15.5791530Z env:\n2022-03-22T00:41:15.5791849Z   IN_CI: 1\n2022-03-22T00:41:15.5792186Z   IS_GHA: 1\n2022-03-22T00:41:15.5792599Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T00:41:15.5793237Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-22T00:41:15.5793831Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge) (10/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:50:32.9799307Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:50:32.8167560Z   Attempting uninstall: s3transfer\n2022-03-21T20:50:32.8169351Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T20:50:32.8213295Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T20:50:32.8219209Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T20:50:32.8602320Z   Attempting uninstall: boto3\n2022-03-21T20:50:32.8603289Z     Found existing installation: boto3 1.16.34\n2022-03-21T20:50:32.8704535Z     Uninstalling boto3-1.16.34:\n2022-03-21T20:50:32.8719403Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T20:50:32.9244278Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T20:50:32.9710449Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0c568461a276d4a71\n2022-03-21T20:50:32.9799307Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:50:32.9812238Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:50:32.9823052Z ##[error]Process completed with exit code 2.\n2022-03-21T20:50:32.9859290Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:50:32.9859527Z with:\n2022-03-21T20:50:32.9859664Z env:\n2022-03-21T20:50:32.9859817Z   IN_CI: 1\n2022-03-21T20:50:32.9859977Z   IS_GHA: 1\n2022-03-21T20:50:32.9860144Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:50:32.9860327Z ##[endgroup]\n2022-03-21T20:50:32.9893642Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge) (11/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:05:00.7163042Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:05:00.6660824Z     #10 0x55fc8a3ea801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:05:00.6661768Z     #11 0x55fc8a3f57a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:05:00.6662455Z     #12 0x55fc8a3f580b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:05:00.6663570Z     #13 0x55fc8a3f5908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:05:00.6663952Z     #14 0x55fc8a3f5908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:05:00.6664431Z     #15 0x55fc8a3f5908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:05:00.6665304Z     #16 0x55fc8a3f5ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:05:00.7162113Z     #17 0x7f940d00f83f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:05:00.7162534Z     #18 0x55fc8a39a554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:05:00.7162711Z \n2022-03-21T21:05:00.7163042Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:05:00.7334595Z + retcode=1\n2022-03-21T21:05:00.7334954Z + set -e\n2022-03-21T21:05:00.7335215Z + return 1\n2022-03-21T21:05:00.7338688Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:05:00.7339232Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:05:00.7340113Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:05:00.7340612Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:05:00.7341187Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:05:00.7341668Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:05:00.7344466Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge) (12/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:06:03.4437430Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:06:03.0752199Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:06:03.2853252Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:06:03.3032326Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:06:03.3081589Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:06:03.3093911Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:06:03.3120244Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:06:03.3162406Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:06:03.3188431Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:06:03.3337181Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:06:03.4348072Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0ee48c8811fafc444\n2022-03-21T22:06:03.4437430Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:06:03.4450920Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:06:03.4461263Z ##[error]Process completed with exit code 2.\n2022-03-21T22:06:03.4502346Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:06:03.4502576Z with:\n2022-03-21T22:06:03.4502730Z env:\n2022-03-21T22:06:03.4502888Z   IN_CI: 1\n2022-03-21T22:06:03.4503038Z   IS_GHA: 1\n2022-03-21T22:06:03.4503302Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:06:03.4503492Z ##[endgroup]\n2022-03-21T22:06:03.4519156Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (13/29)\nStep: \"Test\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:50:13.2205634Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:50:12.8679322Z + python3 -m pip install boto3==1.19.12\n2022-03-21T20:50:13.0744228Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T20:50:13.0916284Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T20:50:13.0964264Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T20:50:13.1005656Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T20:50:13.1017299Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T20:50:13.1041042Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T20:50:13.1189450Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T20:50:13.1208751Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T20:50:13.2119445Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d02da60fd18c22f5\n2022-03-21T20:50:13.2205634Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:50:13.2217939Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:50:13.2220259Z ##[error]Process completed with exit code 2.\n2022-03-21T20:50:13.2248664Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:50:13.2249012Z with:\n2022-03-21T20:50:13.2249260Z env:\n2022-03-21T20:50:13.2249500Z   IN_CI: 1\n2022-03-21T20:50:13.2249738Z   IS_GHA: 1\n2022-03-21T20:50:13.2250025Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:50:13.2250329Z ##[endgroup]\n2022-03-21T20:50:13.2272735Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (14/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:47:38.0451999Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:47:37.5554508Z + python3 -m pip install boto3==1.19.12\n2022-03-21T23:47:37.8411473Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T23:47:37.8631484Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T23:47:37.8699561Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T23:47:37.8737037Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T23:47:37.8754443Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T23:47:37.8814393Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T23:47:37.8849540Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:47:37.9059579Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:47:38.0336298Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0b44f47f4292089a2\n2022-03-21T23:47:38.0451999Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:47:38.0469471Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:47:38.0484106Z ##[error]Process completed with exit code 2.\n2022-03-21T23:47:38.0532678Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T23:47:38.0533007Z with:\n2022-03-21T23:47:38.0533223Z env:\n2022-03-21T23:47:38.0533440Z   IN_CI: 1\n2022-03-21T23:47:38.0533649Z   IS_GHA: 1\n2022-03-21T23:47:38.0533902Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:47:38.0534170Z   GPU_FLAG: --gpus all\n2022-03-21T23:47:38.0534401Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge) (15/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:04:59.3115800Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:04:59.2595213Z     #10 0x55a7f39a4801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:04:59.2595707Z     #11 0x55a7f39af7a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:04:59.2597203Z     #12 0x55a7f39af80b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:04:59.2598205Z     #13 0x55a7f39af908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:04:59.2598697Z     #14 0x55a7f39af908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:04:59.2599178Z     #15 0x55a7f39af908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:04:59.2599747Z     #16 0x55a7f39afccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:04:59.3114751Z     #17 0x7f3b3822383f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:04:59.3115277Z     #18 0x55a7f3954554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:04:59.3115468Z \n2022-03-21T21:04:59.3115800Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:04:59.3292385Z + retcode=1\n2022-03-21T21:04:59.3292781Z + set -e\n2022-03-21T21:04:59.3293062Z + return 1\n2022-03-21T21:04:59.3295462Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:04:59.3295802Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:04:59.3296394Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:04:59.3296700Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:04:59.3297055Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:04:59.3297416Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:04:59.3299623Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge) (16/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:14:31.7846086Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:14:25.5525714Z Collecting jmespath<1.0.0,>=0.7.1\n2022-03-21T22:14:25.5568155Z   Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)\n2022-03-21T22:14:25.5952617Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T22:14:25.6169392Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T22:14:25.6629996Z      -------------------------------------- 247.7/247.7 KB 5.1 MB/s eta 0:00:00\n2022-03-21T22:14:25.6710247Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:14:25.8284354Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:14:25.9816751Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T22:14:31.6672236Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T22:14:31.7630473Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0ed0915ecee5d2424\n2022-03-21T22:14:31.7846086Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:14:31.7876742Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:14:31.7897140Z ##[error]Process completed with exit code 2.\n2022-03-21T22:14:31.8195621Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T22:14:31.8196110Z with:\n2022-03-21T22:14:31.8196356Z env:\n2022-03-21T22:14:31.8196614Z   IN_CI: 1\n2022-03-21T22:14:31.8196876Z   IS_GHA: 1\n2022-03-21T22:14:31.8197169Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:14:31.8197652Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T22:14:31.8198093Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge) (17/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:19:15.8845728Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:19:15.5116060Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:19:15.7231476Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:19:15.7409711Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:19:15.7458478Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:19:15.7470508Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:19:15.7496799Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:19:15.7538362Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:19:15.7566161Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:19:15.7711630Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:19:15.8753543Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0e2b3b4ddb246ff2a\n2022-03-21T21:19:15.8845728Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:19:15.8859814Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:19:15.8870165Z ##[error]Process completed with exit code 2.\n2022-03-21T21:19:15.8917039Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:19:15.8917279Z with:\n2022-03-21T21:19:15.8917433Z env:\n2022-03-21T21:19:15.8917586Z   IN_CI: 1\n2022-03-21T21:19:15.8917734Z   IS_GHA: 1\n2022-03-21T21:19:15.8917917Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:19:15.8918102Z ##[endgroup]\n2022-03-21T21:19:15.8934572Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu) (18/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:19:48.5900162Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:19:48.0742254Z + python3 -m pip install boto3==1.19.12\n2022-03-21T23:19:48.3742563Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T23:19:48.3976536Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T23:19:48.4048700Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T23:19:48.4065374Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T23:19:48.4128076Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T23:19:48.4164273Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T23:19:48.4202610Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:19:48.4416723Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:19:48.5773033Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-07ab7a3c4a5402af2\n2022-03-21T23:19:48.5900162Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:19:48.5919822Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:19:48.5936087Z ##[error]Process completed with exit code 2.\n2022-03-21T23:19:48.6007930Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T23:19:48.6008268Z with:\n2022-03-21T23:19:48.6008483Z env:\n2022-03-21T23:19:48.6008701Z   IN_CI: 1\n2022-03-21T23:19:48.6008920Z   IS_GHA: 1\n2022-03-21T23:19:48.6009170Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:19:48.6009440Z   GPU_FLAG: --gpus all\n2022-03-21T23:19:48.6009671Z ##[endgroup]\n\n\n pull / win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (19/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:54:04.2844259Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:53:59.0889659Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T22:53:59.6881416Z      ---------------------------------------- 8.1/8.1 MB 14.0 MB/s eta 0:00:00\n2022-03-21T22:53:59.7427779Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:53:59.7691882Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T22:53:59.7779847Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T22:53:59.8281663Z      -------------------------------------- 247.7/247.7 KB 5.1 MB/s eta 0:00:00\n2022-03-21T22:54:00.0185115Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:54:00.2359770Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T22:54:04.1208891Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T22:54:04.2505862Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-03b4fbe63be8ef4b0\n2022-03-21T22:54:04.2844259Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:54:04.2891082Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:54:04.2919900Z ##[error]Process completed with exit code 2.\n2022-03-21T22:54:04.3377901Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T22:54:04.3378575Z with:\n2022-03-21T22:54:04.3378930Z env:\n2022-03-21T22:54:04.3379275Z   IN_CI: 1\n2022-03-21T22:54:04.3379600Z   IS_GHA: 1\n2022-03-21T22:54:04.3380023Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:54:04.3380691Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T22:54:04.3381278Z ##[endgroup]\n\n\n pull / linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge) (20/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:09:34.0074610Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:09:33.6365531Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:09:33.8475619Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:09:33.8655152Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:09:33.8704395Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:09:33.8716774Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:09:33.8760145Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:09:33.8785000Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:09:33.8811316Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:09:33.8960134Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:09:33.9984866Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d325eb9fd156146f\n2022-03-21T22:09:34.0074610Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:09:34.0087465Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:09:34.0101743Z ##[error]Process completed with exit code 2.\n2022-03-21T22:09:34.0154014Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:09:34.0154246Z with:\n2022-03-21T22:09:34.0154412Z env:\n2022-03-21T22:09:34.0154574Z   IN_CI: 1\n2022-03-21T22:09:34.0154728Z   IS_GHA: 1\n2022-03-21T22:09:34.0154917Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:09:34.0155112Z ##[endgroup]\n2022-03-21T22:09:34.0191047Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (21/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:03:17.8502655Z [E request_callbac...yUniqueId(created_on=0, local_id=0) to be created.\n\n2022-03-21T21:03:14.4669960Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpxgdsmeer\n2022-03-21T21:03:14.4671407Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpxgdsmeer/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.4973023Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp1i2hfmpc\n2022-03-21T21:03:14.4973800Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp1i2hfmpc/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.5532339Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpgx4da7b0\n2022-03-21T21:03:14.5533064Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpgx4da7b0/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.7050673Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0\n2022-03-21T21:03:14.7097127Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3\n2022-03-21T21:03:14.7398339Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2\n2022-03-21T21:03:14.7922283Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1\n2022-03-21T21:03:17.8502655Z [E request_callback_no_python.cpp:559] Received error while processing request type 261: false INTERNAL ASSERT FAILED at \"/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp\":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.\n2022-03-21T21:03:17.8503603Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):\n2022-03-21T21:03:17.8504385Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f180df19e19 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8505131Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f180df160e2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8505927Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f180df17a7e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8506674Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4b4 (0x7f18118b7b64 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8507642Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x70 (0x7f18118a7bf0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8508613Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7f1819736208 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)\n2022-03-21T21:03:17.8509749Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7f18118ac914 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8510708Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f1819735865 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)\n2022-03-21T21:03:17.8511369Z frame #8: <unknown function> + 0x375249a (0x7f18118a949a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test (22/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:01:07.7015580Z \ufffd[36;1m  echo \"ERR...t available for the merge-base of your branch\"\ufffd[0m\n\n2022-03-21T20:01:07.7012399Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7012634Z \ufffd[36;1m# Covers the case where a previous tag doesn't exist for the tree\ufffd[0m\n2022-03-21T20:01:07.7012992Z \ufffd[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly\ufffd[0m\n2022-03-21T20:01:07.7013373Z \ufffd[36;1mif ! git rev-parse \"$MERGE_BASE:.circleci/docker\"; then\ufffd[0m\n2022-03-21T20:01:07.7013784Z \ufffd[36;1m  echo \"Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit\"\ufffd[0m\n2022-03-21T20:01:07.7014149Z \ufffd[36;1m  exit 1\ufffd[0m\n2022-03-21T20:01:07.7014325Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7014573Z \ufffd[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse \"$MERGE_BASE:.circleci/docker\")\ufffd[0m\n2022-03-21T20:01:07.7014907Z \ufffd[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here\ufffd[0m\n2022-03-21T20:01:07.7015231Z \ufffd[36;1mif [[ \"${PREVIOUS_DOCKER_TAG}\" = \"${DOCKER_TAG}\" ]]; then\ufffd[0m\n2022-03-21T20:01:07.7015580Z \ufffd[36;1m  echo \"ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch\"\ufffd[0m\n2022-03-21T20:01:07.7015931Z \ufffd[36;1m  echo \"       contact the PyTorch team to restore the original images\"\ufffd[0m\n2022-03-21T20:01:07.7016225Z \ufffd[36;1m  exit 1\ufffd[0m\n2022-03-21T20:01:07.7016400Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7016608Z \ufffd[36;1mecho ::set-output name=rebuild::yes\ufffd[0m\n2022-03-21T20:01:07.7027605Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}\n2022-03-21T20:01:07.7027837Z env:\n2022-03-21T20:01:07.7028006Z   IN_CI: 1\n2022-03-21T20:01:07.7028159Z   IS_GHA: 1\n2022-03-21T20:01:07.7028346Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:01:07.7028589Z   BASE_REVISION: 6643522db9ff595f564b8081de58b3a33c546178\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu) (23/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T00:49:54.2949572Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T00:49:53.8049151Z + python3 -m pip install boto3==1.19.12\n2022-03-22T00:49:54.0981629Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-22T00:49:54.1207562Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-22T00:49:54.1277146Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-22T00:49:54.1315027Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-22T00:49:54.1331813Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-22T00:49:54.1391622Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-22T00:49:54.1609217Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-22T00:49:54.1637417Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-22T00:49:54.2830197Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0f7c32fe13be12fea\n2022-03-22T00:49:54.2949572Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T00:49:54.2966933Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T00:49:54.2982588Z ##[error]Process completed with exit code 2.\n2022-03-22T00:49:54.3031464Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T00:49:54.3031794Z with:\n2022-03-22T00:49:54.3032012Z env:\n2022-03-22T00:49:54.3032227Z   IN_CI: 1\n2022-03-22T00:49:54.3032434Z   IS_GHA: 1\n2022-03-22T00:49:54.3032681Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T00:49:54.3033084Z   GPU_FLAG: --gpus all\n2022-03-22T00:49:54.3033312Z ##[endgroup]\n\n\n pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (24/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:56:12.5872636Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:56:07.3365589Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T21:56:07.7926584Z      ---------------------------------------- 8.1/8.1 MB 17.3 MB/s eta 0:00:00\n2022-03-21T21:56:07.9319362Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T21:56:07.9366132Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T21:56:08.0077590Z      -------------------------------------- 247.7/247.7 KB 3.0 MB/s eta 0:00:00\n2022-03-21T21:56:08.0164070Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:56:08.1775537Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:56:08.3393469Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T21:56:12.4576766Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T21:56:12.5641959Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0afad69838118af0e\n2022-03-21T21:56:12.5872636Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:56:12.5905611Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:56:12.5927729Z ##[error]Process completed with exit code 2.\n2022-03-21T21:56:12.6239531Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T21:56:12.6240039Z with:\n2022-03-21T21:56:12.6240299Z env:\n2022-03-21T21:56:12.6240557Z   IN_CI: 1\n2022-03-21T21:56:12.6240805Z   IS_GHA: 1\n2022-03-21T21:56:12.6241118Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:56:12.6241613Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T21:56:12.6242052Z ##[endgroup]\n\n\n pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (25/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:46:39.5474616Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:46:39.1884210Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:46:39.3928976Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:46:39.4105069Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:46:39.4152571Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:46:39.4194931Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:46:39.4218947Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:46:39.4230812Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:46:39.4380089Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:46:39.4399461Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:46:39.5387703Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0888bed1149cca415\n2022-03-21T21:46:39.5474616Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:46:39.5487145Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:46:39.5497480Z ##[error]Process completed with exit code 2.\n2022-03-21T21:46:39.5541319Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:46:39.5541544Z with:\n2022-03-21T21:46:39.5541698Z env:\n2022-03-21T21:46:39.5541851Z   IN_CI: 1\n2022-03-21T21:46:39.5541997Z   IS_GHA: 1\n2022-03-21T21:46:39.5542176Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:46:39.5542361Z ##[endgroup]\n2022-03-21T21:46:39.5557878Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge) (26/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:34:57.0623859Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:34:56.9039884Z   Attempting uninstall: s3transfer\n2022-03-21T21:34:56.9041446Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T21:34:56.9090783Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T21:34:56.9095968Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T21:34:56.9453014Z   Attempting uninstall: boto3\n2022-03-21T21:34:56.9454356Z     Found existing installation: boto3 1.16.34\n2022-03-21T21:34:56.9564320Z     Uninstalling boto3-1.16.34:\n2022-03-21T21:34:56.9578035Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T21:34:57.0091363Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T21:34:57.0536230Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-034a3afd5d80b91fd\n2022-03-21T21:34:57.0623859Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:34:57.0637167Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:34:57.0647396Z ##[error]Process completed with exit code 2.\n2022-03-21T21:34:57.0688237Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:34:57.0688481Z with:\n2022-03-21T21:34:57.0688631Z env:\n2022-03-21T21:34:57.0688769Z   IN_CI: 1\n2022-03-21T21:34:57.0688930Z   IS_GHA: 1\n2022-03-21T21:34:57.0689109Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:34:57.0689462Z ##[endgroup]\n2022-03-21T21:34:57.0704768Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge) (27/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:05:00.7896545Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:05:00.7395504Z     #10 0x5597fd5a9801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:05:00.7396330Z     #11 0x5597fd5b47a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:05:00.7396688Z     #12 0x5597fd5b480b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:05:00.7398664Z     #13 0x5597fd5b4908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:05:00.7399177Z     #14 0x5597fd5b4908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:05:00.7399663Z     #15 0x5597fd5b4908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:05:00.7399986Z     #16 0x5597fd5b4ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:05:00.7895241Z     #17 0x7f0a5905983f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:05:00.7895772Z     #18 0x5597fd559554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:05:00.7896033Z \n2022-03-21T21:05:00.7896545Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:05:00.8063448Z + retcode=1\n2022-03-21T21:05:00.8063787Z + set -e\n2022-03-21T21:05:00.8064058Z + return 1\n2022-03-21T21:05:00.8067638Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:05:00.8068127Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:05:00.8069018Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:05:00.8069500Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:05:00.8070105Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:05:00.8070580Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:05:00.8072640Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (28/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:48:17.3384813Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:48:16.8599645Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:48:17.1464241Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:48:17.1685222Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:48:17.1754164Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:48:17.1771662Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:48:17.1808722Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:48:17.1868636Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:48:17.1903889Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:48:17.2113746Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:48:17.3267404Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-01fe178c405417375\n2022-03-21T22:48:17.3384813Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:48:17.3402286Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:48:17.3418376Z ##[error]Process completed with exit code 2.\n2022-03-21T22:48:17.3470528Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:48:17.3470874Z with:\n2022-03-21T22:48:17.3471096Z env:\n2022-03-21T22:48:17.3471327Z   IN_CI: 1\n2022-03-21T22:48:17.3471538Z   IS_GHA: 1\n2022-03-21T22:48:17.3471802Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:48:17.3472083Z   GPU_FLAG: --gpus all\n2022-03-21T22:48:17.3472322Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (29/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:16:38.9646300Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:16:38.7995969Z   Attempting uninstall: s3transfer\n2022-03-21T21:16:38.7998039Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T21:16:38.8066994Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T21:16:38.8072844Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T21:16:38.8449275Z   Attempting uninstall: boto3\n2022-03-21T21:16:38.8451430Z     Found existing installation: boto3 1.16.34\n2022-03-21T21:16:38.8559828Z     Uninstalling boto3-1.16.34:\n2022-03-21T21:16:38.8574290Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T21:16:38.9100438Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T21:16:38.9558098Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d779c59d277d32ee\n2022-03-21T21:16:38.9646300Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:16:38.9658894Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:16:38.9673240Z ##[error]Process completed with exit code 2.\n2022-03-21T21:16:38.9720106Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:16:38.9720333Z with:\n2022-03-21T21:16:38.9720485Z env:\n2022-03-21T21:16:38.9720645Z   IN_CI: 1\n2022-03-21T21:16:38.9720793Z   IS_GHA: 1\n2022-03-21T21:16:38.9720970Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:16:38.9721151Z ##[endgroup]\n2022-03-21T21:16:38.9736762Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
-                "author": {
-                  "login": "facebook-github-bot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": {
-                  "login": "facebook-github-bot"
-                },
-                "databaseId": 964902894
-              },
-              {
-                "bodyText": "@vitaly-fedyunin @gottbrath  FYI that this is the oneDNN Graph API integration. It depends on the #63748.",
-                "author": {
-                  "login": "Jianhui-Li"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 970451860
-              },
-              {
-                "bodyText": "CI failures are currently being caused by some issues in the CI infra, and are also occurring with other PRs.",
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "authorAssociation": "CONTRIBUTOR",
-                "editor": null,
-                "databaseId": 990641309
+                "login": "kit1980"
               },
               {
-                "bodyText": "CI failures are unrelated.",
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "authorAssociation": "CONTRIBUTOR",
-                "editor": null,
-                "databaseId": 991281407
+                "login": "huydhn"
               },
               {
-                "bodyText": "The CI failure is unrelated.",
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "authorAssociation": "CONTRIBUTOR",
-                "editor": null,
-                "databaseId": 995389295
+                "login": "b0noI"
               },
               {
-                "bodyText": "Hi, thank you for the PR!\nDo you mind running a larger amount of torchbench and reporting numbers ? You can look at Jason's post here for what models are supported in script. Initially just the vision models would be useful. @Krovatkin also did some benchmarking of a traced Bert model and found on average a ~16% speedup with this PR.",
-                "author": {
-                  "login": "eellison"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1015689390
+                "login": "seemethere"
               },
               {
-                "bodyText": "Thanks a lot for reviewing, @eellison & @Krovatkin!\nWe just wanted to let you know that we're working on the benchmarking & will get back to you in a day, or two.\nUPDATE (Jan 21): While running some TorchBench models, we discovered some composability issues, and are working to ensure that oneDNN Graph would complement PyTorch's existing fusion capabilities, not hinder them.\nUPDATE (Jan 24): We've resolved the issues & will update this PR later today. Thanks!",
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "authorAssociation": "CONTRIBUTOR",
-                "editor": {
-                  "login": "sanchitintel"
-                },
-                "databaseId": 1016996190
+                "login": "malfet"
               },
               {
-                "bodyText": "Hello @eellison,\nWe used this TorchBench branch for comparison. compare_llga.sh can be run for comparison.\nFor benchmarking mobilenet_v3_large with hardswish support in oneDNN Graph, this oneDNN Graph branch can be used in third_party/ideep/mkl-dnn. It delivers a speedup over PyTorch JIT (NNC + OFI) because 21 additional reorders are prevented (the major factor here), and fusion with conv also helps further.\nThe next release of oneDNN Graph would have hardswish support.\nWe're also exploring adding a hardsigmoid op in oneDNN Graph.\nThank you!",
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "authorAssociation": "CONTRIBUTOR",
-                "editor": {
-                  "login": "sanchitintel"
-                },
-                "databaseId": 1022709513
+                "login": "DanilBaibak"
               },
               {
-                "bodyText": "Please note that this PR should be merged after #71546, as #71546 changes the  third_party/ideep commit (this PR also uses that ideep commit, but it'd probably be better to merge #71546 first, so that oneDNN v2.5.2 upgrade would be in a separate PR). Thank you!",
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "authorAssociation": "CONTRIBUTOR",
-                "editor": null,
-                "databaseId": 1026330085
+                "login": "ZainRizvi"
               },
               {
-                "bodyText": "@sanchitintel mind rebasing and i'll land ?",
-                "author": {
-                  "login": "eellison"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1055813984
+                "login": "jeanschmidt"
               },
               {
-                "bodyText": "@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
-                "author": {
-                  "login": "facebook-github-bot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1057203495
+                "login": "atalman"
               },
               {
-                "bodyText": "Thanks a lot for taking a look, @eellison! To fix this error, we would enable Bazel build for oneDNN Graph.",
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "authorAssociation": "CONTRIBUTOR",
-                "editor": {
-                  "login": "sanchitintel"
-                },
-                "databaseId": 1061230087
+                "login": "mehtanirav"
               },
               {
-                "bodyText": "@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
-                "author": {
-                  "login": "facebook-github-bot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1063276600
+                "login": "osalpekar"
               },
               {
-                "bodyText": "@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
-                "author": {
-                  "login": "facebook-github-bot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1074355779
+                "login": "janeyx99"
               },
               {
-                "bodyText": "And graph_rewriter.cpp is full of DOS newlines...",
-                "author": {
-                  "login": "malfet"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1074407452
+                "login": "zengk95"
               },
               {
-                "bodyText": "Hey @chunyuan-w.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
-                "author": {
-                  "login": "github-actions"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 1074471758
+                "login": "clee2000"
               },
               {
-                "bodyText": "Thanks a ton for your help, @malfet & @eellison! :)\nWe'll incorporate your suggestions in subsequent PR(s).",
-                "author": {
-                  "login": "sanchitintel"
-                },
-                "authorAssociation": "CONTRIBUTOR",
-                "editor": {
-                  "login": "sanchitintel"
-                },
-                "databaseId": 1074492365
+                "login": "izaitsevfb"
+              },
+              {
+                "login": "weiwangmeta"
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOOYM_0Q==",
-              "hasPreviousPage": false
+              "hasNextPage": false,
+              "endCursor": "Y3Vyc29yOnYyOpHOBoQSVA=="
             }
           }
         }
       }
     }
   },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=73969 owner=pytorch": {
+  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=qwertyuiop org=pytorch": {
+    "data": {
+      "organization": {
+        "team": null
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=82169 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
           "closed": true,
-          "isCrossRepository": true,
+          "isCrossRepository": false,
           "author": {
-            "login": "malfet"
+            "login": "ezyang"
           },
-          "title": "Dummy change",
-          "body": "Test Plan: None at all\n\nDifferential Revision: D34753911\n\n",
-          "headRefName": "export-D34753911",
+          "title": "Move test_dtypes so it runs later",
+          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):\n* __->__ #82169\n\nThe error messages it gives are very unhelpful (because a failure\ngets translated into \"dtype was not supported\" rather than the\nactual backtrace), so I'd rather get error messages about this after\nI've tested basic functionality.\n\nSigned-off-by: Edward Z. Yang <ezyang@fb.com>",
+          "headRefName": "gh/ezyang/1279/head",
           "headRepository": {
-            "nameWithOwner": "malfet/pytorch"
+            "nameWithOwner": "pytorch/pytorch"
           },
-          "baseRefName": "master",
+          "baseRefName": "gh/ezyang/1279/base",
           "baseRepository": {
             "nameWithOwner": "pytorch/pytorch",
             "isPrivate": false,
@@ -12913,20 +20886,44 @@
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "malfet"
+                      "login": "ezyang"
                     },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
                   },
-                  "oid": "4746da707a9912356f5179625da89616b228dc21"
+                  "oid": "cef34da55a59da5a32494bff218ccd4978b659d3"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "83ad7e73a07111ac1d85e931d14360cc22c01edd"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "MQ",
+              "endCursor": "Mw",
               "hasNextPage": false
             },
-            "totalCount": 1
+            "totalCount": 3
           },
           "commits": {
             "nodes": [
@@ -12942,148 +20939,61 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-vulkan-bionic-py3.7-clang9"
+                              "name": "Lint"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
+                                "name": "lintrunner",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928580?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310707890"
                               },
                               {
-                                "name": "test (default, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483086020?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRQMQ=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592963"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QM="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928547?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aM=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592965"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QU="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-bionic-rocm4.5-py3.7"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
+                                "name": "Test collect_env (with_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928602?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708140"
                               },
                               {
-                                "name": "test (default, 1, 2, linux.rocm.gpu)",
+                                "name": "Test collect_env (without_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483235366?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708223"
                               },
                               {
-                                "name": "test (default, 2, 2, linux.rocm.gpu)",
+                                "name": "Test collect_env (older_python_version)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483235570?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708332"
                               },
                               {
-                                "name": "test (distributed, 1, 1, linux.rocm.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483235708?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbTiXw=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592966"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QY="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "win-vs2019-cuda11.3-py3"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
+                                "name": "quick-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928594?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708496"
                               },
                               {
-                                "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "name": "toc",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483593208?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708710"
                               },
                               {
-                                "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "name": "Test tools",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483593337?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310708937"
                               },
                               {
-                                "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "name": "workflow-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483593461?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823981/jobs/4310709169"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbY_vU=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGj1lc=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592967"
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696649"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qc="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc8k="
                       },
                       {
                         "node": {
@@ -13093,26 +21003,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928554?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2ao=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592969"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696651"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qk="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc8s="
                       },
                       {
                         "node": {
@@ -13122,36 +21026,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-docs"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928595?check_suite_focus=true"
-                              },
-                              {
-                                "name": "build-docs (cpp)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483078289?check_suite_focus=true"
-                              },
-                              {
-                                "name": "build-docs (python)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483078365?check_suite_focus=true"
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747823982/jobs/4310707884"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRIt0=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGjz0w=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592970"
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696656"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qo="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc9A="
                       },
                       {
                         "node": {
@@ -13161,41 +21055,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-gcc7"
+                              "name": "Lint"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928553?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483074693?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483074951?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483075182?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRFm4=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592971"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696660"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qs="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc9Q="
                       },
                       {
                         "node": {
@@ -13205,26 +21078,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-build"
+                              "name": "pull"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928556?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aw=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592974"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696715"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Q4="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdAs="
                       },
                       {
                         "node": {
@@ -13234,103 +21101,362 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "pull"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "shellcheck",
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310708487"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310708713"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310708942"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310709174"
+                              },
+                              {
+                                "name": "linux-bionic-py3_7-clang8-xla / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310709340"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310709579"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310709844"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710003"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710175"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.6-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710516"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710716"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310710890"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711097"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang10-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711234"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711429"
+                              },
+                              {
+                                "name": "linux-focal-rocm5.2-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711603"
+                              },
+                              {
+                                "name": "linux-jammy-cuda11.6-cudnn8-py3.8-clang12 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711765"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310711946"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310712129"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4310712276"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194495"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194591"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194659"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194749"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194858"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311194934"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (functorch, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311195003"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311220458"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311220540"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311222725"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311222869"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223128"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223225"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223324"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (functorch, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223396"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223496"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223569"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311223690"
+                              },
+                              {
+                                "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311224360"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928552?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311230050"
                               },
                               {
-                                "name": "quick-checks",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928797?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311301930"
                               },
                               {
-                                "name": "clang-tidy",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482929069?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311302152"
                               },
                               {
-                                "name": "clang-format",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482929350?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311302303"
                               },
                               {
-                                "name": "cmakelint",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482929628?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311302433"
                               },
                               {
-                                "name": "toc",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482929838?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311302531"
                               },
                               {
-                                "name": "py2-setup-validate-errormsg",
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482929972?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491082"
                               },
                               {
-                                "name": "flake8-py3",
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482930102?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491172"
                               },
                               {
-                                "name": "mypy",
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491232"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491289"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482930251?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2747824048/jobs/4311491348"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO4Es=",
-                              "hasNextPage": false
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcG0YME=",
+                              "hasNextPage": true
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592975"
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696836"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Q8="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdIQ="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
-                            }
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
                           },
+                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build-and-test",
+                                "name": "Facebook CLA Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928573?check_suite_focus=true"
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2b0=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGjyQg=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592976"
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696896"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RA="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdMA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Netlify",
+                            "databaseId": 13473
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697185"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdeE="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697205"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdfU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Dependabot",
+                            "databaseId": 29110
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697224"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdgg="
                       }
                     ],
                     "pageInfo": {
                       "hasNextPage": true
                     }
                   },
-                  "pushedDate": "2022-03-09T15:57:16Z",
-                  "oid": "4746da707a9912356f5179625da89616b228dc21"
+                  "status": null,
+                  "pushedDate": "2022-07-27T15:34:17Z",
+                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c"
                 }
               }
             ]
@@ -13339,7 +21465,7 @@
           "files": {
             "nodes": [
               {
-                "path": "tools/build_variables.bzl"
+                "path": "test/test_ops.py"
               }
             ],
             "pageInfo": {
@@ -13348,54 +21474,88 @@
             }
           },
           "reviews": {
-            "nodes": [],
+            "nodes": [
+              {
+                "author": {
+                  "login": "zou3519"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "Chillee"
+                },
+                "state": "APPROVED"
+              }
+            ],
             "pageInfo": {
-              "startCursor": null,
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNy0yNVQxNDo0NTozNS0wNzowMLkyMDIyLTA3LTI1VDE0OjQ1OjM1LTA3OjAwzj6XYmg=",
               "hasPreviousPage": false
             }
           },
           "comments": {
             "nodes": [
               {
-                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/malfet/pytorch/blob/4746da707a9912356f5179625da89616b228dc21/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\nAdd ciflow labels to this PR to trigger more builds:\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nlinux-binary-libtorch-cxx11-abi\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-binary-libtorch-pre-cxx11\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-binary-manywheel\nciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk\n\u2705 triggered\n\n\nlinux-bionic-rocm4.5-py3.7\nciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk\n\u2705 triggered\n\n\nlinux-docs\nciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-vulkan-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-build\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-custom-build-static\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-onnx\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build\nciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7-no-ops\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nmacos-arm64-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nmacos-arm64-binary-wheel\nciflow/binaries, ciflow/binaries_wheel, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-libtorch-cxx11-abi\nciflow/binaries, ciflow/binaries_libtorch, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-libtorch-pre-cxx11\nciflow/binaries, ciflow/binaries_libtorch, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-wheel\nciflow/binaries, ciflow/binaries_wheel, ciflow/default\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwindows-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nwindows-binary-libtorch-debug\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nwindows-binary-libtorch-release\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nwindows-binary-wheel\nciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\ncaffe2-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\ndocker-builds\nciflow/all, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-custom-ops\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-metal\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda10.2-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-docs-push\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-no-ops\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-arm64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-lite-interpreter-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-11-py3-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.3-py3.7-gcc7-debug\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.5-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-build\nciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\npytorch-xla-linux-bionic-py3.7-clang8\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla\n\ud83d\udeab skipped",
+                "bodyText": "@pytorchbot merge -f FORCE",
+                "createdAt": "2022-07-27T17:56:43Z",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1197107402
+              },
+              {
+                "bodyText": "You need to provide a reason for using force merge, in the format @pytorchbot merge -f '[CATEGORY] Explanation'. With [CATEGORY] being one the following:\nEMERGENCY - an emergency fix to quickly address an issue\nMINOR - a minor fix such as cleaning locally unused variables, which shouldn't break anything\nPRE_TESTED - a previous CI run tested everything and you've only added minor changes like fixing lint\nOTHER - something not covered above",
+                "createdAt": "2022-07-27T17:56:45Z",
                 "author": {
                   "login": "pytorch-bot"
                 },
                 "authorAssociation": "NONE",
                 "editor": null,
-                "databaseId": 1063079053
+                "databaseId": 1197107439
               },
               {
-                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/73969\n\ud83d\udcc4 \u00a0Preview docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 4746da7 (more details on the Dr. CI page):\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "bodyText": "@pytorchbot merge -f \"[OTHER] normal land failed twice already\"",
+                "createdAt": "2022-07-27T17:57:28Z",
                 "author": {
-                  "login": "facebook-github-bot"
+                  "login": "malfet"
                 },
                 "authorAssociation": "MEMBER",
-                "editor": {
-                  "login": "facebook-github-bot"
-                },
-                "databaseId": 1063079113
+                "editor": null,
+                "databaseId": 1197108130
               },
               {
-                "bodyText": "This pull request was exported from Phabricator. Differential Revision: D34753911",
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "createdAt": "2022-07-27T18:08:13Z",
                 "author": {
-                  "login": "facebook-github-bot"
+                  "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1063079731
+                "databaseId": 1197119348
+              },
+              {
+                "bodyText": "Hey @ezyang.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-07-27T18:08:58Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1197120095
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOP11MjQ==",
-              "hasPreviousPage": false
+              "startCursor": "Y3Vyc29yOnYyOpHOR1poyg==",
+              "hasPreviousPage": true
             }
           },
           "labels": {
             "edges": [
               {
                 "node": {
-                  "name": "fb-exported"
+                  "name": "Merged"
                 }
               },
               {
@@ -13409,118 +21569,93 @@
       }
     }
   },
-  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAU2F-RA= name=pytorch number=73969 owner=pytorch": {
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=73811 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
-          "commits": {
-            "nodes": [
-              {
-                "commit": {
-                  "oid": "4746da707a9912356f5179625da89616b228dc21",
-                  "checkSuites": {
-                    "edges": [
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "run-torchbench",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928591?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2c8=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592977"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RE="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "Test tools"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928555?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2as=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592978"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RI="
-                      },
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "seemethere"
+          },
+          "title": "ci: Migrate metrics credentials to managed IAM",
+          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* __->__ #73811\n\r\nMigrates our credentials to upload metrics statistics to managed IAM\r\ncredentials in order to make it easier to know where the credentials are\r\ncoming from and to make it easier to add more permissions / less\r\npermissions later on.\r\n\r\nRelates to work done in [D34535827](https://www.internalfb.com/diff/D34535827)\r\n\r\nSigned-off-by: Eli Uriegas <eliuriegas@fb.com>",
+          "headRefName": "gh/seemethere/215/head",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "gh/seemethere/215/base",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "seemethere"
+                    },
+                    "email": "eliuriegas@fb.com",
+                    "name": "Eli Uriegas"
+                  },
+                  "oid": "13c44d16a876a56bca479b4cf30715d21fa16e99"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "seemethere"
+                    },
+                    "email": "eliuriegas@fb.com",
+                    "name": "Eli Uriegas"
+                  },
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mg",
+              "hasNextPage": false
+            },
+            "totalCount": 2
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
-                            }
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
                           },
+                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928570?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483302702?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483302867?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "name": "Facebook CLA Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483303104?check_suite_focus=true"
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbUkMA=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqOaHA=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592980"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658275867"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RQ="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcBs="
                       },
                       {
                         "node": {
@@ -13530,26 +21665,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-gcc7-no-ops"
+                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928607?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2d8=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592981"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276090"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RU="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcPo="
                       },
                       {
                         "node": {
@@ -13563,61 +21692,16 @@
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928611?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, windows.4xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483400398?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, windows.4xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483400575?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbWDX8=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592982"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RY="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928548?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aQ=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592983"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276092"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Rc="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcPw="
                       },
                       {
                         "node": {
@@ -13627,71 +21711,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-clang7-asan"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928603?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 3, 3, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483138456?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 3, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483138698?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 3, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483139049?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbSD-k=",
-                              "hasNextPage": false
+                              "name": "linux-xenial-py3-clang5-mobile-build"
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592985"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Rk="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "Facebook GitHub Tools",
-                            "databaseId": 12274
-                          },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "Facebook CLA Check",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://code.intern.facebook.com/cla/"
-                              },
-                              {
-                                "name": "Meta Internal-Only Changes Check",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://opensource.facebook.com/"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO574=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592986"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276094"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Ro="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcP4="
                       },
                       {
                         "node": {
@@ -13701,31 +21734,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pytorch-xla-linux-bionic-py3.7-clang8"
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928559?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (xla, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483141123?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbSGAM=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592987"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276095"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Rs="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcP8="
                       },
                       {
                         "node": {
@@ -13735,81 +21757,21 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-gcc5.4"
+                              "name": "Lint"
                             }
                           },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928593?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483106295?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483106609?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483106835?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483107050?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (docs_test, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483107208?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483107483?check_suite_focus=true"
-                              }
-                            ],
+                          "checkRuns": {
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRlJs=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592997"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276097"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-SU="
-                      }
-                    ],
-                    "pageInfo": {
-                      "hasNextPage": true
-                    }
-                  }
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAU2F-SU= name=pytorch number=73969 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "commits": {
-            "nodes": [
-              {
-                "commit": {
-                  "oid": "4746da707a9912356f5179625da89616b228dc21",
-                  "checkSuites": {
-                    "edges": [
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQE="
+                      },
                       {
                         "node": {
                           "app": {
@@ -13818,41 +21780,20 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-bionic-py3.7-clang9"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928550?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483083368?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483083553?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (noarch, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483083767?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRN_c=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595593001"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276098"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Sk="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQI="
                       },
                       {
                         "node": {
@@ -13862,7 +21803,7 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3.7-clang7-onnx"
+                              "name": "linux-xenial-py3.7-gcc7-no-ops"
                             }
                           },
                           "checkRuns": {
@@ -13870,28 +21811,18 @@
                               {
                                 "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928572?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483120691?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5483120938?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1983602966/jobs/2839950629"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRySo=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUqObRM=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595593014"
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276099"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-TY="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQM="
                       },
                       {
                         "node": {
@@ -13901,32 +21832,175 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                              "name": "Test tools"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5482928605?check_suite_focus=true"
-                              }
-                            ],
+                            "nodes": [],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2d0=",
+                              "endCursor": null,
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595593026"
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276100"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-UI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQQ="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-clang7-asan"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "CANCELLED",
+                          "url": "https://github.com/pytorch/pytorch/commit/9d26f4e6d8c8df275ea546180fef42548257d2d7/checks?check_suite_id=5658276101"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVFCcQU="
                       }
                     ],
                     "pageInfo": {
-                      "hasNextPage": false
+                      "hasNextPage": true
                     }
-                  }
+                  },
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17044969?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17045014?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17044975?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-03-14T23:01:55Z",
+                  "oid": "9d26f4e6d8c8df275ea546180fef42548257d2d7"
+                }
+              }
+            ]
+          },
+          "changedFiles": 3,
+          "files": {
+            "nodes": [
+              {
+                "path": ".github/templates/common.yml.j2"
+              },
+              {
+                "path": ".github/workflows/generated-macos-11-py3-x86-64.yml"
+              },
+              {
+                "path": ".github/workflows/update_pytorch_labels.yml"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "kit1980"
+                },
+                "state": "APPROVED"
+              },
+              {
+                "author": {
+                  "login": "janeyx99"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMy0wNFQxNDoyNDo0OC0wODowMLkyMDIyLTAzLTA0VDE0OjI0OjQ4LTA4OjAwzjWwwqA=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Merge failed due to Too many checksuites for commit\nRaised by https://github.com/pytorch/pytorch/actions/runs/1988337976",
+                "createdAt": "2022-03-15T17:43:28Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068270969
+              },
+              {
+                "bodyText": "@pytorchbot force merge this",
+                "createdAt": "2022-03-15T20:26:36Z",
+                "author": {
+                  "login": "seemethere"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068436128
+              },
+              {
+                "bodyText": "Merge failed due to Too many checksuites for commit\nRaised by https://github.com/pytorch/pytorch/actions/runs/1989076952",
+                "createdAt": "2022-03-15T20:27:47Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068437098
+              },
+              {
+                "bodyText": "@pytorchbot merge this",
+                "createdAt": "2022-03-15T21:18:55Z",
+                "author": {
+                  "login": "seemethere"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1068482921
+              },
+              {
+                "bodyText": "Hey @seemethere.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-03-15T21:20:40Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1068484404
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOP6yFeQ==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "cla signed"
                 }
               }
             ]
@@ -13935,22 +22009,22 @@
       }
     }
   },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=73099 owner=pytorch": {
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=31093 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
           "closed": true,
-          "isCrossRepository": false,
+          "isCrossRepository": true,
           "author": {
-            "login": "BowenBao"
+            "login": "mingxiaoh"
           },
-          "title": "[ONNX] Make graph name spec-compliant (#71961)",
-          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* #73104\n* #73103\n* #73102\n* #73101\n* #73100\n* __->__ #73099\n\n[According to the ONNX spec](https://github.com/onnx/onnx/blob/main/docs/IR.md#names-within-a-graph),\nall names must adhere to C90 identifier syntax rules, which means no\ndashes.\n\nFixes: #30952",
-          "headRefName": "gh/BowenBao/138/head",
+          "title": "improve mkldnn convolution test coverage",
+          "body": "This pr will improve the test coverage of mkldnn convolution.\r\n1.test input: specific sensitive numbers\r\n2.pass criteria: output of mkldnn convolution matches output of thnn convolution\r\n3.coverage: by using coverage tool, we found out the following sensitive parameters. Overall the case will test 4352 patterns, takes 8.8s on my machine.\r\n\r\nto run the test case:\r\n\r\npython test_mkldnn_conv2d_ext.py\r\nor\r\npython run_test.py -i mkldnn_conv2d_ext\r\n\r\nIn case of failure, the pattern will be printed in the log for further debugging.\r\n\r\nactually, this PR is created to replace and improve that PR we created before(https://github.com/pytorch/pytorch/pull/25085) ",
+          "headRefName": "master",
           "headRepository": {
-            "nameWithOwner": "pytorch/pytorch"
+            "nameWithOwner": "mingxiaoh/pytorch"
           },
-          "baseRefName": "gh/BowenBao/138/base",
+          "baseRefName": "master",
           "baseRepository": {
             "nameWithOwner": "pytorch/pytorch",
             "isPrivate": false,
@@ -13965,12 +22039,12 @@
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "BowenBao"
+                      "login": "11pikachu"
                     },
-                    "email": "bowbao@microsoft.com",
-                    "name": "BowenBao"
+                    "email": "junx.du@intel.com",
+                    "name": "dujun"
                   },
-                  "oid": "3038b939eb2069653305c419326a0f47d2598e39"
+                  "oid": "29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
                 }
               }
             ],
@@ -13994,157 +22068,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "run-torchbench",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161498?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNn9o=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189561"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7k="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161648?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252387496?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252387628?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252387825?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkRE_E=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189562"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7o="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-gcc7-no-ops"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161681?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoJE=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189563"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7s="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-build"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161670?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoIY=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189564"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7w="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                              "name": "clang-format"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build-and-test",
+                                "name": "clang-format",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161691?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676797?check_suite_focus=true"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoJs=",
+                              "endCursor": "Y3Vyc29yOnYyOpHOQYu8fQ==",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189566"
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1175281097"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS74="
+                        "cursor": "Y3Vyc29yOnYyOpHORg1dyQ=="
                       },
                       {
                         "node": {
@@ -14154,866 +22097,2625 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "linux-bionic-py3.7-clang9"
+                              "name": "Lint"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
+                                "name": "flake8-py3",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161678?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676800?check_suite_focus=true"
                               },
                               {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "name": "quick-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252286900?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676817?check_suite_focus=true"
                               },
                               {
-                                "name": "test (noarch, 1, 1, linux.2xlarge)",
+                                "name": "clang-tidy",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252287072?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676829?check_suite_focus=true"
                               },
                               {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "name": "cmakelint",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252287232?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/1099676840?check_suite_focus=true"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiwA=",
+                              "endCursor": "Y3Vyc29yOnYyOpHOQYu8qA==",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189567"
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1175281099"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS78="
+                        "cursor": "Y3Vyc29yOnYyOpHORg1dyw=="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-vulkan-bionic-py3.7-clang9"
-                            }
+                            "name": "Codecov",
+                            "databaseId": 254
                           },
+                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
+                                "name": "codecov/project",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161699?check_suite_focus=true"
+                                "detailsUrl": "https://codecov.io"
                               },
                               {
-                                "name": "test (default, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252302340?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPxgQ=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189568"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8A="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
+                                "name": "codecov/patch",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161696?check_suite_focus=true"
+                                "detailsUrl": "https://codecov.io"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoKA=",
+                              "endCursor": "Y3Vyc29yOnYyOpHOQZhcFQ==",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189570"
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1176100822"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8I="
+                        "cursor": "Y3Vyc29yOnYyOpHORhnf1g=="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "win-vs2019-cpu-py3"
-                            }
+                            "name": "Codecov",
+                            "databaseId": 254
                           },
+                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161646?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 1, 2, windows.4xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252830090?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, windows.4xlarge)",
+                                "name": "codecov/patch",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252830141?check_suite_focus=true"
+                                "detailsUrl": "https://codecov.io"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkX070=",
+                              "endCursor": "Y3Vyc29yOnYyOpHOQZZsEQ==",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189571"
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1176100824"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8M="
+                        "cursor": "Y3Vyc29yOnYyOpHORhnf2A=="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "linux-xenial-py3.7-gcc7"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252161666?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252286386?check_suite_focus=true"
-                              },
-                              {
-                                "name": "test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252286526?check_suite_focus=true"
-                              },
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "name": "Facebook CLA Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5252286720?check_suite_focus=true"
+                                "detailsUrl": "https://code.facebook.com/cla/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiQA=",
+                              "endCursor": "Y3Vyc29yOnYyOpHOUquzJg==",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189572"
+                          "url": "https://github.com/pytorch/pytorch/commit/29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9/checks?check_suite_id=1487517306"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8Q="
+                        "cursor": "Y3Vyc29yOnYyOpHOWKm2eg=="
                       }
                     ],
                     "pageInfo": {
-                      "hasNextPage": true
+                      "hasNextPage": false
                     }
                   },
-                  "pushedDate": "2022-02-18T18:46:28Z",
-                  "oid": "3038b939eb2069653305c419326a0f47d2598e39"
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406538?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406947?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406544?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406931?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_windows_libtorch_3_7_cpu_debug_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406550?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_windows_libtorch_3_7_cpu_debug_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406887?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_windows_libtorch_3_7_cpu_release_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406526?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: binary_windows_libtorch_3_7_cpu_release_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406707?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406533?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: caffe2_onnx_main_py3_6_clang7_ubuntu16_04_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407256?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407254?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407255?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406556?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406532?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406527?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406553?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-py3.6-clang9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406537?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-py3.8-gcc9",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406529?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-rocm3.5.1-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406554?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-bionic-rocm3.7-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406545?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406543?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406536?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406552?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406535?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406540?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406528?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406541?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-asan",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406549?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-clang7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406555?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-gcc4.8",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406546?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-gcc5.4",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406531?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-gcc7",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406534?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.6-gcc7.2",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406523?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3.8",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406539?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-rocm3.3-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406547?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-rocm3.5.1-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406551?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407209?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406611?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_bazel_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406607?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_bazel_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406984?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_cpp_doc_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407013?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_doc_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407011?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_ios_11_2_1_x86_64_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406548?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_libtorch_linux_xenial_cuda11_0_cudnn8_py3_gcc7_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406563?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_libtorch_linux_xenial_cuda11_0_cudnn8_py3_gcc7_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7408680?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_backward_compatibility_check_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407014?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_py3_6_clang9_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406567?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_py3_6_clang9_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406945?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_py3_8_gcc9_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406561?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_py3_8_gcc9_coverage_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407422?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_bionic_rocm3_7_py3_6_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406562?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406612?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7408107?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_legacy_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7408111?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_profiling_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7408101?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406613?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406565?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407017?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_profiling_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407019?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407012?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_6_gcc5_4_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407016?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_vulkan_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406608?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406609?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_asan_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406606?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_asan_test1",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407435?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_asan_test2",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407436?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_mobile_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406605?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406610?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_macos_10_13_py3_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406525?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_macos_10_13_py3_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407415?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_python_doc_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407018?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_vulkan_linux_bionic_py3_6_clang9_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406566?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_vulkan_linux_bionic_py3_6_clang9_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406946?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cpu_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406542?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cuda10.1_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406530?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cuda10.1_test1",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407028?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cuda10.1_test2",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407027?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_windows_vs2019_py36_cuda11.0_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406524?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_xla_linux_bionic_py3_6_clang9_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7406572?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_xla_linux_bionic_py3_6_clang9_test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/7407253?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "codecov/patch",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://codecov.io/gh/pytorch/pytorch/compare/69f6d94caa3559d4f50745c26af5df041b83fee8...29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                      },
+                      {
+                        "context": "codecov/project",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://codecov.io/gh/pytorch/pytorch/compare/69f6d94caa3559d4f50745c26af5df041b83fee8...29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
+                      },
+                      {
+                        "context": "pr/caffe2-pytorch-linux-bionic-rocm3.7-py3.6-test",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://ci.pytorch.org/jenkins/job/caffe2-builds/job/pytorch-linux-bionic-rocm3.7-py3.6-trigger-test/2319/"
+                      },
+                      {
+                        "context": "pr/pytorch-linux-bionic-rocm3.7-py3.6",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm3.7-py3.6-trigger/2325/"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2020-09-11T01:58:24Z",
+                  "oid": "29f6aa6ecc2ece3fa58170ff4561f9d8d5c129f9"
                 }
               }
             ]
           },
-          "changedFiles": 162,
+          "changedFiles": 5,
           "files": {
             "nodes": [
               {
-                "path": "test/onnx/expect/TestOperators.test_acos.expect"
+                "path": "test/math_libraries/convolutions.py"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_googlenet_v3.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_maskrcnn_p1.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_mobilenet.json"
+              },
+              {
+                "path": "test/math_libraries/convolutions_cases/shapes_resnet_50.json"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "NQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "CHANGES_REQUESTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_add_broadcast.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_add_left_broadcast.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_add_size1_broadcast.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_addconstant.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_addmm.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_arange_dynamic.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "CHANGES_REQUESTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_argmax.expect"
+                "author": {
+                  "login": "ailzhang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_asin.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_at_op.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_atan.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_aten_embedding_1.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_aten_embedding_2.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_avg_pool2d.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_baddbmm.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_basic.expect"
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_batchnorm.expect"
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_batchnorm_1d.expect"
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect"
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_batchnorm_training.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_bitshift.expect"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "COMMENTED"
+              },
+              {
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAxOS0xMi0zMFQxMDoxOToxMS0wODowMLkyMDE5LTEyLTMwVDEwOjE5OjExLTA4OjAwzhQZLuY=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "I cloned your repo and ran the tests:\n~/pytorch/test/math_libraries$ python convolutions.py\nFFFF\n======================================================================\nFAIL: test_conv2d_ext_cpu_float32 (__main__.TestConvExtCPU)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n----------------------------------------------------------------------\nRan 4 tests in 33.838s\n\nFAILED (failures=4)\n\nStill fails.\n\n@mruberry  It is suggested by @VitalyFedyunin that, we need to display fail test to avoid invalid inputs, I guess we should set it as expected failures under the pytest test framework, right? we will change it as expected failure cases under pytest test framework. The result will looks like be low, is it ok?\n2500 passed, 136 skipped, 0 failed, 0 errors, 2 expected failures, 0 unexpected passes",
+                "createdAt": "2020-08-14T01:36:20Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 673816925
+              },
+              {
+                "bodyText": "Displaying tests that fail is fine, but I don't think @VitalyFedyunin meant that it was OK if the tests didn't pass. If these are expected failures then yes, you can use with self.assertRaises(RuntimeError):... when testing them. If you also want to report that the test has test cases with these properties you can print or warn, which will appear in the test output.",
+                "createdAt": "2020-08-14T03:09:37Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 673858224
+              },
+              {
+                "bodyText": "Codecov Report\n\nMerging #31093 into master will not change coverage.\nThe diff coverage is n/a.\n\n\n@@           Coverage Diff           @@\n##           master   #31093   +/-   ##\n=======================================\n  Coverage   68.00%   68.00%           \n=======================================\n  Files         382      382           \n  Lines       49527    49527           \n=======================================\n  Hits        33679    33679           \n  Misses      15848    15848           \n\nContinue to review full report at Codecov.\n\nLegend - Click here to learn more\n\u0394 = absolute <relative> (impact), \u00f8 = not affected, ? = missing data\nPowered by Codecov. Last update 69f6d94...29f6aa6. Read the comment docs.",
+                "createdAt": "2020-09-04T05:41:01Z",
+                "author": {
+                  "login": "codecov"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "codecov"
+                },
+                "databaseId": 686921371
+              },
+              {
+                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.  Feel free to remove the Stale label if you feel this was a mistake.  If you are unable to remove the Stale label please contact a maintainer in order to do so.  Stale pull requests will automatically be closed 30 days after being marked Stale",
+                "createdAt": "2022-04-12T02:35:37Z",
+                "author": {
+                  "login": "pytorchbot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1095860944
+              },
+              {
+                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale. Feel free to remove the Stale label if you feel this was a mistake. If you are unable to remove the Stale label please contact a maintainer in order to do so. If you want the bot to never mark this PR stale again, add the no-stale label.Stale pull requests will automatically be closed after 30 days of inactivity.",
+                "createdAt": "2022-06-11T04:40:16Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1152854802
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOKCmhXQ==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "triaged"
+                }
+              },
+              {
+                "node": {
+                  "name": "open source"
+                }
+              },
+              {
+                "node": {
+                  "name": "cla signed"
+                }
+              },
+              {
+                "node": {
+                  "name": "Stale"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=2e2877d2452c4f233f042b7ccd50ab9c2a6e9a73d8819a0c876203c12364e8a3 cursor=Y3Vyc29yOnYyOpHOKCmhXQ== name=pytorch number=31093 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Hi, @mingfeima  @soumith  @Jianhui-Li\nthis will improve the test coverage of mkldnn convolution, would you please review it?\nThe current code is forward only, do we need to cover backward, if yes, we can add backward.",
+                "createdAt": "2019-12-12T01:19:02Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 564806270
+              },
+              {
+                "bodyText": "@mingxiaoh, what is the value in testing DNNL as part of Pytorch validation for the Pytorch developers? Shouldn't having these tests run in DNNL validation be enough?",
+                "createdAt": "2019-12-12T01:28:32Z",
+                "author": {
+                  "login": "vpirogov"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 564808528
+              },
+              {
+                "bodyText": "@vpirogov  The main value is to serve as a blind test to DNNL. If DNNL adds these test to DNNL test sets, it lost the value as a blind test.  The spirit of validation is to cross check.\n@gottbrath @gchanan  The test was developed per the request of Pytorch team. Mingxiao made an effort to reduce the execution time to a few second but still with good coverage.  Although the test today is focused on DNNL, it could be easily extended to be blind test for any conv implementation used in Pytorch.",
+                "createdAt": "2019-12-20T07:44:30Z",
+                "author": {
+                  "login": "Jianhui-Li"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 567826907
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_c2_op.expect"
+                "bodyText": "@mruberry thanks for the comment. As for the chainer dependency, we import it is because we would like to use its testing function for pytest test cases combinations, other wise we need to write much more code to achieve same effect. So, can we use it?",
+                "createdAt": "2020-01-15T09:04:34Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 574563012
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_chunk.expect"
+                "bodyText": "@mingxiaoh You cannot import chainer. Looking at the code you should be able to achieve the same effect without it.",
+                "createdAt": "2020-01-16T17:59:46Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 575272358
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_clip.expect"
+                "bodyText": "@mruberry ok, we will change it according to your requirement. Thanks",
+                "createdAt": "2020-02-10T00:59:34Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 583917522
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_clip_max.expect"
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/31093\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 29f6aa6 (more details on the Dr. CI page):\n\nCommit 29f6aa6 was recently pushed. Waiting for builds...\n\nThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2020-05-14T08:04:30Z",
+                "author": {
+                  "login": "dr-ci"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 628466876
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_clip_min.expect"
+                "bodyText": "@mruberry how about those cudnn UT error? we add check for it but it should be NV to fix cudnn bugs.",
+                "createdAt": "2020-05-18T05:34:11Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 629955767
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_concat2.expect"
+                "bodyText": "Hey @mingxiaoh! You're right, of course, that you shouldn't have to fix cuDNN bugs. Would you please:\n\nAssert that the test case fails, so we know it's failing and if someone fixes it they'll know what test to update.\nFile a new issue explaining the behavior and providing a short PyTorch program to reproduce the issue.\n\nThen we can ping NVIDIA on that issue.",
+                "createdAt": "2020-05-18T07:27:08Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 629997129
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_conv.expect"
+                "bodyText": "about the suggestion 'Assert that the test case fails, so we know it's failing and if someone fixes it they'll know what test to update. ',  if we only assert it and continue the following test, I guess users might always ignore them in later test. Anyway, any similar example case for reference?",
+                "createdAt": "2020-05-18T07:55:08Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 630010734
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect"
+                "bodyText": "In this recent PR https://github.com/pytorch/pytorch/pull/38505/files, for example, you can see that the construction of bool tensors wasn't working properly, so the test author cited the relevant issue and asserted that the incorrect behavior happened, as expected. You can also see how these lines are being removed by https://github.com/pytorch/pytorch/pull/38392/files, which fixes the issue.\nAnother common pattern is to use with self.assertRaises(RuntimeError/AssertionError/etc.):.",
+                "createdAt": "2020-05-18T08:02:13Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 630014823
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4_opset8.expect"
+                "bodyText": "@mruberry the failed UT case is not introduced by our modification, how to handle this issue?",
+                "createdAt": "2020-05-20T01:59:13Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631187735
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_convtranspose.expect"
+                "bodyText": "@mingxiaoh You mean the failures on ROCm? You may ignore them. Be sure to re-request review when you're ready.",
+                "createdAt": "2020-05-20T02:12:58Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 631191425
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_cos.expect"
+                "bodyText": "@mruberry  we already skipped those ROCm errors, but there are stil somel error caused by the original code, they are not introduced by our modification.",
+                "createdAt": "2020-05-21T05:18:07Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631886529
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_cumsum.expect"
+                "bodyText": "I understand. Let me know when you're ready for me to review.",
+                "createdAt": "2020-05-21T06:24:15Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 631908011
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_det.expect"
+                "bodyText": "@mruberry thanks, we are ready for review now.",
+                "createdAt": "2020-05-21T06:28:11Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 631909442
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dict.expect"
+                "bodyText": "@mingxiaoh Great! I'll take a look ASAP.",
+                "createdAt": "2020-05-21T06:31:10Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 631910556
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dict_str.expect"
+                "bodyText": "@mruberry we just pull the latest code and updated the patch according to your comment, may you please help double check it? BTW, the new failed case in preci is not introduced by our modification.",
+                "createdAt": "2020-05-25T07:44:58Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 633430458
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dim.expect"
+                "bodyText": "@ailzhang would you please check the comment below? Thanks.\nIs there a reason why this TestConv2dExt is a new class instead a test inside TestNN?\n//comment: it is actually suggested by Tongzhou Wang in another thread before.\nAlthough this test sits in generic testing framework, it's actually comparing thnn/mkldnn/cudnn results specially. I feel it's better to make it truly generic so that it compares any device result with CPU result. Alternatively you can mark this test only run when torch.backends.mkldnn.is_available()=True\n//comment: but our goal is to compare the result with that of thnn. Anyway, if you insist, we can start to compare it with cpu.",
+                "createdAt": "2020-05-27T05:11:08Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 634432326
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dropout.expect"
+                "bodyText": "Pruning reviewers. @ngimel, @VitalyFedyunin, this PR is looking pretty good from a test framework perspective. Would one of you like to review?",
+                "createdAt": "2020-05-27T09:58:42Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 634557563
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dropout_default.expect"
+                "bodyText": "@mruberry  Thanks, would you please help review it again. BTW: failed case is not introduced by our modification.",
+                "createdAt": "2020-05-28T10:26:32Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 635256214
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dropout_opset12.expect"
+                "bodyText": "@mruberry  we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code",
+                "createdAt": "2020-06-02T08:00:01Z",
+                "author": {
+                  "login": "1pikachu"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637364148
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dropout_training.expect"
+                "bodyText": "@mruberry we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code\n\n@ngimel will follow-up on the test itself sometime this week or early next week.",
+                "createdAt": "2020-06-02T10:23:47Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 637444457
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dropout_training_opset12.expect"
+                "bodyText": "@mruberry we moved our case to TestNNDeviceType class, would you please help review it again? BTW, those failed cases are not introduced by our code\n\n@ngimel will follow-up on the test itself sometime this week or early next week.\n\n@mruberry  thank you",
+                "createdAt": "2020-06-02T11:32:06Z",
+                "author": {
+                  "login": "1pikachu"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637479226
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_add.expect"
+                "bodyText": "Improving test coverage of math libraries is certainly a good goal and this PR is moving towards it. I have some doubts about implementation decisions made, and about running this PR as part of regular pytorch CI.\nIf the primary goal of this PR is to test correctness of the convolution implementations in the vendor library, then it does not serve this purpose. The absolute majority of the 4000+ test cases come from group 1, where different kernel sizes/strides/dilations are used to produce the output of size 1x1. This can test whether pytorch correctly passes convolution parameters to the backends (although there are cheaper ways to do that), but as actual library correctness check it is almost useless - libraries use very different kernels depending in the input/output sizes, and tests with toy sizes like this don't invoke the real bread-and-butter kernels.\nAlso, if this test suite is meant as primary a means of testing vendor libraries (which is a good goal!) it does not have a place as a part of pytorch regular CI, and should be run when the corresponding vendor libraries are updated. I'd suggest moving this test out into a separate file (maybe even outside of torch/test directory) and have it as a part of library update/qualification process rather than regular CI.\nAlso, if the primary goal is to enable easier testing of vendor libraries correctness, perhaps we should rethink the mechanism of the generation of test cases. It should be easy to add a test case with a particular set of parameters that was found to be buggy. Also, running a cross-product of cases in a multi-dimensional space (as this PR does) is rarely an efficient way of getting a signal, some forms of random sampling usually provide a way to get better correctness signal why using less resources.\nAlso, when testing libraries it is important to test both forward and backward functions, whereas this PR does forward only. I'm openminded on whether convTransposed should be tested or not - if we are testing vendor libraries, then it's not necessary, convTransposed calls the same underlying functions, if we are testing pytorch, then it makes sense to test it separately because it takes different codepaths.",
+                "createdAt": "2020-06-02T21:56:33Z",
+                "author": {
+                  "login": "ngimel"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 637827507
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_add_inputs_same_symbolic_shape.expect"
+                "bodyText": "@mruberry ngimel is quite responsible, but it seems that she is not familiar with the background of this pull-request, since this pull-request is pending for so such a long time, each time we are almost done, then reviewer changes, each reviewer has different idea, it is good, but, would it be better if you help review it or ask the same reviewer to review it considering that you are more familiar with the background/change history? Thanks in advance.",
+                "createdAt": "2020-06-03T02:16:07Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637912105
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_matmul.expect"
+                "bodyText": "@mruberry ngimel is quite responsible, but it seems that she is not familiar with the background of this pull-request, since this pull-request is pending for so such a long time, each time we are almost done, then reviewer changes, each reviewer has different idea, it is good, but, would it be better if you help review it or ask the same reviewer to review it considering that you are more familiar with the background/change history? Thanks in advance.\n\nWe know this PR has been open for awhile and we respect that your time is valuable, but we want to make sure we're making the right change here, and I think @ngimel's comments reflect that and should not be too difficult to address. As I understand, her points are:\n\nThis is a good PR with an exciting idea. To let it run longer and test more cases maybe it should run outside the regular PyTorch CI.\nTo remedy this, let's create a test/math_libraries folder and put this test there: test/math_libaries/convolutions.py. Yes, this is different from our requests in the past, which is our mistake, but it should be an easy change.\nTo make the test more interesting it'd be good for the test cases to resemble convolutions used in practice. The current test cases seem like similar \"toy\" examples. Without time pressure we should be able to run larger, more computationally intensive convolutions.\nLet's change the test cases to include some practical convolutions, make it easy to add test cases, and think about how we might generate other interesting cases. (We should also test backwards once we have more time!)\n\nAnd I think these are good points. Maybe the PR doesn't create a new way to generate interesting convolutions to start and instead only runs a few representative convolutions, but @ngimel is positioning the work for success so that it's useful and we can continue to improve on it in the future.\nDoes that make sense?",
+                "createdAt": "2020-06-03T03:04:55Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 637924703
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_reduce_mean.expect"
+                "bodyText": "@mruberry we were required to finish the test in limited time long long before, at that time, jianhui discussed this issue with you, and you are all agreed with the current test scope and test case number and test time, so you meant you change your mind now? you are not care about the test time currently? Sorry, this issue is pending so long, we are struggling with it now and would like to finish it asap.  Given this, it would be be better if you raise all the requirement at a time,  considering that we have many tasks at hand, we are hoping so eagerly that we can finish this PR and use it for further test for bugs finding.",
+                "createdAt": "2020-06-03T05:22:43Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "mingxiaoh"
+                },
+                "databaseId": 637960626
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_unchange.expect"
+                "bodyText": "@mruberry we were required to finish the test in limited time long long before, at that time, jianhui discussed this issue with you, and you are all agreed with the current test scope and test case number and test time, so you meant you change your mind now? you are not care about the test time currently? Sorry, this issue is pending so long, we are struggling with it now and would like to finish it asap. Given this, it would be be better if you raise all the requirement at a time, considering that we have many tasks at hand, we are hoping so eagerly that we can finish this PR and use it for further test for bugs finding.\n\nI'm sorry, I don't think I've talked to @Jianhui-Li before. It's true that the team we expressed a concern about timing if the test was to be run in the CI initially, but I think now that we understand what the test is trying to do better we're not sure the CI is the best place for it. The PR was also closed after a lengthy period of inactivity, and we assumed it had simply been abandoned.\nDo you know who @Jianhui-Li spoke with about this issue originally? Maybe I can follow-up with them for more context.",
+                "createdAt": "2020-06-03T05:42:28Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 637967153
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_elu.expect"
+                "bodyText": "@mruberry  it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?",
+                "createdAt": "2020-06-03T06:13:14Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 637978356
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_embedding_bags.expect"
+                "bodyText": "@mruberry it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?\n\nI think this will be easier to discuss at the regular Intel-FB meeting.",
+                "createdAt": "2020-06-03T20:34:05Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 638446723
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_empty_like.expect"
+                "bodyText": "@mruberry it is reviewed and discussed with @soumith before. Anyway, since current reviewer is you, so, it should be decided by you. So, what we should do next?\n\nI think this will be easier to discuss at the regular Intel-FB meeting.\n\nLet me sync with Mingxiao and follow up with this. Thanks.",
+                "createdAt": "2020-06-03T20:44:44Z",
+                "author": {
+                  "login": "Jianhui-Li"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 638451670
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_empty_like_opset7.expect"
+                "bodyText": "@mruberry would you please help review it again?",
+                "createdAt": "2020-07-02T14:09:23Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 653028208
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_equal.expect"
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?",
+                "createdAt": "2020-07-06T20:15:04Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 654443242
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_erf.expect"
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks",
+                "createdAt": "2020-07-09T11:04:06Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 656062287
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_exp.expect"
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks\n\n@mruberry  the code is ready for review now, would you please take time for it? Thanks.",
+                "createdAt": "2020-07-14T09:16:48Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 658071151
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_expand.expect"
+                "bodyText": "super nit: renaming files to .json will make it more IDE friendly.",
+                "createdAt": "2020-07-14T23:38:37Z",
+                "author": {
+                  "login": "VitalyFedyunin"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 658464685
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_flatten.expect"
+                "bodyText": "@mruberry would you please help review it again?\n\nHappy to help out, but as last discussed this needs some follow-up at the Intel-FB meeting. Did you get a chance to discuss it there, yet? If so, what did you decide?\n\nyes, we talked it with jianhui, and we decided to follow your ideas. Anyway, we would like to do so modification later, will contact you for review tomorrow. Thanks\n\n@mruberry the code is ready for review now, would you please take time for it? Thanks.\n\nCool! I took a look with @ngimel, once these issues are addressed I think we're good to go!",
+                "createdAt": "2020-07-16T05:17:29Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 659164401
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_flatten2D.expect"
+                "bodyText": "@ngimel  & @VitalyFedyunin We have changed the code according to your suggestions, would you please review it again? Thanks.",
+                "createdAt": "2020-07-20T08:30:01Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 660884305
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_fmod.expect"
+                "bodyText": "@ngimel & @VitalyFedyunin We have changed the code according to your suggestions, would you please review it again? Thanks.\n\nUpdated: one more question about tolerances, one code cleanup recommendation, and one task leftover from the last review.",
+                "createdAt": "2020-07-22T20:26:42Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 662678464
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_frobenius_norm.expect"
+                "bodyText": "Updated: one more question about tolerances, one code cleanup recommendation, and one task leftover from the last review.\n@mruberry we have finished the modification according to your comment, would you please review it again? Thanks.",
+                "createdAt": "2020-07-23T10:24:26Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 662930687
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_full.expect"
+                "bodyText": "The code looks good, but I tried running the test suite and hit the following failures:\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 102, in test_conv2d_ext\n    msg=msg\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 1085, in assertEqual\n    self.assertTrue(result, msg=msg)\nAssertionError: False is not true : device:cuda:0, dtype:torch.float16, group:1, batchsize:22input channel:448, output channel:384, bias:False, padding:[1, 1], dilation:[1, 1], stride:[1, 1], kernel:[3, 3]\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 102, in test_conv2d_ext\n    msg=msg\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 1085, in assertEqual\n    self.assertTrue(result, msg=msg)\nAssertionError: False is not true : device:cuda:0, dtype:torch.float32, group:1, batchsize:22input channel:80, output channel:192, bias:False, padding:[0, 0], dilation:[1, 1], stride:[1, 1], kernel:[3, 3]\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 777, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 241, in instantiated_test\n    result = test(self, device_arg, dtype)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 542, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 411, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 106, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\nLooking at the first invalid convolution, for example, it's:\n    {\n        \"case_name\":\"masknet_p1:conv33\",\n        \"mb\":1,\n        \"g\":1,\n        \"ic\":512,\n        \"ih\":64,\n        \"iw\":64,\n        \"oc\":12,\n        \"kh\":1,\n        \"kw\":1,\n        \"sh\":1,\n        \"sw\":1,\n        \"ph\":0,\n        \"pw\":0,\n        \"dh\":0,\n        \"dw\":0,\n        \"bias\":\"False\"\n    },\n\nwhich has a dh and dw of zero, causing it to be added to invalid cases here:\ndh, dw = case['dh'], case['dw']\n            has_bias = case['bias']\n            if dh == 0 or dw == 0:\n                invalid_cases.append(case_name)",
+                "createdAt": "2020-07-23T21:25:19Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "mruberry"
+                },
+                "databaseId": 663240268
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_full_like.expect"
+                "bodyText": "@mruberry the failure was not detected is because we did not export the cudnn path. Yes, you are right, we need to a large atol of 1e-2 . Would you please help review it again? Thanks.",
+                "createdAt": "2020-07-27T12:43:44Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 664373079
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_gather.expect"
+                "bodyText": "@mruberry the failure was not detected is because we did not export the cudnn path. Yes, you are right, we need to a large atol of 1e-2 . Would you please help review it again? Thanks.\n\nBefore I run these tests again, is an atol of 1e-2 needed for all types or just half? Also, how does 1e-2 compare to the values that are being compared?",
+                "createdAt": "2020-07-27T18:39:27Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 664569507
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_gather_opset11.expect"
+                "bodyText": "@mruberry 1e-2 is experimental result, details see below, random means it might be failed sometimes.\n\n\n\natol,rtol\n1e-2,1e-2\n1e-2,1e-3\n1e-3,1e-2\n1e-3,1e-3\n1e-4,1e-3\n1e-3,1e-4\n1e-4,1e-4\n1e-4,1e-5\n1e-5,1e-4\n\n\n\n\nCuda float16\npass\npass\npass\npass\npass\nfail\nFail\nFail\nfail\n\n\nCuda float32\npass\nrandom\nrandom\nrandom\nrandom\nrandom\nrandom\nrandom\nfail",
+                "createdAt": "2020-07-31T03:33:27Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 666894774
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_ge.expect"
+                "bodyText": "@mruberry  would you please find time to review it again? Thanks.",
+                "createdAt": "2020-08-04T05:01:20Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 668380451
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_gelu.expect"
+                "bodyText": "@mruberry would you please find time to review it again? Thanks.\n\nI was just about to try and run this again locally but it looks like the files describing the convolutions are missing?",
+                "createdAt": "2020-08-07T03:49:44Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 670306210
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_gt.expect"
+                "bodyText": "@mruberry sorry but what is missing actually?",
+                "createdAt": "2020-08-07T05:00:20Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 670322557
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_hardtanh.expect"
+                "bodyText": "@mruberry sorry but what is missing actually?\n\nThe JSON files.",
+                "createdAt": "2020-08-07T16:06:41Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 670591170
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_implicit_expand.expect"
+                "bodyText": "@mruberry sorry but what is missing actually?\n\nThe JSON files.\n\n@mruberry sorry, we add them now, would you please check it again? Thanks.",
+                "createdAt": "2020-08-13T10:40:11Z",
+                "author": {
+                  "login": "mingxiaoh"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 673402901
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_index.expect"
-              },
+                "bodyText": "I cloned your repo and ran the tests:\n~/pytorch/test/math_libraries$ python convolutions.py\nFFFF\n======================================================================\nFAIL: test_conv2d_ext_cpu_float32 (__main__.TestConvExtCPU)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float16 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float32 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n======================================================================\nFAIL: test_conv2d_ext_cuda_float64 (__main__.TestConvExtCUDA)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_utils.py\", line 815, in wrapper\n    method(*args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 244, in instantiated_test\n    result = test(self, *args)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 615, in only_fn\n    return fn(self, device, *args, **kwargs)\n  File \"/private/home/mruberry/git/pytorch/torch/testing/_internal/common_device_type.py\", line 472, in dep_fn\n    return fn(slf, device, *args, **kwargs)\n  File \"convolutions.py\", line 114, in test_conv2d_ext\n    \"invalid cases:\" + \",\".join(invalid_cases)\nAssertionError: invalid cases:masknet_p1:conv33,masknet_p1:conv8,masknet_p1:conv2*4,masknet_p1:conv12,masknet_p1:conv4*3,masknet_p1:conv19,masknet_p1:conv4,masknet_p1:conv4,masknet_p1:conv27,masknet_p1:conv39,masknet_p1:conv23,masknet_p1:conv20,masknet_p1:conv25,masknet_p1:conv17,masknet_p1:conv9*4,masknet_p1:conv36,masknet_p1:conv18,masknet_p1:conv5,masknet_p1:conv38,masknet_p1:conv31,masknet_p1:conv14,masknet_p1:conv26,masknet_p1:conv2,masknet_p1:conv5*2,masknet_p1:conv28,masknet_p1:conv16,masknet_p1:conv20*3,masknet_p1:conv9,masknet_p1:conv14*23,masknet_p1:conv32,masknet_p1:conv30,masknet_p1:conv35,masknet_p1:conv37,masknet_p1:conv3,masknet_p1:conv24,masknet_p1:conv13,masknet_p1:conv21*3,masknet_p1:conv10,masknet_p1:conv7,masknet_p1:conv34,masknet_p1:conv13*24,masknet_p1:conv10*4,masknet_p1:conv22*2,masknet_p1:conv6,masknet_p1:conv22,masknet_p1:conv11,masknet_p1:conv40,masknet_p1:conv15,masknet_p1:conv17*23,masknet_p1:conv29,masknet_p1:conv21,masknet_p1:conv1,masknet_p1:conv11*3,mobilenet:conv3,mobilenet:conv2*4,mobilenet:conv6,mobilenet:conv7,mobilenet:conv5*4,mobilenet:conv4*4,mobilenet:conv7*4,mobilenet:conv1*3,mobilenet:conv10,mobilenet:conv2,mobilenet:conv5,mobilenet:conv4,mobilenet:conv9*4,mobilenet:conv8,mobilenet:conv9,mobilenet:conv6*4,mobilenet:conv10*4,mobilenet:conv11,mobilenet:conv8*20,mobilenet:conv1,mobilenet:conv11*4,mobilenet:conv3*4\n\n----------------------------------------------------------------------\nRan 4 tests in 33.838s\n\nFAILED (failures=4)\n\nStill fails.",
+                "createdAt": "2020-08-13T23:35:00Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 673760580
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOIapCfg==",
+              "hasPreviousPage": false
+            }
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=76118 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "malfet"
+          },
+          "title": "Dummy change with lots of commits",
+          "body": "Draft PR with 100+ commits, to test mergebot ",
+          "headRefName": "malfet/pr-with-lots-of-commits",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
-                "path": "test/onnx/expect/TestOperators.test_isnan.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "3067f2240afc7a29dc348000aa19eccbd9772303"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_layer_norm_aten.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "andrewor14"
+                    },
+                    "email": "andrewor@fb.com",
+                    "name": "Andrew Or"
+                  },
+                  "oid": "2f655b71f70c496c4e645f6cdb27d7bb7e825701"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_le.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "0c6dcaa7f58a19c42a530f4ee14bb6f0f03ca9fb"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_linear.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "cad11c563d41ebcffb1683fe1f1288b8157413b3"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_log_sigmoid.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "jwtan@fb.com",
+                    "name": "Jiewen Tan"
+                  },
+                  "oid": "4dfd0875a68d87fccb5ad0d81692db480043b86e"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_logsoftmax.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "2d37e74690582a4a26890e4c8b98f1f80e589c82"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_lstm_none_sequence_lens.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "jwtan@fb.com",
+                    "name": "Jiewen Tan"
+                  },
+                  "oid": "d4aee60947e1a3ef23c7c42990621e0746fdd0a8"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_lt.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "peterbell10"
+                    },
+                    "email": "peterbell10@live.co.uk",
+                    "name": "Peter Bell"
+                  },
+                  "oid": "aac6204bf710beb5e50a383d426ae6222396335a"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_master_opset.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "4b0362cab884584c24f5834b3874f5f357f56b5d"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_max.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "7536df613cbc645a9e68e6a3b0a8450753260fd1"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_maxpool.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "20a50cb966d28d7bf82924adf781cf72a01ef90e"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_maxpool_dilations.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "486387e8644afb46edff5aa5925b55c8119f67f0"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_maxpool_indices.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "acb9d78b9b732d3667b881727e6ed9f92a8c549f"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_mean.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "683bb7959a5b973f8470c081ad02e8fc508e784a"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_mean_dtype.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "qihqi"
+                    },
+                    "email": "qihan@fb.com",
+                    "name": "Han Qi"
+                  },
+                  "oid": "a870cb40af65adf0b77d55f6b554d7093d284d7a"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_meshgrid.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "Krovatkin"
+                    },
+                    "email": "korovaikon@gmail.com",
+                    "name": "Nikolay Korovaiko"
+                  },
+                  "oid": "70793b9f328ddf52cc86336104c3a064c8582ef4"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_min.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "suo"
+                    },
+                    "email": "suo@fb.com",
+                    "name": "Michael Suo"
+                  },
+                  "oid": "f70b31f62b1c5159eef2725484b175983517c88c"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_mm.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dagitses"
+                    },
+                    "email": "mikeyd@fb.com",
+                    "name": "Michael Andreas Dagitses"
+                  },
+                  "oid": "04d3ec1db60defe1c6904bf77e9f8dfa87dc0b63"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_narrow.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "46b754a55b63e3168ad5854ad412c124934b675d"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_ne.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "robieta"
+                    },
+                    "email": "taylorrobie@fb.com",
+                    "name": "Taylor Robie"
+                  },
+                  "oid": "13df69e13ee571fdd716139419a00aec47ade7d6"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_nonzero.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "70642e911ec80a47cdbf4a50aac475c11aa129b6"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_norm_p1.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "59bb7c39384bf3e0b284a037adef8b3caa53c1c4"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_norm_p2.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "007cfb97b55d70ff63e1ed71d1a674638f847376"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_ones_like.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "0a7b858a5af1393fa3cf2853f92eca0e1d408dde"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_pad.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "qihqi"
+                    },
+                    "email": "qihan@fb.com",
+                    "name": "Han Qi"
+                  },
+                  "oid": "7917d789f0a523715041ade5177d271082628236"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_params.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kit1980"
+                    },
+                    "email": "sdym@fb.com",
+                    "name": "Sergii Dymchenko (Meta Employee)"
+                  },
+                  "oid": "91eb6017f0fb8a1b29e8cb48fac93bc9709f73b3"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_params_onnx_irv4.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dagitses"
+                    },
+                    "email": "mikeyd@fb.com",
+                    "name": "Michael Andreas Dagitses"
+                  },
+                  "oid": "bd04dca5fabb0c2a51ac87063a515f256ef274fa"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_permute2.expect"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "MTAw",
-              "hasNextPage": true
-            }
-          },
-          "reviews": {
-            "nodes": [
-              {
-                "author": {
-                  "login": "garymm"
-                },
-                "state": "APPROVED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMi0xOFQxOToxODo0NC0wNjowMLkyMDIyLTAyLTE4VDE5OjE4OjQ0LTA2OjAwzjTr0H0=",
-              "hasPreviousPage": false
-            }
-          },
-          "comments": {
-            "nodes": [
-              {
-                "bodyText": "This PR cannot be merged by bot due to changing > 100 files. @malfet \n  \n    \n      pytorch/.github/scripts/trymerge.py\n    \n    \n         Line 63\n      in\n      932adf2\n    \n  \n  \n    \n\n        \n          \n                 files(last: 100) { \n        \n    \n  \n\n Can this be relaxed? If not please import.",
-                "author": {
-                  "login": "BowenBao"
-                },
-                "authorAssociation": "COLLABORATOR",
-                "editor": null,
-                "databaseId": 1048084569
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dagitses"
+                    },
+                    "email": "mikeyd@fb.com",
+                    "name": "Michael Andreas Dagitses"
+                  },
+                  "oid": "1f805a5defda7dabc49d0059edb9ccb06bc29352"
+                }
               },
               {
-                "bodyText": "This PR cannot be merged by bot due to changing > 100 files. @malfet\nCan this be relaxed? If not please import.\n\nWow, you've hit a really interesting problem. 100 is a limitation enforced by GitHub, see https://docs.github.com/en/graphql/overview/resource-limitations, but I can implement a pagination. Do you mind keeping it like that for a bit, want to land a fix soonish.",
-                "author": {
-                  "login": "malfet"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1048088691
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@fb.com",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "4982c0a8db8f23d15ec4bfcbca4ce939afc04954"
+                }
               },
               {
-                "bodyText": "@malfet Thank you for info. Sure, I have separated the rest of stack from this one, we'll wait for the fix to try again.",
-                "author": {
-                  "login": "BowenBao"
-                },
-                "authorAssociation": "COLLABORATOR",
-                "editor": null,
-                "databaseId": 1048090640
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pearu"
+                    },
+                    "email": "pearu.peterson@gmail.com",
+                    "name": "Pearu Peterson"
+                  },
+                  "oid": "28502265cb5925cb7db8dcb2dd2334963092714a"
+                }
               },
               {
-                "bodyText": "@pytorchbot merge this",
-                "author": {
-                  "login": "BowenBao"
-                },
-                "authorAssociation": "COLLABORATOR",
-                "editor": null,
-                "databaseId": 1050293881
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "e03fcaedb1342e6d65c7f7f20243000938ba60b2"
+                }
               },
               {
-                "bodyText": "Hey @BowenBao.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
-                "author": {
-                  "login": "github-actions"
-                },
-                "authorAssociation": "NONE",
-                "editor": null,
-                "databaseId": 1050295451
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOPniAWQ==",
-              "hasPreviousPage": true
-            }
-          },
-          "labels": {
-            "edges": [
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pritamdamania"
+                    },
+                    "email": "pritam.damania@fb.com",
+                    "name": "pritam"
+                  },
+                  "oid": "efb28f5a1a5d18aa96bd668ab2ab5c651be359f3"
+                }
+              },
               {
-                "node": {
-                  "name": "oncall: jit"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "MagiaSN"
+                    },
+                    "email": "magialiao@tencent.com",
+                    "name": "magialiao"
+                  },
+                  "oid": "52cc1b9994f861ebdd3908759ed1ab11cba1f8de"
                 }
               },
               {
-                "node": {
-                  "name": "open source"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "3cd99f23d1acd6a5bedf6f3b02be79d64350a5b6"
                 }
               },
               {
-                "node": {
-                  "name": "cla signed"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "awgu"
+                    },
+                    "email": "andgu@fb.com",
+                    "name": "Andrew Gu"
+                  },
+                  "oid": "b00502c634a5146f4d996bd90e84d317f049e7b0"
                 }
               },
               {
-                "node": {
-                  "name": "release notes: onnx"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "davidberard98"
+                    },
+                    "email": "dberard@fb.com",
+                    "name": "David Berard"
+                  },
+                  "oid": "976eb7cee799dddfbe6a4122b249aaee1b6c8854"
                 }
               },
               {
-                "node": {
-                  "name": "topic: bug fixes"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
+                  },
+                  "oid": "9608ab28744d5cae32f371490557b248c9549c66"
                 }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MTAw name=pytorch number=73099 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "files": {
-            "nodes": [
+              },
               {
-                "path": "test/onnx/expect/TestOperators.test_pixel_shuffle.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "4e119f0c39eb5ff0777f0e71561e6b633d85fb34"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_pow.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "rohan-varma"
+                    },
+                    "email": "rvarm1@fb.com",
+                    "name": "Rohan Varma"
+                  },
+                  "oid": "447580dc565f3660eddb2c996c6ed25b88338684"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_prelu.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "2bc8f43e9233008ea23053fab87b83ab36fca5e3"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_prod.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "c13a8e891c3e3e714f60649ca1e3b082e090e9fe"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_prod_dtype.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "fddc861b7ee473f57d3c2161e4618a2663a237e8"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_rand.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "jiyuanzFB"
+                    },
+                    "email": "jiyuanz@fb.com",
+                    "name": "Jiyuan Zhang"
+                  },
+                  "oid": "e2336dbc539d6c021720cbe43c92c9e4c8463299"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_randn.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "26e2759d1ad59aac12168b74d1ca55e42ba9455c"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "ad7aa914ee3b3d1252e31514f010ba96c40aae87"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_mean.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "f113c5d78065aafbe7b1c0e611945bfe9f67b3c0"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "a366fd01136292544b7862968ae92feba4b6d8fe"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "seemethere"
+                    },
+                    "email": "eliuriegas@fb.com",
+                    "name": "Eli Uriegas"
+                  },
+                  "oid": "afeba0773749da5883c378a2e6ac066e1ce62ca0"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_prod.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bdhirsh"
+                    },
+                    "email": "hirsheybar@fb.com",
+                    "name": "Brian Hirsh"
+                  },
+                  "oid": "d306c99addc543908f64666baeecacbd0749f4a7"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "awgu"
+                    },
+                    "email": "andgu@fb.com",
+                    "name": "Andrew Gu"
+                  },
+                  "oid": "c2456ea658f41f64ea054a422edf22a9c977399f"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "awgu"
+                    },
+                    "email": "andgu@fb.com",
+                    "name": "Andrew Gu"
+                  },
+                  "oid": "a8b0a1b681c9fe41e0d553c962a5c93e81d92503"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_sum.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "anjali411"
+                    },
+                    "email": "chourdiaanjali123@gmail.com",
+                    "name": "anjali411"
+                  },
+                  "oid": "af761d9a5d058c9188f16589bae4f307d35185be"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "clee2000"
+                    },
+                    "email": "csl@fb.com",
+                    "name": "Catherine Lee"
+                  },
+                  "oid": "beceb417baef35b15c2716e23178fb49f7fd6f9d"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "1516554e22136db89d0aeba43a1a1a987e995d68"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reducemax.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "qihqi"
+                    },
+                    "email": "qihan@fb.com",
+                    "name": "Han Qi"
+                  },
+                  "oid": "68eb1fa8374eff6cbdcf0be5e37ed6775d22e722"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_reducemin.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "janeyx99"
+                    },
+                    "email": "janeyx@fb.com",
+                    "name": "Jane Xu"
+                  },
+                  "oid": "3c7bcb99b5c0c879c2610f427880b03881f82f38"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_remainder.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "janeyx99"
+                    },
+                    "email": "janeyx@fb.com",
+                    "name": "Jane Xu"
+                  },
+                  "oid": "38c1a2028090353e40a019c673c9ab16b39e4825"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_repeat.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "8091cbea2c95ed2c4c406b3c61547a27c6319bae"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "d81f59121969a47c8b2213a88e02cf9be0219be9"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_round.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "20d798b319cd107a767fe220f7a3027c18a1c844"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_rrelu.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "eb35381a770b58c1cd41e935910cb4df2f3d8f14"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_rsqrt.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "e6498a657b9aa47546dcd92d1b4ffb2e1a50ebdb"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_rsub.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "7f821382db5ad08efe5b09a145c606852b8a9272"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_scatter_add.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "995c0e11a97d854ff969962bd81d7341e46ecb07"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_scatter_add_opset11.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "davidberard98"
+                    },
+                    "email": "dberard@fb.com",
+                    "name": "David Berard"
+                  },
+                  "oid": "28d6258e62c9fc361a18689877c962c69889dc23"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_selu.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "HarborYuan"
+                    },
+                    "email": "yuanhaobo@whu.edu.cn",
+                    "name": "Haobo Yuan"
+                  },
+                  "oid": "2350fad8391367ebf81c7236a2c883644b4ff622"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_shape_value_map.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "zou3519"
+                    },
+                    "email": "zou3519@gmail.com",
+                    "name": "Richard Zou"
+                  },
+                  "oid": "3f789c9ccecdd7e2e52269453646e992a68c6b92"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_sign.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "jeffdaily"
+                    },
+                    "email": "jeff.daily@amd.com",
+                    "name": "Jeff Daily"
+                  },
+                  "oid": "20f79f610c1a3314da96d49515bbfbee9442e4f8"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_sin.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "peterbell10"
+                    },
+                    "email": "peterbell10@live.co.uk",
+                    "name": "Peter Bell"
+                  },
+                  "oid": "5823958f047f3b71a5dc8c52a20eb8ae3291bd3e"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_slice.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "peterbell10"
+                    },
+                    "email": "peterbell10@live.co.uk",
+                    "name": "Peter Bell"
+                  },
+                  "oid": "a0b15c49ecf3844daf2c0dcaef44f0214259db20"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_slice_dynamic.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "4afc38c25ca2ca126ba4987a419a58a5c572223b"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "b606f58d4a36683fbe0a7d02adfdde7d5cc694c2"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_3d.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "2d61b4d630f6482a6c3cc7437091fad6d27c347e"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_3d_none.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "george-qi"
+                    },
+                    "email": "georgeqi94@gmail.com",
+                    "name": "George Qi"
+                  },
+                  "oid": "bc5384c47036a6cda94129f3e2f9e43c43393698"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_4d.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "60fc3277634365b64465712b13db2acb76d6c890"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_ignore_index.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "1b8762e95bc38d1847fe99ed3230546c8b800bfd"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_softmaxcrossentropy_weights.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "jerryzh168"
+                    },
+                    "email": "jerryzh168@gmail.com",
+                    "name": "Jerry Zhang"
+                  },
+                  "oid": "6acf60f95f59ecbc6e8ce830dea0abba7d3ec763"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_split.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ysiraichi"
+                    },
+                    "email": "yukio.siraichi@gmail.com",
+                    "name": "Yukio Siraichi"
+                  },
+                  "oid": "8fb0276561fdd530c5a06ea195e930e0584f8705"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_split_with_sizes.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "1da7aed95a8700406671425eac1e4bbc2c7a24b5"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_sqrt.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "thiagocrepaldi"
+                    },
+                    "email": "thiago.crepaldi@microsoft.com",
+                    "name": "Thiago Crepaldi"
+                  },
+                  "oid": "83208e7dee4503c1bee1df9f6632794694dffa01"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_std.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "1a46cf08dcd3d3564604c17b2c02d7e4eb45a7ff"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_sum.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "b7f9b6689445f826c83694652fea5f7cfc7070d7"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_sum_dtype.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "fatcat-z"
+                    },
+                    "email": "jiz@microsoft.com",
+                    "name": "Jay Zhang"
+                  },
+                  "oid": "f273961c1696b156e35f8c76f7ad37934031050d"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_tan.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pavithranrao"
+                    },
+                    "email": "pavithran@fb.com",
+                    "name": "Pavithran Ramachandran"
+                  },
+                  "oid": "eb410a51fcbc716873fd80a970eb932d4aaaea61"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_topk.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
+                  },
+                  "oid": "7dbb12cdc02332fa64264ed0df576511a5070d7e"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_topk_smallest_unsorted.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "43675665fa6b5154de8b25125dd03d7be35c884f"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_transpose.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "6c4d23c402c413667463770d9a2fa801f493d3c5"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_type_as.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "cf3778a35129a40dee14366515201b7ed2c0f346"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_unfold.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "dzdang"
+                    },
+                    "email": "dzdang@umich.edu",
+                    "name": "dzdang"
+                  },
+                  "oid": "9d00a051373cb81f79cb6375942cf3ec9fff2fe6"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_unique.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "pytorchmergebot"
+                    },
+                    "email": "pytorchmergebot@users.noreply.github.com",
+                    "name": "PyTorch MergeBot"
+                  },
+                  "oid": "1eae67cf404aa8dffb80b8e85180f943878d52a6"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_unsqueeze.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "janeyx99"
+                    },
+                    "email": "janeyx@fb.com",
+                    "name": "Jane Xu"
+                  },
+                  "oid": "ce0e69dcda0fe41a6e964d6ac70ce8016979c71a"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "swolchok"
+                    },
+                    "email": "swolchok@fb.com",
+                    "name": "Scott Wolchok"
+                  },
+                  "oid": "6faba554f6e49777f24911928edb3061b6ed0e3d"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "IvanYashchuk"
+                    },
+                    "email": "ivan.yashchuk@aalto.fi",
+                    "name": "Ivan Yashchuk"
+                  },
+                  "oid": "d1d0e03f57a359f8f95331f9a34b8bed3e7cc845"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_upsample_nearest_size.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "Chillee"
+                    },
+                    "email": "chilli@fb.com",
+                    "name": "Horace He"
+                  },
+                  "oid": "bb46bd9233a9fc631802a902cb48a4c13c2722ca"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_view.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "mehtanirav"
+                    },
+                    "email": "niravmehta@fb.com",
+                    "name": "Nirav Mehta"
+                  },
+                  "oid": "3b1007fe4be12e483f2620fbac67cae42e703efc"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_view_flatten.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "mehtanirav"
+                    },
+                    "email": "niravmehta@fb.com",
+                    "name": "Nirav Mehta"
+                  },
+                  "oid": "b4b65228dd0c109f5fdf17c7d9e56f60a98e398b"
+                }
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_zeros_like.expect"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "albanD"
+                    },
+                    "email": "albandes@fb.com",
+                    "name": "Alban Desmaison"
+                  },
+                  "oid": "d629e300705196d3ae0bac5ed983b197101fa2ee"
+                }
               },
               {
-                "path": "torch/csrc/jit/serialization/export.cpp"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "bigfootjon"
+                    },
+                    "email": "jonjanzen@fb.com",
+                    "name": "Jon Janzen"
+                  },
+                  "oid": "52754b9e515f378f8476ad44d75b0a692bad8cde"
+                }
               },
-              {
-                "path": "torch/csrc/jit/serialization/export.h"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "MTYy",
-              "hasNextPage": false
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=74649 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "closed": true,
-          "isCrossRepository": false,
-          "author": {
-            "login": "malfet"
-          },
-          "title": "This should fail flake8",
-          "body": "Test issue for GHF mandatory checks",
-          "headRefName": "malfet-patch-8",
-          "headRepository": {
-            "nameWithOwner": "pytorch/pytorch"
-          },
-          "baseRefName": "master",
-          "baseRepository": {
-            "nameWithOwner": "pytorch/pytorch",
-            "isPrivate": false,
-            "defaultBranchRef": {
-              "name": "master"
-            }
-          },
-          "mergeCommit": null,
-          "commits_with_authors": {
-            "nodes": [
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "malfet"
+                      "login": "samdow"
                     },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
+                    "email": "samdow@fb.com",
+                    "name": "samdow"
                   },
-                  "oid": "57c86ff1c5ab948888fd329986c9d55796680e33"
+                  "oid": "128c3ad747093f4970329a82c7c4720420faeff2"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "malfet"
+                      "login": "arindamroy-eng"
                     },
-                    "email": "nshulga@fb.com",
-                    "name": "Nikita Shulga"
+                    "email": "61168652+arindamroy-eng@users.noreply.github.com",
+                    "name": "arindamroy-eng"
                   },
-                  "oid": "6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4"
+                  "oid": "2a0bda7d32a5bcc9827f7254a7b77cceb16ba973"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "Mg",
-              "hasNextPage": false
+              "endCursor": "MTAw",
+              "hasNextPage": true
             },
-            "totalCount": 2
+            "totalCount": 131
           },
           "commits": {
             "nodes": [
@@ -15037,14 +24739,14 @@
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsK3w=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuNRg4=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018129"
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693698"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1E="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRAI="
                       },
                       {
                         "node": {
@@ -15061,9 +24763,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018131"
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693712"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1M="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRBA="
                       },
                       {
                         "node": {
@@ -15080,9 +24782,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018132"
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693725"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1Q="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRB0="
                       },
                       {
                         "node": {
@@ -15099,9 +24801,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018134"
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693741"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1Y="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRC0="
                       },
                       {
                         "node": {
@@ -15118,9 +24820,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018139"
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693761"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1s="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsREE="
                       },
                       {
                         "node": {
@@ -15137,9 +24839,9 @@
                             }
                           },
                           "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018142"
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193693774"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj14="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRE4="
                       },
                       {
                         "node": {
@@ -15149,110 +24851,85 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "clang-format",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669399915?check_suite_focus=true"
-                              },
-                              {
-                                "name": "clang-tidy",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669399990?check_suite_focus=true"
-                              },
-                              {
-                                "name": "cmakelint",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400052?check_suite_focus=true"
-                              },
-                              {
-                                "name": "flake8-py3",
-                                "conclusion": "FAILURE",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400154?check_suite_focus=true"
-                              },
-                              {
-                                "name": "mypy",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400239?check_suite_focus=true"
-                              },
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192463/jobs/3232430975"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuNR-Y=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694412"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRsw="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
                                 "name": "Test collect_env (with_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400327?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461134"
                               },
                               {
                                 "name": "Test collect_env (without_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400361?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461211"
                               },
                               {
-                                "name": "Test tools",
+                                "name": "toc",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400470?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461301"
                               },
                               {
-                                "name": "py2-setup-validate-errormsg",
+                                "name": "Test tools",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400681?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461386"
                               },
                               {
                                 "name": "quick-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400789?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461521"
                               },
                               {
-                                "name": "toc",
+                                "name": "lintrunner",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669400953?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461634"
                               },
                               {
-                                "name": "shellcheck",
+                                "name": "workflow-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669401126?check_suite_focus=true"
-                              }
-                            ],
-                            "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsMiY=",
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": "FAILURE",
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018384"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkFA="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "GitHub Actions",
-                            "databaseId": 15368
-                          },
-                          "workflowRun": {
-                            "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
-                            }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "run-torchbench",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669399917?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192461/jobs/3232461717"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsLW0=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuN84s=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018395"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694417"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkFs="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRtE="
                       },
                       {
                         "node": {
@@ -15268,2704 +24945,7298 @@
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669414276?check_suite_focus=true"
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232460797"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "name": "linux-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669414324?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232460951"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669414430?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461088"
                               },
                               {
-                                "name": "linux-bionic-rocm4.5-py3.7 / build",
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669414605?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461294"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669414697?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461410"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669414841?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461543"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669414951?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461628"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "name": "linux-bionic-rocm5.0-py3.7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415003?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461719"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415060?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461789"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415120?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461869"
                               },
                               {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415166?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232461946"
                               },
                               {
-                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415236?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462044"
                               },
                               {
-                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "name": "linux-xenial-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415288?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462112"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / build",
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415348?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462244"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "name": "win-vs2019-cuda11.3-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415451?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462360"
                               },
                               {
-                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415561?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462432"
                               },
                               {
-                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "name": "win-vs2019-cpu-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415607?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462521"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415642?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462621"
                               },
                               {
-                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
-                                "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415706?check_suite_focus=true"
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462683"
                               },
                               {
-                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669415757?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232462738"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669488974?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232545510"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669489019?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232545571"
                               },
                               {
                                 "name": "linux-docs / build-docs (cpp)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669492162?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547522"
                               },
                               {
                                 "name": "linux-docs / build-docs (python)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669492211?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547612"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669492293?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547714"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669492341?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547764"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669492396?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547824"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669492440?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547869"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669492497?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547909"
                               },
                               {
                                 "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669492558?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232547973"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669496296?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232553452"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669496350?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232553558"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge)",
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669496393?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232553605"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669498726?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232553650"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669500818?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232563716"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669500848?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232563763"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669518721?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232582650"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669518760?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232582703"
                               },
                               {
                                 "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669518798?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232582741"
                               },
                               {
-                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669549301?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232590204"
                               },
                               {
-                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669549318?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232608872"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669559843?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232608976"
                               },
                               {
                                 "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669567414?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232637097"
                               },
                               {
                                 "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669567499?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232637199"
                               },
                               {
                                 "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669567553?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232637259"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232639932"
                               },
                               {
                                 "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669619773?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232687012"
                               },
                               {
                                 "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669619803?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232687074"
                               },
                               {
                                 "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669724420?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232785088"
                               },
                               {
                                 "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669724451?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/5669724478?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2197192471/jobs/3232785153"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHxIT4=",
-                              "hasNextPage": false
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuVD9M=",
+                              "hasNextPage": true
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018405"
+                          "url": "https://github.com/pytorch/pytorch/commit/5696e8357cf38f852ef3d680381513e26f202371/checks?check_suite_id=6193694439"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkGU="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXEsRuc="
                       }
                     ],
                     "pageInfo": {
                       "hasNextPage": false
                     }
                   },
-                  "pushedDate": "2022-03-24T00:42:33Z",
-                  "oid": "6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4"
+                  "status": null,
+                  "pushedDate": "2022-04-20T17:10:41Z",
+                  "oid": "5696e8357cf38f852ef3d680381513e26f202371"
                 }
               }
             ]
           },
-          "changedFiles": 1,
+          "changedFiles": 348,
           "files": {
             "nodes": [
               {
-                "path": "torch/nn/cpp.py"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "MQ",
-              "hasNextPage": false
-            }
-          },
-          "reviews": {
-            "nodes": [
+                "path": ".circleci/cimodel/data/pytorch_build_data.py"
+              },
               {
-                "author": {
-                  "login": "seemethere"
-                },
-                "state": "APPROVED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMy0yM1QxNzo1MDo0NS0wNTowMLkyMDIyLTAzLTIzVDE3OjUwOjQ1LTA1OjAwzjbPEDg=",
-              "hasPreviousPage": false
-            }
-          },
-          "comments": {
-            "nodes": [
+                "path": ".circleci/cimodel/data/pytorch_build_definitions.py"
+              },
               {
-                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/74649\n\u21a9\ufe0f \u00a0[fb-only] Re-run with SSH instructions\nNeed help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 6c3c3de (more details on the Dr. CI page):\n\n\n1/1 failures introduced in this PR\n\n\n1 failure not recognized by patterns:\n\n\n\nJob\nStep\nAction\n\n\n\n\n Lint / flake8-py3\nFail if there were any warnings\n\ud83d\udd01 rerun\n\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
-                "author": {
-                  "login": "facebook-github-bot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": {
-                  "login": "facebook-github-bot"
-                },
-                "databaseId": 1076891218
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOQDAOUg==",
-              "hasPreviousPage": false
-            }
-          },
-          "labels": {
-            "edges": [
+                "path": ".circleci/scripts/cpp_doc_push_script.sh"
+              },
               {
-                "node": {
-                  "name": "cla signed"
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=metamates org=pytorch": {
-    "data": {
-      "organization": {
-        "team": {
-          "members": {
-            "nodes": [
+                "path": ".circleci/scripts/python_doc_push_script.sh"
+              },
               {
-                "login": "dreiss"
+                "path": ".github/actions/checkout-pytorch/action.yml"
               },
               {
-                "login": "kumpera"
+                "path": ".github/merge_rules.json"
               },
               {
-                "login": "ezyang"
+                "path": ".github/scripts/gitutils.py"
               },
               {
-                "login": "yaroslavvb"
+                "path": ".github/scripts/gql_mocks.json"
               },
               {
-                "login": "stephenroller"
+                "path": ".github/scripts/trymerge.py"
               },
               {
-                "login": "swolchok"
+                "path": ".github/workflows/_bazel-build-test.yml"
               },
               {
-                "login": "hyuen"
+                "path": ".github/workflows/_linux-build.yml"
               },
               {
-                "login": "orionr"
+                "path": ".github/workflows/_linux-test.yml"
               },
               {
-                "login": "dhruvbird"
+                "path": ".github/workflows/_mac-test.yml"
               },
               {
-                "login": "likethesky"
+                "path": ".github/workflows/_rocm-test.yml"
               },
               {
-                "login": "lw"
+                "path": ".github/workflows/_win-test.yml"
               },
               {
-                "login": "raziel"
+                "path": ".github/workflows/buck_build_test.yml"
               },
               {
-                "login": "simpkins"
+                "path": ".github/workflows/lint.yml"
               },
               {
-                "login": "ebyrne"
+                "path": ".github/workflows/periodic.yml"
               },
               {
-                "login": "Babar"
+                "path": ".github/workflows/pull.yml"
               },
               {
-                "login": "kostmo"
+                "path": ".github/workflows/trunk.yml"
               },
               {
-                "login": "0x00b1"
+                "path": ".jenkins/pytorch/macos-test.sh"
               },
               {
-                "login": "bhosmer"
+                "path": ".jenkins/pytorch/test.sh"
               },
               {
-                "login": "zdevito"
+                "path": ".jenkins/pytorch/win-test.sh"
               },
               {
-                "login": "bugra"
+                "path": ".lintrunner.toml"
               },
               {
-                "login": "caraya10"
+                "path": "BUILD.bazel"
               },
               {
-                "login": "kit1980"
+                "path": "CODEOWNERS"
               },
               {
-                "login": "shoumikhin"
+                "path": "README.md"
               },
               {
-                "login": "huydhn"
+                "path": "aten/src/ATen/BatchingRegistrations.cpp"
               },
               {
-                "login": "teytaud"
+                "path": "aten/src/ATen/Dispatch.h"
               },
               {
-                "login": "xuzhao9"
+                "path": "aten/src/ATen/ExpandUtils.h"
               },
               {
-                "login": "jansel"
+                "path": "aten/src/ATen/FunctionalInverses.cpp"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalStorageImpl.cpp"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalStorageImpl.h"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalTensorWrapper.cpp"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalTensorWrapper.h"
+              },
+              {
+                "path": "aten/src/ATen/FunctionalizeFallbackKernel.cpp"
+              },
+              {
+                "path": "aten/src/ATen/NestedTensorImpl.cpp"
+              },
+              {
+                "path": "aten/src/ATen/OpMathType.h"
+              },
+              {
+                "path": "aten/src/ATen/SparseCsrTensorUtils.h"
+              },
+              {
+                "path": "aten/src/ATen/ThreadLocalState.cpp"
+              },
+              {
+                "path": "aten/src/ATen/ThreadLocalState.h"
+              },
+              {
+                "path": "aten/src/ATen/autocast_mode.cpp"
+              },
+              {
+                "path": "aten/src/ATen/autocast_mode.h"
+              },
+              {
+                "path": "aten/src/ATen/core/SymIntArrayRef.cpp"
+              },
+              {
+                "path": "aten/src/ATen/core/SymIntArrayRef.h"
+              },
+              {
+                "path": "aten/src/ATen/core/TensorBase.h"
+              },
+              {
+                "path": "aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h"
+              },
+              {
+                "path": "aten/src/ATen/core/dispatch/Dispatcher.h"
+              },
+              {
+                "path": "aten/src/ATen/core/interned_strings.h"
+              },
+              {
+                "path": "aten/src/ATen/core/ivalue.cpp"
+              },
+              {
+                "path": "aten/src/ATen/core/ivalue.h"
+              },
+              {
+                "path": "aten/src/ATen/core/ivalue_inl.h"
+              },
+              {
+                "path": "aten/src/ATen/core/jit_type.h"
+              },
+              {
+                "path": "aten/src/ATen/core/jit_type_base.h"
+              },
+              {
+                "path": "aten/src/ATen/core/type.cpp"
+              },
+              {
+                "path": "aten/src/ATen/cuda/CUDASparse.h"
+              },
+              {
+                "path": "aten/src/ATen/cuda/llvm_complex.cpp"
+              },
+              {
+                "path": "aten/src/ATen/cuda/llvm_jit_strings.h"
+              },
+              {
+                "path": "aten/src/ATen/native/Blas.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/Itertools.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/LinearAlgebra.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/SoftMax.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/TensorConversions.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/TensorShape.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/TensorShape.h"
+              },
+              {
+                "path": "aten/src/ATen/native/Unique.cpp"
+              },
+              {
+                "path": "aten/src/ATen/native/cuda/BinaryMiscBackwardOpsKernels.cu"
               },
               {
-                "login": "abhinavarora"
+                "path": "aten/src/ATen/native/cuda/CUDAJitLoops.cuh"
               },
               {
-                "login": "b0noI"
+                "path": "aten/src/ATen/native/cuda/JitLoops.cuh"
               },
               {
-                "login": "djthorne"
+                "path": "aten/src/ATen/native/cuda/Lerp.cu"
               },
               {
-                "login": "nairbv"
+                "path": "aten/src/ATen/native/cuda/PersistentSoftmax.cuh"
               },
               {
-                "login": "Mortimerp9"
+                "path": "aten/src/ATen/native/cuda/SoftMax.cu"
               },
               {
-                "login": "dadkins20"
+                "path": "aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu"
               },
               {
-                "login": "colesbury"
+                "path": "aten/src/ATen/native/cuda/Unique.cu"
               },
               {
-                "login": "laurencer"
+                "path": "aten/src/ATen/native/cuda/jit_utils.cpp"
               },
               {
-                "login": "nickgg"
+                "path": "aten/src/ATen/native/cuda/jit_utils.h"
               },
               {
-                "login": "yzhao30"
+                "path": "aten/src/ATen/native/native_functions.yaml"
               },
               {
-                "login": "rmaz"
+                "path": "aten/src/ATen/native/nested/NestedTensorMath.cpp"
               },
               {
-                "login": "bearzx"
+                "path": "aten/src/ATen/native/quantized/cpu/qembeddingbag.cpp"
               },
               {
-                "login": "mattjgalloway"
+                "path": "aten/src/ATen/native/quantized/cpu/qsoftmax.cpp"
               },
               {
-                "login": "chenyang78"
+                "path": "aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp"
               },
               {
-                "login": "yns88"
+                "path": "aten/src/ATen/native/quantized/cudnn/Linear.cpp"
               },
               {
-                "login": "lc0"
+                "path": "aten/src/ATen/native/quantized/cudnn/utils.h"
               },
               {
-                "login": "wenleix"
+                "path": "aten/src/ATen/native/sparse/SparseCsrTensor.cpp"
               },
               {
-                "login": "jingsh"
+                "path": "aten/src/ATen/native/ts_native_functions.yaml"
               },
               {
-                "login": "mthrok"
+                "path": "aten/src/ATen/record_function.cpp"
               },
               {
-                "login": "drdarshan"
+                "path": "aten/src/ATen/record_function.h"
               },
               {
-                "login": "tvalentius"
+                "path": "aten/src/ATen/templates/Operators.h"
               },
               {
-                "login": "d4l3k"
+                "path": "aten/src/ATen/templates/RegisterFunctionalization.cpp"
               },
               {
-                "login": "jamiemccrindle"
+                "path": "aten/src/ATen/test/basic.cpp"
               },
               {
-                "login": "kazhang"
+                "path": "aten/src/ATen/test/vmap_test.cpp"
               },
               {
-                "login": "simonhollis"
+                "path": "binaries/record_function_benchmark.cc"
               },
               {
-                "login": "lqiao"
+                "path": "c10/core/DispatchKey.cpp"
               },
               {
-                "login": "ajyu"
+                "path": "c10/core/DispatchKey.h"
               },
               {
-                "login": "govardhan"
+                "path": "c10/core/DispatchKeySet.h"
               },
               {
-                "login": "yinghai"
+                "path": "c10/test/core/DispatchKeySet_test.cpp"
               },
               {
-                "login": "zyan0"
+                "path": "c10/util/ArrayRef.h"
               },
               {
-                "login": "ajtulloch"
+                "path": "caffe2/core/tensor.h"
               },
               {
-                "login": "vtlam"
+                "path": "docs/source/conf.py"
               },
               {
-                "login": "pbelevich"
+                "path": "docs/source/fx.rst"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTAw",
+              "hasNextPage": true
+            }
+          },
+          "reviews": {
+            "nodes": [],
+            "pageInfo": {
+              "startCursor": null,
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Merge failed due to Matched rule superuser, but it was not reviewed yet by any of:zou3519,abhikrish,mehtanirav,wconstab,lc0, ...",
+                "createdAt": "2022-04-20T17:26:18Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1104215370
               },
               {
-                "login": "VitalyFedyunin"
+                "bodyText": "Merge failed due to Matched rule superuser, but PR has not been reviewed yet",
+                "createdAt": "2022-04-20T17:31:26Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1104220908
               },
               {
-                "login": "dbish"
+                "bodyText": "@pytorchbot merge this",
+                "createdAt": "2022-04-20T19:30:50Z",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1104378397
               },
               {
-                "login": "NicolasHug"
+                "bodyText": "Merge failed due to Matched rule superuser, but PR has not been reviewed yet\nRaised by https://github.com/pytorch/pytorch/actions/runs/2197877090",
+                "createdAt": "2022-04-20T19:32:10Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1104379712
               },
               {
-                "login": "efaust"
+                "bodyText": "Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale. Feel free to remove the Stale label if you feel this was a mistake. If you are unable to remove the Stale label please contact a maintainer in order to do so. If you want the bot to never mark this PR stale again, add the no-stale label.Stale pull requests will automatically be closed after 30 days of inactivity.",
+                "createdAt": "2022-06-20T16:44:05Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1160658699
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQdD9Sg==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "cla signed"
+                }
               },
               {
-                "login": "jfix71"
+                "node": {
+                  "name": "Stale"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=76123 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "kumpera"
+          },
+          "title": "Introduce distributed checkpoint with ShardedTensor.",
+          "body": "Co-authored-by: Wen Zhang <zhangwen@fb.com>\r\nCo-authored-by: Yifu Wang <yifu@fb.com>\r\n\r\n",
+          "headRefName": "st_checkpoint",
+          "headRepository": {
+            "nameWithOwner": "kumpera/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kumpera"
+                    },
+                    "email": "kumpera@fb.com",
+                    "name": "Rodrigo Kumpera"
+                  },
+                  "oid": "6bf248bc20a71f248064b795f38276326fe43aae"
+                }
               },
               {
-                "login": "atuljangra"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kumpera"
+                    },
+                    "email": "kumpera@fb.com",
+                    "name": "Rodrigo Kumpera"
+                  },
+                  "oid": "10f84fb90bf02d7062e565ebf2c1da6352b64db7"
+                }
               },
               {
-                "login": "idning"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kumpera"
+                    },
+                    "email": "kumpera@fb.com",
+                    "name": "Rodrigo Kumpera"
+                  },
+                  "oid": "96c5299740ec791f3cf0975c03a40a7b219b6747"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            },
+            "totalCount": 3
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS2l4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755666"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSmtI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063614/jobs/3379894109"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd2r3Q=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755785"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm0k="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894107"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894332"
+                              },
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894444"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894520"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894567"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894616"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063615/jobs/3379894672"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd2shU=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755786"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm0o="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902301"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902363"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902507"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902560"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902579"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902603"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902637"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902685"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902740"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902761"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902794"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379902874"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903006"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903111"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903193"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903284"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903357"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903446"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903512"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379903546"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379944655"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379944695"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946308"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946337"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946359"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946391"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946423"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946453"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946496"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379946529"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950041"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950137"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950165"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950192"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379950646"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379951202"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379951230"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379963877"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379963928"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379963976"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379964018"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379966372"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379996173"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379996218"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379997861"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379998374"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379998397"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379998422"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3379998441"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2273063632/jobs/3380042106"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXd5yuY=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6380755806"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXxSm14="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387419477"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387419699"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387419923"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387419992"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387420129"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387420208"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796859/jobs/3387420309"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS3SE=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363240"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNGg="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796862/jobs/3387419465"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgS1-o=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363271"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNIc="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387419999"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420164"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420316"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420477"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420675"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387420934"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387421278"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387421672"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387421888"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387421982"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422191"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422303"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422476"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422715"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387422963"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423092"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423234"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423421"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423622"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387423739"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387545789"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387546032"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387546119"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553028"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553144"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553251"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553438"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553556"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387553668"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387554002"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387554098"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387558927"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387559016"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387559071"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387559139"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387563803"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387563894"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387580868"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387580936"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387580993"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 4, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387581053"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387592286"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387631950"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387632035"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387649916"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387649974"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387650084"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387650151"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387650373"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2276796865/jobs/3387753429"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAXgaCXo=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/96c5299740ec791f3cf0975c03a40a7b219b6747/checks?check_suite_id=6390363300"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXzlNKQ="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  },
+                  "status": null,
+                  "pushedDate": "2022-05-05T00:34:26Z",
+                  "oid": "96c5299740ec791f3cf0975c03a40a7b219b6747"
+                }
+              }
+            ]
+          },
+          "changedFiles": 11,
+          "files": {
+            "nodes": [
+              {
+                "path": "test/distributed/_shard/checkpoint/test_checkpoint.py"
               },
               {
-                "login": "soumith"
+                "path": "test/distributed/_shard/checkpoint/test_file_system_checkpoint.py"
               },
               {
-                "login": "nimin98"
+                "path": "test/distributed/_shard/sharded_tensor/test_sharded_tensor.py"
               },
               {
-                "login": "chaekit"
+                "path": "torch/distributed/_shard/checkpoint/__init__.py"
               },
               {
-                "login": "radkris-git"
+                "path": "torch/distributed/_shard/checkpoint/filesystem.py"
               },
               {
-                "login": "xunnanxu"
+                "path": "torch/distributed/_shard/checkpoint/metadata.py"
               },
               {
-                "login": "javier-m"
+                "path": "torch/distributed/_shard/checkpoint/resharding.py"
               },
               {
-                "login": "jmdetloff"
+                "path": "torch/distributed/_shard/checkpoint/state_dict_loader.py"
               },
               {
-                "login": "mostafaelhoushi"
+                "path": "torch/distributed/_shard/checkpoint/state_dict_saver.py"
               },
               {
-                "login": "brianjo"
+                "path": "torch/distributed/_shard/checkpoint/storage.py"
               },
               {
-                "login": "ShijunK"
-              },
+                "path": "torch/testing/_internal/distributed/_shard/sharded_tensor/_test_st_common.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTE",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
-                "login": "suo"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "vkuzo"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "seemethere"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "cpuhrsch"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "qihqi"
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jackm321"
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "linbinyu"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "neerajprad"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "gnadathur"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "rsemenov"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ziky90"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "gmagogsfm"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "zzzwen"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ikriv"
+                "author": {
+                  "login": "wanchaol"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "deeptigp"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "andrewor14"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jianyuh"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "cykustcc"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "highker"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "beauby"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jeffreyksmithjr"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "suphoff"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "smessmer"
-              }
-            ],
-            "pageInfo": {
-              "hasNextPage": true,
-              "endCursor": "Y3Vyc29yOnYyOpHOACQ5JQ=="
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOACQ5JQ== name=metamates org=pytorch": {
-    "data": {
-      "organization": {
-        "team": {
-          "members": {
-            "nodes": [
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              },
               {
-                "login": "ananthsub"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "d1jang"
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "firstprayer"
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "malfet"
+                "author": {
+                  "login": "simpkins"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "fegin"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "hanton"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "zanqi"
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "bujar"
+                "author": {
+                  "login": "zzzwen"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "supriyar"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "kausv"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "divchenko"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "dagitses"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "rahuln32"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "bilgeacun"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "caogao"
+                "author": {
+                  "login": "simpkins"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "miguelmartin75"
+                "author": {
+                  "login": "simpkins"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "penguinwu"
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "shz117"
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ajliu"
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "saketh-are"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "msaroufim"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "mdundas"
+                "author": {
+                  "login": "wilson100hong"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "davides"
+                "author": {
+                  "login": "wilson100hong"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "alannnna"
+                "author": {
+                  "login": "wilson100hong"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "hlin09"
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "DISMISSED"
               },
               {
-                "login": "hudeven"
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "terrychenism"
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "xiaomengy"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jisaacso"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "fkhan1337"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "xing-liu"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "alanadakotashine"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "desertfire"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "YosuaMichael"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "banitag1"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "letterx"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "gchanan"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "dbort"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "bilalsal"
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "DanilBaibak"
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "serhaty"
+                "author": {
+                  "login": "xunnanxu"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "yf225"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "yifuwang"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "piyushmh"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "z-a-f"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "superzgc"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "bertmaher"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "chauhang"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ZainRizvi"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jiayisuse"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "bochko"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jeanschmidt"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "bradleyhd"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ZolotukhinM"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jamesr66a"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "mullachv"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "voznesenskym"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "charliechen0401"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "bwasti"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "cryptopic"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "chinannyang"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "NivekT"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "zhxchen17"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jerryzh168"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "MohammadMahdiJavanmard"
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "rajkar86"
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "wconstab"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "Hangjun"
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "davidberard98"
+                "author": {
+                  "login": "pritamdamania87"
+                },
+                "state": "APPROVED"
               },
               {
-                "login": "Krovatkin"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "CamiWilliams"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "J0Nreynolds"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "datumbox"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "aartibasant"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "xta0"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "zou3519"
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "xman1979"
-              },
+                "author": {
+                  "login": "kumpera"
+                },
+                "state": "COMMENTED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNC0yNVQxMTozNTowMS0wNzowMLkyMDIyLTA0LTI1VDExOjM1OjAwLTA3OjAwzjjC2d0=",
+              "hasPreviousPage": true
+            }
+          },
+          "comments": {
+            "nodes": [
               {
-                "login": "suraj813"
+                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
+                "createdAt": "2022-05-05T12:35:49Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118495479
               },
               {
-                "login": "gqchen"
+                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
+                "createdAt": "2022-05-05T12:53:15Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118511287
               },
               {
-                "login": "george-qi"
+                "bodyText": "Merge failed due to Can't fetch all PR reviews\nRaised by https://github.com/pytorch/pytorch/actions/runs/2275691136",
+                "createdAt": "2022-05-05T15:00:08Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118662274
               },
               {
-                "login": "abhikrish"
+                "bodyText": "Merge failed due to Can't fetch all PR reviews Raised by https://github.com/pytorch/pytorch/actions/runs/2275691136\n\n@osalpekar @malfet This is failing because there are 109 review comments on this PR but we only fetch the first 100. This could be solved with a similar concept as how we fetch more comments/check_runs.",
+                "createdAt": "2022-05-05T15:20:46Z",
+                "author": {
+                  "login": "janeyx99"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118689010
               },
               {
-                "login": "zhangguanheng66"
-              },
+                "bodyText": "On a side note, has the test_fsdp_clip_grad_norm_norm_type_2_0_nested_fsdp_False_cpu_offload_CPUOffload failure on the distributed test first shard of this PR been addressed?",
+                "createdAt": "2022-05-05T15:24:08Z",
+                "author": {
+                  "login": "janeyx99"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1118693497
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQqri9w==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "login": "mikeiovine"
+                "node": {
+                  "name": "oncall: distributed"
+                }
               },
               {
-                "login": "Adolfo-Karim"
-              },
+                "node": {
+                  "name": "cla signed"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=71759 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "coolteemf"
+          },
+          "title": "Optimize grid sample 3d",
+          "body": "Fixes #71415\r\nI have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :\r\n\r\n> Fixes #64977\r\n> \r\n> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).\r\n> \r\n> Brief description of the changes:\r\n> \r\n>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).\r\n> \r\n>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.\r\n> \r\n>     * Changed the CPU kernels:\r\n>       (1) added `bool input_requires_grad` template parameter to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)\r\n> \r\n>     * Changed CUDA kernel:\r\n>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,\r\n>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,\r\n>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.\r\n> \r\n>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.\r\n> \r\n>     * Have not touched the CPU fallback kernel.\r\n\r\nNote: the changes number (3) are N/A in this case.\r\n\r\n",
+          "headRefName": "optimize_grid_sample_3d",
+          "headRepository": {
+            "nameWithOwner": "coolteemf/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
-                "login": "Chillee"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "e0b0d1e695aeddceaf265da602c4704592053e9e"
+                }
               },
               {
-                "login": "albanD"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "563ec73747ad53b63b36736c47c4342f962c2a09"
+                }
               },
               {
-                "login": "bigfootjon"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "51abe41a132d9dd5b1c0551bdca902aacc028ff8"
+                }
               },
               {
-                "login": "robotal"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "be9898205992034a00e8ace8a55c2ecdcee2c2f8"
+                }
               },
               {
-                "login": "MarcioPorto"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "2929c60b64384c2deae0f7dea8bab94ad4bc9ec8"
+                }
               },
               {
-                "login": "srsuryadev"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "9241b737e7e2b257905cc74ad9c50b737d7f9d0a"
+                }
               },
               {
-                "login": "IvanKobzarev"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "64d6b795d0636928a8aa2fd3da01302fb5f5f7af"
+                }
               },
               {
-                "login": "eprivezentsev"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "4503577e53760a0006f1e80ca6bfe04d2be90470"
+                }
               },
               {
-                "login": "kwen2501"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "b16f4b11ffbbbf2ca2098f9702af4ef6b6fc5e1f"
+                }
               },
               {
-                "login": "linux-jedi"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "7ffc23368a604afdc92d2818747f730ce31a2bb5"
+                }
               },
               {
-                "login": "chandlerzuo"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "b85292604b9ad6c31706b76b5a5498c4f6d94309"
+                }
               },
               {
-                "login": "prateek1404"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "9d81d7bae8ad91aaa24b3ceab83e3138894dbc69"
+                }
               },
               {
-                "login": "otsneh"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "e79f6a2202512b294c55bf4bfb2e0524fafd4c48"
+                }
               },
               {
-                "login": "husthyc"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "f683e8aec7aea76097a264eec01511e704c31154"
+                }
               },
               {
-                "login": "briancoutinho"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "coolteemf"
+                    },
+                    "email": "67541941+coolteemf@users.noreply.github.com",
+                    "name": "Fran\u00e7ois Lecomte"
+                  },
+                  "oid": "b932e9e286c22aaf352375186df851ef060b295a"
+                }
               },
               {
-                "login": "fduwjj"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "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",
+                    "name": "coolteemf"
+                  },
+                  "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22"
+                }
               }
             ],
             "pageInfo": {
-              "hasNextPage": true,
-              "endCursor": "Y3Vyc29yOnYyOpHOAGncmA=="
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOAGncmA== name=metamates org=pytorch": {
-    "data": {
-      "organization": {
-        "team": {
-          "members": {
+              "endCursor": "MTY",
+              "hasNextPage": false
+            },
+            "totalCount": 16
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGYqY=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801320"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_T6g="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-clang7-onnx"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754066/jobs/2663109808"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754066/jobs/2663214802"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754066/jobs/2663214856"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIob0=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801849"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ubk="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3-clang5-mobile-build"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754064/jobs/2663109676"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ1E=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801852"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ubw="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-bionic-rocm4.5-py3.7"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754065/jobs/2663109684"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754065/jobs/2663401083"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754065/jobs/2663401143"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.rocm.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754065/jobs/2663401186"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwMsZY=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801853"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ub0="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "win-vs2019-cuda11.3-py3"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754068/jobs/2663109680"
+                              },
+                              {
+                                "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754068/jobs/2663995756"
+                              },
+                              {
+                                "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754068/jobs/2663995819"
+                              },
+                              {
+                                "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754068/jobs/2663995900"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwZbzg=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801855"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Ub8="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "mypy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663109683"
+                              },
+                              {
+                                "name": "shellcheck",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663109827"
+                              },
+                              {
+                                "name": "py2-setup-validate-errormsg",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663109962"
+                              },
+                              {
+                                "name": "clang-format",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110044"
+                              },
+                              {
+                                "name": "cmakelint",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110132"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110233"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110320"
+                              },
+                              {
+                                "name": "clang-tidy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110461"
+                              },
+                              {
+                                "name": "flake8-py3",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754069/jobs/2663110575"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGbAQ=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801856"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-clang7-asan"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754070/jobs/2663109804"
+                              },
+                              {
+                                "name": "test (default, 3, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754070/jobs/2663233675"
+                              },
+                              {
+                                "name": "test (default, 1, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754070/jobs/2663233731"
+                              },
+                              {
+                                "name": "test (default, 2, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754070/jobs/2663233805"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwJC4U=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801857"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcE="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754076/jobs/2663109810"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ_w=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801862"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_UcY="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc5.4"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663109777"
+                              },
+                              {
+                                "name": "test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201383"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201458"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201512"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201580"
+                              },
+                              {
+                                "name": "test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201672"
+                              },
+                              {
+                                "name": "test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754078/jobs/2663201839"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwIWu4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801866"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Uco="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1886754079/jobs/2663109681"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATwGZ1k=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/346e0c547953d98eb84d23c1391a95badb9c4a22/checks?check_suite_id=5414801869"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAUK_Uc0="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": true
+                    }
+                  },
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17017798?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17017799?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17017816?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17017800?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-02-23T10:39:30Z",
+                  "oid": "346e0c547953d98eb84d23c1391a95badb9c4a22"
+                }
+              }
+            ]
+          },
+          "changedFiles": 9,
+          "files": {
             "nodes": [
               {
-                "login": "frank-wei"
-              },
-              {
-                "login": "esqu1"
-              },
-              {
-                "login": "prabhat00155"
-              },
-              {
-                "login": "Gamrix"
-              },
-              {
-                "login": "QuentinDuval"
-              },
-              {
-                "login": "atalman"
-              },
-              {
-                "login": "xush6528"
-              },
-              {
-                "login": "dracifer"
-              },
-              {
-                "login": "SS-JIA"
-              },
-              {
-                "login": "helunwencser"
-              },
-              {
-                "login": "xw285cornell"
-              },
-              {
-                "login": "hhbyyh"
-              },
-              {
-                "login": "rohan-varma"
-              },
-              {
-                "login": "teng-li"
-              },
-              {
-                "login": "larryliu0820"
-              },
-              {
-                "login": "lyoka"
-              },
-              {
-                "login": "cbalioglu"
-              },
-              {
-                "login": "hl475"
-              },
-              {
-                "login": "hwangjeff"
-              },
-              {
-                "login": "Jack-Khuu"
-              },
-              {
-                "login": "mehtanirav"
-              },
-              {
-                "login": "nateanl"
-              },
-              {
-                "login": "fuqianz"
-              },
-              {
-                "login": "boyuantan"
-              },
-              {
-                "login": "muntaqim"
-              },
-              {
-                "login": "ymao1993"
-              },
-              {
-                "login": "fmassa"
-              },
-              {
-                "login": "esantorella"
-              },
-              {
-                "login": "HamidShojanazeri"
-              },
-              {
-                "login": "akshayParashar1995"
-              },
-              {
-                "login": "jubinchheda"
-              },
-              {
-                "login": "mehdimashayekhi"
-              },
-              {
-                "login": "rkindi"
-              },
-              {
-                "login": "wanchaol"
-              },
-              {
-                "login": "zephirefaith"
-              },
-              {
-                "login": "alexbeloi"
-              },
-              {
-                "login": "kapilsh"
-              },
-              {
-                "login": "plahera"
+                "path": "aten/src/ATen/native/GridSampler.cpp"
               },
               {
-                "login": "SherlockNoMad"
+                "path": "aten/src/ATen/native/cpu/GridSamplerKernel.cpp"
               },
               {
-                "login": "pritamdamania87"
+                "path": "aten/src/ATen/native/cuda/GridSampler.cpp"
               },
               {
-                "login": "psavla2"
+                "path": "aten/src/ATen/native/cuda/GridSampler.cu"
               },
               {
-                "login": "rahxephon89"
+                "path": "aten/src/ATen/native/cuda/GridSampler.h"
               },
               {
-                "login": "migeed-z"
+                "path": "aten/src/ATen/native/native_functions.yaml"
               },
               {
-                "login": "iseeyuan"
+                "path": "test/forward_backward_compatibility/check_forward_backward_compatibility.py"
               },
               {
-                "login": "Matphyler"
+                "path": "test/test_nn.py"
               },
               {
-                "login": "protonu"
-              },
+                "path": "tools/autograd/derivatives.yaml"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "OQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
-                "login": "terhuhf"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "aruntonic"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "gcatron"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "yingrliu"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "alexanderguzhva"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "angelayi"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "zhaoalex"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "shahofblah"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "vivekmig"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jspisak"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "akshaypandian"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "tktrungna"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "eellison"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ziab"
+                "author": {
+                  "login": "coolteemf"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "NarineK"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "andrewconnors"
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "APPROVED"
               },
               {
-                "login": "wenwei202"
-              },
+                "author": {
+                  "login": "albanD"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMS0yNVQwODoyODoxMC0wODowMLkyMDIyLTAxLTI1VDA3OjU0OjA1LTA4OjAwzjNooqI=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
               {
-                "login": "jg2912"
+                "bodyText": "Merge failed due to 'NoneType' object is not subscriptable\nRaised by https://github.com/pytorch/pytorch/actions/runs/1887945630",
+                "createdAt": "2022-02-23T14:55:36Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1048868910
               },
               {
-                "login": "robieta"
+                "bodyText": "Thanks for the update! The windows failure is not your fault, you can ignore it!\n\nThank you very much for all of your feedback and sorry for the delay !",
+                "createdAt": "2022-02-23T16:44:36Z",
+                "author": {
+                  "login": "coolteemf"
+                },
+                "authorAssociation": "CONTRIBUTOR",
+                "editor": null,
+                "databaseId": 1048983572
               },
               {
-                "login": "davidxili"
+                "bodyText": "@coolteemf can you please send either me or @albanD an email? (or I can send you and invite to collab on private repo)",
+                "createdAt": "2022-02-23T17:49:55Z",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1049048119
               },
               {
-                "login": "mreso"
+                "bodyText": "@pytorchbot merge this please",
+                "createdAt": "2022-02-23T19:23:55Z",
+                "author": {
+                  "login": "albanD"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1049131992
               },
               {
-                "login": "soulitzer"
-              },
+                "bodyText": "Hey @coolteemf.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-02-23T19:26:51Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1049134520
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOPoR4Lg==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "login": "prigoyal"
+                "node": {
+                  "name": "triaged"
+                }
               },
               {
-                "login": "PaliC"
+                "node": {
+                  "name": "open source"
+                }
               },
               {
-                "login": "aovladi"
+                "node": {
+                  "name": "cla signed"
+                }
               },
               {
-                "login": "anijain2305"
+                "node": {
+                  "name": "release notes: nn"
+                }
               },
               {
-                "login": "pvtuan10"
-              },
+                "node": {
+                  "name": "topic: performance"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=75095 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "mruberry"
+          },
+          "title": "Initial prims, references, and test architecture for them",
+          "body": "This PR adds an initial set of experimental primitive operations and Python references that reimplement existing PyTorch operations using them. See https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-0/577 for additional context.\r\n\r\nThe following experimental primitives are added:\r\n\r\n- Elementwise unary prims -- abs, acos, acosh, asin, atan, cos, cosh, bessel_i0e, bessel_i1e, cbrt, ceil, digamma, erf, erf_inv, erfc, exp, expm1, floor, igamma, igammac, is_finite, lgamma, log, log1p, neg, reciprocal, round, sign, sinh, sqrt, square, tan. \r\n- Elementwise binary prims -- add, atan2, bitwise_and, bitwise_not, bitwise_or, bitwise_xor, div, eq, ge, gt, le, lt, max, min, mul, ne, nextafter, pow, rsqrt, shift_left, shift_right_arithmetic\r\n- View prims -- brodcast_in_dim, collapse_view, split_dim, squeeze\r\n- Shape prims -- collapse, concatenate, reshape\r\n- Conditional prims -- select\r\n- Data conversion & movement prims -- convert_element_type, device_put\r\n- Inplace prims -- copy_to, resize\r\n\r\nThese primitives do not add any new functionality to PyTorch, but are intended to be the semantic building blocks for reference operators. We have tried to make them consistent with the operations in [jax.lax](https://jax.readthedocs.io/en/latest/jax.lax.html) where possible (because PyTorch prefers being consistent with other frameworks), although there are key differences between these prims and operations in jax.lax. Most notably is that these prims model view semantics and inplace operations.\r\n\r\nIn addition to these primitives the following elementwise binary Python references are added:\r\n\r\n- Elementwise binary Python references -- add, atan2, bitwise_and, bitwise_left_shift, bitwise_or, bitwise_right_shift, bitwise_xor, eq, float_power, ge, gt, le, lt, maximum, minimum, mul, ne, nextafter, pow, sub, true_divide\r\n- Conditional Python references - where\r\n- Data conversion & movement references - copy_to\r\n\r\nA Python reference implements the same behavior as its corresponding PyTorch operator (excepting slight numerical differences, bug fixes, and in some cases additional features). \r\n\r\nThe start of an OpInfo-based test architecture for these references is also included in this PR. A new list, `python_ref_db`, is added to `common_methods_invocations.py`. This list introduces the new `ElementwiseBinaryPythonRefInfo`, which inherits input arguments from the original operators' OpInfo, allows them to be overridden, and then constructs the OpInfo for the Python reference using the (potentially modified) arguments. OpInfo-based tests can opt-into testing references by including this new list in the Sequence passed to the `@ops` decorator. \r\n\r\ncc @ngimel @csarofeen @kevinstephano @Lezcano ",
+          "headRefName": "prims_and_references",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
-                "login": "huangyi1979"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "a790467c650be92775103cde5e866c90b56f5376"
+                }
               },
               {
-                "login": "osalpekar"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "bd6fcf50692e208ebecdc2eaa517a2bfcdcd35cf"
+                }
               },
               {
-                "login": "xiaohui-zhang"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "4a119c8f21529fe1375e7e8789b91f41a3df80c5"
+                }
               },
               {
-                "login": "jerry39213gh"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "ea6750dc34d66be759fdfe84b09fb0e23ee59c79"
+                }
               },
               {
-                "login": "jarodhou"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "2eef8a55fe0227e1921b51bf1f56f9d0a29b49ac"
+                }
               },
               {
-                "login": "hlu1"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "b886ed6c20dd1785fd31ed6fa6a8c5b6d0d0b16c"
+                }
               },
               {
-                "login": "huiguoo"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "9ad9b63d09aa4f7a8549bcf1d88ea4ff0674299c"
+                }
               },
               {
-                "login": "H-Huang"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "63fdd580118477416ae160e0670ae722ea248090"
+                }
               },
               {
-                "login": "vtsyvina"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "0ccf7dc292af1d40d0a094eb2b2fb0c7ab4ccc70"
+                }
               },
               {
-                "login": "qchip"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "e8a8a4d1fbe35f20eb88e1a43cf5a653883638e5"
+                }
               },
               {
-                "login": "Nitrokitty"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "186634dfdd25645c05b58a212f9e8d77c4125fc0"
+                }
               },
               {
-                "login": "satgera"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "f5b4741312b5c42a79f6c8a1d3930b79db38ed8f"
+                }
               },
               {
-                "login": "ngimel"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "23d50391bb0fd12111fd3171591c4235ffb2fc1a"
+                }
               },
               {
-                "login": "dongreenberg"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "bac9d45422d58f513b60b4b854441cfdc253d4c5"
+                }
               },
               {
-                "login": "sijiac"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "13240ae0b4a0332c3167b65ac026a3172da90cb7"
+                }
               },
               {
-                "login": "markkm"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "1ee34468cb1db3dc6cbae204669f4fec20e2a466"
+                }
               },
               {
-                "login": "EscapeZero"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ezyang"
+                    },
+                    "email": "ezyang@fb.com",
+                    "name": "Edward Z. Yang"
+                  },
+                  "oid": "561d132bc686d00e8911f7feb3da5901b2bdc574"
+                }
               },
               {
-                "login": "bdhirsh"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
+                  },
+                  "oid": "ac42bedc84b7c96256376ad09917263bb020b2c3"
+                }
               },
               {
-                "login": "cccclai"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
+                  },
+                  "oid": "7f7d5ba40a0b5e10526d90b018b30b54673d12d8"
+                }
               },
               {
-                "login": "carolineechen"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "37a6b4a8b1adb712d5777c7c3479866c27fb3c4e"
+                }
               },
               {
-                "login": "tugsbayasgalan"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
+                  },
+                  "oid": "65b613868c44e519c1777af79b9fd3498c5a7e58"
+                }
               },
               {
-                "login": "agunapal"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "ngimel"
+                    },
+                    "email": "ngimel@fb.com",
+                    "name": "Natalia Gimelshein"
+                  },
+                  "oid": "442c405e9da0d66744ef03e379224c41eedf5b57"
+                }
               },
               {
-                "login": "frankseide"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "031ac49ae9c192989385986b6707fa781e3229e0"
+                }
               },
               {
-                "login": "YazhiGao"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "9a6c3b00039c0c985c1c9cb59490012d1c0b38ba"
+                }
               },
               {
-                "login": "pavithranrao"
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "d5c30e408af1889b90012d2e09f6ec3cda333bcb"
+                }
               },
               {
-                "login": "VirgileHlav"
-              },
+                "commit": {
+                  "author": {
+                    "user": null,
+                    "email": "mruberry@devfair044.h1.fair",
+                    "name": "Mike Ruberry"
+                  },
+                  "oid": "db355d55655bb252a699cd532441bb98e52b98d5"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MjY",
+              "hasNextPage": false
+            },
+            "totalCount": 26
+          },
+          "commits": {
+            "nodes": [
               {
-                "login": "mrshenli"
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                              },
+                              {
+                                "name": "Meta Internal-Only Changes Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://opensource.facebook.com/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6ux14=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454954"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC2o="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Netlify",
+                            "databaseId": 13473
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454956"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC2w="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454965"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC3U="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Dependabot",
+                            "databaseId": 29110
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454970"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC3o="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454974"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC34="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "PyTorch Bot",
+                            "databaseId": 40112
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241454977"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFC4E="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622865/jobs/3270915028"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6e-c8=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455322"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDNo="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915027"
+                              },
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915071"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915141"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915194"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915229"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915283"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622869/jobs/3270915321"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6e-zM=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455334"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDOY="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927344"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.0-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927442"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927507"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927567"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927674"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927727"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927802"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927853"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927948"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270927996"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928061"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928116"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928198"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928256"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928291"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928317"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928338"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928367"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928410"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270928445"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991071"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991125"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991162"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991195"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991233"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991261"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991305"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270991349"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270996024"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270996068"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270996092"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270996505"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270998987"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3270999027"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271006886"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271006941"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271018097"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271018135"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271018162"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271021143"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271034041"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.0-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271034072"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271048218"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271049553"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271049587"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271049616"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271068293"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271068336"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271149276"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2217622878/jobs/3271149321"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAW6jVK8=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/db355d55655bb252a699cd532441bb98e52b98d5/checks?check_suite_id=6241455360"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAXQFDQA="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  },
+                  "status": null,
+                  "pushedDate": "2022-04-25T02:30:31Z",
+                  "oid": "db355d55655bb252a699cd532441bb98e52b98d5"
+                }
               }
-            ],
-            "pageInfo": {
-              "hasNextPage": true,
-              "endCursor": "Y3Vyc29yOnYyOpHOAQNk0w=="
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOAQNk0w== name=metamates org=pytorch": {
-    "data": {
-      "organization": {
-        "team": {
-          "members": {
+            ]
+          },
+          "changedFiles": 5,
+          "files": {
             "nodes": [
               {
-                "login": "lena-kashtelyan"
-              },
-              {
-                "login": "brad-mengchi"
-              },
-              {
-                "login": "kimishpatel"
-              },
-              {
-                "login": "aaronenyeshi"
-              },
-              {
-                "login": "shajrawi"
-              },
-              {
-                "login": "samdow"
-              },
-              {
-                "login": "dzhulgakov"
-              },
-              {
-                "login": "great-way"
-              },
-              {
-                "login": "ashkan-software"
-              },
-              {
-                "login": "garroud"
-              },
-              {
-                "login": "jbitton"
-              },
-              {
-                "login": "jdsgomes"
-              },
-              {
-                "login": "zhangxy988"
-              },
-              {
-                "login": "samlurye"
-              },
-              {
-                "login": "EdwardTyantov"
-              },
-              {
-                "login": "anjali411"
-              },
-              {
-                "login": "kryanchun"
-              },
-              {
-                "login": "842974287"
-              },
-              {
-                "login": "JacobSzwejbka"
-              },
-              {
-                "login": "macandro96"
-              },
-              {
-                "login": "nishantpdce"
-              },
-              {
-                "login": "srinivas212"
-              },
-              {
-                "login": "cherie11"
-              },
-              {
-                "login": "shreyanb98"
-              },
-              {
-                "login": "kavoor"
-              },
-              {
-                "login": "dzdang"
-              },
-              {
-                "login": "yushangdi"
-              },
-              {
-                "login": "naveedgol"
-              },
-              {
-                "login": "Nayef211"
-              },
-              {
-                "login": "zrphercule"
-              },
-              {
-                "login": "HengruiX"
-              },
-              {
-                "login": "langong347"
-              },
-              {
-                "login": "soapisnotfat"
-              },
-              {
-                "login": "ebsmothers"
-              },
-              {
-                "login": "swang392"
-              },
-              {
-                "login": "anshuljain1"
-              },
-              {
-                "login": "b-koopman"
-              },
-              {
-                "login": "salilsdesai"
-              },
-              {
-                "login": "vmoens"
-              },
-              {
-                "login": "LinjianMa"
-              },
-              {
-                "login": "printfoo"
-              },
-              {
-                "login": "xinyang0"
-              },
-              {
-                "login": "ramvenkat98"
-              },
-              {
-                "login": "fbbradheintz"
-              },
-              {
-                "login": "davidchencsl"
-              },
-              {
-                "login": "kauterry"
-              },
-              {
-                "login": "VenkatSubramaniam"
-              },
-              {
-                "login": "yxia11"
-              },
-              {
-                "login": "anirbanraywork"
-              },
-              {
-                "login": "houseroad"
-              },
-              {
-                "login": "erichan1"
-              },
-              {
-                "login": "hsrussell"
-              },
-              {
-                "login": "ilia-cher"
-              },
-              {
-                "login": "ajitmaths"
-              },
-              {
-                "login": "awgu"
-              },
-              {
-                "login": "wz337"
-              },
-              {
-                "login": "qxy11"
-              },
-              {
-                "login": "janeyx99"
-              },
-              {
-                "login": "msedwar"
-              },
-              {
-                "login": "dustinh1999"
-              },
-              {
-                "login": "glaringlee"
-              },
-              {
-                "login": "anj-s"
-              },
-              {
-                "login": "liuchen9494"
-              },
-              {
-                "login": "drisspg"
-              },
-              {
-                "login": "kmh4321"
-              },
-              {
-                "login": "RdoubleA"
-              },
-              {
-                "login": "jramseyer"
-              },
-              {
-                "login": "goldenxuett"
-              },
-              {
-                "login": "zengk95"
-              },
-              {
-                "login": "gtarjun"
-              },
-              {
-                "login": "mikaylagawarecki"
-              },
-              {
-                "login": "xianxl"
-              },
-              {
-                "login": "mingzhe09088"
-              },
-              {
-                "login": "Vucibatina"
-              },
-              {
-                "login": "aazzolini"
-              },
-              {
-                "login": "nataliakliushkina"
-              },
-              {
-                "login": "mruberry"
-              },
-              {
-                "login": "HDCharles"
-              },
-              {
-                "login": "mcr229"
-              },
-              {
-                "login": "manuelcandales"
-              },
-              {
-                "login": "guangy10"
-              },
-              {
-                "login": "mengwa41"
-              },
-              {
-                "login": "YulunW"
-              },
-              {
-                "login": "hx89"
+                "path": "test/test_ops.py"
               },
               {
-                "login": "hanhsienhuang"
+                "path": "torch/_prims/__init__.py"
               },
               {
-                "login": "clee2000"
+                "path": "torch/_prims/utils.py"
               },
               {
-                "login": "lhuang04"
+                "path": "torch/_refs/__init__.py"
               },
               {
-                "login": "sidneyfletcher"
-              },
+                "path": "torch/testing/_internal/common_methods_invocations.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "NQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
-                "login": "gottbrath"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "lessw2020"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "mmh683"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "dwarakrajagopal"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "YifanShenSZ"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "lazysjb"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "zhaojuanmao"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "johncalab"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "dhthompson"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "superwizard2019"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "fbhuba"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "shunting314"
-              }
-            ],
-            "pageInfo": {
-              "hasNextPage": true,
-              "endCursor": "Y3Vyc29yOnYyOpHOAyJyuA=="
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=Y3Vyc29yOnYyOpHOAyJyuA== name=metamates org=pytorch": {
-    "data": {
-      "organization": {
-        "team": {
-          "members": {
-            "nodes": [
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
+              },
               {
-                "login": "edward-io"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "sean-ngo"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "bzinodev"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "skim0514"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "xcheng16"
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "adamomainz"
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "sluks"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "poojahp"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ansley"
+                "author": {
+                  "login": "zou3519"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "mvsampath"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "cheetah2216"
+                "author": {
+                  "login": "peterbell10"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "pinaki-mukerji"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "hongxiayang"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "kyulee-com"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "sstsai-adl"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "dahsh"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ohgnoes"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "szewaiyuen7"
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "byterover"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "asl3"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ejguan"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "nimaelyasi"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "nikithamalgifb"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "rohan-ahluwalia"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "qxu-fb"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "sshawnwu"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "andrewyounkins"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "njuvekar"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "iramazanli"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jnkwok1"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "kurman"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "jbschlosser"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "ccongge"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "haichuan-fb"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "janghyuncho"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "wwang84"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "JustinPinero"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "gcramer23"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "yuguo68"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "c-odrin"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "chowarfb"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "priyaramani"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "yidawang-oss"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "asalioufb"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "four4fish"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "kkosik20"
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "pmabbo13"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "KZFB"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "dborkovic"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "sisilmehta2000"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "henryliu-bluehills"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "madhu-fb"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "muchulee8"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "anirbanr-fb-r2p"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "kirklandsign"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "o-hanna"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "izaitsevfb"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "login": "weiwangmeta"
-              }
-            ],
-            "pageInfo": {
-              "hasNextPage": false,
-              "endCursor": "Y3Vyc29yOnYyOpHOBoQSVA=="
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MTAw name=pytorch number=76118 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "files": {
-            "nodes": [
-              {
-                "path": "docs/source/quantization.rst"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "docs/source/scripts/build_quantization_configs.py"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/allowlist_for_publicAPI.json"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/source_range_test.cpp"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/test_backend.cpp"
+                "author": {
+                  "login": "lezcano"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/test_flatbuffer.cpp"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/test_misc.cpp"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/test_utils.h"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_float_v2.ptl.ff"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_float_v2.ptl.ff"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_inplace_int_v2.ptl.ff"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_int_v2.ptl.ff"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_float_v2.ptl.ff"
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_reciprocal_int_v2.ptl.ff"
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "APPROVED"
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_scalar_scalar_v2.ptl.ff"
+                "author": {
+                  "login": "ezyang"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_tensor_inplace_v2.ptl.ff"
-              },
+                "author": {
+                  "login": "mruberry"
+                },
+                "state": "COMMENTED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNC0wNlQxMjo1NjoyNC0wNzowMLkyMDIyLTA0LTA2VDA4OjQwOjM4LTA3OjAwzjenO6Y=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_tensor_out_v2.ptl.ff"
+                "bodyText": "Ref implementations by themselves can handle any shapes (and broadcast ops by themselves don't bake in any shapes). The question is can we decide if a particular trace is applicable for a different input, but that depends on the tracing technology and what we are caching on, so out of scope for initial PR.",
+                "createdAt": "2022-04-21T19:00:28Z",
+                "author": {
+                  "login": "ngimel"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1105643418
               },
               {
-                "path": "test/cpp/jit/upgrader_models/test_versioned_div_tensor_v2.ptl.ff"
+                "bodyText": "@pytorchbot merge this please",
+                "createdAt": "2022-04-25T04:42:29Z",
+                "author": {
+                  "login": "mruberry"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1108072887
               },
               {
-                "path": "test/cpp/profiler/record_function.cpp"
+                "bodyText": "Merge failed due to 'mruberry'\nRaised by https://github.com/pytorch/pytorch/actions/runs/2218044244",
+                "createdAt": "2022-04-25T04:43:54Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1108073536
               },
               {
-                "path": "test/distributed/_shard/sharded_tensor/test_sharded_tensor.py"
+                "bodyText": "@mruberry has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "createdAt": "2022-04-25T04:51:11Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1108075965
               },
               {
-                "path": "test/distributed/_shard/test_replicated_tensor.py"
+                "bodyText": "Hey @mruberry.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-04-25T09:57:56Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1108351107
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQebHmg==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "cla signed"
+                }
               },
               {
-                "path": "test/distributed/fsdp/test_fsdp_comm.py"
+                "node": {
+                  "name": "topic: not user facing"
+                }
               },
               {
-                "path": "test/distributed/fsdp/test_fsdp_optim_state.py"
-              },
+                "node": {
+                  "name": "module: primTorch"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=77700 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "kit1980"
+          },
+          "title": "Move pull linux-docs job to Ubuntu 20.04",
+          "body": "",
+          "headRefName": "sdym/pull-xenial-focal-linux-docs",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
-                "path": "test/distributed/optim/test_zero_redundancy_optimizer.py"
-              },
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kit1980"
+                    },
+                    "email": "sdym@fb.com",
+                    "name": "Sergii Dymchenko"
+                  },
+                  "oid": "81261599614423baa17df72300b8e109677b6799"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            },
+            "totalCount": 1
+          },
+          "commits": {
+            "nodes": [
               {
-                "path": "test/jit/test_export_modes.py"
-              },
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.facebook.com/cla/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNmNqE=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147714"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuMI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Netlify",
+                            "databaseId": 13473
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147726"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuM4="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147733"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuNU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Dependabot",
+                            "databaseId": 29110
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147746"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuOI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147762"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuPI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "PyTorch Bot",
+                            "databaseId": 40112
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567147780"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuQQ="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "lintrunner",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528127876"
+                              },
+                              {
+                                "name": "workflow-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128023"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128196"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128519"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128575"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128663"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867841/jobs/3528128857"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdYVY=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148336"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuzA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867843/jobs/3528127882"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdXEg=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148344"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduuzg="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "docker-builds"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528127883"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528127945"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128001"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-py3.7-clang9)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128067"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-rocm5.0-py3.7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128124"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-bionic-rocm5.1-py3.7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128191"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128259"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128321"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3-clang5-android-ndk-r19c)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128365"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3-clang5-asan)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128446"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3-clang7-asan)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128507"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3-clang7-onnx)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128563"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3.7-gcc5.4)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128639"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-xenial-py3.7-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128687"
+                              },
+                              {
+                                "name": "docker-build (pytorch-linux-focal-py3.7-gcc7)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867844/jobs/3528128741"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNdYLI=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148352"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduu0A="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528150762"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528150903"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151086"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151258"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151511"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151776"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528151896"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152014"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152139"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152216"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152378"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152516"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152599"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152723"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152802"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152913"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528152969"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528153005"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528153062"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528153125"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528153207"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528242483"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528242528"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528245875"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528245914"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528245964"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528246008"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528248520"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528255086"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528255128"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274064"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274097"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274133"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274173"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528274209"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528277014"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528308958"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309747"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309810"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309837"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309864"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309895"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528309925"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528310044"
+                              },
+                              {
+                                "name": "linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528310101"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384337"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384379"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384408"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384441"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2348867849/jobs/3528384471"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAYNi1Nc=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/81261599614423baa17df72300b8e109677b6799/checks?check_suite_id=6567148369"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAYduu1E="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  },
+                  "status": null,
+                  "pushedDate": "2022-05-19T00:02:11Z",
+                  "oid": "81261599614423baa17df72300b8e109677b6799"
+                }
+              }
+            ]
+          },
+          "changedFiles": 3,
+          "files": {
+            "nodes": [
               {
-                "path": "test/jit/test_if_hoisting.py"
+                "path": ".circleci/docker/build.sh"
               },
               {
-                "path": "test/jit/test_tracer.py"
+                "path": ".circleci/docker/common/install_katex.sh"
               },
               {
-                "path": "test/jit/test_upgraders.py"
-              },
+                "path": ".github/workflows/pull.yml"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
-                "path": "test/mobile/test_lite_script_type.py"
+                "author": {
+                  "login": "suo"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/expect/TestOperators.test_layer_norm_aten.expect"
+                "author": {
+                  "login": "kit1980"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "test/onnx/test_operators.py"
-              },
+                "author": {
+                  "login": "janeyx99"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNS0xOFQxMjo0MTowNS0wNzowMLkyMDIyLTA1LTE4VDEyOjQxOjA0LTA3OjAwzjpD7es=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
               {
-                "path": "test/onnx/test_pytorch_onnx_onnxruntime.py"
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/77700\n\ud83d\udcc4 \u00a0Preview Python docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\u2753Need help or want to give feedback on the CI? Visit our office hours\n\n\u2705 No Failures (0 Pending)\nAs of commit 8126159 (more details on the Dr. CI page):\nExpand to see more\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2022-05-17T23:01:48Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1129400934
               },
               {
-                "path": "test/quantization/ao_migration/test_quantization_fx.py"
+                "bodyText": "@pytorchbot merge",
+                "createdAt": "2022-05-19T15:39:05Z",
+                "author": {
+                  "login": "kit1980"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1131884232
               },
               {
-                "path": "test/quantization/core/test_quantized_op.py"
+                "bodyText": "Merge failed due to Refusing to merge as mandatory check(s) linux-docs / build-docs (cpp), linux-docs / build-docs (python) are pending/not yet run for rule OSS CI\nRaised by https://github.com/pytorch/pytorch/actions/runs/2353067846",
+                "createdAt": "2022-05-19T15:40:59Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1131886153
               },
               {
-                "path": "test/quantization/core/test_quantized_tensor.py"
+                "bodyText": "@pytorchbot merge -f",
+                "createdAt": "2022-05-19T16:41:29Z",
+                "author": {
+                  "login": "kit1980"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1131945610
               },
               {
-                "path": "test/quantization/fx/test_numeric_suite_fx.py"
-              },
+                "bodyText": "Hey @kit1980.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-05-19T16:43:37Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1131947473
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQ1FKZg==",
+              "hasPreviousPage": false
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "path": "test/quantization/fx/test_quantize_fx.py"
+                "node": {
+                  "name": "Merged"
+                }
               },
               {
-                "path": "test/test_autograd.py"
-              },
+                "node": {
+                  "name": "cla signed"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=68111 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "chunyuan-w"
+          },
+          "title": "Add JIT graph fuser for oneDNN Graph API (Preview4)",
+          "body": "## Description\r\nPreview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).\r\n\r\nOn the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:\r\n\r\n- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used\r\n- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.\r\n\r\n### User API:\r\nThe optimization pass is disabled by default. Users could enable it by:\r\n```\r\ntorch.jit.enable_onednn_fusion(True)\r\n```\r\n\r\n### Performance:\r\n[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:\r\n- SkyLake 8180 (1 socket of 28 cores):\r\n\r\n  ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)\r\n\r\n- SkyLake 8180 (single thread):\r\n\r\n  ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)\r\n \\* By mapping hardswish to oneDNN Graph, it\u2019s 8% faster than PyTorch JIT (NNC + OFI)\r\n  \\** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops\r\n\r\n\r\n### Directory structure of the integration code\r\nFuser-related code are placed under:\r\n```\r\ntorch/csrc/jit/codegen/onednn/\r\n```\r\n\r\nOptimization pass registration is done in:\r\n```\r\ntorch/csrc/jit/passes/onednn_graph_fuser.h\r\n```\r\n\r\nCMake for the integration code is:\r\n```\r\ncaffe2/CMakeLists.txt\r\n```\r\n\r\n## Limitations\r\n\r\n- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.\r\n- We have only optimized the inference use case.",
+          "headRefName": "chunyuan/llga_preview2",
+          "headRepository": {
+            "nameWithOwner": "chunyuan-w/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
-                "path": "test/test_binary_ufuncs.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "0096fcc49f277fd8e006fcb42e0cb28a1422ec98"
+                }
               },
               {
-                "path": "test/test_expanded_weights.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "7bcc4de26a5472f1d252735dd425b46794b0844f"
+                }
               },
               {
-                "path": "test/test_functionalization.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "3a2a588bfe6bbf9bf74d88d441cd22affda207da"
+                }
               },
               {
-                "path": "test/test_fx_experimental.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "ca7df12fbfaa3ddbabeca39b76300d17f4a33f2f"
+                }
               },
               {
-                "path": "test/test_jit.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "81d44f35b8bc043c38837d0694e5bc072203b832"
+                }
               },
               {
-                "path": "test/test_jit_cuda_fuser.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "14fd5d1bfc2c58a71379f778871e3fca0a8e79b2"
+                }
               },
               {
-                "path": "test/test_linalg.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "954dc23663125897f4b199eb2a8607dc5fca3274"
+                }
               },
               {
-                "path": "test/test_nestedtensor.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "9f77a0b476accc678b6f0569e4ff33fa6bbe97fc"
+                }
               },
               {
-                "path": "test/test_nn.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "fbf3b23bc1288697e1aec539a7c4ee3dc0bcb84c"
+                }
               },
               {
-                "path": "test/test_ops.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "f8b8e78f786586c3cdf3966fd83ffa124d3eda70"
+                }
               },
               {
-                "path": "test/test_ops_gradients.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "6fffa2f7453ee7e0f8d8e2f73ea8a65230539589"
+                }
               },
               {
-                "path": "test/test_ops_jit.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "849385404e6f3cd1cf7cef19f931ecf4fa28afdb"
+                }
               },
               {
-                "path": "test/test_optim.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "adbae7b77f8c0dbc59fccf15207d97ba86cfade2"
+                }
               },
               {
-                "path": "test/test_overrides.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "6dcf2a4981aff24fa16fc7461ae4ec29690f956f"
+                }
               },
               {
-                "path": "test/test_profiler.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "54f3e05ad524cffd0911ee93be3c50f589b51f58"
+                }
               },
               {
-                "path": "test/test_public_bindings.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "edbfc640ea79a0af85757d9e73796dcc90231519"
+                }
               },
               {
-                "path": "test/test_pytree.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "chunyuan-w"
+                    },
+                    "email": "chunyuan.wu@intel.com",
+                    "name": "chunyuan"
+                  },
+                  "oid": "67654db7cba562809d1b4a44cdda58af5cc9daaf"
+                }
               },
               {
-                "path": "test/test_reductions.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "9c9d99b930b11af9ff03f52d45bf49c652df758d"
+                }
               },
               {
-                "path": "test/test_sort_and_select.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ffb25119cd9ce815cc4d9d14a2317fcbbfa9ea86"
+                }
               },
               {
-                "path": "test/test_sparse.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ab9eee84512ca1bdfbc81e25c6eb67b29d0f302a"
+                }
               },
               {
-                "path": "test/test_sparse_csr.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "62a4642cf3330524990a69ac29e002c97812320a"
+                }
               },
               {
-                "path": "test/test_spectral_ops.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "ca9b1223be4af2c8b4929303d498eafd71793128"
+                }
               },
               {
-                "path": "test/test_tensor_creation_ops.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "6f4a23d24514a02954d2ec792830085f612223c9"
+                }
               },
               {
-                "path": "test/test_tensorboard.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "b2a9a9c0926b02d0b2e87722ed61450f224a61d0"
+                }
               },
               {
-                "path": "test/test_testing.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e88b492be733f24b6aa395829c76add67d0901e7"
+                }
               },
               {
-                "path": "test/test_torch.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "c44336d7a914952bfb78e012e08d9a6d6dde5937"
+                }
               },
               {
-                "path": "test/test_unary_ufuncs.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "5157930f7b3921d41a586260582b574c915f6ca1"
+                }
               },
               {
-                "path": "third_party/BUCK.github"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "04cb8353813f6bbd0d913a994923cc7e1e291406"
+                }
               },
               {
-                "path": "third_party/fbgemm"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "62991eaad0e638bb0bced327e03f932f66f68732"
+                }
               },
               {
-                "path": "tools/autograd/derivatives.yaml"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "7496bf1588050191595d833d23b8972b2f22655e"
+                }
               },
               {
-                "path": "tools/autograd/gen_inplace_or_view_type.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "d9d35f23cca0cd29c78a845731b24826152dcf1c"
+                }
               },
               {
-                "path": "tools/autograd/load_derivatives.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "f74ec134f18a65a7c72455bdf44f72e3ebb27105"
+                }
               },
               {
-                "path": "tools/build_variables.bzl"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "eb32cc65a975361160948bfc3d6a577991ea262e"
+                }
               },
               {
-                "path": "tools/codegen/api/autograd.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "c7665f8d695b680c54db0bad2b7b7df46d886b50"
+                }
               },
               {
-                "path": "tools/codegen/api/cpp.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e6321ad8f59ea01130568c202d186448bb9cb9d0"
+                }
               },
               {
-                "path": "tools/codegen/api/dispatcher.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "a72cd0d02693f45e5354a70654581ad514581ec7"
+                }
               },
               {
-                "path": "tools/codegen/api/functionalization.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "b3cd3028b4ed31805e82f7eaf02217ab74ca59b9"
+                }
               },
               {
-                "path": "tools/codegen/api/lazy.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "49a592d9788d08e6cd0593882f867e129057c1cc"
+                }
               },
               {
-                "path": "tools/codegen/api/meta.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "0575766b2144b13f6a38227c4e2b8d22ec8db80f"
+                }
               },
               {
-                "path": "tools/codegen/api/native.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "b5c9b10ff87d622350e8ca64fae3a476eb70d5aa"
+                }
               },
               {
-                "path": "tools/codegen/api/python.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "66bc652a30ccc329adb929870a4ac726bb98b38c"
+                }
               },
               {
-                "path": "tools/codegen/api/structured.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "72b9ca9c8e2dac98cbb7199b3dfac7c7305b80c5"
+                }
               },
               {
-                "path": "tools/codegen/api/translate.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "a7892ed7373207d96406c8b5734a089643c5cdbd"
+                }
               },
               {
-                "path": "tools/codegen/api/types.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "d54cb084e1daad8a08c3f8de0ad3f7afb5b05ac1"
+                }
               },
               {
-                "path": "tools/codegen/api/ufunc.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "aef71d692a8a159e0ca56be363e2cc1225ce7647"
+                }
               },
               {
-                "path": "tools/codegen/api/unboxing.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "bf618e205ec31cff962dcc8ab478e0a699a9572d"
+                }
               },
               {
-                "path": "tools/codegen/code_template.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e4a331f1088448f7d7d86256ce71e0e71da006b0"
+                }
               },
               {
-                "path": "tools/codegen/context.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "0b743523d1430fec759d5fefbb687f17c89335a5"
+                }
               },
               {
-                "path": "tools/codegen/decompositions/gen_jit_decompositions.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "e80a351a62d98b810ec8985c4b25257af1d6c5bb"
+                }
               },
               {
-                "path": "tools/codegen/dest/__init__.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "c189eca154b6691919d0e21489d1c322c7435c0b"
+                }
               },
               {
-                "path": "tools/codegen/dest/lazy_ir.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "e080a067c75d7b888a8a362682a2d5ba70e0c3a8"
+                }
               },
               {
-                "path": "tools/codegen/dest/lazy_ts_lowering.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "028561fbf8f3ed90e074e6e0e3a4ca4dd7ffa2a8"
+                }
               },
               {
-                "path": "tools/codegen/dest/native_functions.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "d550cf14037badd4caa2f52202e2f20bc4db8432"
+                }
               },
               {
-                "path": "tools/codegen/dest/register_dispatch_key.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "574159ebadd1dec24daaf883879ffeca8d9e71b7"
+                }
               },
               {
-                "path": "tools/codegen/dest/ufunc.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "9eb3ee98ea756067ed1c8f52f309f6d3e211a904"
+                }
               },
               {
-                "path": "tools/codegen/gen.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "29929f48be03dcdd1bbfade572de7feafa825547"
+                }
               },
               {
-                "path": "tools/codegen/gen_backend_stubs.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "8a7358ca8da547b40ea1a99ddc57ebed19959684"
+                }
               },
               {
-                "path": "tools/codegen/gen_functionalization_type.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "6606637d2c5525b43e294a8b366a85052e1be0c6"
+                }
               },
               {
-                "path": "tools/codegen/gen_lazy_tensor.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "5ecfd1f28b87045deb8bc8ffe33b3d8b906f3264"
+                }
               },
               {
-                "path": "tools/codegen/local.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchit.jain"
+                  },
+                  "oid": "be2d4345c65442c4cfbe8afdfb2ae0893945da42"
+                }
               },
               {
-                "path": "tools/codegen/model.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "sanchitintel"
+                    },
+                    "email": "sanchit.jain@intel.com",
+                    "name": "sanchitintel"
+                  },
+                  "oid": "b5b89d3644a43e2dbda841cafb71b32edbe07c8a"
+                }
               },
               {
-                "path": "tools/codegen/operator_versions/gen_mobile_upgraders.py"
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "malfet"
+                    },
+                    "email": "nikita.shulga@gmail.com",
+                    "name": "Nikita Shulga"
+                  },
+                  "oid": "73881411e2bfb3aaa2e89926a82390b4c587ad75"
+                }
               }
             ],
             "pageInfo": {
-              "endCursor": "MjAw",
-              "hasNextPage": true
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MjAw name=pytorch number=76118 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
+              "endCursor": "NjI",
+              "hasNextPage": false
+            },
+            "totalCount": 62
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "Facebook CLA Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                              },
+                              {
+                                "name": "Meta Internal-Only Changes Check",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://opensource.facebook.com/"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NXnc=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625010"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYwzI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "clang-format",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903895825"
+                              },
+                              {
+                                "name": "py2-setup-validate-errormsg",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903895911"
+                              },
+                              {
+                                "name": "quick-checks",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903895963"
+                              },
+                              {
+                                "name": "shellcheck",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896134"
+                              },
+                              {
+                                "name": "toc",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896253"
+                              },
+                              {
+                                "name": "clang-tidy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896371"
+                              },
+                              {
+                                "name": "cmakelint",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896525"
+                              },
+                              {
+                                "name": "flake8-py3",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896658"
+                              },
+                              {
+                                "name": "Test collect_env (with_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896771"
+                              },
+                              {
+                                "name": "Test collect_env (without_torch)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896795"
+                              },
+                              {
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896838"
+                              },
+                              {
+                                "name": "mypy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440028/jobs/2903896897"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NZqw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625458"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxPI="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "run-torchbench",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440031/jobs/2903895828"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_NYIw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SKIPPED",
+                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625463"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxPc="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "pull"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896014"
+                              },
+                              {
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896165"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896394"
+                              },
+                              {
+                                "name": "linux-bionic-rocm4.5-py3.7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896572"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896666"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896778"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896837"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896896"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903896936"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897025"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897161"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897213"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897280"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897368"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897431"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897476"
+                              },
+                              {
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897578"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897630"
+                              },
+                              {
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897699"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2903897733"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904327787"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904327838"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904327956"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904327997"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904328035"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904328093"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904328131"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904328177"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904333962"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904334006"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904430419"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904430459"
+                              },
+                              {
+                                "name": "linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904430508"
+                              },
+                              {
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904430573"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904443663"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904443723"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904443787"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904454239"
+                              },
+                              {
+                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904454303"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904554602"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904554698"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904588855"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904588886"
+                              },
+                              {
+                                "name": "win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904588924"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904655702"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904656104"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904656150"
+                              },
+                              {
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904656192"
+                              },
+                              {
+                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904706520"
+                              },
+                              {
+                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2018440039/jobs/2904706565"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAU_fN1g=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/73881411e2bfb3aaa2e89926a82390b4c587ad75/checks?check_suite_id=5743625483"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVZYxQs="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  },
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17048428?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17048429?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17048431?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17048430?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-03-21T19:58:52Z",
+                  "oid": "73881411e2bfb3aaa2e89926a82390b4c587ad75"
+                }
+              }
+            ]
+          },
+          "changedFiles": 37,
           "files": {
             "nodes": [
               {
-                "path": "tools/codegen/selective_build/operator.py"
-              },
-              {
-                "path": "tools/codegen/selective_build/selector.py"
-              },
-              {
-                "path": "tools/codegen/shape_functions/gen_jit_shape_functions.py"
-              },
-              {
-                "path": "tools/codegen/static_runtime/config.py"
-              },
-              {
-                "path": "tools/codegen/static_runtime/gen_static_runtime_ops.py"
-              },
-              {
-                "path": "tools/codegen/static_runtime/gen_structured.py"
-              },
-              {
-                "path": "tools/codegen/utils.py"
+                "path": "aten/src/ATen/core/interned_strings.h"
               },
               {
-                "path": "tools/linter/adapters/circleci_linter.py"
+                "path": "caffe2/CMakeLists.txt"
               },
               {
-                "path": "tools/linter/adapters/clangformat_linter.py"
+                "path": "cmake/Dependencies.cmake"
               },
               {
-                "path": "tools/linter/adapters/grep_linter.py"
+                "path": "cmake/Modules/FindMKLDNN.cmake"
               },
               {
-                "path": "tools/linter/adapters/nativefunctions_linter.py"
+                "path": "cmake/public/mkldnn.cmake"
               },
               {
-                "path": "tools/setup_helpers/BUILD.bazel"
+                "path": "docs/source/jit.rst"
               },
               {
-                "path": "tools/setup_helpers/generate_code.py"
+                "path": "test/test_jit_llga_fuser.py"
               },
               {
                 "path": "torch/_C/__init__.pyi.in"
               },
               {
-                "path": "torch/amp/autocast_mode.py"
-              },
-              {
-                "path": "torch/ao/ns/fx/pattern_utils.py"
-              },
-              {
-                "path": "torch/ao/quantization/backend_config/README.md"
-              },
-              {
-                "path": "torch/ao/quantization/backend_config/__init__.py"
-              },
-              {
-                "path": "torch/ao/quantization/backend_config/native.py"
-              },
-              {
-                "path": "torch/ao/quantization/backend_config/observation_type.py"
-              },
-              {
-                "path": "torch/ao/quantization/backend_config/tensorrt.py"
-              },
-              {
-                "path": "torch/ao/quantization/backend_config/utils.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/__init__.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/backend_config/fuse_handler.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/backend_config/quantize_handler.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/backend_config_utils.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/convert.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/fuse.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/fusion_patterns.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/match_utils.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/pattern_utils.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/prepare.py"
-              },
-              {
-                "path": "torch/ao/quantization/fx/quantization_patterns.py"
-              },
-              {
-                "path": "torch/ao/quantization/qconfig.py"
-              },
-              {
-                "path": "torch/ao/quantization/quantization_types.py"
-              },
-              {
-                "path": "torch/ao/quantization/quantize_fx.py"
-              },
-              {
-                "path": "torch/autograd/__init__.py"
-              },
-              {
-                "path": "torch/csrc/Module.cpp"
-              },
-              {
-                "path": "torch/csrc/autograd/FunctionsManual.cpp"
-              },
-              {
-                "path": "torch/csrc/autograd/FunctionsManual.h"
-              },
-              {
-                "path": "torch/csrc/autograd/engine.cpp"
-              },
-              {
-                "path": "torch/csrc/autograd/function.h"
-              },
-              {
-                "path": "torch/csrc/autograd/functions/accumulate_grad.h"
-              },
-              {
-                "path": "torch/csrc/autograd/init.cpp"
-              },
-              {
-                "path": "torch/csrc/autograd/python_torch_functions_manual.cpp"
-              },
-              {
-                "path": "torch/csrc/autograd/python_variable.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp"
               },
               {
-                "path": "torch/csrc/autograd/record_function_ops.h"
+                "path": "torch/csrc/jit/codegen/onednn/LlgaTensorImpl.h"
               },
               {
-                "path": "torch/csrc/autograd/utils/grad_layout_contract.h"
+                "path": "torch/csrc/jit/codegen/onednn/README.md"
               },
               {
-                "path": "torch/csrc/deploy/CMakeLists.txt"
+                "path": "torch/csrc/jit/codegen/onednn/defer_size_check.cpp"
               },
               {
-                "path": "torch/csrc/distributed/c10d/logger.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/defer_size_check.h"
               },
               {
-                "path": "torch/csrc/jit/codegen/cuda/graph_fuser.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/graph_fuser.cpp"
               },
               {
-                "path": "torch/csrc/jit/codegen/cuda/parser.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/graph_fuser.h"
               },
               {
-                "path": "torch/csrc/jit/frontend/function_schema_parser.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/graph_helper.cpp"
               },
               {
-                "path": "torch/csrc/jit/frontend/lexer.h"
+                "path": "torch/csrc/jit/codegen/onednn/graph_helper.h"
               },
               {
-                "path": "torch/csrc/jit/frontend/parser.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/graph_rewriter.cpp"
               },
               {
-                "path": "torch/csrc/jit/frontend/parser.h"
+                "path": "torch/csrc/jit/codegen/onednn/guard_shape.cpp"
               },
               {
-                "path": "torch/csrc/jit/frontend/script_type_parser.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/guard_shape.h"
               },
               {
-                "path": "torch/csrc/jit/frontend/source_range.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/interface.cpp"
               },
               {
-                "path": "torch/csrc/jit/frontend/source_range.h"
+                "path": "torch/csrc/jit/codegen/onednn/interface.h"
               },
               {
-                "path": "torch/csrc/jit/frontend/source_ref.h"
+                "path": "torch/csrc/jit/codegen/onednn/kernel.cpp"
               },
               {
-                "path": "torch/csrc/jit/frontend/tracer.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/kernel.h"
               },
               {
-                "path": "torch/csrc/jit/frontend/tracer.h"
+                "path": "torch/csrc/jit/codegen/onednn/layout_propagation.cpp"
               },
               {
-                "path": "torch/csrc/jit/mobile/debug_info.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/layout_propagation.h"
               },
               {
-                "path": "torch/csrc/jit/mobile/debug_info.h"
+                "path": "torch/csrc/jit/codegen/onednn/operator.h"
               },
               {
-                "path": "torch/csrc/jit/mobile/flatbuffer_loader.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/prepare_binary.cpp"
               },
               {
-                "path": "torch/csrc/jit/mobile/module.h"
+                "path": "torch/csrc/jit/codegen/onednn/prepare_binary.h"
               },
               {
-                "path": "torch/csrc/jit/passes/common_expression_hoisting.cpp"
+                "path": "torch/csrc/jit/codegen/onednn/register_interface.cpp"
               },
               {
-                "path": "torch/csrc/jit/passes/common_expression_hoisting.h"
+                "path": "torch/csrc/jit/ir/alias_analysis.cpp"
               },
               {
-                "path": "torch/csrc/jit/passes/frozen_graph_optimizations.cpp"
+                "path": "torch/csrc/jit/ir/ir.cpp"
               },
               {
-                "path": "torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp"
+                "path": "torch/csrc/jit/passes/inline_autodiff_subgraphs.cpp"
               },
               {
-                "path": "torch/csrc/jit/passes/onnx/scalar_type_analysis.cpp"
+                "path": "torch/csrc/jit/passes/onednn_graph_fuser.h"
               },
               {
                 "path": "torch/csrc/jit/python/init.cpp"
               },
               {
-                "path": "torch/csrc/jit/python/python_tree_views.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/python/script_init.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/graph_executor.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/interpreter.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/script_profile.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/serialized_shape_function_registry.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/serialized_shape_function_registry.h"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/shape_function_registry.h"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/shape_functions.h"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/shape_functions_1.h"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/static/impl.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/static/passes.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/symbolic_shape_registry.cpp"
-              },
-              {
-                "path": "torch/csrc/jit/runtime/symbolic_shape_registry.h"
+                "path": "torch/csrc/jit/runtime/operator.cpp"
               },
               {
-                "path": "torch/csrc/jit/serialization/export_module.cpp"
-              },
+                "path": "torch/jit/__init__.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mzc",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
               {
-                "path": "torch/csrc/jit/serialization/flatbuffer_serializer.cpp"
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/jit/serialization/import.cpp"
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/jit/serialization/import_export_helpers.cpp"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/jit/serialization/import_export_helpers.h"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/jit/serialization/import_source.cpp"
+                "author": {
+                  "login": "pinzhenx"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/jit/serialization/import_source.h"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/jit/serialization/source_range_serialization.cpp"
+                "author": {
+                  "login": "chunyuan-w"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/jit/serialization/source_range_serialization.h"
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/jit/testing/file_check.cpp"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/lazy/core/dynamic_ir.cpp"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/lazy/core/dynamic_ir.h"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/lazy/ts_backend/ts_eager_fallback.cpp"
-              }
-            ],
-            "pageInfo": {
-              "endCursor": "MzAw",
-              "hasNextPage": true
-            }
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98 cursor=MzAw name=pytorch number=76118 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "files": {
-            "nodes": [
-              {
-                "path": "torch/csrc/lazy/ts_backend/ts_native_functions.cpp"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/utils/python_arg_parser.cpp"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/utils/python_arg_parser.h"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/utils/tensor_list.cpp"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/utils/tensor_new.cpp"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/csrc/utils/tensor_new.h"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/__init__.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/api.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/replicated_tensor.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/sharded_tensor/__init__.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/sharded_tensor/api.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/_shard/sharded_tensor/utils.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/algorithms/model_averaging/utils.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/fsdp/_optim_utils.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/fsdp/fully_sharded_data_parallel.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/nn/__init__.py"
+                "author": {
+                  "login": "wukong1992"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/nn/functional.py"
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/distributed/optim/functional_adagrad.py"
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/fx/experimental/meta_tracer.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/fx/graph.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/jit/_shape_functions.py"
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/nn/parallel/_replicated_tensor_ddp_interop.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/nn/parallel/_replicated_tensor_ddp_utils.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/nn/parallel/distributed.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/nn/utils/_expanded_weights/__init__.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/nn/utils/_expanded_weights/instance_norm_expanded_weights.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/onnx/symbolic_opset11.py"
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "APPROVED"
               },
               {
-                "path": "torch/onnx/symbolic_opset12.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/onnx/symbolic_opset9.py"
+                "author": {
+                  "login": "eellison"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/optim/adagrad.py"
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/optim/lr_scheduler.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/overrides.py"
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/quantization/fx/pattern_utils.py"
+                "author": {
+                  "login": "malfet"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/quantization/fx/quantization_patterns.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/quantization/fx/quantization_types.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/return_types.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
               },
               {
-                "path": "torch/testing/_internal/common_device_type.py"
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "state": "COMMENTED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMS0xMi0xMFQwOToyNDoxOS0wODowMLkyMDIxLTEyLTEwVDA5OjI0OjE5LTA4OjAwzjFryLE=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "Looks like this broke master https://hud.pytorch.org/pytorch/pytorch/commit/7dd08230117f4fa8bb82b3524e90fb00340198c7. I am reverting.",
+                "createdAt": "2022-03-21T22:51:38Z",
+                "author": {
+                  "login": "suo"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074498483
               },
               {
-                "path": "torch/testing/_internal/common_distributed.py"
+                "bodyText": "@pytorchbot revert this",
+                "createdAt": "2022-03-21T22:51:44Z",
+                "author": {
+                  "login": "suo"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074498550
               },
               {
-                "path": "torch/testing/_internal/common_fx2trt.py"
+                "bodyText": "Looks like this broke master https://hud.pytorch.org/pytorch/pytorch/commit/7dd08230117f4fa8bb82b3524e90fb00340198c7. I am reverting.\n\nOops! Will fix it ASAP.",
+                "createdAt": "2022-03-21T22:53:34Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1074499668
               },
               {
-                "path": "torch/testing/_internal/common_methods_invocations.py"
+                "bodyText": "This pull request has been reverted by e5bf879. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).",
+                "createdAt": "2022-03-21T23:07:23Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074508608
               },
               {
-                "path": "torch/testing/_internal/common_utils.py"
-              },
+                "bodyText": "This pull request has been reverted by e5bf879. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).",
+                "createdAt": "2022-03-30T00:53:50Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1082508130
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQAuLsw==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
               {
-                "path": "torch/testing/_internal/composite_compliance.py"
+                "node": {
+                  "name": "oncall: jit"
+                }
               },
               {
-                "path": "torch/testing/_internal/distributed/distributed_test.py"
+                "node": {
+                  "name": "triaged"
+                }
               },
               {
-                "path": "torch/testing/_internal/jit_metaprogramming_utils.py"
+                "node": {
+                  "name": "open source"
+                }
               },
               {
-                "path": "torch/utils/cpp_extension.py"
+                "node": {
+                  "name": "cla signed"
+                }
               },
               {
-                "path": "torch/utils/data/datapipes/_typing.py"
+                "node": {
+                  "name": "Reverted"
+                }
               },
               {
-                "path": "torch/utils/model_dump/__init__.py"
+                "node": {
+                  "name": "intel priority"
+                }
               }
-            ],
-            "pageInfo": {
-              "endCursor": "MzQ4",
-              "hasNextPage": false
-            }
+            ]
           }
         }
       }
     }
   },
-  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAWuVD9M= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAXEsRtE= name=pytorch number=76118 owner=pytorch": {
+  "query_sha=2e2877d2452c4f233f042b7ccd50ab9c2a6e9a73d8819a0c876203c12364e8a3 cursor=Y3Vyc29yOnYyOpHOQAuLsw== name=pytorch number=68111 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
-          "commits": {
+          "comments": {
             "nodes": [
               {
-                "commit": {
-                  "oid": "5696e8357cf38f852ef3d680381513e26f202371",
-                  "checkSuites": {
-                    "nodes": [
-                      {
-                        "checkRuns": {
-                          "nodes": [
-                            {
-                              "name": "win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/6099898412?check_suite_focus=true"
-                            }
-                          ],
-                          "pageInfo": {
-                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAWuVECw=",
-                            "hasNextPage": false
-                          }
-                        }
-                      }
-                    ]
-                  }
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=pytorch-dev-infra org=pytorch": {
-    "data": {
-      "organization": {
-        "team": {
-          "members": {
-            "nodes": [
+                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/chunyuan-w/pytorch/blob/7496bf1588050191595d833d23b8972b2f22655e/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk\n\u2705 triggered\n\n\nlinux-docs\nciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-vulkan-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-build\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-custom-build-static\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-onnx\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7-no-ops\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\ncaffe2-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\ndocker-builds\nciflow/all, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-custom-ops\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-full-jit\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-metal\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-full-jit\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda10.2-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-binary-conda\nciflow/binaries, ciflow/binaries/conda\n\ud83d\udeab skipped\n\n\nlinux-binary-libtorch-cxx11-abi\nciflow/binaries, ciflow/binaries/libtorch\n\ud83d\udeab skipped\n\n\nlinux-binary-libtorch-pre-cxx11\nciflow/binaries, ciflow/binaries/libtorch\n\ud83d\udeab skipped\n\n\nlinux-binary-manywheel\nciflow/binaries, ciflow/binaries/wheel\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-docs-push\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-no-ops\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-arm64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-lite-interpreter-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-11-py3-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.1-py3.7-gcc7-debug\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.1-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.5-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-build\nciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\n\n\nYou can add a comment to the PR and tag @pytorchbot with the following commands:\n\n# ciflow rerun, \"ciflow/default\" will always be added automatically\n@pytorchbot ciflow rerun\n\n# ciflow rerun with additional labels \"-l <ciflow/label_name>\", which is equivalent to adding these labels manually and trigger the rerun\n@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow\n\nFor more information, please take a look at the CI Flow Wiki.",
+                "createdAt": "2021-11-10T08:42:49Z",
+                "author": {
+                  "login": "pytorch-probot"
+                },
+                "authorAssociation": "NONE",
+                "editor": {
+                  "login": "pytorch-probot"
+                },
+                "databaseId": 964902865
+              },
               {
-                "login": "kit1980"
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/68111\nNeed help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 7388141 (more details on the Dr. CI page):\n\n\n29/29 failures introduced in this PR\n\n\n\ud83d\udd75\ufe0f 29 new failures recognized by patterns\nThe following CI failures do not appear to be due to upstream breakages:\n pull / linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge) (1/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:31:38.6978776Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:31:38.3001628Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:31:38.5169168Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:31:38.5362923Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:31:38.5413452Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:31:38.5458747Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:31:38.5484014Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:31:38.5497924Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:31:38.5656491Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:31:38.5678893Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:31:38.6888479Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0f6488c20adb4dca4\n2022-03-21T21:31:38.6978776Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:31:38.6992648Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:31:38.7003010Z ##[error]Process completed with exit code 2.\n2022-03-21T21:31:38.7044027Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:31:38.7044261Z with:\n2022-03-21T21:31:38.7044413Z env:\n2022-03-21T21:31:38.7044565Z   IN_CI: 1\n2022-03-21T21:31:38.7044709Z   IS_GHA: 1\n2022-03-21T21:31:38.7044885Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:31:38.7045067Z ##[endgroup]\n2022-03-21T21:31:38.7060958Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge) (2/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:35:19.2635222Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:35:18.9028722Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:35:19.1132721Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:35:19.1310590Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:35:19.1360251Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:35:19.1386865Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:35:19.1429182Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:35:19.1441925Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:35:19.1468280Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:35:19.1617667Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:35:19.2545368Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-098be2985e0392130\n2022-03-21T21:35:19.2635222Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:35:19.2648463Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:35:19.2658727Z ##[error]Process completed with exit code 2.\n2022-03-21T21:35:19.2706355Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:35:19.2706591Z with:\n2022-03-21T21:35:19.2706748Z env:\n2022-03-21T21:35:19.2706908Z   IN_CI: 1\n2022-03-21T21:35:19.2707061Z   IS_GHA: 1\n2022-03-21T21:35:19.2707246Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:35:19.2707438Z ##[endgroup]\n2022-03-21T21:35:19.2724554Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge) (3/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:11:57.5531419Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:11:52.7662022Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T23:11:53.1213298Z      ---------------------------------------- 8.1/8.1 MB 23.6 MB/s eta 0:00:00\n2022-03-21T23:11:53.1644665Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:11:53.2218699Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T23:11:53.2389674Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T23:11:53.2787295Z      -------------------------------------- 247.7/247.7 KB 7.4 MB/s eta 0:00:00\n2022-03-21T23:11:53.3761842Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:11:53.5457622Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T23:11:57.4175080Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T23:11:57.5296815Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0105d4db093574f40\n2022-03-21T23:11:57.5531419Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:11:57.5564814Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:11:57.5587712Z ##[error]Process completed with exit code 2.\n2022-03-21T23:11:57.5790311Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T23:11:57.5790832Z with:\n2022-03-21T23:11:57.5791104Z env:\n2022-03-21T23:11:57.5791358Z   IN_CI: 1\n2022-03-21T23:11:57.5791620Z   IS_GHA: 1\n2022-03-21T23:11:57.5791939Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:11:57.5792425Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T23:11:57.5792884Z ##[endgroup]\n\n\n pull / linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu) (4/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T02:17:12.6257577Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T02:17:11.9280556Z   Using cached https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl\n2022-03-22T02:17:11.9335199Z Collecting urllib3<1.27,>=1.25.4 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:11.9682045Z   Using cached https://files.pythonhosted.org/packages/ec/03/062e6444ce4baf1eac17a6a0ebfe36bb1ad05e1df0e20b110de59c278498/urllib3-1.26.9-py2.py3-none-any.whl\n2022-03-22T02:17:11.9850357Z Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:12.0403171Z   Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl\n2022-03-22T02:17:12.0468875Z Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T02:17:12.0590000Z   Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl\n2022-03-22T02:17:12.0607093Z Installing collected packages: jmespath, urllib3, six, python-dateutil, botocore, s3transfer, boto3\n2022-03-22T02:17:12.5273459Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 urllib3-1.26.9\n2022-03-22T02:17:12.6032812Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 worker-rocm-amd-114\n2022-03-22T02:17:12.6257577Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T02:17:12.6259543Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T02:17:12.6291924Z ##[error]Process completed with exit code 2.\n2022-03-22T02:17:12.6387977Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T02:17:12.6388298Z with:\n2022-03-22T02:17:12.6388521Z   wait-ssh: false\n2022-03-22T02:17:12.6388727Z env:\n2022-03-22T02:17:12.6388932Z   IN_CI: 1\n2022-03-22T02:17:12.6389143Z   IS_GHA: 1\n2022-03-22T02:17:12.6389368Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T02:17:12.6389669Z   DOCKER_HOST: unix:///run/user/1121/docker.sock\n\n\n pull / linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge) (5/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:19:24.4890693Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:19:24.0962005Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:19:24.3152253Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:19:24.3341183Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:19:24.3391374Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:19:24.3436392Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:19:24.3448982Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:19:24.3474092Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:19:24.3502003Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:19:24.3655072Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:19:24.4799309Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0bc9250521f338cae\n2022-03-21T22:19:24.4890693Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:19:24.4903625Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:19:24.4913841Z ##[error]Process completed with exit code 2.\n2022-03-21T22:19:24.4957338Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:19:24.4957575Z with:\n2022-03-21T22:19:24.4957735Z env:\n2022-03-21T22:19:24.4957900Z   IN_CI: 1\n2022-03-21T22:19:24.4958055Z   IS_GHA: 1\n2022-03-21T22:19:24.4958246Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:19:24.4958437Z ##[endgroup]\n2022-03-21T22:19:24.4989649Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu) (6/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T01:05:07.6983899Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T01:05:06.8364546Z   Using cached https://files.pythonhosted.org/packages/7b/9c/f51775ebe7df5a7aa4e7c79ed671bde94e154bd968aca8d65bb24aba0c8c/s3transfer-0.5.2-py3-none-any.whl\n2022-03-22T01:05:06.8431763Z Collecting urllib3<1.27,>=1.25.4 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:06.8949391Z   Using cached https://files.pythonhosted.org/packages/ec/03/062e6444ce4baf1eac17a6a0ebfe36bb1ad05e1df0e20b110de59c278498/urllib3-1.26.9-py2.py3-none-any.whl\n2022-03-22T01:05:06.9180079Z Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:06.9803351Z   Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl\n2022-03-22T01:05:06.9882133Z Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12)\n2022-03-22T01:05:07.0067062Z   Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl\n2022-03-22T01:05:07.0088676Z Installing collected packages: urllib3, jmespath, six, python-dateutil, botocore, s3transfer, boto3\n2022-03-22T01:05:07.5819667Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2 six-1.16.0 urllib3-1.26.9\n2022-03-22T01:05:07.6774717Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 worker-rocm-amd-60\n2022-03-22T01:05:07.6983899Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T01:05:07.6988652Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T01:05:07.7023073Z ##[error]Process completed with exit code 2.\n2022-03-22T01:05:07.7102087Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T01:05:07.7102389Z with:\n2022-03-22T01:05:07.7102603Z   wait-ssh: false\n2022-03-22T01:05:07.7102820Z env:\n2022-03-22T01:05:07.7103015Z   IN_CI: 1\n2022-03-22T01:05:07.7103224Z   IS_GHA: 1\n2022-03-22T01:05:07.7103458Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T01:05:07.7103737Z   DOCKER_HOST: unix:///run/user/1502/docker.sock\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge) (7/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:51:39.3637996Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:51:39.2041249Z   Attempting uninstall: s3transfer\n2022-03-21T20:51:39.2043010Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T20:51:39.2083799Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T20:51:39.2089675Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T20:51:39.2480546Z   Attempting uninstall: boto3\n2022-03-21T20:51:39.2482953Z     Found existing installation: boto3 1.16.34\n2022-03-21T20:51:39.2584292Z     Uninstalling boto3-1.16.34:\n2022-03-21T20:51:39.2599474Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T20:51:39.3130921Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T20:51:39.3550598Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-03ef7efc3078e3da5\n2022-03-21T20:51:39.3637996Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:51:39.3650651Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:51:39.3660484Z ##[error]Process completed with exit code 2.\n2022-03-21T20:51:39.3696465Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:51:39.3696693Z with:\n2022-03-21T20:51:39.3696850Z env:\n2022-03-21T20:51:39.3697012Z   IN_CI: 1\n2022-03-21T20:51:39.3697161Z   IS_GHA: 1\n2022-03-21T20:51:39.3697342Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:51:39.3697528Z ##[endgroup]\n2022-03-21T20:51:39.3730420Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge) (8/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:03:36.3916860Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:03:36.0096309Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:03:36.2278560Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:03:36.2461618Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:03:36.2513260Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:03:36.2541524Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:03:36.2554899Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:03:36.2598277Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:03:36.2758299Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:03:36.2780690Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:03:36.3825021Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0a4a552890e6ef7d3\n2022-03-21T21:03:36.3916860Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:03:36.3930343Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:03:36.3941263Z ##[error]Process completed with exit code 2.\n2022-03-21T21:03:36.3979258Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:03:36.3979496Z with:\n2022-03-21T21:03:36.3979654Z env:\n2022-03-21T21:03:36.3979814Z   IN_CI: 1\n2022-03-21T21:03:36.3979968Z   IS_GHA: 1\n2022-03-21T21:03:36.3980157Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:03:36.3980360Z ##[endgroup]\n2022-03-21T21:03:36.3996257Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu) (9/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T00:41:15.5325784Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T00:41:10.3015614Z   Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)\n2022-03-22T00:41:10.3625659Z      ---------------------------------------- 79.5/79.5 KB 1.1 MB/s eta 0:00:00\n2022-03-22T00:41:10.4120236Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-22T00:41:10.4170155Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-22T00:41:10.4722115Z      -------------------------------------- 247.7/247.7 KB 5.2 MB/s eta 0:00:00\n2022-03-22T00:41:10.4843512Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-22T00:41:10.6596108Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-22T00:41:10.8733354Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-22T00:41:15.3745408Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-22T00:41:15.4987162Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-09cacc848abc3dd32\n2022-03-22T00:41:15.5325784Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T00:41:15.5373630Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T00:41:15.5404353Z ##[error]Process completed with exit code 2.\n2022-03-22T00:41:15.5790508Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-22T00:41:15.5791192Z with:\n2022-03-22T00:41:15.5791530Z env:\n2022-03-22T00:41:15.5791849Z   IN_CI: 1\n2022-03-22T00:41:15.5792186Z   IS_GHA: 1\n2022-03-22T00:41:15.5792599Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T00:41:15.5793237Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-22T00:41:15.5793831Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge) (10/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:50:32.9799307Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:50:32.8167560Z   Attempting uninstall: s3transfer\n2022-03-21T20:50:32.8169351Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T20:50:32.8213295Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T20:50:32.8219209Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T20:50:32.8602320Z   Attempting uninstall: boto3\n2022-03-21T20:50:32.8603289Z     Found existing installation: boto3 1.16.34\n2022-03-21T20:50:32.8704535Z     Uninstalling boto3-1.16.34:\n2022-03-21T20:50:32.8719403Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T20:50:32.9244278Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T20:50:32.9710449Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0c568461a276d4a71\n2022-03-21T20:50:32.9799307Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:50:32.9812238Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:50:32.9823052Z ##[error]Process completed with exit code 2.\n2022-03-21T20:50:32.9859290Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:50:32.9859527Z with:\n2022-03-21T20:50:32.9859664Z env:\n2022-03-21T20:50:32.9859817Z   IN_CI: 1\n2022-03-21T20:50:32.9859977Z   IS_GHA: 1\n2022-03-21T20:50:32.9860144Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:50:32.9860327Z ##[endgroup]\n2022-03-21T20:50:32.9893642Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge) (11/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:05:00.7163042Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:05:00.6660824Z     #10 0x55fc8a3ea801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:05:00.6661768Z     #11 0x55fc8a3f57a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:05:00.6662455Z     #12 0x55fc8a3f580b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:05:00.6663570Z     #13 0x55fc8a3f5908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:05:00.6663952Z     #14 0x55fc8a3f5908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:05:00.6664431Z     #15 0x55fc8a3f5908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:05:00.6665304Z     #16 0x55fc8a3f5ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:05:00.7162113Z     #17 0x7f940d00f83f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:05:00.7162534Z     #18 0x55fc8a39a554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:05:00.7162711Z \n2022-03-21T21:05:00.7163042Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:05:00.7334595Z + retcode=1\n2022-03-21T21:05:00.7334954Z + set -e\n2022-03-21T21:05:00.7335215Z + return 1\n2022-03-21T21:05:00.7338688Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:05:00.7339232Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:05:00.7340113Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:05:00.7340612Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:05:00.7341187Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:05:00.7341668Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:05:00.7344466Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge) (12/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:06:03.4437430Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:06:03.0752199Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:06:03.2853252Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:06:03.3032326Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:06:03.3081589Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:06:03.3093911Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:06:03.3120244Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:06:03.3162406Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:06:03.3188431Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:06:03.3337181Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:06:03.4348072Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0ee48c8811fafc444\n2022-03-21T22:06:03.4437430Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:06:03.4450920Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:06:03.4461263Z ##[error]Process completed with exit code 2.\n2022-03-21T22:06:03.4502346Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:06:03.4502576Z with:\n2022-03-21T22:06:03.4502730Z env:\n2022-03-21T22:06:03.4502888Z   IN_CI: 1\n2022-03-21T22:06:03.4503038Z   IS_GHA: 1\n2022-03-21T22:06:03.4503302Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:06:03.4503492Z ##[endgroup]\n2022-03-21T22:06:03.4519156Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (13/29)\nStep: \"Test\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:50:13.2205634Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T20:50:12.8679322Z + python3 -m pip install boto3==1.19.12\n2022-03-21T20:50:13.0744228Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T20:50:13.0916284Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T20:50:13.0964264Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T20:50:13.1005656Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T20:50:13.1017299Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T20:50:13.1041042Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T20:50:13.1189450Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T20:50:13.1208751Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T20:50:13.2119445Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d02da60fd18c22f5\n2022-03-21T20:50:13.2205634Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T20:50:13.2217939Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T20:50:13.2220259Z ##[error]Process completed with exit code 2.\n2022-03-21T20:50:13.2248664Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T20:50:13.2249012Z with:\n2022-03-21T20:50:13.2249260Z env:\n2022-03-21T20:50:13.2249500Z   IN_CI: 1\n2022-03-21T20:50:13.2249738Z   IS_GHA: 1\n2022-03-21T20:50:13.2250025Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:50:13.2250329Z ##[endgroup]\n2022-03-21T20:50:13.2272735Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (14/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:47:38.0451999Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:47:37.5554508Z + python3 -m pip install boto3==1.19.12\n2022-03-21T23:47:37.8411473Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T23:47:37.8631484Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T23:47:37.8699561Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T23:47:37.8737037Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T23:47:37.8754443Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T23:47:37.8814393Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T23:47:37.8849540Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:47:37.9059579Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:47:38.0336298Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0b44f47f4292089a2\n2022-03-21T23:47:38.0451999Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:47:38.0469471Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:47:38.0484106Z ##[error]Process completed with exit code 2.\n2022-03-21T23:47:38.0532678Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T23:47:38.0533007Z with:\n2022-03-21T23:47:38.0533223Z env:\n2022-03-21T23:47:38.0533440Z   IN_CI: 1\n2022-03-21T23:47:38.0533649Z   IS_GHA: 1\n2022-03-21T23:47:38.0533902Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:47:38.0534170Z   GPU_FLAG: --gpus all\n2022-03-21T23:47:38.0534401Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge) (15/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:04:59.3115800Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:04:59.2595213Z     #10 0x55a7f39a4801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:04:59.2595707Z     #11 0x55a7f39af7a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:04:59.2597203Z     #12 0x55a7f39af80b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:04:59.2598205Z     #13 0x55a7f39af908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:04:59.2598697Z     #14 0x55a7f39af908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:04:59.2599178Z     #15 0x55a7f39af908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:04:59.2599747Z     #16 0x55a7f39afccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:04:59.3114751Z     #17 0x7f3b3822383f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:04:59.3115277Z     #18 0x55a7f3954554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:04:59.3115468Z \n2022-03-21T21:04:59.3115800Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:04:59.3292385Z + retcode=1\n2022-03-21T21:04:59.3292781Z + set -e\n2022-03-21T21:04:59.3293062Z + return 1\n2022-03-21T21:04:59.3295462Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:04:59.3295802Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:04:59.3296394Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:04:59.3296700Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:04:59.3297055Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:04:59.3297416Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:04:59.3299623Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge) (16/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:14:31.7846086Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:14:25.5525714Z Collecting jmespath<1.0.0,>=0.7.1\n2022-03-21T22:14:25.5568155Z   Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)\n2022-03-21T22:14:25.5952617Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T22:14:25.6169392Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T22:14:25.6629996Z      -------------------------------------- 247.7/247.7 KB 5.1 MB/s eta 0:00:00\n2022-03-21T22:14:25.6710247Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:14:25.8284354Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:14:25.9816751Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T22:14:31.6672236Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T22:14:31.7630473Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0ed0915ecee5d2424\n2022-03-21T22:14:31.7846086Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:14:31.7876742Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:14:31.7897140Z ##[error]Process completed with exit code 2.\n2022-03-21T22:14:31.8195621Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T22:14:31.8196110Z with:\n2022-03-21T22:14:31.8196356Z env:\n2022-03-21T22:14:31.8196614Z   IN_CI: 1\n2022-03-21T22:14:31.8196876Z   IS_GHA: 1\n2022-03-21T22:14:31.8197169Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:14:31.8197652Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T22:14:31.8198093Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge) (17/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:19:15.8845728Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:19:15.5116060Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:19:15.7231476Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:19:15.7409711Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:19:15.7458478Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:19:15.7470508Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:19:15.7496799Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:19:15.7538362Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:19:15.7566161Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:19:15.7711630Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:19:15.8753543Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0e2b3b4ddb246ff2a\n2022-03-21T21:19:15.8845728Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:19:15.8859814Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:19:15.8870165Z ##[error]Process completed with exit code 2.\n2022-03-21T21:19:15.8917039Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:19:15.8917279Z with:\n2022-03-21T21:19:15.8917433Z env:\n2022-03-21T21:19:15.8917586Z   IN_CI: 1\n2022-03-21T21:19:15.8917734Z   IS_GHA: 1\n2022-03-21T21:19:15.8917917Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:19:15.8918102Z ##[endgroup]\n2022-03-21T21:19:15.8934572Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu) (18/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T23:19:48.5900162Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T23:19:48.0742254Z + python3 -m pip install boto3==1.19.12\n2022-03-21T23:19:48.3742563Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T23:19:48.3976536Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T23:19:48.4048700Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T23:19:48.4065374Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T23:19:48.4128076Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T23:19:48.4164273Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T23:19:48.4202610Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T23:19:48.4416723Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T23:19:48.5773033Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-07ab7a3c4a5402af2\n2022-03-21T23:19:48.5900162Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T23:19:48.5919822Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T23:19:48.5936087Z ##[error]Process completed with exit code 2.\n2022-03-21T23:19:48.6007930Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T23:19:48.6008268Z with:\n2022-03-21T23:19:48.6008483Z env:\n2022-03-21T23:19:48.6008701Z   IN_CI: 1\n2022-03-21T23:19:48.6008920Z   IS_GHA: 1\n2022-03-21T23:19:48.6009170Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T23:19:48.6009440Z   GPU_FLAG: --gpus all\n2022-03-21T23:19:48.6009671Z ##[endgroup]\n\n\n pull / win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (19/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:54:04.2844259Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:53:59.0889659Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T22:53:59.6881416Z      ---------------------------------------- 8.1/8.1 MB 14.0 MB/s eta 0:00:00\n2022-03-21T22:53:59.7427779Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:53:59.7691882Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T22:53:59.7779847Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T22:53:59.8281663Z      -------------------------------------- 247.7/247.7 KB 5.1 MB/s eta 0:00:00\n2022-03-21T22:54:00.0185115Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:54:00.2359770Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T22:54:04.1208891Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T22:54:04.2505862Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-03b4fbe63be8ef4b0\n2022-03-21T22:54:04.2844259Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:54:04.2891082Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:54:04.2919900Z ##[error]Process completed with exit code 2.\n2022-03-21T22:54:04.3377901Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T22:54:04.3378575Z with:\n2022-03-21T22:54:04.3378930Z env:\n2022-03-21T22:54:04.3379275Z   IN_CI: 1\n2022-03-21T22:54:04.3379600Z   IS_GHA: 1\n2022-03-21T22:54:04.3380023Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:54:04.3380691Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T22:54:04.3381278Z ##[endgroup]\n\n\n pull / linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge) (20/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:09:34.0074610Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:09:33.6365531Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:09:33.8475619Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:09:33.8655152Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:09:33.8704395Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:09:33.8716774Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:09:33.8760145Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:09:33.8785000Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:09:33.8811316Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:09:33.8960134Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:09:33.9984866Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d325eb9fd156146f\n2022-03-21T22:09:34.0074610Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:09:34.0087465Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:09:34.0101743Z ##[error]Process completed with exit code 2.\n2022-03-21T22:09:34.0154014Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:09:34.0154246Z with:\n2022-03-21T22:09:34.0154412Z env:\n2022-03-21T22:09:34.0154574Z   IN_CI: 1\n2022-03-21T22:09:34.0154728Z   IS_GHA: 1\n2022-03-21T22:09:34.0154917Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:09:34.0155112Z ##[endgroup]\n2022-03-21T22:09:34.0191047Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (21/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:03:17.8502655Z [E request_callbac...yUniqueId(created_on=0, local_id=0) to be created.\n\n2022-03-21T21:03:14.4669960Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpxgdsmeer\n2022-03-21T21:03:14.4671407Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpxgdsmeer/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.4973023Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp1i2hfmpc\n2022-03-21T21:03:14.4973800Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp1i2hfmpc/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.5532339Z INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpgx4da7b0\n2022-03-21T21:03:14.5533064Z INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpgx4da7b0/_remote_module_non_sriptable.py\n2022-03-21T21:03:14.7050673Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0\n2022-03-21T21:03:14.7097127Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3\n2022-03-21T21:03:14.7398339Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2\n2022-03-21T21:03:14.7922283Z INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1\n2022-03-21T21:03:17.8502655Z [E request_callback_no_python.cpp:559] Received error while processing request type 261: false INTERNAL ASSERT FAILED at \"/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp\":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.\n2022-03-21T21:03:17.8503603Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):\n2022-03-21T21:03:17.8504385Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f180df19e19 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8505131Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f180df160e2 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8505927Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f180df17a7e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)\n2022-03-21T21:03:17.8506674Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4b4 (0x7f18118b7b64 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8507642Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x70 (0x7f18118a7bf0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8508613Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7f1819736208 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)\n2022-03-21T21:03:17.8509749Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7f18118ac914 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n2022-03-21T21:03:17.8510708Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f1819735865 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)\n2022-03-21T21:03:17.8511369Z frame #8: <unknown function> + 0x375249a (0x7f18118a949a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test (22/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T20:01:07.7015580Z \ufffd[36;1m  echo \"ERR...t available for the merge-base of your branch\"\ufffd[0m\n\n2022-03-21T20:01:07.7012399Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7012634Z \ufffd[36;1m# Covers the case where a previous tag doesn't exist for the tree\ufffd[0m\n2022-03-21T20:01:07.7012992Z \ufffd[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly\ufffd[0m\n2022-03-21T20:01:07.7013373Z \ufffd[36;1mif ! git rev-parse \"$MERGE_BASE:.circleci/docker\"; then\ufffd[0m\n2022-03-21T20:01:07.7013784Z \ufffd[36;1m  echo \"Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit\"\ufffd[0m\n2022-03-21T20:01:07.7014149Z \ufffd[36;1m  exit 1\ufffd[0m\n2022-03-21T20:01:07.7014325Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7014573Z \ufffd[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse \"$MERGE_BASE:.circleci/docker\")\ufffd[0m\n2022-03-21T20:01:07.7014907Z \ufffd[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here\ufffd[0m\n2022-03-21T20:01:07.7015231Z \ufffd[36;1mif [[ \"${PREVIOUS_DOCKER_TAG}\" = \"${DOCKER_TAG}\" ]]; then\ufffd[0m\n2022-03-21T20:01:07.7015580Z \ufffd[36;1m  echo \"ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch\"\ufffd[0m\n2022-03-21T20:01:07.7015931Z \ufffd[36;1m  echo \"       contact the PyTorch team to restore the original images\"\ufffd[0m\n2022-03-21T20:01:07.7016225Z \ufffd[36;1m  exit 1\ufffd[0m\n2022-03-21T20:01:07.7016400Z \ufffd[36;1mfi\ufffd[0m\n2022-03-21T20:01:07.7016608Z \ufffd[36;1mecho ::set-output name=rebuild::yes\ufffd[0m\n2022-03-21T20:01:07.7027605Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}\n2022-03-21T20:01:07.7027837Z env:\n2022-03-21T20:01:07.7028006Z   IN_CI: 1\n2022-03-21T20:01:07.7028159Z   IS_GHA: 1\n2022-03-21T20:01:07.7028346Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T20:01:07.7028589Z   BASE_REVISION: 6643522db9ff595f564b8081de58b3a33c546178\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu) (23/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-22T00:49:54.2949572Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-22T00:49:53.8049151Z + python3 -m pip install boto3==1.19.12\n2022-03-22T00:49:54.0981629Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-22T00:49:54.1207562Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-22T00:49:54.1277146Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-22T00:49:54.1315027Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-22T00:49:54.1331813Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-22T00:49:54.1391622Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-22T00:49:54.1609217Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-22T00:49:54.1637417Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-22T00:49:54.2830197Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0f7c32fe13be12fea\n2022-03-22T00:49:54.2949572Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-22T00:49:54.2966933Z + GHA_WORKFLOW_JOB_ID=\n2022-03-22T00:49:54.2982588Z ##[error]Process completed with exit code 2.\n2022-03-22T00:49:54.3031464Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-22T00:49:54.3031794Z with:\n2022-03-22T00:49:54.3032012Z env:\n2022-03-22T00:49:54.3032227Z   IN_CI: 1\n2022-03-22T00:49:54.3032434Z   IS_GHA: 1\n2022-03-22T00:49:54.3032681Z   GIT_DEFAULT_BRANCH: master\n2022-03-22T00:49:54.3033084Z   GPU_FLAG: --gpus all\n2022-03-22T00:49:54.3033312Z ##[endgroup]\n\n\n pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (24/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:56:12.5872636Z C:\\actions-runner\\...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:56:07.3365589Z   Downloading botocore-1.22.12-py3-none-any.whl (8.1 MB)\n2022-03-21T21:56:07.7926584Z      ---------------------------------------- 8.1/8.1 MB 17.3 MB/s eta 0:00:00\n2022-03-21T21:56:07.9319362Z Collecting python-dateutil<3.0.0,>=2.1\n2022-03-21T21:56:07.9366132Z   Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n2022-03-21T21:56:08.0077590Z      -------------------------------------- 247.7/247.7 KB 3.0 MB/s eta 0:00:00\n2022-03-21T21:56:08.0164070Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:56:08.1775537Z Requirement already satisfied: six>=1.5 in c:\\actions-runner\\_work\\_tool\\python\\3.10.3\\x64\\lib\\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:56:08.3393469Z Installing collected packages: python-dateutil, jmespath, botocore, s3transfer, boto3\n2022-03-21T21:56:12.4576766Z Successfully installed boto3-1.19.12 botocore-1.22.12 jmespath-0.10.0 python-dateutil-2.8.2 s3transfer-0.5.2\n2022-03-21T21:56:12.5641959Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0afad69838118af0e\n2022-03-21T21:56:12.5872636Z C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\\python3.exe: can't open file 'C:\\\\actions-runner\\\\_work\\\\pytorch\\\\pytorch\\\\.github\\\\scripts\\\\get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:56:12.5905611Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:56:12.5927729Z ##[error]Process completed with exit code 2.\n2022-03-21T21:56:12.6239531Z ##[group]Run pytorch/pytorch/.github/actions/teardown-win@master\n2022-03-21T21:56:12.6240039Z with:\n2022-03-21T21:56:12.6240299Z env:\n2022-03-21T21:56:12.6240557Z   IN_CI: 1\n2022-03-21T21:56:12.6240805Z   IS_GHA: 1\n2022-03-21T21:56:12.6241118Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:56:12.6241613Z   pythonLocation: C:\\actions-runner\\_work\\_tool\\Python\\3.10.3\\x64\n2022-03-21T21:56:12.6242052Z ##[endgroup]\n\n\n pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (25/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:46:39.5474616Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:46:39.1884210Z + python3 -m pip install boto3==1.19.12\n2022-03-21T21:46:39.3928976Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T21:46:39.4105069Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T21:46:39.4152571Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T21:46:39.4194931Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T21:46:39.4218947Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T21:46:39.4230812Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T21:46:39.4380089Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T21:46:39.4399461Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T21:46:39.5387703Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0888bed1149cca415\n2022-03-21T21:46:39.5474616Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:46:39.5487145Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:46:39.5497480Z ##[error]Process completed with exit code 2.\n2022-03-21T21:46:39.5541319Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:46:39.5541544Z with:\n2022-03-21T21:46:39.5541698Z env:\n2022-03-21T21:46:39.5541851Z   IN_CI: 1\n2022-03-21T21:46:39.5541997Z   IS_GHA: 1\n2022-03-21T21:46:39.5542176Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:46:39.5542361Z ##[endgroup]\n2022-03-21T21:46:39.5557878Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge) (26/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:34:57.0623859Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:34:56.9039884Z   Attempting uninstall: s3transfer\n2022-03-21T21:34:56.9041446Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T21:34:56.9090783Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T21:34:56.9095968Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T21:34:56.9453014Z   Attempting uninstall: boto3\n2022-03-21T21:34:56.9454356Z     Found existing installation: boto3 1.16.34\n2022-03-21T21:34:56.9564320Z     Uninstalling boto3-1.16.34:\n2022-03-21T21:34:56.9578035Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T21:34:57.0091363Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T21:34:57.0536230Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-034a3afd5d80b91fd\n2022-03-21T21:34:57.0623859Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:34:57.0637167Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:34:57.0647396Z ##[error]Process completed with exit code 2.\n2022-03-21T21:34:57.0688237Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:34:57.0688481Z with:\n2022-03-21T21:34:57.0688631Z env:\n2022-03-21T21:34:57.0688769Z   IN_CI: 1\n2022-03-21T21:34:57.0688930Z   IS_GHA: 1\n2022-03-21T21:34:57.0689109Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:34:57.0689462Z ##[endgroup]\n2022-03-21T21:34:57.0704768Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n pull / linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge) (27/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:05:00.7896545Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in\n\n2022-03-21T21:05:00.7395504Z     #10 0x5597fd5a9801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037\n2022-03-21T21:05:00.7396330Z     #11 0x5597fd5b47a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961\n2022-03-21T21:05:00.7396688Z     #12 0x5597fd5b480b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455\n2022-03-21T21:05:00.7398664Z     #13 0x5597fd5b4908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420\n2022-03-21T21:05:00.7399177Z     #14 0x5597fd5b4908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907\n2022-03-21T21:05:00.7399663Z     #15 0x5597fd5b4908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460\n2022-03-21T21:05:00.7399986Z     #16 0x5597fd5b4ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495\n2022-03-21T21:05:00.7895241Z     #17 0x7f0a5905983f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291\n2022-03-21T21:05:00.7895772Z     #18 0x5597fd559554 in _start (/opt/conda/bin/python3.7+0x1d7554)\n2022-03-21T21:05:00.7896033Z \n2022-03-21T21:05:00.7896545Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in \n2022-03-21T21:05:00.8063448Z + retcode=1\n2022-03-21T21:05:00.8063787Z + set -e\n2022-03-21T21:05:00.8064058Z + return 1\n2022-03-21T21:05:00.8067638Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]\n2022-03-21T21:05:00.8068127Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X ]]\n2022-03-21T21:05:00.8069018Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]\n2022-03-21T21:05:00.8069500Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\2 ]]\n2022-03-21T21:05:00.8070105Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]\n2022-03-21T21:05:00.8070580Z + [[ default == \\n\\o\\g\\p\\u\\_\\N\\O\\_\\A\\V\\X\\5\\1\\2 ]]\n2022-03-21T21:05:00.8072640Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]\n\n\n pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu) (28/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T22:48:17.3384813Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T22:48:16.8599645Z + python3 -m pip install boto3==1.19.12\n2022-03-21T22:48:17.1464241Z Defaulting to user installation because normal site-packages is not writeable\n2022-03-21T22:48:17.1685222Z Requirement already satisfied: boto3==1.19.12 in /home/ec2-user/.local/lib/python3.7/site-packages (1.19.12)\n2022-03-21T22:48:17.1754164Z Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.10.0)\n2022-03-21T22:48:17.1771662Z Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (0.5.2)\n2022-03-21T22:48:17.1808722Z Requirement already satisfied: botocore<1.23.0,>=1.22.12 in /home/ec2-user/.local/lib/python3.7/site-packages (from boto3==1.19.12) (1.22.12)\n2022-03-21T22:48:17.1868636Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (2.8.2)\n2022-03-21T22:48:17.1903889Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/.local/lib/python3.7/site-packages (from botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.26.9)\n2022-03-21T22:48:17.2113746Z Requirement already satisfied: six>=1.5 in /home/ec2-user/.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.23.0,>=1.22.12->boto3==1.19.12) (1.16.0)\n2022-03-21T22:48:17.3267404Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-01fe178c405417375\n2022-03-21T22:48:17.3384813Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T22:48:17.3402286Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T22:48:17.3418376Z ##[error]Process completed with exit code 2.\n2022-03-21T22:48:17.3470528Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T22:48:17.3470874Z with:\n2022-03-21T22:48:17.3471096Z env:\n2022-03-21T22:48:17.3471327Z   IN_CI: 1\n2022-03-21T22:48:17.3471538Z   IS_GHA: 1\n2022-03-21T22:48:17.3471802Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T22:48:17.3472083Z   GPU_FLAG: --gpus all\n2022-03-21T22:48:17.3472322Z ##[endgroup]\n\n\n pull / linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (29/29)\nStep: \"Upload test statistics\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-03-21T21:16:38.9646300Z python3: can't ope...ow_job_id.py': [Errno 2] No such file or directory\n\n2022-03-21T21:16:38.7995969Z   Attempting uninstall: s3transfer\n2022-03-21T21:16:38.7998039Z     Found existing installation: s3transfer 0.3.7\n2022-03-21T21:16:38.8066994Z     Uninstalling s3transfer-0.3.7:\n2022-03-21T21:16:38.8072844Z       Successfully uninstalled s3transfer-0.3.7\n2022-03-21T21:16:38.8449275Z   Attempting uninstall: boto3\n2022-03-21T21:16:38.8451430Z     Found existing installation: boto3 1.16.34\n2022-03-21T21:16:38.8559828Z     Uninstalling boto3-1.16.34:\n2022-03-21T21:16:38.8574290Z       Successfully uninstalled boto3-1.16.34\n2022-03-21T21:16:38.9100438Z Successfully installed boto3-1.19.12 botocore-1.22.12 s3transfer-0.5.2\n2022-03-21T21:16:38.9558098Z ++ python3 .github/scripts/get_workflow_job_id.py 2018440039 i-0d779c59d277d32ee\n2022-03-21T21:16:38.9646300Z python3: can't open file '.github/scripts/get_workflow_job_id.py': [Errno 2] No such file or directory\n2022-03-21T21:16:38.9658894Z + GHA_WORKFLOW_JOB_ID=\n2022-03-21T21:16:38.9673240Z ##[error]Process completed with exit code 2.\n2022-03-21T21:16:38.9720106Z ##[group]Run pytorch/pytorch/.github/actions/teardown-linux@master\n2022-03-21T21:16:38.9720333Z with:\n2022-03-21T21:16:38.9720485Z env:\n2022-03-21T21:16:38.9720645Z   IN_CI: 1\n2022-03-21T21:16:38.9720793Z   IS_GHA: 1\n2022-03-21T21:16:38.9720970Z   GIT_DEFAULT_BRANCH: master\n2022-03-21T21:16:38.9721151Z ##[endgroup]\n2022-03-21T21:16:38.9736762Z ##[group]Run # ignore expansion of \"docker ps -q\" since it could be empty\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2021-11-10T08:42:52Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 964902894
               },
               {
-                "login": "huydhn"
+                "bodyText": "@vitaly-fedyunin @gottbrath  FYI that this is the oneDNN Graph API integration. It depends on the #63748.",
+                "createdAt": "2021-11-16T16:36:52Z",
+                "author": {
+                  "login": "Jianhui-Li"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 970451860
               },
               {
-                "login": "b0noI"
+                "bodyText": "CI failures are currently being caused by some issues in the CI infra, and are also occurring with other PRs.",
+                "createdAt": "2021-12-10T05:59:17Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 990641309
               },
               {
-                "login": "seemethere"
+                "bodyText": "CI failures are unrelated.",
+                "createdAt": "2021-12-10T20:44:09Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 991281407
               },
               {
-                "login": "malfet"
+                "bodyText": "The CI failure is unrelated.",
+                "createdAt": "2021-12-16T02:45:59Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 995389295
               },
               {
-                "login": "DanilBaibak"
+                "bodyText": "Hi, thank you for the PR!\nDo you mind running a larger amount of torchbench and reporting numbers ? You can look at Jason's post here for what models are supported in script. Initially just the vision models would be useful. @Krovatkin also did some benchmarking of a traced Bert model and found on average a ~16% speedup with this PR.",
+                "createdAt": "2022-01-18T18:22:34Z",
+                "author": {
+                  "login": "eellison"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1015689390
               },
               {
-                "login": "ZainRizvi"
+                "bodyText": "Thanks a lot for reviewing, @eellison & @Krovatkin!\nWe just wanted to let you know that we're working on the benchmarking & will get back to you in a day, or two.\nUPDATE (Jan 21): While running some TorchBench models, we discovered some composability issues, and are working to ensure that oneDNN Graph would complement PyTorch's existing fusion capabilities, not hinder them.\nUPDATE (Jan 24): We've resolved the issues & will update this PR later today. Thanks!",
+                "createdAt": "2022-01-20T00:31:01Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1016996190
               },
               {
-                "login": "jeanschmidt"
+                "bodyText": "Hello @eellison,\nWe used this TorchBench branch for comparison. compare_llga.sh can be run for comparison.\nFor benchmarking mobilenet_v3_large with hardswish support in oneDNN Graph, this oneDNN Graph branch can be used in third_party/ideep/mkl-dnn. It delivers a speedup over PyTorch JIT (NNC + OFI) because 21 additional reorders are prevented (the major factor here), and fusion with conv also helps further.\nThe next release of oneDNN Graph would have hardswish support.\nWe're also exploring adding a hardsigmoid op in oneDNN Graph.\nThank you!",
+                "createdAt": "2022-01-26T23:51:38Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1022709513
               },
               {
-                "login": "atalman"
+                "bodyText": "Please note that this PR should be merged after #71546, as #71546 changes the  third_party/ideep commit (this PR also uses that ideep commit, but it'd probably be better to merge #71546 first, so that oneDNN v2.5.2 upgrade would be in a separate PR). Thank you!",
+                "createdAt": "2022-01-31T23:57:21Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1026330085
               },
               {
-                "login": "mehtanirav"
+                "bodyText": "@sanchitintel mind rebasing and i'll land ?",
+                "createdAt": "2022-03-01T20:07:57Z",
+                "author": {
+                  "login": "eellison"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1055813984
               },
               {
-                "login": "osalpekar"
+                "bodyText": "@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "createdAt": "2022-03-02T17:44:47Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1057203495
+              },
+              {
+                "bodyText": "Thanks a lot for taking a look, @eellison! To fix this error, we would enable Bazel build for oneDNN Graph.",
+                "createdAt": "2022-03-07T23:03:45Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1061230087
               },
               {
-                "login": "swang392"
+                "bodyText": "@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "createdAt": "2022-03-09T19:24:13Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1063276600
               },
               {
-                "login": "janeyx99"
+                "bodyText": "@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "createdAt": "2022-03-21T19:59:41Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074355779
               },
               {
-                "login": "clee2000"
+                "bodyText": "And graph_rewriter.cpp is full of DOS newlines...",
+                "createdAt": "2022-03-21T20:53:40Z",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1074407452
               },
               {
-                "login": "izaitsevfb"
+                "bodyText": "Hey @chunyuan-w.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-03-21T22:12:51Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1074471758
               },
               {
-                "login": "weiwangmeta"
+                "bodyText": "Thanks a ton for your help, @malfet & @eellison! :)\nWe'll incorporate your suggestions in subsequent PR(s).",
+                "createdAt": "2022-03-21T22:41:25Z",
+                "author": {
+                  "login": "sanchitintel"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "sanchitintel"
+                },
+                "databaseId": 1074492365
               }
             ],
             "pageInfo": {
-              "hasNextPage": false,
-              "endCursor": "Y3Vyc29yOnYyOpHOBoQSVA=="
+              "startCursor": "Y3Vyc29yOnYyOpHOOYM_0Q==",
+              "hasPreviousPage": false
             }
           }
         }
       }
     }
   },
-  "query_sha=a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5 cursor=None name=qwertyuiop org=pytorch": {
-    "data": {
-      "organization": {
-        "team": null
-      }
-    }
-  },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=82169 owner=pytorch": {
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=73969 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
           "closed": true,
-          "isCrossRepository": false,
+          "isCrossRepository": true,
           "author": {
-            "login": "ezyang"
+            "login": "malfet"
           },
-          "title": "Move test_dtypes so it runs later",
-          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):\n* __->__ #82169\n\nThe error messages it gives are very unhelpful (because a failure\ngets translated into \"dtype was not supported\" rather than the\nactual backtrace), so I'd rather get error messages about this after\nI've tested basic functionality.\n\nSigned-off-by: Edward Z. Yang <ezyang@fb.com>",
-          "headRefName": "gh/ezyang/1279/head",
+          "title": "Dummy change",
+          "body": "Test Plan: None at all\n\nDifferential Revision: D34753911\n\n",
+          "headRefName": "export-D34753911",
           "headRepository": {
-            "nameWithOwner": "pytorch/pytorch"
+            "nameWithOwner": "malfet/pytorch"
           },
-          "baseRefName": "gh/ezyang/1279/base",
+          "baseRefName": "master",
           "baseRepository": {
             "nameWithOwner": "pytorch/pytorch",
             "isPrivate": false,
@@ -17980,44 +32251,20 @@
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "cef34da55a59da5a32494bff218ccd4978b659d3"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
-                    },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
-                  },
-                  "oid": "83ad7e73a07111ac1d85e931d14360cc22c01edd"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "ezyang"
+                      "login": "malfet"
                     },
-                    "email": "ezyang@fb.com",
-                    "name": "Edward Z. Yang"
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
                   },
-                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c"
+                  "oid": "4746da707a9912356f5179625da89616b228dc21"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "Mw",
+              "endCursor": "MQ",
               "hasNextPage": false
             },
-            "totalCount": 3
+            "totalCount": 1
           },
           "commits": {
             "nodes": [
@@ -18033,61 +32280,358 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "linux-vulkan-bionic-py3.7-clang9"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280134/jobs/2794078044"
+                              },
+                              {
+                                "name": "test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280134/jobs/2794189060"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRQMQ=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592963"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QM="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280135/jobs/2794078023"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aM=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592965"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-bionic-rocm4.5-py3.7"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280132/jobs/2794078060"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280132/jobs/2794292071"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280132/jobs/2794292205"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.rocm.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280132/jobs/2794292306"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbTiXw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592966"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-QY="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "win-vs2019-cuda11.3-py3"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280139/jobs/2794078053"
+                              },
+                              {
+                                "name": "test (force_on_cpu, 1, 1, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280139/jobs/2794536907"
+                              },
+                              {
+                                "name": "test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280139/jobs/2794536998"
+                              },
+                              {
+                                "name": "test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280139/jobs/2794537089"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbY_vU=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592967"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qc="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280136/jobs/2794078031"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2ao=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592969"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qk="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-docs"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280138/jobs/2794078055"
+                              },
+                              {
+                                "name": "build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280138/jobs/2794183768"
+                              },
+                              {
+                                "name": "build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280138/jobs/2794183828"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRIt0=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592970"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qo="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc7"
                             }
                           },
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "lintrunner",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543705427?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280140/jobs/2794078017"
                               },
                               {
-                                "name": "Test collect_env (with_torch)",
+                                "name": "test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543705796?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280140/jobs/2794181109"
                               },
                               {
-                                "name": "Test collect_env (without_torch)",
+                                "name": "test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543705914?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280140/jobs/2794181305"
                               },
                               {
-                                "name": "Test collect_env (older_python_version)",
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280140/jobs/2794181488"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbRFm4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592971"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Qs="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3-clang5-mobile-build"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280143/jobs/2794078025"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2aw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592974"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Q4="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "Lint"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "shellcheck",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543706071?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078028"
                               },
                               {
                                 "name": "quick-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543706300?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078196"
+                              },
+                              {
+                                "name": "clang-tidy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078407"
+                              },
+                              {
+                                "name": "clang-format",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078610"
+                              },
+                              {
+                                "name": "cmakelint",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078760"
                               },
                               {
                                 "name": "toc",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543706581?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078898"
                               },
                               {
-                                "name": "Test tools",
+                                "name": "py2-setup-validate-errormsg",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543706911?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794078999"
                               },
                               {
-                                "name": "workflow-checks",
+                                "name": "flake8-py3",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794079087"
+                              },
+                              {
+                                "name": "mypy",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543707223?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280145/jobs/2794079199"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGj1lc=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO4Es=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696649"
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592975"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc8k="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-Q8="
                       },
                       {
                         "node": {
@@ -18097,21 +32641,185 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "TorchBench CI (pytorch-linux-py3.7-cu102)"
+                              "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build-and-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1958280146/jobs/2794078040"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAUbO2b0=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696651"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/4746da707a9912356f5179625da89616b228dc21/checks?check_suite_id=5595592976"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc8s="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAU2F-RA="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": true
+                    }
+                  },
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17040614?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17040643?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
                       },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17040615?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-03-09T15:57:16Z",
+                  "oid": "4746da707a9912356f5179625da89616b228dc21"
+                }
+              }
+            ]
+          },
+          "changedFiles": 1,
+          "files": {
+            "nodes": [
+              {
+                "path": "tools/build_variables.bzl"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [],
+            "pageInfo": {
+              "startCursor": null,
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/malfet/pytorch/blob/4746da707a9912356f5179625da89616b228dc21/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\nAdd ciflow labels to this PR to trigger more builds:\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nlinux-binary-libtorch-cxx11-abi\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-binary-libtorch-pre-cxx11\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-binary-manywheel\nciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nlinux-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk\n\u2705 triggered\n\n\nlinux-bionic-rocm4.5-py3.7\nciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk\n\u2705 triggered\n\n\nlinux-docs\nciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-vulkan-bionic-py3.7-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-build\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3-clang5-mobile-custom-build-static\nciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-clang7-onnx\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build\nciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nlinux-xenial-py3.7-gcc7-no-ops\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nmacos-arm64-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nmacos-arm64-binary-wheel\nciflow/binaries, ciflow/binaries_wheel, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-libtorch-cxx11-abi\nciflow/binaries, ciflow/binaries_libtorch, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-libtorch-pre-cxx11\nciflow/binaries, ciflow/binaries_libtorch, ciflow/default\n\u2705 triggered\n\n\nmacos-binary-wheel\nciflow/binaries, ciflow/binaries_wheel, ciflow/default\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit\nciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win\n\u2705 triggered\n\n\nwindows-binary-conda\nciflow/binaries, ciflow/binaries_conda, ciflow/default\n\u2705 triggered\n\n\nwindows-binary-libtorch-debug\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nwindows-binary-libtorch-release\nciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nwindows-binary-wheel\nciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\ncaffe2-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\ndocker-builds\nciflow/all, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-custom-ops\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-arm64-metal\nciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nios-12-5-1-x86-64-coreml\nciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda10.2-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk\n\ud83d\udeab skipped\n\n\nlinux-docs-push\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda11.3-py3.7-gcc7-no-ops\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-arm64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-10-15-py3-lite-interpreter-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nmacos-11-py3-x86-64\nciflow/all, ciflow/macos, ciflow/trunk\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.7-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-bionic-cuda11.5-py3.7-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.3-py3.7-gcc7-debug\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.5-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npytorch-linux-xenial-py3-clang5-android-ndk-r19c-build\nciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk\n\ud83d\udeab skipped\n\n\npytorch-xla-linux-bionic-py3.7-clang8\nciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla\n\ud83d\udeab skipped",
+                "createdAt": "2022-03-09T15:57:11Z",
+                "author": {
+                  "login": "pytorch-bot"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1063079053
+              },
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/73969\n\ud83d\udcc4 \u00a0Preview docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 4746da7 (more details on the Dr. CI page):\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2022-03-09T15:57:12Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1063079113
+              },
+              {
+                "bodyText": "This pull request was exported from Phabricator. Differential Revision: D34753911",
+                "createdAt": "2022-03-09T15:57:34Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1063079731
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOP11MjQ==",
+              "hasPreviousPage": false
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "fb-exported"
+                }
+              },
+              {
+                "node": {
+                  "name": "cla signed"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=73099 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "BowenBao"
+          },
+          "title": "[ONNX] Make graph name spec-compliant (#71961)",
+          "body": "Stack from [ghstack](https://github.com/ezyang/ghstack):\n* #73104\n* #73103\n* #73102\n* #73101\n* #73100\n* __->__ #73099\n\n[According to the ONNX spec](https://github.com/onnx/onnx/blob/main/docs/IR.md#names-within-a-graph),\nall names must adhere to C90 identifier syntax rules, which means no\ndashes.\n\nFixes: #30952",
+          "headRefName": "gh/BowenBao/138/head",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "gh/BowenBao/138/base",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "BowenBao"
+                    },
+                    "email": "bowbao@microsoft.com",
+                    "name": "BowenBao"
+                  },
+                  "oid": "3038b939eb2069653305c419326a0f47d2598e39"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            },
+            "totalCount": 1
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
                       {
                         "node": {
                           "app": {
@@ -18128,18 +32836,18 @@
                               {
                                 "name": "run-torchbench",
                                 "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543705420?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041786/jobs/2626264278"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGjz0w=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNn9o=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696656"
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189561"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc9A="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7k="
                       },
                       {
                         "node": {
@@ -18149,20 +32857,41 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "Lint"
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041785/jobs/2626264385"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041785/jobs/2626417658"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041785/jobs/2626417743"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041785/jobs/2626417885"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkRE_E=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696660"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189562"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRc9Q="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7o="
                       },
                       {
                         "node": {
@@ -18172,20 +32901,26 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pull"
+                              "name": "linux-xenial-py3.7-gcc7-no-ops"
                             }
                           },
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041789/jobs/2626264416"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoJE=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "CANCELLED",
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696715"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189563"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdAs="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7s="
                       },
                       {
                         "node": {
@@ -18195,447 +32930,659 @@
                           },
                           "workflowRun": {
                             "workflow": {
-                              "name": "pull"
+                              "name": "linux-xenial-py3-clang5-mobile-build"
                             }
-                          },
-                          "checkRuns": {
-                            "nodes": [
-                              {
-                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543706290?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543706587?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543706915?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-clang7-asan / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543707231?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3_7-clang8-xla / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543707459?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7-no-ops / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543707794?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543708127?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543708379?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543708606?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cuda11.6-py3 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543709052?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543709309?check_suite_focus=true"
-                              },
-                              {
-                                "name": "win-vs2019-cpu-py3 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543709535?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543709809?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-clang10-onnx / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543709986?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543710238?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-rocm5.2-py3.7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543710467?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-jammy-cuda11.6-cudnn8-py3.8-clang12 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543710675?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543710925?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543711166?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7543711347?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544378552?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544378697?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544378800?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544378922?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544379063?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544379177?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (functorch, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544379274?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544414957?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544415089?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-docs / build-docs (cpp)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544418146?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-docs / build-docs (python)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544418325?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544418649?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544418760?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544418892?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7 / test (functorch, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544418988?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7 / test (docs_test, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544419111?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7 / test (jit_legacy, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544419210?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544419367?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544420236?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544427790?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544526201?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544526466?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544526651?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544526810?check_suite_focus=true"
-                              },
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544526939?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041787/jobs/2626264407"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoIY=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189564"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS7w="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544790873?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041788/jobs/2626264422"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoJs=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189566"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS74="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-bionic-py3.7-clang9"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544790983?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041790/jobs/2626264414"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544791069?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041790/jobs/2626349405"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "test (noarch, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544791145?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041790/jobs/2626349522"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "name": "test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544791233?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041790/jobs/2626349618"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcG0YME=",
-                              "hasNextPage": true
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiwA=",
+                              "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696836"
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189567"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdIQ="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS78="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Facebook GitHub Tools",
-                            "databaseId": 12274
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-vulkan-bionic-py3.7-clang9"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "Facebook CLA Check",
+                                "name": "build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://code.intern.facebook.com/cla/"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041793/jobs/2626264431"
+                              },
+                              {
+                                "name": "test (default, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041793/jobs/2626359364"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcGjyQg=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPxgQ=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546696896"
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189568"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdMA="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8A="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Netlify",
-                            "databaseId": 13473
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3-clang5-mobile-custom-build-static"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041792/jobs/2626264427"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkNoKA=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697185"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189570"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdeE="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8I="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Azure Pipelines",
-                            "databaseId": 9426
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "win-vs2019-cpu-py3"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041791/jobs/2626264386"
+                              },
+                              {
+                                "name": "test (default, 1, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041791/jobs/2626722677"
+                              },
+                              {
+                                "name": "test (default, 2, 2, windows.4xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041791/jobs/2626722710"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkX070=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697205"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189571"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdfU="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8M="
                       },
                       {
                         "node": {
                           "app": {
-                            "name": "Dependabot",
-                            "databaseId": 29110
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-xenial-py3.7-gcc7"
+                            }
                           },
-                          "workflowRun": null,
                           "checkRuns": {
-                            "nodes": [],
+                            "nodes": [
+                              {
+                                "name": "build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041803/jobs/2626264401"
+                              },
+                              {
+                                "name": "test (distributed, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041803/jobs/2626349045"
+                              },
+                              {
+                                "name": "test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041803/jobs/2626349141"
+                              },
+                              {
+                                "name": "test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/1866041803/jobs/2626349272"
+                              }
+                            ],
                             "pageInfo": {
-                              "endCursor": null,
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAATkPiQA=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697224"
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/3038b939eb2069653305c419326a0f47d2598e39/checks?check_suite_id=5365189572"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdgg="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAT_KS8Q="
                       }
                     ],
                     "pageInfo": {
                       "hasNextPage": true
                     }
                   },
-                  "pushedDate": "2022-07-27T15:34:17Z",
-                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c"
+                  "status": {
+                    "contexts": [
+                      {
+                        "context": "ci/circleci: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17010288?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17010289?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17010488?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      },
+                      {
+                        "context": "ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
+                        "state": "SUCCESS",
+                        "targetUrl": "https://circleci.com/gh/pytorch/pytorch/17010326?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link"
+                      }
+                    ]
+                  },
+                  "pushedDate": "2022-02-18T18:46:28Z",
+                  "oid": "3038b939eb2069653305c419326a0f47d2598e39"
                 }
               }
             ]
           },
-          "changedFiles": 1,
+          "changedFiles": 162,
           "files": {
             "nodes": [
               {
-                "path": "test/test_ops.py"
+                "path": "test/onnx/expect/TestOperators.test_acos.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_left_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_addconstant.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_addmm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_arange_dynamic.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_argmax.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_asin.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_at_op.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_atan.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_aten_embedding_1.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_aten_embedding_2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_avg_pool2d.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_baddbmm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_basic.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_1d.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_batchnorm_training.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_bitshift.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_c2_op.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_chunk.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip_max.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_clip_min.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_concat2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_conv.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_conv_onnx_irv4_opset8.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_convtranspose.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_cos.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_cumsum.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_det.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dict.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dict_str.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dim.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout_default.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout_opset12.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout_training.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dropout_training_opset12.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_add.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_add_inputs_same_symbolic_shape.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_matmul.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_reduce_mean.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_dynamic_axes_unchange.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_elu.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_embedding_bags.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_empty_like.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_empty_like_opset7.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_equal.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_erf.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_exp.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_expand.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_flatten.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_flatten2D.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_fmod.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_frobenius_norm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_full.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_full_like.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gather.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gather_opset11.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ge.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gelu.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_gt.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_hardtanh.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_implicit_expand.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_index.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_isnan.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_layer_norm_aten.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_le.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_linear.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_log_sigmoid.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_logsoftmax.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_lstm_none_sequence_lens.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_lt.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_master_opset.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_max.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool_dilations.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_maxpool_indices.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mean.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mean_dtype.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_meshgrid.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_min.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_mm.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_narrow.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ne.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_nonzero.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_norm_p1.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_norm_p2.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_ones_like.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_pad.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_params.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_params_onnx_irv4.expect"
+              },
+              {
+                "path": "test/onnx/expect/TestOperators.test_permute2.expect"
               }
             ],
             "pageInfo": {
-              "endCursor": "MQ",
-              "hasNextPage": false
+              "endCursor": "MTAw",
+              "hasNextPage": true
             }
           },
           "reviews": {
             "nodes": [
               {
                 "author": {
-                  "login": "zou3519"
-                },
-                "state": "APPROVED"
-              },
-              {
-                "author": {
-                  "login": "Chillee"
+                  "login": "garymm"
                 },
                 "state": "APPROVED"
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNy0yNVQxNDo0NTozNS0wNzowMLkyMDIyLTA3LTI1VDE0OjQ1OjM1LTA3OjAwzj6XYmg=",
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMi0xOFQxNzoxODo0NC0wODowMLkyMDIyLTAyLTE4VDE3OjE4OjQ0LTA4OjAwzjTr0H0=",
               "hasPreviousPage": false
             }
           },
           "comments": {
             "nodes": [
               {
-                "bodyText": "@pytorchbot merge -f FORCE",
+                "bodyText": "This PR cannot be merged by bot due to changing > 100 files. @malfet \n  \n    \n      pytorch/.github/scripts/trymerge.py\n    \n    \n         Line 63\n      in\n      932adf2\n    \n  \n  \n    \n\n        \n          \n                 files(last: 100) { \n        \n    \n  \n\n Can this be relaxed? If not please import.",
+                "createdAt": "2022-02-22T18:22:40Z",
                 "author": {
-                  "login": "malfet"
+                  "login": "BowenBao"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "COLLABORATOR",
                 "editor": null,
-                "databaseId": 1197107402
+                "databaseId": 1048084569
               },
               {
-                "bodyText": "You need to provide a reason for using force merge, in the format @pytorchbot merge -f '[CATEGORY] Explanation'. With [CATEGORY] being one the following:\nEMERGENCY - an emergency fix to quickly address an issue\nMINOR - a minor fix such as cleaning locally unused variables, which shouldn't break anything\nPRE_TESTED - a previous CI run tested everything and you've only added minor changes like fixing lint\nOTHER - something not covered above",
+                "bodyText": "This PR cannot be merged by bot due to changing > 100 files. @malfet\nCan this be relaxed? If not please import.\n\nWow, you've hit a really interesting problem. 100 is a limitation enforced by GitHub, see https://docs.github.com/en/graphql/overview/resource-limitations, but I can implement a pagination. Do you mind keeping it like that for a bit, want to land a fix soonish.",
+                "createdAt": "2022-02-22T18:27:29Z",
                 "author": {
-                  "login": "pytorch-bot"
+                  "login": "malfet"
                 },
-                "authorAssociation": "NONE",
+                "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1197107439
+                "databaseId": 1048088691
               },
               {
-                "bodyText": "@pytorchbot merge -f \"[OTHER] normal land failed twice already\"",
+                "bodyText": "@malfet Thank you for info. Sure, I have separated the rest of stack from this one, we'll wait for the fix to try again.",
+                "createdAt": "2022-02-22T18:29:48Z",
                 "author": {
-                  "login": "malfet"
+                  "login": "BowenBao"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "COLLABORATOR",
                 "editor": null,
-                "databaseId": 1197108130
+                "databaseId": 1048090640
               },
               {
-                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "bodyText": "@pytorchbot merge this",
+                "createdAt": "2022-02-24T21:42:36Z",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "BowenBao"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "COLLABORATOR",
                 "editor": null,
-                "databaseId": 1197119348
+                "databaseId": 1050293881
               },
               {
-                "bodyText": "Hey @ezyang.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "bodyText": "Hey @BowenBao.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-02-24T21:44:39Z",
                 "author": {
                   "login": "github-actions"
                 },
                 "authorAssociation": "NONE",
                 "editor": null,
-                "databaseId": 1197120095
+                "databaseId": 1050295451
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHOR1poyg==",
+              "startCursor": "Y3Vyc29yOnYyOpHOPniAWQ==",
               "hasPreviousPage": true
             }
           },
@@ -18643,283 +33590,91 @@
             "edges": [
               {
                 "node": {
-                  "name": "Merged"
+                  "name": "oncall: jit"
                 }
               },
               {
                 "node": {
-                  "name": "cla signed"
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAcG0YME= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAcHRdAs= name=pytorch number=82169 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "commits": {
-            "nodes": [
-              {
-                "commit": {
-                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c",
-                  "checkSuites": {
-                    "nodes": [
-                      {
-                        "checkRuns": {
-                          "nodes": [
-                            {
-                              "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544791308?check_suite_focus=true"
-                            },
-                            {
-                              "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (functorch, 1, 1, linux.4xlarge.nvidia.gpu)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544791418?check_suite_focus=true"
-                            },
-                            {
-                              "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544791778?check_suite_focus=true"
-                            },
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544877177?check_suite_focus=true"
-                            },
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544877276?check_suite_focus=true"
-                            },
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (functorch, 1, 1, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7544877367?check_suite_focus=true"
-                            }
-                          ],
-                          "pageInfo": {
-                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAcG1sTc=",
-                            "hasNextPage": false
-                          }
-                        }
-                      }
-                    ]
-                  }
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=4fa42dda073cf7ac75b2bbf595a8ef67b6dfff4bd248668750ff33ea913bf75f cursor=Y3Vyc29yOnYyOpHPAAAAAcHRdgg= name=pytorch number=82169 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "commits": {
-            "nodes": [
-              {
-                "commit": {
-                  "oid": "28140e4008289251b695385acfb48ac7a47cd49c",
-                  "checkSuites": {
-                    "edges": [
-                      {
-                        "node": {
-                          "app": {
-                            "name": "Codecov",
-                            "databaseId": 254
-                          },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697240"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdhg="
-                      },
-                      {
-                        "node": {
-                          "app": {
-                            "name": "PyTorch Bot",
-                            "databaseId": 40112
-                          },
-                          "workflowRun": null,
-                          "checkRuns": {
-                            "nodes": [],
-                            "pageInfo": {
-                              "endCursor": null,
-                              "hasNextPage": false
-                            }
-                          },
-                          "conclusion": null,
-                          "url": "https://github.com/pytorch/pytorch/commit/28140e4008289251b695385acfb48ac7a47cd49c/checks?check_suite_id=7546697255"
-                        },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAcHRdic="
-                      }
-                    ],
-                    "pageInfo": {
-                      "hasNextPage": false
-                    }
-                  }
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=0e2a29eda6405cea4c9de20fb80ae7924910e17272a7b251040182e7d8c390e0 name=pytorch number=79694 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "closed": false,
-          "isCrossRepository": true,
-          "author": {
-            "login": "kshitij12345"
-          },
-          "title": "[complex] conv_transpose1d",
-          "body": "Reference: https://github.com/pytorch/pytorch/issues/71108",
-          "headRefName": "develop/complex/conv_transpose1d",
-          "headRepository": {
-            "nameWithOwner": "kshitij12345/pytorch"
-          },
-          "baseRefName": "master",
-          "baseRepository": {
-            "nameWithOwner": "pytorch/pytorch",
-            "isPrivate": false,
-            "defaultBranchRef": {
-              "name": "master"
-            }
-          },
-          "mergeCommit": null,
-          "commits_with_authors": {
-            "nodes": [
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kshitij12345"
-                    },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
-                  },
-                  "oid": "d1ea948e65ac6d31ad056287ab65d38ecc68b30d"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kshitij12345"
-                    },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
-                  },
-                  "oid": "b4ba1db9a3a71bd8c03158dcd1b68711360633d8"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kshitij12345"
-                    },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
-                  },
-                  "oid": "655a4220beae163bfe578f0318a130df01ec05d6"
-                }
-              },
-              {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kshitij12345"
-                    },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "Kshiteej K"
-                  },
-                  "oid": "8181716be7a8005eb13ad5c3f2e1279ed1c60aff"
+                  "name": "open source"
                 }
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kshitij12345"
-                    },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
-                  },
-                  "oid": "9e5ca3663e7471786eeebebfdf84aea5d761712f"
+                "node": {
+                  "name": "cla signed"
                 }
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kshitij12345"
-                    },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
-                  },
-                  "oid": "9c110f39bcdc4e56386b6f9c4e2c082c8940ade6"
+                "node": {
+                  "name": "release notes: onnx"
                 }
               },
               {
-                "commit": {
-                  "author": {
-                    "user": {
-                      "login": "kshitij12345"
-                    },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
-                  },
-                  "oid": "49315e79d0eee8008e2a74575c6fc0f6a9531ee4"
+                "node": {
+                  "name": "topic: bug fixes"
                 }
-              },
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=74649 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": false,
+          "author": {
+            "login": "malfet"
+          },
+          "title": "This should fail flake8",
+          "body": "Test issue for GHF mandatory checks",
+          "headRefName": "malfet-patch-8",
+          "headRepository": {
+            "nameWithOwner": "pytorch/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "kshitij12345"
+                      "login": "malfet"
                     },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
                   },
-                  "oid": "728752480760226270c374a0acc08e28b9b133f3"
+                  "oid": "57c86ff1c5ab948888fd329986c9d55796680e33"
                 }
               },
               {
                 "commit": {
                   "author": {
                     "user": {
-                      "login": "kshitij12345"
+                      "login": "malfet"
                     },
-                    "email": "kshitijkalambarkar@gmail.com",
-                    "name": "kshitij12345"
+                    "email": "nshulga@fb.com",
+                    "name": "Nikita Shulga"
                   },
-                  "oid": "ffe43399d6f60ef7844523a5f465c11d9a67062f"
+                  "oid": "6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4"
                 }
               }
             ],
             "pageInfo": {
-              "endCursor": "OQ",
+              "endCursor": "Mg",
               "hasNextPage": false
             },
-            "totalCount": 9
+            "totalCount": 2
           },
           "commits": {
             "nodes": [
@@ -18943,14 +33698,109 @@
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAboNCRo=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsK3w=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/ffe43399d6f60ef7844523a5f465c11d9a67062f/checks?check_suite_id=7428002306"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018129"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1E="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Netlify",
+                            "databaseId": 13473
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018131"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1M="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Azure Pipelines",
+                            "databaseId": 9426
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018132"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1Q="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Dependabot",
+                            "databaseId": 29110
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018134"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1Y="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Codecov",
+                            "databaseId": 254
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018139"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj1s="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "PyTorch Bot",
+                            "databaseId": 40112
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [],
+                            "pageInfo": {
+                              "endCursor": null,
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": null,
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018142"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAbq-UgI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlj14="
                       },
                       {
                         "node": {
@@ -18966,55 +33816,75 @@
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "lintrunner",
+                                "name": "clang-format",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925132"
+                              },
+                              {
+                                "name": "clang-tidy",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925189"
+                              },
+                              {
+                                "name": "cmakelint",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925230"
+                              },
+                              {
+                                "name": "flake8-py3",
+                                "conclusion": "FAILURE",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925307"
+                              },
+                              {
+                                "name": "mypy",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426574264?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925365"
                               },
                               {
                                 "name": "Test collect_env (with_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426574600?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925427"
                               },
                               {
                                 "name": "Test collect_env (without_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426574693?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925449"
                               },
                               {
-                                "name": "Test collect_env (older_python_version)",
+                                "name": "Test tools",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426574832?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925537"
                               },
                               {
-                                "name": "Test tools",
+                                "name": "py2-setup-validate-errormsg",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426575043?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925644"
                               },
                               {
-                                "name": "toc",
+                                "name": "quick-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426575297?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925688"
                               },
                               {
-                                "name": "workflow-checks",
+                                "name": "toc",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426575617?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925809"
                               },
                               {
-                                "name": "quick-checks",
+                                "name": "shellcheck",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426575807?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576283/jobs/2928925945"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbqojb8=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsMiY=",
                               "hasNextPage": false
                             }
                           },
-                          "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/ffe43399d6f60ef7844523a5f465c11d9a67062f/checks?check_suite_id=7437320797"
+                          "conclusion": "FAILURE",
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018384"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAbtMgl0="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkFA="
                       },
                       {
                         "node": {
@@ -19032,18 +33902,18 @@
                               {
                                 "name": "run-torchbench",
                                 "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426574246?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576288/jobs/2928925134"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbqoh6Y=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHsLW0=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/ffe43399d6f60ef7844523a5f465c11d9a67062f/checks?check_suite_id=7437320800"
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018395"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAbtMgmA="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkFs="
                       },
                       {
                         "node": {
@@ -19059,265 +33929,561 @@
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "linux-focal-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426574798?check_suite_focus=true"
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928935743"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7-no-ops / build",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426575118?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928935775"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426575476?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928935850"
                               },
                               {
-                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "name": "linux-bionic-rocm4.5-py3.7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426575622?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928935994"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / build",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426575875?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936064"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / build",
+                                "name": "linux-xenial-py3.7-gcc5.4 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426576118?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936179"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang10-onnx / build",
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426576360?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936265"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "name": "linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426576522?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936309"
                               },
                               {
-                                "name": "linux-jammy-cuda11.6-cudnn8-py3.8-clang12 / build",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426576694?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936353"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build / build",
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426576858?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936395"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426577069?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936426"
                               },
                               {
-                                "name": "linux-focal-rocm5.1-py3.7 / build",
+                                "name": "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426577340?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936483"
                               },
                               {
-                                "name": "win-vs2019-cuda11.6-py3 / build",
+                                "name": "win-vs2019-cuda11.3-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426577507?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936516"
                               },
                               {
-                                "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / build",
+                                "name": "win-vs2019-cpu-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426577677?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936558"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "name": "linux-xenial-py3.7-gcc7-no-ops / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426577906?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936633"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "name": "linux-xenial-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426578065?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936705"
                               },
                               {
-                                "name": "linux-bionic-py3_7-clang8-xla / build",
+                                "name": "deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426578285?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936736"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "name": "linux-xenial-py3.7-clang7-onnx / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936756"
+                              },
+                              {
+                                "name": "pytorch-xla-linux-bionic-py3.7-clang8",
+                                "conclusion": "NEUTRAL",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936796"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-clang7-asan / build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928936823"
+                              },
+                              {
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426578423?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928990551"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "name": "linux-xenial-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426578533?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928990588"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / build",
+                                "name": "linux-docs / build-docs (cpp)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426578766?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928992832"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
+                                "name": "linux-docs / build-docs (python)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426768328?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928992868"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426768494?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928992932"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426768635?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928992965"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7 / test (docs_test, 1, 1, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426768797?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928993011"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7 / test (jit_legacy, 1, 1, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426768904?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928993042"
                               },
                               {
-                                "name": "linux-docs / build-docs (cpp)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426769059?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928993086"
                               },
                               {
-                                "name": "linux-docs / build-docs (python)",
+                                "name": "linux-xenial-py3.7-gcc5.4 / test (jit_legacy, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426769221?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928993128"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426794528?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928995802"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426794681?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426794811?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928995853"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
+                                "name": "linux-bionic-py3.7-clang9 / test (noarch, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426794965?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928995889"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge)",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426795132?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928997626"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426795278?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928999058"
                               },
                               {
-                                "name": "linux-bionic-py3.7-clang9 / test (functorch, 1, 1, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-onnx / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426795396?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2928999075"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426815145?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929012407"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 2, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426815265?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929012438"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "name": "linux-xenial-py3.7-clang7-asan / test (default, 3, 3, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426818878?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929012469"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
+                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 1, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426857383?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929034328"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
+                                "name": "linux-bionic-rocm4.5-py3.7 / test (default, 2, 2, linux.rocm.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426857577?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929034340"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426857720?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929040801"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426857893?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929045939"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 2, linux.4xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426858145?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929046016"
                               },
                               {
-                                "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426883486?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929046063"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426949849?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929082254"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426950005?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929082275"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426950152?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929157614"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426950337?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929157635"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "name": "win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426950460?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2031576300/jobs/2929157656"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAVHxIT4=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4/checks?check_suite_id=5778018405"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAVhlkGU="
+                      }
+                    ],
+                    "pageInfo": {
+                      "hasNextPage": false
+                    }
+                  },
+                  "status": null,
+                  "pushedDate": "2022-03-24T00:42:33Z",
+                  "oid": "6c3c3de6a5c1183d9a08f3c54148bc0b5de11bb4"
+                }
+              }
+            ]
+          },
+          "changedFiles": 1,
+          "files": {
+            "nodes": [
+              {
+                "path": "torch/nn/cpp.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MQ",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "seemethere"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wMy0yM1QxNTo1MDo0NS0wNzowMLkyMDIyLTAzLTIzVDE1OjUwOjQ1LTA3OjAwzjbPEDg=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/74649\n\u21a9\ufe0f \u00a0[fb-only] Re-run with SSH instructions\nNeed help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 6c3c3de (more details on the Dr. CI page):\n\n\n1/1 failures introduced in this PR\n\n\n1 failure not recognized by patterns:\n\n\n\nJob\nStep\nAction\n\n\n\n\n Lint / flake8-py3\nFail if there were any warnings\n\ud83d\udd01 rerun\n\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2022-03-23T22:40:51Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1076891218
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQDAOUg==",
+              "hasPreviousPage": false
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "cla signed"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=81fd873151c3cded18314e9e53bf54a93ffb0afa9c52fa2cbafb2ceab7df5e45 name=pytorch number=79694 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "closed": true,
+          "isCrossRepository": true,
+          "author": {
+            "login": "kshitij12345"
+          },
+          "title": "[complex] conv_transpose1d",
+          "body": "Reference: https://github.com/pytorch/pytorch/issues/71108",
+          "headRefName": "develop/complex/conv_transpose1d",
+          "headRepository": {
+            "nameWithOwner": "kshitij12345/pytorch"
+          },
+          "baseRefName": "master",
+          "baseRepository": {
+            "nameWithOwner": "pytorch/pytorch",
+            "isPrivate": false,
+            "defaultBranchRef": {
+              "name": "master"
+            }
+          },
+          "mergeCommit": null,
+          "commits_with_authors": {
+            "nodes": [
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "d1ea948e65ac6d31ad056287ab65d38ecc68b30d"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "b4ba1db9a3a71bd8c03158dcd1b68711360633d8"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "655a4220beae163bfe578f0318a130df01ec05d6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "Kshiteej K"
+                  },
+                  "oid": "8181716be7a8005eb13ad5c3f2e1279ed1c60aff"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "9e5ca3663e7471786eeebebfdf84aea5d761712f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "9c110f39bcdc4e56386b6f9c4e2c082c8940ade6"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "49315e79d0eee8008e2a74575c6fc0f6a9531ee4"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "728752480760226270c374a0acc08e28b9b133f3"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "ffe43399d6f60ef7844523a5f465c11d9a67062f"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "9672a2198472567bae4ac6f55d004f7e1fa8a9fa"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "48a0ebf32b895286f036b36c871f671dc867e400"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "52fbe80d5c8a94e03d816c0bd21fd82019dcd5ac"
+                }
+              },
+              {
+                "commit": {
+                  "author": {
+                    "user": {
+                      "login": "kshitij12345"
+                    },
+                    "email": "kshitijkalambarkar@gmail.com",
+                    "name": "kshitij12345"
+                  },
+                  "oid": "2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce"
+                }
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "MTM",
+              "hasNextPage": false
+            },
+            "totalCount": 13
+          },
+          "commits": {
+            "nodes": [
+              {
+                "commit": {
+                  "checkSuites": {
+                    "edges": [
+                      {
+                        "node": {
+                          "app": {
+                            "name": "Facebook GitHub Tools",
+                            "databaseId": 12274
+                          },
+                          "workflowRun": null,
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "name": "Facebook CLA Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426950568?check_suite_focus=true"
+                                "detailsUrl": "https://code.facebook.com/cla/"
                               },
                               {
-                                "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "name": "Meta Internal-Only Changes Check",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7426961175?check_suite_focus=true"
+                                "detailsUrl": "https://opensource.facebook.com/"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbqubxc=",
-                              "hasNextPage": true
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdtq8Hc=",
+                              "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/ffe43399d6f60ef7844523a5f465c11d9a67062f/checks?check_suite_id=7437320828"
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7929899098"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAbtMgnw="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdioqFo="
                       },
                       {
                         "node": {
@@ -19335,18 +34501,18 @@
                               {
                                 "name": "run-torchbench",
                                 "conclusion": "NEUTRAL",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453692770?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393316/jobs/4628529923"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbxGU2I=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdqTEwk=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SKIPPED",
-                          "url": "https://github.com/pytorch/pytorch/commit/ffe43399d6f60ef7844523a5f465c11d9a67062f/checks?check_suite_id=7463496300"
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7929899387"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAbzb6mw="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdioqXs="
                       },
                       {
                         "node": {
@@ -19364,53 +34530,58 @@
                               {
                                 "name": "lintrunner",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453692736?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628529910"
                               },
                               {
-                                "name": "toc",
+                                "name": "quick-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453693139?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628530162"
                               },
                               {
-                                "name": "workflow-checks",
+                                "name": "Test collect_env (with_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453693588?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628530698"
                               },
                               {
-                                "name": "quick-checks",
+                                "name": "Test collect_env (without_torch)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453693942?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628530867"
                               },
                               {
-                                "name": "Test tools",
+                                "name": "Test collect_env (older_python_version)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453694270?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628530989"
                               },
                               {
-                                "name": "Test collect_env (with_torch)",
+                                "name": "pr-sanity-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453694519?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628531151"
                               },
                               {
-                                "name": "Test collect_env (without_torch)",
+                                "name": "workflow-checks",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453694654?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628531475"
                               },
                               {
-                                "name": "Test collect_env (older_python_version)",
+                                "name": "Test tools",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628531753"
+                              },
+                              {
+                                "name": "toc",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453694759?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393315/jobs/4628531853"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbxGWyc=",
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdqTHFY=",
                               "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/ffe43399d6f60ef7844523a5f465c11d9a67062f/checks?check_suite_id=7463496306"
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7929899388"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAbzb6nI="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdioqXw="
                       },
                       {
                         "node": {
@@ -19426,273 +34597,478 @@
                           "checkRuns": {
                             "nodes": [
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / build",
-                                "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453693883?check_suite_focus=true"
-                              },
-                              {
-                                "name": "linux-bionic-py3.7-clang9 / build",
+                                "name": "linux-focal-py3.7-clang7-asan / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453694269?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628531149"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7-no-ops / build",
+                                "name": "linux-bionic-cuda11.6-py3.10-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453694482?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628531473"
                               },
                               {
-                                "name": "linux-bionic-py3_7-clang8-xla / build",
+                                "name": "linux-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453694773?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628531754"
                               },
                               {
                                 "name": "linux-jammy-cuda11.6-cudnn8-py3.8-clang12 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453695048?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628531857"
                               },
                               {
-                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
+                                "name": "linux-focal-py3.7-gcc7-pch / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453695376?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628532179"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7 / build",
+                                "name": "linux-focal-py3.7-clang10-onnx / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453695572?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628532543"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang10-onnx / build",
+                                "name": "linux-bionic-cuda11.3-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453695789?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628532694"
                               },
                               {
-                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
+                                "name": "linux-focal-py3.7-gcc7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453696094?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628532918"
                               },
                               {
-                                "name": "win-vs2019-cpu-py3 / build",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453696262?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533033"
                               },
                               {
-                                "name": "linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build / build",
+                                "name": "linux-focal-py3.7-gcc7-no-ops / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453696440?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533181"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / build",
+                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453696619?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533420"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
+                                "name": "linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453696913?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533630"
                               },
                               {
-                                "name": "linux-focal-rocm5.1-py3.7 / build",
+                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit / build-and-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453697192?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533825"
                               },
                               {
-                                "name": "win-vs2019-cuda11.6-py3 / build",
+                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453697504?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628533959"
                               },
                               {
-                                "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / build",
+                                "name": "linux-xenial-py3-clang5-mobile-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453697701?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534129"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-custom-build-static / build",
+                                "name": "linux-bionic-py3_7-clang8-xla / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453697927?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534256"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / build",
+                                "name": "linux-focal-rocm5.2-py3.7 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453698388?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534388"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-mobile-build / build",
+                                "name": "linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453698629?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534571"
                               },
                               {
-                                "name": "linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single / build-and-test",
+                                "name": "linux-bionic-cuda11_6-py3_10-gcc7-deploy / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453698800?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534714"
                               },
                               {
-                                "name": "linux-docs / build-docs (cpp)",
+                                "name": "win-vs2019-cuda11.6-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453870481?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628534989"
                               },
                               {
-                                "name": "linux-docs / build-docs (python)",
+                                "name": "win-vs2019-cpu-py3 / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453870600?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628535311"
                               },
                               {
                                 "name": "linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453870806?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639115"
                               },
                               {
                                 "name": "linux-focal-py3.7-gcc7 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453870899?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639198"
                               },
                               {
                                 "name": "linux-focal-py3.7-gcc7 / test (distributed, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453871006?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639265"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-gcc7 / test (functorch, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639339"
                               },
                               {
                                 "name": "linux-focal-py3.7-gcc7 / test (docs_test, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453871108?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639395"
                               },
                               {
                                 "name": "linux-focal-py3.7-gcc7 / test (jit_legacy, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453871214?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639450"
                               },
                               {
                                 "name": "linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453871379?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639509"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (cpp)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639572"
+                              },
+                              {
+                                "name": "linux-docs / build-docs (python)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628639635"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647047"
+                              },
+                              {
+                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647119"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453877423?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647215"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (default, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453877577?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647277"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453877679?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647348"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453877783?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647432"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453877932?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647522"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453878058?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647641"
                               },
                               {
                                 "name": "linux-bionic-py3.7-clang9 / test (functorch, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453878178?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628647762"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 1, 2, linux.2xlarge)",
+                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453882847?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628653797"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang10-onnx / test (default, 2, 2, linux.2xlarge)",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453882949?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679376"
                               },
                               {
-                                "name": "linux-vulkan-bionic-py3.7-clang9 / test (default, 1, 1, linux.2xlarge)",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453888149?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679431"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 1, 5, linux.2xlarge)",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453922173?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679469"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453922275?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679519"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 3, 5, linux.2xlarge)",
+                                "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453922371?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628679594"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 4, 5, linux.2xlarge)",
+                                "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628681226"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11_6-py3_10-gcc7-deploy / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628854932"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856434"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856501"
+                              },
+                              {
+                                "name": "linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2907393329/jobs/4628856575"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdqZ2fA=",
+                              "hasNextPage": true
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7929899419"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdioqZs="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "windows-binary-libtorch-debug"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "libtorch-cpu-shared-with-deps-debug-build",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351637/jobs/4634503587"
+                              },
+                              {
+                                "name": "libtorch-cpu-shared-with-deps-debug-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351637/jobs/4635312938"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsbsmM=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953056"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUSuA="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "windows-binary-wheel"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "wheel-py3_7-cuda11_3-build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453922449?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351640/jobs/4634503571"
                               },
                               {
-                                "name": "linux-focal-py3.7-clang7-asan / test (default, 5, 5, linux.2xlarge)",
+                                "name": "wheel-py3_7-cuda11_3-test",
+                                "conclusion": "SUCCESS",
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351640/jobs/4636146265"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsskcw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953059"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUSuM="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "windows-binary-libtorch-release"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
+                              {
+                                "name": "libtorch-cpu-shared-with-deps-release-build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453922527?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351643/jobs/4634503570"
                               },
                               {
-                                "name": "linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge)",
+                                "name": "libtorch-cpu-shared-with-deps-release-test",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7453931393?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351643/jobs/4635003925"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsVbD8=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953061"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUSuU="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-binary-libtorch-cxx11-abi"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "libtorch-cpu-shared-with-deps-cxx11-abi-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454011679?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351698/jobs/4634504079"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "libtorch-cpu-shared-with-deps-cxx11-abi-test / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454011783?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351698/jobs/4635072931"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsW5Aw=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953185"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUS2E="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-binary-libtorch-pre-cxx11"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 3, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "libtorch-cpu-shared-with-deps-cxx11-abi-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454011866?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351700/jobs/4634503897"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)",
+                                "name": "libtorch-cpu-shared-with-deps-cxx11-abi-test / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454011976?check_suite_focus=true"
-                              },
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351700/jobs/4635077148"
+                              }
+                            ],
+                            "pageInfo": {
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsW-jo=",
+                              "hasNextPage": false
+                            }
+                          },
+                          "conclusion": "SUCCESS",
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953186"
+                        },
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUS2I="
+                      },
+                      {
+                        "node": {
+                          "app": {
+                            "name": "GitHub Actions",
+                            "databaseId": 15368
+                          },
+                          "workflowRun": {
+                            "workflow": {
+                              "name": "linux-binary-manywheel"
+                            }
+                          },
+                          "checkRuns": {
+                            "nodes": [
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)",
+                                "name": "manywheel-py3_7-cuda10_2-build / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454012075?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351699/jobs/4634503896"
                               },
                               {
-                                "name": "linux-bionic-cuda11.6-py3.7-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)",
+                                "name": "manywheel-py3_7-cuda10_2-test / build",
                                 "conclusion": "SUCCESS",
-                                "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454012177?check_suite_focus=true"
+                                "detailsUrl": "https://github.com/pytorch/pytorch/actions/runs/2910351699/jobs/4635934290"
                               }
                             ],
                             "pageInfo": {
-                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbxLMxE=",
-                              "hasNextPage": true
+                              "endCursor": "Y3Vyc29yOnYyOpHPAAAAAdsoMEA=",
+                              "hasNextPage": false
                             }
                           },
                           "conclusion": "SUCCESS",
-                          "url": "https://github.com/pytorch/pytorch/commit/ffe43399d6f60ef7844523a5f465c11d9a67062f/checks?check_suite_id=7463496361"
+                          "url": "https://github.com/pytorch/pytorch/commit/2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce/checks?check_suite_id=7936953187"
                         },
-                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAbzb6qk="
+                        "cursor": "Y3Vyc29yOnYyOpHPAAAAAdkUS2M="
                       }
                     ],
                     "pageInfo": {
-                      "hasNextPage": false
+                      "hasNextPage": true
                     }
                   },
-                  "pushedDate": "2022-07-19T19:21:58Z",
-                  "oid": "ffe43399d6f60ef7844523a5f465c11d9a67062f"
+                  "status": null,
+                  "pushedDate": "2022-08-22T22:04:19Z",
+                  "oid": "2fd08f1c669bbb0f2e14ae40e76f9e0d3195f4ce"
                 }
               }
             ]
@@ -19701,263 +35077,789 @@
           "files": {
             "nodes": [
               {
-                "path": "aten/src/ATen/native/Convolution.cpp"
+                "path": "aten/src/ATen/native/Convolution.cpp"
+              },
+              {
+                "path": "torch/testing/_internal/common_methods_invocations.py"
+              },
+              {
+                "path": "torch/testing/_internal/common_modules.py"
+              }
+            ],
+            "pageInfo": {
+              "endCursor": "Mw",
+              "hasNextPage": false
+            }
+          },
+          "reviews": {
+            "nodes": [
+              {
+                "author": {
+                  "login": "ngimel"
+                },
+                "state": "APPROVED"
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNy0xOVQxMDowNzo1NC0wNzowMLkyMDIyLTA3LTE5VDEwOjA3OjU0LTA3OjAwzj43QcY=",
+              "hasPreviousPage": false
+            }
+          },
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "@pytorchbot merge -g\nAll is green internally!",
+                "createdAt": "2022-08-23T19:29:55Z",
+                "author": {
+                  "login": "albanD"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1224702749
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here.\nThe merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.\nPlease reach out to the PyTorch DevX Team with feedback or questions!",
+                "createdAt": "2022-08-23T19:31:18Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1224705564
+              },
+              {
+                "bodyText": "Thanks for looking into it \ud83d\ude42 @albanD @jeanschmidt",
+                "createdAt": "2022-08-23T19:34:36Z",
+                "author": {
+                  "login": "kshitij12345"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1224712351
+              },
+              {
+                "bodyText": "Hey @kshitij12345.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-08-23T22:31:58Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1224956051
+              },
+              {
+                "bodyText": "Yeah, discussed with my manager and I got the required permissions to do so. Sorry for not responding promptly yesterday. But I am available from now on to provide assistance :)",
+                "createdAt": "2022-08-24T09:24:04Z",
+                "author": {
+                  "login": "jeanschmidt"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1225462612
+              }
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOSP97HQ==",
+              "hasPreviousPage": true
+            }
+          },
+          "labels": {
+            "edges": [
+              {
+                "node": {
+                  "name": "open source"
+                }
+              },
+              {
+                "node": {
+                  "name": "Merged"
+                }
+              },
+              {
+                "node": {
+                  "name": "cla signed"
+                }
+              },
+              {
+                "node": {
+                  "name": "Reverted"
+                }
+              },
+              {
+                "node": {
+                  "name": "ciflow/trunk"
+                }
+              },
+              {
+                "node": {
+                  "name": "ciflow/periodic"
+                }
+              }
+            ]
+          }
+        }
+      }
+    }
+  },
+  "query_sha=2e2877d2452c4f233f042b7ccd50ab9c2a6e9a73d8819a0c876203c12364e8a3 cursor=Y3Vyc29yOnYyOpHOSP97HQ== name=pytorch number=79694 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "comments": {
+            "nodes": [
+              {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/79694\n\ud83d\udcc4 \u00a0Preview Python docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\u2753Need help or want to give feedback on the CI? Visit our office hours\n\n\u2705 No Failures (0 Pending)\nAs of commit 2fd08f1 (more details on the Dr. CI page):\nExpand to see more\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2022-06-16T09:43:16Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1157454523
+              },
+              {
+                "bodyText": "Unable to reproduce jit failure locally (will skip the test)\nCI Failure : https://github.com/pytorch/pytorch/runs/6926187074?check_suite_focus=true#step:9:20230\npytest test/test_ops_jit.py -k test_variant_consistency_jit_nn_functional_conv_transpose1d_cpu_complex64 -v\n=============================================================== test session starts ===============================================================\nplatform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/kshiteej/.conda/envs/pytorch-cuda-dev/bin/python\ncachedir: .pytest_cache\nhypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/kshiteej/Pytorch/pytorch_complex_convolution.py/.hypothesis/examples')\nrootdir: /home/kshiteej/Pytorch/pytorch_complex_convolution.py, configfile: pytest.ini\nplugins: hypothesis-6.23.2, repeat-0.9.1\ncollected 1976 items / 1975 deselected / 1 selected                                                                                               \n\ntest/test_ops_jit.py::TestJitCPU::test_variant_consistency_jit_nn_functional_conv_transpose1d_cpu_complex64 PASSED                          [100%]\n\n================================================================ warnings summary =================================================================\n../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/testing/_internal/common_cuda.py:9\n  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/testing/_internal/common_cuda.py:9: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives\n    from distutils.version import LooseVersion\n\n../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:91\n  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:91: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.\n    warnings.warn(\n\n-- Docs: https://docs.pytest.org/en/stable/warnings.html\n================================================= 1 passed, 1975 deselected, 2 warnings in 4.90s =================================================",
+                "createdAt": "2022-07-18T09:05:35Z",
+                "author": {
+                  "login": "kshitij12345"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": {
+                  "login": "kshitij12345"
+                },
+                "databaseId": 1186949486
+              },
+              {
+                "bodyText": "@pytorchbot merge",
+                "createdAt": "2022-07-19T17:12:23Z",
+                "author": {
+                  "login": "ngimel"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189347786
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "createdAt": "2022-07-19T17:13:42Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189350009
+              },
+              {
+                "bodyText": "Hey @kshitij12345.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-07-19T17:14:25Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1189350932
+              },
+              {
+                "bodyText": "@pytorchbot revert -m \"broke slow test https://github.com/pytorch/pytorch/runs/7414560957?check_suite_focus=true#step:9:31516\" -c \"nosignal\"",
+                "createdAt": "2022-07-19T19:15:41Z",
+                "author": {
+                  "login": "kshitij12345"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1189459845
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
+                "createdAt": "2022-07-19T19:16:59Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189460926
+              },
+              {
+                "bodyText": "Will not revert as @kshitij12345 is not a MEMBER, but COLLABORATOR",
+                "createdAt": "2022-07-19T19:17:00Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189460942
               },
               {
-                "path": "torch/testing/_internal/common_methods_invocations.py"
+                "bodyText": "@pytorchbot revert -m \"broke slow test https://github.com/pytorch/pytorch/runs/7414560957?check_suite_focus=true#step:9:31516\" -c \"nosignal\"",
+                "createdAt": "2022-07-19T20:40:04Z",
+                "author": {
+                  "login": "anjali411"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189529734
               },
               {
-                "path": "torch/testing/_internal/common_modules.py"
+                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
+                "createdAt": "2022-07-19T20:41:20Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189530756
+              },
+              {
+                "bodyText": "@kshitij12345 your PR has been successfully reverted.",
+                "createdAt": "2022-07-19T20:41:25Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1189530831
+              },
+              {
+                "bodyText": "@pytorchbot merge -g",
+                "createdAt": "2022-07-20T09:53:08Z",
+                "author": {
+                  "login": "kshitij12345"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1190070141
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "createdAt": "2022-07-20T09:54:24Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1190071424
+              },
+              {
+                "bodyText": "Hey @kshitij12345.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "createdAt": "2022-07-20T13:00:51Z",
+                "author": {
+                  "login": "github-actions"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1190258272
+              },
+              {
+                "bodyText": "commit is breaking internal builds/tests https://pastebin.com/HX4RUusH (pytorch/functorch/test:test_eager_transforms)",
+                "createdAt": "2022-07-21T10:39:01Z",
+                "author": {
+                  "login": "jeanschmidt"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191327616
+              },
+              {
+                "bodyText": "@pytorchbot revert -m \"breaking internal builds\" -c \"ghfirst\"",
+                "createdAt": "2022-07-21T10:39:27Z",
+                "author": {
+                  "login": "jeanschmidt"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191328013
+              },
+              {
+                "bodyText": "@pytorchbot revert -m \"breaking internal builds\" -c \"ghfirst\"",
+                "createdAt": "2022-07-21T10:41:23Z",
+                "author": {
+                  "login": "jeanschmidt"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191329792
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
+                "createdAt": "2022-07-21T10:42:16Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191330586
+              },
+              {
+                "bodyText": "@kshitij12345 your PR has been successfully reverted.",
+                "createdAt": "2022-07-21T10:42:23Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1191330690
+              },
+              {
+                "bodyText": "@jeanschmidt which test is it failing on? I tried running the test_eager_transforms in functorch but couldn't reproduce it.",
+                "createdAt": "2022-07-25T07:11:19Z",
+                "author": {
+                  "login": "kshitij12345"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1193667568
+              },
+              {
+                "bodyText": "@jbschlosser have added a ref as discussed offline. Can you please take a look? And if it looks good, can you import the PR to check if it is breaking anything internally.\nThanks",
+                "createdAt": "2022-08-03T18:30:17Z",
+                "author": {
+                  "login": "kshitij12345"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1204329491
+              },
+              {
+                "bodyText": "@jbschlosser @jeanschmidt @albanD anything we can do to unblock this on our side?",
+                "createdAt": "2022-08-20T09:27:17Z",
+                "author": {
+                  "login": "lezcano"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1221266218
+              },
+              {
+                "bodyText": "Functorch tests should be running here now so can you rebase on top of master please?",
+                "createdAt": "2022-08-22T21:42:37Z",
+                "author": {
+                  "login": "albanD"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1223129944
+              },
+              {
+                "bodyText": "@albanD have rebased on latest master.",
+                "createdAt": "2022-08-23T08:49:10Z",
+                "author": {
+                  "login": "kshitij12345"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1223758571
+              },
+              {
+                "bodyText": "I triggered all the tests not to have any issues with slow tests again",
+                "createdAt": "2022-08-23T09:20:18Z",
+                "author": {
+                  "login": "lezcano"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1223796413
+              },
+              {
+                "bodyText": "Thanks @lezcano! However, last time it was reverted for internal failures. So it would be great if someone can import and verify that.\ncc: @albanD @jeanschmidt",
+                "createdAt": "2022-08-23T10:17:50Z",
+                "author": {
+                  "login": "kshitij12345"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1223863075
+              },
+              {
+                "bodyText": "@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "createdAt": "2022-08-23T14:43:02Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1224175731
+              },
+              {
+                "bodyText": "I am not the right person to provide assistence, as currently I am not based in a Tier 1 location, so my permissions to access are so restricted that I am not able to import this commit, run the tests and provide meaningful responses.",
+                "createdAt": "2022-08-23T15:57:48Z",
+                "author": {
+                  "login": "jeanschmidt"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1224272324
+              },
+              {
+                "bodyText": "@jeanschmidt has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.",
+                "createdAt": "2022-08-23T17:00:53Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1224351135
               }
             ],
             "pageInfo": {
-              "endCursor": "Mw",
-              "hasNextPage": false
+              "startCursor": "Y3Vyc29yOnYyOpHORP1auw==",
+              "hasPreviousPage": false
             }
-          },
-          "reviews": {
+          }
+        }
+      }
+    }
+  },
+  "query_sha=2e2877d2452c4f233f042b7ccd50ab9c2a6e9a73d8819a0c876203c12364e8a3 cursor=Y3Vyc29yOnYyOpHOR1poyg== name=pytorch number=82169 owner=pytorch": {
+    "data": {
+      "repository": {
+        "pullRequest": {
+          "comments": {
             "nodes": [
               {
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/82169\n\ud83d\udcc4 \u00a0Preview Python docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\u2753Need help or want to give feedback on the CI? Visit our office hours\n\n\u2705 No Failures (0 Pending)\nAs of commit 28140e4 (more details on the Dr. CI page):\nExpand to see more\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2022-07-25T21:41:41Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1194667199
+              },
+              {
+                "bodyText": "@pytorchbot merge -g",
+                "createdAt": "2022-07-25T21:46:04Z",
+                "author": {
+                  "login": "ezyang"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1194671445
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "createdAt": "2022-07-25T21:47:25Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1194672744
+              },
+              {
+                "bodyText": "Merge failed due to Refusing to merge as mandatory check(s) pull failed for rule superuser\nRaised by https://github.com/pytorch/pytorch/actions/runs/2735501647",
+                "createdAt": "2022-07-25T23:22:45Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1194761219
+              },
+              {
+                "bodyText": "@pytorchbot rebase",
+                "createdAt": "2022-07-26T00:54:17Z",
+                "author": {
+                  "login": "ezyang"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1194839920
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a rebase job. Check the current status here",
+                "createdAt": "2022-07-26T01:01:32Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1194846575
+              },
+              {
+                "bodyText": "Successfully rebased gh/ezyang/1279/orig onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/82169)",
+                "createdAt": "2022-07-26T01:01:53Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1194846838
+              },
+              {
+                "bodyText": "@pytorchbot rebase",
+                "createdAt": "2022-07-27T15:32:13Z",
+                "author": {
+                  "login": "ezyang"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1196915484
+              },
+              {
+                "bodyText": "@pytorchbot successfully started a rebase job. Check the current status here",
+                "createdAt": "2022-07-27T15:33:49Z",
                 "author": {
-                  "login": "ngimel"
+                  "login": "pytorchmergebot"
                 },
-                "state": "APPROVED"
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpO5MjAyMi0wNy0xOVQxMDowNzo1NC0wNzowMLkyMDIyLTA3LTE5VDEwOjA3OjU0LTA3OjAwzj43QcY=",
-              "hasPreviousPage": false
-            }
-          },
-          "comments": {
-            "nodes": [
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1196917359
+              },
               {
-                "bodyText": "@pytorchbot revert -m \"breaking internal builds\" -c \"ghfirst\"",
+                "bodyText": "Successfully rebased gh/ezyang/1279/orig onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/82169)",
+                "createdAt": "2022-07-27T15:34:03Z",
                 "author": {
-                  "login": "jeanschmidt"
+                  "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1191328013
+                "databaseId": 1196917609
               },
               {
-                "bodyText": "@pytorchbot revert -m \"breaking internal builds\" -c \"ghfirst\"",
+                "bodyText": "@pytorchbot merge -g",
+                "createdAt": "2022-07-27T15:41:52Z",
                 "author": {
-                  "login": "jeanschmidt"
+                  "login": "ezyang"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1191329792
+                "databaseId": 1196927174
               },
               {
-                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "createdAt": "2022-07-27T15:43:11Z",
                 "author": {
                   "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1191330586
+                "databaseId": 1196928771
               },
               {
-                "bodyText": "@kshitij12345 your PR has been successfully reverted.",
+                "bodyText": "Merge failed due to Refusing to merge as mandatory check(s) Lint failed for rule superuser\nRaised by https://github.com/pytorch/pytorch/actions/runs/2747872935",
+                "createdAt": "2022-07-27T15:43:14Z",
                 "author": {
                   "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1191330690
+                "databaseId": 1196928849
               },
               {
-                "bodyText": "@jeanschmidt which test is it failing on? I tried running the test_eager_transforms in functorch but couldn't reproduce it.",
+                "bodyText": "@pytorchbot merge -g",
+                "createdAt": "2022-07-27T16:59:37Z",
                 "author": {
-                  "login": "kshitij12345"
+                  "login": "ezyang"
                 },
-                "authorAssociation": "COLLABORATOR",
+                "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1193667568
-              }
-            ],
-            "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHORwI5DQ==",
-              "hasPreviousPage": true
-            }
-          },
-          "labels": {
-            "edges": [
+                "databaseId": 1197046487
+              },
               {
-                "node": {
-                  "name": "open source"
-                }
+                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "createdAt": "2022-07-27T17:07:32Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1197055101
               },
               {
-                "node": {
-                  "name": "Merged"
-                }
+                "bodyText": "Merge failed due to Refusing to merge as mandatory check(s) Lint failed for rule superuser\nRaised by https://github.com/pytorch/pytorch/actions/runs/2748317347",
+                "createdAt": "2022-07-27T17:07:36Z",
+                "author": {
+                  "login": "pytorchmergebot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1197055259
               },
               {
-                "node": {
-                  "name": "cla signed"
-                }
+                "bodyText": "@pytorchbot merge -f",
+                "createdAt": "2022-07-27T17:56:26Z",
+                "author": {
+                  "login": "malfet"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": null,
+                "databaseId": 1197107106
               },
               {
-                "node": {
-                  "name": "Reverted"
-                }
+                "bodyText": "\u274c \ud83e\udd16 pytorchbot command failed:\n@pytorchbot merge: error: argument -f/--force: expected one argument\n\nusage: @pytorchbot merge [-g | -f FORCE | -l]\n\nTry @pytorchbot --help for more info.",
+                "createdAt": "2022-07-27T17:56:27Z",
+                "author": {
+                  "login": "pytorch-bot"
+                },
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1197107129
               }
-            ]
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHORzUsvw==",
+              "hasPreviousPage": false
+            }
           }
         }
       }
     }
   },
-  "query_sha=62ce809793481ce6ddce6e1a19d9b0761755ff0ff75decaf8a79419eaf793110 cursor=Y3Vyc29yOnYyOpHORwI5DQ== name=pytorch number=79694 owner=pytorch": {
+  "query_sha=2e2877d2452c4f233f042b7ccd50ab9c2a6e9a73d8819a0c876203c12364e8a3 cursor=Y3Vyc29yOnYyOpHOPoR4Lg== name=pytorch number=71759 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
           "comments": {
             "nodes": [
               {
-                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/79694\n\ud83d\udcc4 \u00a0Preview Python docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\u2753Need help or want to give feedback on the CI? Visit our office hours\n\n\u2705 No Failures (0 Pending)\nAs of commit ffe4339 (more details on the Dr. CI page):\nExpand to see more\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
-                "author": {
-                  "login": "facebook-github-bot"
-                },
-                "authorAssociation": "MEMBER",
-                "editor": {
-                  "login": "facebook-github-bot"
-                },
-                "databaseId": 1157454523
-              },
-              {
-                "bodyText": "Unable to reproduce jit failure locally (will skip the test)\nCI Failure : https://github.com/pytorch/pytorch/runs/6926187074?check_suite_focus=true#step:9:20230\npytest test/test_ops_jit.py -k test_variant_consistency_jit_nn_functional_conv_transpose1d_cpu_complex64 -v\n=============================================================== test session starts ===============================================================\nplatform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/kshiteej/.conda/envs/pytorch-cuda-dev/bin/python\ncachedir: .pytest_cache\nhypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/kshiteej/Pytorch/pytorch_complex_convolution.py/.hypothesis/examples')\nrootdir: /home/kshiteej/Pytorch/pytorch_complex_convolution.py, configfile: pytest.ini\nplugins: hypothesis-6.23.2, repeat-0.9.1\ncollected 1976 items / 1975 deselected / 1 selected                                                                                               \n\ntest/test_ops_jit.py::TestJitCPU::test_variant_consistency_jit_nn_functional_conv_transpose1d_cpu_complex64 PASSED                          [100%]\n\n================================================================ warnings summary =================================================================\n../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/testing/_internal/common_cuda.py:9\n  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/testing/_internal/common_cuda.py:9: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives\n    from distutils.version import LooseVersion\n\n../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:91\n  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:91: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.\n    warnings.warn(\n\n-- Docs: https://docs.pytest.org/en/stable/warnings.html\n================================================= 1 passed, 1975 deselected, 2 warnings in 4.90s =================================================",
+                "bodyText": "CI Flow Status\n\u269b\ufe0f CI Flow\nRuleset - Version: v1\nRuleset - File: https://github.com/coolteemf/pytorch/blob/7647f7953a68e4f1c3feaa19c77d925abfe8e377/.github/generated-ciflow-ruleset.json\nPR ciflow labels: ciflow/default\nAdd ciflow labels to this PR to trigger more builds:\n\n\n\nWorkflows\nLabels (bold enabled)\nStatus\n\n\n\n\nTriggered Workflows\n\n\n\n\nlinux-bionic-py3.6-clang9\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla\n\u2705 triggered\n\n\nlinux-xenial-cuda11.3-py3.6-gcc7\nciflow/all, ciflow/cuda, ciflow/default, ciflow/linux\n\u2705 triggered\n\n\nlinux-xenial-py3.6-clang7-asan\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers\n\u2705 triggered\n\n\nlinux-xenial-py3.6-gcc5.4\nciflow/all, ciflow/cpu, ciflow/default, ciflow/linux\n\u2705 triggered\n\n\nlinux-xenial-py3.6-gcc7-bazel-test\nciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux\n\u2705 triggered\n\n\nwin-vs2019-cpu-py3\nciflow/all, ciflow/cpu, ciflow/default, ciflow/win\n\u2705 triggered\n\n\nwin-vs2019-cuda11.3-py3\nciflow/all, ciflow/cuda, ciflow/default, ciflow/win\n\u2705 triggered\n\n\nSkipped Workflows\n\n\n\n\nlibtorch-linux-xenial-cuda10.2-py3.6-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux\n\ud83d\udeab skipped\n\n\nlibtorch-linux-xenial-cuda11.3-py3.6-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux\n\ud83d\udeab skipped\n\n\nlinux-bionic-cuda10.2-py3.9-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow\n\ud83d\udeab skipped\n\n\nlinux-xenial-cuda10.2-py3.6-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow\n\ud83d\udeab skipped\n\n\nparallelnative-linux-xenial-py3.6-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux\n\ud83d\udeab skipped\n\n\nperiodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7\nciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-linux-xenial-cuda11.1-py3.6-gcc7\nciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled\n\ud83d\udeab skipped\n\n\nperiodic-win-vs2019-cuda11.1-py3\nciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win\n\ud83d\udeab skipped\n\n\npuretorch-linux-xenial-py3.6-gcc5.4\nciflow/all, ciflow/cpu, ciflow/linux\n\ud83d\udeab skipped",
+                "createdAt": "2022-01-25T09:31:05Z",
                 "author": {
-                  "login": "kshitij12345"
-                },
-                "authorAssociation": "COLLABORATOR",
-                "editor": {
-                  "login": "kshitij12345"
+                  "login": "pytorch-bot"
                 },
-                "databaseId": 1186949486
+                "authorAssociation": "NONE",
+                "editor": null,
+                "databaseId": 1020983378
               },
               {
-                "bodyText": "@pytorchbot merge",
+                "bodyText": "Hi @coolteemf!\nThank you for your pull request and welcome to our community.\nAction Required\nIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.\nProcess\nIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.\nOnce the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.\nIf you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!",
+                "createdAt": "2022-01-25T09:31:06Z",
                 "author": {
-                  "login": "ngimel"
+                  "login": "facebook-github-bot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1189347786
+                "databaseId": 1020983383
               },
               {
-                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/71759\n\ud83d\udcc4 \u00a0Preview docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\ud83d\udd27 \u00a0Opt-in to CIFlow to control what jobs run on your PRs\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit 346e0c5 (more details on the Dr. CI page):\n\n\n2/3 failures introduced in this PR\n1/3 tentatively recognized as flaky \u2744\ufe0f\n\nClick here to rerun these jobs\n\n\n\n\n\ud83d\udd75\ufe0f 2 new failures recognized by patterns\nThe following CI failures do not appear to be due to upstream breakages:\n win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge) (1/2)\nStep: \"Test\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-02-23T14:12:58.9371445Z FAIL [0.010s]: test_sparse_addmm_cpu_bfloat16 (__main__.TestSparseCPU)\n\n2022-02-23T14:12:58.9258506Z   test_sparse_zeros_tanh_cpu_float64 (__main__.TestSparseUnaryUfuncsCPU) ... ok (0.002s)\n2022-02-23T14:12:58.9274771Z   test_sparse_zeros_tanh_cpu_int16 (__main__.TestSparseUnaryUfuncsCPU) ... ok (0.001s)\n2022-02-23T14:12:58.9290805Z   test_sparse_zeros_tanh_cpu_int32 (__main__.TestSparseUnaryUfuncsCPU) ... ok (0.001s)\n2022-02-23T14:12:58.9306695Z   test_sparse_zeros_tanh_cpu_int64 (__main__.TestSparseUnaryUfuncsCPU) ... ok (0.000s)\n2022-02-23T14:12:58.9322595Z   test_sparse_zeros_tanh_cpu_int8 (__main__.TestSparseUnaryUfuncsCPU) ... ok (0.000s)\n2022-02-23T14:12:58.9338535Z   test_sparse_zeros_tanh_cpu_uint8 (__main__.TestSparseUnaryUfuncsCPU) ... ok (0.000s)\n2022-02-23T14:12:58.9354468Z   test_sparse_zeros_trunc_cpu_float32 (__main__.TestSparseUnaryUfuncsCPU) ... ok (0.000s)\n2022-02-23T14:12:58.9370208Z   test_sparse_zeros_trunc_cpu_float64 (__main__.TestSparseUnaryUfuncsCPU) ... ok (0.000s)\n2022-02-23T14:12:58.9370712Z \n2022-02-23T14:12:58.9370976Z ======================================================================\n2022-02-23T14:12:58.9371445Z FAIL [0.010s]: test_sparse_addmm_cpu_bfloat16 (__main__.TestSparseCPU)\n2022-02-23T14:12:58.9372134Z ----------------------------------------------------------------------\n2022-02-23T14:12:58.9372597Z Traceback (most recent call last):\n2022-02-23T14:12:58.9374021Z   File \"C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\testing\\_internal\\common_device_type.py\", line 376, in instantiated_test\n2022-02-23T14:12:58.9374740Z     result = test(self, **param_kwargs)\n2022-02-23T14:12:58.9375570Z   File \"C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\testing\\_internal\\common_utils.py\", line 2951, in wrapped\n2022-02-23T14:12:58.9376266Z     f(self, *args, **kwargs, coalesced=False)\n2022-02-23T14:12:58.9376972Z   File \"test_sparse.py\", line 1272, in test_sparse_addmm\n2022-02-23T14:12:58.9377402Z     test_shape(7, 8, 9, 20, True, None)\n2022-02-23T14:12:58.9377939Z   File \"test_sparse.py\", line 1264, in test_shape\n2022-02-23T14:12:58.9378373Z     self.assertEqual(Y, Y_dense)\n\n\n win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (2/2)\nStep: \"Test\" (full log | diagnosis details | \ud83d\udd01 rerun)\n\n\n2022-02-23T15:20:20.5710678Z FAIL [0.031s]: test_sparse_addmm_cpu_bfloat16 (__main__.TestSparseCPU)\n\n2022-02-23T15:20:20.5569146Z   test_sparse_zeros_tanh_cuda_float64 (__main__.TestSparseUnaryUfuncsCUDA) ... ok (0.000s)\n2022-02-23T15:20:20.5589083Z   test_sparse_zeros_tanh_cuda_int16 (__main__.TestSparseUnaryUfuncsCUDA) ... ok (0.000s)\n2022-02-23T15:20:20.5609025Z   test_sparse_zeros_tanh_cuda_int32 (__main__.TestSparseUnaryUfuncsCUDA) ... ok (0.000s)\n2022-02-23T15:20:20.5629080Z   test_sparse_zeros_tanh_cuda_int64 (__main__.TestSparseUnaryUfuncsCUDA) ... ok (0.016s)\n2022-02-23T15:20:20.5649102Z   test_sparse_zeros_tanh_cuda_int8 (__main__.TestSparseUnaryUfuncsCUDA) ... ok (0.000s)\n2022-02-23T15:20:20.5668867Z   test_sparse_zeros_tanh_cuda_uint8 (__main__.TestSparseUnaryUfuncsCUDA) ... ok (0.000s)\n2022-02-23T15:20:20.5688700Z   test_sparse_zeros_trunc_cuda_float32 (__main__.TestSparseUnaryUfuncsCUDA) ... ok (0.000s)\n2022-02-23T15:20:20.5708285Z   test_sparse_zeros_trunc_cuda_float64 (__main__.TestSparseUnaryUfuncsCUDA) ... ok (0.000s)\n2022-02-23T15:20:20.5709405Z \n2022-02-23T15:20:20.5709879Z ======================================================================\n2022-02-23T15:20:20.5710678Z FAIL [0.031s]: test_sparse_addmm_cpu_bfloat16 (__main__.TestSparseCPU)\n2022-02-23T15:20:20.5711399Z ----------------------------------------------------------------------\n2022-02-23T15:20:20.5712013Z Traceback (most recent call last):\n2022-02-23T15:20:20.5713280Z   File \"C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\testing\\_internal\\common_device_type.py\", line 376, in instantiated_test\n2022-02-23T15:20:20.5714267Z     result = test(self, **param_kwargs)\n2022-02-23T15:20:20.5715299Z   File \"C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\testing\\_internal\\common_utils.py\", line 2951, in wrapped\n2022-02-23T15:20:20.5716240Z     f(self, *args, **kwargs, coalesced=False)\n2022-02-23T15:20:20.5716943Z   File \"test_sparse.py\", line 1275, in test_sparse_addmm\n2022-02-23T15:20:20.5717516Z     test_shape(7, 8, 9, 20, False, (1, 1))\n2022-02-23T15:20:20.5718323Z   File \"test_sparse.py\", line 1264, in test_shape\n2022-02-23T15:20:20.5718915Z     self.assertEqual(Y, Y_dense)\n\n\n\n\u2744\ufe0f 1 failure tentatively classified as flaky\nbut reruns have not yet been triggered to confirm:\n linux-bionic-rocm4.5-py3.7 / test (distributed, 1, 1, linux.rocm.gpu) (1/1)\nStep: \"Test\" (full log | diagnosis details | \ud83d\udd01 rerun) \u2744\ufe0f\n\n\n2022-02-23T16:16:26.7221984Z RuntimeError: Proc...ated or timed out after 100.06913685798645 seconds\n\n2022-02-23T16:16:26.7207909Z ERROR [100.093s]: test_collect_shards (__main__.TestZeroRedundancyOptimizerDistributed)\n2022-02-23T16:16:26.7209206Z Check the state consolidation mechanism, and the state dict exposed by ZeroRedundancyOptimizer\n2022-02-23T16:16:26.7213073Z ----------------------------------------------------------------------\n2022-02-23T16:16:26.7213996Z Traceback (most recent call last):\n2022-02-23T16:16:26.7215434Z   File \"/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py\", line 483, in wrapper\n2022-02-23T16:16:26.7216409Z     self._join_processes(fn)\n2022-02-23T16:16:26.7217801Z   File \"/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py\", line 702, in _join_processes\n2022-02-23T16:16:26.7218822Z     self._check_return_codes(elapsed_time)\n2022-02-23T16:16:26.7220266Z   File \"/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py\", line 754, in _check_return_codes\n2022-02-23T16:16:26.7221201Z     i, elapsed_time\n2022-02-23T16:16:26.7221984Z RuntimeError: Process 0 terminated or timed out after 100.06913685798645 seconds\n2022-02-23T16:16:26.7222551Z \n2022-02-23T16:16:26.7223245Z ----------------------------------------------------------------------\n2022-02-23T16:16:26.7224032Z Ran 26 tests in 303.663s\n2022-02-23T16:16:26.7224400Z \n2022-02-23T16:16:26.7224780Z FAILED (errors=1, skipped=8, unexpected successes=3)\n2022-02-23T16:16:26.7225718Z \n2022-02-23T16:16:26.7225992Z Generating XML reports...\n2022-02-23T16:16:26.7336797Z Generated XML report: test-reports/python-unittest/distributed.optim.test_zero_redundancy_optimizer/TEST-TestZeroRedundancyOptimizerDistributed-20220223161123.xml\n2022-02-23T16:16:26.7349296Z Generated XML report: test-reports/python-unittest/distributed.optim.test_zero_redundancy_optimizer/TEST-TestZeroRedundancyOptimizerSingleRank-20220223161123.xml\n2022-02-23T16:16:27.6823633Z Traceback (most recent call last):\n\n\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2022-01-25T09:31:08Z",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "facebook-github-bot"
                 },
                 "authorAssociation": "MEMBER",
-                "editor": null,
-                "databaseId": 1189350009
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1020983433
               },
               {
-                "bodyText": "Hey @kshitij12345.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "bodyText": "Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!",
+                "createdAt": "2022-01-25T18:07:45Z",
                 "author": {
-                  "login": "github-actions"
+                  "login": "facebook-github-bot"
                 },
-                "authorAssociation": "NONE",
+                "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1189350932
+                "databaseId": 1021467314
               },
               {
-                "bodyText": "@pytorchbot revert -m \"broke slow test https://github.com/pytorch/pytorch/runs/7414560957?check_suite_focus=true#step:9:31516\" -c \"nosignal\"",
+                "bodyText": "@albanD Is there something that needs to be done to correct the failed check ?",
+                "createdAt": "2022-02-04T13:18:05Z",
                 "author": {
-                  "login": "kshitij12345"
+                  "login": "coolteemf"
                 },
-                "authorAssociation": "COLLABORATOR",
+                "authorAssociation": "CONTRIBUTOR",
                 "editor": null,
-                "databaseId": 1189459845
+                "databaseId": 1029978104
               },
               {
-                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
+                "bodyText": "Hi,\nI think you didn't do the merge properly as there are now a lot more commits than it should be in this PR.\nYou can either clean up the branch locally and force push here or open a new clean PR.\nNote that in general, it is better to rebase on top of master than merge master into your branch!",
+                "createdAt": "2022-02-04T14:28:28Z",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "albanD"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1189460926
+                "databaseId": 1030038719
               },
               {
-                "bodyText": "Will not revert as @kshitij12345 is not a MEMBER, but COLLABORATOR",
+                "bodyText": "Okay thank you for the heads up",
+                "createdAt": "2022-02-04T16:44:46Z",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "coolteemf"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "CONTRIBUTOR",
                 "editor": null,
-                "databaseId": 1189460942
+                "databaseId": 1030159616
               },
               {
-                "bodyText": "@pytorchbot revert -m \"broke slow test https://github.com/pytorch/pytorch/runs/7414560957?check_suite_focus=true#step:9:31516\" -c \"nosignal\"",
+                "bodyText": "@albanD I just rebased and updated the branch to take into account changes from 28388b4. Is it all clear for merging ?",
+                "createdAt": "2022-02-16T15:34:59Z",
                 "author": {
-                  "login": "anjali411"
+                  "login": "coolteemf"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "CONTRIBUTOR",
                 "editor": null,
-                "databaseId": 1189529734
+                "databaseId": 1041720345
               },
               {
-                "bodyText": "@pytorchbot successfully started a revert job. Check the current status here",
+                "bodyText": "Thanks! The CI needs fixing for bc-compat and lint though\n\nThe lint should be fixed, however I didn't find clear instructions on how to fix the bc compat.\nI guess output_mask could be made optional, however in the case of native_group_norm_backward the same argument is not optional.",
+                "createdAt": "2022-02-17T08:04:30Z",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "coolteemf"
                 },
-                "authorAssociation": "MEMBER",
+                "authorAssociation": "CONTRIBUTOR",
                 "editor": null,
-                "databaseId": 1189530756
+                "databaseId": 1042672732
               },
               {
-                "bodyText": "@kshitij12345 your PR has been successfully reverted.",
+                "bodyText": "Since we are changing the signature on purpose here, you can add it to the list at https://github.com/pytorch/pytorch/blob/master/test/forward_backward_compatibility/check_forward_backward_compatibility.py#L29 to silence the test.",
+                "createdAt": "2022-02-17T14:41:16Z",
                 "author": {
-                  "login": "pytorchmergebot"
+                  "login": "albanD"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1189530831
+                "databaseId": 1043020903
               },
               {
-                "bodyText": "@pytorchbot merge -g",
+                "bodyText": "@pytorchbot merge this please",
+                "createdAt": "2022-02-23T14:48:05Z",
                 "author": {
-                  "login": "kshitij12345"
+                  "login": "albanD"
                 },
-                "authorAssociation": "COLLABORATOR",
+                "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1190070141
+                "databaseId": 1048861185
               },
               {
-                "bodyText": "@pytorchbot successfully started a merge job. Check the current status here",
+                "bodyText": "Merge failed due to 'NoneType' object is not subscriptable\nRaised by https://github.com/pytorch/pytorch/actions/runs/1887914411",
+                "createdAt": "2022-02-23T14:49:16Z",
                 "author": {
                   "login": "pytorchmergebot"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1190071424
+                "databaseId": 1048862374
               },
               {
-                "bodyText": "Hey @kshitij12345.\nYou've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.\nFor changes that are 'topic: not user facing' there is no need for a release notes label.",
+                "bodyText": "@coolteemf you can ignore me playing with the bot. Nothing is needed on your end anymore, I'll take it from here.",
+                "createdAt": "2022-02-23T14:52:10Z",
                 "author": {
-                  "login": "github-actions"
+                  "login": "albanD"
                 },
-                "authorAssociation": "NONE",
+                "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1190258272
+                "databaseId": 1048865236
               },
               {
-                "bodyText": "commit is breaking internal builds/tests https://pastebin.com/HX4RUusH (pytorch/functorch/test:test_eager_transforms)",
+                "bodyText": "@pytorchbot merge this",
+                "createdAt": "2022-02-23T14:54:23Z",
                 "author": {
-                  "login": "jeanschmidt"
+                  "login": "malfet"
                 },
                 "authorAssociation": "MEMBER",
                 "editor": null,
-                "databaseId": 1191327616
+                "databaseId": 1048867615
               }
             ],
             "pageInfo": {
-              "startCursor": "Y3Vyc29yOnYyOpHORP1auw==",
+              "startCursor": "Y3Vyc29yOnYyOpHOPNr4Ug==",
               "hasPreviousPage": false
             }
           }
@@ -19965,88 +35867,39 @@
       }
     }
   },
-  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAbqubxc= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAbtMgmA= name=pytorch number=79694 owner=pytorch": {
+  "query_sha=2e2877d2452c4f233f042b7ccd50ab9c2a6e9a73d8819a0c876203c12364e8a3 cursor=Y3Vyc29yOnYyOpHOQebHmg== name=pytorch number=75095 owner=pytorch": {
     "data": {
       "repository": {
         "pullRequest": {
-          "commits": {
+          "comments": {
             "nodes": [
               {
-                "commit": {
-                  "oid": "ffe43399d6f60ef7844523a5f465c11d9a67062f",
-                  "checkSuites": {
-                    "nodes": [
-                      {
-                        "checkRuns": {
-                          "nodes": [
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7427036779?check_suite_focus=true"
-                            },
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7427036925?check_suite_focus=true"
-                            }
-                          ],
-                          "pageInfo": {
-                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbqvlv0=",
-                            "hasNextPage": false
-                          }
-                        }
-                      }
-                    ]
-                  }
-                }
-              }
-            ]
-          }
-        }
-      }
-    }
-  },
-  "query_sha=4c16925415d1fcc12ac0f5f7ce73b8e6122997d2f51c4c2757c2543e6493c60d cr_cursor=Y3Vyc29yOnYyOpHPAAAAAbxLMxE= cs_cursor=Y3Vyc29yOnYyOpHPAAAAAbzb6nI= name=pytorch number=79694 owner=pytorch": {
-    "data": {
-      "repository": {
-        "pullRequest": {
-          "commits": {
-            "nodes": [
+                "bodyText": "\ud83d\udd17 Helpful links\n\n\ud83e\uddea \u00a0See artifacts and rendered test results at hud.pytorch.org/pr/75095\n\ud83d\udcc4 \u00a0Preview Python docs built from this PR\n\ud83d\udcc4 \u00a0Preview C++ docs built from this PR\n\u2753Need help or want to give feedback on the CI? Visit our office hours\n\n\ud83d\udc8a CI failures summary and remediations\nAs of commit db355d5 (more details on the Dr. CI page):\nExpand to see more\n\n\ud83d\udc9a \ud83d\udc9a Looks good so far! There are no failures yet. \ud83d\udc9a \ud83d\udc9a\n\nThis comment was automatically generated by Dr. CI (expand for details).\nPlease report bugs/suggestions to the (internal) Dr. CI Users group.\nClick here  to manually regenerate this comment.",
+                "createdAt": "2022-04-01T08:49:06Z",
+                "author": {
+                  "login": "facebook-github-bot"
+                },
+                "authorAssociation": "MEMBER",
+                "editor": {
+                  "login": "facebook-github-bot"
+                },
+                "databaseId": 1085625658
+              },
               {
-                "commit": {
-                  "oid": "ffe43399d6f60ef7844523a5f465c11d9a67062f",
-                  "checkSuites": {
-                    "nodes": [
-                      {
-                        "checkRuns": {
-                          "nodes": [
-                            {
-                              "name": "linux-xenial-cuda11_3-py3_7-gcc7-deploy / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454025911?check_suite_focus=true"
-                            },
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454189584?check_suite_focus=true"
-                            },
-                            {
-                              "name": "win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge)",
-                              "conclusion": "SUCCESS",
-                              "detailsUrl": "https://github.com/pytorch/pytorch/runs/7454189772?check_suite_focus=true"
-                            }
-                          ],
-                          "pageInfo": {
-                            "endCursor": "Y3Vyc29yOnYyOpHPAAAAAbxN6Mw=",
-                            "hasNextPage": false
-                          }
-                        }
-                      }
-                    ]
-                  }
-                }
+                "bodyText": "High level question: how do we plan to validate that our ref implementations are compatible with somewhat-symbolic shapes? There are multiple ways to write the shape processing logic to be compatible vs not, it'd be good to catch such instances early. Does it make sense to throw in some proxy objects (that have state of 0,1,N) in tests early on? (maybe in a follow up PR). Otherwise it's not clear to me that squeeze/broadcast/etc are the right set of primitives for symbolic shapes",
+                "createdAt": "2022-04-21T18:51:24Z",
+                "author": {
+                  "login": "dzhulgakov"
+                },
+                "authorAssociation": "COLLABORATOR",
+                "editor": null,
+                "databaseId": 1105634766
               }
-            ]
+            ],
+            "pageInfo": {
+              "startCursor": "Y3Vyc29yOnYyOpHOQLVVOg==",
+              "hasPreviousPage": false
+            }
           }
         }
       }
diff --git a/.github/scripts/install_nvidia_utils_linux.sh b/.github/scripts/install_nvidia_utils_linux.sh
deleted file mode 100755
index b5274fb5805f..000000000000
--- a/.github/scripts/install_nvidia_utils_linux.sh
+++ /dev/null
@@ -1,57 +0,0 @@
-#!/usr/bin/env bash
-
-set -eou pipefail
-
-
-DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) \
-DRIVER_FN="NVIDIA-Linux-x86_64-515.57.run"
-YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
-
-install_nvidia_docker2_amzn2() {
-    (
-        set -x
-        # Needed for yum-config-manager
-        sudo yum install -y yum-utils
-        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
-        sudo yum install -y nvidia-docker2
-        sudo systemctl restart docker
-    )
-}
-
-install_nvidia_driver_amzn2() {
-    (
-        set -x
-        sudo yum groupinstall -y "Development Tools"
-        # ensure our kernel install is the same as our underlying kernel,
-        # groupinstall "Development Tools" has a habit of mismatching kernel headers
-        sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
-        sudo modprobe backlight
-        sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
-        sudo /bin/bash /tmp/nvidia_driver -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
-        sudo rm -fv /tmp/nvidia_driver
-        nvidia-smi
-    )
-}
-
-# Install container toolkit based on distribution
-echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
-case "${DISTRIBUTION}" in
-    amzn*)
-        install_nvidia_docker2_amzn2
-        ;;
-    *)
-        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
-        exit 1
-        ;;
-esac
-
-echo "== Installing nvidia driver ${DRIVER_FN} =="
-case "${DISTRIBUTION}" in
-    amzn*)
-        install_nvidia_driver_amzn2
-        ;;
-    *)
-        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
-        exit 1
-        ;;
-esac
diff --git a/.github/scripts/parse_ref.py b/.github/scripts/parse_ref.py
index 036146f734c3..59a454fe3025 100755
--- a/.github/scripts/parse_ref.py
+++ b/.github/scripts/parse_ref.py
@@ -4,18 +4,26 @@
 import re
 
 
+def set_output(name: str, val: str) -> None:
+    if os.getenv("GITHUB_OUTPUT"):
+        with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
+            print(f"{name}={val}", file=env)
+    else:
+        print(f"::set-output name={name}::{val}")
+
+
 def main() -> None:
-    ref = os.environ['GITHUB_REF']
+    ref = os.environ["GITHUB_REF"]
     m = re.match(r'^refs/(\w+)/(.*)$', ref)
     if m:
         category, stripped = m.groups()
-        if category == 'heads':
-            print(f'::set-output name=branch::{stripped}')
-        elif category == 'pull':
-            print(f'::set-output name=branch::pull/{stripped.split("/")[0]}')
-        elif category == 'tags':
-            print(f'::set-output name=tag::{stripped}')
+        if category == "heads":
+            set_output("branch", stripped)
+        elif category == "pull":
+            set_output("branch", "pull/" + stripped.split("/")[0])
+        elif category == "tags":
+            set_output("tag", stripped)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     main()
diff --git a/.github/scripts/pr-sanity-check.sh b/.github/scripts/pr-sanity-check.sh
new file mode 100644
index 000000000000..13d037d5eaab
--- /dev/null
+++ b/.github/scripts/pr-sanity-check.sh
@@ -0,0 +1,60 @@
+#!/usr/bin/env bash
+
+set -eou pipefail
+
+GIT_TOP_DIR=$(git rev-parse --show-toplevel)
+
+TMPFILE=$(mktemp)
+trap "rm -rf ${TMPFILE}" EXIT
+
+# By default just run against the latest commit
+BASE=${BASE:-HEAD~1}
+HEAD=${HEAD:-HEAD}
+
+ancestor=$(git merge-base "${BASE}" "${HEAD}")
+echo "INFO: Checking aginst the following stats"
+(
+    set -x
+    git diff --stat "$ancestor" "${HEAD}" | sed '$d' > "${TMPFILE}"
+)
+
+while read -r git_attribute; do
+    if echo "${git_attribute}" | grep "linguist-generated=true" >/dev/null 2>/dev/null; then
+        pattern=$(echo ${git_attribute} | cut -d' ' -f1)
+        escaped_pattern=$(printf '%s\n' "$pattern" | sed -e 's/[\/&]/\\&/g')
+        # Delete known generated files
+        sed -i '/'"${escaped_pattern}"'/d' "${TMPFILE}"
+    fi
+done < "${GIT_TOP_DIR}/.gitattributes"
+
+echo "INFO: Showing non-generated files:"
+(
+    set -x
+    cat "${TMPFILE}"
+)
+
+# Get only files that have changed
+changed_files=$(cut -d' ' -f2 "${TMPFILE}" | xargs)
+
+details=$(git diff --shortstat "$ancestor" "${HEAD}" -- ${changed_files})
+add=$(echo "$details" | grep -o '[0-9]* insertion' | grep -o '[0-9]*' || true)
+remove=$(echo "$details" | grep -o '[0-9]* deletion' | grep -o '[0-9]*' || true)
+pr_size=0
+if [ "$add" ]; then
+  pr_size=$(("$pr_size" + "$add"))
+fi
+if [ "$remove" ]; then
+  pr_size=$(("$pr_size" + "$remove"))
+fi
+echo "INFO: PR SIZE is ${pr_size}"
+
+if ((pr_size > 2000)); then
+    echo
+    echo 'Your PR is '"$pr_size"' LOC which is more than the 2000 maximum'
+    echo 'allowed within PyTorch infra. PLease make sure to split up'
+    echo 'your PR into smaller pieces that can be reviewed.'
+    echo 'If you think that this rule should not apply to your PR,'
+    echo 'please contact @albanD or @seemethere.'
+    echo
+    exit 1
+fi
diff --git a/.github/scripts/process_commit.py b/.github/scripts/process_commit.py
deleted file mode 100644
index 1bfca3237984..000000000000
--- a/.github/scripts/process_commit.py
+++ /dev/null
@@ -1,106 +0,0 @@
-#!/usr/bin/env python3
-"""
-This script finds the user/pr creator responsible for labeling a PR by a commit SHA. It is used by the workflow in
-'.github/workflows/pr-labels.yml'. If there exists no PR associated with the commit or the PR is properly labeled,
-this script is a no-op.
-
-Note: we ping the user only, not the reviewers, as the reviewers can sometimes be external to pytorch
-with no labeling responsibility, so we don't want to bother them.
-This script is based on: https://github.com/pytorch/vision/blob/main/.github/process_commit.py
-"""
-
-import sys
-from typing import Any, Set, Tuple, List
-import re
-import os
-import json
-import requests
-
-# For a PR to be properly labeled it should have release notes label and one topic label
-PULL_REQUEST_EXP = "Pull Request resolved:.*pull/(.*)"
-PRIMARY_LABEL_FILTER = "release notes:"
-SECONDARY_LABELS = {
-    "topic: bc_breaking",
-    "topic: deprecation",
-    "topic: new feature",
-    "topic: improvements",
-    "topic: bug fixes",
-    "topic: performance",
-    "topic: documentation",
-    "topic: developer feature",
-    "topic: not user facing",
-}
-# This secondary does not require a primary
-ALLOWED_ONLY_SECONDARY = {"topic: not user facing"}
-PYTORCH_REPO = "https://api.github.com/repos/pytorch/pytorch"
-GITHUB_TOKEN = os.environ.get('GITHUB_TOKEN')
-REQUEST_HEADERS = {'Accept': 'application/vnd.github.v3+json', 'Authorization': f'token {GITHUB_TOKEN}'}
-
-
-def query_pytorch(cmd: str) -> Any:
-    response = requests.get(f"{PYTORCH_REPO}/{cmd}", headers=REQUEST_HEADERS)
-    return response.json()
-
-
-def get_pr_number(commit_hash: str) -> Any:
-    data = query_pytorch(f"commits/{commit_hash}")
-    if not data or (not data["commit"]["message"]):
-        return None
-    message = data["commit"]["message"]
-    p = re.compile(PULL_REQUEST_EXP)
-    result = p.search(message)
-    if not result:
-        return None
-    return result.group(1)
-
-
-def get_pr_author_and_labels(pr_number: int) -> Tuple[str, Set[str]]:
-    # See https://docs.github.com/en/rest/reference/pulls#get-a-pull-request
-    data = query_pytorch(f"pulls/{pr_number}")
-    user = data["user"]["login"]
-    labels = {label["name"] for label in data["labels"]}
-    return user, labels
-
-def get_repo_labels() -> List[str]:
-    collected_labels: List[str] = list()
-    for page in range(0, 10):
-        response = query_pytorch(f"labels?per_page=100&page={page}")
-        page_labels = list(map(lambda x: str(x["name"]), response))
-        if not page_labels:
-            break
-        collected_labels += page_labels
-    return collected_labels
-
-def post_pytorch_comment(pr_number: int, merger: str) -> Any:
-    message = {'body' : f"Hey @{merger}." + """
-You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. \
-Please add one of each to the PR. The 'release notes: ...' label should represent the part of \
-PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should \
-represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). \
-The list of valid labels can be found [here](https://github.com/pytorch/pytorch/labels?q=release+notes) \
-for the 'release notes: ...' and [here](https://github.com/pytorch/pytorch/labels?q=topic) for the \
-'topics: ...'.
-For changes that are 'topic: not user facing' there is no need for a release notes label."""}
-
-    response = requests.post(
-        f"{PYTORCH_REPO}/issues/{pr_number}/comments",
-        json.dumps(message),
-        headers=REQUEST_HEADERS)
-    return response.json()
-
-if __name__ == "__main__":
-    commit_hash = sys.argv[1]
-    pr_number = get_pr_number(commit_hash)
-
-    if not pr_number:
-        sys.exit(0)
-
-    user, labels = get_pr_author_and_labels(pr_number)
-    repo_labels = get_repo_labels()
-
-    primary_labels = set(filter(lambda x: x.startswith(PRIMARY_LABEL_FILTER), repo_labels))
-    has_both_labels = bool(primary_labels.intersection(labels) and SECONDARY_LABELS.intersection(labels))
-    is_properly_labeled = has_both_labels or bool(ALLOWED_ONLY_SECONDARY.intersection(labels))
-
-    if not is_properly_labeled:
-        post_pytorch_comment(pr_number, user)
diff --git a/.github/scripts/run_torchbench.py b/.github/scripts/run_torchbench.py
index 44e53f6a14e2..352da69c8158 100644
--- a/.github/scripts/run_torchbench.py
+++ b/.github/scripts/run_torchbench.py
@@ -13,10 +13,12 @@
 # 1. Does not reuse the build artifact in other CI workflows
 # 2. CI jobs are serialized because there is only one worker
 import os
+import boto3  # type: ignore[import]
 import git  # type: ignore[import]
 import pathlib
 import argparse
 import subprocess
+from pathlib import Path
 
 from typing import List, Tuple
 
@@ -31,6 +33,25 @@
 direction: decrease
 timeout: 720
 tests:"""
+S3_BUCKET = "ossci-metrics"
+S3_PREFIX = "torchbench-pr-test"
+S3_URL_BASE = f"https://{S3_BUCKET}.s3.amazonaws.com/"
+
+class S3Client:
+    def __init__(self, bucket: str = S3_BUCKET, prefix: str = S3_PREFIX):
+        self.s3 = boto3.client('s3')
+        self.resource = boto3.resource('s3')
+        self.bucket = bucket
+        self.prefix = prefix
+
+    def upload_file(self, file_path: Path, filekey_prefix: str) -> None:
+        assert file_path.is_file(), f"Specified file path {file_path} does not exist or not file."
+        file_name = file_path.name
+        s3_key = f"{self.prefix}/{filekey_prefix}/{file_name}"
+        print(f"Uploading file {file_name} to S3 with key: {s3_key}")
+        self.s3.upload_file(str(file_path), self.bucket, s3_key)
+        # output the result URL
+        print(f"Uploaded the result file {file_name} to {S3_URL_BASE}{s3_key}")
 
 def gen_abtest_config(control: str, treatment: str, models: List[str]) -> str:
     d = {}
@@ -121,6 +142,7 @@ def run_torchbench(pytorch_path: str, torchbench_path: str, output_dir: str) ->
                "--pytorch-src", pytorch_path, "--torchbench-src", torchbench_path,
                "--config", os.path.join(output_dir, TORCHBENCH_CONFIG_NAME),
                "--output", os.path.join(output_dir, "result.txt")]
+    print(f"Running torchbench command: {command}")
     subprocess.check_call(command, cwd=torchbench_path, env=env)
 
 def run_userbenchmarks(pytorch_path: str, torchbench_path: str, base_sha: str, head_sha: str,
@@ -133,11 +155,24 @@ def run_userbenchmarks(pytorch_path: str, torchbench_path: str, base_sha: str, h
                "--head", head_sha,
                "--userbenchmark", userbenchmark,
                "--output-dir", output_dir]
+    print(f"Running torchbench userbenchmark command: {command}")
     subprocess.check_call(command, cwd=torchbench_path, env=env)
 
+def process_upload_s3(result_dir: str) -> None:
+    # validate result directory
+    result_dir_path = Path(result_dir)
+    assert result_dir_path.exists(), f"Specified result directory {result_dir} doesn't exist."
+    # upload all files to S3 bucket oss-ci-metrics
+    files = [x for x in result_dir_path.iterdir() if x.is_file()]
+    # upload file to S3 bucket
+    s3_client: S3Client = S3Client()
+    filekey_prefix = result_dir_path.name
+    for f in files:
+        s3_client.upload_file(f, filekey_prefix)
+
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description='Run TorchBench tests based on PR')
-    parser.add_argument('--pr-body', required=True, help="The file that contains body of a Pull Request")
+    parser.add_argument('--pr-body', help="The file that contains body of a Pull Request")
 
     subparsers = parser.add_subparsers(dest='command')
     # parser for setup the torchbench branch name env
@@ -149,6 +184,9 @@ def run_userbenchmarks(pytorch_path: str, torchbench_path: str, base_sha: str, h
     run_parser.add_argument('--pr-head-sha', required=True, type=str, help="The Pull Request head hash")
     run_parser.add_argument('--pytorch-path', required=True, type=str, help="Path to pytorch repository")
     run_parser.add_argument('--torchbench-path', required=True, type=str, help="Path to TorchBench repository")
+    # parser to upload results to S3
+    upload_parser = subparsers.add_parser("upload-s3")
+    upload_parser.add_argument('--result-dir', required=True, type=str, help="Path to benchmark output")
     args = parser.parse_args()
 
     if args.command == 'set-torchbench-branch':
@@ -179,6 +217,8 @@ def run_userbenchmarks(pytorch_path: str, torchbench_path: str, base_sha: str, h
         if not models and not userbenchmarks:
             print("Can't parse valid models or userbenchmarks from the pr body. Quit.")
             exit(-1)
+    elif args.command == 'upload-s3':
+        process_upload_s3(args.result_dir)
     else:
         print(f"The command {args.command} is not supported.")
         exit(-1)
diff --git a/.github/scripts/test_check_labels.py b/.github/scripts/test_check_labels.py
new file mode 100644
index 000000000000..64e91dcd8ecb
--- /dev/null
+++ b/.github/scripts/test_check_labels.py
@@ -0,0 +1,77 @@
+"""test_check_labels.py"""
+
+from typing import Any
+from unittest import TestCase, mock, main
+
+from trymerge import GitHubPR
+from test_trymerge import mocked_gh_graphql
+from check_labels import has_required_labels
+
+release_notes_labels = [
+    "release notes: AO frontend",
+    "release notes: autograd",
+    "release notes: benchmark",
+    "release notes: build",
+    "release notes: complex",
+    "release notes: composability",
+    "release notes: cpp",
+    "release notes: cuda",
+    "release notes: cudnn",
+    "release notes: dataloader",
+    "release notes: distributed (c10d)",
+    "release notes: distributed (ddp)",
+    "release notes: distributed (fsdp)",
+    "release notes: distributed (pipeline)",
+    "release notes: distributed (rpc)",
+    "release notes: distributed (sharded)",
+    "release notes: foreach_frontend",
+    "release notes: functorch",
+    "release notes: fx",
+    "release notes: hub",
+    "release notes: jit",
+    "release notes: lazy",
+    "release notes: linalg_frontend",
+    "release notes: memory format",
+    "release notes: Meta API",
+    "release notes: mobile",
+    "release notes: mps",
+    "release notes: nested tensor",
+    "release notes: nn",
+    "release notes: onnx",
+    "release notes: package/deploy",
+    "release notes: performance_as_product",
+    "release notes: profiler",
+    "release notes: python_frontend",
+    "release notes: quantization",
+    "release notes: releng",
+    "release notes: rocm",
+    "release notes: sparse",
+    "release notes: visualization",
+    "release notes: vulkan",
+]
+
+
+class TestCheckLabels(TestCase):
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    @mock.patch('check_labels.get_release_notes_labels', return_value=release_notes_labels)
+    def test_pr_with_missing_labels(self, mocked_rn_labels: Any, mocked_gql: Any) -> None:
+        "Test PR with no 'release notes:' label or 'topic: not user facing' label"
+        pr = GitHubPR("pytorch", "pytorch", 82169)
+        self.assertFalse(has_required_labels(pr))
+
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    @mock.patch('check_labels.get_release_notes_labels', return_value=release_notes_labels)
+    def test_pr_with_release_notes_label(self, mocked_rn_labels: Any, mocked_gql: Any) -> None:
+        "Test PR with 'release notes: nn' label"
+        pr = GitHubPR("pytorch", "pytorch", 71759)
+        self.assertTrue(has_required_labels(pr))
+
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    @mock.patch('check_labels.get_release_notes_labels', return_value=release_notes_labels)
+    def test_pr_with_not_user_facing_label(self, mocked_rn_labels: Any, mocked_gql: Any) -> None:
+        "Test PR with 'topic: not user facing' label"
+        pr = GitHubPR("pytorch", "pytorch", 75095)
+        self.assertTrue(has_required_labels(pr))
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/scripts/test_fetch_latest_green_commit.py b/.github/scripts/test_fetch_latest_green_commit.py
index 2f84658e6394..f88e7f262fb9 100644
--- a/.github/scripts/test_fetch_latest_green_commit.py
+++ b/.github/scripts/test_fetch_latest_green_commit.py
@@ -81,13 +81,12 @@ def test_necessary_failed(self, mock_get_commit_results: Any) -> None:
 
     @mock.patch('fetch_latest_green_commit.get_commit_results', return_value=TestChecks().make_test_checks())
     def test_skippable_failed(self, mock_get_commit_results: Any) -> None:
-        "Test with skippable job (ex: docker-release-builds) failing"
+        "Test with failing skippable jobs (ex: docker-release-builds) should pass"
         workflow_checks = mock_get_commit_results()
         workflow_checks = set_workflow_job_status(workflow_checks, "periodic", "skipped")
         workflow_checks = set_workflow_job_status(workflow_checks, "docker-release-builds", "failed")
         result = isGreen("sha", workflow_checks)
-        self.assertFalse(result[0])
-        self.assertEqual(result[1], "docker-release-builds checks were not successful")
+        self.assertTrue(result[0])
 
     @mock.patch('fetch_latest_green_commit.get_commit_results', return_value={})
     def test_no_workflows(self, mock_get_commit_results: Any) -> None:
diff --git a/.github/scripts/test_filter_test_configs.py b/.github/scripts/test_filter_test_configs.py
new file mode 100755
index 000000000000..55410e846c97
--- /dev/null
+++ b/.github/scripts/test_filter_test_configs.py
@@ -0,0 +1,118 @@
+#!/usr/bin/env python3
+
+import os
+import yaml
+import json
+from unittest import TestCase, main, mock
+from filter_test_configs import (
+    get_labels,
+    filter,
+    set_periodic_modes,
+    PREFIX,
+    VALID_TEST_CONFIG_LABELS,
+    SUPPORTED_PERIODICAL_MODES
+)
+import requests
+from requests.models import Response
+from typing import Any, Dict
+
+
+def mocked_gh_get_labels_failed(url: str, headers: Dict[str, str]) -> Response:
+    mocked_response = Response()
+    mocked_response.status_code = requests.codes.bad_request
+    return mocked_response
+
+
+def mocked_gh_get_labels(url: str, headers: Dict[str, str]) -> Response:
+    mocked_response = Response()
+    mocked_response.status_code = requests.codes.ok
+    mocked_response._content = b'[{"name": "foo"}, {"name": "bar"}, {}, {"name": ""}]'
+    return mocked_response
+
+
+class TestConfigFilter(TestCase):
+
+    def setUp(self) -> None:
+        os.environ["GITHUB_TOKEN"] = "GITHUB_TOKEN"
+        if os.getenv("GITHUB_OUTPUT"):
+            del os.environ["GITHUB_OUTPUT"]
+
+    @mock.patch("filter_test_configs.requests.get", side_effect=mocked_gh_get_labels)
+    def test_get_labels(self, mocked_gh: Any) -> None:
+        labels = get_labels(pr_number=12345)
+        self.assertSetEqual({"foo", "bar"}, labels)
+
+    @mock.patch("filter_test_configs.requests.get", side_effect=mocked_gh_get_labels_failed)
+    def test_get_labels_failed(self, mocked_gh: Any) -> None:
+        labels = get_labels(pr_number=54321)
+        self.assertFalse(labels)
+
+    def test_filter(self) -> None:
+        mocked_labels = {f"{PREFIX}cfg", "ciflow/trunk", "plain-cfg"}
+        testcases = [
+            {
+                "test_matrix": '{include: [{config: "default", runner: "linux"}]}',
+                "expected": '{"include": [{"config": "default", "runner": "linux"}]}',
+                "description": "No match, keep the same test matrix",
+            },
+            {
+                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "plain-cfg"}]}',
+                "expected": '{"include": [{"config": "default", "runner": "linux"}, {"config": "plain-cfg"}]}',
+                "description": "No match because there is no prefix or suffix, keep the same test matrix",
+            },
+            {
+                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", shard: 1}]}',
+                "expected": '{"include": [{"config": "cfg", "shard": 1}]}',
+                "description": "Found a match, only keep that",
+            },
+        ]
+
+        for case in testcases:
+            filtered_test_matrix = filter(yaml.safe_load(case["test_matrix"]), mocked_labels)
+            self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))
+
+    def test_filter_with_valid_label(self) -> None:
+        mocked_labels = {f"{PREFIX}cfg", "ciflow/trunk"}
+        VALID_TEST_CONFIG_LABELS.add(f"{PREFIX}cfg")
+
+        testcases = [
+            {
+                "test_matrix": '{include: [{config: "default", runner: "linux"}]}',
+                "expected": '{"include": []}',
+                "description": "Found a valid label in the PR body, return the filtered test matrix",
+            },
+            {
+                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", shard: 1}]}',
+                "expected": '{"include": [{"config": "cfg", "shard": 1}]}',
+                "description": "Found a match, only keep that",
+            },
+        ]
+
+        for case in testcases:
+            filtered_test_matrix = filter(yaml.safe_load(case["test_matrix"]), mocked_labels)
+            self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))
+
+
+    def test_set_periodic_modes(self) -> None:
+        testcases = [
+            {
+                "test_matrix": "{include: []}",
+                "description": "Empty test matrix",
+            },
+            {
+                "test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
+                "descripion": "Replicate each periodic mode in a different config",
+            },
+        ]
+
+        for case in testcases:
+            test_matrix = yaml.safe_load(case["test_matrix"])
+            scheduled_test_matrix = set_periodic_modes(test_matrix)
+            self.assertEqual(
+                len(test_matrix["include"]) * len(SUPPORTED_PERIODICAL_MODES),
+                len(scheduled_test_matrix["include"])
+            )
+
+
+if __name__ == '__main__':
+    main()
diff --git a/.github/scripts/test_trymerge.py b/.github/scripts/test_trymerge.py
index af3faf8cd094..7d5dfe7f0a3a 100755
--- a/.github/scripts/test_trymerge.py
+++ b/.github/scripts/test_trymerge.py
@@ -18,9 +18,12 @@
                       gh_get_team_members,
                       read_merge_rules,
                       validate_revert,
+                      filter_pending_checks,
+                      filter_failed_checks,
                       GitHubPR,
                       MergeRule,
                       MandatoryChecksMissingError,
+                      WorkflowCheckState,
                       main as trymerge_main)
 from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
 from typing import Any, List, Optional
@@ -90,7 +93,7 @@ def mock_revert(repo: GitRepo, pr: GitHubPR, *,
 
 def mock_merge(pr_num: int, repo: GitRepo,
                dry_run: bool = False,
-               force: bool = False,
+               skip_mandatory_checks: bool = False,
                comment_id: Optional[int] = None,
                mandatory_only: bool = False,
                on_green: bool = False,
@@ -127,6 +130,11 @@ def mocked_read_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule
                   ),
     ]
 
+
+def mocked_read_merge_rules_raise(repo: Any, org: str, project: str) -> List[MergeRule]:
+    raise RuntimeError("testing")
+
+
 class DummyGitRepo(GitRepo):
     def __init__(self) -> None:
         super().__init__(get_git_repo_dir(), get_git_remote_name())
@@ -139,7 +147,7 @@ def commit_message(self, ref: str) -> str:
 
 class TestGitHubPR(TestCase):
     def test_merge_rules_valid(self) -> None:
-        "Test that merge_rules.json can be parsed"
+        "Test that merge_rules.yaml can be parsed"
         repo = DummyGitRepo()
         self.assertGreater(len(read_merge_rules(repo, "pytorch", "pytorch")), 1)
 
@@ -151,6 +159,14 @@ def test_match_rules(self, mocked_gql: Any, mocked_rmr: Any) -> None:
         repo = DummyGitRepo()
         self.assertTrue(find_matching_merge_rule(pr, repo) is not None)
 
+    @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
+    @mock.patch('trymerge.read_merge_rules', side_effect=mocked_read_merge_rules_raise)
+    def test_read_merge_rules_fails(self, mocked_gql: Any, mocked_rmr: Any) -> None:
+        "Tests that PR fails to read the merge rules"
+        pr = GitHubPR("pytorch", "pytorch", 77700)
+        repo = DummyGitRepo()
+        self.assertRaisesRegex(RuntimeError, "testing", lambda: find_matching_merge_rule(pr, repo))
+
     @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
     @mock.patch('trymerge.read_merge_rules', side_effect=mocked_read_merge_rules)
     def test_lint_fails(self, mocked_gql: Any, mocked_rmr: Any) -> None:
@@ -203,7 +219,7 @@ def test_internal_changes(self, mocked_gql: Any) -> None:
     def test_checksuites_pagination(self, mocked_gql: Any) -> None:
         "Tests that PR with lots of checksuits can be fetched"
         pr = GitHubPR("pytorch", "pytorch", 73811)
-        self.assertEqual(len(pr.get_checkrun_conclusions()), 104)
+        self.assertEqual(len(pr.get_checkrun_conclusions()), 107)
 
     @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
     def test_comments_pagination(self, mocked_gql: Any) -> None:
@@ -310,7 +326,7 @@ def test_main_force(self, mock_merge: Any, mock_parse_args: Any, mock_gh_get_inf
         mock_merge.assert_called_once_with(mock.ANY,
                                            mock.ANY,
                                            dry_run=mock.ANY,
-                                           force=True,
+                                           skip_mandatory_checks=True,
                                            comment_id=mock.ANY,
                                            on_green=False,
                                            land_checks=False,
@@ -324,18 +340,35 @@ def test_main_merge(self, mock_merge: Any, mock_parse_args: Any, mock_gh_get_inf
         mock_merge.assert_called_once_with(mock.ANY,
                                            mock.ANY,
                                            dry_run=mock.ANY,
-                                           force=False,
+                                           skip_mandatory_checks=False,
                                            comment_id=mock.ANY,
                                            on_green=False,
                                            land_checks=False,
                                            mandatory_only=False)
 
     @mock.patch('trymerge.gh_graphql', side_effect=mocked_gh_graphql)
-    def test_revert_rules(self, mock_gql: Any) -> None:
+    @mock.patch('trymerge.read_merge_rules', side_effect=mocked_read_merge_rules)
+    def test_revert_rules(self, mock_gql: Any, mock_mr: Any) -> None:
         """ Tests that reverts from collaborators are allowed """
         pr = GitHubPR("pytorch", "pytorch", 79694)
         repo = DummyGitRepo()
         self.assertIsNotNone(validate_revert(repo, pr, comment_id=1189459845))
 
+    def test_checks_filter(self) -> None:
+        checks = [
+            WorkflowCheckState(name="check0", status="SUCCESS", url="url0"),
+            WorkflowCheckState(name="check1", status="FAILURE", url="url1"),
+            WorkflowCheckState(name="check2", status="STARTUP_FAILURE", url="url2"),
+            WorkflowCheckState(name="check3", status=None, url="url3"),
+        ]
+
+        checks_dict = {check.name : check for check in checks}
+
+        pending_checks = filter_pending_checks(checks_dict)
+        failing_checks = filter_failed_checks(checks_dict)
+
+        self.assertListEqual(failing_checks, [checks[1], checks[2]])
+        self.assertListEqual(pending_checks, [checks[3]])
+
 if __name__ == "__main__":
     main()
diff --git a/.github/scripts/trymerge.py b/.github/scripts/trymerge.py
index 8be44f240162..697b4b94faac 100755
--- a/.github/scripts/trymerge.py
+++ b/.github/scripts/trymerge.py
@@ -9,6 +9,7 @@
 from dataclasses import dataclass
 from datetime import datetime
 from functools import lru_cache
+import yaml
 from typing import (
     Any,
     Callable,
@@ -20,10 +21,12 @@
     Tuple,
     Union,
     cast,
+    NamedTuple
 )
 from urllib.error import HTTPError
 from urllib.request import Request, urlopen
 from warnings import warn
+from pathlib import Path
 
 from gitutils import (
     GitRepo,
@@ -33,10 +36,14 @@
 )
 from trymerge_explainer import (
     TryMergeExplainer,
-    get_land_check_troubleshooting_message,
     get_revert_message,
 )
 
+class WorkflowCheckState(NamedTuple):
+    status: Optional[str]
+    url: str
+    name: str
+
 GH_PR_REVIEWS_FRAGMENT = """
 fragment PRReviews on PullRequestReviewConnection {
   nodes {
@@ -144,6 +151,13 @@
             checkSuites(first: 10) {
               ...PRCheckSuites
             }
+            status {
+              contexts {
+                context
+                state
+                targetUrl
+              }
+            }
             pushedDate
             oid
           }
@@ -165,6 +179,7 @@
       comments(last: 5) {
         nodes {
           bodyText
+          createdAt
           author {
             login
           }
@@ -322,6 +337,7 @@
       comments(last: 100, before: $cursor) {
         nodes {
           bodyText
+          createdAt
           author {
             login
           }
@@ -391,9 +407,12 @@
     r'https://github.com/(?P<owner>[^/]+)/(?P<repo>[^/]+)/pull/(?P<number>[0-9]+)',
     re.MULTILINE
 )
+RE_PR_CC_LINE = re.compile(r'^cc:? @\w+.*\r?\n?$', re.MULTILINE)
 RE_DIFF_REV = re.compile(r'^Differential Revision:.+?(D[0-9]+)', re.MULTILINE)
 CIFLOW_LABEL = re.compile(r"^ciflow/.+")
 CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk")
+MERGE_RULE_PATH = Path(".github") / "merge_rules.yaml"
+
 
 def _fetch_url(url: str, *,
                headers: Optional[Dict[str, str]] = None,
@@ -485,12 +504,11 @@ def get_check_run_name_prefix(workflow_run: Any) -> str:
     else:
         return f'{workflow_run["workflow"]["name"]} / '
 
-
 def add_workflow_conclusions(
     checksuites: Any,
     get_next_checkruns_page: Callable[[List[Dict[str, Dict[str, Any]]], int, Any], Any],
     get_next_checksuites: Callable[[Any], Any]
-) -> Dict[str, Tuple[str, str]]:
+) -> Dict[str, WorkflowCheckState]:
     conclusions = {}
 
     def add_conclusions(edges: Any) -> None:
@@ -504,7 +522,10 @@ def add_conclusions(edges: Any) -> None:
                 # Do not override existing status with cancelled
                 if workflow_conclusion == "CANCELLED" and workflow_name in conclusions:
                     continue
-                conclusions[workflow_name] = (workflow_conclusion, node["url"])
+                conclusions[workflow_name] = WorkflowCheckState(
+                    name=workflow_name,
+                    status=workflow_conclusion,
+                    url=node["url"])
             has_failing_check = False
             while checkruns is not None:
                 for checkrun_node in checkruns["nodes"]:
@@ -513,8 +534,11 @@ def add_conclusions(edges: Any) -> None:
                         continue
                     if checkrun_node["conclusion"] == 'FAILURE':
                         has_failing_check = True
-                    conclusions[f'{get_check_run_name_prefix(workflow_run)}{checkrun_node["name"]}'] = (
-                        checkrun_node["conclusion"], checkrun_node["detailsUrl"]
+                    checkrun_name = f'{get_check_run_name_prefix(workflow_run)}{checkrun_node["name"]}'
+                    conclusions[checkrun_name] = WorkflowCheckState(
+                        name=checkrun_name,
+                        status=checkrun_node["conclusion"],
+                        url=checkrun_node["detailsUrl"]
                     )
                 if bool(checkruns["pageInfo"]["hasNextPage"]):
                     checkruns = get_next_checkruns_page(edges, edge_idx, checkruns)
@@ -522,7 +546,11 @@ def add_conclusions(edges: Any) -> None:
                     checkruns = None
             # Github doesn't set conclusion to failure if a job is still pending
             if workflow_run is not None and has_failing_check:
-                conclusions[workflow_run["workflow"]["name"]] = ("FAILURE", node["url"])
+                workflow_name = workflow_run["workflow"]["name"]
+                conclusions[workflow_name] = WorkflowCheckState(
+                    name=workflow_name,
+                    status="FAILURE",
+                    url=node["url"])
 
     add_conclusions(checksuites["edges"])
     while bool(checksuites["pageInfo"]["hasNextPage"]):
@@ -558,6 +586,7 @@ def can_skip_internal_checks(pr: "GitHubPR", comment_id: Optional[int] = None) -
 @dataclass
 class GitHubComment:
     body_text: str
+    created_at: str
     author_login: str
     author_association: str
     editor_login: Optional[str]
@@ -573,7 +602,7 @@ def __init__(self, org: str, project: str, pr_num: int) -> None:
         self.info = gh_get_pr_info(org, project, pr_num)
         self.changed_files: Optional[List[str]] = None
         self.labels: Optional[List[str]] = None
-        self.conclusions: Optional[Dict[str, Tuple[str, str]]] = None
+        self.conclusions: Optional[Dict[str, WorkflowCheckState]] = None
         self.comments: Optional[List[GitHubComment]] = None
         self._authors: Optional[List[Tuple[str, str]]] = None
         self._reviews: Optional[List[Tuple[str, str]]] = None
@@ -701,7 +730,7 @@ def get_labels(self) -> List[str]:
         self.labels = labels
         return self.labels
 
-    def get_checkrun_conclusions(self) -> Dict[str, Tuple[str, str]]:
+    def get_checkrun_conclusions(self) -> Dict[str, WorkflowCheckState]:
         """ Returns dict of checkrun -> [conclusion, url] """
         if self.conclusions is not None:
             return self.conclusions
@@ -733,6 +762,13 @@ def get_pr_next_checksuites(checksuites: Any) -> Any:
         checksuites = orig_last_commit["checkSuites"]
 
         self.conclusions = add_workflow_conclusions(checksuites, get_pr_next_check_runs, get_pr_next_checksuites)
+
+        # Append old style statuses(like ones populated by CircleCI or EasyCLA) to conclusions
+        if orig_last_commit["status"] and orig_last_commit["status"]["contexts"]:
+            for status in orig_last_commit["status"]["contexts"]:
+                name = status["context"]
+                self.conclusions[name] = WorkflowCheckState(name=name, status=status["state"], url=status["targetUrl"])
+
         return self.conclusions
 
     def get_authors(self) -> Dict[str, str]:
@@ -775,6 +811,7 @@ def get_pr_url(self) -> str:
     def _comment_from_node(node: Any) -> GitHubComment:
         editor = node["editor"]
         return GitHubComment(body_text=node["bodyText"],
+                             created_at=node["createdAt"] if "createdAt" in node else "",
                              author_login=node["author"]["login"],
                              author_association=node["authorAssociation"],
                              editor_login=editor["login"] if editor else None,
@@ -826,9 +863,15 @@ def has_internal_changes(self) -> bool:
         checks = self.get_checkrun_conclusions()
         if checks is None or checkrun_name not in checks:
             return False
-        return checks[checkrun_name][0] != "SUCCESS"
-
-    def merge_ghstack_into(self, repo: GitRepo, force: bool, comment_id: Optional[int] = None) -> None:
+        return checks[checkrun_name].status != "SUCCESS"
+
+    def merge_ghstack_into(
+        self,
+        repo: GitRepo,
+        skip_mandatory_checks: bool,
+        comment_id: Optional[int] = None,
+        land_check_commit: Optional[str] = None
+    ) -> None:
         assert self.is_ghstack_pr()
         # For ghstack, cherry-pick commits based from origin
         orig_ref = f"{repo.remote}/{re.sub(r'/head$', '/orig', self.head_ref())}"
@@ -849,7 +892,12 @@ def merge_ghstack_into(self, repo: GitRepo, force: bool, comment_id: Optional[in
                     continue
                 commit_msg = pr.gen_commit_message(filter_ghstack=True)
                 # Raises exception if matching rule is not found
-                find_matching_merge_rule(pr, repo, force=force, skip_internal_checks=can_skip_internal_checks(self, comment_id))
+                find_matching_merge_rule(
+                    pr,
+                    repo,
+                    skip_mandatory_checks=skip_mandatory_checks,
+                    skip_internal_checks=can_skip_internal_checks(self, comment_id),
+                    land_check_commit=land_check_commit)
 
             repo.cherry_pick(rev)
             repo.amend_commit_message(commit_msg)
@@ -860,28 +908,41 @@ def gen_commit_message(self, filter_ghstack: bool = False) -> str:
             filters out ghstack info """
         # Adding the url here makes it clickable within the Github UI
         approved_by_urls = ', '.join(prefix_with_github_url(login) for login in self.get_approved_by())
+        # Remove "cc: " line from the message body
+        msg_body = re.sub(RE_PR_CC_LINE, "", self.get_body())
+        if filter_ghstack:
+            msg_body = re.sub(RE_GHSTACK_DESC, "", msg_body)
         msg = self.get_title() + f" (#{self.pr_num})\n\n"
-        msg += self.get_body() if not filter_ghstack else re.sub(RE_GHSTACK_DESC, "", self.get_body())
+        msg += msg_body
         msg += f"\nPull Request resolved: {self.get_pr_url()}\n"
         msg += f"Approved by: {approved_by_urls}\n"
         return msg
 
     def merge_into(self, repo: GitRepo, *,
-                   force: bool = False,
+                   skip_mandatory_checks: bool = False,
                    dry_run: bool = False,
-                   comment_id: Optional[int] = None) -> None:
+                   comment_id: Optional[int] = None,
+                   land_check_commit: Optional[str] = None) -> None:
         # Raises exception if matching rule is not found
-        find_matching_merge_rule(self, repo, force=force, skip_internal_checks=can_skip_internal_checks(self, comment_id))
-        self.merge_changes(repo, force, comment_id)
+        find_matching_merge_rule(
+            self,
+            repo,
+            skip_mandatory_checks=skip_mandatory_checks,
+            skip_internal_checks=can_skip_internal_checks(self, comment_id),
+            land_check_commit=land_check_commit)
+        self.merge_changes(repo, skip_mandatory_checks, comment_id, land_check_commit=land_check_commit)
 
         repo.push(self.default_branch(), dry_run)
         if not dry_run:
+            if land_check_commit:
+                self.delete_land_time_check_branch(repo)
             gh_add_labels(self.org, self.project, self.pr_num, ["merged"])
 
     def merge_changes(self,
                       repo: GitRepo,
-                      force: bool = False,
+                      skip_mandatory_checks: bool = False,
                       comment_id: Optional[int] = None,
+                      land_check_commit: Optional[str] = None,
                       branch: Optional[str] = None) -> None:
         branch_to_merge_into = self.default_branch() if branch is None else branch
         if repo.current_branch() != branch_to_merge_into:
@@ -893,14 +954,25 @@ def merge_changes(self,
             repo._run_git("merge", "--squash", pr_branch_name)
             repo._run_git("commit", f"--author=\"{self.get_author()}\"", "-m", msg)
         else:
-            self.merge_ghstack_into(repo, force, comment_id=comment_id)
+            self.merge_ghstack_into(
+                repo,
+                skip_mandatory_checks,
+                comment_id=comment_id,
+                land_check_commit=land_check_commit
+            )
 
     def create_land_time_check_branch(self,
                                       repo: GitRepo,
                                       branch: str,
-                                      force: bool = False,
+                                      skip_mandatory_checks: bool = False,
                                       comment_id: Optional[int] = None,) -> str:
-        self.merge_changes(repo, branch=branch, force=force, comment_id=comment_id)
+        orig_branch = repo.current_branch()
+        self.merge_changes(
+            repo,
+            branch=branch,
+            skip_mandatory_checks=skip_mandatory_checks,
+            comment_id=comment_id
+        )
         land_check_branch = f'landchecks/{self.pr_num}'
         try:
             repo._run_git('branch', "-D", land_check_branch)
@@ -909,8 +981,16 @@ def create_land_time_check_branch(self,
         repo._run_git('checkout', "-b", land_check_branch)
         repo._run_git('push', '-u', 'origin', land_check_branch, '--force')
         commit = repo.get_commit('HEAD').commit_hash
+        # Important, return to original branch
+        if repo.current_branch() != orig_branch:
+            repo.checkout(orig_branch)
         return commit
 
+    def delete_land_time_check_branch(self,
+                                      repo: GitRepo) -> None:
+        land_check_branch = f'landchecks/{self.pr_num}'
+        repo._run_git('push', 'origin', '-d', land_check_branch)
+
 
 class MandatoryChecksMissingError(Exception):
     pass
@@ -927,10 +1007,20 @@ class MergeRule:
     mandatory_checks_name: Optional[List[str]]
 
 
-def read_merge_rules(repo: Optional[GitRepo], org: str, project: str) -> List[MergeRule]:
-    from pathlib import Path
+def gen_new_issue_link(
+    org: str,
+    project: str,
+    labels: List[str],
+    template: str = "bug-report.yml"
+) -> str:
+    labels_str = ",". join(labels)
+    return (f"https://github.com/{org}/{project}/issues/new?"
+            f"labels={urllib.parse.quote(labels_str)}&"
+            f"template={urllib.parse.quote(template)}")
+
 
-    repo_relative_rules_path = Path(".github") / "merge_rules.json"
+def read_merge_rules(repo: Optional[GitRepo], org: str, project: str) -> List[MergeRule]:
+    repo_relative_rules_path = MERGE_RULE_PATH
     if repo is None:
         json_data = _fetch_url(
             f"https://api.github.com/repos/{org}/{project}/contents/{repo_relative_rules_path}",
@@ -938,28 +1028,46 @@ def read_merge_rules(repo: Optional[GitRepo], org: str, project: str) -> List[Me
             reader=json.load,
         )
         content = base64.b64decode(json_data["content"])
-        return cast(List[MergeRule], json.loads(content, object_hook=lambda x: MergeRule(**x)))
+        return [MergeRule(**x) for x in yaml.safe_load(content)]
     else:
         rules_path = Path(repo.repo_dir) / repo_relative_rules_path
         if not rules_path.exists():
             print(f"{rules_path} does not exist, returning empty rules")
             return []
         with open(rules_path) as fp:
-            rc = json.load(fp, object_hook=lambda x: MergeRule(**x))
-        return cast(List[MergeRule], rc)
+            rc = yaml.safe_load(fp)
+        return [MergeRule(**x) for x in rc]
 
 
 def find_matching_merge_rule(pr: GitHubPR,
                              repo: Optional[GitRepo] = None,
-                             force: bool = False,
-                             skip_internal_checks: bool = False
+                             skip_mandatory_checks: bool = False,
+                             skip_internal_checks: bool = False,
+                             land_check_commit: Optional[str] = None,
                              ) -> MergeRule:
     """Returns merge rule matching to this pr or raises an exception"""
     changed_files = pr.get_changed_files()
     approved_by = set(pr.get_approved_by())
+    issue_link = gen_new_issue_link(
+        org=pr.org,
+        project=pr.project,
+        labels=["module: ci"],
+    )
+    reject_reason = f"No rule found to match PR. Please [report]{issue_link} this issue to DevX team."
+
     rules = read_merge_rules(repo, pr.org, pr.project)
-    reject_reason = f"PR {pr.pr_num} does not match merge rules"
-    #  Used to determine best rejection reason
+    if not rules:
+        reject_reason = f"Rejecting the merge as no rules are defined for the repository in {MERGE_RULE_PATH}"
+        raise RuntimeError(reject_reason)
+
+    # PRs can fail multiple merge rules, but it only needs to pass one rule to be approved.
+    # If it fails all rules, we need to find the rule that it came closest to passing and report
+    # that to the dev.
+    #
+    # reject_reason_score ranks rules by relevancy. The higher the score, the more relevant the
+    # rule & rejection reason, and we only care about the most relevant rule/reason
+    #
+    # reject_reason_score intrepretation:
     # Score 0 to 10K - how many files rule matched
     # Score 10K - matched all files, but no overlapping approvers
     # Score 20K - matched all files and approvers, but mandatory checks are pending
@@ -969,6 +1077,8 @@ def find_matching_merge_rule(pr: GitHubPR,
         rule_name = rule.name
         patterns_re = patterns_to_regex(rule.patterns)
         non_matching_files = []
+
+        # Does this rule apply to all the files?
         for fname in changed_files:
             if not patterns_re.match(fname):
                 non_matching_files.append(fname)
@@ -976,16 +1086,21 @@ def find_matching_merge_rule(pr: GitHubPR,
             num_matching_files = len(changed_files) - len(non_matching_files)
             if num_matching_files > reject_reason_score:
                 reject_reason_score = num_matching_files
-                reject_reason = (f"{num_matching_files} files matched rule {rule_name}, but there are still non-matching files: " +
-                                 f"{','.join(non_matching_files[:5])}{', ...' if len(non_matching_files) > 5 else ''}")
+                reject_reason = "\n".join((
+                    f"Not all files match rule `{rule_name}`."
+                    f"{num_matching_files} files matched, but there are still non-matching files:"
+                    f"{','.join(non_matching_files[:5])}{', ...' if len(non_matching_files) > 5 else ''}"
+                ))
             continue
+
         # If rule needs approvers but PR has not been reviewed, skip it
         if len(rule.approved_by) > 0 and len(approved_by) == 0:
             if reject_reason_score < 10000:
                 reject_reason_score = 10000
-                reject_reason = f"Matched rule {rule_name}, but PR #{pr.pr_num} has not been reviewed yet"
+                reject_reason = f"PR #{pr.pr_num} has not been reviewed yet (Rule {rule_name})"
             continue
 
+        # Does the PR have the required approvals for this rule?
         rule_approvers_set = set()
         for approver in rule.approved_by:
             if "/" in approver:
@@ -998,35 +1113,51 @@ def find_matching_merge_rule(pr: GitHubPR,
         if len(approvers_intersection) == 0 and len(rule_approvers_set) > 0:
             if reject_reason_score < 10000:
                 reject_reason_score = 10000
-                reject_reason = (f"Matched rule {rule_name}, but PR #{pr.pr_num} was not reviewed yet by any of: " +
-                                 f"{', '.join(list(rule_approvers_set)[:5])}{', ...' if len(rule_approvers_set) > 5 else ''}")
+                reject_reason = "\n".join((
+                    f"Approval needed from one of the following (Rule '{rule_name}'):",
+                    f"{', '.join(list(rule_approvers_set)[:5])}{', ...' if len(rule_approvers_set) > 5 else ''}"
+                ))
             continue
+
+        # Does the PR pass the checks required by this rule?
         mandatory_checks = rule.mandatory_checks_name if rule.mandatory_checks_name is not None else []
-        checks = pr.get_checkrun_conclusions()
-        required_checks = filter(lambda x: force is False or "CLA Check" in x, mandatory_checks)
+        checks = get_combined_checks_from_pr_and_land_validation(pr, land_check_commit)
+        required_checks = filter(lambda x: skip_mandatory_checks is False or "EasyCLA" in x, mandatory_checks)
         [pending_checks, failed_checks] = categorize_checks(checks, required_checks)
 
+        hud_link = f"https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}"
         if len(failed_checks) > 0:
             if reject_reason_score < 30000:
                 reject_reason_score = 30000
-                reject_reason = ("Refusing to merge as mandatory check(s) " +
-                                 checks_to_str(failed_checks) + f" failed for rule {rule_name}")
+                reject_reason = "\n".join((
+                    f"The following mandatory check(s) failed (Rule `{rule_name}`):",
+                    *checks_to_markdown_bullets(failed_checks),
+                    "",
+                    f"Dig deeper by [viewing the failures on hud]({hud_link})"
+                ))
             continue
         elif len(pending_checks) > 0:
             if reject_reason_score < 20000:
                 reject_reason_score = 20000
-                reject_reason = f"Refusing to merge as mandatory check(s) {checks_to_str(pending_checks)}"
-                reject_reason += f" are pending/not yet run for rule {rule_name}"
+                reject_reason = "\n".join((
+                    f"The following mandatory check(s) are pending/not yet run (Rule `{rule_name}`):",
+                    *checks_to_markdown_bullets(pending_checks),
+                    "",
+                    f"Dig deeper by [viewing the pending checks on hud]({hud_link})"
+                ))
             continue
+
         if not skip_internal_checks and pr.has_internal_changes():
             raise RuntimeError("This PR has internal changes and must be landed via Phabricator")
+
         return rule
+
     if reject_reason_score == 20000:
         raise MandatoryChecksMissingError(reject_reason)
     raise RuntimeError(reject_reason)
 
 
-def get_land_checkrun_conclusions(org: str, project: str, commit: str) -> Dict[str, Tuple[str, str]]:
+def get_land_checkrun_conclusions(org: str, project: str, commit: str) -> Dict[str, WorkflowCheckState]:
 
     def get_commit_next_check_runs(edges: List[Dict[str, Dict[str, Any]]], edge_idx: int, checkruns: Any) -> Any:
         rc = gh_graphql(GH_GET_COMMIT_NEXT_CHECK_RUNS,
@@ -1055,18 +1186,48 @@ def get_commit_next_checksuites(checksuites: Any) -> Any:
 def checks_to_str(checks: List[Tuple[str, Optional[str]]]) -> str:
     return ", ".join(f"[{c[0]}]({c[1]})" if c[1] is not None else c[0] for c in checks)
 
-def pr_get_checks_with_lambda(pr: GitHubPR, status_check: Callable[[Optional[str]], bool]) -> List[Tuple[str, str]]:
-    checks = pr.get_checkrun_conclusions()
-    return [(name, status[1]) for name, status in checks.items() if status_check(status[0])]
 
+def checks_to_markdown_bullets(checks: List[Tuple[str, Optional[str]]]) -> List[str]:
+    return [f"- [{c[0]}]({c[1]})" if c[1] is not None else f"- {c[0]}" for c in checks]
+
+def get_combined_checks_from_pr_and_land_validation(
+    pr: GitHubPR,
+    land_check_commit: Optional[str]
+) -> Dict[str, WorkflowCheckState]:
+    """
+    Combines checks from both the PR and land validation to get a holistic view
+    of all checks.
+
+    This helps us cover the corner case where certain workflows may have been
+    requested on the PR but are not part of land validation (e.g. nightly
+    builds) or are implicitly run on PRs but not on land validation branches
+    (like CLA Checks).
 
-def pr_get_pending_checks(pr: GitHubPR) -> List[Tuple[str, str]]:
-    return pr_get_checks_with_lambda(pr, lambda x: x is None)
+    At the same time, we prioritize the signal workflows which do run on land
+    validation.
 
+    E.g. if a workflow fails on the PR but passes on land validation then we'd
+    use the successful result from the land validation.
+    """
 
-def pr_get_failed_checks(pr: GitHubPR) -> List[Tuple[str, str]]:
-    return pr_get_checks_with_lambda(pr, lambda x: x in ["FAILURE", "STARTUP_FAILURE"])
+    pr_checks = pr.get_checkrun_conclusions()
+    land_validation_checks = get_land_checkrun_conclusions(pr.org, pr.project, land_check_commit) if land_check_commit else {}
 
+    # Merge the two checks together. Land validation check results (if any) overwrite pr check results
+    merged_checks = {**pr_checks, **land_validation_checks}  # explanation: https://stackoverflow.com/a/26853961/21539
+    return merged_checks
+
+def filter_checks_with_lambda(
+    checks: Dict[str, WorkflowCheckState],
+    status_filter: Callable[[Optional[str]], bool]
+) -> List[WorkflowCheckState]:
+    return [check for check in checks.values() if status_filter(check.status)]
+
+def filter_pending_checks(checks: Dict[str, WorkflowCheckState]) -> List[WorkflowCheckState]:
+    return filter_checks_with_lambda(checks, lambda x: x is None)
+
+def filter_failed_checks(checks: Dict[str, WorkflowCheckState]) -> List[WorkflowCheckState]:
+    return filter_checks_with_lambda(checks, lambda x: x in ["FAILURE", "STARTUP_FAILURE"])
 
 def validate_revert(repo: GitRepo, pr: GitHubPR, *,
                     comment_id: Optional[int] = None) -> Tuple[str, str]:
@@ -1087,7 +1248,7 @@ def validate_revert(repo: GitRepo, pr: GitHubPR, *,
     skip_internal_checks = can_skip_internal_checks(pr, comment_id)
 
     # Raises exception if matching rule is not found, but ignores all status checks
-    find_matching_merge_rule(pr, repo, force=True, skip_internal_checks=skip_internal_checks)
+    find_matching_merge_rule(pr, repo, skip_mandatory_checks=True, skip_internal_checks=skip_internal_checks)
     commit_sha = pr.get_merge_commit()
     if commit_sha is None:
         commits = repo.commits_resolving_gh_pr(pr.pr_num)
@@ -1129,8 +1290,8 @@ def post_comment(msg: str) -> None:
 def prefix_with_github_url(suffix_str: str) -> str:
     return f"https://github.com/{suffix_str}"
 
-def check_for_sev(org: str, project: str, force: bool) -> None:
-    if force:
+def check_for_sev(org: str, project: str, skip_mandatory_checks: bool) -> None:
+    if skip_mandatory_checks:
         return
     response = cast(
         Dict[str, Any],
@@ -1164,24 +1325,22 @@ def validate_land_time_checks(org: str, project: str, commit: str) -> None:
 def has_label(labels: List[str], pattern: Pattern[str] = CIFLOW_LABEL) -> bool:
     return len(list(filter(pattern.match, labels))) > 0
 
-def categorize_checks(check_runs: Dict[str, Tuple[str, str]],
+def categorize_checks(check_runs: Dict[str, WorkflowCheckState],
                       required_checks: Iterable[str]) -> Tuple[List[Tuple[str, Optional[str]]], List[Tuple[str, Optional[str]]]]:
     pending_checks: List[Tuple[str, Optional[str]]] = []
     failed_checks: List[Tuple[str, Optional[str]]] = []
     for checkname in required_checks:
         if checkname not in check_runs:
             pending_checks.append((checkname, None))
-        elif check_runs[checkname][0] is None:
-            pending_checks.append((checkname, check_runs[checkname][1]))
-        elif (check_runs[checkname][0].upper() != 'SUCCESS'
-              and check_runs[checkname][0].upper() != 'SKIPPED'
-              and check_runs[checkname][0].upper() != 'NEUTRAL'):
-            failed_checks.append((checkname, check_runs[checkname][1]))
+        elif check_runs[checkname].status is None:
+            pending_checks.append((checkname, check_runs[checkname].url))
+        elif (str(check_runs[checkname].status).upper() not in ['SUCCESS', 'SKIPPED', 'NEUTRAL']):
+            failed_checks.append((checkname, check_runs[checkname].url))
     return (pending_checks, failed_checks)
 
 def merge(pr_num: int, repo: GitRepo,
           dry_run: bool = False,
-          force: bool = False,
+          skip_mandatory_checks: bool = False,
           comment_id: Optional[int] = None,
           mandatory_only: bool = False,
           on_green: bool = False,
@@ -1192,65 +1351,100 @@ def merge(pr_num: int, repo: GitRepo,
     org, project = repo.gh_owner_and_name()
     pr = GitHubPR(org, project, pr_num)
     initial_commit_sha = pr.last_commit()['oid']
-    explainer = TryMergeExplainer(force, on_green, land_checks, pr.get_labels(), pr.pr_num, org, project)
+    explainer = TryMergeExplainer(skip_mandatory_checks, on_green, land_checks, pr.get_labels(), pr.pr_num, org, project)
     on_green, land_checks = explainer.get_flags()
     land_check_commit = None
 
-    check_for_sev(org, project, force)
+    check_for_sev(org, project, skip_mandatory_checks)
 
-    if force or can_skip_internal_checks(pr, comment_id):
+    if skip_mandatory_checks or can_skip_internal_checks(pr, comment_id):
         # do not wait for any pending signals if PR is closed as part of co-development process
         gh_post_pr_comment(org, project, pr.pr_num, explainer.get_merge_message())
-        return pr.merge_into(repo, dry_run=dry_run, force=force, comment_id=comment_id)
+        return pr.merge_into(
+            repo,
+            dry_run=dry_run,
+            skip_mandatory_checks=skip_mandatory_checks,
+            comment_id=comment_id
+        )
 
-    if land_checks:
-        land_check_commit = pr.create_land_time_check_branch(repo, 'viable/strict', force=force, comment_id=comment_id)
+    # Important: check for merge rule once before starting land checks
+    # because we want to make sure that only approved PRs can start CI
+    # jobs. If there's missing approval, a RuntimeError will be raised
+    # here to stop the merge process right away
+    find_matching_merge_rule(pr, repo, skip_mandatory_checks=True)
+
+    if land_checks and not dry_run:
+        land_check_commit = pr.create_land_time_check_branch(
+            repo,
+            'viable/strict',
+            skip_mandatory_checks=skip_mandatory_checks,
+            comment_id=comment_id
+        )
 
     gh_post_pr_comment(org, project, pr.pr_num, explainer.get_merge_message(land_check_commit))
     if (datetime.utcnow() - pr.last_pushed_at()).days > stale_pr_days:
-        raise RuntimeError("This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.")
+        if land_checks and not dry_run:
+            pr.delete_land_time_check_branch(repo)
+        raise RuntimeError(f"This PR is too stale; the last push date was more than {stale_pr_days} days ago. "
+                           "Please rebase and try again. You can rebase by leaving the following comment on this PR:\n"
+                           "`@pytorchbot rebase`")
 
     start_time = time.time()
     last_exception = ''
     elapsed_time = 0.0
     while elapsed_time < timeout_minutes * 60:
-        check_for_sev(org, project, force)
+        check_for_sev(org, project, skip_mandatory_checks)
         current_time = time.time()
         elapsed_time = current_time - start_time
         print(f"Attempting merge of https://github.com/{org}/{project}/pull/{pr_num} ({elapsed_time / 60} minutes elapsed)")
         pr = GitHubPR(org, project, pr_num)
         if initial_commit_sha != pr.last_commit()['oid']:
+            if land_checks and not dry_run:
+                pr.delete_land_time_check_branch(repo)
             raise RuntimeError("New commits were pushed while merging. Please rerun the merge command.")
         try:
             find_matching_merge_rule(pr, repo)
-            pending = pr_get_pending_checks(pr)
-            failing = pr_get_failed_checks(pr)
+            checks = get_combined_checks_from_pr_and_land_validation(pr, land_check_commit)
+            pending = filter_pending_checks(checks)
+            failing = filter_failed_checks(checks)
 
             # HACK until GitHub will be better about surfacing those
-            startup_failures = pr_get_checks_with_lambda(pr, lambda x: x == "STARTUP_FAILURE")
+            startup_failures = filter_checks_with_lambda(checks, lambda status: status == "STARTUP_FAILURE")
             if len(startup_failures) > 0:
                 raise RuntimeError(f"{len(failing)} STARTUP failures reported, please check workflows syntax! " +
-                                   ' ,'.join(f"[{x[0]}]({x[1]})" for x in startup_failures[:5]))
+                                   ' ,'.join(f"[{x.name}]({x.url})" for x in startup_failures[:5]))
             # END of HACK
 
             if (not mandatory_only and on_green) and len(failing) > 0:
                 raise RuntimeError(f"{len(failing)} additional jobs have failed, first few of them are: " +
-                                   ' ,'.join(f"[{x[0]}]({x[1]})" for x in failing[:5]))
+                                   ' ,'.join(f"[{x.name}]({x.url})" for x in failing[:5]))
             if (not mandatory_only and on_green) and len(pending) > 0:
                 raise MandatoryChecksMissingError(f"Still waiting for {len(pending)} additional jobs to finish, " +
-                                                  f"first few of them are: {' ,'.join(x[0] for x in pending[:5])}")
+                                                  f"first few of them are: {' ,'.join(x.name for x in pending[:5])}")
             if land_checks and land_check_commit is not None:
                 validate_land_time_checks(org, project, land_check_commit)
 
-            return pr.merge_into(repo, dry_run=dry_run, force=force, comment_id=comment_id)
+            return pr.merge_into(
+                repo,
+                dry_run=dry_run,
+                skip_mandatory_checks=skip_mandatory_checks,
+                comment_id=comment_id,
+                land_check_commit=land_check_commit
+            )
         except MandatoryChecksMissingError as ex:
             last_exception = str(ex)
             print(f"Merge of https://github.com/{org}/{project}/pull/{pr_num} failed due to: {ex}. Retrying in 5 min")
             time.sleep(5 * 60)
+        except RuntimeError:
+            if land_checks and not dry_run:
+                pr.delete_land_time_check_branch(repo)
+            raise
     # Finally report timeout back
     msg = f"Merged timed out after {timeout_minutes} minutes. Please contact the pytorch_dev_infra team."
     msg += f"The last exception was: {last_exception}"
     if not dry_run:
+        if land_checks:
+            pr.delete_land_time_check_branch(repo)
         gh_add_labels(org, project, pr_num, ["land-failed"])
     raise RuntimeError(msg)
 
@@ -1260,13 +1454,26 @@ def main() -> None:
     org, project = repo.gh_owner_and_name()
     pr = GitHubPR(org, project, args.pr_num)
 
-    def handle_exception(e: Exception, msg: str = "Merge failed") -> None:
-        msg += f" due to {e}"
+    def handle_exception(e: Exception, title: str = "Merge failed") -> None:
+        exception = f"**Reason**: {e}"
+
+        internal_debugging = ""
         run_url = os.getenv("GH_RUN_URL")
         if run_url is not None:
-            msg += f"\nRaised by {run_url}"
-        if args.land_checks:
-            msg += get_land_check_troubleshooting_message()
+            # Hide this behind a collapsed bullet since it's not helpful to most devs
+            internal_debugging = "\n".join((
+                "<details><summary>Details for Dev Infra team</summary>",
+                f"Raised by <a href=\"{run_url}\">workflow job</a>",
+                "</details>"
+            ))
+
+        msg = "\n".join((
+            f"## {title}",
+            f"{exception}",
+            "",
+            f"{internal_debugging}"
+        ))
+
         gh_post_pr_comment(org, project, args.pr_num, msg, dry_run=args.dry_run)
         import traceback
         traceback.print_exc()
@@ -1290,7 +1497,7 @@ def handle_exception(e: Exception, msg: str = "Merge failed") -> None:
     try:
         merge(args.pr_num, repo,
               dry_run=args.dry_run,
-              force=args.force,
+              skip_mandatory_checks=args.force,
               comment_id=args.comment_id,
               on_green=args.on_green,
               mandatory_only=args.on_mandatory,
diff --git a/.github/scripts/trymerge_explainer.py b/.github/scripts/trymerge_explainer.py
index e59307f10854..a7be2f78c4bc 100644
--- a/.github/scripts/trymerge_explainer.py
+++ b/.github/scripts/trymerge_explainer.py
@@ -9,12 +9,10 @@
 CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk")
 
 OFFICE_HOURS_LINK = "https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours"
-CONTACT_US = f"Please reach out to the [PyTorch DevX Team]({OFFICE_HOURS_LINK}) with feedback or questions!"
+CONTACT_US = f"Questions? Feedback? Please reach out to the [PyTorch DevX Team]({OFFICE_HOURS_LINK})"
 ALTERNATIVES = (
-    "If this is not the intended behavior, feel free to use some "
-    + f"of the other merge options in the [wiki]({BOT_COMMANDS_WIKI})."
+    f"Learn more about merging in the [wiki]({BOT_COMMANDS_WIKI})."
 )
-LAND_CHECK_ROLLOUT = "https://github.com/pytorch/test-infra/blob/main/torchci/lib/bot/rolloutUtils.ts#L1-L34"
 
 
 def has_label(labels: List[str], pattern: Pattern[str] = CIFLOW_LABEL) -> bool:
@@ -62,68 +60,49 @@ def get_flags(self) -> Tuple[bool, bool]:
 
     def _get_flag_msg(self) -> str:
         if self.force:
-            return " the force (-f) flag."
+            return "Your change will be merged immediately since you used the force (-f) flag, " + \
+                "**bypassing any CI checks** (ETA: 1-5 minutes)."
         elif self.on_green:
-            return " the green (-g) flag."
+            return "Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours)."
         elif self.land_checks:
-            return (
-                " the land checks (-l) flag."
-                + " If you did not specify this flag yourself, "
-                + f" you are likely enrolled in the [land checks rollout]({LAND_CHECK_ROLLOUT})."
-            )
+            flag_msg = \
+                "**The `-l` land checks flag is deprecated and no longer needed.** Instead we now automatically " + \
+                "add the `ciflow\\trunk` label to your PR once it's approved\n\n"
+
+            if self.has_trunk_label:
+                flag_msg += "Your change will be merged once all checks on your PR pass (ETA 0-4 Hours)."
+            else:
+                flag_msg += "Your change will be merged once the land checks pass (**ETA 4 Hours**)."
+
+            return flag_msg
         else:
-            return "out a flag."
+            return "Your change will be merged once all checks pass (ETA 0-4 Hours)."
 
     def _get_land_check_progress(self, commit: Optional[str]) -> str:
         if commit is not None:
             return (
                 " and land check "
-                + f"progress [here](https://hud.pytorch.org/{self.org}/{self.project}/commit/{commit})"
+                + f"progress <a href=\"https://hud.pytorch.org/{self.org}/{self.project}/commit/{commit}\">here</a>"
             )
         else:
             return ""
 
-    def _get_flag_explanation_message(self) -> str:
-        if self.force:
-            return "This means your change will be merged **immediately**, bypassing any CI checks (ETA: 1-5 minutes)."
-        elif self.on_green:
-            return "This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours)."
-        elif self.land_checks:
-            if self.has_trunk_label:
-                land_check_msg_suffix = "have passed since you have added the `ciflow/trunk` label to your PR (ETA 0-4 Hours)."
-            else:
-                land_check_msg_suffix = (
-                    "and the land checks have passed (**ETA 4 Hours**). "
-                )
-                land_check_msg_suffix += "If you need to coordinate lands between different changes and cannot risk a land race, "
-                land_check_msg_suffix += "please add the `ciflow/trunk` label to your PR and wait for signal to complete, "
-                land_check_msg_suffix += "and then land your changes in proper order."
-                land_check_msg_suffix += (
-                    " Having `trunk`, `pull`, and `Lint` pre-run on a "
-                )
-                land_check_msg_suffix += (
-                    "PR will bypass land checks and the ETA should be immediate."
-                )
-
-            return (
-                "This means that your change will be merged once all checks on your PR "
-                + land_check_msg_suffix
-            )
-        else:
-            return "This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours)."
-
     def get_merge_message(self, commit: Optional[str] = None) -> str:
-        message_prefix = "@pytorchbot successfully started a merge job."
-        progress_links = f"Check the current status [here]({os.getenv('GH_RUN_URL')}){self._get_land_check_progress(commit)}."
-        flag_message = f"The merge job was triggered with{self._get_flag_msg()}"
-        explanation_message = self._get_flag_explanation_message()
-
-        msg = message_prefix + " "
-        msg += progress_links + "\n"
-        msg += flag_message + " "
-        msg += explanation_message + " "
-        msg += ALTERNATIVES + "\n"
+        title = "### Merge started"
+        main_message = self._get_flag_msg()
+
+        advanced_debugging = "\n".join((
+            "<details><summary>Advanced Debugging</summary>",
+            "Check the merge workflow status ",
+            f"<a href=\"{os.getenv('GH_RUN_URL')}\">here</a>{self._get_land_check_progress(commit)}",
+            "</details>"
+        ))
+
+        msg = title + "\n"
+        msg += main_message + "\n\n"
+        msg += ALTERNATIVES + "\n\n"
         msg += CONTACT_US
+        msg += advanced_debugging
         return msg
 
 
@@ -134,13 +113,3 @@ def get_revert_message(org: str, project: str, pr_num: int) -> str:
     )
     msg += CONTACT_US
     return msg
-
-
-def get_land_check_troubleshooting_message() -> str:
-    return (
-        " If you believe this is an error, you can use the old behavior with `@pytorchbot merge -g`"
-        + " (optionally with the `ciflow/trunk` to get land checks)"
-        + ' or use `@pytorchbot merge -f "some reason here"`.'
-        + f" For more information, see the [bot wiki]({BOT_COMMANDS_WIKI}). \n"
-        + CONTACT_US
-    )
diff --git a/.github/scripts/tryrebase.py b/.github/scripts/tryrebase.py
index 1b69f653e525..2e8987e9faaa 100755
--- a/.github/scripts/tryrebase.py
+++ b/.github/scripts/tryrebase.py
@@ -69,6 +69,7 @@ def rebase_ghstack_onto(pr: GitHubPR, repo: GitRepo, onto_branch: str, dry_run:
         push_result = ghstack_result.stdout.decode("utf-8")
         print(push_result)
         if ghstack_result.returncode != 0:
+            print(ghstack_result.stderr.decode("utf-8"))
             raise Exception(f"\n```{push_result}```")
         # The contents of a successful push result should look like:
         # Summary of changes (ghstack 0.6.0)
diff --git a/.github/scripts/update_commit_hashes.py b/.github/scripts/update_commit_hashes.py
index 5dad5877ca4a..4b638cf11c90 100644
--- a/.github/scripts/update_commit_hashes.py
+++ b/.github/scripts/update_commit_hashes.py
@@ -136,6 +136,7 @@ def main() -> None:
     )
     with open(f".github/ci_commit_pins/{args.repo_name}.txt", "r+") as f:
         old_hash = f.read().strip()
+        subprocess.run(f"git checkout {old_hash}".split(), cwd=args.repo_name)
         f.seek(0)
         f.truncate()
         f.write(f"{hash}\n")
diff --git a/.github/scripts/wait_for_ssh_to_drain.sh b/.github/scripts/wait_for_ssh_to_drain.sh
deleted file mode 100755
index f33d80764033..000000000000
--- a/.github/scripts/wait_for_ssh_to_drain.sh
+++ /dev/null
@@ -1,13 +0,0 @@
-#!/usr/bin/env bash
-
-set -eou pipefail
-
-echo "Holding runner for 2 hours until all ssh sessions have logged out"
-for _ in $(seq 1440); do
-    # Break if no ssh session exists anymore
-    if [ "$(who)" = "" ]; then
-      break
-    fi
-    echo "."
-    sleep 5
-done
diff --git a/.github/templates/common.yml.j2 b/.github/templates/common.yml.j2
index f0f3e3a430f7..edb652ff16ce 100644
--- a/.github/templates/common.yml.j2
+++ b/.github/templates/common.yml.j2
@@ -1,10 +1,8 @@
 {%- set upload_artifact_s3_action = "seemethere/upload-artifact-s3@v5" -%}
 {%- set download_artifact_s3_action = "seemethere/download-artifact-s3@v4" -%}
+{%- set upload_artifact_action = "actions/upload-artifact@v3" -%}
+{%- set download_artifact_action = "actions/download-artifact@v3" -%}
 
-{# squid_proxy is an private ELB that only available for GHA custom runners #}
-{%- set squid_proxy    = "http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -%}
-{# squid_no_proxy is a list of common set of fixed domains or IPs that we don't need to proxy. See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/http_proxy_config.html#windows-proxy #}
-{%- set squid_no_proxy = "localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" -%}
 {%- set timeout_minutes = 240 -%}
 
 # NOTE: If testing pytorch/builder changes you can change this variable to change what pytorch/builder reference
@@ -17,43 +15,6 @@ concurrency:
   cancel-in-progress: true
 {%- endmacro -%}
 
-{%- macro add_retry_to_env() -%}
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-{%- endmacro -%}
-
-{%- macro gen_dispatch_rules(on_pull_request, is_scheduled, ciflow_labels, branches = ['master', 'main', 'release/*'], enable_doc_jobs = True) -%}
-on:
-{%- if on_pull_request %}
-  pull_request:
-{%- endif %}
-  push:
-{%- if enable_doc_jobs and is_scheduled %}
-    tags:
-      # NOTE: Binary build pipelines should only get triggered on release candidate builds
-      # Release candidate tags look like: v1.11.0-rc1
-      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
-{%- endif %}
-{%- for label in ciflow_labels | sort %}
-  {%- if loop.first and not (enable_doc_jobs  and is_scheduled) %}
-    tags:
-  {%- endif %}
-      - '!{{ label }}/*'
-{%- endfor %}
-{%- if not is_scheduled %}
-    branches:
-{%- for branch in branches %}
-      - !{{ branch }}
-{%- endfor %}
-{%- endif %}
-{%- if is_scheduled %}
-  schedule:
-    - cron: !{{ is_scheduled }}
-{%- endif %}
-  workflow_dispatch:
-{%- endmacro -%}
-
 {%- macro display_ec2_information() -%}
       - name: Display EC2 information
         shell: bash
@@ -71,52 +32,6 @@ on:
           echo "system info $(uname -a)"
 {%- endmacro -%}
 
-{%- macro parse_ref(pytorch_directory="") -%}
-      - name: Parse ref
-        shell: bash
-{%- if pytorch_directory %}
-        working-directory: !{{ pytorch_directory }}
-{%- endif %}
-        id: parse-ref
-        run: ./.github/scripts/parse_ref.py
-{%- endmacro -%}
-
-{%- macro upload_test_statistics(build_environment, when="always()", pytorch_directory="", needs_credentials=False) -%}
-      - name: Upload test statistics
-{%- if pytorch_directory %}
-        working-directory: !{{ pytorch_directory }}
-{%- endif %}
-        if: !{{ when }}
-        env:
-          AWS_DEFAULT_REGION: us-east-1
-          GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-          TAG: ${{ steps.parse-ref.outputs.tag }}
-          WORKFLOW_ID: '${{ github.run_id }}'
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-{%- if needs_credentials %}
-          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
-          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
-{%- endif %}
-        shell: bash
-        run: |
-          set -x
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          GHA_WORKFLOW_JOB_ID=$(python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}")
-          export GHA_WORKFLOW_JOB_ID
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
-{%- endmacro -%}
-
-{%- macro chown_dir(dir) -%}
-      - name: Chown artifacts
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "!{{ dir }}:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-{%- endmacro -%}
 
 {%- macro setup_ec2_windows() -%}
       !{{ display_ec2_information() }}
@@ -136,27 +51,6 @@ on:
           Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
 {%- endmacro -%}
 
-{%- macro setup_ec2_linux() -%}
-      - name: Checkout PyTorch
-        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
-      - name: Setup Linux
-        uses: ./.github/actions/setup-linux
-      - name: Chown workspace
-        run: |
-          !{{ add_retry_to_env() }}
-          retry docker pull "${ALPINE_IMAGE}"
-          # Ensure the working directory gets chowned back to the current user
-          docker run --pull=never --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Clean workspace
-        run: |
-          rm -rf "${GITHUB_WORKSPACE}"
-          mkdir "${GITHUB_WORKSPACE}"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-{%- endmacro -%}
-
 {%- macro setup_rocm_linux() -%}
       - name: Clean workspace
         run: |
@@ -184,7 +78,12 @@ on:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -197,29 +96,6 @@ on:
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
 {%- endmacro -%}
 
-{%- macro teardown_ec2_linux(pytorch_directory="") -%}
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-{%- if pytorch_directory %}
-        working-directory: !{{ pytorch_directory }}
-{%- endif %}
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
-{%- endmacro -%}
-
 {%- macro teardown_rocm_linux() -%}
       - name: Kill containers, clean up images
         if: always()
@@ -260,186 +136,6 @@ on:
       {%- endif %}
 {%- endmacro -%}
 
-{%- macro upload_downloaded_files(name, config=None, shard=None, num_shards=None, runner=None, artifact_name="", use_s3=True, when="always()") -%}
-      - name: Zip JSONs for upload
-        if: !{{ when }}
-        env:
-{%- if name == 'linux' or name == 'windows' or name == 'macos' %}
-          FILE_SUFFIX: '${{ github.job }}-!{{ config }}-!{{ shard }}-!{{ num_shards }}-!{{ runner }}'{%- else %}
-          FILE_SUFFIX: '!{{ name }}-${{ github.job }}'
-{%- endif %}
-{%- if name == 'windows' %}
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
-{%- else %}
-        run: |
-          # Remove any previous test jsons if they exist
-          rm -f test-jsons-*.zip
-          zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
-{%- endif %}
-{%- if use_s3 %}
-      - uses: !{{ upload_artifact_s3_action }}
-        name: Store Test Downloaded JSONs on S3
-{%- else %}
-      - uses: actions/upload-artifact@v2
-        name: Store Test Downloaded JSONs on Github
-{%- endif %}
-        if: !{{ when }}
-        with:
-{%- if artifact_name != "" %}
-          name: !{{ artifact_name }}
-{%- endif %}
-          retention-days: 14
-          if-no-files-found: warn
-          path:
-            test-jsons-*.zip
-{%- endmacro -%}
-
-{%- macro upload_test_reports(name, config=None, shard=None, num_shards=None, runner=None, artifact_name="", use_s3=True) -%}
-      - name: Zip test reports for upload
-        if: always()
-        env:
-{%- if name == 'linux' or name == 'windows' or name == 'macos' %}
-          FILE_SUFFIX: '${{ github.job }}-!{{ config }}-!{{ shard }}-!{{ num_shards }}-!{{ runner }}'
-{%- else %}
-          FILE_SUFFIX: '!{{ name }}-${{ github.job }}'
-{%- endif %}
-{%- if name == 'windows' %}
-        shell: powershell
-        run: |
-          # -ir => recursive include all files in pattern
-          7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml'
-{%- else %}
-        run: |
-          # Remove any previous test reports if they exist
-          rm -f test-reports-*.zip
-          zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'
-{%- endif %}
-{%- if use_s3 %}
-      - uses: !{{ upload_artifact_s3_action }}
-        name: Store Test Reports on S3
-{%- else %}
-      - uses: actions/upload-artifact@v2
-        name: Store Test Reports on Github
-{%- endif %}
-        if: always()
-        with:
-{%- if artifact_name != "" %}
-          name: !{{ artifact_name }}
-{%- endif %}
-          retention-days: 14
-          if-no-files-found: error
-          path:
-            test-reports-*.zip
-{%- endmacro -%}
-
-{%- macro upload_cores(artifact_name="coredumps", config=None, shard=None, use_s3=True) -%}
-{%- if use_s3 %}- uses: !{{ upload_artifact_s3_action }}
-        name: Store Core dumps on S3
-{%- else %}- uses: actions/upload-artifact@v2
-        name: Store Core dumps on Github
-{%- endif %}
-        if: failure()
-        with:
-{%- if config != "" and shard != "" %}
-          name: !{{ artifact_name }}-!{{ config }}-!{{ shard }}
-{%- else %}
-          name: !{{ artifact_name }}
-{%- endif %}
-          retention-days: 14
-          if-no-files-found: ignore
-          path:
-            ./**/core.[1-9]*
-{%- endmacro -%}
-
-{%- macro render_test_results() -%}
-      - name: Install render_test_results dependencies
-        if: always()
-        shell: bash
-        run: |
-          python3 -m pip install junitparser==2.1.1 rich==10.9.0
-      - name: "[[ Click me for rendered test results (useful for finding failing tests) ]]"
-        if: always()
-        shell: bash
-        # Encoding is weird on windows, just try to default to utf-8 if possible
-        env:
-          PYTHONIOENCODING: "utf-8"
-        run: |
-          python3 tools/render_junit.py test/
-{%- endmacro -%}
-
-{%- macro calculate_docker_image(always_rebuild) -%}
-      - name: Calculate docker image tag
-        id: calculate-tag
-        run: |
-          DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
-          echo "DOCKER_TAG=${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "DOCKER_IMAGE=${DOCKER_IMAGE_BASE}:${DOCKER_TAG}" >> "${GITHUB_ENV}"
-          echo "::set-output name=docker_tag::${DOCKER_TAG}"
-          echo "::set-output name=docker_image::${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"
-      - name: Check if image should be built
-        id: check
-        env:
-          BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
-        run: |
-          set -x
-{%- if not always_rebuild %}
-          # Check if image already exists, if it does then skip building it
-          if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
-            exit 0
-          fi
-          if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then
-            # if we're on the base branch then use the parent commit
-            MERGE_BASE=$(git rev-parse HEAD~)
-          else
-            # otherwise we're on a PR, so use the most recent base commit
-            MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION")
-          fi
-          # Covers the case where a previous tag doesn't exist for the tree
-          # this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
-          if ! git rev-parse "$MERGE_BASE:.circleci/docker"; then
-            echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"
-            exit 1
-          fi
-          PREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")
-          # If no image exists but the hash is the same as the previous hash then we should error out here
-          if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
-            echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
-            echo "       contact the PyTorch team to restore the original images"
-            exit 1
-          fi
-{%- endif %}
-          echo ::set-output name=rebuild::yes
-      - name: Build and push docker image
-        if: ${{ steps.check.outputs.rebuild }}
-        env:
-          DOCKER_SKIP_S3_UPLOAD: 1
-        working-directory: .circleci/docker
-        run: |
-          export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
-          ./build_docker.sh
-{%- endmacro -%}
-
-{%- macro setup_miniconda(python_version, activate_environment=True) -%}
-      - name: Setup miniconda
-        uses: conda-incubator/setup-miniconda@v2
-        with:
-          auto-update-conda: true
-          python-version: !{{ python_version }}
-{%- if activate_environment %}
-          activate-environment: build
-{%- endif %}
-{%- endmacro -%}
-
-{%- macro set_xcode_version(xcode_version) -%}
-{%- if xcode_version != '' %}
-  # Set xcode xcode version to !{{ xcode_version }}
-  DEVELOPER_DIR: /Applications/Xcode_!{{ xcode_version }}.app/Contents/Developer
-{%- endif %}
-{%- endmacro -%}
-
 {%- macro wait_and_kill_ssh_windows(pytorch_directory="") -%}
       - name: Wait until all sessions have drained
         shell: powershell
diff --git a/.github/templates/linux_binary_build_workflow.yml.j2 b/.github/templates/linux_binary_build_workflow.yml.j2
index 2879da9dad9c..2c6529d32b66 100644
--- a/.github/templates/linux_binary_build_workflow.yml.j2
+++ b/.github/templates/linux_binary_build_workflow.yml.j2
@@ -52,6 +52,9 @@ jobs:
     with:!{{ upload.binary_env_as_input(config) }}
       build_name: !{{ config["build_name"] }}
       build_environment: !{{ build_environment }}
+      {%- if config.pytorch_extra_install_requirements is defined %}
+      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
+      {%- endif %}
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
@@ -78,7 +81,7 @@ jobs:
     !{{ upload.binary_env(config) }}
     steps:
       !{{ common.setup_rocm_linux() }}
-      - uses: !{{ common.download_artifact_s3_action }}
+      - uses: !{{ common.download_artifact_action }}
         name: Download Build Artifacts
         with:
           name: !{{ config["build_name"] }}
@@ -89,7 +92,7 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: !{{ config["container_image"] }}
       - name: Test Pytorch binary
diff --git a/.github/templates/macos_binary_build_workflow.yml.j2 b/.github/templates/macos_binary_build_workflow.yml.j2
index 64bc3653e8de..eb0c2ff4b373 100644
--- a/.github/templates/macos_binary_build_workflow.yml.j2
+++ b/.github/templates/macos_binary_build_workflow.yml.j2
@@ -58,17 +58,8 @@ jobs:
 {%- for config in build_configs %}
   !{{ config["build_name"] }}-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    {%- if config["package_type"] == "libtorch" %}
-    runs-on: macos-10.15
-    {%- else %}
     runs-on: macos-12-xl
-    {%- endif %}
-{%- if config["package_type"] == "libtorch" %}
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
-{%- else %}
     timeout-minutes: !{{ common.timeout_minutes }}
-{%- endif %}
     !{{ upload.binary_env(config, true) }}
       # For sccache access (only on non-forked PRs)
       AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
@@ -78,18 +69,24 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository="pytorch/builder", branch=common.builder_branch) }}
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -100,7 +97,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: !{{ config["build_name"] }}
diff --git a/.github/templates/windows_binary_build_workflow.yml.j2 b/.github/templates/windows_binary_build_workflow.yml.j2
index 6b0cbbd18740..9f68df06b704 100644
--- a/.github/templates/windows_binary_build_workflow.yml.j2
+++ b/.github/templates/windows_binary_build_workflow.yml.j2
@@ -72,7 +72,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: !{{ common.upload_artifact_s3_action }}
+      - uses: !{{ common.upload_artifact_action }}
         if: always()
         with:
           name: !{{ config["build_name"] }}
@@ -93,7 +93,7 @@ jobs:
     steps:
       !{{ common.setup_ec2_windows() }}
       !{{ set_runner_specific_vars() }}
-      - uses: !{{ common.download_artifact_s3_action }}
+      - uses: !{{ common.download_artifact_action }}
         name: Download Build Artifacts
         with:
           name: !{{ config["build_name"] }}
diff --git a/.github/workflows/_android-build-test.yml b/.github/workflows/_android-build-test.yml
index 4d3e07826eae..dfa48daa84ac 100644
--- a/.github/workflows/_android-build-test.yml
+++ b/.github/workflows/_android-build-test.yml
@@ -28,6 +28,11 @@ jobs:
     if: github.repository_owner == 'pytorch'
     runs-on: [self-hosted, linux.2xlarge]
     steps:
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -35,11 +40,6 @@ jobs:
       - name: Setup Linux
         uses: ./.github/actions/setup-linux
 
-      - name: Setup SSH (Click me for login details)
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.GITHUB_TOKEN }}
-
       - name: Calculate docker image
         id: calculate-docker-image
         uses: ./.github/actions/calculate-docker-image
@@ -48,7 +48,7 @@ jobs:
           xla: ${{ contains(inputs.build-environment, 'xla') }}
 
       - name: Pull docker image
-        uses: ./.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
 
@@ -112,5 +112,5 @@ jobs:
         if: always()
 
       - name: Teardown Linux
-        uses: ./.github/actions/teardown-linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
         if: always()
diff --git a/.github/workflows/_android-full-build-test.yml b/.github/workflows/_android-full-build-test.yml
index efc66846db7a..ea07fda814b1 100644
--- a/.github/workflows/_android-full-build-test.yml
+++ b/.github/workflows/_android-full-build-test.yml
@@ -19,23 +19,6 @@ on:
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
 
-    secrets:
-      SONATYPE_NEXUS_USERNAME:
-        description: nexus user
-        required: true
-      SONATYPE_NEXUS_PASSWORD:
-        description: nexus pass
-        required: true
-      ANDROID_SIGN_KEY:
-        description: android key
-        required: true
-      ANDROID_SIGN_PASS:
-        description: android pass
-        required: true
-      SCRIBE_GRAPHQL_ACCESS_TOKEN:
-        description: token for writing to scribe/scuba
-        required: true
-
 env:
   GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
 
@@ -45,6 +28,11 @@ jobs:
     if: github.repository_owner == 'pytorch'
     runs-on: [self-hosted, linux.2xlarge]
     steps:
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -52,11 +40,6 @@ jobs:
       - name: Setup Linux
         uses: ./.github/actions/setup-linux
 
-      - name: Setup SSH (Click me for login details)
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.GITHUB_TOKEN }}
-
       - name: Calculate docker image
         id: calculate-docker-image
         uses: ./.github/actions/calculate-docker-image
@@ -64,7 +47,7 @@ jobs:
           docker-image-name: ${{ inputs.docker-image-name }}
 
       - name: Pull docker image
-        uses: ./.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
 
@@ -145,7 +128,7 @@ jobs:
 
           # run gradle buildRelease
           (echo "./.circleci/scripts/build_android_gradle.sh" | docker exec \
-            -e BUILD_ENVIRONMENT="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build" \
+            -e BUILD_ENVIRONMENT="pytorch-linux-focal-py3-clang7-android-ndk-r19c-gradle-build" \
             -e MAX_JOBS="$(nproc --ignore=2)" \
             -e AWS_DEFAULT_REGION \
             -e PR_NUMBER \
@@ -160,25 +143,6 @@ jobs:
           mkdir -p "${GITHUB_WORKSPACE}/build_android_artifacts"
           docker cp "${ID_X86_32}:/var/lib/jenkins/workspace/android/artifacts.tgz" "${GITHUB_WORKSPACE}/build_android_artifacts/"
 
-      - name: Publish android snapshot
-        if: ${{ github.event_name == 'push' && github.event.ref == 'refs/heads/nightly' }}
-        env:
-          SONATYPE_NEXUS_USERNAME: ${{ secrets.SONATYPE_NEXUS_USERNAME }}
-          SONATYPE_NEXUS_PASSWORD: ${{ secrets.SONATYPE_NEXUS_PASSWORD }}
-          ANDROID_SIGN_KEY: ${{ secrets.ANDROID_SIGN_KEY }}
-          ANDROID_SIGN_PASS: ${{ secrets.ANDROID_SIGN_PASS }}
-          ID_X86_32: ${{ steps.build-x86_32.outputs.container_id }}
-        run: |
-          set -eux
-          (echo "./.circleci/scripts/publish_android_snapshot.sh" | docker exec \
-            -e BUILD_ENVIRONMENT="pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot" \
-            -e SONATYPE_NEXUS_USERNAME \
-            -e SONATYPE_NEXUS_PASSWORD \
-            -e ANDROID_SIGN_KEY \
-            -e ANDROID_SIGN_PASS \
-            -e http_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e https_proxy="http://internal-tf-lb-20210727220640487900000002-835786077.us-east-1.elb.amazonaws.com:3128" -e no_proxy="localhost,127.0.0.1,github.com,amazonaws.com,s3.amazonaws.com,169.254.169.254,169.254.170.2,/var/run/docker.sock" \
-            -u jenkins -i "${ID_X86_32}" bash) 2>&1
-
       - name: Store PyTorch Android Build Artifacts on S3
         uses: seemethere/upload-artifact-s3@v5
         with:
@@ -192,5 +156,5 @@ jobs:
         if: always()
 
       - name: Teardown Linux
-        uses: ./.github/actions/teardown-linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
         if: always()
diff --git a/.github/workflows/_bazel-build-test.yml b/.github/workflows/_bazel-build-test.yml
index 06786d237f07..79445e1dad6c 100644
--- a/.github/workflows/_bazel-build-test.yml
+++ b/.github/workflows/_bazel-build-test.yml
@@ -28,6 +28,11 @@ jobs:
     if: github.repository_owner == 'pytorch'
     runs-on: [self-hosted, linux.2xlarge]
     steps:
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -35,11 +40,6 @@ jobs:
       - name: Setup Linux
         uses: ./.github/actions/setup-linux
 
-      - name: Setup SSH (Click me for login details)
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.GITHUB_TOKEN }}
-
       - name: Calculate docker image
         id: calculate-docker-image
         uses: ./.github/actions/calculate-docker-image
@@ -47,7 +47,7 @@ jobs:
           docker-image-name: ${{ inputs.docker-image-name }}
 
       - name: Pull docker image
-        uses: ./.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
 
@@ -197,5 +197,5 @@ jobs:
           python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
 
       - name: Teardown Linux
-        uses: ./.github/actions/teardown-linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
         if: always()
diff --git a/.github/workflows/_binary-build-linux.yml b/.github/workflows/_binary-build-linux.yml
index b1b88a5b32f8..192ca251b79f 100644
--- a/.github/workflows/_binary-build-linux.yml
+++ b/.github/workflows/_binary-build-linux.yml
@@ -55,6 +55,11 @@ on:
         required: false
         type: string
         description: Desired python version
+      PYTORCH_EXTRA_INSTALL_REQUIREMENTS:
+        required: false
+        type: string
+        description: Extra install requirements
+        default: ""
     secrets:
       github-token:
         required: true
@@ -62,8 +67,8 @@ on:
 
 jobs:
   build:
-    runs-on: linux.4xlarge
-    timeout-minutes: 240
+    runs-on: linux.12xlarge
+    timeout-minutes: 150
     env:
       PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}
       BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}
@@ -79,6 +84,7 @@ jobs:
       LIBTORCH_VARIANT: ${{ inputs.LIBTORCH_VARIANT }}
       DESIRED_DEVTOOLSET: ${{ inputs.DESIRED_DEVTOOLSET }}
       DESIRED_PYTHON: ${{ inputs.DESIRED_PYTHON }}
+      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: ${{ inputs.PYTORCH_EXTRA_INSTALL_REQUIREMENTS }}
       # Needed for conda builds
       ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
       ANACONDA_USER: pytorch
@@ -97,7 +103,6 @@ jobs:
             echo "PYTORCH_ROOT=${{ env.PYTORCH_ROOT }}"
             echo "BUILDER_ROOT=${{ env.BUILDER_ROOT }}"
             echo "PACKAGE_TYPE=${{ env.PACKAGE_TYPE }}"
-
             echo "DESIRED_CUDA=${{ env.DESIRED_CUDA }}"
             echo "GPU_ARCH_VERSION=${{ env.GPU_ARCH_VERSION }}"
             echo "GPU_ARCH_TYPE=${{ env.GPU_ARCH_TYPE }}"
@@ -107,12 +112,13 @@ jobs:
             echo "LIBTORCH_VARIANT=${{ env.LIBTORCH_VARIANT }}"
             echo "DESIRED_DEVTOOLSET=${{ env.DESIRED_DEVTOOLSET }}"
             echo "DESIRED_PYTHON=${{ env.DESIRED_PYTHON }}"
-
+            echo "PYTORCH_EXTRA_INSTALL_REQUIREMENTS=${{ env.PYTORCH_EXTRA_INSTALL_REQUIREMENTS }}"
             echo "ALPINE_IMAGE=${{ env.ALPINE_IMAGE }}"
             echo "ANACONDA_USER=${{ env.ANACONDA_USER }}"
             echo "AWS_DEFAULT_REGION=${{ env.AWS_DEFAULT_REGION }}"
             echo "BINARY_ENV_FILE=${{ env.BINARY_ENV_FILE }}"
             echo "BUILD_ENVIRONMENT=${{ env.BUILD_ENVIRONMENT }}"
+            echo "BUILD_NAME=${{ env.BUILD_NAME }}"
             echo "PR_NUMBER=${{ env.PR_NUMBER }}"
             echo "PYTORCH_FINAL_PACKAGE_DIR=${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
             echo "SHA1=${{ env.SHA1 }}"
@@ -120,16 +126,16 @@ jobs:
       - name: List the env
         shell: bash
         run: env
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.github-token }}
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
       - name: Setup Linux
         uses: ./.github/actions/setup-linux
       - name: Chown workspace
         uses: ./.github/actions/chown-workspace
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.github-token }}
       - name: Clean workspace
         shell: bash
         run: |
@@ -161,17 +167,10 @@ jobs:
           git clean -fxd
         working-directory: builder
 
-      - name: Set BUILD_SPLIT_CUDA
-        if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && startsWith(inputs.GPU_ARCH_VERSION, '11') }}
-        shell: bash
-        run: |
-          echo "BUILD_SPLIT_CUDA='ON'" >> "$GITHUB_ENV"
       - name: Pull Docker image
-        run: |
-          retry () {
-              "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
-          }
-          retry docker pull "${DOCKER_IMAGE}"
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
+        with:
+          docker-image: ${{ inputs.DOCKER_IMAGE }}
       - name: Build PyTorch binary
         run: |
           set -x
@@ -180,7 +179,6 @@ jobs:
             -e BINARY_ENV_FILE \
             -e BUILDER_ROOT \
             -e BUILD_ENVIRONMENT \
-            -e BUILD_SPLIT_CUDA \
             -e DESIRED_CUDA \
             -e DESIRED_DEVTOOLSET \
             -e DESIRED_PYTHON \
@@ -192,6 +190,7 @@ jobs:
             -e PYTORCH_FINAL_PACKAGE_DIR \
             -e PYTORCH_ROOT \
             -e SKIP_ALL_TESTS \
+            -e PYTORCH_EXTRA_INSTALL_REQUIREMENTS \
             --tty \
             --detach \
             -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
@@ -209,29 +208,17 @@ jobs:
           # Ensure the working directory gets chowned back to the current user
           docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
 
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         with:
           name: ${{ inputs.build_name }}
-          retention-days: 14
           if-no-files-found: error
           path:
             ${{ runner.temp }}/artifacts/*
 
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        working-directory: pytorch/
-        # Always hold for active ssh sessions
+      - name: Teardown Linux
         if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
+
       - name: Chown workspace
         if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
+        uses: ./pytorch/.github/actions/chown-workspace
diff --git a/.github/workflows/_binary-test-linux.yml b/.github/workflows/_binary-test-linux.yml
index 5c29288b8246..471a2af88b8f 100644
--- a/.github/workflows/_binary-test-linux.yml
+++ b/.github/workflows/_binary-test-linux.yml
@@ -122,6 +122,10 @@ jobs:
             echo "SHA1=${{ env.SHA1 }}"
           } >> "${GITHUB_ENV} }}"
 
+      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.github-token }}
         # Setup the environment
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -129,23 +133,12 @@ jobs:
         uses: ./.github/actions/setup-linux
       - name: Chown workspace
         uses: ./.github/actions/chown-workspace
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.github-token }}
       - name: Clean workspace
         shell: bash
         run: |
           rm -rf "${GITHUB_WORKSPACE}"
           mkdir "${GITHUB_WORKSPACE}"
 
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: ${{ inputs.build_name }}
-          path: "${{ runner.temp }}/artifacts/"
-
-
       - name: Checkout PyTorch to pytorch dir
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -171,42 +164,28 @@ jobs:
           git clean -fxd
         working-directory: builder
 
+      - uses: actions/download-artifact@v3
+        name: Download Build Artifacts
+        with:
+          name: ${{ inputs.build_name }}
+          path: "${{ runner.temp }}/artifacts/"
+
       - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: pytorch/test-infra/.github/actions/setup-nvidia@main
         if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' }}
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            pushd pytorch
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
-            popd
 
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ inputs.DOCKER_IMAGE }}
 
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
 
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        working-directory: pytorch/
-        # Always hold for active ssh sessions
+      - name: Teardown Linux
         if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
+
       - name: Chown workspace
         if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
+        uses: ./pytorch/.github/actions/chown-workspace
diff --git a/.github/workflows/_binary-upload.yml b/.github/workflows/_binary-upload.yml
index cf47de9ccf21..2dc77dba09bc 100644
--- a/.github/workflows/_binary-upload.yml
+++ b/.github/workflows/_binary-upload.yml
@@ -70,7 +70,9 @@ on:
         description: Conda PyTorchBot token
 jobs:
   build:
-    runs-on: linux.2xlarge
+    runs-on: ubuntu-22.04
+    container:
+      image: continuumio/miniconda3:4.12.0
     env:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
@@ -86,40 +88,20 @@ jobs:
       LIBTORCH_VARIANT: ${{ inputs.LIBTORCH_VARIANT }}
       DESIRED_DEVTOOLSET: ${{ inputs.DESIRED_DEVTOOLSET }}
       DESIRED_PYTHON: ${{ inputs.DESIRED_PYTHON }}
-      # Needed for conda builds
-      ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
       ANACONDA_USER: pytorch
-      AWS_DEFAULT_REGION: us-east-1
       BINARY_ENV_FILE: /tmp/env
       GITHUB_TOKEN: ${{ secrets.github-token }}
       PR_NUMBER: ${{ github.event.pull_request.number }}
       PYTORCH_FINAL_PACKAGE_DIR: /artifacts
       SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
     steps:
-      - name: List the env
-        shell: bash
-        run: env
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
-      - name: Setup Linux
-        uses: ./.github/actions/setup-linux
-      - name: Chown workspace
-        uses: ./.github/actions/chown-workspace
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: ./.github/actions/setup-ssh
         with:
-          github-secret: ${{ secrets.github-token }}
+          no-sudo: true
 
-      - name: Download Build Artifacts with S3
-        uses: seemethere/download-artifact-s3@v4
-        if: ${{ inputs.use_s3 }}
-        with:
-          name: ${{ inputs.build_name }}
-          path: "${{ runner.temp }}/artifacts/"
-
-      - name: Download Build Artifacts without S3
-        uses: actions/download-artifact@v2
-        if: ${{ !inputs.use_s3 }}
+      - name: Download Build Artifacts
+        uses: actions/download-artifact@v3
         with:
           name: ${{ inputs.build_name }}
           path: "${{ runner.temp }}/artifacts/"
@@ -130,6 +112,7 @@ jobs:
           echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
       - name: Set UPLOAD_CHANNEL (only for tagged pushes)
         if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/') }}
+        shell: bash -e -l {0}
         run: |
           # reference ends with an RC suffix
           if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
@@ -143,36 +126,7 @@ jobs:
           AWS_ACCESS_KEY_ID: ${{ secrets.aws-access-key-id }}
           AWS_SECRET_ACCESS_KEY: ${{ secrets.aws-pytorch-uploader-secret-access-key }}
           ANACONDA_API_TOKEN: ${{ secrets.conda-pytorchbot-token }}
+          BUILD_NAME: ${{ inputs.build_name }}
         run: |
-          docker run --rm -i \
-            -e ANACONDA_API_TOKEN \
-            -e AWS_ACCESS_KEY_ID \
-            -e AWS_SECRET_ACCESS_KEY \
-            -e DRY_RUN \
-            -e PACKAGE_TYPE \
-            -e PKG_DIR=/artifacts \
-            -e UPLOAD_CHANNEL \
-            -e UPLOAD_SUBFOLDER \
-            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
-            -v "${GITHUB_WORKSPACE}:/v" \
-            -w /v \
-            308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/miniconda3:4.10.3 \
-            bash -c '.circleci/scripts/binary_upload.sh'
-
-      - name: Hold runner for 2 hours or until ssh sessions have drained
-        # Always hold for active ssh sessions
-        if: always()
-        run: .github/scripts/wait_for_ssh_to_drain.sh
-      - name: Chown workspace
-        if: always()
-        run: |
-          # Ensure the working directory gets chowned back to the current user
-          docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
-      - name: Kill containers, clean up images
-        if: always()
-        run: |
-          # ignore expansion of "docker ps -q" since it could be empty
-          # shellcheck disable=SC2046
-          docker stop $(docker ps -q) || true
-          # Prune all of the docker images
-          docker system prune -af
+            set -ex
+            bash .circleci/scripts/binary_upload.sh
diff --git a/.github/workflows/_buck-build-test.yml b/.github/workflows/_buck-build-test.yml
index 59fb21bf8965..07f41299c711 100644
--- a/.github/workflows/_buck-build-test.yml
+++ b/.github/workflows/_buck-build-test.yml
@@ -21,32 +21,13 @@ jobs:
           distribution: 'temurin'
 
       - name: Setup miniconda
-        uses: conda-incubator/setup-miniconda@v2
+        uses: pytorch/test-infra/.github/actions/setup-miniconda@main
         with:
-          auto-update-conda: true
           python-version: 3.8
-          activate-environment: build
-
-      - name: Install dependencies
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        with:
-          timeout_minutes: 10
-          max_attempts: 5
-          command: |
-            conda install -y \
-              cffi \
-              cmake \
-              mkl \
-              mkl-include \
-              ninja \
-              numpy \
-              pyyaml \
-              requests \
-              setuptools \
-              typing_extensions
+          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
 
       - name: Install Buck
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
         with:
           timeout_minutes: 10
           max_attempts: 5
@@ -56,7 +37,7 @@ jobs:
             sudo apt install ./buck.2021.01.12.01_all.deb
 
       - name: Download third party libraries and generate wrappers
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
         with:
           timeout_minutes: 10
           max_attempts: 5
diff --git a/.github/workflows/_docs.yml b/.github/workflows/_docs.yml
index a925beaf768e..318471e7c786 100644
--- a/.github/workflows/_docs.yml
+++ b/.github/workflows/_docs.yml
@@ -38,11 +38,41 @@ jobs:
   build-docs:
     # Don't run on forked repos.
     if: github.repository_owner == 'pytorch'
-    runs-on: [self-hosted, linux.4xlarge]
+    runs-on: ${{ matrix.runner }}
     strategy:
       matrix:
-        docs_type: [cpp, python]
+        include:
+          - docs_type: cpp
+            # We recently seeing lots of exit code 137 running this in Docker indicating
+            # an OOM issue when running the job, so this upgrades the runner from 4xlarge
+            # to the next available tier of 12xlarge. So much memory just to generate cpp
+            # doc
+            runner: linux.12xlarge
+            # TODO: Nightly cpp docs take longer and longer to finish (more than 3h now)
+            # Let's try to figure out how this can be improved
+            timeout-minutes: 240
+          - docs_type: python
+            runner: linux.2xlarge
+            # It takes less than 30m to finish python docs unless there are issues
+            timeout-minutes: 30
+          - docs_type: functorch
+            runner: linux.2xlarge
+            # It takes less than 15m to finish functorch docs unless there are issues
+            timeout-minutes: 15
+    # Set a fixed name for this job instead of using the current matrix-generated name, i.e. build-docs (cpp, linux.12xlarge, 180)
+    # The current name requires updating the Rockset last docs push query from test-infra every time the matrix is updated
+    name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}
     steps:
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+          instructions: |
+            All builds are done inside the container, to start an interactive session run:
+              docker exec -it $(docker container ps --format '{{.ID}}') bash
+            To start Python docs build type:
+              cd docs && make html && make coverage
+
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -50,13 +80,8 @@ jobs:
       - name: Setup Linux
         uses: ./.github/actions/setup-linux
 
-      - name: Setup SSH (Click me for login details)
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.GITHUB_TOKEN }}
-
       - name: Pull docker image
-        uses: ./.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ inputs.docker-image }}
 
@@ -76,8 +101,13 @@ jobs:
           echo "password ${GITHUB_PYTORCHBOT_TOKEN}" >> "${RUNNER_TEMP}/.netrc"
 
       - name: Build ${{ matrix.docs_type }} docs
+        timeout-minutes: ${{ matrix.timeout-minutes }}
+        id: build-docs
         env:
-          WITH_PUSH: ${{ github.event_name == 'schedule' || startsWith(github.event.ref, 'refs/tags/v') }}
+          # After https://github.com/pytorch/pytorch/pull/88373, pull workflow can now be run periodically,
+          # so using a schedule event to determine if the docs should be pushed or not doesn't hold true
+          # anymore
+          WITH_PUSH: ${{ inputs.push }}
           DOCKER_IMAGE: ${{ inputs.docker-image }}
           DOCS_TYPE: ${{ matrix.docs_type }}
           RUN_DOXYGEN: ${{ inputs.run-doxygen }}
@@ -110,7 +140,7 @@ jobs:
             -w /var/lib/jenkins/workspace \
             "${DOCKER_IMAGE}"
           )
-          docker exec -t "${container_name}" bash -c "sudo chown -R jenkins . && pip install dist/*.whl && ./.circleci/scripts/${DOCS_TYPE}_doc_push_script.sh"
+          docker exec -t "${container_name}" bash -c "sudo chown -R jenkins . && pip install $(echo dist/*.whl)[opt-einsum] && ./.circleci/scripts/${DOCS_TYPE}_doc_push_script.sh"
 
       - name: Chown workspace
         uses: ./.github/actions/chown-workspace
@@ -118,7 +148,7 @@ jobs:
 
       - name: Upload Python Docs Preview
         uses: seemethere/upload-artifact-s3@v5
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' }}
+        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' && steps.build-docs.outcome == 'success' }}
         with:
           retention-days: 14
           s3-bucket: doc-previews
@@ -128,10 +158,23 @@ jobs:
 
       - name: Upload C++ Docs Preview
         uses: seemethere/upload-artifact-s3@v5
-        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' }}
+        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'cpp' && steps.build-docs.outcome == 'success' }}
         with:
           retention-days: 14
           if-no-files-found: error
           s3-bucket: doc-previews
           path: cppdocs/
           s3-prefix: pytorch/${{ github.event.pull_request.number }}/cppdocs
+
+      - name: Upload functorch Docs Preview
+        uses: seemethere/upload-artifact-s3@v5
+        if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'functorch' && steps.build-docs.outcome == 'success' }}
+        with:
+          retention-days: 14
+          s3-bucket: doc-previews
+          if-no-files-found: error
+          path: functorch_ghpages/nightly/
+          s3-prefix: pytorch/${{ github.event.pull_request.number }}/functorchdocs
+      - name: Teardown Linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
+        if: always()
diff --git a/.github/workflows/_ios-build-test.yml b/.github/workflows/_ios-build-test.yml
index 56443419ef1d..269ad3f153ca 100644
--- a/.github/workflows/_ios-build-test.yml
+++ b/.github/workflows/_ios-build-test.yml
@@ -23,20 +23,6 @@ on:
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
 
-    secrets:
-      IOS_CERT_KEY_2022:
-        required: true
-        description: ios cert
-      IOS_CERT_SECRET:
-        required: true
-        description: ios cert
-      IOS_DEV_TEAM_ID:
-        required: true
-        description: ios cert
-      IOS_SIGN_KEY_2022:
-        required: true
-        description: ios cert
-
 env:
   GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
   BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
@@ -45,16 +31,8 @@ env:
 
 jobs:
   build:
-    # NOTE: These builds will not run successfully without running on `pytorch/pytorch` due to the limitations
-    #       of accessing secrets from forked pull requests and IOS' dependency on secrets for their build/test
-    if: github.repository_owner == 'pytorch'
     runs-on: macos-12
     timeout-minutes: 240
-    env:
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET }}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID }}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
     steps:
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
@@ -90,47 +68,34 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           conda install -y \
-            cffi \
-            cmake \
-            mkl \
-            mkl-include \
-            ninja \
-            numpy \
-            pyyaml \
-            requests \
-            setuptools \
-            typing_extensions
+            blas=1.0 \
+            cffi=1.15.1 \
+            cmake=3.22.1 \
+            mkl=2022.1.0 \
+            mkl-include=2022.1.0 \
+            ninja=1.10.2 \
+            numpy=1.23.3 \
+            pyyaml=6.0 \
+            requests=2.28.1 \
+            setuptools=63.4.1 \
+            typing_extensions=4.3.0
 
-      - name: Run Fastlane
+      - name: Setup Fastlane
         run: |
           set -x
           cd ios/TestApp
           # install fastlane
           sudo gem install bundler && bundle install
           bundle update fastlane
-          # install certificates
-          echo "${IOS_CERT_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o Certificates.p12
-          rm cert.txt
-          bundle exec fastlane install_root_cert
-          bundle exec fastlane install_dev_cert
-          # install the provisioning profile
-          PROFILE=PyTorch_CI_2022.mobileprovision
-          PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
-          mkdir -pv "${PROVISIONING_PROFILES}"
-          cd "${PROVISIONING_PROFILES}"
-          echo "${IOS_SIGN_KEY_2022}" >> cert.txt
-          base64 --decode cert.txt -o ${PROFILE}
-          rm cert.txt
 
-      - name: Build
+      - name: Build PyTorch Mobile Runtime
         run: |
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
@@ -139,20 +104,16 @@ jobs:
           export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}
           scripts/build_ios.sh
 
-      - name: Run Build Test
-        timeout-minutes: 5
+      - name: Build TestApp
+        if: inputs.ios-platform == 'SIMULATOR'
+        timeout-minutes: 15
         run: |
-          PROFILE=PyTorch_CI_2022
           # run the ruby build script
           if ! [ -x "$(command -v xcodebuild)" ]; then
             echo 'Error: xcodebuild is not installed.'
             exit 1
           fi
-          if [ "${IOS_PLATFORM}" != "SIMULATOR" ]; then
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}" -c "${PROFILE}" -t "${IOS_DEV_TEAM_ID}"
-          else
-            ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
-          fi
+          ruby scripts/xcode_build.rb -i build_ios/install -x ios/TestApp/TestApp.xcodeproj -p "${IOS_PLATFORM}"
 
       - name: Run Simulator Tests
         if: inputs.ios-platform == 'SIMULATOR'
@@ -191,6 +152,7 @@ jobs:
           else
             bundle exec fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT
           fi
+
       - name: Dump Simulator Tests On a Failure
         if: |
           failure() && inputs.ios-platform == 'SIMULATOR'
diff --git a/.github/workflows/_linux-build.yml b/.github/workflows/_linux-build.yml
index 09a400c4d502..be3d2ce98c03 100644
--- a/.github/workflows/_linux-build.yml
+++ b/.github/workflows/_linux-build.yml
@@ -28,21 +28,49 @@ on:
         description: |
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
+      cuda-arch-list:
+        required: false
+        type: string
+        default: "5.2"
+        description: |
+          List of CUDA architectures CI build should target.
+      runner:
+        required: false
+        type: string
+        default: "linux.2xlarge"
+        description: |
+          List of CUDA architectures CI build should target.
+
+      test-matrix:
+        required: false
+        type: string
+        description: |
+          An option JSON description of what test configs to run later on. This
+          is moved here from the Linux test workflow so that we can apply filter
+          logic using test-config labels earlier and skip unnecessary builds
 
     outputs:
       docker-image:
         value: ${{ jobs.build.outputs.docker-image }}
         description: The docker image containing the built PyTorch.
+      test-matrix:
+        value: ${{ inputs.test-matrix }}
+        description: An optional JSON description of what test configs to run later on.
 
 jobs:
   build:
-    # Don't run on forked repos.
+    # Don't run on forked repos
     if: github.repository_owner == 'pytorch'
-    runs-on: [self-hosted, linux.2xlarge]
+    runs-on: ${{ inputs.runner }}
     timeout-minutes: 240
     outputs:
       docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
     steps:
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
       # [pytorch repo ref]
       # Use a pytorch/pytorch reference instead of a reference to the local
       # checkout because when we run this action we don't *have* a local
@@ -50,21 +78,9 @@ jobs:
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
 
-      - name: Check for new workflows
-        run: |
-          if [ ! -f "./.github/actions/setup-linux/action.yml" ]; then
-            echo "::error::Your PR is based on a version of master that is too old for our CI to work. Please rebase your PR on latest master and resubmit."
-            exit 1
-          fi
-
       - name: Setup Linux
         uses: ./.github/actions/setup-linux
 
-      - name: Setup SSH (Click me for login details)
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.GITHUB_TOKEN }}
-
       - name: Calculate docker image
         id: calculate-docker-image
         uses: ./.github/actions/calculate-docker-image
@@ -73,7 +89,7 @@ jobs:
           xla: ${{ contains(inputs.build-environment, 'xla') }}
 
       - name: Pull docker image
-        uses: ./.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
 
@@ -88,7 +104,17 @@ jobs:
         with:
           github-token: ${{ secrets.GITHUB_TOKEN }}
 
+      # Apply the filter logic to the build step too if the test-config label is already there
+      - name: Select all requested test configurations (if the test matrix is available)
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
       - name: Build
+        if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
+        id: build
         env:
           BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
           BRANCH: ${{ steps.parse-ref.outputs.branch }}
@@ -100,7 +126,7 @@ jobs:
           SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
           XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
           PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
-          TORCH_CUDA_ARCH_LIST: 5.2
+          TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}
           DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
           XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
           DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
@@ -135,13 +161,13 @@ jobs:
           docker exec -t "${container_name}" sh -c '.jenkins/pytorch/build.sh'
 
       - name: Archive artifacts into zip
-        if: inputs.build-generates-artifacts
+        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'
         run: |
           zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .pytorch-test-times.json
 
       - name: Store PyTorch Build Artifacts on S3
         uses: seemethere/upload-artifact-s3@v5
-        if: inputs.build-generates-artifacts
+        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'
         with:
           name: ${{ inputs.build-environment }}
           retention-days: 14
@@ -149,6 +175,7 @@ jobs:
           path: artifacts.zip
 
       - name: Upload sccache stats
+        if: steps.build.outcome != 'skipped'
         uses: seemethere/upload-artifact-s3@v5
         with:
           s3-prefix: |
@@ -158,5 +185,5 @@ jobs:
           path: sccache-stats-*.json
 
       - name: Teardown Linux
-        uses: ./.github/actions/teardown-linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
         if: always()
diff --git a/.github/workflows/_linux-test.yml b/.github/workflows/_linux-test.yml
index aa81647c53fc..a444a5fc530a 100644
--- a/.github/workflows/_linux-test.yml
+++ b/.github/workflows/_linux-test.yml
@@ -22,46 +22,70 @@ on:
         description: |
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
+      timeout-minutes:
+        required: false
+        type: number
+        default: 240
+        description: |
+          Set the maximum (in minutes) how long the workflow should take to finish
 
 env:
   GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
 
 jobs:
+  # This needs to be run right before the test starts so that it can gather the
+  # latest labels from the PR
+  filter:
+    runs-on: [self-hosted, linux.large]
+    outputs:
+      test-matrix: ${{ steps.filter.outputs.test-matrix }}
+      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          fetch-depth: 1
+          submodules: false
+
+      - name: Select all requested test configurations
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
   test:
-    # Don't run on forked repos.
-    if: github.repository_owner == 'pytorch'
+    needs: filter
+    # Don't run on forked repos or empty test matrix
+    if: github.repository_owner == 'pytorch' && needs.filter.outputs.is-test-matrix-empty == 'False'
     strategy:
-      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      matrix: ${{ fromJSON(needs.filter.outputs.test-matrix) }}
       fail-fast: false
     runs-on: ${{ matrix.runner }}
+    timeout-minutes: ${{ inputs.timeout-minutes }}
     steps:
-      # [see note: pytorch repo ref]
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+          instructions: |
+            All testing is done inside the container, to start an interactive session run:
+              docker exec -it $(docker container ps --format '{{.ID}}') bash
+
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
 
       - name: Setup Linux
         uses: ./.github/actions/setup-linux
 
-      - name: Setup SSH (Click me for login details)
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.GITHUB_TOKEN }}
-
       - name: Pull docker image
-        uses: ./.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ inputs.docker-image }}
 
       - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
-        uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
+        uses: pytorch/test-infra/.github/actions/setup-nvidia@main
         if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/install_nvidia_utils_linux.sh
-            echo "GPU_FLAG=--gpus all" >> "${GITHUB_ENV}"
 
       - name: Start monitoring script
         id: monitor-script
@@ -70,7 +94,7 @@ jobs:
           python3 -m pip install psutil==5.9.1
           python3 -m pip install pynvml==11.4.1
           python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
-          echo "::set-output name=monitor-script-pid::${!}"
+          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
 
       - name: Download build artifacts
         uses: ./.github/actions/download-build-artifacts
@@ -96,11 +120,13 @@ jobs:
           NUM_TEST_SHARDS: ${{ matrix.num_shards }}
           PR_BODY: ${{ github.event.pull_request.body }}
           SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
           SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
           DOCKER_IMAGE: ${{ inputs.docker-image }}
           XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
           XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
-        timeout-minutes: 240
+          PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
+          PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
         run: |
           set -x
 
@@ -150,8 +176,11 @@ jobs:
             -e PR_LABELS \
             -e MAX_JOBS="$(nproc --ignore=2)" \
             -e SCCACHE_BUCKET \
+            -e SCCACHE_S3_KEY_PREFIX \
             -e XLA_CUDA \
             -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
+            -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \
+            -e PYTORCH_TEST_RERUN_DISABLED_TESTS \
             --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
             --ulimit stack=10485760:83886080 \
             --security-opt seccomp=unconfined \
@@ -166,7 +195,8 @@ jobs:
             -w /var/lib/jenkins/workspace \
             "${DOCKER_IMAGE}"
           )
-          docker exec -t "${container_name}" sh -c "pip install dist/*.whl && ${TEST_COMMAND}"
+          echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"
+          docker exec -t "${container_name}" sh -c "pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"
 
       - name: Get workflow job id
         id: get-job-id
@@ -178,6 +208,7 @@ jobs:
       - name: Stop monitoring script
         if: always() && steps.monitor-script.outputs.monitor-script-pid
         shell: bash
+        continue-on-error: true
         env:
           MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
         run: |
@@ -189,6 +220,12 @@ jobs:
         with:
           file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
 
+      - name: Collect backtraces from coredumps (if any)
+        if: always()
+        run: |
+          # shellcheck disable=SC2156
+          find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \;
+
       - name: Store Core dumps on S3
         uses: seemethere/upload-artifact-s3@v5
         if: failure()
@@ -223,5 +260,5 @@ jobs:
           python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
 
       - name: Teardown Linux
-        uses: ./.github/actions/teardown-linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
         if: always()
diff --git a/.github/workflows/_mac-build.yml b/.github/workflows/_mac-build.yml
index 316656b6ec9b..5ee909f02c22 100644
--- a/.github/workflows/_mac-build.yml
+++ b/.github/workflows/_mac-build.yml
@@ -33,6 +33,25 @@ on:
         default: "3.8"
         description: |
           The python version to be used. Will be 3.8 by default
+      environment-file:
+        required: false
+        type: string
+        description: Set the conda environment file used to setup macOS build.
+      test-matrix:
+        required: false
+        type: string
+        description: |
+          An option JSON description of what test configs to run later on. This
+          is moved here from the Linux test workflow so that we can apply filter
+          logic using test-config labels earlier and skip unnecessary builds
+
+    outputs:
+      test-matrix:
+        value: ${{ inputs.test-matrix }}
+        description: An optional JSON description of what test configs to run later on.
+      build-outcome:
+        value: ${{ jobs.build.outputs.build-outcome }}
+        description: The outcome of the build step. This is used to influence test filtering logic later on.
 
     secrets:
       MACOS_SCCACHE_S3_ACCESS_KEY_ID:
@@ -42,11 +61,6 @@ on:
         required: true
         description: Secret for S3 bucket for macOS sccache.
 
-# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
-defaults:
-  run:
-    shell: bash -e -l {0}
-
 jobs:
   build:
     # Don't run on forked repos.
@@ -57,6 +71,8 @@ jobs:
       AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
       AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
       BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+    outputs:
+      build-outcome: ${{ steps.build.outcome }}
     steps:
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
@@ -71,25 +87,39 @@ jobs:
           fi
 
       - name: Setup miniconda
-        uses: conda-incubator/setup-miniconda@v2
+        if: inputs.environment-file == ''
+        uses: pytorch/test-infra/.github/actions/setup-miniconda@main
+        with:
+          python-version: ${{ inputs.python_version }}
+          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
+
+      # This option is used when cross-compiling arm64 from x86-64. Specifically, we need arm64 conda
+      # environment even though the arch is x86-64
+      - name: Setup miniconda using the provided environment file
+        if: inputs.environment-file != ''
+        uses: pytorch/test-infra/.github/actions/setup-miniconda@main
         with:
-          auto-update-conda: true
           python-version: ${{ inputs.python_version }}
-          activate-environment: build
-          miniconda-version: 4.7.12
+          environment-file: ${{ inputs.environment-file }}
 
       - name: Install macOS homebrew dependencies
         run: |
           # Install dependencies
           brew install libomp
+          brew link --force libomp
 
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
-          echo "SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+            echo "SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${GITHUB_ENV}"
 
       - name: Get workflow job id
         id: get-job-id
@@ -98,21 +128,31 @@ jobs:
         with:
           github-token: ${{ secrets.GITHUB_TOKEN }}
 
+      # Apply the filter logic to the build step too if the test-config label is already there
+      - name: Select all requested test configurations (if the test matrix is available)
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
       - name: Build
+        if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
+        id: build
         env:
           OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
         run: |
           echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
-          .jenkins/pytorch/macos-build.sh
+          ${CONDA_RUN} .jenkins/pytorch/macos-build.sh
 
       - name: Archive artifacts into zip
-        if: inputs.build-generates-artifacts
+        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'
         run: |
           zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
 
       - name: Store PyTorch Build Artifacts on GHA
-        uses: actions/upload-artifact@v2
-        if: inputs.build-generates-artifacts
+        uses: actions/upload-artifact@v3
+        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'
         with:
           name: ${{ env.BUILD_ENVIRONMENT }}
           retention-days: 14
@@ -120,9 +160,9 @@ jobs:
           path: artifacts.zip
 
       - name: Upload sccache stats to GHA
-        uses: actions/upload-artifact@v2
+        uses: actions/upload-artifact@v3
         # Only if sccache is installed, see above
-        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
+        if: ${{ (github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository) && steps.build.outcome != 'skipped' }}
         with:
           name: sccache-stats-${{ inputs.build-environment }}-runattempt${{ github.run_attempt }}-${{ steps.get-job-id.outputs.job-id }}
           retention-days: 14
diff --git a/.github/workflows/_mac-test-mps.yml b/.github/workflows/_mac-test-mps.yml
index fa189307358a..24203e005153 100644
--- a/.github/workflows/_mac-test-mps.yml
+++ b/.github/workflows/_mac-test-mps.yml
@@ -15,7 +15,6 @@ on:
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
 
-
 jobs:
   run_mps_test:
     name: "Run MPS tests"
@@ -38,6 +37,22 @@ jobs:
           name: ${{ inputs.build-environment }}
           use-gha: true
 
+      # This is copied from the main macos test workflow. It was missed in the earlier fix because macos M1
+      # runners are shared and not ephemeral, so the issue wasn't manifested if the runners with the fix were
+      # used
+      - name: Install macOS homebrew dependencies
+        run: |
+          # Install dependencies
+          brew install libomp
+          brew link --force libomp
+
+      - name: Setup miniconda
+        uses: pytorch/test-infra/.github/actions/setup-miniconda@main
+        with:
+          python-version: 3.9
+          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
+          pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
+
       - name: Install PyTorch
         env:
           ENV_NAME: conda-test-env-${{ github.run_id }}
@@ -45,24 +60,33 @@ jobs:
         shell: arch -arch arm64 bash {0}
         run: |
           # shellcheck disable=SC1090
-          . ~/miniconda3/etc/profile.d/conda.sh
           set -ex
-          conda create -yp "${ENV_NAME}" "python=${PY_VERS}" numpy expecttest pyyaml
           # As wheels are cross-compiled they are reported as x86_64 ones
           ORIG_WHLNAME=$(ls -1 dist/*.whl); ARM_WHLNAME=${ORIG_WHLNAME/x86_64/arm64}; mv ${ORIG_WHLNAME} ${ARM_WHLNAME}
-          conda run -p "${ENV_NAME}" python3 -mpip install dist/*.whl
+          ${CONDA_RUN} python3 -mpip install --no-index --no-deps dist/*.whl
 
       - name: Run MPS tests
+        id: test
         env:
           ENV_NAME: conda-test-env-${{ github.run_id }}
         shell: arch -arch arm64 bash {0}
         run: |
           # shellcheck disable=SC1090
-          . ~/miniconda3/etc/profile.d/conda.sh
           set -ex
           # TODO(https://github.com/pytorch/pytorch/issues/79293)
-          # This step currently fails if we actually run as if we're in CI.
-          unset CI
 
-          conda run --cwd test -p "${ENV_NAME}" python3 test_mps.py -v
-          conda env remove -p "${ENV_NAME}"
+          ${CONDA_RUN} python3 test/run_test.py --mps --verbose
+
+      - name: Get workflow job id
+        id: get-job-id
+        uses: ./.github/actions/get-workflow-job-id
+        if: always()
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Upload test artifacts
+        uses: ./.github/actions/upload-test-artifacts
+        if: always() && (steps.test.conclusion == 'success' || steps.test.conclusion == 'failure')
+        with:
+          use-gha: true
+          file-suffix: ${{ github.job }}-mps-1-1-macos-m1-12_${{ steps.get-job-id.outputs.job-id }}
diff --git a/.github/workflows/_mac-test.yml b/.github/workflows/_mac-test.yml
index 8b648c7a8762..cbc3372e1c42 100644
--- a/.github/workflows/_mac-test.yml
+++ b/.github/workflows/_mac-test.yml
@@ -32,18 +32,39 @@ on:
         required: true
         description: secret acess key for test stats upload
 
-
 jobs:
+  # This needs to be run right before the test starts so that it can gather the
+  # latest labels from the PR
+  filter:
+    runs-on: [self-hosted, linux.large]
+    outputs:
+      test-matrix: ${{ steps.filter.outputs.test-matrix }}
+      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          fetch-depth: 1
+          submodules: false
+
+      - name: Select all requested test configurations
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
   test:
-    # Don't run on forked repos.
-    if: github.repository_owner == 'pytorch'
+    needs: filter
+    # Don't run on forked repos or empty test matrix
+    if: github.repository_owner == 'pytorch' && needs.filter.outputs.is-test-matrix-empty == 'False'
     # For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
     # Also ensure that we always run with the right architecture
     defaults:
       run:
         shell: arch -arch ${{ inputs.arch }} bash -e -l {0}
     strategy:
-      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      matrix: ${{ fromJSON(needs.filter.outputs.test-matrix) }}
       fail-fast: false
     runs-on: ${{ matrix.runner }}
     timeout-minutes: 240
@@ -61,43 +82,39 @@ jobs:
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
 
-      - name: Start monitoring script
-        id: monitor-script
-        run: |
-          python3 -m pip install psutil==5.9.1
-          python3 -m pip install pynvml==11.4.1
-          python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
-          echo "::set-output name=monitor-script-pid::${!}"
-
       - name: Download build artifacts
         uses: ./.github/actions/download-build-artifacts
         with:
           name: ${{ inputs.build-environment }}
           use-gha: true
 
-      - name: Setup miniconda for x86
-        if: inputs.build-environment == 'macos-12-py3-x86-64'
-        uses: conda-incubator/setup-miniconda@v2
+      - name: Setup miniconda (x86, py3.8)
+        if: ${{ runner.arch == 'X64' }}
+        uses: pytorch/test-infra/.github/actions/setup-miniconda@main
         with:
-          auto-update-conda: true
           python-version: 3.8
-          activate-environment: build
-          miniconda-version: 4.7.12
+          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
+          pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
 
-      - name: Setup miniconda for arm64
-        if: inputs.build-environment == 'macos-12-py3-arm64'
+      - name: Setup miniconda (arm64, py3.9)
+        if: ${{ runner.arch == 'ARM64' }}
+        uses: pytorch/test-infra/.github/actions/setup-miniconda@main
+        with:
+          python-version: 3.9
+          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
+          pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
+
+      - name: Start monitoring script
+        id: monitor-script
         run: |
-          # Conda is already installed and setup for bash here
-          # Cleanup lingering conda environment and create
-          # a new one for this run
-          conda env remove -n build
-          conda create -n build python=3.9.12
-          conda list
+          ${CONDA_RUN} python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
+          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
 
       - name: Install macOS homebrew dependencies
         run: |
           # Install dependencies
           brew install libomp
+          brew link --force libomp
 
       - name: Parse ref
         id: parse-ref
@@ -111,6 +128,9 @@ jobs:
 
       - name: Test
         id: test
+        env:
+          PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
+          PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
         run: |
           COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
 
@@ -127,18 +147,8 @@ jobs:
           export PR_BODY="${PR_BODY//[\'\"]}"
           arch
 
-          # This is a no-op for x86
-          conda activate build
-
-          python3 -mpip install dist/*.whl
-          .jenkins/pytorch/macos-test.sh
-
-      - name: Cleanup miniconda for arm64
-        if: inputs.build-environment == 'macos-12-py3-arm64'
-        run: |
-          # Cleanup conda env
-          conda deactivate
-          conda env remove -n build
+          ${CONDA_RUN} python3 -mpip install --no-index --no-deps $(echo dist/*.whl)
+          ${CONDA_RUN} .jenkins/pytorch/macos-test.sh
 
       - name: Get workflow job id
         id: get-job-id
@@ -149,6 +159,7 @@ jobs:
 
       - name: Stop monitoring script
         if: always() && ${{ steps.monitor-script.outputs.monitor-script-pid }}
+        continue-on-error: true
         env:
           MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
         run: |
@@ -182,6 +193,4 @@ jobs:
           GHA_WORKFLOW_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
         run: |
           set -x
-          python3 -m pip install -r requirements.txt
-          python3 -m pip install boto3==1.19.12
-          python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
+          ${CONDA_RUN} python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
diff --git a/.github/workflows/_rocm-test.yml b/.github/workflows/_rocm-test.yml
index b5550fdda7f0..be4a5c9dcc6c 100644
--- a/.github/workflows/_rocm-test.yml
+++ b/.github/workflows/_rocm-test.yml
@@ -39,12 +39,34 @@ env:
   GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
 
 jobs:
+  # This needs to be run right before the test starts so that it can gather the
+  # latest labels from the PR
+  filter:
+    runs-on: [self-hosted, linux.large]
+    outputs:
+      test-matrix: ${{ steps.filter.outputs.test-matrix }}
+      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          fetch-depth: 1
+          submodules: false
+
+      - name: Select all requested test configurations
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
   test:
-    # Don't run on forked repos.
-    if: github.repository_owner == 'pytorch'
+    needs: filter
+    # Don't run on forked repos or empty test matrix
+    if: github.repository_owner == 'pytorch' && needs.filter.outputs.is-test-matrix-empty == 'False'
     timeout-minutes: 300
     strategy:
-      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      matrix: ${{ fromJSON(needs.filter.outputs.test-matrix) }}
       fail-fast: false
     runs-on: ${{ matrix.runner }}
     steps:
@@ -58,7 +80,7 @@ jobs:
         uses: ./.github/actions/setup-rocm
 
       - name: Pull docker image
-        uses: ./.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ inputs.docker-image }}
 
@@ -69,7 +91,7 @@ jobs:
           python3 -m pip install psutil==5.9.1
           python3 -m pip install pynvml==11.4.1
           python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
-          echo "::set-output name=monitor-script-pid::${!}"
+          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
 
       - name: Download build artifacts
         uses: ./.github/actions/download-build-artifacts
@@ -96,6 +118,9 @@ jobs:
           SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
           DOCKER_IMAGE: ${{ inputs.docker-image }}
           XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
+          PYTORCH_JIT_ENABLE_NVFUSER: 1
+          PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
+          PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
         timeout-minutes: 270
         run: |
           set -x
@@ -145,6 +170,8 @@ jobs:
             -e MAX_JOBS="$(nproc --ignore=2)" \
             -e SCCACHE_BUCKET \
             -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
+            -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \
+            -e PYTORCH_TEST_RERUN_DISABLED_TESTS \
             --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
             --ulimit stack=10485760:83886080 \
             --security-opt seccomp=unconfined \
@@ -179,6 +206,7 @@ jobs:
       - name: Stop monitoring script
         if: always() && steps.monitor-script.outputs.monitor-script-pid
         shell: bash
+        continue-on-error: true
         env:
           MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
         run: |
diff --git a/.github/workflows/_run_android_tests.yml b/.github/workflows/_run_android_tests.yml
index 273ec2db81ae..ae992baab11a 100644
--- a/.github/workflows/_run_android_tests.yml
+++ b/.github/workflows/_run_android_tests.yml
@@ -21,16 +21,16 @@ jobs:
       - name: Install dependencies
         run: |
           conda install -y \
-            cffi \
-            cmake \
-            mkl \
-            mkl-include \
-            ninja \
-            numpy \
-            pyyaml \
-            requests \
-            setuptools \
-            typing_extensions
+            cffi=1.15.1 \
+            cmake=3.22.1 \
+            mkl=2022.1.0 \
+            mkl-include=2022.1.0 \
+            ninja=1.10.2 \
+            numpy=1.23.3 \
+            pyyaml=6.0 \
+            requests=2.28.1 \
+            setuptools=65.5.0 \
+            typing_extensions=4.3.0
 
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
diff --git a/.github/workflows/_update-commit-hash.yml b/.github/workflows/_update-commit-hash.yml
index 42e12d9dca9f..416e05c0cc53 100644
--- a/.github/workflows/_update-commit-hash.yml
+++ b/.github/workflows/_update-commit-hash.yml
@@ -27,7 +27,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Checkout repo
-        uses: actions/checkout@v2
+        uses: actions/checkout@v3
         with:
           fetch-depth: 1
           submodules: false
diff --git a/.github/workflows/_win-build.yml b/.github/workflows/_win-build.yml
index fb2195fafce6..8baaca498d17 100644
--- a/.github/workflows/_win-build.yml
+++ b/.github/workflows/_win-build.yml
@@ -23,6 +23,18 @@ on:
         description: |
           If this is set, our linter will use this to make sure that every other
           job with the same `sync-tag` is identical.
+      test-matrix:
+        required: false
+        type: string
+        description: |
+          An option JSON description of what test configs to run later on. This
+          is moved here from the Linux test workflow so that we can apply filter
+          logic using test-config labels earlier and skip unnecessary builds
+
+    outputs:
+      test-matrix:
+        value: ${{ inputs.test-matrix }}
+        description: An optional JSON description of what test configs to run later on.
 
 env:
   GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
@@ -34,6 +46,20 @@ jobs:
     runs-on: [self-hosted, windows.4xlarge]
     timeout-minutes: 240
     steps:
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+          instructions: |
+            To forward remote desktop on your local machine ssh as follows:
+              ssh -L 3389:localhost:3389 %%username%%@%%hostname%%
+            And then change password using `passwd` command.
+
+            To start build locally, change working folder to \actions-runner\_work\pytorch\pytorch,
+            Activate miniconda and Visual Studio environment, but running:
+              call C:\Jenkins\Miniconda3\Scripts\activate.bat C:\Jenkins\Miniconda3
+              call "C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" x64
+
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -45,11 +71,6 @@ jobs:
         with:
           cuda-version: ${{ inputs.cuda-version }}
 
-      - name: Setup SSH (Click me for login details)
-        uses: ./.github/actions/setup-ssh
-        with:
-          github-secret: ${{ secrets.GITHUB_TOKEN }}
-
       - name: Parse ref
         id: parse-ref
         run: .github/scripts/parse_ref.py
@@ -61,7 +82,17 @@ jobs:
         with:
           github-token: ${{ secrets.GITHUB_TOKEN }}
 
+      # Apply the filter logic to the build step too if the test-config label is already there
+      - name: Select all requested test configurations (if the test matrix is available)
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
       - name: Build
+        if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
+        id: build
         shell: bash
         env:
           PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
@@ -89,6 +120,7 @@ jobs:
 
       # Upload to github so that people can click and download artifacts
       - name: Upload artifacts to s3
+        if: steps.build.outcome != 'skipped'
         uses: seemethere/upload-artifact-s3@v5
         with:
           retention-days: 14
@@ -97,6 +129,7 @@ jobs:
           path: C:\${{ github.run_id }}\build-results
 
       - name: Upload sccache stats
+        if: steps.build.outcome != 'skipped'
         uses: seemethere/upload-artifact-s3@v5
         with:
           s3-prefix: |
diff --git a/.github/workflows/_win-test.yml b/.github/workflows/_win-test.yml
index 560c0fe84e1d..0cabb8ec469a 100644
--- a/.github/workflows/_win-test.yml
+++ b/.github/workflows/_win-test.yml
@@ -27,15 +27,42 @@ env:
   GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
 
 jobs:
+  # This needs to be run right before the test starts so that it can gather the
+  # latest labels from the PR
+  filter:
+    runs-on: [self-hosted, linux.large]
+    outputs:
+      test-matrix: ${{ steps.filter.outputs.test-matrix }}
+      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          fetch-depth: 1
+          submodules: false
+
+      - name: Select all requested test configurations
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
   test:
-    # Don't run on forked repos.
-    if: github.repository_owner == 'pytorch'
+    needs: filter
+    # Don't run on forked repos or empty test matrix
+    if: github.repository_owner == 'pytorch' && needs.filter.outputs.is-test-matrix-empty == 'False'
     strategy:
-      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      matrix: ${{ fromJSON(needs.filter.outputs.test-matrix) }}
       fail-fast: false
     runs-on: ${{ matrix.runner }}
     timeout-minutes: 300
     steps:
+      - name: Enable git symlinks on Windows
+        shell: bash
+        run: |
+          git config --global core.symlinks true
+
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -48,7 +75,7 @@ jobs:
           cuda-version: ${{ inputs.cuda-version }}
 
       - name: Setup SSH (Click me for login details)
-        uses: ./.github/actions/setup-ssh
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
         with:
           github-secret: ${{ secrets.GITHUB_TOKEN }}
 
@@ -59,7 +86,7 @@ jobs:
           python3 -m pip install psutil==5.9.1
           python3 -m pip install pynvml==11.4.1
           python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
-          echo "::set-output name=monitor-script-pid::${!}"
+          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
 
       - name: Download PyTorch Build Artifacts
         uses: seemethere/download-artifact-s3@v4
@@ -97,6 +124,8 @@ jobs:
           TEST_CONFIG: ${{ matrix.config }}
           PR_BODY: ${{ github.event.pull_request.body }}
           TORCH_CUDA_ARCH_LIST: "7.0"
+          PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
+          PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
         run: |
           COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
 
@@ -124,6 +153,7 @@ jobs:
       - name: Stop monitoring script
         if: always() && steps.monitor-script.outputs.monitor-script-pid
         shell: bash
+        continue-on-error: true
         env:
           MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
         run: |
diff --git a/.github/workflows/auto_request_review.yml b/.github/workflows/auto_request_review.yml
new file mode 100644
index 000000000000..7c98c2990fba
--- /dev/null
+++ b/.github/workflows/auto_request_review.yml
@@ -0,0 +1,22 @@
+name: Auto Request Review
+
+on:
+  pull_request:
+    types: [opened, ready_for_review, reopened]
+
+jobs:
+  auto-request-review:
+    # Don't run on forked repos
+    if: ${{ !github.event.pull_request.head.repo.fork }}
+    name: Auto Request Review
+    runs-on: ubuntu-latest
+    steps:
+      - name: Request review based on files changes and/or groups the author belongs to
+        # v0.7.0
+        uses: necojackarc/auto-request-review@e08cdffa277d50854744de3f76230260e61c67f4
+        with:
+          token: ${{ secrets.GITHUB_TOKEN }}
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
diff --git a/.github/workflows/build-triton-wheel.yml b/.github/workflows/build-triton-wheel.yml
new file mode 100644
index 000000000000..fac2a1340b42
--- /dev/null
+++ b/.github/workflows/build-triton-wheel.yml
@@ -0,0 +1,149 @@
+name: Build Triton wheels
+
+on:
+  push:
+    branches:
+      - main
+      - master
+    paths:
+      - .github/workflows/build-triton-wheel.yml
+      - .github/scripts/build_triton_wheel.py
+      - .github/ci_commit_pins/triton.txt
+  pull_request:
+    paths:
+      - .github/workflows/build-triton-wheel.yml
+      - .github/scripts/build_triton_wheel.py
+      - .github/ci_commit_pins/triton.txt
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  build-wheel:
+    runs-on: [self-hosted, linux.2xlarge]
+    strategy:
+      fail-fast: false
+      matrix:
+        py_vers: [ "3.7", "3.8", "3.9", "3.10", "3.11" ]
+    timeout-minutes: 40
+    env:
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      PY_VERS: ${{ matrix.py_vers }}
+    steps:
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          submodules: false
+
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+
+      - name: Pull Docker image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
+        with:
+          docker-image: ${{ env.DOCKER_IMAGE }}
+
+      - name: Build Triton wheel
+        run: |
+          set -x
+          mkdir -p "${RUNNER_TEMP}/artifacts/"
+          container_name=$(docker run \
+            --tty \
+            --detach \
+            -v "${GITHUB_WORKSPACE}:/pytorch" \
+            -v "${RUNNER_TEMP}/artifacts:/artifacts" \
+            -w /artifacts/ \
+            "${DOCKER_IMAGE}"      \
+          )
+
+          # Determine python executable for given version
+          case $PY_VERS in
+          3.7)
+            PYTHON_EXECUTABLE=/opt/python/cp37-cp37m/bin/python
+            ;;
+          3.8)
+            PYTHON_EXECUTABLE=/opt/python/cp38-cp38/bin/python
+            ;;
+          3.9)
+            PYTHON_EXECUTABLE=/opt/python/cp39-cp39/bin/python
+            ;;
+          3.10)
+            PYTHON_EXECUTABLE=/opt/python/cp310-cp310/bin/python
+            ;;
+          3.11)
+            PYTHON_EXECUTABLE=/opt/python/cp311-cp311/bin/python
+            ;;
+          *)
+            echo "Unsupported python version ${PY_VERS}"
+            exit 1
+            ;;
+          esac
+
+          docker exec -t "${container_name}" yum install -y llvm11 llvm11-devel llvm11-static llvm11-libs zlib-devel
+          docker exec -t "${container_name}" "${PYTHON_EXECUTABLE}" /pytorch/.github/scripts/build_triton_wheel.py
+          docker exec -t "${container_name}" chown -R 1000.1000 /artifacts
+
+      - uses: actions/upload-artifact@v3
+        with:
+          name: "pytorch-triton-${{ matrix.py_vers }}"
+          if-no-files-found: error
+          path:
+            ${{ runner.temp }}/artifacts/*
+
+      - name: Teardown Linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
+        if: always()
+  upload-wheel:
+    runs-on: linux.20_04.4x
+    needs: build-wheel
+    container:
+      image: continuumio/miniconda3:4.12.0
+    env:
+      GITHUB_TOKEN: ${{ secrets.github-token }}
+    steps:
+      - name: Download Build Artifacts (3.7)
+        uses: actions/download-artifact@v3
+        with:
+          name: "pytorch-triton-3.7"
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Download Build Artifacts (3.8)
+        uses: actions/download-artifact@v3
+        with:
+          name: "pytorch-triton-3.8"
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Download Build Artifacts (3.9)
+        uses: actions/download-artifact@v3
+        with:
+          name: "pytorch-triton-3.9"
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Download Build Artifacts (3.10)
+        uses: actions/download-artifact@v3
+        with:
+          name: "pytorch-triton-3.10"
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Download Build Artifacts (3.11)
+        uses: actions/download-artifact@v3
+        with:
+          name: "pytorch-triton-3.11"
+          path: "${{ runner.temp }}/artifacts/"
+      - name: Upload binaries
+        if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/master' || github.event.ref == 'refs/heads/main') }}
+        env:
+          PKG_DIR: "${{ runner.temp }}/artifacts"
+          # When running these on pull_request events these should be blank
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_S3_UPDATE_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_S3_UPDATE_SECRET_ACCESS_KEY }}
+          UPLOAD_BUCKET: "s3://pytorch"
+        run: |
+            set -ex
+            pip install -q awscli
+            s3_dir="${UPLOAD_BUCKET}/whl/nightly/"
+            for pkg in "${PKG_DIR}/"*.whl; do
+              aws s3 cp --no-progress --acl public-read "${pkg}" "${s3_dir}"
+             done
diff --git a/.github/workflows/check-labels.yml b/.github/workflows/check-labels.yml
new file mode 100644
index 000000000000..5fa5fed16daf
--- /dev/null
+++ b/.github/workflows/check-labels.yml
@@ -0,0 +1,44 @@
+name: Check Labels
+
+on:
+  pull_request:
+    types: [opened, synchronize, reopened, labeled, unlabeled]
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  check-labels:
+    name: Check labels
+    runs-on: linux.20_04.4x
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          submodules: false
+          fetch-depth: 1
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.8'
+          architecture: x64
+          check-latest: false
+          cache: pip
+          cache-dependency-path: |
+            **/.github/requirements-gha-cache.txt
+
+      - name: Install requirements
+        id: requirements
+        run: |
+          pip install -r .github/requirements-gha-cache.txt --user
+
+      - name: Check labels
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          PR_NUM: ${{ github.event.number }}
+        run: |
+          set -ex
+          python3 .github/scripts/check_labels.py "${PR_NUM}"
diff --git a/.github/workflows/docker-builds.yml b/.github/workflows/docker-builds.yml
index 974ac458d4ca..3108f4b926a8 100644
--- a/.github/workflows/docker-builds.yml
+++ b/.github/workflows/docker-builds.yml
@@ -33,20 +33,15 @@ jobs:
     strategy:
       matrix:
         include:
-          - docker-image-name: pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7
           - docker-image-name: pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9
           - docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
           - docker-image-name: pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7
           - docker-image-name: pytorch-linux-bionic-py3.7-clang9
-          - docker-image-name: pytorch-linux-focal-rocm5.1-py3.7
-          - docker-image-name: pytorch-linux-focal-rocm5.2-py3.7
+          - docker-image-name: pytorch-linux-focal-rocm5.1-py3.8
+          - docker-image-name: pytorch-linux-focal-rocm5.2-py3.8
           - docker-image-name: pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12
           - docker-image-name: pytorch-linux-jammy-cuda11.7-cudnn8-py3.8-clang12
-          - docker-image-name: pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7
-          - docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
-          - docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-          - docker-image-name: pytorch-linux-xenial-py3-clang5-asan
-          - docker-image-name: pytorch-linux-xenial-py3-clang7-onnx
+          - docker-image-name: pytorch-linux-focal-py3-clang7-android-ndk-r19c
           - docker-image-name: pytorch-linux-focal-py3.7-gcc7
           - docker-image-name: pytorch-linux-focal-py3-clang7-asan
           - docker-image-name: pytorch-linux-focal-py3-clang10-onnx
@@ -81,7 +76,7 @@ jobs:
           push-ghcr-image: ${{ github.event_name == 'push' }}
 
       - name: Pull docker image
-        uses: ./.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: ${{ steps.build-docker-image.outputs.docker-image }}
 
@@ -90,5 +85,6 @@ jobs:
         if: always()
 
       - name: Teardown Linux
-        uses: ./.github/actions/teardown-linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
+
         if: always()
diff --git a/.github/workflows/docker-release.yml b/.github/workflows/docker-release.yml
new file mode 100644
index 000000000000..0f9638e210ad
--- /dev/null
+++ b/.github/workflows/docker-release.yml
@@ -0,0 +1,110 @@
+name: Build Official Docker Images
+
+on:
+  workflow_dispatch:
+  pull_request:
+    paths:
+      - Dockerfile
+      - docker.Makefile
+      - .github/workflows/docker-release.yml
+  push:
+    branches:
+      - nightly
+    tags:
+      # Release candidate tags look like: v1.11.0-rc1
+      - v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
+      - ciflow/nightly/*
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+env:
+  BUILD_PROGRESS: plain
+  BUILD_TYPE: official
+  DOCKER_ORG: pytorch
+  DOCKER_REGISTRY: ghcr.io
+  NO_BUILD_SUFFIX: true
+  USE_BUILDX: 1
+  WITH_PUSH: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
+
+jobs:
+  build:
+    if: ${{ github.repository == 'pytorch/pytorch' }}
+    runs-on: [self-hosted, linux.2xlarge]
+    timeout-minutes: 240
+    strategy:
+      matrix:
+        include:
+          # nvidia specific images don't exist for arm64 so only build the runtime image
+          - image_type: runtime
+            platform: linux/arm64,linux/amd64
+          - image_type: devel
+            platform: linux/amd64
+    env:
+      BUILD_IMAGE_TYPE: ${{ matrix.image_type }}
+      BUILD_PLATFORMS: ${{ matrix.platform }}
+    steps:
+      - name: Setup SSH (Click me for login details)
+        uses: pytorch/test-infra/.github/actions/setup-ssh@main
+        with:
+          github-secret: ${{ secrets.GITHUB_TOKEN }}
+      # [see note: pytorch repo ref]
+      # deep clone (fetch-depth 0) required for git merge-base
+      - name: Checkout PyTorch
+        uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          submodules: 'recursive'
+      - name: Setup Linux
+        uses: ./.github/actions/setup-linux
+      - name: Login to GitHub Container Registry
+        if: ${{ env.WITH_PUSH == 'true' }}
+        uses: docker/login-action@v2
+        with:
+          registry: ghcr.io
+          username: pytorch
+          password: ${{ secrets.GHCR_PAT }}
+      # Setup multi-arch image builds
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@v2
+        env:
+          QEMU_BINARY_PATH: ${{ runner.temp }}/bin
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v2
+      - name: Setup job specific variables
+        run: |
+          set -eou pipefail
+          # To get QEMU binaries in our PATh
+          echo "${RUNNER_TEMP}/bin" >> "${GITHUB_PATH}"
+          # Generate PyTorch version to use
+          echo "PYTORCH_VERSION=$(python3 .github/scripts/generate_pytorch_version.py)" >> "${GITHUB_ENV}"
+      - name: Setup nightly specific variables
+        if: ${{ github.event.ref == 'refs/heads/nightly' || startsWith(github.event.ref, 'refs/tags/ciflow/nightly/') }}
+        run: |
+          {
+            echo "DOCKER_IMAGE=pytorch-nightly";
+            echo "INSTALL_CHANNEL=pytorch-nightly";
+            echo "TRITON_VERSION=2.0.0+$(cut -c -10 .github/ci_commit_pins/triton.txt)";
+          } >> "${GITHUB_ENV}"
+      - name: Run docker build / push
+        # WITH_PUSH is used here to determine whether or not to add the --push flag
+        run: |
+          make -f docker.Makefile "${BUILD_IMAGE_TYPE}-image"
+      - name: Push nightly tags
+        if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' }}
+        run: |
+          PYTORCH_DOCKER_TAG="${PYTORCH_VERSION}-runtime"
+          CUDA_VERSION=$(python3 -c "import re;print(re.search('CUDA_VERSION\s+=\s+([0-9\.]+)',open('docker.Makefile').read())[1],end='')")
+          PYTORCH_NIGHTLY_COMMIT=$(docker run ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_DOCKER_TAG}" \
+                                          python -c 'import torch; print(torch.version.git_version[:7],end="")')
+          docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_DOCKER_TAG}" \
+                 ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}"
+          docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}"
+
+          docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}" \
+                 ghcr.io/pytorch/pytorch-nightly:latest
+          docker push ghcr.io/pytorch/pytorch-nightly:latest
+      - name: Teardown Linux
+        uses: pytorch/test-infra/.github/actions/teardown-linux@main
+        if: always()
diff --git a/.github/workflows/generated-linux-binary-conda-nightly.yml b/.github/workflows/generated-linux-binary-conda-nightly.yml
index 81f779f2f014..f37b8de5144c 100644
--- a/.github/workflows/generated-linux-binary-conda-nightly.yml
+++ b/.github/workflows/generated-linux-binary-conda-nightly.yml
@@ -93,126 +93,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_7-cuda10_2-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda10_2
-      build_environment: linux-binary-conda
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  conda-py3_7-cuda10_2-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda10_2-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda10_2
-      build_environment: linux-binary-conda
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  conda-py3_7-cuda10_2-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_7-cuda11_3-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda11_3
-      build_environment: linux-binary-conda
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  conda-py3_7-cuda11_3-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda11_3-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda11_3
-      build_environment: linux-binary-conda
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  conda-py3_7-cuda11_3-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda11_3-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda11_3
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   conda-py3_7-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
@@ -390,126 +270,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_8-cuda10_2-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda10_2
-      build_environment: linux-binary-conda
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  conda-py3_8-cuda10_2-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda10_2-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda10_2
-      build_environment: linux-binary-conda
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  conda-py3_8-cuda10_2-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_8-cuda11_3-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda11_3
-      build_environment: linux-binary-conda
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  conda-py3_8-cuda11_3-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_3-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda11_3
-      build_environment: linux-binary-conda
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  conda-py3_8-cuda11_3-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_3-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda11_3
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   conda-py3_8-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
@@ -687,126 +447,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_9-cuda10_2-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda10_2
-      build_environment: linux-binary-conda
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  conda-py3_9-cuda10_2-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda10_2-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda10_2
-      build_environment: linux-binary-conda
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  conda-py3_9-cuda10_2-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_9-cuda11_3-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda11_3
-      build_environment: linux-binary-conda
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  conda-py3_9-cuda11_3-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_3-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda11_3
-      build_environment: linux-binary-conda
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  conda-py3_9-cuda11_3-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_3-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda11_3
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   conda-py3_9-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
@@ -984,126 +624,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_10-cuda10_2-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.10"
-      build_name: conda-py3_10-cuda10_2
-      build_environment: linux-binary-conda
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  conda-py3_10-cuda10_2-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda10_2-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.10"
-      build_name: conda-py3_10-cuda10_2
-      build_environment: linux-binary-conda
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  conda-py3_10-cuda10_2-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda10.2
-      DESIRED_PYTHON: "3.10"
-      build_name: conda-py3_10-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_10-cuda11_3-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.10"
-      build_name: conda-py3_10-cuda11_3
-      build_environment: linux-binary-conda
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  conda-py3_10-cuda11_3-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda11_3-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.10"
-      build_name: conda-py3_10-cuda11_3
-      build_environment: linux-binary-conda
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  conda-py3_10-cuda11_3-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda11_3-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/conda-builder:cuda11.3
-      DESIRED_PYTHON: "3.10"
-      build_name: conda-py3_10-cuda11_3
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   conda-py3_10-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
diff --git a/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml b/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
index a4cfb807988b..6b1765b9a405 100644
--- a/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
+++ b/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
@@ -276,510 +276,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda10_2-shared-with-deps-cxx11-abi-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-shared-with-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda10_2-shared-with-deps-cxx11-abi-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-shared-with-deps-cxx11-abi-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-shared-with-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda10_2-shared-with-deps-cxx11-abi-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-shared-with-deps-cxx11-abi-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-shared-with-deps-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda10_2-shared-without-deps-cxx11-abi-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-shared-without-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda10_2-shared-without-deps-cxx11-abi-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-shared-without-deps-cxx11-abi-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-shared-without-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda10_2-shared-without-deps-cxx11-abi-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-shared-without-deps-cxx11-abi-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-shared-without-deps-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda10_2-static-with-deps-cxx11-abi-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-static-with-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda10_2-static-with-deps-cxx11-abi-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-static-with-deps-cxx11-abi-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-static-with-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda10_2-static-with-deps-cxx11-abi-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-static-with-deps-cxx11-abi-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-static-with-deps-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda10_2-static-without-deps-cxx11-abi-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-static-without-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda10_2-static-without-deps-cxx11-abi-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-static-without-deps-cxx11-abi-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-static-without-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda10_2-static-without-deps-cxx11-abi-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-static-without-deps-cxx11-abi-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda10.2
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda10_2-static-without-deps-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-shared-with-deps-cxx11-abi-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-shared-with-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda11_3-shared-with-deps-cxx11-abi-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-with-deps-cxx11-abi-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-shared-with-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda11_3-shared-with-deps-cxx11-abi-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-with-deps-cxx11-abi-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-shared-with-deps-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-shared-without-deps-cxx11-abi-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-shared-without-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda11_3-shared-without-deps-cxx11-abi-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-without-deps-cxx11-abi-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-shared-without-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda11_3-shared-without-deps-cxx11-abi-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-without-deps-cxx11-abi-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-shared-without-deps-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-static-with-deps-cxx11-abi-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-static-with-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda11_3-static-with-deps-cxx11-abi-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-with-deps-cxx11-abi-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-static-with-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda11_3-static-with-deps-cxx11-abi-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-with-deps-cxx11-abi-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-static-with-deps-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-static-without-deps-cxx11-abi-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-static-without-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda11_3-static-without-deps-cxx11-abi-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-without-deps-cxx11-abi-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-static-without-deps-cxx11-abi
-      build_environment: linux-binary-libtorch-cxx11-abi
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda11_3-static-without-deps-cxx11-abi-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-without-deps-cxx11-abi-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.3
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cuda11_3-static-without-deps-cxx11-abi
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   libtorch-cuda11_6-shared-with-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
@@ -1284,7 +780,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-rocm5_1_1-shared-with-deps-cxx11-abi-build:
+  libtorch-rocm5_2-shared-with-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1293,20 +789,20 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-rocm5_1_1-shared-with-deps-cxx11-abi
+      build_name: libtorch-rocm5_2-shared-with-deps-cxx11-abi
       build_environment: linux-binary-libtorch-cxx11-abi
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-rocm5_1_1-shared-with-deps-cxx11-abi-test:  # Testing
+  libtorch-rocm5_2-shared-with-deps-cxx11-abi-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_1_1-shared-with-deps-cxx11-abi-build
+    needs: libtorch-rocm5_2-shared-with-deps-cxx11-abi-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1315,11 +811,11 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
@@ -1349,7 +845,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1360,10 +861,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: libtorch-rocm5_1_1-shared-with-deps-cxx11-abi
+          name: libtorch-rocm5_2-shared-with-deps-cxx11-abi
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1392,9 +893,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/libtorch-cxx11-builder:rocm5.1.1
+          docker-image: pytorch/libtorch-cxx11-builder:rocm5.2
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1405,29 +906,29 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  libtorch-rocm5_1_1-shared-with-deps-cxx11-abi-upload:  # Uploading
+  libtorch-rocm5_2-shared-with-deps-cxx11-abi-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_1_1-shared-with-deps-cxx11-abi-test
+    needs: libtorch-rocm5_2-shared-with-deps-cxx11-abi-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-rocm5_1_1-shared-with-deps-cxx11-abi
+      build_name: libtorch-rocm5_2-shared-with-deps-cxx11-abi
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-rocm5_1_1-static-with-deps-cxx11-abi-build:
+  libtorch-rocm5_2-static-with-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1436,20 +937,20 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-rocm5_1_1-static-with-deps-cxx11-abi
+      build_name: libtorch-rocm5_2-static-with-deps-cxx11-abi
       build_environment: linux-binary-libtorch-cxx11-abi
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-rocm5_1_1-static-with-deps-cxx11-abi-test:  # Testing
+  libtorch-rocm5_2-static-with-deps-cxx11-abi-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_1_1-static-with-deps-cxx11-abi-build
+    needs: libtorch-rocm5_2-static-with-deps-cxx11-abi-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1458,11 +959,11 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
@@ -1492,7 +993,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1503,10 +1009,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: libtorch-rocm5_1_1-static-with-deps-cxx11-abi
+          name: libtorch-rocm5_2-static-with-deps-cxx11-abi
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1535,9 +1041,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/libtorch-cxx11-builder:rocm5.1.1
+          docker-image: pytorch/libtorch-cxx11-builder:rocm5.2
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1548,29 +1054,29 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  libtorch-rocm5_1_1-static-with-deps-cxx11-abi-upload:  # Uploading
+  libtorch-rocm5_2-static-with-deps-cxx11-abi-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_1_1-static-with-deps-cxx11-abi-test
+    needs: libtorch-rocm5_2-static-with-deps-cxx11-abi-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-rocm5_1_1-static-with-deps-cxx11-abi
+      build_name: libtorch-rocm5_2-static-with-deps-cxx11-abi
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-rocm5_2-shared-with-deps-cxx11-abi-build:
+  libtorch-rocm5_3-shared-with-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1579,20 +1085,20 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.3
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-rocm5_2-shared-with-deps-cxx11-abi
+      build_name: libtorch-rocm5_3-shared-with-deps-cxx11-abi
       build_environment: linux-binary-libtorch-cxx11-abi
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-rocm5_2-shared-with-deps-cxx11-abi-test:  # Testing
+  libtorch-rocm5_3-shared-with-deps-cxx11-abi-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_2-shared-with-deps-cxx11-abi-build
+    needs: libtorch-rocm5_3-shared-with-deps-cxx11-abi-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1601,11 +1107,11 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.3
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
@@ -1635,7 +1141,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1646,10 +1157,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: libtorch-rocm5_2-shared-with-deps-cxx11-abi
+          name: libtorch-rocm5_3-shared-with-deps-cxx11-abi
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1678,9 +1189,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/libtorch-cxx11-builder:rocm5.2
+          docker-image: pytorch/libtorch-cxx11-builder:rocm5.3
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1691,29 +1202,29 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  libtorch-rocm5_2-shared-with-deps-cxx11-abi-upload:  # Uploading
+  libtorch-rocm5_3-shared-with-deps-cxx11-abi-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_2-shared-with-deps-cxx11-abi-test
+    needs: libtorch-rocm5_3-shared-with-deps-cxx11-abi-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.3
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-rocm5_2-shared-with-deps-cxx11-abi
+      build_name: libtorch-rocm5_3-shared-with-deps-cxx11-abi
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-rocm5_2-static-with-deps-cxx11-abi-build:
+  libtorch-rocm5_3-static-with-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1722,20 +1233,20 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.3
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-rocm5_2-static-with-deps-cxx11-abi
+      build_name: libtorch-rocm5_3-static-with-deps-cxx11-abi
       build_environment: linux-binary-libtorch-cxx11-abi
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-rocm5_2-static-with-deps-cxx11-abi-test:  # Testing
+  libtorch-rocm5_3-static-with-deps-cxx11-abi-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_2-static-with-deps-cxx11-abi-build
+    needs: libtorch-rocm5_3-static-with-deps-cxx11-abi-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1744,11 +1255,11 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.3
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
     steps:
@@ -1778,7 +1289,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1789,10 +1305,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: libtorch-rocm5_2-static-with-deps-cxx11-abi
+          name: libtorch-rocm5_3-static-with-deps-cxx11-abi
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1821,9 +1337,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/libtorch-cxx11-builder:rocm5.2
+          docker-image: pytorch/libtorch-cxx11-builder:rocm5.3
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1834,22 +1350,22 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  libtorch-rocm5_2-static-with-deps-cxx11-abi-upload:  # Uploading
+  libtorch-rocm5_3-static-with-deps-cxx11-abi-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_2-static-with-deps-cxx11-abi-test
+    needs: libtorch-rocm5_3-static-with-deps-cxx11-abi-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.3
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-rocm5_2-static-with-deps-cxx11-abi
+      build_name: libtorch-rocm5_3-static-with-deps-cxx11-abi
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
diff --git a/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-master.yml b/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-master.yml
index edacb2e949b0..39e41e67853a 100644
--- a/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-master.yml
+++ b/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-master.yml
@@ -31,7 +31,7 @@ concurrency:
   cancel-in-progress: true
 
 jobs:
-  libtorch-cpu-shared-with-deps-cxx11-abi-build:
+  libtorch-cpu-shared-with-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -42,17 +42,17 @@ jobs:
       #       favor of GPU_ARCH_VERSION
       DESIRED_CUDA: cpu
       GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
       LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cpu-shared-with-deps-cxx11-abi
+      DESIRED_DEVTOOLSET: pre-cxx11
+      build_name: libtorch-cpu-shared-with-deps-pre-cxx11
       build_environment: linux-binary-libtorch-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-cpu-shared-with-deps-cxx11-abi-test:  # Testing
+  libtorch-cpu-shared-with-deps-pre-cxx11-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cpu-shared-with-deps-cxx11-abi-build
+    needs: libtorch-cpu-shared-with-deps-pre-cxx11-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -62,10 +62,10 @@ jobs:
       #       favor of GPU_ARCH_VERSION
       DESIRED_CUDA: cpu
       GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu
+      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
       LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: cxx11-abi
-      build_name: libtorch-cpu-shared-with-deps-cxx11-abi
+      DESIRED_DEVTOOLSET: pre-cxx11
+      build_name: libtorch-cpu-shared-with-deps-pre-cxx11
       build_environment: linux-binary-libtorch-pre-cxx11
       runs_on: linux.4xlarge
     secrets:
diff --git a/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml b/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
index a09ce3c930a3..eaa928f3e09a 100644
--- a/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
+++ b/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
@@ -276,510 +276,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda10_2-shared-with-deps-pre-cxx11-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-shared-with-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda10_2-shared-with-deps-pre-cxx11-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-shared-with-deps-pre-cxx11-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-shared-with-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda10_2-shared-with-deps-pre-cxx11-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-shared-with-deps-pre-cxx11-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-shared-with-deps-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda10_2-shared-without-deps-pre-cxx11-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-shared-without-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda10_2-shared-without-deps-pre-cxx11-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-shared-without-deps-pre-cxx11-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-shared-without-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda10_2-shared-without-deps-pre-cxx11-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-shared-without-deps-pre-cxx11-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-shared-without-deps-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda10_2-static-with-deps-pre-cxx11-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-static-with-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda10_2-static-with-deps-pre-cxx11-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-static-with-deps-pre-cxx11-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-static-with-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda10_2-static-with-deps-pre-cxx11-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-static-with-deps-pre-cxx11-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-static-with-deps-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda10_2-static-without-deps-pre-cxx11-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-static-without-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda10_2-static-without-deps-pre-cxx11-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-static-without-deps-pre-cxx11-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-static-without-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda10_2-static-without-deps-pre-cxx11-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda10_2-static-without-deps-pre-cxx11-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda10_2-static-without-deps-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-shared-with-deps-pre-cxx11-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-shared-with-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda11_3-shared-with-deps-pre-cxx11-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-with-deps-pre-cxx11-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-shared-with-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda11_3-shared-with-deps-pre-cxx11-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-with-deps-pre-cxx11-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-shared-with-deps-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-shared-without-deps-pre-cxx11-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-shared-without-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda11_3-shared-without-deps-pre-cxx11-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-without-deps-pre-cxx11-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-shared-without-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda11_3-shared-without-deps-pre-cxx11-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-without-deps-pre-cxx11-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: shared-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-shared-without-deps-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-static-with-deps-pre-cxx11-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-static-with-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda11_3-static-with-deps-pre-cxx11-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-with-deps-pre-cxx11-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-static-with-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda11_3-static-with-deps-pre-cxx11-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-with-deps-pre-cxx11-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: static-with-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-static-with-deps-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-static-without-deps-pre-cxx11-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-static-without-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  libtorch-cuda11_3-static-without-deps-pre-cxx11-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-without-deps-pre-cxx11-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-static-without-deps-pre-cxx11
-      build_environment: linux-binary-libtorch-pre-cxx11
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  libtorch-cuda11_3-static-without-deps-pre-cxx11-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-without-deps-pre-cxx11-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      LIBTORCH_VARIANT: static-without-deps
-      DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-cuda11_3-static-without-deps-pre-cxx11
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   libtorch-cuda11_6-shared-with-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
@@ -1284,7 +780,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-rocm5_1_1-shared-with-deps-pre-cxx11-build:
+  libtorch-rocm5_2-shared-with-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1293,20 +789,20 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-rocm5_1_1-shared-with-deps-pre-cxx11
+      build_name: libtorch-rocm5_2-shared-with-deps-pre-cxx11
       build_environment: linux-binary-libtorch-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-rocm5_1_1-shared-with-deps-pre-cxx11-test:  # Testing
+  libtorch-rocm5_2-shared-with-deps-pre-cxx11-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_1_1-shared-with-deps-pre-cxx11-build
+    needs: libtorch-rocm5_2-shared-with-deps-pre-cxx11-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1315,11 +811,11 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
@@ -1349,7 +845,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1360,10 +861,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: libtorch-rocm5_1_1-shared-with-deps-pre-cxx11
+          name: libtorch-rocm5_2-shared-with-deps-pre-cxx11
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1392,9 +893,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.1.1
+          docker-image: pytorch/manylinux-builder:rocm5.2
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1405,29 +906,29 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  libtorch-rocm5_1_1-shared-with-deps-pre-cxx11-upload:  # Uploading
+  libtorch-rocm5_2-shared-with-deps-pre-cxx11-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_1_1-shared-with-deps-pre-cxx11-test
+    needs: libtorch-rocm5_2-shared-with-deps-pre-cxx11-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-rocm5_1_1-shared-with-deps-pre-cxx11
+      build_name: libtorch-rocm5_2-shared-with-deps-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-rocm5_1_1-static-with-deps-pre-cxx11-build:
+  libtorch-rocm5_2-static-with-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1436,20 +937,20 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-rocm5_1_1-static-with-deps-pre-cxx11
+      build_name: libtorch-rocm5_2-static-with-deps-pre-cxx11
       build_environment: linux-binary-libtorch-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-rocm5_1_1-static-with-deps-pre-cxx11-test:  # Testing
+  libtorch-rocm5_2-static-with-deps-pre-cxx11-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_1_1-static-with-deps-pre-cxx11-build
+    needs: libtorch-rocm5_2-static-with-deps-pre-cxx11-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1458,11 +959,11 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
@@ -1492,7 +993,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1503,10 +1009,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: libtorch-rocm5_1_1-static-with-deps-pre-cxx11
+          name: libtorch-rocm5_2-static-with-deps-pre-cxx11
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1535,9 +1041,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.1.1
+          docker-image: pytorch/manylinux-builder:rocm5.2
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1548,29 +1054,29 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  libtorch-rocm5_1_1-static-with-deps-pre-cxx11-upload:  # Uploading
+  libtorch-rocm5_2-static-with-deps-pre-cxx11-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_1_1-static-with-deps-pre-cxx11-test
+    needs: libtorch-rocm5_2-static-with-deps-pre-cxx11-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-rocm5_1_1-static-with-deps-pre-cxx11
+      build_name: libtorch-rocm5_2-static-with-deps-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-rocm5_2-shared-with-deps-pre-cxx11-build:
+  libtorch-rocm5_3-shared-with-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1579,20 +1085,20 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-rocm5_2-shared-with-deps-pre-cxx11
+      build_name: libtorch-rocm5_3-shared-with-deps-pre-cxx11
       build_environment: linux-binary-libtorch-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-rocm5_2-shared-with-deps-pre-cxx11-test:  # Testing
+  libtorch-rocm5_3-shared-with-deps-pre-cxx11-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_2-shared-with-deps-pre-cxx11-build
+    needs: libtorch-rocm5_3-shared-with-deps-pre-cxx11-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1601,11 +1107,11 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
@@ -1635,7 +1141,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1646,10 +1157,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: libtorch-rocm5_2-shared-with-deps-pre-cxx11
+          name: libtorch-rocm5_3-shared-with-deps-pre-cxx11
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1678,9 +1189,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.2
+          docker-image: pytorch/manylinux-builder:rocm5.3
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1691,29 +1202,29 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  libtorch-rocm5_2-shared-with-deps-pre-cxx11-upload:  # Uploading
+  libtorch-rocm5_3-shared-with-deps-pre-cxx11-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_2-shared-with-deps-pre-cxx11-test
+    needs: libtorch-rocm5_3-shared-with-deps-pre-cxx11-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       LIBTORCH_VARIANT: shared-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-rocm5_2-shared-with-deps-pre-cxx11
+      build_name: libtorch-rocm5_3-shared-with-deps-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-rocm5_2-static-with-deps-pre-cxx11-build:
+  libtorch-rocm5_3-static-with-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1722,20 +1233,20 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-rocm5_2-static-with-deps-pre-cxx11
+      build_name: libtorch-rocm5_3-static-with-deps-pre-cxx11
       build_environment: linux-binary-libtorch-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  libtorch-rocm5_2-static-with-deps-pre-cxx11-test:  # Testing
+  libtorch-rocm5_3-static-with-deps-pre-cxx11-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_2-static-with-deps-pre-cxx11-build
+    needs: libtorch-rocm5_3-static-with-deps-pre-cxx11-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1744,11 +1255,11 @@ jobs:
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
     steps:
@@ -1778,7 +1289,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1789,10 +1305,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: libtorch-rocm5_2-static-with-deps-pre-cxx11
+          name: libtorch-rocm5_3-static-with-deps-pre-cxx11
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1821,9 +1337,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.2
+          docker-image: pytorch/manylinux-builder:rocm5.3
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1834,22 +1350,22 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  libtorch-rocm5_2-static-with-deps-pre-cxx11-upload:  # Uploading
+  libtorch-rocm5_3-static-with-deps-pre-cxx11-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-rocm5_2-static-with-deps-pre-cxx11-test
+    needs: libtorch-rocm5_3-static-with-deps-pre-cxx11-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: libtorch
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       LIBTORCH_VARIANT: static-with-deps
       DESIRED_DEVTOOLSET: pre-cxx11
-      build_name: libtorch-rocm5_2-static-with-deps-pre-cxx11
+      build_name: libtorch-rocm5_3-static-with-deps-pre-cxx11
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
diff --git a/.github/workflows/generated-linux-binary-manywheel-master.yml b/.github/workflows/generated-linux-binary-manywheel-master.yml
index 6412c82b0c46..e085fb5eb5fb 100644
--- a/.github/workflows/generated-linux-binary-manywheel-master.yml
+++ b/.github/workflows/generated-linux-binary-manywheel-master.yml
@@ -31,7 +31,7 @@ concurrency:
   cancel-in-progress: true
 
 jobs:
-  manywheel-py3_7-cuda10_2-build:
+  manywheel-py3_7-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -40,19 +40,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda10_2
+      build_name: manywheel-py3_7-cuda11_6
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_7-cuda10_2-test:  # Testing
+  manywheel-py3_7-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-cuda10_2-build
+    needs: manywheel-py3_7-cuda11_6-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -60,12 +60,12 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda10_2
+      build_name: manywheel-py3_7-cuda11_6
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
diff --git a/.github/workflows/generated-linux-binary-manywheel-nightly.yml b/.github/workflows/generated-linux-binary-manywheel-nightly.yml
index 1867ce103a14..b93f797d7e01 100644
--- a/.github/workflows/generated-linux-binary-manywheel-nightly.yml
+++ b/.github/workflows/generated-linux-binary-manywheel-nightly.yml
@@ -93,67 +93,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_7-cuda10_2-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda10_2
-      build_environment: linux-binary-manywheel
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  manywheel-py3_7-cuda10_2-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-cuda10_2-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda10_2
-      build_environment: linux-binary-manywheel
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_7-cuda10_2-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_7-cuda11_3-build:
+  manywheel-py3_7-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -162,19 +102,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda11_3
+      build_name: manywheel-py3_7-cuda11_6
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_7-cuda11_3-test:  # Testing
+  manywheel-py3_7-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-cuda11_3-build
+    needs: manywheel-py3_7-cuda11_6-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -182,38 +122,38 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda11_3
+      build_name: manywheel-py3_7-cuda11_6
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_7-cuda11_3-upload:  # Uploading
+  manywheel-py3_7-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-cuda11_3-test
+    needs: manywheel-py3_7-cuda11_6-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda11_3
+      build_name: manywheel-py3_7-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_7-cuda11_6-build:
+  manywheel-py3_7-cuda11_7-with-pypi-cudnn-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -222,19 +162,20 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda11_6
+      build_name: manywheel-py3_7-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
+      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-runtime-cu11; platform_system == 'Linux' | nvidia-cudnn-cu11==8.5.0.96; platform_system == 'Linux' | nvidia-cublas-cu11==11.10.3.66; platform_system == 'Linux'
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_7-cuda11_6-test:  # Testing
+  manywheel-py3_7-cuda11_7-with-pypi-cudnn-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-cuda11_6-build
+    needs: manywheel-py3_7-cuda11_7-with-pypi-cudnn-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -242,31 +183,31 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda11_6
+      build_name: manywheel-py3_7-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_7-cuda11_6-upload:  # Uploading
+  manywheel-py3_7-cuda11_7-with-pypi-cudnn-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-cuda11_6-test
+    needs: manywheel-py3_7-cuda11_7-with-pypi-cudnn-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-cuda11_6
+      build_name: manywheel-py3_7-cuda11_7-with-pypi-cudnn
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
@@ -333,7 +274,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_7-rocm5_1_1-build:
+  manywheel-py3_7-rocm5_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -342,19 +283,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-rocm5_1_1
+      build_name: manywheel-py3_7-rocm5_2
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_7-rocm5_1_1-test:  # Testing
+  manywheel-py3_7-rocm5_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-rocm5_1_1-build
+    needs: manywheel-py3_7-rocm5_2-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -363,11 +304,11 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.7"
     steps:
       - name: Clean workspace
@@ -396,7 +337,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -407,10 +353,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_7-rocm5_1_1
+          name: manywheel-py3_7-rocm5_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -439,9 +385,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.1.1
+          docker-image: pytorch/manylinux-builder:rocm5.2
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -452,28 +398,28 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_7-rocm5_1_1-upload:  # Uploading
+  manywheel-py3_7-rocm5_2-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-rocm5_1_1-test
+    needs: manywheel-py3_7-rocm5_2-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-rocm5_1_1
+      build_name: manywheel-py3_7-rocm5_2
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_7-rocm5_2-build:
+  manywheel-py3_7-rocm5_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -482,19 +428,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-rocm5_2
+      build_name: manywheel-py3_7-rocm5_3
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_7-rocm5_2-test:  # Testing
+  manywheel-py3_7-rocm5_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-rocm5_2-build
+    needs: manywheel-py3_7-rocm5_3-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -503,11 +449,11 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.7"
     steps:
       - name: Clean workspace
@@ -536,7 +482,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -547,10 +498,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_7-rocm5_2
+          name: manywheel-py3_7-rocm5_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -579,9 +530,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.2
+          docker-image: pytorch/manylinux-builder:rocm5.3
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -592,21 +543,21 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_7-rocm5_2-upload:  # Uploading
+  manywheel-py3_7-rocm5_3-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_7-rocm5_2-test
+    needs: manywheel-py3_7-rocm5_3-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.7"
-      build_name: manywheel-py3_7-rocm5_2
+      build_name: manywheel-py3_7-rocm5_3
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
@@ -670,67 +621,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_8-cuda10_2-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda10_2
-      build_environment: linux-binary-manywheel
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  manywheel-py3_8-cuda10_2-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda10_2-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda10_2
-      build_environment: linux-binary-manywheel
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_8-cuda10_2-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_8-cuda11_3-build:
+  manywheel-py3_8-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -739,19 +630,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda11_3
+      build_name: manywheel-py3_8-cuda11_6
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_8-cuda11_3-test:  # Testing
+  manywheel-py3_8-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda11_3-build
+    needs: manywheel-py3_8-cuda11_6-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -759,38 +650,38 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda11_3
+      build_name: manywheel-py3_8-cuda11_6
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_8-cuda11_3-upload:  # Uploading
+  manywheel-py3_8-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda11_3-test
+    needs: manywheel-py3_8-cuda11_6-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda11_3
+      build_name: manywheel-py3_8-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_8-cuda11_6-build:
+  manywheel-py3_8-cuda11_7-with-pypi-cudnn-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -799,19 +690,20 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda11_6
+      build_name: manywheel-py3_8-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
+      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-runtime-cu11; platform_system == 'Linux' | nvidia-cudnn-cu11==8.5.0.96; platform_system == 'Linux' | nvidia-cublas-cu11==11.10.3.66; platform_system == 'Linux'
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_8-cuda11_6-test:  # Testing
+  manywheel-py3_8-cuda11_7-with-pypi-cudnn-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda11_6-build
+    needs: manywheel-py3_8-cuda11_7-with-pypi-cudnn-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -819,31 +711,31 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda11_6
+      build_name: manywheel-py3_8-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_8-cuda11_6-upload:  # Uploading
+  manywheel-py3_8-cuda11_7-with-pypi-cudnn-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-cuda11_6-test
+    needs: manywheel-py3_8-cuda11_7-with-pypi-cudnn-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-cuda11_6
+      build_name: manywheel-py3_8-cuda11_7-with-pypi-cudnn
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
@@ -910,7 +802,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_8-rocm5_1_1-build:
+  manywheel-py3_8-rocm5_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -919,19 +811,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-rocm5_1_1
+      build_name: manywheel-py3_8-rocm5_2
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_8-rocm5_1_1-test:  # Testing
+  manywheel-py3_8-rocm5_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-rocm5_1_1-build
+    needs: manywheel-py3_8-rocm5_2-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -940,11 +832,11 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.8"
     steps:
       - name: Clean workspace
@@ -973,7 +865,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -984,10 +881,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-rocm5_1_1
+          name: manywheel-py3_8-rocm5_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1016,9 +913,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.1.1
+          docker-image: pytorch/manylinux-builder:rocm5.2
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1029,28 +926,28 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-rocm5_1_1-upload:  # Uploading
+  manywheel-py3_8-rocm5_2-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-rocm5_1_1-test
+    needs: manywheel-py3_8-rocm5_2-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-rocm5_1_1
+      build_name: manywheel-py3_8-rocm5_2
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_8-rocm5_2-build:
+  manywheel-py3_8-rocm5_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1059,19 +956,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-rocm5_2
+      build_name: manywheel-py3_8-rocm5_3
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_8-rocm5_2-test:  # Testing
+  manywheel-py3_8-rocm5_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-rocm5_2-build
+    needs: manywheel-py3_8-rocm5_3-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1080,11 +977,11 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.8"
     steps:
       - name: Clean workspace
@@ -1113,7 +1010,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1124,10 +1026,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_8-rocm5_2
+          name: manywheel-py3_8-rocm5_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1156,9 +1058,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.2
+          docker-image: pytorch/manylinux-builder:rocm5.3
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1169,21 +1071,21 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_8-rocm5_2-upload:  # Uploading
+  manywheel-py3_8-rocm5_3-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_8-rocm5_2-test
+    needs: manywheel-py3_8-rocm5_3-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.8"
-      build_name: manywheel-py3_8-rocm5_2
+      build_name: manywheel-py3_8-rocm5_3
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
@@ -1247,7 +1149,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_9-cuda10_2-build:
+  manywheel-py3_9-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1256,19 +1158,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda10_2
+      build_name: manywheel-py3_9-cuda11_6
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_9-cuda10_2-test:  # Testing
+  manywheel-py3_9-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda10_2-build
+    needs: manywheel-py3_9-cuda11_6-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -1276,98 +1178,38 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda10_2
+      build_name: manywheel-py3_9-cuda11_6
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_9-cuda10_2-upload:  # Uploading
+  manywheel-py3_9-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_9-cuda11_3-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda11_3
-      build_environment: linux-binary-manywheel
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  manywheel-py3_9-cuda11_3-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda11_3-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
-      DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda11_3
-      build_environment: linux-binary-manywheel
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_9-cuda11_3-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda11_3-test
+    needs: manywheel-py3_9-cuda11_6-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda11_3
+      build_name: manywheel-py3_9-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_9-cuda11_6-build:
+  manywheel-py3_9-cuda11_7-with-pypi-cudnn-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1376,19 +1218,20 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda11_6
+      build_name: manywheel-py3_9-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
+      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-runtime-cu11; platform_system == 'Linux' | nvidia-cudnn-cu11==8.5.0.96; platform_system == 'Linux' | nvidia-cublas-cu11==11.10.3.66; platform_system == 'Linux'
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_9-cuda11_6-test:  # Testing
+  manywheel-py3_9-cuda11_7-with-pypi-cudnn-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda11_6-build
+    needs: manywheel-py3_9-cuda11_7-with-pypi-cudnn-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -1396,31 +1239,31 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda11_6
+      build_name: manywheel-py3_9-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_9-cuda11_6-upload:  # Uploading
+  manywheel-py3_9-cuda11_7-with-pypi-cudnn-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-cuda11_6-test
+    needs: manywheel-py3_9-cuda11_7-with-pypi-cudnn-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-cuda11_6
+      build_name: manywheel-py3_9-cuda11_7-with-pypi-cudnn
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
@@ -1487,7 +1330,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_9-rocm5_1_1-build:
+  manywheel-py3_9-rocm5_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1496,19 +1339,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-rocm5_1_1
+      build_name: manywheel-py3_9-rocm5_2
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_9-rocm5_1_1-test:  # Testing
+  manywheel-py3_9-rocm5_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-rocm5_1_1-build
+    needs: manywheel-py3_9-rocm5_2-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1517,11 +1360,11 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.9"
     steps:
       - name: Clean workspace
@@ -1550,7 +1393,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1561,10 +1409,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-rocm5_1_1
+          name: manywheel-py3_9-rocm5_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1593,9 +1441,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.1.1
+          docker-image: pytorch/manylinux-builder:rocm5.2
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1606,28 +1454,28 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-rocm5_1_1-upload:  # Uploading
+  manywheel-py3_9-rocm5_2-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-rocm5_1_1-test
+    needs: manywheel-py3_9-rocm5_2-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-rocm5_1_1
+      build_name: manywheel-py3_9-rocm5_2
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_9-rocm5_2-build:
+  manywheel-py3_9-rocm5_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1636,19 +1484,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-rocm5_2
+      build_name: manywheel-py3_9-rocm5_3
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_9-rocm5_2-test:  # Testing
+  manywheel-py3_9-rocm5_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-rocm5_2-build
+    needs: manywheel-py3_9-rocm5_3-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -1657,11 +1505,11 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.9"
     steps:
       - name: Clean workspace
@@ -1690,7 +1538,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -1701,10 +1554,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_9-rocm5_2
+          name: manywheel-py3_9-rocm5_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1733,9 +1586,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.2
+          docker-image: pytorch/manylinux-builder:rocm5.3
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -1746,21 +1599,21 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_9-rocm5_2-upload:  # Uploading
+  manywheel-py3_9-rocm5_3-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_9-rocm5_2-test
+    needs: manywheel-py3_9-rocm5_3-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.9"
-      build_name: manywheel-py3_9-rocm5_2
+      build_name: manywheel-py3_9-rocm5_3
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
@@ -1824,67 +1677,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_10-cuda10_2-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda10_2
-      build_environment: linux-binary-manywheel
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  manywheel-py3_10-cuda10_2-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda10_2-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda10_2
-      build_environment: linux-binary-manywheel
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_10-cuda10_2-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_10-cuda11_3-build:
+  manywheel-py3_10-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1893,19 +1686,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda11_3
+      build_name: manywheel-py3_10-cuda11_6
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_10-cuda11_3-test:  # Testing
+  manywheel-py3_10-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda11_3-build
+    needs: manywheel-py3_10-cuda11_6-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -1913,38 +1706,38 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda11_3
+      build_name: manywheel-py3_10-cuda11_6
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_10-cuda11_3-upload:  # Uploading
+  manywheel-py3_10-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda11_3-test
+    needs: manywheel-py3_10-cuda11_6-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda11_3
+      build_name: manywheel-py3_10-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_10-cuda11_6-build:
+  manywheel-py3_10-cuda11_7-with-pypi-cudnn-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -1953,19 +1746,20 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda11_6
+      build_name: manywheel-py3_10-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
+      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-runtime-cu11; platform_system == 'Linux' | nvidia-cudnn-cu11==8.5.0.96; platform_system == 'Linux' | nvidia-cublas-cu11==11.10.3.66; platform_system == 'Linux'
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_10-cuda11_6-test:  # Testing
+  manywheel-py3_10-cuda11_7-with-pypi-cudnn-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda11_6-build
+    needs: manywheel-py3_10-cuda11_7-with-pypi-cudnn-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -1973,31 +1767,31 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda11_6
+      build_name: manywheel-py3_10-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_10-cuda11_6-upload:  # Uploading
+  manywheel-py3_10-cuda11_7-with-pypi-cudnn-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-cuda11_6-test
+    needs: manywheel-py3_10-cuda11_7-with-pypi-cudnn-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-cuda11_6
+      build_name: manywheel-py3_10-cuda11_7-with-pypi-cudnn
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
@@ -2064,7 +1858,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_10-rocm5_1_1-build:
+  manywheel-py3_10-rocm5_2-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -2073,19 +1867,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-rocm5_1_1
+      build_name: manywheel-py3_10-rocm5_2
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_10-rocm5_1_1-test:  # Testing
+  manywheel-py3_10-rocm5_2-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-rocm5_1_1-build
+    needs: manywheel-py3_10-rocm5_2-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -2094,11 +1888,11 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.10"
     steps:
       - name: Clean workspace
@@ -2127,7 +1921,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -2138,10 +1937,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-rocm5_1_1
+          name: manywheel-py3_10-rocm5_2
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2170,9 +1969,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.1.1
+          docker-image: pytorch/manylinux-builder:rocm5.2
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -2183,28 +1982,28 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-rocm5_1_1-upload:  # Uploading
+  manywheel-py3_10-rocm5_2-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-rocm5_1_1-test
+    needs: manywheel-py3_10-rocm5_2-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.1.1
-      GPU_ARCH_VERSION: 5.1.1
+      DESIRED_CUDA: rocm5.2
+      GPU_ARCH_VERSION: 5.2
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.1.1
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-rocm5_1_1
+      build_name: manywheel-py3_10-rocm5_2
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_10-rocm5_2-build:
+  manywheel-py3_10-rocm5_3-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -2213,19 +2012,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-rocm5_2
+      build_name: manywheel-py3_10-rocm5_3
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_10-rocm5_2-test:  # Testing
+  manywheel-py3_10-rocm5_3-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-rocm5_2-build
+    needs: manywheel-py3_10-rocm5_3-build
     runs-on: linux.rocm.gpu
     timeout-minutes: 240
     env:
@@ -2234,11 +2033,11 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
       SKIP_ALL_TESTS: 1
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.10"
     steps:
       - name: Clean workspace
@@ -2267,7 +2066,12 @@ jobs:
         run: |
           ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx')
           if [[ "x$ngpu" != "x2" && "x$ngpu" != "x4" ]]; then
-              echo "Failed to detect GPUs on the runner"
+              if [[ $ngpu -eq 0 ]]; then
+                echo "Error: Failed to detect any GPUs on the runner"
+              else
+                echo "Error: Detected $ngpu GPUs on the runner, when only 2 or 4 were expected"
+              fi
+              echo "Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
               exit 1
           fi
       - name: Runner health check disconnect on failure
@@ -2278,10 +2082,10 @@ jobs:
         run: |
           env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
           env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: manywheel-py3_10-rocm5_2
+          name: manywheel-py3_10-rocm5_3
           path: "${{ runner.temp }}/artifacts/"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2310,9 +2114,9 @@ jobs:
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
       - name: Pull Docker image
-        uses: ./pytorch/.github/actions/pull-docker-image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
-          docker-image: pytorch/manylinux-builder:rocm5.2
+          docker-image: pytorch/manylinux-builder:rocm5.3
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Kill containers, clean up images
@@ -2323,21 +2127,21 @@ jobs:
           docker stop $(docker ps -q) || true
           # Prune all of the docker images
           docker system prune -af
-  manywheel-py3_10-rocm5_2-upload:  # Uploading
+  manywheel-py3_10-rocm5_3-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_10-rocm5_2-test
+    needs: manywheel-py3_10-rocm5_3-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: rocm5.2
-      GPU_ARCH_VERSION: 5.2
+      DESIRED_CUDA: rocm5.3
+      GPU_ARCH_VERSION: 5.3
       GPU_ARCH_TYPE: rocm
-      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.2
+      DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.3
       DESIRED_PYTHON: "3.10"
-      build_name: manywheel-py3_10-rocm5_2
+      build_name: manywheel-py3_10-rocm5_3
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
@@ -2401,67 +2205,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_11-cuda10_2-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    uses: ./.github/workflows/_binary-build-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda10_2
-      build_environment: linux-binary-manywheel
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-
-  manywheel-py3_11-cuda10_2-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_11-cuda10_2-build
-    uses: ./.github/workflows/_binary-test-linux.yml
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda10_2
-      build_environment: linux-binary-manywheel
-      runs_on: linux.4xlarge.nvidia.gpu
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_11-cuda10_2-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_11-cuda10_2-test
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: manywheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu102
-      GPU_ARCH_VERSION: 10.2
-      GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda10.2
-      DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda10_2
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_11-cuda11_3-build:
+  manywheel-py3_11-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -2470,19 +2214,19 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda11_3
+      build_name: manywheel-py3_11-cuda11_6
       build_environment: linux-binary-manywheel
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_11-cuda11_3-test:  # Testing
+  manywheel-py3_11-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_11-cuda11_3-build
+    needs: manywheel-py3_11-cuda11_6-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -2490,38 +2234,38 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda11_3
+      build_name: manywheel-py3_11-cuda11_6
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_11-cuda11_3-upload:  # Uploading
+  manywheel-py3_11-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_11-cuda11_3-test
+    needs: manywheel-py3_11-cuda11_6-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.3
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
       DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda11_3
+      build_name: manywheel-py3_11-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  manywheel-py3_11-cuda11_6-build:
+  manywheel-py3_11-cuda11_7-with-pypi-cudnn-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     with:
@@ -2530,19 +2274,20 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda11_6
+      build_name: manywheel-py3_11-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
+      PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-runtime-cu11; platform_system == 'Linux' | nvidia-cudnn-cu11==8.5.0.96; platform_system == 'Linux' | nvidia-cublas-cu11==11.10.3.66; platform_system == 'Linux'
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 
-  manywheel-py3_11-cuda11_6-test:  # Testing
+  manywheel-py3_11-cuda11_7-with-pypi-cudnn-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_11-cuda11_6-build
+    needs: manywheel-py3_11-cuda11_7-with-pypi-cudnn-build
     uses: ./.github/workflows/_binary-test-linux.yml
     with:
       PYTORCH_ROOT: /pytorch
@@ -2550,31 +2295,31 @@ jobs:
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda11_6
+      build_name: manywheel-py3_11-cuda11_7-with-pypi-cudnn
       build_environment: linux-binary-manywheel
       runs_on: linux.4xlarge.nvidia.gpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
-  manywheel-py3_11-cuda11_6-upload:  # Uploading
+  manywheel-py3_11-cuda11_7-with-pypi-cudnn-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: manywheel-py3_11-cuda11_6-test
+    needs: manywheel-py3_11-cuda11_7-with-pypi-cudnn-test
     with:
       PYTORCH_ROOT: /pytorch
       BUILDER_ROOT: /builder
       PACKAGE_TYPE: manywheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
-      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.6
+      DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.7
       DESIRED_PYTHON: "3.11"
-      build_name: manywheel-py3_11-cuda11_6
+      build_name: manywheel-py3_11-cuda11_7-with-pypi-cudnn
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
diff --git a/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml b/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml
index a6210cf4ad67..c88b107a90a9 100644
--- a/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml
+++ b/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml
@@ -67,10 +67,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -95,11 +96,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -110,7 +116,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_8-cpu
@@ -171,10 +177,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -199,11 +206,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -214,7 +226,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_9-cpu
@@ -275,10 +287,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -303,11 +316,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -318,7 +336,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_10-cpu
diff --git a/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml b/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
index 61217b639ad5..c8858fd0501b 100644
--- a/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
+++ b/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
@@ -34,110 +34,6 @@ concurrency:
   cancel-in-progress: true
 
 jobs:
-  wheel-py3_7-cpu-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-12-xl
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.7"
-      # For sccache access (only on non-forked PRs)
-      AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
-      AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
-    steps:
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          # shellcheck disable=SC2129
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          # shellcheck disable=SC2129
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          # shellcheck disable=SC2129
-          echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
-      - name: Install conda and dependencies
-        run: |
-          # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
-          chmod +x "${RUNNER_TEMP}/conda.sh"
-          /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
-          echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Install sccache (only for non-forked PRs, and pushes to trunk)
-        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
-      - name: Populate binary env
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        run: |
-          # shellcheck disable=SC1091
-          source "${RUNNER_TEMP}/anaconda/bin/activate"
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
-        if: always()
-        with:
-          name: wheel-py3_7-cpu
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-  wheel-py3_7-cpu-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_7-cpu-build
-    with:
-      PYTORCH_ROOT: /pytorch
-      BUILDER_ROOT: /builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      DOCKER_IMAGE: pytorch/manylinux-builder:cpu
-      DESIRED_PYTHON: "3.7"
-      build_name: wheel-py3_7-cpu
-      use_s3: False
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   wheel-py3_8-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: macos-12-xl
@@ -171,10 +67,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -199,11 +96,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -214,7 +116,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_8-cpu
@@ -275,10 +177,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -303,11 +206,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -318,7 +226,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_9-cpu
@@ -379,10 +287,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -407,11 +316,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -422,7 +336,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_10-cpu
diff --git a/.github/workflows/generated-macos-binary-conda-nightly.yml b/.github/workflows/generated-macos-binary-conda-nightly.yml
index 174650de54d7..52cfb3d98f76 100644
--- a/.github/workflows/generated-macos-binary-conda-nightly.yml
+++ b/.github/workflows/generated-macos-binary-conda-nightly.yml
@@ -65,10 +65,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -93,11 +94,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -108,7 +114,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_7-cpu
@@ -169,10 +175,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -197,11 +204,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -212,7 +224,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_8-cpu
@@ -273,10 +285,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -301,11 +314,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -316,7 +334,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_9-cpu
@@ -377,10 +395,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -405,11 +424,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -420,7 +444,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_10-cpu
diff --git a/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml b/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
index a6e4119c387e..cd9ad45ba561 100644
--- a/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
+++ b/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
@@ -34,9 +34,8 @@ concurrency:
 jobs:
   libtorch-cpu-shared-with-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-10.15
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
+    runs-on: macos-12-xl
+    timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -70,10 +69,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -98,11 +98,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -113,7 +118,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-with-deps-cxx11-abi
@@ -144,9 +149,8 @@ jobs:
     uses: ./.github/workflows/_binary-upload.yml
   libtorch-cpu-shared-without-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-10.15
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
+    runs-on: macos-12-xl
+    timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -180,10 +184,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -208,11 +213,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -223,7 +233,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-without-deps-cxx11-abi
@@ -254,9 +264,8 @@ jobs:
     uses: ./.github/workflows/_binary-upload.yml
   libtorch-cpu-static-with-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-10.15
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
+    runs-on: macos-12-xl
+    timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -290,10 +299,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -318,11 +328,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -333,7 +348,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-static-with-deps-cxx11-abi
@@ -364,9 +379,8 @@ jobs:
     uses: ./.github/workflows/_binary-upload.yml
   libtorch-cpu-static-without-deps-cxx11-abi-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-10.15
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
+    runs-on: macos-12-xl
+    timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -400,10 +414,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -428,11 +443,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -443,7 +463,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-static-without-deps-cxx11-abi
diff --git a/.github/workflows/generated-macos-binary-libtorch-pre-cxx11-nightly.yml b/.github/workflows/generated-macos-binary-libtorch-pre-cxx11-nightly.yml
index 7853e8009393..4ce5c6f32c36 100644
--- a/.github/workflows/generated-macos-binary-libtorch-pre-cxx11-nightly.yml
+++ b/.github/workflows/generated-macos-binary-libtorch-pre-cxx11-nightly.yml
@@ -34,9 +34,8 @@ concurrency:
 jobs:
   libtorch-cpu-shared-with-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-10.15
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
+    runs-on: macos-12-xl
+    timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -70,10 +69,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -98,11 +98,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -113,7 +118,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-with-deps-pre-cxx11
@@ -144,9 +149,8 @@ jobs:
     uses: ./.github/workflows/_binary-upload.yml
   libtorch-cpu-shared-without-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-10.15
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
+    runs-on: macos-12-xl
+    timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -180,10 +184,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -208,11 +213,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -223,7 +233,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-without-deps-pre-cxx11
@@ -254,9 +264,8 @@ jobs:
     uses: ./.github/workflows/_binary-upload.yml
   libtorch-cpu-static-with-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-10.15
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
+    runs-on: macos-12-xl
+    timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -290,10 +299,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -318,11 +328,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -333,7 +348,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-static-with-deps-pre-cxx11
@@ -364,9 +379,8 @@ jobs:
     uses: ./.github/workflows/_binary-upload.yml
   libtorch-cpu-static-without-deps-pre-cxx11-build:
     if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: macos-10.15
-    # libtorch builds take a long time on github hosted runners
-    timeout-minutes: 720
+    runs-on: macos-12-xl
+    timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -400,10 +414,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -428,11 +443,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -443,7 +463,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-static-without-deps-pre-cxx11
diff --git a/.github/workflows/generated-macos-binary-wheel-nightly.yml b/.github/workflows/generated-macos-binary-wheel-nightly.yml
index 47442efef269..a3839d6e8a14 100644
--- a/.github/workflows/generated-macos-binary-wheel-nightly.yml
+++ b/.github/workflows/generated-macos-binary-wheel-nightly.yml
@@ -65,10 +65,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -93,11 +94,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -108,7 +114,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_7-cpu
@@ -169,10 +175,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -197,11 +204,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -212,7 +224,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_8-cpu
@@ -273,10 +285,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -301,11 +314,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -316,7 +334,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_9-cpu
@@ -377,10 +395,11 @@ jobs:
       - name: Install conda and dependencies
         run: |
           # Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
-          curl --retry 3 -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+          curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
           chmod +x "${RUNNER_TEMP}/conda.sh"
           /bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
           echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
+          echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
         with:
@@ -405,11 +424,16 @@ jobs:
           git clean -fxd
         working-directory: builder
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
+        uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
-        run: |
-          sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
-          sudo chmod +x /usr/local/bin/sccache
-          echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
+        with:
+          timeout_minutes: 5
+          max_attempts: 3
+          retry_wait_seconds: 90
+          command: |
+            sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
+            sudo chmod +x /usr/local/bin/sccache
+            echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
       - name: Populate binary env
         run: |
           # shellcheck disable=SC1091
@@ -420,7 +444,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
-      - uses: actions/upload-artifact@v2
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_10-cpu
diff --git a/.github/workflows/generated-windows-binary-conda-nightly.yml b/.github/workflows/generated-windows-binary-conda-nightly.yml
index b4633b15c661..df7cc13d8a26 100644
--- a/.github/workflows/generated-windows-binary-conda-nightly.yml
+++ b/.github/workflows/generated-windows-binary-conda-nightly.yml
@@ -115,7 +115,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_7-cpu
@@ -188,7 +188,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_7-cpu
@@ -256,7 +256,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_7-cuda11_3-build:
+  conda-py3_7-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -266,8 +266,8 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
@@ -340,10 +340,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_7-cuda11_3
+          name: conda-py3_7-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -360,9 +360,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_7-cuda11_3-test:  # Testing
+  conda-py3_7-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda11_3-build
+    needs: conda-py3_7-cuda11_6-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -371,8 +371,8 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
@@ -414,10 +414,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_7-cuda11_3
+          name: conda-py3_7-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -463,27 +463,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_7-cuda11_3-upload:  # Uploading
+  conda-py3_7-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda11_3-test
+    needs: conda-py3_7-cuda11_6-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda11_3
+      build_name: conda-py3_7-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_7-cuda11_6-build:
+  conda-py3_7-cuda11_7-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -493,8 +493,8 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
@@ -567,10 +567,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_7-cuda11_6
+          name: conda-py3_7-cuda11_7
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -587,9 +587,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_7-cuda11_6-test:  # Testing
+  conda-py3_7-cuda11_7-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda11_6-build
+    needs: conda-py3_7-cuda11_7-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -598,8 +598,8 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
@@ -641,10 +641,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_7-cuda11_6
+          name: conda-py3_7-cuda11_7
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -690,27 +690,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_7-cuda11_6-upload:  # Uploading
+  conda-py3_7-cuda11_7-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda11_6-test
+    needs: conda-py3_7-cuda11_7-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda11_6
+      build_name: conda-py3_7-cuda11_7
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_7-cuda11_7-build:
+  conda-py3_8-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -720,11 +720,10 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.7"
+      DESIRED_PYTHON: "3.8"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -794,10 +793,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_7-cuda11_7
+          name: conda-py3_8-cpu
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -814,10 +813,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_7-cuda11_7-test:  # Testing
+  conda-py3_8-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda11_7-build
-    runs-on: windows.8xlarge.nvidia.gpu
+    needs: conda-py3_8-cpu-build
+    runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -825,11 +824,10 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.7"
+      DESIRED_PYTHON: "3.8"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -868,10 +866,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_7-cuda11_7
+          name: conda-py3_8-cpu
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -917,27 +915,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_7-cuda11_7-upload:  # Uploading
+  conda-py3_8-cpu-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_7-cuda11_7-test
+    needs: conda-py3_8-cpu-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.7"
-      build_name: conda-py3_7-cuda11_7
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DESIRED_PYTHON: "3.8"
+      build_name: conda-py3_8-cpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_8-cpu-build:
+  conda-py3_8-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -947,8 +944,9 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
@@ -1020,10 +1018,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_8-cpu
+          name: conda-py3_8-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1040,10 +1038,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_8-cpu-test:  # Testing
+  conda-py3_8-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cpu-build
-    runs-on: windows.4xlarge
+    needs: conda-py3_8-cuda11_6-build
+    runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1051,8 +1049,9 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
@@ -1093,10 +1092,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cpu
+          name: conda-py3_8-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1142,26 +1141,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_8-cpu-upload:  # Uploading
+  conda-py3_8-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cpu-test
+    needs: conda-py3_8-cuda11_6-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cpu
+      build_name: conda-py3_8-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_8-cuda11_3-build:
+  conda-py3_8-cuda11_7-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -1171,8 +1171,8 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
@@ -1245,10 +1245,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_8-cuda11_3
+          name: conda-py3_8-cuda11_7
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1265,9 +1265,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_8-cuda11_3-test:  # Testing
+  conda-py3_8-cuda11_7-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_3-build
+    needs: conda-py3_8-cuda11_7-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -1276,8 +1276,8 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
@@ -1319,10 +1319,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda11_3
+          name: conda-py3_8-cuda11_7
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1368,27 +1368,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_8-cuda11_3-upload:  # Uploading
+  conda-py3_8-cuda11_7-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_3-test
+    needs: conda-py3_8-cuda11_7-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda11_3
+      build_name: conda-py3_8-cuda11_7
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_8-cuda11_6-build:
+  conda-py3_9-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -1398,11 +1398,10 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1472,10 +1471,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_8-cuda11_6
+          name: conda-py3_9-cpu
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1492,10 +1491,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_8-cuda11_6-test:  # Testing
+  conda-py3_9-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_6-build
-    runs-on: windows.8xlarge.nvidia.gpu
+    needs: conda-py3_9-cpu-build
+    runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1503,11 +1502,10 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1546,10 +1544,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda11_6
+          name: conda-py3_9-cpu
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1595,27 +1593,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_8-cuda11_6-upload:  # Uploading
+  conda-py3_9-cpu-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_6-test
+    needs: conda-py3_9-cpu-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda11_6
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DESIRED_PYTHON: "3.9"
+      build_name: conda-py3_9-cpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_8-cuda11_7-build:
+  conda-py3_9-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -1625,11 +1622,11 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1699,10 +1696,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_8-cuda11_7
+          name: conda-py3_9-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1719,9 +1716,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_8-cuda11_7-test:  # Testing
+  conda-py3_9-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_7-build
+    needs: conda-py3_9-cuda11_6-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -1730,11 +1727,11 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1773,10 +1770,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_8-cuda11_7
+          name: conda-py3_9-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1822,27 +1819,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_8-cuda11_7-upload:  # Uploading
+  conda-py3_9-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_8-cuda11_7-test
+    needs: conda-py3_9-cuda11_6-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.8"
-      build_name: conda-py3_8-cuda11_7
+      DESIRED_PYTHON: "3.9"
+      build_name: conda-py3_9-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_9-cpu-build:
+  conda-py3_9-cuda11_7-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -1852,8 +1849,9 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
@@ -1925,10 +1923,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_9-cpu
+          name: conda-py3_9-cuda11_7
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1945,10 +1943,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_9-cpu-test:  # Testing
+  conda-py3_9-cuda11_7-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cpu-build
-    runs-on: windows.4xlarge
+    needs: conda-py3_9-cuda11_7-build
+    runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1956,8 +1954,9 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
@@ -1998,10 +1997,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cpu
+          name: conda-py3_9-cuda11_7
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2047,26 +2046,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_9-cpu-upload:  # Uploading
+  conda-py3_9-cuda11_7-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cpu-test
+    needs: conda-py3_9-cuda11_7-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
+      GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cpu
+      build_name: conda-py3_9-cuda11_7
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_9-cuda11_3-build:
+  conda-py3_10-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -2076,11 +2076,10 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -2150,10 +2149,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: conda-py3_9-cuda11_3
+          name: conda-py3_10-cpu
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -2170,10 +2169,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_9-cuda11_3-test:  # Testing
+  conda-py3_10-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_3-build
-    runs-on: windows.8xlarge.nvidia.gpu
+    needs: conda-py3_10-cpu-build
+    runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -2181,11 +2180,10 @@ jobs:
       PACKAGE_TYPE: conda
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -2224,10 +2222,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: conda-py3_9-cuda11_3
+          name: conda-py3_10-cpu
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2273,688 +2271,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_9-cuda11_3-upload:  # Uploading
+  conda-py3_10-cpu-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_3-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda11_3
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_9-cuda11_6-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: conda-py3_9-cuda11_6
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_9-cuda11_6-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_6-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: conda-py3_9-cuda11_6
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_9-cuda11_6-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_6-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda11_6
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_9-cuda11_7-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: conda-py3_9-cuda11_7
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_9-cuda11_7-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_7-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: conda-py3_9-cuda11_7
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_9-cuda11_7-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_9-cuda11_7-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.9"
-      build_name: conda-py3_9-cuda11_7
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_10-cpu-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.10"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: conda-py3_10-cpu
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_10-cpu-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cpu-build
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.10"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: conda-py3_10-cpu
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_10-cpu-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cpu-test
+    needs: conda-py3_10-cpu-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -2971,233 +2290,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  conda-py3_10-cuda11_3-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.10"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: conda-py3_10-cuda11_3
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_10-cuda11_3-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda11_3-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.10"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: conda-py3_10-cuda11_3
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  conda-py3_10-cuda11_3-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: conda-py3_10-cuda11_3-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: conda
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.10"
-      build_name: conda-py3_10-cuda11_3
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   conda-py3_10-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
@@ -3282,7 +2374,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_10-cuda11_6
@@ -3356,7 +2448,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_10-cuda11_6
@@ -3509,7 +2601,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: conda-py3_10-cuda11_7
@@ -3583,7 +2675,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: conda-py3_10-cuda11_7
diff --git a/.github/workflows/generated-windows-binary-libtorch-debug-master.yml b/.github/workflows/generated-windows-binary-libtorch-debug-master.yml
index c34cb5250018..e52949eadf68 100644
--- a/.github/workflows/generated-windows-binary-libtorch-debug-master.yml
+++ b/.github/workflows/generated-windows-binary-libtorch-debug-master.yml
@@ -114,7 +114,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-with-deps-debug
@@ -191,7 +191,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-debug
diff --git a/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml b/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
index de660d9ef218..c0b5ddae71fa 100644
--- a/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
+++ b/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
@@ -119,7 +119,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-with-deps-debug
@@ -196,7 +196,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-debug
@@ -355,7 +355,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-without-deps-debug
@@ -432,7 +432,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-debug
@@ -591,7 +591,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-static-with-deps-debug
@@ -668,7 +668,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-debug
@@ -827,7 +827,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-static-without-deps-debug
@@ -904,7 +904,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-debug
@@ -976,962 +976,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-shared-with-deps-debug-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: shared-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: libtorch-cuda11_3-shared-with-deps-debug
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-shared-with-deps-debug-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-with-deps-debug-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: shared-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: libtorch-cuda11_3-shared-with-deps-debug
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-shared-with-deps-debug-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-with-deps-debug-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: shared-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-      build_name: libtorch-cuda11_3-shared-with-deps-debug
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-shared-without-deps-debug-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: shared-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: libtorch-cuda11_3-shared-without-deps-debug
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-shared-without-deps-debug-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-without-deps-debug-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: shared-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: libtorch-cuda11_3-shared-without-deps-debug
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-shared-without-deps-debug-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-without-deps-debug-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: shared-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-      build_name: libtorch-cuda11_3-shared-without-deps-debug
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-static-with-deps-debug-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: static-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: libtorch-cuda11_3-static-with-deps-debug
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-static-with-deps-debug-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-with-deps-debug-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: static-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: libtorch-cuda11_3-static-with-deps-debug
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-static-with-deps-debug-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-with-deps-debug-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: static-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-      build_name: libtorch-cuda11_3-static-with-deps-debug
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-static-without-deps-debug-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: static-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: libtorch-cuda11_3-static-without-deps-debug
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-static-without-deps-debug-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-without-deps-debug-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: static-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: libtorch-cuda11_3-static-without-deps-debug
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-static-without-deps-debug-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-without-deps-debug-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      LIBTORCH_CONFIG: debug
-      LIBTORCH_VARIANT: static-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-      build_name: libtorch-cuda11_3-static-without-deps-debug
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   libtorch-cuda11_6-shared-with-deps-debug-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
@@ -2020,7 +1064,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_6-shared-with-deps-debug
@@ -2098,7 +1142,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_6-shared-with-deps-debug
@@ -2259,7 +1303,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_6-shared-without-deps-debug
@@ -2337,7 +1381,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_6-shared-without-deps-debug
@@ -2498,7 +1542,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_6-static-with-deps-debug
@@ -2576,7 +1620,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_6-static-with-deps-debug
@@ -2737,7 +1781,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_6-static-without-deps-debug
@@ -2815,7 +1859,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_6-static-without-deps-debug
@@ -2976,7 +2020,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_7-shared-with-deps-debug
@@ -3054,7 +2098,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_7-shared-with-deps-debug
@@ -3215,7 +2259,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_7-shared-without-deps-debug
@@ -3293,7 +2337,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_7-shared-without-deps-debug
@@ -3454,7 +2498,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_7-static-with-deps-debug
@@ -3532,7 +2576,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_7-static-with-deps-debug
@@ -3693,7 +2737,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_7-static-without-deps-debug
@@ -3771,7 +2815,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_7-static-without-deps-debug
diff --git a/.github/workflows/generated-windows-binary-libtorch-release-master.yml b/.github/workflows/generated-windows-binary-libtorch-release-master.yml
index 7765834eeda7..ada48aa7768c 100644
--- a/.github/workflows/generated-windows-binary-libtorch-release-master.yml
+++ b/.github/workflows/generated-windows-binary-libtorch-release-master.yml
@@ -114,7 +114,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-with-deps-release
@@ -191,7 +191,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-release
diff --git a/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml b/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml
index e96ddc5fb635..f2f1d3badfe3 100644
--- a/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml
+++ b/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml
@@ -119,7 +119,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-with-deps-release
@@ -196,7 +196,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-with-deps-release
@@ -355,7 +355,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-shared-without-deps-release
@@ -432,7 +432,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-shared-without-deps-release
@@ -591,7 +591,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-static-with-deps-release
@@ -668,7 +668,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-with-deps-release
@@ -827,7 +827,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cpu-static-without-deps-release
@@ -904,7 +904,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cpu-static-without-deps-release
@@ -976,962 +976,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-shared-with-deps-release-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: shared-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: libtorch-cuda11_3-shared-with-deps-release
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-shared-with-deps-release-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-with-deps-release-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: shared-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: libtorch-cuda11_3-shared-with-deps-release
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-shared-with-deps-release-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-with-deps-release-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: shared-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-      build_name: libtorch-cuda11_3-shared-with-deps-release
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-shared-without-deps-release-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: shared-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: libtorch-cuda11_3-shared-without-deps-release
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-shared-without-deps-release-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-without-deps-release-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: shared-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: libtorch-cuda11_3-shared-without-deps-release
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-shared-without-deps-release-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-shared-without-deps-release-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: shared-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-      build_name: libtorch-cuda11_3-shared-without-deps-release
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-static-with-deps-release-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: static-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: libtorch-cuda11_3-static-with-deps-release
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-static-with-deps-release-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-with-deps-release-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: static-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: libtorch-cuda11_3-static-with-deps-release
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-static-with-deps-release-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-with-deps-release-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: static-with-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-      build_name: libtorch-cuda11_3-static-with-deps-release
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  libtorch-cuda11_3-static-without-deps-release-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: static-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: libtorch-cuda11_3-static-without-deps-release
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-static-without-deps-release-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-without-deps-release-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: static-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: libtorch-cuda11_3-static-without-deps-release
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  libtorch-cuda11_3-static-without-deps-release-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: libtorch-cuda11_3-static-without-deps-release-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: libtorch
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      LIBTORCH_CONFIG: release
-      LIBTORCH_VARIANT: static-without-deps
-      # This is a dummy value for libtorch to work correctly with our batch scripts
-      # without this value pip does not get installed for some reason
-      DESIRED_PYTHON: "3.7"
-      build_name: libtorch-cuda11_3-static-without-deps-release
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   libtorch-cuda11_6-shared-with-deps-release-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
@@ -2020,7 +1064,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_6-shared-with-deps-release
@@ -2098,7 +1142,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_6-shared-with-deps-release
@@ -2259,7 +1303,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_6-shared-without-deps-release
@@ -2337,7 +1381,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_6-shared-without-deps-release
@@ -2498,7 +1542,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_6-static-with-deps-release
@@ -2576,7 +1620,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_6-static-with-deps-release
@@ -2737,7 +1781,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_6-static-without-deps-release
@@ -2815,7 +1859,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_6-static-without-deps-release
@@ -2976,7 +2020,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_7-shared-with-deps-release
@@ -3054,7 +2098,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_7-shared-with-deps-release
@@ -3215,7 +2259,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_7-shared-without-deps-release
@@ -3293,7 +2337,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_7-shared-without-deps-release
@@ -3454,7 +2498,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_7-static-with-deps-release
@@ -3532,7 +2576,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_7-static-with-deps-release
@@ -3693,7 +2737,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: libtorch-cuda11_7-static-without-deps-release
@@ -3771,7 +2815,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: libtorch-cuda11_7-static-without-deps-release
diff --git a/.github/workflows/generated-windows-binary-wheel-master.yml b/.github/workflows/generated-windows-binary-wheel-master.yml
deleted file mode 100644
index 175507d4fd5a..000000000000
--- a/.github/workflows/generated-windows-binary-wheel-master.yml
+++ /dev/null
@@ -1,236 +0,0 @@
-# @generated DO NOT EDIT MANUALLY
-
-# Template is at:    .github/templates/windows_binary_build_workflow.yml.j2
-# Generation script: .github/scripts/generate_ci_workflows.py
-name: windows-binary-wheel
-
-on:
-  push:
-    branches:
-      - master
-    tags:
-      - 'ciflow/trunk/*'
-  workflow_dispatch:
-
-env:
-  # Needed for conda builds
-  ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-  ANACONDA_USER: pytorch
-  AWS_DEFAULT_REGION: us-east-1
-  BUILD_ENVIRONMENT: windows-binary-wheel
-  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-  PR_NUMBER: ${{ github.event.pull_request.number }}
-  SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
-  SKIP_ALL_TESTS: 1
-concurrency:
-  group: windows-binary-wheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
-
-jobs:
-  wheel-py3_7-cuda11_3-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: wheel-py3_7-cuda11_3
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_7-cuda11_3-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_7-cuda11_3-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.7"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: wheel-py3_7-cuda11_3
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
diff --git a/.github/workflows/generated-windows-binary-wheel-nightly.yml b/.github/workflows/generated-windows-binary-wheel-nightly.yml
index df5ce57fff06..026c81e6bb58 100644
--- a/.github/workflows/generated-windows-binary-wheel-nightly.yml
+++ b/.github/workflows/generated-windows-binary-wheel-nightly.yml
@@ -115,7 +115,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_7-cpu
@@ -188,7 +188,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_7-cpu
@@ -256,7 +256,7 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_7-cuda11_3-build:
+  wheel-py3_7-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -266,8 +266,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
@@ -340,10 +340,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_7-cuda11_3
+          name: wheel-py3_7-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -360,9 +360,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_7-cuda11_3-test:  # Testing
+  wheel-py3_7-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_7-cuda11_3-build
+    needs: wheel-py3_7-cuda11_6-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -371,8 +371,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
@@ -414,10 +414,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_7-cuda11_3
+          name: wheel-py3_7-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -463,27 +463,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_7-cuda11_3-upload:  # Uploading
+  wheel-py3_7-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_7-cuda11_3-test
+    needs: wheel-py3_7-cuda11_6-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.7"
-      build_name: wheel-py3_7-cuda11_3
+      build_name: wheel-py3_7-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_7-cuda11_6-build:
+  wheel-py3_7-cuda11_7-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -493,8 +493,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
@@ -567,10 +567,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_7-cuda11_6
+          name: wheel-py3_7-cuda11_7
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -587,9 +587,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_7-cuda11_6-test:  # Testing
+  wheel-py3_7-cuda11_7-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_7-cuda11_6-build
+    needs: wheel-py3_7-cuda11_7-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -598,8 +598,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.7"
@@ -641,10 +641,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_7-cuda11_6
+          name: wheel-py3_7-cuda11_7
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -690,27 +690,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_7-cuda11_6-upload:  # Uploading
+  wheel-py3_7-cuda11_7-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_7-cuda11_6-test
+    needs: wheel-py3_7-cuda11_7-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.7"
-      build_name: wheel-py3_7-cuda11_6
+      build_name: wheel-py3_7-cuda11_7
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_7-cuda11_7-build:
+  wheel-py3_8-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -720,11 +720,10 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.7"
+      DESIRED_PYTHON: "3.8"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -794,10 +793,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_7-cuda11_7
+          name: wheel-py3_8-cpu
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -814,10 +813,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_7-cuda11_7-test:  # Testing
+  wheel-py3_8-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_7-cuda11_7-build
-    runs-on: windows.8xlarge.nvidia.gpu
+    needs: wheel-py3_8-cpu-build
+    runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -825,11 +824,10 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.7"
+      DESIRED_PYTHON: "3.8"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -868,10 +866,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_7-cuda11_7
+          name: wheel-py3_8-cpu
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -917,27 +915,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_7-cuda11_7-upload:  # Uploading
+  wheel-py3_8-cpu-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_7-cuda11_7-test
+    needs: wheel-py3_8-cpu-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.7"
-      build_name: wheel-py3_7-cuda11_7
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DESIRED_PYTHON: "3.8"
+      build_name: wheel-py3_8-cpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_8-cpu-build:
+  wheel-py3_8-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -947,8 +944,9 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
@@ -1020,10 +1018,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_8-cpu
+          name: wheel-py3_8-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1040,10 +1038,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cpu-test:  # Testing
+  wheel-py3_8-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cpu-build
-    runs-on: windows.4xlarge
+    needs: wheel-py3_8-cuda11_6-build
+    runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1051,8 +1049,9 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
     steps:
@@ -1093,10 +1092,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cpu
+          name: wheel-py3_8-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1142,26 +1141,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cpu-upload:  # Uploading
+  wheel-py3_8-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cpu-test
+    needs: wheel-py3_8-cuda11_6-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
+      GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.8"
-      build_name: wheel-py3_8-cpu
+      build_name: wheel-py3_8-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_8-cuda11_3-build:
+  wheel-py3_8-cuda11_7-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -1171,8 +1171,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
@@ -1245,10 +1245,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_8-cuda11_3
+          name: wheel-py3_8-cuda11_7
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1265,9 +1265,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_3-test:  # Testing
+  wheel-py3_8-cuda11_7-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_3-build
+    needs: wheel-py3_8-cuda11_7-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -1276,8 +1276,8 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.8"
@@ -1319,10 +1319,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cuda11_3
+          name: wheel-py3_8-cuda11_7
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1368,27 +1368,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_3-upload:  # Uploading
+  wheel-py3_8-cuda11_7-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_3-test
+    needs: wheel-py3_8-cuda11_7-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
       GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.8"
-      build_name: wheel-py3_8-cuda11_3
+      build_name: wheel-py3_8-cuda11_7
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_8-cuda11_6-build:
+  wheel-py3_9-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -1398,11 +1398,10 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1472,10 +1471,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_8-cuda11_6
+          name: wheel-py3_9-cpu
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1492,10 +1491,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_6-test:  # Testing
+  wheel-py3_9-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_6-build
-    runs-on: windows.8xlarge.nvidia.gpu
+    needs: wheel-py3_9-cpu-build
+    runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1503,11 +1502,10 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1546,10 +1544,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cuda11_6
+          name: wheel-py3_9-cpu
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1595,27 +1593,26 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_6-upload:  # Uploading
+  wheel-py3_9-cpu-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_6-test
+    needs: wheel-py3_9-cpu-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.8"
-      build_name: wheel-py3_8-cuda11_6
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
+      DESIRED_PYTHON: "3.9"
+      build_name: wheel-py3_9-cpu
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_8-cuda11_7-build:
+  wheel-py3_9-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -1625,11 +1622,11 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1699,10 +1696,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_8-cuda11_7
+          name: wheel-py3_9-cuda11_6
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1719,9 +1716,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_7-test:  # Testing
+  wheel-py3_9-cuda11_6-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_7-build
+    needs: wheel-py3_9-cuda11_6-build
     runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
@@ -1730,11 +1727,11 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.8"
+      DESIRED_PYTHON: "3.9"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -1773,10 +1770,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_8-cuda11_7
+          name: wheel-py3_9-cuda11_6
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -1822,27 +1819,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_8-cuda11_7-upload:  # Uploading
+  wheel-py3_9-cuda11_6-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_8-cuda11_7-test
+    needs: wheel-py3_9-cuda11_6-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
+      DESIRED_CUDA: cu116
+      GPU_ARCH_VERSION: 11.6
       GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.8"
-      build_name: wheel-py3_8-cuda11_7
+      DESIRED_PYTHON: "3.9"
+      build_name: wheel-py3_9-cuda11_6
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_9-cpu-build:
+  wheel-py3_9-cuda11_7-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -1852,8 +1849,9 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
@@ -1925,10 +1923,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_9-cpu
+          name: wheel-py3_9-cuda11_7
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -1945,10 +1943,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cpu-test:  # Testing
+  wheel-py3_9-cuda11_7-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cpu-build
-    runs-on: windows.4xlarge
+    needs: wheel-py3_9-cuda11_7-build
+    runs-on: windows.8xlarge.nvidia.gpu
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -1956,8 +1954,9 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
+      GPU_ARCH_TYPE: cuda
       SKIP_ALL_TESTS: 1
       DESIRED_PYTHON: "3.9"
     steps:
@@ -1998,10 +1997,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_9-cpu
+          name: wheel-py3_9-cuda11_7
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2047,26 +2046,27 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cpu-upload:  # Uploading
+  wheel-py3_9-cuda11_7-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cpu-test
+    needs: wheel-py3_9-cuda11_7-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
+      DESIRED_CUDA: cu117
+      GPU_ARCH_VERSION: 11.7
+      GPU_ARCH_TYPE: cuda
       DESIRED_PYTHON: "3.9"
-      build_name: wheel-py3_9-cpu
+      build_name: wheel-py3_9-cuda11_7
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_9-cuda11_3-build:
+  wheel-py3_10-cpu-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
     timeout-minutes: 240
@@ -2076,11 +2076,10 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -2150,10 +2149,10 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
-          name: wheel-py3_9-cuda11_3
+          name: wheel-py3_10-cpu
           retention-days: 14
           if-no-files-found: error
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
@@ -2170,10 +2169,10 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_3-test:  # Testing
+  wheel-py3_10-cpu-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_3-build
-    runs-on: windows.8xlarge.nvidia.gpu
+    needs: wheel-py3_10-cpu-build
+    runs-on: windows.4xlarge
     timeout-minutes: 240
     env:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@@ -2181,11 +2180,10 @@ jobs:
       PACKAGE_TYPE: wheel
       # TODO: This is a legacy variable that we eventually want to get rid of in
       #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
+      DESIRED_CUDA: cpu
+      GPU_ARCH_TYPE: cpu
       SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
+      DESIRED_PYTHON: "3.10"
     steps:
       - name: Display EC2 information
         shell: bash
@@ -2224,10 +2222,10 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
-          name: wheel-py3_9-cuda11_3
+          name: wheel-py3_10-cpu
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       - name: Checkout PyTorch
         uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
@@ -2273,688 +2271,9 @@ jobs:
         if: always()
         run: |
           .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_3-upload:  # Uploading
+  wheel-py3_10-cpu-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_3-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.9"
-      build_name: wheel-py3_9-cuda11_3
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_9-cuda11_6-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: wheel-py3_9-cuda11_6
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_6-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_6-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: wheel-py3_9-cuda11_6
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_6-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_6-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu116
-      GPU_ARCH_VERSION: 11.6
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.9"
-      build_name: wheel-py3_9-cuda11_6
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_9-cuda11_7-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: wheel-py3_9-cuda11_7
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_7-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_7-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.9"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: wheel-py3_9-cuda11_7
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_9-cuda11_7-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_9-cuda11_7-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu117
-      GPU_ARCH_VERSION: 11.7
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.9"
-      build_name: wheel-py3_9-cuda11_7
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_10-cpu-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.10"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: wheel-py3_10-cpu
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cpu-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cpu-build
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cpu
-      GPU_ARCH_TYPE: cpu
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.10"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: wheel-py3_10-cpu
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cpu-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cpu-test
+    needs: wheel-py3_10-cpu-test
     with:
       PYTORCH_ROOT: ${{ github.workspace }}/pytorch
       BUILDER_ROOT: ${{ github.workspace }}/builder
@@ -2971,233 +2290,6 @@ jobs:
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
     uses: ./.github/workflows/_binary-upload.yml
-  wheel-py3_10-cuda11_3-build:
-    if: ${{ github.repository_owner == 'pytorch' }}
-    runs-on: windows.4xlarge
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.10"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Build PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
-        if: always()
-        with:
-          name: wheel-py3_10-cuda11_3
-          retention-days: 14
-          if-no-files-found: error
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cuda11_3-test:  # Testing
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cuda11_3-build
-    runs-on: windows.8xlarge.nvidia.gpu
-    timeout-minutes: 240
-    env:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      SKIP_ALL_TESTS: 1
-      DESIRED_PYTHON: "3.10"
-    steps:
-      - name: Display EC2 information
-        shell: bash
-        run: |
-          set -euo pipefail
-          function get_ec2_metadata() {
-            # Pulled from instance metadata endpoint for EC2
-            # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
-            category=$1
-            curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
-          }
-          echo "ami-id: $(get_ec2_metadata ami-id)"
-          echo "instance-id: $(get_ec2_metadata instance-id)"
-          echo "instance-type: $(get_ec2_metadata instance-type)"
-          echo "system info $(uname -a)"
-      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
-        uses: seemethere/add-github-ssh-key@v1
-        with:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      # Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
-      - name: Enable long paths on Windows
-        shell: powershell
-        run: |
-          Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
-      # Since it's just a defensive command, the workflow should continue even the command fails
-      - name: Disables Windows Defender scheduled and real-time scanning for files in pytorch directory.
-        shell: powershell
-        run: |
-          Add-MpPreference -ExclusionPath $(Get-Location).tostring() -ErrorAction Ignore
-      # NOTE: These environment variables are put here so that they can be applied on every job equally
-      #       They are also here because setting them at a workflow level doesn't give us access to the
-      #       runner.temp variable, which we need.
-      - name: Populate binary env
-        shell: bash
-        run: |
-          echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
-          echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
-          echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
-        name: Download Build Artifacts
-        with:
-          name: wheel-py3_10-cuda11_3
-          path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          submodules: recursive
-          path: pytorch
-      - name: Clean PyTorch checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: pytorch
-      - name: Checkout pytorch/builder
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: main
-          submodules: recursive
-          repository: pytorch/builder
-          path: builder
-      - name: Clean pytorch/builder checkout
-        run: |
-          # Remove any artifacts from the previous checkouts
-          git clean -fxd
-        working-directory: builder
-      - name: Populate binary env
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
-      - name: Test PyTorch binary
-        shell: bash
-        run: |
-          "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
-      - name: Wait until all sessions have drained
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        timeout-minutes: 120
-        run: |
-          .github\scripts\wait_for_ssh_to_drain.ps1
-      - name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
-        shell: powershell
-        working-directory: pytorch
-        if: always()
-        run: |
-          .github\scripts\kill_active_ssh_sessions.ps1
-  wheel-py3_10-cuda11_3-upload:  # Uploading
-    if: ${{ github.repository_owner == 'pytorch' }}
-    needs: wheel-py3_10-cuda11_3-test
-    with:
-      PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-      BUILDER_ROOT: ${{ github.workspace }}/builder
-      PACKAGE_TYPE: wheel
-      # TODO: This is a legacy variable that we eventually want to get rid of in
-      #       favor of GPU_ARCH_VERSION
-      DESIRED_CUDA: cu113
-      GPU_ARCH_VERSION: 11.3
-      GPU_ARCH_TYPE: cuda
-      DESIRED_PYTHON: "3.10"
-      build_name: wheel-py3_10-cuda11_3
-    secrets:
-      github-token: ${{ secrets.GITHUB_TOKEN }}
-      aws-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
-      aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
-      conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
-    uses: ./.github/workflows/_binary-upload.yml
   wheel-py3_10-cuda11_6-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge
@@ -3282,7 +2374,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_10-cuda11_6
@@ -3356,7 +2448,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_10-cuda11_6
@@ -3509,7 +2601,7 @@ jobs:
         shell: bash
         run: |
           "${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
-      - uses: seemethere/upload-artifact-s3@v5
+      - uses: actions/upload-artifact@v3
         if: always()
         with:
           name: wheel-py3_10-cuda11_7
@@ -3583,7 +2675,7 @@ jobs:
           echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
           echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
           echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
-      - uses: seemethere/download-artifact-s3@v4
+      - uses: actions/download-artifact@v3
         name: Download Build Artifacts
         with:
           name: wheel-py3_10-cuda11_7
diff --git a/.github/workflows/inductor.yml b/.github/workflows/inductor.yml
new file mode 100644
index 000000000000..9179b186e918
--- /dev/null
+++ b/.github/workflows/inductor.yml
@@ -0,0 +1,41 @@
+name: inductor
+
+on:
+  schedule:
+    - cron: 45 1,5,9,13,17,21 * * *
+  push:
+    tags:
+      - ciflow/inductor/*
+      - ciflow/periodic/*
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
+
+jobs:
+  linux-bionic-cuda11_6-py3_10-gcc7-inductor-build:
+    name: cuda11.6-py3.10-gcc7-sm86
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7-sm86
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
+      cuda-arch-list: '8.6'
+      test-matrix: |
+        { include: [
+          { config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "inductor_torchbench", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
+        ]}
+
+  linux-bionic-cuda11_6-py3_10-gcc7-inductor-test:
+    name: cuda11.6-py3.10-gcc7-sm86
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_6-py3_10-gcc7-inductor-build
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7-sm86
+      docker-image: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-inductor-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-inductor-build.outputs.test-matrix }}
diff --git a/.github/workflows/labeler.yml b/.github/workflows/labeler.yml
new file mode 100644
index 000000000000..bdef7a1367bf
--- /dev/null
+++ b/.github/workflows/labeler.yml
@@ -0,0 +1,20 @@
+name: Labeler
+
+on:
+- pull_request_target
+
+jobs:
+  triage:
+    permissions:
+      contents: read
+      pull-requests: write
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/labeler@v4
+      with:
+        repo-token: "${{ secrets.GITHUB_TOKEN }}"
+        sync-labels: ''
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  cancel-in-progress: true
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
index 763d284280f6..1f47e1defc2f 100644
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -14,19 +14,25 @@ jobs:
   lintrunner:
     runs-on: linux.20_04.16x
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
-
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
         with:
           submodules: false
+          fetch-depth: 1
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.8'
+          architecture: x64
+          check-latest: false
+          cache: pip
+          cache-dependency-path: |
+            **/.github/requirements-gha-cache.txt
 
-      - name: Install lintrunner
-        run: pip install lintrunner==0.9.*
+      - name: Install requirements
+        run: |
+          pip install -r .github/requirements-gha-cache.txt --user
 
       - name: Initialize lint dependencies
         run: lintrunner init
@@ -64,11 +70,6 @@ jobs:
     name: quick-checks
     runs-on: linux.20_04.4x
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.x
-          architecture: x64
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -79,9 +80,18 @@ jobs:
         run: |
           # Remove any artifacts from the previous checkouts
           git clean -fxd
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.x'
+          architecture: x64
+          check-latest: false
+          cache: pip
+          cache-dependency-path: |
+            **/requirements.txt
       - name: Install requirements
         id: requirements
-        run: pip3 install -r requirements.txt --user
+        run: pip install -r requirements.txt --user
       - name: Ensure no non-breaking spaces
         if: always()
         run: |
@@ -111,7 +121,7 @@ jobs:
     name: pr-sanity-checks
     runs-on: linux.20_04.4x
     # Only run this on pull requests
-    if: github.event_name == 'pull_request'
+    if: github.event_name == 'pull_request' && !contains(github.event.pull_request.labels.*.name, 'skip-pr-sanity-checks')
     steps:
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
@@ -123,56 +133,35 @@ jobs:
           BASE: ${{ github.event.pull_request.base.sha }}
           HEAD: ${{ github.event.pull_request.head.sha }}
         run: |
-          set -x
-
-          ancestor=$(git merge-base "${BASE}" "${HEAD}")
-          details=$(git diff --shortstat "$ancestor" "${HEAD}")
-          add=$(echo "$details" | grep -o '[0-9]* insertion' | grep -o '[0-9]*' || true)
-          remove=$(echo "$details" | grep -o '[0-9]* deletion' | grep -o '[0-9]*' || true)
-
-          pr_size=0
-          if [ "$add" ]; then
-              pr_size=$(("$pr_size" + "$add"))
-          fi
-          if [ "$remove" ]; then
-              pr_size=$(("$pr_size" + "$remove"))
-          fi
-
-          if ((pr_size > 2000)); then
-            echo
-            echo 'Your PR is '"$pr_size"' LOC which is more than the 2000 maximum'
-            echo 'allowed within PyTorch infra. PLease make sure to split up'
-            echo 'your PR into smaller pieces that can be reviewed.'
-            echo 'If you think that this rule should not apply to your PR,'
-            echo 'please contact @albanD or @seemethere.'
-            echo
-            false
-          fi
-
-
+          bash .github/scripts/pr-sanity-check.sh
 
   workflow-checks:
     name: workflow-checks
     runs-on: linux.20_04.4x
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.x
-          architecture: x64
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
         with:
           submodules: false
           fetch-depth: 1
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.x'
+          architecture: x64
+          check-latest: false
+          cache: pip
+          cache-dependency-path: |
+            **/requirements.txt
+            **/.github/requirements-gha-cache.txt
       - name: Install requirements
         id: requirements
         run: |
-          pip3 install -r requirements.txt --user
+          pip install -r requirements.txt --user
       - name: Install Jinja2
         run: |
-          pip3 install Jinja2==3.0.1 --user
+          pip install Jinja2==3.0.1 --user
       - name: Regenerate workflows
         id: generate_workflows
         run: .github/scripts/generate_ci_workflows.py
@@ -202,14 +191,15 @@ jobs:
     env:
       NPM_CONFIG_PREFIX: ~/.npm-global
     steps:
-      - name: Setup Node
-        uses: actions/setup-node@v2
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
         with:
           submodules: false
           fetch-depth: 1
+      # This is not a node project so there is no package-lock.json to cache
+      - name: Setup Node
+        uses: actions/setup-node@v3
       - name: Install markdown-toc
         run: npm install -g markdown-toc
       - name: Regenerate ToCs and check that they didn't change
@@ -241,29 +231,36 @@ jobs:
     if: ${{ github.repository == 'pytorch/pytorch' }}
     runs-on: linux.20_04.4x
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
       # [see note: pytorch repo ref]
       # deep clone (fetch-depth 0) required, to allow us to use git log
       - name: Checkout PyTorch
         uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
         with:
           submodules: false
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.8'
+          architecture: x64
+          check-latest: false
+          cache: pip
+          cache-dependency-path: |
+            **/requirements.txt
+            **/requirements-flake8.txt
+            **/.circleci/docker/requirements-ci.txt
+            **/.github/requirements-gha-cache.txt
       - name: Install dependencies
         # mypy and boto3 versions copied from
         # .circleci/docker/common/install_conda.sh
         run: |
           set -eux
-          python3 -mpip install -r requirements.txt
-          python3 -mpip install boto3==1.16.34
-          pip3 install typing-extensions==3.10 --user
-          pip3 install -r requirements-flake8.txt --user
-          python3 -mpip install rockset==0.8.10 --user
-          python3 -mpip install -r requirements.txt --user
-          python3 -mpip install mypy==0.960 --user
+          pip install -r requirements.txt
+          pip install boto3==1.19.12
+          pip install typing-extensions==3.10 --user
+          pip install -r requirements-flake8.txt --user
+          pip install rockset==0.8.10 --user
+          pip install -r requirements.txt --user
+          pip install mypy==0.960 --user
           make setup_lint
       - name: Test tools
         run: |
@@ -278,28 +275,37 @@ jobs:
       matrix:
         test_type: [with_torch, without_torch, older_python_version]
     steps:
+      # [see note: pytorch repo ref]
+      # deep clone (fetch-depth 0) required, to allow us to use git log
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          submodules: false
+          fetch-depth: 1
       - name: Setup Python 3.5
         if: matrix.test_type == 'older_python_version'
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v4
         with:
-          python-version: 3.5
+          python-version: '3.5'
           architecture: x64
+          check-latest: false
+          cache: pip
+          cache-dependency-path: |
+            **/requirements.txt
       - name: Setup Python 3.8
         if: matrix.test_type != 'older_python_version'
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v4
         with:
-          python-version: 3.8
+          python-version: '3.8'
           architecture: x64
-      # [see note: pytorch repo ref]
-      # deep clone (fetch-depth 0) required, to allow us to use git log
-      - name: Checkout PyTorch
-        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
-        with:
-          submodules: false
-          fetch-depth: 1
+          check-latest: false
+          cache: pip
+          cache-dependency-path: |
+            **/requirements.txt
       - name: Install torch
         if: matrix.test_type == 'with_torch'
         run: |
+          pip install -r requirements.txt
           # Doesn't really matter what torch version, we just need ANY torch installed
           pip install 'torch==1.*'
       - name: Run collect_env.py
diff --git a/.github/workflows/mac-mps.yml b/.github/workflows/mac-mps.yml
index 8fc2dd8336bf..5df7299cc507 100644
--- a/.github/workflows/mac-mps.yml
+++ b/.github/workflows/mac-mps.yml
@@ -22,6 +22,10 @@ jobs:
       build-generates-artifacts: true
       # To match the one pre-installed in the m1 runners
       python_version: 3.9.12
+      # We need to set the environment file here instead of trying to detect it automatically because
+      # MacOS arm64 is cross-compiled from x86-64. Specifically, it means that arm64 conda environment
+      # is needed when building PyTorch MacOS arm64 from x86-64
+      environment-file: .github/requirements/conda-env-macOS-ARM64
     secrets:
       MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
       MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
index 133aa56865c7..a8de37ca85be 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -35,3 +35,12 @@ jobs:
       run-doxygen: true
     secrets:
       GH_PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
+
+  update-vision-commit-hash:
+    uses: ./.github/workflows/_update-commit-hash.yml
+    with:
+      repo-name: vision
+      branch: main
+    secrets:
+      MERGEBOT_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
+      PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
diff --git a/.github/workflows/periodic.yml b/.github/workflows/periodic.yml
index 7fbd04f8f161..9a188345899d 100644
--- a/.github/workflows/periodic.yml
+++ b/.github/workflows/periodic.yml
@@ -3,99 +3,106 @@ name: periodic
 on:
   schedule:
     - cron: 45 0,4,8,12,16,20 * * *
+    - cron: 29 8 * * *  # about 1:29am PDT, for mem leak check and rerun disabled tests
   push:
     tags:
       - ciflow/periodic/*
   workflow_dispatch:
 
 concurrency:
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
   cancel-in-progress: true
 
 jobs:
-  linux-xenial-cuda10_2-py3-gcc7-slow-gradcheck-build:
-    name: linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
+  linux-bionic-cuda11_6-py3-gcc7-slow-gradcheck-build:
+    name: linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck
     uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
-      docker-image-name: pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7
-
-  linux-xenial-cuda10_2-py3-gcc7-slow-gradcheck-test:
-    name: linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
-    uses: ./.github/workflows/_linux-test.yml
-    needs: linux-xenial-cuda10_2-py3-gcc7-slow-gradcheck-build
-    with:
-      build-environment: linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck
-      docker-image: ${{ needs.linux-xenial-cuda10_2-py3-gcc7-slow-gradcheck-build.outputs.docker-image }}
+      build-environment: linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
       test-matrix: |
         { include: [
           { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
         ]}
 
-  linux-focal-rocm5_2-py3_7-slow-build:
-    name: linux-focal-rocm5.2-py3.7-slow
-    uses: ./.github/workflows/_linux-build.yml
+  linux-bionic-cuda11_6-py3-gcc7-slow-gradcheck-test:
+    name: linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_6-py3-gcc7-slow-gradcheck-build
     with:
-      build-environment: linux-focal-rocm5.2-py3.7
-      docker-image-name: pytorch-linux-focal-rocm5.2-py3.7
+      build-environment: linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck
+      docker-image: ${{ needs.linux-bionic-cuda11_6-py3-gcc7-slow-gradcheck-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-bionic-cuda11_6-py3-gcc7-slow-gradcheck-build.outputs.test-matrix }}
+      timeout-minutes: 300
 
-  linux-focal-rocm5_2-py3_7-slow-test:
-    name: linux-focal-rocm5.2-py3.7-slow
-    uses: ./.github/workflows/_rocm-test.yml
-    needs: linux-focal-rocm5_2-py3_7-slow-build
+  linux-focal-rocm5_2-py3_8-slow-build:
+    name: linux-focal-rocm5.2-py3.8-slow
+    uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-focal-rocm5.2-py3.7
-      docker-image: ${{ needs.linux-focal-rocm5_2-py3_7-slow-build.outputs.docker-image }}
+      build-environment: linux-focal-rocm5.2-py3.8
+      docker-image-name: pytorch-linux-focal-rocm5.2-py3.8
       test-matrix: |
         { include: [
           { config: "slow", shard: 1, num_shards: 1, runner: "linux.rocm.gpu" },
         ]}
+
+  linux-focal-rocm5_2-py3_8-slow-test:
+    name: linux-focal-rocm5.2-py3.8-slow
+    uses: ./.github/workflows/_rocm-test.yml
+    needs: linux-focal-rocm5_2-py3_8-slow-build
+    with:
+      build-environment: linux-focal-rocm5.2-py3.8
+      docker-image: ${{ needs.linux-focal-rocm5_2-py3_8-slow-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-focal-rocm5_2-py3_8-slow-build.outputs.test-matrix }}
     secrets:
       AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
       AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
 
-  linux-focal-rocm5_2-py3_7-distributed-build:
-    name: linux-focal-rocm5.2-py3.7-distributed
+  linux-focal-rocm5_2-py3_8-distributed-build:
+    name: linux-focal-rocm5.2-py3.8-distributed
     uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-focal-rocm5.2-py3.7
-      docker-image-name: pytorch-linux-focal-rocm5.2-py3.7
-
-  linux-focal-rocm5_2-py3_7-distributed-test:
-    name: linux-focal-rocm5.2-py3.7-distributed
-    uses: ./.github/workflows/_rocm-test.yml
-    needs: linux-focal-rocm5_2-py3_7-distributed-build
-    with:
-      build-environment: linux-focal-rocm5.2-py3.7
-      docker-image: ${{ needs.linux-focal-rocm5_2-py3_7-distributed-build.outputs.docker-image }}
+      build-environment: linux-focal-rocm5.2-py3.8
+      docker-image-name: pytorch-linux-focal-rocm5.2-py3.8
       test-matrix: |
         { include: [
           { config: "distributed", shard: 1, num_shards: 2, runner: "linux.rocm.gpu" },
           { config: "distributed", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
         ]}
+
+  linux-focal-rocm5_2-py3_8-distributed-test:
+    name: linux-focal-rocm5.2-py3.8-distributed
+    uses: ./.github/workflows/_rocm-test.yml
+    needs: linux-focal-rocm5_2-py3_8-distributed-build
+    with:
+      build-environment: linux-focal-rocm5.2-py3.8
+      docker-image: ${{ needs.linux-focal-rocm5_2-py3_8-distributed-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-focal-rocm5_2-py3_8-distributed-build.outputs.test-matrix }}
     secrets:
       AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
       AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
 
-  linux-bionic-cuda10_2-py3_9-gcc7-build:
-    name: linux-bionic-cuda10.2-py3.9-gcc7
+  linux-bionic-cuda11_6-py3_9-gcc7-build:
+    name: linux-bionic-cuda11.6-py3.9-gcc7
     uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-bionic-cuda10.2-py3.9-gcc7
-      docker-image-name: pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7
-
-  linux-bionic-cuda10_2-py3_9-gcc7-test:
-    name: linux-bionic-cuda10.2-py3.9-gcc7
-    uses: ./.github/workflows/_linux-test.yml
-    needs: linux-bionic-cuda10_2-py3_9-gcc7-build
-    with:
-      build-environment: linux-bionic-cuda10.2-py3.9-gcc7
-      docker-image: ${{ needs.linux-bionic-cuda10_2-py3_9-gcc7-build.outputs.docker-image }}
+      build-environment: linux-bionic-cuda11.6-py3.9-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
       test-matrix: |
         { include: [
           { config: "multigpu", shard: 1, num_shards: 1, runner: "linux.16xlarge.nvidia.gpu" },
         ]}
+      build-with-debug: false
+
+  linux-bionic-cuda11_6-py3_9-gcc7-test:
+    name: linux-bionic-cuda11.6-py3.9-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_6-py3_9-gcc7-build
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.9-gcc7
+      docker-image: ${{ needs.linux-bionic-cuda11_6-py3_9-gcc7-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-bionic-cuda11_6-py3_9-gcc7-build.outputs.test-matrix }}
 
   linux-bionic-cuda11_6-py3_7-gcc7-debug-build:
     name: linux-bionic-cuda11.6-py3.7-gcc7-debug
@@ -104,6 +111,13 @@ jobs:
       build-environment: linux-bionic-cuda11.6-py3.7-gcc7-debug
       docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
       build-with-debug: true
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
 
   linux-bionic-cuda11_6-py3_7-gcc7-debug-test:
     name: linux-bionic-cuda11.6-py3.7-gcc7-debug
@@ -112,13 +126,7 @@ jobs:
     with:
       build-environment: linux-bionic-cuda11.6-py3.7-gcc7-debug
       docker-image: ${{ needs.linux-bionic-cuda11_6-py3_7-gcc7-debug-build.outputs.docker-image }}
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-        ]}
+      test-matrix: ${{ needs.linux-bionic-cuda11_6-py3_7-gcc7-debug-build.outputs.test-matrix }}
 
   linux-bionic-cuda11_7-py3_7-gcc7-debug-build:
     name: linux-bionic-cuda11.7-py3.7-gcc7-debug
@@ -127,6 +135,13 @@ jobs:
       build-environment: linux-bionic-cuda11.7-py3.7-gcc7-debug
       docker-image-name: pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7
       build-with-debug: true
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+        ]}
 
   linux-bionic-cuda11_7-py3_7-gcc7-debug-test:
     name: linux-bionic-cuda11.7-py3.7-gcc7-debug
@@ -135,13 +150,7 @@ jobs:
     with:
       build-environment: linux-bionic-cuda11.7-py3.7-gcc7-debug
       docker-image: ${{ needs.linux-bionic-cuda11_7-py3_7-gcc7-debug-build.outputs.docker-image }}
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-        ]}
+      test-matrix: ${{ needs.linux-bionic-cuda11_7-py3_7-gcc7-debug-build.outputs.test-matrix }}
 
   libtorch-linux-bionic-cuda11_7-py3_7-gcc7-build:
     name: libtorch-linux-bionic-cuda11.7-py3.7-gcc7
@@ -157,6 +166,13 @@ jobs:
     with:
       build-environment: win-vs2019-cuda11.7-py3
       cuda-version: "11.7"
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 3, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 3, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 3, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}
 
   win-vs2019-cuda11_7-py3-test:
     name: win-vs2019-cuda11.7-py3
@@ -165,12 +181,7 @@ jobs:
     with:
       build-environment: win-vs2019-cuda11.7-py3
       cuda-version: "11.7"
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
-          { config: "default", shard: 2, num_shards: 2, runner: "windows.8xlarge.nvidia.gpu" },
-          { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
-        ]}
+      test-matrix: ${{ needs.win-vs2019-cuda11_7-py3-build.outputs.test-matrix }}
 
   ios-12-5-1-x86-64-coreml:
     name: ios-12-5-1-x86-64-coreml
@@ -179,11 +190,6 @@ jobs:
       build-environment: ios-12-5-1-x86-64-coreml
       ios-platform: SIMULATOR
       ios-arch: x86_64
-    secrets:
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
 
   ios-12-5-1-arm64:
     name: ios-12-5-1-arm64
@@ -192,11 +198,6 @@ jobs:
       build-environment: ios-12-5-1-arm64
       ios-platform: OS
       ios-arch: arm64
-    secrets:
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
 
   ios-12-5-1-arm64-coreml:
     name: ios-12-5-1-arm64-coreml
@@ -205,11 +206,6 @@ jobs:
       build-environment: ios-12-5-1-arm64-coreml
       ios-platform: OS
       ios-arch: arm64
-    secrets:
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
 
   ios-12-5-1-arm64-custom-ops:
     name: ios-12-5-1-arm64-custom-ops
@@ -218,11 +214,6 @@ jobs:
       build-environment: ios-12-5-1-arm64-custom-ops
       ios-platform: OS
       ios-arch: arm64
-    secrets:
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
 
   ios-12-5-1-arm64-metal:
     name: ios-12-5-1-arm64-metal
@@ -231,11 +222,6 @@ jobs:
       build-environment: ios-12-5-1-arm64-metal
       ios-platform: OS
       ios-arch: arm64
-    secrets:
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
 
   buck-build-test:
     name: buck-build-test
diff --git a/.github/workflows/pr-labels.yml b/.github/workflows/pr-labels.yml
deleted file mode 100644
index 7313d0b8e968..000000000000
--- a/.github/workflows/pr-labels.yml
+++ /dev/null
@@ -1,32 +0,0 @@
-name: pr-labels
-
-on:
-  push:
-    branches:
-      - master
-      - main
-
-jobs:
-  is-properly-labeled:
-    runs-on: ubuntu-latest
-
-    steps:
-      - name: Set up python
-        uses: actions/setup-python@v2
-
-      - name: Install requests
-        run: pip3 install requests==2.26
-
-      - name: Checkout repository
-        uses: actions/checkout@v2
-
-      - name: Process commit and find merger responsible for labeling
-        id: commit
-        env:
-          SHA1: ${{ github.sha }}
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-        run: python .github/scripts/process_commit.py "${SHA1}"
-
-concurrency:
-  group: pr-labels-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/pull.yml b/.github/workflows/pull.yml
index 76656febf928..3642c7fc1769 100644
--- a/.github/workflows/pull.yml
+++ b/.github/workflows/pull.yml
@@ -9,9 +9,11 @@ on:
       - release/*
       - landchecks/*
   workflow_dispatch:
+  schedule:
+    - cron: 29 8 * * *  # about 1:29am PDT
 
 concurrency:
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
   cancel-in-progress: true
 
 jobs:
@@ -21,25 +23,27 @@ jobs:
     with:
       build-environment: linux-focal-py3.7-gcc7
       docker-image-name: pytorch-linux-focal-py3.7-gcc7
-
-  linux-focal-py3_7-gcc7-test:
-    name: linux-focal-py3.7-gcc7
-    uses: ./.github/workflows/_linux-test.yml
-    needs: linux-focal-py3_7-gcc7-build
-    with:
-      build-environment: linux-focal-py3.7-gcc7
-      docker-image: ${{ needs.linux-focal-py3_7-gcc7-build.outputs.docker-image }}
       test-matrix: |
         { include: [
           { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
           { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
-          { config: "distributed", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "distributed", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "distributed", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
           { config: "functorch", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
           { config: "docs_test", shard: 1, num_shards: 1,  runner: "linux.2xlarge" },
           { config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
           { config: "backwards_compat", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
         ]}
 
+  linux-focal-py3_7-gcc7-test:
+    name: linux-focal-py3.7-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-focal-py3_7-gcc7-build
+    with:
+      build-environment: linux-focal-py3.7-gcc7
+      docker-image: ${{ needs.linux-focal-py3_7-gcc7-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-focal-py3_7-gcc7-build.outputs.test-matrix }}
+
   linux-docs:
     name: linux-docs
     uses: ./.github/workflows/_docs.yml
@@ -68,6 +72,15 @@ jobs:
     with:
       build-environment: linux-focal-py3.7-clang7-asan
       docker-image-name: pytorch-linux-focal-py3-clang7-asan
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 5, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 5, runner: "linux.2xlarge" },
+          { config: "default", shard: 3, num_shards: 5, runner: "linux.2xlarge" },
+          { config: "default", shard: 4, num_shards: 5, runner: "linux.4xlarge" },
+          { config: "default", shard: 5, num_shards: 5, runner: "linux.4xlarge" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
 
   linux-focal-py3_7-clang7-asan-test:
     name: linux-focal-py3.7-clang7-asan
@@ -76,14 +89,7 @@ jobs:
     with:
       build-environment: linux-focal-py3.7-clang7-asan
       docker-image: ${{ needs.linux-focal-py3_7-clang7-asan-build.outputs.docker-image }}
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 5, runner: "linux.2xlarge" },
-          { config: "default", shard: 2, num_shards: 5, runner: "linux.2xlarge" },
-          { config: "default", shard: 3, num_shards: 5, runner: "linux.2xlarge" },
-          { config: "default", shard: 4, num_shards: 5, runner: "linux.2xlarge" },
-          { config: "default", shard: 5, num_shards: 5, runner: "linux.2xlarge" },
-        ]}
+      test-matrix: ${{ needs.linux-focal-py3_7-clang7-asan-build.outputs.test-matrix }}
 
   linux-focal-py3_7-clang10-onnx-build:
     name: linux-focal-py3.7-clang10-onnx
@@ -91,6 +97,11 @@ jobs:
     with:
       build-environment: linux-focal-py3.7-clang10-onnx
       docker-image-name: pytorch-linux-focal-py3-clang10-onnx
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+        ]}
 
   linux-focal-py3_7-clang10-onnx-test:
     name: linux-focal-py3.7-clang10-onnx
@@ -99,11 +110,7 @@ jobs:
     with:
       build-environment: linux-focal-py3.7-clang10-onnx
       docker-image: ${{ needs.linux-focal-py3_7-clang10-onnx-build.outputs.docker-image }}
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
-          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
-        ]}
+      test-matrix: ${{ needs.linux-focal-py3_7-clang10-onnx-build.outputs.test-matrix }}
 
   linux-bionic-py3_7-clang9-build:
     name: linux-bionic-py3.7-clang9
@@ -111,14 +118,6 @@ jobs:
     with:
       build-environment: linux-bionic-py3.7-clang9
       docker-image-name: pytorch-linux-bionic-py3.7-clang9
-
-  linux-bionic-py3_7-clang9-test:
-    name: linux-bionic-py3.7-clang9
-    uses: ./.github/workflows/_linux-test.yml
-    needs: linux-bionic-py3_7-clang9-build
-    with:
-      build-environment: linux-bionic-py3.7-clang9
-      docker-image: ${{ needs.linux-bionic-py3_7-clang9-build.outputs.docker-image }}
       test-matrix: |
         { include: [
           { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
@@ -130,12 +129,14 @@ jobs:
           { config: "functorch", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
         ]}
 
-  linux-bionic-cuda11_3-py3_7-clang9-build:
-    name: linux-bionic-cuda11.3-py3.7-clang9
-    uses: ./.github/workflows/_linux-build.yml
+  linux-bionic-py3_7-clang9-test:
+    name: linux-bionic-py3.7-clang9
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-py3_7-clang9-build
     with:
-      build-environment: linux-bionic-cuda11.3-py3.7-clang9
-      docker-image-name: pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9
+      build-environment: linux-bionic-py3.7-clang9
+      docker-image: ${{ needs.linux-bionic-py3_7-clang9-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-bionic-py3_7-clang9-build.outputs.test-matrix }}
 
   linux-vulkan-bionic-py3_7-clang9-build:
     name: linux-vulkan-bionic-py3.7-clang9
@@ -143,6 +144,10 @@ jobs:
     with:
       build-environment: linux-vulkan-bionic-py3.7-clang9
       docker-image-name: pytorch-linux-bionic-py3.7-clang9
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
 
   linux-vulkan-bionic-py3_7-clang9-test:
     name: linux-vulkan-bionic-py3.7-clang9
@@ -151,10 +156,7 @@ jobs:
     with:
       build-environment: linux-vulkan-bionic-py3.7-clang9
       docker-image: ${{ needs.linux-vulkan-bionic-py3_7-clang9-build.outputs.docker-image }}
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
-        ]}
+      test-matrix: ${{ needs.linux-vulkan-bionic-py3_7-clang9-build.outputs.test-matrix }}
 
   linux-bionic-cuda11_6-py3_10-gcc7-build:
     name: linux-bionic-cuda11.6-py3.10-gcc7
@@ -162,31 +164,34 @@ jobs:
     with:
       build-environment: linux-bionic-cuda11.6-py3.10-gcc7
       docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
-
-  linux-bionic-cuda11_6-py3_10-gcc7-test:
-    name: linux-bionic-cuda11.6-py3.10-gcc7
-    uses: ./.github/workflows/_linux-test.yml
-    needs: linux-bionic-cuda11_6-py3_10-gcc7-build
-    with:
-      build-environment: linux-bionic-cuda11.6-py3.10-gcc7
-      docker-image: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-build.outputs.docker-image }}
       test-matrix: |
         { include: [
           { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "distributed", shard: 1, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
-          { config: "distributed", shard: 2, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 1, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 2, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 3, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
           { config: "functorch", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
         ]}
 
-  linux-xenial-py3-clang5-mobile-build:
-    name: linux-xenial-py3-clang5-mobile-build
+  linux-bionic-cuda11_6-py3_10-gcc7-test:
+    name: linux-bionic-cuda11.6-py3.10-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_6-py3_10-gcc7-build
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7
+      docker-image: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-build.outputs.test-matrix }}
+
+  linux-focal-py3-clang7-mobile-build:
+    name: linux-focal-py3-clang7-mobile-build
     uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-xenial-py3-clang5-mobile-build
-      docker-image-name: pytorch-linux-xenial-py3-clang5-asan
+      build-environment: linux-focal-py3-clang7-mobile-build
+      docker-image-name: pytorch-linux-focal-py3-clang7-asan
       build-generates-artifacts: false
 
   linux-jammy-cuda-11_6-cudnn8-py3_8-clang12-build:
@@ -196,12 +201,12 @@ jobs:
       build-environment: linux-jammy-cuda11.6-cudnn8-py3.8-clang12
       docker-image-name: pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12
 
-  linux-xenial-py3-clang5-mobile-custom-build-static:
-    name: linux-xenial-py3-clang5-mobile-custom-build-static
+  linux-focal-py3-clang7-mobile-custom-build-static:
+    name: linux-focal-py3-clang7-mobile-custom-build-static
     uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-xenial-py3-clang5-mobile-custom-build-static
-      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+      build-environment: linux-focal-py3-clang7-mobile-custom-build-static
+      docker-image-name: pytorch-linux-focal-py3-clang7-android-ndk-r19c
       build-generates-artifacts: false
 
   linux-bionic-py3_7-clang8-xla-build:
@@ -210,6 +215,10 @@ jobs:
     with:
       build-environment: linux-bionic-py3_7-clang8-xla
       docker-image-name: xla_base
+      test-matrix: |
+        { include: [
+          { config: "xla", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
 
   linux-bionic-py3_7-clang8-xla-test:
     name: linux-bionic-py3_7-clang8-xla
@@ -218,10 +227,7 @@ jobs:
     with:
       build-environment: linux-bionic-py3_7-clang8-xla
       docker-image: ${{ needs.linux-bionic-py3_7-clang8-xla-build.outputs.docker-image }}
-      test-matrix: |
-        { include: [
-          { config: "xla", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
-        ]}
+      test-matrix: ${{ needs.linux-bionic-py3_7-clang8-xla-build.outputs.test-matrix }}
 
   win-vs2019-cpu-py3-build:
     name: win-vs2019-cpu-py3
@@ -229,6 +235,12 @@ jobs:
     with:
       build-environment: win-vs2019-cpu-py3
       cuda-version: cpu
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "windows.4xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "windows.4xlarge" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}
 
   win-vs2019-cpu-py3-test:
     name: win-vs2019-cpu-py3
@@ -237,12 +249,7 @@ jobs:
     with:
       build-environment: win-vs2019-cpu-py3
       cuda-version: cpu
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 2, runner: "windows.4xlarge" },
-          { config: "default", shard: 2, num_shards: 2, runner: "windows.4xlarge" },
-          { config: "functorch", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
-        ]}
+      test-matrix: ${{ needs.win-vs2019-cpu-py3-build.outputs.test-matrix }}
 
   win-vs2019-cuda11_6-py3-build:
     if: github.event_name == 'pull_request'
@@ -252,27 +259,37 @@ jobs:
       build-environment: win-vs2019-cuda11.6-py3
       cuda-version: "11.6"
       sync-tag: win-cuda-build
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 5, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 5, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 5, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 4, num_shards: 5, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "default", shard: 5, num_shards: 5, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "windows.8xlarge.nvidia.gpu" },
+          { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}
 
-  linux-xenial-cuda11_3-py3_7-gcc7-bazel-test:
-    name: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test
+  linux-bionic-cuda11_6-py3_10-gcc7-bazel-test:
+    name: linux-bionic-cuda11.6-py3.10-gcc7-bazel-test
     uses: ./.github/workflows/_bazel-build-test.yml
     with:
-      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-bazel-test
-      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7-bazel-test
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
 
-  linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single:
-    name: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
+  linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single:
+    name: linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single
     uses: ./.github/workflows/_android-build-test.yml
     with:
-      build-environment: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
-      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+      build-environment: linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single
+      docker-image-name: pytorch-linux-focal-py3-clang7-android-ndk-r19c
 
-  linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit:
-    name: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit
+  linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single-full-jit:
+    name: linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single-full-jit
     uses: ./.github/workflows/_android-build-test.yml
     with:
-      build-environment: linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit
-      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
+      build-environment: linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single-full-jit
+      docker-image-name: pytorch-linux-focal-py3-clang7-android-ndk-r19c
 
   linux-focal-py3_7-gcc7-mobile-lightweight-dispatch-build:
     name: linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build
@@ -282,31 +299,17 @@ jobs:
       docker-image-name: pytorch-linux-focal-py3.7-gcc7
       build-generates-artifacts: false
 
-  linux-xenial-cuda11_3-py3_7-gcc7-deploy-build:
-    name: linux-xenial-cuda11_3-py3_7-gcc7-deploy
-    uses: ./.github/workflows/_linux-build.yml
-    with:
-      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-deploy
-      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
-
-  deploy-linux-xenial-cuda11_3-py3_7-gcc7-test:
-    name: linux-xenial-cuda11_3-py3_7-gcc7-deploy
-    uses: ./.github/workflows/_linux-test.yml
-    needs: linux-xenial-cuda11_3-py3_7-gcc7-deploy-build
-    with:
-      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-deploy
-      docker-image: ${{ needs.linux-xenial-cuda11_3-py3_7-gcc7-deploy-build.outputs.docker-image }}
-      test-matrix: |
-        { include: [
-          { config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
-        ]}
-
-  linux-focal-rocm5_2-py3_7-build:
+  linux-focal-rocm5_2-py3_8-build:
     # don't run build twice on master
     if: github.event_name == 'pull_request'
-    name: linux-focal-rocm5.2-py3.7
+    name: linux-focal-rocm5.2-py3.8
     uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-focal-rocm5.2-py3.7
-      docker-image-name: pytorch-linux-focal-rocm5.2-py3.7
+      build-environment: linux-focal-rocm5.2-py3.8
+      docker-image-name: pytorch-linux-focal-rocm5.2-py3.8
       sync-tag: rocm-build
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
+        ]}
diff --git a/.github/workflows/push_nightly_docker_ghcr.yml b/.github/workflows/push_nightly_docker_ghcr.yml
deleted file mode 100644
index ca30c9651ff8..000000000000
--- a/.github/workflows/push_nightly_docker_ghcr.yml
+++ /dev/null
@@ -1,39 +0,0 @@
-name: docker-release-builds
-on:
-  schedule:
-    # Push the nightly docker daily at 1 PM UTC
-    - cron: '0 13 * * *'
-  # Trigger when we modify something related to these images
-  pull_request:
-    paths:
-      - .github/scripts/build_publish_nightly_docker.sh
-      - .github/workflows/push_nightly_docker_ghcr.yml
-      - Dockerfile
-      - docker.Makefile
-  # Have the ability to trigger this job manually using the API as well
-  workflow_dispatch:
-
-jobs:
-  docker-release-build:
-    if: ${{ github.repository == 'pytorch/pytorch' }}
-    runs-on: linux.2xlarge
-    env:
-      GHCR_PAT: ${{ secrets.GHCR_PAT }}
-      WITH_PUSH: ${{ github.event_name == 'schedule' }}
-    steps:
-      - name: Checkout PyTorch
-        uses: zhouzhuojie/checkout@05b13c9a0d21f08f6d5e64a1d5042246d13619d9
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-      - uses: nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
-        name: Build and upload nightly docker
-        with:
-          timeout_minutes: 10
-          max_attempts: 3
-          command: |
-            set -ex
-            bash .github/scripts/build_publish_nightly_docker.sh
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/.github/workflows/revert.yml b/.github/workflows/revert.yml
index 1fbdacc82071..2a2fff27044e 100644
--- a/.github/workflows/revert.yml
+++ b/.github/workflows/revert.yml
@@ -8,18 +8,25 @@ jobs:
   do_revert:
     name: try_revert_pr_${{ github.event.client_payload.pr_num }}
     runs-on: linux.20_04.4x
+    env:
+        GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
       - name: Checkout repo
         uses: actions/checkout@v2
+        id: checkout
         with:
           fetch-depth: 0
           token: ${{ secrets.MERGEBOT_TOKEN }}
 
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.8'
+          architecture: x64
+          check-latest: false
+          cache: pip
+      - run: pip install pyyaml==6.0
+
       - name: Setup committer id
         run: |
           git config --global user.email "pytorchmergebot@users.noreply.github.com"
@@ -30,7 +37,6 @@ jobs:
           PR_NUM: ${{ github.event.client_payload.pr_num }}
           COMMENT_ID: ${{ github.event.client_payload.comment_id }}
           REASON: ${{ github.event.client_payload.reason }}
-          GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
         run: |
           set -ex
           if [ -n "${COMMENT_ID}" ]; then
@@ -46,5 +52,14 @@ jobs:
               python3 .github/scripts/trymerge.py --revert "${PR_NUM}"
             fi
           fi
+      - name: Comment on Canceled
+        if: ${{ cancelled() && steps.checkout.outcome == 'success' }}
+        continue-on-error: true
+        env:
+          GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
+          PR_NUM: ${{ github.event.client_payload.pr_num }}
+        run: |
+          set -ex
+          python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "revert"
 
 concurrency: try-revert
diff --git a/.github/workflows/run_torchbench.yml b/.github/workflows/run_torchbench.yml
index 1ec238fe4d32..b6c870fa7839 100644
--- a/.github/workflows/run_torchbench.yml
+++ b/.github/workflows/run_torchbench.yml
@@ -1,17 +1,18 @@
-name: TorchBench CI (pytorch-linux-py3.7-cu102)
+name: TorchBench CI (pytorch-linux-py3.8-cu116)
 on:
   pull_request:
 
 env:
   PYTHON_VERSION: "3.8"
-  CUDA_VERSION: "11.3"
-  MAGMA_VERSION: "magma-cuda113"
   # must be consistent with https://github.com/pytorch/benchmark/blob/main/requirements.txt#L19
   NUMPY_VERSION: "1.21.2"
+  SETUP_SCRIPT: "/data/nvme/bin/setup_instance.sh"
   PR_NUM: ${{ github.event.number }}
   PR_BODY: ${{ github.event.pull_request.body }}
   PR_BASE_SHA: ${{ github.event.pull_request.base.sha }}
   PR_HEAD_SHA: ${{ github.event.pull_request.head.sha }}
+  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
+  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
 
 jobs:
   run-torchbench:
@@ -35,20 +36,19 @@ jobs:
       - name: Create conda environment and install deps
         run: |
           conda create -y -n pr-ci python="${PYTHON_VERSION}"
-          # shellcheck disable=SC1091
-          . "${HOME}"/anaconda3/etc/profile.d/conda.sh
+          # shellcheck source=/dev/null
+          . "${SETUP_SCRIPT}"
           conda activate pr-ci
           # pin cmake version to 3.22 since 3.23 breaks pytorch build
           # see details at: https://github.com/pytorch/pytorch/issues/74985
           conda install -y numpy="${NUMPY_VERSION}" requests ninja pyyaml mkl mkl-include \
-                           setuptools cmake=3.22 cffi typing_extensions \
+                           setuptools cmake=3.22 cffi typing_extensions boto3 \
                            future six dataclasses pillow pytest tabulate gitpython git-lfs tqdm psutil
-          # install magma
-          conda install -y -c pytorch "${MAGMA_VERSION}"
+          pip install --pre torch torchvision torchtext -f https://download.pytorch.org/whl/nightly/cu116/torch_nightly.html
       - name: Setup TorchBench branch
         run: |
-          # shellcheck disable=SC1091
-          . "${HOME}"/anaconda3/etc/profile.d/conda.sh
+          # shellcheck source=/dev/null
+          . "${SETUP_SCRIPT}"
           conda activate pr-ci
           PR_BODY_FILE=/tmp/pr-body.txt
           echo "$PR_BODY" > ${PR_BODY_FILE}
@@ -60,15 +60,19 @@ jobs:
           path: benchmark
           lfs: false
           ref: ${{ env.TORCHBENCH_BRANCH }}
+      - name: GPU Info
+        run: |
+          nvidia-smi
       - name: Run TorchBench
         run: |
+          set -x
           pushd "${HOME}"/pytorch
           PR_MERGE_BASE=$(git merge-base "$PR_BASE_SHA" "$PR_HEAD_SHA")
           popd
           PR_BODY_FILE=/tmp/pr-body.txt
           echo "$PR_BODY" > ${PR_BODY_FILE}
-          # shellcheck disable=SC1091
-          . "${HOME}"/anaconda3/etc/profile.d/conda.sh
+          # shellcheck source=/dev/null
+          . "${SETUP_SCRIPT}"
           conda activate pr-ci
           python3 pytorch/.github/scripts/run_torchbench.py \
                   --pr-body "$PR_BODY_FILE" \
@@ -78,12 +82,20 @@ jobs:
                   --pr-num "$PR_NUM" \
                   --pr-base-sha "$PR_MERGE_BASE" \
                   --pr-head-sha "$PR_HEAD_SHA"
+      - name: Upload result to S3
+        run: |
+          # shellcheck source=/dev/null
+          . "${SETUP_SCRIPT}"
+          conda activate pr-ci
+          python3 pytorch/.github/scripts/run_torchbench.py \
+                  upload-s3 \
+                  --result-dir "${HOME}/.torchbench/bisection/pr${{ github.event.number }}"
       - name: Remove conda environment and cleanup
         run: |
           conda env remove --name pr-ci
           rm /tmp/pr-body.txt
       - name: Upload artifact
-        uses: actions/upload-artifact@v2
+        uses: actions/upload-artifact@v3
         with:
           name: TorchBench result
           path: ~/.torchbench/bisection/pr${{ github.event.number }}
diff --git a/.github/workflows/scorecards.yml b/.github/workflows/scorecards.yml
new file mode 100644
index 000000000000..8abee79cf400
--- /dev/null
+++ b/.github/workflows/scorecards.yml
@@ -0,0 +1,55 @@
+name: ossf-scorecard
+on:
+  # Only the default branch is supported.
+  branch_protection_rule:
+  workflow_dispatch:
+  schedule:
+    - cron: '32 16 * * 3'
+  push:
+    branches: [ "master" ]
+
+# Declare default permissions as read only.
+permissions: read-all
+
+jobs:
+  analysis:
+    name: Scorecards analysis
+    runs-on: ubuntu-latest
+    permissions:
+      # Needed to upload the results to code-scanning dashboard.
+      security-events: write
+      # Used to receive a badge.
+      id-token: write
+
+    if: false && github.repository == 'pytorch/pytorch'  # don't run on forks
+
+    steps:
+      - name: "Checkout code"
+        uses: actions/checkout@v3
+        with:
+          persist-credentials: false
+
+      - name: "Run analysis"
+        uses: ossf/scorecard-action@865b4092859256271290c77adbd10a43f4779972 # tag=v2.0.3
+        with:
+          results_file: results.sarif
+          results_format: sarif
+
+          # Publish the results for public repositories to enable scorecard badges. For more details, see
+          # https://github.com/ossf/scorecard-action#publishing-results.
+          publish_results: true
+
+      # Upload the results as artifacts (optional). Commenting out will disable uploads of run results in SARIF
+      # format to the repository Actions tab.
+      - name: "Upload artifact"
+        uses: actions/upload-artifact@v3
+        with:
+          name: SARIF file
+          path: results.sarif
+          retention-days: 5
+
+      # Upload the results to GitHub's code scanning dashboard.
+      - name: "Upload to code-scanning"
+        uses: github/codeql-action/upload-sarif@5f532563584d71fdef14ee64d17bafb34f751ce5 # tag=v1.0.26
+        with:
+          sarif_file: results.sarif
diff --git a/.github/workflows/trunk.yml b/.github/workflows/trunk.yml
index 0b4c147386a3..6779a362209c 100644
--- a/.github/workflows/trunk.yml
+++ b/.github/workflows/trunk.yml
@@ -10,9 +10,11 @@ on:
     tags:
       - ciflow/trunk/*
   workflow_dispatch:
+  schedule:
+    - cron: 29 8 * * *  # about 1:29am PDT
 
 concurrency:
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
   cancel-in-progress: true
 
 jobs:
@@ -22,6 +24,11 @@ jobs:
     with:
       build-environment: parallelnative-linux-focal-py3.7-gcc7
       docker-image-name: pytorch-linux-focal-py3.7-gcc7
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
+          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
+        ]}
 
   parallelnative-linux-focal-py3_7-gcc7-test:
     name: parallelnative-linux-focal-py3.7-gcc7
@@ -30,11 +37,7 @@ jobs:
     with:
       build-environment: parallelnative-linux-focal-py3.7-gcc7
       docker-image: ${{ needs.parallelnative-linux-focal-py3_7-gcc7-build.outputs.docker-image }}
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
-          { config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
-        ]}
+      test-matrix: ${{ needs.parallelnative-linux-focal-py3_7-gcc7-build.outputs.test-matrix }}
 
   # Build PyTorch with BUILD_CAFFE2=ON
   caffe2-linux-focal-py3_7-gcc7-build:
@@ -44,34 +47,63 @@ jobs:
       build-environment: caffe2-linux-focal-py3.7-gcc7
       docker-image-name: pytorch-linux-focal-py3.7-gcc7
 
-  linux-bionic-cuda10_2-py3_9-gcc7-build:
-    name: linux-bionic-cuda10.2-py3.9-gcc7
+  linux-bionic-cuda11_7-py3_10-gcc7-build:
+    name: linux-bionic-cuda11.7-py3.10-gcc7
     uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-bionic-cuda10.2-py3.9-gcc7
-      docker-image-name: pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7
-
-  linux-bionic-cuda10_2-py3_9-gcc7-test:
-    name: linux-bionic-cuda10.2-py3.9-gcc7
-    uses: ./.github/workflows/_linux-test.yml
-    needs: linux-bionic-cuda10_2-py3_9-gcc7-build
-    with:
-      build-environment: linux-bionic-cuda10.2-py3.9-gcc7
-      docker-image: ${{ needs.linux-bionic-cuda10_2-py3_9-gcc7-build.outputs.docker-image }}
+      build-environment: linux-bionic-cuda11.7-py3.10-gcc7
+      docker-image-name: pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7
       test-matrix: |
         { include: [
-          { config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 1, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 4, num_shards: 4, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "functorch", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "slow", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "slow", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
           { config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
           { config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
           { config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
-          { config: "distributed", shard: 1, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
-          { config: "distributed", shard: 2, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 1, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 2, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
+          { config: "distributed", shard: 3, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
+        ]}
+
+  linux-bionic-cuda11_7-py3_10-gcc7-test:
+    name: linux-bionic-cuda11.7-py3.10-gcc7
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_7-py3_10-gcc7-build
+    with:
+      build-environment: linux-bionic-cuda11.7-py3.10-gcc7
+      docker-image: ${{ needs.linux-bionic-cuda11_7-py3_10-gcc7-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-bionic-cuda11_7-py3_10-gcc7-build.outputs.test-matrix }}
+
+  linux-bionic-cuda11_6-py3_10-gcc7-sm86-build:
+    name: cuda11.6-py3.10-gcc7-sm86
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7-sm86
+      docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
+      cuda-arch-list: 8.6
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 4, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
         ]}
 
+  linux-bionic-cuda11_6-py3_10-gcc7-sm86-test:
+    name: cuda11.6-py3.10-gcc7-sm86
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-bionic-cuda11_6-py3_10-gcc7-sm86-build
+    with:
+      build-environment: linux-bionic-cuda11.6-py3.10-gcc7-sm86
+      docker-image: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-sm86-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-bionic-cuda11_6-py3_10-gcc7-sm86-build.outputs.test-matrix }}
+
   libtorch-linux-bionic-cuda11_6-py3_7-gcc7-build:
     name: libtorch-linux-bionic-cuda11.6-py3.7-gcc7
     uses: ./.github/workflows/_linux-build.yml
@@ -79,27 +111,22 @@ jobs:
       build-environment: libtorch-linux-bionic-cuda11.6-py3.7-gcc7
       docker-image-name: pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7
       build-generates-artifacts: false
+      runner: linux.4xlarge
 
   # no-ops builds test USE_PER_OPERATOR_HEADERS=0 where ATen/ops is not generated
-  linux-xenial-cuda11_3-py3_7-gcc7-no-ops-build:
-    name: linux-xenial-cuda11.3-py3.7-gcc7-no-ops
+  linux-bionic-cuda11_7-py3_10-gcc7-no-ops-build:
+    name: linux-bionic-cuda11.7-py3.10-gcc7-no-ops
     uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-xenial-cuda11.3-py3.7-gcc7-no-ops
-      docker-image-name: pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7
+      build-environment: linux-bionic-cuda11.7-py3.10-gcc7-no-ops
+      docker-image-name: pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7
 
-  pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build:
-    name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build
+  pytorch-linux-focal-py3-clang7-android-ndk-r19c-build:
+    name: pytorch-linux-focal-py3-clang7-android-ndk-r19c-build
     uses: ./.github/workflows/_android-full-build-test.yml
     with:
-      build-environment: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build
-      docker-image-name: pytorch-linux-xenial-py3-clang5-android-ndk-r19c
-    secrets:
-      SONATYPE_NEXUS_USERNAME: ${{ secrets.SONATYPE_NEXUS_USERNAME }}
-      SONATYPE_NEXUS_PASSWORD: ${{ secrets.SONATYPE_NEXUS_PASSWORD }}
-      ANDROID_SIGN_KEY: ${{ secrets.ANDROID_SIGN_KEY }}
-      ANDROID_SIGN_PASS: ${{ secrets.ANDROID_SIGN_PASS }}
-      SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
+      build-environment: pytorch-linux-focal-py3-clang7-android-ndk-r19c-build
+      docker-image-name: pytorch-linux-focal-py3-clang7-android-ndk-r19c
 
   linux-bionic-py3_7-clang9-slow-build:
     name: linux-bionic-py3.7-clang9-slow
@@ -107,6 +134,10 @@ jobs:
     with:
       build-environment: linux-bionic-py3.7-clang9-slow
       docker-image-name: pytorch-linux-bionic-py3.7-clang9
+      test-matrix: |
+        { include: [
+          { config: "slow", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+        ]}
 
   linux-bionic-py3_7-clang9-slow-test:
     name: linux-bionic-py3.7-clang9-slow
@@ -115,11 +146,28 @@ jobs:
     with:
       build-environment: linux-bionic-py3.7-clang9-slow
       docker-image: ${{ needs.linux-bionic-py3_7-clang9-slow-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-bionic-py3_7-clang9-slow-build.outputs.test-matrix }}
+
+  linux-focal-py3_7-clang7-tsan-build:
+    name: linux-focal-py3.7-clang7-tsan
+    uses: ./.github/workflows/_linux-build.yml
+    with:
+      build-environment: linux-focal-py3.7-clang7-tsan
+      docker-image-name: pytorch-linux-focal-py3-clang7-asan
       test-matrix: |
         { include: [
-          { config: "slow", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
+          { config: "tsan", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
         ]}
 
+  linux-focal-py3_7-clang7-tsan-test:
+    name: linux-focal-py3.7-clang7-tsan
+    uses: ./.github/workflows/_linux-test.yml
+    needs: linux-focal-py3_7-clang7-tsan-build
+    with:
+      build-environment: linux-focal-py3.7-clang7-tsan
+      docker-image: ${{ needs.linux-focal-py3_7-clang7-tsan-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-focal-py3_7-clang7-tsan-build.outputs.test-matrix }}
+
   ios-12-5-1-x86-64:
     name: ios-12-5-1-x86-64
     uses: ./.github/workflows/_ios-build-test.yml
@@ -127,11 +175,6 @@ jobs:
       build-environment: ios-12-5-1-x86-64
       ios-platform: SIMULATOR
       ios-arch: x86_64
-    secrets:
-      IOS_CERT_KEY_2022: ${{ secrets.IOS_CERT_KEY_2022 }}
-      IOS_CERT_SECRET: ${{ secrets.IOS_CERT_SECRET}}
-      IOS_DEV_TEAM_ID: ${{ secrets.IOS_DEV_TEAM_ID}}
-      IOS_SIGN_KEY_2022: ${{ secrets.IOS_SIGN_KEY_2022 }}
 
   macos-12-py3-x86-64-build:
     name: macos-12-py3-x86-64
@@ -141,6 +184,12 @@ jobs:
       xcode-version: "13.3.1"
       runner-type: macos-12-xl
       build-generates-artifacts: true
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "macos-12" },
+          { config: "default", shard: 2, num_shards: 2, runner: "macos-12" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "macos-12" },
+        ]}
     secrets:
       MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
       MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@@ -151,12 +200,7 @@ jobs:
     needs: macos-12-py3-x86-64-build
     with:
       build-environment: macos-12-py3-x86-64
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 2, runner: "macos-12" },
-          { config: "default", shard: 2, num_shards: 2, runner: "macos-12" },
-          { config: "functorch", shard: 1, num_shards: 1, runner: "macos-12" },
-        ]}
+      test-matrix: ${{ needs.macos-12-py3-x86-64-build.outputs.test-matrix }}
       arch: x86_64
     secrets:
       AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
@@ -185,6 +229,16 @@ jobs:
       build-generates-artifacts: true
       # To match the one pre-installed in the m1 runners
       python_version: 3.9.12
+      # We need to set the environment file here instead of trying to detect it automatically because
+      # MacOS arm64 is cross-compiled from x86-64. Specifically, it means that arm64 conda environment
+      # is needed when building PyTorch MacOS arm64 from x86-64
+      environment-file: .github/requirements/conda-env-macOS-ARM64
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 2, runner: "macos-m1-12" },
+          { config: "default", shard: 2, num_shards: 2, runner: "macos-m1-12" },
+          { config: "functorch", shard: 1, num_shards: 1, runner: "macos-m1-12" },
+        ]}
     secrets:
       MACOS_SCCACHE_S3_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
       MACOS_SCCACHE_S3_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@@ -193,6 +247,7 @@ jobs:
     name: macos-12-py3-arm64-mps
     uses: ./.github/workflows/_mac-test-mps.yml
     needs: macos-12-py3-arm64-build
+    if: needs.macos-12-py3-arm64-build.outputs.build-outcome == 'success'
     with:
       sync-tag: macos-12-py3-arm64-mps-test
       build-environment: macos-12-py3-arm64
@@ -203,11 +258,7 @@ jobs:
     needs: macos-12-py3-arm64-build
     with:
       build-environment: macos-12-py3-arm64
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 2, runner: "macos-m1-12" },
-          { config: "default", shard: 2, num_shards: 2, runner: "macos-m1-12" },
-        ]}
+      test-matrix: ${{ needs.macos-12-py3-arm64-build.outputs.test-matrix }}
       arch: arm64
     secrets:
       AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
@@ -220,14 +271,6 @@ jobs:
       build-environment: win-vs2019-cuda11.6-py3
       cuda-version: "11.6"
       sync-tag: win-cuda-build
-
-  win-vs2019-cuda11_6-py3-test:
-    name: win-vs2019-cuda11.6-py3
-    uses: ./.github/workflows/_win-test.yml
-    needs: win-vs2019-cuda11_6-py3-build
-    with:
-      build-environment: win-vs2019-cuda11.6-py3
-      cuda-version: "11.6"
       test-matrix: |
         { include: [
           { config: "default", shard: 1, num_shards: 5, runner: "windows.8xlarge.nvidia.gpu" },
@@ -239,26 +282,36 @@ jobs:
           { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
         ]}
 
-  linux-focal-rocm5_2-py3_7-build:
-    name: linux-focal-rocm5.2-py3.7
-    uses: ./.github/workflows/_linux-build.yml
+  win-vs2019-cuda11_6-py3-test:
+    name: win-vs2019-cuda11.6-py3
+    uses: ./.github/workflows/_win-test.yml
+    needs: win-vs2019-cuda11_6-py3-build
     with:
-      build-environment: linux-focal-rocm5.2-py3.7
-      docker-image-name: pytorch-linux-focal-rocm5.2-py3.7
-      sync-tag: rocm-build
+      build-environment: win-vs2019-cuda11.6-py3
+      cuda-version: "11.6"
+      test-matrix: ${{ needs.win-vs2019-cuda11_6-py3-build.outputs.test-matrix }}
 
-  linux-focal-rocm5_2-py3_7-test:
-    name: linux-focal-rocm5.2-py3.7
-    uses: ./.github/workflows/_rocm-test.yml
-    needs: linux-focal-rocm5_2-py3_7-build
+  linux-focal-rocm5_2-py3_8-build:
+    name: linux-focal-rocm5.2-py3.8
+    uses: ./.github/workflows/_linux-build.yml
     with:
-      build-environment: linux-focal-rocm5.2-py3.7
-      docker-image: ${{ needs.linux-focal-rocm5_2-py3_7-build.outputs.docker-image }}
+      build-environment: linux-focal-rocm5.2-py3.8
+      docker-image-name: pytorch-linux-focal-rocm5.2-py3.8
+      sync-tag: rocm-build
       test-matrix: |
         { include: [
           { config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu" },
           { config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
         ]}
+
+  linux-focal-rocm5_2-py3_8-test:
+    name: linux-focal-rocm5.2-py3.8
+    uses: ./.github/workflows/_rocm-test.yml
+    needs: linux-focal-rocm5_2-py3_8-build
+    with:
+      build-environment: linux-focal-rocm5.2-py3.8
+      docker-image: ${{ needs.linux-focal-rocm5_2-py3_8-build.outputs.docker-image }}
+      test-matrix: ${{ needs.linux-focal-rocm5_2-py3_8-build.outputs.test-matrix }}
     secrets:
       AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_V2_ACCESS_KEY_ID }}
       AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_V2_SECRET_ACCESS_KEY }}
diff --git a/.github/workflows/trymerge.yml b/.github/workflows/trymerge.yml
index 8db7b0c97c5c..3d1d92967d88 100644
--- a/.github/workflows/trymerge.yml
+++ b/.github/workflows/trymerge.yml
@@ -8,18 +8,25 @@ jobs:
   do_merge:
     name: try_merge_pr_${{ github.event.client_payload.pr_num }}
     runs-on: linux.20_04.4x
+    env:
+        GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
       - name: Checkout repo
-        uses: actions/checkout@v2
+        id: checkout
+        uses: actions/checkout@v3
         with:
           fetch-depth: 0
           token: ${{ secrets.MERGEBOT_TOKEN }}
 
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.8'
+          check-latest: false
+          cache: pip
+          architecture: x64
+      - run: pip install pyyaml==6.0
+
       - name: Setup committer id
         run: |
           git config --global user.email "pytorchmergebot@users.noreply.github.com"
@@ -28,13 +35,21 @@ jobs:
         env:
           GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
           PR_NUM: ${{ github.event.client_payload.pr_num }}
-          GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
           FORCE: ${{ github.event.client_payload.force}}
           ON_GREEN: ${{ github.event.client_payload.on_green}}
           LAND_CHECKS: ${{ github.event.client_payload.land_checks }}
           COMMENT_ID: ${{ github.event.client_payload.comment_id }}
+          REBASE: ${{ github.event.client_payload.rebase }}
         run: |
           set -ex
+          if [ -n "${REBASE}" ]; then
+            python3 .github/scripts/tryrebase.py "${PR_NUM}" --branch "${REBASE}"
+            git checkout master
+            git fetch -p
+            # give github some time between the push and start workflows so that Github's messages
+            # on the PR appear in chronological order (timing issues can shuffle them around)
+            sleep 60
+          fi
           if [ -n "${FORCE}" ]; then
             if [ -n "${COMMENT_ID}" ]; then
               python3 .github/scripts/trymerge.py --force --comment-id "${COMMENT_ID}" "${PR_NUM}"
@@ -50,6 +65,15 @@ jobs:
           else
             python3 .github/scripts/trymerge.py "${PR_NUM}"
           fi
+      - name: Comment on Canceled
+        if: ${{ cancelled() && steps.checkout.outcome == 'success' }}
+        continue-on-error: true
+        env:
+          GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
+          PR_NUM: ${{ github.event.client_payload.pr_num }}
+        run: |
+          set -ex
+          python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "merge"
 
 # We want newer merge commands to supercede old ones
 concurrency:
diff --git a/.github/workflows/tryrebase.yml b/.github/workflows/tryrebase.yml
index 748127ff2d62..53434310c3d0 100644
--- a/.github/workflows/tryrebase.yml
+++ b/.github/workflows/tryrebase.yml
@@ -7,19 +7,25 @@ on:
 jobs:
   do_rebase:
     runs-on: ubuntu-20.04
+    env:
+        GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
-
       - name: Checkout repo
+        id: checkout
         uses: actions/checkout@v2
         with:
           fetch-depth: 0
           token: ${{ secrets.MERGEBOT_TOKEN }}
 
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.8'
+          architecture: x64
+          check-latest: false
+          cache: pip
+      - run: pip install pyyaml==6.0
+
       - name: Setup committer id
         run: |
           git config --global user.email "pytorchmergebot@users.noreply.github.com"
@@ -29,7 +35,6 @@ jobs:
         env:
           GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
           PR_NUM: ${{ github.event.client_payload.pr_num }}
-          GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
           BRANCH: ${{ github.event.client_payload.branch }}
         run: |
           set -ex
@@ -38,3 +43,12 @@ jobs:
           else
             python3 .github/scripts/tryrebase.py "${PR_NUM}"
           fi
+      - name: Comment on Canceled
+        if: ${{ cancelled() && steps.checkout.outcome == 'success' }}
+        continue-on-error: true
+        env:
+          GITHUB_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
+          PR_NUM: ${{ github.event.client_payload.pr_num }}
+        run: |
+          set -ex
+          python3 .github/scripts/comment_on_pr.py "${PR_NUM}" "rebase"
diff --git a/.github/workflows/update-commit-hashes.yml b/.github/workflows/update-commit-hashes.yml
deleted file mode 100644
index 6c72492d93ac..000000000000
--- a/.github/workflows/update-commit-hashes.yml
+++ /dev/null
@@ -1,37 +0,0 @@
-name: update-commit-hashes
-
-on:
-  schedule:
-    # Every day at 7:37am UTC = 12:27am PST
-    # Choose a random time near midnight PST because it may be delayed if there are high loads
-    # See https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule
-    - cron: 37 7 * * *
-  workflow_dispatch:
-
-jobs:
-  update-xla-commit-hash:
-    uses: ./.github/workflows/_update-commit-hash.yml
-    with:
-      repo-name: xla
-      branch: master
-    secrets:
-      MERGEBOT_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
-      PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
-
-  update-torchdynamo-commit-hash:
-    uses: ./.github/workflows/_update-commit-hash.yml
-    with:
-      repo-name: torchdynamo
-      branch: main
-    secrets:
-      MERGEBOT_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
-      PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
-
-  update-vision-commit-hash:
-    uses: ./.github/workflows/_update-commit-hash.yml
-    with:
-      repo-name: vision
-      branch: main
-    secrets:
-      MERGEBOT_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
-      PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
diff --git a/.github/workflows/update-viablestrict.yml b/.github/workflows/update-viablestrict.yml
index 872d8f5c1428..12bf4e271f92 100644
--- a/.github/workflows/update-viablestrict.yml
+++ b/.github/workflows/update-viablestrict.yml
@@ -7,24 +7,29 @@ on:
 
 concurrency:
   group: ${{ github.workflow }}
-  cancel-in-progress: true
+  cancel-in-progress: false
 
 jobs:
   do_update_viablestrict:
     runs-on: ubuntu-20.04
     steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
-
       - name: Checkout repo
-        uses: actions/checkout@v2
+        uses: actions/checkout@v3
         with:
           fetch-depth: 0
           token: ${{ secrets.MERGEBOT_TOKEN }}
 
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.8'
+          architecture: x64
+          check-latest: false
+          cache: pip
+          cache-dependency-path: |
+            **/.circleci/docker/requirements-ci.txt
+            **/.github/requirements-gha-cache.txt
+
       - name: Install Python Packages
         run: |
           pip3 install rockset==0.8.10
@@ -36,7 +41,7 @@ jobs:
           ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
         run: |
           output=$(python3 .github/scripts/fetch_latest_green_commit.py)
-          echo "::set-output name=latest_viable_sha::$output"
+          echo "latest_viable_sha=$output" >> "${GITHUB_OUTPUT}"
         id: get-latest-commit
 
       - name: Push SHA to viable/strict branch
@@ -47,4 +52,6 @@ jobs:
           git config --global user.email "pytorchmergebot@users.noreply.github.com"
           git config --global user.name "PyTorch MergeBot"
           echo "Set the latest sha variable to be ${{ steps.get-latest-commit.outputs.latest_viable_sha }}"
-          git push origin "${{ steps.get-latest-commit.outputs.latest_viable_sha }}":viable/strict
+          # Pushing an older green commit here will fail because it's non-fast-forward, which is ok
+          # to ignore because we already have the later green commit in visable/strict
+          git push origin "${{ steps.get-latest-commit.outputs.latest_viable_sha }}":viable/strict || true
diff --git a/.github/workflows/update_pytorch_labels.yml b/.github/workflows/update_pytorch_labels.yml
index f19347070ece..31bbab78e2f9 100644
--- a/.github/workflows/update_pytorch_labels.yml
+++ b/.github/workflows/update_pytorch_labels.yml
@@ -10,7 +10,7 @@ concurrency:
 
 jobs:
   update-labels-in-S3:
-    runs-on: ubuntu-18.04
+    runs-on: ubuntu-22.04
     if: ${{ github.repository == 'pytorch/pytorch' }}
     steps:
       - name: Checkout PyTorch
diff --git a/.github/workflows/update_s3_htmls.yml b/.github/workflows/update_s3_htmls.yml
index 5f3ff056c5a4..d68b58911bed 100644
--- a/.github/workflows/update_s3_htmls.yml
+++ b/.github/workflows/update_s3_htmls.yml
@@ -8,7 +8,7 @@ on:
 
 jobs:
   update-html:
-    runs-on: ubuntu-18.04
+    runs-on: ubuntu-22.04
     if: ${{ github.repository == 'pytorch/pytorch' }}
     strategy:
       matrix:
diff --git a/.github/workflows/upload-test-stats.yml b/.github/workflows/upload-test-stats.yml
index b649aac2c7c5..3f3db80670d8 100644
--- a/.github/workflows/upload-test-stats.yml
+++ b/.github/workflows/upload-test-stats.yml
@@ -2,7 +2,7 @@ name: Upload test stats
 
 on:
   workflow_run:
-    workflows: [pull, trunk, periodic]
+    workflows: [pull, trunk, periodic, inductor]
     types:
       - completed
 
@@ -58,6 +58,31 @@ jobs:
           python3 -m tools.stats.upload_test_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --head-branch "${HEAD_BRANCH}"
           python3 -m tools.stats.upload_sccache_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}"
 
+      - name: Upload test artifacts
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          WORKFLOW_ARTIFACTS_URL: ${{ github.event.workflow_run.artifacts_url }}
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
+          WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
+          REPO_FULLNAME: ${{ github.event.workflow_run.repository.full_name }}
+        run: |
+          echo "${WORKFLOW_ARTIFACTS_URL}"
+
+          # Note that in the case of Linux and Windows, their artifacts have already been uploaded to S3, so there simply won't be
+          # anything on GitHub to upload. The command should return right away
+          python3 -m tools.stats.upload_artifacts --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}"
+
+      - name: Analyze disabled tests rerun
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          WORKFLOW_ARTIFACTS_URL: ${{ github.event.workflow_run.artifacts_url }}
+          WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
+          WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
+          REPO_FULLNAME: ${{ github.event.workflow_run.repository.full_name }}
+        run: |
+          # Analyze the results from disable tests rerun and upload them to S3
+          python3 -m tools.stats.check_disabled_tests --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}"
+
   check-api-rate:
     if: ${{ always() }}
     runs-on: [self-hosted, linux.2xlarge]
@@ -66,5 +91,9 @@ jobs:
       - name: Get our GITHUB_TOKEN API limit usage
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN}}
+          MERGEBOT_TOKEN: ${{ secrets.MERGEBOT_TOKEN}}
         run: |
           curl -H "Accept: application/vnd.github.v3+json" -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit
+          curl -H "Accept: application/vnd.github.v3+json" -H "Authorization: token $PYTORCHBOT_TOKEN" https://api.github.com/rate_limit
+          curl -H "Accept: application/vnd.github.v3+json" -H "Authorization: token $MERGEBOT_TOKEN" https://api.github.com/rate_limit
diff --git a/.github/workflows/weekly.yml b/.github/workflows/weekly.yml
new file mode 100644
index 000000000000..d87c610e1426
--- /dev/null
+++ b/.github/workflows/weekly.yml
@@ -0,0 +1,19 @@
+name: weekly
+
+on:
+  schedule:
+    # Mondays at 7:37am UTC = 12:27am PST
+    # Choose a random time near midnight PST because it may be delayed if there are high loads
+    # See https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule
+    - cron: 37 7 * * 1
+  workflow_dispatch:
+
+jobs:
+  update-xla-commit-hash:
+    uses: ./.github/workflows/_update-commit-hash.yml
+    with:
+      repo-name: xla
+      branch: master
+    secrets:
+      MERGEBOT_TOKEN: ${{ secrets.MERGEBOT_TOKEN }}
+      PYTORCHBOT_TOKEN: ${{ secrets.GH_PYTORCHBOT_TOKEN }}
diff --git a/.gitignore b/.gitignore
index 88d472b456f4..597ae390abe9 100644
--- a/.gitignore
+++ b/.gitignore
@@ -46,6 +46,7 @@ docs/source/generated/
 log
 usage_log.txt
 test-reports/
+test/*.bak
 test/.coverage
 test/.hypothesis/
 test/cpp/api/mnist
@@ -78,10 +79,6 @@ torch/testing/_internal/generated/annotated_fn_args.py
 torch/testing/_internal/data/*.pt
 torch/csrc/api/include/torch/version.h
 torch/csrc/cudnn/cuDNN.cpp
-torch/csrc/deploy/example/generated
-torch/csrc/deploy/interpreter/cpython
-torch/csrc/deploy/interpreter/frozen
-torch/csrc/deploy/interpreter/third_party/typing_extensions.py
 torch/csrc/generated
 torch/csrc/generic/TensorMethods.cpp
 torch/csrc/jit/generated/*
@@ -117,6 +114,7 @@ torch/test/
 torch/utils/benchmark/utils/valgrind_wrapper/callgrind.h
 torch/utils/benchmark/utils/valgrind_wrapper/valgrind.h
 torch/version.py
+minifier_launcher.py
 # Root level file used in CI to specify certain env configs.
 # E.g., see .circleci/config.yaml
 env
@@ -307,6 +305,9 @@ TAGS
 # bazel symlinks
 bazel-*
 
+# xla repo
+xla/
+
 # direnv, posh-direnv
 .envrc
 .psenvrc
@@ -335,3 +336,9 @@ buck-out/
 # Downloaded libraries
 third_party/ruy/
 third_party/glog/
+
+# Virtualenv
+venv/
+
+# Log files
+*.log
diff --git a/.gitmodules b/.gitmodules
index 538967d31764..282746ed0b53 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -148,3 +148,9 @@
 [submodule "third_party/nlohmann"]
 	path = third_party/nlohmann
 	url = https://github.com/nlohmann/json.git
+[submodule "third_party/VulkanMemoryAllocator"]
+	path = third_party/VulkanMemoryAllocator
+	url = https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git
+[submodule "third_party/cutlass"]
+	path = third_party/cutlass
+	url = https://github.com/NVIDIA/cutlass.git
diff --git a/.jenkins/caffe2/bench.sh b/.jenkins/caffe2/bench.sh
deleted file mode 100755
index 55ac4e94df21..000000000000
--- a/.jenkins/caffe2/bench.sh
+++ /dev/null
@@ -1,54 +0,0 @@
-#!/bin/bash
-
-# shellcheck source=./common.sh
-source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
-
-# Anywhere except $ROOT_DIR should work. This is so the python import doesn't
-# get confused by any 'caffe2' directory in cwd
-cd "$INSTALL_PREFIX"
-
-if [[ $BUILD_ENVIRONMENT == *-cuda* ]]; then
-    num_gpus=$(nvidia-smi -L | wc -l)
-elif [[ $BUILD_ENVIRONMENT == *-rocm* ]]; then
-    num_gpus=$(rocminfo | grep 'Device Type.*GPU' | wc -l)
-else
-    num_gpus=0
-fi
-
-caffe2_pypath="$(cd /usr && $PYTHON -c 'import os; import caffe2; print(os.path.dirname(os.path.realpath(caffe2.__file__)))')"
-# Resnet50
-if (( $num_gpus == 0 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --use_cpu
-fi
-if (( $num_gpus >= 1 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 1
-    # Let's skip the fp16 bench runs for now, as it recompiles the miopen kernels and can take 10+min to run.
-    # We can resume when we (1) bindmount the miopen cache folder in jenkins; (2) install the pre-compiled miopen kernel library in the docker
-    # "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 256 --epoch_size 25600 --num_epochs 2 --num_gpus 1 --float16_compute --dtype float16
-fi
-if (( $num_gpus >= 4 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 512 --epoch_size 51200 --num_epochs 2 --num_gpus 4
-fi
-
-# ResNext
-if (( $num_gpus == 0 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --use_cpu
-fi
-if (( $num_gpus >= 1 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --num_gpus 1
-    # "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 64 --epoch_size 3200 --num_epochs 2 --num_gpus 1 --float16_compute --dtype float16
-fi
-if (( $num_gpus >= 4 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 4
-fi
-
-# Shufflenet
-if (( $num_gpus == 0 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --use_cpu --model shufflenet
-fi
-if (( $num_gpus >= 1 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --num_gpus 1 --model shufflenet
-fi
-if (( $num_gpus >= 4 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 4 --model shufflenet
-fi
diff --git a/.jenkins/caffe2/build.sh b/.jenkins/caffe2/build.sh
deleted file mode 100755
index e6e06c1d7db5..000000000000
--- a/.jenkins/caffe2/build.sh
+++ /dev/null
@@ -1,231 +0,0 @@
-#!/bin/bash
-
-set -ex
-
-# shellcheck source=./common.sh
-source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
-
-# CMAKE_ARGS are only passed to 'cmake' and the -Dfoo=bar does not work with
-# setup.py, so we build a list of foo=bars and then either convert it to
-# -Dfoo=bars or export them before running setup.py
-build_args=()
-build_to_cmake () {
-  cmake_args=()
-  for build_arg in $*; do
-    cmake_args+=("-D$build_arg")
-  done
-  echo ${cmake_args[@]}
-}
-
-
-SCCACHE="$(which sccache)"
-
-# Setup ccache if configured to use it (and not sccache)
-if [ -z "${SCCACHE}" ] && which ccache > /dev/null; then
-  mkdir -p ./ccache
-  ln -sf "$(which ccache)" ./ccache/cc
-  ln -sf "$(which ccache)" ./ccache/c++
-  ln -sf "$(which ccache)" ./ccache/gcc
-  ln -sf "$(which ccache)" ./ccache/g++
-  ln -sf "$(which ccache)" ./ccache/x86_64-linux-gnu-gcc
-  if [[ "${BUILD_ENVIRONMENT}" == *-cuda* ]]; then
-    mkdir -p ./ccache/cuda
-    ln -sf "$(which ccache)" ./ccache/cuda/nvcc
-  fi
-  export CACHE_WRAPPER_DIR="$PWD/ccache"
-  export PATH="$CACHE_WRAPPER_DIR:$PATH"
-fi
-
-# sccache will fail for CUDA builds if all cores are used for compiling
-if [ -z "$MAX_JOBS" ]; then
-  if [[ "${BUILD_ENVIRONMENT}" == *-cuda* ]] && [ -n "${SCCACHE}" ]; then
-    MAX_JOBS=`expr $(nproc) - 1`
-  else
-    MAX_JOBS=$(nproc)
-  fi
-fi
-
-report_compile_cache_stats() {
-  if [[ -n "${SCCACHE}" ]]; then
-    "$SCCACHE" --show-stats
-  elif which ccache > /dev/null; then
-    ccache -s
-  fi
-}
-
-
-###############################################################################
-# Use special scripts for Android and setup builds
-###############################################################################
-if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
-  export ANDROID_NDK=/opt/ndk
-  build_args+=("BUILD_BINARY=ON")
-  build_args+=("BUILD_TEST=ON")
-  build_args+=("USE_OBSERVERS=ON")
-  build_args+=("USE_ZSTD=ON")
-  BUILD_CAFFE2_MOBILE=1 "${ROOT_DIR}/scripts/build_android.sh" $(build_to_cmake ${build_args[@]}) "$@"
-  exit 0
-fi
-
-###############################################################################
-# Set parameters
-###############################################################################
-if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then
-  build_args+=("BUILD_PYTHON=OFF")
-else
-  build_args+=("BUILD_PYTHON=ON")
-  build_args+=("PYTHON_EXECUTABLE=${PYTHON}")
-fi
-if [[ $BUILD_ENVIRONMENT == *mkl* ]]; then
-  build_args+=("BLAS=MKL")
-  build_args+=("USE_MKLDNN=ON")
-fi
-build_args+=("BUILD_BINARY=ON")
-build_args+=("BUILD_TEST=ON")
-build_args+=("INSTALL_TEST=ON")
-build_args+=("USE_ZSTD=ON")
-
-if [[ $BUILD_ENVIRONMENT == *cuda* ]]; then
-  build_args+=("USE_CUDA=ON")
-  build_args+=("USE_NNPACK=OFF")
-
-  # Target only our CI GPU machine's CUDA arch to speed up the build
-  build_args+=("TORCH_CUDA_ARCH_LIST=Maxwell")
-
-  # Explicitly set path to NVCC such that the symlink to ccache or sccache is used
-  if [ -n "${CACHE_WRAPPER_DIR}" ]; then
-    build_args+=("CUDA_NVCC_EXECUTABLE=${CACHE_WRAPPER_DIR}/cuda/nvcc")
-    build_args+=("CMAKE_CUDA_COMPILER_LAUNCHER=${CACHE_WRAPPER_DIR}/ccache")
-  fi
-
-  # Ensure FindCUDA.cmake can infer the right path to the CUDA toolkit.
-  # Setting PATH to resolve to the right nvcc alone isn't enough.
-  # See /usr/share/cmake-3.5/Modules/FindCUDA.cmake, block at line 589.
-  export CUDA_PATH="/usr/local/cuda"
-
-  # Ensure the ccache symlink can still find the real nvcc binary.
-  export PATH="/usr/local/cuda/bin:$PATH"
-fi
-if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
-  if [[ -n "$CI" && -z "$PYTORCH_ROCM_ARCH" ]]; then
-      # Set ROCM_ARCH to gfx900 and gfx906 for CI builds, if user doesn't override.
-      echo "Limiting PYTORCH_ROCM_ARCH to gfx90[06] for CI builds"
-      export PYTORCH_ROCM_ARCH="gfx900;gfx906"
-  fi
-  # This is needed to enable ImageInput operator in resnet50_trainer
-  build_args+=("USE_OPENCV=ON")
-  # This is needed to read datasets from https://download.caffe2.ai/databases/resnet_trainer.zip
-  build_args+=("USE_LMDB=ON")
-  # hcc used to run out of memory, silently exiting without stopping
-  # the build process, leaving undefined symbols in the shared lib,
-  # causing undefined symbol errors when later running tests.
-  # We used to set MAX_JOBS to 4 to avoid, but this is no longer an issue.
-  if [ -z "$MAX_JOBS" ]; then
-    export MAX_JOBS=$(($(nproc) - 1))
-  fi
-
-  ########## HIPIFY Caffe2 operators
-  ${PYTHON} "${ROOT_DIR}/tools/amd_build/build_amd.py"
-fi
-
-# Try to include Redis support for Linux builds
-if [ "$(uname)" == "Linux" ]; then
-  build_args+=("USE_REDIS=ON")
-fi
-
-# Use a specialized onnx namespace in CI to catch hardcoded onnx namespace
-build_args+=("ONNX_NAMESPACE=ONNX_NAMESPACE_FOR_C2_CI")
-
-###############################################################################
-# Configure and make
-###############################################################################
-
-if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then
-  # cmake-only non-setup.py build, to test cpp only bits. This installs into
-  # /usr/local/caffe2 and installs no Python tests
-  build_args+=("CMAKE_INSTALL_PREFIX=${INSTALL_PREFIX}")
-
-  # Run cmake from ./build_caffe2 directory so it doesn't conflict with
-  # standard PyTorch build directory. Eventually these won't need to
-  # be separate.
-  rm -rf build_caffe2
-  mkdir build_caffe2
-  cd ./build_caffe2
-
-  # We test the presence of cmake3 (for platforms like Centos and Ubuntu 14.04)
-  # and use that if so.
-  if [[ -x "$(command -v cmake3)" ]]; then
-      CMAKE_BINARY=cmake3
-  else
-      CMAKE_BINARY=cmake
-  fi
-
-  # Configure
-  ${CMAKE_BINARY} "${ROOT_DIR}" $(build_to_cmake ${build_args[@]}) "$@"
-
-  # Build
-  if [ "$(uname)" == "Linux" ]; then
-    make "-j${MAX_JOBS}" install
-  else
-    echo "Don't know how to build on $(uname)"
-    exit 1
-  fi
-
-  # This is to save test binaries for testing
-  mv "$INSTALL_PREFIX/test/" "$INSTALL_PREFIX/cpp_test/"
-
-  ls -lah $INSTALL_PREFIX
-
-else
-  # Python build. Uses setup.py to install into site-packages
-  build_args+=("USE_LEVELDB=ON")
-  build_args+=("USE_LMDB=ON")
-  build_args+=("USE_OPENCV=ON")
-  build_args+=("BUILD_TEST=ON")
-  # These flags preserve the flags that were used before this refactor (blame
-  # me)
-  build_args+=("USE_GLOG=ON")
-  build_args+=("USE_GFLAGS=ON")
-  build_args+=("USE_FBGEMM=OFF")
-  build_args+=("USE_MKLDNN=OFF")
-  build_args+=("USE_DISTRIBUTED=ON")
-  for build_arg in "${build_args[@]}"; do
-    export $build_arg
-  done
-
-  # sccache will be stuck if  all cores are used for compiling
-  # see https://github.com/pytorch/pytorch/pull/7361
-  if [[ -n "${SCCACHE}" && $BUILD_ENVIRONMENT != *rocm* ]]; then
-    export MAX_JOBS=`expr $(nproc) - 1`
-  fi
-
-  pip install --user dataclasses typing_extensions
-
-  $PYTHON setup.py install --user
-
-  report_compile_cache_stats
-fi
-
-###############################################################################
-# Install ONNX
-###############################################################################
-
-# Install ONNX into a local directory
-pip install --user "file://${ROOT_DIR}/third_party/onnx#egg=onnx"
-
-report_compile_cache_stats
-
-if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
-  # remove sccache wrappers post-build; runtime compilation of MIOpen kernels does not yet fully support them
-  sudo rm -f /opt/cache/bin/cc
-  sudo rm -f /opt/cache/bin/c++
-  sudo rm -f /opt/cache/bin/gcc
-  sudo rm -f /opt/cache/bin/g++
-  pushd /opt/rocm/llvm/bin
-  if [[ -d original ]]; then
-    sudo mv original/clang .
-    sudo mv original/clang++ .
-  fi
-  sudo rm -rf original
-  popd
-fi
diff --git a/.jenkins/caffe2/dirty.sh b/.jenkins/caffe2/dirty.sh
deleted file mode 100755
index 6b9ba544dab9..000000000000
--- a/.jenkins/caffe2/dirty.sh
+++ /dev/null
@@ -1,7 +0,0 @@
-#!/bin/bash
-set -ex
-upstream="$1"
-pr="$2"
-git diff --name-only "$upstream" "$pr"
-# For safety, unconditionally trigger for any changes.
-#git diff --name-only "$upstream" "$pr" | grep -Eq '^(CMakeLists.txt|Makefile|.gitmodules|.jenkins/caffe2|binaries|caffe|caffe2|cmake|conda|docker|docs/caffe2|modules|scripts|third_party)'
diff --git a/.jenkins/caffe2/test.sh b/.jenkins/caffe2/test.sh
index 3c1f42aa9d64..d245dabda4da 100755
--- a/.jenkins/caffe2/test.sh
+++ b/.jenkins/caffe2/test.sh
@@ -149,6 +149,9 @@ export DNNL_MAX_CPU_ISA=AVX2
 
 # Should still run even in the absence of SHARD_NUMBER
 if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then
+  # TODO(sdym@meta.com) remove this when the linked issue resolved.
+  # py is temporary until https://github.com/Teemu/pytest-sugar/issues/241 is fixed
+  pip install --user py==1.11.0
   pip install --user pytest-sugar
   # NB: Warnings are disabled because they make it harder to see what
   # the actual erroring test is
@@ -173,7 +176,9 @@ fi
 ##############
 if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
   pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
-  pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.12.1
+  pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.12.1 beartype==0.10.4 onnx==1.12.0
+  # TODO: change this when onnx-script is on testPypi
+  pip install 'onnx-script @ git+https://github.com/microsoft/onnx-script'
   # numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.
   # We don't actually need it for our tests, but it's imported if it's present, so uninstall.
   pip uninstall -q --yes numba
diff --git a/.jenkins/pytorch/build-asan.sh b/.jenkins/pytorch/build-asan.sh
index d46f4bd2a685..91953c322f22 100755
--- a/.jenkins/pytorch/build-asan.sh
+++ b/.jenkins/pytorch/build-asan.sh
@@ -26,7 +26,7 @@ CC="clang" CXX="clang++" LDSHARED="clang --shared" \
   CFLAGS="-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-address-use-after-scope -shared-libasan" \
   USE_ASAN=1 USE_CUDA=0 USE_MKLDNN=0 \
   python setup.py bdist_wheel
-  python -mpip install dist/*.whl
+  pip_install_whl "$(echo dist/*.whl)"
 
 # Test building via the sdist source tarball
 python setup.py sdist
diff --git a/.jenkins/pytorch/build-tsan.sh b/.jenkins/pytorch/build-tsan.sh
new file mode 100755
index 000000000000..e10edb310d81
--- /dev/null
+++ b/.jenkins/pytorch/build-tsan.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+# Required environment variable: $BUILD_ENVIRONMENT
+# (This is set by default in the Docker images we build, so you don't
+# need to set it yourself.
+
+# shellcheck source=./common.sh
+source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
+# shellcheck source=./common-build.sh
+source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
+
+echo "Clang version:"
+clang --version
+
+python tools/stats/export_test_times.py
+
+if [ -n "$(which conda)" ]; then
+  export CMAKE_PREFIX_PATH=/opt/conda
+fi
+
+CC="clang" CXX="clang++" LDSHARED="clang --shared" \
+  CFLAGS="-fsanitize=thread" \
+  USE_TSAN=1 USE_CUDA=0 USE_MKLDNN=0 \
+  python setup.py bdist_wheel
+  pip_install_whl "$(echo dist/*.whl)"
+
+print_sccache_stats
+
+assert_git_not_dirty
diff --git a/.jenkins/pytorch/build.sh b/.jenkins/pytorch/build.sh
index e258c8e9b6b1..bb7b2c5d03c8 100755
--- a/.jenkins/pytorch/build.sh
+++ b/.jenkins/pytorch/build.sh
@@ -15,14 +15,12 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang7-asan* ]]; then
   exec "$(dirname "${BASH_SOURCE[0]}")/build-asan.sh" "$@"
 fi
 
-if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then
-  exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"
+if [[ "$BUILD_ENVIRONMENT" == *-clang7-tsan* ]]; then
+  exec "$(dirname "${BASH_SOURCE[0]}")/build-tsan.sh" "$@"
 fi
 
-if [[ "$BUILD_ENVIRONMENT" == *deploy* ]]; then
-  # Enabling DEPLOY build (embedded torch python interpreter, experimental)
-  # only on one config for now, can expand later
-  export USE_DEPLOY=ON
+if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then
+  exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"
 fi
 
 echo "Python version:"
@@ -43,9 +41,9 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
 fi
 
 if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
-  # enable split torch_cuda build option in CMake
-  export BUILD_SPLIT_CUDA=ON
-  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* ]]; then
+  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then
+    # TODO: there is a linking issue when building with UCC using clang,
+    # disable it for now and to be fix later.
     export USE_UCC=1
     export USE_SYSTEM_UCC=1
   fi
@@ -62,20 +60,20 @@ elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
   export ATEN_THREADING=NATIVE
 fi
 
-# TODO: Don't run this...
-pip_install -r requirements.txt || true
-
 # Enable LLVM dependency for TensorExpr testing
-export USE_LLVM=/opt/llvm
-export LLVM_DIR=/opt/llvm/lib/cmake/llvm
+if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
+  export USE_LLVM=/opt/rocm/llvm
+  export LLVM_DIR=/opt/rocm/llvm/lib/cmake/llvm
+else
+  export USE_LLVM=/opt/llvm
+  export LLVM_DIR=/opt/llvm/lib/cmake/llvm
+fi
 
-# TODO: Don't install this here
 if ! which conda; then
   # In ROCm CIs, we are doing cross compilation on build machines with
   # intel cpu and later run tests on machines with amd cpu.
   # Also leave out two builds to make sure non-mkldnn builds still work.
   if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
-    pip_install mkl mkl-devel
     export USE_MKLDNN=1
   else
     export USE_MKLDNN=0
@@ -144,9 +142,9 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
   fi
 
   if [[ -n "$CI" && -z "$PYTORCH_ROCM_ARCH" ]]; then
-      # Set ROCM_ARCH to gfx900 and gfx906 for CI builds, if user doesn't override.
-      echo "Limiting PYTORCH_ROCM_ARCH to gfx90[06] for CI builds"
-      export PYTORCH_ROCM_ARCH="gfx900;gfx906"
+      # Set ROCM_ARCH to gfx906 for CI builds, if user doesn't override.
+      echo "Limiting PYTORCH_ROCM_ARCH to gfx906 for CI builds"
+      export PYTORCH_ROCM_ARCH="gfx906"
   fi
 
   # hipify sources
@@ -161,8 +159,11 @@ if [ -z "$MAX_JOBS" ]; then
   fi
 fi
 
-# Target only our CI GPU machine's CUDA arch to speed up the build
-export TORCH_CUDA_ARCH_LIST="5.2"
+# TORCH_CUDA_ARCH_LIST must be passed from an environment variable
+if [[ "$BUILD_ENVIRONMENT" == *cuda* && -z "$TORCH_CUDA_ARCH_LIST" ]]; then
+  echo "TORCH_CUDA_ARCH_LIST must be defined"
+  exit 1
+fi
 
 if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
   export CC=clang
@@ -181,17 +182,8 @@ if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build*  ]]; then
   export USE_GLOO_WITH_OPENSSL=ON
 fi
 
-# TODO: Remove after xenial->focal migration
-if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-py3* ]]; then
-  if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
-    export BUILD_STATIC_RUNTIME_BENCHMARK=ON
-  fi
-fi
-
-if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-focal-py3* ]]; then
-  if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
-    export BUILD_STATIC_RUNTIME_BENCHMARK=ON
-  fi
+if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
+  export BUILD_STATIC_RUNTIME_BENCHMARK=ON
 fi
 
 if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
@@ -222,7 +214,7 @@ else
     else
       python setup.py bdist_wheel
     fi
-    python -mpip install dist/*.whl
+    pip_install_whl "$(echo dist/*.whl)"
 
     # TODO: I'm not sure why, but somehow we lose verbose commands
     set -x
diff --git a/.jenkins/pytorch/common.sh b/.jenkins/pytorch/common.sh
index c71acc7e66cf..d8330243db57 100644
--- a/.jenkins/pytorch/common.sh
+++ b/.jenkins/pytorch/common.sh
@@ -23,28 +23,6 @@ fi
 # shellcheck disable=SC2034
 BUILD_TEST_LIBTORCH=0
 
-# Use conda cmake in some CI build. Conda cmake will be newer than our supported
-# min version (3.5 for xenial and 3.10 for bionic),
-# so we only do it in four builds that we know should use conda.
-# Linux bionic cannot find conda mkl with cmake 3.10, so we need a cmake from conda.
-# Alternatively we could point cmake to the right place
-# export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
-if [[ "${TEST_CONFIG:-}" == *xla* ]] || \
-   [[ "$BUILD_ENVIRONMENT" == *centos* ]] || \
-   [[ "$BUILD_ENVIRONMENT" == *linux-bionic* ]] || \
-   [[ "$BUILD_ENVIRONMENT" == *linux-focal* ]]; then
-  if ! which conda; then
-    echo "Expected ${BUILD_ENVIRONMENT} to use conda, but 'which conda' returns empty"
-    exit 1
-  else
-    conda install -q -y cmake
-  fi
-  if [[ "$BUILD_ENVIRONMENT" == *centos* ]]; then
-    # cmake3 package will conflict with conda cmake
-    sudo yum -y remove cmake3 || true
-  fi
-fi
-
 retry () {
   "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")
 }
diff --git a/.jenkins/pytorch/common_utils.sh b/.jenkins/pytorch/common_utils.sh
index 0584ddab9e2a..6d3c96b9278f 100644
--- a/.jenkins/pytorch/common_utils.sh
+++ b/.jenkins/pytorch/common_utils.sh
@@ -9,6 +9,10 @@ log() { printf '%s\n' "$*"; }
 error() { log "ERROR: $*" >&2; }
 fatal() { error "$@"; exit 1; }
 
+retry () {
+    "$@" || (sleep 10 && "$@") || (sleep 20 && "$@") || (sleep 40 && "$@")
+}
+
 # compositional trap taken from https://stackoverflow.com/a/7287873/23845
 # appends a command to a trap
 #
@@ -49,6 +53,12 @@ function assert_git_not_dirty() {
     fi
 }
 
+function pip_install_whl() {
+  # This is used to install PyTorch and other build artifacts wheel locally
+  # without using any network connection
+  python3 -mpip install --no-index --no-deps "$@"
+}
+
 function pip_install() {
   # retry 3 times
   # old versions of pip don't have the "--progress-bar" flag
@@ -72,12 +82,12 @@ function get_exit_code() {
 function get_bazel() {
   if [[ $(uname) == "Darwin" ]]; then
     # download bazel version
-    curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64  -Lo tools/bazel
+    retry curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64  -Lo tools/bazel
     # verify content
     echo '74d93848f0c9d592e341e48341c53c87e3cb304a54a2a1ee9cff3df422f0b23c  tools/bazel' | shasum -a 256 -c >/dev/null
   else
     # download bazel version
-    curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel
+    retry curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel
     # verify content
     echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c  tools/bazel' | shasum -a 256 -c >/dev/null
   fi
@@ -95,20 +105,16 @@ function get_pinned_commit() {
   cat .github/ci_commit_pins/"${1}".txt
 }
 
-function install_torchvision() {
+function install_torchtext() {
   local commit
-  commit=$(get_pinned_commit vision)
-  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"
+  commit=$(get_pinned_commit text)
+  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${commit}"
 }
 
-function checkout_install_torchvision() {
+function install_torchvision() {
   local commit
   commit=$(get_pinned_commit vision)
-  git clone https://github.com/pytorch/vision
-  pushd vision
-  git checkout "${commit}"
-  time python setup.py install
-  popd
+  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"
 }
 
 function clone_pytorch_xla() {
@@ -117,31 +123,81 @@ function clone_pytorch_xla() {
     pushd xla
     # pin the xla hash so that we don't get broken by changes to xla
     git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"
+    git submodule sync
+    git submodule update --init --recursive
     popd
   fi
 }
 
-function install_torchdynamo() {
+function install_filelock() {
+  pip_install filelock
+}
+
+function install_triton() {
   local commit
-  commit=$(get_pinned_commit torchdynamo)
-  pip_install --user "git+https://github.com/pytorch/torchdynamo.git@${commit}"
+  if [[ "${TEST_CONFIG}" == *rocm* ]]; then
+    echo "skipping triton due to rocm"
+  else
+    commit=$(get_pinned_commit triton)
+    pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"
+    pip_install --user jinja2
+  fi
+}
+
+function setup_torchdeploy_deps(){
+  conda install -y cmake
+  conda install -y -c conda-forge libpython-static=3.10
+  local CC
+  local CXX
+  CC="$(which gcc)"
+  CXX="$(which g++)"
+  export CC
+  export CXX
+  pip install --upgrade pip
 }
 
-function checkout_install_torchdynamo() {
+function checkout_install_torchdeploy() {
   local commit
-  commit=$(get_pinned_commit torchdynamo)
+  setup_torchdeploy_deps
   pushd ..
-  git clone https://github.com/pytorch/torchdynamo
-  pushd torchdynamo
-  git checkout "${commit}"
-  time python setup.py develop
+  git clone --recurse-submodules https://github.com/pytorch/multipy.git
+  pushd multipy
+  python multipy/runtime/example/generate_examples.py
+  pip install -e . --install-option="--cudatests"
   popd
   popd
 }
 
-function install_functorch() {
-  pushd functorch
-  time python setup.py develop
+function test_torch_deploy(){
+ pushd ..
+ pushd multipy
+ ./multipy/runtime/build/test_deploy
+ ./multipy/runtime/build/test_deploy_gpu
+ popd
+ popd
+}
+
+function install_huggingface() {
+  local commit
+  commit=$(get_pinned_commit huggingface)
+  pip_install pandas
+  pip_install scipy
+  pip_install "git+https://github.com/huggingface/transformers.git@${commit}#egg=transformers"
+}
+
+function install_timm() {
+  local commit
+  commit=$(get_pinned_commit timm)
+  pip_install pandas
+  pip_install scipy
+  pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"
+}
+
+function checkout_install_torchbench() {
+  git clone https://github.com/pytorch/benchmark torchbench
+  pushd torchbench
+  git checkout no_torchaudio
+  python install.py
   popd
 }
 
diff --git a/.jenkins/pytorch/dirty.sh b/.jenkins/pytorch/dirty.sh
deleted file mode 100755
index 230d69606664..000000000000
--- a/.jenkins/pytorch/dirty.sh
+++ /dev/null
@@ -1,9 +0,0 @@
-#!/bin/bash
-set -ex
-upstream="$1"
-pr="$2"
-git diff --name-only "$upstream" "$pr"
-# Now that PyTorch build depends on Caffe2, unconditionally trigger
-# for any changes.
-# TODO: Replace this with a NEGATIVE regex that allows us to skip builds when they are unnecessary
-#git diff --name-only "$upstream" "$pr" | grep -Eq '^(aten/|caffe2/|.jenkins/pytorch|docs/(make.bat|Makefile|requirements.txt|source)|mypy|requirements.txt|setup.py|test/|third_party/|tools/|\.gitmodules|torch/)'
diff --git a/.jenkins/pytorch/macos-build.sh b/.jenkins/pytorch/macos-build.sh
index d40ec521520b..dbba68081d3e 100755
--- a/.jenkins/pytorch/macos-build.sh
+++ b/.jenkins/pytorch/macos-build.sh
@@ -35,11 +35,13 @@ fi
 
 cross_compile_arm64() {
   # Cross compilation for arm64
-  USE_DISTRIBUTED=1 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF python setup.py bdist_wheel
+  # Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
+  # that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
+  USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
 }
 
 compile_x86_64() {
-  USE_DISTRIBUTED=1 WERROR=1 python setup.py bdist_wheel
+  USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel
 }
 
 build_lite_interpreter() {
diff --git a/.jenkins/pytorch/macos-common.sh b/.jenkins/pytorch/macos-common.sh
index 4df378d505ec..d1b31ec94188 100755
--- a/.jenkins/pytorch/macos-common.sh
+++ b/.jenkins/pytorch/macos-common.sh
@@ -7,52 +7,6 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
 
 sysctl -a | grep machdep.cpu
 
-if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then
-  # We use different versions here as the arm build/tests runs on python 3.9
-  # while the x86 one runs on python 3.8
-  retry conda install -y \
-    numpy=1.22.3 \
-    pyyaml=6.0 \
-    setuptools=61.2.0 \
-    cmake=3.22.1 \
-    cffi \
-    ninja \
-    typing_extensions \
-    dataclasses \
-    pip
-else
-  # NOTE: mkl 2021.3.0+ cmake requires sub-command PREPEND, may break the build
-  retry conda install -y \
-    mkl=2021.2.0 \
-    mkl-include=2021.2.0 \
-    numpy=1.18.5 \
-    pyyaml=5.3 \
-    setuptools=46.0.0 \
-    cmake=3.19 \
-    cffi \
-    ninja \
-    typing_extensions \
-    dataclasses \
-    pip
-fi
-
-# The torch.hub tests make requests to GitHub.
-#
-# The certifi package from conda-forge is new enough to make the
-# following error disappear (included for future reference):
-#
-# > ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED]
-# > certificate verify failed: unable to get local issuer certificate
-# > (_ssl.c:1056)
-#
-retry conda install -y -c conda-forge certifi wheel=0.36.2
-
-# Needed by torchvision, which is imported from TestHub in test_utils.py.
-retry conda install -y pillow
-
-# Building with USE_DISTRIBUTED=1 requires libuv (for Gloo).
-retry conda install -y libuv pkg-config
-
 # These are required for both the build job and the test job.
 # In the latter to test cpp extensions.
 export MACOSX_DEPLOYMENT_TARGET=10.9
diff --git a/.jenkins/pytorch/macos-test.sh b/.jenkins/pytorch/macos-test.sh
index 68f7f2619209..4beab880ddbb 100755
--- a/.jenkins/pytorch/macos-test.sh
+++ b/.jenkins/pytorch/macos-test.sh
@@ -4,22 +4,6 @@
 # shellcheck source=./macos-common.sh
 source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"
 
-conda install -y six
-if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then
-  pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba==0.56.0" psutil "scipy==1.9.0"
-else
-  pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba<=0.49.1" psutil "scipy==1.6.3"
-fi
-
-# TODO move this to docker
-# Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
-pip install "unittest-xml-reporting<=3.2.0,>=2.0.0" \
-  pytest \
-  pytest-xdist \
-  pytest-rerunfailures
-  # TODO: enable xdoctest later
-  # xdoctest
-
 if [ -z "${CI}" ]; then
   rm -rf "${WORKSPACE_DIR}"/miniconda3/lib/python3.6/site-packages/torch*
 fi
@@ -170,14 +154,7 @@ test_jit_hooks() {
   assert_git_not_dirty
 }
 
-test_dynamo() {
-  pushd ../torchdynamo
-  pytest tests
-  popd
-}
-
 if [[ "${TEST_CONFIG}" == *functorch* ]]; then
-  install_functorch
   test_functorch
 elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
   test_python_shard "${SHARD_NUMBER}"
@@ -189,11 +166,9 @@ elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
     test_custom_backend
   fi
 else
-  checkout_install_torchdynamo
   test_python_all
   test_libtorch
   test_custom_script_ops
   test_jit_hooks
   test_custom_backend
-  test_dynamo
 fi
diff --git a/.jenkins/pytorch/multigpu-test.sh b/.jenkins/pytorch/multigpu-test.sh
index d75d701e8e18..9d7efc969823 100755
--- a/.jenkins/pytorch/multigpu-test.sh
+++ b/.jenkins/pytorch/multigpu-test.sh
@@ -7,12 +7,7 @@
 # shellcheck source=./common.sh
 source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
 
-echo "Testing pytorch (distributed only)"
-if [ -n "${CI}" ]; then
-  # TODO move this to docker
-  # Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
-  pip_install "unittest-xml-reporting<=3.2.0,>=2.0.0"
-fi
+echo "Testing pytorch"
 
 # Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015
 # python tools/download_mnist.py --quiet -d test/cpp/api/mnist
@@ -28,8 +23,8 @@ time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_a
 # FSDP tests
 for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done
 # ShardedTensor tests
-time python test/run_test.py --verbose -i distributed/_shard/checkpoint/test_checkpoint
-time python test/run_test.py --verbose -i distributed/_shard/checkpoint/test_file_system_checkpoint
+time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint
+time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint
 time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec
 time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan
 time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_megatron_prototype
@@ -48,4 +43,6 @@ time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/
 time python test/run_test.py --verbose -i distributed/_shard/sharded_optim/test_sharded_optim
 time python test/run_test.py --verbose -i distributed/_shard/test_partial_tensor
 time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor
+# Other tests
+time python test/run_test.py --verbose -i test_cuda_primary_ctx
 assert_git_not_dirty
diff --git a/.jenkins/pytorch/test.sh b/.jenkins/pytorch/test.sh
index 9c767500477c..ca50a31beb60 100755
--- a/.jenkins/pytorch/test.sh
+++ b/.jenkins/pytorch/test.sh
@@ -15,6 +15,45 @@ BUILD_DIR="build"
 BUILD_RENAMED_DIR="build_renamed"
 BUILD_BIN_DIR="$BUILD_DIR"/bin
 
+export VALGRIND=ON
+if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then
+  # clang9 appears to miscompile code involving c10::optional<c10::SymInt>,
+  # such that valgrind complains along these lines:
+  #
+  # Conditional jump or move depends on uninitialised value(s)
+  #    at 0x40303A: ~optional_base (Optional.h:281)
+  #    by 0x40303A: call (Dispatcher.h:448)
+  #    by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:10)
+  #    by 0x403700: main (basic.cpp:16)
+  #  Uninitialised value was created by a stack allocation
+  #    at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:6)
+  #
+  # The problem does not appear with gcc or newer versions of clang (we tested
+  # clang14).  So we suppress valgrind testing for clang9 specifically.
+  # You may need to suppress it for other versions of clang if they still have
+  # the bug.
+  #
+  # A minimal repro for the valgrind error is below:
+  #
+  # #include <ATen/ATen.h>
+  # #include <ATen/core/dispatch/Dispatcher.h>
+  #
+  # using namespace at;
+  #
+  # Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset) {
+  #   auto op = c10::Dispatcher::singleton()
+  #       .findSchemaOrThrow(at::_ops::as_strided::name, at::_ops::as_strided::overload_name)
+  #       .typed<at::_ops::as_strided::schema>();
+  #   return op.call(self, size, stride, storage_offset);
+  # }
+  #
+  # int main(int argv) {
+  #   Tensor b = empty({3, 4});
+  #   auto z = call(b, b.sym_sizes(), b.sym_strides(), c10::nullopt);
+  # }
+  export VALGRIND=OFF
+fi
+
 # Get fully qualified path using realpath
 if [[ "$BUILD_ENVIRONMENT" != *bazel* ]]; then
   CUSTOM_TEST_ARTIFACT_BUILD_DIR=$(realpath "${CUSTOM_TEST_ARTIFACT_BUILD_DIR:-"build/custom_test_artifacts"}")
@@ -58,10 +97,6 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm* ]]; then
   export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
 fi
 
-if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
-  export BUILD_SPLIT_CUDA=ON
-fi
-
 if [[ "$TEST_CONFIG" == *crossref* ]]; then
   export PYTORCH_TEST_WITH_CROSSREF=1
 fi
@@ -70,12 +105,8 @@ if [[ "$TEST_CONFIG" == *dynamo* ]]; then
   export PYTORCH_TEST_WITH_DYNAMO=1
 fi
 
-# TODO: this condition is never true, need to fix this.
-if [[ -n "$PR_NUMBER" ]] && [[ -z "$CI_MASTER" || "$CI_MASTER" == "false" ]]; then
-  # skip expensive checks when on PR and CI_MASTER flag is not set
-  export PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=1
-else
-  export PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=0
+if [[ "$TEST_CONFIG" == *inductor* ]]; then
+  export PYTORCH_TEST_WITH_INDUCTOR=1
 fi
 
 if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
@@ -86,7 +117,7 @@ fi
 
 if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
   # JIT C++ extensions require ninja.
-  pip_install --user ninja
+  pip_install --user "ninja==1.10.2"
   # ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins
   # but this script should be runnable by any user, including root
   export PATH="$HOME/.local/bin:$PATH"
@@ -96,9 +127,8 @@ fi
 # if you're not careful.  Check this if you made some changes and the
 # ASAN test is not working
 if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
-    # Suppress vptr violations arising from multiple copies of pybind11
     export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=1:strict_init_order=true:detect_odr_violation=0
-    export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp
+    export UBSAN_OPTIONS=print_stacktrace=1
     export PYTORCH_TEST_WITH_ASAN=1
     export PYTORCH_TEST_WITH_UBSAN=1
     # TODO: Figure out how to avoid hard-coding these paths
@@ -141,12 +171,17 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
     ulimit -s 81920
 
     (cd test && python -c "import torch; print(torch.__version__, torch.version.git_version)")
-    echo "The next three invocations are expected to crash; if they don't that means ASAN/UBSAN is misconfigured"
+    echo "The next four invocations are expected to crash; if they don't that means ASAN/UBSAN is misconfigured"
     (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_asan(3)")
     (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_ubsan(0)")
+    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_vptr_ubsan()")
     (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_aten_asan(3)")
 fi
 
+if [[ "$BUILD_ENVIRONMENT" == *-tsan* ]]; then
+  export PYTORCH_TEST_WITH_TSAN=1
+fi
+
 if [[ $TEST_CONFIG == 'nogpu_NO_AVX2' ]]; then
   export ATEN_CPU_CAPABILITY=default
 elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then
@@ -163,7 +198,9 @@ test_python_shard() {
     echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
     exit 1
   fi
+
   time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "$1" "$NUM_TEST_SHARDS" --verbose
+
   assert_git_not_dirty
 }
 
@@ -192,16 +229,84 @@ test_dynamo_shard() {
       test_reductions \
       test_namedtensor \
       test_namedtuple_return_api \
-      test_profiler \
-      test_profiler_tree \
+      profiler/test_profiler \
+      profiler/test_profiler_tree \
       test_overrides \
       test_python_dispatch \
       test_fx \
+      test_package \
+      test_vmap \
     --shard "$1" "$NUM_TEST_SHARDS" \
     --verbose
   assert_git_not_dirty
 }
 
+test_inductor_distributed() {
+  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
+  # with if required # gpus aren't available
+  PYTORCH_TEST_WITH_INDUCTOR=0 PYTORCH_TEST_WITH_INDUCTOR=0 python test/run_test.py --include distributed/test_dynamo_distributed --verbose
+  assert_git_not_dirty
+}
+
+test_inductor() {
+  python test/run_test.py --include test_modules test_ops --verbose
+  PYTORCH_TEST_WITH_INDUCTOR=0 python test/run_test.py --include inductor/test_torchinductor --include inductor/test_torchinductor_opinfo --verbose
+  # TODO: investigate "RuntimeError: CUDA driver API confirmed a leak"
+  # seen intest_ops_gradients.py
+  # pytest test/test_ops_gradients.py --verbose -k "not _complex and not test_inplace_grad_acos_cuda_float64"
+}
+
+test_inductor_huggingface() {
+  # Use test-reports directory under test folder will allow the CI to automatically pick up
+  # the test reports and upload them to S3. Need to use full path here otherwise the script
+  # will bark about file not found later on
+  TEST_REPORTS_DIR=$(pwd)/test/test-reports
+  mkdir -p "$TEST_REPORTS_DIR"
+  # Check inference with --float32
+  python benchmarks/dynamo/huggingface.py --ci --accuracy \
+    --device cuda --inductor --float32 --output "$TEST_REPORTS_DIR"/inductor_inference_huggingface.csv
+  python benchmarks/dynamo/check_csv.py -f "$TEST_REPORTS_DIR"/inductor_inference_huggingface.csv
+  # Check training with --amp
+  python benchmarks/dynamo/huggingface.py --ci --training --accuracy \
+    --device cuda --inductor --amp --output "$TEST_REPORTS_DIR"/inductor_training_huggingface.csv
+  python benchmarks/dynamo/check_csv.py -f "$TEST_REPORTS_DIR"/inductor_training_huggingface.csv
+}
+
+test_inductor_timm_shard() {
+  if [[ -z "$NUM_TEST_SHARDS" ]]; then
+    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
+    exit 1
+  fi
+  # Use test-reports directory under test folder will allow the CI to automatically pick up
+  # the test reports and upload them to S3. Need to use full path here otherwise the script
+  # will bark about file not found later on
+  TEST_REPORTS_DIR=$(pwd)/test/test-reports
+  mkdir -p "$TEST_REPORTS_DIR"
+  # Check inference with --float32
+  python benchmarks/dynamo/timm_models.py --ci --accuracy \
+    --device cuda --inductor --float32 --total-partitions 2 --partition-id "$1" \
+    --output "$TEST_REPORTS_DIR"/inductor_inference_timm_"$1".csv
+  python benchmarks/dynamo/check_csv.py -f "$TEST_REPORTS_DIR"/inductor_inference_timm_"$1".csv
+  # Check training with --amp
+  python benchmarks/dynamo/timm_models.py --ci --training --accuracy \
+    --device cuda --inductor --amp --total-partitions 2 --partition-id "$1" \
+    --output "$TEST_REPORTS_DIR"/inductor_training_timm_"$1".csv
+  python benchmarks/dynamo/check_csv.py -f "$TEST_REPORTS_DIR"/inductor_training_timm_"$1".csv
+}
+
+test_inductor_torchbench() {
+  TEST_REPORTS_DIR=$(pwd)/test/test-reports
+  mkdir -p "$TEST_REPORTS_DIR"
+  # Check inference with --float32
+  PYTHONPATH=$(pwd)/torchbench python benchmarks/dynamo/torchbench.py --ci --accuracy \
+    --device cuda --inductor --float32 --output "$TEST_REPORTS_DIR"/inductor_inference_torchbench.csv
+  python benchmarks/dynamo/check_csv.py -f "$TEST_REPORTS_DIR"/inductor_inference_torchbench.csv
+  # Check training with --amp
+  PYTHONPATH=$(pwd)/torchbench python benchmarks/dynamo/torchbench.py --ci --training --accuracy \
+    --device cuda --inductor --amp --output "$TEST_REPORTS_DIR"/inductor_training_torchbench.csv
+  python benchmarks/dynamo/check_csv.py -f "$TEST_REPORTS_DIR"/inductor_training_torchbench.csv
+}
+
 test_python_gloo_with_tls() {
   source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"
   assert_git_not_dirty
@@ -290,8 +395,11 @@ test_libtorch() {
     TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_libtorch
     mkdir -p $TEST_REPORTS_DIR
 
-    # Run JIT cpp tests
-    python test/cpp/jit/tests_setup.py setup
+    if [[ "$BUILD_ENVIRONMENT" != *-tsan* ]]; then
+        # Run JIT cpp tests
+        python test/cpp/jit/tests_setup.py setup
+    fi
+
     if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
       "$TORCH_BIN_DIR"/test_jit  --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml
     else
@@ -305,19 +413,19 @@ test_libtorch() {
       "$TORCH_BIN_DIR"/test_lazy  --gtest_output=xml:$TEST_REPORTS_DIR/test_lazy.xml
     fi
 
-    python test/cpp/jit/tests_setup.py shutdown
+    if [[ "$BUILD_ENVIRONMENT" != *-tsan* ]]; then
+        python test/cpp/jit/tests_setup.py shutdown
+    fi
+
     # Wait for background download to finish
     wait
     # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy.
     OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml
     "$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml
 
-    # TODO: this condition is never (BUILD_ENVIRONMENT doesn't start with pytorch-), need to fix this.
-    if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-py3* ]]; then
-      if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then
-        # TODO: Consider to run static_runtime_test from $TORCH_BIN_DIR (may need modify build script)
-        "$BUILD_BIN_DIR"/static_runtime_test --gtest_output=xml:$TEST_REPORTS_DIR/static_runtime_test.xml
-      fi
+    if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then
+      # TODO: Consider to run static_runtime_test from $TORCH_BIN_DIR (may need modify build script)
+      "$BUILD_BIN_DIR"/static_runtime_test --gtest_output=xml:$TEST_REPORTS_DIR/static_runtime_test.xml
     fi
     assert_git_not_dirty
   fi
@@ -325,6 +433,14 @@ test_libtorch() {
 
 test_aot_compilation() {
   echo "Testing Ahead of Time compilation"
+  ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
+  ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
+
+  # Make test_reports directory
+  # NB: the ending test_libtorch must match the current function name for the current
+  # test reporting process (in print_test_stats.py) to function as expected.
+  TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_aot_compilation
+  mkdir -p $TEST_REPORTS_DIR
   if [ -f "$TORCH_BIN_DIR"/test_mobile_nnc ]; then "$TORCH_BIN_DIR"/test_mobile_nnc --gtest_output=xml:$TEST_REPORTS_DIR/test_mobile_nnc.xml; fi
   # shellcheck source=test/mobile/nnc/test_aot_compile.sh
   if [ -f "$TORCH_BIN_DIR"/aot_model_compiler_test ]; then source test/mobile/nnc/test_aot_compile.sh; fi
@@ -457,6 +573,11 @@ build_xla() {
   clone_pytorch_xla
   # shellcheck disable=SC1091
   source "xla/.circleci/common.sh"
+
+  # TODO: The torch pin #73164 is involved in the sev https://github.com/pytorch/pytorch/issues/86093
+  # so this is temporarily removed until XLA fixes the weird logic in https://github.com/pytorch/xla/blob/master/scripts/apply_patches.sh#L17-L18
+  rm "${XLA_DIR}/torch_patches/.torch_pin" || true
+
   apply_patches
   SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
   # These functions are defined in .circleci/common.sh in pytorch/xla repo
@@ -593,39 +714,19 @@ test_vec256() {
   fi
 }
 
-test_dynamo() {
-  pushd ../torchdynamo
-  pytest tests
-  popd
-}
-
-test_torch_deploy() {
-  python torch/csrc/deploy/example/generate_examples.py
-  ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
-  ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"
-  ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
-  "$TORCH_BIN_DIR"/test_deploy
-  "$TORCH_BIN_DIR"/test_deploy_gpu
-  assert_git_not_dirty
-}
-
 test_docs_test() {
   .jenkins/pytorch/docs-test.sh
 }
 
-if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
+if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* || "${BUILD_ENVIRONMENT}" == *-tsan* ]]; then
   (cd test && python -c "import torch; print(torch.__config__.show())")
   (cd test && python -c "import torch; print(torch.__config__.parallel_info())")
 fi
-if [[ "${TEST_CONFIG}" == *deploy* ]]; then
-  install_torchdynamo
-  test_torch_deploy
-elif [[ "${TEST_CONFIG}" == *backward* ]]; then
+if [[ "${TEST_CONFIG}" == *backward* ]]; then
   test_forward_backward_compatibility
   # Do NOT add tests after bc check tests, see its comment.
 elif [[ "${TEST_CONFIG}" == *xla* ]]; then
   install_torchvision
-  install_torchdynamo
   build_xla
   test_xla
 elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then
@@ -634,32 +735,67 @@ elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
   # TODO: run some C++ tests
   echo "no-op at the moment"
 elif [[ "$TEST_CONFIG" == distributed ]]; then
-  install_torchdynamo
+  install_filelock
+  install_triton
   test_distributed
   # Only run RPC C++ tests on the first shard
   if [[ "${SHARD_NUMBER}" == 1 ]]; then
     test_rpc
   fi
+elif [[ "$TEST_CONFIG" == deploy ]]; then
+  checkout_install_torchdeploy
+  test_torch_deploy
+elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
+  install_filelock
+  install_triton
+  install_huggingface
+  test_inductor_distributed
 elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
   test_without_numpy
   install_torchvision
-  install_torchdynamo
+  install_triton
   test_dynamo_shard 1
   test_aten
 elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
   install_torchvision
-  checkout_install_torchdynamo
+  install_filelock
+  install_triton
   test_dynamo_shard 2
-  test_dynamo
+elif [[ "${TEST_CONFIG}" == *inductor_huggingface* ]]; then
+  install_torchvision
+  install_filelock
+  install_triton
+  install_huggingface
+  test_inductor_huggingface
+elif [[ "${TEST_CONFIG}" == *inductor_timm* && $NUM_TEST_SHARDS -gt 1 ]]; then
+  install_torchvision
+  install_filelock
+  install_triton
+  install_timm
+  id=$((SHARD_NUMBER-1))
+  test_inductor_timm_shard $id
+elif [[ "${TEST_CONFIG}" == *inductor_torchbench* ]]; then
+  install_torchtext
+  install_torchvision
+  install_filelock
+  install_triton
+  checkout_install_torchbench
+  test_inductor_torchbench
+elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
+  install_torchvision
+  install_filelock
+  install_triton
+  test_inductor
+  test_inductor_distributed
 elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
   test_without_numpy
   install_torchvision
-  install_torchdynamo
+  install_triton
   test_python_shard 1
   test_aten
 elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
   install_torchvision
-  checkout_install_torchdynamo
+  install_triton
   test_python_shard 2
   test_libtorch
   test_aot_compilation
@@ -668,7 +804,7 @@ elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
   test_torch_function_benchmark
 elif [[ "${SHARD_NUMBER}" -gt 2 ]]; then
   # Handle arbitrary number of shards
-  install_torchdynamo
+  install_triton
   test_python_shard "$SHARD_NUMBER"
 elif [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then
   test_vulkan
@@ -676,14 +812,17 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
   test_bazel
 elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then
   test_libtorch
+elif [[ "${BUILD_ENVIRONMENT}" == *-tsan* ]]; then
+  # TODO: TSAN check is currently failing with 415 data race warnings. This will
+  # be addressed later, the first PR can be merged first to setup the CI jobs
+  test_libtorch || true
 elif [[ "${TEST_CONFIG}" = docs_test ]]; then
   test_docs_test
 elif [[ "${TEST_CONFIG}" == *functorch* ]]; then
-  install_functorch
   test_functorch
 else
   install_torchvision
-  install_torchdynamo
+  install_triton
   install_monkeytype
   test_python
   test_aten
diff --git a/.jenkins/pytorch/win-test-helpers/build_pytorch.bat b/.jenkins/pytorch/win-test-helpers/build_pytorch.bat
index 7edeca96ed8d..da28956cae97 100644
--- a/.jenkins/pytorch/win-test-helpers/build_pytorch.bat
+++ b/.jenkins/pytorch/win-test-helpers/build_pytorch.bat
@@ -135,16 +135,17 @@ if "%REBUILD%" == "" (
     if not errorlevel 0 exit /b
   )
 )
-:: tests if BUILD_ENVIRONMENT contains cuda11 as a substring
-if not x%BUILD_ENVIRONMENT:cuda11=%==x%BUILD_ENVIRONMENT% (
-   set BUILD_SPLIT_CUDA=ON
-)
 
-python setup.py install --cmake && sccache --show-stats && (
+python setup.py bdist_wheel
+if errorlevel 1 exit /b
+if not errorlevel 0 exit /b
+sccache --show-stats
+python -c "import os, glob; os.system('python -mpip install ' + glob.glob('dist/*.whl')[0] + '[opt-einsum]')"
+(
   if "%BUILD_ENVIRONMENT%"=="" (
     echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
   ) else (
-    7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\caffe2 && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"
+    7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"
     if errorlevel 1 exit /b
     if not errorlevel 0 exit /b
 
diff --git a/.jenkins/pytorch/win-test-helpers/install_test_functorch.bat b/.jenkins/pytorch/win-test-helpers/install_test_functorch.bat
index 7679bffbc70e..d06d46f3dd22 100644
--- a/.jenkins/pytorch/win-test-helpers/install_test_functorch.bat
+++ b/.jenkins/pytorch/win-test-helpers/install_test_functorch.bat
@@ -6,15 +6,6 @@ if not errorlevel 0 (
   exit /b
 )
 
-pushd functorch
-echo "Install functorch"
-:: --no-deps because for some reason, on windows, `torch` isn't found in
-:: `pip list` despite being installed. With just `python setup.py develop`,
-:: setuptools explicitly checks for the existence of torch and can't find it.
-python setup.py develop --no-deps
-popd
-if ERRORLEVEL 1 goto fail
-
 echo "Installing test dependencies"
 pip install networkx
 if errorlevel 1 exit /b
diff --git a/.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat b/.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
index e6660a17b389..0552d85a407a 100644
--- a/.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
+++ b/.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
@@ -13,7 +13,7 @@ if not exist %CONDA_PARENT_DIR%\Miniconda3 (
 )
 
 if "%INSTALL_FRESH_CONDA%"=="1" (
-  curl --retry 3 -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe
+  curl --retry 3 --retry-all-errors -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe
   if errorlevel 1 exit /b
   if not errorlevel 0 exit /b
 
diff --git a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_magma.bat b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_magma.bat
index d9f3ab1cf821..d0fbf5b20d88 100644
--- a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_magma.bat
+++ b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_magma.bat
@@ -24,7 +24,7 @@ if "%CUDA_SUFFIX%" == "" (
 
 if "%REBUILD%"=="" (
   if "%BUILD_ENVIRONMENT%"=="" (
-    curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z
+    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z
   ) else (
     aws s3 cp s3://ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet
   )
diff --git a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_mkl.bat b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_mkl.bat
index c700a04a1e4a..6c676d1baede 100644
--- a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_mkl.bat
+++ b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_mkl.bat
@@ -1,6 +1,6 @@
 if "%REBUILD%"=="" (
   if "%BUILD_ENVIRONMENT%"=="" (
-    curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z
+    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z
   ) else (
     aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet
   )
diff --git a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_sccache.bat b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_sccache.bat
index 0165604400dd..6f8cc15ba868 100644
--- a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_sccache.bat
+++ b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_sccache.bat
@@ -7,8 +7,8 @@ if "%REBUILD%"=="" (
     del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul
     del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul
     if "%BUILD_ENVIRONMENT%"=="" (
-      curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe
-      curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe
+      curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe
+      curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe
     ) else (
       aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe
       aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe
diff --git a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
index c598a04e0f97..29c213ad4246 100644
--- a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
+++ b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
@@ -36,8 +36,7 @@ popd
 =======
 :: Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
 
-pip install "ninja==1.10.0.post1" future "hypothesis==5.35.1" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest pytest-xdist pytest-rerunfailures 
-:: # TODO: enable xdoctest later
+pip install "ninja==1.10.0.post1" future "hypothesis==5.35.1" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest pytest-xdist pytest-shard pytest-rerunfailures sympy "xdoctest==1.0.2" "pygments==2.12.0" "opt-einsum>=3.3"
 if errorlevel 1 exit /b
 if not errorlevel 0 exit /b
 
diff --git a/.jenkins/pytorch/win-test.sh b/.jenkins/pytorch/win-test.sh
index dc2852120487..560b039dbf67 100755
--- a/.jenkins/pytorch/win-test.sh
+++ b/.jenkins/pytorch/win-test.sh
@@ -39,10 +39,6 @@ fi
 
 export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers
 
-if [[ "${BUILD_ENVIRONMENT}" == *cuda11* ]]; then
-  export BUILD_SPLIT_CUDA=ON
-fi
-
 if [[ "$TEST_CONFIG" = "force_on_cpu" ]]; then
   # run the full test suite for force_on_cpu test
   export USE_CUDA=0
diff --git a/.lintrunner.toml b/.lintrunner.toml
index 4c206c5fc744..34b673c7e09a 100644
--- a/.lintrunner.toml
+++ b/.lintrunner.toml
@@ -99,6 +99,8 @@ include_patterns = [
 exclude_patterns = [
     'torch/include/**',
     'torch/csrc/**',
+    'torch/_dynamo/**/*.py',
+    'torch/_inductor/**/*.py',
     'torch/distributed/elastic/agent/server/api.py',
     'torch/testing/_internal/**',
     'torch/distributed/fsdp/fully_sharded_data_parallel.py',
@@ -154,6 +156,7 @@ include_patterns = [
 exclude_patterns = [
     # (linbinyu) copied from internal repo
     'tools/code_analyzer/gen_operators_yaml.py',
+    'tools/dynamo/verify_dynamo.py',
     'tools/gen_vulkan_spv.py',
     'tools/test/gen_operators_yaml_test.py',
     'tools/test/gen_oplist_test.py',
@@ -170,7 +173,6 @@ command = [
 [[linter]]
 code = 'CLANGTIDY'
 include_patterns = [
-    'torch/csrc/deploy/**/*.cpp',
     'torch/csrc/fx/**/*.cpp',
     'torch/csrc/generic/**/*.cpp',
     'torch/csrc/onnx/**/*.cpp',
@@ -183,7 +185,6 @@ exclude_patterns = [
     # FunctionsManual.cpp is excluded to keep this diff clean. It will be fixed
     # in a follow up PR.
     # /torch/csrc/generic/*.cpp is excluded because those files aren't actually built.
-    # deploy/interpreter files are excluded due to using macros and other techniquies
     # that are not easily converted to accepted c++
     'torch/csrc/jit/passes/onnx/helper.cpp',
     'torch/csrc/jit/passes/onnx/shape_type_inference.cpp',
@@ -197,11 +198,6 @@ exclude_patterns = [
     'torch/csrc/autograd/FunctionsManual.cpp',
     'torch/csrc/generic/*.cpp',
     'torch/csrc/jit/codegen/cuda/runtime/*',
-    'torch/csrc/deploy/interactive_embedded_interpreter.cpp',
-    'torch/csrc/deploy/interpreter/**',
-    'torch/csrc/deploy/test_deploy_python_ext.cpp',
-    'torch/csrc/deploy/test_deploy_missing_interpreter.cpp',
-    'torch/csrc/deploy/test_deploy_gpu.cpp',
     'torch/csrc/utils/disable_torch_function.cpp',
 ]
 init_command = [
@@ -293,8 +289,10 @@ include_patterns=['**']
 exclude_patterns=[
     '**/contrib/**',
     'third_party/**',
+    '**/*.bat',
     '**/*.expect',
     '**/*.ipynb',
+    '**/*.ps1',
     '**/*.ptl',
     'tools/clang_format_hash/**',
     'test/cpp/jit/upgrader_models/*.ptl',
@@ -339,6 +337,7 @@ include_patterns = ['**']
 exclude_patterns = [
     '**/*.svg',
     '**/*Makefile',
+    '**/*Makefile_dashboard',
     '**/contrib/**',
     'third_party/**',
     '**/.gitattributes',
@@ -373,6 +372,7 @@ include_patterns = [
 exclude_patterns = [
     'aten/src/ATen/native/quantized/cpu/qnnpack/**',
     'aten/src/ATen/native/vulkan/api/vk_mem_alloc.h',
+    'aten/src/ATen/native/vulkan/glsl/**',
     'torch/csrc/jit/serialization/mobile_bytecode_generated.h',
 ]
 command = [
@@ -422,6 +422,35 @@ command = [
     '@{{PATHSFILE}}'
 ]
 
+[[linter]]
+code = 'ERROR_PRONE_ISINSTANCE'
+include_patterns = [
+    'torch/_refs/**/*.py',
+    'torch/_prims/**/*.py',
+    'torch/_prims_common/**/*.py',
+    'torch/_decomp/**/*.py',
+    'torch/_meta_registrations.py',
+]
+command = [
+    'python3',
+    'tools/linter/adapters/grep_linter.py',
+    '--pattern=isinstance\([^)]+(int|float)\)',
+    '--linter-name=ERROR_PRONE_ISINSTANCE',
+    '--error-name=error prone isinstance',
+    """--error-description=\
+        This line has an isinstance call that directly refers to \
+        int or float.  This is error-prone because you may also \
+        have wanted to allow SymInt or SymFloat in your test.  \
+        To suppress this lint, use an appropriate type alias defined \
+        in torch._prims_common; use IntLike/FloatLike when you would accept \
+        both regular and symbolic numbers, Dim for ints representing \
+        dimensions, or IntWithoutSymInt/FloatWithoutSymFloat if you really \
+        meant to exclude symbolic numbers.
+    """,
+    '--',
+    '@{{PATHSFILE}}'
+]
+
 [[linter]]
 code = 'PYBIND11_SPECIALIZATION'
 include_patterns = [
@@ -710,6 +739,11 @@ include_patterns = [
     'test/onnx/**/*.py',
     'test/test_dynamo_cudagraphs.py',
     'tools/**/*.py',
+    'torch/_dynamo/**/*.py',
+    'test/dynamo/**/*.py',
+    'benchmarks/dynamo/**/*.py',
+    'torch/_inductor/**/*.py',
+    'test/inductor/**/*.py',
     'torch/onnx/**/*.py',
     'torch/package/**/*.py',
     'torch/_decomp/**/*.py',
@@ -719,6 +753,7 @@ include_patterns = [
     'torch/_refs/**/*.py',
     'torch/_subclasses/**/*.py',
     'torch/_*.py',
+    'torch/testing/_internal/opinfo/**/*.py',
     'torchgen/**/*.py',
     'functorch/functorch/_src/aot_autograd.py',
     'functorch/functorch/_src/compilers.py',
@@ -737,6 +772,7 @@ init_command = [
     'python3',
     'tools/linter/adapters/pip_init.py',
     '--dry-run={{DRYRUN}}',
+    '--no-black-binary',
     'black==22.3.0',
     'ufmt==1.3.3',
     'usort==1.0.2',
diff --git a/BUILD.bazel b/BUILD.bazel
index 4c0791bffbb4..172a31723a0b 100644
--- a/BUILD.bazel
+++ b/BUILD.bazel
@@ -28,6 +28,10 @@ COMMON_COPTS = [
 ] + if_cuda([
     "-DUSE_CUDA",
     "-DUSE_CUDNN",
+    # TODO: This should be passed only when building for CUDA-11.5 or newer
+    # use cub in a safe manner, see:
+    # https://github.com/pytorch/pytorch/pull/55292
+    "-DCUB_WRAPPED_NAMESPACE=at_cuda_detail",
 ])
 
 aten_generation_srcs = ["aten/src/ATen/native/native_functions.yaml"] + ["aten/src/ATen/native/tags.yaml"] + glob(["aten/src/ATen/templates/**"])
@@ -47,6 +51,7 @@ generated_cpu_cpp = [
     "aten/src/ATen/RegisterSparseCsrCPU.cpp",
     "aten/src/ATen/RegisterZeroTensor.cpp",
     "aten/src/ATen/RegisterCompositeImplicitAutograd.cpp",
+    "aten/src/ATen/RegisterCompositeImplicitAutogradNestedTensor.cpp",
     "aten/src/ATen/RegisterCompositeExplicitAutograd.cpp",
     "aten/src/ATen/RegisterCompositeExplicitAutogradNonFunctional.cpp",
     "aten/src/ATen/RegisterMeta.cpp",
@@ -62,6 +67,8 @@ generated_cpu_cpp = [
     "aten/src/ATen/CompositeExplicitAutogradNonFunctionalFunctions_inl.h",
     "aten/src/ATen/CompositeImplicitAutogradFunctions.h",
     "aten/src/ATen/CompositeImplicitAutogradFunctions_inl.h",
+    "aten/src/ATen/CompositeImplicitAutogradNestedTensorFunctions.h",
+    "aten/src/ATen/CompositeImplicitAutogradNestedTensorFunctions_inl.h",
     "aten/src/ATen/CompositeViewCopyKernels.cpp",
     "aten/src/ATen/FunctionalInverses.h",
     "aten/src/ATen/Functions.h",
@@ -126,6 +133,7 @@ filegroup(
     name = "aten_base_cpp",
     srcs = glob([
         "aten/src/ATen/*.cpp",
+        "aten/src/ATen/functorch/*.cpp",
         "aten/src/ATen/detail/*.cpp",
         "aten/src/ATen/cpu/*.cpp",
     ]),
@@ -421,6 +429,7 @@ cu_library(
         "@cuda//:cublas",
         "@cuda//:cufft",
         "@cuda//:cusparse",
+        "@cutlass",
     ],
     alwayslink = True,
 )
@@ -1665,6 +1674,7 @@ cc_library(
     ] + if_cuda([
         ":torch_distributed_cuda",
         "@cuda//:nvToolsExt",
+        "@cutlass",
     ]),
     alwayslink = True,
 )
@@ -1740,10 +1750,28 @@ cc_library(
 # Torch integration tests rely on a labeled data set from the MNIST database.
 # http://yann.lecun.com/exdb/mnist/
 
-# imethod.cpp is excluded since torch/csrc/deploy* build is not yet supported.
 cpp_api_tests = glob(
     ["test/cpp/api/*.cpp"],
-    exclude = ["test/cpp/api/imethod.cpp"],
+    exclude = [
+        "test/cpp/api/imethod.cpp",
+        "test/cpp/api/integration.cpp",
+    ],
+)
+
+cc_test(
+      name = "integration_test",
+      size = "medium",
+      srcs = ["test/cpp/api/integration.cpp"],
+      deps = [
+          ":test_support",
+          "@com_google_googletest//:gtest_main",
+      ],
+      tags = [
+        "gpu-required",
+      ],
+      data = [
+        ":download_mnist",
+      ],
 )
 
 [
@@ -1885,3 +1913,15 @@ test_suite(
         "torch/csrc/lazy/ts_backend/ts_native_functions.cpp",
     ]
 ]
+
+genrule(
+    name = "download_mnist",
+    srcs = ["//:tools/download_mnist.py"],
+    outs = [
+        "mnist/train-images-idx3-ubyte",
+        "mnist/train-labels-idx1-ubyte",
+        "mnist/t10k-images-idx3-ubyte",
+        "mnist/t10k-labels-idx1-ubyte",
+    ],
+    cmd = "python3 tools/download_mnist.py -d $(RULEDIR)/mnist",
+)
diff --git a/CITATION b/CITATION
deleted file mode 100644
index f7db31f23627..000000000000
--- a/CITATION
+++ /dev/null
@@ -1,10 +0,0 @@
-@incollection{NEURIPS2019_9015,
-title = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
-author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
-booktitle = {Advances in Neural Information Processing Systems 32},
-editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
-pages = {8024--8035},
-year = {2019},
-publisher = {Curran Associates, Inc.},
-url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf}
-}
diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 000000000000..2bebc947bfb2
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,73 @@
+cff-version: 1.2.0
+message: If you use this software, please cite it as below.
+title: PyTorch
+authors:
+  - family-names: PyTorch Team
+url: https://pytorch.org
+preferred-citation:
+  type: conference-paper
+  title: "PyTorch: An Imperative Style, High-Performance Deep Learning Library"
+  authors:
+    - family-names: Paszke
+      given-names: Adam
+    - family-names: Gross
+      given-names: Sam
+    - family-names: Massa
+      given-names: Francisco
+    - family-names: Lerer
+      given-names: Adam
+    - family-names: Bradbury
+      given-names: James
+    - family-names: Chanan
+      given-names: Gregory
+    - family-names: Killeen
+      given-names: Trevor
+    - family-names: Lin
+      given-names: Zeming
+    - family-names: Gimelshein
+      given-names: Natalia
+    - family-names: Antiga
+      given-names: Luca
+    - family-names: Desmaison
+      given-names: Alban
+    - family-names: Kopf
+      given-names: Andreas
+    - family-names: Yang
+      given-names: Edward
+    - family-names: DeVito
+      given-names: Zachary
+    - family-names: Raison
+      given-names: Martin
+    - family-names: Tejani
+      given-names: Alykhan
+    - family-names: Chilamkurthy
+      given-names: Sasank
+    - family-names: Steiner
+      given-names: Benoit
+    - family-names: Fang
+      given-names: Lu
+    - family-names: Bai
+      given-names: Junjie
+    - family-names: Chintala
+      given-names: Soumith
+  collection-title: Advances in Neural Information Processing Systems 32
+  collection-type: proceedings
+  editors:
+    - family-names: Wallach
+      given-names: H.
+    - family-names: Larochelle
+      given-names: H.
+    - family-names: Beygelzimer
+      given-names: A.
+    - family-names: "d'Alché-Buc"
+      given-names: F.
+    - family-names: Fox
+      given-names: E.
+    - family-names: Garnett
+      given-names: R.
+  start: 8024
+  end: 8035
+  year: 2019
+  publisher:
+    name: Curran Associates, Inc.
+  url: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 5bd4fb954b4d..784b52841704 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required(VERSION 3.13 FATAL_ERROR)
+cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
 #cmake_policy(SET CMP0022 NEW)
 #cmake_policy(SET CMP0023 NEW)
 
@@ -11,13 +11,9 @@ cmake_policy(SET CMP0025 NEW)
 # Suppress warning flags in default MSVC configuration.  It's not
 # mandatory that we do this (and we don't if cmake is old), but it's
 # nice when it's possible, and it's possible on our Windows configs.
-if(NOT CMAKE_VERSION VERSION_LESS 3.15.0)
-  cmake_policy(SET CMP0092 NEW)
-endif()
+cmake_policy(SET CMP0092 NEW)
 
-if(NOT CMAKE_VERSION VERSION_LESS 3.10)
-  set(FIND_CUDA_MODULE_DEPRECATED ON)
-endif()
+set(FIND_CUDA_MODULE_DEPRECATED ON)
 
 # ---[ Project and semantic versioning.
 project(Torch CXX C)
@@ -165,9 +161,6 @@ option(BUILD_LITE_INTERPRETER "Master flag to build Lite Interpreter" OFF)
 cmake_dependent_option(
     BUILD_CAFFE2_OPS "Build Caffe2 operators" ON
     "BUILD_CAFFE2" OFF)
-cmake_dependent_option(
-    BUILD_CAFFE2_MOBILE "Build libcaffe2 for mobile (deprecating)" OFF
-    "BUILD_CAFFE2" OFF)
 option(BUILD_SHARED_LIBS "Build libcaffe2.so" ON)
 cmake_dependent_option(
     CAFFE2_LINK_LOCAL_PROTOBUF "If set, build protobuf inside libcaffe2.so." ON
@@ -186,21 +179,14 @@ cmake_dependent_option(
     INSTALL_TEST "Install test binaries if BUILD_TEST is on" ON
     "BUILD_TEST" OFF)
 option(USE_CPP_CODE_COVERAGE "Compile C/C++ with code coverage flags" OFF)
-option(COLORIZE_OUTPUT "Colorize output during compilation" ON)
-option(USE_ASAN "Use Address Sanitizer" OFF)
+option(USE_COLORIZE_OUTPUT "Colorize output during compilation" ON)
+option(USE_ASAN "Use Address+Undefined Sanitizers" OFF)
 option(USE_TSAN "Use Thread Sanitizer" OFF)
 option(USE_CUDA "Use CUDA" ON)
-# BUILD_SPLIT_CUDA must also be exported as an environment variable before building, with
-# `export BUILD_SPLIT_CUDA=1` because cpp_extension.py can only work properly if this variable
-# also exists in the environment.
-# This option is incompatible with CUDA_SEPARABLE_COMPILATION.
-cmake_dependent_option(
-    BUILD_SPLIT_CUDA "Split torch_cuda library into torch_cuda_cu and torch_cuda_cpp" OFF
-    "USE_CUDA AND NOT CUDA_SEPARABLE_COMPILATION" OFF)
 cmake_dependent_option(
      BUILD_LAZY_CUDA_LINALG "Build cuda linalg ops as separate library" ON "USE_CUDA AND LINUX AND BUILD_PYTHON" OFF)
 option(USE_FAST_NVCC "Use parallel NVCC build" OFF)
-option(USE_ROCM "Use ROCm" ON)
+cmake_dependent_option(USE_ROCM "Use ROCm" ON "LINUX" OFF)
 option(CAFFE2_STATIC_LINK_CUDA "Statically link CUDA libraries" OFF)
 cmake_dependent_option(
     USE_CUDNN "Use cuDNN" ON
@@ -295,6 +281,7 @@ if(NOT USE_XNNPACK AND CMAKE_VERSION VERSION_LESS ${XNNPACK_MIN_CMAKE_VER})
 endif()
 option(USE_ZMQ "Use ZMQ" OFF)
 option(USE_ZSTD "Use ZSTD" OFF)
+option(TORCH_DISABLE_GPU_ASSERTS "Disable GPU asserts by default" OFF)
 # Ensure that an ITT build is the default for x86 CPUs
 cmake_dependent_option(
   USE_ITT "Use Intel(R) VTune Profiler ITT functionality" ON
@@ -348,9 +335,6 @@ cmake_dependent_option(
 option(ONNX_ML "Enable traditional ONNX ML API." ON)
 option(HAVE_SOVERSION "Whether to add SOVERSION to the shared objects" OFF)
 option(BUILD_LIBTORCH_CPU_WITH_DEBUG "Enable RelWithDebInfo for libtorch_cpu target only" OFF)
-cmake_dependent_option(
-    USE_DEPLOY "Build embedded torch::deploy interpreter.  See torch/csrc/deploy/README.md for more info." OFF
-    "BUILD_PYTHON" OFF)
 cmake_dependent_option(USE_CCACHE "Attempt using CCache to wrap the compilation" ON "UNIX" OFF)
 option(WERROR "Build with -Werror supported by the compiler" OFF)
 option(USE_COREML_DELEGATE "Use the CoreML backend through delegate APIs" OFF)
@@ -358,6 +342,8 @@ option(USE_PER_OPERATOR_HEADERS "Whether ATen should generate separate headers f
 cmake_dependent_option(
     BUILD_LAZY_TS_BACKEND "Build the lazy Torchscript backend, not compatible with mobile builds" ON
     "NOT INTERN_BUILD_MOBILE" OFF)
+cmake_dependent_option(
+    BUILD_FUNCTORCH "Build Functorch" ON "BUILD_PYTHON" OFF)
 
 
 if(USE_CCACHE)
@@ -556,6 +542,9 @@ if(MSVC)
 
   # Try harder
   string(APPEND CMAKE_CUDA_FLAGS " -Xcompiler /w -w")
+
+  string(APPEND CMAKE_CXX_FLAGS " /FS")
+  string(APPEND CMAKE_CUDA_FLAGS " -Xcompiler /FS")
 endif(MSVC)
 
 string(APPEND CMAKE_CUDA_FLAGS " -Xfatbin -compress-all")
@@ -575,6 +564,22 @@ if(ANDROID OR IOS OR DEFINED ENV{BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN})
   message(WARNING "INTERN_BUILD_MOBILE is on, disabling BUILD_LAZY_TS_BACKEND")
   set(BUILD_LAZY_TS_BACKEND OFF)
 
+  # Set -ffunction-sections and -fdata-sections so that each method has its own
+  # text section. This allows the linker to remove unused section when the flag
+  # -Wl,-gc-sections is provided at link time.
+  string(APPEND CMAKE_CXX_FLAGS " -ffunction-sections")
+  string(APPEND CMAKE_C_FLAGS " -ffunction-sections")
+  string(APPEND CMAKE_CXX_FLAGS " -fdata-sections")
+  string(APPEND CMAKE_C_FLAGS " -fdata-sections")
+
+  # Please note that the use of the following flags is required when linking
+  # against libtorch_cpu.a for mobile builds.
+  # -Wl,--whole-archive -ltorch_cpu -Wl,--no-whole-archive
+  #
+  # This allows global constructors to be included and run. Global
+  # constructors are used for operator/kernel registration with the
+  # PyTorch Dispatcher.
+
   if(DEFINED ENV{BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN})
     # C10_MOBILE is derived from Android/iOS toolchain macros in
     # c10/macros/Macros.h, so it needs to be explicitly set here.
@@ -591,18 +596,15 @@ if(ANDROID OR IOS OR DEFINED ENV{BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN})
 endif()
 
 # INTERN_BUILD_ATEN_OPS is used to control whether to build ATen/TH operators.
-# It's disabled for caffe2 mobile library.
-if(INTERN_BUILD_MOBILE AND BUILD_CAFFE2_MOBILE)
-  set(INTERN_BUILD_ATEN_OPS OFF)
-else()
-  set(INTERN_BUILD_ATEN_OPS ON)
+set(INTERN_BUILD_ATEN_OPS ON)
+
+if(NOT DEFINED USE_BLAS)
+  set(USE_BLAS ON)
 endif()
 
-# BUILD_CAFFE2_MOBILE is the master switch to choose between libcaffe2 v.s. libtorch mobile build.
-# When it's enabled it builds original libcaffe2 mobile library without ATen/TH ops nor TorchScript support;
-# When it's disabled it builds libtorch mobile library, which contains ATen/TH ops and native support for
+# Build libtorch mobile library, which contains ATen/TH ops and native support for
 # TorchScript model, but doesn't contain not-yet-unified caffe2 ops;
-if(INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE)
+if(INTERN_BUILD_MOBILE)
   if(NOT BUILD_SHARED_LIBS AND NOT "${SELECTED_OP_LIST}" STREQUAL "")
     string(APPEND CMAKE_CXX_FLAGS " -DNO_EXPORT")
   endif()
@@ -612,13 +614,18 @@ if(INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE)
     set(INTERN_DISABLE_AUTOGRAD ON)
   endif()
   set(BUILD_PYTHON OFF)
+  set(BUILD_FUNCTORCH OFF)
   set(BUILD_CAFFE2_OPS OFF)
   set(USE_DISTRIBUTED OFF)
   set(NO_API ON)
   set(USE_FBGEMM OFF)
   set(USE_QNNPACK OFF)
   set(INTERN_DISABLE_ONNX ON)
-  set(INTERN_USE_EIGEN_BLAS ON)
+  if(USE_BLAS)
+    set(INTERN_USE_EIGEN_BLAS ON)
+  else()
+    set(INTERN_USE_EIGEN_BLAS OFF)
+  endif()
   # Disable developing mobile interpreter for actual mobile build.
   # Enable it elsewhere to capture build error.
   set(INTERN_DISABLE_MOBILE_INTERP ON)
@@ -707,6 +714,16 @@ set(BUILD_ONEDNN_GRAPH OFF)
 
 include(cmake/Dependencies.cmake)
 
+# Moved this cmake set option down here because CMAKE_CUDA_COMPILER_VERSION is not avaialble until now
+cmake_dependent_option(
+  USE_FLASH_ATTENTION
+  "Whether to build the flash_attention kernel for scaled dot product attention" ON
+  "USE_CUDA AND NOT ROCM AND NOT MSVC AND NOT CMAKE_CUDA_COMPILER_VERSION VERSION_LESS 11.6" OFF)
+if(USE_FLASH_ATTENTION)
+    ADD_DEFINITIONS(-DUSE_FLASH_ATTENTION)
+ENDIF()
+
+
 if(USE_CUDA AND (CMAKE_CUDA_COMPILER_VERSION VERSION_LESS 10.2) AND (CMAKE_HOST_SYSTEM_NAME MATCHES "Windows"))
   # CUDA < 10.2 doesn't support compiling and extracting header dependencies in
   # one call, so instead CMake calls nvcc twice with && in between.
@@ -794,27 +811,27 @@ endif()
 # ---[ Build flags
 if(NOT MSVC)
   string(APPEND CMAKE_CXX_FLAGS " -O2 -fPIC")
-  string(APPEND CMAKE_CXX_FLAGS " -Wno-narrowing")
   # Eigen fails to build with some versions, so convert this to a warning
   # Details at http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1459
   string(APPEND CMAKE_CXX_FLAGS " -Wall")
   string(APPEND CMAKE_CXX_FLAGS " -Wextra")
   append_cxx_flag_if_supported("-Werror=return-type" CMAKE_CXX_FLAGS)
-  if(NOT USE_CUDNN)
-    # Temporary fix to ignore non virtual dtor error if cudnn is used. A
-    # separate PR to cudnn_frontend is needed to address this later on
-    append_cxx_flag_if_supported("-Werror=non-virtual-dtor" CMAKE_CXX_FLAGS)
-  endif()
+  append_cxx_flag_if_supported("-Werror=non-virtual-dtor" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Werror=braced-scalar-init" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Werror=range-loop-construct" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wnarrowing" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-missing-field-initializers" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-type-limits" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-array-bounds" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-unknown-pragmas" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wunused-local-typedefs" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-unused-parameter" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-unused-function" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-unused-result" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-strict-overflow" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-strict-aliasing" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Wno-error=deprecated-declarations" CMAKE_CXX_FLAGS)
+  append_cxx_flag_if_supported("-Wvla-extension" CMAKE_CXX_FLAGS)
   if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
     string(APPEND CMAKE_CXX_FLAGS " -Wno-range-loop-analysis")
     string(APPEND CMAKE_CXX_FLAGS " -Wno-pass-failed")
@@ -870,12 +887,14 @@ if(NOT MSVC)
     append_cxx_flag_if_supported("-Wno-c++14-extensions" CMAKE_CXX_FLAGS)
     append_cxx_flag_if_supported("-Wno-constexpr-not-const" CMAKE_CXX_FLAGS)
     append_cxx_flag_if_supported("-Wno-missing-braces" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wunused-lambda-capture" CMAKE_CXX_FLAGS)
+    append_cxx_flag_if_supported("-Wunused-local-typedef" CMAKE_CXX_FLAGS)
     append_cxx_flag_if_supported("-Qunused-arguments" CMAKE_CXX_FLAGS)
-    if(${COLORIZE_OUTPUT})
+    if(${USE_COLORIZE_OUTPUT})
     endif()
   endif()
 
-  if(${COLORIZE_OUTPUT})
+  if(${USE_COLORIZE_OUTPUT})
     append_cxx_flag_if_supported("-fcolor-diagnostics" CMAKE_CXX_FLAGS)
     append_cxx_flag_if_supported("-fdiagnostics-color=always" CMAKE_CXX_FLAGS)
   endif()
@@ -903,14 +922,16 @@ if(NOT MSVC)
   append_cxx_flag_if_supported("-fno-trapping-math" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Werror=format" CMAKE_CXX_FLAGS)
   append_cxx_flag_if_supported("-Werror=cast-function-type" CMAKE_CXX_FLAGS)
-  check_cxx_compiler_flag("-Werror=sign-compare" HAS_WERROR_SIGN_COMPARE)
-  # This doesn't work globally so we use the test on specific
-  # target_compile_options
 endif()
 
 if(USE_ASAN)
-    string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fsanitize=address")
-    string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fsanitize=address")
+    string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fsanitize=address -fsanitize=undefined")
+    string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fsanitize=address -fsanitize=undefined")
+endif()
+
+if(USE_TSAN)
+    string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fsanitize=thread")
+    string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fsanitize=thread")
 endif()
 
 if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64")
@@ -1150,7 +1171,6 @@ endif()
 include(cmake/Summary.cmake)
 caffe2_print_configuration_summary()
 
-# ---[ Torch Deploy
-if(USE_DEPLOY)
-  add_subdirectory(torch/csrc/deploy)
+if(BUILD_FUNCTORCH)
+  add_subdirectory(functorch)
 endif()
diff --git a/CODEOWNERS b/CODEOWNERS
index ccd111beba86..179e87198dba 100644
--- a/CODEOWNERS
+++ b/CODEOWNERS
@@ -1,3 +1,8 @@
+# IMPORTANT:
+# This file is ONLY used to subscribe for notifications for PRs
+# related to a specific file path. Approvals from people in this
+# file are not required for merges.
+
 # This is a comment.
 # Each line is a file pattern followed by one or more owners.
 # For module labels => owners mapping, please see https://github.com/pytorch/pytorch/issues/24422.
@@ -9,13 +14,30 @@
 /torch/csrc/autograd/ @albanD @soulitzer
 /torch/autograd/ @albanD @soulitzer
 /tools/autograd/ @albanD @soulitzer
-/torch/nn/ @albanD @jbschlosser @saketh-are
+/torch/nn/ @albanD @jbschlosser
 /torch/optim/ @albanD
 /test/test_public_bindings.py @albanD
 /test/allowlist_for_publicAPI.json @albanD @anjali411
 /docs/source/conf.py @albanD
 /aten/src/ATen/native/tags.yaml @anjali411
 
+# Architecture Optimization (quantization, sparsity, etc.)
+/aten/src/ATen/native/ao_sparse @z-a-f @salilsdesai @kimishpatel @digantdesai @jianyuh
+/aten/src/ATen/native/quantized @jerryzh168 @z-a-f @salilsdesai @kimishpatel @digantdesai @jianyuh
+/aten/src/ATen/native/quantized/cpu @jerryzh168 @z-a-f @salilsdesai @kimishpatel @digantdesai @jianyuh
+/aten/src/ATen/native/quantized/cuda @jerryzh168 @dzdang
+/aten/src/ATen/native/quantized/cudnn @jerryzh168 @dzdang
+/test/test_quantization.py @jerryzh168
+/test/ao/ @jerryzh168 @z-a-f @hdcharles
+/test/quantization/ @jerryzh168 @z-a-f
+/torch/quantization/ @jerryzh168
+ao/sparisty/ @z-a-f @hdcharles
+ao/quantization/ @jerryzh168
+nn/intrinsic/ @jerryzh168
+nn/quantized/ @jerryzh168
+nn/quantizable/ @jerryzh168 @z-a-f
+nn/qat/ @jerryzh168
+
 # Tensorpipe RPC Agent.
 /torch/csrc/distributed/rpc/tensorpipe_agent.cpp @jiayisuse @osalpekar @lw @beauby
 /torch/csrc/distributed/rpc/tensorpipe_agent.h @jiayisuse @osalpekar @lw @beauby
@@ -23,22 +45,23 @@
 # Distributed package
 # This list is mostly if you'd like to be tagged as reviewer, feel free to add
 # or remove yourself from it.
-/torch/csrc/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @awgu
-/torch/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @awgu
-/torch/nn/parallel/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @mingzhe09088 @H-Huang @awgu
+/torch/csrc/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @H-Huang @awgu @kwen2501
+/torch/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @H-Huang @awgu @kwen2501
+/torch/distributed/_composable @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @H-Huang @awgu @kwen2501 @yhcharles
+/torch/nn/parallel/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @H-Huang @awgu @kwen2501
 
 # Distributed tests
 # This list is mostly if you'd like to be tagged as reviewer, feel free to add
 # or remove yourself from it.
-/test/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @H-Huang @awgu
-/torch/testing/_internal/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @H-Huang @awgu
+/test/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @H-Huang @awgu @kwen2501
+/torch/testing/_internal/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @H-Huang @awgu @kwen2501
 
 # ONNX Export
-/torch/csrc/jit/passes/onnx.h @bowenbao @shubhambhokare1
-/torch/csrc/jit/passes/onnx.cpp @bowenbao @shubhambhokare1
-/torch/csrc/jit/passes/onnx/ @bowenbao @shubhambhokare1
-/torch/onnx/ @bowenbao @shubhambhokare1
-/test/onnx/ @bowenbao @shubhambhokare1
+/torch/csrc/jit/passes/onnx.h @bowenbao @abock
+/torch/csrc/jit/passes/onnx.cpp @bowenbao @abock
+/torch/csrc/jit/passes/onnx/ @bowenbao @abock
+/torch/onnx/ @bowenbao @abock
+/test/onnx/ @bowenbao @abock
 
 # Docker
 /.circleci/docker/ @jeffdaily
@@ -68,12 +91,19 @@
 /torch/testing/_internal/common_methods_invocations.py @mruberry @ngimel
 /torch/testing/_internal/common_device_type.py @mruberry @ngimel
 test/test_ops.py @mruberry @ngimel
-test/test_ops_gradients.py @mruberry @ngimel
+test/test_ops_gradients.py @mruberry @ngimel @soulitzer
+test/test_ops_fwd_gradients.py @mruberry @ngimel @soulitzer
 test/test_unary_ufuncs.py @mruberry @ngimel
 test/test_binary_ufuncs.py @mruberry @ngimel
 test/test_reductions.py @mruberry @ngimel
 test/test_type_promotion.py @mruberry @ngimel
 
+# functorch-related things
+# This list is for people wanting to be notified every time there's a change
+# Useful for e.g. auditing xfails that other folks add to tests
+test/functorch/test_ops.py @zou3519
+test/functorch/test_vmap.py @zou3519
+
 # torch MPS
 test/test_mps.py @kulinseth
 aten/src/ATen/mps/ @kulinseth
@@ -84,3 +114,6 @@ torch/csrc/autograd/profiler* @robieta
 torch/autograd/profiler* @robieta
 torch/csrc/profiler/ @robieta
 torch/profiler/ @robieta
+
+# AOTDispatch tests
+test/functorch/test_aotdispatch.py @ezyang @Chillee
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 7b4a1246d002..eaf81b19eefa 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -16,7 +16,9 @@
     - [Running `mypy`](#running-mypy)
   - [C++ Unit Testing](#c-unit-testing)
   - [Run Specific CI Jobs](#run-specific-ci-jobs)
+- [Merging your Change](#merging-your-change)
 - [Writing documentation](#writing-documentation)
+  - [Docstring type formatting](#docstring-type-formatting)
   - [Building documentation](#building-documentation)
     - [Tips](#tips)
     - [Building C++ Documentation](#building-c-documentation)
@@ -116,21 +118,9 @@ git submodule sync --recursive
 git submodule update --init --recursive --jobs 0
 ```
 
-If you want to have no-op incremental rebuilds (which are fast), see the section below titled "Make no-op build fast."
+If you want to have no-op incremental rebuilds (which are fast), see [Make no-op build fast](#make-no-op-build-fast) below.
 
-3. Follow  the instructions for [installing PyTorch from source](https://github.com/pytorch/pytorch#from-source), except when it's time to install PyTorch instead of invoking `setup.py install` you'll want to call `setup.py develop` instead:
-
-Specifically, the change you have to make is to replace
-
-```bash
-python setup.py install
-```
-
-with
-
-```bash
-python setup.py develop
-```
+3. Follow the instructions for [installing PyTorch from source](https://github.com/pytorch/pytorch#from-source), but instead of installing PyTorch via `python setup.py install`, use `python setup.py develop`.
 
 This mode will symlink the Python files from the current local source
 tree into the Python install.  This way when you modify a Python file, you
@@ -434,6 +424,17 @@ ghstack submit
 [`ghstack`](https://github.com/ezyang/ghstack). It creates a large commit that is
 of very low signal to reviewers.
 
+## Merging your Change
+If you know the right people or team that should approve your PR (and you have the required permisssions to do so), add them to the Reviewers list.
+
+If not, leave the Reviewers section empty. Our triage squad will review your PR, add a module label, and assign it to the appropriate reviewer in a couple business days.  The reviewer will then look at your PR and respond.
+
+Occasionally, things might fall through the cracks (sorry!). In case your PR either doesn't get assigned to a reviewer or doesn't get any response from the reviewer for 4 business days, please leave comment on the PR (mentioning the reviewer if one has been assigned). That'll get it nudged back onto people's radar.
+
+If that still doesn't help, come see us during [our office hours](https://github.com/pytorch/pytorch/wiki/Contact-Pytorch-Dev-Infra-Office)
+
+Once your PR is approved, you can merge it in by entering a comment with the content `@pytorchmergebot merge` ([what's this bot?](https://github.com/pytorch/pytorch/wiki/Bot-commands))
+
 ## Writing documentation
 
 So you want to write some documentation and don't know where to start?
@@ -447,9 +448,47 @@ If you're interested in adding new developer docs, please read this [page on the
 
 The rest of this section is about user-facing documentation.
 
-PyTorch uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
+PyTorch uses [Google style](https://www.sphinx-doc.org/en/master/usage/extensions/example_google.html)
 for formatting docstrings. Each line inside a docstrings block must be limited to 80 characters so that it fits into Jupyter documentation popups.
 
+
+### Docstring type formatting
+
+In addition to the standard Google Style docstring formatting rules, the following guidelines should be followed for docstring types (docstring types are the type information contained in the round brackets after the variable name):
+
+* The "`Callable`", "`Any`", "`Iterable`", "`Iterator`", "`Generator`" types should have their first letter capitalized.
+
+* The "`list`" and "`tuple`" types should be completely lowercase.
+
+* Types should not be made plural. For example: `tuple of int` should be used instead of `tuple of ints`.
+
+* The only acceptable delimiter words for types are `or` and `of`. No other non-type words should be used other than `optional`.
+
+* The word `optional` should only be used after the types, and it is only used if the user does not have to specify a value for the variable. Default values are listed after the variable description. Example:
+
+    ```
+    my_var (int, optional): Variable description. Default: 1
+    ```
+
+* Basic Python types should match their type name so that the [Intersphinx](https://www.sphinx-doc.org/en/master/usage/extensions/intersphinx.html) extension can correctly identify them. For example:
+    * Use `str` instead of `string`.
+    * Use `bool` instead of `boolean`.
+    * Use `dict` instead of `dictionary`.
+
+* Square brackets should be used for the dictionary type. For example:
+
+    ```
+    my_var (dict[str, int]): Variable description.
+    ```
+
+* If a variable has two different possible types, then the word `or` should be used without a comma. Otherwise variables with 3 or more types should use commas to separate the types. Example:
+
+    ```
+    x (type1 or type2): Variable description.
+    y (type1, type2, or type3): Variable description.
+    ```
+
+
 ### Building documentation
 
 To build the documentation:
@@ -1207,13 +1246,6 @@ In 2018, we merged Caffe2 into the PyTorch source repository. While the
 steady state aspiration is that Caffe2 and PyTorch share code freely,
 in the meantime there will be some separation.
 
-If you submit a PR to only PyTorch or only Caffe2 code, CI will only
-run for the project you edited. The logic for this is implemented
-in `.jenkins/pytorch/dirty.sh` and `.jenkins/caffe2/dirty.sh`; you
-can look at this to see what path prefixes constitute changes.
-This also means if you ADD a new top-level path, or you start
-sharing code between projects, you need to modify these files.
-
 There are a few "unusual" directories which, for historical reasons,
 are Caffe2/PyTorch specific. Here they are:
 
@@ -1246,8 +1278,9 @@ our [CI wiki](https://github.com/pytorch/pytorch/wiki/Debugging-using-with-ssh-f
 ### Which commit is used in CI?
 
 For CI run on `master`, this repository is checked out for a given `master`
-commit, and CI is run on that commit (there isn't really any other choice). For
-PRs, however, it's a bit more complicated. Consider this commit graph, where
+commit, and CI is run on that commit (there isn't really any other choice).
+
+For PRs, however, it's a bit more complicated. Consider this commit graph, where
 `master` is at commit `A`, and the branch for PR #42 (just a placeholder) is at
 commit `B`:
 
@@ -1256,7 +1289,7 @@ commit `B`:
       /         \
      /           C (refs/pull/42/merge)
     /           /
----o---o---o---A (refs/heads/master)
+---o---o---o---A (merge-destination) - usually master
 ```
 
 There are two possible choices for which commit to use:
@@ -1264,37 +1297,18 @@ There are two possible choices for which commit to use:
 1. Checkout commit `B`, the head of the PR (manually committed by the PR
    author).
 2. Checkout commit `C`, the hypothetical result of what would happen if the PR
-   were merged into `master` (automatically generated by GitHub).
-
-This choice depends on several factors; here is the decision tree as of
-2021-03-30:
-
-- For CI jobs on CircleCI:
-  - If the name of the job (or one of its ancestors in the workflow DAG)
-    contains "xla" or "gcc5", choice **2** is used. This includes the following
-    jobs:
-    - pytorch_linux_xenial_py3_6_gcc5_4_build
-      - pytorch_cpp_doc_build
-      - pytorch_doc_test
-      - pytorch_linux_forward_backward_compatibility_check_test
-      - pytorch_linux_xenial_py3_6_gcc5_4_jit_legacy_test
-      - pytorch_linux_xenial_py3_6_gcc5_4_test
-      - pytorch_python_doc_build
-    - pytorch_xla_linux_bionic_py3_6_clang9_build
-      - pytorch_xla_linux_bionic_py3_6_clang9_test
-  - Otherwise, choice **1** is used.
-- For CI jobs on GitHub Actions:
-  - If the PR was created using [`ghstack`](https://github.com/ezyang/ghstack),
-    choice **1** is used.
-  - Otherwise, choice **2** is used.
-
-This is important to be aware of, because if you see a CI failure on your PR and
-choice **2** is being used for that CI job, it is possible that the failure is
-nondeterministically caused by a commit that does not exist in the ancestry of
-your PR branch. If you happen to have write access to this repo, you can choose
-to use `ghstack` to eliminate this nondeterminism for GitHub Actions jobs on
-your PRs, but it will still be present for the select CircleCI jobs listed
-above.
+   were merged into it's destination (usually `master`).
+
+For all practical purposes, most people can think of the commit being used as
+commit `B` (choice **1**).
+
+However, if workflow files (which govern CI behavior) were modified (either by your PR or since dev branch were created ) there's
+a nuance to know about:
+The workflow files themselves get taken from checkpoint `C`, the merger of your
+PR and the `master` branch. But only the workflow files get taken from that merged
+checkpoint. Everything else (tests, code, etc) all get taken directly from your
+PR's commit (commit `B`). Please note, this scenario would never affect PRs authored by `ghstack` as they would not automatically ingest the updates from default branch.
+
 
 ## Dev Infra Office Hours
 [Dev Infra Office Hours](https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours) are hosted every Friday to answer any questions regarding developer experience, Green HUD, and CI.
diff --git a/Dockerfile b/Dockerfile
index 1bd522a62406..e125271607c9 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -11,8 +11,7 @@ ARG BASE_IMAGE=ubuntu:18.04
 ARG PYTHON_VERSION=3.8
 
 FROM ${BASE_IMAGE} as dev-base
-RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
-    apt-get update && apt-get install -y --no-install-recommends \
+RUN apt-get update && apt-get install -y --no-install-recommends \
         build-essential \
         ca-certificates \
         ccache \
@@ -28,9 +27,16 @@ ENV PATH /opt/conda/bin:$PATH
 
 FROM dev-base as conda
 ARG PYTHON_VERSION=3.8
+# Automatically set by buildx
+ARG TARGETPLATFORM
+# translating Docker's TARGETPLATFORM into miniconda arches
+RUN case ${TARGETPLATFORM} in \
+         "linux/arm64")  MINICONDA_ARCH=aarch64  ;; \
+         *)              MINICONDA_ARCH=x86_64   ;; \
+    esac && \
+    curl -fsSL -v -o ~/miniconda.sh -O  "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
 COPY requirements.txt .
-RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
-    chmod +x ~/miniconda.sh && \
+RUN chmod +x ~/miniconda.sh && \
     ~/miniconda.sh -b -p /opt/conda && \
     rm ~/miniconda.sh && \
     /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
@@ -53,19 +59,28 @@ RUN --mount=type=cache,target=/opt/ccache \
 
 FROM conda as conda-installs
 ARG PYTHON_VERSION=3.8
-ARG CUDA_VERSION=11.3
+ARG CUDA_VERSION=11.6
 ARG CUDA_CHANNEL=nvidia
 ARG INSTALL_CHANNEL=pytorch-nightly
-ENV CONDA_OVERRIDE_CUDA=${CUDA_VERSION}
-RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y python=${PYTHON_VERSION} pytorch torchvision torchtext "cudatoolkit=${CUDA_VERSION}" && \
+# Automatically set by buildx
+RUN /opt/conda/bin/conda update -y conda
+RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -y python=${PYTHON_VERSION}
+ARG TARGETPLATFORM
+ARG TRITON_VERSION
+
+# On arm64 we can only install wheel packages
+RUN case ${TARGETPLATFORM} in \
+         "linux/arm64")  pip install --extra-index-url https://download.pytorch.org/whl/cpu/ torch torchvision torchtext ;; \
+         *)              /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y "python=${PYTHON_VERSION}" pytorch torchvision torchtext "pytorch-cuda=$(echo $CUDA_VERSION | cut -d'.' -f 1-2)"  ;; \
+    esac && \
     /opt/conda/bin/conda clean -ya
 RUN /opt/conda/bin/pip install torchelastic
+RUN if test -n "${TRITON_VERSION}" -a "${TARGETPLATFORM}" != "linux/arm64"; then /opt/conda/bin/pip install "torchtriton==${TRITON_VERSION}" --extra-index-url https://download.pytorch.org/whl/nightly/cpu ; fi
 
 FROM ${BASE_IMAGE} as official
 ARG PYTORCH_VERSION
 LABEL com.nvidia.volumes.needed="nvidia_driver"
-RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
-    apt-get update && apt-get install -y --no-install-recommends \
+RUN apt-get update && apt-get install -y --no-install-recommends \
         ca-certificates \
         libjpeg-dev \
         libpng-dev && \
diff --git a/MANIFEST.in b/MANIFEST.in
index acf4c7291f43..403b90b702df 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -26,5 +26,6 @@ recursive-include benchmarks *.*
 recursive-include scripts *.*
 recursive-include mypy_plugins *.*
 recursive-include modules *.*
+recursive-include functorch *.*
 prune */__pycache__
 global-exclude *.o *.so *.dylib *.a .git *.pyc *.swp
diff --git a/Makefile b/Makefile
index 21745f42a887..45dfeb8cda26 100644
--- a/Makefile
+++ b/Makefile
@@ -31,3 +31,7 @@ lint:
 
 quicklint:
 	lintrunner
+
+triton:
+	$(PIP) uninstall -y triton
+	$(PIP) install -U "git+https://github.com/openai/triton@$(shell cat .github/ci_commit_pins/triton.txt)#subdirectory=python"
diff --git a/README.md b/README.md
index 10d14d354cc8..bcce2997b25b 100644
--- a/README.md
+++ b/README.md
@@ -72,7 +72,7 @@ PyTorch provides Tensors that can live either on the CPU or the GPU and accelera
 computation by a huge amount.
 
 We provide a wide variety of tensor routines to accelerate and fit your scientific computation needs
-such as slicing, indexing, math operations, linear algebra, reductions.
+such as slicing, indexing, mathematical operations, linear algebra, reductions.
 And they are fast!
 
 ### Dynamic Neural Networks: Tape-Based Autograd
@@ -234,7 +234,7 @@ python tools/amd_build/build_amd.py
 Install PyTorch
 ```bash
 export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
-python setup.py install
+python setup.py develop
 ```
 
 Note that if you are using [Anaconda](https://www.anaconda.com/distribution/#download-section), you may experience an error caused by the linker:
@@ -245,13 +245,13 @@ collect2: error: ld returned 1 exit status
 error: command 'g++' failed with exit status 1
 ```
 
-This is caused by `ld` from Conda environment shadowing the system `ld`. You should use a newer version of Python that fixes this issue. The recommended Python version is 3.7.6+ and 3.8.1+.
+This is caused by `ld` from the Conda environment shadowing the system `ld`. You should use a newer version of Python that fixes this issue. The recommended Python version is 3.7.6+ and 3.8.1+.
 
 **On macOS**
 
 ```bash
 export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
-MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install
+MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py develop
 ```
 
 **On Windows**
@@ -274,7 +274,7 @@ In this mode PyTorch computations will run on your CPU, not your GPU
 
 ```cmd
 conda activate
-python setup.py install
+python setup.py develop
 ```
 
 Note on OpenMP: The desired OpenMP implementation is Intel OpenMP (iomp). In order to link against iomp, you'll need to manually download the library and set up the building environment by tweaking `CMAKE_INCLUDE_PATH` and `LIB`. The instruction [here](https://github.com/pytorch/pytorch/blob/master/docs/source/notes/windows.rst#building-from-source) is an example for setting up both MKL and Intel OpenMP. Without these configurations for CMake, Microsoft Visual C OpenMP runtime (vcomp) will be used.
@@ -284,7 +284,7 @@ Note on OpenMP: The desired OpenMP implementation is Intel OpenMP (iomp). In ord
 In this mode PyTorch computations will leverage your GPU via CUDA for faster number crunching
 
 [NVTX](https://docs.nvidia.com/gameworks/content/gameworkslibrary/nvtx/nvidia_tools_extension_library_nvtx.htm) is needed to build Pytorch with CUDA.
-NVTX is a part of CUDA distributive, where it is called "Nsight Compute". To install it onto already installed CUDA run CUDA installation once again and check the corresponding checkbox.
+NVTX is a part of CUDA distributive, where it is called "Nsight Compute". To install it onto an already installed CUDA run CUDA installation once again and check the corresponding checkbox.
 Make sure that CUDA with Nsight Compute is installed after Visual Studio.
 
 Currently, VS 2017 / 2019, and Ninja are supported as the generator of CMake. If `ninja.exe` is detected in `PATH`, then Ninja will be used as the default generator, otherwise, it will use VS 2017 / 2019.
@@ -299,7 +299,7 @@ You can refer to the [build_pytorch.bat](https://github.com/pytorch/pytorch/blob
 ```cmd
 cmd
 
-:: Set the environment variables after you have downloaded and upzipped the mkl package,
+:: Set the environment variables after you have downloaded and unzipped the mkl package,
 :: else CMake would throw an error as `Could NOT find OpenMP`.
 set CMAKE_INCLUDE_PATH={Your directory}\mkl\include
 set LIB={Your directory}\mkl\lib;%LIB%
@@ -315,7 +315,7 @@ for /f "usebackq tokens=*" %i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\
 :: [Optional] If you want to override the CUDA host compiler
 set CUDAHOSTCXX=C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64\cl.exe
 
-python setup.py install
+python setup.py develop
 
 ```
 
diff --git a/RELEASE.md b/RELEASE.md
index 32f71e124141..d13ca5d11e10 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -14,7 +14,7 @@
     - [Release Candidate health validation](#release-candidate-health-validation)
     - [Cherry Picking Fixes](#cherry-picking-fixes)
   - [Promoting RCs to Stable](#promoting-rcs-to-stable)
-  - [Additonal Steps to prepare for release day](#additonal-steps-to-prepare-for-release-day)
+  - [Additional Steps to prepare for release day](#additional-steps-to-prepare-for-release-day)
     - [Modify release matrix](#modify-release-matrix)
     - [Open Google Colab issue](#open-google-colab-issue)
 - [Patch Releases](#patch-releases)
@@ -186,7 +186,7 @@ Promotion should occur in two steps:
 
 **NOTE**: The promotion of wheels to PyPI can only be done once so take caution when attempting to promote wheels to PyPI, (see https://github.com/pypa/warehouse/issues/726 for a discussion on potential draft releases within PyPI)
 
-## Additonal Steps to prepare for release day
+## Additional Steps to prepare for release day
 
 The following should be prepared for the release day
 
@@ -264,7 +264,7 @@ For versions of Python that we support we follow the [NEP 29 policy](https://num
 
 ## Accelerator Software
 
-For acclerator software like CUDA and ROCm we will typically use the following criteria:
+For accelerator software like CUDA and ROCm we will typically use the following criteria:
 * Support latest 2 minor versions
 
 ### Special support cases
@@ -281,7 +281,7 @@ need to support these particular versions of software.
 
 In the event a submodule cannot be fast forwarded and a patch must be applied we can take two different approaches:
 
-* (preferred) Fork the said repository under the pytorch Github organization, apply the patches we need there, and then switch our submodule to accept our fork.
+* (preferred) Fork the said repository under the pytorch GitHub organization, apply the patches we need there, and then switch our submodule to accept our fork.
 * Get the dependencies maintainers to support a release branch for us
 
 Editing submodule remotes can be easily done with: (running from the root of the git repository)
diff --git a/WORKSPACE b/WORKSPACE
index d26dfca5a333..e8591f291abd 100644
--- a/WORKSPACE
+++ b/WORKSPACE
@@ -84,10 +84,17 @@ new_local_repository(
     path = "third_party/eigen",
 )
 
+new_local_repository(
+    name = "cutlass",
+    build_file = "//third_party:cutlass.BUILD",
+    path = "third_party/cutlass",
+)
+
 new_local_repository(
     name = "fbgemm",
     build_file = "//third_party:fbgemm/BUILD.bazel",
     path = "third_party/fbgemm",
+    repo_mapping = {"@cpuinfo" : "@org_pytorch_cpuinfo"}
 )
 
 new_local_repository(
@@ -103,8 +110,8 @@ new_local_repository(
 )
 
 new_local_repository(
-    name = "cpuinfo",
-    build_file = "//third_party:cpuinfo.BUILD",
+    name = "org_pytorch_cpuinfo",
+    build_file = "//third_party:cpuinfo/BUILD.bazel",
     path = "third_party/cpuinfo",
 )
 
diff --git a/android/gradle.properties b/android/gradle.properties
index 9d2640f9a185..ecefc09a587b 100644
--- a/android/gradle.properties
+++ b/android/gradle.properties
@@ -1,6 +1,6 @@
 ABI_FILTERS=armeabi-v7a,arm64-v8a,x86,x86_64
 
-VERSION_NAME=1.13.0-SNAPSHOT
+VERSION_NAME=1.14.0-SNAPSHOT
 GROUP=org.pytorch
 MAVEN_GROUP=org.pytorch
 SONATYPE_STAGING_PROFILE=orgpytorch
diff --git a/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp b/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp
index 5ed0c9978e83..beafc0a7114a 100644
--- a/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp
+++ b/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp
@@ -303,7 +303,7 @@ facebook::jni::local_ref<JIValue> JIValue::newJIValueFromStringDict(
       facebook::jni::alias_ref<JIValue::javaobject>>::create();
   for (auto& pair : dict) {
     jmap->put(
-        facebook::jni::make_jstring(pair.key().toString()->string()),
+        facebook::jni::make_jstring(pair.key().toStringRef()),
         JIValue::newJIValueFromAtIValue(pair.value()));
   }
   return jMethodDictStringKey(JIValue::javaClassStatic(), jmap);
diff --git a/android/pytorch_android/src/main/cpp/pytorch_jni_jit.cpp b/android/pytorch_android/src/main/cpp/pytorch_jni_jit.cpp
index 1b0d54784d76..6ef4f462df16 100644
--- a/android/pytorch_android/src/main/cpp/pytorch_jni_jit.cpp
+++ b/android/pytorch_android/src/main/cpp/pytorch_jni_jit.cpp
@@ -195,14 +195,16 @@ class PytorchJni : public facebook::jni::HybridClass<PytorchJni> {
     std::vector<at::IValue> inputs{};
     size_t n = jinputs->size();
     inputs.reserve(n);
+    const bool requires_backend_transfers =
+        module_.attr("requires_backend_transfers", at::IValue(true)).toBool();
     for (size_t i = 0; i < n; i++) {
       at::IValue atIValue = JIValue::JIValueToAtIValue(jinputs->getElement(i));
-      if (at::kVulkan == deviceType_) {
+      if (at::kVulkan == deviceType_ && requires_backend_transfers) {
         inputs.push_back(
             atIValue.isTensor() ? at::IValue{atIValue.toTensor().vulkan()}
                                 : std::move(atIValue));
       } else {
-        TORCH_CHECK(at::kCPU == deviceType_);
+        TORCH_CHECK(at::kCPU == deviceType_ || !requires_backend_transfers);
         inputs.push_back(std::move(atIValue));
       }
     }
@@ -223,14 +225,16 @@ class PytorchJni : public facebook::jni::HybridClass<PytorchJni> {
     std::vector<at::IValue> inputs{};
     size_t n = jinputs->size();
     inputs.reserve(n);
+    const bool requires_backend_transfers =
+        module_.attr("requires_backend_transfers", at::IValue(true)).toBool();
     for (size_t i = 0; i < n; i++) {
       at::IValue atIValue = JIValue::JIValueToAtIValue(jinputs->getElement(i));
-      if (at::kVulkan == deviceType_) {
+      if (at::kVulkan == deviceType_ && requires_backend_transfers) {
         inputs.push_back(
             atIValue.isTensor() ? at::IValue{atIValue.toTensor().vulkan()}
                                 : std::move(atIValue));
       } else {
-        TORCH_CHECK(at::kCPU == deviceType_);
+        TORCH_CHECK(at::kCPU == deviceType_ || !requires_backend_transfers);
         inputs.push_back(std::move(atIValue));
       }
     }
diff --git a/android/pytorch_android/src/main/cpp/pytorch_jni_lite.cpp b/android/pytorch_android/src/main/cpp/pytorch_jni_lite.cpp
index 86fd1e2260f9..802bb801a1f9 100644
--- a/android/pytorch_android/src/main/cpp/pytorch_jni_lite.cpp
+++ b/android/pytorch_android/src/main/cpp/pytorch_jni_lite.cpp
@@ -158,14 +158,16 @@ class PytorchJni : public facebook::jni::HybridClass<PytorchJni> {
     std::vector<at::IValue> inputs{};
     size_t n = jinputs->size();
     inputs.reserve(n);
+    const bool requires_backend_transfers =
+        module_.attr("requires_backend_transfers", at::IValue(true)).toBool();
     for (const auto i : c10::irange(n)) {
       at::IValue atIValue = JIValue::JIValueToAtIValue(jinputs->getElement(i));
-      if (at::kVulkan == deviceType_) {
+      if (at::kVulkan == deviceType_ && requires_backend_transfers) {
         inputs.push_back(
             atIValue.isTensor() ? at::IValue{atIValue.toTensor().vulkan()}
                                 : std::move(atIValue));
       } else {
-        TORCH_CHECK(at::kCPU == deviceType_);
+        TORCH_CHECK(at::kCPU == deviceType_ || !requires_backend_transfers);
         inputs.push_back(std::move(atIValue));
       }
     }
@@ -187,14 +189,16 @@ class PytorchJni : public facebook::jni::HybridClass<PytorchJni> {
     std::vector<at::IValue> inputs{};
     size_t n = jinputs->size();
     inputs.reserve(n);
+    const bool requires_backend_transfers =
+        module_.attr("requires_backend_transfers", at::IValue(true)).toBool();
     for (const auto i : c10::irange(n)) {
       at::IValue atIValue = JIValue::JIValueToAtIValue(jinputs->getElement(i));
-      if (at::kVulkan == deviceType_) {
+      if (at::kVulkan == deviceType_ && requires_backend_transfers) {
         inputs.push_back(
             atIValue.isTensor() ? at::IValue{atIValue.toTensor().vulkan()}
                                 : std::move(atIValue));
       } else {
-        TORCH_CHECK(at::kCPU == deviceType_);
+        TORCH_CHECK(at::kCPU == deviceType_ || !requires_backend_transfers);
         inputs.push_back(std::move(atIValue));
       }
     }
diff --git a/aten/CMakeLists.txt b/aten/CMakeLists.txt
index 9c3757f346cd..9ba141c29e42 100644
--- a/aten/CMakeLists.txt
+++ b/aten/CMakeLists.txt
@@ -33,6 +33,7 @@ set(ATen_HIP_SRCS_W_SORT_BY_KEY)
 set(ATen_HIP_TEST_SRCS)
 set(ATen_HIP_INCLUDE)
 set(ATen_MPS_SRCS)
+set(ATen_MPS_TEST_SRCS)
 set(ATen_VULKAN_TEST_SRCS)
 set(ATen_CPU_DEPENDENCY_LIBS)
 set(ATen_CUDA_DEPENDENCY_LIBS)
@@ -55,7 +56,7 @@ set(TH_CPU_INCLUDE
 list(APPEND ATen_CPU_INCLUDE ${TH_CPU_INCLUDE})
 
 if(USE_VULKAN)
-  list(APPEND ATen_CPU_INCLUDE ${CMAKE_BINARY_DIR}/vulkan)
+  list(APPEND ATen_CPU_INCLUDE ${CMAKE_BINARY_DIR}/vulkan ${CMAKE_CURRENT_SOURCE_DIR}/../third_party/VulkanMemoryAllocator)
 endif()
 
 # Find the HIP package, set the HIP paths, load the HIP CMake.
@@ -106,6 +107,7 @@ set(ATen_CUDA_SRCS_W_SORT_BY_KEY ${ATen_CUDA_SRCS_W_SORT_BY_KEY} PARENT_SCOPE)
 set(ATen_CUDA_CU_SRCS_W_SORT_BY_KEY ${ATen_CUDA_CU_SRCS_W_SORT_BY_KEY} PARENT_SCOPE)
 set(ATen_HIP_SRCS ${ATen_HIP_SRCS} PARENT_SCOPE)
 set(ATen_MPS_SRCS ${ATen_MPS_SRCS} PARENT_SCOPE)
+set(ATen_MPS_TEST_SRCS ${ATen_MPS_TEST_SRCS} PARENT_SCOPE)
 set(ATen_HIP_SRCS_W_SORT_BY_KEY ${ATen_HIP_SRCS_W_SORT_BY_KEY} PARENT_SCOPE)
 set(ATen_NVRTC_STUB_SRCS ${ATen_NVRTC_STUB_SRCS} PARENT_SCOPE)
 set(ATen_CPU_TEST_SRCS ${ATen_CPU_TEST_SRCS} PARENT_SCOPE)
diff --git a/aten/src/ATen/ATen.h b/aten/src/ATen/ATen.h
index 1be43cbe7def..4a5a949f0dd7 100644
--- a/aten/src/ATen/ATen.h
+++ b/aten/src/ATen/ATen.h
@@ -31,3 +31,7 @@
 #include <c10/core/Storage.h>
 #include <c10/core/TensorOptions.h>
 #include <c10/util/Exception.h>
+
+// TODO: try to remove this
+// There is some back story, see https://github.com/pytorch/pytorch/issues/48684
+#include <ATen/NativeFunctions.h>
diff --git a/aten/src/ATen/BatchedTensorImpl.cpp b/aten/src/ATen/BatchedTensorImpl.cpp
index d5ab588de53d..fdedfa7c6316 100644
--- a/aten/src/ATen/BatchedTensorImpl.cpp
+++ b/aten/src/ATen/BatchedTensorImpl.cpp
@@ -17,7 +17,7 @@ BatchedTensorImpl::BatchedTensorImpl(Tensor value, BatchDims bdims)
 {
   TORCH_INTERNAL_ASSERT(value_.defined());
   set_storage_access_should_throw();
-  set_sizes_strides_policy(SizesStridesPolicy::CustomStrides);
+  set_custom_sizes_strides(SizesStridesPolicy::CustomStrides);
   checkInvariants();
 
   const auto public_dims = value_.dim() - bdims_.size();
diff --git a/aten/src/ATen/BatchingRegistrations.cpp b/aten/src/ATen/BatchingRegistrations.cpp
index a269f82fa817..5a01f949745f 100644
--- a/aten/src/ATen/BatchingRegistrations.cpp
+++ b/aten/src/ATen/BatchingRegistrations.cpp
@@ -4,6 +4,7 @@
 #include <ATen/BatchedFallback.h>
 #include <ATen/native/ResizeCommon.h>
 #include <ATen/ATen.h>
+#include <ATen/core/IListRef.h>
 #include <c10/util/irange.h>
 #include <c10/core/SymIntArrayRef.h>
 
@@ -185,14 +186,6 @@ Tensor expand_batching_rule(const Tensor& self, IntArrayRef size, bool implicit)
   return self_physical.getPhysicalToLogicalMap().apply(result);
 }
 
-Tensor expand_symint_batching_rule(const Tensor& self, SymIntArrayRef psize, bool implicit) {
-  return self.expand(asIntArrayRefSlow(psize), implicit);
-}
-
-Tensor sum_symint_batching_rule(const Tensor& input_t, c10::SymIntArrayRef dim, bool keepdim, optional<ScalarType> opt_dtype) {
-  return input_t.sum(c10::asIntArrayRefSlow(dim), keepdim, opt_dtype);
-}
-
 std::vector<Tensor> chunk_batching_rule(const Tensor& self, int64_t chunks, int64_t dim) {
   auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self);
   auto dim_physical = self_physical.getPhysicalDim(dim);
@@ -472,10 +465,6 @@ Tensor view_batching_rule(const Tensor& self, IntArrayRef size) {
   return self_physical.getPhysicalToLogicalMap().apply(result);
 }
 
-Tensor view_symint_batching_rule(const Tensor& self, c10::SymIntArrayRef size) {
-  return self.view(asIntArrayRefSlow(size));
-}
-
 Tensor view_as_complex_batching_rule(const Tensor& self) {
   // guard against the user passing in a batch of scalar tensors with batch
   // size equal to 2.
@@ -928,7 +917,7 @@ Tensor mm_batching_rule(const Tensor& self, const Tensor& other) {
   TORCH_INTERNAL_ASSERT(false, "either self or other must be a BatchedTensor");
 }
 
-Tensor cat_batching_rule(TensorList tensors, int64_t dim) {
+Tensor cat_batching_rule(const ITensorListRef& tensors, int64_t dim) {
   auto physical_views = MultiBatchVmapTransform::logicalToPhysical(tensors);
   auto physical_tensors = fmap(
       physical_views, [](const VmapPhysicalView& view) -> Tensor { return view.tensor(); });
@@ -1006,16 +995,6 @@ Tensor new_empty_batching_rule(
   return physical_view.getPhysicalToLogicalMap().apply(result);
 }
 
-Tensor new_empty_symint_batching_rule(
-    const Tensor& self,
-    c10::SymIntArrayRef size,
-    c10::optional<ScalarType> dtype,
-    c10::optional<Layout> layout,
-    c10::optional<Device> device,
-    c10::optional<bool> pin_memory) {
-  return new_empty_batching_rule(self, asIntArrayRefSlow(size), dtype, layout, device, pin_memory);
-}
-
 Tensor new_empty_strided_batching_rule(
     const Tensor& self,
     IntArrayRef size,
@@ -1100,7 +1079,6 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
   m.impl("_new_zeros_with_same_feature_meta", _new_zeros_with_same_feature_meta_batching_rule);
 
   m.impl("sum.dim_IntList", sum_batching_rule);
-  m.impl("sum.SymInt", sum_symint_batching_rule);
   m.impl("is_complex", native::is_complex);
 
   // inplace operations
@@ -1115,13 +1093,12 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
   m.impl("tensor_split.indices", tensor_split_indices_batching_rule);
   m.impl("diagonal", diagonal_batching_rule);
   m.impl("expand", expand_batching_rule);
-  m.impl("expand.SymInt", expand_symint_batching_rule);
   m.impl("expand_as", native::expand_as); // composite wrt autograd
   m.impl("movedim.intlist", movedim_batching_rule);
   m.impl("movedim.int", static_cast<Tensor(*)(const Tensor&,int64_t,int64_t)>(native::movedim)); // composite wrt autograd
-  // NB: static_cast because there's another variant of narrow. However, we don't
+  // There is another variant of narrow.  However, we don't
   // want to support the other variant yet bc it isn't documented...
-  m.impl("narrow", static_cast<Tensor(*)(const Tensor&,int64_t,int64_t,int64_t)>(native::narrow)); // composite wrt autograd
+  m.impl("narrow", native::narrow_symint); // composite wrt autograd
   m.impl("numpy_T", native::numpy_T);   // composite wrt autograd
   m.impl("matrix_H", native::matrix_H); // composite wrt autograd
   m.impl("mT", native::mT);             // composite wrt autograd
@@ -1144,7 +1121,6 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
   m.impl("unfold", unfold_batching_rule);
   m.impl("unsqueeze", unsqueeze_batching_rule);
   m.impl("view", view_batching_rule);
-  m.impl("view.SymInt", view_symint_batching_rule);
   m.impl("view_as", native::view_as); // composite wrt autograd
 
   // clamp operations
@@ -1283,7 +1259,6 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
 
   // Tensor.new_* operators
   m.impl("new_empty", new_empty_batching_rule);
-  m.impl("new_empty.SymInt", new_empty_symint_batching_rule);
   m.impl("new_empty_strided", new_empty_strided_batching_rule);
   m.impl("new_zeros", new_zeros_batching_rule);
 
diff --git a/aten/src/ATen/CMakeLists.txt b/aten/src/ATen/CMakeLists.txt
index 286d59f3e97d..613c6a6834e3 100644
--- a/aten/src/ATen/CMakeLists.txt
+++ b/aten/src/ATen/CMakeLists.txt
@@ -56,8 +56,8 @@ if(NOT BUILD_CAFFE2 AND NOT BUILD_LITE_INTERPRETER)
   EXCLUDE(ATen_CORE_TEST_SRCS "${ATen_CORE_TEST_SRCS}" ${ATen_CORE_EXCLUDED_TEST_SRCS})
 endif()
 
-file(GLOB base_h "*.h" "detail/*.h" "cpu/*.h" "cpu/vec/vec512/*.h" "cpu/vec/vec256/*.h" "cpu/vec/*.h" "quantized/*.h")
-file(GLOB base_cpp "*.cpp" "detail/*.cpp" "cpu/*.cpp")
+file(GLOB base_h "*.h" "detail/*.h" "cpu/*.h" "cpu/vec/vec512/*.h" "cpu/vec/vec256/*.h" "cpu/vec/vec256/vsx/*.h" "cpu/vec/*.h" "quantized/*.h" "functorch/*.h")
+file(GLOB base_cpp "*.cpp" "detail/*.cpp" "cpu/*.cpp" "functorch/*.cpp")
 file(GLOB cuda_h "cuda/*.h" "cuda/detail/*.h" "cuda/*.cuh" "cuda/detail/*.cuh")
 file(GLOB cuda_cpp "cuda/*.cpp" "cuda/detail/*.cpp")
 file(GLOB cuda_nvrtc_stub_h "cuda/nvrtc_stub/*.h")
@@ -130,15 +130,13 @@ file(GLOB native_cuda_h "native/cuda/*.h" "native/cuda/*.cuh")
 file(GLOB native_cuda_linalg_cpp "native/cuda/linalg/*.cpp")
 file(GLOB native_hip_h "native/hip/*.h" "native/hip/*.cuh")
 file(GLOB native_cudnn_cpp "native/cudnn/*.cpp")
-file(GLOB native_nested_cuda_cu "native/nested/cuda/*.cu")
-file(GLOB native_nested_cuda_cpp "native/nested/cuda/*.cpp")
 file(GLOB native_sparse_cuda_cu "native/sparse/cuda/*.cu")
 file(GLOB native_sparse_cuda_cpp "native/sparse/cuda/*.cpp")
 file(GLOB native_quantized_cuda_cu "native/quantized/cuda/*.cu")
 file(GLOB native_quantized_cuda_cpp "native/quantized/cuda/*.cpp")
 file(GLOB native_quantized_cudnn_cpp "native/quantized/cudnn/*.cpp")
-file(GLOB native_transformers_cuda_cu "native/transformers/cuda/*.cu")
-file(GLOB native_transformers_cuda_cpp "native/transformers/cuda/*.cpp")
+file(GLOB native_nested_cuda_cu "native/nested/cuda/*.cu")
+file(GLOB native_nested_cuda_cpp "native/nested/cuda/*.cpp")
 
 file(GLOB native_hip_hip "native/hip/*.hip")
 file(GLOB native_hip_cpp "native/hip/*.cpp")
@@ -151,11 +149,31 @@ file(GLOB native_sparse_hip_hip "native/sparse/hip/*.hip")
 file(GLOB native_sparse_hip_cpp "native/sparse/hip/*.cpp")
 file(GLOB native_quantized_hip_hip "native/quantized/hip/*.hip")
 file(GLOB native_quantized_hip_cpp "native/quantized/hip/*.cpp")
+file(GLOB native_transformers_cuda_cu "native/transformers/cuda/*.cu")
+file(GLOB native_transformers_cuda_cpp "native/transformers/cuda/*.cpp")
 file(GLOB native_transformers_hip_hip "native/transformers/hip/*.hip")
 file(GLOB native_transformers_hip_cpp "native/transformers/hip/*.cpp")
 file(GLOB native_quantized_cudnn_hip_cpp "native/quantized/cudnn/hip/*.cpp")
 file(GLOB native_utils_cpp "native/utils/*.cpp")
 
+# flash_attention sources
+file(GLOB flash_attention_cuda_cu "native/transformers/cuda/flash_attn/*.cu")
+file(GLOB flash_attention_cuda_cpp "native/transformers/cuda/flash_attn/*.cpp")
+
+#Mem_eff attention sources
+file(GLOB mem_eff_attention_cuda_cu "native/transformers/cuda/mem_eff_attention/*.cu")
+file(GLOB mem_eff_attention_cuda_kernels_cu "native/transformers/cuda/mem_eff_attention/kernels/*.cu")
+file(GLOB mem_eff_attention_cuda_cpp "native/transformers/cuda/mem_eff_attention/*.cpp")
+
+if(USE_FLASH_ATTENTION)
+  list(APPEND native_transformers_cuda_cu ${flash_attention_cuda_cu})
+  list(APPEND native_transformers_cuda_cpp ${flash_attention_cuda_cpp})
+
+  list(APPEND native_transformers_cuda_cu ${mem_eff_attention_cuda_cu})
+  list(APPEND native_transformers_cuda_cu ${mem_eff_attention_cuda_kernels_cu})
+  list(APPEND native_transformers_cuda_cpp ${mem_eff_attention_cuda_cpp})
+endif()
+
 # XNNPACK
 file(GLOB native_xnnpack "native/xnnpack/*.cpp")
 
@@ -415,6 +433,7 @@ if(NOT MSVC AND NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
 endif()
 
 if(USE_CUDA AND NOT USE_ROCM)
+  list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/include)
   if($ENV{ATEN_STATIC_CUDA})
     list(APPEND ATen_CUDA_DEPENDENCY_LIBS
       ${CUDA_LIBRARIES}
@@ -593,6 +612,7 @@ set(ATen_MOBILE_BENCHMARK_SRCS ${ATen_MOBILE_BENCHMARK_SRCS} PARENT_SCOPE)
 set(ATen_MOBILE_TEST_SRCS ${ATen_MOBILE_TEST_SRCS} ${ATen_VULKAN_TEST_SRCS} PARENT_SCOPE)
 set(ATen_VEC_TEST_SRCS  ${ATen_VEC_TEST_SRCS} PARENT_SCOPE)
 set(ATen_QUANTIZED_TEST_SRCS ${ATen_QUANTIZED_TEST_SRCS} PARENT_SCOPE)
+set(ATen_MPS_TEST_SRCS ${ATen_MPS_TEST_SRCS} PARENT_SCOPE)
 set(ATen_CPU_INCLUDE ${ATen_CPU_INCLUDE} PARENT_SCOPE)
 set(ATen_THIRD_PARTY_INCLUDE ${ATen_THIRD_PARTY_INCLUDE} PARENT_SCOPE)
 set(ATen_CUDA_INCLUDE ${ATen_CUDA_INCLUDE} PARENT_SCOPE)
diff --git a/aten/src/ATen/Context.cpp b/aten/src/ATen/Context.cpp
index 4e8c9cae04f7..7086a05ab6c7 100644
--- a/aten/src/ATen/Context.cpp
+++ b/aten/src/ATen/Context.cpp
@@ -104,6 +104,30 @@ void Context::setAllowTF32CuDNN(bool b) {
   allow_tf32_cudnn = b;
 }
 
+bool Context::userEnabledFlashSDP() const {
+  return enabled_flashSDP;
+}
+
+void Context::setSDPUseFlash(bool e) {
+  enabled_flashSDP = e;
+}
+
+bool Context::userEnabledMemEfficientSDP() const {
+  return enabled_mem_efficientSDP;
+}
+
+void Context::setSDPUseMemEfficient(bool e) {
+  enabled_mem_efficientSDP = e;
+}
+
+bool Context::userEnabledMathSDP() const {
+  return enabled_mathSDP;
+}
+
+void Context::setSDPUseMath(bool e) {
+  enabled_mathSDP = e;
+}
+
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
 static const char cublas_config_var_name[] = "CUBLAS_WORKSPACE_CONFIG";
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
@@ -125,7 +149,11 @@ bool Context::checkCuBLASConfigDeterministic() {
 
 void Context::alertCuBLASConfigNotDeterministic() const {
   static bool cublas_config_deterministic = checkCuBLASConfigDeterministic();
-  TORCH_CHECK(!deterministicAlgorithms() || cublas_config_deterministic,
+  if (C10_LIKELY(!deterministicAlgorithms() || cublas_config_deterministic)) {
+    return;
+  }
+
+  auto msg = c10::str(
     "Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or ",
     "`at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because ",
     "it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this ",
@@ -134,6 +162,12 @@ void Context::alertCuBLASConfigNotDeterministic() const {
     cublas_config_var_name, "=", cublas_deterministic_configs[1], ". For more information, go to ",
     "https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility"
   );
+
+  if (deterministicAlgorithmsWarnOnly()) {
+    TORCH_WARN(msg);
+  } else {
+    TORCH_CHECK(false, msg);
+  }
 }
 
 bool Context::benchmarkCuDNN() const {
@@ -298,9 +332,12 @@ const std::vector<at::QEngine>& Context::supportedQEngines() {
 
 #ifdef USE_FBGEMM
     if (fbgemm::fbgemmSupportedCPU()) {
+      // The X86 qengine is available if and only if FBGEMM is available
+      engines.push_back(at::kX86);
       engines.push_back(at::kFBGEMM);
     }
 #endif
+
     return engines;
   }();
   return supported_qengines;
diff --git a/aten/src/ATen/Context.h b/aten/src/ATen/Context.h
index 8f3928376473..48e3c935a2c0 100644
--- a/aten/src/ATen/Context.h
+++ b/aten/src/ATen/Context.h
@@ -126,6 +126,26 @@ class TORCH_API Context {
   bool deterministicCuDNN() const;
   void setDeterministicCuDNN(bool);
 
+  // Note [Disabling Fused SDP Kernels]
+  // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+  // Flash and Memory Efficient SDP kernels are enabled by default.
+  // However, they can be disabled by setting
+  // at::globalContext().setUserEnabledFlashSDP(false) flag.
+  // This is useful for debugging purposes. For example, if you want to
+  // compare the performance of the flash SDP kernels with the unfused
+  // kernel, you can disable the flash SDP kernels. By disabling
+  // the math SDP kernel, you can force your code to use flash kernels.
+  // The math SDP kernel can be disabled by setting
+  // at::globalContext().setUserEnabledMathSDP(false) flag.
+  void setSDPUseFlash(bool);
+  bool userEnabledFlashSDP() const;
+
+  void setSDPUseMemEfficient(bool);
+  bool userEnabledMemEfficientSDP() const;
+
+  void setSDPUseMath(bool);
+  bool userEnabledMathSDP() const;
+
   at::LinalgBackend linalgPreferredBackend() const;
   void setLinalgPreferredBackend(at::LinalgBackend);
 
@@ -253,7 +273,14 @@ class TORCH_API Context {
   bool deterministic_cudnn = false;
   bool _deterministic_algorithms = false;
   bool _deterministic_algorithms_warn_only = false;
+  bool enabled_flashSDP = true;
+  bool enabled_mem_efficientSDP = true;
+  bool enabled_mathSDP = true;
+#ifdef USE_ROCM
+  bool benchmark_cudnn = true;
+#else
   bool benchmark_cudnn = false;
+#endif
   Float32MatmulPrecision float32_matmul_precision =
       at::Float32MatmulPrecision::HIGHEST;
   int benchmark_limit_cudnn = 10;
diff --git a/aten/src/ATen/DLConvertor.cpp b/aten/src/ATen/DLConvertor.cpp
index fb3f3596e1fe..614dc46158e8 100644
--- a/aten/src/ATen/DLConvertor.cpp
+++ b/aten/src/ATen/DLConvertor.cpp
@@ -215,11 +215,22 @@ void deleter(DLManagedTensor* arg) {
 // This function returns a shared_ptr to memory managed DLpack tensor
 // constructed out of ATen tensor
 DLManagedTensor* toDLPack(const Tensor& src) {
+  // create a new tensor with possibly normalized strides
+  // gh-83069
+  auto shape = src.sizes();
+  auto strides = src.strides().vec();
+  for (int i=0; i<src.dim(); i++) {
+    if (shape[i] < 2) {
+      strides[i] = 1;
+    }
+  }
+
+  auto view = src.as_strided(shape, strides, src.storage_offset());
   ATenDLMTensor* atDLMTensor(new ATenDLMTensor);
-  atDLMTensor->handle = src;
+  atDLMTensor->handle = view;
   atDLMTensor->tensor.manager_ctx = atDLMTensor;
   atDLMTensor->tensor.deleter = &deleter;
-  atDLMTensor->tensor.dl_tensor.data = src.data_ptr();
+  atDLMTensor->tensor.dl_tensor.data = view.data_ptr();
   int64_t device_id = 0;
   if (src.is_cuda()) {
     device_id = src.get_device();
@@ -229,10 +240,10 @@ DLManagedTensor* toDLPack(const Tensor& src) {
   atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
   atDLMTensor->tensor.dl_tensor.shape =
       // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-      const_cast<int64_t*>(src.sizes().data());
+      const_cast<int64_t*>(view.sizes().data());
   atDLMTensor->tensor.dl_tensor.strides =
       // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-      const_cast<int64_t*>(src.strides().data());
+      const_cast<int64_t*>(view.strides().data());
   atDLMTensor->tensor.dl_tensor.byte_offset = 0;
   return &(atDLMTensor->tensor);
 }
@@ -241,8 +252,10 @@ Tensor fromDLPack(const DLManagedTensor* src) {
   Device device = getATenDevice(src->dl_tensor.device);
   ScalarType stype = toScalarType(src->dl_tensor.dtype);
   auto deleter = [src](void* self) {
-    // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-    src->deleter(const_cast<DLManagedTensor*>(src));
+    if (src->deleter) {
+      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
+      src->deleter(const_cast<DLManagedTensor*>(src));
+    }
   };
   if (!src->dl_tensor.strides) {
     return at::from_blob(src->dl_tensor.data,
diff --git a/aten/src/ATen/DeviceGuard.h b/aten/src/ATen/DeviceGuard.h
index a827a1ccc7fa..83bb31d7fd42 100644
--- a/aten/src/ATen/DeviceGuard.h
+++ b/aten/src/ATen/DeviceGuard.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <ATen/core/IListRef.h>
 #include <ATen/core/Tensor.h>
 #include <c10/core/DeviceGuard.h>
 #include <c10/core/ScalarType.h> // TensorList whyyyyy
@@ -29,7 +30,7 @@ inline c10::optional<Device> device_of(const c10::optional<Tensor>& t) {
 /// Return the Device of a TensorList, if the list is non-empty and
 /// the first Tensor is defined.  (This function implicitly assumes
 /// that all tensors in the list have the same device.)
-inline c10::optional<Device> device_of(TensorList t) {
+inline c10::optional<Device> device_of(ITensorListRef t) {
   if (!t.empty()) {
     return device_of(t.front());
   } else {
diff --git a/aten/src/ATen/Dispatch.h b/aten/src/ATen/Dispatch.h
index 08d41126a161..d2f5a244ad57 100644
--- a/aten/src/ATen/Dispatch.h
+++ b/aten/src/ATen/Dispatch.h
@@ -8,6 +8,10 @@
 #include <c10/util/complex.h>
 #include <c10/util/string_view.h>
 
+#ifdef __CUDACC__
+#include <cuda.h> // For CUDA_VERSION
+#endif
+
 #ifdef TEMPLATE_SELECTIVE_BUILD
 #include <ATen/selected_mobile_ops.h>
 #else
@@ -72,10 +76,20 @@ TORCH_API void record_kernel_function_dtype(std::string name);
   })
 #endif
 
+// Workaround for C10_UNUSED because CUDA 10.2 and below fails to handle unused
+// attribute in the type aliasing context. Keep name long and verbose to avoid
+// macro collisions.
+#if defined(__CUDACC__) && CUDA_VERSION < 11000
+#define C10_UNUSED_DISPATCH_CUDA_WORKAROUND
+#else
+#define C10_UNUSED_DISPATCH_CUDA_WORKAROUND C10_UNUSED
+#endif
+
 #define AT_PRIVATE_CASE_TYPE_USING_HINT(enum_type, HINT, ...) \
   case enum_type: {                                           \
     AT_PRIVATE_CHECK_SELECTIVE_BUILD(enum_type);              \
-    using HINT = c10::impl::ScalarTypeToCPPTypeT<enum_type>;  \
+    using HINT C10_UNUSED_DISPATCH_CUDA_WORKAROUND =          \
+        c10::impl::ScalarTypeToCPPTypeT<enum_type>;           \
     return __VA_ARGS__();                                     \
   }
 
@@ -186,7 +200,7 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
 //    conditionally compile fragments of the case statements such
 //    that the kernel functions are specialized only for the dtypes
 //    that are needed. The NAME parameter *must* be a build time
-//    cons char* (can't be std::string, etc...)
+//    const char* (can't be std::string, etc...)
 //
 // Please ensure that the NAME is unique for every implementation
 // or you run the risk of over-including code for the kernel
diff --git a/aten/src/ATen/EmptyTensor.cpp b/aten/src/ATen/EmptyTensor.cpp
index ff91aa0bd14d..daf0b6842365 100644
--- a/aten/src/ATen/EmptyTensor.cpp
+++ b/aten/src/ATen/EmptyTensor.cpp
@@ -106,6 +106,35 @@ size_t computeStorageNbytes(
 #endif
 }
 
+// not including mobile-only macros in this function,
+// since mobile shouldn't be using symints.
+SymInt computeStorageNbytes(
+    SymIntArrayRef sizes,
+    SymIntArrayRef strides,
+    SymInt itemsize_bytes,
+    SymInt storage_offset
+  ) {
+  TORCH_CHECK(
+    sizes.size() == strides.size(),
+    "dimensionality of sizes (",
+    sizes.size(),
+    ") must match dimensionality of strides (",
+    strides.size(),
+    ")");
+
+  // size of the underlying storage is 1 bigger than the offset
+  // of the last element according to stride
+  SymInt size = 1;
+  for (const auto i : c10::irange(sizes.size())) {
+    if (sizes[i] == 0) {
+      return 0;
+    }
+
+    size += strides[i] * (sizes[i] - 1);
+  }
+  return itemsize_bytes * (storage_offset + size);
+}
+
 TensorBase empty_generic(
     IntArrayRef size,
     c10::Allocator* allocator,
@@ -140,20 +169,20 @@ TensorBase empty_generic(
   return tensor;
 }
 
-TensorBase empty_strided_generic(
-    IntArrayRef size,
-    IntArrayRef stride,
+template <typename T>
+TensorBase _empty_strided_generic(
+    T size,
+    T stride,
     c10::Allocator* allocator,
     c10::DispatchKeySet ks,
     ScalarType scalar_type) {
   at::detail::check_size_nonnegative(size);
   at::detail::raise_warning_for_complex_half(scalar_type);
   caffe2::TypeMeta dtype = scalarTypeToTypeMeta(scalar_type);
-  size_t size_bytes = computeStorageNbytes(size, stride, dtype.itemsize());
+  auto size_bytes = computeStorageNbytes(size, stride, dtype.itemsize());
   auto storage_impl = c10::make_intrusive<StorageImpl>(
       c10::StorageImpl::use_byte_size_t(),
       size_bytes,
-      allocator->allocate(size_bytes),
       allocator,
       /*resizeable=*/true);
 
@@ -163,6 +192,24 @@ TensorBase empty_strided_generic(
   return tensor;
 }
 
+TensorBase empty_strided_generic(
+    IntArrayRef size,
+    IntArrayRef stride,
+    c10::Allocator* allocator,
+    c10::DispatchKeySet ks,
+    ScalarType scalar_type) {
+  return _empty_strided_generic<IntArrayRef>(size, stride, allocator, ks, scalar_type);
+}
+
+TensorBase empty_strided_symint_generic(
+    SymIntArrayRef size,
+    SymIntArrayRef stride,
+    c10::Allocator* allocator,
+    c10::DispatchKeySet ks,
+    ScalarType scalar_type) {
+  return _empty_strided_generic<SymIntArrayRef>(size, stride, allocator, ks, scalar_type);
+}
+
 TensorBase empty_cpu(IntArrayRef size, ScalarType dtype, bool pin_memory,
                      c10::optional<c10::MemoryFormat> memory_format_opt) {
   auto allocator = GetCPUAllocatorMaybePinned(pin_memory);
@@ -303,9 +350,7 @@ TensorBase empty_symint_meta(
   auto scalar_type = dtype_or_default(dtype_opt);
   auto *allocator = GetAllocator(kMeta);
   constexpr c10::DispatchKeySet meta_dks(c10::DispatchKey::Meta);
-  // TODO: do this.  Note that naive implementation will choke on truly
-  // unknown sizes without on the fly reasoning
-  // at::detail::check_size_nonnegative(size);
+  at::detail::check_size_nonnegative(size);
   at::detail::raise_warning_for_complex_half(scalar_type);
   caffe2::TypeMeta dtype = scalarTypeToTypeMeta(scalar_type);
   SymInt size_bytes = dtype.itemsize();
@@ -343,7 +388,7 @@ TensorBase empty_symint_meta(
       TORCH_CHECK(0, "other memory format not implemented yet");
   }
 
-  tensor.unsafeGetTensorImpl()->set_sym_sizes_and_strides(size, strides);
+  tensor.unsafeGetTensorImpl()->set_sizes_and_strides(size, strides);
 
   return tensor;
 }
@@ -395,4 +440,40 @@ TensorBase empty_strided_meta(
       options.pinned_memory_opt());
 }
 
+TensorBase empty_strided_symint_meta(SymIntArrayRef size, SymIntArrayRef stride,
+                              ScalarType dtype) {
+  auto *allocator = GetAllocator(kMeta);
+  constexpr c10::DispatchKeySet meta_dks(c10::DispatchKey::Meta);
+  return at::detail::empty_strided_symint_generic(
+      size, stride, allocator, meta_dks, dtype);
+}
+
+TensorBase empty_strided_symint_meta(
+    SymIntArrayRef size,
+    SymIntArrayRef stride,
+    c10::optional<ScalarType> dtype_opt,
+    c10::optional<Layout> layout_opt,
+    c10::optional<Device> device_opt,
+    c10::optional<bool> pin_memory_opt) {
+  auto device = device_or_default(device_opt);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(device.type() == DeviceType::Meta);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(layout_or_default(layout_opt) == Layout::Strided);
+
+  auto dtype = dtype_or_default(dtype_opt);
+  return at::detail::empty_strided_symint_meta(size, stride, dtype);
+}
+
+TensorBase empty_strided_symint_meta(
+    SymIntArrayRef size,
+    SymIntArrayRef stride,
+    const TensorOptions &options) {
+  return at::detail::empty_strided_symint_meta(
+      size,
+      stride,
+      optTypeMetaToScalarType(options.dtype_opt()),
+      options.layout_opt(),
+      options.device_opt(),
+      options.pinned_memory_opt());
+}
+
 }} // namespace at::detail
diff --git a/aten/src/ATen/EmptyTensor.h b/aten/src/ATen/EmptyTensor.h
index 06a33601a154..969eeb6dc5ee 100644
--- a/aten/src/ATen/EmptyTensor.h
+++ b/aten/src/ATen/EmptyTensor.h
@@ -4,7 +4,8 @@
 namespace at {
 namespace detail {
 
-inline void check_size_nonnegative(IntArrayRef size) {
+template <class ArrayRefType>
+inline void check_size_nonnegative(ArrayRefType size) {
   for (auto x : size) {
     TORCH_CHECK(
         x >= 0,
@@ -24,6 +25,11 @@ TORCH_API size_t computeStorageNbytes(
     IntArrayRef strides,
     size_t itemsize,
     size_t storage_offset = 0);
+TORCH_API SymInt computeStorageNbytes(
+    SymIntArrayRef sizes,
+    SymIntArrayRef strides,
+    SymInt itemsize,
+    SymInt storage_offset = 0);
 
 TORCH_API TensorBase empty_generic(
     IntArrayRef size,
@@ -39,6 +45,13 @@ TORCH_API TensorBase empty_strided_generic(
     c10::DispatchKeySet ks,
     ScalarType scalar_type);
 
+TORCH_API TensorBase empty_strided_symint_generic(
+    SymIntArrayRef size,
+    SymIntArrayRef stride,
+    c10::Allocator* allocator,
+    c10::DispatchKeySet ks,
+    ScalarType scalar_type);
+
 TORCH_API TensorBase empty_cpu(
     IntArrayRef size,
     ScalarType dtype,
@@ -113,5 +126,23 @@ TORCH_API TensorBase empty_strided_meta(
     IntArrayRef stride,
     const TensorOptions& options);
 
+TORCH_API TensorBase empty_strided_symint_meta(
+    SymIntArrayRef size,
+    SymIntArrayRef stride,
+    ScalarType dtype);
+
+TORCH_API TensorBase empty_strided_symint_meta(
+    SymIntArrayRef size,
+    SymIntArrayRef stride,
+    c10::optional<ScalarType> dtype_opt,
+    c10::optional<Layout> layout_opt,
+    c10::optional<Device> device_opt,
+    c10::optional<bool> pin_memory_opt);
+
+TORCH_API TensorBase empty_strided_symint_meta(
+    SymIntArrayRef size,
+    SymIntArrayRef stride,
+    const TensorOptions& options);
+
 } // namespace detail
 } // namespace at
diff --git a/aten/src/ATen/ExpandUtils.cpp b/aten/src/ATen/ExpandUtils.cpp
index a44005a2ef81..ee846c9b82e3 100644
--- a/aten/src/ATen/ExpandUtils.cpp
+++ b/aten/src/ATen/ExpandUtils.cpp
@@ -13,8 +13,8 @@ TensorBase expand_slow_path(const TensorBase &self, IntArrayRef size) {
 
 namespace {
 // NOTE: are_expandable did a similar check, please keep them sync if change is needed
-template <typename Container>
-Container infer_size_impl(IntArrayRef a, IntArrayRef b) {
+template <typename Container, typename ArrayType>
+Container infer_size_impl(ArrayType a, ArrayType b) {
   size_t dimsA = a.size();
   size_t dimsB = b.size();
   size_t ndim = dimsA > dimsB ? dimsA : dimsB;
@@ -25,8 +25,8 @@ Container infer_size_impl(IntArrayRef a, IntArrayRef b) {
     ptrdiff_t offset = ndim - 1 - i;
     ptrdiff_t dimA = dimsA - 1 - offset;
     ptrdiff_t dimB = dimsB - 1 - offset;
-    int64_t sizeA = (dimA >= 0) ? a[dimA] : 1;
-    int64_t sizeB = (dimB >= 0) ? b[dimB] : 1;
+    auto sizeA = (dimA >= 0) ? a[dimA] : 1;
+    auto sizeB = (dimB >= 0) ? b[dimB] : 1;
 
     TORCH_CHECK(
         sizeA == sizeB || sizeA == 1 || sizeB == 1,
@@ -35,7 +35,7 @@ Container infer_size_impl(IntArrayRef a, IntArrayRef b) {
         ") at non-singleton dimension ", i);
 
       // 1s map to the other size (even 0).
-      expandedSizes[i] = sizeA == 1 ? sizeB : sizeA;
+      expandedSizes[i] = sizeA == 1 ? std::move(sizeB) : std::move(sizeA);
   }
 
   return expandedSizes;
diff --git a/aten/src/ATen/ExpandUtils.h b/aten/src/ATen/ExpandUtils.h
index 7a81076a7dd0..9e48421e540f 100644
--- a/aten/src/ATen/ExpandUtils.h
+++ b/aten/src/ATen/ExpandUtils.h
@@ -3,6 +3,7 @@
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
 #else
+#include <ATen/ops/view.h>
 #include <ATen/ops/view_copy.h>
 #endif
 
@@ -20,6 +21,8 @@ namespace at {
 
 TORCH_API std::vector<int64_t> infer_size(IntArrayRef a, IntArrayRef b);
 TORCH_API DimVector infer_size_dimvector(IntArrayRef a, IntArrayRef b);
+TORCH_API SymDimVector
+infer_size_symdimvector(SymIntArrayRef a, SymIntArrayRef b);
 
 // Named type instead of a pair/tuple so that we can be sure to
 // construct the vectors in place and get NRVO.
@@ -93,10 +96,11 @@ inline void check_defined(
 inline c10::MaybeOwned<Tensor> expand_inplace(
     const Tensor& tensor,
     const Tensor& to_expand) {
-  if (tensor.sizes().equals(to_expand.sizes())) {
+  if (tensor.sym_sizes().equals(to_expand.sym_sizes())) {
     return c10::MaybeOwned<Tensor>::borrowed(to_expand);
   }
-  return c10::MaybeOwned<Tensor>::owned(to_expand.expand(tensor.sizes()));
+  return c10::MaybeOwned<Tensor>::owned(
+      to_expand.expand_symint(tensor.sym_sizes()));
 }
 
 inline c10::MaybeOwned<Tensor> expand_inplace(
@@ -437,16 +441,17 @@ inline std::vector<Tensor> expand_outplace(TensorList to_expand) {
   return result;
 }
 
-static inline Tensor sum_to(
+template <typename T>
+inline Tensor _sum_to(
     Tensor tensor,
-    const c10::SymIntArrayRef shape,
+    const c10::ArrayRef<T> shape,
     bool always_return_non_view = false) {
   if (shape.size() == 0) {
     return tensor.sum();
   }
 
-  auto sizes = tensor.sym_sizes();
-  c10::SmallVector<c10::SymInt, 8> reduce_dims;
+  auto sizes = at::symint::sizes<T>(tensor);
+  c10::SmallVector<int64_t, 8> reduce_dims;
   const int64_t leading_dims = sizes.size() - shape.size();
   for (const auto i : c10::irange(leading_dims)) {
     reduce_dims.push_back(i);
@@ -458,29 +463,34 @@ static inline Tensor sum_to(
   }
 
   if (!reduce_dims.empty()) {
-    tensor = tensor.sum_symint(reduce_dims, /*keepdim=*/true);
+    tensor = tensor.sum(reduce_dims, /*keepdim=*/true);
   }
 
   if (always_return_non_view) {
     // This is only actually used by the functionalization pass.
     // We want to be able to guarantee that this function doesn't return a view
     // of the input.
-    return leading_dims > 0 ? at::view_copy_symint(tensor, shape)
+    return leading_dims > 0 ? at::symint::view_copy<T>(tensor, shape)
                             : tensor.clone();
   } else {
-    return leading_dims > 0 ? tensor.view_symint(shape) : tensor;
+    return leading_dims > 0 ? at::symint::view<T>(tensor, shape) : tensor;
   }
 }
 
+inline Tensor sum_to(
+    Tensor tensor,
+    const c10::SymIntArrayRef shape,
+    bool always_return_non_view = false) {
+  return _sum_to(tensor, shape, always_return_non_view);
+}
+
 // Sums `tensor` repeatedly to produce a tensor of shape `shape`.
 // Precondition: is_expandable_to(shape, tensor.sizes()) must be true
-static inline Tensor sum_to(
+inline Tensor sum_to(
     Tensor tensor,
     const IntArrayRef shape,
     bool always_return_non_view = false) {
-  auto sym_size = c10::SymIntArrayRef(
-      reinterpret_cast<const c10::SymInt*>(shape.data()), shape.size());
-  return sum_to(tensor, sym_size, always_return_non_view);
+  return _sum_to(tensor, shape, always_return_non_view);
 }
 
 static inline bool is_expandable_to(
diff --git a/aten/src/ATen/FunctionalInverses.cpp b/aten/src/ATen/FunctionalInverses.cpp
index 471c74a73c95..2bdc76c7764a 100644
--- a/aten/src/ATen/FunctionalInverses.cpp
+++ b/aten/src/ATen/FunctionalInverses.cpp
@@ -127,9 +127,9 @@ Tensor FunctionalInverses::_neg_view_copy_inverse(const Tensor& base, const Tens
     }
 }
 
-Tensor FunctionalInverses::as_strided_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::IntArrayRef size, at::IntArrayRef stride, c10::optional<int64_t> storage_offset) {
+Tensor FunctionalInverses::as_strided_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::SymIntArrayRef size, at::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset) {
     // Pessimism: we can't reapply views for as_strided_scatter.
-    return base.as_strided_scatter(mutated_view, size, stride, storage_offset);
+    return base.as_strided_scatter_symint(mutated_view, size, stride, storage_offset);
 }
 
 Tensor FunctionalInverses::diagonal_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t offset, int64_t dim1, int64_t dim2) {
@@ -137,19 +137,15 @@ Tensor FunctionalInverses::diagonal_copy_inverse(const Tensor& base, const Tenso
     return base.diagonal_scatter(mutated_view, offset, dim1, dim2);
 }
 
-Tensor FunctionalInverses::expand_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::IntArrayRef size, bool implicit) {
-    return at::sum_to(mutated_view, base.sizes(),/*always_return_non_view=*/!reapply_views);
-}
-
-Tensor FunctionalInverses::expand_copy_SymInt_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, c10::SymIntArrayRef size, bool implicit) {
-    return at::sum_to(mutated_view, c10::asIntArrayRefSlow(base.sym_sizes()),/*always_return_non_view=*/!reapply_views);
+Tensor FunctionalInverses::expand_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::SymIntArrayRef size, bool implicit) {
+    return at::sum_to(mutated_view, base.sym_sizes(),/*always_return_non_view=*/!reapply_views);
 }
 
 Tensor FunctionalInverses::permute_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::IntArrayRef dims) {
     return at::functionalization::permute_copy_inverse(mutated_view, dims, reapply_views);
 }
 
-Tensor FunctionalInverses::_reshape_alias_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::IntArrayRef size, at::IntArrayRef stride) {
+Tensor FunctionalInverses::_reshape_alias_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::SymIntArrayRef size, at::SymIntArrayRef stride) {
     // Note that I'm directly calling reshape(), and ignoring the strides.
     // _reshape_alias() isn't available from user code, and is an implementation detail of reshape().
     // Specifically, passing in the strides directly can get us into trouble in cases like:
@@ -157,16 +153,17 @@ Tensor FunctionalInverses::_reshape_alias_copy_inverse(const Tensor& base, const
     // When we eventually run the _reshape_alias_inverse() call here, if we were to pass in both sizes and strides,
     // The call would fail because `mutated_view` doesn't have enough bytes of storage.
     if (reapply_views) {
-      return at::_reshape_alias(mutated_view, base.sizes(), base.strides());
+      return at::_reshape_alias_symint(mutated_view, base.sym_sizes(), base.sym_strides());
     } else {
-      return at::_reshape_alias_copy(mutated_view, base.sizes(), base.strides());
+      return at::_reshape_alias_copy_symint(mutated_view, base.sym_sizes(), base.sym_strides());
     }
 }
 
-Tensor FunctionalInverses::select_copy_int_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t dim, int64_t index) {
+Tensor FunctionalInverses::select_copy_int_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t dim, c10::SymInt index) {
     // Pessimism: we can't reapply views for slice_scatter.
-    return base.select_scatter(mutated_view, dim, index);
+    return base.select_scatter_symint(mutated_view, dim, index);
 }
+
 Tensor FunctionalInverses::detach_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views) {
     // the functionalization pass doesn't care about autograd metadata - as a view, I think detach() is just an identity function
     return mutated_view;
@@ -176,36 +173,36 @@ Tensor FunctionalInverses::lift_fresh_copy_inverse(const Tensor& base, const Ten
     return mutated_view;
 }
 
-Tensor FunctionalInverses::slice_copy_Tensor_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t dim, c10::optional<int64_t> start, c10::optional<int64_t> end, int64_t step) {
+Tensor FunctionalInverses::slice_copy_Tensor_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) {
     // Pessimism: we can't reapply views for slice_scatter.
-    return base.slice_scatter(mutated_view, dim, start, end, step);
+    return base.slice_scatter_symint(mutated_view, dim, start, end, step);
 }
 
-Tensor FunctionalInverses::split_copy_Tensor_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t mutated_view_idx, int64_t split_size, int64_t dim) {
+Tensor FunctionalInverses::split_copy_Tensor_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t mutated_view_idx, c10::SymInt split_size, int64_t dim) {
     // It would be nice if this logic could be re-used from autograd's split_backward(), but I don't think it can.
     // For functionalization, we have only have one of the tensors from the TensorList outputed by split(), and we want to layer i
     // on top of the base tensor.
     // For autograd, we have all of the tensors outputted by split() and we just want to stack them.
-    dim = at::maybe_wrap_dim(dim, base.sizes().size());
-    auto dim_size = base.size(dim);
-    auto start = mutated_view_idx * split_size;
-    auto end = start + split_size;
+    dim = at::maybe_wrap_dim(dim, base.dim());
+    auto dim_size = base.sym_size(dim);
+    auto start = split_size * mutated_view_idx;
+    auto end = split_size + start;
     if (end > dim_size) end = dim_size;
     // Pessimism: we can't reapply views for slice_scatter.
-    return base.slice_scatter(mutated_view, dim, start, end, 1);
+    return base.slice_scatter_symint(mutated_view, dim, start, end, 1);
 }
 
-Tensor FunctionalInverses::split_with_sizes_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t mutated_view_idx, at::IntArrayRef split_sizes, int64_t dim) {
-    dim = at::maybe_wrap_dim(dim, base.sizes().size());
-    auto dim_size = base.size(dim);
-    int64_t start = 0;
+Tensor FunctionalInverses::split_with_sizes_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t mutated_view_idx, c10::SymIntArrayRef split_sizes, int64_t dim) {
+    dim = at::maybe_wrap_dim(dim, base.dim());
+    auto dim_size = base.sym_size(dim);
+    c10::SymInt start = 0;
     for (auto i = 0; i < mutated_view_idx; ++i) {
         start += split_sizes[i];
     }
     auto end = start + split_sizes[mutated_view_idx];
     if (end > dim_size) end = dim_size;
     // Pessimism: we can't reapply views for slice_scatter.
-    return base.slice_scatter(mutated_view, dim, start, end, 1);
+    return base.slice_scatter_symint(mutated_view, dim, start, end, 1);
 }
 
 Tensor FunctionalInverses::squeeze_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views) {
@@ -232,6 +229,11 @@ Tensor FunctionalInverses::transpose_copy_int_inverse(const Tensor& base, const
     }
 }
 
+Tensor FunctionalInverses::_nested_view_from_buffer_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, const Tensor& nested_size_tensor, const Tensor& nested_stride_tensor, IntArrayRef offsets) {
+    TORCH_INTERNAL_ASSERT(false, "Attempted to call _nested_view_from_buffer() during the functionalization pass. For now, nested tensors aren't supported during functionalization");
+    return Tensor();
+}
+
 Tensor FunctionalInverses::unsqueeze_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, int64_t dim) {
     if (reapply_views) {
       return at::squeeze(mutated_view, dim);
@@ -291,15 +293,7 @@ Tensor FunctionalInverses::unbind_copy_int_inverse(const Tensor& base, const Ten
     return base.select_scatter(mutated_view, dim, mutated_view_idx);
 }
 
-Tensor FunctionalInverses::view_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::IntArrayRef size) {
-    if (reapply_views) {
-      return mutated_view.view(base.sizes());
-    } else {
-      return at::view_copy(mutated_view, base.sizes());
-    }
-}
-
-Tensor FunctionalInverses::view_copy_SymInt_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, c10::SymIntArrayRef size) {
+Tensor FunctionalInverses::view_copy_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::SymIntArrayRef size) {
     if (reapply_views) {
       return mutated_view.view_symint(base.sym_sizes());
     } else {
@@ -307,6 +301,7 @@ Tensor FunctionalInverses::view_copy_SymInt_inverse(const Tensor& base, const Te
     }
 }
 
+
 Tensor FunctionalInverses::view_copy_dtype_inverse(const Tensor& base, const Tensor& mutated_view, bool reapply_views, at::ScalarType dtype) {
     if (reapply_views) {
       return mutated_view.view(base.scalar_type());
diff --git a/aten/src/ATen/FunctionalStorageImpl.cpp b/aten/src/ATen/FunctionalStorageImpl.cpp
index 2fad6bfad606..8e80ce0ca7dd 100644
--- a/aten/src/ATen/FunctionalStorageImpl.cpp
+++ b/aten/src/ATen/FunctionalStorageImpl.cpp
@@ -1,7 +1,9 @@
 #include <ATen/FunctionalStorageImpl.h>
 
+#include <ATen/EmptyTensor.h>
 #include <ATen/FunctionalTensorWrapper.h>
 #include <ATen/core/LegacyTypeDispatch.h>
+#include <c10/core/CPUAllocator.h>
 #include <c10/util/Exception.h>
 #include <vector>
 
@@ -13,23 +15,9 @@ ViewMeta ViewMeta::to_out_idx(int64_t out_idx) {
   return ViewMeta(forward_fn, reverse_fn, out_idx);
 }
 
-Alias::Alias(const at::Tensor& base) {
-  TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(base));
-  base_ = base;
-}
-
-const at::Tensor& Alias::base() const {
-  return base_;
-}
-
-void Alias::add_update(const at::Tensor& updated_val, const std::vector<ViewMeta>& metas) {
-  updates_.push_back({updated_val, metas});
-  generation_++;
-}
-
 // Note [Functionalization: Alias Removal Part 2]
 // See Note [Functionalization: Alias Removal] for more details.
-// This function applies a single update from one of the views to the Alias object.
+// This function applies a single update from one of the views to the StorageImpl.
 // We start out with <original_base> and <mutated_view>, and our goal is to end up with <mutated_base>.
 // Consider this program:
 //
@@ -44,15 +32,15 @@ void Alias::add_update(const at::Tensor& updated_val, const std::vector<ViewMeta
 // update.new_val = c  # the updated value of c
 // update.view_metas = [view1_meta, view2_meta, view3_meta]
 //
-// Syncing any of a, b or c will eventually call apply_update() on the alias, and the following will run:
+// Syncing any of a, b or c will eventually call apply_update() on the storage, and the following will run:
 //
 // tmp_values = [base, a, b]  # NB: c is not necessary
 // t = update.new_val
 // t = view3_inverse(b, t, 0)  # 0 is output index, these are all single output views so it's 0
 // t = view2_inverse(a, t, 0)
-// t = view1_inverse(base, t, 0)  # t now represents the updated alias.
-// alias.base_ = t
-const Tensor apply_update(const Alias::Update& update, const Tensor& base) {
+// t = view1_inverse(base, t, 0)  # t now represents the updated storage.
+// storage.base_ = t
+const Tensor apply_update(const FunctionalStorageImpl::Update& update, const Tensor& base) {
   at::Tensor t = update.new_val;
   TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(t));
   if (update.view_metas.size() == 0) return t;
@@ -75,7 +63,41 @@ const Tensor apply_update(const Alias::Update& update, const Tensor& base) {
   return t;
 }
 
-bool Alias::apply_updates() {
+
+c10::SymInt get_nbytes(const Tensor& value) {
+  if (value.unsafeGetTensorImpl()->has_symbolic_sizes_strides()) {
+    // Today, the two implementations of SymInt are in Python (proxy tensor),
+    // and lazy tensor (LTC/XLA).
+    // LTC hasn't implemented SymInt support yet though
+    // Once it does, we should remove this check.
+    if (value.key_set().has(c10::DispatchKey::Python)) {
+      return value.storage().sym_nbytes();
+    }
+  }
+  // XLA storage objects also do not properly track nbytes.
+  return at::detail::computeStorageNbytes(value.sizes(), value.strides(), value.dtype().itemsize(), value.storage_offset());
+}
+
+FunctionalStorageImpl::FunctionalStorageImpl(const Tensor& base)
+  : c10::StorageImpl(
+      c10::StorageImpl::use_byte_size_t(),
+      get_nbytes(base),
+      DataPtr{nullptr, base.device()},
+      GetAllocator(kMeta),
+      /*resizeable=*/true
+    ),
+    base_(base)
+  {
+  TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(base_));
+}
+
+void FunctionalStorageImpl::add_update(const Tensor& updated_val, const std::vector<ViewMeta>& metas) {
+  TORCH_CHECK(!frozen_, "cannot mutate tensors with frozen storage");
+  updates_.push_back({updated_val, metas});
+  generation_++;
+}
+
+bool FunctionalStorageImpl::apply_updates() {
   // N.B:none of the tensors used in this function should be FunctionalTensorWrappers at this point.
   // The only reason we currently need the TLS exclude guard here is because of functorch's DynamicLayer stack.
   // It adds the Functionalize key into TLS before redispatching to the functionalization kernels,
@@ -89,33 +111,5 @@ bool Alias::apply_updates() {
   return any_updates;
 }
 
-FunctionalStorageImpl::FunctionalStorageImpl(const Tensor& value)
-  : c10::StorageImpl(
-      c10::StorageImpl::use_byte_size_t(),
-      value.numel() * value.dtype().itemsize(),
-      DataPtr{nullptr, value.device()},
-      // Using a null allocator, since FunctionalTensorImpl's aren't resizeable.
-      nullptr,
-      /*resizeable=*/false
-    ),
-    alias_(Alias(value))
-  {}
-
-void FunctionalStorageImpl::add_update(const Tensor& updated_val, const std::vector<ViewMeta>& view_metas) {
-  alias_.add_update(updated_val, view_metas);
-}
-
-bool FunctionalStorageImpl::apply_updates() {
-  return alias_.apply_updates();
-}
-
-const Tensor& FunctionalStorageImpl::base() {
-  return alias_.base();
-}
-
-size_t FunctionalStorageImpl::generation() const {
-  return alias_.generation();
-}
-
 } // namespace functionalization
 } // namespace at
diff --git a/aten/src/ATen/FunctionalStorageImpl.h b/aten/src/ATen/FunctionalStorageImpl.h
index 6caeac2737fd..dbaf30c9963d 100644
--- a/aten/src/ATen/FunctionalStorageImpl.h
+++ b/aten/src/ATen/FunctionalStorageImpl.h
@@ -46,13 +46,18 @@ struct ViewMeta {
   ViewMeta to_out_idx(int64_t out_idx);
 };
 
-// Alias represents the state shared by (potentially multiple) views of the same
-// tensor. For example, in the following code:
+// FunctionalStorageImpl is a subclass of StorageImpl used by the
+// functionalization pass. It has no underlying data (similar to meta storage).
+// It also knows how to reflect mutations to tensors in the absence of a valid
+// data pointer.
+//
+// A storage represents the state shared by (potentially multiple) views of the
+// same tensor. For example, in the following code:
 //
 // b = a.view1(...)
 // c = b.view2(...)
 // b.add_(1)
-// --> alias.add_update(b, {view1_meta})
+// --> storage.add_update(b, {view1_meta})
 //
 // The call to add_(1) will result in a call to alias.add_update(b,
 // {view1_meta}), queueing up the mutation from b onto the alias. Later, suppose
@@ -65,58 +70,49 @@ struct ViewMeta {
 // --> c.sync_()
 //     --> alias.apply_updates() // after this, the alias will be updated to
 //     reflect the mutation to b
-class Alias {
+struct TORCH_API FunctionalStorageImpl : public c10::StorageImpl {
  public:
   struct Update {
     const at::Tensor new_val;
     const std::vector<ViewMeta> view_metas;
   };
-  explicit Alias(const at::Tensor& base);
-  const at::Tensor& base() const;
+
+  explicit FunctionalStorageImpl(const Tensor& value);
+
+  void add_update(
+      const Tensor& updated_val,
+      const std::vector<ViewMeta>& view_metas);
+  bool apply_updates();
+  const Tensor& base() {
+    return base_;
+  }
   size_t generation() const {
     return generation_;
   }
-  void add_update(
-      const at::Tensor& updated_val,
-      const std::vector<ViewMeta>& metas);
-  bool apply_updates();
+  void freeze() {
+    frozen_ = true;
+  }
+
+  ~FunctionalStorageImpl() override = default;
 
  private:
   // NB: base_ should always point to a tensor BELOW the current
   // functionalization layer. This is mainly to avoid reference cycles. e.g.
   // given `b = a.view(...)` Both a.storage_ and b.storage_ are a
-  // FunctionStorageImpl containing an Alias, with contains a Tensor `base_`. In
-  // this case (where a and b are FunctionalTensorWrapper's), base_ should point
-  // not to a, but to a's unwrapped value, a.value_` See Note
-  // [Functionalization: Alias Removal] for a diagram that shows this visually.
+  // FunctionStorageImpl containing an Walualias, with contains a Tensor
+  // `base_`. In this case (where a and b are FunctionalTensorWrapper's), base_
+  // should point not to a, but to a's unwrapped value, a.value_` See Note
+  // [Functionalization: Walualias Removal] for a diagram that shows this
+  // visually.
   at::Tensor base_;
   std::vector<Update> updates_;
   // generation_ gets incremented every time a mutation is queued onto the
   // alias. It is used to determine if a given tensor is "up to date", or if it
   // needs to be regenerated from the alias.
   size_t generation_ = 0;
-};
-
-// FunctionalStorageImpl is a subclass of StorageImpl used by the
-// functionalization pass. It has no underlying data (similar to meta storage).
-// It also knows how to reflect mutations to tensors in the absence of a valid
-// data pointer. It does this by separately storing an Alias object, which knows
-// how to reflect mutations that may have happened to views of the original
-// tensor.
-struct TORCH_API FunctionalStorageImpl : public c10::StorageImpl {
-  explicit FunctionalStorageImpl(const Tensor& value);
-
-  void add_update(
-      const Tensor& updated_val,
-      const std::vector<ViewMeta>& view_metas);
-  bool apply_updates();
-  const Tensor& base();
-  size_t generation() const;
-
-  ~FunctionalStorageImpl() override = default;
-
- private:
-  at::functionalization::Alias alias_;
+  // If frozen, no more mutations are allowed on this storage.  Once frozen, a
+  // storage cannot be unfrozen.
+  bool frozen_ = false;
 };
 
 } // namespace functionalization
diff --git a/aten/src/ATen/FunctionalTensorWrapper.cpp b/aten/src/ATen/FunctionalTensorWrapper.cpp
index a8c58466a052..2c3a12020eb6 100644
--- a/aten/src/ATen/FunctionalTensorWrapper.cpp
+++ b/aten/src/ATen/FunctionalTensorWrapper.cpp
@@ -4,6 +4,7 @@
 #include <ATen/FunctionalInverses.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/WrapDimUtils.h>
+#include <ATen/core/IListRef.h>
 #include <ATen/core/LegacyTypeDispatch.h>
 #include <c10/util/Exception.h>
 
@@ -36,19 +37,10 @@ void FunctionalTensorWrapper::set_constructor_metadata() {
   // Functorch transforms all have their own wrapper tensors (e.g. BatchedTensorImpl) which expect
   // to participate in the functorch transforms.
   key_set_ = key_set_ - c10::functorch_transforms_ks - c10::python_ks;
-  // For better error handling,
-  // we also don't want our wrapper tensor to be able to dispatch directly
-  // to a backend kernel.
-  // Dispatching directly to e.g. a CPU kernel would always segfault,
-  // because wrapper tensors don't have any real data.
-  // (This should never happen because we should always hit a functionalization kernel,
-  // but can help make bugs less nasty).
-  // Here, we defensively remove any backend keys from the wrapper's keyset.
-  // We don't want to remove actual backend bits though (say we're redispatching to autograd;
-  // we need to know if we're dispatching to AutogradCPU or AutogradXLA).
-  // Instead, it's sufficient to remove the `Dense` dispatch key,
-  // which prevents us from accidentally trying to directly run a CPU/CUDA kernel.
-  key_set_ = key_set_.remove(c10::DispatchKey::Dense);
+  // We override a bunch of _custom(), so make sure they get called
+  // TODO: metadata copying may not actually be necessary then
+  set_custom_sizes_strides(SizesStridesPolicy::CustomSizes);
+  set_custom_device(true);
 }
 
 FunctionalTensorWrapper::FunctionalTensorWrapper(const Tensor& value)
@@ -62,6 +54,10 @@ FunctionalTensorWrapper::FunctionalTensorWrapper(const Tensor& value)
   set_constructor_metadata();
 }
 
+void FunctionalTensorWrapper::freeze_storage() const {
+  functional_storage_impl()->freeze();
+}
+
 // Note [Functionalization: Alias Removal]
 // When someone calls a view() op during the functionalization pass, e.g. 'b = a.view(...)',
 // we link `b` and `a` to a shared Alias object to preserve the aliasing relationship.
@@ -202,12 +198,7 @@ void FunctionalTensorWrapper::replace_(const Tensor& other) {
   value_ = other;
   // out= ops are allowed to resize the output tensors, mutating both the data and metadata of the tensor.
   // We need to propagate that metadata mutation to the wrapper (new size).
-  if (sizes() != value_.sizes() || strides() != value_.strides()) {
-    set_sizes_and_strides(value_.sizes(), value_.strides());
-  }
-  if (storage_offset() != value_.storage_offset()) {
-    set_storage_offset(value_.storage_offset());
-  }
+  set_sizes_and_strides(value_.sym_sizes(), value_.sym_strides(), value_.sym_storage_offset());
   if (dtype() != value_.unsafeGetTensorImpl()->dtype() || layout() != value_.unsafeGetTensorImpl()->layout()) {
     // .to() should not re-entrantly go through functionalization.
     at::AutoDispatchSkipFunctionalize guard;
@@ -296,19 +287,23 @@ c10::intrusive_ptr<TensorImpl> FunctionalTensorWrapper::shallow_copy_and_detach_
     bool allow_tensor_metadata_change) const {
   if (key_set_.has(DispatchKey::Python) &&
       !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
-    auto r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this);
+    auto r = (*pyobj_interpreter_.load(std::memory_order_acquire))->detach(this);
     if (r) {
       r->set_version_counter(std::forward<VariableVersion>(version_counter));
       r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
       return r;
     }
   }
+
   auto impl = c10::make_intrusive<FunctionalTensorWrapper>(value_);
   copy_tensor_metadata(
       /*src_impl=*/this,
       /*dest_impl=*/impl.get(),
       /*version_counter=*/std::forward<VariableVersion>(version_counter),
       /*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
+  impl->level_ = level_;
+  impl->generation_ = generation_;
+  impl->view_metas_ = view_metas_;
   impl->refresh_numel();
   impl->refresh_contiguous();
   return impl;
@@ -328,6 +323,9 @@ c10::intrusive_ptr<TensorImpl> FunctionalTensorWrapper::shallow_copy_and_detach(
       std::move(version_counter), allow_tensor_metadata_change);
 }
 
+c10::Device FunctionalTensorWrapper::device_custom() const {
+  return value_.unsafeGetTensorImpl()->device();
+}
 at::IntArrayRef FunctionalTensorWrapper::sizes_custom() const {
   return value_.unsafeGetTensorImpl()->sizes();
 }
@@ -343,12 +341,18 @@ int64_t FunctionalTensorWrapper::numel_custom() const {
 bool FunctionalTensorWrapper::is_contiguous_custom(at::MemoryFormat memory_format) const {
   return value_.unsafeGetTensorImpl()->is_contiguous();
 }
-c10::SymIntArrayRef FunctionalTensorWrapper::sym_sizes() const {
-  return value_.unsafeGetTensorImpl()->sym_sizes();
-}
 c10::SymIntArrayRef FunctionalTensorWrapper::sym_sizes_custom() const {
   return value_.unsafeGetTensorImpl()->sym_sizes();
 }
+c10::SymIntArrayRef FunctionalTensorWrapper::sym_strides_custom() const {
+  return value_.unsafeGetTensorImpl()->sym_strides();
+}
+c10::SymInt FunctionalTensorWrapper::sym_size_custom(int64_t d) const {
+  return value_.unsafeGetTensorImpl()->sym_size(d);
+}
+c10::SymInt FunctionalTensorWrapper::sym_storage_offset_custom() const {
+  return value_.unsafeGetTensorImpl()->sym_storage_offset();
+}
 
 namespace functionalization {
 namespace impl {
@@ -367,14 +371,6 @@ c10::optional<Tensor> to_functional_tensor(const c10::optional<Tensor>& tensor)
   }
   return c10::nullopt;
 }
-c10::List<Tensor> to_functional_tensor(const c10::List<Tensor>& t_list) {
-  c10::List<Tensor> outputs;
-  outputs.reserve(t_list.size());
-  for (const auto i : c10::irange(t_list.size())) {
-    outputs.push_back(to_functional_tensor(t_list[i]));
-  }
-  return outputs;
-}
 c10::List<c10::optional<Tensor>> to_functional_tensor(const c10::List<c10::optional<Tensor>>& t_list) {
   c10::List<c10::optional<Tensor>> outputs;
   outputs.reserve(t_list.size());
@@ -383,17 +379,11 @@ c10::List<c10::optional<Tensor>> to_functional_tensor(const c10::List<c10::optio
   }
   return outputs;
 }
-std::vector<Tensor> to_functional_tensor(const std::vector<Tensor>& t_list) {
-  std::vector<Tensor> outputs(t_list.size());
-  for (const auto i : c10::irange(t_list.size())) {
-    outputs[i] = to_functional_tensor(t_list[i]);
-  }
-  return outputs;
-}
-std::vector<Tensor> to_functional_tensor(const TensorList& t_list) {
-  std::vector<Tensor> outputs(t_list.size());
-  for (const auto i : c10::irange(t_list.size())) {
-    outputs[i] = to_functional_tensor(t_list[i]);
+std::vector<Tensor> to_functional_tensor(ITensorListRef t_list) {
+  std::vector<Tensor> outputs;
+  outputs.reserve(t_list.size());
+  for (const auto& tensor : t_list) {
+    outputs.push_back(to_functional_tensor(tensor));
   }
   return outputs;
 }
@@ -419,17 +409,17 @@ c10::optional<Tensor> from_functional_tensor(const c10::optional<Tensor>& t, boo
   }
   return c10::nullopt;
 }
-c10::List<Tensor> from_functional_tensor(const c10::List<Tensor>& t_list) {
-  c10::List<Tensor> outputs;
+std::vector<Tensor> from_functional_tensor(ITensorListRef t_list) {
+  std::vector<Tensor> outputs;
   outputs.reserve(t_list.size());
-  for (const auto i : c10::irange(t_list.size())) {
+  for (const auto& tensor : t_list) {
     // from_functional_tensor(Tensor) has asserts to make sure you don't accidentally call
     // it on a non-functional input,
     // but from_functional_tensor(TensorList) can recieve a list containing both
     // functional and non-functional tensors.
     // Example of when that can happen: torch.cat(function_input_tensor, global_state_tensor).
     // When that happens, we're okay with only unwrapping the functional tensors.
-    outputs.push_back(from_functional_tensor(t_list[i], /*assert_functional=*/false));
+    outputs.push_back(from_functional_tensor(tensor, /*assert_functional=*/false));
   }
   return outputs;
 }
@@ -441,13 +431,6 @@ c10::List<c10::optional<Tensor>> from_functional_tensor(const c10::List<c10::opt
   }
   return outputs;
 }
-std::vector<Tensor> from_functional_tensor(const TensorList& t_list) {
-  std::vector<Tensor> outputs(t_list.size());
-  for (const auto i : c10::irange(t_list.size())) {
-    outputs[i] = from_functional_tensor(t_list[i], /*assert_functional=*/false);
-  }
-  return outputs;
-}
 
 void sync(const Tensor& t) {
   if (t.unsafeGetTensorImpl()->is_wrapped_number()) {
@@ -471,13 +454,8 @@ void sync(const c10::optional<Tensor>& t) {
     sync(*t);
   }
 }
-void sync(const c10::List<Tensor> t_list) {
-  for (const auto i : c10::irange(t_list.size())) {
-    sync(t_list[i]);
-  }
-}
-void sync(const at::TensorList t_list) {
-  for (auto t: t_list) {
+void sync(ITensorListRef t_list) {
+  for (const auto& t : t_list) {
     sync(t);
   }
 }
@@ -492,22 +470,24 @@ void replace_(const Tensor& functional_tensor, const Tensor& other) {
   unsafeGetFunctionalWrapper(functional_tensor)->replace_(other);
 }
 
-void replace_(const TensorList functional_tensor, TensorList other) {
+void replace_(const ITensorListRef functional_tensor, ITensorListRef other) {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(functional_tensor.size() == other.size());
+  auto functional_tensor_it = functional_tensor.begin();
+  auto other_it = other.begin();
   for (const auto i : c10::irange(functional_tensor.size())) {
-    replace_(functional_tensor[i], other[i]);
+    (void)i; // Suppress unused variable warning
+    replace_(*functional_tensor_it++, *other_it++);
   }
 }
 
-
 void commit_update(const Tensor& functional_tensor) {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(isFunctionalTensor(functional_tensor));
   unsafeGetFunctionalWrapper(functional_tensor)->commit_update();
 }
 
-void commit_update(const TensorList functional_tensor) {
-  for (const auto i : c10::irange(functional_tensor.size())) {
-    commit_update(functional_tensor[i]);
+void commit_update(ITensorListRef functional_tensor) {
+  for (const auto& t : functional_tensor) {
+    commit_update(t);
   }
 }
 
@@ -523,21 +503,6 @@ bool isFunctionalTensor(const c10::optional<Tensor>& t) {
   }
 }
 
-// For lists that have a mix of functional and nonfunctional tensors,
-// functionalization machinery should just unwrap the functional wrappers
-// and leave the ordinary tensors alone.
-bool isFunctionalTensor(const c10::List<Tensor>& t_list) {
-  if (t_list.size() == 0) return false;
-  auto functional_count = 0;
-  for (const auto i : c10::irange(t_list.size())) {
-    if (!t_list[i].defined()) continue;
-    if (isFunctionalTensor(t_list[i])) {
-      ++functional_count;
-    }
-  }
-  return functional_count > 0;
-}
-
 bool isFunctionalTensor(const c10::List<c10::optional<Tensor>>& t_list) {
   if (t_list.size() == 0) return false;
   auto functional_count = 0;
@@ -550,18 +515,29 @@ bool isFunctionalTensor(const c10::List<c10::optional<Tensor>>& t_list) {
   return functional_count > 0;
 }
 
-bool isFunctionalTensor(const c10::ArrayRef<Tensor> t_list) {
-  if (t_list.size() == 0) return false;
+template <typename T>
+bool isFunctionalTensorIListRef(c10::IListRef<T> list) {
+  if (list.size() == 0) return false;
   auto functional_count = 0;
-  for (const auto i : c10::irange(t_list.size())) {
-    if (!t_list[i].defined()) continue;
-    if (isFunctionalTensor(t_list[i])) {
+  for (const auto& tensor : list) {
+    if (!tensor.defined()) continue;
+    if (isFunctionalTensor(tensor)) {
       ++functional_count;
     }
   }
   return functional_count > 0;
 }
 
+bool isFunctionalTensor(ITensorListRef list) {
+  return isFunctionalTensorIListRef(list);
+}
+
+void freeze_functional_tensor(const Tensor& tensor) {
+  TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(tensor));
+  auto functional_base_impl = at::functionalization::impl::unsafeGetFunctionalWrapper(tensor);
+  functional_base_impl->freeze_storage();
+}
+
 Tensor create_functional_tensor_with_view_meta(const at::Tensor& view_to_wrap, const at::Tensor& base, functionalization::ViewMeta meta, int64_t out_idx) {
   TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(view_to_wrap));
   TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(base));
@@ -575,18 +551,12 @@ Tensor create_functional_tensor_with_view_meta(const at::Tensor& view_to_wrap, c
   return at::detail::make_tensor<FunctionalTensorWrapper>(view_to_wrap, functional_base_impl, meta);
 }
 
-std::vector<Tensor> create_functional_tensor_with_view_meta(const c10::List<at::Tensor>& view_to_wrap, const at::Tensor& base, functionalization::ViewMeta meta) {
-  std::vector<Tensor> outputs(view_to_wrap.size());
-  for (const auto i : c10::irange(view_to_wrap.size())) {
-    outputs[i] = create_functional_tensor_with_view_meta(view_to_wrap[i], base, meta, i);
-  }
-  return outputs;
-}
-
-std::vector<Tensor> create_functional_tensor_with_view_meta(const std::vector<at::Tensor>& view_to_wrap, const at::Tensor& base, functionalization::ViewMeta meta) {
+std::vector<Tensor> create_functional_tensor_with_view_meta(ITensorListRef view_to_wrap, const at::Tensor& base, functionalization::ViewMeta meta) {
   std::vector<Tensor> outputs(view_to_wrap.size());
-  for (const auto i : c10::irange(view_to_wrap.size())) {
-    outputs[i] = create_functional_tensor_with_view_meta(view_to_wrap[i], base, meta, i);
+  int64_t i = 0;
+  for (const auto& tensor : view_to_wrap) {
+    outputs[i] = create_functional_tensor_with_view_meta(tensor, base, meta, i);
+    i++;
   }
   return outputs;
 }
@@ -602,8 +572,7 @@ void mutate_view_meta(const at::Tensor& self, functionalization::ViewMeta meta)
 // calls each {view} reference implementations with meta tensors.
 // The output meta tensor's stride info serves as a reference for what the correct strides should be.
 void set_sizes_strides_offset(const Tensor& out, const Tensor& reference_out) {
-  out.unsafeGetTensorImpl()->set_sizes_and_strides(reference_out.sizes(), reference_out.strides());
-  out.unsafeGetTensorImpl()->set_storage_offset(reference_out.storage_offset());
+  out.unsafeGetTensorImpl()->set_sizes_and_strides(reference_out.sym_sizes(), reference_out.sym_strides(), reference_out.sym_storage_offset());
 }
 
 void set_sizes_strides_offset(const std::vector<Tensor>& outs, const std::vector<Tensor>& reference_outs) {
diff --git a/aten/src/ATen/FunctionalTensorWrapper.h b/aten/src/ATen/FunctionalTensorWrapper.h
index c5c0339fc1bf..0762fb1f7f9b 100644
--- a/aten/src/ATen/FunctionalTensorWrapper.h
+++ b/aten/src/ATen/FunctionalTensorWrapper.h
@@ -3,6 +3,7 @@
 
 #include <ATen/ArrayRef.h>
 #include <ATen/FunctionalStorageImpl.h>
+#include <ATen/core/IListRef.h>
 #include <ATen/core/List.h>
 #include <ATen/core/boxing/BoxedKernel.h>
 #include <ATen/core/boxing/impl/boxing.h>
@@ -99,6 +100,8 @@ struct TORCH_API FunctionalTensorWrapper : public c10::TensorImpl {
   // used to determine if it's up-to-date with its alias. The act of syncing a
   // tensor will set a tensor's generation equal to its alias's generation.
   bool is_up_to_date() const;
+  // Freezes the storage of this tensor, preventing subsequent mutations
+  void freeze_storage() const;
   // Every FunctionalTensorWrapper contains a vector<ViewMeta> objects
   // describing the series of view ops that ran to generate the current tensor
   // from the base tensor. This method is used by inplace-view ops like
@@ -134,15 +137,18 @@ struct TORCH_API FunctionalTensorWrapper : public c10::TensorImpl {
   ~FunctionalTensorWrapper() override = default;
 
   // FunctionalTensorWrapper overrides all custom size/stride function,
-  // so that if the inner tensor has a custo implementation
+  // so that if the inner tensor has a custom implementation
   // we make sure to call that implementation.
   at::IntArrayRef sizes_custom() const override;
   at::IntArrayRef strides_custom() const override;
   int64_t dim_custom() const override;
   int64_t numel_custom() const override;
   bool is_contiguous_custom(at::MemoryFormat memory_format) const override;
-  c10::SymIntArrayRef sym_sizes() const override;
   c10::SymIntArrayRef sym_sizes_custom() const override;
+  c10::SymInt sym_size_custom(int64_t d) const override;
+  c10::SymIntArrayRef sym_strides_custom() const override;
+  c10::SymInt sym_storage_offset_custom() const override;
+  c10::Device device_custom() const override;
 
  private:
   const char* tensorimpl_type_name() const override;
@@ -183,44 +189,40 @@ TORCH_API inline FunctionalTensorWrapper* unsafeGetFunctionalWrapper(
 
 TORCH_API bool isFunctionalTensor(const at::Tensor& tensor);
 TORCH_API bool isFunctionalTensor(const c10::optional<Tensor>& t);
-TORCH_API bool isFunctionalTensor(const c10::List<Tensor>& t_list);
 TORCH_API bool isFunctionalTensor(
     const c10::List<c10::optional<Tensor>>& t_list);
-TORCH_API bool isFunctionalTensor(const c10::ArrayRef<Tensor> t_list);
+TORCH_API bool isFunctionalTensor(ITensorListRef list);
 
 TORCH_API Tensor to_functional_tensor(const Tensor& tensor);
 TORCH_API c10::optional<Tensor> to_functional_tensor(
     const c10::optional<Tensor>& tensor);
-TORCH_API c10::List<Tensor> to_functional_tensor(
-    const c10::List<Tensor>& t_list);
 TORCH_API c10::List<c10::optional<Tensor>> to_functional_tensor(
     const c10::List<c10::optional<Tensor>>& t_list);
-TORCH_API std::vector<Tensor> to_functional_tensor(
-    const std::vector<Tensor>& t_list);
-TORCH_API std::vector<Tensor> to_functional_tensor(const TensorList& t_list);
+TORCH_API std::vector<Tensor> to_functional_tensor(ITensorListRef t_list);
+
+TORCH_API void freeze_functional_tensor(const Tensor& tensor);
 
 TORCH_API Tensor
 from_functional_tensor(const Tensor& tensor, bool assert_functional = true);
 TORCH_API c10::optional<Tensor> from_functional_tensor(
     const c10::optional<Tensor>& t,
     bool assert_functional = true);
-TORCH_API c10::List<Tensor> from_functional_tensor(
-    const c10::List<Tensor>& t_list);
 TORCH_API c10::List<c10::optional<Tensor>> from_functional_tensor(
     const c10::List<c10::optional<Tensor>>& t_list);
-TORCH_API std::vector<Tensor> from_functional_tensor(const TensorList& tensors);
+TORCH_API std::vector<Tensor> from_functional_tensor(ITensorListRef t_list);
 
 TORCH_API void sync(const at::Tensor& t);
 TORCH_API void sync(const c10::optional<Tensor>& t);
-TORCH_API void sync(const c10::List<Tensor> t_list);
-TORCH_API void sync(const at::TensorList t_list);
 TORCH_API void sync(const c10::List<c10::optional<Tensor>> t_list);
+TORCH_API void sync(ITensorListRef t_list);
 
 TORCH_API void replace_(const Tensor& functional_tensor, const Tensor& other);
-TORCH_API void replace_(const TensorList functional_tensor, TensorList other);
+TORCH_API void replace_(
+    const ITensorListRef functional_tensor,
+    ITensorListRef other);
 
 TORCH_API void commit_update(const Tensor& functional_tensor);
-TORCH_API void commit_update(const TensorList functional_tensor);
+TORCH_API void commit_update(ITensorListRef functional_tensor);
 
 Tensor create_functional_tensor_with_view_meta(
     const Tensor& view_to_wrap,
@@ -228,11 +230,7 @@ Tensor create_functional_tensor_with_view_meta(
     functionalization::ViewMeta meta,
     int64_t out_idx = 0);
 std::vector<Tensor> create_functional_tensor_with_view_meta(
-    const c10::List<Tensor>& view_to_wrap,
-    const Tensor& base,
-    functionalization::ViewMeta meta);
-std::vector<Tensor> create_functional_tensor_with_view_meta(
-    const std::vector<Tensor>& view_to_wrap,
+    ITensorListRef view_to_wrap,
     const Tensor& base,
     functionalization::ViewMeta meta);
 
@@ -280,18 +278,21 @@ TORCH_API void functionalize_op_helper(
     const c10::OperatorHandle& op,
     torch::jit::Stack* stack);
 
-template <class Op, class ReturnType, class... ParameterTypes>
+template <class Op, bool symint, class ReturnType, class... ParameterTypes>
 struct _functionalize_aten_op final {};
 
-template <class Op, class ReturnType, class... ParameterTypes>
-struct _functionalize_aten_op<Op, ReturnType(ParameterTypes...)> final {
-  static ReturnType call(ParameterTypes... args) {
+template <class Op, bool symint, class ReturnType, class... ParameterTypes>
+struct _functionalize_aten_op<Op, symint, ReturnType(ParameterTypes...)> final {
+  static ReturnType call(
+      typename c10::maybe_keep_symint<symint, ParameterTypes>::type... args) {
+    using FuncType = ReturnType(
+        typename c10::maybe_keep_symint<symint, ParameterTypes>::type...);
     auto op = c10::Dispatcher::singleton()
                   .findSchemaOrThrow(
                       (const char*)Op::name, (const char*)Op::overload_name)
-                  .typed<ReturnType(ParameterTypes...)>();
+                  .typed<FuncType>();
 
-    return c10::impl::BoxedKernelWrapper<ReturnType(ParameterTypes...)>::call(
+    return c10::impl::BoxedKernelWrapper<FuncType>::call(
         c10::BoxedKernel::makeFromFunction<functionalize_op_helper>(),
         op,
         // BoxedKernelWrapper knows to ignore this keyset argument,
@@ -302,7 +303,12 @@ struct _functionalize_aten_op<Op, ReturnType(ParameterTypes...)> final {
 };
 
 template <class Op>
-using functionalize_aten_op = _functionalize_aten_op<Op, typename Op::schema>;
+using functionalize_aten_op =
+    _functionalize_aten_op<Op, false, typename Op::schema>;
+
+template <class Op>
+using functionalize_aten_op_symint =
+    _functionalize_aten_op<Op, true, typename Op::schema>;
 
 } // namespace functionalization
 } // namespace at
diff --git a/aten/src/ATen/FunctionalizeFallbackKernel.cpp b/aten/src/ATen/FunctionalizeFallbackKernel.cpp
index 25c81165f883..3b7d3361133b 100644
--- a/aten/src/ATen/FunctionalizeFallbackKernel.cpp
+++ b/aten/src/ATen/FunctionalizeFallbackKernel.cpp
@@ -256,35 +256,35 @@ at::Tensor _to_copy_functionalize(
 // The idea with _unsafe_view is that you're guaranteed that the input
 // is a temporary, and don't actually have to worry about propagating
 // mutations between the input and output.
-at::Tensor _unsafe_view_functionalize(const at::Tensor & self, at::IntArrayRef size) {
+at::Tensor _unsafe_view_functionalize(const at::Tensor & self, at::SymIntArrayRef size) {
   if (!at::functionalization::impl::isFunctionalTensor(self)) {
     at::AutoDispatchSkipFunctionalize guard;
-    return at::_unsafe_view(self, size);
+    return at::_unsafe_view_symint(self, size);
   }
 
   auto self_ = at::functionalization::impl::from_functional_tensor(self);
   at::Tensor tmp_output;
   {
     at::AutoDispatchSkipFunctionalize guard;
-    tmp_output = at::_unsafe_view(self_, size);
+    tmp_output = at::_unsafe_view_symint(self_, size);
   }
 
   at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
     [size = size.vec()](const at::Tensor & base, int64_t mutated_view_idx) -> at::Tensor {
-      return at::_unsafe_view(base, size);
+      return at::_unsafe_view_symint(base, size);
     },
     [size = size.vec()](const at::Tensor & base, const at::Tensor & mutated_view, int64_t mutated_view_idx) -> at::Tensor {
-      return at::_unsafe_view(mutated_view, base.sizes());
+      return at::_unsafe_view_symint(mutated_view, base.sym_sizes());
     }
   );
 
   auto out = at::functionalization::impl::create_functional_tensor_with_view_meta(tmp_output, self, view_meta);
   // See  Note [Propagating strides in the functionalization pass]
   // (for _unsafe_view, I'm just manually doing the shape inference rule here instead of calling the meta function for unsafe_view)
-  auto inferred_size = at::infer_size_dv(size, self.numel());
-  auto stride = at::detail::computeStride(self.sizes(), self.strides(), inferred_size);
+  auto inferred_size = at::infer_size_dv(size, self.sym_numel());
+  auto stride = at::detail::computeStride(self.sym_sizes(), self.sym_strides(), inferred_size);
   TORCH_INTERNAL_ASSERT(stride.has_value());
-  out.unsafeGetTensorImpl()->set_sizes_and_strides(size, stride.value());
+  out.unsafeGetTensorImpl()->set_sizes_and_strides(inferred_size, stride.value());
   return out;
 }
 
diff --git a/aten/src/ATen/InferSize.h b/aten/src/ATen/InferSize.h
index e0bedb751bf2..111c7eb8f5fc 100644
--- a/aten/src/ATen/InferSize.h
+++ b/aten/src/ATen/InferSize.h
@@ -2,6 +2,8 @@
 
 #include <ATen/DimVector.h>
 #include <c10/core/ScalarType.h>
+#include <c10/core/SymIntArrayRef.h>
+#include <c10/util/DimVector.h>
 #include <c10/util/Optional.h>
 #include <sstream>
 #include <vector>
@@ -14,9 +16,13 @@ namespace at {
 // templated to handle std::vector<int64_t> and DimVector use cases, see
 // below
 //
-template <typename ResultVec>
-inline void infer_size_impl(IntArrayRef shape, int64_t numel, ResultVec& res) {
-  int64_t newsize = 1;
+template <typename InputArrayRef, typename NumelType, typename ResultVec>
+inline void infer_size_impl(
+    InputArrayRef shape,
+    NumelType numel,
+    ResultVec& res) {
+  NumelType newsize = 1;
+  // N.B. this is an index, not a sym dim!
   auto infer_dim = c10::optional<int64_t>();
   for (int64_t dim = 0, ndim = shape.size(); dim != ndim; dim++) {
     if (shape[dim] == -1) {
@@ -69,4 +75,13 @@ inline at::DimVector infer_size_dv(IntArrayRef shape, int64_t numel) {
   return res;
 }
 
+inline at::SymDimVector infer_size_dv(
+    c10::SymIntArrayRef shape,
+    c10::SymInt numel) {
+  auto res = at::SymDimVector(shape);
+  infer_size_impl<c10::SymIntArrayRef, c10::SymInt, at::SymDimVector>(
+      shape, std::move(numel), res);
+  return res;
+}
+
 } // namespace at
diff --git a/aten/src/ATen/NamedTensorUtils.cpp b/aten/src/ATen/NamedTensorUtils.cpp
index ca38f7be31bd..13d5ddb902de 100644
--- a/aten/src/ATen/NamedTensorUtils.cpp
+++ b/aten/src/ATen/NamedTensorUtils.cpp
@@ -234,7 +234,7 @@ std::vector<Dimname> compute_squeeze_outnames(const Tensor& tensor) {
   std::vector<Dimname> outnames;
   auto tensor_names = tensor.names();
   for (const auto d : c10::irange(tensor.dim())) {
-    if (tensor.sizes()[d] != 1) {
+    if (tensor.sym_sizes()[d] != 1) {
       outnames.push_back(tensor_names[d]);
     }
   }
@@ -410,12 +410,12 @@ std::vector<Dimname> broadcast_to_outnames(
   return unify_from_right(reference_names, tensor_names);
 }
 
-std::vector<Dimname> compute_cat_outnames(ITensorListRef tensors) {
+std::vector<Dimname> compute_cat_outnames(const MaterializedITensorListRef& tensors) {
   if (!at::has_names(tensors)) {
     return {};
   }
   std::vector<Dimname> result;
-  for (const auto& tensor : tensors) {
+  for (const Tensor& tensor : tensors) {
     const auto tensor_names = tensor.names();
     TORCH_CHECK(tensor_names.size() > 0, "zero-dimensional tensor cannot be concatenated");
     TORCH_CHECK(result.empty() || tensor_names.size() == result.size(),
diff --git a/aten/src/ATen/NamedTensorUtils.h b/aten/src/ATen/NamedTensorUtils.h
index a77f38501f53..c9ff27c2d1b2 100644
--- a/aten/src/ATen/NamedTensorUtils.h
+++ b/aten/src/ATen/NamedTensorUtils.h
@@ -118,7 +118,8 @@ TORCH_API void propagate_names_for_expand(
     const Tensor& result,
     const Tensor& self);
 
-TORCH_API std::vector<Dimname> compute_cat_outnames(ITensorListRef tensors);
+TORCH_API std::vector<Dimname> compute_cat_outnames(
+    const MaterializedITensorListRef& tensors);
 
 TORCH_API std::vector<Dimname> compute_broadcast_outnames(
     const Tensor& self,
diff --git a/aten/src/ATen/NestedTensorImpl.cpp b/aten/src/ATen/NestedTensorImpl.cpp
index 122c6f10a7d6..4ed527cfd486 100644
--- a/aten/src/ATen/NestedTensorImpl.cpp
+++ b/aten/src/ATen/NestedTensorImpl.cpp
@@ -4,8 +4,70 @@
 #include <ATen/core/op_registration/op_registration.h>
 #include <ATen/NestedTensorImpl.h>
 #include <c10/core/DispatchKey.h>
+#include <c10/core/DispatchKeySet.h>
 #include <c10/util/Exception.h>
+#include <c10/core/TensorImpl.h>
+#include <c10/util/Logging.h>
 
+#include <numeric>
+#include <functional>
+
+namespace {
+inline void validate_nested_tensor_metadata(
+    const at::Tensor& nested_sizes,
+    const at::Tensor& nested_strides,
+    const std::vector<int64_t>& offsets) {
+  TORCH_INTERNAL_ASSERT(nested_sizes.is_contiguous());
+  int64_t size_dim = nested_sizes.dim();
+  TORCH_INTERNAL_ASSERT(size_dim == 0 || size_dim == 2);
+  TORCH_INTERNAL_ASSERT(nested_strides.is_contiguous());
+  TORCH_INTERNAL_ASSERT(nested_strides.dim() == size_dim);
+  TORCH_INTERNAL_ASSERT(nested_sizes.sizes() == nested_strides.sizes());
+  TORCH_INTERNAL_ASSERT(
+      (size_dim == 0 && (int64_t)offsets.empty()) ||
+      (size_dim == 2 && nested_sizes.size(0) == (int64_t)offsets.size()));
+}
+
+/**
+ * Generates a nested key_set from a non-nested tensor.
+ *
+ * When creating a nested tensor from a non-nested tensor
+ * We want to maintain the same keyset as the buffer but
+ * swap non nested keys for nested ones
+ *
+ * @return Appropriate key set for nested tensor
+ */
+inline c10::DispatchKeySet generate_nested_key_set_from_buffer(
+    const at::Tensor& buffer) {
+  auto nested_key_set = buffer.key_set();
+  const bool has_autograd = nested_key_set.has_any(c10::autograd_dispatch_keyset);
+  // Remove non_nested tensor specific keys
+  nested_key_set = nested_key_set -
+      c10::DispatchKeySet{c10::DispatchKey::Dense, c10::DispatchKey::Autograd};
+
+  // Add nested tensor specific keys
+  nested_key_set =
+      nested_key_set | c10::DispatchKeySet{c10::DispatchKey::NestedTensor};
+  nested_key_set =
+      has_autograd ? nested_key_set | c10::autograd_nested : nested_key_set;
+  return nested_key_set;
+}
+
+/**
+ * Generates a the correct view keyset.
+ *
+ * When creating a nested tensor view of base
+ * The appropriate keyset will be dependent on the nested
+ * status of the base
+ *
+ * @return Appropriate key set for nested tensor
+ */
+c10::DispatchKeySet get_view_key_set(const at::Tensor& base) {
+  return base.is_nested() ? base.key_set()
+                          : generate_nested_key_set_from_buffer(base);
+}
+
+} // namespace
 namespace at {
 namespace native {
 
@@ -67,14 +129,22 @@ inline at::Tensor construct_nested_stride_tensor(const at::Tensor& sizes) {
   return strides;
 }
 
-// assume contiguous, we can construct offsets from size
+/**
+   * Create a vector of offsets assuming the nested tensor is contiguous
+   *
+   * This function iterates over the implicit ntensor outer dimension
+   * populating a vector with the num_elements in each implicit tensor.
+   * The first element is always 0 and the length of the returned vector
+   * is n_tensor.
+   *
+   * @return A vector of offsets
+  */
 inline std::vector<int64_t> construct_offsets(const at::Tensor& sizes) {
   // empty `sizes` means empty nested tensor, so return empty strides
   if (sizes.dim() == 0) {
     return std::vector<int64_t>();
   }
-  int64_t ntensors = sizes.size(0),
-      orig_dim = sizes.size(1);
+  int64_t ntensors = sizes.size(0), orig_dim = sizes.size(1);
   std::vector<int64_t> offsets(ntensors);
   // nesting scalars has easy offsets
   if (orig_dim == 0) {
@@ -83,44 +153,27 @@ inline std::vector<int64_t> construct_offsets(const at::Tensor& sizes) {
   }
   const int64_t* sizes_ptr = sizes.data_ptr<int64_t>();
   offsets[0] = 0;
-  for (int64_t i = 0; i < ntensors - 1; i++) {
-    int64_t row_product = sizes_ptr[0];
-    for (int64_t j = 1; j < orig_dim; j++) {
-      row_product *= sizes_ptr[j];
-    }
+  for (const auto i : c10::irange(ntensors - 1)) {
+    const int64_t row_product = std::accumulate(sizes_ptr, sizes_ptr + orig_dim, 1, std::multiplies<int64_t>());
     offsets[i + 1] = offsets[i] + row_product;
     sizes_ptr += orig_dim;
   }
   return offsets;
 }
 
-// [Note: Nested Tensor Autograd] The Nested Tensor key is a functionality
-// key and therefore getAutogradRelatedKeySetFromBackend will return the
-// wrong autograd key. For this specific impl we make sure to register the
-// correct Autograd key which is AutogradNestedTensor
-c10::DispatchKeySet generate_nested_key_set(at::Tensor buffer) {
-  c10::DispatchKeySet key_set =
-      c10::DispatchKeySet(DispatchKey::NestedTensor) | c10::DispatchKeySet{buffer.key_set().highestBackendKey()};
-
-  // Add AutogradNestedTensor specific keys
-  key_set = key_set | inplace_or_view_ks | autograd_nested;
-  return key_set;
-}
-
 NestedTensorImpl::NestedTensorImpl(
-    int64_t buffer_size,
     Storage storage,
     c10::DispatchKeySet key_set,
     const caffe2::TypeMeta data_type,
     at::Tensor nested_size_tensor,
     at::Tensor nested_stride_tensor,
-    std::vector<int64_t> offsets)
+    std::vector<int64_t>&& offsets)
     : TensorImpl(std::move(storage), key_set, data_type),
-      buffer_size_(buffer_size),
       nested_size_tensor_(std::move(nested_size_tensor)),
       nested_stride_tensor_(std::move(nested_stride_tensor)),
-      offsets_(std::move(offsets)),
+      storage_offsets_(std::move(offsets)),
       opt_sizes_(construct_opt_sizes(nested_size_tensor_)) {
+  C10_LOG_API_USAGE_ONCE("torch.NestedTensor");
   TORCH_WARN_ONCE(
       "The PyTorch API of nested tensors is in prototype stage and will change "
       "in the near future.");
@@ -129,34 +182,23 @@ NestedTensorImpl::NestedTensorImpl(
       storage_device.is_cpu() || storage_device.is_cuda(),
       "NestedTensorImpl storage must be either CUDA or CPU but got ",
       storage_device);
-  TORCH_INTERNAL_ASSERT(nested_size_tensor_.is_contiguous());
-  int64_t size_dim = nested_size_tensor_.dim();
-  TORCH_INTERNAL_ASSERT(size_dim == 0 || size_dim == 2);
-  TORCH_INTERNAL_ASSERT(nested_stride_tensor_.is_contiguous());
-  TORCH_INTERNAL_ASSERT(nested_stride_tensor_.dim() == size_dim);
-  TORCH_INTERNAL_ASSERT(
-      nested_stride_tensor_.sizes() == nested_size_tensor_.sizes());
-  TORCH_INTERNAL_ASSERT(
-      (size_dim == 0 && (int64_t)offsets_.empty()) ||
-      (size_dim == 2 &&
-       nested_size_tensor_.size(0) == (int64_t)offsets_.size()));
+  validate_nested_tensor_metadata(nested_size_tensor_, nested_stride_tensor_, storage_offsets_);
   refresh_dim();
-  set_sizes_strides_policy(c10::TensorImpl::SizesStridesPolicy::CustomSizes);
+  set_custom_sizes_strides(c10::TensorImpl::SizesStridesPolicy::CustomSizes);
 }
 
 NestedTensorImpl::NestedTensorImpl(
     at::Tensor buffer,
     at::Tensor nested_size_tensor,
     at::Tensor nested_stride_tensor,
-    std::vector<int64_t> offsets)
+    std::vector<int64_t>&& offsets)
     : NestedTensorImpl(
-          buffer.sizes()[0],
           buffer.storage(),
-          generate_nested_key_set(buffer),
+          generate_nested_key_set_from_buffer(buffer),
           buffer.dtype(),
           nested_size_tensor,
           nested_stride_tensor,
-          offsets) {
+          std::move(offsets)) {
 
   TORCH_INTERNAL_ASSERT(
       buffer.dim() == 1,
@@ -177,6 +219,22 @@ NestedTensorImpl::NestedTensorImpl(
           construct_offsets(nested_size_tensor))
 {}
 
+NestedTensorImpl::NestedTensorImpl(
+    c10::TensorImpl::ImplType impl_type,
+    const at::Tensor& base_tensor,
+    at::Tensor nested_size_tensor,
+    at::Tensor nested_stride_tensor,
+    std::vector<int64_t>&& offsets)
+    : TensorImpl(impl_type, Storage(base_tensor.storage()), get_view_key_set(base_tensor), base_tensor.dtype()),
+      nested_size_tensor_(std::move(nested_size_tensor)),
+      nested_stride_tensor_(std::move(nested_stride_tensor)),
+      storage_offsets_(std::move(offsets)),
+      opt_sizes_(construct_opt_sizes(nested_size_tensor_)) {
+  validate_nested_tensor_metadata(nested_size_tensor_, nested_stride_tensor_, storage_offsets_);
+  refresh_dim();
+  set_custom_sizes_strides(c10::TensorImpl::SizesStridesPolicy::CustomSizes);
+}
+
 void NestedTensorImpl::refresh_dim() {
   const auto my_dim = nested_size_tensor_.dim() ? nested_size_tensor_.sizes()[1] + 1 : 1;
   sizes_and_strides_.resize(my_dim);
@@ -227,8 +285,8 @@ c10::SymIntArrayRef NestedTensorImpl::sym_sizes_custom() const {
   TORCH_CHECK(false, "Internal error: NestedTensorImpl doesn't support sizes. Please file an issue on https://github.com/pytorch/nestedtensor");
 }
 
-c10::SymIntArrayRef NestedTensorImpl::sym_sizes() const {
-  return sym_sizes_custom();
+c10::SymIntArrayRef NestedTensorImpl::sym_strides_custom() const {
+  TORCH_CHECK(false, "Internal error: NestedTensorImpl doesn't support strides. Please file an issue on https://github.com/pytorch/nestedtensor");
 }
 
 IntArrayRef NestedTensorImpl::strides_custom() const {
@@ -246,7 +304,7 @@ c10::intrusive_ptr<TensorImpl> NestedTensorImpl::shallow_copy_and_detach_core(
     bool allow_tensor_metadata_change) const {
   if (key_set_.has(DispatchKey::Python) &&
       !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
-    auto r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this);
+    auto r = (*pyobj_interpreter_.load(std::memory_order_acquire))->detach(this);
     if (r) {
       r->set_version_counter(std::forward<VariableVersion>(version_counter));
       r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
@@ -256,13 +314,12 @@ c10::intrusive_ptr<TensorImpl> NestedTensorImpl::shallow_copy_and_detach_core(
     // the interpreter is dead no one can call us out on it
   }
   auto impl = c10::make_intrusive<NestedTensorImpl>(
-      buffer_size_,
       storage_,
       key_set_,
       data_type_,
       nested_size_tensor_,
       nested_stride_tensor_,
-      offsets_);
+      std::vector<int64_t>(storage_offsets_));
 
       copy_tensor_metadata(
           /*src_impl=*/this,
diff --git a/aten/src/ATen/NestedTensorImpl.h b/aten/src/ATen/NestedTensorImpl.h
index d2e0381425f4..4790cda1cbcb 100644
--- a/aten/src/ATen/NestedTensorImpl.h
+++ b/aten/src/ATen/NestedTensorImpl.h
@@ -1,6 +1,8 @@
 #pragma once
 #include <ATen/MemoryOverlap.h>
 #include <ATen/Tensor.h>
+#include <c10/core/DispatchKey.h>
+#include <c10/core/DispatchKeySet.h>
 #include <c10/core/MemoryFormat.h>
 #include <c10/core/TensorImpl.h>
 #include <c10/util/ArrayRef.h>
@@ -10,26 +12,35 @@
 
 namespace at {
 namespace native {
+struct NestedTensorImpl;
+inline bool nested_tensor_impl_is_contiguous(const NestedTensorImpl* nt);
 
 struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   explicit NestedTensorImpl(
-      int64_t buffer_size,
       Storage storage,
       c10::DispatchKeySet key_set,
       const caffe2::TypeMeta data_type,
       at::Tensor nested_size_tensor,
       at::Tensor nested_stride_tensor,
-      std::vector<int64_t> offsets);
+      std::vector<int64_t>&& offsets);
 
   explicit NestedTensorImpl(
       at::Tensor buffer,
       at::Tensor nested_size_tensor,
       at::Tensor nested_stride_tensor,
-      std::vector<int64_t> offsets);
+      std::vector<int64_t>&& offsets);
   // assume contiguous, `nested_stride_tensor` and `offsets`
   // can be infered from `nested_size_tensor`
   explicit NestedTensorImpl(at::Tensor buffer, at::Tensor nested_size_tensor);
 
+  // This constructor is used creating view tensors from nested tensors
+  explicit NestedTensorImpl(
+      c10::TensorImpl::ImplType impl_type,
+      const at::Tensor& base_tensor,
+      at::Tensor nested_size_tensor,
+      at::Tensor nested_stride_tensor,
+      std::vector<int64_t>&& offsets);
+
   // TODO: don't expose private implementation details like this; in
   // particular, resizing this tensor will mess up our dim() and
   // callers cannot fix it.
@@ -40,8 +51,8 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   const Tensor& get_nested_stride_tensor() const {
     return nested_stride_tensor_;
   }
-  const std::vector<int64_t>& get_offsets() const {
-    return offsets_;
+  const std::vector<int64_t>& get_storage_offsets() const {
+    return storage_offsets_;
   }
   // Returns nullopt if the ith dimension is irregular. The ith dimension
   // of a NestedTensor is regular if the unbound tensors match in
@@ -63,16 +74,41 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
         " is irregular and does not have a size.");
     return *optional_size;
   }
-
+  /**
+   * Return a view of the nested tensor as a 1 dimensional contiguous tensor.
+   *
+   * The buffer tensor created by this function shares the same storage_impl as
+   * the original nested tensor, and therefore can be seen as a view.
+   *
+   * @return A newly constructed view tensor
+   */
   at::Tensor get_buffer() const {
-    auto buffer_key_set_ = c10::DispatchKeySet{c10::DispatchKey::Dense} |
-        c10::DispatchKeySet{this->key_set_.highestBackendKey()};
+    TORCH_CHECK(
+        nested_tensor_impl_is_contiguous(this),
+        "NestedTensor must be contiguous to get buffer.");
+    return get_unsafe_storage_as_tensor();
+  }
+  /**
+   * If possible use get_buffer() instead. This function returns the storage
+   * as a tensor directly, which is not safe to use in general. If using this
+   * function, The caller must ensure to account for nested_sizes,
+   * nested_strides and storage_offsets.
+   *
+   * @return A newly constructed view tensor
+   */
+  at::Tensor get_unsafe_storage_as_tensor() const {
+    auto buffer_key_set_ = generate_buffer_key_set();
+    const auto buffer_size = get_buffer_size();
     auto buffer_tensor_impl = c10::make_intrusive<TensorImpl>(
-        Storage(storage_), buffer_key_set_, data_type_);
-    buffer_tensor_impl->set_sizes_contiguous(c10::makeArrayRef(buffer_size_));
+        c10::TensorImpl::VIEW, Storage(storage_), buffer_key_set_, data_type_);
+    buffer_tensor_impl->set_sizes_contiguous(c10::makeArrayRef(buffer_size));
     return Tensor(buffer_tensor_impl);
   }
 
+  int64_t get_buffer_size() const {
+    return storage_.nbytes() / data_type_.itemsize();
+  }
+
  protected:
   const char* tensorimpl_type_name() const override;
 
@@ -89,8 +125,8 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   }
   IntArrayRef sizes_custom() const override;
   c10::SymIntArrayRef sym_sizes_custom() const override;
-  c10::SymIntArrayRef sym_sizes() const override;
   IntArrayRef strides_custom() const override;
+  c10::SymIntArrayRef sym_strides_custom() const override;
 
   // this one is real
   int64_t dim_custom() const override;
@@ -115,10 +151,7 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   // Must be called after any changes to our dim() to sync the state
   // to TensorImpl.
   void refresh_dim();
-  // Store the size of the buffer for use in get_buffer().
-  // get_buffer constructs a flat, contiguous tensor from the NestedTensor
-  // storage
-  int64_t buffer_size_;
+
   const at::Tensor nested_size_tensor_, nested_stride_tensor_;
   // The starting positions of the underlying tensors in contiguous buffer
   // i.e. the buffer memory offsets to get the underlying tensors
@@ -132,7 +165,7 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   // Some strong enough constraints are:
   // 1. every underlying tensor is contiguous in memory
   //    && nesting in ascending order
-  std::vector<int64_t> offsets_;
+  std::vector<int64_t> storage_offsets_;
   // NOTE: -1 here means the size is missing
   // TODO: maybe we can remove this metadata since
   //       we can compute it from `nested_size_tensor_`
@@ -142,6 +175,34 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
   c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach_core(
       VariableVersion&& version_counter,
       bool allow_tensor_metadata_change) const;
+
+  /**
+   * Generates a non-nested key_set from a nested tensor.
+   *
+   * For many nested tensor kernel implementations a buffer tensor
+   * is generated and redispatched to a non-nested kernel this function
+   * generates the key set used by that buffer tensor
+   *
+   * @return Appropriate key set for non-nested tensor
+   */
+  inline c10::DispatchKeySet generate_buffer_key_set() const {
+    auto buffer_key_set = this->key_set();
+    const bool Autograd = buffer_key_set.has_any(c10::autograd_dispatch_keyset);
+    // Remove nested tensor specific keys
+    buffer_key_set = buffer_key_set -
+        c10::DispatchKeySet{
+            c10::DispatchKey::NestedTensor,
+            c10::DispatchKey::AutogradNestedTensor};
+
+    // Add dense tensor specific keys
+    buffer_key_set =
+        buffer_key_set | c10::DispatchKeySet{c10::DispatchKey::Dense};
+    buffer_key_set = Autograd
+        ? c10::DispatchKeySet{c10::DispatchKey::Autograd} | buffer_key_set
+        : buffer_key_set;
+
+    return buffer_key_set;
+  }
 };
 
 inline NestedTensorImpl* get_nested_tensor_impl_or_null(
@@ -165,7 +226,7 @@ inline bool nested_tensor_impl_is_contiguous(const NestedTensorImpl* nt) {
   }
   const Tensor &sizemat = nt->get_nested_size_tensor(),
                &stridemat = nt->get_nested_stride_tensor();
-  const auto& offsets = nt->get_offsets();
+  const auto& offsets = nt->get_storage_offsets();
   int64_t orig_dim = sizemat.size(1);
   // nesting scalars
   if (orig_dim == 0) {
diff --git a/aten/src/ATen/NumericUtils.h b/aten/src/ATen/NumericUtils.h
index 816cc4e8a44b..241addbc5c28 100644
--- a/aten/src/ATen/NumericUtils.h
+++ b/aten/src/ATen/NumericUtils.h
@@ -7,8 +7,9 @@
 #include <c10/macros/Macros.h>
 #include <c10/util/BFloat16.h>
 #include <c10/util/Half.h>
+#include <c10/util/complex.h>
+
 #include <cmath>
-#include <complex>
 #include <type_traits>
 
 namespace at {
diff --git a/aten/src/ATen/OpaqueTensorImpl.h b/aten/src/ATen/OpaqueTensorImpl.h
index c87fcab77bd2..e6c6413815bb 100644
--- a/aten/src/ATen/OpaqueTensorImpl.h
+++ b/aten/src/ATen/OpaqueTensorImpl.h
@@ -30,7 +30,7 @@ struct TORCH_API OpaqueTensorImpl : public TensorImpl {
       : TensorImpl(key_set, data_type, device),
         opaque_handle_(std::move(opaque_handle)) {
     set_storage_access_should_throw();
-    set_sizes_strides_policy(SizesStridesPolicy::CustomStrides);
+    set_custom_sizes_strides(SizesStridesPolicy::CustomStrides);
     sizes_and_strides_.set_sizes(sizes);
     refresh_numel();
     is_non_overlapping_and_dense_ = is_non_overlapping_and_dense;
@@ -77,7 +77,7 @@ struct TORCH_API OpaqueTensorImpl : public TensorImpl {
         dtype(),
         device(),
         opaque_handle_,
-        asIntArrayRefSlow(sizes_and_strides_.sizes_arrayref()));
+        sizes_and_strides_.sizes_arrayref());
     copy_tensor_metadata(
         /*src_opaque_impl=*/this,
         /*dest_opaque_impl=*/impl.get(),
@@ -101,7 +101,7 @@ struct TORCH_API OpaqueTensorImpl : public TensorImpl {
         dtype(),
         device(),
         opaque_handle_,
-        asIntArrayRefSlow(sizes_and_strides_.sizes_arrayref()));
+        sizes_and_strides_.sizes_arrayref());
     copy_tensor_metadata(
         /*src_opaque_impl=*/this,
         /*dest_opaque_impl=*/impl.get(),
diff --git a/aten/src/ATen/PadNd.h b/aten/src/ATen/PadNd.h
new file mode 100644
index 000000000000..573d1a7b88ab
--- /dev/null
+++ b/aten/src/ATen/PadNd.h
@@ -0,0 +1,28 @@
+#pragma once
+#include <c10/util/Exception.h>
+#include <c10/util/string_view.h>
+
+namespace at {
+
+enum class padding_mode {
+  reflect,
+  replicate,
+  circular,
+  constant,
+};
+
+static inline c10::string_view padding_mode_string(padding_mode m) {
+  switch (m) {
+    case padding_mode::reflect:
+      return "reflect";
+    case padding_mode::replicate:
+      return "replicate";
+    case padding_mode::circular:
+      return "circular";
+    case padding_mode::constant:
+      return "constant";
+  }
+  TORCH_CHECK(false, "Invalid padding mode (", static_cast<int64_t>(m), ")");
+}
+
+} // namespace at
diff --git a/aten/src/ATen/Parallel.h b/aten/src/ATen/Parallel.h
index 6c99fcd422cb..4693997624e9 100644
--- a/aten/src/ATen/Parallel.h
+++ b/aten/src/ATen/Parallel.h
@@ -2,6 +2,7 @@
 #include <ATen/Config.h>
 #include <c10/macros/Macros.h>
 #include <functional>
+#include <string>
 
 namespace at {
 
diff --git a/aten/src/ATen/PythonTorchFunctionTLS.cpp b/aten/src/ATen/PythonTorchFunctionTLS.cpp
index ae9f722de60a..c9487c6958cb 100644
--- a/aten/src/ATen/PythonTorchFunctionTLS.cpp
+++ b/aten/src/ATen/PythonTorchFunctionTLS.cpp
@@ -6,16 +6,24 @@ namespace impl {
 
 static thread_local PythonTorchFunctionTLS pythonTorchFunctionState;
 
-void PythonTorchFunctionTLS::set_mode(std::shared_ptr<c10::SafePyObject> mode) {
-  pythonTorchFunctionState.mode_ = std::move(mode);
+void PythonTorchFunctionTLS::push_onto_stack(std::shared_ptr<SafePyObject> mode) {
+  pythonTorchFunctionState.stack_.push_back(std::move(mode));
 }
 
-const std::shared_ptr<c10::SafePyObject>& PythonTorchFunctionTLS::get_mode() {
-  return pythonTorchFunctionState.mode_;
+const std::shared_ptr<SafePyObject> PythonTorchFunctionTLS::pop_stack() {
+  TORCH_CHECK(pythonTorchFunctionState.stack_.size() > 0, "trying to pop from empty mode stack");
+  const auto out = pythonTorchFunctionState.stack_.back();
+  pythonTorchFunctionState.stack_.pop_back();
+  return out;
 }
 
-void PythonTorchFunctionTLS::swap_mode(std::shared_ptr<c10::SafePyObject>& mode) {
-  pythonTorchFunctionState.mode_.swap(mode);
+const std::shared_ptr<SafePyObject>& PythonTorchFunctionTLS::get_stack_at(int64_t idx) {
+  TORCH_CHECK(idx < static_cast<int64_t>(pythonTorchFunctionState.stack_.size()), "Tried to get stack at idx that's too big");
+  return pythonTorchFunctionState.stack_[idx];
+}
+
+int64_t PythonTorchFunctionTLS::stack_len() {
+  return pythonTorchFunctionState.stack_.size();
 }
 
 void PythonTorchFunctionTLS::set_disabled(bool disabled) {
@@ -34,5 +42,9 @@ const PythonTorchFunctionTLS& PythonTorchFunctionTLS::get_state() {
   return pythonTorchFunctionState;
 }
 
+bool torch_function_mode_enabled() {
+  return PythonTorchFunctionTLS::stack_len() > 0;
+}
+
 } // namespace impl
 } // namespace at
diff --git a/aten/src/ATen/PythonTorchFunctionTLS.h b/aten/src/ATen/PythonTorchFunctionTLS.h
index 003dcef1e90f..5940fb6f2dee 100644
--- a/aten/src/ATen/PythonTorchFunctionTLS.h
+++ b/aten/src/ATen/PythonTorchFunctionTLS.h
@@ -10,17 +10,25 @@ struct TORCH_API PythonTorchFunctionTLS {
   static void set_disabled(bool);
   static bool is_disabled();
 
-  static void set_mode(std::shared_ptr<c10::SafePyObject>);
-  static const std::shared_ptr<c10::SafePyObject>& get_mode();
-  static void swap_mode(std::shared_ptr<c10::SafePyObject>&);
+  static void push_onto_stack(std::shared_ptr<SafePyObject> mode);
+  static const std::shared_ptr<SafePyObject> pop_stack();
+  static const std::shared_ptr<SafePyObject>& get_stack_at(int64_t idx);
+  static int64_t stack_len();
 
-  static void set_state(const PythonTorchFunctionTLS& state);
   static const PythonTorchFunctionTLS& get_state();
+  static void set_state(const PythonTorchFunctionTLS& state);
 
  private:
+  // The mode TLS is split into
+  //   - disabled_, which says whether or not to disable all torch function
+  //   modes
+  //   - stack_, which is a vector of modes representing the stack of user
+  //   defined modes
   bool disabled_;
-  std::shared_ptr<c10::SafePyObject> mode_;
+  std::vector<std::shared_ptr<c10::SafePyObject>> stack_;
 };
 
+TORCH_API bool torch_function_mode_enabled();
+
 } // namespace impl
 } // namespace at
diff --git a/aten/src/ATen/SavedTensorHooks.cpp b/aten/src/ATen/SavedTensorHooks.cpp
index aff6ddd1b06e..6b3b63f5987e 100644
--- a/aten/src/ATen/SavedTensorHooks.cpp
+++ b/aten/src/ATen/SavedTensorHooks.cpp
@@ -5,46 +5,78 @@
 namespace at {
 
 namespace {
-  // PyObject is defined in c10/util/python_stub.h
-  thread_local std::stack<std::pair<PyObject*, PyObject*>> stack;
+  thread_local impl::SavedTensorDefaultHooksTLS tls;
 
   // This flag is set to true the first time default hooks are registered
   // and left at true for the rest of the execution.
   // It's an optimization so that users who never use default hooks don't need to
   // read the thread_local variables pack_hook_ and unpack_hook_.
-  static bool is_enabled(false);
+  static bool is_initialized(false);
+}
+
+static void assertSavedTensorHooksNotDisabled() {
+  TORCH_CHECK(SavedTensorDefaultHooks::is_enabled(), tls.disabled_error_message.value());
+}
+
+bool SavedTensorDefaultHooks::is_enabled() {
+  // See NOTE: [disabled_error_message invariant]
+  return !tls.disabled_error_message.has_value();
+}
+
+void SavedTensorDefaultHooks::disable(const std::string& message) {
+  tls.disabled_error_message = message;
+  if (tls.stack.size() > 0) {
+    assertSavedTensorHooksNotDisabled();
+  }
 }
 
 void SavedTensorDefaultHooks::enable() {
-  is_enabled = true;
+  tls.disabled_error_message = c10::nullopt;
+}
+
+const c10::optional<std::string>& SavedTensorDefaultHooks::get_disabled_error_message() {
+  return tls.disabled_error_message;
+}
+
+const impl::SavedTensorDefaultHooksTLS& SavedTensorDefaultHooks::get_tls_state() {
+  return tls;
+}
+
+void SavedTensorDefaultHooks::set_tls_state(const impl::SavedTensorDefaultHooksTLS& state) {
+  tls = state;
+}
+
+void SavedTensorDefaultHooks::lazy_initialize() {
+  is_initialized = true;
 }
 
 void SavedTensorDefaultHooks::push_hooks(PyObject* pack_hook, PyObject* unpack_hook) {
   // Reference counting is handled by the caller of `push_hooks`
-  TORCH_INTERNAL_ASSERT(is_enabled);
+  TORCH_INTERNAL_ASSERT(is_initialized);
   TORCH_INTERNAL_ASSERT(pack_hook != nullptr && unpack_hook != nullptr);
-  stack.push(std::make_pair(pack_hook, unpack_hook));
+  assertSavedTensorHooksNotDisabled();
+  tls.stack.push(std::make_pair(pack_hook, unpack_hook));
 }
 
 void SavedTensorDefaultHooks::pop_hooks() {
   // Reference counting is handled by the caller of `pop_hooks`
-  TORCH_INTERNAL_ASSERT(is_enabled && !stack.empty());
-  stack.pop();
+  TORCH_INTERNAL_ASSERT(is_initialized && !tls.stack.empty());
+  tls.stack.pop();
 }
 
 std::pair<PyObject*, PyObject*> SavedTensorDefaultHooks::get_hooks() {
-  if (!is_enabled || stack.empty()) {
+  if (!is_initialized || tls.stack.empty()) {
     return std::make_pair(nullptr, nullptr);
   }
-  return stack.top();
+  return tls.stack.top();
 }
 
 std::stack<std::pair<PyObject*, PyObject*>> SavedTensorDefaultHooks::get_stack() {
-  return stack;
+  return tls.stack;
 }
 
 void SavedTensorDefaultHooks::set_stack(std::stack<std::pair<PyObject*, PyObject*>> stack_) {
-  stack = stack_;
+  tls.stack = stack_;
 }
 
 }
diff --git a/aten/src/ATen/SavedTensorHooks.h b/aten/src/ATen/SavedTensorHooks.h
index 0cdfa3c9ecc3..af821cb908c6 100644
--- a/aten/src/ATen/SavedTensorHooks.h
+++ b/aten/src/ATen/SavedTensorHooks.h
@@ -1,20 +1,52 @@
 #pragma once
 
 #include <c10/macros/Export.h>
+#include <c10/util/Optional.h>
 #include <c10/util/python_stub.h>
 #include <stack>
+#include <string>
 
 #include <utility>
 
 namespace at {
 
+namespace impl {
+
+struct TORCH_API SavedTensorDefaultHooksTLS {
+  // PyObject is defined in c10/util/python_stub.h
+  std::stack<std::pair<PyObject*, PyObject*>> stack;
+
+  // See NOTE: [Disabling SavedTensorDefaultHooks] for context
+  // NOTE: [disabled_error_message invariant]
+  // disabled_error_message is nullopt IFF Saved Tensor hooks is enabled
+  // We did this for efficiency (so we didn't have to keep a separate bool
+  // around)
+  c10::optional<std::string> disabled_error_message;
+};
+
+} // namespace impl
+
 struct TORCH_API SavedTensorDefaultHooks {
   static void push_hooks(PyObject* pack_hook, PyObject* unpack_hook);
   static void pop_hooks();
   static std::pair<PyObject*, PyObject*> get_hooks();
-  static void enable();
+  static void lazy_initialize();
   static std::stack<std::pair<PyObject*, PyObject*>> get_stack();
   static void set_stack(std::stack<std::pair<PyObject*, PyObject*>>);
+
+  static const impl::SavedTensorDefaultHooksTLS& get_tls_state();
+  static void set_tls_state(const impl::SavedTensorDefaultHooksTLS& tls);
+
+  // NOTE: [Disabling SavedTensorDefaultHooks]
+  // A developer of a PyTorch feature may choose to disable SavedTensorDefault
+  // hooks, especially if their feature does not work with it. If they are
+  // disabled, then the following will raise an error:
+  // - Attempting to push_hooks
+  // - calling disable(message) with a non-zero stack (from get_stack) size
+  static void disable(const std::string& error_message);
+  static void enable();
+  static bool is_enabled();
+  static const c10::optional<std::string>& get_disabled_error_message();
 };
 
 } // namespace at
diff --git a/aten/src/ATen/SparseCsrTensorImpl.cpp b/aten/src/ATen/SparseCsrTensorImpl.cpp
index dab45065fa71..f07bd176d6c4 100644
--- a/aten/src/ATen/SparseCsrTensorImpl.cpp
+++ b/aten/src/ATen/SparseCsrTensorImpl.cpp
@@ -8,22 +8,10 @@
 #include <ATen/native/Resize.h>
 
 namespace at {
-namespace {
-DeviceType SparseCsrTensorSetToDeviceType(DispatchKeySet key_set) {
-  if (key_set.has(DispatchKey::SparseCsrCPU)) {
-    return kCPU;
-  } else if (key_set.has(DispatchKey::SparseCsrCUDA)) {
-    return kCUDA;
-  } else {
-    TORCH_CHECK(false,
-        "Cannot construct SparseCsrTensor with non-sparse tensor type ID ",
-        key_set);
-  }
-}
-} // namespace
 
 SparseCsrTensorImpl::SparseCsrTensorImpl(
     at::DispatchKeySet key_set,
+    at::Device device,
     at::Layout layout,
     const caffe2::TypeMeta data_type)
     : SparseCsrTensorImpl(
@@ -32,19 +20,19 @@ SparseCsrTensorImpl::SparseCsrTensorImpl(
           at::empty(
               {0},
               at::initialTensorOptions()
-                  .device(SparseCsrTensorSetToDeviceType(key_set))
+                  .device(device)
                   .dtype(ScalarType::Int)) // crow_indices
           ,
           at::empty(
               {0},
               at::initialTensorOptions()
-                  .device(SparseCsrTensorSetToDeviceType(key_set))
+                  .device(device)
                   .dtype(ScalarType::Int)) // col_indices
           ,
           at::empty(
               {0},
               at::initialTensorOptions()
-                  .device(SparseCsrTensorSetToDeviceType(key_set))
+                  .device(device)
                   .dtype(data_type)) // values
           ,
           layout
@@ -66,15 +54,24 @@ SparseCsrTensorImpl::SparseCsrTensorImpl(
   TORCH_WARN_ONCE("Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensor support is in beta state. "
                   "If you miss a functionality in the sparse tensor support, please submit a feature request "
                   "to https://github.com/pytorch/pytorch/issues.");
+
+  TORCH_INTERNAL_ASSERT(((key_set.has(DispatchKey::SparseCsrCPU) && device().type() == kCPU)
+                         || (key_set.has(DispatchKey::SparseCsrCUDA) && device().type() == kCUDA)),
+                        "Inconsistent key_set (=", key_set, ") and device (=", device(), ")");
+
   set_storage_access_should_throw();
   is_non_overlapping_and_dense_ = false;
-  set_sizes_strides_policy(SizesStridesPolicy::CustomStrides);
+  set_custom_sizes_strides(SizesStridesPolicy::CustomStrides);
   // TODO: If this check ever shows up as a bottleneck, which is unlikely given that
   // comparing devices only involves comparing the type and index (two integers), we
   // can move this to a DEBUG only assert. Until then this confirms and maintains a
   // crucial invariance.
-  TORCH_CHECK(values_.device() == crow_indices_.device(), "Values and crow_indices need to be on the same device.");
-  TORCH_CHECK(values_.device() == col_indices_.device(), "Values and col_indices need to be on the same device.");
+  TORCH_CHECK(values_.device() == crow_indices_.device(), "Values and ",
+              at::sparse_csr::compressedIndicesName(layout_), " need to be on the same device.");
+  TORCH_CHECK(values_.device() == col_indices_.device(), "Values and ",
+              at::sparse_csr::plainIndicesName(layout_), " need to be on the same device.");
+  TORCH_INTERNAL_ASSERT(values_.device() == device(),
+                        "Values and compressed sparse tensor instance need to have the same device.");
 }
 
 const char* SparseCsrTensorImpl::tensorimpl_type_name() const {
@@ -104,23 +101,83 @@ void SparseCsrTensorImpl::resize_(int64_t nnz, IntArrayRef size) {
   sizes_and_strides_.set_sizes(size);
 }
 
-void SparseCsrTensorImpl::resize_as_sparse_csr_tensor_(const Tensor& src) {
+void SparseCsrTensorImpl::resize_and_clear_(int64_t sparse_dim, IntArrayRef size) {
   TORCH_CHECK(
       !has_symbolic_sizes_strides_,
-      "resize_as_sparse_csr_tensor_ called on tensor with symbolic shape")
-  set_layout(src.layout());
-  crow_indices_ = at::empty_like(
-      src.crow_indices(),
-      src.crow_indices().options(),
-      src.crow_indices().suggest_memory_format());
-  col_indices_ = at::empty_like(
-      src.col_indices(),
-      src.col_indices().options(),
-      src.col_indices().suggest_memory_format());
-  values_ = at::empty_like(
-      src.values(),
-      src.values().options(),
-      src.values().suggest_memory_format());
+      "resize_and_clear_ called on tensor with symbolic shape");
+  TORCH_CHECK(sparse_dim >= 2, "resize_and_clear_ sparse dimensionality must be at least 2, got ", sparse_dim);
+  TORCH_CHECK(static_cast<int64_t>(size.size()) >= sparse_dim, "resize_and_clear_ size length must be at least sparse dimensionality (=",
+              sparse_dim, "), got ", size.size());
+  auto batch_dim = sparse_dim - 2;
+  auto batchsize = size.slice(0, batch_dim);
+  auto densesize = size.slice(batch_dim + 2, size.size() - batch_dim - 2);
+
+  auto values_size = DimVector(batchsize);
+  values_size.push_back(0); // nse
+  values_size.append(densesize.begin(), densesize.end());
+
+  auto col_indices_size = DimVector(batchsize);
+  col_indices_size.push_back(0); // nse
+
+  auto n_compressed_indices = AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS(layout_, "resize_and_clear_",
+                                                                        [&] () -> int64_t { return size[batch_dim]; },
+                                                                        [&] () -> int64_t { return size[batch_dim + 1]; }
+                                                                        );
+  AT_DISPATCH_PLAIN_SPARSE_COMPRESSED_LAYOUTS(layout_,
+                                              "resize_and_clear_",
+                                              [] () {},
+                                              [&] () {
+                                                auto blocksize = this->values_.sizes().slice(this->batch_dim() + 1, 2);
+                                                values_size.append(blocksize.begin(), blocksize.end());
+                                                n_compressed_indices /= blocksize[(the_layout == kSparseBsr ? 0 : 1)];
+                                              });
+  auto crow_indices_size = DimVector(batchsize);
+  crow_indices_size.push_back(n_compressed_indices + 1);
+
+  crow_indices_.resize_(crow_indices_size);
+  crow_indices_.zero_();
+  col_indices_.resize_(col_indices_size);
+  values_.resize_(values_size);
+  sizes_and_strides_.set_sizes(size);
+  refresh_numel();
+}
+
+void SparseCsrTensorImpl::resize_as_sparse_compressed_tensor_(
+    const Tensor& src) {
+  TORCH_CHECK(
+      !has_symbolic_sizes_strides_,
+      "resize_as_sparse_compressed_tensor_ called on tensor with symbolic shape");
+
+  // We cannot resize as other layout and preserve the invariants for self
+  // layout
+  TORCH_CHECK(
+      src.layout() == layout_,
+      "resize_as_sparse_compressed_tensor_: self and src must have the same layout, but got: self (",
+      layout_,
+      ") and source (",
+      src.layout(),
+      ")");
+
+  Tensor compressed_indices;
+  Tensor plain_indices;
+  std::tie(compressed_indices, plain_indices) =
+      sparse_csr::getCompressedPlainIndices(src);
+  // reuse self indices storage
+  if (crow_indices_.sizes() != compressed_indices.sizes()) {
+    crow_indices_.resize_as_(compressed_indices);
+  }
+  if (col_indices_.sizes() != plain_indices.sizes()) {
+    col_indices_.resize_as_(plain_indices);
+  }
+  // Update indices data to ensure result is valid under invariants check
+  if ((sizes() != src.sizes()) || (dense_dim() != src.dense_dim())) {
+    crow_indices_.copy_(compressed_indices);
+    col_indices_.copy_(plain_indices);
+  }
+  // Reuse values storage
+  if (values_.sizes() != src.values().sizes()) {
+    values_.resize_as_(src.values());
+  }
   sizes_and_strides_.set_sizes(src.sizes());
   refresh_numel();
 }
@@ -132,7 +189,7 @@ void SparseCsrTensorImpl::set_member_tensors(
     IntArrayRef size) {
   TORCH_CHECK(
       !has_symbolic_sizes_strides_,
-      "set_member_tensors called on tensor with symbolic shape")
+      "set_member_tensors called on tensor with symbolic shape");
 
   // CSR Type Invariants
   TORCH_CHECK(
@@ -142,7 +199,6 @@ void SparseCsrTensorImpl::set_member_tensors(
       ") must match dtype of sparse tensor (",
       typeMetaToScalarType(dtype()),
       ")");
-
   crow_indices_ = crow_indices;
   col_indices_ = col_indices;
   values_ = values;
@@ -153,13 +209,20 @@ void SparseCsrTensorImpl::set_member_tensors(
   // comparing devices only involves comparing the type and index (two integers), we
   // can move this to a DEBUG only assert. Until then this confirms and maintains a
   // crucial invariance.
-  TORCH_CHECK(values_.device() == crow_indices_.device(), "Values and crow_indices need to be on the same device.");
-  TORCH_CHECK(values_.device() == col_indices_.device(), "Values and col_indices need to be on the same device.");
+  TORCH_CHECK(values_.device() == crow_indices_.device(), "Values and ",
+              at::sparse_csr::compressedIndicesName(layout_), " need to be on the same device.");
+  TORCH_CHECK(values_.device() == col_indices_.device(), "Values and ",
+              at::sparse_csr::plainIndicesName(layout_), " need to be on the same device.");
+  TORCH_CHECK(values_.device() == device(),
+              "Values and compressed tensor instance need to be on the same device.");
 }
 
 IntArrayRef SparseCsrTensorImpl::strides_custom() const {
   TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have strides");
 }
+SymIntArrayRef SparseCsrTensorImpl::sym_strides_custom() const {
+  TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have strides");
+}
 void SparseCsrTensorImpl::set_size(int64_t dim, int64_t new_size) {
   TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have set_size.");
 }
@@ -169,5 +232,8 @@ void SparseCsrTensorImpl::set_stride(int64_t dim, int64_t new_stride) {
 void SparseCsrTensorImpl::set_storage_offset(int64_t storage_offset) {
   TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have set_storage_offset.");
 }
+bool SparseCsrTensorImpl::is_contiguous_custom(MemoryFormat) const {
+  TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have is_contiguous");
+}
 
 } // namespace at
diff --git a/aten/src/ATen/SparseCsrTensorImpl.h b/aten/src/ATen/SparseCsrTensorImpl.h
index 878c465962b8..12ef1de24ff7 100644
--- a/aten/src/ATen/SparseCsrTensorImpl.h
+++ b/aten/src/ATen/SparseCsrTensorImpl.h
@@ -3,7 +3,6 @@
 #include <ATen/Tensor.h>
 #include <c10/core/TensorImpl.h>
 #include <c10/util/Exception.h>
-
 namespace at {
 
 // Struct implementing a sparse CSR tensor. It uses three 1-D tensors for
@@ -33,11 +32,13 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
  public:
   explicit SparseCsrTensorImpl(
       at::DispatchKeySet,
+      at::Device device,
       Layout layout,
       const caffe2::TypeMeta);
 
   void resize_(int64_t nnz, IntArrayRef size);
-  void resize_as_sparse_csr_tensor_(const Tensor& src);
+  void resize_and_clear_(int64_t sparse_dim, IntArrayRef size);
+  void resize_as_sparse_compressed_tensor_(const Tensor& src);
   void set_member_tensors(
       const Tensor& crow_indices,
       const Tensor& col_indices,
@@ -76,6 +77,8 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
 
  protected:
   IntArrayRef strides_custom() const override;
+  SymIntArrayRef sym_strides_custom() const override;
+  bool is_contiguous_custom(MemoryFormat) const override;
 
  public:
   void set_size(int64_t dim, int64_t new_size) override;
@@ -107,7 +110,7 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
       const c10::VariableVersion& version_counter,
       bool allow_tensor_metadata_change) const override {
     auto impl = c10::make_intrusive<SparseCsrTensorImpl>(
-        key_set(), layout_impl(), dtype());
+        key_set(), device(), layout_impl(), dtype());
     copy_tensor_metadata(
         /*src_impl=*/this,
         /*dest_impl=*/impl.get(),
@@ -127,7 +130,7 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
       c10::VariableVersion&& version_counter,
       bool allow_tensor_metadata_change) const override {
     auto impl = c10::make_intrusive<SparseCsrTensorImpl>(
-        key_set(), layout_impl(), dtype());
+        key_set(), device(), layout_impl(), dtype());
     copy_tensor_metadata(
         /*src_impl=*/this,
         /*dest_impl=*/impl.get(),
diff --git a/aten/src/ATen/SparseCsrTensorUtils.h b/aten/src/ATen/SparseCsrTensorUtils.h
index 24b5ae47df7d..e76d2707c6f4 100644
--- a/aten/src/ATen/SparseCsrTensorUtils.h
+++ b/aten/src/ATen/SparseCsrTensorUtils.h
@@ -127,6 +127,22 @@ namespace sparse_csr {
 
 using SparseCsrTensor = Tensor;
 
+inline bool is_sparse_compressed(const Layout& layout) {
+  switch (layout) {
+    case kSparseCsr:
+    case kSparseCsc:
+    case kSparseBsr:
+    case kSparseBsc:
+      return true;
+    default:;
+  }
+  return false;
+}
+
+inline bool is_sparse_compressed(const Tensor& self) {
+  return is_sparse_compressed(self.layout());
+}
+
 inline SparseCsrTensorImpl* get_sparse_csr_impl(const SparseCsrTensor& self) {
   AT_DISPATCH_ALL_SPARSE_COMPRESSED_LAYOUTS(
       self.layout(), "get_sparse_csr_impl", [&] {});
@@ -235,5 +251,41 @@ inline int plainDimension(
   return size.size() - dense_ndim - (isCompressedRow(layout) ? 1 : 2);
 }
 
+inline int64_t numBatchDimensions(Tensor const& self) {
+  return AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS(
+      self.layout(),
+      "numBatchDimensions",
+      [&self] { return self.crow_indices().dim() - 1; },
+      [&self] { return self.ccol_indices().dim() - 1; });
+}
+
+inline std::pair<Tensor, Tensor> getCompressedPlainIndices(Tensor const& self) {
+  return AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS(
+      self.layout(),
+      "getCompressedPlainIndices",
+      [&self] {
+        return std::make_pair(self.crow_indices(), self.col_indices());
+      },
+      [&self] {
+        return std::make_pair(self.ccol_indices(), self.row_indices());
+      });
+}
+
+inline Layout flip_compressed_layout(Layout layout) {
+  switch (layout) {
+    case kSparseCsr:
+      return kSparseCsc;
+    case kSparseCsc:
+      return kSparseCsr;
+    case kSparseBsr:
+      return kSparseBsc;
+    case kSparseBsc:
+      return kSparseBsr;
+    default:
+      TORCH_CHECK(false, "Not a sparse compressed layout:", layout);
+      return kSparseCsr;
+  }
+}
+
 } // namespace sparse_csr
 } // namespace at
diff --git a/aten/src/ATen/SparseTensorImpl.cpp b/aten/src/ATen/SparseTensorImpl.cpp
index 03999da97312..36c93b706db8 100644
--- a/aten/src/ATen/SparseTensorImpl.cpp
+++ b/aten/src/ATen/SparseTensorImpl.cpp
@@ -46,7 +46,7 @@ SparseTensorImpl::SparseTensorImpl(at::DispatchKeySet key_set, const caffe2::Typ
 
   is_non_overlapping_and_dense_ = false;
   set_storage_access_should_throw();
-  set_sizes_strides_policy(SizesStridesPolicy::CustomStrides);
+  set_custom_sizes_strides(SizesStridesPolicy::CustomStrides);
 }
 
   // Destructor doesn't call release_resources because it's
@@ -89,16 +89,16 @@ void SparseTensorImpl::set_indices_and_values_unsafe(const Tensor& indices, cons
   TORCH_CHECK(indices.options().backend() == values.options().backend(), "backend of indices (", indices.options().backend(), ") must match backend of values (", values.options().backend(), ")");
   TORCH_CHECK(!indices.is_cuda() || indices.get_device() == values.get_device(), "device of indices (", indices.get_device(), ") must match device of values (", values.get_device(), ")");
 
-  TORCH_CHECK(indices.dim() == 2, "indices must be sparse_dim x nnz, but got: ", indices.sizes());
-  TORCH_CHECK(indices.size(1) == values.size(0), "indices and values must have same nnz, but got nnz from indices: ", indices.size(1), ", nnz from values: ", values.size(0));
-  TORCH_CHECK(indices.size(0) == sparse_dim_, "indices has incorrect first dimension, expected ", sparse_dim_, ", got ", indices.size(0));
+  TORCH_CHECK(indices.dim() == 2, "indices must be sparse_dim x nnz, but got: ", indices.sym_sizes());
+  TORCH_CHECK(indices.sym_size(1) == values.sym_size(0), "indices and values must have same nnz, but got nnz from indices: ", indices.sym_size(1), ", nnz from values: ", values.sym_size(0));
+  TORCH_CHECK(indices.sym_size(0) == sparse_dim_, "indices has incorrect first dimension, expected ", sparse_dim_, ", got ", indices.sym_size(0));
   TORCH_CHECK(values.dim() == dense_dim_ + 1, "values has incorrect number of dimensions, expected ", dense_dim_ + 1, ", got ", values.dim());
 
-  auto dense_size_original = sizes().slice(sparse_dim_);
-  std::vector<int64_t> expected_values_size_vec = {values.size(0)};
+  auto dense_size_original = sym_sizes().slice(sparse_dim_);
+  std::vector<c10::SymInt> expected_values_size_vec = {values.sym_size(0)};
   expected_values_size_vec.insert(expected_values_size_vec.end(), dense_size_original.begin(), dense_size_original.end());
-  IntArrayRef expected_values_size(expected_values_size_vec);
-  auto new_values_size = values.sizes();
+  SymIntArrayRef expected_values_size(expected_values_size_vec);
+  auto new_values_size = values.sym_sizes();
   TORCH_CHECK(
     std::equal(expected_values_size.begin(), expected_values_size.end(), new_values_size.begin()),
     "values has incorrect size, expected ", expected_values_size, ", got ", new_values_size
@@ -109,7 +109,7 @@ void SparseTensorImpl::set_indices_and_values_unsafe(const Tensor& indices, cons
   AT_ASSERT(device() == values_.device());
   AT_ASSERT(values_.device() == indices_.device());
 
-  coalesced_ = false;
+  coalesced_ = sym_nnz() < 2;
 }
 
 
diff --git a/aten/src/ATen/SparseTensorImpl.h b/aten/src/ATen/SparseTensorImpl.h
index 9bbe3b86bc09..d90734100ca6 100644
--- a/aten/src/ATen/SparseTensorImpl.h
+++ b/aten/src/ATen/SparseTensorImpl.h
@@ -9,6 +9,7 @@
 #include <ATen/Functions.h>
 #else
 #include <ATen/ops/empty.h>
+#include <ATen/ops/resize.h>
 #endif
 
 namespace at {
@@ -51,6 +52,10 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
   int64_t nnz() const {
     return values_.size(0);
   }
+
+  c10::SymInt sym_nnz() const {
+    return values_.sym_size(0);
+  }
   int64_t sparse_dim() const {
     return sparse_dim_;
   }
@@ -85,7 +90,7 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
     TORCH_CHECK(
         !has_symbolic_sizes_strides_,
         "raw_resize_ called on tensor with symbolic shape")
-    sizes_and_strides_.set_sizes(size);
+    set_sizes_and_strides(size, std::vector<int64_t>(size.size()));
     sparse_dim_ = sparse_dim;
     dense_dim_ = dense_dim;
     refresh_numel();
@@ -116,7 +121,8 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
   // 4. When we attempt to shrink the size of any of the sparse dimensions on a
   // non-empty sparse tensor (this could make some of the stored indices
   // out-of-bound and thus unsafe).
-  void resize_(int64_t sparse_dim, int64_t dense_dim, IntArrayRef size) {
+  template <typename T>
+  void _resize_(int64_t sparse_dim, int64_t dense_dim, ArrayRef<T> size) {
     TORCH_CHECK(
         allow_tensor_metadata_change(),
         "resize_ ",
@@ -160,7 +166,7 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
 
       bool shrinking_sparse_dims = false;
       bool shrinking_dense_dim = false;
-      auto sparse_size_original = sizes().slice(0, sparse_dim);
+      auto sparse_size_original = generic_sizes<T>().slice(0, sparse_dim);
       auto sparse_size_new = size.slice(0, sparse_dim);
       for (const auto i : c10::irange(sparse_dim)) {
         if (sparse_size_new[i] < sparse_size_original[i]) {
@@ -168,7 +174,7 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
           break;
         }
       }
-      auto dense_size_original = sizes().slice(sparse_dim);
+      auto dense_size_original = generic_sizes<T>().slice(sparse_dim);
       auto dense_size_new = size.slice(sparse_dim);
       for (const auto i : c10::irange(dense_dim)) {
         if (dense_size_new[i] < dense_size_original[i]) {
@@ -196,8 +202,7 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
           alt_options_msg);
     }
 
-    IntArrayRef sizes_and_strides =
-        asIntArrayRefSlow(sizes_and_strides_.sizes_arrayref());
+    auto sizes_and_strides = generic_sizes<T>();
     const bool size_equals_sizes = std::equal(
         size.begin(),
         size.end(),
@@ -205,23 +210,34 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
         sizes_and_strides.end());
     if ((!size_equals_sizes) || (sparse_dim != sparse_dim_) ||
         (dense_dim != dense_dim_)) {
-      auto nnz = values().size(0);
-      std::vector<int64_t> values_size = {nnz};
+      auto nnz = at::symint::sizes<T>(values())[0];
+      std::vector<T> values_size = {nnz};
       auto dense_size = size.slice(sparse_dim);
       values_size.insert(
           values_size.end(), dense_size.begin(), dense_size.end());
-      values_.resize_(values_size);
-      indices_.resize_({sparse_dim, nnz});
+      at::symint::resize_<T>(values_, values_size);
+      at::symint::resize_<T>(indices_, {T(sparse_dim), nnz});
     }
 
     if (!size_equals_sizes) {
-      sizes_and_strides_.set_sizes(size);
+      set_sizes_and_strides(size, std::vector<T>(size.size()));
     }
     sparse_dim_ = sparse_dim;
     dense_dim_ = dense_dim;
     refresh_numel();
   }
 
+  void resize_(int64_t sparse_dim, int64_t dense_dim, ArrayRef<int64_t> size) {
+    return _resize_(sparse_dim, dense_dim, size);
+  }
+
+  void resize_(
+      int64_t sparse_dim,
+      int64_t dense_dim,
+      ArrayRef<c10::SymInt> size) {
+    return _resize_(sparse_dim, dense_dim, size);
+  }
+
   // NOTE: this function will resize the sparse tensor and also set `indices`
   // and `values` to empty.
   void resize_and_clear_(
@@ -244,7 +260,7 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
         "), but got ",
         size.size());
 
-    sizes_and_strides_.set_sizes(size);
+    set_sizes_and_strides(size, std::vector<int64_t>(size.size()));
     sparse_dim_ = sparse_dim;
     dense_dim_ = dense_dim;
 
@@ -275,6 +291,9 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
     AT_ASSERT(new_nnz <= nnz());
     indices_ = indices_.narrow(1, 0, new_nnz);
     values_ = values_.narrow(0, 0, new_nnz);
+    if (new_nnz < 2) {
+      coalesced_ = true;
+    }
   }
 
   // Takes indices and values and directly puts them into the sparse tensor, no
diff --git a/aten/src/ATen/TensorGeometry.cpp b/aten/src/ATen/TensorGeometry.cpp
index 164a7b279129..7dbc5973b7fe 100644
--- a/aten/src/ATen/TensorGeometry.cpp
+++ b/aten/src/ATen/TensorGeometry.cpp
@@ -6,10 +6,11 @@
 namespace at {
 
 // See TensorGeometry.h on why this is useful now that we cache is_contiguous.
-bool geometry_is_contiguous(IntArrayRef sizes, IntArrayRef strides) {
+template <typename T>
+bool _geometry_is_contiguous(ArrayRef<T> sizes, ArrayRef<T> strides) {
   assert(!overflows<std::int64_t>(sizes.size()));
   auto dim = static_cast<std::int64_t>(sizes.size());
-  int64_t expected_stride = 1;
+  T expected_stride = 1;
   bool contig_if_nonempty = true;
   for (int64_t i = dim - 1; i >= 0; i--) {
     if (sizes[i] == 0) {
@@ -25,11 +26,15 @@ bool geometry_is_contiguous(IntArrayRef sizes, IntArrayRef strides) {
   return contig_if_nonempty;
 }
 
+bool geometry_is_contiguous(IntArrayRef sizes, IntArrayRef strides) {
+  return _geometry_is_contiguous(sizes, strides);
+}
+
 bool TensorGeometry::is_contiguous() const {
   if (numel_ == 0) {
     return true;
   }
-  return at::geometry_is_contiguous(sizes_, strides_);
+  return at::_geometry_is_contiguous<c10::SymInt>(sizes_, strides_);
 }
 
 } // namespace at
diff --git a/aten/src/ATen/TensorGeometry.h b/aten/src/ATen/TensorGeometry.h
index e89a666a8c56..110f2356c3a5 100644
--- a/aten/src/ATen/TensorGeometry.h
+++ b/aten/src/ATen/TensorGeometry.h
@@ -15,10 +15,14 @@ TORCH_API bool geometry_is_contiguous(IntArrayRef sizes, IntArrayRef strides);
 struct TORCH_API TensorGeometry {
   TensorGeometry() : storage_offset_(0) {}
 
-  explicit TensorGeometry(IntArrayRef sizes)
-      : sizes_(sizes.vec()), strides_(sizes.size()), storage_offset_(0) {
+  explicit TensorGeometry(c10::SymIntArrayRef sizes)
+      : sizes_(sizes.vec()),
+        strides_(sizes.size()),
+        storage_offset_(0),
+        has_symbolic_sizes_strides_(
+            !c10::asIntArrayRefSlowOpt(sizes).has_value()) {
     int64_t dim = sizes.size();
-    int64_t expected_stride = 1;
+    c10::SymInt expected_stride = 1;
     for (int64_t i = dim - 1; i >= 0; i--) {
       strides_[i] = expected_stride;
       expected_stride *= sizes_[i];
@@ -27,10 +31,12 @@ struct TORCH_API TensorGeometry {
   }
 
   explicit TensorGeometry(const TensorBase& t)
-      : sizes_(t.sizes().vec()),
-        strides_(t.strides().vec()),
-        storage_offset_(t.storage_offset()),
-        numel_(t.numel()) {}
+      : sizes_(t.sym_sizes().vec()),
+        strides_(t.sym_strides().vec()),
+        storage_offset_(t.sym_storage_offset()),
+        numel_(t.sym_numel()),
+        has_symbolic_sizes_strides_(
+            t.unsafeGetTensorImpl()->has_symbolic_sizes_strides()) {}
 
   // true if the tensor is contiguous
   bool is_contiguous() const;
@@ -38,24 +44,52 @@ struct TORCH_API TensorGeometry {
   int64_t dim() const {
     return sizes_.size();
   }
+
   int64_t size(int64_t dim) const {
+    TORCH_INTERNAL_ASSERT(!has_symbolic_sizes_strides_);
     dim = c10::maybe_wrap_dim(dim, this->dim());
-    return sizes_.at(static_cast<size_t>(dim));
+    return sizes_.at(static_cast<size_t>(dim)).as_int_unchecked();
   }
-  IntArrayRef sizes() const {
-    return IntArrayRef{sizes_};
+  c10::IntArrayRef sizes() const {
+    TORCH_INTERNAL_ASSERT(!has_symbolic_sizes_strides_);
+    return c10::asIntArrayRefUnchecked(sizes_);
   }
   int64_t stride(int64_t dim) const {
+    TORCH_INTERNAL_ASSERT(!has_symbolic_sizes_strides_);
     dim = c10::maybe_wrap_dim(dim, this->dim());
-    return strides_.at(static_cast<size_t>(dim));
+    return strides_.at(static_cast<size_t>(dim)).as_int_unchecked();
   }
-  IntArrayRef strides() const {
-    return IntArrayRef{strides_};
+  c10::IntArrayRef strides() const {
+    TORCH_INTERNAL_ASSERT(!has_symbolic_sizes_strides_);
+    return c10::asIntArrayRefUnchecked(strides_);
   }
   int64_t storage_offset() const {
-    return storage_offset_;
+    TORCH_INTERNAL_ASSERT(!has_symbolic_sizes_strides_);
+    return storage_offset_.as_int_unchecked();
   }
   int64_t numel() const {
+    TORCH_INTERNAL_ASSERT(!has_symbolic_sizes_strides_);
+    return numel_.as_int_unchecked();
+  }
+
+  c10::SymInt sym_size(int64_t dim) const {
+    dim = c10::maybe_wrap_dim(dim, this->dim());
+    return sizes_.at(static_cast<size_t>(dim));
+  }
+  c10::SymIntArrayRef sym_sizes() const {
+    return sizes_;
+  }
+  c10::SymInt sym_stride(int64_t dim) const {
+    dim = c10::maybe_wrap_dim(dim, this->dim());
+    return strides_.at(static_cast<size_t>(dim));
+  }
+  c10::SymIntArrayRef sym_strides() const {
+    return strides_;
+  }
+  c10::SymInt sym_storage_offset() const {
+    return storage_offset_;
+  }
+  c10::SymInt sym_numel() const {
     return numel_;
   }
 
@@ -80,10 +114,12 @@ struct TORCH_API TensorGeometry {
     return r;
   }
 
-  std::vector<int64_t> sizes_;
-  std::vector<int64_t> strides_;
-  int64_t storage_offset_;
-  int64_t numel_;
+ private:
+  std::vector<c10::SymInt> sizes_;
+  std::vector<c10::SymInt> strides_;
+  c10::SymInt storage_offset_;
+  c10::SymInt numel_;
+  bool has_symbolic_sizes_strides_;
 };
 
 } // namespace at
diff --git a/aten/src/ATen/TensorIndexing.h b/aten/src/ATen/TensorIndexing.h
index f1eedfa83ef9..19333741a981 100644
--- a/aten/src/ATen/TensorIndexing.h
+++ b/aten/src/ATen/TensorIndexing.h
@@ -1,15 +1,22 @@
 #pragma once
 
 #include <ATen/ExpandUtils.h>
-#include <ATen/Functions.h>
 #include <ATen/ScalarOps.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/core/TensorBody.h>
+#include <c10/core/SymInt.h>
 #include <c10/util/Optional.h>
 #include <c10/util/irange.h>
 
-// TODO: try to remove this
-// There is some back story, see https://github.com/pytorch/pytorch/issues/48684
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/alias.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/scalar_tensor.h>
+#include <ATen/ops/zeros.h>
+#endif
 
 #include <ATen/core/List.h>
 
@@ -211,7 +218,7 @@ static inline Tensor applySlice(
     int64_t step,
     bool disable_slice_optimization,
     const at::Device& self_device,
-    const c10::optional<IntArrayRef>& self_sizes) {
+    const c10::optional<SymIntArrayRef>& self_sizes) {
   // TODO: implement negative step
   TORCH_CHECK_VALUE(step > 0, "step must be greater than zero");
 
@@ -220,10 +227,10 @@ static inline Tensor applySlice(
     // Skip this optimization if we are tracing, as the trace may be polymorphic
     // over the shape of the `self` tensor, and we still want to record
     // the slice.
-    int64_t length = (self_device == at::kCPU || self_device == at::kCUDA)
+    SymInt length = (self_device == at::kCPU || self_device == at::kCUDA)
         ? (*self_sizes)[dim]
-        : self.size(dim);
-    if (!disable_slice_optimization && start == 0 && stop == length &&
+        : self.sym_size(dim);
+    if (!disable_slice_optimization && start == 0 && length == stop &&
         step == 1) {
       return self;
     }
@@ -237,7 +244,7 @@ static inline Tensor applySelect(
     int64_t index,
     int64_t real_dim,
     const at::Device& /*self_device*/,
-    const c10::optional<IntArrayRef>& self_sizes) {
+    const c10::optional<SymIntArrayRef>& self_sizes) {
   // See NOTE [nested tensor size for indexing]
   if (self_sizes.has_value()) {
     TORCH_CHECK_INDEX(
@@ -245,9 +252,9 @@ static inline Tensor applySelect(
         "invalid index of a 0-dim tensor. ",
         "Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number");
 
-    int64_t size = (*self_sizes)[dim];
+    auto size = (*self_sizes)[dim];
     TORCH_CHECK_INDEX(
-        index >= -size && index < size,
+        size >= -index && size > index,
         "index ",
         index,
         " is out of bounds for dimension ",
@@ -383,7 +390,7 @@ static inline Tensor scalarToTensor(
 // To match numpy semantics:
 // As a special case for backwards compatibility,
 // strip away unit dimensions from the left of 'src'
-static inline IntArrayRef slicePrefix1sSize(const IntArrayRef& sizes) {
+static inline SymIntArrayRef slicePrefix1sSize(const SymIntArrayRef& sizes) {
   size_t first_non1_src = sizes.size();
   for (const auto i : c10::irange(sizes.size())) {
     if (sizes[i] != 1) {
@@ -396,7 +403,7 @@ static inline IntArrayRef slicePrefix1sSize(const IntArrayRef& sizes) {
 }
 
 static inline void copy_to(const Tensor& dst, const Tensor& src) {
-  if (dst.sizes().equals(src.sizes())) {
+  if (dst.sym_sizes().equals(src.sym_sizes())) {
     // A shortcut to avoid generating hard-coded constant sizes during tracing.
     // This is not a perfect solution: when src & dst have different shapes,
     // constants will still appear. Users can workaround that case by
@@ -407,7 +414,7 @@ static inline void copy_to(const Tensor& dst, const Tensor& src) {
     dst.fill_(src);
     return;
   }
-  auto src_view = src.view(slicePrefix1sSize(src.sizes()));
+  auto src_view = src.view_symint(slicePrefix1sSize(src.sym_sizes()));
   c10::MaybeOwned<Tensor> b_src = expand_inplace(dst, src_view, "setitem");
   dst.copy_(*b_src);
 }
@@ -424,7 +431,7 @@ static inline Tensor handleDimInMultiDimIndexing(
     std::vector<Tensor>& outIndices,
     bool disable_slice_optimization,
     const at::Device& original_tensor_device,
-    const c10::optional<IntArrayRef>& prev_dim_result_sizes) {
+    const c10::optional<SymIntArrayRef>& prev_dim_result_sizes) {
   if (index.is_integer()) {
     return impl::applySelect(
         prev_dim_result,
@@ -508,7 +515,7 @@ static inline Tensor applySlicing(
     std::vector<Tensor>& outIndices,
     bool disable_slice_optimization,
     const at::Device& self_device,
-    const c10::optional<IntArrayRef>& self_sizes) {
+    const c10::optional<SymIntArrayRef>& self_sizes) {
   int64_t dim = 0;
   int64_t specified_dims = impl::count_specified_dimensions(indices);
 
@@ -524,9 +531,9 @@ static inline Tensor applySlicing(
   for (const auto i : c10::irange(indices.size())) {
     auto& obj = indices[i];
     // See NOTE [nested tensor size for indexing]
-    c10::optional<IntArrayRef> result_sizes = result.is_nested()
-        ? c10::optional<IntArrayRef>(c10::nullopt)
-        : c10::optional<IntArrayRef>(result.sizes());
+    c10::optional<SymIntArrayRef> result_sizes = result.is_nested()
+        ? c10::optional<SymIntArrayRef>(c10::nullopt)
+        : c10::optional<SymIntArrayRef>(result.sym_sizes());
     result = handleDimInMultiDimIndexing(
         /*prev_dim_result=*/result,
         /*original_tensor=*/self,
@@ -600,9 +607,9 @@ static inline Tensor get_item(
   // nested tensor does not have a size (yet) so for now we represent its size
   // as null may need to be changed after we reach a better solution for nested
   // tensor size
-  c10::optional<IntArrayRef> self_sizes = self.is_nested()
-      ? c10::optional<IntArrayRef>(c10::nullopt)
-      : c10::optional<IntArrayRef>(self.sizes());
+  c10::optional<SymIntArrayRef> self_sizes = self.is_nested()
+      ? c10::optional<SymIntArrayRef>(c10::nullopt)
+      : c10::optional<SymIntArrayRef>(self.sym_sizes());
 
   // handle simple types: integers, slices, none, ellipsis, bool
   if (indices.size() == 1) {
@@ -663,7 +670,7 @@ static inline void set_item(
     const Tensor& value,
     bool disable_slice_optimization = false) {
   at::Device self_device = self.device();
-  IntArrayRef self_sizes = self.sizes();
+  SymIntArrayRef self_sizes = self.sym_sizes();
 
   // handle simple types: integers, slices, ellipsis, bool
   if (indices.size() == 1) {
@@ -713,11 +720,11 @@ static inline void set_item(
     return;
   }
 
-  IntArrayRef valueSizes = value.sizes();
-  IntArrayRef slicedValueSizes = slicePrefix1sSize(valueSizes);
+  SymIntArrayRef valueSizes = value.sym_sizes();
+  SymIntArrayRef slicedValueSizes = slicePrefix1sSize(valueSizes);
   Tensor valuesSliced;
   if (!valueSizes.equals(slicedValueSizes)) {
-    valuesSliced = value.view(slicedValueSizes);
+    valuesSliced = value.view_symint(slicedValueSizes);
   } else {
     valuesSliced = value;
   }
diff --git a/aten/src/ATen/TensorIterator.cpp b/aten/src/ATen/TensorIterator.cpp
index a4715a2caabb..7e86163f1ca4 100644
--- a/aten/src/ATen/TensorIterator.cpp
+++ b/aten/src/ATen/TensorIterator.cpp
@@ -431,7 +431,7 @@ void TensorIteratorBase::compute_types(const TensorIteratorConfig& config) {
   }
 
   // Computes a common dtype, if needed
-  if (has_different_input_dtypes && config.promote_inputs_to_common_dtype_) {
+  if ((has_different_input_dtypes || all_ops_are_scalars_) && config.promote_inputs_to_common_dtype_) {
     common_dtype_ = compute_common_dtype();
   }
 
@@ -1237,11 +1237,12 @@ void TensorIteratorBase::compute_shape(const TensorIteratorConfig& config) {
       shape_ = infer_size_dimvector(shape_, shape);
     }
   }
+  all_ops_are_scalars_ = !has_tensors;
 }
 
 void TensorIteratorBase::compute_strides(const TensorIteratorConfig& config) {
   for (auto& op : operands_) {
-    if (op.tensor_base().defined()) {
+    if (op.tensor_base().defined() && !op.will_resize) {
       IntArrayRef original_shape = config.static_shape_ ? shape_ : op.tensor_base().sizes();
       auto original_stride = op.tensor_base().strides();
       auto element_size_in_bytes = op.tensor_base().element_size();
@@ -1491,10 +1492,19 @@ void TensorIteratorBase::build(TensorIteratorConfig& config) {
 
   if (is_meta_) return;
 
+  auto has_storage = true;
+  for (auto& op : operands_) {
+    has_storage &= op.tensor_base().has_storage();
+  }
+  auto privateuse1_without_storage =
+     common_device_.type() == DeviceType::PrivateUse1 &&
+     !has_storage;
+
   // XLA and lazy tensors don't have storage, so they don't have an underlying data pointer.
   // Nothing beyond this point is important for meta functions, so it's fine to exit early here.
   // Extend the condition to ORT tesnors as ORT tensors also don't have storage.
-  if (common_device_.type() == DeviceType::XLA  ||
+  if (privateuse1_without_storage  ||
+      common_device_.type() == DeviceType::XLA  ||
       common_device_.type() == DeviceType::IPU  ||
       common_device_.type() == DeviceType::Lazy ||
       common_device_.type() == DeviceType::ORT  ||
diff --git a/aten/src/ATen/TensorIterator.h b/aten/src/ATen/TensorIterator.h
index fdf86cbba6af..31ae65466870 100644
--- a/aten/src/ATen/TensorIterator.h
+++ b/aten/src/ATen/TensorIterator.h
@@ -473,6 +473,10 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
   }
 
   bool has_contiguous_first_dim() const {
+    if (ndim() == 0) {
+      return true;
+    }
+
     int num_tensors = ntensors();
     for (const auto i : c10::irange(num_tensors)) {
       if (strides(i)[0] != element_size(i)) {
@@ -655,9 +659,12 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
   /// in operands_).
   int num_outputs_ = 0;
 
-  /// Whether or not all operands have the same shape.  Having all the same
-  /// shape affects whether or not the iterator is eligible for fast setup.
+  /// Whether or not all operands have the same shape and are 1d+. Having all
+  /// the same shape affects whether or not the iterator is eligible for fast
+  /// setup.
   bool all_ops_same_shape_ = false;
+  /// Whether or not all operands are 0d, this affects type promotion
+  bool all_ops_are_scalars_ = false;
 
   /// The "computation" dtype of TensorIterator, specifying what the dtype
   /// we will do the internal computation in TensorIterator.  Typically,
diff --git a/aten/src/ATen/TensorMeta.h b/aten/src/ATen/TensorMeta.h
index 97124611ca13..07631c3552fd 100644
--- a/aten/src/ATen/TensorMeta.h
+++ b/aten/src/ATen/TensorMeta.h
@@ -71,6 +71,7 @@ namespace impl {
 struct TORCH_API MetaBase {
   virtual const Tensor& maybe_get_output(int64_t output_idx) = 0;
 
+  // Note: [set_output_*]
   // See: https://github.com/pytorch/pytorch/issues/69813
   // Whenever defining the output properties in the META function of a
   // structured kernel (what was usually done with `set_output`), use one of
diff --git a/aten/src/ATen/TensorSubclassLikeUtils.h b/aten/src/ATen/TensorSubclassLikeUtils.h
index 5c01ce979040..44b422324590 100644
--- a/aten/src/ATen/TensorSubclassLikeUtils.h
+++ b/aten/src/ATen/TensorSubclassLikeUtils.h
@@ -1,5 +1,13 @@
 #pragma once
-#include <ATen/ATen.h>
+#include <ATen/core/List.h>
+#include <ATen/core/Tensor.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/equal.h>
+#endif
 
 namespace at {
 
@@ -23,7 +31,9 @@ namespace at {
 //    or returning a regular non-Tensor-subclass Tensor!
 
 constexpr auto kFunctorchWrappedTensors = DispatchKeySet(
-    {DispatchKey::FuncTorchGradWrapper, DispatchKey::FuncTorchBatched});
+    {DispatchKey::FuncTorchGradWrapper,
+     DispatchKey::FuncTorchBatched,
+     DispatchKey::Functionalize});
 
 constexpr auto kTensorSubclassLike =
     kFunctorchWrappedTensors |
@@ -39,16 +49,22 @@ constexpr auto kTensorSubclassLike =
     DispatchKeySet(BackendComponent::MetaBit);
 
 inline bool isTensorSubclassLike(const Tensor& tensor) {
+  if (c10::impl::dispatch_mode_enabled())
+    return true;
   auto key_set = tensor.unsafeGetTensorImpl()->key_set();
   return !(key_set & kTensorSubclassLike).empty();
 }
 
 inline bool areAnyTensorSubclassLike(TensorList tensors) {
+  if (c10::impl::dispatch_mode_enabled())
+    return true;
   return std::any_of(tensors.begin(), tensors.end(), isTensorSubclassLike);
 }
 
 inline bool areAnyOptionalTensorSubclassLike(
     const c10::List<c10::optional<Tensor>>& tensors) {
+  if (c10::impl::dispatch_mode_enabled())
+    return true;
   return std::any_of(
       tensors.begin(), tensors.end(), [](const optional<Tensor>& opt_tensor) {
         return (
@@ -56,4 +72,16 @@ inline bool areAnyOptionalTensorSubclassLike(
       });
 }
 
+// Helper function to deal testing truthfulness of a scalar tensor
+// in a Composite Compliant manner.
+// NOTE: This function expects a scalar tensor of boolean dtype.
+// Eg.
+// Non-Composite Compliant Pattern : (t == 0).all().item<bool>()
+// Composite Compliant Patter : is_salar_tensor_true((t == 0).all())
+inline bool is_scalar_tensor_true(const Tensor& t) {
+  TORCH_INTERNAL_ASSERT(t.dim() == 0)
+  TORCH_INTERNAL_ASSERT(t.scalar_type() == kBool)
+  return at::equal(t, t.new_ones({}, t.options()));
+}
+
 } // namespace at
diff --git a/aten/src/ATen/TensorUtils.cpp b/aten/src/ATen/TensorUtils.cpp
index 7fbddd7a3482..4f211df680ec 100644
--- a/aten/src/ATen/TensorUtils.cpp
+++ b/aten/src/ATen/TensorUtils.cpp
@@ -75,6 +75,14 @@ void checkSize(CheckedFrom c, const TensorGeometryArg& t, IntArrayRef sizes) {
     " for ", t, " (while checking arguments for ", c, ")");
 }
 
+void checkSize_symint(CheckedFrom c, const TensorGeometryArg& t, c10::SymIntArrayRef sizes) {
+  checkDim(c, t, sizes.size());
+  TORCH_CHECK(
+    t->sym_sizes().equals(sizes),
+    "Expected tensor of size ", sizes, ", but got tensor of size ", t->sizes(),
+    " for ", t, " (while checking arguments for ", c, ")");
+}
+
 void checkSize(CheckedFrom c, const TensorGeometryArg& t, int64_t dim, int64_t size) {
   TORCH_CHECK(
     t->size(dim) == size,
@@ -83,6 +91,14 @@ void checkSize(CheckedFrom c, const TensorGeometryArg& t, int64_t dim, int64_t s
     " (while checking arguments for ", c, ")");
 }
 
+void checkSize_symint(CheckedFrom c, const TensorGeometryArg& t, int64_t dim, c10::SymInt size) {
+  TORCH_CHECK(
+    t->sym_size(dim) == size,
+    "Expected tensor to have size ", size, " at dimension ", dim,
+    ", but got size ", t->size(dim), " for ", t,
+    " (while checking arguments for ", c, ")");
+}
+
 void checkAllSame(CheckedFrom c, ArrayRef<TensorArg> tensors, void(*fn)(CheckedFrom, const TensorArg&, const TensorArg&)) {
   const TensorArg* t0 = nullptr;
   for (auto& t : tensors) {
@@ -310,12 +326,12 @@ std::vector<int64_t> defaultStrides(IntArrayRef sizes) {
 // templatized for DimVector and IntArrayRef use cases,
 // see overloads of computeStride() below.
 //
-template <typename ResultVec, typename NewShapeVec>
+template <typename ResultVec, typename NewShapeVec, typename Numel>
 inline c10::optional<ResultVec> computeStride_impl(
-    IntArrayRef oldshape,
-    IntArrayRef oldstride,
+    const NewShapeVec& oldshape,
+    const NewShapeVec& oldstride,
     const NewShapeVec& newshape,
-    ResultVec toResult(const IntArrayRef&)
+    ResultVec toResult(const NewShapeVec&)
 ) {
   if (oldshape.empty()) {
     return ResultVec(newshape.size(), 1);
@@ -326,7 +342,7 @@ inline c10::optional<ResultVec> computeStride_impl(
   // we use the stride as if it were computed via resize.
   // This could perhaps be combined with the below code, but the complexity
   // didn't seem worth it.
-  const int64_t numel = c10::multiply_integers(oldshape);
+  const Numel numel = c10::multiply_integers(oldshape);
   if (numel == 0 && oldshape.equals(newshape)) {
     return toResult(oldstride);
   }
@@ -338,7 +354,7 @@ inline c10::optional<ResultVec> computeStride_impl(
         newstride[view_d] = 1;
       } else {
         newstride[view_d] =
-          std::max<int64_t>(newshape[view_d+1], 1) * newstride[view_d+1];
+          std::max<Numel>(newshape[view_d+1], Numel(1)) * newstride[view_d+1];
       }
     }
     return newstride;
@@ -346,10 +362,10 @@ inline c10::optional<ResultVec> computeStride_impl(
 
   int64_t view_d = (int64_t)newshape.size() - 1;
   // stride for each subspace in the chunk
-  int64_t chunk_base_stride = oldstride.back();
+  Numel chunk_base_stride = oldstride.back();
   // numel in current chunk
-  int64_t tensor_numel = 1;
-  int64_t view_numel = 1;
+  Numel tensor_numel = 1;
+  Numel view_numel = 1;
   for (int64_t tensor_d = oldshape.size() - 1; tensor_d >= 0; tensor_d--) {
     tensor_numel *= oldshape[tensor_d];
     // if end of tensor size chunk, check view
@@ -383,7 +399,15 @@ c10::optional<std::vector<int64_t>> computeStride(
     IntArrayRef oldstride,
     IntArrayRef newshape) {
   auto toResult = [](const IntArrayRef& a) { return a.vec(); };
-  return computeStride_impl<std::vector<int64_t>, IntArrayRef>(oldshape, oldstride, newshape, toResult);
+  return computeStride_impl<std::vector<int64_t>, IntArrayRef, int64_t>(oldshape, oldstride, newshape, toResult);
+}
+
+c10::optional<SymDimVector> computeStride(
+    c10::SymIntArrayRef oldshape,
+    c10::SymIntArrayRef oldstride,
+    c10::SymIntArrayRef newshape) {
+  auto toResult = [](const SymIntArrayRef& a) { return SymDimVector(a); };
+  return computeStride_impl<SymDimVector, c10::SymIntArrayRef, c10::SymInt>(oldshape, oldstride, newshape, toResult);
 }
 
 c10::optional<DimVector> computeStride(
@@ -391,7 +415,7 @@ c10::optional<DimVector> computeStride(
     IntArrayRef oldstride,
     const DimVector& newshape) {
   auto toResult = [](const IntArrayRef& a) { return DimVector(a); };
-  return computeStride_impl<DimVector, DimVector>(oldshape, oldstride, newshape, toResult);
+  return computeStride_impl<DimVector, IntArrayRef, int64_t>(oldshape, oldstride, newshape, toResult);
 }
 
 }  // namespace detail
diff --git a/aten/src/ATen/TensorUtils.h b/aten/src/ATen/TensorUtils.h
index 4bfe87c9de44..8f5d2687c9c3 100644
--- a/aten/src/ATen/TensorUtils.h
+++ b/aten/src/ATen/TensorUtils.h
@@ -85,11 +85,20 @@ TORCH_API void checkSize(
     CheckedFrom c,
     const TensorGeometryArg& t,
     IntArrayRef sizes);
+TORCH_API void checkSize_symint(
+    CheckedFrom c,
+    const TensorGeometryArg& t,
+    c10::SymIntArrayRef sizes);
 TORCH_API void checkSize(
     CheckedFrom c,
     const TensorGeometryArg& t,
     int64_t dim,
     int64_t size);
+TORCH_API void checkSize_symint(
+    CheckedFrom c,
+    const TensorGeometryArg& t,
+    int64_t dim,
+    c10::SymInt size);
 TORCH_API void checkNumel(
     CheckedFrom c,
     const TensorGeometryArg& t,
@@ -157,6 +166,11 @@ TORCH_API c10::optional<std::vector<int64_t>> computeStride(
     IntArrayRef oldstride,
     IntArrayRef newshape);
 
+TORCH_API c10::optional<SymDimVector> computeStride(
+    c10::SymIntArrayRef oldshape,
+    c10::SymIntArrayRef oldstride,
+    c10::SymIntArrayRef newshape);
+
 TORCH_API c10::optional<DimVector> computeStride(
     IntArrayRef oldshape,
     IntArrayRef oldstride,
diff --git a/aten/src/ATen/ThreadLocalState.cpp b/aten/src/ATen/ThreadLocalState.cpp
index 8315ddad97b2..5c8214b7d882 100644
--- a/aten/src/ATen/ThreadLocalState.cpp
+++ b/aten/src/ATen/ThreadLocalState.cpp
@@ -6,6 +6,7 @@
 
 #include <ATen/record_function.h>
 #include <ATen/SavedTensorHooks.h>
+#include <ATen/FunctionalTensorWrapper.h>
 
 namespace at {
 
@@ -14,18 +15,24 @@ ThreadLocalState::ThreadLocalState()
       debug_info_(c10::ThreadLocalDebugInfo::current()),
       functorch_tls_(functorch::getCopyOfFuncTorchTLS()),
       autograd_tls_(c10::AutogradState::get_tls_state()),
-      python_torch_function_state_(at::impl::PythonTorchFunctionTLS::get_state()) {
+      python_dispatcher_state_(c10::impl::PythonDispatcherTLS::get_state()),
+      python_torch_function_state_(at::impl::PythonTorchFunctionTLS::get_state()),
+      functionalization_reapply_views_state_(at::functionalization::impl::getFunctionalizationReapplyViewsTLS()) {
   rf_tls_ = at::get_record_function_tls_();
 
-  saved_tensors_default_hooks_ = at::SavedTensorDefaultHooks::get_stack();
+  saved_tensors_default_hooks_state_ = at::SavedTensorDefaultHooks::get_tls_state();
 
-  torch_dispatch_mode_state_ = at::impl::TorchDispatchModeTLS::get_state();
+  torch_dispatch_mode_state_ = c10::impl::TorchDispatchModeTLS::get_state();
 }
 
 void ThreadLocalState::set_grad_mode(bool enabled) {
   autograd_tls_.set_grad_mode(enabled);
 }
 
+void ThreadLocalState::set_multithreading_enabled(bool enabled) {
+  autograd_tls_.set_multithreading_enabled(enabled);
+}
+
 /* static */
 void ThreadLocalState::setThreadLocalState(
     const ThreadLocalState& state) {
@@ -33,19 +40,23 @@ void ThreadLocalState::setThreadLocalState(
   // restore the dispatch key set TLS at the same time.
   c10::AutogradState::set_tls_state(state.autograd_tls_);
 
-  at::impl::TorchDispatchModeTLS::set_state(state.torch_dispatch_mode_state_);
+  c10::impl::TorchDispatchModeTLS::set_state(state.torch_dispatch_mode_state_);
 
   at::impl::PythonTorchFunctionTLS::set_state(state.python_torch_function_state_);
 
   at::set_record_function_tls_(state.rf_tls_);
 
-  at::SavedTensorDefaultHooks::set_stack(state.saved_tensors_default_hooks_);
+  at::SavedTensorDefaultHooks::set_tls_state(state.saved_tensors_default_hooks_state_);
+
+  c10::impl::PythonDispatcherTLS::set_state(state.python_dispatcher_state_);
 
   c10::ThreadLocalDebugInfo::_forceCurrentDebugInfo(state.debug_info_);
 
   c10::impl::_force_tls_local_dispatch_key_set(state.dispatch_key_);
 
   functorch::setFuncTorchTLS(state.functorch_tls_);
+
+  at::functionalization::impl::setFunctionalizationReapplyViewsTLS(state.functionalization_reapply_views_state_);
 }
 
 } // namespace at
diff --git a/aten/src/ATen/ThreadLocalState.h b/aten/src/ATen/ThreadLocalState.h
index a21ee6a674f3..0184cc9b82c4 100644
--- a/aten/src/ATen/ThreadLocalState.h
+++ b/aten/src/ATen/ThreadLocalState.h
@@ -9,8 +9,10 @@
 
 #include <ATen/FuncTorchTLS.h>
 #include <ATen/PythonTorchFunctionTLS.h>
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <ATen/SavedTensorHooks.h>
 #include <ATen/record_function.h>
+#include <c10/core/impl/PythonDispatcherTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 
 namespace at {
 
@@ -28,6 +30,12 @@ class TORCH_API ThreadLocalState {
   //  autograd engine.
   void set_grad_mode(bool enabled);
 
+  // set_multithreading_enabled - force the value of the multithreadinmaximum
+  // threads TLS in
+  //  the current state object. This is used for example in the
+  //  autograd engine.
+  void set_multithreading_enabled(bool enabled);
+
   // Sets thread local variables in the current thread,
   // according to the thread boundary specified
   static void setThreadLocalState(const ThreadLocalState& state);
@@ -55,13 +63,18 @@ class TORCH_API ThreadLocalState {
   AutogradState autograd_tls_;
 
   // TLS for enable_torch_dispatch_mode
-  std::shared_ptr<SafePyObject> torch_dispatch_mode_state_;
+  c10::impl::TorchDispatchModeTLS torch_dispatch_mode_state_;
+
+  // TLS for enable_python_dispatcher
+  c10::impl::PyInterpreter* python_dispatcher_state_;
 
   // TLS for __torch_function__ (mode and disable_torch_function)
   at::impl::PythonTorchFunctionTLS python_torch_function_state_;
 
   // TLS for saved tensors default hooks
-  std::stack<std::pair<PyObject*, PyObject*>> saved_tensors_default_hooks_;
+  at::impl::SavedTensorDefaultHooksTLS saved_tensors_default_hooks_state_;
+
+  bool functionalization_reapply_views_state_;
 
   friend class ThreadLocalStateGuard;
 };
diff --git a/aten/src/ATen/Utils.h b/aten/src/ATen/Utils.h
index bbc235182f1e..61c9c58fa437 100644
--- a/aten/src/ATen/Utils.h
+++ b/aten/src/ATen/Utils.h
@@ -26,59 +26,6 @@ namespace at {
 
 TORCH_API int _crash_if_asan(int);
 
-// TODO: This unwrapping code is ONLY used for TH bindings; once TH goes
-// away, we can delete this function
-static inline TensorImpl* checked_dense_tensor_unwrap(
-    const Tensor& expr,
-    const char* name,
-    int pos,
-    const char* api,
-    bool allowNull,
-    DeviceType device_type,
-    ScalarType scalar_type) {
-  if (allowNull && !expr.defined()) {
-    return nullptr;
-  }
-  if (expr.layout() != Layout::Strided) {
-    AT_ERROR(
-        "Expected dense tensor but got ",
-        expr.layout(),
-        " for argument #",
-        pos,
-        " '",
-        name,
-        "' in call to ",
-        api);
-  }
-  if (expr.device().type() != device_type) {
-    AT_ERROR(
-        "Expected object of device type ",
-        device_type,
-        " but got device type ",
-        expr.device().type(),
-        " for argument #",
-        pos,
-        " '",
-        name,
-        "' in call to ",
-        api);
-  }
-  if (expr.scalar_type() != scalar_type) {
-    AT_ERROR(
-        "Expected object of scalar type ",
-        scalar_type,
-        " but got scalar type ",
-        expr.scalar_type(),
-        " for argument #",
-        pos,
-        " '",
-        name,
-        "' in call to ",
-        api);
-  }
-  return expr.unsafeGetTensorImpl();
-}
-
 // Converts a TensorList (i.e. ArrayRef<Tensor> to vector of TensorImpl*)
 // NB: This is ONLY used by legacy TH bindings, and ONLY used by cat.
 // Once cat is ported entirely to ATen this can be deleted!
diff --git a/aten/src/ATen/VmapTransforms.cpp b/aten/src/ATen/VmapTransforms.cpp
index 20c792f73709..71ef7a169026 100644
--- a/aten/src/ATen/VmapTransforms.cpp
+++ b/aten/src/ATen/VmapTransforms.cpp
@@ -1,5 +1,6 @@
 #include <ATen/VmapTransforms.h>
 #include <ATen/ATen.h>
+#include <ATen/core/IListRef.h>
 #include <c10/util/irange.h>
 
 namespace at {
@@ -188,7 +189,7 @@ static Tensor alignBatchDimsAtFront(
 // 4. Expand each physical tensor so that they have output batch size equal
 //    to `batch_sizes`
 VmapPhysicalViewVec
-MultiBatchVmapTransform::logicalToPhysical(TensorList logical_tensors) {
+MultiBatchVmapTransform::logicalToPhysical(ITensorListRef logical_tensors) {
   // Figure out all of the collective vmap levels in `logical_tensors`.
   std::bitset<kVmapNumLevels> collective_levels;
   for (const auto& logical_tensor : logical_tensors) {
diff --git a/aten/src/ATen/VmapTransforms.h b/aten/src/ATen/VmapTransforms.h
index 53e476e2243f..cece52dcbc41 100644
--- a/aten/src/ATen/VmapTransforms.h
+++ b/aten/src/ATen/VmapTransforms.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <ATen/BatchedTensorImpl.h>
+#include <ATen/core/IListRef.h>
 
 namespace at {
 
@@ -55,7 +56,7 @@ using VmapDimVector = SmallVector<int64_t, kVmapStaticDimVecSize>;
 // and returns a VmapPhysicalView on the tensor(s).
 struct TORCH_API MultiBatchVmapTransform {
   static VmapPhysicalView logicalToPhysical(const Tensor& logical_tensor);
-  static VmapPhysicalViewVec logicalToPhysical(TensorList logical_tensors);
+  static VmapPhysicalViewVec logicalToPhysical(ITensorListRef logical_tensors);
 };
 
 // VmapTransform for operators that broadcast all inputs.
diff --git a/aten/src/ATen/WrapDimUtils.h b/aten/src/ATen/WrapDimUtils.h
index 13f8658c354d..b0bc583b90c2 100644
--- a/aten/src/ATen/WrapDimUtils.h
+++ b/aten/src/ATen/WrapDimUtils.h
@@ -8,22 +8,17 @@
 
 namespace at {
 
-static inline int64_t maybe_wrap_dim(
-    int64_t dim,
-    int64_t dim_post_expr,
-    bool wrap_scalar = true) {
-  // if dim_post_expr is 0 and wrap_scalar is true, then dim must be in the
-  // range [-1, 0]. This is a special case for scalar tensors and manifests in
-  // e.g. torch.sum(scalar_tensor, 0) Otherwise, dim should be in the range
-  // [-dim_post_expr, dim_post_expr-1].
-  return c10::maybe_wrap_dim(dim, dim_post_expr, wrap_scalar);
-}
+// if dim_post_expr is 0 and wrap_scalar is true, then dim must be in the
+// range [-1, 0]. This is a special case for scalar tensors and manifests in
+// e.g. torch.sum(scalar_tensor, 0) Otherwise, dim should be in the range
+// [-dim_post_expr, dim_post_expr-1].
+using c10::maybe_wrap_dim;
 
-static inline int64_t maybe_wrap_dim(int64_t dim, TensorImpl* tensor) {
+inline int64_t maybe_wrap_dim(int64_t dim, TensorImpl* tensor) {
   return maybe_wrap_dim(dim, tensor->dim());
 }
 
-static inline int64_t maybe_wrap_dim(int64_t dim, TensorList tensors) {
+inline int64_t maybe_wrap_dim(int64_t dim, TensorList tensors) {
   if (tensors.size() == 0) {
     // can't wrap empty TensorList; rely on underlying implementation to throw
     // error if necessary.
@@ -32,7 +27,7 @@ static inline int64_t maybe_wrap_dim(int64_t dim, TensorList tensors) {
   return maybe_wrap_dim(dim, tensors[0].dim());
 }
 
-static inline int64_t maybe_wrap_dim(
+inline int64_t maybe_wrap_dim(
     int64_t dim,
     const std::vector<std::vector<int64_t>>& tensor_sizes) {
   if (tensor_sizes.size() == 0) {
@@ -43,14 +38,29 @@ static inline int64_t maybe_wrap_dim(
   return maybe_wrap_dim(dim, tensor_sizes[0].size());
 }
 
-// wrap each dim in the dims array, taking dim_post_expr as the true number of
-// dimensions
-static inline void maybe_wrap_dims_n(
+// Given an array of dimensions `dims` of length `ndims`, this function "Wraps"
+// each dim in-place for a tensor of rank `dim_post_expr`, allowing dims to be
+// specified using negative indices.
+//
+// Additionally, if `wrap_scalar` is true then scalar tensors with rank 0, will
+// allow dimensions in the range [-1, 0]. Otherwise, an IndexError is raised for
+// dimensions not in the range [-dim_post_expr, dim_post_expr).
+inline void maybe_wrap_dims_n(
     int64_t* dims,
     int64_t ndims,
-    int64_t dim_post_expr) {
+    int64_t dim_post_expr,
+    bool wrap_scalars = true) {
   if (dim_post_expr <= 0) {
-    dim_post_expr = 1; // this will make range [-1, 0]
+    if (wrap_scalars) {
+      dim_post_expr = 1; // this will make range [-1, 0]
+    } else {
+      TORCH_CHECK_INDEX(
+          ndims == 0,
+          "Dimension specified as ",
+          dims[0],
+          " but tensor has no dimensions");
+      return;
+    }
   }
   int64_t min = -dim_post_expr;
   int64_t max = dim_post_expr - 1;
@@ -72,11 +82,20 @@ static inline void maybe_wrap_dims_n(
   }
 }
 
-// Wrap each dim in a contiguous container, taking dim_post_expr as the true
-// number of dimensions E.g. could also be std::array or c10::SmallVector
+// Given a contiguous container of dimensions `dims`, this function "Wraps"
+// each dim in-place for a tensor of rank `dim_post_expr`, allowing dims to be
+// specified using negative indices.
+//
+// Additionally, if `wrap_scalar` is true then scalar tensors with rank 0, will
+// allow dimensions in the range [-1, 0]. Otherwise, an IndexError is raised for
+// dimensions not in the range [-dim_post_expr, dim_post_expr).
 template <typename Container>
-inline void maybe_wrap_dims(Container& dims, int64_t dim_post_expr) {
-  return maybe_wrap_dims_n(dims.data(), dims.size(), dim_post_expr);
+inline void maybe_wrap_dims(
+    Container& dims,
+    int64_t dim_post_expr,
+    bool wrap_scalars = true) {
+  return maybe_wrap_dims_n(
+      dims.data(), dims.size(), dim_post_expr, wrap_scalars);
 }
 
 // previously, size [0] tensors were the only possible empty tensors; thus, it
@@ -85,11 +104,12 @@ inline void maybe_wrap_dims(Container& dims, int64_t dim_post_expr) {
 // dimension behavior and dimension size checking). We maintain this behavior
 // for backwards compatibility, but only for this specific size (i.e. other
 // empty sizes are not skipped).
-static inline int64_t legacy_cat_wrap_dim(
+template <typename T>
+inline int64_t _legacy_cat_wrap_dim(
     int64_t dim,
-    const std::vector<std::vector<int64_t>>& tensor_sizes) {
+    const std::vector<std::vector<T>>& tensor_sizes) {
   for (auto& sizes : tensor_sizes) {
-    if (sizes == std::vector<int64_t>({0})) {
+    if (sizes.size() == 1 && sizes[0] == 0) {
       continue;
     }
     return maybe_wrap_dim(dim, sizes.size());
@@ -97,8 +117,22 @@ static inline int64_t legacy_cat_wrap_dim(
   return dim;
 }
 
-static inline int64_t legacy_cat_wrap_dim(int64_t dim, ITensorListRef tensors) {
-  for (auto& tensor : tensors) {
+inline int64_t legacy_cat_wrap_dim(
+    int64_t dim,
+    const std::vector<std::vector<int64_t>>& tensor_sizes) {
+  return _legacy_cat_wrap_dim<int64_t>(dim, tensor_sizes);
+}
+
+inline int64_t legacy_cat_wrap_dim_symint(
+    int64_t dim,
+    const std::vector<std::vector<c10::SymInt>>& tensor_sizes) {
+  return _legacy_cat_wrap_dim<c10::SymInt>(dim, tensor_sizes);
+}
+
+inline int64_t legacy_cat_wrap_dim(
+    int64_t dim,
+    const MaterializedITensorListRef& tensors) {
+  for (const Tensor& tensor : tensors) {
     if (tensor.dim() == 1 && tensor.sizes()[0] == 0) {
       continue;
     }
@@ -108,7 +142,7 @@ static inline int64_t legacy_cat_wrap_dim(int64_t dim, ITensorListRef tensors) {
 }
 
 // wrap negative dims in a vector
-static inline void wrap_all_dims(
+inline void wrap_all_dims(
     std::vector<int64_t>& dims_to_wrap,
     int64_t tensor_total_dims) {
   for (const auto i : c10::irange(dims_to_wrap.size())) {
diff --git a/aten/src/ATen/autocast_mode.cpp b/aten/src/ATen/autocast_mode.cpp
index da0a87b02d1d..ee8b4b30b152 100644
--- a/aten/src/ATen/autocast_mode.cpp
+++ b/aten/src/ATen/autocast_mode.cpp
@@ -2,12 +2,14 @@
 #include <torch/library.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/autocast_mode.h>
+#include <ATen/Operators.h>
 
 #include <c10/util/intrusive_ptr.h>
 #include <c10/core/impl/LocalDispatchKeySet.h>
 
 #include <iostream>
 #include <exception>
+#include <mutex>
 
 namespace at {
 namespace autocast {
@@ -36,6 +38,14 @@ void set_xpu_enabled(bool new_enabled) {
   c10::impl::tls_set_dispatch_key_excluded(DispatchKey::AutocastXPU, !new_enabled);
 }
 
+bool is_hpu_enabled() {
+  return !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::AutocastHPU);
+}
+
+void set_hpu_enabled(bool new_enabled) {
+  c10::impl::tls_set_dispatch_key_excluded(DispatchKey::AutocastHPU, !new_enabled);
+}
+
 namespace {
 // Imitate Apex and cache some of the casts to streamline parameter reuse.
 // Our heuristic is to cache lower_precision_fp casts of fp32 model weights (see cached_cast below).
@@ -55,7 +65,8 @@ namespace {
 // directly against incoming TensorImpl*s.
 using weakref_type = c10::weak_intrusive_ptr<TensorImpl, UndefinedTensorImpl>;
 using val_type = std::tuple<weakref_type, Tensor>;
-thread_local std::unordered_map<TensorImpl*, val_type> cached_casts;
+std::unordered_map<TensorImpl*, val_type> cached_casts;
+std::mutex cached_casts_mutex;
 
 // nesting tracks the nesting depth of the Python-side context manager.
 // When the autocast context manager exits to a nesting level that's outside
@@ -69,6 +80,9 @@ thread_local at::ScalarType autocast_cpu_dtype = at::kBFloat16;
 // autocast_xpu_dtype is the lower_precision_fp used by AutocastXPU.
 thread_local at::ScalarType autocast_xpu_dtype = at::kBFloat16;
 
+// autocast_hpu_dtype is the lower_precision_fp used by AutocastHPU.
+thread_local at::ScalarType autocast_hpu_dtype = at::kBFloat16;
+
 // should we enabled the cache inside autocast.
 thread_local bool cache_enabled = true;
 
@@ -77,6 +91,7 @@ thread_local at::ScalarType autocast_gpu_dtype = at::kHalf;
 }
 
 void clear_cache() {
+  const std::lock_guard<std::mutex> lock(cached_casts_mutex);
   cached_casts.clear();
 }
 
@@ -100,6 +115,10 @@ at::ScalarType get_autocast_xpu_dtype() {
   return autocast_xpu_dtype;
 }
 
+at::ScalarType get_autocast_hpu_dtype() {
+  return autocast_hpu_dtype;
+}
+
 void set_autocast_cpu_dtype(at::ScalarType dtype) {
   TORCH_CHECK(
       dtype == at::kBFloat16,
@@ -115,6 +134,10 @@ void set_autocast_xpu_dtype(at::ScalarType dtype) {
   autocast_xpu_dtype = dtype;
 }
 
+void set_autocast_hpu_dtype(at::ScalarType dtype) {
+  autocast_hpu_dtype = dtype;
+}
+
 bool is_autocast_cache_enabled() {
   return cache_enabled;
 }
@@ -135,6 +158,7 @@ Tensor cached_cast(at::ScalarType to_type, const Tensor& arg, DeviceType device_
                          arg.scalar_type() == at::kFloat && arg.requires_grad() &&
                          arg.is_leaf() && !arg.is_view() && cache_enabled);
     if (can_try_cache) {
+      const std::lock_guard<std::mutex> lock(cached_casts_mutex);
       auto it = cached_casts.find(arg.unsafeGetTensorImpl());
       if (it != cached_casts.end()) {
         return std::get<1>(it->second);
@@ -304,9 +328,12 @@ Therefore, for the moment, this is all copy pasted in from VariableTypeEverythin
 
 // Common cases where registration signature matches redispatch signature
 // (that's why SIGNATURE is repeated in the WrapFunction instantiation)
-#define KERNEL(FUNC, REGISTER_NAME, SIGNATURE, POLICY) \
-  m.impl(TORCH_SELECTIVE_NAME("aten::" REGISTER_NAME), \
-    &WrapFunction<CastPolicy::POLICY, DeviceType::CUDA, SIGNATURE, SIGNATURE, &FUNC>::type::call);
+#define KERNEL(OP, POLICY) \
+  m.impl(TORCH_SELECTIVE_NAME("aten::" #OP), \
+    &WrapFunction<CastPolicy::POLICY, DeviceType::CUDA, decltype(ATEN_FN(OP)), decltype(ATEN_FN(OP)), &ATEN_FN(OP)>::type::call);
+#define KERNEL2(OP, OVERLOAD, POLICY) \
+  m.impl(TORCH_SELECTIVE_NAME("aten::" #OP "." #OVERLOAD), \
+    &WrapFunction<CastPolicy::POLICY, DeviceType::CUDA, decltype(ATEN_FN2(OP, OVERLOAD)), decltype(ATEN_FN2(OP, OVERLOAD)), &ATEN_FN2(OP, OVERLOAD)>::type::call);
 
 // Less-common but still useful case: redispatching to a function with a new signature (e.g. appending a dtype)
 #define KERNEL_DIFFERENT_REDISPATCH_SIGNATURE(REDISPATCH_FUNC, REGISTER_NAME, REGISTER_SIGNATURE, REDISPATCH_SIGNATURE, POLICY) \
@@ -314,9 +341,12 @@ Therefore, for the moment, this is all copy pasted in from VariableTypeEverythin
     &WrapFunction<CastPolicy::POLICY, DeviceType::CUDA, REGISTER_SIGNATURE, REDISPATCH_SIGNATURE, &REDISPATCH_FUNC>::type::call);
 
 // KERNEL_CPU registration for AutocastCPU
-#define KERNEL_CPU(FUNC, REGISTER_NAME, SIGNATURE, POLICY) \
-  m.impl(TORCH_SELECTIVE_NAME("aten::" REGISTER_NAME), \
-    &WrapFunction<CastPolicy::POLICY, DeviceType::CPU, SIGNATURE, SIGNATURE, &FUNC>::type::call);
+#define KERNEL_CPU(OP, POLICY) \
+  m.impl(TORCH_SELECTIVE_NAME("aten::" #OP), \
+    &WrapFunction<CastPolicy::POLICY, DeviceType::CPU, decltype(ATEN_FN(OP)), decltype(ATEN_FN(OP)), &ATEN_FN(OP)>::type::call);
+#define KERNEL_CPU2(OP, OVERLOAD, POLICY) \
+  m.impl(TORCH_SELECTIVE_NAME("aten::" #OP "." #OVERLOAD), \
+    &WrapFunction<CastPolicy::POLICY, DeviceType::CPU, decltype(ATEN_FN2(OP, OVERLOAD)), decltype(ATEN_FN2(OP, OVERLOAD)), &ATEN_FN2(OP, OVERLOAD)>::type::call);
 
 /*****************************************
 Explicit registration for out-of-place ops
@@ -327,136 +357,110 @@ TORCH_LIBRARY_IMPL(_, Autocast, m) {
 
 TORCH_LIBRARY_IMPL(aten, Autocast, m) {
   // lower_precision_fp
-  KERNEL(ADD_NS(_convolution), "_convolution.deprecated", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t, bool, bool, bool), lower_precision_fp)
-  KERNEL(ADD_NS(_convolution), "_convolution", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t, bool, bool, bool, bool), lower_precision_fp)
-  KERNEL(ADD_NS(conv1d), "conv1d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t), lower_precision_fp)
-  KERNEL(ADD_NS(conv2d), "conv2d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t), lower_precision_fp)
-  KERNEL(ADD_NS(conv3d), "conv3d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t), lower_precision_fp)
-  KERNEL(ADD_NS(conv_tbc), "conv_tbc", Tensor (const Tensor &, const Tensor &, const Tensor &, int64_t), lower_precision_fp)
-  KERNEL(ADD_NS(conv_transpose1d), "conv_transpose1d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t, IntArrayRef), lower_precision_fp)
-  KERNEL(ADD_NS(conv_transpose2d), "conv_transpose2d.input", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t, IntArrayRef), lower_precision_fp)
-  KERNEL(ADD_NS(conv_transpose3d), "conv_transpose3d.input", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t, IntArrayRef), lower_precision_fp)
-  KERNEL(ADD_NS(convolution), "convolution", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t), lower_precision_fp)
-  KERNEL(ADD_NS(cudnn_convolution), "cudnn_convolution", Tensor (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, IntArrayRef, int64_t, bool, bool, bool), lower_precision_fp)
-  KERNEL(ADD_NS(cudnn_convolution_transpose), "cudnn_convolution_transpose", Tensor (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, IntArrayRef, IntArrayRef, int64_t, bool, bool, bool), lower_precision_fp)
-  KERNEL(ADD_NS(prelu), "prelu", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
-  KERNEL(ADD_NS(addmm), "addmm", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
-  KERNEL(ADD_NS(addmv), "addmv", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
-  KERNEL(ADD_NS(addr), "addr", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
-  KERNEL(ADD_NS(matmul), "matmul", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
-  KERNEL(ADD_NS(einsum), "einsum", Tensor (c10::string_view, TensorList), lower_precision_fp)
-  KERNEL(ADD_NS(mm), "mm", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
-  KERNEL(ADD_NS(mv), "mv", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
-  KERNEL(ADD_NS(linear), "linear", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&), lower_precision_fp)
-  KERNEL(ADD_NS(addbmm), "addbmm", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
-  KERNEL(ADD_NS(baddbmm), "baddbmm", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
-  KERNEL(ADD_NS(bmm), "bmm", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
-  KERNEL(ADD_NS(chain_matmul), "chain_matmul", Tensor (TensorList), lower_precision_fp)
-  KERNEL(ADD_NS(linalg_multi_dot), "linalg_multi_dot", Tensor (TensorList), lower_precision_fp)
-  // The macro doesn't like these (I think it chokes on commas inside <>) so write them manually
-  m.impl(TORCH_SELECTIVE_NAME("aten::_thnn_fused_lstm_cell"),
-         TORCH_FN((&WrapFunction<CastPolicy::lower_precision_fp, DeviceType::CUDA,
-                                 std::tuple<Tensor,Tensor,Tensor> (const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 std::tuple<Tensor,Tensor,Tensor> (const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 &ADD_NS(_thnn_fused_lstm_cell)>::type::call)));
-  m.impl("_thnn_fused_gru_cell",
-         TORCH_FN((&WrapFunction<CastPolicy::lower_precision_fp, DeviceType::CUDA,
-                                 std::tuple<Tensor,Tensor> (const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 std::tuple<Tensor,Tensor> (const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 &ADD_NS(_thnn_fused_gru_cell)>::type::call)));
-  m.impl("lstm_cell",
-         TORCH_FN((&WrapFunction<CastPolicy::lower_precision_fp, DeviceType::CUDA,
-                                 std::tuple<Tensor,Tensor> (const Tensor &, TensorList, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 std::tuple<Tensor,Tensor> (const Tensor &, TensorList, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 &ADD_NS(lstm_cell)>::type::call)));
-  m.impl("gru_cell",
-         TORCH_FN((&WrapFunction<CastPolicy::lower_precision_fp, DeviceType::CUDA,
-                                 Tensor (const Tensor &, const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 Tensor (const Tensor &, const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 &ADD_NS(gru_cell)>::type::call)));
-  m.impl("rnn_tanh_cell", // tanh unary op is executed as a cuda math library call.
-         TORCH_FN((&WrapFunction<CastPolicy::lower_precision_fp, DeviceType::CUDA,
-                                 Tensor (const Tensor &, const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 Tensor (const Tensor &, const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 &ADD_NS(rnn_tanh_cell)>::type::call)));
-  m.impl("rnn_relu_cell",
-         TORCH_FN((&WrapFunction<CastPolicy::lower_precision_fp, DeviceType::CUDA,
-                                 Tensor (const Tensor &, const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 Tensor (const Tensor &, const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&),
-                                 &ADD_NS(rnn_relu_cell)>::type::call)));
+  KERNEL2(_convolution, deprecated, lower_precision_fp)
+  KERNEL(_convolution, lower_precision_fp)
+  KERNEL(conv1d, lower_precision_fp)
+  KERNEL(conv2d, lower_precision_fp)
+  KERNEL(conv3d, lower_precision_fp)
+  KERNEL(conv_tbc, lower_precision_fp)
+  KERNEL(conv_transpose1d, lower_precision_fp)
+  KERNEL2(conv_transpose2d, input, lower_precision_fp)
+  KERNEL2(conv_transpose3d, input, lower_precision_fp)
+  KERNEL(convolution, lower_precision_fp)
+  KERNEL(cudnn_convolution, lower_precision_fp)
+  KERNEL(cudnn_convolution_transpose, lower_precision_fp)
+  KERNEL(prelu, lower_precision_fp)
+  KERNEL(addmm, lower_precision_fp)
+  KERNEL(addmv, lower_precision_fp)
+  KERNEL(addr, lower_precision_fp)
+  KERNEL(matmul, lower_precision_fp)
+  KERNEL(einsum, lower_precision_fp)
+  KERNEL(mm, lower_precision_fp)
+  KERNEL(mv, lower_precision_fp)
+  KERNEL(linear, lower_precision_fp)
+  KERNEL(addbmm, lower_precision_fp)
+  KERNEL(baddbmm, lower_precision_fp)
+  KERNEL(bmm, lower_precision_fp)
+  KERNEL(chain_matmul, lower_precision_fp)
+  KERNEL(linalg_multi_dot, lower_precision_fp)
+  KERNEL(_thnn_fused_lstm_cell, lower_precision_fp)
+  KERNEL(_thnn_fused_gru_cell, lower_precision_fp)
+  KERNEL(lstm_cell, lower_precision_fp)
+  KERNEL(gru_cell, lower_precision_fp)
+  KERNEL(rnn_tanh_cell, lower_precision_fp)
+  KERNEL(rnn_relu_cell, lower_precision_fp)
+
   // fp32
-  KERNEL(ADD_NS(acos), "acos", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(asin), "asin", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(cosh), "cosh", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(erfinv), "erfinv", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(exp), "exp", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(expm1), "expm1", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(log), "log", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(log10), "log10", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(log2), "log2", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(log1p), "log1p", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(reciprocal), "reciprocal", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(rsqrt), "rsqrt", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(sinh), "sinh", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(tan), "tan", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(pow), "pow.Tensor_Scalar", Tensor (const Tensor &, const Scalar&), fp32)
-  KERNEL(ADD_NS(pow), "pow.Tensor_Tensor", Tensor (const Tensor &, const Tensor &), fp32)
-  KERNEL(ADD_NS(pow), "pow.Scalar", Tensor (const Scalar&, const Tensor &), fp32)
-  KERNEL(ADD_NS(softplus), "softplus", Tensor (const Tensor &, const Scalar&, const Scalar&), fp32)
-  KERNEL(ADD_NS(layer_norm), "layer_norm", Tensor (const Tensor &, IntArrayRef, const c10::optional<Tensor>&, const c10::optional<Tensor>&, double, bool), fp32)
-  // The macro doesn't like this one (I think it chokes on commas inside <>) so write it manually
-  m.impl(TORCH_SELECTIVE_NAME("aten::native_layer_norm"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CUDA,
-                                 std::tuple<Tensor,Tensor,Tensor> (const Tensor&, IntArrayRef, const c10::optional<Tensor>&, const c10::optional<Tensor>&, double),
-                                 std::tuple<Tensor,Tensor,Tensor> (const Tensor&, IntArrayRef, const c10::optional<Tensor>&, const c10::optional<Tensor>&, double),
-                                 &ADD_NS(native_layer_norm)>::type::call)));
-  KERNEL(ADD_NS(group_norm), "group_norm", Tensor (const Tensor &, int64_t, const c10::optional<Tensor>&, const c10::optional<Tensor>&, double, bool), fp32)
-  KERNEL(ADD_NS(frobenius_norm), "frobenius_norm", Tensor (const Tensor &), fp32)
-  KERNEL(ADD_NS(frobenius_norm), "frobenius_norm.dim", Tensor (const Tensor &, IntArrayRef, bool), fp32)
-  KERNEL(ADD_NS(nuclear_norm), "nuclear_norm", Tensor (const Tensor &, bool), fp32)
-  KERNEL(ADD_NS(nuclear_norm), "nuclear_norm.dim", Tensor (const Tensor &, IntArrayRef, bool), fp32)
-  KERNEL(ADD_NS(cosine_similarity), "cosine_similarity", Tensor (const Tensor &, const Tensor &, int64_t, double), fp32)
-  KERNEL(ADD_NS(poisson_nll_loss), "poisson_nll_loss", Tensor (const Tensor &, const Tensor &, bool, bool, double, int64_t), fp32)
-  KERNEL(ADD_NS(cosine_embedding_loss), "cosine_embedding_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, int64_t), fp32)
-  KERNEL(ADD_NS(nll_loss), "nll_loss", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, int64_t, int64_t), fp32)
-  KERNEL(ADD_NS(nll_loss2d), "nll_loss2d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, int64_t, int64_t), fp32)
-  KERNEL(ADD_NS(hinge_embedding_loss), "hinge_embedding_loss", Tensor (const Tensor &, const Tensor &, double, int64_t), fp32)
-  KERNEL(ADD_NS(kl_div), "kl_div", Tensor (const Tensor &, const Tensor &, int64_t, bool), fp32)
-  KERNEL(ADD_NS(l1_loss), "l1_loss", Tensor (const Tensor &, const Tensor &, int64_t), fp32)
-  KERNEL(ADD_NS(smooth_l1_loss), "smooth_l1_loss", Tensor (const Tensor &, const Tensor &, int64_t, double), fp32)
-  KERNEL(ADD_NS(huber_loss), "huber_loss", Tensor (const Tensor &, const Tensor &, int64_t, double), fp32)
-  KERNEL(ADD_NS(mse_loss), "mse_loss", Tensor (const Tensor &, const Tensor &, int64_t), fp32)
-  KERNEL(ADD_NS(margin_ranking_loss), "margin_ranking_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, int64_t), fp32)
-  KERNEL(ADD_NS(multilabel_margin_loss), "multilabel_margin_loss", Tensor (const Tensor &, const Tensor &, int64_t), fp32)
-  KERNEL(ADD_NS(soft_margin_loss), "soft_margin_loss", Tensor (const Tensor &, const Tensor &, int64_t), fp32)
-  KERNEL(ADD_NS(triplet_margin_loss), "triplet_margin_loss", Tensor (const Tensor &, const Tensor &, const Tensor &, double, double, double, bool, int64_t), fp32)
-  KERNEL(ADD_NS(multi_margin_loss), "multi_margin_loss", Tensor (const Tensor &, const Tensor &, const Scalar&, const Scalar&, const c10::optional<Tensor>&, int64_t), fp32)
-  KERNEL(ADD_NS(binary_cross_entropy_with_logits), "binary_cross_entropy_with_logits", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, const c10::optional<Tensor>&, int64_t), fp32)
-  KERNEL(ADD_NS(dist), "dist", Tensor (const Tensor &, const Tensor &, const Scalar&), fp32)
-  KERNEL(ADD_NS(pdist), "pdist", Tensor (const Tensor &, double), fp32)
-  KERNEL(ADD_NS(cdist), "cdist", Tensor (const Tensor &, const Tensor &, double, c10::optional<int64_t>), fp32)
-  KERNEL(ADD_NS(renorm), "renorm", Tensor (const Tensor &, const Scalar&, int64_t, const Scalar&), fp32)
-  KERNEL(ADD_NS(logsumexp), "logsumexp", Tensor (const Tensor &, IntArrayRef, bool), fp32)
+  KERNEL(acos, fp32)
+  KERNEL(asin, fp32)
+  KERNEL(cosh, fp32)
+  KERNEL(erfinv, fp32)
+  KERNEL(exp, fp32)
+  KERNEL(expm1, fp32)
+  KERNEL(log, fp32)
+  KERNEL(log10, fp32)
+  KERNEL(log2, fp32)
+  KERNEL(log1p, fp32)
+  KERNEL(reciprocal, fp32)
+  KERNEL(rsqrt, fp32)
+  KERNEL(sinh, fp32)
+  KERNEL(tan, fp32)
+  KERNEL2(pow, Tensor_Scalar, fp32)
+  KERNEL2(pow, Tensor_Tensor, fp32)
+  KERNEL2(pow, Scalar, fp32)
+  KERNEL(softplus, fp32)
+  KERNEL(layer_norm, fp32)
+  KERNEL(native_layer_norm, fp32)
+  KERNEL(group_norm, fp32)
+  KERNEL(frobenius_norm, fp32)
+  KERNEL2(frobenius_norm, dim, fp32)
+  KERNEL(nuclear_norm, fp32)
+  KERNEL2(nuclear_norm, dim, fp32)
+  KERNEL(cosine_similarity, fp32)
+  KERNEL(poisson_nll_loss, fp32)
+  KERNEL(cosine_embedding_loss, fp32)
+  KERNEL(nll_loss, fp32)
+  KERNEL(nll_loss2d, fp32)
+  KERNEL(hinge_embedding_loss, fp32)
+  KERNEL(kl_div, fp32)
+  KERNEL(l1_loss, fp32)
+  KERNEL(smooth_l1_loss, fp32)
+  KERNEL(huber_loss, fp32)
+  KERNEL(mse_loss, fp32)
+  KERNEL(margin_ranking_loss, fp32)
+  KERNEL(multilabel_margin_loss, fp32)
+  KERNEL(soft_margin_loss, fp32)
+  KERNEL(triplet_margin_loss, fp32)
+  KERNEL(multi_margin_loss, fp32)
+  KERNEL(binary_cross_entropy_with_logits, fp32)
+  KERNEL(dist, fp32)
+  KERNEL(pdist, fp32)
+  KERNEL(cdist, fp32)
+  KERNEL(renorm, fp32)
+  KERNEL(logsumexp, fp32)
   // fp32_set_opt_dtype
-  KERNEL(ADD_NS(prod), "prod", Tensor (const Tensor &, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(prod), "prod.dim_int", Tensor (const Tensor &, int64_t, bool, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(prod), "prod.dim_Dimname", Tensor (const Tensor &, Dimname, bool, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(softmax), "softmax.int", Tensor (const Tensor &, int64_t, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(softmax), "softmax.Dimname", Tensor (const Tensor &, Dimname, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(log_softmax), "log_softmax.int", Tensor (const Tensor &, int64_t, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(log_softmax), "log_softmax.Dimname", Tensor (const Tensor &, Dimname, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(cumprod), "cumprod", Tensor (const Tensor &, int64_t, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(cumprod), "cumprod.dimname", Tensor (const Tensor &, Dimname, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(cumsum), "cumsum", Tensor (const Tensor &, int64_t, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(cumsum), "cumsum.dimname", Tensor (const Tensor &, Dimname, c10::optional<ScalarType>), fp32_set_opt_dtype)
+  KERNEL(prod, fp32_set_opt_dtype)
+  KERNEL2(prod, dim_int, fp32_set_opt_dtype)
+  KERNEL2(prod, dim_Dimname, fp32_set_opt_dtype)
+  KERNEL2(softmax, int, fp32_set_opt_dtype)
+  KERNEL2(softmax, Dimname, fp32_set_opt_dtype)
+  KERNEL2(log_softmax, int, fp32_set_opt_dtype)
+  KERNEL2(log_softmax, Dimname, fp32_set_opt_dtype)
+  KERNEL(cumprod, fp32_set_opt_dtype)
+  KERNEL2(cumprod, dimname, fp32_set_opt_dtype)
+  KERNEL(cumsum, fp32_set_opt_dtype)
+  KERNEL2(cumsum, dimname, fp32_set_opt_dtype)
+  KERNEL(linalg_vector_norm, fp32_set_opt_dtype)
+  KERNEL(linalg_matrix_norm, fp32_set_opt_dtype)
+  KERNEL2(linalg_matrix_norm, str_ord, fp32_set_opt_dtype)
   // commenting these out because they accept an explicit (not-optional) dtype, and we shouldn't try to flip that even
   // when autocasting.
-  // KERNEL(ADD_NS(norm), "norm.ScalarOpt_dtype", Tensor (const Tensor &, c10::optional<Scalar>, ScalarType), fp32_set_opt_dtype)
-  // KERNEL(ADD_NS(norm), "norm.ScalarOpt_dim_dtype", Tensor (const Tensor &, c10::optional<Scalar>, IntArrayRef, bool, ScalarType), fp32_set_opt_dtype)
-  // KERNEL(ADD_NS(norm), "norm.names_ScalarOpt_dim_dtype", Tensor (const Tensor &, c10::optional<Scalar>, DimnameList, bool, ScalarType), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(sum), "sum", Tensor (const Tensor &, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(sum), "sum.dim_IntList", Tensor (const Tensor &, OptionalIntArrayRef, bool, c10::optional<ScalarType>), fp32_set_opt_dtype)
-  KERNEL(ADD_NS(sum), "sum.dim_DimnameList", Tensor (const Tensor &, DimnameList, bool, c10::optional<ScalarType>), fp32_set_opt_dtype)
+  // KERNEL2(norm, ScalarOpt_dtype, fp32_set_opt_dtype)
+  // KERNEL2(norm, ScalarOpt_dim_dtype, fp32_set_opt_dtype)
+  // KERNEL2(norm, names_ScalarOpt_dim_dtype, fp32_set_opt_dtype)
+  KERNEL(sum, fp32_set_opt_dtype)
+  KERNEL2(sum, dim_IntList, fp32_set_opt_dtype)
+  KERNEL2(sum, dim_DimnameList, fp32_set_opt_dtype)
   // fp32_append_dtype
   // The fp32_append_dtype wrapper overrides implicit promotion behavior.
   // norm does not implicitly promote, but be aware when adding new ops to this policy.
@@ -464,16 +468,16 @@ TORCH_LIBRARY_IMPL(aten, Autocast, m) {
   KERNEL_DIFFERENT_REDISPATCH_SIGNATURE(ADD_NS(norm), "norm.ScalarOpt_dim", Tensor (const Tensor &, const c10::optional<Scalar>&, IntArrayRef, bool), Tensor (const Tensor &, const c10::optional<Scalar>&, IntArrayRef, bool, ScalarType), fp32_append_dtype)
   KERNEL_DIFFERENT_REDISPATCH_SIGNATURE(ADD_NS(norm), "norm.names_ScalarOpt_dim", Tensor (const Tensor &, const c10::optional<Scalar>&, DimnameList, bool), Tensor (const Tensor &, const c10::optional<Scalar>&, DimnameList, bool, ScalarType), fp32_append_dtype)
   // promote
-  KERNEL(ADD_NS(addcdiv), "addcdiv", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&), promote)
-  KERNEL(ADD_NS(addcmul), "addcmul", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&), promote)
-  KERNEL(ADD_NS(atan2), "atan2", Tensor (const Tensor &, const Tensor &), promote)
-  KERNEL(ADD_NS(bilinear), "bilinear", Tensor (const Tensor &, const Tensor &, const Tensor &, const c10::optional<Tensor>&), promote)
-  KERNEL(ADD_NS(cross), "cross", Tensor (const Tensor &, const Tensor &, c10::optional<int64_t>), promote)
-  KERNEL(ADD_NS(dot), "dot", Tensor (const Tensor &, const Tensor &), promote)
-  KERNEL(ADD_NS(grid_sampler), "grid_sampler", Tensor (const Tensor &, const Tensor &, int64_t, int64_t, bool), promote)
-  KERNEL(ADD_NS(index_put), "index_put", Tensor (const Tensor &, const torch::List<c10::optional<Tensor>>&, const Tensor &, bool), promote)
-  KERNEL(ADD_NS(tensordot), "tensordot", Tensor (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef), promote)
-  KERNEL(ADD_NS(scatter_add), "scatter_add", Tensor (const Tensor&, int64_t, const Tensor&, const Tensor&), promote)
+  KERNEL(addcdiv, promote)
+  KERNEL(addcmul, promote)
+  KERNEL(atan2, promote)
+  KERNEL(bilinear, promote)
+  KERNEL(cross, promote)
+  KERNEL(dot, promote)
+  KERNEL(grid_sampler, promote)
+  KERNEL(index_put, promote)
+  KERNEL(tensordot, promote)
+  KERNEL(scatter_add, promote)
 
   m.impl(TORCH_SELECTIVE_NAME("aten::binary_cross_entropy"),
          TORCH_FN((&at::autocast::binary_cross_entropy_banned)));
@@ -486,223 +490,134 @@ TORCH_LIBRARY_IMPL(_, AutocastCPU, m) {
 
 TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) {
   // lower_precision_fp cast policy
-  KERNEL_CPU(ADD_NS(conv1d), "conv1d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(conv1d), "conv1d.padding", Tensor (const Tensor&, const Tensor&, const c10::optional<Tensor>&, IntArrayRef, c10::string_view, IntArrayRef, int64_t groups), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(conv2d), "conv2d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(conv2d), "conv2d.padding", Tensor (const Tensor&, const Tensor&, const c10::optional<Tensor>&, IntArrayRef, c10::string_view, IntArrayRef, int64_t groups), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(conv3d), "conv3d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, int64_t), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(conv3d), "conv3d.padding", Tensor (const Tensor&, const Tensor&, const c10::optional<Tensor>&, IntArrayRef, c10::string_view, IntArrayRef, int64_t groups), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(bmm), "bmm", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(mm), "mm", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(baddbmm), "baddbmm", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(addmm), "addmm", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(addbmm), "addbmm", Tensor (const Tensor &, const Tensor &, const Tensor &, const Scalar&, const Scalar&), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(linear), "linear", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor> &), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(_convolution), "_convolution.deprecated", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t, bool, bool, bool), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(_convolution), "_convolution", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, IntArrayRef, IntArrayRef, IntArrayRef, bool, IntArrayRef, int64_t, bool, bool, bool, bool), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(matmul), "matmul", Tensor (const Tensor &, const Tensor &), lower_precision_fp)
-  KERNEL_CPU(ADD_NS(conv_tbc), "conv_tbc", Tensor(const Tensor &, const Tensor &, const Tensor &, int64_t), lower_precision_fp)
+  KERNEL_CPU(conv1d, lower_precision_fp)
+  KERNEL_CPU2(conv1d, padding, lower_precision_fp)
+  KERNEL_CPU(conv2d, lower_precision_fp)
+  KERNEL_CPU2(conv2d, padding, lower_precision_fp)
+  KERNEL_CPU(conv3d, lower_precision_fp)
+  KERNEL_CPU2(conv3d, padding, lower_precision_fp)
+  KERNEL_CPU(bmm, lower_precision_fp)
+  KERNEL_CPU(mm, lower_precision_fp)
+  KERNEL_CPU(baddbmm, lower_precision_fp)
+  KERNEL_CPU(addmm, lower_precision_fp)
+  KERNEL_CPU(addbmm, lower_precision_fp)
+  KERNEL_CPU(linear, lower_precision_fp)
+  KERNEL_CPU2(_convolution, deprecated, lower_precision_fp)
+  KERNEL_CPU(matmul, lower_precision_fp)
+  KERNEL_CPU(conv_tbc, lower_precision_fp)
 
   // fp32 cast policy
-  KERNEL_CPU(ADD_NS(conv_transpose1d), "conv_transpose1d", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor> &, IntArrayRef, IntArrayRef, IntArrayRef, int64_t, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(conv_transpose2d), "conv_transpose2d.input", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor> &, IntArrayRef, IntArrayRef, IntArrayRef, int64_t, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(conv_transpose3d), "conv_transpose3d.input", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor> &, IntArrayRef, IntArrayRef, IntArrayRef, int64_t, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(avg_pool3d), "avg_pool3d", Tensor (const Tensor &, IntArrayRef, IntArrayRef, IntArrayRef, bool, bool, c10::optional<int64_t>), fp32)
-  KERNEL_CPU(ADD_NS(binary_cross_entropy), "binary_cross_entropy", Tensor (const Tensor &, const Tensor &, const c10::optional<Tensor>&, int64_t), fp32)
-  KERNEL_CPU(ADD_NS(grid_sampler), "grid_sampler", Tensor(const Tensor &, const Tensor &, int64_t, int64_t, bool), fp32)
-  KERNEL_CPU(ADD_NS(polar), "polar", Tensor(const Tensor &, const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(prod), "prod", Tensor(const Tensor &, c10::optional<at::ScalarType>), fp32)
-  KERNEL_CPU(ADD_NS(prod), "prod.dim_int", Tensor(const Tensor &, int64_t, bool, c10::optional<at::ScalarType>), fp32)
-  KERNEL_CPU(ADD_NS(prod), "prod.dim_Dimname", Tensor(const Tensor &, at::Dimname, bool, c10::optional<at::ScalarType>), fp32)
-  KERNEL_CPU(ADD_NS(quantile), "quantile", Tensor(const Tensor &, const Tensor &, c10::optional<int64_t>, bool, c10::string_view), fp32)
-  KERNEL_CPU(ADD_NS(quantile), "quantile.scalar", Tensor(const Tensor &, double, c10::optional<int64_t>, bool, c10::string_view), fp32)
-  KERNEL_CPU(ADD_NS(nanquantile), "nanquantile", Tensor(const Tensor &, const Tensor &, c10::optional<int64_t>, bool, c10::string_view), fp32)
-  KERNEL_CPU(ADD_NS(nanquantile), "nanquantile.scalar", Tensor(const Tensor &, double, c10::optional<int64_t>, bool, c10::string_view), fp32)
-  KERNEL_CPU(ADD_NS(stft), "stft", Tensor(const Tensor &, int64_t, c10::optional<int64_t>, c10::optional<int64_t>, const c10::optional<Tensor> &, bool, c10::optional<bool>, c10::optional<bool>), fp32)
-  KERNEL_CPU(ADD_NS(stft), "stft.center", Tensor(const Tensor &, int64_t, c10::optional<int64_t>, c10::optional<int64_t>, const c10::optional<Tensor> &, bool, c10::string_view, bool, c10::optional<bool>, c10::optional<bool>), fp32)
-  KERNEL_CPU(ADD_NS(cdist), "cdist", Tensor(const Tensor &, const Tensor &, double, c10::optional<int64_t>), fp32)
-  KERNEL_CPU(ADD_NS(grid_sampler_2d), "grid_sampler_2d", Tensor(const Tensor &, const Tensor &, int64_t, int64_t, bool), fp32)
-  KERNEL_CPU(ADD_NS(_grid_sampler_2d_cpu_fallback), "_grid_sampler_2d_cpu_fallback", Tensor(const Tensor &, const Tensor &, int64_t, int64_t, bool), fp32)
-  KERNEL_CPU(ADD_NS(grid_sampler_3d), "grid_sampler_3d", Tensor(const Tensor &, const Tensor &, int64_t, int64_t, bool), fp32)
-  KERNEL_CPU(ADD_NS(trace), "trace", Tensor(const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(view_as_complex), "view_as_complex", Tensor(const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(cholesky), "cholesky", Tensor(const Tensor &, bool), fp32)
-  KERNEL_CPU(ADD_NS(cholesky_inverse), "cholesky_inverse", Tensor(const Tensor &, bool), fp32)
-  KERNEL_CPU(ADD_NS(cholesky_solve), "cholesky_solve", Tensor(const Tensor &, const Tensor &, bool), fp32)
-  KERNEL_CPU(ADD_NS(inverse), "inverse", Tensor(const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(lu_solve), "lu_solve", Tensor(const Tensor &, const Tensor &, const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(matrix_rank), "matrix_rank", Tensor(const Tensor &, bool), fp32)
-  KERNEL_CPU(ADD_NS(orgqr), "orgqr", Tensor(const Tensor &, const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(ormqr), "ormqr", Tensor(const Tensor &, const Tensor &, const Tensor &, bool, bool), fp32)
-  KERNEL_CPU(ADD_NS(pinverse), "pinverse", Tensor(const Tensor &, double), fp32)
-  KERNEL_CPU(ADD_NS(max_pool3d), "max_pool3d", Tensor(const Tensor &, IntArrayRef, IntArrayRef, IntArrayRef, IntArrayRef, bool), fp32)
-  KERNEL_CPU(ADD_NS(max_unpool2d), "max_unpool2d", Tensor(const Tensor &, const Tensor &, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(max_unpool3d), "max_unpool3d", Tensor(const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(adaptive_avg_pool3d), "adaptive_avg_pool3d", Tensor(const Tensor &, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(reflection_pad1d), "reflection_pad1d", Tensor(const Tensor &, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(reflection_pad2d), "reflection_pad2d", Tensor(const Tensor &, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(replication_pad1d), "replication_pad1d", Tensor(const Tensor &, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(replication_pad2d), "replication_pad2d", Tensor(const Tensor &, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(replication_pad3d), "replication_pad3d", Tensor(const Tensor &, IntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(mse_loss), "mse_loss", Tensor(const Tensor &, const Tensor &, int64_t), fp32)
-  KERNEL_CPU(ADD_NS(ctc_loss), "ctc_loss.IntList", Tensor(const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t, int64_t, bool), fp32)
-  KERNEL_CPU(ADD_NS(ctc_loss), "ctc_loss.Tensor", Tensor(const Tensor &, const Tensor &, const Tensor &, const Tensor &, int64_t, int64_t, bool), fp32)
-  KERNEL_CPU(ADD_NS(kl_div), "kl_div", Tensor(const Tensor &, const Tensor &, int64_t, bool), fp32)
-  KERNEL_CPU(ADD_NS(multilabel_margin_loss), "multilabel_margin_loss", Tensor(const Tensor &, const Tensor &, int64_t), fp32)
-  KERNEL_CPU(ADD_NS(fft_fft), "fft_fft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_ifft), "fft_ifft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_fft2), "fft_fft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_ifft2), "fft_ifft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_fftn), "fft_fftn", Tensor(const Tensor &, at::OptionalIntArrayRef, at::OptionalIntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_ifftn), "fft_ifftn", Tensor(const Tensor &, at::OptionalIntArrayRef, at::OptionalIntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_rfft), "fft_rfft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_irfft), "fft_irfft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_rfft2), "fft_rfft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_irfft2), "fft_irfft2", Tensor(const Tensor &, at::OptionalIntArrayRef, at::IntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_rfftn), "fft_rfftn", Tensor(const Tensor &, at::OptionalIntArrayRef, at::OptionalIntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_irfftn), "fft_irfftn", Tensor(const Tensor &, at::OptionalIntArrayRef, at::OptionalIntArrayRef, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_hfft), "fft_hfft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(fft_ihfft), "fft_ihfft", Tensor(const Tensor &, c10::optional<int64_t>, int64_t, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(linalg_matrix_norm), "linalg_matrix_norm", Tensor(const Tensor &, const at::Scalar &, at::IntArrayRef, bool, c10::optional<at::ScalarType>), fp32)
-  KERNEL_CPU(ADD_NS(linalg_matrix_norm), "linalg_matrix_norm.str_ord", Tensor(const Tensor &, c10::string_view, at::IntArrayRef, bool, c10::optional<at::ScalarType>), fp32)
-  KERNEL_CPU(ADD_NS(linalg_cond), "linalg_cond", Tensor(const Tensor &, const c10::optional<at::Scalar> &), fp32)
-  KERNEL_CPU(ADD_NS(linalg_cond), "linalg_cond.p_str", Tensor(const Tensor &, c10::string_view), fp32)
-  KERNEL_CPU(ADD_NS(linalg_matrix_rank), "linalg_matrix_rank", Tensor(const Tensor &, double, bool), fp32)
-  KERNEL_CPU(ADD_NS(linalg_matrix_rank), "linalg_matrix_rank.tol_tensor", Tensor(const Tensor &, const Tensor &, bool), fp32)
-  KERNEL_CPU(ADD_NS(linalg_matrix_rank), "linalg_matrix_rank.atol_rtol_tensor", Tensor(const Tensor &, const c10::optional<at::Tensor> &, const c10::optional<at::Tensor> &, bool), fp32)
-  KERNEL_CPU(ADD_NS(linalg_matrix_rank), "linalg_matrix_rank.atol_rtol_float", Tensor(const Tensor &, c10::optional<double>, c10::optional<double>, bool), fp32)
-  KERNEL_CPU(ADD_NS(linalg_solve), "linalg_solve", Tensor(const Tensor &, const Tensor &, bool), fp32)
-  KERNEL_CPU(ADD_NS(linalg_cholesky), "linalg_cholesky", Tensor(const Tensor &, bool), fp32)
-  KERNEL_CPU(ADD_NS(linalg_svdvals), "linalg_svdvals", Tensor(const Tensor &, c10::optional<c10::string_view>), fp32)
-  KERNEL_CPU(ADD_NS(linalg_eigvals), "linalg_eigvals", Tensor(const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(linalg_eigvalsh), "linalg_eigvalsh", Tensor(const Tensor &, c10::string_view), fp32)
-  KERNEL_CPU(ADD_NS(linalg_inv), "linalg_inv", Tensor(const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(linalg_householder_product), "linalg_householder_product", Tensor(const Tensor &, const Tensor &), fp32)
-  KERNEL_CPU(ADD_NS(linalg_tensorinv), "linalg_tensorinv", Tensor(const Tensor &, int64_t), fp32)
-  KERNEL_CPU(ADD_NS(linalg_tensorsolve), "linalg_tensorsolve", Tensor(const Tensor &, const Tensor &, at::OptionalIntArrayRef), fp32)
-  KERNEL_CPU(ADD_NS(fake_quantize_per_tensor_affine), "fake_quantize_per_tensor_affine", Tensor (const Tensor &, double, int64_t, int64_t, int64_t), fp32)
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::eig"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool),
-                                 &ADD_NS(eig)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::geqrf"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &),
-                                 std::tuple<Tensor, Tensor> (const Tensor &),
-                                 &ADD_NS(geqrf)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::lstsq"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, const Tensor &),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, const Tensor &),
-                                 &ADD_NS(lstsq)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::_lu_with_info"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor, Tensor> (const Tensor &, bool, bool),
-                                 std::tuple<Tensor, Tensor, Tensor> (const Tensor &, bool, bool),
-                                 &ADD_NS(_lu_with_info)>::type::call)));
-
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::qr"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool),
-                                 &ADD_NS(qr)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::svd"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor, Tensor> (const Tensor &, bool, bool),
-                                 std::tuple<Tensor, Tensor, Tensor> (const Tensor &, bool, bool),
-                                 &ADD_NS(svd)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::symeig"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool, bool),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool, bool),
-                                 &ADD_NS(symeig)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::triangular_solve"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, const Tensor &, bool, bool, bool),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, const Tensor &, bool, bool, bool),
-                                 &ADD_NS(triangular_solve)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::fractional_max_pool2d"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, IntArrayRef, IntArrayRef, const Tensor &),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, IntArrayRef, IntArrayRef, const Tensor &),
-                                 &ADD_NS(fractional_max_pool2d)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::fractional_max_pool3d"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, IntArrayRef, IntArrayRef, const Tensor &),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, IntArrayRef, IntArrayRef, const Tensor &),
-                                 &ADD_NS(fractional_max_pool3d)>::type::call)));
-
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::adaptive_max_pool3d"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, IntArrayRef),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, IntArrayRef),
-                                 &ADD_NS(adaptive_max_pool3d)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::multilabel_margin_loss_forward"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, const Tensor &, int64_t),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, const Tensor &, int64_t),
-                                 &ADD_NS(multilabel_margin_loss_forward)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::linalg_qr"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, c10::string_view),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, c10::string_view),
-                                 &ADD_NS(linalg_qr)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::linalg_cholesky_ex"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool, bool),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool, bool),
-                                 &ADD_NS(linalg_cholesky_ex)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::linalg_svd"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor, Tensor> (const Tensor &, bool, c10::optional<c10::string_view>),
-                                 std::tuple<Tensor, Tensor, Tensor> (const Tensor &, bool, c10::optional<c10::string_view>),
-                                 &ADD_NS(linalg_svd)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::linalg_eig"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &),
-                                 std::tuple<Tensor, Tensor> (const Tensor &),
-                                 &ADD_NS(linalg_eig)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::linalg_eigh"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, c10::string_view),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, c10::string_view),
-                                 &ADD_NS(linalg_eigh)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::linalg_lstsq"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor, Tensor, Tensor> (const Tensor &, const Tensor &, c10::optional<double>, c10::optional<c10::string_view>),
-                                 std::tuple<Tensor, Tensor, Tensor, Tensor> (const Tensor &, const Tensor &, c10::optional<double>, c10::optional<c10::string_view>),
-                                 &ADD_NS(linalg_lstsq)>::type::call)));
-
-  m.impl(TORCH_SELECTIVE_NAME("aten::linalg_inv_ex"),
-         TORCH_FN((&WrapFunction<CastPolicy::fp32, DeviceType::CPU,
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool),
-                                 std::tuple<Tensor, Tensor> (const Tensor &, bool),
-                                 &ADD_NS(linalg_inv_ex)>::type::call)));
+  KERNEL_CPU(conv_transpose1d, fp32)
+  KERNEL_CPU2(conv_transpose2d, input, fp32)
+  KERNEL_CPU2(conv_transpose3d, input, fp32)
+  KERNEL_CPU(avg_pool3d, fp32)
+  KERNEL_CPU(binary_cross_entropy, fp32)
+  KERNEL_CPU(grid_sampler, fp32)
+  KERNEL_CPU(polar, fp32)
+  KERNEL_CPU(prod, fp32)
+  KERNEL_CPU2(prod, dim_int, fp32)
+  KERNEL_CPU2(prod, dim_Dimname, fp32)
+  KERNEL_CPU(quantile, fp32)
+  KERNEL_CPU2(quantile, scalar, fp32)
+  KERNEL_CPU(nanquantile, fp32)
+  KERNEL_CPU2(nanquantile, scalar, fp32)
+  KERNEL_CPU(stft, fp32)
+  KERNEL_CPU2(stft, center, fp32)
+  KERNEL_CPU(cdist, fp32)
+  KERNEL_CPU(grid_sampler_2d, fp32)
+  KERNEL_CPU(_grid_sampler_2d_cpu_fallback, fp32)
+  KERNEL_CPU(grid_sampler_3d, fp32)
+  KERNEL_CPU(trace, fp32)
+  KERNEL_CPU(view_as_complex, fp32)
+  KERNEL_CPU(cholesky, fp32)
+  KERNEL_CPU(cholesky_inverse, fp32)
+  KERNEL_CPU(cholesky_solve, fp32)
+  KERNEL_CPU(inverse, fp32)
+  KERNEL_CPU(lu_solve, fp32)
+  KERNEL_CPU(orgqr, fp32)
+  KERNEL_CPU(ormqr, fp32)
+  KERNEL_CPU(pinverse, fp32)
+  KERNEL_CPU(max_pool3d, fp32)
+  KERNEL_CPU(max_unpool2d, fp32)
+  KERNEL_CPU(max_unpool3d, fp32)
+  KERNEL_CPU(adaptive_avg_pool3d, fp32)
+  KERNEL_CPU(reflection_pad1d, fp32)
+  KERNEL_CPU(reflection_pad2d, fp32)
+  KERNEL_CPU(replication_pad1d, fp32)
+  KERNEL_CPU(replication_pad2d, fp32)
+  KERNEL_CPU(replication_pad3d, fp32)
+  KERNEL_CPU(mse_loss, fp32)
+  KERNEL_CPU(cosine_embedding_loss, fp32)
+  KERNEL_CPU(nll_loss, fp32)
+  KERNEL_CPU(nll_loss2d, fp32)
+  KERNEL_CPU(hinge_embedding_loss, fp32)
+  KERNEL_CPU(poisson_nll_loss, fp32)
+  KERNEL_CPU(smooth_l1_loss, fp32)
+  KERNEL_CPU(cross_entropy_loss, fp32)
+  KERNEL_CPU(l1_loss, fp32)
+  KERNEL_CPU(huber_loss, fp32)
+  KERNEL_CPU(margin_ranking_loss, fp32)
+  KERNEL_CPU(soft_margin_loss, fp32)
+  KERNEL_CPU(triplet_margin_loss, fp32)
+  KERNEL_CPU(multi_margin_loss, fp32)
+  KERNEL_CPU2(ctc_loss, IntList, fp32)
+  KERNEL_CPU2(ctc_loss, Tensor, fp32)
+  KERNEL_CPU(kl_div, fp32)
+  KERNEL_CPU(multilabel_margin_loss, fp32)
+  KERNEL_CPU(binary_cross_entropy_with_logits, fp32)
+  KERNEL_CPU(fft_fft, fp32)
+  KERNEL_CPU(fft_ifft, fp32)
+  KERNEL_CPU(fft_fft2, fp32)
+  KERNEL_CPU(fft_ifft2, fp32)
+  KERNEL_CPU(fft_fftn, fp32)
+  KERNEL_CPU(fft_ifftn, fp32)
+  KERNEL_CPU(fft_rfft, fp32)
+  KERNEL_CPU(fft_irfft, fp32)
+  KERNEL_CPU(fft_rfft2, fp32)
+  KERNEL_CPU(fft_irfft2, fp32)
+  KERNEL_CPU(fft_rfftn, fp32)
+  KERNEL_CPU(fft_irfftn, fp32)
+  KERNEL_CPU(fft_hfft, fp32)
+  KERNEL_CPU(fft_ihfft, fp32)
+  KERNEL_CPU(linalg_cond, fp32)
+  KERNEL_CPU2(linalg_cond, p_str, fp32)
+  KERNEL_CPU(linalg_matrix_rank, fp32)
+  KERNEL_CPU2(linalg_matrix_rank, tol_tensor, fp32)
+  KERNEL_CPU2(linalg_matrix_rank, atol_rtol_tensor, fp32)
+  KERNEL_CPU2(linalg_matrix_rank, atol_rtol_float, fp32)
+  KERNEL_CPU(linalg_solve, fp32)
+  KERNEL_CPU(linalg_cholesky, fp32)
+  KERNEL_CPU(linalg_svdvals, fp32)
+  KERNEL_CPU(linalg_eigvals, fp32)
+  KERNEL_CPU(linalg_eigvalsh, fp32)
+  KERNEL_CPU(linalg_inv, fp32)
+  KERNEL_CPU(linalg_householder_product, fp32)
+  KERNEL_CPU(linalg_tensorinv, fp32)
+  KERNEL_CPU(linalg_tensorsolve, fp32)
+  KERNEL_CPU(fake_quantize_per_tensor_affine, fp32)
+  KERNEL_CPU(geqrf, fp32)
+  KERNEL_CPU(_lu_with_info, fp32)
+  KERNEL_CPU(qr, fp32)
+  KERNEL_CPU(svd, fp32)
+  KERNEL_CPU(symeig, fp32)
+  KERNEL_CPU(triangular_solve, fp32)
+  KERNEL_CPU(fractional_max_pool2d, fp32)
+  KERNEL_CPU(fractional_max_pool3d, fp32)
+  KERNEL_CPU(adaptive_max_pool3d, fp32)
+  KERNEL_CPU(multilabel_margin_loss_forward, fp32)
+  KERNEL_CPU(linalg_qr, fp32)
+  KERNEL_CPU(linalg_cholesky_ex, fp32)
+  KERNEL_CPU(linalg_svd, fp32)
+  KERNEL_CPU(linalg_eig, fp32)
+  KERNEL_CPU(linalg_eigh, fp32)
+  KERNEL_CPU(linalg_lstsq, fp32)
+  KERNEL_CPU(linalg_inv_ex, fp32)
 
   // promote
-  KERNEL_CPU(ADD_NS(cat), "cat", Tensor (TensorList, int64_t), promote)
-  KERNEL_CPU(ADD_NS(stack), "stack", Tensor (TensorList, int64_t), promote)
-  KERNEL_CPU(ADD_NS(index_copy), "index_copy", Tensor (const Tensor &, int64_t, const Tensor &, const Tensor &), promote)
-  KERNEL_CPU(ADD_NS(index_copy), "index_copy.dimname", Tensor (const Tensor &, at::Dimname, const Tensor &, const Tensor &), promote)
+  KERNEL_CPU(stack, promote)
+  KERNEL_CPU(cat, promote)
+  KERNEL_CPU(index_copy, promote)
+  KERNEL_CPU2(index_copy, dimname, promote)
 
 }
 } // namespace
diff --git a/aten/src/ATen/autocast_mode.h b/aten/src/ATen/autocast_mode.h
index f5e88a0b88f1..3d57ac923116 100644
--- a/aten/src/ATen/autocast_mode.h
+++ b/aten/src/ATen/autocast_mode.h
@@ -20,6 +20,10 @@ TORCH_API bool is_xpu_enabled();
 TORCH_API void set_xpu_enabled(bool enabled);
 TORCH_API at::ScalarType get_autocast_xpu_dtype();
 TORCH_API void set_autocast_xpu_dtype(at::ScalarType dtype);
+TORCH_API bool is_hpu_enabled();
+TORCH_API void set_hpu_enabled(bool enabled);
+TORCH_API at::ScalarType get_autocast_hpu_dtype();
+TORCH_API void set_autocast_hpu_dtype(at::ScalarType dtype);
 TORCH_API bool is_autocast_cache_enabled();
 TORCH_API void set_autocast_cache_enabled(bool enabled);
 
@@ -34,6 +38,8 @@ bool is_autocast_eligible(const Tensor& tensor, DeviceType device_type) {
           tensor.is_floating_point();
     case DeviceType::XPU:
       return tensor.is_xpu() && tensor.is_floating_point();
+    case DeviceType::HPU:
+      return tensor.is_hpu() && tensor.is_floating_point();
     default:
       return false;
   }
@@ -49,6 +55,8 @@ inline DispatchKey get_autocast_dispatch_key_from_device_type(
       return DispatchKey::AutocastCPU;
     case DeviceType::XPU:
       return DispatchKey::AutocastXPU;
+    case DeviceType::HPU:
+      return DispatchKey::AutocastHPU;
     default:
       throw std::runtime_error(
           "unknown device type for autocast in get_autocast_dispatch_key_from_device_type");
@@ -64,6 +72,8 @@ inline at::ScalarType get_lower_precision_fp_from_device_type(
       return get_autocast_cpu_dtype();
     case DeviceType::XPU:
       return get_autocast_xpu_dtype();
+    case DeviceType::HPU:
+      return get_autocast_hpu_dtype();
     default:
       throw std::runtime_error(
           "unknown device type for autocast in get_lower_precision_fp_from_device_type");
@@ -116,6 +126,16 @@ inline at::ScalarType prioritize(
   return current;
 }
 
+inline at::ScalarType prioritize(
+    at::ScalarType current,
+    const ITensorListRef& list,
+    DeviceType device_type = DeviceType::CUDA) {
+  for (const auto& tensor : list) {
+    current = prioritize(current, tensor, device_type);
+  }
+  return current;
+}
+
 // Template to catch non-Tensor args (no-op that returns current best guess)
 template <typename T>
 inline at::ScalarType prioritize(
@@ -186,6 +206,18 @@ inline std::vector<Tensor> cached_cast(
   return vec;
 }
 
+inline std::vector<Tensor> cached_cast(
+    at::ScalarType to_type,
+    const ITensorListRef& arg,
+    DeviceType device_type = DeviceType::CUDA) {
+  std::vector<Tensor> vec;
+  vec.reserve(arg.size());
+  for (const auto& t : arg) {
+    vec.push_back(cached_cast(to_type, t, device_type));
+  }
+  return vec;
+}
+
 // Template to catch non-Tensor args.
 template <typename T>
 inline T cached_cast(
diff --git a/aten/src/ATen/core/ATen_fwd.h b/aten/src/ATen/core/ATen_fwd.h
index f6676a0c4ff1..63d576797251 100644
--- a/aten/src/ATen/core/ATen_fwd.h
+++ b/aten/src/ATen/core/ATen_fwd.h
@@ -35,6 +35,7 @@ using IOptTensorListRef = c10::IListRef<OptionalTensorRef>;
 using DimnameList = c10::ArrayRef<Dimname>;
 using IntArrayRef = c10::ArrayRef<int64_t>;
 using OptionalIntArrayRef = c10::OptionalArrayRef<int64_t>;
+using OptionalSymIntArrayRef = c10::OptionalArrayRef<c10::SymInt>;
 
 using c10::Stream;
 using c10::Storage;
diff --git a/aten/src/ATen/core/Formatting.cpp b/aten/src/ATen/core/Formatting.cpp
index 832059ed1980..4537adff5aa4 100644
--- a/aten/src/ATen/core/Formatting.cpp
+++ b/aten/src/ATen/core/Formatting.cpp
@@ -13,7 +13,7 @@ std::ostream& operator<<(std::ostream & out, Backend b) {
   return out << toString(b);
 }
 
-std::ostream& operator<<(std::ostream & out, Scalar s) {
+std::ostream& operator<<(std::ostream & out, const Scalar& s) {
   if (s.isFloatingPoint()) {
     return out << s.toDouble();
   }
@@ -23,13 +23,19 @@ std::ostream& operator<<(std::ostream & out, Scalar s) {
   if (s.isBoolean()) {
     return out << (s.toBool() ? "true" : "false");
   }
+  if (s.isSymInt()) {
+    return out << (s.toSymInt());
+  }
+  if (s.isSymFloat()) {
+    return out << (s.toSymFloat());
+  }
   if (s.isIntegral(false)) {
     return out << s.toLong();
   }
   throw std::logic_error("Unknown type in Scalar");
 }
 
-std::string toString(Scalar s) {
+std::string toString(const Scalar& s) {
   std::stringstream out;
   out << s;
   return out.str();
@@ -83,14 +89,9 @@ static std::tuple<double, int64_t> __printFormat(std::ostream& stream, const Ten
       break;
     }
   }
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  double expMin;
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  double expMax;
-  if(offset == size) {
-    expMin = 1;
-    expMax = 1;
-  } else {
+  double expMin = 1;
+  double expMax = 1;
+  if(offset != size) {
     expMin = fabs(self_p[offset]);
     expMax = fabs(self_p[offset]);
     for (const auto i : c10::irange(offset, size)) {
@@ -116,8 +117,7 @@ static std::tuple<double, int64_t> __printFormat(std::ostream& stream, const Ten
     }
   }
   double scale = 1;
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  int64_t sz;
+  int64_t sz = 11;
   if(intMode) {
     if(expMax > 9) {
       sz = 11;
@@ -153,8 +153,7 @@ static std::tuple<double, int64_t> __printFormat(std::ostream& stream, const Ten
 
 static void __printIndent(std::ostream &stream, int64_t indent)
 {
-  for (const auto i : c10::irange(indent)) {
-    (void)i; //Suppress unused variable warning
+  for (C10_UNUSED const auto i : c10::irange(indent)) {
     stream << " ";
   }
 }
@@ -165,10 +164,8 @@ static void printScale(std::ostream & stream, double scale) {
 }
 static void __printMatrix(std::ostream& stream, const Tensor& self, int64_t linesize, int64_t indent)
 {
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  double scale;
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  int64_t sz;
+  double scale = 0.0;
+  int64_t sz = 0;
   std::tie(scale, sz) = __printFormat(stream, self);
 
   __printIndent(stream, indent);
@@ -277,6 +274,9 @@ std::ostream& print(std::ostream& stream, const Tensor & tensor_, int64_t linesi
     } else if (tensor_.is_mkldnn()) {
       stream << "MKLDNN Tensor: ";
       tensor = tensor_.to_dense().to(kCPU, kDouble).contiguous();
+    } else if (tensor_.is_mps()) {
+      // MPS does not support double tensors, so first copy then convert
+      tensor = tensor_.to(kCPU).to(kDouble).contiguous();
     } else {
       tensor = tensor_.to(kCPU, kDouble).contiguous();
     }
@@ -285,10 +285,8 @@ std::ostream& print(std::ostream& stream, const Tensor & tensor_, int64_t linesi
       stream << "[ " << tensor_.toString() << "{}";
     } else if(tensor.ndimension() == 1) {
       if (tensor.numel() > 0) {
-        // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-        double scale;
-        // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-        int64_t sz;
+        double scale = 0.0;
+        int64_t sz = 0;
         std::tie(scale, sz) =  __printFormat(stream, tensor);
         if(scale != 1) {
           printScale(stream, scale);
diff --git a/aten/src/ATen/core/Formatting.h b/aten/src/ATen/core/Formatting.h
index 6dcfc6c7b3cd..9dcd14e1902e 100644
--- a/aten/src/ATen/core/Formatting.h
+++ b/aten/src/ATen/core/Formatting.h
@@ -8,8 +8,8 @@
 
 namespace c10 {
 TORCH_API std::ostream& operator<<(std::ostream& out, Backend b);
-TORCH_API std::ostream& operator<<(std::ostream & out, Scalar s);
-TORCH_API std::string toString(Scalar s);
+TORCH_API std::ostream& operator<<(std::ostream & out, const Scalar& s);
+TORCH_API std::string toString(const Scalar& s);
 }
 namespace at {
 
diff --git a/aten/src/ATen/core/IListRef.h b/aten/src/ATen/core/IListRef.h
index 35ac34b22020..0b0ff67b02e2 100644
--- a/aten/src/ATen/core/IListRef.h
+++ b/aten/src/ATen/core/IListRef.h
@@ -313,7 +313,10 @@ using _MaterializedIListRefElem = typename std::conditional<
     T>::type;
 
 template <typename T>
-using MaterializedIListRef = std::vector<_MaterializedIListRefElem<IListRefConstRef<T>>>;
+using MaterializedIListRefElem = _MaterializedIListRefElem<IListRefConstRef<T>>;
+
+template <typename T>
+using MaterializedIListRef = std::vector<MaterializedIListRefElem<T>>;
 
 } // namespace detail
 
@@ -388,6 +391,9 @@ class IListRefIterator : public std::iterator<std::bidirectional_iterator_tag, T
       case IListRefTag::Unboxed:
         payload_.unboxed_iterator = iterator.payload_.unboxed_iterator;
         break;
+      case IListRefTag::Materialized:
+        payload_.materialized_iterator = iterator.payload_.materialized_iterator;
+        break;
       default:
         TORCH_INTERNAL_ASSERT(false, "invalid IListRef tag.");
     }
@@ -396,7 +402,7 @@ class IListRefIterator : public std::iterator<std::bidirectional_iterator_tag, T
 
 #if defined(_MSC_VER) && _ITERATOR_DEBUG_LEVEL == 2
   // See [Note: MSVC Iterator Debug]
-  ~IListRefIterator() {
+  ~IListRefIterator() noexcept(false) {
     switch (tag_) {
       case IListRefTag::Boxed:
         payload_.boxed_iterator.~boxed_iterator_type();
@@ -404,6 +410,9 @@ class IListRefIterator : public std::iterator<std::bidirectional_iterator_tag, T
       case IListRefTag::Unboxed:
         payload_.unboxed_iterator.~unboxed_iterator_type();
         break;
+      case IListRefTag::Materialized:
+        payload_.materialized_iterator.~materialized_iterator_type();
+        break;
       default:
         TORCH_INTERNAL_ASSERT(false, "invalid IListRef tag.");
     }
@@ -504,6 +513,7 @@ class IListRef {
 
   using iterator = IListRefIterator<T>;
   using const_iterator = IListRefIterator<T>;
+  using reverse_iterator = std::reverse_iterator<iterator>;
   using value_type = typename iterator::value_type;
 
   IListRef() : tag_(IListRefTag::None) {}
diff --git a/aten/src/ATen/core/IListRef_inl.h b/aten/src/ATen/core/IListRef_inl.h
index a14bcfddae2d..534272f69b64 100644
--- a/aten/src/ATen/core/IListRef_inl.h
+++ b/aten/src/ATen/core/IListRef_inl.h
@@ -93,9 +93,9 @@ class IListRefTagImplBase<IListRefTag::Boxed, T, ListElemT> {
  * implementation for `IListRefTag::Materialized`.
  */
 template <typename T>
-class IListRefTagImplBase<IListRefTag::Materialized, T, _MaterializedIListRefElem<T>> {
+class IListRefTagImplBase<IListRefTag::Materialized, T, MaterializedIListRefElem<T>> {
  public:
-  using elem_type = _MaterializedIListRefElem<T>;
+  using elem_type = MaterializedIListRefElem<T>;
   using list_type = MaterializedIListRef<T>;
 
   static const list_type& unwrap(const IListRef<T>& ilist) {
@@ -141,7 +141,7 @@ class IListRefTagImpl<IListRefTag::Materialized, at::Tensor>
     : public IListRefTagImplBase<
           IListRefTag::Materialized,
           at::Tensor,
-          _MaterializedIListRefElem<at::Tensor>> {};
+          MaterializedIListRefElem<at::Tensor>> {};
 
 /*
  * [Note: IOptTensorListRef]
@@ -182,7 +182,7 @@ class IListRefTagImpl<IListRefTag::Materialized, at::OptionalTensorRef>
     : public IListRefTagImplBase<
           IListRefTag::Materialized,
           at::OptionalTensorRef,
-          _MaterializedIListRefElem<at::OptionalTensorRef>> {};
+          MaterializedIListRefElem<at::OptionalTensorRef>> {};
 
 } // namespace detail
 } // namespace c10
diff --git a/aten/src/ATen/core/IListRef_test.cpp b/aten/src/ATen/core/IListRef_test.cpp
index 1a609de74f80..67bd6efebfe4 100644
--- a/aten/src/ATen/core/IListRef_test.cpp
+++ b/aten/src/ATen/core/IListRef_test.cpp
@@ -77,7 +77,7 @@ TEST(ITensorListRefTest, CtorUnboxedIndirect_IsUnboxed) {
   };
   check_is_unboxed(at::ITensorListRef{vec[0]});
   check_is_unboxed(at::ITensorListRef{vec.data(), vec.size()});
-  check_is_unboxed(at::ITensorListRef{&*vec.begin(), &*vec.end()});
+  check_is_unboxed(at::ITensorListRef{vec.data(), vec.data() + vec.size()});
   check_is_unboxed(vec);
   check_is_unboxed({vec[0], vec[1], vec[2]});
 }
@@ -137,7 +137,7 @@ TEST(ITensorListRefTest, UnboxedIndirect_Equal) {
   // Implicit constructors
   check_elements_same(vec[0], std::vector<at::Tensor>{vec[0]}, /* use_count= */ 3);
   check_elements_same({vec.data(), vec.size()}, vec, /* use_count= */ 1);
-  check_elements_same({&*vec.begin(), &*vec.end()}, vec, /* use_count= */ 1);
+  check_elements_same({vec.data(), vec.data() + vec.size()}, vec, /* use_count= */ 1);
   // Vector constructor
   check_elements_same(vec, vec, /* use_count= */ 1);
   // InitializerList constructor
@@ -165,9 +165,15 @@ TEST(ITensorListRefTest, UnboxedMaterialize_Equal) {
 }
 
 TEST(ITensorListRefIteratorTest, CtorEmpty_ThrowsError) {
-  at::ITensorListRefIterator it;
+  at::ITensorListRefIterator* it = new at::ITensorListRefIterator();
   // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  EXPECT_THROW(*it, c10::Error);
+  EXPECT_THROW(**it, c10::Error);
+
+#if defined(_MSC_VER) && _ITERATOR_DEBUG_LEVEL == 2
+  EXPECT_THROW({ delete it; }, c10::Error);
+#else
+  delete it;
+#endif
 }
 
 TEST(ITensorListRefIteratorTest, Boxed_GetFirstElement) {
diff --git a/aten/src/ATen/core/List_test.cpp b/aten/src/ATen/core/List_test.cpp
index e16e26b6042e..f37f3c008493 100644
--- a/aten/src/ATen/core/List_test.cpp
+++ b/aten/src/ATen/core/List_test.cpp
@@ -1118,7 +1118,7 @@ TEST(ListTest, canAccessStringByReference) {
   List<std::string> list({"one", "two"});
   const auto& listRef = list;
   static_assert(std::is_same<decltype(listRef[1]), const std::string&>::value,
-                "const List<std::string> acccess should be by const reference");
+                "const List<std::string> access should be by const reference");
   std::string str = list[1];
   const std::string& strRef = listRef[1];
   EXPECT_EQ("two", str);
@@ -1130,7 +1130,7 @@ TEST(ListTest, canAccessOptionalStringByReference) {
   const auto& listRef = list;
   static_assert(
       std::is_same<decltype(listRef[1]), c10::optional<std::reference_wrapper<const std::string>>>::value,
-      "List<c10::optional<std::string>> acccess should be by const reference");
+      "List<c10::optional<std::string>> access should be by const reference");
   c10::optional<std::string> str1 = list[1];
   c10::optional<std::string> str2 = list[2];
   decltype(auto) strRef1 = listRef[1];
diff --git a/aten/src/ATen/core/NamedRegistrations.cpp b/aten/src/ATen/core/NamedRegistrations.cpp
index a9ae2f12c4dd..b78a563b673b 100644
--- a/aten/src/ATen/core/NamedRegistrations.cpp
+++ b/aten/src/ATen/core/NamedRegistrations.cpp
@@ -179,7 +179,6 @@ TORCH_LIBRARY_IMPL(aten, Named, m) {
   m.impl("exp.out", CppFunction::makeFallthrough());
   m.impl("exp_", CppFunction::makeFallthrough());
   m.impl("expand", CppFunction::makeFallthrough());
-  m.impl("expand.SymInt", CppFunction::makeFallthrough());
   m.impl("expm1", CppFunction::makeFallthrough());
   m.impl("expm1.out", CppFunction::makeFallthrough());
   m.impl("expm1_", CppFunction::makeFallthrough());
@@ -467,7 +466,6 @@ TORCH_LIBRARY_IMPL(aten, Named, m) {
   m.impl("sum.IntList_out", CppFunction::makeFallthrough());
   m.impl("sum.dim_DimnameList", CppFunction::makeFallthrough());
   m.impl("sum.dim_IntList", CppFunction::makeFallthrough());
-  m.impl("sum.SymInt", CppFunction::makeFallthrough());
   m.impl("t", CppFunction::makeFallthrough());
   m.impl("tan", CppFunction::makeFallthrough());
   m.impl("tan.out", CppFunction::makeFallthrough());
diff --git a/aten/src/ATen/core/PhiloxRNGEngine.h b/aten/src/ATen/core/PhiloxRNGEngine.h
index a702de8998d9..c6536d29e798 100644
--- a/aten/src/ATen/core/PhiloxRNGEngine.h
+++ b/aten/src/ATen/core/PhiloxRNGEngine.h
@@ -213,7 +213,6 @@ class philox_engine {
   inline detail::FLOAT2 normalize_pair_uniform(detail::FLOAT2 in) {
     // TODO(voz) We use std:: below, and thus need a separate impl for CUDA.
     float u1 = in[0];
-    float u2 = in[1];
 
     constexpr float two_pi = 2.0 * M_PI;
 
diff --git a/aten/src/ATen/core/PythonFallbackKernel.cpp b/aten/src/ATen/core/PythonFallbackKernel.cpp
index 37b46ae15a3c..2d8834afe59e 100644
--- a/aten/src/ATen/core/PythonFallbackKernel.cpp
+++ b/aten/src/ATen/core/PythonFallbackKernel.cpp
@@ -1,4 +1,5 @@
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
+#include <c10/core/impl/PythonDispatcherTLS.h>
 #include <ATen/core/PythonFallbackKernel.h>
 #include <c10/core/SafePyObject.h>
 
@@ -51,9 +52,10 @@ void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
 
 
   // If Torch Dispatch Mode is active, use its PyInterpreter for dispatch
-  const auto& maybe_torch_dispatch_mode_state = at::impl::TorchDispatchModeTLS::get_state();
-  if (maybe_torch_dispatch_mode_state) {
-    maybe_torch_dispatch_mode_state->pyinterpreter()->dispatch(op, stack);
+  const auto mode_stack_len = c10::impl::TorchDispatchModeTLS::stack_len();
+  if (mode_stack_len > 0) {
+    const auto& cur_torch_dispatch_mode_state = c10::impl::TorchDispatchModeTLS::get_stack_at(mode_stack_len - 1);
+    cur_torch_dispatch_mode_state->pyinterpreter()->dispatch(op, stack);
     return;
   }
 
@@ -69,16 +71,19 @@ void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
     if (ivalue.isTensor()) {
       auto* interpreter = ivalue.unsafeToTensorImpl()->pyobj_interpreter();
       if (interpreter) {
-        interpreter->dispatch(op, stack);
+        (*interpreter)->dispatch(op, stack);
         return;
       }
-    } else if (ivalue.isTensorList()) {
+    } else if (ivalue.isTensorList() || ivalue.isOptionalTensorList()) {
       // NB: use toListRef as it doesn't induce refcount bumps (toTensorListRef
       // is not a thing)
       for (const auto& nv : ivalue.toListRef()) {
+        if (nv.isNone()) {
+          continue;
+        }
         auto* interpreter = nv.unsafeToTensorImpl()->pyobj_interpreter();
         if (interpreter) {
-          interpreter->dispatch(op, stack);
+          (*interpreter)->dispatch(op, stack);
           return;
         }
       }
@@ -87,6 +92,12 @@ void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
   TORCH_INTERNAL_ASSERT(0, "Hit Python dispatch key but no arguments had PyInterpreter (no tensor args?)");
 }
 
+void pythonDispatcherFallback(const c10::OperatorHandle& op, c10::DispatchKeySet dispatch_keys, torch::jit::Stack* stack) {
+  auto* state = c10::impl::PythonDispatcherTLS::get_state();
+  TORCH_INTERNAL_ASSERT(state, "Hit PythonDispatcher dispatch key but PythonDispatcherTLS was not set");
+  (*state)->python_dispatcher(op, dispatch_keys.remove(c10::DispatchKey::PythonDispatcher), stack);
+}
+
 void pythonTLSSnapshotFallback(const c10::OperatorHandle &op, c10::DispatchKeySet dispatch_keys, torch::jit::Stack* stack) {
   // It is ok for the tls to be already set here.
   // It means that there are multiple calls into the dispatcher not originating from python code.
@@ -134,6 +145,10 @@ TORCH_LIBRARY_IMPL(_, Python, m) {
   m.fallback(torch::CppFunction::makeFromBoxedFunction<&pythonFallback>());
 }
 
+TORCH_LIBRARY_IMPL(_, PythonDispatcher, m) {
+  m.fallback(torch::CppFunction::makeFromBoxedFunction<&pythonDispatcherFallback>());
+}
+
 TORCH_LIBRARY_IMPL(_, PythonTLSSnapshot, m) {
   m.fallback(torch::CppFunction::makeFromBoxedFunction<&pythonTLSSnapshotFallback>());
 }
diff --git a/aten/src/ATen/core/PythonFallbackKernel.h b/aten/src/ATen/core/PythonFallbackKernel.h
index 94cd4e81291a..f38bdd2ada90 100644
--- a/aten/src/ATen/core/PythonFallbackKernel.h
+++ b/aten/src/ATen/core/PythonFallbackKernel.h
@@ -1,5 +1,5 @@
 #pragma once
-
+#include <ATen/core/TorchDispatchUtils.h>
 
 namespace at {
 namespace impl {
diff --git a/aten/src/ATen/core/PythonOpRegistrationTrampoline.cpp b/aten/src/ATen/core/PythonOpRegistrationTrampoline.cpp
new file mode 100644
index 000000000000..2d9b15a6b03c
--- /dev/null
+++ b/aten/src/ATen/core/PythonOpRegistrationTrampoline.cpp
@@ -0,0 +1,28 @@
+#include <ATen/core/PythonOpRegistrationTrampoline.h>
+
+namespace at {
+namespace impl {
+
+// The strategy is that all python interpreters attempt to register themselves
+// as the main interpreter, but only one wins.  Only that interpreter is
+// allowed to interact with the C++ dispatcher.  Furthermore, when we execute
+// logic on that interpreter, we do so hermetically, never setting pyobj field
+// on Tensor.
+
+std::atomic<c10::impl::PyInterpreter*> PythonOpRegistrationTrampoline::interpreter_{nullptr};
+
+bool PythonOpRegistrationTrampoline::registerInterpreter(c10::impl::PyInterpreter* interp) {
+  c10::impl::PyInterpreter* expected = nullptr;
+  interpreter_.compare_exchange_strong(expected, interp);
+  if (expected != nullptr) {
+    // This is the second (or later) Python interpreter, which means we need
+    // non-trivial hermetic PyObject TLS
+    c10::impl::HermeticPyObjectTLS::init_state();
+    return false;
+  } else {
+    return true;
+  }
+}
+
+} // namespace impl
+} // namespace at
diff --git a/aten/src/ATen/core/PythonOpRegistrationTrampoline.h b/aten/src/ATen/core/PythonOpRegistrationTrampoline.h
new file mode 100644
index 000000000000..00d3c635859a
--- /dev/null
+++ b/aten/src/ATen/core/PythonOpRegistrationTrampoline.h
@@ -0,0 +1,18 @@
+#include <ATen/core/dispatch/Dispatcher.h>
+
+// TODO: this can probably live in c10
+
+namespace at {
+namespace impl {
+
+class TORCH_API PythonOpRegistrationTrampoline final {
+  static std::atomic<c10::impl::PyInterpreter*> interpreter_;
+
+public:
+  //  Returns true if you successfully registered yourself (that means
+  //  you are in the hot seat for doing the operator registrations!)
+  static bool registerInterpreter(c10::impl::PyInterpreter*);
+};
+
+} // namespace impl
+} // namespace at
diff --git a/aten/src/ATen/core/TensorAccessor.h b/aten/src/ATen/core/TensorAccessor.h
index 9c60f84a16b3..fea6c09f262f 100644
--- a/aten/src/ATen/core/TensorAccessor.h
+++ b/aten/src/ATen/core/TensorAccessor.h
@@ -160,7 +160,7 @@ class GenericPackedTensorAccessorBase {
   index_t strides_[N];
   C10_HOST void bounds_check_(index_t i) const {
     TORCH_CHECK_INDEX(
-        0 <= i && i < N,
+        0 <= i && i < index_t{N},
         "Index ",
         i,
         " is not within bounds of a tensor of dimension ",
diff --git a/aten/src/ATen/core/TensorBase.h b/aten/src/ATen/core/TensorBase.h
index 334cbba102a2..0ecd4456033b 100644
--- a/aten/src/ATen/core/TensorBase.h
+++ b/aten/src/ATen/core/TensorBase.h
@@ -48,6 +48,7 @@ inline bool variable_excluded_from_dispatch() {
   return c10::impl::tls_local_dispatch_key_set().excluded_.isSupersetOf(c10::autograd_dispatch_keyset);
 #endif
 }
+
 }
 
 // NOTE: [Tensor vs. TensorBase]
@@ -161,6 +162,14 @@ class TORCH_API TensorBase {
     return impl_->sym_size(dim);
   }
 
+  c10::SymInt sym_stride(int64_t dim) const {
+    const auto sizes = this->sym_strides();
+    const auto ndim = static_cast<int64_t>(sizes.size());
+    // false is passed to maybe_wrap_dim so behavior is identical to array access (but with wrapping)
+    return sizes[c10::maybe_wrap_dim(dim, ndim, /*wrap_scalar=*/false)];
+
+  }
+
   int64_t size(int64_t dim) const {
     return impl_->size(dim);
   }
@@ -225,6 +234,9 @@ class TORCH_API TensorBase {
   c10::SymIntArrayRef sym_sizes() const {
     return impl_->sym_sizes();
   }
+  c10::SymIntArrayRef sym_strides() const {
+    return impl_->sym_strides();
+  }
   IntArrayRef strides() const {
     return impl_->strides();
   }
@@ -282,6 +294,14 @@ class TORCH_API TensorBase {
     return impl_->numel() * impl_->itemsize();
   }
 
+  c10::SymInt sym_nbytes() const {
+    TORCH_CHECK(layout () != at::kSparse,
+                "nbytes is not defined for sparse tensors.  If you want the size of the constituent " \
+                "tensors, add the nbytes of the indices and values.  If you want the size of the  " \
+                "equivalent dense tensor, multiply numel() by element_size()");
+    return impl_->sym_numel() * impl_->itemsize();
+  }
+
   int64_t numel() const {
     return impl_->numel();
   }
@@ -290,6 +310,10 @@ class TORCH_API TensorBase {
     return impl_->sym_numel();
   }
 
+  c10::SymInt sym_storage_offset() const {
+    return impl_->sym_storage_offset();
+  }
+
   // Length of one array element in bytes.  This is the traditional
   // Numpy naming.
   size_t itemsize() const {
@@ -553,6 +577,10 @@ class TORCH_API TensorBase {
 
   template<typename T, size_t N, template <typename U> class PtrTraits = DefaultPtrTraits>
   PackedTensorAccessor32<T,N,PtrTraits> packed_accessor32() const& {
+    TORCH_CHECK(
+        impl_->numel() <=
+            static_cast<int64_t>(std::numeric_limits<int32_t>::max()),
+        "numel needs to be smaller than int32_t max; otherwise, please use packed_accessor64");
     return generic_packed_accessor<T,N,PtrTraits,int32_t>();
   }
   template<typename T, size_t N, template <typename U> class PtrTraits = DefaultPtrTraits>
@@ -914,4 +942,34 @@ inline c10::MaybeOwned<TensorBase> TensorBase::expect_contiguous(MemoryFormat me
     return c10::MaybeOwned<TensorBase>::owned(__dispatch_contiguous(memory_format));
   }
 }
+
+namespace symint {
+
+template <typename T>
+using enable_if_symint = std::enable_if_t<std::is_same<T, c10::SymInt>::value>;
+template <typename T>
+using enable_if_int = std::enable_if_t<std::is_same<T, int64_t>::value>;
+
+template <typename T, typename = enable_if_symint<T>>
+c10::SymIntArrayRef sizes(const TensorBase& t) { return t.sym_sizes(); }
+template <typename T, typename = enable_if_int<T>>
+IntArrayRef sizes(const TensorBase& t) { return t.sizes(); }
+
+template <typename T, typename = enable_if_symint<T>>
+c10::SymInt size(const TensorBase& t, int64_t dim) { return t.sym_size(dim); }
+template <typename T, typename = enable_if_int<T>>
+int64_t size(const TensorBase& t, int64_t dim) { return t.size(dim); }
+
+template <typename T, typename = enable_if_symint<T>>
+c10::SymIntArrayRef strides(const TensorBase& t) { return t.sym_strides(); }
+template <typename T, typename = enable_if_int<T>>
+IntArrayRef strides(const TensorBase& t) { return t.strides(); }
+
+template <typename T, typename = enable_if_symint<T>>
+c10::SymInt numel(const TensorBase& t) { return t.sym_numel(); }
+template <typename T, typename = enable_if_int<T>>
+int64_t numel(const TensorBase& t) { return t.numel(); }
+
+} // namespace symint
+
 } // namespace at
diff --git a/aten/src/ATen/core/TorchDispatchModeTLS.cpp b/aten/src/ATen/core/TorchDispatchModeTLS.cpp
deleted file mode 100644
index d224b08d5b54..000000000000
--- a/aten/src/ATen/core/TorchDispatchModeTLS.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-#include <ATen/core/TorchDispatchModeTLS.h>
-#include <c10/core/SafePyObject.h>
-#include <c10/core/DispatchKeySet.h>
-
-namespace at { namespace impl {
-
-thread_local std::shared_ptr<SafePyObject> torchDispatchModeState;
-
-void TorchDispatchModeTLS::set_state(std::shared_ptr<SafePyObject> state) {
-  if (state) {
-    c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, true);
-    c10::impl::tls_set_dispatch_key_included(DispatchKey::PythonTLSSnapshot, true);
-  } else {
-    TorchDispatchModeTLS::reset_state();
-  }
-  torchDispatchModeState = std::move(state);
-}
-
-const std::shared_ptr<SafePyObject>& TorchDispatchModeTLS::get_state() {
-  return torchDispatchModeState;
-}
-
-void TorchDispatchModeTLS::reset_state() {
-  torchDispatchModeState.reset();
-  c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, false);
-  c10::impl::tls_set_dispatch_key_included(DispatchKey::PythonTLSSnapshot, false);
-}
-
-bool dispatch_mode_enabled() {
-  return static_cast<bool>(at::impl::TorchDispatchModeTLS::get_state());
-}
-
-bool tensor_has_dispatch(const at::Tensor& t) {
-  DispatchKeySet key_set({DispatchKey::Python, DispatchKey::PythonTLSSnapshot});
-  return t.key_set().has_any(key_set);
-}
-
-bool tensorlist_has_dispatch(const at::TensorList& li) {
-  for (const auto& t: li) {
-    if (tensor_has_dispatch(t)) {
-      return true;
-    }
-  }
-  return false;
-}
-
-bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li) {
-  for (auto i : c10::irange(li.size())) {
-    auto t = li.get(i);
-    if (t && tensor_has_dispatch(*t)) {
-      return true;
-    }
-  }
-  return false;
-}
-
-} // namespace impl
-} // namespace at
diff --git a/aten/src/ATen/core/TorchDispatchModeTLS.h b/aten/src/ATen/core/TorchDispatchModeTLS.h
deleted file mode 100644
index 9ae015e6582f..000000000000
--- a/aten/src/ATen/core/TorchDispatchModeTLS.h
+++ /dev/null
@@ -1,25 +0,0 @@
-#pragma once
-
-#include <c10/macros/Macros.h>
-#include <torch/library.h>
-#include <ATen/core/dispatch/Dispatcher.h>
-#include <c10/util/ArrayRef.h>
-#include <c10/util/Optional.h>
-
-namespace at {
-namespace impl {
-
-struct TORCH_API TorchDispatchModeTLS {
-  static void set_state(std::shared_ptr<SafePyObject> state);
-  static const std::shared_ptr<SafePyObject>& get_state();
-  static void reset_state();
-};
-
-bool dispatch_mode_enabled();
-bool tensor_has_dispatch(const at::Tensor& t);
-bool tensorlist_has_dispatch(const at::TensorList& li);
-bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li);
-
-
-} // namespace impl
-} // namespace at
diff --git a/aten/src/ATen/core/TorchDispatchUtils.cpp b/aten/src/ATen/core/TorchDispatchUtils.cpp
new file mode 100644
index 000000000000..e2f981c6a833
--- /dev/null
+++ b/aten/src/ATen/core/TorchDispatchUtils.cpp
@@ -0,0 +1,31 @@
+#include <ATen/core/TorchDispatchUtils.h>
+
+namespace at {
+namespace impl {
+
+bool tensor_has_dispatch(const at::Tensor& t) {
+  DispatchKeySet key_set({DispatchKey::Python, DispatchKey::PythonTLSSnapshot});
+  return t.key_set().has_any(key_set);
+}
+
+bool tensorlist_has_dispatch(at::ITensorListRef li) {
+  for (const auto& t : li) {
+    if (tensor_has_dispatch(t)) {
+      return true;
+    }
+  }
+  return false;
+}
+
+bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li) {
+  for (auto i : c10::irange(li.size())) {
+    auto t = li.get(i);
+    if (t && tensor_has_dispatch(*t)) {
+      return true;
+    }
+  }
+  return false;
+}
+
+} // namespace impl
+} // namespace at
diff --git a/aten/src/ATen/core/TorchDispatchUtils.h b/aten/src/ATen/core/TorchDispatchUtils.h
new file mode 100644
index 000000000000..ed7b4181095d
--- /dev/null
+++ b/aten/src/ATen/core/TorchDispatchUtils.h
@@ -0,0 +1,17 @@
+#pragma once
+
+#include <torch/library.h>
+#include <ATen/core/dispatch/Dispatcher.h>
+#include <c10/util/ArrayRef.h>
+#include <c10/util/Optional.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
+
+namespace at {
+namespace impl {
+
+bool tensor_has_dispatch(const at::Tensor& t);
+bool tensorlist_has_dispatch(at::ITensorListRef li);
+bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li);
+using c10::impl::dispatch_mode_enabled;
+
+}}
diff --git a/aten/src/ATen/core/Variadic.h b/aten/src/ATen/core/Variadic.h
index d33f3d575177..61b6a35a0b1c 100644
--- a/aten/src/ATen/core/Variadic.h
+++ b/aten/src/ATen/core/Variadic.h
@@ -48,6 +48,15 @@ struct IterArgs {
   // you may be able to process these structures more efficiently
   // than handling them one-by-one.
 
+  template <typename T>
+  void operator()(c10::IListRef<T> args) {
+    for (const auto& arg : args) {
+      self()(arg);
+      if (self().short_circuit())
+        return;
+    }
+  }
+
   template <typename T>
   void operator()(at::ArrayRef<T> args) {
     for (const auto& arg : args) {
diff --git a/aten/src/ATen/core/boxing/KernelFunction.h b/aten/src/ATen/core/boxing/KernelFunction.h
index 8ab34e95046a..f1bfc9ec6f27 100644
--- a/aten/src/ATen/core/boxing/KernelFunction.h
+++ b/aten/src/ATen/core/boxing/KernelFunction.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <ATen/core/ATen_fwd.h>
 #include <ATen/core/boxing/BoxedKernel.h>
 #include <ATen/core/stack.h>
 #include <c10/core/DispatchKeySet.h>
@@ -14,6 +15,56 @@ class OperatorHandle;
 struct OperatorKernel;
 class KernelFunction;
 
+template <typename T>
+using has_symint =
+  guts::disjunction<
+    std::is_same<c10::SymInt, std::decay_t<T>>,
+    std::is_same<c10::SymIntArrayRef, std::decay_t<T>>,
+    std::is_same<at::OptionalSymIntArrayRef, std::decay_t<T>>,
+    std::is_same<c10::optional<c10::SymInt>, std::decay_t<T>>
+  >;
+
+template <typename T>
+struct remove_symint {
+  using type = T;
+};
+
+template <>
+struct remove_symint<c10::SymInt> {
+  using type = int64_t;
+};
+
+template <>
+struct remove_symint<at::OptionalSymIntArrayRef> {
+  using type = OptionalIntArrayRef;
+};
+
+template <>
+struct remove_symint<c10::SymIntArrayRef> {
+  using type = c10::IntArrayRef;
+};
+
+template <>
+struct remove_symint<c10::optional<c10::SymInt>> {
+  using type = c10::optional<int64_t>;
+};
+
+
+template <bool symint, typename T>
+struct maybe_keep_symint final {};
+
+template <typename T>
+struct maybe_keep_symint<true, T> { using type = T; };
+
+template <typename T>
+struct maybe_keep_symint<false, T> { using type = typename remove_symint<T>::type; };
+
+template <typename T>
+using fn_has_symint = typename guts::typelist::true_for_any_type<
+  has_symint,
+  typename guts::infer_function_traits<T>::type::parameter_types
+>;
+
 /**
  * KernelFunction is similar to std::function but stores a kernel function.
  * You can create a KernelFunction from a boxed or unboxed function/functor/lambda
@@ -31,6 +82,7 @@ class TORCH_API KernelFunction final {
   // Fast path for dispatch to allow not touching the boxed kernel in
   // the common case where unboxed is available.
   bool isValidUnboxed() const;
+  bool isValidSymUnboxed() const;
   bool isValid() const;
   bool isFallthrough() const;
 
@@ -182,13 +234,16 @@ class TORCH_API KernelFunction final {
   explicit KernelFunction(
       std::unique_ptr<OperatorKernel> functor,
       InternalBoxedKernelFunction* boxed_kernel_func,
-      void* unboxed_kernel_func);
+      void* unboxed_kernel_func,
+      void* sym_unboxed_kernel_func);
   explicit KernelFunction(
       BoxedKernel boxed_fn,
-      void* unboxed_kernel_func);
+      void* unboxed_kernel_func,
+      void* sym_unboxed_kernel_func);
 
   BoxedKernel boxed_kernel_func_;
   void* unboxed_kernel_func_;
+  void* sym_unboxed_kernel_func_;
 };
 
 }
diff --git a/aten/src/ATen/core/boxing/KernelFunction_impl.h b/aten/src/ATen/core/boxing/KernelFunction_impl.h
index c33175e4b99a..9637f8fc2043 100644
--- a/aten/src/ATen/core/boxing/KernelFunction_impl.h
+++ b/aten/src/ATen/core/boxing/KernelFunction_impl.h
@@ -8,22 +8,29 @@ namespace c10 {
 inline KernelFunction::KernelFunction()
     : boxed_kernel_func_()
     , unboxed_kernel_func_(nullptr)
+    , sym_unboxed_kernel_func_(nullptr)
 {}
 
-inline KernelFunction::KernelFunction(std::unique_ptr<OperatorKernel> functor, InternalBoxedKernelFunction* boxed_kernel_func, void* unboxed_kernel_func)
+inline KernelFunction::KernelFunction(std::unique_ptr<OperatorKernel> functor, InternalBoxedKernelFunction* boxed_kernel_func, void* unboxed_kernel_func, void* sym_unboxed_kernel_func = nullptr)
   : boxed_kernel_func_(std::move(functor), boxed_kernel_func)
   , unboxed_kernel_func_(unboxed_kernel_func)
+  , sym_unboxed_kernel_func_(sym_unboxed_kernel_func)
 {}
 
-inline KernelFunction::KernelFunction(BoxedKernel boxed_fn, void* unboxed_kernel_func)
+inline KernelFunction::KernelFunction(BoxedKernel boxed_fn, void* unboxed_kernel_func, void* sym_unboxed_kernel_func = nullptr)
   : boxed_kernel_func_(std::move(boxed_fn))
   , unboxed_kernel_func_(unboxed_kernel_func)
+  , sym_unboxed_kernel_func_(sym_unboxed_kernel_func)
 {}
 
 inline bool KernelFunction::isValidUnboxed() const {
   return unboxed_kernel_func_ != nullptr;
 }
 
+inline bool KernelFunction::isValidSymUnboxed() const {
+  return sym_unboxed_kernel_func_ != nullptr;
+}
+
 inline bool KernelFunction::isValid() const {
   return boxed_kernel_func_.isValid();
 }
@@ -43,16 +50,58 @@ inline Return callUnboxedKernelFunction(void* unboxed_kernel_func, OperatorKerne
     return (*func)(functor, dispatchKeySet, std::forward<Args>(args)...);
 }
 
+// This template requires you to explicitly specify the argument you want to
+// forward; it doesn't work if you try to deduce it
+// NB: keep this in sync with cloneWithRealTypes in function_schema.cpp
+
+template <typename T>
+inline typename remove_symint<T>::type unpackSymInt(T x) { return x; }
+
+template <>
+inline typename remove_symint<c10::SymInt>::type unpackSymInt(c10::SymInt x) {
+  return x.expect_int();
+}
+
+template <>
+inline typename remove_symint<c10::SymIntArrayRef>::type unpackSymInt(c10::SymIntArrayRef x) {
+  return c10::asIntArrayRefSlow(x);
+}
+
+template <>
+inline typename remove_symint<c10::optional<c10::SymInt>>::type unpackSymInt(c10::optional<c10::SymInt> x) {
+  return x.has_value() ? c10::make_optional(x->expect_int()) : c10::nullopt;
+}
+
+template <>
+inline typename remove_symint<at::OptionalSymIntArrayRef>::type unpackSymInt(at::OptionalSymIntArrayRef x) {
+  return x.has_value() ? c10::make_optional(c10::asIntArrayRefSlow(*x)) : c10::nullopt;
+}
+
 template<class Return, class... Args>
 C10_ALWAYS_INLINE Return KernelFunction::call(const OperatorHandle& opHandle, DispatchKeySet dispatchKeySet, Args... args) const {
     // note: Args above is intentionally not Args&&. We don't want perfect
     // forwarding, which would require Args to be deduced, but instead we
     // want callers to explicitly specify the Args.
 
-    if (C10_LIKELY(unboxed_kernel_func_ != nullptr)) {
-      auto *functor = boxed_kernel_func_.getFunctor();
-      return callUnboxedKernelFunction<Return, Args...>(
-          unboxed_kernel_func_, functor, dispatchKeySet, std::forward<Args>(args)...);
+    // This should get inlined by compiler
+    if (guts::disjunction<has_symint<Args>...>::value) {
+      if (sym_unboxed_kernel_func_ != nullptr) {
+        auto *functor = boxed_kernel_func_.getFunctor();
+        return callUnboxedKernelFunction<Return, Args...>(
+            sym_unboxed_kernel_func_, functor, dispatchKeySet, std::forward<Args>(args)...);
+      }
+
+      if (unboxed_kernel_func_ != nullptr) {
+        auto *functor = boxed_kernel_func_.getFunctor();
+        return callUnboxedKernelFunction<Return, typename remove_symint<Args>::type...>(
+            unboxed_kernel_func_, functor, dispatchKeySet, unpackSymInt<Args>(args)...);
+      }
+    } else {
+      if (C10_LIKELY(unboxed_kernel_func_ != nullptr)) {
+        auto *functor = boxed_kernel_func_.getFunctor();
+        return callUnboxedKernelFunction<Return, Args...>(
+            unboxed_kernel_func_, functor, dispatchKeySet, std::forward<Args>(args)...);
+      }
     }
 
     return impl::BoxedKernelWrapper<Return(Args...)>::call(
@@ -102,10 +151,14 @@ inline KernelFunction KernelFunction::makeFromUnboxedFunctor(std::unique_ptr<Ope
 #endif
     static_assert(std::is_base_of<OperatorKernel, KernelFunctor>::value, "Tried to call KernelFunction::makeFromUnboxedFunctor<KernelFunctor>, but the functor doesn't inherit from c10::OperatorKernel. Please have the functor inherit from it.");
 
+    auto* unboxed_fn = &impl::wrap_kernel_functor_unboxed<KernelFunctor>::call;
+    void* void_unboxed_fn = reinterpret_cast<void*>(unboxed_fn);
+    bool is_symint = fn_has_symint<decltype(unboxed_fn)>::value;
     return KernelFunction(
         std::move(kernelFunctor),
         &impl::make_boxed_from_unboxed_functor<KernelFunctor, AllowLegacyTypes>::call,
-        reinterpret_cast<void*>(&impl::wrap_kernel_functor_unboxed<KernelFunctor>::call)
+        is_symint ? nullptr : void_unboxed_fn,
+        is_symint ? void_unboxed_fn : nullptr
     );
 }
 
diff --git a/aten/src/ATen/core/boxing/impl/kernel_function_legacy_test.cpp b/aten/src/ATen/core/boxing/impl/kernel_function_legacy_test.cpp
index 4db6794e50eb..3c87fec710aa 100644
--- a/aten/src/ATen/core/boxing/impl/kernel_function_legacy_test.cpp
+++ b/aten/src/ATen/core/boxing/impl/kernel_function_legacy_test.cpp
@@ -508,8 +508,8 @@ TEST(OperatorRegistrationTest_LegacyFunctionBasedKernel, givenKernelWithStringLi
   auto output = std::move(outputs[0]).toList();
 
   EXPECT_EQ(2, output.size());
-  EXPECT_EQ("value1", output.get(0).toString()->string());
-  EXPECT_EQ("value2", output.get(1).toString()->string());
+  EXPECT_EQ("value1", output.get(0).toStringRef());
+  EXPECT_EQ("value2", output.get(1).toStringRef());
 }
 
 int captured_dict_size = 0;
@@ -550,7 +550,7 @@ TEST(OperatorRegistrationTest_LegacyFunctionBasedKernel, givenKernelWithDictInpu
   dict.insert("key2", "value2");
   auto outputs = callOp(*op, dict);
   EXPECT_EQ(1, outputs.size());
-  EXPECT_EQ("value2", outputs[0].toString()->string());
+  EXPECT_EQ("value2", outputs[0].toStringRef());
 }
 
 Dict<string, string> kernelWithDictOutput(Dict<string, string> input) {
@@ -612,7 +612,7 @@ TEST(OperatorRegistrationTest_LegacyFunctionBasedKernel, givenKernelWithUnordere
   dict.insert("key2", "value2");
   auto outputs = callOp(*op, dict);
   EXPECT_EQ(1, outputs.size());
-  EXPECT_EQ("value2", outputs[0].toString()->string());
+  EXPECT_EQ("value2", outputs[0].toStringRef());
 }
 
 std::unordered_map<string, string> kernelWithUnorderedMapOutput(std::unordered_map<string, string> input) {
@@ -897,7 +897,7 @@ TEST(OperatorRegistrationTest_LegacyFunctionBasedKernel, givenKernelWithOptional
   EXPECT_EQ(3, outputs.size());
   EXPECT_EQ(DispatchKey::CUDA, extractDispatchKey(outputs[0].toTensor()));
   EXPECT_TRUE(outputs[1].isNone());
-  EXPECT_EQ("text", outputs[2].toString()->string());
+  EXPECT_EQ("text", outputs[2].toStringRef());
 
   outputs = callOp(*op, dummyTensor(DispatchKey::CPU), c10::IValue(), 4, c10::IValue());
   EXPECT_EQ(3, outputs.size());
diff --git a/aten/src/ATen/core/boxing/impl/kernel_function_test.cpp b/aten/src/ATen/core/boxing/impl/kernel_function_test.cpp
index 10d2a3fdeb2f..b4fe9290b9e2 100644
--- a/aten/src/ATen/core/boxing/impl/kernel_function_test.cpp
+++ b/aten/src/ATen/core/boxing/impl/kernel_function_test.cpp
@@ -484,7 +484,7 @@ TEST(OperatorRegistrationTest_FunctionBasedKernel, givenKernelWithDictInput_with
   dict.insert("key2", "value2");
   auto outputs = callOp(*op, dict);
   EXPECT_EQ(1, outputs.size());
-  EXPECT_EQ("value2", outputs[0].toString()->string());
+  EXPECT_EQ("value2", outputs[0].toStringRef());
 }
 
 Dict<string, string> kernelWithDictOutput(Dict<string, string> input) {
@@ -639,7 +639,7 @@ TEST(OperatorRegistrationTest_FunctionBasedKernel, givenKernelWithOptionalInputs
   EXPECT_EQ(3, outputs.size());
   EXPECT_EQ(DispatchKey::CPU, extractDispatchKey(outputs[0].toTensor()));
   EXPECT_TRUE(outputs[1].isNone());
-  EXPECT_EQ("text", outputs[2].toString()->string());
+  EXPECT_EQ("text", outputs[2].toStringRef());
 
   outputs = callOp(*op, dummyTensor(DispatchKey::CPU), c10::IValue(), 4, c10::IValue());
   EXPECT_EQ(3, outputs.size());
diff --git a/aten/src/ATen/core/boxing/impl/kernel_lambda_legacy_test.cpp b/aten/src/ATen/core/boxing/impl/kernel_lambda_legacy_test.cpp
index 0b4d1e8ad6b7..dc527d98eb99 100644
--- a/aten/src/ATen/core/boxing/impl/kernel_lambda_legacy_test.cpp
+++ b/aten/src/ATen/core/boxing/impl/kernel_lambda_legacy_test.cpp
@@ -456,8 +456,8 @@ TEST(OperatorRegistrationTest_LegacyLambdaBasedKernel, givenKernelWithStringList
   auto output = std::move(outputs[0]).toList();
 
   EXPECT_EQ(2, output.size());
-  EXPECT_EQ("value1", output.get(0).toString()->string());
-  EXPECT_EQ("value2", output.get(1).toString()->string());
+  EXPECT_EQ("value1", output.get(0).toStringRef());
+  EXPECT_EQ("value2", output.get(1).toStringRef());
 }
 
 TEST(OperatorRegistrationTest_LegacyLambdaBasedKernel, givenKernelWithDictInput_withoutOutput_whenRegistered_thenCanBeCalled) {
@@ -494,7 +494,7 @@ TEST(OperatorRegistrationTest_LegacyLambdaBasedKernel, givenKernelWithDictInput_
   dict.insert("key2", "value2");
   auto outputs = callOp(*op, dict);
   EXPECT_EQ(1, outputs.size());
-  EXPECT_EQ("value2", outputs[0].toString()->string());
+  EXPECT_EQ("value2", outputs[0].toStringRef());
 }
 
 TEST(OperatorRegistrationTest_LegacyLambdaBasedKernel, givenKernelWithDictOutput_whenRegistered_thenCanBeCalled) {
@@ -552,7 +552,7 @@ TEST(OperatorRegistrationTest_LegacyLambdaBasedKernel, givenKernelWithUnorderedM
   dict.insert("key2", "value2");
   auto outputs = callOp(*op, dict);
   EXPECT_EQ(1, outputs.size());
-  EXPECT_EQ("value2", outputs[0].toString()->string());
+  EXPECT_EQ("value2", outputs[0].toStringRef());
 }
 
 TEST(OperatorRegistrationTest_LegacyLambdaBasedKernel, givenKernelWithUnorderedMapOutput_whenRegistered_thenCanBeCalled) {
@@ -832,7 +832,7 @@ TEST(OperatorRegistrationTest_LegacyLambdaBasedKernel, givenKernelWithOptionalIn
   EXPECT_EQ(3, outputs.size());
   EXPECT_EQ(DispatchKey::CUDA, extractDispatchKey(outputs[0].toTensor()));
   EXPECT_TRUE(outputs[1].isNone());
-  EXPECT_EQ("text", outputs[2].toString()->string());
+  EXPECT_EQ("text", outputs[2].toStringRef());
 
   outputs = callOp(*op, dummyTensor(DispatchKey::CPU), c10::IValue(), 4, c10::IValue());
   EXPECT_EQ(3, outputs.size());
diff --git a/aten/src/ATen/core/boxing/impl/kernel_lambda_test.cpp b/aten/src/ATen/core/boxing/impl/kernel_lambda_test.cpp
index 19f4ee4acbeb..c9b72e23048f 100644
--- a/aten/src/ATen/core/boxing/impl/kernel_lambda_test.cpp
+++ b/aten/src/ATen/core/boxing/impl/kernel_lambda_test.cpp
@@ -410,7 +410,7 @@ TEST(OperatorRegistrationTest_LambdaBasedKernel, givenKernelWithDictInput_withOu
   dict.insert("key2", "value2");
   auto outputs = callOp(*op, dict);
   EXPECT_EQ(1, outputs.size());
-  EXPECT_EQ("value2", outputs[0].toString()->string());
+  EXPECT_EQ("value2", outputs[0].toStringRef());
 }
 
 TEST(OperatorRegistrationTest_LambdaBasedKernel, givenKernelWithDictOutput_whenRegistered_thenCanBeCalled) {
@@ -554,7 +554,7 @@ TEST(OperatorRegistrationTest_LambdaBasedKernel, givenKernelWithOptionalInputs_w
   EXPECT_EQ(3, outputs.size());
   EXPECT_EQ(DispatchKey::CPU, extractDispatchKey(outputs[0].toTensor()));
   EXPECT_TRUE(outputs[1].isNone());
-  EXPECT_EQ("text", outputs[2].toString()->string());
+  EXPECT_EQ("text", outputs[2].toStringRef());
 
   outputs = callOp(*op, dummyTensor(DispatchKey::CPU), c10::IValue(), 4, c10::IValue());
   EXPECT_EQ(3, outputs.size());
diff --git a/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h b/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h
index 0a28330a0bfb..a99f45040788 100644
--- a/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h
+++ b/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h
@@ -4,6 +4,7 @@
 #include <ATen/core/ivalue.h>
 #include <ATen/core/stack.h>
 #include <c10/util/TypeList.h>
+#include <ATen/core/IListRef.h>
 #include <c10/util/intrusive_ptr.h>
 #include <c10/util/Metaprogramming.h>
 
@@ -342,6 +343,13 @@ namespace impl {
     }
   };
 
+  template<bool AllowDeprecatedTypes>
+  struct ivalue_to_arg<at::ITensorListRef, AllowDeprecatedTypes> final {
+    static List<at::Tensor> call(IValue& v) {
+      return v.toTensorList();
+    }
+  };
+
   template<class T, bool AllowDeprecatedTypes>
   struct ivalue_to_arg<ArrayRef<T>, AllowDeprecatedTypes> final {
     // If an argument is ArrayRef<T>, convert the IValue to a std::vector<T> and pass that
@@ -353,7 +361,27 @@ namespace impl {
   template<bool AllowDeprecatedTypes>
   struct ivalue_to_arg<c10::SymIntArrayRef, AllowDeprecatedTypes> final {
     static std::vector<c10::SymInt> call(IValue& v) {
-      return ivalue_to_arg<std::vector<c10::SymInt>, AllowDeprecatedTypes>::call(v);
+      if (v.isIntList()) {
+        std::vector<c10::SymInt> r;
+        auto src = v.toIntList();
+        std::transform(src.begin(), src.end(), std::back_inserter(r), [](int64_t i) { return c10::SymInt(i); });
+        return r;
+      } else {
+        return ivalue_to_arg<std::vector<c10::SymInt>, AllowDeprecatedTypes>::call(v);
+      }
+    }
+  };
+  template<bool AllowDeprecatedTypes>
+  struct ivalue_to_arg<c10::OptionalArray<c10::SymInt>, AllowDeprecatedTypes> final {
+    static OptionalArray<c10::SymInt> call(IValue& v) {
+      if (v.isIntList()) {
+        std::vector<c10::SymInt> r;
+        auto src = v.toIntList();
+        std::transform(src.begin(), src.end(), std::back_inserter(r), [](int64_t i) { return c10::SymInt(i); });
+        return OptionalArray<c10::SymInt>(r);
+      } else {
+        return std::move(v).to<OptionalArray<c10::SymInt>>();
+      }
     }
   };
   template<class T, bool AllowDeprecatedTypes>
diff --git a/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor_test.cpp b/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor_test.cpp
index 933e1bbdf94c..9eebb55cc34b 100644
--- a/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor_test.cpp
+++ b/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor_test.cpp
@@ -491,7 +491,7 @@ TEST(OperatorRegistrationTest_FunctorBasedKernel, givenKernelWithDictInput_withO
   dict.insert("key2", "value2");
   auto outputs = callOp(*op, dict);
   EXPECT_EQ(1, outputs.size());
-  EXPECT_EQ("value2", outputs[0].toString()->string());
+  EXPECT_EQ("value2", outputs[0].toStringRef());
 }
 
 struct KernelWithDictOutput final : OperatorKernel {
@@ -546,7 +546,7 @@ TEST(OperatorRegistrationTest_FunctorBasedKernel, givenKernelWithTupleInput_with
   std::tuple<string, int64_t, float> tup{"foobar", 123, 420.1337};
   auto outputs = callOp(*op, tup);
   EXPECT_EQ(1, outputs.size());
-  EXPECT_EQ("foobar", outputs[0].toString()->string());
+  EXPECT_EQ("foobar", outputs[0].toStringRef());
 }
 
 TEST(OperatorRegistrationTest_FunctorBasedKernel, givenKernelWithCache_thenCacheIsKeptCorrectly) {
@@ -774,7 +774,7 @@ TEST(OperatorRegistrationTest_FunctorBasedKernel, givenKernelWithOptionalInputs_
   EXPECT_EQ(3, outputs.size());
   EXPECT_EQ(DispatchKey::CPU, extractDispatchKey(outputs[0].toTensor()));
   EXPECT_TRUE(outputs[1].isNone());
-  EXPECT_EQ("text", outputs[2].toString()->string());
+  EXPECT_EQ("text", outputs[2].toStringRef());
 
   outputs = callOp(*op, dummyTensor(DispatchKey::CPU), c10::IValue(), 4, c10::IValue());
   EXPECT_EQ(3, outputs.size());
diff --git a/aten/src/ATen/core/class_type.cpp b/aten/src/ATen/core/class_type.cpp
index 9d7b38d4d67b..2478bde034bc 100644
--- a/aten/src/ATen/core/class_type.cpp
+++ b/aten/src/ATen/core/class_type.cpp
@@ -86,7 +86,7 @@ std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
   std::string pre_hook_schema =
       pre_hook_name + "(self, input: Tuple[" + input_types + "])";
   std::string return_string =
-      "This error occured while scripting the forward pre-hook '" +
+      "This error occurred while scripting the forward pre-hook '" +
       pre_hook_name + "' on module '" + name()->name() +
       "'. If you did not want to script this pre-hook remove it from the "
       "original NN module before scripting. Pre-hooks for module '" +
@@ -111,7 +111,7 @@ std::string ClassType::getForwardHookErrorMessage(int hook_idx) const {
   std::string hook_schema = hook_name + "(self, input: Tuple[" +
                             input_types + "], output: " + output_types + ")";
   std::string return_string =
-      "This error occured while scripting the forward hook '"
+      "This error occurred while scripting the forward hook '"
       + hook_name + "' on module " + name()->name() +
       ". If you did not want to script this hook remove it from" +
       " the original NN module before scripting. This hook was" +
diff --git a/aten/src/ATen/core/custom_class.cpp b/aten/src/ATen/core/custom_class.cpp
index 2bba7e6df62f..d719dde6ea0c 100644
--- a/aten/src/ATen/core/custom_class.cpp
+++ b/aten/src/ATen/core/custom_class.cpp
@@ -143,6 +143,7 @@ c10::FunctionSchema class_base::withNewArguments(
     new_args.emplace_back(
         default_arg.name_,
         old_arg.type(),
+        old_arg.real_type(),
         old_arg.N(),
         default_arg.value_);
   }
diff --git a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
index 6a46a795be42..7401297c66a6 100644
--- a/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
+++ b/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h
@@ -74,7 +74,13 @@ namespace detail {
         }
       }
     }
-    void operator()(at::ArrayRef<c10::optional<at::Tensor>>) {
+    // Structured Tensor[] translates to this case
+    void operator()(at::ITensorListRef xs) {
+      for (const auto& x : xs) {
+        ts = ts | x.key_set();
+      }
+    }
+    [[noreturn]] void operator()(at::ArrayRef<c10::optional<at::Tensor>>) {
       // Just checking that the handling of Tensor?[] didn't change.
       TORCH_INTERNAL_ASSERT(false);
     }
@@ -114,6 +120,9 @@ namespace detail {
  *    they have been registered as fallthrough.  The set of excluded backends
  *    varies from operator, as some operators may have overridden the
  *    fallthrough with custom behavior.
+ *
+ *   Note - this should maintain identical impl to the py dispatcher key extraction logic
+ *   at pytorch/torch/dispatcher.py
  */
 struct TORCH_API DispatchKeyExtractor final {
 public:
diff --git a/aten/src/ATen/core/dispatch/Dispatcher.cpp b/aten/src/ATen/core/dispatch/Dispatcher.cpp
index 667eefdcc5ab..8b2257605161 100644
--- a/aten/src/ATen/core/dispatch/Dispatcher.cpp
+++ b/aten/src/ATen/core/dispatch/Dispatcher.cpp
@@ -1,6 +1,7 @@
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <list>
 #include <sstream>
+#include <chrono>
 
 namespace c10 {
 
@@ -9,6 +10,12 @@ bool show_dispatch_trace() {
     return temp != nullptr;
 }
 
+static thread_local int64_t dispatch_trace_nesting_value_;
+
+void dispatch_trace_nesting_incr() { ++dispatch_trace_nesting_value_; }
+void dispatch_trace_nesting_decr() { --dispatch_trace_nesting_value_; }
+int64_t dispatch_trace_nesting_value() { return dispatch_trace_nesting_value_; }
+
 namespace detail {
 
 class RegistrationListenerList final {
@@ -44,7 +51,9 @@ Dispatcher::Dispatcher()
 , operatorLookupTable_()
 , backendFallbackKernels_()
 , listeners_(std::make_unique<detail::RegistrationListenerList>())
-, mutex_() {}
+, mutex_()
+, cond_var_()
+{}
 
 Dispatcher::~Dispatcher() = default;
 
@@ -63,6 +72,41 @@ c10::optional<OperatorHandle> Dispatcher::findOp(const OperatorName& overload_na
   });
 }
 
+// NB: If you add more waitFor* implementations, you also have to add
+// appropriate notify_all() calls to the relevant register calls
+
+void Dispatcher::waitForDef(const FunctionSchema& schema) {
+  using namespace std::chrono_literals;
+  std::unique_lock<std::mutex> lock(mutex_);
+  bool r = cond_var_.wait_for(lock, 2s, [&]{
+    return findOp(schema.operator_name()) != c10::nullopt;
+  });
+  TORCH_INTERNAL_ASSERT(r,
+    "Expected main interpreter to define ", schema.operator_name(),
+    ", but this didn't happen within timeout.  Are you trying to load "
+    "different models in the same torchdeploy/multipy instance?  You "
+    "must warmup each interpreter identically, e.g., import all "
+    "the same dependencies.");
+}
+
+void Dispatcher::waitForImpl(const OperatorName& op_name, c10::optional<c10::DispatchKey> maybe_dk) {
+  using namespace std::chrono_literals;
+  std::unique_lock<std::mutex> lock(mutex_);
+  auto dk = maybe_dk.value_or(DispatchKey::CompositeImplicitAutograd);
+  auto op = findOrRegisterName_(op_name);
+  bool r = cond_var_.wait_for(lock, 2s, [&]{
+    // NB: this is slightly unsound for overrides, but overrides are
+    // funny business anyway
+    return op.hasKernelForDispatchKey(dk);
+  });
+  TORCH_INTERNAL_ASSERT(r,
+    "Expected main interpreter to implement ", dk, " for ", op_name,
+    ", but this didn't happen within timeout.  Are you trying to load "
+    "different models in the same torchdeploy/multipy instance?  You "
+    "must warmup each interpreter identically, e.g., import all "
+    "the same dependencies.");
+}
+
 c10::optional<OperatorHandle> Dispatcher::findSchema(const OperatorName& overload_name) {
   auto it = findOp(overload_name);
   if (it.has_value()) {
@@ -169,6 +213,8 @@ RegistrationHandleRAII Dispatcher::registerDef(FunctionSchema schema, std::strin
   ++op.operatorDef_->def_count;
   ++op.operatorDef_->def_and_impl_count;
 
+  cond_var_.notify_all();
+
   return RegistrationHandleRAII([this, op, op_name] {
     deregisterDef_(op, op_name);
   });
@@ -221,6 +267,8 @@ RegistrationHandleRAII Dispatcher::registerImpl(
 
   ++op.operatorDef_->def_and_impl_count;
 
+  cond_var_.notify_all();
+
   return RegistrationHandleRAII([this, op, op_name, dispatch_key, handle] {
     deregisterImpl_(op, op_name, dispatch_key, handle);
   });
@@ -243,6 +291,7 @@ RegistrationHandleRAII Dispatcher::registerName(OperatorName op_name) {
   std::lock_guard<std::mutex> lock(mutex_);
   auto op = findOrRegisterName_(op_name);
   ++op.operatorDef_->def_and_impl_count;
+
   return RegistrationHandleRAII(
       [this, op, op_name] { deregisterName_(op, op_name); });
 }
diff --git a/aten/src/ATen/core/dispatch/Dispatcher.h b/aten/src/ATen/core/dispatch/Dispatcher.h
index bc40bc5b62e0..5af8ef1e52de 100644
--- a/aten/src/ATen/core/dispatch/Dispatcher.h
+++ b/aten/src/ATen/core/dispatch/Dispatcher.h
@@ -11,6 +11,7 @@
 #include <c10/util/LeftRight.h>
 #include <list>
 #include <mutex>
+#include <condition_variable>
 #include <type_traits>
 
 #include <ATen/core/grad_mode.h>
@@ -19,6 +20,14 @@
 namespace c10 {
 
 TORCH_API bool show_dispatch_trace();
+TORCH_API void dispatch_trace_nesting_incr();
+TORCH_API void dispatch_trace_nesting_decr();
+TORCH_API int64_t dispatch_trace_nesting_value();
+
+struct DispatchTraceNestingGuard {
+  DispatchTraceNestingGuard() { dispatch_trace_nesting_incr(); }
+  ~DispatchTraceNestingGuard() { dispatch_trace_nesting_decr(); }
+};
 
 class TORCH_API OperatorHandle;
 template<class FuncType> class TypedOperatorHandle;
@@ -168,6 +177,15 @@ class TORCH_API Dispatcher final {
   // See Note [Plumbing Keys Through The Dispatcher]
   void redispatchBoxed(const OperatorHandle& op, DispatchKeySet dispatchKeySet, Stack* stack) const;
 
+  bool hasBackendFallbackForDispatchKey(DispatchKey dk) {
+    auto dispatch_ix = getDispatchTableIndexForDispatchKey(dk);
+    if (dispatch_ix < 0) return false;
+    return backendFallbackKernels_[dispatch_ix].kernel.isValid();
+  }
+
+  // Used by torchdeploy/multipy for multiple interpreters racing.
+  void waitForDef(const FunctionSchema& schema);
+  void waitForImpl(const OperatorName& op_name, c10::optional<DispatchKey> dispatch_key);
 
   // ------------------------------------------------------------------------
   //
@@ -293,7 +311,23 @@ class TORCH_API Dispatcher final {
   std::array<impl::AnnotatedKernel, num_runtime_entries> backendFallbackKernels_;
 
   std::unique_ptr<detail::RegistrationListenerList> listeners_;
+
+  // This mutex protects concurrent access to the dispatcher
   std::mutex mutex_;
+
+  // This condition variable gets notified whenever we add a new def/impl to the
+  // dispatch table.  This is primarily used by multipy/torchdeploy, when
+  // we have multiple interpreters trying to register to the dispatch table.
+  // In this situation, whenever the non-primary interpreter would have tried
+  // to register to the dispatch table, instead it will check to see if the
+  // expected registration has already been made, and if it hasn't, wait on
+  // this condition variable to see if it was just racing with the primary
+  // interpreter.
+  //
+  // We expect it to be rare for there to be any waiters on this condition
+  // variable.  This is mostly just to help give better diagnostics if
+  // something goes horribly wrong
+  std::condition_variable cond_var_;
 };
 
 /**
@@ -302,6 +336,8 @@ class TORCH_API Dispatcher final {
  * to lookup a kernel for a certain set of arguments.
  */
 class TORCH_API OperatorHandle {
+  template <typename T> friend class std::hash;
+
 public:
   OperatorHandle(OperatorHandle&&) noexcept = default;
   OperatorHandle& operator=(OperatorHandle&&) noexcept = default;
@@ -333,6 +369,10 @@ class TORCH_API OperatorHandle {
     return operatorDef_->op.hasKernelForDispatchKey(k);
   }
 
+  bool hasKernelForAnyDispatchKey(DispatchKeySet k) const {
+    return operatorDef_->op.hasKernelForAnyDispatchKey(k);
+  }
+
   bool hasComputedKernelForDispatchKey(DispatchKey k) const {
     return operatorDef_->op.hasComputedKernelForDispatchKey(k);
   }
@@ -388,6 +428,19 @@ class TORCH_API OperatorHandle {
     c10::Dispatcher::singleton().redispatchBoxed(*this, ks, stack);
   }
 
+  template <typename F>
+  PyObject* getPythonOp(c10::impl::PyInterpreter* self_interpreter, F slow_accessor) const {
+    return operatorDef_->op.getPythonOp(self_interpreter, slow_accessor);
+  }
+
+  bool operator==(const OperatorHandle& other) const {
+    return operatorDef_ == other.operatorDef_;
+  }
+
+  bool operator!=(const OperatorHandle& other) const {
+    return operatorDef_ != other.operatorDef_;
+  }
+
 private:
   explicit OperatorHandle(std::list<Dispatcher::OperatorDef>::iterator operatorIterator)
   : operatorDef_(&*operatorIterator), operatorIterator_(operatorIterator)  {}
@@ -568,7 +621,10 @@ C10_ALWAYS_INLINE_UNLESS_MOBILE Return Dispatcher::call(const TypedOperatorHandl
   auto dispatchKeySet = op.operatorDef_->op.dispatchKeyExtractor()
     .template getDispatchKeySetUnboxed<Args...>(args...);
 #ifndef NDEBUG
+  DispatchTraceNestingGuard debug_guard;
   if (show_dispatch_trace()) {
+      auto nesting_value = dispatch_trace_nesting_value();
+      for (int64_t i = 0; i < nesting_value; ++i) std::cerr << " ";
       std::cerr << "[call] op=[" << op.operator_name() << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << std::endl;
   }
 #endif
@@ -588,7 +644,10 @@ inline Return Dispatcher::redispatch(const TypedOperatorHandle<Return (Args...)>
   detail::unused_arg_(args...);  // workaround for a false-positive warning about unused parameters in gcc 5
   // do not use RecordFunction on redispatch
 #ifndef NDEBUG
+  DispatchTraceNestingGuard debug_guard;
   if (show_dispatch_trace()) {
+      auto nesting_value = dispatch_trace_nesting_value();
+      for (int64_t i = 0; i < nesting_value; ++i) std::cerr << " ";
       std::cerr << "[redispatch] op=[" << op.operator_name() << "], key=[" << toString(currentDispatchKeySet.highestPriorityTypeId()) << "]" << std::endl;
   }
 #endif
@@ -601,7 +660,10 @@ inline void Dispatcher::callBoxed(const OperatorHandle& op, Stack* stack) const
   const auto& entry = op.operatorDef_->op;
   auto dispatchKeySet = entry.dispatchKeyExtractor().getDispatchKeySetBoxed(stack);
 #ifndef NDEBUG
+  DispatchTraceNestingGuard debug_guard;
   if (show_dispatch_trace()) {
+      auto nesting_value = dispatch_trace_nesting_value();
+      for (int64_t i = 0; i < nesting_value; ++i) std::cerr << " ";
       std::cerr << "[callBoxed] op=[" << op.operator_name() << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << std::endl;
   }
 #endif
@@ -635,16 +697,26 @@ inline void Dispatcher::callBoxedForDispatchKey(const OperatorHandle& op, Dispat
   // We still compute this as we're obligated to pass it on to the internal
   // kernel, if it is a boxed fallback
   auto dispatchKeySet = entry.dispatchKeyExtractor().getDispatchKeySetBoxed(stack);
-  const auto& kernel = entry.kernelForDispatchKey(dk);
+  const auto& kernel = ([&]() {
+    if (op.hasKernelForDispatchKey(dk)) {
+      return entry.kernelForDispatchKey(dk);
+    } else {
+      auto idx = getDispatchTableIndexForDispatchKey(dk);
+      TORCH_INTERNAL_ASSERT(idx >= 0);
+      return backendFallbackKernels_[idx].kernel;
+    }
+  })();
   kernel.callBoxed(op, dispatchKeySet, stack);
 }
 
-
 inline void Dispatcher::redispatchBoxed(const OperatorHandle& op, DispatchKeySet dispatchKeySet, Stack* stack) const {
   // note: this doesn't need the mutex because write operations on the list keep iterators intact.
   const auto& entry = op.operatorDef_->op;
 #ifndef NDEBUG
+  DispatchTraceNestingGuard debug_guard;
   if (show_dispatch_trace()) {
+      auto nesting_value = dispatch_trace_nesting_value();
+      for (int64_t i = 0; i < nesting_value; ++i) std::cerr << " ";
       std::cerr << "[redispatchBoxed] op=[" << op.operator_name() << "], key=[" << toString(dispatchKeySet.highestPriorityTypeId()) << "]" << std::endl;
   }
 #endif
@@ -653,3 +725,14 @@ inline void Dispatcher::redispatchBoxed(const OperatorHandle& op, DispatchKeySet
 }
 
 } // namespace c10
+
+namespace std {
+
+template <>
+struct hash<c10::OperatorHandle> {
+  size_t operator()(c10::OperatorHandle op) const noexcept {
+    return std::hash<void*>{}(static_cast<void*>(op.operatorDef_));
+  }
+};
+
+} // namespace std
diff --git a/aten/src/ATen/core/dispatch/OperatorEntry.cpp b/aten/src/ATen/core/dispatch/OperatorEntry.cpp
index 5c1c42bb6226..5bd5d8abf54d 100644
--- a/aten/src/ATen/core/dispatch/OperatorEntry.cpp
+++ b/aten/src/ATen/core/dispatch/OperatorEntry.cpp
@@ -26,6 +26,7 @@ OperatorEntry::OperatorEntry(OperatorName&& operator_name)
 , dispatchKeyExtractor_(DispatchKeyExtractor::makeUninitialized())
 , kernels_()
 , cpp_signature_()
+, sym_cpp_signature_()
 , is_observed_(ObservedOperators::isObserved(name_))
 {
   // Pick up any backend fallbacks that were registered prior to this
@@ -34,7 +35,10 @@ OperatorEntry::OperatorEntry(OperatorName&& operator_name)
 }
 
 namespace {
-  void checkSchema(const OperatorName& name, const FunctionSchema& from_def, const std::string& from_def_debug, const FunctionSchema& inferred, const std::string& inferred_debug) {
+  void checkSchema(const OperatorName& name, const FunctionSchema& from_def_, const std::string& from_def_debug, const KernelFunction& kernel, const FunctionSchema& inferred_, const std::string& inferred_debug) {
+    // TODO: figure out if we can just directly save real schema at def time
+    FunctionSchema from_def = from_def_.cloneWithRealTypes(kernel.isValidSymUnboxed());
+    FunctionSchema inferred = inferred_.cloneWithRealTypes();
     c10::optional<std::string> schema_difference = findSchemaDifferences(from_def, inferred);
     if (schema_difference.has_value()) {
       TORCH_CHECK(false,
@@ -60,12 +64,24 @@ const AnnotatedKernel& OperatorEntry::ambiguousAutogradOtherKernel() const {
   return kernel;
 }
 
+void OperatorEntry::assertSignatureIsCorrect(const CppSignature call_signature, bool has_symint) const {
+  if (has_symint) {
+    if (C10_UNLIKELY(sym_cpp_signature_.has_value() && (call_signature != sym_cpp_signature_->signature))) {
+      reportSignatureError(call_signature, *sym_cpp_signature_);
+    }
+  } else {
+    if (C10_UNLIKELY(cpp_signature_.has_value() && (call_signature != cpp_signature_->signature))) {
+      reportSignatureError(call_signature, *cpp_signature_);
+    }
+  }
+}
+
 void OperatorEntry::registerSchema(FunctionSchema&& schema, std::string&& debug, std::vector<at::Tag> tags) {
   TORCH_INTERNAL_ASSERT(!schema_.has_value());
   for (const auto& kernel : kernels_) {
     for (const auto &j : kernel.second) {
       if (j.inferred_function_schema != nullptr) {
-        checkSchema(name_, schema, debug, *j.inferred_function_schema, j.debug);
+        checkSchema(name_, schema, debug, j.kernel, *j.inferred_function_schema, j.debug);
       }
     }
   }
@@ -99,25 +115,26 @@ OperatorEntry::AnnotatedKernelContainerIterator OperatorEntry::registerKernel(
   // which means if you could validly change the type of a cpp_signature, then
   // that would also invalidate the old TypedOperatorHandles.
   if (cpp_signature.has_value()) {
-    if (cpp_signature_.has_value()) {
-      TORCH_CHECK(*cpp_signature == cpp_signature_->signature,
+    auto& local_cpp_signature = kernel.isValidSymUnboxed() ? sym_cpp_signature_ : cpp_signature_;
+    if (local_cpp_signature.has_value()) {
+      TORCH_CHECK(*cpp_signature == local_cpp_signature->signature,
         "\nMismatch in kernel C++ signatures\n",
         "  operator: ", (this->schema_.has_value() ? toString(this->schema_->schema) : toString(name_)), "\n",
         "    ", (this->schema_.has_value() ? this->schema_->debug : "no debug info"), "\n",
-        "  kernel 1: ", cpp_signature_->signature.name(), "\n",
-        "    dispatch key: ", toString(cpp_signature_->dispatch_key), "\n",
-        "    ", cpp_signature_->debug, "\n",
+        "  kernel 1: ", local_cpp_signature->signature.name(), "\n",
+        "    dispatch key: ", toString(local_cpp_signature->dispatch_key), "\n",
+        "    ", local_cpp_signature->debug, "\n",
         "  kernel 2: ", cpp_signature->name(), "\n",
         "    dispatch key: ", toString(dispatch_key), "\n",
         "    ", debug, "\n"
       );
     } else {
-      cpp_signature_ = CppSignatureWithDebug { *cpp_signature, debug, dispatch_key };
+      local_cpp_signature = CppSignatureWithDebug { *cpp_signature, debug, dispatch_key };
     }
   }
 
   if (schema_ && inferred_function_schema) {
-    checkSchema(name_, schema_->schema, schema_->debug, *inferred_function_schema, debug);
+    checkSchema(name_, schema_->schema, schema_->debug, kernel, *inferred_function_schema, debug);
   }
 
   // Add the kernel to the kernels list,
@@ -130,13 +147,17 @@ OperatorEntry::AnnotatedKernelContainerIterator OperatorEntry::registerKernel(
 #else
   if (k.size() > 0) {
 #endif
-    TORCH_WARN("Overriding a previously registered kernel for the same operator and the same dispatch key\n",
-               "  operator: ", (schema_.has_value() ? toString(schema_->schema) : toString(name_)), "\n",
-               "    ", (this->schema_.has_value() ? this->schema_->debug : "no debug info"), "\n",
-               "  dispatch key: ", toString(dispatch_key), "\n",
-               "  previous kernel: ", (cpp_signature_.has_value() ? cpp_signature_->debug : "no debug info"), "\n",
-               "       new kernel: ", debug
-    );
+    // Suppress the warning for Meta key as we are overriding C++ meta functions with python meta functions
+    // for some ops
+    if (dispatch_key != DispatchKey::Meta) {
+      TORCH_WARN("Overriding a previously registered kernel for the same operator and the same dispatch key\n",
+            "  operator: ", (schema_.has_value() ? toString(schema_->schema) : toString(name_)), "\n",
+            "    ", (this->schema_.has_value() ? this->schema_->debug : "no debug info"), "\n",
+            "  dispatch key: ", toString(dispatch_key), "\n",
+            "  previous kernel: ", (cpp_signature_.has_value() ? cpp_signature_->debug : (sym_cpp_signature_.has_value() ? sym_cpp_signature_->debug : "no debug info")), "\n",
+            "       new kernel: ", debug
+      );
+    }
   }
 
 #ifdef C10_DISPATCHER_ONE_KERNEL_PER_DISPATCH_KEY
@@ -303,6 +324,19 @@ std::pair<const AnnotatedKernel&, const char*> OperatorEntry::computeDispatchTab
   //      For AutogradOther, we return ambiguousAutogradOtherKernel() if there's registration
   //      to any of its backends.
   //      See Note [Undefined in dispatchTable_] for the special handling for Undefined.
+
+  // If the dispatch key is included in CompositeImplicitAutogradNestedTensor,
+  // then we register it to nested-tensor kernel rather than
+  // regular-tensor CompositeImplicitAutograd kernel.
+  // We have no intention to change the behavior of Undefined,
+  // so this nested-tensor branch requires `dispatch_key != DispatchKey::Undefined`
+  // to let the original CompositeImplicitAutograd handle Undefined
+  if (dispatch_key != DispatchKey::Undefined && isIncludedInAlias(dispatch_key, DispatchKey::CompositeImplicitAutogradNestedTensor)) {
+    if (auto nested_registration = getKernelForDispatchKey(DispatchKey::CompositeImplicitAutogradNestedTensor)) {
+      return {*nested_registration, "nested kernel"};
+      }
+  }
+
   if (dispatch_key == DispatchKey::Undefined || isIncludedInAlias(dispatch_key, DispatchKey::CompositeImplicitAutograd)) {
     if (auto math_registration = getKernelForDispatchKey(DispatchKey::CompositeImplicitAutograd)) {
       if (dispatch_key == DispatchKey::AutogradOther
@@ -452,19 +486,35 @@ std::string OperatorEntry::listAllDispatchKeys() const {
   return str.str();
 }
 
-void OperatorEntry::reportSignatureError(const CppSignature call_signature) const {
+void OperatorEntry::reportSignatureError(const CppSignature& call_signature, const CppSignatureWithDebug& saved_signature) const {
   TORCH_CHECK(false,
         "\nTried to access or call an operator with a wrong signature.\n",
         "  operator: ", (schema_.has_value() ? toString(schema_->schema) : toString(name_)), "\n",
         "    ", (schema_.has_value() ? schema_->debug : "unknown debug info"), "\n",
-        "  correct signature:  ", cpp_signature_->signature.name(), "\n",
-        "    ", cpp_signature_->debug, "\n",
+        "  correct signature:  ", saved_signature.signature.name(), "\n",
+        "    ", saved_signature.debug, "\n",
         "  accessed/called as: ", call_signature.name(), "\n",
         "This likely happened in a call to OperatorHandle::typed<Return (Args...)>(). ",
         "Please make sure that the function signature matches the signature in the operator registration call."
   );
 };
 
+std::string post_process_dispatch_key_str(std::string dispatch_key) {
+  const std::string substr = "PrivateUse1";
+  if (substr.size() <= dispatch_key.size() && std::equal(substr.rbegin(), substr.rend(), dispatch_key.rbegin())) {
+    auto privateuse1_backend = get_privateuse1_backend();
+    if (privateuse1_backend != "privateuseone") {
+      // remove trailing "*PrivateUse1"
+      dispatch_key.erase(dispatch_key.length() - substr.length());
+      // append the registered backend's name.
+      // AutogradPrivateUse1 -> AutogradFoo
+      auto backend_name = c10::get_privateuse1_backend();
+      dispatch_key = dispatch_key + backend_name;
+    }
+  }
+  return dispatch_key;
+}
+
 void OperatorEntry::reportError(DispatchKey dispatchKey) const {
   // If there is an invariant problem, report it now.
   checkInvariants();
@@ -479,7 +529,7 @@ void OperatorEntry::reportError(DispatchKey dispatchKey) const {
   }
 
   TORCH_CHECK_NOT_IMPLEMENTED(false, "Could not run '", name_, "' with arguments",
-          " from the '", toString(dispatchKey), "' backend. This could be because "
+          " from the '", post_process_dispatch_key_str(toString(dispatchKey)), "' backend. This could be because "
           "the operator doesn't exist for this backend, or was omitted during ",
           "the selective/custom build process (if using custom build). If you are a ",
           "Facebook employee using PyTorch on mobile, please visit ",
diff --git a/aten/src/ATen/core/dispatch/OperatorEntry.h b/aten/src/ATen/core/dispatch/OperatorEntry.h
index 1d9f1495f3c7..c3bd91197f5e 100644
--- a/aten/src/ATen/core/dispatch/OperatorEntry.h
+++ b/aten/src/ATen/core/dispatch/OperatorEntry.h
@@ -6,6 +6,7 @@
 #include <c10/util/either.h>
 #include <c10/util/Optional.h>
 #include <c10/core/DispatchKey.h>
+#include <c10/core/PyHandleCache.h>
 #include <ATen/core/ivalue.h>
 #include <ATen/core/boxing/KernelFunction.h>
 #include <ATen/core/dispatch/DispatchKeyExtractor.h>
@@ -163,14 +164,10 @@ class TORCH_API OperatorEntry final {
   // Asserts that the given FuncType is correct for calling this operator in an unboxed way.
   template<class FuncType>
   inline void assertSignatureIsCorrect() {
-    assertSignatureIsCorrect(CppSignature::make<FuncType>());
+    assertSignatureIsCorrect(CppSignature::make<FuncType>(), fn_has_symint<FuncType>::value);
   }
 
-  void assertSignatureIsCorrect(const CppSignature call_signature) {
-    if (C10_UNLIKELY(cpp_signature_.has_value() && (call_signature != cpp_signature_->signature))) {
-      reportSignatureError(call_signature);
-    }
-  }
+  void assertSignatureIsCorrect(const CppSignature call_signature, bool has_symint) const;
 
   [[noreturn]] void reportError(DispatchKey dispatchKey) const;
 
@@ -215,6 +212,11 @@ class TORCH_API OperatorEntry final {
   // Returns all the operator tags added at the time of registration
   const std::vector<at::Tag>& getTags() const;
 
+  template <typename F>
+  PyObject* getPythonOp(PyInterpreter* self_interpreter, F slow_accessor) const {
+    return py_cache_.ptr_or(self_interpreter, slow_accessor);
+  }
+
 private:
 
   OperatorName name_;
@@ -224,6 +226,8 @@ class TORCH_API OperatorEntry final {
   #endif
   std::array<KernelFunction, c10::num_runtime_entries> dispatchTable_;
   DispatchKeyExtractor dispatchKeyExtractor_;
+  // Pointer to the torch.ops.ns.op.overload object for speed
+  c10::PyHandleCache py_cache_;
 
   // kernels_ stores all registered kernels for the corresponding dispatch key
   // and catchAllKernels_ stores the catch-all kernels.
@@ -280,11 +284,12 @@ class TORCH_API OperatorEntry final {
     c10::optional<DispatchKey> dispatch_key;
   };
   c10::optional<CppSignatureWithDebug> cpp_signature_;
+  c10::optional<CppSignatureWithDebug> sym_cpp_signature_;
 
   // Whether this operator needs to be observed with RecordFunction
   const bool is_observed_;
 
-  [[noreturn]] void reportSignatureError(CppSignature call_signature) const;
+  [[noreturn]] void reportSignatureError(const CppSignature& call_signature, const CppSignatureWithDebug& saved_signature) const;
   const KernelFunction& computeDispatchTableEntry(const c10::Dispatcher& dispatcher, DispatchKey dispatch_key) const;
   std::pair<const AnnotatedKernel&, const char*> computeDispatchTableEntryWithDebug(
     const c10::Dispatcher& dispatcher, DispatchKey dispatch_key
diff --git a/aten/src/ATen/core/dynamic_type.cpp b/aten/src/ATen/core/dynamic_type.cpp
index 5920d7c05f1f..49dd593e38d3 100644
--- a/aten/src/ATen/core/dynamic_type.cpp
+++ b/aten/src/ATen/core/dynamic_type.cpp
@@ -231,8 +231,6 @@ TypePtr DynamicType::fallback() const {
       return BoolType::get();
     case Tag::Int:
       return IntType::get();
-    case Tag::SymInt:
-      return SymIntType::get();
     case Tag::Float:
       return FloatType::get();
     case Tag::Complex:
@@ -326,8 +324,6 @@ DynamicType::Ptr IValue::TagType<c10::DynamicType>::get(const c10::IValue& v) {
       return DynamicTypeTrait<ComplexType>::getBaseType();
     case Tag::Int:
       return DynamicTypeTrait<IntType>::getBaseType();
-    case Tag::SymInt:
-      return DynamicTypeTrait<SymIntType>::getBaseType();
     case Tag::Bool:
       return DynamicTypeTrait<BoolType>::getBaseType();
     case Tag::String:
diff --git a/aten/src/ATen/core/dynamic_type.h b/aten/src/ATen/core/dynamic_type.h
index a84644ddde04..1f649c8217cb 100644
--- a/aten/src/ATen/core/dynamic_type.h
+++ b/aten/src/ATen/core/dynamic_type.h
@@ -16,7 +16,6 @@ constexpr DynamicTypeBits kDynamicAnyTypeBit = DYNAMIC_TYPE_BIT(30);
 
 constexpr DynamicTypeBits kDynamicNoneTypeBit = DYNAMIC_TYPE_BIT(1);
 constexpr DynamicTypeBits kDynamicIntTypeBit = DYNAMIC_TYPE_BIT(3);
-constexpr DynamicTypeBits kDynamicSymIntTypeBit = DYNAMIC_TYPE_BIT(23);
 constexpr DynamicTypeBits kDynamicFloatTypeBit = DYNAMIC_TYPE_BIT(4);
 constexpr DynamicTypeBits kDynamicComplexTypeBit = DYNAMIC_TYPE_BIT(5);
 constexpr DynamicTypeBits kDynamicListTypeBit = DYNAMIC_TYPE_BIT(7);
@@ -29,7 +28,6 @@ constexpr DynamicTypeBits kDynamicClassTypeBit = DYNAMIC_TYPE_BIT(10);
   _(Bool, DYNAMIC_TYPE_BIT(2), 1)                                            \
   _(Int, kDynamicIntTypeBit, 1)                                              \
   _(Float, kDynamicFloatTypeBit, 1)                                          \
-  _(SymInt, kDynamicSymIntTypeBit, 1)                                        \
   _(Complex, kDynamicComplexTypeBit, 1)                                      \
   _(Number,                                                                  \
     (kDynamicIntTypeBit | kDynamicFloatTypeBit | kDynamicComplexTypeBit),    \
@@ -63,6 +61,7 @@ constexpr DynamicTypeBits kDynamicClassTypeBit = DYNAMIC_TYPE_BIT(10);
 #define FORALL_DYNAMIC_TYPES_FAKE(_) \
   _(ScalarType, kDynamicIntTypeBit, 1)                                \
   _(Layout, kDynamicIntTypeBit, 1)                                        \
+  _(SymInt, kDynamicIntTypeBit, 1)                                        \
   _(MemoryFormat, kDynamicIntTypeBit, 1)
 
 #define FORWARD_DECL_TYPE(NAME, _, __) struct NAME ## Type;
diff --git a/aten/src/ATen/core/function_schema.cpp b/aten/src/ATen/core/function_schema.cpp
index a3a10862178c..440ee446d499 100644
--- a/aten/src/ATen/core/function_schema.cpp
+++ b/aten/src/ATen/core/function_schema.cpp
@@ -17,6 +17,37 @@ const std::vector<Argument>& FunctionSchema::getCorrectList(SchemaArgType type)
   }
 }
 
+FunctionSchema FunctionSchema::cloneWithRealTypes(bool with_symint) const {
+  auto cloneWithRealTypes = [&](const Argument& a) {
+    if (with_symint) {
+      return a.cloneWithType(a.real_type());
+    }
+    // Don't use real type if it looks like a SymInt
+    // NB: keep this in sync with unpackSymInt in KernelFunction_impl.h
+    if (
+      *a.real_type() == *getTypePtr<c10::SymInt>() ||
+      *a.real_type() == *getTypePtr<c10::optional<c10::SymInt>>() ||
+      *a.real_type() == *getTypePtr<c10::SymIntArrayRef>() ||
+      *a.real_type() == *getTypePtr<at::OptionalSymIntArrayRef>()
+    ) {
+      // Keep the fake type
+      return a.cloneWithType(a.type());
+    } else {
+      return a.cloneWithType(a.real_type());
+    }
+  };
+  std::vector<Argument> new_arguments, new_returns;
+  std::transform(arguments().begin(), arguments().end(), std::back_inserter(new_arguments), cloneWithRealTypes);
+  std::transform(returns().begin(), returns().end(), std::back_inserter(new_returns), cloneWithRealTypes);
+  return FunctionSchema(
+    name(),
+    overload_name(),
+    std::move(new_arguments),
+    std::move(new_returns),
+    is_vararg(),
+    is_varret());
+}
+
 bool FunctionSchema::canAliasTypeSetsAlias(const c10::optional<AliasTypeSet> &lhs, const c10::optional<AliasTypeSet> &rhs) const {
   if (!lhs || !rhs) {
     return false;
diff --git a/aten/src/ATen/core/function_schema.h b/aten/src/ATen/core/function_schema.h
index 77fdb20f6516..d80eaf6581e0 100644
--- a/aten/src/ATen/core/function_schema.h
+++ b/aten/src/ATen/core/function_schema.h
@@ -44,7 +44,7 @@ struct Argument {
       c10::optional<AliasInfo> alias_info = c10::nullopt)
       : name_(std::move(name)),
         type_(fake_type ? std::move(fake_type) : TensorType::get()),
-        real_type_(real_type ? std::move(real_type) : TensorType::get()),
+        real_type_(real_type ? std::move(real_type) : type_),
         N_(std::move(N)),
         default_value_(std::move(default_value)),
         alias_info_(alias_info ? std::make_unique<AliasInfo>(std::move(*alias_info)) : nullptr),
@@ -88,6 +88,8 @@ struct Argument {
   const TypePtr& type() const {
     return type_;
   }
+  // if type() is non-null, this is guaranteed to be non-null (if no real
+  // type was provided, this takes on type()'s value)
   const TypePtr& real_type() const {
     return real_type_;
   }
@@ -214,6 +216,7 @@ enum struct TORCH_API SchemaArgType { input, output };
 struct TORCH_API SchemaArgument {
   SchemaArgType type;
   size_t index;
+  SchemaArgument(SchemaArgType tpe, size_t idx) : type(tpe), index(idx) {}
   bool operator==(const SchemaArgument& rhs) const {
     return type == rhs.type && index == rhs.index;
   }
@@ -472,6 +475,8 @@ struct TORCH_API FunctionSchema {
   FunctionSchema cloneWithRemappedTypes(
       const std::function<TypePtr(TypePtr)> type_map) const;
 
+  FunctionSchema cloneWithRealTypes(bool with_symint=true) const;
+
   // Check that inputs have the correct types and appends any missing default
   // values.
   template <typename T = c10::PlatformType>
@@ -546,19 +551,31 @@ inline std::ostream& operator<<(std::ostream& out, const Argument& arg) {
   // in schema, we have Tensor?(a!) input, and t(a!)?.
   // however, t?(a!) doesn't work with schema parser.
   // so we always use Type(alias)? format
-  auto type = arg.type();
+  // real_type versus fake_type: in order to be compatible with FunctionSchema
+  // parser, printing an argument with either MemoryFormat or Layout type should
+  // give us the original schema string, hence printing out real_type.
+  auto type = arg.real_type();
   bool is_opt = type->kind() == OptionalType::Kind;
   auto unopt_type = is_opt ? type->castRaw<OptionalType>()->getElementType() : type;
 
-  if (unopt_type->kind() == ListType::Kind && arg.N()) {
+  if (unopt_type->kind() == ListType::Kind) {
     // sized lists get size N from arg, not type
     auto list = unopt_type->cast<c10::ListType>();
-    out << list->getElementType()->str() << "[" << *arg.N() << "]";
+    out << list->getElementType()->str();
+    if (arg.alias_info() && !arg.alias_info()->containedTypes().empty()){
+      out << arg.alias_info()->containedTypes()[0];
+    }
+    std::string N = "";
+    if (arg.N()) {
+        N = std::to_string(*arg.N());
+    }
+    out << "[" << N << "]";
   } else {
     out << unopt_type->str();
   }
 
-  if (arg.alias_info()) {
+  // print alias info if it has beforeSets.
+  if (arg.alias_info() && !arg.alias_info()->beforeSets().empty()) {
     out << *arg.alias_info();
   }
 
diff --git a/aten/src/ATen/core/interned_strings.h b/aten/src/ATen/core/interned_strings.h
index dc5860ebf2c4..2abc6217516d 100644
--- a/aten/src/ATen/core/interned_strings.h
+++ b/aten/src/ATen/core/interned_strings.h
@@ -50,8 +50,11 @@ namespace c10 {
   _(prim, FunctionalGraph)           \
   _(prim, add_optional)              \
   _(prim, view_copy)                 \
+  _(prim, permute_copy)              \
   _(prim, reshape_copy)              \
   _(prim, squeeze_copy)              \
+  _(prim, t_copy)                    \
+  _(prim, transpose_copy)            \
   _(prim, unsqueeze_copy)            \
   _(prim, flatten_copy)              \
   _(prim, expand_copy)               \
@@ -236,6 +239,7 @@ namespace c10 {
   _(onnx, LSTM)                      \
   _(onnx, MatMul)                    \
   _(onnx, Min)                       \
+  _(onnx, Max)                       \
   _(onnx, Mul)                       \
   _(onnx, Pow)                       \
   _(onnx, RNN)                       \
diff --git a/aten/src/ATen/core/ivalue.cpp b/aten/src/ATen/core/ivalue.cpp
index eb977f09cbe6..304ff8cf3f5c 100644
--- a/aten/src/ATen/core/ivalue.cpp
+++ b/aten/src/ATen/core/ivalue.cpp
@@ -93,6 +93,8 @@ c10::TypePtr IValue::TagType<c10::Type>::get(const IValue& v) {
         return IntType::get();
       case Tag::SymInt:
         return c10::SymIntType::get();
+      case Tag::SymFloat:
+        return c10::SymFloatType::get();
       case Tag::Bool:
         return BoolType::get();
       case Tag::String:
@@ -302,6 +304,10 @@ IValue IValue::equals(const IValue& rhs) const {
       return rhs.isInt() && lhs.toInt() == rhs.toInt();
     case Tag::SymInt:
       return rhs.isSymInt() && lhs.toSymInt() == rhs.toSymInt();
+    case Tag::SymFloat:
+      // NB: this doesn't actually work as sym floats don't have equality
+      // defined
+      return rhs.isSymFloat() && lhs.toSymFloat() == rhs.toSymFloat();
     case Tag::Bool:
       return rhs.isBool() && lhs.toBool() == rhs.toBool();
     case Tag::String:
@@ -353,8 +359,11 @@ size_t IValue::hash(const IValue& v) {
       return c10::get_hash(v.payload.u.as_int);
     case Tag::Int:
       return c10::get_hash(v.payload.u.as_int);
+    // NB: these are technically strict aliasing violations
     case Tag::SymInt:
       return c10::get_hash(v.payload.u.as_int);
+    case Tag::SymFloat:
+      return c10::get_hash(v.payload.u.as_int);
     case Tag::String:
       return c10::get_hash(v.toStringRef());
     case Tag::Tuple:
@@ -584,6 +593,8 @@ std::ostream& IValue::repr(
       return out << v.toInt();
     case IValue::Tag::SymInt:
       return out << v.toSymInt();
+    case IValue::Tag::SymFloat:
+      return out << v.toSymFloat();
     case IValue::Tag::Bool:
       return out << (v.toBool() ? "True" : "False");
     case IValue::Tag::Tuple: {
@@ -772,6 +783,8 @@ std::ostream& operator<<(std::ostream & out, const IValue & v) {
       return out << v.toInt();
     case IValue::Tag::SymInt:
       return out << v.toSymInt();
+    case IValue::Tag::SymFloat:
+      return out << v.toSymFloat();
     case IValue::Tag::Bool:
       return out << (v.toBool() ? "True" : "False");
     case IValue::Tag::Tuple: {
@@ -906,6 +919,7 @@ IValue IValue::deepcopy(
     case IValue::Tag::Double:
     case IValue::Tag::Int:
     case IValue::Tag::SymInt:
+    case IValue::Tag::SymFloat:
     case IValue::Tag::Bool:
     case IValue::Tag::Device:
     case IValue::Tag::Uninitialized: {
diff --git a/aten/src/ATen/core/ivalue.h b/aten/src/ATen/core/ivalue.h
index 8d0199b3c954..3461fe2300e4 100644
--- a/aten/src/ATen/core/ivalue.h
+++ b/aten/src/ATen/core/ivalue.h
@@ -7,6 +7,7 @@
 #include <ATen/core/ivalue_to.h>
 #include <ATen/core/jit_type_base.h>
 #include <ATen/core/type_factory.h>
+#include <c10/core/SymFloat.h>
 #include <c10/util/C++17.h>
 #include <c10/util/MaybeOwned.h>
 #include <c10/util/intrusive_ptr.h>
@@ -27,6 +28,8 @@ template <class Key, class Value>
 class Dict;
 template <class T>
 class List;
+template <class T>
+class IListRef;
 struct IValue;
 struct ClassType;
 struct Type;
@@ -145,6 +148,7 @@ struct Capsule {
   _(ComplexDouble)           \
   _(Int)                     \
   _(SymInt)                  \
+  _(SymFloat)                \
   _(Bool)                    \
   _(Tuple)                   \
   _(String)                  \
@@ -421,6 +425,7 @@ struct TORCH_API IValue final {
   at::Tensor& toTensor() &;
   const at::Tensor& toTensor() const&;
   at::TensorImpl* unsafeToTensorImpl() const {
+    TORCH_INTERNAL_ASSERT(isTensor());
     return payload.as_tensor.unsafeGetTensorImpl();
   }
 
@@ -558,21 +563,35 @@ struct TORCH_API IValue final {
   IValue(c10::SymInt i) {
     if (i.is_symbolic()) {
       tag = Tag::SymInt;
-      payload.u.as_intrusive_ptr = i.toSymIntNodeImpl().release();
+      payload.u.as_intrusive_ptr = i.toSymNodeImpl().release();
     } else {
       tag = Tag::Int;
       payload.u.as_int = i.as_int_unchecked();
     }
   }
 
-  IValue(c10::SymIntArrayRef v);
-
   bool isSymInt() const {
     return Tag::SymInt == tag;
   }
 
   c10::SymInt toSymInt() const;
 
+  IValue(c10::SymFloat i) {
+    if (i.is_symbolic()) {
+      tag = Tag::SymFloat;
+      payload.u.as_intrusive_ptr = i.toSymNodeImpl().release();
+    } else {
+      tag = Tag::Double;
+      payload.u.as_double = i.as_float_unchecked();
+    }
+  }
+
+  bool isSymFloat() const {
+    return Tag::SymFloat == tag;
+  }
+
+  c10::SymFloat toSymFloat() const;
+
   // allow you to pass literals (3, 4) without ambiguity
   IValue(int32_t i) : IValue(static_cast<int64_t>(i)) {}
 
@@ -665,22 +684,59 @@ struct TORCH_API IValue final {
   c10::ArrayRef<IValue> toListRef() const;
 
   // Some template constructors of IValue calls another constructor recursively.
-  // This SNIFAEs the called constructor exists.
+  // This SFINAEs the called constructor exists.
   template <class T>
   using enable_if_ivalue_constructible =
       std::enable_if_t<std::is_constructible<IValue, T>::value, std::nullptr_t>;
 
-  template <class T, enable_if_ivalue_constructible<T> = nullptr>
+  // The rule for lists is more complicated; the generic constructor is only
+  // acceptable if your element isn't SymInt.  If you do have a SymInt element,
+  // then you must also, at construction time, check if you can decay the list
+  // into an int list (this is MANDATORY, as at a use site we may expect
+  // toIntList to work even if at the call site you had a SymIntArrayRef
+  // argument).  In practice, only SymIntArrayRef is used this way, so we
+  // didn't bother making it work for the other constructors, we just make sure
+  // they're not selectable.
+  template <class T>
+  using enable_if_list_is_ivalue_constructible =
+      std::enable_if_t<std::is_constructible<IValue, T>::value &&
+        !std::is_same<T, c10::SymInt>::value, std::nullptr_t>;
+
+  template <class T, enable_if_list_is_ivalue_constructible<T> = nullptr>
   IValue(c10::List<T>&& v);
-  template <class T, enable_if_ivalue_constructible<T> = nullptr>
+  template <class T, enable_if_list_is_ivalue_constructible<T> = nullptr>
   IValue(const c10::List<T>& v);
-  template <class T, enable_if_ivalue_constructible<T> = nullptr>
+  template <class T, enable_if_list_is_ivalue_constructible<T> = nullptr>
   IValue(at::ArrayRef<T> v);
-  template <class T, enable_if_ivalue_constructible<T> = nullptr>
+  template <class T, enable_if_list_is_ivalue_constructible<T> = nullptr>
   IValue(const std::vector<T>& v);
   template <class T, size_t N>
   IValue(std::array<T, N> v);
 
+  // Manual constructors for lists of symints, which decay to int list if
+  // possible.  To avoid ambiguous overload situations, we template them
+  // to prevent implicit conversions
+  template <class T>
+  using enable_if_symint =
+      std::enable_if_t<std::is_same<T, c10::SymInt>::value, std::nullptr_t>;
+
+  template <class T, enable_if_symint<T> = nullptr>
+  IValue(at::ArrayRef<T> v);
+  template <class T, enable_if_symint<T> = nullptr>
+  IValue(at::OptionalArrayRef<T> v);
+  template <class T, enable_if_symint<T> = nullptr>
+  IValue(const std::vector<T>& v);
+
+  template <class T>
+  using enable_if_ilist_is_ivalue_constructible = std::enable_if_t<
+      std::is_constructible<IValue, T>::value &&
+          std::is_constructible<IValue, typename IListRef<T>::boxed_type>::value &&
+          !std::is_same<T, c10::SymInt>::value,
+      std::nullptr_t>;
+
+  template <class T, enable_if_ilist_is_ivalue_constructible<T> = nullptr>
+  IValue(c10::IListRef<T> v);
+
   // GenericDict
   IValue(c10::Dict<IValue, IValue> v);
   bool isGenericDict() const {
@@ -702,7 +758,7 @@ struct TORCH_API IValue final {
 
   template <class T, enable_if_ivalue_constructible<T> = nullptr>
   IValue(c10::optional<T> v);
-  template <class T, enable_if_ivalue_constructible<T> = nullptr>
+  template <class T, enable_if_list_is_ivalue_constructible<T> = nullptr>
   IValue(c10::OptionalArrayRef<T> v);
   IValue(c10::nullopt_t);
 
@@ -753,7 +809,15 @@ struct TORCH_API IValue final {
 
   // Scalar, which gets encoded as either an Int, a Double or a ComplexDouble
   IValue(const at::Scalar& s) : IValue() {
-    if (s.isFloatingPoint()) {
+    // NB: do the symbolic versions first, as isFloatingPoint is true
+    // for both SymFloat and double
+    if (s.isSymInt()) {
+      tag = Tag::SymInt;
+      payload.u.as_intrusive_ptr = s.toSymInt().toSymNodeImpl().release();
+    } else if (s.isSymFloat()) {
+      tag = Tag::SymFloat;
+      payload.u.as_intrusive_ptr = s.toSymFloat().toSymNodeImpl().release();
+    } else if (s.isFloatingPoint()) {
       tag = Tag::Double;
       payload.u.as_double = s.toDouble();
     } else if (s.isComplex()) {
@@ -769,7 +833,7 @@ struct TORCH_API IValue final {
   }
 
   bool isScalar() const {
-    return isDouble() || isInt() || isComplexDouble() || isBool();
+    return isDouble() || isInt() || isComplexDouble() || isBool() || isSymInt() || isSymFloat();
   }
 
   at::Scalar toScalar() const {
@@ -781,6 +845,10 @@ struct TORCH_API IValue final {
       return toComplexDouble();
     else if (isBool())
       return toBool();
+    else if (isSymInt())
+      return toSymInt();
+    else if (isSymFloat())
+      return toSymFloat();
     throw std::runtime_error("IValue is not a Scalar");
   }
 
@@ -1077,6 +1145,8 @@ struct TORCH_API IValue final {
         return false;
       case Tag::SymInt:
         return true;
+      case Tag::SymFloat:
+        return true;
       case Tag::Bool:
         return false;
       case Tag::Tuple:
@@ -1126,6 +1196,7 @@ struct TORCH_API IValue final {
   }
 
   union Payload {
+    // [TriviallyCopyablePayload]
     // We use a nested union here so that we can make the copy easy
     // and efficient in the non-tensor (i.e., trivially copyable)
     // case. Specifically, we do not have to do a switch-on-tag to
@@ -1339,6 +1410,10 @@ struct WeakOrStrongCompilationUnit {
     return strong_ptr_ != c10::nullopt;
   }
 
+  bool holdingEmptyStrongRef() const {
+    return holdingStrongRef() && *strong_ptr_ == nullptr;
+  }
+
   c10::optional<std::shared_ptr<torch::jit::CompilationUnit>> strong_ptr_;
   c10::optional<std::weak_ptr<torch::jit::CompilationUnit>> weak_ptr_;
 };
@@ -1362,9 +1437,14 @@ struct TORCH_API WeakOrStrongTypePtr {
 
   WeakOrStrongCompilationUnit cu_;
   TypePtr type_;
+
   bool holds_strong_ref() const {
     return cu_.holdingStrongRef();
   }
+
+  bool holds_empty_strong_ref() const {
+    return cu_.holdingEmptyStrongRef();
+  }
 };
 
 
diff --git a/aten/src/ATen/core/ivalue_inl.h b/aten/src/ATen/core/ivalue_inl.h
index 00361c80a01c..bea795c8d81e 100644
--- a/aten/src/ATen/core/ivalue_inl.h
+++ b/aten/src/ATen/core/ivalue_inl.h
@@ -7,6 +7,7 @@
 
 #include <ATen/core/Dict.h>
 #include <ATen/core/List.h>
+#include <ATen/core/IListRef.h>
 #include <ATen/core/functional.h>
 #include <ATen/core/jit_type.h>
 #include <ATen/core/qualified_name.h>
@@ -218,12 +219,21 @@ inline at::Generator IValue::toGenerator() const& {
 inline c10::SymInt IValue::toSymInt() const {
   AT_ASSERT(isSymInt() || isInt(), "Expected SymInt or int but got ", tagKind());
   if (isSymInt()) {
-    return c10::SymInt::toSymInt(toIntrusivePtr<c10::SymIntNodeImpl>());
+    return c10::SymInt(toIntrusivePtr<c10::SymNodeImpl>());
   } else {
     return c10::SymInt(payload.u.as_int);
   }
 }
 
+inline c10::SymFloat IValue::toSymFloat() const {
+  AT_ASSERT(isSymFloat() || isDouble(), "Expected SymFloat or double but got ", tagKind());
+  if (isSymFloat()) {
+    return c10::SymFloat(toIntrusivePtr<c10::SymNodeImpl>());
+  } else {
+    return c10::SymFloat(payload.u.as_double);
+  }
+}
+
 namespace ivalue {
 
 void TORCH_API
@@ -1455,6 +1465,10 @@ struct C10_EXPORT ivalue::Object final : c10::intrusive_ptr_target {
     return !type_.holds_strong_ref();
   }
 
+  bool is_empty_strong_compilation_ref() const {
+    return type_.holds_empty_strong_ref();
+  }
+
  private:
   void resizeObject(size_t slot);
   WeakOrStrongTypePtr type_;
@@ -1594,6 +1608,7 @@ DEFINE_TO(at::QScheme, toQScheme)
 DEFINE_TO(at::Dimname, toDimname)
 DEFINE_TO(at::Generator, toGenerator)
 DEFINE_TO(c10::SymInt, toSymInt)
+DEFINE_TO(c10::SymFloat, toSymFloat)
 
 template <class T>
 struct _fake_type {};
@@ -1987,11 +2002,11 @@ inline IValue::IValue(c10::impl::GenericList v)
   payload.u.as_intrusive_ptr = null_to_undefined_tensor(v.impl_.release());
 }
 
-template <class T, IValue::enable_if_ivalue_constructible<T>>
+template <class T, IValue::enable_if_list_is_ivalue_constructible<T>>
 inline IValue::IValue(c10::List<T>&& v) : IValue(impl::toList<T>(std::move(v))) {}
-template <class T, IValue::enable_if_ivalue_constructible<T>>
+template <class T, IValue::enable_if_list_is_ivalue_constructible<T>>
 inline IValue::IValue(const c10::List<T>& v) : IValue(impl::toList<T>(v)) {}
-template <class T, IValue::enable_if_ivalue_constructible<T>>
+template <class T, IValue::enable_if_list_is_ivalue_constructible<T>>
 inline IValue::IValue(at::ArrayRef<T> v) : IValue(c10::List<T>()) {
   auto list = to<c10::List<T>>();
   list.reserve(v.size());
@@ -1999,8 +2014,33 @@ inline IValue::IValue(at::ArrayRef<T> v) : IValue(c10::List<T>()) {
     list.push_back(e);
   }
 }
-inline IValue::IValue(c10::SymIntArrayRef v) : IValue(at::ArrayRef<c10::SymInt>(v.data(), v.size())) {}
-template <class T, IValue::enable_if_ivalue_constructible<T>>
+template <class T, IValue::enable_if_symint<T>>
+inline IValue::IValue(at::ArrayRef<T> v) : IValue() {
+  auto vi = c10::asIntArrayRefSlowOpt(v);
+  if (vi.has_value()) {
+    // This list is entirely integers; ensure it is typed as
+    // an IntList so toIntList works
+    *this = IValue(*vi);
+  } else {
+    // This list has SymInts; type it as a SymInt
+    *this = IValue(impl::toList<c10::SymInt>(c10::List<c10::SymInt>()));
+    auto list = to<c10::List<c10::SymInt>>();
+    list.reserve(v.size());
+    for (const auto& e : v) {
+      list.push_back(e);
+    }
+  }
+}
+template <class T, IValue::enable_if_symint<T>>
+inline IValue::IValue(at::OptionalArrayRef<T> mb_v) : IValue() {
+  if (!mb_v.has_value()) return;
+  *this = IValue(*mb_v);
+}
+template <class T, IValue::enable_if_symint<T>>
+inline IValue::IValue(const std::vector<T>& v) : IValue() {
+  *this = IValue(at::ArrayRef<T>(v));
+}
+template <class T, IValue::enable_if_list_is_ivalue_constructible<T>>
 inline IValue::IValue(const std::vector<T>& v) : IValue(c10::List<T>()) {
   auto list = to<c10::List<T>>();
   list.reserve(v.size());
@@ -2008,7 +2048,7 @@ inline IValue::IValue(const std::vector<T>& v) : IValue(c10::List<T>()) {
     list.push_back(e);
   }
 }
-template <class T, IValue::enable_if_ivalue_constructible<T>>
+template <class T, IValue::enable_if_list_is_ivalue_constructible<T>>
 inline IValue::IValue(c10::OptionalArrayRef<T> v) : IValue() {
   if (v.has_value()) {
     *this = IValue(std::move(*v));
@@ -2024,6 +2064,25 @@ inline IValue::IValue(std::array<T, N> v) : IValue(c10::List<T>()) {
   }
 }
 
+template <class T, IValue::enable_if_ilist_is_ivalue_constructible<T>>
+inline IValue::IValue(c10::IListRef<T> v) : IValue() {
+  constexpr bool boxed_type_constructs_ivalue =
+      std::is_constructible<IValue, typename c10::IListRef<T>::boxed_type>::value;
+  // First, we try to use the boxed value.
+  // If we fail (either it's not in the boxed state, or its boxed type
+  // can not construct an IValue), we fallback to copying the list.
+  if (boxed_type_constructs_ivalue && v.isBoxed()) {
+    *this = IValue(impl::toList(v.toBoxed()));
+  } else {
+    c10::List<T> list;
+    list.reserve(v.size());
+    for (const auto& t : v) {
+      list.push_back(t);
+    }
+    *this = IValue(impl::toList(std::move(list)));
+  }
+}
+
 inline IValue::IValue(c10::impl::GenericDict v)
     : tag(Tag::GenericDict) {
   payload.u.as_intrusive_ptr = null_to_undefined_tensor(v.impl_.release());
diff --git a/aten/src/ATen/core/jit_type.h b/aten/src/ATen/core/jit_type.h
index 50b27a0e8fd8..0a8f5e14d9a5 100644
--- a/aten/src/ATen/core/jit_type.h
+++ b/aten/src/ATen/core/jit_type.h
@@ -9,6 +9,7 @@
 #include <ATen/core/qualified_name.h>
 #include <c10/util/TypeList.h>
 #include <c10/util/Optional.h>
+#include <c10/core/SymFloat.h>
 
 #include <array>
 #include <memory>
@@ -1309,7 +1310,6 @@ struct TORCH_API SymIntType : public Type {
     return "SymInt";
   }
   std::string annotation_str_impl(TypePrinter printer = nullptr) const override {
-    // TODO: will become a Union[SymIntNodeImpl|int] in the near future
     return "int";
   }
   static const TypeKind Kind = TypeKind::SymIntType;
@@ -1320,6 +1320,26 @@ struct TORCH_API SymIntType : public Type {
   SymIntType() : Type(TypeKind::SymIntType) {}
 };
 
+struct SymFloatType;
+using SymFloatTypePtr = SingletonTypePtr<SymFloatType>;
+struct TORCH_API SymFloatType : public Type {
+  bool equals(const Type& rhs) const override {
+    return rhs.kind() == kind();
+  }
+  std::string str() const override {
+    return "SymFloat";
+  }
+  std::string annotation_str_impl(TypePrinter printer = nullptr) const override {
+    return "float";
+  }
+  static const TypeKind Kind = TypeKind::SymFloatType;
+  // global singleton
+  static SymFloatTypePtr get();
+
+ private:
+  SymFloatType() : Type(TypeKind::SymFloatType) {}
+};
+
 struct IntType;
 using IntTypePtr = SingletonTypePtr<IntType>;
 // This type represents a Python int number
@@ -1738,6 +1758,13 @@ struct getTypePtr_ final {
   }
 };
 
+template <typename T, bool fake>
+struct getMaybeFakeTypePtr_ final {
+  static decltype(auto) call() {
+    return getTypePtr_<T>::call();
+  }
+};
+
 template <>
 struct getTypePtr_<at::IValue> final {
   static decltype(auto) call() {
@@ -1783,33 +1810,35 @@ struct getTypePtr_<int64_t> final {
 };
 
 template <>
-struct getTypePtr_<SymInt> final {
+struct getMaybeFakeTypePtr_<SymInt, false> final {
   static decltype(auto) call() {
     return SymIntType::get();
   }
 };
 template <>
-struct getTypePtr_<c10::ScalarType> final {
+struct getMaybeFakeTypePtr_<SymInt, true> final {
   static decltype(auto) call() {
     return IntType::get();
   }
 };
+
 template <>
-struct getTypePtr_<c10::Device> final {
+struct getMaybeFakeTypePtr_<SymFloat, false> final {
   static decltype(auto) call() {
-    return DeviceObjType::get();
+    return SymFloatType::get();
   }
 };
 template <>
-struct getTypePtr_<c10::Layout> final {
+struct getMaybeFakeTypePtr_<SymFloat, true> final {
   static decltype(auto) call() {
-    return IntType::get();
+    return FloatType::get();
   }
 };
+
 template <>
-struct getTypePtr_<c10::MemoryFormat> final {
+struct getTypePtr_<c10::Device> final {
   static decltype(auto) call() {
-    return IntType::get();
+    return DeviceObjType::get();
   }
 };
 template <>
@@ -1855,47 +1884,55 @@ struct getTypePtr_<at::Dimname> final {
     return StringType::get();
   }
 };
-template <class T>
-struct getTypePtr_<std::vector<T>> final {
+template <class T, bool fake>
+struct getMaybeFakeTypePtr_<std::vector<T>, fake> final {
   static const auto& call() {
-    static auto inner_type = getTypePtr_<T>::call();
+    static auto inner_type = getMaybeFakeTypePtr_<T, fake>::call();
     // The "per vector<T>" static singleton needs to live in a .cpp file,
     // otherwise we'll end up with one singleton instance per shared library.
     static auto type = ListType::get("vector", inner_type);
     return type;
   }
 };
-template <class T>
-struct getTypePtr_<c10::ArrayRef<T>> final {
+template <class T, bool fake>
+struct getMaybeFakeTypePtr_<c10::ArrayRef<T>, fake> final {
   static const auto& call() {
-    static auto inner_type = getTypePtr_<T>::call();
+    static auto inner_type = getMaybeFakeTypePtr_<T, fake>::call();
     // The "per ArrayRef<T>" static singleton needs to live in a .cpp file,
     // otherwise we'll end up with one singleton instance per shared library.
     static auto type = ListType::get("ArrayRef", inner_type);
     return type;
   }
 };
-template <>
-struct getTypePtr_<c10::SymIntArrayRef> final {
+template <bool fake>
+struct getMaybeFakeTypePtr_<c10::SymIntArrayRef, fake> final {
   static const auto& call() {
-    static auto type = ListType::create(getTypePtr_<c10::SymInt>::call());
+    static auto type = ListType::create(getMaybeFakeTypePtr_<c10::SymInt, fake>::call());
     return type;
   }
 };
-template <class T>
-struct getTypePtr_<c10::List<T>> final {
+template <class T, bool fake>
+struct getMaybeFakeTypePtr_<c10::List<T>, fake> final {
   static const auto& call() {
-    static auto inner_type = getTypePtr_<T>::call();
+    static auto inner_type = getMaybeFakeTypePtr_<T, fake>::call();
     // The "per List<T>" static singleton needs to live in a .cpp file,
     // otherwise we'll end up with one singleton instance per shared library.
     static auto type = ListType::get("List", inner_type);
     return type;
   }
 };
-template <class T, size_t N>
-struct getTypePtr_<std::array<T, N>> final {
+template <class T, bool fake>
+struct getMaybeFakeTypePtr_<c10::IListRef<T>, fake> final {
   static const auto& call() {
-    static auto inner_type = getTypePtr_<T>::call();
+    static auto inner_type = getMaybeFakeTypePtr_<T, fake>::call();
+    static auto type = ListType::get("List", inner_type);
+    return type;
+  }
+};
+template <class T, size_t N, bool fake>
+struct getMaybeFakeTypePtr_<std::array<T, N>, fake> final {
+  static const auto& call() {
+    static auto inner_type = getMaybeFakeTypePtr_<T, fake>::call();
     // The "per array<T, N>" static singleton needs to live in a .cpp file,
     // otherwise we'll end up with one singleton instance per shared library.
     // (Concatenating the length onto the end of the string because we want a unique
@@ -1904,22 +1941,22 @@ struct getTypePtr_<std::array<T, N>> final {
     return type;
   }
 };
-template <class K, class V>
-struct getTypePtr_<std::unordered_map<K, V>> final {
+template <class K, class V, bool fake>
+struct getMaybeFakeTypePtr_<std::unordered_map<K, V>, fake> final {
   static const auto& call() {
-    static auto inner_key_type = getTypePtr_<K>::call();
-    static auto inner_val_type = getTypePtr_<V>::call();
+    static auto inner_key_type = getMaybeFakeTypePtr_<K, fake>::call();
+    static auto inner_val_type = getMaybeFakeTypePtr_<V, fake>::call();
     // The "per unordered_map<K, V>" static singleton needs to live in a .cpp file,
     // otherwise we'll end up with one singleton instance per shared library.
     static auto type = DictType::get("unordered_map", inner_key_type, inner_val_type);
     return type;
   }
 };
-template <class K, class V>
-struct getTypePtr_<c10::Dict<K, V>> final {
+template <class K, class V, bool fake>
+struct getMaybeFakeTypePtr_<c10::Dict<K, V>, fake> final {
   static const auto& call() {
-    static auto inner_key_type = getTypePtr_<K>::call();
-    static auto inner_val_type = getTypePtr_<V>::call();
+    static auto inner_key_type = getMaybeFakeTypePtr_<K, fake>::call();
+    static auto inner_val_type = getMaybeFakeTypePtr_<V, fake>::call();
     // The "per Dict<K, V>" static singleton needs to live in a .cpp file,
     // otherwise we'll end up with one singleton instance per shared library.
     static auto type = DictType::get("Dict", inner_key_type, inner_val_type);
@@ -1927,10 +1964,10 @@ struct getTypePtr_<c10::Dict<K, V>> final {
   }
 };
 
-template <class T>
-struct getTypePtr_<at::optional<T>> final {
+template <class T, bool fake>
+struct getMaybeFakeTypePtr_<at::optional<T>, fake> final {
   static const auto& call() {
-    static auto inner_type = getTypePtr_<T>::call();
+    static auto inner_type = getMaybeFakeTypePtr_<T, fake>::call();
     // The "per optional<T>" static singleton needs to live in a .cpp file,
     // otherwise we'll end up with one singleton instance per shared library.
     static auto type = OptionalType::get(inner_type);
@@ -1942,17 +1979,31 @@ struct getTypePtr_<at::optional<T>> final {
 template<>
 struct getTypePtr_<at::OptionalIntArrayRef> final {
   static const auto& call() {
-    static auto type = OptionalType::create(getTypePtr_<IntArrayRef>::call());
+    static auto inner_type = getMaybeFakeTypePtr_<IntArrayRef, false>::call();
+    // The "per optional<T>" static singleton needs to live in a .cpp file,
+    // otherwise we'll end up with one singleton instance per shared library.
+    static auto type = OptionalType::get(inner_type);
     return type;
   }
 };
 
-template <class... Contained>
-struct getTypePtr_<std::tuple<Contained...>> final {
+template <bool fake>
+struct getMaybeFakeTypePtr_<at::OptionalSymIntArrayRef, fake> final {
+  static const auto& call() {
+    // The "per optional<T>" static singleton needs to live in a .cpp file,
+    // otherwise we'll end up with one singleton instance per shared library.
+    static auto inner_type = getMaybeFakeTypePtr_<SymIntArrayRef, fake>::call();
+    static auto type = OptionalType::get(inner_type);
+    return type;
+  }
+};
+
+template <class... Contained, bool fake>
+struct getMaybeFakeTypePtr_<std::tuple<Contained...>, fake> final {
   static const auto& call() {
     static auto type = ([]() {
       std::vector<TypePtr> contained_types = {
-        (getTypePtr_<Contained>::call())...
+        (getMaybeFakeTypePtr_<Contained, fake>::call())...
       };
       return TupleType::create(std::move(contained_types));
     })();
@@ -1970,7 +2021,7 @@ template <class T>
 inline decltype(auto) getTypePtr() {
   // TODO: static_assert that a templated function exists, and throw a friendly
   // error message if not
-  return detail::getTypePtr_<T>::call();
+  return detail::getMaybeFakeTypePtr_<T, false>::call();
 }
 
 template <class T>
@@ -1980,6 +2031,16 @@ inline TypePtr getTypePtrCopy() {
   return getTypePtr<T>();
 }
 
+template <class T>
+inline decltype(auto) getFakeTypePtr() {
+  return detail::getMaybeFakeTypePtr_<T, true>::call();
+}
+
+template <class T>
+inline TypePtr getFakeTypePtrCopy() {
+  return getFakeTypePtr<T>();
+}
+
 using TypeEnv = std::unordered_map<std::string, TypePtr>;
 struct MatchTypeReturn {
   MatchTypeReturn(std::string reason) : reason_(std::move(reason)) {}
@@ -2109,7 +2170,7 @@ struct MemoryFormatType;
 using MemoryFormatTypePtr = SingletonTypePtr<MemoryFormatType>;
 struct TORCH_API MemoryFormatType : public EnumerationType<TypeKind::MemoryFormatType> {
 std::string str() const override {
-return "MemoryFormatType";
+return "MemoryFormat";
 }
 static const TypeKind Kind = TypeKind::MemoryFormatType;
 // global singleton
@@ -2123,7 +2184,7 @@ struct LayoutType;
 using LayoutTypePtr = SingletonTypePtr<LayoutType>;
 struct TORCH_API LayoutType : public EnumerationType<TypeKind::LayoutType> {
 std::string str() const override {
-return "LayoutType";
+return "Layout";
 }
 static const TypeKind Kind = TypeKind::LayoutType;
 // global singleton
@@ -2133,6 +2194,45 @@ static LayoutTypePtr get();
 LayoutType() : EnumerationType() {}
 };
 
+namespace detail {
+template <>
+struct getMaybeFakeTypePtr_<c10::ScalarType, false> final {
+  static decltype(auto) call() {
+    return ScalarTypeType::get();
+  }
+};
+template <>
+struct getMaybeFakeTypePtr_<c10::Layout, false> final {
+  static decltype(auto) call() {
+    return LayoutType::get();
+  }
+};
+template <>
+struct getMaybeFakeTypePtr_<c10::MemoryFormat, false> final {
+  static decltype(auto) call() {
+    return MemoryFormatType::get();
+  }
+};
+template <>
+struct getMaybeFakeTypePtr_<c10::ScalarType, true> final {
+  static decltype(auto) call() {
+    return IntType::get();
+  }
+};
+template <>
+struct getMaybeFakeTypePtr_<c10::Layout, true> final {
+  static decltype(auto) call() {
+    return IntType::get();
+  }
+};
+template <>
+struct getMaybeFakeTypePtr_<c10::MemoryFormat, true> final {
+  static decltype(auto) call() {
+    return IntType::get();
+  }
+};
+} // namespace detail
+
 // the common supertype of all lists,
 // List[T] <: AnyList for all T
 struct AnyListType;
diff --git a/aten/src/ATen/core/jit_type_base.h b/aten/src/ATen/core/jit_type_base.h
index 6fee9fe0a113..beb553eb935a 100644
--- a/aten/src/ATen/core/jit_type_base.h
+++ b/aten/src/ATen/core/jit_type_base.h
@@ -7,6 +7,7 @@
 #include <ATen/core/qualified_name.h>
 #include <ATen/core/type_ptr.h>
 #include <c10/core/SymInt.h>
+#include <c10/core/SymFloat.h>
 #include <c10/core/SymIntArrayRef.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/ArrayRef.h>
@@ -52,6 +53,7 @@ namespace c10 {
   _(AnyTupleType)           \
   _(AnyClassType)           \
   _(SymIntType)             \
+  _(SymFloatType)           \
   _(UnionType)              \
   _(DynamicType)
 
diff --git a/aten/src/ATen/core/library.cpp b/aten/src/ATen/core/library.cpp
index 5c9cea05ea76..965d3f243d01 100644
--- a/aten/src/ATen/core/library.cpp
+++ b/aten/src/ATen/core/library.cpp
@@ -89,7 +89,7 @@ Library::Library(Kind kind, std::string ns, c10::optional<c10::DispatchKey> k, c
 // merge everything
 
 #define DEF_PRELUDE "def(\"", schema.operator_name(), "\"): "
-Library& Library::_def(c10::FunctionSchema&& schema, c10::OperatorName* out_name, const std::vector<at::Tag>& tags) & {
+Library& Library::_def(c10::FunctionSchema&& schema, c10::OperatorName* out_name, const std::vector<at::Tag>& tags, _RegisterOrVerify rv) & {
   TORCH_CHECK(kind_ == DEF || kind_ == FRAGMENT,
     DEF_PRELUDE,
     "Cannot define an operator inside of a ", toString(kind_), " block.  "
@@ -125,13 +125,20 @@ Library& Library::_def(c10::FunctionSchema&& schema, c10::OperatorName* out_name
   if (out_name) {
     *out_name = schema.operator_name(); // copy!
   }
-  registrars_.emplace_back(
-    c10::Dispatcher::singleton().registerDef(
-      std::move(schema),
-      debugString(file_, line_),
-      tags
-    )
-  );
+  switch (rv) {
+    case _RegisterOrVerify::REGISTER:
+      registrars_.emplace_back(
+        c10::Dispatcher::singleton().registerDef(
+          std::move(schema),
+          debugString(file_, line_),
+          tags
+        )
+      );
+      break;
+    case _RegisterOrVerify::VERIFY:
+      c10::Dispatcher::singleton().waitForDef(schema);
+      break;
+  }
   return *this;
 }
 #undef DEF_PRELUDE
@@ -174,11 +181,10 @@ Library& Library::_def(c10::either<c10::OperatorName, c10::FunctionSchema>&& nam
 }
 
 #define IMPL_PRELUDE "impl(\"", name_str, "\", ...): "
-Library& Library::_impl(const char* name_str, CppFunction&& f) & {
+at::OperatorName Library::_parseNameForLib(const char* name_str) const {
   auto name = torch::jit::parseName(name_str);
   auto ns_opt = name.getNamespace();
-  // This is kind of similar to the checking in def(), but the error
-  // messages are a little different for this call site
+  // This is a copy paste of Library::_impl
   if (ns_opt.has_value()) {
     // See Note [Redundancy in registration code is OK]
     TORCH_CHECK(*ns_opt == *ns_,
@@ -193,6 +199,11 @@ Library& Library::_impl(const char* name_str, CppFunction&& f) & {
     bool b = name.setNamespaceIfNotSet(ns_->c_str());
     TORCH_INTERNAL_ASSERT(b, ERROR_CONTEXT);
   }
+  return name;
+}
+
+Library& Library::_impl(const char* name_str, CppFunction&& f, _RegisterOrVerify rv) & {
+  at::OperatorName name = _parseNameForLib(name_str);
   // See Note [Redundancy in registration code is OK]
   TORCH_CHECK(!(f.dispatch_key_.has_value() &&
                 dispatch_key_.has_value() &&
@@ -205,19 +216,30 @@ Library& Library::_impl(const char* name_str, CppFunction&& f) & {
     ERROR_CONTEXT
   );
   auto dispatch_key = f.dispatch_key_.has_value() ? f.dispatch_key_ : dispatch_key_;
-  registrars_.emplace_back(
-    c10::Dispatcher::singleton().registerImpl(
-      std::move(name),
-      dispatch_key,
-      std::move(f.func_),
-      // NOLINTNEXTLINE(performance-move-const-arg)
-      std::move(f.cpp_signature_),
-      std::move(f.schema_),
-      debugString(std::move(f.debug_), file_, line_)
-    )
-  );
+  switch (rv) {
+    case _RegisterOrVerify::REGISTER:
+      registrars_.emplace_back(
+        c10::Dispatcher::singleton().registerImpl(
+          std::move(name),
+          dispatch_key,
+          std::move(f.func_),
+          // NOLINTNEXTLINE(performance-move-const-arg)
+          std::move(f.cpp_signature_),
+          std::move(f.schema_),
+          debugString(std::move(f.debug_), file_, line_)
+        )
+      );
+      break;
+    case _RegisterOrVerify::VERIFY:
+      c10::Dispatcher::singleton().waitForImpl(name, dispatch_key);
+      break;
+  }
   return *this;
 }
+
+c10::OperatorName Library::_resolve(const char* name_str) const {
+  return _parseNameForLib(name_str);
+}
 #undef IMPL_PRELUDE
 
 Library& Library::_fallback(CppFunction&& f) & {
diff --git a/aten/src/ATen/core/op_registration/adaption.h b/aten/src/ATen/core/op_registration/adaption.h
index 5bf1b691ebad..3112a206bb4e 100644
--- a/aten/src/ATen/core/op_registration/adaption.h
+++ b/aten/src/ATen/core/op_registration/adaption.h
@@ -68,7 +68,7 @@ inline void check_and_update_common_device(optional<Device>& common_device, cons
   }
 }
 
-inline void check_and_update_common_device(optional<Device>& common_device, at::TensorList tensors, at::CheckedFrom methodName, at::CheckedFrom argName) {
+inline void check_and_update_common_device(optional<Device>& common_device, at::ITensorListRef tensors, at::CheckedFrom methodName, at::CheckedFrom argName) {
   for (const auto& tensor : tensors) {
     check_and_update_common_device(common_device, tensor, methodName, argName);
   }
diff --git a/aten/src/ATen/core/op_registration/infer_schema.cpp b/aten/src/ATen/core/op_registration/infer_schema.cpp
index df1925aba5ed..e9e93a2556e0 100644
--- a/aten/src/ATen/core/op_registration/infer_schema.cpp
+++ b/aten/src/ATen/core/op_registration/infer_schema.cpp
@@ -23,7 +23,7 @@ std::vector<Argument> createArgumentVector(c10::ArrayRef<ArgumentDef> args) {
   result.reserve(args.size());
   for (const auto i : c10::irange(args.size())) {
     // Arguments are named "_<index>"
-    result.emplace_back(fastToString(i), (*args[i].getTypeFn)());
+    result.emplace_back(fastToString(i), (*args[i].getFakeTypeFn)(), (*args[i].getTypeFn)());
   }
   return result;
 }
diff --git a/aten/src/ATen/core/op_registration/infer_schema.h b/aten/src/ATen/core/op_registration/infer_schema.h
index 7539cd59cac9..2938e2a8d564 100644
--- a/aten/src/ATen/core/op_registration/infer_schema.h
+++ b/aten/src/ATen/core/op_registration/infer_schema.h
@@ -22,8 +22,9 @@ namespace infer_schema {
 struct ArgumentDef final {
   using GetTypeFn = TypePtr();
   GetTypeFn* getTypeFn;
-  constexpr ArgumentDef(): getTypeFn(nullptr) {}
-  explicit constexpr ArgumentDef(GetTypeFn *getTypeFn): getTypeFn(getTypeFn) {}
+  GetTypeFn* getFakeTypeFn;
+  constexpr ArgumentDef(): getTypeFn(nullptr), getFakeTypeFn(nullptr) {}
+  explicit constexpr ArgumentDef(GetTypeFn *getTypeFn, GetTypeFn *getFakeTypeFn): getTypeFn(getTypeFn), getFakeTypeFn(getFakeTypeFn) {}
 };
 
 template<bool V>
@@ -52,7 +53,8 @@ constexpr std::array<ArgumentDef, sizeof...(Ts)> createArgumentVectorFromTypes(s
     checkStaticTypes<Ts...>(),
 
     // Create the return value
-    std::array<ArgumentDef, sizeof...(Ts)>{ArgumentDef(&getTypePtrCopy<std::decay_t<Ts>>)...}
+    std::array<ArgumentDef, sizeof...(Ts)>{
+      ArgumentDef(&getTypePtrCopy<std::decay_t<Ts>>, &getFakeTypePtrCopy<std::decay_t<Ts>>)...}
   );
 }
 
diff --git a/aten/src/ATen/core/op_registration/op_registration_test.cpp b/aten/src/ATen/core/op_registration/op_registration_test.cpp
index 05294c25548e..ade5da971172 100644
--- a/aten/src/ATen/core/op_registration/op_registration_test.cpp
+++ b/aten/src/ATen/core/op_registration/op_registration_test.cpp
@@ -418,7 +418,7 @@ TEST(OperatorRegistrationTest, whenRegisteringMismatchingKernelsInSameOpCall_the
 }
 
 void backend_fallback_kernel(const c10::OperatorHandle& op, c10::Stack* stack) {
-  (*stack)[1] = (*stack)[1].toString()->string() + op.schema().name();
+  (*stack)[1] = (*stack)[1].toStringRef() + op.schema().name();
 }
 
 TEST(OperatorRegistrationTest, whenRegisteringBackendFallbackKernel_thenCanBeCalled) {
@@ -428,7 +428,7 @@ TEST(OperatorRegistrationTest, whenRegisteringBackendFallbackKernel_thenCanBeCal
   auto op = Dispatcher::singleton().findSchema({"_test::dummy", ""});
   ASSERT_TRUE(op.has_value());
   auto stack = callOp(*op, dummyTensor(c10::DispatchKey::CPU), "hello ");
-  EXPECT_EQ("hello _test::dummy", stack[1].toString()->string());
+  EXPECT_EQ("hello _test::dummy", stack[1].toStringRef());
 }
 
 TEST(OperatorRegistrationTest, whenRegisteringBackendFallbackKernelForWrongBackend_thenCannotBeCalled) {
@@ -472,7 +472,7 @@ TEST(OperatorRegistrationTest, whenRegisteringBackendFallbackKernelAndRegularKer
   called = false;
   auto stack = callOp(*op, dummyTensor(c10::DispatchKey::CPU), "hello ");
   EXPECT_FALSE(called);
-  EXPECT_EQ("hello _test::dummy", stack[1].toString()->string());
+  EXPECT_EQ("hello _test::dummy", stack[1].toStringRef());
 }
 
 TEST(OperatorRegistrationTest, whenRegisteringBackendFallbackKernelAndRegularKernelForSameBackend_thenCallsRegularKernel) {
@@ -875,7 +875,7 @@ TEST(OperatorRegistrationTest, testAvailableArgTypes) {
     "(bool a) -> bool");
   testArgTypes<std::string>::test(
     "string1", [] (const std::string& v) {EXPECT_EQ("string1", v);},
-    "string2", [] (const IValue& v) {EXPECT_EQ("string2", v.toString()->string());},
+    "string2", [] (const IValue& v) {EXPECT_EQ("string2", v.toStringRef());},
     "(str a) -> str");
   testArgTypes<Tensor>::test(
     dummyTensor(c10::DispatchKey::CPU), [] (const Tensor& v) {EXPECT_EQ(c10::DispatchKey::CPU, extractDispatchKey(v));},
@@ -902,7 +902,7 @@ TEST(OperatorRegistrationTest, testAvailableArgTypes) {
     "(bool? a) -> bool?");
   testArgTypes<c10::optional<std::string>>::test(
     c10::optional<std::string>("string1"), [] (const c10::optional<std::string>& v) {EXPECT_EQ("string1", v.value());},
-    c10::optional<std::string>("string2"), [] (const IValue& v) {EXPECT_EQ("string2", v.toString()->string());},
+    c10::optional<std::string>("string2"), [] (const IValue& v) {EXPECT_EQ("string2", v.toStringRef());},
     "(str? a) -> str?");
   testArgTypes<c10::optional<Tensor>>::test(
     c10::optional<Tensor>(dummyTensor(c10::DispatchKey::CPU)), [] (const c10::optional<Tensor>& v) {EXPECT_EQ(c10::DispatchKey::CPU, extractDispatchKey(v.value()));},
@@ -1939,7 +1939,7 @@ TEST(NewOperatorRegistrationTest, fallback) {
   auto op = Dispatcher::singleton().findSchema({"_test::dummy", ""});
   ASSERT_TRUE(op.has_value());
   auto stack = callOp(*op, dummyTensor(c10::DispatchKey::CPU), "hello ");
-  EXPECT_EQ("hello _test::dummy", stack[1].toString()->string());
+  EXPECT_EQ("hello _test::dummy", stack[1].toStringRef());
 }
 
 TEST(NewOperatorRegistrationTest, BackendSelectRedispatchesToCPU) {
diff --git a/aten/src/ATen/core/type.cpp b/aten/src/ATen/core/type.cpp
index 00e4ceffc156..1c291da06c91 100644
--- a/aten/src/ATen/core/type.cpp
+++ b/aten/src/ATen/core/type.cpp
@@ -329,6 +329,11 @@ SymIntTypePtr SymIntType::get() {
   return value;
 }
 
+SymFloatTypePtr SymFloatType::get() {
+  static SymFloatTypePtr value(new SymFloatType());
+  return value;
+}
+
 c10::optional<TypePtr> unifyTypesImpl(const TypePtr& t1, const TypePtr& t2, bool default_to_union=false, TypePtr type_hint=nullptr) {
   // check direct subtyping relation
   if (t1->isSubtypeOf(*t2)) {
diff --git a/aten/src/ATen/cpp_custom_type_hack.h b/aten/src/ATen/cpp_custom_type_hack.h
index ff5310b80968..75b900c0d694 100644
--- a/aten/src/ATen/cpp_custom_type_hack.h
+++ b/aten/src/ATen/cpp_custom_type_hack.h
@@ -48,8 +48,14 @@
 // STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP
 // STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP
 
-#include <ATen/ATen.h>
 #include <ATen/TracerMode.h>
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#endif
 
 namespace at {
 namespace cpp_custom_type_hack {
diff --git a/aten/src/ATen/cpu/vec/vec256/vec256.h b/aten/src/ATen/cpu/vec/vec256/vec256.h
index 98ec588137ce..d0a8cb03604a 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256.h
@@ -222,6 +222,51 @@ inline deinterleave2<float>(const Vectorized<float>& a, const Vectorized<float>&
                         _mm256_permute2f128_ps(a_grouped, b_grouped, 0b0110001)); // 1, 3.   4 bits apart
 }
 
+// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FLIP ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+template<>
+inline Vectorized<float> flip(const Vectorized<float> & v) {
+  const __m256i mask_float = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
+  return _mm256_permutevar8x32_ps(v, mask_float);
+}
+
+template<>
+inline Vectorized<double> flip(const Vectorized<double> & v) {
+  return _mm256_permute4x64_pd(v, 27);  // 27 == _MM_SHUFFLE(0, 1, 2, 3)
+}
+
+template<>
+inline Vectorized<int64_t> flip(const Vectorized<int64_t> & v) {
+  return _mm256_permute4x64_epi64(v, 27);  // 27 == _MM_SHUFFLE(0, 1, 2, 3)
+}
+
+template<>
+inline Vectorized<int32_t> flip(const Vectorized<int32_t> & v) {
+  const __m256i mask_int32 = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
+  return _mm256_permutevar8x32_epi32(v, mask_int32);
+}
+
+template<>
+inline Vectorized<int16_t> flip(const Vectorized<int16_t> & v) {
+  const __m256i mask = _mm256_set_epi8(
+    1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14,
+    1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14
+  );
+  auto reversed = _mm256_shuffle_epi8(v, mask);
+  return _mm256_permute2x128_si256(reversed, reversed, 1);
+}
+
+template<>
+inline Vectorized<int8_t> flip(const Vectorized<int8_t> & v) {
+  const __m256i mask_int8 = _mm256_set_epi8(
+    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+  );
+  auto reversed = _mm256_shuffle_epi8(v, mask_int8);
+  return _mm256_permute2x128_si256(reversed, reversed, 1);
+}
+
+
 #endif // (defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
 
 }}}
diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h b/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h
index c64e3e589905..15d8ac269e3d 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h
@@ -698,6 +698,43 @@ inline void convert(const BFloat16* src, BFloat16* dst, int64_t n) {
   }
 }
 
+template <>
+inline void convert(const float* src, BFloat16* dst, int64_t n) {
+  int64_t i;
+  for (i = 0; i + Vectorized<BFloat16>::size() <= n; i += Vectorized<BFloat16>::size()) {
+    __m256 a = _mm256_loadu_ps(&src[i]);
+    __m256 b = _mm256_loadu_ps(&src[i + 8]);
+
+    __m256i bf = cvtfp32_bf16(a, b);
+    _mm256_storeu_si256(reinterpret_cast<__m256i*>(&dst[i]), bf);
+  }
+  for (; i < n; i++) {
+    dst[i] = c10::convert<BFloat16>(src[i]);
+  }
+}
+
+template <>
+inline void convert(const double* src, BFloat16* dst, int64_t n) {
+  auto load_float = [](const double *src) -> __m256 {
+    // Load one float vector from an array of doubles
+    __m128 a = _mm256_cvtpd_ps(_mm256_loadu_pd(src));
+    __m128 b = _mm256_cvtpd_ps(_mm256_loadu_pd(src + 4));
+    return _mm256_insertf128_ps(_mm256_castps128_ps256(a), b, 1);
+  };
+
+  int64_t i;
+  for (i = 0; i + Vectorized<BFloat16>::size() <= n; i += Vectorized<BFloat16>::size()) {
+    __m256 a = load_float(&src[i]);
+    __m256 b = load_float(&src[i + 8]);
+
+    __m256i bf = cvtfp32_bf16(a, b);
+    _mm256_storeu_si256(reinterpret_cast<__m256i*>(&dst[i]), bf);
+  }
+  for (; i < n; i++) {
+    dst[i] = c10::convert<BFloat16>(src[i]);
+  }
+}
+
 template <>
 Vectorized<BFloat16> inline fmadd(const Vectorized<BFloat16>& a,
     const Vectorized<BFloat16>& b, const Vectorized<BFloat16>& c) {
diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_double.h b/aten/src/ATen/cpu/vec/vec256/vec256_double.h
index 138daf3f588a..7956ff24966a 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256_double.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256_double.h
@@ -412,6 +412,11 @@ template <>
 Vectorized<double> inline fmadd(const Vectorized<double>& a, const Vectorized<double>& b, const Vectorized<double>& c) {
   return _mm256_fmadd_pd(a, b, c);
 }
+
+template <>
+Vectorized<double> inline fmsub(const Vectorized<double>& a, const Vectorized<double>& b, const Vectorized<double>& c) {
+  return _mm256_fmsub_pd(a, b, c);
+}
 #endif
 
 #endif
diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_float.h b/aten/src/ATen/cpu/vec/vec256/vec256_float.h
index 6981676c92c8..440875e59de9 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256_float.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256_float.h
@@ -419,6 +419,11 @@ Vectorized<float> inline fmadd(const Vectorized<float>& a, const Vectorized<floa
   return _mm256_fmadd_ps(a, b, c);
 }
 
+template <>
+Vectorized<float> inline fmsub(const Vectorized<float>& a, const Vectorized<float>& b, const Vectorized<float>& c) {
+  return _mm256_fmsub_ps(a, b, c);
+}
+
 #endif
 
 }}}
diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h b/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h
index cbd349083636..76cc7ba3f59c 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h
@@ -827,6 +827,13 @@ Vectorized<float> inline fmadd(const Vectorized<float>& a, const Vectorized<floa
   return Vectorized<float>(r0, r1);
 }
 
+template <>
+Vectorized<float> inline fmsub(const Vectorized<float>& a, const Vectorized<float>& b, const Vectorized<float>& c) {
+  float32x4_t r0 = vfmsq_f32(c.get_low(), a.get_low(), b.get_low());
+  float32x4_t r1 = vfmsq_f32(c.get_high(), a.get_high(), b.get_high());
+  return Vectorized<float>(r0, r1);
+}
+
 #endif /* defined(aarch64) */
 
 }}}
diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_int.h b/aten/src/ATen/cpu/vec/vec256/vec256_int.h
index 0cc36d590019..391baeb8b6a3 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256_int.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256_int.h
@@ -745,6 +745,257 @@ class Vectorized<int8_t> : public Vectorizedi {
   Vectorized<int8_t> le(const Vectorized<int8_t>& other) const;
 };
 
+template <>
+class Vectorized<uint8_t> : public Vectorizedi {
+private:
+  static const Vectorized<uint8_t> ones;
+public:
+  using value_type = uint8_t;
+  static constexpr int size() {
+    return 32;
+  }
+  using Vectorizedi::Vectorizedi;
+  Vectorized() {}
+  Vectorized(uint8_t v) { values = _mm256_set1_epi8(v); }
+  Vectorized(uint8_t val1, uint8_t val2, uint8_t val3, uint8_t val4,
+         uint8_t val5, uint8_t val6, uint8_t val7, uint8_t val8,
+         uint8_t val9, uint8_t val10, uint8_t val11, uint8_t val12,
+         uint8_t val13, uint8_t val14, uint8_t val15, uint8_t val16,
+         uint8_t val17, uint8_t val18, uint8_t val19, uint8_t val20,
+         uint8_t val21, uint8_t val22, uint8_t val23, uint8_t val24,
+         uint8_t val25, uint8_t val26, uint8_t val27, uint8_t val28,
+         uint8_t val29, uint8_t val30, uint8_t val31, uint8_t val32) {
+    values = _mm256_setr_epi8(val1, val2, val3, val4, val5, val6, val7, val8,
+                              val9, val10, val11, val12, val13, val14, val15, val16,
+                              val17, val18, val19, val20, val21, val22, val23, val24,
+                              val25, val26, val27, val28, val29, val30, val31, val32);
+  }
+  template <int64_t mask>
+  static Vectorized<uint8_t> blend(Vectorized<uint8_t> a, Vectorized<uint8_t> b) {
+    __at_align__ uint8_t tmp_values[size()];
+    a.store(tmp_values);
+    if (mask & 0x01)
+      tmp_values[0] = _mm256_extract_epi8(b.values, 0);
+    if (mask & 0x02)
+      tmp_values[1] = _mm256_extract_epi8(b.values, 1);
+    if (mask & 0x04)
+      tmp_values[2] = _mm256_extract_epi8(b.values, 2);
+    if (mask & 0x08)
+      tmp_values[3] = _mm256_extract_epi8(b.values, 3);
+    if (mask & 0x10)
+      tmp_values[4] = _mm256_extract_epi8(b.values, 4);
+    if (mask & 0x20)
+      tmp_values[5] = _mm256_extract_epi8(b.values, 5);
+    if (mask & 0x40)
+      tmp_values[6] = _mm256_extract_epi8(b.values, 6);
+    if (mask & 0x80)
+      tmp_values[7] = _mm256_extract_epi8(b.values, 7);
+    if (mask & 0x100)
+      tmp_values[8] = _mm256_extract_epi8(b.values, 8);
+    if (mask & 0x200)
+      tmp_values[9] = _mm256_extract_epi8(b.values, 9);
+    if (mask & 0x400)
+      tmp_values[10] = _mm256_extract_epi8(b.values, 10);
+    if (mask & 0x800)
+      tmp_values[11] = _mm256_extract_epi8(b.values, 11);
+    if (mask & 0x1000)
+      tmp_values[12] = _mm256_extract_epi8(b.values, 12);
+    if (mask & 0x2000)
+      tmp_values[13] = _mm256_extract_epi8(b.values, 13);
+    if (mask & 0x4000)
+      tmp_values[14] = _mm256_extract_epi8(b.values, 14);
+    if (mask & 0x8000)
+      tmp_values[15] = _mm256_extract_epi8(b.values, 15);
+    if (mask & 0x010000)
+      tmp_values[16] = _mm256_extract_epi8(b.values, 16);
+    if (mask & 0x020000)
+      tmp_values[17] = _mm256_extract_epi8(b.values, 17);
+    if (mask & 0x040000)
+      tmp_values[18] = _mm256_extract_epi8(b.values, 18);
+    if (mask & 0x080000)
+      tmp_values[19] = _mm256_extract_epi8(b.values, 19);
+    if (mask & 0x100000)
+      tmp_values[20] = _mm256_extract_epi8(b.values, 20);
+    if (mask & 0x200000)
+      tmp_values[21] = _mm256_extract_epi8(b.values, 21);
+    if (mask & 0x400000)
+      tmp_values[22] = _mm256_extract_epi8(b.values, 22);
+    if (mask & 0x800000)
+      tmp_values[23] = _mm256_extract_epi8(b.values, 23);
+    if (mask & 0x1000000)
+      tmp_values[24] = _mm256_extract_epi8(b.values, 24);
+    if (mask & 0x2000000)
+      tmp_values[25] = _mm256_extract_epi8(b.values, 25);
+    if (mask & 0x4000000)
+      tmp_values[26] = _mm256_extract_epi8(b.values, 26);
+    if (mask & 0x8000000)
+      tmp_values[27] = _mm256_extract_epi8(b.values, 27);
+    if (mask & 0x10000000)
+      tmp_values[28] = _mm256_extract_epi8(b.values, 28);
+    if (mask & 0x20000000)
+      tmp_values[29] = _mm256_extract_epi8(b.values, 29);
+    if (mask & 0x40000000)
+      tmp_values[30] = _mm256_extract_epi8(b.values, 30);
+    if (mask & 0x80000000)
+      tmp_values[31] = _mm256_extract_epi8(b.values, 31);
+    return loadu(tmp_values);
+  }
+  static Vectorized<uint8_t> blendv(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b,
+                               const Vectorized<uint8_t>& mask) {
+    return _mm256_blendv_epi8(a.values, b.values, mask.values);
+  }
+  template <typename step_t>
+  static Vectorized<uint8_t> arange(uint8_t base = 0, step_t step = static_cast<step_t>(1)) {
+    return Vectorized<uint8_t>(
+      base,             base +      step, base +  2 * step, base +  3 * step,
+      base +  4 * step, base +  5 * step, base +  6 * step, base +  7 * step,
+      base +  8 * step, base +  9 * step, base + 10 * step, base + 11 * step,
+      base + 12 * step, base + 13 * step, base + 14 * step, base + 15 * step,
+      base + 16 * step, base + 17 * step, base + 18 * step, base + 19 * step,
+      base + 20 * step, base + 21 * step, base + 22 * step, base + 23 * step,
+      base + 24 * step, base + 25 * step, base + 26 * step, base + 27 * step,
+      base + 28 * step, base + 29 * step, base + 30 * step, base + 31 * step);
+  }
+  static Vectorized<uint8_t>
+  set(Vectorized<uint8_t> a, Vectorized<uint8_t> b, uint8_t count = size()) {
+    switch (count) {
+      case 0:
+        return a;
+      case 1:
+        return blend<0x1>(a, b);
+      case 2:
+        return blend<0x3>(a, b);
+      case 3:
+        return blend<0x7>(a, b);
+      case 4:
+        return blend<0xF>(a, b);
+      case 5:
+        return blend<0x1F>(a, b);
+      case 6:
+        return blend<0x3F>(a, b);
+      case 7:
+        return blend<0x7F>(a, b);
+      case 8:
+        return blend<0xFF>(a, b);
+      case 9:
+        return blend<0x1FF>(a, b);
+      case 10:
+        return blend<0x3FF>(a, b);
+      case 11:
+        return blend<0x7FF>(a, b);
+      case 12:
+        return blend<0xFFF>(a, b);
+      case 13:
+        return blend<0x1FFF>(a, b);
+      case 14:
+        return blend<0x3FFF>(a, b);
+      case 15:
+        return blend<0x7FFF>(a, b);
+      case 16:
+        return blend<0xFFFF>(a, b);
+      case 17:
+        return blend<0x1FFFF>(a, b);
+      case 18:
+        return blend<0x3FFFF>(a, b);
+      case 19:
+        return blend<0x7FFFF>(a, b);
+      case 20:
+        return blend<0xFFFFF>(a, b);
+      case 21:
+        return blend<0x1FFFFF>(a, b);
+      case 22:
+        return blend<0x3FFFFF>(a, b);
+      case 23:
+        return blend<0x7FFFFF>(a, b);
+      case 24:
+        return blend<0xFFFFFF>(a, b);
+      case 25:
+        return blend<0x1FFFFFF>(a, b);
+      case 26:
+        return blend<0x3FFFFFF>(a, b);
+      case 27:
+        return blend<0x7FFFFFF>(a, b);
+      case 28:
+        return blend<0xFFFFFFF>(a, b);
+      case 29:
+        return blend<0x1FFFFFFF>(a, b);
+      case 30:
+        return blend<0x3FFFFFFF>(a, b);
+      case 31:
+        return blend<0x7FFFFFFF>(a, b);
+    }
+    return b;
+  }
+  static Vectorized<uint8_t> loadu(const void* ptr) {
+    return _mm256_loadu_si256(reinterpret_cast<const __m256i*>(ptr));
+  }
+  static Vectorized<uint8_t> loadu(const void* ptr, uint8_t count) {
+    __at_align__ uint8_t tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(tmp_values, ptr, count * sizeof(uint8_t));
+    return loadu(tmp_values);
+  }
+  void store(void* ptr, int count = size()) const {
+    if (count == size()) {
+      // ptr need not to be aligned here. See
+      // https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions/intrinsics-for-load-and-store-operations-1/mm256-storeu-si256.html
+      _mm256_storeu_si256(reinterpret_cast<__m256i*>(ptr), values);
+    } else if (count > 0) {
+      __at_align__ uint8_t tmp_values[size()];
+      _mm256_storeu_si256(reinterpret_cast<__m256i*>(tmp_values), values);
+      std::memcpy(ptr, tmp_values, count * sizeof(uint8_t));
+    }
+  }
+  const uint8_t& operator[](int idx) const  = delete;
+  uint8_t& operator[](int idx)  = delete;
+  Vectorized<uint8_t> abs() const {
+    return values;
+  }
+  Vectorized<uint8_t> real() const {
+    return *this;
+  }
+  Vectorized<uint8_t> imag() const {
+    return _mm256_set1_epi8(0);
+  }
+  Vectorized<uint8_t> conj() const {
+    return *this;
+  }
+  Vectorized<uint8_t> frac() const;
+  Vectorized<uint8_t> neg() const;
+  Vectorized<uint8_t> operator==(const Vectorized<uint8_t>& other) const {
+    return _mm256_cmpeq_epi8(values, other.values);
+  }
+  Vectorized<uint8_t> operator!=(const Vectorized<uint8_t>& other) const {
+    return invert(_mm256_cmpeq_epi8(values, other.values));
+  }
+  Vectorized<uint8_t> operator<(const Vectorized<uint8_t>& other) const {
+    __m256i max = _mm256_max_epu8(values, other.values);
+    return invert(_mm256_cmpeq_epi8(max, values));
+  }
+  Vectorized<uint8_t> operator<=(const Vectorized<uint8_t>& other) const {
+    __m256i max = _mm256_max_epu8(values, other.values);
+    return _mm256_cmpeq_epi8(max, other.values);
+  }
+  Vectorized<uint8_t> operator>(const Vectorized<uint8_t>& other) const {
+    return other < *this;
+  }
+  Vectorized<uint8_t> operator>=(const Vectorized<uint8_t>& other) const {
+    return other <= *this;
+  }
+
+  Vectorized<uint8_t> eq(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> ne(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> gt(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> ge(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> lt(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> le(const Vectorized<uint8_t>& other) const;
+};
+
 template <>
 Vectorized<int64_t> inline operator+(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return _mm256_add_epi64(a, b);
@@ -765,6 +1016,12 @@ Vectorized<int8_t> inline operator+(const Vectorized<int8_t>& a, const Vectorize
   return _mm256_add_epi8(a, b);
 }
 
+template <>
+Vectorized<uint8_t> inline operator+(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return _mm256_add_epi8(a, b);
+}
+
+
 template <>
 Vectorized<int64_t> inline operator-(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return _mm256_sub_epi64(a, b);
@@ -785,6 +1042,11 @@ Vectorized<int8_t> inline operator-(const Vectorized<int8_t>& a, const Vectorize
   return _mm256_sub_epi8(a, b);
 }
 
+template <>
+Vectorized<uint8_t> inline operator-(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return _mm256_sub_epi8(a, b);
+}
+
 // Negation. Defined here so we can utilize operator-
 inline Vectorized<int64_t> Vectorized<int64_t>::neg() const {
   return Vectorized<int64_t>(0) - *this;
@@ -802,6 +1064,10 @@ inline Vectorized<int8_t> Vectorized<int8_t>::neg() const {
   return Vectorized<int8_t>(0) - *this;
 }
 
+inline Vectorized<uint8_t> Vectorized<uint8_t>::neg() const {
+  return Vectorized<uint8_t>(0) - *this;
+}
+
 // Emulate operations with no native 64-bit support in avx,
 // by extracting each element, performing the operation pointwise,
 // then combining the results into a vector.
@@ -888,6 +1154,12 @@ Vectorized<int8_t> inline operator*(const Vectorized<int8_t>& a, const Vectorize
   return int_elementwise_binary_256(a, b, std::multiplies<int8_t>());
 }
 
+template <>
+Vectorized<uint8_t> inline operator*(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  // We don't have an instruction for multiplying uint8_t
+  return int_elementwise_binary_256(a, b, std::multiplies<uint8_t>());
+}
+
 template <>
 Vectorized<int64_t> inline minimum(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return emulate(a, b, [](int64_t a_point, int64_t b_point) {return std::min(a_point, b_point);});
@@ -908,6 +1180,11 @@ Vectorized<int8_t> inline minimum(const Vectorized<int8_t>& a, const Vectorized<
   return _mm256_min_epi8(a, b);
 }
 
+template <>
+Vectorized<uint8_t> inline minimum(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return _mm256_min_epu8(a, b);
+}
+
 template <>
 Vectorized<int64_t> inline maximum(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return emulate(a, b, [](int64_t a_point, int64_t b_point) {return std::max(a_point, b_point);});
@@ -928,6 +1205,11 @@ Vectorized<int8_t> inline maximum(const Vectorized<int8_t>& a, const Vectorized<
   return _mm256_max_epi8(a, b);
 }
 
+template <>
+Vectorized<uint8_t> inline maximum(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return _mm256_max_epu8(a, b);
+}
+
 template <>
 Vectorized<int64_t> inline clamp(const Vectorized<int64_t>& a, const Vectorized<int64_t>& min_val, const Vectorized<int64_t>& max_val) {
   return emulate(a, min_val, max_val, [](int64_t a_point, int64_t min_point, int64_t max_point) {return std::min(max_point, std::max(a_point, min_point));});
@@ -948,6 +1230,11 @@ Vectorized<int8_t> inline clamp(const Vectorized<int8_t>& a, const Vectorized<in
   return _mm256_min_epi8(max_val, _mm256_max_epi8(a, min_val));
 }
 
+template <>
+Vectorized<uint8_t> inline clamp(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& min_val, const Vectorized<uint8_t>& max_val) {
+  return _mm256_min_epu8(max_val, _mm256_max_epu8(a, min_val));
+}
+
 template <>
 Vectorized<int64_t> inline clamp_max(const Vectorized<int64_t>& a, const Vectorized<int64_t>& max_val) {
   return emulate(a, max_val, [](int64_t a_point, int64_t max_point) {return std::min(max_point, a_point);});
@@ -968,6 +1255,11 @@ Vectorized<int8_t> inline clamp_max(const Vectorized<int8_t>& a, const Vectorize
   return _mm256_min_epi8(max_val, a);
 }
 
+template <>
+Vectorized<uint8_t> inline clamp_max(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& max_val) {
+  return _mm256_min_epu8(max_val, a);
+}
+
 template <>
 Vectorized<int64_t> inline clamp_min(const Vectorized<int64_t>& a, const Vectorized<int64_t>& min_val) {
   return emulate(a, min_val, [](int64_t a_point, int64_t min_point) {return std::max(min_point, a_point);});
@@ -988,6 +1280,11 @@ Vectorized<int8_t> inline clamp_min(const Vectorized<int8_t>& a, const Vectorize
   return _mm256_max_epi8(min_val, a);
 }
 
+template <>
+Vectorized<uint8_t> inline clamp_min(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& min_val) {
+  return _mm256_max_epu8(min_val, a);
+}
+
 template<typename T>
 Vectorized<int32_t> inline convert_to_int32(const T* ptr) {
   return Vectorized<int32_t>::loadu(ptr);
@@ -1019,6 +1316,10 @@ template <>
 Vectorized<int8_t> inline operator/(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
   return int_elementwise_binary_256(a, b, std::divides<int8_t>());
 }
+template <>
+Vectorized<uint8_t> inline operator/(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return int_elementwise_binary_256(a, b, std::divides<uint8_t>());
+}
 
 template<class T, typename std::enable_if_t<std::is_base_of<Vectorizedi, Vectorized<T>>::value, int> = 0>
 inline Vectorized<T> operator&(const Vectorized<T>& a, const Vectorized<T>& b) {
@@ -1133,6 +1434,292 @@ inline Vectorized<int8_t> Vectorized<int8_t>::le(const Vectorized<int8_t>& other
   return (*this <= other) & Vectorized<int8_t>(1);
 }
 
+inline Vectorized<uint8_t> Vectorized<uint8_t>::eq(const Vectorized<uint8_t>& other) const {
+  return (*this == other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::ne(const Vectorized<uint8_t>& other) const {
+  return (*this != other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::gt(const Vectorized<uint8_t>& other) const {
+  return (*this > other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::ge(const Vectorized<uint8_t>& other) const {
+  return (*this >= other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::lt(const Vectorized<uint8_t>& other) const {
+  return (*this < other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::le(const Vectorized<uint8_t>& other) const {
+  return (*this <= other) & Vectorized<uint8_t>(1);
+}
+
+template <bool left_shift>
+Vectorized<int16_t> inline shift_256_16(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
+  // No vector instruction for shifting int16_t, so emulating it instead.
+
+  // Control masks for shuffle operation, treating 256 bits as an
+  // array of 16-bit elements, and considering pairs of neighboring
+  // elements.  Specifially, a mask named "ctl_M_N" (M,N in [0,1], and
+  // M!=N) is set so that shuffle will move element with index M from
+  // input pair into element with index N in output pair, and element
+  // with index M in output pair will be set to all 0s.
+  __m256i ctl_0_1 = _mm256_set_epi8(29, 28, 0x80, 0x80, 25, 24, 0x80, 0x80,
+                                    21, 20, 0x80, 0x80, 17, 16, 0x80, 0x80,
+                                    13, 12, 0x80, 0x80, 9, 8, 0x80, 0x80,
+                                    5, 4, 0x80, 0x80, 1, 0, 0x80, 0x80);
+  __m256i ctl_1_0 = _mm256_set_epi8(0x80, 0x80, 31, 30, 0x80, 0x80, 27, 26,
+                                    0x80, 0x80, 23, 22, 0x80, 0x80, 19, 18,
+                                    0x80, 0x80, 15, 14, 0x80, 0x80, 11, 10,
+                                    0x80, 0x80, 7, 6, 0x80, 0x80, 3, 2);
+
+  // Masks for bitwise and operation, treating 256 bits as an array of
+  // 16-bit elements, and considering them in pairs of neighboring
+  // elements.  A mask named "keep_M" (M in [0,1]) is set so that
+  // bitwise and will copy element with index M from input pair into
+  // element with the same index in output pair, while the other
+  // element in output pair will be set to all 0s.
+  __m256i keep_0 = _mm256_set1_epi32(0xFFFF);
+  __m256i keep_1 = _mm256_set1_epi32(0xFFFF0000);
+
+  // Take each 16-bit element with idx%2==0 from input array to be
+  // shifted and extend it to 32 bits so that 0s are added to the
+  // right.  Then, perform shifting on this 32-bit number.  Upper 16
+  // bits will be proper result of shifting original 16-bit number, so
+  // write them to result array, into the same position from which
+  // corresponding input element is taken.  Also, make sure that
+  // result array elements with idx%2!=0 are set to all 0s.
+  //
+  // Note that number of bits to shift for is extended to 32 bits by
+  // adding 0s to the left.  That means this number is not properly
+  // sign-extended for negative values.  However, number of bits to
+  // shift is treated as an unsigned integer by respective shift
+  // intrinsics anyway so if negative then either with or without
+  // proper sign extension, it will be interpreted as a number greater
+  // than 32, and the shifting result will be the same.
+  __m256i a0 = _mm256_shuffle_epi8(a, ctl_0_1);
+  __m256i b0 = _mm256_and_si256(b, keep_0);
+  __m256i c0;
+  if (left_shift)
+    c0 = _mm256_sllv_epi32(a0, b0);
+  else
+    c0 = _mm256_srav_epi32(a0, b0);
+  c0 = _mm256_shuffle_epi8(c0, ctl_1_0);
+
+  // Peform shifting the same way for input array elements with
+  // idx%2==1.
+  __m256i a1 = _mm256_and_si256(a, keep_1);
+  __m256i b1 = _mm256_shuffle_epi8(b, ctl_1_0);
+  __m256i c1;
+  if (left_shift)
+    c1 = _mm256_sllv_epi32(a1, b1);
+  else
+    c1 = _mm256_srav_epi32(a1, b1);
+  c1 = _mm256_and_si256(c1, keep_1);
+
+  // Merge partial results into the final result.
+  __m256i c = _mm256_or_si256(c0, c1);
+
+  return c;
+}
+
+template <bool left_shift, typename T, typename std::enable_if_t<std::is_same<T, int8_t>::value || std::is_same<T, uint8_t>::value, int> = 0>
+Vectorized<T> inline shift_256_8(const Vectorized<T>& a, const Vectorized<T>& b) {
+  // No vector instruction for shifting int8_t/uint8_t, so emulating
+  // it instead.
+
+  // Control masks for shuffle operation, treating 256 bits as an
+  // array of 8-bit elements, and considering quadruples of
+  // neighboring elements.  Specifially, a mask named "ctl_M_N" (M,N
+  // in [0,1,2,3], and M!=N) is set so that shuffle will move element
+  // with index M from input quadruple into element with index N in
+  // output quadruple, and other elements in output quadruple will be
+  // set to all 0s.
+  __m256i ctl_0_3 = _mm256_set_epi8(28, 0x80, 0x80, 0x80, 24, 0x80, 0x80, 0x80,
+                                    20, 0x80, 0x80, 0x80, 16, 0x80, 0x80, 0x80,
+                                    12, 0x80, 0x80, 0x80, 8, 0x80, 0x80, 0x80,
+                                    4, 0x80, 0x80, 0x80, 0, 0x80, 0x80, 0x80);
+  __m256i ctl_1_0 = _mm256_set_epi8(0x80, 0x80, 0x80, 29, 0x80, 0x80, 0x80, 25,
+                                    0x80, 0x80, 0x80, 21, 0x80, 0x80, 0x80, 17,
+                                    0x80, 0x80, 0x80, 13, 0x80, 0x80, 0x80, 9,
+                                    0x80, 0x80, 0x80, 5, 0x80, 0x80, 0x80, 1);
+  __m256i ctl_1_3 = _mm256_set_epi8(29, 0x80, 0x80, 0x80, 25, 0x80, 0x80, 0x80,
+                                    21, 0x80, 0x80, 0x80, 17, 0x80, 0x80, 0x80,
+                                    13, 0x80, 0x80, 0x80, 9, 0x80, 0x80, 0x80,
+                                    5, 0x80, 0x80, 0x80, 1, 0x80, 0x80, 0x80);
+  __m256i ctl_2_0 = _mm256_set_epi8(0x80, 0x80, 0x80, 30, 0x80, 0x80, 0x80, 26,
+                                    0x80, 0x80, 0x80, 22, 0x80, 0x80, 0x80, 18,
+                                    0x80, 0x80, 0x80, 14, 0x80, 0x80, 0x80, 10,
+                                    0x80, 0x80, 0x80, 6, 0x80, 0x80, 0x80, 2);
+  __m256i ctl_2_3 = _mm256_set_epi8(30, 0x80, 0x80, 0x80, 26, 0x80, 0x80, 0x80,
+                                    22, 0x80, 0x80, 0x80, 18, 0x80, 0x80, 0x80,
+                                    14, 0x80, 0x80, 0x80, 10, 0x80, 0x80, 0x80,
+                                    6, 0x80, 0x80, 0x80, 2, 0x80, 0x80, 0x80);
+  __m256i ctl_3_0 = _mm256_set_epi8(0x80, 0x80, 0x80, 31, 0x80, 0x80, 0x80, 27,
+                                    0x80, 0x80, 0x80, 23, 0x80, 0x80, 0x80, 19,
+                                    0x80, 0x80, 0x80, 15, 0x80, 0x80, 0x80, 11,
+                                    0x80, 0x80, 0x80, 7, 0x80, 0x80, 0x80, 3);
+  __m256i ctl_3_1 = _mm256_set_epi8(0x80, 0x80, 31, 0x80, 0x80, 0x80, 27, 0x80,
+                                    0x80, 0x80, 23, 0x80, 0x80, 0x80, 19, 0x80,
+                                    0x80, 0x80, 15, 0x80, 0x80, 0x80, 11, 0x80,
+                                    0x80, 0x80, 7, 0x80, 0x80, 0x80, 3, 0x80);
+  __m256i ctl_3_2 = _mm256_set_epi8(0x80, 31, 0x80, 0x80, 0x80, 27, 0x80, 0x80,
+                                    0x80, 23, 0x80, 0x80, 0x80, 19, 0x80, 0x80,
+                                    0x80, 15, 0x80, 0x80, 0x80, 11, 0x80, 0x80,
+                                    0x80, 7, 0x80, 0x80, 0x80, 3, 0x80, 0x80);
+
+  // Masks for bitwise and operation, treating 256 bits as an array of
+  // 8-bit elements, and considering them in quadruples of neighboring
+  // elements.  A mask named "keep_M" (M in [0,1,2,3]) is set so that
+  // bitwise and will copy element with index M from input quadruple
+  // into element with the same index in output quadruple, while the
+  // other elements in output quadruple will be set to all 0s.
+  __m256i keep_0 = _mm256_set1_epi32(0xFF);
+  __m256i keep_3 = _mm256_set1_epi32(0xFF000000);
+
+  // Take each 8-bit element with idx%4==0 from input array to be
+  // shifted and extend it to 32 bits so that 0s are added to the
+  // right.  Then, perform shifting on this 32-bit number.  Upper 8
+  // bits will be proper result of shifting original 8-bit number, so
+  // write them to result array, into the same position from which
+  // corresponding input element is taken.  Also, make sure that
+  // result array elements with idx%4!=0 are set to all 0s.
+  //
+  // Note that number of bits to shift for is extended to 32 bits by
+  // adding 0s to the left.  That means this number is not properly
+  // sign-extended for negative values.  However, number of bits to
+  // shift is treated as an unsigned integer by respective shift
+  // intrinsics anyway so if negative then either with or without
+  // proper sign extension, it will be interpreted as a number greater
+  // than 32, and the shifting result will be the same.
+  __m256i a0 = _mm256_shuffle_epi8(a, ctl_0_3);
+  __m256i b0 = _mm256_and_si256(b, keep_0);
+  __m256i c0;
+  if (left_shift)
+    c0 = _mm256_sllv_epi32(a0, b0);
+  else
+    if (std::is_same<T, int8_t>::value)
+      c0 = _mm256_srav_epi32(a0, b0);
+    else
+      c0 = _mm256_srlv_epi32(a0, b0);
+  c0 = _mm256_shuffle_epi8(c0, ctl_3_0);
+
+  // Peform shifting the same way for input array elements with
+  // idx%4==1.
+  __m256i a1 = _mm256_shuffle_epi8(a, ctl_1_3);
+  __m256i b1 = _mm256_shuffle_epi8(b, ctl_1_0);
+  __m256i c1;
+  if (left_shift)
+    c1 = _mm256_sllv_epi32(a1, b1);
+  else
+    if (std::is_same<T, int8_t>::value)
+      c1 = _mm256_srav_epi32(a1, b1);
+    else
+      c1 = _mm256_srlv_epi32(a1, b1);
+  c1 = _mm256_shuffle_epi8(c1, ctl_3_1);
+
+  // Peform shifting the same way for input array elements with
+  // idx%4==2.
+  __m256i a2 = _mm256_shuffle_epi8(a, ctl_2_3);
+  __m256i b2 = _mm256_shuffle_epi8(b, ctl_2_0);
+  __m256i c2;
+  if (left_shift)
+    c2 = _mm256_sllv_epi32(a2, b2);
+  else
+    if (std::is_same<T, int8_t>::value)
+      c2 = _mm256_srav_epi32(a2, b2);
+    else
+      c2 = _mm256_srlv_epi32(a2, b2);
+  c2 = _mm256_shuffle_epi8(c2, ctl_3_2);
+
+  // Peform shifting the same way for input array elements with
+  // idx%4==3.
+  __m256i a3 =  _mm256_and_si256(a, keep_3);
+  __m256i b3 = _mm256_shuffle_epi8(b, ctl_3_0);
+  __m256i c3;
+  if (left_shift)
+    c3 = _mm256_sllv_epi32(a3, b3);
+  else
+    if (std::is_same<T, int8_t>::value)
+      c3 = _mm256_srav_epi32(a3, b3);
+    else
+      c3 = _mm256_srlv_epi32(a3, b3);
+  c3 = _mm256_and_si256(c3, keep_3);
+
+  // Merge partial results into the final result.
+  __m256i c01 = _mm256_or_si256(c0, c1);
+  __m256i c23 = _mm256_or_si256(c2, c3);
+  __m256i c = _mm256_or_si256(c01, c23);
+
+  return c;
+}
+
+template <>
+Vectorized<int64_t> inline operator<<(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
+  return _mm256_sllv_epi64(a, b);
+}
+
+template <>
+Vectorized<int32_t> inline operator<<(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
+  return _mm256_sllv_epi32(a, b);
+}
+
+template <>
+Vectorized<int16_t> inline operator<<(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
+  return shift_256_16<true>(a, b);
+}
+
+template <>
+Vectorized<int8_t> inline operator<<(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
+  return shift_256_8<true>(a, b);
+}
+
+template <>
+Vectorized<uint8_t> inline operator<<(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return shift_256_8<true>(a, b);
+}
+
+template <>
+Vectorized<int64_t> inline operator>>(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
+  // No vector instruction for right shifting int64_t, so emulating it
+  // instead.
+
+  // Shift the number logically to the right, thus filling the most
+  // significant bits with 0s.  Then, replace these bits with the sign
+  // bit.
+  __m256i sign_bits = _mm256_cmpgt_epi64(_mm256_set1_epi64x(0), a);
+  __m256i b_inv_mod_64 = _mm256_sub_epi64(_mm256_set1_epi64x(64), b);
+  __m256i sign_ext = _mm256_sllv_epi64(sign_bits, b_inv_mod_64);
+  __m256i c = _mm256_srlv_epi64(a, b);
+  c = _mm256_or_si256(c, sign_ext);
+
+  return c;
+}
+
+template <>
+Vectorized<int32_t> inline operator>>(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
+  return _mm256_srav_epi32(a, b);
+}
+
+template <>
+Vectorized<int16_t> inline operator>>(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
+  return shift_256_16<false>(a, b);
+}
+
+template <>
+Vectorized<int8_t> inline operator>>(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
+  return shift_256_8<false>(a, b);
+}
+
+template <>
+Vectorized<uint8_t> inline operator>>(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return shift_256_8<false>(a, b);
+}
+
 #endif
 
 }}}
diff --git a/aten/src/ATen/cpu/vec/vec256/vec256_qint.h b/aten/src/ATen/cpu/vec/vec256/vec256_qint.h
index 1dc624f6668a..0ee43b53e635 100644
--- a/aten/src/ATen/cpu/vec/vec256/vec256_qint.h
+++ b/aten/src/ATen/cpu/vec/vec256/vec256_qint.h
@@ -257,6 +257,19 @@ struct Vectorized<c10::qint32> : public Vectorizedqi {
         return Vectorized<c10::qint32>(ptr);
     }
 
+    static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(
+            tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return _mm256_loadu_si256((const __m256i*)tmp_values);
+    }
+
     float_vec_return_type dequantize(
         Vectorized<float> scale,
         Vectorized<float> /*zero_point*/,
@@ -436,6 +449,19 @@ struct Vectorized<c10::qint8> : public Vectorizedqi {
         return Vectorized<c10::qint8>(ptr);
     }
 
+    static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(
+            tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return _mm256_loadu_si256((const __m256i*)tmp_values);
+    }
+
  private:
     __m256i cvtepi8_epi32(__m128i epi8_vals) const {
         return _mm256_cvtepi8_epi32(epi8_vals);
@@ -601,6 +627,19 @@ struct Vectorized<c10::quint8> : public Vectorizedqi {
         return Vectorized<c10::quint8>(ptr);
     }
 
+    static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(
+            tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return _mm256_loadu_si256((const __m256i*)tmp_values);
+    }
+
  private:
     __m256i cvtepu8_epi32(__m128i epu8_vals) const {
         return _mm256_cvtepu8_epi32(epu8_vals);
@@ -820,6 +859,19 @@ struct Vectorized<c10::qint32> : public VectorizedQuantizedConverter<
     return Vectorized<c10::qint32>(ptr);
   }
 
+  static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(
+        tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return Vectorized<c10::qint32>(tmp_values);
+  }
+
   static Vectorized<c10::qint32> quantize(
       const float_vec_return_type& rhs,
       float scale,
@@ -952,6 +1004,19 @@ struct Vectorized<c10::qint8> : public VectorizedQuantizedConverter<
     return Vectorized<c10::qint8>(ptr);
   }
 
+  static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(
+        tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return Vectorized<c10::qint8>(tmp_values);
+  }
+
   static Vectorized<c10::qint8> quantize(
       const float_vec_return_type& rhs,
       float scale,
@@ -1072,6 +1137,19 @@ struct Vectorized<c10::quint8> : public VectorizedQuantizedConverter<
     return Vectorized<c10::quint8>(ptr);
   }
 
+  static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(
+        tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return Vectorized<c10::quint8>(tmp_values);
+  }
+
   static Vectorized<c10::quint8> quantize(
       const float_vec_return_type& rhs,
       float scale,
diff --git a/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h b/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h
index 77cf3695ab91..8fe6cc25f0ee 100644
--- a/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h
+++ b/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h
@@ -256,29 +256,29 @@ class Vectorized<float> {
   }
 
   Vectorized<float> C10_ALWAYS_INLINE acos() const {
-     return {Sleef_acosf4_u10vsx(_vec0), Sleef_acosf4_u10vsx(_vec1)};
+    return {Sleef_acosf4_u10vsx(_vec0), Sleef_acosf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE asin() const {
-     return {Sleef_asinf4_u10vsx(_vec0), Sleef_asinf4_u10vsx(_vec1)};
+    return {Sleef_asinf4_u10vsx(_vec0), Sleef_asinf4_u10vsx(_vec1)};
   }
   Vectorized<float> atan() const {
-     return {Sleef_atanf4_u10vsx(_vec0), Sleef_atanf4_u10vsx(_vec1)};
+    return {Sleef_atanf4_u10vsx(_vec0), Sleef_atanf4_u10vsx(_vec1)};
   }
   Vectorized<float> atan2(const Vectorized<float>& b) const {
-     return {Sleef_atan2f4_u10vsx(_vec0, b._vec0), Sleef_atan2f4_u10vsx(_vec1, b._vec1)};
+    return {Sleef_atan2f4_u10vsx(_vec0, b._vec0), Sleef_atan2f4_u10vsx(_vec1, b._vec1)};
   }
   Vectorized<float> copysign(const Vectorized<float> &sign) const {
     return {Sleef_copysignf4_vsx(_vec0, sign._vec0), Sleef_copysignf4_vsx(_vec1, sign._vec1)};
   }
   Vectorized<float> lgamma() const {
-     return {Sleef_lgammaf4_u10vsx(_vec0), Sleef_lgammaf4_u10vsx(_vec1)};
+    return {Sleef_lgammaf4_u10vsx(_vec0), Sleef_lgammaf4_u10vsx(_vec1)};
   }
   Vectorized<float> erf() const {
-     return {Sleef_erff4_u10vsx(_vec0), Sleef_erff4_u10vsx(_vec1)};
+    return {Sleef_erff4_u10vsx(_vec0), Sleef_erff4_u10vsx(_vec1)};
   }
 
   Vectorized<float> erfc() const {
-     return {Sleef_erfcf4_u15vsx(_vec0), Sleef_erfcf4_u15vsx(_vec1)};
+    return {Sleef_erfcf4_u15vsx(_vec0), Sleef_erfcf4_u15vsx(_vec1)};
   }
 
   Vectorized<float> erfinv() const {
@@ -301,133 +301,32 @@ class Vectorized<float> {
   }
 
   Vectorized<float> C10_ALWAYS_INLINE exp() const {
-    // implementation logic from avx_mathfun with some modifications from sleef
-    // Express e**x = e**g 2**n
-    ///   = e**g e**( n loge(2) )
-    ///   = e**( g + n loge(2) )
-    //
-    auto tmp_x = *this;
-    auto fx = (tmp_x * log2e_inv).round();
-
-    auto x = fx.madd(negln2f_hi, tmp_x);
-    x = fx.madd(negln2f_lo, x);
-    auto z = x * x;
-    auto y = x.madd(exp_p0, exp_p1);
-    y = y.madd(x, exp_p2);
-    y = y.madd(x, exp_p3);
-    y = y.madd(x, exp_p4);
-    y = y.madd(x, exp_p5);
-    y = y.madd(z, x) + one;
-
-    // vm_pow2n 2^n
-    vint32 imm0 = vec_signed(fx._vec0);
-    vint32 imm1 = vec_signed(fx._vec1);
-    // this pow2n logic is  from Sleef code
-    vint32 imm00 = imm0 >> 1; //>>1
-    vint32 imm01 = imm1 >> 1;
-    vint32 imm10 = imm0 - imm00;
-    vint32 imm11 = imm1 - imm01;
-    imm00 = (imm00 + v0x7f) << vu_23;
-    imm01 = (imm01 + v0x7f) << vu_23;
-    imm10 = (imm10 + v0x7f) << vu_23;
-    imm11 = (imm11 + v0x7f) << vu_23;
-    // treat imm as float vector without conversion
-
-    y._vec0 = (y._vec0 * (vfloat32)imm00) * (vfloat32)imm10;
-    y._vec1 = (y._vec1 * (vfloat32)imm01) * (vfloat32)imm11;
-    // boundary check
-    auto tmp = blendv(y, v_inf, (Vectorized<float>(exp_hi) <= tmp_x));
-    y = blendv(tmp, zero, (tmp_x < Vectorized<float>(exp_lo)));
-
-    return y;
+    return {Sleef_expf4_u10vsx(_vec0), Sleef_expf4_u10vsx(_vec1)};
   }
   Vectorized<float> expm1() const {
-    return exp() - one;
+    return {Sleef_expm1f4_u10vsx(_vec0), Sleef_expm1f4_u10vsx(_vec1)};
   }
 
   Vectorized<float> C10_ALWAYS_INLINE log() const {
     return {Sleef_logf4_u10vsx(_vec0), Sleef_logf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE log10() const {
-     return {Sleef_log10f4_u10vsx(_vec0), Sleef_log10f4_u10vsx(_vec1)};
+    return {Sleef_log10f4_u10vsx(_vec0), Sleef_log10f4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE log1p() const {
-     return {Sleef_log1pf4_u10vsx(_vec0), Sleef_log1pf4_u10vsx(_vec1)};
+    return {Sleef_log1pf4_u10vsx(_vec0), Sleef_log1pf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE log2() const {
-     return {Sleef_log2f4_u10vsx(_vec0), Sleef_log2f4_u10vsx(_vec1)};
+    return {Sleef_log2f4_u10vsx(_vec0), Sleef_log2f4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE ceil() const {
     return {vec_ceil(_vec0), vec_ceil(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE cos() const {
-    // take the absolute value
-    auto x = abs();
-    // extract the sign bit (upper one)
-    auto sign_bit = (*this) & sign_mask;
-    // scale by 4/Pi
-    auto y = x * _4div_pi;
-    // store the integer part of y in mm0
-    // j=(j+1) & (~1) (see the cephes sources)
-    vint32 imm0 = (vec_signed(y._vec0) + vi_1) & vi_inv1;
-    vint32 imm1 = (vec_signed(y._vec1) + vi_1) & vi_inv1;
-    y._vec0 = vec_float(imm0);
-    y._vec1 = vec_float(imm1);
-
-    imm0 = imm0 - vi_2;
-    imm1 = imm1 - vi_2;
-    Vectorized<float> poly_mask;
-    // get the swap sign flag
-    vint32 tmp0 = vec_and(vec_nand(imm0, imm0), vi_4);
-    vint32 tmp1 = vec_and(vec_nand(imm1, imm1), vi_4);
-    sign_bit._vecb0 = (vbool32)vec_sl(tmp0, vu_29);
-    sign_bit._vecb1 = (vbool32)vec_sl(tmp1, vu_29);
-    // get the polynom selection mask
-    // there is one polynom for 0 <= x <= Pi / 4
-    // and another one for Pi / 4 < x <= Pi / 2
-    // Both branches will be computed.
-
-    poly_mask._vecb0 = (vbool32)vec_cmpeq((imm0 & vi_2), vi_0);
-    poly_mask._vecb1 = (vbool32)vec_cmpeq((imm1 & vi_2), vi_0);
-
-    // The magic pass: "Extended precision modular arithmetic"
-    //  x = ((x - y * DP1) - y * DP2) - y * DP3;
-    x = y.madd(minus_cephes_dp1, x);
-    x = y.madd(minus_cephes_dp2, x);
-    x = y.madd(minus_cephes_dp3, x);
-
-    // Evaluate the first polynom  (0 <= x <= Pi/4)
-    auto z = x * x;
-    y = z.madd(coscof_p0, coscof_p1);
-    y = y.madd(z, coscof_p2);
-    y = y * z * z;
-    y = y - z * half + one;
-
-    // Evaluate the second polynom  (Pi/4 <= x <= 0)
-    auto y_2 = z.madd(sincof_p0, sincof_p1);
-    y_2 = y_2.madd(z, sincof_p2);
-    y_2 = y_2 * z;
-    y_2 = y_2.madd(x, x);
-
-    // select the correct result from the two polynoms
-    y = blendv(y, y_2, poly_mask);
-    // update the sign
-    y = y ^ sign_bit;
-
-    return y;
+    return {Sleef_cosf4_u10vsx(_vec0), Sleef_cosf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE cosh() const {
-    // cosh = 1/2 * (e^x + e^-x)
-    auto x = abs();
-    auto e_x = x.exp();
-    auto ret = (e_x + Vectorized<float>(one) / e_x) * half;
-    // inf and nan checks
-#if 0
-                    ret = blendv(ret, v_inf, x >= vf_89);
-                    ret = blendv(ret, v_inf, ret.isnan());
-                    ret = blendv(ret, v_nan, this->isnan());
-#endif
-    return ret;
+    return {Sleef_coshf4_u10vsx(_vec0), Sleef_coshf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE floor() const {
     return {vec_floor(_vec0), vec_floor(_vec1)};
@@ -440,97 +339,16 @@ class Vectorized<float> {
     return {vec_round(_vec0), vec_round(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE sin() const {
-    // take the absolute value and xtract sign
-    auto x = abs();
-    auto sign_bit = (*this) & sign_mask;
-
-    // scale by 4/Pi
-    auto y = x * _4div_pi;
-    // store the integer part of y in mm0
-
-    // j=(j+1) & (~1) (see the cephes sources)
-    vint32 imm0 = (vec_signed(y._vec0) + vi_1) & vi_inv1;
-    vint32 imm1 = (vec_signed(y._vec1) + vi_1) & vi_inv1;
-    y._vec0 = vec_float(imm0);
-    y._vec1 = vec_float(imm1);
-    // get the swap sign flag
-    Vectorized<float> swap_sign_bit, poly_mask;
-    swap_sign_bit._vecb0 = (vbool32)vec_sl(imm0 & vi_4, vu_29);
-    swap_sign_bit._vecb1 = (vbool32)vec_sl(imm1 & vi_4, vu_29);
-    // get the polynom selection mask
-    // there is one polynom for 0 <= x <= Pi/4
-    // and another one for Pi/4<x<=Pi/2
-    // Both branches will be computed.
-
-    poly_mask._vecb0 = vec_cmpeq((imm0 & vi_2), vi_0);
-    poly_mask._vecb1 = vec_cmpeq((imm1 & vi_2), vi_0);
-    sign_bit = sign_bit ^ swap_sign_bit; // xor operation
-
-    // The magic pass: "Extended precision modular arithmetic"
-    //  x = ((x - y * DP1) - y * DP2) - y * DP3;
-    x = y.madd(minus_cephes_dp1, x);
-    x = y.madd(minus_cephes_dp2, x);
-    x = y.madd(minus_cephes_dp3, x);
-
-    // Evaluate the first polynom  (0 <= x <= Pi/4)
-    auto z = x * x;
-    y = z.madd(coscof_p0, coscof_p1);
-    y = y.madd(z, coscof_p2);
-    y = y * z * z;
-    y = y - z * half + one;
-
-    // Evaluate the second polynom  (Pi/4 <= x <= 0)
-    auto y2 = z.madd(sincof_p0, sincof_p1);
-    y2 = y2.madd(z, sincof_p2);
-    y2 = y2 * z;
-    y2 = y2.madd(x, x);
-    // select the correct result from the two polynoms
-    y = blendv(y, y2, poly_mask);
-    y = y ^ sign_bit;
-
-    return y;
+    return {Sleef_sinf4_u10vsx(_vec0), Sleef_sinf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE sinh() const {
-    auto temp_abs = abs();
-    // get exponent
-    auto ret = temp_abs.exp();
-    auto recp = Vectorized<float>(half) / ret;
-    auto v = ret * half - recp;
-    // extract the sign bit (upper one)
-    auto sign_bit = (*this) & sign_mask;
-    auto z = temp_abs * temp_abs;
-    auto y = z.madd(p0, p1);
-    y = y.madd(z, p2);
-    y = (y * z).madd(temp_abs, temp_abs);
-    // check and select
-    auto result = blendv(y, v, temp_abs > one);
-    return result | sign_bit;
+    return {Sleef_sinhf4_u10vsx(_vec0), Sleef_sinhf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE tan() const {
-     return {Sleef_tanf4_u10vsx(_vec0), Sleef_tanf4_u10vsx(_vec1)};
+    return {Sleef_tanf4_u10vsx(_vec0), Sleef_tanf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE tanh() const {
-    auto x = *this;
-    auto vabs = abs();
-    // get exponent
-    auto exp2x = (vabs + vabs).exp();
-    auto vv = Vectorized<float>(one) - Vectorized<float>(two) / (exp2x + one);
-    // extract the sign bit (upper one)
-    auto sign_bit = (*this) & sign_mask;
-    auto z = vabs * vabs;
-    auto y = z.madd(tanh_p0, tanh_p1);
-    auto tmp = y.madd(z, tanh_p2);
-    y = z.madd(tmp, tanh_p3);
-    tmp = y.madd(z, tanh_p4);
-    y = tmp * z;
-    tmp = y.madd(x, x);
-    // add sign
-    vv = vv | sign_bit;
-    // check and select
-    auto sel_mask = vabs >= tanh_0p625;
-    auto max_mask = vabs > tanh_half_max;
-    auto max_ret = sign_bit ^ one;
-    return blendv(blendv(tmp, vv, sel_mask), max_ret, max_mask);
+    return {Sleef_tanhf4_u10vsx(_vec0), Sleef_tanhf4_u10vsx(_vec1)};
   }
   Vectorized<float> C10_ALWAYS_INLINE trunc() const {
     return {vec_trunc(_vec0), vec_trunc(_vec1)};
@@ -555,15 +373,15 @@ class Vectorized<float> {
   }
 
   Vectorized<float> fmod(const Vectorized<float>& b) const {
-     return {Sleef_fmodf4_vsx(_vec0, b._vec0),Sleef_fmodf4_vsx(_vec1, b._vec1)};
+    return {Sleef_fmodf4_vsx(_vec0, b._vec0),Sleef_fmodf4_vsx(_vec1, b._vec1)};
   }
 
   Vectorized<float> hypot(const Vectorized<float>& b) const {
-     return {Sleef_hypotf4_u05vsx(_vec0, b._vec0), Sleef_hypotf4_u05vsx(_vec1, b._vec1)};
+    return {Sleef_hypotf4_u05vsx(_vec0, b._vec0), Sleef_hypotf4_u05vsx(_vec1, b._vec1)};
   }
 
   Vectorized<float> nextafter(const Vectorized<float>& b) const {
-     return {Sleef_nextafterf4_vsx(_vec0, b._vec0), Sleef_nextafterf4_vsx(_vec1, b._vec1)};
+    return {Sleef_nextafterf4_vsx(_vec0, b._vec0), Sleef_nextafterf4_vsx(_vec1, b._vec1)};
   }
 
   Vectorized<float> igamma(const Vectorized<float>& x) const {
diff --git a/aten/src/ATen/cpu/vec/vec512/vec512.h b/aten/src/ATen/cpu/vec/vec512/vec512.h
index 0c6f33fa08a0..dd1235e82ece 100644
--- a/aten/src/ATen/cpu/vec/vec512/vec512.h
+++ b/aten/src/ATen/cpu/vec/vec512/vec512.h
@@ -190,6 +190,56 @@ inline deinterleave2<float>(const Vectorized<float>& a, const Vectorized<float>&
                         _mm512_mask_permutex2var_ps(a, 0xffff, idx2, b));
 }
 
+// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FLIP ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+template<>
+inline Vectorized<float> flip(const Vectorized<float> & v) {
+  const __m512i mask = _mm512_set_epi32(0, 1, 2, 3, 4, 5, 6, 7,
+                                        8, 9, 10, 11, 12, 13, 14, 15);
+  return _mm512_permutexvar_ps(mask, v);
+}
+
+template<>
+inline Vectorized<double> flip(const Vectorized<double> & v) {
+  const __m512i mask = _mm512_set_epi64(0, 1, 2, 3, 4, 5, 6, 7);
+  return _mm512_permutexvar_pd(mask, v);
+}
+
+template<>
+inline Vectorized<int64_t> flip(const Vectorized<int64_t> & v) {
+  const __m512i mask = _mm512_set_epi64(0, 1, 2, 3, 4, 5, 6, 7);
+  return _mm512_permutexvar_epi64(mask, v);
+}
+
+template<>
+inline Vectorized<int32_t> flip(const Vectorized<int32_t> & v) {
+  const __m512i mask = _mm512_set_epi32(0, 1, 2, 3, 4, 5, 6, 7,
+                                        8, 9, 10, 11, 12, 13, 14, 15);
+  return _mm512_permutexvar_epi32(mask, v);
+}
+
+template<>
+inline Vectorized<int16_t> flip(const Vectorized<int16_t> & v) {
+  const __m512i mask = _mm512_set_epi16(
+      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+      16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
+  );
+  return _mm512_permutexvar_epi16(mask, v);
+}
+
+template<>
+inline Vectorized<int8_t> flip(const Vectorized<int8_t> & v) {
+  const __m512i mask1 = _mm512_set_epi8(
+      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+  );
+  const __m512i mask2 = _mm512_set_epi64(1, 0, 3, 2, 5, 4, 7, 6);
+  auto reversed_vec = _mm512_shuffle_epi8(v, mask1);
+  return _mm512_permutexvar_epi64(mask2, reversed_vec);
+}
+
 #endif // defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
 
 }}}
diff --git a/aten/src/ATen/cpu/vec/vec512/vec512_bfloat16.h b/aten/src/ATen/cpu/vec/vec512/vec512_bfloat16.h
index c690682a4aa4..65fca9154215 100644
--- a/aten/src/ATen/cpu/vec/vec512/vec512_bfloat16.h
+++ b/aten/src/ATen/cpu/vec/vec512/vec512_bfloat16.h
@@ -800,6 +800,43 @@ inline void convert(const BFloat16* src, BFloat16* dst, int64_t n) {
   }
 }
 
+template <>
+inline void convert(const float* src, BFloat16* dst, int64_t n) {
+  int64_t i;
+  for (i = 0; i + Vectorized<BFloat16>::size() <= n; i += Vectorized<BFloat16>::size()) {
+    __m512 a = _mm512_loadu_ps(&src[i]);
+    __m512 b = _mm512_loadu_ps(&src[i + 16]);
+
+    __m512i bf = cvtfp32_bf16(a, b);
+    _mm512_storeu_si512(reinterpret_cast<__m512i*>(&dst[i]), bf);
+  }
+  for (; i < n; i++) {
+    dst[i] = c10::convert<BFloat16>(src[i]);
+  }
+}
+
+template <>
+inline void convert(const double* src, BFloat16* dst, int64_t n) {
+  auto load_float = [](const double *src) -> __m512 {
+    // Load one float vector from an array of doubles
+    __m256 a = _mm512_cvtpd_ps(_mm512_loadu_pd(src));
+    __m256 b = _mm512_cvtpd_ps(_mm512_loadu_pd(src + 8));
+    return _mm512_insertf32x8(_mm512_castps256_ps512(a), b, 1);
+  };
+
+  int64_t i;
+  for (i = 0; i + Vectorized<BFloat16>::size() <= n; i += Vectorized<BFloat16>::size()) {
+    __m512 a = load_float(&src[i]);
+    __m512 b = load_float(&src[i + 16]);
+
+    __m512i bf = cvtfp32_bf16(a, b);
+    _mm512_storeu_si512(reinterpret_cast<__m512i*>(&dst[i]), bf);
+  }
+  for (; i < n; i++) {
+    dst[i] = c10::convert<BFloat16>(src[i]);
+  }
+}
+
 template <>
 Vectorized<BFloat16> inline fmadd(const Vectorized<BFloat16>& a,
     const Vectorized<BFloat16>& b, const Vectorized<BFloat16>& c) {
@@ -831,7 +868,9 @@ inline std::tuple<Vectorized<float>, Vectorized<float>> convert_bfloat16_float(c
   __at_align__ float arr[K];
   __at_align__ BFloat16 arr2[K];
   a.store(arr2);
-  convert(arr2, arr, K);
+  for (const auto k : c10::irange(K)) {
+    arr[k] = c10::convert<float>(arr2[k]);
+  }
   return std::make_tuple(
       Vectorized<float>::loadu(arr),
       Vectorized<float>::loadu(arr + Vectorized<float>::size()));
@@ -843,7 +882,9 @@ inline Vectorized<BFloat16> convert_float_bfloat16(const Vectorized<float>& a, c
   __at_align__ BFloat16 arr2[K];
   a.store(arr);
   b.store(arr + Vectorized<float>::size());
-  convert(arr, arr2, K);
+  for (const auto k : c10::irange(K)) {
+    arr2[k] = c10::convert<BFloat16>(arr[k]);
+  }
   return Vectorized<BFloat16>::loadu(arr2);
 }
 
diff --git a/aten/src/ATen/cpu/vec/vec512/vec512_double.h b/aten/src/ATen/cpu/vec/vec512/vec512_double.h
index 077ce2381cdc..c4c0749d14c2 100644
--- a/aten/src/ATen/cpu/vec/vec512/vec512_double.h
+++ b/aten/src/ATen/cpu/vec/vec512/vec512_double.h
@@ -450,6 +450,11 @@ Vectorized<double> inline fmadd(const Vectorized<double>& a, const Vectorized<do
   return _mm512_fmadd_pd(a, b, c);
 }
 
+template <>
+Vectorized<double> inline fmsub(const Vectorized<double>& a, const Vectorized<double>& b, const Vectorized<double>& c) {
+  return _mm512_fmsub_pd(a, b, c);
+}
+
 #endif
 
 }}}
diff --git a/aten/src/ATen/cpu/vec/vec512/vec512_float.h b/aten/src/ATen/cpu/vec/vec512/vec512_float.h
index e0c93a834118..849e1320f55a 100644
--- a/aten/src/ATen/cpu/vec/vec512/vec512_float.h
+++ b/aten/src/ATen/cpu/vec/vec512/vec512_float.h
@@ -465,6 +465,11 @@ Vectorized<float> inline fmadd(const Vectorized<float>& a, const Vectorized<floa
   return _mm512_fmadd_ps(a, b, c);
 }
 
+template <>
+Vectorized<float> inline fmsub(const Vectorized<float>& a, const Vectorized<float>& b, const Vectorized<float>& c) {
+  return _mm512_fmsub_ps(a, b, c);
+}
+
 #endif
 
 }}}
diff --git a/aten/src/ATen/cpu/vec/vec512/vec512_int.h b/aten/src/ATen/cpu/vec/vec512/vec512_int.h
index c2cbc0b1d7f9..a2550fbfc1df 100644
--- a/aten/src/ATen/cpu/vec/vec512/vec512_int.h
+++ b/aten/src/ATen/cpu/vec/vec512/vec512_int.h
@@ -828,6 +828,280 @@ class Vectorized<int8_t> : public Vectorizedi {
   Vectorized<int8_t> le(const Vectorized<int8_t>& other) const;
 };
 
+template <>
+class Vectorized<uint8_t> : public Vectorizedi {
+private:
+  static constexpr __m512i zero_vector {0, 0, 0, 0, 0, 0, 0, 0};
+  static const Vectorized<uint8_t> ones;
+public:
+  using value_type = uint8_t;
+  static constexpr int size() {
+    return 64;
+  }
+  using Vectorizedi::Vectorizedi;
+  Vectorized() {}
+  Vectorized(uint8_t v) { values = _mm512_set1_epi8(v); }
+  Vectorized(uint8_t val1, uint8_t val2, uint8_t val3, uint8_t val4,
+         uint8_t val5, uint8_t val6, uint8_t val7, uint8_t val8,
+         uint8_t val9, uint8_t val10, uint8_t val11, uint8_t val12,
+         uint8_t val13, uint8_t val14, uint8_t val15, uint8_t val16,
+         uint8_t val17, uint8_t val18, uint8_t val19, uint8_t val20,
+         uint8_t val21, uint8_t val22, uint8_t val23, uint8_t val24,
+         uint8_t val25, uint8_t val26, uint8_t val27, uint8_t val28,
+         uint8_t val29, uint8_t val30, uint8_t val31, uint8_t val32,
+         uint8_t val33, uint8_t val34, uint8_t val35, uint8_t val36,
+         uint8_t val37, uint8_t val38, uint8_t val39, uint8_t val40,
+         uint8_t val41, uint8_t val42, uint8_t val43, uint8_t val44,
+         uint8_t val45, uint8_t val46, uint8_t val47, uint8_t val48,
+         uint8_t val49, uint8_t val50, uint8_t val51, uint8_t val52,
+         uint8_t val53, uint8_t val54, uint8_t val55, uint8_t val56,
+         uint8_t val57, uint8_t val58, uint8_t val59, uint8_t val60,
+         uint8_t val61, uint8_t val62, uint8_t val63, uint8_t val64){
+    values = _mm512_set_epi8(val64, val63, val62, val61, val60, val59, val58, val57,
+                              val56, val55, val54, val53,val52, val51, val50, val49,
+                              val48, val47, val46, val45, val44, val43, val42, val41,
+                              val40, val39, val38, val37, val36, val35, val34, val33,
+                              val32, val31, val30, val29, val28, val27, val26, val25,
+                              val24, val23, val22, val21, val20, val19, val18, val17,
+                              val16, val15, val14, val13, val12, val11, val10, val9,
+                              val8, val7, val6, val5, val4, val3, val2, val1);
+  }
+  template <int64_t mask>
+  static Vectorized<uint8_t> blend(Vectorized<uint8_t> a, Vectorized<uint8_t> b) {
+    return _mm512_mask_blend_epi8(mask, a.values, b.values);
+  }
+  static Vectorized<uint8_t> blendv(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b,
+                               const Vectorized<uint8_t>& mask) {
+    auto msb_one = _mm512_set1_epi8(0xFF);
+    auto mask_ = _mm512_cmp_epu8_mask(mask, msb_one, _MM_CMPINT_EQ);
+    return _mm512_mask_blend_epi8(mask_, a.values, b.values);
+  }
+  template <typename step_t>
+  static Vectorized<uint8_t> arange(uint8_t base = 0, step_t step = static_cast<step_t>(1)) {
+    return Vectorized<uint8_t>(
+      base,             base +      step, base +  2 * step, base +  3 * step,
+      base +  4 * step, base +  5 * step, base +  6 * step, base +  7 * step,
+      base +  8 * step, base +  9 * step, base + 10 * step, base + 11 * step,
+      base + 12 * step, base + 13 * step, base + 14 * step, base + 15 * step,
+      base + 16 * step, base + 17 * step, base + 18 * step, base + 19 * step,
+      base + 20 * step, base + 21 * step, base + 22 * step, base + 23 * step,
+      base + 24 * step, base + 25 * step, base + 26 * step, base + 27 * step,
+      base + 28 * step, base + 29 * step, base + 30 * step, base + 31 * step,
+      base + 32 * step, base + 33 * step, base + 34 * step, base + 35 * step,
+      base + 36 * step, base + 37 * step, base + 38 * step, base + 39 * step,
+      base + 40 * step, base + 41 * step, base + 42 * step, base + 43 * step,
+      base + 44 * step, base + 45 * step, base + 46 * step, base + 47 * step,
+      base + 48 * step, base + 49 * step, base + 50 * step, base + 51 * step,
+      base + 52 * step, base + 53 * step, base + 54 * step, base + 55 * step,
+      base + 56 * step, base + 57 * step, base + 58 * step, base + 59 * step,
+      base + 60 * step, base + 61 * step, base + 62 * step, base + 63 * step);
+  }
+  static Vectorized<uint8_t>
+  set(Vectorized<uint8_t> a, Vectorized<uint8_t> b, uint8_t count = size()) {
+    switch (count) {
+      case 0:
+        return a;
+      case 1:
+        return blend<0x1>(a, b);
+      case 2:
+        return blend<0x3>(a, b);
+      case 3:
+        return blend<0x7>(a, b);
+      case 4:
+        return blend<0xF>(a, b);
+      case 5:
+        return blend<0x1F>(a, b);
+      case 6:
+        return blend<0x3F>(a, b);
+      case 7:
+        return blend<0x7F>(a, b);
+      case 8:
+        return blend<0xFF>(a, b);
+      case 9:
+        return blend<0x1FF>(a, b);
+      case 10:
+        return blend<0x3FF>(a, b);
+      case 11:
+        return blend<0x7FF>(a, b);
+      case 12:
+        return blend<0xFFF>(a, b);
+      case 13:
+        return blend<0x1FFF>(a, b);
+      case 14:
+        return blend<0x3FFF>(a, b);
+      case 15:
+        return blend<0x7FFF>(a, b);
+      case 16:
+        return blend<0xFFFF>(a, b);
+      case 17:
+        return blend<0x1FFFF>(a, b);
+      case 18:
+        return blend<0x3FFFF>(a, b);
+      case 19:
+        return blend<0x7FFFF>(a, b);
+      case 20:
+        return blend<0xFFFFF>(a, b);
+      case 21:
+        return blend<0x1FFFFF>(a, b);
+      case 22:
+        return blend<0x3FFFFF>(a, b);
+      case 23:
+        return blend<0x7FFFFF>(a, b);
+      case 24:
+        return blend<0xFFFFFF>(a, b);
+      case 25:
+        return blend<0x1FFFFFF>(a, b);
+      case 26:
+        return blend<0x3FFFFFF>(a, b);
+      case 27:
+        return blend<0x7FFFFFF>(a, b);
+      case 28:
+        return blend<0xFFFFFFF>(a, b);
+      case 29:
+        return blend<0x1FFFFFFF>(a, b);
+      case 30:
+        return blend<0x3FFFFFFF>(a, b);
+      case 31:
+        return blend<0x7FFFFFFF>(a, b);
+      case 32:
+        return blend<0xFFFFFFFF>(a, b);
+      case 33:
+        return blend<0x1FFFFFFFF>(a, b);
+      case 34:
+        return blend<0x3FFFFFFFF>(a, b);
+      case 35:
+        return blend<0x7FFFFFFFF>(a, b);
+      case 36:
+        return blend<0xFFFFFFFFF>(a, b);
+      case 37:
+        return blend<0x1FFFFFFFFF>(a, b);
+      case 38:
+        return blend<0x3FFFFFFFFF>(a, b);
+      case 39:
+        return blend<0x7FFFFFFFFF>(a, b);
+      case 40:
+        return blend<0xFFFFFFFFFF>(a, b);
+      case 41:
+        return blend<0x1FFFFFFFFFF>(a, b);
+      case 42:
+        return blend<0x3FFFFFFFFFF>(a, b);
+      case 43:
+        return blend<0x7FFFFFFFFFF>(a, b);
+      case 44:
+        return blend<0xFFFFFFFFFFF>(a, b);
+      case 45:
+        return blend<0x1FFFFFFFFFFF>(a, b);
+      case 46:
+        return blend<0x3FFFFFFFFFFF>(a, b);
+      case 47:
+        return blend<0x7FFFFFFFFFFF>(a, b);
+      case 48:
+        return blend<0xFFFFFFFFFFFF>(a, b);
+      case 49:
+        return blend<0x1FFFFFFFFFFFF>(a, b);
+      case 50:
+        return blend<0x3FFFFFFFFFFFF>(a, b);
+      case 51:
+        return blend<0x7FFFFFFFFFFFF>(a, b);
+      case 52:
+        return blend<0xFFFFFFFFFFFFF>(a, b);
+      case 53:
+        return blend<0x1FFFFFFFFFFFFF>(a, b);
+      case 54:
+        return blend<0x3FFFFFFFFFFFFF>(a, b);
+      case 55:
+        return blend<0x7FFFFFFFFFFFFF>(a, b);
+      case 56:
+        return blend<0xFFFFFFFFFFFFFF>(a, b);
+      case 57:
+        return blend<0x1FFFFFFFFFFFFFF>(a, b);
+      case 58:
+        return blend<0x3FFFFFFFFFFFFFF>(a, b);
+      case 59:
+        return blend<0x7FFFFFFFFFFFFFF>(a, b);
+      case 60:
+        return blend<0xFFFFFFFFFFFFFFF>(a, b);
+      case 61:
+        return blend<0x1FFFFFFFFFFFFFFF>(a, b);
+      case 62:
+        return blend<0x3FFFFFFFFFFFFFFF>(a, b);
+      case 63:
+        return blend<0x7FFFFFFFFFFFFFFF>(a, b);
+    }
+    return b;
+  }
+  static Vectorized<uint8_t> loadu(const void* ptr) {
+    return _mm512_loadu_si512(reinterpret_cast<const __m512i*>(ptr));
+  }
+  static Vectorized<uint8_t> loadu(const void* ptr, uint8_t count) {
+    __at_align__ uint8_t tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(tmp_values, ptr, count * sizeof(uint8_t));
+    return loadu(tmp_values);
+  }
+  void store(void* ptr, int count = size()) const {
+    if (count == size()) {
+      // ptr need not to be aligned here. See
+      // https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions/intrinsics-for-load-and-store-operations-1/mm512-storeu-si512.html
+      _mm512_storeu_si512(reinterpret_cast<__m512i*>(ptr), values);
+    } else if (count > 0) {
+      __at_align__ uint8_t tmp_values[size()];
+      _mm512_storeu_si512(reinterpret_cast<__m512i*>(tmp_values), values);
+      std::memcpy(ptr, tmp_values, count * sizeof(uint8_t));
+    }
+  }
+  const uint8_t& operator[](int idx) const  = delete;
+  uint8_t& operator[](int idx)  = delete;
+  Vectorized<uint8_t> abs() const {
+    return values;
+  }
+  Vectorized<uint8_t> real() const {
+    return *this;
+  }
+  Vectorized<uint8_t> imag() const {
+    return _mm512_set1_epi8(0);
+  }
+  Vectorized<uint8_t> conj() const {
+    return *this;
+  }
+  Vectorized<uint8_t> frac() const;
+  Vectorized<uint8_t> neg() const;
+  Vectorized<uint8_t> operator==(const Vectorized<uint8_t>& other) const {
+    auto mask = _mm512_cmpeq_epu8_mask(values, other.values);
+    return _mm512_mask_set1_epi8(zero_vector, mask, 0xFF);
+  }
+  Vectorized<uint8_t> operator!=(const Vectorized<uint8_t>& other) const {
+    auto mask = _mm512_cmpneq_epu8_mask(values, other.values);
+    return _mm512_mask_set1_epi8(zero_vector, mask, 0xFF);
+  }
+  Vectorized<uint8_t> operator<(const Vectorized<uint8_t>& other) const {
+    auto mask = _mm512_cmplt_epu8_mask(values, other.values);
+    return _mm512_mask_set1_epi8(zero_vector, mask, 0xFF);
+  }
+  Vectorized<uint8_t> operator<=(const Vectorized<uint8_t>& other) const {
+    auto mask = _mm512_cmple_epu8_mask(values, other.values);
+    return _mm512_mask_set1_epi8(zero_vector, mask, 0xFF);
+  }
+  Vectorized<uint8_t> operator>(const Vectorized<uint8_t>& other) const {
+    return other < *this;
+  }
+  Vectorized<uint8_t> operator>=(const Vectorized<uint8_t>& other) const {
+    return other <= *this;
+  }
+
+  Vectorized<uint8_t> eq(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> ne(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> gt(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> ge(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> lt(const Vectorized<uint8_t>& other) const;
+  Vectorized<uint8_t> le(const Vectorized<uint8_t>& other) const;
+};
+
 template <>
 Vectorized<int64_t> inline operator+(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return _mm512_add_epi64(a, b);
@@ -848,6 +1122,11 @@ Vectorized<int8_t> inline operator+(const Vectorized<int8_t>& a, const Vectorize
   return _mm512_add_epi8(a, b);
 }
 
+template <>
+Vectorized<uint8_t> inline operator+(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return _mm512_add_epi8(a, b);
+}
+
 template <>
 Vectorized<int64_t> inline operator-(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return _mm512_sub_epi64(a, b);
@@ -868,6 +1147,11 @@ Vectorized<int8_t> inline operator-(const Vectorized<int8_t>& a, const Vectorize
   return _mm512_sub_epi8(a, b);
 }
 
+template <>
+Vectorized<uint8_t> inline operator-(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return _mm512_sub_epi8(a, b);
+}
+
 // Negation. Defined here so we can utilize operator-
 inline Vectorized<int64_t> Vectorized<int64_t>::neg() const {
   return Vectorized<int64_t>(0) - *this;
@@ -885,6 +1169,10 @@ inline Vectorized<int8_t> Vectorized<int8_t>::neg() const {
   return Vectorized<int8_t>(0) - *this;
 }
 
+inline Vectorized<uint8_t> Vectorized<uint8_t>::neg() const {
+  return Vectorized<uint8_t>(0) - *this;
+}
+
 template <>
 Vectorized<int64_t> inline operator*(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return _mm512_mullo_epi64(a, b);
@@ -918,6 +1206,12 @@ Vectorized<int8_t> inline operator*(const Vectorized<int8_t>& a, const Vectorize
   return int_elementwise_binary_512(a, b, std::multiplies<int8_t>());
 }
 
+template <>
+Vectorized<uint8_t> inline operator*(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  // We don't have an instruction for multiplying uint8_t
+  return int_elementwise_binary_512(a, b, std::multiplies<uint8_t>());
+}
+
 template <>
 Vectorized<int64_t> inline minimum(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return _mm512_min_epi64(a, b);
@@ -938,6 +1232,11 @@ Vectorized<int8_t> inline minimum(const Vectorized<int8_t>& a, const Vectorized<
   return _mm512_min_epi8(a, b);
 }
 
+template <>
+Vectorized<uint8_t> inline minimum(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return _mm512_min_epu8(a, b);
+}
+
 template <>
 Vectorized<int64_t> inline maximum(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
   return _mm512_max_epi64(a, b);
@@ -958,6 +1257,11 @@ Vectorized<int8_t> inline maximum(const Vectorized<int8_t>& a, const Vectorized<
   return _mm512_max_epi8(a, b);
 }
 
+template <>
+Vectorized<uint8_t> inline maximum(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return _mm512_max_epi8(a, b);
+}
+
 template <>
 Vectorized<int64_t> inline clamp(const Vectorized<int64_t>& a, const Vectorized<int64_t>& min_val, const Vectorized<int64_t>& max_val) {
   return _mm512_min_epi64(max_val, _mm512_max_epi64(a, min_val));
@@ -978,6 +1282,11 @@ Vectorized<int8_t> inline clamp(const Vectorized<int8_t>& a, const Vectorized<in
   return _mm512_min_epi8(max_val, _mm512_max_epi8(a, min_val));
 }
 
+template <>
+Vectorized<uint8_t> inline clamp(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& min_val, const Vectorized<uint8_t>& max_val) {
+  return _mm512_min_epu8(max_val, _mm512_max_epu8(a, min_val));
+}
+
 template <>
 Vectorized<int64_t> inline clamp_max(const Vectorized<int64_t>& a, const Vectorized<int64_t>& max_val) {
   return _mm512_min_epi64(max_val, a);
@@ -998,6 +1307,11 @@ Vectorized<int8_t> inline clamp_max(const Vectorized<int8_t>& a, const Vectorize
   return _mm512_min_epi8(max_val, a);
 }
 
+template <>
+Vectorized<uint8_t> inline clamp_max(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& max_val) {
+  return _mm512_min_epu8(max_val, a);
+}
+
 template <>
 Vectorized<int64_t> inline clamp_min(const Vectorized<int64_t>& a, const Vectorized<int64_t>& min_val) {
   return _mm512_max_epi64(min_val, a);
@@ -1018,6 +1332,11 @@ Vectorized<int8_t> inline clamp_min(const Vectorized<int8_t>& a, const Vectorize
   return _mm512_max_epi8(min_val, a);
 }
 
+template <>
+Vectorized<uint8_t> inline clamp_min(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& min_val) {
+  return _mm512_max_epu8(min_val, a);
+}
+
 template<typename T>
 Vectorized<int32_t> inline convert_to_int32(const T* ptr) {
   return Vectorized<int32_t>::loadu(ptr);
@@ -1049,6 +1368,10 @@ template <>
 Vectorized<int8_t> inline operator/(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
   return int_elementwise_binary_512(a, b, std::divides<int8_t>());
 }
+template <>
+Vectorized<uint8_t> inline operator/(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return int_elementwise_binary_512(a, b, std::divides<uint8_t>());
+}
 
 template<class T, typename std::enable_if_t<std::is_base_of<Vectorizedi, Vectorized<T>>::value, int> = 0>
 inline Vectorized<T> operator&(const Vectorized<T>& a, const Vectorized<T>& b) {
@@ -1163,6 +1486,164 @@ inline Vectorized<int8_t> Vectorized<int8_t>::le(const Vectorized<int8_t>& other
   return (*this <= other) & Vectorized<int8_t>(1);
 }
 
+inline Vectorized<uint8_t> Vectorized<uint8_t>::eq(const Vectorized<uint8_t>& other) const {
+  return (*this == other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::ne(const Vectorized<uint8_t>& other) const {
+  return (*this != other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::gt(const Vectorized<uint8_t>& other) const {
+  return (*this > other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::ge(const Vectorized<uint8_t>& other) const {
+  return (*this >= other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::lt(const Vectorized<uint8_t>& other) const {
+  return (*this < other) & Vectorized<uint8_t>(1);
+}
+
+inline Vectorized<uint8_t> Vectorized<uint8_t>::le(const Vectorized<uint8_t>& other) const {
+  return (*this <= other) & Vectorized<uint8_t>(1);
+}
+
+template <bool left_shift, typename T, typename std::enable_if_t<std::is_same<T, int8_t>::value || std::is_same<T, uint8_t>::value, int> = 0>
+Vectorized<T> inline shift_512_8(const Vectorized<T>& a, const Vectorized<T>& b) {
+  // No vector instruction for shifting int8_t/uint8_t, so emulating
+  // it instead.
+
+  // Control masks for shuffle operation, treating 512 bits as an
+  // array of 8-bit elements, and considering pairs of neighboring
+  // elements.  Specifially, a mask named "ctl_M_N" (M,N in [0,1], and
+  // M!=N) is set so that shuffle will move element with index M from
+  // input pair into element with index N in output pair, and element
+  // with index M in output pair will be set to all 0s.
+  __m512i ctl_0_1 = _mm512_set_epi8(62, 0x80, 60, 0x80, 58, 0x80, 56, 0x80,
+                                    54, 0x80, 52, 0x80, 50, 0x80, 48, 0x80,
+                                    46, 0x80, 44, 0x80, 42, 0x80, 40, 0x80,
+                                    38, 0x80, 36, 0x80, 34, 0x80, 32, 0x80,
+                                    30, 0x80, 28, 0x80, 26, 0x80, 24, 0x80,
+                                    22, 0x80, 20, 0x80, 18, 0x80, 16, 0x80,
+                                    14, 0x80, 12, 0x80, 10, 0x80, 8, 0x80,
+                                    6, 0x80, 4, 0x80, 2, 0x80, 0, 0x80);
+  __m512i ctl_1_0 = _mm512_set_epi8(0x80, 63, 0x80, 61, 0x80, 59, 0x80, 57,
+                                    0x80, 55, 0x80, 53, 0x80, 51, 0x80, 49,
+                                    0x80, 47, 0x80, 45, 0x80, 43, 0x80, 41,
+                                    0x80, 39, 0x80, 37, 0x80, 35, 0x80, 33,
+                                    0x80, 31, 0x80, 29, 0x80, 27, 0x80, 25,
+                                    0x80, 23, 0x80, 21, 0x80, 19, 0x80, 17,
+                                    0x80, 15, 0x80, 13, 0x80, 11, 0x80, 9,
+                                    0x80, 7, 0x80, 5, 0x80, 3, 0x80, 1);
+
+  // Masks for bitwise and operation, treating 512 bits as an array of
+  // 8-bit elements, and considering them in pairs of neighboring
+  // elements.  A mask named "keep_M" (M in [0,1]) is set so that
+  // bitwise and will copy element with index M from input pair into
+  // element with the same index in output pair, while the other
+  // element in output pair will be set to all 0s.
+  __m512i keep_0 = _mm512_set1_epi16(0xFF);
+  __m512i keep_1 = _mm512_set1_epi16(0xFF00);
+
+  // Take each 8-bit element with idx%2==0 from input array to be
+  // shifted and extend it to 16 bits so that 0s are added to the
+  // right.  Then, perform shifting on this 16-bit number.  Upper 8
+  // bits will be proper result of shifting original 8-bit number, so
+  // write them to result array, into the same position from which
+  // corresponding input element is taken.  Also, make sure that
+  // result array elements with idx%2!=0 are set to all 0s.
+  //
+  // Note that number of bits to shift for is extended to 16 bits by
+  // adding 0s to the left.  That means this number is not properly
+  // sign-extended for negative values.  However, number of bits to
+  // shift is treated as an unsigned integer by respective shift
+  // intrinsics anyway so if negative then either with or without
+  // proper sign extension, it will be interpreted as a number greater
+  // than 32, and the shifting result will be the same.
+  __m512i a0 = _mm512_shuffle_epi8(a, ctl_0_1);
+  __m512i b0 = _mm512_and_si512(b, keep_0);
+  __m512i c0;
+  if (left_shift)
+    c0 = _mm512_sllv_epi16(a0, b0);
+  else
+    if (std::is_same<T, int8_t>::value)
+      c0 = _mm512_srav_epi16(a0, b0);
+    else
+      c0 = _mm512_srlv_epi16(a0, b0);
+  c0 = _mm512_shuffle_epi8(c0, ctl_1_0);
+
+  // Peform shifting the same way for input array elements with
+  // idx%2==1.
+  __m512i a1 = _mm512_and_si512(a, keep_1);
+  __m512i b1 = _mm512_shuffle_epi8(b, ctl_1_0);
+  __m512i c1;
+  if (left_shift)
+    c1 = _mm512_sllv_epi16(a1, b1);
+  else
+    if (std::is_same<T, int8_t>::value)
+      c1 = _mm512_srav_epi16(a1, b1);
+    else
+      c1 = _mm512_srlv_epi16(a1, b1);
+  c1 = _mm512_and_si512(c1, keep_1);
+
+  // Merge partial results into the final result.
+  __m512i c = _mm512_or_si512(c0, c1);
+
+  return c;
+}
+
+template <>
+Vectorized<int64_t> inline operator<<(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
+  return _mm512_sllv_epi64(a, b);
+}
+
+template <>
+Vectorized<int32_t> inline operator<<(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
+  return _mm512_sllv_epi32(a, b);
+}
+
+template <>
+Vectorized<int16_t> inline operator<<(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
+  return _mm512_sllv_epi16(a, b);
+}
+
+template <>
+Vectorized<int8_t> inline operator<<(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
+  return shift_512_8<true>(a, b);
+}
+
+template <>
+Vectorized<uint8_t> inline operator<<(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return shift_512_8<true>(a, b);
+}
+
+template <>
+Vectorized<int64_t> inline operator>>(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
+  return _mm512_srav_epi64(a, b);
+}
+
+template <>
+Vectorized<int32_t> inline operator>>(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
+  return _mm512_srav_epi32(a, b);
+}
+
+template <>
+Vectorized<int16_t> inline operator>>(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
+  return _mm512_srav_epi16(a, b);
+}
+
+template <>
+Vectorized<int8_t> inline operator>>(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
+  return shift_512_8<false>(a, b);
+}
+
+template <>
+Vectorized<uint8_t> inline operator>>(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
+  return shift_512_8<false>(a, b);
+}
+
 #endif
 
 }}}
diff --git a/aten/src/ATen/cpu/vec/vec512/vec512_qint.h b/aten/src/ATen/cpu/vec/vec512/vec512_qint.h
index 0f3474eaa2ad..87cf44283c0b 100644
--- a/aten/src/ATen/cpu/vec/vec512/vec512_qint.h
+++ b/aten/src/ATen/cpu/vec/vec512/vec512_qint.h
@@ -268,6 +268,18 @@ struct Vectorized<c10::qint32> : public Vectorizedqi {
         return Vectorized<c10::qint32>(ptr);
     }
 
+    static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return loadu(tmp_values);
+    }
+
     float_vec_return_type dequantize(
         Vectorized<float> scale,
         Vectorized<float> zero_point,
@@ -447,6 +459,18 @@ struct Vectorized<c10::qint8> : public Vectorizedqi {
         return Vectorized<c10::qint8>(ptr);
     }
 
+    static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return loadu(tmp_values);
+    }
+
  private:
     __m512i cvtepi8_epi32(__m128i epi8_vals) const {
         return _mm512_cvtepi8_epi32(epi8_vals);
@@ -611,6 +635,18 @@ struct Vectorized<c10::quint8> : public Vectorizedqi {
         return Vectorized<c10::quint8>(ptr);
     }
 
+    static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
+        __at_align__ value_type tmp_values[size()];
+        // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+        // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+        // instructions while a loop would be compiled to one instruction.
+        for (const auto i : c10::irange(size())) {
+          tmp_values[i] = 0;
+        }
+        std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+        return loadu(tmp_values);
+    }
+
  private:
     __m512i cvtepu8_epi32(__m128i epu8_vals) const {
         return _mm512_cvtepu8_epi32(epu8_vals);
@@ -833,6 +869,18 @@ struct Vectorized<c10::qint32> : public VectorizedQuantizedConverter<
     return Vectorized<c10::qint32>(ptr);
   }
 
+  static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return loadu(tmp_values);
+  }
+
   static Vectorized<c10::qint32> quantize(
       const float_vec_return_type& rhs,
       float scale,
@@ -965,6 +1013,18 @@ struct Vectorized<c10::qint8> : public VectorizedQuantizedConverter<
     return Vectorized<c10::qint8>(ptr);
   }
 
+  static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return loadu(tmp_values);
+  }
+
   static Vectorized<c10::qint8> quantize(
       const float_vec_return_type& rhs,
       float scale,
@@ -1085,6 +1145,18 @@ struct Vectorized<c10::quint8> : public VectorizedQuantizedConverter<
     return Vectorized<c10::quint8>(ptr);
   }
 
+  static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
+    __at_align__ value_type tmp_values[size()];
+    // Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
+    // for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
+    // instructions while a loop would be compiled to one instruction.
+    for (const auto i : c10::irange(size())) {
+      tmp_values[i] = 0;
+    }
+    std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
+    return loadu(tmp_values);
+  }
+
   static Vectorized<c10::quint8> quantize(
       const float_vec_return_type& rhs,
       float scale,
diff --git a/aten/src/ATen/cpu/vec/vec_base.h b/aten/src/ATen/cpu/vec/vec_base.h
index 3bf1010efd68..abf106e8d5b3 100644
--- a/aten/src/ATen/cpu/vec/vec_base.h
+++ b/aten/src/ATen/cpu/vec/vec_base.h
@@ -33,6 +33,7 @@
 #include <c10/util/TypeCast.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/irange.h>
+#include <c10/util/Load.h>
 
 // These macros helped us unify vec_base.h
 #ifdef CPU_CAPABILITY_AVX512
@@ -131,8 +132,9 @@ struct Vectorized {
   // versions GCC/Clang have buggy determinations on whether or not an
   // identifier is odr-used or not, and in any case it's hard to tell if
   // a variable is odr-used or not.  So best to just cut the problem at the root.
+  static constexpr size_type size_T = sizeof(T);  // Workaround to compile with VS2022.
   static constexpr size_type size() {
-    return VECTOR_WIDTH / sizeof(T);
+    return VECTOR_WIDTH / size_T;
   }
   Vectorized() : values{static_cast<T>(0)} {}
   Vectorized(T val) {
@@ -797,6 +799,21 @@ inline Vectorized<T> operator~(const Vectorized<T>& a) {
   return a ^ ones;
 }
 
+template <class T> Vectorized<T> inline operator<<(const Vectorized<T> &a, const Vectorized<T> &b) {
+  Vectorized<T> c;
+  for (int i = 0; i != Vectorized<T>::size(); i++) {
+    c[i] = a[i] << b[i];
+  }
+  return c;
+}
+
+template <class T> Vectorized<T> inline operator>>(const Vectorized<T> &a, const Vectorized<T> &b) {
+  Vectorized<T> c;
+  for (int i = 0; i != Vectorized<T>::size(); i++) {
+    c[i] = a[i] >> b[i];
+  }
+  return c;
+}
 
 template <typename T>
 inline Vectorized<T>& operator += (Vectorized<T>& a, const Vectorized<T>& b) {
@@ -824,11 +841,28 @@ inline Vectorized<T>& operator *= (Vectorized<T>& a, const Vectorized<T>& b) {
   return a;
 }
 
+template <typename T>
+inline Vectorized<T>& operator <<= (Vectorized<T>& a, const Vectorized<T>& b) {
+  a = a << b;
+  return a;
+}
+
+template <typename T>
+inline Vectorized<T>& operator >>= (Vectorized<T>& a, const Vectorized<T>& b) {
+  a = a >> b;
+  return a;
+}
+
 template <typename T>
 inline Vectorized<T> fmadd(const Vectorized<T>& a, const Vectorized<T>& b, const Vectorized<T>& c) {
   return a * b + c;
 }
 
+template <typename T>
+inline Vectorized<T> fmsub(const Vectorized<T>& a, const Vectorized<T>& b, const Vectorized<T>& c) {
+  return a * b - c;
+}
+
 template <int64_t scale = 1, typename T = void>
 std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<T>>
 inline gather(T const* base_addr, const Vectorized<int_same_size_t<T>>& vindex) {
@@ -975,10 +1009,22 @@ inline void convert(const src_T *src, dst_T *dst, int64_t n) {
 #endif
   for (const auto i : c10::irange(n)) {
     (void)i; //Suppress unused variable warning
-    *dst = c10::static_cast_with_inter_type<dst_T, src_T>::apply(*src);
+    *dst = c10::convert<dst_T>(c10::load(src));
     src++;
     dst++;
   }
 }
 
+template <typename T>
+inline Vectorized<T> flip(const Vectorized<T> & data) {
+  static constexpr int size = Vectorized<T>::size();
+  T output[size];
+  T buffer[size];
+  data.store(static_cast<void*>(buffer));
+  for (const auto i : c10::irange(size)) {
+    output[i] = buffer[size - i - 1];
+  }
+  return Vectorized<T>::loadu(static_cast<void*>(output));
+}
+
 }}}
diff --git a/aten/src/ATen/cuda/Atomic.cuh b/aten/src/ATen/cuda/Atomic.cuh
index 079b289ef8c3..3d60b672e972 100644
--- a/aten/src/ATen/cuda/Atomic.cuh
+++ b/aten/src/ATen/cuda/Atomic.cuh
@@ -6,6 +6,10 @@
 
 #include <ATen/NumericUtils.h>
 
+#if !(defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION < 11000) || (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800))))
+#include <cuda_bf16.h>
+#endif
+
 template <typename T>
 struct AtomicFPOp;
 
@@ -164,6 +168,7 @@ Atomic##NAME##IntegerImpl<DTYPE, sizeof(DTYPE)>()(address,
 }                                                                                                                      \
 
 ATOMIC_INTEGER_IMPL(Add)
+GPU_ATOMIC_INTEGER(Add, a || b, bool)
 
 // Don't instantiate gpuAtomicAdd with the macro as it seems non-standard (see int32, int64)
 static inline __device__ void gpuAtomicAdd(uint8_t *address, uint8_t val) {
@@ -206,10 +211,6 @@ static inline __device__ void gpuAtomicAdd(int64_t *address, int64_t val) {
 #endif
 }
 
-static inline __device__ void gpuAtomicAdd(bool *address, bool val) {
-  *address = address && val;
-}
-
 static inline  __device__ at::Half gpuAtomicAdd(at::Half *address, at::Half val) {
 #if defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION < 10000) || (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 700)))
   return AtomicFPOp<at::Half>()(address, val,
@@ -222,10 +223,15 @@ static inline  __device__ at::Half gpuAtomicAdd(at::Half *address, at::Half val)
 }
 
 static inline __device__ at::BFloat16 gpuAtomicAdd(at::BFloat16 *address, at::BFloat16 val) {
-  return AtomicFPOp<at::BFloat16>()(address, val,
-                                    [](at::BFloat16 bsum, at::BFloat16 val) {
-                                      return bsum + val;
-                                    });
+#if defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION < 11000) || (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800)))
+return AtomicFPOp<at::BFloat16>()(address, val,
+                                  [](at::BFloat16 bsum, at::BFloat16 val) {
+                                    return bsum + val;
+                                  });
+#else
+  __nv_bfloat16 r = atomicAdd(reinterpret_cast<__nv_bfloat16*>(address), *reinterpret_cast<__nv_bfloat16*>(&val));
+  return *reinterpret_cast<c10::BFloat16*>(&r);
+#endif
 }
 
 #if defined(CUDA_VERSION) && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 600 || CUDA_VERSION < 8000)
@@ -256,7 +262,7 @@ static inline __device__ double atomicAdd(double* address, double val)
  * minimal.
  */
 
-#if defined(__HIP_PLATFORM_HCC__) && __hcc_workweek__ < 18312 && !__HIP__
+#if defined(USE_ROCM) && __hcc_workweek__ < 18312 && !__HIP__
   // This needs to be defined for the host side pass
   static inline  __device__  double atomicAdd(double *address, double val) { }
 #endif
diff --git a/aten/src/ATen/cuda/CUDABlas.cpp b/aten/src/ATen/cuda/CUDABlas.cpp
index e99017289d68..866f53ee7f87 100644
--- a/aten/src/ATen/cuda/CUDABlas.cpp
+++ b/aten/src/ATen/cuda/CUDABlas.cpp
@@ -1162,7 +1162,7 @@ void vdot<c10::complex<double>>(CUDABLAS_DOT_ARGTYPES(c10::complex<double>)) {
                                    reinterpret_cast<cuDoubleComplex*>(result)));
 }
 
-// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched, getriBatched on platforms other than cuda
+// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda
 #ifdef CUDART_VERSION
 
 template <>
@@ -1323,67 +1323,6 @@ void getrfBatched<c10::complex<float>>(
       batchsize));
 }
 
-template <>
-void getriBatched<double>(
-    int n, double** dA_array, int ldda, int* ipiv_array, double** dC_array, int lddc, int* info_array, int batchsize) {
-  auto handle = at::cuda::getCurrentCUDABlasHandle();
-  TORCH_CUDABLAS_CHECK(cublasDgetriBatched(
-      handle, n, dA_array, ldda, ipiv_array, dC_array, lddc, info_array, batchsize));
-}
-
-template <>
-void getriBatched<float>(
-    int n, float** dA_array, int ldda, int* ipiv_array, float** dC_array, int lddc, int* info_array, int batchsize) {
-  auto handle = at::cuda::getCurrentCUDABlasHandle();
-  TORCH_CUDABLAS_CHECK(cublasSgetriBatched(
-      handle, n, dA_array, ldda, ipiv_array, dC_array, lddc, info_array, batchsize));
-}
-
-template <>
-void getriBatched<c10::complex<double>>(
-    int n,
-    c10::complex<double>** dA_array,
-    int ldda,
-    int* ipiv_array,
-    c10::complex<double>** dC_array,
-    int lddc,
-    int* info_array,
-    int batchsize) {
-  auto handle = at::cuda::getCurrentCUDABlasHandle();
-  TORCH_CUDABLAS_CHECK(cublasZgetriBatched(
-      handle,
-      n,
-      reinterpret_cast<cuDoubleComplex**>(dA_array),
-      ldda,
-      ipiv_array,
-      reinterpret_cast<cuDoubleComplex**>(dC_array),
-      lddc,
-      info_array,
-      batchsize));
-}
-
-template <>
-void getriBatched<c10::complex<float>>(
-    int n,
-    c10::complex<float>** dA_array,
-    int ldda,
-    int* ipiv_array,
-    c10::complex<float>** dC_array,
-    int lddc,
-    int* info_array,
-    int batchsize) {
-  auto handle = at::cuda::getCurrentCUDABlasHandle();
-  TORCH_CUDABLAS_CHECK(cublasCgetriBatched(
-      handle,
-      n,
-      reinterpret_cast<cuComplex**>(dA_array),
-      ldda,
-      ipiv_array,
-      reinterpret_cast<cuComplex**>(dC_array),
-      lddc,
-      info_array,
-      batchsize));
-}
 
 template <>
 void gelsBatched<double>(CUDABLAS_GELS_BATCHED_ARGTYPES(double)) {
diff --git a/aten/src/ATen/cuda/CUDABlas.h b/aten/src/ATen/cuda/CUDABlas.h
index 10e589ecd6c9..96c7fc818422 100644
--- a/aten/src/ATen/cuda/CUDABlas.h
+++ b/aten/src/ATen/cuda/CUDABlas.h
@@ -227,7 +227,7 @@ void vdot<c10::complex<float>>(CUDABLAS_DOT_ARGTYPES(c10::complex<float>));
 template <>
 void vdot<c10::complex<double>>(CUDABLAS_DOT_ARGTYPES(c10::complex<double>));
 
-// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched, getriBatched on platforms other than cuda
+// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda
 #ifdef CUDART_VERSION
 
 #define CUDABLAS_GETRS_ARGTYPES(Dtype)  \
@@ -287,22 +287,6 @@ TORCH_CUDA_CU_API void getrfBatched<c10::complex<double>>(CUDABLAS_GETRF_ARGTYPE
 template<>
 TORCH_CUDA_CU_API void getrfBatched<c10::complex<float>>(CUDABLAS_GETRF_ARGTYPES(c10::complex<float>));
 
-#define CUDABLAS_GETRI_ARGTYPES(Dtype)  \
-  int n, Dtype** dA_array, int ldda, int* ipiv_array, Dtype** dC_array, int lddc, int* info_array, int batchsize
-
-template<class Dtype>
-void getriBatched(CUDABLAS_GETRI_ARGTYPES(Dtype)) {
-  TORCH_CHECK(false, "at::cuda::blas::getriBatched: not implemented for ", typeid(Dtype).name());
-}
-template<>
-TORCH_CUDA_CU_API void getriBatched<float>(CUDABLAS_GETRI_ARGTYPES(float));
-template<>
-TORCH_CUDA_CU_API void getriBatched<double>(CUDABLAS_GETRI_ARGTYPES(double));
-template<>
-TORCH_CUDA_CU_API void getriBatched<c10::complex<double>>(CUDABLAS_GETRI_ARGTYPES(c10::complex<double>));
-template<>
-TORCH_CUDA_CU_API void getriBatched<c10::complex<float>>(CUDABLAS_GETRI_ARGTYPES(c10::complex<float>));
-
 #define CUDABLAS_GELS_BATCHED_ARGTYPES(Dtype)  \
   cublasHandle_t handle, cublasOperation_t trans, int m, int n, int nrhs, Dtype** dA_array, int ldda, Dtype** dC_array, int lddc, int* info, int *devInfoArray, int batchSize
 
diff --git a/aten/src/ATen/cuda/CUDAContext.h b/aten/src/ATen/cuda/CUDAContext.h
index 0167cd585eaa..12349b709050 100644
--- a/aten/src/ATen/cuda/CUDAContext.h
+++ b/aten/src/ATen/cuda/CUDAContext.h
@@ -72,6 +72,8 @@ TORCH_CUDA_CPP_API Allocator* getCUDADeviceAllocator();
 TORCH_CUDA_CPP_API cusparseHandle_t getCurrentCUDASparseHandle();
 TORCH_CUDA_CPP_API cublasHandle_t getCurrentCUDABlasHandle();
 
+TORCH_CUDA_CPP_API void clearCublasWorkspaces();
+
 #ifdef CUDART_VERSION
 TORCH_CUDA_CPP_API cusolverDnHandle_t getCurrentCUDASolverDnHandle();
 #endif
diff --git a/aten/src/ATen/cuda/CUDADataType.h b/aten/src/ATen/cuda/CUDADataType.h
index 5221b233398c..d25722c080ec 100644
--- a/aten/src/ATen/cuda/CUDADataType.h
+++ b/aten/src/ATen/cuda/CUDADataType.h
@@ -33,7 +33,7 @@ template<> inline cudaDataType getCudaDataType<c10::complex<double>>() {
 }
 
 // HIP doesn't define integral types
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
 template<> inline cudaDataType getCudaDataType<uint8_t>() {
   return CUDA_R_8U;
 }
@@ -45,7 +45,7 @@ template<> inline cudaDataType getCudaDataType<int>() {
 }
 #endif
 
-#if !defined(__HIP_PLATFORM_HCC__) && defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 11000
 template<> inline cudaDataType getCudaDataType<int16_t>() {
   return CUDA_R_16I;
 }
@@ -60,7 +60,7 @@ template<> inline cudaDataType getCudaDataType<at::BFloat16>() {
 inline cudaDataType ScalarTypeToCudaDataType(const c10::ScalarType& scalar_type) {
   switch (scalar_type) {
 // HIP doesn't define integral types
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
     case c10::ScalarType::Byte:
       return CUDA_R_8U;
     case c10::ScalarType::Char:
@@ -80,7 +80,7 @@ inline cudaDataType ScalarTypeToCudaDataType(const c10::ScalarType& scalar_type)
       return CUDA_C_32F;
     case c10::ScalarType::ComplexDouble:
       return CUDA_C_64F;
-#if !defined(__HIP_PLATFORM_HCC__) && defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 11000
     case c10::ScalarType::Short:
       return CUDA_R_16I;
     case c10::ScalarType::Long:
diff --git a/aten/src/ATen/cuda/CUDAEvent.h b/aten/src/ATen/cuda/CUDAEvent.h
index 205fad8c1121..1c3c67949e58 100644
--- a/aten/src/ATen/cuda/CUDAEvent.h
+++ b/aten/src/ATen/cuda/CUDAEvent.h
@@ -48,7 +48,7 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
         CUDAGuard guard(device_index_);
         const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
         if (C10_UNLIKELY(interp)) {
-          interp->trace_gpu_event_deletion(reinterpret_cast<uintptr_t>(event_));
+          (*interp)->trace_gpu_event_deletion(reinterpret_cast<uintptr_t>(event_));
         }
         cudaEventDestroy(event_);
       }
@@ -120,7 +120,7 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
     AT_CUDA_CHECK(cudaEventRecord(event_, stream));
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_event_record(
+      (*interp)->trace_gpu_event_record(
           reinterpret_cast<uintptr_t>(event_),
           reinterpret_cast<uintptr_t>(stream.stream())
       );
@@ -136,7 +136,7 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
       AT_CUDA_CHECK(cudaStreamWaitEvent(stream, event_, 0));
       const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
       if (C10_UNLIKELY(interp)) {
-        interp->trace_gpu_event_wait(
+        (*interp)->trace_gpu_event_wait(
             reinterpret_cast<uintptr_t>(event_),
             reinterpret_cast<uintptr_t>(stream.stream())
         );
@@ -157,6 +157,10 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
   // Note: cudaEventSynchronize can be safely called from any device
   void synchronize() const {
     if (is_created_) {
+      const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+      if (C10_UNLIKELY(interp)) {
+          (*interp)->trace_gpu_event_synchronization(reinterpret_cast<uintptr_t>(event_));
+      }
       AT_CUDA_CHECK(cudaEventSynchronize(event_));
     }
   }
@@ -185,7 +189,7 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
     AT_CUDA_CHECK(cudaEventCreateWithFlags(&event_, flags_));
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_event_creation(reinterpret_cast<uintptr_t>(event_));
+      (*interp)->trace_gpu_event_creation(reinterpret_cast<uintptr_t>(event_));
     }
     is_created_ = true;
   }
diff --git a/aten/src/ATen/cuda/CUDAGeneratorImpl.cpp b/aten/src/ATen/cuda/CUDAGeneratorImpl.cpp
index 0cac5d6da2d5..a678354dca49 100644
--- a/aten/src/ATen/cuda/CUDAGeneratorImpl.cpp
+++ b/aten/src/ATen/cuda/CUDAGeneratorImpl.cpp
@@ -231,7 +231,8 @@ uint64_t CUDAGeneratorImpl::philox_offset_per_thread() const {
  * offset_extragraph is the initial offset at the start of the graphed region.
  * offset_intragraph tracks the offset in the graphed region.
  */
-void CUDAGeneratorImpl::capture_prologue(int64_t* offset_extragraph) {
+void CUDAGeneratorImpl::capture_prologue(int64_t* seed_extragraph, int64_t* offset_extragraph) {
+  seed_extragraph_ = seed_extragraph;
   offset_extragraph_ = offset_extragraph;
   offset_intragraph_ = 0;
   graph_expects_this_gen_ = true;
@@ -279,7 +280,7 @@ PhiloxCudaState CUDAGeneratorImpl::philox_cuda_state(uint64_t increment) {
     TORCH_INTERNAL_ASSERT(this->offset_intragraph_ <=
                           std::numeric_limits<uint32_t>::max() - increment);
     this->offset_intragraph_ += increment;
-    return PhiloxCudaState(this->seed_,
+    return PhiloxCudaState(this->seed_extragraph_,
                            this->offset_extragraph_,
                            offset);
   } else {
diff --git a/aten/src/ATen/cuda/CUDAGeneratorImpl.h b/aten/src/ATen/cuda/CUDAGeneratorImpl.h
index 768f0b7549c2..b8d563343f24 100644
--- a/aten/src/ATen/cuda/CUDAGeneratorImpl.h
+++ b/aten/src/ATen/cuda/CUDAGeneratorImpl.h
@@ -19,10 +19,10 @@ namespace at {
  *
  * A CUDA graph containing multiple RNG ops behaves like a
  * single giant kernel from the perspective of ops external
- * to the graph.  During graph capture, logic below records
- * the total of all offset increments that occur in the graphed
- * region, and records the final total as the offset for the
- * entire graph.
+ * to the graph.  During graph capture, logic in CUDAGeneratorImpl
+ * records the total of all offset increments that occur in the
+ * graphed region, and records the final total as the offset for
+ * the entire graph.
  *
  * When the graph reruns, the logic that reruns it
  * increments this device's CUDA generator's offset
@@ -30,8 +30,8 @@ namespace at {
  *
  * Meanwhile, within the graph, at capture time, instead of
  * populating PhiloxCudaStates with the uint64_t offset pulled
- * directly from the global state, PhiloxCudaState instead
- * holds a pointer to one-element stream-local int64_t device tensor
+ * directly from the global state, PhiloxCudaState uses a pointer
+ * to a one-element stream-local int64_t device tensor
  * holding an initial offset value, and a uint64_t holding an
  * intra-graph offset. (The intra-graph offset starts from zero
  * when capture begins.)  In each consumer kernel,
@@ -100,7 +100,7 @@ struct TORCH_CUDA_CPP_API CUDAGeneratorImpl : public c10::GeneratorImpl {
   c10::intrusive_ptr<c10::TensorImpl> get_state() const override;
   void set_philox_offset_per_thread(uint64_t offset);
   uint64_t philox_offset_per_thread() const;
-  void capture_prologue(int64_t* offset_extragraph);
+  void capture_prologue(int64_t* seed_extragraph, int64_t* offset_extragraph);
   uint64_t capture_epilogue();
   PhiloxCudaState philox_cuda_state(uint64_t increment);
 
@@ -114,6 +114,7 @@ struct TORCH_CUDA_CPP_API CUDAGeneratorImpl : public c10::GeneratorImpl {
   CUDAGeneratorImpl* clone_impl() const override;
   uint64_t seed_ = default_rng_seed_val;
   uint64_t philox_offset_per_thread_ = 0;
+  int64_t* seed_extragraph_{};
   int64_t* offset_extragraph_{};
   uint32_t offset_intragraph_ = 0;
   bool graph_expects_this_gen_ = false;
diff --git a/aten/src/ATen/cuda/CUDAGraph.cpp b/aten/src/ATen/cuda/CUDAGraph.cpp
index c7734334f4e2..24ee0b19ab90 100644
--- a/aten/src/ATen/cuda/CUDAGraph.cpp
+++ b/aten/src/ATen/cuda/CUDAGraph.cpp
@@ -65,9 +65,11 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/) {
       c10::nullopt, cuda::detail::getDefaultCUDAGenerator());
 
   auto options = TensorOptions().device(at::kCUDA).dtype(at::kLong);
+  seed_extragraph_ = at::empty({1}, options);
   offset_extragraph_ = at::empty({1}, options);
 
-  gen->capture_prologue(offset_extragraph_.data_ptr<int64_t>());
+  seed_extragraph_.fill_(int64_t(gen->current_seed()));
+  gen->capture_prologue(seed_extragraph_.data_ptr<int64_t>(), offset_extragraph_.data_ptr<int64_t>());
 
   auto stream = at::cuda::getCurrentCUDAStream();
 
@@ -131,16 +133,42 @@ void CUDAGraph::capture_end() {
   TORCH_CHECK(stream == capture_stream_,
               "Capture must end on the same stream it began on.");
 
-  c10::cuda::CUDACachingAllocator::notifyCaptureEnd(capture_dev_, id_);
+  c10::cuda::CUDACachingAllocator::notifyCaptureAboutToEnd(capture_dev_, id_);
 
   AT_CUDA_CHECK(cudaStreamEndCapture(capture_stream_, &graph_));
   TORCH_CHECK(graph_ != NULL, "Invalid capture.");
   has_graph_ = true;
 
-  // Trailing NULL, NULL, 0 arguments were recommended by Cuda driver people,
-  // who prefer not to report error message through these arguments moving forward
-  // (they prefer return value, or errors on api calls internal to the capture)
-  AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
+  c10::cuda::CUDACachingAllocator::notifyCaptureEnded(capture_dev_, id_);
+
+  // In typical graph usage some tensors (e.g. the tensors used for graph IO) are not freed
+  // between replays.
+  // If Pytorch compiles and runs with a CUDA 11.4+ toolkit, there's a chance the allocator backend
+  // is cudaMallocAsync.
+  // cudaMallocAsync is generally graph-safe, but if some tensors are not freed between replays,
+  // the graph's internal bookkeeping requires that we instantiate with
+  // cudaGraphInstantiateFlagAutoFreeOnLaunch. See
+  // cudaGraphLaunch
+  // https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1accfe1da0c605a577c22d9751a09597
+  // cudaGraphInstantiateWithFlags
+  // https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1ga2c652a24ba93e52b99a47bec0888233
+#if CUDA_VERSION >= 11040
+  int version;
+  AT_CUDA_CHECK(cudaDriverGetVersion(&version));
+  if (version < 11040) {
+#endif
+    // Trailing NULL, NULL, 0 arguments were recommended by Cuda driver people,
+    // who prefer not to report error message through these arguments moving forward
+    // (they prefer return value, or errors on api calls internal to the capture)
+    AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
+#if CUDA_VERSION >= 11040
+  } else {
+    AT_CUDA_CHECK(cudaGraphInstantiateWithFlags(&graph_exec_,
+                                                graph_,
+                                                cudaGraphInstantiateFlagAutoFreeOnLaunch));
+  }
+#endif
+
   has_graph_exec_ = true;
 
   auto* gen = get_generator_or_default<CUDAGeneratorImpl>(
@@ -175,6 +203,7 @@ void CUDAGraph::replay() {
     std::lock_guard<std::mutex> lock(gen->mutex_);
     rng_engine_inputs = gen->philox_cuda_state(wholegraph_increment_);
   }
+  seed_extragraph_.fill_(int64_t(gen->current_seed()));
   offset_extragraph_.fill_(int64_t(rng_engine_inputs.offset_.val));
 
   // graph_exec_ may be replayed in any stream.
diff --git a/aten/src/ATen/cuda/CUDAGraph.h b/aten/src/ATen/cuda/CUDAGraph.h
index 09b0b7b5d800..bacad79102a3 100644
--- a/aten/src/ATen/cuda/CUDAGraph.h
+++ b/aten/src/ATen/cuda/CUDAGraph.h
@@ -69,6 +69,7 @@ struct TORCH_CUDA_CPP_API CUDAGraph {
   int capture_dev_;
 
   // RNG state trackers
+  at::Tensor seed_extragraph_;
   at::Tensor offset_extragraph_;
   uint64_t wholegraph_increment_;
 };
diff --git a/aten/src/ATen/cuda/CUDASparse.h b/aten/src/ATen/cuda/CUDASparse.h
index ecb7127dfa32..d309cd5d8e31 100644
--- a/aten/src/ATen/cuda/CUDASparse.h
+++ b/aten/src/ATen/cuda/CUDASparse.h
@@ -4,13 +4,26 @@
 
 // cuSparse Generic API added in CUDA 10.1
 // Windows support added in CUDA 11.0
-// ROCm is not enabled
 #if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && ((CUSPARSE_VERSION >= 10300) || (CUSPARSE_VERSION >= 11000 && defined(_WIN32)))
 #define AT_USE_CUSPARSE_GENERIC_API() 1
 #else
 #define AT_USE_CUSPARSE_GENERIC_API() 0
 #endif
 
+// hipSparse Generic API ROCm 5.2
+#if defined(USE_ROCM) && ROCM_VERSION >= 50200
+#define AT_USE_HIPSPARSE_GENERIC_52_API() 1
+#else
+#define AT_USE_HIPSPARSE_GENERIC_52_API() 0
+#endif
+
+// hipSparse Generic API ROCm 5.1
+#if defined(USE_ROCM) && ROCM_VERSION >= 50100
+#define AT_USE_HIPSPARSE_GENERIC_API() 1
+#else
+#define AT_USE_HIPSPARSE_GENERIC_API() 0
+#endif
+
 // cuSparse Generic API spsv function was added in CUDA 11.3.0
 #if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && (CUSPARSE_VERSION >= 11500)
 #define AT_USE_CUSPARSE_GENERIC_SPSV() 1
diff --git a/aten/src/ATen/cuda/CUDASparseDescriptors.cpp b/aten/src/ATen/cuda/CUDASparseDescriptors.cpp
index 3065babf89b6..6319e214ac98 100644
--- a/aten/src/ATen/cuda/CUDASparseDescriptors.cpp
+++ b/aten/src/ATen/cuda/CUDASparseDescriptors.cpp
@@ -9,7 +9,7 @@ namespace at {
 namespace cuda {
 namespace sparse {
 
-#if AT_USE_CUSPARSE_GENERIC_API()
+#if AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
 
 namespace {
 
@@ -53,6 +53,7 @@ cusparseIndexType_t getCuSparseIndexType(const c10::ScalarType& scalar_type) {
   }
 }
 
+#if AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API()
 CuSparseDnMatDescriptor::CuSparseDnMatDescriptor(const Tensor& input, int64_t batch_offset) {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.layout() == kStrided);
   IntArrayRef input_strides = input.strides();
@@ -105,6 +106,7 @@ CuSparseDnMatDescriptor::CuSparseDnMatDescriptor(const Tensor& input, int64_t ba
 
   descriptor_.reset(raw_descriptor);
 }
+#endif // AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API()
 
 CuSparseDnVecDescriptor::CuSparseDnVecDescriptor(const Tensor& input) {
   // cuSPARSE doesn't support batched vectors
@@ -175,7 +177,7 @@ CuSparseSpMatCsrDescriptor::CuSparseSpMatCsrDescriptor(const Tensor& input, int6
       value_type // data type of values
       ));
 
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+#if AT_USE_HIPSPARSE_GENERIC_52_API() || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000)
   if (ndim == 3 && batch_offset == -1) {
     int batch_count =
         at::native::cuda_int_cast(at::native::batchCount(input), "batch_count");
@@ -204,7 +206,7 @@ CuSparseSpMatCsrDescriptor::CuSparseSpMatCsrDescriptor(const Tensor& input, int6
   descriptor_.reset(raw_descriptor);
 }
 
-#endif // AT_USE_CUSPARSE_GENERIC_API()
+#endif // AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
 
 } // namespace sparse
 } // namespace cuda
diff --git a/aten/src/ATen/cuda/CUDASparseDescriptors.h b/aten/src/ATen/cuda/CUDASparseDescriptors.h
index 40078b65df64..60c9ff0ffa88 100644
--- a/aten/src/ATen/cuda/CUDASparseDescriptors.h
+++ b/aten/src/ATen/cuda/CUDASparseDescriptors.h
@@ -40,6 +40,11 @@ class CuSparseDescriptor {
 #if defined(USE_ROCM)
 // hipSPARSE doesn't define this
 using cusparseMatDescr = std::remove_pointer<cusparseMatDescr_t>::type;
+using cusparseDnMatDescr = std::remove_pointer<cusparseDnMatDescr_t>::type;
+using cusparseDnVecDescr = std::remove_pointer<cusparseDnVecDescr_t>::type;
+using cusparseSpMatDescr = std::remove_pointer<cusparseSpMatDescr_t>::type;
+using cusparseSpMatDescr = std::remove_pointer<cusparseSpMatDescr_t>::type;
+using cusparseSpGEMMDescr = std::remove_pointer<cusparseSpGEMMDescr_t>::type;
 #if AT_USE_HIPSPARSE_TRIANGULAR_SOLVE()
 using bsrsv2Info = std::remove_pointer<bsrsv2Info_t>::type;
 using bsrsm2Info = std::remove_pointer<bsrsm2Info_t>::type;
@@ -92,15 +97,17 @@ class TORCH_CUDA_CPP_API CuSparseBsrsm2Info
 
 #endif // AT_USE_HIPSPARSE_TRIANGULAR_SOLVE
 
-#if AT_USE_CUSPARSE_GENERIC_API()
+#if AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
 
 cusparseIndexType_t getCuSparseIndexType(const c10::ScalarType& scalar_type);
 
+#if AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API()
 class TORCH_CUDA_CPP_API CuSparseDnMatDescriptor
     : public CuSparseDescriptor<cusparseDnMatDescr, &cusparseDestroyDnMat> {
  public:
   explicit CuSparseDnMatDescriptor(const Tensor& input, int64_t batch_offset = -1);
 };
+#endif //AT_USE_HIPSPARSE_GENERIC_52_API() || AT_USE_CUSPARSE_GENERIC_API()
 
 class TORCH_CUDA_CPP_API CuSparseDnVecDescriptor
     : public CuSparseDescriptor<cusparseDnVecDescr, &cusparseDestroyDnVec> {
@@ -116,7 +123,7 @@ class TORCH_CUDA_CPP_API CuSparseSpMatCsrDescriptor
  public:
   explicit CuSparseSpMatCsrDescriptor(const Tensor& input, int64_t batch_offset = -1);
 
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+#if defined(USE_ROCM) || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000)
   std::tuple<int64_t, int64_t, int64_t> get_size() {
     int64_t rows, cols, nnz;
     TORCH_CUDASPARSE_CHECK(cusparseSpMatGetSize(
@@ -190,7 +197,7 @@ class TORCH_CUDA_CPP_API CuSparseSpSMDescriptor
 };
 #endif
 
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+#if (defined(USE_ROCM) && ROCM_VERSION >= 50200) || (defined(CUDA_VERSION) && CUDA_VERSION >= 11000)
 class TORCH_CUDA_CPP_API CuSparseSpGEMMDescriptor
     : public CuSparseDescriptor<cusparseSpGEMMDescr, &cusparseSpGEMM_destroyDescr> {
  public:
@@ -202,7 +209,7 @@ class TORCH_CUDA_CPP_API CuSparseSpGEMMDescriptor
 };
 #endif
 
-#endif // AT_USE_CUSPARSE_GENERIC_API()
+#endif // AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
 
 } // namespace sparse
 } // namespace cuda
diff --git a/aten/src/ATen/cuda/CublasHandlePool.cpp b/aten/src/ATen/cuda/CublasHandlePool.cpp
index 08fa4e4904c9..b168c6bcdfcf 100644
--- a/aten/src/ATen/cuda/CublasHandlePool.cpp
+++ b/aten/src/ATen/cuda/CublasHandlePool.cpp
@@ -1,9 +1,19 @@
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/DeviceThreadHandles.h>
 
+#include <c10/cuda/CUDACachingAllocator.h>
+
+#include <regex>
+
 namespace at { namespace cuda {
+
 namespace {
 
+std::map<std::tuple<void *, void *>, at::DataPtr>& cublas_handle_stream_to_workspace() {
+  static auto& instance = *new std::map<std::tuple<void *, void *>, at::DataPtr>;
+  return instance;
+}
+
 void createCublasHandle(cublasHandle_t *handle) {
   TORCH_CUDABLAS_CHECK(cublasCreate(handle));
 }
@@ -25,6 +35,44 @@ using CuBlasPoolType = DeviceThreadHandlePool<cublasHandle_t, createCublasHandle
 
 } // namespace
 
+void clearCublasWorkspaces() {
+  cublas_handle_stream_to_workspace().clear();
+}
+
+size_t parseChosenWorkspaceSize() {
+  const char * val = getenv("CUBLAS_WORKSPACE_CONFIG");
+  if (val) {
+    size_t total_size = 0;
+    const std::string config(val);
+    std::regex exp(":([0-9]+):([0-9]+)");
+    std::sregex_iterator next(config.begin(), config.end(), exp);
+    std::sregex_iterator end;
+    if (next == end) {
+      TORCH_WARN("Could not parse CUBLAS_WORKSPACE_CONFIG, using default workspace size of 4096.");
+      return 4096 * 1024;
+    }
+    while (next != end) {
+      std::smatch match = *next;
+      TORCH_CHECK(match.size() == 3, "Expected CUBLAS_WORKSPACE_SPACE_CONFIG match of size 3 (Format :SIZE:COUNT)");
+      size_t curr_size = (size_t) std::stoi(match.str(1));
+      total_size += curr_size * 1024;
+      next++;
+    }
+    return total_size;
+  } else /* :4096:8 */ {
+    return 4096 * 1024;
+  }
+}
+
+size_t getChosenWorkspaceSize() {
+  static size_t pool_size = parseChosenWorkspaceSize();
+  return pool_size;
+}
+
+at::DataPtr getNewWorkspace() {
+  return c10::cuda::CUDACachingAllocator::get()->allocate(getChosenWorkspaceSize());
+}
+
 cublasHandle_t getCurrentCUDABlasHandle() {
   int device;
   AT_CUDA_CHECK(cudaGetDevice(&device));
@@ -47,6 +95,16 @@ cublasHandle_t getCurrentCUDABlasHandle() {
   auto handle = myPoolWindow->reserve(device);
   auto stream = c10::cuda::getCurrentCUDAStream();
   TORCH_CUDABLAS_CHECK(cublasSetStream(handle, stream));
+#if !defined(USE_ROCM) && CUDA_VERSION >= 11000
+  // cublasSetWorkspace not available on CUDA 10.2
+  cudaStream_t _stream = stream;
+  auto key = std::make_tuple(static_cast<void *>(handle), static_cast<void *>(_stream));
+  auto workspace_it = cublas_handle_stream_to_workspace().find(key);
+  if (workspace_it == cublas_handle_stream_to_workspace().end()) {
+    workspace_it = cublas_handle_stream_to_workspace().insert(workspace_it, {key, getNewWorkspace()});
+  }
+  TORCH_CUDABLAS_CHECK(cublasSetWorkspace(handle, workspace_it->second.get(), getChosenWorkspaceSize()));
+#endif
 #if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
   // On CUDA >= 11, and architecture >= Ampere, cuBLAS can use TF32 to speedup
   // FP32 data type calculations based on the value of the allow_tf32 flag.
diff --git a/aten/src/ATen/cuda/PeerToPeerAccess.cpp b/aten/src/ATen/cuda/PeerToPeerAccess.cpp
index 8d2e16776f9e..4c0e4f9c1f1d 100644
--- a/aten/src/ATen/cuda/PeerToPeerAccess.cpp
+++ b/aten/src/ATen/cuda/PeerToPeerAccess.cpp
@@ -1,10 +1,11 @@
 #include <ATen/cuda/PeerToPeerAccess.h>
+
+#include <c10/cuda/CUDACachingAllocator.h>
 #include <c10/cuda/CUDAGuard.h>
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
 
 #include <vector>
-#include <algorithm>
 
 namespace at {
 namespace cuda {
@@ -38,6 +39,12 @@ bool get_p2p_access(int dev, int dev_to_access) {
               dev_to_access, " is not a device");
   TORCH_INTERNAL_ASSERT(num_devices_ >= 0, "p2p access cache not initialized");
 
+#ifdef USE_ROCM
+  bool needs_pool_specific_peer_access = false;
+#else
+  bool needs_pool_specific_peer_access = CUDACachingAllocator::get()->needsPoolSpecificPeerAccess();
+#endif
+
   auto &cache = p2pAccessEnabled_[dev * num_devices_ + dev_to_access];
 
   if (cache != -1) {
@@ -49,12 +56,30 @@ bool get_p2p_access(int dev, int dev_to_access) {
   int access = 0;
   C10_CUDA_CHECK(cudaDeviceCanAccessPeer(&access, dev, dev_to_access));
   if (access) {
-    cudaError_t err = cudaDeviceEnablePeerAccess(dev_to_access, 0);
-    if (err == cudaErrorPeerAccessAlreadyEnabled) {
-      // ignore and clear the error if access was already enabled
-      cudaGetLastError();
+    if (needs_pool_specific_peer_access) {
+#if CUDA_VERSION >= 11040
+      // Double-checks allocator backend hasn't changed, which would definitely be an error.
+      // cudaMallocAsync pools are unaffected by cudaDeviceEnablePeerAccess.
+      // We need pool-specific enablement. See
+      // https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/
+      cudaMemPool_t mempool;
+      C10_CUDA_CHECK(cudaDeviceGetDefaultMemPool(&mempool, dev_to_access));
+      cudaMemAccessDesc desc = {};
+      desc.location.type = cudaMemLocationTypeDevice;
+      desc.location.id = dev;
+      desc.flags = cudaMemAccessFlagsProtReadWrite;
+      C10_CUDA_CHECK(cudaMemPoolSetAccess(mempool, &desc, 1 /* numDescs */));
+#else
+      TORCH_INTERNAL_ASSERT(false);
+#endif
     } else {
-      C10_CUDA_CHECK(err);
+      cudaError_t err = cudaDeviceEnablePeerAccess(dev_to_access, 0);
+      if (err == cudaErrorPeerAccessAlreadyEnabled) {
+        // ignore and clear the error if access was already enabled
+        cudaGetLastError();
+      } else {
+        C10_CUDA_CHECK(err);
+      }
     }
     cache = 1;
   } else {
diff --git a/aten/src/ATen/cuda/detail/CUDAHooks.cpp b/aten/src/ATen/cuda/detail/CUDAHooks.cpp
index ea335180259e..25e4c2b44fa9 100644
--- a/aten/src/ATen/cuda/detail/CUDAHooks.cpp
+++ b/aten/src/ATen/cuda/detail/CUDAHooks.cpp
@@ -53,6 +53,20 @@ void set_magma_init_fn(void (*fn)()) {
   magma_init_fn = fn;
 }
 
+// Sets the CUDA_MODULE_LOADING environment variable
+// if it's not set by the user.
+void maybe_set_cuda_module_loading(const std::string &def_value) {
+  auto value = std::getenv("CUDA_MODULE_LOADING");
+  if (!value) {
+#ifdef _WIN32
+    auto env_var = "CUDA_MODULE_LOADING=" + def_value;
+    _putenv(env_var.c_str());
+#else
+    setenv("CUDA_MODULE_LOADING", def_value.c_str(), 1);
+#endif
+  }
+}
+
 // NB: deleter is dynamic, because we need it to live in a separate
 // compilation unit (alt is to have another method in hooks, but
 // let's not if we don't need to!)
@@ -62,12 +76,13 @@ void CUDAHooks::initCUDA() const {
   // have a chance to enable vitals.
   at::vitals::VitalsAPI.setVital("CUDA", "used", "true", /* force = */ true);
 
+  maybe_set_cuda_module_loading("LAZY");
   const auto num_devices = c10::cuda::device_count_ensure_non_zero();
   c10::cuda::CUDACachingAllocator::init(num_devices);
   at::cuda::detail::init_p2p_access_cache(num_devices);
 
 #if AT_MAGMA_ENABLED()
-  TORCH_INTERNAL_ASSERT(magma_init_fn != nullptr, "Cannot initilaize magma, init routine not set");
+  TORCH_INTERNAL_ASSERT(magma_init_fn != nullptr, "Cannot initialize magma, init routine not set");
   magma_init_fn();
 #endif
 }
diff --git a/aten/src/ATen/cuda/detail/KernelUtils.h b/aten/src/ATen/cuda/detail/KernelUtils.h
index b36e78c9b9a6..5479f500a3e1 100644
--- a/aten/src/ATen/cuda/detail/KernelUtils.h
+++ b/aten/src/ATen/cuda/detail/KernelUtils.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <limits>
+#include <c10/util/Exception.h>
 
 namespace at { namespace cuda { namespace detail {
 
diff --git a/aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh b/aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh
index e14680f88793..a9b67b41ac45 100644
--- a/aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh
+++ b/aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh
@@ -13,14 +13,14 @@ struct PhiloxCudaState {
   // Called if graph capture is not underway
   PhiloxCudaState(uint64_t seed,
                   uint64_t offset) {
-    seed_ = seed;
+    seed_.val = seed;
     offset_.val = offset;
   }
   // Called if graph capture is underway
-  PhiloxCudaState(uint64_t seed,
+  PhiloxCudaState(int64_t* seed,
                   int64_t* offset_extragraph,
                   uint32_t offset_intragraph) {
-    seed_ = seed;
+    seed_.ptr = seed;
     offset_.ptr = offset_extragraph;
     offset_intragraph_ = offset_intragraph;
     captured_ = true;
@@ -34,7 +34,7 @@ struct PhiloxCudaState {
     int64_t* ptr;
   };
 
-  uint64_t seed_ = 0;
+  Payload seed_;
   Payload offset_;
   uint32_t offset_intragraph_ = 0;
   bool captured_ = false;
diff --git a/aten/src/ATen/cuda/detail/UnpackRaw.cuh b/aten/src/ATen/cuda/detail/UnpackRaw.cuh
index e6746fbe4fd0..f8fa4ebbf160 100644
--- a/aten/src/ATen/cuda/detail/UnpackRaw.cuh
+++ b/aten/src/ATen/cuda/detail/UnpackRaw.cuh
@@ -21,9 +21,9 @@ unpack(at::PhiloxCudaState arg) {
     // static_cast avoids "warning: invalid narrowing conversion from "long" to "unsigned long".
     // *(arg.offset_.ptr) is a broadcast load of a single int64_t to the entire kernel.
     // For most threads' reads it will hit in cache, so it shouldn't hurt performance.
-    return std::make_tuple(arg.seed_, static_cast<uint64_t>(*(arg.offset_.ptr) + arg.offset_intragraph_));
+    return std::make_tuple(static_cast<uint64_t>(*arg.seed_.ptr), static_cast<uint64_t>(*(arg.offset_.ptr) + arg.offset_intragraph_));
   } else {
-    return std::make_tuple(arg.seed_, arg.offset_.val);
+    return std::make_tuple(arg.seed_.val, arg.offset_.val);
   }
 }
 
diff --git a/aten/src/ATen/cuda/jiterator.h b/aten/src/ATen/cuda/jiterator.h
index 41a6f719a9e3..ac2c4d7cecf3 100644
--- a/aten/src/ATen/cuda/jiterator.h
+++ b/aten/src/ATen/cuda/jiterator.h
@@ -33,7 +33,7 @@ TORCH_CUDA_CPP_API c10::SmallVector<at::Tensor> CompileAndLaunchKernel(
   const c10::SmallVector<at::Tensor>& tensors,
   const c10::SmallVector<at::Scalar>& extra_args,
   bool return_by_ref) {
-    TORCH_CHECK(false, "Jiterator is not supported on ROCm");
+    TORCH_CHECK(false, "Jiterator is not supported");
   }
 }} // namespace at::cuda
 
diff --git a/aten/src/ATen/cuda/jiterator_impl.h b/aten/src/ATen/cuda/jiterator_impl.h
index 7144b6d8eeaf..5ba251055ad2 100644
--- a/aten/src/ATen/cuda/jiterator_impl.h
+++ b/aten/src/ATen/cuda/jiterator_impl.h
@@ -27,6 +27,16 @@ namespace native {
   _(7)                      \
   _(8)
 
+#define AT_FOR_8_CASES_WITH_COMMA(_)  \
+  _(1)     ,                           \
+  _(2)     ,                           \
+  _(3)     ,                           \
+  _(4)     ,                           \
+  _(5)     ,                           \
+  _(6)     ,                           \
+  _(7)     ,                           \
+  _(8)
+
 c10::SmallVector<std::string> get_extra_args_typenames(const c10::SmallVector<at::Scalar>& extra_args) {
   c10::SmallVector<std::string> args_typenames(extra_args.size());
   for (auto i = 0; i < extra_args.size(); ++i) {
@@ -83,9 +93,9 @@ static std::unique_ptr<OffsetCalculator<N>> make_unique_offset_calculator(
 
 template <bool IS_INPUT>
 struct OffsetCalculatorVariant {
-#define DEFINE_CASE(index) std::unique_ptr<OffsetCalculator<index>>,
+#define DEFINE_CASE(index) std::unique_ptr<OffsetCalculator<index>>
   using OffsetCalculatorTypes = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
@@ -113,9 +123,9 @@ struct OffsetCalculatorVariant {
 
 struct ArrayVariant {
 // works for up to 8 input + 8 outputs
-#define DEFINE_CASE(index) at::detail::Array<char*, index>, at::detail::Array<char*, index+8>,
+#define DEFINE_CASE(index) at::detail::Array<char*, index>, at::detail::Array<char*, index+8>
   using ArrayTypes = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
@@ -149,9 +159,9 @@ struct ArrayVariant {
 };
 
 struct TrivialOffsetCalculatorVariant {
-#define DEFINE_CASE(index) TrivialOffsetCalculator<index>,
+#define DEFINE_CASE(index) TrivialOffsetCalculator<index>
   using TrivialOffsetCalculatorTypes = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
@@ -177,9 +187,9 @@ struct TrivialOffsetCalculatorVariant {
 };
 
 struct LoadWithCastVariant {
-#define DEFINE_CASE(index) std::unique_ptr<memory::LoadWithCast<index>>,
+#define DEFINE_CASE(index) std::unique_ptr<memory::LoadWithCast<index>>
   using LoadWithCastPtr = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
@@ -206,9 +216,9 @@ struct LoadWithCastVariant {
 };
 
 struct StoreWithCastVariant {
-#define DEFINE_CASE(index) std::unique_ptr<memory::StoreWithCast<index>>,
+#define DEFINE_CASE(index) std::unique_ptr<memory::StoreWithCast<index>>
   using StoreWithCastPtr = c10::variant<
-    AT_FOR_8_CASES(DEFINE_CASE)
+    AT_FOR_8_CASES_WITH_COMMA(DEFINE_CASE)
   >;
 #undef DEFINE_CASE
 
diff --git a/aten/src/ATen/cuda/llvm_complex.cpp b/aten/src/ATen/cuda/llvm_complex.cpp
index 55e39e280272..0bb2c2ba9a09 100644
--- a/aten/src/ATen/cuda/llvm_complex.cpp
+++ b/aten/src/ATen/cuda/llvm_complex.cpp
@@ -48,6 +48,10 @@ class complex
     void real(value_type __re) {__re_ = __re;}
     void imag(value_type __im) {__im_ = __im;}
 
+    constexpr operator bool() const {
+        return real() || imag();
+    }
+
     complex& operator= (const value_type& __re)
         {__re_ = __re; __im_ = value_type(); return *this;}
     complex& operator+=(const value_type& __re) {__re_ += __re; return *this;}
@@ -106,6 +110,10 @@ class complex<float>
     void real(value_type __re) {__re_ = __re;}
     void imag(value_type __im) {__im_ = __im;}
 
+    constexpr operator bool() const {
+        return real() || imag();
+    }
+
     complex& operator= (float __re)
         {__re_ = __re; __im_ = value_type(); return *this;}
     complex& operator+=(float __re) {__re_ += __re; return *this;}
@@ -162,6 +170,10 @@ class complex<double>
     void real(value_type __re) {__re_ = __re;}
     void imag(value_type __im) {__im_ = __im;}
 
+    constexpr operator bool() const {
+        return real() || imag();
+    }
+
     complex& operator= (double __re)
         {__re_ = __re; __im_ = value_type(); return *this;}
     complex& operator+=(double __re) {__re_ += __re; return *this;}
@@ -482,7 +494,15 @@ inline constexpr
 bool
 operator&&(const complex<_Tp>& __x, const complex<_Tp>& __y)
 {
-    return (__x.real() || __x.imag()) && (__y.real() || __y.imag());
+    return bool(__x) && bool(__y);
+}
+
+template<class _Tp>
+inline constexpr
+bool
+operator||(const complex<_Tp>& __x, const complex<_Tp>& __y)
+{
+    return bool(__x) || bool(__y);
 }
 
 // 26.3.7 values:
@@ -834,7 +854,7 @@ complex<typename __promote<_Tp, _Up>::type>
 pow(const complex<_Tp>& __x, const complex<_Up>& __y)
 {
     typedef complex<typename __promote<_Tp, _Up>::type> result_type;
-    return _VSTD::pow(result_type(__x), result_type(__y));
+    return std::pow(result_type(__x), result_type(__y));
 }
 
 template<class _Tp, class _Up>
@@ -847,7 +867,7 @@ typename enable_if
 pow(const complex<_Tp>& __x, const _Up& __y)
 {
     typedef complex<typename __promote<_Tp, _Up>::type> result_type;
-    return _VSTD::pow(result_type(__x), result_type(__y));
+    return std::pow(result_type(__x), result_type(__y));
 }
 
 template<class _Tp, class _Up>
@@ -860,7 +880,7 @@ typename enable_if
 pow(const _Tp& __x, const complex<_Up>& __y)
 {
     typedef complex<typename __promote<_Tp, _Up>::type> result_type;
-    return _VSTD::pow(result_type(__x), result_type(__y));
+    return std::pow(result_type(__x), result_type(__y));
 }
 
 // __sqr, computes pow(x, 2)
diff --git a/aten/src/ATen/cudnn/Descriptors.cpp b/aten/src/ATen/cudnn/Descriptors.cpp
index f954bbf5623a..0e739a49bb33 100644
--- a/aten/src/ATen/cudnn/Descriptors.cpp
+++ b/aten/src/ATen/cudnn/Descriptors.cpp
@@ -164,7 +164,7 @@ void FilterDescriptor::set(const at::Tensor &t, const at::MemoryFormat memory_fo
       filter_format = CUDNN_TENSOR_NHWC;
       break;
     default:
-      TORCH_INTERNAL_ASSERT(false, "unsurpported memory_format for cuDNN filters");
+      TORCH_INTERNAL_ASSERT(false, "unsupported memory_format for cuDNN filters");
   }
   set(getDataType(t), (int) dim, size, filter_format);
 }
diff --git a/aten/src/ATen/cudnn/Descriptors.h b/aten/src/ATen/cudnn/Descriptors.h
index a7bcb5eb72ea..e111987785cc 100644
--- a/aten/src/ATen/cudnn/Descriptors.h
+++ b/aten/src/ATen/cudnn/Descriptors.h
@@ -7,11 +7,17 @@
 
 #include <ATen/cudnn/cudnn-wrapper.h>
 #include <ATen/cudnn/Utils.h>
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/cuda/ATenCUDAGeneral.h>
 #include <cuda.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#endif
+
 namespace at { namespace native {
 
 std::string cudnnTypeToString(cudnnDataType_t dtype);
@@ -40,7 +46,8 @@ inline int dataSize(cudnnDataType_t dataType)
 // that the stride for dim i is the product of the sizes of dims
 // i+1 to the end.  This stride is indeed uniquely determined.  This
 // function modifies 'stride' in place so this invariant holds.
-static inline void fixSizeOneDimStride(int dim, const int *size, int *stride, bool nhwc) {
+template <typename T>
+static inline void fixSizeOneDimStride(int dim, const T *size, T *stride, bool nhwc) {
   int64_t z = 1;
   int index = 0;
   std::vector<int> permutation(dim);
@@ -144,7 +151,7 @@ class TORCH_CUDA_CPP_API TensorDescriptor : public Descriptor<
   void set(cudnnDataType_t dataType, IntArrayRef sizes, IntArrayRef strides, size_t pad, bool nhwc);
 
   void set(cudnnDataType_t dataType, int dim, int* size, int* stride, bool nhwc) {
-    fixSizeOneDimStride(dim, size, stride, nhwc);
+    fixSizeOneDimStride<int>(dim, size, stride, nhwc);
     AT_CUDNN_CHECK(cudnnSetTensorNdDescriptor(mut_desc(), dataType, dim, size, stride));
   }
 };
diff --git a/aten/src/ATen/cudnn/Utils.h b/aten/src/ATen/cudnn/Utils.h
index 9552953e88ee..64c13c68aa21 100644
--- a/aten/src/ATen/cudnn/Utils.h
+++ b/aten/src/ATen/cudnn/Utils.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/Exceptions.h>
 #include <ATen/cudnn/cudnn-wrapper.h>
 #include <ATen/cudnn/Handle.h>
diff --git a/aten/src/ATen/detail/FunctionTraits.h b/aten/src/ATen/detail/FunctionTraits.h
index aab7300b585f..f49a55e1326d 100644
--- a/aten/src/ATen/detail/FunctionTraits.h
+++ b/aten/src/ATen/detail/FunctionTraits.h
@@ -76,3 +76,27 @@ struct binary_function_traits {
   using arg1_t = typename traits::template arg<0>::type;
   using arg2_t = typename traits::template arg<1>::type;
 };
+
+
+// Traits for calling with c10::guts::invoke, where member_functions have a first argument of ClassType
+template <typename T>
+struct invoke_traits : public function_traits<T>{
+};
+
+template <typename T>
+struct invoke_traits<T&> : public invoke_traits<T>{
+};
+
+template <typename T>
+struct invoke_traits<T&&> : public invoke_traits<T>{
+};
+
+template <typename ClassType, typename ReturnType, typename... Args>
+struct invoke_traits<ReturnType(ClassType::*)(Args...)> :
+  public function_traits<ReturnType(ClassType&, Args...)> {
+};
+
+template <typename ClassType, typename ReturnType, typename... Args>
+struct invoke_traits<ReturnType(ClassType::*)(Args...) const> :
+  public function_traits<ReturnType(const ClassType&, Args...)> {
+};
diff --git a/functorch/functorch/csrc/ADInterpreters.cpp b/aten/src/ATen/functorch/ADInterpreters.cpp
similarity index 70%
rename from functorch/functorch/csrc/ADInterpreters.cpp
rename to aten/src/ATen/functorch/ADInterpreters.cpp
index 6a269d7e5394..174949bbc3b4 100644
--- a/functorch/functorch/csrc/ADInterpreters.cpp
+++ b/aten/src/ATen/functorch/ADInterpreters.cpp
@@ -1,9 +1,12 @@
-#include <functorch/csrc/ADInterpreters.h>
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/TensorWrapper.h>
+#include <ATen/functorch/ADInterpreters.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <bitset>
 
 namespace at { namespace functorch {
 
+constexpr size_t default_bitset_size = 64;
+
 static void checkForInvalidMutationOnCaptures(
     const c10::OperatorHandle& op,
     const torch::jit::Stack* stack,
@@ -14,7 +17,7 @@ static void checkForInvalidMutationOnCaptures(
   auto args = torch::jit::last(stack, op.schema().arguments().size());
   auto mutated_arg = unwrapIfDead(args[0].toTensor());
   auto* wrapper = maybeGetTensorWrapper(mutated_arg);
-  if (wrapper && wrapper->level().has_value() && wrapper->level().value() == cur_level) {
+  if (wrapper && wrapper->level().has_value() && wrapper->level().value() == cur_level && !(wrapper->is_immutable())) {
     return;
   }
   TORCH_CHECK(false,
@@ -25,20 +28,28 @@ static void checkForInvalidMutationOnCaptures(
       "as inputs.");
 }
 
-static Tensor materializeGradWrappers(const Tensor& tensor, int64_t current_level) {
+Tensor materializeGradWrappers(const Tensor& tensor, int64_t current_level) {
   if (!tensor.defined()) {
     return tensor;
   }
   auto* wrapper = maybeGetTensorWrapper(tensor);
   if (!wrapper) {
-    return makeTensorWrapper(tensor, current_level);
+    return makeTensorWrapper(tensor, current_level, /*is_immutable=*/true);
   }
   TORCH_INTERNAL_ASSERT(wrapper->level().value() <= current_level, "escaped?");
   if (wrapper->level().value() == current_level) {
     TORCH_INTERNAL_ASSERT(tensor.defined());
     return tensor;
   }
-  return makeTensorWrapper(tensor, current_level);
+  return makeTensorWrapper(tensor, current_level, /*is_immutable=*/true);
+}
+
+Tensor GradInterpreterPtr::lift(const Tensor& tensor) const {
+  return materializeGradWrappers(tensor, level());
+}
+
+Tensor JvpInterpreterPtr::lift(const Tensor& tensor) const {
+  return materializeGradWrappers(tensor, level());
 }
 
 static void autogradBasedTransformProcess(
@@ -69,7 +80,8 @@ static void autogradBasedTransformSendToNext(
     int64_t current_level,
     TransformType transform_type,
     optional<bool> prev_grad_mode,
-    optional<bool> prev_fwd_grad_mode) {
+    optional<bool> prev_fwd_grad_mode,
+    bool grad_special_case) {
   if (transform_type == TransformType::Grad) {
     TORCH_INTERNAL_ASSERT(prev_grad_mode.has_value());
   }
@@ -91,14 +103,14 @@ static void autogradBasedTransformSendToNext(
     }
     return tensor;
   };
-  auto wrap = [&](const Tensor& tensor) {
+  auto wrap = [&](const Tensor& tensor, bool is_immutable) {
     if (!tensor.defined()) {
       return tensor;
     }
     // if (c10::show_dispatch_trace_enabled()) {
     //   std::cout << "wrap " << current_level << std::endl;
     // }
-    return makeTensorWrapper(tensor, current_level);
+    return makeTensorWrapper(tensor, current_level, is_immutable);
   };
 
   // TODO: we only need to do the following (marked with !) on in-place functions
@@ -113,11 +125,34 @@ static void autogradBasedTransformSendToNext(
 
   // Step 1 & 2
   auto args_size = op.schema().arguments().size();
+  const auto ret_size = op.schema().returns().size();
   // Step 1
   auto front = stack->size() - args_size;
   for (const auto arg_idx : c10::irange(0, args_size)) {
     stack->push_back((*stack)[front + arg_idx]);
   }
+
+  std::bitset<default_bitset_size> outputs_aliasing_immutable; // set = 1 for all bits
+  if(!grad_special_case) {
+    for (auto idx = stack->size() - args_size; idx < stack->size(); idx++) {
+      const auto ivalue = (*stack)[idx];
+      if (!ivalue.isTensor()) {
+        continue; // only input that can be aliased is a tensor, not a tensor list (expect in ops without returns)
+      }
+      const auto tensor = ivalue.toTensor();
+      auto* maybe_tensor_wrapper = maybeGetTensorWrapper(tensor);
+      if (!maybe_tensor_wrapper || maybe_tensor_wrapper->is_immutable()) {
+        // if the input is immutable, we find if it aliases anything, noting that
+        // args are in reverse order on stack, so the last arg is at the top of the stack
+        const auto relative_pos = idx - (stack->size() - args_size);
+        const auto aliased_out = findAliasedOutput(op.schema(), relative_pos);
+        if (aliased_out.has_value()) {
+          outputs_aliasing_immutable.flip(*aliased_out); // each output aliases at most one input, so we can only hit this once
+        }
+      }
+    }
+  }
+
   // Step 2
   foreachTensorInplace(*stack, stack->size() - args_size, stack->size(), unwrap);
 
@@ -136,12 +171,13 @@ static void autogradBasedTransformSendToNext(
   if (getDynamicLayerStack().size() == 0) {
     sanityCheckStack(op, stack);
   }
-  op.callBoxed(stack);
 
   // Step 4, 5, 6
-  auto ret_size = op.schema().returns().size();
+
+  op.callBoxed(stack);
+
   // Step 4
-  foreachTensorInplace(*stack, stack->size() - ret_size, stack->size(), wrap);
+  foreachTensorInplaceWithFlag(*stack, stack->size() - ret_size, stack->size(), outputs_aliasing_immutable, wrap);
 
   // Step 5
   auto args_front = stack->size() - args_size - ret_size;
@@ -169,10 +205,11 @@ void GradInterpreterPtr::processImpl(
 
 void GradInterpreterPtr::sendToNextInterpreterImpl(
     const c10::OperatorHandle& op,
-    torch::jit::Stack* stack) {
+    torch::jit::Stack* stack,
+    bool grad_special_case) {
   autogradBasedTransformSendToNext(
       op, stack, level(),
-      TransformType::Grad, prevGradMode(), nullopt);
+      TransformType::Grad, prevGradMode(), nullopt, grad_special_case);
 }
 
 void JvpInterpreterPtr::processImpl(
@@ -183,10 +220,11 @@ void JvpInterpreterPtr::processImpl(
 
 void JvpInterpreterPtr::sendToNextInterpreterImpl(
     const c10::OperatorHandle& op,
-    torch::jit::Stack* stack) {
+    torch::jit::Stack* stack,
+    bool grad_special_case) {
   autogradBasedTransformSendToNext(
       op, stack, level(),
-      TransformType::Jvp, nullopt, prevFwdGradMode());
+      TransformType::Jvp, nullopt, prevFwdGradMode(), grad_special_case);
 }
 
 }} // namespace at::functorch
diff --git a/functorch/functorch/csrc/ADInterpreters.h b/aten/src/ATen/functorch/ADInterpreters.h
similarity index 71%
rename from functorch/functorch/csrc/ADInterpreters.h
rename to aten/src/ATen/functorch/ADInterpreters.h
index 6f79afc6144f..6ec1cca065d6 100644
--- a/functorch/functorch/csrc/ADInterpreters.h
+++ b/aten/src/ATen/functorch/ADInterpreters.h
@@ -1,30 +1,36 @@
 #pragma once
-#include <functorch/csrc/Interpreter.h>
+#include <ATen/functorch/Interpreter.h>
 
 namespace at { namespace functorch {
 
-struct GradInterpreterPtr {
+// These are the interpreters for our AD transforms
+// (grad, vjp and jvp).
+// See NOTE: [functorch interpreter stack] for more details.
+
+struct TORCH_API GradInterpreterPtr {
   explicit GradInterpreterPtr(const Interpreter* base): base_(base) { TORCH_INTERNAL_ASSERT(base->key() == TransformType::Grad); }
   TransformType key() const { return base_->key(); }
   int64_t level() const { return base_->level(); }
   void processImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack);
-  void sendToNextInterpreterImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack);
+  void sendToNextInterpreterImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack, bool grad_special_case);
   bool prevGradMode() const {
     return c10::get<GradInterpreterMeta>(base_->meta()).prevGradMode_;
   }
+  Tensor lift(const Tensor& tensor) const;
  private:
   const Interpreter* base_;
 };
 
-struct JvpInterpreterPtr {
+struct TORCH_API JvpInterpreterPtr {
   explicit JvpInterpreterPtr(const Interpreter* base): base_(base) { TORCH_INTERNAL_ASSERT(base->key() == TransformType::Jvp); }
   TransformType key() const { return base_->key(); }
   int64_t level() const { return base_->level(); }
   void processImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack);
-  void sendToNextInterpreterImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack);
+  void sendToNextInterpreterImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack, bool grad_special_case);
   bool prevFwdGradMode() const {
     return c10::get<JvpInterpreterMeta>(base_->meta()).prevFwdGradMode_;
   }
+  Tensor lift(const Tensor& tensor) const;
  private:
   const Interpreter* base_;
 };
diff --git a/functorch/functorch/csrc/BatchRulesActivation.cpp b/aten/src/ATen/functorch/BatchRulesActivation.cpp
similarity index 98%
rename from functorch/functorch/csrc/BatchRulesActivation.cpp
rename to aten/src/ATen/functorch/BatchRulesActivation.cpp
index b761c70b1575..d96ab08a7e2f 100644
--- a/functorch/functorch/csrc/BatchRulesActivation.cpp
+++ b/aten/src/ATen/functorch/BatchRulesActivation.cpp
@@ -4,8 +4,8 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
 #include <ATen/Operators.h>
 
 // NB: most activation functions fit pointwise unary or binary rules.
@@ -216,7 +216,7 @@ std::tuple<Tensor,optional<int64_t>,Tensor,optional<int64_t>> prelu_backward_bat
   return std::make_tuple(std::get<0>(grads), 0, std::get<1>(grads), (weight_grad_is_batched ? optional<int64_t>(0) : nullopt));
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   VMAP_SUPPORT(glu_backward, glu_backward_batch_rule);
   VMAP_SUPPORT(glu, glu_batch_rule);
   VMAP_SUPPORT(prelu, prelu_batch_rule)
diff --git a/functorch/functorch/csrc/BatchRulesBinaryOps.cpp b/aten/src/ATen/functorch/BatchRulesBinaryOps.cpp
similarity index 90%
rename from functorch/functorch/csrc/BatchRulesBinaryOps.cpp
rename to aten/src/ATen/functorch/BatchRulesBinaryOps.cpp
index afc3579eb22e..4e228afdfc61 100644
--- a/functorch/functorch/csrc/BatchRulesBinaryOps.cpp
+++ b/aten/src/ATen/functorch/BatchRulesBinaryOps.cpp
@@ -4,8 +4,8 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
 #include <ATen/Operators.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 
@@ -53,7 +53,7 @@ struct BinaryRandomPointwiseBatchRuleHelper;
 template <typename F, F Func, typename T1, typename T2, typename... T>
 struct BinaryRandomPointwiseBatchRuleHelper<F, Func, typelist<T1, T2, T...>> {
   static Tensor apply(const Tensor& tensor, const Tensor& other, T... extra_args) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
     auto maybe_layer = maybeCurrentDynamicLayer();
     auto cur_level = maybe_layer->layerId();
     RandomnessType randomness = maybe_layer->randomness();
@@ -268,6 +268,43 @@ std::tuple<Tensor,optional<int64_t>> cdist_backward_batch_rule(
   return std::make_tuple(out, out_bdim);
 }
 
+void fill__Tensor_batch_rule(
+    Tensor& self,
+    optional<int64_t> self_bdim,
+    const Tensor& other,
+    optional<int64_t> other_bdim) {
+  if (!other_bdim.has_value()) {
+    // Optimization: fill_ is faster than the other path which does
+    // reshaping + copy_
+    self.fill_(other);
+    return;
+  }
+  if (!self_bdim && other_bdim) {
+    vmapIncompatibleInplaceError("fill_");
+  }
+  auto self_and_other = _binary_pointwise_helper(
+      self, self_bdim, other, other_bdim, /*do_type_promotion*/false);
+  std::get<0>(self_and_other).copy_(std::get<1>(self_and_other));
+}
+
+std::tuple<Tensor, optional<int64_t>> log_sigmoid_backward_batch_rule(
+  Tensor& grad, optional<int64_t> grad_bdim,
+  Tensor& self, optional<int64_t> self_bdim,
+  Tensor& buffer, optional<int64_t> buffer_bdim) {
+  // NB: This emulates handle_pointwise_ops except we ignore the last argument, buffer
+  // when any of the inputs are on cuda.
+  // We do this because on cuda, buffer is a dummy tensor always of logical rank 1 and
+  // it becomes an issue when the rest of the inputs are scalar
+  int64_t out_logical_rank = std::max(rankWithoutBatchDim(grad, grad_bdim), rankWithoutBatchDim(self, self_bdim));
+  if (!grad.is_cuda() && !self.is_cuda() && !buffer.is_cuda()) {
+    out_logical_rank = std::max(out_logical_rank, rankWithoutBatchDim(buffer, buffer_bdim));
+  }
+  Tensor out_grad = maybePadToLogicalRank(moveBatchDimToFront(grad, grad_bdim), grad_bdim, out_logical_rank);
+  Tensor out_self = maybePadToLogicalRank(moveBatchDimToFront(self, self_bdim), self_bdim, out_logical_rank);
+  Tensor out_buffer = maybePadToLogicalRank(moveBatchDimToFront(buffer, buffer_bdim), buffer_bdim, out_logical_rank);
+  return std::make_tuple(at::log_sigmoid_backward(out_grad, out_self, out_buffer), 0);
+}
+
 Tensor binomial_wrapper(const Tensor& count, const Tensor& prob, c10::optional<Generator> gen) {
   return at::binomial(count, prob.contiguous(), gen); // Bug in PyTorch, prob shouldn't need to be contiguous
 }
@@ -282,7 +319,7 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchVmapMode, m) {
   m.impl("binomial", BINARY_RANDOM_POINTWISE_BATCH_RULE(at::functorch::binomial_wrapper));
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
 #define BINARY_POINTWISE2(op, overload) \
   VMAP_SUPPORT2(op, overload, BINARY_POINTWISE_BATCH_RULE(ATEN_FN2(op, overload)));
 #define BINARY_POINTWISE(op) \
@@ -395,7 +432,7 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   // BINARY_POINTWISE(infinitely_differentiable_gelu_backward);
   BINARY_POINTWISE(leaky_relu_backward);
   BINARY_POINTWISE(logit_backward);
-  POINTWISE_BOXED(log_sigmoid_backward);
+  VMAP_SUPPORT(log_sigmoid_backward, log_sigmoid_backward_batch_rule);
   VMAP_SUPPORT(gelu_backward, gelu_backward_batch_rule);
   BINARY_POINTWISE(sigmoid_backward);
   POINTWISE_BOXED(softplus_backward);
@@ -456,7 +493,9 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
 #undef SINGLE_ARG
 #undef LOGICAL_COMPARISON_POINTWISE
   VMAP_SUPPORT(masked_select, masked_select_batch_rule);
-  VMAP_SUPPORT(masked_select_backward, masked_select_backward_batch_rule)
+  VMAP_SUPPORT(masked_select_backward, masked_select_backward_batch_rule);
+
+  VMAP_SUPPORT2(fill_, Tensor, fill__Tensor_batch_rule);
 }
 
 }}
diff --git a/functorch/functorch/csrc/BatchRulesConvolution.cpp b/aten/src/ATen/functorch/BatchRulesConvolution.cpp
similarity index 82%
rename from functorch/functorch/csrc/BatchRulesConvolution.cpp
rename to aten/src/ATen/functorch/BatchRulesConvolution.cpp
index 8382070283cd..79523ed1fb6d 100644
--- a/functorch/functorch/csrc/BatchRulesConvolution.cpp
+++ b/aten/src/ATen/functorch/BatchRulesConvolution.cpp
@@ -4,8 +4,8 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 
 namespace at { namespace functorch {
@@ -17,7 +17,7 @@ namespace at { namespace functorch {
 // we do not support batch_group_count (which is needed for convolution backwards).
 // Instead, there's a convolution_backward op that needs a batching rule.
 std::tuple<Tensor,optional<int64_t>>
-convolution_batch_rule(const Tensor& lhs, optional<int64_t> lhs_bdim, const Tensor& rhs, optional<int64_t> rhs_bdim, const optional<Tensor>& bias, optional<int64_t> bias_bdim, IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, bool transposed, IntArrayRef output_padding, int64_t groups) {
+convolution_batch_rule(const Tensor& lhs, optional<int64_t> lhs_bdim, const Tensor& rhs, optional<int64_t> rhs_bdim, const optional<Tensor>& bias, optional<int64_t> bias_bdim, IntArrayRef stride, c10::SymIntArrayRef padding, IntArrayRef dilation, bool transposed, c10::SymIntArrayRef output_padding, int64_t groups) {
   DimVector lhs_spec(stride.size() + 2);
   std::iota(lhs_spec.begin(), lhs_spec.end(), 0);
   DimVector rhs_spec = lhs_spec;
@@ -42,36 +42,68 @@ convolution_batch_rule(const Tensor& lhs, optional<int64_t> lhs_bdim, const Tens
   std::tuple<Tensor, optional<int64_t>> result;
   if (lhs_bdim && !rhs_bdim) {
     auto new_x = reshape_dim_into(*lhs_bdim, lhs_spec[0], lhs);
-    auto out = at::convolution(new_x, rhs, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
+    auto out = at::convolution_symint(new_x, rhs, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
     out = reshape_dim_outof(out_spec[0], lhs.sizes()[*lhs_bdim], out);
     result = std::make_tuple(out, out_spec[0]);
   } else if (!lhs_bdim && rhs_bdim) {
     if (groups == 1) {
       auto new_w = reshape_dim_into(*rhs_bdim, rhs_spec[0], rhs);
-      auto out = at::convolution(lhs, new_w, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
-      out = reshape_dim_outof(out_spec[1], rhs.sizes()[*rhs_bdim], out);
+      auto out = at::convolution_symint(lhs, new_w, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
+      out = reshape_dim_outof(out_spec[1], rhs.size(*rhs_bdim), out);
       result = std::make_tuple(out, out_spec[1]);
     } else {
-      auto dim_with_groups = transposed ? 1 : 0;
-      auto new_w = reshape_dim_outof(rhs_spec[dim_with_groups] + (*rhs_bdim <= rhs_spec[0]), groups, rhs);
-      new_w = reshape_dim_into(*rhs_bdim + (rhs_spec[0] < rhs_bdim), rhs_spec[0] + 1, new_w);
-      new_w = reshape_dim_into(rhs_spec[0], rhs_spec[0], new_w);
-      auto out = at::convolution(lhs, new_w, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
-      out = reshape_dim_outof(out_spec[1], groups, out);
-      out = reshape_dim_outof(out_spec[1] + 1, rhs.sizes()[*rhs_bdim], out);
-      out = reshape_dim_into(out_spec[1], out_spec[1] + 1, out);
-      result = std::make_tuple(out, out_spec[1]);
+      if (transposed) {
+        // conv_transpose with groups is normally NIHW, IOHW -> N(GO)HW
+        // With RHS batched, we do the following:
+        // NIHW, BIOHW -> NIHW, I(BO)HW -> N(GBO)HW -> BN(GO)HW
+        // NB: the following isn't written using rhs_spec
+        // (PyTorch convs have a fixed dimension order)
+
+        // BIOHW -> I(BO)HW
+        auto new_w = reshape_dim_into(*rhs_bdim, 1, rhs);
+        // NIHW, I(BO)HW -> N(GBO)HW
+        auto out = at::convolution_symint(lhs, new_w, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
+        // N(GBO)HW -> NG(BO)HW
+        out = reshape_dim_outof(1, groups, out);
+        // NG(BO)HW -> NGBOHW
+        out = reshape_dim_outof(2, rhs.size(*rhs_bdim), out);
+        // NGBOHW -> NB(GO)HW
+        out = reshape_dim_into(1, 2, out);
+        result = std::make_tuple(out, 1);
+      } else {
+        // conv with groups is normally N(GI)HW, (GO)IHW -> N(GO)HW
+        // With RHS batched, we do the following:
+        // N(GI)HW, B(GO)IHW -> N(GI)HW, (GBO)IHW -> N(GBO)HW -> BN(GO)HW
+        // NB: the following isn't written using rhs_spec
+        // (PyTorch convs have a fixed dimension order)
+
+        // B(GO)IHW -> BGOIHW
+        auto new_w = reshape_dim_outof(0 + (*rhs_bdim == 0), groups, rhs);
+        // BGOIHW -> G(BO)IHW
+        new_w = reshape_dim_into(*rhs_bdim + (*rhs_bdim > 0), 1, new_w);
+        // G(BO)IHW -> (GBO)IHW
+        new_w = reshape_dim_into(0, 0, new_w);
+        // N(GI)HW, (GBO)IHW -> N(GBO)HW
+        auto out = at::convolution_symint(lhs, new_w, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
+        // N(GBO)HW -> NG(BO)HW
+        out = reshape_dim_outof(1, groups, out);
+        // NG(BO)HW -> NGBOHW
+        out = reshape_dim_outof(2, rhs.size(*rhs_bdim), out);
+        // NGBOHW -> NB(GO)HW
+        out = reshape_dim_into(1, 2, out);
+        result = std::make_tuple(out, 1);
+      }
     }
   } else if (lhs_bdim && rhs_bdim) {
     auto new_x = reshape_dim_into(*lhs_bdim, lhs_spec[1], lhs);
     groups *= lhs.sizes()[*lhs_bdim];
     auto dim_with_groups = transposed ? 1 : 0;
     auto new_w = reshape_dim_into(*rhs_bdim, rhs_spec[dim_with_groups], rhs);
-    auto out = at::convolution(new_x, new_w, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
+    auto out = at::convolution_symint(new_x, new_w, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups);
     out = reshape_dim_outof(out_spec[1], lhs.sizes()[*lhs_bdim], out);
     result = std::make_tuple(out, out_spec[1]);
   } else {
-    result = std::make_tuple(at::convolution(lhs, rhs, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups), nullopt);
+    result = std::make_tuple(at::convolution_symint(lhs, rhs, unbatched_bias, stride, padding, dilation, transposed, output_padding, groups), nullopt);
   }
   if (separate_bias) {
     auto A = std::get<0>(result);
@@ -165,7 +197,7 @@ Tensor _convolution_decomp(
 //   std::tie(weight_value, weight_bdim) = unwrapTensorAtLevel(weight, cur_level);
 //
 //   if (self_bdim.has_value() && self_value.dim() == 5 && first_dim_has_size_1(self_value, *self_bdim) && grad_output_bdim.has_value() && !weight_bdim.has_value()) {
-//     c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+//     c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
 //     auto result = cudnn_conv_per_sample_grad_rule(
 //         self_value, self_bdim,
 //         grad_output_value, grad_output_bdim,
@@ -212,8 +244,8 @@ convolution_backward_input_batch_rule(
     const Tensor& grad_output, optional<int64_t> grad_output_bdim,
     const Tensor& input, optional<int64_t> input_bdim,
     const Tensor& weight, optional<int64_t> weight_bdim,
-    IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, bool transposed,
-    IntArrayRef output_padding, int64_t groups) {
+    IntArrayRef stride, c10::SymIntArrayRef padding, IntArrayRef dilation, bool transposed,
+    c10::SymIntArrayRef output_padding, int64_t groups) {
   const std::array<bool, 3> mask = {true, false, false};
   if (grad_output_bdim && weight_bdim) {
     // regular: BNO, BOI -> N(BO), (BO)I -> N(BI)
@@ -222,7 +254,7 @@ convolution_backward_input_batch_rule(
     const auto grad_output_ = reshape_dim_into(*grad_output_bdim, 1, grad_output);
     const auto weight_ = reshape_dim_into(*weight_bdim, 0, weight);
     auto dummy_input = make_dummy(input, input_bdim, 1, batch_size);
-    const auto result = at::convolution_backward(
+    const auto result = at::convolution_backward_symint(
         grad_output_, dummy_input, weight_, nullopt, stride, padding,
         dilation, transposed, output_padding, groups * batch_size, mask);
     const auto grad_input = reshape_dim_outof(1, batch_size, std::get<0>(result));
@@ -233,7 +265,7 @@ convolution_backward_input_batch_rule(
     const auto batch_size = grad_output.size(*grad_output_bdim);
     const auto grad_output_ = reshape_dim_into(*grad_output_bdim, 0, grad_output);
     auto dummy_input = make_dummy(input, input_bdim, 0, batch_size);
-    const auto result = at::convolution_backward(
+    const auto result = at::convolution_backward_symint(
         grad_output_, dummy_input, weight, nullopt, stride, padding,
         dilation, transposed, output_padding, groups, mask);
     const auto grad_input = reshape_dim_outof(0, batch_size, std::get<0>(result));
@@ -246,7 +278,7 @@ convolution_backward_input_batch_rule(
       const auto in_ch_dim = transposed ? 0 : 1;
       const auto weight_ = reshape_dim_into(*weight_bdim, in_ch_dim, weight);
       auto dummy_input = make_dummy(input, input_bdim, 1, batch_size);
-      const auto result = at::convolution_backward(
+      const auto result = at::convolution_backward_symint(
           grad_output, dummy_input, weight_, nullopt, stride, padding,
           dilation, transposed, output_padding, groups, mask);
       const auto grad_input = reshape_dim_outof(1, batch_size, std::get<0>(result));
@@ -257,7 +289,7 @@ convolution_backward_input_batch_rule(
       // N(GO), B(GO)I -> N(GO), (GO)(BI) -> N(GBI)
       const auto weight_ = reshape_dim_into(*weight_bdim, 1, weight);
       auto dummy_input = make_dummy(input, input_bdim, 1, batch_size);
-      const auto result = at::convolution_backward(
+      const auto result = at::convolution_backward_symint(
           grad_output, dummy_input, weight_, nullopt, stride, padding,
           dilation, transposed, output_padding, groups, mask);
       grad_input = std::get<0>(result); // N(GBI)
@@ -268,7 +300,7 @@ convolution_backward_input_batch_rule(
       weight_ = weight_.transpose(0, 1);                       // GBIO
       weight_ = weight_.flatten(0, 2);                         // (GBI)O
       const auto dummy_input = make_dummy(input, input_bdim, 1, batch_size);
-      const auto result = at::convolution_backward(
+      const auto result = at::convolution_backward_symint(
           grad_output, dummy_input, weight_, nullopt, stride, padding,
           dilation, transposed, output_padding, groups, mask);
       grad_input = std::get<0>(result); // N(GBI)
@@ -282,7 +314,7 @@ convolution_backward_input_batch_rule(
   } else {
     TORCH_INTERNAL_ASSERT(input_bdim);
     const auto dummy_input = make_dummy(input, input_bdim, 0, 1);
-    const auto result = at::convolution_backward(
+    const auto result = at::convolution_backward_symint(
         grad_output, dummy_input, weight, nullopt, stride, padding,
         dilation, transposed, output_padding, groups, mask);
     return std::make_tuple(std::get<0>(result), nullopt);
@@ -293,8 +325,8 @@ convolution_backward_weight_batch_rule(
     const Tensor& grad_output, optional<int64_t> grad_output_bdim,
     const Tensor& input, optional<int64_t> input_bdim,
     const Tensor& weight, optional<int64_t> weight_bdim,
-    IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, bool transposed,
-    IntArrayRef output_padding, int64_t groups) {
+    IntArrayRef stride, c10::SymIntArrayRef padding, IntArrayRef dilation, bool transposed,
+    c10::SymIntArrayRef output_padding, int64_t groups) {
   const std::array<bool, 3> mask = {false, true, false};
   if (grad_output_bdim && input_bdim) {
     // BNO, BNI -> N(BO), N(BI) -> (BO)I (regular) (BI)O (transposed)
@@ -302,7 +334,7 @@ convolution_backward_weight_batch_rule(
     const auto grad_output_ = reshape_dim_into(*grad_output_bdim, 1, grad_output);
     const auto input_ = reshape_dim_into(*input_bdim, 1, input);
     const auto dummy_weight = make_dummy(weight, weight_bdim, 0, batch_size);
-    const auto result = at::convolution_backward(
+    const auto result = at::convolution_backward_symint(
         grad_output_, input_, dummy_weight, nullopt, stride, padding,
         dilation, transposed, output_padding, groups * batch_size, mask);
     auto grad_weight = std::get<1>(result);
@@ -316,7 +348,7 @@ convolution_backward_weight_batch_rule(
       const auto grad_output_ = reshape_dim_into(*grad_output_bdim, 1, grad_output);
       const auto out_ch_dim = transposed ? 1 : 0;
       const auto dummy_weight = make_dummy(weight, weight_bdim, out_ch_dim, batch_size);
-      const auto result = at::convolution_backward(
+      const auto result = at::convolution_backward_symint(
           grad_output_, input, dummy_weight, nullopt, stride, padding,
           dilation, transposed, output_padding, groups, mask);
       auto grad_weight = std::get<1>(result);
@@ -330,7 +362,7 @@ convolution_backward_weight_batch_rule(
       if (!transposed) {
         // BN(GO), N(GI) -> N(GBO), N(GI) -> (GBO)I
         const auto dummy_weight = make_dummy(weight, weight_bdim, 0, batch_size);
-        const auto result = at::convolution_backward(
+        const auto result = at::convolution_backward_symint(
             grad_output_, input, dummy_weight, nullopt, stride, padding,
             dilation, transposed, output_padding, groups, mask);
         auto grad_weight = std::get<1>(result);
@@ -341,7 +373,7 @@ convolution_backward_weight_batch_rule(
       } else {
         // BN(GO), N(GI) -> N(GBO), N(GI) -> (GI)(BO)
         const auto dummy_weight = make_dummy(weight, weight_bdim, 1, batch_size);
-        const auto result = at::convolution_backward(
+        const auto result = at::convolution_backward_symint(
             grad_output_, input, dummy_weight, nullopt, stride, padding,
             dilation, transposed, output_padding, groups, mask);
         auto grad_weight = std::get<1>(result);
@@ -357,7 +389,7 @@ convolution_backward_weight_batch_rule(
       const auto input_ = reshape_dim_into(*input_bdim, 1, input);
       const auto in_ch_dim = transposed ? 0 : 1;
       const auto dummy_weight = make_dummy(weight, weight_bdim, in_ch_dim, batch_size);
-      const auto result = at::convolution_backward(
+      const auto result = at::convolution_backward_symint(
           grad_output, input_, dummy_weight, nullopt, stride, padding,
           dilation, transposed, output_padding, groups, mask);
       auto grad_weight = std::get<1>(result);
@@ -371,7 +403,7 @@ convolution_backward_weight_batch_rule(
       if (!transposed) {
         // regular: N(GO), BN(GI) -> N(GO), N(GBI) -> (GO)(BI)
         const auto dummy_weight = make_dummy(weight, weight_bdim, 1, batch_size);
-        const auto result = at::convolution_backward(
+        const auto result = at::convolution_backward_symint(
             grad_output, input_, dummy_weight, nullopt, stride, padding,
             dilation, transposed, output_padding, groups, mask);
         auto grad_weight = std::get<1>(result);
@@ -380,7 +412,7 @@ convolution_backward_weight_batch_rule(
       } else {
         // transposed: N(GO), BN(GI) -> N(GO), N(GBI) -> (GBI)O
         const auto dummy_weight = make_dummy(weight, weight_bdim, 0, batch_size);
-        const auto result = at::convolution_backward(
+        const auto result = at::convolution_backward_symint(
             grad_output, input_, dummy_weight, nullopt, stride, padding,
             dilation, transposed, output_padding, groups, mask);
         auto grad_weight = std::get<1>(result);
@@ -393,7 +425,7 @@ convolution_backward_weight_batch_rule(
   } else {
     TORCH_INTERNAL_ASSERT(weight_bdim);
     const auto dummy_weight = make_dummy(weight, weight_bdim, 0, 1);
-    const auto result = at::convolution_backward(
+    const auto result = at::convolution_backward_symint(
         grad_output, input, dummy_weight, nullopt, stride, padding,
         dilation, transposed, output_padding, groups, mask);
     return std::make_tuple(std::get<1>(result), nullopt);
@@ -403,16 +435,16 @@ convolution_backward_weight_batch_rule(
 
 std::tuple<Tensor,Tensor,Tensor> convolution_backward_plumbing(
     const Tensor& grad_output_, const Tensor& input_, const Tensor& weight_,
-    const c10::OptionalArrayRef<int64_t> bias_sizes_opt,
-    IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, bool transposed,
-    IntArrayRef output_padding, int64_t groups, std::array<bool, 3> output_mask) {
+    const c10::OptionalArrayRef<SymInt> bias_sizes_opt,
+    IntArrayRef stride, c10::SymIntArrayRef padding, IntArrayRef dilation, bool transposed,
+    c10::SymIntArrayRef output_padding, int64_t groups, std::array<bool, 3> output_mask) {
   const auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
 
   if (!areAnyBatchedAtLevel({grad_output_, input_, weight_}, cur_level)){
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
-    return at::convolution_backward(
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
+    return at::convolution_backward_symint(
         grad_output_, input_, weight_, bias_sizes_opt, stride, padding,
         dilation, transposed, output_padding, groups, output_mask);
   }
@@ -448,14 +480,14 @@ std::tuple<Tensor,Tensor,Tensor> convolution_backward_plumbing(
   // BNO, BNI, BOI
   // AKA one of the model ensembling case
   if (grad_output_bdim && input_bdim && weight_bdim) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     grad_output = reshape_dim_into(*grad_output_bdim, 1, grad_output);
 
     // BNO, BNI, BOI -> N(BO), N(BI), (BO)I
     const auto batch_size = weight.size(*weight_bdim);
     input = reshape_dim_into(*input_bdim, 1, input);
     weight = reshape_dim_into(*weight_bdim, 0, weight);
-    const auto result = at::convolution_backward(
+    const auto result = at::convolution_backward_symint(
         grad_output, input, weight, nullopt, stride, padding, dilation,
         transposed, output_padding, batch_size * groups, output_mask);
     // N(BI), (BO)I -> NBI, BOI
@@ -471,7 +503,7 @@ std::tuple<Tensor,Tensor,Tensor> convolution_backward_plumbing(
 
   Tensor grad_input;
   if (output_mask[0]) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto result = convolution_backward_input_batch_rule(
         grad_output, grad_output_bdim,
         input, input_bdim,
@@ -482,7 +514,7 @@ std::tuple<Tensor,Tensor,Tensor> convolution_backward_plumbing(
 
   Tensor grad_weight;
   if (output_mask[1]) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto result = convolution_backward_weight_batch_rule(
         grad_output, grad_output_bdim,
         input, input_bdim,
@@ -504,7 +536,7 @@ std::tuple<Tensor,Tensor,Tensor> convolution_backward_plumbing(
 }
 
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   VMAP_SUPPORT(convolution, convolution_batch_rule);
   m.impl("_convolution", _convolution_decomp);
   m.impl("convolution_backward", convolution_backward_plumbing);
diff --git a/functorch/functorch/csrc/BatchRulesDecompositions.cpp b/aten/src/ATen/functorch/BatchRulesDecompositions.cpp
similarity index 82%
rename from functorch/functorch/csrc/BatchRulesDecompositions.cpp
rename to aten/src/ATen/functorch/BatchRulesDecompositions.cpp
index 46219f542ef1..13dedcfb879a 100644
--- a/functorch/functorch/csrc/BatchRulesDecompositions.cpp
+++ b/aten/src/ATen/functorch/BatchRulesDecompositions.cpp
@@ -8,24 +8,24 @@
 #include <ATen/Operators.h>
 #include <ATen/FunctionalTensorWrapper.h>
 #include <ATen/core/dispatch/Dispatcher.h>
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/BatchedFallback.h>
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/PlumbingHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/PlumbingHelper.h>
 
 namespace at { namespace functorch {
 
 #define OP_DECOMPOSE(op)  m.impl(#op, static_cast<decltype(&ATEN_FN(op))>(native::op));
 #define OP_DECOMPOSE2(op, overload)  m.impl(#op"."#overload, static_cast<decltype(&ATEN_FN2(op, overload))>(native::op));
 
-TORCH_LIBRARY_IMPL(aten, FT_VMAP_MODE_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchVmapMode, m) {
   OP_DECOMPOSE(alpha_dropout_);
   OP_DECOMPOSE(dropout_);
   OP_DECOMPOSE(feature_alpha_dropout_);
   OP_DECOMPOSE(feature_dropout_);
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   OP_DECOMPOSE2(__and__, Scalar);
   OP_DECOMPOSE2(__and__, Tensor);
   OP_DECOMPOSE2(__iand__, Tensor);
@@ -44,8 +44,8 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE(avg_pool1d);
   OP_DECOMPOSE(adaptive_max_pool1d);
   OP_DECOMPOSE(adaptive_avg_pool1d);
-  OP_DECOMPOSE(adaptive_avg_pool2d);
-  OP_DECOMPOSE(adaptive_avg_pool3d);
+  m.impl("adaptive_avg_pool2d", native::adaptive_avg_pool2d_symint);
+  m.impl("adaptive_avg_pool3d", native::adaptive_avg_pool3d_symint);
   OP_DECOMPOSE(adjoint);
   OP_DECOMPOSE(arccos);
   OP_DECOMPOSE(arccosh);
@@ -63,7 +63,7 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(bitwise_or, Scalar);
   OP_DECOMPOSE2(bitwise_xor, Scalar);
   OP_DECOMPOSE(broadcast_tensors);
-  OP_DECOMPOSE(broadcast_to);
+  m.impl("broadcast_to", native::broadcast_to_symint);
   OP_DECOMPOSE(cartesian_prod);
   OP_DECOMPOSE(cdist);
   OP_DECOMPOSE(clip);
@@ -75,17 +75,16 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE(cosine_embedding_loss);
   OP_DECOMPOSE(cosine_similarity);
   OP_DECOMPOSE(cov);
-  OP_DECOMPOSE(cross_entropy_loss);
+  m.impl("cross_entropy_loss", native::cross_entropy_loss_symint);
   OP_DECOMPOSE2(cumulative_trapezoid, x);
   OP_DECOMPOSE2(cumulative_trapezoid, dx);
   OP_DECOMPOSE2(dsplit, int);
   OP_DECOMPOSE2(dsplit, array);
   OP_DECOMPOSE(det);
-  OP_DECOMPOSE(diag_backward);
   OP_DECOMPOSE(diff);
   OP_DECOMPOSE(dstack);
   OP_DECOMPOSE(einsum);
-  OP_DECOMPOSE(embedding_backward);
+  m.impl("embedding_backward", native::embedding_backward_symint);
   OP_DECOMPOSE(expand_as);
   OP_DECOMPOSE(fft_fft);
   OP_DECOMPOSE(fft_fftshift);
@@ -126,13 +125,14 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(hsplit, int);
   OP_DECOMPOSE2(hsplit, array);
   OP_DECOMPOSE(hstack);
-  OP_DECOMPOSE(index_select_backward);
+  m.impl("index_select_backward", native::index_select_backward_symint);
   OP_DECOMPOSE(inner);
   OP_DECOMPOSE(inverse);
+  OP_DECOMPOSE(concatenate);
   OP_DECOMPOSE(instance_norm);
   OP_DECOMPOSE(kron);
   OP_DECOMPOSE(l1_loss);
-  OP_DECOMPOSE(layer_norm);
+  m.impl("layer_norm", native::layer_norm_symint);
   OP_DECOMPOSE2(ldexp, Tensor);
   OP_DECOMPOSE2(less_equal, Tensor );
   OP_DECOMPOSE2(less, Tensor );
@@ -140,13 +140,16 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE(linalg_cholesky);
   OP_DECOMPOSE(linalg_det);
   OP_DECOMPOSE(linalg_eigvalsh);
+  OP_DECOMPOSE(linalg_eigvals);
   OP_DECOMPOSE(linalg_inv);
   OP_DECOMPOSE(linalg_matmul);
   OP_DECOMPOSE(linalg_matrix_norm);
   OP_DECOMPOSE2(linalg_matrix_norm, str_ord);
   OP_DECOMPOSE(linalg_multi_dot);
   OP_DECOMPOSE(linalg_norm);
+  OP_DECOMPOSE2(linalg_norm, ord_str);
   OP_DECOMPOSE(linalg_solve);
+  OP_DECOMPOSE(linalg_solve_ex);
   OP_DECOMPOSE(linalg_svd);
   OP_DECOMPOSE(linalg_svdvals);
   OP_DECOMPOSE(linalg_tensorinv);
@@ -165,24 +168,25 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(movedim, int);
   OP_DECOMPOSE(msort);
   OP_DECOMPOSE(mT);
-  OP_DECOMPOSE(narrow);
+  m.impl("narrow", native::narrow_symint);
   OP_DECOMPOSE(negative);
   OP_DECOMPOSE2(frobenius_norm, dim);
   OP_DECOMPOSE2(nuclear_norm, dim);
   OP_DECOMPOSE(nuclear_norm);
-  OP_DECOMPOSE(nll_loss_nd);
-  OP_DECOMPOSE(nll_loss);
-  OP_DECOMPOSE(nll_loss2d);
+  m.impl("nll_loss_nd", native::nll_loss_nd_symint);
+  m.impl("nll_loss", native::nll_loss_symint);
+  m.impl("nll_loss2d", native::nll_loss2d_symint);
   OP_DECOMPOSE2(not_equal, Tensor );
   OP_DECOMPOSE(outer);
   OP_DECOMPOSE(pairwise_distance);
   OP_DECOMPOSE(pinverse);
   OP_DECOMPOSE(poisson_nll_loss);
+  OP_DECOMPOSE(positive);
   OP_DECOMPOSE(qr);
   OP_DECOMPOSE(ravel);
-  OP_DECOMPOSE2(repeat_interleave, self_int);
+  m.impl("repeat_interleave.self_int", native::repeat_interleave_symint);
   OP_DECOMPOSE2(repeat_interleave, self_Tensor);
-  OP_DECOMPOSE(reshape);
+  m.impl("reshape", native::reshape_symint);
   OP_DECOMPOSE(resolve_conj);
   OP_DECOMPOSE(resolve_neg);
   OP_DECOMPOSE(row_stack);
@@ -196,10 +200,11 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE(special_multigammaln);
   OP_DECOMPOSE(special_polygamma);
   OP_DECOMPOSE(special_softmax);
-  OP_DECOMPOSE2(split, sizes);
+  m.impl("split.sizes", native::split_symint);
   OP_DECOMPOSE(square);
   OP_DECOMPOSE(numpy_T);
   OP_DECOMPOSE(reshape_as);
+  OP_DECOMPOSE(slogdet);
   OP_DECOMPOSE(t);
   OP_DECOMPOSE2(result_type, Tensor);
   OP_DECOMPOSE2(result_type, Scalar);
@@ -225,7 +230,7 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(trapezoid, dx);
   OP_DECOMPOSE2(trapz, x);
   OP_DECOMPOSE2(trapz, dx);
-  OP_DECOMPOSE(value_selecting_reduction_backward);
+  m.impl("value_selecting_reduction_backward", native::value_selecting_reduction_backward_symint);
   OP_DECOMPOSE(var);
   OP_DECOMPOSE2(var, dim);
   OP_DECOMPOSE(var_mean);
@@ -237,7 +242,7 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(where, ScalarSelf);
   OP_DECOMPOSE(orgqr);
   OP_DECOMPOSE2(unflatten, int);
-  OP_DECOMPOSE(_convolution_double_backward);
+  m.impl("_convolution_double_backward", native::_convolution_double_backward);
   OP_DECOMPOSE(conv_transpose1d);
   OP_DECOMPOSE2(conv_transpose2d, input);
   OP_DECOMPOSE2(conv_transpose3d, input);
@@ -248,14 +253,15 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   OP_DECOMPOSE2(conv2d, padding);
   OP_DECOMPOSE2(conv3d, padding);
   OP_DECOMPOSE(_convolution_mode);
-  OP_DECOMPOSE(frobenius_norm);
   OP_DECOMPOSE(type_as);
   OP_DECOMPOSE(linalg_diagonal);
-  OP_DECOMPOSE(pad);
-  OP_DECOMPOSE(_pad_circular);
+  OP_DECOMPOSE(diagonal_copy);
+  m.impl("pad", native::pad_symint);
+  m.impl("_pad_circular", native::_pad_circular_symint);
   OP_DECOMPOSE(t_);
   OP_DECOMPOSE(swapdims_);
   OP_DECOMPOSE(swapaxes_);
+  OP_DECOMPOSE(unfold_copy);
 
   // divide, alias for div
   OP_DECOMPOSE2(divide, Tensor);
diff --git a/functorch/functorch/csrc/BatchRulesDynamic.cpp b/aten/src/ATen/functorch/BatchRulesDynamic.cpp
similarity index 86%
rename from functorch/functorch/csrc/BatchRulesDynamic.cpp
rename to aten/src/ATen/functorch/BatchRulesDynamic.cpp
index e752d96d168d..a85d7f18953f 100644
--- a/functorch/functorch/csrc/BatchRulesDynamic.cpp
+++ b/aten/src/ATen/functorch/BatchRulesDynamic.cpp
@@ -5,11 +5,15 @@
 // LICENSE file in the root directory of this source tree.
 
 #include <ATen/ATen.h>
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/BatchedFallback.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <c10/util/Metaprogramming.h>
 
+// This file contains batching rules for operations that return Tensors of
+// dynamic shape. We generally don't support those with vmap so we raise
+// errors for them.
+
 
 namespace at { namespace functorch {
 
@@ -57,10 +61,13 @@ void unsupportedAllclose(const c10::OperatorHandle& op, torch::jit::Stack* stack
         "support over at github.com/pytorch/functorch/issues/275");
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
     UNSUPPORTED_DYNAMIC(nonzero);
     UNSUPPORTED_DYNAMIC(where);
-    UNSUPPORTED_DYNAMIC(unique);
+    UNSUPPORTED_DYNAMIC(unique_dim);
+    UNSUPPORTED_DYNAMIC(unique_consecutive);
+    UNSUPPORTED_DYNAMIC(unique_dim_consecutive);
+    UNSUPPORTED_DYNAMIC(_unique2);
     m.impl("_local_scalar_dense", torch::CppFunction::makeFromBoxedFunction<&unsupportedLocalScalarDense>());
     m.impl("item", torch::CppFunction::makeFromBoxedFunction<&unsupportedItem>());
     m.impl("is_nonzero", torch::CppFunction::makeFromBoxedFunction<&unsupportedIsNonzero>());
diff --git a/functorch/functorch/csrc/BatchRulesFactory.cpp b/aten/src/ATen/functorch/BatchRulesFactory.cpp
similarity index 73%
rename from functorch/functorch/csrc/BatchRulesFactory.cpp
rename to aten/src/ATen/functorch/BatchRulesFactory.cpp
index 3f63d27a0c8e..06d497959f42 100644
--- a/functorch/functorch/csrc/BatchRulesFactory.cpp
+++ b/aten/src/ATen/functorch/BatchRulesFactory.cpp
@@ -4,11 +4,30 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include "c10/core/SymIntArrayRef.h"
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <c10/core/SymIntArrayRef.h>
 
 namespace at { namespace functorch {
 
+template <typename A, A a, typename C>
+struct NewBlahBatchRuleHelperSymInt;
+
+template <typename F, F Func, typename A, typename B, typename... T>
+struct NewBlahBatchRuleHelperSymInt<F, Func, typelist<A, B, T...>> {
+  static std::tuple<Tensor,optional<int64_t>> apply(
+      const Tensor& tensor,
+      optional<int64_t> batch_dim,
+      SymIntArrayRef shape,
+      T... extra_args) {
+    const auto bdim_size = tensor.sym_size(batch_dim.value());
+    c10::SmallVector<c10::SymInt> new_shape;
+    new_shape.reserve(shape.size() + 1);
+    new_shape.emplace_back(bdim_size);
+    new_shape.insert(new_shape.end(), shape.begin(), shape.end());
+    return std::make_tuple(Func(tensor, new_shape, std::forward<T>(extra_args)...), 0);
+  }
+};
+
 template <typename A, A a, typename C>
 struct NewBlahBatchRuleHelper;
 
@@ -37,6 +56,12 @@ struct NewBlahBatchRuleHelper<F, Func, typelist<A, B, T...>> {
       &fn,\
       c10::guts::function_traits<decltype(fn)>::parameter_types>::apply)
 
+#define NEW_BLAH_BATCH_RULE_SYMINT(fn) SINGLE_ARG(\
+    NewBlahBatchRuleHelperSymInt<\
+      decltype(&fn),\
+      &fn,\
+      c10::guts::function_traits<decltype(fn)>::parameter_types>::apply)
+
 std::tuple<Tensor,optional<int64_t>> _new_zeros_with_same_feature_meta_batch_rule(
     const Tensor& self, optional<int64_t> self_bdim,
     const Tensor& other, optional<int64_t> other_bdim,
@@ -82,18 +107,7 @@ bool _has_same_storage_numel_batch_rule(const Tensor& a, const Tensor& b) {
   return true;
 }
 
-Tensor new_empty_symint_decomp(
-    const Tensor& self,
-    SymIntArrayRef size,
-    c10::optional<ScalarType> dtype_opt,
-    c10::optional<Layout> layout_opt,
-    c10::optional<Device> device_opt,
-    c10::optional<bool> pin_memory_opt
-    ) {
-  return self.new_empty(c10::asIntArrayRefSlow(size), dtype_opt, layout_opt, device_opt, pin_memory_opt);
-}
-
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   m.impl("_has_same_storage_numel", _has_same_storage_numel_batch_rule);
   VMAP_SUPPORT(ones_like, BASIC_UNARY_BATCH_RULE(ATEN_FN(ones_like)));
   VMAP_SUPPORT(zeros_like, BASIC_UNARY_BATCH_RULE(ATEN_FN(zeros_like)));
@@ -101,11 +115,10 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   VMAP_SUPPORT(randn_like, BASIC_UNARY_BATCH_RULE(ATEN_FN(randn_like)));
   VMAP_SUPPORT(rand_like, BASIC_UNARY_BATCH_RULE(ATEN_FN(rand_like)));
   VMAP_SUPPORT(full_like, BASIC_UNARY_BATCH_RULE(ATEN_FN(full_like)));
-  VMAP_SUPPORT(new_empty, NEW_BLAH_BATCH_RULE(ATEN_FN(new_empty)));
-  m.impl("new_empty.SymInt", new_empty_symint_decomp);
-  VMAP_SUPPORT(new_zeros, NEW_BLAH_BATCH_RULE(ATEN_FN(new_zeros)));
-  VMAP_SUPPORT(new_ones, NEW_BLAH_BATCH_RULE(ATEN_FN(new_ones)));
-  VMAP_SUPPORT(new_full, NEW_BLAH_BATCH_RULE(ATEN_FN(new_full)));
+  VMAP_SUPPORT(new_empty, NEW_BLAH_BATCH_RULE_SYMINT(ATEN_FN(new_empty)));
+  VMAP_SUPPORT(new_zeros, NEW_BLAH_BATCH_RULE_SYMINT(ATEN_FN(new_zeros)));
+  VMAP_SUPPORT(new_ones, NEW_BLAH_BATCH_RULE_SYMINT(ATEN_FN(new_ones)));
+  VMAP_SUPPORT(new_full, NEW_BLAH_BATCH_RULE_SYMINT(ATEN_FN(new_full)));
   VMAP_SUPPORT(_new_zeros_with_same_feature_meta, _new_zeros_with_same_feature_meta_batch_rule);
   // Not sure how to add the ones with irregular args to the mix cleanly (i.e. randint takes an extra int parameter)
 }
diff --git a/functorch/functorch/csrc/BatchRulesHelper.cpp b/aten/src/ATen/functorch/BatchRulesHelper.cpp
similarity index 92%
rename from functorch/functorch/csrc/BatchRulesHelper.cpp
rename to aten/src/ATen/functorch/BatchRulesHelper.cpp
index dfd690ac2168..136a23e17088 100644
--- a/functorch/functorch/csrc/BatchRulesHelper.cpp
+++ b/aten/src/ATen/functorch/BatchRulesHelper.cpp
@@ -4,8 +4,7 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <torch/csrc/jit/runtime/decomposition_registry.h>
+#include <ATen/functorch/BatchRulesHelper.h>
 #include <ATen/WrapDimUtils.h>
 
 namespace at { namespace functorch {
@@ -122,6 +121,16 @@ Tensor reshape_dim_outof(int64_t src, int64_t size1, const Tensor& x) {
   return at::reshape(x, shape);
 }
 
+Tensor reshape_dim_outof_symint(int64_t src, c10::SymInt size1, const Tensor& x) {
+  src = maybe_wrap_dim(src, x.dim());
+  c10::SymDimVector shape(x.sym_sizes().begin(), x.sym_sizes().end());
+  TORCH_INTERNAL_ASSERT(shape[src] % size1 == 0);
+  auto size2 = shape[src] / size1;
+  shape[src] = size1;
+  shape.insert(shape.begin() + src + 1, size2);
+  return at::reshape_symint(x, shape);
+}
+
 void vmapIncompatibleInplaceError(const char* schema_name) {
   TORCH_CHECK(false,
     "vmap: ", schema_name, "(self, *extra_args) is not possible because ",
@@ -133,20 +142,6 @@ void vmapIncompatibleInplaceError(const char* schema_name) {
     "please file a bug report instead.");
 }
 
-void run_jit_decomposition(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
-  const auto& schema = op.schema();
-  // TODO: templatize based on op and keep static trace_exec
-  auto * trace_exec = torch::jit::GetDecompositionExecutor(schema);
-  trace_exec->run((*stack));
-  if (stack->back().isTuple()) {
-    IValue tup = stack->back();
-    stack->pop_back();
-    for (const auto& elem: tup.toTuple()->elements()) {
-      stack->push_back(elem);
-    }
-  }
-}
-
 static void handleScalarTypePromotion(Tensor& logical_scalar_tensor, Tensor& second) {
   auto result_type = at::native::result_type(logical_scalar_tensor[0], second);
   if (logical_scalar_tensor.scalar_type() != result_type) {
diff --git a/functorch/functorch/csrc/BatchRulesHelper.h b/aten/src/ATen/functorch/BatchRulesHelper.h
similarity index 94%
rename from functorch/functorch/csrc/BatchRulesHelper.h
rename to aten/src/ATen/functorch/BatchRulesHelper.h
index 552a38b20e20..219c01c89c56 100644
--- a/functorch/functorch/csrc/BatchRulesHelper.h
+++ b/aten/src/ATen/functorch/BatchRulesHelper.h
@@ -4,24 +4,26 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <ATen/native/ResizeCommon.h>
 #include <ATen/ATen.h>
 #include <ATen/Operators.h>
-#include <torch/csrc/autograd/variable.h>
-
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/BatchingMetaprogramming.h>
-#include <functorch/csrc/LegacyVmapTransforms.h>
-#include <functorch/csrc/BatchedFallback.h>
-#include <functorch/csrc/PlumbingHelper.h>
+
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <ATen/functorch/BatchingMetaprogramming.h>
+#include <ATen/functorch/LegacyVmapTransforms.h>
+#include <ATen/functorch/BatchedFallback.h>
+#include <ATen/functorch/PlumbingHelper.h>
 #include <ATen/core/dispatch/Dispatcher.h>
-#include <functorch/csrc/Constants.h>
 #include <ATen/VmapGeneratedPlumbing.h>
 
+// This file contains helper functions for batching rules.
+
 namespace at { namespace functorch {
-Tensor reshape_dim_into(int64_t src, int64_t dst, const Tensor& x);
-Tensor reshape_dim_outof(int64_t src, int64_t size1, const Tensor& x);
+
+TORCH_API Tensor reshape_dim_into(int64_t src, int64_t dst, const Tensor& x);
+TORCH_API Tensor reshape_dim_outof(int64_t src, int64_t size1, const Tensor& x);
+
+TORCH_API Tensor reshape_dim_outof_symint(int64_t src, c10::SymInt size1, const Tensor& x);
 
 Tensor moveBatchDimToFront(const Tensor& tensor, optional<int64_t> maybe_batch_dim);
 int64_t rankWithoutBatchDim(const Tensor& tensor, optional<int64_t> maybe_batch_dim);
@@ -119,7 +121,7 @@ void boxed_tensor_inputs_batch_rule(const c10::OperatorHandle& op, torch::jit::S
   const auto num_returns = schema.returns().size();
   const auto num_arguments = schema.arguments().size();
 
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
@@ -195,12 +197,6 @@ inline void handle_variadic_bdims(std::vector<std::pair<Tensor, optional<int64_t
 #define VARIADIC_BDIMS_BOXED(op) \
   m.impl(#op, torch::CppFunction::makeFromBoxedFunction<boxed_tensor_inputs_batch_rule<decltype(&handle_variadic_bdims), &handle_variadic_bdims>>());
 
-void run_jit_decomposition(const c10::OperatorHandle& op, torch::jit::Stack* stack);
-
-#define RUN_JIT_DECOMPOSITION(op) \
-  m.impl(#op, torch::CppFunction::makeFromBoxedFunction<&run_jit_decomposition>());
-
-
 using UnpackedBatchedTensor = std::tuple<Tensor,optional<int64_t>>;
 
 inline void find_and_unpack_tensors(
@@ -243,7 +239,7 @@ inline void boxed_existing_bdim_all_batch_rule(
   const auto num_returns = schema.returns().size();
   const auto num_arguments = schema.arguments().size();
 
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
@@ -299,7 +295,7 @@ inline void boxed_all_tensors_have_optional_bdim(
   const auto num_returns = schema.returns().size();
   const auto num_arguments = schema.arguments().size();
 
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
@@ -390,7 +386,7 @@ struct ExistingBdimBatchRuleHelper<F, Func, typelist<A, T...>> {
       T... extra_args) {
     auto self_ = reshape_dim_into(*self_bdim, 0, self);
     auto out = Func(self_, std::forward<T>(extra_args)...);
-    return std::make_tuple(reshape_dim_outof(0, self.sizes()[*self_bdim], out), 0);
+    return std::make_tuple(reshape_dim_outof_symint(0, self.sym_sizes()[*self_bdim], out), 0);
   }
 };
 
diff --git a/functorch/functorch/csrc/BatchRulesLinearAlgebra.cpp b/aten/src/ATen/functorch/BatchRulesLinearAlgebra.cpp
similarity index 59%
rename from functorch/functorch/csrc/BatchRulesLinearAlgebra.cpp
rename to aten/src/ATen/functorch/BatchRulesLinearAlgebra.cpp
index 63efbec1caba..f26a4f79b146 100644
--- a/functorch/functorch/csrc/BatchRulesLinearAlgebra.cpp
+++ b/aten/src/ATen/functorch/BatchRulesLinearAlgebra.cpp
@@ -4,7 +4,7 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
 
 namespace at { namespace functorch {
 
@@ -264,6 +264,59 @@ struct LinalgCheckMatrixBinaryRuleHelper<op_name, F, Func, typelist<A, B, T...>>
   }
 };
 
+static void expect_at_least_rank(
+    const Tensor& tensor,
+    optional<int64_t> tensor_bdim,
+    int64_t expected_rank,
+    const char* name) {
+  auto rank = rankWithoutBatchDim(tensor, tensor_bdim);
+  TORCH_CHECK(rank >= expected_rank,
+      name, " should have at least ", expected_rank, " dimensions, but has ",
+      rank, " dimensions instead.");
+}
+
+oneOutput linalg_lu_solve_batch_rule(
+    const Tensor& LU, optional<int64_t> LU_bdim,
+    const Tensor& pivots, optional<int64_t> pivots_bdim,
+    const Tensor& B, optional<int64_t> B_bdim,
+    bool left, bool adjoint) {
+  const auto LU_min_rank = 2;
+  const auto pivots_min_rank = 1;
+  const auto B_min_rank = 2;
+
+  expect_at_least_rank(LU, LU_bdim, LU_min_rank, "LU");
+  expect_at_least_rank(pivots, pivots_bdim, pivots_min_rank, "pivots");
+  expect_at_least_rank(B, B_bdim, B_min_rank, "B");
+
+  auto LU_ = moveBatchDimToFront(LU, LU_bdim);
+  auto pivots_ = moveBatchDimToFront(pivots, pivots_bdim);
+  auto B_ = moveBatchDimToFront(B, B_bdim);
+
+  // LU and pivots's first {N-2} (for LU), {N-1} (for pivots) dimensions must match
+  // So if only one of them is being vmapped over, we must expand out that dimension.
+  if (LU_bdim.has_value() ^ pivots_bdim.has_value()) {
+    auto bdim_size = get_bdim_size2(LU, LU_bdim, pivots, pivots_bdim);
+    LU_ = ensure_has_bdim(LU_, LU_bdim.has_value(), bdim_size);
+    pivots_ = ensure_has_bdim(pivots_, pivots_bdim.has_value(), bdim_size);
+    pivots_bdim = 0;
+    LU_bdim = 0;
+  }
+
+  // Now, {LU, pivots} and B's first dimensions are allowed to broadcast.
+  // The rest of the logic handles that.
+  const auto LU_num_batch_dims = rankWithoutBatchDim(LU_, LU_bdim) - LU_min_rank;
+  const auto pivots_num_batch_dims = rankWithoutBatchDim(pivots_, pivots_bdim) - pivots_min_rank;
+  const auto B_num_batch_dims = rankWithoutBatchDim(B_, B_bdim) - B_min_rank;
+  const auto max_num_batch_dims = std::max(std::max(LU_num_batch_dims, pivots_num_batch_dims), B_num_batch_dims);
+
+  LU_ = maybePadToLogicalRank(LU_, LU_bdim, max_num_batch_dims + LU_min_rank);
+  pivots_ = maybePadToLogicalRank(pivots_, pivots_bdim, max_num_batch_dims + pivots_min_rank);
+  B_ = maybePadToLogicalRank(B_, B_bdim, max_num_batch_dims + B_min_rank);
+
+  const auto result = at::linalg_lu_solve(LU_, pivots_, B_, left, adjoint);
+  return std::make_tuple(result, 0);
+}
+
 oneOutput cholesky_solve_batch_rule(
     const Tensor& self, c10::optional<int64_t> self_bdim,
     const Tensor& A, c10::optional<int64_t> A_bdim,
@@ -293,6 +346,151 @@ oneOutput matrix_exp_batch_rule(const Tensor& self, c10::optional<int64_t> self_
   return std::make_tuple(at::matrix_exp(self_), 0);
 }
 
+fourOutputs solve_ex_batch_rule(
+    const Tensor& A, optional<int64_t> A_bdim,
+    const Tensor& B, optional<int64_t> B_bdim,
+    bool left, bool check_errors) {
+  auto batch_size = get_bdim_size2(A, A_bdim, B, B_bdim);
+  const auto A_logical_rank = rankWithoutBatchDim(A, A_bdim);
+  const auto B_logical_rank = rankWithoutBatchDim(B, B_bdim);
+  const auto max_logical_rank = std::max(A_logical_rank, B_logical_rank);
+
+  TORCH_CHECK(A_logical_rank >= 2,
+            "linalg.solve: The input tensor A must have at least 2 dimensions.");
+
+  int b_logical_rank = max_logical_rank;
+  if (A_logical_rank > B_logical_rank) {  // vector case: B was a vector or batched vector
+    // not accurate but matches linalg error message
+    TORCH_CHECK(B_logical_rank >= 1, "linalg.solve: The input tensor B must have at least 2 dimensions.");
+    b_logical_rank = max_logical_rank - 1;
+  } else {  // matrix case: A and B are both matrices or batches of matrices
+    TORCH_CHECK(B_logical_rank >= 2, "linalg.solve: The input tensor B must have at least 2 dimensions.");
+  }
+
+  // basically binary pointwise helper but if B was a vector incoming, we must pad it to be 1 dim smaller than A
+  auto A_ = moveBatchDimToFront(A, A_bdim);
+  auto B_ = moveBatchDimToFront(B, B_bdim);
+  A_ = maybePadToLogicalRank(A_, A_bdim, max_logical_rank);
+  B_ = maybePadToLogicalRank(B_, B_bdim, b_logical_rank);
+
+  A_ = ensure_has_bdim(A_, A_bdim.has_value(), batch_size);
+  B_ = ensure_has_bdim(B_, B_bdim.has_value(), batch_size);
+
+  // NOTE [ solve_ex Batch Rule Contiguity ]
+  // A determines whether or not linalg_solve takes an optimized path. We need the check on A_ to match the one run on
+  // A as BatchedTensor since it might have been saved by autograd (specifically by the jvp) and the autograd behvaior
+  // differs based on whether or not the optimized path was taken
+  const auto batched_A_was_contiguous = A_bdim.has_value() ? at::select(A, *A_bdim, 0).is_contiguous() : A.is_contiguous();
+  if (batched_A_was_contiguous && !A.is_complex()) {
+    A_ = A_.contiguous();
+  }
+  const auto res = _linalg_solve_ex(A_, B_, left, check_errors);
+  return std::make_tuple(std::get<0>(res), 0, std::get<1>(res), 0, std::get<2>(res), 0, std::get<3>(res), 0);
+}
+
+oneOutput cross_batch_rule(const Tensor& self, c10::optional<int64_t> self_bdim,
+                           const Tensor& other, c10::optional<int64_t> other_bdim, const int64_t dim) {
+  // match cross dimension checks
+  TORCH_CHECK(rankWithoutBatchDim(self, self_bdim) == rankWithoutBatchDim(other, other_bdim),
+    "linalg.cross: inputs must have the same number of dimensions."
+  );
+
+  const auto batch_size = get_bdim_size2(self, self_bdim, other, other_bdim);
+  const auto self_other_bundled = _binary_pointwise_helper(self, self_bdim, other, other_bdim, false);
+
+  const auto self_ = ensure_has_bdim(std::get<0>(self_other_bundled), self_bdim.has_value(), batch_size);
+  const auto other_ = ensure_has_bdim(std::get<1>(self_other_bundled), other_bdim.has_value(), batch_size);
+
+  const auto dim_ = getPhysicalDim(self_, true, dim);
+
+  return std::make_tuple(linalg_cross(self_, other_, dim_), 0);
+}
+
+c10::optional<int64_t> batch_dim_if_not_empty(const Tensor& t) {
+  if (t.dim() == 1 && t.size(0) == 0) {
+    return c10::optional<int64_t>();
+  }
+  return c10::optional<int64_t>(0);
+}
+
+fourOutputs linalg_lstsq_batch_rule(
+    const Tensor& self, c10::optional<int64_t> self_bdim, const Tensor& b, c10::optional<int64_t> b_bdim,
+    c10::optional<double> rcond, c10::optional<c10::string_view> driver) {
+  TORCH_CHECK(rankWithoutBatchDim(self, self_bdim) >= 2, "torch.linalg.lstsq: input must have at least 2 dimensions.");
+  TORCH_CHECK(rankWithoutBatchDim(b, b_bdim) >= 1, "torch.linalg.lstsq: other must have at least 1 dimension.");
+
+  const auto batch_size = get_bdim_size2(self, self_bdim, b, b_bdim);
+  const auto tensor_other = _binary_pointwise_helper(self, self_bdim, b, b_bdim, /*do_type_promotion=*/false);
+
+  // because of ambiguity with vector case, lstsq can broadcast [1, 2] -> [batch_size, 2] but not [2] -> [batch_size, 2]
+  // so could unsqueeze if there's no bdim or just ensure_has_bdim
+  const auto self_ = ensure_has_bdim(std::get<0>(tensor_other), self_bdim.has_value(), batch_size);
+  const auto b_ = ensure_has_bdim(std::get<1>(tensor_other), b_bdim.has_value(), batch_size);
+
+  Tensor res, res_1, res_2, res_3;
+  std::tie(res, res_1, res_2, res_3) = at::linalg_lstsq(self_, b_, rcond, driver);
+
+  // everything but the 0th output are only sometimes computed. When they aren't, they're empty tensors without a bdim
+  const auto res_1_bdim = batch_dim_if_not_empty(res_1);
+  const auto res_2_bdim = batch_dim_if_not_empty(res_2);
+  const auto res_3_bdim = batch_dim_if_not_empty(res_3);
+  return std::make_tuple(res, 0, res_1, res_1_bdim, res_2, res_2_bdim, res_3, res_3_bdim);
+}
+
+template<typename F>
+std::tuple<Tensor, c10::optional<int64_t>>
+atol_rtol_tensor_batch_rule(
+    F Func, const Tensor& input, optional<int64_t> input_bdim,
+    const optional<Tensor>& atol, const optional<int64_t> atol_bdim,
+    const optional<Tensor>& rtol, const optional<int64_t> rtol_bdim, bool hermitian, char const *op_name) {
+  auto input_logical_rank = rankWithoutBatchDim(input, input_bdim);
+
+  TORCH_CHECK(input_logical_rank >= 2,
+            op_name, ": The input tensor input must have at least 2 dimensions.");
+
+  // atol and rtol's dims must be broadcastable to the number of batch dims of input
+  // which is input's dim - 2 (input represents a batch of matrices, so 2 is for the matrix dimensions)
+  const auto input_logical_num_bdims = input_logical_rank - 2;
+  const int64_t atol_logical_num_bdims = atol.has_value() ? rankWithoutBatchDim(*atol, atol_bdim) : 0;
+  const int64_t rtol_logical_num_bdims = rtol.has_value() ? rankWithoutBatchDim(*rtol, rtol_bdim) : 0;
+  const auto max_logical_bdims = std::max({input_logical_num_bdims, atol_logical_num_bdims, rtol_logical_num_bdims});
+
+  auto input_ = moveBatchDimToFront(input, input_bdim);
+  auto atol_ = atol.has_value() ? moveBatchDimToFront(*atol, atol_bdim) : atol;
+  auto rtol_ = rtol.has_value() ? moveBatchDimToFront(*rtol, rtol_bdim) : rtol;
+
+  // pad all inputs to have the same number of (non-vmap) batch dimensions
+  input_ = maybePadToLogicalRank(input_, input_bdim, max_logical_bdims + 2);
+  atol_ = atol_.has_value() ? maybePadToLogicalRank(*atol_, atol_bdim, max_logical_bdims) : atol_;
+  rtol_ = rtol_.has_value() ? maybePadToLogicalRank(*rtol_, rtol_bdim, max_logical_bdims) : rtol_;
+
+  return std::make_tuple(Func(input_, atol_, rtol_, hermitian), 0);
+}
+
+std::tuple<Tensor, c10::optional<int64_t>>
+matrix_rank_atol_rtol_tensor_batch_rule(
+    const Tensor& input, c10::optional<int64_t> input_bdim, const optional<Tensor>& atol,
+    const c10::optional<int64_t> atol_bdim, const optional<Tensor>& rtol,
+    const c10::optional<int64_t> rtol_bdim, bool hermitian) {
+  return atol_rtol_tensor_batch_rule(ATEN_FN2(linalg_matrix_rank, atol_rtol_tensor), input, input_bdim, atol, atol_bdim, rtol, rtol_bdim, hermitian, "torch.linalg.matrix_rank");
+}
+
+std::tuple<Tensor, c10::optional<int64_t>>
+pinv_batch_rule(
+    const Tensor& input, c10::optional<int64_t> input_bdim, const optional<Tensor>& atol,
+    const c10::optional<int64_t> atol_bdim, const optional<Tensor>& rtol,
+    const c10::optional<int64_t> rtol_bdim, bool hermitian) {
+  return atol_rtol_tensor_batch_rule(ATEN_FN2(linalg_pinv, atol_rtol_tensor), input, input_bdim, atol, atol_bdim, rtol, rtol_bdim, hermitian, "linalg.pinv");
+}
+
+std::tuple<Tensor,optional<int64_t>>
+matrix_rank_atol_rtol_float_batch_rule(
+    const Tensor& input, optional<int64_t> input_bdim, optional<double> atol, optional<double> rtol, bool hermitian) {
+  TORCH_CHECK(rankWithoutBatchDim(input, input_bdim) >= 2,
+            "torch.linalg.matrix_rank: The input tensor input must have at least 2 dimensions.");
+  return std::make_tuple(linalg_matrix_rank(moveBatchDimToFront(input, input_bdim), atol, rtol, hermitian), 0);
+}
+
 #define LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, num_out) SINGLE_ARG(\
   LinalgCheckMatrixUnaryRuleHelper<\
     func_string_##fn,\
@@ -317,51 +515,65 @@ oneOutput matrix_exp_batch_rule(const Tensor& self, c10::optional<int64_t> self_
 
 // Define string constants with the function names. These will be used as template parameters
 // C++ doesn't let us use string literals as template parameters, so we have to declare them as consts first
+// What is going on with these macros?
+// - clang-5 seems to require the constexpr
+// - windows compiles with or without the constexpr, but the constexpr causes test problems
+// - as a result we have some macro guards.
+#if defined(_MSC_VER)
 #define LINALG_STRING_CONST(fn, op_name) \
   const char func_string_##fn[] = #op_name;\
 
 #define LINALG_STRING_CONST2(fn, overload, op_name) \
   const char func_string_##fn_##overload[] = #op_name;\
 
+#else
+#define LINALG_STRING_CONST(fn, op_name) \
+  constexpr const char func_string_##fn[] = #op_name;\
+
+#define LINALG_STRING_CONST2(fn, overload, op_name) \
+  constexpr const char func_string_##fn_##overload[] = #op_name;\
+
+#endif
+
 #define LINALG_CHECK_MATRIX_UNARY_ONE_OUT(fn, op_name) \
   LINALG_STRING_CONST(fn, op_name);\
-  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+  TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {\
     VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, one));\
   }
 
 #define LINALG_CHECK_MATRIX_UNARY_ONE_OUT2(fn, overload, op_name) \
   LINALG_STRING_CONST2(fn, overload, op_name);\
-  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+  TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {\
     VMAP_SUPPORT2(fn, overload, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE2(fn, overload, one));\
   }
 
 #define LINALG_CHECK_MATRIX_UNARY_TWO_OUT(fn, op_name) \
   LINALG_STRING_CONST(fn, op_name);\
-  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+  TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {\
     VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, two));\
   }
 
 #define LINALG_CHECK_MATRIX_UNARY_THREE_OUT(fn, op_name) \
   LINALG_STRING_CONST(fn, op_name);\
-  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+  TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {\
     VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, three));\
   }
 
 #define LINALG_CHECK_MATRIX_UNARY_FOUR_OUT(fn, op_name) \
   LINALG_STRING_CONST(fn, op_name);\
-  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+  TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {\
     VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_UNARY_BATCH_RULE(fn, four));\
   }
 
 #define LINALG_CHECK_MATRIX_BINARY_ONE_OUT(fn, op_name) \
   LINALG_STRING_CONST(fn, op_name);\
-  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+  TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {\
     VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_BINARY_BATCH_RULE(fn, one));\
   }
 
 #define LINALG_CHECK_MATRIX_BINARY_TWO_OUT(fn, op_name) \
   LINALG_STRING_CONST(fn, op_name);\
-  TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {\
+  TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {\
     VMAP_SUPPORT(fn, LINALG_CHECK_MATRIX_BINARY_BATCH_RULE(fn, two));\
   }
 
@@ -370,7 +582,6 @@ LINALG_CHECK_MATRIX_UNARY_ONE_OUT(cholesky, cholesky);
 LINALG_CHECK_MATRIX_UNARY_ONE_OUT(cholesky_inverse, cholesky_inverse);
 LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_cholesky_ex, linalg.cholesky);
 LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_eig, linalg.eig);
-LINALG_CHECK_MATRIX_UNARY_ONE_OUT(linalg_eigvals, linalg.eigvals);
 LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_inv_ex, linalg.inv_ex);
 LINALG_CHECK_MATRIX_UNARY_THREE_OUT(linalg_ldl_factor_ex, torch.linalg.ldl_factor_ex);
 LINALG_CHECK_MATRIX_UNARY_ONE_OUT(linalg_matrix_power, linalg.matrix_power);
@@ -389,7 +600,7 @@ LINALG_CHECK_MATRIX_UNARY_TWO_OUT(_linalg_eigh, linalg.eigh);
 LINALG_CHECK_MATRIX_UNARY_FOUR_OUT(_linalg_slogdet, linalg.slogdet);
 LINALG_CHECK_MATRIX_UNARY_THREE_OUT(_linalg_svd, linalg.svd);
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   VMAP_SUPPORT(bmm, bmm_batch_rule);
   m.impl("addmv", addmv_decomp);
   m.impl("addmm", addmm_decomp);
@@ -399,10 +610,17 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   VMAP_SUPPORT(mv, mv_batch_rule);
   VMAP_SUPPORT(mm, mm_batch_rule);
   m.impl("linear", linear_decomp);
+  VMAP_SUPPORT(linalg_lu_solve, linalg_lu_solve_batch_rule);
   VMAP_SUPPORT(linalg_householder_product, householder_product_batch_rule);
   VMAP_SUPPORT(cholesky_solve, cholesky_solve_batch_rule);  // custom dim error
+  VMAP_SUPPORT(linalg_lstsq, linalg_lstsq_batch_rule);  // custom errors and sometimes empty return
   VMAP_SUPPORT(linalg_lu_factor_ex, linalg_lu_factor_ex_batch_rule);
   VMAP_SUPPORT(linalg_matrix_exp, matrix_exp_batch_rule);
+  VMAP_SUPPORT(_linalg_solve_ex, solve_ex_batch_rule);
+  VMAP_SUPPORT(linalg_cross, cross_batch_rule);
+  VMAP_SUPPORT2(linalg_matrix_rank, atol_rtol_tensor, matrix_rank_atol_rtol_tensor_batch_rule);
+  VMAP_SUPPORT2(linalg_matrix_rank, atol_rtol_float, matrix_rank_atol_rtol_float_batch_rule);
+  VMAP_SUPPORT2(linalg_pinv, atol_rtol_tensor, pinv_batch_rule);
 
   VMAP_SUPPORT(_linalg_check_errors, _linalg_check_errors_batch_rule);
 }
diff --git a/functorch/functorch/csrc/BatchRulesLoss.cpp b/aten/src/ATen/functorch/BatchRulesLoss.cpp
similarity index 94%
rename from functorch/functorch/csrc/BatchRulesLoss.cpp
rename to aten/src/ATen/functorch/BatchRulesLoss.cpp
index 16ee2fb7e9c1..66c2b7fb3194 100644
--- a/functorch/functorch/csrc/BatchRulesLoss.cpp
+++ b/aten/src/ATen/functorch/BatchRulesLoss.cpp
@@ -4,9 +4,9 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
-#include <functorch/csrc/BatchedFallback.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 
 namespace at { namespace functorch {
@@ -64,7 +64,7 @@ Tensor binary_cross_entropy_plumbing(
 
   if (!isBatchedAtLevel(self, cur_level) && !isBatchedAtLevel(target, cur_level)
       && !isBatchedAtLevel(weight, cur_level)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::binary_cross_entropy(self, target, weight, reduction);
   }
 
@@ -77,7 +77,7 @@ Tensor binary_cross_entropy_plumbing(
 
   Tensor result;
   if (self_bdim || target_bdim) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto bdim_size = get_bdim_size2(self_value, self_bdim, target_value, target_bdim);
     auto self_ = moveBatchDimToFront(self_value, self_bdim);
     auto target_ = moveBatchDimToFront(target_value, target_bdim);
@@ -86,7 +86,7 @@ Tensor binary_cross_entropy_plumbing(
     result = at::binary_cross_entropy(self_, target_, nullopt, Reduction::None);
     result = makeBatched(result, 0, cur_level);
   } else {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     result = at::binary_cross_entropy(self_value, target_value, nullopt, Reduction::None);
   }
   if (weight.has_value() && weight->defined()) {
@@ -103,7 +103,7 @@ Tensor binary_cross_entropy_backward_plumbing(
   int64_t cur_level = maybe_layer->layerId();
 
   if (!areAnyBatchedAtLevel({grad, input, target, weight_opt}, cur_level)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::binary_cross_entropy_backward(grad, input, target, weight_opt, reduction);
   }
 
@@ -120,7 +120,7 @@ Tensor binary_cross_entropy_backward_plumbing(
 
   Tensor grad_input;
   if (grad_bdim || input_bdim || target_bdim) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto bdim_size = get_bdim_size3(
         grad_value, grad_bdim, input_value, input_bdim, target_value, target_bdim);
 
@@ -136,7 +136,7 @@ Tensor binary_cross_entropy_backward_plumbing(
         grad_, input_, target_, nullopt, Reduction::None);
     grad_input = makeBatched(grad_input, 0, cur_level);
   } else {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     grad_input = at::binary_cross_entropy_backward(
         grad_value, input_value, target_value, nullopt, Reduction::None);
   }
@@ -276,7 +276,7 @@ at::Tensor nll_loss_backward_decomposition(
   return grad_input * grad_output_;
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   m.impl("nll_loss_forward", nll_loss_forward_decomposition);
   m.impl("nll_loss2d_forward", nll_loss_forward_decomposition);
   m.impl("nll_loss_backward", nll_loss_backward_decomposition);
diff --git a/functorch/functorch/csrc/BatchRulesModules.cpp b/aten/src/ATen/functorch/BatchRulesModules.cpp
similarity index 89%
rename from functorch/functorch/csrc/BatchRulesModules.cpp
rename to aten/src/ATen/functorch/BatchRulesModules.cpp
index 3d54ba5d0fe4..f51d63feaa8e 100644
--- a/functorch/functorch/csrc/BatchRulesModules.cpp
+++ b/aten/src/ATen/functorch/BatchRulesModules.cpp
@@ -4,33 +4,33 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 
 namespace at { namespace functorch {
 
-static Tensor getStepTensor(const Tensor& indices, int64_t bdim_size, int64_t num_embeddings) {
+static Tensor getStepTensor(const Tensor& indices, c10::SymInt bdim_size, c10::SymInt num_embeddings) {
   // [batch_size, 1, 1, 1, ..., 1]
-  DimVector view_shape(indices.dim(), 1);
+  c10::SymDimVector view_shape(indices.dim(), 1);
   view_shape[0] = bdim_size;
   auto range = at::arange(0, bdim_size * num_embeddings, num_embeddings, indices.options());
-  return range.view(view_shape);
+  return range.view_symint(view_shape);
 }
 
 std::tuple<Tensor,optional<int64_t>> embedding_batch_rule(
     const Tensor& weight, optional<int64_t> weight_bdim,
     const Tensor& indices, optional<int64_t> indices_bdim,
-    int64_t padding_idx, bool scale_grad_by_freq, bool sparse) {
+    c10::SymInt padding_idx, bool scale_grad_by_freq, bool sparse) {
   if (!weight_bdim && indices_bdim) {
     // B*, ED -> B*D
-    const auto result = at::embedding(weight, indices, padding_idx, scale_grad_by_freq, sparse);
+    const auto result = at::embedding_symint(weight, indices, padding_idx, scale_grad_by_freq, sparse);
     return std::make_tuple(result, indices_bdim);
   } else if (weight_bdim && !indices_bdim) {
     // *, BED -> *, E(BD) -> *(BD) -> *BD
     const auto batch_size = weight.size(*weight_bdim);
     const auto weight_ = reshape_dim_into(*weight_bdim, /*embedding_dim*/1, weight);
-    auto result = at::embedding(weight_, indices, padding_idx, scale_grad_by_freq, sparse);
+    auto result = at::embedding_symint(weight_, indices, padding_idx, scale_grad_by_freq, sparse);
     result = reshape_dim_outof(-1, batch_size, result);
     return std::make_tuple(result, result.dim() - 2);
   }
@@ -44,7 +44,7 @@ std::tuple<Tensor,optional<int64_t>> embedding_batch_rule(
 
   const auto range = getStepTensor(indices, batch_size, num_embeddings);
   indices_ = indices_ + range;
-  const auto result = at::embedding(weight_, indices_, padding_idx, scale_grad_by_freq, sparse);
+  const auto result = at::embedding_symint(weight_, indices_, padding_idx, scale_grad_by_freq, sparse);
   return std::make_tuple(result, 0);
 }
 
@@ -52,15 +52,15 @@ std::tuple<Tensor,optional<int64_t>>
 embedding_dense_backward_batch_rule(
     const Tensor& grad_, optional<int64_t> grad_bdim,
     const Tensor& indices_, optional<int64_t> indices_bdim,
-    int64_t num_weights, int64_t padding_idx, bool scale_grad_by_freq) {
+    c10::SymInt num_weights, c10::SymInt padding_idx, bool scale_grad_by_freq) {
   Tensor grad = grad_;
   Tensor indices = indices_;
   if (!indices_bdim && grad_bdim) {
-    const auto bdim_size = grad.size(*grad_bdim);
+    const auto bdim_size = grad.sym_size(*grad_bdim);
     grad = reshape_dim_into(*grad_bdim, -1, grad);
-    auto result = at::embedding_dense_backward(
+    auto result = at::embedding_dense_backward_symint(
         grad, indices, num_weights, padding_idx, scale_grad_by_freq);
-    result = reshape_dim_outof(1, bdim_size, result);
+    result = reshape_dim_outof_symint(1, bdim_size, result);
     return std::make_tuple(result, 1);
   }
   const auto bdim_size = indices.size(*indices_bdim);
@@ -68,13 +68,13 @@ embedding_dense_backward_batch_rule(
   grad = moveBatchDimToFront(grad, grad_bdim);
   grad = ensure_has_bdim(grad, grad_bdim.has_value(), bdim_size);
   const auto range = getStepTensor(indices, bdim_size, num_weights);
-  auto result = at::embedding_dense_backward(
+  auto result = at::embedding_dense_backward_symint(
       grad, indices + range, num_weights * bdim_size, -1, scale_grad_by_freq);
   result = reshape_dim_outof(0, bdim_size, result);
   // Fill in the padding. We can't do it in the embedding_dense_backward call
   // because we need to fill in multiple rows!
   if (padding_idx >= 0) {
-    result.select(1, padding_idx).fill_(0);
+    result.select_symint(1, padding_idx).fill_(0);
   }
   return std::make_tuple(result, 0);
 }
@@ -295,21 +295,21 @@ template <typename F, F Func, typename A, typename B, typename C, typename... T>
 struct UpsampleBackwardBatchRuleHelper<F, Func, typelist<A, B, C, T...>> {
   static std::tuple<Tensor,optional<int64_t>> apply(
       const Tensor& grad_output, optional<int64_t> grad_output_bdim,
-      OptionalArrayRef<int64_t> output_size, IntArrayRef input_size,
+      c10::SymIntArrayRef output_size, c10::SymIntArrayRef input_size,
       T... extra_args) {
     auto grad_output_ = reshape_dim_into(*grad_output_bdim, 0, grad_output);
     TORCH_INTERNAL_ASSERT(input_size.size() > 0);
 
     // input_size is wrong so we correct it
-    DimVector physical_input_size(input_size.begin(), input_size.end());
-    physical_input_size[0] = grad_output_.sizes()[0];
+    c10::SymDimVector physical_input_size(input_size.begin(), input_size.end());
+    physical_input_size[0] = grad_output_.sym_sizes()[0];
 
     auto out = Func(
         grad_output_,
         output_size,
         physical_input_size,
         std::forward<T>(extra_args)...);
-    return std::make_tuple(reshape_dim_outof(0, grad_output.sizes()[*grad_output_bdim], out), 0);
+    return std::make_tuple(reshape_dim_outof_symint(0, grad_output.sym_sizes()[*grad_output_bdim], out), 0);
   }
 
 };
@@ -375,20 +375,20 @@ struct CudnnGridSampleBackwardBatchRuleHelper {
 #define CUDNN_GRID_SAMPLE_BW_BATCH_RULE(fn)\
     CudnnGridSampleBackwardBatchRuleHelper<decltype(&ATEN_FN(fn)), &ATEN_FN(fn)>::apply
 
-#define UPSAMPLE_BACKWARD(op, overload) VMAP_SUPPORT2(op, overload, SINGLE_ARG(\
+#define UPSAMPLE_BACKWARD(op) VMAP_SUPPORT(op, SINGLE_ARG(\
     UpsampleBackwardBatchRuleHelper<\
-      decltype(&ATEN_FN2(op, overload)),\
-      &ATEN_FN2(op, overload),\
-      c10::guts::function_traits<decltype(ATEN_FN2(op, overload))>::parameter_types>::apply))
+      decltype(&ATEN_FN(op)),\
+      &ATEN_FN(op),\
+      c10::guts::function_traits<decltype(ATEN_FN(op))>::parameter_types>::apply))
 
 #define UPSAMPLE_BATCH(op) \
   EXISTING_BDIM2(op, vec); \
   EXISTING_BDIM(op);
 
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   EXISTING_BDIM(im2col);
-  EXISTING_BDIM(im2col_backward);
+  EXISTING_BDIM(col2im);
 
   VMAP_SUPPORT(embedding, embedding_batch_rule);
   VMAP_SUPPORT(embedding_dense_backward, embedding_dense_backward_batch_rule);
@@ -430,13 +430,13 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   UPSAMPLE_BATCH(upsample_nearest3d);
   UPSAMPLE_BATCH(upsample_trilinear3d);
 
-  UPSAMPLE_BACKWARD(upsample_bicubic2d_backward, vec);
-  UPSAMPLE_BACKWARD(upsample_bilinear2d_backward, vec);
-  UPSAMPLE_BACKWARD(upsample_linear1d_backward, vec);
-  UPSAMPLE_BACKWARD(upsample_nearest1d_backward, vec);
-  UPSAMPLE_BACKWARD(upsample_nearest2d_backward, vec);
-  UPSAMPLE_BACKWARD(upsample_nearest3d_backward, vec);
-  UPSAMPLE_BACKWARD(upsample_trilinear3d_backward, vec);
+  UPSAMPLE_BACKWARD(upsample_bicubic2d_backward);
+  UPSAMPLE_BACKWARD(upsample_bilinear2d_backward);
+  UPSAMPLE_BACKWARD(upsample_linear1d_backward);
+  UPSAMPLE_BACKWARD(upsample_nearest1d_backward);
+  UPSAMPLE_BACKWARD(upsample_nearest2d_backward);
+  UPSAMPLE_BACKWARD(upsample_nearest3d_backward);
+  UPSAMPLE_BACKWARD(upsample_trilinear3d_backward);
   m.impl("one_hot", one_hot_decomposition_hack);
 }
 }}
diff --git a/functorch/functorch/csrc/BatchRulesNorm.cpp b/aten/src/ATen/functorch/BatchRulesNorm.cpp
similarity index 93%
rename from functorch/functorch/csrc/BatchRulesNorm.cpp
rename to aten/src/ATen/functorch/BatchRulesNorm.cpp
index e78538329582..d53d4f6a2e97 100644
--- a/functorch/functorch/csrc/BatchRulesNorm.cpp
+++ b/aten/src/ATen/functorch/BatchRulesNorm.cpp
@@ -4,9 +4,9 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
-#include <functorch/csrc/BatchedFallback.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 
 namespace at { namespace functorch {
@@ -279,7 +279,7 @@ std::tuple<at::Tensor,at::Tensor,at::Tensor> batch_norm_backward_plumbing(
     std::tie(grad_normalized_input_value, grad_normalized_input_bdim) =
         unwrapTensorAtLevel(grad_normalized_input.transpose(0, 1), cur_level);       // [B0, B, C, *]
 
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto results = batch_norm_backward_no_weight_bias_batch_rule<F, Func>(
         grad_normalized_input_value, grad_normalized_input_bdim,
         input_value, input_bdim,
@@ -308,7 +308,7 @@ std::tuple<Tensor,Tensor,Tensor> native_group_norm_plumbing(
   int64_t cur_level = maybe_layer->layerId();
 
   if (!areAnyBatchedAtLevel({input, weight_opt, bias_opt}, cur_level)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::native_group_norm(input, weight_opt, bias_opt, N, C, HxW, group, eps);
   }
 
@@ -323,13 +323,13 @@ std::tuple<Tensor,Tensor,Tensor> native_group_norm_plumbing(
     const auto input_ = reshape_dim_into(*input_bdim, 0, input_value);
     const auto bdim_size = input_value.size(*input_bdim);
 
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto result = at::native_group_norm(input_, nullopt, nullopt, N * bdim_size, C, HxW, group, eps);
     result0 = makeBatched(reshape_dim_outof(0, bdim_size, std::get<0>(result)), 0, cur_level);
     mean = makeBatched(reshape_dim_outof(0, bdim_size, std::get<1>(result)), 0, cur_level);
     rstd = makeBatched(reshape_dim_outof(0, bdim_size, std::get<2>(result)), 0, cur_level);
   } else {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto result = at::native_group_norm(input_value, nullopt, nullopt, N, C, HxW, group, eps);
     result0 = std::get<0>(result);
     mean = std::get<1>(result);
@@ -397,7 +397,7 @@ std::tuple<Tensor,Tensor,Tensor> native_group_norm_backward_plumbing(
   int64_t cur_level = maybe_layer->layerId();
 
   if (!areAnyBatchedAtLevel({grad_out, input, mean, rstd, weight_opt}, cur_level)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::native_group_norm_backward(grad_out, input, mean, rstd, weight_opt, N, C, HxW, group, output_mask);
   }
 
@@ -441,7 +441,7 @@ std::tuple<Tensor,Tensor,Tensor> native_group_norm_backward_plumbing(
     std::tie(grad_normalized_input_value, grad_normalized_input_bdim) =
         unwrapTensorAtLevel(grad_normalized_input, cur_level);
 
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto res = group_norm_backward_no_weight_bias_batch_rule(
         grad_normalized_input_value, grad_normalized_input_bdim,
         input_value, input_bdim,
@@ -456,7 +456,7 @@ std::tuple<Tensor,Tensor,Tensor> native_group_norm_backward_plumbing(
 
 C10_ALWAYS_INLINE bool has_same_shape(
     const Tensor& tensor, optional<int64_t> tensor_bdim,
-    IntArrayRef normalized_shape) {
+    c10::SymIntArrayRef normalized_shape) {
   if (!tensor.defined()) {
     return true;
   }
@@ -479,7 +479,7 @@ C10_ALWAYS_INLINE bool has_same_shape(
 
 C10_ALWAYS_INLINE void check_same_shape(
     const Tensor& tensor, optional<int64_t> tensor_bdim,
-    IntArrayRef normalized_shape, const std::string& name) {
+    c10::SymIntArrayRef normalized_shape, const std::string& name) {
   TORCH_CHECK(has_same_shape(tensor, tensor_bdim, normalized_shape),
       "Expected ", name, " to be of same shape as normalized_shape, but got ",
       name, " of shape ",
@@ -490,7 +490,7 @@ C10_ALWAYS_INLINE void check_same_shape(
 
 // Ugh, hard to deduplicate
 C10_ALWAYS_INLINE void _check_layer_norm_inputs(
-    IntArrayRef normalized_shape,
+    SymIntArrayRef normalized_shape,
     const Tensor& weight, optional<int64_t> weight_bdim,
     const Tensor& bias, optional<int64_t> bias_bdim) {
 
@@ -507,13 +507,13 @@ C10_ALWAYS_INLINE void _check_layer_norm_inputs(
 std::tuple<Tensor,optional<int64_t>,Tensor,optional<int64_t>,Tensor,optional<int64_t>>
 native_layer_norm_batch_rule(
     const Tensor& input, optional<int64_t> input_bdim,
-    IntArrayRef normalized_shape,
+    c10::SymIntArrayRef normalized_shape,
     const c10::optional<Tensor>& weight_opt, optional<int64_t> weight_bdim,
     const c10::optional<Tensor>& bias_opt, optional<int64_t> bias_bdim,
     double eps) {
   auto input_ = moveBatchDimToFront(input, input_bdim);
   if (!weight_bdim && !bias_bdim) {
-    const auto result = at::native_layer_norm(input_, normalized_shape, weight_opt, bias_opt, eps);
+    const auto result = at::native_layer_norm_symint(input_, normalized_shape, weight_opt, bias_opt, eps);
     const auto mean = std::get<1>(result);
     const auto rstd = std::get<2>(result);
     const auto stats_bdim = compute_stat_bdim(input_bdim, mean);
@@ -528,7 +528,7 @@ native_layer_norm_batch_rule(
   _check_layer_norm_inputs(normalized_shape, weight, weight_bdim, bias, bias_bdim);
 
   const auto input_logical_rank = rankWithoutBatchDim(input, input_bdim);
-  const auto result = at::native_layer_norm(input_, normalized_shape, nullopt, nullopt, eps);
+  const auto result = at::native_layer_norm_symint(input_, normalized_shape, nullopt, nullopt, eps);
   auto result0 = std::get<0>(result);
   const auto mean = std::get<1>(result);
   const auto rstd = std::get<2>(result);
@@ -607,7 +607,7 @@ std::tuple<at::Tensor,at::Tensor,at::Tensor> native_layer_norm_backward_plumbing
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
   if (!areAnyBatchedAtLevel({grad_out, input, mean, rstd, weight_opt, bias_opt}, cur_level)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::native_layer_norm_backward(grad_out, input, normalized_shape, mean, rstd,
         weight_opt, bias_opt, output_mask);
   }
@@ -667,7 +667,7 @@ std::tuple<at::Tensor,at::Tensor,at::Tensor> native_layer_norm_backward_plumbing
     std::tie(grad_normalized_input_value, grad_normalized_input_bdim) =
         unwrapTensorAtLevel(grad_normalized_input, cur_level);
 
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     const auto results = native_layer_norm_backward_no_weight_bias_batch_rule(
         grad_normalized_input_value, grad_normalized_input_bdim,
         input_value, input_bdim,
@@ -761,7 +761,7 @@ struct NativeBatchNormBackwardBatchRuleHelper {
 
     if (!areAnyBatchedAtLevel({grad_out, input, weight_opt, running_mean_opt,
           running_var_opt, save_mean_opt, save_rstd_opt}, cur_level)) {
-      c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+      c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
       return at::native_batch_norm_backward(grad_out, input, weight_opt,
           running_mean_opt, running_var_opt, save_mean_opt, save_rstd_opt,
           training, eps, output_mask);
@@ -791,7 +791,7 @@ struct CudnnBatchNormBackwardBatchRuleHelper {
 
     if (!areAnyBatchedAtLevel({input, grad_out, weight, running_mean_opt,
           running_var_opt, save_mean_opt, save_rstd_opt, reserve}, cur_level)) {
-      c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+      c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
       return at::cudnn_batch_norm_backward(input, grad_out, weight,
           running_mean_opt, running_var_opt, save_mean_opt, save_rstd_opt, eps, reserve);
     }
@@ -819,7 +819,7 @@ struct MiopenBatchNormBackwardBatchRuleHelper {
 
     if (!areAnyBatchedAtLevel({input, grad_out, weight, running_mean_opt,
           running_var_opt, save_mean_opt, save_rstd_opt}, cur_level)) {
-      c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+      c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
       return at::miopen_batch_norm_backward(input, grad_out, weight,
           running_mean_opt, running_var_opt, save_mean_opt, save_rstd_opt, eps);
     }
@@ -875,10 +875,28 @@ std::tuple<at::Tensor,at::Tensor,at::Tensor> cudnn_batch_norm_backward_wrapper(
     return at::miopen_batch_norm_backward(input, grad_out, weight_opt, running_mean_opt, running_var_opt, save_mean_opt, save_rstd_opt, eps);
   }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+// NB: This is NOT good. In the ideal world, we do NOT want to convert the new legit op back into native_batch_norm
+// as native_batch_norm has a problematic schema--it promises it is functional when it is not. However, vmap doesn't
+// work with dynamo anyway so we gain some buffer room to do wrong things here. The (reasonable) hope is that we will
+// make native_batch_norm composite implicit within a few weeks and we can fix this before vmap works with dynamo.
+std::tuple<at::Tensor,at::Tensor,at::Tensor> _native_batch_norm_legit_batch(
+  const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt,
+  Tensor& running_mean, Tensor& running_var, bool train, double momentum, double eps) {
+    return at::native_batch_norm(self, weight_opt, bias_opt, running_mean, running_var, train, momentum, eps);
+}
+
+std::tuple<at::Tensor,at::Tensor,at::Tensor> _native_batch_norm_legit_no_stats_batch(
+  const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt,
+  bool train, double momentum, double eps) {
+    return at::native_batch_norm(self, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, eps);
+}
+
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   VMAP_SUPPORT(native_batch_norm, NATIVE_BATCH_NORM_BATCH_RULE(native_batch_norm));
   VMAP_SUPPORT(cudnn_batch_norm, CUDNN_BATCH_NORM_BATCH_RULE(cudnn_batch_norm));
   VMAP_SUPPORT(miopen_batch_norm, MIOPEN_BATCH_NORM_BATCH_RULE(miopen_batch_norm));
+  m.impl("_native_batch_norm_legit", _native_batch_norm_legit_batch);
+  m.impl("_native_batch_norm_legit.no_stats", _native_batch_norm_legit_no_stats_batch);
   m.impl("native_batch_norm_backward", NATIVE_BATCH_NORM_BACKWARD_BATCH_RULE(native_batch_norm_backward));
   m.impl("cudnn_batch_norm_backward", CUDNN_BATCH_NORM_BACKWARD_BATCH_RULE(at::functorch::cudnn_batch_norm_backward_wrapper));
   m.impl("miopen_batch_norm_backward", MIOPEN_BATCH_NORM_BACKWARD_BATCH_RULE(at::functorch::miopen_batch_norm_backward_wrapper));
diff --git a/functorch/functorch/csrc/BatchRulesPooling.cpp b/aten/src/ATen/functorch/BatchRulesPooling.cpp
similarity index 92%
rename from functorch/functorch/csrc/BatchRulesPooling.cpp
rename to aten/src/ATen/functorch/BatchRulesPooling.cpp
index a04cba329697..ad79f49bc3b3 100644
--- a/functorch/functorch/csrc/BatchRulesPooling.cpp
+++ b/aten/src/ATen/functorch/BatchRulesPooling.cpp
@@ -4,9 +4,9 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
-#include <functorch/csrc/BatchedFallback.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 
 namespace at { namespace functorch {
@@ -35,7 +35,7 @@ max_pool2d_with_indices_batch_rule(
       reshape_dim_outof(0, bdim_size, std::get<1>(result)), 0);
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   EXISTING_BDIM(_adaptive_avg_pool2d);
   EXISTING_BDIM_ALL_BOXED(_adaptive_avg_pool2d_backward);
   EXISTING_BDIM(_adaptive_avg_pool3d);
diff --git a/functorch/functorch/csrc/BatchRulesRandomness.cpp b/aten/src/ATen/functorch/BatchRulesRandomness.cpp
similarity index 87%
rename from functorch/functorch/csrc/BatchRulesRandomness.cpp
rename to aten/src/ATen/functorch/BatchRulesRandomness.cpp
index a4a9ef9abcb7..159abc4108e8 100644
--- a/functorch/functorch/csrc/BatchRulesRandomness.cpp
+++ b/aten/src/ATen/functorch/BatchRulesRandomness.cpp
@@ -5,17 +5,23 @@
 // LICENSE file in the root directory of this source tree.
 
 #include <ATen/ATen.h>
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/BatchRulesHelper.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+
+// This file contains batching rules for random operations. These are different
+// from our regular batching rules: regular batching rules get registered to the
+// FuncTorchBatched key, but batching rules for random operations get
+// registered to FuncTorchVmapMode. This is because we need to interpose on
+// random operations even if they're not on a BatchedTensor.
 
 namespace at {
 namespace functorch {
 
 template <typename F, F Func, typename... ExtraArgs>
-Tensor random_batching_rule(IntArrayRef shape, ExtraArgs... extra_args) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+Tensor random_batching_rule(SymIntArrayRef shape, ExtraArgs... extra_args) {
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
   auto maybe_layer = maybeCurrentDynamicLayer();
-  VmapDimVector shapeVec(1, maybe_layer->batchSize());
+  c10::SmallVector<SymInt> shapeVec(1, maybe_layer->batchSize());
   shapeVec.reserve(shape.size() + 1);
   shapeVec.insert(shapeVec.end(), shape.begin(), shape.end());
   RandomnessType randomness = maybe_layer->randomness();
@@ -29,7 +35,7 @@ Tensor random_batching_rule(IntArrayRef shape, ExtraArgs... extra_args) {
 
 template <typename F, F Func, typename... ExtraArgs>
 Tensor& random_inplace_batching_rule(Tensor& self, ExtraArgs... extra_args) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
   auto maybe_layer = maybeCurrentDynamicLayer();
   const auto cur_level = maybe_layer->layerId();
   Tensor self_value;
@@ -54,7 +60,7 @@ Tensor& random_inplace_batching_rule(Tensor& self, ExtraArgs... extra_args) {
 }
 
 Tensor& bernoulli_inplace_Tensor_batching_rule(Tensor& self, const Tensor& p_, c10::optional<Generator> gen) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
   auto maybe_layer = maybeCurrentDynamicLayer();
   auto cur_level = maybe_layer->layerId();
   RandomnessType randomness = maybe_layer->randomness();
@@ -104,7 +110,7 @@ Tensor& bernoulli_inplace_Tensor_batching_rule(Tensor& self, const Tensor& p_, c
 
 template <typename F, F Func, typename... ExtraArgs>
 Tensor randperm_batching_rule(int64_t n, ExtraArgs... extra_args) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
   auto maybe_layer = maybeCurrentDynamicLayer();
   auto const batch_size = maybe_layer->batchSize();
   RandomnessType randomness = maybe_layer->randomness();
@@ -124,7 +130,7 @@ Tensor randperm_batching_rule(int64_t n, ExtraArgs... extra_args) {
 
 template <typename F, F Func, typename... ExtraArgs>
 Tensor unary_pointwise_random_batch_rule(const Tensor& tensor, ExtraArgs... extra_args) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
   auto maybe_layer = maybeCurrentDynamicLayer();
   const auto cur_level = maybe_layer->layerId();
 
@@ -152,7 +158,7 @@ Tensor unary_pointwise_random_batch_rule(const Tensor& tensor, ExtraArgs... extr
 
 template<typename F, F Func, typename... ExtraArgs>
 Tensor tensor_like_random_batch_rule(const Tensor& self, ExtraArgs... extra_args) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
   auto maybe_layer = maybeCurrentDynamicLayer();
   const auto cur_level = maybe_layer->layerId();
   RandomnessType randomness = maybe_layer->randomness();
@@ -178,7 +184,7 @@ Tensor tensor_like_random_batch_rule(const Tensor& self, ExtraArgs... extra_args
 }
 
 std::tuple<Tensor,Tensor> native_dropout_batching_rule(const Tensor& tensor, double p, c10::optional<bool> train) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
   auto maybe_layer = maybeCurrentDynamicLayer();
   const auto cur_level = maybe_layer->layerId();
   RandomnessType randomness = maybe_layer->randomness();
@@ -208,7 +214,7 @@ std::tuple<Tensor,Tensor> native_dropout_batching_rule(const Tensor& tensor, dou
 }
 
 Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_samples, const bool replacement, const c10::optional<Generator> generator) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kVmapModeKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
   auto maybe_layer = maybeCurrentDynamicLayer();
   const auto cur_level = maybe_layer->layerId();
 
@@ -220,24 +226,32 @@ Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_samples,
   RandomnessType randomness = maybe_layer->randomness();
   check_randomness(randomness, self_bdim.has_value());
 
-  if (randomness == RandomnessType::Different && !self_bdim) {
-    auto shape = self_value.sizes();
-    VmapDimVector shapeVec(1, maybe_layer->batchSize());
-    shapeVec.reserve(shape.size() + 1);
-    shapeVec.insert(shapeVec.end(), shape.begin(), shape.end());
-    self_value = self_value.expand(shapeVec);
-  }
-  if (self_value.dim() == 3 && (self_bdim || randomness == RandomnessType::Different)) {
-    self_value = reshape_dim_into(1, 0, self_value);
-  }
-  auto out = multinomial(self_value, num_samples, replacement, generator);
-  if (randomness == RandomnessType::Same && !self_bdim) {
-    return out;
-  }
-  if(self_value.dim() == 3 && self_bdim) {
-    out = out.reshape(self.sizes());
+  if (randomness == RandomnessType::Different) {
+    // 1D cases: S -> BS -> multinomial(BS)
+    //           BS -> multinomial(BS)
+    //
+    // 2D cases: MS -> BMS -> (BM)S -> multinomial((BM)S) -> (BM)S -> BMS
+    //           BMS -> (BM)S -> multinomial((BM)S) -> (BM)S -> BMS
+    const auto is_2D_case = rankWithoutBatchDim(self_value, self_bdim) == 2;
+    if (!self_bdim.has_value()) {
+      self_value = ensure_has_bdim(self_value, self_bdim.has_value(), maybe_layer->batchSize());
+    }
+    if (is_2D_case) {
+      self_value = reshape_dim_into(0, 0, self_value);
+    }
+    auto out = multinomial(self_value, num_samples, replacement, generator);
+    if (is_2D_case) {
+      out = reshape_dim_outof(0, maybe_layer->batchSize(), out);
+    }
+    return makeBatched(out, 0, cur_level);;
   }
-  return makeBatched(out, 0, cur_level);
+
+  TORCH_INTERNAL_ASSERT(randomness == RandomnessType::Same); // check_randomness eliminates error randomness
+  TORCH_INTERNAL_ASSERT(!self_bdim.has_value()); // check_randomness eliminates same randomness with batched input
+  // Must be same randomness with unbatched input
+  // 1D case: S -> multinomial(S) -> S
+  // 2D case: MS -> multinomial(MS) -> MS
+  return multinomial(self_value, num_samples, replacement, generator);
 }
 
 template <typename A, A a, typename C>
@@ -245,13 +259,13 @@ struct RandomBatchRuleHelper;
 
 template <typename F, F Func, typename T1, typename... T>
 struct RandomBatchRuleHelper<F, Func, typelist<T1, T...>> {
-  static Tensor apply(IntArrayRef shape, T... extra_args) {
+  static Tensor apply(SymIntArrayRef shape, T... extra_args) {
     return random_batching_rule<F, Func, T...>(shape, std::forward<T>(extra_args)...);
   }
 };
 
 template <typename F, F Func, typename... T>
-Tensor rand_int_wrapper(IntArrayRef shape, int64_t high, T... extra_args) {
+Tensor rand_int_wrapper(SymIntArrayRef shape, int64_t high, T... extra_args) {
   return Func(high, shape, std::forward<T>(extra_args)...);
 }
 
@@ -270,7 +284,7 @@ struct RandIntBatchRuleHelper;
 
 template <typename F, F Func, typename T1, typename T2, typename... T>
 struct RandIntBatchRuleHelper<F, Func, typelist<T1, T2, T...>> {
-  static Tensor apply(int64_t high, IntArrayRef shape, T... extra_args) {
+  static Tensor apply(int64_t high, SymIntArrayRef shape, T... extra_args) {
     return random_batching_rule<decltype(&rand_int_wrapper<F, Func, T...>),
                                 &rand_int_wrapper<F, Func, T...>,
                                 int64_t, T...>(shape, high, std::forward<T>(extra_args)...);
@@ -278,7 +292,7 @@ struct RandIntBatchRuleHelper<F, Func, typelist<T1, T2, T...>> {
 };
 
 template <typename F, F Func, typename T0, typename T1, typename... T>
-Tensor rand_int_low_wrapper(IntArrayRef shape, T0 scalar0, T1 scalar1, T... extra_args) {
+Tensor rand_int_low_wrapper(SymIntArrayRef shape, T0 scalar0, T1 scalar1, T... extra_args) {
   return Func(scalar0, scalar1, shape, std::forward<T>(extra_args)...);
 }
 
@@ -287,7 +301,7 @@ struct RandTwoLeadingScalarsBatchRuleHelper;
 
 template <typename F, F Func, typename T0, typename T1, typename T2, typename... T>
 struct RandTwoLeadingScalarsBatchRuleHelper<F, Func, typelist<T0, T1, T2, T...>> {
-  static Tensor apply(T0 scalar0, T1 scalar1, IntArrayRef shape, T... extra_args) {
+  static Tensor apply(T0 scalar0, T1 scalar1, SymIntArrayRef shape, T... extra_args) {
     return random_batching_rule<decltype(&rand_int_low_wrapper<F, Func, T0, T1, T...>),
                                 &rand_int_low_wrapper<F, Func, T0, T1, T...>,
                                 int64_t, int64_t, T...>(shape, scalar0, scalar1, std::forward<T>(extra_args)...);
diff --git a/functorch/functorch/csrc/BatchRulesReduceOps.cpp b/aten/src/ATen/functorch/BatchRulesReduceOps.cpp
similarity index 96%
rename from functorch/functorch/csrc/BatchRulesReduceOps.cpp
rename to aten/src/ATen/functorch/BatchRulesReduceOps.cpp
index 17f7a263f4ee..9126507e73be 100644
--- a/functorch/functorch/csrc/BatchRulesReduceOps.cpp
+++ b/aten/src/ATen/functorch/BatchRulesReduceOps.cpp
@@ -4,8 +4,8 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
 #include <ATen/Operators.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 
@@ -20,10 +20,6 @@ Tensor sum_decomp(
   return at::sum(self, range(0, self.dim()), false, dtype);
 }
 
-Tensor sum_symint_decomp(const Tensor& input_t, c10::SymIntArrayRef dim, bool keepdim, optional<ScalarType> opt_dtype) {
-  return at::sum(input_t, c10::asIntArrayRefSlow(dim), keepdim, opt_dtype);
-}
-
 Tensor mean_decomp(
     const Tensor& self, optional<ScalarType> dtype) {
   return at::mean(self, range(0, self.dim()), false, dtype);
@@ -74,14 +70,14 @@ void boxed_reduction_batch_rule(const c10::OperatorHandle& op, torch::jit::Stack
   const auto num_returns = schema.returns().size();
   const auto num_arguments = schema.arguments().size();
 
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
 
   auto orig_arguments = torch::jit::last(*stack, num_arguments);
   if (std::none_of(orig_arguments.begin(), orig_arguments.end(), ivalueParticipatesInCurrentLevel)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     op.callBoxed(stack);
     return;
   }
@@ -172,7 +168,7 @@ void boxed_reduction_batch_rule(const c10::OperatorHandle& op, torch::jit::Stack
 #define REDUCTION_BOXED_ARGS(op, dim_pos) \
   m.impl(#op, torch::CppFunction::makeFromBoxedFunction<boxed_reduction_batch_rule<dim_pos>>());
 
-// Skipping frobenius/nuclear/all/any since they don't have opinfo tests right now :P
+// Skipping all/any since they don't have opinfo tests right now :P
 
 Tensor dist_decomp(const Tensor& self, const Tensor& other, const Scalar& p) {
   return at::norm((self - other), p);
@@ -372,7 +368,7 @@ std::tuple<Tensor,optional<int64_t>> searchsorted_batch_rule(
   TORCH_INTERNAL_ASSERT(false);
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   VMAP_SUPPORT2(searchsorted, Tensor, searchsorted_batch_rule);
   REDUCTION_BOXED(_fft_r2c);
   REDUCTION_BOXED(_fft_c2r);
@@ -426,7 +422,6 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   REDUCTION_BOXED(_log_softmax);
   REDUCTION_BOXED_ARGS(rot90, 2);
   VMAP_SUPPORT(aminmax, aminmax_batching_rule);
-  m.impl("sum.SymInt", sum_symint_decomp);
   VMAP_SUPPORT(_log_softmax_backward_data, _log_softmax_backward_batch_rule);
   VMAP_SUPPORT(_softmax_backward_data, _softmax_backward_batch_rule);
 }
diff --git a/functorch/functorch/csrc/BatchRulesScatterOps.cpp b/aten/src/ATen/functorch/BatchRulesScatterOps.cpp
similarity index 98%
rename from functorch/functorch/csrc/BatchRulesScatterOps.cpp
rename to aten/src/ATen/functorch/BatchRulesScatterOps.cpp
index da01d464908e..fc51e9d74409 100644
--- a/functorch/functorch/csrc/BatchRulesScatterOps.cpp
+++ b/aten/src/ATen/functorch/BatchRulesScatterOps.cpp
@@ -4,11 +4,11 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
 #include <iostream>
 #include <ATen/Operators.h>
-#include <functorch/csrc/PlumbingHelper.h>
-#include <functorch/csrc/BatchedFallback.h>
+#include <ATen/functorch/PlumbingHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
 #include <ATen/native/TensorAdvancedIndexing.h>
 #include <ATen/native/IndexKernel.h>
 #include <ATen/native/IndexingUtils.h>
@@ -317,7 +317,7 @@ std::tuple<Tensor,optional<int64_t>> index_batch_rule(
 // plumbing done since we don't support List<optional<Tensor>> in codegen
 Tensor index_plumbing(const Tensor & self, const List<optional<Tensor>> & indices
 ) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
@@ -504,7 +504,7 @@ void index_put__batch_rule(
 // plumbing done since we don't support List<optional<Tensor>> in codegen
 Tensor& index_put__plumbing(Tensor & self, const List<optional<Tensor>> & indices
 , const Tensor & values, bool accumulate) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
@@ -543,7 +543,7 @@ void _index_put_impl__batch_rule(
 // plumbing done since we don't support List<optional<Tensor>> in codegen
 Tensor &_index_put_impl__plumbing(Tensor &self, const List<optional<Tensor>> &indices,
                                   const Tensor &values, bool accumulate, bool unsafe) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
@@ -664,7 +664,7 @@ std::tuple<Tensor,optional<int64_t>> index_put_batch_rule(
 // plumbing done since we don't support List<optional<Tensor>> in codegen
 Tensor index_put_plumbing(const Tensor & self, const List<optional<Tensor>> & indices,
                           const Tensor & values, bool accumulate) {
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
@@ -928,6 +928,11 @@ Tensor index_copy_decomp(
   return at::scatter(self, dim, index_, source);  ;
 }
 
+// Note [Fix vmap slice_scatter]
+// registers a decomposition for `slice_scatter` that calls into `slice.src`
+// *_scatter operators have some special semantics though, that we can't easily
+// through a decomposition: slice_scatter's output needs to have the same
+// size, size, strides and storage_offset as the input.
 Tensor slice_scatter_decomp(const Tensor &self, const Tensor &src,
                             int64_t dim, c10::optional<int64_t> start,
                             c10::optional<int64_t> end, int64_t step)
@@ -1050,7 +1055,7 @@ std::tuple<Tensor,optional<int64_t>> masked_fill_scalar_batch_rule(
   return std::make_tuple(result, 0);
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   m.impl("index.Tensor", index_plumbing);
   m.impl("index_put_", index_put__plumbing);
   m.impl("index_put", index_put_plumbing);
diff --git a/functorch/functorch/csrc/BatchRulesUnaryOps.cpp b/aten/src/ATen/functorch/BatchRulesUnaryOps.cpp
similarity index 97%
rename from functorch/functorch/csrc/BatchRulesUnaryOps.cpp
rename to aten/src/ATen/functorch/BatchRulesUnaryOps.cpp
index 660cb1f3c713..ee6391c6e284 100644
--- a/functorch/functorch/csrc/BatchRulesUnaryOps.cpp
+++ b/aten/src/ATen/functorch/BatchRulesUnaryOps.cpp
@@ -4,8 +4,8 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/PlumbingHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/PlumbingHelper.h>
 
 namespace at { namespace functorch {
 
@@ -79,7 +79,7 @@ to_other_batch_rule(const Tensor& self, optional<int64_t> self_bdim,
   return std::make_tuple(self.to(other, non_blocking, copy, memory_format), self_bdim);
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
 
 #define UNARY_POINTWISE_ALL2(op, overload) \
   POINTWISE_BOXED2(op ## _, overload); \
@@ -139,7 +139,6 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   UNARY_POINTWISE_ALL(mvlgamma);
   UNARY_POINTWISE_ALL(nan_to_num);
   UNARY_POINTWISE_ALL(neg);
-  UNARY_POINTWISE_ALL(positive);
   UNARY_POINTWISE_ALL(rad2deg);
   UNARY_POINTWISE_ALL(reciprocal);
   UNARY_POINTWISE_ALL(round);
diff --git a/functorch/functorch/csrc/BatchRulesViews.cpp b/aten/src/ATen/functorch/BatchRulesViews.cpp
similarity index 83%
rename from functorch/functorch/csrc/BatchRulesViews.cpp
rename to aten/src/ATen/functorch/BatchRulesViews.cpp
index e4160ea4c98f..98eaf0f387a6 100644
--- a/functorch/functorch/csrc/BatchRulesViews.cpp
+++ b/aten/src/ATen/functorch/BatchRulesViews.cpp
@@ -4,12 +4,12 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchRulesHelper.h>
+#include <ATen/functorch/BatchRulesHelper.h>
 #include <iostream>
 
 #include <ATen/Operators.h>
-#include <functorch/csrc/PlumbingHelper.h>
-#include <functorch/csrc/BatchedFallback.h>
+#include <ATen/functorch/PlumbingHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/core/TensorBody.h>
 #include <c10/core/SymIntArrayRef.h>
@@ -58,7 +58,7 @@ namespace at { namespace functorch {
 //
 // Now that we have written `sum_batch_rule`, we have to register it inside a
 // TORCH_LIBRARY_IMPL block:
-//   TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+//   TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
 //     ...
 //     VMAP_SUPPORT2(sum, int, sum_batch_rule);
 //     ...
@@ -79,7 +79,7 @@ namespace at { namespace functorch {
 //     return torch.add(self, product, value);
 //   }
 // And register it inside a TORCH_LIBRARY_IMPL block:
-//   TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+//   TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
 //     ...
 //     m.impl("addcmul", addcmul_decomp);
 //     ...
@@ -102,15 +102,15 @@ std::tuple<Tensor,optional<int64_t>> unsqueeze_batch_rule(
 std::tuple<Tensor,optional<int64_t>> repeat_batch_rule(
     const Tensor& self,
     optional<int64_t> self_bdim,
-    IntArrayRef sizes) {
+    c10::SymIntArrayRef sizes) {
 
-  VmapDimVector sizes_with_bdim = { sizes.begin(), sizes.end() };
+  SymDimVector sizes_with_bdim = { sizes.begin(), sizes.end() };
   sizes_with_bdim.insert(sizes_with_bdim.begin(), 1);
   auto self_ = moveBatchDimToFront(self, self_bdim);
   while (self_.dim() < (int64_t)sizes_with_bdim.size()) {
     self_ = self_.unsqueeze(1);
   }
-  return std::make_tuple(self_.repeat(sizes_with_bdim), 0);
+  return std::make_tuple(self_.repeat_symint(sizes_with_bdim), 0);
 }
 
 
@@ -136,22 +136,22 @@ std::tuple<Tensor,optional<int64_t>> diag_batch_rule(
 std::tuple<Tensor,optional<int64_t>> _unsafe_view_batch_rule(
     const Tensor& self,
     optional<int64_t> self_bdim,
-    IntArrayRef size) {
+    c10::SymIntArrayRef size) {
   auto self_ = moveBatchDimToFront(self, self_bdim);
-  VmapDimVector view_size(size);
+  SymDimVector view_size(size);
   view_size.insert(view_size.begin(), self_.size(0));
 
   // See if the view is valid. If it's not, then we copy.
   // It's OK to copy, because _unsafe_view(x) guarantees that x isn't used
   // anymore.
-  const at::DimVector inferred_size = at::infer_size_dv(view_size, self_.numel());
-  const auto stride = at::detail::computeStride(self_.sizes(),
-                                                self_.strides(),
+  const at::SymDimVector inferred_size = at::infer_size_dv(view_size, self_.sym_numel());
+  const auto stride = at::detail::computeStride(self_.sym_sizes(),
+                                                self_.sym_strides(),
                                                 inferred_size);
   if (!stride.has_value()) {
     self_ = self_.contiguous();
   }
-  return std::make_tuple(at::_unsafe_view(self_, view_size), 0);
+  return std::make_tuple(at::_unsafe_view_symint(self_, view_size), 0);
 }
 
 std::tuple<Tensor,optional<int64_t>> flip_batch_rule(const Tensor& self, optional<int64_t> self_bdim, IntArrayRef dims) {
@@ -175,7 +175,7 @@ const Tensor& resize__plumbing(
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t cur_level = maybe_layer->layerId();
   if (!isBatchedAtLevel(self, cur_level)) {
-    c10::impl::ExcludeDispatchKeyGuard guard2(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard2(DispatchKey::FuncTorchBatched);
     return self.resize_(size, optional_memory_format);
   }
 
@@ -190,7 +190,7 @@ const Tensor& resize__plumbing(
   TORCH_INTERNAL_ASSERT(self_bdim.value() == 0, "NYI: resize_ batch rule for batch dim != 0");
 
   // Resize the wrapped tensor
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   self_value = moveBatchDimToFront(self_value, self_bdim);
   VmapDimVector new_size(size);
   new_size.insert(new_size.begin(), self_value.size(*self_bdim));
@@ -275,26 +275,26 @@ std::tuple<std::vector<Tensor>, optional<int64_t>> chunk_batching_rule(const Ten
   return std::make_tuple(at::chunk(self_, chunks, new_dim), 0);
 }
 
-std::tuple<Tensor, optional<int64_t>> select_batching_rule(const Tensor& self, optional<int64_t> bdim, int64_t dim, int64_t index) {
+std::tuple<Tensor, optional<int64_t>> select_batching_rule(const Tensor& self, optional<int64_t> bdim, int64_t dim, c10::SymInt index) {
   if (!bdim) {
-    return std::make_tuple(self.select(dim, index), nullopt);
+    return std::make_tuple(self.select_symint(dim, index), nullopt);
   }
 
   auto _self = moveBatchDimToFront(self, bdim);
   auto dim_physical = getPhysicalDim(_self, true, dim);
-  auto result = _self.select(dim_physical, index);
+  auto result = _self.select_symint(dim_physical, index);
   return std::make_tuple(result, 0);
 }
 
-std::tuple<Tensor, optional<int64_t>> _reshape_alias_batch_rule(const Tensor& self, optional<int64_t> bdim, const IntArrayRef shape, const IntArrayRef strides) {
+std::tuple<Tensor, optional<int64_t>> _reshape_alias_batch_rule(const Tensor& self, optional<int64_t> bdim, const c10::SymIntArrayRef shape, const c10::SymIntArrayRef strides) {
   (void) strides;
   TORCH_INTERNAL_ASSERT(bdim.has_value());
 
   auto self_ = moveBatchDimToFront(self, bdim);
-  c10::SmallBuffer<int64_t, 5> new_shape(shape.size() + 1);
-  new_shape[0] = self_.size(0);
+  c10::SymDimVector new_shape(shape.size() + 1);
+  new_shape[0] = self_.sym_size(0);
   std::copy(shape.begin(), shape.end(), new_shape.begin() + 1);
-  return std::make_tuple(at::reshape(self_, new_shape), 0);
+  return std::make_tuple(at::reshape_symint(self_, new_shape), 0);
 }
 
 std::tuple<Tensor, optional<int64_t>> roll_batch_rule(const Tensor& self, optional<int64_t> bdim, IntArrayRef shifts, IntArrayRef dims) {
@@ -330,15 +330,15 @@ std::tuple<Tensor, optional<int64_t>> diagonal_batching_rule(
 
 std::tuple<Tensor,optional<int64_t>> diagonal_backward_batch_rule(
     const Tensor& grad_input, optional<int64_t> grad_input_bdim,
-    IntArrayRef input_sizes, int64_t offset, int64_t dim1, int64_t dim2) {
+    c10::SymIntArrayRef input_sizes, int64_t offset, int64_t dim1, int64_t dim2) {
   auto logical_rank = rankWithoutBatchDim(grad_input, grad_input_bdim);
   auto grad_input_ = moveBatchDimToFront(grad_input, grad_input_bdim);
   dim1 = maybe_wrap_dim(dim1, logical_rank + 1) + 1;
   dim2 = maybe_wrap_dim(dim2, logical_rank + 1) + 1;
-  c10::SmallBuffer<int64_t, 5> input_sizes_(input_sizes.size() + 1);
+  c10::SymDimVector input_sizes_(input_sizes.size() + 1);
   input_sizes_[0] = grad_input_.size(0);
   std::copy(input_sizes.begin(), input_sizes.end(), input_sizes_.begin() + 1);
-  auto result = at::diagonal_backward(grad_input_, input_sizes_, offset, dim1, dim2);
+  auto result = at::diagonal_backward_symint(grad_input_, input_sizes_, offset, dim1, dim2);
   return std::make_tuple(std::move(result), 0);
 }
 
@@ -346,13 +346,13 @@ std::tuple<Tensor,optional<int64_t>> slice_batch_rule(
     const Tensor& self,
     optional<int64_t> self_bdim,
     int64_t dim,
-    c10::optional<int64_t> start,
-    c10::optional<int64_t> end,
-    int64_t step) {
+    c10::optional<c10::SymInt> start,
+    c10::optional<c10::SymInt> end,
+    c10::SymInt step) {
   auto self_ = moveBatchDimToFront(self, self_bdim);
   dim = getPhysicalDim(self, self_bdim.has_value(), dim);
 
-  auto result = self_.slice(dim, start, end, step);
+  auto result = self_.slice_symint(dim, start, end, step);
   return std::make_tuple(result, 0);
 }
 
@@ -402,51 +402,58 @@ std::tuple<Tensor, optional<int64_t>> permute_batching_rule(
 
 std::tuple<Tensor,optional<int64_t>> select_backward_batch_rule(
     const Tensor& grad_input, optional<int64_t> grad_input_bdim,
-    IntArrayRef input_sizes, int64_t dim, int64_t index) {
+    c10::SymIntArrayRef input_sizes, int64_t dim, c10::SymInt index) {
   auto logical_rank = rankWithoutBatchDim(grad_input, grad_input_bdim);
   auto grad_input_ = moveBatchDimToFront(grad_input, grad_input_bdim);
   dim = maybe_wrap_dim(dim, logical_rank + 1) + 1;
-  c10::SmallBuffer<int64_t, 5> input_sizes_(input_sizes.size() + 1);
-  input_sizes_[0] = grad_input_.size(0);
+  c10::SymDimVector input_sizes_(input_sizes.size() + 1);
+  input_sizes_[0] = grad_input_.sym_size(0);
   std::copy(input_sizes.begin(), input_sizes.end(), input_sizes_.begin() + 1);
-  auto result = at::select_backward(grad_input_, input_sizes_, dim, index);
+  auto result = at::select_backward_symint(grad_input_, input_sizes_, dim, index);
   return std::make_tuple(std::move(result), 0);
 }
 
 std::tuple<Tensor,optional<int64_t>> slice_backward_batch_rule(
     const Tensor& grad_input, optional<int64_t> grad_input_bdim,
-    IntArrayRef input_sizes, int64_t dim, int64_t start, int64_t end, int64_t step) {
+    SymIntArrayRef input_sizes, int64_t dim, c10::SymInt start, c10::SymInt end, c10::SymInt step) {
   auto logical_rank = rankWithoutBatchDim(grad_input, grad_input_bdim);
   auto grad_input_ = moveBatchDimToFront(grad_input, grad_input_bdim);
   dim = maybe_wrap_dim(dim, logical_rank) + 1;
-  c10::SmallBuffer<int64_t, 5> input_sizes_(input_sizes.size() + 1);
+  c10::SymDimVector input_sizes_(input_sizes.size() + 1);
   input_sizes_[0] = grad_input_.size(0);
   std::copy(input_sizes.begin(), input_sizes.end(), input_sizes_.begin() + 1);
-  auto result = at::slice_backward(grad_input_, input_sizes_, dim, start, end, step);
+  auto result = at::slice_backward_symint(grad_input_, input_sizes_, dim, start, end, step);
   return std::make_tuple(std::move(result), 0);
 }
 
 std::tuple<Tensor, optional<int64_t>> view_batching_rule(
-    const Tensor &self, optional<int64_t> self_bdim, IntArrayRef size)
+    const Tensor &self, optional<int64_t> self_bdim, SymIntArrayRef sym_size)
 {
   TORCH_INTERNAL_ASSERT(self_bdim.has_value());
   auto self_ = moveBatchDimToFront(self, self_bdim);
-  VmapDimVector size_(size.size() + 1);
+  c10::SmallVector<c10::SymInt> size_(sym_size.size() + 1);
   // copy batch size
   size_[0] = self_.size(0);
-  std::copy(size.cbegin(), size.cend(), size_.begin() + 1);
-  return std::make_tuple(self_.view(size_), 0);
+  std::copy(sym_size.cbegin(), sym_size.cend(), size_.begin() + 1);
+  return std::make_tuple(self_.view_symint(size_), 0);
 }
 
-Tensor view_symint_decomposition(const Tensor& self,
-            c10::SymIntArrayRef size) {
-  return self.view( c10::asIntArrayRefSlow(size));
+std::tuple<Tensor,optional<int64_t>> view_copy_batch_rule(
+    const Tensor& self,
+    optional<int64_t> self_bdim,
+    c10::SymIntArrayRef size) {
+  auto self_ = moveBatchDimToFront(self, self_bdim);
+  SymDimVector view_size(size.size() + 1);
+  view_size[0] = self_.size(0);
+  std::copy(size.cbegin(), size.cend(), view_size.begin() + 1);
+
+  return std::make_tuple(at::view_copy_symint(self_, view_size), 0);
 }
 
 
 template <typename F, F Func>
 std::tuple<Tensor, optional<int64_t>> expand_batch_rule(
-    const Tensor &self, optional<int64_t> self_bdim, IntArrayRef size, bool implicit)
+    const Tensor &self, optional<int64_t> self_bdim, SymIntArrayRef size, bool implicit)
 {
   auto self_dim = self.dim();
   TORCH_CHECK(static_cast<uint64_t>(self_dim - 1) <= size.size(),
@@ -457,7 +464,7 @@ std::tuple<Tensor, optional<int64_t>> expand_batch_rule(
   auto self_sizes = self_.sizes();
   auto batch_size = self_sizes[0];
 
-  c10::SmallBuffer<int64_t, 5> size_(size.size() + 1);
+  c10::SmallVector<c10::SymInt> size_(size.size() + 1);
   size_[0] = batch_size;
   std::copy(size.cbegin(), size.cend(), size_.begin() + 1);
 
@@ -471,12 +478,12 @@ std::tuple<Tensor, optional<int64_t>> expand_batch_rule(
   // so the strategy here is to view it first as a tensor of size [B0, 1, 3] and
   // then expand.
   auto extra_dims = size.size() - (self_dim - 1);
-  VmapDimVector view_shape(size_.size(), /*init_value*/1);
+  c10::SmallVector<c10::SymInt> view_shape(size_.size(), /*init_value*/1);
   view_shape[0] = batch_size;
   std::copy(self_sizes.cbegin() + 1, self_sizes.cend(),
             view_shape.begin() + 1 + extra_dims);
 
-  return std::make_tuple(Func(self_.view(view_shape), size_, implicit), 0);
+  return std::make_tuple(Func(self_.view_symint(view_shape), size_, implicit), 0);
 }
 
 std::tuple<Tensor, optional<int64_t>> unfold_batch_rule(
@@ -496,6 +503,18 @@ std::tuple<Tensor, optional<int64_t>> unfold_batch_rule(
   return std::make_tuple(result, 0);
 }
 
+std::tuple<Tensor, optional<int64_t>> narrow_copy_batch_rule(
+    const Tensor &self, optional<int64_t> self_bdim, int64_t dim, c10::SymInt start, c10::SymInt length)
+{
+  TORCH_INTERNAL_ASSERT(self_bdim.has_value());
+  auto self_ = moveBatchDimToFront(self, self_bdim);
+  auto logical_rank = rankWithoutBatchDim(self, self_bdim);
+  dim = maybe_wrap_dim(dim, logical_rank) + 1;
+  auto result = self_.narrow_copy_symint(dim, start, length);
+
+  return std::make_tuple(result, 0);
+}
+
 std::tuple<Tensor, optional<int64_t>> movedim_batch_rule(const Tensor& self, optional<int64_t> self_bdim, IntArrayRef source, IntArrayRef destination) {
   auto self_ = moveBatchDimToFront(self, self_bdim);
   auto source_ = getPhysicalDims(self_, self_bdim.has_value(), source);
@@ -511,20 +530,16 @@ std::tuple<Tensor, optional<int64_t>> diag_embed_batch_rule(const Tensor& self,
   return std::make_tuple(at::diag_embed(self_, offset, dim1, dim2), 0);
 }
 
-// We need to write a real batching rule to fully support symint.
-// This requires symint variants of other operations, like `view`,
-// which don't exist yet.
-Tensor expand_symint_decomp_hack(const Tensor& self, SymIntArrayRef packed_size, bool implicit) {
-   auto size = asIntArrayRefSlow(packed_size);
-   return self.expand(size, implicit);
+Tensor trace_decomp(const Tensor& tensor) {
+  return tensor.diagonal().sum();
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   VMAP_SUPPORT(diag, diag_batch_rule);
   VMAP_SUPPORT(chunk, chunk_batching_rule);
   m.impl("flatten.using_ints", static_cast<decltype(&ATEN_FN2(flatten, using_ints))>(native::flatten));
   VMAP_SUPPORT(flip, flip_batch_rule);
-  RUN_JIT_DECOMPOSITION(trace)
+  m.impl("trace", trace_decomp);
   VMAP_SUPPORT(tril, VARIADIC_BDIMS_BATCH_RULE(ATEN_FN(tril)));
   VMAP_SUPPORT(triu, VARIADIC_BDIMS_BATCH_RULE(ATEN_FN(triu)));
   VMAP_SUPPORT(repeat, repeat_batch_rule);
@@ -542,6 +557,7 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   VMAP_SUPPORT(select_backward, select_backward_batch_rule);
   VMAP_SUPPORT(slice_backward, slice_backward_batch_rule);
   VMAP_SUPPORT(view, view_batching_rule);
+  VMAP_SUPPORT(view_copy, view_copy_batch_rule);
   VMAP_SUPPORT(expand, SINGLE_ARG(expand_batch_rule<decltype(&ATEN_FN(expand)), &ATEN_FN(expand)>));
   VMAP_SUPPORT(expand_copy, SINGLE_ARG(expand_batch_rule<decltype(&ATEN_FN(expand_copy)), &ATEN_FN(expand_copy)>));
   VMAP_SUPPORT(unfold, unfold_batch_rule);
@@ -549,8 +565,7 @@ TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
   VMAP_SUPPORT2(slice, Tensor, slice_batch_rule);
   VMAP_SUPPORT2(transpose, int, transpose_int_batch_rule);
   VMAP_SUPPORT(diag_embed, diag_embed_batch_rule);
-  m.impl("expand.SymInt", expand_symint_decomp_hack);
-  m.impl("view.SymInt", view_symint_decomposition);
+  VMAP_SUPPORT(narrow_copy, narrow_copy_batch_rule);
 }
 
 }}
diff --git a/functorch/functorch/csrc/BatchedFallback.cpp b/aten/src/ATen/functorch/BatchedFallback.cpp
similarity index 97%
rename from functorch/functorch/csrc/BatchedFallback.cpp
rename to aten/src/ATen/functorch/BatchedFallback.cpp
index 6b6c58b243ee..87cdcc0fe9fc 100644
--- a/functorch/functorch/csrc/BatchedFallback.cpp
+++ b/aten/src/ATen/functorch/BatchedFallback.cpp
@@ -4,12 +4,11 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/BatchedFallback.h>
-#include <functorch/csrc/LegacyVmapTransforms.h>
-#include <functorch/csrc/Constants.h>
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/PlumbingHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
+#include <ATen/functorch/LegacyVmapTransforms.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/PlumbingHelper.h>
 
 #include <ATen/Context.h>
 #include <ATen/MatrixRef.h>
@@ -268,7 +267,7 @@ void batchedTensorForLoopFallback(const c10::OperatorHandle& op, torch::jit::Sta
               "We could not generate a fallback.");
 
   if (std::none_of(arguments.begin(), arguments.end(), ivalueParticipatesInCurrentLevel)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     op.callBoxed(stack);
     return;
   }
@@ -354,7 +353,7 @@ void batchedTensorForLoopFallback(const c10::OperatorHandle& op, torch::jit::Sta
       // argument is a BatchedTensor
       TORCH_INTERNAL_ASSERT(input_physical_views_iter != input_physical_views.end());
       const auto& physical_view_for_argument = *input_physical_views_iter;
-      c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+      c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
       torch::jit::push(stack, physical_view_for_argument.tensor().index(index));
       batched_tensor_inputs_pos_iter++;
       input_physical_views_iter++;
@@ -362,7 +361,7 @@ void batchedTensorForLoopFallback(const c10::OperatorHandle& op, torch::jit::Sta
 
     // std::cout << "[Fallback]: ";
     // at::dump_tensor((*stack)[stack->size() - 1].toTensor());
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     op.callBoxed(stack);
 
     // Store the result into `output_shards`. See NOTE: [Output shards layout]
@@ -379,7 +378,7 @@ void batchedTensorForLoopFallback(const c10::OperatorHandle& op, torch::jit::Sta
   auto output_shards_chunks = MatrixRef<Tensor>(output_shards, num_batches);
   for (const auto return_idx : c10::irange(0, num_returns)) {
     auto shards = output_shards_chunks[return_idx];
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     auto flat_output = safeStack(shards);
     // See NOTE [vmap through backward and undefined grad]
     if (!flat_output.defined()) {
diff --git a/functorch/functorch/csrc/BatchedFallback.h b/aten/src/ATen/functorch/BatchedFallback.h
similarity index 63%
rename from functorch/functorch/csrc/BatchedFallback.h
rename to aten/src/ATen/functorch/BatchedFallback.h
index 9130245f28b1..05d223568a37 100644
--- a/functorch/functorch/csrc/BatchedFallback.h
+++ b/aten/src/ATen/functorch/BatchedFallback.h
@@ -12,14 +12,19 @@
 namespace at {
 namespace functorch {
 
+// This file contains code for the vmap fallback (also known as the
+// BatchedTensor fallback or the Batched fallback). This code runs
+// when an operation doesn't have a batching rule implemented.
+
 // If an operator doesn't have a batching rule implemented then we fallback
-// to this implementation. The fallback only works on out-of-place operators
-// that return only tensors with new memory. (e.g., no in-place operators, no
-// view operations).
+// to this implementation. The fallback doesn't work on out= variants or
+// view operations; that is, it works for out-of-place operations and
+// in-place non-view operations.
 //
-// The fallback effectively takes all of the BatchedTensors in `stack`, slices
-// them, and runs `op` on all of the corresponding slices to produce slices
-// of the outputs. The output slices then get `torch.stack`ed to create the
+// For out-of-place operations, the fallback effectively takes all of the
+// BatchedTensors in `stack`, slices them, and runs `op` on all of the
+// corresponding slices to produce slices of the outputs. The output slices
+// then get `torch.stack`ed to create the
 // final returns.
 //
 // The performance of the fallback is not very good because it introduces an
@@ -27,11 +32,15 @@ namespace functorch {
 // write batching rules for operators whenever possible.
 void batchedTensorForLoopFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
 
-bool isVmapFallbackWarningEnabled();
-void setVmapFallbackWarningEnabled(bool enabled);
+// The vmap fallback emits a warning by default, but it may be disabled if
+// the user finds it to be too annoying.
+TORCH_API bool isVmapFallbackWarningEnabled();
+TORCH_API void setVmapFallbackWarningEnabled(bool enabled);
 
-bool isVmapFallbackEnabled();
-void setVmapFallbackEnabled(bool enabled);
+// Used for testing. The vmap fallback is enabled by default. When it is disabled,
+// it raises an error.
+TORCH_API bool isVmapFallbackEnabled();
+TORCH_API void setVmapFallbackEnabled(bool enabled);
 
 template <typename A> A vector_to_result(const std::vector<IValue>& buffer) {
   return buffer[0].to<A>();
@@ -43,8 +52,8 @@ template <typename A, typename B, typename C> std::tuple<A, B, C> vector_to_resu
   return std::make_tuple(buffer[0].to<A>(), buffer[1].to<B>(), buffer[2].to<B>());
 }
 
-// This is a way to call the slow fallback from inside some plumbing
-// TODO: Probably better way to metaprogram this
+// slow_fallback is a way to call the vmap fallback inside some boxed kernel.
+// There is probably some better way to metaprogram this.
 template <typename Ret>
 Ret slow_fallback(const c10::OperatorHandle& op, ArrayRef<IValue> args) {
   std::vector<IValue> stack(args.begin(), args.end());
diff --git a/functorch/functorch/csrc/BatchedTensorImpl.cpp b/aten/src/ATen/functorch/BatchedTensorImpl.cpp
similarity index 58%
rename from functorch/functorch/csrc/BatchedTensorImpl.cpp
rename to aten/src/ATen/functorch/BatchedTensorImpl.cpp
index 487df2900071..c5d6eb34030d 100644
--- a/functorch/functorch/csrc/BatchedTensorImpl.cpp
+++ b/aten/src/ATen/functorch/BatchedTensorImpl.cpp
@@ -3,51 +3,19 @@
 //
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
-#include <functorch/csrc/BatchedTensorImpl.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
 
 #include <ATen/WrapDimUtils.h>
 #include <c10/util/Exception.h>
 
-#include <functorch/csrc/Constants.h>
 #include <c10/util/irange.h>
 
 namespace at {
 namespace functorch {
 
-BatchedTensorImpl::BatchedTensorImpl(Tensor value, int64_t bdim, int64_t level)
-  : TensorImpl(
-      c10::DispatchKeySet(kBatchedKey),
-      value.dtype(),
-      value.device()
-    )
-  , value_(std::move(value))
-  , level_(level)
-  , bdim_(bdim)
-{
-  // TODO: I don't think this ctor gets used.
-  TORCH_INTERNAL_ASSERT(false);
-  TORCH_INTERNAL_ASSERT(value_.defined());
-  set_storage_access_should_throw();
-  set_sizes_strides_policy(SizesStridesPolicy::CustomStrides);
-  checkInvariants();
-
-  const auto public_dims = value_.dim() - 1;
-  const auto value_sizes = value_.sizes();
-  const auto value_strides = value_.strides();
-  sizes_and_strides_.resize(public_dims);
-  for (const auto dim : c10::irange(0, public_dims)) {
-    auto actual_dim = actualDim(dim, /*wrap_dim=*/false);
-    sizes_and_strides_.size_at_unchecked(dim) = value_sizes.at(actual_dim);
-    sizes_and_strides_.stride_at_unchecked(dim) = value_strides.at(actual_dim);
-  }
-  storage_offset_= value_.storage_offset();
-  refresh_numel();
-  refresh_contiguous();
-}
-
 BatchedTensorImpl::BatchedTensorImpl(DispatchKeySet key_set, Tensor value, int64_t bdim, int64_t level)
   : TensorImpl(
-      key_set.add(kBatchedKey),
+      key_set.add(DispatchKey::FuncTorchBatched),
       value.dtype(),
       value.device()
     )
@@ -57,7 +25,7 @@ BatchedTensorImpl::BatchedTensorImpl(DispatchKeySet key_set, Tensor value, int64
 {
   TORCH_INTERNAL_ASSERT(value_.defined());
   set_storage_access_should_throw();
-  set_sizes_strides_policy(SizesStridesPolicy::CustomStrides);
+  set_custom_sizes_strides(SizesStridesPolicy::CustomStrides);
   checkInvariants();
   refreshTensorMetadata();
 }
@@ -82,36 +50,11 @@ int64_t BatchedTensorImpl::actualDim(int64_t dim, bool wrap_dim) const {
     const auto ndim = sizes_and_strides_.size();
     dim = maybe_wrap_dim(dim, ndim);
   }
-  auto is_bdim = createBatchDimBitset(bdim_);
-
-  // TODO(vfdev): As BatchedTensorImpl is refactored and has only one dim.
-  // Below code may be simplified.
-
-  // Example: assume dim = 3, and is_bdim = 10010011000...
-  // The 1's are batch dims and 0's are normal dims of the underlying value_ Tensor.
-  // actualDim gives us the index of `dim` in the `value_` Tensor, which is equivalent
-  // to asking "where does the 3rd (0-indexed) zero occur in the bitset?".
-  // The answer to that is index 5.
-  //
-  // TODO(rzou): the PDEP instruction does exactly this
-  // (https://stackoverflow.com/questions/7669057/find-nth-set-bit-in-an-int)
-  // but it might require newer (>= ~2015) CPUs. We should clean this up
-  // if/when we have dropped support for older CPUs.
-  int64_t non_bdim_count = 0;
-  for (int64_t actual_dim = 0; actual_dim < kVmapMaxTensorDims; actual_dim++) {
-    if (is_bdim[actual_dim]) {
-      continue;
-    }
-    if (non_bdim_count == dim) {
-      return actual_dim;
-    }
-    non_bdim_count++;
+  if (bdim_ <= dim) {
+    return dim + 1;
+  } else {
+    return dim;
   }
-  // If we hit this assert, then that means
-  // `non_bdim_count` + #num_bdims > kVmapMaxTensorDims. We restrict the number
-  // of dims a BatchedTensorImpl can have to kVmapMaxTensorDims so this should
-  // never be hit.
-  TORCH_INTERNAL_ASSERT(false);
 }
 
 void BatchedTensorImpl::checkInvariants() const {
@@ -124,6 +67,11 @@ IntArrayRef BatchedTensorImpl::strides_custom() const {
   return strides_default();
 }
 
+SymIntArrayRef BatchedTensorImpl::sym_strides_custom() const {
+  return sym_strides_default();
+}
+
+
 // TODO: implement proper contiguity on batched tensor, then put
 // sizes_strides_policy back to Default
 bool BatchedTensorImpl::is_contiguous_custom(at::MemoryFormat memory_format) const {
diff --git a/functorch/functorch/csrc/BatchedTensorImpl.h b/aten/src/ATen/functorch/BatchedTensorImpl.h
similarity index 82%
rename from functorch/functorch/csrc/BatchedTensorImpl.h
rename to aten/src/ATen/functorch/BatchedTensorImpl.h
index 37294f20695c..320989604570 100644
--- a/functorch/functorch/csrc/BatchedTensorImpl.h
+++ b/aten/src/ATen/functorch/BatchedTensorImpl.h
@@ -12,9 +12,6 @@
 #include <ATen/SmallVector.h>
 #include <ATen/Tensor.h>
 
-#include <functorch/csrc/Macros.h>
-#include <functorch/csrc/Constants.h>
-
 namespace at {
 namespace functorch {
 
@@ -43,8 +40,7 @@ constexpr int64_t kBatchDimsStackSize = 5;
 //
 // bt.sizes() returns (5, 7); bt.sum(0) performs a reduction over the (public)
 // dim 0, which is equivalent to dim 3 in the underlying ones(2, 3, 5, 7) tensor.
-struct BatchedTensorImpl : public c10::TensorImpl {
-  explicit BatchedTensorImpl(Tensor value, int64_t dim, int64_t level);
+struct TORCH_API BatchedTensorImpl : public c10::TensorImpl {
   explicit BatchedTensorImpl(at::DispatchKeySet key_set, Tensor value, int64_t dim, int64_t level);
 
   // Returns batch dimension of this tensor
@@ -68,6 +64,7 @@ struct BatchedTensorImpl : public c10::TensorImpl {
 
   // We have to override this because we opted into CustomStrides
   IntArrayRef strides_custom() const override;
+  SymIntArrayRef sym_strides_custom() const override;
   // Override a bunch of methods inherited from TensorImpl to return error messages.
   bool is_contiguous_custom(at::MemoryFormat memory_format=at::MemoryFormat::Contiguous) const override;
   void set_size(int64_t dim, int64_t new_size) override;
@@ -78,9 +75,16 @@ struct BatchedTensorImpl : public c10::TensorImpl {
 #endif
 
   void refreshTensorMetadata();
+
+  // Used in torchdim. torchdim uses non-lexical BatchedTensor; the way it
+  // accomplishes this is a hack where it is able to modify the levels of
+  // BatchedTensor to match the level of the current vmap transform.
   void _unsafe_set_level(int64_t level) {
     level_ = level;
   }
+
+  // Used in batching rule for in-place view operations that can change
+  // the index of the bdim (think squeeze_, unsqueeze_)
   void unsafe_set_bdim(int64_t bdim) {
     // NB: you MUST call refreshTensorMetadata after doing this.
     bdim_ = bdim;
@@ -99,7 +103,7 @@ struct BatchedTensorImpl : public c10::TensorImpl {
 // NB: We use the term "BatchedTensor" to mean a Tensor that is backed with a
 // BatchedTensorImpl.
 inline bool isBatchedTensor(const Tensor& tensor) {
-  return tensor.unsafeGetTensorImpl()->key_set().has(kBatchedKey);
+  return tensor.unsafeGetTensorImpl()->key_set().has(DispatchKey::FuncTorchBatched);
 }
 
 // It is unsafe to call this on a Tensor that is not backed by a
@@ -130,11 +134,15 @@ inline std::bitset<kVmapNumLevels> createVmapLevelsBitset(int64_t level) {
 }
 
 // Use this to construct a BatchedTensor from a regular Tensor
-FUNCTORCH_API Tensor makeBatched(const Tensor& tensor, int64_t dim, int64_t level);
+TORCH_API Tensor makeBatched(const Tensor& tensor, int64_t dim, int64_t level);
 
 // Adds a batch dim to `tensor`, returning a BatchedTensor
-FUNCTORCH_API Tensor addBatchDim(const Tensor& tensor, int64_t dim, int64_t level);
+TORCH_API Tensor addBatchDim(const Tensor& tensor, int64_t dim, int64_t level);
 
+// Certain dispatch keys must be propagated to the BatchedTensor (or, in general,
+// any wrapper Tensor subclasses). This is because there are methods on Tensor
+// that skip dispatch and check for the presence of a dispatch key (e.g. is_cpu()).
+// TODO: should probably contain more (or all?) backend keys
 constexpr DispatchKeySet kKeysToPropagateToWrapper({
   DispatchKey::Negative,
   DispatchKey::Conjugate,
diff --git a/functorch/functorch/csrc/BatchingMetaprogramming.h b/aten/src/ATen/functorch/BatchingMetaprogramming.h
similarity index 92%
rename from functorch/functorch/csrc/BatchingMetaprogramming.h
rename to aten/src/ATen/functorch/BatchingMetaprogramming.h
index e054e58568be..e77960f441fe 100644
--- a/functorch/functorch/csrc/BatchingMetaprogramming.h
+++ b/aten/src/ATen/functorch/BatchingMetaprogramming.h
@@ -8,6 +8,14 @@
 #include <ATen/Tensor.h>
 #include <ATen/VmapGeneratedPlumbing.h>
 
+// This file contains template metaprogramming things that are used for our
+// batching rules.
+//
+// See NOTE: [vmap plumbing] for more details on why this is necessary.
+// The plumbing has a bunch of metaprogramming hacks for determining the signature
+// of a batching rule from the signature of the operator, many of which use the
+// helper functions in this file.
+
 namespace at {
 namespace functorch {
 
diff --git a/functorch/functorch/csrc/DynamicLayer.cpp b/aten/src/ATen/functorch/DynamicLayer.cpp
similarity index 72%
rename from functorch/functorch/csrc/DynamicLayer.cpp
rename to aten/src/ATen/functorch/DynamicLayer.cpp
index 8bfd388358a0..d152f3c08c2d 100644
--- a/functorch/functorch/csrc/DynamicLayer.cpp
+++ b/aten/src/ATen/functorch/DynamicLayer.cpp
@@ -4,16 +4,15 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
-#include <functorch/csrc/BatchRulesHelper.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
+#include <ATen/functorch/BatchRulesHelper.h>
 
 #include <torch/library.h>
 #include <c10/core/impl/LocalDispatchKeySet.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/FunctionalTensorWrapper.h>
-#include <torch/csrc/autograd/variable.h>
 #include <c10/util/irange.h>
 #include <ATen/FuncTorchTLS.h>
 #include <iostream>
@@ -22,8 +21,8 @@ namespace at {
 namespace functorch {
 
 void setDynamicLayerFrontBackKeysIncluded(bool included) {
-  c10::impl::tls_set_dispatch_key_included(kDynamicLayerFrontModeKey, included);
-  c10::impl::tls_set_dispatch_key_included(kDynamicLayerBackModeKey, included);
+  c10::impl::tls_set_dispatch_key_included(DispatchKey::FuncTorchDynamicLayerFrontMode, included);
+  c10::impl::tls_set_dispatch_key_included(DispatchKey::FuncTorchDynamicLayerBackMode, included);
 }
 
 DynamicLayer::DynamicLayer(
@@ -75,8 +74,8 @@ RandomnessType DynamicLayer::randomness() const {
   return VmapInterpreterPtr(&interpreter_).randomness();
 }
 
-constexpr DispatchKeySet kFrontBackKeys({kDynamicLayerBackModeKey, kDynamicLayerFrontModeKey});
-
+// Maps level to life handle, see NOTE: [Life handles and lexically scoped transforms]
+// for details
 using DynmetaData = std::unordered_map<int64_t, std::shared_ptr<bool>>;
 DynmetaData kDynMetaDataSingleton;
 
@@ -84,6 +83,13 @@ static DynmetaData& getGlobalDynmetaData() {
   return kDynMetaDataSingleton;
 }
 
+// functorch stores some TLS. Inside the TLS is the stack of transforms.
+// Unfortunately, since functorch isn't a part of libtorch, we have
+// a level of indirection. FuncTorchTLSBase is the interface that lives in libtorch,
+// while FuncTorchTLS implements all the methods and stores data.
+//
+// TODO: after functorch C++ code is moved into PyTorch, we can get rid of
+// this layer of indirection.
 class FuncTorchTLS : public FuncTorchTLSBase {
  public:
   FuncTorchTLS() {}
@@ -95,7 +101,7 @@ class FuncTorchTLS : public FuncTorchTLSBase {
   }
 
   int64_t checkSupportsAutogradFunction() const override {
-    TORCH_CHECK(dynamicLayerStack.size() == 0,
+    TORCH_CHECK(dynamicLayerStack.size() == 0 || getAutogradFunctionAllowed(),
         "functorch functions (vmap, grad, vjp, etc.) currently do not support the use of autograd.Function. ",
         "Please rewrite your function to not use autograd.Function while we work on fixing this");
     return 0;
@@ -122,6 +128,7 @@ class FuncTorchTLS : public FuncTorchTLSBase {
 
   std::vector<DynamicLayer> dynamicLayerStack;
   bool allow_inplace_requires_grad_ = false;
+  bool allow_autograd_function_ = false;
 };
 
 static FuncTorchTLS* getRawFunctorchTLS() {
@@ -145,6 +152,15 @@ bool getInplaceRequiresGradAllowed() {
   return functorch_tls->allow_inplace_requires_grad_;
 }
 
+void setAutogradFunctionAllowed(bool allowed) {
+  auto* functorch_tls = getRawFunctorchTLS();
+  functorch_tls->allow_autograd_function_ = allowed;
+}
+
+bool getAutogradFunctionAllowed() {
+  auto* functorch_tls = getRawFunctorchTLS();
+  return functorch_tls->allow_autograd_function_;
+}
 
 static std::vector<DynamicLayer>& dynamicLayerStackAccessor() {
   return getRawFunctorchTLS()->dynamicLayerStack;
@@ -198,7 +214,7 @@ bool areTransformsActive() {
   return !data.empty();
 }
 
-static DynamicLayer popDynamicLayer() {
+DynamicLayer popDynamicLayer() {
   auto& dynamicLayerStack = dynamicLayerStackAccessor();
   TORCH_INTERNAL_ASSERT(dynamicLayerStack.size() > 0);
   auto result = dynamicLayerStack.back();
@@ -216,7 +232,7 @@ static DynamicLayer popDynamicLayer() {
   return result;
 }
 
-static int64_t pushDynamicLayer(DynamicLayer&& dynamic_layer) {
+int64_t pushDynamicLayer(DynamicLayer&& dynamic_layer) {
   auto& dynamicLayerStack = dynamicLayerStackAccessor();
   int64_t layerId = 1 + dynamicLayerStack.size();
   TORCH_INTERNAL_ASSERT(layerId == dynamic_layer.layerId());
@@ -264,17 +280,11 @@ DynamicLayer popDynamicLayerAndDeleteMetadata() {
   auto level = result.layerId();
 
   // TODO: is this lock safe? No one else should be writing to the same bucket
-  // if (c10::show_dispatch_trace_enabled()) {
-  //   std::cout << "deleting metadata" << std::endl;
-  // }
   auto& data = getGlobalDynmetaData();
   auto it = data.find(level);
   if (it == data.end()) {
     return result;
   }
-  // if (c10::show_dispatch_trace_enabled()) {
-  //   std::cout << "deleted metadata for level " << level << std::endl;
-  // }
   // invalidate the thing
   *(it->second) = false;
   data.erase(level);
@@ -294,10 +304,19 @@ Tensor unwrapIfDead(const Tensor& tensor) {
 
 void foreachTensorInplace(std::vector<IValue>& args, int64_t begin, int64_t end,
     std::function<Tensor(const Tensor&)> func) {
+   auto func_with_bool = [&](const Tensor& tensor, bool unused) { return func(tensor); };
+   foreachTensorInplaceWithFlag(args, begin, end, std::bitset<64>(), func_with_bool);
+}
+
+void foreachTensorInplaceWithFlag(std::vector<IValue>& args, int64_t begin, int64_t end,
+    const std::bitset<64> use_flag_relative, std::function<Tensor(const Tensor&, bool)> func){
   TORCH_INTERNAL_ASSERT(begin >= 0);
   TORCH_INTERNAL_ASSERT(end >= 0);
   TORCH_INTERNAL_ASSERT(begin <= end);
-  for (int64_t idx = begin; idx < end; idx++) {
+  for (int64_t relative_idx = 0; relative_idx < end - begin; relative_idx++) {
+    const bool flag = use_flag_relative[relative_idx] == 1;
+
+    const auto idx = relative_idx + begin;
     auto ivalue = args[idx];
     // Tensor?[] translates to a c10::List<IValue> so we need to peek inside List
     if (ivalue.isList()) {
@@ -307,7 +326,7 @@ void foreachTensorInplace(std::vector<IValue>& args, int64_t begin, int64_t end,
       for (const auto list_idx : c10::irange(0, list.size())) {
         const auto& elt = list.get(list_idx);
         if (elt.isTensor()) {
-          list.set(list_idx, func(elt.toTensor()));
+          list.set(list_idx, func(elt.toTensor(), flag));
           modified = true;
         }
       }
@@ -319,7 +338,7 @@ void foreachTensorInplace(std::vector<IValue>& args, int64_t begin, int64_t end,
     if (ivalue.isTensorList()) {
       auto list = ivalue.toTensorList();
       for (const auto list_idx : c10::irange(0, list.size())) {
-        list[list_idx] = func(list[list_idx]);
+        list[list_idx] = func(list[list_idx], flag);
       }
       args[idx] = list;
     }
@@ -328,7 +347,7 @@ void foreachTensorInplace(std::vector<IValue>& args, int64_t begin, int64_t end,
       continue;
     }
     Tensor value = ivalue.toTensor();
-    Tensor replacement = func(value);
+    Tensor replacement = func(value, flag);
     args[idx] = std::move(replacement);
     // sanity checks
     if (ivalue.toTensor().defined()) {
@@ -371,6 +390,14 @@ bool isInplaceOp(const FunctionSchema& schema) {
   return return_alias_info && return_alias_info->isWrite();
 }
 
+c10::optional<size_t> findAliasedOutput(const FunctionSchema& schema, const int64_t immutable_input_idx) {
+  for (size_t res_idx = 0; res_idx != schema.returns().size(); ++res_idx) {
+    if (schema.may_contain_alias(SchemaArgument(SchemaArgType::input, immutable_input_idx), SchemaArgument(SchemaArgType::output, res_idx))) {
+      return res_idx; // for everything currently in native_functions, each input aliases at most one output (tensor list counts as one output)
+    }
+  }
+  return nullopt;
+}
 
 #ifdef HAS_TORCH_SHOW_DISPATCH_TRACE
 static void dump_local_tls() {
@@ -391,43 +418,34 @@ WithoutTop::~WithoutTop() {
   pushDynamicLayer(std::move(layer_));
 }
 
-// NOTE: [forward-mode AD decompositions hack]
-//
-// The mechanism is: in DynamicLayerFrontMode, IF we are dispatching on the
-// jvp transform, AND we have a decomposition for the operation, then run
-// the decomposition.
+// NOTE: [functorch front and back key fallbacks]
 //
-// Let's break that down. There are a douple of moving pieces.
+// Please read NOTE: [functorch interpreter stack] first for some context.
+// The following doc also provides some visuals:
+// https://docs.google.com/document/d/14qyaa3xIjmVxYiMLlIlQErunYgR_uR1WupsKMZlnGY4/edit
 //
-// 0. How do we know what transform we're dispatching on?
-// Easy, check the top of the DynamicLayerStack and read the transform.
+// functorch's "stack of transforms" is implemented as the following:
+// - each transform is associated with one or more dispatch keys in the PyTorch
+//   dispatcher. For example, vmap -> {FuncTorchBatched, FuncTorchVmapMode},
+//   Autograd -> {Autograd{Backend}, ADInplaceOrView}
+// - Whenever a functorch transform is active, the FuncTorchDynamicLayer{Front, Back}Mode
+//   keys are added to the dispatcher's local dispatch key set.
 //
-// 1. Next, we must identify when an operation (e.g. nll_loss_backward)
-// gets dispatched to.
-// - register a special kernel to the DynamicLayerFrontMode key
-//   (see JVP_DECOMP)
-// - that special kernel invokes dynamicLayerFrontFallbackOperator with
-//   an arg indicating we're going to use a decomp
+// DynamicLayerFrontMode is responsible for:
+// 1. selecting the transform that is at the top of the stack and grabbing its
+//    interpreter
+// 2. Calling interpreter.process(), which does the following:
+// 2a. enables/disables a bunch of dispatch keys, so that the only dispatch
+//     keys that are enabled are the ones that belong to the transform.
+// 2b. redispatching
 //
-// 2. Next, we need to call the decomposition. See call_decomposition_for_jvp.
-// We currently use python decompositions that we torchscript.
+// Eventually, DynamicLayerBackMode captures the redispatch from the transforms.
+// DynamicLayerBackMode is responsible for:
+// - redirecting back to DynamicLayerFrontMode
 
-// Ideally c10::OperatorHandle would have a field like this
-// to identify the operator.
-// The stuff here should map 1:1 with the operator name.
-// aten::nll_loss_backward -> nll_loss_backward
-// aten::add.Tensor -> add_Tensor
-
-static void call_decomposition_for_jvp(
+static void dynamicLayerFrontFallback(
     const c10::OperatorHandle& op,
     torch::jit::Stack* stack) {
-  run_jit_decomposition(op, stack);
-}
-
-static void dynamicLayerFrontFallbackOperator(
-    const c10::OperatorHandle& op,
-    torch::jit::Stack* stack,
-    bool decomp_jvp) {
   auto& dynamicLayerStack = dynamicLayerStackAccessor();
   TORCH_INTERNAL_ASSERT(dynamicLayerStack.size() > 0);
 #ifdef HAS_TORCH_SHOW_DISPATCH_TRACE
@@ -436,13 +454,6 @@ static void dynamicLayerFrontFallbackOperator(
     dump_local_tls();
   }
 #endif
-
-  // Hack: if jvp and we have a decomposition registered, then do the decomposition
-  if (dynamicLayerStack.back().interpreter().key() == TransformType::Jvp &&
-      decomp_jvp) {
-    return call_decomposition_for_jvp(op, stack);
-  }
-
   // Save the current LocalDispatchKeySet (to the current DynamicLayer).
   // Upon exiting the current scope, that LocalDispatchKeySet gets restored.
   // When the current DynamicLayer dispatches to the next (inner) DynamicLayer,
@@ -462,50 +473,44 @@ restoreLocalDispatchKeySetRAII(const c10::impl::LocalDispatchKeySet& key_set) {
   return c10::impl::ForceDispatchKeyGuard(key_set);
 }
 
-void dynamicLayerFrontFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
-  return dynamicLayerFrontFallbackOperator(op, stack, false);
+// right now grad_special_case as a bool is sufficient because this is the only special case for grad. If we need to add
+// more special cases, it's more scalable to add an enum to know which op we're looking at without looking at the schema
+void dynamicLayerBack(const c10::OperatorHandle& op, torch::jit::Stack* stack, bool grad_special_case) {
+  auto& layer = dynamicLayerStackAccessor().back();
+  auto restore_guard = restoreLocalDispatchKeySetRAII(layer.interpreter().getSavedLocalDispatchKeySet());
+  WithoutTop guard;
+
+  layer.interpreter().sendToNextInterpreter(op, stack, grad_special_case);
 }
 
-void dynamicLayerFrontFallBackWithDecomp(
-    const c10::OperatorHandle& op,
-    torch::jit::Stack* stack) {
-  return dynamicLayerFrontFallbackOperator(op, stack, true);
+// used for functions that have aliasing operations but should be treated like they're out of place (i.e. lift_fresh)
+void dynamicLayerBackGradSpecialCase(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
+  return dynamicLayerBack(op, stack, true);
 }
 
 void dynamicLayerBackFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
-  auto& layer = dynamicLayerStackAccessor().back();
-  auto restore_guard = restoreLocalDispatchKeySetRAII(layer.interpreter().getSavedLocalDispatchKeySet());
-  WithoutTop guard;
-
-  layer.interpreter().sendToNextInterpreter(op, stack);
+  return dynamicLayerBack(op, stack, false);
 }
 
-TORCH_LIBRARY_IMPL(_, FT_DYNAMIC_LAYER_FRONT_MODE_KEY, m) {
+TORCH_LIBRARY_IMPL(_, FuncTorchDynamicLayerFrontMode, m) {
   m.fallback(torch::CppFunction::makeFromBoxedFunction<&dynamicLayerFrontFallback>());
 }
 
-TORCH_LIBRARY_IMPL(_, FT_DYNAMIC_LAYER_BACK_MODE_KEY, m) {
+TORCH_LIBRARY_IMPL(_, FuncTorchDynamicLayerBackMode, m) {
   m.fallback(torch::CppFunction::makeFromBoxedFunction<&dynamicLayerBackFallback>());
 }
 
-#define JVP_DECOMP(op) \
-  m.impl(#op, torch::CppFunction::makeFromBoxedFunction<&dynamicLayerFrontFallBackWithDecomp>());
 
-#define JVP_DECOMP2(op, overload) \
-  m.impl(#op "." #overload, torch::CppFunction::makeFromBoxedFunction<&dynamicLayerFrontFallBackWithDecomp>());
+#define SPECIAL_GRAD_CASE(op) \
+  m.impl(#op, torch::CppFunction::makeFromBoxedFunction<&dynamicLayerBackGradSpecialCase>());
 
-TORCH_LIBRARY_IMPL(aten, FT_DYNAMIC_LAYER_FRONT_MODE_KEY, m) {
-  JVP_DECOMP(nll_loss_backward);
-  JVP_DECOMP(nll_loss2d_backward);
-  JVP_DECOMP(_log_softmax_backward_data);
-  JVP_DECOMP(_softmax_backward_data);
-  OP_DECOMPOSE(log_sigmoid);
-  JVP_DECOMP(log_sigmoid_forward);
-  JVP_DECOMP(native_layer_norm_backward);
-  JVP_DECOMP(native_batch_norm_backward);
-  JVP_DECOMP(cudnn_batch_norm_backward);
+TORCH_LIBRARY_IMPL(aten, FuncTorchDynamicLayerBackMode, m) {
+  // lift_fresh: it's must be freshly allocated and should be wrapped. User shouldn't have access to input version
+  // alias: this is needed for the CompositeImplicit instance norm (running_mean/var get set to be a wrapped value)
+  //        It's not a user facing function, but is more prone to possible errors
+  SPECIAL_GRAD_CASE(lift_fresh);
+  SPECIAL_GRAD_CASE(alias);
 }
 
-
 }
 } // namespace at
diff --git a/aten/src/ATen/functorch/DynamicLayer.h b/aten/src/ATen/functorch/DynamicLayer.h
new file mode 100644
index 000000000000..6c7139f5c01e
--- /dev/null
+++ b/aten/src/ATen/functorch/DynamicLayer.h
@@ -0,0 +1,131 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+
+#pragma once
+#include <ATen/functorch/Macros.h>
+#include <c10/core/DispatchKey.h>
+#include <ATen/core/function_schema.h>
+#include <c10/util/Optional.h>
+#include <c10/util/variant.h>
+#include <unordered_map>
+#include <mutex>
+#include <c10/core/impl/LocalDispatchKeySet.h>
+#include <ATen/functorch/Interpreter.h>
+#include <ATen/functorch/VmapInterpreter.h>
+#include <ATen/functorch/ADInterpreters.h>
+#include <ATen/functorch/FunctionalizeInterpreter.h>
+
+// Forward declared
+namespace c10 { struct AutogradMetaInterface; }
+
+namespace at {
+namespace functorch {
+
+// This file contains the implementation of functorch's interpreter stack.
+// See NOTE: [functorch interpreter stack] first before reading on.
+//
+// NB: the functorch interpreter stack is also referred to as:
+// - the "dynamic layer stack" -- an older name for "interpreter" was
+//   "dynamic layer".
+// - the "functorch mode stack". You can think of each functorch transform as a
+//   "mode" (in the same sense as torch_dispatch mode or torch_function mode),
+//   and functorch being an implementation of a "mode stack" where the modes
+//   may be arbitrary composed.
+
+// DynamicLayer is basically the same thing as an Interpreter.
+// It represents a functorch transform and it holds an Interpreter,
+// which contains metadata related to the transform and instructions on
+// how to perform the transform.
+//
+// TODO: we can excise DynamicLayer in favor of Interpreter,
+// But I am going to leave it for now as a compatiblity shim to avoid
+// needing to refactor a lot of callsites...
+struct TORCH_API DynamicLayer {
+  explicit DynamicLayer(
+      TransformType transform_type,
+      int64_t layerId,
+      optional<int64_t> batchSize = nullopt,
+      optional<RandomnessType> randomness = nullopt,
+      optional<bool> prev_grad_mode = nullopt,
+      optional<bool> pre_fwd_grad_mode = nullopt,
+      optional<bool> functionalize_add_back_views = nullopt);
+
+  TransformType key() const;
+  int64_t layerId() const;
+
+  const Interpreter& interpreter() const { return interpreter_; }
+  Interpreter& interpreter() { return interpreter_; }
+
+  // Only valid for vmap
+  int64_t batchSize() const;
+  RandomnessType randomness() const;
+
+ private:
+  Interpreter interpreter_;
+};
+
+TORCH_API int64_t initAndPushDynamicLayer(
+    TransformType transform_type,
+    optional<int64_t> batch_size = nullopt,
+    optional<RandomnessType> randomness = nullopt,
+    optional<bool> prev_grad_mode = nullopt,
+    optional<bool> prev_fwd_grad_mode = nullopt,
+    optional<bool> functionalize_add_back_views = nullopt);
+TORCH_API DynamicLayer popDynamicLayerAndDeleteMetadata();
+TORCH_API c10::optional<DynamicLayer> maybeCurrentDynamicLayer();
+TORCH_API const std::vector<DynamicLayer>& getDynamicLayerStack();
+TORCH_API void setDynamicLayerStack(const std::vector<DynamicLayer>& stack);
+TORCH_API void setDynamicLayerFrontBackKeysIncluded(bool included);
+
+// NB: Not lock safe, you should only call this from Python where the GIL will
+// prevent race conditions.
+TORCH_API bool areTransformsActive();
+
+// NOTE: [Life handles and lexically scoped transforms]
+// functorch transforms are lexically scoped.
+// Given a level, we store a "life handle" that is a boolean that tells us if the
+// transform with that level is active or not.
+//
+// functorch's TensorWrapper (for grad transforms) stores a life handle.
+// If a TensorWrapper escapes from the scope of the transform, then somehow
+// it must know it escaped; it can tell by querying the life handle.
+//
+// NB: not lock safe. TODO: does it need a lock?
+TORCH_API std::shared_ptr<bool> getLifeHandleForLevel(int64_t level);
+
+// Returns if an operator is in-place. An operator is inplace if:
+// 1. The first argument is a Tensor and it is being written to
+// 2. The first argument is being returned
+// 3. No other arguments are aliased
+// Here is an example of an in-place operator:
+// add_(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> Tensor(a!)
+TORCH_API bool isInplaceOp(const c10::FunctionSchema& schema);
+
+// Given the indices of unwrapped inputs and the schema, this returns the indices of any outputs that should remain unwrapped
+TORCH_API c10::optional<size_t> findAliasedOutput(const FunctionSchema& schema, const int64_t immutable_input);
+
+TORCH_API Tensor unwrapIfDead(const Tensor& tensor);
+
+// Pretty printers
+TORCH_API std::ostream& operator<<(std::ostream& os, const DynamicLayer& layer);
+TORCH_API std::ostream& operator<<(std::ostream& os, const std::vector<DynamicLayer>& dynamicLayerStack);
+
+// While a functorch transform is active, autograd.Function is disabled
+// by default. The following two APIs are APIs for enabling
+// autograd.Function. These are not user-facing APIs.
+TORCH_API void setAutogradFunctionAllowed(bool allowed);
+TORCH_API bool getAutogradFunctionAllowed();
+
+// While a functorch grad transform is active, Tensor.requires_grad_() gets
+// disabled. These two functions are the mechanism to controlling that.
+TORCH_API void setInplaceRequiresGradAllowed(bool allowed);
+TORCH_API bool getInplaceRequiresGradAllowed();
+
+TORCH_API DynamicLayer popDynamicLayer();
+TORCH_API int64_t pushDynamicLayer(DynamicLayer&& layer);
+
+}
+} // namespace at
diff --git a/functorch/functorch/csrc/FunctionalizeInterpreter.cpp b/aten/src/ATen/functorch/FunctionalizeInterpreter.cpp
similarity index 94%
rename from functorch/functorch/csrc/FunctionalizeInterpreter.cpp
rename to aten/src/ATen/functorch/FunctionalizeInterpreter.cpp
index 4242305636cf..40e22c455509 100644
--- a/functorch/functorch/csrc/FunctionalizeInterpreter.cpp
+++ b/aten/src/ATen/functorch/FunctionalizeInterpreter.cpp
@@ -1,5 +1,5 @@
-#include <functorch/csrc/FunctionalizeInterpreter.h>
-#include <functorch/csrc/DynamicLayer.h>
+#include <ATen/functorch/FunctionalizeInterpreter.h>
+#include <ATen/functorch/DynamicLayer.h>
 #include <ATen/FunctionalTensorWrapper.h>
 
 namespace at { namespace functorch {
@@ -47,7 +47,8 @@ void FunctionalizeInterpreterPtr::processImpl(
 
 void FunctionalizeInterpreterPtr::sendToNextInterpreterImpl(
     const c10::OperatorHandle& op,
-    torch::jit::Stack* stack) {
+    torch::jit::Stack* stack,
+    bool grad_special_case) {
   // For now, we don't support nested functionalization calls.
   // This check just enforces that - after the functionalize kernel runs
   // and we hit the BackModeFallback, we'll have unwrapped our FunctionalTensors
diff --git a/functorch/functorch/csrc/FunctionalizeInterpreter.h b/aten/src/ATen/functorch/FunctionalizeInterpreter.h
similarity index 75%
rename from functorch/functorch/csrc/FunctionalizeInterpreter.h
rename to aten/src/ATen/functorch/FunctionalizeInterpreter.h
index 5475b38f068f..4157eb82d84f 100644
--- a/functorch/functorch/csrc/FunctionalizeInterpreter.h
+++ b/aten/src/ATen/functorch/FunctionalizeInterpreter.h
@@ -1,14 +1,17 @@
 #pragma once
-#include <functorch/csrc/Interpreter.h>
+#include <ATen/functorch/Interpreter.h>
 
 namespace at { namespace functorch {
 
+// This is the interpreter that handles the functionalize() transform.
+// See NOTE: [functorch interpreter stack] for more details.
+
 struct FunctionalizeInterpreterPtr {
   explicit FunctionalizeInterpreterPtr(const Interpreter* base): base_(base) { TORCH_INTERNAL_ASSERT(base->key() == TransformType::Functionalize); }
   TransformType key() const { return base_->key(); }
   int64_t level() const { return base_->level(); }
   void processImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack);
-  void sendToNextInterpreterImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack);
+  void sendToNextInterpreterImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack, bool grad_special_case);
   bool functionalizeAddBackViews() const {
     return c10::get<FunctionalizeInterpreterMeta>(base_->meta()).functionalizeAddBackViews_;
   }
diff --git a/functorch/functorch/csrc/Interpreter.cpp b/aten/src/ATen/functorch/Interpreter.cpp
similarity index 75%
rename from functorch/functorch/csrc/Interpreter.cpp
rename to aten/src/ATen/functorch/Interpreter.cpp
index cce9fa05f70e..2531a49d5f19 100644
--- a/functorch/functorch/csrc/Interpreter.cpp
+++ b/aten/src/ATen/functorch/Interpreter.cpp
@@ -1,9 +1,9 @@
-#include <functorch/csrc/Interpreter.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/VmapInterpreter.h>
-#include <functorch/csrc/FunctionalizeInterpreter.h>
-#include <functorch/csrc/ADInterpreters.h>
+#include <ATen/functorch/Interpreter.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <ATen/functorch/VmapInterpreter.h>
+#include <ATen/functorch/FunctionalizeInterpreter.h>
+#include <ATen/functorch/ADInterpreters.h>
 
 namespace at { namespace functorch {
 
@@ -12,18 +12,18 @@ static DispatchKeySet get_all_dynlayer_keyset() {
 
   // "all dispatch keys between DynamicLayer{Front, Back}Mode, inclusive"
   auto result =
-    DispatchKeySet(DispatchKeySet::FULL_AFTER, kDynamicLayerFrontModeKey) -
-    DispatchKeySet(DispatchKeySet::FULL_AFTER, kDynamicLayerBackModeKey);
-  result = result | DispatchKeySet({kDynamicLayerFrontModeKey});
+    DispatchKeySet(DispatchKeySet::FULL_AFTER, DispatchKey::FuncTorchDynamicLayerFrontMode) -
+    DispatchKeySet(DispatchKeySet::FULL_AFTER, DispatchKey::FuncTorchDynamicLayerBackMode);
+  result = result | DispatchKeySet({DispatchKey::FuncTorchDynamicLayerFrontMode});
 
   // Hack: don't handle the autocast dispatch keys. Their interaction with functorch
   // is weird.
   result = result - autocast_dispatch_keyset;
 
-  // Hack: don't handle kVmapModeKey. We need a better way of modeling this.
-  // In e.g. grad(vmap(f)), kVmapModeKey makes it so that all random operations,
+  // Hack: don't handle DispatchKey::FuncTorchVmapMode. We need a better way of modeling this.
+  // In e.g. grad(vmap(f)), DispatchKey::FuncTorchVmapMode makes it so that all random operations,
   // even after we are done handling the vmap layer, error out.
-  result = result.remove(kVmapModeKey);
+  result = result.remove(DispatchKey::FuncTorchVmapMode);
 
   return result;
 }
@@ -34,10 +34,10 @@ static DispatchKeySet all_dynlayer_keyset = get_all_dynlayer_keyset();
 
 static DispatchKeySet keysForEnteringDynamicLayer(TransformType key) {
   if (key == TransformType::Vmap) {
-    // NB: Does not include kVmapModeKey. We may modulate the key when
+    // NB: Does not include DispatchKey::FuncTorchVmapMode. We may modulate the key when
     // constructing the DynamicLayer, but we don't control it when entering/exiting
     // the DynamicLayer.
-    return DispatchKeySet({kBatchedKey});
+    return DispatchKeySet({DispatchKey::FuncTorchBatched});
   } else if (key == TransformType::Grad || key == TransformType::Jvp) {
     return autograd_dispatch_keyset.add(DispatchKey::ADInplaceOrView);
   } else if (key == TransformType::Functionalize) {
@@ -49,7 +49,7 @@ static DispatchKeySet keysForEnteringDynamicLayer(TransformType key) {
 
 DispatchKeySet keysToExcludeWhenEnteringDynamicLayer(TransformType key) {
   DispatchKeySet exclude = all_dynlayer_keyset;
-  exclude = exclude.remove(kDynamicLayerBackModeKey);
+  exclude = exclude.remove(DispatchKey::FuncTorchDynamicLayerBackMode);
   exclude = exclude - keysForEnteringDynamicLayer(key);
   return exclude;
 }
@@ -115,8 +115,8 @@ void Interpreter::process(const c10::OperatorHandle& op, torch::jit::Stack* stac
   INTERPRETER_DISPATCH(key_, SINGLE_ARG(processImpl(op, stack)));
 }
 
-void Interpreter::sendToNextInterpreter(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
-  INTERPRETER_DISPATCH(key_, SINGLE_ARG(sendToNextInterpreterImpl(op, stack)));
+void Interpreter::sendToNextInterpreter(const c10::OperatorHandle& op, torch::jit::Stack* stack, bool grad_special_case) {
+  INTERPRETER_DISPATCH(key_, SINGLE_ARG(sendToNextInterpreterImpl(op, stack, grad_special_case)));
 }
 
 }}
diff --git a/functorch/functorch/csrc/Interpreter.h b/aten/src/ATen/functorch/Interpreter.h
similarity index 91%
rename from functorch/functorch/csrc/Interpreter.h
rename to aten/src/ATen/functorch/Interpreter.h
index 2a1a426824b1..f521e26f2b64 100644
--- a/functorch/functorch/csrc/Interpreter.h
+++ b/aten/src/ATen/functorch/Interpreter.h
@@ -1,14 +1,11 @@
 #pragma once
 
-// variant.h doesn't clean up after itself...
-#include <c10/util/variant.h>
-#undef DECLTYPE_AUTO
-
-#include <functorch/csrc/Macros.h>
-#include <functorch/csrc/Constants.h>
+#include <ATen/functorch/Macros.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <c10/core/impl/LocalDispatchKeySet.h>
 #include <c10/util/Optional.h>
+#include <c10/util/variant.h>
+#include <bitset>
 
 namespace at { namespace functorch {
 
@@ -143,7 +140,7 @@ struct Interpreter {
   const InterpreterMeta& meta() const { return meta_; }
 
   void process(const c10::OperatorHandle& op, torch::jit::Stack* stack);
-  void sendToNextInterpreter(const c10::OperatorHandle& op, torch::jit::Stack* stack);
+  void sendToNextInterpreter(const c10::OperatorHandle& op, torch::jit::Stack* stack, bool grad_special_case);
 
   void saveLocalDispatchKeySet(c10::impl::LocalDispatchKeySet keyset) {
     TORCH_INTERNAL_ASSERT(!savedLocalDispatchKeySet_.has_value());
@@ -178,6 +175,16 @@ struct Interpreter {
 void foreachTensorInplace(std::vector<IValue>& args, int64_t begin, int64_t end,
     std::function<Tensor(const Tensor&)> func);
 
+// Applies the following for-loop:
+// for i in range(begin, end):
+//   if use_flag_relative[i] == 1: <-- treats use_flag_relative as a bitset
+//     args[i] = func(args[i], i - begin, true)
+//   args[i] = func(args[i], i - begin)
+void foreachTensorInplaceWithFlag(std::vector<IValue>& args, int64_t begin, int64_t end,
+    const std::bitset<64> use_flag_relative, std::function<Tensor(const Tensor&, bool)> func);
+
+std::vector<int64_t> findUnwrappedInputs(std::vector<IValue>& args, int64_t begin, int64_t end);
+
 DispatchKeySet keysToExcludeWhenEnteringDynamicLayer(TransformType key);
 
 void setup_dispatch_key_tls(DispatchKeySet exclude, DispatchKeySet include);
diff --git a/functorch/functorch/csrc/LegacyBatchingRegistrations.cpp b/aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp
similarity index 82%
rename from functorch/functorch/csrc/LegacyBatchingRegistrations.cpp
rename to aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp
index 6de2a4000ede..8456bf0008fa 100644
--- a/functorch/functorch/csrc/LegacyBatchingRegistrations.cpp
+++ b/aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp
@@ -7,15 +7,14 @@
 #include <torch/library.h>
 #include <ATen/native/ResizeCommon.h>
 #include <ATen/ATen.h>
-#include <torch/csrc/autograd/variable.h>
+#include <ATen/native/TensorShape.h>
 
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/BatchingMetaprogramming.h>
-#include <functorch/csrc/LegacyVmapTransforms.h>
-#include <functorch/csrc/BatchedFallback.h>
-#include <functorch/csrc/Constants.h>
-#include <functorch/csrc/BatchRulesHelper.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <ATen/functorch/BatchingMetaprogramming.h>
+#include <ATen/functorch/LegacyVmapTransforms.h>
+#include <ATen/functorch/BatchedFallback.h>
+#include <ATen/functorch/BatchRulesHelper.h>
 
 namespace at {
 namespace functorch {
@@ -23,6 +22,10 @@ namespace functorch {
 
 // NOTE: [What is a batching rule?]
 //
+// NB: the following description only applies to this file and is about
+// the legacy (deprecated) batching rule API. Please see writing_batch_rules.md
+// for how to write new-style batching rules.
+//
 // This files contains batching rules written with the legacy (now-deprecated)
 // batching rule API.
 // Please try to use the new-style batching rule API (see writing_batch_rules.md)
@@ -61,23 +64,20 @@ namespace functorch {
 // to do steps (1), (2), and (4).
 // (see NOTE: [What is an VmapTransform?] in VmapTransforms.h)
 
-// Note: [Future plans]
-// The API for writing a batching rule isn't stable. In the future, we'd like
-// to think about the problem of translating these batching rules to TorchScript.
-// Ideally batching rules in eager mode vs TorchScript would look pretty similar,
-// if not use the same mechanism. In order to accomplish that we might have to
-// do some refactoring.
-
 // PyTorch allows operations to specify dim 0 and dim -1 on a scalar tensor.
 static bool is_allowed_dim_on_scalar_tensor(int64_t dim) {
   return dim == 0 || dim == -1;
 }
 
-// This check should probably go into the dispatcher...
-static bool participatesInCurrentLevel(const Tensor& self) {
+static int64_t get_current_level() {
   auto maybe_level = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_level.has_value());
-  auto current_level = maybe_level->layerId();
+  return maybe_level->layerId();
+}
+
+// This check should probably go into the dispatcher...
+static bool participatesInCurrentLevel(const Tensor& self) {
+  auto current_level = get_current_level();
   auto* maybe_batched_impl = maybeGetBatchedImpl(self);
   if (!maybe_batched_impl) {
     return false;
@@ -87,7 +87,7 @@ static bool participatesInCurrentLevel(const Tensor& self) {
   return self_level == current_level;
 }
 
-static bool participatesInCurrentLevel(TensorList self) {
+static bool participatesInCurrentLevel(ITensorListRef self) {
   for (const Tensor& tensor : self) {
     if (participatesInCurrentLevel(tensor)) {
       return true;
@@ -109,7 +109,7 @@ bool isPhysicalScalarTensor(const Tensor& logical_tensor) {
 
 std::vector<Tensor> chunk_batching_rule(const Tensor& self, int64_t chunks, int64_t dim) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return self.chunk(chunks, dim);
   }
 
@@ -122,7 +122,7 @@ std::vector<Tensor> chunk_batching_rule(const Tensor& self, int64_t chunks, int6
 
 std::vector<Tensor> tensor_split_sections_batching_rule(const Tensor& self, int64_t sections, int64_t dim) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::tensor_split(self, sections, dim);
   }
   auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self);
@@ -134,7 +134,7 @@ std::vector<Tensor> tensor_split_sections_batching_rule(const Tensor& self, int6
 
 std::vector<Tensor> tensor_split_indices_batching_rule(const Tensor& self, IntArrayRef indices, int64_t dim) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::tensor_split(self, indices, dim);
   }
   auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self);
@@ -146,7 +146,7 @@ std::vector<Tensor> tensor_split_indices_batching_rule(const Tensor& self, IntAr
 
 Tensor& squeeze_dim__batching_rule(Tensor& self, int64_t dim) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return self.squeeze_(dim);
   }
   auto* batched = maybeGetBatchedImpl(self);
@@ -180,7 +180,7 @@ Tensor& squeeze_dim__batching_rule(Tensor& self, int64_t dim) {
 
 Tensor& squeeze__batching_rule(Tensor& self) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return self.squeeze_();
   }
   auto* batched = maybeGetBatchedImpl(self);
@@ -217,7 +217,7 @@ Tensor& squeeze__batching_rule(Tensor& self) {
 
 Tensor& unsqueeze__batching_rule(Tensor& self, int64_t dim) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return self.unsqueeze_(dim);
   }
   auto* batched = maybeGetBatchedImpl(self);
@@ -237,7 +237,7 @@ Tensor& unsqueeze__batching_rule(Tensor& self, int64_t dim) {
 
 Tensor& transpose__batching_rule(Tensor& self, int64_t dim0, int64_t dim1) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return self.transpose_(dim0, dim1);
   }
   auto* batched = maybeGetBatchedImpl(self);
@@ -269,7 +269,7 @@ Tensor& transpose__batching_rule(Tensor& self, int64_t dim0, int64_t dim1) {
 
 Tensor& fill_inplace_scalar_batching_rule(Tensor& self, Scalar value) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return self.fill_(value);
   }
   auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self);
@@ -299,7 +299,7 @@ Tensor& zero_inplace_batching_rule(Tensor &self) {
 
 Tensor transpose_int_batching_rule(const Tensor& self, int64_t dim0, int64_t dim1) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::transpose(self, dim0, dim1);
   }
   // PyTorch has a special case where scalar_tensor.transpose(dim0, dim1) works
@@ -324,7 +324,7 @@ static int64_t getGradInputPhysicalDim(int64_t dim, IntArrayRef input_sizes, int
 
 Tensor select_backward_batching_rule(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t index) {
   if (!participatesInCurrentLevel(grad)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::select_backward(grad, input_sizes, dim, index);
   }
   auto grad_physical = MultiBatchVmapTransform::logicalToPhysical(grad);
@@ -336,7 +336,7 @@ Tensor select_backward_batching_rule(const Tensor& grad, IntArrayRef input_sizes
 
 Tensor slice_backward_batching_rule(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t start, int64_t end, int64_t step) {
   if (!participatesInCurrentLevel(grad)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::slice_backward(grad, input_sizes, dim, start, end, step);
   }
   auto grad_physical = MultiBatchVmapTransform::logicalToPhysical(grad);
@@ -348,7 +348,7 @@ Tensor slice_backward_batching_rule(const Tensor& grad, IntArrayRef input_sizes,
 
 std::vector<Tensor> split_batching_rule(const Tensor& self, int64_t split_size, int64_t dim) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::split(self, split_size, dim);
   }
   auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self);
@@ -360,7 +360,7 @@ std::vector<Tensor> split_batching_rule(const Tensor& self, int64_t split_size,
 
 std::vector<Tensor> split_with_sizes_batching_rule(const Tensor& self, IntArrayRef split_sizes, int64_t dim) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return split_with_sizes(self, split_sizes, dim);
   }
   auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self);
@@ -372,7 +372,7 @@ std::vector<Tensor> split_with_sizes_batching_rule(const Tensor& self, IntArrayR
 
 std::vector<Tensor> unbind_batching_rule(const Tensor& self, int64_t dim) {
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::unbind(self, dim);
   }
   auto self_physical = MultiBatchVmapTransform::logicalToPhysical(self);
@@ -382,35 +382,11 @@ std::vector<Tensor> unbind_batching_rule(const Tensor& self, int64_t dim) {
   return result;
 }
 
-// Checks that the smallest batch stride is greater than the largest example
-// stride. This is something we can support but we choose not to because it's
-// potentially error prone.
-static void checkBatchDimsAtFrontInLayout(IntArrayRef physical_strides, int64_t num_batch_dims) {
-  auto smallest_batch_stride = std::min_element(
-      physical_strides.begin(), physical_strides.begin() + num_batch_dims);
-  auto largest_example_stride = std::max_element(
-      physical_strides.begin() + num_batch_dims, physical_strides.end());
-  if (largest_example_stride == physical_strides.end()) {
-    // No example dimensions
-    return;
-  }
-  if (num_batch_dims == 1 && physical_strides.size() > 0 && physical_strides[0] == 0) {
-    // degenerate batch dim
-    return;
-  }
-  TORCH_CHECK(*smallest_batch_stride >= *largest_example_stride,
-    "vmap: Calling Tensor.as_strided is not supported unless the batch dims being ",
-    "vmapped over are at the front of the tensor (in memory layout). When they are ",
-    "not at the front of the tensor this operation can be error prone so we "
-    "actively discourage it; please file us a bug report and/or try to ",
-    "express the as_strided operation in terms of PyTorch view operations");
-}
-
 // given (sizes, strides, storage_offset) returns the maximum location that
 // can be indexed (or nullopt if such a location doesn't exist, e.g., tensors
 // with zero-size dims).
-static optional<int64_t> maximum_indexable_location(
-    IntArrayRef sizes, IntArrayRef strides, int64_t storage_offset) {
+static optional<c10::SymInt> maximum_indexable_location(
+    c10::SymIntArrayRef sizes, c10::SymIntArrayRef strides, c10::SymInt storage_offset) {
   auto result = native::storage_size_for(sizes, strides);
   if (result == 0) {
     return nullopt;
@@ -425,12 +401,12 @@ static optional<int64_t> maximum_indexable_location(
 static void checkBasicAsStridedValidForSlice(
     const Tensor& physical_tensor,
     int64_t num_batch_dims,
-    IntArrayRef sizes,
-    IntArrayRef strides,
-    optional<int64_t> maybe_storage_offset) {
-  auto slice_sizes = physical_tensor.sizes().slice(num_batch_dims);
-  auto slice_strides = physical_tensor.strides().slice(num_batch_dims);
-  auto base_offset = physical_tensor.storage_offset();
+    c10::SymIntArrayRef sizes,
+    c10::SymIntArrayRef strides,
+    optional<c10::SymInt> maybe_storage_offset) {
+  auto slice_sizes = physical_tensor.sym_sizes().slice(num_batch_dims);
+  auto slice_strides = physical_tensor.sym_strides().slice(num_batch_dims);
+  auto base_offset = physical_tensor.sym_storage_offset();
 
   auto storage_offset = maybe_storage_offset.value_or(base_offset);
 
@@ -442,7 +418,7 @@ static void checkBasicAsStridedValidForSlice(
   }
   if (!max_slice_loc.has_value()) {
     TORCH_CHECK(false,
-        "result = tensor.as_strided(", sizes, ",",  strides, ",", storage_offset, ")",
+        "result = tensor.as_strided(", sizes, ", ",  strides, ", ", storage_offset, ") ",
         "can access memory outside of `tensor`. `tensor` has no storage but the ",
         "passed-in (size, stride, storage_offset) imply a result with some storage. ",
         "This is not supported inside of vmap, please try to rewrite the ",
@@ -451,11 +427,11 @@ static void checkBasicAsStridedValidForSlice(
 
   TORCH_CHECK(
       *max_as_strided_loc <= *max_slice_loc && base_offset <= storage_offset,
-      "result = tensor.as_strided(", sizes, ",",  strides, ",", storage_offset, ")",
-      "can access memory outside of `tensor`. `result` can access some",
+      "result = tensor.as_strided(", sizes, ", ",  strides, ", ", storage_offset, ") ",
+      "can access memory outside of `tensor`. `result` can access some ",
       "memory in range [", storage_offset, ", ", *max_as_strided_loc, "], but ",
       "`tensor` can only access some memory in range [", base_offset, ", ",
-      *max_slice_loc, "]. This is not supported inside of vmap, please try to",
+      *max_slice_loc, "]. This is not supported inside of vmap, please try to ",
       "rewrite the `as_strided` call as a sequence of PyTorch view operations");
 }
 
@@ -483,12 +459,12 @@ static void checkBasicAsStridedValidForSlice(
 // >>> z = [x[i].as_strided([1], [1], 1 + x[i].storage_offset() - 1) for i in range(4)]
 Tensor as_strided_batching_rule(
     const Tensor& tensor,
-    IntArrayRef sizes,
-    IntArrayRef strides,
-    optional<int64_t> storage_offset) {
+    c10::SymIntArrayRef sizes,
+    c10::SymIntArrayRef strides,
+    optional<c10::SymInt> storage_offset) {
   if (!participatesInCurrentLevel(tensor)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
-    return at::as_strided(tensor, sizes, strides, storage_offset);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
+    return at::as_strided_symint(tensor, sizes, strides, storage_offset);
   }
   auto physical_view = MultiBatchVmapTransform::logicalToPhysical(tensor);
   auto num_batch_dims = physical_view.numBatchDims();
@@ -502,18 +478,15 @@ Tensor as_strided_batching_rule(
       "same length! Got size ", sizes, " and stride ", strides);
 
   // Sanity checks:
-  // 1. All batch dims are at the front in memory layout (not necessary for
-  // correctness, but we are worried the user might be doing crazy things)
-  // 2. as_strided(sizes, strides, storage_offset + tensor[i].offset() - tensor.offset())
+  // 1. as_strided(sizes, strides, storage_offset + tensor[i].offset() - tensor.offset())
   // is valid for a slice of the input tensor.
   // See Note: [When will the as_strided batching rule fail?] for details.
-  checkBatchDimsAtFrontInLayout(physical_tensor.strides(), num_batch_dims);
   checkBasicAsStridedValidForSlice(
       physical_tensor, num_batch_dims, sizes, strides, storage_offset);
 
   // physical_strides = physical tensor's batch strides + (logical) strides
   auto batch_strides = physical_tensor.strides().slice(0, num_batch_dims);
-  VmapDimVector physical_strides;
+  SymDimVector physical_strides;
   physical_strides.reserve(num_batch_dims + strides.size());
   physical_strides.insert(
       physical_strides.end(), batch_strides.begin(), batch_strides.end());
@@ -525,7 +498,7 @@ Tensor as_strided_batching_rule(
   // xs.as_strided(physical_sizes, physical_strides, offset) always succeeds
   // and creates a tensor y such that each y[i] references the same memory
   // locations as zi. See NOTE: [When will the as_strided batching rule fail?]
-  auto result = physical_view.tensor().as_strided(
+  auto result = physical_view.tensor().as_strided_symint(
       physical_sizes, physical_strides, storage_offset);
   return physical_view.getPhysicalToLogicalMap().apply(result);
 }
@@ -618,7 +591,7 @@ Tensor as_strided_batching_rule(
 template <typename F, F Func, typename... ExtraArgs>
 Tensor unwrap_and_call(const Tensor& input, ExtraArgs... args) {
   if (!participatesInCurrentLevel(input)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return Func(input, args...);
   }
   // guard against the user passing in a batch of scalar tensors with batch
@@ -630,7 +603,7 @@ Tensor unwrap_and_call(const Tensor& input, ExtraArgs... args) {
 template <typename F, F Func, typename... ExtraArgs>
 Tensor unwrap_and_call_method(const Tensor& input, ExtraArgs... extra_args) {
   if (!participatesInCurrentLevel(input)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return (input.*Func)(extra_args...);
   }
   auto* input_batched = unsafeGetBatchedImpl(input);
@@ -638,23 +611,76 @@ Tensor unwrap_and_call_method(const Tensor& input, ExtraArgs... extra_args) {
   return makeBatched(output_physical, input_batched->bdim(), input_batched->level());
 }
 
-Tensor cat_batching_rule(TensorList tensors, int64_t dim) {
+Tensor cat_batching_rule(const ITensorListRef& tensors, int64_t dim) {
   if (!participatesInCurrentLevel(tensors)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::cat(tensors, dim);
   }
-  auto physical_views = MultiBatchVmapTransform::logicalToPhysical(tensors);
-  auto physical_tensors = fmap(
-      physical_views, [](const VmapPhysicalView& view) -> Tensor { return view.tensor(); });
-  TORCH_INTERNAL_ASSERT(
-      tensors.size() > 0, "The dispatcher should not have dispatched here otherwise.");
-  auto result = at::cat(physical_tensors, physical_views[0].getPhysicalDim(dim));
-  return physical_views[0].getPhysicalToLogicalMap().apply(result);
+
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
+
+  // NB: Probably bad for perf that we're allocating std::vectors for each level, but
+  // what can you do.
+  auto materialized = tensors.materialize();
+  dim = at::legacy_cat_wrap_dim(dim, materialized);
+
+  // Strategy:
+  // we're going to unwrap tensors, move their batch dims to the front,
+  // and put them into `tensors_to_cat`. Tensors that don't have a batch dim
+  // will get one forced onto them.
+  //
+  // Then, we'll do at::cat(tensors_to_cat, ...).
+  //
+  // There's a special case where at::cat ignores tensors that have logical shape
+  // [0]. If we see a Tensor that has logical shape [0] (but physical shape [B, 0]),
+  // we'll just slice the tensor to get a Tensor of shape [0] to pass to at::cat.
+  std::vector<Tensor> tensors_to_cat;
+  tensors_to_cat.reserve(tensors.size());
+  c10::optional<int64_t> bdim_size = c10::nullopt;
+
+  // find the bdim size. Might not exist if all BatchedTensors should be skipped
+  // by cat's special case.
+  for (const auto& tensor : tensors) {
+    if (!participatesInCurrentLevel(tensor)) {
+      continue;
+    }
+    if (at::native::cat_should_skip_tensor(tensor)) {
+      continue;
+    }
+    const auto* batched = unsafeGetBatchedImpl(tensor);
+    bdim_size = batched->value().size(batched->bdim());
+    break;
+  }
+
+  // unwrap batchedtensors; expand out bdims
+  for (const auto& tensor : tensors) {
+    if (!participatesInCurrentLevel(tensor)) {
+      if (at::native::cat_should_skip_tensor(tensor) || !bdim_size.has_value()) {
+        tensors_to_cat.emplace_back(tensor);
+        continue;
+      }
+      tensors_to_cat.emplace_back(ensure_has_bdim(tensor, /*has_bdim*/false, *bdim_size));
+      continue;
+    }
+    const auto* batched = unsafeGetBatchedImpl(tensor);
+    if (at::native::cat_should_skip_tensor(tensor)) {
+      // Special case: slice the tensor to get something of shape [0] to pass to cat
+      // We slice instead of allocate a new tensor to propagate requires_gradness...
+      tensors_to_cat.emplace_back(batched->value().select(/*dim=*/batched->bdim(), /*index=*/0));
+      continue;
+    }
+    tensors_to_cat.emplace_back(moveBatchDimToFront(batched->value(), batched->bdim()));
+  }
+
+  auto new_dim = bdim_size.has_value() ? dim + 1 : dim;
+  c10::optional<int64_t> new_bdim = bdim_size.has_value() ? c10::make_optional((int64_t)0) : nullopt;
+  auto result = at::cat(tensors_to_cat, new_dim);
+  return makeBatched(result, new_bdim, get_current_level());
 }
 
 Tensor block_diag_batching_rule(TensorList tensors) {
   if (!participatesInCurrentLevel(tensors)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::block_diag(tensors);
   }
   auto physical_views = MultiBatchVmapTransform::logicalToPhysical(tensors);
@@ -682,7 +708,7 @@ Tensor block_diag_batching_rule(TensorList tensors) {
 
 Tensor stack_batching_rule(TensorList tensors, int64_t dim) {
   if (!participatesInCurrentLevel(tensors)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return at::stack(tensors, dim);
   }
   auto physical_views = MultiBatchVmapTransform::logicalToPhysical(tensors);
@@ -700,14 +726,17 @@ Tensor stack_batching_rule(TensorList tensors, int64_t dim) {
 
 Tensor new_empty_strided_batching_rule(
     const Tensor& self,
-    IntArrayRef size,
-    IntArrayRef stride,
+    SymIntArrayRef sym_size,
+    SymIntArrayRef sym_stride,
     optional<ScalarType> dtype,
     optional<Layout> layout,
     optional<Device> device,
     optional<bool> pin_memory) {
+
+  auto size = c10::asIntArrayRefSlow(sym_size);
+  auto stride = c10::asIntArrayRefSlow(sym_stride);
   if (!participatesInCurrentLevel(self)) {
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     return self.new_empty_strided(
         size, stride, dtype, layout, device, pin_memory);
   }
@@ -763,25 +792,17 @@ Tensor new_empty_strided_batching_rule(
   return physical_view.getPhysicalToLogicalMap().apply(result);
 }
 
-bool BatchedTensor_is_leaf(const Tensor& self) {
-  if (torch::autograd::impl::get_autograd_meta(self)) {
-    return torch::autograd::impl::get_autograd_meta(self)->grad_fn_ == nullptr;
-  } else {
-    return true;
-  }
-}
-
 Tensor& BatchedTensor_requires_grad_(Tensor& self, bool requires_grad) {
   self.set_requires_grad(requires_grad);
   return self;
 }
 
 
-TORCH_LIBRARY_IMPL(_, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(_, FuncTorchBatched, m) {
   m.fallback(torch::CppFunction::makeFromBoxedFunction<&batchedTensorForLoopFallback>());
 }
 
-TORCH_LIBRARY_IMPL(aten, FT_BATCHED_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
   // still legacy b/c teturns multiple tensors
   m.impl("tensor_split.sections", tensor_split_sections_batching_rule);
   m.impl("tensor_split.indices", tensor_split_indices_batching_rule);
diff --git a/functorch/functorch/csrc/LegacyVmapTransforms.cpp b/aten/src/ATen/functorch/LegacyVmapTransforms.cpp
similarity index 88%
rename from functorch/functorch/csrc/LegacyVmapTransforms.cpp
rename to aten/src/ATen/functorch/LegacyVmapTransforms.cpp
index 3b57bd35e52e..682169a52622 100644
--- a/functorch/functorch/csrc/LegacyVmapTransforms.cpp
+++ b/aten/src/ATen/functorch/LegacyVmapTransforms.cpp
@@ -4,8 +4,8 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/LegacyVmapTransforms.h>
-#include <functorch/csrc/DynamicLayer.h>
+#include <ATen/functorch/LegacyVmapTransforms.h>
+#include <ATen/functorch/DynamicLayer.h>
 
 #include <ATen/ATen.h>
 #include <c10/util/irange.h>
@@ -76,6 +76,15 @@ VmapDimVector VmapPhysicalView::getPhysicalShape(IntArrayRef logical_shape) cons
   return result;
 }
 
+SymDimVector VmapPhysicalView::getPhysicalShape(c10::SymIntArrayRef logical_shape) const {
+  SymDimVector result;
+  result.reserve(logical_shape.size() + numBatchDims());
+  auto tensor_sizes = tensor_.sym_sizes();
+  result.insert(result.end(), tensor_sizes.begin(), tensor_sizes.begin() + numBatchDims());
+  result.insert(result.end(), logical_shape.begin(), logical_shape.end());
+  return result;
+}
+
 static std::tuple<int64_t, int64_t> computeFrontBatchDimsFromLevels(std::bitset<kVmapNumLevels> levels_bitset) {
   int64_t level = 0;
   int64_t dim = 0;
@@ -109,7 +118,7 @@ static Tensor moveDimToFrontAndExpand(Tensor tensor, optional<int64_t> dim, int6
 // 4. Expand each physical tensor so that they have output batch size equal
 //    to `batch_sizes`
 VmapPhysicalViewVec
-MultiBatchVmapTransform::logicalToPhysical(TensorList logical_tensors) {
+MultiBatchVmapTransform::logicalToPhysical(ITensorListRef logical_tensors) {
   auto cur_level = maybeCurrentDynamicLayer().value().layerId();
   auto bdim_size = -1;
 
@@ -134,12 +143,12 @@ MultiBatchVmapTransform::logicalToPhysical(TensorList logical_tensors) {
     auto* batched = maybeGetBatchedImpl(logical_tensor);
     if (!batched || (batched->level() != cur_level)) {
       // Unsqueeze dim 0, expand it to the correct shape
-      c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+      c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
       auto value = moveDimToFrontAndExpand(logical_tensor, {}, bdim_size);
       result.emplace_back(std::move(value), levels);
       continue;
     }
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     auto physical = batched->value();
     auto value = moveDimToFrontAndExpand(physical, batched->bdim(), bdim_size);
     result.emplace_back(std::move(value), levels);
@@ -189,12 +198,12 @@ VmapPhysicalViewVec BroadcastingVmapTransform::logicalToPhysical(TensorList logi
     auto* batched = maybeGetBatchedImpl(logical_tensor);
     if (!batched || (batched->level() != cur_level)) {
       // Unsqueeze dim 0, expand it to the correct shape
-      c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+      c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
       auto value = moveDimToFrontAndUnsqueeze(logical_tensor, {}, max_example_dim);
       result.emplace_back(std::move(value), levels);
       continue;
     }
-    c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+    c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
     auto physical = batched->value();
     auto value = moveDimToFrontAndUnsqueeze(physical, batched->bdim(), max_example_dim);
     result.emplace_back(std::move(value), levels);
diff --git a/functorch/functorch/csrc/LegacyVmapTransforms.h b/aten/src/ATen/functorch/LegacyVmapTransforms.h
similarity index 95%
rename from functorch/functorch/csrc/LegacyVmapTransforms.h
rename to aten/src/ATen/functorch/LegacyVmapTransforms.h
index 443c4e867de2..5fc05b6c8038 100644
--- a/functorch/functorch/csrc/LegacyVmapTransforms.h
+++ b/aten/src/ATen/functorch/LegacyVmapTransforms.h
@@ -6,8 +6,8 @@
 
 #pragma once
 
-#include <functorch/csrc/Macros.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
+#include <ATen/functorch/Macros.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
 
 namespace at {
 namespace functorch {
@@ -62,9 +62,9 @@ using VmapDimVector = SmallVector<int64_t, kVmapStaticDimVecSize>;
 // permutes all of the batch dims to the front of the tensor, aligns
 // and expands the batch dims to match each other (according to their `level`),
 // and returns a VmapPhysicalView on the tensor(s).
-struct FUNCTORCH_API MultiBatchVmapTransform {
+struct TORCH_API MultiBatchVmapTransform {
   static VmapPhysicalView logicalToPhysical(const Tensor& logical_tensor);
-  static VmapPhysicalViewVec logicalToPhysical(TensorList logical_tensors);
+  static VmapPhysicalViewVec logicalToPhysical(ITensorListRef logical_tensors);
 };
 
 // VmapTransform for operators that broadcast all inputs.
@@ -86,7 +86,7 @@ struct FUNCTORCH_API MultiBatchVmapTransform {
 // actually *need* to return a tensor of size (1, 2) for the second tensor
 // because the broadcasting operation takes care of that for us, but we do
 // it anyways to keep things simple.
-struct FUNCTORCH_API BroadcastingVmapTransform {
+struct TORCH_API BroadcastingVmapTransform {
   static VmapPhysicalViewVec logicalToPhysical(TensorList logical_tensors);
 };
 
@@ -118,7 +118,7 @@ struct VmapPhysicalToLogicalMap;
 //              ^
 //              |
 //   levels: 012345
-struct FUNCTORCH_API VmapPhysicalView {
+struct TORCH_API VmapPhysicalView {
   VmapPhysicalView(Tensor&& tensor, std::bitset<kVmapNumLevels> levels)
       : levels_(levels), tensor_(tensor) {
     // TORCH_INTERNAL_ASSERT(!isBatchedTensor(tensor));
@@ -146,6 +146,7 @@ struct FUNCTORCH_API VmapPhysicalView {
   // Maps a logical shape to a physical shape by pre-pending the batch
   // sizes to the logical shape.
   VmapDimVector getPhysicalShape(IntArrayRef logical_shape) const;
+  SymDimVector getPhysicalShape(c10::SymIntArrayRef logical_shape) const;
 
   int64_t numBatchDims() const;
 
@@ -160,7 +161,7 @@ struct FUNCTORCH_API VmapPhysicalView {
 // to a logical one (BatchedTensor). It holds some levels that are used to do the
 // mapping and assumes that the batch dimensions in the physical tensor all
 // occur at the front of the tensor.
-struct FUNCTORCH_API VmapPhysicalToLogicalMap {
+struct TORCH_API VmapPhysicalToLogicalMap {
   VmapPhysicalToLogicalMap(std::bitset<kVmapNumLevels> levels): levels_(levels) {}
 
   // Maps a physical tensor to a new logical tensor (BatchedTensor).
diff --git a/aten/src/ATen/functorch/Macros.h b/aten/src/ATen/functorch/Macros.h
new file mode 100644
index 000000000000..eb0a763261bf
--- /dev/null
+++ b/aten/src/ATen/functorch/Macros.h
@@ -0,0 +1,3 @@
+#pragma once
+
+#define SINGLE_ARG(...) __VA_ARGS__
diff --git a/functorch/functorch/csrc/PlumbingHelper.cpp b/aten/src/ATen/functorch/PlumbingHelper.cpp
similarity index 91%
rename from functorch/functorch/csrc/PlumbingHelper.cpp
rename to aten/src/ATen/functorch/PlumbingHelper.cpp
index e75fb82a3864..5dd01d0abbcb 100644
--- a/functorch/functorch/csrc/PlumbingHelper.cpp
+++ b/aten/src/ATen/functorch/PlumbingHelper.cpp
@@ -4,9 +4,9 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
 
 namespace at { namespace functorch {
 
@@ -50,7 +50,7 @@ bool isBatchedAtLevel(const c10::optional<Tensor>& maybe_tensor, int64_t level)
   return isBatchedAtLevel(*maybe_tensor, level);
 }
 
-bool isBatchedAtLevel(TensorList tensors, int64_t level) {
+bool isBatchedAtLevel(ITensorListRef tensors, int64_t level) {
   for (const auto& tensor : tensors) {
     if (isBatchedAtLevel(tensor, level)) {
       return true;
diff --git a/aten/src/ATen/functorch/PlumbingHelper.h b/aten/src/ATen/functorch/PlumbingHelper.h
new file mode 100644
index 000000000000..9eb486a6eefa
--- /dev/null
+++ b/aten/src/ATen/functorch/PlumbingHelper.h
@@ -0,0 +1,61 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+#pragma once
+#include <ATen/Tensor.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
+#include <ATen/functorch/DynamicLayer.h>
+
+// NOTE: [vmap plumbing]
+//
+// Here's how "batching rules" work.
+// - we register kernels to the Batched key
+// - these kernels have the same signatures as the original operators.
+//   For example, at::sin(Tensor self) accepts a Tensor, and the batched kernel
+//   must also accept a Tensor
+// - However, it is more natural for users to write a batching rule like the
+//   following: sin_batch_rule(Tensor self, optional<int> self_bdim)
+// - There is some codegenerated layer (the "plumbing") that wraps the user
+//   defined batching rule (e.g. sin_batch_rule) in a kernel that can be
+//   registered to the Batched key.
+//
+// The plumbing is responsible for wrapping a batching rule into a form that may
+// be registered as the kernel for the batched key.
+
+namespace at { namespace functorch {
+
+// Create a BatchedTensor given a tensor, bdim, and level
+TORCH_API Tensor makeBatched(const Tensor& tensor, optional<int64_t> bdim, int64_t level);
+
+// Given a Tensor that may or may not be a BatchedTensor, unwrap it.
+// If `tensor` is not a BatchedTensor, or is a BatchedTensor but the level
+// doesn't match, then this returns (tensor, nullopt).
+// Otherwise, it returns (unwrap(tensor), bdim).
+TORCH_API std::tuple<Tensor, optional<int64_t>> unwrapTensorAtLevel(const Tensor& tensor, int64_t level);
+
+// Creates a vector of BatchedTensor
+TORCH_API std::vector<Tensor> makeBatchedVector(const std::vector<Tensor>& tensors, optional<int64_t> bdim, int64_t level);
+
+// Returns True if ANY tensor in tensors is batched at level
+TORCH_API bool isBatchedAtLevel(ITensorListRef tensors, int64_t level);
+TORCH_API bool isBatchedAtLevel(const c10::List<c10::optional<Tensor>> maybe_tensors, int64_t level);
+TORCH_API bool isBatchedAtLevel(const Tensor& tensor, int64_t level);
+TORCH_API bool isBatchedAtLevel(const c10::optional<Tensor>& maybe_tensor, int64_t level);
+
+// Convenience helper. Returns true if any tensor is batched at level
+TORCH_API bool areAnyBatchedAtLevel(ArrayRef<optional<Tensor>> maybe_tensors, int64_t level);
+
+inline bool ivalueParticipatesInCurrentLevel(const IValue& ivalue) {
+  if (ivalue.isTensor()) {
+    auto maybe_level = maybeCurrentDynamicLayer();
+    TORCH_INTERNAL_ASSERT(maybe_level.has_value());
+    auto current_level = maybe_level->layerId();
+    return isBatchedAtLevel(ivalue.toTensor(), current_level);
+  }
+  // TODO: should really check this
+  return false;
+}
+
+}}
diff --git a/functorch/functorch/csrc/PyTorchOperatorHacks.cpp b/aten/src/ATen/functorch/PyTorchOperatorHacks.cpp
similarity index 95%
rename from functorch/functorch/csrc/PyTorchOperatorHacks.cpp
rename to aten/src/ATen/functorch/PyTorchOperatorHacks.cpp
index 0bde1f53d254..9f76253f81fc 100644
--- a/functorch/functorch/csrc/PyTorchOperatorHacks.cpp
+++ b/aten/src/ATen/functorch/PyTorchOperatorHacks.cpp
@@ -1,26 +1,28 @@
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/Constants.h>
+#include <ATen/functorch/DynamicLayer.h>
 #include <torch/library.h>
 #include <ATen/ATen.h>
 #include <ATen/WrapDimUtils.h>
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
 #include <ATen/ATen.h>
 #include <ATen/Dispatch.h>
 #include <c10/util/irange.h>
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/native/LinearAlgebraUtils.h>
+#include <ATen/native/xnnpack/Engine.h>
 
 namespace at { namespace functorch {
 
-// TODO: all of these should be fixed in a more blessed way. In particular,
-// it is bad if any of these go out-of-sync with the implementations in
-// pytorch/pytorch.
+// NOTE: [functorch's PyTorch Operator Hacks]
 //
 // This file contains hacks for composite PyTorch operators that are problematic.
 // For example, the composite op might have in-place operations,
 // or call data_ptr. We have some idea of how to fix these things in the long term
-// (e.g. functionalization for the in-place operations).
+// e.g., upstream the changes to PyTorch.
+//
+// TODO: all of these should be fixed in a more blessed way. In particular,
+// it is bad if any of these go out-of-sync with the implementations in
+// pytorch/pytorch.
 
 // TODO: upstream into core
 Tensor index_select_backward_hack(const Tensor& grad, IntArrayRef self_sizes, int64_t dim, const Tensor& index) {
@@ -79,8 +81,8 @@ Tensor linear_hack(const Tensor& input, const Tensor& weight, const c10::optiona
     return at::mkldnn_linear(input, weight, *bias);
   }
 #if defined(C10_MOBILE)
-  if (xnnpack::use_linear(input, weight, *bias)) {
-    return xnnpack::linear(input, weight, *bias);
+  if (at::native::xnnpack::use_linear(input, weight, *bias)) {
+    return at::native::xnnpack::linear(input, weight, *bias);
   }
 #endif
   if (input.dim() == 2 && bias->defined()) {
@@ -288,7 +290,7 @@ Tensor& feature_alpha_dropout_(Tensor& input, double p, bool train) {
 
 } // dropout_hack
 
-TORCH_LIBRARY_IMPL(aten, FT_DYNAMIC_LAYER_FRONT_MODE_KEY, m) {
+TORCH_LIBRARY_IMPL(aten, FuncTorchDynamicLayerFrontMode, m) {
   m.impl("index_select_backward", index_select_backward_hack);
   m.impl("linear", linear_hack);
   m.impl("binary_cross_entropy_with_logits", binary_cross_entropy_with_logits_hack);
diff --git a/functorch/functorch/csrc/TensorWrapper.cpp b/aten/src/ATen/functorch/TensorWrapper.cpp
similarity index 89%
rename from functorch/functorch/csrc/TensorWrapper.cpp
rename to aten/src/ATen/functorch/TensorWrapper.cpp
index 054be6495c37..afd79943051e 100644
--- a/functorch/functorch/csrc/TensorWrapper.cpp
+++ b/aten/src/ATen/functorch/TensorWrapper.cpp
@@ -4,9 +4,9 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
 
 #include <torch/library.h>
 #include <ATen/core/dispatch/Dispatcher.h>
@@ -62,7 +62,7 @@ c10::intrusive_ptr<TensorWrapper> makeTensorWrapperPtr(const Tensor& tensor, int
   auto keys_to_propagate = kKeysToPropagateToWrapper | DispatchKeySet({
       DispatchKey::AutogradCPU, DispatchKey::AutogradCUDA, DispatchKey::AutogradXLA});
   auto key_set = getKeysToPropagateToWrapper(tensor, keys_to_propagate);
-  key_set = key_set.add(kGradWrapperKey);
+  key_set = key_set.add(DispatchKey::FuncTorchGradWrapper);
   if (should_be_alive) {
     return c10::make_intrusive<TensorWrapper>(key_set, tensor, level, getLifeHandleForLevel(level));
   } else {
@@ -70,7 +70,7 @@ c10::intrusive_ptr<TensorWrapper> makeTensorWrapperPtr(const Tensor& tensor, int
   }
 }
 
-Tensor makeTensorWrapper(const Tensor& tensor, int64_t level) {
+Tensor makeTensorWrapper(const Tensor& tensor, int64_t level, bool is_immutable) {
   auto wrapped = maybeGetTensorWrapper(tensor);
   if (wrapped) {
     TORCH_INTERNAL_ASSERT(wrapped->level() < level);
@@ -79,10 +79,10 @@ Tensor makeTensorWrapper(const Tensor& tensor, int64_t level) {
   auto keys_to_propagate = kKeysToPropagateToWrapper | DispatchKeySet({
       DispatchKey::AutogradCPU, DispatchKey::AutogradCUDA, DispatchKey::AutogradXLA});
   auto key_set = getKeysToPropagateToWrapper(tensor, keys_to_propagate);
-  key_set = key_set.add(kGradWrapperKey);
+  key_set = key_set.add(DispatchKey::FuncTorchGradWrapper);
   auto life_handle = getLifeHandleForLevel(level);
-  auto result = at::detail::make_tensor<TensorWrapper>(key_set, tensor, level, std::move(life_handle));
-  TORCH_INTERNAL_ASSERT(result.key_set().has(kGradWrapperKey));
+  auto result = at::detail::make_tensor<TensorWrapper>(key_set, tensor, level, std::move(life_handle), is_immutable);
+  TORCH_INTERNAL_ASSERT(result.key_set().has(DispatchKey::FuncTorchGradWrapper));
   return result;
 }
 
@@ -121,10 +121,12 @@ TensorWrapper::TensorWrapper(
     Tensor value,
     int64_t level,
     std::shared_ptr<bool> is_alive,
+    bool is_immutable,
     bool use_value_sizes_strides)
   : TensorImpl(key_set, value.dtype(), value.device())
   , value_(std::move(value))
   , level_(level)
+  , is_immutable_(is_immutable)
   , is_alive_(std::move(is_alive))
 {
   TORCH_INTERNAL_ASSERT(value_.defined());
@@ -154,7 +156,7 @@ const char* TensorWrapper::tensorimpl_type_name() const {
 
 
 TensorWrapper* maybeGetTensorWrapper(const Tensor& tensor) {
-  if (!tensor.key_set().has(kGradWrapperKey)) {
+  if (!tensor.key_set().has(DispatchKey::FuncTorchGradWrapper)) {
     return nullptr;
   }
   return (TensorWrapper*)(tensor.unsafeGetTensorImpl());
@@ -184,7 +186,7 @@ void dead_tensor_wrapper_fallback(const c10::OperatorHandle& op, torch::jit::Sta
 
 // TensorWrapper backend fallback: Unwrap and fallthrough.
 
-TORCH_LIBRARY_IMPL(_, FT_GRAD_WRAPPER_KEY, m) {
+TORCH_LIBRARY_IMPL(_, FuncTorchGradWrapper, m) {
   m.fallback(torch::CppFunction::makeFromBoxedFunction<&dead_tensor_wrapper_fallback>());
 }
 
diff --git a/aten/src/ATen/functorch/TensorWrapper.h b/aten/src/ATen/functorch/TensorWrapper.h
new file mode 100644
index 000000000000..25da91fd88e8
--- /dev/null
+++ b/aten/src/ATen/functorch/TensorWrapper.h
@@ -0,0 +1,97 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+
+#pragma once
+
+#include <ATen/functorch/Macros.h>
+#include <ATen/Tensor.h>
+
+namespace at {
+namespace functorch {
+
+// NOTE: [functorch's TensorWrapper]
+//
+// Taking better suggestions for a name. TensorWrapper is the wrapper Tensor
+// Subclass for functorch's grad-based transforms (grad, vjp, jvp). It is
+// analogous to how vmap uses BatchedTensor as the wrapper Tensor subclass.
+//
+// If you're familiar with the Tensor-Variable merge, TensorWrapper is effectively
+// another Variable.
+//
+// Consider grad(grad(torch.sin))(x). This wraps `x` as TensorWrapper(TensorWrapper(x)).
+// The reason why is so that each TensorWrapper can hold its own AutogradMeta and
+// participate in a **separate** autograd graph.
+//
+// There are alternative designs we could have chosen (e.g. each grad transform
+// stores a weak map of Tensor -> AutogradMeta); the benefit of the TensorWrapper
+// design is that we can re-use existing VariableType kernels (i.e. Autograd kernels)
+// without much modification. Since a TensorWrapper looks like a regular Tensor,
+// the VariableType kernel can pull out the AutogradMeta struct from where it
+// expects and extend the autograd graph
+
+struct TORCH_API TensorWrapper : public c10::TensorImpl {
+  explicit TensorWrapper(
+      c10::DispatchKeySet key_set,
+      Tensor value,
+      int64_t level,
+      std::shared_ptr<bool> is_alive,
+      bool is_immutable = false,  // if true, this came from an operation that aliases an immutable tensor
+      bool use_value_sizes_strides = true);
+
+  // Override a bunch of methods inherited from TensorImpl to return error messages
+  void set_size(int64_t dim, int64_t new_size) override;
+  void set_stride(int64_t dim, int64_t new_stride) override;
+  void set_storage_offset(int64_t storage_offset) override;
+
+  void refreshMetadata();
+
+  const Tensor& value() const {
+    return value_;
+  }
+  optional<int64_t> level() const {
+    if (is_alive()) {
+      return level_;
+    }
+    return {};
+  }
+  bool is_immutable() const {
+    return is_immutable_;
+  }
+  bool is_alive() const;
+
+  // Overrides necessary for autograd
+  c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
+    const c10::VariableVersion& version_counter,
+    bool allow_tensor_metadata_change) const override;
+  c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
+      c10::VariableVersion&& version_counter,
+      bool allow_tensor_metadata_change) const override;
+  void shallow_copy_from(const c10::intrusive_ptr<TensorImpl>& impl) override;
+
+ private:
+  const char* tensorimpl_type_name() const override;
+  Tensor value_;
+  int64_t level_;
+  bool is_immutable_;
+
+  // TensorWrapper receives a boolean flag on whether or not the Grad Interpreter
+  // that created it is still alive or not.
+  // If the Grad Interpreter is no longer alive then it attempts to behave like
+  // a regular Tensor.
+  //
+  // When we exit the level, this wrapper may be marked as "not alive".
+  // Wrappers that are not alive:
+  // 1) May still have autograd metadata on them
+  // 2) Forward dispatches to the underlying value()
+  std::shared_ptr<bool> is_alive_;
+};
+
+TORCH_API Tensor makeTensorWrapper(const Tensor& tensor, int64_t level, bool is_immutable=false);
+TORCH_API TensorWrapper* maybeGetTensorWrapper(const Tensor& tensor);
+TORCH_API void dumpTensor(std::ostream & ss, const Tensor& tensor);
+TORCH_API void dumpTensorCout(const Tensor& tensor);
+}
+} // namespace at
diff --git a/functorch/functorch/csrc/VmapInterpreter.cpp b/aten/src/ATen/functorch/VmapInterpreter.cpp
similarity index 68%
rename from functorch/functorch/csrc/VmapInterpreter.cpp
rename to aten/src/ATen/functorch/VmapInterpreter.cpp
index a8f0283aa3b7..a7db8f13a031 100644
--- a/functorch/functorch/csrc/VmapInterpreter.cpp
+++ b/aten/src/ATen/functorch/VmapInterpreter.cpp
@@ -1,5 +1,5 @@
-#include <functorch/csrc/VmapInterpreter.h>
-#include <functorch/csrc/DynamicLayer.h>
+#include <ATen/functorch/VmapInterpreter.h>
+#include <ATen/functorch/DynamicLayer.h>
 
 namespace at { namespace functorch {
 
@@ -7,13 +7,14 @@ void VmapInterpreterPtr::processImpl(
     const c10::OperatorHandle& op,
     torch::jit::Stack* stack) {
   DispatchKeySet exclude = keysToExcludeWhenEnteringDynamicLayer(TransformType::Vmap);
-  setup_dispatch_key_tls(exclude, DispatchKeySet(kVmapModeKey));
+  setup_dispatch_key_tls(exclude, DispatchKeySet(DispatchKey::FuncTorchVmapMode));
   op.callBoxed(stack);
 }
 
 void VmapInterpreterPtr::sendToNextInterpreterImpl(
     const c10::OperatorHandle& op,
-    torch::jit::Stack* stack) {
+    torch::jit::Stack* stack,
+    bool grad_special_case) {
   // Re-dispatch
   if (getDynamicLayerStack().size() == 0) {
     sanityCheckStack(op, stack);
diff --git a/functorch/functorch/csrc/VmapInterpreter.h b/aten/src/ATen/functorch/VmapInterpreter.h
similarity index 76%
rename from functorch/functorch/csrc/VmapInterpreter.h
rename to aten/src/ATen/functorch/VmapInterpreter.h
index 084cea956b28..2e4e6fff212f 100644
--- a/functorch/functorch/csrc/VmapInterpreter.h
+++ b/aten/src/ATen/functorch/VmapInterpreter.h
@@ -1,14 +1,17 @@
 #pragma once
-#include <functorch/csrc/Interpreter.h>
+#include <ATen/functorch/Interpreter.h>
 
 namespace at { namespace functorch {
 
+// This is the interpreter that handles the functionalize() transform.
+// See NOTE: [functorch interpreter stack] for more details.
+
 struct VmapInterpreterPtr {
   explicit VmapInterpreterPtr(const Interpreter* base): base_(base) { TORCH_INTERNAL_ASSERT(base->key() == TransformType::Vmap); }
   TransformType key() const { return base_->key(); }
   int64_t level() const { return base_->level(); }
   void processImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack);
-  void sendToNextInterpreterImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack);
+  void sendToNextInterpreterImpl(const c10::OperatorHandle& op, torch::jit::Stack* stack, bool grad_special_case);
   int64_t batchSize() const {
     return c10::get<VmapInterpreterMeta>(base_->meta()).batchSize_;
   }
diff --git a/functorch/functorch/csrc/VmapModeRegistrations.cpp b/aten/src/ATen/functorch/VmapModeRegistrations.cpp
similarity index 83%
rename from functorch/functorch/csrc/VmapModeRegistrations.cpp
rename to aten/src/ATen/functorch/VmapModeRegistrations.cpp
index 922b06e93db4..53c8c01ee7c7 100644
--- a/functorch/functorch/csrc/VmapModeRegistrations.cpp
+++ b/aten/src/ATen/functorch/VmapModeRegistrations.cpp
@@ -6,13 +6,17 @@
 
 #include <torch/library.h>
 #include <ATen/ATen.h>
-#include <functorch/csrc/LegacyVmapTransforms.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
-#include <functorch/csrc/PlumbingHelper.h>
-#include <functorch/csrc/Constants.h>
-#include <functorch/csrc/DynamicLayer.h>
+#include <ATen/functorch/LegacyVmapTransforms.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
+#include <ATen/functorch/PlumbingHelper.h>
+#include <ATen/functorch/DynamicLayer.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 
+// functorch's vmap has two Dispatch Keys that implement it:
+// FuncTorchBatched and FuncTorchVmapMode. This file contains registrations for
+// FuncTorchVmapMode -- these registrations are to error out on operations
+// that we don't support on regular Tensors.
+
 namespace at {
 namespace functorch {
 
diff --git a/aten/src/ATen/jit_macros.h b/aten/src/ATen/jit_macros.h
index ca765f03afbf..9af826549021 100644
--- a/aten/src/ATen/jit_macros.h
+++ b/aten/src/ATen/jit_macros.h
@@ -3,12 +3,5 @@
 #include <string>
 
 // AT_USE_JITERATOR(), controls whether we jit some elementwise kernels
-// Currently unsupported on ROCm GPUs
-#if !AT_ROCM_ENABLED()
 #define AT_USE_JITERATOR() true
 #define jiterator_stringify(...) std::string(#__VA_ARGS__);
-#else
-#define AT_USE_JITERATOR() false
-#define jiterator_stringify(...) \
-  static_assert(false, "Jiterator is not supported on ROCm");
-#endif // USE_ROCM
diff --git a/aten/src/ATen/jiterator_macros.h b/aten/src/ATen/jiterator_macros.h
index 63a7dfa2eb96..3aa4c7ebb0af 100644
--- a/aten/src/ATen/jiterator_macros.h
+++ b/aten/src/ATen/jiterator_macros.h
@@ -25,8 +25,8 @@
 // These `,`s confuse the preprocessor into thinking we are passing
 // multiple arguments to the macro.
 #define jiterator_code(...) __VA_ARGS__
-#if defined(__CUDACC__)
-// CPU and CUDA case
+#if defined(__CUDACC__) || defined(__HIPCC__)
+// CPU and CUDA and ROCm case
 #define stringify_code(...) #__VA_ARGS__
 #define jiterator_also_stringify_as(code, str_name) \
   code /* define the function */                    \
diff --git a/aten/src/ATen/miopen/Descriptors.h b/aten/src/ATen/miopen/Descriptors.h
index a376b30315f7..ba7f232c8fd7 100644
--- a/aten/src/ATen/miopen/Descriptors.h
+++ b/aten/src/ATen/miopen/Descriptors.h
@@ -3,7 +3,7 @@
 #include <ATen/miopen/Exceptions.h>
 
 #include <ATen/miopen/miopen-wrapper.h>
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/TensorUtils.h>
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/miopen/Utils.h b/aten/src/ATen/miopen/Utils.h
index 68ec5bafeebb..a0ec83d976bc 100644
--- a/aten/src/ATen/miopen/Utils.h
+++ b/aten/src/ATen/miopen/Utils.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/miopen/miopen-wrapper.h>
 #include <ATen/miopen/Handle.h>
 
diff --git a/aten/src/ATen/mkl/SparseBlas.cpp b/aten/src/ATen/mkl/SparseBlas.cpp
index 1ad464b8d3a3..ac4bbf65f311 100644
--- a/aten/src/ATen/mkl/SparseBlas.cpp
+++ b/aten/src/ATen/mkl/SparseBlas.cpp
@@ -1,7 +1,7 @@
 /*
   Provides the implementations of MKL Sparse BLAS function templates.
 */
-
+#define TORCH_ASSERT_NO_OPERATORS
 #include <ATen/mkl/Exceptions.h>
 #include <ATen/mkl/SparseBlas.h>
 
diff --git a/aten/src/ATen/mps/EmptyTensor.cpp b/aten/src/ATen/mps/EmptyTensor.cpp
index 759aef741ade..1642f4aeddd1 100644
--- a/aten/src/ATen/mps/EmptyTensor.cpp
+++ b/aten/src/ATen/mps/EmptyTensor.cpp
@@ -5,6 +5,7 @@
 #include <ATen/Tensor.h>
 #include <ATen/Utils.h>
 #include <torch/library.h>
+#include <ATen/mps/MPSDevice.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/mps/Copy.h>
 
diff --git a/aten/src/ATen/mps/IndexKernels.h b/aten/src/ATen/mps/IndexKernels.h
new file mode 100644
index 000000000000..df22c616baac
--- /dev/null
+++ b/aten/src/ATen/mps/IndexKernels.h
@@ -0,0 +1,181 @@
+#pragma once
+
+namespace at {
+namespace mps {
+
+static const char * indexing_metal_shaders = R"INDEX_METAL(
+#include <metal_stdlib>
+#include <metal_atomic>
+
+using namespace metal;
+
+constant uint32_t num_indices            [[function_constant(0)]];
+
+struct IndexAB {
+    // Allow up to 16 indices
+    metal::array<constant void *, 16>  indexArray [[ id(0) ]];
+};
+
+template<typename T>
+kernel void index_select(
+    constant IndexAB  & indexAB           [[buffer(0)]],
+    constant void     * indexSizes        [[buffer(1)]],
+    constant void     * indexStrides      [[buffer(2)]],
+    constant uint3    * offsets           [[buffer(3)]],
+    constant void     * inputData         [[buffer(4)]],
+    device   void     * outputData        [[buffer(5)]],
+    uint thread_index [[thread_position_in_grid]]) {
+    constant int64_t * index_sizes   = (constant int64_t *)indexSizes;
+    constant int64_t * index_strides = (constant int64_t *)indexStrides;
+    int64_t offset = 0;
+    for (uint32_t i = 0; i < num_indices; i++) {
+        int64_t index = ((constant int64_t*)(indexAB.indexArray[i]))[offsets[thread_index].z / sizeof(int64_t)];
+        if (index < 0) {
+            index += index_sizes[i];
+        }
+        offset += index * index_strides[i];
+     }
+    device T * out = (device T*)((device char*)outputData + offsets[thread_index].x);
+    constant T * in  = (constant T*)((constant char*)inputData  + offsets[thread_index].y + offset);
+    *out = *in;
+}
+
+template<typename T>
+kernel void index_put(
+    constant IndexAB  & indexAB           [[buffer(0)]],
+    constant void     * indexSizes        [[buffer(1)]],
+    constant void     * indexStrides      [[buffer(2)]],
+    constant uint3    * offsets           [[buffer(3)]],
+    constant void     * inputData         [[buffer(4)]],
+    device   void     * outputData        [[buffer(5)]],
+    uint thread_index [[thread_position_in_grid]]) {
+
+    constant int64_t * index_sizes   = (constant int64_t *)indexSizes;
+    constant int64_t * index_strides = (constant int64_t *)indexStrides;
+    int64_t offset = 0;
+    for (uint32_t i = 0; i < num_indices; i++) {
+        int64_t index = ((constant int64_t*)(indexAB.indexArray[i]))[offsets[thread_index].z / sizeof(int64_t)];
+        if (index < 0) {
+            index += index_sizes[i];
+        }
+        offset += index * index_strides[i];
+     }
+    device T * out = (device T*)((device char*)outputData + offsets[thread_index].x + offset);
+    constant T * in  = (constant T*)((constant char*)inputData  + offsets[thread_index].y);
+    *out = *in;
+}
+
+#define REGISTER_INDEX_OP(DTYPE_SIZE, DTYPE, INDEX_OP_TYPE)     \
+template                                                        \
+[[host_name("index_" #INDEX_OP_TYPE "_" #DTYPE_SIZE)]]          \
+kernel void index_ ## INDEX_OP_TYPE<DTYPE>(                     \
+    constant IndexAB & indexAB           [[buffer(0)]],         \
+    constant void    * indexSizes        [[buffer(1)]],         \
+    constant void    * indexStrides      [[buffer(2)]],         \
+    constant uint3   * offsets           [[buffer(3)]],         \
+    constant void    * inputData         [[buffer(4)]],         \
+    device   void    * outputData        [[buffer(5)]],         \
+    uint thread_index [[thread_position_in_grid]]);
+
+#define REGISTER_INDEX_OP_ALL_DTYPES(INDEX_OP_TYPE)     \
+    REGISTER_INDEX_OP(8bit,  char,  INDEX_OP_TYPE);     \
+    REGISTER_INDEX_OP(16bit, short, INDEX_OP_TYPE);     \
+    REGISTER_INDEX_OP(32bit, int,   INDEX_OP_TYPE);     \
+    REGISTER_INDEX_OP(64bit, long,  INDEX_OP_TYPE);
+
+REGISTER_INDEX_OP_ALL_DTYPES(select);
+REGISTER_INDEX_OP_ALL_DTYPES(put);
+
+kernel void kernel_index_offsets(constant packed_uint3 * strides         [[buffer(0)]],
+                                 device uint3          * data_offsets    [[buffer(1)]],
+                                 constant uint         * iter_shape      [[buffer(2)]],
+                                 constant uint         & num_dimensions  [[buffer(3)]],
+                                 constant uint         & num_offsets     [[buffer(4)]],
+                                 uint thread_index [[thread_position_in_grid]]) {
+    uint32_t idx = thread_index;
+    for (uint32_t dim = 0; dim < num_dimensions; dim++) {
+        uint32_t remainder = idx % iter_shape[dim];
+        idx /= iter_shape[dim];
+
+        for (uint32_t offset = 0; offset < num_offsets; offset++)
+            data_offsets[thread_index][offset] += remainder * strides[dim][offset];
+    }
+}
+
+template<typename T, typename E>
+kernel void index_put_accumulate_native_dtypes(constant IndexAB & indexAB      [[buffer(0)]],
+                                               constant void    * indexSizes   [[buffer(1)]],
+                                               constant void    * indexStrides [[buffer(2)]],
+                                               constant uint3   * offsets      [[buffer(3)]],
+                                               constant void    * inputData    [[buffer(4)]],
+                                               device       void    * outputData   [[buffer(5)]],
+                                               uint thread_index [[thread_position_in_grid]]) {
+    constant int64_t * index_sizes   = (constant int64_t *)indexSizes;
+    constant int64_t * index_strides = (constant int64_t *)indexStrides;
+    int64_t offset = 0;
+    for (uint32_t i = 0; i < num_indices; i++) {
+        int64_t index = ((constant int64_t*)(indexAB.indexArray[i]))[offsets[thread_index].z / sizeof(int64_t)];
+        if (index < 0) {
+            index += index_sizes[i];
+        }
+        offset += index * index_strides[i];
+    }
+    device T * out = (device T*)((device char*)outputData + offsets[thread_index].x + offset);
+    constant E * in  = (constant E*)((constant char*)inputData  + offsets[thread_index].y);
+    atomic_fetch_add_explicit(out, *in, memory_order_relaxed);
+}
+
+template<typename T>
+__attribute__((__always_inline__)) void atomic_fetch_add_relaxed(device void * addr, T value) {
+    device atomic_uint* uintAddr = (device atomic_uint*)addr;
+    uint expected = atomic_load_explicit(uintAddr, memory_order_relaxed);
+    T updated = as_type<T>(expected) + value;
+    while (!atomic_compare_exchange_weak_explicit(uintAddr, &expected, as_type<uint>(updated), memory_order_relaxed, memory_order_relaxed)) {
+        updated = as_type<T>(expected) + value;
+    }
+}
+
+template<typename T>
+kernel void atomic_index_put_accumulate(constant IndexAB & indexAB           [[buffer(0)]],
+                                        constant void    * indexSizes        [[buffer(1)]],
+                                        constant void    * indexStrides      [[buffer(2)]],
+                                        constant uint3   * offsets           [[buffer(3)]],
+                                        constant void    * inputData         [[buffer(4)]],
+                                        device   void    * outputData        [[buffer(5)]],
+                                        uint thread_index [[thread_position_in_grid]]) {
+    constant int64_t * index_sizes   = (constant int64_t *)indexSizes;
+    constant int64_t * index_strides = (constant int64_t *)indexStrides;
+    int64_t offset = 0;
+    for (uint32_t i = 0; i < num_indices; i++) {
+        int64_t index = ((constant int64_t*)(indexAB.indexArray[i]))[offsets[thread_index].z / sizeof(int64_t)];
+        if (index < 0) {
+            index += index_sizes[i];
+        }
+        offset += index * index_strides[i];
+    }
+    device void * out = (device void*)((device char*)outputData + offsets[thread_index].x + offset);
+    constant T  * in  = (constant T*)((constant char*)inputData + offsets[thread_index].y);
+    atomic_fetch_add_relaxed<T>(out, *in);
+}
+
+template
+[[host_name("index_put_accumulate_32bit_float")]]
+kernel void atomic_index_put_accumulate<float>(constant IndexAB & indexAB      [[buffer(0)]],
+                                               constant void    * indexSizes   [[buffer(1)]],
+                                               constant void    * indexStrides [[buffer(2)]],
+                                               constant uint3   * offsets      [[buffer(3)]],
+                                               constant void    * inputData    [[buffer(4)]],
+                                               device   void    * outputData   [[buffer(5)]],
+                                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_put_accumulate_32bit_int")]]
+kernel void index_put_accumulate_native_dtypes<atomic_int, int>(constant IndexAB & indexAB      [[buffer(0)]],
+                                                                constant void    * indexSizes   [[buffer(1)]],
+                                                                constant void    * indexStrides [[buffer(2)]],
+                                                                constant uint3   * offsets      [[buffer(3)]],
+                                                                constant void    * inputData    [[buffer(4)]],
+                                                                device   void    * outputData   [[buffer(5)]],
+                                                                uint thread_index [[thread_position_in_grid]]);
+)INDEX_METAL";
+}
+}
diff --git a/aten/src/ATen/mps/MPSAllocator.h b/aten/src/ATen/mps/MPSAllocator.h
index ee8712d227ce..d739e8956d81 100644
--- a/aten/src/ATen/mps/MPSAllocator.h
+++ b/aten/src/ATen/mps/MPSAllocator.h
@@ -1,24 +1,11 @@
 //  Copyright © 2022 Apple Inc.
 
-#include <ATen/Tensor.h>
-#include <ATen/ATen.h>
-#include <ATen/Utils.h>
-#include <torch/library.h>
-#include <c10/util/flat_hash_map.h>
-
-#include <ATen/mps/MPSDevice.h>
+#include <ATen/mps/MPSStream.h>
 #include <cstdio>
 #include <mutex>
 #include <set>
-#include <utility>
 #include <mach/vm_page_size.h>
-
-#ifdef __OBJC__
-#include <Foundation/Foundation.h>
-#include <Metal/Metal.h>
-#include <Metal/MTLHeap.h>
-#include <MetalPerformanceShaders/MetalPerformanceShaders.h>
-#endif
+#include <c10/util/flat_hash_map.h>
 
 // this implementation is based on CUDACachingAllocator.
 // It utilizes Metal Heaps to improve the performance with buffer allocation.
@@ -47,16 +34,32 @@ namespace HeapAllocator {
 
 #define MB(x) round_page(x * 1048576UL)
 
-static const size_t kMaxSmallAlloc = MB(1);  // largest "small" allocation is 1 MiB
-static const size_t kMinLargeAlloc = MB(10); // allocations between 1 and 10 MiB may use kLargeHeap
-static const size_t kSmallHeap     = MB(8);  // "small" allocations are packed in 8 MiB heaps
-static const size_t kLargeHeap     = MB(32); // "large" allocations may be packed in 32 MiB heaps
-static const size_t kRoundLarge    = MB(2);  // round up large allocations to 2 MiB
-
-// TODO: check the caching performance of write-combined mode
-constexpr MTLResourceOptions kCPUCacheMode = MTLResourceOptionCPUCacheModeDefault;
-constexpr MTLResourceOptions kPrivateResourceOptions = kCPUCacheMode | MTLResourceStorageModePrivate;
-constexpr MTLResourceOptions kSharedResourceOptions  = kCPUCacheMode | MTLResourceStorageModeShared;
+static const size_t kMaxSmallAlloc = MB(1);    // largest "small" allocation is 1 MiB
+static const size_t kMinLargeAlloc = MB(10);   // allocations between 1 and 10 MiB may use kLargeHeap
+static const size_t kRoundLarge    = MB(2);    // round up large allocations to 2 MiB
+static const size_t kSmallHeap     = MB(8);    // "small" allocations are packed in 8 MiB heaps
+static const size_t kLargeHeap     = MB(32);   // "large" allocations may be packed in 32 MiB heaps
+static const size_t kXLargeHeapD   = MB(128);  // "extra large" allocations on Discrete devices may be packed in 128 MiB heaps
+static const size_t kXLargeHeapU   = MB(1024); // "extra large" allocations on Unified devices may be packed in 1 GiB heaps
+
+// buffer pools could be customized with a combination of usage flags
+enum UsageFlags : uint32_t {
+  PRIVATE = 0,
+  SMALL   = (1 << 0), // small heaps have sizes of kSmallHeap, and large ones kLargeHeap
+  SHARED  = (1 << 1), // shared pools allocated on devices with unified memory; otherwise, private between host/device
+  MANAGED = (1 << 2), // managed storage mode
+  HAZARD  = (1 << 3), // enables Automatic Hazard Tracking for the resources allocated on the pool
+  SCALAR  = (1 << 4), // used to import CPU scalar values to GPU and use them in MPS Stream
+};
+// debug verbosity flags
+enum DebugVerbosity : uint32_t {
+  SILENT      = 0,
+  PROFILING   = (1 << 0), // print generic profiling data for total system memory usage
+  ALLOCATIONS = (1 << 1), // print buffer allocations
+  RECYCLES    = (1 << 2), // print buffer recycling
+  RELEASES    = (1 << 3), // print buffer releases
+  LARGE_ONLY  = (1 << 4), // only log large buffer pool transactions
+};
 
 struct HeapBlock;
 
@@ -70,11 +73,16 @@ struct BufferBlock
   bool in_use;
   HeapBlock* heap;
   id_t buf_id;
+  // counter to candidate least recently used buffers for garbage collection
+  uint32_t gc_count;
+  uint32_t use_count;
+  // counter to assign unique ids to buffer blocks
+  static uint64_t buffer_counter;
 
   BufferBlock(size_t Size, size_t RequestedSize = 0, const id<MTLBuffer> Buffer = nullptr,
-              HeapBlock* Heap = nullptr, id_t BufID = 0) :
+              HeapBlock* Heap = nullptr) :
               buffer(Buffer), size(Size), requested_size(RequestedSize),
-              in_use(false), heap(Heap), buf_id(BufID) { }
+              in_use(false), heap(Heap), buf_id(++buffer_counter), gc_count(0), use_count(0) { }
 
   static bool Comparator(const BufferBlock* a, const BufferBlock* b) {
     return (a->size != b->size) ? a->size < b->size : (uintptr_t)a->buffer < (uintptr_t)b->buffer;
@@ -83,10 +91,28 @@ struct BufferBlock
     assert(((Alignment - 1) & Alignment) == 0);
     return ((Size + Alignment - 1) & ~(Alignment - 1));
   }
+  uint32_t retainCount() const { return [buffer retainCount]; }
 };
 typedef bool (*BufferComparison)(const BufferBlock*, const BufferBlock*);
 
 struct BufferPool;
+struct AllocParams
+{
+  AllocParams(size_t Alloc_Size, size_t Requested_Size, BufferPool* Pool) :
+              search_key(Alloc_Size), pool(Pool), buffer_block(nullptr),
+              requested_size(Requested_Size), has_memory_pressure(false) { }
+  size_t size() const { return search_key.size; }
+
+  BufferBlock search_key;
+  BufferPool* pool;
+  BufferBlock* buffer_block;
+  size_t requested_size;
+  // true if we exceed the low watermark limit. In this case
+  // we apply strategies to relieve the pressure before allocation.
+  bool has_memory_pressure;
+  // true if we're allocating on a unified memory device
+  bool has_unified_memory;
+};
 
 struct HeapBlock
 {
@@ -94,37 +120,68 @@ struct HeapBlock
   struct { size_t total, available; } size;
   BufferPool* pool;
   unsigned int n_buffers;
+  id_t heap_id;
+  // indicates if we split this heap to sub-allocate 'several' buffers (otherwise single buffer)
+  bool is_split;
+  // counter to assign unique ids to heap blocks
+  static uint64_t heap_counter;
 
   HeapBlock(size_t Size, const id<MTLHeap> Heap = nullptr, BufferPool *Pool = nullptr) :
-            heap(Heap), size({.total = Size, .available = Size}), pool(Pool), n_buffers(0) { }
+            heap(Heap), size({.total = Size, .available = Size}), pool(Pool),
+            n_buffers(0), heap_id(++heap_counter), is_split(true) { }
+
+  static MTLResourceOptions getOptions(uint32_t usage) {
+    // TODO: check the caching performance of write-combined mode
+    MTLResourceOptions options = MTLResourceCPUCacheModeDefaultCache;
+
+    if (usage & UsageFlags::MANAGED)
+      options |= MTLResourceStorageModeManaged;
+    else if (usage & UsageFlags::SHARED)
+      options |= MTLResourceStorageModeShared;
+    else
+      options |= MTLResourceStorageModePrivate;
 
-  static MTLResourceOptions getOptions(bool SharedStorage = false) { return SharedStorage ? kSharedResourceOptions : kPrivateResourceOptions; }
+    options |= (usage & UsageFlags::HAZARD) ? MTLResourceHazardTrackingModeTracked : MTLResourceHazardTrackingModeUntracked;
+
+    return options;
+  }
 
-  static id<MTLHeap> createMTLHeap(id<MTLDevice> device, size_t size, bool is_shared) {
-    id<MTLHeap> heap = nil;
+  static HeapBlock* createHeapBlock(AllocParams& params, id<MTLDevice> device, uint32_t usage) {
+    HeapBlock *heapBlock = nullptr;
+    bool is_split = true;
+    const size_t size = params.size();
     MTLHeapDescriptor *d = [MTLHeapDescriptor new];
     if (d) {
+      const size_t kXLargeHeap = params.has_unified_memory ? kXLargeHeapU : kXLargeHeapD;
       if (size <= kMaxSmallAlloc) {
         d.size = kSmallHeap;
       } else if (size < kMinLargeAlloc) {
         d.size = kLargeHeap;
+      } else if (size < kXLargeHeap / 2 && !params.has_memory_pressure) {
+        d.size = kXLargeHeap;
       } else {
         d.size = kRoundLarge * ((size + kRoundLarge - 1) / kRoundLarge);
+        is_split = false;
       }
-      d.storageMode = is_shared ? MTLStorageModeShared : MTLStorageModePrivate;
+      d.storageMode = (usage & UsageFlags::SHARED) ? MTLStorageModeShared : MTLStorageModePrivate;
       d.cpuCacheMode = MTLCPUCacheModeDefaultCache;
       // this automatically handles Metal buffer access synchronizations at the
       // cost of slightly lower performance.
-      d.hazardTrackingMode = MTLHazardTrackingModeTracked;
-      d.resourceOptions = getOptions(is_shared) | (MTLHazardTrackingModeTracked << MTLResourceHazardTrackingModeShift);
+      d.hazardTrackingMode = (usage & UsageFlags::HAZARD) ? MTLHazardTrackingModeTracked : MTLHazardTrackingModeUntracked;
+      d.resourceOptions = getOptions(usage);
       d.type = MTLHeapTypeAutomatic;
-      heap = [device newHeapWithDescriptor: d];
+      id<MTLHeap> heap = [device newHeapWithDescriptor: d];
       if (heap) {
         [heap setPurgeableState:MTLPurgeableStateNonVolatile];
+        const size_t heap_size = heapAvailableSize(heap);
+        heapBlock = new HeapBlock(heap_size, heap, params.pool);
+        if (heapBlock) {
+          heapBlock->is_split = is_split;
+        }
       }
       [d release];
     }
-    return heap;
+    return heapBlock;
   }
   static bool Comparator(const HeapBlock* a, const HeapBlock* b) {
     return a->size.available < b->size.available;
@@ -132,82 +189,106 @@ struct HeapBlock
   static NSUInteger heapAvailableSize(id<MTLHeap> heap, size_t Alignment = vm_page_size) {
       return [heap maxAvailableSizeWithAlignment:Alignment];
   }
-  id<MTLBuffer> newMTLBuffer(size_t length, bool is_shared) {
-    id<MTLBuffer> buf = [heap newBufferWithLength:length options:getOptions(is_shared)];
+  id<MTLBuffer> newMTLBuffer(size_t length, uint32_t usage) {
+    id<MTLBuffer> buf = [heap newBufferWithLength:length options:getOptions(usage)];
     if (buf) {
-      size.available = heapAvailableSize(heap);
+      updateAvailableSize();
       n_buffers++;
     }
     return buf;
   }
-  void releaseMTLBuffer(id<MTLBuffer> buffer) {
+  // returns the retainCount before releasing the buffer
+  uint32_t releaseMTLBuffer(id<MTLBuffer>& buffer) {
+    const uint32_t retainCount = [buffer retainCount];
     [buffer release];
-    size.available = heapAvailableSize(heap);
+    buffer = nil;
+    updateAvailableSize();
     n_buffers--;
+    return retainCount;
   }
-  void releaseMTLHeap() {
+  // returns the retainCount before releasing the heap
+  uint32_t releaseMTLHeap() {
+    const uint32_t retainCount = [heap retainCount];
     TORCH_INTERNAL_ASSERT(!n_buffers); // assert if heap isn't empty
+    [heap setPurgeableState:MTLPurgeableStateEmpty];
     [heap release];
+    heap = nil;
     size.available = 0;
+    return retainCount;
   }
+  uint32_t retainCount() const { return [heap retainCount]; }
+  void updateAvailableSize() { size.available = heapAvailableSize(heap); }
 };
 typedef bool (*HeapComparison)(const HeapBlock*, const HeapBlock*);
 
 struct BufferPool
 {
-  BufferPool(const id<MTLDevice> Device, bool Small, bool Shared) :
-           device(Device), is_small(Small), is_shared(Shared),
-           heaps(HeapBlock::Comparator), buffers(BufferBlock::Comparator) { }
+  BufferPool(const id<MTLDevice> Device, uint32_t Usage) :
+             device(Device), usage(Usage), n_buffers(0), allocated_size(0), available_size(0),
+             heaps(HeapBlock::Comparator), buffers(BufferBlock::Comparator) { }
 
   const id<MTLDevice> device;
-  // small heaps have sizes of kSmallHeap, and large ones kLargeHeap
-  const bool is_small;
-  // private pools allocated on device memory; otherwise, shared between host/device
-  const bool is_shared;
+  // usage flags to customize the pool for various purposes (see UsageFlags enum)
+  const uint32_t usage;
+  // total number of buffers in the pool
+  uint32_t n_buffers;
+  // total allocations size on this pool
+  size_t allocated_size;
+  // total memory available in the pool
+  size_t available_size;
   // list of heaps ordered by their "available" (not total) memory size
   std::set<HeapBlock*, HeapComparison> heaps;
   // list of only "available" buffers in the pool (i.e., buffers not in-use)
   std::set<BufferBlock*, BufferComparison> buffers;
-};
-
-struct AllocParams
-{
-  AllocParams(size_t Alloc_Size, size_t Requested_Size, BufferPool* Pool) :
-            search_key(Alloc_Size), pool(Pool),
-            buffer_block(nullptr), requested_size(Requested_Size) {}
-  size_t size() const { return search_key.size; }
-
-  BufferBlock search_key;
-  BufferPool* pool;
-  BufferBlock* buffer_block;
-  size_t requested_size;
+  // list of heaps pending size update
+  std::unordered_set<HeapBlock*> heaps_pending_update;
 };
 
 class MPSHeapAllocatorImpl
 {
 public:
   explicit MPSHeapAllocatorImpl() :
-                      m_device(at::mps::MPSDevice::getInstance()->device()),
-                      m_large_pool_shared(m_device, false, true), m_large_pool_private(m_device, false, false),
-                      m_small_pool_shared(m_device, true , true), m_small_pool_private(m_device, true , false),
-                      m_total_allocated_memory(0), m_max_buffer_size([m_device maxBufferLength]),
-                      m_set_fraction(false), m_enable_debug_info(false) { }
+    m_device(at::mps::MPSDevice::getInstance()->device()),
+    m_large_pool_shared (m_device, UsageFlags::SHARED  | UsageFlags::HAZARD),
+    m_large_pool_private(m_device, UsageFlags::PRIVATE | UsageFlags::HAZARD),
+    m_small_pool_shared (m_device, UsageFlags::SMALL   | UsageFlags::SHARED  | UsageFlags::HAZARD),
+    m_small_pool_private(m_device, UsageFlags::SMALL   | UsageFlags::PRIVATE | UsageFlags::HAZARD),
+    // no Hazard Tracking required for the Scalar pool (synchronized manually)
+    m_scalar_pool(m_device, UsageFlags::SMALL | UsageFlags::SHARED | UsageFlags::SCALAR),
+    m_total_allocated_memory(0), m_max_buffer_size([m_device maxBufferLength]),
+    m_stream(getDefaultMPSStream())
+  {
+    init_allocator();
+  }
 
   // interface exposed to at::Allocator
-  id<MTLBuffer> Malloc(size_t size, bool sharedStorage);
-  void Free(void* ptr);
-  void EmptyCache();
+  id<MTLBuffer> malloc(size_t size, uint32_t usage);
+  void free(void* ptr);
+  void emptyCache();
+  // interface exposed to internal MPS operations
   bool isSharedBuffer(void* ptr);
   ssize_t getRequestedBufferSize(void* ptr);
   void setBufferShape(void* ptr, const IntArrayRef& shape);
   IntArrayRef getBufferShape(void* ptr);
-
+  id<MTLBuffer> allocScalarBufferWithValue(void* value, size_t size);
+  // this indicates how far (in Megabytes) the current total allocations are from the
+  // low watermark limit which is used to detect if we're under memory pressure
+  // This returns zero if we've reached the low watermark limit
+  ssize_t getLowWatermarkValue();
+
+  bool getDebugVerbosity() const { return m_debug_verbosity; }
+  size_t getMaxTotalAllowedSize() const { return m_max_total_allowed_size; }
+  size_t getLowWatermarkLimit() const { return m_low_watermark_limit; }
   inline id<MTLDevice> Device() const { return m_device; }
-  void enable_debug_info() { m_enable_debug_info = true; }
-  bool debug_info_enabled() const { return m_enable_debug_info; }
-  void set_shared_storage_mode(bool useSharedStorage);
 
 private:
+  // (see m_high_watermark_ratio for description)
+  constexpr static double default_high_watermark_ratio = 0.0;
+  // (see m_low_watermark_ratio for description)
+  // on unified memory, we could allocate beyond the recommendedMaxWorkingSetSize
+  constexpr static double default_low_watermark_ratio_unified  = 1.5;
+  constexpr static double default_low_watermark_ratio_discrete = 1.0;
+
   const id<MTLDevice> m_device;
   std::mutex m_mutex;
   // allocated buffers by device pointer
@@ -216,40 +297,69 @@ class MPSHeapAllocatorImpl
   BufferPool m_large_pool_shared, m_large_pool_private;
   // unallocated cached buffers 1 MB or smaller
   BufferPool m_small_pool_shared, m_small_pool_private;
+  // small cached buffers to import scalar values into MPS stream
+  BufferPool m_scalar_pool;
   // total memory allocated by HeapAllocator
   size_t m_total_allocated_memory;
   // max buffer size allowed by Metal
   size_t m_max_buffer_size;
-  // sets a soft upper bound to limit the total allocations
-  bool m_set_fraction;
-  // use "PYTORCH_DEBUG_MPS_ALLOCATOR" env-var to enable debug info
-  bool m_enable_debug_info;
-
-  HeapBlock* get_free_heap(AllocParams& p);
-  bool get_free_buffer(AllocParams& p);
+  // maximum total size allowed to be allocated
+  size_t m_max_total_allowed_size;
+  // high watermark ratio is a hard limit for the total allowed allocations (between 0 and 1)
+  // 0 means unlimited (would spill to disk or system failure if OOM)
+  // 1 is maximum allowed by device.recommendedMaxWorkingSetSize
+  // (e.g., value 0.95 means we allocate up to 95% of total memory; beyond that allocations fail)
+  double m_high_watermark_ratio;
+  // low watermark ratio is a soft limit to attempt limiting memory allocations up to the lower watermark
+  // level by garbage collection or committing command buffers more frequently (a.k.a, adaptive commit).
+  // Value between 0 to m_high_watermark_ratio (setting 0.0 disables adaptive commit and garbage collection)
+  // (e.g., value 0.9 means we 'attempt' to limit allocations up to 90% of total memory)
+  double m_low_watermark_ratio;
+  // low watermark size limit (in Bytes) at the time we initialize the allocator
+  size_t m_low_watermark_limit;
+  // use "PYTORCH_DEBUG_MPS_ALLOCATOR" env-var to set debug verbosity
+  uint32_t m_debug_verbosity;
+  // default MPS stream
+  MPSStream* m_stream;
+
+  void init_allocator();
+  HeapBlock* get_free_heap(AllocParams& params);
+  bool get_free_buffer(AllocParams& params);
   BufferBlock* get_allocated_buffer_block(void* ptr);
-  bool alloc_buffer(AllocParams& p);
+  BufferBlock* alloc_buffer_block(size_t size, uint32_t usage);
+  bool alloc_buffer(AllocParams& params);
   void free_buffer(BufferBlock* buffer_block);
-  void release_buffer(BufferBlock* buffer_block, bool remove_empty_heap = true);
+  // returns true if the container heap is also released
+  bool release_buffer(BufferBlock* buffer_block, bool remove_empty_heap = true);
   void release_buffers(BufferPool& pool);
-  bool release_available_cached_buffers(const AllocParams& p);
+  bool release_available_cached_buffers(AllocParams& params);
   bool release_cached_buffers();
-  void trigger_memory_callbacks(BufferBlock* buffer_block, IMpsAllocatorCallback::EventType event);
-
-  BufferPool& get_pool(size_t Size, bool useShared) {
-      return Size <= kMaxSmallAlloc ? (useShared ? m_small_pool_shared : m_small_pool_private) :
-                                      (useShared ? m_large_pool_shared : m_large_pool_private);
+  // free unused cached blocks to reclaim GPU memory if memory pressure is high
+  void garbage_collect_cached_buffers(AllocParams& params);
+
+  BufferPool& get_pool(size_t Size, uint32_t usage) {
+    if (usage & UsageFlags::SCALAR)
+      return m_scalar_pool;
+    return Size <= kMaxSmallAlloc ? ((usage & UsageFlags::SHARED) ? m_small_pool_shared : m_small_pool_private) :
+                                    ((usage & UsageFlags::SHARED) ? m_large_pool_shared : m_large_pool_private);
   }
 
-  size_t get_allocation_size(size_t Length, bool useShared) {
+  size_t get_allocation_size(size_t Length, uint32_t usage) const  {
     MTLSizeAndAlign sizeAlign = [m_device heapBufferSizeAndAlignWithLength:Length
-                                                                   options:HeapBlock::getOptions(useShared)];
+                                                                   options:HeapBlock::getOptions(usage)];
     return BufferBlock::alignUp(sizeAlign.size, sizeAlign.align);
   }
-  // TODO: make this configurable
-  static size_t max_split_size() { return std::numeric_limits<size_t>::max(); }
   // maximum size of device memory available for allocation in current process
-  size_t max_available_size() const { return [m_device recommendedMaxWorkingSetSize] - [m_device currentAllocatedSize]; }
+  size_t max_device_size() const { return [m_device recommendedMaxWorkingSetSize]; }
+  // there are implicit allocations from MPS backend, so we need to query the 'device' for
+  // total allocated size instead of manually tracking in MPSAllocator
+  size_t current_allocated_size() const { return [m_device currentAllocatedSize]; }
+
+  void trigger_memory_callbacks(BufferBlock* buffer_block, IMpsAllocatorCallback::EventType event) const {
+    for (const auto& name : MPSAllocatorCallbacksRegistry()->Keys()) {
+      MPSAllocatorCallbacksRegistry()->Create(name)->executeMPSAllocatorCallback(buffer_block->buffer, event);
+    }
+  }
 
   // TODO: make a common function to do size unit conversions in PyTorch.
   static std::string format_size(uint64_t size) {
@@ -266,5 +376,16 @@ class MPSHeapAllocatorImpl
 
 } // namespace HeapAllocator
 
+// interface exposed to internal MPS operations
+
+// get the requested non-aligned size of an MTL buffer
+ssize_t get_requested_buffer_size(void* ptr);
+// retrieve the shape of a base tensor from a view tensor
+IntArrayRef get_buffer_shape(void* ptr);
+// set the shape of a base tensor from a view tensor
+void set_buffer_shape(void* ptr, const IntArrayRef& shape);
+// allocate a buffer from a specialized pool to import CPU scalars into GPU
+DataPtr allocate_scalar_buffer(void* value, size_t size);
+
 } // namespace mps
 } // namespace at
diff --git a/aten/src/ATen/mps/MPSAllocator.mm b/aten/src/ATen/mps/MPSAllocator.mm
index 2433acbc050b..a40ddd7992a2 100644
--- a/aten/src/ATen/mps/MPSAllocator.mm
+++ b/aten/src/ATen/mps/MPSAllocator.mm
@@ -4,6 +4,7 @@
 #include <c10/core/Allocator.h>
 #include <c10/core/Storage.h>
 #include <ATen/CPUFunctions.h>
+#include <iostream>
 
 namespace at {
 namespace mps {
@@ -12,129 +13,237 @@
 
 namespace HeapAllocator {
 
-HeapBlock* MPSHeapAllocatorImpl::get_free_heap(AllocParams& p)
+uint64_t BufferBlock::buffer_counter = 0;
+uint64_t HeapBlock::heap_counter = 0;
+
+void MPSHeapAllocatorImpl::init_allocator()
+{
+  // debug verbosity flags (see DebugVerbosity enum)
+  static const char *verbosity_str = getenv("PYTORCH_DEBUG_MPS_ALLOCATOR");
+  m_debug_verbosity = verbosity_str ? strtol(verbosity_str, nullptr, 0) : DebugVerbosity::SILENT;
+
+  // on unified memory, we set the allowed upper bound to twice the size of recommendedMaxWorkingSetSize.
+  const double high_watermark_upper_bound =  m_device.hasUnifiedMemory ? 2.0 : 1.0;
+
+  static const char *high_watermark_ratio_str = getenv("PYTORCH_MPS_HIGH_WATERMARK_RATIO");
+  m_high_watermark_ratio = high_watermark_ratio_str ? strtod(high_watermark_ratio_str, nullptr) : default_high_watermark_ratio;
+  TORCH_CHECK(m_high_watermark_ratio >= 0.0 && m_high_watermark_ratio <= high_watermark_upper_bound,
+              "invalid high watermark ratio ", m_high_watermark_ratio);
+
+  m_max_total_allowed_size = (m_high_watermark_ratio == 0.0) ? std::numeric_limits<size_t>::max() :
+                              static_cast<size_t>(m_high_watermark_ratio * (double)max_device_size());
+  // used for comparison with lower_watermark_ratio
+  const double high_watermark_limit = m_high_watermark_ratio == 0.0 ? high_watermark_upper_bound : m_high_watermark_ratio;
+  const double default_low_watermark_ratio =  m_device.hasUnifiedMemory ? default_low_watermark_ratio_unified :
+                                                                          default_low_watermark_ratio_discrete;
+  static const char *low_watermark_ratio_str = getenv("PYTORCH_MPS_LOW_WATERMARK_RATIO");
+  m_low_watermark_ratio = low_watermark_ratio_str ? strtod(low_watermark_ratio_str, nullptr) : default_low_watermark_ratio;
+  TORCH_CHECK(m_low_watermark_ratio >= 0.0 && m_low_watermark_ratio <= high_watermark_limit,
+              "invalid low watermark ratio ", m_low_watermark_ratio);
+  // we use this to detect if there's memory pressure
+  m_low_watermark_limit = (m_low_watermark_ratio == 0.0) ? std::numeric_limits<size_t>::max() :
+                          static_cast<size_t>(m_low_watermark_ratio * (double)max_device_size());
+}
+
+HeapBlock* MPSHeapAllocatorImpl::get_free_heap(AllocParams& params)
 {
-  BufferPool *pool = p.pool;
-  HeapBlock *heapBlock = nullptr;
-  HeapBlock search_key(p.size());
-
-  auto it = pool->heaps.lower_bound(&search_key);
-  if (it == pool->heaps.end()) {
-    id<MTLHeap> heap = HeapBlock::createMTLHeap(pool->device, p.size(), pool->is_shared);
-    if (heap) {
-      size_t heap_size = HeapBlock::heapAvailableSize(heap);
-      heapBlock = new HeapBlock(heap_size, heap, pool);
-
-      if (debug_info_enabled()) {
-        static unsigned int heap_counter = 0;
+  BufferPool& pool = *params.pool;
+  HeapBlock *heap_block = nullptr;
+  HeapBlock search_key(params.size());
+
+  auto it = pool.heaps.lower_bound(&search_key);
+  if (it == pool.heaps.end()) {
+    heap_block = HeapBlock::createHeapBlock(params, pool.device, pool.usage);
+    if (heap_block) {
+      if (m_debug_verbosity & DebugVerbosity::ALLOCATIONS) {
         std::cerr << "\nAllocated "
-                  << (pool->is_small ? "small " : "large ")
-                  << (pool->is_shared ? "shared " : "private ")
-                  << "heap of size " << format_size(heap_size)
-                  << " (#heaps: " << (++heap_counter)
-                  << ", free memory: " << format_size(max_available_size()) << ")\n";
+                  << ((pool.usage & UsageFlags::SHARED) ? "shared " : "private ")
+                  << " heap #" << heap_block->heap_id
+                  << " of size " << format_size(heap_block->size.total)
+                  << " (#heaps: " << (pool.heaps.size() + 1)
+                  << ", current allocated: " << format_size(current_allocated_size()) << ")\n";
       }
     }
   } else {
-    heapBlock = *it;
+    heap_block = *it;
     // remove and re-insert heap in the set later after a buffer is created.
     // this ensures updating the order of heaps based on their new available sizes
-    pool->heaps.erase(it);
+    pool.heaps.erase(it);
   }
-  return heapBlock;
+  return heap_block;
 }
 
-bool MPSHeapAllocatorImpl::alloc_buffer(AllocParams& p)
+bool MPSHeapAllocatorImpl::alloc_buffer(AllocParams& params)
 {
-  if (m_set_fraction && m_total_allocated_memory + p.size() > max_available_size())
+  if (m_max_total_allowed_size != std::numeric_limits<size_t>::max() &&
+      current_allocated_size() + params.size() > m_max_total_allowed_size)
     return false;
 
-  HeapBlock *heap = get_free_heap(p);
+  HeapBlock *heap = get_free_heap(params);
   if (!heap)
     return false; // this will cause releasing pool buffers to free up memory
 
-  id<MTLBuffer> buffer = heap->newMTLBuffer(p.size(), p.pool->is_shared);
+  BufferPool& pool = *params.pool;
+
+  id<MTLBuffer> buffer = heap->newMTLBuffer(params.size(), pool.usage);
   // this should never happen as the backing memory (i.e., heap) was allocated successfully.
   TORCH_INTERNAL_ASSERT(buffer);
   // insert heap after a buffer was created on it to update the order of heap's set
-  p.pool->heaps.insert(heap);
-  p.buffer_block = new BufferBlock(p.size(), p.requested_size, buffer, heap, m_allocated_buffers.size() + 1);
-  m_allocated_buffers[p.buffer_block->buffer] = p.buffer_block;
-  m_total_allocated_memory += p.size();
-
-  if (debug_info_enabled()) {
+  pool.heaps.insert(heap);
+  params.buffer_block = new BufferBlock(params.size(), params.requested_size, buffer, heap);
+  m_allocated_buffers[params.buffer_block->buffer] = params.buffer_block;
+  m_total_allocated_memory += params.size();
+  pool.allocated_size += params.size();
+  pool.n_buffers++;
+
+  if ((m_debug_verbosity & DebugVerbosity::ALLOCATIONS) &&
+    (!(m_debug_verbosity & DebugVerbosity::LARGE_ONLY) || !(pool.usage & UsageFlags::SMALL))) {
     std::cerr << "Allocated "
-              << (p.pool->is_shared ? "shared" : "private")
-              << " buffer #" << p.buffer_block->buf_id
-              << " of size " << format_size(p.size())
-              << " at " << p.buffer_block->buffer
-              << " (requested size: " << format_size(p.requested_size)
-              << ", heap size: " << format_size(heap->size.available)
-              << ", total allocated: " << format_size(m_total_allocated_memory) << ")\n";
+              << ((params.pool->usage & UsageFlags::SHARED) ? "shared" : "private")
+              << ((params.pool->usage & UsageFlags::SCALAR) ? " scalar" : "")
+              << " buffer #" << params.buffer_block->buf_id
+              << " of size " << format_size(params.size())
+              << " at " << params.buffer_block->buffer
+              << " from heap #" << heap->heap_id
+              << " (requested: " << format_size(params.requested_size)
+              << ", heap: " << format_size(heap->size.available)
+              << ", total: " << format_size(m_total_allocated_memory) << ")\n";
   }
   return true;
 }
 
-bool MPSHeapAllocatorImpl::get_free_buffer(AllocParams& p)
+bool MPSHeapAllocatorImpl::get_free_buffer(AllocParams& params)
 {
-  BufferPool& pool = *p.pool;
-  auto it = pool.buffers.lower_bound(&p.search_key);
-  if (it == pool.buffers.end())
-    return false;
-  // do not return an oversized buffer for a large request
-  // allow oversized buffer size to be rounded up but within a limit
-  if ((p.size() < max_split_size() && (*it)->size >= max_split_size()) ||
-     ((p.size() >= max_split_size()) && ((*it)->size >= p.size() + kLargeHeap)))
+  // this helps to monitor "implicit" allocations from MPS backend and to prevent OOM and system failure.
+  if (m_high_watermark_ratio > 0.0 && current_allocated_size() + params.size() > m_max_total_allowed_size)
     return false;
 
-  p.buffer_block = *it;
-  pool.buffers.erase(it);
-  if (debug_info_enabled()) {
+  BufferPool& pool = *params.pool;
+  // track buffer reuse intervals only on large pool when low watermark limit is enabled.
+  if (m_low_watermark_ratio > 0.0 && !(pool.usage & UsageFlags::SMALL)) {
+    for (auto& b : pool.buffers) {
+      ++b->gc_count;
+    }
+  }
+  auto it = pool.buffers.lower_bound(&params.search_key);
+  if (it != pool.buffers.end()) {
+    BufferBlock* buffer_block = *it;
+
+    // the logic in here is simple: keep reusing existing heaps capacity as long as possible (by splitting
+    // or releasing oversize buffers, if required), and avoid 'new' heap allocations as much as possible.
+    if (buffer_block->size <= params.size() + kLargeHeap) {
+      // return the existing buffer if it already fits the requested size (i.e., not oversize)
+      params.buffer_block = buffer_block;
+    } else {
+      HeapBlock search_key(params.size());
+      // if there's an 'existing' heap with enough capacity, then don't
+      // return the oversize buffer and sub-allocate from that existing heap.
+      if (pool.heaps.lower_bound(&search_key) != pool.heaps.end()) {
+        params.buffer_block = nullptr;
+      } else if (buffer_block->retainCount() <= 1) {
+        // otherwise if buffer is releasable immediately, we make room by releasing the
+        // buffer and reuse the new space within its heap container for the new smaller buffer allocation
+        release_buffer(buffer_block, false);
+        // this will skip unnecessary garbage collection as we'll reuse the newly released space
+        params.has_memory_pressure = false;
+      } else if (params.has_memory_pressure) {
+        // the oversized buffer is busy and not reusable at the moment. So release it (and potentially its heap container)
+        // in allocator, and ARC will later free up its backing memory when the busy command buffer finishes.
+        release_buffer(buffer_block, true);
+      } else {
+        // only if there's no memory pressure, we'll reuse the oversized buffer
+        params.buffer_block = buffer_block;
+      }
+    }
+  }
+
+  if (!params.buffer_block)
+    return false; // this will make allocator to allocate a new buffer
+
+  pool.buffers.erase(params.buffer_block);
+  params.buffer_block->gc_count = 0;
+  pool.available_size -= params.buffer_block->size;
+
+  if ((m_debug_verbosity & DebugVerbosity::RECYCLES) &&
+    (!(m_debug_verbosity & DebugVerbosity::LARGE_ONLY) || !(pool.usage & UsageFlags::SMALL))) {
     std::cerr << "Reusing "
-              << (p.pool->is_shared ? "shared" : "private")
-              << " buffer #" << p.buffer_block->buf_id
-              << " of size " << format_size(p.buffer_block->size)
-              << " at " << p.buffer_block->buffer
-              << " (requested size: " << format_size(p.requested_size) << ")\n";
+              << ((params.pool->usage & UsageFlags::SHARED) ? "shared" : "private")
+              << ((params.pool->usage & UsageFlags::SCALAR) ? " scalar" : "")
+              << " buffer #" << params.buffer_block->buf_id
+              << " of size " << format_size(params.buffer_block->size)
+              << " at " << params.buffer_block->buffer
+              << " (requested: " << format_size(params.requested_size)
+              << ", use#: " << params.buffer_block->use_count + 1
+              << ", retain#: " << params.buffer_block->retainCount() << ")\n";
   }
   return true;
 }
 
-id<MTLBuffer> MPSHeapAllocatorImpl::Malloc(size_t size, bool sharedStorage)
+BufferBlock* MPSHeapAllocatorImpl::alloc_buffer_block(size_t size, uint32_t usage)
 {
   TORCH_CHECK(size < m_max_buffer_size, "Invalid buffer size: ", format_size(size));
 
-  std::lock_guard<std::mutex> lock(m_mutex);
-
-  size_t alloc_size = get_allocation_size(size, sharedStorage);
-  auto& pool = get_pool(alloc_size, sharedStorage);
+  size_t alloc_size = get_allocation_size(size, usage);
+  auto& pool = get_pool(alloc_size, usage);
   AllocParams params(alloc_size, size, &pool);
+  // we care about memory pressure if only we're allocating large buffers when the
+  // low watermark limit has been reached
+  params.has_memory_pressure = !(pool.usage & UsageFlags::SMALL) && getLowWatermarkValue() <= 0;
+  params.has_unified_memory = m_device.hasUnifiedMemory;
+
+  // first, try to get a block from the existing pool.
+  bool block_found = get_free_buffer(params);
+  if (!block_found) {
+    // do garbage collection if memory pressure is high and there's enough memory in pool
+    if (params.has_memory_pressure && alloc_size < pool.available_size) {
+      garbage_collect_cached_buffers(params);
+    }
 
-  bool block_found =
-      // Search pool
-      get_free_buffer(params) ||
-      // Attempt allocate
-      alloc_buffer(params) ||
-      // Free enough available cached blocks to satisfy alloc and retry alloc.
-      (release_available_cached_buffers(params) && alloc_buffer(params)) ||
-      // Free all non-split cached buffers and retry alloc.
-      (release_cached_buffers() && alloc_buffer(params));
+    block_found =
+        // Attempt allocate
+        alloc_buffer(params) ||
+        // Free enough available cached blocks to satisfy alloc and retry alloc.
+        (release_available_cached_buffers(params) && alloc_buffer(params)) ||
+        // Free all cached buffers and retry alloc.
+        (release_cached_buffers() && alloc_buffer(params));
+  }
 
   BufferBlock* buffer_block = params.buffer_block;
-  TORCH_INTERNAL_ASSERT(block_found && buffer_block);
+
+  // the OOM could be triggered if:
+  //   1- the High Watermark limit has been reached (if enabled)
+  //   2- ran out of device memory, or the memory fragmentation is so high that a contiguous
+  //      chunk of requested size couldn't be found.
+  if (!block_found || !buffer_block) {
+    if (m_high_watermark_ratio > 0.0) {
+      TORCH_CHECK(false, "MPS backend out of memory (MPS allocated: ", format_size(m_total_allocated_memory),
+                  ", other allocations: ", format_size(current_allocated_size() - m_total_allocated_memory),
+                  ", max allowed: ", format_size(m_max_total_allowed_size), "). Tried to allocate ", format_size(alloc_size),
+                  " on ", ((pool.usage & UsageFlags::SHARED) ? "shared" : "private"),
+                  " pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).");
+    } else {
+      TORCH_CHECK(false, "MPS backend out of memory (MPS allocated: ", format_size(m_total_allocated_memory),
+                  ", other allocations: ", format_size(current_allocated_size() - m_total_allocated_memory),
+                  "). Tried to allocate ", format_size(alloc_size),
+                  " on ", ((pool.usage & UsageFlags::SHARED) ? "shared" : "private"), " pool.");
+    }
+  }
   buffer_block->in_use = true;
-  return buffer_block->buffer;
+  buffer_block->use_count++;
+
+  return buffer_block;
 }
 
 void MPSHeapAllocatorImpl::free_buffer(BufferBlock* buffer_block)
 {
   TORCH_INTERNAL_ASSERT(buffer_block->in_use);
-  trigger_memory_callbacks(buffer_block, IMpsAllocatorCallback::EventType::FREED);
-  buffer_block->in_use = false;
-  buffer_block->shape.clear(); // reset shape
-  BufferPool *pool = buffer_block->heap->pool;
+
+  BufferPool& pool = *buffer_block->heap->pool;
   // Makes sure the BufferBlock* isn't already present in the pool we're freeing it back into.
-  TORCH_INTERNAL_ASSERT(pool->buffers.insert(buffer_block).second);
+  TORCH_INTERNAL_ASSERT(pool.buffers.insert(buffer_block).second);
+  pool.available_size += buffer_block->size;
+  buffer_block->shape.clear(); // reset shape
+  buffer_block->in_use = false;
 }
 
 BufferBlock* MPSHeapAllocatorImpl::get_allocated_buffer_block(void* ptr)
@@ -146,105 +255,77 @@
   return it->second;
 }
 
-void MPSHeapAllocatorImpl::trigger_memory_callbacks(BufferBlock* buffer_block, IMpsAllocatorCallback::EventType event) {
-  for (const auto& name : MPSAllocatorCallbacksRegistry()->Keys()) {
-    MPSAllocatorCallbacksRegistry()->Create(name)->executeMPSAllocatorCallback(buffer_block->buffer, event);
-  }
-}
-
-bool MPSHeapAllocatorImpl::isSharedBuffer(void* ptr)
-{
-  std::lock_guard<std::mutex> lock(m_mutex);
-
-  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
-  // it's OK for the buffer_block to not exist yet
-  return buffer_block && buffer_block->heap->pool->is_shared;
-}
-
-ssize_t MPSHeapAllocatorImpl::getRequestedBufferSize(void* ptr)
-{
-  std::lock_guard<std::mutex> lock(m_mutex);
-
-  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
-  if (buffer_block)
-    return (ssize_t) buffer_block->requested_size;
-  // this indicates the passed buffer pointer wasn't found
-  return -1;
-}
-
-void MPSHeapAllocatorImpl::setBufferShape(void* ptr, const IntArrayRef& shape)
-{
-  std::lock_guard<std::mutex> lock(m_mutex);
-
-  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
-  TORCH_INTERNAL_ASSERT(buffer_block, "failed to find the buffer ", ptr);
-  // note that the IntArrayRef doesn't own the underlying data, and the backing
-  // memory for shape data must persist as long as the buffer is in use.
-  // So we need to copy to vector.
-  buffer_block->shape = shape.vec();
-}
-
-IntArrayRef MPSHeapAllocatorImpl::getBufferShape(void* ptr)
-{
-  std::lock_guard<std::mutex> lock(m_mutex);
-
-  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
-  if (buffer_block && buffer_block->shape.size() > 0)
-    return IntArrayRef{buffer_block->shape};
-
-  return IntArrayRef();
-}
-
-void MPSHeapAllocatorImpl::Free(void* ptr)
+bool MPSHeapAllocatorImpl::release_buffer(BufferBlock* buffer_block, bool remove_empty_heap)
 {
-  std::lock_guard<std::mutex> lock(m_mutex);
-
-  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
-  TORCH_INTERNAL_ASSERT(buffer_block);
-  free_buffer(buffer_block);
-}
-
-void MPSHeapAllocatorImpl::EmptyCache()
-{
-  std::lock_guard<std::mutex> lock(m_mutex);
-  release_cached_buffers();
-}
-
-void MPSHeapAllocatorImpl::release_buffer(BufferBlock* buffer_block, bool remove_empty_heap)
-{
-  trigger_memory_callbacks(buffer_block, IMpsAllocatorCallback::EventType::RELEASED);
-
-  HeapBlock *heap = buffer_block->heap;
-  BufferPool *pool = heap->pool;
+  HeapBlock *heap_block = buffer_block->heap;
+  BufferPool& pool = *heap_block->pool;
   m_total_allocated_memory -= buffer_block->size;
+  pool.allocated_size -= buffer_block->size;
+  pool.available_size -= buffer_block->size;
   m_allocated_buffers.erase(buffer_block->buffer);
-  pool->buffers.erase(buffer_block);
+  pool.buffers.erase(buffer_block);
+  pool.n_buffers--;
   // will re-insert later to keep the heaps list sorted based on heap's new available size (if heap not empty)
-  pool->heaps.erase(heap);
-  heap->releaseMTLBuffer(buffer_block->buffer);
-  if (debug_info_enabled()) {
+  pool.heaps.erase(heap_block);
+  uint32_t retainCount = heap_block->releaseMTLBuffer(buffer_block->buffer);
+
+  if ((m_debug_verbosity & DebugVerbosity::RELEASES) &&
+    (!(m_debug_verbosity & DebugVerbosity::LARGE_ONLY) || !(pool.usage & UsageFlags::SMALL))) {
     std::cerr << "Released buffer #" << buffer_block->buf_id
               << " of size " << format_size(buffer_block->size)
-              << " (heap size: " << format_size(heap->size.available)
-              << ", total allocated: " << format_size(m_total_allocated_memory) << ")\n";
-
+              << " from heap #" << heap_block->heap_id
+              << " (heap size: " << format_size(heap_block->size.available)
+              << ", use#: " << buffer_block->use_count
+              << ", retain#: " << retainCount
+              << ", gc#: " << buffer_block->gc_count << ")\n";
   }
   delete buffer_block;
 
-  if (remove_empty_heap && heap->n_buffers == 0) {
-    heap->releaseMTLHeap();
-    if (debug_info_enabled()) {
-      std::cerr << "Released heap of size " << format_size(heap->size.total)
-                << " (free memory: " << format_size(max_available_size()) << ")\n";
+  if (remove_empty_heap && heap_block->n_buffers == 0) {
+    pool.heaps_pending_update.erase(heap_block);
+    retainCount = heap_block->releaseMTLHeap();
+    if (m_debug_verbosity & DebugVerbosity::RELEASES) {
+      std::cerr << "Released heap #" << heap_block->heap_id
+                << " of size " << format_size(heap_block->size.total)
+                << " (current allocated: " << format_size(current_allocated_size())
+                << ", retain#: " << retainCount << ")\n";
     }
-    delete heap;
+    delete heap_block;
+    return true;
   } else {
-    pool->heaps.insert(heap);
+    pool.heaps.insert(heap_block);
+    // if heap wasn't released and its released buffer is still busy in command buffer, the available
+    // size of the heap cannot be updated and we should defer updating until command buffer finishes.
+    if (retainCount > 1) {
+      pool.heaps_pending_update.insert(heap_block);
+      m_mutex.unlock();
+      m_stream->addCompletedHandler(^(id <MTLCommandBuffer>) {
+        std::lock_guard<std::mutex> lock(m_mutex);
+        // check if the heap block still exists
+        if (pool.heaps_pending_update.find(heap_block) != pool.heaps_pending_update.end()) {
+          pool.heaps_pending_update.erase(heap_block);
+          pool.heaps.erase(heap_block);
+          heap_block->updateAvailableSize();
+          pool.heaps.insert(heap_block);
+        }
+      });
+      m_mutex.lock();
+    }
   }
+  return false;
 }
 
 void MPSHeapAllocatorImpl::release_buffers(BufferPool& pool)
 {
+  if ((m_debug_verbosity & DebugVerbosity::PROFILING) && pool.n_buffers > 0) {
+    std::cerr << "Releasing " << pool.n_buffers
+              << " buffers from "
+              << ((pool.usage & UsageFlags::SMALL ) ? "small " : "large ")
+              << ((pool.usage & UsageFlags::SHARED) ? "shared" : "private")
+              << ((pool.usage & UsageFlags::SCALAR) ? " scalar" : "")
+              << " pool (total size: " << format_size(pool.allocated_size)
+              << ", free buffers: " << pool.buffers.size() << ")\n";
+  }
   auto it = pool.buffers.begin();
   while (it != pool.buffers.end()) {
     BufferBlock* buffer_block = *it;
@@ -253,20 +334,18 @@
   }
 }
 
-bool MPSHeapAllocatorImpl::release_available_cached_buffers(const AllocParams& p)
+bool MPSHeapAllocatorImpl::release_available_cached_buffers(AllocParams& params)
 {
-  BufferPool& pool = *p.pool;
+  BufferPool& pool = *params.pool;
 
-  if (max_split_size() == std::numeric_limits<size_t>::max() || pool.buffers.empty())
+  if (pool.buffers.empty())
     return false;
 
-  BufferBlock key = p.search_key;
-  key.size = (key.size < max_split_size()) ? max_split_size() : key.size;
-  auto it = pool.buffers.lower_bound(&key);
+  auto it = pool.buffers.lower_bound(&params.search_key);
   if (it == pool.buffers.end()) {
     size_t totalReleased = 0;
     --it;
-    while ((totalReleased < key.size) && ((*it)->size >= max_split_size())) {
+    while (totalReleased < params.search_key.size) {
       auto cur = it;
       totalReleased += (*it)->size;
       if (it != pool.buffers.begin()) {
@@ -277,7 +356,7 @@
         break;
       }
     }
-    if (totalReleased < key.size)
+    if (totalReleased < params.search_key.size)
       return false;
   } else {
     release_buffer(*it);
@@ -287,14 +366,179 @@
 
 bool MPSHeapAllocatorImpl::release_cached_buffers()
 {
+  if (m_debug_verbosity >= DebugVerbosity::PROFILING) {
+    std::cerr << "Releasing buffer pools (MPS allocated: " << format_size(m_total_allocated_memory)
+              << ", other allocations: " << format_size(current_allocated_size() - m_total_allocated_memory) << ")\n";
+  }
+  // before releasing the buffers make sure the command buffer has finished.
+  // we need to release the lock temporarily as synchronizing may cause deadlock with completion handlers.
+  m_mutex.unlock();
+  m_stream->synchronize(SyncType::COMMIT_AND_WAIT);
+  m_mutex.lock();
   // Free all cached blocks to system allocator
   release_buffers(m_large_pool_private);
   release_buffers(m_large_pool_shared);
   release_buffers(m_small_pool_private);
   release_buffers(m_small_pool_shared);
+  release_buffers(m_scalar_pool);
   return true;
 }
 
+void MPSHeapAllocatorImpl::garbage_collect_cached_buffers(AllocParams& params)
+{
+  TORCH_INTERNAL_ASSERT(current_allocated_size() >= m_low_watermark_limit);
+  // attempt to collect garbage until we reach below low watermark limit
+  const auto target_size = current_allocated_size() - m_low_watermark_limit;
+  const BufferPool& pool = *params.pool;
+  // calculate the total age of the free-able blocks. We'll use it later to get the average age threshold.
+  double total_age = 0.0;
+  unsigned int freeable_block_count = 0, freed_count = 0;
+  size_t gc_reclaimed = 0;
+
+  for (auto& b : pool.buffers) {
+    if (b->retainCount() <= 1) {
+      total_age += b->gc_count;
+      ++freeable_block_count;
+    }
+  }
+  if (freeable_block_count == 0) {
+    return;
+  }
+  // repeat GC until we reach reclaim > target size.
+  bool block_freed = true;
+  while (gc_reclaimed < target_size && block_freed && freeable_block_count > 0) {
+    // free blocks exceeding this age threshold first.
+    double age_threshold = total_age / freeable_block_count;
+    // stop iteration if we can no longer free a block.
+    block_freed = false;
+    // free blocks of > avg age. Stop garbage collection if we reach below the
+    // low watermark limit since re-allocation or fragmentation could be costly.
+    auto it = pool.buffers.begin();
+    while (it != pool.buffers.end() && gc_reclaimed < target_size) {
+      BufferBlock* buffer_block = *it++;
+      if (buffer_block->gc_count >= age_threshold && buffer_block->retainCount() <= 1) {
+        block_freed = true;
+        gc_reclaimed += buffer_block->size;
+        total_age -= buffer_block->gc_count;
+        freeable_block_count--;
+        freed_count++;
+        release_buffer(buffer_block, !buffer_block->heap->is_split);
+      }
+    }
+  }
+  if (m_debug_verbosity & DebugVerbosity::RELEASES) {
+    std::cerr << "Garbage collected " << freed_count
+              << " buffers from large "
+              << ((pool.usage & UsageFlags::SHARED) ? "shared" : "private")
+              << " pool (total reclaimed: " << format_size(gc_reclaimed)
+              << ", #buffers: " << pool.buffers.size() << ")\n";
+  }
+}
+
+// public interface to MPSAllocator
+id<MTLBuffer> MPSHeapAllocatorImpl::malloc(size_t size, uint32_t usage)
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+
+  BufferBlock* buffer_block = alloc_buffer_block(size, usage);
+  return buffer_block ? buffer_block->buffer : nullptr;
+}
+
+bool MPSHeapAllocatorImpl::isSharedBuffer(void* ptr)
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+
+  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
+  // it's OK for the buffer_block to not exist yet
+  return buffer_block && (buffer_block->heap->pool->usage & UsageFlags::SHARED);
+}
+
+id<MTLBuffer> MPSHeapAllocatorImpl::allocScalarBufferWithValue(void* value, size_t size)
+{
+  BufferBlock* buffer_block = nullptr;
+  {
+    std::lock_guard<std::mutex> lock(m_mutex);
+
+    buffer_block = alloc_buffer_block(size, UsageFlags::SCALAR);
+    if (!buffer_block)
+      return nullptr;
+  }
+  // buffer is out of the pool, so no mutex lock is needed
+  memcpy([buffer_block->buffer contents], value, size);
+  return buffer_block->buffer;
+}
+
+ssize_t MPSHeapAllocatorImpl::getRequestedBufferSize(void* ptr)
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+
+  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
+  if (buffer_block)
+    return (ssize_t) buffer_block->requested_size;
+  // -1 indicates the passed buffer pointer wasn't found
+  return -1;
+}
+
+void MPSHeapAllocatorImpl::setBufferShape(void* ptr, const IntArrayRef& shape)
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+
+  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
+  TORCH_INTERNAL_ASSERT(buffer_block, "failed to find the buffer ", ptr);
+  // note that the IntArrayRef doesn't own the underlying data, and the backing
+  // memory for shape data must persist as long as the buffer is in use.
+  // So we need to copy to vector.
+  buffer_block->shape = shape.vec();
+}
+
+IntArrayRef MPSHeapAllocatorImpl::getBufferShape(void* ptr)
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+
+  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
+  if (buffer_block && buffer_block->shape.size() > 0)
+    return IntArrayRef{buffer_block->shape};
+
+  return IntArrayRef();
+}
+
+void MPSHeapAllocatorImpl::free(void* ptr)
+{
+  BufferBlock *buffer_block = nullptr;
+  {
+    std::lock_guard<std::mutex> lock(m_mutex);
+
+    buffer_block = get_allocated_buffer_block(ptr);
+    TORCH_INTERNAL_ASSERT(buffer_block);
+    const BufferPool& pool = *buffer_block->heap->pool;
+    if (!(pool.usage & UsageFlags::SCALAR)) {
+      free_buffer(buffer_block);
+      return;
+    }
+  }
+  // we sync the scalar pool manually with completion handler at the time buffer is
+  // freed when the MPSScalar instance goes our of scope
+  m_stream->addCompletedHandler(^(id <MTLCommandBuffer>) {
+    std::lock_guard<std::mutex> lock(m_mutex);
+    free_buffer(buffer_block);
+  });
+}
+
+void MPSHeapAllocatorImpl::emptyCache()
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+  release_cached_buffers();
+}
+
+ssize_t MPSHeapAllocatorImpl::getLowWatermarkValue()
+{
+  // check if low watermark limit is disabled
+  if (m_low_watermark_ratio == 0.0)
+    return std::numeric_limits<ssize_t>::max();
+  // current_allocated_size could exceed m_low_watermark_limit (e.g., when swapping to disk)
+  return std::max<ssize_t>(0, (ssize_t)(m_low_watermark_limit - current_allocated_size()) / 1048576L);
+}
+
 } // namespace HeapAllocator
 
 // Use "at::mps::GetMPSAllocator()" to acquire a handle to MPS Allocator
@@ -308,66 +552,66 @@
 // MPS allocator struct to be registered with Pytorch
 struct TORCH_API MPSAllocator final : public at::Allocator {
 public:
-  explicit MPSAllocator(bool useSharedStorage) :
-      m_has_unified_memory(_getAllocImpl().Device().hasUnifiedMemory), m_use_shared_storage(useSharedStorage)
+  explicit MPSAllocator(uint32_t Usage) :
+      m_has_unified_memory(_getAllocImpl().Device().hasUnifiedMemory), m_usage(Usage)
   {
-    const bool enable_debug_info = isEnvVarEnabled("PYTORCH_DEBUG_MPS_ALLOCATOR");
-    if (enable_debug_info) {
-      _getAllocImpl().enable_debug_info();
-      if (!m_use_shared_storage || m_has_unified_memory) {
+    if (_getAllocImpl().getDebugVerbosity()) {
+      if (!(m_usage & HeapAllocator::UsageFlags::SHARED) || m_has_unified_memory) {
+        const size_t max_total_allowed_size = _getAllocImpl().getMaxTotalAllowedSize();
+        const size_t low_watermark_limit = _getAllocImpl().getLowWatermarkLimit();
         std::cerr << "Initializing "
-                  << (useSharedStorage ? "shared" : "private")
+                  << ((m_usage & HeapAllocator::UsageFlags::SHARED) ? "shared" : "private")
                   << " heap allocator on "
                   << (m_has_unified_memory ? "unified" : "discrete")
                   << " device memory of size "
-                  << _getAllocImpl().Device().recommendedMaxWorkingSetSize / 1048576UL << " MB\n";
+                  << _getAllocImpl().Device().recommendedMaxWorkingSetSize / 1048576UL << " MB"
+                  << " (max allowed: "
+                  << (max_total_allowed_size == std::numeric_limits<size_t>::max() ? "unlimited" :
+                     (to_string(max_total_allowed_size / 1048576UL) + " MB"))
+                  << ", low watermark: "
+                  << (low_watermark_limit == std::numeric_limits<size_t>::max() ? "unlimited" :
+                     (to_string(low_watermark_limit / 1048576UL) + " MB"))  << ")\n";
       }
     }
   }
 
   ~MPSAllocator() override {
-    _getAllocImpl().EmptyCache();
+    _getAllocImpl().emptyCache();
   }
 
   DataPtr allocate(const size_t nbytes) const override {
-    __block id<MTLBuffer> buf = nbytes > 0 ? _getAllocImpl().Malloc(nbytes, m_use_shared_storage) : nullptr;
+    __block id<MTLBuffer> buf = nbytes > 0 ? _getAllocImpl().malloc(nbytes, m_usage) : nullptr;
+    return { buf, buf, &Delete, at::Device(at::DeviceType::MPS, 0)};
+  }
+
+  DataPtr allocate_scalar_buffer(void *value, size_t size) const {
+    id<MTLBuffer> buf = _getAllocImpl().allocScalarBufferWithValue(value, size);
     return { buf, buf, &Delete, at::Device(at::DeviceType::MPS, 0)};
   }
 
   DeleterFnPtr raw_deleter() const override { return &Delete; }
   bool is_shared(void* ptr) const { return _getAllocImpl().isSharedBuffer(ptr); }
-  bool is_shared_storge_supported() const { return m_has_unified_memory; }
+  bool is_shared_storage_supported() const { return m_has_unified_memory; }
 
 private:
   bool m_has_unified_memory;
-  // use shared buffers on unified memory
-  bool m_use_shared_storage;
+  uint32_t m_usage;
 
   static void Delete(void* ptr) {
     if (ptr) {
-      _getAllocImpl().Free(ptr);
-    }
-  }
-
-  static bool isEnvVarEnabled(const char *envvar) {
-    const char *e = getenv(envvar);
-    if (e) {
-      char *t = (char*) e;
-      long val = strtol(e, &t, 0);
-      return (t != e && val != 0);
+      _getAllocImpl().free(ptr);
     }
-    return false;
   }
 };
 
 namespace {
 MPSAllocator& _getSharedAllocator() {
-  static MPSAllocator s_mps_shared_alloc(true);
+  static MPSAllocator s_mps_shared_alloc(HeapAllocator::UsageFlags::SHARED);
   return s_mps_shared_alloc;
 }
 
 MPSAllocator& _getPrivateAllocator() {
-  static mps::MPSAllocator s_mps_private_alloc(false);
+  static MPSAllocator s_mps_private_alloc(HeapAllocator::UsageFlags::PRIVATE);
   return s_mps_private_alloc;
 }
 } // anonymous namespace
@@ -375,14 +619,14 @@ static bool isEnvVarEnabled(const char *envvar) {
 at::Allocator* getMPSSharedAllocator()
 {
   auto& sa = _getSharedAllocator();
-  if (sa.is_shared_storge_supported()) {
+  if (sa.is_shared_storage_supported()) {
     return &sa;
   }
 
   return nullptr;
 }
 
-at::Allocator* getMPSStaticAllocator() {
+at::Allocator* getMPSPrivateAllocator() {
   return &_getPrivateAllocator();
 }
 
@@ -397,7 +641,15 @@ void set_buffer_shape(void* ptr, const IntArrayRef& shape) {
 
 IntArrayRef get_buffer_shape(void* ptr) {
   return _getAllocImpl().getBufferShape(ptr);
-};
+}
+
+DataPtr allocate_scalar_buffer(void *value, size_t size) {
+  return _getPrivateAllocator().allocate_scalar_buffer(value, size);
+}
+
+uint32_t get_adaptive_commit_threshold() {
+  return _getAllocImpl().getLowWatermarkValue();
+}
 
 } // namespace mps
 
diff --git a/aten/src/ATen/mps/MPSDevice.h b/aten/src/ATen/mps/MPSDevice.h
index d957c5440a06..48e1904346c1 100644
--- a/aten/src/ATen/mps/MPSDevice.h
+++ b/aten/src/ATen/mps/MPSDevice.h
@@ -11,9 +11,15 @@
 #include <Metal/Metal.h>
 #include <MetalPerformanceShaders/MetalPerformanceShaders.h>
 typedef id<MTLDevice> MTLDevice_t;
+typedef id<MTLLibrary> MTLLibrary_t;
+typedef id<MTLFunction> MTLFunction_t;
+typedef MTLFunctionConstantValues* MTLFunctionConstantValues_t;
 #else
 typedef void* MTLDevice;
 typedef void* MTLDevice_t;
+typedef void* MTLLibrary_t;
+typedef void* MTLFunction_t;
+typedef void* MTLFunctionConstantValues_t;
 #endif
 
 using namespace std;
@@ -47,16 +53,25 @@ class TORCH_API MPSDevice {
   MTLDevice_t device() {
     return _mtl_device;
   }
+  /**
+   * Returns whether running on Ventura or newer
+   */
+  bool isMacOS13Plus() const;
+
+  MTLFunction_t metalIndexingFunction(const std::string &kernel, MTLFunctionConstantValues_t constantValues);
 
   ~MPSDevice();
 
  private:
   static MPSDevice* _device;
   MTLDevice_t _mtl_device;
+  bool _macos13plus;
+  MTLLibrary_t _mtl_indexing_library;
   MPSDevice();
 };
 
 TORCH_API bool is_available();
+TORCH_API bool is_macos_13_or_newer();
 
 TORCH_API at::Allocator* GetMPSAllocator(bool useSharedAllocator = false);
 
diff --git a/aten/src/ATen/mps/MPSDevice.mm b/aten/src/ATen/mps/MPSDevice.mm
index 277510066649..c11621b3f354 100644
--- a/aten/src/ATen/mps/MPSDevice.mm
+++ b/aten/src/ATen/mps/MPSDevice.mm
@@ -3,6 +3,7 @@
 #include <c10/util/CallOnce.h>
 
 #include <ATen/mps/MPSDevice.h>
+#include <ATen/mps/IndexKernels.h>
 
 namespace at {
 namespace mps {
@@ -10,6 +11,15 @@
 static std::unique_ptr<MPSDevice> mps_device;
 static c10::once_flag mpsdev_init;
 
+static inline MTLLanguageVersion getMetalLanguageVersion(const id<MTLDevice>& device) {
+  // MPS Advanced Indexing needs at least Metal 2.0 (support for Argument Buffers and function constants)
+  // host_name attribute needs at least Metal 2.2
+  MTLLanguageVersion languageVersion = MTLLanguageVersion2_2;
+
+  TORCH_CHECK([device supportsFamily:MTLGPUFamilyMac2], "Missing Metal support for MTLGPUFamilyMac2");
+  return languageVersion;
+}
+
 MPSDevice* MPSDevice::getInstance() {
   c10::call_once(mpsdev_init, [] {
       mps_device = std::unique_ptr<MPSDevice>(new MPSDevice());
@@ -17,16 +27,46 @@
   return mps_device.get();
 }
 
+id<MTLFunction> MPSDevice::metalIndexingFunction(const std::string& kernel, MTLFunctionConstantValues* constantValues) {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(_mtl_device);
+  NSError* error = nil;
+  if (!_mtl_indexing_library) {
+    MTLCompileOptions *options = [MTLCompileOptions new];
+    [options setLanguageVersion: getMetalLanguageVersion(_mtl_device)];
+    [options setFastMathEnabled: YES];
+    _mtl_indexing_library = [_mtl_device newLibraryWithSource: [NSString stringWithCString: mps::indexing_metal_shaders encoding:NSASCIIStringEncoding]
+                                                      options: options
+                                                        error: &error];
+    TORCH_CHECK(_mtl_indexing_library, "Failed to create indexing library, error: ", [[error description] UTF8String]);
+  }
+
+  id<MTLFunction> indexFunction = nil;
+  if (constantValues) {
+    indexFunction = [[_mtl_indexing_library newFunctionWithName: [NSString stringWithUTF8String: kernel.c_str()]
+                                                constantValues: constantValues
+                                                         error: &error] autorelease];
+  } else {
+    indexFunction = [[_mtl_indexing_library newFunctionWithName: [NSString stringWithUTF8String: kernel.c_str()]] autorelease];
+  }
+
+  TORCH_CHECK(indexFunction, "Failed to create specialized function state object: ", kernel, ", error: ", [[error description] UTF8String]);
+
+  return indexFunction;
+}
+
 MPSDevice::~MPSDevice() {
   [_mtl_device release];
+  [_mtl_indexing_library release];
   _mtl_device = nil;
+  _mtl_indexing_library = nil;
 }
 
-MPSDevice::MPSDevice(): _mtl_device(nil) {
+MPSDevice::MPSDevice(): _mtl_device(nil), _mtl_indexing_library(nil)  {
   // Check that MacOS 12.3+ version of MPS framework is available
   // Create the MPSGraph and check method introduced in 12.3+
   // which is used by MPS backend.
   id mpsCD = NSClassFromString(@"MPSGraph");
+  _macos13plus = [mpsCD instancesRespondToSelector:@selector(cumulativeSumWithTensor:axis:name:)] == YES;
   if ([mpsCD instancesRespondToSelector:@selector(LSTMWithSourceTensor:
                                                        recurrentWeight:
                                                            inputWeight:
@@ -37,6 +77,7 @@
                                                                   name:)] == NO) {
     return;
   }
+
   NSArray* devices = [MTLCopyAllDevices() autorelease];
   for (unsigned long i = 0 ; i < [devices count] ; i++) {
     id<MTLDevice>  device = devices[i];
@@ -45,18 +86,27 @@
       break;
     }
   }
-  assert(_mtl_device);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(_mtl_device);
+
+}
+
+bool MPSDevice::isMacOS13Plus() const {
+  return _macos13plus;
 }
 
 at::Allocator* getMPSSharedAllocator();
-at::Allocator* getMPSStaticAllocator();
+at::Allocator* getMPSPrivateAllocator();
 at::Allocator* GetMPSAllocator(bool useSharedAllocator) {
-  return useSharedAllocator ? getMPSSharedAllocator() : getMPSStaticAllocator();
+  return useSharedAllocator ? getMPSSharedAllocator() : getMPSPrivateAllocator();
 }
 
 bool is_available() {
   return MPSDevice::getInstance()->device() != nil;
 }
 
+bool is_macos_13_or_newer() {
+  return MPSDevice::getInstance()->isMacOS13Plus();
+}
+
 } // namespace mps
 } // namespace at
diff --git a/aten/src/ATen/mps/MPSFallback.mm b/aten/src/ATen/mps/MPSFallback.mm
index 7e6be9c772b9..f1c0dbbacdca 100644
--- a/aten/src/ATen/mps/MPSFallback.mm
+++ b/aten/src/ATen/mps/MPSFallback.mm
@@ -14,7 +14,7 @@ void mps_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack)
 
 void mps_error_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack)
 {
-  TORCH_CHECK_NOT_IMPLEMENTED(false, "The operator '", op.schema().operator_name(), "' is not current implemented ",
+  TORCH_CHECK_NOT_IMPLEMENTED(false, "The operator '", op.schema().operator_name(), "' is not currently implemented ",
     "for the MPS device. If you want this op to be added in priority during the prototype ",
     "phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. ",
     "As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` ",
@@ -22,6 +22,20 @@ void mps_error_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack)
     "on MPS.")
 }
 
+
+// This dispatch should never be called for tensor on MPS but is frequently called
+// If one of them are on CPU
+Tensor slow_conv2d_forward_mps(
+    const Tensor &self,
+    const Tensor &weight,
+    IntArrayRef kernel_size,
+    const c10::optional<Tensor> &bias,
+    IntArrayRef stride,
+    IntArrayRef padding) {
+   TORCH_CHECK(self.device() == weight.device(), __func__, ": input(device='", self.device(), "') and weight(device=", weight.device(), "')  must be on the same device");
+   TORCH_INTERNAL_ASSERT(false, __func__, " should not be called for both tensors on MPS device");
+}
+
 TORCH_LIBRARY_IMPL(_, MPS, m) {
   static const char *enable_mps_fallback = getenv("PYTORCH_ENABLE_MPS_FALLBACK");
   if (!enable_mps_fallback || std::stoi(enable_mps_fallback) == 0) {
@@ -35,7 +49,6 @@ void mps_error_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack)
   // These ops are not supported via MPS backend currently, and we fallback to run on CPU.
   // For the rest of unsupported ops the user needs to pass 'PYTORCH_ENABLE_MPS_FALLBACK=1'
   // to fallback on CPU, otherwise we will error out.
-  m.impl("bitwise_not.out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
   m.impl("bitwise_left_shift.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
   m.impl("bitwise_right_shift.Tensor_out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
   m.impl("embedding_renorm_", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
@@ -49,7 +62,7 @@ void mps_error_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack)
   m.impl("linalg_vector_norm", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
   m.impl("sgn.out", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
   m.impl("nonzero", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
-  m.impl("masked_select", torch::CppFunction::makeFromBoxedFunction<&mps_fallback>());
+  m.impl("_slow_conv2d_forward", slow_conv2d_forward_mps);
 }
 
 } // namespace at
diff --git a/aten/src/ATen/mps/MPSGuardImpl.h b/aten/src/ATen/mps/MPSGuardImpl.h
index 27d32bf652e7..b6002497d223 100644
--- a/aten/src/ATen/mps/MPSGuardImpl.h
+++ b/aten/src/ATen/mps/MPSGuardImpl.h
@@ -109,12 +109,12 @@ struct TORCH_API MPSGuardImpl final : public c10::impl::DeviceGuardImplInterface
 struct OptionalMPSGuard {
   explicit OptionalMPSGuard() : guard_() {}
 
-  explicit OptionalMPSGuard(optional<Device> device_opt)
+  explicit OptionalMPSGuard(c10::optional<Device> device_opt)
       : guard_(device_opt) {}
 
   /// Set the current MPS device to the passed device index, if it is not
   /// nullopt
-  explicit OptionalMPSGuard(optional<DeviceIndex> device_index_opt)
+  explicit OptionalMPSGuard(c10::optional<DeviceIndex> device_index_opt)
       : guard_(device_index_opt) {}
 
   // Copy is not allowed
@@ -144,14 +144,14 @@ struct OptionalMPSGuard {
 
   /// Returns the device that was set immediately prior to initialization of the
   /// guard, or nullopt if the guard is uninitialized.
-  optional<Device> original_device() const {
+  c10::optional<Device> original_device() const {
     return guard_.original_device();
   }
 
   /// Returns the most recent device that was set using this device guard,
   /// either from construction, or via set_device, if the guard is initialized,
   /// or nullopt if the guard is uninitialized.
-  optional<Device> current_device() const {
+  c10::optional<Device> current_device() const {
     return guard_.current_device();
   }
 
diff --git a/aten/src/ATen/mps/MPSGuardImpl.mm b/aten/src/ATen/mps/MPSGuardImpl.mm
index 2aedeccf82cb..787ef4cae7cd 100644
--- a/aten/src/ATen/mps/MPSGuardImpl.mm
+++ b/aten/src/ATen/mps/MPSGuardImpl.mm
@@ -35,7 +35,7 @@
 
     auto mps_event = static_cast<mpsEvent_t>(*event);
     MPSStream mps_stream{stream};
-    mps_event->recordEvent(&mps_stream);
+    mps_event->recordEvent(true);
   }
 
   void MPSGuardImpl::block(
@@ -45,7 +45,7 @@
     auto mps_event = static_cast<mpsEvent_t>(event);
     MPSStream mps_stream{stream};
 
-    mps_event->waitForEvent(&mps_stream);
+    mps_event->waitForEvent(true);
   }
 
   bool MPSGuardImpl::queryEvent(void* event) const {
diff --git a/aten/src/ATen/mps/MPSStream.h b/aten/src/ATen/mps/MPSStream.h
index d4e6172954da..afd4d53e1cdd 100644
--- a/aten/src/ATen/mps/MPSStream.h
+++ b/aten/src/ATen/mps/MPSStream.h
@@ -43,6 +43,7 @@ enum class SyncType {
   COMMIT,             // commit and flush the command buffer
   COMMIT_AND_WAIT,    // flush and wait for command buffer execution to finish
   COMMIT_AND_CONTINUE,// commit and continue with a new underlying command buffer
+  COMMIT_ADAPTIVE,    // commit adaptively based on available memory
 };
 
 class TORCH_API MPSStream
@@ -70,6 +71,7 @@ class TORCH_API MPSStream
                      size_t length, size_t srcOffset, size_t dstOffset, bool non_blocking);
   void flush();
   void executeMPSGraph(MPSGraph* mpsGraph, NSDictionary* feeds, NSDictionary* results, SyncType syncType = SyncType::NONE);
+  void addCompletedHandler(MTLCommandBufferHandler block);
 
   /// Get the MPS device index that this stream is associated with.
   c10::DeviceIndex device_index() const { return _stream.device_index(); }
@@ -125,21 +127,27 @@ class TORCH_API MPSStreamImpl
 
 struct TORCH_API MPSEvent
 {
-  MPSEvent();
-  // MPSEvent(id<MTLDevice> device);
-
+  // for a new instance of MPSEvent, sometimes we want an empty shell and don't
+  // necessarily want to create events or listeners. So we defer initialization
+  // until we actually use the event (e.g., record, notify, etc.)
+  MPSEvent(bool deferInitialization = true);
   ~MPSEvent();
   MTLSharedEvent_t event() const {return _event; }
 
-  void recordEvent(MPSStream *stream);
-  void waitForEvent(MPSStream *queue); // waits on the cpu
-  bool queryEvent();
-  uint64_t getCurrentValue() { return _currentValue; }
-  void setCurrentValue(uint64_t currValue) { _currentValue = currValue; }
+  void recordEvent(bool syncEvent = false);
+  void waitForEvent(bool syncEvent = false); // waits on the cpu
+  void notifyEvent(MTLSharedEventNotificationBlock block);
+  bool queryEvent() const;
+  uint64_t getCurrentValue() const { return _signalCounter; }
+  void setCurrentValue(uint64_t currValue) { _signalCounter = currValue; }
 private:
-  bool _isRecorded = false;
-  uint64_t  _currentValue = 0;
+  bool is_initialized;
+  uint64_t _signalCounter;
+  MPSStream* _stream;
   MTLSharedEvent_t _event;
+  MTLSharedEventListener* _listener;
+
+  void initialize();
 };
 
 typedef MPSEvent* mpsEvent_t;
diff --git a/aten/src/ATen/mps/MPSStream.mm b/aten/src/ATen/mps/MPSStream.mm
index 948d5723cad9..04115fc268c7 100644
--- a/aten/src/ATen/mps/MPSStream.mm
+++ b/aten/src/ATen/mps/MPSStream.mm
@@ -5,7 +5,10 @@
 namespace at {
 namespace mps {
 
-#define USE_MPSCOMMANDBUFFER 1
+#define USE_COMMIT_AND_CONTINUE 1
+
+// the frequency that we commit the command buffer calculated based on low watermark ratio in MPSAllocator
+uint32_t get_adaptive_commit_threshold();
 
 //-----------------------------------------------------------------
 //  MPSStream
@@ -47,6 +50,16 @@
     case SyncType::COMMIT:
       flush();
       break;
+    case SyncType::COMMIT_ADAPTIVE:
+      // the adaptive commit only commits if we hit the low watermark memory threshold
+      if (get_adaptive_commit_threshold() <= 1) {
+#if USE_COMMIT_AND_CONTINUE
+        commitAndContinue();
+#else
+        flush();
+#endif
+      }
+      break;
     case SyncType::COMMIT_AND_WAIT:
       commitAndWait();
       break;
@@ -57,7 +70,7 @@
 }
 
 void MPSStream::commit(bool doFlush) {
-#if USE_MPSCOMMANDBUFFER
+#if USE_COMMIT_AND_CONTINUE
   [commandBuffer() commitAndContinue];
 #else
   if (doFlush) {
@@ -96,6 +109,14 @@
   [_commandBuffer release];
 }
 
+void MPSStream::addCompletedHandler(MTLCommandBufferHandler block) {
+ dispatch_sync(_serialQueue, ^() {
+    @autoreleasepool {
+      [commandBuffer() addCompletedHandler:block];
+    }
+  });
+}
+
 void MPSStream::fill(id<MTLBuffer> buffer, uint8_t value, size_t length, size_t offset, SyncType syncType)
 {
   TORCH_INTERNAL_ASSERT(length >= offset);
@@ -138,7 +159,7 @@
 
 void MPSStream::executeMPSGraph(MPSGraph* mpsGraph, NSDictionary* feeds, NSDictionary* results, SyncType syncType) {
   dispatch_sync(_serialQueue, ^() {
-#if USE_MPSCOMMANDBUFFER
+#if USE_COMMIT_AND_CONTINUE
     [mpsGraph encodeToCommandBuffer:commandBuffer()
                               feeds:feeds
                    targetOperations:nil
@@ -185,40 +206,73 @@
 //  MPSEvent
 //-----------------------------------------------------------------
 
-MPSEvent::MPSEvent() {
-  _event = [MPSDevice::getInstance()->device() newSharedEvent];
+MPSEvent::MPSEvent(bool deferInitialization) :
+    is_initialized(false), _signalCounter(0), _stream(nil), _event(nil), _listener(nil) {
+  if (!deferInitialization) {
+    initialize();
+  }
 }
 
 MPSEvent::~MPSEvent() {
-  [_event release];
-  _event = nil;
-}
-
-void MPSEvent::recordEvent(MPSStream* stream) {
-  @autoreleasepool {
-    _isRecorded = true;
-    dispatch_sync(stream->queue(), ^() {
-      @autoreleasepool {
-        id<MTLCommandBuffer> commandBuffer = stream->commandBuffer();
-        [commandBuffer encodeSignalEvent:_event value:_currentValue];
-        stream->commit(true);
-      }
-    });
+  if (_event) {
+    [_event release];
+    _event = nil;
+  }
+  if (_listener) {
+    [_listener release];
+    _listener = nil;
   }
 }
 
-void MPSEvent::waitForEvent(MPSStream* stream) {
-  dispatch_sync(stream->queue(), ^() {
+void MPSEvent::initialize() {
+  _stream = getDefaultMPSStream();
+  _event = [_stream->device() newSharedEvent];
+  _listener = [[MTLSharedEventListener alloc] init];
+  is_initialized = true;
+}
+
+void MPSEvent::recordEvent(bool syncEvent) {
+  if (!is_initialized)
+    initialize();
+
+  dispatch_sync(_stream->queue(), ^() {
+    @autoreleasepool {
+      ++_signalCounter;
+      id<MTLCommandBuffer> commandBuffer = _stream->commandBuffer();
+      [commandBuffer encodeSignalEvent:_event value:_signalCounter];
+      if (syncEvent)
+        _stream->synchronize(SyncType::COMMIT);
+    }
+  });
+}
+
+void MPSEvent::waitForEvent(bool syncEvent) {
+  TORCH_INTERNAL_ASSERT(is_initialized);
+  dispatch_sync(_stream->queue(), ^() {
+    @autoreleasepool {
+      id<MTLCommandBuffer> commandBuffer = _stream->commandBuffer();
+      [commandBuffer encodeWaitForEvent:_event value:_signalCounter];
+      if (syncEvent)
+        _stream->synchronize(SyncType::COMMIT);
+    }
+  });
+}
+
+void MPSEvent::notifyEvent(MTLSharedEventNotificationBlock block)
+{
+  if (!is_initialized)
+    initialize();
+  dispatch_sync(_stream->queue(), ^() {
     @autoreleasepool {
-      id<MTLCommandBuffer> commandBuffer = stream->commandBuffer();
-      [commandBuffer encodeWaitForEvent:_event value:_currentValue];
-      stream->commit(false);
+      ++_signalCounter;
+      [_event notifyListener:_listener atValue:_signalCounter block:block];
     }
   });
 }
 
-bool MPSEvent::queryEvent() {
-  return !_isRecorded || (_event.signaledValue >= _currentValue);
+bool MPSEvent::queryEvent() const {
+  // return false if not recorded or signaled yet
+  return _signalCounter && (_event.signaledValue >= _signalCounter);
 }
 
 } // namespace mps
diff --git a/aten/src/ATen/native/Activation.cpp b/aten/src/ATen/native/Activation.cpp
index 97f504b85dd1..bef09e81a5ea 100644
--- a/aten/src/ATen/native/Activation.cpp
+++ b/aten/src/ATen/native/Activation.cpp
@@ -1,11 +1,12 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/Activation.h>
 
-#include <ATen/ATen.h>
-#include <ATen/CPUApplyUtils.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/TensorIterator.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/Parallel.h>
+#include <ATen/ScalarOps.h>
 #if defined(C10_MOBILE) && defined(USE_XNNPACK)
 #include <ATen/native/xnnpack/Engine.h>
 #endif
@@ -17,6 +18,63 @@
 #include <ATen/native/mkldnn/Utils.h>
 #endif
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/celu_native.h>
+#include <ATen/ops/clamp.h>
+#include <ATen/ops/clamp_min.h>
+#include <ATen/ops/elu.h>
+#include <ATen/ops/elu_backward_native.h>
+#include <ATen/ops/elu_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/gelu_backward_native.h>
+#include <ATen/ops/gelu_native.h>
+#include <ATen/ops/hardshrink_backward_native.h>
+#include <ATen/ops/hardshrink_native.h>
+#include <ATen/ops/hardsigmoid_backward_native.h>
+#include <ATen/ops/hardsigmoid_native.h>
+#include <ATen/ops/hardswish_backward_native.h>
+#include <ATen/ops/hardswish_native.h>
+#include <ATen/ops/hardtanh.h>
+#include <ATen/ops/hardtanh_backward_native.h>
+#include <ATen/ops/hardtanh_native.h>
+#include <ATen/ops/infinitely_differentiable_gelu_backward_native.h>
+#include <ATen/ops/leaky_relu.h>
+#include <ATen/ops/leaky_relu_backward.h>
+#include <ATen/ops/leaky_relu_backward_native.h>
+#include <ATen/ops/leaky_relu_native.h>
+#include <ATen/ops/log_sigmoid_backward_native.h>
+#include <ATen/ops/log_sigmoid_forward.h>
+#include <ATen/ops/log_sigmoid_forward_native.h>
+#include <ATen/ops/log_sigmoid_native.h>
+#include <ATen/ops/mish_backward_native.h>
+#include <ATen/ops/mish_native.h>
+#include <ATen/ops/prelu_backward_native.h>
+#include <ATen/ops/prelu_native.h>
+#include <ATen/ops/relu6_native.h>
+#include <ATen/ops/relu_native.h>
+#include <ATen/ops/rrelu_native.h>
+#include <ATen/ops/rrelu_with_noise.h>
+#include <ATen/ops/rrelu_with_noise_backward_native.h>
+#include <ATen/ops/rrelu_with_noise_native.h>
+#include <ATen/ops/selu_native.h>
+#include <ATen/ops/sigmoid.h>
+#include <ATen/ops/silu_backward_native.h>
+#include <ATen/ops/silu_native.h>
+#include <ATen/ops/softplus.h>
+#include <ATen/ops/softplus_backward_native.h>
+#include <ATen/ops/softplus_native.h>
+#include <ATen/ops/softshrink_backward_native.h>
+#include <ATen/ops/softshrink_native.h>
+#include <ATen/ops/tanh.h>
+#include <ATen/ops/threshold_backward_native.h>
+#include <ATen/ops/threshold_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at {
 namespace meta {
 // computes `result = self <= threshold ? value : other`
@@ -314,7 +372,7 @@ bool use_mkldnn(const Tensor& input) {
   if (!at::globalContext().userEnabledMkldnn()) {
     return false;
   }
-  if (!input.is_contiguous() || input.numel() == 1) {
+  if (!input.is_contiguous() || input.numel() <= 1) {
     return false;
   }
   return (input.is_mkldnn()) || // input is mkldnn Tensor
@@ -599,10 +657,12 @@ Tensor rrelu_with_noise_backward(
 }
 
 Tensor rrelu(const Tensor & self, const Scalar& lower, const Scalar& upper, bool training, c10::optional<Generator> generator) {
+  TORCH_CHECK(lower.to<double>() <= upper.to<double>(), "Lower bound should be less than or equal to the upper bound")
   return at::rrelu_with_noise(self, at::empty_like(self, LEGACY_CONTIGUOUS_MEMORY_FORMAT), lower, upper, training, generator);
 }
 
 Tensor & rrelu_(Tensor & self, const Scalar& lower, const Scalar& upper, bool training, c10::optional<Generator> generator) {
+  TORCH_CHECK(lower.to<double>() <= upper.to<double>(), "Lower bound should be less than or equal to the upper bound")
   return at::rrelu_with_noise_(self, at::empty_like(self, LEGACY_CONTIGUOUS_MEMORY_FORMAT), lower, upper, training, generator);
 }
 
@@ -639,7 +699,7 @@ Tensor prelu_cpu(const Tensor& self, const Tensor& weight_) {
   auto as_nd = [&](const Tensor& t) {
     TORCH_CHECK(
       t.dim() == 1 || t.dim() == 0,
-      "prelu: Expected `weight` to be a scalar or 1D tensor, but got ndim = ", t.dim());
+      "prelu: Expected `weight` to be a scalar or 1D tensor, but got: ndim = ", t.dim());
     if (ndim >= 2) {
       sizes[1] = t.dim() == 1 ? t.size(0) : 1;
       strides[1] = t.dim() == 1 ? t.stride(0) : 0;
diff --git a/aten/src/ATen/native/Activation.h b/aten/src/ATen/native/Activation.h
index ba2dbc0768e8..64f6c6a6dceb 100644
--- a/aten/src/ATen/native/Activation.h
+++ b/aten/src/ATen/native/Activation.h
@@ -1,6 +1,8 @@
 #pragma once
 
 #include <ATen/native/DispatchStub.h>
+#include <c10/util/Exception.h>
+#include <c10/util/string_view.h>
 
 namespace c10 {
 class Scalar;
diff --git a/aten/src/ATen/native/AdaptiveAveragePooling.cpp b/aten/src/ATen/native/AdaptiveAveragePooling.cpp
index cf4321a1d2d6..b612ef009b65 100644
--- a/aten/src/ATen/native/AdaptiveAveragePooling.cpp
+++ b/aten/src/ATen/native/AdaptiveAveragePooling.cpp
@@ -1,9 +1,21 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/AdaptivePooling.h>
 #include <ATen/native/xnnpack/Engine.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_adaptive_avg_pool2d.h>
+#include <ATen/ops/_adaptive_avg_pool2d_backward_native.h>
+#include <ATen/ops/_adaptive_avg_pool2d_native.h>
+#include <ATen/ops/adaptive_avg_pool2d_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/mkldnn_adaptive_avg_pool2d.h>
+#endif
+
 
 namespace at {
 namespace native {
@@ -16,16 +28,16 @@ namespace {
     IntArrayRef output_size)
   {
     TORCH_CHECK(output_size.size() == 2, "adaptive_avg_pool2d: output_size must be 2");
-    int64_t ndim = input.ndimension();
-    for (const auto i : c10::irange(1, ndim)) {
+    int64_t ndim = input.dim();
+    TORCH_CHECK((ndim == 3 || ndim == 4),
+      "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes());
+    for (const auto i : {-2, -1}) {
       TORCH_CHECK(input.size(i) > 0,
         "adaptive_avg_pool2d(): Expected input to have non-zero size for non-batch dimensions, "
-        "but input has sizes ", input.sizes(), " with dimension ", i, " being "
+        "but input has sizes ", input.sizes(), " with dimension ", i + ndim, " being "
         "empty");
     }
 
-    TORCH_CHECK((ndim == 3 || ndim == 4),
-      "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes());
     TORCH_CHECK(input.dtype() == output.dtype(),
       "expected dtype ", input.dtype(), " for `output` but got dtype ", output.dtype());
 
@@ -95,7 +107,7 @@ namespace {
     return output;
   }
 
-  Tensor adaptive_avg_pool2d(at::Tensor const& input, IntArrayRef output_size) {
+  Tensor adaptive_avg_pool2d_symint(at::Tensor const& input, SymIntArrayRef output_size) {
     TORCH_CHECK(output_size.size() == 2, "adaptive_avg_pool2d: output_size must be 2");
     TORCH_CHECK(
         (output_size[0] >= 0 && output_size[1] >= 0),
@@ -103,10 +115,10 @@ namespace {
         "but received {", output_size[0], ", ", output_size[1], "}");
 
     if (input.is_mkldnn()) {
-      return at::mkldnn_adaptive_avg_pool2d(input, output_size);
+      return at::mkldnn_adaptive_avg_pool2d(input, c10::asIntArrayRefSlow(output_size));
     }
 
-    if (!input.is_quantized() && output_size[0] == 1 && output_size[1] == 1) {
+    if (!input.is_quantized() && output_size[0] == 1 && output_size[1] == 1 && !input.is_xpu()) {
       // in this case, adaptive pooling is just computing mean over hw
       // dimensions, which can be done more efficiently
       #if defined(C10_MOBILE) && defined(USE_XNNPACK)
@@ -118,13 +130,13 @@ namespace {
       Tensor out = input.mean({-1, -2}, /* keepdim = */ true);
       if (input.suggest_memory_format() == at::MemoryFormat::ChannelsLast) {
         // assert ndim == 4, since ndim = 3 doesn't give channels_last
-        const int n = input.size(0);
-        const int c = input.size(1);
-        out.as_strided_({n, c, 1, 1}, {c, 1, c, c});
+        const auto n = input.sym_size(0);
+        const auto c = input.sym_size(1);
+        out.as_strided__symint({n, c, 1, 1}, {c, 1, c, c});
       }
       return out;
     } else {
-      return _adaptive_avg_pool2d(input, output_size);
+      return _adaptive_avg_pool2d_symint(input, output_size);
     }
   }
 
diff --git a/aten/src/ATen/native/AdaptiveAveragePooling3d.cpp b/aten/src/ATen/native/AdaptiveAveragePooling3d.cpp
index 64b7014b1def..a0a02ca53160 100644
--- a/aten/src/ATen/native/AdaptiveAveragePooling3d.cpp
+++ b/aten/src/ATen/native/AdaptiveAveragePooling3d.cpp
@@ -1,21 +1,35 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_adaptive_avg_pool3d.h>
+#include <ATen/ops/_adaptive_avg_pool3d_backward_native.h>
+#include <ATen/ops/_adaptive_avg_pool3d_native.h>
+#include <ATen/ops/adaptive_avg_pool3d_backward_native.h>
+#include <ATen/ops/adaptive_avg_pool3d_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at {
 namespace native {
 
 namespace {
 
-inline int start_index(int a, int b, int c) {
+inline int64_t start_index(int64_t a, int64_t b, int64_t c) {
   // NOLINTNEXTLINE(cppcoreguidelines-narrowing-conversions,bugprone-narrowing-conversions)
-  return (int)std::floor((float)(a * c) / b);
+  return (a / b) * c + ((a % b) * c) / b;
 }
 
-inline int end_index(int a, int b, int c) {
+inline int64_t end_index(int64_t a, int64_t b, int64_t c) {
   // NOLINTNEXTLINE(cppcoreguidelines-narrowing-conversions,bugprone-narrowing-conversions)
-  return (int)std::ceil((float)((a + 1) * c) / b);
+  return 1 + ((a + 1) * c - 1) / b;
 }
 
 template <typename scalar_t>
@@ -299,20 +313,20 @@ Tensor adaptive_avg_pool3d_cpu(Tensor const& input, IntArrayRef output_size) {
   return output;
 }
 
-Tensor adaptive_avg_pool3d(Tensor const& input, IntArrayRef output_size) {
+Tensor adaptive_avg_pool3d_symint(Tensor const& input, SymIntArrayRef output_size) {
   TORCH_CHECK(output_size.size() == 3, "adaptive_avg_pool3d: output_size must be 3");
   TORCH_CHECK(
         (output_size[0] >= 0 && output_size[1] >= 0 && output_size[2] >= 0),
         "adaptive_avg_pool2d: elements of output_size must be greater than or equal to 0 ",
         "but received {", output_size[0], ", ", output_size[1], ",", output_size[2], "}");
 
-  if (output_size[0] == 1 && output_size[1] == 1 && output_size[2] == 1) {
+  if (output_size[0] == 1 && output_size[1] == 1 && output_size[2] == 1 && !input.is_xpu()) {
     // in this case, adaptive pooling is just computing mean over hw
     // dimensions, which can be done more efficiently
     Tensor out = input.mean({-1, -2, -3}, /* keepdim = */ true);
     return out;
   } else {
-    return _adaptive_avg_pool3d(input, output_size);
+    return _adaptive_avg_pool3d_symint(input, output_size);
   }
 }
 
diff --git a/aten/src/ATen/native/AdaptiveMaxPooling2d.cpp b/aten/src/ATen/native/AdaptiveMaxPooling2d.cpp
index 61e53f52f7b1..8f9c7ce274eb 100644
--- a/aten/src/ATen/native/AdaptiveMaxPooling2d.cpp
+++ b/aten/src/ATen/native/AdaptiveMaxPooling2d.cpp
@@ -1,8 +1,15 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/AdaptivePooling.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/adaptive_max_pool2d_backward_native.h>
+#include <ATen/ops/adaptive_max_pool2d_native.h>
+#endif
 
 namespace at {
 namespace meta {
diff --git a/aten/src/ATen/native/AdaptiveMaxPooling3d.cpp b/aten/src/ATen/native/AdaptiveMaxPooling3d.cpp
index 5b9904e02249..ecfc151f0710 100644
--- a/aten/src/ATen/native/AdaptiveMaxPooling3d.cpp
+++ b/aten/src/ATen/native/AdaptiveMaxPooling3d.cpp
@@ -1,9 +1,16 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <c10/util/irange.h>
 #include <tuple>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/adaptive_max_pool3d_backward_native.h>
+#include <ATen/ops/adaptive_max_pool3d_native.h>
+#endif
 
 namespace at {
 namespace meta {
@@ -66,12 +73,12 @@ namespace native {
 
 namespace {
 
-inline int start_index(int a, int b, int c) {
-  return (int)std::floor((float)(a * c) / b);
+inline int64_t start_index(int64_t a, int64_t b, int64_t c) {
+  return (a / b) * c + ((a % b) * c) / b;
 }
 
-inline int end_index(int a, int b, int c) {
-  return (int)std::ceil((float)((a + 1) * c) / b);
+inline int64_t end_index(int64_t a, int64_t b, int64_t c) {
+  return 1 + ((a + 1) * c - 1) / b;
 }
 
 // #define START_IND(a,b,c) a * c / b
diff --git a/aten/src/ATen/native/AdaptivePooling.h b/aten/src/ATen/native/AdaptivePooling.h
index 68fb08a5f397..6f6e49e195f4 100644
--- a/aten/src/ATen/native/AdaptivePooling.h
+++ b/aten/src/ATen/native/AdaptivePooling.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <ATen/native/DispatchStub.h>
+#include <c10/util/ArrayRef.h>
 #include <cmath>
 
 namespace at {
@@ -19,11 +20,11 @@ DECLARE_DISPATCH(adaptive_max_pooling_fn, adaptive_max_pool2d_kernel);
 DECLARE_DISPATCH(adaptive_max_pooling_backward_fn, adaptive_max_pool2d_backward_kernel);
 
 static inline int64_t start_index(int64_t a, int64_t b, int64_t c) {
-  return (int64_t)std::floor((float)(a * c) / b);
+  return (a / b) * c + ((a % b) * c) / b;
 }
 
 static inline int64_t end_index(int64_t a, int64_t b, int64_t c) {
-  return (int64_t)std::ceil((float)((a + 1) * c) / b);
+  return 1 + ((a + 1) * c - 1) / b;
 }
 
 }} // namespace at::native
diff --git a/aten/src/ATen/native/AffineGridGenerator.cpp b/aten/src/ATen/native/AffineGridGenerator.cpp
index fc5b22324eaa..fe2c2d4aaa2b 100644
--- a/aten/src/ATen/native/AffineGridGenerator.cpp
+++ b/aten/src/ATen/native/AffineGridGenerator.cpp
@@ -1,5 +1,17 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorOperators.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/affine_grid_generator_backward_native.h>
+#include <ATen/ops/affine_grid_generator_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/linspace.h>
+#include <ATen/ops/tensor.h>
+#endif
 
 namespace at { namespace native {
 
diff --git a/aten/src/ATen/native/AutogradComposite.cpp b/aten/src/ATen/native/AutogradComposite.cpp
index 08f38ce249bb..c4573d5be918 100644
--- a/aten/src/ATen/native/AutogradComposite.cpp
+++ b/aten/src/ATen/native/AutogradComposite.cpp
@@ -1,6 +1,19 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <c10/util/SmallBuffer.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_has_same_storage_numel_native.h>
+#include <ATen/ops/_make_dual_native.h>
+#include <ATen/ops/_new_zeros_with_same_feature_meta_native.h>
+#include <ATen/ops/_unpack_dual_native.h>
+#include <ATen/ops/alias.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/AveragePool2d.cpp b/aten/src/ATen/native/AveragePool2d.cpp
index 1a3b88a62e12..441a320b7df2 100644
--- a/aten/src/ATen/native/AveragePool2d.cpp
+++ b/aten/src/ATen/native/AveragePool2d.cpp
@@ -1,7 +1,15 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/ScalarOps.h>
 #include <ATen/native/Pool.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/avg_pool2d_backward_native.h>
+#include <ATen/ops/avg_pool2d_native.h>
+#endif
 
 namespace at {
 
diff --git a/aten/src/ATen/native/AveragePool3d.cpp b/aten/src/ATen/native/AveragePool3d.cpp
index 1c4724eb038d..a31292ea2167 100644
--- a/aten/src/ATen/native/AveragePool3d.cpp
+++ b/aten/src/ATen/native/AveragePool3d.cpp
@@ -1,10 +1,19 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
+#include <ATen/ScalarOps.h>
 #include <ATen/Parallel.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/native/Pool.h>
 #include <c10/util/irange.h>
 #include <tuple>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/avg_pool3d_backward_native.h>
+#include <ATen/ops/avg_pool3d_native.h>
+#endif
 
 namespace at {
 
diff --git a/aten/src/ATen/native/BatchLinearAlgebra.cpp b/aten/src/ATen/native/BatchLinearAlgebra.cpp
index b2dc974f5a3b..9800ab3a5a57 100644
--- a/aten/src/ATen/native/BatchLinearAlgebra.cpp
+++ b/aten/src/ATen/native/BatchLinearAlgebra.cpp
@@ -1,21 +1,124 @@
-#include <ATen/ATen.h>
-#include <ATen/CPUApplyUtils.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
+#include <ATen/Parallel.h>
 #include <ATen/TensorMeta.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/ExpandUtils.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorSubclassLikeUtils.h>
 
 #include <ATen/native/BatchLinearAlgebra.h>
 #include <ATen/native/LinearAlgebraUtils.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/cpu/zmath.h>
-#include <ATen/Parallel.h>
-#include <ATen/TensorSubclassLikeUtils.h>
 
 #include <c10/util/irange.h>
 
 #include <vector>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_cholesky_solve_helper.h>
+#include <ATen/ops/_cholesky_solve_helper_native.h>
+#include <ATen/ops/_linalg_check_errors.h>
+#include <ATen/ops/_linalg_check_errors_native.h>
+#include <ATen/ops/_linalg_eigh.h>
+#include <ATen/ops/_linalg_eigh_meta.h>
+#include <ATen/ops/_linalg_eigh_native.h>
+#include <ATen/ops/_linalg_solve_ex.h>
+#include <ATen/ops/_linalg_solve_ex_meta.h>
+#include <ATen/ops/_linalg_solve_ex_native.h>
+#include <ATen/ops/_linalg_svd.h>
+#include <ATen/ops/_linalg_svd_meta.h>
+#include <ATen/ops/_linalg_svd_native.h>
+#include <ATen/ops/_lu_with_info_native.h>
+#include <ATen/ops/_symeig_helper.h>
+#include <ATen/ops/_symeig_helper_native.h>
+#include <ATen/ops/all.h>
+#include <ATen/ops/arange.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/cholesky.h>
+#include <ATen/ops/cholesky_inverse.h>
+#include <ATen/ops/cholesky_inverse_native.h>
+#include <ATen/ops/cholesky_native.h>
+#include <ATen/ops/cholesky_solve.h>
+#include <ATen/ops/cholesky_solve_native.h>
+#include <ATen/ops/clone.h>
+#include <ATen/ops/complex.h>
+#include <ATen/ops/cumprod.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/geqrf.h>
+#include <ATen/ops/geqrf_native.h>
+#include <ATen/ops/inverse_native.h>
+#include <ATen/ops/linalg_cholesky_ex.h>
+#include <ATen/ops/linalg_cholesky_ex_meta.h>
+#include <ATen/ops/linalg_cholesky_ex_native.h>
+#include <ATen/ops/linalg_cholesky_native.h>
+#include <ATen/ops/linalg_eig.h>
+#include <ATen/ops/linalg_eig_native.h>
+#include <ATen/ops/linalg_eigh_native.h>
+#include <ATen/ops/linalg_eigvals.h>
+#include <ATen/ops/linalg_eigvals_native.h>
+#include <ATen/ops/linalg_eigvalsh_native.h>
+#include <ATen/ops/linalg_householder_product.h>
+#include <ATen/ops/linalg_householder_product_native.h>
+#include <ATen/ops/linalg_inv.h>
+#include <ATen/ops/linalg_inv_ex.h>
+#include <ATen/ops/linalg_inv_ex_native.h>
+#include <ATen/ops/linalg_inv_native.h>
+#include <ATen/ops/linalg_ldl_factor_ex.h>
+#include <ATen/ops/linalg_ldl_factor_ex_meta.h>
+#include <ATen/ops/linalg_ldl_factor_ex_native.h>
+#include <ATen/ops/linalg_ldl_factor_native.h>
+#include <ATen/ops/linalg_ldl_solve_meta.h>
+#include <ATen/ops/linalg_ldl_solve_native.h>
+#include <ATen/ops/linalg_lstsq.h>
+#include <ATen/ops/linalg_lstsq_native.h>
+#include <ATen/ops/linalg_lu_factor_ex.h>
+#include <ATen/ops/linalg_lu_factor_ex_meta.h>
+#include <ATen/ops/linalg_lu_factor_ex_native.h>
+#include <ATen/ops/linalg_lu_factor_native.h>
+#include <ATen/ops/linalg_lu_meta.h>
+#include <ATen/ops/linalg_lu_native.h>
+#include <ATen/ops/linalg_lu_solve.h>
+#include <ATen/ops/linalg_lu_solve_meta.h>
+#include <ATen/ops/linalg_lu_solve_native.h>
+#include <ATen/ops/linalg_qr.h>
+#include <ATen/ops/linalg_qr_meta.h>
+#include <ATen/ops/linalg_qr_native.h>
+#include <ATen/ops/linalg_solve_ex.h>
+#include <ATen/ops/linalg_solve_ex_native.h>
+#include <ATen/ops/linalg_solve_native.h>
+#include <ATen/ops/linalg_solve_triangular_native.h>
+#include <ATen/ops/linalg_svd.h>
+#include <ATen/ops/linalg_svd_native.h>
+#include <ATen/ops/linalg_svdvals.h>
+#include <ATen/ops/linalg_svdvals_native.h>
+#include <ATen/ops/linalg_vander_native.h>
+#include <ATen/ops/linalg_vecdot_native.h>
+#include <ATen/ops/lu_solve_native.h>
+#include <ATen/ops/lu_unpack.h>
+#include <ATen/ops/lu_unpack_meta.h>
+#include <ATen/ops/lu_unpack_native.h>
+#include <ATen/ops/orgqr_native.h>
+#include <ATen/ops/ormqr_native.h>
+#include <ATen/ops/qr_native.h>
+#include <ATen/ops/real.h>
+#include <ATen/ops/resize_as_native.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/svd_native.h>
+#include <ATen/ops/symeig.h>
+#include <ATen/ops/symeig_native.h>
+#include <ATen/ops/triangular_solve_meta.h>
+#include <ATen/ops/triangular_solve_native.h>
+#include <ATen/ops/tril.h>
+#include <ATen/ops/triu.h>
+#include <ATen/ops/vdot.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 // First the required LAPACK implementations are registered here.
 // A comment above the registered LAPACK routine suggest which batched
 // linear algebra function uses that routine
@@ -27,12 +130,6 @@ extern "C" void cgetrf_(int *m, int *n, std::complex<float> *a, int *lda, int *i
 extern "C" void dgetrf_(int *m, int *n, double *a, int *lda, int *ipiv, int *info);
 extern "C" void sgetrf_(int *m, int *n, float *a, int *lda, int *ipiv, int *info);
 
-// getri
-extern "C" void zgetri_(int *n, std::complex<double> *a, int *lda, int *ipiv, std::complex<double> *work, int *lwork, int *info);
-extern "C" void cgetri_(int *n, std::complex<float> *a, int *lda, int *ipiv, std::complex<float> *work, int *lwork, int *info);
-extern "C" void dgetri_(int *n, double *a, int *lda, int *ipiv, double *work, int *lwork, int *info);
-extern "C" void sgetri_(int *n, float *a, int *lda, int *ipiv, float *work, int *lwork, int *info);
-
 // potrs
 extern "C" void zpotrs_(char *uplo, int *n, int *nrhs, std::complex<double> *a, int *lda, std::complex<double> *b, int *ldb, int *info);
 extern "C" void cpotrs_(char *uplo, int *n, int *nrhs, std::complex<float> *a, int *lda, std::complex<float> *b, int *ldb, int *info);
@@ -454,6 +551,18 @@ TORCH_META_FUNC(_linalg_solve_ex)(const Tensor& A,
   set_output_contiguous(3, shape.slice(0, ndim - 2), A.options().dtype(kInt));
 }
 
+TORCH_META_FUNC(linalg_inv_ex)(const Tensor& A, bool check_errors) {
+  at::native::squareCheckInputs(A, "linalg.inv");
+  at::native::checkFloatingOrComplex(A, "linalg.inv", /*allow_low_precision_dtypes*/false);
+
+  auto shape = A.sizes();
+
+  auto result_strides = at::native::batched_matrix_contiguous_strides(shape, /*f-contig*=*/true);
+  set_output_strided(0, shape, result_strides, A.options(), {});
+  set_output_contiguous(
+      1, shape.slice(0, shape.size() - 2), A.options().dtype(ScalarType::Int)); // info
+}
+
 TORCH_META_FUNC(linalg_lu_factor_ex)(const Tensor& A, bool pivot, bool check_errors) {
   TORCH_CHECK(A.dim() >= 2, "torch.lu_factor: Expected tensor with 2 or more dimensions. Got size: ", A.sizes(), " instead");
 
@@ -682,31 +791,12 @@ namespace native {
 // Define the per-batch functions to be used in the main implementation of the batched
 // linear algebra operations
 
-template<class scalar_t>
-void lapackGetri(int n, scalar_t *a, int lda, int *ipiv, scalar_t *work, int lwork, int *info);
-
 template<class scalar_t>
 void lapackCholeskySolve(char uplo, int n, int nrhs, scalar_t *a, int lda, scalar_t *b, int ldb, int *info);
 
 template<class scalar_t, class value_t=scalar_t>
 void lapackSymeig(char jobz, char uplo, int n, scalar_t *a, int lda, value_t *w, scalar_t *work, int lwork, value_t *rwork, int *info);
 
-template<> void lapackGetri<c10::complex<double>>(int n, c10::complex<double> *a, int lda, int *ipiv, c10::complex<double> *work, int lwork, int *info) {
-  zgetri_(&n, reinterpret_cast<std::complex<double>*>(a), &lda, ipiv, reinterpret_cast<std::complex<double>*>(work), &lwork, info);
-}
-
-template<> void lapackGetri<c10::complex<float>>(int n, c10::complex<float> *a, int lda, int *ipiv, c10::complex<float> *work, int lwork, int *info) {
-  cgetri_(&n, reinterpret_cast<std::complex<float>*>(a), &lda, ipiv, reinterpret_cast<std::complex<float>*>(work), &lwork, info);
-}
-
-template<> void lapackGetri<double>(int n, double *a, int lda, int *ipiv, double *work, int lwork, int *info) {
-  dgetri_(&n, a, &lda, ipiv, work, &lwork, info);
-}
-
-template<> void lapackGetri<float>(int n, float *a, int lda, int *ipiv, float *work, int lwork, int *info) {
-  sgetri_(&n, a, &lda, ipiv, work, &lwork, info);
-}
-
 template<> void lapackLu<c10::complex<double>>(int m, int n, c10::complex<double> *a, int lda, int *ipiv, int *info) {
   zgetrf_(&m, &n, reinterpret_cast<std::complex<double>*>(a), &lda, ipiv, info);
 }
@@ -1508,228 +1598,51 @@ void _linalg_check_errors(
   TORCH_INTERNAL_ASSERT(false);
 }
 
-bool _requires_fw_or_bw_grad(const Tensor& input) {
+// If an input requires fw or bw grad then we need to go down a different
+// (slower) path to ensure that the gradients are computable.
+// That is what `_may_require_fw_or_bw_grad` is helpful for.
+//
+// Why is there a isTensorSubclassLike check here?
+// Without it, this function can lead to composite compliance problems, which
+// may lead to bugs in functorch, where a Tensor Subclass that doesn't
+// require grad may wrap a Tensor subclass that requires grad.
+bool _may_require_fw_or_bw_grad(const Tensor& input) {
   return ((at::GradMode::is_enabled() && input.requires_grad())
-          || input._fw_grad(/*level */ 0).defined());
-}
-
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inverse ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-/*
-Computes the inverse of n-by-n matrix 'self'
-This is an in-place routine, it overwrites the content of 'self'.
-'infos_lu' and 'infos_getri' are int Tensors containing error codes for each matrix in the batched input.
-'infos_lu' is for holding lapackLU errors, and 'infos_getri' is for holding lapackGetri errors.
-For more information see LAPACK's documentation for GETRI and GETRF routines.
-*/
-template <typename scalar_t>
-static void apply_inverse(Tensor& self, Tensor& infos_lu, Tensor& infos_getri) {
-#if !AT_BUILD_WITH_LAPACK()
-  AT_ERROR("inverse: LAPACK library not found in compilation");
-#else
-  using value_t = typename c10::scalar_value_type<scalar_t>::type;
-  auto self_data = self.data_ptr<scalar_t>();
-  auto self_matrix_stride = matrixStride(self);
-  auto batch_size = batchCount(self);
-  auto n = self.size(-2);
-  auto lda = std::max<int64_t>(1, n);
-
-  auto ipiv = at::empty({lda}, self.options().dtype(kInt));
-  auto ipiv_data = ipiv.data_ptr<int>();
-  auto infos_lu_data = infos_lu.data_ptr<int>();
-  auto infos_getri_data = infos_getri.data_ptr<int>();
-
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  int info;
-  // Run once, first to get the optimum work size
-  // Since we deal with batches of matrices with the same dimensions, doing this outside
-  // the loop saves (batch_size - 1) workspace queries which would provide the same result
-  // and (batch_size - 1) calls to allocate and deallocate workspace using at::empty()
-  int lwork = -1;
-  scalar_t wkopt;
-  lapackGetri<scalar_t>(n, self_data, lda, ipiv_data, &wkopt, lwork, &info);
-  lwork = std::max<int>(1, real_impl<scalar_t, value_t>(wkopt));
-  Tensor work = at::empty({lwork}, self.options());
-  auto work_data = work.data_ptr<scalar_t>();
-
-  for (const auto i : c10::irange(batch_size)) {
-    scalar_t* self_working_ptr = &self_data[i * self_matrix_stride];
-    int* info_lu_working_ptr = &infos_lu_data[i];
-    lapackLu<scalar_t>(n, n, self_working_ptr, lda, ipiv_data, info_lu_working_ptr);
-
-    // now compute the actual inverse
-    int* info_getri_working_ptr = &infos_getri_data[i];
-    lapackGetri<scalar_t>(n, self_working_ptr, lda, ipiv_data, work_data, lwork, info_getri_working_ptr);
-  }
-#endif
-}
-
-Tensor inverse(const Tensor &self) {
-  if (self.numel() == 0) {
-    return at::empty_like(self);
-  }
-  return at::linalg_inv(self);
-}
-
-Tensor& inverse_out(const Tensor &self, Tensor &result) {
-  at::linalg_inv_out(result, self);
-  return result;
-}
-
-// This is a type dispatching helper function for 'apply_inverse'
-Tensor& _linalg_inv_out_helper_cpu(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) {
-  // This function calculates the inverse matrix in-place
-  // result should be in column major order and contain matrices to invert
-  // the content of result is overwritten by 'apply_inverse'
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cpu", [&]{
-    apply_inverse<scalar_t>(result, infos_lu, infos_getri);
-  });
-  return result;
+          || input._fw_grad(/*level */ 0).defined()
+          || isTensorSubclassLike(input));
 }
 
-// Computes the inverse matrix of 'input', it is saved to 'result' in-place
-// LAPACK/MAGMA/cuSOLVER error codes are saved in 'infos' tensors, they are not checked here
-static Tensor& linalg_inv_out_info(Tensor& result, Tensor& infos_lu, Tensor& infos_getri, const Tensor& input) {
-  squareCheckInputs(input, "linalg.inv");
-  checkSameDevice("linalg.inv", result, input);
-  checkLinalgCompatibleDtype("linalg.inv", result, input);
-
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.scalar_type() == kInt);
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.scalar_type() == kInt);
-
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.device() == input.device());
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.device() == input.device());
-
-  bool result_input_same_type = (result.scalar_type() == input.scalar_type());
-  bool result_equal_expected_shape = result.sizes().equals(input.sizes());
-  bool is_batched_column_major = false;
-  if (result.dim() >= 2) {
-    is_batched_column_major = result.mT().is_contiguous();
-  }
-
-  // if result is not empty and not in batched column major format
-  bool copy_needed = (result.numel() != 0 && !is_batched_column_major);
-  copy_needed |= !result_input_same_type;  // or result does not have the same dtype as input
-  copy_needed |= (result.numel() != 0 && !result_equal_expected_shape); // or result does not have the expected shape
-  // we have to allocate a temporary tensor
-
-  // similar conditions for infos_lu and infos_getri tensors
-  auto expected_info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2]
-  copy_needed |= (infos_lu.numel() != 0 && !infos_lu.is_contiguous());
-  copy_needed |= (infos_lu.numel() != 0 && !(infos_lu.sizes().equals(expected_info_shape)));
-
-  copy_needed |= (infos_getri.numel() != 0 && !infos_getri.is_contiguous());
-  copy_needed |= (infos_getri.numel() != 0 && !(infos_getri.sizes().equals(expected_info_shape)));
-
-  if (copy_needed) {
-    Tensor result_tmp = at::empty(input.sizes(), input.options());
-    result_tmp.transpose_(-2, -1);
-    Tensor infos_lu_tmp = at::zeros({expected_info_shape}, input.options().dtype(kInt));
-    Tensor infos_getri_tmp = at::zeros({expected_info_shape}, input.options().dtype(kInt));
-
-    result_tmp = linalg_inv_out_info(result_tmp, infos_lu_tmp, infos_getri_tmp, input);
-
-    at::native::resize_output(result, result_tmp.sizes());
-    result.copy_(result_tmp);
-    at::native::resize_output(infos_lu, infos_lu_tmp.sizes());
-    infos_lu.copy_(infos_lu_tmp);
-    at::native::resize_output(infos_getri, infos_getri_tmp.sizes());
-    infos_getri.copy_(infos_getri_tmp);
-    return result;
-  }
-  // else  use result's storage directly
-
-  // if result has no elements we can modify it
-  if (result.numel() == 0) {
-    at::native::resize_as_(result, input.mT(), MemoryFormat::Contiguous);
-    result.transpose_(-2, -1);
-  }
-
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.sizes().equals(input.sizes()));
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.scalar_type() == input.scalar_type());
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.device() == input.device());
-
-  // result tensor must be in batched column major order (Fortran contiguous)
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.mT().is_contiguous());
-
-  // if info has no elements we can modify it
-  if (infos_lu.numel() == 0) {
-    infos_lu.resize_(expected_info_shape);
-    infos_lu.fill_(0);
-  }
-  if (infos_getri.numel() == 0) {
-    infos_getri.resize_(expected_info_shape);
-    infos_getri.fill_(0);
+// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ linalg.inv ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+TORCH_IMPL_FUNC(linalg_inv_ex_out)(const Tensor& A, bool check_errors, const Tensor& result, const Tensor& info) {
+  // Fill result with the identity
+  result.zero_();
+  result.diagonal(0, -2, -1).fill_(1.);
+  at::linalg_solve_ex_out(const_cast<Tensor&>(result), const_cast<Tensor&>(info), A, result, /*left*/true);
+  if (check_errors) {
+    at::_linalg_check_errors(info, "linalg.inv_ex", A.dim() == 2);
   }
-
-  // info tensors must be contiguous
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.is_contiguous());
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_lu.sizes().equals(expected_info_shape));
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.is_contiguous());
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos_getri.sizes().equals(expected_info_shape));
-
-  // _linalg_inv_out_helper_ (apply_inverse) performs calculations in-place and result must be a copy of input
-  result.copy_(input);
-
-  // TODO: Replace this helper with DECLARE/DEFINE_DISPATCH
-  result = at::_linalg_inv_out_helper_(result, infos_lu, infos_getri);
-  return result;
 }
 
-// Computes the inverse matrix of 'input', it is saved to 'result' in-place
-Tensor& linalg_inv_out(const Tensor &input, Tensor &result) {
-  auto info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2]
-  auto infos_lu = at::zeros({info_shape}, input.options().dtype(kInt));
-  auto infos_getri = at::zeros({info_shape}, input.options().dtype(kInt));
-  result = linalg_inv_out_info(result, infos_lu, infos_getri, input);
-
-  // Now check LAPACK/MAGMA/cuSOLVER error codes
-  at::_linalg_check_errors(infos_lu, "linalg.inv", result.dim() == 2);
-  at::_linalg_check_errors(infos_getri, "linalg.inv", result.dim() == 2);
+Tensor& linalg_inv_out(const Tensor& A, Tensor& result) {
+  auto info = at::empty({0}, A.options().dtype(kInt));
+  at::linalg_inv_ex_out(result, info, A);
+  at::_linalg_check_errors(info, "linalg.inv", A.dim() == 2);
   return result;
 }
 
-// Computes the inverse matrix of 'input'
-Tensor linalg_inv(const Tensor &input) {
+Tensor linalg_inv(const Tensor& A) {
   Tensor result, info;
-  std::tie(result, info) = at::linalg_inv_ex(input, /*check_errors=*/false);
-
-  // we pass check_errors=false above and do the check here
-  // so that the name of the function is correct in the error message
-  at::_linalg_check_errors(info, "torch.linalg.inv", input.dim() == 2);
+  std::tie(result, info) = at::linalg_inv_ex(A);
+  at::_linalg_check_errors(info, "linalg.inv", A.dim() == 2);
   return result;
 }
 
-std::tuple<Tensor&, Tensor&> linalg_inv_ex_out(const Tensor& input, bool check_errors, Tensor& inverse, Tensor& info) {
-  squareCheckInputs(input, "linalg.inv_ex");
-  ScalarType info_output_type = ScalarType::Int;
-  TORCH_CHECK(
-      info.scalar_type() == info_output_type,
-      "torch.linalg.inv_ex: ",
-      "Expected info to have ", info_output_type, " dtype, but got info with dtype ", info.scalar_type());
-
-  // provided `info` tensor is used to save the information about the LU decomposition of `input`
-  // in addition current implementation requires a separate tensor
-  // for saving the information about the inversion process after the LU decomposition
-  auto expected_info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2]
-  auto info_inversion = at::zeros({expected_info_shape}, input.options().dtype(kInt));
-
-  linalg_inv_out_info(inverse, info, info_inversion, input);
-
-  if (check_errors) {
-    at::_linalg_check_errors(info, "torch.linalg.inv_ex", input.dim() == 2);
-  }
-
-  return std::tuple<Tensor&, Tensor&>(inverse, info);
+Tensor& inverse_out(const Tensor& A, Tensor& result) {
+  return at::linalg_inv_out(result, A);
 }
 
-std::tuple<Tensor, Tensor> linalg_inv_ex(const Tensor& input, bool check_errors) {
-  squareCheckInputs(input, "linalg.inv_ex");
-  Tensor inverse = at::empty(input.sizes(), input.options(), MemoryFormat::Contiguous);
-  inverse.transpose_(-2, -1); // make `inverse` tensor with batched column major format
-  auto info_shape = IntArrayRef(input.sizes().cbegin(), input.sizes().cend() - 2); // input.shape[:-2]
-  Tensor info = at::zeros({info_shape}, input.options().dtype(kInt));
-  std::tie(inverse, info) = at::native::linalg_inv_ex_out(input, check_errors, inverse, info);
-  return std::make_tuple(inverse, info);
+Tensor inverse(const Tensor& A) {
+  return at::linalg_inv(A);
 }
 
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cholesky_solve ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2001,6 +1914,7 @@ TORCH_IMPL_FUNC(_linalg_solve_ex_out)(const Tensor& A,
   // Possible optimization: Compute the LU factorization of A^T if A is contiguous
   // Then we solve A^T X = B with adjoint=True
   // This saves a copy as A doesn't need to be copied into an F-contig matrix in lu_factor
+  // This optimization makes functorch's batching rule difficult. See NOTE [ solve_ex Batch Rule Contiguity ]
   const bool use_A_T = A.is_contiguous() && !A.is_complex();
   at::linalg_lu_factor_ex_out(const_cast<Tensor&>(LU),
                               const_cast<Tensor&>(pivots),
@@ -2204,7 +2118,7 @@ TORCH_IMPL_FUNC(lu_unpack_out)(const Tensor& LU,
       .add_owned_input(pivots.contiguous())
       .build();
 
-    unpack_pivots_stub(pivots.device().type(), iter, std::min(m, n));
+    unpack_pivots_stub(pivots.device().type(), iter, std::min(m, n), m);
 
     // Transform the permutation into a permutation matrix
     P.zero_();
@@ -2756,6 +2670,10 @@ Tensor& ormqr_out(const Tensor& input, const Tensor& tau, const Tensor& other, b
       left_size_condition,
       "] must be equal to input.shape[-2]");
 
+  TORCH_CHECK(
+      tau.size(-1) <= input.size(-1),
+      "torch.ormqr: tau.shape[-1] must be less than or equal to input.shape[-1]");
+
   TORCH_CHECK(
       input.dim() - tau.dim() == 1,
       "torch.ormqr: ",
@@ -2886,9 +2804,8 @@ std::tuple<Tensor&, Tensor&> linalg_eigh_out(const Tensor& A, c10::string_view u
 
 
 Tensor linalg_eigvalsh(const Tensor& A, c10::string_view uplo) {
-  // See [Note: svdvals_compute_uv] for the condition in compute_v
   return std::get<0>(at::_linalg_eigh(A, uplo,
-                     /*comptue_v=*/_requires_fw_or_bw_grad(A) || isTensorSubclassLike(A)));
+                     /*compute_v=*/_may_require_fw_or_bw_grad(A)));
 }
 
 Tensor& linalg_eigvalsh_out(const Tensor& A, c10::string_view uplo, Tensor& L) {
@@ -3346,7 +3263,7 @@ Tensor& linalg_eigvals_out(const Tensor& input, Tensor& values) {
 Tensor linalg_eigvals(const Tensor& input) {
   // if input requires grad we must compute the eigenvectors to make this function differentiable
   // the eigenvectors are not exposed to the user
-  if (_requires_fw_or_bw_grad(input)) {
+  if (_may_require_fw_or_bw_grad(input)) {
     return std::get<0>(at::linalg_eig(input));
   }
 
@@ -3358,66 +3275,6 @@ Tensor linalg_eigvals(const Tensor& input) {
   return values;
 }
 
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ eig ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-DEFINE_DISPATCH(eig_stub);
-
-std::tuple<Tensor&, Tensor&> eig_out(const Tensor& self, bool eigenvectors, Tensor& e, Tensor& v) {
-  TORCH_WARN_ONCE(
-    "torch.eig is deprecated in favor of torch.linalg.eig and will be removed in a future ",
-    "PyTorch release.\n",
-    "torch.linalg.eig returns complex tensors of dtype cfloat or cdouble rather than real tensors ",
-    "mimicking complex tensors.\n",
-    "L, _ = torch.eig(A)\n",
-    "should be replaced with\n",
-    "L_complex = torch.linalg.eigvals(A)\n",
-    "and\n",
-    "L, V = torch.eig(A, eigenvectors=True)\n",
-    "should be replaced with\n",
-    "L_complex, V_complex = torch.linalg.eig(A)"
-  );
-  TORCH_CHECK(self.dim() == 2, "input should be 2 dimensional");
-  TORCH_CHECK(self.size(0) == self.size(1), "input should be square");
-  TORCH_CHECK(self.isfinite().all().item<bool>(), "input should not contain infs or NaNs");
-  checkSameDevice("torch.eig", e, self, "eigenvalues");
-  checkLinalgCompatibleDtype("torch.eig", e, self, "eigenvalues");
-  if (eigenvectors) {
-    checkSameDevice("torch.eig", v, self, "eigenvectors");
-    checkLinalgCompatibleDtype("torch.eig", v, self, "eigenvectors");
-  }
-  int64_t n = self.size(-1);
-
-  if (isComplexType(at::typeMetaToScalarType(self.dtype()))) {
-      at::native::resize_output(e, {n});
-  } else {
-      at::native::resize_output(e, {n, 2});
-  }
-  if (eigenvectors) {
-      at::native::resize_output(v, self.sizes());
-  }
-
-  // optimization: if self is empty, we can immediately return the empty
-  // tensors, instead of getting empty tensors from eig_helper
-  if (self.numel() == 0) {
-      return std::tuple<Tensor&, Tensor&>(e, v);
-  }
-
-  Tensor vals_, vecs_;
-  std::tie(vals_, vecs_) = eig_stub(self.device().type(), self, eigenvectors);
-  e.copy_(vals_);
-  if (eigenvectors) {
-    v.copy_(vecs_);
-  }
-  return std::tuple<Tensor&, Tensor&>(e, v);
-}
-
-std::tuple<Tensor,Tensor> eig(const Tensor& self, bool eigenvectors) {
-  Tensor e = at::empty({0}, self.options());
-  Tensor v = at::empty({0}, self.options());
-  at::eig_out(e, v, self, eigenvectors);
-  return std::tuple<Tensor, Tensor>(e, v);
-}
-
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ linalg_svd ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 /* torch.svd, implemented in terms of torch.linalg.svd. There are two main
@@ -3516,12 +3373,8 @@ Tensor& linalg_svdvals_out(const Tensor& A, c10::optional<c10::string_view> driv
 }
 
 Tensor linalg_svdvals(const Tensor& A, c10::optional<c10::string_view> driver) {
-  // [Note: svdvals_compute_uv]
-  // NB: Why do we need isTensorSubclassLike check for linalg_svdvals but not linalg_eigvals?
-  //     svdvals is decomposed at the vmap level in functorch so A can be a BatchedTensor wrapping
-  //     a TensorWrapper requiring fw or bw grad.
   return std::get<1>(at::_linalg_svd(A, /*full_matrices=*/false,
-                     /*comptue_uv=*/_requires_fw_or_bw_grad(A) || isTensorSubclassLike(A),
+                     /*compute_uv=*/_may_require_fw_or_bw_grad(A),
                      /*driver=*/driver));
 }
 
@@ -3766,14 +3619,16 @@ static void linalg_lstsq_out_info(
       at::sum_out(residuals, raw_residuals, /*dim=*/-2, /*keepdim=*/false, /*dtype*/real_dtype);
     }
   }
-  solution = solution.narrow(/*dim=*/-2, /*start=*/0, /*length*/n);
+  auto solution_view = solution.narrow(/*dim=*/-2, /*start=*/0, /*length*/n);
+  // manually restride original
+  solution.set_(solution.storage(), solution_view.storage_offset(), solution_view.sizes(), solution_view.strides());
   if (m == 0) {
     solution.zero_();
   }
 
   // for 1-dimensional 'other', we need to squeeze the solution after "apply_lstsq"
   if (vector_case) {
-    solution = solution.squeeze_(-1);
+    solution.squeeze_(-1);
   }
 }
 
@@ -3987,106 +3842,6 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> linalg_lstsq(
   return std::make_tuple(solution, residuals, rank, singular_values);
 }
 
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ legacy_lstsq ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-// This wraps Lapack's gels routine, which uses a QR or LQ factorization to
-// solve any linear system, minimizing ||A.X - B||
-// A & B must be fortran-contiguous matrixes.
-// On exit, A is overwritten with the QR/LQ factorization of input A
-//          B is overwritten with the solution vectors
-template <typename scalar_t>
-static void apply_lstsq(const Tensor& B, const Tensor& A) {
-#if !AT_BUILD_WITH_LAPACK()
-  TORCH_INTERNAL_ASSERT(false, "lstsq: LAPACK library not found in compilation");
-#else
-
-  int m, n, nrhs, lda, ldb, info, lwork;
-  scalar_t wkopt = 0.0;
-  lwork = -1; // work length
-  m = A.size(0);
-  n = A.size(1);
-  nrhs = B.size(1);
-  info = 0;
-  lda = m;
-  ldb = (m > n) ? m : n;
-
-  auto B_data = B.data_ptr<scalar_t>();
-  auto A_data = A.data_ptr<scalar_t>();
-
-  // get info how much space is needed
-  lapackGels<scalar_t>('N', m, n, nrhs, A_data, lda, B_data, ldb, &wkopt, lwork, &info);
-
-  lwork = static_cast<int>(wkopt);
-  Tensor work_tensor = at::empty({lwork}, A.scalar_type());
-  auto work = work_tensor.data_ptr<scalar_t>();
-
-  lapackGels<scalar_t>('N', m, n, nrhs, A_data, lda, B_data, ldb, work, lwork, &info);
-
-  TORCH_CHECK(
-      info >= 0,
-      "Lapack Error in gels : Illegal argument ", -info);
-  TORCH_CHECK(
-      info == 0,
-      "Lapack Error in gels: The ", info, "-th diagonal element of the ",
-      "triangular factor of A is zero");
-#endif
-}
-
-std::tuple<Tensor, Tensor> legacy_lstsq(const Tensor& B, const Tensor& A) {
-  TORCH_WARN_ONCE(
-    "torch.lstsq is deprecated in favor of torch.linalg.lstsq and will be removed in a future PyTorch release.\n",
-    "torch.linalg.lstsq has reversed arguments and does not return the QR decomposition in "
-    "the returned tuple (although it returns other information about the problem).\n",
-    "To get the qr decomposition consider using torch.linalg.qr.\n",
-    "The returned solution in torch.lstsq stored the residuals of the solution in the ",
-    "last m - n columns of the returned value whenever m > n. In torch.linalg.lstsq, the ",
-    "residuals in the field 'residuals' of the returned named tuple.\n",
-    "The unpacking of the solution, as in\n",
-    "X, _ = torch.lstsq(B, A).solution[:A.size(1)]\n",
-    "should be replaced with\n",
-    "X = torch.linalg.lstsq(A, B).solution");
-
-  TORCH_CHECK(A.scalar_type() == B.scalar_type(), "Exepected A and B dtypes to match but found ",
-              A.scalar_type(), " and ", B.scalar_type());
-  TORCH_CHECK(A.dim() == 2, "Expected A to have 2 dimensions, but got ", A.dim());
-  TORCH_CHECK(A.numel() != 0, "A should not be empty");
-  TORCH_CHECK(B.dim() == 1 || B.dim() == 2, "Expected B to have 1 or 2 "
-      "dimensions, but got ", B.dim());
-  TORCH_CHECK(B.numel() != 0, "B should not be empty");
-  TORCH_CHECK(A.size(0) == B.size(0), "Expected A and B to have same size "
-      "at dim 0, but A has ", A.size(0), " rows and B has ", B.size(0), " rows");
-
-  const auto a_sizes = A.sizes();
-  const auto ldb = std::max(a_sizes[0], a_sizes[1]);
-
-  auto A_working = cloneBatchedColumnMajor(A);
-  auto B_working = copyBatchedColumnMajor(B.dim() == 1 ? B.unsqueeze(1) : B, ldb);
-
-  AT_DISPATCH_FLOATING_TYPES(B.scalar_type(), "lstsq_cpu", [&] {
-    apply_lstsq<scalar_t>(B_working, A_working);
-  });
-
-  return std::tuple<Tensor, Tensor>(B_working, A_working);
-}
-
-std::tuple<Tensor&,Tensor&> legacy_lstsq_out(
-    const Tensor& B, const Tensor& A, Tensor& B_out, Tensor& A_out) {
-  const auto dtype = A.scalar_type();
-  TORCH_CHECK(B.scalar_type() == dtype, "exepected A and B dtypes to match but found ",
-              A.scalar_type(), " and ", B.scalar_type());
-  TORCH_CHECK(A_out.scalar_type() == dtype, "A_out to have scalar type ", dtype,
-              " but found", A_out.scalar_type());
-  TORCH_CHECK(B_out.scalar_type() == dtype, "A_out to have scalar type ", dtype,
-              " but found", B_out.scalar_type());
-  Tensor A_tmp, B_tmp;
-  std::tie(B_tmp, A_tmp) = native::legacy_lstsq(B, A);
-  resize_output(A_out, A_tmp.sizes());
-  A_out.copy_(A_tmp);
-  resize_output(B_out, B_tmp.sizes());
-  B_out.copy_(B_tmp);
-  return std::tuple<Tensor&, Tensor&>(B_out, A_out);
-}
-
 DEFINE_DISPATCH(ldl_factor_stub);
 
 TORCH_IMPL_FUNC(linalg_ldl_factor_ex_out)
diff --git a/aten/src/ATen/native/BatchLinearAlgebra.h b/aten/src/ATen/native/BatchLinearAlgebra.h
index 531595f3544e..955b83b3855a 100644
--- a/aten/src/ATen/native/BatchLinearAlgebra.h
+++ b/aten/src/ATen/native/BatchLinearAlgebra.h
@@ -231,10 +231,6 @@ using cholesky_inverse_fn = Tensor& (*)(Tensor& /*result*/, Tensor& /*infos*/, b
 
 DECLARE_DISPATCH(cholesky_inverse_fn, cholesky_inverse_stub);
 
-using eig_fn = std::tuple<Tensor, Tensor> (*)(const Tensor&, bool&);
-
-DECLARE_DISPATCH(eig_fn, eig_stub);
-
 using linalg_eig_fn = void (*)(Tensor& /*eigenvalues*/, Tensor& /*eigenvectors*/, Tensor& /*infos*/, const Tensor& /*input*/, bool /*compute_eigenvectors*/);
 
 DECLARE_DISPATCH(linalg_eig_fn, linalg_eig_stub);
@@ -284,7 +280,8 @@ DECLARE_DISPATCH(lu_factor_fn, lu_factor_stub);
 
 using unpack_pivots_fn = void(*)(
   TensorIterator& iter,
-  const int64_t dim_size);
+  const int64_t dim_size,
+  const int64_t max_pivot);
 DECLARE_DISPATCH(unpack_pivots_fn, unpack_pivots_stub);
 
 using lu_solve_fn = void (*)(
diff --git a/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp b/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp
index 5b18dbe2d5fa..e53d8cd2d38f 100644
--- a/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp
+++ b/aten/src/ATen/native/BatchLinearAlgebraKernel.cpp
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Config.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/BatchLinearAlgebra.h>
@@ -7,6 +9,14 @@
 
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_strided.h>
+#endif
+
 namespace at { namespace native {
 
 namespace {
@@ -127,87 +137,6 @@ Tensor& cholesky_inverse_kernel_impl(Tensor& result, Tensor& infos, bool upper)
   return result;
 }
 
-template <typename scalar_t>
-void apply_eig(const Tensor& self, bool eigenvectors, Tensor& vals_, Tensor& vecs_, int* info_ptr) {
-#if !AT_BUILD_WITH_LAPACK()
-  TORCH_CHECK(false, "Calling torch.eig on a CPU tensor requires compiling ",
-    "PyTorch with LAPACK. Please use PyTorch built with LAPACK support.");
-#else
-  using value_t = typename c10::scalar_value_type<scalar_t>::type;
-
-  char jobvr = eigenvectors ? 'V' : 'N';
-  int64_t n = self.size(-1);
-  auto self_data = self.data_ptr<scalar_t>();
-
-  auto vals_data = vals_.data_ptr<scalar_t>();
-  scalar_t* wr = vals_data;
-
-  scalar_t* vecs_data = eigenvectors ? vecs_.data_ptr<scalar_t>() : nullptr;
-  // NOLINTNEXTLINE(cppcoreguidelines-narrowing-conversions,bugprone-narrowing-conversions)
-  int ldvr = eigenvectors ? n : 1;
-
-  Tensor rwork;
-  value_t* rwork_data = nullptr;
-  if (self.is_complex()) {
-    ScalarType real_dtype = toRealValueType(typeMetaToScalarType(self.dtype()));
-    rwork = at::empty({n*2}, self.options().dtype(real_dtype));
-    rwork_data = rwork.data_ptr<value_t>();
-  }
-
-  if (n > 0) {
-    // call lapackEig once to get the optimal size for work data
-    scalar_t wkopt;
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    lapackEig<scalar_t, value_t>('N', jobvr, n, self_data, n, wr,
-      nullptr, 1, vecs_data, ldvr, &wkopt, -1, rwork_data, info_ptr);
-    int lwork = std::max<int>(1, real_impl<scalar_t, value_t>(wkopt));
-
-    // call again to do the actual work
-    Tensor work = at::empty({lwork}, self.dtype());
-    lapackEig<scalar_t, value_t>('N', jobvr, n, self_data, n, wr,
-      nullptr, 1, vecs_data, ldvr, work.data_ptr<scalar_t>(), lwork, rwork_data, info_ptr);
-  }
-#endif
-}
-
-std::tuple<Tensor, Tensor> eig_kernel_impl(const Tensor& self, bool& eigenvectors) {
-  int64_t n = self.size(-1);
-  // lapackEig function expects the input to be column major, or stride {1, n},
-  // so we must set the stride manually since the default stride for tensors is
-  // row major, {n, 1}
-  Tensor self_ = at::empty_strided(
-      {n, n},
-      {1, n},
-      at::TensorOptions(self.dtype()));
-  self_.copy_(self);
-
-  auto options = self.options().memory_format(LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-
-  // the API is slightly different for the complex vs real case: if the input
-  // is complex, eigenvals will be a vector of complex. If the input is real,
-  // eigenvals will be a (n, 2) matrix containing the real and imaginary parts
-  // in each column
-  Tensor vals_;
-  if (self.is_complex()) {
-      vals_ = at::empty({n}, options);
-  } else {
-      vals_ = at::empty_strided({n, 2}, {1, n}, options);
-  }
-  Tensor vecs_ = eigenvectors
-                 ? at::empty_strided({n, n}, {1, n}, options)
-                 : Tensor();
-
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  auto infos = at::zeros({}, self.options().dtype(kInt));
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(self.scalar_type(), "eig_cpu", [&]{
-    apply_eig<scalar_t>(self_, eigenvectors, vals_, vecs_, infos.data_ptr<int>());
-  });
-  // NOLINTNEXTLINE(clang-analyzer-core.CallAndMessage)
-  at::_linalg_check_errors(infos, "eig", /*is_matrix*/true);
-
-  return std::tuple<Tensor, Tensor>(vals_, vecs_);
-}
-
 /*
   Computes the eigenvalues and eigenvectors of n-by-n matrix 'input'.
   This is an in-place routine, content of 'input', 'values', 'vectors' is overwritten.
@@ -522,15 +451,6 @@ Tensor& orgqr_kernel_impl(Tensor& result, const Tensor& tau) {
   return result;
 }
 
-// we use `enum class LapackLstsqDriverType` as keys in an unordered_map.
-// Clang5 and Gcc5 do not support std::hash for enum classes, hence
-// we provide our own hash function.
-struct LapackLstsqDriverTypeHash {
-  std::size_t operator()(const LapackLstsqDriverType& driver_type) const {
-    return static_cast<std::size_t>(driver_type);
-  }
-};
-
 /*
   Solves a least squares problem. That is minimizing ||B - A X||.
 
@@ -561,7 +481,7 @@ void apply_lstsq(const Tensor& A, Tensor& B, Tensor& rank, Tensor& singular_valu
 
   auto lapack_func = lapackLstsq<driver_t::Gelsd, scalar_t, value_t>;
   static auto driver_type_to_func
-    = std::unordered_map<driver_t, decltype(lapack_func), LapackLstsqDriverTypeHash>({
+    = std::unordered_map<driver_t, decltype(lapack_func)>({
     {driver_t::Gels, lapackLstsq<driver_t::Gels, scalar_t, value_t>},
     {driver_t::Gelsy, lapackLstsq<driver_t::Gelsy, scalar_t, value_t>},
     {driver_t::Gelsd, lapackLstsq<driver_t::Gelsd, scalar_t, value_t>},
@@ -1072,6 +992,15 @@ void apply_lu_solve(const Tensor& LU, const Tensor& pivots, const Tensor& B, Tra
 
 // This is a type dispatching helper function for 'apply_lu_solve'
 void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor& B, TransposeType trans) {
+  // Lapack will write into unrelated memory if pivots are not in the right range so we do
+  // some simple sanity checks here for the CPU version
+  TORCH_CHECK(pivots.gt(0).all().item<bool>(),
+              "Pivots given to lu_solve must all be greater or equal to 1. "
+              "Did you properly pass the result of lu_factor?");
+  TORCH_CHECK(pivots.le(LU.size(-2)).all().item<bool>(),
+              "Pivots given to lu_solve must all be smaller or equal to LU.size(-2). "
+              "Did you properly pass the result of lu_factor?");
+
   AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(LU.scalar_type(), "linalg.lu_solve_cpu", [&]{
     apply_lu_solve<scalar_t>(LU, pivots, B, trans);
   });
@@ -1157,7 +1086,7 @@ void svd_kernel(const Tensor& A,
   });
 }
 
-void unpack_pivots_cpu_kernel(TensorIterator& iter, const int64_t dim_size) {
+void unpack_pivots_cpu_kernel(TensorIterator& iter, const int64_t dim_size, const int64_t max_pivot) {
   if (iter.numel() == 0) {
     return;
   }
@@ -1173,9 +1102,13 @@ void unpack_pivots_cpu_kernel(TensorIterator& iter, const int64_t dim_size) {
       const auto pivots_data = reinterpret_cast<const int32_t*>(pivots_ptr);
 
       for (const auto i : c10::irange(dim_size)) {
+        auto new_idx = pivots_data[i] - 1;
+        TORCH_CHECK(new_idx >= 0 && new_idx < max_pivot,
+                    "pivots passed to lu_unpack must be between 1 and LU.size(-2) inclusive."
+                    "Did you properly pass the result of lu_factor?");
         std::swap(
           perm_data[i],
-          perm_data[pivots_data[i] - 1]
+          perm_data[new_idx]
         );
       }
 
@@ -1200,12 +1133,6 @@ REGISTER_AVX2_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl);
 REGISTER_VSX_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl);
 REGISTER_ZVECTOR_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl);
 
-REGISTER_ARCH_DISPATCH(eig_stub, DEFAULT, &eig_kernel_impl);
-REGISTER_AVX512_DISPATCH(eig_stub, &eig_kernel_impl);
-REGISTER_AVX2_DISPATCH(eig_stub, &eig_kernel_impl);
-REGISTER_VSX_DISPATCH(eig_stub, &eig_kernel_impl);
-REGISTER_ZVECTOR_DISPATCH(eig_stub, &eig_kernel_impl);
-
 REGISTER_ARCH_DISPATCH(linalg_eig_stub, DEFAULT, &linalg_eig_kernel);
 REGISTER_AVX512_DISPATCH(linalg_eig_stub, &linalg_eig_kernel);
 REGISTER_AVX2_DISPATCH(linalg_eig_stub, &linalg_eig_kernel);
diff --git a/aten/src/ATen/native/Batching.cpp b/aten/src/ATen/native/Batching.cpp
index 109499f9cb17..b50b6201b7a2 100644
--- a/aten/src/ATen/native/Batching.cpp
+++ b/aten/src/ATen/native/Batching.cpp
@@ -1,3 +1,4 @@
+#include <ATen/core/Tensor.h>
 #include <ATen/BatchedTensorImpl.h>
 #include <ATen/WrapDimUtils.h>
 #include <ATen/VmapTransforms.h>
diff --git a/aten/src/ATen/native/BinaryOps.cpp b/aten/src/ATen/native/BinaryOps.cpp
index 807170026a21..e0815b786d17 100644
--- a/aten/src/ATen/native/BinaryOps.cpp
+++ b/aten/src/ATen/native/BinaryOps.cpp
@@ -1,15 +1,149 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/BinaryOps.h>
 
 #include <type_traits>
 
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
-#include <ATen/MemoryOverlap.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorMeta.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/ExpandUtils.h>
-#include <ATen/RedispatchFunctions.h>
-#include <torch/library.h>
+#else
+#include <ATen/ops/_add_relu_native.h>
+#include <ATen/ops/_efficientzerotensor.h>
+#include <ATen/ops/_test_serialization_subcmul_native.h>
+#include <ATen/ops/_to_copy.h>
+#include <ATen/ops/add.h>
+#include <ATen/ops/add_native.h>
+#include <ATen/ops/add_ops.h>
+#include <ATen/ops/and_native.h>
+#include <ATen/ops/arctan2_native.h>
+#include <ATen/ops/atan2.h>
+#include <ATen/ops/atan2_native.h>
+#include <ATen/ops/bitwise_and.h>
+#include <ATen/ops/bitwise_and_native.h>
+#include <ATen/ops/bitwise_left_shift.h>
+#include <ATen/ops/bitwise_left_shift_native.h>
+#include <ATen/ops/bitwise_or.h>
+#include <ATen/ops/bitwise_or_native.h>
+#include <ATen/ops/bitwise_right_shift.h>
+#include <ATen/ops/bitwise_right_shift_native.h>
+#include <ATen/ops/bitwise_xor.h>
+#include <ATen/ops/bitwise_xor_native.h>
+#include <ATen/ops/copysign.h>
+#include <ATen/ops/copysign_native.h>
+#include <ATen/ops/div.h>
+#include <ATen/ops/div_native.h>
+#include <ATen/ops/div_ops.h>
+#include <ATen/ops/divide_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/eq_native.h>
+#include <ATen/ops/floor_divide.h>
+#include <ATen/ops/floor_divide_native.h>
+#include <ATen/ops/fmax_native.h>
+#include <ATen/ops/fmin_native.h>
+#include <ATen/ops/fmod.h>
+#include <ATen/ops/fmod_native.h>
+#include <ATen/ops/full.h>
+#include <ATen/ops/gcd_native.h>
+#include <ATen/ops/ge.h>
+#include <ATen/ops/ge_native.h>
+#include <ATen/ops/greater_equal_native.h>
+#include <ATen/ops/greater_native.h>
+#include <ATen/ops/gt.h>
+#include <ATen/ops/gt_native.h>
+#include <ATen/ops/heaviside_native.h>
+#include <ATen/ops/hypot_native.h>
+#include <ATen/ops/igamma.h>
+#include <ATen/ops/igamma_native.h>
+#include <ATen/ops/igammac.h>
+#include <ATen/ops/igammac_native.h>
+#include <ATen/ops/lcm_native.h>
+#include <ATen/ops/ldexp.h>
+#include <ATen/ops/ldexp_native.h>
+#include <ATen/ops/le.h>
+#include <ATen/ops/le_native.h>
+#include <ATen/ops/less_equal_native.h>
+#include <ATen/ops/less_native.h>
+#include <ATen/ops/linalg_cross_native.h>
+#include <ATen/ops/linalg_cross_ops.h>
+#include <ATen/ops/logaddexp2_native.h>
+#include <ATen/ops/logaddexp_native.h>
+#include <ATen/ops/logical_and.h>
+#include <ATen/ops/logical_and_native.h>
+#include <ATen/ops/logical_or.h>
+#include <ATen/ops/logical_or_native.h>
+#include <ATen/ops/logical_xor.h>
+#include <ATen/ops/logical_xor_native.h>
+#include <ATen/ops/logit_backward_native.h>
+#include <ATen/ops/lshift_native.h>
+#include <ATen/ops/lt.h>
+#include <ATen/ops/lt_native.h>
+#include <ATen/ops/max_native.h>
+#include <ATen/ops/maximum.h>
+#include <ATen/ops/maximum_native.h>
+#include <ATen/ops/min_native.h>
+#include <ATen/ops/minimum.h>
+#include <ATen/ops/minimum_native.h>
+#include <ATen/ops/mul.h>
+#include <ATen/ops/mul_native.h>
+#include <ATen/ops/mul_ops.h>
+#include <ATen/ops/multiply_native.h>
+#include <ATen/ops/ne.h>
+#include <ATen/ops/ne_native.h>
+#include <ATen/ops/nextafter_native.h>
+#include <ATen/ops/not_equal_native.h>
+#include <ATen/ops/or_native.h>
+#include <ATen/ops/pow.h>
+#include <ATen/ops/remainder.h>
+#include <ATen/ops/remainder_native.h>
+#include <ATen/ops/rshift_native.h>
+#include <ATen/ops/rsub_native.h>
+#include <ATen/ops/sigmoid_backward_native.h>
+#include <ATen/ops/special_chebyshev_polynomial_t.h>
+#include <ATen/ops/special_chebyshev_polynomial_t_native.h>
+#include <ATen/ops/special_chebyshev_polynomial_u.h>
+#include <ATen/ops/special_chebyshev_polynomial_u_native.h>
+#include <ATen/ops/special_chebyshev_polynomial_v.h>
+#include <ATen/ops/special_chebyshev_polynomial_v_native.h>
+#include <ATen/ops/special_chebyshev_polynomial_w.h>
+#include <ATen/ops/special_chebyshev_polynomial_w_native.h>
+#include <ATen/ops/special_gammainc_native.h>
+#include <ATen/ops/special_gammaincc_native.h>
+#include <ATen/ops/special_hermite_polynomial_h.h>
+#include <ATen/ops/special_hermite_polynomial_h_native.h>
+#include <ATen/ops/special_hermite_polynomial_he.h>
+#include <ATen/ops/special_hermite_polynomial_he_native.h>
+#include <ATen/ops/special_laguerre_polynomial_l.h>
+#include <ATen/ops/special_laguerre_polynomial_l_native.h>
+#include <ATen/ops/special_legendre_polynomial_p.h>
+#include <ATen/ops/special_legendre_polynomial_p_native.h>
+#include <ATen/ops/special_shifted_chebyshev_polynomial_t.h>
+#include <ATen/ops/special_shifted_chebyshev_polynomial_t_native.h>
+#include <ATen/ops/special_shifted_chebyshev_polynomial_u.h>
+#include <ATen/ops/special_shifted_chebyshev_polynomial_u_native.h>
+#include <ATen/ops/special_shifted_chebyshev_polynomial_v.h>
+#include <ATen/ops/special_shifted_chebyshev_polynomial_v_native.h>
+#include <ATen/ops/special_shifted_chebyshev_polynomial_w.h>
+#include <ATen/ops/special_shifted_chebyshev_polynomial_w_native.h>
+#include <ATen/ops/special_xlog1py.h>
+#include <ATen/ops/special_xlog1py_native.h>
+#include <ATen/ops/special_xlogy_native.h>
+#include <ATen/ops/special_zeta.h>
+#include <ATen/ops/special_zeta_native.h>
+#include <ATen/ops/sub.h>
+#include <ATen/ops/sub_native.h>
+#include <ATen/ops/subtract_native.h>
+#include <ATen/ops/tanh_backward_native.h>
+#include <ATen/ops/true_divide_native.h>
+#include <ATen/ops/xlogy.h>
+#include <ATen/ops/xlogy_native.h>
+#include <ATen/ops/xor_native.h>
+#endif
 
 namespace at {
 
@@ -880,7 +1014,8 @@ Tensor mul_zerotensor(const Tensor& self, const Tensor& other) {
   auto out_device = correct_out_device(self, other);
   // hack to use the TensorIterator to get the correct broadcasting and type promotion logic
   auto device_ = Device(DeviceType::Meta);
-  auto meta_out = at::redispatch::mul(c10::DispatchKeySet(at::DispatchKey::Meta), self.to(device_), other.to(device_));
+  constexpr c10::DispatchKeySet meta_dks(at::DispatchKey::Meta);
+  auto meta_out = at::_ops::mul_Tensor::redispatch(meta_dks, self.to(device_), other.to(device_));
   return at::_efficientzerotensor(meta_out.sizes(), meta_out.options().device(out_device));
 }
 
@@ -888,7 +1023,8 @@ Tensor div_zerotensor(const Tensor& self, const Tensor& other) {
   auto out_device = correct_out_device(self, other);
   // hack to use the TensorIterator to get the correct broadcasting and type promotion logic
   auto device_ = Device(DeviceType::Meta);
-  auto meta_out = at::redispatch::div(c10::DispatchKeySet(at::DispatchKey::Meta), self.to(device_), other.to(device_));
+  constexpr c10::DispatchKeySet meta_dks(at::DispatchKey::Meta);
+  auto meta_out = at::_ops::div_Tensor::redispatch(meta_dks, self.to(device_), other.to(device_));
 
   if (self._is_zerotensor()) {
     if (other._is_zerotensor()) {
@@ -916,7 +1052,9 @@ Tensor maybe_add_maybe_sub(const Tensor& self, const Tensor& other, const Scalar
   auto out_device = correct_out_device(self, other);
   // hack to use the TensorIterator to get the correct broadcasting and type promotion logic
   auto device_ = Device(DeviceType::Meta);
-  auto meta_out = at::redispatch::add(c10::DispatchKeySet(at::DispatchKey::Meta), self.to(device_), other.to(device_));
+  constexpr c10::DispatchKeySet meta_dks(at::DispatchKey::Meta);
+  auto meta_out = at::_ops::add_Tensor::redispatch(
+      meta_dks, self.to(device_), other.to(device_), alpha);
 
   auto get_out_like = [&] (const Tensor& tensor)
   {
@@ -951,7 +1089,7 @@ Tensor linalg_cross_zerotensor(
   // hack to use the TensorIterator to get the correct broadcasting and type
   // promotion logic (see add_zerotensor)
   auto device = Device(DeviceType::Meta);
-  auto meta_out = at::redispatch::linalg_cross(
+  auto meta_out = at::_ops::linalg_cross::redispatch(
     c10::DispatchKeySet(at::DispatchKey::Meta),
     input.to(device),
     other.to(device),
diff --git a/aten/src/ATen/native/Blas.cpp b/aten/src/ATen/native/Blas.cpp
index 0e9f62d9a3f1..deda705d0887 100644
--- a/aten/src/ATen/native/Blas.cpp
+++ b/aten/src/ATen/native/Blas.cpp
@@ -1,12 +1,31 @@
-#include <ATen/ATen.h>
-#include <ATen/CPUFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/core/NamedTensor.h>
 #include <ATen/Dispatch.h>
+#include <ATen/ExpandUtils.h>
 #include <ATen/NamedTensorUtils.h>
-#include <ATen/ScalarOps.h>
 #include <ATen/Config.h>
 
 #include <ATen/native/mkldnn/Matmul.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/CPUFunctions.h>
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_efficientzerotensor.h>
+#include <ATen/ops/addmv.h>
+#include <ATen/ops/addmv_native.h>
+#include <ATen/ops/copy_native.h>
+#include <ATen/ops/dot.h>
+#include <ATen/ops/dot_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/mul_cpu_dispatch.h>
+#include <ATen/ops/mv_native.h>
+#include <ATen/ops/scalar_tensor_native.h>
+#include <ATen/ops/vdot_native.h>
+#endif
+
 namespace at {
 namespace meta {
 TORCH_META_FUNC(addmv)(const Tensor &self, const Tensor &mat, const Tensor &vec, const Scalar& beta, const Scalar& alpha) {
diff --git a/aten/src/ATen/native/BlasKernel.cpp b/aten/src/ATen/native/BlasKernel.cpp
index 9cf1f995f3ca..87182b3514df 100644
--- a/aten/src/ATen/native/BlasKernel.cpp
+++ b/aten/src/ATen/native/BlasKernel.cpp
@@ -1,8 +1,12 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <limits>
 #include <algorithm>
-#include <ATen/ATen.h>
+#include <climits>
 #include <ATen/Config.h>
+#include <c10/core/ScalarType.h>
 #include <c10/util/irange.h>
+#include <c10/util/Exception.h>
+#include <c10/util/complex.h>
 
 #if AT_BUILD_WITH_BLAS()
 extern "C" double ddot_(int *n, double *x, int *incx, double *y, int *incy);
diff --git a/aten/src/ATen/native/Bucketization.cpp b/aten/src/ATen/native/Bucketization.cpp
index 15d30c137d5b..7b53a31c5be7 100644
--- a/aten/src/ATen/native/Bucketization.cpp
+++ b/aten/src/ATen/native/Bucketization.cpp
@@ -1,10 +1,17 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/Functions.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/BucketizationUtils.h>
 #include <ATen/native/Resize.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#endif
+
 /* Implement a numpy like searchsorted and a TF like bucketize function running on cpu
  *
  * - torch.searchsorted(sorted_sequence, values, right=False, side='left', out_int32=False, sorter=None)
diff --git a/aten/src/ATen/native/CPUBlas.cpp b/aten/src/ATen/native/CPUBlas.cpp
index 13593a337949..b78e57fc63d6 100644
--- a/aten/src/ATen/native/CPUBlas.cpp
+++ b/aten/src/ATen/native/CPUBlas.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/CPUBlas.h>
 #include <ATen/native/mkl/LinearAlgebra.h>
 #include <ATen/native/mkldnn/Matmul.h>
diff --git a/aten/src/ATen/native/CPUFallback.cpp b/aten/src/ATen/native/CPUFallback.cpp
index 5199fb8acc78..985ee15a5a99 100644
--- a/aten/src/ATen/native/CPUFallback.cpp
+++ b/aten/src/ATen/native/CPUFallback.cpp
@@ -1,13 +1,19 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/CPUFallback.h>
 
 #include <sstream>
 
 #include <ATen/core/ivalue.h>
 #include <ATen/core/stack.h>
-#include <ATen/core/boxing/KernelFunction.h>
 #include <ATen/core/dispatch/Dispatcher.h>
-#include <torch/library.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
+#else
+#include <ATen/ops/_copy_from_and_resize.h>
+#include <ATen/ops/_to_cpu.h>
+#endif
+
 
 namespace at { namespace native {
 
diff --git a/aten/src/ATen/native/CPUFallback.h b/aten/src/ATen/native/CPUFallback.h
index 91f1f08c1184..2d4dfc98aa06 100644
--- a/aten/src/ATen/native/CPUFallback.h
+++ b/aten/src/ATen/native/CPUFallback.h
@@ -15,27 +15,21 @@ TORCH_API void cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* st
 
 // This is a helper function that backends can use to directly call their boxed CPU fallback
 // TODO: update and add a usage example after https://github.com/pytorch/pytorch/pull/58092 lands.
-template<c10::KernelFunction::BoxedKernelFunction* fallback_fn, class Op, class ReturnType, class... ParameterTypes>
+template<c10::KernelFunction::BoxedKernelFunction* fallback_fn, class Op, bool symint, class ReturnType, class... ParameterTypes>
 struct _call_fallback_fn final {};
 
-template<c10::KernelFunction::BoxedKernelFunction* fallback_fn, class Op, class ReturnType, class... ParameterTypes>
-struct _call_fallback_fn<fallback_fn, Op, ReturnType(ParameterTypes...)> final {
-    static_assert(std::is_same<ReturnType, typename guts::infer_function_traits_t<typename Op::schema>::return_type>::value,
-      "Return type mismatch");
-    static_assert(std::is_same<guts::typelist::typelist<ParameterTypes...>, typename guts::infer_function_traits_t<typename Op::schema>::parameter_types>::value,
-      "Parameter types mismatch");
-
-    static ReturnType call(ParameterTypes... args) {
+template<c10::KernelFunction::BoxedKernelFunction* fallback_fn, class Op, bool symint, class ReturnType, class... ParameterTypes>
+struct _call_fallback_fn<fallback_fn, Op, symint, ReturnType(ParameterTypes...)> final {
+    static ReturnType call(typename c10::maybe_keep_symint<symint, ParameterTypes>::type... args) {
         auto op = c10::Dispatcher::singleton()
             // TODO: figure out how to make compiler happy without dynamic casts
             .findSchemaOrThrow((const char*) Op::name, (const char*) Op::overload_name)
             //.findSchemaOrThrow("a", "b")
-            .typed<ReturnType (ParameterTypes...)>();
-        return c10::impl::BoxedKernelWrapper<ReturnType (ParameterTypes...)>::call(
+            .typed<ReturnType (typename c10::maybe_keep_symint<symint, ParameterTypes>::type...)>();
+        return c10::impl::BoxedKernelWrapper<ReturnType (typename c10::maybe_keep_symint<symint, ParameterTypes>::type...)>::call(
             c10::BoxedKernel::makeFromFunction<fallback_fn>(),
             op,
             c10::DispatchKeySet(), // we know that the cpu_fallback doesn't use the dispatch keyset.
-            //std::forward<ParameterTypes...>(args...)
             // TODO: get std::forward<> to work
             args...
             );
@@ -43,7 +37,10 @@ struct _call_fallback_fn<fallback_fn, Op, ReturnType(ParameterTypes...)> final {
 };
 
 template<c10::KernelFunction::BoxedKernelFunction* fallback_fn, class Op>
-using call_fallback_fn = _call_fallback_fn<fallback_fn, Op, typename Op::schema>;
+using call_fallback_fn_symint = _call_fallback_fn<fallback_fn, Op, true, typename Op::schema>;
+
+template<c10::KernelFunction::BoxedKernelFunction* fallback_fn, class Op>
+using call_fallback_fn = _call_fallback_fn<fallback_fn, Op, false, typename Op::schema>;
 
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/ChanelShuffle.cpp b/aten/src/ATen/native/ChanelShuffle.cpp
index 7def359e7056..a4f9f2bfe864 100644
--- a/aten/src/ATen/native/ChanelShuffle.cpp
+++ b/aten/src/ATen/native/ChanelShuffle.cpp
@@ -1,14 +1,23 @@
-#include <ATen/native/TensorTransformations.h>
-
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/NamedTensorUtils.h>
-#include <ATen/NativeFunctions.h>
 #if defined(C10_MOBILE) && defined(USE_XNNPACK)
 #include <ATen/native/xnnpack/Engine.h>
 #endif
 #include <c10/util/Exception.h>
 
+#include <ATen/native/TensorTransformations.h>
 #include <ATen/native/cpu/ChannelShuffleKernel.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/channel_shuffle_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/native_channel_shuffle.h>
+#include <ATen/ops/native_channel_shuffle_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/Col2Im.cpp b/aten/src/ATen/native/Col2Im.cpp
index f1e08a887c84..5ce747e9c7a7 100644
--- a/aten/src/ATen/native/Col2Im.cpp
+++ b/aten/src/ATen/native/Col2Im.cpp
@@ -1,12 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/Utils.h>
-#include <ATen/div_rtn.h>
 
 #include <ATen/native/im2col.h>
 #include <ATen/native/im2col_shape_check.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/col2im_native.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/im2col_native.h>
+#endif
+
 // Note [im2col/col2im output padding]
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 // Our implementations of im2col and col2im take both the input height/width as
@@ -135,7 +144,6 @@ static void col2im_out_cpu_template(
   int64_t n_output_plane = n_input_plane / (kernel_width * kernel_height);
 
   output.resize_({batch_size, n_output_plane, output_height, output_width});
-  output.zero_();
 
   AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kBFloat16, kHalf,
       input.scalar_type(), "col2im_out_cpu", [&] {
@@ -179,18 +187,6 @@ static void col2im_out_cpu_template(
       });
 }
 
-void col2im_backward_out_cpu_template(
-    Tensor& grad_input,
-    const Tensor& grad_output,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride) {
-  // im2col_out_cpu checks size of kernel_size, dilation, padding and stride
-  at::native::im2col_out_cpu(
-      grad_output, kernel_size, dilation, padding, stride, grad_input);
-}
-
 } // namespace
 
 Tensor& col2im_out_cpu(const Tensor& input,
@@ -219,29 +215,5 @@ Tensor col2im_cpu(
   return output;
 }
 
-Tensor& col2im_backward_out_cpu(const Tensor& grad_output,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride,
-    Tensor& grad_input) {
-  col2im_backward_out_cpu_template(
-      grad_input, grad_output, kernel_size, dilation, padding, stride);
-  return grad_input;
-}
-
-Tensor col2im_backward_cpu(
-    const Tensor& grad_output,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride) {
-  Tensor grad_input = at::empty_like(grad_output, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-
-  col2im_backward_out_cpu_template(
-      grad_input, grad_output, kernel_size, dilation, padding, stride);
-  return grad_input;
-}
-
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/ComparisonUtils.cpp b/aten/src/ATen/native/ComparisonUtils.cpp
new file mode 100644
index 000000000000..c16c361c3442
--- /dev/null
+++ b/aten/src/ATen/native/ComparisonUtils.cpp
@@ -0,0 +1,32 @@
+#include <ATen/core/TensorBase.h>
+#include <algorithm>
+#include <vector>
+#include <ATen/core/TensorBody.h>
+#include <c10/util/OptionalArrayRef.h>
+
+namespace at {
+
+class Tensor;
+
+namespace native {
+
+template<typename O, typename C>
+void _assert_match(const O& original, const C& compared, const std::string& name) {
+  if (compared) {
+    bool equal = (original == compared.value());
+    if (!equal) {
+      std::stringstream msg;
+      msg << "Tensor " << name << " mismatch!";
+      AT_ASSERT(equal, msg.str());
+    }
+  }
+}
+
+void _assert_tensor_metadata(at::Tensor const& tensor, at::OptionalIntArrayRef sizes, at::OptionalIntArrayRef strides, c10::optional<c10::ScalarType> dtype) {
+  _assert_match(tensor.sizes(), sizes, "sizes");
+  _assert_match(tensor.strides(), strides, "strides");
+  _assert_match(tensor.dtype(), dtype, "dtype");
+}
+
+}
+}  // namespace at::native
diff --git a/aten/src/ATen/native/ComplexHelper.h b/aten/src/ATen/native/ComplexHelper.h
index 88668d13145c..9533115a7066 100644
--- a/aten/src/ATen/native/ComplexHelper.h
+++ b/aten/src/ATen/native/ComplexHelper.h
@@ -1,8 +1,15 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/view_as_real_native.h>
+#include <ATen/ops/view_as_complex_native.h>
+#endif
+
 // WARNING: this header contains non-inline functions and should be only
 // included from ONE cpp file
 
@@ -11,19 +18,18 @@ namespace at { namespace native {
 // View tensor with new dtype, storage offset, sizes and strides
 inline Tensor view_tensor(
     const Tensor &tensor, ScalarType dtype,
-    int64_t offset, IntArrayRef sizes, IntArrayRef strides) {
+    c10::SymInt offset, SymIntArrayRef sizes, SymIntArrayRef strides) {
   Storage storage = tensor.storage();
   auto key_set = tensor.key_set().remove(DispatchKey::Conjugate);
   auto new_tensor = detail::make_tensor<TensorImpl>(
       c10::TensorImpl::VIEW, std::move(storage), key_set, scalarTypeToTypeMeta(dtype));
   auto * impl = new_tensor.unsafeGetTensorImpl();
-  impl->set_storage_offset(offset);
-  impl->set_sizes_and_strides(sizes, strides);
+  impl->set_sizes_and_strides(sizes, strides, offset);
   return new_tensor;
 }
 
-inline DimVector computeStrideForViewAsReal(IntArrayRef oldstride) {
-  DimVector res(oldstride.size() + 1);
+inline SymDimVector computeStrideForViewAsReal(SymIntArrayRef oldstride) {
+  SymDimVector res(oldstride.size() + 1);
   for (const auto i : c10::irange(oldstride.size())) {
     res[i] = oldstride[i] * 2;
   }
@@ -33,13 +39,13 @@ inline DimVector computeStrideForViewAsReal(IntArrayRef oldstride) {
 
 Tensor _view_as_real_physical(const Tensor& self) {
   TORCH_CHECK(self.is_complex(), "view_as_real is only supported for complex tensors");
-  auto old_sizes = self.sizes();
-  DimVector new_sizes(old_sizes.size() + 1);
+  auto old_sizes = self.sym_sizes();
+  SymDimVector new_sizes(old_sizes.size() + 1);
   std::copy(old_sizes.begin(), old_sizes.end(), new_sizes.begin());
   // last dimension will always have two elements containing the real and imag vals
   new_sizes.back() = 2;
-  auto new_strides = computeStrideForViewAsReal(self.strides());
-  auto new_storage_offset = 2 * self.storage_offset();
+  auto new_strides = computeStrideForViewAsReal(self.sym_strides());
+  auto new_storage_offset = self.sym_storage_offset() * 2;
   const auto float_type = c10::toRealValueType(self.scalar_type());
   auto real_tensor = view_tensor(self, float_type, new_storage_offset, new_sizes, new_strides);
   return real_tensor;
@@ -53,11 +59,11 @@ Tensor view_as_real(const Tensor& self) {
   return _view_as_real_physical(self);
 }
 
-inline DimVector computeStrideForViewAsComplex(IntArrayRef oldstride) {
+inline SymDimVector computeStrideForViewAsComplex(SymIntArrayRef oldstride) {
   const int64_t dim = oldstride.size();
   TORCH_CHECK(oldstride[dim-1] == 1, "Tensor must have a last dimension with stride 1");
 
-  DimVector res(dim - 1);
+  SymDimVector res(dim - 1);
   for (const auto i : c10::irange(res.size())) {
     TORCH_CHECK(oldstride[i] % 2 == 0, "Tensor must have a stride divisible by 2 for all but last dimension");
     res[i] = oldstride[i] / 2;
@@ -72,16 +78,16 @@ Tensor view_as_complex(const Tensor& self) {
     self.scalar_type() == kFloat || self.scalar_type() == kDouble || self.scalar_type() == kHalf,
     "view_as_complex is only supported for half, float and double tensors, but got a tensor of scalar type: ", self.scalar_type());
 
-  auto old_sizes = self.sizes();
+  auto old_sizes = self.sym_sizes();
   TORCH_CHECK(old_sizes.size() != 0, "Input tensor must have one or more dimensions");
   TORCH_CHECK(old_sizes[old_sizes.size()-1] == 2, "Tensor must have a last dimension of size 2");
-  DimVector new_sizes(old_sizes.begin(), old_sizes.end() - 1);
+  SymDimVector new_sizes(old_sizes.begin(), old_sizes.end() - 1);
 
-  const auto new_strides = computeStrideForViewAsComplex(self.strides());
+  const auto new_strides = computeStrideForViewAsComplex(self.sym_strides());
   const auto complex_type = c10::toComplexType(self.scalar_type());
 
-  TORCH_CHECK(self.storage_offset() % 2 == 0, "Tensor must have a storage_offset divisible by 2");
-  const auto new_storage_offset = self.storage_offset() / 2;
+  TORCH_CHECK(self.sym_storage_offset() % 2 == 0, "Tensor must have a storage_offset divisible by 2");
+  const auto new_storage_offset = self.sym_storage_offset() / 2;
 
   return view_tensor(self, complex_type, new_storage_offset, new_sizes, new_strides);
 }
diff --git a/aten/src/ATen/native/ConvUtils.h b/aten/src/ATen/native/ConvUtils.h
index 8493deba7b33..880ce0c2af54 100644
--- a/aten/src/ATen/native/ConvUtils.h
+++ b/aten/src/ATen/native/ConvUtils.h
@@ -80,40 +80,7 @@ static inline bool cudnnv8_use_heur_mode_b() {
   return cudnnv8_heuristic_mode_b;
 }
 
-// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-struct ConvParams {
-  std::vector<int64_t> stride;
-  std::vector<int64_t> padding;
-  std::vector<int64_t> dilation;
-  bool transposed;
-  std::vector<int64_t> output_padding;
-  int groups;
-  bool benchmark;
-  bool deterministic;
-  bool cudnn_enabled;
-  bool allow_tf32;
-
-  bool is_strided() const;
-  bool is_dilated() const;
-  bool is_padded() const;
-  bool is_output_padding_neg() const;
-  bool is_output_padding_big() const;
-  bool is_padding_neg() const;
-  bool is_stride_nonpos() const;
-  void view1d_as_2d();
-  bool use_cpu_depthwise3x3_winograd(const at::Tensor& input, const at::Tensor& weight) const;
-  bool needs_64bit_indexing_no_split(const at::Tensor& input, const at::Tensor& weight) const;
-  bool use_cudnn(const at::Tensor& input, const at::Tensor& weight) const;
-  bool use_cudnn_depthwise(const at::Tensor& input, const at::Tensor& weight) const;
-  bool use_miopen(const at::Tensor& input, const at::Tensor& weight, bool bias_defined) const;
-  bool use_mkldnn(const at::Tensor& input, const at::Tensor& weight) const;
-  bool use_nnpack(const at::Tensor& input, const at::Tensor& weight) const;
-  bool use_xnnpack(const at::Tensor& input, const at::Tensor& weight,
-                   const at::OptionalIntArrayRef bias_sizes_opt) const;
-  bool use_mps(const at::Tensor& input, const at::Tensor& weight) const;
-  bool is_depthwise(const at::Tensor& input, const at::Tensor& weight) const;
-};
-
+// Keep in sync with py::enum_ in Module.cpp
 enum class ConvBackend {
   CudaDepthwise2d,
   CudaDepthwise3d,
@@ -139,24 +106,16 @@ enum class ConvBackend {
   MpsTranspose,
 };
 
-// Function to select the convolution backend based on the inputs and params.
-// This overload is used within the convolution internals but not exposed to python.
-// NB: The forward pass provides a bias tensor while the backward pass provides
-// a bool indicating whether the bias is defined. This is done to save memory by
-// avoiding saving the full bias tensor for backward.
-TORCH_API ConvBackend select_conv_backend(
-    const Tensor& input,
-    const Tensor& weight,
-    const at::OptionalIntArrayRef bias_sizes_opt,
-    const bool need_backward,
-    const ConvParams& params);
-
 // Overload for selecting the convolution backend from the full set of convolution inputs.
 // This overload is exposed to python for testing, etc.
 TORCH_API ConvBackend select_conv_backend(
     const Tensor& input, const Tensor& weight, const c10::optional<Tensor>& bias_opt,
-    IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation,
-    bool transposed, IntArrayRef output_padding, int64_t groups);
+    IntArrayRef stride, SymIntArrayRef padding, IntArrayRef dilation,
+    bool transposed, SymIntArrayRef output_padding, int64_t groups, const at::OptionalSymIntArrayRef bias_sizes_opt);
+
+TORCH_API at::MemoryFormat _determine_backend_memory_format(const Tensor& input,
+    const Tensor& weight,
+    const ConvBackend backend);
 
 // ---------------------------------------------------------------------
 //
@@ -227,7 +186,7 @@ static void convolution_shape_check(
 
   // Input
   checkDimRange(c, input, 3, 6 /* exclusive */);
-  checkSize(c, input, input_channels_dim, weight->size(1) * groups);
+  checkSize_symint(c, input, input_channels_dim, weight->size(1) * groups);
 
   // Weight
   checkSameDim(c, input, weight);
@@ -241,15 +200,16 @@ static void convolution_shape_check(
 // as conv_output_size loses information; this is why conv_input_size
 // takes an extra output_padding argument to resolve the ambiguity.
 
-static inline std::vector<int64_t> conv_output_size(
-    IntArrayRef input_size, IntArrayRef weight_size,
-    IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation = IntArrayRef()
+template <typename T>
+static inline std::vector<T> _conv_output_size(
+    ArrayRef<T> input_size, ArrayRef<T> weight_size,
+    ArrayRef<T> padding, IntArrayRef stride, IntArrayRef dilation = IntArrayRef()
 ) {
   // ASSERT(input_size.size() > 2)
   // ASSERT(input_size.size() == weight_size.size())
   bool has_dilation = dilation.size() > 0;
   auto dim = input_size.size();
-  std::vector<int64_t> output_size(dim);
+  std::vector<T> output_size(dim);
   output_size[0] = input_size[input_batch_size_dim];
   output_size[1] = weight_size[weight_output_channels_dim];
   for (const auto d : c10::irange(2, dim)) {
@@ -260,40 +220,84 @@ static inline std::vector<int64_t> conv_output_size(
   return output_size;
 }
 
-static inline std::vector<int64_t> conv_input_size(
-    IntArrayRef output_size, IntArrayRef weight_size,
-    IntArrayRef padding, IntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups
+static inline std::vector<int64_t> conv_output_size(
+    IntArrayRef input_size, IntArrayRef weight_size,
+    IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation = IntArrayRef()
+) {
+  return _conv_output_size(input_size, weight_size, padding, stride, dilation);
+}
+
+static inline std::vector<c10::SymInt> conv_output_size(
+    SymIntArrayRef input_size, SymIntArrayRef weight_size,
+    SymIntArrayRef padding, IntArrayRef stride, IntArrayRef dilation = IntArrayRef()
+) {
+  return _conv_output_size(input_size, weight_size, padding, stride, dilation);
+}
+
+template <typename T>
+std::vector<T> _conv_input_size(
+    ArrayRef<T> output_size, ArrayRef<T> weight_size,
+    ArrayRef<T> padding, ArrayRef<T> output_padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups
 ) {
   // ASSERT(output_size.size() > 2)
   // ASSERT(output_size.size() == weight_size.size())
   auto dim = output_size.size();
-  std::vector<int64_t> input_size(dim);
+  std::vector<T> input_size(dim);
   input_size[0] = output_size[output_batch_size_dim];
   input_size[1] = weight_size[weight_input_channels_dim] * groups;
   for (const auto d : c10::irange(2, dim)) {
-    int kernel = dilation[d - 2] * (weight_size[d] - 1) + 1;
-    input_size[d] = (output_size[d] - 1) * stride[d - 2] - (2 * padding[d - 2]) +
+    auto kernel = (weight_size[d] - 1) * dilation[d - 2] + 1;
+    input_size[d] = (output_size[d] - 1) * stride[d - 2] - (padding[d - 2] * 2) +
                      kernel + output_padding[d - 2];
   }
   return input_size;
 }
 
-static inline std::vector<int64_t> conv_weight_size(
-    IntArrayRef input_size, IntArrayRef output_size,
+static inline std::vector<c10::SymInt> conv_input_size(
+    SymIntArrayRef output_size, SymIntArrayRef weight_size,
+    SymIntArrayRef padding, SymIntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups
+) {
+  return _conv_input_size(output_size, weight_size, padding, output_padding, stride, dilation, groups);
+}
+
+static inline std::vector<int64_t> conv_input_size(
+    IntArrayRef output_size, IntArrayRef weight_size,
     IntArrayRef padding, IntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups
+) {
+  return _conv_input_size(output_size, weight_size, padding, output_padding, stride, dilation, groups);
+}
+
+template <typename T>
+std::vector<T> _conv_weight_size(
+    ArrayRef<T> input_size, ArrayRef<T> output_size,
+    ArrayRef<T> padding, ArrayRef<T> output_padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups
 ) {
   auto dim = input_size.size();
-  std::vector<int64_t> weight_size(dim);
+  std::vector<T> weight_size(dim);
   weight_size[0] = output_size[1];
   weight_size[1] = input_size[1] / groups;
   for (const auto d : c10::irange(2, dim)) {
-    int kernel = input_size[d] - (output_size[d] - 1) * stride[d - 2]
-               + 2 * padding[d - 2] - output_padding[d - 2];
+    auto kernel = input_size[d] - (output_size[d] - 1) * stride[d - 2]
+               + padding[d - 2] * 2 - output_padding[d - 2];
     weight_size[d] = (kernel - 1) / dilation[d - 2] + 1;
   }
   return weight_size;
 }
 
+static inline std::vector<c10::SymInt> conv_weight_size(
+    SymIntArrayRef input_size, SymIntArrayRef output_size,
+    SymIntArrayRef padding, SymIntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups
+) {
+  return _conv_weight_size(input_size, output_size, padding, output_padding, stride, dilation, groups);
+}
+
+static inline std::vector<int64_t> conv_weight_size(
+    IntArrayRef input_size, IntArrayRef output_size,
+    IntArrayRef padding, IntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups
+) {
+  return _conv_weight_size(input_size, output_size, padding, output_padding, stride, dilation, groups);
+}
+
 static inline Tensor reshape_bias(int64_t dim, const Tensor& bias) {
   std::vector<int64_t> shape(dim, 1);
   shape[1] = -1;
diff --git a/aten/src/ATen/native/Convolution.cpp b/aten/src/ATen/native/Convolution.cpp
index 9f2d8efbd618..edb51a5c837d 100644
--- a/aten/src/ATen/native/Convolution.cpp
+++ b/aten/src/ATen/native/Convolution.cpp
@@ -1,20 +1,25 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Config.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/ConvolutionMM3d.h>
 #include <ATen/native/ConvUtils.h>
 #include <ATen/native/Pool.h>
 #include <ATen/native/cpu/DepthwiseConvKernel.h>
 #include <ATen/native/utils/ParamUtils.h>
 #include <ATen/native/xnnpack/Engine.h>
-#include <ATen/NativeFunctions.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/irange.h>
-
-#include <ATen/Config.h>
 #include <c10/macros/Macros.h>
-
 #include <limits>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/permute.h>
+#endif
+
 #if AT_NNPACK_ENABLED()
 #include <nnpack.h>
 #endif
@@ -23,311 +28,70 @@
 #include <ATen/native/mkldnn/Utils.h>
 #endif
 
-constexpr int MIOPEN_DIM_MAX = 5;
-
-namespace at { namespace native {
-
-DEFINE_DISPATCH(conv_depthwise2d_backward_stub);
-DEFINE_DISPATCH(conv_depthwise3d_backward_stub);
-DEFINE_DISPATCH(cudnn_convolution_backward_stub);
-DEFINE_DISPATCH(cudnn_convolution_transpose_backward_stub);
-DEFINE_DISPATCH(slow_conv_transpose3d_backward_stub);
-DEFINE_DISPATCH(convolution_depthwise3x3_winograd_stub);
-DEFINE_DISPATCH(miopen_convolution_backward_stub);
-DEFINE_DISPATCH(miopen_convolution_transpose_backward_stub);
-DEFINE_DISPATCH(miopen_depthwise_convolution_backward_stub);
-DEFINE_DISPATCH(mkldnn_convolution_backward_stub);
-DEFINE_DISPATCH(slow_conv_dilated2d_backward_stub);
-DEFINE_DISPATCH(slow_conv_dilated3d_backward_stub);
-DEFINE_DISPATCH(slow_conv_transpose2d_backward_stub);
-REGISTER_NO_CPU_DISPATCH(conv_depthwise2d_backward_stub);
-REGISTER_NO_CPU_DISPATCH(conv_depthwise3d_backward_stub);
-REGISTER_NO_CPU_DISPATCH(cudnn_convolution_backward_stub);
-REGISTER_NO_CPU_DISPATCH(cudnn_convolution_transpose_backward_stub);
-REGISTER_NO_CPU_DISPATCH(miopen_convolution_backward_stub);
-REGISTER_NO_CPU_DISPATCH(miopen_convolution_transpose_backward_stub);
-REGISTER_NO_CPU_DISPATCH(miopen_depthwise_convolution_backward_stub);
-
-std::ostream& operator<<(std::ostream & out, const ConvParams& params) {
-  out << "ConvParams {"
-      << "  stride = " << IntArrayRef{params.stride}
-      << "  padding = " << IntArrayRef{params.padding}
-      << "  dilation = " << IntArrayRef{params.dilation}
-      << "  transposed = " << params.transposed
-      << "  output_padding = " << IntArrayRef{params.output_padding}
-      << "  groups = " << params.groups
-      << "  benchmark = " << params.benchmark
-      << "  deterministic = " << params.deterministic
-      << "  cudnn_enabled = " << params.cudnn_enabled
-      << "  allow_tf32 = " << params.allow_tf32
-      << "}";
-  return out;
-}
-
-auto ConvParams::is_strided() const -> bool {
-  bool is_strided = false;
-  for (auto s : stride) {
-    is_strided |= (s != 1);
-  }
-  return is_strided;
-}
-
-auto ConvParams::is_dilated() const -> bool {
-  bool is_dilated = false;
-  for (auto d : dilation) {
-    is_dilated |= (d != 1);
-  }
-  return is_dilated;
-}
-
-auto ConvParams::is_padded() const -> bool {
-  bool is_padded = false;
-  for (auto p : padding) {
-    is_padded |= (p != 0);
-  }
-  return is_padded;
-}
-
-auto ConvParams::is_output_padding_neg() const -> bool {
-  bool is_non_neg = false;
-  for (auto p : output_padding) {
-    is_non_neg |= (p < 0);
-  }
-  return is_non_neg;
-}
-
-auto ConvParams::is_output_padding_big() const -> bool {
-  bool is_big = false;
-  for (auto i: c10::irange(output_padding.size())) {
-    is_big |= (output_padding[i] >= stride[i]);
-  }
-  return is_big;
-}
-
-auto ConvParams::is_padding_neg() const -> bool {
-  bool is_non_neg = false;
-  for (auto p : padding) {
-    is_non_neg |= (p < 0);
-  }
-  return is_non_neg;
-}
-
-auto ConvParams::is_stride_nonpos() const -> bool {
-  bool is_nonpos = false;
-  for (auto s : stride) {
-    is_nonpos |= (s <= 0);
-  }
-  return is_nonpos;
-}
-
-auto ConvParams::view1d_as_2d() -> void {
-  if (stride.size() == 1) {
-    stride.insert(stride.begin(), 1);
-    padding.insert(padding.begin(), 0);
-    dilation.insert(dilation.begin(), 1);
-    output_padding.insert(output_padding.begin(), 0);
-  }
-}
-
-auto ConvParams::use_cpu_depthwise3x3_winograd(
-    const at::Tensor& input,
-    const at::Tensor& weight) const -> bool {
-#if defined(__ARM_NEON__)
-  // Currently only 3x3 depthwise convolutions on tensors of float are supported.
-  return (input.ndimension() == 4) &&
-         (input.size(1) == groups) &&
-         (weight.ndimension() == 4 ) &&
-         (weight.size(0) % input.size(1) == 0) &&
-         (weight.size(1) == 1) &&
-         (weight.size(2) == 3) &&
-         (weight.size(3) == 3) &&
-         (input.device().is_cpu()) &&
-         (input.scalar_type() == at::kFloat) &&
-         input.is_contiguous() &&
-         (weight.device().is_cpu()) &&
-         (weight.scalar_type() == at::kFloat) &&
-         weight.is_contiguous() &&
-         !is_strided() &&
-         !is_dilated() &&
-         !transposed;
-#else
-  return false;
-#endif
-}
-
-auto ConvParams::needs_64bit_indexing_no_split(const at::Tensor& input, const at::Tensor& weight) const -> bool {
-  constexpr int64_t int_max = std::numeric_limits<int>::max();
-  int64_t numel_input = input.numel();
-  // empty input
-  if (numel_input == 0) {
-    return false;
-  }
-  // input size can not be reduced to the range of int by splitting the batch dim
-  int64_t n = input.size(0);
-  if (numel_input / n > int_max) {
-    return true;
-  }
-  // output size can not be reduced to the range of int by splitting the batch dim
-  int64_t outsize = 1;
-  if (transposed) {
-    std::vector<int64_t> o = conv_input_size(input.sizes(), weight.sizes(), padding, output_padding, stride, dilation, groups);
-    outsize = c10::multiply_integers(o.begin() + 1, o.end());
-  } else {
-    std::vector<int64_t> o = conv_output_size(input.sizes(), weight.sizes(), padding, stride, dilation);
-    outsize = c10::multiply_integers(o.begin() + 1, o.end());
-  }
-  return outsize > int_max;
-}
-
-auto ConvParams::use_cudnn(const at::Tensor& input, const at::Tensor& weight) const -> bool {
-
-// Note [Mobile check segfaults]
-// cudnn and miopen are guaranteed not to be on mobile, and T102591915 / T110194934 suggest
-// that maybe the compiledWithCuDNN() check sometimes segfaults (though I can't imagine how)
-#if !defined(C10_MOBILE)
-  if (needs_64bit_indexing_no_split(input, weight)) {
-    return false;
-  }
-  if (!detail::getCUDAHooks().compiledWithCuDNN()) {
-    return false;
-  }
-  if (!input.is_cuda() || !cudnn_enabled) {
-    return false;
-  }
-  if (input.scalar_type() == at::kBFloat16 || weight.scalar_type() == at::kBFloat16) {
-    if (!(detail::getCUDAHooks().supportsBFloat16ConvolutionWithCuDNNv8() && at::native::cudnnv8_enabled_check_debug())) {
-      return false;
-    }
-  }
-  if (cudnn_conv_suggest_memory_format(input, weight) == at::MemoryFormat::Contiguous) {
-    // bypass dilation checks for channels_last convolution
-    if (deterministic && is_dilated()) {
-      // cudnn doesn't support deterministic dilated convolution fully yet
-      return false;
-    }
-    if (is_dilated()) {
-      return detail::getCUDAHooks().supportsDilatedConvolutionWithCuDNN() && !is_output_padding_big();
-    }
-  }
-  return !is_output_padding_big();
-#else
-  return false;
-#endif
-}
-
-auto ConvParams::use_mps( const at::Tensor& input, const at::Tensor& weight) const -> bool {
-  // These checks need to be expanded. Currently we have very limited set of
-  // checks for MPS.
-#ifdef USE_MPS
-  if (needs_64bit_indexing_no_split(input, weight)) {
-    return false;
-  }
-  if (!input.is_mps()) {
-    return false;
-  }
-  return true;
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
 #else
-  return false;
+#include <ATen/ops/_conv_depthwise2d.h>
+#include <ATen/ops/_convolution.h>
+#include <ATen/ops/_convolution_double_backward_native.h>
+#include <ATen/ops/_convolution_mode.h>
+#include <ATen/ops/_convolution_mode_native.h>
+#include <ATen/ops/_convolution_native.h>
+#include <ATen/ops/_mps_convolution.h>
+#include <ATen/ops/_mps_convolution_transpose.h>
+#include <ATen/ops/_nnpack_available.h>
+#include <ATen/ops/_nnpack_spatial_convolution.h>
+#include <ATen/ops/_slow_conv2d_backward.h>
+#include <ATen/ops/_unsafe_view.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/constant_pad_nd.h>
+#include <ATen/ops/conv1d_native.h>
+#include <ATen/ops/conv2d_native.h>
+#include <ATen/ops/conv3d_native.h>
+#include <ATen/ops/conv_depthwise3d.h>
+#include <ATen/ops/conv_transpose1d_native.h>
+#include <ATen/ops/conv_transpose2d_native.h>
+#include <ATen/ops/conv_transpose3d_native.h>
+#include <ATen/ops/convolution.h>
+#include <ATen/ops/convolution_backward_native.h>
+#include <ATen/ops/convolution_backward_overrideable.h>
+#include <ATen/ops/convolution_backward_overrideable_native.h>
+#include <ATen/ops/convolution_native.h>
+#include <ATen/ops/convolution_overrideable.h>
+#include <ATen/ops/convolution_overrideable_native.h>
+#include <ATen/ops/cudnn_convolution.h>
+#include <ATen/ops/cudnn_convolution_transpose.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/miopen_convolution.h>
+#include <ATen/ops/miopen_convolution_transpose.h>
+#include <ATen/ops/miopen_depthwise_convolution.h>
+#include <ATen/ops/mkldnn_convolution.h>
+#include <ATen/ops/mps_convolution_backward.h>
+#include <ATen/ops/mps_convolution_transpose_backward.h>
+#include <ATen/ops/slow_conv3d.h>
+#include <ATen/ops/slow_conv_dilated2d.h>
+#include <ATen/ops/slow_conv_dilated3d.h>
+#include <ATen/ops/slow_conv_transpose2d.h>
+#include <ATen/ops/slow_conv_transpose3d.h>
+#include <ATen/ops/thnn_conv2d.h>
+#include <ATen/ops/view_as_real.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like.h>
 #endif
-}
-
-auto ConvParams::use_miopen(const at::Tensor& input, const at::Tensor& weight, bool bias_defined) const -> bool {
-  if (needs_64bit_indexing_no_split(input, weight)) {
-    return false;
-  }
-  return ((input.scalar_type() == at::kFloat) || (input.scalar_type() == at::kHalf) || (input.scalar_type() == at::kBFloat16))
-         && detail::getCUDAHooks().compiledWithMIOpen()
-         && input.is_cuda()
-         && input.dim() <= MIOPEN_DIM_MAX
-         && !(groups > 1 && is_dilated()) // MIOpen currently does not support dilation with groups of size > 1
-         && !(input.scalar_type() == at::kBFloat16 && bias_defined) // MIOpen currently doesn't support bias with bfloat16
-         && cudnn_enabled
-         ;
-}
-
-auto ConvParams::use_mkldnn(const at::Tensor& input, const at::Tensor& weight) const -> bool {
-#if AT_MKLDNN_ENABLED()
-  if (!at::globalContext().userEnabledMkldnn()) {
-    return false;
-  }
-  if (input.device().is_cpu() && input.scalar_type() == kBFloat16 && mkldnn_bf16_device_check()) {
-    return true;
-  }
-  return (input.is_mkldnn()) || // input is mkldnn Tensor
-    (input.device().is_cpu() &&
-     input.scalar_type() == kFloat && // only on CPU Float Tensors
-     !transposed && // or transposed tensors
-     // For 1x1 filters, MKLDNN is faster than THNN when multi-threaded,
-     // but THNN is faster when single-threaded.
-     (is_strided() || is_dilated() || input.size(0) >= 16 ||
-      weight.size(-1) != 1 || weight.size(-2) != 1 || at::get_num_threads() > 1) &&
-     (groups > 1
-      || (weight.size(-1) > 3 && weight.size(-2) > 3)
-      || input.size(0) > 1
-      || input.size(0)*input.size(1)*input.size(2)*input.size(3) > 20480) // for some case, native is faster
-      );
 
-#endif
-  return false;
-}
-
-auto ConvParams::use_nnpack(const at::Tensor& input, const at::Tensor& weight) const -> bool {
-#if AT_NNPACK_ENABLED()
-  return at::_nnpack_available() &&
-         input.device().is_cpu() &&
-         input.scalar_type() == kFloat && // only on CPU Float Tensors
-         !is_dilated() && // or dilation
-         !transposed &&   // or transposed tensors
-         input.ndimension() == 4 && // must be in NCHW format
-         weight.ndimension() == 4 &&
-         (weight.size(2) < 17) && (weight.size(3) < 17) // NNPACK only supports kernels up to 16x16
-#if !defined(C10_MOBILE)
-         && input.size(0) >= 16 // ensure large enough batch size to ensure perf, tuneable
-#endif
-     ;
-#endif
-  return false;
-}
-
-auto ConvParams::use_xnnpack(
-    const at::Tensor& input,
-    const at::Tensor& weight,
-    const at::OptionalIntArrayRef bias_sizes_opt) const -> bool {
-#if defined(C10_MOBILE)
-  if (!transposed) {
-    return (input.size(1) == groups) &&
-            xnnpack::use_convolution2d(
-                input,
-                weight,
-                bias_sizes_opt,
-                padding,
-                stride,
-                dilation,
-                groups,
-                transposed);
-  }
-#endif
-  return false;
-}
+constexpr int MIOPEN_DIM_MAX = 5;
 
-// We currently only have depthwise support for the case where groups ==
-// nInputPlane and nInputPlane == nOutputPlane (the latter due to the lack of
-// a depthwise multiplier)
-auto ConvParams::is_depthwise(
-        const at::Tensor& input, const at::Tensor& weight) const -> bool {
-  return input.is_cuda() &&
-         !transposed &&
-         (input.ndimension() == 4 || input.ndimension() == 5) &&
-         input.size(1) == groups &&
-         groups > 1 && // no point if there is only a single group
-         weight.size(0) % input.size(1) == 0; // output channels must be a multiple of input channels
-}
+namespace at { namespace native {
 
 // Check workload to activate fast depthwise FP16 cudnn conv kernels
+template <typename T>
 bool check_cudnn_depthwise_workload(const at::Tensor& input, int stride) {
-  int w = input.size(3);  // same as h
-  int ch = input.size(1);
-  int bs = input.size(0);
+  auto w = at::symint::size<T>(input, 3);  // same as h
+  auto ch = at::symint::size<T>(input, 1);
+  auto bs = at::symint::size<T>(input, 0);
   if (stride==1) {
     if (w >= 7) {
       // All batch sizes and nb_channels
@@ -446,27 +210,28 @@ bool check_cudnn_depthwise_workload(const at::Tensor& input, int stride) {
 }
 
 // simplified version for cudnn 8.2 and above
+template <typename T>
 bool check_cudnn_depthwise_workload_with_filter(const at::Tensor& input, int stride, const at::Tensor& weight) {
   // 1D conv
-  if(input.size(2) == 1 && stride == 1){
+  if(at::symint::size<T>(input, 2) == 1 && stride == 1){
     return true;
   }
 
   // 2d conv
   // only square filters
-  if (weight.size(2) != weight.size(3)) return false;
-  int filter = weight.size(3);
+  if (at::symint::size<T>(weight, 2) != at::symint::size<T>(weight, 3)) return false;
+  auto filter = at::symint::size<T>(weight, 3);
   // only 1/3/5 filter
   if (filter != 1 && filter != 3 && filter != 5) return false;
   // we don't enforce square input but only check width to reduce heuristic space
-  if (input.size(3) < 7) return false; // min width 7
-  int w = input.size(3);
+  if (at::symint::size<T>(input, 3) < 7) return false; // min width 7
+  auto w = at::symint::size<T>(input, 3);
   // only 1/2 stride, use cudnn for all stride 1
   if (stride == 1) return true;
   if (stride != 2) return false;
 
-  int ch = input.size(1);
-  int bs = input.size(0);
+  auto ch = at::symint::size<T>(input, 1);
+  auto bs = at::symint::size<T>(input, 0);
   // special case since bs1 show good perf in lots of cases
   if (bs == 1) {
     if (filter == 1 && w <= 28) return true;
@@ -480,54 +245,390 @@ bool check_cudnn_depthwise_workload_with_filter(const at::Tensor& input, int str
   return false;
 }
 
-// Use cudnn for FP16 depthwise convolutions
-auto ConvParams::use_cudnn_depthwise(
-        const at::Tensor& input, const at::Tensor& weight) const -> bool {
-  if (cudnn_conv_suggest_memory_format(input, weight) != at::MemoryFormat::Contiguous && use_cudnn(input, weight)) {
-    // always use cudnn_depthwise for channels_last format
-    return true;
+
+bool xnnpack_use_convolution2d(
+    const Tensor& input,
+    const Tensor& weight,
+    const at::OptionalIntArrayRef bias_sizes_opt,
+    const IntArrayRef padding,
+    const IntArrayRef stride,
+    const IntArrayRef dilation,
+    const int64_t groups,
+    const bool transposed) {
+  return xnnpack::use_convolution2d(input, weight, bias_sizes_opt, padding, stride, dilation, groups, transposed);
+}
+
+bool xnnpack_use_convolution2d(
+    const Tensor& input,
+    const Tensor& weight,
+    const at::OptionalSymIntArrayRef bias_sizes_opt,
+    const SymIntArrayRef padding,
+    const IntArrayRef stride,
+    const IntArrayRef dilation,
+    const int64_t groups,
+    const bool transposed) {
+  // Never use xnnpack for symbolic tracing
+  return false;
+}
+
+// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
+// This struct is templated so that we can run backend selection in a dynamic
+// shapes context; all of the real kernel selection in eager mode runs with
+// int64_t
+template <typename T>
+struct ConvParams {
+  std::vector<int64_t> stride;
+  std::vector<T> padding;
+  std::vector<int64_t> dilation;
+  bool transposed;
+  std::vector<T> output_padding;
+  int groups;
+  bool benchmark;
+  bool deterministic;
+  bool cudnn_enabled;
+  bool allow_tf32;
+
+  bool is_strided() const {
+    bool is_strided = false;
+    for (auto s : stride) {
+      is_strided |= (s != 1);
+    }
+    return is_strided;
   }
-  if (detail::getCUDAHooks().supportsDepthwiseConvolutionWithCuDNN()) {
-    long cudnn_version = detail::getCUDAHooks().versionCuDNN();
-    if (cudnn_version >= 8200) {
-      bool kernel_cond =  (use_cudnn(input, weight) &&
+
+  bool is_dilated() const {
+    bool is_dilated = false;
+    for (auto d : dilation) {
+      is_dilated |= (d != 1);
+    }
+    return is_dilated;
+  }
+
+  bool is_padded() const {
+    bool is_padded = false;
+    for (auto p : padding) {
+      is_padded |= (p != 0);
+    }
+    return is_padded;
+  }
+
+  bool is_output_padding_neg() const {
+    bool is_non_neg = false;
+    for (auto p : output_padding) {
+      is_non_neg |= (p < 0);
+    }
+    return is_non_neg;
+  }
+
+  bool is_output_padding_big() const {
+    bool is_big = false;
+    for (auto i: c10::irange(output_padding.size())) {
+      is_big |= (output_padding[i] >= stride[i]);
+    }
+    return is_big;
+  }
+
+  bool is_padding_neg() const {
+    bool is_non_neg = false;
+    for (auto p : padding) {
+      is_non_neg |= (p < 0);
+    }
+    return is_non_neg;
+  }
+
+  bool is_stride_nonpos() const {
+    bool is_nonpos = false;
+    for (auto s : stride) {
+      is_nonpos |= (s <= 0);
+    }
+    return is_nonpos;
+  }
+
+  void view1d_as_2d() {
+    if (stride.size() == 1) {
+      stride.insert(stride.begin(), 1);
+      padding.insert(padding.begin(), 0);
+      dilation.insert(dilation.begin(), 1);
+      output_padding.insert(output_padding.begin(), 0);
+    }
+  }
+
+  bool use_cpu_depthwise3x3_winograd(const at::Tensor& input, const at::Tensor& weight, const c10::optional<at::Tensor>& bias) const {
+#if defined(__ARM_NEON__)
+    // Currently only 3x3 depthwise convolutions on tensors of float are supported.
+    return (input.ndimension() == 4) &&
+           (at::symint::size<T>(input, 1) == groups) &&
+           (weight.ndimension() == 4 ) &&
+           (at::symint::size<T>(weight, 0) % at::symint::size<T>(input, 1) == 0) &&
+           (at::symint::size<T>(weight, 1) == 1) &&
+           (at::symint::size<T>(weight, 2) == 3) &&
+           (at::symint::size<T>(weight, 3) == 3) &&
+           (input.device().is_cpu()) &&
+           (input.scalar_type() == at::kFloat) &&
+           input.is_contiguous() &&
+           (weight.device().is_cpu()) &&
+           (weight.scalar_type() == at::kFloat) &&
+           weight.is_contiguous() &&
+           (!bias.has_value() || bias->is_contiguous()) &&
+           !is_strided() &&
+           !is_dilated() &&
+           !transposed;
+#else
+    return false;
+#endif
+  }
+
+  bool needs_64bit_indexing_no_split(const at::Tensor& input, const at::Tensor& weight) const {
+    constexpr int64_t int_max = std::numeric_limits<int>::max();
+    auto numel_input = at::symint::numel<T>(input);
+    // empty input
+    if (numel_input == 0) {
+      return false;
+    }
+    // input size can not be reduced to the range of int by splitting the batch dim
+    auto n = at::symint::size<T>(input, 0);
+    if (numel_input / n > int_max) {
+      return true;
+    }
+    // output size can not be reduced to the range of int by splitting the batch dim
+    T outsize = 1;
+    if (transposed) {
+      auto o = conv_input_size(at::symint::sizes<T>(input), at::symint::sizes<T>(weight), padding, output_padding, stride, dilation, groups);
+      outsize = c10::multiply_integers(o.begin() + 1, o.end());
+    } else {
+      auto o = conv_output_size(at::symint::sizes<T>(input), at::symint::sizes<T>(weight), padding, stride, dilation);
+      outsize = c10::multiply_integers(o.begin() + 1, o.end());
+    }
+    return outsize > int_max;
+  }
+
+  bool use_cudnn(const at::Tensor& input, const at::Tensor& weight) const {
+  // Note [Mobile check segfaults]
+  // cudnn and miopen are guaranteed not to be on mobile, and T102591915 / T110194934 suggest
+  // that maybe the compiledWithCuDNN() check sometimes segfaults (though I can't imagine how)
+#if !defined(C10_MOBILE)
+    if (needs_64bit_indexing_no_split(input, weight)) {
+      return false;
+    }
+    if (!detail::getCUDAHooks().compiledWithCuDNN()) {
+      return false;
+    }
+    if (!input.is_cuda() || !cudnn_enabled) {
+      return false;
+    }
+    if (input.scalar_type() == at::kBFloat16 || weight.scalar_type() == at::kBFloat16) {
+      if (!(detail::getCUDAHooks().supportsBFloat16ConvolutionWithCuDNNv8() && at::native::cudnnv8_enabled_check_debug())) {
+        return false;
+      }
+    }
+    if (cudnn_conv_suggest_memory_format(input, weight) == at::MemoryFormat::Contiguous) {
+      // bypass dilation checks for channels_last convolution
+      if (deterministic && is_dilated()) {
+        // cudnn doesn't support deterministic dilated convolution fully yet
+        return false;
+      }
+      if (is_dilated()) {
+        return detail::getCUDAHooks().supportsDilatedConvolutionWithCuDNN() && !is_output_padding_big();
+      }
+    }
+    return !is_output_padding_big();
+#else
+    return false;
+#endif
+  }
+
+  // Use cudnn for FP16 depthwise convolutions
+  bool use_cudnn_depthwise(const at::Tensor& input, const at::Tensor& weight) const  {
+    if (cudnn_conv_suggest_memory_format(input, weight) != at::MemoryFormat::Contiguous && use_cudnn(input, weight)) {
+      // always use cudnn_depthwise for channels_last format
+      return true;
+    }
+    if (detail::getCUDAHooks().supportsDepthwiseConvolutionWithCuDNN()) {
+      long cudnn_version = detail::getCUDAHooks().versionCuDNN();
+      if (cudnn_version >= 8200) {
+        bool kernel_cond =  (use_cudnn(input, weight) &&
+                             input.scalar_type() == kHalf && // only for FP16
+                             weight.scalar_type() == kHalf &&
+                             is_depthwise(input, weight) &&
+                             input.ndimension() == 4 &&   // TODO: 5-D contiguous depthwise is not supported yet, need benchmarks
+                             !is_dilated() && // no dilation supported
+                             (stride[0] == stride[1] || at::symint::size<T>(input, 2) == 1) && // square or 1d
+                             at::symint::size<T>(input, 1) >= 32); // min 32 channels supported)
+        if (kernel_cond) {
+          return check_cudnn_depthwise_workload_with_filter<T>(input, stride[1], weight);
+        }
+      }
+      // keep (7600 <= cudnn < 8200) code unchanged
+      bool kernel_cond =  (cudnn_version >= 7600 &&
+                           use_cudnn(input, weight) &&
                            input.scalar_type() == kHalf && // only for FP16
                            weight.scalar_type() == kHalf &&
                            is_depthwise(input, weight) &&
                            input.ndimension() == 4 &&   // TODO: 5-D contiguous depthwise is not supported yet, need benchmarks
+                           at::symint::size<T>(weight, 2) == at::symint::size<T>(weight, 3) && // only square kernels
+                           at::symint::size<T>(input, 2) >= 7 && // min width/height 7
                            !is_dilated() && // no dilation supported
-                           (stride[0] == stride[1] || input.size(2) == 1) && // square or 1d
-                           input.size(1) >= 32); // min 32 channels supported)
+                           stride[0] == stride[1] && // equal strides
+                           ((at::symint::size<T>(weight, 3) == 3) || (at::symint::size<T>(weight, 3) == 1)) &&
+                           at::symint::size<T>(input, 1) >= 32); // min 32 channels supported)
       if (kernel_cond) {
-        return check_cudnn_depthwise_workload_with_filter(input, stride[1], weight);
+        return check_cudnn_depthwise_workload<T>(input, stride[0]);
+      } else {
+        return false;
       }
-    }
-    // keep (7600 <= cudnn < 8200) code unchanged
-    bool kernel_cond =  (cudnn_version >= 7600 &&
-                         use_cudnn(input, weight) &&
-                         input.scalar_type() == kHalf && // only for FP16
-                         weight.scalar_type() == kHalf &&
-                         is_depthwise(input, weight) &&
-                         input.ndimension() == 4 &&   // TODO: 5-D contiguous depthwise is not supported yet, need benchmarks
-                         weight.size(2) == weight.size(3) && // only square kernels
-                         input.size(2) >= 7 && // min width/height 7
-                         !is_dilated() && // no dilation supported
-                         stride[0] == stride[1] && // equal strides
-                         ((weight.size(3) == 3) || (weight.size(3) == 1)) &&
-                         input.size(1) >= 32); // min 32 channels supported)
-    if (kernel_cond) {
-      return check_cudnn_depthwise_workload(input, stride[0]);
     } else {
       return false;
     }
-  } else {
+  }
+
+  bool use_miopen(const at::Tensor& input, const at::Tensor& weight, bool bias_defined) const  {
+    if (needs_64bit_indexing_no_split(input, weight)) {
+      return false;
+    }
+    return ((input.scalar_type() == at::kFloat) || (input.scalar_type() == at::kHalf) || (input.scalar_type() == at::kBFloat16))
+           && detail::getCUDAHooks().compiledWithMIOpen()
+           && input.is_cuda()
+           && input.dim() <= MIOPEN_DIM_MAX
+           && !(groups > 1 && is_dilated()) // MIOpen currently does not support dilation with groups of size > 1
+           && !(input.scalar_type() == at::kBFloat16 && bias_defined) // MIOpen currently doesn't support bias with bfloat16
+           && cudnn_enabled
+           ;
+  }
+  bool use_mkldnn(const at::Tensor& input, const at::Tensor& weight) const  {
+#if AT_MKLDNN_ENABLED()
+    if (!at::globalContext().userEnabledMkldnn()) {
+      return false;
+    }
+    if (input.device().is_cpu() && input.scalar_type() == kBFloat16 && mkldnn_bf16_device_check()) {
+      return true;
+    }
+    return (input.is_mkldnn()) || // input is mkldnn Tensor
+      (input.device().is_cpu() &&
+       input.scalar_type() == kFloat && // only on CPU Float Tensors
+       !transposed && // or transposed tensors
+       // For 1x1 filters, MKLDNN is faster than THNN when multi-threaded,
+       // but THNN is faster when single-threaded.
+       (is_strided() || is_dilated() || at::symint::size<T>(input, 0) >= 16 ||
+        at::symint::size<T>(weight, -1) != 1 || at::symint::size<T>(weight, -2) != 1 || at::get_num_threads() > 1) &&
+       (groups > 1
+        || (at::symint::size<T>(weight, -1) > 3 && at::symint::size<T>(weight, -2) > 3)
+        || at::symint::size<T>(input, 0) > 1
+        || at::symint::size<T>(input, 0)*at::symint::size<T>(input, 1)*at::symint::size<T>(input, 2)*at::symint::size<T>(input, 3) > 20480) // for some case, native is faster
+        );
+
+#endif
+    return false;
+  }
+  bool use_nnpack(const at::Tensor& input, const at::Tensor& weight) const  {
+#if AT_NNPACK_ENABLED()
+    return at::_nnpack_available() &&
+           input.device().is_cpu() &&
+           input.scalar_type() == kFloat && // only on CPU Float Tensors
+           !is_dilated() && // or dilation
+           !transposed &&   // or transposed tensors
+           input.ndimension() == 4 && // must be in NCHW format
+           weight.ndimension() == 4 &&
+           (at::symint::size<T>(weight, 2) < 17) && (at::symint::size<T>(weight, 3) < 17) // NNPACK only supports kernels up to 16x16
+#if !defined(C10_MOBILE)
+           && at::symint::size<T>(input, 0) >= 16 // ensure large enough batch size to ensure perf, tuneable
+#endif
+       ;
+#endif
+    return false;
+  }
+  bool use_xnnpack(const at::Tensor& input, const at::Tensor& weight,
+                   const at::OptionalArrayRef<T> bias_sizes_opt) const {
+#if defined(C10_MOBILE)
+    if (!transposed) {
+      // NB: for the call here, it MATTERS that we are templated. If you
+      // untemplate this to always use SymInt, the function
+      // xnnpack_use_convolution2d will always return false
+      return (at::symint::size<T>(input, 1) == groups) &&
+              xnnpack_use_convolution2d(
+                  input,
+                  weight,
+                  bias_sizes_opt,
+                  padding,
+                  stride,
+                  dilation,
+                  groups,
+                  transposed);
+    }
+#endif
+    return false;
+  }
+
+  bool use_mps(const at::Tensor& input, const at::Tensor& weight) const {
+    // These checks need to be expanded. Currently we have very limited set of
+    // checks for MPS.
+#ifdef USE_MPS
+    if (needs_64bit_indexing_no_split(input, weight)) {
+      return false;
+    }
+    if (!input.is_mps()) {
+      return false;
+    }
+    return true;
+#else
     return false;
+#endif
   }
+
+  // We currently only have depthwise support for the case where groups ==
+  // nInputPlane and nInputPlane == nOutputPlane (the latter due to the lack of
+  // a depthwise multiplier)
+  bool is_depthwise(const at::Tensor& input, const at::Tensor& weight) const  {
+    return input.is_cuda() &&
+           !transposed &&
+           (input.ndimension() == 4 || input.ndimension() == 5) &&
+           at::symint::size<T>(input, 1) == groups &&
+           groups > 1 && // no point if there is only a single group
+           at::symint::size<T>(weight, 0) % at::symint::size<T>(input, 1) == 0; // output channels must be a multiple of input channels
+  }
+};
+
+DEFINE_DISPATCH(conv_depthwise2d_backward_stub);
+DEFINE_DISPATCH(conv_depthwise3d_backward_stub);
+DEFINE_DISPATCH(cudnn_convolution_backward_stub);
+DEFINE_DISPATCH(cudnn_convolution_transpose_backward_stub);
+DEFINE_DISPATCH(slow_conv_transpose3d_backward_stub);
+DEFINE_DISPATCH(convolution_depthwise3x3_winograd_stub);
+DEFINE_DISPATCH(miopen_convolution_backward_stub);
+DEFINE_DISPATCH(miopen_convolution_transpose_backward_stub);
+DEFINE_DISPATCH(miopen_depthwise_convolution_backward_stub);
+DEFINE_DISPATCH(mkldnn_convolution_backward_stub);
+DEFINE_DISPATCH(slow_conv_dilated2d_backward_stub);
+DEFINE_DISPATCH(slow_conv_dilated3d_backward_stub);
+DEFINE_DISPATCH(slow_conv_transpose2d_backward_stub);
+REGISTER_NO_CPU_DISPATCH(conv_depthwise2d_backward_stub);
+REGISTER_NO_CPU_DISPATCH(conv_depthwise3d_backward_stub);
+REGISTER_NO_CPU_DISPATCH(cudnn_convolution_backward_stub);
+REGISTER_NO_CPU_DISPATCH(cudnn_convolution_transpose_backward_stub);
+REGISTER_NO_CPU_DISPATCH(miopen_convolution_backward_stub);
+REGISTER_NO_CPU_DISPATCH(miopen_convolution_transpose_backward_stub);
+REGISTER_NO_CPU_DISPATCH(miopen_depthwise_convolution_backward_stub);
+
+template <typename T>
+std::ostream& operator<<(std::ostream & out, const ConvParams<T>& params) {
+  out << "ConvParams {"
+      << "  stride = " << IntArrayRef{params.stride}
+      << "  padding = " << ArrayRef<T>{params.padding}
+      << "  dilation = " << IntArrayRef{params.dilation}
+      << "  transposed = " << params.transposed
+      << "  output_padding = " << ArrayRef<T>{params.output_padding}
+      << "  groups = " << params.groups
+      << "  benchmark = " << params.benchmark
+      << "  deterministic = " << params.deterministic
+      << "  cudnn_enabled = " << params.cudnn_enabled
+      << "  allow_tf32 = " << params.allow_tf32
+      << "}";
+  return out;
 }
 
+template <typename T>
 static void check_shape_forward(const at::Tensor& input,
-                                const c10::IntArrayRef& weight_sizes, const at::Tensor& bias,
-                                const ConvParams& params) {
+                                const c10::ArrayRef<T>& weight_sizes, const at::Tensor& bias,
+                                const ConvParams<T>& params) {
   int64_t k = input.ndimension();
   int64_t weight_dim = weight_sizes.size();
   int64_t groups = params.groups;
@@ -542,7 +643,7 @@ static void check_shape_forward(const at::Tensor& input,
   TORCH_CHECK(weight_dim == k,
            "Expected ", weight_dim, "-dimensional input for ", weight_dim,
            "-dimensional weight ", weight_sizes, ", but got ", k, "-dimensional input of size ",
-           input.sizes(), " instead");
+           at::symint::sizes<T>(input), " instead");
   TORCH_CHECK(weight_sizes[0] >= groups,
            "Given groups=", groups, ", expected weight to be at least ", groups,
            " at dimension 0, but got weight of size ", weight_sizes, " instead");
@@ -552,23 +653,23 @@ static void check_shape_forward(const at::Tensor& input,
            "] instead");
 
   if (!transposed) {
-    std::vector<int64_t> input_shape;
-    std::vector<int64_t> kernel_shape;
+    std::vector<T> input_shape;
+    std::vector<T> kernel_shape;
     bool kernel_size_correct = true;
 
-    TORCH_CHECK(input.size(1) == (weight_sizes[1] * groups),
+    TORCH_CHECK(at::symint::size<T>(input, 1) == (weight_sizes[1] * groups),
                 "Given groups=", groups, ", weight of size ", weight_sizes,
-                ", expected input", input.sizes(), " to have ",
-                (weight_sizes[1] * groups), " channels, but got ", input.size(1),
+                ", expected input", at::symint::sizes<T>(input), " to have ",
+                (weight_sizes[1] * groups), " channels, but got ", at::symint::size<T>(input, 1),
                 " channels instead");
 
-    TORCH_CHECK(!bias.defined() || (bias.ndimension() == 1 && bias.size(0) == weight_sizes[0]),
+    TORCH_CHECK(!bias.defined() || (bias.ndimension() == 1 && at::symint::size<T>(bias, 0) == weight_sizes[0]),
              "Given weight of size ", weight_sizes,
              ", expected bias to be 1-dimensional with ", weight_sizes[0], " elements",
-             ", but got bias of size ", bias.sizes(), " instead");
+             ", but got bias of size ", at::symint::sizes<T>(bias), " instead");
 
     for (const auto i : c10::irange(2, k)) {
-      input_shape.push_back(input.size(i) + 2 * padding[i-2]);
+      input_shape.push_back(at::symint::size<T>(input, i) + 2 * padding[i-2]);
       // log new kernel size considering dilation
       kernel_shape.push_back(dilation[i-2] * (weight_sizes[i]-1) + 1);
       if (input_shape.back() < kernel_shape.back()) {
@@ -594,22 +695,23 @@ static void check_shape_forward(const at::Tensor& input,
                "Kernel size: (", kernel_ss.str(), "). Kernel size can't be greater than actual input size");
     }
   } else { // transposed
-    TORCH_CHECK(input.size(1) == weight_sizes[0],
+    TORCH_CHECK(at::symint::size<T>(input, 1) == weight_sizes[0],
              "Given transposed=", transposed, ", weight of size ", weight_sizes,
-             ", expected input", input.sizes(), " to have ", weight_sizes[0],
-             " channels, but got ", input.size(1), " channels instead");
-    TORCH_CHECK(!bias.defined() || (bias.ndimension() == 1 && bias.size(0) == weight_sizes[1] * groups),
+             ", expected input", at::symint::sizes<T>(input), " to have ", weight_sizes[0],
+             " channels, but got ", at::symint::size<T>(input, 1), " channels instead");
+    TORCH_CHECK(!bias.defined() || (bias.ndimension() == 1 && at::symint::size<T>(bias, 0) == weight_sizes[1] * groups),
              "Given transposed=", transposed, ", weight of size ", weight_sizes,
              ", expected bias to be 1-dimensional with ", weight_sizes[1] * groups, " elements",
-             ", but got bias of size ", bias.sizes(), " instead");
+             ", but got bias of size ", at::symint::sizes<T>(bias), " instead");
   }
 }
 
+template <typename T>
 static void check_shape_backward(
     const at::Tensor& input,
-    const c10::IntArrayRef& weight_sizes,
-    const ConvParams& params) {
-  check_shape_forward(input, weight_sizes, /*bias=*/ Tensor(), params);
+    const c10::ArrayRef<T>& weight_sizes,
+    const ConvParams<T>& params) {
+  check_shape_forward<T>(input, weight_sizes, /*bias=*/ Tensor(), params);
 }
 
 // Given an input tensor and an expected number of spatial dimensions, checks that the
@@ -713,6 +815,7 @@ at::Tensor complex_convolution(
     IntArrayRef stride,
     IntArrayRef padding,
     IntArrayRef dilation,
+    bool transposed,
     IntArrayRef output_padding,
     int64_t groups) {
   check_input_same_type_as_parameters(input, weight, bias);
@@ -730,15 +833,15 @@ at::Tensor complex_convolution(
   // conv(W, x, b) = a - b + i(c - a - b)
   Tensor a, b, c;
   if (!bias.defined()) {
-    a = at::convolution(i_r, w_r, bias, stride, padding, dilation, false, output_padding, groups);
-    b = at::convolution(i_i, w_i, bias, stride, padding, dilation, false, output_padding, groups);
-    c = at::convolution(i_r + i_i, w_r + w_i, bias, stride, padding, dilation, false, output_padding, groups);
+    a = at::convolution(i_r, w_r, bias, stride, padding, dilation, transposed, output_padding, groups);
+    b = at::convolution(i_i, w_i, bias, stride, padding, dilation, transposed, output_padding, groups);
+    c = at::convolution(i_r + i_i, w_r + w_i, bias, stride, padding, dilation, transposed, output_padding, groups);
   } else {
     Tensor b_r, b_i;
     std::tie(b_r, b_i) = complex_to_real(bias.resolve_conj());
-    a = at::convolution(i_r, w_r, b_r, stride, padding, dilation, false, output_padding, groups);
-    b = at::convolution(i_i, w_i, Tensor(), stride, padding, dilation, false, output_padding, groups);
-    c = at::convolution(i_r + i_i, w_r + w_i, b_r + b_i, stride, padding, dilation, false, output_padding, groups);
+    a = at::convolution(i_r, w_r, b_r, stride, padding, dilation, transposed, output_padding, groups);
+    b = at::convolution(i_i, w_i, Tensor(), stride, padding, dilation, transposed, output_padding, groups);
+    c = at::convolution(i_r + i_i, w_r + w_i, b_r + b_i, stride, padding, dilation, transposed, output_padding, groups);
   }
 
   auto i = c10::Scalar(c10::complex<double>(0, 1));
@@ -791,7 +894,7 @@ at::Tensor conv1d(
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 1, "conv1d");
   Tensor output;
   if (at::isComplexType(input_.scalar_type())) {
-    output = complex_convolution(input, weight, bias, stride, padding, dilation, {0}, groups);
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {0}, groups);
   } else {
     output = at::convolution(input, weight, bias, stride, padding, dilation, false, {0}, groups);
   }
@@ -805,12 +908,20 @@ at::Tensor conv2d(
   c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt);
   const Tensor& bias = *bias_maybe_owned;
 
+  TORCH_CHECK(
+    !bias.defined() || bias.dtype() == input_.dtype(),
+    "Input type (",
+    input_.dtype().name(),
+    ") and bias type (",
+    bias.dtype().name(),
+    ") should be the same");
+
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 2, "conv2d");
   Tensor output;
   if (at::isComplexType(input_.scalar_type())) {
-    output = complex_convolution(input, weight, bias, stride, padding, dilation, {{0, 0}}, groups);
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0}}, groups);
   } else {
     output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0}}, groups);
   }
@@ -829,7 +940,7 @@ at::Tensor conv3d(
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 3, "conv3d");
   Tensor output;
   if (at::isComplexType(input_.scalar_type())) {
-    output = complex_convolution(input, weight, bias, stride, padding, dilation, {{0, 0, 0}}, groups);
+    output = complex_convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0, 0}}, groups);
   } else {
     output = at::convolution(input, weight, bias, stride, padding, dilation, false, {{0, 0, 0}}, groups);
   }
@@ -844,8 +955,8 @@ static Tensor convolution_same(
   auto k = weight.dim();
   TORCH_CHECK(k > 2, "weight should have at least three dimensions");
   auto dim = static_cast<size_t>(k - 2);
-  auto weight_sizes = weight.sizes();
-  auto input_sizes = input.sizes();
+  auto weight_sizes = weight.sym_sizes();
+  auto input_sizes = input.sym_sizes();
   TORCH_CHECK(k == input.dim(),
               "Expected ", k, "-dimensional input for ",
               k, "-dimensional weight", weight_sizes, ", but got ",
@@ -860,7 +971,7 @@ static Tensor convolution_same(
   }
 
   // Calculate the correct padding
-  DimVector padding_l, padding_r;
+  SymDimVector padding_l, padding_r;
   bool symmetric_padding = true;
   for (auto i: c10::irange(dim)) {
     auto s = stride.size() == 1 ? stride[0] : stride[i];
@@ -876,14 +987,14 @@ static Tensor convolution_same(
 
   if (symmetric_padding) {
     // All backends handle symmetric padding natively
-    DimVector output_padding(static_cast<size_t>(dim));
-    return at::convolution(input, weight, bias, stride, padding_l, dilation,
+    SymDimVector output_padding(static_cast<size_t>(dim));
+    return at::convolution_symint(input, weight, bias, stride, padding_l, dilation,
                                false, output_padding, groups);
   }
 
   TORCH_WARN_ONCE("Using padding='same' with even kernel lengths and odd dilation may"
                   " require a zero-padded copy of the input be created");
-  SmallVector<int64_t, kDimVectorStaticSize * 2> pad_nd(static_cast<size_t>(2 * dim));
+  SmallVector<c10::SymInt, kDimVectorStaticSize * 2> pad_nd(static_cast<size_t>(2 * dim));
   for (auto i: c10::irange(dim)) {
     // Apply padding by the difference, leaving only a symmetric padding
     auto delta_pad = padding_r[i] - padding_l[i];
@@ -895,10 +1006,10 @@ static Tensor convolution_same(
       padding_l[i] = padding_r[i];
     }
   }
-  auto padded_input = at::constant_pad_nd(input, pad_nd, 0);
-  DimVector output_padding(static_cast<size_t>(dim));
-  return at::convolution(padded_input, weight, bias, stride, padding_l,
-                         dilation, false, output_padding, groups);
+  auto padded_input = at::constant_pad_nd_symint(input, pad_nd, 0);
+  SymDimVector output_padding(static_cast<size_t>(dim));
+  return at::convolution_symint(padded_input, weight, bias, stride, padding_l,
+                                dilation, false, output_padding, groups);
 }
 
 Tensor _convolution_mode(
@@ -979,8 +1090,14 @@ at::Tensor conv_transpose1d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 1, "conv_transpose1d");
-  auto output = at::convolution(
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution(
+      input, weight, bias, stride, padding, dilation, true, output_padding, groups);
+  } else {
+    output = at::convolution(
       input, weight, bias, stride, padding, dilation, true, output_padding, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -994,8 +1111,14 @@ at::Tensor conv_transpose2d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 2, "conv_transpose2d");
-  auto output = at::convolution(
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution(
       input, weight, bias, stride, padding, dilation, true, output_padding, groups);
+  } else {
+    output = at::convolution(
+      input, weight, bias, stride, padding, dilation, true, output_padding, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -1009,8 +1132,14 @@ at::Tensor conv_transpose3d(
   Tensor input;
   bool is_batched;
   std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 3, "conv_transpose3d");
-  auto output = at::convolution(
+  Tensor output;
+  if (at::isComplexType(input_.scalar_type())) {
+    output = complex_convolution(
+      input, weight, bias, stride, padding, dilation, true, output_padding, groups);
+  } else {
+    output = at::convolution(
       input, weight, bias, stride, padding, dilation, true, output_padding, groups);
+  }
   return is_batched ? output : output.squeeze(0);
 }
 
@@ -1040,61 +1169,25 @@ at::Tensor convolution_overrideable(
   TORCH_CHECK_NOT_IMPLEMENTED(false, "convolution_overrideable not implemented. You are likely triggering this with tensor backend other than CPU/CUDA/MKLDNN, if this is intended, please use TORCH_LIBRARY_IMPL to override this function ");
 }
 
-// Selects a backend for convolution based on the inputs and params.
-ConvBackend select_conv_backend(
-    const Tensor& input_r, const Tensor& weight_r, const c10::optional<Tensor>& bias_opt,
-    IntArrayRef stride_, IntArrayRef padding_, IntArrayRef dilation_,
-    bool transposed_, IntArrayRef output_padding_, int64_t groups_) {
-  c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt);
-  const Tensor& bias = *bias_maybe_owned;
-
-  auto& ctx = at::globalContext();
-  auto k = weight_r.ndimension();
-  int64_t dim = k - 2;
-  ConvParams params;
-  params.stride = expand_param_if_needed(stride_, "stride", dim);
-  params.padding = expand_param_if_needed(padding_, "padding", dim);
-  params.dilation = expand_param_if_needed(dilation_, "dilation", dim);
-  params.transposed = transposed_;
-  params.output_padding = expand_param_if_needed(output_padding_, "output_padding", dim);
-  params.groups = groups_;
-  params.benchmark = ctx.benchmarkCuDNN();
-  params.deterministic = ctx.deterministicCuDNN() || ctx.deterministicAlgorithms();
-  params.cudnn_enabled = ctx.userEnabledCuDNN();
-  params.allow_tf32 = ctx.allowTF32CuDNN();
-
-  auto input = input_r;
-  auto weight = weight_r;
-  check_shape_forward(input, weight.sizes(), bias, params);
-
-  // Expand 1d -> 2d.
-  // This is only done for backends that don't natively support 1d spatial input.
-  if (k == 3 && !input.is_mkldnn() && !input.is_xpu()) {
-    // avoid accidentally going through NHWC for permuted 3d input.
-    input = input.contiguous();
-    params.view1d_as_2d();
-    input = view4d(input);
-    weight = view4d(weight);
-  }
-
-  auto bias_sizes_opt = bias.defined() ? c10::optional<IntArrayRef>(bias.sizes()) : c10::nullopt;
-  bool need_backward = GradMode::is_enabled() &&
-      (input.requires_grad() || weight.requires_grad() || (bias.defined() && bias.requires_grad()));
-  return select_conv_backend(input, weight, bias_sizes_opt, need_backward, params);
-}
-
-ConvBackend select_conv_backend(
+// Function to select the convolution backend based on the inputs and params.
+// This overload is used within the convolution internals but not exposed to python.
+// NB: The forward pass provides a bias tensor while the backward pass provides
+// a bool indicating whether the bias is defined. This is done to save memory by
+// avoiding saving the full bias tensor for backward.
+template <typename T>
+ConvBackend _select_conv_backend(
     const Tensor& input,
     const Tensor& weight,
-    const at::OptionalIntArrayRef bias_sizes_opt,
+    const c10::optional<Tensor>& bias,
+    const at::OptionalArrayRef<T> bias_sizes_opt,
     const bool need_backward,
-    const ConvParams& params) {
+    const ConvParams<T>& params) {
 
   // don't send empty inputs through backends
-  if (input.size(0) == 0 || input.size(1) == 0) {
+  if (at::symint::size<T>(input, 0) == 0 || at::symint::size<T>(input, 1) == 0) {
     return input.is_mkldnn() ? ConvBackend::MkldnnEmpty : ConvBackend::Empty;
-  } else if (input.numel() == 0) {
-    TORCH_CHECK(false, "Only zero batch or zero channel inputs are supported, but got input shape: ", input.sizes());
+  } else if (at::symint::numel<T>(input) == 0) {
+    TORCH_CHECK(false, "Only zero batch or zero channel inputs are supported, but got input shape: ", at::symint::sizes<T>(input));
   }
 
   if (params.is_depthwise(input, weight)) {
@@ -1130,7 +1223,7 @@ ConvBackend select_conv_backend(
     // option for NHWC.
     return ConvBackend::Xnnpack2d;
   // 3x3 depthwith convolutions implementation is inference only
-  } else if (!need_backward && params.use_cpu_depthwise3x3_winograd(input, weight)) {
+  } else if (!need_backward && params.use_cpu_depthwise3x3_winograd(input, weight, bias)) {
     return ConvBackend::Winograd3x3Depthwise;
   } else if (
       !params.transposed && (input.ndimension() == 5) &&
@@ -1186,12 +1279,65 @@ ConvBackend select_conv_backend(
   AT_ERROR("unsupported ConvNd parameters");
 }
 
+// Selects a backend for convolution based on the inputs and params.
+ConvBackend select_conv_backend(
+    const Tensor& input_r, const Tensor& weight_r, const c10::optional<Tensor>& bias_opt,
+    IntArrayRef stride_, SymIntArrayRef padding_, IntArrayRef dilation_,
+    bool transposed_, SymIntArrayRef output_padding_, int64_t groups_, const at::OptionalSymIntArrayRef bias_sizes_opt) {
+  c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+
+  auto& ctx = at::globalContext();
+  auto k = weight_r.ndimension();
+  int64_t dim = k - 2;
+  ConvParams<c10::SymInt> params;
+  params.stride = expand_param_if_needed(stride_, "stride", dim);
+  params.padding = expand_param_if_needed(padding_, "padding", dim);
+  params.dilation = expand_param_if_needed(dilation_, "dilation", dim);
+  params.transposed = transposed_;
+  params.output_padding = expand_param_if_needed(output_padding_, "output_padding", dim);
+  params.groups = groups_;
+  params.benchmark = ctx.benchmarkCuDNN();
+  params.deterministic = ctx.deterministicCuDNN() || ctx.deterministicAlgorithms();
+  params.cudnn_enabled = ctx.userEnabledCuDNN();
+  params.allow_tf32 = ctx.allowTF32CuDNN();
+
+  auto input = input_r;
+  auto weight = weight_r;
+  check_shape_forward(input, weight.sym_sizes(), bias, params);
+
+  // Expand 1d -> 2d.
+  // This is only done for backends that don't natively support 1d spatial input.
+  if (k == 3 && !input.is_mkldnn() && !input.is_xpu()) {
+    // avoid accidentally going through NHWC for permuted 3d input.
+    input = input.contiguous();
+    params.view1d_as_2d();
+    input = view4d(input);
+    weight = view4d(weight);
+  }
+
+  auto bias_sizes = bias.defined() ? c10::optional<SymIntArrayRef>(bias.sym_sizes()) : bias_sizes_opt;
+  bool need_backward = GradMode::is_enabled() &&
+      (input.requires_grad() || weight.requires_grad() || (bias.defined() && bias.requires_grad()));
+  return _select_conv_backend(input, weight, bias, bias_sizes, need_backward, params);
+}
+
+// For BC reasons, have a copy that does not require bias_opt
+ConvBackend select_conv_backend(
+    const Tensor& input,
+    const Tensor& weight,
+    const at::OptionalIntArrayRef bias_sizes_opt,
+    const bool need_backward,
+    const ConvParams<int64_t>& params) {
+  return _select_conv_backend(input, weight, {}, bias_sizes_opt, need_backward, params);
+}
+
 at::Tensor _convolution_nogroup_backend(
     const Tensor& input,
     const Tensor& weight,
     const Tensor& bias,
     const ConvBackend backend,
-    const ConvParams& params) {
+    const ConvParams<int64_t>& params) {
   auto kernel_size = weight.sizes().slice(2);
   switch(backend) {
     case ConvBackend::NnpackSpatial:
@@ -1222,7 +1368,7 @@ at::Tensor _convolution_nogroup_backend(
 static inline std::vector<int64_t> calc_output_size(
     const Tensor& input,
     const Tensor& weight,
-    const ConvParams& params) {
+    const ConvParams<int64_t>& params) {
   std::vector<int64_t> output_size = params.transposed ?
     conv_input_size(input.sizes(), weight.sizes(), params.padding, params.output_padding,
         params.stride, params.dilation, params.groups) :
@@ -1277,6 +1423,13 @@ static inline at::MemoryFormat determine_backend_memory_format(
   return backend_memory_format;
 }
 
+at::MemoryFormat _determine_backend_memory_format(
+    const Tensor& input,
+    const Tensor& weight,
+    const ConvBackend backend)  {
+  return determine_backend_memory_format(input, weight, backend);
+}
+
 at::Tensor _convolution(
     const Tensor& input_r, const Tensor& weight_r, const c10::optional<Tensor>& bias_r_opt,
     IntArrayRef stride_, IntArrayRef padding_, IntArrayRef dilation_,
@@ -1294,8 +1447,9 @@ at::Tensor _convolution(
   int64_t dim = k - 2;
 
   TORCH_CHECK(dim > 0, "weight should have at least three dimensions");
+  TORCH_CHECK(groups_ > 0, "non-positive groups is not supported");
 
-  ConvParams params;
+  ConvParams<int64_t> params;
   params.stride = expand_param_if_needed(stride_, "stride", dim);
   params.padding = expand_param_if_needed(padding_, "padding", dim);
   params.dilation = expand_param_if_needed(dilation_, "dilation", dim);
@@ -1323,7 +1477,7 @@ at::Tensor _convolution(
   auto bias_sizes_opt = bias.defined() ? c10::optional<IntArrayRef>(bias.sizes()) : c10::nullopt;
   bool need_backward = GradMode::is_enabled() &&
       (input.requires_grad() || weight.requires_grad() || (bias.defined() && bias.requires_grad()));
-  ConvBackend backend = select_conv_backend(input, weight, bias_sizes_opt, need_backward, params);
+  ConvBackend backend = _select_conv_backend(input, weight, bias, c10::OptionalIntArrayRef(bias_sizes_opt), need_backward, params);
   at::MemoryFormat backend_memory_format = determine_backend_memory_format(input, weight, backend);
 
   // Call the backend.
@@ -1358,7 +1512,19 @@ at::Tensor _convolution(
       break;
     case ConvBackend::Empty:
     {
-      auto weight_view = at::_unsafe_view(weight, -1);
+      Tensor weight_view;
+      // Use permute and clone to avoid at::_unsafe_view(weight, -1) failure for non-contiguous cases where
+      // view size is not compatible with input tensor's size and stride.
+      if(weight.is_contiguous()) {
+        weight_view = at::_unsafe_view(weight, -1);
+      } else if (weight.is_contiguous(at::MemoryFormat::ChannelsLast)) {
+        weight_view = at::_unsafe_view(at::permute(weight, {0, 2, 3, 1}), -1);
+      } else if (weight.is_contiguous(at::MemoryFormat::ChannelsLast3d)) {
+        weight_view = at::_unsafe_view(at::permute(weight, {0, 2, 3, 4, 1}), -1);
+      } else {
+        weight_view = at::_unsafe_view(weight.clone(at::MemoryFormat::Contiguous), -1);
+      }
+
       output = (input.size(1) == 0) ? (input.view(-1) * weight_view) : (input * weight_view[0]);
       if (bias.defined()) {
         output.add_(bias[0]);
@@ -1536,7 +1702,7 @@ std::tuple<Tensor,Tensor,Tensor> _convolution_double_backward( const c10::option
   auto weight = weight_r;
 
   int64_t dim = weight.ndimension() - 2;
-  ConvParams params;
+  ConvParams<int64_t> params;
   params.stride = expand_param_if_needed(stride_, "stride", dim);
   params.padding = expand_param_if_needed(padding_, "padding", dim);
   params.dilation = expand_param_if_needed(dilation_, "dilation", dim);
@@ -1599,7 +1765,7 @@ std::tuple<Tensor,Tensor,Tensor> _convolution_double_backward( const c10::option
   if (ggI.defined()) {
 
     // Modified params with correct padding
-    ConvParams gw_conv_params(params);
+    ConvParams<int64_t> gw_conv_params(params);
 
     // Disable groups as they are handled separately
     auto groups = gw_conv_params.groups;
@@ -1668,7 +1834,7 @@ std::tuple<Tensor,Tensor,Tensor> _convolution_double_backward( const c10::option
   Tensor gI;
   if (input.numel() != 0) {
     if (ggW.defined()) {
-      ConvParams gi_conv_params(params);
+      ConvParams<int64_t> gi_conv_params(params);
       gi_conv_params.transposed = !params.transposed;
 
       if (params.transposed) {
@@ -1724,7 +1890,7 @@ std::tuple<at::Tensor, at::Tensor, at::Tensor> _convolution_backward_nogroup_bac
     const Tensor& weight,
     const std::array<bool, 3> output_mask,
     const ConvBackend backend,
-    const ConvParams& params) {
+    const ConvParams<int64_t>& params) {
   auto kernel_size = weight.sizes().slice(2);
   switch(backend) {
     case ConvBackend::Slow2d:
@@ -1789,7 +1955,7 @@ std::tuple<Tensor, Tensor, Tensor> convolution_backward(
   TORCH_CHECK(dim > 0, "weight should have at least three dimensions");
 
   auto& ctx = at::globalContext();
-  ConvParams params;
+  ConvParams<int64_t> params;
   params.stride = expand_param_if_needed(stride, "stride", dim);
   params.padding = expand_param_if_needed(padding, "padding", dim);
   params.dilation = expand_param_if_needed(dilation, "dilation", dim);
diff --git a/aten/src/ATen/native/ConvolutionMM2d.cpp b/aten/src/ATen/native/ConvolutionMM2d.cpp
index d93166a1e343..eb4deee26945 100644
--- a/aten/src/ATen/native/ConvolutionMM2d.cpp
+++ b/aten/src/ATen/native/ConvolutionMM2d.cpp
@@ -1,15 +1,26 @@
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/core/grad_mode.h>
 #include <ATen/div_rtn.h>
 #include <ATen/native/ConvUtils.h>
 #include <ATen/native/CPUBlas.h>
 #include <ATen/native/Unfold2d.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_slow_conv2d_backward_native.h>
+#include <ATen/ops/_slow_conv2d_forward.h>
+#include <ATen/ops/_slow_conv2d_forward_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/thnn_conv2d_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/ConvolutionMM3d.cpp b/aten/src/ATen/native/ConvolutionMM3d.cpp
index 98dce11f48d4..3569a9a55d8e 100644
--- a/aten/src/ATen/native/ConvolutionMM3d.cpp
+++ b/aten/src/ATen/native/ConvolutionMM3d.cpp
@@ -1,14 +1,26 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/core/grad_mode.h>
 #include <ATen/div_rtn.h>
 #include <ATen/native/ConvolutionMM3d.h>
 #include <ATen/native/CPUBlas.h>
+#include <ATen/native/TransposeType.h>
 #include <ATen/native/Unfold3d.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/slow_conv3d_forward.h>
+#include <ATen/ops/slow_conv3d_forward_native.h>
+#include <ATen/ops/slow_conv3d_native.h>
+#include <ATen/ops/sum.h>
+#endif
+
 constexpr int64_t CONV3D_GRAIN_SALT = 20;
 
 namespace at {
diff --git a/aten/src/ATen/native/ConvolutionMM3d.h b/aten/src/ATen/native/ConvolutionMM3d.h
index 9567b5d928c1..b87674672d1d 100644
--- a/aten/src/ATen/native/ConvolutionMM3d.h
+++ b/aten/src/ATen/native/ConvolutionMM3d.h
@@ -1,4 +1,4 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/ConvolutionTBC.cpp b/aten/src/ATen/native/ConvolutionTBC.cpp
index c90577822218..38aa7b85ca5f 100644
--- a/aten/src/ATen/native/ConvolutionTBC.cpp
+++ b/aten/src/ATen/native/ConvolutionTBC.cpp
@@ -1,8 +1,18 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
 #include <tuple>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/conv_tbc_backward_native.h>
+#include <ATen/ops/conv_tbc_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/Copy.cpp b/aten/src/ATen/native/Copy.cpp
index d4b5c74c3bf3..0c99943eb0cb 100644
--- a/aten/src/ATen/native/Copy.cpp
+++ b/aten/src/ATen/native/Copy.cpp
@@ -1,21 +1,29 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/Copy.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/FunctionalTensorWrapper.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/TensorIterator.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/quantized/Copy.h>
 #include <ATen/native/mps/Copy.h>
 #include <ATen/native/vulkan/ops/Copy.h>
 #include <ATen/quantized/Quantizer.h>
 #include <ATen/vulkan/Context.h>
 #include <ATen/metal/Context.h>
-#include <ATen/MemoryOverlap.h>
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/Parallel.h>
 #include <c10/util/irange.h>
-#include <torch/library.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_copy_from.h>
+#include <ATen/ops/copy_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/expand_copy.h>
+#endif
 
 #ifdef USE_FBGEMM
 #include <fbgemm/Fbgemm.h>
@@ -116,12 +124,17 @@ static Tensor & copy_impl(Tensor & self, const Tensor & src, bool non_blocking)
   // 1. Memory Format for source and destination tensors is contiguous.
   // 2. Device for both the source and destination tensor is CPU.
   // 3. dtype conversion between FP32->FP16 and FP16->FP32.
+  // This checks that self.sizes() == src.sizes() because this code path doesn't
+  // support broadcasting. This also guards against out of bounds memory access
+  // when copying, see fbgemm::Float16ToFloat_ref.
+  // https://github.com/pytorch/pytorch/issues/88543
   #ifdef USE_FBGEMM
     if (((self.dtype() == at::kFloat && src.dtype() == at::kHalf) ||
          (self.dtype() == at::kHalf && src.dtype() == at::kFloat)) &&
         (self.device().is_cpu() && src.device().is_cpu()) &&
         ((self.is_contiguous() && src.is_contiguous()) ||
-         (self.is_non_overlapping_and_dense() && self.strides() == src.strides()))) {
+         (self.is_non_overlapping_and_dense() && self.strides() == src.strides())) &&
+        (self.sizes() == src.sizes())) {
       if (src.dtype() == at::kFloat && self.dtype() == at::kHalf) {
         auto* output_ptr =
             reinterpret_cast<fbgemm::float16*>(self.data_ptr<at::Half>());
@@ -212,6 +225,18 @@ static Tensor & copy_impl(Tensor & self, const Tensor & src, bool non_blocking)
     return at::metal::metal_copy_(self, src);
   }
 
+  // Exit early if self and src are views of the same data
+  const bool is_same_data = (
+      self.is_alias_of(src) &&
+      self.storage_offset() == src.storage_offset() &&
+      self.strides().equals(src.strides()) &&
+      self.sizes().equals(src.sizes()) &&
+      self.scalar_type() == src.scalar_type()
+    );
+  if (is_same_data) {
+    return self;
+  }
+
 
   auto iter = TensorIteratorConfig()
     .add_output(self)
@@ -253,27 +278,39 @@ static Tensor & copy_impl(Tensor & self, const Tensor & src, bool non_blocking)
   return self;
 }
 
+// NB: cribbed from https://github.com/pytorch/pytorch/pull/88198
+at::Tensor clone_preserve_strides(const at::Tensor& self) {
+  TORCH_INTERNAL_ASSERT(self.has_storage());
+  // In cases where the input tensor has internal memory overlap, we cannot actually
+  // preserve the strides/storage_offset of the input tensor, because
+  // *_scatter ops will try to copy_() into the cloned tensor.
+  // However, this should **never** show up in functionalized user code;
+  // most aten ops that try to mutate a tensor with internal memory overlap would error anyway.
+  //
+  // The one place that this does come up is in autograd - if there's a select_scatter
+  // in the forward, then autograd will generate one for the backward.
+  // If the input to the select_scatter is grad_output, then this could be an expanded tensor
+  // with internal overlap.
+  //if (at::has_internal_overlap(self) == at::MemOverlap::Yes) {
+  //  return self.clone();
+  //}
+  auto dtype_size = self.dtype().itemsize();
+  auto nbytes = self.storage().sym_nbytes();
+  TORCH_INTERNAL_ASSERT(nbytes % dtype_size == 0);
+  auto numel = nbytes / dtype_size;
+  auto self_full_size = self.as_strided_symint({numel}, {1}, 0);
+  auto clone = self_full_size.clone();
+  auto out = clone.as_strided_symint(self.sym_sizes(), self.sym_strides(), self.sym_storage_offset());
+  return out;
+}
+
 Tensor copy(const Tensor& self, const Tensor& src, bool non_blocking) {
   // copy() is the "functional" form of copy_(). It exists so we can properly functionalize copy_(), but:
   // (1) It isn't exposed to the frontend (no python bindings)
   // (2) It isn't exposed to the backend (it's a composite, that decomposes into to() and expand_as() calls.
-  // Note: This implementation doesn't currently preserve the strides of `self`.
-  // That might be fine for functorch (which already doesn't preserve strides in vmap),
-  // but it's worth looking into whether or not this implementation will be problematic for LazyTensor/XLA.
-  auto intermediate = src.to(self, non_blocking);
-  // We can't use expand() here. Why?
-  // The contract for copy_() is that the output tensor has the same amount of storage as the original tensor.
-  // e.g. This should work:
-  //   a = torch.ones(4, 4)
-  //   b = torch.ones(1, 4)
-  //   c = torch.ones(4, 4)
-  //   torch.ops.aten.copy(a, b).add_(c)
-  // We don't want to emit an extra copy every time though, so we only do it if the shapes are different.
-  if (self.sizes() != intermediate.sizes()) {
-    return at::expand_copy(intermediate, self.sizes());
-  } else {
-    return intermediate;
-  }
+  auto r = clone_preserve_strides(self);
+  r.copy_(src, non_blocking);
+  return r;
 }
 
 Tensor& copy_(Tensor& self, const Tensor& src, bool non_blocking) {
diff --git a/aten/src/ATen/native/Correlation.cpp b/aten/src/ATen/native/Correlation.cpp
index 0bd27195df76..9aca753c78ca 100644
--- a/aten/src/ATen/native/Correlation.cpp
+++ b/aten/src/ATen/native/Correlation.cpp
@@ -1,5 +1,23 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorSubclassLikeUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/complex.h>
+#include <ATen/ops/corrcoef_native.h>
+#include <ATen/ops/cov.h>
+#include <ATen/ops/cov_native.h>
+#include <ATen/ops/imag.h>
+#include <ATen/ops/mm.h>
+#include <ATen/ops/real.h>
+#include <ATen/ops/scalar_tensor.h>
+#include <ATen/ops/sqrt.h>
+#include <ATen/ops/true_divide.h>
+#endif
 
 namespace at {
 namespace native {
@@ -47,7 +65,7 @@ Tensor cov(
         " != ",
         num_observations);
     TORCH_CHECK(
-        num_observations == 0 || w.min().ge(0).item<bool>(),
+        num_observations == 0 || at::is_scalar_tensor_true(w.min().ge(0)),
         "cov(): fweights cannot be negative");
   }
 
@@ -70,7 +88,7 @@ Tensor cov(
         " != ",
         num_observations);
     TORCH_CHECK(
-        num_observations == 0 || aw.min().ge(0).item<bool>(),
+        num_observations == 0 || at::is_scalar_tensor_true(aw.min().ge(0)),
         "cov(): aweights cannot be negative");
     w = w.defined() ? w * aw : aw;
   }
@@ -81,7 +99,7 @@ Tensor cov(
       : at::scalar_tensor(num_observations, in.options().dtype(kLong));
 
   TORCH_CHECK(
-      !w.defined() || w_sum.ne(0).item<bool>(),
+      !w.defined() || at::is_scalar_tensor_true(w_sum.ne(0)),
       "cov(): weights sum to zero, can't be normalized");
 
   const auto avg = (w.defined() ? in * w : in).sum(OBSERVATIONS_DIM) / w_sum;
@@ -95,7 +113,7 @@ Tensor cov(
     norm_factor = w_sum - correction;
   }
 
-  if (norm_factor.le(0).item<bool>()) {
+  if (at::is_scalar_tensor_true(norm_factor.le(0))) {
     TORCH_WARN("cov(): degrees of freedom is <= 0");
     norm_factor.zero_();
   }
@@ -121,7 +139,7 @@ Tensor corrcoef(const Tensor& self) {
   }
 
   // normalize covariance
-  const auto d = c.diag();
+  const auto d = c.diagonal();
   const auto stddev = at::sqrt(d.is_complex() ? at::real(d) : d);
   c = c / stddev.view({-1, 1});
   c = c / stddev.view({1, -1});
diff --git a/aten/src/ATen/native/Cross.cpp b/aten/src/ATen/native/Cross.cpp
index 4b3e43da1147..6c40001703c8 100644
--- a/aten/src/ATen/native/Cross.cpp
+++ b/aten/src/ATen/native/Cross.cpp
@@ -1,25 +1,39 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/Cross.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/Resize.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/WrapDimUtils.h>
 #include <ATen/ExpandUtils.h>
+#include <ATen/native/Resize.h>
 
-#include <ATen/native/Cross.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cross_native.h>
+#include <ATen/ops/linalg_cross.h>
+#include <ATen/ops/linalg_cross_native.h>
+#endif
 
 namespace at {
 namespace meta {
 
-TORCH_PRECOMPUTE_META_FUNC(linalg_cross)
-(const Tensor & input, const Tensor & other, const int64_t dimension) {
-  auto out_size = infer_size(input.sizes(), other.sizes());
-  Tensor input_broadcasted = input.expand(out_size);
-  Tensor other_broadcasted = other.expand(out_size);
+TORCH_META_FUNC(linalg_cross)
+(const Tensor & input, const Tensor & other, int64_t dim) {
+  auto x_d = input.dim();
+  auto y_d = other.dim();
+  // This is to avoid things like
+  // linalg.cross(torch.randn(2, 3), torch.randn(5, 2, 3), dim=2)
+  TORCH_CHECK(x_d == y_d, "linalg.cross: inputs must have the same number of dimensions.");
+  TORCH_CHECK(input.size(dim) == 3 && other.size(dim) == 3, "linalg.cross: inputs dimension ", dim, " must have length 3. Got ", input.size(dim), " and ", other.size(dim));
 
-  int64_t dim = maybe_wrap_dim(dimension, input.dim()); // default dim = -1
-  TORCH_CHECK(input_broadcasted.size(dim) == 3, "dimension ", dimension, " does not have size 3");
+  // Broadcast the batch dimension of input and other.
+  // Since the non-batch dimensions agree, this is the same as broadcast all the inputs
+  auto out_size = infer_size(input.sizes(), other.sizes());
 
   set_output_raw_strided(0, out_size, {}, input.options());
-  return TORCH_PRECOMPUTE_STRUCT(linalg_cross)().set_dim(dim);
 }
 
 }
@@ -56,8 +70,9 @@ Tensor & cross_out(const Tensor & input, const Tensor & other, const c10::option
 
 
 TORCH_IMPL_FUNC(linalg_cross_out)
-(const Tensor & input, const Tensor & other, const int64_t dim, const Tensor & out) {
-  auto out_size = infer_size(input.sizes(), other.sizes());
+(const Tensor & input, const Tensor & other, int64_t dim, const Tensor & out) {
+  dim = maybe_wrap_dim(dim, input.dim());
+  auto out_size = out.sizes();
   Tensor input_broadcasted = input.expand(out_size);
   Tensor other_broadcasted = other.expand(out_size);
 
diff --git a/aten/src/ATen/native/DilatedMaxPool2d.cpp b/aten/src/ATen/native/DilatedMaxPool2d.cpp
index c9e980e44ab7..576e28866cbc 100644
--- a/aten/src/ATen/native/DilatedMaxPool2d.cpp
+++ b/aten/src/ATen/native/DilatedMaxPool2d.cpp
@@ -1,8 +1,18 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/NamedTensorUtils.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/core/NamedTensor.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorMeta.h>
 #include <ATen/native/Pool.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/max_pool2d_with_indices_backward_native.h>
+#include <ATen/ops/max_pool2d_with_indices_native.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
 
 namespace at {
 namespace meta {
diff --git a/aten/src/ATen/native/DilatedMaxPool3d.cpp b/aten/src/ATen/native/DilatedMaxPool3d.cpp
index 57fa6f9ea691..643943160556 100644
--- a/aten/src/ATen/native/DilatedMaxPool3d.cpp
+++ b/aten/src/ATen/native/DilatedMaxPool3d.cpp
@@ -1,11 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/NamedTensorUtils.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/native/Pool.h>
 #include <c10/util/irange.h>
 #include <tuple>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/max_pool3d_with_indices_backward_native.h>
+#include <ATen/ops/max_pool3d_with_indices_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/DispatchStub.cpp b/aten/src/ATen/native/DispatchStub.cpp
index a91448c3da72..52f73cfce43a 100644
--- a/aten/src/ATen/native/DispatchStub.cpp
+++ b/aten/src/ATen/native/DispatchStub.cpp
@@ -1,6 +1,8 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/DispatchStub.h>
 
 #include <c10/util/Exception.h>
+#include <c10/macros/Macros.h>
 
 #include <cpuinfo.h>
 #include <cstdlib>
diff --git a/aten/src/ATen/native/DispatchStub.h b/aten/src/ATen/native/DispatchStub.h
index bcbf41fd9d0f..9394442fe754 100644
--- a/aten/src/ATen/native/DispatchStub.h
+++ b/aten/src/ATen/native/DispatchStub.h
@@ -1,11 +1,10 @@
 #pragma once
 
-#include <c10/core/Backend.h>
-#include <c10/core/ScalarType.h>
-#include <c10/util/Exception.h>
+#include <c10/core/DeviceType.h>
+#include <c10/macros/Export.h>
 
-#include <type_traits>
 #include <atomic>
+#include <utility>
 
 // Implements instruction set specific function dispatch.
 //
diff --git a/aten/src/ATen/native/Distance.cpp b/aten/src/ATen/native/Distance.cpp
index 8d23e10b1719..17be4a468751 100644
--- a/aten/src/ATen/native/Distance.cpp
+++ b/aten/src/ATen/native/Distance.cpp
@@ -1,11 +1,39 @@
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/NamedTensorUtils.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/Distance.h>
-#include <ATen/NativeFunctions.h>
 #include <c10/util/accumulate.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_cdist_backward_native.h>
+#include <ATen/ops/_cdist_forward.h>
+#include <ATen/ops/_cdist_forward_native.h>
+#include <ATen/ops/_euclidean_dist.h>
+#include <ATen/ops/_euclidean_dist_native.h>
+#include <ATen/ops/_pdist_backward_native.h>
+#include <ATen/ops/_pdist_forward.h>
+#include <ATen/ops/_pdist_forward_native.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/cdist_native.h>
+#include <ATen/ops/cosine_similarity_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/norm.h>
+#include <ATen/ops/ones_like.h>
+#include <ATen/ops/pairwise_distance_native.h>
+#include <ATen/ops/pdist_native.h>
+#include <ATen/ops/pow.h>
+#include <ATen/ops/result_type.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at { namespace native {
 
 DEFINE_DISPATCH(pdist_forward_stub);
diff --git a/aten/src/ATen/native/DistributionTemplates.h b/aten/src/ATen/native/DistributionTemplates.h
index 15e2be8c8f27..2132407df80f 100644
--- a/aten/src/ATen/native/DistributionTemplates.h
+++ b/aten/src/ATen/native/DistributionTemplates.h
@@ -1,8 +1,9 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Generator.h>
+#include <ATen/ExpandUtils.h>
 #include <ATen/Tensor.h>
 #include <ATen/MemoryOverlap.h>
 #include <ATen/NamedTensorUtils.h>
@@ -12,6 +13,15 @@
 #include <limits>
 #include <cmath>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/full.h>
+#include <ATen/ops/view_as_real.h>
+#endif
+
 namespace at {
 namespace native {
 namespace templates {
diff --git a/aten/src/ATen/native/Distributions.cpp b/aten/src/ATen/native/Distributions.cpp
index 962c01061442..43305efdd885 100644
--- a/aten/src/ATen/native/Distributions.cpp
+++ b/aten/src/ATen/native/Distributions.cpp
@@ -1,24 +1,48 @@
-#include <ATen/ATen.h>
-#include <ATen/CPUApplyUtils.h>
-#include <ATen/Config.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/ExpandUtils.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <c10/util/Exception.h>
 #include <c10/util/math_compat.h>
 #include <c10/util/Optional.h>
 
-#include <ATen/Utils.h>
 #include <ATen/CPUGeneratorImpl.h>
 #include <ATen/core/DistributionsHelper.h>
 #include <ATen/native/Distributions.h>
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/UnaryOps.h>
-#include <ATen/native/TensorIterator.h>
 #include <ATen/native/DistributionTemplates.h>
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/native/cpu/Loops.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_dirichlet_grad_native.h>
+#include <ATen/ops/_sample_dirichlet_native.h>
+#include <ATen/ops/_standard_gamma_grad_native.h>
+#include <ATen/ops/_standard_gamma_native.h>
+#include <ATen/ops/argmax.h>
+#include <ATen/ops/bernoulli_native.h>
+#include <ATen/ops/binomial_native.h>
+#include <ATen/ops/cauchy_native.h>
+#include <ATen/ops/div.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/exponential_native.h>
+#include <ATen/ops/geometric_native.h>
+#include <ATen/ops/log_normal_native.h>
+#include <ATen/ops/multinomial_native.h>
+#include <ATen/ops/normal_native.h>
+#include <ATen/ops/poisson_native.h>
+#include <ATen/ops/random_native.h>
+#include <ATen/ops/topk.h>
+#include <ATen/ops/uniform_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <type_traits>
 #include <functional>
 // NOLINTNEXTLINE(modernize-deprecated-headers)
@@ -581,14 +605,14 @@ Tensor& multinomial_out(const Tensor& self,
     return result;
   }
 
-  // Fast-path for no replacement.
+  // Fast-path for no replacement or if only one sample is drawn.
   // Reference:
   // https://github.com/pytorch/pytorch/issues/11931#issuecomment-625882503
   // Half is not supported on CPU.
   TORCH_CHECK(
       !(self.device().is_cpu() && self.scalar_type() == ScalarType::Half),
       "multinomial is not implemented for half on CPU");
-  if (!with_replacement) {
+  if (!with_replacement || n_sample == 1) {
     // Sanity checks on `self`.
     auto is_valid = ((self.max() < INFINITY) & (self.min() >= 0)).item();
     TORCH_CHECK(
diff --git a/aten/src/ATen/native/Dropout.cpp b/aten/src/ATen/native/Dropout.cpp
index 36e1b92ad1bd..2903fac4f504 100644
--- a/aten/src/ATen/native/Dropout.cpp
+++ b/aten/src/ATen/native/Dropout.cpp
@@ -1,8 +1,25 @@
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/NamedTensorUtils.h>
+#include <ATen/TensorOperators.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/alpha_dropout_native.h>
+#include <ATen/ops/dropout_native.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/feature_alpha_dropout_native.h>
+#include <ATen/ops/feature_dropout_native.h>
+#include <ATen/ops/native_dropout.h>
+#include <ATen/ops/native_dropout_backward_native.h>
+#include <ATen/ops/native_dropout_native.h>
+#include <ATen/ops/ones_like.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -12,9 +29,9 @@ template<bool inplace>
 using Ctype = typename std::conditional<inplace, Tensor&, Tensor>::type;
 
 Tensor make_feature_noise(const Tensor& input) {
-  auto input_sizes = input.sizes();
+  auto input_sizes = input.sym_sizes();
   TORCH_CHECK(input.dim() >= 2, "Feature dropout requires at least 2 dimensions in the input");
-  std::vector<int64_t> sizes;
+  c10::SymDimVector sizes;
   sizes.reserve(input.dim());
   sizes.push_back(input_sizes[0]);
   sizes.push_back(input_sizes[1]);
@@ -22,11 +39,11 @@ Tensor make_feature_noise(const Tensor& input) {
     (void)i; //Suppress unused variable warning
     sizes.push_back(1);
   }
-  return input.new_empty(sizes);
+  return input.new_empty_symint(sizes);
 }
 
 bool is_fused_kernel_acceptable(const Tensor& input, double p) {
-  return (input.is_cuda() || input.is_xpu() || input.is_lazy()) && p > 0 && p < 1 && input.numel() > 0;
+  return (input.is_cuda() || input.is_xpu() || input.is_lazy()) && p > 0 && p < 1 && input.sym_numel() > 0;
 }
 
 // NB: sure, we could have used different overloads here, but I would feel insecure
@@ -46,7 +63,7 @@ Tensor multiply(const Tensor& input, const Tensor& noise) {
 template<bool feature_dropout, bool alpha_dropout, bool inplace, typename T>
 Ctype<inplace> _dropout_impl(T& input, double p, bool train) {
   TORCH_CHECK(p >= 0 && p <= 1, "dropout probability has to be between 0 and 1, but got ", p);
-  if (p == 0 || !train || input.numel() == 0) {
+  if (p == 0 || !train || input.sym_numel() == 0) {
     return input;
   }
 
@@ -109,7 +126,7 @@ native_dropout_cpu(const Tensor& input, double p, c10::optional<bool> train) {
   return std::make_tuple(output, mask);
 }
 
-Tensor native_dropout_backward_cpu(const Tensor& grad, const Tensor& mask, double scale) {
+Tensor native_dropout_backward(const Tensor& grad, const Tensor& mask, double scale) {
   Tensor result = grad * mask * scale;
   return result;
 }
@@ -117,7 +134,10 @@ Tensor native_dropout_backward_cpu(const Tensor& grad, const Tensor& mask, doubl
 Tensor dropout(const Tensor& input, double p, bool train) {
   auto result = [&]() {
     NoNamesGuard guard;
-    if (train && is_fused_kernel_acceptable(input, p)) {
+    // TODO: we can remove this is_nested() code smell in the future
+    //       if we find a way to support _dropout for nested tensor
+    //       e.g. make it an op (at::_dropout) to use dispatcher?
+    if (input.is_nested() || (train && is_fused_kernel_acceptable(input, p))) {
       return std::get<0>(at::native_dropout(input, p, train));
     }
     return _dropout<false>(input, p, train);
diff --git a/aten/src/ATen/native/Embedding.cpp b/aten/src/ATen/native/Embedding.cpp
index cac0cbe7130f..4c37325c4817 100644
--- a/aten/src/ATen/native/Embedding.cpp
+++ b/aten/src/ATen/native/Embedding.cpp
@@ -1,20 +1,40 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/native/BinaryOps.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_sparse_coo_tensor_unsafe.h>
+#include <ATen/ops/embedding_backward_native.h>
+#include <ATen/ops/embedding_dense_backward.h>
+#include <ATen/ops/embedding_dense_backward_native.h>
+#include <ATen/ops/embedding_native.h>
+#include <ATen/ops/embedding_renorm_native.h>
+#include <ATen/ops/embedding_sparse_backward.h>
+#include <ATen/ops/embedding_sparse_backward_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <c10/util/irange.h>
 
 #include <cstring>
 #include <memory>
-#include <sstream>
 #include <vector>
 
 
 namespace at { namespace native {
 
-Tensor embedding(const Tensor & weight, const Tensor & indices,
-                 int64_t padding_idx, bool scale_grad_by_freq, bool sparse) {
+Tensor embedding_symint(const Tensor & weight, const Tensor & indices,
+                        c10::SymInt padding_idx, bool scale_grad_by_freq, bool sparse) {
   TORCH_CHECK(weight.dim() == 2,  "'weight' must be 2-D");
   auto indices_arg = TensorArg(indices, "indices", 1);
   checkScalarTypes("embedding", indices_arg, {kLong, kInt});
@@ -24,23 +44,30 @@ Tensor embedding(const Tensor & weight, const Tensor & indices,
     return weight.index_select(0, indices);
   }
 
-  auto size = indices.sizes().vec();
-  for (auto d : weight.sizes().slice(1)) {
+  auto size = indices.sym_sizes().vec();
+  for (auto d : weight.sym_sizes().slice(1)) {
     size.push_back(d);
   }
 
-  return weight.index_select(0, indices.reshape(-1)).view(size);
+  return weight.index_select(0, indices.reshape(-1)).view_symint(size);
 }
 
-Tensor embedding_backward(
-    const Tensor & grad, const Tensor & indices, int64_t num_weights,
-    int64_t padding_idx, bool scale_grad_by_freq, bool sparse) {
+Tensor embedding_backward_symint(
+    const Tensor & grad, const Tensor & indices, c10::SymInt num_weights,
+    c10::SymInt padding_idx, bool scale_grad_by_freq, bool sparse) {
   if (sparse) {
+    // TODO: if we teach sparse tensor how to propagate symints, the guard
+    // here is not strictly necessary.  However, we think it is fine as is
+    // because num weights is derived from a parameter and therefore
+    // typically not varying.
     return at::embedding_sparse_backward(
-        grad, indices, num_weights, padding_idx, scale_grad_by_freq);
+      grad, indices,
+      num_weights.guard_int(__FILE__, __LINE__),
+      padding_idx.guard_int(__FILE__, __LINE__),
+      scale_grad_by_freq);
   } else {
-    return at::embedding_dense_backward(
-        grad, indices, num_weights, padding_idx, scale_grad_by_freq);
+    return at::embedding_dense_backward_symint(
+      grad, indices, num_weights, padding_idx, scale_grad_by_freq);
   }
 }
 
@@ -60,25 +87,25 @@ Tensor embedding_sparse_backward(
   Tensor indices = indices_;
   Tensor grad = grad_;
   if (padding_idx != -1) {
-    torch::List<c10::optional<Tensor>> c({indices != padding_idx});
+    c10::List<c10::optional<Tensor>> c({indices != padding_idx});
     indices = indices.index(c);
     grad = grad.index(c);
   }
 
-  int64_t num_features = grad_.size(-1);
-  auto weight_size = std::array<int64_t, 2>{{ num_weights, num_features }};
+  auto num_features = grad_.sym_size(-1);
+  auto weight_size = std::array<c10::SymInt, 2>{{ num_weights, num_features }};
   auto dense_options = grad.options();
 
   // check if all our grad come from padding_idx
-  if (grad.numel() == 0) {
-    return at::_sparse_coo_tensor_unsafe(at::empty({1, 0}, indices_.options().dtype(kLong)),
-                                         at::empty({0, num_features}, dense_options),
+  if (grad.sym_numel() == 0) {
+    return at::_sparse_coo_tensor_unsafe_symint(at::empty({1, 0}, indices_.options().dtype(kLong)),
+                                         at::empty_symint({c10::SymInt(0), num_features}, dense_options),
                                          weight_size);
   }
 
   auto index = indices.reshape({1, -1});
-  auto values = grad.reshape({-1, num_features});
-  return at::_sparse_coo_tensor_unsafe(index.to(kLong), values, weight_size);
+  auto values = grad.reshape_symint({c10::SymInt(-1), num_features});
+  return at::_sparse_coo_tensor_unsafe_symint(index.to(kLong), values, weight_size);
 }
 
 Tensor embedding_dense_backward_cpu(
diff --git a/aten/src/ATen/native/EmbeddingBag.cpp b/aten/src/ATen/native/EmbeddingBag.cpp
index 17094bf9082d..21404947b3db 100644
--- a/aten/src/ATen/native/EmbeddingBag.cpp
+++ b/aten/src/ATen/native/EmbeddingBag.cpp
@@ -1,13 +1,16 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/EmbeddingBag.h>
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 
 #include <ATen/native/CPUBlas.h>
+#include <ATen/native/NonSymbolicBC.h>
 
 #include <c10/util/irange.h>
+#include <c10/util/Half.h>
 
 #ifdef USE_FBGEMM
 #include <fbgemm/Fbgemm.h>
@@ -18,12 +21,32 @@
 
 #include <algorithm>
 #include <cstring>
-#include <iostream>
-#include <memory>
-#include <sstream>
 #include <tuple>
 #include <vector>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_embedding_bag.h>
+#include <ATen/ops/_embedding_bag_backward_native.h>
+#include <ATen/ops/_embedding_bag_dense_backward.h>
+#include <ATen/ops/_embedding_bag_dense_backward_native.h>
+#include <ATen/ops/_embedding_bag_forward_only.h>
+#include <ATen/ops/_embedding_bag_forward_only_native.h>
+#include <ATen/ops/_embedding_bag_native.h>
+#include <ATen/ops/_embedding_bag_per_sample_weights_backward_native.h>
+#include <ATen/ops/_embedding_bag_sparse_backward.h>
+#include <ATen/ops/_embedding_bag_sparse_backward_native.h>
+#include <ATen/ops/embedding_backward_native.h>
+#include <ATen/ops/embedding_bag_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/max.h>
+#include <ATen/ops/ones_like.h>
+#include <ATen/ops/resize_native.h>
+#include <ATen/ops/zero_native.h>
+#include <ATen/ops/zeros.h>
+#endif
 
 namespace {
   const int MODE_SUM = 0;
@@ -1245,8 +1268,6 @@ void _embedding_bag_cpu_out(
       fbgemm_kernel_cache);
 }
 
-// Assumes all input tensors are contiguous.
-// See NOTE [ embedding_bag Native Functions ] in native_functions.yaml for details
 Tensor _embedding_bag_backward(const Tensor &grad, const Tensor &indices_,
                               const Tensor &offsets_,
                               const Tensor &offset2bag,
@@ -1256,6 +1277,21 @@ Tensor _embedding_bag_backward(const Tensor &grad, const Tensor &indices_,
                               bool scale_grad_by_freq, int64_t mode,
                               bool sparse, const c10::optional<Tensor>& per_sample_weights_opt,
                               int64_t padding_idx) {
+    return at::native::_embedding_bag_backward_symint(
+        grad, indices_, offsets_, offset2bag, bag_size_, max_indices_, num_weights, scale_grad_by_freq, mode, sparse, per_sample_weights_opt, padding_idx);
+}
+
+// Assumes all input tensors are contiguous.
+// See NOTE [ embedding_bag Native Functions ] in native_functions.yaml for details
+Tensor _embedding_bag_backward_symint(const Tensor &grad, const Tensor &indices_,
+                              const Tensor &offsets_,
+                              const Tensor &offset2bag,
+                              const Tensor &bag_size_,
+                              const Tensor &max_indices_,
+                              c10::SymInt num_weights,
+                              bool scale_grad_by_freq, int64_t mode,
+                              bool sparse, const c10::optional<Tensor>& per_sample_weights_opt,
+                              int64_t padding_idx) {
   // See [Note: hacky wrapper removal for optional tensor]
   c10::MaybeOwned<Tensor> per_sample_weights_maybe_owned = at::borrow_from_optional_tensor(per_sample_weights_opt);
   const Tensor& per_sample_weights = *per_sample_weights_maybe_owned;
@@ -1271,7 +1307,7 @@ Tensor _embedding_bag_backward(const Tensor &grad, const Tensor &indices_,
   checkContiguous("embedding_bag", offsets_arg);
 
   Tensor offset2bag_;
-  if (indices.numel() != 0 && offset2bag.numel() == 0) {
+  if (indices.sym_numel() != 0 && offset2bag.sym_numel() == 0) {
     offset2bag_ = offsets.new_zeros(
       {indices.size(0) + 1}, offsets.options()); // offset2bag = [0 0 0 0 0]
 
@@ -1292,11 +1328,11 @@ Tensor _embedding_bag_backward(const Tensor &grad, const Tensor &indices_,
   }
 
   if (sparse) {
-    return at::_embedding_bag_sparse_backward(
+    return at::_embedding_bag_sparse_backward_symint(
         grad, indices, offsets, offset2bag_, bag_size_, num_weights,
         scale_grad_by_freq, mode, per_sample_weights, padding_idx);
   } else {
-    return at::_embedding_bag_dense_backward(
+    return at::_embedding_bag_dense_backward_symint(
         grad, indices, offset2bag_, bag_size_, max_indices_, num_weights,
         scale_grad_by_freq, mode, per_sample_weights, padding_idx);
   }
@@ -1606,7 +1642,16 @@ Tensor _embedding_bag_per_sample_weights_backward_cpu(
 
 Tensor _embedding_bag_sparse_backward(
     const Tensor &grad_, const Tensor &indices, const Tensor &offsets,
-    const Tensor &offset2bag, const Tensor &bag_size_, int64_t num_weights,
+    const Tensor &offset2bag, const Tensor &bag_size_, SymInt num_weights,
+    bool scale_grad_by_freq, int64_t mode, const c10::optional<Tensor>& per_sample_weights_opt,
+    int64_t padding_idx) {
+    return at::native::_embedding_bag_sparse_backward_symint(grad_, indices, offsets, offset2bag, bag_size_, num_weights,
+        scale_grad_by_freq, mode, per_sample_weights_opt, padding_idx);
+}
+
+Tensor _embedding_bag_sparse_backward_symint(
+    const Tensor &grad_, const Tensor &indices, const Tensor &offsets,
+    const Tensor &offset2bag, const Tensor &bag_size_, SymInt num_weights,
     bool scale_grad_by_freq, int64_t mode, const c10::optional<Tensor>& per_sample_weights_opt,
     int64_t padding_idx) {
   // See [Note: hacky wrapper removal for optional tensor]
@@ -1628,7 +1673,7 @@ Tensor _embedding_bag_sparse_backward(
     AT_ASSERT(mode == MODE_SUM);
     index_grad.mul_(per_sample_weights.unsqueeze(1));
   }
-  return native::embedding_backward(index_grad, indices, num_weights, padding_idx,
+  return native::embedding_backward_symint(index_grad, indices, num_weights, padding_idx,
                                     scale_grad_by_freq, true);
 }
 }
diff --git a/aten/src/ATen/native/EmbeddingBag.h b/aten/src/ATen/native/EmbeddingBag.h
index 6600c661d46a..9d44fa688b2b 100644
--- a/aten/src/ATen/native/EmbeddingBag.h
+++ b/aten/src/ATen/native/EmbeddingBag.h
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/Config.h>
 #include <cstdint>
 
 #ifdef USE_FBGEMM
diff --git a/aten/src/ATen/native/Fill.cpp b/aten/src/ATen/native/Fill.cpp
index 4952aaa91a05..ac3e2bb2cbd6 100644
--- a/aten/src/ATen/native/Fill.cpp
+++ b/aten/src/ATen/native/Fill.cpp
@@ -1,13 +1,24 @@
 // Functions that fill Tensors with constants.
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
 #include <ATen/native/Fill.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/Utils.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/fill_diagonal_native.h>
+#include <ATen/ops/fill_native.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/zero_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/ForeachOpsKernels.cpp b/aten/src/ATen/native/ForeachOpsKernels.cpp
index f5665be248e4..4b6ef9196f99 100644
--- a/aten/src/ATen/native/ForeachOpsKernels.cpp
+++ b/aten/src/ATen/native/ForeachOpsKernels.cpp
@@ -1,7 +1,57 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/ForeachUtils.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#include <ATen/Operators.h>
+#else
+#include <ATen/ops/_foreach_abs_native.h>
+#include <ATen/ops/_foreach_acos_native.h>
+#include <ATen/ops/_foreach_add_native.h>
+#include <ATen/ops/_foreach_addcdiv_native.h>
+#include <ATen/ops/_foreach_addcmul_native.h>
+#include <ATen/ops/_foreach_asin_native.h>
+#include <ATen/ops/_foreach_atan_native.h>
+#include <ATen/ops/_foreach_ceil_native.h>
+#include <ATen/ops/_foreach_cos_native.h>
+#include <ATen/ops/_foreach_cosh_native.h>
+#include <ATen/ops/_foreach_div_native.h>
+#include <ATen/ops/_foreach_erf_native.h>
+#include <ATen/ops/_foreach_erfc_native.h>
+#include <ATen/ops/_foreach_exp_native.h>
+#include <ATen/ops/_foreach_expm1_native.h>
+#include <ATen/ops/_foreach_floor_native.h>
+#include <ATen/ops/_foreach_frac_native.h>
+#include <ATen/ops/_foreach_lgamma_native.h>
+#include <ATen/ops/_foreach_log10_native.h>
+#include <ATen/ops/_foreach_log1p_native.h>
+#include <ATen/ops/_foreach_log2_native.h>
+#include <ATen/ops/_foreach_log_native.h>
+#include <ATen/ops/_foreach_maximum_native.h>
+#include <ATen/ops/_foreach_minimum_native.h>
+#include <ATen/ops/_foreach_mul_native.h>
+#include <ATen/ops/_foreach_neg_native.h>
+#include <ATen/ops/_foreach_norm_native.h>
+#include <ATen/ops/_foreach_reciprocal_native.h>
+#include <ATen/ops/_foreach_round_native.h>
+#include <ATen/ops/_foreach_sigmoid_native.h>
+#include <ATen/ops/_foreach_sin_native.h>
+#include <ATen/ops/_foreach_sinh_native.h>
+#include <ATen/ops/_foreach_sqrt_native.h>
+#include <ATen/ops/_foreach_sub_native.h>
+#include <ATen/ops/_foreach_tan_native.h>
+#include <ATen/ops/_foreach_tanh_native.h>
+#include <ATen/ops/_foreach_trunc_native.h>
+#include <ATen/ops/_foreach_zero_native.h>
+#include <ATen/ops/linalg_vector_norm.h>
+#include <ATen/ops/maximum.h>
+#include <ATen/ops/minimum.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
+
 namespace at { namespace native {
 
 #define FOREACH_BINARY_OP_SCALAR(OP)                                                                      \
@@ -146,7 +196,30 @@ void foreach_tensor_##OP##_scalarlist_slow_(TensorList input, TensorList tensors
   for(const auto i : c10::irange(input.size())) {                                                                                                       \
     input[i].OP##_(tensors1[i], tensors2[i], scalars[i]);                                                                                               \
   }                                                                                                                                                     \
-}                                                                                                                                                       \
+}
+
+#define FOREACH_POINTWISE_OP_TENSOR(OP)                                    \
+  std::vector<Tensor> foreach_tensor_##OP##_tensor_slow(                   \
+      TensorList input,                                                    \
+      TensorList tensors1,                                                 \
+      TensorList tensors2,                                                 \
+      const Tensor& scalars_) {                                            \
+    auto scalars = convert_tensor_to_scalar_list(scalars_, input.size());  \
+    check_foreach_api_restrictions(input, tensors1, tensors2, scalars);    \
+    return foreach_tensor_##OP##_scalarlist_slow(                          \
+        input, tensors1, tensors2, scalars);                               \
+  }                                                                        \
+                                                                           \
+  void foreach_tensor_##OP##_tensor_slow_(                                 \
+      TensorList input,                                                    \
+      TensorList tensors1,                                                 \
+      TensorList tensors2,                                                 \
+      const Tensor& scalars_) {                                            \
+    auto scalars = convert_tensor_to_scalar_list(scalars_, input.size());  \
+    check_foreach_api_restrictions(input, tensors1, tensors2, scalars);    \
+    foreach_tensor_##OP##_scalarlist_slow_(                                \
+        input, tensors1, tensors2, scalars);                               \
+  }
 
 FOREACH_BINARY_OP_LIST_ALPHA(add);
 FOREACH_BINARY_OP_LIST_ALPHA(sub);
@@ -199,6 +272,9 @@ FOREACH_POINTWISE_OP_SCALAR(addcmul);
 FOREACH_POINTWISE_OP_SCALARLIST(addcdiv);
 FOREACH_POINTWISE_OP_SCALARLIST(addcmul);
 
+FOREACH_POINTWISE_OP_TENSOR(addcdiv);
+FOREACH_POINTWISE_OP_TENSOR(addcmul);
+
 // NOTE(crcrpar): It didn't seem feasible to use `self[i]` as both the first and the last
 // arguments of `maximum_out` and `minimum_out` so I tentatively embarrassingly get and copy
 // the result to `self[i]`.
diff --git a/aten/src/ATen/native/ForeachUtils.h b/aten/src/ATen/native/ForeachUtils.h
index 033052f401f6..0166d040863c 100644
--- a/aten/src/ATen/native/ForeachUtils.h
+++ b/aten/src/ATen/native/ForeachUtils.h
@@ -2,6 +2,7 @@
 
 #include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
+#include <ATen/Dispatch.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
@@ -123,6 +124,45 @@ bool check_fast_path_restrictions(
     return true;
 }
 
+std::vector<c10::Scalar> convert_tensor_to_scalar_list(
+    const Tensor& scalarList_,
+    int64_t expect_length) {
+  std::vector<c10::Scalar> scalarList;
+  TORCH_CHECK(
+      scalarList_.device() == c10::kCPU,
+      "Expected scalars to be on CPU, got ",
+      scalarList_.device(),
+      " instead.");
+  TORCH_CHECK(
+      scalarList_.is_contiguous(), "Expected scalars to be contiguous.");
+  TORCH_CHECK(
+      scalarList_.dim() == 1,
+      "Expected packed scalar Tensor to be of dimension 1. Got ",
+      scalarList_.dim(),
+      " instead.");
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+      kComplexHalf,
+      kHalf,
+      kBool,
+      kBFloat16,
+      scalarList_.scalar_type(),
+      "convert_tensor_to_scalar_list",
+      [&]() {
+        const scalar_t* scalar_data = scalarList_.data_ptr<scalar_t>();
+        TORCH_CHECK(
+            (expect_length == scalarList_.size(0)),
+            "Expected length of scalars to match input of length ",
+            expect_length,
+            " but got ",
+            scalarList_.size(0),
+            " instead.");
+        for (int64_t i = 0; i < scalarList_.size(0); i++) {
+          scalarList.push_back(c10::Scalar(scalar_data[i]));
+        }
+      });
+  return scalarList;
+}
+
 bool can_use_fast_route(ArrayRef<TensorList> tensorLists,
                         ArrayRef<Scalar> scalarList = {},
                         bool does_op_promote_integer_inputs_to_float = false) {
diff --git a/aten/src/ATen/native/FractionalMaxPool2d.cpp b/aten/src/ATen/native/FractionalMaxPool2d.cpp
index b4f8207af042..82512c83f433 100644
--- a/aten/src/ATen/native/FractionalMaxPool2d.cpp
+++ b/aten/src/ATen/native/FractionalMaxPool2d.cpp
@@ -1,8 +1,18 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorMeta.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/fractional_max_pool2d_backward_native.h>
+#include <ATen/ops/fractional_max_pool2d_native.h>
+#endif
+
 #include <tuple>
 #include <vector>
 
diff --git a/aten/src/ATen/native/FractionalMaxPool3d.cpp b/aten/src/ATen/native/FractionalMaxPool3d.cpp
index 11769545090f..5890026872a8 100644
--- a/aten/src/ATen/native/FractionalMaxPool3d.cpp
+++ b/aten/src/ATen/native/FractionalMaxPool3d.cpp
@@ -1,10 +1,20 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorMeta.h>
 
 #include <c10/util/irange.h>
 
-#include <tuple>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/fractional_max_pool3d_backward_native.h>
+#include <ATen/ops/fractional_max_pool3d_native.h>
+#endif
+
 #include <vector>
 
 namespace at {
diff --git a/aten/src/ATen/native/GatedLinearUnit.cpp b/aten/src/ATen/native/GatedLinearUnit.cpp
index b7b20e1c32f1..0bbfc74f99a7 100644
--- a/aten/src/ATen/native/GatedLinearUnit.cpp
+++ b/aten/src/ATen/native/GatedLinearUnit.cpp
@@ -1,7 +1,22 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/Activation.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cat.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/glu_backward_native.h>
+#include <ATen/ops/glu_backward_jvp_native.h>
+#include <ATen/ops/glu_jvp_native.h>
+#include <ATen/ops/glu_native.h>
+#include <ATen/ops/sigmoid.h>
+#endif
+
 namespace at {
 
 namespace meta {
diff --git a/aten/src/ATen/native/GridSampler.cpp b/aten/src/ATen/native/GridSampler.cpp
index 8b0440610226..586c1cab40d1 100644
--- a/aten/src/ATen/native/GridSampler.cpp
+++ b/aten/src/ATen/native/GridSampler.cpp
@@ -1,17 +1,35 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/GridSampler.h>
 #include <ATen/native/GridSamplerUtils.h>
-#include <ATen/ATen.h>
-#include <ATen/Device.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
-#include <c10/core/Layout.h>
-#include <ATen/cpu/vml.h>
-#include <ATen/native/IndexingUtils.h>
+#include <ATen/cpu/vec/vec.h>
 #include <ATen/native/UpSample.h>
 #include <ATen/native/cpu/GridSamplerKernel.h>
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_grid_sampler_2d_cpu_fallback_backward_native.h>
+#include <ATen/ops/_grid_sampler_2d_cpu_fallback_native.h>
+#include <ATen/ops/cudnn_grid_sampler.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/grid_sampler_2d.h>
+#include <ATen/ops/grid_sampler_2d_backward_native.h>
+#include <ATen/ops/grid_sampler_2d_native.h>
+#include <ATen/ops/grid_sampler_3d.h>
+#include <ATen/ops/grid_sampler_3d_backward_native.h>
+#include <ATen/ops/grid_sampler_3d_native.h>
+#include <ATen/ops/grid_sampler_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at { namespace native {
 
 using at::native::detail::GridSamplerInterpolation;
diff --git a/aten/src/ATen/native/GridSamplerUtils.h b/aten/src/ATen/native/GridSamplerUtils.h
index 0b6f29de8c42..7c22fedfe94e 100644
--- a/aten/src/ATen/native/GridSamplerUtils.h
+++ b/aten/src/ATen/native/GridSamplerUtils.h
@@ -101,7 +101,7 @@ bool cond_cudnn_grid_sampler(
     at::native::canUse32BitIndexMath(input) &&
     at::native::canUse32BitIndexMath(grid) &&
     input.dim() == 4 &&
-    input.size(1) <= 1024);
+    input.sym_size(1) <= 1024);
 }
 
 } // anonymous namespace
diff --git a/aten/src/ATen/native/Histogram.cpp b/aten/src/ATen/native/Histogram.cpp
index c3a007f2c2dc..89ede6bea35c 100644
--- a/aten/src/ATen/native/Histogram.cpp
+++ b/aten/src/ATen/native/Histogram.cpp
@@ -1,10 +1,28 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
 
 #include <ATen/native/Histogram.h>
 #include <ATen/native/Resize.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_histogramdd_bin_edges.h>
+#include <ATen/ops/_histogramdd_bin_edges_native.h>
+#include <ATen/ops/_histogramdd_from_bin_cts.h>
+#include <ATen/ops/_histogramdd_from_bin_cts_native.h>
+#include <ATen/ops/_histogramdd_from_bin_tensors.h>
+#include <ATen/ops/_histogramdd_from_bin_tensors_native.h>
+#include <ATen/ops/aminmax.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/histc_native.h>
+#include <ATen/ops/histogram_native.h>
+#include <ATen/ops/histogramdd_native.h>
+#include <ATen/ops/linspace_native.h>
+#endif
+
 #include <numeric>
 #include <tuple>
 #include <vector>
diff --git a/aten/src/ATen/native/Histogram.h b/aten/src/ATen/native/Histogram.h
index 9df0aafafc18..3305cc5e315f 100644
--- a/aten/src/ATen/native/Histogram.h
+++ b/aten/src/ATen/native/Histogram.h
@@ -3,8 +3,6 @@
 #include <ATen/core/Tensor.h>
 #include <ATen/native/DispatchStub.h>
 
-#include <tuple>
-
 namespace at { namespace native {
 
 using histogramdd_fn = void(*)(const Tensor&, const c10::optional<Tensor>&, bool, Tensor&, const TensorList&);
diff --git a/aten/src/ATen/native/Im2Col.cpp b/aten/src/ATen/native/Im2Col.cpp
index c4b05bc18b56..416e77e9ff19 100644
--- a/aten/src/ATen/native/Im2Col.cpp
+++ b/aten/src/ATen/native/Im2Col.cpp
@@ -1,12 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/Utils.h>
-#include <ATen/div_rtn.h>
 
 #include <ATen/native/im2col.h>
 #include <ATen/native/im2col_shape_check.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/col2im_native.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/im2col_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
@@ -85,7 +94,6 @@ static void im2col_out_cpu_template(
   int64_t output_length = output_height * output_width;
 
   output.resize_({batch_size, n_output_plane, output_length});
-  output.zero_();
 
   AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kBFloat16, kHalf,
       input.scalar_type(), "im2col_out_cpu", [&] {
@@ -120,29 +128,6 @@ static void im2col_out_cpu_template(
       });
 }
 
-static void im2col_backward_out_cpu_template(
-    Tensor& grad_input,
-    const Tensor& grad_output,
-    IntArrayRef input_size,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride) {
-  TORCH_CHECK(
-      input_size.size() == 2,
-      "It is expected input_size equals to 2, but got size ",
-      input_size.size());
-  // col2im_out_cpu checks size of kernel_size, dilation, padding and stride
-  at::native::col2im_out_cpu(
-      grad_output,
-      input_size,
-      kernel_size,
-      dilation,
-      padding,
-      stride,
-      grad_input);
-}
-
 } // namespace
 
 Tensor& im2col_out_cpu(const Tensor& input,
@@ -169,43 +154,5 @@ Tensor im2col_cpu(
   return output;
 }
 
-Tensor& im2col_backward_out_cpu(const Tensor& grad_output,
-    IntArrayRef input_size,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride,
-    Tensor& grad_input) {
-  im2col_backward_out_cpu_template(
-      grad_input,
-      grad_output,
-      input_size,
-      kernel_size,
-      dilation,
-      padding,
-      stride);
-  return grad_input;
-}
-
-Tensor im2col_backward_cpu(
-    const Tensor& grad_output,
-    IntArrayRef input_size,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride) {
-  Tensor grad_input = at::empty_like(grad_output, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-
-  im2col_backward_out_cpu_template(
-      grad_input,
-      grad_output,
-      input_size,
-      kernel_size,
-      dilation,
-      padding,
-      stride);
-  return grad_input;
-}
-
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/IndexKernel.h b/aten/src/ATen/native/IndexKernel.h
index 41b4efc5f441..a54343d510a8 100644
--- a/aten/src/ATen/native/IndexKernel.h
+++ b/aten/src/ATen/native/IndexKernel.h
@@ -1,5 +1,6 @@
 #pragma once
 #include <ATen/native/DispatchStub.h>
+#include <c10/util/ArrayRef.h>
 
 namespace at {
 class Tensor;
diff --git a/aten/src/ATen/native/IndexingUtils.cpp b/aten/src/ATen/native/IndexingUtils.cpp
index e91eff03ab85..2dba1972ce57 100644
--- a/aten/src/ATen/native/IndexingUtils.cpp
+++ b/aten/src/ATen/native/IndexingUtils.cpp
@@ -1,9 +1,10 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/IndexingUtils.h>
 
 namespace at { namespace native {
 
 bool canUse32BitIndexMath(const TensorBase& t, int64_t max_elem) {
-  int64_t elements = t.numel();
+  auto elements = t.sym_numel();
   if (elements >= max_elem) {
     return false;
   }
@@ -11,16 +12,16 @@ bool canUse32BitIndexMath(const TensorBase& t, int64_t max_elem) {
     return max_elem > 0;
   }
 
-  int64_t offset = 0;
-  int64_t linearId = elements - 1;
+  c10::SymInt offset = 0;
+  auto linearId = elements - 1;
 
   // NOTE: Assumes all strides are positive, which is true for now
   // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
   for (int i = t.dim() - 1; i >= 0; --i) {
-    int64_t curDimIndex = linearId % t.size(i);
-    int64_t curDimOffset = curDimIndex * t.stride(i);
+    auto curDimIndex = linearId % t.sym_size(i);
+    auto curDimOffset = curDimIndex * t.sym_stride(i);
     offset += curDimOffset;
-    linearId /= t.size(i);
+    linearId /= t.sym_size(i);
   }
 
   if (offset >= max_elem) {
diff --git a/aten/src/ATen/native/IndexingUtils.h b/aten/src/ATen/native/IndexingUtils.h
index 500df7966d8e..a99b3817c275 100644
--- a/aten/src/ATen/native/IndexingUtils.h
+++ b/aten/src/ATen/native/IndexingUtils.h
@@ -48,12 +48,18 @@ static C10_UNUSED std::vector<Tensor> expandTensors(const Tensor & self, IOptTen
   return result;
 }
 
-static C10_UNUSED void checkIndexTensorTypes(IOptTensorListRef indices) {
+static C10_UNUSED void checkIndexTensorTypes(IOptTensorListRef indices, bool allow_int=false) {
   for (const auto& tensor : indices) {
     if (tensor.has_value() && tensor->defined()) {
       auto scalarType = tensor->scalar_type();
-      if (scalarType != kLong && scalarType != kByte && scalarType != kBool) {
-          TORCH_CHECK_INDEX(false, "tensors used as indices must be long, byte or bool tensors");
+      if (allow_int) {
+        if (scalarType != kLong && scalarType != kByte && scalarType != kBool && scalarType != kInt) {
+            TORCH_CHECK_INDEX(false, "tensors used as indices must be long, int, byte or bool tensors");
+        }
+      } else {
+        if (scalarType != kLong && scalarType != kByte && scalarType != kBool) {
+            TORCH_CHECK_INDEX(false, "tensors used as indices must be long, byte or bool tensors");
+        }
       }
     }
   }
diff --git a/aten/src/ATen/native/Integration.cpp b/aten/src/ATen/native/Integration.cpp
index 7ca01bae18a5..09e444476d1f 100644
--- a/aten/src/ATen/native/Integration.cpp
+++ b/aten/src/ATen/native/Integration.cpp
@@ -1,12 +1,23 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/WrapDimUtils.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/core/DimVector.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/WrapDimUtils.h>
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
 #include <c10/core/ScalarType.h>
 #include <c10/core/Scalar.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cumulative_trapezoid_native.h>
+#include <ATen/ops/trapezoid_native.h>
+#include <ATen/ops/trapz_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/Itertools.cpp b/aten/src/ATen/native/Itertools.cpp
index 265b05054b0a..8d6ff506a43f 100644
--- a/aten/src/ATen/native/Itertools.cpp
+++ b/aten/src/ATen/native/Itertools.cpp
@@ -1,5 +1,20 @@
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorOperators.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/cartesian_prod_native.h>
+#include <ATen/ops/combinations_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/full.h>
+#include <ATen/ops/meshgrid.h>
+#include <ATen/ops/stack.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
 
 #include <vector>
 
diff --git a/aten/src/ATen/native/Lerp.cpp b/aten/src/ATen/native/Lerp.cpp
index bfac91a881ae..2e67dec35033 100644
--- a/aten/src/ATen/native/Lerp.cpp
+++ b/aten/src/ATen/native/Lerp.cpp
@@ -1,5 +1,14 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/Lerp.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorMeta.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/lerp_native.h>
+#endif
 
 namespace at {
 namespace meta {
diff --git a/aten/src/ATen/native/Lerp.h b/aten/src/ATen/native/Lerp.h
index f24032f5e38d..c1784ae16f31 100644
--- a/aten/src/ATen/native/Lerp.h
+++ b/aten/src/ATen/native/Lerp.h
@@ -1,12 +1,39 @@
 #pragma once
 
 #include <ATen/native/DispatchStub.h>
+#include <ATen/OpMathType.h>
 #include <ATen/TensorIterator.h>
 #include <c10/core/Scalar.h>
 
 namespace at {
 namespace native {
 
+template <typename scalar_t>
+C10_HOST_DEVICE C10_ALWAYS_INLINE bool is_lerp_weight_small(scalar_t weight) {
+  return std::abs(weight) < scalar_t(0.5);
+}
+template <typename scalar_t>
+C10_HOST_DEVICE C10_ALWAYS_INLINE bool is_lerp_weight_small(c10::complex<scalar_t> weight) {
+  // Avoid the sqrt in abs(weight)
+  return (weight.real() * weight.real() + weight.imag() * weight.imag()) < scalar_t(0.25);
+}
+
+template <typename scalar_t, typename weight_t>
+C10_HOST_DEVICE C10_ALWAYS_INLINE scalar_t lerp(scalar_t self_, scalar_t end_, weight_t weight_) {
+  using opmath_t = at::opmath_type<scalar_t>;
+  using opmath_weight_t = at::opmath_type<weight_t>;
+
+  opmath_t self = self_;
+  opmath_t end = end_;
+  opmath_weight_t weight = weight_;
+
+  // Conditional for better numeric. This has been discussed in
+  // https://github.com/pytorch/pytorch/pull/18871
+  return is_lerp_weight_small(weight)
+      ? self + weight * (end - self)
+      : end - (end - self) * (opmath_t(1) - weight);
+}
+
 using lerp_fn_scalar = void (*)(
     at::TensorIteratorBase& iter,
     const Scalar& weight);
diff --git a/aten/src/ATen/native/Linear.cpp b/aten/src/ATen/native/Linear.cpp
index a002369fc547..591289a726ac 100644
--- a/aten/src/ATen/native/Linear.cpp
+++ b/aten/src/ATen/native/Linear.cpp
@@ -1,16 +1,36 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/Resize.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/native/xnnpack/Engine.h>
 #include <ATen/WrapDimUtilsMulti.h>
-#include <c10/macros/Macros.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/native/xnnpack/Engine.h>
 #include <c10/util/irange.h>
 #include <c10/util/MaybeOwned.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 
-#include <array>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_trilinear.h>
+#include <ATen/ops/_trilinear_native.h>
+#include <ATen/ops/add.h>
+#include <ATen/ops/addmm.h>
+#include <ATen/ops/bilinear_native.h>
+#include <ATen/ops/bmm.h>
+#include <ATen/ops/einsum_native.h>
+#include <ATen/ops/linear_native.h>
+#include <ATen/ops/matmul.h>
+#include <ATen/ops/mkldnn_linear.h>
+#include <ATen/ops/mm.h>
+#include <ATen/ops/mul.h>
+#include <ATen/ops/tensordot_native.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
+
 #include <cctype>
-#include <cstddef>
 #include <sstream>
 #include <string>
 #include <vector>
@@ -26,9 +46,6 @@ Tensor linear(const Tensor& input, const Tensor& weight, const c10::optional<Ten
   if (input.is_mkldnn()) {
     return at::mkldnn_linear(input, weight, *bias);
   }
-  if (input.is_mps()) {
-   return at::_mps_linear(input, weight, *bias);
-  }
 #if defined(C10_MOBILE)
   if (xnnpack::use_linear(input, weight, *bias)) {
     return xnnpack::linear(input, weight, *bias);
@@ -38,11 +55,13 @@ Tensor linear(const Tensor& input, const Tensor& weight, const c10::optional<Ten
     // Fused op is marginally faster.
     return at::addmm(*bias, input, weight.t());
   }
-  if (input.dim() == 3 && bias->defined() && input.is_contiguous()) {
-    // Also hit the fused path for contiguous 3D input.
-    const auto input_sizes = input.sizes();
-    const auto result = at::addmm(*bias, input.view({input_sizes[0] * input_sizes[1], input_sizes[2]}), weight.t());
-    return result.view({input_sizes[0], input_sizes[1], result.size(1)});
+  if (input.dim() == 3 && bias->defined() && input.is_contiguous() &&
+      !input.is_xla()) {
+    // Also hit the fused path for contiguous 3D input, if not using xla
+    // backend. Reshaping/flattening has some performance implications on xla.
+    const auto input_sizes = input.sym_sizes();
+    const auto result = at::addmm(*bias, input.view_symint({input_sizes[0] * input_sizes[1], input_sizes[2]}), weight.t());
+    return result.view_symint({input_sizes[0], input_sizes[1], result.sym_size(1)});
   }
   auto output = at::matmul(input, weight.t());
   if (bias->defined()) {
@@ -86,51 +105,52 @@ static Tensor sumproduct_pair(const Tensor& left_, const Tensor& right_, IntArra
     return at::mul(left_, right_);
   int64_t dim = left_.dim();
   auto sum_dims = at::dim_list_to_bitset(sum_dims_, dim);
-  // dimensions that will be part of the output (i.e. not summed over) in three vectors
-  // dims in lro appear in left, right and output, similarly lo: left and output, ro: right and output
+  // dimensions that will be part of the output (i.e. not summed over) in three vectors:
+  // dims in lro appear in left, right and output, similarly, lo: left and output, ro: right and output
   // also the sizes are kept track of for reshaping
   std::vector<int64_t> lro, lo, ro;
-  int64_t lro_size = 1, lo_size = 1, ro_size = 1, sum_size = 1;
+  SymInt lro_size = 1, lo_size = 1, ro_size = 1, sum_size = 1;
   Tensor left = left_;
   Tensor right = right_;
   for (const auto i : c10::irange(dim)) {
-    auto sl = left.size(i)>1;
-    auto sr = right.size(i)>1;
+    auto sl = left.sym_size(i)!=1;
+    auto sr = right.sym_size(i)!=1;
     if (sum_dims[i]) { // first dimensions that will be summed over after multiplication
       if (sl && sr) {  // dimensions nontrivially in both left and right must be of the same size
-        TORCH_CHECK(left.size(i)==right.size(i), "non-broadcast dimensions must match");
-        sum_size *= left.size(i);
+        TORCH_CHECK(left.sym_size(i)==right.sym_size(i), "non-broadcast dimensions must match");
+        sum_size *= left.sym_size(i);
       } else if (sl) { // if it is only in one of left and right, we can sum right away
         left = left.sum(i, true);
       } else if (sr) {
         right = right.sum(i, true);
       }
-    } else if (sl && sr) { // now deal with dimensions  dimensions that will be in the output
+    } else if (sl && sr) { // now deal with dimensions that will be in the output
       // dimensions nontrivially in both left and right must be of the same size
-      TORCH_CHECK(left.size(i)==right.size(i), "non-broadcast dimensions must match");
+      TORCH_CHECK(left.sym_size(i)==right.sym_size(i), "non-broadcast dimensions must match");
       lro.push_back(i);
-      lro_size *= left.size(i);
+      lro_size *= left.sym_size(i);
     } else if (sl) { // keep track of dimensions appearing only once
       lo.push_back(i);
-      lo_size *= left.size(i);
+      lo_size *= left.sym_size(i);
     } else {
       ro.push_back(i);
-      ro_size *= right.size(i);
+      ro_size *= right.sym_size(i);
     }
   }
   // we now work with the following permutations / shapes.
   // the pipeline is permute inputs -> reshape inputs -> batch matrix mul -> reshape(view) output -> permute output
-  // output: "lro, lo, 1-for-summed-dims, ro" with orgiginal shape dimensions
+  // output: "lro, lo, 1-for-summed-dims, ro" with original shape dimensions
   // left:   "lro, lo, summed" permuted with lpermutation and the three flattened
   // right:  "lro, summed, ro" permuted with rpermutation and the three flattened
   // then the permuted output is a view of bmm(left, right)
   // finally, opermutation reverts the permutation to the original order of dimensions
-  std::vector<int64_t> out_size;
-  // NOLINTNEXTLINE(performance-inefficient-vector-operation)
-  for (auto& d : lro) out_size.push_back(left.size(d));
-  for (auto& d : lo) out_size.push_back(left.size(d));
-  for (auto& d : sum_dims_) { out_size.push_back(1); (void)(d); }; // avoid warining about not using d
-  for (auto& d : ro) out_size.push_back(right.size(d));
+  auto out_num_dim = lro.size() + lo.size() + sum_dims_.size() + ro.size();
+  std::vector<SymInt> out_size;
+  out_size.reserve(out_num_dim);
+  for (auto& d : lro) out_size.push_back(left.sym_size(d));
+  for (auto& d : lo) out_size.push_back(left.sym_size(d));
+  for (auto& d : sum_dims_) { out_size.push_back(1); (void)(d); }; // avoid warning about not using d
+  for (auto& d : ro) out_size.push_back(right.sym_size(d));
 
   std::vector<int64_t> lpermutation(lro);
   lpermutation.insert(lpermutation.end(), lo.begin(), lo.end());
@@ -142,7 +162,7 @@ static Tensor sumproduct_pair(const Tensor& left_, const Tensor& right_, IntArra
   rpermutation.insert(rpermutation.end(), ro.begin(), ro.end());
   rpermutation.insert(rpermutation.end(), lo.begin(), lo.end());
 
-  std::vector<int64_t> opermutation(lro.size()+lo.size()+sum_dims_.size()+ro.size(), -1);
+  std::vector<int64_t> opermutation(out_num_dim, -1);
   {
     int64_t i = 0;
 
@@ -161,16 +181,15 @@ static Tensor sumproduct_pair(const Tensor& left_, const Tensor& right_, IntArra
   }
 
   // now we can execute the operations above
-  left = left.permute(lpermutation).reshape({lro_size, lo_size, sum_size});
-  right = right.permute(rpermutation).reshape({lro_size, sum_size, ro_size});
+  left = left.permute(lpermutation).reshape_symint({lro_size, lo_size, sum_size});
+  right = right.permute(rpermutation).reshape_symint({lro_size, sum_size, ro_size});
   Tensor result = at::bmm(left, right);
-  result = result.view(out_size).permute(opermutation);
+  result = result.view_symint(out_size).permute(opermutation);
 
   // finally squeeze summed dimensions if desired
   if (! keepdim) {
     auto sizes = result.sizes().vec();
-    // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
-    for (int i = dim-1; i>=0; i--) {
+    for (auto i = dim-1; i>=0; i--) {
       if (sum_dims[i]) {
         sizes.erase(sizes.begin() + i);
       }
@@ -180,47 +199,55 @@ static Tensor sumproduct_pair(const Tensor& left_, const Tensor& right_, IntArra
   return result;
 }
 
-namespace {
-
-bool einsum_check_label(unsigned char label) {
-  return std::isalpha(label);
-}
-
-uint8_t einsum_label_to_index(unsigned char label) {
-  constexpr uint8_t NUM_OF_LETTERS = 'z' - 'a' + 1;
-  return std::isupper(label) ? label - 'A' : NUM_OF_LETTERS + (label - 'a');
-}
-
-unsigned char einsum_index_to_label(uint8_t index) {
-  constexpr uint8_t NUM_OF_LETTERS = 'z' - 'a' + 1;
-  return index < NUM_OF_LETTERS ? index + 'A' : index - NUM_OF_LETTERS + 'a';
-}
-
-} // namespace
-
-// There are roughly three parts to compute einsum:
+// There are roughly three parts to computing einsum:
 // 1. Parse equation to extract the labels for each input operand and output
 // 2. Unsqueeze missing dimensions from input operands and permute to align them
 // 3. Compute result by multiplying input operands and summing contraction
-//    dimensions We do the last part by reducing to bmm.
-Tensor einsum(c10::string_view equation, TensorList operands) {
+//    dimensions. We do the last part by reducing to bmm.
+// If a path is specified, we reduce in the order specified by the path, else we
+// default to going left => right. The path is a list of indices processed the same
+// way as opt-einsum: https://optimized-einsum.readthedocs.io/en/stable/path_finding.html#format-of-the-path
+Tensor einsum(c10::string_view equation, TensorList operands, at::OptionalIntArrayRef path) {
   TORCH_CHECK(!operands.empty(), "einsum(): must provide at least one operand");
+  const auto num_ops = operands.size();
+
+  if (path.has_value()) {
+    const auto path_size = num_ops == 1 ? 1 : (num_ops - 1) * 2;
+    TORCH_CHECK(
+        path->size() == path_size,
+        "einsum(): expected contraction path given in path parameter to have size ",
+        path_size,
+        " but got ",
+        path->size());
+  }
+
+  // Labels must be in range [A-Za-z]
+  constexpr uint8_t NUM_OF_LETTERS = 'z' - 'a' + 1;
+  constexpr uint8_t TOTAL_LABELS = NUM_OF_LETTERS * 2;
 
   // Code used to identify ELLIPSIS ("...")
-  constexpr uint8_t ELLIPSIS = 52;
+  constexpr uint8_t ELLIPSIS = TOTAL_LABELS;
+
+  // Convert label in [A-Za-z] to subscript in [0, TOTAL_LABELS)
+  auto label_to_subscript = [=](unsigned char label) -> uint8_t {
+    return std::isupper(label) ? label - 'A' : label - 'a' + NUM_OF_LETTERS;
+  };
+
+  // Convert subscript in [0, TOTAL_LABELS) to label in [A-Za-z]
+  auto subscript_to_label = [=](uint8_t s) -> unsigned char {
+    return s < NUM_OF_LETTERS ? s + 'A' : s + 'a' - NUM_OF_LETTERS;
+  };
 
   // Find arrow (->) to split equation into lhs and rhs
   const auto arrow_pos = equation.find("->");
   const auto lhs = equation.substr(0, arrow_pos);
 
-  const auto num_ops = operands.size();
-
   // Convert labels for input operands into an index in [0, 52) and store
   // them in op_labels for each operand along with ELLIPSIS if present.
   std::vector<std::vector<uint8_t>> op_labels(num_ops);
-  bool found_ell = false;
+  bool ell_in_input = false;
   std::size_t curr_op = 0;
-  for (auto i = decltype(lhs.length()){0}; i < lhs.length(); ++i) {
+  for (std::size_t i = 0; i < lhs.length(); ++i) {
     const unsigned char label = lhs[i];
     switch (label) {
       case ' ':
@@ -230,7 +257,7 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
       case '.':
         TORCH_CHECK(
             // Only one ellipsis per operand can be given
-            !found_ell,
+            !ell_in_input,
             "einsum(): found \'.\' for operand ",
             curr_op,
             " for which an ellipsis was already found");
@@ -241,7 +268,7 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
             curr_op,
             " that is not part of any ellipsis");
         op_labels[curr_op].push_back(ELLIPSIS);
-        found_ell = true;
+        ell_in_input = true;
         break;
 
       case ',':
@@ -250,17 +277,17 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
         TORCH_CHECK(
             curr_op < num_ops,
             "einsum(): fewer operands were provided than specified in the equation");
-        found_ell = false;
+        ell_in_input = false;
         break;
 
       default:
         // Parse label
         TORCH_CHECK(
-            einsum_check_label(label),
+            std::isalpha(label),
             "einsum(): invalid subscript given at index ",
             i,
             " in the equation string, subscripts must be in [a-zA-Z]");
-        op_labels[curr_op].push_back(einsum_label_to_index(label));
+        op_labels[curr_op].push_back(label_to_subscript(label));
     }
   }
 
@@ -268,8 +295,6 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
       curr_op == num_ops - 1,
       "einsum(): more operands were provided than specified in the equation");
 
-  // Labels must be within [a-zA-Z].
-  constexpr uint8_t TOTAL_LABELS = 52;
   std::vector<int64_t> label_count(TOTAL_LABELS, 0);
 
   // The maximum number of dimensions covered by any ellipsis, needed when
@@ -318,12 +343,13 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
 
   // Start index of ellipsis dimensions in the permuted shape
   int64_t ell_index = 0;
-  found_ell = false;
+  bool ell_in_output = false;
 
   if (arrow_pos == std::string::npos) {
     // Implicit output is ellipsis (...) + labels seen only once
     perm_index = ell_num_dim;
-    found_ell = true;
+    // ell_in_output is used to stop us from reducing ellipses dims later
+    ell_in_output = true;
     for (const auto label : c10::irange(TOTAL_LABELS)) {
       if (label_count[label] == 1) {
         label_perm_index[label] = perm_index++;
@@ -332,7 +358,7 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
   } else {
     // Parse explicit output
     const auto rhs = equation.substr(arrow_pos + 2);
-    for (auto i = decltype(rhs.length()){0}; i < rhs.length(); ++i) {
+    for (std::size_t i = 0; i < rhs.length(); ++i) {
       const unsigned char label = rhs[i];
       switch (label) {
         case ' ':
@@ -342,7 +368,7 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
         case '.':
           TORCH_CHECK(
               // There can only be one ellipsis in the output
-              !found_ell,
+              !ell_in_output,
               "einsum(): found \'.\' for output but an ellipsis (...) was already found");
           TORCH_CHECK(
               // Ensure ellipsis is correct
@@ -350,16 +376,16 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
               "einsum(): found \'.\' for output that is not part of any ellipsis (...)");
           ell_index = perm_index;
           perm_index += ell_num_dim;
-          found_ell = true;
+          ell_in_output = true;
           break;
 
         default:
           TORCH_CHECK(
-              einsum_check_label(label),
+              std::isalpha(label),
               "einsum(): invalid subscript given at index ",
-            lhs.size() + 2 + i,
+              lhs.size() + 2 + i,
               " in the equation string, subscripts must be in [a-zA-Z]");
-          const auto index = einsum_label_to_index(label);
+          const auto index = label_to_subscript(label);
           TORCH_CHECK(
               // Ensure label appeared at least once for some input operand and at
               // most once for the output
@@ -374,11 +400,11 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
     }
   }
 
-  // Save output size before adding contraction dims (dims to sum out)
-  const int64_t out_size = perm_index;
+  // Save number of dimensions in output before adding contraction dims (dims to sum out)
+  const int64_t out_num_dim = perm_index;
 
   // If ellipsis is not part of the output, add to contraction dimensions
-  if (!found_ell) {
+  if (!ell_in_output) {
     ell_index = perm_index;
     perm_index += ell_num_dim;
   }
@@ -390,144 +416,171 @@ Tensor einsum(c10::string_view equation, TensorList operands) {
     }
   }
 
-  // Here we unsqueeze missing dimensions to make all operands have the same
-  // number of dimensions. We take diagonals for repeated labels within the
-  // same operand. Finally we permute the operands to align dimensions as
-  // per the perm_out_index we computed above.
-  std::vector<Tensor> permuted_operands;
-  for (const auto i: c10::irange(num_ops)) {
-    std::vector<int64_t> perm_shape(perm_index, -1);
-    std::vector<int64_t> label_dim(TOTAL_LABELS, -1);
-    Tensor operand = operands[i];
-    const auto labels = op_labels[i];
-    const auto original_sizes = operand.sizes();
-
-    int64_t j = 0;
-    for (const auto& label : labels) {
-      if (label == ELLIPSIS) {
-        // Add missing dimensions covered by the ellipsis
-        const auto num_missing_dim =
-            ell_num_dim - (original_sizes.size() - labels.size() + 1);
-        for (const auto k : c10::irange(num_missing_dim)) {
-          (void)k; //Suppress unused warning
-          operand = operand.unsqueeze(j);
+  // Next: we check the sizes, take diagonals for repeated labels, unsqueeze
+  // missing dimensions so all operands have the same dimensions and permute
+  // the operands to align the dimensions following the indices computed above.
+  // We also count how many operands have dimension with size != 1 for each
+  // label used to identify which dimensions can be contracted.
+  std::vector<SymInt> label_size(TOTAL_LABELS, 1);
+  std::vector<SymInt> ell_sizes(ell_num_dim, 1);
+  std::vector<uint64_t> dim_counts(perm_index, 0);
+  std::deque<Tensor> ops;
+  for (const auto i : irange(num_ops)) {
+    auto op = operands[i];
+    std::vector<int64_t> permutation(perm_index, -1);
+    std::int64_t dim = 0;
+    for (const auto s : op_labels[i]) {
+      if (s == ELLIPSIS) {
+        // Iterate over each dimension covered by ellipsis
+        const auto ndim = operands[i].ndimension() - (static_cast<int64_t>(op_labels[i].size()) - 1);
+        for (auto j = ell_num_dim - ndim; j < ell_num_dim; ++j) {
+          if (op.sym_size(dim) != 1) {
+            // Update ellipsis size
+            TORCH_CHECK(
+                ell_sizes[j] == 1 || ell_sizes[j] == op.sym_size(dim),
+                "einsum(): dimension ",
+                dim,
+                " covered by ellipsis in operand ",
+                i,
+                "has size ",
+                op.size(dim),
+                " which does not broadcast with previously seen ellipsis with size ",
+                ell_sizes[j],
+                " for the respective dimension");
+            ell_sizes[j] = op.sym_size(dim);
+            ++dim_counts[ell_index + j];
+          }
+          permutation[ell_index + j] = dim++;
         }
-        for (const auto k : c10::irange(ell_num_dim)) {
-          perm_shape[ell_index + k] = j++;
+      } else if (permutation[label_perm_index[s]] == -1) {
+        if (op.sym_size(dim) != 1) {
+          // Update subscript
+          TORCH_CHECK(
+              label_size[s] == 1 || label_size[s] == op.sym_size(dim),
+              "einsum(): subscript ",
+              subscript_to_label(s),
+              " has size ",
+              op.sym_size(dim),
+              " for operand ",
+              i,
+              " which does not broadcast with previously seen size ",
+              label_size[s]);
+          label_size[s] = op.sym_size(dim);
+          ++dim_counts[label_perm_index[s]];
         }
-      } else if (label_dim[label] != -1) {
+        permutation[label_perm_index[s]] = dim++;
+      } else {
         // Repeated label, take diagonal
-        const auto dim = label_dim[label];
+        const auto prev_dim = permutation[label_perm_index[s]];
         TORCH_CHECK(
-            operand.size(j) == operand.size(dim),
+          op.sym_size(dim) == op.sym_size(prev_dim),
             "einsum(): subscript ",
-            einsum_index_to_label(label),
+            subscript_to_label(s),
             " is repeated for operand ",
             i,
             " but the sizes don't match, ",
-            operand.size(j),
+            op.sym_size(dim),
             " != ",
-            operand.size(dim));
-        operand = operand.diagonal(0, dim, j).movedim(-1, dim);
-      } else {
-        // Lookup output index for label
-        label_dim[label] = j;
-        perm_shape[label_perm_index[label]] = j++;
+            op.sym_size(prev_dim));
+        op = op.diagonal(0, prev_dim, dim).movedim(-1, prev_dim);
       }
     }
 
     // Add dimensions for missing labels
-    for (int64_t& index : perm_shape) {
-      if (index == -1) {
-        operand = operand.unsqueeze(-1);
-        index = j++;
+    for (auto& val : permutation) {
+      if (val == -1) {
+        op = op.unsqueeze(dim);
+        val = dim++;
       }
     }
-
-    permuted_operands.push_back(operand.permute(perm_shape));
+    ops.emplace_back(op.permute(permutation));
   }
 
-  // Check if operands broadcast and keep track of last operand with
-  // dimension size != 1 for optimizing reductions
-  std::vector<std::size_t> dim_last_op(perm_index, 0);
-  bool has_zero_size_dim = false;
-  for (const auto dim : c10::irange(perm_index)) {
-    auto broadcast_size = permuted_operands[0].size(dim);
-    for (const auto i: c10::irange(1, num_ops)) {
-      const auto dim_size = permuted_operands[i].size(dim);
-      if (broadcast_size != dim_size && broadcast_size != 1 && dim_size != 1) {
-        std::ostringstream msg;
-        msg << "einsum(): operands do not broadcast with remapped shapes [original->remapped]:";
-        for (const auto j: c10::irange(num_ops)) {
-          msg << " " << operands[j].sizes() << "->"
-              << permuted_operands[j].sizes();
-        }
-        TORCH_CHECK(false, msg.str());
-      }
-      if (dim_size != 1) {
-        broadcast_size = dim_size;
-        dim_last_op[dim] = i;
-      }
-    }
-    has_zero_size_dim |= broadcast_size == 0;
-  }
+  const auto contract_path = path.value_or(std::vector<int64_t>{});
+  auto it = contract_path.begin();
 
-  // Compute result
-  Tensor result = permuted_operands[0];
-
-  // Fast path for when an operand has zero sized dim
-  if (has_zero_size_dim) {
-    std::vector<int64_t> out_shape(out_size);
-    for (const auto i : c10::irange(out_size)) {
-      out_shape[i] = permuted_operands[dim_last_op[i]].size(i);
-    }
-    return at::zeros(out_shape, result.options());
-  }
+  // Contract
+  while (ops.size() > 1) {
+    int64_t i = 0;
+    int64_t j = 1;
 
-  // Sum out or squeeze dimensions that are size 1 for all later operands
-  int64_t dim = out_size;
-  for (int64_t i = dim; i < perm_index; ++i, ++dim) {
-    if (dim_last_op[i] == 0) {
-      if (result.size(dim) == 1) {
-        result = result.squeeze(dim--);
-      } else {
-        result = result.sum(dim--);
+    if (path.has_value()) {
+      i = *it++;
+      j = *it++;
+      if (j < i) {
+        std::swap(i, j);
       }
+
+      TORCH_CHECK(
+          i != j && i >= 0 && j < static_cast<int64_t>(ops.size()),
+          "einsum(): invalid contraction (",
+          i,
+          ", ",
+          j,
+          i == j ? ") cannot contract an operand with itself"
+                 : ") operand index is out of bounds");
     }
-  }
 
-  for (const auto i: c10::irange(1, num_ops)) {
-    Tensor operand = permuted_operands[i];
-    std::vector<int64_t> sum_dims;
+    auto a = ops[i];
+    auto b = ops[j];
+    ops.erase(ops.begin() + j);
+    ops.erase(ops.begin() + i);
 
-    // Sum out or squeeze dimensions that are size 1 for all later operands
-    dim = out_size;
-    for (int64_t j = dim; j < perm_index; ++j, ++dim) {
-      if (dim_last_op[j] < i) {
-        operand = operand.squeeze(dim);
-        --dim;
-      } else if (dim_last_op[j] == i) {
-        if (result.size(dim) == 1) {
-          operand = operand.sum(dim);
-          result = result.squeeze(dim);
-          --dim;
-        } else {
+    // Collect dimensions that can be summed now
+    std::vector<int64_t> sum_dims;
+    SmallVector<int64_t, 5> a_dims_to_sum;
+    SmallVector<int64_t, 5> b_dims_to_sum;
+    for (auto dim = out_num_dim; dim < perm_index; ++dim) {
+      if (a.sym_size(dim) != 1 && b.sym_size(dim) != 1) {
+        if (--dim_counts[dim] == 1) {
           sum_dims.push_back(dim);
+          dim_counts[dim] = 0;
+        }
+      } else if (dim_counts[dim] == 1) {
+        if (a.sym_size(dim) != 1) {
+          a_dims_to_sum.push_back(dim);
+          dim_counts[dim] = 0;
+        } else if (b.sym_size(dim) != 1) {
+          b_dims_to_sum.push_back(dim);
+          dim_counts[dim] = 0;
         }
       }
     }
 
-    // Multiply tensors and sum out dimensions in sum_dims
-    if (sum_dims.empty()) {
-      result = result.mul(operand);
-    } else if (sum_dims.size() == result.sizes().size()) {
-      result = result.flatten().dot(operand.flatten());
+    // Sum multiple dims at a time to minimize the number of kernel calls to sum
+    if (!a_dims_to_sum.empty()) {
+      a = a.sum(a_dims_to_sum, true);
+    }
+    if (!b_dims_to_sum.empty()) {
+      b = b.sum(b_dims_to_sum, true);
+    }
+
+    if (path.has_value()) {
+      ops.emplace_back(sumproduct_pair(a, b, sum_dims, true));
     } else {
-      result = sumproduct_pair(result, operand, sum_dims, false);
+      ops.emplace_front(sumproduct_pair(a, b, sum_dims, true));
     }
   }
 
-  return result;
+  // Sum out contraction dims
+  if (perm_index - out_num_dim > 0) {
+    // if there were ops to contract, we would have already done so
+    // in the previous loop and all the dims to sum are now 1
+    // NB: use view instead of squeeze (or sum) for faster (mps) performance
+    if (num_ops > 1) {
+      auto sizes = ops[0].sym_sizes().vec();
+      for (auto dim = perm_index - 1; dim >= out_num_dim; --dim) {
+        sizes.erase(sizes.begin() + dim);
+      }
+      return ops[0].view_symint(sizes);
+    } else {
+      std::vector<int64_t> sum_dims(perm_index - out_num_dim);
+      std::iota(sum_dims.begin(), sum_dims.end(), out_num_dim);
+      return ops[0].sum(sum_dims);
+    }
+  }
+
+  return ops[0];
 }
 
 // _trilinear computes a trilinear einstein sum with an unrolled dimension
diff --git a/aten/src/ATen/native/LinearAlgebra.cpp b/aten/src/ATen/native/LinearAlgebra.cpp
index 529c6b5ef9ca..7e47170cd72e 100644
--- a/aten/src/ATen/native/LinearAlgebra.cpp
+++ b/aten/src/ATen/native/LinearAlgebra.cpp
@@ -1,27 +1,132 @@
-#include <ATen/ATen.h>
-#include <ATen/core/grad_mode.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/Dispatch.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/OpMathType.h>
 #include <ATen/native/mkldnn/Matmul.h>
 #include <ATen/native/CPUBlas.h>
-#include <ATen/native/IndexingUtils.h>
 #include <ATen/native/LinearAlgebra.h>
 #include <ATen/native/LinearAlgebraUtils.h>
 #include <ATen/native/ReduceOps.h>
 #include <ATen/native/ReduceOpsUtils.h>
 #include <ATen/native/Resize.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorIndexing.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/Utils.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/irange.h>
 #include <c10/util/variant.h>
 
-#include <functional>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_addmm_activation_native.h>
+#include <ATen/ops/_compute_linear_combination_native.h>
+#include <ATen/ops/_linalg_check_errors.h>
+#include <ATen/ops/_linalg_det.h>
+#include <ATen/ops/_linalg_det_native.h>
+#include <ATen/ops/_linalg_slogdet.h>
+#include <ATen/ops/_linalg_slogdet_native.h>
+#include <ATen/ops/_unsafe_view.h>
+#include <ATen/ops/addbmm_native.h>
+#include <ATen/ops/addmm_native.h>
+#include <ATen/ops/addr.h>
+#include <ATen/ops/addr_native.h>
+#include <ATen/ops/arange.h>
+#include <ATen/ops/baddbmm_native.h>
+#include <ATen/ops/bmm.h>
+#include <ATen/ops/bmm_native.h>
+#include <ATen/ops/ceil.h>
+#include <ATen/ops/chain_matmul_native.h>
+#include <ATen/ops/det_native.h>
+#include <ATen/ops/diag_embed.h>
+#include <ATen/ops/dot.h>
+#include <ATen/ops/dot_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/eye.h>
+#include <ATen/ops/frobenius_norm_native.h>
+#include <ATen/ops/from_blob.h>
+#include <ATen/ops/full.h>
+#include <ATen/ops/gelu.h>
+#include <ATen/ops/ger_native.h>
+#include <ATen/ops/index_select.h>
+#include <ATen/ops/inner_native.h>
+#include <ATen/ops/is_complex_native.h>
+#include <ATen/ops/is_floating_point_native.h>
+#include <ATen/ops/kron_native.h>
+#include <ATen/ops/linalg_cond.h>
+#include <ATen/ops/linalg_cond_native.h>
+#include <ATen/ops/linalg_det.h>
+#include <ATen/ops/linalg_det_native.h>
+#include <ATen/ops/linalg_diagonal_native.h>
+#include <ATen/ops/linalg_eigh.h>
+#include <ATen/ops/linalg_eigvalsh.h>
+#include <ATen/ops/linalg_inv.h>
+#include <ATen/ops/linalg_inv_ex.h>
+#include <ATen/ops/linalg_lu_factor_ex.h>
+#include <ATen/ops/linalg_matmul_native.h>
+#include <ATen/ops/linalg_matrix_exp.h>
+#include <ATen/ops/linalg_matrix_exp_native.h>
+#include <ATen/ops/linalg_matrix_norm.h>
+#include <ATen/ops/linalg_matrix_norm_native.h>
+#include <ATen/ops/linalg_matrix_power_native.h>
+#include <ATen/ops/linalg_matrix_rank.h>
+#include <ATen/ops/linalg_matrix_rank_native.h>
+#include <ATen/ops/linalg_multi_dot_native.h>
+#include <ATen/ops/linalg_norm.h>
+#include <ATen/ops/linalg_norm_native.h>
+#include <ATen/ops/linalg_pinv.h>
+#include <ATen/ops/linalg_pinv_native.h>
+#include <ATen/ops/linalg_slogdet.h>
+#include <ATen/ops/linalg_slogdet_native.h>
+#include <ATen/ops/linalg_solve.h>
+#include <ATen/ops/linalg_svdvals.h>
+#include <ATen/ops/linalg_tensorinv.h>
+#include <ATen/ops/linalg_tensorinv_native.h>
+#include <ATen/ops/linalg_tensorsolve.h>
+#include <ATen/ops/linalg_tensorsolve_native.h>
+#include <ATen/ops/linalg_vector_norm.h>
+#include <ATen/ops/linalg_vector_norm_native.h>
+#include <ATen/ops/log2.h>
+#include <ATen/ops/logdet_native.h>
+#include <ATen/ops/matmul.h>
+#include <ATen/ops/matmul_native.h>
+#include <ATen/ops/matrix_exp_backward_native.h>
+#include <ATen/ops/matrix_exp_native.h>
+#include <ATen/ops/matrix_power_native.h>
+#include <ATen/ops/max.h>
+#include <ATen/ops/mm.h>
+#include <ATen/ops/mm_native.h>
+#include <ATen/ops/movedim.h>
+#include <ATen/ops/mul.h>
+#include <ATen/ops/mv.h>
+#include <ATen/ops/narrow.h>
+#include <ATen/ops/norm.h>
+#include <ATen/ops/nuclear_norm_native.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/outer.h>
+#include <ATen/ops/outer_native.h>
+#include <ATen/ops/pinverse_native.h>
+#include <ATen/ops/pow.h>
+#include <ATen/ops/prod.h>
+#include <ATen/ops/real.h>
+#include <ATen/ops/relu.h>
+#include <ATen/ops/slogdet_native.h>
+#include <ATen/ops/sqrt.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/tensordot.h>
+#include <ATen/ops/vdot_native.h>
+#include <ATen/ops/where.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 #include <limits>
 #include <numeric>
 #include <string>
@@ -48,6 +153,8 @@ namespace detail {
 namespace meta {
 
 #define ADDMM_META() \
+  TORCH_CHECK(self.scalar_type() == mat2.scalar_type(), "self and mat2 must have the same dtype"); \
+  TORCH_CHECK(mat1.scalar_type() == mat2.scalar_type(), "mat1 and mat2 must have the same dtype"); \
   TORCH_CHECK(mat1.dim() == 2, "mat1 must be a matrix, got ", mat1.dim(), "-D tensor"); \
   TORCH_CHECK(mat2.dim() == 2, "mat2 must be a matrix, got ", mat2.dim(), "-D tensor"); \
   TORCH_CHECK( \
@@ -55,7 +162,7 @@ namespace meta {
       mat1.sizes()[0], "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")"); \
  \
   auto names = at::namedinference::propagate_names_for_addmm(mat1, mat2, self); \
-  set_output_raw_strided(0, {mat1.sizes()[0], mat2.sizes()[1]}, {}, self.options(), names);
+  set_output_raw_strided(0, {mat1.sizes()[0], mat2.sizes()[1]}, {}, mat1.options(), names);
 
 TORCH_META_FUNC(addmm)(const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha) {
   ADDMM_META();
@@ -711,24 +818,6 @@ Tensor linalg_matrix_rank(const Tensor& input, double tol, bool hermitian) {
   return matrix_rank_impl(input, atol_tensor, rtol_tensor, hermitian, result);
 }
 
-Tensor matrix_rank(const Tensor& self, double tol, bool symmetric) {
-  TORCH_WARN_ONCE(
-    "torch.matrix_rank is deprecated in favor of torch.linalg.matrix_rank",
-    "and will be removed in a future PyTorch release. The parameter 'symmetric' was ",
-    "renamed in torch.linalg.matrix_rank to 'hermitian'."
-  );
-  return at::linalg_matrix_rank(self, tol, symmetric);
-}
-
-Tensor matrix_rank(const Tensor& self, bool symmetric) {
-  TORCH_WARN_ONCE(
-    "torch.matrix_rank is deprecated in favor of torch.linalg.matrix_rank",
-    "and will be removed in a future PyTorch release. The parameter 'symmetric' was ",
-    "renamed in torch.linalg.matrix_rank to 'hermitian'."
-  );
-  return at::linalg_matrix_rank(self, 0.0, c10::nullopt, symmetric);
-}
-
 // multi_dot helper functions
 namespace {
 
@@ -788,7 +877,7 @@ std::vector<std::vector<int64_t>> matrix_chain_order(TensorList tensors) {
 /**
  * @brief Recursively multiplies the tensors i...j using the given order
  *
- * @param tensors matrices to multiply togther
+ * @param tensors matrices to multiply together
  * @param order optimal chain multiplication order from #matrix_chain_order
  * @param i index of first tensor to be multiplied
  * @param j index of last tensor to be multiplied
@@ -2285,8 +2374,7 @@ void compute_T18_scale_square(
   for (const auto i : c10::irange(mexp_scaled.size(0))) {
     auto s_val = s_cpu.select(0, i).template item<int64_t>();
     auto mexp = mexp_scaled.select(0, i);
-    for (const auto p : c10::irange(s_val)) {
-      (void)p; //Suppress unused variable warning
+    for (const auto p C10_UNUSED : c10::irange(s_val)) {
       mexp = at::matmul(mexp, mexp);
     }
     mexp_out.select(0, i).copy_(mexp);
@@ -2682,7 +2770,7 @@ Tensor& linalg_norm_out(const Tensor& X, c10::string_view ord, OptionalIntArrayR
 
 ////////////////////////////////////////////////////////////////////////////////
 //                              Frobenius Norm                                //
-//             Just used in linalg.norm. It should not be removed.            //
+//             Just used in torch..norm. It should not be removed.            //
 ////////////////////////////////////////////////////////////////////////////////
 
 Tensor frobenius_norm(const Tensor& self) {
@@ -2728,7 +2816,7 @@ Tensor &frobenius_norm_out(const Tensor& self,
 
 ////////////////////////////////////////////////////////////////////////////////
 //                                Nuclear Norm                                //
-//             Just used in linalg.norm. It should not be removed.            //
+//              Just used in torch.norm. It should not be removed.            //
 ////////////////////////////////////////////////////////////////////////////////
 
 Tensor nuclear_norm(const Tensor& self, bool keepdim) {
diff --git a/aten/src/ATen/native/LinearAlgebraUtils.h b/aten/src/ATen/native/LinearAlgebraUtils.h
index cbeb49fe81c6..351bc33f6590 100644
--- a/aten/src/ATen/native/LinearAlgebraUtils.h
+++ b/aten/src/ATen/native/LinearAlgebraUtils.h
@@ -241,8 +241,7 @@ void batch_iterator_with_broadcasting(const Tensor& a, const Tensor& b, const fu
     auto* b_batch_idx_ptr = data[0];
     auto* a_batch_idx_ptr = data[1];
 
-    for (const auto elem : c10::irange(nelems)) {
-      (void)elem; //Suppress unused variable warning
+    for (const auto elem C10_UNUSED : c10::irange(nelems)) {
       auto b_curr_linear_batch_idx = *reinterpret_cast<int64_t*>(b_batch_idx_ptr);
       auto a_curr_linear_batch_idx = *reinterpret_cast<int64_t*>(a_batch_idx_ptr);
 
diff --git a/aten/src/ATen/native/Loss.cpp b/aten/src/ATen/native/Loss.cpp
index b5b7acb8ede2..78b7d7023620 100644
--- a/aten/src/ATen/native/Loss.cpp
+++ b/aten/src/ATen/native/Loss.cpp
@@ -1,15 +1,62 @@
-#include <ATen/ATen.h>
-#include <ATen/CPUApplyUtils.h>
-#include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/core/Reduction.h>
+#include <ATen/Dispatch.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/BinaryOps.h>
 #include <ATen/native/PointwiseOps.h>
-#include <ATen/native/TensorIterator.h>
 #include <ATen/native/cpu/Loops.h>
 #include <c10/util/Exception.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/binary_cross_entropy_backward_native.h>
+#include <ATen/ops/binary_cross_entropy_native.h>
+#include <ATen/ops/binary_cross_entropy_with_logits_native.h>
+#include <ATen/ops/clamp_min.h>
+#include <ATen/ops/cosine_embedding_loss_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/exp.h>
+#include <ATen/ops/hinge_embedding_loss_native.h>
+#include <ATen/ops/huber_loss_backward.h>
+#include <ATen/ops/huber_loss_backward_native.h>
+#include <ATen/ops/huber_loss_native.h>
+#include <ATen/ops/kl_div_native.h>
+#include <ATen/ops/l1_loss_native.h>
+#include <ATen/ops/log.h>
+#include <ATen/ops/margin_ranking_loss_native.h>
+#include <ATen/ops/mean.h>
+#include <ATen/ops/min.h>
+#include <ATen/ops/mse_loss_backward.h>
+#include <ATen/ops/mse_loss_backward_native.h>
+#include <ATen/ops/mse_loss_meta.h>
+#include <ATen/ops/mse_loss_native.h>
+#include <ATen/ops/mul.h>
+#include <ATen/ops/neg.h>
+#include <ATen/ops/pairwise_distance.h>
+#include <ATen/ops/poisson_nll_loss_native.h>
+#include <ATen/ops/smooth_l1_loss_backward.h>
+#include <ATen/ops/smooth_l1_loss_backward_native.h>
+#include <ATen/ops/smooth_l1_loss_meta.h>
+#include <ATen/ops/smooth_l1_loss_native.h>
+#include <ATen/ops/soft_margin_loss.h>
+#include <ATen/ops/soft_margin_loss_backward.h>
+#include <ATen/ops/soft_margin_loss_backward_native.h>
+#include <ATen/ops/soft_margin_loss_native.h>
+#include <ATen/ops/squeeze.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/triplet_margin_loss_native.h>
+#include <ATen/ops/where.h>
+#include <ATen/ops/xlogy.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 constexpr float EPSILON = 1e-12;
 
 namespace {
@@ -157,15 +204,17 @@ Tensor triplet_margin_loss(const Tensor& anchor, const Tensor& positive, const T
   auto n_dim = negative.dim();
   TORCH_CHECK(
       a_dim == p_dim && p_dim == n_dim,
-      "All inputs should have same dimension but got ",
-      a_dim,
-      "D, ",
-      p_dim,
-      "D and ",
-      n_dim,
-      "D inputs.")
+      "The anchor, positive, and negative tensors are expected to have "
+      "the same number of dimensions, but got: anchor ", a_dim, "D, "
+      "positive ", p_dim, "D, and negative ", n_dim, "D inputs")
+
   auto dist_pos = at::pairwise_distance(anchor, positive, p, eps);
   auto dist_neg = at::pairwise_distance(anchor, negative, p, eps);
+  // The distance swap is described in the paper "Learning shallow
+  // convolutional feature descriptors with triplet losses" by V. Balntas, E.
+  // Riba et al.  If True, and if the positive example is closer to the
+  // negative example than the anchor is, swaps the positive example and the
+  // anchor in the loss computation.
   if (swap) {
     auto dist_swap = at::pairwise_distance(positive, negative, p, eps);
     dist_neg = at::min(dist_neg, dist_swap);
@@ -189,9 +238,22 @@ Tensor margin_ranking_loss(const Tensor& input1, const Tensor& input2, const Ten
 
 Tensor kl_div(const Tensor& input, const Tensor& target, int64_t reduction, bool log_target) {
   TORCH_CHECK(!input.is_complex() && !target.is_complex(),
-              "kl_div: Complex inputs not supported.")
-  auto output = log_target ? at::exp(target) * (target - input)
-                           : target * (at::log(target) - input);
+              "kl_div: Complex inputs not supported.");
+  TORCH_CHECK(!at::isIntegralType(input.scalar_type(), /*include_bool*/true) &&
+              !at::isIntegralType(target.scalar_type(), /*include_bool*/true),
+              "kl_div: Integral inputs not supported.");
+  Tensor output;
+  if (log_target) {
+    output = at::exp(target) * (target - input);
+  } else {
+    if (input.is_mps() || target.is_mps()) {
+      // MPS fallback, as MPS does not currently implement xlogy.
+      // MPS will give the wrong results at `target[i] = 0`
+      output = target * (at::log(target) - input);
+    } else {
+      output = at::xlogy(target, target) - target * input;
+    }
+  }
   return apply_loss_reduction(output, reduction);
 }
 
diff --git a/aten/src/ATen/native/LossCTC.cpp b/aten/src/ATen/native/LossCTC.cpp
index 344c7269b0f2..dcfad968cad7 100644
--- a/aten/src/ATen/native/LossCTC.cpp
+++ b/aten/src/ATen/native/LossCTC.cpp
@@ -5,15 +5,36 @@
 // 1. Graves et al: http://www.cs.toronto.edu/~graves/icml_2006.pdf
 // We use the equations from above link, but note that [1] has 1-based indexing and we (of course) use 0-based.
 // Graves et al call the probabilities y, we use log_probs (also calling them inputs)
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
-#include <ATen/TensorUtils.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/Fill.h>
 #include <c10/util/irange.h>
+#include <ATen/TensorSubclassLikeUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_ctc_loss.h>
+#include <ATen/ops/_ctc_loss_backward.h>
+#include <ATen/ops/_ctc_loss_backward_native.h>
+#include <ATen/ops/_ctc_loss_native.h>
+#include <ATen/ops/_cudnn_ctc_loss.h>
+#include <ATen/ops/_use_cudnn_ctc_loss.h>
+#include <ATen/ops/ctc_loss_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/full_like.h>
+#include <ATen/ops/tensor.h>
+#include <ATen/ops/where.h>
+#include <ATen/ops/zeros.h>
+#endif
 
-#include <numeric>
 #include <type_traits>
 
 namespace at {
@@ -355,6 +376,18 @@ std::tuple<Tensor, Tensor> ctc_loss_cpu(const Tensor& log_probs, const Tensor& t
   });
 }
 
+std::tuple<Tensor, Tensor> ctc_loss_tensor(const Tensor& log_probs, const Tensor& targets, const Tensor& input_lengths, const Tensor& target_lengths, int64_t BLANK, bool zero_infinity) {
+  TORCH_CHECK(isIntegralType(input_lengths.scalar_type(), /*includeBool=*/false), "input_lengths must be integral");
+  TORCH_CHECK(isIntegralType(target_lengths.scalar_type(), /*includeBool=*/false), "target_lengths must be integral");
+
+  Tensor ilc = input_lengths.to(Device(at::kCPU), at::kLong).contiguous();
+  Tensor tlc = target_lengths.to(Device(at::kCPU), at::kLong).contiguous();
+  IntArrayRef il(ilc.data_ptr<int64_t>(), ilc.numel());
+  IntArrayRef tl(tlc.data_ptr<int64_t>(), tlc.numel());
+
+  return at::_ctc_loss(log_probs, targets, il, tl, BLANK, zero_infinity);
+}
+
 Tensor ctc_loss_backward_cpu(const Tensor& grad, const Tensor& log_probs, const Tensor& targets, IntArrayRef input_lengths, IntArrayRef target_lengths,
                              const Tensor& neg_log_likelihood, const Tensor& log_alpha, int64_t BLANK, bool zero_infinity) {
   return AT_DISPATCH_FLOATING_TYPES(log_probs.scalar_type(), "ctc_loss_backward_cpu", [&] {
@@ -366,10 +399,47 @@ Tensor ctc_loss_backward_cpu(const Tensor& grad, const Tensor& log_probs, const
   });
 }
 
+Tensor ctc_loss_backward_tensor(
+    const Tensor& grad,
+    const Tensor& log_probs,
+    const Tensor& targets,
+    const Tensor& input_lengths,
+    const Tensor& target_lengths,
+    const Tensor& neg_log_likelihood,
+    const Tensor& log_alpha,
+    int64_t BLANK,
+    bool zero_infinity) {
+  TORCH_CHECK(
+      isIntegralType(input_lengths.scalar_type(), /*includeBool=*/false),
+      "input_lengths must be integral");
+  TORCH_CHECK(isIntegralType(target_lengths.scalar_type(), /*includeBool=*/false), "target_lengths must be integral");
+
+  Tensor ilc = input_lengths.to(Device(at::kCPU), at::kLong).contiguous();
+  Tensor tlc = target_lengths.to(Device(at::kCPU), at::kLong).contiguous();
+  IntArrayRef il(ilc.data_ptr<int64_t>(), ilc.numel());
+  IntArrayRef tl(tlc.data_ptr<int64_t>(), tlc.numel());
+  return at::_ctc_loss_backward(grad, log_probs, targets, il, tl, neg_log_likelihood, log_alpha, BLANK, zero_infinity);
+}
+
+namespace {
+
+Tensor get_clamped_target_length(
+    IntArrayRef target_lengths,
+    const TensorOptions& options) {
+  return at::tensor(target_lengths, options).clamp_min(1);
+}
+
+Tensor get_clamped_target_length(
+    Tensor target_lengths,
+    const TensorOptions& options) {
+  return target_lengths.clamp_min(1);
+}
+
 // this wrapper function dispatches to the native and cudnn implementations and hides the alpha/grad from the user (by just returning the loss)
 // the gradient is implemented for _cudnn_ctc_loss (just in derivatives.yaml) and _ctc_loss and this function has automatic gradients
 // it also handles the reduction if desired
-Tensor ctc_loss(const Tensor& log_probs_, const Tensor& targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t BLANK, int64_t reduction, bool zero_infinity) {
+template <typename LengthsType>
+Tensor ctc_loss_impl(const Tensor& log_probs_, const Tensor& targets, LengthsType input_lengths, LengthsType target_lengths, int64_t BLANK, int64_t reduction, bool zero_infinity) {
   auto is_batched = log_probs_.dim() == 3;
   Tensor log_probs = is_batched ? log_probs_ : log_probs_.unsqueeze(1);
   bool use_cudnn =
@@ -397,8 +467,7 @@ Tensor ctc_loss(const Tensor& log_probs_, const Tensor& targets, IntArrayRef inp
     }
   }
   if (reduction == at::Reduction::Mean) {
-    auto target_lengths_t =
-        at::tensor(target_lengths, res.options()).clamp_min(1);
+    auto target_lengths_t = get_clamped_target_length(target_lengths, res.options());
     return (res / target_lengths_t).mean();
   } else if (reduction == at::Reduction::Sum) {
     return res.sum();
@@ -406,8 +475,22 @@ Tensor ctc_loss(const Tensor& log_probs_, const Tensor& targets, IntArrayRef inp
   return is_batched ? res : res.squeeze(0);
 }
 
+} // namespace
+
+Tensor ctc_loss(const Tensor& log_probs_, const Tensor& targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t BLANK, int64_t reduction, bool zero_infinity) {
+  return ctc_loss_impl(log_probs_, targets, input_lengths, target_lengths, BLANK, reduction, zero_infinity);
+}
+
 // Convenience function accepting Tensors
 Tensor ctc_loss(const Tensor& log_probs, const Tensor& targets, const Tensor& input_lengths, const Tensor& target_lengths, int64_t BLANK, int64_t reduction, bool zero_infinity) {
+  if (at::areAnyTensorSubclassLike(
+          {log_probs, targets, input_lengths, target_lengths})) {
+    // Composite Compliant path for TensorSubclasses
+    return ctc_loss_impl(log_probs, targets, input_lengths, target_lengths, BLANK, reduction, zero_infinity);
+  }
+
+  // Fast path (which accesses data_ptr) and less operator dispatches for
+  // regular tensors
   TORCH_CHECK(isIntegralType(input_lengths.scalar_type(), /*includeBool=*/false), "input_lengths must be integral");
   TORCH_CHECK(isIntegralType(target_lengths.scalar_type(), /*includeBool=*/false), "target_lengths must be integral");
 
diff --git a/aten/src/ATen/native/LossMulti.h b/aten/src/ATen/native/LossMulti.h
index 54736bcc123b..148615e7e14f 100644
--- a/aten/src/ATen/native/LossMulti.h
+++ b/aten/src/ATen/native/LossMulti.h
@@ -1,8 +1,8 @@
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
-#include <ATen/AccumulateType.h>
-
 #pragma once
+#include <ATen/core/Tensor.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
+#include <ATen/TensorUtils.h>
 
 namespace at { namespace native {
 namespace {
diff --git a/aten/src/ATen/native/LossMultiLabelMargin.cpp b/aten/src/ATen/native/LossMultiLabelMargin.cpp
index f59de5c8817a..26d7a748df8d 100644
--- a/aten/src/ATen/native/LossMultiLabelMargin.cpp
+++ b/aten/src/ATen/native/LossMultiLabelMargin.cpp
@@ -1,10 +1,23 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/native/LossMulti.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/multilabel_margin_loss_backward_native.h>
+#include <ATen/ops/multilabel_margin_loss_forward.h>
+#include <ATen/ops/multilabel_margin_loss_forward_native.h>
+#include <ATen/ops/multilabel_margin_loss_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/LossMultiMargin.cpp b/aten/src/ATen/native/LossMultiMargin.cpp
index c7ab53f1d211..110520cf8f95 100644
--- a/aten/src/ATen/native/LossMultiMargin.cpp
+++ b/aten/src/ATen/native/LossMultiMargin.cpp
@@ -1,9 +1,19 @@
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
 #include <ATen/native/LossMulti.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/multi_margin_loss_backward_native.h>
+#include <ATen/ops/multi_margin_loss_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/LossNLL.cpp b/aten/src/ATen/native/LossNLL.cpp
index 1eb630538b80..28fc60508ab1 100644
--- a/aten/src/ATen/native/LossNLL.cpp
+++ b/aten/src/ATen/native/LossNLL.cpp
@@ -1,13 +1,32 @@
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorIndexing.h>
 #include <ATen/TensorMeta.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/native/cpu/utils.h>
 #include <ATen/native/Resize.h>
 #include <c10/util/SmallBuffer.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cross_entropy_loss_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/log_softmax.h>
+#include <ATen/ops/nll_loss.h>
+#include <ATen/ops/nll_loss2d.h>
+#include <ATen/ops/nll_loss_backward_native.h>
+#include <ATen/ops/nll_loss_forward.h>
+#include <ATen/ops/nll_loss_forward_native.h>
+#include <ATen/ops/nll_loss_native.h>
+#include <ATen/ops/nll_loss_nd.h>
+#include <ATen/ops/nll_loss_nd_native.h>
+#endif
+
 #include <c10/core/TensorOptions.h>
 #include <c10/util/irange.h>
 
@@ -530,11 +549,11 @@ Tensor cross_entropy_loss_label_smoothing(
     const Tensor& target,
     const Tensor& weight,
     int64_t reduction,
-    int64_t ignore_index,
+    c10::SymInt ignore_index,
     double label_smoothing) {
     auto class_dim = self.dim() == 1 ? 0 : 1;
     auto input = at::log_softmax(self, class_dim, self.scalar_type());
-    auto nllloss = at::nll_loss_nd(input, target, weight, reduction, ignore_index);
+    auto nllloss = at::nll_loss_nd_symint(input, target, weight, reduction, ignore_index);
 
     auto n_classes = input.size(class_dim);
 
@@ -577,15 +596,15 @@ Tensor cross_entropy_loss_label_smoothing(
     return (1 - label_smoothing) * nllloss + ret * (label_smoothing / n_classes);
 }
 
-Tensor cross_entropy_loss(
+Tensor cross_entropy_loss_symint(
     const Tensor& self,
     const Tensor& target,
     const c10::optional<Tensor>& weight,
     int64_t reduction,
-    int64_t ignore_index,
+    c10::SymInt ignore_index,
     double label_smoothing) {
   Tensor ret;
-  if (self.sizes() == target.sizes()) {
+  if (self.sym_sizes() == target.sym_sizes()) {
     // Assume soft targets when input and target shapes are the same
     TORCH_CHECK(at::isFloatingType(target.scalar_type()),
         "Expected floating point type for target with class probabilities, got ", target.scalar_type());
@@ -604,7 +623,7 @@ Tensor cross_entropy_loss(
     ret = cross_entropy_loss_label_smoothing(self, target, weight_, reduction, ignore_index, label_smoothing);
   } else {
     auto class_dim = self.dim() == 1 ? 0 : 1;
-    ret = at::nll_loss_nd(
+    ret = at::nll_loss_nd_symint(
         at::log_softmax(self, class_dim, self.scalar_type()),
         target,
         weight,
@@ -623,32 +642,41 @@ Tensor & nll_loss_out(const Tensor & self, const Tensor & target, const c10::opt
   return std::get<0>(at::nll_loss_forward_out(output, total_weight, self, target, weight, reduction, ignore_index));
 }
 
+Tensor nll_loss_symint(const Tensor & self, const Tensor & target, const c10::optional<Tensor>& weight_opt, int64_t reduction, c10::SymInt ignore_index) {
+  // See [Note: hacky wrapper removal for optional tensor]
+  c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
+  const Tensor& weight = *weight_maybe_owned;
+
+  return std::get<0>(at::nll_loss_forward_symint(self, target, weight, reduction, ignore_index));
+}
+
+// Duplicate of above code for non-symbolic ints. Kept for BC purposes and to minimize breakages.
 Tensor nll_loss(const Tensor & self, const Tensor & target, const c10::optional<Tensor>& weight_opt, int64_t reduction, int64_t ignore_index) {
   // See [Note: hacky wrapper removal for optional tensor]
   c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
   const Tensor& weight = *weight_maybe_owned;
 
-  return std::get<0>(at::nll_loss_forward(self, target, weight, reduction, ignore_index));
+  return std::get<0>(at::nll_loss_forward_symint(self, target, weight, reduction, ignore_index));
 }
 
-Tensor nll_loss_nd(
+Tensor nll_loss_nd_symint(
     const Tensor& self,
     const Tensor& target,
     const c10::optional<Tensor>& weight,
     int64_t reduction,
-    int64_t ignore_index) {
+    c10::SymInt ignore_index) {
   if (self.dim() < 1) {
     TORCH_CHECK_VALUE(
         false, "Expected 1 or more dimensions (got ", self.dim(), ")");
   }
 
-  if (self.dim() != 1 && self.sizes()[0] != target.sizes()[0]) {
+  if (self.dim() != 1 && self.sym_sizes()[0] != target.sym_sizes()[0]) {
     TORCH_CHECK_VALUE(
         false,
         "Expected input batch_size (",
-        self.sizes()[0],
+        self.sym_sizes()[0],
         ") to match target batch_size (",
-        target.sizes()[0],
+        target.sym_sizes()[0],
         ").");
   }
 
@@ -656,42 +684,42 @@ Tensor nll_loss_nd(
   Tensor input_ = self;
   Tensor target_ = target;
   if (input_.dim() == 1 || input_.dim() == 2) {
-    ret = at::nll_loss(input_, target_, weight, reduction, ignore_index);
+    ret = at::nll_loss_symint(input_, target_, weight, reduction, ignore_index);
   } else if (input_.dim() == 4) {
-    ret = at::nll_loss2d(input_, target_, weight, reduction, ignore_index);
+    ret = at::nll_loss2d_symint(input_, target_, weight, reduction, ignore_index);
   } else {
     // dim == 3 or dim > 4
-    auto n = input_.sizes()[0];
-    auto c = input_.sizes()[1];
-    auto out_size = input_.sizes().slice(2).vec();
+    auto n = input_.sym_sizes()[0];
+    auto c = input_.sym_sizes()[1];
+    auto out_size = input_.sym_sizes().slice(2).vec();
     out_size.insert(out_size.begin(), n);
-    if (target_.sizes().slice(1) != input_.sizes().slice(2)) {
+    if (target_.sym_sizes().slice(1) != input_.sym_sizes().slice(2)) {
       TORCH_CHECK(
           false,
           "Expected target size ",
-          IntArrayRef(out_size),
+          SymIntArrayRef(out_size),
           ", got ",
-          target_.sizes());
+          target_.sym_sizes());
     }
     input_ = input_.contiguous();
     target_ = target_.contiguous();
     // support empty batches, see #15870
     if (input_.numel() > 0) {
-      input_ = input_.view({n, c, 1, -1});
+      input_ = input_.view_symint({n, c, 1, -1});
     } else {
-      input_ = input_.view({n, c, 0, 0});
+      input_ = input_.view_symint({n, c, 0, 0});
     }
     if (target_.numel() > 0) {
-      target_ = target_.view({n, 1, -1});
+      target_ = target_.view_symint({n, 1, -1});
     } else {
-      target_ = target_.view({n, 0, 0});
+      target_ = target_.view_symint({n, 0, 0});
     }
     if (reduction != Reduction::None) {
-      ret = at::nll_loss2d(input_, target_, weight, reduction, ignore_index);
+      ret = at::nll_loss2d_symint(input_, target_, weight, reduction, ignore_index);
     } else {
       auto out =
-          at::nll_loss2d(input_, target_, weight, reduction, ignore_index);
-      ret = out.view(out_size);
+          at::nll_loss2d_symint(input_, target_, weight, reduction, ignore_index);
+      ret = out.view_symint(out_size);
     }
   }
   return ret;
diff --git a/aten/src/ATen/native/LossNLL2d.cpp b/aten/src/ATen/native/LossNLL2d.cpp
index d7ebf65231f1..aee22ce3edeb 100644
--- a/aten/src/ATen/native/LossNLL2d.cpp
+++ b/aten/src/ATen/native/LossNLL2d.cpp
@@ -1,12 +1,23 @@
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
-#include <ATen/TensorUtils.h>
 #include <ATen/native/cpu/utils.h>
 #include <ATen/native/Resize.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/nll_loss2d_backward_native.h>
+#include <ATen/ops/nll_loss2d_forward.h>
+#include <ATen/ops/nll_loss2d_forward_native.h>
+#include <ATen/ops/nll_loss2d_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -473,12 +484,21 @@ Tensor & nll_loss2d_out(const Tensor & self, const Tensor & target, const c10::o
   return std::get<0>(at::nll_loss2d_forward_out(output, total_weight, self, target, weight, reduction, ignore_index));
 }
 
+Tensor nll_loss2d_symint(const Tensor & self, const Tensor & target, const c10::optional<Tensor>& weight_opt, int64_t reduction, c10::SymInt ignore_index) {
+  // See [Note: hacky wrapper removal for optional tensor]
+  c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
+  const Tensor& weight = *weight_maybe_owned;
+
+  return std::get<0>(at::nll_loss2d_forward_symint(self, target, weight, reduction, ignore_index));
+}
+
+// Duplicate of above code for non-symbolic ints. Kept for BC purposes and to minimize breakages.
 Tensor nll_loss2d(const Tensor & self, const Tensor & target, const c10::optional<Tensor>& weight_opt, int64_t reduction, int64_t ignore_index) {
   // See [Note: hacky wrapper removal for optional tensor]
   c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
   const Tensor& weight = *weight_maybe_owned;
 
-  return std::get<0>(at::nll_loss2d_forward(self, target, weight, reduction, ignore_index));
+  return std::get<0>(at::nll_loss2d_forward_symint(self, target, weight, reduction, ignore_index));
 }
 
 } // namespace native
diff --git a/aten/src/ATen/native/MathBitFallThroughLists.h b/aten/src/ATen/native/MathBitFallThroughLists.h
index 025c25bcbe7b..97b0854d82d0 100644
--- a/aten/src/ATen/native/MathBitFallThroughLists.h
+++ b/aten/src/ATen/native/MathBitFallThroughLists.h
@@ -54,7 +54,6 @@ namespace at {
 #define TENSOR_UTILITIES_AND_CONSTRUCTORS(m) \
   m.impl("empty_like", torch::CppFunction::makeFallthrough()); \
   m.impl("empty.memory_format", torch::CppFunction::makeFallthrough()); \
-  m.impl("empty.SymInt", torch::CppFunction::makeFallthrough()); \
   m.impl("empty.out", torch::CppFunction::makeFallthrough()); \
   m.impl("empty_strided", torch::CppFunction::makeFallthrough()); \
   m.impl("full_like", torch::CppFunction::makeFallthrough()); \
diff --git a/aten/src/ATen/native/MathBitsFallback.h b/aten/src/ATen/native/MathBitsFallback.h
index 4e9c2d9e98b1..84e72aa724d0 100644
--- a/aten/src/ATen/native/MathBitsFallback.h
+++ b/aten/src/ATen/native/MathBitsFallback.h
@@ -1,12 +1,17 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/core/op_registration/op_registration.h>
 #include <ATen/native/UnaryOps.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/native/Resize.h>
 #include <c10/util/irange.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/clone.h>
+#endif
+
 namespace at {
 namespace native {
 // This fallback should only be used for operations that are self inverse and have a corresponding tensor
diff --git a/aten/src/ATen/native/MaxPooling.cpp b/aten/src/ATen/native/MaxPooling.cpp
index 3e615d7cf071..e809c75ba21d 100644
--- a/aten/src/ATen/native/MaxPooling.cpp
+++ b/aten/src/ATen/native/MaxPooling.cpp
@@ -1,10 +1,22 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/NamedTensorUtils.h>
+#include <ATen/TensorSubclassLikeUtils.h>
 #include <ATen/core/grad_mode.h>
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/MaxPooling.h>
 #include <ATen/native/Pool.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/max_pool1d_native.h>
+#include <ATen/ops/max_pool1d_with_indices.h>
+#include <ATen/ops/quantized_max_pool1d.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -26,19 +38,19 @@ Tensor max_pool1d_impl(
       "max_pool1d() Expected 2D or 3D input tensor, but got ", self.sizes());
   TORCH_CHECK(
       kernel_size.size() == 1,
-      "max_pool1d() kernel_size must be an int or int list of size 1 but got size ",
+      "max_pool1d() kernel_size must be an int, list of ints or tuple of ints of size 1 but got size ",
       kernel_size.size());
   TORCH_CHECK(
       stride.size() == 0 || stride.size() == 1,
-      "max_pool1d() stride must be None, an int or int list of size 1 but got size ",
+      "max_pool1d() stride must be None, an int, list of ints, or tuple of ints of size 1 but got size ",
       stride.size());
   TORCH_CHECK(
       padding.size() == 1,
-      "max_pool1d() padding must be an int or int list of size 1 but got size ",
+      "max_pool1d() padding must be an int, list of ints, or tuple of ints of size 1 but got size ",
       padding.size());
   TORCH_CHECK(
       dilation.size() == 1,
-      "max_pool1d() dilation must be an int or int list of size 1 but got size ",
+      "max_pool1d() dilation must be an int, list of ints or tuple of ints of size 1 but got size ",
       dilation.size());
 
   // If stride=None then set it to kernel_size
@@ -97,13 +109,22 @@ Tensor max_pool1d(
     IntArrayRef padding,
     IntArrayRef dilation,
     bool ceil_mode) {
+
+  auto ndim = self.ndimension();
+   TORCH_CHECK(
+       (ndim == 2 && self.size(0) != 0 && self.size(1) != 0) ||
+           (ndim == 3 && self.size(1) != 0 && self.size(2) != 0),
+       "max_pool1d: Expected 2D or 3D (batch mode) tensor with optional 0 dim batch size for input, but got:",
+       self.sizes());
+
   if (self.is_quantized()) {
     return at::quantized_max_pool1d(
         self, kernel_size, stride, padding, dilation, ceil_mode);
   }
   if ((self.requires_grad() && at::GradMode::is_enabled()) ||
       self._fw_grad(/*level */ 0).defined() ||
-      !self.device().is_cpu()) {
+      !self.device().is_cpu() ||
+      isTensorSubclassLike(self)) {
     // Needs indices for grad and with_indices defines CUDA dispatch
     return std::get<0>(at::max_pool1d_with_indices(
         self, kernel_size, stride, padding, dilation, ceil_mode));
diff --git a/aten/src/ATen/native/MaxUnpooling.cpp b/aten/src/ATen/native/MaxUnpooling.cpp
index 27d4e1a93c81..adab802d65cd 100644
--- a/aten/src/ATen/native/MaxUnpooling.cpp
+++ b/aten/src/ATen/native/MaxUnpooling.cpp
@@ -1,8 +1,17 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/cpu/MaxUnpoolKernel.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/max_unpool2d_native.h>
+#include <ATen/ops/max_unpool3d_native.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -11,6 +20,10 @@ Tensor& max_unpooling2d_forward_out_cpu(
     const Tensor& indices_,
     IntArrayRef output_size,
     Tensor& output) {
+  // See Note [Writing Nondeterministic Operations]
+  // Nondeterministic with duplicate indices
+  at::globalContext().alertNotDeterministic("max_unpooling2d_forward_out");
+
   auto oheight = output_size[0];
   auto owidth = output_size[1];
   TORCH_CHECK(
@@ -149,6 +162,10 @@ Tensor& max_unpooling3d_forward_out_cpu(const Tensor& self_,
     IntArrayRef stride,
     IntArrayRef padding,
     Tensor& output) {
+  // See Note [Writing Nondeterministic Operations]
+  // Nondeterministic with duplicate indices
+  at::globalContext().alertNotDeterministic("max_unpooling3d_forward_out");
+
   TORCH_CHECK(output.is_contiguous(), "output must be contiguous");
   int64_t oT = output_size[0];
   int64_t oH = output_size[1];
diff --git a/aten/src/ATen/native/Memory.cpp b/aten/src/ATen/native/Memory.cpp
index df6949b2d7d9..2b66f0893393 100644
--- a/aten/src/ATen/native/Memory.cpp
+++ b/aten/src/ATen/native/Memory.cpp
@@ -1,6 +1,17 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/MemoryOverlap.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_debug_has_internal_overlap_native.h>
+#include <ATen/ops/_pin_memory.h>
+#include <ATen/ops/is_pinned_native.h>
+#include <ATen/ops/pin_memory_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/MetaTensor.cpp b/aten/src/ATen/native/MetaTensor.cpp
index 0b3bb3e04c7b..5ebe52ec4a81 100644
--- a/aten/src/ATen/native/MetaTensor.cpp
+++ b/aten/src/ATen/native/MetaTensor.cpp
@@ -12,19 +12,7 @@
 namespace at {
 namespace native {
 
-Tensor empty_meta(
-  IntArrayRef size,
-  c10::optional<ScalarType> dtype_opt,
-  c10::optional<Layout> layout_opt,
-  c10::optional<Device> device_opt,
-  c10::optional<bool> pin_memory_opt,
-  c10::optional<c10::MemoryFormat> memory_format_opt
-) {
-  return at::detail::empty_meta(
-      size, dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
-}
-
-Tensor empty_symint_meta(
+Tensor empty_meta_symint(
   SymIntArrayRef size,
   c10::optional<ScalarType> dtype_opt,
   c10::optional<Layout> layout_opt,
@@ -41,6 +29,7 @@ Tensor empty_symint_meta(
       size, dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
 }
 
+// Kept only for BC with XLA
 Tensor empty_strided_meta(
   IntArrayRef size,
   IntArrayRef stride,
@@ -49,7 +38,18 @@ Tensor empty_strided_meta(
   c10::optional<Device> device_opt,
   c10::optional<bool> pin_memory_opt
 ) {
-  return at::detail::empty_strided_meta(
+  return empty_strided_meta_symint(c10::fromIntArrayRefSlow(size), c10::fromIntArrayRefSlow(stride), dtype_opt, layout_opt, device_opt, pin_memory_opt);
+}
+
+Tensor empty_strided_meta_symint(
+  SymIntArrayRef size,
+  SymIntArrayRef stride,
+  c10::optional<ScalarType> dtype_opt,
+  c10::optional<Layout> layout_opt,
+  c10::optional<Device> device_opt,
+  c10::optional<bool> pin_memory_opt
+) {
+  return at::detail::empty_strided_symint_meta(
       size, stride, dtype_opt, layout_opt, device_opt, pin_memory_opt);
 }
 
diff --git a/aten/src/ATen/native/NNPACK.cpp b/aten/src/ATen/native/NNPACK.cpp
index 3df0a0623e43..4fb40a17d026 100644
--- a/aten/src/ATen/native/NNPACK.cpp
+++ b/aten/src/ATen/native/NNPACK.cpp
@@ -1,10 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 
 #include <c10/util/CallOnce.h>
 
 #include <thread>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_nnpack_available_native.h>
+#include <ATen/ops/_nnpack_spatial_convolution_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #if !AT_NNPACK_ENABLED()
 
 namespace at {
@@ -198,8 +209,8 @@ Tensor _nnpack_spatial_convolution(
       .height = (size_t)output.size(2),
   };
   const nnp_size output_subsample = {
-      .width = stride[1],
-      .height = stride[0],
+      .width = static_cast<std::size_t>(stride[1]),
+      .height = static_cast<std::size_t>(stride[0]),
   };
 
   const auto input_ = input.contiguous();
diff --git a/aten/src/ATen/native/NaiveConvolutionTranspose2d.cpp b/aten/src/ATen/native/NaiveConvolutionTranspose2d.cpp
index ea604c426c3b..a9cf36a004f4 100644
--- a/aten/src/ATen/native/NaiveConvolutionTranspose2d.cpp
+++ b/aten/src/ATen/native/NaiveConvolutionTranspose2d.cpp
@@ -1,5 +1,5 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/Dispatch.h>
 #include <ATen/TensorMeta.h>
 #include <ATen/TensorUtils.h>
 
@@ -8,6 +8,17 @@
 #include <ATen/native/CPUBlas.h>
 #include <ATen/native/im2col.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/slow_conv_transpose2d_native.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <c10/core/TensorOptions.h>
 #include <c10/util/irange.h>
 
diff --git a/aten/src/ATen/native/NaiveConvolutionTranspose3d.cpp b/aten/src/ATen/native/NaiveConvolutionTranspose3d.cpp
index 3d34091fd036..cf60f56f9df4 100644
--- a/aten/src/ATen/native/NaiveConvolutionTranspose3d.cpp
+++ b/aten/src/ATen/native/NaiveConvolutionTranspose3d.cpp
@@ -1,11 +1,23 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/TensorUtils.h>
 
 #include <ATen/native/ConvUtils.h>
 #include <ATen/native/CPUBlas.h>
 #include <ATen/native/vol2col.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/slow_conv_transpose3d_native.h>
+#include <ATen/ops/sum.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/NaiveDilatedConvolution.cpp b/aten/src/ATen/native/NaiveDilatedConvolution.cpp
index fa7b30f5977e..827bf204b093 100644
--- a/aten/src/ATen/native/NaiveDilatedConvolution.cpp
+++ b/aten/src/ATen/native/NaiveDilatedConvolution.cpp
@@ -1,14 +1,25 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
+#include <ATen/TensorUtils.h>
 #include <ATen/native/ConvUtils.h>
 #include <ATen/native/CPUBlas.h>
 #include <ATen/native/DilatedConvolutionUtils.h>
 #include <ATen/native/im2col.h>
 #include <ATen/native/vol2col.h>
-#include <ATen/Utils.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/irange.h>
 #include <tuple>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/slow_conv_dilated2d_native.h>
+#include <ATen/ops/slow_conv_dilated3d_native.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/NamedTensor.cpp b/aten/src/ATen/native/NamedTensor.cpp
index d725c26a1463..6ee2f095b6d0 100644
--- a/aten/src/ATen/native/NamedTensor.cpp
+++ b/aten/src/ATen/native/NamedTensor.cpp
@@ -1,8 +1,30 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/NamedTensorUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/align_as_native.h>
+#include <ATen/ops/align_tensors_native.h>
+#include <ATen/ops/align_to_native.h>
+#include <ATen/ops/gather_native.h>
+#include <ATen/ops/index_add_native.h>
+#include <ATen/ops/index_copy_native.h>
+#include <ATen/ops/index_fill.h>
+#include <ATen/ops/index_fill_native.h>
+#include <ATen/ops/index_select_native.h>
+#include <ATen/ops/refine_names_native.h>
+#include <ATen/ops/rename_native.h>
+#include <ATen/ops/scatter_add_native.h>
+#include <ATen/ops/scatter_native.h>
+#include <ATen/ops/sort_native.h>
+#include <ATen/ops/squeeze.h>
+#include <ATen/ops/squeeze_native.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
+
 #include <c10/util/irange.h>
 
 #include <bitset>
diff --git a/aten/src/ATen/native/NegateFallback.cpp b/aten/src/ATen/native/NegateFallback.cpp
index a2b134a91e40..0a34b4f4331d 100644
--- a/aten/src/ATen/native/NegateFallback.cpp
+++ b/aten/src/ATen/native/NegateFallback.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/MathBitsFallback.h>
 #include <ATen/native/MathBitFallThroughLists.h>
 
diff --git a/aten/src/ATen/native/NonSymbolicBC.h b/aten/src/ATen/native/NonSymbolicBC.h
new file mode 100644
index 000000000000..0b942efb52c3
--- /dev/null
+++ b/aten/src/ATen/native/NonSymbolicBC.h
@@ -0,0 +1,27 @@
+#pragma once
+#include <ATen/core/Tensor.h>
+#include <c10/util/irange.h>
+#include <ATen/core/IListRef.h>
+
+namespace at {
+namespace native {
+// This file contains non-symbolic signatures for ops that we have sym-intified the signature of.
+// However, in certain cases (such as static runtime), we call the native versions of the ops directly.
+// In those cases, we will duplicate the signature here with non-symbolic ints, and also duplicate the C++ implementation.
+TORCH_API at::Tensor reshape(const at::Tensor& self, at::IntArrayRef proposed_shape);
+TORCH_API at::Tensor narrow(const at::Tensor& self, int64_t dim, int64_t start, int64_t length);
+TORCH_API at::Tensor _sparse_coo_tensor_unsafe(const at::Tensor & indices, const at::Tensor & values, at::IntArrayRef size, c10::optional<at::ScalarType> dtype=c10::nullopt, c10::optional<at::Layout> layout=c10::nullopt, c10::optional<at::Device> device=c10::nullopt, c10::optional<bool> pin_memory=c10::nullopt);
+TORCH_API at::Tensor nll_loss(const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor>& weight_opt, int64_t reduction, int64_t ignore_index);
+TORCH_API at::Tensor nll_loss2d(const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor>& weight_opt, int64_t reduction, int64_t ignore_index);
+// The below ops don't get a duplicated C++ implementation.
+// They are backward ops, which make them very unlikely to be called directly
+// by external code (at::native::trace_backward).
+// They get their own declaration for BC purposes however.
+TORCH_API at::Tensor _embedding_bag_backward(const at::Tensor & grad, const at::Tensor & indices, const at::Tensor & offsets, const at::Tensor & offset2bag, const at::Tensor & bag_size, const at::Tensor & maximum_indices, int64_t num_weights, bool scale_grad_by_freq, int64_t mode, bool sparse, const c10::optional<at::Tensor> & per_sample_weights, int64_t padding_idx=-1);
+TORCH_API at::Tensor _embedding_bag_sparse_backward(const at::Tensor & grad, const at::Tensor & indices, const at::Tensor & offsets, const at::Tensor & offset2bag, const at::Tensor & bag_size, int64_t num_weights, bool scale_grad_by_freq, int64_t mode, const c10::optional<at::Tensor> & per_sample_weights, int64_t padding_idx=-1);
+TORCH_API at::Tensor value_selecting_reduction_backward(const at::Tensor & grad, int64_t dim, const at::Tensor & indices, at::IntArrayRef sizes, bool keepdim);
+TORCH_API at::Tensor trace_backward(const at::Tensor & grad, at::IntArrayRef sizes);
+TORCH_API at::Tensor index_select_backward(const at::Tensor & grad, at::IntArrayRef self_sizes, int64_t dim, const at::Tensor & index);
+TORCH_API at::Tensor select(const at::Tensor& self, int64_t dim, int64_t index);
+TORCH_API std::vector<Tensor> tensor_split(const Tensor& self, IntArrayRef indices, int64_t dim);
+}}
diff --git a/aten/src/ATen/native/Normalization.cpp b/aten/src/ATen/native/Normalization.cpp
index e5373cac4ad2..ab9094d9b598 100644
--- a/aten/src/ATen/native/Normalization.cpp
+++ b/aten/src/ATen/native/Normalization.cpp
@@ -1,19 +1,55 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/CPUApplyUtils.h>
-#include <ATen/Parallel.h>
 #include <ATen/Config.h>
+#include <ATen/Dispatch.h>
+#include <ATen/Parallel.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorUtils.h>
 
 #include <ATen/detail/CUDAHooksInterface.h>
-#include <ATen/native/TensorIterator.h>
 #include <ATen/native/cpu/Loops.h>
 #include <ATen/native/batch_norm.h>
 #include <ATen/native/Normalization.h>
+#include <ATen/native/Resize.h>
 #include <ATen/native/cpu/mixed_data_type.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_batch_norm_impl_index.h>
+#include <ATen/ops/_batch_norm_impl_index_backward_native.h>
+#include <ATen/ops/_batch_norm_impl_index_native.h>
+#include <ATen/ops/alias.h>
+#include <ATen/ops/batch_norm.h>
+#include <ATen/ops/batch_norm_native.h>
+#include <ATen/ops/batch_norm_update_stats_native.h>
+#include <ATen/ops/cudnn_batch_norm.h>
+#include <ATen/ops/cudnn_batch_norm_backward.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/instance_norm_native.h>
+#include <ATen/ops/linalg_vector_norm.h>
+#include <ATen/ops/mean.h>
+#include <ATen/ops/miopen_batch_norm.h>
+#include <ATen/ops/miopen_batch_norm_backward.h>
+#include <ATen/ops/mul.h>
+#include <ATen/ops/native_batch_norm.h>
+#include <ATen/ops/native_batch_norm_backward.h>
+#include <ATen/ops/native_batch_norm_backward_native.h>
+#include <ATen/ops/native_batch_norm_native.h>
+#include <ATen/ops/renorm_native.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/sqrt.h>
+#endif
+
 #include <vector>
+#include <c10/core/SymIntArrayRef.h>
 
 static const int MIOPEN_DIM_MAX = 5;
 
@@ -41,14 +77,14 @@ DEFINE_DISPATCH(batch_norm_cpu_backward_stub);
 DEFINE_DISPATCH(renorm_scale_factor_stub);
 
 namespace {
-  void check_dims_match_num_input_features(const char* arg_name, int64_t expected, int64_t actual){
+  void check_dims_match_num_input_features(const char* arg_name, SymInt expected, SymInt actual){
     TORCH_CHECK(actual == expected,
              arg_name, " should contain ", expected, " elements not ", actual);
   }
 
-  static inline Tensor repeat_if_defined(const Tensor& t, int64_t repeat) {
+  static inline Tensor repeat_if_defined(const Tensor& t, SymInt repeat) {
     if (t.defined()) {
-      return t.repeat(repeat);
+      return t.repeat_symint(repeat);
     }
     return t;
   }
@@ -88,17 +124,17 @@ std::tuple<Tensor,Tensor,Tensor> batch_norm_cpu_transform_input_template(
     const Tensor& input, const Tensor& weight, const Tensor& bias,
     const Tensor& save_mean /* optional */, const Tensor& save_invstd /* optional */,
     const Tensor& running_mean /* optional */, const Tensor& running_var /* optional */,
-    bool train, double eps) {
+    bool train, double eps, Tensor& output) {
 
   bool all_contiguous = is_contiguous(input)
-      && (!weight.defined() || weight.is_contiguous())
-      && (!bias.defined() || bias.is_contiguous())
-      && running_mean.is_contiguous()
-      && running_var.is_contiguous();
+    && is_contiguous(output)
+    && (!weight.defined() || weight.is_contiguous())
+    && (!bias.defined() || bias.is_contiguous())
+    && running_mean.is_contiguous()
+    && running_var.is_contiguous();
 
   // inference contiguous path
   if (all_contiguous) {
-    Tensor output = at::empty_like(input, suggest_memory_format_contig(input));
     batch_norm_cpu_stub(kCPU, output, input, weight, bias,
         save_mean, save_invstd, running_mean, running_var, train, eps);
     return std::make_tuple(output, save_mean, save_invstd);
@@ -130,7 +166,6 @@ std::tuple<Tensor,Tensor,Tensor> batch_norm_cpu_transform_input_template(
   auto b = bias.defined() ? as_nd(bias) :
       at::detail::scalar_tensor_static(0, dtype, kCPU);
 
-  Tensor output = at::empty_like(input, input.suggest_memory_format());
   auto iter = TensorIteratorConfig()
     .add_output(output)
     .add_input(input)
@@ -141,8 +176,7 @@ std::tuple<Tensor,Tensor,Tensor> batch_norm_cpu_transform_input_template(
     .check_all_same_dtype(false)
     .promote_inputs_to_common_dtype(false)
     .build();
-
-  cpu_kernel(iter, [=](scalar_t input, param_t mean, param_t invstd, param_t weight, param_t bias) {
+  cpu_kernel(iter, [=](scalar_t input, param_t mean, param_t invstd, param_t weight, param_t bias) -> scalar_t {
     return ((input - mean) * invstd) * weight + bias;
   });
   return std::make_tuple(output, save_mean, save_invstd);
@@ -151,30 +185,17 @@ std::tuple<Tensor,Tensor,Tensor> batch_norm_cpu_transform_input_template(
 template<typename scalar_t, typename param_t, template<typename T> class VarTransform>
 std::tuple<Tensor,Tensor> batch_norm_cpu_update_stats_template(
     const Tensor& input, const Tensor& running_mean, const Tensor& running_var,
-    double momentum, double eps) {
+    double momentum, double eps, Tensor& save_mean, Tensor& save_var_transform) {
 
   using accscalar_t = at::acc_type<scalar_t, false>;
 
   int64_t n_input = input.size(1);
   int64_t n = input.numel() / n_input;
-  const int64_t ndim = input.dim();
-
-  // Reduce all dimensions except dim=1
-  DimVector reduce_dims(ndim - 1);
-  reduce_dims[0] = 0;
-  for (const auto i : c10::irange(2, ndim)) {
-    reduce_dims[i - 1] = i;
-  }
 
   bool all_contiguous = is_contiguous(input);
   const bool mixed_type = !std::is_same<scalar_t, param_t>::value;
   const auto dtype = mixed_type ? kFloat : input.scalar_type();
 
-  // For contiguous case, leave 'mean' computation to kernel
-  Tensor save_mean = all_contiguous
-      ? at::empty({n_input}, input.options().dtype(dtype))
-      : at::mean(input, /*dim=*/reduce_dims, /*keepdim=*/false, dtype);
-  Tensor save_var_transform = at::empty({n_input}, input.options().dtype(dtype));
   auto save_mean_a = save_mean.accessor<param_t, 1>();
   auto save_var_transform_a = save_var_transform.accessor<param_t, 1>();
 
@@ -186,6 +207,7 @@ std::tuple<Tensor,Tensor> batch_norm_cpu_update_stats_template(
     auto _var_sum = at::empty({n_input}, input.options().dtype(dtype));
     auto _mean_a = _mean.accessor<param_t, 1>();
     auto _var_sum_a = _var_sum.accessor<param_t, 1>();
+    auto momentum_ = static_cast<param_t>(momentum);
 
     batch_norm_cpu_collect_stats_stub(kCPU, _mean, _var_sum, input);
 
@@ -195,11 +217,11 @@ std::tuple<Tensor,Tensor> batch_norm_cpu_update_stats_template(
         save_var_transform_a[f] = VarTransform<accscalar_t>{}(_var_sum_a[f] / n, eps);
 
         if (running_mean.defined()) {
-          running_mean_a[f] = momentum * _mean_a[f] + (1 - momentum) * running_mean_a[f];
+          running_mean_a[f] = momentum_ * _mean_a[f] + (1 - momentum_) * running_mean_a[f];
         }
         if (running_var.defined()) {
-           accscalar_t unbiased_var = _var_sum_a[f] / (n - 1);
-           running_var_a[f] = momentum * unbiased_var + (1 - momentum) * running_var_a[f];
+          accscalar_t unbiased_var = _var_sum_a[f] / (n - 1);
+          running_var_a[f] = momentum_ * unbiased_var + (1 - momentum_) * running_var_a[f];
         }
       }
     });
@@ -243,6 +265,25 @@ std::tuple<Tensor,Tensor> batch_norm_cpu_update_stats_template(
   return std::make_tuple(save_mean, save_var_transform);
 }
 
+template<typename scalar_t, typename param_t, template<typename T> class VarTransform>
+std::tuple<Tensor,Tensor> batch_norm_cpu_update_stats_template(
+    const Tensor& input, const Tensor& running_mean, const Tensor& running_var,
+    double momentum, double eps) {
+  int64_t n_input = input.size(1);
+  const int64_t ndim = input.dim();
+  DimVector reduce_dims(ndim - 1);
+  reduce_dims[0] = 0;
+  for (const auto i : c10::irange(2, ndim)) {
+    reduce_dims[i - 1] = i;
+  }
+
+  const bool mixed_type = !std::is_same<scalar_t, param_t>::value;
+  const auto dtype = mixed_type ? kFloat : input.scalar_type();
+  Tensor save_mean = is_contiguous(input) ? at::empty({n_input}, input.options().dtype(dtype)) : at::mean(input, /*dim=*/reduce_dims, /*keepdim=*/false, dtype);
+  Tensor save_var_transform = at::empty({n_input}, input.options().dtype(dtype));
+  return batch_norm_cpu_update_stats_template<scalar_t, param_t, VarTransform>(input, running_mean, running_var, momentum, eps, save_mean, save_var_transform);
+}
+
 template<typename scalar_t, typename param_t>
 std::tuple<Tensor, Tensor, Tensor> batch_norm_backward_cpu_template(
     const Tensor& grad_out_, const Tensor& input, const Tensor& weight,
@@ -442,14 +483,14 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
   const Tensor& running_mean = c10::value_or_else(running_mean_opt, [] {return Tensor();});
   const Tensor& running_var = c10::value_or_else(running_var_opt, [] {return Tensor();});
 
-  auto num_features = input.sizes()[1];
+  auto num_features = input.sym_sizes()[1];
 
-  if (input.numel() == 0) {
+  if (input.sym_numel() == 0) {
     Tensor reserve = at::empty({0}, input.options().dtype(kByte));
     auto options = input.options().dtype(
         at::toAccumulateType(input.scalar_type(), /*is_cuda=*/input.is_cuda()));
-    auto save_mean = at::empty({num_features}, options);
-    auto save_invstd = at::empty({num_features}, options);
+    auto save_mean = at::empty_symint(c10::SymIntArrayRef({num_features}), options);
+    auto save_invstd = at::empty_symint(c10::SymIntArrayRef({num_features}), options);
 
     // don't return view of input, don't return empty tensor because it will break gradient chain
     auto out = input.clone();
@@ -460,20 +501,20 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
   }
 
   if (running_mean.defined()) {
-    check_dims_match_num_input_features("running_mean", num_features, running_mean.numel());
+    check_dims_match_num_input_features("running_mean", num_features, running_mean.sym_numel());
   } else if (!training) {
     AT_ERROR("running_mean must be defined in evaluation mode");
   }
   if (running_var.defined()) {
-    check_dims_match_num_input_features("running_var", num_features, running_var.numel());
+    check_dims_match_num_input_features("running_var", num_features, running_var.sym_numel());
   } else if (!training) {
     AT_ERROR("running_var must be defined in evaluation mode");
   }
   if (weight.defined()) {
-    check_dims_match_num_input_features("weight", num_features, weight.numel());
+    check_dims_match_num_input_features("weight", num_features, weight.sym_numel());
   }
   if (bias.defined()) {
-    check_dims_match_num_input_features("bias", num_features, bias.numel());
+    check_dims_match_num_input_features("bias", num_features, bias.sym_numel());
   }
 
   const bool use_cudnn = (
@@ -485,12 +526,12 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
       && ((running_mean.defined() && running_var.defined())
         || (!running_mean.defined() && !running_var.defined() && training))
       && (input.dim() >= 3)
-      && ((input.size(0) <= 880801 && training) // spatial, training
-          ||(input.size(0) <= 65535 && !training)) //spatial, eval
+      && ((input.sym_size(0) <= 880801 && training) // spatial, training
+          ||(input.sym_size(0) <= 65535 && !training)) //spatial, eval
       && detail::getCUDAHooks().compiledWithCuDNN()
       && eps >= detail::getCUDAHooks().batchnormMinEpsilonCuDNN()
       && cudnn_enabled && detail::getCUDAHooks().versionCuDNN() >= 5110L
-      && input.numel() < std::numeric_limits<std::int32_t>::max() // some cuDNN kernels have 32-bit indexing limitations
+      && input.sym_numel() < std::numeric_limits<std::int32_t>::max() // some cuDNN kernels have 32-bit indexing limitations
       );
 
   if (use_cudnn) {
@@ -523,7 +564,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
                && cudnn_enabled
                );
 
-  if (use_miopen) {
+  if (use_miopen && input.suggest_memory_format() != MemoryFormat::ChannelsLast && input.suggest_memory_format() != MemoryFormat::ChannelsLast3d) {
     return std::tuple_cat(
              at::miopen_batch_norm(
                input.contiguous(), weight.contiguous(), bias.contiguous(),
@@ -609,32 +650,32 @@ Tensor instance_norm(
   const Tensor& running_mean = c10::value_or_else(running_mean_opt, [] {return Tensor();});
   const Tensor& running_var = c10::value_or_else(running_var_opt, [] {return Tensor();});
 
-  TORCH_CHECK(use_input_stats || (running_mean.defined() && running_var.defined()),
+ TORCH_CHECK(use_input_stats || (running_mean.defined() && running_var.defined()),
            "Expected running_mean and running_var to be defined when use_input_stats is false");
-  std::vector<int64_t> shape = input.sizes().vec();
-  int64_t b = input.size(0);
-  int64_t c = input.size(1);
+  std::vector<SymInt> shape = input.sym_sizes().vec();
+  SymInt b = input.sym_size(0);
+  SymInt c = input.sym_size(1);
   shape[1] = b * c;
-  shape[0] = 1;
+  shape[0] = SymInt(1);
 
   Tensor weight_ = repeat_if_defined(weight, b);
   Tensor bias_ = repeat_if_defined(bias, b);
   Tensor running_mean_ = repeat_if_defined(running_mean, b);
   Tensor running_var_ = repeat_if_defined(running_var, b);
 
-  auto input_reshaped = input.contiguous().view(shape);
+  auto input_reshaped = input.contiguous().view_symint(shape);
   auto out = at::batch_norm(input_reshaped, weight_, bias_, running_mean_, running_var_,
                             use_input_stats, momentum, eps, cudnn_enabled);
 
   // we alias running_mean and running_var because they are const but we want to modify their data
   if (running_mean.defined()) {
-    at::alias(running_mean).copy_(running_mean_.view({ b, c }).mean(0, false));
+    at::alias(running_mean).copy_(running_mean_.view_symint({ b, c }).mean(0, false));
   }
   if (running_var.defined()) {
-    at::alias(running_var).copy_(running_var_.view({ b, c }).mean(0, false));
+    at::alias(running_var).copy_(running_var_.view_symint({ b, c }).mean(0, false));
   }
 
-  return out.view(input.sizes());
+  return out.view_symint(input.sym_sizes());
 }
 
 std::tuple<Tensor, Tensor> batch_norm_update_stats_cpu(
@@ -655,8 +696,8 @@ std::tuple<Tensor, Tensor> batch_norm_update_stats_cpu(
   });
 }
 
-std::tuple<Tensor, Tensor, Tensor> batch_norm_cpu(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, const c10::optional<Tensor>& running_mean_opt, const c10::optional<Tensor>& running_var_opt,
-                                                  bool train, double momentum, double eps) {
+std::tuple<Tensor&, Tensor&, Tensor&> batch_norm_cpu_out(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, const c10::optional<Tensor>& running_mean_opt, const c10::optional<Tensor>& running_var_opt,
+                                                  bool train, double momentum, double eps, Tensor& out, Tensor& save_mean, Tensor& save_var) {
   // See [Note: hacky wrapper removal for optional tensor]
   c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
   const Tensor& weight = *weight_maybe_owned;
@@ -664,33 +705,112 @@ std::tuple<Tensor, Tensor, Tensor> batch_norm_cpu(const Tensor& self, const c10:
   const Tensor& running_mean = c10::value_or_else(running_mean_opt, [] {return Tensor();});
   const Tensor& running_var = c10::value_or_else(running_var_opt, [] {return Tensor();});
 
-  checkBackend("batch_norm_cpu", {self, weight, bias, running_mean, running_var}, Backend::CPU);
+  checkBackend("batch_norm_cpu_out", {self, weight, bias, running_mean, running_var}, Backend::CPU);
+  // Resize out
+  at::native::resize_output(out, self.sizes());
 
   const bool mixed_type = is_mixed_type(self, weight, bias, running_mean, running_var);
-  return AT_DISPATCH_FLOATING_TYPES_AND(ScalarType::BFloat16, self.scalar_type(), "batch_norm", [&] {
+  AT_DISPATCH_FLOATING_TYPES_AND(ScalarType::BFloat16, self.scalar_type(), "batch_norm", [&] {
     if (mixed_type) {
       check_mixed_data_type(self, weight, bias, running_mean, running_var);
       if (!train) {
-        auto save_mean = at::empty({0}, self.options().dtype(kFloat));
-        auto save_var = at::empty({0}, self.options().dtype(kFloat));
-        return batch_norm_cpu_transform_input_template<BFloat16, float>(self, weight, bias, save_mean, save_var, running_mean, running_var, train, eps);
+        return batch_norm_cpu_transform_input_template<BFloat16, float>(self, weight, bias, save_mean, save_var, running_mean, running_var, train, eps, out);
       } else {
-        auto save_stats = batch_norm_cpu_update_stats_template<BFloat16, float, InvStd>(self, running_mean, running_var, momentum, eps);
-        return batch_norm_cpu_transform_input_template<BFloat16, float>(self, weight, bias, std::get<0>(save_stats), std::get<1>(save_stats), running_mean, running_var, train, eps);
+        // Resize save_mean and save_var
+        at::native::resize_output(save_mean, {self.size(1)});
+        at::native::resize_output(save_var, {self.size(1)});
+        auto save_stats = batch_norm_cpu_update_stats_template<BFloat16, float, InvStd>(self, running_mean, running_var, momentum, eps, save_mean, save_var);
+        return batch_norm_cpu_transform_input_template<BFloat16, float>(self, weight, bias, std::get<0>(save_stats), std::get<1>(save_stats), running_mean, running_var, train, eps, out);
       }
     } else {
       if (!train) {
-        auto save_mean = at::empty({0}, self.options());
-        auto save_var = at::empty({0}, self.options());
-        return batch_norm_cpu_transform_input_template<scalar_t, scalar_t>(self, weight, bias, save_mean, save_var, running_mean, running_var, train, eps);
+        return batch_norm_cpu_transform_input_template<scalar_t, scalar_t>(self, weight, bias, save_mean, save_var, running_mean, running_var, train, eps, out);
       } else {
-        auto save_stats = batch_norm_cpu_update_stats_template<scalar_t, scalar_t, InvStd>(self, running_mean, running_var, momentum, eps);
-        return batch_norm_cpu_transform_input_template<scalar_t, scalar_t>(self, weight, bias, std::get<0>(save_stats), std::get<1>(save_stats), running_mean, running_var, train, eps);
+        // Resize save_mean and save_var
+        at::native::resize_output(save_mean, {self.size(1)});
+        at::native::resize_output(save_var, {self.size(1)});
+        auto save_stats = batch_norm_cpu_update_stats_template<scalar_t, scalar_t, InvStd>(self, running_mean, running_var, momentum, eps, save_mean, save_var);
+        return batch_norm_cpu_transform_input_template<scalar_t, scalar_t>(self, weight, bias, std::get<0>(save_stats), std::get<1>(save_stats), running_mean, running_var, train, eps, out);
       }
     }
   });
+
+  return std::tuple<Tensor& ,Tensor&, Tensor&>(out, save_mean, save_var);
 }
 
+std::tuple<Tensor, Tensor, Tensor> batch_norm_cpu(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, const c10::optional<Tensor>& running_mean_opt, const c10::optional<Tensor>& running_var_opt,
+                                                  bool train, double momentum, double eps) {
+  // See [Note: hacky wrapper removal for optional tensor]
+  c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
+  const Tensor& weight = *weight_maybe_owned;
+  const Tensor& bias = c10::value_or_else(bias_opt, [] {return Tensor();});
+  const Tensor& running_mean = c10::value_or_else(running_mean_opt, [] {return Tensor();});
+  const Tensor& running_var = c10::value_or_else(running_var_opt, [] {return Tensor();});
+
+  checkBackend("batch_norm_cpu", {self, weight, bias, running_mean, running_var}, Backend::CPU);
+
+  // Prepare output tensor
+  const bool all_contiguous = is_contiguous(self)
+    && (!weight.defined() || weight.is_contiguous())
+    && (!bias.defined() || bias.is_contiguous())
+    && running_mean.is_contiguous()
+    && running_var.is_contiguous();
+  Tensor output = at::empty_like(self, all_contiguous ? suggest_memory_format_contig(self) : self.suggest_memory_format());
+
+  // Prepare save_mean and save_var
+  Tensor save_var;
+  Tensor save_mean;
+  const bool mixed_type = is_mixed_type(self, weight, bias, running_mean, running_var);
+  const int64_t ndim = self.dim();
+  DimVector reduce_dims(ndim - 1);
+  reduce_dims[0] = 0;
+  for (const auto i : c10::irange(2, ndim)) {
+    reduce_dims[i - 1] = i;
+  }
+  if (mixed_type) {
+    if (!train) {
+      save_mean = at::empty({0}, self.options().dtype(kFloat));
+      save_var = at::empty({0}, self.options().dtype(kFloat));
+    } else {
+      save_mean = is_contiguous(self) ? at::empty({self.size(1)}, self.options().dtype(kFloat)) : at::mean(self, /*dim=*/reduce_dims, /*keepdim=*/false, kFloat);
+      save_var = at::empty({self.size(1)}, self.options().dtype(kFloat));
+    }
+  } else {
+    if (!train) {
+      save_mean = at::empty({0}, self.options());
+      save_var = at::empty({0}, self.options());
+    } else {
+      save_mean = is_contiguous(self) ? at::empty({self.size(1)}, self.options()) : at::mean(self, /*dim=*/reduce_dims, /*keepdim=*/false);
+      save_var = at::empty({self.size(1)}, self.options());
+    }
+  }
+  return batch_norm_cpu_out(self, weight_opt, bias_opt, running_mean_opt, running_var_opt, train, momentum, eps, output, save_mean, save_var);
+}
+
+
+std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_cpu(
+    const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt,
+    Tensor& running_mean, Tensor& running_var, bool train, double momentum, double eps) {
+  return batch_norm_cpu(self, weight_opt, bias_opt, running_mean, running_var, train, momentum, eps);
+}
+
+std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_no_stats_cpu(
+    const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt,
+    bool train, double momentum, double eps) {
+  return batch_norm_cpu(self, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, eps);
+}
+
+
+std::tuple<Tensor&, Tensor&, Tensor&> _batch_norm_legit_cpu_out(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, Tensor& running_mean, Tensor& running_var, bool train, double momentum, double eps, Tensor& out, Tensor& save_mean, Tensor& save_var) {
+  return batch_norm_cpu_out(self, weight_opt, bias_opt, running_mean, running_var, train, momentum, eps, out, save_mean, save_var);
+}
+
+
+std::tuple<Tensor&, Tensor&, Tensor&> _batch_norm_legit_no_stats_cpu_out(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, bool train, double momentum, double eps, Tensor& out, Tensor& save_mean, Tensor& save_var) {
+  return batch_norm_cpu_out(self, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, eps, out, save_mean, save_var);
+}
+
+
 std::tuple<Tensor, Tensor, Tensor> batch_norm_backward_cpu(const Tensor& grad_out, const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& running_mean_opt, const c10::optional<Tensor>& running_var_opt, const c10::optional<Tensor>& save_mean_opt, const c10::optional<Tensor>& save_invstd_opt,
                                                            bool train, double eps, std::array<bool,3> grad_input_mask) {
   // See [Note: hacky wrapper removal for optional tensor]
diff --git a/aten/src/ATen/native/Onehot.cpp b/aten/src/ATen/native/Onehot.cpp
index a0c061062174..41b7a6961863 100644
--- a/aten/src/ATen/native/Onehot.cpp
+++ b/aten/src/ATen/native/Onehot.cpp
@@ -1,4 +1,14 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/one_hot_native.h>
+#include <ATen/ops/zeros.h>
+#endif
 
 namespace at { namespace native {
 
diff --git a/aten/src/ATen/native/PackedSequence.cpp b/aten/src/ATen/native/PackedSequence.cpp
index ec997d86aa1b..19b12b081960 100644
--- a/aten/src/ATen/native/PackedSequence.cpp
+++ b/aten/src/ATen/native/PackedSequence.cpp
@@ -1,5 +1,20 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_pack_padded_sequence_backward_native.h>
+#include <ATen/ops/_pack_padded_sequence_native.h>
+#include <ATen/ops/_pad_packed_sequence_native.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/full.h>
+#include <ATen/ops/pad_sequence_native.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
 
 #include <c10/util/irange.h>
 
@@ -96,18 +111,20 @@ std::tuple<Tensor, Tensor> _pack_padded_sequence(const Tensor& _input, const Ten
 // `grad` could be on arbitrary device and of arbitrary dtype, but `_batch_sizes`
 // is guaranteed to be a CPU int64 tensor.
 // See NOTE [ device and dtype of a PackedSequence ]
-Tensor _pack_padded_sequence_backward(const Tensor& grad, at::IntArrayRef input_size, const Tensor& _batch_sizes, bool batch_first) {
-  std::vector<int64_t> input_size_after_t = input_size.vec();
+Tensor _pack_padded_sequence_backward_symint(const Tensor& grad, c10::SymIntArrayRef input_size, const Tensor& _batch_sizes, bool batch_first) {
+  std::vector<c10::SymInt> input_size_after_t = input_size.vec();
   if (batch_first) {
     TORCH_CHECK(input_size.size() >= 2);
     std::swap(input_size_after_t[0], input_size_after_t[1]);
   }
-  auto grad_input = at::zeros(input_size_after_t, grad.options());
+  auto grad_input = at::zeros_symint(input_size_after_t, grad.options());
   auto batch_sizes_t = _batch_sizes.contiguous();
   checkLongTensor(batch_sizes_t);
 
   int64_t offset = 0;
-  int64_t max_seq_len = batch_sizes_t.size(0);
+  // NOTE: this op advertises as CompositeImplicitAutograd, but uses data_ptr().
+  // we should fix this.
+  auto max_seq_len = batch_sizes_t.size(0);
   int64_t * batch_sizes = batch_sizes_t.data_ptr<int64_t>();
   for (const auto i : c10::irange(max_seq_len)) {
     grad_input[i].slice(0, 0, batch_sizes[i]).copy_(grad.slice(0, offset, offset + batch_sizes[i]));
diff --git a/aten/src/ATen/native/PadNd.cpp b/aten/src/ATen/native/PadNd.cpp
index 9510b17de002..9421d537717c 100644
--- a/aten/src/ATen/native/PadNd.cpp
+++ b/aten/src/ATen/native/PadNd.cpp
@@ -1,8 +1,29 @@
-#include <ATen/ATen.h>
-#include <ATen/native/PadNd.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/PadNd.h>
+#include <ATen/core/Tensor.h>
 
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_pad_circular.h>
+#include <ATen/ops/_pad_circular_native.h>
+#include <ATen/ops/_pad_enum_native.h>
+#include <ATen/ops/constant_pad_nd.h>
+#include <ATen/ops/constant_pad_nd_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/pad_native.h>
+#include <ATen/ops/reflection_pad1d.h>
+#include <ATen/ops/reflection_pad2d.h>
+#include <ATen/ops/reflection_pad3d.h>
+#include <ATen/ops/replication_pad1d.h>
+#include <ATen/ops/replication_pad2d.h>
+#include <ATen/ops/replication_pad3d.h>
+#endif
+
 namespace at { namespace native {
 
 Tensor constant_pad_nd(const Tensor& self, IntArrayRef pad, const Scalar& value) {
@@ -85,13 +106,13 @@ Tensor constant_pad_nd(const Tensor& self, IntArrayRef pad, const Scalar& value)
     return output;
 }
 
-Tensor _pad_circular(const Tensor &self, IntArrayRef padding) {
-  const auto in_shape = self.sizes();
+Tensor _pad_circular_symint(const Tensor &self, c10::SymIntArrayRef padding) {
+  const auto in_shape = self.sym_sizes();
   const auto ndim = static_cast<int64_t>(in_shape.size()) - 2;
   TORCH_CHECK(padding.size() + 4 == in_shape.size() * 2,
               "Invalid padding size, expected ", ndim * 2, " but got ", padding.size());
 
-  DimVector out_shape(in_shape.size());
+  c10::SymDimVector out_shape(in_shape.size());
   out_shape[0] = in_shape[0];
   out_shape[1] = in_shape[1];
 
@@ -110,18 +131,18 @@ Tensor _pad_circular(const Tensor &self, IntArrayRef padding) {
         "Negative padding value is resulting in an empty dimension");
   }
 
-  auto out = self.new_empty(out_shape, self.options());
+  auto out = self.new_empty_symint(out_shape, self.options());
 
   // Put original array into the padded array
   Tensor out_slice = out;
   Tensor in_slice = self;
-  constexpr int64_t zero = 0;
+  const SymInt zero = 0;
   for (const auto i : c10::irange(ndim)) {
     const auto dim = ndim - i + 1;
     const auto pad_l = padding[2*i + 0];
     const auto pad_r = padding[2*i + 1];
-    out_slice = out_slice.slice(dim, std::max(pad_l, zero), out_shape[dim] - std::max(pad_r, zero));
-    in_slice = in_slice.slice(dim, std::max(-pad_l, zero), in_shape[dim] - std::max(-pad_r, zero));
+    out_slice = out_slice.slice_symint(dim, std::max(pad_l, zero), out_shape[dim] - std::max(pad_r, zero));
+    in_slice = in_slice.slice_symint(dim, std::max(-pad_l, zero), in_shape[dim] - std::max(-pad_r, zero));
   }
   out_slice.copy_(in_slice);
 
@@ -137,16 +158,16 @@ Tensor _pad_circular(const Tensor &self, IntArrayRef padding) {
     const auto pad_r = padding[2*i + 1];
 
     if (pad_l > 0) {
-      out_slice = out.slice(dim, 0, pad_l);
-      in_slice = out.slice(dim,
+      out_slice = out.slice_symint(dim, 0, pad_l);
+      in_slice = out.slice_symint(dim,
                            out_shape[dim] - pad_l - std::max(pad_r, zero),
                            out_shape[dim] - std::max(pad_r, zero));
       out_slice.copy_(in_slice);
     }
 
     if (pad_r > 0) {
-      out_slice = out.slice(dim, out_shape[dim] - pad_r, out_shape[dim]);
-      in_slice = out.slice(dim, std::max(pad_l, zero), std::max(pad_l, zero) + pad_r);
+      out_slice = out.slice_symint(dim, out_shape[dim] - pad_r, out_shape[dim]);
+      in_slice = out.slice_symint(dim, std::max(pad_l, zero), std::max(pad_l, zero) + pad_r);
       out_slice.copy_(in_slice);
     }
   }
@@ -154,14 +175,14 @@ Tensor _pad_circular(const Tensor &self, IntArrayRef padding) {
   return out;
 }
 
-Tensor _pad_enum(const Tensor &self, IntArrayRef pad, int64_t mode_int, c10::optional<double> value) {
+Tensor _pad_enum_symint(const Tensor &self, c10::SymIntArrayRef pad, int64_t mode_int, c10::optional<double> value) {
   const auto input_dim = self.dim();
   TORCH_CHECK(pad.size() % 2 == 0, "Padding length must be divisible by 2");
   TORCH_CHECK(static_cast<int64_t>(pad.size()) <= input_dim * 2, "Padding length too large");
   auto mode = static_cast<at::padding_mode>(mode_int);
 
   if (mode == at::padding_mode::constant) {
-    return at::constant_pad_nd(self, pad, value.value_or(0.0));
+    return at::constant_pad_nd_symint(self, pad, value.value_or(0.0));
   }
   TORCH_CHECK(!value.has_value() || *value == 0,
               "Padding mode \"", padding_mode_string(mode),
@@ -169,23 +190,23 @@ Tensor _pad_enum(const Tensor &self, IntArrayRef pad, int64_t mode_int, c10::opt
 
   if (pad.size() == 2 && (input_dim == 2 || input_dim == 3)) {
     switch (mode) {
-      case at::padding_mode::reflect: return at::reflection_pad1d(self, pad);
-      case at::padding_mode::replicate: return at::replication_pad1d(self, pad);
-      case at::padding_mode::circular: return at::_pad_circular(self, pad);
+      case at::padding_mode::reflect: return at::reflection_pad1d_symint(self, pad);
+      case at::padding_mode::replicate: return at::replication_pad1d_symint(self, pad);
+      case at::padding_mode::circular: return at::_pad_circular_symint(self, pad);
       default: {}
     }
   } else if(pad.size() == 4 && (input_dim == 3 || input_dim == 4)) {
     switch (mode) {
-      case at::padding_mode::reflect: return at::reflection_pad2d(self, pad);
-      case at::padding_mode::replicate: return at::replication_pad2d(self, pad);
-      case at::padding_mode::circular: return at::_pad_circular(self, pad);
+      case at::padding_mode::reflect: return at::reflection_pad2d_symint(self, pad);
+      case at::padding_mode::replicate: return at::replication_pad2d_symint(self, pad);
+      case at::padding_mode::circular: return at::_pad_circular_symint(self, pad);
       default: {}
     }
   } else if (pad.size() == 6 && (input_dim == 4 || input_dim == 5)) {
     switch (mode) {
-      case at::padding_mode::reflect: return at::reflection_pad3d(self, pad);
-      case at::padding_mode::replicate: return at::replication_pad3d(self, pad);
-      case at::padding_mode::circular: return at::_pad_circular(self, pad);
+      case at::padding_mode::reflect: return at::reflection_pad3d_symint(self, pad);
+      case at::padding_mode::replicate: return at::replication_pad3d_symint(self, pad);
+      case at::padding_mode::circular: return at::_pad_circular_symint(self, pad);
       default: {}
     }
   }
@@ -193,7 +214,7 @@ Tensor _pad_enum(const Tensor &self, IntArrayRef pad, int64_t mode_int, c10::opt
       "Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now");
 }
 
-Tensor pad(const Tensor &self, IntArrayRef pad, c10::string_view mode, c10::optional<double> value) {
+Tensor pad_symint(const Tensor &self, c10::SymIntArrayRef pad, c10::string_view mode, c10::optional<double> value) {
   const auto mode_enum = [&] {
     if (mode == "reflect") {
       return at::padding_mode::reflect;
@@ -207,7 +228,7 @@ Tensor pad(const Tensor &self, IntArrayRef pad, c10::string_view mode, c10::opti
     C10_THROW_ERROR(NotImplementedError,
                     c10::str("Unrecognised padding mode ", mode));
   }();
-  return at::native::_pad_enum(self, pad, static_cast<int64_t>(mode_enum), value);
+  return at::native::_pad_enum_symint(self, pad, static_cast<int64_t>(mode_enum), value);
 }
 
 }}  // namespace at::native
diff --git a/aten/src/ATen/native/PadNd.h b/aten/src/ATen/native/PadNd.h
deleted file mode 100644
index 37f59acb8a4c..000000000000
--- a/aten/src/ATen/native/PadNd.h
+++ /dev/null
@@ -1,22 +0,0 @@
-#pragma once
-
-namespace at {
-
-enum class padding_mode {
-  reflect,
-  replicate,
-  circular,
-  constant,
-};
-
-static inline c10::string_view padding_mode_string(padding_mode m) {
-  switch (m) {
-    case padding_mode::reflect: return "reflect";
-    case padding_mode::replicate: return "replicate";
-    case padding_mode::circular: return "circular";
-    case padding_mode::constant: return "constant";
-  }
-  TORCH_CHECK(false, "Invalid padding mode (", static_cast<int64_t>(m), ")");
-}
-
-}  // namespace at
diff --git a/aten/src/ATen/native/PixelShuffle.cpp b/aten/src/ATen/native/PixelShuffle.cpp
index 41547a10f5fd..e535909a7342 100644
--- a/aten/src/ATen/native/PixelShuffle.cpp
+++ b/aten/src/ATen/native/PixelShuffle.cpp
@@ -1,9 +1,21 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/TensorTransformations.h>
+#include <ATen/native/cpu/PixelShuffleKernel.h>
 
-#include <ATen/NativeFunctions.h>
 #include <c10/util/Exception.h>
 
-#include <ATen/native/cpu/PixelShuffleKernel.h>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/pixel_shuffle_native.h>
+#include <ATen/ops/pixel_unshuffle_native.h>
+#endif
+
+#include <algorithm>
+#include <numeric>
+#include <vector>
 
 namespace at {
 namespace native {
@@ -52,6 +64,11 @@ Tensor pixel_shuffle_cpu(const Tensor& self, int64_t upscale_factor) {
   auto output = at::empty({0}, self.options());
   auto memory_format = self.suggest_memory_format();
   output.resize_(output_sizes, memory_format);
+
+  if (output.numel() == 0) {
+    return output;
+  }
+
   auto input = self.contiguous(memory_format);
 
   pixel_shuffle_kernel(kCPU, output, input, upscale_factor);
@@ -61,6 +78,10 @@ Tensor pixel_shuffle_cpu(const Tensor& self, int64_t upscale_factor) {
 Tensor pixel_unshuffle_cpu(const Tensor& self, int64_t downscale_factor) {
   check_pixel_unshuffle_shapes(self, downscale_factor);
 
+  if (self.numel() == 0) {
+    return self.clone();
+  }
+
   // Format: (B1, ..., Bn), C, H, W
   std::vector<int64_t> output_sizes(self.sizes().begin(), self.sizes().end() - 3);
   output_sizes.insert(output_sizes.end(),
@@ -71,6 +92,11 @@ Tensor pixel_unshuffle_cpu(const Tensor& self, int64_t downscale_factor) {
   auto output = at::empty({0}, self.options());
   auto memory_format = self.suggest_memory_format();
   output.resize_(output_sizes, memory_format);
+
+  if (output.numel() == 0) {
+    return output;
+  }
+
   auto input = self.contiguous(memory_format);
 
   pixel_unshuffle_kernel(kCPU, output, input, downscale_factor);
@@ -114,7 +140,8 @@ Tensor math_pixel_shuffle(const Tensor& self, int64_t upscale_factor) {
   std::vector<int64_t> final_shape(self.sizes().begin(), self_sizes_batch_end);
   final_shape.insert(final_shape.end(), {oc, oh, ow});
 
-  return input_permuted.reshape(final_shape);
+  // pixel_shuffle expects to *never* return an alias of the input.
+  return input_permuted.clone(at::MemoryFormat::Contiguous).view(final_shape);
 }
 
 Tensor math_pixel_unshuffle(const Tensor& self, int64_t downscale_factor) {
@@ -154,7 +181,8 @@ Tensor math_pixel_unshuffle(const Tensor& self, int64_t downscale_factor) {
   std::vector<int64_t> final_shape(self.sizes().begin(), self_sizes_batch_end);
   final_shape.insert(final_shape.end(), {oc, oh, ow});
 
-  return input_permuted.reshape(final_shape);
+  // pixel_unshuffle expects to *never* return an alias of the input.
+  return input_permuted.clone(at::MemoryFormat::Contiguous).view(final_shape);
 }
 
 DEFINE_DISPATCH(pixel_shuffle_kernel);
diff --git a/aten/src/ATen/native/PointwiseOps.cpp b/aten/src/ATen/native/PointwiseOps.cpp
index a99bc959eb95..8259135ce14a 100644
--- a/aten/src/ATen/native/PointwiseOps.cpp
+++ b/aten/src/ATen/native/PointwiseOps.cpp
@@ -1,12 +1,17 @@
 // Ternary and higher-order pointwise operations
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/PointwiseOps.h>
 
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/MemoryOverlap.h>
-#include <ATen/native/TensorIterator.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorMeta.h>
 
-#include <ATen/NamedTensorUtils.h>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/addcdiv_native.h>
+#include <ATen/ops/addcmul_native.h>
+#endif
 
 namespace at {
 namespace meta {
diff --git a/aten/src/ATen/native/Pool.h b/aten/src/ATen/native/Pool.h
index 0f3885524a79..0ff4490086b7 100644
--- a/aten/src/ATen/native/Pool.h
+++ b/aten/src/ATen/native/Pool.h
@@ -58,21 +58,27 @@ template<typename T>
 static inline T pooling_output_shape(
       T inputSize, T kernelSize, T pad, T stride, T dilation, bool ceil_mode) {
     TORCH_CHECK(stride != 0, "stride should not be zero");
+    TORCH_CHECK(pad >= 0,
+                "pad must be non-negative, but got pad: ", pad);
+    TORCH_CHECK(pad <= kernelSize / 2,
+                "pad should be at most half of kernel size, but got pad=",
+                pad, " and kernel_size=", kernelSize)
     return pooling_output_shape_pad_lr(
         inputSize, kernelSize, pad, pad, stride, dilation, ceil_mode);
 }
 
-inline std::pair<int64_t, int64_t> pooling_same_mode_padding_lr(
-    int64_t inputSize, int64_t kernelSize, int64_t stride, int64_t dilation) {
+template <typename T>
+std::pair<T, T> _pooling_same_mode_padding_lr(
+    T inputSize, T kernelSize, int64_t stride, int64_t dilation) {
   // NOTE: with strides, the output shape is ceil(inputSize/stride)
-  auto total_padding = dilation * (kernelSize - 1);
+  auto total_padding = T(dilation) * (kernelSize - 1);
 
   // Prefer symmetric padding if possible
   if (stride > 2 && (total_padding % 2 == 1)) {
     // The floor in the output size calculation gives us a little wiggle room
     auto wiggle_room = inputSize % stride - 1;
     if (wiggle_room > 0) {
-      --total_padding;
+      total_padding = total_padding - 1;
     }
   }
 
@@ -80,6 +86,15 @@ inline std::pair<int64_t, int64_t> pooling_same_mode_padding_lr(
   return {left, total_padding - left};
 }
 
+inline std::pair<int64_t, int64_t> pooling_same_mode_padding_lr(
+    int64_t inputSize, int64_t kernelSize, int64_t stride, int64_t dilation) {
+  return _pooling_same_mode_padding_lr(inputSize, kernelSize, stride, dilation);
+}
+
+inline std::pair<c10::SymInt, c10::SymInt> pooling_same_mode_padding_lr(
+    c10::SymInt inputSize, c10::SymInt kernelSize, int64_t stride, int64_t dilation) {
+  return _pooling_same_mode_padding_lr(inputSize, kernelSize, stride, dilation);
+}
 
 // AveragePool2d/DilatedMaxPool2d (forward)
 static inline void
@@ -211,10 +226,20 @@ pool3d_shape_check(
   TORCH_CHECK(ndim == 4 || ndim == 5,
               fn_name, ": Expected 4D or 5D tensor for input, but got: ", input.sizes());
 
-  for (const auto i : c10::irange(1, ndim)) {
-    TORCH_CHECK(input.size(i) > 0,
-                fn_name, "Expected input to have non-zero size for non-batch dimensions, but got",
-                input.sizes(), " with dimension ", i, " being empty.");
+  for (const auto i : c10::irange(ndim)) {
+    if (ndim == 5 && i == 0) {
+      // size of batch-dim can be 0.
+      continue;
+    }
+    TORCH_CHECK(
+        input.size(i) > 0,
+        fn_name,
+        ": Expected input's non-batch dimensions to have positive length,"
+        " but input has a shape of ",
+        input.sizes(),
+        " and non-batch dimension ",
+        input.size(i),
+        " has length zero!")
   }
 
   if (check_input_size) { // AveragePool3d
diff --git a/aten/src/ATen/native/Pooling.cpp b/aten/src/ATen/native/Pooling.cpp
index 724c53fdd0c0..fcbe741ab0ea 100644
--- a/aten/src/ATen/native/Pooling.cpp
+++ b/aten/src/ATen/native/Pooling.cpp
@@ -1,12 +1,31 @@
-#include <ATen/ATen.h>
-
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/native/xnnpack/Engine.h>
-#include <c10/macros/Macros.h>
 #include <c10/util/Exception.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_mps_max_pool2d.h>
+#include <ATen/ops/adaptive_avg_pool1d_native.h>
+#include <ATen/ops/adaptive_avg_pool2d.h>
+#include <ATen/ops/adaptive_max_pool1d_native.h>
+#include <ATen/ops/adaptive_max_pool2d.h>
+#include <ATen/ops/avg_pool1d_native.h>
+#include <ATen/ops/avg_pool2d.h>
+#include <ATen/ops/max_pool1d_with_indices_native.h>
+#include <ATen/ops/max_pool2d_native.h>
+#include <ATen/ops/max_pool2d_with_indices.h>
+#include <ATen/ops/max_pool3d_native.h>
+#include <ATen/ops/max_pool3d_with_indices.h>
+#include <ATen/ops/mkldnn_max_pool2d.h>
+#include <ATen/ops/mkldnn_max_pool3d.h>
+#include <ATen/ops/quantized_max_pool2d.h>
+#endif
+
 #include <tuple>
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/Pow.cpp b/aten/src/ATen/native/Pow.cpp
index 4326853a8165..7050524acebf 100644
--- a/aten/src/ATen/native/Pow.cpp
+++ b/aten/src/ATen/native/Pow.cpp
@@ -1,11 +1,20 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/Pow.h>
 
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
-#include <ATen/native/TensorIterator.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/ScalarOps.h>
 #include <ATen/native/Resize.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/float_power_native.h>
+#include <ATen/ops/pow.h>
+#include <ATen/ops/pow_native.h>
+#include <ATen/ops/result_type.h>
+#endif
+
 namespace at {
 namespace meta {
 
diff --git a/aten/src/ATen/native/QuantizedLinear.cpp b/aten/src/ATen/native/QuantizedLinear.cpp
index af7643ec18b6..002bb1adc438 100644
--- a/aten/src/ATen/native/QuantizedLinear.cpp
+++ b/aten/src/ATen/native/QuantizedLinear.cpp
@@ -1,20 +1,28 @@
-#include <array>
-#include <cctype>
-#include <chrono>
-#include <cmath>
-#include <cstddef>
-#include <sstream>
-#include <string>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <vector>
 
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Parallel.h>
 #include <ATen/WrapDimUtilsMulti.h>
 #include <ATen/cpp_custom_type_hack.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/PackedParams.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like_native.h>
+#include <ATen/ops/fbgemm_linear_fp16_weight_fp32_activation_native.h>
+#include <ATen/ops/fbgemm_linear_fp16_weight_native.h>
+#include <ATen/ops/fbgemm_linear_int8_weight_fp32_activation_native.h>
+#include <ATen/ops/fbgemm_linear_int8_weight_native.h>
+#include <ATen/ops/fbgemm_linear_quantize_weight_native.h>
+#include <ATen/ops/fbgemm_pack_gemm_matrix_fp16_native.h>
+#include <ATen/ops/fbgemm_pack_quantized_matrix_native.h>
+#endif
+
 #include <c10/util/irange.h>
 
 #ifdef USE_FBGEMM
diff --git a/aten/src/ATen/native/README.md b/aten/src/ATen/native/README.md
index 043e93e332a6..651b21ae0186 100644
--- a/aten/src/ATen/native/README.md
+++ b/aten/src/ATen/native/README.md
@@ -47,10 +47,9 @@ signature.
   if one argument is a `FloatTensor`, all other arguments are checked
   to be `FloatTensor`s).
   `Tensor` or `Tensor?` must sometimes be annotated to indicate aliasing and mutability.
-  In general annotations can be defined via the following four situations:
-  - `Tensor(a)` - `a` is a set of Tensors that may alias to the same data.
+  In general annotations can be defined via the following situations:
+  - `Tensor(a)` - `a` is a set of Tensors that may alias to the same data. The set could have a size of one.
   - `Tensor(a!)` - members of `a` may be written to thus mutating the underlying data.
-  - `Tensor!` - shorthand for Tensor(fresh\_identifier!)
   - `Tensor(a! -> a|b)` - Tensor is in set `a`, written to, and after the write is in set `a` AND `b`.
   For more details on when and why this needs to happen, please see the section on annotations.
 - `Tensor[]`.  A `Tensor[]` argument translates into a C++ argument of type `ArrayRef<Tensor>`
@@ -445,7 +444,7 @@ By default, ATen code generation will generate device check,
 which will ensure all the tensor parameters passed to kernel are
 on the same device.
 
-However, in some cases, checking the device is unncessary, because,
+However, in some cases, checking the device is unnecessary, because,
 e.g., you call a function allows to work on multiple devices.
 In that case, code generation of the device check can be disabled by adding
 `device_check: NoCheck` to your function definition.
@@ -476,6 +475,28 @@ as `Tensor &`, which 1) allowed changing which `TensorImpl` the `Tensor` itself
 was not necessary to allow the underlying data to change. (This was like using `T * const` when we
 wanted `const T*`.)
 
+### `autogen`
+
+```
+- func: my_op_(Tensor(a!) self) -> Tensor(a!)
+...
+  autogen: my_op, my_op.out
+```
+
+`autogen` keyword is being used to specify which native function the codegen system should generate
+implementations for.
+* For an in-place variant of a native function (op name ends with an `_`), we will generate a functional
+variant and an out= variant.
+* If a functional variant is given, we generate an out= variant.
+* We don't support `autogen` for view ops, ops that bypass the dispatcher as well as composite ops.
+
+We also generate kernels for generated ops, which merely copy and return the result from the base ops.
+These generated kernels can be found in `<gen-out>/aten/src/ATen/CompositeViewCopyKernels.cpp`.
+
+Also notice that for new operators being added to `native_functions.yaml`, if they satisfy the requirements
+mentioned above, they should include `autogen` keyword, since functionalization depends on it. We will
+enforce this in codegen.
+
 
 ## Writing an implementation in C++
 
@@ -534,7 +555,7 @@ Here're steps to follow to decide the right dispatch keyword:
       Note: to support training, you're required to write a formula in
       derivatives.yaml since your backend implementations don't support autograd.
 
-    - Yes: you're likely calling other `at::` ops in the implemetation. Go to step 2.
+    - Yes: you're likely calling other `at::` ops in the implementation. Go to step 2.
 
 2. Think about training: does your kernel support autograd? [check autograd support](#will-your-function-be-automatically-differentiable)
     - Yes: in other words, you're providing a `CompositeImplicitAutograd` kernel which supports both inference and autograd.
@@ -588,7 +609,7 @@ It shows for a certain operator, what the computed dispatch table looks like aft
 4. TODO: AutogradCPUOrCUDA
 
 Note that in native_functions.yaml you can mix using backend keywords and alias keywords above for one op:
-  - direct registration to backend always has higher precendence than alias
+  - direct registration to backend always has higher precedence than alias
   - DO NOT provide multiple alias keywords to the same op: alias keywords have precedence `CompositeExplicitAutograd > CompositeImplicitAutograd`,
     e.g. adding both `CompositeImplicitAutograd` and `CompositeExplicitAutograd` kernels for one op will completely ignore `CompositeImplicitAutograd` kernel for
     both inference and training. Thus this will trigger an error when native_functions.yaml is parsed.
@@ -606,7 +627,8 @@ the torch._C._nn (marked with `python_module: nn`),
 torch._C._fft (marked with `python_module: fft`),
 torch._C._linalg (marked with `python_module: linalg`) objects,
 torch._C._sparse (marked with `python_module: sparse`) objects,
-or torch._C._special (marked with `python_module: special`) objects.
+torch._C._special (marked with `python_module: special`) objects,
+or torch._C._nested (marked with `python_module: nested`) objects.
 
 ### Undefined tensor conventions
 
diff --git a/aten/src/ATen/native/RNN.cpp b/aten/src/ATen/native/RNN.cpp
index e40caef80e3c..52efc6929f54 100644
--- a/aten/src/ATen/native/RNN.cpp
+++ b/aten/src/ATen/native/RNN.cpp
@@ -1,8 +1,10 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/RNN.h>
 
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/core/op_registration/op_registration.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
+#include <ATen/Context.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/quantized/PackedParams.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
@@ -10,6 +12,46 @@
 #include <torch/custom_class.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_lstm_mps.h>
+#include <ATen/ops/_thnn_differentiable_gru_cell_backward_native.h>
+#include <ATen/ops/_thnn_differentiable_lstm_cell_backward_native.h>
+#include <ATen/ops/_thnn_fused_gru_cell.h>
+#include <ATen/ops/_thnn_fused_lstm_cell.h>
+#include <ATen/ops/_thnn_fused_lstm_cell_backward_impl.h>
+#include <ATen/ops/_use_cudnn_rnn_flatten_weight_native.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/cudnn_is_acceptable.h>
+#include <ATen/ops/dropout.h>
+#include <ATen/ops/fbgemm_linear_int8_weight_fp32_activation.h>
+#include <ATen/ops/fbgemm_linear_quantize_weight_native.h>
+#include <ATen/ops/fbgemm_pack_quantized_matrix_native.h>
+#include <ATen/ops/gru_cell_native.h>
+#include <ATen/ops/gru_native.h>
+#include <ATen/ops/linear.h>
+#include <ATen/ops/lstm_cell_native.h>
+#include <ATen/ops/lstm_native.h>
+#include <ATen/ops/matmul.h>
+#include <ATen/ops/quantized_gru_cell_native.h>
+#include <ATen/ops/quantized_lstm_cell_native.h>
+#include <ATen/ops/quantized_rnn_relu_cell_native.h>
+#include <ATen/ops/quantized_rnn_tanh_cell_native.h>
+#include <ATen/ops/relu.h>
+#include <ATen/ops/rnn_relu_cell_native.h>
+#include <ATen/ops/rnn_relu_native.h>
+#include <ATen/ops/rnn_tanh_cell_native.h>
+#include <ATen/ops/rnn_tanh_native.h>
+#include <ATen/ops/sigmoid_backward.h>
+#include <ATen/ops/stack.h>
+#include <ATen/ops/tanh.h>
+#include <ATen/ops/tanh_backward.h>
+#include <ATen/ops/zeros_like.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
+
 int register_linear_params();
 
 namespace at { namespace native {
@@ -624,20 +666,20 @@ tpair_of<Tensor> hidden_slice(const tpair_of<Tensor>& t, int64_t start, int64_t
 // It's a struct only because functional programming in C++ is a pain, and it's easier
 // to pass around "vtable pointers" than actual function pointers.
 
-void check_rnn_cell_forward_input(const Tensor& input, int64_t input_size) {
+void check_rnn_cell_forward_input(const Tensor& input, c10::SymInt input_size) {
   TORCH_CHECK(
-    input.size(1) == input_size,
-    "input has inconsistent input_size: got ", input.size(1), " expected ", input_size);
+    input.sym_size(1) == input_size,
+    "input has inconsistent input_size: got ", input.sym_size(1), " expected ", input_size);
 }
 
-void check_rnn_cell_forward_hidden(const Tensor& input, const Tensor& hx, int64_t hidden_size, int64_t hidden_label) {
+void check_rnn_cell_forward_hidden(const Tensor& input, const Tensor& hx, c10::SymInt hidden_size, c10::SymInt hidden_label) {
   TORCH_CHECK(
-    input.size(0) == hx.size(0),
-    "Input batch size ", input.size(0), " doesn't match hidden", hidden_label, " batch size ", hx.size(0));
+    input.sym_size(0) == hx.sym_size(0),
+    "Input batch size ", input.sym_size(0), " doesn't match hidden", hidden_label, " batch size ", hx.sym_size(0));
 
   TORCH_CHECK(
-    hx.size(1) == hidden_size,
-    "hidden", hidden_label, " has inconsistent hidden_size: got ", hx.size(1), ", expected ", hidden_size);
+    hx.sym_size(1) == hidden_size,
+    "hidden", hidden_label, " has inconsistent hidden_size: got ", hx.sym_size(1), ", expected ", hidden_size);
 }
 
 template<typename hidden_type_tmpl, typename cell_params_tmpl>
@@ -717,7 +759,7 @@ struct GRUCell : Cell<Tensor, cell_params> {
       const hidden_type& hidden,
       const cell_params& params,
       bool pre_compute_input = false) const override {
-    if (input.is_cuda()) {
+    if (input.is_cuda() || input.is_xpu()) {
       TORCH_CHECK(!pre_compute_input);
       auto igates = params.matmul_ih(input);
       auto hgates = params.matmul_hh(hidden);
@@ -1465,8 +1507,8 @@ std::tuple<Tensor, Tensor> lstm_cell(
   const Tensor& b_hh = c10::value_or_else(b_hh_opt, [] {return Tensor();});
 
   TORCH_CHECK(hx.size() == 2, "lstm_cell expects two hidden states");
-  check_rnn_cell_forward_input(input, w_ih.size(1));
-  auto hidden_size = w_hh.size(1);
+  check_rnn_cell_forward_input(input, w_ih.sym_size(1));
+  auto hidden_size = w_hh.sym_size(1);
   check_rnn_cell_forward_hidden(input, hx[0], hidden_size, 0);
   check_rnn_cell_forward_hidden(input, hx[1], hidden_size, 0);
   static at::Tensor undefined;
diff --git a/aten/src/ATen/native/RNN.h b/aten/src/ATen/native/RNN.h
index 2bdb9becf4fa..50aaa0a29c2b 100644
--- a/aten/src/ATen/native/RNN.h
+++ b/aten/src/ATen/native/RNN.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/DispatchStub.h>
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/RangeFactories.cpp b/aten/src/ATen/native/RangeFactories.cpp
index 038da93456ed..408bf0a27e6f 100644
--- a/aten/src/ATen/native/RangeFactories.cpp
+++ b/aten/src/ATen/native/RangeFactories.cpp
@@ -1,13 +1,23 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/RangeFactories.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/Parallel.h>
 #include <ATen/Dispatch.h>
-#include <ATen/native/TensorIterator.h>
+#include <ATen/Parallel.h>
+#include <ATen/TensorIterator.h>
 #include <c10/util/irange.h>
 #include <cmath>
 #include <limits>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange_native.h>
+#include <ATen/ops/linspace_native.h>
+#include <ATen/ops/logspace_native.h>
+#include <ATen/ops/range_native.h>
+#endif
+
 namespace at { namespace native {
 
 
diff --git a/aten/src/ATen/native/ReduceAllOps.cpp b/aten/src/ATen/native/ReduceAllOps.cpp
index 31764734b67a..e1d51a1666af 100644
--- a/aten/src/ATen/native/ReduceAllOps.cpp
+++ b/aten/src/ATen/native/ReduceAllOps.cpp
@@ -1,8 +1,21 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/ReduceAllOps.h>
 #include <ATen/native/Resize.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_aminmax_native.h>
+#include <ATen/ops/aminmax.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/max.h>
+#include <ATen/ops/max_native.h>
+#include <ATen/ops/min.h>
+#include <ATen/ops/min_native.h>
+#endif
 
 namespace at {
 namespace native {
@@ -34,9 +47,16 @@ Tensor max(const Tensor &self) {
 }
 
 Tensor& max_unary_out(const Tensor &self, Tensor& out) {
-  Tensor tmp_output = at::max(self);
-  at::native::resize_output(out, tmp_output.sizes());
-  out.copy_(tmp_output);
+  // First check if the devices match (CPU vs GPU)
+  TORCH_CHECK(self.device() == out.device());
+
+  TORCH_CHECK(canCast(
+      typeMetaToScalarType(self.dtype()),
+      typeMetaToScalarType(out.dtype())));
+
+  at::native::resize_output(out, {});
+
+  max_all_stub(self.device().type(), out, self.contiguous());
   return out;
 }
 
diff --git a/aten/src/ATen/native/ReduceOps.cpp b/aten/src/ATen/native/ReduceOps.cpp
index 71fb7d94c4be..2fe5eee4a286 100644
--- a/aten/src/ATen/native/ReduceOps.cpp
+++ b/aten/src/ATen/native/ReduceOps.cpp
@@ -1,21 +1,114 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/ReduceOps.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/ExpandUtils.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/WrapDimUtils.h>
 #include <ATen/WrapDimUtilsMulti.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/NamedTensorUtils.h>
 #include <ATen/native/ReduceOpsUtils.h>
 #include <ATen/native/Resize.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/NamedTensorUtils.h>
 #include <ATen/native/TensorDimApply.h>
-#include <ATen/native/SharedReduceOps.h>
 #include <ATen/core/grad_mode.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_cummax_helper.h>
+#include <ATen/ops/_cummax_helper_native.h>
+#include <ATen/ops/_cummin_helper.h>
+#include <ATen/ops/_cummin_helper_native.h>
+#include <ATen/ops/_logcumsumexp.h>
+#include <ATen/ops/_logcumsumexp_native.h>
+#include <ATen/ops/add.h>
+#include <ATen/ops/all_meta.h>
+#include <ATen/ops/all_native.h>
+#include <ATen/ops/amax.h>
+#include <ATen/ops/amax_meta.h>
+#include <ATen/ops/amax_native.h>
+#include <ATen/ops/amin_meta.h>
+#include <ATen/ops/amin_native.h>
+#include <ATen/ops/aminmax_meta.h>
+#include <ATen/ops/aminmax_native.h>
+#include <ATen/ops/any_meta.h>
+#include <ATen/ops/any_native.h>
+#include <ATen/ops/argmax_meta.h>
+#include <ATen/ops/argmax_native.h>
+#include <ATen/ops/argmin_meta.h>
+#include <ATen/ops/argmin_native.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/complex.h>
+#include <ATen/ops/cummax.h>
+#include <ATen/ops/cummax_native.h>
+#include <ATen/ops/cummaxmin_backward_native.h>
+#include <ATen/ops/cummin.h>
+#include <ATen/ops/cummin_native.h>
+#include <ATen/ops/cumprod.h>
+#include <ATen/ops/cumprod_backward_native.h>
+#include <ATen/ops/cumprod_meta.h>
+#include <ATen/ops/cumprod_native.h>
+#include <ATen/ops/cumsum.h>
+#include <ATen/ops/cumsum_meta.h>
+#include <ATen/ops/cumsum_native.h>
+#include <ATen/ops/diff_native.h>
+#include <ATen/ops/dist_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/equal_native.h>
+#include <ATen/ops/exp.h>
+#include <ATen/ops/gather.h>
+#include <ATen/ops/gradient_native.h>
+#include <ATen/ops/imag.h>
+#include <ATen/ops/isnan_native.h>
+#include <ATen/ops/logcumsumexp.h>
+#include <ATen/ops/logcumsumexp_native.h>
+#include <ATen/ops/logical_xor.h>
+#include <ATen/ops/logsumexp.h>
+#include <ATen/ops/logsumexp_native.h>
+#include <ATen/ops/mean.h>
+#include <ATen/ops/mean_meta.h>
+#include <ATen/ops/mean_native.h>
+#include <ATen/ops/nanmean_native.h>
+#include <ATen/ops/nansum.h>
+#include <ATen/ops/nansum_native.h>
+#include <ATen/ops/narrow.h>
+#include <ATen/ops/native_norm.h>
+#include <ATen/ops/norm.h>
+#include <ATen/ops/norm_meta.h>
+#include <ATen/ops/norm_native.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/prod.h>
+#include <ATen/ops/prod_meta.h>
+#include <ATen/ops/prod_native.h>
+#include <ATen/ops/real.h>
+#include <ATen/ops/slice.h>
+#include <ATen/ops/special_logsumexp_native.h>
+#include <ATen/ops/sqrt.h>
+#include <ATen/ops/stack.h>
+#include <ATen/ops/std.h>
+#include <ATen/ops/std_mean.h>
+#include <ATen/ops/std_mean_native.h>
+#include <ATen/ops/std_native.h>
+#include <ATen/ops/sub.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/sum_meta.h>
+#include <ATen/ops/sum_native.h>
+#include <ATen/ops/trace_native.h>
+#include <ATen/ops/value_selecting_reduction_backward_native.h>
+#include <ATen/ops/var.h>
+#include <ATen/ops/var_mean.h>
+#include <ATen/ops/var_mean_native.h>
+#include <ATen/ops/var_native.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 #include <c10/util/irange.h>
 #include <c10/util/SmallBuffer.h>
 
@@ -24,9 +117,7 @@
 #include <limits>
 #include <numeric>
 #include <vector>
-#include <map>
 #include <cmath>
-#include <cfloat>
 #include <type_traits>
 
 namespace at {
@@ -390,7 +481,6 @@ template <class Stub>
 void impl_func_cum_ops(
     const Tensor& self,
     int64_t dim,
-    c10::optional<ScalarType> dtype,
     const Tensor& result,
     Stub& stub) {
   NoNamesGuard guard;
@@ -409,7 +499,7 @@ TORCH_IMPL_FUNC(cumsum_out)
  int64_t dim,
  c10::optional<ScalarType> dtype,
  const Tensor& result) {
-  impl_func_cum_ops(self, dim, dtype, result, cumsum_stub);
+  impl_func_cum_ops(self, dim, result, cumsum_stub);
 }
 
 TORCH_IMPL_FUNC(cumprod_out)
@@ -417,7 +507,7 @@ TORCH_IMPL_FUNC(cumprod_out)
  int64_t dim,
  c10::optional<ScalarType> dtype,
  const Tensor& result) {
-  impl_func_cum_ops(self, dim, dtype, result, cumprod_stub);
+  impl_func_cum_ops(self, dim, result, cumprod_stub);
 }
 
 Tensor reversed_cumsum(const Tensor& w, int64_t dim) {
@@ -527,18 +617,22 @@ Tensor cumprod_backward(const Tensor& grad, const Tensor& input, int64_t dim, co
   auto input_conj = input.conj();
   auto output_conj = output.conj();
 
+  // For Composite Compliance, we always choose the slower but composite compliant path.
+  bool are_inputs_tensors_sublcass = areAnyTensorSubclassLike({input, grad, output});
+
   const auto w = output_conj * grad;
   const auto is_zero = input == 0;
-  if (!(is_zero.any().item<uint8_t>())) {
-    return reversed_cumsum(w, dim).div(input_conj);
+  if (!are_inputs_tensors_sublcass) {
+    if (is_zero.any().item<uint8_t>() == 0) {
+      return reversed_cumsum(w, dim).div(input_conj);
+    }
   }
 
   // If we are not computing a second order gradient, we can use an
   // O(n) implementation. The derivative of this implementation is _not_
   // the second derivative of cumprod. As such, we fallback to a less efficient
   // O(n^2) implementation when at::GradMode::is_enabled().
-  Tensor grad_input = at::zeros(input.sizes(), grad.options());
-  if (!at::GradMode::is_enabled()) {
+  if (!at::GradMode::is_enabled() && !are_inputs_tensors_sublcass) {
     // n.b. This could probably be implemented much faster with a kernel
 
     // From here on we need to use some mask gymnastics to
@@ -556,6 +650,7 @@ Tensor cumprod_backward(const Tensor& grad, const Tensor& input, int64_t dim, co
     // zeros_like(indices).scatter_(dim, indices, 1.) & cumsum == 1
     // Note that the logic_and with cumsum == 1 accounts
     // for the case when there is no first zero
+    Tensor grad_input = at::zeros(input.sizes(), grad.options());
     const auto cumsum = is_zero.cumsum(dim);
 
     // case k < z1
@@ -592,6 +687,7 @@ Tensor cumprod_backward(const Tensor& grad, const Tensor& input, int64_t dim, co
                                     .mul_(at::gather(output_conj, dim, (first_zero_index - 1).relu_())
                                           .masked_fill_(first_zero_index == 0, 1.))
                                     .masked_select(first_zero_mask));
+    return grad_input;
   } else { // GradMode::enabled()
     /*
     If the input is nonzero, we need to calculate the dy_j / dx_k
@@ -614,6 +710,15 @@ Tensor cumprod_backward(const Tensor& grad, const Tensor& input, int64_t dim, co
     dy_j / dx_k = 0, which is done right after the assert.
     */
 
+    Tensor grad_input;
+    // For Composite Compliance, we will use
+    // at::stack on the grad slices, hence the vector.
+    std::vector<Tensor> grad_inputs;
+    if (are_inputs_tensors_sublcass) {
+      grad_inputs.reserve(dim_size);
+    } else {
+      grad_input = at::zeros(input.sizes(), grad.options());
+    }
     auto ones_size = input.sizes().vec();
     ones_size[dim] = 1;
     const Tensor ones = at::ones({1}, grad.options()).expand(ones_size);
@@ -638,11 +743,16 @@ Tensor cumprod_backward(const Tensor& grad, const Tensor& input, int64_t dim, co
       // dim_size - k
       TORCH_CHECK(omitted_products.size(dim) == dim_size - k);
 
-      grad_input.select(dim, k).copy_(
-          at::sum(grad.slice(dim, k) * omitted_products,dim));
+      auto grad_slice = at::sum(grad.slice(dim, k) * omitted_products, dim);
+      if (are_inputs_tensors_sublcass) {
+        grad_inputs.push_back(grad_slice);
+      } else {
+        grad_input.select(dim, k).copy_(grad_slice);
+      }
     }
+
+    return are_inputs_tensors_sublcass ? at::stack(grad_inputs, dim) : grad_input;
   }
-  return grad_input;
 }
 
 // Implement std::is_nan<IntegralType> for MSVC.
@@ -1079,10 +1189,6 @@ Tensor sum(const Tensor& self, DimnameList dim, bool keepdim, c10::optional<Scal
   return at::sum(self, dimnames_to_positions(self, dim), keepdim, dtype);
 }
 
-Tensor sum_symint(const Tensor& input_t, c10::SymIntArrayRef dim, bool keepdim, optional<ScalarType> opt_dtype) {
-  return at::sum(input_t, c10::asIntArrayRefSlow(dim), keepdim, opt_dtype);
-}
-
 Tensor& sum_out(const Tensor& self, DimnameList dim,
                 bool keepdim, optional<ScalarType> opt_dtype, Tensor& result) {
   return at::sum_out(result, self, dimnames_to_positions(self, dim), keepdim, opt_dtype);
@@ -1447,7 +1553,7 @@ inline void allany_impl(
   if (self.numel() == 0) {
     result.fill_(identity);
   } else if (self.numel() == 1) {
-    result.fill_(self.item().toBool());
+    result.copy_(self.view_as(result).to(at::kBool));
   } else {
     auto iter = get_allany_iter(self, result, dims, keepdim);
     stub(iter.device_type(), iter);
@@ -1977,9 +2083,6 @@ bool cpu_equal(const Tensor& self, const Tensor& other) {
   at::NoNamesGuard guard;
   TORCH_CHECK(self.device() == other.device(), "Cannot compare two tensors on "
               "different devices. Got: ", self.device(), " and ", other.device());
-  TORCH_CHECK(self.dtype() == other.dtype(),
-              "Expected object of scalar type ", self.dtype(), " but got scalar type ",
-              other.dtype(), " for argument 'other'");
   if (!self.is_same_size(other)) {
     return false;
   }
@@ -2012,14 +2115,19 @@ bool cpu_equal(const Tensor& self, const Tensor& other) {
   return result.load();
 }
 
+Tensor value_selecting_reduction_backward(const Tensor& grad, int64_t dim, const Tensor& indices, at::IntArrayRef sizes, bool keepdim) {
+    return at::native::value_selecting_reduction_backward_symint(grad, dim, indices, c10::fromIntArrayRefSlow(sizes), keepdim);
+}
+
+
 // max(dim), min(dim), topk(dim), mode(dim), are examples of reduction
 // functions that select values. value_selecting_reduction_backward is the
 // backward function for those operators; it propagates the grad to the
 // specific value locations referred to at `indices`.
-Tensor value_selecting_reduction_backward(const Tensor& grad, int64_t dim, const Tensor& indices, IntArrayRef sizes, bool keepdim) {
+Tensor value_selecting_reduction_backward_symint(const Tensor& grad, int64_t dim, const Tensor& indices, c10::SymIntArrayRef sizes, bool keepdim) {
   auto inplace_scatter_if_not_tensor_subclass =
       [&](const Tensor& grad_out, const Tensor& indices_) {
-        auto grad_in = at::zeros(sizes, grad_out.options());
+        auto grad_in = at::zeros_symint(sizes, grad_out.options());
         if (areAnyTensorSubclassLike({grad, indices})) {
           return grad_in.scatter(dim, indices_, grad_out);
         }
@@ -2038,5 +2146,9 @@ Tensor sum_csr(const Tensor &self, c10::optional<ScalarType> dtype) {
   return self.values().sum(dtype);
 }
 
+Tensor sum_coo(const Tensor &self, c10::optional<ScalarType> dtype) {
+  return self._values().sum(dtype);
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/ReduceOpsUtils.h b/aten/src/ATen/native/ReduceOpsUtils.h
index 9db9802ea788..2b46eb683f1c 100644
--- a/aten/src/ATen/native/ReduceOpsUtils.h
+++ b/aten/src/ATen/native/ReduceOpsUtils.h
@@ -102,7 +102,7 @@ static inline void check_scalar_type_device_layout_equal(const Tensor& out, cons
   OPTION_TYPE_EQUALITY_CHECK(layout, out.options(), self.options());
 }
 
-static inline Tensor integer_upcast(const Tensor& self, optional<ScalarType> dtype) {
+static inline Tensor integer_upcast(const Tensor& self, c10::optional<ScalarType> dtype) {
   ScalarType scalarType = self.scalar_type();
   ScalarType upcast_scalarType = dtype.value_or(at::isIntegralType(scalarType, /*includeBool=*/true) ? ScalarType::Long : scalarType);
   return self.toType(upcast_scalarType);
diff --git a/aten/src/ATen/native/ReflectionPad.cpp b/aten/src/ATen/native/ReflectionPad.cpp
index db744cc95eb0..3a6ad683d045 100644
--- a/aten/src/ATen/native/ReflectionPad.cpp
+++ b/aten/src/ATen/native/ReflectionPad.cpp
@@ -1,9 +1,26 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorMeta.h>
 #include <ATen/quantized/Quantizer.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/reflection_pad1d_backward_native.h>
+#include <ATen/ops/reflection_pad1d_native.h>
+#include <ATen/ops/reflection_pad2d_backward_native.h>
+#include <ATen/ops/reflection_pad2d_native.h>
+#include <ATen/ops/reflection_pad3d_backward_native.h>
+#include <ATen/ops/reflection_pad3d_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at {
 
 namespace meta {
@@ -965,8 +982,8 @@ TORCH_IMPL_FUNC(reflection_pad3d_out_cpu)
   auto input = input_.contiguous();
 
   if (batch_mode) {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(
-        kHalf, input.scalar_type(), "reflection_pad3d_cpu", [&] {
+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
+        kHalf, kBFloat16, input.scalar_type(), "reflection_pad3d_cpu", [&] {
           auto input_data = input.data_ptr<scalar_t>();
           auto output_data = output.data_ptr<scalar_t>();
           auto nbatch = input.size(0);
@@ -986,8 +1003,8 @@ TORCH_IMPL_FUNC(reflection_pad3d_out_cpu)
               pad_front);
         });
   } else {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(
-        kHalf, input.scalar_type(), "reflection_pad3d_cpu", [&] {
+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
+        kHalf, kBFloat16, input.scalar_type(), "reflection_pad3d_cpu", [&] {
           auto input_data = input.data_ptr<scalar_t>();
           auto output_data = output.data_ptr<scalar_t>();
           reflection_pad3d_out_frame(
@@ -1043,8 +1060,8 @@ TORCH_IMPL_FUNC(reflection_pad3d_backward_out_cpu)(const Tensor& grad_output,
   grad_input.zero_();
 
   if (batch_mode) {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(
-        kHalf, input.scalar_type(), "reflection_pad3d_backward_cpu", [&] {
+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
+        kHalf, kBFloat16, input.scalar_type(), "reflection_pad3d_backward_cpu", [&] {
           reflection_pad3d_backward_out_loop<scalar_t>(
               grad_input.data_ptr<scalar_t>(),
               grad_output_.data_ptr<scalar_t>(),
@@ -1061,8 +1078,8 @@ TORCH_IMPL_FUNC(reflection_pad3d_backward_out_cpu)(const Tensor& grad_output,
               pad_front);
         });
   } else {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(
-        kHalf, input.scalar_type(), "reflection_pad3d_backward_cpu", [&] {
+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
+        kHalf, kBFloat16, input.scalar_type(), "reflection_pad3d_backward_cpu", [&] {
           reflection_pad3d_backward_out_frame<scalar_t>(
               grad_input.data_ptr<scalar_t>(),
               grad_output_.data_ptr<scalar_t>(),
diff --git a/aten/src/ATen/native/Repeat.cpp b/aten/src/ATen/native/Repeat.cpp
index b6e5c04f7702..c8c4e134929f 100644
--- a/aten/src/ATen/native/Repeat.cpp
+++ b/aten/src/ATen/native/Repeat.cpp
@@ -1,8 +1,19 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/Repeat.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/repeat_interleave.h>
+#include <ATen/ops/repeat_interleave_native.h>
+#endif
+
 template <typename index_t>
 static void compute_cpu(
     index_t* repeat_ptr,
@@ -64,11 +75,11 @@ Tensor repeat_interleave(
   }
 
   Tensor repeats_ = repeats;
-  if (repeats.dim() == 0 || (repeats.dim() == 1 && repeats.size(0) == 1)) {
-    repeats_ = repeats.reshape({1}).expand({input.size(dim.value())});
+  if (repeats.dim() == 0 || (repeats.dim() == 1 && repeats.sym_size(0) == 1)) {
+    repeats_ = repeats.reshape({1}).expand_symint({input.sym_size(dim.value())});
   } else if (repeats.dim() == 1) {
     TORCH_CHECK(
-        repeats.size(0) == input.size(dim.value()),
+        repeats.sym_size(0) == input.sym_size(dim.value()),
         "repeats must have the same size as input along dim")
   } else {
     AT_ERROR("repeats must be 0-dim or 1-dim tensor");
@@ -91,10 +102,17 @@ Tensor repeat_interleave(
     int64_t repeats,
     c10::optional<int64_t> dim,
     c10::optional<int64_t> output_size) {
-  at::Tensor repeats_ =
-      at::empty(1, self.options().dtype(at::kLong)).fill_(repeats);
+  at::Tensor repeats_ = at::empty(1, self.options().dtype(at::kLong)).fill_(repeats);
   return at::native::repeat_interleave(self, repeats_, dim, output_size);
 }
 
+Tensor repeat_interleave_symint(
+    const Tensor& self,
+    c10::SymInt repeats,
+    c10::optional<int64_t> dim,
+    c10::optional<int64_t> output_size) {
+    return at::native::repeat_interleave(self, repeats.guard_int(__FILE__, __LINE__), dim, output_size);
+  }
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/ReplicationPadding.cpp b/aten/src/ATen/native/ReplicationPadding.cpp
index 40fdb788a4ff..d0a4ea919acb 100644
--- a/aten/src/ATen/native/ReplicationPadding.cpp
+++ b/aten/src/ATen/native/ReplicationPadding.cpp
@@ -1,9 +1,24 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorMeta.h>
 #include <c10/util/irange.h>
 #include <algorithm>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/replication_pad1d_backward_native.h>
+#include <ATen/ops/replication_pad1d_native.h>
+#include <ATen/ops/replication_pad2d_backward_native.h>
+#include <ATen/ops/replication_pad2d_native.h>
+#include <ATen/ops/replication_pad3d_backward_native.h>
+#include <ATen/ops/replication_pad3d_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 namespace at {
 
 namespace meta {
diff --git a/aten/src/ATen/native/Resize.cpp b/aten/src/ATen/native/Resize.cpp
index 08286f3983cc..bd47a25e6960 100644
--- a/aten/src/ATen/native/Resize.cpp
+++ b/aten/src/ATen/native/Resize.cpp
@@ -1,9 +1,16 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/ResizeCommon.h>
+#include <ATen/NamedTensorUtils.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 
-#include <c10/core/TensorOptions.h>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/resize_as_native.h>
+#include <ATen/ops/resize_native.h>
+#endif
 
 namespace at { namespace native {
 
diff --git a/aten/src/ATen/native/Resize.h b/aten/src/ATen/native/Resize.h
index c6fe2b3d2146..0bed4232695a 100644
--- a/aten/src/ATen/native/Resize.h
+++ b/aten/src/ATen/native/Resize.h
@@ -83,20 +83,30 @@ inline TensorImpl* resize_impl_cpu_(
   return self;
 }
 
+template <typename T>
+T maybe_convert_symint(c10::SymInt) = delete;
+
+template <>
+inline c10::SymInt maybe_convert_symint(c10::SymInt x) { return x; }
+
+template <>
+inline int64_t maybe_convert_symint(c10::SymInt x) { return x.expect_int(); }
+
+template <typename T>
 static inline void checkInBoundsForStorage(
-    IntArrayRef size,
-    IntArrayRef stride,
-    int64_t storage_offset,
+    ArrayRef<T> size,
+    ArrayRef<T> stride,
+    T storage_offset,
     const caffe2::TypeMeta data_type,
     const Storage& new_storage) {
-  int64_t storage_size_bytes =
+  T storage_size_bytes =
       at::detail::computeStorageNbytes(size, stride, data_type.itemsize());
-  int64_t storage_offset_bytes = storage_offset * data_type.itemsize();
+  T storage_offset_bytes = storage_offset * data_type.itemsize();
   if (storage_size_bytes == 0) {
     // NB: (a tensor with arbitrary 0 dims)'s storage can have any numel.
     return;
   }
-  int64_t new_storage_size_bytes = new_storage.nbytes();
+  T new_storage_size_bytes = maybe_convert_symint<T>(new_storage.sym_nbytes());
   TORCH_CHECK(
       storage_size_bytes + storage_offset_bytes <= new_storage_size_bytes,
       "setStorage: sizes ",
@@ -114,8 +124,9 @@ static inline void checkInBoundsForStorage(
       new_storage_size_bytes);
 }
 
-static inline void checkSetStorage(Tensor& result, Storage storage, int64_t storage_offset,
-                                   IntArrayRef size, IntArrayRef stride) {
+template <typename T>
+static inline void checkSetStorage(Tensor& result, Storage storage, T storage_offset,
+                                   ArrayRef<T> size, ArrayRef<T> stride) {
   // FIXME: stride should be optional
   if (stride.data()) {
     TORCH_CHECK(size.size() == stride.size(), "unequal size length (", size.size(),
@@ -151,11 +162,12 @@ static inline void checkSetStorage(Tensor& result, Storage storage, int64_t stor
  * Set self's sizes, strides, and storage_offset.
  * (size, stride, storage_offset) must be in bounds for self's storage.
  */
+template <typename T>
 inline void setStrided(
     const Tensor& self,
-    IntArrayRef size,
-    IntArrayRef stride,
-    int64_t storage_offset) {
+    ArrayRef<T> size,
+    ArrayRef<T> stride,
+    T storage_offset) {
   TORCH_CHECK(size.size() == stride.size(), "mismatch in length of strides and shape");
   for (auto val : stride) {
     TORCH_CHECK(val >= 0,
@@ -169,13 +181,7 @@ inline void setStrided(
 
   /* storage offset */
   TORCH_CHECK(storage_offset >= 0, "Tensor: invalid storage offset ", storage_offset);
-  self_->set_storage_offset(storage_offset);
-
-  /* size and stride */
-  if (self_->sizes() == size && self_->strides() == stride) {
-    return;
-  }
-  self_->set_sizes_and_strides(size, stride);
+  self_->set_sizes_and_strides(size, stride, c10::make_optional(storage_offset));
 }
 
 }}
diff --git a/aten/src/ATen/native/ResizeCommon.h b/aten/src/ATen/native/ResizeCommon.h
index e814a71c89a8..1de4d74b3af6 100644
--- a/aten/src/ATen/native/ResizeCommon.h
+++ b/aten/src/ATen/native/ResizeCommon.h
@@ -6,11 +6,12 @@
 
 namespace at { namespace native {
 
-inline int64_t storage_size_for(IntArrayRef size, IntArrayRef stride) {
+template <typename T>
+inline T storage_size_for(ArrayRef<T> size, ArrayRef<T> stride) {
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(size.size() == stride.size(),
       "storage_size_for(size, stride) requires that size and stride ",
       "have the same size as a precondition.");
-  int64_t storage_size = 1;
+  T storage_size = 1;
   for (const auto dim : c10::irange(size.size())) {
     if (size[dim] == 0) {
       storage_size = 0;
diff --git a/aten/src/ATen/native/RowwisePrune.cpp b/aten/src/ATen/native/RowwisePrune.cpp
index 40ae2215cbcc..c27707c4d307 100644
--- a/aten/src/ATen/native/RowwisePrune.cpp
+++ b/aten/src/ATen/native/RowwisePrune.cpp
@@ -1,8 +1,17 @@
 // Copyright 2004-present Facebook. All Rights Reserved.
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_rowwise_prune_native.h>
+#include <ATen/ops/empty.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/Scalar.cpp b/aten/src/ATen/native/Scalar.cpp
index 7342c4806d44..f8932ea03bb2 100644
--- a/aten/src/ATen/native/Scalar.cpp
+++ b/aten/src/ATen/native/Scalar.cpp
@@ -1,5 +1,15 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_local_scalar_dense.h>
+#include <ATen/ops/_local_scalar_dense_native.h>
+#include <ATen/ops/item_native.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/SegmentReduce.cpp b/aten/src/ATen/native/SegmentReduce.cpp
index 3e562b7cf859..1e5e28dab86b 100644
--- a/aten/src/ATen/native/SegmentReduce.cpp
+++ b/aten/src/ATen/native/SegmentReduce.cpp
@@ -1,10 +1,23 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/SegmentReduce.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <ATen/NumericUtils.h>
+#include <ATen/TensorOperators.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_segment_reduce_backward_native.h>
+#include <ATen/ops/all.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/segment_reduce_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/SobolEngineOps.cpp b/aten/src/ATen/native/SobolEngineOps.cpp
index 48366976a2e7..187faeba16a7 100644
--- a/aten/src/ATen/native/SobolEngineOps.cpp
+++ b/aten/src/ATen/native/SobolEngineOps.cpp
@@ -1,11 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
 
 #include <ATen/native/SobolEngineOpsUtils.h>
 #include <c10/util/irange.h>
 
-#include <vector>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_sobol_engine_draw_native.h>
+#include <ATen/ops/_sobol_engine_ff_native.h>
+#include <ATen/ops/_sobol_engine_initialize_state_native.h>
+#include <ATen/ops/_sobol_engine_scramble_native.h>
+#include <ATen/ops/arange_native.h>
+#include <ATen/ops/empty.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/SobolEngineOpsUtils.cpp b/aten/src/ATen/native/SobolEngineOpsUtils.cpp
index ef7cbb1faae9..709d5c06d3c9 100644
--- a/aten/src/ATen/native/SobolEngineOpsUtils.cpp
+++ b/aten/src/ATen/native/SobolEngineOpsUtils.cpp
@@ -1,4 +1,5 @@
 /// This file contains tensor-agnostic SoboleEngine constants
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/SobolEngineOpsUtils.h>
 
 /*
diff --git a/aten/src/ATen/native/SobolEngineOpsUtils.h b/aten/src/ATen/native/SobolEngineOpsUtils.h
index d3d7a362f2e8..495a43ed8a7c 100644
--- a/aten/src/ATen/native/SobolEngineOpsUtils.h
+++ b/aten/src/ATen/native/SobolEngineOpsUtils.h
@@ -1,6 +1,14 @@
 /// This file contains some tensor-agnostic operations to be used in the
 /// core functions of the `SobolEngine`
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/mul.h>
+#include <ATen/ops/pow.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/SoftMax.cpp b/aten/src/ATen/native/SoftMax.cpp
index 21a94d5ed923..0332f57e9e23 100644
--- a/aten/src/ATen/native/SoftMax.cpp
+++ b/aten/src/ATen/native/SoftMax.cpp
@@ -1,13 +1,36 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/TensorMeta.h>
 #include <ATen/TensorUtils.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/WrapDimUtils.h>
 #include <ATen/native/cpu/SoftmaxKernel.h>
 #include <ATen/NamedTensorUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_log_softmax.h>
+#include <ATen/ops/_log_softmax_backward_data_native.h>
+#include <ATen/ops/_log_softmax_native.h>
+#include <ATen/ops/_masked_softmax_native.h>
+#include <ATen/ops/_softmax.h>
+#include <ATen/ops/_softmax_backward_data_native.h>
+#include <ATen/ops/_softmax_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/log_softmax.h>
+#include <ATen/ops/log_softmax_native.h>
+#include <ATen/ops/softmax.h>
+#include <ATen/ops/softmax_native.h>
+#include <ATen/ops/special_log_softmax_native.h>
+#include <ATen/ops/special_softmax_native.h>
+#endif
+
 #include <c10/core/TensorOptions.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/irange.h>
@@ -137,10 +160,8 @@ void host_softmax(
   if (MaskedSoftMax) {
     TORCH_CHECK(mask_type_.has_value(), "Mask Type should be defined");
     int64_t mask_type = mask_type_.value();
-    TORCH_CHECK((mask_type == 0) || (mask_type == 1), "Mask Type should be 0 (src_mask) or 1 (src_key_padding_mask)");
-
-    // TODO: Add support for TxT src_mask
-    TORCH_CHECK(mask_type != 0, "src_mask not currently supported on CPU");
+    // If mask_type == 2, then mask_.sizes() must equal input_.sizes()
+    TORCH_CHECK((mask_type == 0) || (mask_type == 1) || (mask_type == 2), "Mask Type should be 0 (src_mask) or 1 (src_key_padding_mask), or 2 (default_mask)");
   }
 
   int64_t outer_size = 1;
@@ -170,8 +191,22 @@ void host_softmax(
               output_data_base + outer_idx * outer_stride + inner_idx;
           bool* mask_data = nullptr;
           if (MaskedSoftMax) {
-            mask_data = mask_data_base + outer_idx * outer_stride + inner_idx;
-          }
+            // Process mask differently depending on the type:
+            // For a generic mask of mask_type == 2, mask shape is the same as the input shape,
+            // so indexing is the same.
+            auto mask_outer_idx = outer_idx;
+            if (mask_type_ == 0) {
+                // Optimized case: attention mask of shape LxL
+                // outer_idx goes over BxHxL, mask_outer_idx goes over L.
+                mask_outer_idx = outer_idx % input.size(2);
+            } else if (mask_type_ == 1) {
+                // Optimized case: padding mask of shape BxL
+                // outer_idx goes over BxHxL, mask_outer_idx goes over B.
+                mask_outer_idx = outer_idx / (input.size(1) * input.size(2));
+            }
+
+            mask_data = mask_data_base + mask_outer_idx * outer_stride + inner_idx;
+          };
 
           // Calc max in softmax dim
           bool is_meaningful_max = false;
@@ -553,15 +588,48 @@ Tensor log_softmax(const Tensor& self, Dimname dim, optional<ScalarType> dtype)
 }
 
 Tensor masked_softmax_cpu(const Tensor& input_, const Tensor& mask_, const c10::optional<int64_t> dim_, const c10::optional<int64_t> mask_type_) {
-  TORCH_CHECK(
-      input_.sizes() == mask_.sizes(), "Mask shape should match input shape");
+
+  auto mask = mask_.contiguous();
+  auto mask_type = mask_type_; // Mask type might get transformed below
+
   TORCH_CHECK(
       mask_.scalar_type() == ScalarType::Bool,
       "Mask should be a boolean tensor");
 
+  if ((mask.dim() != 2) || (input_.dim() != 4)) {
+    // Mask types 0 and 1 are only allowed for 2D masks and 4D inputs
+    mask_type = 2;
+  }
+
+  if (mask_type == 2) {
+      TORCH_CHECK(input_.sizes() == mask.sizes(),
+                  "For mask_type == 2 mask shape should match input shape")
+  } else if (mask_type == 1) {
+      // Padding mask of shape (B, L)
+      TORCH_CHECK((input_.sizes()[0] == mask.sizes()[0]) && (input_.sizes()[2] == mask.sizes()[1]),
+                  "For mask_type == 1 mask shape should be (B, L)");
+      if (dim_ != input_.dim() - 1) {
+            // We only process padding mask in the optimized way if softmax is applied along the last dimesion,
+            // otherwise we need to expand the mask into a generic 4D one
+            mask = mask_.view({input_.sizes()[0], 1, 1, input_.sizes()[2]});
+            mask = mask.expand(input_.sizes()).contiguous();
+            mask_type = 2;
+      }
+  } else if (mask_type == 0) {
+      // Attention mask of shape (L, L)
+      TORCH_CHECK((mask.dim() == 2) && (input_.sizes()[2] == mask.sizes()[0]) && (input_.sizes()[2] == mask.sizes()[1]),
+                  "For mask_type == 0 mask shape should be (L, L)");
+      if (dim_ != input_.dim() - 1) {
+            // We only process attention mask in a optimized way if softmax is applied along the last dimesion,
+            // otherwise we need to expand the mask into a generic 4D one
+            mask = mask.view({1, 1, input_.sizes()[2], input_.sizes()[2]});
+            mask = mask.expand(input_.sizes()).contiguous();
+            mask_type = 2;
+      }
+  }
+
   Tensor output = at::empty_like(input_, input_.options());
   auto input = input_.contiguous();
-  auto mask = mask_.contiguous();
   int64_t dim = dim_.has_value() ? dim_.value() : input.dim() - 1;
   dim = maybe_wrap_dim(dim, input_.dim());
 
@@ -575,7 +643,7 @@ Tensor masked_softmax_cpu(const Tensor& input_, const Tensor& mask_, const c10::
             scalar_t,
             false /* LogSoftMax */,
             true /* MaskedSoftMax */>(
-            output, input, dim, mask.data_ptr<bool>(), mask_type_);
+            output, input, dim, mask.data_ptr<bool>(), mask_type);
       });
   return output;
 }
diff --git a/aten/src/ATen/native/Sorting.cpp b/aten/src/ATen/native/Sorting.cpp
index fb4bdd87b7a7..3b50d7744aa2 100644
--- a/aten/src/ATen/native/Sorting.cpp
+++ b/aten/src/ATen/native/Sorting.cpp
@@ -1,8 +1,16 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
+#include <ATen/ExpandUtils.h>
 #include <ATen/MemoryOverlap.h>
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/NumericUtils.h>
 #include <ATen/Parallel.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorUtils.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 #include <ATen/WrapDimUtils.h>
 #include <ATen/native/Resize.h>
@@ -11,6 +19,32 @@
 #include <ATen/native/ReduceOpsUtils.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/argsort_native.h>
+#include <ATen/ops/broadcast_tensors.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/full.h>
+#include <ATen/ops/full_like.h>
+#include <ATen/ops/kthvalue.h>
+#include <ATen/ops/kthvalue_native.h>
+#include <ATen/ops/masked_fill.h>
+#include <ATen/ops/median.h>
+#include <ATen/ops/median_native.h>
+#include <ATen/ops/msort_native.h>
+#include <ATen/ops/nanmedian.h>
+#include <ATen/ops/nanmedian_native.h>
+#include <ATen/ops/nanquantile_native.h>
+#include <ATen/ops/quantile_native.h>
+#include <ATen/ops/scalar_tensor.h>
+#include <ATen/ops/sort.h>
+#include <ATen/ops/sort_native.h>
+#include <ATen/ops/topk_native.h>
+#endif
+
 #include <utility>
 
 namespace at {
@@ -227,7 +261,7 @@ Tensor quantile_compute(
   // synchronizing an accelerator with the CPU
   if (self.device().is_cpu()) {
     auto all_q_in_range = q.ge(0).logical_and_(q.le(1)).all();
-    TORCH_CHECK(at::equal(all_q_in_range, all_q_in_range.new_ones({})),
+    TORCH_CHECK(at::is_scalar_tensor_true(all_q_in_range),
                 "quantile() q values must be in the range [0, 1]");
   }
 
diff --git a/aten/src/ATen/native/SpectralOps.cpp b/aten/src/ATen/native/SpectralOps.cpp
index d6389608a9e3..124c2d06d9e8 100644
--- a/aten/src/ATen/native/SpectralOps.cpp
+++ b/aten/src/ATen/native/SpectralOps.cpp
@@ -1,15 +1,67 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
+#include <ATen/TensorSubclassLikeUtils.h>
 #include <ATen/detail/CUDAHooksInterface.h>
 #include <ATen/native/SpectralOpsUtils.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/WrapDimUtils.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_cufft_clear_plan_cache_native.h>
+#include <ATen/ops/_cufft_get_plan_cache_max_size_native.h>
+#include <ATen/ops/_cufft_get_plan_cache_size_native.h>
+#include <ATen/ops/_cufft_set_plan_cache_max_size_native.h>
+#include <ATen/ops/_fft_c2c.h>
+#include <ATen/ops/_fft_c2r.h>
+#include <ATen/ops/_fft_r2c.h>
+#include <ATen/ops/arange.h>
+#include <ATen/ops/arange_native.h>
+#include <ATen/ops/conj.h>
+#include <ATen/ops/conj_physical.h>
+#include <ATen/ops/constant_pad_nd.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/fft_fft2_native.h>
+#include <ATen/ops/fft_fft_native.h>
+#include <ATen/ops/fft_fftfreq_native.h>
+#include <ATen/ops/fft_fftn_native.h>
+#include <ATen/ops/fft_fftshift_native.h>
+#include <ATen/ops/fft_hfft2_native.h>
+#include <ATen/ops/fft_hfft_native.h>
+#include <ATen/ops/fft_hfftn_native.h>
+#include <ATen/ops/fft_ifft2_native.h>
+#include <ATen/ops/fft_ifft_native.h>
+#include <ATen/ops/fft_ifftn_native.h>
+#include <ATen/ops/fft_ifftshift_native.h>
+#include <ATen/ops/fft_ihfft2_native.h>
+#include <ATen/ops/fft_ihfft_native.h>
+#include <ATen/ops/fft_ihfftn_native.h>
+#include <ATen/ops/fft_irfft2_native.h>
+#include <ATen/ops/fft_irfft_native.h>
+#include <ATen/ops/fft_irfftn_native.h>
+#include <ATen/ops/fft_rfft2_native.h>
+#include <ATen/ops/fft_rfft_native.h>
+#include <ATen/ops/fft_rfftfreq_native.h>
+#include <ATen/ops/fft_rfftn_native.h>
+#include <ATen/ops/istft_native.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/pad.h>
+#include <ATen/ops/roll.h>
+#include <ATen/ops/stft.h>
+#include <ATen/ops/stft_native.h>
+#include <ATen/ops/unfold_backward.h>
+#include <ATen/ops/view_as_complex.h>
+#include <ATen/ops/view_as_real.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
+
 #include <algorithm>
-#include <vector>
-#include <cmath>
 
 namespace at { namespace native {
 
@@ -147,7 +199,7 @@ Tensor fft_c2r(c10::string_view function_name,
               " expects a floating point output tensor, but got ", out.scalar_type());
   input = promote_tensor_fft(input, /*require_complex=*/true);
   const auto input_dim = input.dim();
-  const auto dim = maybe_wrap_dim(unwrapped_dim, input_dim);
+  const auto dim = maybe_wrap_dim(unwrapped_dim, input_dim, /*wrap_scalar=*/false);
   const auto n = n_opt.value_or(2*(input.sizes()[dim] - 1));
   TORCH_CHECK(n >= 1, "Invalid number of data points (", n, ") specified");
   if (n_opt) {
@@ -156,7 +208,7 @@ Tensor fft_c2r(c10::string_view function_name,
   const auto norm = norm_from_string(norm_str, forward);
   if (forward) {
     // FIXME: _fft does not support complex_output=false with inverse=false
-    input = at::conj(input);
+    input = input.conj();
   }
   return fft_c2r_maybe_out(
       function_name, out, input, dim, static_cast<int64_t>(norm), n);
@@ -173,7 +225,7 @@ Tensor fft_r2c(c10::string_view function_name,
               " expects a complex output tensor, but got ", out.scalar_type());
   input = promote_tensor_fft(input);
   const auto input_dim = input.dim();
-  const auto dim = maybe_wrap_dim(unwrapped_dim, input_dim);
+  const auto dim = maybe_wrap_dim(unwrapped_dim, input_dim, /*wrap_scalar=*/false);
   const auto n = n_opt.value_or(input.sizes()[dim]);
   TORCH_CHECK(n >= 1, "Invalid number of data points (", n, ") specified");
   if (n_opt) {
@@ -191,7 +243,7 @@ Tensor fft_r2c(c10::string_view function_name,
 
   if (!forward) {
     // FIXME: _fft_r2c doesn't support native r2c IFFT
-    return out.defined() ? at::conj_physical_out(out, ret) : at::conj(ret);
+    return out.defined() ? at::conj_physical_out(out, ret) : ret.conj();
   } else {
     return ret;
   }
@@ -205,7 +257,7 @@ Tensor fft_c2c(c10::string_view function_name,
   TORCH_CHECK(input.is_complex(), function_name,
               " expects a complex input tensor, but got ", input.scalar_type());
   const auto input_dim = input.dim();
-  const auto dim = maybe_wrap_dim(unwrapped_dim, input_dim);
+  const auto dim = maybe_wrap_dim(unwrapped_dim, input_dim, /*wrap_scalar=*/false);
   const auto n = n_opt.value_or(input.sizes()[dim]);
   TORCH_CHECK(n >= 1, "Invalid number of data points (", n, ") specified");
   if (n_opt) {
@@ -232,7 +284,7 @@ ShapeAndDims canonicalize_fft_shape_and_dim_args(
   if (dim) {
     ret.dim.resize(dim->size());
     std::copy(dim->begin(), dim->end(), ret.dim.begin());
-    maybe_wrap_dims(ret.dim, input_dim);
+    maybe_wrap_dims(ret.dim, input_dim, /*wrap_scalars=*/false);
 
     // Check dims are unique
     DimVector copy = ret.dim;
@@ -520,7 +572,7 @@ static Tensor fft_hfftn_impl(
   }
 
   const auto last_dim = desc.dim.back();
-  tmp = at::conj(tmp);
+  tmp = tmp.conj();
   return fft_c2r_maybe_out(fname, out, tmp, last_dim, norm, last_dim_size);
 }
 
@@ -558,7 +610,7 @@ static Tensor fft_ihfftn_impl(
   const auto last_dim = desc.dim.back();
   auto tmp = at::_fft_r2c(x, last_dim, norm, /*onesided=*/true);
   if (desc.dim.size() == 1) {
-    return out.defined() ? at::conj_physical_out(tmp, out) : at::conj(tmp);
+    return out.defined() ? at::conj_physical_out(tmp, out) : tmp.conj();
   }
 
   tmp = at::conj_physical(tmp);
@@ -698,7 +750,7 @@ DimVector default_alldims(const Tensor& self, at::OptionalIntArrayRef dim_opt) {
     IntArrayRef dim_unwrapped = *dim_opt;
     dim.resize(dim_unwrapped.size());
     for (const auto i : c10::irange(dim.size())) {
-      dim[i] = maybe_wrap_dim(dim_unwrapped[i], self.dim());
+      dim[i] = maybe_wrap_dim(dim_unwrapped[i], self.dim(), /*wrap_scalars=*/false);
     }
   } else {
     dim.resize(self.dim());
@@ -796,20 +848,17 @@ Tensor stft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop
   const bool return_complex = return_complexOpt.value_or(
       self.is_complex() || (window.defined() && window.is_complex()));
   if (!return_complex) {
-    if (!return_complexOpt.has_value()) {
-      TORCH_WARN_ONCE(
-        "stft will soon require the return_complex parameter be given for real inputs, "
-        "and will further require that return_complex=True in a future PyTorch release."
-      );
-    }
+    TORCH_CHECK(return_complexOpt.has_value(),
+        "stft requires the return_complex parameter be given for real inputs, "
+        "and will further require that return_complex=True in a future PyTorch release.");
 
 
-    // TORCH_WARN_ONCE(
-    //     "stft with return_complex=False is deprecated. In a future pytorch "
-    //     "release, stft will return complex tensors for all inputs, and "
-    //     "return_complex=False will raise an error.\n"
-    //     "Note: you can still call torch.view_as_real on the complex output to "
-    //     "recover the old return format.");
+    TORCH_WARN_ONCE(
+        "stft with return_complex=False is deprecated. In a future pytorch "
+        "release, stft will return complex tensors for all inputs, and "
+        "return_complex=False will raise an error.\n"
+        "Note: you can still call torch.view_as_real on the complex output to "
+        "recover the old return format.");
   }
 
   if (!at::isFloatingType(self.scalar_type()) && !at::isComplexType(self.scalar_type())) {
@@ -973,12 +1022,10 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
   const auto hop_length = hop_lengthOpt.value_or(n_fft >> 2);
   const auto win_length = win_lengthOpt.value_or(n_fft);
 
-  if (!self.is_complex()) {
-    TORCH_WARN_ONCE(
-      "istft will require a complex-valued input tensor in a future PyTorch release. "
-      "Matching the output from stft with return_complex=True. ");
-  }
-  Tensor input = self.is_complex() ? self.is_conj() ? at::view_as_real(self.resolve_conj()) : at::view_as_real(self) : self;
+  TORCH_CHECK(self.is_complex(),
+              "istft requires a complex-valued input tensor matching the "
+              "output from stft with return_complex=True.");
+  Tensor input = at::view_as_real(self.resolve_conj());
   const auto input_dim = input.dim();
   const auto n_frames = input.size(-2);
   const auto fft_size = input.size(-3);
@@ -1006,13 +1053,13 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
   if (onesided) {
     if (n_fft / 2 + 1 != fft_size) {
       std::ostringstream ss;
-      REPR(ss) << ": expected the frequency dimension (3rd to the last) of the input tensor to match n_fft / 2 + 1 when onsided=True, but got " << fft_size;
+      REPR(ss) << ": expected the frequency dimension (3rd to the last) of the input tensor to match n_fft / 2 + 1 when onesided=True, but got " << fft_size;
       AT_ERROR(ss.str());
     }
   } else {
     if (n_fft != fft_size) {
       std::ostringstream ss;
-      REPR(ss) << ": expected the frequency dimension (3rd to the last) of the input tensor to match n_fft when onsided=False, but got " << fft_size;
+      REPR(ss) << ": expected the frequency dimension (3rd to the last) of the input tensor to match n_fft when onesided=False, but got " << fft_size;
       AT_ERROR(ss.str());
     }
   }
@@ -1048,7 +1095,7 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
     input = input.unsqueeze(0);
   }
 
-  input = as_complex(input.transpose(1, 2));  // size: (channel, n_frames, fft_size, 2)
+  input = as_complex(input.transpose(1, 2));  // size: (channel, n_frames, fft_size)
 
   const fft_norm_mode norm = normalized ? fft_norm_mode::by_root_n : fft_norm_mode::by_n;
   if (return_complex) {
@@ -1065,26 +1112,23 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
   TORCH_INTERNAL_ASSERT(input.size(2) == n_fft);
 
   Tensor y_tmp = input * window_tmp.view({1, 1, n_fft});  // size: (channel, n_frames, n_fft)
-  y_tmp = y_tmp.transpose(1, 2);  // size: (channel, n_fft, frame)
-
-  Tensor y = at::col2im(y_tmp,
-                                  /*output_size*/ {1, (n_frames - 1) * hop_length + n_fft},
-                                  /*kernel_size*/ {1, n_fft},
-                                  /*dilation*/    {1, 1},
-                                  /*padding*/     {0, 0},
-                                  /*stride*/      {1, hop_length}
-                                 ).squeeze(2);
-  window_tmp = window_tmp.pow(2).view({n_fft, 1}).repeat({1, n_frames}).unsqueeze(0);  // size: (1, n_fft, n_frames)
-  Tensor window_envelop = at::col2im(window_tmp,
-                                  /*output_size*/ {1, (n_frames - 1) * hop_length + n_fft},
-                                  /*kernel_size*/ {1, n_fft},
-                                  /*dilation*/    {1, 1},
-                                  /*padding*/     {0, 0},
-                                  /*stride*/      {1, hop_length}
-                                 ).squeeze(2); // size: (1, 1, expected_output_signal_len)
-
-  TORCH_INTERNAL_ASSERT(expected_output_signal_len == y.size(2));
-  TORCH_INTERNAL_ASSERT(expected_output_signal_len == window_envelop.size(2));
+
+  Tensor y = at::unfold_backward(
+    y_tmp,
+    /*input_sizes=*/{y_tmp.size(0), expected_output_signal_len},
+    /*dim=*/1,
+    /*size=*/n_fft,
+    /*step=*/hop_length);
+  window_tmp = window_tmp.pow(2).expand({1, n_frames, n_fft});  // size: (1, n_frames, n_fft)
+  Tensor window_envelop = at::unfold_backward(
+    window_tmp,
+    /*input_sizes=*/{1, expected_output_signal_len},
+    /*dim=*/1,
+    /*size=*/n_fft,
+    /*step=*/hop_length); // size: (1, expected_output_signal_len)
+
+  TORCH_INTERNAL_ASSERT(expected_output_signal_len == y.size(1));
+  TORCH_INTERNAL_ASSERT(expected_output_signal_len == window_envelop.size(1));
 
   // We need to trim the front padding away if centered
   const auto start = center ? n_fft / 2 : 0;
@@ -1098,16 +1142,16 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
     return expected_output_signal_len;
   }();
 
-  y = y.slice(2, start, end, 1);
-  window_envelop = window_envelop.slice(2, start, end, 1);
-  const auto window_envelop_lowest = window_envelop.abs().min().item().toDouble();
-  if (window_envelop_lowest < 1e-11) {
+  y = y.slice(1, start, end, 1);
+  window_envelop = window_envelop.slice(1, start, end, 1);
+  const auto window_envelop_lowest = window_envelop.abs().min().lt(1e-11);
+  if (at::is_scalar_tensor_true(window_envelop_lowest)) {
     std::ostringstream ss;
     REPR(ss) << "window overlap add min: " << window_envelop_lowest;
     AT_ERROR(ss.str());
   }
 
-  y = (y / window_envelop).squeeze(1);  // size: (channel, expected_output_signal_len)
+  y = (y / window_envelop);  // size: (channel, expected_output_signal_len)
   if (input_dim == 3) {
     y = y.squeeze(0);
   }
@@ -1121,7 +1165,7 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
   }
   return y;
 
-  #undef REPR
+#undef REPR
 }
 
 Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop_lengthOpt,
@@ -1138,7 +1182,7 @@ void _fft_fill_with_conjugate_symmetry_(const Tensor& input, IntArrayRef dim_) {
   const auto input_strides = input.strides();
   TORCH_CHECK(dim_.size() > 0);
   DimVector dim(dim_.begin(), dim_.end());
-  at::maybe_wrap_dims(dim, input_strides.size());
+  at::maybe_wrap_dims(dim, input_strides.size(), /*wrap_scalars=*/false);
 
   if (input.numel() == 0 || input_sizes[dim.back()] <= 2) {
     return;  // No elements need writing
diff --git a/aten/src/ATen/native/SpmmReduce.cpp b/aten/src/ATen/native/SpmmReduce.cpp
deleted file mode 100644
index cdbce3fe4b36..000000000000
--- a/aten/src/ATen/native/SpmmReduce.cpp
+++ /dev/null
@@ -1,32 +0,0 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/SpmmReduce.h>
-
-namespace at { namespace native {
-
-Tensor spmm_sum_cpu(
-    const Tensor& rowptr,
-    const Tensor& col,
-    const c10::optional<Tensor>& optional_value,
-    const Tensor& mat) {
-  TORCH_CHECK(rowptr.dim() == 1);
-  TORCH_CHECK(col.dim() == 1);
-  if (optional_value.has_value()) {
-    TORCH_CHECK(optional_value.value().dim() == 1);
-    TORCH_CHECK(optional_value.value().size(0) == col.size(0));
-  }
-  TORCH_CHECK(mat.dim() >= 2);
-
-  Tensor other = mat.contiguous();
-
-  auto sizes = other.sizes().vec();
-  sizes[other.dim() - 2] = rowptr.numel() - 1;
-  Tensor result = at::empty(sizes, other.options());
-  spmm_sum_stub(kCPU, result, rowptr, col, optional_value, other);
-
-  return result;
-}
-
-DEFINE_DISPATCH(spmm_sum_stub);
-
-}} // at::native
diff --git a/aten/src/ATen/native/SpmmReduce.h b/aten/src/ATen/native/SpmmReduce.h
deleted file mode 100644
index ac34bf0090de..000000000000
--- a/aten/src/ATen/native/SpmmReduce.h
+++ /dev/null
@@ -1,12 +0,0 @@
-#pragma once
-
-#include <ATen/core/Tensor.h>
-#include <ATen/native/DispatchStub.h>
-
-namespace at { namespace native {
-
-using spmm_sum_fn = void(*)(const Tensor&, const Tensor&, const Tensor&, const c10::optional<Tensor>&, const Tensor&);
-DECLARE_DISPATCH(spmm_sum_fn, spmm_sum_stub);
-
-}} // at::native
-
diff --git a/aten/src/ATen/native/SummaryOps.cpp b/aten/src/ATen/native/SummaryOps.cpp
index cf86225460ea..ae0b38c96efa 100644
--- a/aten/src/ATen/native/SummaryOps.cpp
+++ b/aten/src/ATen/native/SummaryOps.cpp
@@ -1,10 +1,17 @@
 // Returns the frequency of elements of input non-negative integer tensor.
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <c10/util/irange.h>
 
-#include <tuple>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/bincount_native.h>
+#include <ATen/ops/zeros.h>
+#endif
 
 namespace at { namespace native {
 
@@ -20,15 +27,15 @@ Tensor _bincount_cpu_template(
     AT_ERROR("minlength should be >= 0");
   }
   if (self.dim() == 1 && self.numel() == 0) {
-    return native::zeros({minlength}, kLong);
+    return at::zeros({minlength}, kLong);
   }
   if (self.dim() != 1 || *self.min().data_ptr<input_t>() < 0) {
     AT_ERROR("bincount only supports 1-d non-negative integral inputs.");
   }
 
   bool has_weights = weights.defined();
-  if (has_weights && weights.size(0) != self.size(0)) {
-    AT_ERROR("input and weights should have the same length");
+  if (has_weights && (weights.dim() != 1 || weights.size(0) != self.size(0))) {
+    AT_ERROR("weights should be 1-d and have the same length as input");
   }
 
   Tensor output;
@@ -38,7 +45,7 @@ Tensor _bincount_cpu_template(
 
   const input_t* self_p = self.data_ptr<input_t>();
   if (has_weights) {
-    output = native::zeros(
+    output = at::zeros(
         {nbins},
         optTypeMetaToScalarType(weights.options().dtype_opt()),
         weights.options().layout_opt(),
@@ -50,7 +57,7 @@ Tensor _bincount_cpu_template(
       output_p[self_p[i]] += weights_p[i];
     }
   } else {
-    output = native::zeros({nbins}, kLong);
+    output = at::zeros({nbins}, kLong);
     int64_t* output_p = output.data_ptr<int64_t>();
     for (const auto i : c10::irange(self_size)) {
       output_p[self_p[i]] += 1L;
diff --git a/aten/src/ATen/native/TensorAdvancedIndexing.cpp b/aten/src/ATen/native/TensorAdvancedIndexing.cpp
index 951d9eeb18fa..7d23413c6560 100644
--- a/aten/src/ATen/native/TensorAdvancedIndexing.cpp
+++ b/aten/src/ATen/native/TensorAdvancedIndexing.cpp
@@ -47,31 +47,93 @@
 //                   ...)
 //
 // where & and * represent the C-style address-of and indirection operations.
+// #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/ATen.h>
 
 #include <ATen/native/TensorAdvancedIndexing.h>
 #include <ATen/native/IndexKernel.h>
 #include <ATen/native/IndexingUtils.h>
 
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/IListRef.h>
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/MemoryOverlap.h>
-#include <ATen/native/TensorAdvancedIndexingUtils.h>
-#include <ATen/core/IListRef.h>
-#include <ATen/native/TensorIterator.h>
+#include <ATen/NamedTensorUtils.h>
+#include <ATen/Parallel.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/WrapDimUtils.h>
 #include <ATen/native/BinaryOps.h>
 #include <ATen/native/Copy.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/ScatterGatherChecks.h>
+#include <ATen/native/TensorAdvancedIndexingUtils.h>
 #include <ATen/Parallel.h>
 #include <ATen/NumericUtils.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_gather_sparse_backward.h>
+#include <ATen/ops/_gather_sparse_backward_native.h>
+#include <ATen/ops/_index_put_impl.h>
+#include <ATen/ops/_index_put_impl_native.h>
+#include <ATen/ops/_sparse_coo_tensor_unsafe.h>
+#include <ATen/ops/arange.h>
+#include <ATen/ops/argwhere_native.h>
+#include <ATen/ops/as_strided.h>
+#include <ATen/ops/broadcast_to.h>
+#include <ATen/ops/count_nonzero.h>
+#include <ATen/ops/count_nonzero_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_quantized.h>
+#include <ATen/ops/gather.h>
+#include <ATen/ops/gather_backward_native.h>
+#include <ATen/ops/gather_meta.h>
+#include <ATen/ops/gather_native.h>
+#include <ATen/ops/index.h>
+#include <ATen/ops/index_add_meta.h>
+#include <ATen/ops/index_add_native.h>
+#include <ATen/ops/index_copy_meta.h>
+#include <ATen/ops/index_copy_native.h>
+#include <ATen/ops/index_fill_native.h>
+#include <ATen/ops/index_meta.h>
+#include <ATen/ops/index_native.h>
+#include <ATen/ops/index_put_native.h>
+#include <ATen/ops/index_reduce_meta.h>
+#include <ATen/ops/index_reduce_native.h>
+#include <ATen/ops/index_select_backward_native.h>
+#include <ATen/ops/index_select_native.h>
+#include <ATen/ops/masked_fill_native.h>
+#include <ATen/ops/masked_scatter_native.h>
+#include <ATen/ops/masked_select_backward_native.h>
+#include <ATen/ops/masked_select_native.h>
+#include <ATen/ops/nonzero_native.h>
+#include <ATen/ops/nonzero_numpy_native.h>
+#include <ATen/ops/ones_like.h>
+#include <ATen/ops/put_native.h>
+#include <ATen/ops/quantize_per_tensor.h>
+#include <ATen/ops/scatter_add_meta.h>
+#include <ATen/ops/scatter_add_native.h>
+#include <ATen/ops/scatter_meta.h>
+#include <ATen/ops/scatter_native.h>
+#include <ATen/ops/scatter_reduce_meta.h>
+#include <ATen/ops/scatter_reduce_native.h>
+#include <ATen/ops/take_along_dim_native.h>
+#include <ATen/ops/take_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 #include <c10/util/irange.h>
 #include <c10/util/Unroll.h>
 
 #include <algorithm>
-#include <functional>
 #include <numeric>
 #include <vector>
 
@@ -416,6 +478,7 @@ DEFINE_DISPATCH(put_stub);
 DEFINE_DISPATCH(take_stub);
 DEFINE_DISPATCH(masked_fill_stub);
 REGISTER_NO_CPU_DISPATCH(index_put_with_sort_stub);
+REGISTER_NO_CPU_DISPATCH(index_put_with_sort_quantized_stub);
 DEFINE_DISPATCH(masked_select_serial_stub);
 DEFINE_DISPATCH(masked_select_stub);
 DEFINE_DISPATCH(masked_scatter_stub);
@@ -428,6 +491,10 @@ DEFINE_DISPATCH(scatter_reduce_stub);
 DEFINE_DISPATCH(scatter_scalar_reduce_stub);
 DEFINE_DISPATCH(scatter_reduce_two_stub);
 
+DEFINE_DISPATCH(scatter_add_expanded_index_stub);
+DEFINE_DISPATCH(scatter_reduce_expanded_index_stub);
+DEFINE_DISPATCH(gather_expanded_index_stub);
+
 static bool all_strides_match(TensorList tensors) {
   TORCH_CHECK(tensors.size() >= 1);
   auto strides = tensors[0].strides();
@@ -521,9 +588,9 @@ AdvancedIndex::AdvancedIndex(const Tensor& src, TensorList indices_list)
     }
   }
 
-  // For CUDA tensors, force all index tensors to have the same striding to
-  // simplify the CUDA kernel.
-  if (indices.size() >= 2 && this->src.device().type() == kCUDA) {
+  // For CUDA/MPS tensors, force all index tensors to have the same striding to
+  // simplify the CUDA/MPS kernel.
+  if (indices.size() >= 2 && (this->src.device().type() == kCUDA || this->src.device().type() == kMPS)) {
     if (!all_strides_match(indices)) {
       for (auto & indice : indices) {
         indice = indice.contiguous();
@@ -1095,8 +1162,6 @@ Tensor & index_select_out_cpu_(const Tensor & self, int64_t dim, const Tensor &
   TORCH_CHECK(index.scalar_type() == ScalarType::Long || index.scalar_type() == ScalarType::Int, "index_select(): Expected dtype int32 or int64 for index");
   TORCH_CHECK(self.scalar_type() == result.scalar_type(),
               "index_select(): self and result must have the same scalar type");
-  TORCH_CHECK(dim == 0 || dim < self.dim(),
-              "index_select(): Indexing dim ", dim, " is out of bounds of tensor");
   at::assert_no_internal_overlap(result);
   at::assert_no_overlap(result, self);
   at::assert_no_overlap(result, index);
@@ -1258,13 +1323,17 @@ Tensor index_select_quantized_cpu_(const Tensor & self, int64_t dim, const Tenso
   return at::native::index_select_out_cpu_(self, dim, index, result);
 }
 
-Tensor index_select_backward(const Tensor& grad, IntArrayRef self_sizes, int64_t dim, const Tensor& index) {
+Tensor index_select_backward(const Tensor& grad, at::IntArrayRef self_sizes, int64_t dim, const Tensor& index) {
+    return at::native::index_select_backward_symint(grad, c10::fromIntArrayRefSlow(self_sizes), dim, index);
+}
+
+Tensor index_select_backward_symint(const Tensor& grad, c10::SymIntArrayRef self_sizes, int64_t dim, const Tensor& index) {
   // for composite compliance, use out-of-place variant of
   // `index_add` if index tensor is a Tensor Subclass.
   if (isTensorSubclassLike(index)) {
-    return grad.new_zeros(self_sizes, grad.options()).index_add(dim, index, grad);
+    return grad.new_zeros_symint(self_sizes, grad.options()).index_add(dim, index, grad);
   }
-  return grad.new_zeros(self_sizes, grad.options()).index_add_(dim, index, grad);
+  return grad.new_zeros_symint(self_sizes, grad.options()).index_add_(dim, index, grad);
 }
 
 Tensor & index_fill_(Tensor & self, int64_t dim, const Tensor & index, const Scalar& source) {
@@ -1359,14 +1428,18 @@ TORCH_IMPL_FUNC(gather_out)
 (const Tensor& self, int64_t dim, const Tensor& index, bool sparse_grad, const Tensor& result) {
   if (index.numel() == 0) return;
   dim = at::maybe_wrap_dim(dim, self.dim());
-  gather_stub(result.device().type(), result, self, dim, index);
+  if (can_use_expanded_index_path</*is_scatter_like=*/false>(result, dim, index, self)) {
+    gather_expanded_index_stub(result.device().type(), result, self, index);
+  } else {
+    gather_stub(result.device().type(), result, self, dim, index);
+  }
 }
 
 Tensor gather_backward(const Tensor& grad, const Tensor& self, int64_t dim, const Tensor& index, bool sparse_grad) {
   if (sparse_grad) {
     return at::_gather_sparse_backward(self, dim, index, grad);
   }
-  auto result = grad.new_zeros(self.sizes());
+  auto result = grad.new_zeros_symint(self.sym_sizes());
   // for composite compliance, use out-of-place variant of
   // `scatter_add` if index tensor is a Tensor Subclass.
   if (isTensorSubclassLike(index)) {
@@ -1504,18 +1577,107 @@ TORCH_IMPL_FUNC(scatter_add)
 
   if (index.numel() == 0) return;
 
-  if (globalContext().deterministicAlgorithms() && self.device().type() == DeviceType::CUDA && self.dim() == 1) {
-    TORCH_CHECK(index.dim() == 1 && src.dim() == 1, "index and src should be 1D tensors when self is a 1D tensor, "
-      "but their dims are ", index.dim(), " and ", src.dim(), ", respectively");
-    TORCH_CHECK(index.numel() == src.numel(), "index and src should have same number of elements for 1D tensors, "
-      "but got ", index.numel(), " versus ", src.numel());
-    TORCH_CHECK(dim == 0, "dim should be zero for 1D self tensor, but got ", dim);
-    torch::List<c10::optional<Tensor>> indices;
-    indices.reserve(1);
-    indices.push_back(index);
-    mut_out.index_put_(indices, src, true);
+  // See Note [Enabling Deterministic Operations]
+  // Avoid gpuAtomicAdd for CUDA if deterministic mode is turned on
+  if (globalContext().deterministicAlgorithms() && self.device().type() == DeviceType::CUDA) {
+    if (self.dim() == 1) {
+      // TODO: Pretty sure these checks can be removed, since they're done in
+      // `scatter_meta_impl`, which I think is always called before this
+      TORCH_CHECK(index.dim() == 1 && src.dim() == 1, "index and src should be 1D tensors when self is a 1D tensor, "
+        "but their dims are ", index.dim(), " and ", src.dim(), ", respectively");
+      TORCH_CHECK(index.numel() == src.numel(), "index and src should have same number of elements for 1D tensors, "
+        "but got ", index.numel(), " versus ", src.numel());
+      TORCH_CHECK(dim == 0, "dim should be zero for 1D self tensor, but got ", dim);
+      torch::List<c10::optional<Tensor>> indices;
+      indices.reserve(1);
+      indices.push_back(index);
+      mut_out.index_put_(indices, src, true);
+    } else {
+      Tensor mut_out_contig = mut_out.contiguous();
+
+      auto index_coords_sizes = index.sizes().vec();
+      index_coords_sizes.push_back(self.dim());
+      auto index_coords = at::empty(
+        index_coords_sizes,
+        at::TensorOptions().dtype(at::ScalarType::Long).device(self.device()));
+
+      for (int64_t dim_other = 0; dim_other < self.dim(); dim_other++) {
+        if (dim_other == dim) {
+          continue;
+        }
+        auto dim_coord_vals = at::arange(
+          index.size(dim_other),
+          at::TensorOptions().device(self.device()));
+
+        for (int64_t dim_unsqueeze = 0; dim_unsqueeze < self.dim() - 1; dim_unsqueeze++) {
+          dim_coord_vals = dim_coord_vals.unsqueeze((dim_unsqueeze >= dim_other) ? -1 : 0);
+        }
+
+        auto view_sizes = index.sizes().vec();
+        view_sizes.push_back(1);
+        auto view_strides = index_coords.strides().vec();
+        view_strides[self.dim()] = self.dim();
+
+        at::as_strided(
+          index_coords,
+          view_sizes,
+          view_strides,
+          dim_other
+        ).copy_(dim_coord_vals.unsqueeze(-1));
+      }
+
+      auto view_sizes = index.sizes().vec();
+      view_sizes.push_back(1);
+      auto view_strides = index_coords.strides().vec();
+      view_strides[self.dim()] = self.dim();
+
+      at::as_strided(
+        index_coords,
+        view_sizes,
+        view_strides,
+        dim
+      ).copy_(index.unsqueeze(-1));
+
+      Tensor index_coords_flat = index_coords.flatten(0, -2);
+
+      // Copy mut_out_contig's strides into a tensor
+      // TODO: Is there a utility function that already does this?
+      IntArrayRef mut_out_contig_strides = mut_out_contig.strides();
+      Tensor coord_strides = at::empty(
+        {mut_out_contig.dim()},
+        TensorOptions().dtype(at::ScalarType::Long).device(at::kCPU));
+      std::memcpy(
+        coord_strides.data_ptr(),
+        mut_out_contig_strides.data(),
+        coord_strides.nbytes());
+      coord_strides = coord_strides.to(mut_out_contig.device());
+
+      // `index_flat` contains the 1-D indices corresponding with the
+      // flattened `mut_out`
+      Tensor index_flat = (index_coords_flat * coord_strides).sum({-1});
+      Tensor mut_out_flat = mut_out_contig.flatten();
+      Tensor src_flat = at::as_strided(
+        src,
+        index.sizes(),
+        src.strides()
+      ).flatten();
+
+      torch::List<c10::optional<Tensor>> indices;
+      indices.reserve(1);
+      indices.push_back(index_flat);
+
+      mut_out_flat.index_put_(indices, src_flat, true);
+
+      if (!mut_out.is_contiguous()) {
+        mut_out.copy_(mut_out_flat.reshape(mut_out.sizes()));
+      }
+    }
   } else {
-    scatter_add_stub(self.device().type(), mut_out, dim, index, src);
+    if (can_use_expanded_index_path</*is_scatter_like*/true>(mut_out, dim, index, src)) {
+      scatter_add_expanded_index_stub(self.device().type(), mut_out, index, src);
+    } else {
+      scatter_add_stub(self.device().type(), mut_out, dim, index, src);
+    }
   }
 }
 
@@ -1530,13 +1692,27 @@ TORCH_IMPL_FUNC(scatter_reduce_two)
   // See issue https://github.com/pytorch/pytorch/issues/74770
   TORCH_WARN_ONCE("scatter_reduce() is in beta and the API may change at any time.");
 
+  dim = at::maybe_wrap_dim(dim, self.dim());
+  auto mut_out = const_cast<Tensor&>(out);
+
+  if (!self.is_same(mut_out)) {
+    mut_out.copy_(self);
+  }
+
+  const auto op = meta::get_operator_enum(reduce, true);
+
+  if (can_use_expanded_index_path</*is_scatter_like*/true>(mut_out, dim, index, src)) {
+    scatter_reduce_expanded_index_stub(self.device().type(), mut_out, index, src, op, include_self);
+    return;
+  }
+
   scatter_impl</*use_new_options=*/true>(self, dim, index, src, out,
                                          scatter_reduce_two_stub,
                                          scatter_stub,
                                          reduce,
                                          include_self);
 
-  if (meta::get_operator_enum(reduce, true) == SCATTER_GATHER_OP::REDUCE_MEAN) {
+  if (op == SCATTER_GATHER_OP::REDUCE_MEAN) {
     auto ones = at::ones_like(src);
     auto count = include_self ? at::ones_like(out) : at::zeros_like(out);
     count.scatter_add_(dim, index, ones);
diff --git a/aten/src/ATen/native/TensorAdvancedIndexing.h b/aten/src/ATen/native/TensorAdvancedIndexing.h
index a0c282d550e4..01ae7edf036a 100644
--- a/aten/src/ATen/native/TensorAdvancedIndexing.h
+++ b/aten/src/ATen/native/TensorAdvancedIndexing.h
@@ -5,6 +5,7 @@
 #include <ATen/core/List.h>
 #include <ATen/core/Tensor.h>
 #include <ATen/native/DispatchStub.h>
+#include <ATen/native/cpu/radix_sort.h>
 
 namespace at {
 struct TensorIterator;
@@ -15,7 +16,7 @@ namespace at { namespace native {
 enum class SCATTER_GATHER_OP: uint8_t {REDUCE_ADD, REDUCE_MULTIPLY, REDUCE_MAXIMUM, REDUCE_MINIMUM, REDUCE_MEAN};
 
 using index_put_with_sort_fn = void(*)(Tensor &, const c10::List<c10::optional<Tensor>> &, const Tensor &, bool accumulate, bool unsafe);
-
+using index_put_with_sort_quantized_fn = void(*)(Tensor& self, const c10::List<c10::optional<Tensor>>& indices, const Tensor& value, double scale, int zero_point, bool unsafe);
 using gather_fn = void (*)(const Tensor & result, const Tensor & self, int64_t dim, const Tensor & index);
 using scatter_fn = void(*)(const Tensor& self, int64_t dim, const Tensor& index, const Tensor& src);
 using scatter_fill_fn = void(*)(const Tensor& self, int64_t dim, const Tensor& index, const Scalar& src);
@@ -28,7 +29,7 @@ using scatter_reduce_two_fn = void(*)(const Tensor& self, const int64_t dim, con
                                       const Tensor& src, const SCATTER_GATHER_OP& reduce);
 
 DECLARE_DISPATCH(index_put_with_sort_fn, index_put_with_sort_stub);
-
+DECLARE_DISPATCH(index_put_with_sort_quantized_fn, index_put_with_sort_quantized_stub);
 DECLARE_DISPATCH(gather_fn, gather_stub);
 DECLARE_DISPATCH(scatter_fn, scatter_stub);
 DECLARE_DISPATCH(scatter_fill_fn, scatter_fill_stub);
@@ -39,4 +40,50 @@ DECLARE_DISPATCH(scatter_reduce_two_fn, scatter_reduce_two_stub);
 
 TORCH_API Tensor& index_out(Tensor& result, const Tensor & self, const c10::List<c10::optional<at::Tensor>>& indices);
 
+// fast paths for GNN usage
+template <bool is_scatter_like = true>
+bool can_use_expanded_index_path(const Tensor& self, int64_t dim, const Tensor& index, const Tensor& src) {
+  if (!self.device().is_cpu()) { return false; }
+
+  const auto st = self.scalar_type();
+  if (!(st == ScalarType::Float || st == ScalarType::Double || st == ScalarType::BFloat16)) { return false; }
+
+  if (!is_radix_sort_available()) { return false; }
+
+  // skip when having empty tensor
+  if (self.numel() == 0 || index.numel() == 0 || src.numel() == 0) { return false; }
+
+  // skip when having scalar tensor
+  if (self.ndimension() == 0 || index.ndimension() == 0 || src.ndimension() == 0) { return false; }
+
+  if (is_scatter_like) {
+    // using `spmm` for scatter would require sorting on index,
+    // this is only perf beneficial when the inner dimension, aka, `channels`
+    // is big enough.
+    constexpr int64_t threshold = 16;
+    if (index.numel() / index.size(0) < threshold) { return false; }
+  }
+
+  // usually the expanded index has stride on the first dimension to be 1,
+  // and strides on other dims to be 0 or 1, e.g.
+  //   shape [108365, 16]; strides [1, 0]
+  //   shape [13264, 1, 7]; strides [1, 1, 0]
+  auto index_strides = index.strides().vec();
+  bool is_index_expanded = index_strides[0] == 1;
+  for (const auto dim : c10::irange(1, index_strides.size())) {
+    if (index_strides[dim] > 1) { is_index_expanded = false; }
+  }
+
+  // index is expanded
+  return dim == 0 && is_index_expanded && src.is_contiguous() && self.is_contiguous();
+}
+
+using scatter_add_expanded_index_fn = void(*)(const Tensor&, const Tensor&, const Tensor&);
+using scatter_reduce_expanded_index_fn = void(*)(const Tensor&, const Tensor&, const Tensor&, const SCATTER_GATHER_OP& reduce, bool);
+using gather_expanded_index_fn = void (*)(const Tensor&, const Tensor&, const Tensor&);
+
+DECLARE_DISPATCH(scatter_add_expanded_index_fn, scatter_add_expanded_index_stub);
+DECLARE_DISPATCH(scatter_reduce_expanded_index_fn, scatter_reduce_expanded_index_stub);
+DECLARE_DISPATCH(gather_expanded_index_fn, gather_expanded_index_stub);
+
 }} // namespace at::native
diff --git a/aten/src/ATen/native/TensorAdvancedIndexingUtils.h b/aten/src/ATen/native/TensorAdvancedIndexingUtils.h
index 8ffff8b6e912..0c0db4b83f35 100644
--- a/aten/src/ATen/native/TensorAdvancedIndexingUtils.h
+++ b/aten/src/ATen/native/TensorAdvancedIndexingUtils.h
@@ -1,5 +1,5 @@
 #pragma once
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/IndexingUtils.h>
 #include <ATen/native/TensorIterator.h>
 
@@ -57,7 +57,7 @@ const Tensor& value){
 }
 
 static AdvancedIndex make_info(Tensor self, IOptTensorListRef orig) {
-  checkIndexTensorTypes(orig);
+  checkIndexTensorTypes(orig, /*allow_int*/ true);
   // first expand BoolTensor (masks) or ByteTensor (masks) into 1 or more LongTensors
   auto indices = expandTensors(self, orig);
   // next broadcast all index tensors together
@@ -82,6 +82,12 @@ static AdvancedIndex make_info(Tensor self, IOptTensorListRef orig) {
       indice = indice.to(self.device());
     }
   }
+  for (auto & indice : indices) {
+    if (indice.defined() && indice.dtype() == at::kInt) {
+      indice = indice.to(at::kLong);
+    }
+  }
+
   return AdvancedIndex(self, indices);
 }
 
diff --git a/aten/src/ATen/native/TensorCompare.cpp b/aten/src/ATen/native/TensorCompare.cpp
index 1ce3e32377d8..5d3ee7d98d80 100644
--- a/aten/src/ATen/native/TensorCompare.cpp
+++ b/aten/src/ATen/native/TensorCompare.cpp
@@ -1,19 +1,73 @@
-#include <ATen/ATen.h>
-#include <ATen/CPUApplyUtils.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/ExpandUtils.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/ReduceOpsUtils.h>
-#include <c10/util/Exception.h>
+#include <ATen/NamedTensorUtils.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIndexing.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/WrapDimUtils.h>
 #include <ATen/native/BinaryOps.h>
+#include <ATen/native/ReduceOpsUtils.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/TensorCompare.h>
-#include <ATen/native/Fill.h>
-#include <ATen/NamedTensorUtils.h>
-#include <ATen/TensorIndexing.h>
 #include <ATen/native/TypeProperties.h>
-#include <c10/core/QScheme.h>
 #include <ATen/TensorSubclassLikeUtils.h>
+#include <c10/util/Exception.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_aminmax_native.h>
+#include <ATen/ops/_assert_async_native.h>
+#include <ATen/ops/_make_per_tensor_quantized_tensor.h>
+#include <ATen/ops/_unique.h>
+#include <ATen/ops/allclose_native.h>
+#include <ATen/ops/aminmax.h>
+#include <ATen/ops/argsort_native.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/clamp.h>
+#include <ATen/ops/clamp_max.h>
+#include <ATen/ops/clamp_max_native.h>
+#include <ATen/ops/clamp_min.h>
+#include <ATen/ops/clamp_min_native.h>
+#include <ATen/ops/clamp_native.h>
+#include <ATen/ops/clip_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/eq.h>
+#include <ATen/ops/fill.h>
+#include <ATen/ops/imag.h>
+#include <ATen/ops/index.h>
+#include <ATen/ops/is_nonzero_native.h>
+#include <ATen/ops/isclose.h>
+#include <ATen/ops/isclose_native.h>
+#include <ATen/ops/isfinite.h>
+#include <ATen/ops/isfinite_native.h>
+#include <ATen/ops/isin.h>
+#include <ATen/ops/isin_native.h>
+#include <ATen/ops/isinf.h>
+#include <ATen/ops/isinf_native.h>
+#include <ATen/ops/isnan_native.h>
+#include <ATen/ops/isneginf_native.h>
+#include <ATen/ops/isposinf_native.h>
+#include <ATen/ops/isreal_native.h>
+#include <ATen/ops/max.h>
+#include <ATen/ops/max_native.h>
+#include <ATen/ops/min.h>
+#include <ATen/ops/min_native.h>
+#include <ATen/ops/mode.h>
+#include <ATen/ops/mode_native.h>
+#include <ATen/ops/ne.h>
+#include <ATen/ops/ones_like.h>
+#include <ATen/ops/real.h>
+#include <ATen/ops/result_type_native.h>
+#include <ATen/ops/scalar_tensor.h>
+#include <ATen/ops/where.h>
+#include <ATen/ops/where_native.h>
+#include <ATen/ops/zeros_like.h>
+#endif
 
 namespace at {
 namespace meta {
@@ -399,8 +453,19 @@ static void isin_sorting(
   }
 }
 
+template<typename... Args>
+Device out_device(Args&... inps){
+  for (const auto& i : {inps...}){
+    if (i.device() != at::kCPU) {
+      return i.device();
+    }
+  }
+  return at::kCPU;
+}
+
+
 Tensor& where_self_out(const Tensor& condition, const Tensor& self, const Tensor& other, Tensor& out) {
-  Tensor self_, other_;
+  Tensor self_, other_, condition_;
   if (self.dtype() != other.dtype()) {
     auto result_type = at::native::result_type(self, other);
     self_ = self.to(result_type);
@@ -409,16 +474,30 @@ Tensor& where_self_out(const Tensor& condition, const Tensor& self, const Tensor
     self_ = self;
     other_ = other;
   }
+  auto device = out_device(condition, self_, other_);
+  condition_ = condition;
+  if (device != at::kCPU) { // allow CPU scalars on non-cpu device
+    if (condition.device() != device && condition.ndimension() == 0) {
+      condition_ = condition.to(device);
+    }
+    if (self_.device() != device && self_.ndimension() == 0) {
+        self_ = self_.to(device);
+    }
+    if (other_.device() != device && other_.ndimension() == 0) {
+        other_ = other_.to(device);
+    }
+  }
   if (condition.scalar_type() == ScalarType::Byte) {
   TORCH_WARN_ONCE("where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead.");
   } else {
   TORCH_CHECK(condition.scalar_type() == ScalarType::Bool, "where expected condition to be a boolean tensor, but got a tensor with dtype ", condition.scalar_type());
   }
-  Tensor cond_bool = condition.scalar_type() == ScalarType::Byte ? condition.to(ScalarType::Bool) : condition;
+  condition_ = condition_.scalar_type() == ScalarType::Byte ? condition_.to(ScalarType::Bool) : condition_;
+  // if there's still a device mismatch, let tensoriterator error out with it
   auto iter = at::TensorIteratorConfig()
     .check_all_same_dtype(false)
     .add_output(out)
-    .add_input(cond_bool)
+    .add_input(condition_)
     .add_input(self_)
     .add_input(other_)
     .build();
@@ -426,9 +505,11 @@ Tensor& where_self_out(const Tensor& condition, const Tensor& self, const Tensor
   return out;
 }
 
+
 Tensor where(const Tensor& condition, const Tensor& self, const Tensor& other) {
+  auto device = out_device(condition, self, other);
   auto result_type = at::native::result_type(self, other);
-  Tensor ret = at::empty({0}, self.options().dtype(result_type));
+  Tensor ret = at::empty({0}, self.options().dtype(result_type).device(device));
   at::native::where_self_out(condition, self, other, ret);
   return ret;
 }
diff --git a/aten/src/ATen/native/TensorConversions.cpp b/aten/src/ATen/native/TensorConversions.cpp
index 819516f67397..96275bde8299 100644
--- a/aten/src/ATen/native/TensorConversions.cpp
+++ b/aten/src/ATen/native/TensorConversions.cpp
@@ -1,16 +1,206 @@
+// #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/core/Tensor.h>
 #include <c10/util/Optional.h>
 #include <ATen/quantized/Quantizer.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorOperators.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_autocast_to_full_precision_native.h>
+#include <ATen/ops/_autocast_to_reduced_precision_native.h>
+#include <ATen/ops/_convert_indices_from_coo_to_csr.h>
+#include <ATen/ops/_convert_indices_from_coo_to_csr_native.h>
+#include <ATen/ops/_convert_indices_from_csr_to_coo.h>
+#include <ATen/ops/_convert_indices_from_csr_to_coo_native.h>
+#include <ATen/ops/_sparse_bsc_tensor_unsafe_native.h>
+#include <ATen/ops/_sparse_bsr_tensor_unsafe_native.h>
+#include <ATen/ops/_sparse_compressed_tensor_unsafe_native.h>
+#include <ATen/ops/_sparse_coo_tensor_unsafe_native.h>
+#include <ATen/ops/_sparse_csc_tensor_unsafe_native.h>
+#include <ATen/ops/_sparse_csr_tensor_unsafe_native.h>
+#include <ATen/ops/_to_copy.h>
+#include <ATen/ops/_to_copy_native.h>
+#include <ATen/ops/_to_cpu_native.h>
+#include <ATen/ops/_to_dense_native.h>
+#include <ATen/ops/arange_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_quantized.h>
+#include <ATen/ops/empty_strided.h>
+#include <ATen/ops/empty_strided_native.h>
+#include <ATen/ops/to_dense_backward_native.h>
+#include <ATen/ops/to_dense_native.h>
+#include <ATen/ops/to_mkldnn_backward_native.h>
+#include <ATen/ops/to_native.h>
+#include <ATen/ops/to_sparse_bsc_native.h>
+#include <ATen/ops/to_sparse_bsr_native.h>
+#include <ATen/ops/to_sparse_csc_native.h>
+#include <ATen/ops/to_sparse_csr_native.h>
+#include <ATen/ops/to_sparse_native.h>
+#include <ATen/ops/view_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
+#include <ATen/SparseCsrTensorUtils.h>
 #include <ATen/SparseTensorUtils.h>
+#include <ATen/core/ATen_fwd.h>
 #include <ATen/native/IndexingUtils.h>
+#include <ATen/native/NonSymbolicBC.h>
 #include <c10/core/impl/DeviceGuardImplInterface.h>
+#include <numeric>
 
 namespace at {
 namespace native {
 
+namespace {
+// dense_to_sparse_{csr,bsr,csc,bsc} common helpers
+
+// Preparation fo the N-D dense -> sparse compressed conversion.
+// The N-D input is converted to 3-D (single batch dim) where we check that the
+// product of batch dims is nonzero and for each batch the sparse matrix
+// contained within has the same number of non-zero elements.
+// The batches are joined along the compressed axis. The generation of indices
+// for this matrix can be performed in a single step followed by a single step
+// conversion to restore the batch dimension.
+void dense_to_sparse_compressed_prepare_check_mask_values_batched(
+    const Layout& target_layout,
+    Tensor& values,
+    Tensor& mask,
+    const int64_t& n_batch_dim) {
+  if (n_batch_dim > 1) {
+    // For inputs with more than 1 batch dim we flatten them out.
+    // Input shape (b0, b1 ..., bn, r, c) -> (b0 * b1 * ... * bn, r ,c)
+    values = values.flatten(0, n_batch_dim - 1);
+    mask = mask.flatten(0, n_batch_dim - 1);
+  }
+
+  // For informative messaging form the name of the function
+  // to_sparse_{csr,csc,bsr,bsc}.
+  TORCH_CHECK(
+      mask.size(0) > 0,
+      "to_sparse_",
+      // We want the message to match the function name so generate the
+      // lowercase acronym for the layout
+      sparse_csr::layoutToString(target_layout, false, true),
+      ": Expected product of batch dimensions to be non-zero.");
+
+  // Compute the number of non-zero elements in the first batch, expand to full
+  // size
+  auto nse_per_batch = mask.select(0, 0).sum().expand(mask.size(0));
+  TORCH_CHECK(
+      mask.sum({-2, -1}).equal(nse_per_batch),
+      "Expect the same number of specified elements per batch.");
+
+  // We need to join batches into a matrix increasing the length of the
+  // compressed axis. This allows us to create indices for a compressed matrix
+  // and de-batch them later (two kernels). Otherwise we would have to create
+  // indices for each batch individually requiring n_batch kernels. For csr/bsr,
+  // we already have the batch dim adjacent to the compressed axis and can
+  // flatten them together. For csc/bsc, we need to transpose first.
+  // For BSR/CSR (b, r, c) -> (b*r, c)
+  // For BSC/CSC (b, c, r) -> (r, b*c)
+  AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS(
+      target_layout,
+      "dense_to_sparse_compressed",
+      [&]() {
+        values = values.flatten(0, 1);
+        mask = mask.flatten(0, 1);
+      },
+      [&]() {
+        values = values.transpose(0, 1).flatten(1, 2);
+        mask = mask.transpose(0, 1).flatten(1, 2);
+      });
+}
+
+// This function unfolds the compressed indices of a compressed sparse matrix
+// into a batched compressed sparse tensor.
+// This is analogous to an unflatten-like operation:
+// unflatten(0, {b, r}) for csr/bsr with input shape (r*b, c)
+//          (output shape (b, r, c))
+// unflatten(1, {b, c}).transpose(0,1) for csc/bsc with input shape (r, c*b)
+//          (output shape (r, b, c) unflatten, (b, r, c) unflatten + transpose)
+// This only operates on the compressed indices as the plain indices and values
+// can be manipulated as described above without special handling.
+// It is a prerequisite for the conversion that the sparsity pattern is sane for
+// the batched shape. That is each batch has the same number of nonzero
+// elements.
+Tensor compressed_to_batched_compressed_indices(
+    const Tensor& compressed_in,
+    const int64_t& n_batch,
+    bool out_int32) {
+  auto n_compressed_per_batch = (compressed_in.size(0) - 1) / n_batch;
+  ScalarType out_type = out_int32 ? ScalarType::Int : ScalarType::Long;
+  auto batched_out = at::zeros(
+      {n_batch, n_compressed_per_batch + 1},
+      compressed_in.options().dtype(out_type));
+
+  // If the compressed dimension has length zero there is 1 element in each
+  // batch and it is zero we already have this result formed
+  if (n_compressed_per_batch > 0) {
+    // Slice the compressed indices ignoring the leading 0 element and reshape
+    // to n-batch rows
+    auto trailing_slice =
+        compressed_in.slice(0, 1, c10::nullopt, 1).reshape({n_batch, -1});
+    // Slice the compressed indices again selecting the elements corresponding
+    // to the batch boundary. The values here will be increasing multiples of
+    // nnz per batch. Reshape to n-batch rows (1 col) for broadcasting.
+    // This is equivalent to arange(n_batch) * nnz_per_batch with the same
+    // reshape
+    auto offsets = compressed_in.slice(0, 0, -1, n_compressed_per_batch)
+                       .reshape({n_batch, -1});
+    // Subtracting the offsets from each row of the reshaped compressed indices
+    // gives us the compressed indices within the batch. The leading element of
+    // each row is not computed as it is always zero.  We copy into the view on
+    // the output buffer.
+    batched_out.narrow(-1, 1, n_compressed_per_batch)
+        .copy_(trailing_slice - offsets);
+  }
+  return batched_out;
+}
+
+// After generating member tensors for sparse_compressed matrix, if the target
+// shape is N-D we must reform the batch dimensions.
+// Single kernel is used to restore one batch dimension in the compressed
+// indices. From there full batch shape is restored by reshape. No special
+// handling is needed for restoring batch dimensions of the values or
+// plain_indices it can be done with reshape/unflatten.
+void reshape_2d_sparse_compressed_members_to_nd_batched(
+    const IntArrayRef full_sizes,
+    const int64_t& n_batch_dim,
+    Tensor& compressed_indices,
+    Tensor& plain_indices,
+    Tensor& values) {
+  auto batch_shape = full_sizes.slice(0, n_batch_dim);
+  auto n_batch = std::accumulate(
+      batch_shape.begin(), batch_shape.end(), 1, std::multiplies<int64_t>());
+  // NOTE: using this conversion requires the nnz per batch is the same for all
+  // batches that will be formed. We ensured this was the case on the way in so
+  // it is safe to use this conversion.
+  compressed_indices = compressed_to_batched_compressed_indices(
+      compressed_indices, n_batch, /*out_int32*/ false);
+
+  // We can infer the last dim of the reshape targets, it will be nnz or
+  // nrow/ncol+1 depending on the layout and member tensor targeted.
+  auto batchsize_infer_last = DimVector(batch_shape);
+  batchsize_infer_last.push_back(-1);
+
+  // -1 will be nnz per batch
+  plain_indices = plain_indices.reshape(batchsize_infer_last);
+  // -1 will be ncols (bsc,csc) or nrows (bsr,csr) + 1
+  compressed_indices = compressed_indices.reshape(batchsize_infer_last);
+  // -1 will be nnz (per batch).
+  // Note: Unflatten rather than reshape as it will work
+  // for both blocked and unblocked layouts. reshape works for unblocked layouts
+  // only
+  values = values.unflatten(0, batchsize_infer_last);
+}
+} // namespace
+
 // Take a Device that may not have device_index set (i.e., having it as -1
 // representing the current device) and return the corresponding Device
 // according to the actual device at the time of this function call.  No-op
@@ -54,48 +244,52 @@ Tensor _to_copy(
   // memory_format is handled separately due to MemoryFormat::Preserve logic
   options = self.options().merge_in(options).memory_format(c10::nullopt);
   auto memory_format = optional_memory_format.value_or(MemoryFormat::Preserve);
+
   // TODO: Use the dispatcher for this.
   // Currently there are unenumerated extensibility issues preventing this.
-  if (self.is_sparse_csr()) {
-    TORCH_CHECK(
-        memory_format == MemoryFormat::Preserve,
-        "sparse_csr only supports memory format Preserve, but got ",
-        memory_format,
-        " instead.");
-
-    auto new_values = at::native::to(
-        self.values(),
-        dtype,
-        c10::kStrided, // values are strided
-        device,
-        pin_memory,
-        non_blocking,
-        true, // force copy since we're in _to_copy
-        memory_format);
-
-    auto new_crow_indices = at::native::to(
-        self.crow_indices(),
-        self.crow_indices().scalar_type(), // indices are integral
-        c10::kStrided, // indices are strided
-        device,
-        pin_memory,
-        non_blocking,
-        true, // force copy since we're in _to_copy
-        memory_format);
-
-    auto new_col_indices = at::native::to(
-        self.col_indices(),
-        self.col_indices().scalar_type(), // indices are integral
-        c10::kStrided, // indices are strided
-        device,
-        pin_memory,
-        non_blocking,
-        true, // force copy since we're in _to_copy
-        memory_format);
-
-    return at::native::_sparse_csr_tensor_unsafe(
-        new_crow_indices,
-        new_col_indices,
+  if (at::sparse_csr::is_sparse_compressed(self)) {
+      TORCH_CHECK(
+          memory_format == MemoryFormat::Preserve,
+          "to(options): ", at::sparse_csr::layoutToString(self.layout()),
+          " only supports memory format Preserve, but got ", memory_format,
+          " instead.");
+
+      Tensor compressed_indices, plain_indices;
+      std::tie(compressed_indices, plain_indices) = at::sparse_csr::getCompressedPlainIndices(self);
+
+      const auto new_values = at::native::to(
+          self.values(),
+          dtype,
+          c10::kStrided,
+          device,
+          pin_memory,
+          non_blocking,
+          true, // force copy since we are in _to_copy
+          memory_format);
+
+      const auto new_compressed_indices = at::native::to(
+          compressed_indices,
+          compressed_indices.scalar_type(),
+          c10::kStrided,
+          device,
+          pin_memory,
+          non_blocking,
+          true, // force copy since we are in _to_copy
+          memory_format);
+
+      const auto new_plain_indices = at::native::to(
+          plain_indices,
+          plain_indices.scalar_type(),
+          c10::kStrided,
+          device,
+          pin_memory,
+          non_blocking,
+          true, // force copy since we are in _to_copy
+          memory_format);
+
+    return at::native::_sparse_compressed_tensor_unsafe(
+        new_compressed_indices,
+        new_plain_indices,
         new_values,
         self.sizes(),
         new_values.scalar_type(),
@@ -309,6 +503,15 @@ Tensor to_dense_backward(const Tensor& grad, const Tensor& input_) {
     auto input = input_.coalesce();
     return grad.sparse_mask(input);
   }
+  if (at::sparse_csr::is_sparse_compressed(input_)) {
+    // TODO: implement sparse_compressed_mask
+    switch(input_.layout()) {
+    case kSparseCsr: return grad.sparse_mask(input_.to_sparse()).to_sparse_csr();
+    case kSparseCsc: return grad.sparse_mask(input_.to_sparse()).to_sparse_csc();
+      // BSR and BSC should be handled via implement sparse_compressed_mask
+    default: ; // fall back to unsupported input layout error
+    }
+  }
   if (input_.layout() == c10::kMkldnn) {
     return grad.to_mkldnn(input_.scalar_type());
   }
@@ -329,7 +532,8 @@ Tensor to_dense(const Tensor& tensor, c10::optional<c10::ScalarType> dtype) {
   }
   if (tensor.layout() == c10::kSparseCsr ||
       tensor.layout() == c10::kSparseCsc ||
-      tensor.layout() == c10::kSparseBsr) {
+      tensor.layout() == c10::kSparseBsr ||
+      tensor.layout() == c10::kSparseBsc) {
     return tensor._to_dense(dtype);
   }
   if (tensor.layout() == c10::kMkldnn) {
@@ -358,6 +562,14 @@ Tensor sparse_compressed_to_dense(
   TORCH_CHECK(
       !dtype.has_value(),
       "dtype argument is not supported by sparse_csr_to_dense");
+
+  // Guard upfront against hybrid tensors (causes segfault)
+  auto batch_ndim = sparse_csr::numBatchDimensions(self);
+
+  TORCH_CHECK(
+      (self.dim() - batch_ndim) == 2,
+      "sparse_compressed_to_dense: Hybrid tensors are not supported");
+
   if (self.layout() == kSparseCsr) {
     Tensor dst = at::zeros(self.sizes(), self.options().layout(kStrided));
     return dst.add_(self);
@@ -384,26 +596,28 @@ Tensor sparse_compressed_to_dense(
     dst_transposed.add_(to_transposed_csr);
     return dst_transposed.transpose(batch_ndim, batch_ndim + 1);
   }
-  if (self.layout() == kSparseBsr) {
-    auto crow_indices = self.crow_indices();
-    auto col_indices = self.col_indices();
+  if (self.layout() == kSparseBsr || self.layout() == kSparseBsc) {
+    Tensor compressed_indices;
+    Tensor plain_indices;
+    std::tie(compressed_indices, plain_indices) =
+        sparse_csr::getCompressedPlainIndices(self);
+
     auto values = self.values();
     Tensor dense = at::zeros(self.sizes(), self.options().layout(kStrided));
     if (self.dim() == 2) {
       // Pad shape so we can treat 2-d like batched, we will squeeze out the
       // phantom batch dim at the end
-      crow_indices = crow_indices.unsqueeze(0);
-      col_indices = col_indices.unsqueeze(0);
-      values = values.unsqueeze(0);
-      dense = dense.unsqueeze(0);
+      compressed_indices.unsqueeze_(0);
+      plain_indices.unsqueeze_(0);
+      values = values.unsqueeze_(0);
+      dense = dense.unsqueeze_(0);
     }
     if (self.dim() > 3) {
       // Flatten batch dims
-      auto n_batch_dim = self.dim() - 2;
-      crow_indices = crow_indices.flatten(0, n_batch_dim - 1);
-      col_indices = col_indices.flatten(0, n_batch_dim - 1);
-      values = values.flatten(0, n_batch_dim - 1);
-      dense = dense.flatten(0, n_batch_dim - 1);
+      compressed_indices = compressed_indices.flatten(0, batch_ndim - 1);
+      plain_indices = plain_indices.flatten(0, batch_ndim - 1);
+      values = values.flatten(0, batch_ndim - 1);
+      dense = dense.flatten(0, batch_ndim - 1);
     }
 
     // At this point everything has 3d shape either the batch dim was inserted,
@@ -419,7 +633,10 @@ Tensor sparse_compressed_to_dense(
     dense = dense.reshape({n_batch, -1, values.size(-2), values.size(-1)});
     for (auto batch : c10::irange(n_batch)) {
       Tensor batch_indices = at::_convert_indices_from_csr_to_coo(
-          crow_indices[batch], col_indices[batch], false, false);
+          compressed_indices[batch],
+          plain_indices[batch],
+          false,
+          self.layout() == kSparseBsc);
       auto batch_row_indices = batch_indices.select(0, 0);
       auto batch_col_indices = batch_indices.select(0, 1);
       auto offsets = batch_col_indices +
@@ -557,16 +774,6 @@ Tensor view_dtype(const Tensor& self, ScalarType dtype) {
   return new_tensor;
 }
 
-// Sparse layout conversions Start
-
-Tensor dense_to_sparse_csr(const Tensor& self) {
-  return self.to_sparse().to_sparse_csr();
-}
-
-Tensor dense_to_sparse_csc(const Tensor& self) {
-  return self.to_sparse().to_sparse_csc();
-}
-
 Tensor _tile_tensor(const Tensor& self, IntArrayRef blocksize) {
   // This code turns a matrix into a sequence of blocks
   //
@@ -641,6 +848,83 @@ std::pair<Tensor, Tensor> _not_zero_mask_to_col_row_indices(
   return std::pair<Tensor, Tensor>(col_indices, row_indices);
 }
 
+// Sparse layout conversions Start
+
+Tensor dense_to_sparse_csr(const Tensor& self) {
+  auto n_batch_dim = self.dim() - 2;
+  auto values = self;
+  auto not_zero_mask = self != 0;
+
+  if (n_batch_dim > 0) {
+    dense_to_sparse_compressed_prepare_check_mask_values_batched(
+        Layout::SparseCsr, values, not_zero_mask, n_batch_dim);
+  }
+
+  Tensor col_indices;
+  Tensor row_indices;
+  std::tie(col_indices, row_indices) = _not_zero_mask_to_col_row_indices(
+      not_zero_mask, at::kLong, not_zero_mask.device());
+  Tensor crow_indices = at::_convert_indices_from_coo_to_csr(
+      row_indices, not_zero_mask.size(0), false /*out_int32*/);
+  {
+    auto mask_indices = _mask_to_indices(not_zero_mask.flatten());
+    values = values.flatten().index_select(0, mask_indices);
+  }
+
+  if (n_batch_dim > 0) {
+    reshape_2d_sparse_compressed_members_to_nd_batched(
+        self.sizes(), n_batch_dim, crow_indices, col_indices, values);
+  }
+  return at::native::_sparse_csr_tensor_unsafe(
+      crow_indices,
+      col_indices,
+      values,
+      self.sizes(),
+      values.scalar_type(),
+      c10::kSparseCsr,
+      values.device());
+}
+
+Tensor dense_to_sparse_csc(const Tensor& self) {
+  auto n_batch_dim = self.dim() - 2;
+  auto values = self;
+  auto not_zero_mask = self != 0;
+
+  if (n_batch_dim > 0) {
+    dense_to_sparse_compressed_prepare_check_mask_values_batched(
+        Layout::SparseCsc, values, not_zero_mask, n_batch_dim);
+  }
+
+  Tensor col_indices;
+  Tensor row_indices;
+  // Compressed col indices are the same as the row indices of the transpose!
+  std::tie(row_indices, col_indices) = _not_zero_mask_to_col_row_indices(
+      not_zero_mask.transpose(1, 0), at::kLong, not_zero_mask.device());
+  Tensor ccol_indices = at::_convert_indices_from_coo_to_csr(
+      col_indices, not_zero_mask.size(-1), false /*out_int32*/);
+  {
+    // We need to transpose the mask and values before flattening so the nnz dim
+    // will run in col-major order.
+    values = values.transpose(0, 1).flatten();
+    auto mask_indices =
+        _mask_to_indices(not_zero_mask.transpose(0, 1).flatten());
+    values = values.index_select(0, mask_indices);
+  }
+
+  if (n_batch_dim > 0) {
+    reshape_2d_sparse_compressed_members_to_nd_batched(
+        self.sizes(), n_batch_dim, ccol_indices, row_indices, values);
+  }
+  return at::native::_sparse_csc_tensor_unsafe(
+      ccol_indices,
+      row_indices,
+      values,
+      self.sizes(),
+      values.scalar_type(),
+      c10::kSparseCsc,
+      values.device());
+}
+
 Tensor dense_to_sparse_bsr(const Tensor& self, IntArrayRef blocksize) {
   TORCH_CHECK(
       blocksize[0] > 0 && blocksize[1] > 0,
@@ -659,92 +943,37 @@ Tensor dense_to_sparse_bsr(const Tensor& self, IntArrayRef blocksize) {
       " needs to be divisible by blocksize[1] ",
       blocksize[1]);
 
-  auto block_size_0 = self.size(-2) / blocksize[0];
   auto n_batch_dim = self.dim() - 2;
 
   auto values = _batch_tile_tensor(self, blocksize);
   auto not_zero_mask = _batch_tile_tensor((self != 0), blocksize);
-  // Find tiles that have at least 1 non-zero value in them.
-  not_zero_mask = not_zero_mask.any(-1).any(-1);
+  auto mask_shape = DimVector(not_zero_mask.sizes().slice(0, n_batch_dim + 2));
+  // Can't use -1 here one of sparse/batch dims may be zero
+  mask_shape.push_back(blocksize[0] * blocksize[1]);
+  not_zero_mask = not_zero_mask.view(mask_shape).any(-1);
 
   if (n_batch_dim > 0) {
-    // for 3D input the mask is already flat along the batch dims, avoid
-    // creating unnessesary view
-    if (n_batch_dim > 1) {
-      // flatten out the batch dims for N-D input
-      not_zero_mask = not_zero_mask.flatten(0, n_batch_dim - 1);
-    }
-    TORCH_CHECK(
-        not_zero_mask.size(0) > 0,
-        "to_sparse_bsr: Expected product of batch dimensions to be non-zero.");
-
-    // If the input is ND we assert that the same sparsity pattern
-    // is used across matrices. That means the same number of materialized
-    // values and *at the same location*.
-    // This requirement is not included in Pearu's blog post on BSR invariants.
-    // He specifically states that different batches may have different sparsity
-    // patterns as long as the number of specified elements is the same for all
-    // batches.
-
-    auto not_zero_mask_0 = not_zero_mask.select(0, 0);
-    auto nse_per_batch = not_zero_mask_0.sum().repeat(not_zero_mask.size(0));
-    TORCH_CHECK(
-        not_zero_mask.sum({-2, -1}).equal(nse_per_batch),
-        "Expect the same number of specified elements per batch.");
+    dense_to_sparse_compressed_prepare_check_mask_values_batched(
+        Layout::SparseBsr, values, not_zero_mask, n_batch_dim);
   }
 
   Tensor col_indices;
   Tensor row_indices;
   std::tie(col_indices, row_indices) = _not_zero_mask_to_col_row_indices(
       not_zero_mask, at::kLong, not_zero_mask.device());
-  Tensor crow_indices;
 
-  if (n_batch_dim > 0) {
-    // reshape to put the (flattened) batch dims back in
-    col_indices = col_indices.reshape({not_zero_mask.size(0), -1});
-    row_indices = row_indices.reshape({not_zero_mask.size(0), -1});
-    crow_indices = at::empty(
-        {not_zero_mask.size(0), block_size_0 + 1}, col_indices.options());
-    // For each batch compute crow_indices
-    for (auto batch : c10::irange(not_zero_mask.size(0))) {
-      Tensor batch_crow_indices = crow_indices[batch];
-      at::_convert_indices_from_coo_to_csr_out(
-          batch_crow_indices,
-          row_indices[batch],
-          block_size_0,
-          false /* out_int32 */);
-    }
-    // At this point, we have constructed col_indices and crow_indices
-    // such that they are 2d with dim0 of length B = product(batchdims). We can
-    // now reshape them to the correct shapes.
-    auto batch_shape = self.sizes().slice(0, n_batch_dim);
-    crow_indices = crow_indices.unflatten(0, batch_shape);
-    col_indices = col_indices.unflatten(0, batch_shape);
-
-    // Mask is also leading dim B, but we can't masked select wit it (see below)
-    // unless it is flat, then we can partially faltten values, index it along
-    // and unfold the result to batchdims + (nnz(per batch), )
-    auto batch_sizes_nnz = DimVector(batch_shape);
-    batch_sizes_nnz.push_back(-1); // we can infer nnz
-    not_zero_mask = not_zero_mask.flatten();
-    // TODO: masked_select does not support some form of broadcasting, so we're
-    // using the mask to construct indices that are then passed into
-    // index_select. This isn't ideal.
-    values = values.flatten(0, -3)
-                 .index_select(0, _mask_to_indices(not_zero_mask))
-                 .unflatten(0, batch_sizes_nnz);
+  Tensor crow_indices = at::_convert_indices_from_coo_to_csr(
+      row_indices, not_zero_mask.size(0), false /*out_int32*/);
 
-  } else {
-    crow_indices = at::_convert_indices_from_coo_to_csr(
-        row_indices.view({-1}), block_size_0, false /* out_int32 */);
-    not_zero_mask = not_zero_mask.reshape({-1});
-    // TODO: masked_select does not support some form of broadcasting, so we're
-    // using the mask to construct indices that are then passed into
-    // index_select. This isn't ideal.
-    values = values.reshape({-1, values.size(-2), values.size(-1)})
-                 .index_select(0, _mask_to_indices(not_zero_mask));
+  {
+    auto mask_indices = _mask_to_indices(not_zero_mask.flatten());
+    values = values.flatten(0, -3).index_select(0, mask_indices);
   }
 
+  if (n_batch_dim > 0) {
+    reshape_2d_sparse_compressed_members_to_nd_batched(
+        self.sizes(), n_batch_dim, crow_indices, col_indices, values);
+  }
   return at::native::_sparse_bsr_tensor_unsafe(
       crow_indices,
       col_indices,
@@ -756,64 +985,319 @@ Tensor dense_to_sparse_bsr(const Tensor& self, IntArrayRef blocksize) {
 }
 
 Tensor dense_to_sparse_bsc(const Tensor& self, IntArrayRef blocksize) {
-  AT_ERROR(
-      "Conversion from ", self.layout(), " to SparseBsc is currently not supported.");
-  return self;
+  TORCH_CHECK(
+      blocksize[0] > 0 && blocksize[1] > 0,
+      "blocksize needs to be non zero, but got ",
+      blocksize);
+  TORCH_CHECK(
+      self.size(-2) % blocksize[0] == 0,
+      "Tensor size(-2) ",
+      self.size(-2),
+      " needs to be divisible by blocksize[0] ",
+      blocksize[0]);
+  TORCH_CHECK(
+      self.size(-1) % blocksize[1] == 0,
+      "Tensor size(-1) ",
+      self.size(-1),
+      " needs to be divisible by blocksize[1] ",
+      blocksize[1]);
+  auto n_batch_dim = self.dim() - 2;
+  auto is_batched = n_batch_dim > 0;
+  auto values = _batch_tile_tensor(self, blocksize);
+  auto not_zero_mask = _batch_tile_tensor((self != 0), blocksize);
+  auto mask_shape = DimVector(not_zero_mask.sizes().slice(0, n_batch_dim + 2));
+  // Can't use -1 here one of sparse/batch dims may be zero
+  mask_shape.push_back(blocksize[0] * blocksize[1]);
+  not_zero_mask = not_zero_mask.view(mask_shape).any(-1);
+
+  if (is_batched) {
+    dense_to_sparse_compressed_prepare_check_mask_values_batched(
+        Layout::SparseBsc, values, not_zero_mask, n_batch_dim);
+  }
+
+  Tensor col_indices;
+  Tensor row_indices;
+  // Compressed col indices are the same as the row indices of the transpose!
+  std::tie(row_indices, col_indices) = _not_zero_mask_to_col_row_indices(
+      not_zero_mask.transpose(1, 0), at::kLong, not_zero_mask.device());
+  // This only works if the col_indices vector is in ascending order.
+  Tensor ccol_indices = at::_convert_indices_from_coo_to_csr(
+      col_indices, not_zero_mask.size(-1), false /*out_int32*/);
+  {
+    // We need the block-values in col major order, but blocks themselves to
+    // remain in row-major order, so we transpose the leading two dims, leaving
+    // the trailing two dims as is.
+    values = values.transpose(0, 1).flatten(0, -3);
+    // The mask must transpose as well to index it correctly.
+    auto mask_indices =
+        _mask_to_indices(not_zero_mask.transpose(0, 1).flatten());
+    values = values.index_select(0, mask_indices);
+  }
+  if (is_batched) {
+    reshape_2d_sparse_compressed_members_to_nd_batched(
+        self.sizes(), n_batch_dim, ccol_indices, row_indices, values);
+  }
+
+  return at::native::_sparse_bsc_tensor_unsafe(
+      ccol_indices,
+      row_indices,
+      values,
+      self.sizes(),
+      values.scalar_type(),
+      c10::kSparseBsc,
+      values.device());
+}
+
+void _check_blocksize_matches(
+    const Tensor& self,
+    c10::optional<IntArrayRef> blocksize_opt,
+    const std::string& name) {
+  if (blocksize_opt.has_value()) {
+    const auto blocksize = *blocksize_opt;
+    const auto self_values = self.values();
+    const auto self_blocksize = at::DimVector({self_values.size(-2), self_values.size(-1)});
+    TORCH_CHECK(self_blocksize == blocksize,
+        name, "(): the provided blocksize does not match the blocksize of the to be converted tensor, ",
+        "got (", blocksize[0], ", ", blocksize[1], ") ",
+        "but expected (", self_blocksize[0], ", ", self_blocksize[1], ").");
+  }
+}
+
+Tensor sparse_compressed_clone(
+    const Tensor& self,
+    c10::optional<IntArrayRef> blocksize,
+    const std::string& name) {
+  _check_blocksize_matches(self, blocksize, name);
+  // Just returning self doesn't work
+  // RuntimeError: t.use_count() <= 1 INTERNAL ASSERT FAILED at
+  // "../torch/csrc/autograd/autograd_not_implemented_fallback.cpp":152,
+  // please report a bug to PyTorch.
+  const auto layout = self.layout();
+  Tensor compressed_indices, plain_indices;
+  std::tie(compressed_indices, plain_indices) = at::sparse_csr::getCompressedPlainIndices(self);
+  auto values = self.values();
+  return _sparse_compressed_tensor_unsafe(
+      compressed_indices,
+      plain_indices,
+      values,
+      self.sizes(),
+      values.scalar_type(),
+      layout,
+      values.device());
+}
+
+Tensor sparse_compressed_to_flipped(
+    const Tensor& self,
+    c10::optional<IntArrayRef> blocksize,
+    const std::string& name) {
+  _check_blocksize_matches(self, blocksize, name);
+
+  const auto layout = self.layout();
+  // NOTE: errors on non-compressed sparse layouts.
+  const auto flipped_layout = at::sparse_csr::flip_compressed_layout(layout);
+
+  // Suppose compressed_indices represent rows of an input in either
+  // CSR or BSR sparse compressed format.
+  // In order to convert a batched CSR/BSR index into a batched CSC/BSC index
+  // we perform the following steps:
+  // 1. Convert a sparse compressed index representing batches of matrices of
+  //    shape (b, r, c) to a sparse compressed index that represents a single
+  //    matrix of shape (b * r, c).
+  // 2. Turn the compressed indices of the matrix of shape (b * r, c) into
+  //    COO indices.
+  // 3. Map these COO indices into the COO indices of a matrix of shape (r, b * c)
+  //    such that if A is a matrix of shape (b * r, c) and B is a matrix of shape
+  //    (r, b * c) such that
+  //    A[(k * r):(k * r + r), :] = B[:, (k * c):(k * c + c)] for all k in arange(b),
+  //    then A[i, j] = B[i', j'].
+  //    This is equivalent to finding indices that match values of matrices
+  //    tiled vertically to values of the same matrices tiled horizontally.
+  // 4. Convert the COO indices to the CSC/BSC indices and form the output.
+  //
+  // NOTE: the reason behind vertical/horizontal tiling is to be able to transform
+  //       indices over all matrices in the batch in a single kernel call, since
+  //       all the existing coo <-> compressed indices conversion methods assume
+  //       a single matrix.
+  //
+  // CSC/BSC inputs are handled in a similar fashion with a "transposed" argument.
+  // See the comments below for detailed explanations on how exactly each step
+  // is performed.
+
+  Tensor compressed_indices, plain_indices;
+  std::tie(compressed_indices, plain_indices) = at::sparse_csr::getCompressedPlainIndices(self);
+  auto values = self.values();
+  const auto nnz = plain_indices.size(-1);
+
+  const auto n_batches = compressed_indices.dim() - 1;
+  auto n_batches_nonzero = n_batches;
+  // Insert fake batch dim for simplicity
+  if (!n_batches) {
+    n_batches_nonzero = 1;
+    compressed_indices.unsqueeze_(0);
+    plain_indices.unsqueeze_(0);
+    values.unsqueeze_(0);
+  }
+
+  // NOTE: these sparse_dims are true sparse dims only for CSR/CSC inputs.
+  // And for BSR/BSC these are <true sparse dims> / <blocksize>.
+  // In other words, sparse_dims stores ranges of valid indices in the row/col dims.
+  const auto sparse_dims = [&]() -> at::DimVector {
+    auto sparse_dims = at::DimVector(self.sizes().slice(n_batches, 2));
+    if (layout == at::kSparseBsr || layout == at::kSparseBsc) {
+      std::array<int64_t, 2> blocksize = {values.size(-2), values.size(-1)};
+      sparse_dims[0] /= blocksize[0];
+      sparse_dims[1] /= blocksize[1];
+    }
+    return sparse_dims;
+  }();
+
+  // batch_sizes_nonempty stores at least one, potentially fake, batch dimension.
+  // rebatch_sizes_nonempty is equivalent to batch_sizes_nonempty.push_back(-1),
+  // and is used to unflatten batch dimensions from a dimension of size
+  // (batch_numel * dim_size,) for some dim_size.
+  const auto batch_sizes_nonempty = at::DimVector(plain_indices.sizes().slice(0, n_batches_nonzero));
+  auto rebatch_sizes_nonempty = at::DimVector(batch_sizes_nonempty);
+  rebatch_sizes_nonempty.push_back(-1);
+  const auto batch_numel_nonzero = std::accumulate(
+      batch_sizes_nonempty.begin(),
+      batch_sizes_nonempty.begin() + n_batches_nonzero,
+      1,
+      std::multiplies<int64_t>());
+
+  // Equivalent to (arange(batch_numel_nonzero).mul_(nnz)).reshape(batch_sizes_nonempty).
+  // We just compute it differently to use `add` kernel in place of `mul` for better
+  // performance.
+  const auto batch_nnz_offset = [&]() -> Tensor {
+    const auto wrapped_nnz = at::tensor({nnz}, compressed_indices.options());
+    const auto offset = wrapped_nnz
+      .expand({batch_numel_nonzero})
+      .cumsum(-1).sub_(wrapped_nnz)
+      .reshape(batch_sizes_nonempty);
+    return offset;
+  }();
+
+  // Step 1 for CSR/BSR inputs:
+  // Convert a sparse compressed index representing batches of matrices of
+  // shape (b, r, c) to a sparse compressed index that represents a single
+  // matrix of shape (b * r, c).
+  // The algorithm is identical for CSC/BSC inputs, with the batch dimensions
+  // flattened in the "transposed" dimension.
+  const auto compressed_indices_2d = [&]() -> Tensor {
+    // Extract offsets only relevant for the first :-1 elements in a row/col.
+    const auto compressed_offsets = compressed_indices.slice(-1, 0, -1);
+    // batch_offsets offsets each individual matrix row/col offsets by the total
+    // sum of nnz's of all the matrices with the smaller batch index.
+    const auto batch_offsets = batch_nnz_offset
+      .unsqueeze(-1).expand_as(compressed_offsets);
+    // compressed_offsets + batch_offsets creates an offset vector for a 2d matrix
+    // that is stored in a compressed sparse format.
+    const auto compressed_offsets_2d = compressed_offsets.add(batch_offsets).reshape({-1});
+    const auto offsets_len = compressed_offsets_2d.numel();
+    auto res = at::empty({offsets_len + 1}, compressed_indices.options());
+    res.slice(-1, 0, -1).copy_(compressed_offsets_2d);
+    // By appending nnz * batch_numel_nonzero to (compressed_offsets + batch_offsets)
+    // a compressed index of a 2d matrix is formed.
+    res.slice(-1, -1).fill_(nnz * batch_numel_nonzero);
+    return res;
+  }();
+  // More involved for compressed indices, but pretty easy for plain_indices and values:
+  // just squash batch dimensions.
+  const auto plain_indices_2d = plain_indices.flatten(0, n_batches_nonzero);
+  // NOTE: values are not 2d! They just represent values of a sparse compressed 2d matrix.
+  const auto values_2d = values.flatten(0, n_batches_nonzero);
+
+  const auto is_out_int32 = compressed_indices.scalar_type() == ScalarType::Int;
+
+  // Step 2 & 3:
+  //
+  // Turn the compressed indices of the matrix of shape (b * r, c) into COO indices.
+  //
+  // Map these COO indices into the COO indices of a matrix of shape (r, b * c)
+  // such that if A is a matrix of shape (b * r, c) and B is a matrix of shape
+  // (r, b * c) such that
+  // A[(k * r):(k * r + r), :] = B[:, (k * c):(k * c + c)] for all k in arange(b),
+  // then A[i, j] = B[i', j'].
+  // This is equivalent to finding indices that match values of matrices
+  // tiled vertically to values of the same matrices tiled horizontally.
+
+  // coo <-> sparse index conversions assume CSR/BSR inputs.
+  // To CSC/BSC inputs these indices will appear "transposed".
+  const auto is_transposed_indices = layout == at::kSparseCsc || layout == at::kSparseBsc;
+  const auto coo_indices_2d_transposed = [&]() -> Tensor {
+    const auto coo_indices_2d = _convert_indices_from_csr_to_coo(
+        compressed_indices_2d,
+        plain_indices_2d,
+        is_out_int32,
+        /*transpose=*/true); // Flip rows/cols for convenience.
+    // Convert COO indices of (b * r, c) to (r, b * c).
+    // It is a map (i, j) -> {
+    //    b = i // r
+    //    i' = i % r
+    //    j' = j + b * c
+    //    return (i', j')
+    // }
+    // NOTE: we used transposed=true above!
+    auto i = coo_indices_2d.select(0, 1);
+    auto j = coo_indices_2d.select(0, 0);
+    auto b = i.div(is_transposed_indices ? sparse_dims[1] : sparse_dims[0], "trunc");
+    // Modify i, j in-place.
+    i.fmod_(is_transposed_indices ? sparse_dims[1] : sparse_dims[0]);
+    j.add_(b * (is_transposed_indices ? sparse_dims[0] : sparse_dims[1]));
+    return coo_indices_2d;
+  }();
+
+  // Step 4:
+  // Convert the COO indices to the CSC/BSC indices and form the output.
+  // We need to sort COO indices along the "tranposed" dim to satisfy the
+  // invariant of sorted plain indices.
+  // Hash coo indices by converting 2d indices to linear offsets with
+  // more "weight" (aka stride) placed on the "transposed" dimension.
+  const auto coo_indices_2d_transposed_hashed = at::sparse::flatten_indices(
+      coo_indices_2d_transposed,
+      is_transposed_indices ? at::DimVector({sparse_dims[0], sparse_dims[1] * batch_numel_nonzero})
+                            : at::DimVector({sparse_dims[1], sparse_dims[0] * batch_numel_nonzero}));
+  const auto hash_argsort = std::get<1>(coo_indices_2d_transposed_hashed.sort());
+  const auto coo_indices_2d_transposed_sorted = coo_indices_2d_transposed.index_select(1, hash_argsort);
+
+  const auto new_compressed_indices_coo_2d = coo_indices_2d_transposed_sorted.select(0, 0);
+  const auto new_plain_indices_2d = coo_indices_2d_transposed_sorted.select(0, 1);
+  const auto new_values_2d = values_2d.index_select(0, hash_argsort);
+
+  auto new_compressed_indices = compressed_to_batched_compressed_indices(
+      _convert_indices_from_coo_to_csr(
+        new_compressed_indices_coo_2d,
+        is_transposed_indices
+          ? batch_numel_nonzero * sparse_dims[0]
+          : batch_numel_nonzero * sparse_dims[1],
+        is_out_int32),
+      batch_numel_nonzero,
+      is_out_int32)
+    .unflatten(0, batch_sizes_nonempty);
+  auto new_plain_indices = new_plain_indices_2d.unflatten(0, rebatch_sizes_nonempty);
+  auto new_values = new_values_2d.unflatten(0, rebatch_sizes_nonempty);
+  // Kill fake batch dim if it was inserted.
+  if (!n_batches) {
+    new_compressed_indices.squeeze_(0);
+    new_plain_indices.squeeze_(0);
+    new_values.squeeze_(0);
+  }
+
+  return _sparse_compressed_tensor_unsafe(
+      new_compressed_indices,
+      new_plain_indices,
+      new_values,
+      self.sizes(),
+      new_values.scalar_type(),
+      flipped_layout,
+      new_values.device());
 }
 
 Tensor sparse_compressed_to_sparse_csr(const Tensor& self) {
   if (self.layout() == kSparseCsc) {
-    TORCH_CHECK(
-        self.dim() == 2,
-        "Expected self to be of dimension 2, but got ",
-        self.dim(),
-        ".");
-    auto sizes = self.sizes();
-    auto ccol_indices = self.ccol_indices();
-    auto row_indices = self.row_indices();
-    auto values = self.values();
-
-    // convert CSC indices to COO indices and swap its rows
-    const bool out_int32 = ccol_indices.scalar_type() == ScalarType::Int;
-    Tensor indices_transposed = _convert_indices_from_csr_to_coo(
-        ccol_indices, row_indices, out_int32, true);
-
-    // sort transposed indices
-    auto indices_scalar =
-        at::sparse::flatten_indices(indices_transposed, {sizes[0], sizes[1]});
-    auto indicesPermutation = std::get<1>(indices_scalar.sort(0));
-    auto indices_transposed_sorted =
-        indices_transposed.index_select(1, indicesPermutation);
-
-    // construct a CSR tensor
-    auto new_row_indices = indices_transposed_sorted.select(0, 0);
-    auto new_col_indices = indices_transposed_sorted.select(0, 1);
-    auto new_values = values.index_select(0, indicesPermutation);
-    Tensor new_crow_indices =
-        _convert_indices_from_coo_to_csr(new_row_indices, sizes[0], out_int32);
-
-    return _sparse_csr_tensor_unsafe(
-        new_crow_indices,
-        new_col_indices,
-        new_values,
-        {sizes[0], sizes[1]},
-        new_values.scalar_type(),
-        c10::kSparseCsr,
-        new_values.device());
+    return sparse_compressed_to_flipped(self, c10::nullopt, "to_sparse_csr");
   }
   if (self.layout() == kSparseCsr) {
-    // Just returning self doesn't work
-    // RuntimeError: t.use_count() <= 1 INTERNAL ASSERT FAILED at
-    // "../torch/csrc/autograd/autograd_not_implemented_fallback.cpp":152,
-    // please report a bug to PyTorch. aten::to_sparse_csr
-    return at::native::_sparse_csr_tensor_unsafe(
-        self.crow_indices(),
-        self.col_indices(),
-        self.values(),
-        self.sizes(),
-        self.scalar_type(),
-        c10::kSparseCsr,
-        self.device());
+    return sparse_compressed_clone(self, c10::nullopt, "to_sparse_csr");
   }
   AT_ERROR(
       "sparse_compressed_to_sparse_csr expected SparseCsr or SparseCsc layout but got ",
@@ -1150,59 +1634,73 @@ Tensor _csr_to_block_csr_cpu(const Tensor& self, IntArrayRef blocksize) {
 }
 
 Tensor sparse_compressed_to_sparse_bsr(const Tensor& self, IntArrayRef blocksize) {
-  TORCH_CHECK(
-      self.is_sparse_csr(),
-      "Can only convert CSR to SparseBsr, but got ",
-      self.layout(),
-      " instead.");
-  Tensor self_values = self.values();
-  Tensor self_crow_indices = self.crow_indices();
-  Tensor self_col_indices = self.col_indices();
-  Tensor cpu_result = _csr_to_block_csr_cpu(
-      _sparse_csr_tensor_unsafe(
-          self_crow_indices.cpu(),
-          self_col_indices.cpu(),
-          self_values.cpu(),
-          self.sizes(),
-          self_values.scalar_type(),
-          self.layout(),
-          self_values.device()),
-      blocksize);
-  Tensor result_values = cpu_result.values().to(self_values.options());
-  Tensor result_crow_indices =
-      cpu_result.crow_indices().to(self_crow_indices.options());
-  Tensor result_col_indices =
-      cpu_result.col_indices().to(self_col_indices.options());
-  return at::native::_sparse_bsr_tensor_unsafe(
-      result_crow_indices,
-      result_col_indices,
-      result_values,
-      self.sizes(),
-      result_values.scalar_type(),
-      c10::kSparseBsr,
-      result_values.device());
+  if (self.layout() == kSparseBsc) {
+    return sparse_compressed_to_flipped(self, blocksize, "to_sparse_bsr");
+  }
+  if (self.layout() == kSparseBsr) {
+    return sparse_compressed_clone(self, blocksize, "to_sparse_bsr");
+  }
+  if (self.layout() == kSparseCsr) {
+    TORCH_CHECK(self.dim() == 2,
+        "to_sparse_bsr(): conversion from Csr to Bsr is only possible for 2d inputs, ",
+        "but got input of dimension ", self.dim(), " instead.");
+    Tensor self_values = self.values();
+    Tensor self_crow_indices = self.crow_indices();
+    Tensor self_col_indices = self.col_indices();
+    Tensor cpu_result = _csr_to_block_csr_cpu(
+        _sparse_csr_tensor_unsafe(
+            self_crow_indices.cpu(),
+            self_col_indices.cpu(),
+            self_values.cpu(),
+            self.sizes(),
+            self_values.scalar_type(),
+            self.layout(),
+            at::kCPU),
+        blocksize);
+    Tensor result_values = cpu_result.values().to(self_values.options());
+    Tensor result_crow_indices =
+        cpu_result.crow_indices().to(self_crow_indices.options());
+    Tensor result_col_indices =
+        cpu_result.col_indices().to(self_col_indices.options());
+    return at::native::_sparse_bsr_tensor_unsafe(
+        result_crow_indices,
+        result_col_indices,
+        result_values,
+        self.sizes(),
+        result_values.scalar_type(),
+        c10::kSparseBsr,
+        result_values.device());
+  }
+  AT_ERROR(
+      "sparse_compressed_to_sparse_bsr expected SparseCsr, SparseBsr or SparseBsc layout but got ",
+      self.layout());
+  return self;
 }
 
 Tensor sparse_compressed_to_sparse_bsc(const Tensor& self, IntArrayRef blocksize) {
+  if (self.layout() == kSparseBsr) {
+    return sparse_compressed_to_flipped(self, blocksize, "to_sparse_bsr");
+  }
+  if (self.layout() == kSparseBsc) {
+    return sparse_compressed_clone(self, blocksize, "to_sparse_bsr");
+  }
   AT_ERROR(
-      "Conversion from ", self.layout(), " to SparseBsc is currently not supported.");
+      "sparse_compressed_to_sparse_bsc expected SparseBsr or SparseBsc layout but got ",
+      self.layout());
   return self;
 }
 
 Tensor sparse_compressed_to_sparse_csc(const Tensor& self) {
+  if (self.layout() == kSparseCsr) {
+    return sparse_compressed_to_flipped(self, c10::nullopt, "to_sparse_csc");
+  }
   if (self.layout() == kSparseCsc) {
-    // Based on to_sparse_csr just returning self doesn't work
-    return _sparse_csc_tensor_unsafe(
-        self.ccol_indices(),
-        self.row_indices(),
-        self.values(),
-        self.sizes(),
-        self.scalar_type(),
-        c10::kSparseCsc,
-        self.device());
+    return sparse_compressed_clone(self, c10::nullopt, "to_sparse_csc");
   }
   AT_ERROR(
-      "Conversion from ", self.layout(), " to SparseCsc is currently not supported.");
+      "sparse_compressed_to_sparse_csc expected SparseCsr or SparseCsc layout but got ",
+      self.layout());
+  return self;
 }
 
 Tensor sparse_compressed_to_sparse(const Tensor& self, int64_t sparse_dim) {
@@ -1239,7 +1737,7 @@ Tensor sparse_compressed_to_sparse(const Tensor& self) {
 // Sparse layout conversions End
 
 Tensor to_meta(const Tensor& tensor) {
-  auto out = at::native::empty_strided_meta(tensor.sizes(), tensor.strides(), \
+  auto out = at::native::empty_strided_meta_symint(tensor.sym_sizes(), tensor.sym_strides(), \
 /*dtype=*/c10::make_optional(tensor.scalar_type()), /*layout=*/c10::make_optional(tensor.layout()), \
 /*device=*/c10::make_optional(c10::Device(c10::kMeta)), /*pin_memory=*/c10::nullopt);
   // needs to handle wrapped numbers, so dtype promotion works properly.
@@ -1255,13 +1753,13 @@ c10::optional<Tensor> to_meta(const c10::optional<Tensor>& tensor) {
   return c10::nullopt;
 }
 
-std::vector<Tensor> to_meta(const at::TensorList& t_list) {
+std::vector<Tensor> to_meta(at::ITensorListRef t_list) {
   std::vector<Tensor> outs;
   outs.reserve(t_list.size());
-  for (const auto& i : c10::irange(t_list.size())) {
-    outs.push_back(to_meta(t_list[i]));
+  for (const auto& tensor : t_list) {
+    outs.push_back(to_meta(tensor));
   }
   return outs;
 }
-
-}} // namespace at::native
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/TensorConversions.h b/aten/src/ATen/native/TensorConversions.h
index 75a01ea0e755..8ec21a75dcac 100644
--- a/aten/src/ATen/native/TensorConversions.h
+++ b/aten/src/ATen/native/TensorConversions.h
@@ -19,7 +19,7 @@ bool to_will_alias(
 
 Tensor to_meta(const Tensor& tensor);
 c10::optional<Tensor> to_meta(const c10::optional<Tensor>& tensor);
-std::vector<Tensor> to_meta(const at::TensorList& t_list);
+std::vector<Tensor> to_meta(at::ITensorListRef t_list);
 
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/TensorDimApply.h b/aten/src/ATen/native/TensorDimApply.h
index ad9ca857eeab..e75cd40caf48 100644
--- a/aten/src/ATen/native/TensorDimApply.h
+++ b/aten/src/ATen/native/TensorDimApply.h
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#pragma once
+#include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
 
 namespace at {
diff --git a/aten/src/ATen/native/TensorFactories.cpp b/aten/src/ATen/native/TensorFactories.cpp
index 230f7964658d..7245cb77b1c5 100644
--- a/aten/src/ATen/native/TensorFactories.cpp
+++ b/aten/src/ATen/native/TensorFactories.cpp
@@ -1,31 +1,99 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/TensorFactories.h>
+
+#include <ATen/core/Tensor.h>
 #include <ATen/CPUGeneratorImpl.h>
 #include <ATen/Dispatch.h>
 #include <ATen/EmptyTensor.h>
+#include <ATen/ExpandUtils.h>
 #include <ATen/Parallel.h>
 #include <ATen/MapAllocator.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/TracerMode.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/NamedTensorUtils.h>
+#include <ATen/native/UnaryOps.h>
 #include <c10/core/ScalarType.h>
-#include <c10/util/Deprecated.h>
-#include <ATen/native/Math.h>
-#include <ATen/native/Resize.h>
-#include <ATen/native/TensorFactories.h>
 #include <c10/core/TensorOptions.h>
-#include <ATen/detail/CUDAHooksInterface.h>
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
-#include <ATen/NamedTensorUtils.h>
-#include <ATen/native/UnaryOps.h>
+#include <c10/util/MathConstants.h>
+
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
 #else
+#include <ATen/ops/_cast_Byte_native.h>
+#include <ATen/ops/_cast_Char_native.h>
+#include <ATen/ops/_cast_Double_native.h>
+#include <ATen/ops/_cast_Float_native.h>
+#include <ATen/ops/_cast_Half_native.h>
+#include <ATen/ops/_cast_Int_native.h>
+#include <ATen/ops/_cast_Long_native.h>
+#include <ATen/ops/_cast_Short_native.h>
+#include <ATen/ops/_dim_arange_native.h>
+#include <ATen/ops/_efficientzerotensor_native.h>
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_empty_per_channel_affine_quantized.h>
+#include <ATen/ops/arange.h>
+#include <ATen/ops/arange_native.h>
+#include <ATen/ops/bartlett_window_native.h>
+#include <ATen/ops/blackman_window_native.h>
+#include <ATen/ops/clone_native.h>
+#include <ATen/ops/complex.h>
+#include <ATen/ops/complex_native.h>
+#include <ATen/ops/cumprod.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_like_native.h>
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/empty_strided.h>
+#include <ATen/ops/empty_strided_native.h>
 #include <ATen/ops/eye.h>
+#include <ATen/ops/eye_native.h>
+#include <ATen/ops/fill_native.h>
+#include <ATen/ops/flip.h>
+#include <ATen/ops/from_file_native.h>
+#include <ATen/ops/full_like_native.h>
+#include <ATen/ops/full_native.h>
+#include <ATen/ops/hamming_window_native.h>
+#include <ATen/ops/hann_window_native.h>
+#include <ATen/ops/kaiser_window_native.h>
+#include <ATen/ops/linspace.h>
+#include <ATen/ops/linspace_native.h>
+#include <ATen/ops/logspace.h>
+#include <ATen/ops/logspace_native.h>
+#include <ATen/ops/new_empty_native.h>
+#include <ATen/ops/new_empty_strided_native.h>
+#include <ATen/ops/new_full_native.h>
+#include <ATen/ops/new_ones_native.h>
+#include <ATen/ops/new_zeros_native.h>
+#include <ATen/ops/normal_native.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/ones_like_native.h>
+#include <ATen/ops/ones_native.h>
+#include <ATen/ops/polar.h>
+#include <ATen/ops/polar_native.h>
+#include <ATen/ops/promote_types.h>
+#include <ATen/ops/rand_like_native.h>
+#include <ATen/ops/rand_native.h>
+#include <ATen/ops/randint_like_native.h>
+#include <ATen/ops/randint_native.h>
+#include <ATen/ops/randn_like_native.h>
+#include <ATen/ops/randn_native.h>
+#include <ATen/ops/randperm.h>
+#include <ATen/ops/randperm_native.h>
+#include <ATen/ops/range.h>
+#include <ATen/ops/range_native.h>
+#include <ATen/ops/scalar_tensor_native.h>
+#include <ATen/ops/tril_indices_native.h>
+#include <ATen/ops/triu_indices_native.h>
+#include <ATen/ops/vander_native.h>
+#include <ATen/ops/zeros_like_native.h>
+#include <ATen/ops/zeros_like_ops.h>
+#include <ATen/ops/zeros_native.h>
 #endif
 
 #include <algorithm>
-#include <cctype>
-#include <cmath>
 #include <cstddef>
 #include <string>
 #include <c10/core/SymIntArrayRef.h>
@@ -186,12 +254,7 @@ Tensor empty_cpu(IntArrayRef size, c10::optional<ScalarType> dtype_opt, c10::opt
   return at::detail::empty_cpu(size, dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
 }
 
-Tensor empty_symint_cpu(c10::SymIntArrayRef size, c10::optional<ScalarType> dtype_opt, c10::optional<Layout> layout_opt,
-                 c10::optional<Device> device_opt, c10::optional<bool> pin_memory_opt, c10::optional<c10::MemoryFormat> memory_format_opt) {
-  return at::native::empty_cpu(c10::asIntArrayRefSlow(size), dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
-}
-
-Tensor empty(
+Tensor empty_names(
     IntArrayRef size,
     c10::optional<DimnameList> names,
     c10::optional<ScalarType> dtype,
@@ -262,12 +325,6 @@ Tensor empty_like(
   // See [Note: hacky wrapper removal for TensorOptions]
   TensorOptions options_ = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
 
-
-  TORCH_CHECK(
-    !(options_.has_memory_format() && optional_memory_format.has_value()),
-    "Cannot set memory_format both in TensorOptions and explicit argument; please delete "
-    "the redundant setter.");
-
   TensorOptions options =
       self.options()
           .merge_in(options_)
@@ -388,17 +445,6 @@ Tensor empty_like_quantized(
   }
 }
 
-Tensor new_empty(
-    const Tensor& self,
-    IntArrayRef size,
-    c10::optional<ScalarType> dtype_opt,
-    c10::optional<Layout> layout_opt,
-    c10::optional<Device> device_opt,
-    c10::optional<bool> pin_memory_opt
-    ) {
-  return self.new_empty_symint(c10::SymIntArrayRef::fromIntArrayRef(size), dtype_opt, layout_opt, device_opt, pin_memory_opt);
-}
-
 Tensor new_empty_symint(
     const Tensor& self,
     SymIntArrayRef size,
@@ -414,10 +460,10 @@ Tensor new_empty_symint(
   return at::empty_symint(size, dtype, layout, device, pin_memory, c10::nullopt);
 }
 
-Tensor new_empty_strided(
+Tensor new_empty_strided_symint(
     const Tensor& self,
-    IntArrayRef size,
-    IntArrayRef stride,
+    c10::SymIntArrayRef size,
+    c10::SymIntArrayRef stride,
     c10::optional<ScalarType> dtype,
     c10::optional<Layout> layout,
     c10::optional<Device> device,
@@ -426,7 +472,7 @@ Tensor new_empty_strided(
   // See [Note: hacky wrapper removal for TensorOptions]
   TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
 
-  return at::empty_strided(size, stride, self.options().merge_in(options));
+  return at::empty_strided_symint(size, stride, self.options().merge_in(options));
 }
 
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ eye ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1023,12 +1069,12 @@ Tensor tril_indices_cpu(
   //
   // 3. sequential RAM + transpose: create an n X 2 Tensor, fill the Tensor
   //    sequentially, and then transpose it.
-  AT_DISPATCH_ALL_TYPES_AND(kBFloat16, result.scalar_type(), "tril_indices", [&]() -> void {
+  AT_DISPATCH_INDEX_TYPES(result.scalar_type(), "tril_indices", [&]() -> void {
     // fill the Tensor with correct values
-    scalar_t* result_data = result.data_ptr<scalar_t>();
+    index_t* result_data = result.data_ptr<index_t>();
     int64_t i = 0;
 
-    scalar_t r = std::max<int64_t>(0, -offset), c = 0;
+    index_t r = std::max<int64_t>(0, -offset), c = 0;
     while (i < tril_size) {
       result_data[i] = r;
       result_data[tril_size + i++] = c;
@@ -1061,14 +1107,14 @@ Tensor triu_indices_cpu(
   // create an empty Tensor with correct size
   auto result = at::native::empty_cpu({2, triu_size}, dtype_opt, layout_opt, device_opt, pin_memory_opt);
 
-  AT_DISPATCH_ALL_TYPES_AND(kBFloat16, result.scalar_type(), "triu_indices", [&]() -> void {
+  AT_DISPATCH_INDEX_TYPES(result.scalar_type(), "triu_indices", [&]() -> void {
     // fill the Tensor with correct values
-    scalar_t* result_data = result.data_ptr<scalar_t>();
+    index_t* result_data = result.data_ptr<index_t>();
     int64_t i = 0;
     // not typing std::max with scalar_t as it could be an unsigned type
     // NOTE: no need to check if the returned value of std::max overflows
-    // scalar_t, as i and triu_size act as a guard.
-    scalar_t c = std::max<int64_t>(0, offset), r = 0;
+    // index_t, as i and triu_size act as a guard.
+    index_t c = std::max<int64_t>(0, offset), r = 0;
     while (i < triu_size) {
       result_data[i] = r;
       result_data[triu_size + i++] = c;
@@ -1090,14 +1136,6 @@ Tensor triu_indices_cpu(
 
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ zeros ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Tensor zeros(IntArrayRef size,
-    c10::optional<ScalarType> dtype,
-    c10::optional<Layout> layout,
-    c10::optional<Device> device,
-    c10::optional<bool> pin_memory) {
-  return at::zeros_symint(c10::SymIntArrayRef::fromIntArrayRef(size), dtype, layout, device, pin_memory);
-}
-
 Tensor zeros_symint(SymIntArrayRef size,
     c10::optional<ScalarType> dtype,
     c10::optional<Layout> layout,
@@ -1123,8 +1161,16 @@ Tensor _efficientzerotensor(IntArrayRef size,
     return out;
 }
 
+Tensor& zeros_sparse_out(IntArrayRef size, Tensor& result) {
+  result.sparse_resize_and_clear_(size, size.size(), 0.);
+  return result;
+}
+
 Tensor& zeros_out(IntArrayRef size, Tensor& result) {
   if (result.is_sparse()) {
+    // TODO: I think this branch should be dead, but we don't have an easy
+    // way to cover all sparse kernels with zeros_sparse_out, so retain this
+    // for now
     result.sparse_resize_and_clear_(size, size.size(), 0.);
     return result;
   } else {
@@ -1495,7 +1541,7 @@ Tensor clone(const Tensor& src, c10::optional<c10::MemoryFormat> optional_memory
   if (memory_format == MemoryFormat::Preserve) {
     if (src.is_non_overlapping_and_dense()) {
       // Copy all strides, this is marginally faster than calling empty_like
-      self = at::empty_strided(src.sizes(), src.strides(), src.options());
+      self = at::empty_strided_symint(src.sym_sizes(), src.sym_strides(), src.options());
     } else {
       self = at::empty_like(src);
     }
diff --git a/aten/src/ATen/native/TensorFactories.h b/aten/src/ATen/native/TensorFactories.h
index 35e058df4b3a..2c0665518a9e 100644
--- a/aten/src/ATen/native/TensorFactories.h
+++ b/aten/src/ATen/native/TensorFactories.h
@@ -1,10 +1,9 @@
 #pragma once
 
 #include <ATen/core/Tensor.h>
-#include <ATen/Utils.h>
+#include <ATen/EmptyTensor.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/DispatchStub.h>
-#include <ATen/native/TensorIterator.h>
-#include <c10/core/TensorOptions.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
diff --git a/aten/src/ATen/native/TensorIteratorReduce.cpp b/aten/src/ATen/native/TensorIteratorReduce.cpp
index ea772bfe7e64..606a44222687 100644
--- a/aten/src/ATen/native/TensorIteratorReduce.cpp
+++ b/aten/src/ATen/native/TensorIteratorReduce.cpp
@@ -1,11 +1,14 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/TensorIterator.h>
 #include <ATen/Parallel.h>
-#include <algorithm>
-#include <memory>
-#include <ATen/Functions.h>
-#include <ATen/TensorOperators.h>
 #include <ATen/TensorIteratorInternal.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#endif
+
 #include <c10/util/irange.h>
 
 /// Contains the implementation of parallel reductions in TensorIterator.
diff --git a/aten/src/ATen/native/TensorProperties.cpp b/aten/src/ATen/native/TensorProperties.cpp
index f509f5982d96..e37dbf56cc81 100644
--- a/aten/src/ATen/native/TensorProperties.cpp
+++ b/aten/src/ATen/native/TensorProperties.cpp
@@ -1,12 +1,27 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/TensorProperties.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/NamedTensorUtils.h>
-#include <torch/library.h>
-#include <ATen/native/nested/NestedTensorMath.h>
+#include <ATen/detail/CUDAHooksInterface.h>
+#include <ATen/native/TensorProperties.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_nested_tensor_size_native.h>
+#include <ATen/ops/contiguous_native.h>
+#include <ATen/ops/cudnn_is_acceptable_native.h>
+#include <ATen/ops/detach_native.h>
+#include <ATen/ops/equal.h>
+#include <ATen/ops/is_same_size_native.h>
+#include <ATen/ops/is_set_to_native.h>
+#include <ATen/ops/size_native.h>
+#include <ATen/ops/stride_native.h>
+#endif
 
-#include <ATen/Config.h>
 #include <c10/util/irange.h>
+
 namespace at {
 namespace native {
 
@@ -22,8 +37,8 @@ bool nested_is_same_size(const Tensor& self, const Tensor& other) {
       "nested. While Other ",
       other.is_nested()? "is " : "is not ",
       "nested.")
-  const auto self_nt_size = get_nested_size_tensor(self);
-  const auto other_nt_size = get_nested_size_tensor(other);
+  const auto self_nt_size = _nested_tensor_size(self);
+  const auto other_nt_size = _nested_tensor_size(other);
   return at::equal(self_nt_size, other_nt_size);
 }
 int64_t size(const Tensor& self, int64_t dim) {
@@ -54,7 +69,7 @@ bool cudnn_is_acceptable(const TensorBase& self) {
   // tensors. Maybe some cuDNN functions actually support empty tensors, but
   // native/THNN kernels shouldn't be much slower because the output is also
   // likely empty.
-  if (self.numel() == 0) return false;
+  if (self.sym_numel() == 0) return false;
   // NB: In the old Python code, there was also a test to see if the
   // cuDNN library was actually dynamically linked or not.  I'm not
   // sure if we can actually test this.
diff --git a/aten/src/ATen/native/TensorShape.cpp b/aten/src/ATen/native/TensorShape.cpp
index 6eab75417476..f2ee31fe0bcd 100644
--- a/aten/src/ATen/native/TensorShape.cpp
+++ b/aten/src/ATen/native/TensorShape.cpp
@@ -1,33 +1,215 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/core/DimVector.h>
+#include <ATen/core/functional.h>
+#include <ATen/core/IListRef.h>
+#include <ATen/TensorSubclassLikeUtils.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/ATen.h>
+#include <ATen/Dispatch.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/InferSize.h>
 #include <ATen/MemoryOverlap.h>
 #include <ATen/NamedTensorUtils.h>
+#include <ATen/SparseCsrTensorUtils.h>
+#include <ATen/SparseTensorUtils.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/WrapDimUtils.h>
 #include <ATen/core/DimVector.h>
 #include <ATen/core/IListRef.h>
 #include <ATen/native/Copy.h>
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/TensorIterator.h>
-#include <ATen/native/TypeProperties.h>
 #include <ATen/native/TensorShape.h>
+#include <ATen/native/TypeProperties.h>
 #include <ATen/native/cpu/CatKernel.h>
-#include <ATen/native/cpu/StackKernel.h>
 #include <ATen/native/cpu/SerialStackImpl.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/native/cpu/StackKernel.h>
 #include <ATen/quantized/QTensorImpl.h>
-#include <ATen/SparseTensorUtils.h>
-#include <ATen/WrapDimUtils.h>
-#include <c10/util/accumulate.h>
 #include <c10/util/Exception.h>
-#include <c10/util/irange.h>
 #include <c10/util/Optional.h>
 #include <c10/util/SmallVector.h>
+#include <c10/util/accumulate.h>
+#include <c10/util/irange.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_conj_copy_native.h>
+#include <ATen/ops/_convert_indices_from_coo_to_csr.h>
+#include <ATen/ops/_convert_indices_from_csr_to_coo.h>
+#include <ATen/ops/_fw_primal_copy_native.h>
+#include <ATen/ops/_indices_copy_native.h>
+#include <ATen/ops/_make_dual.h>
+#include <ATen/ops/_make_dual_copy_native.h>
+#include <ATen/ops/_mkldnn_reshape.h>
+#include <ATen/ops/_mkldnn_transpose.h>
+#include <ATen/ops/_neg_view_copy_native.h>
+#include <ATen/ops/_reshape_alias_copy_native.h>
+#include <ATen/ops/_reshape_alias_native.h>
+#include <ATen/ops/_reshape_from_tensor_native.h>
+#include <ATen/ops/_shape_as_tensor_native.h>
+#include <ATen/ops/_sparse_broadcast_to.h>
+#include <ATen/ops/_sparse_broadcast_to_copy_native.h>
+#include <ATen/ops/_sparse_broadcast_to_native.h>
+#include <ATen/ops/_sparse_compressed_tensor_unsafe_native.h>
+#include <ATen/ops/_sparse_coo_tensor_with_dims_and_tensors.h>
+#include <ATen/ops/_sparse_csc_tensor_unsafe_native.h>
+#include <ATen/ops/_sparse_csr_tensor_unsafe.h>
+#include <ATen/ops/_sparse_csr_tensor_unsafe_native.h>
+#include <ATen/ops/_stack_native.h>
+#include <ATen/ops/_unsafe_view.h>
+#include <ATen/ops/_unsafe_view_native.h>
+#include <ATen/ops/_values_copy_native.h>
+#include <ATen/ops/adjoint_native.h>
+#include <ATen/ops/alias.h>
+#include <ATen/ops/alias_copy_native.h>
+#include <ATen/ops/alias_native.h>
+#include <ATen/ops/arange.h>
+#include <ATen/ops/arange_native.h>
+#include <ATen/ops/as_strided_copy_native.h>
+#include <ATen/ops/as_strided_native.h>
+#include <ATen/ops/as_strided_scatter_native.h>
+#include <ATen/ops/atleast_1d.h>
+#include <ATen/ops/atleast_2d.h>
+#include <ATen/ops/atleast_3d.h>
+#include <ATen/ops/block_diag_native.h>
+#include <ATen/ops/broadcast_tensors_native.h>
+#include <ATen/ops/broadcast_to_native.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/cat_meta.h>
+#include <ATen/ops/cat_native.h>
+#include <ATen/ops/chunk_native.h>
+#include <ATen/ops/col_indices_copy_native.h>
+#include <ATen/ops/column_stack_native.h>
+#include <ATen/ops/concat_native.h>
+#include <ATen/ops/concatenate_native.h>
+#include <ATen/ops/crow_indices_copy_native.h>
+#include <ATen/ops/dense_dim_native.h>
+#include <ATen/ops/detach_copy_native.h>
+#include <ATen/ops/detach_native.h>
+#include <ATen/ops/diag.h>
+#include <ATen/ops/diag_embed.h>
+#include <ATen/ops/diag_embed_native.h>
+#include <ATen/ops/diag_native.h>
+#include <ATen/ops/diagflat_native.h>
+#include <ATen/ops/diagonal.h>
+#include <ATen/ops/diagonal_backward.h>
+#include <ATen/ops/diagonal_backward_native.h>
+#include <ATen/ops/diagonal_copy.h>
+#include <ATen/ops/diagonal_copy_native.h>
+#include <ATen/ops/diagonal_native.h>
+#include <ATen/ops/diagonal_scatter_native.h>
+#include <ATen/ops/dsplit_native.h>
+#include <ATen/ops/dstack_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_quantized.h>
+#include <ATen/ops/expand_as_native.h>
+#include <ATen/ops/expand_copy_native.h>
+#include <ATen/ops/expand_native.h>
+#include <ATen/ops/flatten_dense_tensors_native.h>
+#include <ATen/ops/flatten_native.h>
+#include <ATen/ops/from_blob.h>
+#include <ATen/ops/hsplit_native.h>
+#include <ATen/ops/hstack.h>
+#include <ATen/ops/hstack_native.h>
+#include <ATen/ops/index_select_native.h>
+#include <ATen/ops/indices_copy_native.h>
+#include <ATen/ops/lift_fresh_native.h>
+#include <ATen/ops/lift_native.h>
+#include <ATen/ops/mH_native.h>
+#include <ATen/ops/mT_native.h>
+#include <ATen/ops/matrix_H_native.h>
+#include <ATen/ops/meshgrid_native.h>
+#include <ATen/ops/moveaxis_native.h>
+#include <ATen/ops/movedim.h>
+#include <ATen/ops/movedim_native.h>
+#include <ATen/ops/narrow.h>
+#include <ATen/ops/narrow_copy.h>
+#include <ATen/ops/narrow_copy_native.h>
+#include <ATen/ops/narrow_native.h>
+#include <ATen/ops/new_empty_native.h>
+#include <ATen/ops/new_ones_native.h>
+#include <ATen/ops/numpy_T_native.h>
+#include <ATen/ops/permute_copy_native.h>
+#include <ATen/ops/permute_native.h>
+#include <ATen/ops/ravel_native.h>
+#include <ATen/ops/repeat_native.h>
+#include <ATen/ops/reshape_as_native.h>
+#include <ATen/ops/reshape_native.h>
+#include <ATen/ops/resize_native.h>
+#include <ATen/ops/row_stack_native.h>
+#include <ATen/ops/select.h>
+#include <ATen/ops/select_backward_native.h>
+#include <ATen/ops/select_copy_native.h>
+#include <ATen/ops/select_native.h>
+#include <ATen/ops/select_scatter_native.h>
+#include <ATen/ops/set_native.h>
+#include <ATen/ops/slice.h>
+#include <ATen/ops/slice_backward_native.h>
+#include <ATen/ops/slice_copy_native.h>
+#include <ATen/ops/slice_native.h>
+#include <ATen/ops/slice_scatter_native.h>
+#include <ATen/ops/sparse_coo_tensor.h>
+#include <ATen/ops/sparse_coo_tensor_native.h>
+#include <ATen/ops/sparse_dim_native.h>
+#include <ATen/ops/split_copy_native.h>
+#include <ATen/ops/split_native.h>
+#include <ATen/ops/split_with_sizes.h>
+#include <ATen/ops/split_with_sizes_copy_native.h>
+#include <ATen/ops/split_with_sizes_native.h>
+#include <ATen/ops/squeeze_copy_native.h>
+#include <ATen/ops/squeeze_native.h>
+#include <ATen/ops/stack_native.h>
+#include <ATen/ops/sub.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/sum_to_size_native.h>
+#include <ATen/ops/swapaxes_native.h>
+#include <ATen/ops/swapdims_native.h>
+#include <ATen/ops/t_copy_native.h>
+#include <ATen/ops/t_native.h>
+#include <ATen/ops/tensor.h>
+#include <ATen/ops/tensor_split.h>
+#include <ATen/ops/tensor_split_native.h>
+#include <ATen/ops/tile_native.h>
+#include <ATen/ops/transpose.h>
+#include <ATen/ops/transpose_copy_native.h>
+#include <ATen/ops/transpose_native.h>
+#include <ATen/ops/unbind.h>
+#include <ATen/ops/unbind_copy_native.h>
+#include <ATen/ops/unbind_native.h>
+#include <ATen/ops/unflatten_dense_tensors_native.h>
+#include <ATen/ops/unflatten_native.h>
+#include <ATen/ops/unfold_copy_native.h>
+#include <ATen/ops/unfold_native.h>
+#include <ATen/ops/unsafe_chunk_native.h>
+#include <ATen/ops/unsafe_split_native.h>
+#include <ATen/ops/unsafe_split_with_sizes_native.h>
+#include <ATen/ops/unsqueeze_copy_native.h>
+#include <ATen/ops/unsqueeze_native.h>
+#include <ATen/ops/values_copy_native.h>
+#include <ATen/ops/view_as_complex.h>
+#include <ATen/ops/view_as_complex_copy_native.h>
+#include <ATen/ops/view_as_native.h>
+#include <ATen/ops/view_as_real.h>
+#include <ATen/ops/view_as_real_copy_native.h>
+#include <ATen/ops/view_copy_native.h>
+#include <ATen/ops/view_native.h>
+#include <ATen/ops/vsplit_native.h>
+#include <ATen/ops/vstack.h>
+#include <ATen/ops/vstack_native.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like.h>
+#include <ATen/ops/zeros_native.h>
+#endif
 
+#include <c10/util/StringUtil.h>
 #include <algorithm>
 #include <cstdint>
+#include <utility>
 #include <vector>
-#include <c10/util/StringUtil.h>
 
 namespace at {
 namespace meta {
@@ -56,7 +238,7 @@ inline c10::MemoryFormat cat_compute_output_memory_format(const MaterializedITen
   return format.value();
 }
 
-TORCH_PRECOMPUTE_META_FUNC(cat)(ITensorListRef tensors, int64_t dim) {
+TORCH_PRECOMPUTE_META_FUNC(cat)(const ITensorListRef& tensors, int64_t dim) {
   // previously, size [0] tensors were the only possible empty tensors; thus, it wasn't possible
   // to cat empty tensors unless all the other tensors were 1-dimensional, so we allowed these tensors
   // to be "skipped".  We maintain this behavior for backwards compatibility, but only for this specific
@@ -64,10 +246,10 @@ TORCH_PRECOMPUTE_META_FUNC(cat)(ITensorListRef tensors, int64_t dim) {
   auto materialized = tensors.materialize();
 
   cat_check_no_zero_dim(materialized);
-  dim = at::legacy_cat_wrap_dim(dim, tensors);
+  dim = at::legacy_cat_wrap_dim(dim, materialized);
 
   // Checking names before the actual dimensions.
-  auto maybe_outnames = namedinference::compute_cat_outnames(tensors);
+  auto maybe_outnames = namedinference::compute_cat_outnames(materialized);
 
   TORCH_CHECK(
       materialized.size() > 0, "torch.cat(): expected a non-empty list of Tensors");
@@ -123,11 +305,11 @@ TORCH_PRECOMPUTE_META_FUNC(cat)(ITensorListRef tensors, int64_t dim) {
     size_t size_at_dim = 0;
     for (const auto i : c10::irange(materialized.size())) {
       const Tensor& t = materialized[i];
+      all_same_dtype = all_same_dtype && out_dtype == t.scalar_type();
       if (!at::native::cat_should_skip_tensor(t)) {
         at::native::check_cat_shape_except_dim(materialized[valid], t, dim, i);
         size_at_dim += t.size(dim);
         all_contiguous = all_contiguous && t.is_contiguous(memory_format);
-        all_same_dtype = all_same_dtype && out_dtype == t.scalar_type();
         all_same_sizes_and_stride = all_same_sizes_and_stride &&
             t.sizes() == materialized[valid].get().sizes() &&
             t.strides() == materialized[valid].get().strides();
@@ -202,9 +384,48 @@ Tensor& set_storage_cpu_(Tensor& result, Storage storage, int64_t storage_offset
   return result;
 }
 
-Tensor& set_(Tensor& result, const Tensor& storage, int64_t storage_offset, IntArrayRef size, IntArrayRef stride) {
+Tensor& set_storage_meta__symint(Tensor& result, Storage storage, c10::SymInt storage_offset, c10::SymIntArrayRef size, c10::SymIntArrayRef stride) {
+  checkSetStorage(result, storage, storage_offset, size, stride);
+
+  c10::SymDimVector contiguous_strides;
+  if (stride.data() == nullptr) {
+    // TODO: dedupe this with empty() symbolic logic
+    int64_t dim = size.size();
+    contiguous_strides.resize(dim);
+    if (dim > 0) {
+      const auto last_idx = dim - 1;
+      contiguous_strides.at(last_idx) = 1;
+      for (auto i = last_idx - 1; i >= 0; --i) {
+        // TODO: max with 1
+        contiguous_strides.at(i) = contiguous_strides.at(i+1) * size.at(i+1);
+      }
+    }
+    stride = contiguous_strides;
+  }
+
+  // Run this before storage setting so we can access numel
+  result.unsafeGetTensorImpl()->set_sizes_and_strides(size, stride, storage_offset);
+
+  // Matches maybe_resize_storage_cpu no-numel behavior
+  if (result.sym_numel() != 0) {
+    // maybe_resize_storage_cpu can handle no storage exists at all but
+    // that should never be the case here
+    TORCH_INTERNAL_ASSERT(storage);
+    TORCH_CHECK(storage.resizable(), "Trying to resize storage that is not resizable");
+    // All meta data pointers are the same, so we don't have to "re" allocate
+    // it.  TODO: Actually this might not quite be correct if we use special
+    // pointers to track whether or not fake cuda tensors are pinned or not
+    const auto itemsize = result.dtype().itemsize();
+    c10::SymInt size_bytes = at::detail::computeStorageNbytes(
+        size, stride, itemsize, storage_offset);
+    storage.set_nbytes(std::move(size_bytes));
+  }
+  return result;
+}
+
+Tensor& set__symint(Tensor& result, const Tensor& storage, c10::SymInt storage_offset, c10::SymIntArrayRef size, c10::SymIntArrayRef stride) {
   TORCH_CHECK(storage.is_contiguous(), "passed in tensor to be used as storage must be contiguous");
-  return result.set_(storage.storage(), storage_offset + storage.storage_offset(), size, stride);
+  return result.set__symint(storage.storage(), storage_offset + storage.sym_storage_offset(), size, stride);
 }
 
 Tensor& set_tensor_(Tensor& result, const Tensor& source) {
@@ -300,7 +521,7 @@ Tensor sparse_broadcast_to(const Tensor& self, IntArrayRef size) {
   new_values_size[0] = new_indices_size[1];
 
   Tensor new_values = values.expand(broadcast_dense_sizes).repeat_interleave(nnz_factor, 0);
-  Tensor new_indices = at::native::new_empty(indices, new_indices_size);
+  Tensor new_indices = indices.new_empty(new_indices_size);
   if (broadcast_sizes.size()>0) {
     // ones(broadcast_sizes).nonzero() is equivalent to
     // product(map(arange, broadcast_sizes)) but avoids creating
@@ -318,8 +539,8 @@ Tensor sparse_broadcast_to(const Tensor& self, IntArrayRef size) {
   return at::sparse_coo_tensor(new_indices, new_values, size)._coalesced_(is_coalesced);
 }
 
-Tensor broadcast_to(const Tensor& self, IntArrayRef size) {
-  return self.expand(size);
+Tensor broadcast_to_symint(const Tensor& self, SymIntArrayRef size) {
+  return self.expand_symint(size);
 }
 
 std::vector<Tensor> broadcast_tensors(TensorList tensors) {
@@ -327,7 +548,7 @@ std::vector<Tensor> broadcast_tensors(TensorList tensors) {
 }
 
 TORCH_IMPL_FUNC(cat_out_cpu)
-(ITensorListRef tensors,
+(const ITensorListRef& tensors,
  int64_t dim,
  int64_t valid,
  bool all_contiguous,
@@ -428,6 +649,23 @@ Tensor concat(TensorList tensors, int64_t dim) {
   return at::cat(tensors, dim);
 }
 
+// torch.concatenate, alias for torch.cat
+Tensor& concatenate_out(TensorList tensors, Dimname dim, Tensor& result) {
+  return at::cat_out(result, tensors, dimname_to_position(tensors[0], dim));
+}
+
+Tensor concatenate(TensorList tensors, Dimname dim) {
+  return at::cat(tensors, dimname_to_position(tensors[0], dim));
+}
+
+Tensor& concatenate_out(TensorList tensors, int64_t dim, Tensor & result) {
+  return at::cat_out(result, tensors, dim);
+}
+
+Tensor concatenate(TensorList tensors, int64_t dim) {
+  return at::cat(tensors, dim);
+}
+
 static bool sizes_match_except(IntArrayRef s1, IntArrayRef s2, int64_t dim_except /* should already be wrapped */) {
   if (s1.size() != s2.size()) {
     return false;
@@ -458,16 +696,16 @@ static void check_cat_sparse_dims(Tensor const &t,
             ", but tensor at position ", pos, " has ", t.sparse_dim(), ", ", t.dense_dim(), ".");
 }
 
-static Tensor cat_sparse_impl(TensorList tensors, int64_t dim) {
+static Tensor cat_sparse_impl(const MaterializedITensorListRef& tensors, int64_t dim) {
   std::vector<Tensor> indices;
   std::vector<Tensor> values;
-  int64_t wrapped = maybe_wrap_dim(dim, tensors[0].dim());
-  int64_t sparse_dim = tensors[0].sparse_dim();
-  int64_t dense_dim = tensors[0].dense_dim();
-  IntArrayRef sizes = tensors[0].sizes();
+  int64_t wrapped = maybe_wrap_dim(dim, tensors[0].get().dim());
+  int64_t sparse_dim = tensors[0].get().sparse_dim();
+  int64_t dense_dim = tensors[0].get().dense_dim();
+  IntArrayRef sizes = tensors[0].get().sizes();
   if (wrapped < sparse_dim) {
     for (const auto i : c10::irange(tensors.size())) {
-      auto const &t = tensors[i];
+      const Tensor& t = tensors[i];
       check_cat_sparse_dims(t, i, sizes, wrapped, sparse_dim, dense_dim);
       indices.push_back(t._indices());
       values.push_back(t._values());
@@ -486,7 +724,7 @@ static Tensor cat_sparse_impl(TensorList tensors, int64_t dim) {
     int64_t col = 0;
     int64_t cumulative_offset = 0;
     for (const auto i : c10::irange(tensors.size())) {
-      auto const &t = tensors[i];
+      const Tensor& t = tensors[i];
       int64_t this_piece_size = t._nnz();
       // cumulative_offset is zero for the first piece, so
       // don't waste time doing this operation unless i > 0.
@@ -502,10 +740,10 @@ static Tensor cat_sparse_impl(TensorList tensors, int64_t dim) {
         idxs,
         vals,
         sizes_copy,
-        optTypeMetaToScalarType(tensors[0].options().dtype_opt()),
-        tensors[0].options().layout_opt(),
-        tensors[0].options().device_opt(),
-        tensors[0].options().pinned_memory_opt());
+        optTypeMetaToScalarType(tensors[0].get().options().dtype_opt()),
+        tensors[0].get().options().layout_opt(),
+        tensors[0].get().options().device_opt(),
+        tensors[0].get().options().pinned_memory_opt());
   }
   else {
     // Catting along a dense dimension requires us to create new values.
@@ -527,29 +765,33 @@ static Tensor cat_sparse_impl(TensorList tensors, int64_t dim) {
     // The dimension in each tensor's values object that corresponds to the overall dimension along which we're catting.
     int64_t values_dim = wrapped - sparse_dim + 1;
     // The final size along the catted dimension.
-    const int64_t total_size = std::accumulate(tensors.begin(), tensors.end(), static_cast<int64_t>(0), [values_dim](int64_t l, Tensor const &r) {
-      return l + r._values().size(values_dim);
-    });
-    auto zeros_sizes = tensors[0]._values().sizes().vec();
+    const int64_t total_size = std::accumulate(
+        tensors.begin(),
+        tensors.end(),
+        static_cast<int64_t>(0),
+        [values_dim](int64_t l, const Tensor& r) {
+          return l + r._values().size(values_dim);
+        });
+    auto zeros_sizes = tensors[0].get()._values().sizes().vec();
     int64_t cumulative_size = 0;
     std::vector<Tensor> vals_pieces;
     std::vector<Tensor> idxs_pieces;
     for (const auto i : c10::irange(tensors.size())) {
-      auto const &t = tensors[i];
+      const Tensor& t = tensors[i];
       check_cat_sparse_dims(t, i, sizes, wrapped, sparse_dim, dense_dim);
       // dimension 0 of values corresponds to the number of values,
       // rather than to any logical dimension of the sparse tensor.
       zeros_sizes[0] = t._values().size(0);
       zeros_sizes[values_dim] = cumulative_size;
       cumulative_size += t._values().size(values_dim);
-      auto z1 = native::zeros(
+      auto z1 = at::zeros(
           zeros_sizes,
           optTypeMetaToScalarType(t._values().options().dtype_opt()),
           t._values().options().layout_opt(),
           t._values().options().device_opt(),
           t._values().options().pinned_memory_opt());
       zeros_sizes[values_dim] = total_size - cumulative_size;
-      auto z2 = native::zeros(
+      auto z2 = at::zeros(
           zeros_sizes,
           optTypeMetaToScalarType(t._values().options().dtype_opt()),
           t._values().options().layout_opt(),
@@ -565,16 +807,17 @@ static Tensor cat_sparse_impl(TensorList tensors, int64_t dim) {
         at::cat(idxs_pieces, 1),
         at::cat(vals_pieces),
         sizes_copy,
-        optTypeMetaToScalarType(tensors[0].options().dtype_opt()),
-        tensors[0].options().layout_opt(),
-        tensors[0].options().device_opt(),
-        tensors[0].options().pinned_memory_opt());
+        optTypeMetaToScalarType(tensors[0].get().options().dtype_opt()),
+        tensors[0].get().options().layout_opt(),
+        tensors[0].get().options().device_opt(),
+        tensors[0].get().options().pinned_memory_opt());
   }
 }
 
-Tensor cat_sparse(TensorList tensors, int64_t dim) {
-  auto maybe_outnames = namedinference::compute_cat_outnames(tensors);
-  auto result = cat_sparse_impl(tensors, at::legacy_cat_wrap_dim(dim, tensors));
+Tensor cat_sparse(const ITensorListRef& tensors, int64_t dim) {
+  auto materialized = tensors.materialize();
+  auto maybe_outnames = namedinference::compute_cat_outnames(materialized);
+  auto result = cat_sparse_impl(materialized, at::legacy_cat_wrap_dim(dim, materialized));
   namedinference::propagate_names_if_nonempty(result, maybe_outnames);
   return result;
 }
@@ -660,54 +903,66 @@ std::vector<Tensor> chunk(const Tensor& self, int64_t chunks, int64_t dim) {
   TORCH_CHECK(chunks > 0,
            "chunk expects `chunks` to be greater than 0, got: ", chunks);
 
-  const auto dim_size = self.size(dim);
-  int64_t split_size = (dim_size + chunks - 1) / chunks;
+  const auto dim_size = self.sym_size(dim);
+  auto split_size = (dim_size + chunks - 1) / chunks;
 
   // We need to call split_with_sizes in the case where split_size and dimension size are 0, because
   // a call to split would discard the number of chunks (because we can have an arbitrary number of
   // 0-sized chunks adding up to 0).  So, call split_with_sizes with the correct number of chunks,
   // eventually we will do this for all cases.
   if (split_size == 0 && dim_size == 0) {
-    std::vector<int64_t> split_sizes(chunks, split_size);
+    std::vector<c10::SymInt> split_sizes(chunks, split_size);
     split_sizes[chunks - 1] = split_size - (split_size * chunks - dim_size);
-    return self.split_with_sizes(split_sizes, dim);
+    return self.split_with_sizes_symint(split_sizes, dim);
   } else {
-    return self.split(split_size, dim);
+    return self.split_symint(split_size, dim);
   }
 }
 
-std::vector<Tensor> tensor_split(const Tensor& self, int64_t sections, int64_t dim) {
+std::vector<Tensor> tensor_split_sections_symint(const Tensor& self, c10::SymInt sym_sections, int64_t dim) {
   TORCH_CHECK(self.dim() > 0, "tensor_split expected at least a 1-dimensional tensor, but got a tensor with ", self.dim()," dims");
   int64_t dim_ = maybe_wrap_dim(dim, self.dim());
+  // NB: intentional, sections specifies number of output tensors, which
+  // cannot be polymorphic
+  int64_t sections = sym_sections.guard_int(__FILE__, __LINE__);
   TORCH_CHECK(sections > 0, "number of sections must be larger than 0, got ", sections);
-  const auto dim_size = self.size(dim_);
+  const auto dim_size = self.sym_size(dim_);
   std::vector<Tensor> splits(sections);
-  int64_t min_split_size = dim_size / sections;
-  int64_t num_splits_one_extra = dim_size % sections;
-  int64_t start_idx = 0;
+  auto min_split_size = dim_size / sections;
+  auto num_splits_one_extra = dim_size % sections;
+  c10::SymInt start_idx = 0;
   for (const auto split_idx : c10::irange(sections)) {
-    int64_t split_size = (split_idx < num_splits_one_extra) ? (min_split_size + 1) : min_split_size;
-    splits[split_idx] = at::slice(self, dim_, start_idx, start_idx + split_size);
+    auto split_size = (num_splits_one_extra > split_idx) ? (min_split_size + 1) : min_split_size;
+    splits[split_idx] = at::slice_symint(self, dim_, start_idx, start_idx + split_size);
     start_idx += split_size;
   }
   return splits;
 }
 
-std::vector<Tensor> tensor_split(const Tensor& self, IntArrayRef indices, int64_t dim) {
+template <typename T>
+std::vector<Tensor> _tensor_split_indices(const Tensor& self, ArrayRef<T> indices, int64_t dim) {
   TORCH_CHECK(self.dim() > 0, "tensor_split expected at least a 1-dimensional tensor, but got a tensor with ", self.dim()," dims");
   int64_t dim_ = maybe_wrap_dim(dim, self.dim());
   int64_t num_indices = indices.size();
   std::vector<Tensor> splits(num_indices + 1);
-  int64_t start_idx = 0;
+  T start_idx(0);
   for (const auto split_idx : c10::irange(num_indices)) {
-    int64_t end_idx = indices[split_idx];
-    splits[split_idx] = at::slice(self, dim_, start_idx, end_idx);
+    auto end_idx = indices[split_idx];
+    splits[split_idx] = at::symint::slice<T>(self, dim_, start_idx, end_idx);
     start_idx = end_idx;
   }
-  splits[num_indices] = at::slice(self, dim_, start_idx, self.size(dim_));
+  splits[num_indices] = at::symint::slice<T>(self, dim_, start_idx, at::symint::size<T>(self, dim_));
   return splits;
 }
 
+std::vector<Tensor> tensor_split(const Tensor& self, IntArrayRef indices, int64_t dim) {
+  return _tensor_split_indices(self, indices, dim);
+}
+
+std::vector<Tensor> tensor_split_indices_symint(const Tensor& self, SymIntArrayRef indices, int64_t dim) {
+  return _tensor_split_indices(self, indices, dim);
+}
+
 std::vector<Tensor> tensor_split(const Tensor& self, const Tensor& tensor_indices_or_sections, int64_t dim) {
   TORCH_CHECK(self.dim() > 0, "tensor_split expected at least a 1-dimensional tensor, but got a tensor with ", self.dim()," dims");
   auto split_device = tensor_indices_or_sections.device();
@@ -843,12 +1098,7 @@ Tensor diag_embed(const Tensor& self, int64_t offset, int64_t dim1_, int64_t dim
   return result;
 }
 
-Tensor expand_symint(const Tensor& self, c10::SymIntArrayRef packed_size, bool implicit) {
-  auto size = asIntArrayRefSlow(packed_size);
-  return self.expand(size, implicit);
-}
-
-Tensor expand(const Tensor& self, IntArrayRef size, bool /*unused*/) {
+Tensor expand(const Tensor& self, c10::IntArrayRef size, bool /*unused*/) {
   TORCH_CHECK(size.size() >= (size_t)self.dim(),
            "expand(", self.toString(), "{", self.sizes(), "}, size=", size,
            "): the number of sizes provided (", size.size(), ") ",
@@ -864,7 +1114,7 @@ Tensor expand(const Tensor& self, IntArrayRef size, bool /*unused*/) {
 }
 
 Tensor expand_as(const Tensor& self, const Tensor& other) {
-  return self.expand(other.sizes());
+  return self.expand_symint(other.sym_sizes());
 }
 
 Tensor sum_to_size(const Tensor& self, IntArrayRef size) {
@@ -884,6 +1134,7 @@ Tensor make_qtensor(const Tensor& self, IntArrayRef size, IntArrayRef stride, Qu
 }
 
 Tensor as_strided_tensorimpl(const Tensor& self, IntArrayRef size, IntArrayRef stride, optional<int64_t> storage_offset_) {
+  TORCH_INTERNAL_ASSERT(!self.is_mps(), "as_strided_tensorimpl does not work with MPS; call self.as_strided(...) instead");
   auto storage_offset = storage_offset_.value_or(self.storage_offset());
   auto result = at::detail::make_tensor<TensorImpl>(
       c10::TensorImpl::VIEW, Storage(self.storage()), self.key_set(), self.dtype());
@@ -891,6 +1142,22 @@ Tensor as_strided_tensorimpl(const Tensor& self, IntArrayRef size, IntArrayRef s
   return result;
 }
 
+Tensor as_strided_tensorimpl_meta(const Tensor& self, IntArrayRef size, IntArrayRef stride, optional<int64_t> storage_offset_) {
+  auto storage_offset = storage_offset_.value_or(self.storage_offset());
+  auto result = at::detail::make_tensor<TensorImpl>(
+      c10::TensorImpl::VIEW, Storage(self.storage()), self.key_set(), self.dtype());
+  setStrided(result, size, stride, storage_offset);
+  return result;
+}
+
+Tensor as_strided_tensorimpl_meta_symint(const Tensor& self, SymIntArrayRef sym_size, SymIntArrayRef sym_stride, optional<c10::SymInt> sym_storage_offset_) {
+  auto sym_storage_offset = sym_storage_offset_.value_or(self.sym_storage_offset());
+  auto result = at::detail::make_tensor<TensorImpl>(
+      c10::TensorImpl::VIEW, Storage(self.storage()), self.key_set(), self.dtype());
+  setStrided(result, sym_size, sym_stride, sym_storage_offset);
+  return result;
+}
+
 Tensor as_strided_qtensorimpl(const Tensor& self, IntArrayRef size, IntArrayRef stride, optional<int64_t> storage_offset_) {
   auto storage_offset = storage_offset_.value_or(self.storage_offset());
   auto quantizer = get_qtensorimpl(self)->quantizer();
@@ -921,20 +1188,18 @@ Tensor as_strided_qtensorimpl(const Tensor& self, IntArrayRef size, IntArrayRef
   return result;
 }
 
-const Tensor &as_strided_(const Tensor& self, IntArrayRef size, IntArrayRef stride, optional<int64_t> storage_offset_) {
-  auto storage_offset = storage_offset_.value_or(self.storage_offset());
+const Tensor &as_strided__symint(const Tensor& self, SymIntArrayRef size, SymIntArrayRef stride, optional<c10::SymInt> storage_offset_) {
+  auto storage_offset = storage_offset_.value_or(self.sym_storage_offset());
   setStrided(self, size, stride, storage_offset);
   return self;
 }
 
-Tensor narrow_copy_symint(const Tensor& self, int64_t dim, int64_t start, SymInt sym_length) {
-  return self.narrow_copy(dim, start, sym_length.expect_int());
-}
-
 Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length) {
   return self.narrow(dim, start, length).clone(at::MemoryFormat::Contiguous);
 }
 
+// Should just use narrow_copy_out, but this API is used internally at Meta:
+// https://github.com/pytorch/pytorch/pull/87045#issuecomment-1309353561
 Tensor narrow_copy_dense_cpu(const Tensor& self, int64_t dim, int64_t start, int64_t length){
   auto output = at::empty_like(self);
   return narrow_copy_dense_cpu_out(self, dim, start, length, output);
@@ -944,9 +1209,10 @@ Tensor narrow_copy_sparse(const Tensor& self, int64_t dim, int64_t start, int64_
   int64_t allDim = self.dim();
   int64_t end = start+length;
   TORCH_CHECK(allDim > 0, "narrow() cannot be applied to a 0-dim tensor.");
+  TORCH_CHECK(length >= 0, "narrow(): length must be non-negative.");
   TORCH_CHECK(dim >= 0 && dim < allDim,
     "Dimension ", dim, " out of range. Expecting 0 <= dim < ", allDim, ".");
-  TORCH_CHECK(start >= 0 && length >= 0 && end <= self.size(dim),
+  TORCH_CHECK(start >= 0 && end <= self.size(dim),
     "Invalid range to narrow. range(start, start+length) must be a subset of range(0, ", self.size(dim), ").")
   Tensor indices = self._indices();
   int64_t sparse_dim = self.sparse_dim();
@@ -974,6 +1240,8 @@ Tensor narrow_copy_sparse(const Tensor& self, int64_t dim, int64_t start, int64_
   return newTensor._coalesced_(self.is_coalesced());
 }
 
+// Should just use narrow_copy_out, but this API is used internally at Meta:
+// https://github.com/pytorch/pytorch/pull/87045#issuecomment-1309353561
 Tensor& narrow_copy_dense_cpu_out(
   const Tensor& self, int64_t dim, int64_t start, int64_t length, Tensor& output
 ) {
@@ -1057,20 +1325,35 @@ Tensor& narrow_copy_dense_cpu_out(
 
 Tensor narrow(const Tensor& self, int64_t dim, int64_t start, int64_t length) {
   TORCH_CHECK(self.dim() > 0, "narrow() cannot be applied to a 0-dim tensor.");
+  TORCH_CHECK(length >= 0, "narrow(): length must be non-negative.");
   auto cur_size = self.size(dim);
   if (start != cur_size) {  // start being the end is valid, but not a valid dim specification.
     start = maybe_wrap_dim(start, cur_size);
   }
-  TORCH_CHECK(length >= 0 && start <= cur_size - length,
+  TORCH_CHECK(start <= cur_size - length,
            "start (", start, ") + length (", length, ") exceeds dimension size (", cur_size, ").");
   return at::slice(self, dim, start, start + length, 1);
 }
 
-Tensor narrow(const Tensor& self, int64_t dim, const Tensor& start, int64_t length) {
+Tensor narrow_symint(const Tensor& self, int64_t dim, SymInt start, SymInt length) {
+  TORCH_CHECK(self.dim() > 0, "narrow() cannot be applied to a 0-dim tensor.");
+  TORCH_CHECK(length >= 0, "narrow(): length must be non-negative.");
+  auto cur_size = self.sym_size(dim);
+  if (start != cur_size) {  // start being the end is valid, but not a valid dim specification.
+    start = maybe_wrap_dim(start, cur_size);
+  }
+  TORCH_CHECK(start <= cur_size - length,
+           "start (", start, ") + length (", length, ") exceeds dimension size (", cur_size, ").");
+  return at::slice_symint(self, dim, start, start + length, 1);
+}
+
+// This overload exists purely for XLA, because they wanted to pass in "symbolic"
+// start via Tensor.
+Tensor narrow_tensor_symint(const Tensor& self, int64_t dim, const Tensor& start, SymInt length) {
   TORCH_CHECK(start.dim() == 0 && isIntegralType(start.scalar_type(), /*includeBool=*/false),
               "start must be an 0-dim integral Tensor.");
   int64_t st = start.item<int64_t>();
-  return at::narrow(self, dim, st, length);
+  return at::narrow_symint(self, dim, c10::SymInt(st), length);
 }
 
 std::tuple<DimVector, DimVector, std::vector<int64_t>>
@@ -1261,18 +1544,65 @@ Tensor alias_with_sizes_and_strides(
   return self_;
 }
 
-Tensor reshape(const Tensor& self, IntArrayRef proposed_shape) {
-  // reshape has special autograd logic since it sometimes returns a view but sometimes does not
-  // we have to intercept here instead of using dispatcher
-  // otherwise we will see "autograd still running" kind of error in inference mode:
-  // * if we create a tensor in inference mode scope,
-  //   then pass it to a inference mode decorated function,
-  //   everything is fine
-  // * but if we create the input tensor not with inference mode,
-  //   then errors like "Cannot set version_counter for inference tensor" arise
-  if (self.is_nested()) {
-    return at::_reshape_nested(self, proposed_shape);
+Tensor reshape_symint(const Tensor& self, c10::SymIntArrayRef proposed_shape) {
+  if (self.is_sparse()) {
+    AT_ERROR("reshape is not implemented for sparse tensors");
+  }
+  c10::SymDimVector shape = infer_size_dv(proposed_shape, self.sym_numel());
+
+  if (self.is_mkldnn()) {
+    return at::_mkldnn_reshape(self, c10::asIntArrayRefSlow(shape));
+  }
+
+  // `computeStride` returns the proper strides to use if this
+  // `reshape` can be just a view.
+  auto stride = at::detail::computeStride(self.sym_sizes(), self.sym_strides(), shape);
+
+  // NB: Even though we have viewable geometry and the target strides here,
+  //     we do not just call `as_strided` on `self` because the backward
+  //     for `as_strided` is not as efficient as that of `view` (since the
+  //     former is meant to handle general cases).
+  //
+  //     Similarly we don't call `view` because it duplicates some of the work
+  //     we've already done, and instead call our internal/private operator
+  //     `_reshape_alias` that essentially does the same thing as `view` and
+  //     `as_strided` without any of the extra overhead.
+  if (stride.has_value()) {
+    // Temporary check to revert to the old behavior/view in cases where the
+    // device is not supported (e.g. for XLA the operation is not supported
+    // so we use `view` instead).
+    //
+    // We need to do the checks here instead of in `native_functions.yaml`
+    // to preserve backwards compatibility.
+    if (!self.is_xla() && !self.is_lazy() && !self.is_ipu() && !at::isTensorSubclassLike(self)) {
+      return self._reshape_alias_symint(shape, stride.value());
+    } else {
+      return self.view_symint(shape);
+    }
+  }
+  return at::_unsafe_view_symint(self.clone(at::MemoryFormat::Contiguous), shape);
+}
+
+Tensor _reshape_copy_symint(const Tensor& self, c10::SymIntArrayRef proposed_shape) {
+  if (self.is_sparse()) {
+    TORCH_CHECK(0, "_reshape_copy is not implemented for sparse tensors");
+  }
+  c10::SymDimVector shape = infer_size_dv(proposed_shape, self.sym_numel());
+
+  if (self.is_mkldnn()) {
+    TORCH_CHECK(0, "_reshape_copy not implemented for mkldnn tensors");
+  }
+
+  if (self.is_contiguous()) {
+    return self.view_symint(shape).clone(at::MemoryFormat::Contiguous);
+  } else {
+    return at::_unsafe_view_symint(self.clone(at::MemoryFormat::Contiguous), shape);
   }
+}
+
+// Duplicate of above code for non-symbolic ints. Kept for BC purposes and to
+// minimize breakages.
+Tensor reshape(const Tensor& self, IntArrayRef proposed_shape) {
   if (self.is_sparse()) {
     AT_ERROR("reshape is not implemented for sparse tensors");
   }
@@ -1399,13 +1729,21 @@ QuantizerPtr create_subtensor_quantizer(const Tensor& self, bool is_select, int6
 }
 
 Tensor select(const Tensor& self, int64_t dim, int64_t index) {
+  return at::select_symint(self, dim, c10::SymInt{index});
+}
+
+Tensor select(const Tensor& self, Dimname dim, int64_t index) {
+  return at::select_symint(self, dimname_to_position(self, dim), c10::SymInt{index});
+}
+
+Tensor select_symint(const Tensor& self, int64_t dim, c10::SymInt index) {
   int64_t ndim = self.dim();
   if (ndim == 0) {
     TORCH_CHECK_INDEX(false, "select() cannot be applied to a 0-dim tensor.");
   }
   dim = maybe_wrap_dim(dim, ndim);
-  auto size = self.size(dim);
-  if (index < -size || index >= size) {
+  auto size = self.sym_sizes()[dim];
+  if (size < -index || size <= index) {
     if (self.has_names() && self.names()[dim] != Dimname::wildcard()) {
       TORCH_CHECK_INDEX(false, "select(): index ", index, " out of range for tensor of size ",
                      self.sizes(), " at dimension ", self.names()[dim]);
@@ -1417,32 +1755,37 @@ Tensor select(const Tensor& self, int64_t dim, int64_t index) {
     index += size;
   }
   if (self.is_sparse()) {
-    return select_sparse(self, dim, index);
+    return select_sparse(self, dim, index.guard_int(__FILE__, __LINE__));
   }
-  DimVector sizes(self.sizes().begin(), self.sizes().end());
-  DimVector strides(self.strides().begin(), self.strides().end());
-  auto storage_offset = self.storage_offset() + index * strides[dim];
-  sizes.erase(sizes.begin() + dim);
-  strides.erase(strides.begin() + dim);
 
   Tensor result;
   if (self.is_quantized()) {
-    auto quantizer = create_subtensor_quantizer(self, true, index, index + 1, dim, 1);
+    auto local_index = index.guard_int(__FILE__, __LINE__);
+
+    DimVector sizes(self.sizes().begin(), self.sizes().end());
+    DimVector strides(self.strides().begin(), self.strides().end());
+    auto storage_offset = self.storage_offset() + local_index * strides[dim];
+    sizes.erase(sizes.begin() + dim);
+    strides.erase(strides.begin() + dim);
+
+    auto quantizer = create_subtensor_quantizer(self, true, local_index, local_index + 1, dim, 1);
     result = as_strided_qtensorimpl(self, sizes, strides, storage_offset, quantizer);
   } else {
-    result = self.as_strided(sizes, strides, storage_offset);
+    std::vector<c10::SymInt> sizes(self.sym_sizes().begin(), self.sym_sizes().end());
+    std::vector<c10::SymInt> strides(self.sym_strides().begin(), self.sym_strides().end());
+    auto storage_offset = self.sym_storage_offset() + index * strides[dim];
+    sizes.erase(sizes.begin() + dim);
+    strides.erase(strides.begin() + dim);
+
+    result = self.as_strided_symint(sizes, strides, storage_offset);
   }
   namedinference::propagate_names_except(result, self, {dim});
   return result;
 }
 
-Tensor select(const Tensor& self, Dimname dim, int64_t index) {
-  return at::select(self, dimname_to_position(self, dim), index);
-}
-
-Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t index) {
-  auto grad_input = at::zeros(input_sizes, grad.options());
-  grad_input.select(dim, index).copy_(grad);
+Tensor select_backward_symint(const Tensor& grad, c10::SymIntArrayRef input_sizes, int64_t dim, c10::SymInt index) {
+  auto grad_input = at::zeros_symint(input_sizes, grad.options());
+  grad_input.select_symint(dim, index).copy_(grad);
   return grad_input;
 }
 
@@ -2095,10 +2438,6 @@ Tensor slice(
   // TODO: support negative strides
   TORCH_CHECK(step > 0, "slice step must be positive");
 
-  // INT64_MAX stands for default value.
-  if (start_val == INT64_MAX) {
-    start_val = 0;
-  }
   if (start_val < 0) {
     start_val += sizes[dim];
   }
@@ -2125,6 +2464,10 @@ Tensor slice(
     auto quantizer = create_subtensor_quantizer(self, false, start_val, end_val, dim, step);
     result = as_strided_qtensorimpl(self, sizes, strides, storage_offset, quantizer);
   } else {
+    // NB: it is extremely important to perform a redispatch here for
+    // the MPS backend; if you call directly to as_strided_tensorimpl,
+    // the necessary metadata for MPS will not get setup and you will
+    // get silently wrong results
     result = self.as_strided(sizes, strides, storage_offset);
   }
   namedinference::propagate_names(result, self);
@@ -2149,8 +2492,8 @@ std::vector<Tensor> split(const Tensor& self, int64_t split_size, int64_t dim) {
   return splits;
 }
 
-std::vector<Tensor> split(const Tensor& self, IntArrayRef sizes, int64_t dim) {
-  return at::split_with_sizes(self, sizes, dim);
+std::vector<Tensor> split_symint(const Tensor& self, c10::SymIntArrayRef sizes, int64_t dim) {
+  return at::split_with_sizes_symint(self, sizes, dim);
 }
 
 std::vector<Tensor> unsafe_split(const Tensor& self, int64_t split_size, int64_t dim) {
@@ -2199,7 +2542,7 @@ std::vector<Tensor> split_with_sizes(const Tensor& self, IntArrayRef split_sizes
     TORCH_CHECK(length >= 0,
              "split_with_sizes expects split_sizes have only non-negative ",
              "entries, but got split_sizes=", split_sizes);
-    splits.push_back(self.narrow(dim, start_idx, length));
+    splits.push_back(at::native::slice(self, dim, start_idx, start_idx + length, 1));
     start_idx += length;
   }
   TORCH_CHECK(start_idx == dim_size,
@@ -2521,6 +2864,132 @@ Tensor & transpose_(Tensor & self, int64_t dim0, int64_t dim1) {
   return self;
 }
 
+namespace {
+// Transpose implementation for sparse compressed layouts
+// NB: We assume that dim1,dim0 have already been wrapped
+static inline Tensor sparse_compressed_transpose(
+    const Tensor& self,
+    int64_t dim0,
+    int64_t dim1) {
+  auto compressed_inds = AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS(
+      self.layout(),
+      "compressed_inds",
+      [&self]() { return self.crow_indices(); },
+      [&self]() { return self.ccol_indices(); });
+
+  auto plain_inds = AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS(
+      self.layout(),
+      "plain_inds",
+      [&self]() { return self.col_indices(); },
+      [&self]() { return self.row_indices(); });
+
+  const auto n_batch_dim = compressed_inds.dim() - 1;
+  const auto n_dense_dim = self.dim() - n_batch_dim - 2;
+
+  // In theory it works, but missing to_dense coverage to test
+  TORCH_CHECK(
+      n_dense_dim == 0,
+      "transpose(): hybrid sparse compressed tensors with dense dimensions are not supported");
+
+  // Classify transpose "type"
+  enum class TransposeDim : uint8_t { Batch, Sparse, Dense };
+  auto classify_dim = [&n_batch_dim](const int64_t dim) {
+    if (dim < n_batch_dim) {
+      return TransposeDim::Batch;
+    } else if (dim > n_batch_dim + 1) {
+      return TransposeDim::Dense;
+    } else {
+      return TransposeDim::Sparse;
+    }
+  };
+
+  const auto transpose_type = classify_dim(dim0);
+  {
+    auto dim_type_name = [](const TransposeDim dim) {
+      switch (dim) {
+        case TransposeDim::Batch:
+          return "Batch";
+        case TransposeDim::Dense:
+          return "Dense";
+        case TransposeDim::Sparse:
+          return "Sparse";
+        default:
+          TORCH_INTERNAL_ASSERT(
+              false,
+              "Impossible TransposeDim value: ",
+              static_cast<std::underlying_type_t<TransposeDim>>(dim));
+      }
+    };
+    const auto dim1_type = classify_dim(dim1);
+    TORCH_CHECK(
+        dim1_type == transpose_type,
+        "transpose(): can only transpose dimensions of the same type (Batch, Sparse, Dense), got ",
+        dim0,
+        "(",
+        dim_type_name(transpose_type),
+        ")",
+        " and ",
+        dim1,
+        "(",
+        dim_type_name(dim1_type),
+        ")");
+  }
+
+  // We have validated everything, early exit for equal dims (no effect)
+  if (dim0 == dim1) {
+    return self.clone();
+  }
+
+  auto result_sizes = DimVector(self.sizes());
+  std::swap(result_sizes[dim0], result_sizes[dim1]);
+  Tensor result_vals;
+  auto result_layout = self.layout();
+
+  if (transpose_type == TransposeDim::Batch) {
+    compressed_inds = compressed_inds.transpose(dim0, dim1).contiguous();
+    plain_inds = plain_inds.transpose(dim0, dim1).contiguous();
+    result_vals = self.values().transpose(dim0, dim1).contiguous();
+
+  } else if (transpose_type == TransposeDim::Dense) {
+    // NB: This code should work, but is untestable due to lack of support for
+    // dense dimensions in to_dense. The Debug assert is present to emphasize
+    // the fact that the block should not be possible to hit this code block
+    TORCH_INTERNAL_ASSERT(
+        false, "transpose(): Shouldn't have reached this point");
+    result_vals = AT_DISPATCH_PLAIN_SPARSE_COMPRESSED_LAYOUTS(
+        self.layout(),
+        "sparse_transpose",
+        // un-blocked: 2 sparse dims map to single nnz dim, so dense dim0/1 are
+        // one position left
+        [&]() { return self.values().transpose(dim0 - 1, dim1 - 1); },
+        // blocked: 2 sparse dims map to 3 (nnz, ) + blocksize dims, so dense
+        // dim0/1 are one position right
+        [&]() { return self.values().transpose(dim0 + 1, dim1 + 1); });
+  } else /*if (transpose_type == TransposeDim::Sparse) */ {
+    // Flip the layout
+    result_layout = sparse_csr::flip_compressed_layout(self.layout());
+    result_vals = AT_DISPATCH_PLAIN_SPARSE_COMPRESSED_LAYOUTS(
+        self.layout(),
+        "sparse_transpose",
+        // un-blocked: no change to values, layout is flipped.
+        [&]() { return self.values(); },
+        // blocked: the blocks are nested under the sparse dims so they must be
+        // transposed as well.
+        [&]() {
+          return self.values().transpose(-2 - n_dense_dim, -1 - n_dense_dim);
+        });
+  }
+  return at::native::_sparse_compressed_tensor_unsafe(
+      compressed_inds,
+      plain_inds,
+      result_vals,
+      result_sizes,
+      self.scalar_type(),
+      result_layout,
+      self.device());
+}
+} // namespace
+
 Tensor transpose(const Tensor & self, int64_t dim0, int64_t dim1) {
   auto ndims = self.dim();
   dim0 = maybe_wrap_dim(dim0, ndims);
@@ -2533,45 +3002,25 @@ Tensor transpose(const Tensor & self, int64_t dim0, int64_t dim1) {
     Tensor self_clone = self.clone();
     return sparse_transpose_(self_clone, dim0, dim1);
   }
-  TORCH_CHECK(!(self.layout() == kSparseBsr || self.layout() == kSparseBsc),
-      "Transposition of tensors with ", self.layout(), " layout is currently not supported.");
-
-  // Transpose of a tensor is a view operation.
-  if (dim0 == dim1) {
-    return self;
+  if (self.layout() == kSparseBsr || self.layout() == kSparseCsr ||
+      self.layout() == kSparseBsc || self.layout() == kSparseCsc) {
+    return sparse_compressed_transpose(self, dim0, dim1);
   }
 
   if (self.is_mkldnn()) {
     return at::_mkldnn_transpose(self, dim0, dim1);
   }
 
-  DimVector sizes(self.sizes().begin(), self.sizes().end());
-  std::swap(sizes[dim0], sizes[dim1]);
-
-  if (self.layout() == kSparseCsr) {
-    TORCH_CHECK(self.dim() == 2, "Transposition for layout ", self.layout(), " is only supported for 2D inputs.")
-    return at::native::_sparse_csc_tensor_unsafe(
-        self.crow_indices(),
-        self.col_indices(),
-        self.values(),
-        sizes,
-        self.scalar_type(),
-        c10::kSparseCsc,
-        self.device());
-  }
-  if (self.layout() == kSparseCsc) {
-    return at::native::_sparse_csr_tensor_unsafe(
-        self.ccol_indices(),
-        self.row_indices(),
-        self.values(),
-        sizes,
-        self.scalar_type(),
-        c10::kSparseCsr,
-        self.device());
+  // Transpose of a tensor is a view operation.
+  if (dim0 == dim1) {
+    return self.alias();
   }
-  DimVector strides(self.strides().begin(), self.strides().end());
+
+  SymDimVector sizes(self.sym_sizes().begin(), self.sym_sizes().end());
+  std::swap(sizes[dim0], sizes[dim1]);
+  SymDimVector strides(self.sym_strides().begin(), self.sym_strides().end());
   std::swap(strides[dim0], strides[dim1]);
-  auto result = self.as_strided(sizes, strides);
+  auto result = self.as_strided_symint(sizes, strides);
   propagate_transposed_names(result, self, dim0, dim1);
   return result;
 }
@@ -2599,30 +3048,30 @@ Tensor & t_(Tensor & self) {
   return self.transpose_(0, self.dim() < 2 ? 0 : 1);
 }
 
-std::tuple<DimVector, DimVector>
+std::tuple<SymDimVector, SymDimVector>
 inferSqueezeGeometry(const Tensor &tensor) {
-  DimVector sizes;
-  DimVector strides;
+  SymDimVector sizes;
+  SymDimVector strides;
 
   for(const auto d : c10::irange(tensor.dim())) {
-    if(tensor.sizes()[d] != 1) {
-      sizes.push_back(tensor.sizes()[d]);
-      strides.push_back(tensor.strides()[d]);
+    if(tensor.sym_sizes()[d] != 1) {
+      sizes.push_back(tensor.sym_sizes()[d]);
+      strides.push_back(tensor.sym_strides()[d]);
     }
   }
 
   return std::make_tuple(std::move(sizes), std::move(strides));
 }
 
-std::tuple<DimVector, DimVector>
+std::tuple<SymDimVector, SymDimVector>
 inferSqueezeGeometry(const Tensor& tensor, int64_t dim) {
-  DimVector sizes;
-  DimVector strides;
+  SymDimVector sizes;
+  SymDimVector strides;
 
   for(const auto d : c10::irange(tensor.dim())) {
-    if(d != dim || tensor.sizes()[dim] != 1) {
-      sizes.push_back(tensor.sizes()[d]);
-      strides.push_back(tensor.strides()[d]);
+    if(d != dim || tensor.sym_sizes()[dim] != 1) {
+      sizes.push_back(tensor.sym_sizes()[d]);
+      strides.push_back(tensor.sym_strides()[d]);
     }
   }
   return std::make_tuple(std::move(sizes), std::move(strides));
@@ -2652,14 +3101,14 @@ inferUnsqueezeGeometry(const Tensor& tensor, int64_t dim) {
 // dim is present if squeezing a single dimension and absent if squeezing all dimensions
 Tensor squeeze_qtensor(const Tensor& self, c10::optional<int64_t> dim) {
   auto quantizer = get_qtensorimpl(self)->quantizer();
-  DimVector sizes;
-  DimVector strides;
+  SymDimVector sizes;
+  SymDimVector strides;
   std::tie(sizes, strides) = dim.has_value() ? inferSqueezeGeometry(self, dim.value()) : inferSqueezeGeometry(self);
   if (quantizer->qscheme() == QScheme::PER_CHANNEL_AFFINE) {
     const auto* per_channel_quantizer = static_cast<at::PerChannelAffineQuantizer*>(quantizer.get());
     auto axis = per_channel_quantizer->axis();
     int64_t shift = 0;
-    integer_range<int64_t> dims = dim.has_value() ? integer_range<int64_t>{dim.value(), dim.value() + 1} : c10::irange(self.dim());
+    integer_range<int64_t> dims = dim.has_value() ? integer_range<int64_t>{dim.value(), dim.value() + 1} : c10::irange(0, self.dim());
     for (const auto d : dims) {
       if (self.sizes()[d] == 1) {
         TORCH_CHECK(axis != d, "Squeeze is only possible on non-axis dimension for Per-Channel Quantized Tensors.");
@@ -2674,7 +3123,9 @@ Tensor squeeze_qtensor(const Tensor& self, c10::optional<int64_t> dim) {
                                                   axis,
                                                   quantizer->scalar_type());
   }
-  auto result = make_qtensor(self, sizes, strides, quantizer);
+  // TODO: quantized Tensor support for SymInt needs to be added but basic building blocs
+  // are missing for now.
+  auto result = make_qtensor(self, c10::asIntArrayRefSlow(sizes), c10::asIntArrayRefSlow(strides), quantizer);
   if (dim.has_value()) {
     namedinference::propagate_names_except(result, self, {dim.value()});
   } else {
@@ -2687,7 +3138,7 @@ Tensor squeeze_qtensor(const Tensor& self, c10::optional<int64_t> dim) {
 
 Tensor squeeze(const Tensor& self) {
   auto g = inferSqueezeGeometry(self);
-  at::Tensor result = self.as_strided(std::get<0>(g), std::get<1>(g));
+  at::Tensor result = self.as_strided_symint(std::get<0>(g), std::get<1>(g));
   auto maybe_outnames = namedinference::compute_squeeze_outnames(self);
   namedinference::propagate_names_if_nonempty(result, maybe_outnames);
   return result;
@@ -2703,11 +3154,11 @@ Tensor squeeze_quantized(const Tensor& self) {
 Tensor squeeze(const Tensor& self, int64_t dim) {
   int64_t dims = self.dim();
   dim = maybe_wrap_dim(dim, dims);
-  if (dims == 0 || self.sizes()[dim] != 1) {
-    return self.as_strided(self.sizes(), self.strides());
+  if (dims == 0 || self.sym_sizes()[dim] != 1) {
+    return self.as_strided_symint(self.sym_sizes(), self.sym_strides());
   }
   auto g = inferSqueezeGeometry(self, dim);
-  auto result = self.as_strided(std::get<0>(g), std::get<1>(g));
+  auto result = self.as_strided_symint(std::get<0>(g), std::get<1>(g));
   namedinference::propagate_names_except(result, self, {dim});
   return result;
 }
@@ -2720,7 +3171,7 @@ Tensor squeeze_quantized(const Tensor& self, int64_t dim) {
 
 Tensor & squeeze_(Tensor& self) {
   auto g = inferSqueezeGeometry(self);
-  self.as_strided_(std::get<0>(g), std::get<1>(g));
+  self.as_strided__symint(std::get<0>(g), std::get<1>(g));
   return self;
 }
 
@@ -2728,12 +3179,12 @@ Tensor & squeeze_(Tensor& self, int64_t dim) {
   int64_t dims = self.dim();
   dim = maybe_wrap_dim(dim, self.dim());
 
-  if (dims == 0 || self.sizes()[dim] != 1) {
-    self.as_strided_(self.sizes(), self.strides());
+  if (dims == 0 || self.sym_sizes()[dim] != 1) {
+    self.as_strided__symint(self.sym_sizes(), self.sym_strides());
     return self;
   }
   auto g = inferSqueezeGeometry(self, dim);
-  self.as_strided_(std::get<0>(g), std::get<1>(g));
+  self.as_strided__symint(std::get<0>(g), std::get<1>(g));
   return self;
 }
 
@@ -2782,7 +3233,7 @@ Tensor unsqueeze_sparse(Tensor const &self, int64_t dim) {
   if (dim <= sparse_dim) {
     auto new_indices = at::cat(
         {indices.narrow(0, 0, dim),
-         native::zeros(
+         at::zeros(
              {1, indices.size(1)},
              kLong,
              indices.options().layout_opt(),
@@ -2839,18 +3290,18 @@ Tensor flatten(const Tensor& self, int64_t start_dim, int64_t end_dim) {
   // of freedom we don't want; for example, consider shape [0, 1, 3, 0], with start_dim=1, end_dim=2.
   // It's clear we want result shape [0, 3, 0] but passing [0, -1, 0] to infer_size means the -1
   // can take on any value and satisfy the constraints.
-  auto slice_numel = c10::multiply_integers(self.sizes().slice(start_dim, end_dim - start_dim + 1));
-  std::vector<int64_t> shape;
+  auto slice_numel = c10::multiply_integers(self.sym_sizes().slice(start_dim, end_dim - start_dim + 1));
+  std::vector<c10::SymInt> shape;
   shape.reserve(self.dim() - end_dim + start_dim);
   for (const auto i : c10::irange(start_dim)) {
-    shape.push_back(self.sizes()[i]);
+    shape.push_back(self.sym_sizes()[i]);
   }
   shape.push_back(slice_numel);
   for (const auto i : c10::irange(end_dim + 1, self.dim())) {
-    shape.push_back(self.sizes()[i]);
+    shape.push_back(self.sym_sizes()[i]);
   }
 
-  return native::reshape(self, shape);
+  return native::reshape_symint(self, shape);
 }
 
 Tensor flatten(const Tensor& self, int64_t start_dim, int64_t end_dim, Dimname out_dim) {
@@ -3119,17 +3570,12 @@ Tensor adjoint(const Tensor &self) {
 }
 
 Tensor view(const Tensor& self,
-            IntArrayRef size) {
+            at::IntArrayRef size) {
   return view_impl(self, size);
 }
 
-Tensor view_symint(const Tensor& self,
-            c10::SymIntArrayRef size) {
-  return self.view(c10::asIntArrayRefSlow(size));
-}
-
 Tensor alias(const Tensor& self) {
-    return alias_with_sizes_and_strides(self, self.sizes(), self.strides());
+  return alias_with_sizes_and_strides(self, self.sizes(), self.strides());
 }
 
 Tensor detach(const Tensor& self) {
@@ -3142,107 +3588,54 @@ Tensor detach(const Tensor& self) {
     /*allow_tensor_metadata_change=*/false));
 }
 
-Tensor unfold(const Tensor& self, int64_t dimension, int64_t size, int64_t step) {
-  // some special handling to deal with allow dimension == 0 when self.dim() == 0
-  dimension = at::maybe_wrap_dim(dimension, self.dim(), /*wrap_scalar=*/true);
+Tensor unfold(const Tensor& self, int64_t d, int64_t size, int64_t step) {
+  // some special handling to deal with allow d == 0 when self.dim() == 0
+  auto ndim = self.dim();
+  d = at::maybe_wrap_dim(d, ndim, /*wrap_scalar=*/true);
 
-  const auto sizes = self.sizes();
-  const auto strides = self.strides();
-  int64_t max_size = self.dim() == 0 ? 1 : sizes[dimension];
-  TORCH_CHECK(size <= max_size, "maximum size for tensor at dimension ", dimension,
+  auto sizes = self.sizes().vec();
+  auto strides = self.strides().vec();
+  int64_t max_size = self.dim() == 0 ? 1 : sizes[d];
+  TORCH_CHECK(size <= max_size, "maximum size for tensor at dimension ", d,
                                 " is ", max_size, " but size is ", size);
   TORCH_CHECK(step > 0, "step is ", step, " but must be > 0");
-
-  DimVector new_size(self.dim() + 1);
-  DimVector new_stride(self.dim() + 1);
-
-  new_size[self.dim()] = size;
-  new_stride[self.dim()] = self.dim() == 0 ? 1 : strides[dimension];
-  for(const auto d : c10::irange(self.dim())) {
-    const auto self_size = sizes[d];
-    const auto self_stride = strides[d];
-    if(d == dimension) {
-      new_size[d] = (self_size - size) / step + 1;
-      new_stride[d] = step*self_stride;
-    } else {
-      new_size[d] = self_size;
-      new_stride[d] = self_stride;
-    }
+  sizes.push_back(size);
+  strides.push_back(self.dim() == 0 ? 1 : strides[d]);
+  // The if handles the self.dim() == 0 case
+  if (d < ndim) {
+    sizes[d] = (sizes[d] - size) / step + 1;
+    strides[d] *= step;
   }
-
-  return self.as_strided(new_size, new_stride);
+  return self.as_strided(sizes, strides);
 }
 
-template <typename scalar_t>
-void apply_diag(Tensor& result, const Tensor& self, int64_t dimension) {
-  TORCH_CHECK(self.dim() == 1 || self.dim() == 2, "matrix or a vector expected");
-
-  auto self_data = self.data_ptr<scalar_t>();
-  if (self.dim() == 1) {
-    auto self_size = self.size(0);
-    auto self_stride = self.stride(0);
-    int64_t sz = self_size + std::abs(dimension);
-
-    at::native::resize_output(result, {sz, sz});
-    result.zero_();
-    auto r_data = result.data_ptr<scalar_t>();
-    auto r_stride_0 = result.stride(0);
-    auto r_stride_1 = result.stride(1);
-    r_data += (dimension >= 0 ? dimension*r_stride_1 : -dimension*r_stride_0);
-
-    for (const auto i : c10::irange(self_size)) {
-      r_data[i * (r_stride_0 + r_stride_1)] = self_data[i * self_stride];
-    }
+Tensor diag(const Tensor& self, int64_t offset) {
+  auto ndim = self.dim();
+  TORCH_CHECK(ndim == 1 || ndim == 2, "diag(): Supports 1D or 2D tensors. Got ", self.dim(), "D");
+  if (ndim == 1) {
+    return at::diag_embed(self, offset);
   } else {
-    auto self_stride_0 = self.stride(0);
-    auto self_stride_1 = self.stride(1);
-
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    int64_t sz;
-    if (dimension >= 0) {
-      sz = std::min(self.size(0), self.size(1) - dimension);
-    } else {
-      sz = std::min(self.size(0) + dimension, self.size(1));
-    }
-
-    at::native::resize_output(result, {sz});
-    result.zero_();
-    auto r_data = result.data_ptr<scalar_t>();
-    auto r_stride_0 = result.stride(0);
-    self_data += (dimension >= 0 ? dimension * self_stride_1 : -dimension * self_stride_0);
-    for (const auto i : c10::irange(sz)) {
-      r_data[i * r_stride_0] = self_data[i * (self_stride_0 + self_stride_1)];
-    }
+    // We return a copy of the diagonal
+    return at::diagonal_copy(self, offset);
   }
 }
 
-Tensor diag(const Tensor& self, int64_t dimension) {
-  Tensor result = at::empty({0}, self.options());
-  at::diag_out(result, self, dimension);
-  return result;
-}
-
-Tensor& diag_cpu_out(const Tensor& self, int64_t dimension, Tensor &result) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(kBFloat16, kBool, self.scalar_type(), "diag", [&] {
-    apply_diag<scalar_t>(result, self, dimension);
-  });
-  return result;
-}
-
-Tensor diag_backward(const Tensor& grad, IntArrayRef input_sizes, int64_t diagonal) {
-  auto ndimension = input_sizes.size();
-  AT_ASSERT(ndimension == 1 || ndimension == 2);
-
-  if (ndimension == 1 || input_sizes[0] == input_sizes[1]) {
-    return grad.diag(diagonal);
+Tensor& diag_out(const Tensor& self, int64_t offset, Tensor& out) {
+  auto ndim = self.dim();
+  TORCH_CHECK(ndim == 1 || ndim == 2, "Supports 1D or 2D tensors. Got ", self.dim(), "D");
+  if (ndim == 1) {
+    TORCH_CHECK(
+        canCast(self.scalar_type(), out.scalar_type()),
+        "diag: result type ", self.scalar_type(), " can't be cast to the desired out= type ",
+        out.scalar_type());
+    return at::diag_embed_out(out, self, offset);
+  } else {
+    return at::diagonal_copy_out(out, self, offset);
   }
-
-  // Input was a matrix but was not square
-  return at::diagonal_backward(grad, input_sizes, diagonal, 0, 1);
 }
 
-Tensor diagonal_backward(const Tensor & grad, IntArrayRef input_sizes, int64_t offset, int64_t dim1, int64_t dim2) {
-  auto grad_input = at::zeros(input_sizes, grad.options());
+Tensor diagonal_backward_symint(const Tensor & grad, SymIntArrayRef input_sizes, int64_t offset, int64_t dim1, int64_t dim2) {
+  auto grad_input = at::zeros_symint(input_sizes, grad.options());
   auto diag = grad_input.diagonal(offset, dim1, dim2);
   diag.copy_(grad);
   return grad_input;
@@ -3250,7 +3643,7 @@ Tensor diagonal_backward(const Tensor & grad, IntArrayRef input_sizes, int64_t o
 
 Tensor movedim(const Tensor& self, IntArrayRef src, IntArrayRef dst) {
   TORCH_CHECK(src.size() == dst.size(), "movedim: Invalid source or destination dims: source (",
-              src, " dims ) should contain the same number of dims as destination (", dst, " dims)");
+              src, " dims) should contain the same number of dims as destination (", dst, " dims)");
 
   size_t self_dim = self.dim();
   DimVector normalized_src(src.size());
@@ -3399,9 +3792,9 @@ at::Tensor slice_scatter(const at::Tensor& self, const at::Tensor& src, int64_t
     slice.copy_(src);
     return output;
 }
-at::Tensor select_scatter(const at::Tensor& self, const at::Tensor& src, int64_t dim, int64_t index) {
+at::Tensor select_scatter_symint(const at::Tensor& self, const at::Tensor& src, int64_t dim, c10::SymInt index) {
     auto output = self.clone();
-    auto slice = output.select(dim, index);
+    auto slice = output.select_symint(dim, index);
     TORCH_CHECK(slice.sizes() == src.sizes(), "expected src to have a size equal to the slice of self. src size = ", src.sizes(), ", slice size = ", slice.sizes());
     slice.copy_(src);
     return output;
@@ -3413,12 +3806,12 @@ at::Tensor diagonal_scatter(const at::Tensor& self, const at::Tensor& src, int64
     slice.copy_(src);
     return output;
 }
-at::Tensor as_strided_scatter(const at::Tensor& self, const at::Tensor& src, at::IntArrayRef size, at::IntArrayRef stride, c10::optional<int64_t> storage_offset) {
+at::Tensor as_strided_scatter_symint(const at::Tensor& self, const at::Tensor& src, at::SymIntArrayRef size, at::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset) {
     // See Note [as_strided_scatter backward support]
     TORCH_INTERNAL_ASSERT(!self.requires_grad() || self.is_contiguous(), "as_strided_scatter is currently only supported for contiguous inputs");
     auto output = self.clone();
-    auto slice = output.as_strided(size, stride, storage_offset);
-    TORCH_CHECK(slice.sizes() == src.sizes(), "expected src to have a size equal to the slice of self. src size = ", src.sizes(), ", slice size = ", slice.sizes());
+    auto slice = output.as_strided_symint(size, stride, storage_offset);
+    TORCH_CHECK(slice.sym_sizes() == src.sym_sizes(), "expected src to have a size equal to the slice of self. src size = ", src.sym_sizes(), ", slice size = ", slice.sym_sizes());
     slice.copy_(src);
     return output;
 }
@@ -3477,8 +3870,8 @@ at::Tensor& _neg_view_copy_out(const at::Tensor & self, at::Tensor & out) {
 }
 
 
-at::Tensor& as_strided_copy_out(const at::Tensor & self, at::IntArrayRef size, at::IntArrayRef stride, c10::optional<int64_t> storage_offset, at::Tensor & out) {
-  auto tmp = self.as_strided(size, stride, storage_offset);
+at::Tensor& as_strided_copy_out_symint(const at::Tensor & self, at::SymIntArrayRef size, at::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset, at::Tensor & out) {
+  auto tmp = self.as_strided_symint(size, stride, storage_offset);
   out.copy_(tmp);
   return out;
 }
@@ -3492,8 +3885,16 @@ at::Tensor& _sparse_broadcast_to_copy_out(const at::Tensor & self, at::IntArrayR
 
 
 at::Tensor& diagonal_copy_out(const at::Tensor & self, int64_t offset, int64_t dim1, int64_t dim2, at::Tensor & out) {
-  auto tmp = self.diagonal(offset, dim1, dim2);
-  out.copy_(tmp);
+  TORCH_CHECK(
+    out.device() == self.device(),
+    "diagonal_copy: Expected out and self tensors to be on the same device, but got ",
+    "out on ", out.device(), " and self on ", self.device());
+  auto result = self.diagonal(offset, dim1, dim2);
+  at::native::resize_output(out, result.sizes());
+  TORCH_CHECK(
+      canCast(result.scalar_type(), out.scalar_type()),
+      "diagonal_copy: result type ", result.scalar_type(), " can't be cast to the desired out= type ", out.scalar_type());
+  out.copy_(result);
   return out;
 }
 
@@ -3505,8 +3906,8 @@ at::Tensor& expand_copy_SymInt_out(const at::Tensor & self, c10::SymIntArrayRef
 }
 
 
-at::Tensor& expand_copy_out(const at::Tensor & self, at::IntArrayRef size, bool implicit, at::Tensor & out) {
-  auto tmp = self.expand(size, implicit);
+at::Tensor& expand_copy_out_symint(const at::Tensor & self, at::SymIntArrayRef size, bool implicit, at::Tensor & out) {
+  auto tmp = self.expand_symint(size, implicit);
   out.copy_(tmp);
   return out;
 }
@@ -3533,8 +3934,8 @@ at::Tensor& _reshape_alias_copy_out(const at::Tensor & self, at::IntArrayRef siz
 }
 
 
-at::Tensor& select_copy_int_out(const at::Tensor & self, int64_t dim, int64_t index, at::Tensor & out) {
-  auto tmp = self.select(dim, index);
+at::Tensor& select_copy_symint_out(const at::Tensor & self, int64_t dim, c10::SymInt index, at::Tensor & out) {
+  auto tmp = self.select_symint(dim, index);
   out.copy_(tmp);
   return out;
 }
@@ -3661,8 +4062,8 @@ void unbind_copy_int_out(const at::Tensor & self, int64_t dim, at::TensorList  o
 }
 
 
-at::Tensor& view_copy_out(const at::Tensor & self, at::IntArrayRef size, at::Tensor & out) {
-  auto tmp = self.view(size);
+at::Tensor& view_copy_out_symint(const at::Tensor & self, at::SymIntArrayRef size, at::Tensor & out) {
+  auto tmp = self.view_symint(size);
   out.copy_(tmp);
   return out;
 }
@@ -3688,5 +4089,13 @@ at::Tensor& alias_copy_out(const at::Tensor & self, at::Tensor & out) {
   return out;
 }
 
+int64_t sparse_dim_strided(const at::Tensor& self) {
+  return 0;
+}
+
+int64_t dense_dim_strided(const at::Tensor& self) {
+  return self.dim();
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/TensorShape.h b/aten/src/ATen/native/TensorShape.h
index bb296b5ae5bc..60e2533e9b53 100644
--- a/aten/src/ATen/native/TensorShape.h
+++ b/aten/src/ATen/native/TensorShape.h
@@ -1,6 +1,7 @@
 #pragma once
 #include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
+#include <ATen/core/IListRef.h>
 
 namespace at {
 namespace native {
@@ -26,11 +27,12 @@ inline void check_cat_shape_except_dim(const Tensor & first, const Tensor & seco
    }
  }
 
-inline void check_cat_no_zero_dim(at::ArrayRef<Tensor> tensors) {
-  for(const auto i : c10::irange(tensors.size())) {
-    auto& t = tensors[i];
+inline void check_cat_no_zero_dim(const MaterializedITensorListRef& tensors) {
+  int64_t i = 0;
+  for(const Tensor& t : tensors) {
     TORCH_CHECK(t.dim() > 0,
              "zero-dimensional tensor (at position ", i, ") cannot be concatenated");
+    i++;
   }
 }
 
@@ -51,11 +53,4 @@ inline int64_t get_num_splits(const Tensor& self, int64_t split_size, int64_t di
   return num_splits;
 }
 
-///
-/// For more information, see
-/// https://pytorch.org/docs/master/generated/torch.Tensor.unfold.html#torch.Tensor.unfold
-///
-
-Tensor unfold(const Tensor& self, int64_t dimension, int64_t size, int64_t step);
-
 }} // namespace at::native
diff --git a/aten/src/ATen/native/TensorTransformations.cpp b/aten/src/ATen/native/TensorTransformations.cpp
index f0e2c0f02caa..028b05e66930 100644
--- a/aten/src/ATen/native/TensorTransformations.cpp
+++ b/aten/src/ATen/native/TensorTransformations.cpp
@@ -1,14 +1,31 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/TensorTransformations.h>
 #include <ATen/native/IndexKernel.h>  // for flip_stub
 
-#include <ATen/Functions.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/WrapDimUtilsMulti.h>
 #include <ATen/core/DimVector.h>
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/atleast_1d_native.h>
+#include <ATen/ops/atleast_2d_native.h>
+#include <ATen/ops/atleast_3d_native.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/flip_native.h>
+#include <ATen/ops/fliplr_native.h>
+#include <ATen/ops/flipud_native.h>
+#include <ATen/ops/roll_native.h>
+#include <ATen/ops/rot90_native.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
+
 #include <algorithm>
 #include <vector>
 
diff --git a/aten/src/ATen/native/TestOps.cpp b/aten/src/ATen/native/TestOps.cpp
index a8c30f5c3ba6..f36765436991 100644
--- a/aten/src/ATen/native/TestOps.cpp
+++ b/aten/src/ATen/native/TestOps.cpp
@@ -1,10 +1,25 @@
 // Copyright 2004-present Facebook. All Rights Reserved.
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/FunctionalInverses.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/ScalarOps.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_test_ambiguous_defaults_native.h>
+#include <ATen/ops/_test_autograd_multiple_dispatch_native.h>
+#include <ATen/ops/_test_autograd_multiple_dispatch_view_native.h>
+#include <ATen/ops/_test_optional_filled_intlist_native.h>
+#include <ATen/ops/_test_optional_floatlist_native.h>
+#include <ATen/ops/_test_optional_intlist_native.h>
+#include <ATen/ops/_test_string_default_native.h>
+#include <ATen/ops/_test_warn_in_autograd_native.h>
+#include <ATen/ops/empty_like.h>
+#endif
+
 #include <c10/util/irange.h>
 
 namespace at {
diff --git a/aten/src/ATen/native/TriangularOps.cpp b/aten/src/ATen/native/TriangularOps.cpp
index d5f408a74f1b..59d2b8a0d224 100644
--- a/aten/src/ATen/native/TriangularOps.cpp
+++ b/aten/src/ATen/native/TriangularOps.cpp
@@ -1,22 +1,34 @@
-#include <ATen/ATen.h>
-#include <ATen/CPUApplyUtils.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/Parallel.h>
 #include <ATen/TensorMeta.h>
-#include <ATen/native/Resize.h>
 #include <ATen/native/TriangularOpsUtils.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/trace_backward_native.h>
+#include <ATen/ops/tril_native.h>
+#include <ATen/ops/triu_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace meta {
 
 TORCH_META_FUNC(tril)(const Tensor& self, int64_t k) {
+  TORCH_CHECK(self.dim() >= 2, "tril: input tensor must have at least 2 dimensions")
   set_output_raw_strided(0, self.sizes(), {}, self.options());
 }
 
 TORCH_META_FUNC(triu)(const Tensor& self, int64_t k) {
+  TORCH_CHECK(self.dim() >= 2, "triu: input tensor must have at least 2 dimensions")
   set_output_raw_strided(0, self.sizes(), {}, self.options());
 }
 
@@ -168,12 +180,16 @@ TORCH_IMPL_FUNC(triu_cpu)(const Tensor& self, int64_t k, const Tensor &result) {
   compute_triu_tril<UpperTriangle>(self, k, result);
 }
 
-Tensor trace_backward(const Tensor& grad, IntArrayRef sizes) {
+Tensor trace_backward(const Tensor& grad, at::IntArrayRef sizes) {
+    return at::native::trace_backward_symint(grad, c10::fromIntArrayRefSlow(sizes));
+}
+
+Tensor trace_backward_symint(const Tensor& grad, c10::SymIntArrayRef sizes) {
   if (sizes.size() != 2) {
     throw std::runtime_error("expected matrix input");
   }
 
-  auto grad_input = at::zeros(sizes[0] * sizes[1], grad.options());
+  auto grad_input = at::zeros_symint(sizes[0] * sizes[1], grad.options());
   auto indices = at::arange(0, grad_input.numel(), sizes[1] + 1, grad.options().dtype(at::kLong));
   // for composite compliance, use out-of-place variant of
   // `index_fill` if grad tensor is a Tensor Subclass.
@@ -182,7 +198,7 @@ Tensor trace_backward(const Tensor& grad, IntArrayRef sizes) {
   } else {
     grad_input.index_fill_(0, indices, grad);
   }
-  return grad_input.view(sizes);
+  return grad_input.view_symint(sizes);
 }
 
 }  // namespace native
diff --git a/aten/src/ATen/native/TriangularOpsUtils.h b/aten/src/ATen/native/TriangularOpsUtils.h
index c5bce42ed3fd..e380a510bdde 100644
--- a/aten/src/ATen/native/TriangularOpsUtils.h
+++ b/aten/src/ATen/native/TriangularOpsUtils.h
@@ -1,4 +1,4 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/LinearAlgebraUtils.h>
 
 namespace at {
diff --git a/aten/src/ATen/native/TypeProperties.cpp b/aten/src/ATen/native/TypeProperties.cpp
index feceb75631ce..36354c133a98 100644
--- a/aten/src/ATen/native/TypeProperties.cpp
+++ b/aten/src/ATen/native/TypeProperties.cpp
@@ -1,8 +1,26 @@
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/TypeProperties.h>
-#include <type_traits>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_has_compatible_shallow_copy_type_native.h>
+#include <ATen/ops/_is_zerotensor_native.h>
+#include <ATen/ops/can_cast_native.h>
+#include <ATen/ops/is_complex_native.h>
+#include <ATen/ops/is_conj_native.h>
+#include <ATen/ops/is_distributed_native.h>
+#include <ATen/ops/is_floating_point_native.h>
+#include <ATen/ops/is_inference_native.h>
+#include <ATen/ops/is_neg_native.h>
+#include <ATen/ops/is_signed_native.h>
+#include <ATen/ops/promote_types_native.h>
+#include <ATen/ops/result_type.h>
+#include <ATen/ops/result_type_native.h>
+#include <ATen/ops/type_as_native.h>
+#endif
 
 namespace at { namespace native {
 
diff --git a/aten/src/ATen/native/UnaryOps.cpp b/aten/src/ATen/native/UnaryOps.cpp
index 160955a01350..845610ce373e 100644
--- a/aten/src/ATen/native/UnaryOps.cpp
+++ b/aten/src/ATen/native/UnaryOps.cpp
@@ -1,26 +1,174 @@
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/ExpandUtils.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/MemoryOverlap.h>
+#include <ATen/NamedTensorUtils.h>
+#include <ATen/Parallel.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIterator.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/WrapDimUtils.h>
 
-#include <ATen/CPUApplyUtils.h>
-#include <ATen/Parallel.h>
-#include <ATen/native/Math.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/UnaryOps.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/NamedTensorUtils.h>
 #include <ATen/native/ComplexHelper.h>
 
-#include <algorithm>
-#include <cmath>
-#include <functional>
-#include <numeric>
-#include <vector>
+#include <c10/util/MathConstants.h>
 
-#include <map>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_conj_native.h>
+#include <ATen/ops/_conj_physical.h>
+#include <ATen/ops/_conj_physical_native.h>
+#include <ATen/ops/_neg_view_native.h>
+#include <ATen/ops/abs.h>
+#include <ATen/ops/abs_native.h>
+#include <ATen/ops/absolute_native.h>
+#include <ATen/ops/acos.h>
+#include <ATen/ops/acos_native.h>
+#include <ATen/ops/acosh.h>
+#include <ATen/ops/acosh_native.h>
+#include <ATen/ops/angle.h>
+#include <ATen/ops/angle_native.h>
+#include <ATen/ops/arange_native.h>
+#include <ATen/ops/arccos_native.h>
+#include <ATen/ops/arccosh_native.h>
+#include <ATen/ops/arcsin_native.h>
+#include <ATen/ops/arcsinh_native.h>
+#include <ATen/ops/arctan_native.h>
+#include <ATen/ops/arctanh_native.h>
+#include <ATen/ops/asin.h>
+#include <ATen/ops/asin_native.h>
+#include <ATen/ops/asinh.h>
+#include <ATen/ops/asinh_native.h>
+#include <ATen/ops/atan.h>
+#include <ATen/ops/atan_native.h>
+#include <ATen/ops/atanh.h>
+#include <ATen/ops/atanh_native.h>
+#include <ATen/ops/bitwise_not_native.h>
+#include <ATen/ops/can_cast.h>
+#include <ATen/ops/ceil_native.h>
+#include <ATen/ops/conj_native.h>
+#include <ATen/ops/conj_physical.h>
+#include <ATen/ops/conj_physical_native.h>
+#include <ATen/ops/cos_native.h>
+#include <ATen/ops/cosh_native.h>
+#include <ATen/ops/deg2rad.h>
+#include <ATen/ops/deg2rad_native.h>
+#include <ATen/ops/digamma.h>
+#include <ATen/ops/digamma_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/erf.h>
+#include <ATen/ops/erf_native.h>
+#include <ATen/ops/erfc.h>
+#include <ATen/ops/erfc_native.h>
+#include <ATen/ops/erfinv.h>
+#include <ATen/ops/erfinv_native.h>
+#include <ATen/ops/exp2.h>
+#include <ATen/ops/exp2_native.h>
+#include <ATen/ops/exp_native.h>
+#include <ATen/ops/expm1.h>
+#include <ATen/ops/expm1_native.h>
+#include <ATen/ops/fix_native.h>
+#include <ATen/ops/floor_native.h>
+#include <ATen/ops/frac_native.h>
+#include <ATen/ops/frexp.h>
+#include <ATen/ops/frexp_native.h>
+#include <ATen/ops/i0.h>
+#include <ATen/ops/i0_native.h>
+#include <ATen/ops/imag_native.h>
+#include <ATen/ops/lgamma.h>
+#include <ATen/ops/lgamma_native.h>
+#include <ATen/ops/log10_native.h>
+#include <ATen/ops/log1p.h>
+#include <ATen/ops/log1p_native.h>
+#include <ATen/ops/log2_native.h>
+#include <ATen/ops/log_native.h>
+#include <ATen/ops/logical_not.h>
+#include <ATen/ops/logical_not_native.h>
+#include <ATen/ops/logit.h>
+#include <ATen/ops/logit_native.h>
+#include <ATen/ops/mul.h>
+#include <ATen/ops/mvlgamma.h>
+#include <ATen/ops/mvlgamma_native.h>
+#include <ATen/ops/nan_to_num.h>
+#include <ATen/ops/nan_to_num_native.h>
+#include <ATen/ops/neg.h>
+#include <ATen/ops/neg_native.h>
+#include <ATen/ops/negative_native.h>
+#include <ATen/ops/polygamma.h>
+#include <ATen/ops/polygamma_native.h>
+#include <ATen/ops/positive_native.h>
+#include <ATen/ops/pow.h>
+#include <ATen/ops/rad2deg.h>
+#include <ATen/ops/rad2deg_native.h>
+#include <ATen/ops/real.h>
+#include <ATen/ops/real_native.h>
+#include <ATen/ops/reciprocal_native.h>
+#include <ATen/ops/resolve_conj_native.h>
+#include <ATen/ops/resolve_neg_native.h>
+#include <ATen/ops/round.h>
+#include <ATen/ops/round_native.h>
+#include <ATen/ops/rsqrt_native.h>
+#include <ATen/ops/select.h>
+#include <ATen/ops/sgn_native.h>
+#include <ATen/ops/sigmoid.h>
+#include <ATen/ops/sigmoid_native.h>
+#include <ATen/ops/sign_native.h>
+#include <ATen/ops/signbit_native.h>
+#include <ATen/ops/sin_native.h>
+#include <ATen/ops/sinc.h>
+#include <ATen/ops/sinc_native.h>
+#include <ATen/ops/sinh_native.h>
+#include <ATen/ops/special_airy_ai_native.h>
+#include <ATen/ops/special_bessel_j0_native.h>
+#include <ATen/ops/special_bessel_j1_native.h>
+#include <ATen/ops/special_bessel_y0_native.h>
+#include <ATen/ops/special_bessel_y1_native.h>
+#include <ATen/ops/special_digamma_native.h>
+#include <ATen/ops/special_entr_native.h>
+#include <ATen/ops/special_erf_native.h>
+#include <ATen/ops/special_erfc_native.h>
+#include <ATen/ops/special_erfcx_native.h>
+#include <ATen/ops/special_erfinv_native.h>
+#include <ATen/ops/special_exp2_native.h>
+#include <ATen/ops/special_expit_native.h>
+#include <ATen/ops/special_expm1_native.h>
+#include <ATen/ops/special_gammaln_native.h>
+#include <ATen/ops/special_i0_native.h>
+#include <ATen/ops/special_i0e_native.h>
+#include <ATen/ops/special_i1_native.h>
+#include <ATen/ops/special_i1e_native.h>
+#include <ATen/ops/special_log1p_native.h>
+#include <ATen/ops/special_log_ndtr_native.h>
+#include <ATen/ops/special_logit_native.h>
+#include <ATen/ops/special_modified_bessel_i0_native.h>
+#include <ATen/ops/special_modified_bessel_i1_native.h>
+#include <ATen/ops/special_modified_bessel_k0_native.h>
+#include <ATen/ops/special_modified_bessel_k1_native.h>
+#include <ATen/ops/special_multigammaln_native.h>
+#include <ATen/ops/special_ndtr_native.h>
+#include <ATen/ops/special_ndtri_native.h>
+#include <ATen/ops/special_polygamma_native.h>
+#include <ATen/ops/special_psi_native.h>
+#include <ATen/ops/special_round_native.h>
+#include <ATen/ops/special_scaled_modified_bessel_k0_native.h>
+#include <ATen/ops/special_scaled_modified_bessel_k1_native.h>
+#include <ATen/ops/special_sinc_native.h>
+#include <ATen/ops/special_spherical_bessel_j0_native.h>
+#include <ATen/ops/sqrt_native.h>
+#include <ATen/ops/square_native.h>
+#include <ATen/ops/tan_native.h>
+#include <ATen/ops/tanh_native.h>
+#include <ATen/ops/trunc.h>
+#include <ATen/ops/trunc_native.h>
+#include <ATen/ops/view_as_real.h>
+#endif
+
+#include <cmath>
 
 namespace at {
 
@@ -157,6 +305,21 @@ TORCH_IMPL_FUNC(func_out) (const Tensor& self, const Tensor& result) {  \
   func_stub(device_type(), *this);                                      \
 }
 
+// This macro is as optional as the one above. torch.(ceil|floor|round|trunc) are no-ops for integers
+// See gh-70918
+#define CREATE_UNARY_TORCH_IMPL_INTEGER_NO_OP_FUNC(func_out, func_stub)                                \
+TORCH_IMPL_FUNC(func_out) (const Tensor& self, const Tensor& result) {  \
+  if (c10::isIntegralType(self.scalar_type(), /*includeBool=*/false)) {                                      \
+    result.copy_(self);                                                 \
+  } else {                                                              \
+    func_stub(device_type(), *this);                                    \
+  }                                                                     \
+}
+CREATE_UNARY_TORCH_IMPL_INTEGER_NO_OP_FUNC(ceil_out, ceil_stub)
+CREATE_UNARY_TORCH_IMPL_INTEGER_NO_OP_FUNC(floor_out, floor_stub)
+CREATE_UNARY_TORCH_IMPL_INTEGER_NO_OP_FUNC(round_out, round_stub)
+CREATE_UNARY_TORCH_IMPL_INTEGER_NO_OP_FUNC(trunc_out, trunc_stub)
+
 CREATE_UNARY_TORCH_IMPL_FUNC(acos_out, acos_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(acosh_out, acosh_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(asin_out, asin_stub)
@@ -164,7 +327,6 @@ CREATE_UNARY_TORCH_IMPL_FUNC(asinh_out, asinh_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(atan_out, atan_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(atanh_out, atanh_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(bitwise_not_out, bitwise_not_stub)
-CREATE_UNARY_TORCH_IMPL_FUNC(ceil_out, ceil_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(cos_out, cos_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(cosh_out, cosh_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(digamma_out, digamma_stub)
@@ -174,7 +336,6 @@ CREATE_UNARY_TORCH_IMPL_FUNC(erfinv_out, erfinv_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(exp_out, exp_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(exp2_out, exp2_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(expm1_out, expm1_stub)
-CREATE_UNARY_TORCH_IMPL_FUNC(floor_out, floor_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(frac_out, frac_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(i0_out, i0_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(lgamma_out, lgamma_stub)
@@ -184,7 +345,6 @@ CREATE_UNARY_TORCH_IMPL_FUNC(log1p_out, log1p_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(log2_out, log2_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(neg_out, neg_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(reciprocal_out, reciprocal_stub)
-CREATE_UNARY_TORCH_IMPL_FUNC(round_out, round_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(rsqrt_out, rsqrt_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(sigmoid_out, sigmoid_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(sign_out, sign_stub)
@@ -201,7 +361,6 @@ CREATE_UNARY_TORCH_IMPL_FUNC(special_log_ndtr_out, special_log_ndtr_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(sqrt_out, sqrt_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(tan_out, tan_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(tanh_out, tanh_stub)
-CREATE_UNARY_TORCH_IMPL_FUNC(trunc_out, trunc_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(special_airy_ai_out, special_airy_ai_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(special_bessel_j0_out, special_bessel_j0_stub)
 CREATE_UNARY_TORCH_IMPL_FUNC(special_bessel_j1_out, special_bessel_j1_stub)
@@ -723,8 +882,7 @@ constexpr double QUARTER = 0.25;
 }
 
 static inline void mvlgamma_check(const Tensor& self, int64_t p) {
-  TORCH_CHECK((self > HALF * (p - 1)).all().item<bool>(),
-              "All elements must be greater than (p-1)/2");
+  TORCH_CHECK(self.scalar_type() != kBool, "The input tensor may not be a boolean tensor.");
   TORCH_CHECK(p >= 1, "p has to be greater than or equal to 1");
 }
 
diff --git a/aten/src/ATen/native/Unfold2d.cpp b/aten/src/ATen/native/Unfold2d.cpp
index 0a3b760a33fd..60bbc8a77712 100644
--- a/aten/src/ATen/native/Unfold2d.cpp
+++ b/aten/src/ATen/native/Unfold2d.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/Unfold2d.h>
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/Unfold3d.cpp b/aten/src/ATen/native/Unfold3d.cpp
index 3495f92dc3ce..1a2d0ea2ae1f 100644
--- a/aten/src/ATen/native/Unfold3d.cpp
+++ b/aten/src/ATen/native/Unfold3d.cpp
@@ -1,5 +1,7 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <c10/util/irange.h>
 
diff --git a/aten/src/ATen/native/Unfold3d.h b/aten/src/ATen/native/Unfold3d.h
index 51eb89f9b810..e9b5a34a8d10 100644
--- a/aten/src/ATen/native/Unfold3d.h
+++ b/aten/src/ATen/native/Unfold3d.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <c10/core/ScalarType.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/UnfoldBackward.cpp b/aten/src/ATen/native/UnfoldBackward.cpp
index 10bee80cea23..494143232116 100644
--- a/aten/src/ATen/native/UnfoldBackward.cpp
+++ b/aten/src/ATen/native/UnfoldBackward.cpp
@@ -5,6 +5,7 @@
 #include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
 #else
+#include <ATen/ops/empty.h>
 #include <ATen/ops/unfold_backward_native.h>
 #include <ATen/ops/zeros.h>
 #endif
@@ -21,6 +22,11 @@ Tensor unfold_backward(
   int64_t step
 ) {
   auto grad_input = at::zeros(input_sizes, grad.options());
+  if (step >= size) {
+    auto gI_unfolded = grad_input.unfold(dim, size, step);
+    gI_unfolded.copy_(grad);
+    return grad_input;
+  }
 
   unfold_backward_stub(
     grad.device().type(),
diff --git a/aten/src/ATen/native/UnfoldBackward.h b/aten/src/ATen/native/UnfoldBackward.h
index 1f6c8fa1b289..f8099167361c 100644
--- a/aten/src/ATen/native/UnfoldBackward.h
+++ b/aten/src/ATen/native/UnfoldBackward.h
@@ -1,10 +1,9 @@
 #pragma once
 
 #include <ATen/core/Tensor.h>
-#include <ATen/Dispatch.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/DispatchStub.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/ReduceOpsUtils.h>
+#include <ATen/native/NonEmptyUtils.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
@@ -108,79 +107,6 @@ static C10_UNUSED TensorIterator _make_unfold_backward_iter_over_grad_out(
   return iter;
 }
 
-static C10_UNUSED TensorIterator _make_unfold_backward_iter_over_grad_in(
-  Tensor& grad_out,
-  const Tensor& grad_in,
-  int64_t dim,
-  int64_t /*size*/,
-  int64_t /*step*/
-) {
-  dim = maybe_wrap_dim(dim, grad_out.dim());
-  // last dim stores the folds
-  auto last_dim = maybe_wrap_dim(-1, grad_in.dim());
-
-  auto grad_in_dim = ensure_nonempty_dim(grad_in.dim());
-  auto grad_in_dim_size = ensure_nonempty_size(grad_in, dim);
-  auto grad_in_last_dim_size = ensure_nonempty_size(grad_in, last_dim);
-
-  /* prepare grad_out for TensorIterator { */
-  auto grad_out_restrided = grad_out.unsqueeze(-1);
-
-  auto grad_out_strides = ensure_nonempty_vec(grad_out_restrided.strides().vec());
-  auto grad_out_sizes = ensure_nonempty_vec(grad_out_restrided.sizes().vec());
-
-  grad_out_strides[dim] = 0;
-  grad_out_strides[last_dim] = 0;
-
-  grad_out_sizes[dim] = grad_in_dim_size;
-  grad_out_sizes[last_dim] = grad_in_last_dim_size;
-
-  grad_out_restrided = grad_out_restrided.as_strided(grad_out_sizes, grad_out_strides);
-  /* } */
-
-  // for each element grad_out[i_1,...,i_dim,...,i_last_dim]
-  // we have to know i_dim and i_last_dim.
-  // This information is stored in Tensors
-  // idx_dim and idx_last_dim
-  /* prepare idx_dim and idx_last_dim for TensorIterator { */
-  auto idx_dim = at::arange(
-    0, grad_in_dim_size, grad_in.options().dtype(at::kLong)
-  );
-
-  auto idx_dim_strides = std::vector<int64_t>(grad_in_dim, 0);
-  auto idx_dim_sizes = std::vector<int64_t>(grad_in_dim, 1);
-
-  idx_dim_strides[dim] = 1;
-  idx_dim_sizes[dim] = grad_in_dim_size;
-
-  auto idx_dim_restrided = idx_dim.as_strided(idx_dim_sizes, idx_dim_strides);
-
-  auto idx_last_dim = at::arange(
-    0, grad_in_last_dim_size, grad_in.options().dtype(at::kLong)
-  );
-
-  auto idx_last_dim_strides = std::vector<int64_t>(grad_in_dim, 0);
-  auto idx_last_dim_sizes = std::vector<int64_t>(grad_in_dim, 1);
-
-  idx_last_dim_strides[last_dim] = 1;
-  idx_last_dim_sizes[last_dim] = grad_in_last_dim_size;
-
-  auto idx_last_dim_restrided = idx_last_dim.as_strided(idx_last_dim_sizes, idx_last_dim_strides);
-  /* } */
-
-  auto iter = TensorIteratorConfig()
-    .set_check_mem_overlap(false)
-    .check_all_same_dtype(false)
-    .resize_outputs(false)
-    .add_owned_output(grad_out_restrided)
-    .add_owned_input(grad_in)
-    .add_owned_input(idx_dim_restrided)
-    .add_owned_input(idx_last_dim_restrided)
-    .build();
-
-  return iter;
-}
-
 }
 
 }} // namespace at::native
diff --git a/aten/src/ATen/native/Unique.cpp b/aten/src/ATen/native/Unique.cpp
index f418611e0864..92b48c9f388c 100644
--- a/aten/src/ATen/native/Unique.cpp
+++ b/aten/src/ATen/native/Unique.cpp
@@ -1,8 +1,27 @@
 // Returns unique elements of input tensor.
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
 #include <c10/util/irange.h>
+#include <c10/util/Load.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_unique2_native.h>
+#include <ATen/ops/_unique_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/equal.h>
+#include <ATen/ops/narrow.h>
+#include <ATen/ops/stack.h>
+#include <ATen/ops/unbind.h>
+#include <ATen/ops/unique_consecutive_native.h>
+#include <ATen/ops/unique_dim_consecutive_native.h>
+#include <ATen/ops/unique_dim_native.h>
+#include <ATen/ops/zeros.h>
+#endif
 
 #include <tuple>
 #include <unordered_map>
diff --git a/aten/src/ATen/native/UpSample.cpp b/aten/src/ATen/native/UpSample.cpp
index db75b7e99fdb..1a6af7526030 100644
--- a/aten/src/ATen/native/UpSample.cpp
+++ b/aten/src/ATen/native/UpSample.cpp
@@ -1,4 +1,5 @@
 // Copyright 2004-present Facebook. All Rights Reserved.
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
 #include <ATen/native/UpSample.h>
 #include <c10/util/irange.h>
diff --git a/aten/src/ATen/native/UpSample.h b/aten/src/ATen/native/UpSample.h
index 6b248352de6a..144b5921eed3 100644
--- a/aten/src/ATen/native/UpSample.h
+++ b/aten/src/ATen/native/UpSample.h
@@ -2,11 +2,11 @@
 
 #include <math.h>
 
-#include <ATen/core/Tensor.h>
+#include <ATen/OpMathType.h>
 #include <ATen/TensorUtils.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/DispatchStub.h>
 
-
 /**
  * Note [compute_scales_value]
  * Note [area_pixel_compute_scale]
@@ -266,15 +266,13 @@ static inline scalar_t area_pixel_compute_scale(
     bool align_corners,
     const c10::optional<double> scale) {
   // see Note [area_pixel_compute_scale]
-  if(align_corners){
+  if(align_corners) {
     if(output_size > 1) {
       return static_cast<scalar_t>(input_size - 1) / (output_size - 1);
-    }
-    else {
+    } else {
       return static_cast<scalar_t>(0);
     }
-  }
-  else{
+  } else {
     return compute_scales_value<scalar_t>(scale, input_size, output_size);
   }
 }
@@ -288,7 +286,8 @@ static inline scalar_t area_pixel_compute_source_index(
   if (align_corners) {
     return scale * dst_index;
   } else {
-    scalar_t src_idx = scale * (dst_index + 0.5) - 0.5;
+    scalar_t src_idx = scale * (dst_index + static_cast<scalar_t>(0.5)) -
+        static_cast<scalar_t>(0.5);
     // [Note] Follow Opencv resize logic:
     // We allow negative src_idx here and later will use
     //   dx = src_idx - floorf(src_idx)
@@ -301,7 +300,8 @@ static inline scalar_t area_pixel_compute_source_index(
     // where we should and then remove this cubic flag.
     // This matters in cubic mode, as we might need [-1, 0, 1, 2]
     // to interpolate and the weights can be affected.
-    return (!cubic && src_idx < 0) ? scalar_t(0) : src_idx;
+    return (!cubic && src_idx < static_cast<scalar_t>(0)) ? scalar_t(0)
+                                                          : src_idx;
   }
 }
 
@@ -445,8 +445,10 @@ static inline void compute_source_index_and_lambda(
     lambda0 = static_cast<scalar_t>(1);
     lambda1 = static_cast<scalar_t>(0);
   } else {
-    const scalar_t real_input_index = area_pixel_compute_source_index<scalar_t>(
-        ratio, output_index, align_corners, /*cubic=*/false);
+    using opmath_t = at::opmath_type<scalar_t>;
+    const auto real_input_index =
+        area_pixel_compute_source_index<opmath_t>(
+            ratio, output_index, align_corners, /*cubic=*/false);
     input_index0 = static_cast<int64_t>(real_input_index);
     int64_t offset = (input_index0 < input_size - 1) ? 1 : 0;
     input_index1 = input_index0 + offset;
diff --git a/aten/src/ATen/native/UpSampleBicubic2d.cpp b/aten/src/ATen/native/UpSampleBicubic2d.cpp
index 5dd1b370b217..035bea562954 100644
--- a/aten/src/ATen/native/UpSampleBicubic2d.cpp
+++ b/aten/src/ATen/native/UpSampleBicubic2d.cpp
@@ -1,8 +1,24 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
+#include <ATen/TensorMeta.h>
 #include <ATen/native/UpSample.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_upsample_bicubic2d_aa.h>
+#include <ATen/ops/_upsample_bicubic2d_aa_backward.h>
+#include <ATen/ops/_upsample_bicubic2d_aa_backward_native.h>
+#include <ATen/ops/_upsample_bicubic2d_aa_native.h>
+#include <ATen/ops/upsample_bicubic2d.h>
+#include <ATen/ops/upsample_bicubic2d_backward.h>
+#include <ATen/ops/upsample_bicubic2d_backward_native.h>
+#include <ATen/ops/upsample_bicubic2d_native.h>
+#endif
+
 namespace at {
 namespace meta {
 
@@ -109,8 +125,7 @@ static void upsample_bicubic2d_backward_out_frame(
       for (const auto output_x : c10::irange(output_width)) {
         scalar_t* in = &idata[output_y * input_width + output_x];
         scalar_t* out = &odata[output_y * output_width + output_x];
-        for (const auto c : c10::irange(channels)) {
-          (void)c; //Suppress unused variable warning
+        for (const auto c C10_UNUSED : c10::irange(channels)) {
           in[0] = out[0];
           in += input_width * input_height;
           out += output_width * output_height;
@@ -146,8 +161,7 @@ static void upsample_bicubic2d_backward_out_frame(
       get_cubic_upsample_coefficients<scalar_t>(x_coeffs, t_x);
       get_cubic_upsample_coefficients<scalar_t>(y_coeffs, t_y);
 
-      for (const auto c : c10::irange(channels)) {
-        (void)c; //Suppress unused variable warning
+      for (const auto c C10_UNUSED : c10::irange(channels)) {
         scalar_t out_value = out[output_y * output_width + output_x];
 
         for (const auto i : c10::irange(4)) {
@@ -273,18 +287,6 @@ Tensor upsample_bicubic2d(
   return at::upsample_bicubic2d(input, osize, align_corners, scale_h, scale_w);
 }
 
-Tensor upsample_bicubic2d_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    bool align_corners,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_h = get_scale_value(scale_factors, 0);
-  auto scale_w = get_scale_value(scale_factors, 1);
-  return at::upsample_bicubic2d_backward(grad_output, osize, input_size, align_corners, scale_h, scale_w);
-}
-
 Tensor _upsample_bicubic2d_aa(
     const Tensor& input,
     at::OptionalIntArrayRef output_size,
@@ -296,18 +298,6 @@ Tensor _upsample_bicubic2d_aa(
   return at::_upsample_bicubic2d_aa(input, osize, align_corners, scale_h, scale_w);
 }
 
-Tensor _upsample_bicubic2d_aa_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    bool align_corners,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_h = get_scale_value(scale_factors, 0);
-  auto scale_w = get_scale_value(scale_factors, 1);
-  return at::_upsample_bicubic2d_aa_backward(grad_output, osize, input_size, align_corners, scale_h, scale_w);
-}
-
 DEFINE_DISPATCH(upsample_bicubic2d_kernel);
 DEFINE_DISPATCH(_upsample_bicubic2d_aa_kernel);
 DEFINE_DISPATCH(_upsample_bicubic2d_aa_backward_kernel);
diff --git a/aten/src/ATen/native/UpSampleBilinear2d.cpp b/aten/src/ATen/native/UpSampleBilinear2d.cpp
index 527555a066ab..5d91e93e016d 100644
--- a/aten/src/ATen/native/UpSampleBilinear2d.cpp
+++ b/aten/src/ATen/native/UpSampleBilinear2d.cpp
@@ -1,11 +1,26 @@
 // Adapted from interp.cpp from Caffe util by Pauline Luc
 // Originally developed by George Papandreou
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorMeta.h>
 #include <ATen/native/UpSample.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_upsample_bilinear2d_aa.h>
+#include <ATen/ops/_upsample_bilinear2d_aa_backward.h>
+#include <ATen/ops/_upsample_bilinear2d_aa_backward_native.h>
+#include <ATen/ops/_upsample_bilinear2d_aa_native.h>
+#include <ATen/ops/upsample_bilinear2d.h>
+#include <ATen/ops/upsample_bilinear2d_backward.h>
+#include <ATen/ops/upsample_bilinear2d_backward_native.h>
+#include <ATen/ops/upsample_bilinear2d_native.h>
+#endif
+
 namespace at {
 namespace meta {
 
@@ -154,18 +169,6 @@ Tensor upsample_bilinear2d(
   return at::upsample_bilinear2d(input, osize, align_corners, scale_h, scale_w);
 }
 
-Tensor upsample_bilinear2d_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    bool align_corners,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_h = get_scale_value(scale_factors, 0);
-  auto scale_w = get_scale_value(scale_factors, 1);
-  return at::upsample_bilinear2d_backward(grad_output, osize, input_size, align_corners, scale_h, scale_w);
-}
-
 Tensor _upsample_bilinear2d_aa(
     const Tensor& input,
     at::OptionalIntArrayRef output_size,
@@ -177,18 +180,6 @@ Tensor _upsample_bilinear2d_aa(
   return at::_upsample_bilinear2d_aa(input, osize, align_corners, scale_h, scale_w);
 }
 
-Tensor _upsample_bilinear2d_aa_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    bool align_corners,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_h = get_scale_value(scale_factors, 0);
-  auto scale_w = get_scale_value(scale_factors, 1);
-  return at::_upsample_bilinear2d_aa_backward(grad_output, osize, input_size, align_corners, scale_h, scale_w);
-}
-
 DEFINE_DISPATCH(upsample_bilinear2d_kernel);
 DEFINE_DISPATCH(upsample_bilinear2d_backward_kernel);
 DEFINE_DISPATCH(_upsample_bilinear2d_aa_kernel);
diff --git a/aten/src/ATen/native/UpSampleLinear1d.cpp b/aten/src/ATen/native/UpSampleLinear1d.cpp
index b100450c2b6a..aed082b68563 100644
--- a/aten/src/ATen/native/UpSampleLinear1d.cpp
+++ b/aten/src/ATen/native/UpSampleLinear1d.cpp
@@ -1,10 +1,22 @@
 // Adapted from interp.cpp from Caffe util by Pauline Luc
 // Originally developed by George Papandreou
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/TensorUtils.h>
 #include <ATen/native/UpSample.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/upsample_linear1d.h>
+#include <ATen/ops/upsample_linear1d_backward.h>
+#include <ATen/ops/upsample_linear1d_backward_native.h>
+#include <ATen/ops/upsample_linear1d_native.h>
+#endif
+
 namespace at {
 namespace meta {
 
@@ -87,17 +99,6 @@ Tensor upsample_linear1d(
   return at::upsample_linear1d(input, osize, align_corners, scale_w);
 }
 
-Tensor upsample_linear1d_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    bool align_corners,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_w = get_scale_value(scale_factors, 0);
-  return at::upsample_linear1d_backward(grad_output, osize, input_size, align_corners, scale_w);
-}
-
 DEFINE_DISPATCH(upsample_linear1d_kernel);
 DEFINE_DISPATCH(upsample_linear1d_backward_kernel);
 
diff --git a/aten/src/ATen/native/UpSampleNearest1d.cpp b/aten/src/ATen/native/UpSampleNearest1d.cpp
index 83121ed3be45..1bdbda8f66c4 100644
--- a/aten/src/ATen/native/UpSampleNearest1d.cpp
+++ b/aten/src/ATen/native/UpSampleNearest1d.cpp
@@ -1,7 +1,23 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorMeta.h>
+#include <ATen/TensorUtils.h>
 #include <ATen/native/UpSample.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_upsample_nearest_exact1d.h>
+#include <ATen/ops/_upsample_nearest_exact1d_backward.h>
+#include <ATen/ops/_upsample_nearest_exact1d_backward_native.h>
+#include <ATen/ops/_upsample_nearest_exact1d_native.h>
+#include <ATen/ops/upsample_nearest1d.h>
+#include <ATen/ops/upsample_nearest1d_backward.h>
+#include <ATen/ops/upsample_nearest1d_backward_native.h>
+#include <ATen/ops/upsample_nearest1d_native.h>
+#endif
+
 namespace at {
 namespace meta {
 
@@ -125,26 +141,6 @@ Tensor _upsample_nearest_exact1d(
   return at::_upsample_nearest_exact1d(input, osize, scale_w);
 }
 
-Tensor upsample_nearest1d_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_w = get_scale_value(scale_factors, 0);
-  return at::upsample_nearest1d_backward(grad_output, osize, input_size, scale_w);
-}
-
-Tensor _upsample_nearest_exact1d_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_w = get_scale_value(scale_factors, 0);
-  return at::_upsample_nearest_exact1d_backward(grad_output, osize, input_size, scale_w);
-}
-
 DEFINE_DISPATCH(upsample_nearest1d_kernel);
 DEFINE_DISPATCH(_upsample_nearest_exact1d_kernel);
 DEFINE_DISPATCH(upsample_nearest1d_backward_kernel);
diff --git a/aten/src/ATen/native/UpSampleNearest2d.cpp b/aten/src/ATen/native/UpSampleNearest2d.cpp
index ee5dce4a02ef..65e20b78f868 100644
--- a/aten/src/ATen/native/UpSampleNearest2d.cpp
+++ b/aten/src/ATen/native/UpSampleNearest2d.cpp
@@ -1,9 +1,24 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorMeta.h>
 #include <ATen/native/UpSample.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_upsample_nearest_exact2d.h>
+#include <ATen/ops/_upsample_nearest_exact2d_backward.h>
+#include <ATen/ops/_upsample_nearest_exact2d_backward_native.h>
+#include <ATen/ops/_upsample_nearest_exact2d_native.h>
+#include <ATen/ops/upsample_nearest2d.h>
+#include <ATen/ops/upsample_nearest2d_backward.h>
+#include <ATen/ops/upsample_nearest2d_backward_native.h>
+#include <ATen/ops/upsample_nearest2d_native.h>
+#endif
+
 namespace at {
 namespace meta {
 
@@ -152,28 +167,6 @@ Tensor _upsample_nearest_exact2d(
   return at::_upsample_nearest_exact2d(input, osize, scale_h, scale_w);
 }
 
-Tensor upsample_nearest2d_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_h = get_scale_value(scale_factors, 0);
-  auto scale_w = get_scale_value(scale_factors, 1);
-  return at::upsample_nearest2d_backward(grad_output, osize, input_size, scale_h, scale_w);
-}
-
-Tensor _upsample_nearest_exact2d_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_h = get_scale_value(scale_factors, 0);
-  auto scale_w = get_scale_value(scale_factors, 1);
-  return at::_upsample_nearest_exact2d_backward(grad_output, osize, input_size, scale_h, scale_w);
-}
-
 DEFINE_DISPATCH(upsample_nearest2d_kernel);
 DEFINE_DISPATCH(_upsample_nearest_exact2d_kernel);
 DEFINE_DISPATCH(upsample_nearest2d_backward_kernel);
diff --git a/aten/src/ATen/native/UpSampleNearest3d.cpp b/aten/src/ATen/native/UpSampleNearest3d.cpp
index 0e4040980ae2..27ca6745655c 100644
--- a/aten/src/ATen/native/UpSampleNearest3d.cpp
+++ b/aten/src/ATen/native/UpSampleNearest3d.cpp
@@ -1,8 +1,23 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorMeta.h>
 #include <ATen/native/UpSample.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_upsample_nearest_exact3d.h>
+#include <ATen/ops/_upsample_nearest_exact3d_backward.h>
+#include <ATen/ops/_upsample_nearest_exact3d_backward_native.h>
+#include <ATen/ops/_upsample_nearest_exact3d_native.h>
+#include <ATen/ops/upsample_nearest3d.h>
+#include <ATen/ops/upsample_nearest3d_backward.h>
+#include <ATen/ops/upsample_nearest3d_backward_native.h>
+#include <ATen/ops/upsample_nearest3d_native.h>
+#endif
+
 namespace at {
 namespace meta {
 
@@ -147,7 +162,7 @@ TORCH_IMPL_FUNC(_upsample_nearest_exact3d_backward_out_cpu) (
 using at::native::upsample::compute_output_size;
 using at::native::upsample::get_scale_value;
 
-Tensor upsample_nearest3d_cpu(
+Tensor upsample_nearest3d(
     const Tensor& input,
     at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
@@ -158,7 +173,7 @@ Tensor upsample_nearest3d_cpu(
   return at::upsample_nearest3d(input, osize, scale_d, scale_h, scale_w);
 }
 
-Tensor _upsample_nearest_exact3d_cpu(
+Tensor _upsample_nearest_exact3d(
     const Tensor& input,
     at::OptionalIntArrayRef output_size,
     c10::optional<ArrayRef<double>> scale_factors) {
@@ -169,31 +184,6 @@ Tensor _upsample_nearest_exact3d_cpu(
   return at::_upsample_nearest_exact3d(input, osize, scale_d, scale_h, scale_w);
 }
 
-// when structured kernels can handle QuantizedCPU, update these overloads to be CompositeExplicitAutograd
-Tensor upsample_nearest3d_backward_cpu(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return at::upsample_nearest3d_backward(grad_output, osize, input_size, scale_d, scale_h, scale_w);
-}
-
-Tensor _upsample_nearest_exact3d_backward_cpu(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return at::_upsample_nearest_exact3d_backward(grad_output, osize, input_size, scale_d, scale_h, scale_w);
-}
-
 DEFINE_DISPATCH(upsample_nearest3d_kernel);
 DEFINE_DISPATCH(_upsample_nearest_exact3d_kernel);
 DEFINE_DISPATCH(upsample_nearest3d_backward_kernel);
diff --git a/aten/src/ATen/native/UpSampleTrilinear3d.cpp b/aten/src/ATen/native/UpSampleTrilinear3d.cpp
index 73fffbe5afe7..1bf9c8f6cb4e 100644
--- a/aten/src/ATen/native/UpSampleTrilinear3d.cpp
+++ b/aten/src/ATen/native/UpSampleTrilinear3d.cpp
@@ -1,11 +1,22 @@
 // Adapted from interp.cpp from Caffe util by Pauline Luc
 // Originally developed by George Papandreou
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorMeta.h>
 #include <ATen/native/UpSample.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/upsample_trilinear3d.h>
+#include <ATen/ops/upsample_trilinear3d_backward.h>
+#include <ATen/ops/upsample_trilinear3d_backward_native.h>
+#include <ATen/ops/upsample_trilinear3d_native.h>
+#endif
+
 namespace at {
 namespace meta {
 
@@ -100,19 +111,6 @@ Tensor upsample_trilinear3d(
   return at::upsample_trilinear3d(input, osize, align_corners, scale_d, scale_h, scale_w);
 }
 
-Tensor upsample_trilinear3d_backward(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    bool align_corners,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return at::upsample_trilinear3d_backward(grad_output, osize, input_size, align_corners, scale_d, scale_h, scale_w);
-}
-
 DEFINE_DISPATCH(upsample_trilinear3d_kernel);
 DEFINE_DISPATCH(upsample_trilinear3d_backward_kernel);
 
diff --git a/aten/src/ATen/native/VariableMethodStubs.cpp b/aten/src/ATen/native/VariableMethodStubs.cpp
index ce5432e677af..6191717930ae 100644
--- a/aten/src/ATen/native/VariableMethodStubs.cpp
+++ b/aten/src/ATen/native/VariableMethodStubs.cpp
@@ -1,5 +1,23 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_backward_native.h>
+#include <ATen/ops/_fw_primal_native.h>
+#include <ATen/ops/_version_native.h>
+#include <ATen/ops/alias.h>
+#include <ATen/ops/data_native.h>
+#include <ATen/ops/is_leaf_native.h>
+#include <ATen/ops/output_nr_native.h>
+#include <ATen/ops/requires_grad_native.h>
+#include <ATen/ops/retain_grad_native.h>
+#include <ATen/ops/retains_grad_native.h>
+#include <ATen/ops/set_data_native.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
 
 // The stubs in here are used by dynamic dispatch. It just redirects everything
 // to the Tensor method we manually bind in TensorBody.h.
diff --git a/aten/src/ATen/native/WeightNorm.cpp b/aten/src/ATen/native/WeightNorm.cpp
index bf258d80a0fb..8291120f1960 100644
--- a/aten/src/ATen/native/WeightNorm.cpp
+++ b/aten/src/ATen/native/WeightNorm.cpp
@@ -1,11 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/TensorUtils.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/cpu/WeightNormKernel.h>
 
-#include <cstring>
-#include <memory>
-#include <sstream>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_weight_norm_differentiable_backward_native.h>
+#include <ATen/ops/_weight_norm_interface.h>
+#include <ATen/ops/_weight_norm_native.h>
+#include <ATen/ops/empty_strided.h>
+#include <ATen/ops/norm_except_dim.h>
+#include <ATen/ops/norm_except_dim_native.h>
+#endif
+
 #include <vector>
 
 namespace at {
diff --git a/aten/src/ATen/native/ao_sparse/library.cpp b/aten/src/ATen/native/ao_sparse/library.cpp
index 0c0042c6b143..1a284726e93f 100644
--- a/aten/src/ATen/native/ao_sparse/library.cpp
+++ b/aten/src/ATen/native/ao_sparse/library.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <torch/library.h>
 
 #include <torch/custom_class.h>
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/fbgemm_utils.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/fbgemm_utils.cpp
index 2f1d8a3e7be9..cdbfda3c71bb 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/fbgemm_utils.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/fbgemm_utils.cpp
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 
 #include <torch/custom_class.h>
 
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/packed_params.h b/aten/src/ATen/native/ao_sparse/quantized/cpu/packed_params.h
index 57ebba85a063..1ca66bf536a7 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/packed_params.h
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/packed_params.h
@@ -11,7 +11,7 @@ namespace sparse {
 using LinearPackedSerializationType =
     std::tuple<at::Tensor, c10::optional<at::Tensor>, std::vector<int64_t>>;
 
-#define SPARSE_LINEAR_PACKED_PARAM_SERIALIZATION_VERSION 1
+#define SPARSE_LINEAR_PACKED_PARAM_SERIALIZATION_VERSION 2
 
 using BCSRSerializationType =
     std::tuple<
@@ -22,8 +22,8 @@ using BCSRSerializationType =
         at::Tensor,                 // Weight Scales (single element vector if per-tensor) (float)
         at::Tensor,                 // Wrapper for Weight Zero Points (single element vector if per-tensor) (int8_t)
         bool,                       // Quantization Scheme (true: per tensor, false: per channel)
-        at::Tensor,                 // Wrapper for Row Block Indices (int32_t)
-        at::Tensor,                 // Wrapper for Column Block Indices (int32_t)
+        at::Tensor,                 // Wrapper for Row Block Indices (int8_t, int16_t, or int32_t)
+        at::Tensor,                 // Wrapper for Column Block Indices (int8_t, int16_t, or int32_t)
         at::Tensor,                 // Wrapper for Non-Zero Weight Values, each +128 (uint8_t)
         int64_t,                    // Number of Output Channels
         int64_t                     // Number of Input Channels
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear.cpp
index 12046dde22f9..de053b353758 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear.cpp
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Parallel.h>
 #include <torch/custom_class.h>
 #include <torch/library.h>
@@ -7,6 +8,13 @@
 #include <ATen/native/ao_sparse/quantized/cpu/packed_params.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/empty.h>
+#endif
+
 namespace ao {
 namespace sparse {
 
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_deserialize.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_deserialize.cpp
index 24d24eee66ec..d367dbe01103 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_deserialize.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_deserialize.cpp
@@ -11,6 +11,7 @@ namespace ao {
 namespace sparse {
 
 namespace {
+const int64_t serialization_version_index = 0;
 const int64_t bias_index = 1;
 const int64_t out_features_block_size_index = 2;
 const int64_t in_features_block_size_index = 3;
@@ -127,16 +128,25 @@ c10::intrusive_ptr<LinearPackedParamsBase> PackedLinearWeight::deserialize(
         return static_cast<int8_t>(static_cast<int16_t>(v) - 128);
       });
 
+  const at::Tensor row_block_indices =
+      std::get<row_block_indices_index>(serialized);
+  const at::Tensor col_block_indices =
+      std::get<col_block_indices_index>(serialized);
   // Unpack as non backend specific untiled BCSR then pack as Fbgemm tiled BCSR
   // because untiled Fbgemm BCSR currently doesn't exist
   unpack_bcsr(
       reinterpret_cast<int8_t*>(weight_origin.data_ptr<c10::qint8>()),
-      ao::sparse::BCSR(
-          std::move(weight_values),
-          unwrap_vector<int32_t, int32_t>(
-              std::get<row_block_indices_index>(serialized)), // Row Indices
-          unwrap_vector<int32_t, int32_t>(
-              std::get<col_block_indices_index>(serialized))), // Col Indices
+      AT_DISPATCH_INTEGRAL_TYPES(
+          row_block_indices.scalar_type(),
+          "packed_linear_weight_fbgemm_setup_bcsr",
+          [&] {
+            return ao::sparse::BCSR(
+                std::move(weight_values),
+                unwrap_vector<scalar_t, int32_t>(
+                    std::get<row_block_indices_index>(serialized)),
+                unwrap_vector<scalar_t, int32_t>(
+                    std::get<col_block_indices_index>(serialized)));
+          }),
       output_channels,
       input_channels,
       out_features_block_size,
@@ -160,6 +170,28 @@ c10::intrusive_ptr<LinearPackedParamsBase> PackedLinearWeightQnnp::deserialize(
   return c10::make_intrusive<PackedLinearWeightQnnp>(serialized);
 }
 
+template <typename INDICES_DTYPE>
+struct UnsignedIndicesTypeTrait {
+  static_assert(
+      sizeof(INDICES_DTYPE) == 0,
+      "Invalid dtype for UnsignedIndicesTypeTrait");
+};
+
+template <>
+struct UnsignedIndicesTypeTrait<int32_t> {
+  using t = uint32_t;
+};
+
+template <>
+struct UnsignedIndicesTypeTrait<int16_t> {
+  using t = uint16_t;
+};
+
+template <>
+struct UnsignedIndicesTypeTrait<int8_t> {
+  using t = uint8_t;
+};
+
 // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
 PackedLinearWeightQnnp::PackedLinearWeightQnnp(
     const BCSRSerializationType& serialized)
@@ -173,6 +205,17 @@ PackedLinearWeightQnnp::PackedLinearWeightQnnp(
               : c10::kPerChannelAffine),
       output_channels_(std::get<num_output_channels_index>(serialized)),
       input_channels_(std::get<num_input_channels_index>(serialized)) {
+  const int64_t serialization_version =
+      std::get<serialization_version_index>(serialized);
+  TORCH_CHECK(
+      serialization_version <= SPARSE_LINEAR_PACKED_PARAM_SERIALIZATION_VERSION,
+      "Attempted to deserialize sparse qlinear packed params with an ",
+      "incompatible serialization version (",
+      serialization_version,
+      " > ",
+      SPARSE_LINEAR_PACKED_PARAM_SERIALIZATION_VERSION,
+      ")");
+
   if (orig_bias_.has_value()) {
     bias_ = orig_bias_.value();
 
@@ -242,15 +285,35 @@ PackedLinearWeightQnnp::PackedLinearWeightQnnp(
       std::get<col_block_indices_index>(serialized);
   deserialized_bcsr_weight_values_ = std::get<weight_values_index>(serialized);
 
-  bcsr_matrix_ = qnnpack::generateBlockCSRMatrix(
-      (uint32_t*)deserialized_bcsr_col_block_indices_.data_ptr<int32_t>(),
-      (uint32_t*)deserialized_bcsr_row_block_indices_.data_ptr<int32_t>(),
-      deserialized_bcsr_weight_values_.data_ptr<uint8_t>(),
-      deserialized_bcsr_col_block_indices_.numel(),
-      deserialized_bcsr_row_block_indices_.numel(),
-      deserialized_bcsr_weight_values_.numel(),
-      out_features_block_size_,
-      in_features_block_size_);
+#define AT_DISPATCH_CASE_BCSR_INDICES_TYPES(...)      \
+  AT_DISPATCH_CASE(at::ScalarType::Char, __VA_ARGS__) \
+  AT_DISPATCH_CASE(at::ScalarType::Int, __VA_ARGS__)  \
+  AT_DISPATCH_CASE(at::ScalarType::Short, __VA_ARGS__)
+
+#define AT_DISPATCH_BCSR_INDICES_TYPES(TYPE, NAME, ...) \
+  AT_DISPATCH_SWITCH(                                   \
+      TYPE, NAME, AT_DISPATCH_CASE_BCSR_INDICES_TYPES(__VA_ARGS__))
+
+  bcsr_matrix_ = AT_DISPATCH_BCSR_INDICES_TYPES(
+      deserialized_bcsr_row_block_indices_.scalar_type(),
+      "packed_linear_weight_qnnp_setup_bcsr",
+      [&] {
+        using unsigned_t = UnsignedIndicesTypeTrait<scalar_t>::t;
+        return qnnpack::generateBlockCSRMatrix<unsigned_t>(
+            reinterpret_cast<unsigned_t*>(
+                deserialized_bcsr_col_block_indices_.data_ptr<scalar_t>()),
+            reinterpret_cast<unsigned_t*>(
+                deserialized_bcsr_row_block_indices_.data_ptr<scalar_t>()),
+            deserialized_bcsr_weight_values_.data_ptr<uint8_t>(),
+            deserialized_bcsr_col_block_indices_.numel(),
+            deserialized_bcsr_row_block_indices_.numel(),
+            deserialized_bcsr_weight_values_.numel(),
+            out_features_block_size_,
+            in_features_block_size_);
+      });
+
+#undef AT_DISPATCH_CASE_BCSR_INDICES_TYPES
+#undef AT_DISPATCH_BCSR_INDICES_TYPES
 }
 #endif // USE_PYTORCH_QNNPACK
 
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_dynamic.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_dynamic.cpp
index bd6f92c97c5e..64cab80790a9 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_dynamic.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_dynamic.cpp
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Parallel.h>
 #include <torch/custom_class.h>
 #include <torch/library.h>
@@ -10,6 +11,13 @@
 #include <ATen/native/ao_sparse/quantized/cpu/packed_params.h>
 #include <ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/quantize_per_tensor.h>
+#include <ATen/ops/empty.h>
+#endif
+
 namespace ao {
 namespace sparse {
 
@@ -37,7 +45,7 @@ at::Tensor PackedLinearWeightQnnp::apply_dynamic_impl<false>(
   const auto cols_input = static_cast<int64_t>(input.size(input.dim() - 1));
   TORCH_CHECK(
       cols_input == input_channels_,
-      "quantized_sparse_lienar: Input tensor's last and weight tensor's"
+      "quantized_sparse_linear: Input tensor's last and weight tensor's"
       " second dimension must match.");
 
   // On empty input, no output data will be generated,
@@ -75,11 +83,12 @@ at::Tensor PackedLinearWeightQnnp::apply_dynamic_impl<false>(
             output_channels_,
             q_input_contig.q_zero_point(),
             w_zero_points_.data(),
-            bcsr_matrix_->col_indices.data(),
-            bcsr_matrix_->row_values.data(),
+            bcsr_matrix_->col_indices_data_ptr(),
+            bcsr_matrix_->row_values_data_ptr(),
             bcsr_matrix_->values.data(),
             bcsr_matrix_->row_block_size, /* out_features_block_size */
             bcsr_matrix_->col_block_size, /* in_features_block_size */
+            bcsr_matrix_->indices_dtype,
             0, /* output zero point: not used */
             std::numeric_limits<uint8_t>::min(),
             std::numeric_limits<uint8_t>::max(),
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_prepack.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_prepack.cpp
index 616ed9011e0c..83aaf810edd7 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_prepack.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_prepack.cpp
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <c10/util/irange.h>
 #include <torch/custom_class.h>
 
@@ -7,6 +9,13 @@
 #include <ATen/native/ao_sparse/quantized/cpu/packed_params.h>
 #include <ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <algorithm>
 
 namespace ao {
@@ -190,7 +199,7 @@ PackedLinearWeightQnnp::PackedLinearWeightQnnp(
   for (const auto i : c10::irange(wt_numel)) {
     qnnp_w_data[i] = static_cast<c10::quint8>(w_data[i] + 128);
   }
-  bcsr_matrix_ = qnnpack::generateBlockCSRMatrix(
+  bcsr_matrix_ = qnnpack::generateBlockCSRMatrix<uint32_t>(
       reinterpret_cast<uint8_t*>(qnnp_w_data),
       output_channels_,
       input_channels_,
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_serialize.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_serialize.cpp
index cacb2815a2a3..7fd0cb25ff20 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_serialize.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_serialize.cpp
@@ -195,6 +195,37 @@ BCSRSerializationType PackedLinearWeightQnnp::serialize() {
     TORCH_CHECK(false, "Unsupported quantization scheme.");
   }
 
+  at::Tensor wrapped_row_values;
+  at::Tensor wrapped_col_indices;
+
+  const uint32_t max_index = bcsr_matrix_->max_index();
+
+  if (max_index <= std::numeric_limits<uint8_t>::max()) {
+    // Cast from uint8_t range to int8_t
+    wrapped_row_values = QNNPACK_BCSRMATRIX_DISPATCH_INDICES_DTYPE(
+        bcsr_matrix_,
+        { return wrap_vector<int8_t>(typed_bcsr->row_values, c10::kChar); });
+    wrapped_col_indices = QNNPACK_BCSRMATRIX_DISPATCH_INDICES_DTYPE(
+        bcsr_matrix_,
+        { return wrap_vector<int8_t>(typed_bcsr->col_indices, c10::kChar); });
+  } else if (max_index <= std::numeric_limits<uint16_t>::max()) {
+    // Cast from uint16_t range to int16_t
+    wrapped_row_values = QNNPACK_BCSRMATRIX_DISPATCH_INDICES_DTYPE(
+        bcsr_matrix_,
+        { return wrap_vector<int16_t>(typed_bcsr->row_values, c10::kShort); });
+    wrapped_col_indices = QNNPACK_BCSRMATRIX_DISPATCH_INDICES_DTYPE(
+        bcsr_matrix_,
+        { return wrap_vector<int16_t>(typed_bcsr->col_indices, c10::kShort); });
+  } else {
+    // Cast from uint32_t range to int32_t
+    wrapped_row_values = QNNPACK_BCSRMATRIX_DISPATCH_INDICES_DTYPE(
+        bcsr_matrix_,
+        { return wrap_vector<int>(typed_bcsr->row_values, c10::kInt); });
+    wrapped_col_indices = QNNPACK_BCSRMATRIX_DISPATCH_INDICES_DTYPE(
+        bcsr_matrix_,
+        { return wrap_vector<int>(typed_bcsr->col_indices, c10::kInt); });
+  }
+
   return BCSRSerializationType(
       SPARSE_LINEAR_PACKED_PARAM_SERIALIZATION_VERSION,
       orig_bias_,
@@ -203,10 +234,8 @@ BCSRSerializationType PackedLinearWeightQnnp::serialize() {
       std::move(w_scales_compact),
       std::move(w_zero_points_compact),
       (q_scheme_ == c10::kPerTensorAffine),
-      wrap_vector<int>(
-          bcsr_matrix_->row_values, c10::kInt), // Casting from uint32_t to int
-      wrap_vector<int>(
-          bcsr_matrix_->col_indices, c10::kInt), // Casting from uint32_t to int
+      wrapped_row_values,
+      wrapped_col_indices,
       wrap_vector<uint8_t>(bcsr_matrix_->values, c10::kByte),
       output_channels_,
       input_channels_);
diff --git a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_unpack.cpp b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_unpack.cpp
index c10cc40af4a2..14cf9521a4cd 100644
--- a/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_unpack.cpp
+++ b/aten/src/ATen/native/ao_sparse/quantized/cpu/qlinear_unpack.cpp
@@ -1,10 +1,20 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <torch/custom_class.h>
 
 #include <ATen/native/ao_sparse/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/ao_sparse/quantized/cpu/packed_params.h>
 #include <ATen/native/ao_sparse/quantized/cpu/qnnpack_utils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_per_channel_affine_quantized.h>
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/from_blob.h>
+#endif
+
 namespace ao {
 namespace sparse {
 int register_linear_params();
diff --git a/aten/src/ATen/native/cpu/Activation.cpp b/aten/src/ATen/native/cpu/Activation.cpp
index 6f3eac783ccd..728ea62f1898 100644
--- a/aten/src/ATen/native/cpu/Activation.cpp
+++ b/aten/src/ATen/native/cpu/Activation.cpp
@@ -623,7 +623,25 @@ void shrink_backward_kernel(TensorIteratorBase& iter, const Scalar& lambd) {
 }
 
 void hardtanh_backward_kernel(TensorIterator& iter, const Scalar& min, const Scalar& max) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardshrink_backward_cpu", [&] {
+  if (iter.dtype() == kBFloat16) {
+    auto min_val = min.to<float>();
+    auto max_val = max.to<float>();
+    cpu_kernel_vec(
+        iter,
+        [=](BFloat16 grad_val, BFloat16 self_val) -> BFloat16 {
+          return (float(self_val) <= min_val || float(self_val) >= max_val) ? BFloat16(0) : grad_val;
+        },
+        [=](Vectorized<BFloat16> grad_val, Vectorized<BFloat16> self_val) -> Vectorized<BFloat16> {
+          Vectorized<float> grad_val0, grad_val1, self_val0, self_val1;
+          std::tie(grad_val0, grad_val1) = convert_bfloat16_float(grad_val);
+          std::tie(self_val0, self_val1) = convert_bfloat16_float(self_val);
+          return convert_float_bfloat16(
+            ((self_val0 > min_val) & (self_val0 < max_val)) & grad_val0,
+            ((self_val1 > min_val) & (self_val1 < max_val)) & grad_val1
+          );
+        });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "hardshrink_backward_cpu", [&] {
     auto min_val = min.to<scalar_t>();
     auto max_val = max.to<scalar_t>();
     cpu_kernel_vec(
@@ -635,6 +653,7 @@ void hardtanh_backward_kernel(TensorIterator& iter, const Scalar& min, const Sca
           return ((self_val > min_val) & (self_val < max_val)) & grad_val;
         });
   });
+  }
 }
 
 void hardswish_kernel(TensorIterator& iter) {
@@ -1035,8 +1054,23 @@ void glu_backward_kernel(TensorIterator& iter) {
 }
 
 void silu_kernel(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(
-      kBFloat16, iter.dtype(), "silu_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+      const Vectorized<float> kOneVec(1.0f);
+      cpu_kernel_vec(
+          iter,
+          [](BFloat16 x) -> BFloat16 {
+            return float(x) / (1.0f + std::exp(-float(x)));
+          },
+          [kOneVec](Vectorized<BFloat16> x_vec) -> Vectorized<BFloat16> {
+            Vectorized<float> x_vec0, x_vec1;
+            std::tie(x_vec0, x_vec1) = convert_bfloat16_float(x_vec);
+            return convert_float_bfloat16(
+              x_vec0 / (kOneVec + x_vec0.neg().exp()),
+              x_vec1 / (kOneVec + x_vec1.neg().exp()));
+          });
+  } else {
+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
+      iter.dtype(), "silu_cpu", [&]() {
         const Vectorized<scalar_t> kOneVec(scalar_t(1));
         cpu_kernel_vec(
             iter,
@@ -1047,11 +1081,34 @@ void silu_kernel(TensorIteratorBase& iter) {
               return x_vec / (kOneVec + x_vec.neg().exp());
             });
       });
+    }
 }
 
 void silu_backward_kernel(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(
-      kBFloat16, iter.dtype(), "silu_backward_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    const Vectorized<float> kOneVec(1.0f);
+    cpu_kernel_vec(
+        iter,
+        [](BFloat16 dy, BFloat16 x) -> BFloat16 {
+          const float sigmoid =
+              1.0f / (1.0f + std::exp(-float(x)));
+          return dy * sigmoid * (1.0f + x * (1.0f - sigmoid));
+        },
+        [kOneVec](Vectorized<BFloat16> dy_vec, Vectorized<BFloat16> x_vec) -> Vectorized<BFloat16> {
+          Vectorized<float> x_vec0, x_vec1, dy_vec0, dy_vec1;
+          std::tie(x_vec0, x_vec1) = convert_bfloat16_float(x_vec);
+          std::tie(dy_vec0, dy_vec1) = convert_bfloat16_float(dy_vec);
+          const Vectorized<float> sigmoid0 =
+              kOneVec / (kOneVec + x_vec0.neg().exp());
+          const Vectorized<float> sigmoid1 =
+              kOneVec / (kOneVec + x_vec1.neg().exp());
+          return convert_float_bfloat16(
+            dy_vec0 * sigmoid0 * (kOneVec + x_vec0 * (kOneVec - sigmoid0)),
+            dy_vec1 * sigmoid1 * (kOneVec + x_vec1 * (kOneVec - sigmoid1)));
+        });
+  } else {
+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
+      iter.dtype(), "silu_backward_cpu", [&]() {
         const Vectorized<scalar_t> kOneVec(scalar_t(1));
         cpu_kernel_vec(
             iter,
@@ -1066,10 +1123,26 @@ void silu_backward_kernel(TensorIteratorBase& iter) {
               return dy_vec * sigmoid * (kOneVec + x_vec * (kOneVec - sigmoid));
             });
       });
+  }
 }
 
 void mish_kernel(TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "mish_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    cpu_kernel_vec(
+        iter,
+        [](BFloat16 x) -> BFloat16{
+          return static_cast<BFloat16>(float(x) * std::tanh(std::log1p(std::exp(float(x)))));
+        },
+        [](Vectorized<BFloat16> x_vec) -> Vectorized<BFloat16> {
+          Vectorized<float> x_vec0, x_vec1;
+          std::tie(x_vec0, x_vec1) = convert_bfloat16_float(x_vec);
+          return convert_float_bfloat16(
+            x_vec0 * x_vec0.exp().log1p().tanh(),
+            x_vec1 * x_vec1.exp().log1p().tanh()
+          );
+        });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "mish_cpu", [&]() {
         using Vec = Vectorized<scalar_t>;
         cpu_kernel_vec(
             iter,
@@ -1080,10 +1153,36 @@ void mish_kernel(TensorIteratorBase& iter) {
               return x_vec * x_vec.exp().log1p().tanh();
             });
       });
+  }
 }
 
 void mish_backward_kernel(TensorIterator& iter) {
-  AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "mish_backward_cpu", [&]() {
+  if (iter.dtype() == kBFloat16) {
+    using Vec = Vectorized<float>;
+    const Vec kOneVec(1.0f);
+    cpu_kernel_vec(
+        iter,
+        [](BFloat16 dy, BFloat16 x) -> BFloat16 {
+          const float sigmoid =
+              1.0f / (1.0f + std::exp(-float(x)));
+          const float tanh_softplus = std::tanh(std::log1p(std::exp(float(x))));
+          return dy * (tanh_softplus + x * sigmoid * (1.0f - tanh_softplus * tanh_softplus));
+        },
+        [kOneVec](Vectorized<BFloat16> dy_vec, Vectorized<BFloat16> x_vec) -> Vectorized<BFloat16> {
+          Vectorized<float> x_vec0, x_vec1, dy_vec0, dy_vec1;
+          std::tie(x_vec0, x_vec1) = convert_bfloat16_float(x_vec);
+          std::tie(dy_vec0, dy_vec1) = convert_bfloat16_float(dy_vec);
+          const Vec sigmoid0 = kOneVec / (kOneVec + x_vec0.neg().exp());
+          const Vec sigmoid1 = kOneVec / (kOneVec + x_vec1.neg().exp());
+          const Vec tanh_softplus0 = x_vec0.exp().log1p().tanh();
+          const Vec tanh_softplus1 = x_vec1.exp().log1p().tanh();
+          return convert_float_bfloat16(
+            dy_vec0 * (tanh_softplus0 + x_vec0 * sigmoid0 * (kOneVec - tanh_softplus0 * tanh_softplus0)),
+            dy_vec1 * (tanh_softplus1 + x_vec1 * sigmoid1 * (kOneVec - tanh_softplus1 * tanh_softplus1))
+          );
+        });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES(iter.dtype(), "mish_backward_cpu", [&]() {
         using Vec = Vectorized<scalar_t>;
         const Vec kOneVec(scalar_t(1));
         cpu_kernel_vec(
@@ -1100,6 +1199,7 @@ void mish_backward_kernel(TensorIterator& iter) {
               return dy_vec * (tanh_softplus + x_vec * sigmoid * (kOneVec - tanh_softplus * tanh_softplus));
             });
       });
+  }
 }
 
 void prelu_cpu_kernel(TensorIterator& iter) {
diff --git a/aten/src/ATen/native/cpu/AtomicAddFloat.h b/aten/src/ATen/native/cpu/AtomicAddFloat.h
index db96e1760de5..5b24ee4821c4 100644
--- a/aten/src/ATen/native/cpu/AtomicAddFloat.h
+++ b/aten/src/ATen/native/cpu/AtomicAddFloat.h
@@ -1,7 +1,7 @@
 #ifndef ATOMIC_ADD_FLOAT
 #define ATOMIC_ADD_FLOAT
 
-#if (defined(__x86_64__) || defined(__i386__))
+#if (defined(__x86_64__) || defined(__i386__) || defined(__aarch64__))
 #include <ATen/native/cpu/Intrinsics.h>
 #else
 #define _mm_pause()
@@ -24,7 +24,11 @@ static inline void cpu_atomic_add_float(float* dst, float fvalue)
 
   unsigned* old_intV = (unsigned*)(&old_value.intV);
   while (!std::atomic_compare_exchange_strong(dst_intV, old_intV, new_value.intV)) {
+#ifdef __aarch64__
+    __asm__ __volatile__("yield;" : : : "memory");
+#else
     _mm_pause();
+#endif
     old_value.floatV = *dst;
     new_value.floatV = old_value.floatV + fvalue;
   }
diff --git a/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp b/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
index 2c9ac5ac15b6..9b5f442ef02c 100644
--- a/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
+++ b/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
@@ -68,8 +68,8 @@ void mul_kernel(TensorIteratorBase& iter) {
   } else {
     AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(kBFloat16, kHalf, iter.dtype(), "mul_cpu", [&]() {
       cpu_kernel_vec(iter,
-        [=](scalar_t a, scalar_t b) -> scalar_t { return a * b; },
-        [=](Vectorized<scalar_t> a, Vectorized<scalar_t> b) {
+        [=](scalar_t a, scalar_t b) __ubsan_ignore_undefined__ -> scalar_t { return a * b; },
+        [=](Vectorized<scalar_t> a, Vectorized<scalar_t> b) __ubsan_ignore_undefined__ {
           return a * b;
         });
     });
@@ -314,10 +314,13 @@ void bitwise_xor_kernel(TensorIteratorBase& iter) {
 
 void lshift_kernel(TensorIteratorBase& iter) {
   AT_DISPATCH_INTEGRAL_TYPES(iter.dtype(), "lshift_cpu", [&]() {
-    cpu_kernel(iter,
-      [](scalar_t a, scalar_t b) -> scalar_t {
-        return static_cast<std::make_unsigned_t<scalar_t>>(a) << b;
-    });
+    cpu_kernel_vec(iter,
+        [](scalar_t a, scalar_t b) -> scalar_t {
+          return static_cast<std::make_unsigned_t<scalar_t>>(a) << b;
+        },
+        [](Vectorized<scalar_t> a, Vectorized<scalar_t> b) {
+            return a << b;
+        });
   });
 }
 
@@ -380,10 +383,13 @@ void logical_xor_kernel(TensorIterator& iter) {
 
 void rshift_kernel(TensorIteratorBase& iter) {
   AT_DISPATCH_INTEGRAL_TYPES(iter.dtype(), "rshift_cpu", [&]() {
-    cpu_kernel(iter,
-      [](scalar_t a, scalar_t b) -> scalar_t {
-        return a >> b;
-      });
+    cpu_kernel_vec(iter,
+        [](scalar_t a, scalar_t b) -> scalar_t {
+          return a >> b;
+        },
+        [](Vectorized<scalar_t> a, Vectorized<scalar_t> b) {
+          return a >> b;
+        });
   });
 }
 
diff --git a/aten/src/ATen/native/cpu/BlasKernel.cpp b/aten/src/ATen/native/cpu/BlasKernel.cpp
index cf12c392f868..7a27b152edf7 100644
--- a/aten/src/ATen/native/cpu/BlasKernel.cpp
+++ b/aten/src/ATen/native/cpu/BlasKernel.cpp
@@ -2,6 +2,7 @@
 #include <ATen/Dispatch.h>
 #include <ATen/native/CPUBlas.h>
 #include <c10/util/irange.h>
+#include <c10/util/Unroll.h>
 
 namespace at {
 namespace native {
@@ -30,6 +31,29 @@ void scale_(int64_t m, int64_t n, opmath_t alpha, scalar_t *a, int64_t lda) {
   }
 }
 
+template <typename Func>
+auto sum(int64_t N, Func f) {
+  constexpr int ilp_factor = 4;
+  using acc_t = decltype(f(0));
+
+  // Calculate independent partial sums then add together at the end
+  std::array<acc_t, ilp_factor> partial_sums{};
+
+  int64_t i = 0;
+  for (; i + ilp_factor <= N; i += ilp_factor) {
+    c10::ForcedUnroll<ilp_factor>{}([&](int k) {
+      partial_sums[k] += f(i + k);
+    });
+  }
+  for (; i < N; ++i) {
+    partial_sums[0] += f(i);
+  }
+  for (int k = 1; k < ilp_factor; ++k) {
+    partial_sums[0] += partial_sums[k];
+  }
+  return partial_sums[0];
+}
+
 
 template <typename scalar_t, typename opmath_t>
 void gemm_notrans_(
@@ -73,15 +97,15 @@ void gemm_transa_(
   for (const auto i : c10::irange(m)) {
     const scalar_t *b_ = b;
     for (const auto j : c10::irange(n)) {
-      opmath_t sum = 0;
-      for (const auto l : c10::irange(k)) {
-        sum += static_cast<opmath_t>(a_[l]) * static_cast<opmath_t>(b_[l]);
-      }
+      const auto dot = sum(k, [&](int64_t l) -> opmath_t {
+        return static_cast<opmath_t>(a_[l]) * static_cast<opmath_t>(b_[l]);
+      });
       b_ += ldb;
-      if (beta == scalar_t(0))
-        c[j*ldc+i] = alpha*sum;
-      else
-        c[j*ldc+i] = beta*c[j*ldc+i]+alpha*sum;
+      if (beta == opmath_t(0)) {
+        c[j*ldc+i] = alpha*dot;
+      } else {
+        c[j*ldc+i] = beta*c[j*ldc+i]+alpha*dot;
+      }
     }
     a_ += lda;
   }
@@ -124,26 +148,19 @@ void gemm_transab_(
     const scalar_t *b, int64_t ldb,
     opmath_t beta,
     scalar_t *c, int64_t ldc) {
-  // c *= beta
-  scale_(m, n, beta, c, ldc);
-
-  // c += alpha * (a.T @ b.T)
+  // c = beta * c + alpha * (a.T @ b.T)
   for (const auto i : c10::irange(m)) {
     for (const auto j : c10::irange(n)) {
-      int64_t l_k = k / 4;
-      for (const auto l_l : c10::irange(l_k)) {
-        c[j * ldc + i] += a[i * lda + l_l * 4 + 0] //
-            * (b[(l_l * 4 + 0) * ldb + j] * alpha);
-        c[j * ldc + i] += a[i * lda + l_l * 4 + 1] //
-            * (b[(l_l * 4 + 1) * ldb + j] * alpha);
-        c[j * ldc + i] += a[i * lda + l_l * 4 + 2] //
-            * (b[(l_l * 4 + 2) * ldb + j] * alpha);
-        c[j * ldc + i] += a[i * lda + l_l * 4 + 3] //
-            * (b[(l_l * 4 + 3) * ldb + j] * alpha);
+      const auto dot = sum(k, [&](int64_t l) -> opmath_t {
+        return static_cast<opmath_t>(a[i * lda + l]) *
+            static_cast<opmath_t>(b[l * ldb + j]);
+      });
+
+      if (beta == opmath_t(0)) {
+        c[j * ldc + i] = alpha * dot;
+      } else {
+        c[j * ldc + i] = beta * c[j * ldc + i] + alpha * dot;
       }
-      int64_t l = l_k * 4;
-      for (; l < k; l++)
-        c[j * ldc + i] += a[i * lda + l] * (b[l * ldb + j] * alpha);
     }
   }
 }
diff --git a/aten/src/ATen/native/cpu/ChannelShuffleKernel.cpp b/aten/src/ATen/native/cpu/ChannelShuffleKernel.cpp
index 769c9028e7b0..57bd4f3badc0 100644
--- a/aten/src/ATen/native/cpu/ChannelShuffleKernel.cpp
+++ b/aten/src/ATen/native/cpu/ChannelShuffleKernel.cpp
@@ -1,8 +1,10 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/native/cpu/ChannelShuffleKernel.h>
+
+#include <ATen/core/TensorBase.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/cpu/utils.h>
-#include <ATen/native/cpu/ChannelShuffleKernel.h>
 #include <ATen/cpu/vec/vec.h>
 #include <c10/util/irange.h>
 
@@ -12,8 +14,8 @@ namespace {
 
 template <typename scalar_t>
 void cpu_channel_shuffle(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t groups) {
   auto input_data = input.data_ptr<scalar_t>();
   auto output_data = output.data_ptr<scalar_t>();
@@ -57,8 +59,8 @@ void cpu_channel_shuffle(
 
 template <typename scalar_t>
 void cpu_channel_shuffle_cl(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t groups) {
   auto input_data = input.data_ptr<scalar_t>();
   auto output_data = output.data_ptr<scalar_t>();
@@ -83,8 +85,8 @@ void cpu_channel_shuffle_cl(
 }
 
 void channel_shuffle_kernel_impl(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t groups) {
   switch (input.suggest_memory_format()) {
     case at::MemoryFormat::Contiguous: {
diff --git a/aten/src/ATen/native/cpu/ChannelShuffleKernel.h b/aten/src/ATen/native/cpu/ChannelShuffleKernel.h
index 939a6c4b172d..10e592cf59eb 100644
--- a/aten/src/ATen/native/cpu/ChannelShuffleKernel.h
+++ b/aten/src/ATen/native/cpu/ChannelShuffleKernel.h
@@ -1,12 +1,14 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#pragma once
 #include <ATen/native/DispatchStub.h>
+#include <cstdint>
 
-#pragma once
+namespace at {
+class TensorBase;
+}
 
 namespace at { namespace native {
 
-using channel_shuffle_fn = void(*)(Tensor&, const Tensor&, int64_t);
+using channel_shuffle_fn = void(*)(TensorBase&, const TensorBase&, int64_t);
 DECLARE_DISPATCH(channel_shuffle_fn, channel_shuffle_kernel);
 
 }} // at::native
diff --git a/aten/src/ATen/native/cpu/CopyKernel.cpp b/aten/src/ATen/native/cpu/CopyKernel.cpp
index de1841d989c3..c6411efd77cd 100644
--- a/aten/src/ATen/native/cpu/CopyKernel.cpp
+++ b/aten/src/ATen/native/cpu/CopyKernel.cpp
@@ -13,9 +13,6 @@ namespace native {
 inline namespace CPU_CAPABILITY {
 void neg_kernel(TensorIteratorBase &iter);
 void conj_kernel(TensorIteratorBase &iter);
-} // namespace CPU_CAPABILITY
-
-namespace {
 
 void float_bfloat16_copy_kernel(TensorIteratorBase &iter, bool requires_neg) {
   auto strides_out = iter.strides(0);
@@ -52,8 +49,7 @@ void float_bfloat16_copy_kernel(TensorIteratorBase &iter, bool requires_neg) {
       std::copy_n(base, 2, data.data());
       const int64_t *outer_strides = &strides[2];
 
-      for (const auto it : c10::irange(size1)) {
-        (void)it;
+      for (const auto it C10_UNUSED : c10::irange(size1)) {
         Vecd dst_s;
         if (strides_in[0] == 0) {
           dst_s = Vecd(dest_t(*((scalar_t*)data[1])));
@@ -122,8 +118,7 @@ void float_bfloat16_copy_kernel(TensorIteratorBase &iter, bool requires_neg) {
       std::copy_n(base, 2, data.data());
       const int64_t *outer_strides = &strides[2];
 
-      for (const auto it : c10::irange(size1)) {
-        (void)it;
+      for (const auto it C10_UNUSED : c10::irange(size1)) {
         Vecd dst_s;
         if (strides_in[0] == 0) {
           dst_s = Vecd(dest_t(*((scalar_t*)data[1])));
@@ -246,22 +241,20 @@ void copy_kernel(TensorIterator& iter, bool /*non_blocking*/) {
     AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, dtype, "copy_", [&] {
       using dest_t = scalar_t;
       AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, iter.dtype(1), "copy_", [&] {
-        // Note (@zasdfgbnm):
-        //
-        // The code below can not be simplified as
-        //    cpu_kernel(iter, c10::static_cast_with_inter_type<dest_t, scalar_t>::apply);
-        //
-        // because this would force the compiler to instantiate the inline function and generate a function call in the loop
-        // instead of inlining it, making all the optimizations like vectorization impossible.
-        // You can verify this by looking the the symbols of `libtorch_cpu.so`:
-        //
-        //    readelf -Ws libtorch_cpu.so | grep static_cast_with_inter_type
-        //
-        // If done correctly, the above command should have no output.
-        //
-        // See: https://github.com/pytorch/pytorch/issues/31271
-        cpu_kernel(iter, [](scalar_t src) -> dest_t {
-          return c10::static_cast_with_inter_type<dest_t, scalar_t>::apply(src); });
+        if (iter.has_contiguous_first_dim()) {
+          TORCH_INTERNAL_ASSERT(iter.ninputs() == 1);
+          TORCH_INTERNAL_ASSERT(iter.noutputs() == 1);
+
+          iter.for_each([](char **data, const int64_t *strides, int64_t size) {
+            auto src = reinterpret_cast<const scalar_t*>(data[1]);
+            auto dst = reinterpret_cast<dest_t*>(data[0]);
+            at::vec::convert(src, dst, size);
+          });
+        } else {
+          cpu_kernel(iter, [](scalar_t x) -> dest_t {
+            return c10::convert<dest_t>(x);
+          });
+        }
       });
     });
 
@@ -274,7 +267,7 @@ void copy_kernel(TensorIterator& iter, bool /*non_blocking*/) {
   }
 }
 
-} // anonymous namespace
+} // namespace CPU_CAPABILITY
 
 REGISTER_DISPATCH(copy_stub, &copy_kernel);
 
diff --git a/aten/src/ATen/native/cpu/CopyKernel.h b/aten/src/ATen/native/cpu/CopyKernel.h
new file mode 100644
index 000000000000..9d2affd6101a
--- /dev/null
+++ b/aten/src/ATen/native/cpu/CopyKernel.h
@@ -0,0 +1,12 @@
+#pragma once
+
+namespace at {
+struct TensorIteratorBase;
+
+namespace native {
+inline namespace CPU_CAPABILITY {
+
+void direct_copy_kernel(TensorIteratorBase &iter);
+void copy_kernel(TensorIterator& iter, bool /*non_blocking*/);
+
+}}}  // namespace at::native::CPU_CAPABILITY
diff --git a/aten/src/ATen/native/cpu/DepthwiseConvKernel.h b/aten/src/ATen/native/cpu/DepthwiseConvKernel.h
index 56956b443386..80970074b8e6 100644
--- a/aten/src/ATen/native/cpu/DepthwiseConvKernel.h
+++ b/aten/src/ATen/native/cpu/DepthwiseConvKernel.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <ATen/native/DispatchStub.h>
+#include <c10/util/ArrayRef.h>
 
 /*
   Depthwise 3x3 Winograd convolution operator
@@ -12,7 +13,7 @@ class Tensor;
 namespace native {
 
 using convolution_depthwise3x3_winograd_fn =
-    Tensor (*)(const Tensor &, const Tensor &, const Tensor &,IntArrayRef, IntArrayRef, int64_t);
+    Tensor (*)(const Tensor &, const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t);
 
 DECLARE_DISPATCH(convolution_depthwise3x3_winograd_fn, convolution_depthwise3x3_winograd_stub);
 
diff --git a/aten/src/ATen/native/cpu/DistanceOpsKernel.cpp b/aten/src/ATen/native/cpu/DistanceOpsKernel.cpp
index 98404005c551..9f88a23c8e36 100644
--- a/aten/src/ATen/native/cpu/DistanceOpsKernel.cpp
+++ b/aten/src/ATen/native/cpu/DistanceOpsKernel.cpp
@@ -394,8 +394,7 @@ struct Dist {
     const scalar_t * t1_end = t1 + l1_size;
     const scalar_t * t2_end = t2 + l2_size;
 
-    for (const auto l : c10::irange(d)) {
-      (void)l; //Suppress unused variable warning
+    for (const auto l C10_UNUSED : c10::irange(d)) {
       for (; t1 != t1_end; t1 += m, res += m) {
         const Vec vec_t1 = Vec::loadu(t1, count);
         Vec res_vec = Vec::loadu(res, count);
diff --git a/aten/src/ATen/native/cpu/FunctionOfAMatrixUtilsKernel.cpp b/aten/src/ATen/native/cpu/FunctionOfAMatrixUtilsKernel.cpp
index 0f4d4b607717..de3be1587e56 100644
--- a/aten/src/ATen/native/cpu/FunctionOfAMatrixUtilsKernel.cpp
+++ b/aten/src/ATen/native/cpu/FunctionOfAMatrixUtilsKernel.cpp
@@ -30,8 +30,7 @@ void _compute_linear_combination_cpu_kernel(
         auto* RESTRICT in_ptr = data[1];
         auto* RESTRICT coeff_ptr = data[2];
 
-        for (const auto elem : c10::irange(n)) {
-          (void)elem; //Suppress unused variable warning
+        for (const auto elem C10_UNUSED : c10::irange(n)) {
           auto* RESTRICT out_data = reinterpret_cast<scalar_t*>(out_ptr);
           auto* RESTRICT in_data = reinterpret_cast<scalar_t*>(in_ptr);
           using primitive_t = typename scalar_value_type<scalar_t>::type;
diff --git a/aten/src/ATen/native/cpu/HistogramKernel.cpp b/aten/src/ATen/native/cpu/HistogramKernel.cpp
index 6d6b4a749fb2..83011aa2e9a7 100644
--- a/aten/src/ATen/native/cpu/HistogramKernel.cpp
+++ b/aten/src/ATen/native/cpu/HistogramKernel.cpp
@@ -148,8 +148,8 @@ void histogramdd_cpu_contiguous(Tensor& hist, const TensorList& bin_edges,
             for (const auto dim : c10::irange(D)) {
                 const input_t elt = accessor_in[i][dim];
 
-                // Skips elements which fall outside the specified bins
-                if (elt < leftmost_edge[dim] || rightmost_edge[dim] < elt) {
+                // Skips elements which fall outside the specified bins and NaN elements
+                if (!(elt >= leftmost_edge[dim] && elt <= rightmost_edge[dim])) {
                     skip_elt = true;
                     break;
                 }
@@ -166,8 +166,8 @@ void histogramdd_cpu_contiguous(Tensor& hist, const TensorList& bin_edges,
                      * the appropriate bin via simple division.
                      */
                     pos = static_cast<int64_t>((elt - leftmost_edge[dim])
-                            / (rightmost_edge[dim] - leftmost_edge[dim])
-                            * (num_bin_edges[dim] - 1));
+                            * (num_bin_edges[dim] - 1)
+                            / (rightmost_edge[dim] - leftmost_edge[dim]));
 
                     /* Ensures consistency with bin_edges by checking the bins to the left and right
                      * of the selected position. Necessary for cases in which an element very close
diff --git a/aten/src/ATen/native/cpu/IndexKernel.cpp b/aten/src/ATen/native/cpu/IndexKernel.cpp
index be8e1a0a7315..81e135d1e749 100644
--- a/aten/src/ATen/native/cpu/IndexKernel.cpp
+++ b/aten/src/ATen/native/cpu/IndexKernel.cpp
@@ -74,8 +74,7 @@ void cpu_take_put_kernel(
   auto loop = [&](char** data, const int64_t* strides, int64_t n) {
     auto* iterated_data_bytes = data[0];
     auto* index_data_bytes = data[1];
-    for (const auto elem : c10::irange(n)) {
-      (void)elem; //Suppress unused variable warning
+    for (const auto elem C10_UNUSED : c10::irange(n)) {
       auto idx = *reinterpret_cast<int64_t*>(index_data_bytes);
       auto& iterated = *reinterpret_cast<scalar_t*>(iterated_data_bytes);
 
@@ -192,8 +191,7 @@ void index_fill_kernel(
     auto handle_nonzero_idx_stride = [&](char** data, const int64_t* strides, int64_t n) {
       auto* self_data_bytes = data[0];
       auto* index_data_bytes = data[1];
-      for (const auto elem : c10::irange(n)) {
-        (void)elem; //Suppress unused variable warning
+      for (const auto elem C10_UNUSED : c10::irange(n)) {
         auto* self_data = reinterpret_cast<scalar_t*>(self_data_bytes);
         auto idx = *reinterpret_cast<int64_t*>(index_data_bytes);
         TORCH_CHECK_INDEX(idx >= -self_dim_size && idx < self_dim_size,
@@ -219,8 +217,7 @@ void index_fill_kernel(
       if (idx < 0) {
         idx += self_dim_size;
       }
-      for (const auto elem : c10::irange(n)) {
-        (void)elem; //Suppress unused variable warning
+      for (const auto elem C10_UNUSED: c10::irange(n)) {
         auto* self_data = reinterpret_cast<scalar_t*>(self_data_bytes);
 
         self_data[idx * self_dim_stride] = fill_val;
@@ -253,8 +250,7 @@ void index_copy_kernel(
       auto* self_data_bytes = data[0];
       auto* index_data_bytes = data[1];
       auto* source_data_bytes = data[2];
-      for (const auto elem : c10::irange(n)) {
-        (void)elem; //Suppress unused variable warning
+      for (const auto elem C10_UNUSED : c10::irange(n)) {
         auto* self_data = reinterpret_cast<scalar_t*>(self_data_bytes);
         auto idx = *reinterpret_cast<int64_t*>(index_data_bytes);
         auto* source_data = reinterpret_cast<scalar_t*>(source_data_bytes);
@@ -277,8 +273,7 @@ void index_copy_kernel(
       TORCH_CHECK_INDEX(idx >= 0 && idx < self_dim_size,
             "index_copy_(): index ", idx, " is out of bounds for dimension ",
             dim, " with size ", self_dim_size);
-      for (const auto elem : c10::irange(n)) {
-        (void)elem; //Suppress unused variable warning
+      for (const auto elem C10_UNUSED : c10::irange(n)) {
         auto* self_data = reinterpret_cast<scalar_t*>(self_data_bytes);
         auto* source_data = reinterpret_cast<scalar_t*>(source_data_bytes);
 
@@ -462,6 +457,75 @@ void masked_select_kernel(TensorIterator& iter, int64_t result_stride) {
     });
 }
 
+
+template <typename scalar_t>
+void cpu_hflip_vec(at::TensorIterator& iter) {
+
+  auto loop2d = [&](char** base, const int64_t *strides, int64_t size0, int64_t size1) {
+
+    static constexpr int ntensors = 3;
+    std::array<char*, ntensors> data_arr;
+    std::copy_n(base, ntensors, data_arr.data());
+    const int64_t *outer_strides = &strides[ntensors];
+
+    using Vec = Vectorized<scalar_t>;
+
+    constexpr auto stride = sizeof(scalar_t);
+    TORCH_INTERNAL_ASSERT(stride == -strides[0] && stride == strides[1]);
+
+    for (const auto j C10_UNUSED : c10::irange(size1)) {
+
+      // vectorized loop with negative stride for output
+      char** C10_RESTRICT data_ = data_arr.data();
+      int64_t n = size0;
+
+      char* C10_RESTRICT data[ntensors];
+      for (const auto arg : c10::irange(ntensors)) {
+        data[arg] = data_[arg];
+      }
+
+      int64_t i = 0;
+
+      // data[0] unaligned pre-pass
+      int64_t offset = (j * n + (n - i - Vec::size())) % 32;
+      offset = (offset >= n) ? n : offset;
+      for (; i < offset; i++) {
+        scalar_t* out_ptr = (scalar_t*)(data[0] - i * stride);
+        *out_ptr = *(scalar_t *)(data[1] + i * stride);
+      }
+      // Empirically found that it is faster to process 3 data items together vs 2 or 4
+      for (; i <= n - 3 * Vec::size(); i += 3 * Vec::size()) {
+        auto out1 = Vec::loadu(data[1] + i * stride);
+        auto out2 = Vec::loadu(data[1] + (i + Vec::size()) * stride);
+        auto out3 = Vec::loadu(data[1] + (i + 2 * Vec::size()) * stride);
+        // flip the vector: 1234 -> 4321
+        out1 = flip(out1);
+        out2 = flip(out2);
+        out3 = flip(out3);
+        out1.store(data[0] - (i + Vec::size() - 1) * stride);
+        out2.store(data[0] - (i + 2 * Vec::size() - 1) * stride);
+        out3.store(data[0] - (i + 3 * Vec::size() - 1) * stride);
+      }
+      if (i < n) {
+        for (; i < n; i++) {
+          scalar_t* out_ptr = (scalar_t*)(data[0] - i * stride);
+          *out_ptr = *(scalar_t *)(data[1] + i * stride);
+        }
+      }
+
+      // advance:
+      for (const auto arg : c10::irange(data_arr.size())) {
+        data_arr[arg] += outer_strides[arg];
+      }
+    }
+  };
+
+  int64_t grain_size = at::internal::GRAIN_SIZE;
+  iter.for_each(loop2d, grain_size);
+  iter.cast_outputs();
+}
+
+
 void flip_kernel(TensorIterator& iter, const bool quantized) {
   if (quantized) {
     AT_DISPATCH_QINT_AND_SUB_BYTE_TYPES(iter.dtype(), "flip_quantized_cpu",
@@ -471,6 +535,29 @@ void flip_kernel(TensorIterator& iter, const bool quantized) {
         });
     });
   } else {
+    // Special case: horizontal flip with vectorization and input is contiguous
+    // Context: horizontal flip leads to strides[0] < 0 and
+    // thus is_contiguous condition is not satisfied and non-vectorized code path is taken.
+    auto output_strides = iter.strides(0);
+    auto input_strides = iter.strides(1);
+    if (iter.ndim() > 0 && output_strides[0] < 0 && input_strides[0] == iter.element_size(1)) {
+      auto iter_dtype = iter.dtype();
+      if (iter_dtype == kByte) {
+        return cpu_hflip_vec<uint8_t>(iter);
+      } else if (iter_dtype == kFloat) {
+        return cpu_hflip_vec<float>(iter);
+      } else if (iter_dtype == kInt) {
+        return cpu_hflip_vec<int32_t>(iter);
+      } else if (iter_dtype == kShort) {
+        return cpu_hflip_vec<int16_t>(iter);
+      } else if (iter_dtype == kLong) {
+        return cpu_hflip_vec<int64_t>(iter);
+      } else if (iter_dtype == kDouble) {
+        return cpu_hflip_vec<double>(iter);
+      }
+      // other dtypes are handled below with cpu_kernel_vec
+    }
+
     AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kHalf, kBFloat16, iter.dtype(), "flip_cpu",
         [&iter] { cpu_kernel_vec(iter,
           [](scalar_t a, scalar_t /*dummy input*/) -> scalar_t {
diff --git a/aten/src/ATen/native/cpu/LerpKernel.cpp b/aten/src/ATen/native/cpu/LerpKernel.cpp
index 28b2cde664ab..afff85370acd 100644
--- a/aten/src/ATen/native/cpu/LerpKernel.cpp
+++ b/aten/src/ATen/native/cpu/LerpKernel.cpp
@@ -4,35 +4,127 @@
 #include <ATen/TensorIterator.h>
 #include <ATen/native/cpu/Loops.h>
 
+#include <c10/util/irange.h>
+
 namespace at {
 namespace native {
 namespace {
 
+template <typename scalar_t>
+Vectorized<scalar_t> is_lerp_weight_small(Vectorized<scalar_t> weight) {
+  static_assert(!c10::is_complex<scalar_t>::value, "");
+  return weight.abs() < Vectorized<scalar_t>(0.5);
+}
+
+// is_lerp_weight_small doesn't work for complex because z.abs() returns a
+// complex vector which can't be compared. Either implement it with z.abs_2_(),
+// or fallback to the scalar function.
+#if !(defined(CPU_CAPABILITY_DEFAULT) || defined(_MSC_VER))
+template <typename value_t>
+Vectorized<c10::complex<value_t>> is_lerp_weight_small(Vectorized<c10::complex<value_t>> weight) {
+  using vec_reg_t = decltype(weight.abs_2_());
+  vec_reg_t mask = Vectorized<value_t>(weight.abs_2_()) < Vectorized<value_t>(0.25);
+  return Vectorized<c10::complex<value_t>>(mask);
+}
+#else
+template <typename scalar_t>
+Vectorized<scalar_t> lerp_vec_map(Vectorized<scalar_t> start, Vectorized<scalar_t> end, Vectorized<scalar_t> weight) {
+  using vec_t = Vectorized<scalar_t>;
+  __at_align__ scalar_t start_arr[vec_t::size()];
+  __at_align__ scalar_t end_arr[vec_t::size()];
+  __at_align__ scalar_t weight_arr[vec_t::size()];
+  __at_align__ scalar_t result_arr[vec_t::size()];
+
+  start.store(start_arr);
+  end.store(end_arr);
+  weight.store(weight_arr);
+
+  for (auto i : c10::irange(vec_t::size())) {
+    result_arr[i] = lerp(start_arr[i], end_arr[i], weight_arr[i]);
+  }
+  return vec_t::loadu(result_arr);
+}
+
+template <typename value_t>
+Vectorized<c10::complex<value_t>> lerp_vec(Vectorized<c10::complex<value_t>> start, Vectorized<c10::complex<value_t>> end, Vectorized<c10::complex<value_t>> weight) {
+  return lerp_vec_map(start, end, weight);
+}
+#endif
+
+template <typename scalar_t>
+Vectorized<scalar_t> lerp_vec(Vectorized<scalar_t> start, Vectorized<scalar_t> end, Vectorized<scalar_t> weight) {
+  using vec_t = Vectorized<scalar_t>;
+  auto mask = is_lerp_weight_small(weight);
+  auto coeff = vec_t::blendv(weight - vec_t(1), weight, mask);
+  auto base = vec_t::blendv(end, start, mask);
+  return vec::fmadd(coeff, end - start, base);
+}
+
 void lerp_scalar_kernel(at::TensorIteratorBase& iter, const Scalar& weight) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(iter.common_dtype(), "lerp_kernel_scalar", [&] {
-    using value_t = typename c10::scalar_value_type<scalar_t>::type;
-    scalar_t weight_val = weight.to<scalar_t>();
-    at::native::cpu_kernel(
-        iter,
-        [weight_val](scalar_t self_val, scalar_t end_val) {
-          return (zabs<scalar_t, value_t>(weight_val) < 0.5)
-              ? self_val + weight_val * (end_val - self_val)
-              : end_val - (end_val - self_val) * (scalar_t(1) - weight_val);
-        });
-  });
+  if (iter.common_dtype() == kBFloat16) {
+    using bVec = Vectorized<BFloat16>;
+    using fVec = Vectorized<float>;
+    float weight_val = weight.to<float>();
+    auto weight_vec = fVec(weight_val);
+    at::native::cpu_kernel_vec(
+      iter,
+      [weight_val](BFloat16 self_val, BFloat16 end_val) -> BFloat16 {
+        return lerp(self_val, end_val, weight_val);
+      },
+      [=](bVec self_vec, bVec end_vec) -> bVec {
+          fVec self_vec0, self_vec1, end_vec0, end_vec1;
+          std::tie(self_vec0, self_vec1) = convert_bfloat16_float(self_vec);
+          std::tie(end_vec0, end_vec1) = convert_bfloat16_float(end_vec);
+          auto result0 = lerp_vec(self_vec0, end_vec0, weight_vec);
+          auto result1 = lerp_vec(self_vec1, end_vec1, weight_vec);
+          return convert_float_bfloat16(result0, result1);
+      });
+  } else {
+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(iter.common_dtype(), "lerp_kernel_scalar", [&] {
+      auto weight_val = weight.to<scalar_t>();
+      at::native::cpu_kernel_vec(
+          iter,
+          [weight_val](scalar_t self_val, scalar_t end_val) {
+            return lerp(self_val, end_val, weight_val);
+          },
+          [weight_val](Vectorized<scalar_t> self, Vectorized<scalar_t> end) {
+            const Vectorized<scalar_t> weight(weight_val);
+            return lerp_vec(self, end, weight);
+          });
+    });
+  }
 }
 
 void lerp_tensor_kernel(at::TensorIteratorBase& iter) {
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(iter.common_dtype(), "lerp_kernel_tensor", [&] {
-    using value_t = typename c10::scalar_value_type<scalar_t>::type;
-    at::native::cpu_kernel(
-        iter,
-        [](scalar_t self_val, scalar_t end_val, scalar_t weight_val) {
-          return (zabs<scalar_t, value_t>(weight_val) < 0.5)
-              ? self_val + weight_val * (end_val - self_val)
-              : end_val - (end_val - self_val) * (scalar_t(1) - weight_val);
-        });
-  });
+  if (iter.common_dtype() == kBFloat16) {
+    using bVec = Vectorized<BFloat16>;
+    using fVec = Vectorized<float>;
+    at::native::cpu_kernel_vec(
+      iter,
+      [=](BFloat16 self_val, BFloat16 end_val, BFloat16 weight_val) -> BFloat16 {
+        return lerp(self_val, end_val, weight_val);
+      },
+      [=](bVec self_vec, bVec end_vec, bVec weight_vec) -> bVec {
+          fVec self_vec0, self_vec1, end_vec0, end_vec1, weight_vec0, weight_vec1;
+          std::tie(self_vec0, self_vec1) = convert_bfloat16_float(self_vec);
+          std::tie(end_vec0, end_vec1) = convert_bfloat16_float(end_vec);
+          std::tie(weight_vec0, weight_vec1) = convert_bfloat16_float(weight_vec);
+          auto result0 = lerp_vec(self_vec0, end_vec0, weight_vec0);
+          auto result1 = lerp_vec(self_vec1, end_vec1, weight_vec1);
+          return convert_float_bfloat16(result0, result1);
+      });
+  } else {
+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(iter.common_dtype(), "lerp_kernel_tensor", [&] {
+      at::native::cpu_kernel_vec(
+          iter,
+          [](scalar_t self_val, scalar_t end_val, scalar_t weight_val) {
+            return lerp(self_val, end_val, weight_val);
+          },
+          [](Vectorized<scalar_t> self_val, Vectorized<scalar_t> end_val, Vectorized<scalar_t> weight_val) {
+            return lerp_vec(self_val, end_val, weight_val);
+          });
+    });
+  }
 }
 
 } // anonymous namespace
diff --git a/aten/src/ATen/native/cpu/Loops.h b/aten/src/ATen/native/cpu/Loops.h
index 2558736ddc0f..8e76cca50f01 100644
--- a/aten/src/ATen/native/cpu/Loops.h
+++ b/aten/src/ATen/native/cpu/Loops.h
@@ -269,8 +269,7 @@ struct VectorizedLoop2d {
     const int64_t *outer_strides = &strides[ntensors];
 
     if (is_contiguous<traits>(strides)) {
-      for (const auto i : c10::irange(size1)) {
-        (void)i;
+      for (const auto i C10_UNUSED : c10::irange(size1)) {
         vectorized_loop(data.data(), size0, 0, op, vop);
         advance(data, outer_strides);
       }
@@ -278,14 +277,12 @@ struct VectorizedLoop2d {
       using Indices = std::make_index_sequence<traits::arity>;
       unroll_contiguous_scalar_checks<traits>(strides, Indices{}, [&](size_t idx) {
         if (idx) {
-          for (const auto i : c10::irange(size1)) {
-            (void)i;
+          for (const auto i C10_UNUSED : c10::irange(size1)) {
             vectorized_loop(data.data(), size0, idx, op, vop);
             advance(data, outer_strides);
           }
         } else {
-          for (const auto i : c10::irange(size1)) {
-            (void)i;
+          for (const auto i C10_UNUSED : c10::irange(size1)) {
             basic_loop(data.data(), strides, 0, size0, op);
             advance(data, outer_strides);
           }
diff --git a/aten/src/ATen/native/cpu/PixelShuffleKernel.cpp b/aten/src/ATen/native/cpu/PixelShuffleKernel.cpp
index aedd845fee89..0045edd2feaf 100644
--- a/aten/src/ATen/native/cpu/PixelShuffleKernel.cpp
+++ b/aten/src/ATen/native/cpu/PixelShuffleKernel.cpp
@@ -1,8 +1,10 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/native/cpu/PixelShuffleKernel.h>
+
+#include <ATen/core/TensorBase.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/cpu/utils.h>
-#include <ATen/native/cpu/PixelShuffleKernel.h>
 #include <ATen/cpu/vec/vec.h>
 #include <c10/util/irange.h>
 
@@ -12,8 +14,8 @@ namespace {
 
 template <typename scalar_t>
 void cpu_pixel_shuffle(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t upscale_factor) {
   auto input_data = input.data_ptr<scalar_t>();
   auto output_data = output.data_ptr<scalar_t>();
@@ -52,8 +54,8 @@ void cpu_pixel_shuffle(
 
 template <typename scalar_t>
 void cpu_pixel_shuffle_channels_last(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t upscale_factor) {
   TORCH_CHECK(input.ndimension() == 4,
               "pixel shuffle with channels last format supports tensors with 4 dims");
@@ -110,8 +112,8 @@ void cpu_pixel_shuffle_channels_last(
 
 template <typename scalar_t>
 void cpu_pixel_unshuffle(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t downscale_factor) {
   auto input_data = input.data_ptr<scalar_t>();
   auto output_data = output.data_ptr<scalar_t>();
@@ -151,8 +153,8 @@ void cpu_pixel_unshuffle(
 
 template <typename scalar_t>
 void cpu_pixel_unshuffle_channels_last(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t downscale_factor) {
   TORCH_CHECK(input.ndimension() == 4,
               "pixel unshuffle with channels last format supports tensors with 4 dims");
@@ -192,8 +194,8 @@ void cpu_pixel_unshuffle_channels_last(
 }
 
 void pixel_shuffle_kernel_impl(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t upscale_factor) {
   switch (input.suggest_memory_format()) {
     case at::MemoryFormat::Contiguous: {
@@ -216,8 +218,8 @@ void pixel_shuffle_kernel_impl(
 }
 
 void pixel_unshuffle_kernel_impl(
-    Tensor& output,
-    const Tensor& input,
+    TensorBase& output,
+    const TensorBase& input,
     int64_t downscale_factor) {
   switch (input.suggest_memory_format()) {
     case at::MemoryFormat::Contiguous: {
diff --git a/aten/src/ATen/native/cpu/PixelShuffleKernel.h b/aten/src/ATen/native/cpu/PixelShuffleKernel.h
index f7234edf0e60..c015e674a24c 100644
--- a/aten/src/ATen/native/cpu/PixelShuffleKernel.h
+++ b/aten/src/ATen/native/cpu/PixelShuffleKernel.h
@@ -1,12 +1,13 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#pragma once
 #include <ATen/native/DispatchStub.h>
 
-#pragma once
+namespace at {
+class TensorBase;
+}
 
 namespace at { namespace native {
 
-using pixel_shuffle_fn = void(*)(Tensor&, const Tensor&, int64_t);
+using pixel_shuffle_fn = void(*)(TensorBase&, const TensorBase&, int64_t);
 DECLARE_DISPATCH(pixel_shuffle_fn, pixel_shuffle_kernel);
 DECLARE_DISPATCH(pixel_shuffle_fn, pixel_unshuffle_kernel);
 
diff --git a/aten/src/ATen/native/cpu/README.md b/aten/src/ATen/native/cpu/README.md
index ab2f9d3d0260..2cf6fa0a1332 100644
--- a/aten/src/ATen/native/cpu/README.md
+++ b/aten/src/ATen/native/cpu/README.md
@@ -64,7 +64,7 @@ within 256bit & 512bits registers. vec defines various operators such as
 As an example `ReduceOpsKernel.cpp` implements a generic `kernel_` that reduces
 an entire array using a given associative binary operation such as +.
 
-More explicity, calling `kernel_` with template argument `std::plus` will cause
+More explicitly, calling `kernel_` with template argument `std::plus` will cause
 it to sum up the entire array into a single value.
 
 `ReduceOpsKernel.cpp` uses the `CPU_CAPABILITY_*` macros to "know" under which
@@ -73,7 +73,7 @@ generic code, which will be compiled under multipled compilation settings.
 
 `../ReduceOps.cpp` now includes the header `ReduceOpsKernel.h`, which contains
 a generic definition of `sumImplAll`. This function allows the user to reduce
-over a dimension or all dimensions. The appropiate capability is chosen at
+over a dimension or all dimensions. The appropriate capability is chosen at
 runtime using cpuinfo. If the current platform has AVX2, `sumImpl` will be set
 to `sumImplAll<CPUCapability::AVX2>`.
 
diff --git a/aten/src/ATen/native/cpu/Reduce.h b/aten/src/ATen/native/cpu/Reduce.h
index 8fe94699503b..fdb1c0d1a0fc 100644
--- a/aten/src/ATen/native/cpu/Reduce.h
+++ b/aten/src/ATen/native/cpu/Reduce.h
@@ -69,8 +69,7 @@ static inline void vectorized_reduction(char** data, int64_t n, int64_t stride,
 
 template <typename F>
 static inline void UNARY_OUTER_LOOP(char* data[2], const int64_t strides[2], int64_t n, F f) {
-  for (const auto j : c10::irange(n)) {
-    (void)j; //Suppress unused variable warning
+  for (const auto j C10_UNUSED : c10::irange(n)) {
     f();
     data[0] += strides[0];
     data[1] += strides[1];
diff --git a/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp b/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp
index 52e18faf737d..bbf45ba2ecd0 100644
--- a/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp
+++ b/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp
@@ -61,8 +61,7 @@ static inline void cpu_cum_base_kernel(const Tensor& result,
     auto* result_data_bytes = data[0];
     const auto* self_data_bytes = data[1];
 
-    for (const auto i : c10::irange(n)) {
-      (void)i; //Suppress unused variable warning
+    for (const auto i C10_UNUSED : c10::irange(n)) {
       f(
         (scalar_t*)result_data_bytes, result_dim_stride,
         (scalar_t*)self_data_bytes, self_dim_stride, init_val
@@ -185,7 +184,7 @@ static void prod_kernel_impl(TensorIterator& iter) {
         // NOLINTNEXTLINE(bugprone-argument-comment)
         /*identity=*/1);
   } else {
-    AT_DISPATCH_ALL_TYPES_AND_COMPLEX(iter.dtype(), "prod_cpu", [&] {
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND(kBFloat16, iter.dtype(), "prod_out_cpu", [&] {
       binary_kernel_reduce_vec(
           iter,
           [=](scalar_t a, scalar_t b)
@@ -334,20 +333,9 @@ static void and_kernel_impl(TensorIterator& iter) {
     binary_kernel_reduce_vec(
         iter,
         [=](uint8_t a, uint8_t b) -> uint8_t { return (a && b) ? 1 : 0; },
-#if defined(CPU_CAPABILITY_ZVECTOR)
         [=](Vectorized<uint8_t> a, Vectorized<uint8_t> b) {
           return a & b;
         },
-#else
-        [=](Vectorized<uint8_t> a, Vectorized<uint8_t> b) {
-          Vectorized<uint8_t> c = Vectorized<uint8_t>();
-
-          for (decltype(c.size()) i = 0; i != Vectorized<uint8_t>::size(); i++) {
-            c[i] = (a[i] && b[i]) ? 1 : 0;
-          }
-          return c;
-        },
-#endif
         /*ident=*/true);
   } else {
     binary_kernel_reduce_vec(
@@ -381,20 +369,9 @@ static void or_kernel_impl(TensorIterator& iter) {
     binary_kernel_reduce_vec(
         iter,
         [=](uint8_t a, uint8_t b) -> uint8_t { return (a || b) ? 1 : 0; },
-#if defined(CPU_CAPABILITY_ZVECTOR)
         [=](Vectorized<uint8_t> a, Vectorized<uint8_t> b) {
           return a | b;
         },
-#else
-        [=](Vectorized<uint8_t> a, Vectorized<uint8_t> b) {
-          Vectorized<uint8_t> c = Vectorized<uint8_t>();
-
-          for (decltype(c.size()) i = 0; i != Vectorized<uint8_t>::size(); i++) {
-            c[i] = (a[i] || b[i]) ? 1 : 0;
-          }
-          return c;
-        },
-#endif
         /*ident=*/false);
   } else {
     binary_kernel_reduce_vec(
diff --git a/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp b/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp
index 8a157cee7522..6321fb6349e5 100644
--- a/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp
+++ b/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp
@@ -10,7 +10,6 @@
 #include <ATen/Parallel.h>
 #include <ATen/cpu/vec/functional.h>
 #include <ATen/cpu/vec/vec.h>
-#include <ATen/native/cpu/radix_sort.h>
 #include <c10/util/irange.h>
 
 namespace at { namespace native {
@@ -184,8 +183,7 @@ struct cpu_scatter_gather_base_kernel {
           // vs dim-TensorIterator loop order depending on
           // whether dim is the last dimension
           if (dim== self.dim() - 1) {
-            for (const auto nelem : c10::irange(n)) {
-              (void)nelem; //Suppress unused variable warning
+            for (const auto nelem C10_UNUSED : c10::irange(n)) {
               // dim loop is a separate code block
               // for better performance
               _cpu_scatter_gather_dim_loop<is_scatter_like>()(
@@ -202,8 +200,7 @@ struct cpu_scatter_gather_base_kernel {
             for (const auto i : c10::irange(index_dim_size)) {
               auto* self_data = self_data_bytes;
               auto* index_data = (char*)((int64_t*)index_data_bytes + i * index_dim_stride);
-              for (const auto nelem : c10::irange(n)) {
-                (void)nelem; //Suppress unused variable warning
+              for (const auto nelem C10_UNUSED : c10::irange(n)) {
                 int64_t idx_dim = *(int64_t*)index_data;
                 // we are not putting idx_dim in the error message because it disables
                 // loop optimization in clang-7
@@ -268,8 +265,7 @@ struct cpu_scatter_gather_base_kernel {
           // vs dim-TensorIterator loop order depending on
           // whether dim is the last dimension
           if (dim== self.dim() - 1) {
-            for (const auto nelem : c10::irange(n)) {
-              (void)nelem; //Suppress unused variable warning
+            for (const auto nelem C10_UNUSED : c10::irange(n)) {
               // dim loop is a separate code block
               // for better performance
               _cpu_scatter_gather_dim_loop<is_scatter_like>()(
@@ -290,8 +286,7 @@ struct cpu_scatter_gather_base_kernel {
               auto* self_data = self_data_bytes;
               auto* index_data = (char*)((int64_t*)index_data_bytes + i * index_dim_stride);
               auto* src_data = src_data_bytes;
-              for (const auto nelem : c10::irange(n)) {
-                (void)nelem; //Suppress unused variable warning
+              for (const auto nelem C10_UNUSED : c10::irange(n)) {
                 int64_t idx_dim = *(int64_t*)index_data;
                 // we are not putting idx_dim in the error message because it disables
                 // loop optimization in clang-7
@@ -357,8 +352,7 @@ struct cpu_scatter_gather_base_kernel {
           // vs dim-TensorIterator loop order depending on
           // whether dim is the last dimension
           if (dim== self.dim() - 1) {
-            for (const auto nelem : c10::irange(n)) {
-              (void)nelem; //Suppress unused variable warning
+            for (const auto nelem C10_UNUSED : c10::irange(n)) {
               // dim loop is a separate code block
               // for better performance
               _cpu_scatter_gather_dim_loop<is_scatter_like>()(
@@ -379,8 +373,7 @@ struct cpu_scatter_gather_base_kernel {
               auto* self_data = self_data_bytes;
               auto* index_data = (char*)((int64_t*)index_data_bytes + i * index_dim_stride);
               auto* src_data = src_data_bytes;
-              for (const auto nelem : c10::irange(n)) {
-                (void)nelem; //Suppress unused variable warning
+              for (const auto nelem C10_UNUSED : c10::irange(n)) {
                 int64_t idx_dim = *(int64_t*)index_data;
                 // we are not putting idx_dim in the error message because it disables
                 // loop optimization in clang-7
@@ -446,8 +439,7 @@ struct cpu_scatter_gather_base_kernel {
           // vs dim-TensorIterator loop order depending on
           // whether dim is the last dimension
           if (dim== self.dim() - 1) {
-            for (const auto nelem : c10::irange(n)) {
-              (void)nelem; //Suppress unused variable warning
+            for (const auto nelem C10_UNUSED : c10::irange(n)) {
               // dim loop is a separate code block
               // for better performance
               _cpu_scatter_gather_dim_loop<is_scatter_like>()(
@@ -468,8 +460,7 @@ struct cpu_scatter_gather_base_kernel {
               auto* self_data = self_data_bytes;
               auto* index_data = (char*)((int64_t*)index_data_bytes + i * index_dim_stride);
               auto* src_data = src_data_bytes;
-              for (const auto nelem : c10::irange(n)) {
-                (void)nelem; //Suppress unused variable warning
+              for (const auto nelem C10_UNUSED : c10::irange(n)) {
                 int64_t idx_dim = *(int64_t*)index_data;
                 // we are not putting idx_dim in the error message because it disables
                 // loop optimization in clang-7
@@ -535,8 +526,7 @@ struct cpu_scatter_gather_base_kernel {
           // vs dim-TensorIterator loop order depending on
           // whether dim is the last dimension
           if (dim== self.dim() - 1) {
-            for (const auto nelem : c10::irange(n)) {
-              (void)nelem; //Suppress unused variable warning
+            for (const auto nelem C10_UNUSED : c10::irange(n)) {
               // dim loop is a separate code block
               // for better performance
               _cpu_scatter_gather_dim_loop<is_scatter_like>()(
@@ -557,8 +547,7 @@ struct cpu_scatter_gather_base_kernel {
               auto* self_data = self_data_bytes;
               auto* index_data = (char*)((int64_t*)index_data_bytes + i * index_dim_stride);
               auto* src_data = src_data_bytes;
-              for (const auto nelem : c10::irange(n)) {
-                (void)nelem; //Suppress unused variable warning
+              for (const auto nelem C10_UNUSED : c10::irange(n)) {
                 int64_t idx_dim = *(int64_t*)index_data;
                 // we are not putting idx_dim in the error message because it disables
                 // loop optimization in clang-7
@@ -584,13 +573,55 @@ struct cpu_scatter_gather_base_kernel {
   }
 };
 
+template <typename scalar_t, SCATTER_GATHER_OP reduce>
+inline void init(scalar_t* ptr, int64_t size, bool include_self) {
+  if (!include_self) {
+    using acc_t = vec::vec_scalar_t<scalar_t>;
+    using Vec = vec::Vectorized<acc_t>;
+
+    acc_t val;
+    if (reduce == SCATTER_GATHER_OP::REDUCE_ADD ||
+        reduce == SCATTER_GATHER_OP::REDUCE_MEAN) {
+      val = static_cast<acc_t>(0);
+    } else if (reduce == SCATTER_GATHER_OP::REDUCE_MULTIPLY) {
+      val = static_cast<acc_t>(1);
+    } else if (reduce == SCATTER_GATHER_OP::REDUCE_MAXIMUM) {
+      val = std::numeric_limits<acc_t>::lowest();
+    } else {
+      val = std::numeric_limits<acc_t>::max();
+    }
+    vec::map<scalar_t>(
+        [val](Vec x) { return Vec(val); },
+        ptr,
+        ptr,
+        size);
+  }
+}
+
+template <typename vec_t, SCATTER_GATHER_OP reduce>
+inline vec_t update(const vec_t& x, const vec_t& y) {
+  if (reduce == SCATTER_GATHER_OP::REDUCE_ADD ||
+      reduce == SCATTER_GATHER_OP::REDUCE_MEAN) {
+    return x + y;
+  } else if (reduce == SCATTER_GATHER_OP::REDUCE_MULTIPLY) {
+    return x * y;
+  } else if (reduce == SCATTER_GATHER_OP::REDUCE_MAXIMUM) {
+    return vec::maximum(x, y);
+  } else {
+    return vec::minimum(x, y);
+  }
+}
+
 // Note [scatter reduce optimization]
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 //
-// 1. initiative: optimize scatter_reduce optimization on PyG
-//   `scatter_add` is extensively used on 'message passing' when
-//   aggregating info. The `index` tensor is extended which means
-//   the aggregation is on rowwise.
+// 1. initiative: optimize `scatter_reduce` on classic PyG use-case:
+//   `scatter_reduce` is extensively used on 'message passing' when
+//   aggregating info.
+//
+//   Typically, `self` will 2D tensor and `index` is a 1D extended/broadcasted
+//   tensor, which means that the aggregation is on rowwise and we can vectorize
+//   on the inner dimensions.
 //
 // 2. implementation: map `scatter_reduce` to `spmm` reduce
 //   in the shape of `[M, N]` * `[N, K]`, where:
@@ -604,8 +635,8 @@ struct cpu_scatter_gather_base_kernel {
 //
 //   step 2: spmm reduce, parallel on M and vectorize on K
 //
-template <typename scalar_t>
-void cpu_scatter_add_contig_kernel(const Tensor& self, const Tensor& index, const Tensor& src) {
+template <typename scalar_t, SCATTER_GATHER_OP reduce>
+void cpu_scatter_reduce_expanded_index(const Tensor& self, const Tensor& index, const Tensor& src, bool include_self) {
   int64_t* index_data = index.data_ptr<int64_t>();
   scalar_t* self_data = self.data_ptr<scalar_t>();
   scalar_t* src_data = src.data_ptr<scalar_t>();
@@ -624,9 +655,9 @@ void cpu_scatter_add_contig_kernel(const Tensor& self, const Tensor& index, cons
     for (const auto i : c10::irange(begin, end)) {
       int64_t index = index_data[i];
       TORCH_CHECK(index >= 0 && index < index_upper_bound,
-                "index ", index,
-                " is out of bounds for dimension ", 0,
-                " with size ", index_upper_bound);
+                  "index ", index,
+                  " is out of bounds for dimension ", 0,
+                  " with size ", index_upper_bound);
       keys[i] = index;
       values[i] = i;
     }
@@ -689,25 +720,110 @@ void cpu_scatter_add_contig_kernel(const Tensor& self, const Tensor& index, cons
       int64_t off_start = row_index_offset[m];
       int64_t off_end = row_index_offset[m + 1];
       scalar_t* self_ptr = self_data + row * K;
+
+      // reinit rows in `self` if needed
+      init<scalar_t, reduce>(self_ptr, K, include_self);
+
       for (const auto n : c10::irange(off_start, off_end)) {
         int64_t col = sorted_col_index_values[n];
         scalar_t* src_ptr = src_data + col * K;
         vec::map2<scalar_t>(
-            [](Vec x, Vec y) { return x + y; },
+            [](Vec x, Vec y) { return update<Vec, reduce>(x, y); },
             self_ptr,
             self_ptr,
             src_ptr,
             K);
       }
+
+      if (reduce == SCATTER_GATHER_OP::REDUCE_MEAN) {
+        int64_t count = include_self ? 1 : 0;
+        count += off_end - off_start;
+        if (count != 0) {
+          vec::map<scalar_t>(
+              [count](Vec x) { return x / Vec(count); },
+              self_ptr,
+              self_ptr,
+              K);
+        }
+      }
     }
   });
 }
 
-void scatter_add_config(const Tensor& self, const Tensor& index, const Tensor& src) {
-  AT_DISPATCH_ALL_TYPES_AND3(
-    ScalarType::Bool, ScalarType::Half, ScalarType::BFloat16, self.scalar_type(),
-    "scatter_add_contig", [&] {
-      cpu_scatter_add_contig_kernel<scalar_t>(self, index, src);
+template <typename scalar_t>
+void cpu_gather_expanded_index_kernel(const Tensor& result, const Tensor& index, const Tensor& self) {
+  int64_t* index_data = index.data_ptr<int64_t>();
+  scalar_t* result_data = result.data_ptr<scalar_t>();
+  scalar_t* self_data = self.data_ptr<scalar_t>();
+
+  const int64_t M = ensure_nonempty_size(result, 0);
+  const int64_t N = ensure_nonempty_size(self, 0);
+  const int64_t K = index.numel() / M;
+
+  const int64_t index_upper_bound = N;
+
+  using Vec = vec::Vectorized<scalar_t>;
+  int64_t grain_size = std::max((int64_t) 1, at::internal::GRAIN_SIZE / K);
+  at::parallel_for(0, M, grain_size, [&](int64_t begin, int64_t end) {
+    for (const auto m : c10::irange(begin, end)) {
+      scalar_t* result_ptr = result_data + m * K;
+      int64_t index = index_data[m];
+      TORCH_CHECK(index >= 0 && index < index_upper_bound,
+                  "index ", index,
+                  " is out of bounds for dimension ", 0,
+                  " with size ", index_upper_bound);
+      scalar_t* self_ptr = self_data + index * K;
+      int64_t d = 0;
+      for (; d < K - (K % Vec::size()); d += Vec::size()) {
+        Vec out_vec = Vec::loadu(self_ptr + d);
+        out_vec.store(result_ptr + d);
+      }
+      #if !defined(_MSC_VER) && !defined(COMPILING_FOR_MIN_SIZE)
+      # pragma unroll
+      #endif
+      for (; d < K; d++) {
+        result_ptr[d] = self_ptr[d];
+      }
+    }
+  });
+}
+
+void scatter_add_expanded_index_kernel(const Tensor& self, const Tensor& index, const Tensor& src) {
+  AT_DISPATCH_FLOATING_TYPES_AND(
+    ScalarType::BFloat16, self.scalar_type(), "scatter_add_expanded_index", [&] {
+      cpu_scatter_reduce_expanded_index<scalar_t, SCATTER_GATHER_OP::REDUCE_ADD>(self, index, src, /*include_self*/true);
+  });
+}
+
+void scatter_reduce_expanded_index_kernel(
+    const Tensor& self, const Tensor& index, const Tensor& src,
+    const SCATTER_GATHER_OP& reduce, bool include_self) {
+  AT_DISPATCH_FLOATING_TYPES_AND(
+    ScalarType::BFloat16, self.scalar_type(), "scatter_reduce_expanded_index", [&] {
+      switch (reduce) {
+      case SCATTER_GATHER_OP::REDUCE_ADD :
+        cpu_scatter_reduce_expanded_index<scalar_t, SCATTER_GATHER_OP::REDUCE_ADD>(self, index, src, include_self);
+        break;
+      case SCATTER_GATHER_OP::REDUCE_MULTIPLY :
+        cpu_scatter_reduce_expanded_index<scalar_t, SCATTER_GATHER_OP::REDUCE_MULTIPLY>(self, index, src, include_self);
+        break;
+      case SCATTER_GATHER_OP::REDUCE_MAXIMUM :
+        cpu_scatter_reduce_expanded_index<scalar_t, SCATTER_GATHER_OP::REDUCE_MAXIMUM>(self, index, src, include_self);
+        break;
+      case SCATTER_GATHER_OP::REDUCE_MINIMUM :
+        cpu_scatter_reduce_expanded_index<scalar_t, SCATTER_GATHER_OP::REDUCE_MINIMUM>(self, index, src, include_self);
+        break;
+      case SCATTER_GATHER_OP::REDUCE_MEAN :
+        cpu_scatter_reduce_expanded_index<scalar_t, SCATTER_GATHER_OP::REDUCE_MEAN>(self, index, src, include_self);
+        break;
+      }
+  });
+}
+
+void gather_expanded_index_kernel(const Tensor& result, const Tensor& self, const Tensor& index) {
+  AT_DISPATCH_FLOATING_TYPES_AND(
+    ScalarType::BFloat16, self.scalar_type(), "gather_expanded_index", [&] {
+      cpu_gather_expanded_index_kernel<scalar_t>(result, index, self);
   });
 }
 
@@ -727,25 +843,10 @@ void scatter_fill_cpu_kernel(const Tensor& self, int64_t dim, const Tensor& inde
     self, dim, index, value, "scatter_fill_cpu_", tensor_assign);
 }
 
-inline bool is_fast_path_scatter(const Tensor& self, int64_t dim, const Tensor& index, const Tensor& src) {
-#if AT_PARALLEL_OPENMP
-  //TODO: add optimization when inner_size is 1
-  // currently inner_size == 1 will go sequetial
-  if (index.numel() == index.size(0)) { return false; }
-  return dim == 0 && index.stride(dim) == 1 && src.is_contiguous() && self.is_contiguous();
-#else
-  return false;
-#endif
-}
-
 void scatter_add_cpu_kernel(const Tensor& self, int64_t dim, const Tensor& index, const Tensor& src) {
-  if (is_fast_path_scatter(self, dim, index, src)) {
-    scatter_add_config(self, index, src);
-  } else {
-    cpu_scatter_gather_base_kernel<>()(
-      self, dim, index, src,
-      "scatter_add_", reduce_add);
-  }
+  cpu_scatter_gather_base_kernel<>()(
+    self, dim, index, src,
+    "scatter_add_", reduce_add);
 }
 
 void scatter_reduce_cpu_kernel(const Tensor& self, const int64_t dim, const Tensor& index,
@@ -816,4 +917,9 @@ REGISTER_DISPATCH(scatter_reduce_stub, &scatter_reduce_cpu_kernel);
 REGISTER_DISPATCH(scatter_scalar_reduce_stub, &scatter_scalar_reduce_cpu_kernel);
 REGISTER_DISPATCH(scatter_reduce_two_stub, &scatter_reduce_two_cpu_kernel);
 
+// fast paths for GNN usage
+REGISTER_DISPATCH(scatter_add_expanded_index_stub, &scatter_add_expanded_index_kernel);
+REGISTER_DISPATCH(scatter_reduce_expanded_index_stub, &scatter_reduce_expanded_index_kernel);
+REGISTER_DISPATCH(gather_expanded_index_stub, &gather_expanded_index_kernel);
+
 }} // namespace at::native
diff --git a/aten/src/ATen/native/cpu/SortingKernel.cpp b/aten/src/ATen/native/cpu/SortingKernel.cpp
index fdbecbb65cdf..66c9c3b68c8a 100644
--- a/aten/src/ATen/native/cpu/SortingKernel.cpp
+++ b/aten/src/ATen/native/cpu/SortingKernel.cpp
@@ -45,8 +45,7 @@ void _dim_apply(
           return;
         }
 
-        for (const auto i : c10::irange(n)) {
-          (void)i; //Suppress unused variable warning
+        for (const auto i C10_UNUSED : c10::irange(n)) {
           f(
             reinterpret_cast<scalar_t*>(values_data_bytes),
             values_dim_stride,
diff --git a/aten/src/ATen/native/cpu/SparseFactories.cpp b/aten/src/ATen/native/cpu/SparseFactories.cpp
index 0b0f73e1844c..1fb33c7e3713 100644
--- a/aten/src/ATen/native/cpu/SparseFactories.cpp
+++ b/aten/src/ATen/native/cpu/SparseFactories.cpp
@@ -1,35 +1,25 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/native/sparse/SparseFactories.h>
+
 #include <ATen/Dispatch.h>
-#include <ATen/SparseTensorImpl.h>
-#include <ATen/SparseTensorUtils.h>
-#include <ATen/TensorIndexing.h>
 #include <ATen/TensorIterator.h>
-#include <ATen/core/ATen_fwd.h>
-#include <ATen/core/Tensor.h>
+#include <ATen/core/TensorBase.h>
 #include <ATen/native/cpu/Loops.h>
-#include <ATen/native/sparse/SparseFactories.h>
-#include <c10/core/Scalar.h>
-#include <c10/util/ArrayRef.h>
+#include <c10/core/ScalarType.h>
 #include <c10/util/Exception.h>
 
-#ifndef AT_PER_OPERATOR_HEADERS
-#include <ATen/Functions.h>
-#include <ATen/NativeFunctions.h>
-#else
-#include <ATen/ops/sparse_coo_tensor.h>
-#endif
-
 namespace at {
 namespace native {
-using namespace at::sparse;
 
 namespace {
 void _spdiags_kernel_cpu(
     TensorIterator& iter,
-    const Tensor& diagonals,
-    Tensor& values,
-    Tensor& indices) {
-  auto* row_index_write_ptr = indices[0].data_ptr<int64_t>();
-  auto* col_index_write_ptr = indices[1].data_ptr<int64_t>();
+    const TensorBase& diagonals,
+    TensorBase& values,
+    TensorBase& indices) {
+  auto* row_index_write_ptr = indices.data_ptr<int64_t>();
+  auto* col_index_write_ptr = row_index_write_ptr + indices.stride(0);
+  const int64_t diagonals_index_stride = diagonals.stride(0);
   const int64_t diagonals_read_stride = diagonals.stride(1);
   AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
       at::ScalarType::BFloat16,
@@ -39,7 +29,9 @@ void _spdiags_kernel_cpu(
       diagonals.scalar_type(),
       "spdiags_cpu",
       [&] {
-        auto* values_write_ptr = values.data_ptr<scalar_t>();
+        auto* const values_write_ptr = values.data_ptr<scalar_t>();
+        const auto* const diagonals_ptr = diagonals.data_ptr<scalar_t>();
+
         cpu_kernel(
             iter,
             [&](int64_t diag_index,
@@ -52,8 +44,9 @@ void _spdiags_kernel_cpu(
                 auto* vals_start = values_write_ptr + out_offset;
                 const int64_t first_col = std::max<int64_t>(diag_offset, 0);
                 const int64_t first_row = first_col - diag_offset;
-                auto* data_read = diagonals[diag_index].data_ptr<scalar_t>() +
-                    first_col * diagonals_read_stride;
+                auto* data_read = (diagonals_ptr +
+                                   diagonals_index_stride * diag_index +
+                                   first_col * diagonals_read_stride);
                 for (int64_t i = 0; i < n_out; ++i) {
                   rows_start[i] = first_row + i;
                   cols_start[i] = first_col + i;
diff --git a/aten/src/ATen/native/cpu/SpmmReduceKernel.cpp b/aten/src/ATen/native/cpu/SpmmReduceKernel.cpp
index 74854855ff83..cba47abcd3e4 100644
--- a/aten/src/ATen/native/cpu/SpmmReduceKernel.cpp
+++ b/aten/src/ATen/native/cpu/SpmmReduceKernel.cpp
@@ -1,150 +1,601 @@
 #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/core/Tensor.h>
-
+#include <ATen/ExpandUtils.h>
 #include <ATen/Dispatch.h>
-#include <ATen/native/SpmmReduce.h>
 #include <ATen/Parallel.h>
 #include <ATen/cpu/vec/functional.h>
 #include <ATen/cpu/vec/vec.h>
+#include <ATen/native/cpu/SpmmReduceKernel.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_native.h>
+#endif
+
 namespace at { namespace native {
 
 namespace {
 
-template <typename scalar_t, bool has_optional_value>
-void spmm_sum_kernel_impl(
-    const Tensor& result,
-    const Tensor& rowptr,
-    const Tensor& col,
-    const c10::optional<Tensor>& optional_value,
-    const Tensor& mat) {
-
-  scalar_t* result_data = result.data_ptr<scalar_t>();
-  int64_t* rowptr_data = rowptr.data_ptr<int64_t>();
-  int64_t* col_data = col.data_ptr<int64_t>();
-  scalar_t* value_data = has_optional_value ? optional_value.value().data_ptr<scalar_t>() : nullptr;
-  scalar_t* mat_data = mat.data_ptr<scalar_t>();
-
-  int64_t M = rowptr.numel() - 1;
-  int64_t N = mat.size(-2);
-  int64_t K = mat.size(-1);
-  int64_t B = mat.numel() / (N * K);
-
-  // directly parallel on `B * M` may lead to load imbalance,
+template <typename scalar_t, SPMM_REDUCE_OP reduce>
+struct Reducer {
+  static inline void init(scalar_t* ptr, int64_t size) {
+    using acc_t = vec::vec_scalar_t<scalar_t>;
+    using Vec = vec::Vectorized<acc_t>;
+
+    acc_t val;
+    if (reduce == SPMM_MAX) {
+      val = std::numeric_limits<acc_t>::lowest();
+    } else if (reduce == SPMM_MIN) {
+      val = std::numeric_limits<acc_t>::max();
+    } else {
+      return;
+    }
+
+    vec::map<scalar_t>(
+        [val](Vec x) { return Vec(val); },
+        ptr,
+        ptr,
+        size);
+  }
+
+  static inline void update(scalar_t& out, const scalar_t data) {
+    if (reduce == SPMM_SUM || reduce == SPMM_MEAN) {
+      out += data;
+    } else if (reduce == SPMM_MAX) {
+      out = std::max(out, data);
+    } else {
+      out = std::min(out, data);
+    }
+  }
+
+  static inline void update(
+      vec::Vectorized<scalar_t>& out_vec,
+      const vec::Vectorized<scalar_t>& data_vec) {
+    if (reduce == SPMM_SUM || reduce == SPMM_MEAN) {
+      out_vec += data_vec;
+    } else if (reduce == SPMM_MAX) {
+      out_vec = vec::maximum(out_vec, data_vec);
+    } else {
+      out_vec = vec::minimum(out_vec, data_vec);
+    }
+  }
+};
+
+template <typename scalar_t, typename index_t, SPMM_REDUCE_OP reduce>
+void spmm_reduce_kernel_impl(
+    const Tensor& out,
+    const Tensor& crow_indices_,
+    const Tensor& col_indices_,
+    const Tensor& values_,
+    const Tensor& weight_) {
+
+  int64_t nnz = values_.numel();
+  if (nnz == 0) {
+    return;
+  }
+
+  auto crow_indices = crow_indices_.contiguous();
+  auto col_indices = col_indices_.contiguous();
+  auto values = values_.contiguous();
+  auto weight = weight_.contiguous();
+
+  scalar_t* out_data = out.data_ptr<scalar_t>();
+  index_t* csr_data = crow_indices.data_ptr<index_t>();
+  index_t* col_data = col_indices.data_ptr<index_t>();
+  scalar_t* val_data = values.data_ptr<scalar_t>();
+  scalar_t* weight_data = weight.data_ptr<scalar_t>();
+
+  int64_t M = crow_indices.numel() - 1;
+  int64_t K = weight.size(-1);
+
+  // directly parallel on `M` may lead to load imbalance,
   // statically determine thread partition here to average payload
   // for each thread.
   int num_threads = at::get_num_threads();
-  std::vector<int64_t> thread_splits(num_threads + 1, B * M);
-  int64_t thread_averge_payload = (rowptr_data[M] - rowptr_data[0]) / num_threads;
+  std::vector<int64_t> thread_splits(num_threads + 1, M);
+
+  int64_t thread_averge_payload = std::max((int64_t)1, divup(nnz, num_threads));
 
   thread_splits[0] = 0;
   int64_t sum = 0;
   int64_t t = 1;
   for (const auto m : c10::irange(M)) {
-    int64_t row_start = rowptr_data[m];
-    int64_t row_end = rowptr_data[m + 1];
+    int64_t row_start = csr_data[m];
+    int64_t row_end = csr_data[m + 1];
     sum += row_end - row_start;
     if (sum > t * thread_averge_payload) {
-      thread_splits[t] = B * m;
+      thread_splits[t] = m;
       t++;
     }
   }
   // need to restore the last index,
   // due to rounding error when calculating `thread_averge_payload`.
-  thread_splits[num_threads] = B * M;
+  thread_splits[num_threads] = M;
 
-  // TODO: add bfloat16 support here
   using Vec = vec::Vectorized<scalar_t>;
   at::parallel_for(0, num_threads, 1, [&](int64_t cbegin, int64_t cend) {
     int tid = at::get_thread_num();
     int64_t begin = thread_splits[tid];
     int64_t end = thread_splits[tid + 1];
 
-    int64_t row_start, row_end, b, m, c;
-    for (const auto i : c10::irange(begin, end)) {
-      b = i / M;
-      m = i % M;
-      row_start = rowptr_data[m];
-      row_end = rowptr_data[m + 1];
+    int64_t row_start, row_end, c;
+    for (const auto m : c10::irange(begin, end)) {
+      row_start = csr_data[m];
+      row_end = csr_data[m + 1];
 
-      scalar_t* result_ptr = result_data + i * K;
+      scalar_t* out_ptr = out_data + m * K;
 
       constexpr int64_t kVecSize = Vec::size();
       constexpr int64_t kVLEN = kVecSize * 4;
       constexpr int64_t CHUNK_SIZE = 16;
 
-      // init the output lane
-      vec::map<scalar_t>([](Vec x) { return Vec(0); }, result_ptr, result_ptr, K);
+      // reinit the output row for reduce type 'max' and 'min'
+      int64_t count = row_end - row_start;
+      if (count != 0) {
+        Reducer<scalar_t, reduce>::init(out_ptr, K);
+      }
 
       // blocking on rowwise to reduce write memory bandwidth
       for (int64_t e0 = row_start; e0 < row_end; e0 += CHUNK_SIZE) {
         int64_t e1 = std::min(e0 + CHUNK_SIZE, row_end);
 
-        // unrolling by 4
         int64_t k = 0;
         for (; k < K - (K % kVLEN); k += kVLEN) {
-          Vec out_vec0 = Vec::loadu(result_ptr + k);
-          Vec out_vec1 = Vec::loadu(result_ptr + k + kVecSize);
-          Vec out_vec2 = Vec::loadu(result_ptr + k + kVecSize * 2);
-          Vec out_vec3 = Vec::loadu(result_ptr + k + kVecSize * 3);
+          Vec out_vec0 = Vec::loadu(out_ptr + k);
+          Vec out_vec1 = Vec::loadu(out_ptr + k + kVecSize);
+          Vec out_vec2 = Vec::loadu(out_ptr + k + kVecSize * 2);
+          Vec out_vec3 = Vec::loadu(out_ptr + k + kVecSize * 3);
           for (const auto e : c10::irange(e0, e1)) {
             c = col_data[e];
-            scalar_t val = has_optional_value ? value_data[e] : scalar_t(1);
-            scalar_t* mat_ptr = mat_data + b * N * K + c * K + k;
+            scalar_t val = val_data[e];
+            scalar_t* weight_ptr = weight_data + c * K + k;
 
-            out_vec0 += Vec::loadu(mat_ptr) * Vec(val);
-            out_vec1 += Vec::loadu(mat_ptr + kVecSize) * Vec(val);
-            out_vec2 += Vec::loadu(mat_ptr + kVecSize * 2) * Vec(val);
-            out_vec3 += Vec::loadu(mat_ptr + kVecSize * 3) * Vec(val);
+            Reducer<scalar_t, reduce>::update(out_vec0, Vec::loadu(weight_ptr) * Vec(val));
+            Reducer<scalar_t, reduce>::update(out_vec1, Vec::loadu(weight_ptr + kVecSize) * Vec(val));
+            Reducer<scalar_t, reduce>::update(out_vec2, Vec::loadu(weight_ptr + kVecSize * 2) * Vec(val));
+            Reducer<scalar_t, reduce>::update(out_vec3, Vec::loadu(weight_ptr + kVecSize * 3) * Vec(val));
           }
-          out_vec0.store(result_ptr + k);
-          out_vec1.store(result_ptr + k + kVecSize);
-          out_vec2.store(result_ptr + k + kVecSize * 2);
-          out_vec3.store(result_ptr + k + kVecSize * 3);
+          out_vec0.store(out_ptr + k);
+          out_vec1.store(out_ptr + k + kVecSize);
+          out_vec2.store(out_ptr + k + kVecSize * 2);
+          out_vec3.store(out_ptr + k + kVecSize * 3);
         }
         for (; k < K - (K % Vec::size()); k += Vec::size()) {
-          Vec out_vec = Vec::loadu(result_ptr + k);
+          Vec out_vec = Vec::loadu(out_ptr + k);
           for (const auto e : c10::irange(e0, e1)) {
             c = col_data[e];
-            scalar_t val = has_optional_value ? value_data[e] : scalar_t(1);
-            scalar_t* mat_ptr = mat_data + b * N * K + c * K;
-            out_vec += Vec::loadu(mat_ptr + k) * Vec(val);
+            scalar_t val = val_data[e];
+            scalar_t* weight_ptr = weight_data + c * K;
+            Reducer<scalar_t, reduce>::update(out_vec, Vec::loadu(weight_ptr + k) * Vec(val));
           }
-          out_vec.store(result_ptr + k);
+          out_vec.store(out_ptr + k);
         }
         for (; k < K; k++) {
-          scalar_t out_val = result_ptr[k];
+          scalar_t out_val = out_ptr[k];
           for (const auto e : c10::irange(e0, e1)) {
             c = col_data[e];
-            scalar_t val = has_optional_value ? value_data[e] : scalar_t(1);
-            scalar_t* mat_ptr = mat_data + b * N * K + c * K;
-            out_val += mat_ptr[k] * val;
+            scalar_t val = val_data[e];
+            scalar_t* weight_ptr = weight_data + c * K;
+            Reducer<scalar_t, reduce>::update(out_val, weight_ptr[k] * val);
           }
-          result_ptr[k] = out_val;
+          out_ptr[k] = out_val;
+        }
+      }
+
+      if (reduce == SPMM_MEAN && count != 0) {
+        int64_t k = 0;
+        for (; k < K - (K % Vec::size()); k += Vec::size()) {
+          Vec out_vec = Vec::loadu(out_ptr + k);
+          out_vec /= Vec(count);
+          out_vec.store(out_ptr + k);
+        }
+        for (; k < K; k++) {
+          out_ptr[k] /= count;
         }
       }
     }
   });
 }
 
-void spmm_sum_kernel(
-    const Tensor& result,
-    const Tensor& rowptr,
-    const Tensor& col,
-    const c10::optional<Tensor>& optional_value,
-    const Tensor& mat) {
-  AT_DISPATCH_FLOATING_TYPES(result.scalar_type(), "spmm_sum_kernel", [&]() {
-    if (optional_value.has_value()) {
-      spmm_sum_kernel_impl<scalar_t, true>(result, rowptr, col, optional_value, mat);
-    } else {
-      spmm_sum_kernel_impl<scalar_t, false>(result, rowptr, col, optional_value, mat);
+template <typename scalar_t, typename index_t, SPMM_REDUCE_OP reduce>
+inline void update(scalar_t *val, scalar_t new_val, index_t *arg, index_t new_arg) {
+  if ((reduce == SPMM_MIN && new_val < *val) ||
+      (reduce == SPMM_MAX && new_val > *val)) {
+    *val = new_val;
+    *arg = new_arg;
+  }
+}
+
+template <typename scalar_t, typename index_t, SPMM_REDUCE_OP reduce>
+void spmm_reduce_arg_kernel_impl(
+    const Tensor& out,
+    const Tensor& arg_out,
+    const Tensor& crow_indices_,
+    const Tensor& col_indices_,
+    const Tensor& values_,
+    const Tensor& weight_) {
+
+  TORCH_CHECK(reduce == SPMM_MAX || reduce == SPMM_MIN);
+  int64_t nnz = values_.numel();
+  if (nnz == 0) {
+    return;
+  }
+
+  auto crow_indices = crow_indices_.contiguous();
+  auto col_indices = col_indices_.contiguous();
+  auto values = values_.contiguous();
+  auto weight = weight_.contiguous();
+
+  scalar_t* out_data = out.data_ptr<scalar_t>();
+  index_t* arg_out_data = arg_out.data_ptr<index_t>();
+  index_t* csr_data = crow_indices.data_ptr<index_t>();
+  index_t* col_data = col_indices.data_ptr<index_t>();
+  scalar_t* val_data = values.data_ptr<scalar_t>();
+  scalar_t* weight_data = weight.data_ptr<scalar_t>();
+
+  int64_t M = crow_indices.numel() - 1;
+  int64_t K = weight.size(-1);
+
+  at::parallel_for(0, M, 1, [&](int64_t begin, int64_t end) {
+    int64_t row_start, row_end, c;
+    for (const auto m : c10::irange(begin, end)) {
+      row_start = csr_data[m];
+      row_end = csr_data[m + 1];
+
+      scalar_t* out_ptr = out_data + m * K;
+      index_t* arg_out_ptr = arg_out_data + m * K;
+
+      int64_t count = row_end - row_start;
+      if (count != 0) {
+        Reducer<scalar_t, reduce>::init(out_ptr, K);
+        for (const auto e : c10::irange(row_start, row_end)) {
+          c = col_data[e];
+          scalar_t val = val_data[e];
+
+          scalar_t* weight_ptr = weight_data + c * K;
+          for (const auto k : c10::irange(K)) {
+            update<scalar_t, index_t, reduce>(
+                &out_ptr[k], val *  weight_ptr[k], &arg_out_ptr[k], index_t(e));
+          };
+        }
+      }
+    }
+  });
+}
+
+template <typename scalar_t, typename index_t, SPMM_REDUCE_OP reduce>
+void spmm_reduce_backward_input_kernel_impl(
+    const Tensor& grad_input,
+    const Tensor& grad_out_,
+    const Tensor& crow_indices_,
+    const Tensor& col_indices_,
+    const Tensor& weight_,
+    const Tensor& row_indices_) {
+
+  int64_t nnz = grad_input._nnz();
+  if (nnz == 0) {
+    return;
+  }
+
+  auto grad_out = grad_out_.contiguous();
+  auto crow_indices = crow_indices_.contiguous();
+  auto col_indices = col_indices_.contiguous();
+  auto weight = weight_.contiguous();
+  auto row_indices = row_indices_.contiguous();
+
+  scalar_t* grad_values_data = grad_input.values().data_ptr<scalar_t>();
+  scalar_t* grad_out_data = grad_out.data_ptr<scalar_t>();
+  index_t* crow_data = crow_indices.data_ptr<index_t>();
+  index_t* col_data = col_indices.data_ptr<index_t>();
+  scalar_t* weight_data = weight.data_ptr<scalar_t>();
+  index_t* row_data = row_indices.data_ptr<index_t>();
+
+  int64_t K = grad_out.size(1);
+
+  using Vec = vec::Vectorized<vec::vec_scalar_t<scalar_t>>;
+  at::parallel_for(0, nnz, 1, [&](int64_t begin, int64_t end) {
+    for (const auto i : c10::irange(begin, end)) {
+      index_t row = row_data[i], col = col_data[i];
+
+      scalar_t val = vec::map2_reduce_all<scalar_t>(
+          [](Vec x, Vec y) { return x * y; },
+          [](Vec x, Vec y) { return x + y; },
+          weight_data + col * K,
+          grad_out_data + row * K,
+          K);
+
+      if (reduce == SPMM_MEAN) {
+        index_t row_start = crow_data[row], row_end = crow_data[row + 1];
+        val /= std::max((index_t)1, row_end - row_start);
+      }
+
+      grad_values_data[i] = val;
+    }
+  });
+}
+
+// backward for reduce type 'max' or 'min'
+template <typename scalar_t, typename index_t>
+void spmm_reduce_backward_input_arg_kernel_impl(
+    const Tensor& grad_input,
+    const Tensor& grad_out_,
+    const Tensor& col_indices_,
+    const Tensor& weight_,
+    const Tensor& arg_out_) {
+
+  int64_t nnz = grad_input._nnz();
+  if (nnz == 0) {
+    return;
+  }
+
+  auto grad_out = grad_out_.contiguous();
+  auto col_indices = col_indices_.contiguous();
+  auto weight = weight_.contiguous();
+  auto arg_out = arg_out_.contiguous();
+
+  scalar_t* grad_values_data = grad_input.values().data_ptr<scalar_t>();
+  scalar_t* grad_out_data = grad_out.data_ptr<scalar_t>();
+  index_t* col_data = col_indices.data_ptr<index_t>();
+  scalar_t* weight_data = weight.data_ptr<scalar_t>();
+  index_t* arg_out_data = arg_out.data_ptr<index_t>();
+
+  int64_t M = grad_out.size(0);
+  int64_t K = grad_out.size(1);
+  auto grad = at::empty({M, K}, grad_out.options());
+  scalar_t* grad_data = grad.data_ptr<scalar_t>();
+
+  at::parallel_for(0, M, 1, [&](int64_t begin, int64_t end) {
+    for (const auto m : c10::irange(begin, end)) {
+      scalar_t* grad_out_ptr = grad_out_data + m * K;
+      scalar_t* grad_ptr = grad_data + m * K;
+      index_t* arg_out_ptr = arg_out_data + m * K;
+
+      for (const auto k : c10::irange(K)) {
+        if (arg_out_ptr[k] == index_t(nnz)) {
+          grad_ptr[k] = scalar_t(0);
+        } else {
+          // collect weight at max/min indices
+          index_t col = col_data[arg_out_data[m * K + k]];
+          grad_ptr[k] = weight_data[col * K + k] * grad_out_ptr[k];
+        }
+      }
+    }
+  });
+
+  // scatter_add, consider to parallel this with atomic
+  for (const auto i : c10::irange(M * K)) {
+    index_t ind = arg_out_data[i];
+    if (ind != index_t(nnz)) {
+      grad_values_data[ind] += grad_data[i];
     }
+  }
+}
+
+template <typename scalar_t, typename index_t>
+void spmm_reduce_update_values_kernel_impl(
+    const Tensor& updated_values,
+    const Tensor& values_,
+    const Tensor& crow_indices_,
+    const Tensor& row_indices_) {
+
+  int64_t nnz = values_.numel();
+  if (nnz == 0) {
+    return;
+  }
+
+  auto values = values_.contiguous();
+  auto crow_indices = crow_indices_.contiguous();
+  auto row_indices = row_indices_.contiguous();
+
+  scalar_t* updated_values_data = updated_values.data_ptr<scalar_t>();
+  scalar_t* values_data = values.data_ptr<scalar_t>();
+  index_t* crow_data = crow_indices.data_ptr<index_t>();
+  index_t* row_data = row_indices.data_ptr<index_t>();
+
+  at::parallel_for(0, nnz, 1, [&](int64_t begin, int64_t end) {
+    for (const auto i : c10::irange(begin, end)) {
+      index_t row = row_data[i];
+      index_t row_start = crow_data[row], row_end = crow_data[row + 1];
+      updated_values_data[i] = values_data[i] / std::max((index_t)1, row_end - row_start);
+    }
+  });
+}
+
+template <typename scalar_t, typename index_t>
+void spmm_reduce_backward_weight_arg_kernel_impl(
+    const Tensor& grad_weight,
+    const Tensor& grad_out_,
+    const Tensor& col_indices_,
+    const Tensor& values_,
+    const Tensor& arg_out_) {
+
+  int64_t nnz = values_.numel();
+  if (nnz == 0) {
+    return;
+  }
+
+  auto grad_out = grad_out_.contiguous();
+  auto col_indices = col_indices_.contiguous();
+  auto values = values_.contiguous();
+  auto arg_out = arg_out_.contiguous();
+
+  scalar_t* grad_weight_data = grad_weight.data_ptr<scalar_t>();
+  scalar_t* grad_out_data = grad_out.data_ptr<scalar_t>();
+  index_t* col_data = col_indices.data_ptr<index_t>();
+  scalar_t* values_data = values.data_ptr<scalar_t>();
+  index_t* arg_out_data = arg_out.data_ptr<index_t>();
+
+  int64_t M = grad_out.size(0);
+  int64_t K = grad_out.size(1);
+  auto grad = at::empty({M, K}, grad_out.options());
+  scalar_t* grad_data = grad.data_ptr<scalar_t>();
+
+  at::parallel_for(0, M, 1, [&](int64_t begin, int64_t end) {
+    for (const auto m : c10::irange(begin, end)) {
+      scalar_t* grad_out_ptr = grad_out_data + m * K;
+      scalar_t* grad_ptr = grad_data + m * K;
+      index_t* arg_out_ptr = arg_out_data + m * K;
+
+      for (const auto k : c10::irange(K)) {
+        if (arg_out_ptr[k] == index_t(nnz)) {
+          grad_ptr[k] = scalar_t(0);
+        } else {
+          grad_ptr[k] = values_data[arg_out_ptr[k]] * grad_out_ptr[k];
+        }
+      }
+    }
+  });
+
+  // scatter_add, consider to parallel this with atomic
+  for (const auto m : c10::irange(M)) {
+    for (const auto k : c10::irange(K)) {
+      index_t ind = arg_out_data[m * K + k];
+      if (ind != index_t(nnz)) {
+        index_t col = col_data[ind];
+        grad_weight_data[col * K + k] += grad_data[m * K + k];
+      }
+    }
+  }
+}
+
+void spmm_reduce_kernel(
+    const Tensor& out,
+    const Tensor& crow_indices,
+    const Tensor& col_indices,
+    const Tensor& values,
+    const Tensor& weight,
+    SPMM_REDUCE_OP reduce_op) {
+  AT_DISPATCH_FLOATING_TYPES_AND(ScalarType::BFloat16, values.scalar_type(), "spmm_reduce_kernel", [&]() {
+    AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "spmm_reduce_indices", [&]() {
+      AT_DISPATCH_REDUCTION_TYPES(reduce_op, [&]() {
+        spmm_reduce_kernel_impl<scalar_t, index_t, reduce>(
+            out, crow_indices, col_indices, values, weight);
+      });
+    });
+  });
+}
+
+void spmm_reduce_arg_kernel(
+    const Tensor& out,
+    const Tensor& arg_out,
+    const Tensor& crow_indices,
+    const Tensor& col_indices,
+    const Tensor& values,
+    const Tensor& weight,
+    SPMM_REDUCE_OP reduce_op) {
+  AT_DISPATCH_FLOATING_TYPES_AND(ScalarType::BFloat16, values.scalar_type(), "spmm_reduce_kernel", [&]() {
+    AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "spmm_reduce_indices", [&]() {
+      AT_DISPATCH_REDUCTION_TYPES(reduce_op, [&]() {
+        spmm_reduce_arg_kernel_impl<scalar_t, index_t, reduce>(
+            out, arg_out, crow_indices, col_indices, values, weight);
+      });
+    });
+  });
+}
+
+void spmm_reduce_backward_input_kernel(
+    const Tensor& grad_input,
+    const Tensor& grad_out,
+    const Tensor& crow_indices,
+    const Tensor& col_indices,
+    const Tensor& weight,
+    const Tensor& row_indices,
+    SPMM_REDUCE_OP reduce_op) {
+  TORCH_CHECK(reduce_op == SPMM_SUM || reduce_op == SPMM_MEAN);
+  AT_DISPATCH_FLOATING_TYPES_AND(ScalarType::BFloat16, weight.scalar_type(), "spmm_reduce_backward_input_kernel", [&]() {
+    AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "spmm_reduce_backward_input_indices", [&]() {
+      AT_DISPATCH_REDUCTION_TYPES(reduce_op, [&]() {
+        spmm_reduce_backward_input_kernel_impl<scalar_t, index_t, reduce>(
+            grad_input, grad_out, crow_indices, col_indices, weight, row_indices);
+      });
+    });
+  });
+}
+
+void spmm_reduce_backward_input_arg_kernel(
+    const Tensor& grad_input,
+    const Tensor& grad_out,
+    const Tensor& col_indices,
+    const Tensor& weight,
+    const Tensor& arg_out,
+    SPMM_REDUCE_OP reduce_op) {
+  TORCH_CHECK(reduce_op == SPMM_MAX || reduce_op == SPMM_MIN);
+  AT_DISPATCH_FLOATING_TYPES_AND(ScalarType::BFloat16, weight.scalar_type(), "spmm_reduce_backward_input_arg_kernel", [&]() {
+    AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "spmm_reduce_backward_input_arg_indices", [&]() {
+      spmm_reduce_backward_input_arg_kernel_impl<scalar_t, index_t>(
+          grad_input, grad_out, col_indices, weight, arg_out);
+    });
+  });
+}
+
+void spmm_reduce_update_values_kernel(
+    const Tensor& updated_values,
+    const Tensor& values,
+    const Tensor& crow_indices,
+    const Tensor& row_indices) {
+  AT_DISPATCH_FLOATING_TYPES_AND(ScalarType::BFloat16, values.scalar_type(), "spmm_reduce_update_values_kernel", [&]() {
+    AT_DISPATCH_INDEX_TYPES(crow_indices.scalar_type(), "spmm_reduce_update_values_indices", [&]() {
+      spmm_reduce_update_values_kernel_impl<scalar_t, index_t>(
+          updated_values, values, crow_indices, row_indices);
+    });
+  });
+}
+
+void spmm_reduce_backward_weight_kernel(
+    const Tensor& grad_weight,
+    const Tensor& grad_out,
+    const Tensor& crow_indices,
+    const Tensor& values,
+    const Tensor& row_indices,
+    const Tensor& ccol_indices,
+    const Tensor& csr2csc,
+    SPMM_REDUCE_OP reduce_op) {
+  TORCH_CHECK(reduce_op == SPMM_SUM || reduce_op == SPMM_MEAN);
+  // need to permute row_indices to CSC order
+  auto row = row_indices.index_select(0, csr2csc);
+
+  Tensor val;
+  if (reduce_op == SPMM_MEAN) {
+    // for reduce type "mean", need to update the values
+    // with rowcount for each of the nonzero element.
+    Tensor updated_values = at::empty(values.sizes(), values.options());
+    spmm_reduce_update_values_kernel(updated_values, values, crow_indices, row_indices);
+    val = updated_values.index_select(0, csr2csc);
+  } else {
+    val = values.index_select(0, csr2csc);
+  }
+
+  if (reduce_op == SPMM_SUM || reduce_op == SPMM_MEAN) {
+    spmm_reduce_kernel(grad_weight, ccol_indices, row, val, grad_out, SPMM_SUM);
+  }
+}
+
+void spmm_reduce_backward_weight_arg_kernel(
+    const Tensor& grad_weight,
+    const Tensor& grad_out,
+    const Tensor& col_indices,
+    const Tensor& values,
+    const Tensor& arg_out,
+    SPMM_REDUCE_OP reduce_op) {
+  TORCH_CHECK(reduce_op == SPMM_MAX || reduce_op == SPMM_MIN);
+  AT_DISPATCH_FLOATING_TYPES_AND(ScalarType::BFloat16, values.scalar_type(), "spmm_reduce_backward_weight_arg_kernel", [&]() {
+    AT_DISPATCH_INDEX_TYPES(col_indices.scalar_type(), "spmm_reduce_backward_weight_arg_indices", [&]() {
+      spmm_reduce_backward_weight_arg_kernel_impl<scalar_t, index_t>(
+          grad_weight, grad_out, col_indices, values, arg_out);
+    });
   });
 }
 
 } // anonymous namespace
 
-REGISTER_DISPATCH(spmm_sum_stub, &spmm_sum_kernel);
+REGISTER_DISPATCH(spmm_reduce_stub, &spmm_reduce_kernel);
+REGISTER_DISPATCH(spmm_reduce_arg_stub, &spmm_reduce_arg_kernel);
+REGISTER_DISPATCH(spmm_reduce_backward_input_stub, &spmm_reduce_backward_input_kernel);
+REGISTER_DISPATCH(spmm_reduce_backward_input_arg_stub, &spmm_reduce_backward_input_arg_kernel);
+REGISTER_DISPATCH(spmm_reduce_backward_weight_stub, &spmm_reduce_backward_weight_kernel);
+REGISTER_DISPATCH(spmm_reduce_backward_weight_arg_stub, &spmm_reduce_backward_weight_arg_kernel);
 
 }} // at::native
diff --git a/aten/src/ATen/native/cpu/SpmmReduceKernel.h b/aten/src/ATen/native/cpu/SpmmReduceKernel.h
new file mode 100644
index 000000000000..cbd26cfbf4ba
--- /dev/null
+++ b/aten/src/ATen/native/cpu/SpmmReduceKernel.h
@@ -0,0 +1,45 @@
+#pragma once
+
+#include <ATen/core/Tensor.h>
+#include <ATen/native/DispatchStub.h>
+
+namespace at { namespace native {
+
+enum SPMM_REDUCE_OP {SPMM_SUM, SPMM_MAX, SPMM_MIN, SPMM_MEAN};
+
+using spmm_reduce_fn = void(*)(const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, SPMM_REDUCE_OP op);
+using spmm_reduce_arg_fn = void(*)(const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, SPMM_REDUCE_OP op);
+using spmm_reduce_backward_input_fn = void(*)(const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, SPMM_REDUCE_OP op);
+using spmm_reduce_backward_input_arg_fn = void(*)(const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, SPMM_REDUCE_OP op);
+using spmm_reduce_backward_weight_fn = void(*)(const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, SPMM_REDUCE_OP op);
+
+DECLARE_DISPATCH(spmm_reduce_fn, spmm_reduce_stub);
+DECLARE_DISPATCH(spmm_reduce_arg_fn, spmm_reduce_arg_stub);
+DECLARE_DISPATCH(spmm_reduce_backward_input_fn, spmm_reduce_backward_input_stub);
+DECLARE_DISPATCH(spmm_reduce_backward_input_arg_fn, spmm_reduce_backward_input_arg_stub);
+DECLARE_DISPATCH(spmm_reduce_backward_weight_fn, spmm_reduce_backward_weight_stub);
+DECLARE_DISPATCH(spmm_reduce_backward_input_arg_fn, spmm_reduce_backward_weight_arg_stub);
+
+#define AT_DISPATCH_REDUCTION_TYPES(op, ...)                                   \
+  [&] {                                                                        \
+    switch (op) {                                                              \
+      case SPMM_SUM: {                                                         \
+        static constexpr SPMM_REDUCE_OP reduce = SPMM_SUM;                     \
+        return __VA_ARGS__();                                                  \
+      }                                                                        \
+      case SPMM_MEAN: {                                                        \
+        static constexpr SPMM_REDUCE_OP reduce = SPMM_MEAN;                    \
+        return __VA_ARGS__();                                                  \
+      }                                                                        \
+      case SPMM_MIN: {                                                         \
+        static constexpr SPMM_REDUCE_OP reduce = SPMM_MIN;                     \
+        return __VA_ARGS__();                                                  \
+      }                                                                        \
+      case SPMM_MAX: {                                                         \
+        static constexpr SPMM_REDUCE_OP reduce = SPMM_MAX;                     \
+        return __VA_ARGS__();                                                  \
+      }                                                                        \
+    }                                                                          \
+  }()
+
+}} // at::native
diff --git a/aten/src/ATen/native/cpu/TensorCompareKernel.cpp b/aten/src/ATen/native/cpu/TensorCompareKernel.cpp
index 903fef2f0331..1547249b7018 100644
--- a/aten/src/ATen/native/cpu/TensorCompareKernel.cpp
+++ b/aten/src/ATen/native/cpu/TensorCompareKernel.cpp
@@ -83,8 +83,7 @@ static inline void compare_base_kernel(const Tensor& result1, const Tensor& resu
     auto* result1_data_bytes = data[0];
     auto* result2_data_bytes = data[1];
     const auto* self_data_bytes = data[2];
-    for (const auto i : c10::irange(n)) {
-      (void)i; //Suppress unused variable warning
+    for (const auto i C10_UNUSED : c10::irange(n)) {
       f((scalar_t*)result1_data_bytes,
         (scalar_t_2*)result2_data_bytes,
         (scalar_t*)self_data_bytes,
@@ -245,8 +244,7 @@ static void mode_kernel_impl(
 
           std::vector<std::pair<scalar_t, int64_t>> elements(self_dim_size);
 
-          for (const auto k : c10::irange(n)) {
-            (void)k; //Suppress unused variable warning
+          for (const auto k C10_UNUSED : c10::irange(n)) {
             scalar_t* values_data = (scalar_t*)values_data_bytes;
             int64_t* indices_data = (int64_t*)indices_data_bytes;
             const scalar_t* self_data = (scalar_t*)self_data_bytes;
diff --git a/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp b/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
index a53587e56da4..8a0534fd3da5 100644
--- a/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
+++ b/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp
@@ -13,6 +13,7 @@
 #include <ATen/cpu/vec/vec.h>
 #include <ATen/cpu/vml.h>
 #include <ATen/native/TensorIterator.h>
+#include <ATen/native/cpu/CopyKernel.h>
 #include <ATen/native/cpu/Loops.h>
 #include <ATen/native/cpu/zmath.h>
 #include <ATen/OpMathType.h>
@@ -203,13 +204,18 @@ static void angle_kernel(TensorIteratorBase& iter) {
 
 // NB: Ignores the negative bit on tensors
 void conj_kernel(TensorIteratorBase& iter) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
-      kBool, kBFloat16, kHalf, kComplexHalf, iter.common_dtype(), "conj_cpu", [&]() {
-        cpu_kernel_vec(
-            iter,
-            [=](scalar_t a) -> scalar_t { return conj_impl(a); },
-            [=](Vectorized<scalar_t> a) { return a.conj(); });
-      });
+  AT_DISPATCH_SWITCH(iter.common_dtype(), "conj_cpu",
+    AT_DISPATCH_CASE_ALL_TYPES_AND3(kBool, kBFloat16, kHalf, [&] {
+      // conj is a no-op for non-complex types
+      direct_copy_kernel(iter);
+    })
+    AT_DISPATCH_CASE_COMPLEX_TYPES_AND(kComplexHalf, [&] {
+      cpu_kernel_vec(
+          iter,
+          [=](scalar_t a) -> scalar_t { return conj_impl(a); },
+          [=](Vectorized<scalar_t> a) { return a.conj(); });
+    })
+  );
 }
 
 static void bitwise_not_kernel(TensorIteratorBase& iter) {
diff --git a/aten/src/ATen/native/cpu/Unfold2d.cpp b/aten/src/ATen/native/cpu/Unfold2d.cpp
index 9bfa9ac8c6ab..fae56c7ebc2b 100644
--- a/aten/src/ATen/native/cpu/Unfold2d.cpp
+++ b/aten/src/ATen/native/cpu/Unfold2d.cpp
@@ -354,8 +354,7 @@ static void unfolded2d_copy_channels_last(
     int64_t x = 0;
     data_index_init(start, y, output_height, x, output_width);
 
-    for (const auto k : c10::irange(start, end)) {
-      (void)k; // Suppress unused variable warning
+    for (const auto k C10_UNUSED: c10::irange(start, end)) {
       scalar_t* dst = finput_data + y * output_width * kH * kW * n_input_plane + x * kH * kW * n_input_plane;
       scalar_t* src = input_data;
 
diff --git a/aten/src/ATen/native/cpu/UnfoldBackwardKernel.cpp b/aten/src/ATen/native/cpu/UnfoldBackwardKernel.cpp
index 8cfe6674906e..aa5dfb014380 100644
--- a/aten/src/ATen/native/cpu/UnfoldBackwardKernel.cpp
+++ b/aten/src/ATen/native/cpu/UnfoldBackwardKernel.cpp
@@ -1,5 +1,6 @@
 #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/cpu/vec/vec.h>
 #include <ATen/native/UnfoldBackward.h>
@@ -65,8 +66,7 @@ void _unfold_backward_internal_kernel(
   int64_t grad_in_dim_stride,
   int64_t grad_in_last_dim_stride,
   int64_t grad_in_dim_size,
-  int64_t grad_out_dim_stride,
-  bool is_step_ge_size
+  int64_t grad_out_dim_stride
 ) {
   if (iter.numel() == 0) {
     return;
@@ -77,55 +77,32 @@ void _unfold_backward_internal_kernel(
     auto* RESTRICT grad_in_ptr = data[1];
     auto* RESTRICT idx_dim_ptr = data[2];
 
-    if (is_step_ge_size) {
-      auto* RESTRICT idx_last_dim_ptr = data[3];
+    for (const auto elem C10_UNUSED : c10::irange(nelems)) {
+      auto* RESTRICT grad_out_data = reinterpret_cast<scalar_t*>(grad_out_ptr);
+      auto* RESTRICT grad_in_data = reinterpret_cast<scalar_t*>(grad_in_ptr);
 
-      for (const auto elem : c10::irange(nelems)) {
-        (void)elem; //Suppress unused variable warning
-        auto* RESTRICT grad_out_data = reinterpret_cast<scalar_t*>(grad_out_ptr);
-        auto* RESTRICT grad_in_data = reinterpret_cast<scalar_t*>(grad_in_ptr);
+      auto idx_dim = *reinterpret_cast<int64_t*>(idx_dim_ptr);
 
-        auto idx_dim = *reinterpret_cast<int64_t*>(idx_dim_ptr);
-        auto idx_last_dim = *reinterpret_cast<int64_t*>(idx_last_dim_ptr);
+      // left_fold potentially intersecting with idx_dim
+      // is either (idx_dim - size) / step or the next integer.
+      int64_t left_fold_idx = (idx_dim > size) ? (idx_dim - size) / step : 0;
+      if (!(left_fold_idx * step <= idx_dim && idx_dim < left_fold_idx * step + size)) {
+        ++left_fold_idx;
+      }
 
-        auto grad_out_idx_dim = idx_dim * step + idx_last_dim;
-        grad_out_data[grad_out_idx_dim * grad_out_dim_stride] = *grad_in_data;
+      auto right_fold_idx = idx_dim / step;
+      right_fold_idx = (right_fold_idx >= grad_in_dim_size)
+        ? (grad_in_dim_size - 1) : right_fold_idx;
 
-        grad_out_ptr += strides[0];
-        grad_in_ptr += strides[1];
-        idx_dim_ptr += strides[2];
-        idx_last_dim_ptr += strides[3];
-      }
-    }
-    else {
-      for (const auto elem : c10::irange(nelems)) {
-        (void)elem; //Suppress unused variable warning
-        auto* RESTRICT grad_out_data = reinterpret_cast<scalar_t*>(grad_out_ptr);
-        auto* RESTRICT grad_in_data = reinterpret_cast<scalar_t*>(grad_in_ptr);
-
-        auto idx_dim = *reinterpret_cast<int64_t*>(idx_dim_ptr);
-
-        // left_fold potentially intersecting with idx_dim
-        // is either (idx_dim - size) / step or the next integer.
-        int64_t left_fold_idx = (idx_dim > size) ? (idx_dim - size) / step : 0;
-        if (!(left_fold_idx * step <= idx_dim && idx_dim < left_fold_idx * step + size)) {
-          ++left_fold_idx;
-        }
-
-        auto right_fold_idx = idx_dim / step;
-        right_fold_idx = (right_fold_idx >= grad_in_dim_size)
-          ? (grad_in_dim_size - 1) : right_fold_idx;
-
-        for (auto fold_idx = left_fold_idx; fold_idx <= right_fold_idx; ++fold_idx) {
-          auto idx_last_dim = idx_dim - fold_idx * step;
-          *grad_out_data += grad_in_data[fold_idx * grad_in_dim_stride
-                                      + idx_last_dim * grad_in_last_dim_stride];
-        }
-
-        grad_out_ptr += strides[0];
-        grad_in_ptr += strides[1];
-        idx_dim_ptr += strides[2];
+      for (auto fold_idx = left_fold_idx; fold_idx <= right_fold_idx; ++fold_idx) {
+        auto idx_last_dim = idx_dim - fold_idx * step;
+        *grad_out_data += grad_in_data[fold_idx * grad_in_dim_stride
+                                    + idx_last_dim * grad_in_last_dim_stride];
       }
+
+      grad_out_ptr += strides[0];
+      grad_in_ptr += strides[1];
+      idx_dim_ptr += strides[2];
     }
   };
 
@@ -149,16 +126,8 @@ void unfold_backward_cpu_kernel(
 
   auto grad_out_dim_stride = ensure_nonempty_stride(grad_out, dim);
 
-  auto is_step_ge_size = (step >= size);
-
-  TensorIterator iter =
-    is_step_ge_size ?
-    _make_unfold_backward_iter_over_grad_in(
-      grad_out, grad_in, dim, size, step
-    ) :
-    _make_unfold_backward_iter_over_grad_out(
-      grad_out, grad_in, dim, size, step
-    );
+  TensorIterator iter = _make_unfold_backward_iter_over_grad_out(
+      grad_out, grad_in, dim, size, step);
 
   AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
     at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16,
@@ -171,8 +140,7 @@ void unfold_backward_cpu_kernel(
         grad_in_dim_stride,
         grad_in_last_dim_stride,
         grad_in_dim_size,
-        grad_out_dim_stride,
-        is_step_ge_size
+        grad_out_dim_stride
       );
     }
   );
diff --git a/aten/src/ATen/native/cpu/UpSampleKernel.cpp b/aten/src/ATen/native/cpu/UpSampleKernel.cpp
index cfc931862372..8d418c264504 100644
--- a/aten/src/ATen/native/cpu/UpSampleKernel.cpp
+++ b/aten/src/ATen/native/cpu/UpSampleKernel.cpp
@@ -471,12 +471,12 @@ void cpu_upsample_linear_channels_last(
   TORCH_CHECK(channels > 0, "expected input and output channels greater than 0 but got ", channels);
   int64_t output_slice_size = output_depth * output_height * output_width * channels;
 
-  using accscalar_t = at::acc_type<scalar_t, false>;
+  using opmath_t = at::opmath_type<scalar_t>;
   using Vec = vec::Vectorized<scalar_t>;
   auto loop2d = [&](int64_t begin, int64_t end) {
-    const scalar_t height_scale = area_pixel_compute_scale<scalar_t>(
+    const auto height_scale = area_pixel_compute_scale<opmath_t>(
         input_height, output_height, align_corners, scales[0]);
-    const scalar_t width_scale = area_pixel_compute_scale<scalar_t>(
+    const auto width_scale = area_pixel_compute_scale<opmath_t>(
         input_width, output_width, align_corners, scales[1]);
 
     auto input_indexr = [=](int64_t n, int64_t h, int64_t w) {
@@ -486,7 +486,7 @@ void cpu_upsample_linear_channels_last(
 
     // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
     int64_t ih0, ih1, iw0, iw1;
-    scalar_t h0lambda, h1lambda, w0lambda, w1lambda;
+    opmath_t h0lambda, h1lambda, w0lambda, w1lambda;
     for (const auto n : c10::irange(begin, end)) {
       for (const auto oh : c10::irange(output_height)) {
         compute_source_index_and_lambda(
@@ -501,10 +501,10 @@ void cpu_upsample_linear_channels_last(
           scalar_t* i01 = input_indexr(n, ih0, iw1);
           scalar_t* i10 = input_indexr(n, ih1, iw0);
           scalar_t* i11 = input_indexr(n, ih1, iw1);
-          accscalar_t w00 = h0lambda * w0lambda;
-          accscalar_t w01 = h0lambda * w1lambda;
-          accscalar_t w10 = h1lambda * w0lambda;
-          accscalar_t w11 = h1lambda * w1lambda;
+          opmath_t w00 = h0lambda * w0lambda;
+          opmath_t w01 = h0lambda * w1lambda;
+          opmath_t w10 = h1lambda * w0lambda;
+          opmath_t w11 = h1lambda * w1lambda;
 
           int64_t size = channels;
           int64_t d = 0;
@@ -521,11 +521,11 @@ void cpu_upsample_linear_channels_last(
   };
 
   auto loop3d = [&](int64_t begin, int64_t end) {
-    const scalar_t depth_scale = area_pixel_compute_scale<scalar_t>(
+    const auto depth_scale = area_pixel_compute_scale<opmath_t>(
         input_depth, output_depth, align_corners, scales[0]);
-    const scalar_t height_scale = area_pixel_compute_scale<scalar_t>(
+    const auto height_scale = area_pixel_compute_scale<opmath_t>(
         input_height, output_height, align_corners, scales[1]);
-    const scalar_t width_scale = area_pixel_compute_scale<scalar_t>(
+    const auto width_scale = area_pixel_compute_scale<opmath_t>(
         input_width, output_width, align_corners, scales[2]);
 
     auto input_indexr = [=](int64_t n, int64_t d, int64_t h, int64_t w) {
@@ -536,7 +536,7 @@ void cpu_upsample_linear_channels_last(
 
     // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
     int64_t id0, id1, ih0, ih1, iw0, iw1;
-    scalar_t d0lambda, d1lambda, h0lambda, h1lambda, w0lambda, w1lambda;
+    opmath_t d0lambda, d1lambda, h0lambda, h1lambda, w0lambda, w1lambda;
     for (const auto n : c10::irange(begin, end)) {
       for (const auto od : c10::irange(output_depth)) {
         compute_source_index_and_lambda(
@@ -559,14 +559,14 @@ void cpu_upsample_linear_channels_last(
             scalar_t* i101 = input_indexr(n, id1, ih0, iw1);
             scalar_t* i110 = input_indexr(n, id1, ih1, iw0);
             scalar_t* i111 = input_indexr(n, id1, ih1, iw1);
-            accscalar_t w000 = d0lambda * h0lambda * w0lambda;
-            accscalar_t w001 = d0lambda * h0lambda * w1lambda;
-            accscalar_t w010 = d0lambda * h1lambda * w0lambda;
-            accscalar_t w011 = d0lambda * h1lambda * w1lambda;
-            accscalar_t w100 = d1lambda * h0lambda * w0lambda;
-            accscalar_t w101 = d1lambda * h0lambda * w1lambda;
-            accscalar_t w110 = d1lambda * h1lambda * w0lambda;
-            accscalar_t w111 = d1lambda * h1lambda * w1lambda;
+            opmath_t w000 = d0lambda * h0lambda * w0lambda;
+            opmath_t w001 = d0lambda * h0lambda * w1lambda;
+            opmath_t w010 = d0lambda * h1lambda * w0lambda;
+            opmath_t w011 = d0lambda * h1lambda * w1lambda;
+            opmath_t w100 = d1lambda * h0lambda * w0lambda;
+            opmath_t w101 = d1lambda * h0lambda * w1lambda;
+            opmath_t w110 = d1lambda * h1lambda * w0lambda;
+            opmath_t w111 = d1lambda * h1lambda * w1lambda;
 
             int64_t size = channels;
             int64_t d = 0;
@@ -613,8 +613,7 @@ struct HelperInterpBase {
     auto new_shape = std::vector<int64_t>(ndims, 1);
     new_shape[reshape_dim] = output_size;
 
-    for (const auto j : c10::irange(interp_size)) {
-      (void)j; //Suppress unused variable warning
+    for (const auto j C10_UNUSED : c10::irange(interp_size)) {
       output.emplace_back(empty(new_shape, CPU(c10::CppTypeToScalarType<int64_t>())));
       output.emplace_back(empty(new_shape, CPU(output_type)));
     }
@@ -735,8 +734,7 @@ struct HelperInterpNearest : public HelperInterpBase {
     auto new_shape = std::vector<int64_t>(ndims, 1);
     new_shape[reshape_dim] = output_size;
 
-    for (const auto j : c10::irange(interp_size)) {
-      (void)j; //Suppress unused variable warning
+    for (const auto j C10_UNUSED : c10::irange(interp_size)) {
       output.emplace_back(empty(new_shape, CPU(c10::CppTypeToScalarType<int64_t>())));
       // Defines weights for consistency, but not used
       output.emplace_back(at::ones(new_shape, CPU(output_type)));
@@ -767,7 +765,6 @@ struct HelperInterpNearest : public HelperInterpBase {
 
     AT_DISPATCH_FLOATING_TYPES_AND(
       ScalarType::BFloat16, scalar_type, "compute_indices_weights_nearest", [&] {
-
         scalar_t scale = area_pixel_compute_scale<scalar_t>(input_size, output_size, align_corners, opt_scale);
 
         auto input_index_ptr = output[0].data_ptr<int64_t>();
@@ -778,10 +775,11 @@ struct HelperInterpNearest : public HelperInterpBase {
         // index_f32 = (output_index) * scale
         // input_index = floor(index_f32)
         // Same as OpenCV INTER_NEAREST
-
+        using opmath_t = at::opmath_type<scalar_t>;
         for (const auto i : c10::irange(output_size)) {
-          const scalar_t real_input_index = area_pixel_compute_source_index<scalar_t>(
-              scale, i, /*align_corners=*/true, /*cubic=*/false);
+          const auto real_input_index =
+              area_pixel_compute_source_index<opmath_t>(
+                  scale, i, /*align_corners=*/true, /*cubic=*/false);
           input_index = static_cast<int64_t>(floorf(real_input_index));
           input_index_ptr[i] = static_cast<int64_t>(std::min(input_index, input_size - 1)) * stride;
         }
@@ -818,7 +816,6 @@ struct HelperInterpNearestExact : public HelperInterpNearest {
 
     AT_DISPATCH_FLOATING_TYPES(
       scalar_type, "compute_indices_weights_nearest", [&] {
-
         scalar_t scale = area_pixel_compute_scale<scalar_t>(input_size, output_size, align_corners, opt_scale);
 
         auto input_index_ptr = output[0].data_ptr<int64_t>();
@@ -829,10 +826,11 @@ struct HelperInterpNearestExact : public HelperInterpNearest {
         // index_f32 = (output_index + 0.5) * scale - 0.5
         // input_index = round(index_f32)
         // Same as Pillow and Scikit-Image/Scipy ndi.zoom
-
+        using opmath_t = at::opmath_type<scalar_t>;
         for (const auto i : c10::irange(output_size)) {
-          const scalar_t real_input_index = area_pixel_compute_source_index<scalar_t>(
-              scale, i, /*align_corners=*/align_corners, /*cubic=*/false);
+          const auto real_input_index =
+              area_pixel_compute_source_index<opmath_t>(
+                  scale, i, /*align_corners=*/align_corners, /*cubic=*/false);
           input_index = static_cast<int64_t>(floorf(real_input_index + 0.5));
           input_index_ptr[i] = static_cast<int64_t>(std::min(input_index, input_size - 1)) * stride;
         }
@@ -865,10 +863,8 @@ struct HelperInterpLinear : public HelperInterpBase {
     std::vector<Tensor> output;
     HelperInterpLinear::init_indices_weights(
       scalar_type, output, output_size, ndims, reshape_dim, HelperInterpLinear::interp_size);
-
     AT_DISPATCH_FLOATING_TYPES_AND(
       ScalarType::BFloat16, scalar_type, "compute_indices_weights_linear", [&] {
-
         scalar_t scale = area_pixel_compute_scale<scalar_t>(input_size, output_size, align_corners, opt_scale);
 
         auto input_index0_ptr = output[0].data_ptr<int64_t>();
@@ -970,7 +966,6 @@ struct HelperInterpCubic : public HelperInterpBase {
 
     AT_DISPATCH_FLOATING_TYPES_AND(
       ScalarType::BFloat16, scalar_type, "compute_indices_weights_cubic", [&] {
-
         scalar_t scale = area_pixel_compute_scale<scalar_t>(input_size, output_size, align_corners, opt_scale);
 
         int64_t input_index;
@@ -980,11 +975,11 @@ struct HelperInterpCubic : public HelperInterpBase {
 
         int64_t * idx_ptr;
         scalar_t * wt_ptr;
-
+        using opmath_t = at::opmath_type<scalar_t>;
         for (const auto i : c10::irange(output_size)) {
-
-          const scalar_t real_input_index = area_pixel_compute_source_index<scalar_t>(
-              scale, i, align_corners, /*cubic=*/true);
+          const auto real_input_index =
+              area_pixel_compute_source_index<opmath_t>(
+                  scale, i, align_corners, /*cubic=*/true);
           input_index = static_cast<int64_t>(floorf(real_input_index));
           get_cubic_upsample_coefficients<scalar_t>(coeffs, real_input_index - input_index);
 
@@ -1184,7 +1179,6 @@ void _separable_upsample_generic_Nd_kernel_impl_single_dim(
 
   int interp_size = F::interp_size;
   auto input_scalar_type = input.scalar_type();
-
   if (interp_size == 1 && input_scalar_type == at::ScalarType::Byte) {
     // nearest also supports uint8 tensor, but we have to use float
     // with compute_indices_weights
@@ -1266,12 +1260,26 @@ void _upsample_nearest_exact1d_kernel_impl(
     output, input, false, {scales_w});
 }
 
+int _use_vectorized_kernel_cond(
+    const Tensor& output,
+    const Tensor& input) {
+      // This condition is used to know whether we should dispatch to a vectorized
+      // kernel, or to the more general upsample_generic_Nd_kernel_impl(). For now,
+      // the vectorized kernels are only optimized for channels_last and when C >= 4
+      // (shape = NCHW). For a very wide range of use-cases (typically image or mask
+      // resizing where we have C < 4), using upsample_generic_Nd_kernel_impl() is
+      // actually faster. On top of that, bencharmks showed that this also depends on
+      // the *output* size (output_H + output_W) , for both upsampling and
+      // downsampling. The current 128 threshold was determined through benchmarks.
+      return ((input.is_contiguous(at::MemoryFormat::ChannelsLast)) && (input.size(-3) > 3)) || ((output.size(-2) + output.size(-1)) <= 128);
+}
+
 void upsample_nearest2d_kernel_impl(
     const Tensor& output,
     const Tensor& input,
     c10::optional<double> scales_h,
     c10::optional<double> scales_w) {
-  if (input.is_contiguous(at::MemoryFormat::ChannelsLast)) {
+  if (_use_vectorized_kernel_cond(output, input)) {
     AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Byte, at::ScalarType::BFloat16,
         input.scalar_type(), "upsample_nearest2d_channels_last", [&] {
       cpu_upsample_nearest_channels_last<scalar_t, scale_t, nearest_idx>(output, input, {scales_h, scales_w});
@@ -1287,7 +1295,7 @@ void _upsample_nearest_exact2d_kernel_impl(
     const Tensor& input,
     c10::optional<double> scales_h,
     c10::optional<double> scales_w) {
-  if (input.is_contiguous(at::MemoryFormat::ChannelsLast)) {
+  if (_use_vectorized_kernel_cond(output, input)) {
     AT_DISPATCH_FLOATING_TYPES_AND(at::ScalarType::Byte, input.scalar_type(), "upsample_nearest2d_channels_last", [&] {
       cpu_upsample_nearest_channels_last<scalar_t, scale_t, nearest_exact_idx>(output, input, {scales_h, scales_w});
     });
@@ -1346,8 +1354,12 @@ void upsample_bilinear2d_kernel_impl(
     c10::optional<double> scales_h,
     c10::optional<double> scales_w) {
 
-  // Temporarily dispatch to original channels last implementation
-  if (input.is_contiguous(at::MemoryFormat::ChannelsLast)) {
+  // See note above about _use_vectorized_kernel_cond(output, input). The extra cond is present
+  // because benchmarks showed that with only 1 thread, images (C == 3) were
+  // slightly faster with the vectorized kernel than with the generic one.
+  // That's not the case for masks though (C == 1), which strongly benefit from
+  // using the generic kernel.
+  if ((_use_vectorized_kernel_cond(output, input)) || (at::get_num_threads() == 1 && input.size(-3) == 3)) {
     AT_DISPATCH_FLOATING_TYPES_AND(at::ScalarType::BFloat16, input.scalar_type(), "upsample_bilinear2d_channels_last", [&] {
       cpu_upsample_linear_channels_last<scalar_t, scale_t>(output, input, align_corners, {scales_h, scales_w});
     });
diff --git a/aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp b/aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp
index a26cef72bb10..c73e0249dee8 100644
--- a/aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp
+++ b/aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp
@@ -441,9 +441,9 @@ void cpu_upsample_linear_backward_channels_last(
   int64_t input_width = input_sizes[ndim - 1];
   int64_t output_width = output_sizes[ndim - 1];
 
-  using accscalar_t = at::acc_type<scalar_t, false>;
+  using opmath_t = at::opmath_type<scalar_t>;
   using Vec = vec::Vectorized<scalar_t>;
-  auto acc = [](scalar_t* gin, scalar_t* gout, accscalar_t w, int64_t size) {
+  auto acc = [](scalar_t* gin, scalar_t* gout, opmath_t w, int64_t size) {
     int64_t d = 0;
     for (; d < size - (size % Vec::size()); d += Vec::size()) {
       Vec gin_vec = Vec::loadu(gin + d) + Vec(w) * Vec::loadu(gout + d);
diff --git a/aten/src/ATen/native/cpu/WeightNormKernel.cpp b/aten/src/ATen/native/cpu/WeightNormKernel.cpp
index 9dc6b5285805..8ab7226d2127 100644
--- a/aten/src/ATen/native/cpu/WeightNormKernel.cpp
+++ b/aten/src/ATen/native/cpu/WeightNormKernel.cpp
@@ -1,6 +1,8 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
 
 #include <ATen/Dispatch.h>
+#include <ATen/EmptyTensor.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/cpu/WeightNormKernel.h>
 #include <ATen/cpu/vec/functional.h>
@@ -13,10 +15,10 @@ namespace {
 
 template <typename scalar_t, typename accscalar_t>
 void weight_norm_first_dim_kernel(
-    Tensor& w,
-    Tensor& norm,
-    const Tensor& v,
-    const Tensor& g,
+    TensorBase& w,
+    TensorBase& norm,
+    const TensorBase& v,
+    const TensorBase& g,
     int64_t M, int64_t N) {
   const auto v_data = v.data_ptr<scalar_t>();
   const auto g_data = g.data_ptr<scalar_t>();
@@ -121,10 +123,10 @@ inline void apply_norm_per_row(
 
 template <typename scalar_t, typename accscalar_t>
 void weight_norm_last_dim_kernel(
-    Tensor& w,
-    Tensor& norm,
-    const Tensor& v,
-    const Tensor& g,
+    TensorBase& w,
+    TensorBase& norm,
+    const TensorBase& v,
+    const TensorBase& g,
     int64_t M, int64_t N) {
   const auto v_data = v.data_ptr<scalar_t>();
   const auto g_data = g.data_ptr<scalar_t>();
@@ -132,7 +134,7 @@ void weight_norm_last_dim_kernel(
   auto norm_data = norm.data_ptr<accscalar_t>();
 
   int num_threads = at::get_num_threads();
-  Tensor buffer = at::empty({num_threads, N}, norm.options()).zero_();
+  TensorBase buffer = at::detail::empty_cpu({num_threads, N}, norm.options()).zero_();
   auto buffer_data = buffer.data_ptr<accscalar_t>();
 
   // vertical parallel reduction
@@ -173,12 +175,12 @@ void weight_norm_last_dim_kernel(
 
 template <typename scalar_t, typename accscalar_t>
 void weight_norm_backward_first_dim_kernel(
-    Tensor& grad_v,
-    Tensor& grad_g,
-    const Tensor& grad_w,
-    const Tensor& saved_v,
-    const Tensor& saved_g,
-    const Tensor& saved_norm,
+    TensorBase& grad_v,
+    TensorBase& grad_g,
+    const TensorBase& grad_w,
+    const TensorBase& saved_v,
+    const TensorBase& saved_g,
+    const TensorBase& saved_norm,
     int64_t M, int64_t N) {
   const auto grad_w_data = grad_w.data_ptr<scalar_t>();
   const auto saved_v_data = saved_v.data_ptr<scalar_t>();
@@ -314,12 +316,12 @@ inline void apply_per_row_backward(
 
 template <typename scalar_t, typename accscalar_t>
 void weight_norm_backward_last_dim_kernel(
-    Tensor& grad_v,
-    Tensor& grad_g,
-    const Tensor& grad_w,
-    const Tensor& saved_v,
-    const Tensor& saved_g,
-    const Tensor& saved_norm,
+    TensorBase& grad_v,
+    TensorBase& grad_g,
+    const TensorBase& grad_w,
+    const TensorBase& saved_v,
+    const TensorBase& saved_g,
+    const TensorBase& saved_norm,
     int64_t M, int64_t N) {
   const auto grad_w_data = grad_w.data_ptr<scalar_t>();
   const auto saved_v_data = saved_v.data_ptr<scalar_t>();
@@ -335,7 +337,7 @@ void weight_norm_backward_last_dim_kernel(
   //
   int num_threads = at::get_num_threads();
   int K = std::max(3, num_threads);
-  Tensor buffer = at::empty({K, N}, saved_norm.options()).zero_();
+  TensorBase buffer = at::detail::empty_cpu({K, N}, saved_norm.options()).zero_();
   auto buffer_data = buffer.data_ptr<accscalar_t>();
 
   // vertical parallel reduction
@@ -391,10 +393,10 @@ void weight_norm_backward_last_dim_kernel(
 }
 
 void weight_norm_kernel(
-    Tensor& w,
-    Tensor& norm,
-    const Tensor& v,
-    const Tensor& g,
+    TensorBase& w,
+    TensorBase& norm,
+    const TensorBase& v,
+    const TensorBase& g,
     int64_t dim) {
   TORCH_INTERNAL_ASSERT(dim == 0 || dim == v.dim() - 1,
       "fused kernels can only be applied for first or last dim");
@@ -414,12 +416,12 @@ void weight_norm_kernel(
 }
 
 void weight_norm_backward_kernel(
-    Tensor& grad_v,
-    Tensor& grad_g,
-    const Tensor& grad_w,
-    const Tensor& saved_v,
-    const Tensor& saved_g,
-    const Tensor& saved_norm,
+    TensorBase& grad_v,
+    TensorBase& grad_g,
+    const TensorBase& grad_w,
+    const TensorBase& saved_v,
+    const TensorBase& saved_g,
+    const TensorBase& saved_norm,
     int64_t dim) {
   TORCH_INTERNAL_ASSERT(dim == 0 || dim == saved_v.dim() - 1,
       "fused kernels can only be applied for first or last dim");
diff --git a/aten/src/ATen/native/cpu/WeightNormKernel.h b/aten/src/ATen/native/cpu/WeightNormKernel.h
index 1f5ad65b52d9..6e1f3ec3b029 100644
--- a/aten/src/ATen/native/cpu/WeightNormKernel.h
+++ b/aten/src/ATen/native/cpu/WeightNormKernel.h
@@ -1,13 +1,18 @@
 #pragma once
-
-#include <ATen/ATen.h>
 #include <ATen/native/DispatchStub.h>
+#include <cstdint>
+
+namespace at {
+class TensorBase;
+}
 
 namespace at { namespace native {
 
-using weight_norm_fn = void(*)(Tensor&, Tensor&, const Tensor&, const Tensor&, int64_t);
+using weight_norm_fn = void(*)(
+    TensorBase&, TensorBase&, const TensorBase&, const TensorBase&, int64_t);
 using weight_norm_backward_fn = void(*)(
-    Tensor&, Tensor&, const Tensor&, const Tensor&, const Tensor&, const Tensor&, int64_t);
+    TensorBase&, TensorBase&, const TensorBase&, const TensorBase&,
+    const TensorBase&, const TensorBase&, int64_t);
 
 DECLARE_DISPATCH(weight_norm_fn, weight_norm_stub);
 DECLARE_DISPATCH(weight_norm_backward_fn, weight_norm_backward_stub);
diff --git a/aten/src/ATen/native/cpu/radix_sort.h b/aten/src/ATen/native/cpu/radix_sort.h
index ad94f2e06e91..2b0657ee6986 100644
--- a/aten/src/ATen/native/cpu/radix_sort.h
+++ b/aten/src/ATen/native/cpu/radix_sort.h
@@ -5,6 +5,8 @@
 
 namespace at { namespace native {
 
+bool inline is_radix_sort_available() { return false; }
+
 template <typename K, typename V>
 std::pair<K*, V*> radix_sort_parallel(
     K* inp_key_buf,
@@ -21,6 +23,7 @@ std::pair<K*, V*> radix_sort_parallel(
 #else
 
 #include <omp.h>
+#include <c10/util/llvmMathExtras.h>
 
 namespace at { namespace native {
 
@@ -31,7 +34,7 @@ namespace {
 //
 // Copied from fbgemm implementation here:
 // https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/cpu_utils.cpp
-// 
+//
 // `radix_sort_parallel` is only available when ATen is compiled with OpenMP,
 // since the algorithm requires sync between omp threads, which can not be perfectly
 // mapped to `at::parallel_for` at the current stage.
@@ -132,8 +135,11 @@ void radix_sort_kernel(
     }
   }
 }
+
 } // namespace
 
+bool inline is_radix_sort_available() { return true; }
+
 template <typename K, typename V>
 std::pair<K*, V*> radix_sort_parallel(
     K* inp_key_buf,
@@ -143,12 +149,16 @@ std::pair<K*, V*> radix_sort_parallel(
     int64_t elements_count,
     int64_t max_value) {
   int maxthreads = omp_get_max_threads();
-  alignas(64) int histogram[RDX_HIST_SIZE * maxthreads];
-  alignas(64) int histogram_ps[RDX_HIST_SIZE * maxthreads + 1];
+  std::unique_ptr<int []> histogram_tmp(new int[RDX_HIST_SIZE * maxthreads]);
+  std::unique_ptr<int []> histogram_ps_tmp(new int[RDX_HIST_SIZE * maxthreads + 1]);
+  int* histogram = histogram_tmp.get();
+  int* histogram_ps = histogram_ps_tmp.get();
   if (max_value == 0) {
     return std::make_pair(inp_key_buf, inp_value_buf);
   }
-  int num_bits = sizeof(K) * 8 - __builtin_clz(max_value);
+
+  // __builtin_clz is not portable
+  int num_bits = sizeof(K) * 8 - llvm::countLeadingZeros(static_cast<std::make_unsigned_t<K>>(max_value));
   unsigned int num_passes = (num_bits + 7) / 8;
 
 #pragma omp parallel
diff --git a/aten/src/ATen/native/cuda/Activation.cpp b/aten/src/ATen/native/cuda/Activation.cpp
index 4360f8b5c3ef..31926b353b4a 100644
--- a/aten/src/ATen/native/cuda/Activation.cpp
+++ b/aten/src/ATen/native/cuda/Activation.cpp
@@ -114,7 +114,7 @@ Tensor prelu_cuda(const Tensor& self, const Tensor& weight_) {
   Tensor result = at::empty_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
 
   TORCH_CHECK(weight_dim == 0 || weight_dim == 1,
-      "prelu: Expected `weight` to be a scalar or 1D tensor, but got ndim = ",
+      "prelu: Expected `weight` to be a scalar or 1D tensor, but got: ndim = ",
       weight_dim);
 
   // case1: shared weight for all channels
diff --git a/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu b/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
index 55b0d3322e04..42c10fb6eb29 100644
--- a/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
+++ b/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
@@ -23,8 +23,8 @@
 #include <cfloat>
 #include <cmath>
 
-#define START_IND(a,b,c) (int)std::floor((float)(a * c) / b)
-#define END_IND(a,b,c) (int)std::ceil((float)((a + 1) * c) / b)
+#define START_IND(a,b,c) ((int64_t)((a / b) * c + ((a % b) * c) / b))
+#define END_IND(a,b,c) (1 + ((int64_t)(a + 1) * c - 1) / b)
 
 #define START_IND_INT(a,b,c) ((a * c) / b)
 #define END_IND_INT(a,b,c) (((a + 1) * c + b - 1) / b)
@@ -442,10 +442,14 @@ namespace {
               output_arg{ output, "output", 2 };
     checkAllSameGPU(__func__, {input_arg, output_arg});
 
-    for (int64_t i = 1; i < input.ndimension(); i++) {
+    TORCH_CHECK(output_size.size() == 2, "adaptive_avg_pool2d: output_size must be 2");
+    int64_t ndim = input.dim();
+    TORCH_CHECK((ndim == 3 || ndim == 4),
+      "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ", input.sizes());
+    for (const auto i : {-2, -1}) {
       TORCH_CHECK(input.size(i) > 0,
         "adaptive_avg_pool2d(): Expected input to have non-zero size for non-batch dimensions, "
-        "but input has sizes ", input.sizes(), " with dimension ", i, " being "
+        "but input has sizes ", input.sizes(), " with dimension ", i + ndim, " being "
         "empty");
     }
 
@@ -538,9 +542,6 @@ namespace {
         break;
       }
       case at::MemoryFormat::Contiguous: {
-        TORCH_CHECK((input.ndimension() == 3 || input.ndimension() == 4),
-                    "adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got ",
-                    input.sizes());
         int64_t grid_x = input.size(-3);
         if (input.ndimension() == 4) {
            input_ = input.contiguous();
diff --git a/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu b/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu
index ec71b37015fb..6e43e382ddfc 100644
--- a/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu
+++ b/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu
@@ -28,12 +28,12 @@ namespace native {
 
 namespace {
 
-__device__ inline int start_index(int a, int b, int c) {
-  return (int)std::floor((float)(a * c) / b);
+__device__ inline int64_t start_index(int64_t a, int64_t b, int64_t c) {
+  return (a / b) * c + ((a % b) * c) / b;
 }
 
-__device__ inline int end_index(int a, int b, int c) {
-  return (int)std::ceil((float)((a + 1) * c) / b);
+__device__ inline int64_t end_index(int64_t a, int64_t b, int64_t c) {
+  return 1 + ((a + 1) * c - 1) / b;
 }
 
 // 5d tensor B x D x T x H x W
diff --git a/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu b/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu
index 5b46fb9c34a5..4903fdacc8cb 100644
--- a/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu
+++ b/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu
@@ -28,12 +28,12 @@ namespace native {
 
 namespace {
 
-__device__ inline int start_index(int a, int b, int c) {
-  return (int)std::floor((float)(a * c) / b);
+__device__ inline int64_t start_index(int64_t a, int64_t b, int64_t c) {
+  return (a / b) * c + ((a % b) * c) / b;
 }
 
-__device__ inline int end_index(int a, int b, int c) {
-  return (int)std::ceil((float)((a + 1) * c) / b);
+__device__ inline int64_t end_index(int64_t a, int64_t b, int64_t c) {
+  return 1 + ((a + 1) * c - 1) / b;
 }
 
 // 4d tensor B x D x H x W
diff --git a/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu b/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu
index baafc6c56d46..4694d73b3a02 100644
--- a/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu
+++ b/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu
@@ -28,12 +28,12 @@ namespace native {
 
 namespace {
 
-__device__ inline int start_index(int a, int b, int c) {
-  return (int)std::floor((float)(a * c) / b);
+__device__ inline int64_t start_index(int64_t a, int64_t b, int64_t c) {
+  return (a / b) * c + ((a % b) * c) / b;
 }
 
-__device__ inline int end_index(int a, int b, int c) {
-  return (int)std::ceil((float)((a + 1) * c) / b);
+__device__ inline int64_t end_index(int64_t a, int64_t b, int64_t c) {
+  return 1 + ((a + 1) * c - 1) / b;
 }
 
 // 5d tensor B x D x T x H x W
diff --git a/aten/src/ATen/native/cuda/AveragePool2d.cu b/aten/src/ATen/native/cuda/AveragePool2d.cu
index 55632014a0de..46e96e902981 100644
--- a/aten/src/ATen/native/cuda/AveragePool2d.cu
+++ b/aten/src/ATen/native/cuda/AveragePool2d.cu
@@ -32,8 +32,8 @@ __device__ inline int max(int a, int b) {
 
 template <typename scalar_t, typename accscalar_t>
 __global__ void avg_pool2d_out_cuda_frame(const int nthreads,
-    const scalar_t* const bottom_data, const int channels,
-    const int height, const int width, const int pooled_height,
+    const scalar_t* const bottom_data, const int64_t channels,
+    const int64_t height, const int64_t width, const int64_t pooled_height,
     const int pooled_width, const int kernel_h, const int kernel_w,
     const int stride_h, const int stride_w, const int pad_h, const int pad_w,
     scalar_t* const top_data, const int divisor_override,
@@ -81,8 +81,8 @@ __global__ void avg_pool2d_out_cuda_frame(const int nthreads,
 
 template <typename scalar_t, typename accscalar_t>
 __global__ void avg_pool2d_out_cuda_frame_nhwc(const int nthreads,
-    const scalar_t* const bottom_data, const int channels,
-    const int height, const int width, const int pooled_height,
+    const scalar_t* const bottom_data, const int64_t channels,
+    const int64_t height, const int64_t width, const int pooled_height,
     const int pooled_width, const int kernel_h, const int kernel_w,
     const int stride_h, const int stride_w, const int pad_h, const int pad_w,
     scalar_t* const top_data, const int divisor_override,
@@ -130,8 +130,8 @@ __global__ void avg_pool2d_out_cuda_frame_nhwc(const int nthreads,
 
 template <typename scalar_t, typename accscalar_t>
 __global__ void avg_pool2d_backward_out_cuda_frame(const int nthreads, const scalar_t* const top_diff,
-    const int channels, const int height,
-    const int width, const int pooled_height, const int pooled_width,
+    const int64_t channels, const int64_t height,
+    const int64_t width, const int64_t pooled_height, const int64_t pooled_width,
     const int kernel_h, const int kernel_w, const int stride_h,
     const int stride_w, const int pad_h, const int pad_w,
     scalar_t* const bottom_diff, const int divisor_override,
@@ -187,8 +187,8 @@ __global__ void avg_pool2d_backward_out_cuda_frame(const int nthreads, const sca
 template <typename scalar_t, typename accscalar_t>
 __global__ void avg_pool2d_backward_out_cuda_frame_nhwc(const int nthreads,
     const scalar_t* const top_diff,
-    const int channels, const int height,
-    const int width, const int pooled_height, const int pooled_width,
+    const int64_t channels, const int64_t height,
+    const int64_t width, const int pooled_height, const int pooled_width,
     const int kernel_h, const int kernel_w, const int stride_h,
     const int stride_w, const int pad_h, const int pad_w,
     scalar_t* const bottom_diff, const int divisor_override,
diff --git a/aten/src/ATen/native/cuda/BinaryLogicalOpsKernels.cu b/aten/src/ATen/native/cuda/BinaryLogicalOpsKernels.cu
index e69674412c79..cc6046c003e4 100644
--- a/aten/src/ATen/native/cuda/BinaryLogicalOpsKernels.cu
+++ b/aten/src/ATen/native/cuda/BinaryLogicalOpsKernels.cu
@@ -18,7 +18,7 @@ void logical_and_kernel_cuda(TensorIterator& iter) {
 #if AT_USE_JITERATOR()
     static const auto logical_and_string = jiterator_stringify(
         template <typename T>
-        T logical_and_kernel(T a, T b) {
+        bool logical_and_kernel(T a, T b) {
           return a && b;
         }
     ); // logical_and_string
@@ -48,24 +48,76 @@ void logical_and_kernel_cuda(TensorIterator& iter) {
   }
 }
 
+const char logical_or_name[] = "logical_or_kernel";
 void logical_or_kernel_cuda(TensorIterator& iter) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kHalf, kBool, ScalarType::BFloat16,
-                                         iter.common_dtype(), "logical_or_cuda", [&]() {
+  auto dtype = iter.common_dtype();
+  if (at::isComplexType(dtype)) {
+#if AT_USE_JITERATOR()
+    static const auto logical_or_string = jiterator_stringify(
+      template <typename T>
+      bool logical_or_kernel(T a, T b) {
+        return a || b;
+      }
+    ); // logical_or_string
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "logical_or_cuda", [&]() {
+      jitted_gpu_kernel<
+        /*name=*/ logical_or_name,
+        /*return_dtype=*/ scalar_t,
+        /*common_dtype=*/ scalar_t,
+        /*arity=*/ 2>(iter, logical_or_string);
+    });
+#else
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "logical_or_cuda", [&]() {
+      gpu_kernel_with_scalars(iter, []GPU_LAMBDA(scalar_t a, scalar_t b) -> bool {
+        return a || b;
+      });
+    });
+#endif
+  } else {
+  AT_DISPATCH_ALL_TYPES_AND3(kHalf, kBool, ScalarType::BFloat16,
+                             dtype, "logical_or_cuda", [&]() {
     opmath_symmetric_gpu_kernel_with_scalars<scalar_t, bool>(
         iter, []GPU_LAMBDA(scalar_t a, scalar_t b) -> bool {
       return a || b;
     });
   });
+  }
 }
 
+const char logical_xor_name[] = "logical_xor_kernel";
 void logical_xor_kernel_cuda(TensorIterator& iter) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kHalf, kBool, ScalarType::BFloat16,
-                                         iter.common_dtype(), "logical_xor_cuda", [&]() {
+  auto dtype = iter.common_dtype();
+  if (at::isComplexType(dtype)) {
+#if AT_USE_JITERATOR()
+    static const auto logical_xor_string = jiterator_stringify(
+        template <typename T>
+        bool logical_xor_kernel(T a, T b) {
+          return bool(a) != bool(b);
+        }
+    );
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "logical_xor_cuda", [&]() {
+      jitted_gpu_kernel<
+        /*name=*/ logical_xor_name,
+        /*return_dtype=*/ scalar_t,
+        /*common_dtype=*/ scalar_t,
+        /*arity=*/ 2>(iter, logical_xor_string);
+    }); // logical_xor_string
+#else
+    AT_DISPATCH_COMPLEX_TYPES(dtype, "logical_xor_cuda", [&]() {
+      gpu_kernel_with_scalars(iter, []GPU_LAMBDA(scalar_t a, scalar_t b) -> bool {
+        return bool(a) != bool(b);
+      });
+    });
+#endif
+  } else {
+  AT_DISPATCH_ALL_TYPES_AND3(kHalf, kBool, ScalarType::BFloat16,
+                             dtype, "logical_xor_cuda", [&]() {
     opmath_symmetric_gpu_kernel_with_scalars<scalar_t, bool>(
         iter, []GPU_LAMBDA(scalar_t a, scalar_t b) -> bool {
       return bool(a) != bool(b);
     });
   });
+  }
 }
 
 REGISTER_DISPATCH(logical_and_stub, &logical_and_kernel_cuda);
diff --git a/aten/src/ATen/native/cuda/Bucketization.cu b/aten/src/ATen/native/cuda/Bucketization.cu
index 2a3d5730d786..21c582216628 100644
--- a/aten/src/ATen/native/cuda/Bucketization.cu
+++ b/aten/src/ATen/native/cuda/Bucketization.cu
@@ -10,7 +10,6 @@
 #include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
 #else
-#include <ATen/ops/_torch_cuda_cu_linker_symbol_op_native.h>
 #include <ATen/ops/bucketize_native.h>
 #include <ATen/ops/empty.h>
 #include <ATen/ops/searchsorted_native.h>
@@ -191,11 +190,6 @@ Tensor searchsorted_cuda(
   return result;
 }
 
-// See [Note about _torch_cuda_cu_linker_symbol_op and torch_cuda_cu] in native_functions.yaml
-Tensor _torch_cuda_cu_linker_symbol_op_cuda(const Tensor& self) {
-  return self;
-}
-
 Tensor searchsorted_cuda(
     const Tensor& sorted_sequence,
     const Scalar& self,
diff --git a/aten/src/ATen/native/cuda/Col2Im.cu b/aten/src/ATen/native/cuda/Col2Im.cu
index 5cb825a2e70b..53eb2df3013e 100644
--- a/aten/src/ATen/native/cuda/Col2Im.cu
+++ b/aten/src/ATen/native/cuda/Col2Im.cu
@@ -16,7 +16,6 @@
 #include <ATen/NativeFunctions.h>
 #else
 #include <ATen/ops/col2im_native.h>
-#include <ATen/ops/col2im_backward_native.h>
 #include <ATen/ops/empty_like.h>
 #include <ATen/ops/im2col_native.h>
 #endif
@@ -99,17 +98,13 @@ void col2im_out_cuda_template(
   int64_t batch_size = input.size(0);
   int64_t n_input_plane = input.size(1);
   int64_t n_output_plane = n_input_plane / (kernel_width * kernel_height);
+  int64_t input_batch_stride = input.stride(0);
 
   output.resize_({batch_size, n_output_plane, output_height, output_width});
-  output.zero_();
+  int64_t output_batch_stride = output.stride(0);
 
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kHalf,
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
       input.scalar_type(), "col2im_out_cuda", [&] {
-    using accscalar_t = at::acc_type<scalar_t, true>;
-
-    Tensor input_n;
-    Tensor output_n;
-
     int64_t height_col = (output_height + 2 * pad_height -
                           (dilation_height * (kernel_height - 1) + 1)) /
             stride_height +
@@ -119,28 +114,26 @@ void col2im_out_cuda_template(
             stride_width +
         1;
 
-    for (int64_t elt = 0; elt < batch_size; elt++) {
-      input_n = input.select(0, elt);
-      output_n = output.select(0, elt);
-
-      col2im<scalar_t, accscalar_t>(
-          at::cuda::getCurrentCUDAStream(),
-          input_n.data_ptr<scalar_t>(),
-          n_output_plane,
-          output_height,
-          output_width,
-          height_col,
-          width_col,
-          kernel_height,
-          kernel_width,
-          pad_height,
-          pad_width,
-          stride_height,
-          stride_width,
-          dilation_height,
-          dilation_width,
-          output_n.data_ptr<scalar_t>());
-    }
+    col2im_batched(
+        at::cuda::getCurrentCUDAStream(),
+        input.data_ptr<scalar_t>(),
+        input_batch_stride,
+        batch_size,
+        n_output_plane,
+        output_height,
+        output_width,
+        height_col,
+        width_col,
+        kernel_height,
+        kernel_width,
+        pad_height,
+        pad_width,
+        stride_height,
+        stride_width,
+        dilation_height,
+        dilation_width,
+        output.data_ptr<scalar_t>(),
+        output_batch_stride);
 
     if (!batched_input) {
       output.resize_({n_output_plane, output_height, output_width});
@@ -148,18 +141,6 @@ void col2im_out_cuda_template(
   });
 }
 
-void col2im_backward_out_cuda_template(
-    Tensor& grad_input,
-    const Tensor& grad_output,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride) {
-  // im2col_out_cuda checks size of kernel_size, dilation, padding and stride
-  at::native::im2col_out_cuda(
-      grad_output, kernel_size, dilation, padding, stride, grad_input);
-}
-
 } // namespace
 
 Tensor& col2im_out_cuda(const Tensor& input,
@@ -188,29 +169,5 @@ Tensor col2im_cuda(
   return output;
 }
 
-Tensor& col2im_backward_out_cuda(const Tensor& grad_output,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride,
-    Tensor& grad_input) {
-  col2im_backward_out_cuda_template(
-      grad_input, grad_output, kernel_size, dilation, padding, stride);
-  return grad_input;
-}
-
-Tensor col2im_backward_cuda(
-    const Tensor& grad_output,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride) {
-  Tensor grad_input = at::empty_like(grad_output, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-
-  col2im_backward_out_cuda_template(
-      grad_input, grad_output, kernel_size, dilation, padding, stride);
-  return grad_input;
-}
-
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/cuda/Copy.cu b/aten/src/ATen/native/cuda/Copy.cu
index 4fb647e329d3..564ecf1c1291 100644
--- a/aten/src/ATen/native/cuda/Copy.cu
+++ b/aten/src/ATen/native/cuda/Copy.cu
@@ -6,7 +6,6 @@
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/CUDAEvent.h>
 #include <ATen/cuda/PeerToPeerAccess.h>
-#include <c10/cuda/CUDAStream.h>
 #include <ATen/native/Copy.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/cuda/Loops.cuh>
@@ -17,13 +16,15 @@
 #include <ATen/ops/empty_like.h>
 #endif
 
+#include <c10/cuda/CUDACachingAllocator.h>
+#include <c10/cuda/CUDAStream.h>
+
 namespace at {
 namespace native {
 
 void neg_kernel_cuda(TensorIteratorBase &iter);
 void conj_kernel_cuda(TensorIteratorBase &iter);
 
-namespace {
 void direct_copy_kernel_cuda(TensorIteratorBase &iter) {
   ScalarType dtype = iter.dtype(0);
   if (isQIntType(dtype)) {
@@ -43,12 +44,13 @@ void neg_conj_kernel_cuda(TensorIteratorBase &iter) {
     gpu_kernel(iter, [] GPU_LAMBDA(scalar_t x) { return -std::conj(x); });
   });
 }
-}  // namespace (anonymous)
 
 using namespace at::cuda;
 
 // device-to-device copy, does type conversion
-void copy_device_to_device(TensorIterator& iter, bool non_blocking) {
+void copy_device_to_device(TensorIterator& iter,
+                           bool non_blocking,
+                           bool p2p_enabled) {
   int64_t numel = iter.numel();
 
   // We can memcpy the memory if both tensors have the same type AND both
@@ -89,11 +91,28 @@ void copy_device_to_device(TensorIterator& iter, bool non_blocking) {
     void *src = iter.data_ptr(1);
     size_t size = numel * iter.element_size(0);
     if (src != dst || src_device != dst_device) {
-      // Perform the copy
-      AT_CUDA_CHECK(cudaMemcpyAsync(
-          dst, src, size,
-          cudaMemcpyDeviceToDevice,
-          copy_stream));
+      // Due to bizarre cuda driver intricacies, copies of
+      // cudaMallocAsynced memory between devices that aren't
+      // peer-to-peer-capable need "cudaMemcpyPeerAsync".
+#ifdef USE_ROCM
+      bool needs_pool_specific_peer_access = false;
+#else
+      bool needs_pool_specific_peer_access = CUDACachingAllocator::get()->needsPoolSpecificPeerAccess();
+#endif
+      bool needs_MemcpyPeer = (src_device != dst_device &&
+                               needs_pool_specific_peer_access &&
+                               !p2p_enabled);
+      if (needs_MemcpyPeer) {
+        AT_CUDA_CHECK(cudaMemcpyPeerAsync(
+            dst, dst_device.index(),
+            src, src_device.index(),
+            size, copy_stream));
+      } else {
+        AT_CUDA_CHECK(cudaMemcpyAsync(
+            dst, src, size,
+            cudaMemcpyDeviceToDevice,
+            copy_stream));
+      }
     }
   } else {
     if (same_neg) {
@@ -207,7 +226,7 @@ static void copy_kernel_cuda(TensorIterator& iter, bool non_blocking) {
 
   // Copy on GPU (or between GPUs)
   if (dst_device.is_cuda() && src_device.is_cuda()) {
-    copy_device_to_device(iter, non_blocking);
+    copy_device_to_device(iter, non_blocking, p2p_enabled);
     return;
   }
 
diff --git a/aten/src/ATen/native/cuda/Copy.h b/aten/src/ATen/native/cuda/Copy.h
new file mode 100644
index 000000000000..5639567d6666
--- /dev/null
+++ b/aten/src/ATen/native/cuda/Copy.h
@@ -0,0 +1,10 @@
+#pragma once
+
+namespace at {
+struct TensorIteratorBase;
+
+namespace native {
+
+void direct_copy_kernel_cuda(TensorIteratorBase &iter);
+
+}}  // namespace at::native
diff --git a/aten/src/ATen/native/cuda/CumminmaxKernel.cu b/aten/src/ATen/native/cuda/CumminmaxKernel.cu
new file mode 100644
index 000000000000..ea73273e2d4b
--- /dev/null
+++ b/aten/src/ATen/native/cuda/CumminmaxKernel.cu
@@ -0,0 +1,29 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/native/cuda/ScanKernels.h>
+#include <ATen/native/cuda/ScanUtils.cuh>
+
+#include <limits>
+#include <functional>
+
+namespace at { namespace native {
+
+void launch_cummax_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) {
+  AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16,
+    self.scalar_type(), "cummax_cuda", [&]() {
+    scalar_t init = self.is_floating_point() ? (-1*std::numeric_limits<scalar_t>::infinity()) : std::numeric_limits<scalar_t>::lowest();
+    scan_dim_with_indices<scalar_t>(self, values, indices, dim, init, std::greater_equal<scalar_t>());
+  });
+}
+
+void launch_cummin_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) {
+  AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16,
+    self.scalar_type(), "cummin_cuda", [&]() {
+    scalar_t init = self.is_floating_point() ? std::numeric_limits<scalar_t>::infinity() : std::numeric_limits<scalar_t>::max();
+    scan_dim_with_indices<scalar_t>(self, values, indices, dim, init, std::less_equal<scalar_t>());
+  });
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/CumprodKernel.cu b/aten/src/ATen/native/cuda/CumprodKernel.cu
new file mode 100644
index 000000000000..d1f3233abb13
--- /dev/null
+++ b/aten/src/ATen/native/cuda/CumprodKernel.cu
@@ -0,0 +1,23 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/native/cuda/ScanKernels.h>
+#include <ATen/native/cuda/ScanUtils.cuh>
+
+namespace at { namespace native {
+
+void launch_cumprod_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
+      ScalarType::Half, ScalarType::BFloat16, self.scalar_type(), "cumprod_cuda", [&]() {
+        scalar_t init = 1;
+        scan_dim<scalar_t>(
+            self,
+            result,
+            dim,
+            init,
+            std::multiplies<scalar_t>());
+      });
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/CumsumKernel.cu b/aten/src/ATen/native/cuda/CumsumKernel.cu
new file mode 100644
index 000000000000..85866b3f0f32
--- /dev/null
+++ b/aten/src/ATen/native/cuda/CumsumKernel.cu
@@ -0,0 +1,25 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/native/cuda/ScanKernels.h>
+#include <ATen/native/cuda/ScanUtils.cuh>
+
+namespace at { namespace native {
+
+void launch_cumsum_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
+  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
+      ScalarType::Half, ScalarType::BFloat16,
+      self.scalar_type(), "cumsum_cuda",
+      [&]() {
+        scalar_t init = 0;
+        scan_dim<scalar_t>(
+            self,
+            result,
+            dim,
+            init,
+            std::plus<scalar_t>());
+      });
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/DepthwiseConv2d.cu b/aten/src/ATen/native/cuda/DepthwiseConv2d.cu
index 8f0f9b99903a..20748837bbaf 100644
--- a/aten/src/ATen/native/cuda/DepthwiseConv2d.cu
+++ b/aten/src/ATen/native/cuda/DepthwiseConv2d.cu
@@ -236,7 +236,6 @@ __global__ void conv_depthwise2d_grad_weight_kernel(
       }
     }
   }
-  __syncthreads();
 
   // At this point each thread in the block has a local gradient, which we need to
   // accumulate prior to writing the global value
diff --git a/aten/src/ATen/native/cuda/DilatedMaxPool2d.cu b/aten/src/ATen/native/cuda/DilatedMaxPool2d.cu
index 05a201147241..728c3144f083 100644
--- a/aten/src/ATen/native/cuda/DilatedMaxPool2d.cu
+++ b/aten/src/ATen/native/cuda/DilatedMaxPool2d.cu
@@ -44,8 +44,8 @@ static __device__ inline int p_end(int size, int pad, int pooled_size, int strid
 // kernels borrowed from Caffe
 template <typename scalar_t, typename accscalar_t>
 __global__ void max_pool_forward_nchw(const int nthreads, const scalar_t* bottom_data,
-    const int channels, const int height,
-    const int width, const int pooled_height, const int pooled_width,
+    const int64_t channels, const int64_t height,
+    const int64_t width, const int pooled_height, const int pooled_width,
     const int kernel_h, const int kernel_w, const int stride_h,
     const int stride_w, const int pad_h, const int pad_w,
     const int dilation_h, const int dilation_w, scalar_t* top_data,
@@ -83,8 +83,8 @@ __global__ void max_pool_forward_nchw(const int nthreads, const scalar_t* bottom
 template <typename scalar_t, typename accscalar_t>
 C10_LAUNCH_BOUNDS_1(CUDA_MAX_THREADS)
 __global__ void max_pool_forward_nhwc(const scalar_t* bottom_data, const int nbatch,
-                                   const int channels, const int height,
-                                   const int width, const int pooled_height, const int pooled_width,
+                                   const int64_t channels, const int64_t height,
+                                   const int64_t width, const int pooled_height, const int pooled_width,
                                    const int kernel_h, const int kernel_w, const int stride_h,
                                    const int stride_w, const int pad_h, const int pad_w,
                                    const int dilation_h, const int dilation_w,
@@ -176,8 +176,8 @@ C10_LAUNCH_BOUNDS_2(BLOCK_THREADS, 4)
 C10_LAUNCH_BOUNDS_2(BLOCK_THREADS, 8)
 #endif
 __global__ void max_pool_backward_nchw(const scalar_t* top_diff,
-    const int64_t* top_mask, const int num, const int channels,
-    const int height, const int width, const int pooled_height,
+    const int64_t* top_mask, const int num, const int64_t channels,
+    const int64_t height, const int64_t width, const int pooled_height,
     const int pooled_width, const int kernel_h, const int kernel_w,
     const int stride_h, const int stride_w, const int pad_h, const int pad_w,
     const int dilation_h, const int dilation_w,
@@ -209,8 +209,8 @@ __global__ void max_pool_backward_nchw(const scalar_t* top_diff,
 template <typename scalar_t, typename accscalar_t>
 C10_LAUNCH_BOUNDS_1(CUDA_MAX_THREADS)
 __global__ void max_pool_backward_nhwc(const scalar_t* top_diff,
-                                    const int64_t* top_mask, const int nbatch, const int channels,
-                                    const int height, const int width, const int pooled_height,
+                                    const int64_t* top_mask, const int nbatch, const int64_t channels,
+                                    const int64_t height, const int64_t width, const int pooled_height,
                                     const int pooled_width, const int kernel_h, const int kernel_w,
                                     const int stride_h, const int stride_w, const int pad_h, const int pad_w,
                                     const int dilation_h, const int dilation_w,
@@ -242,9 +242,9 @@ __global__ void max_pool_backward_nhwc(const scalar_t* top_diff,
   int iH = (height + gridDim.z-1) / gridDim.z;
   int iW = (width + gridDim.y-1) / gridDim.y;
   int istartH = threadIdx.z + blockIdx.z*iH;
-  int iendH = ::min(istartH+iH, height);
+  int iendH = ::min(static_cast<int64_t>(istartH)+iH, height);
   int istartW = threadIdx.y + blockIdx.y*iW;
-  int iendW = ::min(istartW+iW, width);
+  int iendW = ::min(static_cast<int64_t>(istartW)+iW, width);
 
   for (int ih = istartH; ih < iendH; ih+=blockDim.z) {
     int phstart = p_start(ih, pad_h, kernel_h, dilation_h, stride_h);
@@ -423,14 +423,14 @@ IntArrayRef stride,
 IntArrayRef padding,
 IntArrayRef dilation,
 bool ceil_mode,
-const Tensor& indices,
+const Tensor& indices_,
 const Tensor& gradInput) {
   NoNamesGuard guard;
 
   TensorArg gradInput_arg{ gradInput, "gradInput", 1 };
   TensorArg gradOutput_arg{ gradOutput_, "gradOutput_", 2 };
   TensorArg input_arg{ input_, "input_", 3 };
-  TensorArg indices_arg{ indices, "indices", 4 };
+  TensorArg indices_arg{ indices_, "indices", 4 };
 
   checkAllSameGPU(__func__,
                   {gradInput_arg, gradOutput_arg, input_arg, indices_arg});
@@ -474,6 +474,8 @@ const Tensor& gradInput) {
   const int64_t out_stride_h = gradOutput.stride(-2);
   const int64_t out_stride_w = gradOutput.stride(-1);
 
+  const Tensor indices = indices_.contiguous(memory_format);
+
   gradInput.zero_();
 
   int64_t count = input.numel();
diff --git a/aten/src/ATen/native/cuda/DistanceKernel.cu b/aten/src/ATen/native/cuda/DistanceKernel.cu
index a9130bd3e808..2ae4cd592e6b 100644
--- a/aten/src/ATen/native/cuda/DistanceKernel.cu
+++ b/aten/src/ATen/native/cuda/DistanceKernel.cu
@@ -6,6 +6,8 @@
 #include <ATen/cuda/CUDAContext.h>
 #include <math.h>
 
+#include <ATen/native/cuda/block_reduce.cuh>
+#include <ATen/native/cuda/DeviceSqrt.cuh>
 #include <ATen/native/Distance.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
@@ -21,20 +23,7 @@ namespace at { namespace native {
 
 namespace {
 
-static const int forward_threads = 256;
-
-template <typename scalar_t>
-static __forceinline__ __device__ scalar_t device_sqrt(scalar_t val);
-
-template <>
-__forceinline__ __device__ float device_sqrt(float val) {
-  return ::sqrtf(val);
-}
-
-template <>
-__forceinline__ __device__ double device_sqrt(double val) {
-  return ::sqrt(val);
-}
+constexpr int kCUDANumThreads = 256;
 
 template <typename scalar_t>
 struct dists {
@@ -92,27 +81,16 @@ struct dists {
 };
 
 template <typename scalar_t, typename F>
-__device__ static inline scalar_t reduce_agg(scalar_t agg) {
-  for (int offset = warpSize / 2; offset > 0; offset /= 2) {
-    F::agg(agg, WARP_SHFL_DOWN(agg, offset));
-  }
-
-  __shared__ scalar_t shared[forward_threads];
-  int lane = threadIdx.x % warpSize;
-  int warp_id = threadIdx.x / warpSize;
-  if (lane == 0) {
-    shared[warp_id] = agg;
-  }
+struct DistReduceOp {
+    __forceinline__ __device__ scalar_t combine(scalar_t a, scalar_t b) const {
+        F::agg(a, b);
+        return a;
+    }
 
-  __syncthreads();
-  agg = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0.0;
-  if (warp_id == 0) {
-    for (int offset = blockDim.x / warpSize / 2; offset > 0; offset /= 2) {
-      F::agg(agg, WARP_SHFL_DOWN(agg, offset));
+    __forceinline__ __device__ scalar_t warp_shfl_down(scalar_t data, int offset) const {
+        return WARP_SHFL_DOWN(data, offset);
     }
-  }
-  return agg;
-}
+};
 
 template <typename scalar_t, typename F>
 __global__ static void pdist_kernel_cuda_impl(scalar_t * result, const scalar_t * self, const int64_t n, const int64_t m, const scalar_t p,
@@ -133,7 +111,9 @@ __global__ static void pdist_kernel_cuda_impl(scalar_t * result, const scalar_t
     F::inc(agg, std::abs(*a - *b), p);
   }
 
-  agg = reduce_agg<scalar_t, F>(agg);
+  __shared__ scalar_t agg_smem[kCUDANumThreads];
+  scalar_t agg_init{0.0};
+  agg = cuda_utils::BlockReduce(agg, DistReduceOp<scalar_t, F>{}, agg_init, agg_smem);
   if (threadIdx.x == 0) {
     result[k] = F::finish(agg, p);
   }
@@ -222,7 +202,9 @@ __global__ static void cdist_kernel_cuda_impl(scalar_t * result, const scalar_t
   for (; a < end; a += stride, b += stride) {
     F::inc(agg, std::abs(*a - *b), p);
   }
-  agg = reduce_agg<scalar_t, F>(agg);
+  __shared__ scalar_t agg_smem[kCUDANumThreads];
+  scalar_t agg_init{0.0};
+  agg = cuda_utils::BlockReduce(agg, DistReduceOp<scalar_t, F>{}, agg_init, agg_smem);
   if (threadIdx.x == 0) {
     result[blockIdx.x] = F::finish(agg, p);
   }
@@ -236,31 +218,27 @@ void cdist_kernel_impl(Tensor& result, const Tensor& x1, const Tensor& x2, doubl
   const int64_t l1_size = r1 * m;
   const int64_t l2_size = r2 * m;
   const dim3 grid(result.numel());
-  const dim3 block(forward_threads);
+  const dim3 block(kCUDANumThreads);
 
   AT_DISPATCH_FLOATING_TYPES(x1.scalar_type(), "cdist_cuda", [&] {
+    auto impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::p>;
     if (p == 0.0) {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::zero><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::zero>;
     } else if (p == 1.0) {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::one><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::one>;
     } else if (p == 2.0) {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::two>;
     } else if (std::isinf(p)) {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    } else {
-      cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::p><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf>;
     }
+    impl_fptr<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), p, r2, m, r_size, l1_size, l2_size);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
   });
 }
 
 void pdist_forward_kernel_impl(Tensor& result, const Tensor& self, double p) {
   const dim3 grid(result.numel());
-  const dim3 block(forward_threads);
+  const dim3 block(kCUDANumThreads);
   int64_t n = self.size(0);
   int64_t m = self.size(1);
   // https://github.com/pytorch/pytorch/issues/15511 demonstrated we need to do
@@ -269,22 +247,18 @@ void pdist_forward_kernel_impl(Tensor& result, const Tensor& self, double p) {
   const double n2_squared_minus_1 = n2 * n2 - 1;
 
   AT_DISPATCH_FLOATING_TYPES(self.scalar_type(), "pdist_cuda", [&] {
+    auto impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::p>;
     if (p == 0.0) {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::zero><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::zero>;
     } else if (p == 1.0) {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::one><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::one>;
     } else if (p == 2.0) {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::two>;
     } else if (std::isinf(p)) {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    } else {
-      pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::p><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf>;
     }
+    impl_fptr<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(result.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), n, m, p, n2, n2_squared_minus_1);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
   });
 }
 
@@ -311,22 +285,18 @@ void pdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor
 
   Tensor buffer = at::empty({n - 1, result.size(0), result.size(1)}, result.options());
   AT_DISPATCH_FLOATING_TYPES(self.scalar_type(), "pdist_cuda_backward", [&] {
+    auto impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::p>;
     if (p == 1.0) {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::one><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::one>;
     } else if (p < 2.0) {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::lt_two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::lt_two>;
     } else if (p == 2.0) {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::two>;
     } else if (std::isinf(p)) {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    } else {
-      pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::p><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = pdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf>;
     }
+    impl_fptr<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(), grad.data_ptr<scalar_t>(), self.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(), grad.stride(0), n, m, dist.numel(), p, n2, n2_squared_minus_1);
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
   });
 
   at::sum_out(result, buffer, 0);
@@ -364,32 +334,20 @@ void cdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor
 
   Tensor buffer = at::empty({batch, r2, r1, m}, result.options());
   AT_DISPATCH_FLOATING_TYPES(result.scalar_type(), "cdist_cuda_backward", [&] {
+    auto impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::p>;
     if (p == 1.0) {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::one><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
-      grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
-      p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+      impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::one>;
     } else if (p < 2.0) {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::lt_two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
-      grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
-      p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+       impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::lt_two>;
     } else if (p == 2.0) {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::two><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
-      grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
-      p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
+       impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::two>;
     } else if (std::isinf(p)) {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
-      grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
-      p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    } else {
-      cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::p><<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
+       impl_fptr = cdist_backward_kernel_cuda_impl<scalar_t, dists<scalar_t>::inf>;
+    }
+    impl_fptr<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(buffer.data_ptr<scalar_t>(),
       grad.data_ptr<scalar_t>(), x1.data_ptr<scalar_t>(), x2.data_ptr<scalar_t>(), dist.data_ptr<scalar_t>(),
       p, r1, r2, m, count, r_size, l1_size, l2_size);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    }
+    C10_CUDA_KERNEL_LAUNCH_CHECK();
   });
 
   at::sum_out(result, buffer, 1);
diff --git a/aten/src/ATen/native/cuda/Distributions.cu b/aten/src/ATen/native/cuda/Distributions.cu
index 717ad4d985d4..f45d745eb418 100644
--- a/aten/src/ATen/native/cuda/Distributions.cu
+++ b/aten/src/ATen/native/cuda/Distributions.cu
@@ -47,6 +47,7 @@ void poisson_cuda_kernel(
     at::PhiloxCudaState philox_args) {
   auto functor = [philox_args] __device__(
           scalar_t & ret_val, const scalar_t& lambda) {
+        CUDA_KERNEL_ASSERT(lambda >= 0 && "invalid Poisson rate, expected rate to be non-negative");
         auto seeds = at::cuda::philox::unpack(philox_args);
         curandStatePhilox4_32_10_t state;
         curand_init(std::get<0>(seeds),
diff --git a/aten/src/ATen/native/cuda/EmbeddingBag.cu b/aten/src/ATen/native/cuda/EmbeddingBag.cu
index 7ac3a7151b79..2cd76cbe34d1 100644
--- a/aten/src/ATen/native/cuda/EmbeddingBag.cu
+++ b/aten/src/ATen/native/cuda/EmbeddingBag.cu
@@ -26,6 +26,7 @@
 #include <ATen/native/cuda/SortingCommon.cuh>
 #include <ATen/native/cuda/EmbeddingBackwardKernel.cuh>
 #include <ATen/native/cuda/KernelUtils.cuh>
+#include <ATen/native/cuda/block_reduce.cuh>
 
 #include <c10/macros/Macros.h>
 
@@ -457,14 +458,6 @@ Tensor _embedding_bag_dense_backward_cuda(const Tensor &grad_, const Tensor &ind
   }
 }
 
-template <typename scalar_t>
-__inline__ __device__
-static scalar_t warpReduceSum(scalar_t val) {
-  for (int offset = C10_WARP_SIZE/2; offset > 0; offset /= 2)
-    val += WARP_SHFL_DOWN(val, offset);
-  return val;
-}
-
 template <typename scalar_t, typename index_t>
 __global__ static void _embedding_bag_per_sample_weights_backward_kernel(
     const scalar_t* grad, int64_t grad_stride0, int64_t grad_stride1,
@@ -495,7 +488,7 @@ __global__ static void _embedding_bag_per_sample_weights_backward_kernel(
             weight[weight_stride0 * embedding_idx + weight_stride1 * feature_idx];
       }
     }
-    result = warpReduceSum<accscalar_t>(result);
+    result = cuda_utils::WarpReduceSum<accscalar_t>(result);
     if (thread_in_warp == 0) {
       output[sample_idx] = result;
     }
diff --git a/aten/src/ATen/native/cuda/ForeachFunctors.cuh b/aten/src/ATen/native/cuda/ForeachFunctors.cuh
index 8a16534cec3f..a72c33ac6960 100644
--- a/aten/src/ATen/native/cuda/ForeachFunctors.cuh
+++ b/aten/src/ATen/native/cuda/ForeachFunctors.cuh
@@ -47,6 +47,25 @@ __device__ bool init_args(
         return all_aligned;
 }
 
+template<int depth, typename T>
+__device__ bool init_args(
+    T** args,
+    FusedOptimizerTensorListMetadata<depth>& tl,
+    int chunk_idx,
+    int chunk_size,
+    int tensor_loc) {
+        bool all_aligned = true;
+        for (int i = 0; i < depth; i++) {
+            args[i] =  (T*)tl.addresses[i][tensor_loc];
+            args[i] += chunk_idx * chunk_size;
+
+            if (!is_aligned(args[i])) {
+                all_aligned = false;
+            }
+        }
+        return all_aligned;
+}
+
 template<int depth, typename T>
 __device__ void load_args(T r_args[][kILP], T** args, int i_start, int chunk_size, int n) {
 #pragma unroll
diff --git a/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu b/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu
index 3b04b68b0f39..27b3d77ad4d6 100644
--- a/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu
+++ b/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu
@@ -160,10 +160,45 @@ void foreach_tensor_##NAME##_scalarlist_cuda_(TensorList input, TensorList tenso
     foreach_pointwise_op_<OP>(input, tensors1, tensors2, scalars);                                                                                       \
 }
 
+#define FOREACH_POINTWISE_OP_TENSOR(NAME, OP)                             \
+  std::vector<Tensor> foreach_tensor_##NAME##_tensor_cuda(                \
+      TensorList input,                                                   \
+      TensorList tensors1,                                                \
+      TensorList tensors2,                                                \
+      const Tensor& scalars_) {                                           \
+    auto scalars = convert_tensor_to_scalar_list(scalars_, input.size()); \
+    check_foreach_api_restrictions(input, tensors1, tensors2, scalars);   \
+    if (!can_use_fast_route({input, tensors1, tensors2}) ||               \
+        has_integral_tensor(input, /* includeBool */ true)) {             \
+      return at::native::foreach_tensor_##NAME##_scalarlist_slow(         \
+          input, tensors1, tensors2, scalars);                            \
+    }                                                                     \
+                                                                          \
+    return foreach_pointwise_op<OP>(input, tensors1, tensors2, scalars);  \
+  }                                                                       \
+                                                                          \
+  void foreach_tensor_##NAME##_tensor_cuda_(                              \
+      TensorList input,                                                   \
+      TensorList tensors1,                                                \
+      TensorList tensors2,                                                \
+      const Tensor& scalars_) {                                           \
+    auto scalars = convert_tensor_to_scalar_list(scalars_, input.size()); \
+    check_foreach_api_restrictions(input, tensors1, tensors2, scalars);   \
+    if (!can_use_fast_route({input, tensors1, tensors2}, scalars) ||      \
+        has_integral_tensor(input, /* includeBool */ true)) {             \
+      return at::native::foreach_tensor_##NAME##_scalarlist_slow_(        \
+          input, tensors1, tensors2, scalars);                            \
+    }                                                                     \
+                                                                          \
+    foreach_pointwise_op_<OP>(input, tensors1, tensors2, scalars);        \
+  }
+
 FOREACH_POINTWISE_OP_SCALAR(addcmul, std::multiplies);
 FOREACH_POINTWISE_OP_SCALAR(addcdiv, std::divides);
 FOREACH_POINTWISE_OP_SCALARLIST(addcmul, std::multiplies);
 FOREACH_POINTWISE_OP_SCALARLIST(addcdiv, std::divides);
+FOREACH_POINTWISE_OP_TENSOR(addcdiv, std::divides);
+FOREACH_POINTWISE_OP_TENSOR(addcmul, std::multiplies);
 
 
 // Why bool tensors are pushed to slowpath?
diff --git a/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu b/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
index 46ea4eadf1fe..24db8776cd49 100644
--- a/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
+++ b/aten/src/ATen/native/cuda/FractionalMaxPool2d.cu
@@ -185,10 +185,10 @@ TORCH_IMPL_FUNC(fractional_max_pool2d_out_cuda) (
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(input.scalar_type(),
     "fractional_max_pool2d_out_cuda_frame",
     [&] {
-      auto devInput = input_.packed_accessor<scalar_t, 4>();
-      auto devOutput = output_.packed_accessor<scalar_t, 4>();
-      auto devIndices = indices_.packed_accessor<int64_t, 4>();
-      auto devSamples = randomSamples.packed_accessor<scalar_t, 3>();
+      auto devInput = input_.packed_accessor64<scalar_t, 4>();
+      auto devOutput = output_.packed_accessor64<scalar_t, 4>();
+      auto devIndices = indices_.packed_accessor64<int64_t, 4>();
+      auto devSamples = randomSamples.packed_accessor64<scalar_t, 3>();
       fractional_max_pool2d_out_cuda_frame<scalar_t>
         <<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(
           devOutput, devIndices, devInput, devSamples,
@@ -253,12 +253,12 @@ TORCH_IMPL_FUNC(fractional_max_pool2d_backward_cuda)(
             gradInput_.size(0));
   dim3 block(outputPlaneSize > 128 ? 128 : outputPlaneSize);
 
-  auto devIndices = indices_.packed_accessor<int64_t, 4>();
+  auto devIndices = indices_.packed_accessor64<int64_t, 4>();
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(gradOutput.scalar_type(),
     "fractional_max_pool2d_backward_out_cuda_frame",
     [&] {
-      auto devGradInput = gradInput_.packed_accessor<scalar_t, 4>();
-      auto devGradOutput = gradOutput_.packed_accessor<scalar_t, 4>();
+      auto devGradInput = gradInput_.packed_accessor64<scalar_t, 4>();
+      auto devGradOutput = gradOutput_.packed_accessor64<scalar_t, 4>();
       fractional_max_pool2d_backward_out_cuda_frame<scalar_t>
         <<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(
         devGradInput, devGradOutput, devIndices);
diff --git a/aten/src/ATen/native/cuda/FusedAdamKernel.cu b/aten/src/ATen/native/cuda/FusedAdamKernel.cu
new file mode 100644
index 000000000000..d35c44df219c
--- /dev/null
+++ b/aten/src/ATen/native/cuda/FusedAdamKernel.cu
@@ -0,0 +1,45 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/TypeDefault.h>
+#include <ATen/native/ForeachUtils.h>
+#include <ATen/native/cuda/fused_adam_amsgrad_impl.cuh>
+#include <ATen/native/cuda/fused_adam_impl.cuh>
+#include <c10/util/Exception.h>
+
+
+namespace at { namespace native {
+
+// note(crcrpar): To observe the CI rules, i.e. 20 minutes per file to compile, defensively split instantiations into _impl files.
+// this is only for CUDA 11.3 for which it took about 20 minutes and 28 minutes in my workstation and CI, respectively.
+// As a data point, it took about 20 seconds for CUDA 11.7 installed in my environment.
+// See https://github.com/pytorch/pytorch/pull/81705 for details.
+void _fused_adam_kernel_cuda_(
+    at::TensorList params,
+    at::TensorList grads,
+    at::TensorList exp_avgs,
+    at::TensorList exp_avg_sqs,
+    at::TensorList max_exp_avg_sqs,
+    at::TensorList state_steps,
+    const double lr,
+    const double beta1,
+    const double beta2,
+    const double weight_decay,
+    const double eps,
+    const bool amsgrad,
+    const bool maximize,
+    const c10::optional<at::Tensor>& grad_scale,
+    const c10::optional<at::Tensor>& found_inf
+) {
+  if (amsgrad) {
+    TORCH_CHECK(
+        at::native::check_fast_path_restrictions({params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs}),
+        "params, grads, exp_avgs, exp_avg_sqs, and max_exp_avg_sqs must have same dtype, device, and layout");
+    _fused_adam_cuda_impl_(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, lr, beta1, beta2, weight_decay, eps, amsgrad, maximize, grad_scale, found_inf);
+  } else {
+    TORCH_CHECK(
+        at::native::check_fast_path_restrictions({params, grads, exp_avgs, exp_avg_sqs}),
+        "params, grads, exp_avgs, and exp_avg_sqs must have same dtype, device, and layout");
+    _fused_adam_cuda_impl_(params, grads, exp_avgs, exp_avg_sqs, state_steps, lr, beta1, beta2, weight_decay, eps, amsgrad, maximize, grad_scale, found_inf);
+  }
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/GridSampler.cu b/aten/src/ATen/native/cuda/GridSampler.cu
index bfc3d86b8ab9..8aae51499e03 100644
--- a/aten/src/ATen/native/cuda/GridSampler.cu
+++ b/aten/src/ATen/native/cuda/GridSampler.cu
@@ -96,8 +96,8 @@ namespace {
           }
         }
       } else if (interpolation_mode == GridSamplerInterpolation::Nearest) {
-        index_t ix_nearest = static_cast<index_t>(::round(ix));
-        index_t iy_nearest = static_cast<index_t>(::round(iy));
+        index_t ix_nearest = static_cast<index_t>(::nearbyint(ix));
+        index_t iy_nearest = static_cast<index_t>(::nearbyint(iy));
 
         // assign nearest neighor pixel value to output pixel
         auto inp_ptr_NC = input.data + n * inp_sN;
diff --git a/aten/src/ATen/native/cuda/Im2Col.cu b/aten/src/ATen/native/cuda/Im2Col.cu
index 89b2a1879b4b..a209aa276463 100644
--- a/aten/src/ATen/native/cuda/Im2Col.cu
+++ b/aten/src/ATen/native/cuda/Im2Col.cu
@@ -18,7 +18,6 @@
 #include <ATen/ops/empty_like.h>
 #include <ATen/ops/col2im_native.h>
 #include <ATen/ops/im2col_native.h>
-#include <ATen/ops/im2col_backward_native.h>
 #endif
 
 namespace at {
@@ -103,10 +102,9 @@ static void im2col_out_cuda_template(
   int64_t output_length = output_height * output_width;
 
   output.resize_({batch_size, n_output_plane, output_length});
-  output.zero_();
 
   // Launch kernel
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kHalf,
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
       input.scalar_type(), "im2col_out_cuda", [&] {
     Tensor input_n;
     Tensor output_n;
@@ -140,29 +138,6 @@ static void im2col_out_cuda_template(
   });
 }
 
-static void im2col_backward_out_cuda_template(
-    Tensor& grad_input,
-    const Tensor& grad_output,
-    IntArrayRef input_size,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride) {
-  TORCH_CHECK(
-      input_size.size() == 2,
-      "It is expected input_size equals to 2, but got size ",
-      input_size.size());
-  // col2im_out_cuda checks size of kernel_size, dilation, padding and stride
-  at::native::col2im_out_cuda(
-      grad_output,
-      input_size,
-      kernel_size,
-      dilation,
-      padding,
-      stride,
-      grad_input);
-}
-
 } // namespace
 
 Tensor& im2col_out_cuda(const Tensor& input,
@@ -188,42 +163,5 @@ Tensor im2col_cuda(
   return output;
 }
 
-Tensor& im2col_backward_out_cuda(const Tensor& grad_output,
-    IntArrayRef input_size,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride,
-    Tensor& grad_input) {
-  im2col_backward_out_cuda_template(
-      grad_input,
-      grad_output,
-      input_size,
-      kernel_size,
-      dilation,
-      padding,
-      stride);
-  return grad_input;
-}
-
-Tensor im2col_backward_cuda(
-    const Tensor& grad_output,
-    IntArrayRef input_size,
-    IntArrayRef kernel_size,
-    IntArrayRef dilation,
-    IntArrayRef padding,
-    IntArrayRef stride) {
-  Tensor grad_input = at::empty_like(grad_output, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-  im2col_backward_out_cuda_template(
-      grad_input,
-      grad_output,
-      input_size,
-      kernel_size,
-      dilation,
-      padding,
-      stride);
-  return grad_input;
-}
-
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/cuda/IndexKernel.cu b/aten/src/ATen/native/cuda/IndexKernel.cu
index dee39b40e91e..f23c2dc3b387 100644
--- a/aten/src/ATen/native/cuda/IndexKernel.cu
+++ b/aten/src/ATen/native/cuda/IndexKernel.cu
@@ -12,6 +12,7 @@
 #include <ATen/cuda/detail/OffsetCalculator.cuh>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/cuda/KernelUtils.cuh>
+#include <ATen/native/quantized/IndexKernel.h>
 
 #include <c10/core/Scalar.h>
 
@@ -239,6 +240,21 @@ static void index_put_kernel(TensorIterator& iter, IntArrayRef index_size, IntAr
   });
 }
 
+void index_put_kernel_quantized_cuda(TensorIterator& iter, IntArrayRef index_size, IntArrayRef index_stride, bool accumulate, double scale, int zero_point) {
+  TORCH_CHECK(!accumulate, "index_put does not support accumulate=true");
+  AT_DISPATCH_QINT_AND_SUB_BYTE_TYPES(iter.dtype(), "index_put", [&] {
+    constexpr int64_t qmin = std::numeric_limits<typename scalar_t::underlying>::min();
+    constexpr int64_t qmax = std::numeric_limits<typename scalar_t::underlying>::max();
+    float inv_scale = 1.0f / static_cast<float>(scale);
+
+    gpu_index_kernel(iter, index_size, index_stride, [inv_scale, zero_point, qmin, qmax]C10_DEVICE(char* out_data, char* in_data, int64_t offset) {
+      int64_t qvalue = static_cast<int64_t>(zero_point + nearbyintf(*(float*)in_data * inv_scale));
+      qvalue = min(max(qvalue, qmin), qmax);
+      *(scalar_t*)(out_data + offset) = static_cast<scalar_t>(qvalue);
+    });
+  });
+}
+
 template <typename scalar_t, typename index_t, typename func_t>
 void cuda_take_put_kernel(
   TensorIterator& iter,
@@ -451,4 +467,6 @@ REGISTER_DISPATCH(put_stub, &put_kernel);
 REGISTER_DISPATCH(take_stub, &take_kernel);
 REGISTER_DISPATCH(flip_stub, &flip_kernel);
 
+REGISTER_CUDA_DISPATCH(index_put_kernel_quantized_stub, &index_put_kernel_quantized_cuda);
+
 }} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/Indexing.cu b/aten/src/ATen/native/cuda/Indexing.cu
index 6ea88069ca2e..9140e2ada8a3 100644
--- a/aten/src/ATen/native/cuda/Indexing.cu
+++ b/aten/src/ATen/native/cuda/Indexing.cu
@@ -1,6 +1,7 @@
 #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/TensorAdvancedIndexing.h>
 #include <ATen/native/IndexingUtils.h>
+#include <ATen/native/quantized/IndexKernel.h>
 #include <ATen/native/cuda/KernelUtils.cuh>
 
 #include <ATen/core/Tensor.h>
@@ -35,13 +36,13 @@
 #include <ATen/cuda/cub.h>
 #include <c10/util/irange.h>
 #include <c10/core/QScheme.h>
+#include <ATen/native/quantized/AffineQuantizerBase.h>
 
 #include <limits>
 
 #include <c10/macros/Macros.h>
 
 namespace {
-
 template <typename scalar_t, int SZ>
 __global__ void indexing_backward_kernel(
   int64_t* sorted_indices, int64_t* indices, scalar_t* grad_output, scalar_t* grad_weight,
@@ -120,6 +121,65 @@ __global__ void indexing_backward_kernel(
   }
 }
 
+template <typename scalar_t, int SZ>
+__global__ void indexing_backward_kernel_quantized(
+  int64_t* sorted_indices, int64_t* indices, float* grad_output, scalar_t* grad_weight,
+  int64_t numel, int64_t stride, int64_t stride_before, int64_t outer_dim,
+  float inv_scale, int zero_point, int64_t qmin, int64_t qmax) {
+
+  // This implementation is adopted from indexing_backward_kernel above.
+  using opmath_t = at::opmath_type<float>;
+  for (int64_t z = blockIdx.z; z < outer_dim; z += gridDim.z){
+    int64_t idx = blockIdx.x * blockDim.y + threadIdx.y;
+    if (idx < numel
+        && (idx == 0 || sorted_indices[idx] != sorted_indices[idx - 1])){
+      do {
+        int64_t start_feature = threadIdx.x + blockIdx.y * blockDim.x * SZ;
+        // we only keep the last duplicate index so skip those before it
+        if ((idx < numel - 1) && sorted_indices[idx] == sorted_indices[idx + 1]) {
+          idx++;
+          continue;
+        }
+        const int64_t weight_row = ((int64_t) sorted_indices[idx]) * stride + z * stride_before;
+        const int64_t grad_row = ((int64_t) indices[idx]) * stride + z * numel * stride;
+        const opmath_t scale = (opmath_t)1.0;
+
+        opmath_t gradient[SZ];
+        opmath_t weight[SZ];
+
+        while (start_feature < stride) {
+          #pragma unroll
+          for (int ii = 0; ii < SZ; ii++) {
+            int64_t feature_dim = start_feature + ii * C10_WARP_SIZE;
+            if (feature_dim < stride) {
+              gradient[ii] = static_cast<opmath_t>(grad_output[grad_row + feature_dim]);
+            }
+          }
+
+          #pragma unroll
+          for (int ii = 0; ii < SZ; ii++) {
+            weight[ii] = gradient[ii] * scale;
+          }
+
+          #pragma unroll
+          for (int ii = 0; ii < SZ; ii++) {
+            int64_t feature_dim = start_feature + ii * C10_WARP_SIZE;
+            if (feature_dim < stride) {
+                // we do quantization here
+                int64_t qvalue = static_cast<int64_t>(zero_point + nearbyintf(weight[ii]* inv_scale));
+                qvalue = min(max(qvalue, qmin), qmax);
+                grad_weight[weight_row + feature_dim] = static_cast<scalar_t>(qvalue);
+            }
+          }
+          start_feature += gridDim.y * blockDim.x * SZ;
+        }
+
+        idx++;
+      } while (idx < numel && sorted_indices[idx] == sorted_indices[idx - 1]);
+    }
+  }
+}
+
 
 }
 
@@ -231,9 +291,14 @@ computeLinearIndex(const Tensor & src, TensorList indices, bool check_range) {
 
 
 static std::tuple<Tensor, Tensor, int64_t, int64_t, int64_t, std::vector<int64_t>> makeLinearIndex(Tensor self, IOptTensorListRef orig, bool check_range) {
-  checkIndexTensorTypes(orig);
+  checkIndexTensorTypes(orig, /*allow_int*/true);
   // first expand BoolTensor (masks) or ByteTensor (masks) into 1 or more LongTensors
   auto indices = expandTensors(self, orig);
+  for (auto & i : indices) {
+    if (i.defined() && i.dtype() == at::kInt) {
+      i = i.to(at::kLong);
+    }
+  }
   // next broadcast all index tensors together
   indices = expand_outplace(indices);
   // add missing null Tensors so that it matches self.dim()
@@ -357,6 +422,106 @@ void index_put_with_sort_kernel(Tensor & self, const c10::List<c10::optional<Ten
 }
 
 REGISTER_CUDA_DISPATCH(index_put_with_sort_stub, &index_put_with_sort_kernel);
+
+void index_put_with_sort_quantized(Tensor & self, const c10::List<c10::optional<Tensor>>& indices, const Tensor & value, double scale, int zero_point, bool unsafe) {
+  if (indices.size() > (size_t)self.dim()) {
+    TORCH_CHECK_INDEX(false, "too many indices for tensor of dimension ", self.dim(), " (got ", indices.size(), ")");
+  }
+  bool self_contiguous = self.is_contiguous();
+  auto self_ = self_contiguous ? self : self.contiguous();
+  Tensor linearIndex, src, expandedValue = value;
+  int64_t nElemBefore, strideBefore, sliceSize;
+  std::vector<int64_t> inversePerm;
+  std::tie(linearIndex, src, nElemBefore, strideBefore, sliceSize, inversePerm) = makeLinearIndex(self_, indices, !unsafe);
+  int64_t num_indices = linearIndex.numel();
+
+  if (expandedValue.numel() < num_indices * nElemBefore * sliceSize) {
+    auto expanded_size = at::DimVector(expandedValue.sizes());
+    auto size1 = expandedValue.sizes();
+    auto size2 = linearIndex.sizes();
+    if (are_expandable(size1, size2)) {
+      expanded_size = infer_size_dimvector(size1, size2);
+    }
+    if (nElemBefore > 1) {
+      expanded_size.insert(expanded_size.begin(), nElemBefore);
+    }
+    expandedValue = expandedValue.expand(expanded_size);
+  }
+  expandedValue = expandedValue.contiguous();
+
+  if (num_indices > 0 && sliceSize > 0) {
+      const bool permuted = !src.is_contiguous();
+      auto src_ = permuted ? src.contiguous() : src;
+      linearIndex = linearIndex.reshape(-1);
+      auto sorted_indices = at::empty_like(linearIndex, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+      auto orig_indices = at::empty_like(linearIndex, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+      const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+      linearIndex.divide_(sliceSize, "trunc");
+
+      // cub on CUDA <= 11.2 have a bug that for small sizes
+      // cub's sort can be much slower than thrust's merge sort
+      // this bug is fixed in CUDA 11.3
+#if (defined(CUDA_VERSION) && CUDA_VERSION < 11030) || defined(USE_ROCM)
+      if (num_indices < 50000) {
+        index_put_with_sort_kernel_thrust_helper(linearIndex, orig_indices, sorted_indices, num_indices);
+      } else
+#endif
+      {
+      // Sort the inputs into sorted with the corresponding indices
+      auto range = at::arange(num_indices, linearIndex.options());
+      // linearIndex can not be negative, and we take advantage of this
+      // fact to sort on less bits for better performance.
+      int64_t nbits = cuda::cub::get_num_bits(largestIndex(self_) / sliceSize);
+      cuda::cub::radix_sort_pairs(
+        linearIndex.data_ptr<int64_t>(), sorted_indices.data_ptr<int64_t>(),
+        range.data_ptr<int64_t>(), orig_indices.data_ptr<int64_t>(),
+        num_indices, false, 0, nbits);
+      }
+
+      TORCH_INTERNAL_ASSERT(
+          linearIndex.numel()*sliceSize*nElemBefore == expandedValue.numel(),
+          "number of flattened indices did not match number of elements in the value tensor: ",
+          linearIndex.numel()*sliceSize*nElemBefore, " vs ", expandedValue.numel());
+      const int UNROLL = 4;
+      const int indices_per_block = 4;
+      const int warp_size = at::cuda::warp_size();
+      dim3 grid(ceil_div(num_indices, (int64_t) indices_per_block),
+           std::min<int>(at::cuda::getCurrentDeviceProperties()->maxGridSize[1], ceil_div(sliceSize, (int64_t) (warp_size*UNROLL))),
+           std::min(std::max<int>(1,nElemBefore), at::cuda::getCurrentDeviceProperties()->maxGridSize[2]));
+      dim3 block(warp_size, indices_per_block);
+
+      AT_DISPATCH_QINT_TYPES(
+        src.scalar_type(), "indexing_backward_quantized", [&] {
+        constexpr int64_t qmin = std::numeric_limits<typename scalar_t::underlying>::min();
+        constexpr int64_t qmax = std::numeric_limits<typename scalar_t::underlying>::max();
+        float inv_scale = 1.0f / static_cast<float>(scale);
+
+        indexing_backward_kernel_quantized<scalar_t, UNROLL><<<grid, block, 0, stream>>>(
+          sorted_indices.data_ptr<int64_t>(),
+          orig_indices.data_ptr<int64_t>(),
+          expandedValue.data_ptr<float>(),
+          src_.data_ptr<scalar_t>(),
+          num_indices,
+          sliceSize,
+          strideBefore,
+          nElemBefore,
+          inv_scale,
+          zero_point,
+          qmin,
+          qmax);
+        C10_CUDA_KERNEL_LAUNCH_CHECK();
+      });
+
+      if (permuted) {
+        self.copy_(src_.permute(inversePerm));
+      } else if (!self_contiguous) {
+        self.copy_(self_);
+      }
+  }
+}
+
+REGISTER_CUDA_DISPATCH(index_put_with_sort_quantized_stub, &index_put_with_sort_quantized);
 } //anonymous
 
 
@@ -1215,6 +1380,35 @@ void masked_fill_kernel(TensorIterator& iter, const Scalar& value) {
       });
 }
 
+template <typename scalar_t, typename mask_t>
+void cuda_masked_fill_kernel_quantized(TensorIterator& iter, scalar_t quantized_val) {
+    gpu_kernel(
+        iter, [quantized_val] GPU_LAMBDA(scalar_t self, mask_t mask) -> scalar_t {
+          if (mask) {
+            return quantized_val;
+          }
+          return self;
+    });
+}
+
+void masked_fill_kernel_quantized(TensorIterator& iter, const Scalar& value, double scale, int zero_point) {
+  AT_DISPATCH_QINT_TYPES(
+      iter.common_dtype(), "masked_fill_", [&]() {
+        float float_val = value.to<float>();
+        const auto quantized_val = quantize_val<scalar_t>(scale, zero_point, float_val);
+        auto mask_dtype = iter.input_dtype(0);
+
+        if (mask_dtype == at::ScalarType::Bool) {
+            cuda_masked_fill_kernel_quantized<scalar_t, bool>(iter, quantized_val);
+        }
+        else {
+            cuda_masked_fill_kernel_quantized<scalar_t, uint8_t>(iter, quantized_val);
+        }
+    });
+}
+
+REGISTER_CUDA_DISPATCH(masked_fill_kernel_quantized_stub, &masked_fill_kernel_quantized);
+
 } // anonymous namespace
 
 Tensor & masked_fill__cuda(Tensor& self, const Tensor & mask, const Scalar& value) {
diff --git a/aten/src/ATen/native/cuda/JitLoops.cuh b/aten/src/ATen/native/cuda/JitLoops.cuh
index bb37a6acc2e1..6f350c550ce9 100644
--- a/aten/src/ATen/native/cuda/JitLoops.cuh
+++ b/aten/src/ATen/native/cuda/JitLoops.cuh
@@ -12,11 +12,7 @@
 
 #include <ATen/native/cuda/MemoryAccess.cuh>
 
-#if !AT_ROCM_ENABLED()
 #include <ATen/native/cuda/CUDAJitLoops.cuh>
-#else
-#error Jiterator not supported on ROCm
-#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/cuda/KernelUtils.cuh b/aten/src/ATen/native/cuda/KernelUtils.cuh
index 1e36e2db74d5..d2e956d1a3e4 100644
--- a/aten/src/ATen/native/cuda/KernelUtils.cuh
+++ b/aten/src/ATen/native/cuda/KernelUtils.cuh
@@ -1,6 +1,10 @@
 #pragma once
 #include <ATen/cuda/Atomic.cuh>
 
+#if !(defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION < 11000) || (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800))))
+#include <cuda_bf16.h>
+#endif
+
 namespace at {
 namespace native {
 
@@ -66,7 +70,49 @@ __device__ __forceinline__ void fastSpecializedAtomicAdd(
 template <
     typename scalar_t,
     typename index_t,
-    typename std::enable_if<!std::is_same<c10::Half, scalar_t>::value>::type* =
+    typename std::enable_if<std::is_same<c10::BFloat16, scalar_t>::value>::type* =
+        nullptr>
+__device__ __forceinline__ void fastSpecializedAtomicAdd(
+    scalar_t* tensor,
+    index_t index,
+    const index_t numel,
+    scalar_t value) {
+#if (                      \
+    (defined(USE_ROCM)) || \
+    (defined(CUDA_VERSION) && (CUDA_VERSION < 11000)) || \
+    (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800)))
+  gpuAtomicAddNoReturn(
+      reinterpret_cast<at::BFloat16*>(tensor) + index,
+      static_cast<at::BFloat16>(value));
+#else
+  // Accounts for the chance tensor falls on an odd 16 bit alignment (ie, not 32 bit aligned)
+  __nv_bfloat16* target_addr = reinterpret_cast<__nv_bfloat16*>(tensor + index);
+  bool low_byte = (reinterpret_cast<std::uintptr_t>(target_addr) % sizeof(__nv_bfloat162) == 0);
+
+  if (low_byte && index < (numel - 1)) {
+    __nv_bfloat162 value2;
+    value2.x = *reinterpret_cast<__nv_bfloat16*>(&value);
+    value2.y = __int2bfloat16_rz(0);
+    atomicAdd(reinterpret_cast<__nv_bfloat162*>(target_addr), value2);
+
+  } else if (!low_byte && index > 0) {
+    __nv_bfloat162 value2;
+    value2.x = __int2bfloat16_rz(0);
+    value2.y = *reinterpret_cast<__nv_bfloat16*>(&value);
+    atomicAdd(reinterpret_cast<__nv_bfloat162*>(target_addr - 1), value2);
+
+  } else {
+    atomicAdd(
+        reinterpret_cast<__nv_bfloat16*>(tensor) + index, *reinterpret_cast<__nv_bfloat16*>(&value));
+  }
+#endif
+}
+
+
+template <
+    typename scalar_t,
+    typename index_t,
+    typename std::enable_if<!std::is_same<c10::Half, scalar_t>::value && !std::is_same<c10::BFloat16, scalar_t>::value >::type* =
         nullptr>
 __device__ __forceinline__ void fastSpecializedAtomicAdd(
     scalar_t* tensor,
diff --git a/aten/src/ATen/native/cuda/Lerp.cu b/aten/src/ATen/native/cuda/Lerp.cu
index ac1f2ba379b5..c1adb5b6fc03 100644
--- a/aten/src/ATen/native/cuda/Lerp.cu
+++ b/aten/src/ATen/native/cuda/Lerp.cu
@@ -14,23 +14,13 @@ void lerp_tensor_kernel(at::TensorIteratorBase& iter) {
       at::ScalarType::Half, at::ScalarType::BFloat16,
       iter.common_dtype(), "lerp_cuda",
       [&] {
-        using opmath_t = at::opmath_type<scalar_t>;
         at::native::gpu_kernel(
             iter,
             [] GPU_LAMBDA(
                 scalar_t self_val,
                 scalar_t end_val,
                 scalar_t weight_val) -> scalar_t {
-              opmath_t self_val_f = self_val;
-              opmath_t end_val_f = end_val;
-              opmath_t weight_val_f = weight_val;
-              // Conditional for better numeric. This has been discussed in
-              // https://github.com/pytorch/pytorch/pull/18871
-              return (std::abs(weight_val_f) < 0.5)
-                  ? self_val_f + weight_val_f * (end_val_f - self_val_f)
-                  : end_val_f -
-                      (end_val_f - self_val_f) *
-                          (opmath_t{1} - weight_val_f);
+              return lerp(self_val, end_val, weight_val);
             });
       });
 }
@@ -44,14 +34,7 @@ void lerp_scalar_kernel(at::TensorIteratorBase& iter, const c10::Scalar& weight)
         auto weight_val = weight.to<opmath_t>();
         at::native::gpu_kernel(
             iter, [=] GPU_LAMBDA(scalar_t self_val, scalar_t end_val) {
-              opmath_t self_val_f = self_val;
-              opmath_t end_val_f = end_val;
-              // Conditional for better numeric. This has been discussed in
-              // https://github.com/pytorch/pytorch/pull/18871
-              return (std::abs(weight_val) < 0.5)
-                  ? self_val_f + weight_val * (end_val_f - self_val_f)
-                  : end_val_f -
-                      (end_val_f - self_val_f) * (opmath_t{1} - weight_val);
+              return lerp(self_val, end_val, weight_val);
             });
       });
     }
diff --git a/aten/src/ATen/native/cuda/LinearAlgebra.cu b/aten/src/ATen/native/cuda/LinearAlgebra.cu
index 280a5046ef06..ae6901a361af 100644
--- a/aten/src/ATen/native/cuda/LinearAlgebra.cu
+++ b/aten/src/ATen/native/cuda/LinearAlgebra.cu
@@ -101,14 +101,14 @@ static void _launch_kernel(int total_n_elems, func_t f) {
   C10_CUDA_KERNEL_LAUNCH_CHECK();
 }
 
-void unpack_pivots_cuda_kernel(TensorIterator& iter, const int64_t dim_size) {
+void unpack_pivots_cuda_kernel(TensorIterator& iter, const int64_t dim_size, const int64_t max_pivot) {
   if (iter.numel() == 0) {
     return;
   }
 
   if (!iter.can_use_32bit_indexing()) {
     for (auto& sub_iter : iter.with_32bit_indexing()) {
-      unpack_pivots_cuda_kernel(sub_iter, dim_size);
+      unpack_pivots_cuda_kernel(sub_iter, dim_size, max_pivot);
     }
     return;
   }
diff --git a/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp b/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp
index 913e30b77c0f..655090d28e63 100644
--- a/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp
+++ b/aten/src/ATen/native/cuda/LinearAlgebraStubs.cpp
@@ -34,9 +34,7 @@ namespace native {
 #if defined(BUILD_LAZY_CUDA_LINALG)
 namespace {
 cuda::detail::LinalgDispatch disp = {_symeig_helper_cuda,
-                                     _cholesky_solve_helper_cuda,
-                                     legacy_lstsq_cuda,
-                                     _linalg_inv_out_helper_cuda};
+                                     _cholesky_solve_helper_cuda};
 
 at::DynamicLibrary& getTorchLinalgLibrary() {
   static at::DynamicLibrary lib("libtorch_cuda_linalg.so", nullptr, true);
@@ -94,11 +92,6 @@ void lazy_linalg_eigh_kernel(const Tensor& eigenvalues, const Tensor& eigenvecto
   linalg_eigh_stub(DeviceType::CUDA, eigenvalues, eigenvectors, infos, upper, compute_eigenvectors);
 }
 
-std::tuple<Tensor, Tensor> lazy_eig_kernel(const Tensor& self, bool& eigenvectors) {
-  loadLazyTorchLinalgLibrary();
-  return eig_stub(DeviceType::CUDA, self, eigenvectors);
-}
-
 void lazy_linalg_eig_kernel(Tensor& eigenvalues, Tensor& eigenvectors, Tensor& infos, const Tensor& input, bool compute_eigenvectors) {
   getTorchLinalgLibrary();
   linalg_eig_stub(DeviceType::CUDA, eigenvalues, eigenvectors, infos, input, compute_eigenvectors);
@@ -156,7 +149,6 @@ REGISTER_CUDA_DISPATCH(orgqr_stub, &lazy_orgqr_kernel);
 REGISTER_CUDA_DISPATCH(ormqr_stub, &lazy_ormqr_kernel);
 REGISTER_CUDA_DISPATCH(geqrf_stub, &lazy_geqrf_kernel);
 REGISTER_CUDA_DISPATCH(linalg_eigh_stub, &lazy_linalg_eigh_kernel);
-REGISTER_CUDA_DISPATCH(eig_stub, &lazy_eig_kernel);
 REGISTER_CUDA_DISPATCH(linalg_eig_stub, &lazy_linalg_eig_kernel);
 REGISTER_CUDA_DISPATCH(svd_stub, &lazy_svd_kernel)
 REGISTER_CUDA_DISPATCH(lu_solve_stub, &lazy_lu_solve);
@@ -177,18 +169,6 @@ void registerLinalgDispatch(const LinalgDispatch& disp_) {
 }
 }} //namespace cuda::detail
 
-Tensor& _linalg_inv_out_helper_cuda(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) {
-    getTorchLinalgLibrary();
-    TORCH_CHECK(disp.inv_out_helper != _linalg_inv_out_helper_cuda, "Can't find _linalg_inv_out_helper_cuda");
-    return disp.inv_out_helper(result, infos_lu, infos_getri);
-}
-
-std::tuple<Tensor, Tensor> legacy_lstsq_cuda(const Tensor &B, const Tensor &A) {
-    getTorchLinalgLibrary();
-    TORCH_CHECK(disp.legacy_lstsq != legacy_lstsq_cuda, "Can't find legacy_lstsq_cuda");
-    return disp.legacy_lstsq(B, A);
-}
-
 Tensor _cholesky_solve_helper_cuda(const Tensor& self, const Tensor& A, bool upper) {
     getTorchLinalgLibrary();
     TORCH_CHECK(disp.cholesky_solve_helper != _cholesky_solve_helper_cuda, "Can't find _cholesky_solve_helper_cuda");
@@ -203,22 +183,4 @@ std::tuple<Tensor, Tensor> _symeig_helper_cuda(const Tensor& self, bool eigenvec
 
 #endif /*defined(BUILD_LAZY_CUDA_LINALG)*/
 
-std::tuple<Tensor&, Tensor&> legacy_lstsq_out_cuda(
-    const Tensor& B, const Tensor& A, Tensor& B_out, Tensor& A_out) {
-  const auto dtype = A.scalar_type();
-  TORCH_CHECK(B.scalar_type() == dtype, "exepected A and B dtypes to match but found ",
-              A.scalar_type(), " and ", B.scalar_type());
-  TORCH_CHECK(A_out.scalar_type() == dtype, "A_out to have scalar type ", dtype,
-              " but found", A_out.scalar_type());
-  TORCH_CHECK(B_out.scalar_type() == dtype, "A_out to have scalar type ", dtype,
-              " but found", B_out.scalar_type());
-  Tensor A_tmp, B_tmp;
-  std::tie(B_tmp, A_tmp) = native::legacy_lstsq_cuda(B, A);
-  resize_output(A_out, A_tmp.sizes());
-  A_out.copy_(A_tmp);
-  resize_output(B_out, B_tmp.sizes());
-  B_out.copy_(B_tmp);
-  return std::tuple<Tensor&, Tensor&>(B_out, A_out);
-}
-
 }} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu b/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu
new file mode 100644
index 000000000000..28b3236caa2d
--- /dev/null
+++ b/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu
@@ -0,0 +1,37 @@
+#define TORCH_ASSERT_NO_OPERATORS
+#include <ATen/core/TensorBase.h>
+#include <ATen/OpMathType.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/native/cuda/ScanKernels.h>
+#include <ATen/native/cuda/ScanUtils.cuh>
+
+#include <cmath>
+#include <limits>
+
+namespace at { namespace native {
+
+void launch_logcumsumexp_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
+  AT_DISPATCH_FLOATING_TYPES_AND2(
+      ScalarType::Half, ScalarType::BFloat16,
+      self.scalar_type(), "logcumsumexp_cuda",
+      [&]() {
+        using opmath_t = at::opmath_type<scalar_t>;
+        scalar_t init = -std::numeric_limits<scalar_t>::infinity();
+        auto log_add_exp = [] C10_HOST_DEVICE (const scalar_t x_, const scalar_t y_) -> scalar_t {
+          const opmath_t x{x_}, y{y_};
+          auto min = at::_isnan(y) ? y : std::min<opmath_t>(x, y); //std::min returns first arg if one of the args is nan
+          auto max = at::_isnan(y) ? y : std::max<opmath_t>(x, y); //std::max returns first arg if one of the args is nan
+          if (min != max || ::isfinite(min)) {
+          // nan will be propagated here
+              return ::log1p(std::exp(min - max)) + max;
+          } else {
+          // special case to correctly handle infinite inputs
+             return x;
+          }
+        };
+        scan_dim<scalar_t>(self, result, dim, init, log_add_exp);
+      });
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/Loss.cu b/aten/src/ATen/native/cuda/Loss.cu
index fcb3229198ab..f1cda14a16a2 100644
--- a/aten/src/ATen/native/cuda/Loss.cu
+++ b/aten/src/ATen/native/cuda/Loss.cu
@@ -152,6 +152,7 @@ namespace {
 
 constexpr int NLL_LOSS_THREADS = 32;
 
+// NOTE(crcrpar): `Byte` support was added for https://github.com/pytorch/pytorch/issues/59765.
 #define AT_DISPATCH_NLL_LOSS_INDEX_TYPES(TYPE, NAME, ...)                     \
   AT_DISPATCH_SWITCH(TYPE, NAME,                                              \
   AT_PRIVATE_CASE_TYPE_USING_HINT(at::ScalarType::Byte, index_t, __VA_ARGS__) \
@@ -164,10 +165,10 @@ __global__ void nll_loss_forward_no_reduce_cuda_kernel(
     index_t* target,
     scalar_t* output,
     scalar_t* weights,
-    int n_classes,
-    int ignore_index) {
+    int64_t n_classes,
+    int64_t ignore_index) {
   CUDA_KERNEL_LOOP(index, batch_size) {
-    int cur_target = target[index];
+    index_t cur_target = target[index];
     if (cur_target == ignore_index) {
       output[index] = static_cast<scalar_t>(0);
       continue;
@@ -187,12 +188,12 @@ __global__ void nll_loss_forward_reduce_cuda_kernel_1d(
     index_t* target,
     scalar_t* weights,
     bool size_average,
-    int n_classes,
+    int64_t n_classes,
     int64_t ignore_index) {
   CUDA_KERNEL_ASSERT(threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0);
 
-  int t = static_cast<int>(*target);
-  if (t != static_cast<int>(ignore_index)) {
+  const index_t t = *target;
+  if (t != ignore_index) {
     CUDA_KERNEL_ASSERT(t >= 0 && t < n_classes);
     const auto cur_weight = weights != nullptr ? weights[t] : scalar_t{1};
     *total_weight = cur_weight;
@@ -223,9 +224,9 @@ __global__ void nll_loss_forward_reduce_cuda_kernel_2d(
     index_t* target,
     scalar_t* weights,
     bool size_average,
-    int nframe,
-    int ndim,
-    int n_classes,
+    int64_t nframe,
+    int64_t ndim,
+    int64_t n_classes,
     int64_t ignore_index) {
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
   __shared__ accscalar_t sh_inputs[NLL_LOSS_THREADS],
@@ -234,8 +235,8 @@ __global__ void nll_loss_forward_reduce_cuda_kernel_2d(
   sh_inputs[threadIdx.x] = static_cast<accscalar_t>(0);
   acc_weight[threadIdx.x] = static_cast<accscalar_t>(0);
   for (int i = threadIdx.x; i < nframe; i += NLL_LOSS_THREADS) {
-    int t = target[i];
-    if (t != static_cast<int>(ignore_index)) {
+    index_t t = target[i];
+    if (t != ignore_index) {
       CUDA_KERNEL_ASSERT(t >= 0 && t < n_classes);
       scalar_t cur_weight =
           weights != nullptr ? weights[t] : static_cast<scalar_t>(1);
@@ -400,11 +401,11 @@ __global__ void nll_loss_backward_no_reduce_cuda_kernel(
   PackedTensorAccessor64<scalar_t, 1> grad_output,
   PackedTensorAccessor64<scalar_t, 2> grad_input,
   scalar_t *weights,
-  int n_classes,
-  int ignore_index) {
+  int64_t n_classes,
+  int64_t ignore_index) {
 
   CUDA_KERNEL_LOOP(index, batch_size) {
-    int cur_target = target[index];
+    index_t cur_target = target[index];
     if (cur_target == ignore_index) {
       continue;
     }
@@ -422,19 +423,21 @@ __global__ void nll_loss_backward_reduce_cuda_kernel_1d(
   index_t *target,
   scalar_t *total_weight,
   bool size_average,
-  int n_classes,
+  int64_t n_classes,
   int64_t ignore_index
 ) {
-  int t = static_cast<int>(*target);
-  if (t != static_cast<int>(ignore_index)) {
+  const index_t t = *target;
+  if (t != ignore_index) {
     CUDA_KERNEL_ASSERT(t >= 0 && t < n_classes);
-    const auto grad = -(size_average ? *grad_output / *total_weight
-                                     : *grad_output);
-    grad_input[t] = weights != nullptr ? weights[t] * grad
-                                       : grad;
+    const auto grad = -(size_average ? *grad_output / *total_weight : *grad_output);
+    grad_input[t] = weights != nullptr ? weights[t] * grad : grad;
   }
 }
 
+template <typename T> struct bwd_index_type { using type = T; };
+template<> struct bwd_index_type<uint8_t> { using type = int; };
+template<> struct bwd_index_type<int64_t> { using type = uint64_t; };
+
 template <typename scalar_t, typename index_t>
 __global__ void nll_loss_backward_reduce_cuda_kernel_2d(
     scalar_t* grad_input,
@@ -445,17 +448,20 @@ __global__ void nll_loss_backward_reduce_cuda_kernel_2d(
     bool size_average,
     int nframe,
     int ndim,
-    int n_classes,
+    int64_t n_classes,
     int64_t ignore_index) {
+  using bwd_index_t = typename bwd_index_type<index_t>::type;
   const auto grad = -(size_average ? *grad_output / *total_weight
                                    : *grad_output);
 
   for (int i = threadIdx.x; i < nframe; i += NLL_LOSS_THREADS) {
-    int t = target[i];
-    if (t != static_cast<int>(ignore_index)) {
+    const index_t t = target[i];
+    if (t != ignore_index) {
       CUDA_KERNEL_ASSERT(t >= 0 && t < n_classes);
-      grad_input[i * ndim + t] = weights != nullptr ? weights[t] * grad
-                                                    : grad;
+      // NOTE(crcrpar): this index could overflow in int64_t as `t` itself can be close to the max.
+      const bwd_index_t index = static_cast<bwd_index_t>(i) * ndim + t;
+      CUDA_KERNEL_ASSERT(index >= 0);
+      grad_input[index] = weights != nullptr ? weights[t] * grad : grad;
     }
   }
 }
@@ -504,8 +510,7 @@ void nll_loss_backward_out_cuda_template(
                         target.data_ptr<index_t>(),
                         grad_output.packed_accessor64<scalar_t, 1>(),
                         grad_input.packed_accessor64<scalar_t, 2>(),
-                        weight.defined() ? weight_.data_ptr<scalar_t>()
-                                         : nullptr,
+                        weight.defined() ? weight_.data_ptr<scalar_t>() : nullptr,
                         n_classes,
                         ignore_index);
                 C10_CUDA_KERNEL_LAUNCH_CHECK();
diff --git a/aten/src/ATen/native/cuda/MaxUnpooling.cu b/aten/src/ATen/native/cuda/MaxUnpooling.cu
index 9c24c4ea8edc..ba1a7eb1f5cb 100644
--- a/aten/src/ATen/native/cuda/MaxUnpooling.cu
+++ b/aten/src/ATen/native/cuda/MaxUnpooling.cu
@@ -118,6 +118,10 @@ Tensor& max_unpooling2d_forward_out_cuda(const Tensor& self_,
     const Tensor& indices_,
     IntArrayRef output_size,
     Tensor& output) {
+  // See Note [Writing Nondeterministic Operations]
+  // Nondeterministic with duplicate indices
+  at::globalContext().alertNotDeterministic("max_unpooling2d_forward_out");
+
   TORCH_CHECK(output.is_contiguous(), "output must be contiguous");
   TORCH_CHECK(
       indices_.scalar_type() == at::ScalarType::Long,
@@ -291,6 +295,10 @@ Tensor& max_unpooling3d_forward_out_cuda(const Tensor& self_,
     IntArrayRef stride,
     IntArrayRef padding,
     Tensor& output) {
+  // See Note [Writing Nondeterministic Operations]
+  // Nondeterministic with duplicate indices
+  at::globalContext().alertNotDeterministic("max_unpooling3d_forward_out");
+
   TORCH_CHECK(output.is_contiguous(), "output must be contiguous");
   max_unpooling3d_shape_check(
     self_, Tensor(), indices_, output_size, stride, padding, "max_unpooling3d_forward_out_cuda()");
diff --git a/aten/src/ATen/native/cuda/MultiMarginLoss.cu b/aten/src/ATen/native/cuda/MultiMarginLoss.cu
index 15e6d1e9dc0c..26f21cfa59a2 100644
--- a/aten/src/ATen/native/cuda/MultiMarginLoss.cu
+++ b/aten/src/ATen/native/cuda/MultiMarginLoss.cu
@@ -31,6 +31,7 @@ __global__ void MultiMarginLoss_forward_kernel(
   scalar_t *input_k = input + k*dim;
   scalar_t *output_k = output + k;
   int target_k = static_cast<int>(target[k]);
+  CUDA_KERNEL_ASSERT(target_k >= 0 && target_k < dim && "target index is out of bounds");
   scalar_t input_target_k = input_k[target_k];
 
   int i_start = threadIdx.x;
diff --git a/aten/src/ATen/native/cuda/MultiTensorApply.cuh b/aten/src/ATen/native/cuda/MultiTensorApply.cuh
index 29675695e013..a74144974a48 100644
--- a/aten/src/ATen/native/cuda/MultiTensorApply.cuh
+++ b/aten/src/ATen/native/cuda/MultiTensorApply.cuh
@@ -24,6 +24,7 @@ __device__ __forceinline__ void load_store(T* dst, T* src, int dst_offset, int s
   ((LT*)dst)[dst_offset] = ((LT*)src)[src_offset];
 }
 
+// TODO(crcrpar): Add `n>5` for `low prec params & their higher prec copy`
 // TensorListMetadata has to be < 4KB - the limit for kernel launch argument
 static constexpr int depth_to_max_tensors[5] = {110, 64, 48, 36, 30};
 static constexpr int depth_to_max_blocks[5] = {320, 320, 320, 320, 320};
@@ -38,6 +39,18 @@ template<int n> struct TensorListMetadata
   int start_tensor_this_launch;
 };
 
+// NOTE(crcrpar): This is a conservative resolution to handle `state_steps`
+// whose each element is `at::Tensor` of 1 element representing the number of `step`s called so far.
+template<int n> struct FusedOptimizerTensorListMetadata
+{
+  void* addresses[n][depth_to_max_tensors[n-1]];
+  int numel_for_tensor[depth_to_max_tensors[n-1]];
+  void* state_steps_addresses[depth_to_max_tensors_scalarlist[n-1]];
+  unsigned char block_to_tensor[depth_to_max_blocks[n-1]];
+  int block_to_chunk[depth_to_max_blocks[n-1]];
+  int start_tensor_this_launch;
+};
+
 template<typename scalar_vals_t, int n> struct TensorListScalarListMetadata
 {
   void* addresses[n][depth_to_max_tensors_scalarlist[n-1]];
@@ -184,6 +197,61 @@ void multi_tensor_apply(
                 }
             }
         }
+}
+
+template<int depth, typename T, typename... ArgTypes>
+void multi_tensor_apply_for_fused_optimizer(
+    std::vector<std::vector<at::Tensor>>& tensor_lists,
+    at::TensorList state_steps,
+    T callable,
+    ArgTypes... args) {
+  TORCH_CHECK(tensor_lists.size() == depth, "Number of tensor lists has to match the depth");
+  const auto num_tensors = tensor_lists[0].size();
+  FusedOptimizerTensorListMetadata<depth> tensorListMeta;
+
+  int loc_block_info = 0;
+  int loc_tensor_info = 0;
+  for (const auto & tensor_index : c10::irange(num_tensors)) {
+    tensorListMeta.state_steps_addresses[loc_tensor_info] = state_steps[tensor_index].data_ptr();
+    tensorListMeta.numel_for_tensor[loc_tensor_info] = tensor_lists[0][tensor_index].numel();
+    for (const auto & d : c10::irange(depth)) {
+      tensorListMeta.addresses[d][loc_tensor_info] = tensor_lists[d][tensor_index].data_ptr();
     }
+    loc_tensor_info++;
+
+    const auto chunks = (tensor_lists[0][tensor_index].numel() + kChunkSize - 1) / kChunkSize;
+    for (const auto & chunk : c10::irange(chunks)) {
+      tensorListMeta.block_to_tensor[loc_block_info] = loc_tensor_info - 1;
+      tensorListMeta.block_to_chunk[loc_block_info] = chunk;
+      loc_block_info++;
+
+      const auto tensor_full = (loc_tensor_info == depth_to_max_tensors[depth - 1] && chunk == chunks - 1);
+      const auto blocks_full = loc_block_info == depth_to_max_blocks[depth - 1];
+      const auto last_chunk = (tensor_index == num_tensors - 1 && chunk == chunks - 1);
+
+      if (tensor_full || blocks_full || last_chunk) {
+        multi_tensor_apply_kernel<<<loc_block_info, kBlockSize, 0, at::cuda::getCurrentCUDAStream()>>>(
+            tensorListMeta,
+            callable,
+            args...);
+        C10_CUDA_KERNEL_LAUNCH_CHECK();
+
+        // Reset.
+        loc_block_info = 0;
+        if (chunk == chunks - 1) {
+          loc_tensor_info = 0;
+        } else {
+          tensorListMeta.numel_for_tensor[0] = tensorListMeta.numel_for_tensor[loc_tensor_info - 1];
+          tensorListMeta.state_steps_addresses[0] = tensorListMeta.state_steps_addresses[loc_tensor_info - 1];
+          for (const auto & d : c10::irange(depth)) {
+            tensorListMeta.addresses[d][0] = tensorListMeta.addresses[d][loc_tensor_info - 1];
+          }
+          loc_tensor_info = 1;
+        }
+      }
+    }
+  }
+}
+
 } // namespace
 }} // at::native
diff --git a/aten/src/ATen/native/cuda/MultinomialKernel.cu b/aten/src/ATen/native/cuda/MultinomialKernel.cu
index de8e8404ac2d..c8473245604c 100644
--- a/aten/src/ATen/native/cuda/MultinomialKernel.cu
+++ b/aten/src/ATen/native/cuda/MultinomialKernel.cu
@@ -80,7 +80,7 @@ void renormRows(Tensor& t) {
   int64_t cols = t.size(1);
 
   auto props = at::cuda::getCurrentDeviceProperties();
-  CUDA_KERNEL_ASSERT(props != NULL);
+  TORCH_CHECK(props != nullptr);
   int numSM = props->multiProcessorCount;
   const int64_t maxThreads = std::min(
       props->maxThreadsPerBlock, cuda_utils::kCUDABlockReduceMaxThreads);
@@ -342,7 +342,7 @@ void multinomial_with_replacement_kernel_impl(
   AT_DISPATCH_FLOATING_TYPES_AND_HALF(self_v.scalar_type(), "multinomial_kernel_cuda", [&] {
     using accscalar_t = at::acc_type<scalar_t, true>;
     auto props = at::cuda::getCurrentDeviceProperties();
-    CUDA_KERNEL_ASSERT(props != NULL);
+    TORCH_CHECK(props != nullptr);
     int numSM = props->multiProcessorCount;
     int maxThreads = props->maxThreadsPerBlock;
     int maxShared = props->sharedMemPerBlock;
diff --git a/aten/src/ATen/native/cuda/NLLLoss2d.cu b/aten/src/ATen/native/cuda/NLLLoss2d.cu
index 2246c836f3dc..d3f128462529 100644
--- a/aten/src/ATen/native/cuda/NLLLoss2d.cu
+++ b/aten/src/ATen/native/cuda/NLLLoss2d.cu
@@ -44,6 +44,7 @@ inline scalar_t* optional_data(const Tensor& source) {
 using at::cuda::detail::CUDA_NUM_THREADS;
 using at::cuda::detail::GET_BLOCKS;
 
+// TODO(crcrpar): Think about introducing `canUse32BitIndexMath` and choose int or int64_t for `target`.
 template <typename scalar_t>
 C10_LAUNCH_BOUNDS_1(CUDA_NUM_THREADS)
 __global__ void nll_loss2d_forward_no_reduce_kernel(
@@ -98,11 +99,13 @@ __global__ void nll_loss2d_forward_kernel(
   for (int i = (blockIdx.x % blocks_per_sample) * blockDim.x + threadIdx.x;
        i < map_nelem;
        i += step) {
-    int t = target[toffset + i];
+    int64_t t = target[toffset + i];
     if (t != ignore_index) {
       CUDA_KERNEL_ASSERT(t >= 0 && t < n_classes);
       cur_weight = weight != nullptr ? weight[t] : static_cast<scalar_t>(1);
-      input_sum -= input[ioffset + i + map_nelem * t] * cur_weight;
+      const auto input_index = ioffset + i + map_nelem * t;
+      CUDA_KERNEL_ASSERT(input_index >= 0);
+      input_sum -= input[input_index] * cur_weight;
       acc_weight += cur_weight;
     }
   }
@@ -185,9 +188,11 @@ __global__ void nll_loss2d_backward_kernel(
   for (int i = (blockIdx.x % blocks_per_sample) * blockDim.x + threadIdx.x;
        i < map_nelem;
        i += step) {
-    int t = (int)target_thread[i];
+    const int64_t t = target_thread[i];
     if (t != ignore_index) {
       CUDA_KERNEL_ASSERT(t >= 0 && t < n_classes);
+      const auto grad_input_index = i + map_nelem * t;
+      CUDA_KERNEL_ASSERT(grad_input_index >= 0);
       grad_input_thread[i + map_nelem * t] = weights != nullptr ? weights[t] * grad
                                                                 : grad;
     }
@@ -268,9 +273,9 @@ void nll_loss2d_forward_out_cuda_template(
                  0,
                  at::cuda::getCurrentCUDAStream()>>>(
                   count,
-                  input.packed_accessor<scalar_t, 4>(),
-                  target.packed_accessor<int64_t, 3>(),
-                  output.packed_accessor<scalar_t, 3>(),
+                  input.packed_accessor64<scalar_t, 4>(),
+                  target.packed_accessor64<int64_t, 3>(),
+                  output.packed_accessor64<scalar_t, 3>(),
                   optional_data<scalar_t>(weight_),
                   ignore_index);
           C10_CUDA_KERNEL_LAUNCH_CHECK();
@@ -403,9 +408,9 @@ void nll_loss2d_backward_out_cuda_template(
                  0,
                  at::cuda::getCurrentCUDAStream()>>>(
                   count,
-                  target.packed_accessor<int64_t, 3>(),
-                  grad_output.packed_accessor<scalar_t, 3>(),
-                  grad_input.packed_accessor<scalar_t, 4>(),
+                  target.packed_accessor64<int64_t, 3>(),
+                  grad_output.packed_accessor64<scalar_t, 3>(),
+                  grad_input.packed_accessor64<scalar_t, 4>(),
                   optional_data<scalar_t>(weight_),
                   ignore_index);
           C10_CUDA_KERNEL_LAUNCH_CHECK();
diff --git a/aten/src/ATen/native/cuda/NaiveConvolutionTranspose3d.cu b/aten/src/ATen/native/cuda/NaiveConvolutionTranspose3d.cu
index d34de0f156bd..0ed107f2db19 100644
--- a/aten/src/ATen/native/cuda/NaiveConvolutionTranspose3d.cu
+++ b/aten/src/ATen/native/cuda/NaiveConvolutionTranspose3d.cu
@@ -176,7 +176,7 @@ void slow_conv_transpose3d_out_cuda_template(
     const Tensor& input_,
     const Tensor& weight_,
     IntArrayRef kernel_size,
-    const Tensor& bias,
+    const Tensor& bias_,
     IntArrayRef stride,
     IntArrayRef padding,
     IntArrayRef output_padding,
@@ -226,7 +226,7 @@ void slow_conv_transpose3d_out_cuda_template(
   int n_output_plane = weight_.size(1);
 
   TensorArg input_arg{input_, "input", 1}, output_arg{output, "output", 2},
-      weight_arg{weight_, "weight", 3}, bias_arg{bias, "bias", 4};
+      weight_arg{weight_, "weight", 3}, bias_arg{bias_, "bias", 4};
 
   checkAllSameGPU(
       "slow_conv_transpose3d_out_cuda",
@@ -236,7 +236,7 @@ void slow_conv_transpose3d_out_cuda_template(
       input_,
       Tensor(),
       weight_,
-      bias,
+      bias_,
       kernel_depth,
       kernel_width,
       kernel_height,
@@ -254,12 +254,9 @@ void slow_conv_transpose3d_out_cuda_template(
       output_padding_height,
       0);
 
-  TORCH_CHECK(
-      !bias.defined() || bias.is_contiguous(),
-      "bias tensor has to be contiguous");
-
   Tensor input = input_.contiguous();
   Tensor weight = weight_.contiguous();
+  Tensor bias = bias_.defined() ? bias_.contiguous() : bias_;
 
   int is_batch = false;
   if (input.dim() == 4) {
diff --git a/aten/src/ATen/native/cuda/Normalization.cu b/aten/src/ATen/native/cuda/Normalization.cu
index 3b27ebfc7d92..a8eff154c350 100644
--- a/aten/src/ATen/native/cuda/Normalization.cu
+++ b/aten/src/ATen/native/cuda/Normalization.cu
@@ -48,8 +48,11 @@ bool is_mixed_type(const Tensor& input, const Args&... parameters) {
 }
 
 inline bool batch_norm_use_channels_last_kernels(const at::Tensor& self) {
-  return (self.is_contiguous(at::MemoryFormat::ChannelsLast) ||
-          (self.is_contiguous() && self.strides()[1] == 1));
+  return (
+    self.is_contiguous(at::MemoryFormat::ChannelsLast) ||
+    self.is_contiguous(at::MemoryFormat::ChannelsLast3d) ||
+    (self.is_contiguous() && self.strides()[1] == 1)
+  );
 }
 
 enum class Impl {
@@ -470,6 +473,22 @@ std::tuple<Tensor, Tensor, Tensor> batch_norm_cuda(const Tensor& self, const c10
   return std::make_tuple(output, save_mean, save_invstd);
 }
 
+std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_cuda(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, Tensor& running_mean, Tensor& running_var, bool train, double momentum, double epsilon) {
+  return batch_norm_cuda(self, weight_opt, bias_opt, running_mean, running_var, train, momentum, epsilon);
+}
+
+std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_no_stats_cuda(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, bool train, double momentum, double epsilon) {
+  return batch_norm_cuda(self, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, epsilon);
+}
+
+std::tuple<Tensor&, Tensor&, Tensor&> _batch_norm_legit_cuda_out(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, Tensor& running_mean, Tensor& running_var, bool train, double momentum, double epsilon, Tensor& output, Tensor& save_mean, Tensor& save_invstd) {
+  return batch_norm_cuda_out(self, weight_opt, bias_opt, running_mean, running_var, train, momentum, epsilon, output, save_mean, save_invstd);
+}
+
+std::tuple<Tensor&, Tensor&, Tensor&> _batch_norm_legit_no_stats_cuda_out(const Tensor& self, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, bool train, double momentum, double epsilon, Tensor& output, Tensor& save_mean, Tensor& save_invstd) {
+  return batch_norm_cuda_out(self, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, epsilon, output, save_mean, save_invstd);
+}
+
 std::tuple<Tensor, Tensor, Tensor> batch_norm_backward_cuda(const Tensor& grad_out, const Tensor& input, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& running_mean_opt, const c10::optional<Tensor>& running_var_opt, const c10::optional<Tensor>& save_mean_opt, const c10::optional<Tensor>& save_invstd_opt, bool train, double epsilon, std::array<bool,3> grad_input_mask) {
   // See [Note: hacky wrapper removal for optional tensor]
   c10::MaybeOwned<Tensor> weight = at::borrow_from_optional_tensor(weight_opt);
diff --git a/aten/src/ATen/native/cuda/Normalization.cuh b/aten/src/ATen/native/cuda/Normalization.cuh
index a9b11e76db68..cc79284fea4d 100644
--- a/aten/src/ATen/native/cuda/Normalization.cuh
+++ b/aten/src/ATen/native/cuda/Normalization.cuh
@@ -6,6 +6,7 @@
 #include <ATen/ceil_div.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/DeviceUtils.cuh>
+#include <ATen/native/cuda/block_reduce.cuh>
 #include <ATen/native/cuda/DeviceSqrt.cuh>
 #include <ATen/native/cuda/LaunchUtils.h>
 #include <c10/macros/Macros.h>
@@ -60,26 +61,10 @@ struct Float2 {
     v2 += a.v2;
     return *this;
   }
-};
-
-template <typename scalar_t, typename accscalar_t, typename PTA>
-struct SumOp {
-  __device__ SumOp(const PTA& t) : tensor(t) {}
-  __device__ __forceinline__ accscalar_t operator()(int batch, int plane, int n) {
-    return static_cast<accscalar_t>(tensor[batch][plane][n]);
-  }
-  const PTA& tensor;
-};
-
-template <typename scalar_t, typename accscalar_t, typename PTA>
-struct VarOp {
-  __device__ VarOp(accscalar_t m, const PTA& t) : mean(m), tensor(t) {}
-  __device__ __forceinline__ accscalar_t operator()(int batch, int plane, int n) {
-    accscalar_t val = tensor[batch][plane][n];
-    return (val - mean) * (val - mean);
+  __device__ friend Float2 operator+(Float2 a, const Float2& b) {
+    a += b;
+    return a;
   }
-  const accscalar_t mean;
-  const PTA& tensor;
 };
 
 template <typename scalar_t, typename accscalar_t, typename PTA>
@@ -96,21 +81,25 @@ struct GradOp {
   const PTA& grad_output;
 };
 
-// Sum across all threads within a warp
-template <typename T>
-static __device__ __forceinline__ T warpSum(T val) {
-  for (int i = 0; i < getMSB(C10_WARP_SIZE); ++i) {
-    val += WARP_SHFL_XOR(val, 1 << i, C10_WARP_SIZE);
-  }
-  return val;
-}
+template <typename acc_t>
+struct SumReduceOp {
+    __device__ __forceinline__ acc_t combine(acc_t a, acc_t b) const { return a + b; }
+
+    __device__ __forceinline__ acc_t warp_shfl_down(acc_t data, int offset) const {
+        return WARP_SHFL_DOWN(data, offset);
+    }
+};
 
 template <typename scalar_t, typename accscalar_t>
-static __device__ __forceinline__ Float2<scalar_t, accscalar_t> warpSum(Float2<scalar_t, accscalar_t> value) {
-  value.v1 = warpSum(value.v1);
-  value.v2 = warpSum(value.v2);
-  return value;
-}
+struct SumReduceOp<Float2<scalar_t, accscalar_t>> {
+    using acc_t = Float2<scalar_t, accscalar_t>;
+
+    __device__ __forceinline__ acc_t combine(acc_t a, acc_t b) const { return a + b; }
+
+    __device__ __forceinline__ acc_t warp_shfl_down(acc_t data, int offset) const {
+        return {WARP_SHFL_DOWN(data.v1, offset), WARP_SHFL_DOWN(data.v2, offset)};
+    }
+};
 
 // Sum across (batch, x/y/z) applying Op() pointwise
 // this works by first having each thread sum it's part
@@ -130,37 +119,13 @@ __device__ scalar_t reduce(Op op, PTA tensor, int plane) {
       sum += op(batch, plane, x);
     }
   }
-
-  // first warpSum to get one value per thread to
-  // one value per warp
-  sum = warpSum(sum);
-
-  // this writes each warps  item into shared memory
-  // there are at most C10_WARP_SIZE items left because
-  // there are at most C10_WARP_SIZE**2 threads at the beginning
   __shared__ scalar_t shared[C10_WARP_SIZE];
-  __syncthreads();
-  int tid = threadIdx.x + threadIdx.y * blockDim.x;
-  if (tid % C10_WARP_SIZE == 0) {
-    shared[tid / C10_WARP_SIZE] = sum;
-  }
-  if (tid >= blockDim.x * blockDim.y / C10_WARP_SIZE && tid < C10_WARP_SIZE) {
-    // zero out the other entries in shared
-    shared[tid] = (scalar_t)0;
-  }
-  __syncthreads();
-  // now have a second warpSum to reduce the intermediate values
-  // from shared memory to a single number. The very first
-  // thread writes it to shared memory.
-
-  if (tid / C10_WARP_SIZE == 0) {
-    sum = warpSum(shared[tid]);
-    if (tid == 0) {
+  SumReduceOp<scalar_t> reduce_op;
+  sum = cuda_utils::BlockReduce<scalar_t, SumReduceOp<scalar_t>, cuda_utils::Block2D>(sum, reduce_op, 0, shared);
+  if (threadIdx.x == 0 && threadIdx.y == 0) {
       shared[0] = sum;
-    }
   }
   __syncthreads();
-
   // Everyone picks it up, should be broadcast into the whole grad_input
   return shared[0];
 }
diff --git a/aten/src/ATen/native/cuda/Pow.cuh b/aten/src/ATen/native/cuda/Pow.cuh
new file mode 100644
index 000000000000..9530b0ede274
--- /dev/null
+++ b/aten/src/ATen/native/cuda/Pow.cuh
@@ -0,0 +1,58 @@
+#pragma once
+#include <ATen/native/Pow.h>
+#include <c10/core/Scalar.h>
+
+namespace at { namespace native {
+
+namespace {
+
+
+// SFINAE doesn't work well with NVCC under Windows for math functions like pow and sqrt.
+// So we need to define the functions with the explicit function signatures.
+// As for pow, the following signatures are defined as the device function:
+//   pow(float, int)
+//   pow(double, int)
+//   pow(float, float)
+//   pow(double, double)
+#ifdef _MSC_VER
+// Functions for pow
+// pow for at::Half
+static inline __host__ __device__ at::Half pow_(at::Half base, at::Half exp) {
+  return static_cast<at::Half>(std::pow(static_cast<float>(base), static_cast<float>(exp)));
+}
+// pow for at::BFloat16
+static inline __host__ __device__ at::BFloat16 pow_(at::BFloat16 base, at::BFloat16 exp) {
+  return static_cast<at::BFloat16>(std::pow(static_cast<float>(base), static_cast<float>(exp)));
+}
+// pow (floating, floating/int)
+template <typename Base_type, typename Exp_type>
+static inline __host__ __device__ typename std::enable_if<std::is_floating_point<Base_type>::value && (std::is_same<Base_type, Exp_type>::value || std::is_same<Exp_type, int>::value), Base_type>::type
+  pow_(Base_type base, Exp_type exp) {
+  return std::pow(base, exp);
+}
+// pow (Otherwise)
+template <typename Base_type, typename Exp_type>
+static inline __host__ __device__ typename std::enable_if<!std::is_same<Base_type, Exp_type>::value && !std::is_same<Exp_type, int>::value, Base_type>::type
+  pow_(Base_type base, Exp_type exp) {
+  return static_cast<Base_type>(std::pow(static_cast<double>(base), static_cast<double>(exp)));
+}
+#else
+template <typename Base_type, typename Exp_type>
+static inline __host__ __device__ Base_type pow_(Base_type base, Exp_type exp) {
+  return ::pow(base, exp);
+}
+#endif
+
+template <typename T>
+static inline __host__ __device__ std::enable_if_t<std::is_integral<T>::value, T> pow_(
+    T base, T exp) {
+  return at::native::powi(base, exp);
+}
+
+template <typename T>
+static inline __host__ __device__ c10::complex<T> pow_(c10::complex<T> base, c10::complex<T> exp) {
+  return c10_complex_math::pow(base, exp);
+}
+
+} // namespace
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/PowKernel.cu b/aten/src/ATen/native/cuda/PowKernel.cu
index a1e453455d1b..30868f27d609 100644
--- a/aten/src/ATen/native/cuda/PowKernel.cu
+++ b/aten/src/ATen/native/cuda/PowKernel.cu
@@ -3,6 +3,7 @@
 #include <ATen/Dispatch.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/cuda/JitLoops.cuh>
+#include <ATen/native/cuda/Pow.cuh>
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/Pow.h>
@@ -17,54 +18,6 @@ void reciprocal_kernel_cuda(TensorIteratorBase& iter);
 
 namespace {
 
-
-// SFINAE doesn't work well with NVCC under Windows for math functions like pow and sqrt.
-// So we need to define the functions with the explicit function signatures.
-// As for pow, the following signatures are defined as the device function:
-//   pow(float, int)
-//   pow(double, int)
-//   pow(float, float)
-//   pow(double, double)
-#ifdef _MSC_VER
-// Functions for pow
-// pow for at::Half
-static inline __host__ __device__ at::Half pow_(at::Half base, at::Half exp) {
-  return static_cast<at::Half>(std::pow(static_cast<float>(base), static_cast<float>(exp)));
-}
-// pow for at::BFloat16
-static inline __host__ __device__ at::BFloat16 pow_(at::BFloat16 base, at::BFloat16 exp) {
-  return static_cast<at::BFloat16>(std::pow(static_cast<float>(base), static_cast<float>(exp)));
-}
-// pow (floating, floating/int)
-template <typename Base_type, typename Exp_type>
-static inline __host__ __device__ typename std::enable_if<std::is_floating_point<Base_type>::value && (std::is_same<Base_type, Exp_type>::value || std::is_same<Exp_type, int>::value), Base_type>::type
-  pow_(Base_type base, Exp_type exp) {
-  return std::pow(base, exp);
-}
-// pow (Otherwise)
-template <typename Base_type, typename Exp_type>
-static inline __host__ __device__ typename std::enable_if<!std::is_same<Base_type, Exp_type>::value && !std::is_same<Exp_type, int>::value, Base_type>::type
-  pow_(Base_type base, Exp_type exp) {
-  return static_cast<Base_type>(std::pow(static_cast<double>(base), static_cast<double>(exp)));
-}
-#else
-template <typename Base_type, typename Exp_type>
-static inline __host__ __device__ Base_type pow_(Base_type base, Exp_type exp) {
-  return ::pow(base, exp);
-}
-#endif
-
-template <typename T>
-static inline __host__ __device__ std::enable_if_t<std::is_integral<T>::value, T> pow_(
-    T base, T exp) {
-  return at::native::powi(base, exp);
-}
-
-template <typename T>
-static inline __host__ __device__ c10::complex<T> pow_(c10::complex<T> base, c10::complex<T> exp) {
-  return c10_complex_math::pow(base, exp);
-}
-
 void pow_tensor_scalar_kernel(TensorIteratorBase& iter, const Scalar& exp_scalar);
 
 template <typename scalar_t>
diff --git a/aten/src/ATen/native/cuda/Reduce.cuh b/aten/src/ATen/native/cuda/Reduce.cuh
index 34e99ae57a59..0b3e4a622487 100644
--- a/aten/src/ATen/native/cuda/Reduce.cuh
+++ b/aten/src/ATen/native/cuda/Reduce.cuh
@@ -1135,8 +1135,23 @@ inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t id
 
   using traits = function_traits<decltype(&ops_t::reduce)>;
   using arg_t = typename traits::template arg<0>::type;
+  // at::Half/at::ComplexHalf overflows easily as it's range is very small.
+  // So when scalar_t and out_scalar_t are at::Half/at::ComplexHalf, we
+  // set can_accumulate_in_output to False.
+  static constexpr bool is_inp_out_type_half_or_chalf =
+      (std::is_same<at::Half, scalar_t>::value &&
+       std::is_same<at::Half, out_scalar_t>::value) ||
+      (std::is_same<c10::complex<Half>, scalar_t>::value &&
+       std::is_same<c10::complex<Half>, out_scalar_t>::value);
+  // at::BFloat16 has lower precision and can lead to rounding errors.
+  // So when scalar_t and out_scalar_t are at::BFloat16, we
+  // set can_accumulate_in_output to False.
+  static constexpr bool is_inp_out_type_bfloat16 =
+      (std::is_same<at::BFloat16, scalar_t>::value &&
+       std::is_same<at::BFloat16, out_scalar_t>::value);
   static constexpr bool can_accumulate_in_output =
-    std::is_convertible<arg_t, out_scalar_t>::value;
+      std::is_convertible<arg_t, out_scalar_t>::value &&
+      !(is_inp_out_type_half_or_chalf || is_inp_out_type_bfloat16);
 
   bool can_use_32bit_indexing = iter.can_use_32bit_indexing();
   std::unique_ptr<AccumulationBuffer> owned_buf_ptr;
@@ -1227,9 +1242,23 @@ inline void jitted_gpu_reduce_kernel(TensorIterator& iter, const std::string& fu
   //TODO - this will be different for more complicated reductions, but for now reductions using
   //func_wrapper all have arg_t = opmath
   using arg_t = at::opmath_type<scalar_t>;
+  // at::Half/at::ComplexHalf overflows easily as it's range is very small.
+  // So when scalar_t and out_scalar_t are at::Half/at::ComplexHalf, we
+  // set can_accumulate_in_output to False.
+  static constexpr bool is_inp_out_type_half_or_chalf =
+      (std::is_same<at::Half, scalar_t>::value &&
+       std::is_same<at::Half, out_scalar_t>::value) ||
+      (std::is_same<c10::complex<Half>, scalar_t>::value &&
+       std::is_same<c10::complex<Half>, out_scalar_t>::value);
+  // at::BFloat16 has lower precision and can lead to rounding errors.
+  // So when scalar_t and out_scalar_t are at::BFloat16, we
+  // set can_accumulate_in_output to False.
+  static constexpr bool is_inp_out_type_bfloat16 =
+      (std::is_same<at::BFloat16, scalar_t>::value &&
+       std::is_same<at::BFloat16, out_scalar_t>::value);
   static constexpr bool can_accumulate_in_output =
-    std::is_convertible<arg_t, out_scalar_t>::value;
-  static_assert(can_accumulate_in_output == true, "unsupported arg_t for jitted reduction");
+      std::is_convertible<arg_t, out_scalar_t>::value &&
+      !(is_inp_out_type_half_or_chalf || is_inp_out_type_bfloat16);
 
   bool can_use_32bit_indexing = iter.can_use_32bit_indexing();
   std::unique_ptr<AccumulationBuffer> owned_buf_ptr;
diff --git a/aten/src/ATen/native/cuda/ReflectionPad.cu b/aten/src/ATen/native/cuda/ReflectionPad.cu
index 33f71368ca10..5380b0fef5f2 100644
--- a/aten/src/ATen/native/cuda/ReflectionPad.cu
+++ b/aten/src/ATen/native/cuda/ReflectionPad.cu
@@ -335,7 +335,7 @@ void reflection_pad2d_out_template(
   int64_t size_y = nplane;
   int64_t size_z = nbatch;
 
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kHalf,
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
     input.scalar_type(), "reflection_pad2d_out_template", [&] {
 
       for (int64_t block_y = 0; block_y < size_y; block_y += 65535) {
@@ -407,7 +407,7 @@ void reflection_pad2d_backward_out_template(
   int64_t size_y = nplane;
   int64_t size_z = nbatch;
 
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kHalf,
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
     input.scalar_type(), "reflection_pad2d_backward_out_template", [&] {
 
       for (int64_t block_y = 0; block_y < size_y; block_y += 65535) {
@@ -463,8 +463,8 @@ TORCH_IMPL_FUNC(reflection_pad1d_out_cuda)
 
   Tensor input = input_.contiguous();
 
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(
-      kHalf, input.scalar_type(), "reflection_pad1d_out_template", [&] {
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
+      kHalf, kBFloat16, input.scalar_type(), "reflection_pad1d_out_template", [&] {
         reflection_pad1d_out_kernel<<<
             grid_size,
             block_size,
@@ -520,7 +520,7 @@ TORCH_IMPL_FUNC(reflection_pad1d_backward_out_cuda)(const Tensor& grad_output_,
   dim3 block_size(output_w > 256 ? 256 : output_w);
   dim3 grid_size((int) ::ceil(output_w / 256.0), nplane, nbatch);
 
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kHalf,
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
     grad_input.scalar_type(), "reflection_pad1d_backward_out_cuda", [&] {
       reflection_pad1d_backward_out_kernel<<<
         grid_size, block_size, 0, at::cuda::getCurrentCUDAStream()>>>(
@@ -589,7 +589,7 @@ TORCH_IMPL_FUNC(reflection_pad3d_out_cuda) (
   auto input = input_.contiguous();
   bool batch_mode = (input.dim() == 5);
 
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kHalf,
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
       input.scalar_type(), "reflection_pad3d_out_cuda", [&] {
         auto input_inner = input;
         auto output_inner = output;
@@ -641,7 +641,7 @@ TORCH_IMPL_FUNC(reflection_pad3d_backward_out_cuda) (
   int64_t pad_top = padding[2];
   int64_t pad_front = padding[4];
 
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kHalf,
+  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
       input.scalar_type(), "reflection_pad3d_backward_out_cuda", [&] {
         auto grad_input_ = grad_input;
         auto grad_output_ = grad_output;
diff --git a/aten/src/ATen/native/cuda/RreluWithNoise.cu b/aten/src/ATen/native/cuda/RreluWithNoise.cu
index 762098ab7770..c97cd15e1e85 100644
--- a/aten/src/ATen/native/cuda/RreluWithNoise.cu
+++ b/aten/src/ATen/native/cuda/RreluWithNoise.cu
@@ -17,7 +17,7 @@
 namespace at { namespace native {
 
 template <typename scalar_t, int unroll_factor, typename F>
-#if __CUDA_ARCH__ >= 350 || defined __HIP_PLATFORM_HCC__
+#if __CUDA_ARCH__ >= 350 || defined USE_ROCM
 C10_LAUNCH_BOUNDS_2(256, 4)
 #endif
 __global__ void rrelu_with_noise_cuda_kernel(
diff --git a/aten/src/ATen/native/cuda/ScanKernels.cu b/aten/src/ATen/native/cuda/ScanUtils.cuh
similarity index 84%
rename from aten/src/ATen/native/cuda/ScanKernels.cu
rename to aten/src/ATen/native/cuda/ScanUtils.cuh
index 44982208c086..ba27a245172b 100644
--- a/aten/src/ATen/native/cuda/ScanKernels.cu
+++ b/aten/src/ATen/native/cuda/ScanUtils.cuh
@@ -1,18 +1,15 @@
-#define TORCH_ASSERT_NO_OPERATORS
-#include <ATen/core/TensorBase.h>
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/NumericLimits.cuh>
-#include <ATen/AccumulateType.h>
-#include <ATen/Dispatch.h>
+#pragma once
 #include <ATen/NumericUtils.h>
-#include <c10/util/accumulate.h>
-#include <c10/util/Load.h>
-
+#include <ATen/core/TensorBase.h>
 #include <ATen/cuda/cub.cuh>
+#include <ATen/cuda/CUDAContext.h>
 
-#include <ATen/native/cuda/ScanKernels.h>
+#include <c10/util/Load.h>
+#include <limits>
+#include <cmath>
 
-namespace at { namespace native {
+namespace at {
+namespace native {
 
 template <typename integer>
 constexpr inline integer ceil_div(integer n, integer m) {
@@ -158,7 +155,7 @@ __global__ void tensor_kernel_scan_outer_dim_with_indices(scalar_t *self_, scala
   }
 }
 
-void check_fits_in_unsigned(int64_t val, const char* name) {
+inline void check_fits_in_unsigned(int64_t val, const char* name) {
   constexpr auto umax = std::numeric_limits<uint32_t>::max();
   TORCH_CHECK(
       val >= 0 && val <= umax, name, " must fit in a 32-bit uint32_t value");
@@ -224,22 +221,6 @@ void scan_dim_with_indices(const TensorBase& self, const TensorBase& values, con
   }
 }
 
-void launch_cummax_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) {
-  AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16,
-    self.scalar_type(), "cummax_cuda", [&]() {
-    scalar_t init = self.is_floating_point() ? (-1*std::numeric_limits<scalar_t>::infinity()) : std::numeric_limits<scalar_t>::lowest();
-    scan_dim_with_indices<scalar_t>(self, values, indices, dim, init, std::greater_equal<scalar_t>());
-  });
-}
-
-void launch_cummin_cuda_kernel(const TensorBase& self, const TensorBase& values, const TensorBase& indices, int64_t dim) {
-  AT_DISPATCH_ALL_TYPES_AND3(at::ScalarType::Bool, at::ScalarType::Half, at::ScalarType::BFloat16,
-    self.scalar_type(), "cummin_cuda", [&]() {
-    scalar_t init = self.is_floating_point() ? std::numeric_limits<scalar_t>::infinity() : std::numeric_limits<scalar_t>::max();
-    scan_dim_with_indices<scalar_t>(self, values, indices, dim, init, std::less_equal<scalar_t>());
-  });
-}
-
 // TODO: The implementation of `tensor_kernel_scan_outer_dim` and
 // `tensor_kernel_scan_innermost_dim` is similar to
 // `tensor_kernel_scan_outer_dim_with_indices`
@@ -468,54 +449,4 @@ void scan_dim(const TensorBase& self, const TensorBase& result,
   }
 }
 
-void launch_logcumsumexp_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
-  AT_DISPATCH_FLOATING_TYPES_AND2(
-      ScalarType::Half, ScalarType::BFloat16,
-      self.scalar_type(), "logcumsumexp_cuda",
-      [&]() {
-        using accscalar_t = acc_type<scalar_t, true>;
-        scalar_t init = -std::numeric_limits<scalar_t>::infinity();
-        auto log_add_exp = [] C10_HOST_DEVICE (const scalar_t x, const scalar_t y) -> scalar_t {
-          scalar_t min = at::_isnan(y) ? y : std::min<scalar_t>(x,y); //std::min returns first arg if one of the args is nan
-          scalar_t max = at::_isnan(y) ? y : std::max<scalar_t>(x,y); //std::max returns first arg if one of the args is nan
-          if (min != max || ::isfinite(static_cast<accscalar_t>(min))) {
-          // nan will be propagated here
-              return ::log1p(std::exp(min - max)) + max;
-          } else {
-          // special case to correctly handle infinite inputs
-             return x;
-          }
-        };
-        scan_dim<scalar_t>(self, result, dim, init, log_add_exp);
-      });
-}
-
-void launch_cumsum_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
-      ScalarType::Half, ScalarType::BFloat16,
-      self.scalar_type(), "cumsum_cuda",
-      [&]() {
-        scalar_t init = 0;
-        scan_dim<scalar_t>(
-            self,
-            result,
-            dim,
-            init,
-            std::plus<scalar_t>());
-      });
-}
-
-void launch_cumprod_cuda_kernel(const TensorBase& result, const TensorBase& self, int64_t dim) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
-      ScalarType::Half, ScalarType::BFloat16, self.scalar_type(), "cumprod_cuda", [&]() {
-        scalar_t init = 1;
-        scan_dim<scalar_t>(
-            self,
-            result,
-            dim,
-            init,
-            std::multiplies<scalar_t>());
-      });
-}
-
-}} // namespace at::native
+}}  // namespace at::native
diff --git a/aten/src/ATen/native/cuda/Shape.cu b/aten/src/ATen/native/cuda/Shape.cu
index 08605cf4ed1b..389515eac1e6 100644
--- a/aten/src/ATen/native/cuda/Shape.cu
+++ b/aten/src/ATen/native/cuda/Shape.cu
@@ -252,7 +252,7 @@ void parallel_cat(const Tensor &out, const MaterializedITensorListRef& inputs, i
 } // namespace
 
 TORCH_IMPL_FUNC(cat_out_cuda)
-(ITensorListRef tensors,
+(const ITensorListRef& tensors,
  int64_t dim,
  int64_t valid,
  bool all_contiguous,
diff --git a/aten/src/ATen/native/cuda/SoftMax.cu b/aten/src/ATen/native/cuda/SoftMax.cu
index c53276e619be..6df916caaa85 100644
--- a/aten/src/ATen/native/cuda/SoftMax.cu
+++ b/aten/src/ATen/native/cuda/SoftMax.cu
@@ -636,8 +636,8 @@ cunn_SoftMaxForward(outscalar_t *output, scalar_t *input, int classes)
 
   // forward pointers to batch[blockIdx.x]
   // each block handles a sample in the mini-batch
-  input += blockIdx.x * classes;
-  output += blockIdx.x * classes;
+  input += static_cast<int64_t>(blockIdx.x) * classes;
+  output += static_cast<int64_t>(blockIdx.x) * classes;
 
   const int shift = ((uint64_t)input) % ALIGN_BYTES / sizeof(scalar_t);
   const int output_shift = ((uint64_t)output) % ALIGN_BYTES / sizeof(outscalar_t);
@@ -672,9 +672,9 @@ cunn_SoftMaxBackward(scalar_t *gradInput, outscalar_t *output, outscalar_t *grad
 
   extern __shared__ unsigned char smem[];
   auto sdata = reinterpret_cast<accscalar_t*>(smem);
-  gradInput += blockIdx.x * classes;
-  output += blockIdx.x * classes;
-  gradOutput += blockIdx.x * classes;
+  gradInput += static_cast<int64_t>(blockIdx.x) * classes;
+  output += static_cast<int64_t>(blockIdx.x) * classes;
+  gradOutput += static_cast<int64_t>(blockIdx.x) * classes;
 
   const int shift = ((uint64_t)gradInput) % ALIGN_BYTES / sizeof(scalar_t);
   const int output_shift = ((uint64_t)output) % ALIGN_BYTES / sizeof(outscalar_t);
@@ -963,7 +963,7 @@ Tensor masked_softmax_cuda(const Tensor& input_, const Tensor& mask_, const c10:
 
   TORCH_CHECK(mask_type_.has_value(), "Mask Type should be defined");
   int64_t mask_type = mask_type_.value();
-  TORCH_CHECK((mask_type == 0) || (mask_type == 1), "Mask Type should be 0 (src_mask) or 1 (src_key_padding_mask)");
+  TORCH_CHECK((mask_type == 0) || (mask_type == 1) || (mask_type == 2), "Mask Type should be 0 (src_mask), 1 (src_key_padding_mask), or 2 (default_mask)");
 
   // If input is [B, H, T, T] and mask is [B, T]
   // we have special fast kernel
@@ -975,6 +975,7 @@ Tensor masked_softmax_cuda(const Tensor& input_, const Tensor& mask_, const c10:
   // TODO We should have special fast kernel for TxT mask as well
   // mask_type == 0 => mask_ is a src_mask
   bool is_TxT_mask = (mask_type == 0) && input_.dim() == 4 && mask_.dim() == 2 && input_.size(3) == mask_.size(1) && input_.size(2) == mask_.size(0) && mask_.size(0) == mask_.size(1);
+  // If mask_type == 2, then mask_.sizes() must equal input_.sizes()
   TORCH_CHECK(mask_.sizes() == input_.sizes() || is_BxT_mask || is_TxT_mask, "Mask shape should match input. mask: ", mask_.sizes(), " input: ", input_.sizes());
 
   auto input = input_.dim() == 0 ? input_.view(1) : input_;
@@ -992,7 +993,9 @@ Tensor masked_softmax_cuda(const Tensor& input_, const Tensor& mask_, const c10:
   //     4) dim == input.dim() - 1
   // Otherwise, we fallback to vanilla softmax (where we do not support transformer_mask since converting the mask is expensive)
   if (softmax_elements > 1024 || softmax_elements * input.element_size() > 4096 || !mask.is_contiguous() || dim < input.dim()-1) {
-    TORCH_CHECK(mask.sizes() == input.sizes(), "Mask shape should match input shape; transformer_mask is not supported in the fallback case.");
+    if (is_BxT_mask) {
+      mask = mask.view({mask_.size(0), 1, 1, mask_.size(1)}).expand(input.sizes());
+    }
     AT_DISPATCH_FLOATING_TYPES_AND2(
       ScalarType::Half,
       ScalarType::BFloat16,
@@ -1061,7 +1064,7 @@ Tensor masked_softmax_backward_cuda(
   auto grad = grad_.contiguous();
   auto output = output_.contiguous();
   auto mask = mask_.contiguous();
-  int64_t dim = dim_.has_value() ? dim_.value() : output.dim() - 1;
+  int64_t dim = dim_.has_value() ? maybe_wrap_dim(dim_.value(), output.dim()) : output.dim() - 1;
 
   grad = grad.dim() == 0 ? grad.view(1) : grad;
   mask = mask.dim() == 0 ? mask.view(1) : mask;
diff --git a/aten/src/ATen/native/cuda/SparseBinaryOpIntersectionKernel.cu b/aten/src/ATen/native/cuda/SparseBinaryOpIntersectionKernel.cu
new file mode 100644
index 000000000000..d34e0c62e6ab
--- /dev/null
+++ b/aten/src/ATen/native/cuda/SparseBinaryOpIntersectionKernel.cu
@@ -0,0 +1,150 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/sparse/SparseStubs.h>
+#include <ATen/native/sparse/SparseBinaryOpIntersectionCommon.h>
+#include <ATen/native/cuda/Loops.cuh>
+#include <ATen/native/cuda/KernelUtils.cuh>
+#include <ATen/cuda/detail/OffsetCalculator.cuh>
+
+namespace at {
+namespace native {
+
+namespace {
+
+template <typename func_t>
+struct CUDAKernelLauncher {
+  static void launch(TensorIteratorBase& iter, const func_t& f) {
+    gpu_kernel(iter, f);
+  }
+};
+
+struct MulOp {
+  template <typename scalar_t>
+  static FUNCAPI scalar_t apply(scalar_t a, scalar_t b) {
+    return a * b;
+  }
+};
+
+template <>
+FUNCAPI bool MulOp::apply(bool a, bool b) {
+  return a && b;
+}
+
+template <int nt, int vt, typename loop_t>
+C10_LAUNCH_BOUNDS_2(nt, vt)
+__global__ void apply_kernel(int n, loop_t loop) {
+  constexpr int nv = nt * vt;
+  int idx = nv * blockIdx.x + threadIdx.x;
+
+  #pragma unroll
+  for (int i = 0; i < vt; ++i) {
+    if (idx < n) {
+      loop(idx);
+      idx += nt;
+    }
+  }
+}
+
+template <int nt, int vt, typename loop_t>
+void launch_kernel(int64_t n, const loop_t& loop) {
+  TORCH_INTERNAL_ASSERT(0 <= n && n <= std::numeric_limits<int32_t>::max());
+  if (!n) {
+    return;
+  }
+
+  const dim3 block(nt);
+  const dim3 grid((n + block.x * vt - 1) / (block.x * vt));
+  const auto stream = at::cuda::getCurrentCUDAStream();
+  apply_kernel<nt, vt, loop_t><<<grid, block, 0, stream>>>(n, loop);
+  C10_CUDA_KERNEL_LAUNCH_CHECK();
+}
+
+template <typename binary_op_t, typename scalar_t, typename index_t>
+void binary_op_intersection_kernel(
+    TensorIterator& iter,
+    int64_t lhs_nnz_stride,
+    int64_t rhs_nnz_stride) {
+  if (!iter.can_use_32bit_indexing()) {
+    for (auto& sub_iter : iter.with_32bit_indexing()) {
+      binary_op_intersection_kernel<binary_op_t, scalar_t, index_t>(
+          sub_iter, lhs_nnz_stride, rhs_nnz_stride);
+    }
+    return;
+  }
+
+  auto* RESTRICT ptr_res_values_bytes = reinterpret_cast<char*>(iter.data_ptr(0));
+  const auto* RESTRICT ptr_lhs_values_bytes = reinterpret_cast<char*>(iter.data_ptr(1));
+  const auto* RESTRICT ptr_lhs_select_idx_bytes = reinterpret_cast<char*>(iter.data_ptr(2));
+  const auto* RESTRICT ptr_rhs_values_bytes = reinterpret_cast<char*>(iter.data_ptr(3));
+  const auto* RESTRICT ptr_rhs_select_idx_bytes = reinterpret_cast<char*>(iter.data_ptr(4));
+
+  auto offset_calc = make_offset_calculator<5>(iter);
+  auto loop = [=] FUNCAPI (int i) {
+    auto offsets = offset_calc.get(i);
+
+    auto* RESTRICT ptr_res_values = reinterpret_cast<scalar_t*>(ptr_res_values_bytes + offsets[0]);
+    const auto* RESTRICT ptr_lhs_values = reinterpret_cast<const scalar_t*>(ptr_lhs_values_bytes + offsets[1]);
+    const auto lhs_nnz_idx = *reinterpret_cast<const index_t*>(ptr_lhs_select_idx_bytes + offsets[2]);
+    const auto* RESTRICT ptr_rhs_values = reinterpret_cast<const scalar_t*>(ptr_rhs_values_bytes + offsets[3]);
+    const auto rhs_nnz_idx = *reinterpret_cast<const index_t*>(ptr_rhs_select_idx_bytes + offsets[4]);
+
+    *ptr_res_values = binary_op_t::apply(
+        *(ptr_lhs_values + lhs_nnz_idx * lhs_nnz_stride),
+        *(ptr_rhs_values + rhs_nnz_idx * rhs_nnz_stride));
+  };
+
+  launch_kernel<num_threads(), thread_work_size()>(iter.numel(), loop);
+}
+
+
+template <typename binary_op_t>
+struct CUDAValueSelectionIntersectionKernel {
+  static Tensor apply(
+      const Tensor& lhs_values,
+      const Tensor& lhs_select_idx,
+      const Tensor& rhs_values,
+      const Tensor& rhs_select_idx) {
+    auto iter = make_value_selection_intersection_iter(
+        lhs_values,
+        lhs_select_idx,
+        rhs_values,
+        rhs_select_idx);
+    auto res_values = iter.tensor(0);
+
+    // If res_values is empty, we can return it right away.
+    // Otherwise floating point issues with OffsetCalculator.
+    if (!res_values.numel()) {
+      return res_values;
+    }
+
+    const auto lhs_nnz_stride = lhs_values.stride(0);
+    const auto rhs_nnz_stride = rhs_values.stride(0);
+
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
+        ScalarType::Bool, ScalarType::Half, ScalarType::BFloat16, res_values.scalar_type(),
+        "binary_op_intersection_cpu", [&] {
+          AT_DISPATCH_INDEX_TYPES(lhs_select_idx.scalar_type(),
+              "binary_op_intersection_cpu", [&] {
+                binary_op_intersection_kernel<binary_op_t, scalar_t, index_t>(
+                    iter, lhs_nnz_stride, rhs_nnz_stride);
+              });
+        });
+
+    return res_values;
+  }
+};
+
+void mul_sparse_sparse_out_cuda_kernel(
+    Tensor& result,
+    const Tensor& x,
+    const Tensor& y) {
+  using CUDAValueSelectionMulKernel = CUDAValueSelectionIntersectionKernel<MulOp>;
+  _sparse_binary_op_intersection_kernel_out<CUDAKernelLauncher, CUDAValueSelectionMulKernel>(
+      result, x, y
+  );
+}
+
+}
+
+REGISTER_CUDA_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cuda_kernel);
+
+}}
diff --git a/aten/src/ATen/native/cuda/SummaryOps.cu b/aten/src/ATen/native/cuda/SummaryOps.cu
index 5476682d7c4d..3383c38ac9ac 100644
--- a/aten/src/ATen/native/cuda/SummaryOps.cu
+++ b/aten/src/ATen/native/cuda/SummaryOps.cu
@@ -15,7 +15,7 @@
 #include <ATen/ops/bincount_native.h>
 #include <ATen/ops/empty.h>
 #include <ATen/ops/histc_native.h>
-#include <ATen/ops/zeros_native.h>
+#include <ATen/ops/zeros.h>
 #endif
 
 namespace at {
@@ -271,7 +271,7 @@ bool CUDA_tensor_histogram(
   detail::TensorInfo<output_t, IndexType> pInfo(nullptr, 0, {}, {});
   Tensor partial_output;
   if (memType == CUDAHistogramMemoryType::MULTI_BLOCK) {
-    partial_output = native::zeros(
+    partial_output = at::zeros(
         {grid.x, nbins},
         optTypeMetaToScalarType(a.options().dtype_opt()),
         a.options().layout_opt(),
@@ -313,7 +313,7 @@ Tensor _bincount_cuda_template(
     AT_ERROR("minlength should be >= 0");
   }
   if (self.dim() == 1 && self.numel() == 0) {
-    return native::zeros(
+    return at::zeros(
         {minlength},
         kLong,
         c10::nullopt /* layout */,
@@ -327,8 +327,8 @@ Tensor _bincount_cuda_template(
   }
 
   bool has_weights = weights.defined();
-  if (has_weights && weights.size(0) != self.size(0)) {
-    AT_ERROR("input and weights should have the same length");
+  if (has_weights && (weights.dim() != 1 || weights.size(0) != self.size(0))) {
+    AT_ERROR("weights should be 1-d and have the same length as input");
   }
 
   const int64_t nbins =
@@ -342,7 +342,7 @@ Tensor _bincount_cuda_template(
   // alloc output counter on GPU
   Tensor output;
   if (has_weights) {
-    output = native::zeros(
+    output = at::zeros(
         {nbins},
         optTypeMetaToScalarType(weights.options().dtype_opt()),
         weights.options().layout_opt(),
@@ -351,7 +351,7 @@ Tensor _bincount_cuda_template(
     cuda::CUDA_tensor_histogram<weights_t, input_t, true>(
         output, self, weights, nbins, minvalue, maxvalue);
   } else {
-    output = native::zeros(
+    output = at::zeros(
         {nbins},
         kLong,
         c10::nullopt /* layout */,
@@ -373,7 +373,7 @@ Tensor _histc_cuda_template(
   if (nbins <= 0) {
     AT_ERROR("bins must be > 0");
   }
-  Tensor output = native::zeros(
+  Tensor output = at::zeros(
       {nbins},
       self.scalar_type(),
       c10::nullopt /* layout */,
diff --git a/aten/src/ATen/native/cuda/TensorFactories.cu b/aten/src/ATen/native/cuda/TensorFactories.cu
index 6e05908b2cce..e880b21d650d 100644
--- a/aten/src/ATen/native/cuda/TensorFactories.cu
+++ b/aten/src/ATen/native/cuda/TensorFactories.cu
@@ -55,10 +55,6 @@ Tensor empty_cuda(IntArrayRef size, c10::optional<ScalarType> dtype_opt, c10::op
   return at::detail::empty_cuda(size, dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
 }
 
-Tensor empty_symint_cuda(c10::SymIntArrayRef size, c10::optional<ScalarType> dtype_opt, c10::optional<Layout> layout_opt, c10::optional<Device> device_opt, c10::optional<bool> pin_memory_opt, c10::optional<c10::MemoryFormat> memory_format_opt) {
-  return at::native::empty_cuda(asIntArrayRefSlow(size), dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
-}
-
 Tensor _efficientzerotensor_cuda(IntArrayRef size,
     c10::optional<ScalarType> dtype,
     c10::optional<Layout> layout,
@@ -294,10 +290,10 @@ Tensor tril_indices_cuda(
       cuda::getApplyGrid(tril_size, dim_grid, tensor.get_device()),
       "unable to get dim grid");
 
-    AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, tensor.scalar_type(), "tril_indices_cuda", [&] {
+    AT_DISPATCH_INDEX_TYPES(tensor.scalar_type(), "tril_indices_cuda", [&] {
       tril_indices_kernel<<<
           dim_grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>(
-        tensor.data_ptr<scalar_t>(),
+        tensor.data_ptr<index_t>(),
         trapezoid_row_offset,
         m_first_row,
         col,
@@ -372,10 +368,10 @@ Tensor triu_indices_cuda(
       cuda::getApplyGrid(triu_size, dim_grid, tensor.get_device()),
       "unable to get dim grid");
 
-    AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, tensor.scalar_type(), "triu_indices_cuda", [&] {
+    AT_DISPATCH_INDEX_TYPES(tensor.scalar_type(), "triu_indices_cuda", [&] {
       triu_indices_kernel<<<
           dim_grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>(
-        tensor.data_ptr<scalar_t>(),
+        tensor.data_ptr<index_t>(),
         std::max<int64_t>(0, offset),
         m_first_row,
         col,
diff --git a/aten/src/ATen/native/cuda/TriangularOps.cu b/aten/src/ATen/native/cuda/TriangularOps.cu
index f87d821f396c..a079ec684988 100644
--- a/aten/src/ATen/native/cuda/TriangularOps.cu
+++ b/aten/src/ATen/native/cuda/TriangularOps.cu
@@ -102,137 +102,9 @@ TORCH_IMPL_FUNC(triu_cuda)(const Tensor& self, int64_t k, const Tensor &result)
   }
 }
 
-// Copy the kth diagonal of a matrix B to a vector A.
-template <typename scalar_t>
-C10_LAUNCH_BOUNDS_1(1024)
-__global__ void copy_from_diagonal_kernel(
-    scalar_t* a,
-    scalar_t* b,
-    std::ptrdiff_t start,
-    std::ptrdiff_t size,
-    std::ptrdiff_t strideSum,
-    std::ptrdiff_t strideA) {
-  for (std::ptrdiff_t linearIndex = blockIdx.x * blockDim.x + threadIdx.x;
-       linearIndex < size;
-       linearIndex += gridDim.x * blockDim.x) {
-    const std::ptrdiff_t bOffset = start + strideSum * linearIndex;
-    a[strideA * linearIndex] = b[bOffset];
-  }
-}
-
-// Copy vector B to the kth diagonal of a matrix A
-template <typename scalar_t>
-C10_LAUNCH_BOUNDS_1(1024)
-__global__ void copy_to_diagonal_kernel(
-    scalar_t* a,
-    scalar_t* b,
-    std::ptrdiff_t start,
-    std::ptrdiff_t size,
-    std::ptrdiff_t strideSum,
-    std::ptrdiff_t strideB) {
-  for (std::ptrdiff_t linearIndex = blockIdx.x * blockDim.x + threadIdx.x;
-       linearIndex < size;
-       linearIndex += gridDim.x * blockDim.x) {
-    const std::ptrdiff_t aOffset = start + strideSum * linearIndex;
-    a[aOffset] = b[strideB * linearIndex];
-  }
-}
-
-template <typename scalar_t>
-Tensor& apply_diag(Tensor& result, const Tensor& self, int64_t dimension) {
-  TORCH_CHECK(
-      self.dim() == 1 || self.dim() == 2, "matrix or a vector expected");
-
-  TensorArg result_arg{result, "result", 1};
-  TensorArg self_arg{self, "self", 2};
-  checkAllSameGPU(__func__, {result_arg, self_arg});
-  checkSameType(__func__, result_arg, self_arg);
-
-  int nDimension = self.dim();
-  if (nDimension == 2) {
-    auto self_stride_0 = self.stride(0);
-    auto self_stride_1 = self.stride(1);
-
-    int sz;
-    if (dimension > 0) {
-      sz = std::min(self.size(0), self.size(1) - dimension);
-    } else {
-      sz = std::min(self.size(0) + dimension, self.size(1));
-    }
-
-    at::native::resize_output(result, {sz});
-    if (sz > 0) {
-      at::assert_no_internal_overlap(result);
-      auto result_stride = result.stride(0);
-      const dim3 threads(std::min(
-          int(sz),
-          int(at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock)));
-      const dim3 grid(
-          std::min(int(1024), ceil_div(int(sz), int(threads.x))));
-      auto start =
-          (dimension >= 0 ? dimension * self_stride_1
-                          : -dimension * self_stride_0);
-
-      // Kernel Launch
-      copy_from_diagonal_kernel<scalar_t>
-          <<<grid, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
-              result.data_ptr<scalar_t>(),
-              self.data_ptr<scalar_t>(),
-              start,
-              sz,
-              self_stride_0 + self_stride_1,
-              result_stride);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    }
-  } else {
-    auto n_elems = self.numel();
-    auto sz = (dimension > 0) ? n_elems + dimension : n_elems - dimension;
-    auto self_stride = self.stride(0);
-    at::native::resize_output(result, {sz, sz});
-    result.zero_();
-    if (sz > 0) {
-      at::assert_no_internal_overlap(result);
-      auto result_stride_0 = result.stride(0);
-      auto result_stride_1 = result.stride(1);
-      const dim3 threads(std::min(
-          int(sz), at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock));
-      const dim3 grid(
-          std::min(int(1024), ceil_div(int(sz), int(threads.x))));
-      auto start =
-          (dimension >= 0 ? dimension * result_stride_1
-                          : -dimension * result_stride_0);
-
-      // Kernel Launch
-      copy_to_diagonal_kernel<scalar_t>
-          <<<grid, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
-              result.data_ptr<scalar_t>(),
-              self.data_ptr<scalar_t>(),
-              start,
-              n_elems,
-              result_stride_0 + result_stride_1,
-              self_stride);
-      C10_CUDA_KERNEL_LAUNCH_CHECK();
-    }
-  }
-
-  return result;
-}
-
-Tensor& diag_cuda_out(const Tensor& self, int64_t dimension, Tensor& result) {
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
-      kComplexHalf, ScalarType::Half, ScalarType::BFloat16, ScalarType::Bool,
-      self.scalar_type(), "diag_cuda",
-      [&] {
-        apply_diag<scalar_t>(result, self, dimension);
-      });
-  return result;
-}
-
 Tensor trace_cuda(const Tensor& self) {
   TORCH_CHECK(self.dim() == 2, "expected a matrix");
-  int dimension = 0;
-  auto result = at::diag(self, dimension);
-  return result.sum();
+  return self.diagonal().sum();
 }
 
 } // namespace native
diff --git a/aten/src/ATen/native/cuda/UnaryComplexKernels.cu b/aten/src/ATen/native/cuda/UnaryComplexKernels.cu
index 0589c3ba4f0d..a04194b1117e 100644
--- a/aten/src/ATen/native/cuda/UnaryComplexKernels.cu
+++ b/aten/src/ATen/native/cuda/UnaryComplexKernels.cu
@@ -1,6 +1,7 @@
 #define TORCH_ASSERT_NO_OPERATORS
 #include <limits>
 #include <ATen/native/UnaryOps.h>
+#include <ATen/native/cuda/Copy.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/cuda/JitLoops.cuh>
 #include <ATen/Dispatch.h>
@@ -58,22 +59,10 @@ void angle_kernel_cuda(TensorIteratorBase& iter) {
   }
 }
 
-// We manually overload conj because std::conj does not work types other than c10::complex.
-template<typename scalar_t>
-__host__ __device__ static inline scalar_t conj_wrapper(scalar_t v) {
-  return v;
-}
-
-template<typename T>
-__host__ __device__ static inline c10::complex<T> conj_wrapper(c10::complex<T> v) {
-  return std::conj(v);
-}
-
 // NB: Ignores the negative bit on tensors
 const char conj_name[] = "conj_kernel";
 void conj_kernel_cuda(TensorIteratorBase& iter) {
-  auto common_dtype = iter.common_dtype();
-  if (common_dtype == kComplexHalf) {
+  auto conj_chalf = [&] {
     using scalar_t = c10::complex<at::Half>;
     #if AT_USE_JITERATOR()
       static const auto conj_string = jiterator_stringify(
@@ -85,17 +74,23 @@ void conj_kernel_cuda(TensorIteratorBase& iter) {
       jitted_gpu_kernel<conj_name, scalar_t, scalar_t, 1>(iter, conj_string);
     #else
       gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t {
-          return conj_wrapper(a);
+          return std::conj(a);
       });
     #endif
-  } else {
-    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
-      kBool, kBFloat16, kHalf, iter.common_dtype(), "conj_cuda", [&]() {
-        gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t {
-          return conj_wrapper(a);
-        });
-    });
-  }
+  };
+
+  AT_DISPATCH_SWITCH(iter.common_dtype(), "conj_cuda",
+    AT_DISPATCH_CASE_ALL_TYPES_AND3(kBool, kBFloat16, kHalf, [&] {
+      // Conj is a no-op for non-complex types
+      direct_copy_kernel_cuda(iter);
+    })
+    AT_DISPATCH_CASE_COMPLEX_TYPES([&] {
+      gpu_kernel(iter, [] GPU_LAMBDA(scalar_t a) -> scalar_t {
+        return std::conj(a);
+      });
+    })
+    AT_DISPATCH_CASE(kComplexHalf, conj_chalf)
+  );
 }
 
 REGISTER_DISPATCH(angle_stub, &angle_kernel_cuda);
diff --git a/aten/src/ATen/native/cuda/UnaryFractionKernels.cu b/aten/src/ATen/native/cuda/UnaryFractionKernels.cu
index 87aa784b7d5d..ae4d4a01aa00 100644
--- a/aten/src/ATen/native/cuda/UnaryFractionKernels.cu
+++ b/aten/src/ATen/native/cuda/UnaryFractionKernels.cu
@@ -122,7 +122,7 @@ __host__ __device__ static inline c10::complex<float> nearbyint_wrapper(c10::com
 }
 
 #pragma push
-#pragma diag_suppress 177   // Function was declared but never referenced
+#pragma nv_diag_suppress 177   // Function was declared but never referenced
 __host__ __device__ static inline c10::complex<double> nearbyint_wrapper(c10::complex<double> a) {
   return c10::complex<double>(::nearbyint(static_cast<double>(a.real())), ::nearbyint(static_cast<double>(a.imag())));
 }
diff --git a/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu b/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
index 0cb0d9f238cf..2481fd602896 100644
--- a/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
+++ b/aten/src/ATen/native/cuda/UnarySpecialOpsKernel.cu
@@ -151,7 +151,9 @@ void sigmoid_kernel_cuda(TensorIteratorBase& iter) {
   } else {
     AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, common_dtype, "sigmoid_cuda", [&]() {
       gpu_kernel(iter, []GPU_LAMBDA(scalar_t a) -> scalar_t {
-        return scalar_t{1} / (scalar_t{1} + std::exp(-a));
+        using opmath_t = at::opmath_type<scalar_t>;
+        const auto one = opmath_t{1};
+        return static_cast<scalar_t>(one/(one + std::exp(-opmath_t{a})));
       });
     });
   }
@@ -179,8 +181,9 @@ void sinc_kernel_cuda(TensorIteratorBase& iter) {
               return scalar_t(1);
             } else {
               // NVCC says constexpr var is not accessible from device
-              scalar_t product = c10::detail::pi<scalar_t>() * a;
-              return std::sin(product) / product;
+              using opmath_t = at::opmath_type<scalar_t>;
+              opmath_t product = c10::detail::pi<opmath_t>() * opmath_t{a};
+              return static_cast<scalar_t>(std::sin(product) / product);
             }
           });
         });
diff --git a/aten/src/ATen/native/cuda/UnfoldBackwardKernel.cu b/aten/src/ATen/native/cuda/UnfoldBackwardKernel.cu
index 90f5238d0180..d75de2a6e90f 100644
--- a/aten/src/ATen/native/cuda/UnfoldBackwardKernel.cu
+++ b/aten/src/ATen/native/cuda/UnfoldBackwardKernel.cu
@@ -1,6 +1,7 @@
 #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/UnfoldBackward.h>
 
+#include <ATen/Dispatch.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/cuda/detail/OffsetCalculator.cuh>
 #include <ATen/cuda/CUDAContext.h>
@@ -57,8 +58,7 @@ void _unfold_backward_internal_kernel(
   int64_t grad_in_dim_stride,
   int64_t grad_in_last_dim_stride,
   int64_t grad_in_dim_size,
-  int64_t grad_out_dim_stride,
-  bool is_step_ge_size
+  int64_t grad_out_dim_stride
 ) {
   if (iter.numel() == 0) {
     return;
@@ -73,8 +73,7 @@ void _unfold_backward_internal_kernel(
         grad_in_dim_stride,
         grad_in_last_dim_stride,
         grad_in_dim_size,
-        grad_out_dim_stride,
-        is_step_ge_size
+        grad_out_dim_stride
       );
     }
     return;
@@ -84,63 +83,39 @@ void _unfold_backward_internal_kernel(
   char* __restrict__ grad_in_ptr = reinterpret_cast<char*>(iter.data_ptr(1));
   char* __restrict__ idx_dim_ptr = reinterpret_cast<char*>(iter.data_ptr(2));
 
-  if (is_step_ge_size) {
-    char* __restrict__ idx_last_dim_ptr = reinterpret_cast<char*>(iter.data_ptr(3));
+  auto offset_calc = make_offset_calculator<3>(iter);
 
-    auto offset_calc = make_offset_calculator<4>(iter);
+  // The algorithm is: for each index in grad_out find
+  // the elements contributing to it and sum them up.
+  // Note: the algorithm does not require any synchronization.
+  auto loop = [=]C10_DEVICE(int i) {
+    auto offsets = offset_calc.get(i);
 
-    // this loop simply copies the data
-    // from proper places in grad_out to grad_in
-    auto loop = [=]C10_DEVICE(int i) {
-      auto offsets = offset_calc.get(i);
+    auto* __restrict__ grad_out_data = reinterpret_cast<scalar_t*>(grad_out_ptr + offsets[0]);
+    auto* __restrict__ grad_in_data = reinterpret_cast<scalar_t*>(grad_in_ptr + offsets[1]);
 
-      auto* __restrict__ grad_out_data = reinterpret_cast<scalar_t*>(grad_out_ptr + offsets[0]);
-      auto* __restrict__ grad_in_data = reinterpret_cast<scalar_t*>(grad_in_ptr + offsets[1]);
+    auto idx_dim = *reinterpret_cast<int64_t*>(idx_dim_ptr + offsets[2]);
 
-      auto idx_dim = *reinterpret_cast<int64_t*>(idx_dim_ptr + offsets[2]);
-      auto idx_last_dim = *reinterpret_cast<int64_t*>(idx_last_dim_ptr + offsets[3]);
-
-      auto grad_out_idx_dim = idx_dim * step + idx_last_dim;
-      grad_out_data[grad_out_idx_dim * grad_out_dim_stride] = *grad_in_data;
-    };
-
-    _launch_unfold_backward_kernel<num_threads(), thread_work_size()>(iter.numel(), loop);
-  }
-  else {
-    auto offset_calc = make_offset_calculator<3>(iter);
-
-    // The algorithm is: for each index in grad_out find
-    // the elements contributing to it and sum them up.
-    // Note: the algorithm does not require any synchronization.
-    auto loop = [=]C10_DEVICE(int i) {
-      auto offsets = offset_calc.get(i);
-
-      auto* __restrict__ grad_out_data = reinterpret_cast<scalar_t*>(grad_out_ptr + offsets[0]);
-      auto* __restrict__ grad_in_data = reinterpret_cast<scalar_t*>(grad_in_ptr + offsets[1]);
-
-      auto idx_dim = *reinterpret_cast<int64_t*>(idx_dim_ptr + offsets[2]);
-
-      // left_fold potentially intersecting with idx_dim
-      // is either (idx_dim - size) / step or the next integer.
-      int64_t left_fold_idx = (idx_dim > size) ? (idx_dim - size) / step : 0;
-      if (!(left_fold_idx * step <= idx_dim && idx_dim < left_fold_idx * step + size)) {
-        ++left_fold_idx;
-      }
+    // left_fold potentially intersecting with idx_dim
+    // is either (idx_dim - size) / step or the next integer.
+    int64_t left_fold_idx = (idx_dim > size) ? (idx_dim - size) / step : 0;
+    if (!(left_fold_idx * step <= idx_dim && idx_dim < left_fold_idx * step + size)) {
+      ++left_fold_idx;
+    }
 
-      auto right_fold_idx = idx_dim / step;
-      right_fold_idx = (right_fold_idx >= grad_in_dim_size) ?
-        (grad_in_dim_size - 1) : right_fold_idx;
+    auto right_fold_idx = idx_dim / step;
+    right_fold_idx = (right_fold_idx >= grad_in_dim_size) ?
+      (grad_in_dim_size - 1) : right_fold_idx;
 
-      for (auto fold_idx = left_fold_idx; fold_idx <= right_fold_idx; ++fold_idx) {
-        auto idx_last_dim = idx_dim - fold_idx * step;
-        *grad_out_data += grad_in_data[fold_idx * grad_in_dim_stride
-                                    + idx_last_dim * grad_in_last_dim_stride];
-      }
+    for (auto fold_idx = left_fold_idx; fold_idx <= right_fold_idx; ++fold_idx) {
+      auto idx_last_dim = idx_dim - fold_idx * step;
+      *grad_out_data += grad_in_data[fold_idx * grad_in_dim_stride
+                                  + idx_last_dim * grad_in_last_dim_stride];
+    }
 
-    };
+  };
 
-    _launch_unfold_backward_kernel<num_threads(), thread_work_size()>(iter.numel(), loop);
-  }
+  _launch_unfold_backward_kernel<num_threads(), thread_work_size()>(iter.numel(), loop);
 }
 
 void unfold_backward_cuda_kernel(
@@ -160,16 +135,8 @@ void unfold_backward_cuda_kernel(
 
   auto grad_out_dim_stride = ensure_nonempty_stride(grad_out, dim);
 
-  auto is_step_ge_size = (step >= size);
-
-  TensorIterator iter =
-    is_step_ge_size ?
-    _make_unfold_backward_iter_over_grad_in(
-      grad_out, grad_in, dim, size, step
-    ) :
-    _make_unfold_backward_iter_over_grad_out(
-      grad_out, grad_in, dim, size, step
-    );
+  TensorIterator iter = _make_unfold_backward_iter_over_grad_out(
+      grad_out, grad_in, dim, size, step);
 
   AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
     at::ScalarType::Half, at::ScalarType::Bool, at::ScalarType::BFloat16,
@@ -182,8 +149,7 @@ void unfold_backward_cuda_kernel(
         grad_in_dim_stride,
         grad_in_last_dim_stride,
         grad_in_dim_size,
-        grad_out_dim_stride,
-        is_step_ge_size
+        grad_out_dim_stride
       );
     }
   );
diff --git a/aten/src/ATen/native/cuda/UpSampleNearest2d.cu b/aten/src/ATen/native/cuda/UpSampleNearest2d.cu
index 8aa4f68aeda6..f223655daca1 100644
--- a/aten/src/ATen/native/cuda/UpSampleNearest2d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleNearest2d.cu
@@ -94,13 +94,13 @@ __global__ void upsample_nearest2d_nhwc_out_frame(
     float width_scale,
     const size_t out_numel) {
 
-  const int index = blockIdx.x * blockDim.x + threadIdx.x;
+  const int64_t index = blockIdx.x * blockDim.x + threadIdx.x;
 
   if (index < out_numel) {
-    const int c = index % channels;
-    const int w2 = (index / channels) % width2;
-    const int h2 = (index / channels / width2) % height2;
-    const int n = index / channels / width2 / height2;
+    const auto c = index % channels;
+    const auto w2 = (index / channels) % width2;
+    const auto h2 = (index / channels / width2) % height2;
+    const auto n = index / channels / width2 / height2;
 
     const size_t h1 = height1 == height2 ? h2 : nn_compute_source_index_fn(height_scale, h2, height1);
     const size_t w1 = width1 == width2 ? w2 : nn_compute_source_index_fn(width_scale, w2, width1);
@@ -240,13 +240,13 @@ static void upsample_nearest2d_out_cuda_template(
         output.is_contiguous(memory_format)) {
     at::Tensor input = input_.contiguous(at::MemoryFormat::ChannelsLast);
 
-    TORCH_CHECK(input.numel() < std::numeric_limits<int>::max(),
-      "upsample_nearest_nhwc only supports input tensors with less than INT_MAX elements");
-    TORCH_CHECK(output.numel() < std::numeric_limits<int>::max(),
-      "upsample_nearest_nhwc only supports output tensors with less than INT_MAX elements");
+    TORCH_CHECK(input.numel() < std::numeric_limits<int64_t>::max(),
+      "upsample_nearest_nhwc only supports input tensors with less than 2^63 - 1 elements");
+    TORCH_CHECK(output.numel() < std::numeric_limits<int64_t>::max(),
+      "upsample_nearest_nhwc only supports output tensors with less than 2^63 - 1 elements");
 
-    const int num_kernels = output.numel();
-    const int num_threads = std::min(at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock, 1024);
+    const int64_t num_kernels = output.numel();
+    const int64_t num_threads = std::min(at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock, 1024);
 
     AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::Half, ScalarType::Byte, input.scalar_type(), "upsample_nearest2d_nhwc_out_frame", [&] {
       const scalar_t* idata = input.data_ptr<scalar_t>();
diff --git a/aten/src/ATen/native/cuda/UpSampleNearest3d.cu b/aten/src/ATen/native/cuda/UpSampleNearest3d.cu
index 1a4afa012d78..58f14ad491a6 100644
--- a/aten/src/ATen/native/cuda/UpSampleNearest3d.cu
+++ b/aten/src/ATen/native/cuda/UpSampleNearest3d.cu
@@ -337,52 +337,5 @@ TORCH_IMPL_FUNC(_upsample_nearest_exact3d_backward_out_cuda) (
 using at::native::upsample::compute_output_size;
 using at::native::upsample_cuda::get_scale_value;
 
-Tensor upsample_nearest3d_cuda(
-    const Tensor& input,
-    at::OptionalIntArrayRef output_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return at::upsample_nearest3d(input, osize, scale_d, scale_h, scale_w);
-}
-
-Tensor _upsample_nearest_exact3d_cuda(
-    const Tensor& input,
-    at::OptionalIntArrayRef output_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return at::_upsample_nearest_exact3d(input, osize, scale_d, scale_h, scale_w);
-}
-
-// when structured kernels can handle QuantizedCPU, update these overloads to be CompositeExplicitAutograd
-Tensor upsample_nearest3d_backward_cuda(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return at::upsample_nearest3d_backward(grad_output, osize, input_size, scale_d, scale_h, scale_w);
-}
-
-Tensor _upsample_nearest_exact3d_backward_cuda(
-    const Tensor& grad_output,
-    at::OptionalIntArrayRef output_size,
-    IntArrayRef input_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input_size, output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return at::_upsample_nearest_exact3d_backward(grad_output, osize, input_size, scale_d, scale_h, scale_w);
-}
-
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/cuda/block_reduce.cuh b/aten/src/ATen/native/cuda/block_reduce.cuh
index e01cd0b060f5..fa75c71f8aca 100644
--- a/aten/src/ATen/native/cuda/block_reduce.cuh
+++ b/aten/src/ATen/native/cuda/block_reduce.cuh
@@ -29,24 +29,43 @@ __inline__ __device__ T WarpReduceSum(T val) {
   return val;
 }
 
+struct Block1D {
+    static __forceinline__ __device__ int Tid() { return threadIdx.x; }
+
+    static __forceinline__ __device__ int Warps() {
+        return blockDim.x / C10_WARP_SIZE;
+    }
+};
+
+struct Block2D {
+    static __forceinline__ __device__ int Tid() {
+        return threadIdx.x + threadIdx.y * blockDim.x;
+    }
+
+    static __forceinline__ __device__ int Warps() {
+        return blockDim.x * blockDim.y / C10_WARP_SIZE;
+    }
+};
+
 // Sums `val` across all threads in a block.
 //
+// Warning: the return value is only valid for thread 0.
 // Assumptions:
-//   - Thread blocks are an 1D set of threads (indexed with `threadIdx.x` only)
 //   - The size of each block should be a multiple of `C10_WARP_SIZE`
 //   - `shared` should be a pointer to shared memory with size of, at least,
 //     `sizeof(T) * number_of_warps`
-template <typename T>
+template <typename T, typename B = Block1D>
 __inline__ __device__ T BlockReduceSum(T val, T* shared) {
-  const int lid = threadIdx.x % C10_WARP_SIZE;
-  const int wid = threadIdx.x / C10_WARP_SIZE;
+  const int tid = B::Tid();
+  const int lid = tid % C10_WARP_SIZE;
+  const int wid = tid / C10_WARP_SIZE;
   val = WarpReduceSum(val);
-  __syncthreads();
+  __syncthreads(); // prevent races when BlockReduces are called in a row.
   if (lid == 0) {
     shared[wid] = val;
   }
   __syncthreads();
-  val = (threadIdx.x < blockDim.x / C10_WARP_SIZE) ? shared[lid] : T(0);
+  val = (tid < B::Warps()) ? shared[lid] : T(0);
   if (wid == 0) {
     val = WarpReduceSum(val);
   }
@@ -62,19 +81,19 @@ __inline__ __device__ T WarpReduce(T val, const ReduceOp& op) {
   return val;
 }
 
-template <typename T, class ReduceOp>
+template <typename T, class ReduceOp, typename B = Block1D>
 __inline__ __device__ T
 BlockReduce(T val, const ReduceOp& op, const T& identity_element, T* shared) {
-  const int lid = threadIdx.x % C10_WARP_SIZE;
-  const int wid = threadIdx.x / C10_WARP_SIZE;
+  const int tid = B::Tid();
+  const int lid = tid % C10_WARP_SIZE;
+  const int wid = tid / C10_WARP_SIZE;
   val = WarpReduce(val, op);
-  __syncthreads();
+  __syncthreads(); // prevent races when BlockReduces are called in a row.
   if (lid == 0) {
     shared[wid] = val;
   }
   __syncthreads();
-  val = (threadIdx.x < blockDim.x / C10_WARP_SIZE) ? shared[lid]
-                                                   : identity_element;
+  val = (tid < B::Warps()) ? shared[lid] : identity_element;
   if (wid == 0) {
     val = WarpReduce(val, op);
   }
diff --git a/aten/src/ATen/native/cuda/fused_adam_amsgrad_impl.cu b/aten/src/ATen/native/cuda/fused_adam_amsgrad_impl.cu
new file mode 100644
index 000000000000..f394899e24bd
--- /dev/null
+++ b/aten/src/ATen/native/cuda/fused_adam_amsgrad_impl.cu
@@ -0,0 +1,52 @@
+#include <ATen/native/cuda/fused_adam_amsgrad_impl.cuh>
+
+#include <ATen/Dispatch.h>
+#include <ATen/native/ForeachUtils.h>
+#include <ATen/native/cuda/fused_adam_utils.cuh>
+#include <ATen/native/cuda/MultiTensorApply.cuh>
+#include <vector>
+
+namespace at { namespace native {
+
+void _fused_adam_cuda_impl_(
+    at::TensorList params,
+    at::TensorList grads,
+    at::TensorList exp_avgs,
+    at::TensorList exp_avg_sqs,
+    at::TensorList max_exp_avg_sqs,
+    at::TensorList state_steps,
+    const double lr,
+    const double beta1,
+    const double beta2,
+    const double weight_decay,
+    const double eps,
+    const bool amsgrad,
+    const bool maximize,
+    const c10::optional<at::Tensor>& grad_scale,
+    const c10::optional<at::Tensor>& found_inf
+) {
+  std::vector<std::vector<at::Tensor>> tensor_lists{
+    params.vec(), grads.vec(), exp_avgs.vec(), exp_avg_sqs.vec(), max_exp_avg_sqs.vec() };
+
+  float* grad_scale_ptr = grad_scale.has_value() ? grad_scale->data_ptr<float>() : nullptr;
+  float* found_inf_ptr = found_inf.has_value() ? found_inf->data_ptr<float>() : nullptr;
+
+  AT_DISPATCH_FLOATING_TYPES_AND2(kHalf, kBFloat16, params[0].scalar_type(),
+      "fused_adam_kernel_cuda", [&]() {
+        multi_tensor_apply_for_fused_optimizer<5>(
+            tensor_lists,
+            state_steps,
+            FusedAdamMathFunctor<scalar_t, 5>(),
+            lr,
+            beta1,
+            beta2,
+            weight_decay,
+            eps,
+            maximize,
+            /* amsgrad */true,
+            grad_scale_ptr,
+            found_inf_ptr);
+        });
+}
+
+} } // namespace at::native
diff --git a/aten/src/ATen/native/cuda/fused_adam_amsgrad_impl.cuh b/aten/src/ATen/native/cuda/fused_adam_amsgrad_impl.cuh
new file mode 100644
index 000000000000..46e893e564d9
--- /dev/null
+++ b/aten/src/ATen/native/cuda/fused_adam_amsgrad_impl.cuh
@@ -0,0 +1,24 @@
+#pragma once
+#include <ATen/core/Tensor.h>
+
+namespace at { namespace native {
+
+void _fused_adam_cuda_impl_(
+    at::TensorList params,
+    at::TensorList grads,
+    at::TensorList exp_avgs,
+    at::TensorList exp_avg_sqs,
+    at::TensorList max_exp_avg_sqs,
+    at::TensorList state_steps,
+    const double lr,
+    const double beta1,
+    const double beta2,
+    const double weight_decay,
+    const double eps,
+    const bool amsgrad,
+    const bool maximize,
+    const c10::optional<at::Tensor>& grad_scale,
+    const c10::optional<at::Tensor>& found_inf
+);
+
+} } // namespace at::native
diff --git a/aten/src/ATen/native/cuda/fused_adam_impl.cu b/aten/src/ATen/native/cuda/fused_adam_impl.cu
new file mode 100644
index 000000000000..3674f83b20b1
--- /dev/null
+++ b/aten/src/ATen/native/cuda/fused_adam_impl.cu
@@ -0,0 +1,51 @@
+#include <ATen/native/cuda/fused_adam_impl.cuh>
+
+#include <ATen/Dispatch.h>
+#include <ATen/native/ForeachUtils.h>
+#include <ATen/native/cuda/fused_adam_utils.cuh>
+#include <ATen/native/cuda/MultiTensorApply.cuh>
+#include <vector>
+
+namespace at { namespace native {
+
+void _fused_adam_cuda_impl_(
+    at::TensorList params,
+    at::TensorList grads,
+    at::TensorList exp_avgs,
+    at::TensorList exp_avg_sqs,
+    at::TensorList state_steps,
+    const double lr,
+    const double beta1,
+    const double beta2,
+    const double weight_decay,
+    const double eps,
+    const bool amsgrad,
+    const bool maximize,
+    const c10::optional<at::Tensor>& grad_scale,
+    const c10::optional<at::Tensor>& found_inf
+) {
+  std::vector<std::vector<at::Tensor>> tensor_lists{
+    params.vec(), grads.vec(), exp_avgs.vec(), exp_avg_sqs.vec() };
+
+  float* grad_scale_ptr = grad_scale.has_value() ? grad_scale->data_ptr<float>() : nullptr;
+  float* found_inf_ptr = found_inf.has_value() ? found_inf->data_ptr<float>() : nullptr;
+
+  AT_DISPATCH_FLOATING_TYPES_AND2(kHalf, kBFloat16, params[0].scalar_type(),
+      "fused_adam_kernel_cuda", [&]() {
+        multi_tensor_apply_for_fused_optimizer<4>(
+            tensor_lists,
+            state_steps,
+            FusedAdamMathFunctor<scalar_t, 4>(),
+            lr,
+            beta1,
+            beta2,
+            weight_decay,
+            eps,
+            maximize,
+            /* amsgrad */false,
+            grad_scale_ptr,
+            found_inf_ptr);
+        });
+}
+
+} } // namespace at::native
diff --git a/aten/src/ATen/native/cuda/fused_adam_impl.cuh b/aten/src/ATen/native/cuda/fused_adam_impl.cuh
new file mode 100644
index 000000000000..a76ba566970f
--- /dev/null
+++ b/aten/src/ATen/native/cuda/fused_adam_impl.cuh
@@ -0,0 +1,23 @@
+#pragma once
+#include <ATen/core/Tensor.h>
+
+namespace at { namespace native {
+
+void _fused_adam_cuda_impl_(
+    at::TensorList params,
+    at::TensorList grads,
+    at::TensorList exp_avgs,
+    at::TensorList exp_avg_sqs,
+    at::TensorList state_steps,
+    const double lr,
+    const double beta1,
+    const double beta2,
+    const double weight_decay,
+    const double eps,
+    const bool amsgrad,
+    const bool maximize,
+    const c10::optional<at::Tensor>& grad_scale,
+    const c10::optional<at::Tensor>& found_inf
+);
+
+} } // namespace at::native
diff --git a/aten/src/ATen/native/cuda/fused_adam_utils.cuh b/aten/src/ATen/native/cuda/fused_adam_utils.cuh
new file mode 100644
index 000000000000..8d7c410915c1
--- /dev/null
+++ b/aten/src/ATen/native/cuda/fused_adam_utils.cuh
@@ -0,0 +1,166 @@
+#pragma once
+#include <ATen/core/Tensor.h>
+#include <ATen/native/cuda/MultiTensorApply.cuh>
+#include <ATen/native/cuda/ForeachFunctors.cuh>
+#include <ATen/native/cuda/Pow.cuh>
+
+
+namespace at { namespace native {
+
+namespace {
+
+constexpr uint8_t kParamIdx = 0;
+constexpr uint8_t kGradIdx = 1;
+constexpr uint8_t kExpAvgIdx = 2;
+constexpr uint8_t kExpAvgSqIdx = 3;
+constexpr uint8_t kMaxExpAvgSqIdx = 4;
+
+template <typename scalar_type, typename opmath_t, int depth=4>
+C10_DEVICE __forceinline__ void adam_math(
+    scalar_type r_args[depth][kILP],
+    const float* step_count,
+    const double lr,
+    const double beta1,
+    const double beta2,
+    const double weight_decay,
+    const double eps,
+    const bool maximize,
+    const bool amsgrad,
+    const float* grad_scale_ptr,
+    const float* found_inf_ptr
+) {
+#pragma unroll
+    for (int ii = 0; ii < kILP; ii++) {
+        // Load values.
+        opmath_t param = static_cast<opmath_t>(r_args[kParamIdx][ii]);
+        opmath_t grad = static_cast<opmath_t>(r_args[kGradIdx][ii]);
+        if (grad_scale_ptr) {
+            grad /= (static_cast<double>(*grad_scale_ptr));
+        }
+        const opmath_t grad_to_store = grad;
+        if (maximize) {
+            grad = -grad;
+        }
+        opmath_t exp_avg = static_cast<opmath_t>(r_args[kExpAvgIdx][ii]);
+        opmath_t exp_avg_sq = static_cast<opmath_t>(r_args[kExpAvgSqIdx][ii]);
+        opmath_t max_exp_avg_sq;
+        if (amsgrad) {
+            max_exp_avg_sq = static_cast<opmath_t>(r_args[kMaxExpAvgSqIdx][ii]);
+        }
+
+        // Update param, grad, 1st and 2nd order momentum.
+        if (weight_decay != 0) {
+            grad += param * weight_decay;
+        }
+        // todo(crcrpar): use lerp
+        // ref: https://developer.nvidia.com/blog/lerp-faster-cuda/
+        exp_avg = beta1 * exp_avg + (1 - beta1) * grad;
+        exp_avg_sq = beta2 * exp_avg_sq + (1 - beta2) * grad * grad;
+
+        if (amsgrad) {
+            max_exp_avg_sq = std::max(max_exp_avg_sq, exp_avg_sq);
+        }
+
+        const opmath_t bias_correction1 = 1 - at::native::pow_(beta1, *step_count);
+        const opmath_t bias_correction2 = 1 - at::native::pow_(beta2, *step_count);
+
+        const opmath_t step_size = lr / bias_correction1;
+
+        const opmath_t bias_correction2_sqrt = std::sqrt(bias_correction2);
+
+        opmath_t denom;
+        if (amsgrad) {
+            denom = (std::sqrt(max_exp_avg_sq) / bias_correction2_sqrt) + eps;
+        } else {
+            denom = (std::sqrt(exp_avg_sq) / bias_correction2_sqrt) + eps;
+        }
+
+        param -= step_size * exp_avg / denom;
+
+        // Store results.
+        r_args[kParamIdx][ii] = param;
+        if (grad_scale_ptr) {
+          r_args[kGradIdx][ii] = grad_to_store;
+        }
+        r_args[kExpAvgIdx][ii] = exp_avg;
+        r_args[kExpAvgSqIdx][ii] = exp_avg_sq;
+        if (amsgrad) {
+            r_args[kMaxExpAvgSqIdx][ii] = max_exp_avg_sq;
+        }
+    }
+}
+
+// [note: Conditional Gradient Store when `optimizer.step` is called by GradScaler]
+// When a user is training their model(s) with an FP16 AMP recipe,
+// parameter updates are done via `grad_scaler.step(optimizer)` instead of `optimizer.step()`.
+// For most optimizers, GradScaler unscales gradients on behalf of those optimizers.
+// Also, before `.step`, it makes sure that all the gradients involved are finite, which incurs a device sync.
+// On the other hand, fused optimizers set their member variable of `_step_supports_amp_scaling` to `True`
+// in order to remove the device sync above. This means that fused optimizers have to have
+// their CUDA kernels (a) unscale gradients and (b) skip parameter updates accordingly.
+// To be functionally on par with `torch.optim` optimizers and `_multi_tensor` ones,
+// the kernel below writes out gradients only when `grad_scale_ptr != nullptr.
+template <typename scalar_type, int depth=4>
+struct FusedAdamMathFunctor {
+    static_assert(depth == 4 || depth == 5, "depth of 4 for Adam, depth of 5 for Adam with AMSGrad.");
+    using opmath_t = at::opmath_type<scalar_type>;
+    C10_DEVICE __forceinline__ void operator()(
+            int chunk_size,
+            FusedOptimizerTensorListMetadata<depth>& tl,
+            const double lr,
+            const double beta1,
+            const double beta2,
+            const double weight_decay,
+            const double eps,
+            const bool maximize,
+            const bool amsgrad,
+            const float* grad_scale_ptr,
+            const float* found_inf_ptr
+  ) {
+        int tensor_loc = tl.block_to_tensor[blockIdx.x];
+        int chunk_idx = tl.block_to_chunk[blockIdx.x];
+        int n = tl.numel_for_tensor[tensor_loc];
+
+        if (found_inf_ptr && *found_inf_ptr == 1) {
+            return;
+        }
+        float *step_count = reinterpret_cast<float*>(tl.state_steps_addresses[tensor_loc]);
+
+        scalar_type* args[depth];
+        const bool all_aligned{init_args<depth>(args, tl, chunk_idx, chunk_size, tensor_loc)};
+        n -= chunk_idx * chunk_size;
+        scalar_type r_args[depth][kILP];
+
+        if ((n % kILP == 0) && (chunk_size % kILP == 0) && all_aligned) {
+            for (int i_start = threadIdx.x; i_start * kILP < n && i_start * kILP < chunk_size; i_start += blockDim.x) {
+#pragma unroll
+                for (int i = 0; i < depth; i++) {
+                    load_store(r_args[i], args[i], 0, i_start);
+                }
+                adam_math<scalar_type, opmath_t, depth>(
+                    r_args, step_count, lr, beta1, beta2, weight_decay, eps, maximize, amsgrad, grad_scale_ptr, found_inf_ptr);
+#pragma unroll
+                for (int i = 0; i < depth; i++) {
+                  if (i != kGradIdx || grad_scale_ptr) {
+                    load_store(args[i], r_args[i], i_start, 0);
+                  }
+                }
+            }
+        } else {
+            for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * kILP) {
+              load_args<depth>(r_args, args, i_start, chunk_size, n);
+              adam_math<scalar_type, opmath_t, depth>(
+                  r_args, step_count, lr, beta1, beta2, weight_decay, eps, maximize, amsgrad, grad_scale_ptr, found_inf_ptr);
+#pragma unroll
+              for (int i = 0; i < depth; i++) {
+                  if (i != kGradIdx || grad_scale_ptr) {
+                    store_args(args[i], r_args[i], i_start, chunk_size, n);
+                  }
+              }
+            }
+        }
+    }
+};
+} // namespace
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/cuda/im2col.cuh b/aten/src/ATen/native/cuda/im2col.cuh
index 391b8c6d83af..06eef13208c6 100644
--- a/aten/src/ATen/native/cuda/im2col.cuh
+++ b/aten/src/ATen/native/cuda/im2col.cuh
@@ -1,9 +1,8 @@
 #pragma once
 
+#include <ATen/AccumulateType.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/KernelUtils.h>
-#include <ATen/cuda/detail/IndexUtils.cuh>
-#include <ATen/cuda/detail/TensorInfo.cuh>
 
 #include <c10/macros/Macros.h>
 
@@ -103,6 +102,60 @@ void im2col(
   C10_CUDA_KERNEL_LAUNCH_CHECK();
 }
 
+template <typename accT, typename dt>
+__forceinline__ __device__ void col2im_device(
+    const int64_t index,
+    const dt* data_col,
+    const int64_t height,
+    const int64_t width,
+    const int64_t channels,
+    const int64_t kernel_h,
+    const int64_t kernel_w,
+    const int64_t pad_height,
+    const int64_t pad_width,
+    const int64_t stride_height,
+    const int64_t stride_width,
+    const int64_t dilation_height,
+    const int64_t dilation_width,
+    const int64_t height_col,
+    const int64_t width_col,
+    dt* data_im) {
+  accT val = static_cast<accT>(0);
+  const int64_t w_im = index % width + pad_width;
+  const int64_t h_im = (index / width) % height + pad_height;
+  const int64_t c_im = index / (width * height);
+  int64_t kernel_extent_w = (kernel_w - 1) * dilation_width + 1;
+  int64_t kernel_extent_h = (kernel_h - 1) * dilation_height + 1;
+  // compute the start and end of the output
+  const int64_t w_col_start = (w_im < kernel_extent_w)
+      ? 0
+      : (w_im - kernel_extent_w) / stride_width + 1;
+  const int64_t w_col_end = ::min(w_im / stride_width + 1, width_col);
+  const int64_t h_col_start = (h_im < kernel_extent_h)
+      ? 0
+      : (h_im - kernel_extent_h) / stride_height + 1;
+  const int64_t h_col_end = ::min(h_im / stride_height + 1, height_col);
+
+  // TODO: use LCM of stride and dilation to avoid unnecessary loops
+  for (int64_t h_col = h_col_start; h_col < h_col_end; h_col += 1) {
+    for (int64_t w_col = w_col_start; w_col < w_col_end; w_col += 1) {
+      int64_t h_k = (h_im - h_col * stride_height);
+      int64_t w_k = (w_im - w_col * stride_width);
+      if (h_k % dilation_height == 0 && w_k % dilation_width == 0) {
+        h_k /= dilation_height;
+        w_k /= dilation_width;
+        int64_t data_col_index =
+            (((c_im * kernel_h + h_k) * kernel_w + w_k) * height_col +
+              h_col) *
+                width_col +
+            w_col;
+        val += data_col[data_col_index];
+      }
+    }
+  }
+  data_im[index] = static_cast<dt>(val);
+}
+
 template <typename dt, typename accT>
 C10_LAUNCH_BOUNDS_1(512)
 __global__ void col2im_kernel(
@@ -123,40 +176,23 @@ __global__ void col2im_kernel(
     const int64_t width_col,
     dt* data_im) {
   CUDA_KERNEL_LOOP(index, n) {
-    accT val = static_cast<accT>(0);
-    const int64_t w_im = index % width + pad_width;
-    const int64_t h_im = (index / width) % height + pad_height;
-    const int64_t c_im = index / (width * height);
-    int64_t kernel_extent_w = (kernel_w - 1) * dilation_width + 1;
-    int64_t kernel_extent_h = (kernel_h - 1) * dilation_height + 1;
-    // compute the start and end of the output
-    const int64_t w_col_start = (w_im < kernel_extent_w)
-        ? 0
-        : (w_im - kernel_extent_w) / stride_width + 1;
-    const int64_t w_col_end = ::min(w_im / stride_width + 1, width_col);
-    const int64_t h_col_start = (h_im < kernel_extent_h)
-        ? 0
-        : (h_im - kernel_extent_h) / stride_height + 1;
-    const int64_t h_col_end = ::min(h_im / stride_height + 1, height_col);
-
-    // TODO: use LCM of stride and dilation to avoid unnecessary loops
-    for (int64_t h_col = h_col_start; h_col < h_col_end; h_col += 1) {
-      for (int64_t w_col = w_col_start; w_col < w_col_end; w_col += 1) {
-        int64_t h_k = (h_im - h_col * stride_height);
-        int64_t w_k = (w_im - w_col * stride_width);
-        if (h_k % dilation_height == 0 && w_k % dilation_width == 0) {
-          h_k /= dilation_height;
-          w_k /= dilation_width;
-          int64_t data_col_index =
-              (((c_im * kernel_h + h_k) * kernel_w + w_k) * height_col +
-               h_col) *
-                  width_col +
-              w_col;
-          val += data_col[data_col_index];
-        }
-      }
-    }
-    data_im[index] = static_cast<dt>(val);
+    col2im_device<accT>(
+        index,
+        data_col,
+        height,
+        width,
+        channels,
+        kernel_h,
+        kernel_w,
+        pad_height,
+        pad_width,
+        stride_height,
+        stride_width,
+        dilation_height,
+        dilation_width,
+        height_col,
+        width_col,
+        data_im);
   }
 }
 
@@ -203,5 +239,107 @@ void col2im(
   C10_CUDA_KERNEL_LAUNCH_CHECK();
 }
 
+template <typename dt>
+C10_LAUNCH_BOUNDS_1(512)
+__global__ void col2im_batched_kernel(
+    const int64_t n,
+    const dt* data_col,
+    const int64_t col_batch_stride,
+    const int64_t nbatch,
+    const int64_t height,
+    const int64_t width,
+    const int64_t channels,
+    const int64_t kernel_h,
+    const int64_t kernel_w,
+    const int64_t pad_height,
+    const int64_t pad_width,
+    const int64_t stride_height,
+    const int64_t stride_width,
+    const int64_t dilation_height,
+    const int64_t dilation_width,
+    const int64_t height_col,
+    const int64_t width_col,
+    dt* data_im,
+    const int64_t im_batch_stride) {
+  using accT = at::acc_type<dt, /*is_cuda*/true>;
+  const auto im_numel = n * nbatch;
+
+  CUDA_KERNEL_LOOP_TYPE(index, im_numel, int64_t) {
+    const auto ibatch = index / n;
+    const auto slice_index = index % n;
+
+    col2im_device<accT>(
+        slice_index,
+        data_col + ibatch * col_batch_stride,
+        height,
+        width,
+        channels,
+        kernel_h,
+        kernel_w,
+        pad_height,
+        pad_width,
+        stride_height,
+        stride_width,
+        dilation_height,
+        dilation_width,
+        height_col,
+        width_col,
+        data_im + ibatch * im_batch_stride);
+  }
+}
+
+template <typename dt>
+void col2im_batched(
+    cudaStream_t stream,
+    const dt* data_col,
+    const int64_t col_batch_stride,
+    const int64_t nbatch,
+    const int64_t channels,
+    const int64_t height,
+    const int64_t width,
+    const int64_t height_col,
+    const int64_t width_col,
+    const int64_t patch_height,
+    const int64_t patch_width,
+    const int64_t pad_height,
+    const int64_t pad_width,
+    const int64_t stride_height,
+    const int64_t stride_width,
+    const int64_t dilation_height,
+    const int64_t dilation_width,
+    dt* data_im,
+    const int64_t im_batch_stride) {
+  const int64_t num_kernels = channels * height * width;
+  const int64_t output_numel = nbatch * num_kernels;
+  if (output_numel == 0) {
+    return;  // No work to do
+  }
+
+  // To avoid involving atomic operations, we will launch one kernel per
+  // bottom dimension, and then in the kernel add up the top dimensions.
+  // CUDA_NUM_THREADS = 1024
+  col2im_batched_kernel<<<GET_BLOCKS(output_numel, 512), 512, 0, stream>>>(
+          num_kernels,
+          data_col,
+          col_batch_stride,
+          nbatch,
+          height,
+          width,
+          channels,
+          patch_height,
+          patch_width,
+          pad_height,
+          pad_width,
+          stride_height,
+          stride_width,
+          dilation_height,
+          dilation_width,
+          height_col,
+          width_col,
+          data_im,
+          im_batch_stride);
+  C10_CUDA_KERNEL_LAUNCH_CHECK();
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/cuda/jit_utils.cpp b/aten/src/ATen/native/cuda/jit_utils.cpp
index 673ea9f476e4..a1266fb1b504 100644
--- a/aten/src/ATen/native/cuda/jit_utils.cpp
+++ b/aten/src/ATen/native/cuda/jit_utils.cpp
@@ -3,10 +3,10 @@
 #include <c10/util/irange.h>
 #include <c10/util/hash.h>
 #include <c10/util/Optional.h>
-#include <c10/cuda/CUDACachingAllocator.h>
 #include <ATen/jit_macros.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/OffsetCalculator.cuh>
+#include <ATen/cuda/nvrtc_stub/ATenNVRTC.h>
 #include <ATen/code_template.h>
 #include <ATen/OpMathType.h>
 #include <ATen/native/cuda/jit_utils.h>
@@ -40,7 +40,148 @@
 
 namespace at { namespace cuda { namespace jit {
 
+// hiprtc already includes some traits, so this removes duplicate definitions of
+// integral_constant, is_same, is_integral, enable_if, is_floating_point, is_arithmetic.
+// Copied from aten/src/ATen/cuda/llvm_basic.cpp, then modified as above.
+// If not compiling for ROCm, return the original get_traits_string().
+std::string get_traits_string_but_hiprtc_safe() {
+#ifdef USE_ROCM
+    return R"ESCAPE(
+namespace std {
+
+template <class _Tp>
+_Tp&& __declval(int);
+template <class _Tp>
+_Tp __declval(long);
+template <class _Tp>
+decltype(__declval<_Tp>(0)) declval() noexcept;
+
+template <class _Tp> struct remove_const            {typedef _Tp type;};
+template <class _Tp> struct remove_const<const _Tp> {typedef _Tp type;};
+template <class _Tp> using remove_const_t = typename remove_const<_Tp>::type;
+
+template <class _Tp> struct remove_volatile               {typedef _Tp type;};
+template <class _Tp> struct remove_volatile<volatile _Tp> {typedef _Tp type;};
+template <class _Tp> using remove_volatile_t = typename remove_volatile<_Tp>::type;
+
+template <class _Tp> struct remove_cv
+{typedef typename remove_volatile<typename remove_const<_Tp>::type>::type type;};
+template <class _Tp> using remove_cv_t = typename remove_cv<_Tp>::type;
+
+template <class _Tp> struct __libcpp_is_floating_point              : public false_type {};
+template <>          struct __libcpp_is_floating_point<float>       : public true_type {};
+template <>          struct __libcpp_is_floating_point<double>      : public true_type {};
+template <>          struct __libcpp_is_floating_point<long double> : public true_type {};
+
+template <class _Tp>
+inline constexpr bool is_arithmetic_v = is_arithmetic<_Tp>::value;
+
+template <class _Tp>
+struct __numeric_type
+{
+   static void __test(...);
+   static float __test(float);
+   static double __test(char);
+   static double __test(int);
+   static double __test(unsigned);
+   static double __test(long);
+   static double __test(unsigned long);
+   static double __test(long long);
+   static double __test(unsigned long long);
+   static double __test(double);
+   static long double __test(long double);
+
+   typedef decltype(__test(declval<_Tp>())) type;
+   static const bool value = !is_same<type, void>::value;
+};
+
+template <>
+struct __numeric_type<void>
+{
+   static const bool value = true;
+};
+
+// __promote
+
+template <class _A1, class _A2 = void, class _A3 = void,
+          bool = __numeric_type<_A1>::value &&
+                 __numeric_type<_A2>::value &&
+                 __numeric_type<_A3>::value>
+class __promote_imp
+{
+public:
+    static const bool value = false;
+};
+
+template <class _A1, class _A2, class _A3>
+class __promote_imp<_A1, _A2, _A3, true>
+{
+private:
+    typedef typename __promote_imp<_A1>::type __type1;
+    typedef typename __promote_imp<_A2>::type __type2;
+    typedef typename __promote_imp<_A3>::type __type3;
+public:
+    typedef decltype(__type1() + __type2() + __type3()) type;
+    static const bool value = true;
+};
+
+template <class _A1, class _A2>
+class __promote_imp<_A1, _A2, void, true>
+{
+private:
+    typedef typename __promote_imp<_A1>::type __type1;
+    typedef typename __promote_imp<_A2>::type __type2;
+public:
+    typedef decltype(__type1() + __type2()) type;
+    static const bool value = true;
+};
+
+template <class _A1>
+class __promote_imp<_A1, void, void, true>
+{
+public:
+    typedef typename __numeric_type<_A1>::type type;
+    static const bool value = true;
+};
+
+template <class _A1, class _A2 = void, class _A3 = void>
+class __promote : public __promote_imp<_A1, _A2, _A3> {};
+
+} // namespace std
+)ESCAPE";
+#else
+    return get_traits_string();
+#endif
+}
+
+#ifdef USE_ROCM
+const std::string jit_preamble = R"ESCAPE(
+#pragma clang force_cuda_host_device begin
+)ESCAPE";
+const std::string jit_epilogue = R"ESCAPE(
+#pragma clang force_cuda_host_device end
+)ESCAPE";
+#else
+const std::string jit_preamble;
+const std::string jit_epilogue;
+#endif
+
 const std::string jit_common_types = R"ESCAPE(
+  #ifdef __HIPCC__
+  #define ERROR_UNSUPPORTED_CAST ;
+  // corresponds to aten/src/ATen/native/cuda/thread_constants.h
+  #define CUDA_OR_ROCM_NUM_THREADS 256
+  // corresponds to aten/src/ATen/cuda/detail/OffsetCalculator.cuh
+  #define MAX_DIMS 16
+  #ifndef __forceinline__
+  #define __forceinline__ inline __attribute__((always_inline))
+  #endif
+  #else
+  //TODO use _assert_fail, because assert is disabled in non-debug builds
+  #define ERROR_UNSUPPORTED_CAST assert(false);
+  #define CUDA_OR_ROCM_NUM_THREADS 128
+  #define MAX_DIMS 25
+  #endif
   #define POS_INFINITY __int_as_float(0x7f800000)
   #define INFINITY POS_INFINITY
   #define NEG_INFINITY __int_as_float(0xff800000)
@@ -54,11 +195,9 @@ const std::string jit_common_types = R"ESCAPE(
   static_assert(sizeof(int64_t) == 8, "expected size does not match");
   static_assert(sizeof(uint32_t) == 4, "expected size does not match");
   static_assert(sizeof(int8_t) == 1, "expected size does not match");
-  constexpr int num_threads = 128;
+  constexpr int num_threads = CUDA_OR_ROCM_NUM_THREADS;
   constexpr int thread_work_size = 4; // TODO: make template substitution once we decide where those vars live
   constexpr int block_work_size = thread_work_size * num_threads;
-  //TODO use _assert_fail, because assert is disabled in non-debug builds
-  #define ERROR_UNSUPPORTED_CAST assert(false);
 
   ${traits_string}
   ${cmath_string}
@@ -146,15 +285,22 @@ struct alignas(2) Half {
 
   Half() = default;
   inline __host__ __device__ Half(float value){
+#ifdef __HIPCC__
+    x = __half_as_short(__float2half(value));
+#else
     asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(x) : "f"(value));
+#endif
   }
   inline __host__ __device__ operator float() const{
+#ifdef __HIPCC__
+      return __half2float(*reinterpret_cast<const __half*>(&x));
+#else
       float val;
       asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(x)); // do we need const cast here?
       //asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(x)));
       return val;
+#endif
   }
-
 };
 }
 )ESCAPE";
@@ -201,9 +347,18 @@ struct alignas(2) BFloat16 {
   }
 
   inline __host__ __device__ operator float() const{
+#ifdef __HIPCC__
+    union
+    {
+        uint32_t int32;
+        float    fp32;
+    } u = {uint32_t(x) << 16};
+    return u.fp32;
+#else
     float val;
     asm("{ mov.b32 %0, {0,%1};}\n" : "=f"(val) : "h"(x)); //do we need const cast here?
     return val;
+#endif
   }
 
 };
@@ -450,7 +605,7 @@ const std::string offset_calc_template = R"ESCAPE(
       }
 
       #pragma unroll
-      for (int dim = 0; dim < 25; ++dim) {
+      for (int dim = 0; dim < MAX_DIMS; ++dim) {
       if (dim == dims) {
           break;
       }
@@ -469,9 +624,9 @@ const std::string offset_calc_template = R"ESCAPE(
   }
 
     int dims;
-    IntDivider sizes_[25];
+    IntDivider sizes_[MAX_DIMS];
     // NOTE: this approach will not support nInputs == 0
-    ${index_type} strides_[25][NARGS];
+    ${index_type} strides_[MAX_DIMS][NARGS];
   };
 
 
@@ -501,7 +656,7 @@ const std::string jit_code_template = R"ESCAPE(
     int idx = blockIdx.x;
 
     int remaining = numel - block_work_size * idx;
-    auto thread_idx = threadIdx.x;
+    int thread_idx = threadIdx.x;
 
     #pragma unroll
     for (int j = 0; j < thread_work_size; j++){
@@ -592,7 +747,7 @@ const std::string jit_vectorized_code_template = R"ESCAPE(
       constexpr int vec_size = ${vec_size};
       using scalar_t = ${scalar_type};
       int remaining = N - block_work_size * blockIdx.x;
-      auto thread_idx = threadIdx.x;
+      int thread_idx = threadIdx.x;
       int idx = blockIdx.x;
       ${declare_load_arrays}
       ${declare_store_arrays}
@@ -651,6 +806,49 @@ const std::string jit_vectorized_code_template = R"ESCAPE(
   }
 )ESCAPE";
 
+static void replace_all(std::string& s, const std::string& to_replace, const std::string& replace_with) {
+  std::ostringstream oss;
+  std::size_t pos = 0;
+  std::size_t prev_pos = pos;
+
+  while (true) {
+    prev_pos = pos;
+    pos = s.find(to_replace, pos);
+    if (pos == std::string::npos)
+      break;
+    oss << s.substr(prev_pos, pos - prev_pos);
+    oss << replace_with;
+    pos += to_replace.size();
+  }
+
+  oss << s.substr(prev_pos);
+  s = oss.str();
+}
+
+// hipify replaces certain device math functions, e.g., std::max -> ::max
+// See torch/utils/hipify/cuda_to_hip_mappings.py.
+// Replace them back. Search for " ::<name>" to avoid duplicate replacements.
+static std::string unhipify_math_functions(const std::string &original) {
+  static std::vector<std::pair<std::string,std::string>> mappings = {
+    {" std::max", " ::max"},
+    {" std::min", " ::min"},
+    {" std::ceil", " ::ceil"},
+    {" std::floor", " ::floor"},
+    {" std::exp", " ::exp"},
+    {" std::log", " ::log"},
+    {" std::pow", " ::pow"},
+    {" std::fabs", " ::fabs"},
+    {" std::fmod", " ::fmod"},
+    {" std::remainder", " ::remainder"},
+    {" std::frexp", " ::frexp"}
+  };
+  std::string ret = original;
+  for (const auto& mapping : mappings) {
+    replace_all(ret, mapping.second, mapping.first);
+  }
+  return ret;
+}
+
 // The following is copied from fused_kernel.cpp
 // TODO: refactor codegenOutputQuery into its own file
 //   that can be included by both files
@@ -668,7 +866,12 @@ void codegenOutputQuery(
     int& nvrtc_major,
     int& nvrtc_minor,
     bool& compile_to_sass) {
-
+#ifdef USE_ROCM
+  AT_CUDA_NVRTC_CHECK(nvrtc().nvrtcVersion(&nvrtc_major, &nvrtc_minor));
+  cuda_major = prop->major;
+  cuda_minor = prop->minor;
+  compile_to_sass = false;
+#else
   AT_CUDA_NVRTC_CHECK(nvrtc().nvrtcVersion(&nvrtc_major, &nvrtc_minor));
   TORCH_CHECK(
       nvrtc_major >= 6, "NVRTC versions less than 6 are not supported. Is: ", nvrtc_major);
@@ -690,6 +893,8 @@ void codegenOutputQuery(
     max_dev_version = CUDAVersion(7, 5);
   } else if (nvrtc_version == CUDAVersion(11, 0)) { // 11.0 supports 3-8.0
     max_dev_version = CUDAVersion(8, 0);
+  } else if (nvrtc_major == 11 && nvrtc_minor < 8) {
+    max_dev_version = CUDAVersion(8, 6);
   } else {
     // If the driver version is unknown (i.e. newer than this code)
     // assume the driver supports this device
@@ -711,6 +916,7 @@ void codegenOutputQuery(
     // compile to sass is not allowed prior to CUDA 11.1
     compile_to_sass = false;
   #endif
+#endif
 }
 
 // TODO: another copy paste from jit, refactor so it's usable from both
@@ -722,7 +928,7 @@ void __inline__ initializeCudaContext() {
   AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuCtxGetCurrent(&pctx));
   if (!pctx) {
     std::unique_lock<std::mutex> cudaFreeMutexLock(
-        *(c10::cuda::CUDACachingAllocator::getFreeMutex()));
+        *(c10::cuda::getFreeMutex()));
     cudaFree(nullptr);
   }
 }
@@ -764,7 +970,7 @@ constexpr int thread_work_size = THREAD_WORK_SIZE;
 std::string generate_code(
     int nInputs,
     int nOutputs,
-    const std::string& func,
+    const std::string& func_,
     const std::string& name,
     const std::string& f_inputs_type,
     const std::string& compute_type,
@@ -776,6 +982,7 @@ std::string generate_code(
     bool vectorized,
     int vec_size,
     bool return_by_ref) {
+  std::string func = func_;
   at::jit::TemplateEnv env;
 
   env.s("index_type", "unsigned int");
@@ -887,11 +1094,16 @@ std::string generate_code(
       f_inputs_type == "std::complex<double>" || result_type == "std::complex<double>" ||
       f_inputs_type == "std::complex<at::Half>" || result_type == "std::complex<at::Half>") {
     // complex<Half> depends on complex<T> and Half dtypes.
-    env.s("traits_string", get_traits_string());
+    env.s("traits_string", get_traits_string_but_hiprtc_safe());
     env.s("complex_body_string", get_complex_body_string());
     env.s("complex_math_string", get_complex_math_string());
+#ifdef USE_ROCM
+    // unhipify math functions, but only if std::complex is used.
+    func = unhipify_math_functions(func);
+    env.s("functor", func);
+#endif
   } else if (dynamic_casting) {
-    env.s("traits_string", get_traits_string());
+    env.s("traits_string", get_traits_string_but_hiprtc_safe());
     env.s("complex_body_string", get_complex_body_string());
     env.s("complex_math_string", "");
   } else {
@@ -948,7 +1160,8 @@ std::string generate_code(
     }
     env.s("store_outputs", store_outputs.str());
 
-    static auto cuda_template = at::jit::CodeTemplate(jit_common_types + offset_calc_template + jit_code_template);
+    static auto cuda_template = at::jit::CodeTemplate(
+      jit_preamble + jit_common_types + offset_calc_template + jit_code_template + jit_epilogue);
     const auto code = cuda_template.format(env);
     return code;
   }
@@ -1014,7 +1227,8 @@ std::string generate_code(
   }
   env.s("store_unrolled_outputs", store_unrolled_outputs.str());
 
-  static auto cuda_template = at::jit::CodeTemplate(jit_common_types + jit_vectorized_code_template);
+  static auto cuda_template = at::jit::CodeTemplate(
+    jit_preamble + jit_common_types + jit_vectorized_code_template + jit_epilogue);
   const auto code = cuda_template.format(env);
   return code;
 }
@@ -1114,7 +1328,7 @@ std::string generate_reduction_code(
 
 std::string generate_reduction_code(
     int nOutputs,
-    const std::string& func,
+    const std::string& func_,
     const std::string& name,
     const int vt0,
     const std::string& f_inputs_type,
@@ -1124,6 +1338,7 @@ std::string generate_reduction_code(
     bool vectorized,
     int vec_size,
     int max_threads_codegen) {
+      std::string func = func_;
       at::jit::TemplateEnv env;
       env.s("index_type", "unsigned int");
       env.s("scalar_type", f_inputs_type);
@@ -1149,10 +1364,14 @@ std::string generate_reduction_code(
           f_inputs_type == "std::complex<double>" ||
           f_inputs_type == "std::complex<at::Half>" ) {
         // complex<Half> depends on complex<T> and Half dtypes.
-        env.s("traits_string", get_traits_string());
+        env.s("traits_string", get_traits_string_but_hiprtc_safe());
         env.s("complex_body_string", get_complex_body_string());
         env.s("complex_math_string", get_complex_math_string());
         env.s("complex", std::to_string(1));
+#ifdef USE_ROCM
+        // unhipify math functions, but only if std::complex is used.
+        func = unhipify_math_functions(func);
+#endif
       } else {
         env.s("traits_string", "");
         env.s("complex_body_string", "");
@@ -1168,7 +1387,7 @@ std::string generate_reduction_code(
       env.s("functor", func);
       env.s("output_vec_size", std::to_string(vec_size));
       static auto cuda_template = at::jit::CodeTemplate(
-        jit_common_types + offset_calc_template + get_reduction_template());
+        jit_preamble + jit_common_types + offset_calc_template + get_reduction_template() + jit_epilogue);
       const auto code = cuda_template.format(env);
       return code;
 }
@@ -1312,6 +1531,9 @@ NvrtcFunction jit_pwise_function(
   AT_CUDA_NVRTC_CHECK(nvrtc.nvrtcCreateProgram(
       &program, code.c_str(), nullptr, 0, nullptr, nullptr));
 
+#ifdef USE_ROCM
+  std::vector<const char*> args = {"--std=c++14"};
+#else
   // Constructs nvrtc build arguments
   // CUDA 11.1 allows going directly to SASS (sm_) instead of PTX (compute_)
   // which gives better backwards compatibility to work on older driver,
@@ -1326,6 +1548,7 @@ NvrtcFunction jit_pwise_function(
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
   std::vector<const char*> args = {
       "--std=c++14", compute.c_str(), "-default-device"};
+#endif
 
   #ifndef NDEBUG
     // Add line info to generated kernels
diff --git a/aten/src/ATen/native/cuda/jit_utils.h b/aten/src/ATen/native/cuda/jit_utils.h
index 13aa723db275..8206f67316e1 100644
--- a/aten/src/ATen/native/cuda/jit_utils.h
+++ b/aten/src/ATen/native/cuda/jit_utils.h
@@ -8,7 +8,6 @@
 #include <c10/util/irange.h>
 #include <ATen/jit_macros.h>
 #include <ATen/cuda/detail/LazyNVRTC.h>
-#include <ATen/cuda/nvrtc_stub/ATenNVRTC.h>
 
 namespace at { namespace cuda { namespace jit {
 
diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 96d700c761eb..693524818fb4 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -25,6 +25,8 @@
 #endif
 
 #include <c10/cuda/CUDAMathCompat.h>
+#include <c10/util/env.h>
+
 
 namespace at {
 namespace native {
@@ -33,6 +35,7 @@ namespace {
 
 constexpr int kCUDANumThreads = 256;
 constexpr int kColwiseReduceTileSize = 32;
+constexpr unsigned int kWarpSize = 32;
 constexpr int vec_size = 4; //we could make it dependent on dtype, but that would lead to different results between float and low-p types
 
 // aligned vector generates vectorized load/store on CUDA (copy-pasted from MemoryAccess.cuh)
@@ -555,8 +558,108 @@ __global__ void GammaBetaBackwardCUDAKernel1(
   }
 }
 
+template <typename T, typename T_ACC>
+__global__ void GammaBetaBackwardCUDAKernel_32x32(
+    int64_t M,
+    int64_t N,
+    const T* dY,
+    const T* X,
+    const T_ACC* mean,
+    const T_ACC* rstd,
+    T* dg,
+    T* db) {
+  alignas(sizeof(double)) extern __shared__ char s_data1[];
+  T_ACC* s_data_typed = reinterpret_cast<T_ACC*>(&s_data1);
+  T_ACC* s_dg;
+  T_ACC* s_db;
+
+  T_ACC dg_sum = 0;
+  T_ACC db_sum = 0;
+
+  const int64_t j = blockIdx.x * blockDim.x + threadIdx.x;
+
+  if (j < N) {
+    constexpr int unroll_factor = 8;
+    int laneId = threadIdx.x & 0x1f;
+
+    T_ACC mean_reg, mean_reg_tmp;
+    T_ACC rstd_reg, rstd_reg_tmp;
+    T dY_reg;
+    T X_reg;
+
+    // Main loop
+    int bcounter;
+    for (bcounter = 0; bcounter < M / (blockDim.y * unroll_factor);
+         bcounter++) {
+      int offset = (bcounter * blockDim.y + threadIdx.y) * unroll_factor;
+
+      if (laneId < unroll_factor) {
+        mean_reg_tmp = mean[offset + laneId];
+        rstd_reg_tmp = rstd[offset + laneId];
+      }
+#if !defined(USE_ROCM)
+      // Volta and newer architectures allow lane divergence within a warp.
+      __syncwarp();
+#endif
+
+      #pragma unroll
+      for (int ii = 0; ii < unroll_factor; ++ii) {
+        dY_reg = dY[(offset + ii) * N + j];
+        X_reg = X[(offset + ii) * N + j];
+        mean_reg = WARP_SHFL(mean_reg_tmp, ii, kWarpSize);
+        rstd_reg = WARP_SHFL(rstd_reg_tmp, ii, kWarpSize);
+        dg_sum += dY_reg * (X_reg - mean_reg) * rstd_reg;
+        db_sum += dY_reg;
+      }
+    }
+
+    // Remainder loop
+    int offset = (bcounter * blockDim.y + threadIdx.y) * unroll_factor;
+    for (int ii = 0; ii < unroll_factor; ii++) {
+      if ((offset + ii) < M) {
+        mean_reg = mean[offset + ii];
+        rstd_reg = rstd[offset + ii];
+        dY_reg = dY[(offset + ii) * N + j];
+        X_reg = X[(offset + ii) * N + j];
+        dg_sum += dY_reg * (X_reg - mean_reg) * rstd_reg;
+        db_sum += dY_reg;
+      }
+    }
+
+    // This kernel uses a block of (32 x 32) and gets called when M; N
+    // divide by 32. We can use warp shuffles for the final reduction
+    // step. This removes 4 shmem loads and stores with their
+    // corresponding __syncthreads()
+
+    // This greatly reduces bank conflicts at the expense of a little
+    // extra shared memory. It does not impact occupancy
+    int padded_bx = (1 + blockDim.x);
 
+    s_dg = s_data_typed;
+    s_db = s_data_typed + (padded_bx * blockDim.y);
+    s_dg[threadIdx.y * padded_bx + threadIdx.x] = dg_sum;
+    s_db[threadIdx.y * padded_bx + threadIdx.x] = db_sum;
+    __syncthreads();
+
+    // Load transposed so that a warp holds an entire column
+    T_ACC reg_dg = s_dg[threadIdx.x * padded_bx + threadIdx.y];
+    T_ACC reg_db = s_db[threadIdx.x * padded_bx + threadIdx.y];
+    for (int delta = 16; delta >= 1; delta /= 2) {
+      reg_dg += WARP_SHFL_XOR(reg_dg, delta, kWarpSize);
+      reg_db += WARP_SHFL_XOR(reg_db, delta, kWarpSize);
+    }
 
+    if (threadIdx.x == 0) {
+      const int64_t j = blockIdx.x * blockDim.x + threadIdx.y;
+      if (dg) {
+        dg[j] = reg_dg;
+      }
+      if (db) {
+        db[j] = reg_db;
+      }
+    }
+  }
+}
 
 template <typename T, typename T_ACC>
 __global__ void GammaBetaBackwardCUDAKernel(
@@ -569,66 +672,75 @@ __global__ void GammaBetaBackwardCUDAKernel(
     T* dg,
     T* db) {
   alignas(sizeof(double)) extern __shared__ char s_data1[];
-  T_ACC * s_data_typed = reinterpret_cast<T_ACC*>(&s_data1);
+  T_ACC* s_data_typed = reinterpret_cast<T_ACC*>(&s_data1);
+  T_ACC* s_dg;
+  T_ACC* s_db;
+
   const int64_t j = blockIdx.x * blockDim.x + threadIdx.x;
-  constexpr int unroll = 8;
-  T dYs[unroll];
-  T Xs[unroll];
-  T_ACC *  means = s_data_typed;
-  T_ACC * rstds = s_data_typed + unroll * blockDim.y;
+
   T_ACC dg_sum = 0;
   T_ACC db_sum = 0;
+
   if (j < N) {
+    constexpr int unroll_factor = 8;
+
+    T_ACC mean_reg;
+    T_ACC rstd_reg;
+    T dY_reg;
+    T X_reg;
+
+    // Main Loop
     int bcounter;
-    for (bcounter = 0; bcounter < M/(blockDim.y * unroll); bcounter++){
-      int offset = (bcounter * blockDim.y + threadIdx.y) * unroll;
-      #pragma unroll
-      for (int ii=0; ii<unroll; ii++){
-        if (threadIdx.x == 0) {
-          means[ii*blockDim.y + threadIdx.y] = mean[offset + ii];
-          rstds[ii*blockDim.y + threadIdx.y] = rstd[offset + ii];
-        }
-        dYs[ii] = dY[(offset + ii) * N + j ];
-        Xs[ii] = X[(offset + ii) * N + j];
+    for (bcounter = 0; bcounter < M / (blockDim.y * unroll_factor); bcounter++){
+      int offset = (bcounter * blockDim.y + threadIdx.y) * unroll_factor;
 
-      }
-      __syncthreads();
       #pragma unroll
-      for (int ii=0; ii<unroll; ii++){
-        dg_sum += dYs[ii] * (Xs[ii] - means[ii*blockDim.y + threadIdx.y]) * rstds[ii * blockDim.y + threadIdx.y];
-        db_sum += dYs[ii];
+      for (int ii = 0; ii < unroll_factor; ++ii) {
+        dY_reg = dY[(offset + ii) * N + j];
+        X_reg = X[(offset + ii) * N + j];
+        mean_reg = mean[offset + ii];
+        rstd_reg = rstd[offset + ii];
+        dg_sum += dY_reg * (X_reg - mean_reg) * rstd_reg;
+        db_sum += dY_reg;
       }
-      __syncthreads();
     }
-    int offset = (bcounter * blockDim.y + threadIdx.y) * unroll;
-    for (int ii = 0; ii<8; ii++ ){
-      T_ACC mean_val, rstd_val; // we don't use smem in the tail to avoid awkward synchronizations, perf penalty is negligible
+
+    // Remainder loop
+    int offset = (bcounter * blockDim.y + threadIdx.y) * unroll_factor;
+    for (int ii = 0; ii < unroll_factor; ii++ ){
       if ((offset + ii) < M) {
-        mean_val = mean[offset+ii];
-        rstd_val = rstd[offset+ii];
-        dYs[0] = dY[(offset + ii) * N + j ];
-        Xs[0] = X[(offset + ii) * N + j];
-        dg_sum += dYs[0] * (Xs[0] - mean_val) * rstd_val;
-        db_sum += dYs[0];
+        dY_reg = dY[(offset + ii) * N + j ];
+        X_reg = X[(offset + ii) * N + j];
+        mean_reg = mean[offset + ii];
+        rstd_reg = rstd[offset + ii];
+        dg_sum += dY_reg * (X_reg - mean_reg) * rstd_reg;
+        db_sum += dY_reg;
       }
     }
-    s_data_typed[threadIdx.y * blockDim.x + threadIdx.x] = dg_sum;
-    s_data_typed[blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x] = db_sum;
+
+    // Do the final reduction in shared memory
+    s_dg = s_data_typed;
+    s_db = s_data_typed + blockDim.x * blockDim.y;
+    s_dg[threadIdx.y * blockDim.x + threadIdx.x] = dg_sum;
+    s_db[threadIdx.y * blockDim.x + threadIdx.x] = db_sum;
     __syncthreads();
-    for (int offset = blockDim.y/2; offset >=1; offset /= 2){
+
+    for (int offset = blockDim.y / 2; offset >= 1; offset /= 2) {
       if (threadIdx.y < offset) {
-        s_data_typed[threadIdx.y * blockDim.x + threadIdx.x] += s_data_typed[(threadIdx.y + offset) * blockDim.x + threadIdx.x];
-        s_data_typed[blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x] +=
-        s_data_typed[blockDim.x * blockDim.y + (threadIdx.y + offset) * blockDim.x + threadIdx.x];
-      }
+        s_dg[threadIdx.y * blockDim.x + threadIdx.x] +=
+            s_dg[(threadIdx.y + offset) * blockDim.x + threadIdx.x];
+        s_db[threadIdx.y * blockDim.x + threadIdx.x] +=
+            s_db[(threadIdx.y + offset) * blockDim.x + threadIdx.x];
+        }
       __syncthreads();
     }
+
     if (threadIdx.y == 0) {
       if (dg) {
-        dg[j] = s_data_typed[threadIdx.x];
+        dg[j] = s_dg[threadIdx.x];
       }
       if (db) {
-        db[j] = s_data_typed[threadIdx.x + blockDim.x * blockDim.y];
+        db[j] = s_db[threadIdx.x];
       }
     }
   }
@@ -684,7 +796,7 @@ void LayerNormKernelImplInternal(
   auto can_vectorize = [&](const T * ptr, int alignment){uint64_t addr = reinterpret_cast<uint64_t>(ptr); return addr % alignment == 0;};
   constexpr int num_vec_elems = vec_size;
   constexpr int alignment = num_vec_elems * sizeof(T);
-  if ((std::is_same<T, float>::value || std::is_same<T, at::Half>::value) &&
+  if ((std::is_same<T, float>::value || std::is_same<T, at::Half>::value || std::is_same<T, at::BFloat16>::value) &&
   N <= 1ULL << std::numeric_limits<float>::digits && N % num_vec_elems == 0 &&
   can_vectorize(X_data, alignment) && can_vectorize(Y_data, alignment)) {
     launch_vectorized_layer_norm_kernel(static_cast<int>(N), M, eps, X_data, gamma_data, beta_data, Y_data, mean_data, rstd_data);
@@ -722,6 +834,201 @@ void LayerNormKernelImpl(
       });
 }
 
+template<typename T, typename T_ACC> __device__
+void cuLoadWriteStridedInputs(
+    const int i1_block,
+    const int thr_load_row_off,
+    const int thr_load_col_off,
+    const int i2_off,
+    const int row_stride,
+    T_ACC* warp_buf1,
+    T_ACC* warp_buf2,
+    const T* input,
+    const T* dout,
+    const int i1_end,
+    const int64_t N,
+    const T_ACC* __restrict__ mean,
+    const T_ACC* __restrict__ rstd)
+{
+  int i1 = i1_block+thr_load_row_off;
+  if (i1 < i1_end) {
+    T curr_mean = mean[i1];
+    T curr_rstd = rstd[i1];
+    for (int k = 0;  k < blockDim.y;  ++k) {
+      int i2 = i2_off + k;
+      int load_idx = i1*N+i2;
+      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
+      if (i2<N) {
+        T curr_input = static_cast<T>(input[load_idx]);
+        T curr_dout = static_cast<T>(dout[load_idx]);
+        warp_buf1[write_idx] = curr_dout;
+        warp_buf2[write_idx] = curr_dout * (curr_input - curr_mean) * curr_rstd;
+      } else {
+        warp_buf1[write_idx] = T(0);
+        warp_buf2[write_idx] = T(0);
+      }
+    }
+  } else {
+    for (int k = 0;  k < blockDim.y;  ++k) {
+      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
+      warp_buf1[write_idx] = T(0);
+      warp_buf2[write_idx] = T(0);
+    }
+  }
+}
+
+template<typename T, typename T_ACC> __device__
+void cuLoadAddStridedInputs(
+    const int i1_block,
+    const int thr_load_row_off,
+    const int thr_load_col_off,
+    const int i2_off,
+    const int row_stride,
+    T_ACC* warp_buf1,
+    T_ACC* warp_buf2,
+    const T* input,
+    const T* dout,
+    const int i1_end,
+    const int64_t N,
+    const T_ACC* __restrict__ mean,
+    const T_ACC* __restrict__ rstd)
+{
+  int i1 = i1_block+thr_load_row_off;
+  if (i1 < i1_end) {
+    T_ACC curr_mean = mean[i1];
+    T_ACC curr_rstd = rstd[i1];
+    for (int k = 0;  k < blockDim.y;  ++k) {
+      int i2 = i2_off + k;
+      int load_idx = i1*N+i2;
+      int write_idx = thr_load_row_off*row_stride+thr_load_col_off+k;
+      if (i2<N) {
+        T_ACC curr_input = static_cast<T_ACC>(input[load_idx]);
+        T_ACC curr_dout = static_cast<T_ACC>(dout[load_idx]);
+        warp_buf1[write_idx] += curr_dout;
+        warp_buf2[write_idx] += curr_dout * (curr_input - curr_mean) * curr_rstd;
+      }
+    }
+  }
+}
+
+template<typename T, typename T_ACC> __global__
+void cuComputePartGradGammaBeta(
+    const T* __restrict__ dout,
+    const T* __restrict__ input,
+    const int64_t M,
+    const int64_t N,
+    const T_ACC* __restrict__ mean,
+    const T_ACC* __restrict__ rstd,
+    T_ACC* part_grad_gamma,
+    T_ACC* part_grad_beta)
+{
+    const int numsegs_M = (M+blockDim.y*blockDim.y-1) / (blockDim.y*blockDim.y);
+    const int segs_per_block = (numsegs_M + gridDim.y - 1) / gridDim.y;
+    const int i1_beg = blockIdx.y * segs_per_block * blockDim.y*blockDim.y;
+    const int i1_beg_plus_one = (blockIdx.y+1) * segs_per_block * blockDim.y*blockDim.y;
+    const int i1_end = i1_beg_plus_one < M ? i1_beg_plus_one : M;
+    const int row_stride = blockDim.x+1;
+    const int thr_load_col_off = (threadIdx.x*blockDim.y)&(blockDim.x-1);
+    const int thr_load_row_off = (threadIdx.x*blockDim.y)/blockDim.x + threadIdx.y*blockDim.y;
+    const int i2_off = blockIdx.x * blockDim.x + thr_load_col_off;
+    alignas(sizeof(double)) extern __shared__ char shared[];
+    T_ACC * buf = reinterpret_cast<T_ACC*>(&shared); // buf has at least blockDim.x * blockDim.y * blockDim.y + (blockDim.y - 1)*(blockDim.x/blockDim.y) elements
+    T_ACC* warp_buf1 = (T_ACC*)buf;
+    T_ACC* warp_buf2 = warp_buf1 + blockDim.y * blockDim.y * row_stride;
+    // compute partial sums from strided inputs
+    // do this to increase number of loads in flight
+    cuLoadWriteStridedInputs(i1_beg,thr_load_row_off,thr_load_col_off,i2_off,row_stride,warp_buf1,warp_buf2,input,dout,i1_end,N,mean,rstd);
+    for (int i1_block = i1_beg+blockDim.y*blockDim.y;  i1_block < i1_end;  i1_block+=blockDim.y*blockDim.y) {
+      cuLoadAddStridedInputs(i1_block,thr_load_row_off,thr_load_col_off,i2_off,row_stride,warp_buf1,warp_buf2,input,dout,i1_end,N,mean,rstd);
+    }
+    __syncthreads();
+    // inter-warp reductions
+    // sum within each warp
+    T_ACC acc1 = T_ACC(0);
+    T_ACC acc2 = T_ACC(0);
+    for (int k = 0;  k < blockDim.y;  ++k) {
+      int row1 = threadIdx.y + k*blockDim.y;
+      int idx1 = row1*row_stride + threadIdx.x;
+      acc1 += warp_buf1[idx1];
+      acc2 += warp_buf2[idx1];
+    }
+    warp_buf1[threadIdx.y*row_stride+threadIdx.x] = acc1;
+    warp_buf2[threadIdx.y*row_stride+threadIdx.x] = acc2;
+    __syncthreads();
+    // sum all warps
+    for (int offset = blockDim.y/2;  offset > 1;  offset /= 2) {
+      if (threadIdx.y < offset) {
+        int row1 = threadIdx.y;
+        int row2 = threadIdx.y + offset;
+        int idx1 = row1*row_stride + threadIdx.x;
+        int idx2 = row2*row_stride + threadIdx.x;
+        warp_buf1[idx1] += warp_buf1[idx2];
+        warp_buf2[idx1] += warp_buf2[idx2];
+      }
+      __syncthreads();
+    }
+    int i2 = blockIdx.x * blockDim.x + threadIdx.x;
+    if (threadIdx.y == 0 && i2 < N) {
+      int row1 = threadIdx.y;
+      int row2 = threadIdx.y + 1;
+      int idx1 = row1*row_stride + threadIdx.x;
+      int idx2 = row2*row_stride + threadIdx.x;
+      part_grad_beta[blockIdx.y*N+i2] = warp_buf1[idx1] + warp_buf1[idx2];
+      part_grad_gamma[blockIdx.y*N+i2] = warp_buf2[idx1] + warp_buf2[idx2];
+    }
+}
+
+template<typename T, typename T_ACC> __global__
+void cuComputeGradGammaBeta(
+    const T_ACC* part_grad_gamma,
+    const T_ACC* part_grad_beta,
+    const int part_size,
+    const int64_t M,
+    const int64_t N,
+    T* grad_gamma,
+    T* grad_beta)
+{
+    // sum partial gradients for gamma and beta
+    alignas(sizeof(double)) extern __shared__ char shared[];
+    T_ACC * buf = reinterpret_cast<T_ACC*>(&shared);
+    int i2 = blockIdx.x * blockDim.x + threadIdx.x;
+    if (i2 < N) {
+      // each warp does sequential reductions until reduced part_size is num_warps
+      int num_warp_reductions = part_size / blockDim.y;
+      T_ACC sum_gamma = T_ACC(0);
+      T_ACC sum_beta = T_ACC(0);
+      const T_ACC* part_grad_gamma_ptr = part_grad_gamma + threadIdx.y * num_warp_reductions * N + i2;
+      const T_ACC* part_grad_beta_ptr = part_grad_beta + threadIdx.y * num_warp_reductions * N + i2;
+      for (int warp_offset = 0;  warp_offset < num_warp_reductions;  ++warp_offset) {
+        sum_gamma += part_grad_gamma_ptr[warp_offset*N];
+        sum_beta += part_grad_beta_ptr[warp_offset*N];
+      }
+      // inter-warp reductions
+      const int nbsize3 = blockDim.x * blockDim.y / 2;
+      for (int offset = blockDim.y/2;  offset >= 1;  offset /= 2) {
+        // top half write to shared memory
+        if (threadIdx.y >= offset && threadIdx.y < 2*offset) {
+          const int write_idx = (threadIdx.y - offset) * blockDim.x + threadIdx.x;
+          buf[write_idx] = sum_gamma;
+          buf[write_idx+nbsize3] = sum_beta;
+        }
+        __syncthreads();
+        // bottom half sums
+        if (threadIdx.y < offset) {
+          const int read_idx = threadIdx.y * blockDim.x + threadIdx.x;
+          sum_gamma += buf[read_idx];
+          sum_beta += buf[read_idx+nbsize3];
+        }
+        __syncthreads();
+      }
+      // write out fully summed gradients
+      if (threadIdx.y == 0) {
+        grad_gamma[i2] = sum_gamma;
+        grad_beta[i2] = sum_beta;
+      }
+    }
+}
+
 template <typename T>
 void LayerNormBackwardKernelImplInternal(
     const Tensor& dY,
@@ -750,8 +1057,8 @@ void LayerNormBackwardKernelImplInternal(
       gamma.defined() ? gamma.template data_ptr<T>() : nullptr;
   T* dX_data = dX->defined() ? dX->template data_ptr<T>() : nullptr;
   cudaStream_t cuda_stream = at::cuda::getCurrentCUDAStream();
+  const int warp_size = at::cuda::warp_size();
   if (dX_data != nullptr) {
-    const int warp_size = at::cuda::warp_size();
     const dim3 blocks(M);
     int nshared = (num_threads()/warp_size) * sizeof(T_ACC);
     layer_norm_grad_input_kernel<<<blocks, num_threads(), nshared, cuda_stream>>>(dY_data,
@@ -763,7 +1070,8 @@ void LayerNormBackwardKernelImplInternal(
     T* dgamma_data =
         dgamma->defined() ? dgamma->template data_ptr<T>() : nullptr;
     T* dbeta_data = dbeta->defined() ? dbeta->template data_ptr<T>() : nullptr;
-    if (M < 512) {
+
+    if (M < 128) {
       // For small batch size, do colwise reduce directly.
       const int64_t B = (N + kCUDANumThreads - 1) / kCUDANumThreads;
       GammaBetaBackwardSimpleCUDAKernel<T, T_ACC>
@@ -778,19 +1086,77 @@ void LayerNormBackwardKernelImplInternal(
               dbeta_data);
       C10_CUDA_KERNEL_LAUNCH_CHECK();
     } else {
-      dim3 threads{16, 32};
-      int blocks = (N + threads.x-1)/threads.x;
-      GammaBetaBackwardCUDAKernel<T, T_ACC>
-          <<<blocks, threads, 2 * sizeof(T_ACC) * threads.x * threads.y, cuda_stream>>>(
-              M,
-              N,
-              dY_data,
-              X_data,
-              mean_data,
-              rstd_data,
-              dgamma_data,
-              dbeta_data);
+#if defined(USE_ROCM)
+      // For small batch size, do colwise reduce directly.
+      const int part_size = warp_size;
+      const dim3 threads2(warp_size, 4, 1);
+      const dim3 blocks2((N + threads2.x - 1) / threads2.x, part_size, 1);
+      const int nshared2_a = 2 * sizeof(T_ACC) * threads2.y * threads2.y * (threads2.x + 1);
+      const int nshared2_b = threads2.x * threads2.y * sizeof(T_ACC);
+      const int nshared2 = nshared2_a > nshared2_b ? nshared2_a : nshared2_b;
+
+      const auto part_grad_dtype = at::toAccumulateType(X.scalar_type(), true);
+      Tensor part_grad_gamma = at::empty({part_size,N}, gamma.options().dtype(part_grad_dtype));
+      Tensor part_grad_beta = at::native::empty_like(part_grad_gamma);
+      cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, cuda_stream>>>(
+                      dY_data,
+                      X_data,
+                      M,N,
+                      mean_data,
+                      rstd_data,
+                      part_grad_gamma.template data_ptr<T_ACC>(),
+                      part_grad_beta.template data_ptr<T_ACC>());
       C10_CUDA_KERNEL_LAUNCH_CHECK();
+
+      const dim3 threads3(warp_size, 8, 1); // Optimization for ROCm
+      const dim3 blocks3((N + threads2.x - 1) / threads2.x, 1, 1);
+      const int nshared3 = threads3.x * threads3.y * sizeof(T);
+      cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, cuda_stream>>>(
+                      part_grad_gamma.template data_ptr<T_ACC>(),
+                      part_grad_beta.template data_ptr<T_ACC>(),
+                      part_size,
+                      M,N,
+                      dgamma_data,
+                      dbeta_data);
+      C10_CUDA_KERNEL_LAUNCH_CHECK();
+#else
+      if ((M % kWarpSize == 0) && (N % kWarpSize == 0)) {
+        // This implementation relies on warp primitives and requires that M and N divide
+        // exactly to warp size.
+        dim3 threads{kWarpSize, kWarpSize};
+        int blocks = (N + threads.x - 1) / threads.x;
+
+        // If M and N divide by 32, we can use warp shuffles for the final reduction. That requires
+        // transposing values in shared memory, so we apply a padding to reduce bank conflicts.
+        size_t shmem_sz = 2 * sizeof(T_ACC) * (threads.x + 1) * threads.y;
+        GammaBetaBackwardCUDAKernel_32x32<T, T_ACC>
+            <<<blocks, threads, shmem_sz, cuda_stream>>>(
+                M,
+                N,
+                dY_data,
+                X_data,
+                mean_data,
+                rstd_data,
+                dgamma_data,
+                dbeta_data);
+          C10_CUDA_KERNEL_LAUNCH_CHECK();
+      } else {
+        dim3 threads{16, 32};
+        int blocks = (N + threads.x - 1) / threads.x;
+        size_t shmem_sz = 2 * sizeof(T_ACC) * threads.x * threads.y;
+        GammaBetaBackwardCUDAKernel<T, T_ACC>
+            <<<blocks, threads, shmem_sz, cuda_stream>>>(
+                M,
+                N,
+                dY_data,
+                X_data,
+                mean_data,
+                rstd_data,
+                dgamma_data,
+                dbeta_data);
+        C10_CUDA_KERNEL_LAUNCH_CHECK();
+      }
+#endif
     }
   }
 }
diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
index 320c799f23bc..22de5012f11f 100644
--- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
+++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp
@@ -24,7 +24,6 @@
 #include <ATen/NativeFunctions.h>
 #else
 #include <ATen/ops/_cholesky_solve_helper_native.h>
-#include <ATen/ops/_linalg_inv_out_helper_native.h>
 #include <ATen/ops/_symeig_helper_native.h>
 #include <ATen/ops/arange.h>
 #include <ATen/ops/empty.h>
@@ -33,7 +32,6 @@
 #include <ATen/ops/linalg_eigh.h>
 #include <ATen/ops/linalg_eigvalsh.h>
 #include <ATen/ops/linalg_solve_triangular.h>
-#include <ATen/ops/lstsq_native.h>
 #include <ATen/ops/zeros.h>
 #include <ATen/ops/_linalg_check_errors.h>
 #endif
@@ -115,20 +113,6 @@ void magmaLuNoPivBatched(
     magma_int_t m, magma_int_t n, scalar_t** dA_array, magma_int_t ldda,
     magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue);
 
-template<class scalar_t>
-inline magma_int_t magmaGetriOptimalBlocksize(magma_int_t n);
-
-template<class scalar_t>
-void magmaGetri(
-    magma_int_t n, scalar_t* dA, magma_int_t ldda, magma_int_t* ipiv, scalar_t* dwork,
-    magma_int_t lwork, magma_int_t* info);
-
-template<class scalar_t>
-void magmaGetriBatched(
-    magma_int_t n, scalar_t** dA_array, magma_int_t ldda,
-    magma_int_t** ipiv_array, scalar_t** dinvA_array, magma_int_t lddia,
-    magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue);
-
 template<class scalar_t>
 void magmaCholeskySolve(
     magma_uplo_t uplo, magma_int_t n, magma_int_t nrhs, scalar_t* dA, magma_int_t ldda,
@@ -400,154 +384,6 @@ void magmaLuNoPivBatched<c10::complex<float>>(
   AT_CUDA_CHECK(cudaGetLastError());
 }
 
-template<>
-inline magma_int_t magmaGetriOptimalBlocksize<double>(magma_int_t n) {
-  return magma_get_dgetri_nb(n);
-}
-
-template<>
-inline magma_int_t magmaGetriOptimalBlocksize<float>(magma_int_t n) {
-  return magma_get_sgetri_nb(n);
-}
-
-template <>
-inline magma_int_t magmaGetriOptimalBlocksize<c10::complex<double>>(
-    magma_int_t n) {
-  return magma_get_zgetri_nb(n);
-}
-
-template <>
-inline magma_int_t magmaGetriOptimalBlocksize<c10::complex<float>>(
-    magma_int_t n) {
-  return magma_get_cgetri_nb(n);
-}
-
-template<>
-void magmaGetri<double>(
-    magma_int_t n, double* dA, magma_int_t ldda, magma_int_t* ipiv, double* dwork,
-    magma_int_t lwork, magma_int_t* info) {
-  MagmaStreamSyncGuard guard;
-  magma_dgetri_gpu(n, dA, ldda, ipiv, dwork, lwork, info);
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template<>
-void magmaGetri<float>(
-    magma_int_t n, float* dA, magma_int_t ldda, magma_int_t* ipiv, float* dwork,
-    magma_int_t lwork, magma_int_t* info) {
-  MagmaStreamSyncGuard guard;
-  magma_sgetri_gpu(n, dA, ldda, ipiv, dwork, lwork, info);
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template <>
-void magmaGetri<c10::complex<double>>(
-    magma_int_t n,
-    c10::complex<double>* dA,
-    magma_int_t ldda,
-    magma_int_t* ipiv,
-    c10::complex<double>* dwork,
-    magma_int_t lwork,
-    magma_int_t* info) {
-  MagmaStreamSyncGuard guard;
-  magma_zgetri_gpu(
-      n,
-      reinterpret_cast<magmaDoubleComplex*>(dA),
-      ldda,
-      ipiv,
-      reinterpret_cast<magmaDoubleComplex*>(dwork),
-      lwork,
-      info);
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template <>
-void magmaGetri<c10::complex<float>>(
-    magma_int_t n,
-    c10::complex<float>* dA,
-    magma_int_t ldda,
-    magma_int_t* ipiv,
-    c10::complex<float>* dwork,
-    magma_int_t lwork,
-    magma_int_t* info) {
-  MagmaStreamSyncGuard guard;
-  magma_cgetri_gpu(
-      n,
-      reinterpret_cast<magmaFloatComplex*>(dA),
-      ldda,
-      ipiv,
-      reinterpret_cast<magmaFloatComplex*>(dwork),
-      lwork,
-      info);
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template<>
-void magmaGetriBatched<double>(
-    magma_int_t n, double** dA_array, magma_int_t ldda,
-    magma_int_t** ipiv_array, double** dinvA_array, magma_int_t lddia,
-    magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue) {
-  magma_dgetri_outofplace_batched(n, dA_array, ldda, ipiv_array, dinvA_array, lddia, info_array, batchsize, magma_queue.get_queue());
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template<>
-void magmaGetriBatched<float>(
-    magma_int_t n, float** dA_array, magma_int_t ldda,
-    magma_int_t** ipiv_array, float** dinvA_array, magma_int_t lddia,
-    magma_int_t* info_array, magma_int_t batchsize, const MAGMAQueue& magma_queue) {
-  magma_sgetri_outofplace_batched(n, dA_array, ldda, ipiv_array, dinvA_array, lddia, info_array, batchsize, magma_queue.get_queue());
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template <>
-void magmaGetriBatched<c10::complex<double>>(
-    magma_int_t n,
-    c10::complex<double>** dA_array,
-    magma_int_t ldda,
-    magma_int_t** ipiv_array,
-    c10::complex<double>** dinvA_array,
-    magma_int_t lddia,
-    magma_int_t* info_array,
-    magma_int_t batchsize,
-    const MAGMAQueue& magma_queue) {
-  magma_zgetri_outofplace_batched(
-      n,
-      reinterpret_cast<magmaDoubleComplex**>(dA_array),
-      ldda,
-      ipiv_array,
-      reinterpret_cast<magmaDoubleComplex**>(dinvA_array),
-      lddia,
-      info_array,
-      batchsize,
-      magma_queue.get_queue());
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
-template <>
-void magmaGetriBatched<c10::complex<float>>(
-    magma_int_t n,
-    c10::complex<float>** dA_array,
-    magma_int_t ldda,
-    magma_int_t** ipiv_array,
-    c10::complex<float>** dinvA_array,
-    magma_int_t lddia,
-    magma_int_t* info_array,
-    magma_int_t batchsize,
-    const MAGMAQueue& magma_queue) {
-  magma_cgetri_outofplace_batched(
-      n,
-      reinterpret_cast<magmaFloatComplex**>(dA_array),
-      ldda,
-      ipiv_array,
-      reinterpret_cast<magmaFloatComplex**>(dinvA_array),
-      lddia,
-      info_array,
-      batchsize,
-      magma_queue.get_queue());
-  AT_CUDA_CHECK(cudaGetLastError());
-}
-
 template<>
 void magmaCholeskySolve<double>(
     magma_uplo_t uplo, magma_int_t n, magma_int_t nrhs, double* dA, magma_int_t ldda,
@@ -1319,156 +1155,6 @@ void ldl_solve_kernel(
 REGISTER_CUDA_DISPATCH(ldl_factor_stub, &ldl_factor_kernel)
 REGISTER_CUDA_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
 
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inverse ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-/*
-Computes the inverse of n-by-n matrix 'self', it is saved to 'self_inv'.
-'infos' is an int Tensor containing error codes for each matrix in the batched input.
-'infos_lu' is for holding magmaLU errors, and 'infos_getri' is for holding magmaGetri errors
-For more information see MAGMA's documentation for GETRI and GETRF routines.
-*/
-template <typename scalar_t>
-static void apply_batched_inverse(Tensor& self, Tensor& self_inv, Tensor& infos_lu, Tensor& infos_getri) {
-#if !AT_MAGMA_ENABLED()
-AT_ERROR("inverse: MAGMA library not found in "
-    "compilation. Please rebuild with MAGMA.");
-#else
-  auto self_data = self.data_ptr<scalar_t>();
-  auto self_mat_stride = matrixStride(self);
-  auto self_inv_data = self_inv.data_ptr<scalar_t>();
-  auto self_inv_mat_stride = matrixStride(self_inv);
-
-  auto infos_lu_data = infos_lu.data_ptr<magma_int_t>();
-  auto infos_getri_data = infos_getri.data_ptr<magma_int_t>();
-
-  magma_int_t batch_size = magma_int_cast(batchCount(self), "batchCount");
-  // MAGMA does not work with batch_size == 0, let's return early in this case
-  if (batch_size == 0) {
-    return;
-  }
-
-  magma_int_t n = magma_int_cast(self.size(-2), "self.size(-2)");
-  magma_int_t lda = std::max<magma_int_t>(1, n);
-
-  magma_int_t* ipiv_data;
-  magma_int_t** ipiv_array;
-  scalar_t** self_array;
-  scalar_t** self_inv_array;
-
-  ALLOCATE_ARRAY(ipiv_data, magma_int_t, batch_size * lda);
-  ALLOCATE_ARRAY(ipiv_array, magma_int_t*, batch_size);
-  ALLOCATE_ARRAY(self_array, scalar_t*, batch_size);
-  ALLOCATE_ARRAY(self_inv_array, scalar_t*, batch_size);
-
-  // Set up the created arrays
-  for (int64_t i = 0; i < batch_size; i++) {
-    self_array[i] = &self_data[i * self_mat_stride];
-    self_inv_array[i] = &self_inv_data[i * self_inv_mat_stride];
-    ipiv_array[i] = &ipiv_data[i * n];
-  }
-  // magmaLuBatched leaves ipiv_data values unwritten for singular matrices.
-  // Initialize to avoid memory access violations inside magma kernels (gh-51930).
-  std::fill_n(ipiv_data, batch_size * n, 1);
-
-  MAGMAQueue magma_queue(self.get_device());
-  magmaLuBatched<scalar_t>(
-    n, n, self_array, lda, ipiv_array, infos_lu_data,
-    batch_size, magma_queue);
-
-  constexpr int64_t batch_limit = 65535;
-  // Compute as many batches of 65535 possible
-  // The number of "mini"-batches are floor(batch_size / batch_limit)
-  // and these cover floor(batch_size / batch_limit) * batch_limit matrix solves
-  int64_t mini_batches = batch_size / batch_limit, mini_idx;
-  for (mini_idx = 0; mini_idx < mini_batches * batch_limit; mini_idx += batch_limit) {
-    scalar_t** self_array_cur = &self_array[mini_idx];
-    scalar_t** self_inv_array_cur = &self_inv_array[mini_idx];
-    magma_int_t** ipiv_array_cur = &ipiv_array[mini_idx];
-    magma_int_t* info_array_cur_getri = &infos_getri_data[mini_idx];
-
-    magmaGetriBatched<scalar_t>(
-      n, self_array_cur, lda, ipiv_array_cur, self_inv_array_cur,
-      lda, info_array_cur_getri, batch_limit, magma_queue);
-  }
-
-  // Compute whatever is left = batch_size - floor(batch_size / batch_limit) * batch_limit
-  // which concisely is equal to batch_size % batch_limit
-  if (batch_size % batch_limit != 0) {
-    magmaGetriBatched<scalar_t>(
-      n, &self_array[mini_idx], lda, &ipiv_array[mini_idx], &self_inv_array[mini_idx],
-      lda, &infos_getri_data[mini_idx], batch_size % batch_limit, magma_queue);
-  }
-#endif
-}
-
-template <typename scalar_t>
-static void apply_single_inverse(Tensor& self, Tensor& info_lu, Tensor& info_getri) {
-#if !AT_MAGMA_ENABLED()
-AT_ERROR("inverse: MAGMA library not found in "
-    "compilation. Please rebuild with MAGMA.");
-#else
-  auto self_data = self.data_ptr<scalar_t>();
-  magma_int_t n = magma_int_cast(self.size(-2), "self.size(-2)");
-  magma_int_t lda = std::max<magma_int_t>(1, n);
-  magma_int_t lwork = n * magmaGetriOptimalBlocksize<scalar_t>(n);
-
-  // magmaLu and magmaGetri requires info argument to live on CPU
-  // but info_lu and info_getri tensors are on the same device as self
-  magma_int_t info_lu_cpu = 0;
-  magma_int_t info_getri_cpu = 0;
-
-  Tensor ipiv = at::empty({lda}, at::kInt);
-  Tensor dwork = at::empty({lwork}, self.options());
-  magmaLu<scalar_t>(n, n, self_data, lda, ipiv.data_ptr<magma_int_t>(), &info_lu_cpu);
-  magmaGetri<scalar_t>(
-    n, self_data, lda, ipiv.data_ptr<magma_int_t>(), dwork.data_ptr<scalar_t>(), lwork, &info_getri_cpu);
-  info_lu.fill_(info_lu_cpu);
-  info_getri.fill_(info_getri_cpu);
-#endif
-}
-
-
-// This is a type dispatching helper function for 'apply_batched_inverse' and 'singleCheckErrors'
-Tensor& _linalg_inv_out_helper_cuda_legacy(Tensor& result, Tensor& infos_lu, Tensor& infos_getri) {
-  // assuming result is in column major order and contains the matrices to invert
-  if (result.dim() > 2) {
-    auto input_working_copy = cloneBatchedColumnMajor(result);
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{
-      apply_batched_inverse<scalar_t>(
-        input_working_copy, result, infos_lu, infos_getri);
-    });
-  } else {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{
-      apply_single_inverse<scalar_t>(result, infos_lu, infos_getri);
-    });
-  }
-  return result;
-}
-
-// This is a MAGMA/cuSOLVER dispatching helper function
-Tensor& _linalg_inv_out_helper_cuda(Tensor &result, Tensor& infos_lu, Tensor& infos_getri) {
-  // This function calculates the inverse matrix in-place
-  // result should be in column major order and contain matrices to invert
-#ifdef USE_CUSOLVER
-  auto preferred_backend = at::globalContext().linalgPreferredBackend();
-  switch (preferred_backend) {
-    case at::LinalgBackend::Cusolver:
-      return _linalg_inv_out_helper_cuda_lib(result, infos_lu, infos_getri);  // cusolver or cublas
-    case at::LinalgBackend::Magma:
-      return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri);  // magma-cuda
-    default:
-      if (batchCount(result) <= 2 || !use_magma_) {
-        return _linalg_inv_out_helper_cuda_lib(result, infos_lu, infos_getri);  // cusolver or cublas
-      } else {
-        return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri);  // magma-cuda
-      }
-  }
-#else
-  return _linalg_inv_out_helper_cuda_legacy(result, infos_lu, infos_getri);  // magma-cuda
-#endif
-  return result;
-}
-
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cholesky_solve ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 template <typename scalar_t>
@@ -1874,6 +1560,9 @@ static void apply_lu_factor_batched_magma(const Tensor& input, const Tensor& piv
     input_array[i] = &input_data[i * input_matrix_stride];
   }
 
+  // needed to run lu tests in parallel, see https://github.com/pytorch/pytorch/issues/82894 for examples
+  // of failures
+  c10::cuda::device_synchronize();
   MAGMAQueue magma_queue(input.get_device());
 
   if (compute_pivots) {
@@ -1928,7 +1617,12 @@ static void lu_factor(const Tensor& input, const Tensor& pivots, const Tensor& i
   const auto preferred_backend = at::globalContext().linalgPreferredBackend();
 #ifdef USE_CUSOLVER
   const auto lu_factor_cusolver = [batch_size, m, n](const Tensor& input, const Tensor& pivots, const Tensor& infos, bool compute_pivots) {
-    if (batch_size == 1 || m != n || m >= 512) {
+    // In CUDA 10.2, lu_factor_looped_cusolver does not finish the computations when the input
+    // matrix is exactly singular. The returned pivots contain garbage. This breaks linalg.det
+    // Now, batched_cublas does not handle rectangular matrices, so we still dispatch to
+    // looped_cusolver even if m != n.
+    constexpr bool looped_correct = CUSOLVER_VERSION >= 11100;
+    if (m != n || (looped_correct && (batch_size == 1 || m >= 512))) {
       lu_factor_looped_cusolver(input, pivots, infos, compute_pivots);
     } else {
       lu_factor_batched_cublas(input, pivots, infos, compute_pivots);
@@ -2344,96 +2038,6 @@ void linalg_eigh_kernel(const Tensor& eigenvalues, const Tensor& eigenvectors, c
 
 REGISTER_CUDA_DISPATCH(linalg_eigh_stub, &linalg_eigh_kernel);
 
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ eig ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-// magmaEig uses a hybrid CPU-GPU algorithm, which takes and return CPU
-// memory. So, we accept a GPU tensor, copy it to CPU memory, and later copy
-// the returned values from CPU to GPU. See also magmaSymeig, which uses a
-// similar approach.
-
-template <typename scalar_t>
-static void apply_eig(const Tensor& self, bool eigenvectors, Tensor& out_eigvals, Tensor& out_eigvecs,
-                      int* info_ptr) {
-#if !AT_MAGMA_ENABLED()
-TORCH_CHECK(false, "Calling torch.eig on a CUDA tensor requires compiling PyTorch with MAGMA. "
-                   "Either transfer the tensor to the CPU before calling torch.eig or recompile with MAGMA.");
-#else
-  TORCH_INTERNAL_ASSERT(self.device() == at::kCPU, "Internal error: apply_eig needs a CPU tensor");
-  using value_t = typename c10::scalar_value_type<scalar_t>::type;
-  magma_vec_t jobvr = eigenvectors ? MagmaVec : MagmaNoVec;
-  magma_int_t n = magma_int_cast(self.size(-1), "n");
-  auto self_data = self.data_ptr<scalar_t>();
-
-  auto out_eigvals_data = out_eigvals.data_ptr<scalar_t>();
-  scalar_t *wr = out_eigvals_data;
-
-  scalar_t *vr_data = NULL;
-  magma_int_t ldvr = 1;
-  if (jobvr == MagmaVec)
-  {
-      vr_data = out_eigvecs.data_ptr<scalar_t>();
-      ldvr = n;
-  }
-
-  value_t *rwork_data = nullptr;
-  if (isComplexType(at::typeMetaToScalarType(self.dtype()))) {
-    ALLOCATE_ARRAY(rwork_data, value_t, n*2);
-  }
-
-  if (n > 0) {
-    // call magmaEig once to get the optimal size of work_data
-    scalar_t wkopt;
-    magma_int_t info;
-    magmaEig<scalar_t, value_t>(MagmaNoVec, jobvr, n, self_data, n, wr, NULL, 1, vr_data, ldvr, &wkopt, -1, rwork_data, &info);
-    magma_int_t lwork = static_cast<magma_int_t>(real_impl<scalar_t, value_t>(wkopt));
-
-    // call it a 2nd time to to the actual work
-    scalar_t *work_data = nullptr;
-    ALLOCATE_ARRAY(work_data, scalar_t, lwork);
-    magmaEig<scalar_t, value_t>(MagmaNoVec, jobvr, n, self_data, n, wr, NULL, 1, vr_data, ldvr, work_data, lwork, rwork_data, &info);
-    *info_ptr = info;
-  }
-#endif
-}
-
-/*
- * Internal helper; like eig_cuda but:
- *   1. assume that self is a square matrix of side "n"
- *   2. return CPU tensors (because this is what magmaEig returns), which will be copied to GPU memory
- *      by the caller
- */
-std::tuple<Tensor, Tensor> eig_kernel_impl(const Tensor& self, bool& eigenvectors) {
-  int64_t n = self.size(-1);
-  // copy self to pinned CPU memory
-  auto self_working_copy = at::empty_strided(
-      {n, n}, // square matrix
-      {1, n}, // column-ordered, as magmaEig expects
-      at::TensorOptions(at::kCPU).dtype(self.dtype()).pinned_memory(true));
-  self_working_copy.copy_(self);
-
-  // tensors holding the results. We use empty_strided to make them column-ordered
-  auto options = self.options().device(at::kCPU).memory_format(LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-  Tensor out_eigvals;
-  if (isComplexType(at::typeMetaToScalarType(self.dtype()))) {
-      out_eigvals = at::empty({n}, options);
-  } else {
-      out_eigvals = at::empty_strided({n, 2}, {1, n}, options);
-  }
-  auto out_eigvecs = eigenvectors
-                     ? at::empty_strided({n, n}, {1, n}, options)
-                     : Tensor();
-
-  auto infos = at::zeros({}, self_working_copy.options().dtype(kInt));
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(self.scalar_type(), "eig_cuda", [&]{
-    apply_eig<scalar_t>(self_working_copy, eigenvectors, out_eigvals, out_eigvecs, infos.data_ptr<int>());
-  });
-  at::_linalg_check_errors(infos, "eig", /*is_matrix*/true);
-
-  return std::tuple<Tensor, Tensor>(out_eigvals, out_eigvecs);
-}
-
-REGISTER_CUDA_DISPATCH(eig_stub, &eig_kernel_impl);
-
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ linalg_eig ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 /*
@@ -2869,7 +2473,7 @@ static void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor
       .add_output(perm)
       .add_input(*pivots_)
       .build();
-    unpack_pivots_stub(pivots_->device().type(), iter, n);
+    unpack_pivots_stub(pivots_->device().type(), iter, n, n);
 
     if (trans == TransposeType::NoTranspose) {
       // Get the inverse permutation
@@ -3189,73 +2793,12 @@ void lstsq_kernel(const Tensor& a, Tensor& b, Tensor& /*rank*/, Tensor& /*singul
 
 REGISTER_CUDA_DISPATCH(lstsq_stub, &lstsq_kernel);
 
-// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ legacy_lstsq ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-std::tuple<Tensor, Tensor> legacy_lstsq_cuda(const Tensor &B, const Tensor &A) {
-  TORCH_WARN_ONCE(
-      "torch.lstsq is deprecated in favor of torch.linalg.lstsq and will be removed in a future PyTorch release.\n",
-      "torch.linalg.lstsq has reversed arguments and does not return the QR decomposition in "
-      "the returned tuple (although it returns other information about the problem).\n",
-      "To get the qr decomposition consider using torch.linalg.qr.\n",
-      "The returned solution in torch.lstsq stored the residuals of the solution in the ",
-      "last m - n columns of the returned value whenever m > n. In torch.linalg.lstsq, the ",
-      "residuals in the field 'residuals' of the returned named tuple.\n",
-      "The unpacking of the solution, as in\n",
-      "X, _ = torch.lstsq(B, A).solution[:A.size(1)]\n",
-      "should be replaced with\n",
-      "X = torch.linalg.lstsq(A, B).solution"
-    );
-
-#if !AT_MAGMA_ENABLED()
-  TORCH_CHECK(false, "solve: MAGMA library not found in "
-              "compilation. Please rebuild with MAGMA.");
-#else
-  const auto dtype = A.scalar_type();
-  TORCH_CHECK(B.scalar_type() == dtype, "exepected A and B dtypes to match but found ",
-              dtype, " and ", B.scalar_type());
-  TORCH_CHECK(A.numel() > 0 && A.dim() == 2, "A should be (non-empty) 2 dimensional");
-  TORCH_CHECK(B.numel() > 0 && B.dim() == 2, "B should be (non-empty) 2 dimensional");
-  auto a_sizes = A.sizes();
-  auto b_sizes = B.sizes();
-  TORCH_CHECK(a_sizes[0] == b_sizes[0], "Expected A and b to have same size "
-      "at dim 0, but A has ", a_sizes[0], " rows and B has ", b_sizes[0], " rows");
-  TORCH_CHECK(a_sizes[0] >= a_sizes[1], "Expected A with shape (m x n) to have "
-      "m >= n. The case for m < n is not implemented yet.");
-
-  Tensor A_working = cloneBatchedColumnMajor(A);
-  Tensor B_working = cloneBatchedColumnMajor(B);
-
-  int64_t m = a_sizes[0];
-  int64_t n = a_sizes[1];
-  int64_t nrhs = b_sizes[1];
-
-  int info;
-  AT_DISPATCH_FLOATING_TYPES(A.scalar_type(), "legacy_lstsq_cuda", [&] {
-    scalar_t *a_data = A_working.data_ptr<scalar_t>();
-    scalar_t *b_data = B_working.data_ptr<scalar_t>();
-    scalar_t wkopt;
-    magmaGels(MagmaNoTrans, m, n, nrhs, a_data, m, b_data, m, &wkopt, -1, &info);
-
-    const auto hwork_size = static_cast<magma_int_t>(wkopt);
-    scalar_t *hwork = nullptr;
-    ALLOCATE_ARRAY(hwork, scalar_t, hwork_size);
-
-    magmaGels(MagmaNoTrans, m, n, nrhs, a_data, m, b_data, m, hwork, hwork_size, &info);
-  });
-
-  TORCH_CHECK(info == 0, "MAGMA gels : Argument %d : illegal value", -info);
-  return std::tuple<Tensor, Tensor>(B_working, A_working);
-#endif  // AT_MAGMA_ENABLED()
-}
-
 
 #if defined(BUILD_LAZY_CUDA_LINALG)
 struct DispatchInitializer {
   DispatchInitializer() {
     cuda::detail::LinalgDispatch disp{ _symeig_helper_cuda,
-                                       _cholesky_solve_helper_cuda,
-                                       legacy_lstsq_cuda,
-                                       _linalg_inv_out_helper_cuda};
+                                       _cholesky_solve_helper_cuda};
     cuda::detail::registerLinalgDispatch(disp);
   };
 } initializer;
diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
index d3109d866a59..89c1246a32d1 100644
--- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
+++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp
@@ -237,6 +237,9 @@ void apply_ldl_solve_cusolver(
   auto pivots_ = pivots.to(kLong);
   auto pivots_data = pivots_.data_ptr<int64_t>();
 
+  // needed to run ldl_solve tests in parallel
+  // see https://github.com/pytorch/pytorch/issues/82894 for examples of failures
+  c10::cuda::device_synchronize();
   auto handle = at::cuda::getCurrentCUDASolverDnHandle();
   auto datatype = at::cuda::solver::get_cusolver_datatype<scalar_t>();
   size_t worksize_device = 0;
@@ -471,101 +474,6 @@ inline static Tensor column_major_identity_matrix_like(const Tensor& self) {
   return at::ones(size_slice, self.options()).diag_embed().mT();
 }
 
-template <typename scalar_t>
-inline static void _apply_single_inverse_helper(scalar_t* self_ptr, scalar_t* self_inv_ptr, int* ipiv_ptr, int* info_getrf_ptr, int* info_getrs_ptr, int n, int lda) {
-  // self_inv_ptr should already be an identity matrix
-
-  auto handle = at::cuda::getCurrentCUDASolverDnHandle();
-  at::cuda::solver::getrf<scalar_t>(handle, n, n, self_ptr, lda, ipiv_ptr, info_getrf_ptr);
-  at::cuda::solver::getrs<scalar_t>(handle, n, n, self_ptr, lda, ipiv_ptr, self_inv_ptr, lda, info_getrs_ptr, CUBLAS_OP_N);
-}
-
-template <typename scalar_t>
-static void apply_batched_inverse_lib(Tensor& self, Tensor& self_inv, Tensor& infos_getrf, Tensor& infos_getrs) {
-  const int batch_size = cuda_int_cast(batchCount(self), "batchCount");
-  const int n = cuda_int_cast(self.size(-2), "self.size(-2)");
-  const int lda = std::max<int>(1, n);
-
-  auto self_data = self.data_ptr<scalar_t>();
-  auto self_mat_stride = matrixStride(self);
-  auto self_inv_data = self_inv.data_ptr<scalar_t>();
-  auto self_inv_mat_stride = matrixStride(self_inv);
-
-  auto infos_getrf_data = infos_getrf.data_ptr<int>();
-  auto infos_getrs_data = infos_getrs.data_ptr<int>();
-
-  auto& allocator = *::c10::cuda::CUDACachingAllocator::get();
-
-  // Heuristic: For small batch size or large matrix size, we use for-loop to iterate over the batches instead of
-  //            calling the batched cublas routine.
-  if (batch_size <= 8 || /* batch_size > 8 && */ n >= 512) {
-    for (int64_t i = 0; i < batch_size; i++) {
-      auto dataPtr = allocator.allocate(sizeof(int) * lda);
-      int* pivot = reinterpret_cast<int*>(dataPtr.get());
-
-      int* infos_getrf_working_ptr = &infos_getrf_data[i];
-      int* infos_getrs_working_ptr = &infos_getrs_data[i];
-
-      _apply_single_inverse_helper<scalar_t>(
-        &self_data[i * self_mat_stride], &self_inv_data[i * self_inv_mat_stride], pivot, infos_getrf_working_ptr, infos_getrs_working_ptr, n, lda);
-    }
-  } else {
-    // cublas batched kernels require input be "device array of device pointers"
-    Tensor self_array = at::arange(
-      reinterpret_cast<int64_t>(self_data),
-      reinterpret_cast<int64_t>(&self_data[(batch_size-1) * self_mat_stride]) + 1,
-      static_cast<int64_t>(self_mat_stride * sizeof(scalar_t)), self.options().dtype(at::kLong));
-    Tensor self_inv_array = at::arange(
-      reinterpret_cast<int64_t>(self_inv_data),
-      reinterpret_cast<int64_t>(&self_inv_data[(batch_size-1) * self_inv_mat_stride]) + 1,
-      static_cast<int64_t>(self_inv_mat_stride * sizeof(scalar_t)), self.options().dtype(at::kLong));
-
-    auto dataPtr = allocator.allocate(sizeof(int)*batch_size*lda);
-    int* ipiv_array = reinterpret_cast<int*>(dataPtr.get());
-
-    at::cuda::blas::getrfBatched<scalar_t>(n, reinterpret_cast<scalar_t**>(self_array.data_ptr()), lda,
-      ipiv_array, infos_getrf_data, batch_size);
-
-    at::cuda::blas::getriBatched<scalar_t>(n, reinterpret_cast<scalar_t**>(self_array.data_ptr()), lda,
-      ipiv_array, reinterpret_cast<scalar_t**>(self_inv_array.data_ptr()), lda, infos_getrs_data, batch_size);
-  }
-}
-
-template <typename scalar_t>
-static void apply_single_inverse_lib(const Tensor& self, Tensor& self_inv, Tensor& infos_getrf, Tensor& infos_getrs) {
-  int n = cuda_int_cast(self.size(-2), "self.size(-2)");
-  int lda = std::max<int>(1, n);
-
-  Tensor ipiv = at::empty({lda}, self.options().dtype(at::kInt));
-
-  _apply_single_inverse_helper<scalar_t>(
-    self.data_ptr<scalar_t>(), self_inv.data_ptr<scalar_t>(), ipiv.data_ptr<int>(), infos_getrf.data_ptr<int>(), infos_getrs.data_ptr<int>(), n, lda);
-}
-
-// This is a type dispatching helper function for 'apply_batched_inverse_lib' and 'apply_single_inverse_lib'
-Tensor& _linalg_inv_out_helper_cuda_lib(Tensor& result, Tensor& infos_getrf, Tensor& infos_getrs) {
-  // assuming result is in column major order and contains the matrices to invert
-  Tensor input_working_copy = cloneBatchedColumnMajor(result);
-
-  // for getrf + getrs (cusolver path)
-  // result should be filled with identity matrices
-  result.zero_();
-  result.diagonal(/*offset=*/0, /*dim1=*/-2, /*dim2=*/-1).fill_(1);
-
-  if (result.dim() > 2) {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{
-      apply_batched_inverse_lib<scalar_t>(
-        input_working_copy, result, infos_getrf, infos_getrs);
-    });
-  } else {
-    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(result.scalar_type(), "linalg_inv_out_cuda", [&]{
-      apply_single_inverse_lib<scalar_t>(input_working_copy, result, infos_getrf, infos_getrs);
-    });
-  }
-
-  return result;
-}
-
 // call cusolver gesvd function to calculate svd
 template<typename scalar_t>
 inline static void apply_svd_cusolver_gesvd(const Tensor& A, const Tensor& U, const Tensor& S, const Tensor& V,
@@ -748,23 +656,21 @@ inline static void apply_svd_cusolver_gesvdjBatched(const Tensor& A, const Tenso
   using value_t = typename c10::scalar_value_type<scalar_t>::type;
   int m = cuda_int_cast(A.size(-2), "m");
   int n = cuda_int_cast(A.size(-1), "n");
-  int k = std::min(m, n);
   int batchsize = cuda_int_cast(batchCount(A), "batch size");
+  int lda = A.stride(-1);
+  int ldu = compute_uv ? U.stride(-1) : m;
+  int ldv = compute_uv ? V.stride(-1) : n;
 
   // Need to pass allocated memory to the function, otherwise it fails
   auto& allocator = *::c10::cuda::CUDACachingAllocator::get();
-  auto dataPtr_U = !compute_uv ? allocator.allocate(sizeof(scalar_t) * batchsize * m * k) : c10::DataPtr{};
-  auto dataPtr_V = !compute_uv ? allocator.allocate(sizeof(scalar_t) * batchsize * n * k) : c10::DataPtr{};
+  auto dataPtr_U = !compute_uv ? allocator.allocate(sizeof(scalar_t) * batchsize * m * ldu) : c10::DataPtr{};
+  auto dataPtr_V = !compute_uv ? allocator.allocate(sizeof(scalar_t) * batchsize * n * ldv) : c10::DataPtr{};
 
   auto A_data = A.data_ptr<scalar_t>();
   auto U_data = compute_uv ? U.data_ptr<scalar_t>() : reinterpret_cast<scalar_t*>(dataPtr_U.get());
   auto S_data = S.data_ptr<value_t>();
   auto V_data = compute_uv ? V.data_ptr<scalar_t>() : reinterpret_cast<scalar_t*>(dataPtr_V.get());
 
-  int lda = A.stride(-1);
-  int ldu = compute_uv ? U.stride(-1) : m;
-  int ldv = compute_uv ? V.stride(-1) : n;
-
   TORCH_INTERNAL_ASSERT(m <= 32 && n <= 32, "gesvdjBatched requires both matrix dimensions not greater than 32, but got "
                         "m = ", m, " n = ", n);
 
@@ -787,10 +693,42 @@ inline static void apply_svd_cusolver_gesvdjBatched(const Tensor& A, const Tenso
   TORCH_CUSOLVER_CHECK(cusolverDnDestroyGesvdjInfo(gesvdj_params));
 }
 
-inline static void svd_cusolver_gesvdjBatched(const Tensor& A, const Tensor& U, const Tensor& S, const Tensor& V, const Tensor& infos, bool compute_uv) {
+inline static void svd_cusolver_gesvdjBatched(const Tensor& A, const Tensor& U, const Tensor& S, const Tensor& V, const Tensor& infos, bool full_matrices, bool compute_uv) {
+  auto m = A.size(-2);
+  auto n = A.size(-1);
+  auto k = std::min(m, n);
+  // The kernel assumes full_matrices == true
+  // If full_matrices == false and m != n, we create auxiliary tensors of the right size and copy the results back
+  auto U_ = U;
+  auto V_ = V;
+  if (compute_uv && !full_matrices) {
+    auto sizes = A.sizes().vec();
+    if (m > n) {
+      // Size of U with full_matrices == True
+      sizes.end()[-1] = m;
+      // U, V should be a batch of Fortran contiguous arrays
+      U_ = U.new_empty(sizes).mT();
+    } else if (m < n) {
+      // Size of V with full_matrices == True
+      sizes.end()[-2] = n;
+      V_ = V.new_empty(sizes).mT();
+    }
+  }
+  // Here U_ and V_ are batches of F-contig square matrices
+
   AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(A.scalar_type(), "svd_cuda_gesvdjBatched", [&] {
-    apply_svd_cusolver_gesvdjBatched<scalar_t>(A, U, S, V, infos, compute_uv);
+    apply_svd_cusolver_gesvdjBatched<scalar_t>(A, U_, S, V_, infos, compute_uv);
   });
+
+  // Copy the result back if we created any new matrix
+  if (compute_uv && !full_matrices) {
+    if (!U_.is_alias_of(U)) {
+      U.copy_(U_.narrow(-1, 0, k));
+    }
+    if (!V_.is_alias_of(V)) {
+      V.copy_(V_.narrow(-1, 0, k));
+    }
+  }
 }
 
 template<typename scalar_t>
@@ -924,21 +862,23 @@ void svd_cusolver(const Tensor& A,
                   const Tensor& V,
                   const Tensor& info) {
   // Here U and V are F-contig whenever they are defined (i.e. whenever compute_uv=true)
-  const auto batch_size = batchCount(A);
   const auto m = A.size(-2);
   const auto n = A.size(-1);
   const auto k = std::min(m, n);
 
   static const char* check_svd_doc = "Check doc at https://pytorch.org/docs/stable/generated/torch.linalg.svd.html";
 
-  // The default heuristic is to use gesvdj driver
+  // The default heuristic is to use the gesvdj driver
   const auto driver_v = driver.value_or("gesvdj");
 
   if (driver_v == "gesvd") {
     svd_cusolver_gesvd(A, U, S, V, info, full_matrices, compute_uv);
   } else if (driver_v == "gesvdj") {
-    if (m <= 32 && n <= 32 && batch_size > 1 && (full_matrices || m == n)) {
-      svd_cusolver_gesvdjBatched(cloneBatchedColumnMajor(A), U, S, V, info, compute_uv);
+    // See the benchmarks in
+    // https://github.com/pytorch/pytorch/pull/88502#issuecomment-1303860789
+    // The m <= 32 && n <= 32 restrictions come from the limitations of the cusolver backend. See the cusolver docs
+    if (m <= 32 && n <= 32) {
+      svd_cusolver_gesvdjBatched(cloneBatchedColumnMajor(A), U, S, V, info, full_matrices, compute_uv);
     } else {
       // gesvdj driver may be numerically unstable for large sized matrix
       svd_cusolver_gesvdj(cloneBatchedColumnMajor(A), U, S, V, info, full_matrices, compute_uv);
diff --git a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h
index adee8cc9eb4e..532919e83ebd 100644
--- a/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h
+++ b/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.h
@@ -59,10 +59,6 @@ void lu_solve_batched_cublas(const Tensor& LU, const Tensor& pivots, const Tenso
 
 #ifdef USE_CUSOLVER
 
-// entrance of calculations of `inverse` using cusolver getrf + getrs, cublas getrfBatched + getriBatched
-Tensor _inverse_helper_cuda_lib(const Tensor& self);
-Tensor& _linalg_inv_out_helper_cuda_lib(Tensor& result, Tensor& infos_getrf, Tensor& infos_getrs);
-
 // entrance of calculations of `svd` using cusolver gesvdj and gesvdjBatched
 void svd_cusolver(const Tensor& A, const bool full_matrices, const bool compute_uv,
   const c10::optional<c10::string_view>& driver, const Tensor& U, const Tensor& S, const Tensor& V, const Tensor& info);
@@ -90,8 +86,6 @@ namespace cuda { namespace detail {
 struct LinalgDispatch {
    std::tuple<Tensor, Tensor> (*symeig_helper)(const Tensor& self, bool eigenvectors, bool upper);
    Tensor (*cholesky_solve_helper)(const Tensor& self, const Tensor& A, bool upper);
-   std::tuple<Tensor, Tensor> (*legacy_lstsq)(const Tensor &B, const Tensor &A);
-   Tensor& (*inv_out_helper)(Tensor &result, Tensor& infos_lu, Tensor& infos_getri);
 };
 C10_EXPORT void registerLinalgDispatch(const LinalgDispatch&);
 }} // namespace cuda::detail
diff --git a/aten/src/ATen/native/cuda/reduction_template.cuh b/aten/src/ATen/native/cuda/reduction_template.cuh
index 4d9d559d8ec8..a38edb538256 100644
--- a/aten/src/ATen/native/cuda/reduction_template.cuh
+++ b/aten/src/ATen/native/cuda/reduction_template.cuh
@@ -4,11 +4,22 @@ namespace cuda {
 const std::string reduction_template_0 = R"ESCAPE(
   #define C10_HOST_DEVICE __host__ __device__
   #define C10_DEVICE __device__
+  #if defined(__clang__) && defined(__HIP__)
+  #ifndef __forceinline__
+  #define __forceinline__ inline __attribute__((always_inline))
+  #endif
+  // until ROCm support for kernel asserts is restored
+  #define assert(expr) (static_cast<void>(0))
+  #endif
 
   template <typename T>
   __device__ __forceinline__ T WARP_SHFL_DOWN(T value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
   {
+  #if defined(__clang__) && defined(__HIP__)
+    return __shfl_down(value, delta, width);
+  #else
     return __shfl_down_sync(mask, value, delta, width);
+  #endif
   }
 
 
@@ -17,8 +28,13 @@ const std::string reduction_template_0 = R"ESCAPE(
   __device__ __forceinline__ std::complex<T> WARP_SHFL_DOWN(std::complex<T> value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
   {
     return std::complex<T>(
+  #if defined(__clang__) && defined(__HIP__)
+        __shfl_down(value.real(), delta, width),
+        __shfl_down(value.imag(), delta, width));
+  #else
         __shfl_down_sync(mask, value.real(), delta, width),
         __shfl_down_sync(mask, value.imag(), delta, width));
+  #endif
   }
   #endif
 
diff --git a/aten/src/ATen/native/cuda/vol2col.cuh b/aten/src/ATen/native/cuda/vol2col.cuh
index 7ab719bc819e..51dbe1c74405 100644
--- a/aten/src/ATen/native/cuda/vol2col.cuh
+++ b/aten/src/ATen/native/cuda/vol2col.cuh
@@ -15,7 +15,7 @@ using namespace at::cuda::detail;
 // Kernel for fast unfold+copy on volumes
 template <typename T>
 __global__ void vol2col_kernel(
-    const int n,
+    const int64_t n,
     const T* data_vol,
     const int depth,
     const int height,
@@ -37,16 +37,16 @@ __global__ void vol2col_kernel(
     const int width_col,
     T* data_col) {
   CUDA_KERNEL_LOOP(index, n) {
-    int w_out = index % width_col;
+    auto w_out = index % width_col;
     index /= width_col;
-    int h_out = index % height_col;
+    auto h_out = index % height_col;
     index /= height_col;
-    int t_out = index % depth_col;
-    int channel_in = index / depth_col;
-    int channel_out = channel_in * ksize_t * ksize_h * ksize_w;
-    int t_in = t_out * stride_t - pad_t;
-    int h_in = h_out * stride_h - pad_h;
-    int w_in = w_out * stride_w - pad_w;
+    auto t_out = index % depth_col;
+    auto channel_in = index / depth_col;
+    auto channel_out = channel_in * ksize_t * ksize_h * ksize_w;
+    auto t_in = t_out * stride_t - pad_t;
+    auto h_in = h_out * stride_h - pad_h;
+    auto w_in = w_out * stride_w - pad_w;
     data_col +=
         ((channel_out * depth_col + t_out) * height_col + h_out) * width_col +
         w_out;
@@ -54,9 +54,9 @@ __global__ void vol2col_kernel(
     for (int i = 0; i < ksize_t; ++i) {
       for (int j = 0; j < ksize_h; ++j) {
         for (int k = 0; k < ksize_w; ++k) {
-          int t = t_in + i * dilation_t;
-          int h = h_in + j * dilation_h;
-          int w = w_in + k * dilation_w;
+          auto t = t_in + i * dilation_t;
+          auto h = h_in + j * dilation_h;
+          auto w = w_in + k * dilation_w;
           *data_col = (t >= 0 && h >= 0 && w >= 0 && t < depth && h < height &&
                        w < width)
               ? data_vol
@@ -126,7 +126,7 @@ void vol2col(
 
 template <typename T, typename accT>
 __global__ void vol2im_kernel(
-    const unsigned n,
+    const int64_t n,
     const T* data_col,
     const unsigned depth,
     const unsigned height,
@@ -150,30 +150,30 @@ __global__ void vol2im_kernel(
     T* data_vol) {
   CUDA_KERNEL_LOOP(index, n) {
     accT val = static_cast<accT>(0);
-    const unsigned w_im = index % width + pad_w;
-    const unsigned h_im = (index / width) % height + pad_h;
-    const unsigned t_im = (index / width / height) % depth + pad_t;
-    const unsigned c_im = index / (width * height * depth);
-    unsigned kernel_extent_w = (kernel_w - 1) * dilation_w + 1;
-    unsigned kernel_extent_h = (kernel_h - 1) * dilation_h + 1;
-    unsigned kernel_extent_t = (kernel_t - 1) * dilation_t + 1;
+    const auto w_im = index % width + pad_w;
+    const auto h_im = (index / width) % height + pad_h;
+    const auto t_im = (index / width / height) % depth + pad_t;
+    const auto c_im = index / (width * height * depth);
+    auto kernel_extent_w = (kernel_w - 1) * dilation_w + 1;
+    auto kernel_extent_h = (kernel_h - 1) * dilation_h + 1;
+    auto kernel_extent_t = (kernel_t - 1) * dilation_t + 1;
     // compute the start and end of the output
-    const unsigned w_col_start =
+    const auto w_col_start =
         (w_im < kernel_extent_w) ? 0 : (w_im - kernel_extent_w) / stride_w + 1;
-    const unsigned w_col_end = std::min(w_im / stride_w + 1, width_col);
-    const unsigned h_col_start =
+    const auto w_col_end = std::min(w_im / stride_w + 1, width_col);
+    const auto h_col_start =
         (h_im < kernel_extent_h) ? 0 : (h_im - kernel_extent_h) / stride_h + 1;
-    const unsigned h_col_end = std::min(h_im / stride_h + 1, height_col);
-    const unsigned t_col_start =
+    const auto h_col_end = std::min(h_im / stride_h + 1, height_col);
+    const auto t_col_start =
         (t_im < kernel_extent_t) ? 0 : (t_im - kernel_extent_t) / stride_t + 1;
-    const unsigned t_col_end = std::min(t_im / stride_t + 1, depth_col);
+    const auto t_col_end = std::min(t_im / stride_t + 1, depth_col);
     // TODO: use LCM of stride and dilation to avoid unnecessary loops
     for (unsigned t_col = t_col_start; t_col < t_col_end; t_col += 1) {
       for (unsigned h_col = h_col_start; h_col < h_col_end; h_col += 1) {
         for (unsigned w_col = w_col_start; w_col < w_col_end; w_col += 1) {
-          unsigned t_k = (t_im - t_col * stride_t);
-          unsigned h_k = (h_im - h_col * stride_h);
-          unsigned w_k = (w_im - w_col * stride_w);
+          uint64_t t_k = (t_im - t_col * stride_t);
+          uint64_t h_k = (h_im - h_col * stride_h);
+          uint64_t w_k = (w_im - w_col * stride_w);
           if (t_k % dilation_t == 0 && h_k % dilation_h == 0 &&
               w_k % dilation_w == 0) {
             t_k /= dilation_t;
diff --git a/aten/src/ATen/native/cudnn/AffineGridGenerator.cpp b/aten/src/ATen/native/cudnn/AffineGridGenerator.cpp
index 50fc37ba76da..bfc7184e9303 100644
--- a/aten/src/ATen/native/cudnn/AffineGridGenerator.cpp
+++ b/aten/src/ATen/native/cudnn/AffineGridGenerator.cpp
@@ -1,8 +1,17 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/cuda/CUDAConfig.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/cudnn_affine_grid_generator_native.h>
+#include <ATen/ops/cudnn_affine_grid_generator_backward_native.h>
+#endif
+
 #if !AT_CUDNN_ENABLED()
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/cudnn/BatchNorm.cpp b/aten/src/ATen/native/cudnn/BatchNorm.cpp
index 1c70aa353b51..f1f275e63885 100644
--- a/aten/src/ATen/native/cudnn/BatchNorm.cpp
+++ b/aten/src/ATen/native/cudnn/BatchNorm.cpp
@@ -1,5 +1,5 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/cuda/CUDAConfig.h>
 
@@ -32,6 +32,16 @@ std::tuple<Tensor, Tensor, Tensor> cudnn_batch_norm_backward(
 
 #include <ATen/TensorUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cudnn_batch_norm_backward_native.h>
+#include <ATen/ops/cudnn_batch_norm_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#endif
+
 namespace at { namespace native {
 
 namespace {
diff --git a/aten/src/ATen/native/cudnn/ConvPlaceholders.cpp b/aten/src/ATen/native/cudnn/ConvPlaceholders.cpp
index 0474b1bf1448..feb679d57d78 100644
--- a/aten/src/ATen/native/cudnn/ConvPlaceholders.cpp
+++ b/aten/src/ATen/native/cudnn/ConvPlaceholders.cpp
@@ -1,6 +1,16 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/cuda/CUDAConfig.h> // for the definition of AT_CUDNN_ENABLED
-#include <ATen/native/ConvUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cudnn_convolution_add_relu_native.h>
+#include <ATen/ops/cudnn_convolution_native.h>
+#include <ATen/ops/cudnn_convolution_relu_native.h>
+#include <ATen/ops/cudnn_convolution_transpose_native.h>
+#endif
 
 namespace at { namespace native {
 
diff --git a/aten/src/ATen/native/cudnn/ConvShared.cpp b/aten/src/ATen/native/cudnn/ConvShared.cpp
index 9f921faf0320..2c4d77c6f617 100644
--- a/aten/src/ATen/native/cudnn/ConvShared.cpp
+++ b/aten/src/ATen/native/cudnn/ConvShared.cpp
@@ -1,12 +1,30 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/core/Tensor.h>
-#include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
+#include <ATen/Context.h>
 #include <ATen/cuda/EmptyTensor.h>
+#include <ATen/TensorGeometry.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/cuda/CUDAConfig.h>
 #include <ATen/native/ConvUtils.h>
 
 #if AT_CUDNN_ENABLED()
 
 #include <ATen/native/cudnn/ConvShared.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cudnn_convolution_add_relu_native.h>
+#include <ATen/ops/cudnn_convolution_native.h>
+#include <ATen/ops/cudnn_convolution_relu_native.h>
+#include <ATen/ops/cudnn_convolution_transpose_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 // NOTE [cuDNN API version]
 //
 // ConvPlaceholders.cpp contains placeholder implementation of cudnn
@@ -436,7 +454,7 @@ Tensor cudnn_convolution_relu(
   bool allow_tf32 = ctx.allowTF32CuDNN();
   auto _bias = bias_t.has_value()
           ? bias_t.value()
-          : at::native::zeros(
+          : at::zeros(
                 {output_t.size(1)},
                 optTypeMetaToScalarType(output_t.options().dtype_opt()),
                 output_t.options().layout_opt(),
@@ -514,7 +532,7 @@ Tensor cudnn_convolution_add_relu(
   auto _alpha = alpha.has_value() ? alpha.value().to<float>() : 1.0;
   auto _bias = bias_t.has_value()
           ? bias_t.value()
-          : at::native::zeros(
+          : at::zeros(
                 {output_t.size(1)},
                 optTypeMetaToScalarType(output_t.options().dtype_opt()),
                 output_t.options().layout_opt(),
diff --git a/aten/src/ATen/native/cudnn/ConvShared.h b/aten/src/ATen/native/cudnn/ConvShared.h
index fbcf667f40fc..9a576de285ce 100644
--- a/aten/src/ATen/native/cudnn/ConvShared.h
+++ b/aten/src/ATen/native/cudnn/ConvShared.h
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#pragma once
+#include <ATen/core/Tensor.h>
 
 #include <ATen/cudnn/cudnn-wrapper.h>
 #include <ATen/cudnn/Descriptors.h>
diff --git a/aten/src/ATen/native/cudnn/Conv_v7.cpp b/aten/src/ATen/native/cudnn/Conv_v7.cpp
index 63968fd2072f..f5c7af79a740 100644
--- a/aten/src/ATen/native/cudnn/Conv_v7.cpp
+++ b/aten/src/ATen/native/cudnn/Conv_v7.cpp
@@ -1,15 +1,21 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
 
 #if AT_CUDNN_ENABLED()
 
 #include <ATen/native/cudnn/Macros.h>
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/zeros.h>
+#endif
 
 #include <limits>
 #include <vector>
-#include <sstream>
-#include <functional>
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/Config.h>
 #include <ATen/cuda/CUDAGraphsUtils.cuh>
 #include <ATen/cuda/Exceptions.h>
@@ -199,10 +205,11 @@ size_t getMaxWorkspaceSize(
 {
   size_t max_ws_size = 0;
   size_t max_block_size = 0;
-  size_t tmp_bytes = 0;  // Only used for filling pointer parameters that aren't used later
 
   const auto device = c10::cuda::current_device();
-  c10::cuda::CUDACachingAllocator::cacheInfo(device, &tmp_bytes, &max_block_size);
+  // For the native allocator, retrieves the size of the largest unused block.
+  // For cudaMallocAsync, see c10/cuda/CUDAMallocAsync.cpp:cacheInfo for details.
+  c10::cuda::CUDACachingAllocator::cacheInfo(device, &max_block_size);
 
   for (const auto i : c10::irange(n_algo)) {
     cudnnStatus_t err;
diff --git a/aten/src/ATen/native/cudnn/Conv_v8.cpp b/aten/src/ATen/native/cudnn/Conv_v8.cpp
index 2ad8d4ffe37c..11fe5be8298e 100644
--- a/aten/src/ATen/native/cudnn/Conv_v8.cpp
+++ b/aten/src/ATen/native/cudnn/Conv_v8.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/cuda/CUDAConfig.h>  // for the definition of AT_CUDNN_ENABLED
 
 #if AT_CUDNN_ENABLED()
@@ -10,7 +11,7 @@
 #include <cudnn_frontend.h>
 #include <cudnn_frontend_find_plan.h>
 #include <cudnn_frontend_get_plan.h>
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/cuda/Exceptions.h>
 #include <ATen/native/ConvUtils.h>
@@ -26,6 +27,12 @@
 #include <mutex>
 #include <unordered_map>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#endif
+
 namespace at { namespace native {
 
 namespace {
@@ -47,9 +54,12 @@ uint8_t getAlignment(const Tensor &t) {
   return alignment;
 }
 
-cudnn_frontend::Tensor getTensorDescriptorWithTypeVirtual(const Tensor &t, const int64_t id, const uint8_t alignment, const cudnnDataType_t dataType, const bool _virtual) {
+cudnn_frontend::Tensor getTensorDescriptorWithTypeVirtual(const Tensor &t, const int64_t id, const uint8_t alignment, const cudnnDataType_t dataType, const at::MemoryFormat memory_format, const bool _virtual) {
   auto sizes = t.sizes();
   auto strides = t.strides();
+  bool channels_last = memory_format == at::MemoryFormat::ChannelsLast ||
+    memory_format == at::MemoryFormat::ChannelsLast3d;
+  fixSizeOneDimStride<int64_t>(sizes.size(), &sizes[0], (int64_t *) &strides[0], channels_last);
   auto r = cudnn_frontend::TensorBuilder()
     .setDim(sizes.size(), sizes.data())
     .setStrides(strides.size(), strides.data())
@@ -61,8 +71,8 @@ cudnn_frontend::Tensor getTensorDescriptorWithTypeVirtual(const Tensor &t, const
   return r;
 }
 
-cudnn_frontend::Tensor getTensorDescriptor(const Tensor &t, const int64_t id, const uint8_t alignment) {
-  return getTensorDescriptorWithTypeVirtual(t, id, alignment, getCudnnDataType(t), false);
+cudnn_frontend::Tensor getTensorDescriptor(const Tensor &t, const int64_t id, const uint8_t alignment, const at::MemoryFormat memory_format) {
+  return getTensorDescriptorWithTypeVirtual(t, id, alignment, getCudnnDataType(t), memory_format, false);
 }
 
 cudnn_frontend::ConvDesc_v8 getConvDescriptor(cudnnDataType_t dataType, IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, const at::ScalarType scalar_type) {
@@ -152,7 +162,8 @@ BenchmarkCache<cudnn_frontend::ExecutionPlan, CacheKeyFused> benchmark_cache_fus
 // would not be a POD anymore.
 void setCacheKey(CacheKey& key, const cudnnBackendDescriptorType_t operation, const Tensor& y, const Tensor& x, const Tensor& w, const IntArrayRef padding, const IntArrayRef stride, const IntArrayRef dilation, int64_t groups, bool deterministic, bool allow_tf32) {
   memset(&key, 0, sizeof(key));
-  setConvolutionParams(&key.params, x, w, padding, stride, dilation, groups, deterministic, allow_tf32, x.suggest_memory_format());
+  at::MemoryFormat memory_format = cudnn_conv_suggest_memory_format(x, w);
+  setConvolutionParams(&key.params, x, w, padding, stride, dilation, groups, deterministic, allow_tf32, memory_format);
   key.operation = operation;
   key.x_alignment = getAlignment(x);
   key.y_alignment = getAlignment(y);
@@ -161,7 +172,8 @@ void setCacheKey(CacheKey& key, const cudnnBackendDescriptorType_t operation, co
 
 void setCacheKeyFused(CacheKeyFused& key, const Tensor& y, const Tensor& x, const Tensor& w, const Tensor& z, const Tensor& b, const float alpha, const IntArrayRef padding, const IntArrayRef stride, const IntArrayRef dilation, int64_t groups, bool deterministic, bool allow_tf32) {
   memset(&key, 0, sizeof(key));
-  setConvolutionParams(&key.params, x, w, padding, stride, dilation, groups, deterministic, allow_tf32, x.suggest_memory_format());
+  at::MemoryFormat memory_format = cudnn_conv_suggest_memory_format(x, w);
+  setConvolutionParams(&key.params, x, w, padding, stride, dilation, groups, deterministic, allow_tf32, memory_format);
   key.x_alignment = getAlignment(x);
   key.y_alignment = getAlignment(y);
   key.w_alignment = getAlignment(w);
@@ -200,9 +212,9 @@ void run_conv_plan_fused(cudnnHandle_t handle, const Tensor& x, const Tensor& y,
 
 auto build_opgraph(const cudnnHandle_t handle, const cudnnBackendDescriptorType_t desc, const Tensor& x, const Tensor& y, const Tensor& w, const CacheKey& key, const IntArrayRef padding, const IntArrayRef stride, const IntArrayRef dilation) {
   auto op = cudnn_frontend::OperationBuilder(desc)
-      .setxDesc(getTensorDescriptor(x, 'x', key.x_alignment))
-      .setyDesc(getTensorDescriptor(y, 'y', key.y_alignment))
-      .setwDesc(getTensorDescriptor(w, 'w', key.w_alignment))
+      .setxDesc(getTensorDescriptor(x, 'x', key.x_alignment, key.params.memory_format))
+      .setyDesc(getTensorDescriptor(y, 'y', key.y_alignment, key.params.memory_format))
+      .setwDesc(getTensorDescriptor(w, 'w', key.w_alignment, key.params.memory_format))
       .setcDesc(getConvDescriptor(key.params.dataType, padding, stride, dilation, x.scalar_type()))
       .build();
   std::array<cudnn_frontend::Operation const *, 1> ops = {&op};
@@ -232,33 +244,33 @@ auto build_opgraph_fused(const cudnnHandle_t handle, const Tensor & x, const Ten
   const float alpha1 = 1.0;
   const float alpha2 = alpha;
   auto conv_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR)
-                   .setxDesc(getTensorDescriptor(x, 'x', key.x_alignment))
+                   .setxDesc(getTensorDescriptor(x, 'x', key.x_alignment, key.params.memory_format))
                    // virtual output of conv
-                   .setyDesc(getTensorDescriptorWithTypeVirtual(y, 'C', key.y_alignment, precision, true))
-                   .setwDesc(getTensorDescriptor(w, 'w', key.w_alignment))
+                   .setyDesc(getTensorDescriptorWithTypeVirtual(y, 'C', key.y_alignment, precision, key.params.memory_format, true))
+                   .setwDesc(getTensorDescriptor(w, 'w', key.w_alignment, key.params.memory_format))
                    .setAlpha(alpha1)
                    .setcDesc(convDesc)
                    .build();
   auto add_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
                            .setxDesc(conv_op.getOutputTensor())
-                           .setbDesc(getTensorDescriptor(z, 'z', key.z_alignment))
+                           .setbDesc(getTensorDescriptor(z, 'z', key.z_alignment, key.params.memory_format))
                            // another virtual output (of add)
-                           .setyDesc(getTensorDescriptorWithTypeVirtual(y, 'A', key.y_alignment, precision, true))
+                           .setyDesc(getTensorDescriptorWithTypeVirtual(y, 'A', key.y_alignment, precision, key.params.memory_format, true))
                            .setpwDesc(addDesc)
                            .setAlpha(alpha1)
                            .setAlpha2(alpha2)
                            .build();
   auto add_bias_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
                            .setxDesc(add_op.getOutputTensor())
-                           .setbDesc(getTensorDescriptor(b, 'b', key.b_alignment))
+                           .setbDesc(getTensorDescriptor(b, 'b', key.b_alignment, key.params.memory_format))
                            // another virtual output (of add bias)
-                           .setyDesc(getTensorDescriptorWithTypeVirtual(y, 'B', key.y_alignment, precision, true))
+                           .setyDesc(getTensorDescriptorWithTypeVirtual(y, 'B', key.y_alignment, precision, key.params.memory_format, true))
                            .setpwDesc(addBiasDesc)
                            .build();
   auto act_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
                           .setxDesc(add_bias_op.getOutputTensor())
                           // final output is in original datatype
-                          .setyDesc(getTensorDescriptor(y, 'y', key.y_alignment))
+                          .setyDesc(getTensorDescriptor(y, 'y', key.y_alignment, key.params.memory_format))
                           .setpwDesc(actDesc)
                           .build();
   std::array<cudnn_frontend::Operation const*, 4> ops = {&conv_op, &add_op, &add_bias_op, &act_op};
@@ -300,8 +312,7 @@ size_t get_available_workspace() {
   int device;
   C10_CUDA_CHECK(cudaGetDevice(&device));
   size_t max_block_size = 0;
-  size_t tmp_bytes = 0;  // Only used for filling pointer parameters that aren't used later
-  c10::cuda::CUDACachingAllocator::cacheInfo(device, &tmp_bytes, &max_block_size);
+  c10::cuda::CUDACachingAllocator::cacheInfo(device, &max_block_size);
   return max_block_size;
 }
 
@@ -654,7 +665,7 @@ void raw_cudnn_convolution_add_relu_out(
     bool allow_tf32) {
   if (output.numel() == 0) { return; }
   if (at::native::cudnnv8_enabled_check_debug()) {
-    auto bias_ = bias.view({1, bias.numel(), 1, 1});
+    auto bias_ = input.ndimension() == 4 ? bias.view({1, bias.numel(), 1, 1}) : bias.view({1, bias.numel(), 1, 1, 1});
     run_fused_conv(input, output, weight, z, bias_,
       alpha, stride, padding, dilation,
       groups, benchmark, deterministic, allow_tf32);
diff --git a/aten/src/ATen/native/cudnn/GridSampler.cpp b/aten/src/ATen/native/cudnn/GridSampler.cpp
index b22d25cbff97..8697b89c399a 100644
--- a/aten/src/ATen/native/cudnn/GridSampler.cpp
+++ b/aten/src/ATen/native/cudnn/GridSampler.cpp
@@ -1,9 +1,18 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/cuda/CUDAConfig.h>
 #include <ATen/native/GridSamplerUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/cudnn_grid_sampler_native.h>
+#include <ATen/ops/cudnn_grid_sampler_backward_native.h>
+#endif
+
 #if !AT_CUDNN_ENABLED()
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/cudnn/LossCTC.cpp b/aten/src/ATen/native/cudnn/LossCTC.cpp
index 37c5277428b7..a741816424a7 100644
--- a/aten/src/ATen/native/cudnn/LossCTC.cpp
+++ b/aten/src/ATen/native/cudnn/LossCTC.cpp
@@ -1,11 +1,23 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/cuda/CUDAConfig.h>
 #if AT_CUDNN_ENABLED()
   #include <ATen/cudnn/Descriptors.h>
 #endif
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_cudnn_ctc_loss.h>
+#include <ATen/ops/_cudnn_ctc_loss_native.h>
+#include <ATen/ops/_use_cudnn_ctc_loss.h>
+#include <ATen/ops/_use_cudnn_ctc_loss_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#endif
+
 #if (!AT_CUDNN_ENABLED()) || (CUDNN_VERSION < 7600)
 
 namespace at { namespace native {
@@ -21,10 +33,30 @@ bool _use_cudnn_ctc_loss(
   return false;
 }
 
+bool _use_cudnn_ctc_loss_tensor(
+    const Tensor& log_probs,
+    const Tensor& targets,
+    const Tensor& input_lengths,
+    const Tensor& target_lengths,
+    int64_t BLANK) {
+  return false;
+}
+
 std::tuple<Tensor, Tensor> _cudnn_ctc_loss(const Tensor& log_probs, const Tensor& targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t BLANK, bool deterministic, bool zero_infinity) {
   AT_ERROR("cudnn_ctc_loss: ATen not compiled with cuDNN >= 7 support");
 }
 
+std::tuple<Tensor, Tensor> _cudnn_ctc_loss_tensor(
+    const Tensor& log_probs,
+    const Tensor& targets,
+    const Tensor& input_lengths,
+    const Tensor& target_lengths,
+    int64_t BLANK,
+    bool deterministic,
+    bool zero_infinity) {
+  AT_ERROR("cudnn_ctc_loss: ATen not compiled with cuDNN >= 7 support");
+}
+
 }}
 
 #else // AT_CUDNN_ENABLED
@@ -68,6 +100,20 @@ bool _use_cudnn_ctc_loss(
   return use_cudnn;
 }
 
+bool _use_cudnn_ctc_loss_tensor(
+    const Tensor& log_probs,
+    const Tensor& targets,
+    const Tensor& input_lengths,
+    const Tensor& target_lengths,
+    int64_t BLANK) {
+  Tensor ilc = input_lengths.to(Device(at::kCPU), at::kLong).contiguous();
+  Tensor tlc = target_lengths.to(Device(at::kCPU), at::kLong).contiguous();
+  IntArrayRef il(ilc.data_ptr<int64_t>(), ilc.numel());
+  IntArrayRef tl(tlc.data_ptr<int64_t>(), tlc.numel());
+  return at::_use_cudnn_ctc_loss(
+      log_probs, targets, il, tl, BLANK);
+}
+
 std::tuple<Tensor, Tensor> _cudnn_ctc_loss(const Tensor& log_probs_t, const Tensor& targets_t, IntArrayRef input_lengths_, IntArrayRef target_lengths_, int64_t BLANK, bool deterministic, bool zero_infinity) {
   (void)zero_infinity; // only used for backward
   const CheckedFrom c = "cudnn_ctc_loss";
@@ -138,6 +184,21 @@ std::tuple<Tensor, Tensor> _cudnn_ctc_loss(const Tensor& log_probs_t, const Tens
   return std::make_tuple(costs, grad);
 }
 
+std::tuple<Tensor, Tensor> _cudnn_ctc_loss_tensor(
+    const Tensor& log_probs,
+    const Tensor& targets,
+    const Tensor& input_lengths,
+    const Tensor& target_lengths,
+    int64_t BLANK,
+    bool deterministic,
+    bool zero_infinity) {
+  Tensor ilc = input_lengths.to(Device(at::kCPU), at::kLong).contiguous();
+  Tensor tlc = target_lengths.to(Device(at::kCPU), at::kLong).contiguous();
+  IntArrayRef il(ilc.data_ptr<int64_t>(), ilc.numel());
+  IntArrayRef tl(tlc.data_ptr<int64_t>(), tlc.numel());
+  return at::_cudnn_ctc_loss(
+      log_probs, targets, il, tl, BLANK, deterministic, zero_infinity);
+}
 
 }}  // namespace at::native
 
diff --git a/aten/src/ATen/native/cudnn/RNN.cpp b/aten/src/ATen/native/cudnn/RNN.cpp
index 29430b38e74e..426243392b6f 100644
--- a/aten/src/ATen/native/cudnn/RNN.cpp
+++ b/aten/src/ATen/native/cudnn/RNN.cpp
@@ -1,18 +1,32 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/cuda/CUDAConfig.h>
 #include <ATen/cuda/CUDAEvent.h>
 #include <ATen/cuda/CUDAGraphsUtils.cuh>
 #include <ATen/cuda/Exceptions.h>
-#include <ATen/InitialTensorOptions.h>
 #include <ATen/MatrixRef.h>
 #include <ATen/native/RNN.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/TensorUtils.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_cudnn_init_dropout_state.h>
+#include <ATen/ops/_cudnn_init_dropout_state_native.h>
+#include <ATen/ops/_cudnn_rnn.h>
+#include <ATen/ops/_cudnn_rnn_backward_native.h>
+#include <ATen/ops/_cudnn_rnn_flatten_weight_native.h>
+#include <ATen/ops/_cudnn_rnn_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 #if !AT_CUDNN_ENABLED()
 
 namespace at { namespace native {
@@ -56,7 +70,7 @@ Tensor _cudnn_init_dropout_state(double dropout, bool train, int64_t dropout_see
     c10::optional<Device> device,
     c10::optional<bool> pin_memory) {
   // See [Note: hacky wrapper removal for TensorOptions]
-  TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
+  TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
 
   AT_ERROR("_cudnn_init_dropout_state: ATen not compiled with cuDNN support");
 }
diff --git a/aten/src/ATen/native/group_norm.cpp b/aten/src/ATen/native/group_norm.cpp
index db1d82f84fef..22ff9ea5f0e8 100644
--- a/aten/src/ATen/native/group_norm.cpp
+++ b/aten/src/ATen/native/group_norm.cpp
@@ -1,26 +1,37 @@
-#include <ATen/ATen.h>
-#include <ATen/AccumulateType.h>
-#include <ATen/CPUApplyUtils.h>
-#include <ATen/Config.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/Parallel.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/group_norm.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/Parallel.h>
 #include <c10/util/accumulate.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like_native.h>
+#include <ATen/ops/group_norm_native.h>
+#include <ATen/ops/native_batch_norm.h>
+#include <ATen/ops/native_group_norm.h>
+#include <ATen/ops/native_group_norm_backward_native.h>
+#include <ATen/ops/native_group_norm_native.h>
+#endif
+
 #include <array>
 #include <functional>
-#include <numeric>
 #include <tuple>
 #include <vector>
 
 namespace at {
+
 namespace native {
 
+template <typename T>
 void check_group_norm_inputs(
     const Tensor& input,
     const Tensor& weight,
     const Tensor& bias,
-    int64_t C,
+    T C,
     int64_t num_groups) {
   TORCH_CHECK(
       num_groups > 0,
@@ -34,14 +45,14 @@ void check_group_norm_inputs(
       "num_groups=",
       num_groups);
   TORCH_CHECK(
-      !weight.defined() || (weight.dim() == 1 && weight.numel() == C),
+      !weight.defined() || (weight.dim() == 1 && at::symint::numel<T>(weight) == C),
       "Expected weight to be a vector of size equal to the number of ",
       "channels in input, but got weight of shape ",
       weight.sizes(),
       " and input of shape ",
       input.sizes());
   TORCH_CHECK(
-      !bias.defined() || (bias.dim() == 1 && bias.numel() == C),
+      !bias.defined() || (bias.dim() == 1 && at::symint::numel<T>(bias) == C),
       "Expected bias to be a vector of size equal to the number of ",
       "channels in input, but got bias of shape ",
       weight.sizes(),
@@ -162,24 +173,24 @@ Tensor group_norm(
   const Tensor& weight = *weight_maybe_owned;
   const Tensor& bias = c10::value_or_else(bias_opt, [] { return Tensor(); });
 
-  const int64_t N = input.size(0);
-  const int64_t C = input.size(1);
+  const auto N = input.sym_size(0);
+  const auto C = input.sym_size(1);
   check_group_norm_inputs(input, weight, bias, C, num_groups);
 
-  const auto input_shape = input.sizes();
-  const int64_t HxW =
-      c10::multiply_integers(input_shape.cbegin() + 2, input_shape.cend());
+  const auto input_shape = input.sym_sizes();
+  const auto HxW =
+      c10::multiply_integers(input_shape.slice(2));
 
   const Tensor kEmpty;
   auto memory_format = input.suggest_memory_format();
-  const auto& X = input.device().is_cpu() ?
+  const auto& X = input.device().is_cpu() || input.device().is_xpu() ?
       input.contiguous(memory_format) : input.contiguous();
   const auto& gamma = weight.defined() ? weight.contiguous() : kEmpty;
   const auto& beta = bias.defined() ? bias.contiguous() : kEmpty;
-  TORCH_CHECK(!gamma.defined() || gamma.numel() == C);
-  TORCH_CHECK(!beta.defined() || beta.numel() == C);
+  TORCH_CHECK(!gamma.defined() || gamma.sym_numel() == C);
+  TORCH_CHECK(!beta.defined() || beta.sym_numel() == C);
   return std::get<0>(
-      at::native_group_norm(X, gamma, beta, N, C, HxW, num_groups, eps));
+      at::native_group_norm_symint(X, gamma, beta, N, C, HxW, num_groups, eps));
 }
 
 DEFINE_DISPATCH(GroupNormKernel);
@@ -224,8 +235,10 @@ std::tuple<at::Tensor, at::Tensor, at::Tensor> math_group_norm(
   } else if (bias.defined()) {
     out = out.add(bias.view(affine_param_shape));
   }
-  at::Tensor mean = std::get<1>(outputs).view({N, group});
-  at::Tensor rstd = std::get<2>(outputs).view({N, group});
+  // convert mean/std to have the same dtype as input.
+  // This follows the same behavior as the CPU and CUDA kernels.
+  at::Tensor mean = std::get<1>(outputs).to(c10::TensorOptions().dtype(input.scalar_type())).view({N, group});
+  at::Tensor rstd = std::get<2>(outputs).to(c10::TensorOptions().dtype(input.scalar_type())).view({N, group});
   return std::make_tuple(out, mean, rstd);
 }
 } // namespace native
diff --git a/aten/src/ATen/native/im2col.h b/aten/src/ATen/native/im2col.h
index c3daed3d4ffc..ecbb7ab0b35d 100644
--- a/aten/src/ATen/native/im2col.h
+++ b/aten/src/ATen/native/im2col.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/Utils.h>
 #include <ATen/Parallel.h>
diff --git a/aten/src/ATen/native/im2col_shape_check.h b/aten/src/ATen/native/im2col_shape_check.h
index 45fc96ea8443..d6c95465da26 100644
--- a/aten/src/ATen/native/im2col_shape_check.h
+++ b/aten/src/ATen/native/im2col_shape_check.h
@@ -1,6 +1,7 @@
 #pragma once
 #include <ATen/core/Tensor.h>
 #include <ATen/TensorUtils.h>
+#include <ATen/div_rtn.h>
 
 namespace at {
 namespace native {
@@ -36,6 +37,13 @@ static inline void col2im_shape_check(
       dilation_height,
       " dilation_width: ",
       dilation_width);
+  TORCH_CHECK(
+      pad_width >= 0 && pad_height >= 0,
+      "padding should be non-negative, but got pad_height: ",
+      pad_height,
+      " pad_width: ",
+      pad_width);
+
 
   int64_t ndim = input.ndimension();
   // allow dim=0 only the batch dimension.
@@ -218,7 +226,7 @@ static inline void im2col_shape_check(
         output_height,
         ", ",
         output_width,
-        "), which is too small (non-positive).");
+        "), but its components must be at least one.");
   }
 }
 
diff --git a/aten/src/ATen/native/layer_norm.cpp b/aten/src/ATen/native/layer_norm.cpp
index e4ea0ac8fe21..8269a4d3af9e 100644
--- a/aten/src/ATen/native/layer_norm.cpp
+++ b/aten/src/ATen/native/layer_norm.cpp
@@ -1,17 +1,26 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/layer_norm.h>
 
-#include <ATen/AccumulateType.h>
-#include <ATen/ATen.h>
-#include <ATen/Config.h>
-#include <ATen/CPUApplyUtils.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Parallel.h>
 #include <c10/util/irange.h>
-#include <torch/library.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_like_native.h>
+#include <ATen/ops/layer_norm_native.h>
+#include <ATen/ops/native_batch_norm.h>
+#include <ATen/ops/native_layer_norm.h>
+#include <ATen/ops/native_layer_norm_backward_native.h>
+#include <ATen/ops/native_layer_norm_native.h>
+#include <ATen/ops/zeros_like_native.h>
+#endif
 
 #include <array>
-#include <functional>
-#include <numeric>
 #include <tuple>
 #include <vector>
 
@@ -37,8 +46,7 @@ void layer_norm_with_mean_rstd_out(
   for (const auto idx : c10::irange(axis)) {
     stat_shape.emplace_back(input_shape[idx]);
   }
-  for (const auto idx : c10::irange(axis, input.dim())) {
-    (void)idx; // Suppress unused variable warning
+  for (const auto idx C10_UNUSED : c10::irange(axis, input.dim())) {
     stat_shape.emplace_back(1);
   }
 
@@ -167,9 +175,9 @@ std::tuple<Tensor, Tensor, Tensor> layer_norm_backward_cpu(
   return std::make_tuple(std::move(dX), std::move(dgamma), std::move(dbeta));
 }
 
-Tensor layer_norm(
+Tensor layer_norm_symint(
     const Tensor& input,
-    IntArrayRef normalized_shape, const c10::optional<Tensor>& weight_opt /* optional */, const c10::optional<Tensor>& bias_opt /* optional */,
+    c10::SymIntArrayRef normalized_shape, const c10::optional<Tensor>& weight_opt /* optional */, const c10::optional<Tensor>& bias_opt /* optional */,
     double eps,
     bool /* cudnn_enable, deprecated */) {
   // See [Note: hacky wrapper removal for optional tensor]
@@ -178,8 +186,7 @@ Tensor layer_norm(
   c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt);
   const Tensor& bias = *bias_maybe_owned;
 
-
-  return std::get<0>(at::native_layer_norm(input, normalized_shape, weight, bias, eps));
+  return std::get<0>(at::native_layer_norm_symint(input, normalized_shape, weight, bias, eps));
 }
 
 DEFINE_DISPATCH(LayerNormKernel);
@@ -216,7 +223,7 @@ std::tuple<Tensor, Tensor, Tensor> math_native_layer_norm(
       at::empty_like(input, c10::TensorOptions().dtype(result_type))
     );
   }
-  at::Tensor input_reshaped = input.view({1, M, -1});
+  at::Tensor input_reshaped = input.reshape({1, M, -1});
   // Unlike Batch Normalization, which applies scalar scale and bias for each
   // entire channel/plane with the affine option, Layer Normalization applies
   // per-element scale and bias. E.g. For input {N, C, H, W}, weight for
@@ -239,8 +246,7 @@ std::tuple<Tensor, Tensor, Tensor> math_native_layer_norm(
   for (const auto idx : c10::irange(axis)) {
     stat_shape.push_back(input_shape[idx]);
   }
-  for (const auto idx : c10::irange(axis, input.dim())) {
-    (void)idx; // Suppress unused variable
+  for (const auto idx C10_UNUSED : c10::irange(axis, input.dim())) {
     stat_shape.push_back(1);
   }
   mean = mean.view(stat_shape);
diff --git a/aten/src/ATen/native/metal/MetalAten.mm b/aten/src/ATen/native/metal/MetalAten.mm
index c1c34217c374..f100f473f055 100644
--- a/aten/src/ATen/native/metal/MetalAten.mm
+++ b/aten/src/ATen/native/metal/MetalAten.mm
@@ -70,12 +70,13 @@
 #pragma mark - ATen Ops
 
 Tensor empty(
-    IntArrayRef size,
+    c10::SymIntArrayRef sym_size,
     optional<ScalarType> dtype,
     optional<Layout> layout,
     optional<Device> device,
     optional<bool> pin_memory,
     c10::optional<MemoryFormat> memory_format) {
+  auto size = c10::asIntArrayRefSlow(sym_size);
   TORCH_CHECK(
       !pin_memory.has_value(),
       "'pin_memory' argument is incompatible with Metal tensor");
diff --git a/aten/src/ATen/native/metal/MetalContext.mm b/aten/src/ATen/native/metal/MetalContext.mm
index 51423f59785a..c9571757f246 100644
--- a/aten/src/ATen/native/metal/MetalContext.mm
+++ b/aten/src/ATen/native/metal/MetalContext.mm
@@ -89,7 +89,7 @@ - (BOOL)available {
                                                             constants {
   TORCH_CHECK(_library, "Failed to load Metal shaders");
   std::string kernelStr = kernel;
-  for (auto i = 0; i < constants.count; ++i) {
+  for (NSUInteger i = 0; i < constants.count; ++i) {
     kernelStr += "_" + std::string([constants[i] stringValue].UTF8String);
   }
   std::lock_guard<std::mutex> g(_pipelineCacheMutex);
@@ -100,7 +100,7 @@ - (BOOL)available {
   MTLFunctionConstantValues* constantValues = [MTLFunctionConstantValues new];
   NSUInteger ushortArgIndex = 0;
   NSUInteger floatArgIndex = 12;
-  for (auto i = 0; i < constants.count; ++i) {
+  for (NSUInteger i = 0; i < constants.count; ++i) {
     NSNumber* constant = constants[i];
     const char* type = constant.objCType;
     if (strcmp(type, @encode(NSUInteger)) == 0 ||
diff --git a/aten/src/ATen/native/metal/MetalConvParams.h b/aten/src/ATen/native/metal/MetalConvParams.h
index f4cc1a2c5fa8..7b0bfc9670a1 100644
--- a/aten/src/ATen/native/metal/MetalConvParams.h
+++ b/aten/src/ATen/native/metal/MetalConvParams.h
@@ -22,7 +22,7 @@ struct Conv2DParams final {
   }
 
   bool isDepthwise() const {
-    // Currently, only channel multipler of 1 is supported
+    // Currently, only channel multiplier of 1 is supported
     // i.e. inputFeatureChannels == outputFeatureChannels
     return G > 1 && IC == 1 && OC == G && OC == C;
   }
diff --git a/aten/src/ATen/native/metal/MetalTensorImpl.h b/aten/src/ATen/native/metal/MetalTensorImpl.h
index 799f7ef3bd11..2fb87b2f4f89 100644
--- a/aten/src/ATen/native/metal/MetalTensorImpl.h
+++ b/aten/src/ATen/native/metal/MetalTensorImpl.h
@@ -31,6 +31,10 @@ struct TORCH_API MetalTensorImpl : public OpaqueTensorImpl<OpaqueHandle> {
     return strides_;
   }
 
+  c10::SymIntArrayRef sym_strides_custom() const override {
+    return c10::fromIntArrayRefKnownNonNegative(strides_);
+  }
+
   bool is_contiguous_custom(c10::MemoryFormat memory_format) const override {
     return true;
   }
diff --git a/aten/src/ATen/native/metal/mpscnn/MPSCNNConvOp.mm b/aten/src/ATen/native/metal/mpscnn/MPSCNNConvOp.mm
index adf9e1b75c2d..bf4136aed5db 100644
--- a/aten/src/ATen/native/metal/mpscnn/MPSCNNConvOp.mm
+++ b/aten/src/ATen/native/metal/mpscnn/MPSCNNConvOp.mm
@@ -75,10 +75,10 @@ + (MPSCNNConvOp*)conv2d:(const Conv2DParams&)params
   using namespace at::native::metal::mpscnn;
   TORCH_CHECK(
       params.DX == params.DY == 1, "Dilated convolution is not supported yet.");
-  const int64_t oC = params.OC;
-  const int64_t iC = params.C;
-  const int64_t kH = params.KH;
-  const int64_t kW = params.KW;
+  const NSUInteger oC = params.OC;
+  const NSUInteger iC = params.C;
+  const NSUInteger kH = params.KH;
+  const NSUInteger kW = params.KW;
   MPSCNNNeuron* neuron = at::native::metal::neuron(t);
   MPSCNNConvolutionDescriptor* desc = nil;
   if (params.isDepthwise()) {
@@ -149,7 +149,7 @@ + (MPSCNNConvOp*)conv2d:(const Conv2DParams&)params
   offset.z = 0;
   [conv setOffset:offset];
 
-  TORCH_CHECK(conv.inputFeatureChannels == params.IC * params.G);
+  TORCH_CHECK(static_cast<int64_t>(conv.inputFeatureChannels) == params.IC * params.G);
   TORCH_CHECK(oC % conv.groups == 0);
   TORCH_CHECK(conv.outputFeatureChannels == oC);
   TORCH_CHECK(conv.kernelWidth == kW);
diff --git a/aten/src/ATen/native/metal/mpscnn/MPSImageWrapper.mm b/aten/src/ATen/native/metal/mpscnn/MPSImageWrapper.mm
index d5a9632d26c9..14c98f99cff0 100644
--- a/aten/src/ATen/native/metal/mpscnn/MPSImageWrapper.mm
+++ b/aten/src/ATen/native/metal/mpscnn/MPSImageWrapper.mm
@@ -23,6 +23,9 @@ + (instancetype)newWithMPSImageWrapper:(MPSImageWrapper*)wrapper {
 
 - (void)dealloc {
   _imageWrapper = nullptr;
+#if !__has_feature(objc_arc)
+  [super dealloc];
+#endif
 }
 
 - (void)beginSynchronization {
diff --git a/aten/src/ATen/native/metal/ops/MetalConcat.mm b/aten/src/ATen/native/metal/ops/MetalConcat.mm
index c43bf055fa2e..8c28568d3101 100644
--- a/aten/src/ATen/native/metal/ops/MetalConcat.mm
+++ b/aten/src/ATen/native/metal/ops/MetalConcat.mm
@@ -16,13 +16,11 @@
 namespace native {
 namespace metal {
 
-Tensor cat_batch(const TensorList tensors, MetalTensorImplStorage& mt) {
-  at::Tensor tensor = tensors[0];
+Tensor cat_batch(const Tensor& tensor, const ITensorListRef& tensors, MetalTensorImplStorage& mt) {
   MetalCommandBuffer* commandBuffer = getCommandBuffer(tensor);
   MPSImage* Y = mt.texture()->image();
   ushort cat_dim4_pointer = 0;
-  for (int i = 0; i < tensors.size(); ++i) {
-    const auto& t = tensors[i];
+  for (const auto& t : tensors) {
     MPSImage* X = imageFromTensor(t);
     MetalCommandBuffer* Xcb = getCommandBuffer(t);
     TORCH_CHECK(
@@ -55,8 +53,7 @@ Tensor cat_batch(const TensorList tensors, MetalTensorImplStorage& mt) {
   return output;
 }
 
-Tensor cat_feature(const TensorList tensors, MetalTensorImplStorage& mt) {
-  at::Tensor tensor = tensors[0];
+Tensor cat_feature(const Tensor& tensor, const ITensorListRef& tensors, MetalTensorImplStorage& mt) {
   MetalCommandBuffer* commandBuffer = getCommandBuffer(tensor);
   MPSImage* Y = mt.texture()->image();
   ushort channel_offset = 0;
@@ -68,9 +65,9 @@ Tensor cat_feature(const TensorList tensors, MetalTensorImplStorage& mt) {
   tt.texture()->allocateTemporaryStorage(temp_size, commandBuffer);
   MPSImage* T = tt.texture()->image();
 
-  for (int i = 0; i < tensors.size(); ++i) {
-    MPSImage* X = imageFromTensor(tensors[i]);
-    MetalCommandBuffer* Xcb = getCommandBuffer(tensors[i]);
+  for (const auto& t : tensors) {
+    MPSImage* X = imageFromTensor(t);
+    MetalCommandBuffer* Xcb = getCommandBuffer(t);
     TORCH_CHECK(
         [commandBuffer isEqual:Xcb],
         @"inputs have different Metal command buffers");
@@ -165,15 +162,15 @@ Tensor cat_feature(const TensorList tensors, MetalTensorImplStorage& mt) {
   return output;
 }
 
-Tensor cat(const TensorList tensors, int64_t dim) {
+Tensor cat(const ITensorListRef& tensors, int64_t dim) {
   TORCH_CHECK(
       dim == 0 || dim == 1,
       "Metal cat is implemented only for batch dimension");
   int64_t cat_dim_size = 0;
-  at::Tensor tensor = tensors[0];
+  TORCH_CHECK(!tensors.empty(), "cat expected a non-empty list of Tensor");
+  at::Tensor tensor = *tensors.begin();
   MetalCommandBuffer* commandBuffer = getCommandBuffer(tensor);
-  for (int i = 0; i < tensors.size(); ++i) {
-    const auto& t = tensors[i];
+  for (const auto& t : tensors) {
     TORCH_CHECK(t.dim() == 4, "Metal cat expects 4 dimensional inputs");
     TORCH_CHECK(t.is_metal(), "Metal cat expects metal tensors");
 
@@ -197,9 +194,9 @@ Tensor cat(const TensorList tensors, int64_t dim) {
   mt.texture()->allocateTemporaryStorage(result_size, commandBuffer);
 
   if (dim == 1) {
-    return cat_feature(tensors, mt);
+    return cat_feature(tensor, tensors, mt);
   }
-  return cat_batch(tensors, mt);
+  return cat_batch(tensor, tensors, mt);
 }
 
 TORCH_LIBRARY_IMPL(aten, Metal, m) {
diff --git a/aten/src/ATen/native/metal/ops/MetalConvolution.mm b/aten/src/ATen/native/metal/ops/MetalConvolution.mm
index 2e1503f67076..46295abefae9 100644
--- a/aten/src/ATen/native/metal/ops/MetalConvolution.mm
+++ b/aten/src/ATen/native/metal/ops/MetalConvolution.mm
@@ -106,7 +106,9 @@ Tensor conv2d_prepack_run(
 } // namespace prepack
 
 TORCH_LIBRARY_IMPL(aten, Metal, m) {
-  m.impl(TORCH_SELECTIVE_NAME("aten::conv2d"), TORCH_FN(conv2d));
+  // NB: this didn't actually do anything; need to generalize this to
+  // work for general convolution and register to aten::convolution
+  // m.impl(TORCH_SELECTIVE_NAME("aten::conv2d"), TORCH_FN(conv2d));
 };
 
 TORCH_LIBRARY_IMPL(metal_prepack, Metal, m) {
diff --git a/aten/src/ATen/native/metal/ops/MetalHardshrink.mm b/aten/src/ATen/native/metal/ops/MetalHardshrink.mm
index 972768070407..4de506cb6526 100644
--- a/aten/src/ATen/native/metal/ops/MetalHardshrink.mm
+++ b/aten/src/ATen/native/metal/ops/MetalHardshrink.mm
@@ -15,6 +15,8 @@
 
 using MetalTensorImpl = at::MetalTensorImpl<MetalTensorImplStorage>;
 
+// NB: this is currently unused, but I've left it because in principle
+// it's useful
 Tensor& hardshrink_(Tensor& input, const at::Scalar& lambda=0.5) {
   float l = lambda.toFloat();
   MPSImage* X = imageFromTensor(input);
@@ -84,7 +86,6 @@ Tensor hardshrink(const at::Tensor& input, const at::Scalar& lambda=0.5) {
 }
 
 TORCH_LIBRARY_IMPL(aten, Metal, m) {
-  m.impl(TORCH_SELECTIVE_NAME("aten::hardshrink_"), TORCH_FN(hardshrink_));
   m.impl(TORCH_SELECTIVE_NAME("aten::hardshrink"), TORCH_FN(hardshrink));
 };
 
diff --git a/aten/src/ATen/native/metal/ops/MetalPadding.mm b/aten/src/ATen/native/metal/ops/MetalPadding.mm
index 4edd4a04bbde..748fa8f4b653 100644
--- a/aten/src/ATen/native/metal/ops/MetalPadding.mm
+++ b/aten/src/ATen/native/metal/ops/MetalPadding.mm
@@ -35,7 +35,7 @@ Tensor reflection_pad2d(const Tensor& input, IntArrayRef padding) {
   }
 
   std::vector<int64_t> output_size(input_dim);
-  for (size_t d = 0; d < input_dim; ++d) {
+  for (int d = 0; d < input_dim; ++d) {
     if (d == input_dim - 1) {
       output_size[d] = input_size[d] + pad_right + pad_left;
     }
diff --git a/aten/src/ATen/native/metal/ops/MetalReshape.mm b/aten/src/ATen/native/metal/ops/MetalReshape.mm
index 37842ee3be59..551e336a9be1 100644
--- a/aten/src/ATen/native/metal/ops/MetalReshape.mm
+++ b/aten/src/ATen/native/metal/ops/MetalReshape.mm
@@ -16,7 +16,8 @@
 namespace metal {
 
 API_AVAILABLE(ios(11.0), macos(10.13))
-Tensor view(const Tensor& input, IntArrayRef size) {
+Tensor view(const Tensor& input, c10::SymIntArrayRef sym_size) {
+  auto size = c10::asIntArrayRefSlow(sym_size);
   TORCH_CHECK(input.is_metal());
   auto inferred_size = at::infer_size(size, input.numel());
   auto stride =
@@ -63,7 +64,7 @@ Tensor view(const Tensor& input, IntArrayRef size) {
 
 Tensor reshape(const Tensor& input, IntArrayRef shape) {
   TORCH_CHECK(input.is_metal());
-  return view(input, shape);
+  return view(input, c10::fromIntArrayRefSlow(shape));
 }
 
 Tensor flatten_using_ints(
diff --git a/aten/src/ATen/native/miopen/BatchNorm_miopen.cpp b/aten/src/ATen/native/miopen/BatchNorm_miopen.cpp
index 28e20e90b299..91f3f01764da 100644
--- a/aten/src/ATen/native/miopen/BatchNorm_miopen.cpp
+++ b/aten/src/ATen/native/miopen/BatchNorm_miopen.cpp
@@ -1,7 +1,16 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/miopen_batch_norm_native.h>
+#include <ATen/ops/miopen_batch_norm_backward_native.h>
+#endif
+
 // TODO: Remove the condition on AT_ROCM_ENABLED entirely,
 // don't build this file as part of CPU build.
 #include <ATen/cuda/CUDAConfig.h>
diff --git a/aten/src/ATen/native/miopen/Conv_miopen.cpp b/aten/src/ATen/native/miopen/Conv_miopen.cpp
index 61eb209d5adc..677a711ce7a6 100644
--- a/aten/src/ATen/native/miopen/Conv_miopen.cpp
+++ b/aten/src/ATen/native/miopen/Conv_miopen.cpp
@@ -1,8 +1,25 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/native/ConvUtils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/miopen_convolution_add_relu_native.h>
+#include <ATen/ops/miopen_convolution_native.h>
+#include <ATen/ops/miopen_convolution_relu_native.h>
+#include <ATen/ops/miopen_convolution_transpose_native.h>
+#include <ATen/ops/miopen_depthwise_convolution_native.h>
+#include <ATen/ops/squeeze.h>
+#include <ATen/ops/sum.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 // TODO: Remove the condition on AT_ROCM_ENABLED entirely,
 // don't build this file as part of CPU build.
 #include <ATen/cuda/CUDAConfig.h>
@@ -102,6 +119,20 @@ std::tuple<at::Tensor,at::Tensor,at::Tensor> miopen_depthwise_convolution_backwa
   AT_ERROR("miopen_depthwise_convolution_backward: ATen not compiled with MIOpen support");
 }
 
+
+at::Tensor miopen_convolution_add_relu(
+    const at::Tensor& input, const at::Tensor& weight, const at::Tensor& z,
+    const c10::optional<Scalar>& alpha, const c10::optional<Tensor>& bias, IntArrayRef stride,
+    IntArrayRef padding, IntArrayRef dilation, int64_t groups) {
+  AT_ERROR("miopen_convolution_add_relu: ATen not compiled with MIOpen support");
+}
+
+at::Tensor miopen_convolution_relu(
+    const at::Tensor& input, const at::Tensor& weight, const c10::optional<Tensor>& bias,
+    IntArrayRef stride, IntArrayRef padding, IntArrayRef dilation, int64_t groups) {
+  AT_ERROR("miopen_convolution_relu: ATen not compiled with MIOpen support");
+}
+
 }}
 
 #else  // AT_ROCM_ENABLED
@@ -1449,6 +1480,219 @@ Tensor miopen_convolution_transpose(
   return output_t;
 }
 
+// MIOpen fused convolution bias activation forward
+void raw_miopen_convolution_relu_out(
+    const Tensor& output,
+    const Tensor& input,
+    const Tensor& weight,
+    const Tensor& bias,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    int64_t groups,
+    bool benchmark,
+    bool deterministic) {
+
+  auto dataType = getMiopenDataType(input);
+  miopenConvolutionMode_t c_mode = miopenConvolution;
+
+  ConvolutionArgs args{ input, output, weight };
+  args.handle = getMiopenHandle();
+  setConvolutionParams(&args.params, args.handle, input, weight, padding, stride, dilation, groups, deterministic);
+  args.idesc.set(input);
+  args.wdesc.set(weight, input.suggest_memory_format(), 0);
+  args.odesc.set(output);
+  args.cdesc.set(dataType, c_mode, input.dim() - 2, args.params.padding, args.params.stride, args.params.dilation, args.params.groups);
+
+  TensorDescriptor bdesc;
+  bdesc.set(bias.expand({1, bias.size(0)}), output.dim());
+
+  // Create the fusion plan
+  miopenFusionPlanDescriptor_t fusePlanDesc;
+  miopenFusionOpDescriptor_t convoOp;
+  miopenFusionOpDescriptor_t biasOp;
+  miopenFusionOpDescriptor_t activOp;
+  MIOPEN_CHECK(miopenCreateFusionPlan(&fusePlanDesc, miopenVerticalFusion, args.idesc.desc()));
+  MIOPEN_CHECK(miopenCreateOpConvForward(fusePlanDesc, &convoOp, args.cdesc.desc(), args.wdesc.desc()));
+  MIOPEN_CHECK(miopenCreateOpBiasForward(fusePlanDesc, &biasOp, bdesc.desc()));
+  MIOPEN_CHECK(miopenCreateOpActivationForward(fusePlanDesc, &activOp, miopenActivationRELU));
+
+  // compile fusion plan
+  MIOPEN_CHECK(miopenCompileFusionPlan(args.handle, fusePlanDesc));
+
+  // Set the Args
+  float alpha = static_cast<float>(1);
+  float beta = static_cast<float>(0);
+  float activ_alpha = static_cast<float>(0);
+  float activ_beta = static_cast<float>(0);
+  float activ_gamma = static_cast<float>(0);
+  miopenOperatorArgs_t fusionArgs;
+  MIOPEN_CHECK(miopenCreateOperatorArgs(&fusionArgs));
+  MIOPEN_CHECK(miopenSetOpArgsConvForward(fusionArgs, convoOp, &alpha, &beta, weight.data_ptr()));
+  MIOPEN_CHECK(miopenSetOpArgsBiasForward(fusionArgs, biasOp, &alpha, &beta, bias.data_ptr()));
+  MIOPEN_CHECK(miopenSetOpArgsActivForward(fusionArgs, activOp, &alpha, &beta, activ_alpha, activ_beta, activ_gamma));
+
+  miopenExecuteFusionPlan(args.handle, fusePlanDesc, args.idesc.desc(), input.data_ptr(), args.odesc.desc(), output.data_ptr(), fusionArgs);
+
+  // Cleanup
+  miopenDestroyFusionPlan(fusePlanDesc);
+}
+
+static at::Tensor self_or_new_memory_format(at::Tensor& self, at::MemoryFormat memory_format) {
+  if (self.is_contiguous(memory_format)) {
+    return self;
+  }
+  return at::empty_like(self, self.options(), memory_format);
+}
+
+Tensor miopen_convolution_add_relu(
+    const Tensor& input,
+    const Tensor& weight,
+    const Tensor& z,
+    const c10::optional<Scalar>& alpha,
+    const c10::optional<Tensor>& bias,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    int64_t groups) {
+
+  // MIOpen does not support fusion of add, the alpha2 * z step of the below cuDNN function:
+  // y = act ( alpha1 * conv(x) + alpha2 * z + bias )
+
+  auto memory_format = input.suggest_memory_format();
+
+  auto& ctx = at::globalContext();
+  bool benchmark = ctx.benchmarkCuDNN();
+
+  TensorArg input_arg  { input,  "input",  1 },
+            weight_arg { weight, "weight", 2 };
+  auto output = miopen_convolution_forward(
+      "miopen_convolution_add_relu",
+      input_arg,
+      weight_arg,
+      padding,
+      stride,
+      dilation,
+      groups,
+      benchmark,
+      false // deterministic
+  );
+
+  auto contig_output = self_or_new_memory_format(output, memory_format);
+
+  if (!output.is_same(contig_output)) {
+    contig_output.copy_(output);
+  }
+
+  auto _alpha = alpha.has_value() ? alpha.value().to<float>() : 1.0;
+  auto _bias = bias.has_value()
+          ? bias.value()
+          : at::zeros(
+                {contig_output.size(1)},
+                optTypeMetaToScalarType(contig_output.options().dtype_opt()),
+                contig_output.options().layout_opt(),
+                contig_output.options().device_opt(),
+                contig_output.options().pinned_memory_opt());
+
+  at::Tensor alpha_mul_z_add_bias = at::native::reshape_bias(input.dim(), _bias).add(z, _alpha);
+  contig_output.add_(alpha_mul_z_add_bias);
+  contig_output.relu_();
+
+  return contig_output;
+}
+
+Tensor miopen_convolution_relu(
+    const Tensor& input,
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    IntArrayRef stride,
+    IntArrayRef padding,
+    IntArrayRef dilation,
+    int64_t groups) {
+
+  auto memory_format = input.suggest_memory_format();
+
+  auto& ctx = at::globalContext();
+  bool benchmark = ctx.benchmarkCuDNN();
+
+  // MIOpen currently only supports MemoryFormat::Contiguous and fp32 and 2d
+  if (input.suggest_memory_format() == at::MemoryFormat::Contiguous
+          && input.scalar_type() == at::kFloat
+          && input.ndimension() == 4) {
+
+    // FuseFrozenConvAddRelu performs some tensor shape checking
+    Tensor output_t = at::detail::empty_cuda(
+        conv_output_size(
+            input.sizes(), weight.sizes(), padding, stride, dilation),
+        input.options().memory_format(input.suggest_memory_format()));
+    if (output_t.numel() == 0) {
+      return output_t;
+    }
+
+    auto _bias = bias.has_value()
+            ? bias.value()
+            : at::zeros(
+                  {output_t.size(1)},
+                  optTypeMetaToScalarType(output_t.options().dtype_opt()),
+                  output_t.options().layout_opt(),
+                  output_t.options().device_opt(),
+                  output_t.options().pinned_memory_opt());
+
+    raw_miopen_convolution_relu_out(
+        output_t,
+        input,
+        weight,
+        _bias,
+        stride,
+        padding,
+        dilation,
+        groups,
+        benchmark, // benchmark
+        false // deterministic
+    );
+
+    return output_t;
+  }
+  else {
+    // fallback
+
+    TensorArg input_arg  { input,  "input",  1 },
+              weight_arg { weight, "weight", 2 };
+    auto output = miopen_convolution_forward(
+        "miopen_convolution_relu",
+        input_arg,
+        weight_arg,
+        padding,
+        stride,
+        dilation,
+        groups,
+        benchmark,
+        false // deterministic
+    );
+
+    auto contig_output = self_or_new_memory_format(output, memory_format);
+
+    if (!output.is_same(contig_output)) {
+      contig_output.copy_(output);
+    }
+
+    auto _bias = bias.has_value()
+            ? bias.value()
+            : at::zeros(
+                  {contig_output.size(1)},
+                  optTypeMetaToScalarType(contig_output.options().dtype_opt()),
+                  contig_output.options().layout_opt(),
+                  contig_output.options().device_opt(),
+                  contig_output.options().pinned_memory_opt());
+
+    at::Tensor reshaped_bias = at::native::reshape_bias(input.dim(), _bias);
+    contig_output.add_(reshaped_bias);
+    contig_output.relu_();
+
+    return contig_output;
+  }
+}
+
 REGISTER_CUDA_DISPATCH(miopen_convolution_backward_stub, &miopen_convolution_backward);
 REGISTER_CUDA_DISPATCH(miopen_convolution_transpose_backward_stub, &miopen_convolution_transpose_backward);
 REGISTER_CUDA_DISPATCH(miopen_depthwise_convolution_backward_stub, &miopen_depthwise_convolution_backward);
diff --git a/aten/src/ATen/native/miopen/RNN_miopen.cpp b/aten/src/ATen/native/miopen/RNN_miopen.cpp
index b5a63dd803d1..b61794487a41 100644
--- a/aten/src/ATen/native/miopen/RNN_miopen.cpp
+++ b/aten/src/ATen/native/miopen/RNN_miopen.cpp
@@ -1,15 +1,28 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/RNN.h>
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/InitialTensorOptions.h>
 #include <ATen/MatrixRef.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/TensorUtils.h>
 
 #include <ATen/cuda/CUDAConfig.h>
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cat.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/miopen_rnn.h>
+#include <ATen/ops/miopen_rnn_native.h>
+#include <ATen/ops/miopen_rnn_backward_native.h>
+#include <ATen/ops/zeros.h>
+#include <ATen/ops/zeros_like.h>
+#endif
+
 #if !AT_ROCM_ENABLED()
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/mkl/LinearAlgebra.cpp b/aten/src/ATen/native/mkl/LinearAlgebra.cpp
index 2790f1e8b3f2..a47afe97648f 100644
--- a/aten/src/ATen/native/mkl/LinearAlgebra.cpp
+++ b/aten/src/ATen/native/mkl/LinearAlgebra.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_NO_OPERATORS
 #include <ATen/native/mkl/LinearAlgebra.h>
 #include <ATen/Config.h>
 
diff --git a/aten/src/ATen/native/mkl/LinearAlgebra.h b/aten/src/ATen/native/mkl/LinearAlgebra.h
index a536c193524e..a3bbd8285320 100644
--- a/aten/src/ATen/native/mkl/LinearAlgebra.h
+++ b/aten/src/ATen/native/mkl/LinearAlgebra.h
@@ -1,5 +1,6 @@
-#include <ATen/native/LinearAlgebraUtils.h>
+#pragma once
 #include <ATen/native/TransposeType.h>
+#include <c10/util/complex.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/mkl/SparseBlasImpl.cpp b/aten/src/ATen/native/mkl/SparseBlasImpl.cpp
index 3e1f7a5771a1..a2ed1af23795 100644
--- a/aten/src/ATen/native/mkl/SparseBlasImpl.cpp
+++ b/aten/src/ATen/native/mkl/SparseBlasImpl.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/Dispatch.h>
 #include <ATen/SparseCsrTensorImpl.h>
 #include <ATen/Tensor.h>
@@ -351,30 +352,132 @@ void addmm_out_sparse_csr(
     const Scalar& beta,
     const Scalar& alpha,
     const Tensor& result) {
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(mat1.dim() == 2 && mat2.dim() == 2 && result.dim() == 2);
-  if ((mat1.layout() == kSparseCsr || mat1.layout() == kSparseBsr) &&
-      mat2.layout() == kStrided && result.layout() == kStrided) {
-    return addmm_dense_result(mat1, mat2, beta, alpha, result);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
+      mat1.dim() == 2 && mat2.dim() == 2 && result.dim() == 2);
+  TORCH_INTERNAL_ASSERT(
+      !((mat1.layout() == kStrided) && (mat2.layout() == kStrided) &&
+        (result.layout() == kStrided)),
+      "Expected at least one sparse input");
+
+  // Layout checks are nested mat1, mat2, result
+  // Conditions are ordered strided, csr, csc, bsr, bsc.
+  // Valid combinations terminate in a return
+  // Invalid combinations are omitted and will fall though to the TORCH check
+  // generating an informative error message
+  if (mat1.layout() == kStrided) {
+    if (mat2.layout() == kSparseCsr) {
+      if (result.layout() == kStrided) {
+        // TODO: Add native CSC support via cuSPARSE if supported.
+        return addmm_dense_result(
+            mat2.transpose(0, 1).to_sparse_csr(),
+            mat1.transpose(0, 1),
+            beta,
+            alpha,
+            result.transpose(0, 1));
+      }
+    }
+    if (mat2.layout() == kSparseCsc) {
+      if (result.layout() == kStrided) {
+        return addmm_dense_result(
+            mat2.transpose(-2, -1),
+            mat1.transpose(-2, -1),
+            beta,
+            alpha,
+            result.transpose(-2, -1));
+      }
+    }
+    if (mat2.layout() == kSparseBsc) {
+      if (result.layout() == kStrided) {
+        return addmm_dense_result(
+            mat2.transpose(-2, -1),
+            mat1.transpose(-2, -1),
+            beta,
+            alpha,
+            result.transpose(-2, -1));
+      }
+    }
   }
-  if (mat1.layout() == kStrided && mat2.is_sparse_csr() &&
-      result.layout() == kStrided) {
-    // TODO: Use MKL's transposition flags instead of this costly conversion to
-    // CSR
-    return addmm_dense_result(
-        mat2.transpose(0, 1).to_sparse_csr(),
-        mat1.transpose(0, 1),
-        beta,
-        alpha,
-        result.transpose(0, 1));
+  if (mat1.layout() == kSparseCsr) {
+    if (mat2.layout() == kStrided) {
+      if (result.layout() == kStrided) {
+        return addmm_dense_result(mat1, mat2, beta, alpha, result);
+      }
+    }
+    if (mat2.layout() == kSparseCsr) {
+      if (result.layout() == kStrided) {
+        return addmm_sparse_input_dense_result(mat1, mat2, beta, alpha, result);
+      }
+      if (result.layout() == kSparseCsr) {
+        return addmm_sparse_result(mat1, mat2, beta, alpha, result);
+      }
+    }
+    if (mat2.layout() == kSparseCsc) {
+      if (result.layout() == kStrided) {
+        // TODO: CSR @ CSC kernel would be very fast due to format alignment
+        return addmm_sparse_input_dense_result(
+            mat1, mat2.to_sparse_csr(), beta, alpha, result);
+      }
+      if (result.layout() == kSparseCsr) {
+        // TODO: CSR @ CSC kernel would be very fast due to format alignment
+        return addmm_sparse_result(
+            mat1, mat2.to_sparse_csr(), beta, alpha, result);
+      }
+    }
   }
-  if (mat1.is_sparse_csr() && mat2.is_sparse_csr() && result.layout() == kStrided) {
-    return addmm_sparse_input_dense_result(mat1, mat2, beta, alpha, result);
+  if (mat1.layout() == kSparseCsc) {
+    if (mat2.layout() == kStrided) {
+      if (result.layout() == kStrided) {
+        // TODO: avoid csc->csr conversion with native csc support
+        return addmm_dense_result(
+            mat1.to_sparse_csr(), mat2, beta, alpha, result);
+      }
+    }
+    if (mat2.layout() == kSparseCsr) {
+      if (result.layout() == kSparseCsr) {
+        // TODO: avoid csc->csr conversion with native csc support
+        return addmm_sparse_result(
+            mat1.to_sparse_csr(), mat2, beta, alpha, result);
+      }
+    }
+    if (mat2.layout() == kSparseCsc) {
+      if (result.layout() == kStrided) {
+        return addmm_sparse_input_dense_result(
+            mat2.transpose(-2, -1),
+            mat1.transpose(-2, -1),
+            beta,
+            alpha,
+            result.transpose(-2, -1));
+      }
+      if (result.layout() == kSparseCsr) {
+        // TODO avoid csc->csr
+        return addmm_sparse_result(
+            mat1.to_sparse_csr(), mat2.to_sparse_csr(), beta, alpha, result);
+      }
+      if (result.layout() == kSparseCsc) {
+        return addmm_sparse_result(
+            mat2.transpose(-2, -1),
+            mat1.transpose(-2, -1),
+            beta,
+            alpha,
+            result.transpose(-2, -1));
+      }
+    }
   }
-  if (mat1.is_sparse_csr() && mat2.is_sparse_csr() && result.is_sparse_csr()) {
-    return addmm_sparse_result(mat1, mat2, beta, alpha, result);
+  if (mat1.layout() == kSparseBsr) {
+    if (mat2.layout() == kStrided) {
+      if (result.layout() == kStrided) {
+        return addmm_dense_result(mat1, mat2, beta, alpha, result);
+      }
+    }
   }
-  TORCH_CHECK(false, "addmm: computation on CPU is not implemented for ",
-              result.layout(), " + ", mat1.layout(), " @ ", mat2.layout());
+  TORCH_CHECK(
+      false,
+      "addmm: computation on CPU is not implemented for ",
+      result.layout(),
+      " + ",
+      mat1.layout(),
+      " @ ",
+      mat2.layout());
 }
 
 /*
diff --git a/aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.cpp b/aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.cpp
index bf84d583dbde..8081de65facf 100644
--- a/aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.cpp
+++ b/aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/mkl/SparseCsrLinearAlgebra.h>
 
 // Don't compile with MKL for MSVC/macos since linking the sparse MKL routines
diff --git a/aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.h b/aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.h
index 74f3c62215fd..480282e3b3ed 100644
--- a/aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.h
+++ b/aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.h
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#pragma once
+#include <ATen/core/Tensor.h>
 #include <ATen/SparseCsrTensorUtils.h>
 
 namespace at {
diff --git a/aten/src/ATen/native/mkl/SpectralOps.cpp b/aten/src/ATen/native/mkl/SpectralOps.cpp
index 470c3a48e5e0..cb00ce99d82e 100644
--- a/aten/src/ATen/native/mkl/SpectralOps.cpp
+++ b/aten/src/ATen/native/mkl/SpectralOps.cpp
@@ -1,13 +1,25 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
+#include <ATen/Dispatch.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/SpectralOpsUtils.h>
-#include <ATen/NativeFunctions.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_fft_c2c_native.h>
+#include <ATen/ops/_fft_c2r_native.h>
+#include <ATen/ops/_fft_r2c_native.h>
+#include <ATen/ops/empty.h>
+#endif
+
 #if AT_MKL_ENABLED() || AT_POCKETFFT_ENABLED()
 #include <ATen/Parallel.h>
+#include <ATen/TensorIterator.h>
 
 namespace at { namespace native {
 // In real-to-complex transform, MKL FFT only fills half of the values due to
@@ -313,16 +325,9 @@ Tensor _fft_c2c_mkl(const Tensor& self, IntArrayRef dim, int64_t normalization,
 }}
 
 #elif AT_MKL_ENABLED()
-#include <ATen/ATen.h>
-#include <ATen/Config.h>
 #include <ATen/Dispatch.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/Utils.h>
-
-#include <ATen/native/TensorIterator.h>
 
 #include <algorithm>
-#include <vector>
 #include <numeric>
 #include <cmath>
 
diff --git a/aten/src/ATen/native/mkldnn/BinaryOps.cpp b/aten/src/ATen/native/mkldnn/BinaryOps.cpp
index b842c425a919..3b68c60a9d68 100644
--- a/aten/src/ATen/native/mkldnn/BinaryOps.cpp
+++ b/aten/src/ATen/native/mkldnn/BinaryOps.cpp
@@ -1,7 +1,15 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/ExpandUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/add_native.h>
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/mul_native.h>
+#endif
 
 #if !AT_MKLDNN_ENABLED()
 
diff --git a/aten/src/ATen/native/mkldnn/Conv.cpp b/aten/src/ATen/native/mkldnn/Conv.cpp
index 0096a1cda674..3d8188c003e1 100644
--- a/aten/src/ATen/native/mkldnn/Conv.cpp
+++ b/aten/src/ATen/native/mkldnn/Conv.cpp
@@ -1,7 +1,20 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/Config.h>
+#include <torch/library.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/ConvUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
-#include <ATen/Config.h>
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_add_relu_native.h>
+#include <ATen/ops/_to_dense_native.h>
+#include <ATen/ops/convolution.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/mkldnn_convolution_native.h>
+#endif
 
 #if !AT_MKLDNN_ENABLED()
 
@@ -39,7 +52,6 @@ REGISTER_NO_CPU_DISPATCH(mkldnn_convolution_backward_stub);
 
 #include <ATen/native/mkldnn/MKLDNNCommon.h>
 #include <ATen/native/mkldnn/Utils.h>
-#include <ATen/native/ConvUtils.h>
 
 namespace at { namespace native {
 
@@ -155,41 +167,34 @@ static void check_shape_forward(const Tensor& input,
 //  but weight/bias and grad_weight/grad_bias are always CPU tensor.
 //
 
-Tensor mkldnn_convolution(
-    const Tensor& input,
-    const Tensor& weight,
-    const c10::optional<Tensor>& bias_opt,
-    IntArrayRef padding,
+static inline at::MemoryFormat mkldnn_convolution_memory_format(int64_t dims, bool is_channels_last) {
+   auto memory_format =  at::MemoryFormat::Contiguous;
+   if (is_channels_last) {
+      memory_format = dims == 4 ? at::MemoryFormat::ChannelsLast : at::MemoryFormat::ChannelsLast3d;
+   }
+   return memory_format;
+}
+
+void _mkldnn_convolution_out (
+    const Tensor& input_t,
+    const Tensor& weight_t,
+    const Tensor& bias,
+    std::vector<int64_t>& output_sizes,
+    ideep::tensor& y,
     IntArrayRef stride,
     IntArrayRef dilation,
-    int64_t groups) {
-  // See [Note: hacky wrapper removal for optional tensor]
-  c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt);
-  const Tensor& bias = *bias_maybe_owned;
-
-  if (input.scalar_type() == ScalarType::BFloat16) {
-    TORCH_CHECK(mkldnn_bf16_device_check(),
-        "mkldnn_convolution: bf16 path needs the cpu support avx512bw, avx512vl and avx512dq");
-  }
-
-  check_shape_forward(input, weight, bias, padding, stride, dilation, groups);
-
-  bool is_channels_last = input.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
-
-  auto output_sizes = conv_output_size(input.sizes(), weight.sizes(), padding, stride, dilation);
-  auto output = at::empty({0}, input.options());
-
+    IntArrayRef padding,
+    int64_t groups,
+    bool is_channels_last,
+    const ideep::attr_t& op_attr) {
+  auto memory_format = mkldnn_convolution_memory_format(input_t.ndimension(), is_channels_last);
+  auto input = input_t.is_mkldnn() ? input_t : input_t.contiguous(memory_format);
+  auto weight = weight_t.is_mkldnn() ? weight_t : weight_t.contiguous(memory_format);
   const ideep::tensor x = itensor_from_tensor(input);
   const ideep::tensor w = itensor_from_tensor(weight);
-
-  ideep::tensor y;
-  if (is_channels_last) {
-    output.resize_(output_sizes, input.suggest_memory_format());
-    y = itensor_from_tensor(output);
-  }
   if (bias.defined()) {
     const ideep::tensor b = itensor_from_tensor(bias);
-    ideep::convolution_forward::compute(
+    ideep::convolution_forward::compute_v3(
         x,
         w,
         b,
@@ -199,9 +204,11 @@ Tensor mkldnn_convolution(
         {dilation.begin(), dilation.end()},
         {padding.begin(), padding.end()},
         {padding.begin(), padding.end()},
-        groups);
+        groups,
+        is_channels_last,
+        op_attr);
   } else {
-    ideep::convolution_forward::compute(
+    ideep::convolution_forward::compute_v3(
         x,
         w,
         {output_sizes.cbegin(), output_sizes.cend()},
@@ -210,24 +217,392 @@ Tensor mkldnn_convolution(
         {dilation.begin(), dilation.end()},
         {padding.begin(), padding.end()},
         {padding.begin(), padding.end()},
-        groups);
+        groups,
+        is_channels_last,
+        op_attr);
+  }
+}
+
+Tensor _mkldnn_convolution(
+    const Tensor& input_t,
+    const Tensor& weight_t,
+    const c10::optional<Tensor>& bias_opt,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups,
+    bool use_channels_last,
+    c10::string_view attr = "none",
+    torch::List<c10::optional<at::Scalar>> scalars =
+        torch::List<c10::optional<at::Scalar>>(),
+    c10::optional<c10::string_view> algorithm = c10::nullopt) {
+  ideep::attr_t op_attr = ideep::attr_t();
+  if (attr != "none") {
+    auto it = fusion_unary_attr_map().find(attr);
+    TORCH_CHECK(
+        it != fusion_unary_attr_map().end(), "Fusion behavior undefined.");
+    op_attr = it->second(scalars, algorithm);
+  }
+  // See [Note: hacky wrapper removal for optional tensor]
+  c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+
+  if (input_t.scalar_type() == ScalarType::BFloat16) {
+    TORCH_CHECK(mkldnn_bf16_device_check(),
+        "mkldnn_convolution: bf16 path needs the cpu support avx512bw, avx512vl and avx512dq");
   }
 
-  if (input.is_mkldnn()) {
-    return MKLDNNTensor(y, input.options());
-  } else if (!is_channels_last) {
-    return mkldnn_to_dense(MKLDNNTensor(y, input.options()));
+  check_shape_forward(input_t, weight_t, bias, padding, stride, dilation, groups);
+
+  auto memory_format =
+      mkldnn_convolution_memory_format(input_t.ndimension(), use_channels_last);
+
+  auto output_sizes = conv_output_size(input_t.sizes(), weight_t.sizes(), padding, stride, dilation);
+  auto output = at::empty({0}, input_t.options());
+  ideep::tensor y;
+  if (use_channels_last) {
+    output.resize_(output_sizes, memory_format);
+    y = itensor_from_tensor(output);
+  }
+  _mkldnn_convolution_out(
+      input_t,
+      weight_t,
+      bias,
+      output_sizes,
+      y,
+      stride,
+      dilation,
+      padding,
+      groups,
+      use_channels_last,
+      op_attr);
+
+  if (input_t.is_mkldnn()) {
+    return MKLDNNTensor(y, input_t.options());
+  } else if (!use_channels_last) {
+    return mkldnn_to_dense(MKLDNNTensor(y, input_t.options()));
   } else {
     TORCH_INTERNAL_ASSERT(y.get_desc().is_nhwc());
     return output;
   }
 }
 
+Tensor mkldnn_convolution(
+    const Tensor& input_t,
+    const Tensor& weight_t,
+    const c10::optional<Tensor>& bias_opt,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups) {
+  bool use_channels_last = mkldnn_conv_use_channels_last(input_t, weight_t);
+  return _mkldnn_convolution(
+      input_t,
+      weight_t,
+      bias_opt,
+      padding,
+      stride,
+      dilation,
+      groups,
+      use_channels_last);
+}
+
+Tensor mkldnn_convolution_pointwise(
+    const Tensor& input_t,
+    const Tensor& weight_t,
+    const c10::optional<Tensor>& bias_opt,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups,
+    c10::string_view attr,
+    torch::List<c10::optional<at::Scalar>> scalars,
+    c10::optional<c10::string_view> algorithm) {
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  bool use_channels_last =
+      weight_t.is_mkldnn() || mkldnn_conv_use_channels_last(input_t, weight_t);
+  return _mkldnn_convolution(
+      input_t,
+      weight_t,
+      bias_opt,
+      padding,
+      stride,
+      dilation,
+      groups,
+      use_channels_last,
+      attr,
+      scalars,
+      algorithm);
+}
+
+// Fuse convolution+binary_op+unary_op for good performance, which doing such
+// operation: output=unary_op(binary_op(conv(input_t, ...), other_t, alpha)).
+// The binary_attr means which binary_op is, it can be "add", or
+// other binary operation. the unary_attr means which unary_op is,
+// it can be "relu" or other unary operation, if it is none, meaning that
+// there doesn't have a unary post op. unary_scalars and unary_algorithm
+// are the parameters of the unary op, such as "hardtanh" has scalar parameters,
+// "gelu" has algorithm parameters.
+Tensor mkldnn_convolution_pointwise_binary(
+    const Tensor& input_t,
+    const Tensor& other_t,
+    const Tensor& weight_t,
+    const c10::optional<Tensor>& bias_opt,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups,
+    c10::string_view binary_attr,
+    c10::optional<at::Scalar> alpha,
+    c10::optional<c10::string_view> unary_attr,
+    torch::List<c10::optional<at::Scalar>> unary_scalars,
+    c10::optional<c10::string_view> unary_algorithm) {
+  TORCH_CHECK(
+      input_t.ndimension() == 4 || input_t.ndimension() == 5,
+      "mkldnn_convolution_pointwise_binary: currently only support 2d and 3d")
+  TORCH_CHECK(
+      !alpha.has_value() || alpha.value().to<float>() == 1.0,
+      "mkldnn_convolution_pointwise_binary: the alpha value should be none or 1.0");
+
+  c10::MaybeOwned<Tensor> bias_maybe_owned =
+      at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+
+  // Make sure inputs have same type(device, layout, dtype), device is cpu and
+  // dtype is float or bfloat16.
+  check_mkldnn_binary_fusion_inputs(input_t, other_t, weight_t, bias);
+
+  check_shape_forward(
+      input_t, weight_t, bias, padding, stride, dilation, groups);
+
+  auto output_sizes = conv_output_size(
+      input_t.sizes(), weight_t.sizes(), padding, stride, dilation);
+  // TODO: support broadcast binary fusion.
+  TORCH_CHECK(
+      output_sizes == other_t.sizes(),
+      "Binary Fusion's inputs should have same shape");
+  // Only calling fusion path for channels_last path.
+  // TODO: OneDNN doesn't optimize well for groups > 1 case, it will be enabled
+  // at next OneDNN release.
+  bool use_channels_last =
+      weight_t.is_mkldnn() || mkldnn_conv_use_channels_last(input_t, weight_t);
+  bool can_be_fused = groups == 1 && use_channels_last;
+
+  c10::string_view unary_attr_value = "none";
+  ideep::algorithm unary_alg;
+  if (unary_attr.has_value()) {
+    auto it_unary = fusion_unary_alg_map().find(unary_attr.value());
+    // Now, we only support conv+binary+relu.
+    TORCH_CHECK(
+        it_unary != fusion_unary_alg_map().end(),
+        "Unary Fusion behavior undefined.");
+    unary_attr_value = unary_attr.value();
+    unary_alg = it_unary->second;
+  }
+  auto it_binary = fusion_binary_alg_map().find(binary_attr);
+  TORCH_CHECK(
+      it_binary != fusion_binary_alg_map().end(),
+      "Binary Fusion behavior undefined.");
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  if (can_be_fused) {
+    auto memory_format =
+        mkldnn_convolution_memory_format(input_t.ndimension(), true);
+    auto input = input_t.contiguous(memory_format);
+    auto weight =
+        weight_t.is_mkldnn() ? weight_t : weight_t.contiguous(memory_format);
+    auto other = other_t.contiguous(memory_format);
+    auto output = at::empty_like(other);
+    const ideep::tensor x = itensor_from_tensor(input);
+    const ideep::tensor w = itensor_from_tensor(weight);
+    const ideep::tensor z = itensor_from_tensor(other);
+    ideep::tensor y = itensor_from_tensor(output);
+    auto output_size = other.sizes().vec();
+    ideep::tag format_tag = ideep::tag::nhwc;
+    if (input_t.ndimension() == 5) {
+      format_tag = ideep::tag::ndhwc;
+    }
+    auto other_desc = ideep::tensor::desc(
+        output_size, get_mkldnn_dtype(weight.scalar_type()), format_tag);
+
+    ideep::attr_t op_attr;
+    ideep::post_ops po;
+    po.append_binary(it_binary->second, other_desc);
+    if (unary_attr_value != "none") {
+      po.append_eltwise(1.0, unary_alg, 0.f, 0.f);
+    }
+    op_attr.set_post_ops(po);
+
+    if (bias.defined()) {
+      const ideep::tensor b = itensor_from_tensor(bias);
+      ideep::convolution_forward::compute_binary(
+          x,
+          z,
+          w,
+          b,
+          output_size,
+          y,
+          {stride.begin(), stride.end()},
+          {dilation.begin(), dilation.end()},
+          {padding.begin(), padding.end()},
+          {padding.begin(), padding.end()},
+          groups,
+          /* is_channels_last */ true,
+          op_attr);
+    } else {
+      ideep::convolution_forward::compute_binary(
+          x,
+          z,
+          w,
+          output_size,
+          y,
+          {stride.begin(), stride.end()},
+          {dilation.begin(), dilation.end()},
+          {padding.begin(), padding.end()},
+          {padding.begin(), padding.end()},
+          groups,
+          /* is_channels_last */ true,
+          op_attr);
+    }
+    return output;
+  } else {
+    // Fallback case, if inputs are not channels last or have different dtype,
+    // OneDNN fusion may have performance regression.
+    Tensor output;
+    if (weight_t.is_mkldnn()) {
+      output = _mkldnn_convolution(
+          input_t, weight_t, bias, padding, stride, dilation, groups, true);
+    } else {
+      output = at::convolution(
+          input_t, weight_t, bias, stride, padding, dilation, false, 0, groups);
+    }
+    if (binary_attr == "add" && unary_attr_value != "none") {
+      output = at::native::add_relu_(output, other_t);
+      return output;
+    }
+    if (binary_attr == "add") {
+      output.add_(other_t);
+    } else if (binary_attr == "sub") {
+      output.sub_(other_t);
+    } else if (binary_attr == "mul") {
+      output.mul_(other_t);
+    } else {
+      output.div_(other_t);
+    }
+    if (unary_attr_value != "none") {
+      output.relu_();
+    }
+    return output;
+  }
+}
+
+// Fuse convolution+binary_op+unary_op for good performance, which doing
+// such operation: other_t=unary_op(binary_op(conv(input_t, ...), other_t,
+// alpha)). The binary_attr means which binary_op is, it can be "add", or other
+// binary operation. the unary_attr means which unary_op is, it can be "relu" or
+// other unary operation, if it is none, meaning that there doesn't have a unary
+// post op. unary_scalars and unary_algorithm are the parameters of the unary
+// op, such as "hardtanh" has scalar parameters "gelu" has algorithm parameters.
+
+Tensor& mkldnn_convolution_pointwise_binary_(
+    const Tensor& input_t,
+    Tensor& other_t,
+    const Tensor& weight_t,
+    const c10::optional<Tensor>& bias_opt,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups,
+    c10::string_view binary_attr,
+    c10::optional<at::Scalar> alpha,
+    c10::optional<c10::string_view> unary_attr,
+    torch::List<c10::optional<at::Scalar>> unary_scalars,
+    c10::optional<c10::string_view> unary_algorithm) {
+  // other_t += convolution(...), other_t = unary(other_t)
+  TORCH_CHECK(
+      input_t.ndimension() == 4 || input_t.ndimension() == 5,
+      "mkldnn_convolution_add_: currently only support 2d and 3d")
+  TORCH_CHECK(
+      binary_attr == "add",
+      "mkldnn_convolution_pointwise_binary_: only support binary op fusion")
+  TORCH_CHECK(
+      !alpha.has_value() || alpha.value().to<float>() == 1.0,
+      "mkldnn_convolution_pointwise_binary: the alpha value for the binary op should be none(meaning 1.0) or 1.0");
+  TORCH_CHECK(
+      !unary_attr.has_value() || unary_attr.value() == "relu",
+      "mkldnn_convolution_pointwise_binary: only support none or relu unary op fusion after binary op");
+
+  c10::MaybeOwned<Tensor> bias_maybe_owned =
+      at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+
+  // Make sure inputs have same type(device, layout, dtype), device is cpu and
+  // dtype is float or bfloat16.
+  check_mkldnn_binary_fusion_inputs(input_t, other_t, weight_t, bias);
+
+  check_shape_forward(
+      input_t, weight_t, bias, padding, stride, dilation, groups);
+
+  auto output_sizes = conv_output_size(
+      input_t.sizes(), weight_t.sizes(), padding, stride, dilation);
+  TORCH_CHECK(
+      output_sizes == other_t.sizes(),
+      "Add Fusion's inputs should have same shape");
+  // Only calling fusion path for channels_last path and the output is contiguous tensor(channels_last).
+  bool can_be_fused = (weight_t.is_mkldnn() ||
+                       mkldnn_conv_use_channels_last(input_t, weight_t)) &&
+      (other_t.is_contiguous(at::MemoryFormat::ChannelsLast) ||
+       other_t.is_contiguous(at::MemoryFormat::ChannelsLast3d));
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  if (can_be_fused) {
+    ideep::tensor y = itensor_from_tensor(other_t);
+    ideep::attr_t op_attr;
+    if (unary_attr.has_value()) {
+      op_attr = ideep::attr_t::residual();
+    } else {
+      op_attr = ideep::attr_t::fuse_sum();
+    }
+    _mkldnn_convolution_out(
+        input_t,
+        weight_t,
+        bias,
+        output_sizes,
+        y,
+        stride,
+        dilation,
+        padding,
+        groups,
+        true,
+        op_attr);
+  } else {
+    // Fallback case, if inputs are not channels last or have different dtype,
+    // OneDNN fusion may have performance regression.
+    Tensor output;
+    if (weight_t.is_mkldnn()) {
+      output = _mkldnn_convolution(
+          input_t, weight_t, bias, padding, stride, dilation, groups, true);
+    } else {
+      output = at::convolution(
+          input_t, weight_t, bias, stride, padding, dilation, false, 0, groups);
+    }
+    if (unary_attr.has_value()) {
+      other_t = at::native::add_relu_(other_t, output);
+    } else {
+      other_t.add_(output);
+    }
+  }
+  return other_t;
+}
+
 Tensor mkldnn_convolution_backward_input(
-    IntArrayRef input_size, const Tensor& grad_output, const Tensor& weight,
-    IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, bool bias_defined)
-{
-  bool is_channels_last = grad_output.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+    IntArrayRef input_size,
+    const Tensor& grad_output,
+    const Tensor& weight,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups,
+    bool bias_defined,
+    bool is_channels_last) {
   auto grad_input = at::empty({0}, grad_output.options());
 
   auto grad_y = itensor_from_tensor(grad_output);
@@ -235,10 +610,11 @@ Tensor mkldnn_convolution_backward_input(
 
   ideep::tensor grad_x;
   if (is_channels_last) {
-    grad_input.resize_(input_size, grad_output.suggest_memory_format());
+    auto memory_format = mkldnn_convolution_memory_format(grad_output.ndimension(), is_channels_last);
+    grad_input.resize_(input_size, memory_format);
     grad_x = itensor_from_tensor(grad_input);
   }
-  ideep::convolution_backward_data::compute(
+  ideep::convolution_backward_data::compute_v2(
       grad_y,
       w,
       input_size.vec(),
@@ -247,7 +623,8 @@ Tensor mkldnn_convolution_backward_input(
       dilation.vec(),
       padding.vec(),
       padding.vec(),
-      groups);
+      groups,
+      is_channels_last);
 
   if (grad_output.is_mkldnn()) {
     return MKLDNNTensor(grad_x, grad_output.options());
@@ -260,17 +637,21 @@ Tensor mkldnn_convolution_backward_input(
 }
 
 std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
-    IntArrayRef weight_size, const Tensor& grad_output, const Tensor& input,
-    IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, bool bias_defined)
-{
-  bool is_channels_last = grad_output.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
-
+    IntArrayRef weight_size,
+    const Tensor& grad_output,
+    const Tensor& input,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    int64_t groups,
+    bool bias_defined,
+    bool is_channels_last) {
   const ideep::tensor grad_y = itensor_from_tensor(grad_output);
   const ideep::tensor x = itensor_from_tensor(input);
 
   ideep::tensor grad_w, grad_b;
   if (bias_defined) {
-    ideep::convolution_backward_weights::compute(
+    ideep::convolution_backward_weights::compute_v2(
         x,
         grad_y,
         weight_size.vec(),
@@ -280,9 +661,10 @@ std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
         dilation.vec(),
         padding.vec(),
         padding.vec(),
-        groups);
+        groups,
+        is_channels_last);
   } else {
-    ideep::convolution_backward_weights::compute(
+    ideep::convolution_backward_weights::compute_v2(
         x,
         grad_y,
         weight_size.vec(),
@@ -291,7 +673,8 @@ std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
         dilation.vec(),
         padding.vec(),
         padding.vec(),
-        groups);
+        groups,
+        is_channels_last);
   }
 
   if (!is_channels_last) {
@@ -306,20 +689,23 @@ std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
 }
 
 std::tuple<Tensor, Tensor, Tensor> mkldnn_convolution_backward(
-    const Tensor& input, const Tensor& grad_output_t, const Tensor& weight,
+    const Tensor& input_t, const Tensor& grad_output_t, const Tensor& weight_t,
     IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, std::array<bool,3> output_mask)
 {
-  auto memory_format = input.suggest_memory_format();
+  bool is_channels_last = mkldnn_conv_use_channels_last(input_t, weight_t);
+  auto memory_format = mkldnn_convolution_memory_format(input_t.ndimension(), is_channels_last);
   Tensor grad_output = grad_output_t.is_mkldnn() ? grad_output_t : grad_output_t.contiguous(memory_format);
 
+  Tensor input = input_t.is_mkldnn() ? input_t : input_t.contiguous(memory_format);
+  Tensor weight = weight_t.is_mkldnn() ? weight_t : weight_t.contiguous(memory_format);
   Tensor grad_input, grad_weight, grad_bias;
   if (output_mask[0]) {
     grad_input = mkldnn_convolution_backward_input(
-      input.sizes(), grad_output, weight, padding, stride, dilation, groups, output_mask[2]);
+      input.sizes(), grad_output, weight, padding, stride, dilation, groups, output_mask[2], is_channels_last);
   }
   if (output_mask[1] || output_mask[2]) {
     std::tie(grad_weight, grad_bias) = mkldnn_convolution_backward_weights(
-      weight.sizes(), grad_output, input, padding, stride, dilation, groups, output_mask[2]);
+      weight.sizes(), grad_output, input, padding, stride, dilation, groups, output_mask[2], is_channels_last);
   }
 
   return std::make_tuple(grad_input, grad_weight, grad_bias);
@@ -327,6 +713,29 @@ std::tuple<Tensor, Tensor, Tensor> mkldnn_convolution_backward(
 
 REGISTER_ALL_CPU_DISPATCH(mkldnn_convolution_backward_stub, &mkldnn_convolution_backward);
 
+TORCH_LIBRARY_IMPL(mkldnn, CPU, m) {
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn::_convolution_pointwise"),
+      TORCH_FN(mkldnn_convolution_pointwise));
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn::_convolution_pointwise.binary"),
+      TORCH_FN(mkldnn_convolution_pointwise_binary));
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn::_convolution_pointwise_.binary"),
+      TORCH_FN(mkldnn_convolution_pointwise_binary_));
+}
+
+TORCH_LIBRARY_IMPL(mkldnn, MkldnnCPU, m) {
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn::_convolution_pointwise"),
+      TORCH_FN(mkldnn_convolution_pointwise));
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn::_convolution_pointwise.binary"),
+      TORCH_FN(mkldnn_convolution_pointwise_binary));
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn::_convolution_pointwise_.binary"),
+      TORCH_FN(mkldnn_convolution_pointwise_binary_));
+}
 }}  // namespace at::native
 
 #endif
diff --git a/aten/src/ATen/native/mkldnn/Copy.cpp b/aten/src/ATen/native/mkldnn/Copy.cpp
index eb45bad99264..088353f06b45 100644
--- a/aten/src/ATen/native/mkldnn/Copy.cpp
+++ b/aten/src/ATen/native/mkldnn/Copy.cpp
@@ -1,6 +1,12 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/copy_native.h>
+#endif
 
 #if !AT_MKLDNN_ENABLED()
 
diff --git a/aten/src/ATen/native/mkldnn/Gelu.cpp b/aten/src/ATen/native/mkldnn/Gelu.cpp
index 1d2a67251513..71ab0b92f545 100644
--- a/aten/src/ATen/native/mkldnn/Gelu.cpp
+++ b/aten/src/ATen/native/mkldnn/Gelu.cpp
@@ -1,8 +1,15 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/native/Activation.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/gelu_native.h>
+#include <ATen/ops/gelu_backward_native.h>
+#endif
+
 #if !AT_MKLDNN_ENABLED()
 
 namespace at { namespace native {
diff --git a/aten/src/ATen/native/mkldnn/IDeepRegistration.cpp b/aten/src/ATen/native/mkldnn/IDeepRegistration.cpp
index b99527480bf2..97f8f8951959 100644
--- a/aten/src/ATen/native/mkldnn/IDeepRegistration.cpp
+++ b/aten/src/ATen/native/mkldnn/IDeepRegistration.cpp
@@ -1,5 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/Config.h>
+#include <c10/core/Allocator.h>
 
 #if AT_MKLDNN_ENABLED()
 
diff --git a/aten/src/ATen/native/mkldnn/Linear.cpp b/aten/src/ATen/native/mkldnn/Linear.cpp
index 0138190de78a..894e54eefb1c 100644
--- a/aten/src/ATen/native/mkldnn/Linear.cpp
+++ b/aten/src/ATen/native/mkldnn/Linear.cpp
@@ -1,6 +1,23 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
+#include <ATen/core/Tensor.h>
+#include <torch/library.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_to_dense_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/linear.h>
+#include <ATen/ops/mkldnn_linear_backward_input.h>
+#include <ATen/ops/mkldnn_linear_backward_input_native.h>
+#include <ATen/ops/mkldnn_linear_backward_native.h>
+#include <ATen/ops/mkldnn_linear_backward_weights.h>
+#include <ATen/ops/mkldnn_linear_backward_weights_native.h>
+#include <ATen/ops/mkldnn_linear_native.h>
+#endif
 
 #if !AT_MKLDNN_ENABLED()
 
@@ -160,7 +177,260 @@ std::tuple<Tensor, Tensor, Tensor> mkldnn_linear_backward(
   return std::tuple<Tensor, Tensor, Tensor>{grad_input, grad_weight, grad_bias};
 }
 
+Tensor mkldnn_linear_pointwise(
+    const Tensor& input_t,
+    const Tensor& weight_t,
+    const c10::optional<Tensor>& bias_opt,
+    c10::string_view attr,
+    torch::List<c10::optional<at::Scalar>> scalars,
+    c10::optional<c10::string_view> algorithm) {
+  auto input = input_t.contiguous();
+  auto input_size = input.sizes();
+
+  const int64_t dim = input.dim();
+  auto input_reshaped =
+      dim == 2 ? input : input.reshape({-1, input.size(input.dim() - 1)});
+
+  std::vector<int64_t> output_size(input_size.begin(), input_size.end() - 1);
+  output_size.push_back(weight_t.size(0));
+  auto output = at::empty(output_size, input.options());
+
+  if (dim != 2) {
+    std::vector<int64_t> output_size_reshaped = {input_reshaped.size(0),
+                                                 weight_t.size(0)};
+    output = output.reshape(output_size_reshaped);
+  }
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  ideep::tensor mkldnn_output = itensor_from_tensor(output);
+
+  c10::MaybeOwned<Tensor> bias_maybe_owned =
+      at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+
+  const ideep::tensor mkldnn_input = itensor_view_from_dense(input_reshaped);
+
+  c10::optional<ideep::tensor> mkldnn_bias{c10::nullopt};
+  if (bias.defined()) {
+    mkldnn_bias = itensor_from_tensor(bias);
+  }
+  const ideep::tensor w = itensor_from_tensor(weight_t);
+
+  auto it = fusion_unary_attr_map().find(attr);
+  TORCH_CHECK(
+      it != fusion_unary_attr_map().end(), "Fusion behavior undefined.");
+  ideep::attr_t op_attr = it->second(scalars, algorithm);
+
+  if (mkldnn_bias.has_value()) {
+    ideep::inner_product_forward::compute(
+        mkldnn_input,
+        w,
+        mkldnn_bias.value(),
+        mkldnn_output,
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::scale_t(),
+        op_attr);
+  } else {
+    ideep::inner_product_forward::compute(
+        mkldnn_input,
+        w,
+        mkldnn_output,
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::scale_t(),
+        op_attr);
+  }
+
+  if (dim != 2) {
+    output = output.reshape(output_size);
+  }
+
+  return output;
+}
+
+Tensor mkldnn_linear_pointwise_binary(
+    const Tensor& input_t,
+    const Tensor& other_t,
+    const Tensor& weight_t,
+    const c10::optional<Tensor>& bias_opt,
+    c10::string_view attr) {
+  c10::MaybeOwned<Tensor> bias_maybe_owned =
+      at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+  // Make sure inputs have same type(device, layout, dtype), device is cpu and
+  // dtype is float or bfloat16.
+  check_mkldnn_binary_fusion_inputs(input_t, other_t, weight_t, bias);
+
+  auto input = input_t.contiguous();
+
+  auto it_binary = fusion_binary_alg_map().find(attr);
+  TORCH_CHECK(
+      it_binary != fusion_binary_alg_map().end(), "Fusion behavior undefined.");
+
+  auto input_size = input.sizes();
+
+  const int64_t dim = input.dim();
+  auto input_reshaped =
+      dim == 2 ? input : input.reshape({-1, input.size(input.dim() - 1)});
+
+  std::vector<int64_t> output_size(input_size.begin(), input_size.end() - 1);
+  output_size.push_back(weight_t.size(0));
+  auto output = at::empty(output_size, input.options());
+  auto other_reshaped = other_t.contiguous();
+
+  if (dim != 2) {
+    std::vector<int64_t> output_size_reshaped = {
+        input_reshaped.size(0), weight_t.size(0)};
+    output = output.reshape(output_size_reshaped);
+    other_reshaped = other_reshaped.reshape(output_size_reshaped);
+  }
+
+  TORCH_CHECK(
+      output.sizes() == other_reshaped.sizes(),
+      "linear_binary_run expects the size of output and other tensor to be the same");
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  ideep::tensor mkldnn_output = itensor_from_tensor(output);
+  const ideep::tensor mkldnn_other = itensor_from_tensor(other_reshaped);
+  const ideep::tensor mkldnn_input = itensor_view_from_dense(input_reshaped);
+
+  c10::optional<ideep::tensor> mkldnn_bias{c10::nullopt};
+  if (bias.defined()) {
+    mkldnn_bias = itensor_from_tensor(bias);
+  }
+  const ideep::tensor w = itensor_from_tensor(weight_t);
+
+  auto other_desc = mkldnn_other.get_desc();
+  auto op_attr = ideep::attr_t::fuse_binary(it_binary->second, other_desc);
+
+  if (mkldnn_bias.has_value()) {
+    ideep::inner_product_forward::compute_binary(
+        mkldnn_input,
+        mkldnn_other,
+        w,
+        mkldnn_bias.value(),
+        mkldnn_output,
+        op_attr);
+  } else {
+    ideep::inner_product_forward::compute_binary(
+        mkldnn_input, mkldnn_other, w, mkldnn_output, op_attr);
+  }
+
+  if (dim != 2) {
+    output = output.reshape(output_size);
+  }
+
+  return output;
+}
+
+TORCH_LIBRARY_IMPL(mkldnn, CPU, m) {
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn::_linear_pointwise"),
+      TORCH_FN(mkldnn_linear_pointwise));
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkldnn::_linear_pointwise.binary"),
+      TORCH_FN(mkldnn_linear_pointwise_binary));
+}
+
 } // namespace native
 } // namespace at
 
 #endif // AT_MKLDNN_ENABLED
+
+#if AT_MKL_ENABLED() && AT_MKLDNN_ENABLED()
+#include <mkl.h>
+
+namespace at {
+namespace native {
+
+Tensor mkl_linear(
+    const Tensor& self,
+    const Tensor& mkl_weight_t,
+    const Tensor& origin_weight_t,
+    const c10::optional<Tensor>& bias_opt,
+    const int64_t prepack_batch_size) {
+  c10::MaybeOwned<Tensor> bias_maybe_owned =
+      at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+  TORCH_CHECK(
+      self.options().type_equal(origin_weight_t.options()),
+      "Input type (",
+      self.toString(),
+      ") and weight type (",
+      origin_weight_t.toString(),
+      ") should be the same");
+  TORCH_CHECK(
+      !bias.defined() || (self.options().type_equal(bias.options())),
+      "Input type (",
+      self.toString(),
+      ") and bias type (",
+      bias.toString(),
+      ") should be the same");
+  TORCH_CHECK(
+      mkl_weight_t.scalar_type() == origin_weight_t.scalar_type() &&
+          origin_weight_t.scalar_type() == kFloat,
+      "mkl_linear: weight dtype should be float");
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  auto input_size = self.sizes();
+  std::vector<int64_t> output_size(input_size.begin(), input_size.end() - 1);
+  output_size.push_back(origin_weight_t.size(0));
+  auto output = at::empty(output_size, self.options());
+  int64_t M = self.numel() / self.size(self.dim() - 1);
+  if (M == prepack_batch_size && mkl_weight_t.is_mkldnn()) {
+    auto self_ = self.is_contiguous() ? self : self.contiguous();
+    auto K = origin_weight_t.size(1);
+    auto N = origin_weight_t.size(0);
+    const ideep::tensor& w = itensor_from_mkldnn(mkl_weight_t);
+    auto in_ptr = self_.data_ptr<float>();
+    auto weight_ptr = (float*)(w.get_data_handle());
+    auto out_ptr = output.data_ptr<float>();
+    if (bias.defined()) {
+      auto bias_ = bias.is_contiguous() ? bias : bias.contiguous();
+      auto bias_ptr = bias_.data_ptr<float>();
+#ifdef _OPENMP
+#if (_OPENMP >= 201307)
+#pragma omp parallel for simd schedule( \
+    static) if (omp_get_max_threads() > 1 && !omp_in_parallel())
+#else
+#pragma omp parallel for schedule( \
+    static) if (omp_get_max_threads() > 1 && !omp_in_parallel())
+#endif
+#endif
+      for (int64_t i = 0; i < M; ++i) {
+        memcpy(out_ptr + i * N, bias_ptr, sizeof(float) * N);
+      }
+    }
+    cblas_sgemm_compute(
+        CblasRowMajor,
+        CblasNoTrans,
+        CblasPacked,
+        M,
+        N,
+        K,
+        in_ptr,
+        K,
+        weight_ptr,
+        K,
+        bias.defined() ? 1.f : 0.f,
+        out_ptr,
+        N);
+  } else {
+    output = at::linear_out(output, self, origin_weight_t, bias_opt);
+  }
+  return output;
+}
+
+TORCH_LIBRARY_IMPL(mkl, CPU, m) {
+  m.impl(TORCH_SELECTIVE_NAME("mkl::_mkl_linear"), TORCH_FN(mkl_linear));
+}
+
+TORCH_LIBRARY_IMPL(mkl, MkldnnCPU, m) {
+  m.impl(TORCH_SELECTIVE_NAME("mkl::_mkl_linear"), TORCH_FN(mkl_linear));
+}
+
+} // namespace native
+} // namespace at
+
+#endif // AT_MKL_ENABLED && AT_MKLDNN_ENABLED
diff --git a/aten/src/ATen/native/mkldnn/MKLDNNCommon.h b/aten/src/ATen/native/mkldnn/MKLDNNCommon.h
index a86d1c4b722c..783195361570 100644
--- a/aten/src/ATen/native/mkldnn/MKLDNNCommon.h
+++ b/aten/src/ATen/native/mkldnn/MKLDNNCommon.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 
 #if AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp b/aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp
index fbfb329a5e93..d643fae22ca2 100644
--- a/aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp
+++ b/aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp
@@ -1,9 +1,23 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/Config.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/mkldnn/MKLDNNCommon.h>
 #include <ATen/native/mkldnn/Utils.h>
 #include <ATen/native/utils/ParamUtils.h>
+#include <torch/library.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_to_dense_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/mkldnn_reorder_conv2d_weight_native.h>
+#include <ATen/ops/mkldnn_reorder_conv3d_weight_native.h>
+#include <ATen/ops/to_mkldnn_native.h>
+#endif
+
 
 namespace at { namespace native {
 
@@ -75,7 +89,8 @@ Tensor mkldnn_reorder_conv2d_weight(
     IntArrayRef padding,
     IntArrayRef stride,
     IntArrayRef dilation,
-    int64_t groups) {
+    int64_t groups,
+    c10::OptionalArrayRef<int64_t> input_size) {
   if (self.scalar_type() == ScalarType::BFloat16) {
     TORCH_CHECK(mkldnn_bf16_device_check(),
         "mkldnn_reorder_conv2d_weight: bf16 path needs the cpu support avx512bw, avx512vl and avx512dq");
@@ -93,16 +108,28 @@ Tensor mkldnn_reorder_conv2d_weight(
     w.reshape({wdims[0] * wdims[1], wdims[2], wdims[3], wdims[4]});
   }
 
-  auto desc =
-      ideep::convolution_forward::expected_weights_desc(
-          w.get_dims(),
-          w.get_data_type(),
-          {stride.begin(), stride.end()},
-          {padding.begin(), padding.end()},
-          {padding.begin(), padding.end()},
-          {dilation.begin(), dilation.end()},
-          groups,
-          ideep::algorithm::convolution_direct);
+  ideep::dims src_dims = ideep::dims();
+  bool is_channels_last = false;
+  if (input_size.has_value()) {
+    src_dims = input_size.value().vec();
+    // if has input size, we always use channels last.
+    is_channels_last = true;
+  }
+
+  auto desc = ideep::convolution_forward::expected_weights_desc(
+      w.get_dims(),
+      w.get_data_type(),
+      {stride.begin(), stride.end()},
+      {padding.begin(), padding.end()},
+      {padding.begin(), padding.end()},
+      {dilation.begin(), dilation.end()},
+      groups,
+      ideep::algorithm::convolution_direct,
+      ideep::prop_kind::forward,
+      w.get_data_type(),
+      src_dims,
+      ideep::attr_t(),
+      is_channels_last);
   ideep::tensor result;
   result.init(desc);
   result.feed_from(w);
@@ -156,7 +183,8 @@ Tensor mkldnn_reorder_conv2d_weight(
     IntArrayRef padding,
     IntArrayRef stride,
     IntArrayRef dilation,
-    int64_t groups) {
+    int64_t groups,
+    c10::OptionalArrayRef<int64_t> input_size) {
   TORCH_CHECK(false, "mkldnn_reorder_conv2d_weight: MKL-DNN build is disabled");
 }
 
@@ -171,4 +199,48 @@ Tensor mkldnn_reorder_conv3d_weight(
 
 #endif // AT_MKLDNN_ENABLED()
 
+#if AT_MKL_ENABLED() && AT_MKLDNN_ENABLED()
+#include <mkl.h>
+
+Tensor mkl_reorder_linear_weight(
+    const Tensor& weight,
+    const int64_t batch_size) {
+  TORCH_CHECK(
+      weight.scalar_type() == ScalarType::Float,
+      "reorder_linear_weight: weight's dtype should be float");
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  auto M = batch_size;
+  auto N = weight.size(0);
+  auto K = weight.size(1);
+  int64_t pack_size =
+      (int64_t)(cblas_sgemm_pack_get_size(CblasBMatrix, M, N, K) / sizeof(float) + 1);
+  auto packed_weight = empty_mkldnn(
+      {pack_size, 1},
+      weight.scalar_type(),
+      weight.options().layout_opt(),
+      weight.options().device_opt(),
+      weight.options().pinned_memory_opt());
+  ideep::tensor& mkl_weight = itensor_from_mkldnn(packed_weight);
+  ideep::tensor& orig_w = itensor_from_mkldnn(weight);
+  cblas_sgemm_pack(
+      CblasRowMajor,
+      CblasBMatrix,
+      CblasTrans,
+      M,
+      N,
+      K,
+      1.0f,
+      (float*)(orig_w.get_data_handle()),
+      K,
+      (float*)(mkl_weight.get_data_handle()));
+  return packed_weight;
+}
+
+TORCH_LIBRARY_IMPL(mkl, MkldnnCPU, m) {
+  m.impl(
+      TORCH_SELECTIVE_NAME("mkl::_mkl_reorder_linear_weight"),
+      TORCH_FN(mkl_reorder_linear_weight));
+}
+
+#endif // AT_MKL_ENABLED && AT_MKLDNN_ENABLED
 }}
diff --git a/aten/src/ATen/native/mkldnn/Matmul.cpp b/aten/src/ATen/native/mkldnn/Matmul.cpp
index 9b07dbfcee5f..383d29659230 100644
--- a/aten/src/ATen/native/mkldnn/Matmul.cpp
+++ b/aten/src/ATen/native/mkldnn/Matmul.cpp
@@ -1,7 +1,9 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/Context.h>
 #include <ATen/native/mkldnn/Matmul.h>
+
 #if !AT_MKLDNN_ENABLED()
 
 namespace at {
@@ -127,11 +129,24 @@ void mkldnn_matmul(
               (mat1.dim() == 2 && mat2.dim() == 1) || // aten::mv
               (mat1.dim() == 1 && mat2.dim() == 1),  // aten::dot
               "mkldnn_matmul:  unsupported dims for mat and mat2");
-  TORCH_CHECK(mat1.scalar_type() == at::kBFloat16 &&
-   mat2.scalar_type() == at::kBFloat16 &&
-   result.scalar_type() == at::kBFloat16, "mkldnn_matmul:  only enabled for bf16 path");
+
   TORCH_CHECK(mkldnn_bf16_device_check(),
-    "mkldnn_matmul: mkldnn_matmul bf16 path needs the cpu support avx512bw, avx512vl and avx512dq");
+    "mkldnn_matmul: mkldnn_matmul bf16 path needs the cpu support avx512bw, avx512vl and avx512dq, or AWS Graviton3");
+
+#if defined(__aarch64__)
+  if (mkldnn_bf16_device_check_arm()) {
+     //onednn fastmath mode can leverage bf16 HW even for the fp32 input, e.g. Arm Neoverse V1
+     //so, don't restrict the mkldnn_matmul only for bf16 inputs, allow it for float as well
+     TORCH_CHECK((mat1.scalar_type() == mat2.scalar_type()) && (mat1.scalar_type() == result.scalar_type()) &&
+                 ((mat1.scalar_type() == at::kFloat) || (mat1.scalar_type() == at::kBFloat16)),
+                 "mkldnn_matmul:  only enabled for fp32 and bf16 path");
+  } else
+#endif
+  {
+     TORCH_CHECK(mat1.scalar_type() == at::kBFloat16 &&
+                 mat2.scalar_type() == at::kBFloat16 &&
+                 result.scalar_type() == at::kBFloat16, "mkldnn_matmul:  only enabled for bf16 path");
+  }
 
   auto mat1_unsqueezed = mat1.dim() == 1 ? mat1.unsqueeze(0) : mat1;
   auto mat2_unsqueezed = mat2.dim() == 1 ? mat2.unsqueeze(1) : mat2;
@@ -209,14 +224,29 @@ bool use_mkldnn_bf16_matmul(
     const Tensor& mat1,
     const Tensor& mat2,
     const Tensor& result) {
-  return (
-      use_mkldnn_bf16_matmul() &&
-      mat1.scalar_type() == kBFloat16 &&
-      mat2.scalar_type() == kBFloat16 &&
-      (!result.defined() || result.scalar_type() == kBFloat16) &&
-      mat1.numel() != 0 &&
-      mat2.numel() != 0 &&
-      checksize(mat1, mat2));
+#if defined(__aarch64__)
+  if (mkldnn_bf16_device_check_arm()) {
+     //onednn fastmath mode can leverage bf16 HW even for the fp32 input, e.g. Arm Neoverse V1
+     //so, don't restrict the mkldnn_matmul only for bf16 inputs, allow it for float as well
+     return (
+        use_mkldnn_bf16_matmul() &&
+        (mat1.scalar_type() == mat2.scalar_type()) && (!result.defined() || (mat1.scalar_type() == result.scalar_type())) &&
+        ((mat1.scalar_type() == kFloat) || (mat1.scalar_type() == kBFloat16)) &&
+        mat1.numel() != 0 &&
+        mat2.numel() != 0 &&
+        checksize(mat1, mat2));
+  } else
+#endif
+  {
+     return (
+        use_mkldnn_bf16_matmul() &&
+        mat1.scalar_type() == kBFloat16 &&
+        mat2.scalar_type() == kBFloat16 &&
+        (!result.defined() || result.scalar_type() == kBFloat16) &&
+        mat1.numel() != 0 &&
+        mat2.numel() != 0 &&
+        checksize(mat1, mat2));
+  }
 }
 
 } // namespace native
diff --git a/aten/src/ATen/native/mkldnn/Matmul.h b/aten/src/ATen/native/mkldnn/Matmul.h
index 63426714933b..999cae99a7e0 100644
--- a/aten/src/ATen/native/mkldnn/Matmul.h
+++ b/aten/src/ATen/native/mkldnn/Matmul.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 #include <ATen/native/LinearAlgebraUtils.h>  // For TransposeType
 
diff --git a/aten/src/ATen/native/mkldnn/MkldnnTensorMath.cpp b/aten/src/ATen/native/mkldnn/MkldnnTensorMath.cpp
index 71d23a34425e..c12db6d6b7e9 100644
--- a/aten/src/ATen/native/mkldnn/MkldnnTensorMath.cpp
+++ b/aten/src/ATen/native/mkldnn/MkldnnTensorMath.cpp
@@ -1,10 +1,16 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/Parallel.h>
 #include <ATen/cpu/vec/functional.h>
 #include <ATen/cpu/vec/vec.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/zero_native.h>
+#endif
+
 #if !AT_MKLDNN_ENABLED()
 
 namespace at {
diff --git a/aten/src/ATen/native/mkldnn/Normalization.cpp b/aten/src/ATen/native/mkldnn/Normalization.cpp
index 83750ae51263..d0171865fac6 100644
--- a/aten/src/ATen/native/mkldnn/Normalization.cpp
+++ b/aten/src/ATen/native/mkldnn/Normalization.cpp
@@ -1,8 +1,17 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
-#include <ATen/NativeFunctions.h>
 #include <tuple>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_to_dense_native.h>
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/native_batch_norm_backward_native.h>
+#include <ATen/ops/native_batch_norm_native.h>
+#endif
+
 #if !AT_MKLDNN_ENABLED()
 
 namespace at {
@@ -32,6 +41,23 @@ std::tuple<Tensor, Tensor, Tensor> mkldnn_layer_norm_last_index_weight_bias_f32(
   TORCH_CHECK(false, "mkldnn_layer_norm_last_index_weight_bias_f32: ATen not compiled with MKLDNN support");
 }
 
+std::tuple<Tensor, Tensor, Tensor> _mkldnn_batch_norm_legit(
+    const Tensor& input, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, Tensor& running_mean, Tensor& running_var,
+    bool train,
+    double momentum,
+    double eps) {
+  TORCH_CHECK(false, "_mkldnn_batch_norm_legit: ATen not compiled with MKLDNN support");
+}
+
+
+std::tuple<Tensor, Tensor, Tensor> _mkldnn_batch_norm_legit_no_stats(
+    const Tensor& input, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt,
+    bool train,
+    double momentum,
+    double eps) {
+  TORCH_CHECK(false, "_mkldnn_batch_norm_legit_no_stats: ATen not compiled with MKLDNN support");
+}
+
 } // namespace native
 } // namespace at
 
@@ -164,6 +190,25 @@ std::tuple<Tensor, Tensor, Tensor> mkldnn_batch_norm(
   }
 }
 
+
+std::tuple<Tensor, Tensor, Tensor> _mkldnn_batch_norm_legit(
+    const Tensor& input, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt, Tensor& running_mean, Tensor& running_var,
+    bool train,
+    double momentum,
+    double eps) {
+  return mkldnn_batch_norm(input, weight_opt, bias_opt, running_mean, running_var, train, momentum, eps);
+}
+
+
+std::tuple<Tensor, Tensor, Tensor> _mkldnn_batch_norm_legit_no_stats(
+    const Tensor& input, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& bias_opt,
+    bool train,
+    double momentum,
+    double eps) {
+  return mkldnn_batch_norm(input, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, eps);
+}
+
+
 std::tuple<Tensor, Tensor, Tensor> mkldnn_batch_norm_backward(const Tensor& grad_output,
     const Tensor& input, const c10::optional<Tensor>& weight_opt, const c10::optional<Tensor>& running_mean_opt, const c10::optional<Tensor>& running_var_opt, const c10::optional<Tensor>& save_mean_opt, const c10::optional<Tensor>& save_invstd_opt,
     bool train,
diff --git a/aten/src/ATen/native/mkldnn/Pooling.cpp b/aten/src/ATen/native/mkldnn/Pooling.cpp
index 80cfa2efcc10..30ff49f49dd3 100644
--- a/aten/src/ATen/native/mkldnn/Pooling.cpp
+++ b/aten/src/ATen/native/mkldnn/Pooling.cpp
@@ -1,12 +1,28 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/core/grad_mode.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/utils/ParamUtils.h>
 #include <c10/util/irange.h>
 #include <tuple>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/adaptive_avg_pool2d_native.h>
+#include <ATen/ops/avg_pool2d_backward_native.h>
+#include <ATen/ops/avg_pool2d_native.h>
+#include <ATen/ops/avg_pool3d_backward_native.h>
+#include <ATen/ops/avg_pool3d_native.h>
+#include <ATen/ops/mkldnn_adaptive_avg_pool2d_backward_native.h>
+#include <ATen/ops/mkldnn_adaptive_avg_pool2d_native.h>
+#include <ATen/ops/mkldnn_max_pool2d_backward_native.h>
+#include <ATen/ops/mkldnn_max_pool2d_native.h>
+#include <ATen/ops/mkldnn_max_pool3d_backward_native.h>
+#include <ATen/ops/mkldnn_max_pool3d_native.h>
+#endif
+
 
 #if !AT_MKLDNN_ENABLED()
 
@@ -502,7 +518,7 @@ Tensor mkldnn_adaptive_avg_pool2d(
       /*padding*/ {0, 0},
       /*dilation*/ {1, 1},
       /*ceil_mode*/ false,
-      /*algo*/ ideep::algorithm::pooling_avg);
+      /*algo*/ ideep::algorithm::pooling_avg_exclude_padding);
 }
 
 Tensor& mkldnn_adaptive_avg_pool2d_out_stub(const Tensor& input,
diff --git a/aten/src/ATen/native/mkldnn/Prelu.cpp b/aten/src/ATen/native/mkldnn/Prelu.cpp
index acc78211d83c..dc7d239da7b6 100644
--- a/aten/src/ATen/native/mkldnn/Prelu.cpp
+++ b/aten/src/ATen/native/mkldnn/Prelu.cpp
@@ -17,7 +17,7 @@ std::tuple<Tensor, Tensor> mkldnn_prelu_backward(const Tensor& grad_output, cons
 
 }}
 
-#else // AT_MKLDNN_EBABLED
+#else // AT_MKLDNN_ENABLED
 
 #include <ATen/native/mkldnn/MKLDNNCommon.h>
 #include <ATen/native/mkldnn/Utils.h>
@@ -76,4 +76,4 @@ std::tuple<Tensor, Tensor> mkldnn_prelu_backward(const Tensor& grad_output, cons
 }
 }}
 
-#endif // AT_MKLDNN_EBABLED
+#endif // AT_MKLDNN_ENABLED
diff --git a/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp b/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp
index 44447441f6da..8841d65a2e78 100644
--- a/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp
+++ b/aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp
@@ -34,6 +34,17 @@ TORCH_LIBRARY(mkldnn, m) {
                 // NOLINTNEXTLINE(performance-move-const-arg,cppcoreguidelines-avoid-magic-numbers)
                 std::move(std::get<7>(state)));
           });
+
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkldnn::_linear_pointwise(Tensor X, Tensor W, Tensor? B, str attr, Scalar?[] scalars, str? algorithm) -> Tensor Y"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkldnn::_linear_pointwise.binary(Tensor X, Tensor other, Tensor W, Tensor? B, str attr) -> Tensor Y"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkldnn::_convolution_pointwise(Tensor X, Tensor W, Tensor? B, int[] padding, int[] stride, int[] dilation, int groups, str attr, Scalar?[] scalars, str? algorithm) -> Tensor Y"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkldnn::_convolution_pointwise.binary(Tensor X, Tensor other, Tensor W, Tensor? B, int[] padding, int[] stride, int[] dilation, int groups, str binary_attr, Scalar? alpha, str? unary_attr, Scalar?[] unary_scalars, str? unary_algorithm) -> Tensor Y"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkldnn::_convolution_pointwise_.binary(Tensor X, Tensor(a!) other, Tensor W, Tensor? B, int[] padding, int[] stride, int[] dilation, int groups, str binary_attr, Scalar? alpha, str? unary_attr, Scalar?[] unary_scalars, str? unary_algorithm) -> Tensor(a!) Y"));
 }
 
 TORCH_LIBRARY(mkldnn_prepacked, m) {
@@ -58,3 +69,22 @@ TORCH_LIBRARY_IMPL(mkldnn_prepacked, CPU, m) {
 } // namespace at
 
 #endif // AT_MKLDNN_ENABLED()
+
+#if AT_MKL_ENABLED() && AT_MKLDNN_ENABLED()
+
+namespace at {
+namespace native {
+namespace mkl {
+
+TORCH_LIBRARY(mkl, m) {
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkl::_mkl_reorder_linear_weight(Tensor X, int batch_size) -> Tensor"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "mkl::_mkl_linear(Tensor X, Tensor MKL_W, Tensor ORI_W, Tensor? B, int batch_size) -> Tensor"));
+}
+
+} // namespace mkl
+} // namespace native
+} // namespace at
+
+#endif // AT_MKL_ENABLED && AT_MKLDNN_ENABLED
diff --git a/aten/src/ATen/native/mkldnn/Relu.cpp b/aten/src/ATen/native/mkldnn/Relu.cpp
index 517fa6aa6444..ace99f7706e0 100644
--- a/aten/src/ATen/native/mkldnn/Relu.cpp
+++ b/aten/src/ATen/native/mkldnn/Relu.cpp
@@ -1,7 +1,13 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/relu_native.h>                // for mkldnn_relu, mkldnn_...
+#include <ATen/ops/threshold_backward_native.h>  // for mkldnn_relu_backward
+#endif
 
 #if !AT_MKLDNN_ENABLED()
 
diff --git a/aten/src/ATen/native/mkldnn/SoftMax.cpp b/aten/src/ATen/native/mkldnn/SoftMax.cpp
index 743584544ef9..d49643ee1ad3 100644
--- a/aten/src/ATen/native/mkldnn/SoftMax.cpp
+++ b/aten/src/ATen/native/mkldnn/SoftMax.cpp
@@ -1,6 +1,12 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_softmax_native.h>         // for mkldnn_softmax
+#endif
 
 #if !AT_MKLDNN_ENABLED()
 
diff --git a/aten/src/ATen/native/mkldnn/TensorFactories.cpp b/aten/src/ATen/native/mkldnn/TensorFactories.cpp
index a944d4db19b6..65a22aa74ed5 100644
--- a/aten/src/ATen/native/mkldnn/TensorFactories.cpp
+++ b/aten/src/ATen/native/mkldnn/TensorFactories.cpp
@@ -1,10 +1,14 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/mkldnn/MKLDNNCommon.h>
 
-namespace at { namespace native {
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty_native.h>
+#endif
 
-Tensor empty_symint_mkldnn(c10::SymIntArrayRef sizes, c10::optional<ScalarType> dtype, c10::optional<Layout> layout, c10::optional<Device> device, c10::optional<bool> pin_memory, c10::optional<c10::MemoryFormat> optional_memory_format) {
-  return at::native::empty_mkldnn(c10::asIntArrayRefSlow(sizes), dtype, layout, device, pin_memory, optional_memory_format);
-}
+namespace at { namespace native {
 
 #if AT_MKLDNN_ENABLED()
 
diff --git a/aten/src/ATen/native/mkldnn/TensorShape.cpp b/aten/src/ATen/native/mkldnn/TensorShape.cpp
index ec3c58eda77f..1e54aae9d660 100644
--- a/aten/src/ATen/native/mkldnn/TensorShape.cpp
+++ b/aten/src/ATen/native/mkldnn/TensorShape.cpp
@@ -1,9 +1,19 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/Config.h>
 #include <ATen/InferSize.h>
-#include <ATen/NativeFunctions.h>
+#include <ATen/WrapDimUtils.h>
+#include <ATen/core/Tensor.h>
 #include <c10/core/SymIntArrayRef.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_mkldnn_reshape_native.h>
+#include <ATen/ops/_mkldnn_transpose_native.h>
+#include <ATen/ops/clone_native.h>
+#include <ATen/ops/view_native.h>
+#endif
+
 #if !AT_MKLDNN_ENABLED()
 
 namespace at {
@@ -69,6 +79,9 @@ Tensor mkldnn_clone(const Tensor& self, c10::optional<c10::MemoryFormat> optiona
 }
 
 Tensor mkldnn_transpose(const Tensor& self, int64_t dim0, int64_t dim1) {
+  auto ndims = self.dim();
+  dim0 = maybe_wrap_dim(dim0, ndims);
+  dim1 = maybe_wrap_dim(dim1, ndims);
   const ideep::tensor& x = itensor_from_mkldnn(self);
   ideep::tensor y;
   std::vector<int> axes(x.ndims());
diff --git a/aten/src/ATen/native/mkldnn/UnaryOps.cpp b/aten/src/ATen/native/mkldnn/UnaryOps.cpp
index f4a1a76c69b1..7f57d99ac176 100644
--- a/aten/src/ATen/native/mkldnn/UnaryOps.cpp
+++ b/aten/src/ATen/native/mkldnn/UnaryOps.cpp
@@ -1,6 +1,13 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Config.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/sigmoid_native.h>          // for mkldnn_sigmoid, mkldnn_...
+#include <ATen/ops/tanh_native.h>             // for mkldnn_tanh, mkldnn_tanh_
+#endif
 
 #if !AT_MKLDNN_ENABLED()
 
diff --git a/aten/src/ATen/native/mkldnn/Utils.cpp b/aten/src/ATen/native/mkldnn/Utils.cpp
index 62aeee407808..2c9bcc016e47 100644
--- a/aten/src/ATen/native/mkldnn/Utils.cpp
+++ b/aten/src/ATen/native/mkldnn/Utils.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/mkldnn/Utils.h>
 #include <ATen/native/Pool.h>
 #include <c10/util/irange.h>
@@ -32,4 +33,136 @@ std::vector<int64_t> pool_output_sizes(
    return output_size;
 }
 
+void check_mkldnn_binary_fusion_inputs(
+    const Tensor& input,
+    const Tensor& other,
+    const Tensor& weight,
+    const Tensor& bias) {
+  if (!weight.is_mkldnn()) {
+    TORCH_CHECK(
+        input.options().type_equal(weight.options()),
+        "Input type (",
+        input.toString(),
+        ") and weight type (",
+        weight.toString(),
+        ") should be the same");
+  } else {
+    TORCH_CHECK(
+        input.scalar_type() == input.scalar_type(),
+        "mkldnn pointwise binary: input dtype and weight dtype should be the same");
+  }
+  TORCH_CHECK(
+      input.options().type_equal(other.options()),
+      "Input type (",
+      input.toString(),
+      ") and other type (",
+      other.toString(),
+      ") should be the same");
+  TORCH_CHECK(
+      !bias.defined() || (input.options().type_equal(bias.options())),
+      "Input type (",
+      input.toString(),
+      ") and bias type (",
+      bias.toString(),
+      ") should be the same");
+  TORCH_CHECK(
+      input.device().is_cpu(),
+      "mkldnn pointwise binary fusion: input's device should be CPU");
+  TORCH_CHECK(
+      input.scalar_type() == ScalarType::Float ||
+          input.scalar_type() == ScalarType::BFloat16,
+      "mkldnn pointwise binary: input's dtype should be float or bfloat16");
+  if (input.scalar_type() == ScalarType::BFloat16) {
+    TORCH_CHECK(
+        mkldnn_bf16_device_check(),
+        "mkldnn pointwise binary: bf16 path needs the cpu support avx512bw, avx512vl and avx512dq");
+  }
+}
+
+#if AT_MKLDNN_ENABLED()
+
+#define ATTR_FUNC(NAME)                              \
+  [](torch::List<c10::optional<at::Scalar>> scalars, \
+     c10::optional<c10::string_view> algorithm) {    \
+    return ideep::attr_t::fuse_##NAME();             \
+  }
+
+AttrFunction attr_func_leaky_relu =
+    [](torch::List<c10::optional<at::Scalar>> scalars,
+       c10::optional<c10::string_view> algorithm) {
+      TORCH_CHECK(
+          scalars.size() == 1 &&
+              scalars[0].get().toOptional<at::Scalar>().has_value(),
+          "leaky_relu is expected to have one scalar input: negative_slope");
+      auto alpha_value =
+          scalars[0].get().toOptional<at::Scalar>().value().to<float>();
+      return ideep::attr_t::fuse_relu(1.0, alpha_value);
+    };
+
+AttrFunction attr_func_hardtanh =
+    [](torch::List<c10::optional<at::Scalar>> scalars,
+       c10::optional<c10::string_view> algorithm) {
+      TORCH_CHECK(
+          scalars.size() == 2 &&
+              scalars[0].get().toOptional<at::Scalar>().has_value() &&
+              scalars[1].get().toOptional<at::Scalar>().has_value(),
+          "hardtanh is expected to have two scalar input: min_val and max_val");
+
+      auto lower_bound_value =
+          scalars[0].get().toOptional<at::Scalar>().value().to<float>();
+      auto upper_bound_value =
+          scalars[1].get().toOptional<at::Scalar>().value().to<float>();
+      return ideep::attr_t::fuse_clamp(lower_bound_value, upper_bound_value);
+    };
+
+AttrFunction attr_func_gelu = [](torch::List<c10::optional<at::Scalar>> scalars,
+                                 c10::optional<c10::string_view> algorithm) {
+  TORCH_CHECK(
+      algorithm.has_value(),
+      "gelu is expected to have one str input: algorithm");
+  dnnl::algorithm gelu_type;
+  if (algorithm.value() == "none") {
+    gelu_type = dnnl::algorithm::eltwise_gelu_erf;
+  } else if (algorithm.value() == "tanh") {
+    gelu_type = dnnl::algorithm::eltwise_gelu_tanh;
+  } else {
+    TORCH_INTERNAL_ASSERT(
+        false, "Unsupported gelu algorithm: ", algorithm.value());
+  }
+
+  return ideep::attr_t::fuse_gelu(1.0, 0.f, 0.f, gelu_type);
+};
+
+const std::map<c10::string_view, AttrFunction>& fusion_unary_attr_map() {
+  static const std::map<c10::string_view, AttrFunction> fusion_attr_map{
+      {"relu", ATTR_FUNC(relu)},
+      {"sigmoid", ATTR_FUNC(sigmoid)},
+      {"tanh", ATTR_FUNC(tanh)},
+      {"swish", ATTR_FUNC(swish)},
+      {"hardswish", ATTR_FUNC(hardswish)},
+      {"leaky_relu", attr_func_leaky_relu},
+      {"hardtanh", attr_func_hardtanh},
+      {"gelu", attr_func_gelu},
+  };
+  return fusion_attr_map;
+};
+
+const std::map<c10::string_view, ideep::algorithm>& fusion_unary_alg_map() {
+  static const std::map<c10::string_view, ideep::algorithm> fusion_attr_map{
+      {"relu", {ideep::algorithm::eltwise_relu}},
+  };
+  return fusion_attr_map;
+};
+
+const std::map<c10::string_view, ideep::algorithm>& fusion_binary_alg_map() {
+  static const std::map<c10::string_view, ideep::algorithm> fusion_attr_map{
+      {"add", {ideep::algorithm::binary_add}},
+      {"sub", {ideep::algorithm::binary_sub}},
+      {"mul", {ideep::algorithm::binary_mul}},
+      {"div", {ideep::algorithm::binary_div}},
+  };
+  return fusion_attr_map;
+};
+
+#endif // AT_MKLDNN_ENABLED()
 }}
diff --git a/aten/src/ATen/native/mkldnn/Utils.h b/aten/src/ATen/native/mkldnn/Utils.h
index a27b842be04b..a25be13c46da 100644
--- a/aten/src/ATen/native/mkldnn/Utils.h
+++ b/aten/src/ATen/native/mkldnn/Utils.h
@@ -1,11 +1,15 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/Config.h>
+#include <ATen/core/List.h>
+#include <ATen/core/Tensor.h>
 #include <c10/util/ArrayRef.h>
-#include <ATen/ATen.h>
-#include <vector>
 #include <cpuinfo.h>
+#include <vector>
 
+#if AT_MKLDNN_ENABLED()
+#include <ideep/tensor.hpp>
+#endif // AT_MKLDNN_ENABLED()
 
 namespace at { namespace native {
 
@@ -22,11 +26,41 @@ std::vector<int64_t> pool_output_sizes(
     IntArrayRef padding_r,
     IntArrayRef dilation,
     bool ceil_mode);
+
+void check_mkldnn_binary_fusion_inputs(
+    const Tensor& input,
+    const Tensor& other,
+    const Tensor& weight,
+    const Tensor& bias);
+
+#if AT_MKLDNN_ENABLED()
+
+using AttrFunction = std::function<ideep::attr_t(
+    torch::List<c10::optional<at::Scalar>>,
+    c10::optional<c10::string_view>)>;
+
+const std::map<c10::string_view, AttrFunction>& fusion_unary_attr_map();
+
+const std::map<c10::string_view, ideep::algorithm>& fusion_unary_alg_map();
+
+const std::map<c10::string_view, ideep::algorithm>& fusion_binary_alg_map();
+
+#endif // AT_MKLDNN_ENABLED()
 };
 
 inline bool mkldnn_bf16_device_check() {
-  return cpuinfo_initialize() && cpuinfo_has_x86_avx512bw()
-      && cpuinfo_has_x86_avx512vl() && cpuinfo_has_x86_avx512dq();
+  return cpuinfo_initialize() && ((cpuinfo_has_x86_avx512bw()
+     && cpuinfo_has_x86_avx512vl() && cpuinfo_has_x86_avx512dq()) || (cpuinfo_has_arm_bf16()));
+}
+
+#if defined(__aarch64__)
+inline bool mkldnn_bf16_device_check_arm() {
+  return (cpuinfo_initialize() && cpuinfo_has_arm_bf16());
+}
+#else
+constexpr bool mkldnn_bf16_device_check_arm() {
+  return false;
 }
+#endif
 
 }
diff --git a/aten/src/ATen/native/mps/Copy.h b/aten/src/ATen/native/mps/Copy.h
index 1a4465e73538..4ffa73d039ad 100644
--- a/aten/src/ATen/native/mps/Copy.h
+++ b/aten/src/ATen/native/mps/Copy.h
@@ -1,20 +1,7 @@
 //  Copyright © 2022 Apple Inc.
 
 #pragma once
-#include <atomic>
-
-#include <ATen/ATen.h>
-#include <ATen/Dispatch.h>
-#include <ATen/Tensor.h>
-#include <ATen/native/Copy.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/mps/MPSDevice.h>
-
-#ifdef __OBJC__
-#include <Foundation/Foundation.h>
-#include <Metal/Metal.h>
-#include <MetalPerformanceShaders/MetalPerformanceShaders.h>
-#endif
+#include <ATen/core/Tensor.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/mps/MPSGraphVenturaOps.h b/aten/src/ATen/native/mps/MPSGraphVenturaOps.h
new file mode 100644
index 000000000000..b77db66795cf
--- /dev/null
+++ b/aten/src/ATen/native/mps/MPSGraphVenturaOps.h
@@ -0,0 +1,17 @@
+#pragma once
+#include <MetalPerformanceShadersGraph/MetalPerformanceShadersGraph.h>
+
+// TODO: Remove me when moved to MacOS 13
+@interface MPSGraph (VenturaOps)
+- (MPSGraphTensor *)cumulativeSumWithTensor:(MPSGraphTensor *)tensor
+                                       axis:(NSInteger)axis
+                                       name:(NSString *)name;
+
+- (MPSGraphTensor *)sortWithTensor:(MPSGraphTensor *)tensor
+                                       axis:(NSInteger)axis
+                                       name:(NSString *)name;
+
+- (MPSGraphTensor *)argSortWithTensor:(MPSGraphTensor *)tensor
+                                       axis:(NSInteger)axis
+                                       name:(NSString *)name;
+@end
diff --git a/aten/src/ATen/native/mps/OperationUtils.h b/aten/src/ATen/native/mps/OperationUtils.h
index 32aede7fc5e0..93b014124339 100644
--- a/aten/src/ATen/native/mps/OperationUtils.h
+++ b/aten/src/ATen/native/mps/OperationUtils.h
@@ -37,6 +37,20 @@ struct TORCH_CUDA_CPP_API MPSGeneratorImpl : public c10::GeneratorImpl {
 
 const Generator& getDefaultMPSGenerator();
 
+struct MPSScalar {
+  id<MTLBuffer> getMTLBuffer() const { return __builtin_bit_cast(id<MTLBuffer>, buffer.get()); }
+
+  size_t size = 0;
+  ScalarType type = ScalarType::Undefined;
+  c10::DataPtr buffer; // stores MTLBuffer (frees buffer if MPSScalar instance goes out of scope)
+  union {
+    float f; // MPS doesn't support 'double'
+    at::Half h;
+    int64_t i;
+    bool b;
+  } value {};
+};
+
 void runMPSGraph(
     MPSStream* mpsStream,
     MPSGraph* mpsGraph,
@@ -45,10 +59,10 @@ void runMPSGraph(
 
 MPSDataType getMPSDataType(ScalarType scalar_type);
 MPSDataType getMPSScalarType(ScalarType scalar_type);
+MPSScalar   getMPSScalar(const Scalar& scalar, ScalarType type);
 std::string getMPSTypeString(ScalarType scalar_type);
 std::string getMPSShapeString(MPSShape* shape);
 std::string getTensorsStringKey(const TensorList& tensors, bool use_scalar_value = false);
-double getMPSScalarValue(const Tensor& t);
 std::string getArrayRefString(const IntArrayRef s);
 // use has_storage() on the returned tensor to determine if src actually is a view
 Tensor gatherViewTensor(const at::Tensor& src, at::Tensor& dst);
@@ -87,7 +101,7 @@ void resize_tensor(Tensor* output);
 MPSGraphTensor* trunc_tensor(MPSGraph* mpsGraph, MPSGraphTensor* inputTensor);
 MPSGraphTensor* castMPSTensor(MPSGraph *mpsGraph, MPSGraphTensor* tensor, ScalarType toType);
 MPSGraphTensorData *getMPSGraphTensorData(MPSGraph* mpsGraph, MPSStream* mpsStream, const Tensor& tensor);
-MPSGraphTensorData* getMPSGraphTensorFromScalar(MPSStream* mpsStream, const Scalar& scalar, MPSDataType dataType);
+MPSGraphTensorData* getMPSGraphTensorFromScalar(MPSStream* mpsStream, MPSScalar& scalar);
 
 MPSGraph* make_mps_graph();
 void printTensorNDArray(const Tensor& t);
@@ -95,6 +109,7 @@ void printTensorNDArray(const Tensor& t);
 MPSGraphTensor* mpsGraphUnrankedPlaceHolder(MPSGraph *mpsGraph, MPSDataType dataType);
 MPSGraphTensor* mpsGraphRankedPlaceHolder(MPSGraph *mpsGraph, MPSDataType dataType, MPSShape* mpsShape);
 MPSGraphTensor* mpsGraphRankedPlaceHolder(MPSGraph *mpsGraph, const Tensor& tensor);
+MPSGraphTensor* mpsGraphScalarPlaceHolder(MPSGraph *mpsGraph, MPSDataType dataType);
 MPSGraphTensor* mpsGraphScalarPlaceHolder(MPSGraph *mpsGraph, const Scalar& scalar);
 
 string get_mem_format_string(c10::MemoryFormat memory_format);
@@ -190,6 +205,11 @@ struct MPSGraphCache
     return result;
   }
 
+  template<typename T>
+  inline T* CreateCachedGraphAs(const std::string& key, CreateCachedGraphBlock createCacheBlock, void* view_ptr = nullptr) {
+    return static_cast<T *>(CreateCachedGraph(key, createCacheBlock, view_ptr));
+  }
+
   MPSCachedGraph* LookUp(const std::string& key) const {
 
     __block MPSCachedGraph* result = nullptr;
@@ -244,6 +264,7 @@ struct MPSGraphCache
 
 };
 
+
 } // namespace mps
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/mps/OperationUtils.mm b/aten/src/ATen/native/mps/OperationUtils.mm
index 65c30a0b39ed..f41484b27b14 100644
--- a/aten/src/ATen/native/mps/OperationUtils.mm
+++ b/aten/src/ATen/native/mps/OperationUtils.mm
@@ -12,6 +12,7 @@
   this->set_current_seed(random);
   return random;
 }
+
 uint64_t MPSGeneratorImpl::current_seed() const {
   return seed_;
 }
@@ -61,7 +62,7 @@
 }
 
 void runMPSGraph(MPSStream* mpsStream, MPSGraph* mpsGraph, NSDictionary* feeds, NSDictionary* results) {
-  mpsStream->executeMPSGraph(mpsGraph, feeds, results);
+  mpsStream->executeMPSGraph(mpsGraph, feeds, results, SyncType::COMMIT_ADAPTIVE);
 }
 
 MPSDataType getMPSDataType(ScalarType scalar_type) {
@@ -163,7 +164,7 @@ MPSDataType getMPSScalarType(ScalarType scalar_type) {
         str += getMPSTypeString(tensor.scalar_type()) + "[";
         // if tensor is a scalar
         if (tensor.dim() == 0) {
-          str += (use_scalar_value ? std::to_string(getMPSScalarValue(tensor)) : "Scalar");
+          str += (use_scalar_value ? std::to_string(tensor.item().to<double>()) : "Scalar");
         } else {
           const NSString* ns_shape_key = [[getMPSShape(tensor) valueForKey:@"description"] componentsJoinedByString:@","];
           str += std::string(ns_shape_key.UTF8String);
@@ -176,26 +177,8 @@ MPSDataType getMPSScalarType(ScalarType scalar_type) {
     return str;
 }
 
-double getMPSScalarValue(const Tensor& t) {
-  assert (t.dim() == 0);  // only applicable for scalar types
-  auto other_value = t.item();
-  return other_value.to<double>();
-}
-
 MPSShape* getMPSShape(const Tensor& t) {
-  const int sz = t.dim();
-  const int sz_ = (sz > 0) ? sz : 1;
-
-  NSNumber* numbers[sz_];
-
-  for (int i = 0; i < sz_; i++)
-  {
-    NSInteger sz_i = (i < sz) ? t.size(i) : 1;
-
-    NSNumber* number = [NSNumber numberWithInteger:sz_i];
-    numbers[i] = number;
-  }
-  return [NSArray arrayWithObjects:numbers count:sz_];
+  return getMPSShape(t.sizes());
 }
 
 MPSShape* getMPSShape(c10::MaybeOwned<Tensor> t) {
@@ -207,16 +190,14 @@ MPSDataType getMPSScalarType(ScalarType scalar_type) {
   const int sz = sizes.size();
   const int sz_ = (sz > 0) ? sz : 1;
 
-  NSNumber* numbers[sz_];
+  std::vector<NSNumber*> numbers(sz_);
 
-  for (int i = 0; i < sz_; i++)
-  {
+  for (int i = 0; i < sz_; i++) {
     NSInteger sz_i = (i < sz) ? sizes[i] : 1;
-
     NSNumber* number = [NSNumber numberWithInteger:sz_i];
     numbers[i] = number;
   }
-  return [NSArray arrayWithObjects:numbers count:sz_];
+  return [NSArray arrayWithObjects:numbers.data() count:numbers.size()];
 }
 
 void printTensorNDArray(const Tensor& t) {
@@ -250,9 +231,9 @@ void printTensorNDArray(const Tensor& t) {
     // use "_tensor" from Placeholder to retain view's output during its usage in other ops
     _tensor = gatherViewTensor(src, emptyShell);
     if (!_tensor.has_storage()) {
-      // if we cannot gather, we make the the tensor contiguous implicitly, and keep
+      // if we cannot gather, we make the tensor contiguous implicitly, and keep
       // it in placeholder to be able to retrieve it when we return from constructor
-      _tensor = src.contiguous();
+      _tensor = src.clone(MemoryFormat::Contiguous);
     }
     srcBuf = getMTLBufferStorage(_tensor);
   }
@@ -297,46 +278,36 @@ void printTensorNDArray(const Tensor& t) {
   return result;
 }
 
-MPSGraphTensorData* getMPSGraphTensorFromScalar(MPSStream* mpsStream, const Scalar& scalar, MPSDataType dataType) {
-  union {
-    float f; // MPS doesn't support 'double'
-    at::Half h;
-    int64_t i;
-    bool b;
-  } v;
-  switch (dataType) {
-    case MPSDataTypeFloat32:
-      v.f = scalar.to<float>();
-      break;
-    case MPSDataTypeFloat16:
-      v.h = scalar.to<at::Half>();
-      break;
-    case MPSDataTypeInt64:
-      v.i = scalar.to<int64_t>();
-      break;
-    case MPSDataTypeInt32:
-      v.i = scalar.to<int32_t>();
-      break;
-    case MPSDataTypeInt16:
-      v.i = scalar.to<int16_t>();
-      break;
-    case MPSDataTypeInt8:
-      v.i = scalar.to<int8_t>();
-      break;
-    case MPSDataTypeUInt8:
-      v.i = scalar.to<uint8_t>();
-      break;
-    case MPSDataTypeBool:
-      v.b = scalar.to<bool>();
-      break;
+MPSScalar getMPSScalar(const Scalar& scalar, ScalarType type) {
+  switch (type) {
+    case ScalarType::Double:
+    case ScalarType::Float: return {.value.f = scalar.to<float>()   , .size = sizeof(float)  , .type = type};
+    case ScalarType::Half:  return {.value.h = scalar.to<at::Half>(), .size = sizeof(short)  , .type = type};
+    case ScalarType::Long:  return {.value.i = scalar.to<int64_t>() , .size = sizeof(int64_t), .type = type};
+    case ScalarType::Int:   return {.value.i = scalar.to<int32_t>() , .size = sizeof(int32_t), .type = type};
+    case ScalarType::Short: return {.value.i = scalar.to<int16_t>() , .size = sizeof(int16_t), .type = type};
+    case ScalarType::Char:  return {.value.i = scalar.to<int8_t>()  , .size = sizeof(int8_t) , .type = type};
+    case ScalarType::Byte:  return {.value.i = scalar.to<uint8_t>() , .size = sizeof(uint8_t), .type = type};
+    case ScalarType::Bool:  return {.value.b = scalar.to<bool>()    , .size = sizeof(bool)   , .type = type};
     default:
-      TORCH_INTERNAL_ASSERT(false, "Unsupported scalar type on MPS backend.")
+      TORCH_INTERNAL_ASSERT(false, "Unsupported scalar type '", type, "' on MPS backend.");
   }
+}
 
-  MPSNDArrayDescriptor *tensorDesc = [MPSNDArrayDescriptor descriptorWithDataType:dataType shape:@[@1]];
-  MPSNDArray *tensorNDArray = [[[MPSNDArray alloc] initWithDevice:mpsStream->device() descriptor:tensorDesc] autorelease];
-  [tensorNDArray writeBytes:&v strideBytes:nil];
-  MPSGraphTensorData* result = [[[MPSGraphTensorData alloc] initWithMPSNDArray:tensorNDArray] autorelease];
+MPSGraphTensorData* getMPSGraphTensorFromScalar(MPSStream* mpsStream, MPSScalar& scalar) {
+  MPSGraphTensorData *result = nullptr;
+  // Scalar pools are only supported on devices with unified memory
+  if (mpsStream->device().hasUnifiedMemory) {
+    scalar.buffer = at::mps::allocate_scalar_buffer(&scalar.value, scalar.size);
+    result = [[[MPSGraphTensorData alloc] initWithMTLBuffer: scalar.getMTLBuffer()
+                                                      shape: @[@1]
+                                                   dataType: getMPSScalarType(scalar.type)] autorelease];
+  } else {
+    MPSNDArrayDescriptor *tensorDesc = [MPSNDArrayDescriptor descriptorWithDataType:getMPSScalarType(scalar.type) shape:@[@1]];
+    MPSNDArray *tensorNDArray = [[[MPSNDArray alloc] initWithDevice:mpsStream->device() descriptor:tensorDesc] autorelease];
+    [tensorNDArray writeBytes:&scalar.value strideBytes:nil];
+    result = [[[MPSGraphTensorData alloc] initWithMPSNDArray:tensorNDArray] autorelease];
+  }
   return result;
 }
 
@@ -368,6 +339,12 @@ void resize_tensor(Tensor* output) {
                                      name:nil];
 }
 
+MPSGraphTensor* mpsGraphScalarPlaceHolder(MPSGraph *mpsGraph, MPSDataType dataType) {
+    return [mpsGraph placeholderWithShape:@[@1]
+                                 dataType:dataType
+                                     name:nil];
+}
+
 MPSGraphTensor* mpsGraphScalarPlaceHolder(MPSGraph *mpsGraph, const Scalar& scalar) {
     return [mpsGraph placeholderWithShape:@[@1]
                                  dataType:getMPSScalarType(scalar.type())
@@ -411,4 +388,4 @@ void executeMPSAllocatorCallback(void* ptr, EventType event) override { }
 
 } // namespace mps
 } // namespace native
-} // namespace at
+} // namespace at
\ No newline at end of file
diff --git a/aten/src/ATen/native/mps/TensorFactory.cpp b/aten/src/ATen/native/mps/TensorFactory.cpp
index d280da4d9c65..2f4c04024536 100644
--- a/aten/src/ATen/native/mps/TensorFactory.cpp
+++ b/aten/src/ATen/native/mps/TensorFactory.cpp
@@ -5,6 +5,7 @@
 #include <ATen/Utils.h>
 #include <torch/library.h>
 #include <ATen/mps/EmptyTensor.h>
+#include <ATen/mps/MPSDevice.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/mps/Copy.h>
 #include <ATen/native/mps/TensorFactory.h>
@@ -71,17 +72,6 @@ Tensor empty_mps(
   return at::detail::empty_mps(size, dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
 }
 
-Tensor empty_symint_mps(
-    c10::SymIntArrayRef size,
-    c10::optional<ScalarType> dtype_opt,
-    c10::optional<Layout> layout_opt,
-    c10::optional<Device> device_opt,
-    c10::optional<bool> pin_memory_opt,
-    c10::optional<c10::MemoryFormat> memory_format_opt) {
-
-  return at::native::empty_mps(c10::asIntArrayRefSlow(size), dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
-}
-
 Tensor empty_strided_mps(
     IntArrayRef size,
     IntArrayRef stride,
diff --git a/aten/src/ATen/native/mps/operations/Activation.mm b/aten/src/ATen/native/mps/operations/Activation.mm
index e929a41be2ce..618a00f33787 100644
--- a/aten/src/ATen/native/mps/operations/Activation.mm
+++ b/aten/src/ATen/native/mps/operations/Activation.mm
@@ -777,16 +777,17 @@ Tensor relu_mps(const Tensor& self) {
 
 MPSGraphTensor* normcdf (MPSGraph* mpsGraph, MPSGraphTensor *inputTensor) {
     // (1.0f + erf(x*SQRT1_2)) * 0.5f * x;
+    auto dataType = [inputTensor dataType];
     const float SQRT1_2 = 0.707106781186547524400844362104849039f;
-    MPSGraphTensor *sqrt1_2 = [mpsGraph constantWithScalar:SQRT1_2
-                                                        shape:@[@1]
-                                                     dataType:MPSDataTypeFloat32];
-    MPSGraphTensor *onef = [mpsGraph constantWithScalar:1.0f
-                                                  shape:@[@1]
-                                              dataType:MPSDataTypeFloat32];
-    MPSGraphTensor *halff = [mpsGraph constantWithScalar:0.5f
-                                                    shape:@[@1]
-                                                dataType:MPSDataTypeFloat32];
+    MPSGraphTensor *sqrt1_2 = [mpsGraph constantWithScalar: SQRT1_2
+                                                        shape: @[@1]
+                                                     dataType: dataType];
+    MPSGraphTensor *onef = [mpsGraph constantWithScalar: 1.0f
+                                                  shape: @[@1]
+                                              dataType: dataType];
+    MPSGraphTensor *halff = [mpsGraph constantWithScalar: 0.5f
+                                                    shape: @[@1]
+                                                dataType: dataType];
 
     MPSGraphTensor *erfTensor = [mpsGraph multiplicationWithPrimaryTensor: inputTensor
                                                           secondaryTensor: sqrt1_2
@@ -807,6 +808,7 @@ Tensor relu_mps(const Tensor& self) {
   ) {
   using namespace mps;
   TORCH_CHECK(output.is_mps());
+  TORCH_CHECK(c10::isFloatingType(self.scalar_type()), "GELU is only implemented for floating types");
 
   // Empty output
   if(output.numel() == 0)
@@ -899,6 +901,7 @@ Tensor relu_mps(const Tensor& self) {
         CachedGraph *newCachedGraph = nil;
 
         @autoreleasepool {
+          auto dataType = getMPSDataType(self.scalar_type());
           MPSGraph* mpsGraph = make_mps_graph();
           newCachedGraph = new CachedGraph(mpsGraph);
 
@@ -906,15 +909,15 @@ Tensor relu_mps(const Tensor& self) {
                                                                   getMPSDataType(grad.scalar_type()),
                                                                   getMPSShape(grad));
           MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph,
-                                                                  getMPSDataType(self.scalar_type()),
+                                                                  dataType,
                                                                   getMPSShape(self));
           MPSGraphTensor* cdf = normcdf(mpsGraph, inputTensor);
-          MPSGraphTensor *halff = [mpsGraph constantWithScalar:-0.5f
-                                                    shape:@[@1]
-                                                dataType:MPSDataTypeFloat32];
-          MPSGraphTensor *betaf = [mpsGraph constantWithScalar:kBeta
-                                                    shape:@[@1]
-                                                dataType:MPSDataTypeFloat32];
+          MPSGraphTensor *halff = [mpsGraph constantWithScalar: -0.5f
+                                                    shape: @[@1]
+                                                dataType: dataType];
+          MPSGraphTensor *betaf = [mpsGraph constantWithScalar :kBeta
+                                                    shape :@[@1]
+                                                dataType:dataType];
           MPSGraphTensor *pdfMul = [mpsGraph squareWithTensor : inputTensor
                                                     name : nil];
           pdfMul = [mpsGraph multiplicationWithPrimaryTensor : pdfMul
@@ -1456,19 +1459,20 @@ Tensor glu_backward_mps (const Tensor& grad_output,
       if(result.numel() == 0)
         return;
 
-      auto beta_f = beta.to<float>();
-
       struct CachedGraph : public MPSCachedGraph
       {
         CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
         MPSGraphTensor *inputTensor_ = nil;
         MPSGraphTensor *betaTensor_ = nil;
+        MPSGraphTensor *thresholdTensor_ = nil;
         MPSGraphTensor *outputTensor_ = nil;
       };
 
       MPSGraphCache* cache_ = MPSGraphCache::getInstance();
 
       MPSStream* stream = getCurrentMPSStream();
+      MPSScalar beta_scalar = getMPSScalar(beta, ScalarType::Float);
+      MPSScalar threshold_scalar = getMPSScalar(threshold, ScalarType::Float);
 
       @autoreleasepool {
         string key = "softplus_out_mps:" + getTensorsStringKey({self});
@@ -1484,7 +1488,9 @@ Tensor glu_backward_mps (const Tensor& grad_output,
               newCachedGraph = new CachedGraph(mpsGraph);
               MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
 
-              MPSGraphTensor* betaTensor = mpsGraphScalarPlaceHolder(mpsGraph, beta);
+              MPSGraphTensor* betaTensor = mpsGraphScalarPlaceHolder(mpsGraph, getMPSDataType(ScalarType::Float));
+
+              MPSGraphTensor* thresholdTensor = mpsGraphScalarPlaceHolder(mpsGraph, getMPSDataType(ScalarType::Float));
 
               MPSGraphTensor* reluTensor = [mpsGraph reLUWithTensor:inputTensor
                                                                name:nil];
@@ -1497,9 +1503,6 @@ Tensor glu_backward_mps (const Tensor& grad_output,
               MPSGraphTensor* bxTensor = [mpsGraph multiplicationWithPrimaryTensor:inputTensor
                                                                   secondaryTensor:betaTensor
                                                                   name:nil];
-              MPSGraphTensor* thresholdTensor = [mpsGraph constantWithScalar:threshold.to<double>()
-                                                                       shape:@[@1]
-                                                               dataType:getMPSDataType(self.scalar_type())];
               MPSGraphTensor* predicateTensor = [mpsGraph greaterThanWithPrimaryTensor:bxTensor
                                                                        secondaryTensor:thresholdTensor
                                                                                   name:nil];
@@ -1522,6 +1525,7 @@ Tensor glu_backward_mps (const Tensor& grad_output,
 
               newCachedGraph->inputTensor_ = inputTensor;
               newCachedGraph->betaTensor_ = betaTensor;
+              newCachedGraph->thresholdTensor_ = thresholdTensor;
               newCachedGraph->outputTensor_ = outputTensor;
             }
             return newCachedGraph;
@@ -1534,7 +1538,8 @@ Tensor glu_backward_mps (const Tensor& grad_output,
         // Create dictionary of inputs and outputs
         NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
           selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
-          cachedGraph->betaTensor_ : getMPSGraphTensorFromScalar(stream, beta_f, MPSDataTypeFloat32)
+          cachedGraph->betaTensor_ : getMPSGraphTensorFromScalar(stream, beta_scalar),
+          cachedGraph->thresholdTensor_ : getMPSGraphTensorFromScalar(stream, threshold_scalar),
         };
         NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
           outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
@@ -1557,7 +1562,8 @@ Tensor glu_backward_mps (const Tensor& grad_output,
       if(grad_input.numel() == 0)
         return;
 
-      auto beta_f = beta.to<float>();
+      MPSScalar beta_scalar = getMPSScalar(beta, ScalarType::Float);
+      MPSScalar threshold_scalar = getMPSScalar(threshold, ScalarType::Float);
 
       struct CachedGraph : public MPSCachedGraph
       {
@@ -1565,6 +1571,7 @@ Tensor glu_backward_mps (const Tensor& grad_output,
         MPSGraphTensor *gradOutputTensor_ = nil;
         MPSGraphTensor *inputTensor_ = nil;
         MPSGraphTensor *betaTensor_ = nil;
+        MPSGraphTensor *thresholdTensor_ = nil;
         MPSGraphTensor *outputTensor_ = nil;
       };
 
@@ -1588,7 +1595,9 @@ Tensor glu_backward_mps (const Tensor& grad_output,
 
               MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
 
-              MPSGraphTensor* betaTensor = mpsGraphScalarPlaceHolder(mpsGraph, beta);
+              MPSGraphTensor* betaTensor = mpsGraphScalarPlaceHolder(mpsGraph, getMPSScalarType(ScalarType::Float));
+
+              MPSGraphTensor* thresholdTensor = mpsGraphScalarPlaceHolder(mpsGraph, getMPSScalarType(ScalarType::Float));
 
               MPSGraphTensor* unitTensor = [mpsGraph constantWithScalar:1.0
                                                                   shape:@[@1]
@@ -1607,9 +1616,6 @@ Tensor glu_backward_mps (const Tensor& grad_output,
               rTensor = [mpsGraph divisionWithPrimaryTensor:rTensor
                                             secondaryTensor:unitExpBxTensor
                                                        name:nil];
-              MPSGraphTensor* thresholdTensor = [mpsGraph constantWithScalar:threshold.to<double>()
-                                                                       shape:@[@1]
-                                                               dataType:getMPSDataType(self.scalar_type())];
               MPSGraphTensor* predicateTensor = [mpsGraph greaterThanWithPrimaryTensor:bxTensor
                                                                        secondaryTensor:thresholdTensor
                                                                                  name:nil];
@@ -1621,6 +1627,7 @@ Tensor glu_backward_mps (const Tensor& grad_output,
               newCachedGraph->gradOutputTensor_ = gradOutputTensor;
               newCachedGraph->inputTensor_ = inputTensor;
               newCachedGraph->betaTensor_ = betaTensor;
+              newCachedGraph->thresholdTensor_ = thresholdTensor;
               newCachedGraph->outputTensor_ = outputTensor;
             }
             return newCachedGraph;
@@ -1635,7 +1642,8 @@ Tensor glu_backward_mps (const Tensor& grad_output,
         NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
           gradOutputPlaceholder.getMPSGraphTensor() : gradOutputPlaceholder.getMPSGraphTensorData(),
           selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
-          cachedGraph->betaTensor_ : getMPSGraphTensorFromScalar(stream, beta_f, MPSDataTypeFloat32)
+          cachedGraph->betaTensor_ : getMPSGraphTensorFromScalar(stream, beta_scalar),
+          cachedGraph->thresholdTensor_ : getMPSGraphTensorFromScalar(stream, threshold_scalar),
         };
         NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
           gradInputPlaceholder.getMPSGraphTensor() : gradInputPlaceholder.getMPSGraphTensorData()
@@ -2194,5 +2202,257 @@ Tensor prelu_mps(const Tensor& self, const Tensor& weight_) {
   return grad_input;
 }
 
+Tensor& hardswish_out_mps(const Tensor& self, Tensor& output) {
+  using namespace mps;
+  using CachedGraph = MPSUnaryCachedGraph;
+
+  TORCH_CHECK(self.is_mps());
+
+  if (output.numel() == 0) {
+    return output;
+  }
+
+  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
+
+  MPSStream* stream = at::mps::getCurrentMPSStream();
+
+  @autoreleasepool {
+    string key = "hardswish_out_mps" + getTensorsStringKey({self});
+    CachedGraph* cachedGraph = static_cast<CachedGraph*>(cache_->LookUp(key));
+    if (!cachedGraph) {
+      MPSCachedGraph* tmpCachedGraph =
+          cache_->CreateCachedGraph(key, ^MPSCachedGraph*() {
+            CachedGraph* newCachedGraph = nil;
+            @autoreleasepool {
+              MPSGraph* mpsGraph = make_mps_graph();
+              newCachedGraph = new CachedGraph(mpsGraph);
+              MPSGraphTensor* inputTensor =
+                  mpsGraphRankedPlaceHolder(mpsGraph, self);
+
+              MPSGraphTensor* zeroTensor = [mpsGraph
+                  constantWithScalar:0.0f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(self.scalar_type())];
+
+              MPSGraphTensor* threeTensor = [mpsGraph
+                  constantWithScalar:3.0f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(self.scalar_type())];
+
+              MPSGraphTensor* negativeThreeTensor = [mpsGraph
+                  constantWithScalar:-3.0f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(self.scalar_type())];
+
+              MPSGraphTensor* sixTensor = [mpsGraph
+                  constantWithScalar:6.0f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(self.scalar_type())];
+
+              MPSGraphTensor* lessThanMinPredicateTensor = [mpsGraph
+                  lessThanOrEqualToWithPrimaryTensor:inputTensor
+                                     secondaryTensor:negativeThreeTensor
+                                                name:nil];
+
+              MPSGraphTensor* lessThanMaxPredicateTensor =
+                  [mpsGraph lessThanWithPrimaryTensor:inputTensor
+                                      secondaryTensor:threeTensor
+                                                 name:nil];
+
+              MPSGraphTensor* inputPlusThreeTensor =
+                  [mpsGraph additionWithPrimaryTensor:inputTensor
+                                      secondaryTensor:threeTensor
+                                                 name:nil];
+
+              MPSGraphTensor* inputDivSixTensor =
+                  [mpsGraph divisionWithPrimaryTensor:inputPlusThreeTensor
+                                      secondaryTensor:sixTensor
+                                                 name:nil];
+
+              MPSGraphTensor* weightedTensor =
+                  [mpsGraph multiplicationWithPrimaryTensor:inputTensor
+                                            secondaryTensor:inputDivSixTensor
+                                                       name:nil];
+
+              MPSGraphTensor* tempTensor =
+                  [mpsGraph selectWithPredicateTensor:lessThanMaxPredicateTensor
+                                  truePredicateTensor:weightedTensor
+                                 falsePredicateTensor:inputTensor
+                                                 name:nil];
+
+              MPSGraphTensor* outputTensor =
+                  [mpsGraph selectWithPredicateTensor:lessThanMinPredicateTensor
+                                  truePredicateTensor:zeroTensor
+                                 falsePredicateTensor:tempTensor
+                                                 name:nil];
+              newCachedGraph->inputTensor_ = inputTensor;
+              newCachedGraph->outputTensor_ = outputTensor;
+            }
+            return newCachedGraph;
+          });
+      cachedGraph = static_cast<CachedGraph*>(tmpCachedGraph);
+    }
+    Placeholder selfPlaceholder = Placeholder(cachedGraph->inputTensor_, self);
+    Placeholder outputPlaceholder =
+        Placeholder(cachedGraph->outputTensor_, output);
+
+    // Create dictionary of inputs and outputs
+    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
+      selfPlaceholder.getMPSGraphTensor() :
+          selfPlaceholder.getMPSGraphTensorData()
+    };
+
+    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
+      outputPlaceholder.getMPSGraphTensor() :
+          outputPlaceholder.getMPSGraphTensorData()
+    };
+
+    runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+  }
+  return output;
+}
+
+Tensor hardswish_mps(const Tensor& self) {
+  using namespace mps;
+  Tensor output = at::empty_like(self, self.suggest_memory_format());
+
+  return hardswish_out_mps(self, output);
+}
+
+Tensor& hardswish_mps_(Tensor& self) {
+  using namespace mps;
+  Tensor& output = self;
+
+  return hardswish_out_mps(self, output);
+}
+
+Tensor hardswish_backward_mps(const Tensor& grad_output, const Tensor& self) {
+  using namespace mps;
+
+  if (grad_output.numel() == 0) {
+    return grad_output;
+  }
+
+  Tensor grad_input = at::empty_like(self, self.suggest_memory_format());
+
+  struct CachedGraph : public MPSCachedGraph {
+    CachedGraph(MPSGraph* graph) : MPSCachedGraph(graph) {}
+    MPSGraphTensor* gradOutputTensor_ = nil;
+    MPSGraphTensor* inputTensor_ = nil;
+    MPSGraphTensor* gradInputTensor_ = nil;
+  };
+
+  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
+
+  MPSStream* stream = at::mps::getCurrentMPSStream();
+
+  @autoreleasepool {
+    string key = "hardswish_backward_mps" + getTensorsStringKey({self});
+    CachedGraph* cachedGraph = static_cast<CachedGraph*>(cache_->LookUp(key));
+    if (!cachedGraph) {
+      MPSCachedGraph* tmpCachedGraph =
+          cache_->CreateCachedGraph(key, ^MPSCachedGraph*() {
+            CachedGraph* newCachedGraph = nil;
+            @autoreleasepool {
+              MPSGraph* mpsGraph = make_mps_graph();
+              newCachedGraph = new CachedGraph(mpsGraph);
+              MPSGraphTensor* gradOutputTensor =
+                  mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
+              MPSGraphTensor* inputTensor =
+                  mpsGraphRankedPlaceHolder(mpsGraph, self);
+
+              MPSGraphTensor* zeroTensor = [mpsGraph
+                  constantWithScalar:0.0f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(grad_output.scalar_type())];
+
+              MPSGraphTensor* unitTensor = [mpsGraph
+                  constantWithScalar:1.0f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(grad_output.scalar_type())];
+
+              MPSGraphTensor* threeTensor = [mpsGraph
+                  constantWithScalar:3.0f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(grad_output.scalar_type())];
+
+              MPSGraphTensor* negativeThreeTensor = [mpsGraph
+                  constantWithScalar:-3.0f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(grad_output.scalar_type())];
+
+              MPSGraphTensor* halfTensor = [mpsGraph
+                  constantWithScalar:0.5f
+                               shape:@[ @1 ]
+                            dataType:getMPSDataType(grad_output.scalar_type())];
+
+              MPSGraphTensor* tempTensor =
+                  [mpsGraph divisionWithPrimaryTensor:inputTensor
+                                      secondaryTensor:threeTensor
+                                                 name:nil];
+
+              MPSGraphTensor* weightedTensor =
+                  [mpsGraph additionWithPrimaryTensor:tempTensor
+                                      secondaryTensor:halfTensor
+                                                 name:nil];
+
+              MPSGraphTensor* lessThanMinPredicateTensor = [mpsGraph
+                  lessThanOrEqualToWithPrimaryTensor:inputTensor
+                                     secondaryTensor:negativeThreeTensor
+                                                name:nil];
+
+              MPSGraphTensor* lessThanMaxPredicateTensor =
+                  [mpsGraph lessThanWithPrimaryTensor:inputTensor
+                                      secondaryTensor:threeTensor
+                                                 name:nil];
+
+              MPSGraphTensor* lessThanMaxGradTensor =
+                  [mpsGraph selectWithPredicateTensor:lessThanMaxPredicateTensor
+                                  truePredicateTensor:weightedTensor
+                                 falsePredicateTensor:unitTensor
+                                                 name:nil];
+
+              MPSGraphTensor* gradTensor =
+                  [mpsGraph selectWithPredicateTensor:lessThanMinPredicateTensor
+                                  truePredicateTensor:zeroTensor
+                                 falsePredicateTensor:lessThanMaxGradTensor
+                                                 name:nil];
+              MPSGraphTensor* gradInputTensor =
+                  [mpsGraph multiplicationWithPrimaryTensor:gradTensor
+                                            secondaryTensor:gradOutputTensor
+                                                       name:nil];
+
+              newCachedGraph->gradOutputTensor_ = gradOutputTensor;
+              newCachedGraph->inputTensor_ = inputTensor;
+              newCachedGraph->gradInputTensor_ = gradInputTensor;
+            }
+            return newCachedGraph;
+          });
+      cachedGraph = static_cast<CachedGraph*>(tmpCachedGraph);
+    }
+
+    Placeholder gradOutputPlaceholder =
+        Placeholder(cachedGraph->gradOutputTensor_, grad_output);
+    Placeholder selfPlaceholder = Placeholder(cachedGraph->inputTensor_, self);
+    Placeholder gradInputPlaceholder =
+        Placeholder(cachedGraph->gradInputTensor_, grad_input);
+
+    // Create dictionary of inputs and outputs
+    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
+      gradOutputPlaceholder.getMPSGraphTensor() :
+          gradOutputPlaceholder.getMPSGraphTensorData(),
+      selfPlaceholder.getMPSGraphTensor() :
+          selfPlaceholder.getMPSGraphTensorData()
+    };
+
+    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
+      gradInputPlaceholder.getMPSGraphTensor() :
+          gradInputPlaceholder.getMPSGraphTensorData()
+    };
+
+    runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+  }
+  return grad_input;
+}
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/mps/operations/AdaptivePooling.mm b/aten/src/ATen/native/mps/operations/AdaptivePooling.mm
index 1d58de2902cf..e13deb805bb6 100644
--- a/aten/src/ATen/native/mps/operations/AdaptivePooling.mm
+++ b/aten/src/ATen/native/mps/operations/AdaptivePooling.mm
@@ -19,11 +19,27 @@
    int64_t &strideH, int64_t &strideW,
    int64_t &kernel_sizeH, int64_t &kernel_sizeW) {
 
-  strideH = (int64_t) (isizeH / osizeH);
-  strideW = (int64_t) (isizeW / osizeW);
+  TORCH_CHECK((isizeH >= osizeH && isizeW >= osizeW) || (isizeH <= osizeH && isizeW <= osizeW),
+              "Adaptive pool MPS: Input height and width must both be greather than or equal to, or lesser than, output height and width")
+
+  TORCH_CHECK((!(isizeH <= osizeH && isizeW <= osizeW) || (osizeH % isizeH == 0 && osizeW % isizeW == 0)),
+              "Adaptive pool MPS: If output is larger than input, output sizes must be multiples of input sizes")
+
+  if(isizeH >= osizeH) {
+    strideH = (int64_t) (isizeH / osizeH);
+    strideW = (int64_t) (isizeW / osizeW);
+
+    kernel_sizeH = isizeH - (osizeH-1) * strideH;
+    kernel_sizeW = isizeW - (osizeW-1) * strideW;
+  }
+  else {
+    strideH = (int64_t) (osizeH / isizeH);
+    strideW = (int64_t) (osizeW / isizeW);
+
+    kernel_sizeH = osizeH - (isizeH-1) * strideH;
+    kernel_sizeW = osizeW - (isizeW-1) * strideW;
+  }
 
-  kernel_sizeH = isizeH - (osizeH-1) * strideH;
-  kernel_sizeW = isizeW - (osizeW-1) * strideW;
 }
 
 // Adaptive average pooling
@@ -71,13 +87,33 @@
                     strideH, strideW,
                     kernel_sizeH, kernel_sizeW);
 
-  output =  at::avg_pool2d(input,
-                           IntArrayRef({kernel_sizeH, kernel_sizeW}),
-                           IntArrayRef({strideH, strideW}),
-                           IntArrayRef({0, 0}),
-                           false,
-                           true,
-                           c10::nullopt);
+  if(isizeH >= osizeH) {
+    output =  at::avg_pool2d(input,
+                             IntArrayRef({kernel_sizeH, kernel_sizeW}),
+                             IntArrayRef({strideH, strideW}),
+                             IntArrayRef({0, 0}),
+                             false,
+                             true,
+                             c10::nullopt);
+  }  else {
+    Tensor phony_grad = at::ones_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+    auto input_sizes = input.sizes();
+    std::vector<int64_t> phony_shape{input_sizes.begin(), input_sizes.end() -2};
+    phony_shape.push_back(output_size[0]);
+    phony_shape.push_back(output_size[1]);
+    phony_grad.resize_(IntArrayRef(phony_shape));
+    output =  at::avg_pool2d_backward(input,
+                                      phony_grad,
+                                      IntArrayRef({kernel_sizeH, kernel_sizeW}),
+                                      IntArrayRef({strideH, strideW}),
+                                      IntArrayRef({0, 0}),
+                                      false,
+                                      true,
+                                      c10::nullopt);
+    // Multiply output by kernel size
+    output = at::mul(output, kernel_sizeH*kernel_sizeW);
+  }
+
   return output;
 }
 
@@ -138,15 +174,27 @@
                       strideH, strideW,
                       kernel_sizeH, kernel_sizeW);
     auto gradInput = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-    if (gradInput.numel() != 0)
-      gradInput = at::avg_pool2d_backward(gradOutput,
-                                          input,
-                                          IntArrayRef({kernel_sizeH, kernel_sizeW}),
-                                          IntArrayRef({strideH, strideW}),
-                                          IntArrayRef({0, 0}),
-                                          false,
-                                          true,
-                                          c10::nullopt);
+    if (gradInput.numel() != 0) {
+      if(isizeH >= osizeH) {
+        gradInput = at::avg_pool2d_backward(gradOutput,
+                                            input,
+                                            IntArrayRef({kernel_sizeH, kernel_sizeW}),
+                                            IntArrayRef({strideH, strideW}),
+                                            IntArrayRef({0, 0}),
+                                            false,
+                                            true,
+                                            c10::nullopt);
+      } else {
+        gradInput = at::avg_pool2d(gradOutput,
+                                   IntArrayRef({kernel_sizeH, kernel_sizeW}),
+                                   IntArrayRef({strideH, strideW}),
+                                   IntArrayRef({0, 0}),
+                                   false,
+                                   true,
+                                   c10::nullopt);
+        gradInput = at::mul(gradInput, kernel_sizeH*kernel_sizeW);
+      }
+    }
 
     return gradInput;
 
diff --git a/aten/src/ATen/native/mps/operations/BinaryOps.mm b/aten/src/ATen/native/mps/operations/BinaryOps.mm
index b619307ef8aa..a246bb0c50f0 100644
--- a/aten/src/ATen/native/mps/operations/BinaryOps.mm
+++ b/aten/src/ATen/native/mps/operations/BinaryOps.mm
@@ -27,10 +27,6 @@
 void binaryOpTensor(const Tensor& self, const Tensor& other, const Scalar& alpha,
                     const Tensor& output_, std::string op_name, BinaryOpBlock binaryBlock)
 {
-  // it's possible to receive empty tensors here
-  if (self.numel() == 0 || other.numel() == 0) {
-    return;
-  }
   MPSStream* mpsStream = getCurrentMPSStream();
 
   const bool is_self_scalar = self.dim() == 0;
@@ -41,6 +37,11 @@ void binaryOpTensor(const Tensor& self, const Tensor& other, const Scalar& alpha
       output_.resize_(new_size);
   }
 
+  // it's possible to receive empty tensors here
+  if (self.numel() == 0 || other.numel() == 0) {
+    return;
+  }
+
   Tensor output = output_;
   bool needsCopyToOutput = false;
 
@@ -72,16 +73,37 @@ void binaryOpTensor(const Tensor& self, const Tensor& other, const Scalar& alpha
 
           // this type inference is only required at the time of graph creation
           const ScalarType common_dtype = c10::promoteTypes(self.scalar_type(), other.scalar_type());
-          if (self.scalar_type() != common_dtype) {
-            primaryCastTensor = castMPSTensor(mpsGraph, newCachedGraph->primaryTensor, common_dtype);
+
+          // Condition -
+          // 1. Division operation
+          // 2. Inputs are not float
+          bool div_condition = op_name.rfind("div", 0) == 0
+                                  && (!(common_dtype == ScalarType::Float || common_dtype == ScalarType::Half));
+
+          auto compute_type = ScalarType::Float;
+
+          if(div_condition) {
+
+            if(output_.scalar_type() == ScalarType::Float || output_.scalar_type() == ScalarType::Half)
+              compute_type = output_.scalar_type();
+
+            primaryCastTensor = castMPSTensor(mpsGraph, newCachedGraph->primaryTensor, compute_type);
+            secondaryCastTensor = castMPSTensor(mpsGraph, newCachedGraph->secondaryTensor, compute_type);
           }
-          if (other.scalar_type() != common_dtype) {
-            secondaryCastTensor = castMPSTensor(mpsGraph, newCachedGraph->secondaryTensor, common_dtype);
+          else  {
+            if (self.scalar_type() != common_dtype) {
+              primaryCastTensor = castMPSTensor(mpsGraph, newCachedGraph->primaryTensor, common_dtype);
+            }
+            if (other.scalar_type() != common_dtype) {
+              secondaryCastTensor = castMPSTensor(mpsGraph, newCachedGraph->secondaryTensor, common_dtype);
+            }
           }
           newCachedGraph->outputTensor = binaryBlock(newCachedGraph, primaryCastTensor, secondaryCastTensor);
           // Cast output tensor to an expected type if needed, which addresses discrepancy when int64 scalar is added to int32 tensor
           // Output tensor should have been promoted but it remains an int32 tensor
-          if (output_.scalar_type() != common_dtype) {
+
+          if ((div_condition && compute_type != output_.scalar_type()) ||
+              output_.scalar_type() != common_dtype) {
             newCachedGraph->outputTensor = castMPSTensor(mpsGraph, newCachedGraph->outputTensor, output_.scalar_type());
           }
         }
@@ -93,22 +115,29 @@ void binaryOpTensor(const Tensor& self, const Tensor& other, const Scalar& alpha
     NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease];
     Placeholder selfPlaceholder;
     Placeholder otherPlaceholder;
+    MPSScalar self_scalar;
+    MPSScalar other_scalar;
+    MPSScalar alpha_scalar;
 
     if (is_self_scalar && !self.is_mps()) {
-      feeds[cachedGraph->primaryTensor] = getMPSGraphTensorFromScalar(mpsStream, self.item(), getMPSScalarType(self.scalar_type()));
+      self_scalar = getMPSScalar(self.item(), self.scalar_type());
+      feeds[cachedGraph->primaryTensor] = getMPSGraphTensorFromScalar(mpsStream, self_scalar);
     } else {
       selfPlaceholder = Placeholder(cachedGraph->primaryTensor, self);
       feeds[selfPlaceholder.getMPSGraphTensor()] = selfPlaceholder.getMPSGraphTensorData();
     }
     if (is_other_scalar && !other.is_mps()) {
-      feeds[cachedGraph->secondaryTensor] = getMPSGraphTensorFromScalar(mpsStream, other.item(), getMPSScalarType(other.scalar_type()));
+      other_scalar = getMPSScalar(other.item(), other.scalar_type());
+      feeds[cachedGraph->secondaryTensor] = getMPSGraphTensorFromScalar(mpsStream, other_scalar);
     } else {
       otherPlaceholder = Placeholder(cachedGraph->secondaryTensor, other);
       feeds[otherPlaceholder.getMPSGraphTensor()] = otherPlaceholder.getMPSGraphTensorData();
     }
+
     // 'cachedGraph->alphaTensor' is not nil only if add_sub_template() was called with an alpha value != 1.0
     if (cachedGraph->alphaTensor) {
-      feeds[cachedGraph->alphaTensor] = getMPSGraphTensorFromScalar(mpsStream, alpha, getMPSScalarType(other.scalar_type()));
+      alpha_scalar = getMPSScalar(alpha, other.scalar_type());
+      feeds[cachedGraph->alphaTensor] = getMPSGraphTensorFromScalar(mpsStream, alpha_scalar);
     }
 
     Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, needsCopyToOutput ? output : output_);
@@ -138,7 +167,11 @@ void div_mode_template(const Tensor& self, const Tensor& other,
     MPSGraphTensor* divTensor =  [mpsGraph divisionWithPrimaryTensor:primaryCastTensor
                                                      secondaryTensor:secondaryCastTensor
                                                                 name:nil];
-    if (!rounding_mode.has_value()) {
+    // Rounding is a no-op for integral types, and also a reasonable workaround
+    // For MPSGraph bug on Apple Silicon, that throws `Function floorOp_i64 was not found in the library`
+    // See https://github.com/pytorch/pytorch/issues/84995
+    bool isFloatOutput = ([divTensor dataType] & MPSDataTypeFloatBit) != 0;
+    if (!rounding_mode.has_value() || !isFloatOutput) {
       return divTensor;
     } else if (*rounding_mode == "trunc") {
       return trunc_tensor(mpsGraph, divTensor);
@@ -203,6 +236,9 @@ void add_sub_template(const Tensor& self, const Tensor& other, const Scalar& alp
 
 #define CREATE_MPS_STRUCTURED_BINARY_OP_FUNC(func_out, func_stub, other_type)                   \
 TORCH_IMPL_FUNC(func_out) (const Tensor& self, const other_type& other, const Tensor& output) { \
+  TORCH_CHECK(!(self.scalar_type() == ScalarType::Long &&                                       \
+               (std::string(#func_stub) == "power" || std::string(#func_stub) == "atan2")),     \
+               "MPS does not support ", #func_stub, " op with int64 input")                     \
   mps::binaryOp##other_type(self, other, Scalar(1.0), output, #func_stub,                       \
     ^BinaryOpFn(cachedGraph, primaryCastTensor, secondaryCastTensor) {                          \
       MPSGraph* mpsGraph = cachedGraph->graph();                                                \
diff --git a/aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm b/aten/src/ATen/native/mps/operations/BitwiseOps.mm
similarity index 79%
rename from aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm
rename to aten/src/ATen/native/mps/operations/BitwiseOps.mm
index 62d25c3e97d9..5b57693296b1 100644
--- a/aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm
+++ b/aten/src/ATen/native/mps/operations/BitwiseOps.mm
@@ -73,6 +73,15 @@ kernel void bitwise_xor_scalar(constant uint& length [[buffer(0)]],
   out[offset] = a[offset] ^ b;
 }}
 
+kernel void bitwise_not(constant uint& length [[buffer(0)]],
+                         device {0}  *out [[buffer(1)]],
+                         device {1}  *a [[buffer(2)]],
+                         uint offset [[thread_position_in_grid]]) {{
+  if (offset >= length) {{
+    return;
+  }}
+  out[offset] = ~a[offset];
+}}
 )METAL";
 
 
@@ -113,8 +122,10 @@ kernel void bitwise_xor_scalar(constant uint& length [[buffer(0)]],
     return it->second;
   }
   NSError *error = nil;
+  MTLCompileOptions *options = [[MTLCompileOptions new] autorelease];
+  [options setLanguageVersion: MTLLanguageVersion2_3];
   auto rc  = [device newLibraryWithSource:[NSString stringWithUTF8String:fmt::format(BITWISE_OPS_TEMPLATE, t1, t2, t3).c_str()]
-                                  options:nil
+                                  options:options
                                     error:&error];
  TORCH_CHECK(rc != nil && error == nil, "Failed to compile library: ", [[error localizedDescription] UTF8String]);
  libMap[key] = rc;
@@ -161,6 +172,9 @@ void handle_tensor_tensor_binary_op(const at::Tensor& self, const at::Tensor& ot
                                                      getMetalType(other),
                                                      kernel_name);
   uint32_t length = output.numel();
+  if (length == 0) {
+    return;
+  }
   dispatch_sync(stream->queue(), ^(){
     id<MTLCommandBuffer> buffer = stream->commandBuffer();
     id<MTLComputeCommandEncoder> commandEncoder = [buffer computeCommandEncoder];
@@ -191,6 +205,9 @@ void handle_tensor_scalar_binary_op(const at::Tensor& self, const at::Scalar& ot
                                                      kernel_name);
   uint64_t sval = other.to<int64_t>();
   uint32_t length = output.numel();
+  if (length == 0) {
+    return;
+  }
   dispatch_sync(stream->queue(), ^(){
     id<MTLCommandBuffer> buffer = stream->commandBuffer();
     id<MTLComputeCommandEncoder> commandEncoder = [buffer computeCommandEncoder];
@@ -262,12 +279,69 @@ void handle_tensor_scalar_binary_op(const at::Tensor& self, const at::Scalar& ot
  return _bitwise_op_out_mps(self, other, output, "xor");
 }
 
+at::Tensor& bitwise_not_out_mps (const at::Tensor& self, at::Tensor& output_) {
+  // Handle boolean tensor using logical not
+  if (self.scalar_type() == c10::ScalarType::Bool) {
+    return at::native::logical_not_out_mps(self, output_);
+  }
+
+  at::Tensor output = output_;
+  bool needs_output_copy = false;
+
+  at::native::resize_output(output, self.sizes());
+  if (!output.is_contiguous()) {
+    output = output.contiguous();
+    needs_output_copy = true;
+  }
+  if (self.dim() == 0) {
+    if (self.scalar_type() == c10::ScalarType::Byte) {
+      // Unsigned types need a special handling to keep result of operation in 0..255 output
+      output.fill_(c10::Scalar(static_cast<uint8_t>(~self.item<uint8_t>())));
+    } else {
+      output.fill_(c10::Scalar(~self.item<int64_t>()));
+    }
+    return output_;
+  }
+  uint32_t length = output.numel();
+  if (length == 0) {
+    return output_;
+  }
+  using namespace at::mps;
+  MPSStream* stream = getCurrentMPSStream();
+  id<MTLComputePipelineState> cplState = getCPLState(MPSDevice::getInstance()->device(),
+                                                     getMetalType(output),
+                                                     getMetalType(self),
+                                                     getMetalType(self),
+                                                     "bitwise_not");
+  dispatch_sync(stream->queue(), ^(){
+    id<MTLCommandBuffer> buffer = stream->commandBuffer();
+    id<MTLComputeCommandEncoder> commandEncoder = [buffer computeCommandEncoder];
+
+    id<MTLBuffer> outBuf = __builtin_bit_cast(id<MTLBuffer>, output.storage().data());
+    id<MTLBuffer> selfBuf = __builtin_bit_cast(id<MTLBuffer>, self.storage().data());
+
+    [commandEncoder pushDebugGroup:@"Dispatch bitwise_not kernel"];
+    [commandEncoder setComputePipelineState:cplState];
+    [commandEncoder setBytes:&length length:sizeof(length) atIndex:0];
+    [commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1];
+    [commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize()  atIndex:2];
+    dispatch1DJob(commandEncoder, cplState, length);
+    [commandEncoder endEncoding];
+    stream->commit(true);
+  });
+  if (needs_output_copy) {
+      output_.copy_(output);
+  }
+  return output_;
+}
+
 
 
 TORCH_LIBRARY_IMPL(aten, MPS, m) {
   m.impl("bitwise_and.Tensor_out", bitwise_and_out_mps);
   m.impl("bitwise_or.Tensor_out", bitwise_or_out_mps);
   m.impl("bitwise_xor.Tensor_out", bitwise_xor_out_mps);
+  m.impl("bitwise_not.out", bitwise_not_out_mps);
 }
 
 } // anonymous namespace
diff --git a/aten/src/ATen/native/mps/operations/Blas.mm b/aten/src/ATen/native/mps/operations/Blas.mm
index 7ab34ac31401..20a3ec5eb6db 100644
--- a/aten/src/ATen/native/mps/operations/Blas.mm
+++ b/aten/src/ATen/native/mps/operations/Blas.mm
@@ -51,13 +51,36 @@ Tensor dot_mps(
           MPSGraphTensor *selfTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, self);
           MPSGraphTensor *otherTensor =  mps::mpsGraphRankedPlaceHolder(mpsGraph, other);
 
-          MPSGraphTensor *dot = [mpsGraph multiplicationWithPrimaryTensor: selfTensor
-                                                          secondaryTensor: otherTensor
+          MPSGraphTensor *castSelf = nil;
+          MPSGraphTensor *castOther = nil;
+
+          if(self.scalar_type() == ScalarType::Short || self.scalar_type() == ScalarType::Byte
+                                                     || self.scalar_type() == ScalarType::Char) {
+            castSelf = [mpsGraph castTensor:selfTensor
+                                     toType:MPSDataTypeInt32
+                                       name:@"castSelfTensor"];
+            castOther = [mpsGraph castTensor:otherTensor
+                                      toType:MPSDataTypeInt32
+                                        name:@"castOtherTensor"];
+          } else {
+            castSelf = selfTensor;
+            castOther = otherTensor;
+          }
+
+          MPSGraphTensor *dot = [mpsGraph multiplicationWithPrimaryTensor: castSelf
+                                                          secondaryTensor: castOther
                                                                      name: @"multiplication"];
 
           MPSGraphTensor *dotProductTensor = [mpsGraph reductionSumWithTensor: dot
                                                                          axes: nil
                                                                          name: @"dotProduct"];
+
+          if(self.scalar_type() == ScalarType::Short || self.scalar_type() == ScalarType::Byte
+                                                     || self.scalar_type() == ScalarType::Char)
+            dotProductTensor = [mpsGraph castTensor:dotProductTensor
+                                             toType:getMPSDataType(self.scalar_type())
+                                               name:@"castDotProductTensor"];
+
           newCachedGraph->selfTensor_ = selfTensor;
           newCachedGraph->otherTensor_ = otherTensor;
           newCachedGraph->outputTensor_ = dotProductTensor;
diff --git a/aten/src/ATen/native/mps/operations/ConstantOps.mm b/aten/src/ATen/native/mps/operations/ConstantOps.mm
index 0cfd7ccc2ff5..a5ddd82a229e 100644
--- a/aten/src/ATen/native/mps/operations/ConstantOps.mm
+++ b/aten/src/ATen/native/mps/operations/ConstantOps.mm
@@ -35,11 +35,15 @@
           MPSGraph *mpsGraph = make_mps_graph();
           newCachedGraph = new CachedGraph(mpsGraph);
           auto isBool = self.scalar_type() == c10::ScalarType::Bool;
-          auto dataType = (!isBool) ? getMPSScalarType(self.scalar_type()) : MPSDataTypeInt8;
+          auto isUInt8 = self.scalar_type() == c10::ScalarType::Byte;
+          auto dataType = !isUInt8 ? !isBool ? getMPSScalarType(self.scalar_type()) : MPSDataTypeInt8 : MPSDataTypeUInt32;
           // constantWithScalar does not work for boolTypes on MacOS-12.[34]
           // workaround by filing it as int8 tensor and than casting to bool
           // See https://github.com/pytorch/pytorch/issues/82427
-          MPSGraphTensor* inputTensor = [mpsGraph constantWithScalar:value.toDouble()
+          // constantWithScalar does not work for UInt8 Types on MacOS-12.[34]/Ventura preview
+          // workaround by filing it as uint32 tensor and than casting to uint8
+          // See https://github.com/pytorch/pytorch/issues/83692
+          MPSGraphTensor* inputTensor = [mpsGraph constantWithScalar: value.toDouble()
                                                                shape:getMPSShape(self)
                                                             dataType:dataType];
           MPSGraphTensor* outputTensor = [mpsGraph identityWithTensor:inputTensor
@@ -49,6 +53,11 @@
                                            toType:MPSDataTypeBool
                                              name:@"constWithBool-workaround"];
           }
+          if (isUInt8) {
+              outputTensor = [mpsGraph castTensor:outputTensor
+                                           toType:MPSDataTypeUInt8
+                                             name:@"constWithUInt8-workaround"];
+          }
 
           newCachedGraph->outputTensor_ = outputTensor;
         }
diff --git a/aten/src/ATen/native/mps/operations/Convolution.mm b/aten/src/ATen/native/mps/operations/Convolution.mm
index 2c74dcf07667..88bad9a5872a 100644
--- a/aten/src/ATen/native/mps/operations/Convolution.mm
+++ b/aten/src/ATen/native/mps/operations/Convolution.mm
@@ -39,6 +39,19 @@ void fill_conv_desc(MPSGraphConvolution2DOpDescriptor* descriptor_,
   descriptor_.groups = groups;
 }
 
+static
+MPSShape* get_mps_conv_shape(const Tensor& tensor, bool is_channels_last) {
+  if (is_channels_last) {
+    const auto tensorSizes = tensor.sizes();
+    const NSUInteger N = tensorSizes[0];
+    const NSUInteger C = tensorSizes[1];
+    const NSUInteger H = tensorSizes[2];
+    const NSUInteger W = tensorSizes[3];
+    return @[@(N), @(H), @(W), @(C)];
+  }
+  return at::native::mps::getMPSShape(tensor);
+}
+
 Tensor _mps_convolution(
     const Tensor& input_t,
     const Tensor& weight_t,
@@ -47,6 +60,8 @@ Tensor _mps_convolution(
     IntArrayRef stride,
     IntArrayRef dilation,
     int64_t groups) {
+  TORCH_CHECK(input_t.dim() < 5, "Conv3D is not supported on MPS");
+
   namespace native_mps = at::native::mps;
   CheckedFrom c = "mps_convolution";
   TensorArg input  { input_t,  "input",  1 },
@@ -124,19 +139,7 @@ Tensor _mps_convolution(
                                     + mps::getTensorsStringKey({input_t, weight_t}) + ":"
                                     + to_string(bias_defined) + ":" + bias_shape_key;
     CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
-    MPSShape* inputShape = nil;
-
-    if (is_channels_last) {
-      const auto inputSizes = input_t.sizes();
-      const NSUInteger N = inputSizes[0];
-      const NSUInteger C = inputSizes[1];
-      const NSUInteger H = inputSizes[2];
-      const NSUInteger W = inputSizes[3];
-      inputShape = @[@(N), @(H), @(W), @(C)];
-    } else {
-      inputShape = native_mps::getMPSShape(input_t);
-    }
-
+    MPSShape* inputShape = get_mps_conv_shape(input_t, is_channels_last);
     if(!cachedGraph) {
       native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
 
@@ -147,8 +150,8 @@ Tensor _mps_convolution(
           newCachedGraph = new CachedGraph(mpsGraph);
 
           MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease];
-          fill_conv_desc(descriptor_, stride[0], stride[1],
-                                      dilation[0], dilation[1],
+          fill_conv_desc(descriptor_, stride[1], stride[0],
+                                      dilation[1], dilation[0],
                                       padding[1], padding[0],
                                       memory_format, groups);
 
@@ -229,7 +232,7 @@ Tensor mps_convolution_backward_input(
                     c10::nullopt,
                     kMPS,
                     c10::nullopt,
-                    memory_format);
+                    c10::nullopt);
 
   // Avoid "grad_input" when this is being used as transposed convolution
   TensorArg grad_input{ grad_input_t, "result", 0 };
@@ -264,9 +267,7 @@ Tensor mps_convolution_backward_input(
     }
 
     MPSShape* mps_input_shape = getMPSShape(input_size);
-
     NSString* ns_shape_key = [[mps_input_shape valueForKey:@"description"] componentsJoinedByString:@","];
-
     string key = "mps_convolution_backward_input:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":"
                                                    + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":"
                                                    + to_string(padding[0]) + ":" + to_string(padding[1]) + ":"
@@ -285,8 +286,8 @@ Tensor mps_convolution_backward_input(
           newCachedGraph = new CachedGraph(mpsGraph);
 
           MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease];
-          fill_conv_desc(descriptor_, stride[0], stride[1],
-                                      dilation[0], dilation[1],
+          fill_conv_desc(descriptor_, stride[1], stride[0],
+                                      dilation[1], dilation[0],
                                       padding[1], padding[0],
                                       memory_format, groups);
 
@@ -333,6 +334,9 @@ Tensor mps_convolution_backward_weights(
   using namespace mps;
   CheckedFrom c = "mps_convolution_backward_weights";
   auto memory_format = input_t.suggest_memory_format();
+  bool is_channels_last = (memory_format == at::MemoryFormat::ChannelsLast);
+  MPSShape* inputShape = get_mps_conv_shape(input_t, is_channels_last);
+  MPSShape* gradOutputShape = get_mps_conv_shape(grad_output_t, is_channels_last);
 
   // For uniformity with everything else, although it seems grad_weight
   // would be unambiguous too.
@@ -342,7 +346,7 @@ Tensor mps_convolution_backward_weights(
   checkAllSameType(c, {grad_output, input});
   checkAllSameGPU(c, {grad_output, input});
 
-  auto grad_weight_t = at::empty(weight_size, grad_output_t.options(), memory_format);
+  auto grad_weight_t = at::empty(weight_size, grad_output_t.options(), c10::nullopt);
   TensorArg grad_weight{ grad_weight_t, "result", 0 };
 
   convolution_shape_check(c, input, grad_weight, grad_output, padding, stride, dilation, groups);
@@ -375,9 +379,7 @@ Tensor mps_convolution_backward_weights(
     }
 
     MPSShape* mps_weight_shape = getMPSShape(weight_size);
-
     NSString* ns_shape_key = [[mps_weight_shape valueForKey:@"description"] componentsJoinedByString:@","];
-
     string key = "mps_convolution_backward_weights:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":"
                                                      + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":"
                                                      + to_string(padding[0]) + ":" + to_string(padding[1]) + ":"
@@ -396,13 +398,13 @@ Tensor mps_convolution_backward_weights(
           newCachedGraph = new CachedGraph(mpsGraph);
 
           MPSGraphConvolution2DOpDescriptor *descriptor_ = [[MPSGraphConvolution2DOpDescriptor new] autorelease];
-          fill_conv_desc(descriptor_, stride[0], stride[1],
-                                      dilation[0], dilation[1],
+          fill_conv_desc(descriptor_, stride[1], stride[0],
+                                      dilation[1], dilation[0],
                                       padding[1], padding[0],
                                       memory_format, groups);
 
-          MPSGraphTensor* gradOutputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, grad_output_t);
-          MPSGraphTensor* inputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, input_t);
+          MPSGraphTensor* gradOutputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, native_mps::getMPSScalarType(grad_output_t.scalar_type()), gradOutputShape);
+          MPSGraphTensor* inputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, native_mps::getMPSScalarType(input_t.scalar_type()), inputShape);
 
           MPSGraphTensor* gradWeightTensor = [mpsGraph convolution2DWeightsGradientWithIncomingGradientTensor:gradOutputTensor
                                                                                                  sourceTensor:inputTensor
@@ -419,8 +421,8 @@ Tensor mps_convolution_backward_weights(
       cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
-    auto gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor_, grad_output_t);
-    auto inputPlaceholder = Placeholder(cachedGraph->inputTensor_, input_t);
+    auto gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor_, grad_output_t, gradOutputShape);
+    auto inputPlaceholder = Placeholder(cachedGraph->inputTensor_, input_t, inputShape);
     auto outputPlaceholder = Placeholder(cachedGraph->gradWeightTensor_, grad_weight_t);
 
     NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = @{
diff --git a/aten/src/ATen/native/mps/operations/Copy.mm b/aten/src/ATen/native/mps/operations/Copy.mm
index 3c2ab0d6c2f8..99183d21030c 100644
--- a/aten/src/ATen/native/mps/operations/Copy.mm
+++ b/aten/src/ATen/native/mps/operations/Copy.mm
@@ -33,10 +33,29 @@
   return (void*)alignedAddress;
 }
 
+/**
+ * Computes number of elements one needs to transfer to preserve all the elements
+ */
+size_t compute_strided_size(const at::Tensor& t) {
+   size_t rc = 1;
+   if (t.numel() == 0) {
+       return 0;
+   }
+   for(const auto i: c10::irange(t.dim())) {
+     assert(t.size(i) > 0);
+     rc += (t.size(i) - 1) * t.stride(i);
+   }
+   return rc;
+}
+
+bool is_strided_contiguous(const at::Tensor& t) {
+  return compute_strided_size(t) == t.numel();
+}
+
 // Copy sourceBuffer into destBuffer, casting sourceBuffer to src.scalar_type().
 // The shapes and dtypes are taken from dst and src, but their storage pointers are not used.
 void copy_cast_mps(at::Tensor& dst, const at::Tensor& src,
-                   id<MTLBuffer> destBuffer, id<MTLBuffer> sourceBuffer) {
+                   id<MTLBuffer> destBuffer, id<MTLBuffer> sourceBuffer, bool non_blocking = true) {
   using namespace mps;
 
   struct CachedGraph : public MPSCachedGraph
@@ -84,6 +103,8 @@ void copy_cast_mps(at::Tensor& dst, const at::Tensor& src,
     NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{cachedGraph->inputTensor_: srcData};
     NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{cachedGraph->outputTensor_: dstData};
     runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+    if (!non_blocking)
+      stream->synchronize(SyncType::COMMIT_AND_WAIT);
   }
 }
 
@@ -113,38 +134,51 @@ void copy_cast_mps(at::Tensor& dst, const at::Tensor& src,
     src = src_;
   }
   id<MTLBuffer> sourceBuffer = getMTLBufferStorage(src);
-  size_t src_total_size = src_.is_view() ? at::detail::computeStorageNbytesContiguous(src.sizes(), src.element_size(), src.storage_offset()) :
-                                           src.nbytes();
-  size_t size_to_copy = src.nbytes();
-
-  // In case of dtype change, first convert src inplace
-  if (src_.dtype() != dst_.dtype()) {
-    copy_cast_mps(dst, src, sourceBuffer, sourceBuffer);
-    // Use the element size of dst to calculate the total size after casting
-    size_to_copy = (size_to_copy / src.element_size()) * dst.element_size();
-  }
-
-  // If there's anything wrong with source, we shouldn't return dst_ silently and must error out.
-  TORCH_INTERNAL_ASSERT(sourceBuffer && size_to_copy > 0);
-  TORCH_INTERNAL_ASSERT(src_total_size >= storage_byte_offset);
-  TORCH_INTERNAL_ASSERT(dst.nbytes() >= (dst.storage_offset() * dst.element_size()));
+  size_t dst_tensor_nbytes = dst.nbytes();
 
   @autoreleasepool {
     MTLResourceOptions options = MTLResourceOptionCPUCacheModeDefault | MTLResourceStorageModeShared;
     NSUInteger alignedLength = 0;
 
     void* host_dst = dst.storage().data();
-    void* alignedPtr = pageAlignedBlockPtr(host_dst, (NSUInteger)src_total_size, &alignedLength);
+    void* alignedPtr = pageAlignedBlockPtr(host_dst, (NSUInteger)dst_tensor_nbytes, &alignedLength);
+    NSUInteger destOffset = (uintptr_t(host_dst) - uintptr_t(alignedPtr));
+    // 4 bytes alignment required on macos for blits.
+    TORCH_INTERNAL_ASSERT(destOffset % 4 == 0, "Unaligned blit request");
+
     id<MTLBuffer> destBuffer = [device newBufferWithBytesNoCopy:alignedPtr
                                                          length:alignedLength
                                                         options:options
                                                     deallocator:nil];
-     NSUInteger destOffset = uintptr_t(host_dst) - uintptr_t(alignedPtr);
-    // 4 bytes alignment required on macos for blits.
-    TORCH_INTERNAL_ASSERT(destOffset % 4 == 0, "Unaligned blit request");
+    id<MTLBuffer> tmpBuffer = sourceBuffer;
+    Tensor tmp;
+    bool needsBlit = true;
+    if (src_.dtype() != dst.dtype()) {
+      if (destOffset == 0 && storage_byte_offset == 0) {
+        // Return the casted tensor directly if there's no destination offset
+        needsBlit = false;
+        tmpBuffer = destBuffer;
+      } else if (src.element_size() < dst.element_size()) {
+          tmp = at::native::empty_mps(dst.sizes(), dst.scalar_type(), c10::nullopt, kMPS);
+          tmpBuffer = getMTLBufferStorage(tmp);
+      }
+    }
 
-    stream->copy_and_sync(sourceBuffer, destBuffer, size_to_copy, storage_byte_offset, destOffset, non_blocking);
-    [destBuffer release];
+    size_t size_to_copy = src.nbytes();
+    // In case of dtype change, first convert src inplace
+    if (src_.dtype() != dst.dtype()) {
+      copy_cast_mps(dst, src, tmpBuffer, sourceBuffer, non_blocking);
+    }
+
+    if (needsBlit) {
+      size_to_copy = (size_to_copy / src.element_size()) * dst.element_size();
+
+      // If there's anything wrong with source, we shouldn't return dst_ silently and must error out.
+      TORCH_INTERNAL_ASSERT(sourceBuffer && dst_tensor_nbytes > 0);
+
+      stream->copy_and_sync(tmpBuffer, destBuffer, size_to_copy, storage_byte_offset, destOffset, non_blocking);
+      [destBuffer release];
+    }
   }
   if (!dst.is_same(dst_)) {
     dst_.copy_(dst, non_blocking);
@@ -153,55 +187,60 @@ void copy_cast_mps(at::Tensor& dst, const at::Tensor& src,
   return dst_;
 }
 
-static at::Tensor& copy_to_mps_(at::Tensor& dst_, const at::Tensor& src_, bool non_blocking)
+// Copies tensor from cpu to mps backed by identical strided-contiguous data
+static void copy_to_mps_stride_contig(at::Tensor& dst, const at::Tensor& src, bool non_blocking)
 {
   MPSStream* stream = getCurrentMPSStream();
-  Tensor dst;
-  Tensor src;
-
   id<MTLDevice> device = MPSDevice::getInstance()->device();
-  auto dst_byte_offset = dst_.storage_offset() * dst_.itemsize();
-  id<MTLBuffer> destBuffer = getMTLBufferStorage(dst_);
-  uint64_t src_total_size = 0;
-
-  if (src_.is_view()) {
-    src = src_.to(dst_.dtype()).expand_as(dst_).contiguous();
-    // Get the actual size of a View (takes into account the storage offset)
-    // For View tensors, the storage offset can be bigger than what's being reported by nbytes
-    src_total_size = at::detail::computeStorageNbytesContiguous(src.sizes(), src.element_size(), src.storage_offset());
-  } else {
-    src = src_;
-    if (src.dtype() != dst_.dtype()) {
-      // In case of dtype change, perform conversion on source device
-      src = src.to(dst_.dtype());
-    }
-    src_total_size = src.nbytes();
-  }
-
+  auto dst_byte_offset = dst.storage_offset() * dst.itemsize();
+  auto src_byte_offset = src.storage_offset() * src.itemsize();
+  id<MTLBuffer> destBuffer = getMTLBufferStorage(dst);
   const size_t size_to_copy = src.nbytes();
-  const void* host_src = src.storage().data();
-  TORCH_INTERNAL_ASSERT(src_total_size >= (src.storage_offset() * src.element_size()));
-  TORCH_INTERNAL_ASSERT(dst_.nbytes() >= dst_byte_offset);
+  const void* host_src = static_cast<char *>(src.storage().data()) + src_byte_offset;
+
+  TORCH_INTERNAL_ASSERT(src.dtype() == dst.dtype() && src.strides() == dst.strides() && is_strided_contiguous(src));
 
-  NSUInteger sourceOffset = 0;
   @autoreleasepool {
     MTLResourceOptions options = MTLResourceOptionCPUCacheModeDefault | MTLResourceStorageModeShared;
     NSUInteger alignedLength = 0;
+    NSUInteger sourceOffset = 0;
 
-    void* alignedPtr = pageAlignedBlockPtr(host_src, (NSUInteger)src_total_size, &alignedLength);
+    void* alignedPtr = pageAlignedBlockPtr(host_src, (NSUInteger)size_to_copy, &alignedLength);
     id<MTLBuffer> sourceBuffer = [device newBufferWithBytesNoCopy:alignedPtr
                                           length:alignedLength
                                          options:options
                                      deallocator:nil];
     sourceOffset = uintptr_t(host_src) - uintptr_t(alignedPtr);
-    if (src_.is_view() || !src_.is_contiguous())
-      sourceOffset += src_.storage_offset() * src_.itemsize();
 
     stream->copy_and_sync(sourceBuffer, destBuffer, size_to_copy, sourceOffset, dst_byte_offset, non_blocking);
     [sourceBuffer release];
   }
+}
 
-  return dst_;
+static at::Tensor& copy_to_mps_(at::Tensor& dst_, const at::Tensor& src_, bool non_blocking)
+{
+  // Typecast to dst_ if needed and expand, which is a no-op
+  Tensor src = (src_.dtype() != dst_.dtype() ? src_.to(dst_.dtype()) : src_).expand_as(dst_);
+
+  // If src is not contiguously strided it must be cloned
+  // It does not mean that tensor is contiguous, but rather
+  // that it could be represented as 1d view
+  if (!is_strided_contiguous(src)) {
+    src = src.clone();
+    TORCH_INTERNAL_ASSERT(is_strided_contiguous(src));
+  }
+  Tensor dst = dst_;
+  bool needs_copy = false;
+  // If src and dst_ strides do not match, it means that
+  // either dst_ is not representable as 1d view or its stride order is different
+  // in that case create an empty storage like src, copy it to device and then do
+  // reshaping on the device
+  if (src.strides() != dst_.strides()) {
+    needs_copy = true;
+    dst = at::empty_like(src, at::device(at::kMPS));
+  }
+  copy_to_mps_stride_contig(dst, src, non_blocking && !needs_copy);
+  return needs_copy? dst_.copy_(dst) : dst_;
 }
 
 void copy_blit_mps(void* dst, const void* src, size_t size) {
@@ -235,17 +274,29 @@ void copy_blit_mps(void* dst, const void* src, size_t size) {
   } else {
     src = src_;
   }
+  id<MTLBuffer> destBuffer = getMTLBufferStorage(dst_);
+  id<MTLBuffer> sourceBuffer = getMTLBufferStorage(src);
+
   // Scatter to `dst` if the memory is not contiguous
   // If the memory is not contiguous, it means that the tensor has strides and we would not be
   // able to do the copy using a single blit
   if (!dst_.is_contiguous()) {
-    return scatterViewTensor(src, dst_);
+    Tensor tmp;
+    if (src.dtype() != dst_.dtype()) {
+      id<MTLBuffer> tmpBuffer = sourceBuffer;
+      if (src.element_size() < dst_.element_size()) {
+        tmp = at::native::empty_mps(dst_.sizes(), dst_.scalar_type(), c10::nullopt, kMPS);
+        tmpBuffer = getMTLBufferStorage(tmp);
+      }
+
+      copy_cast_mps(dst_, src, tmpBuffer, sourceBuffer);
+    }
+
+    return scatterViewTensor((src.dtype() != dst_.dtype() && tmp.has_storage()) ? tmp : src, dst_);
   }
   src._set_conj(src_.is_conj());
   src._set_neg(src_.is_neg());
 
-  id<MTLBuffer> destBuffer = getMTLBufferStorage(dst_);
-  id<MTLBuffer> sourceBuffer = getMTLBufferStorage(src);
   const size_t src_size = src.nbytes();
   if (src.dtype() == dst_.dtype()) {
     MPSStream* stream = getCurrentMPSStream();
diff --git a/aten/src/ATen/native/mps/operations/Distributions.mm b/aten/src/ATen/native/mps/operations/Distributions.mm
index 999b1cc79d5b..99d01c6825b3 100644
--- a/aten/src/ATen/native/mps/operations/Distributions.mm
+++ b/aten/src/ATen/native/mps/operations/Distributions.mm
@@ -3,429 +3,302 @@
 #include <ATen/native/Distributions.h>
 #include <ATen/native/DistributionTemplates.h>
 #include <ATen/native/mps/OperationUtils.h>
+#include <ATen/core/PhiloxRNGEngine.h>
 
 namespace at {
 namespace native {
+namespace mps {
 
-Tensor& uniform_mps_(Tensor& input, double from, double to, c10::optional<Generator> gen_)
+struct RandomCachedGraph : public MPSCachedGraph
 {
-  using namespace mps;
-
-  if (input.numel() == 0) {
-    return input;
+  RandomCachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {
+    // initialize Philox state values (only required once when graph is created)
+    const auto seed = c10::detail::getNonDeterministicRandom();
+    const auto subsequence = c10::detail::getNonDeterministicRandom();
+    philoxState = at::Philox4_32(seed, subsequence);
+    // the two last state values are the Philox keys which are initialized once only
+    stateValues[5] = static_cast<uint32_t>(seed);
+    stateValues[6] = static_cast<uint32_t>(seed >> 32);
+  }
+  // Only relevant for multinomial
+  MPSGraphTensor *probTensor = nil;
+  MPSGraphTensor *resultTensor = nil;
+  MPSGraphTensor *stateTensor = nil;
+  // used for Normal distributions only
+  MPSGraphTensor *meanTensor = nil, *stdTensor = nil;
+  // we initialize and keep the philox's state in the graph. This would
+  // guarantee producing new random values each time the same graph is reused.
+  at::Philox4_32 philoxState;
+  std::array<uint32_t, 7> stateValues = {1};
+
+  void updatePhiloxCounters() {
+    // calling philoxState() would call operator() of philox_engine class to
+    // get each of the four newly generated counter values (see PhiloxRNGEngine.h).
+    for (int i = 1; i <= 4; i++)
+      stateValues[i] = philoxState();
+  }
+};
+
+typedef MPSGraphTensor* (^RandomOpBlock)(RandomCachedGraph*, MPSGraphTensor*);
+#define RandomOpFn(graph, randomTensor) MPSGraphTensor* (mps::RandomCachedGraph* graph, MPSGraphTensor* randomTensor)
+
+// for Uniform distributions with scalar from (val1) and to (val2) intervals
+// for Normal distributions with scalar mean (val1) and std (val2) values
+template<typename scalar_t>
+Tensor& random_mps_impl(Tensor& self, scalar_t val1, scalar_t val2,
+                        const c10::optional<Tensor>& mean_opt,
+                        const c10::optional<Tensor>& std_opt,
+                        MPSGraphRandomDistribution distribution,
+                        std::string op_name, RandomOpBlock randomBlock)
+{
+  if (self.numel() == 0) {
+    return self;
   }
-  double delta = (to - from);
-  AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "check_uniform_bounds", [&] {
-    const auto min = static_cast<double>(std::numeric_limits<scalar_t>::lowest());
-    const auto max = static_cast<double>(std::numeric_limits<scalar_t>::max());
-    TORCH_CHECK(from <= to, "uniform_ expects to return a [from, to) range, but found from=", from, " > to=", to);
-    TORCH_CHECK((to - from) <= std::numeric_limits<scalar_t>::max(),
-          "uniform_ expects to-from <= std::numeric_limits<", toString(input.scalar_type()),
-          ">::max(), but found to=", to, " and from=", from,
-          " which result in to-from to exceed the limit");
-    from = std::min(std::max(from, min), max);
-    to = std::max(std::min(to, max), min);
-  });
-
-  struct CachedGraph : public MPSCachedGraph
-  {
-    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
-    MPSGraphTensor *outputTensor_ = nil;
-  };
-
   MPSGraphCache* cache_ = MPSGraphCache::getInstance();
-
   MPSStream* stream = getCurrentMPSStream();
-  uint64_t seed_ = c10::detail::getNonDeterministicRandom(true);
 
   @autoreleasepool {
-    MPSShape* input_shape = getMPSShape(input);
-    string key = "uniform_mps_" + getTensorsStringKey(input) + ":" + to_string(from) + ":" + to_string(to) + ":" + to_string(seed_);
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
-
-    if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+    string key = op_name + getTensorsStringKey({self}) + ":" + to_string(val1) + ":" + to_string(val2);
+    auto cachedGraph = cache_->LookUpAs<RandomCachedGraph>(key);
 
-        CachedGraph *newCachedGraph = nil;
+    if (!cachedGraph) {
+      cachedGraph = cache_->CreateCachedGraphAs<RandomCachedGraph>(key, ^ MPSCachedGraph * () {
+        RandomCachedGraph *newCachedGraph = nil;
 
         @autoreleasepool {
           MPSGraph* mpsGraph = make_mps_graph();
-          newCachedGraph = new CachedGraph(mpsGraph);
-
-          // TODO: right now taking the default seed. Extend it to be extracted from the
-          // MPSGenerator
-          MPSGraphTensor* randomTensor = [mpsGraph randomUniformTensorWithShape:input_shape
-                                                                           seed:seed_
-                                                                           name:nil];
-          MPSGraphTensor* deltaTensor = [mpsGraph constantWithScalar:delta
-                                                               shape:input_shape
-                                                            dataType:MPSDataTypeFloat32];
-          MPSGraphTensor* fromTensor = [mpsGraph constantWithScalar:from
-                                                              shape:input_shape
-                                                           dataType:MPSDataTypeFloat32];
-          MPSGraphTensor* mulTensor = [mpsGraph multiplicationWithPrimaryTensor:randomTensor
-                                                                secondaryTensor:deltaTensor
-                                                                           name:nil];
-          MPSGraphTensor* outputTensor = [mpsGraph additionWithPrimaryTensor:mulTensor
-                                                             secondaryTensor:fromTensor
-                                                                        name:nil];
-          newCachedGraph->outputTensor_ = outputTensor;
-
+          newCachedGraph = new RandomCachedGraph(mpsGraph);
+          newCachedGraph->stateTensor = mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[@7]);
+
+          // FP16, FP32 and Int32 are the only data types supported for distributions on MPS backend.
+          const MPSDataType inputDataType = [&] {
+            // only for random_mps, we pass interval range of type int64_t
+            if (std::is_same<scalar_t, int64_t>::value)
+              return MPSDataTypeInt32;
+            else
+              return (self.scalar_type() == ScalarType::Half) ? MPSDataTypeFloat16 : MPSDataTypeFloat32;
+          }();
+          const MPSDataType outputDataType = (std::is_same<scalar_t, bool>::value) ? MPSDataTypeBool : inputDataType;
+
+          MPSGraphRandomOpDescriptor *desc = [MPSGraphRandomOpDescriptor descriptorWithDistribution: distribution
+                                                                                           dataType: inputDataType];
+          if (distribution == MPSGraphRandomDistributionUniform) {
+            if (inputDataType == MPSDataTypeInt32) {
+              desc.minInteger = static_cast<NSInteger>(val1);
+              desc.maxInteger = static_cast<NSInteger>(val2);
+            } else {
+              desc.min = static_cast<float>(val1);
+              desc.max = static_cast<float>(val2);
+            }
+          } else if (distribution == MPSGraphRandomDistributionNormal) {
+            desc.mean = static_cast<float>(val1);
+            desc.standardDeviation = static_cast<float>(val2);
+          }
+          // we don't use the output state tensor from the MPSGraph API as it requires reading back from GPU to CPU.
+          // Instead, we keep the Philox state in the cached graph and use the PyTorch's philox_engine to maintain
+          // the counters, and feed them to the graph manually
+          NSArray<MPSGraphTensor*> *resultTensors = [mpsGraph randomTensorWithShape: getMPSShape(self)
+                                                                         descriptor: desc
+                                                                        stateTensor: newCachedGraph->stateTensor
+                                                                               name: nil];
+          newCachedGraph->resultTensor = randomBlock ? randomBlock(newCachedGraph, resultTensors[0]) : resultTensors[0];
+          // results will be cast if self's scalar type isn't directly supported by MPS backend.
+          if (getMPSDataType(self.scalar_type()) != outputDataType)
+            newCachedGraph->resultTensor = castMPSTensor(mpsGraph, newCachedGraph->resultTensor, self.scalar_type());
         }
         return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
+    }
+    // update the Philox state values on each run of the same graph
+    cachedGraph->updatePhiloxCounters();
+    // feed the updated state values to the graph
+    MPSNDArrayDescriptor *stateDesc = [MPSNDArrayDescriptor descriptorWithDataType: MPSDataTypeInt32 shape: @[@7]];
+    MPSNDArray *stateNDArray = [[[MPSNDArray alloc] initWithDevice: stream->device() descriptor: stateDesc] autorelease];
+    [stateNDArray writeBytes: &cachedGraph->stateValues[0] strideBytes: nil];
+    MPSGraphTensorData* stateTensorData = [[[MPSGraphTensorData alloc] initWithMPSNDArray: stateNDArray] autorelease];
+
+    Placeholder meanPlaceholder, stdPlaceholder;
+    NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease];
+    feeds[cachedGraph->stateTensor] = stateTensorData;
+
+    if (cachedGraph->stdTensor) {
+      const Tensor& stdTensor = *(at::borrow_from_optional_tensor(std_opt));
+      stdPlaceholder = Placeholder(cachedGraph->stdTensor, stdTensor);
+      feeds[stdPlaceholder.getMPSGraphTensor()] = stdPlaceholder.getMPSGraphTensorData();
+    }
+    if (cachedGraph->meanTensor) {
+      const Tensor& meanTensor = *(at::borrow_from_optional_tensor(mean_opt));
+      meanPlaceholder = Placeholder(cachedGraph->meanTensor, meanTensor);
+      feeds[meanPlaceholder.getMPSGraphTensor()] = meanPlaceholder.getMPSGraphTensorData();
     }
 
-    auto outputPlaceholder = Placeholder(cachedGraph->outputTensor_, input);
-    NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = nil;
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
-      outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
+    Placeholder outputPlaceholder = Placeholder(cachedGraph->resultTensor, self);
+    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*> *results = @{
+      outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData(),
     };
 
     runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+  }
+
+  return self;
+}
 
+Tensor& normal_mps_impl(Tensor& self, double mean_s, double std_s,
+                        const c10::optional<Tensor>& mean_opt,
+                        const c10::optional<Tensor>& std_opt,
+                        std::string op_name)
+{
+  const Tensor& std_t  = *(at::borrow_from_optional_tensor(std_opt));
+  const Tensor& mean_t = *(at::borrow_from_optional_tensor(mean_opt));
+
+  TORCH_CHECK(std_s >= 0.0, op_name, " expects std >= 0.0, but found std=", std_s);
+  if (std_t.defined()) {
+    TORCH_CHECK(!std_t.is_complex(), op_name, " expects standard deviation to be non-complex");
+    if (mean_t.defined())
+      TORCH_CHECK(mean_t.numel() == std_t.numel(), op_name, ": mean and std must have same number of elements")
   }
 
-  return input;
+  RandomOpBlock random_op_block = ^RandomOpFn(cachedGraph, randomTensor) {
+    MPSGraph* mpsGraph = cachedGraph->graph();
+    MPSGraphTensor* resultTensor = randomTensor;
+
+    if (std_t.defined()) {
+      cachedGraph->stdTensor = mpsGraphRankedPlaceHolder(mpsGraph, std_t);
+      resultTensor = [mpsGraph multiplicationWithPrimaryTensor: randomTensor
+                                               secondaryTensor: cachedGraph->stdTensor
+                                                          name: nil];
+    }
+    if (mean_t.defined()) {
+      cachedGraph->meanTensor = mpsGraphRankedPlaceHolder(mpsGraph, mean_t);
+      return [mpsGraph additionWithPrimaryTensor: resultTensor
+                                 secondaryTensor: cachedGraph->meanTensor
+                                            name: nil];
+    }
+    return resultTensor;
+  };
+  return random_mps_impl<double>(self, mean_s, std_s, mean_opt, std_opt,
+                                 MPSGraphRandomDistributionNormal,
+                                 op_name + getTensorsStringKey({mean_t, std_t}), random_op_block);
+
+}
+
+Tensor& bernoulli_mps_impl(Tensor& self, const Tensor& prob_t, std::string op_name)
+{
+  TORCH_CHECK(prob_t.is_same_size(self), op_name, ": probability and self tensor should be of the same shape")
+
+  RandomOpBlock random_op_block = ^RandomOpFn(cachedGraph, randomTensor) {
+    MPSGraph* mpsGraph = cachedGraph->graph();
+    cachedGraph->stdTensor = mpsGraphRankedPlaceHolder(mpsGraph, prob_t);
+    return [mpsGraph lessThanWithPrimaryTensor: randomTensor
+                               secondaryTensor: cachedGraph->stdTensor
+                                          name: nil];
+  };
+  // Bernoulli generates binary output so we use bool type
+  return mps::random_mps_impl<bool>(self, 0.0, 1.0, c10::nullopt, prob_t,
+                                    MPSGraphRandomDistributionUniform,
+                                    op_name + getTensorsStringKey({prob_t}), random_op_block);
+}
+
+} // namespace mps
+
+Tensor& uniform_mps_(Tensor& self, double from, double to, c10::optional<Generator> gen) {
+  AT_DISPATCH_FLOATING_TYPES_AND_HALF(self.scalar_type(), "check_uniform_bounds", [&] {
+    const auto min = static_cast<double>(std::numeric_limits<scalar_t>::lowest());
+    const auto max = static_cast<double>(std::numeric_limits<scalar_t>::max());
+    TORCH_CHECK(from <= to, "uniform_ expects to return a [from, to) range, but found from=", from, " > to=", to);
+    TORCH_CHECK((to - from) <= std::numeric_limits<scalar_t>::max(),
+          "uniform_ expects to-from <= std::numeric_limits<", toString(self.scalar_type()),
+          ">::max(), but found to=", to, " and from=", from,
+          " which result in to-from to exceed the limit");
+    from = std::min(std::max(from, min), max);
+    to = std::max(std::min(to, max), min);
+  });
+
+  return mps::random_mps_impl<double>(self, from, to, c10::nullopt, c10::nullopt,
+                                      MPSGraphRandomDistributionUniform, __func__, nullptr);
 }
 
 Tensor& normal_mps_(Tensor& self, double mean, double std, c10::optional<Generator> gen) {
-  if (self.numel() == 0)
-    return self;
-  TORCH_CHECK(std >= 0.0, "normal_mps_ expects std >= 0.0, but found std=", std);
-
-  Tensor mean_t = empty_mps(
-                      self.sizes(),
-                      self.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-  mean_t.fill_(mean);
-
-  Tensor std_t = empty_mps(
-                      self.sizes(),
-                      self.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-  std_t.fill_(std);
-
-  return normal_mps_out(mean_t, std_t, gen, self);
+  return mps::normal_mps_impl(self, mean, std, c10::nullopt, c10::nullopt, __func__);
 }
 
 Tensor normal_mps(const Tensor& mean, double std, c10::optional<Generator> gen) {
-  Tensor output = empty_mps(
-                      mean.sizes(),
-                      mean.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-
-  Tensor std_t = empty_mps(
-                      output.sizes(),
-                      output.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-  std_t.fill_(std);
-
-  return normal_mps_out(mean, std_t, gen, output);
+  Tensor self = empty_mps(mean.sizes(), mean.scalar_type(), c10::nullopt, kMPS);
+  return mps::normal_mps_impl(self, 0.0, std, mean, c10::nullopt, __func__);
 }
 
 Tensor normal_mps(double mean, const Tensor& std, c10::optional<Generator> gen) {
-  Tensor output = empty_mps(
-                      std.sizes(),
-                      std.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-
-  Tensor mean_t = empty_mps(
-                      output.sizes(),
-                      output.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-  mean_t.fill_(mean);
-
-  return normal_mps_out(mean_t, std, gen, output);
+  Tensor self = empty_mps(std.sizes(), std.scalar_type(), c10::nullopt, kMPS);
+  // when there's no tensor-type mean, we cannot pass scalar mean value due to the order of
+  // multiply/add ops in random computation. So we create a mean tensor instead.
+  Tensor mean_t = at::full_like(self, Scalar(mean));
+  return mps::normal_mps_impl(self, 0.0, 1.0, mean_t, std, __func__);
 }
 
 Tensor normal_mps(const Tensor& mean, const Tensor& std, c10::optional<Generator> gen) {
   auto shape = at::infer_size(mean.sizes(), std.sizes());
-
-  Tensor output = empty_mps(
-                      shape,
-                      mean.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-
-  return normal_mps_out(mean, std, gen, output);
+  Tensor self = empty_mps(shape, mean.scalar_type(), c10::nullopt, kMPS);
+  return mps::normal_mps_impl(self, 0.0, 1.0, mean, std, __func__);
 }
 
-Tensor& normal_mps_out(const Tensor& mean, double std, c10::optional<Generator> gen, Tensor& output) {
-  TORCH_CHECK(std >= 0.0, "normal_mps_out expects std >= 0.0, but found std=", std);
-
-  Tensor std_t = empty_mps(
-                      output.sizes(),
-                      output.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-  std_t.fill_(std);
-
-  return normal_mps_out(mean, std_t, gen, output);
-
+Tensor& normal_mps_out(const Tensor& mean, double std, c10::optional<Generator> gen, Tensor& self) {
+  return mps::normal_mps_impl(self, 0.0, std, mean, c10::nullopt, __func__);
 }
 
-Tensor& normal_mps_out(double mean, const Tensor& std, c10::optional<Generator> gen, Tensor& output) {
-  Tensor mean_t = empty_mps(
-                      output.sizes(),
-                      output.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-  mean_t.fill_(mean);
-
-  return normal_mps_out(mean_t, std, gen, output);
-
+Tensor& normal_mps_out(double mean, const Tensor& std, c10::optional<Generator> gen, Tensor& self) {
+  // when there's no tensor-type mean, we cannot pass scalar mean value due to the order of
+  // multiply/add ops in random computation. So we create a mean tensor instead.
+  Tensor mean_t = at::full_like(self, Scalar(mean));
+  return mps::normal_mps_impl(self, 0.0, 1.0, mean_t, std, __func__);
 }
 
-Tensor& normal_mps_out(const Tensor& mean, const Tensor& std, c10::optional<Generator> gen, Tensor& output) {
-  TORCH_CHECK(!std.is_complex(), "normal expects standard deviation to be non-complex");
-  // Check that mean and std have same number of elements
+Tensor& normal_mps_out(const Tensor& mean, const Tensor& std, c10::optional<Generator> gen, Tensor& self) {
   TORCH_CHECK(mean.numel() == std.numel(), "normal_mps_out: mean and std must have same number of elements")
-
-  using namespace mps;
-
-  struct CachedGraph : public MPSCachedGraph
-  {
-    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
-    MPSGraphTensor* outputTensor_ = nil;
-    MPSGraphTensor* meanTensor_ = nil;
-    MPSGraphTensor* stdTensor_ = nil;
-  };
-
-  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
-
-  MPSStream* stream = getCurrentMPSStream();
-  uint64_t seed_ = c10::detail::getNonDeterministicRandom(true);
-
-  @autoreleasepool {
-    MPSShape* input_shape = getMPSShape(output);
-    string key = "normal_mps_out:" + getMPSShapeString(input_shape) + ":" + getMPSTypeString(output.scalar_type()) + ":" + to_string(seed_);
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
-
-    if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
-
-        CachedGraph *newCachedGraph = nil;
-
-        @autoreleasepool {
-          MPSGraph* mpsGraph = make_mps_graph();
-          newCachedGraph = new CachedGraph(mpsGraph);
-
-          MPSGraphRandomOpDescriptor* desc = [[MPSGraphRandomOpDescriptor new] autorelease];
-          desc.distribution = MPSGraphRandomDistributionNormal;
-          desc.dataType = getMPSDataType(output.scalar_type());
-          desc.mean = 0.0;
-          desc.standardDeviation = 1.0;
-
-          MPSGraphTensor* meanTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(output.scalar_type()), input_shape);
-          MPSGraphTensor* stdTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(output.scalar_type()), input_shape);
-
-          // TODO: right now taking the default seed. Extend it to be extracted from the
-          // MPSGenerator
-          MPSGraphTensor* randomTensor = [mpsGraph randomTensorWithShape:input_shape
-                                                              descriptor:desc
-                                                                    seed:seed_
-                                                                    name:nil];
-          MPSGraphTensor* scaleTensor = [mpsGraph multiplicationWithPrimaryTensor:randomTensor
-                                                                  secondaryTensor:stdTensor
-                                                                             name:nil];
-          MPSGraphTensor* outputTensor = [mpsGraph additionWithPrimaryTensor:scaleTensor
-                                                            secondaryTensor:meanTensor
-                                                                        name:nil];
-          newCachedGraph->meanTensor_ = meanTensor;
-          newCachedGraph->stdTensor_ = stdTensor;
-          newCachedGraph->outputTensor_ = outputTensor;
-
-        }
-        return newCachedGraph;
-      });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
-    }
-
-    auto meanPlaceholder = Placeholder(cachedGraph->meanTensor_, mean);
-    auto stdPlaceholder = Placeholder(cachedGraph->stdTensor_, std);
-    auto outputPlaceholder = Placeholder(cachedGraph->outputTensor_, output);
-    NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = @{
-      meanPlaceholder.getMPSGraphTensor() : meanPlaceholder.getMPSGraphTensorData(),
-      stdPlaceholder.getMPSGraphTensor() : stdPlaceholder.getMPSGraphTensorData()
-    };
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
-      outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
-    };
-
-    runMPSGraph(stream, cachedGraph->graph(), feeds, results);
-
-  }
-
-  return output;
+  return mps::normal_mps_impl(self, 0.0, 1.0, mean, std, __func__);
 }
 
 Tensor& bernoulli_out_mps(const Tensor& p_, c10::optional<Generator> gen, Tensor& result) {
   result.resize_(p_.sizes());
-  return  bernoulli_mps_(result, p_, gen);
+  return  mps::bernoulli_mps_impl(result, p_, __func__);
 }
 
 Tensor& bernoulli_mps_(Tensor& self, double p, c10::optional<Generator> gen) {
-  TORCH_CHECK(0 <= p && p <= 1, "bernoulli_mps_ expects p to be in [0, 1], but got p=", p);
-  Tensor p_t = empty_mps(
-                      self.sizes(),
-                      self.scalar_type(),
-                      c10::nullopt,
-                      kMPS,
-                      c10::nullopt,
-                      c10::nullopt);
-  p_t.fill_(p);
-
-  return bernoulli_mps_(self, p_t, gen);
+  TORCH_CHECK(0.0 <= p && p <= 1.0, "bernoulli_mps_ expects p to be in [0, 1], but got p=", p);
+  Tensor prob_t = at::full_like(self, Scalar(p));
+  return mps::bernoulli_mps_impl(self, prob_t, __func__);
 }
 
 Tensor& bernoulli_mps_(Tensor& self, const Tensor& p_, c10::optional<Generator> gen) {
-  TORCH_CHECK(self.is_same_size(p_), "bernoulli_mps_: probability and self tensor should be of the same shape")
-
-  using namespace mps;
-
-  MPSStream* stream = getCurrentMPSStream();
-  uint64_t seed_ = c10::detail::getNonDeterministicRandom(true);
-
-  @autoreleasepool {
-    MPSShape* input_shape = getMPSShape(self);
-
-    auto mps_dtype = getMPSDataType(p_.scalar_type());
-
-    MPSGraph* mpsGraph = make_mps_graph();
-
-    MPSGraphTensor* probTensor = mpsGraphRankedPlaceHolder(mpsGraph, mps_dtype, input_shape);
-
-    // TODO: right now taking the default seed. Extend it to be extracted from the
-    // MPSGenerator
-    MPSGraphTensor* randomTensor = [mpsGraph randomUniformTensorWithShape:input_shape
-                                                                     seed:seed_
-                                                                     name:nil];
-    MPSGraphTensor* outputTensor = [mpsGraph lessThanWithPrimaryTensor:randomTensor
-                                                       secondaryTensor:probTensor
-                                                                  name:nil];
-
-    auto probPlaceholder = Placeholder(probTensor, p_);
-    auto outputPlaceholder = Placeholder(outputTensor, self);
-    NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = @{
-      probPlaceholder.getMPSGraphTensor() : probPlaceholder.getMPSGraphTensorData(),
-    };
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
-      outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
-    };
-
-    runMPSGraph(stream, mpsGraph, feeds, results);
-  }
-
-  return self;
-
+  return mps::bernoulli_mps_impl(self, p_, __func__);
 }
 
-// Taken from ATen/native/DistributionTemplates.h
-#define CHECK_OUT_OF_BOUNDS(var, name, min, max, dtype) \
-  TORCH_CHECK(var >= min && var <= max, name , " is out of bounds for ", dtype); \
-
-#define WARN_OUT_OF_BOUNDS(var, name, digits, dtype) \
-  if (var < -(1LL << digits) || var > (1LL << digits)) { \
-    TORCH_WARN(name , " is out of bounds [-(2^", digits, "), 2^", digits, "]. ", \
-      "Due to precision limitations ", dtype, " can support discrete uniform distribution only within this range. ", \
-      "This warning will become an error in version 1.7 release, please fix the code in advance"); \
-  }
-
-// Modified from ATen/native/DistributionTemplates.h
-static void check_from_to_in_range(int64_t from, int64_t to_inc, ScalarType scalar_type) {
-  const auto dtype = scalarTypeToTypeMeta(scalar_type);
-  if (isFloatingType(scalar_type)) {
-    AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, scalar_type, "check_random_fp_bounds", [&] {
-      const auto min = static_cast<double>(std::numeric_limits<scalar_t>::lowest());
-      const auto max = static_cast<double>(std::numeric_limits<scalar_t>::max());
-      CHECK_OUT_OF_BOUNDS(from, "from", min, max, dtype);
-      CHECK_OUT_OF_BOUNDS(to_inc, "to - 1", min, max, dtype);
-
-      constexpr auto digits = std::numeric_limits<scalar_t>::digits;
-      WARN_OUT_OF_BOUNDS(from, "from", digits, dtype);
-      WARN_OUT_OF_BOUNDS(to_inc, "to - 1", digits, dtype);
-    });
-  } else if (isIntegralType(scalar_type, /*includeBool=*/true)) {
-    AT_DISPATCH_INTEGRAL_TYPES_AND(at::ScalarType::Bool, scalar_type, "check_random_integral_bounds", [&]() {
-      const auto min = static_cast<int64_t>(std::numeric_limits<scalar_t>::lowest());
-      const auto max = static_cast<int64_t>(std::numeric_limits<scalar_t>::max());
-      CHECK_OUT_OF_BOUNDS(from, "from", min, max, dtype);
-      CHECK_OUT_OF_BOUNDS(to_inc, "to - 1", min, max, dtype);
-    });
-  } else {
-    TORCH_CHECK(false, "check_random_bounds handles only integral, floating-point and boolean types");
-  }
-}
-
-
 // random_.from
-Tensor& random_mps_
-  (Tensor& self,
-   int64_t from,
-   optional<int64_t> to_opt,
-   c10::optional<Generator> gen) {
-
-  using namespace mps;
-
-  MPSStream* stream = getCurrentMPSStream();
-  uint64_t seed_ = c10::detail::getNonDeterministicRandom(true);
-
+Tensor& random_mps_(Tensor& self, int64_t from, c10::optional<int64_t> to_opt, c10::optional<Generator> gen) {
   auto input_dtype = self.scalar_type();
+  int64_t to = 0;
 
-  int64_t to;
-
-  if(to_opt.has_value()) {
+  if (to_opt.has_value()) {
     // [from, to)
     to = *to_opt;
     TORCH_CHECK(from < to, "random_mps_ expects 'from' to be less than 'to', but got from=", from, " >= to=", to);
     if (isFloatingType(input_dtype)) {
-      // TODO: what is "random_update_from_to"?
       AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, input_dtype, "random_update_from_to", [&] {
         from = templates::update_from<scalar_t>(from);
         to = templates::update_to<scalar_t>(to);
         TORCH_CHECK(from < to, "random_mps_ expects 'from' casted to dtype to be less than 'to' casted to dtype, but got from=", from, " >= to=", to);
       });
-      check_from_to_in_range(from, to - 1, input_dtype);
+      templates::check_from_to_in_range(from, to - 1, self.dtype());
     }
-  }
-  else if (from != std::numeric_limits<int64_t>::lowest()) {
+  } else if (from != std::numeric_limits<int64_t>::lowest()) {
     // [from, std::numeric_limits<int64_t>::max()]
-    to = 0;
-    if(isFloatingType(input_dtype)) {
+    if (isFloatingType(input_dtype)) {
       AT_DISPATCH_FLOATING_TYPES_AND2(at::ScalarType::Half, at::ScalarType::BFloat16, input_dtype, "random_from_to_range_calc", [&] {
         constexpr int64_t scalar_t_max = static_cast<int64_t>(1) << std::numeric_limits<scalar_t>::digits;
         to = scalar_t_max > std::numeric_limits<int64_t>::max() ? std::numeric_limits<int64_t>::max() : static_cast<int64_t>(scalar_t_max);
         from = templates::update_from<scalar_t>(from);
         TORCH_CHECK(from < to, "random_mps_ expects 'from' casted to dtype to be less than or equal to 'to' casted to dtype, but got from=", from, " > to=", to);
       });
-    }
-    else if(isIntegralType(input_dtype, /*includeBool=*/true)) {
+    } else if (isIntegralType(input_dtype, /*includeBool=*/true)) {
       AT_DISPATCH_INTEGRAL_TYPES_AND(at::ScalarType::Bool, input_dtype, "random_from_to_range_calc", [&] {
         if (std::is_same<scalar_t, bool>::value) {
           to = static_cast<int64_t>(true);
@@ -437,124 +310,294 @@ static void check_from_to_in_range(int64_t from, int64_t to_inc, ScalarType scal
     else {
       TORCH_CHECK(false, "random_mps_ handles only integral, floating-point and boolean types");
     }
-    check_from_to_in_range(from, to, input_dtype);
+    templates::check_from_to_in_range(from, to, self.dtype());
   }
   else {
     // [std::numeric_limits<int64_t>::lowest(), std::numeric_limits<int64_t>::max()]
     // range = 2^64
 
-    // TODO - how to implement this?
+    // TODO - should we error out in case max is beyond MPS limit (INT32_MAX)?
     TORCH_CHECK(false, "random_mps_ currently does not handle the lowest() -> max() range");
-
-  }
-
-  @autoreleasepool {
-    MPSShape* input_shape = getMPSShape(self);
-
-    MPSGraph* mpsGraph = make_mps_graph();
-
-    MPSGraphRandomOpDescriptor* descriptor = [MPSGraphRandomOpDescriptor descriptorWithDistribution:MPSGraphRandomDistributionUniform
-                                                                                           dataType:MPSDataTypeInt32];
-    descriptor.minInteger = from;
-    descriptor.maxInteger = to - 1;
-
-    // TODO: right now taking the default seed. Extend it to be extracted from the
-    // MPSGenerator
-    MPSGraphTensor* randomTensor = [mpsGraph randomTensorWithShape:input_shape
-                                                        descriptor:descriptor
-                                                              seed:seed_
-                                                              name:nil];
-
-    MPSGraphTensor* outputTensor = nil;
-
-    if(input_dtype != ScalarType::Int)
-      outputTensor = [mpsGraph castTensor:randomTensor
-                                   toType:getMPSDataType(input_dtype)
-                                     name:@"outputTensor"];
-    else
-      outputTensor = randomTensor;
-
-    auto outputPlaceholder = Placeholder(outputTensor, self);
-    NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = nil;
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
-      outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
-    };
-
-    runMPSGraph(stream, mpsGraph, feeds, results);
   }
 
-  return self;
-
+  return mps::random_mps_impl<int64_t>(self, from, to - 1, c10::nullopt, c10::nullopt,
+                                       MPSGraphRandomDistributionUniform, __func__, nullptr);
 }
 
-Tensor& random_mps_
-  (Tensor& self,
-   int64_t to,
-   c10::optional<Generator> gen) {
-
+Tensor& random_mps_(Tensor& self, int64_t to, c10::optional<Generator> gen) {
   return random_mps_(self, 0, to, gen);
 }
 
 // Exponential distribution
-
 Tensor& exponential_mps_(Tensor& self, double lambda, c10::optional<Generator> gen) {
+  TORCH_CHECK(lambda > 0, "exponential_mps_: lambda must be greater than zero")
 
-  using namespace mps;
+  mps::RandomOpBlock random_op_block = ^RandomOpFn(cachedGraph, randomTensor) {
+    MPSGraph* mpsGraph = cachedGraph->graph();
+    MPSGraphTensor* unitTensor = [mpsGraph constantWithScalar: 1.0f
+                                                     dataType: randomTensor.dataType];
+    MPSGraphTensor* minusLambdaTensor = [mpsGraph constantWithScalar: -lambda
+                                                            dataType: randomTensor.dataType];
+    MPSGraphTensor* subtractTensor = [mpsGraph subtractionWithPrimaryTensor: unitTensor
+                                                            secondaryTensor: randomTensor
+                                                                       name: nil];
+    MPSGraphTensor* logTensor = [mpsGraph logarithmWithTensor: subtractTensor
+                                                         name: nil];
+    return [mpsGraph divisionWithPrimaryTensor: logTensor
+                               secondaryTensor: minusLambdaTensor
+                                          name: nil];
+  };
+  return mps::random_mps_impl<double>(self, 0.0, 1.0, c10::nullopt, c10::nullopt,
+                                      MPSGraphRandomDistributionUniform,
+                                      "exponential_mps_:" + std::to_string(lambda), random_op_block);
+}
 
-  if (self.numel() == 0) {
-    return self;
-  }
+Tensor& multinomial_with_replacement_mps_kernel(
+    const Tensor& self,
+    const int64_t n_sample,
+    c10::optional<Generator> generator,
+    Tensor& result) {
 
-  TORCH_CHECK(lambda > 0, "exponential_mps_: lambda must be greater than zero")
+  using namespace mps;
 
-  struct CachedGraph : public MPSCachedGraph
-  {
-    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
-    MPSGraphTensor *outputTensor_ = nil;
-  };
+  int inputSize = self.dim();
+  int numDist =
+      inputSize == 1 ? 1 : self.size(0);
+  int numCategories =
+      inputSize == 1 ? self.size(0) : self.size(1);
+
+  // Restructure data for 2d
+  auto self_v = inputSize == 1 ? self.view({numDist, numCategories}) : self;
+  auto result_v = inputSize == 1 ? result.view({numDist, n_sample}) : result;
 
   MPSStream* stream = getCurrentMPSStream();
-  uint64_t seed_ = c10::detail::getNonDeterministicRandom(true);
+  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
 
   @autoreleasepool {
-    MPSShape* self_shape = getMPSShape(self);
-
-    MPSGraph* mpsGraph = make_mps_graph();
-    // TODO: right now taking the default seed. Extend it to be extracted from the
-    // MPSGenerator
-    MPSGraphTensor* randomTensor = [mpsGraph randomUniformTensorWithShape:self_shape
-                                                                     seed:seed_
-                                                                     name:nil];
-    MPSGraphTensor* unitTensor = [mpsGraph constantWithScalar:1.0f
-                                                     dataType:MPSDataTypeFloat32];
-    MPSGraphTensor* minusLambdaTensor = [mpsGraph constantWithScalar:-lambda
-                                                       dataType:MPSDataTypeFloat32];
-    MPSGraphTensor* subtractTensor = [mpsGraph subtractionWithPrimaryTensor:unitTensor
-                                                            secondaryTensor:randomTensor
-                                                                       name:nil];
-    MPSGraphTensor* logTensor = [mpsGraph logarithmWithTensor:subtractTensor
-                                                         name:nil];
-    MPSGraphTensor* outputTensor = [mpsGraph divisionWithPrimaryTensor:logTensor
-                                                       secondaryTensor:minusLambdaTensor
+    string key = "multinomial_with_replacement:" + getTensorsStringKey({self}) + ":" + to_string(n_sample);
+    auto cachedGraph = cache_->LookUpAs<RandomCachedGraph>(key);
+    if (!cachedGraph) {
+      cachedGraph = cache_->CreateCachedGraphAs<RandomCachedGraph>(key, ^ MPSCachedGraph * () {
+        RandomCachedGraph *newCachedGraph = nil;
+        @autoreleasepool {
+          MPSShape* prob_shape = getMPSShape(self_v);
+          MPSGraph* mpsGraph = make_mps_graph();
+          newCachedGraph = new RandomCachedGraph(mpsGraph);
+          newCachedGraph->stateTensor = mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[@7]);
+
+          auto prob_dtype = getMPSDataType(self_v.scalar_type());
+
+          // This is probability weights
+          newCachedGraph->probTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self_v.scalar_type()), prob_shape);
+
+          MPSGraphTensor *sumProbs = [mpsGraph reductionSumWithTensor:newCachedGraph->probTensor
+                                                                 axis:-1
+                                                                 name:nil];
+
+          MPSGraphTensor *normalizedProbs = [mpsGraph divisionWithPrimaryTensor:newCachedGraph->probTensor
+                                                                secondaryTensor:sumProbs
+                                                                           name:nil];
+
+          auto ns_numCategories = [NSNumber numberWithInt:numCategories];
+          auto ns_numDist = [NSNumber numberWithInt:numDist];
+          auto ns_n_sample = [NSNumber numberWithInt:n_sample];
+
+          MPSGraphTensor *ones = [mpsGraph constantWithScalar:1.0f
+                                                        shape:@[ns_numCategories, ns_numCategories]
+                                                     dataType:prob_dtype];
+          auto zeroTensor = [mpsGraph constantWithScalar: 0.0f
+                                                dataType: MPSDataTypeInt32];
+          auto minusOneTensor = [mpsGraph constantWithScalar: -1.0f
+                                                    dataType: MPSDataTypeInt32];
+
+          MPSGraphTensor *upperTriangle = [mpsGraph bandPartWithTensor:ones
+                                                        numLowerTensor:zeroTensor
+                                                        numUpperTensor:minusOneTensor
                                                                   name:nil];
+          MPSGraphTensor *upperProbRange = [mpsGraph matrixMultiplicationWithPrimaryTensor:normalizedProbs
+                                                                           secondaryTensor:upperTriangle
+                                                                                      name:nil];
 
-    if(getMPSDataType(self.scalar_type()) != MPSDataTypeFloat32)
-      outputTensor = [mpsGraph castTensor:outputTensor
-                                   toType:getMPSDataType(self.scalar_type())
-                                     name:@"output"];
+          MPSGraphTensor *lowerProbRange = [mpsGraph subtractionWithPrimaryTensor:upperProbRange
+                                                                  secondaryTensor:normalizedProbs
+                                                                             name:nil];
 
-    auto outputPlaceholder = Placeholder(outputTensor, self);
-    NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = nil;
+          upperProbRange = [mpsGraph reshapeTensor:upperProbRange
+                                         withShape:@[ns_numDist, @1, ns_numCategories]
+                                              name:nil];
+          lowerProbRange = [mpsGraph reshapeTensor:lowerProbRange
+                                         withShape:@[ns_numDist, @1, ns_numCategories]
+                                              name:nil];
+
+          MPSGraphRandomOpDescriptor *descriptor = [MPSGraphRandomOpDescriptor descriptorWithDistribution:MPSGraphRandomDistributionUniform
+                                                                                                 dataType:prob_dtype];
+          NSArray<MPSGraphTensor*> *generatorTensors = [mpsGraph randomTensorWithShape:@[ns_numDist, ns_n_sample, @1]
+                                                                            descriptor:descriptor
+                                                                           stateTensor:newCachedGraph->stateTensor
+                                                                                  name:nil];
+          MPSGraphTensor *randomTensor = generatorTensors[0];
+
+          auto broadcastShape = @[ns_numDist ,ns_n_sample, ns_numCategories];
+          int broadcastShapeVals[3] = {numDist, static_cast<int>(n_sample), numCategories};
+          MPSGraphTensor *broadcastShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:broadcastShapeVals length:sizeof(int) * broadcastShape.count]
+                                                                      shape:@[[NSNumber numberWithUnsignedInteger:broadcastShape.count]]
+                                                                   dataType:MPSDataTypeUInt32];
+
+          MPSGraphTensor *samplesTensor = [mpsGraph broadcastTensor:randomTensor
+                                                            toShape:broadcastShape
+                                                               name:nil];
+          MPSGraphTensor *sampleAbove = [mpsGraph greaterThanWithPrimaryTensor:samplesTensor
+                                                               secondaryTensor:lowerProbRange
+                                                                          name:nil];
+          MPSGraphTensor *sampleBelow = [mpsGraph lessThanWithPrimaryTensor:samplesTensor
+                                                            secondaryTensor:upperProbRange
+                                                                       name:nil];
+          MPSGraphTensor *sampleWithin = [mpsGraph logicalANDWithPrimaryTensor:sampleAbove
+                                                            secondaryTensor:sampleBelow
+                                                                       name:nil];
+          MPSGraphTensor *sampleMask = [mpsGraph castTensor:sampleWithin
+                                                     toType:MPSDataTypeInt32
+                                                       name:@"sampleMask"];
+          MPSGraphTensor *categoriesTensor = [mpsGraph coordinateAlongAxis:-1
+                                                           withShapeTensor:broadcastShapeTensor
+                                                                      name:nil];
+          MPSGraphTensor *binnedSamplesTensor = [mpsGraph multiplicationWithPrimaryTensor:categoriesTensor
+                                                                       secondaryTensor:sampleMask
+                                                                                  name:nil];
+          MPSGraphTensor *reducedTensor = [mpsGraph reductionSumWithTensor:binnedSamplesTensor
+                                                                      axis:-1
+                                                                      name:nil];
+          MPSGraphTensor *reshapeTensor = [mpsGraph reshapeTensor:reducedTensor
+                                                       withShape:@[ns_numDist ,ns_n_sample]
+                                                            name:nil];
+          newCachedGraph->resultTensor = [mpsGraph castTensor:reshapeTensor
+                                                       toType:getMPSDataType(result.scalar_type())
+                                                         name:@"resultTensor"];
+        }
+        return newCachedGraph;
+     });
+    }
+    // update the Philox state values on each run of the same graph
+    cachedGraph->updatePhiloxCounters();
+    // feed the updated state values to the graph
+    MPSNDArrayDescriptor *stateDesc = [MPSNDArrayDescriptor descriptorWithDataType: MPSDataTypeInt32 shape: @[@7]];
+    MPSNDArray *stateNDArray = [[[MPSNDArray alloc] initWithDevice: stream->device() descriptor: stateDesc] autorelease];
+    [stateNDArray writeBytes: &cachedGraph->stateValues[0] strideBytes: nil];
+    MPSGraphTensorData* stateTensorData = [[[MPSGraphTensorData alloc] initWithMPSNDArray: stateNDArray] autorelease];
+
+    auto probPlaceholder = Placeholder(cachedGraph->probTensor, self_v);
+    auto outputPlaceholder = Placeholder(cachedGraph->resultTensor, result_v);
+    NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = @{
+      cachedGraph->stateTensor : stateTensorData,
+      probPlaceholder.getMPSGraphTensor() : probPlaceholder.getMPSGraphTensorData()
+    };
     NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
       outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
     };
 
-    runMPSGraph(stream, mpsGraph, feeds, results);
+    runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+  }
+
+  return result;
+
+}
 
+/* The largest consecutive integer representable in float32 (2^24) */
+constexpr int64_t FLOAT32_MAX_CONSECUTIVE_INT = 1 << (FLT_MANT_DIG);
+
+Tensor& multinomial_out_mps(const Tensor& self,
+    int64_t n_sample,
+    bool with_replacement,
+    c10::optional<Generator> gen,
+    Tensor& result) {
+
+  TORCH_CHECK(
+      result.device() == self.device(),
+      "multinomial arguments must have the same device");
+  TORCH_CHECK(
+      self.dim() > 0 && self.dim() <= 2, "prob_dist must be 1 or 2 dim");
+  TORCH_CHECK(
+      at::isFloatingType(self.scalar_type()),
+      "multinomial only supports floating-point dtypes for input, got: ",
+      self.scalar_type());
+  TORCH_CHECK(result.scalar_type() == ScalarType::Long,
+      "multinomial expects Long tensor out, got: ", result.scalar_type());
+  TORCH_CHECK(n_sample > 0, "cannot sample n_sample <= 0 samples");
+  int64_t n_categories = self.size(-1);
+  TORCH_CHECK(with_replacement || (n_sample <= n_categories),
+      "cannot sample n_sample > prob_dist.size(-1) samples without replacement");
+  // Since the index tensor is float, numCategories cannot exceed max
+  // float integer precision
+  TORCH_CHECK(
+      n_categories <= FLOAT32_MAX_CONSECUTIVE_INT,
+      "number of categories cannot exceed 2^24");
+
+  if (self.dim() == 1) {
+    result.resize_({n_sample});
+  } else {
+    const int64_t n_dist = self.size(0);
+    result.resize_({n_dist, n_sample});
+  }
+  if (result.numel() == 0) {
+    return result;
   }
 
-  return self;
+  // Fast-path for no replacement (or if only one sample draw).
+  // Reference:
+  // https://github.com/pytorch/pytorch/issues/11931#issuecomment-625882503
+  if (!with_replacement || n_sample == 1) {
+    // Sanity checks on `self`.
+    auto is_valid = ((self.max() < INFINITY) & (self.min() >= 0)).item();
+    TORCH_CHECK(
+        is_valid.to<bool>(),
+        "probability tensor contains either `inf`, `nan` or element < 0");
+    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
+    bool zero_prob_condition;
+    if (self.dim() == 1){
+      zero_prob_condition = (self.sum() == 0).item().to<bool>();
+    } else {
+      zero_prob_condition = (self.sum(1) == 0).sum().item().to<bool>();
+    }
+    TORCH_CHECK(
+        !zero_prob_condition,
+        "invalid multinomial distribution (sum of probabilities <= 0)");
+
+    // The algorithm is from gumbel softmax.
+    // s = argmax( logp - log(-log(eps)) ) where eps ~ U(0, 1)
+    // Here we can apply exp to the formula which will not affect result of
+    // argmax or topk. Then we have
+    // s = argmax( p / (-log(eps)) ) where eps ~ U(0, 1).
+    // We can also simplify the formula above by
+    // s = argmax( p / q ) where q ~ Exp(1)
+    Tensor q = at::empty_like(self).exponential_(1, gen);
+    // In theory the probability to generate 0 from exponential distribution is
+    // 0. However, on CUDA side there is a protection to avoid 0s, but on CPU
+    // side, there is a very low probability to generate 0 from
+    // exponential<double>. The probability is about 2^(-DBL_MANT_DIG). We just
+    // ignore it here, but there may be some risk to get invalid output on CPU.
+    at::div_out(q, self, q);
+    if (n_sample == 1) {
+      at::argmax_out(result, q, /*dim=*/-1, /*keepdim=*/true);
+    } else {
+      Tensor vals = at::empty(result.sizes(), self.options());
+      at::topk_out(vals, result, q, n_sample);
+    }
+    return result;
+  }
+
+  result = multinomial_with_replacement_mps_kernel(const_cast<Tensor&>(self), n_sample, gen, result);
+
+  return result;
+}
 
+Tensor multinomial_mps(
+    const Tensor& self,
+    int64_t n_sample,
+    bool with_replacement,
+    c10::optional<Generator> gen) {
+  Tensor result = at::empty({0}, self.options().dtype(kLong));
+  multinomial_out_mps(self, n_sample, with_replacement, gen, result);
+  return result;
 }
 
 } // namespace native
diff --git a/aten/src/ATen/native/mps/operations/Eye.mm b/aten/src/ATen/native/mps/operations/Eye.mm
index 45b3fdf68b07..6b72c0686caa 100644
--- a/aten/src/ATen/native/mps/operations/Eye.mm
+++ b/aten/src/ATen/native/mps/operations/Eye.mm
@@ -70,9 +70,9 @@
   @autoreleasepool {
     // A key is used to identify the MPSGraph which was created once, and can be reused if the parameters, data types etc match the earlier created MPSGraph
     string key = "eye_out_mps:" + getTensorsStringKey({result});
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
     if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ MPSCachedGraph * () {
 
         CachedGraph *newCachedGraph = nil;
 
@@ -94,7 +94,6 @@
         }
         return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
     // Create placeholders which use the keys of the CachedGraph to create inputs and outputs of the operation
diff --git a/aten/src/ATen/native/mps/operations/Indexing.h b/aten/src/ATen/native/mps/operations/Indexing.h
new file mode 100644
index 000000000000..e769a7121d50
--- /dev/null
+++ b/aten/src/ATen/native/mps/operations/Indexing.h
@@ -0,0 +1,39 @@
+//  Copyright © 2022 Apple Inc.
+
+#include <ATen/ATen.h>
+#include <ATen/Tensor.h>
+#include <ATen/Utils.h>
+#include <ATen/mps/MPSStream.h>
+#include <ATen/native/mps/TensorFactory.h>
+#include <c10/core/ScalarType.h>
+#include <torch/library.h>
+#include <unordered_map>
+
+using namespace at::mps;
+
+namespace at {
+namespace native {
+namespace mps {
+
+std::string getBitSizeString(ScalarType scalar_type) {
+  size_t scalarBitSize = c10::elementSize(scalar_type) * 8;
+  TORCH_CHECK(scalarBitSize <= 64, "Unsupported data type: ", getMPSTypeString(scalar_type));
+  return std::to_string(scalarBitSize) + "bit";
+
+}
+
+std::string getIndexFunctionName(ScalarType scalar_type, bool index_select, bool accumulate) {
+  std::string indexFunction = index_select ? "index_select_" :
+                      (accumulate && (scalar_type != kBool)) ? "index_put_accumulate_" : "index_put_";
+
+  indexFunction += getBitSizeString(scalar_type);
+  if (accumulate) {
+    TORCH_CHECK(scalar_type == ScalarType::Float || scalar_type == ScalarType::Int, "Unsupported data type for accumulate case: ", getMPSTypeString(scalar_type));
+    string dtypeString = (scalar_type == ScalarType::Float) ? "_float" : "_int";
+    indexFunction += dtypeString;
+  }
+  return indexFunction;
+}
+}
+}
+}
diff --git a/aten/src/ATen/native/mps/operations/Indexing.mm b/aten/src/ATen/native/mps/operations/Indexing.mm
index 7acb2fdba422..78e93fc99175 100644
--- a/aten/src/ATen/native/mps/operations/Indexing.mm
+++ b/aten/src/ATen/native/mps/operations/Indexing.mm
@@ -1,5 +1,4 @@
 //  Copyright © 2022 Apple Inc.
-
 #include <ATen/ATen.h>
 #include <ATen/Tensor.h>
 #include <ATen/Utils.h>
@@ -12,6 +11,7 @@
 #include <ATen/WrapDimUtilsMulti.h>
 #include <ATen/native/LinearAlgebraUtils.h>
 #include <ATen/native/mps/OperationUtils.h>
+#include <ATen/native/mps/operations/Indexing.h>
 #include <ATen/native/Resize.h>
 #include <ATen/AccumulateType.h>
 #include <torch/library.h>
@@ -20,6 +20,7 @@
 #include <c10/util/irange.h>
 #include <c10/core/QScheme.h>
 #include <c10/util/SmallVector.h>
+#include <ATen/native/IndexKernel.h>
 
 #ifdef __OBJC__
 #include <MetalPerformanceShaders/MetalPerformanceShaders.h>
@@ -28,6 +29,199 @@
 namespace at {
 namespace native {
 
+static
+bool dispatchIndexKernel(TensorIteratorBase& iter,
+                         IntArrayRef index_size,
+                         IntArrayRef index_stride,
+                         bool index_select,
+                         bool accumulate) {
+  using namespace mps;
+
+ if (iter.numel() == 0)
+    return true;
+
+  const Tensor& inputTensor = iter.tensor(1);
+  Tensor outputTensor = iter.tensor(0);
+  id<MTLBuffer> inputBuffer  = getMTLBufferStorage(inputTensor);
+  id<MTLBuffer> outputBuffer = getMTLBufferStorage(outputTensor);
+  MPSStream* mpsStream = getCurrentMPSStream();
+  id<MTLDevice> device = MPSDevice::getInstance()->device();
+
+  dispatch_sync(mpsStream->queue(), ^(){
+    @autoreleasepool {
+    NSError* error = nil;
+      constexpr uint32_t nOffsets = 3;
+      const int64_t num_indices = index_size.size();
+      const uint32_t numThreads = iter.numel();
+      const uint32_t nDim = iter.ndim();
+      const IntArrayRef& iterShape = iter.shape();
+      std::vector<uint32_t> iterShapeData(iterShape.size());
+      std::vector<std::array<uint32_t, nOffsets>> strides(nDim);
+
+      for (const auto i: c10::irange(iterShape.size())) {
+        TORCH_CHECK(i <= UINT32_MAX);
+        iterShapeData[i] = (uint32_t)(iterShape[i]);
+      }
+
+      for (const auto i: c10::irange(nDim)) {
+        for (const auto offset: c10::irange(nOffsets)) {
+          strides[i][offset] = iter.strides(offset)[i];
+        }
+      }
+
+      MTLSize gridSize = MTLSizeMake(numThreads, 1, 1);
+      id<MTLCommandBuffer> commandBuffer = mpsStream->commandBuffer();
+      id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
+      id<MTLFunction> kernelDataOffsetsFunction = MPSDevice::getInstance()->metalIndexingFunction("kernel_index_offsets", nil);
+      id<MTLComputePipelineState> kernelDataOffsetsPSO = [[device newComputePipelineStateWithFunction: kernelDataOffsetsFunction
+                                                                                                error: &error] autorelease];
+      id<MTLBuffer> kernelDataOffsets = [[device newBufferWithLength: numThreads * sizeof(simd_uint3)
+                                                             options: 0] autorelease];
+      TORCH_CHECK(kernelDataOffsetsPSO, "Failed to created pipeline state object, error: ", [[error description] UTF8String]);
+
+      [computeEncoder setComputePipelineState:kernelDataOffsetsPSO];
+      [computeEncoder setBytes:strides.data() length:sizeof(uint32_t) * nDim * nOffsets atIndex:0];
+      [computeEncoder setBuffer:kernelDataOffsets offset:0 atIndex:1];
+      [computeEncoder setBytes:iterShapeData.data() length:sizeof(uint32_t) * iterShape.size() atIndex:2];
+      [computeEncoder setBytes:&nDim length:sizeof(uint32_t) atIndex:3];
+      [computeEncoder setBytes:&nOffsets length:sizeof(uint32_t) atIndex:4];
+
+      NSUInteger kernelOffsetsTGSize = kernelDataOffsetsPSO.maxTotalThreadsPerThreadgroup;
+      if (kernelOffsetsTGSize > numThreads)
+          kernelOffsetsTGSize = numThreads;
+
+      MTLSize kernelOffsetsThreadGroupSize = MTLSizeMake(kernelOffsetsTGSize, 1, 1);
+      [computeEncoder dispatchThreads: gridSize
+                threadsPerThreadgroup: kernelOffsetsThreadGroupSize];
+
+      MTLFunctionConstantValues* constantValues = [[MTLFunctionConstantValues new] autorelease];
+      [constantValues setConstantValue: &num_indices type:MTLDataTypeUInt atIndex:0];
+
+      std::string indexFunction = getIndexFunctionName(inputTensor.scalar_type(), index_select, accumulate);
+      id<MTLFunction> indexKernelFunction = MPSDevice::getInstance()->metalIndexingFunction(indexFunction, constantValues);
+      id<MTLArgumentEncoder> argumentEncoder = [[indexKernelFunction newArgumentEncoderWithBufferIndex:0] autorelease];
+      NSUInteger argumentBufferLength = argumentEncoder.encodedLength;
+      id<MTLBuffer> indexAB = [[device newBufferWithLength:argumentBufferLength options:0] autorelease];
+      [argumentEncoder setArgumentBuffer:indexAB offset:0];
+
+      for (uint32_t idx = 0; idx < num_indices; idx++) {
+        const Tensor& indexTensor = iter.tensor(idx+2);
+        [argumentEncoder setBuffer: getMTLBufferStorage(indexTensor)
+                            offset: indexTensor.storage_offset() * indexTensor.element_size()
+                           atIndex: idx];
+        TORCH_CHECK(indexTensor.scalar_type() == ScalarType::Long, "index(): Expected dtype int64 for Index");
+      }
+
+      // FIXME: PSO needs to be cached
+      id<MTLComputePipelineState> indexSelectPSO = [[device newComputePipelineStateWithFunction: indexKernelFunction
+                                                                                          error: &error] autorelease];
+      TORCH_CHECK(indexSelectPSO, "Failed to created pipeline state object, error: ", [[error description] UTF8String]);
+
+      for (uint32_t idx = 0; idx < num_indices; idx++) {
+        const Tensor& indexTensor = iter.tensor(idx+2);
+        [computeEncoder useResource:getMTLBufferStorage(indexTensor) usage:MTLResourceUsageRead];
+      }
+
+      [computeEncoder setComputePipelineState:indexSelectPSO];
+      [computeEncoder setBuffer:indexAB offset:0 atIndex:0];
+      [computeEncoder setBytes:index_size.data() length:sizeof(index_size[0]) * index_size.size() atIndex:1];
+      [computeEncoder setBytes:index_stride.data() length:sizeof(index_stride[0]) * index_stride.size() atIndex:2];
+      [computeEncoder setBuffer:kernelDataOffsets offset:0 atIndex:3];
+      [computeEncoder setBuffer:inputBuffer offset:inputTensor.storage_offset() * inputTensor.element_size() atIndex:4];
+      [computeEncoder setBuffer:outputBuffer offset:outputTensor.storage_offset() * outputTensor.element_size() atIndex:5];
+
+      NSUInteger tgSize = indexSelectPSO.maxTotalThreadsPerThreadgroup;
+      if (tgSize > numThreads)
+          tgSize = numThreads;
+
+      MTLSize threadGroupSize = MTLSizeMake(tgSize, 1, 1);
+      [computeEncoder dispatchThreads: gridSize
+                threadsPerThreadgroup: threadGroupSize];
+
+      [computeEncoder endEncoding];
+      mpsStream->commit(true);
+    }
+  });
+
+  return true;
+}
+
+static void validateInputData(const TensorIteratorBase& iter, IntArrayRef index_size, IntArrayRef index_stride, const std::string& op, bool accumulate) {
+  using namespace mps;
+
+  int64_t num_indices = index_size.size();
+  TORCH_CHECK(num_indices <= 16, "Current limit allows up to 16 indices to be used in MPS indexing kernels");
+
+  AT_ASSERT(num_indices == index_stride.size());
+  AT_ASSERT(num_indices == iter.ntensors() - 2);
+  const Tensor& inputTensor = iter.tensor(1);
+
+  if (accumulate) {
+    // No atomic support for the rest of dtypes
+    TORCH_CHECK(inputTensor.scalar_type() == ScalarType::Float ||
+                inputTensor.scalar_type() == ScalarType::Int   ||
+                inputTensor.scalar_type() == ScalarType::Bool);
+  } else {
+    TORCH_CHECK(c10::isIntegralType(inputTensor.scalar_type(), /*includesBool=*/true) ||
+                inputTensor.scalar_type() == ScalarType::Float ||
+                inputTensor.scalar_type() == ScalarType::Half,
+                getMPSTypeString(inputTensor.scalar_type()) + std::string(" not supported for index.Tensor_out"));
+  }
+}
+
+void index_kernel_mps(TensorIteratorBase& iter, IntArrayRef index_size, IntArrayRef index_stride) {
+  using namespace mps;
+  @autoreleasepool {
+    validateInputData(iter, index_size, index_stride, "index.Tensor_out", /*accumulate=*/false);
+    dispatchIndexKernel(iter, index_size, index_stride, /*index_select=*/true, /*accumulate=*/false);
+  }
+}
+
+void index_put_kernel_mps(TensorIterator& iter, IntArrayRef index_size, IntArrayRef index_stride, bool accumulate) {
+  using namespace mps;
+  @autoreleasepool {
+    validateInputData(iter, index_size, index_stride, "index_put_impl", accumulate);
+    dispatchIndexKernel(iter, index_size, index_stride, /*index_select=*/false, accumulate);
+  }
+}
+
+static Tensor & masked_select_out_mps_impl(Tensor & result, const Tensor & self, const Tensor & mask) {
+  NoNamesGuard guard;
+
+  TORCH_CHECK(mask.scalar_type() == ScalarType::Byte || mask.scalar_type() == ScalarType::Bool,
+              "masked_select: expected BoolTensor or ByteTensor for mask");
+  TORCH_CHECK(self.scalar_type() == result.scalar_type(),
+              "masked_select(): self and result must have the same scalar type");
+
+  auto mask_temp = (mask.dim() == 0)
+    ? c10::MaybeOwned<Tensor>::owned(mask.unsqueeze(0))
+    : c10::MaybeOwned<Tensor>::borrowed(mask);
+  auto self_temp = (self.dim() == 0)
+    ? c10::MaybeOwned<Tensor>::owned(self.unsqueeze(0))
+    : c10::MaybeOwned<Tensor>::borrowed(self);
+
+  // Cannot reassign to mask_temp and self_temp here! if they are
+  // owning and expand_outplace returns a borrow, the returned borrow
+  // would dangle.
+  auto mask_self_expanded = expand_outplace(*mask_temp, *self_temp);
+  at::index_out(
+      result, *std::get<1>(mask_self_expanded),
+      c10::List<c10::optional<at::Tensor>>({*std::move(std::get<0>(mask_self_expanded))}));
+
+  return result;
+}
+
+Tensor masked_select_mps(const Tensor & self, const Tensor & mask) {
+  namedinference::compute_broadcast_outnames(self, mask);
+  Tensor result = at::empty({0}, self.options());
+  return masked_select_out_mps_impl(result, self, mask);
+}
+
+Tensor & masked_select_out_mps(const Tensor & self, const Tensor & mask, Tensor & result) {
+  namedinference::compute_broadcast_outnames(self, mask);
+  return masked_select_out_mps_impl(result, self, mask);
+}
+
 Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
   using namespace mps;
 
@@ -42,7 +236,7 @@ Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
   auto total_dims = self.dim();
   // It wraps the dims and checks that there are no repeated dims
   auto flip_dims_b = at::dim_list_to_bitset(dims, total_dims);
-  NSMutableArray<NSNumber*> * ns_dims = [NSMutableArray<NSNumber*> new];
+  NSMutableArray<NSNumber*> * ns_dims = [[NSMutableArray<NSNumber*> new] autorelease];
 
   for (const auto i : c10::irange(total_dims)) {
     if(flip_dims_b[i] && self.size(i) > 1 && self.stride(i) != 0) {
@@ -58,12 +252,7 @@ Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
 
   MPSStream* stream = getCurrentMPSStream();
 
-  struct CachedGraph : public MPSCachedGraph
-  {
-    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
-    MPSGraphTensor* inputTensor_ = nil;
-    MPSGraphTensor* outputTensor_ = nil;
-  };
+  using CachedGraph = mps::MPSUnaryCachedGraph;
 
   MPSGraphCache* cache_ = MPSGraphCache::getInstance();
 
@@ -71,9 +260,9 @@ Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
     NSString* ns_dims_key = [[ns_dims valueForKey:@"description"] componentsJoinedByString:@","];
     // A key is used to identify the MPSGraph which was created once, and can be reused if the parameters, data types etc match the earlier created MPSGraph
     string key = "flip_mps:" + getTensorsStringKey({self}) + ":" + string([ns_dims_key UTF8String]);
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    auto cachedGraph = cache_->LookUpAs<CachedGraph>(key);
     if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ MPSCachedGraph * () {
 
         CachedGraph *newCachedGraph = nil;
 
@@ -90,7 +279,6 @@ Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
         }
         return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
     // Create placeholders which use the keys of the CachedGraph to create inputs and outputs of the operation
@@ -147,10 +335,10 @@ Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
   @autoreleasepool {
 
     string key = "index_add_mps_out" + getTensorsStringKey({self, index, source}) + ":" + std::to_string(dim);
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
     if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ MPSCachedGraph * () {
         CachedGraph *newCachedGraph = nil;
 
         @autoreleasepool {
@@ -178,19 +366,19 @@ Tensor flip_mps(const Tensor& self, IntArrayRef dims) {
         }
         return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
     Placeholder selfPlaceholder = Placeholder(cachedGraph->inputTensor_, self);
     Placeholder indexPlaceholder = Placeholder(cachedGraph->indexTensor_, index);
     Placeholder sourcePlaceholder = Placeholder(cachedGraph->sourceTensor_, source);
     Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor_, result);
+    MPSScalar alpha_scalar = getMPSScalar(alpha_f, source.scalar_type());
 
     NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
       selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
       indexPlaceholder.getMPSGraphTensor() : indexPlaceholder.getMPSGraphTensorData(),
       sourcePlaceholder.getMPSGraphTensor() : sourcePlaceholder.getMPSGraphTensorData(),
-      cachedGraph->alphaTensor_ : getMPSGraphTensorFromScalar(stream, alpha_f, MPSDataTypeFloat32)
+      cachedGraph->alphaTensor_ : getMPSGraphTensorFromScalar(stream, alpha_scalar),
     };
     NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
       outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
@@ -265,10 +453,10 @@ Tensor index_select_mps(const Tensor & self,
   @autoreleasepool {
 
     string key = "index_select_out_mps" + getTensorsStringKey({self, index}) + ":" + std::to_string(dim);
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
     if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ MPSCachedGraph * () {
         CachedGraph *newCachedGraph = nil;
 
         @autoreleasepool {
@@ -290,7 +478,6 @@ Tensor index_select_mps(const Tensor & self,
         }
         return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
     Placeholder selfPlaceholder = Placeholder(cachedGraph->inputTensor_, self);
@@ -335,9 +522,9 @@ Tensor index_select_mps(const Tensor & self,
   MPSStream* stream = getCurrentMPSStream();
   @autoreleasepool {
     string key = "masked_fill" + getTensorsStringKey({self, mask}) + ":" + std::to_string(value.toDouble());
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
     if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ MPSCachedGraph * () {
 
         CachedGraph *newCachedGraph = nil;
 
@@ -371,7 +558,6 @@ Tensor index_select_mps(const Tensor & self,
         }
         return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
     Placeholder selfPlaceholder   = Placeholder(cachedGraph->inputTensor_, self);
@@ -420,7 +606,7 @@ Tensor embedding_dense_backward_mps(
     int64_t D = incoming_gradient_shape[num_incoming_gradient_dims - 1];
     c10::SmallVector<int64_t, 2> outgoing_gradient_shape{num_weights, D};
     Tensor outgoing_gradient = at::native::empty_mps(
-                                IntArrayRef(outgoing_gradient_shape.data(), outgoing_gradient_shape.size()),
+                                IntArrayRef(outgoing_gradient_shape),
                                 grad_.scalar_type(),
                                 c10::nullopt,
                                 kMPS,
@@ -435,10 +621,10 @@ Tensor embedding_dense_backward_mps(
 
     @autoreleasepool {
         string key = "edb_mps:" + native_mps::getMPSTypeString(grad_.scalar_type()) + ":indices" + std::to_string(num_indices_dims) + ":num_weights" + std::to_string(num_weights) + ":padding_idx" + std::to_string(padding_idx) + ":scaled" + std::to_string(scale_grad_by_freq);
-      CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+      CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
       // Initialize once if configuration not found in cache
       if(!cachedGraph) {
-        native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+        cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
           CachedGraph *newCachedGraph = nil;
 
@@ -450,17 +636,20 @@ Tensor embedding_dense_backward_mps(
 
             MPSGraphTensor* indicesTensor = native_mps::mpsGraphUnrankedPlaceHolder(mpsGraph, native_mps::getMPSDataType(indices.scalar_type()));
 
-            MPSGraphTensor *reshapedIndicesTensor = [mpsGraph  expandDimsOfTensor:indicesTensor
-                             axes:@[@-1]
-                             name:nil];
+            MPSGraphTensor* reshapedIndicesTensor = indicesTensor;
 
-            MPSGraphTensor *outgoingGradTensor;
-            outgoingGradTensor = [mpsGraph scatterNDWithUpdatesTensor:incomingGradTensor
-                            indicesTensor:reshapedIndicesTensor
-                                    shape:native_mps::getMPSShape(IntArrayRef(outgoing_gradient_shape.data(), outgoing_gradient_shape.size()))
-                          batchDimensions:0
-                                     mode:MPSGraphScatterModeAdd
-                                     name:@"edb"];
+            if (num_indices_dims != 0) {
+              reshapedIndicesTensor = [mpsGraph  expandDimsOfTensor: indicesTensor
+                                                               axes: @[@-1]
+                                                               name: nil];
+            }
+
+            auto outgoingGradTensor = [mpsGraph scatterNDWithUpdatesTensor: incomingGradTensor
+                                                             indicesTensor: reshapedIndicesTensor
+                                                                     shape: native_mps::getMPSShape(IntArrayRef(outgoing_gradient_shape))
+                                                           batchDimensions: 0
+                                                                      mode: MPSGraphScatterModeAdd
+                                                                      name: @"edb"];
 
             newCachedGraph->incomingGradTensor_ = incomingGradTensor;
             newCachedGraph->indicesTensor_ = indicesTensor;
@@ -469,7 +658,6 @@ Tensor embedding_dense_backward_mps(
           }
           return newCachedGraph;
         });
-        cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
       }
       auto incomingGradPlaceholder = native_mps::Placeholder(cachedGraph->incomingGradTensor_, grad_);
       auto indicesPlaceholder = native_mps::Placeholder(cachedGraph->indicesTensor_, indices);
@@ -494,5 +682,7 @@ Tensor embedding_dense_backward_mps(
   return masked_fill__mps(self, mask, value.item());
 }
 
-}
-}
+REGISTER_DISPATCH(index_stub, &index_kernel_mps);
+REGISTER_DISPATCH(index_put_stub, &index_put_kernel_mps);
+} // native
+} // at
diff --git a/aten/src/ATen/native/mps/operations/Linear.mm b/aten/src/ATen/native/mps/operations/Linear.mm
index 34b933d44461..ddaa6ce97963 100644
--- a/aten/src/ATen/native/mps/operations/Linear.mm
+++ b/aten/src/ATen/native/mps/operations/Linear.mm
@@ -18,13 +18,15 @@
 
 Tensor _mps_linear(
   const Tensor& input,
-  const Tensor& weight,
+  const Tensor& weight_arg,
   const c10::optional<Tensor>& bias_opt) {
   // wT = transpose(weight);
   // y=x*wT+b
 
   using namespace mps;
 
+  auto weight = (weight_arg.dim() == 1) ? weight_arg.view({1, weight_arg.size(0)}) : weight_arg;
+
   TORCH_CHECK(input.scalar_type() == ScalarType::Double
               || input.scalar_type() == ScalarType::Float
               || input.scalar_type() == ScalarType::Half, "MPS device does not support linear for non-float inputs");
@@ -150,7 +152,15 @@ Tensor _mps_linear(
     mps::runMPSGraph(stream, cachedGraph->graph(), feeds, results);
   }
 
-  return output;
+  // Shave off '1' present at the end of the shape
+  if(weight_arg.dim() == 1) {
+    // Number of elements in new output shape
+    auto output_sizes = output.sizes();
+    std::vector<int64_t> out_shape(output_sizes.begin(), output_sizes.end()-1);
+    return output.view(IntArrayRef(out_shape));
+  }
+  else
+    return output;
 }
 
 Tensor _mps_linear_backward_input(
@@ -361,10 +371,10 @@ Tensor _mps_linear_backward_input(
     const Tensor& weight, std::array<bool,3> output_mask) {
   Tensor grad_input, grad_weight, grad_bias;
   if (output_mask[0]) {
-    grad_input = at::_mps_linear_backward_input(input.sizes(), grad_output, weight);
+    grad_input = _mps_linear_backward_input(input.sizes(), grad_output, weight);
   }
   if (output_mask[1] || output_mask[2]) {
-    std::tie(grad_weight, grad_bias) = at::_mps_linear_backward_weights(grad_output, input, weight, output_mask[2]);
+    std::tie(grad_weight, grad_bias) = _mps_linear_backward_weights(grad_output, input, weight, output_mask[2]);
   }
   return std::tuple<Tensor, Tensor, Tensor>{grad_input, grad_weight, grad_bias};
 }
diff --git a/aten/src/ATen/native/mps/operations/LossOps.mm b/aten/src/ATen/native/mps/operations/LossOps.mm
index cc112265a3a8..3430af0434de 100644
--- a/aten/src/ATen/native/mps/operations/LossOps.mm
+++ b/aten/src/ATen/native/mps/operations/LossOps.mm
@@ -455,7 +455,7 @@ void nllnd_loss_backward_impl(
         auto totalWeightPlaceholder   = Placeholder(cachedGraph->totalWeightTensor_, total_weight);
         auto gradInputPlaceholder = Placeholder(cachedGraph->gradInputTensor_, grad_input);
 
-        NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = [[NSMutableDictionary alloc] initWithCapacity: 4];
+        NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = [[[NSMutableDictionary alloc] initWithCapacity: 4] autorelease];
         feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData();
         feeds[targetPlaceholder.getMPSGraphTensor()] = targetPlaceholder.getMPSGraphTensorData();
         feeds[totalWeightPlaceholder.getMPSGraphTensor()] = totalWeightPlaceholder.getMPSGraphTensorData();
@@ -697,7 +697,7 @@ void nllnd_loss_backward_impl(
         Placeholder totalWeightsPlaceholder = Placeholder(cachedGraph->totalWeightTensor_, total_weight);
 
         // Create dictionary of inputs and outputs
-        NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = [[NSMutableDictionary alloc] initWithCapacity: 4];
+        NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = [[[NSMutableDictionary alloc] initWithCapacity: 4] autorelease];
         feeds[selfPlaceholder.getMPSGraphTensor()] = selfPlaceholder.getMPSGraphTensorData();
         feeds[targetPlaceholder.getMPSGraphTensor()] = targetPlaceholder.getMPSGraphTensorData();
         feeds[batchSizePlaceholder.getMPSGraphTensor()] = batchSizePlaceholder.getMPSGraphTensorData();
diff --git a/aten/src/ATen/native/mps/operations/Normalization.mm b/aten/src/ATen/native/mps/operations/Normalization.mm
index 2e026b9acb46..49f1e0538463 100644
--- a/aten/src/ATen/native/mps/operations/Normalization.mm
+++ b/aten/src/ATen/native/mps/operations/Normalization.mm
@@ -411,6 +411,54 @@ Check if running mean exists (maybe do this check before making graph)
   return std::make_tuple(output, save_mean, save_var);
 }
 
+std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_mps
+                  (const Tensor& self,
+                   const c10::optional<Tensor>& weight_opt,
+                   const c10::optional<Tensor>& bias_opt,
+                   Tensor& running_mean,
+                   Tensor& running_var,
+                   bool train,
+                   double momentum,
+                   double epsilon) {
+
+  return batch_norm_mps(self, weight_opt, bias_opt, running_mean, running_var, train, momentum, epsilon);
+}
+
+std::tuple<Tensor, Tensor, Tensor> _batch_norm_legit_no_stats_mps
+                  (const Tensor& self,
+                   const c10::optional<Tensor>& weight_opt,
+                   const c10::optional<Tensor>& bias_opt,
+                   bool train,
+                   double momentum,
+                   double epsilon) {
+
+  return batch_norm_mps(self, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, epsilon);
+}
+
+std::tuple<Tensor&, Tensor&, Tensor&> _batch_norm_legit_mps_out
+                   (const Tensor& self,
+                    const c10::optional<Tensor>& weight_opt,
+                    const c10::optional<Tensor>& bias_opt,
+                    Tensor& running_mean,
+                    Tensor& running_var,
+                    bool train, double momentum, double epsilon,
+                    Tensor& output,
+                    Tensor& save_mean,
+                    Tensor& save_var) {
+  return batch_norm_mps_out(self, weight_opt, bias_opt, running_mean, running_var, train, momentum, epsilon, output, save_mean, save_var);
+}
+
+std::tuple<Tensor&, Tensor&, Tensor&> _batch_norm_legit_no_stats_mps_out
+                   (const Tensor& self,
+                    const c10::optional<Tensor>& weight_opt,
+                    const c10::optional<Tensor>& bias_opt,
+                    bool train, double momentum, double epsilon,
+                    Tensor& output,
+                    Tensor& save_mean,
+                    Tensor& save_var) {
+  return batch_norm_mps_out(self, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, epsilon, output, save_mean, save_var);
+}
+
 string get_mem_string(c10::MemoryFormat memory_format) {
   string mem_format_key;
   switch(memory_format) {
@@ -823,7 +871,7 @@ string get_mem_string(c10::MemoryFormat memory_format) {
   const int normalized_ndim = normalized_shape.size();
   // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
   const int axis = input_ndim - normalized_ndim;
-  at::Tensor input_reshaped = input.view({1, M, -1});
+  at::Tensor input_reshaped = input.reshape({1, M, -1});
   // Unlike Batch Normalization, which applies scalar scale and bias for each
   // entire channel/plane with the affine option, Layer Normalization applies
   // per-element scale and bias. E.g. For input {N, C, H, W}, weight for
diff --git a/aten/src/ATen/native/mps/operations/Pad.mm b/aten/src/ATen/native/mps/operations/Pad.mm
new file mode 100644
index 000000000000..63a26e66288b
--- /dev/null
+++ b/aten/src/ATen/native/mps/operations/Pad.mm
@@ -0,0 +1,306 @@
+//  Copyright © 2022 Apple Inc.
+
+#include <ATen/native/mps/OperationUtils.h>
+#include <c10/util/Optional.h>
+
+namespace at {
+namespace native {
+namespace mps {
+
+// Pad operations (1D/2D/3D forward and backward)
+Tensor& pad_out_template(Tensor &output, const Tensor &input_, IntArrayRef padding,
+                         const c10::optional<Tensor>& grad_output_opt,
+                         MPSGraphPaddingMode mode, double constantValue, const string op_name)
+{
+  const int padding_size = (int) padding.size();
+  const int padding_dim = padding_size / 2; // either 1D, 2D, or 3D
+
+  TORCH_CHECK(padding_size == 2 || padding_size == 4 || padding_size == 6,
+              "invalid padding argument of size ", padding_size);
+
+  const Tensor& grad_output_ = *(at::borrow_from_optional_tensor(grad_output_opt));
+  const bool is_backward_pass = grad_output_.defined();
+
+  int64_t nbatch = 1;
+  int64_t ndims = input_.ndimension();
+  // number of input dims with ConstantPad could be less than 2
+  int dim_w = ndims > 1 ? padding_dim : 0;
+  int dim_h = padding_dim - 1;
+  int dim_d = padding_dim - 2;
+  int dim_slices = 0;
+
+  if (!is_backward_pass && ndims > 1) {
+    bool valid_dims = input_.size(1) != 0 && input_.size(padding_dim) != 0;
+    TORCH_CHECK((ndims == 1 + padding_dim && valid_dims) ||
+                (ndims == 2 + padding_dim && valid_dims && input_.size(1 + padding_dim) != 0),
+                "3D or 4D (batch mode) tensor expected for input, but got: ", input_);
+  }
+
+  if (ndims == 2 + padding_dim) {
+    nbatch = input_.size(0);
+    dim_w++;
+    dim_h++;
+    dim_d++;
+    dim_slices++;
+  }
+
+  int64_t pad_l = padding[0];
+  int64_t pad_r = padding[1];
+  int64_t pad_t = padding_dim > 1 ? padding[2] : 0;
+  int64_t pad_b = padding_dim > 1 ? padding[3] : 0;
+  int64_t pad_front = padding_dim > 2 ? padding[4] : 0;
+  int64_t pad_back  = padding_dim > 2 ? padding[5] : 0;
+
+  int64_t nplane = input_.size(dim_slices);
+  int64_t input_w = input_.size(dim_w);
+  int64_t output_w  = input_w + pad_l + pad_r;
+  int64_t input_h = padding_dim > 1 ? input_.size(dim_h) : 0;
+  int64_t output_h = padding_dim > 1 ? input_h + pad_t + pad_b : 0;
+  int64_t input_d = padding_dim > 2 ? input_.size(dim_d) : 0;
+  int64_t output_d = padding_dim > 2 ? input_d + pad_front + pad_back : 0;
+
+  Tensor grad_output, input = input_;
+
+  if (!is_backward_pass) {
+    TORCH_CHECK(pad_l < input_w && pad_r < input_w,
+      "Argument #4: Padding size should be less than the corresponding "
+      "input dimension, but got: padding (", pad_l, ", ", pad_r,
+      ") at dimension ", dim_w, " of input ", ndims);
+
+    if (padding_dim > 1) {
+      TORCH_CHECK(pad_t < input_h && pad_b < input_h,
+        "Argument #6: Padding size should be less than the corresponding "
+        "input dimension, but got: padding (", pad_t, ", ", pad_b,
+        ") at dimension ", dim_h, " of input ", ndims);
+    }
+    TORCH_CHECK(output_w >= 1 || output_h >= padding_dim - 1,
+      "input (H: ", input_h, ", W: ", input_w, ") is too small. Calculated "
+      "output H: ", output_h, " W: ", output_w);
+
+    if (ndims == 1 + padding_dim) {
+      if (padding_dim == 3)
+        output.resize_({nplane, output_d, output_h, output_w});
+      else if (padding_dim == 2)
+        output.resize_({nplane, output_h, output_w});
+      else
+        output.resize_({nplane, output_w});
+    } else {
+      if (padding_dim == 3)
+        output.resize_({nbatch, nplane, output_d, output_h, output_w});
+      else if (padding_dim == 2)
+        output.resize_({nbatch, nplane, output_h, output_w});
+      else if (ndims > 1)
+        output.resize_({nbatch, nplane, output_w});
+      else
+        output.resize_({output_w});
+    }
+    if (output.numel() == 0 || input_.numel() == 0)
+      return output;
+    input = input_.contiguous();
+  } else {
+    TORCH_CHECK(output_w == grad_output_.size(dim_w),
+      "gradOutput width unexpected. Expected: ", output_w, ", Got: ", grad_output_.size(dim_w));
+    if (padding_dim > 1) {
+      TORCH_CHECK(output_h == grad_output_.size(dim_h),
+        "gradOutput height unexpected. Expected: ", output_h, ", Got: ", grad_output_.size(dim_h));
+    }
+    grad_output = grad_output_.contiguous();
+  }
+
+  std::vector<NSNumber*> leftPadVec(ndims, @(0));
+  std::vector<NSNumber*> rightPadVec(ndims, @(0));
+  leftPadVec [ndims - 1] = @(pad_l);
+  rightPadVec[ndims - 1] = @(pad_r);
+  if (padding_dim >= 2) {
+    leftPadVec [ndims - 2] = @(pad_t);
+    rightPadVec[ndims - 2] = @(pad_b);
+  }
+  if (padding_dim >= 3) {
+    leftPadVec [ndims - 3] = @(pad_front);
+    rightPadVec[ndims - 3] = @(pad_back);
+  }
+  MPSShape *leftPadding  = [NSArray arrayWithObjects:leftPadVec.data() count:ndims];
+  MPSShape *rightPadding = [NSArray arrayWithObjects:rightPadVec.data() count:ndims];
+
+  struct CachedGraph : public MPSCachedGraph {
+    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) { }
+    MPSGraphTensor *inputTensor = nil, *outputTensor = nil;
+    MPSGraphTensor *gradOutputTensor = nil;
+  };
+  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
+
+  @autoreleasepool {
+    string key = op_name + getTensorsStringKey({input, grad_output}) +
+                           ":L" + to_string(pad_l)     + ":R" + to_string(pad_r) +
+                           ":T" + to_string(pad_t)     + ":B" + to_string(pad_b) +
+                           ":F" + to_string(pad_front) + ":K" + to_string(pad_back);
+
+    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    if(!cachedGraph) {
+      cachedGraph = static_cast<CachedGraph*>(cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+        CachedGraph *newCachedGraph = nil;
+        @autoreleasepool {
+            MPSGraph* mpsGraph = make_mps_graph();
+            newCachedGraph = new CachedGraph(mpsGraph);
+            newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);
+            if (!is_backward_pass) {
+              newCachedGraph->outputTensor = [mpsGraph padTensor:newCachedGraph->inputTensor
+                                                 withPaddingMode:mode
+                                                     leftPadding:leftPadding
+                                                    rightPadding:rightPadding
+                                                   constantValue:constantValue
+                                                            name:nil];
+            } else {
+              newCachedGraph->gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
+              newCachedGraph->outputTensor = [mpsGraph padGradientWithIncomingGradientTensor:newCachedGraph->gradOutputTensor
+                                                                                sourceTensor:newCachedGraph->inputTensor
+                                                                                 paddingMode:mode
+                                                                                 leftPadding:leftPadding
+                                                                                rightPadding:rightPadding
+                                                                                        name:nil];
+            }
+        }
+        return newCachedGraph;
+      }));
+    }
+    Placeholder inputPlaceholder  = Placeholder(cachedGraph->inputTensor, input);
+    Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, output);
+
+    NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease];
+    feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData();
+    if (is_backward_pass) {
+        Placeholder gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor, grad_output);
+        feeds[gradOutputPlaceholder.getMPSGraphTensor()] = gradOutputPlaceholder.getMPSGraphTensorData();
+    }
+    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
+        outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
+    };
+    runMPSGraph(getCurrentMPSStream(), cachedGraph->graph(), feeds, results);
+  }
+  return output;
+}
+} // namespace mps
+
+// 1D Reflection and Replication Padding
+TORCH_IMPL_FUNC(reflection_pad1d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_out_mps");
+}
+
+TORCH_IMPL_FUNC(reflection_pad1d_backward_out_mps)
+(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
+                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_backward_out_mps");
+}
+
+TORCH_IMPL_FUNC(replication_pad1d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_out_mps");
+}
+
+TORCH_IMPL_FUNC(replication_pad1d_backward_out_mps)
+(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
+                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_backward_out_mps");
+}
+
+// 2D Reflection and Replication Padding
+Tensor& reflection_pad2d_out_mps(const Tensor& input, IntArrayRef padding, Tensor& output)
+{
+  return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__);
+}
+
+Tensor reflection_pad2d_mps(const Tensor& input, IntArrayRef padding)
+{
+  Tensor output = at::empty({0}, input.options());
+  return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__);
+}
+
+Tensor& reflection_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__);
+}
+
+Tensor reflection_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
+{
+  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__);
+}
+
+TORCH_IMPL_FUNC(replication_pad2d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad2d_out_mps");
+}
+
+Tensor& replication_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
+}
+
+Tensor replication_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
+{
+  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
+}
+
+// 3D Reflection and Replication Padding
+TORCH_IMPL_FUNC(reflection_pad3d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_out_mps");
+}
+
+TORCH_IMPL_FUNC(reflection_pad3d_backward_out_mps)
+(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
+                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_backward_out_mps");
+}
+
+TORCH_IMPL_FUNC(replication_pad3d_out_mps)
+(const Tensor& input, IntArrayRef padding, const Tensor& output)
+{
+  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
+                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad3d_out_mps");
+}
+
+Tensor& replication_pad3d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
+{
+  grad_input.resize_as_(input).zero_();
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
+}
+
+Tensor replication_pad3d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
+{
+  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
+  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
+}
+
+// backward pass is exlicitly handled in autograd by negating the "pad" argument
+Tensor constant_pad_nd_mps(const Tensor& self, IntArrayRef pad, const Scalar& value)
+{
+  if (pad.size() > 6) {
+    TORCH_WARN_ONCE("MPS: The constant padding of more than 3 dimensions is not currently supported natively. ",
+                    "It uses View Ops default implementation to run. This may have performance implications.");
+    return at::native::constant_pad_nd(self, pad, value);
+  }
+  Tensor output = at::empty({0}, self.options());
+  return mps::pad_out_template(output, self, pad, c10::nullopt, MPSGraphPaddingModeConstant, value.toDouble(), __func__);
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/mps/operations/PointwiseOps.mm b/aten/src/ATen/native/mps/operations/PointwiseOps.mm
index 261749bd269f..8da6b94dd856 100644
--- a/aten/src/ATen/native/mps/operations/PointwiseOps.mm
+++ b/aten/src/ATen/native/mps/operations/PointwiseOps.mm
@@ -36,10 +36,10 @@
   @autoreleasepool {
     string key = op_name + getTensorsStringKey({self, tensor1, tensor2}, false);
 
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+    CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
     if(!cachedGraph) {
-        MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
+        cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ MPSCachedGraph * () {
 
           CachedGraph* newCachedGraph = nil;
           @autoreleasepool {
@@ -72,7 +72,6 @@
           }
           return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
     // Inputs as placeholders
@@ -80,13 +79,14 @@
     Placeholder tensor1Placeholder = Placeholder(cachedGraph->firstTensor, tensor1);
     Placeholder tensor2Placeholder = Placeholder(cachedGraph->secondTensor, tensor2);
     Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, output);
+    MPSScalar value_scalar = getMPSScalar(value_opt, self.scalar_type());
 
     // Create dictionary of inputs and outputs
     NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
       selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
       tensor1Placeholder.getMPSGraphTensor() : tensor1Placeholder.getMPSGraphTensorData(),
       tensor2Placeholder.getMPSGraphTensor() : tensor2Placeholder.getMPSGraphTensorData(),
-      cachedGraph->valueTensor : getMPSGraphTensorFromScalar(mpsStream, value_opt, getMPSScalarType(self.scalar_type())),
+      cachedGraph->valueTensor : getMPSGraphTensorFromScalar(mpsStream, value_scalar),
     };
 
     NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
diff --git a/aten/src/ATen/native/mps/operations/RangeFactories.mm b/aten/src/ATen/native/mps/operations/RangeFactories.mm
index 2d97e01fea13..403ae4748f0f 100644
--- a/aten/src/ATen/native/mps/operations/RangeFactories.mm
+++ b/aten/src/ATen/native/mps/operations/RangeFactories.mm
@@ -95,7 +95,7 @@
     auto stream = getCurrentMPSStream();
     auto mpsDataType = getMPSDataType(result.scalar_type());
     @autoreleasepool {
-      string key = "arange_mps_out:" + getTensorsStringKey({result}) + ":" + to_string(size);
+      string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + to_string(size);
       auto cachedGraph = static_cast<RangeCachedGraph *>(cache_->LookUp(key));
       if (!cachedGraph) {
         auto *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph *() {
@@ -106,8 +106,10 @@
       }
       Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, r);
       NSMutableDictionary *feeds   = [[NSMutableDictionary new] autorelease];
-      feeds[cachedGraph->startTensor] = getMPSGraphTensorFromScalar(stream, start, mpsDataType);
-      feeds[cachedGraph->multiplyTensor] = getMPSGraphTensorFromScalar(stream, Scalar(step), mpsDataType);
+      MPSScalar startScalar = getMPSScalar(start, result.scalar_type());
+      feeds[cachedGraph->startTensor] = getMPSGraphTensorFromScalar(stream, startScalar);
+      MPSScalar stepScalar = getMPSScalar(step, result.scalar_type());
+      feeds[cachedGraph->multiplyTensor] = getMPSGraphTensorFromScalar(stream, stepScalar);
 
       NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
         outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
@@ -167,13 +169,16 @@
       }
 
       NSMutableDictionary *feeds   = [[NSMutableDictionary new] autorelease];
-      auto multiplyScalar = (end.to<double>() - start.to<double>()) / ((double)steps - 1.0f);
+      auto multiply = (end.to<double>() - start.to<double>()) / ((double)steps - 1.0f);
       Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, r);
 
       // Create dictionary of inputs and outputs
-      feeds[cachedGraph->startTensor] = getMPSGraphTensorFromScalar(stream, start, MPSDataTypeFloat32);
-      feeds[cachedGraph->endTensor] = getMPSGraphTensorFromScalar(stream, end, MPSDataTypeFloat32);
-      feeds[cachedGraph->multiplyTensor] = getMPSGraphTensorFromScalar(stream, Scalar(multiplyScalar), MPSDataTypeFloat32);
+      MPSScalar startScalar = getMPSScalar(start, ScalarType::Float);
+      feeds[cachedGraph->startTensor] = getMPSGraphTensorFromScalar(stream, startScalar);
+      MPSScalar endScalar = getMPSScalar(end, ScalarType::Float);
+      feeds[cachedGraph->endTensor] = getMPSGraphTensorFromScalar(stream, endScalar);
+      MPSScalar multiplyScalar = getMPSScalar(multiply, ScalarType::Float);
+      feeds[cachedGraph->multiplyTensor] = getMPSGraphTensorFromScalar(stream, multiplyScalar);
 
       NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
         outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
diff --git a/aten/src/ATen/native/mps/operations/ReduceOps.mm b/aten/src/ATen/native/mps/operations/ReduceOps.mm
index d6e510a06e32..d905107b8ffd 100644
--- a/aten/src/ATen/native/mps/operations/ReduceOps.mm
+++ b/aten/src/ATen/native/mps/operations/ReduceOps.mm
@@ -9,6 +9,7 @@
 #include <ATen/native/ReduceOpsUtils.h>
 #include <ATen/native/Pool.h>
 #include <torch/library.h>
+#include <ATen/native/mps/MPSGraphVenturaOps.h>
 
 namespace at {
 namespace native {
@@ -26,7 +27,8 @@
   SUM,
   PROD,
   MEAN,
-  COUNT_NONZERO
+  COUNT_NONZERO,
+  TRACE
 };
 
 
@@ -138,7 +140,7 @@ void set_axes_and_shapes(const Tensor& input_t,
 }
 
 void reduction_out_mps
-   (const Tensor& input_t,
+   (const Tensor& input_tensor,
     OptionalIntArrayRef opt_dim,
     bool keepdim,
     c10::optional<ScalarType> dtype,
@@ -146,6 +148,8 @@ void set_axes_and_shapes(const Tensor& input_t,
     MPSReductionType reduction_type,
     const std::string& func_name) {
 
+  auto input_t = (input_tensor.sizes().size() == 0) ? input_tensor.view({1}) : input_tensor;
+
   IntArrayRef input_shape = input_t.sizes();
 
   if (opt_dim.has_value()) {
@@ -183,7 +187,7 @@ void set_axes_and_shapes(const Tensor& input_t,
     auto cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
     if(!cachedGraph) {
-      native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
         CachedGraph *newCachedGraph = nil;
 
@@ -236,6 +240,14 @@ void set_axes_and_shapes(const Tensor& input_t,
             castOutputTensor = [mpsGraph reductionMinimumWithTensor:inputTensor
                                                                axes:axes
                                                                name:nil];
+          } else if(reduction_type == MPSReductionType::TRACE) {
+            MPSGraphTensor *bandPartWithTensor = [mpsGraph bandPartWithTensor:inputTensor
+                                                                     numLower:0
+                                                                     numUpper:0
+                                                                         name:nil];
+            castOutputTensor = [mpsGraph reductionSumWithTensor:bandPartWithTensor
+                                                           axes:@[@0, @1]
+                                                           name:nil];
           }
 
           MPSGraphTensor* outputTensor = nil;
@@ -252,15 +264,15 @@ void set_axes_and_shapes(const Tensor& input_t,
         }
         return newCachedGraph;
       });
-      cachedGraph = tmpCachedGraph->as<CachedGraph>();
     }
 
     auto inputPlaceholder = native_mps::Placeholder();
 
-    if(apparent_input_shape)
+    if (apparent_input_shape) {
       inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t, apparent_input_shape);
-    else
+    } else {
       inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
+    }
     auto outputPlaceholder = native_mps::Placeholder(cachedGraph->outputTensor_, output_t, apparent_output_shape);
     NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = @{
       inputPlaceholder.getMPSGraphTensor() : inputPlaceholder.getMPSGraphTensorData(),
@@ -284,6 +296,26 @@ void set_axes_and_shapes(const Tensor& input_t,
     reduction_out_mps(input_t, opt_dim, keepdim, dtype, output_t, MPSReductionType::SUM, "sum_out_mps");
 }
 
+Tensor trace_mps_out(const Tensor& self) {
+
+    Tensor output_t = at::native::empty_mps(
+                      {},
+                      self.scalar_type(),
+                      c10::nullopt,
+                      kMPS,
+                      c10::nullopt,
+                      c10::nullopt);
+
+    std::vector<int64_t> dims(self.dim());
+    std::iota(dims.begin(), dims.end(), 0);
+
+    reduction_out_mps(self, IntArrayRef(dims), false, c10::nullopt, const_cast<Tensor&>(output_t), MPSReductionType::TRACE, "trace_mps_out");
+
+  return output_t;
+
+
+}
+
 TORCH_IMPL_FUNC(prod_out_mps)
    (const Tensor& input_t,
     int64_t dim,
@@ -299,7 +331,7 @@ void set_axes_and_shapes(const Tensor& input_t,
 // Taken from ReduceOps.cpp
 inline ScalarType get_dtype_from_self(
     const Tensor& self,
-    const optional<ScalarType>& dtype,
+    const c10::optional<ScalarType>& dtype,
     bool promote_integers) {
   if (dtype.has_value()) {
     return dtype.value();
@@ -331,12 +363,8 @@ inline ScalarType get_dtype_from_self(
 
 Tensor prod_mps(const Tensor &self, c10::optional<ScalarType> opt_dtype) {
 
-  auto num_dims = self.dim();
-
-  int64_t dims[num_dims];
-
-  for(int i = 0; i < num_dims; i++)
-    dims[i] = i;
+  std::vector<int64_t> dims(self.dim());
+  std::iota(dims.begin(), dims.end(), 0);
 
   Tensor output_t = at::native::empty_mps(
                       {},
@@ -346,7 +374,7 @@ Tensor prod_mps(const Tensor &self, c10::optional<ScalarType> opt_dtype) {
                       c10::nullopt,
                       c10::nullopt);
 
-  reduction_out_mps(self, IntArrayRef(dims, num_dims), false, opt_dtype, const_cast<Tensor&>(output_t), MPSReductionType::PROD, "prod_mps");
+  reduction_out_mps(self, IntArrayRef(dims), false, opt_dtype, const_cast<Tensor&>(output_t), MPSReductionType::PROD, "prod_mps");
 
   return output_t;
 }
@@ -360,13 +388,13 @@ Tensor count_nonzero_mps(const Tensor& self, IntArrayRef dims){
 
   set_axes_and_shapes(self, dims, axes, apparent_input_shape, apparent_output_shape, output_shape);
 
-  int64_t* raw_output_shape = (int64_t *)malloc([output_shape count] * sizeof(int64_t));
-  for(int i=0; i < [output_shape count]; i++) {
+  std::vector<int64_t> raw_output_shape([output_shape count]);
+  for(auto i: c10::irange(raw_output_shape.size())) {
     raw_output_shape[i] = [output_shape[i] longValue];
   }
 
   Tensor output_t = at::native::empty_mps(
-                      IntArrayRef(raw_output_shape, [output_shape count]),
+                      IntArrayRef(raw_output_shape),
                       ScalarType::Long,
                       c10::nullopt,
                       kMPS,
@@ -375,8 +403,6 @@ Tensor count_nonzero_mps(const Tensor& self, IntArrayRef dims){
 
   reduction_out_mps(self, dims, false, self.scalar_type(), const_cast<Tensor&>(output_t), MPSReductionType::COUNT_NONZERO, "count_nonzero_mps");
 
-  free(raw_output_shape);
-
   return output_t;
 }
 
@@ -391,14 +417,17 @@ Tensor count_nonzero_mps(const Tensor& self, IntArrayRef dims){
 }
 
 TORCH_IMPL_FUNC(norm_out_mps)
-(const Tensor& input_t,
+(const Tensor& input_tensor,
  const OptionalScalarRef opt_p,
  IntArrayRef dim,
  bool keepdim,
  const Tensor& output_t)
 {
-  if (input_t.numel() == 0)
+  if (input_tensor.numel() == 0)
     return;
+
+  auto input_t = (input_tensor.sizes().size() == 0) ? input_tensor.view({1}) : input_tensor;
+
   IntArrayRef input_shape = input_t.sizes();
 
   for(int i = 0; i < dim.size(); i++) {
@@ -452,7 +481,7 @@ Tensor count_nonzero_mps(const Tensor& self, IntArrayRef dims){
     auto cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
     if(!cachedGraph) {
-      native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
         CachedGraph *newCachedGraph = nil;
 
@@ -522,7 +551,6 @@ Tensor count_nonzero_mps(const Tensor& self, IntArrayRef dims){
         }
         return newCachedGraph;
       });
-      cachedGraph = tmpCachedGraph->as<CachedGraph>();
     }
 
     auto inputPlaceholder = native_mps::Placeholder();
@@ -584,7 +612,7 @@ Tensor std_var_common_impl_mps(
   NSMutableArray<NSNumber *> *axes = nil;
   NSMutableArray<NSNumber*> *apparent_output_shape = nil;
   NSMutableArray<NSNumber*> *apparent_input_shape = nil;
-  int64_t* output_shape = nil;
+  std::vector<int64_t> output_shape;
 
   if ((!keepdim && !use_dim) || (!keepdim && use_dim && dim_value.size() <= 0))
   {
@@ -624,7 +652,6 @@ Tensor std_var_common_impl_mps(
                            axes);
 
       num_output_dims = (num_input_dims >= num_reduce_dims) ? (num_input_dims - num_reduce_dims) : 0; //num_input_dims;
-      output_shape = (int64_t *)malloc(num_output_dims  * sizeof(int64_t));
 
       unsigned int curr_i = 0;
       for (int i = 0; i < num_input_dims; i++)
@@ -639,13 +666,17 @@ Tensor std_var_common_impl_mps(
               }
           }
           if (found) continue;
-          output_shape[curr_i] = input_shape[i];
+          output_shape.push_back(input_shape[i]);
           curr_i += 1;
+          // End loop when output shape is filled
+          if (curr_i == num_output_dims)
+            break;
       }
 
       for(int i = 0; i < num_reduce_dims; i++)
       {
-          correction_n *= input_shape[dim_value[i]];
+          auto wrap_dim = maybe_wrap_dim(dim_value[i], input_shape.size());
+          correction_n *= input_shape[wrap_dim];
       }
       // (3, 4, 5) --> (3, 5)
   }
@@ -662,10 +693,9 @@ Tensor std_var_common_impl_mps(
                            input_shape,
                            axes);
       num_output_dims = num_input_dims;
-      output_shape = (int64_t *)malloc(num_output_dims  * sizeof(int64_t));
       for (int i = 0; i < num_input_dims; i++)
       {
-          output_shape[i] = (int64_t) 1;
+          output_shape.push_back((int64_t) 1);
           correction_n *= input_shape[i];
       }
       // scalar --> vector case [[1.0034567]]
@@ -685,21 +715,22 @@ Tensor std_var_common_impl_mps(
                            axes);
 
       num_output_dims = num_input_dims;//(num_input_dims >= num_reduce_dims) ? (num_input_dims - num_reduce_dims) : 0;
-      output_shape = (int64_t *)malloc(num_output_dims  * sizeof(int64_t));
 
       for(int i = 0; i < num_reduce_dims; i++)
       {
-          correction_n *= input_shape[dim_value[i]];
+          auto wrap_dim = maybe_wrap_dim(dim_value[i], input_shape.size());
+          correction_n *= input_shape[wrap_dim];
       }
 
       for (int i = 0; i < num_input_dims; i++)
       {
-          output_shape[i] = [apparent_output_shape[i] longValue];
+          output_shape.push_back([apparent_output_shape[i] longValue]);
       }
   }
 
+
   Tensor output_t = at::native::empty_mps(
-                      IntArrayRef(output_shape, num_output_dims),
+                      IntArrayRef(output_shape.data(), num_output_dims),
                       input_t.scalar_type(),
                       c10::nullopt,
                       kMPS,
@@ -726,7 +757,7 @@ Tensor std_var_common_impl_mps(
     auto cachedGraph = cache_->LookUpAs<CachedGraph>(key);
     // Initialize once if configuration not found in cache
   if(!cachedGraph) {
-      native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
       CachedGraph *newCachedGraph = nil;
 
@@ -761,7 +792,6 @@ Tensor std_var_common_impl_mps(
       }
       return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
   }
   auto inputPlaceholder = native_mps::Placeholder();
 
@@ -784,7 +814,7 @@ Tensor std_var_common_impl_mps(
   };
   native_mps::runMPSGraph(stream, cachedGraph->graph(), feeds, results);
   }
-  free(output_shape);
+
   return output_t;
 }
 
@@ -844,7 +874,7 @@ Tensor std_mps(
         CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
         if(!cachedGraph) {
-          native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+          cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
             CachedGraph *newCachedGraph = nil;
             @autoreleasepool {
@@ -884,7 +914,6 @@ Tensor std_mps(
             }
             return newCachedGraph;
           });
-          cachedGraph = tmpCachedGraph->as<CachedGraph>();
         }
 
         auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
@@ -919,7 +948,7 @@ Tensor std_mps(
         CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
         if(!cachedGraph) {
-          native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+          cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
             CachedGraph *newCachedGraph = nil;
 
@@ -960,7 +989,6 @@ Tensor std_mps(
             }
             return newCachedGraph;
           });
-          cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
         }
 
         auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
@@ -1015,7 +1043,7 @@ Tensor std_mps(
         CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
         if(!cachedGraph) {
-          native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+          cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
             CachedGraph *newCachedGraph = nil;
             @autoreleasepool {
@@ -1055,7 +1083,6 @@ Tensor std_mps(
             }
             return newCachedGraph;
           });
-          cachedGraph = tmpCachedGraph->as<CachedGraph>();
         }
 
         auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
@@ -1090,7 +1117,7 @@ Tensor std_mps(
         CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
         if(!cachedGraph) {
-          native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+          cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
             CachedGraph *newCachedGraph = nil;
 
@@ -1131,7 +1158,6 @@ Tensor std_mps(
             }
             return newCachedGraph;
           });
-          cachedGraph = tmpCachedGraph->as<CachedGraph>();
         }
 
         auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
@@ -1183,7 +1209,7 @@ Tensor std_mps(
     CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
     // Initialize once if configuration not found in cache
     if(!cachedGraph) {
-      native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+      cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
         CachedGraph *newCachedGraph = nil;
 
@@ -1210,7 +1236,6 @@ Tensor std_mps(
         }
         return newCachedGraph;
       });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
     }
 
     auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t, apparent_input_shape);
@@ -1294,10 +1319,10 @@ Tensor min_mps(const Tensor& input_t) {
 
     @autoreleasepool {
         string key = func_name + ":" + to_string(dim_) + ":" + native_mps::getMPSTypeString(input_t.scalar_type());
-        CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+        CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
         if(!cachedGraph) {
-          native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+          cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
             CachedGraph *newCachedGraph = nil;
 
@@ -1347,7 +1372,6 @@ Tensor min_mps(const Tensor& input_t) {
             }
             return newCachedGraph;
           });
-          cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
         }
 
         auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
@@ -1461,7 +1485,7 @@ Tensor min_mps(const Tensor& input_t) {
         CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
 
         if(!cachedGraph) {
-          native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+          cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^ native_mps::MPSCachedGraph * () {
 
             CachedGraph *newCachedGraph = nil;
 
@@ -1502,7 +1526,6 @@ Tensor min_mps(const Tensor& input_t) {
             }
             return newCachedGraph;
           });
-          cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
         }
 
         native_mps::Placeholder inputPlaceholder = native_mps::Placeholder();
@@ -1565,8 +1588,8 @@ Tensor min_mps(const Tensor& input_t) {
     // Use this if keepdim is false
     int64_t num_output_dims = num_input_dims - 1;
 
-    int64_t* malloc_apparent_out_shape = (int64_t *)malloc(num_input_dims * sizeof(int64_t));
-    int64_t* malloc_out_shape = (int64_t *)malloc(num_output_dims * sizeof(int64_t));
+    std::vector<int64_t> vec_apparent_out_shape(num_input_dims);
+    std::vector<int64_t> vec_out_shape(num_output_dims);
 
     apparent_out_shape = [NSMutableArray<NSNumber*> arrayWithCapacity:num_input_dims];
     // Counter for shape when keepdim is false
@@ -1574,12 +1597,12 @@ Tensor min_mps(const Tensor& input_t) {
     for(int i = 0; i < num_input_dims; i++) {
         if(dim_ == i) {
             apparent_out_shape[i] = @1;
-            malloc_apparent_out_shape[i] = 1;
+            vec_apparent_out_shape[i] = 1;
         }
         else {
             apparent_out_shape[i] = [NSNumber numberWithInt:input_shape[i]];
-            malloc_apparent_out_shape[i] = input_shape[i];
-            malloc_out_shape[out_i] = input_shape[i];
+            vec_apparent_out_shape[i] = input_shape[i];
+            vec_out_shape[out_i] = input_shape[i];
             out_i++;
         }
     }
@@ -1588,30 +1611,29 @@ Tensor min_mps(const Tensor& input_t) {
     Tensor indices_t;
     if(!keepdim) {
      output_t = at::native::empty_mps(
-                      IntArrayRef(malloc_out_shape, num_output_dims),
+                      IntArrayRef(vec_out_shape),
                       input_t.scalar_type(),
                       c10::nullopt,
                       kMPS,
                       c10::nullopt,
                       c10::nullopt);
      indices_t = at::native::empty_mps(
-                      IntArrayRef(malloc_out_shape, num_output_dims),
+                      IntArrayRef(vec_out_shape),
                       ScalarType::Long,
                       c10::nullopt,
                       kMPS,
                       c10::nullopt,
                       c10::nullopt);
-    }
-    else {
+    } else {
       output_t = at::native::empty_mps(
-                      IntArrayRef(malloc_apparent_out_shape, num_input_dims),
+                      IntArrayRef(vec_apparent_out_shape),
                       input_t.scalar_type(),
                       c10::nullopt,
                       kMPS,
                       c10::nullopt,
                       c10::nullopt);
      indices_t = at::native::empty_mps(
-                      IntArrayRef(malloc_apparent_out_shape, num_input_dims),
+                      IntArrayRef(vec_apparent_out_shape),
                       ScalarType::Long,
                       c10::nullopt,
                       kMPS,
@@ -1620,15 +1642,11 @@ Tensor min_mps(const Tensor& input_t) {
     }
 
     if (output_t.numel() == 0 || input_t.numel() == 0) {
-        free(malloc_out_shape);
-        free(malloc_apparent_out_shape);
         return std::tuple<Tensor, Tensor>{output_t, indices_t};
     }
 
     min_max_out_mps(input_t, dim, keepdim, output_t, indices_t, reduction_type, func_name);
 
-    free(malloc_out_shape);
-    free(malloc_apparent_out_shape);
     return std::tuple<Tensor, Tensor>{output_t, indices_t};
 }
 
@@ -1650,5 +1668,319 @@ Tensor min_mps(const Tensor& input_t) {
     return min_max_mps(input_t, dim, keepdim, MPSReductionType::MIN, "min_mps");
 }
 
+// Median of entire tensor into scalar result
+Tensor median_mps(const Tensor& input_t) {
+
+  if(!is_macos_13_or_newer()){
+        TORCH_WARN_ONCE("MPS: median op is supported natively starting from macOS 13.0. ",
+                    "Falling back on CPU. This may have performace implications.");
+        return at::median(input_t.to("cpu"));
+  }
+
+    TORCH_INTERNAL_ASSERT(input_t.scalar_type() != ScalarType::Long, "median not supported for Long dtype on MPS");
+
+    namespace native_mps = at::native::mps;
+    using CachedGraph = native_mps::MPSUnaryCachedGraph;
+
+    native_mps::MPSGraphCache* cache_ = native_mps::MPSGraphCache::getInstance();
+
+    IntArrayRef input_shape = input_t.sizes();
+    int64_t num_input_dims = input_shape.size();
+
+    // calculate total no. of elements in the input tensor to reduce it to one dimension
+    NSMutableArray<NSNumber*> *apparent_input_shape = [NSMutableArray<NSNumber*> arrayWithCapacity:1];
+    int64_t num_in_elements = 1;
+    for(int i = 0; i < num_input_dims; i++) {
+        num_in_elements *= input_shape[i];
+    }
+
+    apparent_input_shape[0] = [NSNumber numberWithInt:num_in_elements];
+
+    Tensor output_t = at::native::empty_mps({}, input_t.scalar_type(), c10::nullopt, kMPS, c10::nullopt, c10::nullopt);
+
+    if (output_t.numel() == 0 || num_in_elements == 0) {
+        return output_t;
+    }
+
+  @autoreleasepool {
+    string key = "median_mps:"+ mps::getMPSTypeString(input_t.scalar_type())  + mps::getTensorsStringKey(input_t);
+    CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);
+    // Initialize once if configuration not found in cache
+    if(!cachedGraph) {
+      native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+
+        CachedGraph *newCachedGraph = nil;
+
+        @autoreleasepool {
+            MPSGraph* mpsGraph = native_mps::make_mps_graph();
+            newCachedGraph = new CachedGraph(mpsGraph);
+
+            MPSGraphTensor* inputTensor = native_mps::mpsGraphRankedPlaceHolder(mpsGraph, input_t);
+
+            MPSGraphTensor* outputTensor = nil;
+
+            MPSGraphTensor * reshapedTensor = [mpsGraph reshapeTensor:inputTensor
+                                                            withShape:@[@-1]
+                                                                  name:nil];
+            MPSGraphTensor * sortedTensor = [mpsGraph
+                                                  sortWithTensor:reshapedTensor
+                                                  axis:((NSUInteger) (int)0)
+                                                  name:nil];
+
+            outputTensor = [mpsGraph sliceTensor:sortedTensor
+                                                        dimension:0
+                                                        start:((NSUInteger) (int)((num_in_elements+1)/2 ) - 1)
+                                                        length:1
+                                                        name:nil];
+
+            newCachedGraph->inputTensor_ = inputTensor;
+            newCachedGraph->outputTensor_ = outputTensor;
+        }
+        return newCachedGraph;
+      });
+      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
+    }
+
+    auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
+    auto outputPlaceholder = native_mps::Placeholder(cachedGraph->outputTensor_, output_t, @[@1]);
+
+    NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = @{
+      inputPlaceholder.getMPSGraphTensor() : inputPlaceholder.getMPSGraphTensorData(),
+    };
+
+    NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *results = @{
+      outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
+    };
+
+    native_mps::runMPSGraph(getCurrentMPSStream(), cachedGraph->graph(), feeds, results);
+  }
+
+  return output_t;
+}
+
+
+void median_out_mps
+  (const Tensor& input_t,
+  int64_t dim,
+  bool keepdim,
+  const Tensor& output_t,
+  const Tensor& indices_t,
+  const std::string& func_name) {
+
+    namespace native_mps = at::native::mps;
+
+    if (output_t.numel() == 0) {
+      return;
+    }
+    if (input_t.numel() == 1 && input_t.dim() == 0) {
+      output_t.fill_(input_t);
+      indices_t.fill_(0);
+      return;
+    }
+
+    // Derive from MPSCachedGraph
+    struct CachedGraph : public native_mps::MPSCachedGraph
+    {
+      CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
+      MPSGraphTensor *inputTensor_ = nil;
+      MPSGraphTensor *outputTensor_ = nil;
+      MPSGraphTensor *indicesTensor_ = nil;
+    };
+
+    native_mps::MPSGraphCache* cache_ = native_mps::MPSGraphCache::getInstance();
+
+    int64_t dim_ = maybe_wrap_dim(dim, input_t.dim());
+
+    // Calculate the output shape according to keepdim=True
+    // If there is no dim argument, the input shape is flattened
+    IntArrayRef input_shape = input_t.sizes();
+    int64_t num_input_dims = input_shape.size();
+    NSMutableArray<NSNumber*> *apparent_out_shape = nil;
+
+    apparent_out_shape = [NSMutableArray<NSNumber*> arrayWithCapacity:num_input_dims];
+    for(int i = 0; i < num_input_dims; i++) {
+        if(dim_ == i)
+            apparent_out_shape[i] = @1;
+        else
+            apparent_out_shape[i] = [NSNumber numberWithInt:input_shape[i]];
+    }
+    int dim_total_elements = input_shape[dim_];
+
+    auto stream = at::mps::getCurrentMPSStream();
+
+    @autoreleasepool {
+        string key = func_name + ":" + to_string(dim_) + ":" + native_mps::getMPSTypeString(input_t.scalar_type());
+        CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
+
+        if(!cachedGraph) {
+          native_mps::MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ native_mps::MPSCachedGraph * () {
+
+            CachedGraph *newCachedGraph = nil;
+
+            @autoreleasepool {
+              MPSGraph* mpsGraph = native_mps::make_mps_graph();
+              newCachedGraph = new CachedGraph(mpsGraph);
+
+              MPSGraphTensor* inputTensor = native_mps::mpsGraphUnrankedPlaceHolder(mpsGraph, native_mps::getMPSDataType(input_t.scalar_type()));
+              MPSGraphTensor* outputTensor = nil;
+              MPSGraphTensor * sortedTensor = [mpsGraph
+                                                  sortWithTensor:inputTensor
+                                                  axis:((NSUInteger) (int)dim_)
+                                                  name:nil];
+
+              outputTensor = [mpsGraph sliceTensor:sortedTensor
+                                                        dimension:dim_
+                                                        start:((NSUInteger) (int)((dim_total_elements+1)/2 ) - 1)
+                                                        length:1
+                                                        name:nil];
+              MPSGraphTensor* argreduceOutTensor = nil;
+                argreduceOutTensor = [mpsGraph argSortWithTensor:inputTensor
+                                                                        axis:(NSInteger)dim_
+                                                                        name:@"argmax_out"];
+              MPSGraphTensor* argOutputTensor = [mpsGraph sliceTensor:argreduceOutTensor
+                                                        dimension:dim_
+                                                        start:((NSUInteger) (int)((dim_total_elements+1)/2 ) - 1)
+                                                        length:1
+                                                        name:nil];
+
+              newCachedGraph->inputTensor_ = inputTensor;
+              newCachedGraph->outputTensor_ = outputTensor;
+              newCachedGraph->indicesTensor_ = argOutputTensor;
+            }
+            return newCachedGraph;
+          });
+          cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
+        }
+
+        auto inputPlaceholder = native_mps::Placeholder(cachedGraph->inputTensor_, input_t);
+        auto outputPlaceholder = native_mps::Placeholder(cachedGraph->outputTensor_, output_t, apparent_out_shape);
+        auto indicesPlaceholder = native_mps::Placeholder(cachedGraph->indicesTensor_, indices_t, apparent_out_shape);
+
+        NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *feeds = @{
+          inputPlaceholder.getMPSGraphTensor() : inputPlaceholder.getMPSGraphTensorData(),
+        };
+
+        NSDictionary<MPSGraphTensor *, MPSGraphTensorData *> *results = @{
+          outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData(),
+          indicesPlaceholder.getMPSGraphTensor() : indicesPlaceholder.getMPSGraphTensorData()
+        };
+
+        native_mps::runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+
+    }
+
+}
+
+// in case mps sortWithTensor do not supported on macOS
+std::tuple<Tensor&, Tensor&> median_from_cpu(
+    const Tensor& self,
+    int64_t dim,
+    bool keepdim, Tensor & valuesI, Tensor & indicesI, IntArrayRef vec_out_shape, IntArrayRef vec_apparent_out_shape) {
+      // Tensor a = at::median(self.to("cpu"));
+      Tensor values;
+      Tensor indices;
+    if (!keepdim){
+        values = at::empty({vec_out_shape}, self.options());
+        indices = at::empty({vec_out_shape}, self.options().dtype(kLong));
+
+      }
+      else{
+          values = at::empty({vec_apparent_out_shape}, self.options());
+          indices = at::empty({vec_apparent_out_shape}, self.options().dtype(kLong));
+      }
+      at::median_out(values, indices, self, dim, keepdim);
+
+  valuesI.copy_(values);
+  indicesI.copy_(indices);
+  return std::forward_as_tuple(valuesI, indicesI);
+}
+
+TORCH_API ::std::tuple<at::Tensor &,at::Tensor &> median_out_mps
+    (const at::Tensor & input_t,
+    int64_t dim,
+    bool keepdim,
+    at::Tensor & values,
+    at::Tensor & indices){
+
+  TORCH_INTERNAL_ASSERT(input_t.scalar_type() != ScalarType::Long, "median not supported for Long dtype on MPS");
+
+  namespace native_mps = at::native::mps;
+    int64_t dim_ = maybe_wrap_dim(dim, input_t.dim());
+    native::zero_numel_check_dims(input_t, dim_, "max()");
+
+    // Calculate the output shape according to keepdim=True
+    // If there is no dim argument, the input shape is flattened
+    IntArrayRef input_shape = input_t.sizes();
+    int64_t num_input_dims = input_shape.size();
+    NSMutableArray<NSNumber*> *apparent_out_shape = nil;
+    // Use this if keepdim is false
+    int64_t num_output_dims = num_input_dims - 1;
+
+    std::vector<int64_t> vec_apparent_out_shape(num_input_dims);
+    std::vector<int64_t> vec_out_shape(num_output_dims);
+
+    apparent_out_shape = [NSMutableArray<NSNumber*> arrayWithCapacity:num_input_dims];
+    // Counter for shape when keepdim is false
+    int out_i = 0;
+    for(int i = 0; i < num_input_dims; i++) {
+        if(dim_ == i) {
+            apparent_out_shape[i] = @1;
+            vec_apparent_out_shape[i] = 1;
+        }
+        else {
+            apparent_out_shape[i] = [NSNumber numberWithInt:input_shape[i]];
+            vec_apparent_out_shape[i] = input_shape[i];
+            vec_out_shape[out_i] = input_shape[i];
+            out_i++;
+        }
+    }
+
+    if(!keepdim) {
+     values = at::native::empty_mps(
+                      IntArrayRef(vec_out_shape),
+                      input_t.scalar_type(),
+                      c10::nullopt,
+                      kMPS,
+                      c10::nullopt,
+                      c10::nullopt);
+     indices = at::native::empty_mps(
+                      IntArrayRef(vec_out_shape),
+                      ScalarType::Long,
+                      c10::nullopt,
+                      kMPS,
+                      c10::nullopt,
+                      c10::nullopt);
+    } else {
+      values = at::native::empty_mps(
+                      IntArrayRef(vec_apparent_out_shape),
+                      input_t.scalar_type(),
+                      c10::nullopt,
+                      kMPS,
+                      c10::nullopt,
+                      c10::nullopt);
+     indices = at::native::empty_mps(
+                      IntArrayRef(vec_apparent_out_shape),
+                      ScalarType::Long,
+                      c10::nullopt,
+                      kMPS,
+                      c10::nullopt,
+                      c10::nullopt);
+    }
+
+    if (values.numel() == 0 || input_t.numel() == 0) {
+        return std::tuple<Tensor&, Tensor&>{values, indices};
+    }
+
+    if(!is_macos_13_or_newer()){
+      TORCH_WARN_ONCE("MPS: median op is supported natively starting from macOS 13.0.",
+                    "Falling back on CPU. This may have performace implications.");
+    return median_from_cpu(input_t.to("cpu"), dim, keepdim, values, indices, IntArrayRef(vec_out_shape),IntArrayRef(vec_apparent_out_shape) );
+  }
+
+    median_out_mps(input_t, dim, keepdim, values, indices, "median_out_mps");
+
+    return std::tuple<Tensor&, Tensor&>{values, indices};
+}
+
 } // native
 } // at
diff --git a/aten/src/ATen/native/mps/operations/Repeat.mm b/aten/src/ATen/native/mps/operations/Repeat.mm
index 53bcddf405cc..8b6b709da642 100644
--- a/aten/src/ATen/native/mps/operations/Repeat.mm
+++ b/aten/src/ATen/native/mps/operations/Repeat.mm
@@ -108,16 +108,17 @@ Tensor repeat_mps(const Tensor& self, IntArrayRef repeats) {
                       num_repeat_dims);
 
   // Set output shape
-  int64_t output_shape[num_repeat_dims];
+  std::vector<int64_t> output_shape(num_repeat_dims);
   bool zero_tensor = false;
-  for(int i = 0; i < num_repeat_dims; i++) {
+  for(auto i : c10::irange(num_repeat_dims)) {
     output_shape[i] = repeats[i] * [apparent_input_shape[i] intValue];
-    if(output_shape[i] == 0)
+    if(output_shape[i] == 0) {
       zero_tensor = true;
+    }
   }
 
   Tensor output = at::native::empty_mps(
-                      IntArrayRef(output_shape, num_repeat_dims),
+                      IntArrayRef(output_shape),
                       self.scalar_type(),
                       c10::nullopt,
                       kMPS,
diff --git a/aten/src/ATen/native/mps/operations/RnnOps.mm b/aten/src/ATen/native/mps/operations/RnnOps.mm
index f15e842b54b2..23a59a19fdd2 100644
--- a/aten/src/ATen/native/mps/operations/RnnOps.mm
+++ b/aten/src/ATen/native/mps/operations/RnnOps.mm
@@ -193,7 +193,7 @@
       Placeholder recurrentKernelWeight;
       Placeholder bias;
       Placeholder recurrentBias;
-      NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*> *feeds = [[NSMutableDictionary alloc] init];
+      NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*> *feeds = [[[NSMutableDictionary alloc] init] autorelease];
       for (size_t i = 0; i < num_layers; i+=1) {
           kernelWeight = Placeholder([kernelWeightsList objectAtIndex:i], kernel_weights[i]);
           recurrentKernelWeight = Placeholder([recurrentKernelWeightsList objectAtIndex:i], recurrent_kernel_weights[i]);
@@ -425,7 +425,7 @@
         Placeholder gradientHyPlaceholder   = Placeholder(cachedGraph->inputTensors_[6], grad_hy);
         Placeholder gradientCyPlaceholder   = Placeholder(cachedGraph->inputTensors_[7], grad_cy);
 
-        NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*> *feeds = [[NSMutableDictionary alloc] init];
+        NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*> *feeds = [[[NSMutableDictionary alloc] init] autorelease];
         [feeds setObject:gradientPlaceholder.getMPSGraphTensorData() forKey:gradientPlaceholder.getMPSGraphTensor()];
         [feeds setObject:gradientHyPlaceholder.getMPSGraphTensorData() forKey:gradientHyPlaceholder.getMPSGraphTensor()];
         [feeds setObject:gradientCyPlaceholder.getMPSGraphTensorData() forKey:gradientCyPlaceholder.getMPSGraphTensor()];
@@ -469,7 +469,7 @@
 
         std::vector<Tensor> grad_hx = {grad_state, grad_cell_state};
 
-        NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*> *results = [[NSMutableDictionary alloc] init];
+        NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*> *results = [[[NSMutableDictionary alloc] init] autorelease];
         NSMutableArray<MPSGraphTensor*> *gradOutputArray = cachedGraph->gradOutput_;
         NSMutableArray<MPSGraphTensor*> *gradRecWeightsArray = cachedGraph->gradRecWeights_;
         NSMutableArray<MPSGraphTensor*> *gradWeightsArray = cachedGraph->gradWeights_;
diff --git a/aten/src/ATen/native/mps/operations/ScatterGather.mm b/aten/src/ATen/native/mps/operations/ScatterGather.mm
index c4943d1242d9..cf8d8a1fef7e 100644
--- a/aten/src/ATen/native/mps/operations/ScatterGather.mm
+++ b/aten/src/ATen/native/mps/operations/ScatterGather.mm
@@ -15,7 +15,7 @@
 namespace native {
 
 TORCH_IMPL_FUNC(gather_out_mps)
-(const Tensor & self,
+(const Tensor & self_arg,
  int64_t dim,
  const Tensor & index,
  bool sparse_grad,
@@ -24,6 +24,8 @@
   using namespace mps;
   MPSStream* stream = getCurrentMPSStream();
 
+  auto self = self_arg.dim() == 0 ? self_arg.view({1}) : self_arg;
+
   dim = at::maybe_wrap_dim(dim, self.dim());
 
   TORCH_CHECK(!sparse_grad, "sparse_grad not supported in MPS yet")
@@ -150,7 +152,7 @@
 }
 
 void scatter_mps_general
-(const Tensor& self,
+(const Tensor& self_arg,
  int64_t dim,
  const Tensor& index,
  const Tensor& src,
@@ -161,6 +163,8 @@
   using namespace mps;
   MPSStream* stream = getCurrentMPSStream();
 
+  auto self = self_arg.dim() == 0 ? self_arg.view({1}) : self_arg;
+
   dim = at::maybe_wrap_dim(dim, self.dim());
 
   TORCH_CHECK(index.scalar_type() == ScalarType::Long || index.scalar_type() == ScalarType::Int, "index_select(): Expected dtype int32 or int64 for index");
@@ -358,13 +362,13 @@
             // 2. Flatten the values
             // 3. Scatter into input with add mode
 
-            int shape_data[num_input_dims];
+            std::vector<int> shape_data(num_input_dims);
 
             for(int i = 0; i < num_input_dims; i++) {
               shape_data[i] = {[scatterInputShape[i] intValue]};
             }
 
-            MPSGraphTensor* scatterInputShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:shape_data length:num_input_dims * sizeof(int)]
+            MPSGraphTensor* scatterInputShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:shape_data.data() length:num_input_dims * sizeof(int)]
                                                                            shape:@[[NSNumber numberWithInt:num_input_dims]]
                                                                         dataType:MPSDataTypeInt32];
 
diff --git a/aten/src/ATen/native/mps/operations/Shape.mm b/aten/src/ATen/native/mps/operations/Shape.mm
index 6bb918061c89..f491f2ff823a 100644
--- a/aten/src/ATen/native/mps/operations/Shape.mm
+++ b/aten/src/ATen/native/mps/operations/Shape.mm
@@ -16,288 +16,6 @@
 
 namespace at {
 namespace native {
-namespace mps {
-
-// Pad operations (1D/2D/3D forward and backward)
-Tensor& pad_out_template(Tensor &output, const Tensor &input_, IntArrayRef padding,
-                         const c10::optional<Tensor>& grad_output_opt,
-                         MPSGraphPaddingMode mode, double constantValue, const string op_name)
-{
-  const int padding_size = (int) padding.size();
-  const int padding_dim = padding_size / 2; // either 1D, 2D, or 3D
-
-  TORCH_CHECK(padding_size == 2 || padding_size == 4 || padding_size == 6,
-              "invalid padding argument of size ", padding_size);
-
-  const Tensor& grad_output_ = *(at::borrow_from_optional_tensor(grad_output_opt));
-  const bool is_backward_pass = grad_output_.defined();
-
-  int dim_w = padding_dim, dim_h = padding_dim - 1, dim_d = padding_dim - 2, dim_slices = 0;
-  int64_t nbatch = 1, ndims = input_.ndimension();
-
-  if (!is_backward_pass) {
-    bool valid_dims = input_.size(1) != 0 && input_.size(padding_dim) != 0;
-    TORCH_CHECK((ndims == 1 + padding_dim && valid_dims) ||
-                (ndims == 2 + padding_dim && valid_dims && input_.size(1 + padding_dim) != 0),
-                "3D or 4D (batch mode) tensor expected for input, but got: ", input_);
-  }
-
-  if (ndims == 2 + padding_dim) {
-    nbatch = input_.size(0);
-    dim_w++;
-    dim_h++;
-    dim_d++;
-    dim_slices++;
-  }
-
-  int64_t pad_l = padding[0];
-  int64_t pad_r = padding[1];
-  int64_t pad_t = padding_dim > 1 ? padding[2] : 0;
-  int64_t pad_b = padding_dim > 1 ? padding[3] : 0;
-  int64_t pad_front = padding_dim > 2 ? padding[4] : 0;
-  int64_t pad_back  = padding_dim > 2 ? padding[5] : 0;
-
-  int64_t nplane = input_.size(dim_slices);
-  int64_t input_w = input_.size(dim_w);
-  int64_t output_w  = input_w + pad_l + pad_r;
-  int64_t input_h = padding_dim > 1 ? input_.size(dim_h) : 0;
-  int64_t output_h = padding_dim > 1 ? input_h + pad_t + pad_b : 0;
-  int64_t input_d = padding_dim > 2 ? input_.size(dim_d) : 0;
-  int64_t output_d = padding_dim > 2 ? input_d + pad_front + pad_back : 0;
-
-  Tensor grad_output, input = input_;
-
-  if (!is_backward_pass) {
-    TORCH_CHECK(pad_l < input_w && pad_r < input_w,
-      "Argument #4: Padding size should be less than the corresponding "
-      "input dimension, but got: padding (", pad_l, ", ", pad_r,
-      ") at dimension ", dim_w, " of input ", ndims);
-
-    if (padding_dim > 1) {
-      TORCH_CHECK(pad_t < input_h && pad_b < input_h,
-        "Argument #6: Padding size should be less than the corresponding "
-        "input dimension, but got: padding (", pad_t, ", ", pad_b,
-        ") at dimension ", dim_h, " of input ", ndims);
-    }
-    TORCH_CHECK(output_w >= 1 || output_h >= padding_dim - 1,
-      "input (H: ", input_h, ", W: ", input_w, ") is too small. Calculated "
-      "output H: ", output_h, " W: ", output_w);
-
-    if (ndims == 1 + padding_dim) {
-      if (padding_dim == 3)
-        output.resize_({nplane, output_d, output_h, output_w});
-      else if (padding_dim == 2)
-        output.resize_({nplane, output_h, output_w});
-      else
-        output.resize_({nplane, output_w});
-    } else {
-      if (padding_dim == 3)
-        output.resize_({nbatch, nplane, output_d, output_h, output_w});
-      else if (padding_dim == 2)
-        output.resize_({nbatch, nplane, output_h, output_w});
-      else
-        output.resize_({nbatch, nplane, output_w});
-    }
-    if (output.numel() == 0 || input_.numel() == 0)
-      return output;
-    input = input_.contiguous();
-  } else {
-    TORCH_CHECK(output_w == grad_output_.size(dim_w),
-      "gradOutput width unexpected. Expected: ", output_w, ", Got: ", grad_output_.size(dim_w));
-    if (padding_dim > 1) {
-      TORCH_CHECK(output_h == grad_output_.size(dim_h),
-        "gradOutput height unexpected. Expected: ", output_h, ", Got: ", grad_output_.size(dim_h));
-    }
-    grad_output = grad_output_.contiguous();
-  }
-
-  const int64_t input_dim = input.dim();
-  MPSShape *leftPadding = nullptr, *rightPadding = nullptr;
-  if (padding_dim == 3) {
-    leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_front), @(pad_t), @(pad_l) } count:input_dim];
-    rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_back), @(pad_b), @(pad_r) } count:input_dim];
-  } else if (padding_dim == 2) {
-    leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_t), @(pad_l) } count:input_dim];
-    rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_b), @(pad_r) } count:input_dim];
-  } else if (padding_dim == 1) {
-    leftPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_l) } count:input_dim];
-    rightPadding = [NSArray arrayWithObjects:(const NSNumber*[]){ @(0), @(0), @(pad_r) } count:input_dim];
-  }
-
-  struct CachedGraph : public MPSCachedGraph {
-    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) { }
-    MPSGraphTensor *inputTensor = nil, *outputTensor = nil;
-    MPSGraphTensor *gradOutputTensor = nil;
-  };
-  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
-
-  @autoreleasepool {
-    string key = op_name + getTensorsStringKey({input, grad_output}) +
-                           ":L" + to_string(pad_l)     + ":R" + to_string(pad_r) +
-                           ":T" + to_string(pad_t)     + ":B" + to_string(pad_b) +
-                           ":F" + to_string(pad_front) + ":K" + to_string(pad_back);
-
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
-    if(!cachedGraph) {
-      cachedGraph = static_cast<CachedGraph*>(cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
-        CachedGraph *newCachedGraph = nil;
-        @autoreleasepool {
-            MPSGraph* mpsGraph = make_mps_graph();
-            newCachedGraph = new CachedGraph(mpsGraph);
-            newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);
-            if (!is_backward_pass) {
-              newCachedGraph->outputTensor = [mpsGraph padTensor:newCachedGraph->inputTensor
-                                                 withPaddingMode:mode
-                                                     leftPadding:leftPadding
-                                                    rightPadding:rightPadding
-                                                   constantValue:constantValue
-                                                            name:nil];
-            } else {
-              newCachedGraph->gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
-              newCachedGraph->outputTensor = [mpsGraph padGradientWithIncomingGradientTensor:newCachedGraph->gradOutputTensor
-                                                                                sourceTensor:newCachedGraph->inputTensor
-                                                                                 paddingMode:mode
-                                                                                 leftPadding:leftPadding
-                                                                                rightPadding:rightPadding
-                                                                                        name:nil];
-            }
-        }
-        return newCachedGraph;
-      }));
-    }
-    Placeholder inputPlaceholder  = Placeholder(cachedGraph->inputTensor, input);
-    Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor, output);
-
-    NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease];
-    feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData();
-    if (is_backward_pass) {
-        Placeholder gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor, grad_output);
-        feeds[gradOutputPlaceholder.getMPSGraphTensor()] = gradOutputPlaceholder.getMPSGraphTensorData();
-    }
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
-        outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
-    };
-    runMPSGraph(getCurrentMPSStream(), cachedGraph->graph(), feeds, results);
-  }
-  return output;
-}
-} // namespace mps
-
-// 1D Reflection and Replication Padding
-TORCH_IMPL_FUNC(reflection_pad1d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_out_mps");
-}
-
-TORCH_IMPL_FUNC(reflection_pad1d_backward_out_mps)
-(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
-                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad1d_backward_out_mps");
-}
-
-TORCH_IMPL_FUNC(replication_pad1d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_out_mps");
-}
-
-TORCH_IMPL_FUNC(replication_pad1d_backward_out_mps)
-(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
-                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad1d_backward_out_mps");
-}
-
-// 2D Reflection and Replication Padding
-Tensor& reflection_pad2d_out_mps(const Tensor& input, IntArrayRef padding, Tensor& output)
-{
-  return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__);
-}
-
-Tensor reflection_pad2d_mps(const Tensor& input, IntArrayRef padding)
-{
-  Tensor output = at::empty({0}, input.options());
-  return mps::pad_out_template(output, input, padding, c10::nullopt, MPSGraphPaddingModeReflect, 0.0, __func__);
-}
-
-Tensor& reflection_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__);
-}
-
-Tensor reflection_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
-{
-  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeReflect, 0.0, __func__);
-}
-
-TORCH_IMPL_FUNC(replication_pad2d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad2d_out_mps");
-}
-
-Tensor& replication_pad2d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
-}
-
-Tensor replication_pad2d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
-{
-  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
-}
-
-// 3D Reflection and Replication Padding
-TORCH_IMPL_FUNC(reflection_pad3d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_out_mps");
-}
-
-TORCH_IMPL_FUNC(reflection_pad3d_backward_out_mps)
-(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, const Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  mps::pad_out_template(const_cast<Tensor&>(grad_input), input, padding, grad_output,
-                        MPSGraphPaddingModeReflect, 0.0, "reflection_pad3d_backward_out_mps");
-}
-
-TORCH_IMPL_FUNC(replication_pad3d_out_mps)
-(const Tensor& input, IntArrayRef padding, const Tensor& output)
-{
-  mps::pad_out_template(const_cast<Tensor&>(output), input, padding, c10::nullopt,
-                        MPSGraphPaddingModeClampToEdge, 0.0, "replication_pad3d_out_mps");
-}
-
-Tensor& replication_pad3d_backward_out_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding, Tensor& grad_input)
-{
-  grad_input.resize_as_(input).zero_();
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
-}
-
-Tensor replication_pad3d_backward_mps(const Tensor& grad_output, const Tensor& input, IntArrayRef padding)
-{
-  auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-  return mps::pad_out_template(grad_input, input, padding, grad_output, MPSGraphPaddingModeClampToEdge, 0.0, __func__);
-}
-
-// backward pass is exlicitly handled in autograd by negating the "pad" argument
-Tensor constant_pad_nd_mps(const Tensor& self, IntArrayRef pad, const Scalar& value)
-{
-  Tensor output = at::empty({0}, self.options());
-  return mps::pad_out_template(output, self, pad, c10::nullopt, MPSGraphPaddingModeConstant, value.toDouble(), __func__);
-}
 
 // topk
 TORCH_IMPL_FUNC(topk_out_mps)
@@ -499,7 +217,7 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second,
 //}
 
 TORCH_IMPL_FUNC(cat_out_mps)
-      (ITensorListRef inputs,
+      (const ITensorListRef& inputs,
        int64_t dimension,
        int64_t valid,
        bool all_contiguous,
@@ -521,7 +239,7 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second,
     idx++;
   }
 
-  dimension = legacy_cat_wrap_dim(dimension, inputs);
+  dimension = legacy_cat_wrap_dim(dimension, materialized_inputs);
 
   // previously, size [0] tensors were the only possible empty tensors; thus, it
   // wasn't possible to cat empty tensors unless all the other tensors were
@@ -671,8 +389,8 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second,
 
           // Create placeholders
           auto len_tensor_array = inputs.size() - skipped_tensor_indices.size();
-          MPSGraphTensor* inputMPSGraphTensors[len_tensor_array];
-          MPSGraphTensor* castInputMPSGraphTensors[len_tensor_array];
+          std::vector<MPSGraphTensor*> inputMPSGraphTensors(len_tensor_array);
+          std::vector<MPSGraphTensor*> castInputMPSGraphTensors(len_tensor_array);
 
           int graph_tensor_idx = 0;
           for(const Tensor* tensor : input_tensors) {
@@ -693,7 +411,7 @@ void check_shape_except_dim(const Tensor &first, const Tensor &second,
             graph_tensor_idx++;
           }
 
-          auto inputTensorsArray = [NSArray arrayWithObjects:castInputMPSGraphTensors
+          auto inputTensorsArray = [NSArray arrayWithObjects:castInputMPSGraphTensors.data()
                                                        count:len_tensor_array];
           // Use concatTensors to concatenate
           MPSGraphTensor* outputTensor = [mpsGraph concatTensors:inputTensorsArray
diff --git a/aten/src/ATen/native/mps/operations/TensorCompare.mm b/aten/src/ATen/native/mps/operations/TensorCompare.mm
index fb3b93a602f1..44d19e99c2f6 100644
--- a/aten/src/ATen/native/mps/operations/TensorCompare.mm
+++ b/aten/src/ATen/native/mps/operations/TensorCompare.mm
@@ -37,6 +37,47 @@ void clamp_mps_graph(CachedGraph* cachedGraph, const Tensor& input_tensor)
     }
 }
 
+void check_min_max_dims(const OptionalTensorRef clamp_opt,
+                        const Tensor& input_t,
+                        string op_name) {
+
+    if(!clamp_opt->is_same_size(input_t)) {
+
+        auto num_clamp_dims = clamp_opt->dim();
+        auto num_input_dims = input_t.dim();
+
+        auto clamp_shape = clamp_opt->sizes();
+        auto input_shape = input_t.sizes();
+
+        TORCH_CHECK(num_clamp_dims <= num_input_dims, op_name + ": clamp tensor number of dims must not be greater than that of input tensor")
+
+        for(int i = 0; i < num_clamp_dims; i++)
+            // One of the indices is allowed to be 1; will be handled by broadcast
+            TORCH_CHECK(clamp_shape[num_clamp_dims-1-i] == input_shape[num_input_dims-1-i] ||
+                        clamp_shape[num_clamp_dims-1-i] == 1 ||
+                        input_shape[num_input_dims-1-i] == 1,
+                        op_name + ": clamp tensor trailing shape must match input tensor")
+
+    }
+}
+
+void fill_new_shape(int64_t num_input_dims,
+                    int64_t num_clamp_dims,
+                    int64_t *new_shape,
+                    IntArrayRef clamp_shape) {
+
+    // Extend the shape with ones to the left
+    int clamp_idx = 0;
+    for(int i = 0; i < num_input_dims; i++) {
+        if(i <  num_input_dims - num_clamp_dims)
+            new_shape[i] = 1;
+        else {
+            new_shape[i] = clamp_shape[clamp_idx];
+            clamp_idx++;
+        }
+    }
+}
+
 void clamp_tensor_out_mps(const Tensor& input_t,
                           const OptionalTensorRef min_opt,
                           const OptionalTensorRef max_opt,
@@ -48,17 +89,54 @@ void clamp_tensor_out_mps(const Tensor& input_t,
 
     TORCH_CHECK(has_min || has_max, op_name + ": either min, max or both tensors must be defined")
     if (has_min)
-        TORCH_CHECK(min_opt->is_same_size(input_t), op_name + ": min and input tensors must be of the same shape")
+        check_min_max_dims(min_opt, input_t, op_name);
+
     if (has_max)
-        TORCH_CHECK(max_opt->is_same_size(input_t), op_name + ": max and input tensors must be of the same shape")
+        check_min_max_dims(max_opt, input_t, op_name);
 
     if (output_t.numel() == 0)
         return;
 
+    IntArrayRef new_min_shape;
+    IntArrayRef new_max_shape;
+
+    auto num_min_dims = min_opt->dim();
+    auto num_max_dims = max_opt->dim();
+    auto num_input_dims = input_t.dim();
+
+    std::vector<int64_t> new_min_arr(num_input_dims);
+    std::vector<int64_t> new_max_arr(num_input_dims);
+
+    if(has_min && num_min_dims < num_input_dims) {
+        fill_new_shape(num_input_dims, num_min_dims, new_min_arr.data(), min_opt->sizes());
+        new_min_shape = IntArrayRef(new_min_arr);
+    }
+
+    if(has_max && num_max_dims < num_input_dims) {
+        fill_new_shape(num_input_dims, num_max_dims, new_max_arr.data(), max_opt->sizes());
+        new_max_shape = IntArrayRef(new_max_arr);
+    }
+
+    Tensor min_opt_tensor;
+    Tensor max_opt_tensor;
+
+    if(has_min) {
+        min_opt_tensor = (num_min_dims < num_input_dims) ? (*min_opt).view(new_min_shape) : *min_opt;
+    }
+    if(has_max) {
+        max_opt_tensor = (num_max_dims < num_input_dims) ? (*max_opt).view(new_max_shape) : *max_opt;
+    }
+
     @autoreleasepool {
         // the optional min/max refs could affect how we build the cached graph
+
+        auto tensor_key = has_min ? (has_max ? getTensorsStringKey({input_t, min_opt_tensor, max_opt_tensor})
+                                             : getTensorsStringKey({input_t, min_opt_tensor}))
+                                  : (has_max ? getTensorsStringKey({input_t, max_opt_tensor})
+                                             : getTensorsStringKey({input_t}));
+
         string key = op_name + (has_min ? "_min" : "") + (has_max ? "_max" : "")
-                             + "_tensor" + getTensorsStringKey({input_t});
+                             + "_tensor" + tensor_key;
         MPSGraphCache* cache_ = MPSGraphCache::getInstance();
         CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
 
@@ -71,9 +149,9 @@ void clamp_tensor_out_mps(const Tensor& input_t,
                     newCachedGraph = new CachedGraph(mpsGraph);
 
                     if (has_min)
-                        newCachedGraph->minTensor = mpsGraphRankedPlaceHolder(mpsGraph, *min_opt);
+                        newCachedGraph->minTensor = mpsGraphRankedPlaceHolder(mpsGraph, min_opt_tensor);
                     if (has_max)
-                        newCachedGraph->maxTensor = mpsGraphRankedPlaceHolder(mpsGraph, *max_opt);
+                        newCachedGraph->maxTensor = mpsGraphRankedPlaceHolder(mpsGraph, max_opt_tensor);
 
                     clamp_mps_graph(newCachedGraph, input_t);
                 }
@@ -88,11 +166,11 @@ void clamp_tensor_out_mps(const Tensor& input_t,
         NSMutableDictionary *feeds = [[NSMutableDictionary new] autorelease];
         feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData();
         if (has_min) {
-            auto minPlaceholder = Placeholder(cachedGraph->minTensor, *min_opt);
+            auto minPlaceholder = Placeholder(cachedGraph->minTensor, min_opt_tensor);
             feeds[minPlaceholder.getMPSGraphTensor()] = minPlaceholder.getMPSGraphTensorData();
         }
         if (has_max) {
-            auto maxPlaceholder = Placeholder(cachedGraph->maxTensor, *max_opt);
+            auto maxPlaceholder = Placeholder(cachedGraph->maxTensor, max_opt_tensor);
             feeds[maxPlaceholder.getMPSGraphTensor()] = maxPlaceholder.getMPSGraphTensorData();
         }
 
@@ -302,29 +380,33 @@ Tensor where_mps(const Tensor& condition,
                  const Tensor& self,
                  const Tensor& other) {
 
-  bool cond_zero_shape = (condition.dim() == 0);
-  bool self_zero_shape = (self.dim() == 0);
-  bool other_zero_shape = (other.dim() == 0);
-
   auto max_dim = std::max(condition.dim(), std::max(self.dim(), other.dim()));
 
-  auto sum_dims = condition.dim() + self.dim() + other.dim();
+  // How many leading dimensions do we broadcast across for each Tensor?
+  int cond_num_implicit_ones = (max_dim - condition.dim());
+  int self_num_implicit_ones = (max_dim - self.dim());
+  int other_num_implicit_ones = (max_dim - other.dim());
 
-  TORCH_CHECK(max_dim == 0 || !(sum_dims % max_dim), "All inputs of where should have same/compatible number of dims")
-
-  int64_t out_arr[max_dim];
+  std::vector<int64_t> out_arr(max_dim);
 
   // Broadcasted output shape
   for(int i = 0; i < max_dim; i++) {
 
-    int64_t cond_num = cond_zero_shape ? 0 : condition.size(i);
-    int64_t self_num = self_zero_shape ? 0 : self.size(i);
-    int64_t other_num = other_zero_shape ? 0 : other.size(i);
+    // Use up the leading broadcast dimensions for each Tensor, then continue from the start of the "actual" shape
+    int64_t cond_idx = i < cond_num_implicit_ones ? 1 : (condition.size(i - cond_num_implicit_ones));
+    int64_t self_idx = i < self_num_implicit_ones ? 1 : (self.size(i - self_num_implicit_ones));
+    int64_t other_idx = i < other_num_implicit_ones ? 1 : (other.size(i - other_num_implicit_ones));
+
+    auto max_idx = std::max({cond_idx, self_idx, other_idx});
+
+    TORCH_CHECK(cond_idx == max_idx || cond_idx == 1 || (cond_idx == 0 && max_idx == 1), i, "'th index ", cond_idx, " of condition tensor does not match the other tensors")
+    TORCH_CHECK(self_idx == max_idx || self_idx == 1 || (self_idx == 0 && max_idx == 1), i, "'th index ", self_idx, " of x tensor does not match the other tensors")
+    TORCH_CHECK(other_idx == max_idx || other_idx == 1 || (other_idx == 0 && max_idx == 1), i, "'th index ", other_idx, " of x tensor does not match the other tensors")
 
-    out_arr[i] = std::max(cond_num, std::max(self_num, other_num));
+    out_arr[i] = (cond_idx == 0 || self_idx == 0 || other_idx == 0) ? 0 : max_idx;
   }
 
-  Tensor ret = empty_mps(IntArrayRef(out_arr, max_dim),
+  Tensor ret = empty_mps(IntArrayRef(out_arr),
                          self.scalar_type(),
                          c10::nullopt,
                          kMPS,
diff --git a/aten/src/ATen/native/mps/operations/TriangularOps.mm b/aten/src/ATen/native/mps/operations/TriangularOps.mm
index fb6e1c52ba49..c27670796499 100644
--- a/aten/src/ATen/native/mps/operations/TriangularOps.mm
+++ b/aten/src/ATen/native/mps/operations/TriangularOps.mm
@@ -172,197 +172,5 @@
 
 }
 
-Tensor& diag_mps_out(const Tensor& self,
-                     int64_t diagonal,
-                     Tensor &output) {
-
-  // Do checks, resize output
-  IntArrayRef input_size = self.sizes();
-  auto num_input_dims = input_size.size();
-  // Input can only be 1D or 2D
-  TORCH_CHECK(num_input_dims == 1 || num_input_dims == 2,
-    "diag_mps_out: Input tensor must be 1D or 2D")
-
-  if(num_input_dims == 1) {
-    auto n = input_size[0];
-    if(diagonal > 0)
-      n += diagonal;
-    else if(diagonal < 0)
-      n -= diagonal;
-
-    output.resize_({n, n});
-  }
-  else if(num_input_dims == 2) {
-    auto num_diag_elements = std::min(input_size[0], input_size[1]);
-    if(diagonal > 0) {
-      TORCH_CHECK(input_size[1] - diagonal > 0, "Matrix not big enough for requested diagonal")
-      num_diag_elements = std::min(input_size[0], input_size[1] - diagonal);
-    }
-    else if(diagonal < 0) {
-      TORCH_CHECK(input_size[0] + diagonal > 0, "Matrix not big enough for requested diagonal")
-      num_diag_elements = std::min(input_size[0] + diagonal, input_size[1]);
-    }
-
-    output.resize_({num_diag_elements});
-  }
-
-  using namespace mps;
-  MPSStream* stream = getCurrentMPSStream();
-
-  // Derive from MPSCachedGraph
-  struct CachedGraph : public MPSCachedGraph
-  {
-    CachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
-    MPSGraphTensor *inputTensor_ = nil;
-    MPSGraphTensor *outputTensor_ = nil;
-  };
-
-  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
-
-  @autoreleasepool {
-
-    MPSShape* input_shape = getMPSShape(self);
-    MPSShape* output_shape = getMPSShape(output);
-    NSNumber* num_input_cols = nil;
-    NSNumber* num_output_cols = nil;
-    NSMutableArray<NSNumber*>* flat_input_shape = nil;
-    NSMutableArray<NSNumber*>* flat_output_shape = nil;
-    if(num_input_dims == 1) {
-      num_output_cols = output_shape[1];
-      flat_output_shape = [NSMutableArray<NSNumber*> arrayWithCapacity:1];
-      flat_output_shape[0] = [NSNumber numberWithInt:[output_shape[0] intValue] * [output_shape[1] intValue]];
-    }
-    else if(num_input_dims == 2) {
-      num_input_cols = input_shape[1];
-      flat_input_shape = [NSMutableArray<NSNumber*> arrayWithCapacity:1];
-      flat_input_shape[0] = [NSNumber numberWithInt:[input_shape[0] intValue] * [input_shape[1] intValue]];
-    }
-    NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
-    string key = "diag_mps_out:" + getMPSTypeString(self.scalar_type()) + ":" + std::to_string(diagonal)
-                                 + ":" + string([ns_shape_key UTF8String]);
-    CachedGraph* cachedGraph = static_cast<CachedGraph *>(cache_->LookUp(key));
-
-    if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph * () {
-        CachedGraph *newCachedGraph = nil;
-
-        @autoreleasepool {
-          MPSGraph* mpsGraph = make_mps_graph();
-          newCachedGraph = new CachedGraph(mpsGraph);
-
-          // TODO: Accept this as the flat version in 2D case
-          MPSGraphTensor* inputTensor = nil;
-          if(num_input_dims == 1)
-           inputTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(self.scalar_type()));
-         else
-           inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self.scalar_type()), flat_input_shape);
-
-          MPSGraphTensor* outputTensor = nil;
-
-          MPSGraphTensor* zeroTensor = [mpsGraph constantWithScalar:0
-                                                           dataType:MPSDataTypeInt32];
-          MPSGraphTensor* numDiagElementsRange = nil;
-          MPSGraphTensor* diagOffset = nil;
-          MPSGraphTensor* rowMultiplier = nil;
-          MPSGraphTensor* rowIndices = nil;
-          MPSGraphTensor* colIndices = nil;
-          MPSGraphTensor* indicesTensor = nil;
-
-          if(num_input_dims == 1) {
-            int shape_data[1] = {[input_shape[0] intValue]};
-            MPSGraphTensor* inputShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:shape_data length:sizeof(int)]
-                                                                    shape:@[@1]
-                                                                 dataType:MPSDataTypeInt32];
-            numDiagElementsRange = [mpsGraph coordinateAlongAxisTensor: zeroTensor
-                                                       withShapeTensor: inputShapeTensor
-                                                                  name: nil];
-            diagOffset = [mpsGraph constantWithScalar:diagonal
-                                             dataType:MPSDataTypeInt32];
-            rowMultiplier = [mpsGraph constantWithScalar:[num_output_cols intValue]
-                                                dataType:MPSDataTypeInt32];
-          }
-          else {
-            int shape_data[1] = {[output_shape[0] intValue]};
-            MPSGraphTensor* outputShapeTensor = [mpsGraph constantWithData:[NSData dataWithBytes:shape_data length:sizeof(int)]
-                                                                     shape:@[@1]
-                                                                  dataType:MPSDataTypeInt32];
-            numDiagElementsRange = [mpsGraph coordinateAlongAxisTensor: zeroTensor
-                                                       withShapeTensor: outputShapeTensor
-                                                                  name: nil];
-            diagOffset = [mpsGraph constantWithScalar:diagonal
-                                             dataType:MPSDataTypeInt32];
-            rowMultiplier = [mpsGraph constantWithScalar:[num_input_cols intValue]
-                                                dataType:MPSDataTypeInt32];
-          }
-
-          if(diagonal >= 0) {
-            rowIndices = numDiagElementsRange;
-            colIndices = [mpsGraph additionWithPrimaryTensor:numDiagElementsRange
-                                             secondaryTensor:diagOffset
-                                                        name:nil];
-          }
-          else {
-            rowIndices = [mpsGraph subtractionWithPrimaryTensor:numDiagElementsRange
-                                                secondaryTensor:diagOffset
-                                                           name:nil];;
-            colIndices = numDiagElementsRange;
-          }
-
-          indicesTensor = [mpsGraph multiplicationWithPrimaryTensor:rowIndices
-                                                    secondaryTensor:rowMultiplier
-                                                               name:nil];
-          indicesTensor = [mpsGraph additionWithPrimaryTensor:indicesTensor
-                                              secondaryTensor:colIndices
-                                                         name:nil];
-
-          if(num_input_dims == 1) {
-            // TODO: Scatter mode doesn't matter, so what should I set it to be?
-            outputTensor = [mpsGraph scatterWithUpdatesTensor:inputTensor
-                                                indicesTensor:indicesTensor
-                                                        shape:flat_output_shape
-                                                         axis:0
-                                                         mode:MPSGraphScatterModeAdd
-                                                         name:nil];
-            outputTensor = [mpsGraph reshapeTensor:outputTensor
-                                         withShape:output_shape
-                                              name:nil];
-          }
-          else if(num_input_dims == 2) {
-            outputTensor = [mpsGraph gatherWithUpdatesTensor:inputTensor
-                                               indicesTensor:indicesTensor
-                                                        axis:0
-                                             batchDimensions:0
-                                                        name:nil];
-          }
-
-          newCachedGraph->inputTensor_ = inputTensor;
-          newCachedGraph->outputTensor_ = outputTensor;
-        }
-        return newCachedGraph;
-      });
-      cachedGraph = static_cast<CachedGraph *>(tmpCachedGraph);
-    }
-
-    Placeholder selfPlaceholder = Placeholder();
-    if(num_input_dims == 1)
-      selfPlaceholder = Placeholder(cachedGraph->inputTensor_, self);
-    else
-      selfPlaceholder = Placeholder(cachedGraph->inputTensor_, self, flat_input_shape);
-
-    Placeholder outputPlaceholder = Placeholder(cachedGraph->outputTensor_, output);
-
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
-      selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData()
-    };
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results = @{
-      outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()
-    };
-
-    runMPSGraph(stream, cachedGraph->graph(), feeds, results);
-  }
-
-  return output;
-}
-
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/mps/operations/UnaryOps.mm b/aten/src/ATen/native/mps/operations/UnaryOps.mm
index 2231a66fb3ac..3d641d3af82c 100644
--- a/aten/src/ATen/native/mps/operations/UnaryOps.mm
+++ b/aten/src/ATen/native/mps/operations/UnaryOps.mm
@@ -5,6 +5,7 @@
 #include <ATen/Utils.h>
 #include <ATen/mps/MPSStream.h>
 #include <ATen/native/mps/OperationUtils.h>
+#include <ATen/native/mps/MPSGraphVenturaOps.h>
 #include <torch/library.h>
 
 namespace at {
@@ -12,24 +13,29 @@
 namespace mps {
 
 typedef MPSGraphTensor* (^UnaryOpBlock)(MPSGraph*, MPSGraphTensor*);
+using is_noop_p = std::function<bool(const Tensor&)>;
 
-void unary_op(const Tensor& self, const Tensor& output, std::string op_name, UnaryOpBlock unaryBlock)
+
+bool is_empty_tensor(const Tensor& self) {
+  return self.numel() == 0;
+}
+
+void unary_op(const Tensor& self, const Tensor& output, std::string op_name, UnaryOpBlock unaryBlock, is_noop_p is_noop = is_empty_tensor)
 {
-  TORCH_CHECK_TYPE(self.scalar_type() != ScalarType::Long, "Operation '", op_name, "()' does not support input type 'int64' in MPS backend.");
   if (!output.is_same_size(self)) {
     output.resize_(self.sizes());
   }
-  // Empty tensor is noop
-  if (self.numel() == 0) {
+  if (is_noop(self)) {
+    output.copy_(self);
     return;
   }
   MPSGraphCache* cache_ = MPSGraphCache::getInstance();
   @autoreleasepool {
-    string key = op_name + getTensorsStringKey({self}, /*use_scalar_value*/ false);
+    string key = op_name + getTensorsStringKey({self, output}, /*use_scalar_value*/ false);
     auto cachedGraph = cache_->LookUpAs<MPSUnaryCachedGraph>(key);
 
     if(!cachedGraph) {
-      MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph* () {
+      cachedGraph = cache_->CreateCachedGraphAs<MPSUnaryCachedGraph>(key, ^ MPSCachedGraph* () {
         MPSUnaryCachedGraph *newCachedGraph = nil;
         @autoreleasepool {
           MPSGraph* mpsGraph = make_mps_graph();
@@ -44,7 +50,6 @@ void unary_op(const Tensor& self, const Tensor& output, std::string op_name, Una
         }
         return newCachedGraph;
       });
-      cachedGraph = tmpCachedGraph->as<MPSUnaryCachedGraph>();
     }
 
     Placeholder selfPlaceholder = Placeholder(cachedGraph->inputTensor_, self);
@@ -61,6 +66,14 @@ void unary_op(const Tensor& self, const Tensor& output, std::string op_name, Una
 
 MPSGraphTensor* trunc_tensor(MPSGraph* mpsGraph, MPSGraphTensor* inputTensor)
 {
+  // Rounding is a no-op for integral types, and also a reasonable workaround
+  // For MPSGraph bug on Apple Silicon, that throws `Function floorOp_i64 was not found in the library`
+  // See https://github.com/pytorch/pytorch/issues/84995
+  bool isFloatInput = ([inputTensor dataType] & MPSDataTypeFloatBit) != 0;
+  if (!isFloatInput) {
+    return inputTensor;
+  }
+
   MPSGraphTensor* zeroTensor = [mpsGraph constantWithScalar:0.0
                                                    dataType:inputTensor.dataType];
   MPSGraphTensor* predicateTensor = [mpsGraph lessThanWithPrimaryTensor:inputTensor
@@ -80,6 +93,51 @@ void unary_op(const Tensor& self, const Tensor& output, std::string op_name, Una
                   { return mps::trunc_tensor(mpsGraph, inputTensor); });
 }
 
+TORCH_IMPL_FUNC(signbit_out_mps) (const Tensor& self, const Tensor& output) {
+  mps::unary_op(self, output, "signbit_out_mps",
+                ^ MPSGraphTensor* (MPSGraph* mpsGraph, MPSGraphTensor* inputTensor) {
+                    MPSGraphTensor* output;
+                    // signbit is not implemented for int64 type.
+                    // workaround for `Function signbitOp_i64 was not found in the library`
+                    if ([inputTensor dataType] == MPSDataTypeInt64) {
+                      MPSGraphTensor* zeroTensor = [mpsGraph constantWithScalar:0.0 dataType:inputTensor.dataType];
+                      output = [mpsGraph lessThanWithPrimaryTensor:inputTensor
+                                                   secondaryTensor:zeroTensor
+                                                              name:nil];
+                    } else {
+                      output = [mpsGraph signbitWithTensor: inputTensor name: nil];
+                    }
+                    return mps::castMPSTensor(mpsGraph, output, ScalarType::Bool);
+                 });
+}
+
+TORCH_IMPL_FUNC(sign_out_mps) (const Tensor& self, const Tensor& output) {
+  mps::unary_op(self, output, "sign_out_mps",
+                ^ MPSGraphTensor* (MPSGraph* mpsGraph, MPSGraphTensor* inputTensor) {
+                    // Sign op is not implemented in MPS as of MacOS13.0 beta, so simulate it using clamp
+                    if ([inputTensor dataType] == MPSDataTypeInt64) {
+                      return [mpsGraph clampWithTensor:inputTensor
+                                        minValueTensor:[mpsGraph constantWithScalar:-1 dataType:MPSDataTypeInt64]
+                                        maxValueTensor:[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt64]
+                                                  name: nil];
+                    }
+                    return [mpsGraph signWithTensor: inputTensor name: nil];
+                 });
+}
+
+#define CREATE_MPS_STRUCTURED_UNARY_ROUNDING_TORCH_IMPL_FUNC(func_out, func_stub)     \
+TORCH_IMPL_FUNC(func_out) (const Tensor& self, const Tensor& output) {                \
+  mps::unary_op(self, output, #func_out,                                              \
+                ^ MPSGraphTensor* (MPSGraph* mpsGraph, MPSGraphTensor* inputTensor)   \
+                  { return [mpsGraph func_stub##WithTensor:inputTensor name:nil]; },  \
+                  [](const Tensor& t) -> bool {                                       \
+                  return t.numel() == 0 || isIntegralType(t.scalar_type());           \
+                });                                                                   \
+}
+CREATE_MPS_STRUCTURED_UNARY_ROUNDING_TORCH_IMPL_FUNC(ceil_out_mps, ceil)
+CREATE_MPS_STRUCTURED_UNARY_ROUNDING_TORCH_IMPL_FUNC(floor_out_mps, floor)
+CREATE_MPS_STRUCTURED_UNARY_ROUNDING_TORCH_IMPL_FUNC(round_out_mps, round)
+
 #define CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(func_out, func_stub)              \
 TORCH_IMPL_FUNC(func_out) (const Tensor& self, const Tensor& output) {                \
   mps::unary_op(self, output, #func_out,                                              \
@@ -101,14 +159,10 @@ void unary_op(const Tensor& self, const Tensor& output, std::string op_name, Una
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(reciprocal_out_mps, reciprocal)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(sqrt_out_mps, squareRoot)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(rsqrt_out_mps, reverseSquareRoot)
-CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(sign_out_mps, sign)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(neg_out_mps, negative)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(log_out_mps, logarithm)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(log10_out_mps, logarithmBase10)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(log2_out_mps, logarithmBase2)
-CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(ceil_out_mps, ceil)
-CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(floor_out_mps, floor)
-CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(round_out_mps, round)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(erf_out_mps, erf)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(sin_out_mps, sin)
 CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(cos_out_mps, cos)
@@ -144,7 +198,7 @@ void unary_op(const Tensor& self, const Tensor& output, std::string op_name, Una
       auto cachedGraph = cache_->LookUpAs<MPSUnaryCachedGraph>(key);
 
       if(!cachedGraph) {
-        MPSCachedGraph *tmpCachedGraph = cache_->CreateCachedGraph(key, ^ MPSCachedGraph* () {
+        cachedGraph = cache_->CreateCachedGraphAs<MPSUnaryCachedGraph>(key, ^ MPSCachedGraph* () {
           MPSUnaryCachedGraph *newCachedGraph = nil;
           @autoreleasepool {
             MPSGraph* mpsGraph = make_mps_graph();
@@ -161,7 +215,6 @@ void unary_op(const Tensor& self, const Tensor& output, std::string op_name, Una
           }
           return newCachedGraph;
         });
-        cachedGraph = tmpCachedGraph->as<MPSUnaryCachedGraph>();
       }
 
       Placeholder selfPlaceholder = Placeholder(cachedGraph->inputTensor_, self);
@@ -176,5 +229,70 @@ void unary_op(const Tensor& self, const Tensor& output, std::string op_name, Una
     }
 }
 
+TORCH_IMPL_FUNC(frac_out_mps) (const Tensor& self, const Tensor& output) {
+  TORCH_CHECK(isFloatingType(self.scalar_type()), "frac_out_mps is only implemented for floating types");
+  mps::unary_op(self, output, "frac_out_mps",
+                ^ MPSGraphTensor* (MPSGraph* mpsGraph, MPSGraphTensor* inputTensor) {
+                  auto zeroTensor = [mpsGraph constantWithScalar:0.0
+                                                       dataType:inputTensor.dataType];
+                  auto predicateTensor = [mpsGraph lessThanWithPrimaryTensor:inputTensor
+                                                             secondaryTensor:zeroTensor
+                                                                        name:nil];
+                  auto truncTensor = [mpsGraph selectWithPredicateTensor:predicateTensor
+                                                     truePredicateTensor:[mpsGraph ceilWithTensor :inputTensor name:nil]
+                                                    falsePredicateTensor:[mpsGraph floorWithTensor:inputTensor name:nil]
+                                                                    name:nil];
+                  return [mpsGraph subtractionWithPrimaryTensor:inputTensor
+                                               secondaryTensor:truncTensor
+                                                   name: nil];
+                });
+}
+
+TORCH_IMPL_FUNC(expm1_out_mps) (const Tensor& self, const Tensor& output) {
+  mps::unary_op(self, output, "expm1_out_mps",
+                ^ MPSGraphTensor* (MPSGraph* mpsGraph, MPSGraphTensor* inputTensor) {
+                  MPSGraphTensor* oneTensor = [mpsGraph constantWithScalar:1.0
+                                                       shape:@[@1]
+                                                       dataType:inputTensor.dataType];
+                  MPSGraphTensor* ePowTensor = [mpsGraph exponentWithTensor:inputTensor
+                                                                         name:nil];
+                  return [mpsGraph subtractionWithPrimaryTensor:ePowTensor
+                                               secondaryTensor:oneTensor
+                                                   name: nil];
+                });
+}
+
+
+
+TORCH_IMPL_FUNC(cumsum_out_mps)
+(const Tensor& self,
+ int64_t dim,
+ c10::optional<ScalarType> dtype,
+ const Tensor& result) {
+  TORCH_CHECK(dim >=0 && dim < std::max(1LL, self.ndimension()), "Expected dim to be between 0 and ", self.ndimension(), " but got ", dim);
+  if (!is_macos_13_or_newer()) {
+    TORCH_WARN_ONCE("torch.cumsum supported by MPS on MacOS 13+, please upgrade");
+    auto cpu_result = self.to(at::Device(kCPU)).cumsum(dim, dtype);
+    at::_copy_from_and_resize(cpu_result, result);
+    return;
+  }
+  auto input = dtype.has_value() ? self.to(dtype.value()) : self;
+  mps::unary_op(input, result, "cumsum_out_mp" + std::to_string(dim),
+                ^ MPSGraphTensor* (MPSGraph* mpsGraph, MPSGraphTensor* inputTensor) {
+       // cumsum is horribly broken for int8, int16 and as chances for overflow is pretty high, cast to int32
+       if (isIntegralType(input.scalar_type()) && input.scalar_type() !=ScalarType::Int) {
+           inputTensor = mps::castMPSTensor(mpsGraph, inputTensor, result.scalar_type());
+       }
+       auto rc = [mpsGraph cumulativeSumWithTensor: inputTensor
+                                              axis: dim
+                                              name: nil];
+       if (result.scalar_type()!= input.scalar_type() ||
+           (isIntegralType(input.scalar_type()) && input.scalar_type() !=ScalarType::Int)) {
+         return mps::castMPSTensor(mpsGraph, rc, result.scalar_type());
+       }
+       return rc;
+    });
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/mps/operations/View.mm b/aten/src/ATen/native/mps/operations/View.mm
index a8a55b21d246..0e35c7b2f642 100644
--- a/aten/src/ATen/native/mps/operations/View.mm
+++ b/aten/src/ATen/native/mps/operations/View.mm
@@ -2,19 +2,9 @@
 
 #include <ATen/native/mps/OperationUtils.h>
 #include <ATen/native/Resize.h>
+#include <ATen/mps/MPSAllocator.h>
 
 namespace at {
-
-// these are from MPSAllocator
-namespace mps {
-  // to check the requested non-aligned size of an MTL buffer
-  ssize_t get_requested_buffer_size(void* ptr);
-  // to retrieve the shape of a base tensor from a view tensor
-  IntArrayRef get_buffer_shape(void* ptr);
-  // to set the shape of a base tensor from a view tensor
-  void set_buffer_shape(void* ptr, const IntArrayRef& shape);
-}
-
 namespace native {
 namespace mps {
 
@@ -62,9 +52,13 @@
                                                                                    shape: getMPSShape(src.numel())
                                                                                 dataType: inputType] autorelease];
     }
-    feeds[cachedGraph->storageOffsetTensor] = getMPSGraphTensorFromScalar(stream, Scalar(storage_offset), MPSDataTypeInt32);
+    MPSScalar storageOffsetScalar = getMPSScalar(storage_offset, ScalarType::Int);
+    feeds[cachedGraph->storageOffsetTensor] = getMPSGraphTensorFromScalar(stream, storageOffsetScalar);
+
+    std::vector<MPSScalar> strideScalars(sizes.size());
     for (int i = 0; i < sizes.size(); i++) {
-      feeds[cachedGraph->strideTensors[i]] = getMPSGraphTensorFromScalar(stream, Scalar(strides[i]), MPSDataTypeInt32);
+      strideScalars[i] = getMPSScalar(strides[i], ScalarType::Int);
+      feeds[cachedGraph->strideTensors[i]] = getMPSGraphTensorFromScalar(stream, strideScalars[i]);
     }
     // Workaround for MPSShaderLibrary bug
     // TODO: Remove once https://github.com/pytorch/pytorch/issues/82305 is resolved
@@ -79,7 +73,7 @@
       cachedGraph->outputTensor : outputTensorData
     };
     stream->executeMPSGraph(cachedGraph->graph(), feeds, results,
-                            requires_sync ? SyncType::COMMIT : SyncType::NONE);
+                            requires_sync ? SyncType::COMMIT : SyncType::COMMIT_ADAPTIVE);
   }
   return output;
 }
@@ -144,7 +138,7 @@
                                                           withShape: @[@-1]
                                                                name: nil];
     if (needsScatter) {
-      MPSGraphTensor* scatteredTensor = [mpsGraph scatterAlongAxis: 0
+      MPSGraphTensor* scatteredTensor = [mpsGraph scatterAlongAxis: (NSInteger) 0
                                                     withDataTensor: reshapedInputTensor
                                                      updatesTensor: cachedGraph->updatesTensor
                                                      indicesTensor: reshapedIndicesTensor
@@ -201,7 +195,9 @@
     // IntArrayRef wouldn't own the data, so we use a static storage
     static const int64_t shape_1d = 1;
     // self.sizes().size() could be zero
-    base_shape = self.sizes().size() ? self.sizes() : IntArrayRef(&shape_1d, 1);
+    base_shape = self.sizes().size() ? self.sizes() :
+                      self.is_view() ? self._base().sizes() : IntArrayRef(&shape_1d, 1);
+
     // base_shape will be retained in MPSAllocator until buffer gets recycled
     if (self.storage().data())
       set_buffer_shape(self.storage().data(), base_shape);
@@ -232,7 +228,7 @@
               newCachedGraph->strideTensors.push_back(mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[@1]));
             }
             if (needsScatter) {
-              newCachedGraph->updatesTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(self.scalar_type()));
+              newCachedGraph->updatesTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, inputType);
             }
             newCachedGraph->outputTensor = chainViewOperation(newCachedGraph, size, stride, storage_offset, base_shape, needsScatter, needsBoolCast);
         }
@@ -278,7 +274,7 @@ Tensor gatherViewTensor(const at::Tensor& src, at::Tensor& dst)
 } // namespace mps
 
 // implementation of as_strided() op
-Tensor as_strided_tensorimpl_mps(const Tensor& self, IntArrayRef size, IntArrayRef stride, optional<int64_t> storage_offset_)
+Tensor as_strided_tensorimpl_mps(const Tensor& self, IntArrayRef size, IntArrayRef stride, c10::optional<int64_t> storage_offset_)
 {
   auto storage_offset = storage_offset_.value_or(self.storage_offset());
   auto result = detail::make_tensor<TensorImpl>(c10::TensorImpl::VIEW, Storage(self.storage()), self.key_set(), self.dtype());
diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index 7c43a7b25eef..e8d2b884c6d2 100644
--- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -170,6 +170,9 @@
     CPU: _assert_async_cpu
     CUDA: _assert_async_cuda
 
+
+- func: _assert_tensor_metadata(Tensor a, int[]? size=None, int[]? stride=None, ScalarType? dtype=None) -> ()
+
 - func: refine_names(Tensor(a) self, Dimname[] names) -> Tensor(a)
   variants: method
 
@@ -178,20 +181,30 @@
   dispatch:
     CUDA: _use_cudnn_ctc_loss
 
+- func: _use_cudnn_ctc_loss.Tensor(Tensor log_probs, Tensor targets, Tensor input_lengths, Tensor target_lengths, int blank) -> bool
+  device_check: NoCheck  # Tensor arguments allowed to be on different devices, see also _cudnn_ctc_loss
+  dispatch:
+    CUDA: _use_cudnn_ctc_loss_tensor
+
 - func: _cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank, bool deterministic, bool zero_infinity) -> (Tensor, Tensor)
   device_check: NoCheck  # log_probs is expected to be on CUDA while targets is expected to be on CPU
   dispatch:
     CUDA: _cudnn_ctc_loss
   autogen: _cudnn_ctc_loss.out
 
+- func: _cudnn_ctc_loss.Tensor(Tensor log_probs, Tensor targets, Tensor input_lengths, Tensor target_lengths, int blank, bool deterministic, bool zero_infinity) -> (Tensor, Tensor)
+  device_check: NoCheck  # log_probs is expected to be on CUDA while targets is expected to be on CPU
+  dispatch:
+    CUDA: _cudnn_ctc_loss_tensor
+
 - func: _use_cudnn_rnn_flatten_weight() -> bool
 
-- func: _cudnn_rnn_flatten_weight(Tensor[] weight_arr, int weight_stride0, int input_size, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, bool bidirectional) -> Tensor
+- func: _cudnn_rnn_flatten_weight(Tensor[] weight_arr, int weight_stride0, SymInt input_size, int mode, SymInt hidden_size, SymInt proj_size, int num_layers, bool batch_first, bool bidirectional) -> Tensor
   dispatch:
     CUDA: _cudnn_rnn_flatten_weight
   autogen: _cudnn_rnn_flatten_weight.out
 
-- func: _cudnn_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor? weight_buf, Tensor hx, Tensor? cx, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
+- func: _cudnn_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor? weight_buf, Tensor hx, Tensor? cx, int mode, SymInt hidden_size, SymInt proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, SymInt[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
   # rnn_tanh may or may not redispatch to _cudnn_rnn based on algorithm and build. Thus it might hit dispatch or kernel device check.
   # Disable dispatch time device check for consistent behavior.
   device_check: NoCheck
@@ -199,7 +212,7 @@
     CUDA: _cudnn_rnn
   autogen: _cudnn_rnn.out
 
-- func: _cudnn_rnn_backward(Tensor input, Tensor[] weight, int weight_stride0, Tensor weight_buf, Tensor hx, Tensor? cx, Tensor output, Tensor? grad_output, Tensor? grad_hy, Tensor? grad_cy, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state, Tensor reserve, bool[4] output_mask) -> (Tensor, Tensor, Tensor, Tensor[])
+- func: _cudnn_rnn_backward(Tensor input, Tensor[] weight, int weight_stride0, Tensor weight_buf, Tensor hx, Tensor? cx, Tensor output, Tensor? grad_output, Tensor? grad_hy, Tensor? grad_cy, int mode, SymInt hidden_size, SymInt proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, SymInt[] batch_sizes, Tensor? dropout_state, Tensor reserve, bool[4] output_mask) -> (Tensor, Tensor, Tensor, Tensor[])
   dispatch:
     CUDA: _cudnn_rnn_backward
   autogen: _cudnn_rnn_backward.out
@@ -230,12 +243,13 @@
   dispatch:
     CPU: native_dropout_cpu
     CUDA: native_dropout_cuda
-  tags: nondeterministic_seeded
+    NestedTensorCPU, NestedTensorCUDA: native_dropout_nested
+  tags: nondeterministic_seeded, canonical
   autogen: native_dropout.out
 
 - func: native_dropout_backward(Tensor grad_output, Tensor mask, float scale) -> Tensor
   dispatch:
-    CPU: native_dropout_backward_cpu
+    CPU, NestedTensorCPU, NestedTensorCUDA: native_dropout_backward
     CUDA: native_dropout_backward_cuda
   autogen: native_dropout_backward.out
 
@@ -252,27 +266,28 @@
 - func: _shape_as_tensor(Tensor self) -> Tensor
 
 - func: dropout(Tensor input, float p, bool train) -> Tensor
-  dispatch:
-    CompositeImplicitAutograd: dropout
-    NestedTensorCPU, NestedTensorCUDA: dropout_nested
   tags: nondeterministic_seeded
 
 - func: dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!)
-  dispatch:
-    CompositeImplicitAutograd: dropout_
-    NestedTensorCPU, NestedTensorCUDA: dropout_nested_
+  tags: nondeterministic_seeded
 
 - func: feature_dropout(Tensor input, float p, bool train) -> Tensor
+  tags: nondeterministic_seeded
 
 - func: feature_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: alpha_dropout(Tensor input, float p, bool train) -> Tensor
+  tags: nondeterministic_seeded
 
 - func: alpha_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: feature_alpha_dropout(Tensor input, float p, bool train) -> Tensor
+  tags: nondeterministic_seeded
 
 - func: feature_alpha_dropout_(Tensor(a!) self, float p, bool train) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: abs(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -281,6 +296,7 @@
     CompositeExplicitAutograd: abs
     SparseCPU, SparseCUDA: abs_sparse
     SparseCsrCPU, SparseCsrCUDA: abs_sparse_csr
+  tags: canonical
 
 - func: abs_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -475,6 +491,7 @@
     MkldnnCPU: mkldnn_add
     ZeroTensor: add_zerotensor
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_add_Tensor
+  tags: canonical
 
 - func: add_.Tensor(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -533,6 +550,7 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: add
+  tags: canonical
 
 - func: add_.Scalar(Tensor(a!) self, Scalar other, Scalar alpha=1) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -650,6 +668,7 @@
   dispatch:
     CompositeExplicitAutograd: arange
   cpp_no_default_args: ['step']
+  tags: canonical
 
 - func: arange.out(Scalar end, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
@@ -779,23 +798,24 @@
 
 - func: arctanh.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
 
-- func: as_strided(Tensor(a) self, int[] size, int[] stride, int? storage_offset=None) -> Tensor(a)
+- func: as_strided(Tensor(a) self, SymInt[] size, SymInt[] stride, SymInt? storage_offset=None) -> Tensor(a)
   variants: function, method
   dispatch:
-    ZeroTensor, CPU, CUDA, Meta: as_strided_tensorimpl
+    ZeroTensor, CPU, CUDA: as_strided_tensorimpl
+    Meta: as_strided_tensorimpl_meta_symint
     MPS: as_strided_tensorimpl_mps
     QuantizedCPU, QuantizedCUDA: as_strided_qtensorimpl
   device_check: NoCheck
   device_guard: False
 
-- func: as_strided_(Tensor(a!) self, int[] size, int[] stride, int? storage_offset=None) -> Tensor(a!)
+- func: as_strided_(Tensor(a!) self, SymInt[] size, SymInt[] stride, SymInt? storage_offset=None) -> Tensor(a!)
   use_const_ref_for_mutable_tensors: True
   variants: function, method
   device_check: NoCheck
   device_guard: False
   tags: inplace_view
   dispatch:
-    CompositeExplicitAutogradNonFunctional: as_strided_
+    CompositeExplicitAutogradNonFunctional: as_strided__symint
 
 - func: asin(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -933,6 +953,7 @@
 - func: bernoulli.out(Tensor self, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: function
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: bernoulli_out
     MPS: bernoulli_out_mps
@@ -940,6 +961,7 @@
 - func: bernoulli_.Tensor(Tensor(a!) self, Tensor p, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: bernoulli_
     MPS: bernoulli_mps_
@@ -948,6 +970,7 @@
 - func: bernoulli_.float(Tensor(a!) self, float p=0.5, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: bernoulli_
     MPS: bernoulli_mps_
@@ -962,6 +985,8 @@
   device_check: NoCheck   # TensorIterator
   variants: function, method
   tags: nondeterministic_seeded
+  dispatch:
+    CompositeExplicitAutogradNonFunctional: bernoulli
 
 - func: bilinear(Tensor input1, Tensor input2, Tensor weight, Tensor? bias=None) -> Tensor
 
@@ -1018,6 +1043,7 @@
   device_check: NoCheck   # TensorIterator
   structured_delegate: bitwise_not.out
   variants: function, method
+  tags: canonical
 
 - func: bitwise_not_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -1150,7 +1176,9 @@
   dispatch:
     SparseCPU: bmm_sparse_cpu
     SparseCUDA: bmm_sparse_cuda
-    NestedTensorCPU, NestedTensorCUDA: bmm_nested
+    NestedTensorCPU: bmm_nested
+    NestedTensorCUDA: bmm_nested_cuda
+  tags: canonical
 
 - func: bmm.out(Tensor self, Tensor mat2, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -1167,8 +1195,10 @@
   device_check: NoCheck
   device_guard: False
 
-- func: broadcast_to(Tensor(a) self, int[] size) -> Tensor(a)
+- func: broadcast_to(Tensor(a) self, SymInt[] size) -> Tensor(a)
   variants: function, method
+  dispatch:
+    CompositeImplicitAutograd: broadcast_to_symint
 
 - func: _sparse_broadcast_to(Tensor(a) self, int[] size) -> Tensor(a)
   variants: function
@@ -1180,6 +1210,7 @@
   dispatch:
     SparseCPU, SparseCUDA: cat_sparse
     QuantizedCPU: cat_quantized_cpu
+  tags: canonical
 
 - func: cat.out(Tensor[] tensors, int dim=0, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -1204,6 +1235,15 @@
 
 - func: concat.names_out(Tensor[] tensors, Dimname dim, *, Tensor(a!) out) -> Tensor(a!)
 
+# alias for torch.cat
+- func: concatenate(Tensor[] tensors, int dim=0) -> Tensor
+
+- func: concatenate.out(Tensor[] tensors, int dim=0, *, Tensor(a!) out) -> Tensor(a!)
+
+- func: concatenate.names(Tensor[] tensors, Dimname dim) -> Tensor
+
+- func: concatenate.names_out(Tensor[] tensors, Dimname dim, *, Tensor(a!) out) -> Tensor(a!)
+
 - func: block_diag(Tensor[] tensors) -> Tensor
   variants: function
   dispatch:
@@ -1252,12 +1292,19 @@
   variants: function, method
   device_check: NoCheck
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: chunk
+    NestedTensorCPU, NestedTensorCUDA: chunk_nested_tensor
 
-- func: tensor_split.sections(Tensor(a -> *) self, int sections, int dim=0) -> Tensor(a)[]
+- func: tensor_split.sections(Tensor(a -> *) self, SymInt sections, int dim=0) -> Tensor(a)[]
   variants: function, method
+  dispatch:
+    CompositeImplicitAutograd: tensor_split_sections_symint
 
-- func: tensor_split.indices(Tensor(a -> *) self, int[] indices, int dim=0) -> Tensor(a)[]
+- func: tensor_split.indices(Tensor(a -> *) self, SymInt[] indices, int dim=0) -> Tensor(a)[]
   variants: function, method
+  dispatch:
+    CompositeImplicitAutograd: tensor_split_indices_symint
 
 - func: tensor_split.tensor_indices_or_sections(Tensor(a -> *) self, Tensor tensor_indices_or_sections, int dim=0) -> Tensor(a)[]
   variants: function, method
@@ -1269,6 +1316,7 @@
   structured_delegate: clamp.out
   dispatch:
     QuantizedCPU: clamp_quantized_cpu
+  tags: canonical
 
 - func: clamp.Tensor(Tensor self, Tensor? min=None, Tensor? max=None) -> Tensor
   variants: function, method
@@ -1411,26 +1459,29 @@
   dispatch:
     CPU, CUDA: polar_out
 
-- func: constant_pad_nd(Tensor self, int[] pad, Scalar value=0) -> Tensor
+- func: constant_pad_nd(Tensor self, SymInt[] pad, Scalar value=0) -> Tensor
   variants: function
   dispatch:
     CompositeExplicitAutograd: constant_pad_nd
     MPS: constant_pad_nd_mps
   autogen: constant_pad_nd.out
+  tags: canonical
 
 - func: contiguous(Tensor(a) self, *, MemoryFormat memory_format=contiguous_format) -> Tensor(a)
   variants: method
   manual_cpp_binding: True
 
-- func: convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor
+- func: convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups) -> Tensor
   dispatch:
     CompositeExplicitAutograd: convolution
   autogen: convolution.out
+  tags: canonical
 
-- func: convolution_backward(Tensor grad_output, Tensor input, Tensor weight, int[]? bias_sizes, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
+- func: convolution_backward(Tensor grad_output, Tensor input, Tensor weight, SymInt[]? bias_sizes, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     CompositeExplicitAutograd, CUDA: convolution_backward
   autogen: convolution_backward.out
+  tags: canonical
 
 - func: convolution_overrideable(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor
   dispatch:
@@ -1442,7 +1493,7 @@
     CompositeExplicitAutograd: convolution_backward_overrideable
   autogen: convolution_backward_overrideable.out
 
-- func: _convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> Tensor
+- func: _convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> Tensor
   dispatch:
     CompositeExplicitAutograd: _convolution
   autogen: _convolution.out
@@ -1451,7 +1502,7 @@
 
 - func: _convolution_mode(Tensor input, Tensor weight, Tensor? bias, int[] stride, str padding, int[] dilation, int groups) -> Tensor
 
-- func: _convolution_double_backward(Tensor? ggI, Tensor? ggW, Tensor? ggb, Tensor gO, Tensor weight, Tensor self, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
+- func: _convolution_double_backward(Tensor? ggI, Tensor? ggW, Tensor? ggb, Tensor gO, Tensor weight, Tensor self, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
 
 - func: conv1d(Tensor input, Tensor weight, Tensor? bias=None, int[1] stride=1, int[1] padding=0, int[1] dilation=1, int groups=1) -> Tensor
 
@@ -1484,6 +1535,8 @@
 
 - func: copy(Tensor self, Tensor src, bool non_blocking=False) -> Tensor
   variants: function
+  dispatch:
+    CompositeExplicitAutogradNonFunctional: copy
 
 - func: copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)
   variants: method
@@ -1494,6 +1547,7 @@
     SparseCPU, SparseCUDA: copy_sparse_wrapper_
     CompositeExplicitAutograd: copy_
     SparseCsrCPU, SparseCsrCUDA: copy_sparse_compressed_
+    NestedTensorCPU, NestedTensorCUDA: copy_nested_
   autogen: copy.out
 
 - func: _copy_from(Tensor self, Tensor dst, bool non_blocking=False) -> Tensor
@@ -1726,6 +1780,7 @@
   device_check: NoCheck   # TensorIterator
   dispatch:
     CPU, CUDA: cumsum_out
+    MPS: cumsum_out_mps
 
 - func: cumsum.dimname(Tensor self, Dimname dim, *, ScalarType? dtype=None) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -1751,6 +1806,13 @@
     CPU: ctc_loss_cpu
     CUDA: ctc_loss_gpu
   autogen: _ctc_loss.out
+  tags: dynamic_output_shape  # the shape of second output is data dependent
+
+- func: _ctc_loss.Tensor(Tensor log_probs, Tensor targets, Tensor input_lengths, Tensor target_lengths, int blank=0, bool zero_infinity=False) -> (Tensor, Tensor)
+  dispatch:
+    CPU, CUDA: ctc_loss_tensor
+  autogen: _ctc_loss.Tensor_out
+  tags: dynamic_output_shape  # the shape of second output is data dependent
 
 - func: _ctc_loss_backward(Tensor grad, Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, Tensor neg_log_likelihood, Tensor log_alpha, int blank, bool zero_infinity=False) -> Tensor
   dispatch:
@@ -1758,10 +1820,14 @@
     CUDA: ctc_loss_backward_gpu
   autogen: _ctc_loss_backward.out
 
+- func: _ctc_loss_backward.Tensor(Tensor grad, Tensor log_probs, Tensor targets, Tensor input_lengths, Tensor target_lengths, Tensor neg_log_likelihood, Tensor log_alpha, int blank, bool zero_infinity=False) -> Tensor
+  dispatch:
+    CPU, CUDA: ctc_loss_backward_tensor
+
 - func: diag_embed(Tensor self, int offset=0, int dim1=-2, int dim2=-1) -> Tensor
   variants: function, method
   dispatch:
-    CompositeExplicitAutograd: diag_embed
+    CompositeExplicitAutogradNonFunctional: diag_embed
   autogen: diag_embed.out
 
 - func: diagflat(Tensor self, int offset=0) -> Tensor
@@ -1779,12 +1845,12 @@
 - func: diagonal.Dimname(Tensor(a) self, *, Dimname outdim, Dimname dim1, Dimname dim2, int offset=0) -> Tensor(a)
   variants: function, method
 
-- func: diagonal_backward(Tensor grad_output, int[] input_sizes, int offset, int dim1, int dim2) -> Tensor
+- func: diagonal_backward(Tensor grad_output, SymInt[] input_sizes, int offset, int dim1, int dim2) -> Tensor
   variants: function
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CompositeExplicitAutograd: diagonal_backward
+    CompositeExplicitAutograd: diagonal_backward_symint
   autogen: diagonal_backward.out
 
 - func: fill_diagonal_(Tensor(a!) self, Scalar fill_value, bool wrap=False) -> Tensor(a!)
@@ -1824,6 +1890,8 @@
   dispatch:
     SparseCPU, SparseCUDA: div_sparse
     ZeroTensor: div_zerotensor
+    NestedTensorCPU, NestedTensorCUDA: NestedTensor_div_Tensor
+  tags: canonical
 
 - func: div_.Tensor(Tensor(a!) self, Tensor other) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -1870,6 +1938,8 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: div
+    NestedTensorCPU, NestedTensorCUDA: NestedTensor_div_Scalar
+  tags: canonical
 
 - func: div_.Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -1959,22 +2029,25 @@
   dispatch:
     CompositeExplicitAutograd: vdot_out
 
-- func: einsum(str equation, Tensor[] tensors) -> Tensor
+- func: einsum(str equation, Tensor[] tensors, *, int[]? path=None) -> Tensor
 
-- func: embedding(Tensor weight, Tensor indices, int padding_idx=-1, bool scale_grad_by_freq=False, bool sparse=False) -> Tensor
+- func: embedding(Tensor weight, Tensor indices, SymInt padding_idx=-1, bool scale_grad_by_freq=False, bool sparse=False) -> Tensor
   dispatch:
-    CompositeExplicitAutograd: embedding
+    CompositeExplicitAutograd: embedding_symint
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_embedding
   autogen: embedding.out
 
-- func: embedding_backward(Tensor grad, Tensor indices, int num_weights, int padding_idx, bool scale_grad_by_freq, bool sparse) -> Tensor
+- func: embedding_backward(Tensor grad, Tensor indices, SymInt num_weights, SymInt padding_idx, bool scale_grad_by_freq, bool sparse) -> Tensor
+  dispatch:
+    CompositeImplicitAutograd: embedding_backward_symint
 
-- func: embedding_dense_backward(Tensor grad_output, Tensor indices, int num_weights, int padding_idx, bool scale_grad_by_freq) -> Tensor
+- func: embedding_dense_backward(Tensor grad_output, Tensor indices, SymInt num_weights, SymInt padding_idx, bool scale_grad_by_freq) -> Tensor
   dispatch:
     CPU: embedding_dense_backward_cpu
     CUDA: embedding_dense_backward_cuda
     MPS: embedding_dense_backward_mps
   autogen: embedding_dense_backward.out
+  tags: canonical
 
 - func: embedding_renorm_(Tensor(a!) self, Tensor indices, float max_norm, float norm_type) -> Tensor(a!)
   dispatch:
@@ -2021,11 +2094,15 @@
     CUDA: _embedding_bag_cuda
   autogen: _embedding_bag.out
 
-- func: _embedding_bag_backward(Tensor grad, Tensor indices, Tensor offsets, Tensor offset2bag, Tensor bag_size, Tensor maximum_indices, int num_weights, bool scale_grad_by_freq, int mode, bool sparse, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
+- func: _embedding_bag_backward(Tensor grad, Tensor indices, Tensor offsets, Tensor offset2bag, Tensor bag_size, Tensor maximum_indices, SymInt num_weights, bool scale_grad_by_freq, int mode, bool sparse, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
+  dispatch:
+    CompositeImplicitAutograd: _embedding_bag_backward_symint
 
-- func: _embedding_bag_sparse_backward(Tensor grad, Tensor indices, Tensor offsets, Tensor offset2bag, Tensor bag_size, int num_weights, bool scale_grad_by_freq, int mode, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
+- func: _embedding_bag_sparse_backward(Tensor grad, Tensor indices, Tensor offsets, Tensor offset2bag, Tensor bag_size, SymInt num_weights, bool scale_grad_by_freq, int mode, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
+  dispatch:
+    CompositeImplicitAutograd: _embedding_bag_sparse_backward_symint
 
-- func: _embedding_bag_dense_backward(Tensor grad, Tensor indices, Tensor offset2bag, Tensor bag_size, Tensor maximum_indices, int num_weights, bool scale_grad_by_freq, int mode, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
+- func: _embedding_bag_dense_backward(Tensor grad, Tensor indices, Tensor offset2bag, Tensor bag_size, Tensor maximum_indices, SymInt num_weights, bool scale_grad_by_freq, int mode, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
   dispatch:
     CPU: _embedding_bag_dense_backward_cpu
     CUDA: _embedding_bag_dense_backward_cuda
@@ -2041,60 +2118,35 @@
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CompositeExplicitAutograd: empty
+    CompositeExplicitAutograd: empty_names
   autogen: empty.names_out
 
-- func: empty.memory_format(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
+- func: empty.memory_format(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   dispatch:
     CPU: empty_cpu
     CUDA: empty_cuda
     MPS: empty_mps
-    Meta: empty_meta
+    Meta: empty_meta_symint
     MkldnnCPU: empty_mkldnn
     SparseCPU, SparseCUDA, SparseMeta: empty_sparse
     SparseCsrCPU, SparseCsrCUDA: empty_sparse_compressed
     QuantizedCPU, QuantizedCUDA, QuantizedMeta: empty_unknown_quantized
 
-# all calls to empty() in python used to go through the symint overload
-# even if all arguments were concerete integers.
-# adding symint overloads of kernels for every dispatch key allowed us
-# to skip redispatching to `empty.memory_format` and hit backend kernels directly
-# we recently updated signature parsing to dispath `empty()` calls in python
-# to `empty.SymInt` iff there's is a symint node argument
-# hopefully, we could simplify this entry soon
-- func: empty.SymInt(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
-  dispatch:
-    CPU: empty_symint_cpu
-    CUDA: empty_symint_cuda
-    MPS: empty_symint_mps
-    Meta: empty_symint_meta
-    MkldnnCPU: empty_symint_mkldnn
-    SparseCPU, SparseCUDA, SparseMeta: empty_symint_sparse
-    SparseCsrCPU, SparseCsrCUDA: empty_symint_sparse_compressed
-    QuantizedCPU, QuantizedCUDA: empty_symint_unknown_quantized
-  autogen: empty.SymInt_out
-
 # We do not make new_empty a composite that calls into new_empty_strided, as the strided version
 # is significantly more difficult to implement by different backends
-- func: new_empty(Tensor self, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
-  variants: method
-  dispatch:
-    CompositeExplicitAutograd: new_empty
-  autogen: new_empty.out
-
-- func: new_empty.SymInt(Tensor self, SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: new_empty(Tensor self, SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
   dispatch:
     CompositeExplicitAutograd: new_empty_symint
-  autogen: new_empty.SymInt_out
+  autogen: new_empty.out
 
-- func: new_empty_strided(Tensor self, int[] size, int[] stride, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: new_empty_strided(Tensor self, SymInt[] size, SymInt[] stride, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
   dispatch:
-    CompositeExplicitAutogradNonFunctional: new_empty_strided
+    CompositeExplicitAutogradNonFunctional: new_empty_strided_symint
   autogen: new_empty_strided.out
 
-- func: new_full(Tensor self, int[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: new_full(Tensor self, SymInt[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
   dispatch:
     # NB: Although this composite mutates on the inside, it is
@@ -2102,7 +2154,7 @@
     CompositeExplicitAutograd: new_full
   autogen: new_full.out
 
-- func: new_zeros(Tensor self, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: new_zeros(Tensor self, SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
   dispatch:
     # NB: Although this composite mutates on the inside, it is
@@ -2110,7 +2162,7 @@
     CompositeExplicitAutograd: new_zeros
   autogen: new_zeros.out
 
-- func: new_ones(Tensor self, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: new_ones(Tensor self, SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: method
   dispatch:
     # NB: Although this composite mutates on the inside, it is
@@ -2134,7 +2186,7 @@
     QuantizedCPU, QuantizedCUDA: empty_per_channel_affine_quantized
   autogen: _empty_per_channel_affine_quantized.out
 
-- func: resize_(Tensor(a!) self, int[] size, *, MemoryFormat? memory_format=None) -> Tensor(a!)
+- func: resize_(Tensor(a!) self, SymInt[] size, *, MemoryFormat? memory_format=None) -> Tensor(a!)
   use_const_ref_for_mutable_tensors: True
   variants: method
   device_check: NoCheck
@@ -2165,7 +2217,7 @@
     QuantizedCPU, QuantizedCUDA: empty_quantized
   autogen: empty_quantized.out
 
-- func: empty.out(int[] size, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)
+- func: empty.out(SymInt[] size, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck
   device_guard: False
 
@@ -2177,14 +2229,15 @@
     QuantizedCPU, QuantizedCUDA: empty_like_quantized
     SparseCPU, SparseCUDA, SparseMeta: empty_like_sparse_coo
     SparseCsrCPU, SparseCsrCUDA: empty_like_sparse_csr
+    NestedTensorCPU, NestedTensorCUDA: empty_like_nested
   autogen: empty_like.out
 
-- func: empty_strided(int[] size, int[] stride, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: empty_strided(SymInt[] size, SymInt[] stride, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CPU: empty_strided_cpu
     CUDA: empty_strided_cuda
     MPS: empty_strided_mps
-    Meta: empty_strided_meta
+    Meta: empty_strided_meta_symint
     QuantizedCPU, QuantizedCUDA: empty_strided_unknown_quantized
   autogen: empty_strided.out
 
@@ -2195,6 +2248,7 @@
   dispatch:
     SparseCPU, SparseCUDA: erf_sparse
     SparseCsrCPU, SparseCsrCUDA: erf_sparse_csr
+  tags: canonical
 
 - func: erf_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -2235,6 +2289,7 @@
   device_check: NoCheck   # TensorIterator
   structured_delegate: exp.out
   variants: function, method
+  tags: canonical
 
 - func: exp_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -2286,22 +2341,17 @@
   structured_inherits: TensorIteratorBase
   dispatch:
     CPU, CUDA: expm1_out
+    MPS: expm1_out_mps
     SparseCPU, SparseCUDA: expm1_sparse_out
     SparseCsrCPU, SparseCsrCUDA: expm1_sparse_csr_out
 
-- func: expand.SymInt(Tensor(a) self, SymInt[] size, *, bool implicit=False) -> Tensor(a)
-  variants: method  # This is method-only to match the previous tensor API. In the future we could make this a function too.
-  device_check: NoCheck
-  device_guard: False
-  dispatch:
-    CompositeExplicitAutograd: expand_symint
-
-- func: expand(Tensor(a) self, int[] size, *, bool implicit=False) -> Tensor(a)
+- func: expand(Tensor(a) self, SymInt[] size, *, bool implicit=False) -> Tensor(a)
   variants: method  # This is method-only to match the previous tensor API. In the future we could make this a function too.
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: expand
+  tags: canonical
 
 - func: expand_as(Tensor(a) self, Tensor other) -> Tensor(a)
   variants: method  # This is method-only to match the previous tensor API. In the future we could make this a function too.
@@ -2351,6 +2401,7 @@
   variants: function
   dispatch:
     CompositeExplicitAutograd: fill
+  tags: canonical
 
 - func: fill.Tensor(Tensor self, Tensor value) -> Tensor
   variants: function
@@ -2366,6 +2417,7 @@
     QuantizedCPU, QuantizedCUDA: fill_quantized_
     Meta: fill_meta_
     SparseCsrCPU, SparseCsrCUDA: fill_sparse_csr_
+    NestedTensorCPU, NestedTensorCUDA: fill_nested_
   autogen: fill.Scalar_out
 
 - func: fill_.Tensor(Tensor(a!) self, Tensor value) -> Tensor(a!)
@@ -2376,6 +2428,7 @@
     MPS: fill_tensor_mps_
     QuantizedCPU, QuantizedCUDA: fill_quantized_
     Meta: fill_meta_
+    NestedTensorCPU, NestedTensorCUDA: fill_nested_
   autogen: fill.Tensor_out
 
 - func: floor(Tensor self) -> Tensor
@@ -2436,11 +2489,17 @@
   device_check: NoCheck   # TensorIterator
   structured_delegate: frac.out
   variants: function, method
+  dispatch:
+    SparseCPU, SparseCUDA: frac_sparse
+    SparseCsrCPU, SparseCsrCUDA: frac_sparse_csr
 
 - func: frac_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   structured_delegate: frac.out
   variants: function, method
+  dispatch:
+    SparseCPU, SparseCUDA: frac_sparse_
+    SparseCsrCPU, SparseCsrCUDA: frac_sparse_csr_
 
 - func: frac.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -2448,6 +2507,9 @@
   structured_inherits: TensorIteratorBase
   dispatch:
     CPU, CUDA: frac_out
+    MPS: frac_out_mps
+    SparseCPU, SparseCUDA: frac_sparse_out
+    SparseCsrCPU, SparseCsrCUDA: frac_sparse_csr_out
 
 - func: full.names(int[] size, Scalar fill_value, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   device_check: NoCheck
@@ -2456,11 +2518,11 @@
     CompositeExplicitAutograd: full
   autogen: full.names_out
 
-- func: full(int[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: full(SymInt[] size, Scalar fill_value, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: full
 
-- func: full.out(int[] size, Scalar fill_value, *, Tensor(a!) out) -> Tensor(a!)
+- func: full.out(SymInt[] size, Scalar fill_value, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CompositeExplicitAutograd: full_out
 
@@ -2528,6 +2590,7 @@
     CPU, QuantizedCPU: grid_sampler_2d_cpu
     CUDA: grid_sampler_2d_cuda
   autogen: grid_sampler_2d.out
+  tags: canonical
 
 # `grid_sampler_2d_backward` takes in `output_mask` to optimize performance for
 # the case where `input` doesn't require gradient. Gradient for `grid` is always
@@ -2610,16 +2673,18 @@
 
 - func: group_norm(Tensor input, int num_groups, Tensor? weight=None, Tensor? bias=None, float eps=1e-05, bool cudnn_enabled=True) -> Tensor
 
-- func: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, int N, int C, int HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
+- func: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
   dispatch:
     CPU, CUDA: native_group_norm
     CompositeExplicitAutograd: math_group_norm
   autogen: native_group_norm.out
+  tags: canonical
 
-- func: native_group_norm_backward(Tensor grad_out, Tensor input, Tensor mean, Tensor rstd, Tensor? weight, int N, int C, int HxW, int group, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
+- func: native_group_norm_backward(Tensor grad_out, Tensor input, Tensor mean, Tensor rstd, Tensor? weight, SymInt N, SymInt C, SymInt HxW, int group, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     CPU, CUDA: native_group_norm_backward
   autogen: native_group_norm_backward.out
+  tags: canonical
 
 # Real to complex forward FFT
 - func: _fft_r2c(Tensor self, int[] dim, int normalization, bool onesided) -> Tensor
@@ -2648,13 +2713,13 @@
     CUDA: _fft_c2r_cufft_out
 
 # Standard complex to complex FFT (forward or backward)
-- func: _fft_c2c(Tensor self, int[] dim, int normalization, bool forward) -> Tensor
+- func: _fft_c2c(Tensor self, SymInt[] dim, int normalization, bool forward) -> Tensor
   variants: function
   dispatch:
     CPU: _fft_c2c_mkl
     CUDA: _fft_c2c_cufft
 
-- func: _fft_c2c.out(Tensor self, int[] dim, int normalization, bool forward, *, Tensor(a!) out) -> Tensor(a!)
+- func: _fft_c2c.out(Tensor self, SymInt[] dim, int normalization, bool forward, *, Tensor(a!) out) -> Tensor(a!)
   variants: function
   dispatch:
     CPU: _fft_c2c_mkl_out
@@ -2694,7 +2759,7 @@
   precomputed:
   - indices -> DimVector sizes, DimVector strides
   dispatch:
-    CPU, CUDA: index_out
+    CPU, CUDA, MPS: index_out
 
 - func: index_copy.out(Tensor self, int dim, Tensor index, Tensor source, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -2740,22 +2805,14 @@
   device_check: NoCheck   # TensorIterator
   variants: function
   dispatch:
-    CPU, CUDA: _index_put_impl_
+    CPU, CUDA, MPS: _index_put_impl_
     QuantizedCPU: _index_put_impl_quantized_cpu_
+    QuantizedCUDA: _index_put_impl_quantized_cuda_
   autogen: _index_put_impl, _index_put_impl.out
 
 - func: instance_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool use_input_stats, float momentum, float eps, bool cudnn_enabled) -> Tensor
   variants: function
 
-- func: inverse(Tensor self) -> Tensor
-  variants: function, method
-  dispatch:
-    CompositeExplicitAutograd: inverse
-
-- func: inverse.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
-  dispatch:
-    CompositeExplicitAutograd: inverse_out
-
 - func: isclose(Tensor self, Tensor other, float rtol=1e-05, float atol=1e-08, bool equal_nan=False) -> Tensor
   variants: function, method
 
@@ -2881,22 +2938,27 @@
 
 - func: kthvalue.dimname_out(Tensor self, int k, Dimname dim, bool keepdim=False, *, Tensor(a!) values, Tensor(b!) indices) -> (Tensor(a!) values, Tensor(b!) indices)
 
-- func: layer_norm(Tensor input, int[] normalized_shape, Tensor? weight=None, Tensor? bias=None, float eps=1e-05, bool cudnn_enable=True) -> Tensor
+- func: layer_norm(Tensor input, SymInt[] normalized_shape, Tensor? weight=None, Tensor? bias=None, float eps=1e-05, bool cudnn_enable=True) -> Tensor
+  dispatch:
+    CompositeImplicitAutograd: layer_norm_symint
 
-- func: native_layer_norm(Tensor input, int[] normalized_shape, Tensor? weight, Tensor? bias, float eps) -> (Tensor, Tensor, Tensor)
+- func: native_layer_norm(Tensor input, SymInt[] normalized_shape, Tensor? weight, Tensor? bias, float eps) -> (Tensor, Tensor, Tensor)
   dispatch:
     CPU: layer_norm_cpu
     CUDA: layer_norm_cuda
     MPS: layer_norm_mps
     CompositeExplicitAutograd: math_native_layer_norm
+    NestedTensorCPU, NestedTensorCUDA: nested_layer_norm
   autogen: native_layer_norm.out
+  tags: canonical
 
-- func: native_layer_norm_backward(Tensor grad_out, Tensor input, int[] normalized_shape, Tensor mean, Tensor rstd, Tensor? weight, Tensor? bias, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
+- func: native_layer_norm_backward(Tensor grad_out, Tensor input, SymInt[] normalized_shape, Tensor mean, Tensor rstd, Tensor? weight, Tensor? bias, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     CPU: layer_norm_backward_cpu
     CUDA: layer_norm_backward_cuda
     MPS: layer_norm_backward_mps
   autogen: native_layer_norm_backward.out
+  tags: canonical
 
 - func: nan_to_num(Tensor self, float? nan=None, float? posinf=None, float? neginf=None) -> Tensor
   variants: function, method
@@ -2920,10 +2982,12 @@
   dispatch:
     CompositeImplicitAutograd: linear
     NestedTensorCPU, NestedTensorCUDA: nested_linear
+    MPS: _mps_linear
 
 - func: linear_backward(Tensor self, Tensor grad_output, Tensor weight, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   dispatch:
     NestedTensorCPU, NestedTensorCUDA: nested_linear_backward
+    MPS: mps_linear_backward
   autogen: linear_backward.out
 
 - func: linear.out(Tensor input, Tensor weight, Tensor? bias=None, *, Tensor(a!) out) -> Tensor(a!)
@@ -2931,15 +2995,6 @@
   dispatch:
     CompositeExplicitAutograd: linear_out
 
-# TODO: Add this function to MPS dispatch key so that we avoid declaring it in
-# native_functions.yaml
-# https://github.com/pytorch/pytorch/issues/77394
-- func: _mps_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor
-  python_module: nn
-  dispatch:
-    MPS: _mps_linear
-  autogen: _mps_linear.out
-
 - func: mkldnn_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor
   python_module: nn
   dispatch:
@@ -2961,21 +3016,6 @@
     MkldnnCPU: mkldnn_linear_backward
   autogen: mkldnn_linear_backward.out
 
-- func: _mps_linear_backward_input(int[] input_size, Tensor grad_output, Tensor weight) -> Tensor
-  dispatch:
-    MPS: _mps_linear_backward_input
-  autogen: _mps_linear_backward_input.out
-
-- func: _mps_linear_backward_weights(Tensor grad_output, Tensor input, Tensor weight, bool bias_defined) -> (Tensor, Tensor)
-  dispatch:
-    MPS: _mps_linear_backward_weights
-  autogen: _mps_linear_backward_weights.out
-
-- func: mps_linear_backward(Tensor self, Tensor grad_output, Tensor weight, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
-  dispatch:
-    MPS: mps_linear_backward
-  autogen: mps_linear_backward.out
-
 - func: fbgemm_linear_int8_weight_fp32_activation(Tensor input, Tensor weight, Tensor packed, Tensor col_offsets, Scalar weight_scale, Scalar weight_zero_point, Tensor bias) -> Tensor
 
 - func: fbgemm_linear_int8_weight(Tensor input, Tensor weight, Tensor packed, Tensor col_offsets, Scalar weight_scale, Scalar weight_zero_point, Tensor bias) -> Tensor
@@ -3014,6 +3054,7 @@
   device_check: NoCheck   # TensorIterator
   structured_delegate: log.out
   variants: function, method
+  tags: canonical
 
 - func: log_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -3185,6 +3226,7 @@
 
 - func: _log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
   structured_delegate: _log_softmax.out
+  tags: canonical
 
 - func: _log_softmax.out(Tensor self, int dim, bool half_to_float, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -3264,10 +3306,6 @@
     CompositeImplicitAutograd: matmul_out
     NestedTensorCPU, NestedTensorCUDA: matmul_out_nested
 
-- func: matrix_rank.tol(Tensor self, float tol, bool symmetric=False) -> Tensor
-
-- func: matrix_rank(Tensor self, bool symmetric=False) -> Tensor
-
 # Alias to linalg.matrix_power
 - func: matrix_power(Tensor self, int n) -> Tensor
   variants: function, method
@@ -3319,6 +3357,7 @@
   variants: function, method
   dispatch:
     QuantizedCPU, QuantizedCUDA: qmax
+  tags: canonical
 
 - func: max.dim_max(Tensor self, int dim, bool keepdim=False, *, Tensor(a!) max, Tensor(b!) max_values) -> (Tensor(a!) values, Tensor(b!) indices)
   device_check: NoCheck   # TensorIterator
@@ -3336,14 +3375,17 @@
 - func: max.names_dim_max(Tensor self, Dimname dim, bool keepdim=False, *, Tensor(a!) max, Tensor(b!) max_values) -> (Tensor(a!) values, Tensor(b!) indices)
   device_check: NoCheck   # TensorIterator
 
-- func: value_selecting_reduction_backward(Tensor grad, int dim, Tensor indices, int[] sizes, bool keepdim) -> Tensor
+- func: value_selecting_reduction_backward(Tensor grad, int dim, Tensor indices, SymInt[] sizes, bool keepdim) -> Tensor
   variants: function
   device_check: NoCheck
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: value_selecting_reduction_backward_symint
 
 - func: amax(Tensor self, int[1] dim=[], bool keepdim=False) -> Tensor
   variants: function, method
   structured_delegate: amax.out
+  tags: canonical
 
 - func: amax.out(Tensor self, int[1] dim=[], bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -3425,6 +3467,7 @@
   variants: function, method
   dispatch:
     QuantizedCPU: mean_quantized_cpu
+  tags: canonical
 
 - func: mean.out(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -3453,6 +3496,7 @@
   dispatch:
     CPU: median_cpu
     CUDA: median_cuda
+    MPS: median_mps
   autogen: median.out
 
 - func: median.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)
@@ -3464,6 +3508,7 @@
   dispatch:
     CPU: median_out_cpu
     CUDA: median_out_cuda
+    MPS: median_out_mps
 
 - func: median.names_dim(Tensor self, Dimname dim, bool keepdim=False) -> (Tensor values, Tensor indices)
   variants: function, method
@@ -3498,6 +3543,7 @@
   variants: function, method
   dispatch:
     QuantizedCPU, QuantizedCUDA: qmin
+  tags: canonical
 
 - func: min.dim_min(Tensor self, int dim, bool keepdim=False, *, Tensor(a!) min, Tensor(b!) min_indices) -> (Tensor(a!) values, Tensor(b!) indices)
   device_check: NoCheck   # TensorIterator
@@ -3518,6 +3564,7 @@
 - func: amin(Tensor self, int[1] dim=[], bool keepdim=False) -> Tensor
   variants: function, method
   structured_delegate: amin.out
+  tags: canonical
 
 - func: amin.out(Tensor self, int[1] dim=[], bool keepdim=False, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -3538,7 +3585,7 @@
     MPS: mps_convolution_backward
   autogen: mps_convolution_backward.out
 
-- func: mkldnn_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups) -> Tensor
+- func: mkldnn_convolution(Tensor self, Tensor weight, Tensor? bias, SymInt[] padding, int[] stride, int[] dilation, int groups) -> Tensor
   dispatch:
     CompositeExplicitAutograd: mkldnn_convolution
   autogen: mkldnn_convolution.out
@@ -3553,21 +3600,29 @@
     CUDA: miopen_batch_norm_backward
   autogen: miopen_batch_norm_backward.out
 
-- func: miopen_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
+- func: miopen_convolution(Tensor self, Tensor weight, Tensor? bias, SymInt[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
   dispatch:
     CUDA: miopen_convolution
   autogen: miopen_convolution.out
 
-- func: miopen_convolution_transpose(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
+- func: miopen_convolution_transpose(Tensor self, Tensor weight, Tensor? bias, SymInt[] padding, SymInt[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
   dispatch:
     CUDA: miopen_convolution_transpose
   autogen: miopen_convolution_transpose.out
 
-- func: miopen_depthwise_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
+- func: miopen_depthwise_convolution(Tensor self, Tensor weight, Tensor? bias, SymInt[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
   dispatch:
     CUDA: miopen_depthwise_convolution
   autogen: miopen_depthwise_convolution.out
 
+- func: miopen_convolution_relu(Tensor self, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor
+  dispatch:
+    CUDA: miopen_convolution_relu
+
+- func: miopen_convolution_add_relu(Tensor self, Tensor weight, Tensor z, Scalar? alpha, Tensor? bias, int[] stride, int[] padding, int[] dilation, int groups) -> Tensor
+  dispatch:
+    CUDA: miopen_convolution_add_relu
+
 - func: miopen_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor hx, Tensor? cx, int mode, int hidden_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
   dispatch:
     CUDA: miopen_rnn
@@ -3584,6 +3639,7 @@
   dispatch:
     SparseCPU, SparseCUDA: _sparse_mm
     SparseCsrCPU, SparseCsrCUDA: _sparse_csr_mm
+  tags: canonical
 
 - func: mm.out(Tensor self, Tensor mat2, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -3609,11 +3665,6 @@
     SparseCUDA: sparse_mask_helper_cuda
   autogen: _sparse_mask_helper.out
 
-- func: spmm_sum(Tensor rowptr, Tensor col, Tensor? optional_value, Tensor mat) -> Tensor
-  variants: function
-  dispatch:
-    CPU: spmm_sum_cpu
-
 - func: mode(Tensor self, int dim=-1, bool keepdim=False) -> (Tensor values, Tensor indices)
   variants: function, method
   dispatch:
@@ -3638,6 +3689,7 @@
     MkldnnCPU: mkldnn_mul
     ZeroTensor: mul_zerotensor
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_mul_Tensor
+  tags: canonical
 
 - func: mul_.Tensor(Tensor(a!) self, Tensor other) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -3669,6 +3721,7 @@
     CompositeExplicitAutograd: mul
     SparseCsrCPU, SparseCsrCUDA: mul_scalar_sparse_csr
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_mul_Scalar
+  tags: canonical
 
 - func: mul_.Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -3720,33 +3773,31 @@
   dispatch:
     CompositeExplicitAutograd: mvlgamma_
 
-- func: narrow_copy(Tensor self, int dim, int start, int length) -> Tensor
+- func: narrow_copy(Tensor self, int dim, SymInt start, SymInt length) -> Tensor
   variants: function, method
   dispatch:
     CPU: narrow_copy_dense_cpu
     SparseCPU, SparseCUDA: narrow_copy_sparse
-    CompositeExplicitAutogradNonFunctional: narrow_copy_dense
+    CompositeExplicitAutogradNonFunctional: narrow_copy_dense_symint
   tags: view_copy
 
-- func: narrow_copy.SymInt(Tensor self, int dim, int start, SymInt length) -> Tensor
-  variants: function, method
-  dispatch:
-    CompositeExplicitAutograd: narrow_copy_symint
-  autogen: narrow_copy.SymInt_out
-
-- func: narrow_copy.out(Tensor self, int dim, int start, int length, *, Tensor(a!) out) -> Tensor(a!)
+- func: narrow_copy.out(Tensor self, int dim, SymInt start, SymInt length, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CPU: narrow_copy_dense_cpu_out
 
-- func: narrow(Tensor(a) self, int dim, int start, int length) -> Tensor(a)
+- func: narrow(Tensor(a) self, int dim, SymInt start, SymInt length) -> Tensor(a)
   variants: function, method
   device_check: NoCheck
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: narrow_symint
 
-- func: narrow.Tensor(Tensor(a) self, int dim, Tensor start, int length) -> Tensor(a)
+- func: narrow.Tensor(Tensor(a) self, int dim, Tensor start, SymInt length) -> Tensor(a)
   variants: function, method
   device_check: NoCheck
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: narrow_tensor_symint
 
 - func: native_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)
   dispatch:
@@ -3754,11 +3805,42 @@
     CUDA: batch_norm_cuda
     MPS: batch_norm_mps
     MkldnnCPU: mkldnn_batch_norm
+  tags: canonical
 
 - func: native_batch_norm.out(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, *, Tensor(a!) out, Tensor(b!) save_mean, Tensor(c!) save_invstd) -> (Tensor(a!), Tensor(b!), Tensor(c!))
   dispatch:
     CUDA: batch_norm_cuda_out
     MPS: batch_norm_mps_out
+    CPU: batch_norm_cpu_out
+
+# TODO: In 2 weeks, we should make native_batch_norm composite implicit so that this correct schema percolates correctly through our dispatching
+- func: _native_batch_norm_legit(Tensor input, Tensor? weight, Tensor? bias, Tensor(a!) running_mean, Tensor(b!) running_var, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)
+  dispatch:
+    CPU: _batch_norm_legit_cpu
+    CUDA: _batch_norm_legit_cuda
+    MPS: _batch_norm_legit_mps
+    MkldnnCPU: _mkldnn_batch_norm_legit
+  autogen: _native_batch_norm_legit_functional
+
+- func: _native_batch_norm_legit.out(Tensor input, Tensor? weight, Tensor? bias, Tensor(a!) running_mean, Tensor(b!) running_var, bool training, float momentum, float eps, *, Tensor(d!) out, Tensor(e!) save_mean, Tensor(f!) save_invstd) -> (Tensor(d!), Tensor(e!), Tensor(f!))
+  dispatch:
+    CPU: _batch_norm_legit_cpu_out
+    CUDA: _batch_norm_legit_cuda_out
+    MPS: _batch_norm_legit_mps_out
+
+- func: _native_batch_norm_legit.no_stats(Tensor input, Tensor? weight, Tensor? bias, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)
+  dispatch:
+    CPU: _batch_norm_legit_no_stats_cpu
+    CUDA: _batch_norm_legit_no_stats_cuda
+    MPS: _batch_norm_legit_no_stats_mps
+    MkldnnCPU: _mkldnn_batch_norm_legit_no_stats
+  tags: canonical
+
+- func: _native_batch_norm_legit.no_stats_out(Tensor input, Tensor? weight, Tensor? bias, bool training, float momentum, float eps, *, Tensor(a!) out, Tensor(b!) save_mean, Tensor(c!) save_invstd) -> (Tensor(a!), Tensor(b!), Tensor(c!))
+  dispatch:
+    CPU: _batch_norm_legit_no_stats_cpu_out
+    CUDA: _batch_norm_legit_no_stats_cuda_out
+    MPS: _batch_norm_legit_no_stats_mps_out
 
 - func: batch_norm_stats(Tensor input, float eps) -> (Tensor, Tensor)
   dispatch:
@@ -3812,7 +3894,7 @@
 
 - func: _nnpack_available() -> bool
 
-- func: _nnpack_spatial_convolution(Tensor input, Tensor weight, Tensor? bias, int[2] padding, int[2] stride=1) -> Tensor
+- func: _nnpack_spatial_convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[2] padding, int[2] stride=1) -> Tensor
   variants: function
   dispatch:
     CompositeExplicitAutograd: _nnpack_spatial_convolution
@@ -3825,11 +3907,11 @@
     CompositeExplicitAutograd: ones
   autogen: ones.names_out
 
-- func: ones(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: ones(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: ones
 
-- func: ones.out(int[] size, *, Tensor(a!) out) -> Tensor(a!)
+- func: ones.out(SymInt[] size, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CompositeExplicitAutograd: ones_out
 
@@ -3838,6 +3920,7 @@
     # NB: Although this composite mutates on the inside, it is
     # non-differentiable so NonFunctional doesn't apply
     CompositeExplicitAutograd: ones_like
+    NestedTensorCPU, NestedTensorCUDA: ones_like
   autogen: ones_like.out
 
 - func: pairwise_distance(Tensor x1, Tensor x2, float p=2, float eps=1e-06, bool keepdim=False) -> Tensor
@@ -3880,6 +3963,7 @@
     CompositeExplicitAutograd: permute
     MPS: permute_mps
     SparseCPU, SparseCUDA: permute_sparse_coo
+  tags: canonical
 
 - func: movedim.intlist(Tensor(a) self, int[] source, int[] destination) -> Tensor(a)
   variants: function, method
@@ -3971,66 +4055,81 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: rad2deg
+    SparseCPU, SparseCUDA: rad2deg_sparse
     SparseCsrCPU, SparseCsrCUDA: rad2deg_sparse_csr
 
 - func: rad2deg_(Tensor(a!) self) -> Tensor(a!)
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: rad2deg_
+    SparseCPU, SparseCUDA: rad2deg_sparse_
     SparseCsrCPU, SparseCsrCUDA: rad2deg_sparse_csr_
 
 - func: rad2deg.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CompositeExplicitAutograd: rad2deg_out
+    SparseCPU, SparseCUDA: rad2deg_sparse_out
     SparseCsrCPU, SparseCsrCUDA: rad2deg_sparse_csr_out
 
 - func: deg2rad(Tensor self) -> Tensor
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: deg2rad
+    SparseCPU, SparseCUDA: deg2rad_sparse
+    SparseCsrCPU, SparseCsrCUDA: deg2rad_sparse_csr
 
 - func: deg2rad_(Tensor(a!) self) -> Tensor(a!)
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: deg2rad_
+    SparseCPU, SparseCUDA: deg2rad_sparse_
+    SparseCsrCPU, SparseCsrCUDA: deg2rad_sparse_csr_
 
 - func: deg2rad.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CompositeExplicitAutograd: deg2rad_out
+    SparseCPU, SparseCUDA: deg2rad_sparse_out
+    SparseCsrCPU, SparseCsrCUDA: deg2rad_sparse_csr_out
 
 - func: scalar_tensor(Scalar s, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: scalar_tensor
   autogen: scalar_tensor.out
+  tags: canonical
 
-- func: rand.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: rand.names(SymInt[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: rand
   autogen: rand.names_out
+  tags: nondeterministic_seeded
 
-- func: rand.generator_with_names(int[] size, *, Generator? generator, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: rand.generator_with_names(SymInt[] size, *, Generator? generator, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   device_check: NoCheck
   device_guard: False
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: rand
   autogen: rand.generator_with_names_out
 
-- func: rand(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: rand(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: rand
 
-- func: rand.generator(int[] size, *, Generator? generator, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: rand.generator(SymInt[] size, *, Generator? generator, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: rand
 
-- func: rand.out(int[] size, *, Tensor(a!) out) -> Tensor(a!)
+- func: rand.out(SymInt[] size, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: rand_out
 
-- func: rand.generator_out(int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+- func: rand.generator_out(SymInt[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: rand_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   tags: nondeterministic_seeded
@@ -4040,37 +4139,43 @@
     CompositeExplicitAutograd: rand_like
   autogen: rand_like.out
 
-- func: randint(int high, int[] size, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: randint(int high, SymInt[] size, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint
 
-- func: randint.generator(int high, int[] size, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: randint.generator(int high, SymInt[] size, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint
 
-- func: randint.low(int low, int high, int[] size, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: randint.low(int low, int high, SymInt[] size, *, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint
 
-- func: randint.low_generator(int low, int high, int[] size, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: randint.low_generator(int low, int high, SymInt[] size, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint
 
-- func: randint.out(int high, int[] size, *, Tensor(a!) out) -> Tensor(a!)
+- func: randint.out(int high, SymInt[] size, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint_out
 
-- func: randint.generator_out(int high, int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+- func: randint.generator_out(int high, SymInt[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint_out
 
-- func: randint.low_out(int low, int high, int[] size, *, Tensor(a!) out) -> Tensor(a!)
+- func: randint.low_out(int low, int high, SymInt[] size, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint_out
 
-- func: randint.low_generator_out(int low, int high, int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+- func: randint.low_generator_out(int low, int high, SymInt[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randint_out
 
@@ -4090,32 +4195,37 @@
     CompositeExplicitAutograd: randint_like
   autogen: randint_like.low_dtype_out
 
-- func: randn(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: randn(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randn
 
-- func: randn.generator(int[] size, *, Generator? generator, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: randn.generator(SymInt[] size, *, Generator? generator, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randn
 
-- func: randn.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: randn.names(SymInt[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: randn
   autogen: randn.names_out
 
-- func: randn.generator_with_names(int[] size, *, Generator? generator, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: randn.generator_with_names(SymInt[] size, *, Generator? generator, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: randn
   autogen: randn.generator_with_names_out
 
-- func: randn.out(int[] size, *, Tensor(a!) out) -> Tensor(a!)
+- func: randn.out(SymInt[] size, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
-- func: randn.generator_out(int[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+- func: randn.generator_out(SymInt[] size, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
 
 - func: randn_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   tags: nondeterministic_seeded
@@ -4131,14 +4241,17 @@
     CompositeExplicitAutograd: randperm
 
 - func: randperm.generator(int n, *, Generator? generator, ScalarType? dtype=long, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randperm
 
 - func: randperm.out(int n, *, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: randperm_out
 
 - func: randperm.generator_out(int n, *, Generator? generator, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CPU: randperm_out_cpu
     CUDA: randperm_out_cuda
@@ -4168,6 +4281,7 @@
   device_check: NoCheck   # TensorIterator
   structured_delegate: reciprocal.out
   variants: function, method
+  tags: canonical
 
 - func: reciprocal_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -4189,6 +4303,8 @@
   dispatch:
     SparseCPU, SparseCUDA: neg_sparse
     SparseCsrCPU, SparseCsrCUDA: neg_sparse_csr
+    NestedTensorCPU, NestedTensorCUDA: NestedTensor_neg
+  tags: canonical
 
 - func: neg_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -4197,6 +4313,7 @@
   dispatch:
     SparseCPU, SparseCUDA: neg_sparse_
     SparseCsrCPU, SparseCsrCUDA: neg_sparse_csr_
+    NestedTensorCPU, NestedTensorCUDA: NestedTensor_neg_
 
 - func: neg.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -4217,12 +4334,13 @@
 
 - func: negative.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
 
-- func: repeat(Tensor self, int[] repeats) -> Tensor
+- func: repeat(Tensor self, SymInt[] repeats) -> Tensor
   variants: method  # This is method-only to match the previous tensor API. In the future we could make this a function too.
   dispatch:
     CompositeExplicitAutograd: repeat
     MPS: repeat_mps
   autogen: repeat.out
+  tags: canonical
 
 - func: repeat_interleave.Tensor(Tensor repeats, *, int? output_size=None) -> Tensor
   variants: function
@@ -4235,28 +4353,28 @@
 - func: repeat_interleave.self_Tensor(Tensor self, Tensor repeats, int? dim=None, *, int? output_size=None) -> Tensor
   variants: function, method
 
-- func: repeat_interleave.self_int(Tensor self, int repeats, int? dim=None, *, int? output_size=None) -> Tensor
+- func: repeat_interleave.self_int(Tensor self, SymInt repeats, int? dim=None, *, int? output_size=None) -> Tensor
   variants: function, method
+  dispatch:
+    CompositeImplicitAutograd: repeat_interleave_symint
 
-- func: reshape(Tensor(a) self, int[] shape) -> Tensor(a)
+- func: reshape(Tensor(a) self, SymInt[] shape) -> Tensor(a)
   variants: function, method
   device_check: NoCheck
   device_guard: False
-
-- func: _reshape_nested(Tensor self, int[] shape) -> Tensor
   dispatch:
-    NestedTensorCPU, NestedTensorCUDA: _reshape_nested
-  autogen: _reshape_nested.out
+    CompositeImplicitAutograd: reshape_symint
+    CompositeImplicitAutogradNestedTensor: reshape_nested
 
-- func: _reshape_nested_backward(Tensor self, Tensor grad) -> Tensor
+- func: _reshape_copy(Tensor self, SymInt[] size) -> Tensor
+  variants: function
   dispatch:
-    NestedTensorCPU, NestedTensorCUDA: _reshape_nested_backward
-  autogen: _reshape_nested_backward.out
+    CompositeExplicitAutograd: _reshape_copy_symint
 
 # NOTE [ _reshape_alias ] is meant to be used in the implementation of reshape.
 # They are not user-facing, hence the leading underscore. Please don't use it
 # anywhere else.
-- func: _reshape_alias(Tensor(a) self, int[] size, int[] stride) -> Tensor(a)
+- func: _reshape_alias(Tensor(a) self, SymInt[] size, SymInt[] stride) -> Tensor(a)
   variants: function, method
   device_check: NoCheck
   device_guard: False
@@ -4275,6 +4393,9 @@
   variants: method
   device_check: NoCheck
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: reshape_as
+    CompositeImplicitAutogradNestedTensor: reshape_as_nested
 
 - func: round(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -4326,6 +4447,7 @@
   tags: nondeterministic_seeded
 
 - func: rrelu_(Tensor(a!) self, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None) -> Tensor(a!)
+  tags: nondeterministic_seeded
   device_check: NoCheck   # TensorIterator
 
 - func: relu(Tensor self) -> Tensor
@@ -4336,7 +4458,11 @@
     MPS: relu_mps
     MkldnnCPU: mkldnn_relu
     QuantizedCPU: relu_quantized_cpu
+    QuantizedCUDA: relu_quantized_cuda
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_relu
+    SparseCPU, SparseCUDA: relu_sparse
+    SparseCsrCPU, SparseCsrCUDA: relu_sparse_csr
+  tags: canonical
 
 - func: relu_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -4346,7 +4472,10 @@
     MPS: relu_mps_
     MkldnnCPU: mkldnn_relu_
     QuantizedCPU: relu_quantized_cpu_
+    QuantizedCUDA: relu_quantized_cuda_
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_relu_
+    SparseCPU, SparseCUDA: relu_sparse_
+    SparseCsrCPU, SparseCsrCUDA: relu_sparse_csr_
   autogen: relu.out
 
 - func: relu6(Tensor self) -> Tensor
@@ -4400,6 +4529,7 @@
     QuantizedCPU: gelu_quantized_cpu
     QuantizedCUDA: gelu_quantized_cuda
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_gelu
+  tags: canonical
 
 - func: gelu_backward.grad_input(Tensor grad_output, Tensor self, *, str approximate='none', Tensor(a!) grad_input) -> Tensor(a!)
   structured: True
@@ -4448,6 +4578,7 @@
   device_check: NoCheck   # TensorIterator
   structured_delegate: rsqrt.out
   variants: function, method
+  tags: canonical
 
 - func: rsqrt_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -4467,23 +4598,30 @@
   device_check: NoCheck
   device_guard: False
 
-- func: select.int(Tensor(a) self, int dim, int index) -> Tensor(a)
+- func: select.int(Tensor(a) self, int dim, SymInt index) -> Tensor(a)
   variants: function, method
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CompositeExplicitAutograd: select
+    CompositeExplicitAutograd: select_symint
     SparseCsrCPU, SparseCsrCUDA: select_sparse_csr
     NestedTensorCPU, NestedTensorCUDA: select_nested
 
-- func: select_backward(Tensor grad_output, int[] input_sizes, int dim, int index) -> Tensor
+- func: select_backward(Tensor grad_output, SymInt[] input_sizes, int dim, SymInt index) -> Tensor
   variants: function
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CompositeExplicitAutogradNonFunctional: select_backward
+    CompositeExplicitAutogradNonFunctional: select_backward_symint
   autogen: select_backward.out
 
+- func: _nested_select_backward(Tensor grad_output, Tensor self, int dim, SymInt index) -> Tensor
+  variants: function
+  device_check: NoCheck
+  device_guard: False
+  dispatch:
+    NestedTensorCPU, NestedTensorCUDA: _nested_select_backward_symint
+
 - func: selu(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
 
@@ -4559,6 +4697,7 @@
   dispatch:
     QuantizedCPU: sigmoid_quantized_cpu
     MkldnnCPU: mkldnn_sigmoid
+  tags: canonical
 
 - func: sigmoid_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -4670,6 +4809,7 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: detach
+    NestedTensorCPU, NestedTensorCUDA: detach
 
 # Like `detach()`, but modifies this `Variable` in-place. This method may
 # only be called on non-view `Variable`s. You can use `is_view()` to check
@@ -4691,14 +4831,18 @@
   device_check: NoCheck
   device_guard: False
 
-- func: slice.Tensor(Tensor(a) self, int dim=0, int? start=None, int? end=None, int step=1) -> Tensor(a)
+- func: slice.Tensor(Tensor(a) self, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1) -> Tensor(a)
   variants: function, method
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: slice
+  tags: canonical
+
+# NOTE: The implementation of split_with_sizes bypasses the dispatcher to call this; undo
+# that if adding specific implementations here!
 
-- func: slice_backward(Tensor grad_output, int[] input_sizes, int dim, int start, int end, int step) -> Tensor
+- func: slice_backward(Tensor grad_output, SymInt[] input_sizes, int dim, SymInt start, SymInt end, SymInt step) -> Tensor
   variants: function
   device_check: NoCheck
   device_guard: False
@@ -4706,20 +4850,21 @@
     CompositeExplicitAutograd: slice_backward
   autogen: slice_backward.out
 
-- func: slice_scatter(Tensor self, Tensor src, int dim=0, int? start=None, int? end=None, int step=1) -> Tensor
+- func: slice_scatter(Tensor self, Tensor src, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1) -> Tensor
   variants: function, method
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: slice_scatter
   autogen: slice_scatter.out
+  tags: canonical
 
-- func: select_scatter(Tensor self, Tensor src, int dim, int index) -> Tensor
+- func: select_scatter(Tensor self, Tensor src, int dim, SymInt index) -> Tensor
   variants: function, method
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CompositeExplicitAutograd: select_scatter
+    CompositeExplicitAutograd: select_scatter_symint
   autogen: select_scatter.out
 
 - func: diagonal_scatter(Tensor self, Tensor src, int offset=0, int dim1=0, int dim2=1) -> Tensor
@@ -4730,12 +4875,12 @@
     CompositeExplicitAutograd: diagonal_scatter
   autogen: diagonal_scatter.out
 
-- func: as_strided_scatter(Tensor self, Tensor src, int[] size, int[] stride, int? storage_offset=None) -> Tensor
+- func: as_strided_scatter(Tensor self, Tensor src, SymInt[] size, SymInt[] stride, SymInt? storage_offset=None) -> Tensor
   variants: function, method
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CompositeExplicitAutograd: as_strided_scatter
+    CompositeExplicitAutograd: as_strided_scatter_symint
   autogen: as_strided_scatter.out
 
 - func: smm(Tensor self, Tensor mat2) -> Tensor
@@ -4758,6 +4903,7 @@
   dispatch:
     MkldnnCPU: mkldnn_softmax
     NestedTensorCPU, NestedTensorCUDA: softmax_nested
+  tags: canonical
 
 - func: _softmax.out(Tensor self, int dim, bool half_to_float, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -4778,7 +4924,7 @@
     CUDA: softmax_backward_cuda_out
     MPS: softmax_backward_mps_out
 
-- func: unsafe_split.Tensor(Tensor self, int split_size, int dim=0) -> Tensor[]
+- func: unsafe_split.Tensor(Tensor self, SymInt split_size, int dim=0) -> Tensor[]
   variants: function, method
   device_check: NoCheck
   device_guard: False
@@ -4786,18 +4932,20 @@
     CompositeExplicitAutograd: unsafe_split
   autogen: unsafe_split.Tensor_out
 
-- func: split.Tensor(Tensor(a -> *) self, int split_size, int dim=0) -> Tensor(a)[]
+- func: split.Tensor(Tensor(a -> *) self, SymInt split_size, int dim=0) -> Tensor(a)[]
   variants: function, method
   device_check: NoCheck
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: split
 
-- func: split.sizes(Tensor(a -> *) self, int[] split_size, int dim=0) -> Tensor(a)[]
+- func: split.sizes(Tensor(a -> *) self, SymInt[] split_size, int dim=0) -> Tensor(a)[]
   variants: function, method
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: split_symint
 
-- func: unsafe_split_with_sizes(Tensor self, int[] split_sizes, int dim=0) -> Tensor[]
+- func: unsafe_split_with_sizes(Tensor self, SymInt[] split_sizes, int dim=0) -> Tensor[]
   variants: function, method
   device_check: NoCheck
   device_guard: False
@@ -4805,7 +4953,7 @@
     CompositeExplicitAutograd: unsafe_split_with_sizes
   autogen: unsafe_split_with_sizes.out
 
-- func: split_with_sizes(Tensor(a -> *) self, int[] split_sizes, int dim=0) -> Tensor(a)[]
+- func: split_with_sizes(Tensor(a -> *) self, SymInt[] split_sizes, int dim=0) -> Tensor(a)[]
   variants: function, method
   device_check: NoCheck
   device_guard: False
@@ -4837,6 +4985,7 @@
   dispatch:
     CompositeExplicitAutograd: squeeze
     QuantizedCPU, QuantizedCUDA: squeeze_quantized
+    NestedTensorCPU, NestedTensorCUDA: squeeze_nested
 
 - func: squeeze.dim(Tensor(a) self, int dim) -> Tensor(a)
   variants: function, method
@@ -4845,6 +4994,8 @@
   dispatch:
     CompositeExplicitAutograd: squeeze
     QuantizedCPU, QuantizedCUDA: squeeze_quantized
+    NestedTensorCPU, NestedTensorCUDA: squeeze_dim_nested
+  tags: canonical
 
 - func: squeeze.dimname(Tensor(a) self, Dimname dim) -> Tensor(a)
   variants: function, method
@@ -4940,22 +5091,17 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: sum
+    SparseCPU, SparseCUDA: sum_coo
     SparseCsrCPU, SparseCsrCUDA: sum_csr
   autogen: sum.out
 
-- func: sum.SymInt(Tensor self, SymInt[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
-  device_check: NoCheck   # TensorIterator
-  variants: function, method
-  dispatch:
-    CompositeExplicitAutograd: sum_symint
-  autogen: sum.SymInt_out
-
 - func: sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   structured_delegate: sum.IntList_out
   device_check: NoCheck   # TensorIterator
   variants: function, method
   dispatch:
     NestedTensorCPU: NestedTensor_sum_dim_CPU
+  tags: canonical
 
 - func: sum.dim_DimnameList(Tensor self, Dimname[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -4971,6 +5117,11 @@
 - func: sum.DimnameList_out(Tensor self, Dimname[1] dim, bool keepdim=False, *, ScalarType? dtype=None, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
 
+# TODO: this function will be replaced once nested expand semantics have been settled on
+- func: _nested_sum_backward(Tensor grad, Tensor self, int[1]? dim, bool keepdim=False) -> Tensor
+  dispatch:
+    NestedTensorCPU: _nested_sum_backward_cpu
+
 - func: nansum(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   variants: function, method
   dispatch:
@@ -4992,6 +5143,7 @@
   dispatch:
     SparseCPU, SparseCUDA: sqrt_sparse
     SparseCsrCPU, SparseCsrCUDA: sqrt_sparse_csr
+  tags: canonical
 
 - func: sqrt_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -5161,6 +5313,8 @@
     MkldnnCPU: mkldnn_tanh
     SparseCPU, SparseCUDA: tanh_sparse
     SparseCsrCPU, SparseCsrCUDA: tanh_sparse_csr
+    NestedTensorCPU, NestedTensorCUDA: NestedTensor_tanh
+  tags: canonical
 
 - func: tanh_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -5170,6 +5324,7 @@
     MkldnnCPU: mkldnn_tanh_
     SparseCPU, SparseCUDA: tanh_sparse_
     SparseCsrCPU, SparseCsrCUDA: tanh_sparse_csr_
+    NestedTensorCPU, NestedTensorCUDA: NestedTensor_tanh_
 
 - func: tanh.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -5216,12 +5371,16 @@
   dispatch:
     CPU, CUDA: threshold_backward_out
     MPS: threshold_backward_out_mps
+    SparseCPU, SparseCUDA: threshold_backward_sparse_out
+    SparseCsrCPU, SparseCsrCUDA: threshold_backward_sparse_compressed_out
 
 - func: threshold_backward(Tensor grad_output, Tensor self, Scalar threshold) -> Tensor
   variants: function
   structured_delegate: threshold_backward.grad_input
   dispatch:
     MkldnnCPU: mkldnn_relu_backward
+    SparseCPU, SparseCUDA: threshold_backward_sparse
+    SparseCsrCPU, SparseCsrCUDA: threshold_backward_sparse_compressed
 
 - func: tile(Tensor self, int[] dims) -> Tensor
   variants: function, method
@@ -5324,12 +5483,24 @@
     CUDA: nested_from_padded_cuda
   autogen: _nested_from_padded.out
 
+# These private functions are temporary. They will be updated/deleted when nested tensors switch to using SymInts for their metadata representation
 - func: _nested_tensor_size(Tensor self) -> Tensor
   variants: method
   dispatch:
-    NestedTensorCPU, NestedTensorCUDA: NestedTensor_get_nested_size_tensor
+    NestedTensorCPU, NestedTensorCUDA: _nested_tensor_size
   autogen: _nested_tensor_size.out
 
+- func: _nested_tensor_strides(Tensor self) -> Tensor
+  variants: method
+  dispatch:
+    NestedTensorCPU, NestedTensorCUDA: _nested_tensor_strides
+  autogen: _nested_tensor_strides.out
+
+- func: _nested_tensor_offsets(Tensor self) -> int[]
+  variants: method
+  dispatch:
+    NestedTensorCPU, NestedTensorCUDA: _nested_tensor_offsets
+
 # _nested_from_padded is not usable from Python, so
 # _nested_from_padded_and_nested_example is available for testing.
 - func: _nested_from_padded_and_nested_example(Tensor padded, Tensor nt_example) -> Tensor
@@ -5337,6 +5508,22 @@
     NestedTensorCPU, NestedTensorCUDA: NestedTensor_from_padded_and_nested_example
   autogen: _nested_from_padded_and_nested_example.out
 
+# The input arguments' types to this functions are temporary. When nested tensors switch to using SymInts for their metadata representation
+# this will need to be updated
+- func: _nested_view_from_buffer(Tensor(a) self, Tensor nested_size, Tensor nested_strides, int[] offsets) -> Tensor(a)
+  variants: function
+  device_check: NoCheck
+  dispatch:
+    CPU, CUDA: _nested_view_from_buffer
+
+- func: _nested_view_from_buffer_copy(Tensor self, Tensor nested_size, Tensor nested_strides, int[] offsets) -> Tensor
+  variants: function
+  device_check: NoCheck
+  tags: view_copy
+  dispatch:
+    CompositeExplicitAutogradNonFunctional: _nested_view_from_buffer_copy
+  autogen: _nested_view_from_buffer_copy.out
+
 - func: _trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> Tensor
   dispatch:
     # calls unsqueeze
@@ -5429,7 +5616,7 @@
   tags: dynamic_output_shape
   autogen: _unique2.out
 
-- func: _unsafe_view(Tensor self, int[] size) -> Tensor
+- func: _unsafe_view(Tensor self, SymInt[] size) -> Tensor
   dispatch:
     CompositeExplicitAutograd: _unsafe_view
   autogen: _unsafe_view.out
@@ -5442,6 +5629,8 @@
     CompositeExplicitAutograd: unsqueeze
     SparseCPU, SparseCUDA: unsqueeze_sparse
     QuantizedCPU, QuantizedCUDA: unsqueeze_quantized
+    NestedTensorCPU, NestedTensorCUDA: unsqueeze_nested
+  tags: canonical
 
 - func: unsqueeze_(Tensor(a!) self, int dim) -> Tensor(a!)
   variants: method
@@ -5460,6 +5649,7 @@
 - func: var.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor
   device_check: NoCheck   # TensorIterator
   variants: function, method
+  tags: canonical
 
 - func: var.correction(Tensor self, int[1]? dim, *, int? correction, bool keepdim=False) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -5525,6 +5715,7 @@
   dispatch:
     CPU, CUDA: where
     MPS: where_mps
+  tags: canonical
 
 - func: where.self_out(Tensor condition, Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -5583,19 +5774,14 @@
     CUDA: _efficientzerotensor_cuda
   autogen: _efficientzerotensor.out
 
-- func: zeros(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
-  dispatch:
-    CompositeExplicitAutograd: zeros
-
-- func: zeros.SymInt(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: zeros(SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: zeros_symint
-  autogen: zeros.SymInt_out
 
-- func: zeros.out(int[] size, *, Tensor(a!) out) -> Tensor(a!)
+- func: zeros.out(SymInt[] size, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CompositeExplicitAutograd: zeros_out
-    SparseCPU, SparseCUDA, SparseMeta: zeros_out
+    SparseCPU, SparseCUDA, SparseMeta: zeros_sparse_out
 
 - func: zeros_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
   dispatch:
@@ -5626,6 +5812,7 @@
   autogen: _dirichlet_grad.out
 
 - func: _sample_dirichlet(Tensor self, Generator? generator=None) -> Tensor
+  tags: nondeterministic_seeded
   variants: function
   dispatch:
     CPU: _s_dirichlet_cpu
@@ -5842,6 +6029,7 @@
     QuantizedCPU, QuantizedCUDA: quantized_clone
     NestedTensorCPU, NestedTensorCUDA: clone_nested
   autogen: clone.out
+  tags: canonical
 
 - func: positive(Tensor(a) self) -> Tensor(a)
   variants: function, method
@@ -5858,7 +6046,7 @@
   variants: function, method
   dispatch:
     SparseCPU, SparseCUDA: resize_as_sparse_
-    SparseCsrCPU, SparseCsrCUDA: resize_as_sparse_csr_
+    SparseCsrCPU, SparseCsrCUDA: resize_as_sparse_compressed_
   autogen: resize_as_sparse, resize_as_sparse.out
 
 - func: zero_(Tensor(a!) self) -> Tensor(a!)
@@ -5889,6 +6077,7 @@
   dispatch:
     SparseCPU, SparseCUDA: sub_sparse
     ZeroTensor: sub_zerotensor
+  tags: canonical
 
 - func: sub_.Tensor(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -5903,6 +6092,7 @@
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: sub
+  tags: canonical
 
 - func: sub_.Scalar(Tensor(a!) self, Scalar other, Scalar alpha=1) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -5979,6 +6169,19 @@
     SparseCsrCUDA: sparse_sampled_addmm_sparse_csr_cuda
     SparseCsrCPU: sparse_sampled_addmm_sparse_csr_cpu
 
+- func: spmm_reduce(Tensor input, Tensor weight, str reduce, Tensor? row_indices=None, Tensor? ccol_indices=None, Tensor? csr2csc=None) -> Tensor
+  python_module: sparse
+
+- func: _spmm_reduce(Tensor input, Tensor weight, str reduce, Tensor? row_indices=None, Tensor? ccol_indices=None, Tensor? csr2csc=None) -> (Tensor, Tensor)
+  python_module: sparse
+  dispatch:
+    SparseCsrCPU: _spmm_reduce_sparse_csr_cpu
+
+- func: _spmm_reduce_backward(Tensor input, Tensor grad_out, Tensor weight, str reduce, Tensor arg_out, Tensor row_indices, Tensor ccol_indices, Tensor csr2csc, bool[2] output_mask) -> (Tensor, Tensor)
+  python_module: sparse
+  dispatch:
+    SparseCsrCPU: _spmm_reduce_backward_sparse_csr_cpu
+
 - func: addmm.out(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
   structured: True
   dispatch:
@@ -5997,6 +6200,7 @@
     SparseCPU: addmm_sparse_dense_cpu
     SparseCUDA: addmm_sparse_dense_cuda
     SparseCsrCPU, SparseCsrCUDA: addmm_sparse_compressed_dense
+  tags: canonical
 
 - func: addmm_(Tensor(a!) self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1) -> Tensor(a!)
   structured_delegate: addmm.out
@@ -6155,7 +6359,9 @@
 
 - func: sparse_coo_tensor.indices_size(Tensor indices, Tensor values, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
 
-- func: _sparse_coo_tensor_unsafe(Tensor indices, Tensor values, int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: _sparse_coo_tensor_unsafe(Tensor indices, Tensor values, SymInt[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+  dispatch:
+    CompositeImplicitAutograd: _sparse_coo_tensor_unsafe_symint
 
 - func: _validate_sparse_coo_tensor_args(Tensor indices, Tensor values, int[] size) -> ()
 
@@ -6170,9 +6376,9 @@
     SparseCPU, SparseCUDA, SparseMeta, Meta: new_with_dims_sparse
   autogen: _sparse_coo_tensor_with_dims.out
 
-- func: _sparse_coo_tensor_with_dims_and_tensors(int sparse_dim, int dense_dim, int[] size, Tensor indices, Tensor values, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
+- func: _sparse_coo_tensor_with_dims_and_tensors(int sparse_dim, int dense_dim, SymInt[] size, Tensor indices, Tensor values, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
   dispatch:
-    SparseCPU, SparseCUDA, SparseMeta, Meta: new_with_dims_and_tensor_sparse
+    SparseCPU, SparseCUDA, SparseMeta, Meta: new_with_dims_and_tensor_sparse_symint
   autogen: _sparse_coo_tensor_with_dims_and_tensors.out
 
 - func: sparse_resize_(Tensor(a!) self, int[] size, int sparse_dim, int dense_dim) -> Tensor(a!)
@@ -6217,6 +6423,7 @@
 - func: sparse_dim(Tensor self) -> int
   variants: method
   dispatch:
+    CPU, CUDA: sparse_dim_strided
     SparseCPU, SparseCUDA, SparseMeta: sparse_dim_sparse
     SparseCsrCPU, SparseCsrCUDA: sparse_dim_sparse_csr
   device_check: NoCheck
@@ -6233,6 +6440,7 @@
 - func: dense_dim(Tensor self) -> int
   variants: method
   dispatch:
+    CPU, CUDA: dense_dim_strided
     SparseCPU, SparseCUDA, SparseMeta: dense_dim_sparse
     SparseCsrCPU, SparseCsrCUDA: dense_dim_sparse_csr
   device_check: NoCheck
@@ -6312,6 +6520,7 @@
   dispatch:
     SparseCPU, SparseCUDA, SparseMeta: values_sparse
     SparseCsrCPU, SparseCsrCUDA: values_sparse_csr
+    NestedTensorCPU, NestedTensorCUDA: values_nested
   device_check: NoCheck
   device_guard: False
 
@@ -6360,11 +6569,12 @@
     SparseCPU, SparseCUDA: copy_sparse_
   autogen: copy_sparse_to_sparse, copy_sparse_to_sparse.out
 
+# By adding the AutogradNestedTensor this makes this function CompositeImplicit-like for nested tensors
 - func: unbind.int(Tensor(a -> *) self, int dim=0) -> Tensor(a)[]
   variants: function, method
   dispatch:
     CompositeExplicitAutograd: unbind
-    NestedTensorCPU, NestedTensorCUDA: NestedTensor_unbind
+    CompositeImplicitAutogradNestedTensor: NestedTensor_unbind
 
 - func: unbind.Dimname(Tensor(a -> *) self, Dimname dim) -> Tensor(a)[]
   variants: function, method
@@ -6421,7 +6631,7 @@
     CPU: dense_to_mkldnn
   autogen: to_mkldnn.out
 
-- func: mkldnn_reorder_conv2d_weight(Tensor self, int[2] padding=0, int[2] stride=1, int[2] dilation=1, int groups=1) -> Tensor
+- func: mkldnn_reorder_conv2d_weight(Tensor self, int[2] padding=0, int[2] stride=1, int[2] dilation=1, int groups=1, int[]? input_size=None) -> Tensor
   variants: function
   python_module: nn
   dispatch:
@@ -6563,6 +6773,8 @@
 
 - func: _fake_quantize_learnable_per_tensor_affine_backward(Tensor grad, Tensor self, Tensor scale, Tensor zero_point, int quant_min, int quant_max, float grad_factor=1.0) -> (Tensor, Tensor, Tensor)
   variants: function
+  dispatch:
+    CPU, CUDA: _fake_quantize_learnable_per_tensor_affine_backward
 
 - func: fake_quantize_per_channel_affine(Tensor self, Tensor scale, Tensor zero_point, int axis, int quant_min, int quant_max) -> Tensor
   device_check: NoCheck   # TensorIterator
@@ -6585,6 +6797,8 @@
 
 - func: _fake_quantize_learnable_per_channel_affine_backward(Tensor grad, Tensor self, Tensor scale, Tensor zero_point, int axis, int quant_min, int quant_max, float grad_factor=1.0) -> (Tensor, Tensor, Tensor)
   variants: function
+  dispatch:
+    CPU, CUDA: _fake_quantize_learnable_per_channel_affine_backward
 
 - func: fused_moving_avg_obs_fake_quant(Tensor self, Tensor observer_on, Tensor fake_quant_on, Tensor(a!) running_min, Tensor(b!) running_max, Tensor(c!) scale, Tensor(d!) zero_point, float averaging_const, int quant_min, int quant_max, int ch_axis, bool per_row_fake_quant=False, bool symmetric_quant=False) -> Tensor
   variants: function
@@ -6617,7 +6831,9 @@
   device_guard: False
   dispatch:
     CompositeExplicitAutograd: _to_copy
+    NestedTensorCPU, NestedTensorCUDA: _to_copy_nested
   autogen: _to_copy.out
+  tags: canonical
 
 # to(Device) must not exist because all constructors of Device also works for
 # TensorOptions. Otherwise, an ambiguity error is thrown.
@@ -6785,7 +7001,9 @@
     CompositeExplicitAutograd: _pack_padded_sequence
   autogen: _pack_padded_sequence.out
 
-- func: _pack_padded_sequence_backward(Tensor grad, int[] input_size, Tensor batch_sizes, bool batch_first) -> Tensor
+- func: _pack_padded_sequence_backward(Tensor grad, SymInt[] input_size, Tensor batch_sizes, bool batch_first) -> Tensor
+  dispatch:
+    CompositeImplicitAutograd: _pack_padded_sequence_backward_symint
 
 - func: _pad_packed_sequence(Tensor data, Tensor batch_sizes, bool batch_first, Scalar padding_value, int total_length) -> (Tensor, Tensor)
 
@@ -6799,21 +7017,24 @@
     CPU, CUDA, Meta, MPS: set_
   autogen: set.source_Storage, set.source_Storage_out
 
-- func: set_.source_Storage_storage_offset(Tensor(a!) self, Storage source, int storage_offset, int[] size, int[] stride=[]) -> Tensor(a!)
+- func: set_.source_Storage_storage_offset(Tensor(a!) self, Storage source, SymInt storage_offset, SymInt[] size, SymInt[] stride=[]) -> Tensor(a!)
   variants: method
   device_check: NoCheck
   device_guard: False
   dispatch:
-    CPU, Meta: set_storage_cpu_
+    CPU: set_storage_cpu_
+    Meta: set_storage_meta__symint
     CUDA: set_storage_cuda_
     MPS: set_storage_mps_
     QuantizedCPU, QuantizedCUDA: set_storage_quantized_
   autogen: set.source_Storage_storage_offset, set.source_Storage_storage_offset_out
 
-- func: set_.source_Tensor_storage_offset(Tensor(a!) self, Tensor source, int storage_offset, int[] size, int[] stride=[]) -> Tensor(a!)
+- func: set_.source_Tensor_storage_offset(Tensor(a!) self, Tensor source, SymInt storage_offset, SymInt[] size, SymInt[] stride=[]) -> Tensor(a!)
   variants: method
   device_check: NoCheck
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: set__symint
 
 - func: set_.source_Tensor(Tensor(a!) self, Tensor source) -> Tensor(a!)
   variants: method
@@ -6855,7 +7076,7 @@
 - func: lift_fresh_copy(Tensor self) -> Tensor
   tags: view_copy
   dispatch:
-    CompositeExplicitAutograd: lift_fresh_copy
+    CompositeExplicitAutogradNonFunctional: lift_fresh_copy
   autogen: lift_fresh_copy.out
 
 - func: is_set_to(Tensor self, Tensor tensor) -> bool
@@ -6872,6 +7093,7 @@
     CPU: masked_fill__cpu
     CUDA: masked_fill__cuda
     QuantizedCPU: masked_fill__quantized_cpu
+    QuantizedCUDA: masked_fill__quantized_cuda
     MPS: masked_fill__mps
   autogen: masked_fill.Scalar_out
 
@@ -6888,6 +7110,7 @@
     CPU: masked_fill__cpu
     CUDA: masked_fill__cuda
     QuantizedCPU: masked_fill__quantized_cpu
+    QuantizedCUDA: masked_fill__quantized_cuda
     MPS: masked_fill__mps
   autogen: masked_fill.Tensor_out
 
@@ -6921,21 +7144,15 @@
     CPU: masked_softmax_backward_cpu
   autogen: _masked_softmax_backward.out
 
-- func: view.SymInt(Tensor(a) self, SymInt[] size) -> Tensor(a)
-  variants: method
-  device_check: NoCheck
-  device_guard: False
-  dispatch:
-    CompositeExplicitAutograd: view_symint
-    MkldnnCPU: mkldnn_view_symint
-
-- func: view(Tensor(a) self, int[] size) -> Tensor(a)
+- func: view(Tensor(a) self, SymInt[] size) -> Tensor(a)
   variants: method
   device_check: NoCheck
   device_guard: False
   dispatch:
-    ZeroTensor, CPU, CUDA, Meta, QuantizedCPU, QuantizedCUDA, MPS: view
+    ZeroTensor, Meta, CPU, CUDA, QuantizedCPU, QuantizedCUDA, MPS: view
     MkldnnCPU: mkldnn_view
+    NestedTensorCPU, NestedTensorCUDA: view_nested
+  tags: canonical
 
 # Warning: If you want to change the name or overload name of this
 # operator, you might also want to change the `isBlockListedSchema`
@@ -7111,6 +7328,7 @@
 - func: scatter_add(Tensor self, int dim, Tensor index, Tensor src) -> Tensor
   structured_delegate: scatter_add.out
   variants: function, method
+  tags: canonical
 
 - func: scatter_add_(Tensor(a!) self, int dim, Tensor index, Tensor src) -> Tensor(a!)
   structured_delegate: scatter_add.out
@@ -7181,6 +7399,7 @@
   device_check: NoCheck   # TensorIterator
   variants: method, function
   structured_delegate: bitwise_and.Tensor_out
+  tags: canonical
 
 - func: bitwise_and_.Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -7236,6 +7455,7 @@
   device_check: NoCheck   # TensorIterator
   variants: method, function
   structured_delegate: bitwise_or.Tensor_out
+  tags: canonical
 
 - func: bitwise_or_.Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -7496,6 +7716,7 @@
 - func: random_.from(Tensor(a!) self, int from, int? to, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: random_
     Meta: random_meta_
@@ -7504,6 +7725,7 @@
 
 - func: random_.to(Tensor(a!) self, int to, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: random_
@@ -7513,6 +7735,7 @@
 
 - func: random_(Tensor(a!) self, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: random_
@@ -7521,6 +7744,7 @@
 
 - func: uniform_(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: uniform_
@@ -7531,12 +7755,14 @@
 - func: cauchy_(Tensor(a!) self, float median=0, float sigma=1, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   variants: method
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: cauchy_
   autogen: cauchy, cauchy.out
 
 - func: log_normal_(Tensor(a!) self, float mean=1, float std=2, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: log_normal_
@@ -7544,6 +7770,7 @@
 
 - func: exponential_(Tensor(a!) self, float lambd=1, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: exponential_
@@ -7552,6 +7779,7 @@
 
 - func: geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: geometric_
@@ -7560,20 +7788,9 @@
   autogen: geometric, geometric.out
 
 - func: diag.out(Tensor self, int diagonal=0, *, Tensor(a!) out) -> Tensor(a!)
-  dispatch:
-    CPU: diag_cpu_out
-    CUDA: diag_cuda_out
-    MPS: diag_mps_out
 
 - func: diag(Tensor self, int diagonal=0) -> Tensor
   variants: method, function
-  dispatch:
-    CompositeExplicitAutograd: diag
-
-- func: diag_backward(Tensor grad, int[] input_sizes, int diagonal) -> Tensor
-  variants: function
-  device_check: NoCheck
-  device_guard: False
 
 - func: cross.out(Tensor self, Tensor other, int? dim=None, *, Tensor(a!) out) -> Tensor(a!)
 
@@ -7619,12 +7836,15 @@
   dispatch:
     CPU: trace_cpu
     CUDA: trace_cuda
+    MPS: trace_mps_out
   autogen: trace.out
 
-- func: trace_backward(Tensor grad, int[] sizes) -> Tensor
+- func: trace_backward(Tensor grad, SymInt[] sizes) -> Tensor
   variants: function
   device_check: NoCheck
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: trace_backward_symint
 
 - func: ne.Scalar_out(Tensor self, Scalar other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -7700,6 +7920,7 @@
   variants: method, function
   dispatch:
     QuantizedCPU: eq_quantized_cpu
+  tags: canonical
 
 - func: eq.Tensor_out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -7732,6 +7953,7 @@
   variants: method, function
   dispatch:
     QuantizedCPU: ge_quantized_cpu
+  tags: canonical
 
 - func: ge.Tensor_out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -7791,6 +8013,7 @@
   variants: method, function
   dispatch:
     QuantizedCPU: le_quantized_cpu
+  tags: canonical
 
 - func: le.Tensor_out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -7850,6 +8073,7 @@
   variants: method, function
   dispatch:
     QuantizedCPU: gt_quantized_cpu
+  tags: canonical
 
 - func: gt.Tensor_out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -7909,6 +8133,7 @@
   variants: method, function
   dispatch:
     QuantizedCPU: lt_quantized_cpu
+  tags: canonical
 
 - func: lt.Tensor_out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -7983,21 +8208,25 @@
     SparseCPU: index_select_sparse_cpu
     SparseCUDA: index_select_sparse_cuda
     MPS: index_select_mps
+  tags: canonical
 
 - func: index_select.dimname_out(Tensor self, Dimname dim, Tensor index, *, Tensor(a!) out) -> Tensor(a!)
 
 - func: index_select.dimname(Tensor self, Dimname dim, Tensor index) -> Tensor
   variants: method, function
 
-- func: index_select_backward(Tensor grad, int[] self_sizes, int dim, Tensor index) -> Tensor
+- func: index_select_backward(Tensor grad, SymInt[] self_sizes, int dim, Tensor index) -> Tensor
   variants: function
   device_check: NoCheck
   device_guard: False
+  dispatch:
+    CompositeImplicitAutograd: index_select_backward_symint
 
 - func: masked_select.out(Tensor self, Tensor mask, *, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CPU: masked_select_out_cpu
     CUDA: masked_select_out_cuda
+    MPS: masked_select_out_mps
   tags: dynamic_output_shape
 
 - func: masked_select(Tensor self, Tensor mask) -> Tensor
@@ -8005,6 +8234,7 @@
   dispatch:
     CPU: masked_select_cpu
     CUDA: masked_select_cuda
+    MPS: masked_select_mps
   tags: dynamic_output_shape
 
 - func: masked_select_backward(Tensor grad, Tensor input, Tensor mask) -> Tensor
@@ -8023,7 +8253,7 @@
   dispatch:
     CPU: nonzero_cpu
     CUDA: nonzero_cuda
-  tags: dynamic_output_shape
+  tags: dynamic_output_shape, canonical
 
 - func: nonzero_numpy(Tensor self) -> Tensor[]
   variants: method, function
@@ -8041,6 +8271,7 @@
 - func: gather(Tensor self, int dim, Tensor index, *, bool sparse_grad=False) -> Tensor
   variants: method, function
   structured_delegate: gather.out
+  tags: canonical
 
 - func: gather_backward(Tensor grad, Tensor self, int dim, Tensor index, bool sparse_grad) -> Tensor
   variants: function
@@ -8090,19 +8321,10 @@
   device_check: NoCheck   # TensorIterator
   variants: method
 
-- func: cross_entropy_loss(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, int ignore_index=-100, float label_smoothing=0.0) -> Tensor
+- func: cross_entropy_loss(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, SymInt ignore_index=-100, float label_smoothing=0.0) -> Tensor
   python_module: nn
-
-- func: lstsq.X(Tensor self, Tensor A, *, Tensor(a!) X, Tensor(b!) qr) -> (Tensor(a!) solution, Tensor(b!) QR)
   dispatch:
-    CPU: legacy_lstsq_out
-    CUDA: legacy_lstsq_out_cuda
-
-- func: lstsq(Tensor self, Tensor A) -> (Tensor solution, Tensor QR)
-  variants: method, function
-  dispatch:
-    CPU: legacy_lstsq
-    CUDA: legacy_lstsq_cuda
+    CompositeImplicitAutograd: cross_entropy_loss_symint
 
 - func: triangular_solve.X(Tensor self, Tensor A, bool upper=True, bool transpose=False, bool unitriangular=False, *, Tensor(a!) X, Tensor(b!) M) -> (Tensor(a!) solution, Tensor(b!) cloned_coefficient)
   structured: True
@@ -8149,15 +8371,6 @@
     CUDA: _symeig_helper_cuda
   autogen: _symeig_helper.out
 
-- func: eig.e(Tensor self, bool eigenvectors=False, *, Tensor(a!) e, Tensor(b!) v) -> (Tensor(a!) eigenvalues, Tensor(b!) eigenvectors)
-  dispatch:
-    CompositeExplicitAutograd: eig_out
-
-- func: eig(Tensor self, bool eigenvectors=False) -> (Tensor eigenvalues, Tensor eigenvectors)
-  variants: method, function
-  dispatch:
-    CompositeExplicitAutograd: eig
-
 - func: svd.U(Tensor self, bool some=True, bool compute_uv=True, *, Tensor(a!) U, Tensor(b!) S, Tensor(c!) V) -> (Tensor(a!) U, Tensor(b!) S, Tensor(c!) V)
 
 - func: svd(Tensor self, bool some=True, bool compute_uv=True) -> (Tensor U, Tensor S, Tensor V)
@@ -8271,13 +8484,16 @@
 
 # TODO: remove dispatch section when porting TH CUDA to ATen
 - func: multinomial.out(Tensor self, int num_samples, bool replacement=False, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: multinomial_out
+    MPS: multinomial_out_mps
 
 - func: multinomial(Tensor self, int num_samples, bool replacement=False, *, Generator? generator=None) -> Tensor
   variants: method, function
   dispatch:
     CPU, CUDA: multinomial
+    MPS: multinomial_mps
   tags: nondeterministic_seeded
 
 - func: lgamma.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
@@ -8405,6 +8621,7 @@
   dispatch:
     CPU: signbit_out
     CUDA: signbit_out
+    MPS: signbit_out_mps
     SparseCPU, SparseCUDA: signbit_sparse_out
     SparseCsrCPU, SparseCsrCUDA: signbit_sparse_csr_out
 
@@ -8681,13 +8898,6 @@
     MPS: max_mps
     QuantizedCPU: max_quantized_cpu
 
-# Not to be confused with binary op `max.out`. Commented because of failed CI
-# FIXME: enable this
-#- func: max.unary_out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
-#  device_check: NoCheck   # TensorIterator
-#  dispatch:
-#    CompositeExplicitAutograd: max_unary_out
-
 - func: fmax(Tensor self, Tensor other) -> Tensor
   structured_delegate: fmax.out
   device_check: NoCheck   # TensorIterator
@@ -8704,6 +8914,7 @@
   structured_delegate: maximum.out
   device_check: NoCheck   # TensorIterator
   variants: method, function
+  tags: canonical
 
 - func: maximum.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -8722,10 +8933,17 @@
 - func: max.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
 
+- func: max.unary_out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
+  device_check: NoCheck   # TensorIterator
+  dispatch:
+    CPU, CUDA: max_unary_out
+    QuantizedCPU: max_quantized_unary_out
+
 - func: minimum(Tensor self, Tensor other) -> Tensor
   structured_delegate: minimum.out
   device_check: NoCheck   # TensorIterator
   variants: method, function
+  tags: canonical
 
 - func: minimum.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
   structured: True
@@ -8878,7 +9096,7 @@
     CPU, CUDA, Meta: unfold
     QuantizedCPU, QuantizedCUDA: unfold
 
-- func: unfold_backward(Tensor grad_in, int[] input_sizes, int dim, int size, int step) -> Tensor
+- func: unfold_backward(Tensor grad_in, SymInt[] input_sizes, int dim, int size, int step) -> Tensor
   variants: function
   dispatch:
     CPU, CUDA: unfold_backward
@@ -8931,6 +9149,7 @@
   variants: function, method
   dispatch:
     SparseCPU, SparseCUDA: pow_sparse_scalar
+  tags: canonical
 
 - func: pow_.Scalar(Tensor(a!) self, Scalar exponent) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
@@ -8964,6 +9183,7 @@
 
 - func: normal_(Tensor(a!) self, float mean=0, float std=1, *, Generator? generator=None) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   variants: method
   dispatch:
     CPU, CUDA: normal_
@@ -8977,10 +9197,12 @@
 # but we can't due to overload ambiguity with normal.Tensor_float.
 - func: normal_functional(Tensor self, float mean=0, float std=1, *, Generator? generator=None) -> Tensor
   device_check: NoCheck   # TensorIterator
+  tags: nondeterministic_seeded
   dispatch:
     CompositeExplicitAutograd: normal_functional
 
 - func: normal.Tensor_float_out(Tensor mean, float std=1, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
+  tags: nondeterministic_seeded
   dispatch:
     CPU, CUDA: normal_out
     MPS: normal_mps_out
@@ -8998,6 +9220,7 @@
     CPU, CUDA: normal_out
     Meta: normal_out_meta
     MPS: normal_mps_out
+  tags: nondeterministic_seeded
 
 - func: normal.float_Tensor(float mean, Tensor std, *, Generator? generator=None) -> Tensor
   dispatch:
@@ -9011,6 +9234,7 @@
     CPU, CUDA: normal_out
     Meta: normal_out_meta
     MPS: normal_mps_out
+  tags: nondeterministic_seeded
 
 - func: normal.Tensor_Tensor(Tensor mean, Tensor std, *, Generator? generator=None) -> Tensor
   dispatch:
@@ -9019,13 +9243,15 @@
     Meta: normal_meta
   tags: nondeterministic_seeded
 
-- func: normal.float_float(float mean, float std, int[] size, *, Generator? generator=None, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: normal.float_float(float mean, float std, SymInt[] size, *, Generator? generator=None, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   dispatch:
     CompositeExplicitAutograd: normal
+  tags: nondeterministic_seeded
 
-- func: normal.float_float_out(float mean, float std, int[] size, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
+- func: normal.float_float_out(float mean, float std, SymInt[] size, *, Generator? generator=None, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CompositeExplicitAutograd: normal_out
+  tags: nondeterministic_seeded
 
 - func: alias(Tensor(a) self) -> Tensor(a)
   variants: method, function
@@ -9689,6 +9915,14 @@
     CUDA: foreach_tensor_addcdiv_scalarlist_cuda_
   autogen: _foreach_addcdiv.ScalarList_out
 
+- func: _foreach_addcdiv_.Tensor(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, Tensor scalars) -> ()
+  device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
+  variants: function
+  dispatch:
+    CPU: foreach_tensor_addcdiv_tensor_slow_
+    CUDA: foreach_tensor_addcdiv_tensor_cuda_
+  autogen: _foreach_addcdiv.Tensor_out
+
 - func: _foreach_addcmul_.ScalarList(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, Scalar[] scalars) -> ()
   device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
   variants: function
@@ -9697,6 +9931,14 @@
     CUDA: foreach_tensor_addcmul_scalarlist_cuda_
   autogen: _foreach_addcmul.ScalarList_out
 
+- func: _foreach_addcmul_.Tensor(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, Tensor scalars) -> ()
+  device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
+  variants: function
+  dispatch:
+    CPU: foreach_tensor_addcmul_tensor_slow_
+    CUDA: foreach_tensor_addcmul_tensor_cuda_
+  autogen: _foreach_addcmul.Tensor_out
+
 - func: _foreach_addcdiv.Scalar(Tensor[] self, Tensor[] tensor1, Tensor[] tensor2, Scalar value=1) -> Tensor[]
   device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
   variants: function
@@ -9718,6 +9960,13 @@
     CPU: foreach_tensor_addcdiv_scalarlist_slow
     CUDA: foreach_tensor_addcdiv_scalarlist_cuda
 
+- func: _foreach_addcdiv.Tensor(Tensor[] self, Tensor[] tensor1, Tensor[] tensor2, Tensor scalars) -> Tensor[]
+  device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
+  variants: function
+  dispatch:
+    CPU: foreach_tensor_addcdiv_tensor_slow
+    CUDA: foreach_tensor_addcdiv_tensor_cuda
+
 - func: _foreach_addcmul.ScalarList(Tensor[] self, Tensor[] tensor1, Tensor[] tensor2, Scalar[] scalars) -> Tensor[]
   device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
   variants: function
@@ -9725,6 +9974,13 @@
     CPU: foreach_tensor_addcmul_scalarlist_slow
     CUDA: foreach_tensor_addcmul_scalarlist_cuda
 
+- func: _foreach_addcmul.Tensor(Tensor[] self, Tensor[] tensor1, Tensor[] tensor2, Tensor scalars) -> Tensor[]
+  device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
+  variants: function
+  dispatch:
+    CPU: foreach_tensor_addcmul_tensor_slow
+    CUDA: foreach_tensor_addcmul_tensor_cuda
+
 - func: _foreach_maximum.List(Tensor[] self, Tensor[] other) -> Tensor[]
   device_check: NoCheck   # foreach kernels fall back to slow path when tensor are on different devices
   variants: function
@@ -9784,17 +10040,6 @@
     CPU: searchsorted_cpu
     CUDA: searchsorted_cuda
 
-# [Note about _torch_cuda_cu_linker_symbol_op and torch_cuda_cu]
-# This is a DUMMY function to force the linking against torch_cuda_cu on Windows.
-# Otherwise, the Windows linker will optimize and not include torch_cuda_cu even when we
-# want it to be included. This is similar to what we do with warp_size for torch_cuda_cpp,
-# described as the solution to this issue: https://github.com/pytorch/pytorch/issues/31611
-# This op should NOT be used or exposed or edited or else Windows builds (with BUILD_SPLIT_CUDA) will break.
-- func: _torch_cuda_cu_linker_symbol_op(Tensor self) -> Tensor
-  dispatch:
-    CUDA: _torch_cuda_cu_linker_symbol_op_cuda
-  autogen: _torch_cuda_cu_linker_symbol_op.out
-
 - func: searchsorted.Tensor_out(Tensor sorted_sequence, Tensor self, *, bool out_int32=False, bool right=False, str? side=None, Tensor? sorter=None, Tensor(a!) out) -> Tensor(a!)
   dispatch:
     CPU: searchsorted_out_cpu
@@ -9909,16 +10154,20 @@
     CPU: multilabel_margin_loss_backward_cpu
     CUDA: multilabel_margin_loss_backward_cuda
 
-- func: nll_loss.out(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, int ignore_index=-100, *, Tensor(a!) out) -> Tensor(a!)
+- func: nll_loss.out(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, SymInt ignore_index=-100, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
 
-- func: nll_loss_nd(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, int ignore_index=-100) -> Tensor
+- func: nll_loss_nd(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, SymInt ignore_index=-100) -> Tensor
   python_module: nn
+  dispatch:
+    CompositeImplicitAutograd: nll_loss_nd_symint
 
-- func: nll_loss(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, int ignore_index=-100) -> Tensor
+- func: nll_loss(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, SymInt ignore_index=-100) -> Tensor
   python_module: nn
+  dispatch:
+    CompositeImplicitAutograd: nll_loss_symint
 
-- func: nll_loss_forward.output(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, *, Tensor(a!) output, Tensor(b!) total_weight) -> (Tensor(a!), Tensor(b!))
+- func: nll_loss_forward.output(Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index, *, Tensor(a!) output, Tensor(b!) total_weight) -> (Tensor(a!), Tensor(b!))
   python_module: nn
   structured: True
   dispatch:
@@ -9926,11 +10175,11 @@
     CUDA: nll_loss_forward_out_cuda
     MPS: nll_loss_forward_out_mps
 
-- func: nll_loss_forward(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index) -> (Tensor output, Tensor total_weight)
+- func: nll_loss_forward(Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index) -> (Tensor output, Tensor total_weight)
   python_module: nn
   structured_delegate: nll_loss_forward.output
 
-- func: nll_loss_backward.grad_input(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: nll_loss_backward.grad_input(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index, Tensor total_weight, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -9938,38 +10187,40 @@
     CUDA: nll_loss_backward_out_cuda
     MPS: nll_loss_backward_out_mps
 
-- func: nll_loss_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight) -> Tensor
+- func: nll_loss_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index, Tensor total_weight) -> Tensor
   python_module: nn
   structured_delegate: nll_loss_backward.grad_input
 
-- func: nll_loss2d.out(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, int ignore_index=-100, *, Tensor(a!) out) -> Tensor(a!)
+- func: nll_loss2d.out(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, SymInt ignore_index=-100, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
 
-- func: nll_loss2d(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, int ignore_index=-100) -> Tensor
+- func: nll_loss2d(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, SymInt ignore_index=-100) -> Tensor
   python_module: nn
+  dispatch:
+    CompositeImplicitAutograd: nll_loss2d_symint
 
-- func: nll_loss2d_forward.output(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, *, Tensor(a!) output, Tensor(b!) total_weight) -> (Tensor(a!), Tensor(b!))
+- func: nll_loss2d_forward.output(Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index, *, Tensor(a!) output, Tensor(b!) total_weight) -> (Tensor(a!), Tensor(b!))
   python_module: nn
   dispatch:
     CPU: nll_loss2d_forward_out_cpu
     CUDA: nll_loss2d_forward_out_cuda
     MPS: nll_loss2d_forward_out_mps
 
-- func: nll_loss2d_forward(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index) -> (Tensor output, Tensor total_weight)
+- func: nll_loss2d_forward(Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index) -> (Tensor output, Tensor total_weight)
   python_module: nn
   dispatch:
     CPU: nll_loss2d_forward_cpu
     CUDA: nll_loss2d_forward_cuda
     MPS: nll_loss2d_forward_mps
 
-- func: nll_loss2d_backward.grad_input(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: nll_loss2d_backward.grad_input(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index, Tensor total_weight, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: nll_loss2d_backward_out_cpu
     CUDA: nll_loss2d_backward_out_cuda
     MPS: nll_loss2d_backward_out_mps
 
-- func: nll_loss2d_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight) -> Tensor
+- func: nll_loss2d_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index, Tensor total_weight) -> Tensor
   python_module: nn
   dispatch:
     CPU: nll_loss2d_backward_cpu
@@ -10160,6 +10411,7 @@
   dispatch:
     CPU, CUDA, MPS: hardtanh
     QuantizedCPU: hardtanh_quantized_cpu
+  tags: canonical
 
 - func: hardtanh_backward.grad_input(Tensor grad_output, Tensor self, Scalar min_val, Scalar max_val, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
@@ -10185,23 +10437,27 @@
   python_module: nn
   dispatch:
     CPU, CUDA: hardswish_out
+    MPS: hardswish_out_mps
 
 - func: hardswish(Tensor self) -> Tensor
   device_check: NoCheck   # TensorIterator
   python_module: nn
   dispatch:
     CPU, CUDA: hardswish
+    MPS: hardswish_mps
 
 - func: hardswish_(Tensor(a!) self) -> Tensor(a!)
   device_check: NoCheck   # TensorIterator
   python_module: nn
   dispatch:
     CPU, CUDA: hardswish_
+    MPS: hardswish_mps_
 
 - func: hardswish_backward(Tensor grad_output, Tensor self) -> Tensor
   python_module: nn
   dispatch:
     CPU, CUDA: hardswish_backward
+    MPS: hardswish_backward_mps
   autogen: hardswish_backward.out
 
 - func: leaky_relu.out(Tensor self, Scalar negative_slope=0.01, *, Tensor(a!) out) -> Tensor(a!)
@@ -10220,6 +10476,7 @@
   python_module: nn
   dispatch:
     QuantizedCPU: leaky_relu_quantized_cpu
+  tags: canonical
 
 - func: leaky_relu_backward.grad_input(Tensor grad_output, Tensor self, Scalar negative_slope, bool self_is_result, *, Tensor(a!) grad_input) -> Tensor(a!)
   structured: True
@@ -10276,6 +10533,7 @@
 
 - func: rrelu_with_noise.out(Tensor self, Tensor noise, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
+  tags: nondeterministic_seeded
   dispatch:
     CPU: rrelu_with_noise_out_cpu
     CUDA: rrelu_with_noise_out_cuda
@@ -10295,6 +10553,7 @@
 
 - func: rrelu_with_noise_(Tensor(a!) self, Tensor noise, Scalar lower=0.125, Scalar upper=0.3333333333333333, bool training=False, Generator? generator=None) -> Tensor(a!)
   python_module: nn
+  tags: nondeterministic_seeded
   dispatch:
     CPU: rrelu_with_noise_cpu_
     CUDA: rrelu_with_noise_cuda_
@@ -10349,7 +10608,7 @@
   structured_delegate: softshrink_backward.grad_input
   python_module: nn
 
-- func: adaptive_avg_pool2d.out(Tensor self, int[2] output_size, *, Tensor(a!) out) -> Tensor(a!)
+- func: adaptive_avg_pool2d.out(Tensor self, SymInt[2] output_size, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: adaptive_avg_pool2d_out_cpu
@@ -10357,8 +10616,10 @@
     MPS: adaptive_avg_pool2d_out_mps
     MkldnnCPU: mkldnn_adaptive_avg_pool2d_out_stub
 
-- func: adaptive_avg_pool2d(Tensor self, int[2] output_size) -> Tensor
+- func: adaptive_avg_pool2d(Tensor self, SymInt[2] output_size) -> Tensor
   python_module: nn
+  dispatch:
+    CompositeImplicitAutograd: adaptive_avg_pool2d_symint
 
 - func: mkldnn_adaptive_avg_pool2d(Tensor self, int[2] output_size) -> Tensor
   dispatch:
@@ -10373,7 +10634,7 @@
     MkldnnCPU: mkldnn_adaptive_avg_pool2d_backward
   autogen: mkldnn_adaptive_avg_pool2d_backward.out
 
-- func: _adaptive_avg_pool2d(Tensor self, int[2] output_size) -> Tensor
+- func: _adaptive_avg_pool2d(Tensor self, SymInt[2] output_size) -> Tensor
   dispatch:
     CPU: adaptive_avg_pool2d_cpu
     CUDA: adaptive_avg_pool2d_cuda
@@ -10381,6 +10642,7 @@
     QuantizedCPU: adaptive_avg_pool2d_quantized_cpu
     QuantizedCUDA: adaptive_avg_pool2d_quantized_cuda
   autogen: _adaptive_avg_pool2d.out
+  tags: canonical
 
 - func: _adaptive_avg_pool2d_backward(Tensor grad_output, Tensor self) -> Tensor
   python_module: nn
@@ -10389,18 +10651,21 @@
     CUDA: adaptive_avg_pool2d_backward_cuda
     MPS: adaptive_avg_pool2d_backward_mps
   autogen: _adaptive_avg_pool2d_backward.out
+  tags: canonical
 
-- func: adaptive_avg_pool3d.out(Tensor self, int[3] output_size, *, Tensor(a!) out) -> Tensor(a!)
+- func: adaptive_avg_pool3d.out(Tensor self, SymInt[3] output_size, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: adaptive_avg_pool3d_out_cpu
     CUDA: adaptive_avg_pool3d_out_cuda
     QuantizedCPU: adaptive_avg_pool3d_out_quantized_cpu
 
-- func: adaptive_avg_pool3d(Tensor self, int[3] output_size) -> Tensor
+- func: adaptive_avg_pool3d(Tensor self, SymInt[3] output_size) -> Tensor
   python_module: nn
+  dispatch:
+    CompositeImplicitAutograd: adaptive_avg_pool3d_symint
 
-- func: _adaptive_avg_pool3d(Tensor self, int[3] output_size) -> Tensor
+- func: _adaptive_avg_pool3d(Tensor self, SymInt[3] output_size) -> Tensor
   dispatch:
     CPU: adaptive_avg_pool3d_cpu
     CUDA: adaptive_avg_pool3d_cuda
@@ -10489,6 +10754,7 @@
   dispatch:
     MkldnnCPU: mkldnn_avg_pool2d
     QuantizedCPU: avg_pool2d_quantized_cpu
+  tags: canonical
 
 - func: avg_pool2d_backward.grad_input(Tensor grad_output, Tensor self, int[2] kernel_size, int[2] stride, int[2] padding, bool ceil_mode, bool count_include_pad, int? divisor_override, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
@@ -10504,6 +10770,7 @@
   structured_delegate: avg_pool2d_backward.grad_input
   dispatch:
     MkldnnCPU: mkldnn_avg_pool2d_backward
+  tags: canonical
 
 - func: avg_pool3d.out(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=0, bool ceil_mode=False, bool count_include_pad=True, int? divisor_override=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
@@ -10600,6 +10867,7 @@
 - func: max_pool2d_with_indices(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> (Tensor, Tensor)
   python_module: nn
   structured_delegate: max_pool2d_with_indices.out
+  tags: canonical
 
 - func: max_pool2d_with_indices_backward.grad_input(Tensor grad_output, Tensor self, int[2] kernel_size, int[2] stride, int[2] padding, int[2] dilation, bool ceil_mode, Tensor indices, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
@@ -10612,6 +10880,7 @@
 - func: max_pool2d_with_indices_backward(Tensor grad_output, Tensor self, int[2] kernel_size, int[2] stride, int[2] padding, int[2] dilation, bool ceil_mode, Tensor indices) -> Tensor
   python_module: nn
   structured_delegate: max_pool2d_with_indices_backward.grad_input
+  tags: canonical
 
 # Return: (Tensor output, Tensor indices)
 - func: max_pool3d_with_indices.out(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=0, int[3] dilation=1, bool ceil_mode=False, *, Tensor(a!) out, Tensor(b!) indices) -> (Tensor(a!), Tensor(b!))
@@ -10663,7 +10932,7 @@
     CPU: max_unpooling3d_forward_cpu
     CUDA: max_unpooling3d_forward_cuda
 
-- func: reflection_pad1d.out(Tensor self, int[2] padding, *, Tensor(a!) out) -> Tensor(a!)
+- func: reflection_pad1d.out(Tensor self, SymInt[2] padding, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -10672,11 +10941,11 @@
     CUDA: reflection_pad1d_out_cuda
     MPS: reflection_pad1d_out_mps
 
-- func: reflection_pad1d(Tensor self, int[2] padding) -> Tensor
+- func: reflection_pad1d(Tensor self, SymInt[2] padding) -> Tensor
   python_module: nn
   structured_delegate: reflection_pad1d.out
 
-- func: reflection_pad1d_backward.grad_input(Tensor grad_output, Tensor self, int[2] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: reflection_pad1d_backward.grad_input(Tensor grad_output, Tensor self, SymInt[2] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -10684,18 +10953,18 @@
     CUDA: reflection_pad1d_backward_out_cuda
     MPS: reflection_pad1d_backward_out_mps
 
-- func: reflection_pad1d_backward(Tensor grad_output, Tensor self, int[2] padding) -> Tensor
+- func: reflection_pad1d_backward(Tensor grad_output, Tensor self, SymInt[2] padding) -> Tensor
   python_module: nn
   structured_delegate: reflection_pad1d_backward.grad_input
 
-- func: reflection_pad2d.out(Tensor self, int[4] padding, *, Tensor(a!) out) -> Tensor(a!)
+- func: reflection_pad2d.out(Tensor self, SymInt[4] padding, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU, QuantizedCPU: reflection_pad2d_out_cpu
     CUDA: reflection_pad2d_out_cuda
     MPS: reflection_pad2d_out_mps
 
-- func: reflection_pad2d(Tensor self, int[4] padding) -> Tensor
+- func: reflection_pad2d(Tensor self, SymInt[4] padding) -> Tensor
   python_module: nn
   dispatch:
     CPU: reflection_pad2d_cpu
@@ -10703,21 +10972,21 @@
     CUDA: reflection_pad2d_cuda
     MPS: reflection_pad2d_mps
 
-- func: reflection_pad2d_backward.grad_input(Tensor grad_output, Tensor self, int[4] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: reflection_pad2d_backward.grad_input(Tensor grad_output, Tensor self, SymInt[4] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: reflection_pad2d_backward_out_cpu
     CUDA: reflection_pad2d_backward_out_cuda
     MPS: reflection_pad2d_backward_out_mps
 
-- func: reflection_pad2d_backward(Tensor grad_output, Tensor self, int[4] padding) -> Tensor
+- func: reflection_pad2d_backward(Tensor grad_output, Tensor self, SymInt[4] padding) -> Tensor
   python_module: nn
   dispatch:
     CPU: reflection_pad2d_backward_cpu
     CUDA: reflection_pad2d_backward_cuda
     MPS: reflection_pad2d_backward_mps
 
-- func: reflection_pad3d.out(Tensor self, int[6] padding, *, Tensor(a!) out) -> Tensor(a!)
+- func: reflection_pad3d.out(Tensor self, SymInt[6] padding, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -10725,11 +10994,11 @@
     CUDA: reflection_pad3d_out_cuda
     MPS: reflection_pad3d_out_mps
 
-- func: reflection_pad3d(Tensor self, int[6] padding) -> Tensor
+- func: reflection_pad3d(Tensor self, SymInt[6] padding) -> Tensor
   python_module: nn
   structured_delegate: reflection_pad3d.out
 
-- func: reflection_pad3d_backward.grad_input(Tensor grad_output, Tensor self, int[6] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: reflection_pad3d_backward.grad_input(Tensor grad_output, Tensor self, SymInt[6] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -10737,11 +11006,11 @@
     CUDA: reflection_pad3d_backward_out_cuda
     MPS: reflection_pad3d_backward_out_mps
 
-- func: reflection_pad3d_backward(Tensor grad_output, Tensor self, int[6] padding) -> Tensor
+- func: reflection_pad3d_backward(Tensor grad_output, Tensor self, SymInt[6] padding) -> Tensor
   python_module: nn
   structured_delegate: reflection_pad3d_backward.grad_input
 
-- func: replication_pad1d.out(Tensor self, int[2] padding, *, Tensor(a!) out) -> Tensor(a!)
+- func: replication_pad1d.out(Tensor self, SymInt[2] padding, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -10749,11 +11018,11 @@
     CUDA: replication_pad1d_out_cuda
     MPS: replication_pad1d_out_mps
 
-- func: replication_pad1d(Tensor self, int[2] padding) -> Tensor
+- func: replication_pad1d(Tensor self, SymInt[2] padding) -> Tensor
   python_module: nn
   structured_delegate: replication_pad1d.out
 
-- func: replication_pad1d_backward.grad_input(Tensor grad_output, Tensor self, int[2] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: replication_pad1d_backward.grad_input(Tensor grad_output, Tensor self, SymInt[2] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -10761,11 +11030,11 @@
     CUDA: replication_pad1d_backward_out_cuda
     MPS: replication_pad1d_backward_out_mps
 
-- func: replication_pad1d_backward(Tensor grad_output, Tensor self, int[2] padding) -> Tensor
+- func: replication_pad1d_backward(Tensor grad_output, Tensor self, SymInt[2] padding) -> Tensor
   python_module: nn
   structured_delegate: replication_pad1d_backward.grad_input
 
-- func: replication_pad2d.out(Tensor self, int[4] padding, *, Tensor(a!) out) -> Tensor(a!)
+- func: replication_pad2d.out(Tensor self, SymInt[4] padding, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -10773,25 +11042,25 @@
     CUDA: replication_pad2d_out_cuda
     MPS: replication_pad2d_out_mps
 
-- func: replication_pad2d(Tensor self, int[4] padding) -> Tensor
+- func: replication_pad2d(Tensor self, SymInt[4] padding) -> Tensor
   python_module: nn
   structured_delegate: replication_pad2d.out
 
-- func: replication_pad2d_backward.grad_input(Tensor grad_output, Tensor self, int[4] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: replication_pad2d_backward.grad_input(Tensor grad_output, Tensor self, SymInt[4] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: replication_pad2d_backward_out_cpu
     CUDA: replication_pad2d_backward_out_cuda
     MPS: replication_pad2d_backward_out_mps
 
-- func: replication_pad2d_backward(Tensor grad_output, Tensor self, int[4] padding) -> Tensor
+- func: replication_pad2d_backward(Tensor grad_output, Tensor self, SymInt[4] padding) -> Tensor
   python_module: nn
   dispatch:
     CPU: replication_pad2d_backward_cpu
     CUDA: replication_pad2d_backward_cuda
     MPS: replication_pad2d_backward_mps
 
-- func: replication_pad3d.out(Tensor self, int[6] padding, *, Tensor(a!) out) -> Tensor(a!)
+- func: replication_pad3d.out(Tensor self, SymInt[6] padding, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -10799,207 +11068,113 @@
     CUDA: replication_pad3d_out_cuda
     MPS: replication_pad3d_out_mps
 
-- func: replication_pad3d(Tensor self, int[6] padding) -> Tensor
+- func: replication_pad3d(Tensor self, SymInt[6] padding) -> Tensor
   python_module: nn
   structured_delegate: replication_pad3d.out
 
-- func: replication_pad3d_backward.grad_input(Tensor grad_output, Tensor self, int[6] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: replication_pad3d_backward.grad_input(Tensor grad_output, Tensor self, SymInt[6] padding, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: replication_pad3d_backward_out_cpu
     CUDA: replication_pad3d_backward_out_cuda
     MPS: replication_pad3d_backward_out_mps
 
-- func: replication_pad3d_backward(Tensor grad_output, Tensor self, int[6] padding) -> Tensor
+- func: replication_pad3d_backward(Tensor grad_output, Tensor self, SymInt[6] padding) -> Tensor
   python_module: nn
   dispatch:
     CPU: replication_pad3d_backward_cpu
     CUDA: replication_pad3d_backward_cuda
     MPS: replication_pad3d_backward_mps
 
-- func: _pad_circular(Tensor self, int[] pad) -> Tensor
-  python_module: nn
-
-- func: _pad_enum(Tensor self, int[] pad, int mode, float? value=None) -> Tensor
-  python_module: nn
-
-- func: pad(Tensor self, int[] pad, str mode="constant", float? value=None) -> Tensor
-  python_module: nn
-
-- func: upsample_linear1d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: _pad_circular(Tensor self, SymInt[] pad) -> Tensor
   python_module: nn
   dispatch:
-    CompositeExplicitAutograd: upsample_linear1d
-  autogen: upsample_linear1d.vec_out
+    CompositeImplicitAutograd: _pad_circular_symint
 
-- func: upsample_linear1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: _pad_enum(Tensor self, SymInt[] pad, int mode, float? value=None) -> Tensor
   python_module: nn
   dispatch:
-    CompositeExplicitAutograd: upsample_linear1d_backward
-  autogen: upsample_linear1d_backward.vec_out
+    CompositeImplicitAutograd: _pad_enum_symint
 
-- func: upsample_bilinear2d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: pad(Tensor self, SymInt[] pad, str mode="constant", float? value=None) -> Tensor
   python_module: nn
   dispatch:
-    CompositeExplicitAutograd: upsample_bilinear2d
-  autogen: upsample_bilinear2d.vec_out
+    CompositeImplicitAutograd: pad_symint
 
-- func: upsample_bilinear2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: upsample_linear1d.vec(Tensor input, SymInt[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_bilinear2d_backward
-  autogen: upsample_bilinear2d_backward.vec_out
+  autogen: upsample_linear1d.vec_out
 
-- func: _upsample_bilinear2d_aa.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: upsample_bilinear2d.vec(Tensor input, SymInt[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: _upsample_bilinear2d_aa
-  autogen: _upsample_bilinear2d_aa.vec_out
+  autogen: upsample_bilinear2d.vec_out
+  tags: canonical
 
-- func: _upsample_bilinear2d_aa_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: _upsample_bilinear2d_aa.vec(Tensor input, SymInt[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: _upsample_bilinear2d_aa_backward
-  autogen: _upsample_bilinear2d_aa_backward.vec_out
+  autogen: _upsample_bilinear2d_aa.vec_out
 
-- func: upsample_trilinear3d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: upsample_trilinear3d.vec(Tensor input, SymInt[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_trilinear3d
   autogen: upsample_trilinear3d.vec_out
 
-- func: upsample_trilinear3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: upsample_bicubic2d.vec(Tensor input, SymInt[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_trilinear3d_backward
-  autogen: upsample_trilinear3d_backward.vec_out
-
-- func: upsample_bicubic2d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_bicubic2d
   autogen: upsample_bicubic2d.vec_out
 
-- func: upsample_bicubic2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: _upsample_bicubic2d_aa.vec(Tensor input, SymInt[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_bicubic2d_backward
-  autogen: upsample_bicubic2d_backward.vec_out
-
-- func: _upsample_bicubic2d_aa.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: _upsample_bicubic2d_aa
   autogen: _upsample_bicubic2d_aa.vec_out
 
-- func: _upsample_bicubic2d_aa_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
+- func: upsample_nearest1d.vec(Tensor input, SymInt[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: _upsample_bicubic2d_aa_backward
-  autogen: _upsample_bicubic2d_aa_backward.vec_out
-
-- func: upsample_nearest1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_nearest1d
   autogen: upsample_nearest1d.vec_out
 
-- func: _upsample_nearest_exact1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
+- func: _upsample_nearest_exact1d.vec(Tensor input, SymInt[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: _upsample_nearest_exact1d
   autogen: _upsample_nearest_exact1d.vec_out
 
-- func: upsample_nearest1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_nearest1d_backward
-  autogen: upsample_nearest1d_backward.vec_out
-
-- func: _upsample_nearest_exact1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: _upsample_nearest_exact1d_backward
-  autogen: _upsample_nearest_exact1d_backward.vec_out
-
-- func: upsample_nearest2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
+- func: upsample_nearest2d.vec(Tensor input, SymInt[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_nearest2d
   autogen: upsample_nearest2d.vec_out
+  tags: canonical
 
-- func: _upsample_nearest_exact2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
+- func: _upsample_nearest_exact2d.vec(Tensor input, SymInt[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: _upsample_nearest_exact2d
   autogen: _upsample_nearest_exact2d.vec_out
 
-- func: upsample_nearest2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: upsample_nearest2d_backward
-  autogen: upsample_nearest2d_backward.vec_out
-
-- func: _upsample_nearest_exact2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CompositeExplicitAutograd: _upsample_nearest_exact2d_backward
-  autogen: _upsample_nearest_exact2d_backward.vec_out
-
-- func: upsample_nearest3d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
+- func: upsample_nearest3d.vec(Tensor input, SymInt[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CPU: upsample_nearest3d_cpu
-    CUDA: upsample_nearest3d_cuda
-    QuantizedCPU: upsample_nearest3d_quantized_cpu
   autogen: upsample_nearest3d.vec_out
 
-- func: _upsample_nearest_exact3d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
+- func: _upsample_nearest_exact3d.vec(Tensor input, SymInt[]? output_size, float[]? scale_factors) -> Tensor
   python_module: nn
-  dispatch:
-    CPU: _upsample_nearest_exact3d_cpu
-    CUDA: _upsample_nearest_exact3d_cuda
-    QuantizedCPU: _upsample_nearest_exact3d_quantized_cpu
   autogen: _upsample_nearest_exact3d.vec_out
 
-- func: upsample_nearest3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CPU: upsample_nearest3d_backward_cpu
-    CUDA: upsample_nearest3d_backward_cuda
-  autogen: upsample_nearest3d_backward.vec_out
-
-- func: _upsample_nearest_exact3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  python_module: nn
-  dispatch:
-    CPU: _upsample_nearest_exact3d_backward_cpu
-    CUDA: _upsample_nearest_exact3d_backward_cuda
-  autogen: _upsample_nearest_exact3d_backward.vec_out
-
 # NOTE: all of the non-"vec" upsample overloads are only kept for backward compatibility.
-- func: upsample_linear1d.out(Tensor self, int[1] output_size, bool align_corners, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: upsample_linear1d.out(Tensor self, SymInt[1] output_size, bool align_corners, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_linear1d_out_cpu
     CUDA: upsample_linear1d_out_cuda
 
-- func: upsample_linear1d(Tensor self, int[1] output_size, bool align_corners, float? scales=None) -> Tensor
+- func: upsample_linear1d(Tensor self, SymInt[1] output_size, bool align_corners, float? scales=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_linear1d.out
 
-- func: upsample_linear1d_backward.grad_input(Tensor grad_output, int[1] output_size, int[3] input_size, bool align_corners, float? scales=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: upsample_linear1d_backward.grad_input(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, bool align_corners, float? scales=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_linear1d_backward_out_cpu
     CUDA: upsample_linear1d_backward_out_cuda
 
-- func: upsample_linear1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, bool align_corners, float? scales=None) -> Tensor
+- func: upsample_linear1d_backward(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, bool align_corners, float? scales=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_linear1d_backward.grad_input
 
-- func: upsample_bilinear2d.out(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: upsample_bilinear2d.out(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -11007,13 +11182,13 @@
     CUDA: upsample_bilinear2d_out_cuda
     MPS: upsample_bilinear2d_out_mps
 
-- func: upsample_bilinear2d(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_bilinear2d(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_bilinear2d.out
   dispatch:
     QuantizedCPU: upsample_bilinear2d_quantized_cpu
 
-- func: upsample_bilinear2d_backward.grad_input(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: upsample_bilinear2d_backward.grad_input(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -11021,99 +11196,99 @@
     CUDA: upsample_bilinear2d_backward_out_cuda
     MPS: upsample_bilinear2d_backward_out_mps
 
-- func: upsample_bilinear2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_bilinear2d_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_bilinear2d_backward.grad_input
 
-- func: _upsample_bilinear2d_aa.out(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: _upsample_bilinear2d_aa.out(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: _upsample_bilinear2d_aa_out_cpu
     CUDA: _upsample_bilinear2d_aa_out_cuda
 
-- func: _upsample_bilinear2d_aa(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: _upsample_bilinear2d_aa(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_bilinear2d_aa.out
 
-- func: _upsample_bilinear2d_aa_backward.grad_input(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: _upsample_bilinear2d_aa_backward.grad_input(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: _upsample_bilinear2d_aa_backward_out_cpu
     CUDA: _upsample_bilinear2d_aa_backward_out_cuda
 
-- func: _upsample_bilinear2d_aa_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: _upsample_bilinear2d_aa_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_bilinear2d_aa_backward.grad_input
 
-- func: upsample_bicubic2d.out(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: upsample_bicubic2d.out(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_bicubic2d_out_cpu
     CUDA: upsample_bicubic2d_out_cuda
 
-- func: upsample_bicubic2d(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_bicubic2d(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_bicubic2d.out
 
-- func: upsample_bicubic2d_backward.grad_input(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: upsample_bicubic2d_backward.grad_input(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_bicubic2d_backward_out_cpu
     CUDA: upsample_bicubic2d_backward_out_cuda
 
-- func: upsample_bicubic2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_bicubic2d_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_bicubic2d_backward.grad_input
 
-- func: _upsample_bicubic2d_aa.out(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: _upsample_bicubic2d_aa.out(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: _upsample_bicubic2d_aa_out_cpu
     CUDA: _upsample_bicubic2d_aa_out_cuda
 
-- func: _upsample_bicubic2d_aa(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: _upsample_bicubic2d_aa(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_bicubic2d_aa.out
 
-- func: _upsample_bicubic2d_aa_backward.grad_input(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: _upsample_bicubic2d_aa_backward.grad_input(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: _upsample_bicubic2d_aa_backward_out_cpu
     CUDA: _upsample_bicubic2d_aa_backward_out_cuda
 
-- func: _upsample_bicubic2d_aa_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: _upsample_bicubic2d_aa_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_bicubic2d_aa_backward.grad_input
 
-- func: upsample_trilinear3d.out(Tensor self, int[3] output_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: upsample_trilinear3d.out(Tensor self, SymInt[3] output_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_trilinear3d_out_cpu
     CUDA: upsample_trilinear3d_out_cuda
 
-- func: upsample_trilinear3d(Tensor self, int[3] output_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_trilinear3d(Tensor self, SymInt[3] output_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_trilinear3d.out
 
-- func: upsample_trilinear3d_backward.grad_input(Tensor grad_output, int[3] output_size, int[5] input_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: upsample_trilinear3d_backward.grad_input(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_trilinear3d_backward_out_cpu
     CUDA: upsample_trilinear3d_backward_out_cuda
 
-- func: upsample_trilinear3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_trilinear3d_backward(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_trilinear3d_backward.grad_input
 
-- func: upsample_nearest1d.out(Tensor self, int[1] output_size, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: upsample_nearest1d.out(Tensor self, SymInt[1] output_size, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -11121,44 +11296,44 @@
     CUDA: upsample_nearest1d_out_cuda
     MPS: upsample_nearest1d_out_mps
 
-- func: _upsample_nearest_exact1d.out(Tensor self, int[1] output_size, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: _upsample_nearest_exact1d.out(Tensor self, SymInt[1] output_size, float? scales=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: _upsample_nearest_exact1d_out_cpu
     CUDA: _upsample_nearest_exact1d_out_cuda
 
-- func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
+- func: upsample_nearest1d(Tensor self, SymInt[1] output_size, float? scales=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_nearest1d.out
 
-- func: _upsample_nearest_exact1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
+- func: _upsample_nearest_exact1d(Tensor self, SymInt[1] output_size, float? scales=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_nearest_exact1d.out
 
-- func: upsample_nearest1d_backward.grad_input(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: upsample_nearest1d_backward.grad_input(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, float? scales=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_nearest1d_backward_out_cpu
     CUDA: upsample_nearest1d_backward_out_cuda
 
-- func: _upsample_nearest_exact1d_backward.grad_input(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: _upsample_nearest_exact1d_backward.grad_input(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, float? scales=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: _upsample_nearest_exact1d_backward_out_cpu
     CUDA: _upsample_nearest_exact1d_backward_out_cuda
 
-- func: upsample_nearest1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None) -> Tensor
+- func: upsample_nearest1d_backward(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, float? scales=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_nearest1d_backward.grad_input
 
-- func: _upsample_nearest_exact1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None) -> Tensor
+- func: _upsample_nearest_exact1d_backward(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, float? scales=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_nearest_exact1d_backward.grad_input
 
-- func: upsample_nearest2d.out(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: upsample_nearest2d.out(Tensor self, SymInt[2] output_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -11166,7 +11341,7 @@
     CUDA: upsample_nearest2d_out_cuda
     MPS: upsample_nearest2d_out_mps
 
-- func: _upsample_nearest_exact2d.out(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: _upsample_nearest_exact2d.out(Tensor self, SymInt[2] output_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -11174,19 +11349,19 @@
     CUDA: _upsample_nearest_exact2d_out_cuda
     MPS: _upsample_nearest_exact2d_out_mps
 
-- func: upsample_nearest2d(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_nearest2d(Tensor self, SymInt[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_nearest2d.out
   dispatch:
     QuantizedCPU: upsample_nearest2d_quantized_cpu
 
-- func: _upsample_nearest_exact2d(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: _upsample_nearest_exact2d(Tensor self, SymInt[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_nearest_exact2d.out
   dispatch:
     QuantizedCPU: _upsample_nearest_exact2d_quantized_cpu
 
-- func: upsample_nearest2d_backward.grad_input(Tensor grad_output, int[2] output_size, int[4] input_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: upsample_nearest2d_backward.grad_input(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -11194,7 +11369,7 @@
     CUDA: upsample_nearest2d_backward_out_cuda
     MPS: upsample_nearest2d_backward_out_mps
 
-- func: _upsample_nearest_exact2d_backward.grad_input(Tensor grad_output, int[2] output_size, int[4] input_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: _upsample_nearest_exact2d_backward.grad_input(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
@@ -11202,59 +11377,59 @@
     CUDA: _upsample_nearest_exact2d_backward_out_cuda
     MPS: _upsample_nearest_exact2d_backward_out_mps
 
-- func: upsample_nearest2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_nearest2d_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_nearest2d_backward.grad_input
 
-- func: _upsample_nearest_exact2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: _upsample_nearest_exact2d_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_nearest_exact2d_backward.grad_input
 
-- func: upsample_nearest3d.out(Tensor self, int[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: upsample_nearest3d.out(Tensor self, SymInt[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_nearest3d_out_cpu
     CUDA: upsample_nearest3d_out_cuda
 
-- func: _upsample_nearest_exact3d.out(Tensor self, int[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
+- func: _upsample_nearest_exact3d.out(Tensor self, SymInt[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: _upsample_nearest_exact3d_out_cpu
     CUDA: _upsample_nearest_exact3d_out_cuda
 
-- func: upsample_nearest3d(Tensor self, int[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_nearest3d(Tensor self, SymInt[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_nearest3d.out
   dispatch:
     QuantizedCPU: upsample_nearest3d_quantized_cpu
 
-- func: _upsample_nearest_exact3d(Tensor self, int[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: _upsample_nearest_exact3d(Tensor self, SymInt[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_nearest_exact3d.out
   dispatch:
     QuantizedCPU: _upsample_nearest_exact3d_quantized_cpu
 
-- func: upsample_nearest3d_backward.grad_input(Tensor grad_output, int[3] output_size, int[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: upsample_nearest3d_backward.grad_input(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: upsample_nearest3d_backward_out_cpu
     CUDA: upsample_nearest3d_backward_out_cuda
 
-- func: _upsample_nearest_exact3d_backward.grad_input(Tensor grad_output, int[3] output_size, int[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
+- func: _upsample_nearest_exact3d_backward.grad_input(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None, *, Tensor(a!) grad_input) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: _upsample_nearest_exact3d_backward_out_cpu
     CUDA: _upsample_nearest_exact3d_backward_out_cuda
 
-- func: upsample_nearest3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: upsample_nearest3d_backward(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: upsample_nearest3d_backward.grad_input
 
-- func: _upsample_nearest_exact3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+- func: _upsample_nearest_exact3d_backward(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
   python_module: nn
   structured_delegate: _upsample_nearest_exact3d_backward.grad_input
 
@@ -11311,24 +11486,24 @@
 # these are the same thing, but we give them different prefixes to
 # make the operational distinction clear.
 
-- func: slow_conv_transpose2d.out(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] output_padding=0, int[2] dilation=1, *, Tensor(a!) out) -> Tensor(a!)
+- func: slow_conv_transpose2d.out(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, SymInt[2] padding=0, SymInt[2] output_padding=0, int[2] dilation=1, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   structured: True
   dispatch:
     CPU: slow_conv_transpose2d_structured_cpu
     CUDA: slow_conv_transpose2d_structured_cuda
 
-- func: slow_conv_transpose2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] output_padding=0, int[2] dilation=1) -> Tensor
+- func: slow_conv_transpose2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, SymInt[2] padding=0, SymInt[2] output_padding=0, int[2] dilation=1) -> Tensor
   python_module: nn
   structured_delegate: slow_conv_transpose2d.out
 
-- func: slow_conv_transpose3d.out(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] output_padding=0, int[3] dilation=1, *, Tensor(a!) out) -> Tensor(a!)
+- func: slow_conv_transpose3d.out(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, SymInt[3] padding=0, SymInt[3] output_padding=0, int[3] dilation=1, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: slow_conv_transpose3d_out_cpu
     CUDA: slow_conv_transpose3d_out_cuda
 
-- func: slow_conv_transpose3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] output_padding=0, int[3] dilation=1) -> Tensor
+- func: slow_conv_transpose3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, SymInt[3] padding=0, SymInt[3] output_padding=0, int[3] dilation=1) -> Tensor
   python_module: nn
   dispatch:
     CPU: slow_conv_transpose3d_cpu
@@ -11365,76 +11540,65 @@
     CUDA: slow_conv2d_backward_cuda
   autogen: _slow_conv2d_backward.output_mask_out
 
-- func: _conv_depthwise2d.out(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, int[2] padding, int[2] dilation, *, Tensor(a!) out) -> Tensor(a!)
+- func: _conv_depthwise2d.out(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, SymInt[2] padding, int[2] dilation, *, Tensor(a!) out) -> Tensor(a!)
   use_const_ref_for_mutable_tensors: True
   python_module: nn
   dispatch:
     CUDA: conv_depthwise2d_cuda_out
 
-- func: _conv_depthwise2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, int[2] padding, int[2] dilation) -> Tensor
+- func: _conv_depthwise2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, SymInt[2] padding, int[2] dilation) -> Tensor
   python_module: nn
   dispatch:
     CUDA: conv_depthwise2d_cuda
 
-- func: conv_depthwise3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, int[3] padding, int[3] dilation) -> Tensor
+- func: conv_depthwise3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, SymInt[3] padding, int[3] dilation) -> Tensor
   python_module: nn
   dispatch:
     CUDA: conv_depthwise3d_cuda
   autogen: conv_depthwise3d.out
 
-- func: slow_conv3d.out(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, *, Tensor(a!) out) -> Tensor(a!)
+- func: slow_conv3d.out(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, SymInt[3] padding=0, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
 
-- func: slow_conv3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0) -> Tensor
+- func: slow_conv3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, SymInt[3] padding=0) -> Tensor
   python_module: nn
 
-- func: slow_conv3d_forward.output(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, int[3] padding, *, Tensor(a!) output) -> Tensor(a!)
+- func: slow_conv3d_forward.output(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, SymInt[3] padding, *, Tensor(a!) output) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: slow_conv3d_forward_out_cpu
 
-- func: slow_conv3d_forward(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, int[3] padding) -> Tensor
+- func: slow_conv3d_forward(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, SymInt[3] padding) -> Tensor
   python_module: nn
   dispatch:
     CPU: slow_conv3d_forward_cpu
 
-- func: slow_conv_dilated2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] dilation=1) -> Tensor
+- func: slow_conv_dilated2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, SymInt[2] padding=0, int[2] dilation=1) -> Tensor
   python_module: nn
   dispatch:
     CPU: slow_conv_dilated2d_cpu
     CUDA: slow_conv_dilated2d_cuda
   autogen: slow_conv_dilated2d.out
 
-- func: slow_conv_dilated3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] dilation=1) -> Tensor
+- func: slow_conv_dilated3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, SymInt[3] padding=0, int[3] dilation=1) -> Tensor
   python_module: nn
   dispatch:
     CPU: slow_conv_dilated3d_cpu
     CUDA: slow_conv_dilated3d_cuda
   autogen: slow_conv_dilated3d.out
 
-- func: col2im.out(Tensor self, int[2] output_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride, *, Tensor(a!) out) -> Tensor(a!)
+- func: col2im.out(Tensor self, SymInt[2] output_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride, *, Tensor(a!) out) -> Tensor(a!)
   python_module: nn
   dispatch:
     CPU: col2im_out_cpu
     CUDA: col2im_out_cuda
 
-- func: col2im(Tensor self, int[2] output_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
+- func: col2im(Tensor self, SymInt[2] output_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
   python_module: nn
   dispatch:
     CPU: col2im_cpu
     CUDA: col2im_cuda
-
-- func: col2im_backward.grad_input(Tensor grad_output, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride, *, Tensor(a!) grad_input) -> Tensor(a!)
-  python_module: nn
-  dispatch:
-    CPU: col2im_backward_out_cpu
-    CUDA: col2im_backward_out_cuda
-
-- func: col2im_backward(Tensor grad_output, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
-  python_module: nn
-  dispatch:
-    CPU: col2im_backward_cpu
-    CUDA: col2im_backward_cuda
+  tags: canonical
 
 - func: column_stack(Tensor[] tensors) -> Tensor
 
@@ -11452,18 +11616,6 @@
     CPU: im2col_cpu
     CUDA: im2col_cuda
 
-- func: im2col_backward.grad_input(Tensor grad_output, int[2] input_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride, *, Tensor(a!) grad_input) -> Tensor(a!)
-  python_module: nn
-  dispatch:
-    CPU: im2col_backward_out_cpu
-    CUDA: im2col_backward_out_cuda
-
-- func: im2col_backward(Tensor grad_output, int[2] input_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
-  python_module: nn
-  dispatch:
-    CPU: im2col_backward_cpu
-    CUDA: im2col_backward_cuda
-
 - func: isfinite(Tensor self) -> Tensor
   variants: function, method
   device_check: NoCheck
@@ -12130,8 +12282,6 @@
 - func: linalg_cross.out(Tensor self, Tensor other, *, int dim=-1, Tensor(a!) out) -> Tensor(a!)
   python_module: linalg
   structured: True
-  precomputed:
-  - dim -> int dim
   dispatch:
     CPU, CUDA: linalg_cross_out
 
@@ -12343,34 +12493,26 @@
   dispatch:
     CPU, CUDA: linalg_householder_product_out
 
-- func: _linalg_inv_out_helper_(Tensor(a!) self, Tensor(b!) infos_lu, Tensor(c!) infos_getri) -> Tensor(a!)
-  variants: function
-  dispatch:
-    CPU: _linalg_inv_out_helper_cpu
-    CUDA: _linalg_inv_out_helper_cuda
-  autogen: _linalg_inv_out_helper, _linalg_inv_out_helper.out
-
-- func: linalg_inv_ex(Tensor self, *, bool check_errors=False) -> (Tensor inverse, Tensor info)
+- func: linalg_inv_ex(Tensor A, *, bool check_errors=False) -> (Tensor inverse, Tensor info)
   python_module: linalg
-  variants: function
-  dispatch:
-    # calls transpose_
-    CompositeExplicitAutogradNonFunctional: linalg_inv_ex
+  structured_delegate: linalg_inv_ex.inverse
 
-- func: linalg_inv_ex.inverse(Tensor self, *, bool check_errors=False, Tensor(a!) inverse, Tensor(b!) info) -> (Tensor(a!) inverse, Tensor(b!) info)
+- func: linalg_inv_ex.inverse(Tensor A, *, bool check_errors=False, Tensor(a!) inverse, Tensor(b!) info) -> (Tensor(a!) inverse, Tensor(b!) info)
   python_module: linalg
-  variants: function
+  structured: True
   dispatch:
-    # calls transpose_
-    CompositeExplicitAutogradNonFunctional: linalg_inv_ex_out
+    CPU, CUDA: linalg_inv_ex_out
 
-- func: linalg_inv(Tensor self) -> Tensor
+- func: linalg_inv(Tensor A) -> Tensor
   python_module: linalg
-  variants: function
 
-- func: linalg_inv.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
+- func: linalg_inv.out(Tensor A, *, Tensor(a!) out) -> Tensor(a!)
   python_module: linalg
-  variants: function
+
+- func: inverse(Tensor self) -> Tensor
+  variants: function, method
+
+- func: inverse.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
 
 - func: inner(Tensor self, Tensor other) -> Tensor
   variants: function, method
@@ -12603,6 +12745,17 @@
 - func: linalg_multi_dot.out(Tensor[] tensors, *, Tensor(a!) out) -> Tensor(a!)
   python_module: linalg
 
+## Functions related to the `torch.nested` namespace
+# Note [nested namespace binding]
+# Functions in the nested python module should have their names start with
+#   "nested_" underscore and be bound to the desired Python name in
+#   torch/nested/__init__.py, and the desired C++ name in torch/csrc/api/include/torch/nested.h.
+#   The "nested_" names should be hidden from the user and not documented.
+
+- func: nested_to_padded_tensor(Tensor self, float padding, int[]? output_size=None) -> Tensor
+  python_module: nested
+  variants: function
+
 ## Functions that are only for testing
 # It is undocumented and should not be used outside of tests.
 - func: _test_serialization_subcmul(Tensor self, Tensor other, Scalar alpha=1) -> Tensor
@@ -12698,11 +12851,11 @@
   variants: function
   python_module: nn
 
-- func: nested_tensor(Tensor[] list, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+- func: _nested_tensor_from_tensor_list(Tensor[] list, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutograd: nested_tensor
-  autogen: nested_tensor.out
+    CompositeExplicitAutograd: _nested_tensor_from_tensor_list
+  autogen: _nested_tensor_from_tensor_list.out
 
 - func: _fw_primal_copy(Tensor self, int level) -> Tensor
   variants: function
@@ -12740,10 +12893,10 @@
     CompositeExplicitAutogradNonFunctional: _neg_view_copy
   tags: view_copy
 
-- func: as_strided_copy(Tensor self, int[] size, int[] stride, int? storage_offset=None) -> Tensor
+- func: as_strided_copy(Tensor self, SymInt[] size, SymInt[] stride, SymInt? storage_offset=None) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutogradNonFunctional: as_strided_copy
+    CompositeExplicitAutogradNonFunctional: as_strided_copy_symint
   tags: view_copy
 
 - func: _sparse_broadcast_to_copy(Tensor self, int[] size) -> Tensor
@@ -12758,16 +12911,10 @@
     CompositeExplicitAutogradNonFunctional: diagonal_copy
   tags: view_copy
 
-- func: expand_copy(Tensor self, int[] size, *, bool implicit=False) -> Tensor
-  variants: function
-  dispatch:
-    CompositeExplicitAutogradNonFunctional: expand_copy
-  tags: view_copy
-
-- func: expand_copy.SymInt(Tensor self, SymInt[] size, *, bool implicit=False) -> Tensor
+- func: expand_copy(Tensor self, SymInt[] size, *, bool implicit=False) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutograd: expand_copy_SymInt
+    CompositeExplicitAutogradNonFunctional: expand_copy_symint
   tags: view_copy
 
 - func: permute_copy(Tensor self, int[] dims) -> Tensor
@@ -12776,16 +12923,16 @@
     CompositeExplicitAutogradNonFunctional: permute_copy
   tags: view_copy
 
-- func: _reshape_alias_copy(Tensor self, int[] size, int[] stride) -> Tensor
+- func: _reshape_alias_copy(Tensor self, SymInt[] size, SymInt[] stride) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutogradNonFunctional: _reshape_alias_copy
+    CompositeExplicitAutogradNonFunctional: _reshape_alias_copy_symint
   tags: view_copy
 
-- func: select_copy.int(Tensor self, int dim, int index) -> Tensor
+- func: select_copy.int(Tensor self, int dim, SymInt index) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutogradNonFunctional: select_copy_int
+    CompositeExplicitAutogradNonFunctional: select_copy_symint
   tags: view_copy
 
 - func: detach_copy(Tensor self) -> Tensor
@@ -12794,22 +12941,22 @@
     CompositeExplicitAutogradNonFunctional: detach_copy
   tags: view_copy
 
-- func: slice_copy.Tensor(Tensor self, int dim=0, int? start=None, int? end=None, int step=1) -> Tensor
+- func: slice_copy.Tensor(Tensor self, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutogradNonFunctional: slice_copy_Tensor
+    CompositeExplicitAutogradNonFunctional: slice_copy_Tensor_symint
   tags: view_copy
 
-- func: split_copy.Tensor(Tensor self, int split_size, int dim=0) -> Tensor[]
+- func: split_copy.Tensor(Tensor self, SymInt split_size, int dim=0) -> Tensor[]
   variants: function
   dispatch:
-    CompositeExplicitAutogradNonFunctional: split_copy_Tensor
+    CompositeExplicitAutogradNonFunctional: split_copy_Tensor_symint
   tags: view_copy
 
-- func: split_with_sizes_copy(Tensor self, int[] split_sizes, int dim=0) -> Tensor[]
+- func: split_with_sizes_copy(Tensor self, SymInt[] split_sizes, int dim=0) -> Tensor[]
   variants: function
   dispatch:
-    CompositeExplicitAutogradNonFunctional: split_with_sizes_copy
+    CompositeExplicitAutogradNonFunctional: split_with_sizes_copy_symint
   tags: view_copy
 
 - func: squeeze_copy(Tensor self) -> Tensor
@@ -12881,14 +13028,14 @@
 - func: ccol_indices_copy(Tensor self) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutograd: ccol_indices_copy
+    CompositeExplicitAutogradNonFunctional: ccol_indices_copy
   tags: view_copy
   autogen: ccol_indices_copy.out
 
 - func: row_indices_copy(Tensor self) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutograd: row_indices_copy
+    CompositeExplicitAutogradNonFunctional: row_indices_copy
   tags: view_copy
   autogen: row_indices_copy.out
 
@@ -12898,10 +13045,10 @@
     CompositeExplicitAutogradNonFunctional: unbind_copy_int
   tags: view_copy
 
-- func: view_copy(Tensor self, int[] size) -> Tensor
+- func: view_copy(Tensor self, SymInt[] size) -> Tensor
   variants: function
   dispatch:
-    CompositeExplicitAutogradNonFunctional: view_copy
+    CompositeExplicitAutogradNonFunctional: view_copy_symint
   tags: view_copy
 
 - func: view_copy.dtype(Tensor self, ScalarType dtype) -> Tensor
@@ -12958,18 +13105,10 @@
     CompositeExplicitAutograd: _neg_view_copy_out
 
 
-- func: view_copy.SymInt(Tensor self, SymInt[] size) -> Tensor
+- func: as_strided_copy.out(Tensor self, SymInt[] size, SymInt[] stride, SymInt? storage_offset=None, *, Tensor(a!) out) -> Tensor(a!)
   variants: function
   dispatch:
-    CompositeExplicitAutograd: view_copy_SymInt
-  tags: view_copy
-  autogen: view_copy.SymInt_out
-
-
-- func: as_strided_copy.out(Tensor self, int[] size, int[] stride, int? storage_offset=None, *, Tensor(a!) out) -> Tensor(a!)
-  variants: function
-  dispatch:
-    CompositeExplicitAutograd: as_strided_copy_out
+    CompositeExplicitAutograd: as_strided_copy_out_symint
 
 
 - func: _sparse_broadcast_to_copy.out(Tensor self, int[] size, *, Tensor(a!) out) -> Tensor(a!)
@@ -12984,16 +13123,10 @@
     CompositeExplicitAutograd: diagonal_copy_out
 
 
-- func: expand_copy.SymInt_out(Tensor self, SymInt[] size, *, bool implicit=False, Tensor(a!) out) -> Tensor(a!)
+- func: expand_copy.out(Tensor self, SymInt[] size, *, bool implicit=False, Tensor(a!) out) -> Tensor(a!)
   variants: function
   dispatch:
-    CompositeExplicitAutograd: expand_copy_SymInt_out
-
-
-- func: expand_copy.out(Tensor self, int[] size, *, bool implicit=False, Tensor(a!) out) -> Tensor(a!)
-  variants: function
-  dispatch:
-    CompositeExplicitAutograd: expand_copy_out
+    CompositeExplicitAutograd: expand_copy_out_symint
 
 
 - func: permute_copy.out(Tensor self, int[] dims, *, Tensor(a!) out) -> Tensor(a!)
@@ -13002,16 +13135,16 @@
     CompositeExplicitAutograd: permute_copy_out
 
 
-- func: _reshape_alias_copy.out(Tensor self, int[] size, int[] stride, *, Tensor(a!) out) -> Tensor(a!)
+- func: _reshape_alias_copy.out(Tensor self, SymInt[] size, SymInt[] stride, *, Tensor(a!) out) -> Tensor(a!)
   variants: function
   dispatch:
     CompositeExplicitAutograd: _reshape_alias_copy_out
 
 
-- func: select_copy.int_out(Tensor self, int dim, int index, *, Tensor(a!) out) -> Tensor(a!)
+- func: select_copy.int_out(Tensor self, int dim, SymInt index, *, Tensor(a!) out) -> Tensor(a!)
   variants: function
   dispatch:
-    CompositeExplicitAutograd: select_copy_int_out
+    CompositeExplicitAutograd: select_copy_symint_out
 
 
 - func: detach_copy.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
@@ -13020,19 +13153,19 @@
     CompositeExplicitAutograd: detach_copy_out
 
 
-- func: slice_copy.Tensor_out(Tensor self, int dim=0, int? start=None, int? end=None, int step=1, *, Tensor(a!) out) -> Tensor(a!)
+- func: slice_copy.Tensor_out(Tensor self, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1, *, Tensor(a!) out) -> Tensor(a!)
   variants: function
   dispatch:
     CompositeExplicitAutograd: slice_copy_Tensor_out
 
 
-- func: split_copy.Tensor_out(Tensor self, int split_size, int dim=0, *, Tensor(a!)[] out) -> ()
+- func: split_copy.Tensor_out(Tensor self, SymInt split_size, int dim=0, *, Tensor(a!)[] out) -> ()
   variants: function
   dispatch:
     CompositeExplicitAutograd: split_copy_Tensor_out
 
 
-- func: split_with_sizes_copy.out(Tensor self, int[] split_sizes, int dim=0, *, Tensor(a!)[] out) -> ()
+- func: split_with_sizes_copy.out(Tensor self, SymInt[] split_sizes, int dim=0, *, Tensor(a!)[] out) -> ()
   variants: function
   dispatch:
     CompositeExplicitAutograd: split_with_sizes_copy_out
@@ -13110,10 +13243,10 @@
     CompositeExplicitAutograd: unbind_copy_int_out
 
 
-- func: view_copy.out(Tensor self, int[] size, *, Tensor(a!) out) -> Tensor(a!)
+- func: view_copy.out(Tensor self, SymInt[] size, *, Tensor(a!) out) -> Tensor(a!)
   variants: function
   dispatch:
-    CompositeExplicitAutograd: view_copy_out
+    CompositeExplicitAutograd: view_copy_out_symint
 
 
 - func: view_copy.dtype_out(Tensor self, ScalarType dtype, *, Tensor(a!) out) -> Tensor(a!)
@@ -13133,18 +13266,17 @@
   dispatch:
     CompositeExplicitAutograd: alias_copy_out
 
-- func: to_padded_tensor(Tensor self, float padding, int[]? output_size=None) -> Tensor
+- func: to_padded_tensor(Tensor self, float padding, SymInt[]? output_size=None) -> Tensor
   variants: method
   dispatch:
     NestedTensorCPU: NestedTensor_to_padded_tensor_generic
     NestedTensorCUDA: NestedTensor_to_padded_tensor_cuda
   autogen: to_padded_tensor.out
 
-- func: _nested_tensor_layer_norm(Tensor self, Tensor? weight, Tensor? bias, float eps) -> Tensor
-  variants: method
+- func: _nested_tensor_softmax_with_shape(Tensor self, Tensor query) -> Tensor
   dispatch:
-    NestedTensorCPU, NestedTensorCUDA: NestedTensor_layer_norm
-  autogen: _nested_tensor_layer_norm.out
+    NestedTensorCPU: NestedTensor_softmax_dropout
+    NestedTensorCUDA: NestedTensor_softmax_dropout_cuda
 
 # Apparently, putting "forward" in the name will cause Python bindings to be skipped, so "fwd" it is.
 - func: _transformer_encoder_layer_fwd(Tensor src, int embed_dim, int num_heads, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, bool use_gelu, bool norm_first, float eps, Tensor norm_weight_1, Tensor norm_bias_1, Tensor norm_weight_2, Tensor norm_bias_2, Tensor ffn_weight_1, Tensor ffn_bias_1, Tensor ffn_weight_2, Tensor ffn_bias_2, Tensor? mask=None, int? mask_type=None) -> Tensor
@@ -13156,11 +13288,53 @@
 - func: _native_multi_head_attention(Tensor query, Tensor key, Tensor value, int embed_dim, int num_head, Tensor qkv_weight, Tensor qkv_bias, Tensor proj_weight, Tensor proj_bias, Tensor? mask=None, bool need_weights=True, bool average_attn_weights=True, int? mask_type=None) -> (Tensor, Tensor)
   variants: function
   dispatch:
-    CPU, CUDA, NestedTensorCPU, NestedTensorCUDA: native_multi_head_attention
+    CPU, NestedTensorCPU: native_multi_head_attention_cpu
+    CUDA, NestedTensorCUDA: native_multi_head_attention_cuda
   autogen: _native_multi_head_attention.out
 
-- func: _scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool need_attn_weights=True, bool is_causal=False) -> (Tensor, Tensor)
+- func: _scaled_dot_product_attention(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool need_attn_weights=False, bool is_causal=False) -> (Tensor, Tensor)
+  python_module: nn
   variants: function
+  autogen: _scaled_dot_product_attention.out
+
+- func: _fused_sdp_choice(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool need_attn_weights=False, bool is_causal=False) -> int
+  dispatch:
+    CPU, NestedTensorCPU, Meta: _fused_sdp_choice_cpp
+    CUDA, NestedTensorCUDA: _fused_sdp_choice_cuda
+
+- func: _scaled_dot_product_attention_math(Tensor query, Tensor key, Tensor value, Tensor? attn_mask=None, float dropout_p=0.0, bool need_attn_weights=False, bool is_causal=False) -> (Tensor, Tensor)
+  variants: function
+
+- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool return_softmax=False, bool is_causal=False) -> (Tensor, Tensor, Tensor)
+  dispatch:
+    CUDA: _scaled_dot_product_flash_attention_cuda
+    NestedTensorCUDA: _scaled_dot_product_flash_attention_nestedtensor_cuda
+
+- func: _scaled_dot_product_efficient_attention(Tensor query, Tensor key, Tensor value, bool compute_log_sumexp, bool is_causal=False) -> (Tensor, Tensor)
+  dispatch:
+    CUDA: _scaled_dot_product_efficient_attention_cuda
+    NestedTensorCUDA: _scaled_dot_product_efficient_attention_nestedtensor_cuda
+
+- func: _scaled_dot_product_efficient_attention_backward(Tensor grad_out_, Tensor query, Tensor key, Tensor value, Tensor out, Tensor logsumexp, bool is_causal=False) -> (Tensor, Tensor, Tensor)
+  dispatch:
+    CUDA: _scaled_dot_product_efficient_attention_backward_cuda
+
+# Returns ouput, softmax_logsumexp, softmax
+- func: _flash_attention_forward(Tensor query, Tensor key, Tensor value, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, bool return_softmax, float dropout_p, bool is_causal) -> (Tensor, Tensor, Tensor)
+  variants: function
+  dispatch:
+    CUDA: _flash_attention_forward
+
+# Returns ouput, logsumexp if compute_logsumexp
+- func: _efficient_attention_forward(Tensor query, Tensor key, Tensor value, Tensor? cu_seqlens_q, Tensor? cu_seqlens_k, int? max_seqlen_q, bool compute_log_sumexp=False, bool causal=False) -> (Tensor, Tensor)
+  variants: function
+  dispatch:
+    CUDA: _efficient_attention_forward
+
+- func: _efficient_attention_backward(Tensor grad_out_, Tensor query, Tensor key, Tensor value, Tensor out, Tensor logsumexp, bool is_causal=False) -> (Tensor, Tensor, Tensor)
+  variants: function
+  dispatch:
+    CUDA: _efficient_attention_backward
 
 - func: _triton_scaled_dot_attention(Tensor q, Tensor k, Tensor v, float dropout_p=0.0) -> Tensor
   variants: function
@@ -13792,3 +13966,11 @@
   dispatch:
     CPU: foobar
   autogen: _foobar.out
+
+# Fused Optimizer CUDA kernels.
+- func: _fused_adam_(Tensor(a!)[] self, Tensor(b!)[] grads, Tensor(c!)[] exp_avgs, Tensor(d!)[] exp_avg_sqs, Tensor(e!)[] max_exp_avg_sqs, Tensor[] state_steps, *, float lr, float beta1, float beta2, float weight_decay, float eps, bool amsgrad, bool maximize, Tensor? grad_scale=None, Tensor? found_inf=None) -> ()
+  # Unlike "foreach" functions, lists of tensors should be guaranteed to be on the same device (for now).
+  variants: function
+  dispatch:
+    CUDA: _fused_adam_kernel_cuda_
+  autogen: _fused_adam, _fused_adam.out
diff --git a/aten/src/ATen/native/nested/NestedTensorAliases.cpp b/aten/src/ATen/native/nested/NestedTensorAliases.cpp
new file mode 100644
index 000000000000..c5785297be4f
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorAliases.cpp
@@ -0,0 +1,15 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+// alias for to_padded_tensor in nested namespace
+Tensor nested_to_padded_tensor(
+    const Tensor& t,
+    double padding,
+    OptionalIntArrayRef output_size) {
+    return t.to_padded_tensor(padding, output_size);
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorBackward.cpp b/aten/src/ATen/native/nested/NestedTensorBackward.cpp
index 39016bd85e5f..51a4210a56ae 100644
--- a/aten/src/ATen/native/nested/NestedTensorBackward.cpp
+++ b/aten/src/ATen/native/nested/NestedTensorBackward.cpp
@@ -8,12 +8,12 @@
 #include <ATen/native/layer_norm.h>
 #include <ATen/NestedTensorImpl.h>
 #include <c10/core/DispatchKey.h>
-#include <ATen/native/nested/NestedTensorMath.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
 
 namespace at {
 namespace native {
 
-// See Note [nested tensor matmul] TODO in NestedTensorMath.cpp
+// See Note [nested tensor matmul] in NestedTensorMath.cpp
 std::tuple<Tensor, Tensor> matmul_backward_nested(
     const Tensor& grad,
     const Tensor& self,
@@ -66,23 +66,6 @@ std::tuple<Tensor, Tensor, Tensor> nested_linear_backward(
   return std::tuple<Tensor, Tensor, Tensor>{grad_input, grad_weight, grad_bias};
 }
 
-Tensor _reshape_nested_backward(const Tensor& self, const Tensor& grad) {
-  auto self_ptr = get_nested_tensor_impl(self);
-  // TODO: this is to reproduce self_ptr->opt_sizes_
-  //       if an accessor is provided in the future, can replace this
-  std::vector<int64_t> sizes;
-  for (int64_t i = 0; i < self_ptr->dim(); i++) {
-    c10::optional<int64_t> opt_size = self_ptr->opt_size(i);
-    if (opt_size.has_value()) {
-      sizes.push_back(*opt_size);
-    }
-    else {
-      sizes.push_back(-1);
-    }
-  }
-  return grad.reshape(sizes);
-}
-
 Tensor nested_softmax_backward(
     const Tensor& grad,
     const Tensor& output,
@@ -123,6 +106,68 @@ Tensor nested_softmax_backward(
         input_dtype);
   }
   return grad_output;
+
+}
+
+// Rudimentary sum backward assuming the conditions in #82387
+Tensor _nested_sum_backward_cpu(
+  const Tensor& grad,
+  const Tensor& nested_self,
+  OptionalIntArrayRef opt_dims,
+  bool keepdim) {
+  auto nt_self = get_nested_tensor_impl(nested_self);
+  auto nt_grad = get_nested_tensor_impl(grad);
+  const Tensor& grad_buffer = nt_grad->get_buffer();
+  const Tensor& self_buffer = nt_self->get_buffer();
+  auto grad_sizes = nt_grad->get_nested_size_tensor();
+  auto self_sizes = nt_self->get_nested_size_tensor();
+  int64_t ntensors = nt_self->size(0);
+  const Tensor& self_grad_buffer = self_buffer.new_empty(self_buffer.sizes());
+
+  auto num_segments = at::prod(grad_sizes, -1);
+  auto segment_lengths = self_sizes.select(1, -1);
+
+  // This logic assumes for now that
+  // (1) all the gradient nested tensors are contiguous
+  // (2) the gradient nested tensors are stored contiguously in the buffer
+  AT_DISPATCH_ALL_TYPES_AND2(
+    ScalarType::Half, ScalarType::BFloat16, self_grad_buffer.scalar_type(), "nested_sum_dim_cpu", [&]() {
+    auto* self_grad_data = self_grad_buffer.data_ptr<scalar_t>();
+    const auto* output_grad_data = grad_buffer.data_ptr<scalar_t>();
+    int64_t out_idx = 0, in_idx = 0;
+    for (const auto i : c10::irange(ntensors)) {
+      int64_t segments = num_segments[i].item<int64_t>();
+      int64_t segment_length = segment_lengths[i].item<int64_t>();
+      for (auto j = 0; j < segments; j++) {
+        scalar_t output_grad = output_grad_data[out_idx];
+        for (auto k = 0; k < segment_length; k++) {
+          self_grad_data[in_idx] = output_grad;
+          in_idx += 1;
+        }
+        out_idx += 1;
+      }
+    }
+  });
+
+  return wrap_buffer(self_grad_buffer, self_sizes);
+
+}
+
+
+Tensor _nested_select_backward_symint(
+  const Tensor& grad,
+  const Tensor& nested_self,
+  int64_t dim,
+  c10::SymInt index) {
+  auto nt_self = get_nested_tensor_impl(nested_self);
+  const Tensor& self_buffer = nt_self->get_buffer();
+  const auto self_sizes = nt_self->get_nested_size_tensor();
+  const Tensor& self_grad_buffer = self_buffer.new_zeros(self_buffer.sizes());
+
+  auto nt_grad = wrap_buffer(self_grad_buffer, self_sizes);
+  nt_grad.select_symint(dim, index).copy_(grad);
+
+  return nt_grad;
 }
 
 } // namespace native
diff --git a/aten/src/ATen/native/nested/NestedTensorBinaryOps.cpp b/aten/src/ATen/native/nested/NestedTensorBinaryOps.cpp
new file mode 100644
index 000000000000..215252f91d6d
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorBinaryOps.cpp
@@ -0,0 +1,247 @@
+#include <ATen/native/nested/NestedTensorMath.h>
+#include  <ATen/native/nested/NestedTensorBinaryOps.h>
+
+#include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#include <ATen/NestedTensorImpl.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIndexing.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/native/layer_norm.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+
+#include <tuple>
+
+namespace at {
+namespace native {
+
+DEFINE_DISPATCH(nested_dense_elementwise_stub);
+REGISTER_NO_CPU_DISPATCH(nested_dense_elementwise_stub);
+
+std::pair<NestedTensorImpl*, NestedTensorImpl*>
+get_elementwise_nested_tensor_impl(
+    const Tensor& self,
+    const Tensor& other,
+    const std::string& op_name) {
+  if (self.is_nested() && !(other.is_nested())) {
+    TORCH_CHECK(
+        false,
+        "Expected both self and other to be nested, but got a nested self and non-nested other");
+  } else if (!(self.is_nested()) && other.is_nested()) {
+    TORCH_CHECK(
+        false,
+        "Expected both self and other to be nested, but got a non-nested self and nested other");
+  } else if (!(self.is_nested()) || !(other.is_nested())) {
+    TORCH_CHECK(
+        false,
+        "Expected both self and other to be nested, but got a non-nested self and non-nested other");
+  }
+
+  auto self_ptr = get_nested_tensor_impl(self);
+  auto other_ptr = get_nested_tensor_impl(other);
+
+  TORCH_CHECK(
+      self.dim() == other.dim(),
+      op_name,
+      " does not support broadcasting when given a NestedTensor");
+  TORCH_CHECK(
+      at::equal(
+          self_ptr->get_nested_size_tensor(),
+          other_ptr->get_nested_size_tensor()),
+      op_name,
+      " does not support broadcasting when given a NestedTensor");
+  TORCH_CHECK(
+      at::equal(
+          self_ptr->get_nested_stride_tensor(),
+          other_ptr->get_nested_stride_tensor()),
+      op_name,
+      " requires strides to match when given NestedTensors");
+  auto self_offsets = self_ptr->get_storage_offsets();
+  auto other_offsets = other_ptr->get_storage_offsets();
+  bool offsets_match = true;
+  for (size_t i = 0; i < self_offsets.size(); i++) {
+    offsets_match = offsets_match && (self_offsets[i] == other_offsets[i]);
+  }
+  TORCH_CHECK(
+      offsets_match,
+      op_name,
+      " requires offsets to match when given NestedTensors");
+  return std::make_pair(self_ptr, other_ptr);
+}
+
+template <typename Func>
+Tensor NestedTensor_elementwise_Tensor(
+    const Tensor& self,
+    const Tensor& other,
+    const std::string& op_name,
+    Func f) {
+  // self is a scalar
+  if (!self.is_nested() && self.dim() == 0 && self.numel() == 1) {
+    auto other_impl = get_nested_tensor_impl(other);
+    return wrap_buffer(
+      f(self, other_impl->get_unsafe_storage_as_tensor()),
+      other_impl->get_nested_size_tensor().clone(),
+      other_impl->get_nested_stride_tensor().clone(),
+      other_impl->get_storage_offsets()
+    );
+  }
+  // other is a scalar
+  if (!other.is_nested() && other.dim() == 0 && other.numel() == 1) {
+    auto self_impl = get_nested_tensor_impl(self);
+    return wrap_buffer(
+      f(self_impl->get_unsafe_storage_as_tensor(), other),
+      self_impl->get_nested_size_tensor().clone(),
+      self_impl->get_nested_stride_tensor().clone(),
+      self_impl->get_storage_offsets()
+    );
+  }
+  // special case when other is dense
+  if (self.is_nested() && !other.is_nested()) {
+    // check for the [B, *, D], [B, 1, D] esuhm case
+    // TODO: this if statement is ugly and hopefully we will remove this in the near future
+    auto self_ptr = get_nested_tensor_impl(self);
+    if (self_ptr->dim() == 3 &&
+        other.dim() == 3 &&
+        self_ptr->size(0) == other.size(0) &&
+        other.size(1) == 1 &&
+        self_ptr->opt_size(2).has_value() &&
+        self_ptr->opt_size(2).value() == other.size(2) &&
+        self.is_cuda() &&
+        other.is_cuda()) {
+      if (!nested_tensor_impl_is_contiguous(self_ptr)) {
+        self_ptr = get_nested_tensor_impl(self.contiguous());
+      }
+      const auto self_buffer = self_ptr->get_buffer();
+      const auto self_sizes = self_ptr->get_nested_size_tensor();
+      auto result_buffer = at::empty_like(self_buffer);
+      auto result = wrap_buffer(result_buffer, self_sizes);
+      if (op_name == "add") {
+        nested_dense_elementwise_stub(self.device().type(), result, self, other, NESTED_DENSE_OP::ADD);
+      } else if (op_name == "mul") {
+        nested_dense_elementwise_stub(self.device().type(), result, self, other, NESTED_DENSE_OP::MUL);
+      } else {
+        TORCH_CHECK(false, "Unsupported nested dense elementwise op");
+      }
+      return result;
+    }
+    TORCH_CHECK(false, "Expected both self and other to be nested, but got a nested self and non-nested other.");
+  }
+
+  NestedTensorImpl* self_impl = nullptr;
+  NestedTensorImpl* other_impl = nullptr;
+  std::tie(self_impl, other_impl) =
+      get_elementwise_nested_tensor_impl(self, other, op_name);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(self_impl);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(other_impl);
+  return wrap_buffer(
+      f(self_impl->get_unsafe_storage_as_tensor(),
+        other_impl->get_unsafe_storage_as_tensor()),
+      self_impl->get_nested_size_tensor(),
+      self_impl->get_nested_stride_tensor(),
+      self_impl->get_storage_offsets());
+}
+
+Tensor NestedTensor_add_Tensor(
+    const Tensor& self,
+    const Tensor& other,
+    const Scalar& alpha) {
+  return NestedTensor_elementwise_Tensor(
+      self, other, "add", [alpha](const Tensor& b1, const Tensor& b2) {
+        return at::add(b1, b2, alpha);
+      });
+}
+
+Tensor NestedTensor_mul_Tensor(const Tensor& self, const Tensor& other) {
+  return NestedTensor_elementwise_Tensor(
+      self, other, "mul", [](const Tensor& b1, const Tensor& b2) {
+        return at::mul(b1, b2);
+      });
+}
+
+// Only usable on the C++ side; scalars are converted to tensors coming from Python.
+Tensor NestedTensor_mul_Scalar(const Tensor& self, const Scalar& other) {
+  return NestedTensor_mul_Tensor(self, wrapped_scalar_tensor(other));
+}
+
+Tensor NestedTensor_div_Tensor(const Tensor& self, const Tensor& other) {
+  return NestedTensor_elementwise_Tensor(
+      self, other, "div", [](const Tensor& b1, const Tensor& b2) {
+        return at::div(b1, b2);
+      });
+}
+
+// Only usable on the C++ side; scalars are converted to tensors coming from Python.
+Tensor NestedTensor_div_Scalar(const Tensor& self, const Scalar& other) {
+  return NestedTensor_div_Tensor(self, wrapped_scalar_tensor(other));
+}
+
+template <typename Func>
+Tensor& NestedTensor_elementwise__Tensor(
+    Tensor& self,
+    const Tensor& other,
+    const std::string& op_name,
+    Func f) {
+  // self is a scalar
+  if (!self.is_nested() && self.dim() == 0 && self.numel() == 1) {
+    auto other_impl = get_nested_tensor_impl(other);
+    f(self, other_impl->get_buffer());
+    return self;
+  }
+  // other is a scalar
+  if (!other.is_nested() && other.dim() == 0 && other.numel() == 1) {
+    auto self_impl = get_nested_tensor_impl(self);
+    f(self_impl->get_buffer(), other);
+    return self;
+  }
+  NestedTensorImpl* self_impl = nullptr;
+  NestedTensorImpl* other_impl = nullptr;
+  std::tie(self_impl, other_impl) =
+      get_elementwise_nested_tensor_impl(self, other, op_name);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(self_impl);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(other_impl);
+  const auto& nt_self = *self_impl;
+  const auto& nt_other = *other_impl;
+  f(nt_self.get_buffer().view({-1}), nt_other.get_buffer().view({-1}));
+  return self;
+}
+
+Tensor& NestedTensor_add__Tensor(
+    Tensor& self,
+    const Tensor& other,
+    const Scalar& alpha) {
+  return NestedTensor_elementwise__Tensor(
+      self, other, "add_", [alpha](const Tensor& b1, const Tensor& b2) {
+        return b1.add_(b2, alpha);
+      });
+}
+
+Tensor& NestedTensor_mul__Tensor(Tensor& self, const Tensor& other) {
+  return NestedTensor_elementwise__Tensor(
+      self, other, "mul_", [](const Tensor& b1, const Tensor& b2) {
+        return b1.mul_(b2);
+      });
+}
+
+// Only usable on the C++ side; scalars are converted to tensors coming from Python.
+Tensor& NestedTensor_mul__Scalar(Tensor& self, const Scalar& other) {
+  return NestedTensor_mul__Tensor(self, wrapped_scalar_tensor(other));
+}
+
+Tensor& fill_nested_(Tensor& self, const Scalar& value) {
+  const auto& self_buf = get_nested_tensor_impl(self)->get_buffer();
+  self_buf.fill_(value);
+  return self;
+}
+
+Tensor& fill_nested_(Tensor& self, const Tensor& value) {
+  const auto& self_buf = get_nested_tensor_impl(self)->get_buffer();
+  self_buf.fill_(value);
+  return self;
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorBinaryOps.h b/aten/src/ATen/native/nested/NestedTensorBinaryOps.h
new file mode 100644
index 000000000000..51eeaf291911
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorBinaryOps.h
@@ -0,0 +1,16 @@
+#pragma once
+
+#include <ATen/core/ATen_fwd.h>
+#include <ATen/native/DispatchStub.h>
+
+namespace at {
+namespace native {
+
+enum class NESTED_DENSE_OP: uint8_t {ADD, MUL};
+
+using nested_dense_elementwise_fn = void (*)(Tensor& result, const Tensor & self, const Tensor & other, const NESTED_DENSE_OP& op);
+
+DECLARE_DISPATCH(nested_dense_elementwise_fn, nested_dense_elementwise_stub);
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorFactories.cpp b/aten/src/ATen/native/nested/NestedTensorFactories.cpp
new file mode 100644
index 000000000000..b45fbb24880c
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorFactories.cpp
@@ -0,0 +1,125 @@
+#include <ATen/ATen.h>
+#include <ATen/native/nested/NestedTensorFactories.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+
+namespace at {
+namespace native {
+
+TensorOptions verify_empty_parameters(
+    const at::Tensor& self,
+    c10::optional<ScalarType> dtype,
+    c10::optional<Layout> layout,
+    c10::optional<Device> device,
+    c10::optional<bool> pin_memory,
+    c10::optional<c10::MemoryFormat> optional_memory_format) {
+  TensorOptions options_ =
+      TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(
+          pin_memory);
+
+  TORCH_CHECK(
+      !(options_.has_memory_format() && optional_memory_format.has_value()),
+      "Cannot set memory_format both in TensorOptions and explicit argument; please delete "
+      "the redundant setter.");
+  TensorOptions options = self.options().merge_in(options_).merge_memory_format(
+      optional_memory_format);
+
+  auto memory_format =
+      options_.memory_format_opt().value_or(MemoryFormat::Preserve);
+  TORCH_CHECK(
+      memory_format == MemoryFormat::Preserve,
+      "empty_like_nested only supports memory format Preserve, but got ",
+      memory_format,
+      " instead.");
+
+  TORCH_CHECK(
+      self.is_contiguous(),
+      "empty_like only supports contiguous memory format for Nested Tensors");
+
+  TORCH_CHECK(
+      !(options.layout() != kStrided && optional_memory_format.has_value()),
+      "memory format option is only supported by strided tensors");
+  return options;
+}
+
+Tensor empty_like_nested(
+    const Tensor& self,
+    c10::optional<ScalarType> dtype,
+    c10::optional<Layout> layout,
+    c10::optional<Device> device,
+    c10::optional<bool> pin_memory,
+    c10::optional<c10::MemoryFormat> optional_memory_format) {
+  auto options = verify_empty_parameters(
+      self, dtype, layout, device, pin_memory, optional_memory_format);
+  auto self_nt = get_nested_tensor_impl(self);
+  Tensor new_buffer = at::empty_like(self_nt->get_buffer(), options);
+  auto nested_size = self_nt->get_nested_size_tensor().clone();
+  auto nested_strides = self_nt->get_nested_stride_tensor().clone();
+  auto offsets = std::vector<int64_t>(self_nt->get_storage_offsets());
+  auto tensor = detail::make_tensor_base<NestedTensorImpl>(
+      new_buffer, nested_size, nested_strides, std::move(offsets));
+  return tensor;
+}
+
+// Take a Device that may not have device_index set (i.e., having it as -1
+// representing the current device) and return the corresponding Device
+// according to the actual device at the time of this function call.  No-op
+// if the device_index is set.
+static inline Device ensure_has_index(Device device) {
+  if (device.is_cpu() || device.has_index()) {
+    return device;
+  }
+  const c10::impl::DeviceGuardImplInterface* impl =
+      c10::impl::getDeviceGuardImpl(device.type());
+  return impl->getDevice();
+}
+
+Tensor _to_copy_nested(
+    const Tensor& self,
+    c10::optional<ScalarType> dtype,
+    c10::optional<Layout> layout,
+    c10::optional<Device> device,
+    c10::optional<bool> pin_memory,
+    bool non_blocking,
+    c10::optional<c10::MemoryFormat> optional_memory_format) {
+  TORCH_CHECK(
+      !layout.has_value() || self.layout() == layout.value(),
+      "to(options) doesn't support converting to a different layout, "
+      "but got self.layout being ",
+      self.layout(),
+      " and options.layout set as ",
+      layout.value());
+  auto options =
+      TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(
+          pin_memory);
+
+  if (options.has_device()) {
+    options = options.device(ensure_has_index(options.device()));
+  }
+  // memory_format is handled separately due to MemoryFormat::Preserve logic
+  options = self.options().merge_in(options).memory_format(c10::nullopt);
+  auto memory_format = optional_memory_format.value_or(MemoryFormat::Preserve);
+
+  bool pin_out =
+      (non_blocking && self.is_cuda() && options.device().is_cpu() &&
+       (options.layout() == c10::kStrided));
+
+  Tensor r;
+  r = at::empty_like(self, dtype, layout, device, pin_out, memory_format);
+  get_nested_tensor_impl(r)->get_buffer().copy_(
+      get_nested_tensor_impl(self)->get_buffer(), non_blocking);
+  return r;
+}
+
+Tensor& copy_nested_(Tensor& self, const Tensor& src, bool non_blocking) {
+  const auto* nt_self = get_nested_tensor_impl(self);
+  const auto* nt_src = get_nested_tensor_impl(src);
+  TORCH_CHECK(
+      at::equal(
+          nt_self->get_nested_size_tensor(), nt_src->get_nested_size_tensor()),
+      "copy_ only supports tensors that are the same size for Nested implementations");
+  nt_self->get_buffer().copy_(nt_src->get_buffer(), non_blocking);
+  return self;
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorFactories.h b/aten/src/ATen/native/nested/NestedTensorFactories.h
new file mode 100644
index 000000000000..51123f0fc119
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorFactories.h
@@ -0,0 +1,7 @@
+#pragma once
+
+namespace at {
+namespace native {
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorMath.cpp b/aten/src/ATen/native/nested/NestedTensorMath.cpp
index 486173a1aa67..5842c3b8b217 100644
--- a/aten/src/ATen/native/nested/NestedTensorMath.cpp
+++ b/aten/src/ATen/native/nested/NestedTensorMath.cpp
@@ -1,42 +1,23 @@
 #include <ATen/native/nested/NestedTensorMath.h>
 
-#include <ATen/ATen.h>
 #include <ATen/AccumulateType.h>
-#include <ATen/NamedTensorUtils.h>
-#include <ATen/WrapDimUtils.h>
-#include <ATen/core/op_registration/op_registration.h>
-#include <ATen/native/layer_norm.h>
+#include <ATen/Dispatch.h>
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
 #include <ATen/NestedTensorImpl.h>
-#include <c10/core/DispatchKey.h>
-#include <ATen/native/nested/NestedTensorMath.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIndexing.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/native/layer_norm.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+
+#include <tuple>
 
 namespace at {
 namespace native {
-
 namespace {
-template <typename Func>
-Tensor map_nt(const Tensor& nt, Func f) {
-  auto* nt_impl = get_nested_tensor_impl(nt);
-  const auto& sizes = nt_impl->get_nested_size_tensor();
-  return at::detail::make_tensor<NestedTensorImpl>(f(nt_impl->get_buffer()), sizes);
-}
-
-c10::optional<int64_t> maybe_get_consistent_last_dim_of_nested_tensor(
-    const NestedTensorImpl& nt) {
-  const auto& sizes = nt.get_nested_size_tensor();
-  // The last entry in every row of sizes must be the same.
-  const auto& last_dims = sizes.select(1, -1);
-  const auto last_dims_accessor = last_dims.packed_accessor64<int64_t, 1>();
-  // REVIEW: this can't be the most efficient and concise way to
-  // write this check, can it?
-  const auto last_dim_value = last_dims_accessor[0];
-  for (const auto i : c10::irange(1, last_dims.numel())) {
-    if (last_dims_accessor[i] != last_dim_value) {
-      return c10::nullopt;
-    }
-  }
-  return last_dim_value;
-}
 
 int64_t num_bytes(IntArrayRef sizes) {
   // 0-dim Tensors have torch.Size of .size() 0, but carry 1 memory.
@@ -53,26 +34,6 @@ int64_t num_bytes(IntArrayRef sizes) {
   return result;
 }
 
-std::vector<int64_t> NestedTensor_get_max_size_from_size_tensor(const Tensor& sizes) {
-  if (sizes.dim() == 0) {
-    return {};
-  }
-  const auto sizes_ptr = sizes.data_ptr<int64_t>();
-  const auto sizes_size_0 = sizes.sizes()[0];
-  const auto sizes_size_1 = sizes.sizes()[1];
-  TORCH_INTERNAL_ASSERT(sizes_size_1 > 0);
-  std::vector<int64_t> results(sizes_size_1, 0);
-  for (const auto ii : c10::irange(sizes_size_0)) {
-    for (const auto jj : c10::irange(sizes_size_1)) {
-      auto val = sizes_ptr[ii * sizes_size_1 + jj];
-      if (results[jj] < val) {
-        results[jj] = val;
-      }
-    }
-  }
-  return results;
-}
-
 Tensor pad_tensor_to_shape(
     const Tensor& t,
     IntArrayRef goal_shape,
@@ -111,40 +72,17 @@ std::vector<at::Tensor> NestedTensor_unbind(
   if (ntensors == 0) {
     return result_tensors;
   }
-  const at::Tensor& buffer = self_ptr->get_buffer();
+  // This returns a differentiable view of self as a regular tensor
+  auto buffer = self.values();
   std::vector<IntArrayRef> sizes = NestedTensor_get_sizes(self_ptr),
       strides = NestedTensor_get_strides(self_ptr);
-  const std::vector<int64_t>& offsets = self_ptr->get_offsets();
+  const std::vector<int64_t>& offsets = self_ptr->get_storage_offsets();
   for (const int64_t i: c10::irange(ntensors)){
     result_tensors[i] = buffer.as_strided(sizes[i], strides[i], offsets[i]);
   }
   return result_tensors;
 }
 
-Tensor& NestedTensor_relu_(Tensor& self) {
-  auto buffer = get_nested_tensor_impl(self)->get_buffer();
-  at::relu_(buffer);
-  return self;
-}
-
-Tensor NestedTensor_relu(const Tensor& self) {
-  return map_nt(self, at::relu);
-}
-
-Tensor& NestedTensor_gelu_(Tensor& self, c10::string_view approximate) {
-  auto buffer = get_nested_tensor_impl(self)->get_buffer();
-  at::gelu_(buffer, approximate);
-  return self;
-}
-
-Tensor NestedTensor_gelu(const Tensor& self, c10::string_view approximate) {
-  return map_nt(
-      self,
-      [approximate](const Tensor& buffer) {
-        return at::gelu(buffer, approximate);
-      });
-}
-
 Tensor NestedTensor_nested_tensor_from_mask(const Tensor& t, const Tensor& mask, bool mask_check) {
     TORCH_CHECK(mask.scalar_type() == at::ScalarType::Bool, "Expected mask to be of ScalarType Bool, but got ", mask.scalar_type(), " instead.");
     TORCH_CHECK(mask.dim() == 2, "Padding mask should be 2D");
@@ -197,7 +135,7 @@ bool NestedTensor_nested_tensor_from_mask_left_aligned(const Tensor& t, const Te
     return sizes.equal(nums);
 }
 
-Tensor nested_tensor(
+Tensor _nested_tensor_from_tensor_list(
     TensorList list,
     c10::optional<ScalarType> dtype,
     c10::optional<Layout> layout,
@@ -229,21 +167,58 @@ Tensor nested_tensor(
       pin_memory);
 }
 
-int64_t get_consistent_last_dim_of_nested_tensor(const NestedTensorImpl& nt) {
-  auto result = maybe_get_consistent_last_dim_of_nested_tensor(nt);
+C10_ALWAYS_INLINE std::pair<int64_t, int64_t> _check_nested_layer_norm_inputs(
+    const NestedTensorImpl& input,
+    IntArrayRef normalized_shape,
+    const Tensor& weight /* optional */,
+    const Tensor& bias /* optional */) {
+
+  const size_t normalized_ndim = normalized_shape.size();
   TORCH_CHECK(
-      result.has_value(),
-      "all tensors in NestedTensor must have the same trailing dim for Matmul but got ",
-      nt.get_nested_size_tensor().select(1, -1));
-  return *result;
-}
+      normalized_ndim >= 1,
+      "Expected normalized_shape to be at least 1-dimensional, i.e., ",
+      "containing at least one element, but got normalized_shape = ",
+      normalized_shape);
+  TORCH_CHECK(
+      !weight.defined() || weight.sizes().equals(normalized_shape),
+      "Expected weight to be of same shape as normalized_shape, but got ",
+      "weight of shape ",
+      weight.sizes(),
+      " and normalized_shape = ",
+      normalized_shape);
+  TORCH_CHECK(
+      !bias.defined() || bias.sizes().equals(normalized_shape),
+      "Expected bias to be of same shape as normalized_shape, but got ",
+      "bias of shape ",
+      bias.sizes(),
+      " and normalized_shape = ",
+      normalized_shape);
+
+  // Check that the normalized_shape has the exact same sizes as the last dimensions from the NestedTensor input
+  // Also, compute M and N considering the idiosyncracies of NestedTensors
+  int64_t N = 1;
+  for (const auto i: c10::irange(normalized_ndim)) {
+    TORCH_CHECK(
+      input.opt_size(-normalized_ndim + i) != c10::nullopt,
+      "normalized_shape extends into irregular dimensions for the nested tensor"
+    );
+    TORCH_CHECK(
+      normalized_shape[i] == *input.opt_size(-normalized_ndim + i),
+      "The shape at dimension ",
+      i,
+      "of normalized_shape doesn't match the input"
+    );
+    N *= normalized_shape[i];
+  }
+
+  const int64_t M = input.numel() / N;
 
-std::vector<int64_t> NestedTensor_get_max_size(const NestedTensorImpl& nt) {
-  return NestedTensor_get_max_size_from_size_tensor(nt.get_nested_size_tensor());
+  return std::make_pair(M, N);
 }
 
-Tensor NestedTensor_layer_norm(
+std::tuple<Tensor, Tensor, Tensor> nested_layer_norm(
     const Tensor& input,
+    IntArrayRef normalized_shape,
     const c10::optional<Tensor>& weight_opt,
     const c10::optional<Tensor>& bias_opt,
     double eps) {
@@ -255,8 +230,9 @@ Tensor NestedTensor_layer_norm(
   auto* nt_input = get_nested_tensor_impl(input);
   TORCH_CHECK(nested_tensor_impl_is_contiguous(nt_input));
   const auto& input_buffer = nt_input->get_buffer();
-  const auto last_dim = get_consistent_last_dim_of_nested_tensor(*nt_input);
-  const auto valid_word_num = input_buffer.numel() / last_dim;
+  auto M_N = _check_nested_layer_norm_inputs(*nt_input, normalized_shape, weight, bias);
+  auto M = M_N.first;
+  auto N = M_N.second;
   const auto weight_contig = weight.expect_contiguous();
   const auto bias_contig = bias.expect_contiguous();
   auto output_buffer = at::native::empty_like(
@@ -271,21 +247,24 @@ Tensor NestedTensor_layer_norm(
     auto acc_type = at::toAccumulateType(input_buffer.scalar_type(), true);
     options = options.dtype(acc_type);
   }
-  Tensor mean = at::empty({valid_word_num}, options);
-  Tensor rstd = at::empty({valid_word_num}, options);
+  Tensor mean = at::empty({M}, options);
+  Tensor rstd = at::empty({M}, options);
   LayerNormKernel(
       input_buffer.is_cuda() ? kCUDA : kCPU,
       input_buffer,
       *weight_contig,
       *bias_contig,
-      valid_word_num,
-      last_dim,
+      M,
+      N,
       eps,
       &output_buffer,
       &mean,
       &rstd);
-  return at::detail::make_tensor<NestedTensorImpl>(
-      std::move(output_buffer), nt_input->get_nested_size_tensor());
+  return std::make_tuple(
+    wrap_buffer(output_buffer, nt_input->get_nested_size_tensor()),
+    mean,
+    rstd
+  );
 }
 
 Tensor NestedTensor_from_padded_and_nested_example(
@@ -441,157 +420,6 @@ Tensor NestedTensor_embedding(
       result_buffer.reshape({-1}), std::move(new_sizes));
 }
 
-std::pair<NestedTensorImpl*, NestedTensorImpl*>
-get_elementwise_nested_tensor_impl(
-    const Tensor& self,
-    const Tensor& other,
-    const std::string& op_name) {
-  if (self.is_nested() && !(other.is_nested())) {
-    TORCH_CHECK(
-        false,
-        "Expected both self and other to be nested, but got a nested self and non-nested other");
-  } else if (!(self.is_nested()) && other.is_nested()) {
-    TORCH_CHECK(
-        false,
-        "Expected both self and other to be nested, but got a non-nested self and nested other");
-  } else if (!(self.is_nested()) || !(other.is_nested())) {
-    TORCH_CHECK(
-        false,
-        "Expected both self and other to be nested, but got a non-nested self and non-nested other");
-  }
-
-  auto self_ptr = get_nested_tensor_impl(self);
-  auto other_ptr = get_nested_tensor_impl(other);
-
-  TORCH_CHECK(
-      self.dim() == other.dim(),
-      op_name,
-      " does not support broadcasting when given a NestedTensor");
-  TORCH_CHECK(
-      at::equal(
-          self_ptr->get_nested_size_tensor(),
-          other_ptr->get_nested_size_tensor()),
-      op_name,
-      " does not support broadcasting when given a NestedTensor");
-  TORCH_CHECK(
-      nested_tensor_impl_is_contiguous(self_ptr) &&
-          nested_tensor_impl_is_contiguous(other_ptr),
-      op_name,
-      " does not support non-contiguous NestedTensor inputs");
-  return std::make_pair(self_ptr, other_ptr);
-}
-
-template <typename Func>
-Tensor NestedTensor_elementwise_Tensor(
-    const Tensor& self,
-    const Tensor& other,
-    const std::string& op_name,
-    Func f) {
-  // self is a scalar
-  if (!self.is_nested() && self.dim() == 0 && self.numel() == 1) {
-    auto other_impl = get_nested_tensor_impl(other);
-    return wrap_buffer(
-      f(self, other_impl->get_buffer()),
-      other_impl->get_nested_size_tensor().clone()
-    );
-  }
-  // other is a scalar
-  if (!other.is_nested() && other.dim() == 0 && other.numel() == 1) {
-    auto self_impl = get_nested_tensor_impl(self);
-    return wrap_buffer(
-      f(self_impl->get_buffer(), other),
-      self_impl->get_nested_size_tensor().clone()
-    );
-  }
-  NestedTensorImpl* self_impl = nullptr;
-  NestedTensorImpl* other_impl = nullptr;
-  std::tie(self_impl, other_impl) =
-      get_elementwise_nested_tensor_impl(self, other, op_name);
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(self_impl);
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(other_impl);
-  const auto& nt_self = *self_impl;
-  const auto& nt_other = *other_impl;
-  const auto& self_sizes = nt_self.get_nested_size_tensor();
-  return wrap_buffer(
-      f(nt_self.get_buffer().reshape({-1}),
-        nt_other.get_buffer().reshape({-1})),
-      self_sizes);
-}
-
-Tensor NestedTensor_add_Tensor(
-    const Tensor& self,
-    const Tensor& other,
-    const Scalar& alpha) {
-  return NestedTensor_elementwise_Tensor(
-      self, other, "add", [alpha](const Tensor& b1, const Tensor& b2) {
-        return at::add(b1, b2, alpha);
-      });
-}
-
-Tensor NestedTensor_mul_Tensor(const Tensor& self, const Tensor& other) {
-  return NestedTensor_elementwise_Tensor(
-      self, other, "mul", [](const Tensor& b1, const Tensor& b2) {
-        return at::mul(b1, b2);
-      });
-}
-
-// Only usable on the C++ side; scalars are converted to tensors coming from Python.
-Tensor NestedTensor_mul_Scalar(const Tensor& self, const Scalar& other) {
-  return NestedTensor_mul_Tensor(self, wrapped_scalar_tensor(other));
-}
-
-template <typename Func>
-Tensor& NestedTensor_elementwise__Tensor(
-    Tensor& self,
-    const Tensor& other,
-    const std::string& op_name,
-    Func f) {
-  // self is a scalar
-  if (!self.is_nested() && self.dim() == 0 && self.numel() == 1) {
-    auto other_impl = get_nested_tensor_impl(other);
-    f(self, other_impl->get_buffer());
-    return self;
-  }
-  // other is a scalar
-  if (!other.is_nested() && other.dim() == 0 && other.numel() == 1) {
-    auto self_impl = get_nested_tensor_impl(self);
-    f(self_impl->get_buffer(), other);
-    return self;
-  }
-  NestedTensorImpl* self_impl = nullptr;
-  NestedTensorImpl* other_impl = nullptr;
-  std::tie(self_impl, other_impl) =
-      get_elementwise_nested_tensor_impl(self, other, op_name);
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(self_impl);
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(other_impl);
-  const auto& nt_self = *self_impl;
-  const auto& nt_other = *other_impl;
-  f(nt_self.get_buffer().view({-1}), nt_other.get_buffer().view({-1}));
-  return self;
-}
-
-Tensor& NestedTensor_add__Tensor(
-    Tensor& self,
-    const Tensor& other,
-    const Scalar& alpha) {
-  return NestedTensor_elementwise__Tensor(
-      self, other, "add_", [alpha](const Tensor& b1, const Tensor& b2) {
-        return b1.add_(b2, alpha);
-      });
-}
-
-Tensor& NestedTensor_mul__Tensor(Tensor& self, const Tensor& other) {
-  return NestedTensor_elementwise__Tensor(
-      self, other, "mul_", [](const Tensor& b1, const Tensor& b2) {
-        return b1.mul_(b2);
-      });
-}
-
-// Only usable on the C++ side; scalars are converted to tensors coming from Python.
-Tensor& NestedTensor_mul__Scalar(Tensor& self, const Scalar& other) {
-  return NestedTensor_mul__Tensor(self, wrapped_scalar_tensor(other));
-}
-
 // Very rudimentary sum_dim for prototyping with torch_scatter.segment_reduce.
 Tensor NestedTensor_sum_dim_CPU(
     const Tensor& self,
@@ -666,61 +494,117 @@ Tensor NestedTensor_sum_dim_CPU(
 
 Tensor select_nested(const Tensor& self, int64_t dim, int64_t index) {
   auto self_ptr = get_nested_tensor_impl(self);
+  std::vector<IntArrayRef> sizes = NestedTensor_get_sizes(self_ptr),
+                           strides = NestedTensor_get_strides(self_ptr);
+  const std::vector<int64_t>& offsets = self_ptr->get_storage_offsets();
+  const at::Tensor& buffer = self_ptr->get_unsafe_storage_as_tensor();
   int64_t positive_dim = at::maybe_wrap_dim(dim, self_ptr->dim());
-  TORCH_CHECK(
-    positive_dim == 0,
-    "NestedTensor can only be selected along dimension 0 ",
-    "got dimension ", dim, " instead."
-  );
   int64_t ntensors = self_ptr->size(0);
-  TORCH_CHECK_INDEX(
-      index >= -ntensors && index < ntensors,
-      "index ", index,
-      " is out of bounds for dimension 0 with size ", ntensors);
-  int64_t positive_index = index < 0 ? index + ntensors : index;
-  const at::Tensor& buffer = self_ptr->get_buffer();
-  std::vector<IntArrayRef> sizes = NestedTensor_get_sizes(self_ptr),
-      strides = NestedTensor_get_strides(self_ptr);
-  const std::vector<int64_t>& offsets = self_ptr->get_offsets();
-  return buffer.as_strided(sizes[positive_index], strides[positive_index], offsets[positive_index]);
+  TORCH_CHECK_INDEX(ntensors > 0, "You can only select when the NT is not empty.");
+  int64_t ndims = static_cast<long>(sizes[0].size());
+  if (positive_dim == 0) {
+    TORCH_CHECK_INDEX(
+        index >= -ntensors && index < ntensors,
+        "index ",
+        index,
+        " is out of bounds for dimension 0 with size ",
+        ntensors);
+    int64_t positive_index = index < 0 ? index + ntensors : index;
+    return buffer.as_strided(
+        sizes[positive_index],
+        strides[positive_index],
+        offsets[positive_index]);
+  } else {
+    auto new_sizes = at::empty({ntensors, ndims-1}, TensorOptions().dtype(kLong));
+    auto new_strides = at::empty({ntensors, ndims-1}, TensorOptions().dtype(kLong));
+    auto new_offsets = std::vector<int64_t>(offsets);
+    std::vector<Tensor> tensor_slices(ntensors);
+    for (int64_t i : c10::irange(ntensors)) {
+      int64_t *size_ptr = new_sizes[i].data_ptr<int64_t>();
+      int64_t *stride_ptr = new_strides[i].data_ptr<int64_t>();
+
+      int64_t dim_idx = 0;
+      for (int64_t j : c10::irange(ndims)) {
+        if (j != dim - 1) {
+          size_ptr[dim_idx] = sizes[i][j];
+          stride_ptr[dim_idx] = strides[i][j];
+          ++dim_idx;
+        } else {
+          TORCH_CHECK_INDEX(
+              index >= 0 && index < sizes[i][j],
+              "index ",
+              index,
+              " is out of bounds for dimension ",
+              j,
+              " of the ",
+              i,
+              "th constituent tensor with size ",
+              sizes[i][j]);
+          new_offsets[i] = offsets[i] + index * strides[i][j];
+        }
+      }
+    }
+    return create_nested_view_tensor(self, new_sizes, new_strides, std::move(new_offsets));
+  }
+
 }
 
 Tensor clone_nested(
     const Tensor& self,
     c10::optional<c10::MemoryFormat> optional_memory_format) {
-  auto memory_format = optional_memory_format.value_or(MemoryFormat::Preserve);
-  TORCH_CHECK(
-      memory_format == MemoryFormat::Preserve,
-      "clone_nested only supports memory format Preserve, but got ",
-      memory_format,
-      " instead.");
-  // TODO: The size doesn't necessarily need to be cloned, but it is more
-  // conservative. This is something we could revisit once we land a more
-  // efficient implementation of nested_size_tensor_.
-  return wrap_buffer(
-      get_buffer(self).clone(), get_nested_size_tensor(self).clone());
-}
-
-at::Tensor NestedTensor_get_nested_size_tensor(const at::Tensor& self){
-  return get_nested_size_tensor(self);
+  auto memory_format = optional_memory_format.value_or(c10::MemoryFormat::Preserve);
+  auto self_ptr = get_nested_tensor_impl(self);
+  if (memory_format == c10::MemoryFormat::Preserve ||
+  (memory_format == c10::MemoryFormat::Contiguous && self.is_contiguous())) {
+    const Tensor& buffer = self_ptr->get_unsafe_storage_as_tensor(),
+        sizemat = self_ptr->get_nested_size_tensor(),
+        stridemat = self_ptr->get_nested_stride_tensor();
+    const std::vector<int64_t>& offsets = self_ptr->get_storage_offsets();
+    // TODO: The size and the stride do not necessarily need to be cloned,
+    //       but it is more conservative.
+    //       This is something we could revisit once we land a more
+    //       efficient implementation of nested_size_tensor_ and nested_stride_tensor.
+    return wrap_buffer(buffer.clone(), sizemat.clone(), stridemat.clone(), std::vector<int64_t>(offsets));
+  }
+  // actually, memory format is contiguous and self is noncontiguous
+  else if (memory_format == c10::MemoryFormat::Contiguous) {
+    const Tensor& self_buffer = self_ptr->get_unsafe_storage_as_tensor(),
+        sizemat = self_ptr->get_nested_size_tensor();
+    Tensor output_buffer = at::empty(self.numel(), self_buffer.options());
+    Tensor output = wrap_buffer(output_buffer, sizemat);
+    std::vector<Tensor> self_unbind = self.unbind(),
+        output_unbind = output.unbind();
+    for (const int64_t i: c10::irange(self_ptr->size(0))) {
+      output_unbind[i].copy_(self_unbind[i]);
+    }
+    return output;
+  } else {
+    TORCH_CHECK(
+        false,
+        "Nested tensor clone supports Preserve and Contiguous memory formats, called clone with memory format: ",
+        memory_format);
+  }
 }
 
-Tensor dropout_nested(const Tensor& input, double p, bool train) {
+std::tuple<Tensor,Tensor> native_dropout_nested(const Tensor& input, double p, c10::optional<bool> train) {
   auto input_ptr = get_nested_tensor_impl(input);
-  const Tensor& input_buffer = input_ptr->get_buffer(),
+  const Tensor& input_buffer = input_ptr-> get_unsafe_storage_as_tensor(),
       & sizemat = input_ptr->get_nested_size_tensor(),
       & stridemat = input_ptr->get_nested_stride_tensor();
-  const std::vector<int64_t>& offsets = input_ptr->get_offsets();
-  Tensor output_buffer = at::dropout(input_buffer, p, train);
+  const std::vector<int64_t>& offsets = input_ptr->get_storage_offsets();
+  Tensor output_buffer, mask_buffer;
+  if (input_buffer.numel() == 0) {
+    output_buffer = input_buffer.clone();
+    mask_buffer = input_buffer.clone();
+  }
+  else {
+    std::tie(output_buffer, mask_buffer) = at::native_dropout(input_buffer, p, train);
+  }
   // regular tensor dropout reuses input size and stride
   // i.e. if input is not contiguous, then output is also discontiguous
-  return wrap_buffer(output_buffer, sizemat.clone(), stridemat.clone(), offsets);
-}
-
-Tensor& dropout_nested_(Tensor& input, double p, bool train) {
-  Tensor input_buffer = get_buffer(input);
-  at::dropout_(input_buffer, p, train);
-  return input;
+  Tensor output = wrap_buffer(output_buffer, sizemat.clone(), stridemat.clone(), std::vector<int64_t>(offsets)),
+      mask = wrap_buffer(mask_buffer, sizemat.clone(), stridemat.clone(), std::vector<int64_t>(offsets));
+  return std::make_tuple(output, mask);
 }
 
 Tensor softmax_nested(
@@ -737,7 +621,10 @@ Tensor softmax_nested(
       positive_dim >= 1,
       "Cannot apply softmax across nested dimension 0");
   // create a contiguous output
-  const Tensor& buffer = input_ptr->get_buffer(),
+  // TODO We would ideally use a empty_like here, but that is not supported
+  // for nested tensors yet. Since we are only using the buffer for the options
+  // and size it is okay to use unsafe_storage_as_tensor here.
+  const Tensor& buffer = input_ptr->get_unsafe_storage_as_tensor(),
       & sizemat = input_ptr->get_nested_size_tensor();
   Tensor output_buffer = buffer.new_empty(buffer.sizes());
   Tensor output = wrap_buffer(output_buffer, sizemat.clone());
@@ -758,224 +645,6 @@ Tensor softmax_nested(
   return output;
 }
 
-Tensor bmm_nested(const Tensor& self, const Tensor& mat2) {
-  if (self.is_nested() && !mat2.is_nested()) {
-    AT_ERROR("Expected both to be nested, but got a nested self and non-nested other");
-  }
-  else if (!self.is_nested() && mat2.is_nested()) {
-    AT_ERROR("Expected both to be nested, but got a non-nested self and nested other");
-  }
-  // dispatcher should have guaranteed that at least one is nested
-  auto self_ptr = get_nested_tensor_impl(self);
-  auto mat2_ptr = get_nested_tensor_impl(mat2);
-  TORCH_CHECK(self_ptr->dim() == 3, "batch1 must be a 3D tensor");
-  TORCH_CHECK(mat2_ptr->dim() == 3, "batch2 must be a 3D tensor");
-  int64_t ntensors = self_ptr->size(0),
-      ntensors2 = mat2_ptr->size(0);
-  TORCH_CHECK(ntensors == ntensors2,
-      "Expected size for the 1st dimension of batch2 tensor to be: ", ntensors,
-      " but got: ", ntensors2, ".");
-  const Tensor& self_buffer = self_ptr->get_buffer(),
-      & mat2_buffer = mat2_ptr->get_buffer();
-  std::vector<IntArrayRef> self_sizes = NestedTensor_get_sizes(self_ptr),
-      mat2_sizes = NestedTensor_get_sizes(mat2_ptr),
-      self_strides = NestedTensor_get_strides(self_ptr),
-      mat2_strides = NestedTensor_get_strides(mat2_ptr);
-  const std::vector<int64_t>& self_offsets = self_ptr->get_offsets(),
-      & mat2_offsets = mat2_ptr->get_offsets();
-  // create a contiguous output
-  int64_t out_numel = 0;
-  const Tensor& self_sizemat = self_ptr->get_nested_size_tensor();
-  Tensor out_sizemat = self_sizemat.new_empty(self_sizemat.sizes());
-  int64_t* out_sizemat_ptr = out_sizemat.data_ptr<int64_t>();
-  for (int64_t i = 0; i < ntensors; i++) {
-    const IntArrayRef& self_shape = self_sizes[i],
-        & mat2_shape = mat2_sizes[i];
-    const int64_t& self_size0 = self_shape[0], & self_size1 = self_shape[1],
-        & mat2_size0 = mat2_shape[0], & mat2_size1 = mat2_shape[1];
-    TORCH_CHECK(self_size1 == mat2_size0,
-        i, "-th nested matrices in batch cannot be multiplied (",
-        self_size0, "x", self_size1, " and ",
-        mat2_size0, "x", mat2_size1, ")");
-    out_sizemat_ptr[0] = self_size0;
-    out_sizemat_ptr[1] = mat2_size1;
-    out_sizemat_ptr += 2;
-    out_numel += self_size0 * mat2_size1;
-  }
-  Tensor out_buffer = self_buffer.new_empty(out_numel);
-  Tensor output = wrap_buffer(out_buffer, out_sizemat);
-  // call tensor mm
-  // TODO: `padding nested tensor -> bmm -> remove padding` may be more efficient
-  //       until we have specialized nested tensor bmm kernel
-  //       useful resource: `aten/src/ATen/native/cpu/LinearAlgebra.cpp/bmm_out_or_baddbmm_`
-  //                        `aten/src/ATen/native/cuda/Blas.cpp/baddbmm_out_cuda_impl`
-  std::vector<Tensor> output_unbind = output.unbind();
-  for (int64_t i = 0; i < ntensors; i++) {
-    at::mm_out(output_unbind[i],
-               self_buffer.as_strided(self_sizes[i], self_strides[i], self_offsets[i]),
-               mat2_buffer.as_strided(mat2_sizes[i], mat2_strides[i], mat2_offsets[i]));
-  }
-  return output;
-}
-
-// utilities support `matmul_nested`
-namespace {
-// Args:
-//     self_sizes: the sizes of `self` in `matmul_nested`
-//     mat2_sizes: the sizes of `mat2` in `matmul_nested`
-//     buffer_op: the options for new buffer
-//     sizemat_op: the options for new size matrix
-// Returns:
-//     the batch size of each input underlying tensor, i.e. the product of batch-dimension sizes
-//     the empty output nested tensor
-inline std::tuple<std::vector<int64_t>, Tensor>
-matmul_nested_helper(
-    const std::vector<IntArrayRef>& self_sizes,
-    const std::vector<IntArrayRef>& mat2_sizes,
-    const c10::TensorOptions& buffer_op,
-    const c10::TensorOptions& sizemat_op) {
-  int64_t ntensors = self_sizes.size(),
-      ndims = self_sizes[0].size();
-  std::vector<int64_t> batch_sizes(ntensors, 1);
-  Tensor sizemat = at::empty({ntensors, ndims}, sizemat_op);
-  int64_t* sizemat_ptr = sizemat.data_ptr<int64_t>();
-  int64_t numel = 0;
-  for (int64_t i = 0; i < ntensors; i++) {
-    const IntArrayRef& self_size = self_sizes[i],
-        & mat2_size = mat2_sizes[i];
-    int64_t& batch_size = batch_sizes[i];
-    // batch dimensions
-    for (int64_t j = 0; j < ndims - 2; j++) {
-      const int64_t& self_sizej = self_size[j],
-          & mat2_sizej = mat2_size[j];
-      TORCH_CHECK(
-          self_sizej == mat2_sizej,
-          "matmul: For nested tensors, no broadcasting is currently performed: ",
-          i, "-th nested matrices in batch at dimension ", j + 1,
-          " have mismatching sizes ", self_sizej, " and ", mat2_sizej);
-      sizemat_ptr[j] = self_sizej;
-      batch_size *= sizemat_ptr[j];
-    }
-    // matrix multiplication dimensions
-    const int64_t& self_size0 = self_size[ndims - 2], & self_size1 = self_size[ndims - 1],
-        & mat2_size0 = mat2_size[ndims - 2], & mat2_size1 = mat2_size[ndims - 1];
-    TORCH_CHECK(
-        self_size1 == mat2_size0,
-        "matmul: ",
-        i, "-th nested matrices in batch cannot be multiplied (",
-        self_size0, "x", self_size1, " and ",
-        mat2_size0, "x", mat2_size1, ")");
-    sizemat_ptr[ndims - 2] = self_size0;
-    sizemat_ptr[ndims - 1] = mat2_size1;
-    sizemat_ptr += ndims;
-    numel += batch_size * self_size0 * mat2_size1;
-  }
-  Tensor buffer = at::empty(numel, buffer_op);
-  Tensor output = wrap_buffer(buffer, sizemat);
-  return std::make_tuple(batch_sizes, output);
-}
-}
-
-// Note [nested tensor matmul]
-// This is really a generalized batched matmul dedicated to nested tensors,
-// where `self` and `mat2` have same number (>= 3) of dimensions.
-// The last 2 dimensions will be considered as matrix dimensions,
-// so they should be matrix-multiplicable.
-// The leading dimensions are considered as batch dimensions,
-// and since nested tensor does not support broadcasting for now,
-// for each batch dimension `self` and `mat2` must have same size.
-// TODO: Should make full matmul semantics support some day
-Tensor matmul_nested(const Tensor& self, const Tensor& mat2) {
-  if (self.is_nested() && !mat2.is_nested()) {
-    AT_ERROR("Expected both to be nested, but got a nested self and non-nested other");
-  }
-  else if (!self.is_nested() && mat2.is_nested()) {
-    AT_ERROR("Expected both to be nested, but got a non-nested self and nested other");
-  }
-  // dispatcher should have guaranteed that at least one is nested
-  auto self_ptr = get_nested_tensor_impl(self),
-      mat2_ptr = get_nested_tensor_impl(mat2);
-  int64_t self_dim = self_ptr->dim(),
-      mat2_dim = mat2_ptr->dim();
-  TORCH_CHECK(
-      self_dim >= 3,
-      "matmul: For nested tensors, only inputs with >= 3 dims are currently supported. 1st input has rank: ",
-      self_dim);
-  TORCH_CHECK(
-      mat2_dim >= 3,
-      "matmul: For nested tensors, only inputs with >= 3 dims are currently supported. 2nd input has rank: ",
-      mat2_dim);
-  TORCH_CHECK(self_dim == mat2_dim, "matmul: both inputs must have same rank");
-  int64_t ntensors = self_ptr->size(0),
-      ntensors2 = mat2_ptr->size(0);
-  TORCH_CHECK(ntensors == ntensors2,
-      "matmul: Expected size for the 1st dimension of 2nd input tensor to be: ", ntensors,
-      " but got: ", ntensors2, ".");
-  const Tensor& self_buffer = self_ptr->get_buffer(),
-      & mat2_buffer = mat2_ptr->get_buffer();
-  std::vector<IntArrayRef> self_sizes = NestedTensor_get_sizes(self_ptr),
-      mat2_sizes = NestedTensor_get_sizes(mat2_ptr),
-      self_strides = NestedTensor_get_strides(self_ptr),
-      mat2_strides = NestedTensor_get_strides(mat2_ptr);
-  const std::vector<int64_t>& self_offsets = self_ptr->get_offsets(),
-      & mat2_offsets = mat2_ptr->get_offsets();
-  // create a contiguous output
-  std::vector<int64_t> batch_sizes;
-  Tensor output;
-  std::tie(batch_sizes, output) = matmul_nested_helper(
-      self_sizes, mat2_sizes, self_buffer.options(), self_ptr->get_nested_size_tensor().options());
-  // call tensor matmul
-  // TODO: `padding nested tensor -> bmm -> remove padding` may be more efficient
-  //       until we have specialized nested tensor bmm kernel
-  //       useful resource: `aten/src/ATen/native/cpu/LinearAlgebra.cpp/bmm_out_or_baddbmm_`
-  //                        `aten/src/ATen/native/cuda/Blas.cpp/baddbmm_out_cuda_impl`
-  std::vector<Tensor> output_unbind = output.unbind();
-  for (int64_t i = 0; i < ntensors; i++) {
-    const IntArrayRef& self_size = self_sizes[i],
-        & mat2_size = mat2_sizes[i];
-    const int64_t& batch_size = batch_sizes[i];
-    if (batch_size == 1) {
-      at::mm_out(
-          output_unbind[i],
-          self_buffer.as_strided(self_size, self_strides[i], self_offsets[i]),
-          mat2_buffer.as_strided(mat2_size, mat2_strides[i], mat2_offsets[i])
-          );
-    }
-    else {
-      at::bmm_out(
-          output_unbind[i],
-          self_buffer.as_strided(self_size, self_strides[i], self_offsets[i])
-              .reshape({batch_size, self_size[self_dim - 1 - 2], self_size[self_dim - 1 - 1]}),
-          mat2_buffer.as_strided(mat2_size, mat2_strides[i], mat2_offsets[i])
-              .reshape({batch_size, mat2_size[self_dim - 1 - 2], mat2_size[self_dim - 1 - 1]})
-          );
-    }
-  }
-  return output;
-}
-
-Tensor& matmul_out_nested(const Tensor& tensor1, const Tensor& tensor2, Tensor& result) {
-  // TODO: this is a very quick and dirty implementation
-  //       should improve it to avoid the intermediate memory usage
-  Tensor function_result = at::matmul(tensor1, tensor2);
-  auto function_result_ptr = get_nested_tensor_impl(function_result);
-  // TODO: this is to reproduce function_result_ptr->opt_sizes_
-  //       if an accessor is provided in the future, can replace this
-  std::vector<int64_t> sizes;
-  for (int64_t i = 0; i < function_result_ptr->dim(); i++) {
-    c10::optional<int64_t> opt_size = function_result_ptr->opt_size(i);
-    if (opt_size.has_value()) {
-      sizes.push_back(*opt_size);
-    }
-    else {
-      sizes.push_back(-1);
-    }
-  }
-  result.reshape(sizes);
-  result.copy_(function_result);
-  return result;
-}
-
 Tensor transpose_nested(const Tensor& self, int64_t dim0, int64_t dim1) {
   auto self_ptr = get_nested_tensor_impl(self);
   // check input dimensions
@@ -1001,10 +670,77 @@ Tensor transpose_nested(const Tensor& self, int64_t dim0, int64_t dim1) {
   // create transposed `sizemat` and `stridemat`
   Tensor sizemat_transposed = at::index_select(sizemat, 1, column_indices),
       stridemat_transposed = at::index_select(stridemat, 1, column_indices);
-  return wrap_buffer(self_ptr->get_buffer(), sizemat_transposed, stridemat_transposed, self_ptr->get_offsets());
+  return create_nested_view_tensor(
+      self, sizemat_transposed, stridemat_transposed, std::vector<int64_t>(self_ptr->get_storage_offsets()));
 }
 
-// utilities supporting `_reshape_nested`
+Tensor squeeze_nested(const Tensor& self) {
+  TORCH_CHECK(false,
+  "squeeze(): For nested tensors, squeeze without the dim argument is not supported ",
+  "at the moment, however you can use squeeze(Tensor self, int dim) instead ",
+  "if you need this feature, please open an issue on github describing your use case.");
+  return self;
+}
+
+Tensor squeeze_dim_nested(const Tensor& self, int64_t dim) {
+  auto self_ptr = get_nested_tensor_impl(self);
+  int64_t ndim = self_ptr->dim();
+  int64_t wrapped_dim = at::maybe_wrap_dim(dim, ndim);
+  TORCH_CHECK(wrapped_dim > 0,
+  "squeeze(): For nested tensors, squeezing dimension 0 is not supported at the moment ",
+  "if you need this feature, please open an issue on github describing your use case.");
+  const Tensor& sizemat = self_ptr->get_nested_size_tensor();
+  const Tensor& stridemat = self_ptr->get_nested_stride_tensor();
+  // if tensor.size(dim) != 1 torch.squeeze will return the result, we do the same here
+  c10::optional<int64_t> size_dim = self_ptr->opt_size(dim);
+  if (!(size_dim.has_value() && size_dim.value() == 1)) {
+    // detach to avoid triggering throw_error_if_base_and_tensor_are_same
+    return self.detach();
+  }
+  // if ndim == 2 and we pass the above if statement we should have a
+  // nested tensor of singleton tensors
+  TORCH_CHECK(ndim != 2,
+  "squeeze(): For nested tensors, squeezing a nested tensor of singleton tensors is not ",
+  "supported at the moment, if you need this feature, please open an issue on github",
+  "describing your use case.");
+  auto column_indices = sizemat.new_empty(ndim - 2);
+  int64_t* column_indices_ptr = column_indices.data_ptr<int64_t>();
+  std::iota(column_indices_ptr, column_indices_ptr + wrapped_dim - 1, 0);
+  std::iota(column_indices_ptr + wrapped_dim - 1, column_indices_ptr + ndim - 2, wrapped_dim);
+  auto sizemat_squeezed = at::index_select(sizemat, 1, column_indices);
+  auto stridemat_squeezed = at::index_select(stridemat, 1, column_indices);
+  return create_nested_view_tensor(
+      self, sizemat_squeezed, stridemat_squeezed, std::vector<int64_t>(self_ptr->get_storage_offsets()));
+}
+
+Tensor unsqueeze_nested(const Tensor& self, int64_t dim) {
+  auto self_ptr = get_nested_tensor_impl(self);
+  int64_t ndim = self_ptr->dim();
+  int64_t wrapped_dim = at::maybe_wrap_dim(dim, ndim + 1);
+  TORCH_CHECK(wrapped_dim > 0,
+  "unsqueeze(): For nested tensors, unsqueezing dimension 0 is not supported at the moment ",
+  "if you need this feature, please open an issue on github describing your use case.");
+  const Tensor& sizemat = self_ptr->get_nested_size_tensor();
+  const Tensor& stridemat = self_ptr->get_nested_stride_tensor();
+  auto mat_dim = wrapped_dim - 1;
+  Tensor new_size = sizemat.new_ones({sizemat.size(0), 1});
+  Tensor sizemat_unsqueezed = at::cat({sizemat.slice(1, 0, mat_dim),
+                                       new_size,
+                                       sizemat.slice(1, mat_dim, ndim)}, 1);
+  Tensor new_stride;
+  if (wrapped_dim == ndim) {
+    new_stride = stridemat.new_ones({stridemat.size(0), 1});
+  } else {
+    new_stride = (stridemat.select(1, mat_dim - 1) * sizemat.select(1, mat_dim - 1)).unsqueeze(-1);
+  }
+  Tensor stridemat_unsqueezed = at::cat({stridemat.slice(1, 0, mat_dim),
+                                         new_stride,
+                                         stridemat.slice(1, mat_dim, ndim)}, 1);
+  return create_nested_view_tensor(
+      self, sizemat_unsqueezed, stridemat_unsqueezed, std::vector<int64_t>(self_ptr->get_storage_offsets()));
+}
+
+// utilities supporting `view_nested` and `reshape_nested`
 namespace {
 // Args:
 //     sizes: the sizes of original nested tensor
@@ -1012,10 +748,10 @@ namespace {
 //     proposed_shape: user proposed new shape
 //     op: the options for new size and stride matrices
 // Returns:
-//     whether reshape as view is possible (i.e. old buffer can be reused)
+//     whether viewable
 //     size matrix after reshape
-//     stride matrix after reshape (not fully populated if reshape as view is impossible)
-inline std::tuple<bool, Tensor, Tensor> NestedTensor_reshape_size_stride(
+//     stride matrix after reshape (not fully populated if not viewable)
+inline std::tuple<bool, Tensor, Tensor> NestedTensor_compute_size_stride(
     const std::vector<IntArrayRef>& sizes,
     const std::vector<IntArrayRef>& strides,
     const IntArrayRef& proposed_shape,
@@ -1023,7 +759,7 @@ inline std::tuple<bool, Tensor, Tensor> NestedTensor_reshape_size_stride(
   int64_t ntensors = sizes.size(),
       ndims_underlying = sizes[0].size(),
       ndims_underlying_reshaped = proposed_shape.size() - 1;
-  bool reshape_as_view = true;
+  bool viewable = true;
   Tensor sizemat_reshaped = at::empty({ntensors, ndims_underlying_reshaped}, op),
       stridemat_reshaped = at::empty({ntensors, ndims_underlying_reshaped}, op);
   int64_t* sizemat_reshaped_ptr = sizemat_reshaped.data_ptr<int64_t>(),
@@ -1033,21 +769,31 @@ inline std::tuple<bool, Tensor, Tensor> NestedTensor_reshape_size_stride(
         & stride = strides[itensor];
     // compute reshaped size
     std::vector<int64_t> size_reshaped_vector(proposed_shape.begin() + 1, proposed_shape.end());
+    // only allow one pre-existing dimension to have proposed shape == -1
+    int64_t infer_index_old = -1;
     // some negative sizes remain to be infered
     if (ndims_underlying < ndims_underlying_reshaped) {
+      int64_t numel = 1, numel_reshaped = 1;
       // replace negative sizes for old dimensions with old sizes
       for (int64_t idim = 0; idim < ndims_underlying; idim++) {
         int64_t& size_reshaped = size_reshaped_vector[idim];
         TORCH_CHECK(size_reshaped >= -1, "invalid shape dimension ", size_reshaped);
         if (size_reshaped == -1) {
+          TORCH_CHECK(infer_index_old == -1, "only one dimension can be inferred");
           size_reshaped = size[idim];
+          infer_index_old = idim;
         }
+        numel *= size[idim];
+        numel_reshaped *= size_reshaped;
       }
       // infer negative size for new dimension
       int64_t infer_index = -1;
       for (int64_t idim = ndims_underlying; idim < ndims_underlying_reshaped; idim++) {
         const int64_t& size_reshaped = size_reshaped_vector[idim];
-        if (size_reshaped == -1) {
+        if (size_reshaped >= 0) {
+          numel_reshaped *= size_reshaped;
+        }
+        else if (size_reshaped == -1) {
           if (infer_index > -1) {
             throw std::runtime_error("only one dimension can be inferred");
           }
@@ -1055,22 +801,36 @@ inline std::tuple<bool, Tensor, Tensor> NestedTensor_reshape_size_stride(
             infer_index = idim;
           }
         }
-        else if (size_reshaped < 0) {
+        else {
           AT_ERROR("invalid shape dimension ", size_reshaped);
         }
       }
-      // See Note [inference and inheritance semantics]
+      // See Note [Special size rule for nested tensor]
       TORCH_CHECK(infer_index == -1, "nested tensor does not infer shape");
+      TORCH_CHECK(
+          numel == numel_reshaped,
+          "shape '", proposed_shape, "' ",
+          "is invalid for input of size ", numel);
     }
     // all negative sizes can be replaced
     else {
+      int64_t numel = 1, numel_reshaped = 1;
       for (int64_t idim = 0; idim < ndims_underlying_reshaped; idim++) {
         int64_t& size_reshaped = size_reshaped_vector[idim];
         TORCH_CHECK(size_reshaped >= -1, "invalid shape dimension ", size_reshaped);
         if (size_reshaped == -1) {
           size_reshaped = size[idim];
         }
+        numel *= size[idim];
+        numel_reshaped *= size_reshaped;
+      }
+      for (int64_t idim = ndims_underlying_reshaped; idim < ndims_underlying; idim++) {
+        numel *= size[idim];
       }
+      TORCH_CHECK(
+          numel == numel_reshaped,
+          "shape '", proposed_shape, "' ",
+          "is invalid for input of size ", numel);
     }
     IntArrayRef size_reshaped(size_reshaped_vector);
     // compute reshaped stride
@@ -1088,7 +848,7 @@ inline std::tuple<bool, Tensor, Tensor> NestedTensor_reshape_size_stride(
     }
     // reshape as view is impossible
     else {
-      reshape_as_view = false;
+      viewable = false;
       // fill reshaped size into sizemat
       for (int64_t idim = 0; idim < ndims_underlying_reshaped; idim++) {
         sizemat_reshaped_ptr[idim] = size_reshaped[idim];
@@ -1096,42 +856,104 @@ inline std::tuple<bool, Tensor, Tensor> NestedTensor_reshape_size_stride(
       sizemat_reshaped_ptr += ndims_underlying_reshaped;
     }
   }
-  return std::make_tuple(reshape_as_view, sizemat_reshaped, stridemat_reshaped);
+  return std::make_tuple(viewable, sizemat_reshaped, stridemat_reshaped);
 }
+} // namespace
 
-// Args:
-//     nt_reshaped: the reshaped nested tensor to receive copies
-//     buffer: the original nested tensor buffer
-//     sizes: the original nested tensor sizes (may have gone through collapsing or splitting)
-//     strides: the original nested tensor strides (may have gone through collapsing or splitting)
-//     offsets: the original nested tensor offsets (may have gone through collapsing or splitting)
-inline void NestedTensor_reshape_copy(
-    Tensor& nt_reshaped,
+// Note [Special size rule for nested tensor]
+// Instead of infering size, -1 means "inherit the old size", so:
+// * negative size is legal for a ragged dimension
+// * however, we only allow one -1
+// In principle we could still infer a dimension,
+// we are designing a better semantics to include both inheritance and inference
+Tensor view_nested(const Tensor& self, IntArrayRef proposed_shape) {
+  TORCH_CHECK(
+      proposed_shape.size() > 0,
+      "shape '[]' is invalid for a nested tensor");
+  auto self_ptr = get_nested_tensor_impl(self);
+  // basic information before reshaping
+  int64_t ntensors = self_ptr->size(0);
+  TORCH_CHECK(
+      ntensors > 0,
+      "empty nested tensor cannot be reshaped");
+  // basic information after reshaping
+  int64_t ntensors_reshaped = proposed_shape[0];
+  TORCH_CHECK(
+      ntensors == ntensors_reshaped,
+      "view: For now nested view cannot change or infer the implicit batch dimension");
+  std::vector<IntArrayRef> sizes = NestedTensor_get_sizes(self_ptr),
+      strides = NestedTensor_get_strides(self_ptr);
+  // reshaping underlying tensor dimensions does not change offset
+  // determine reshaped size and stride
+  const Tensor& sizemat = self_ptr->get_nested_size_tensor();
+  bool viewable;
+  Tensor sizemat_reshaped, stridemat_reshaped;
+  std::tie(viewable, sizemat_reshaped, stridemat_reshaped) = NestedTensor_compute_size_stride(
+      sizes, strides, proposed_shape, sizemat.options());
+  TORCH_CHECK(
+      viewable,
+      "view size is not compatible with input tensor's size and stride "
+      "(at least one dimension spans across two contiguous subspaces). "
+      "Use .reshape(...) instead.");
+  return create_nested_view_tensor(self, sizemat_reshaped, stridemat_reshaped, std::vector<int64_t>(self_ptr->get_storage_offsets()));
+}
+  /**
+   * Create a buffer tensor that is a view of self
+   *
+   * This serves as the boundary between nested and non nested tensor
+   * view conversions
+   *
+   * @return Returns a new non nested tensor that
+   * aliases the same storage as self
+   */
+Tensor values_nested(const Tensor& self) {
+  TORCH_INTERNAL_ASSERT(self.is_nested(), "Can only create a buffer from Nested Tensor");
+  auto* nt_self = get_nested_tensor_impl(self);
+  return nt_self->get_unsafe_storage_as_tensor();
+}
+
+/**
+ * Create a nested tensor that is a view of a buffer
+ *
+ * This serves as the boundary between non nested tensor and nested
+ * view conversions
+ *
+ * @return Returns a nested tensor that
+ * aliases the same storage as buffer
+ */
+Tensor _nested_view_from_buffer(
     const Tensor& buffer,
-    const std::vector<IntArrayRef>& sizes,
-    const std::vector<IntArrayRef>& strides,
-    const std::vector<int64_t>& offsets) {
-    auto nt_reshaped_ptr = get_nested_tensor_impl(nt_reshaped);
-    const Tensor& buffer_reshaped = nt_reshaped_ptr->get_buffer();
-    std::vector<IntArrayRef> sizes_reshaped = NestedTensor_get_sizes(nt_reshaped_ptr),
-        strides_reshaped = NestedTensor_get_strides(nt_reshaped_ptr);
-    const std::vector<int64_t>& offsets_reshaped = nt_reshaped_ptr->get_offsets();
-    for (int64_t i = 0; i < nt_reshaped_ptr->size(0); i++) {
-      buffer_reshaped.as_strided(sizes_reshaped[i], strides_reshaped[i], offsets_reshaped[i]).copy_(
-          // TODO: can we avoid allocating new memory for `buffer...reshape`
-          //       I did not find anything like reshape_out
-          buffer.as_strided(sizes[i], strides[i], offsets[i]).reshape(sizes_reshaped[i]));
-    }
+    const Tensor& nested_size_tensor,
+    const Tensor& nested_stride_tensor,
+    IntArrayRef offsets) {
+  TORCH_INTERNAL_ASSERT(
+      !buffer.is_nested(),
+      "Can only a create Nested Tensor from a normal tensor buffer");
+  TORCH_INTERNAL_ASSERT(buffer.dim() == 1, "The input buffer must be flat");
+  TORCH_INTERNAL_ASSERT(nested_size_tensor.dim() == 2, "Expected the nested size tensor to be two dimensional.");
+  uint64_t num_elements_nested_size = at::prod(nested_size_tensor, 1).sum().item<int64_t>();
+  uint64_t buffer_storage_size = buffer.storage().nbytes()/buffer.dtype().itemsize();
+  TORCH_INTERNAL_ASSERT(
+      buffer_storage_size == num_elements_nested_size,
+      "The number of elements in the buffer must equal the nested tensor size but buffer size: ",
+      buffer_storage_size,
+      " and nested tensor size: ",
+      num_elements_nested_size,
+      ".");
+
+  TORCH_INTERNAL_ASSERT(nested_stride_tensor.dim() == 2, "Expected the nested stride tensor to be two dimensional.");
+  TORCH_INTERNAL_ASSERT(nested_size_tensor.size(0) == nested_stride_tensor.size(0), "Expected the first dimension of nested size and nested stride tensor to be equal.");
+  TORCH_INTERNAL_ASSERT(nested_stride_tensor.size(0) == (int64_t)offsets.size(), "Expected the first dimension of nested stride tensor to equal the length of offsets.");
+  return at::detail::make_tensor<NestedTensorImpl>(
+    c10::TensorImpl::VIEW,
+    buffer,
+    nested_size_tensor,
+    nested_stride_tensor,
+    std::vector<int64_t>(offsets.begin(), offsets.end()));
 }
-} // namespace
 
-// Special rules for reshape(nested tensor):
-// 1. Only 1 regular dimension can be collapsed with
-//    or splitted from the implicit batch dimension
-// 2. Instead of infering size, -1 means "inherit the old size", so:
-//    * negative size is legal for a ragged dimension
-//    * multiple sizes can be -1
-Tensor _reshape_nested(const Tensor& self, IntArrayRef proposed_shape) {
+// See Note [Special size rule for nested tensor]
+Tensor reshape_nested(const Tensor& self, IntArrayRef proposed_shape) {
   TORCH_CHECK(
       proposed_shape.size() > 0,
       "shape '[]' is invalid for a nested tensor");
@@ -1142,38 +964,43 @@ Tensor _reshape_nested(const Tensor& self, IntArrayRef proposed_shape) {
       ntensors > 0,
       "empty nested tensor cannot be reshaped");
   // basic information after reshaping
-  int64_t ntensors_reshaped;
-  if (proposed_shape[0] >= 0) {
-    ntensors_reshaped = proposed_shape[0];
-  }
-  else if (proposed_shape[0] == -1) {
-    ntensors_reshaped = ntensors;
-  }
-  else {
-    AT_ERROR("invalid shape dimension ", proposed_shape[0]);
-  }
+  int64_t ntensors_reshaped = proposed_shape[0];
   TORCH_CHECK(
       ntensors == ntensors_reshaped,
-      "for now reshape cannot change the implicit batch dimension");
+      "reshape: For now nested reshape cannot change or infer the implicit batch dimension");
   std::vector<IntArrayRef> sizes = NestedTensor_get_sizes(self_ptr),
       strides = NestedTensor_get_strides(self_ptr);
-  const std::vector<int64_t>& offsets = self_ptr->get_offsets();
   // reshaping underlying tensor dimensions does not change offset
   // determine reshaped size and stride
-  const Tensor& buffer = self_ptr->get_buffer(),
-      & sizemat = self_ptr->get_nested_size_tensor();
-  bool reshape_as_view;
+  const Tensor& sizemat = self_ptr->get_nested_size_tensor();
+  bool viewable{false};
   Tensor sizemat_reshaped, stridemat_reshaped;
-  std::tie(reshape_as_view, sizemat_reshaped, stridemat_reshaped) = NestedTensor_reshape_size_stride(
+  std::tie(viewable, sizemat_reshaped, stridemat_reshaped) = NestedTensor_compute_size_stride(
       sizes, strides, proposed_shape, sizemat.options());
-  if (reshape_as_view) {
-    return wrap_buffer(buffer, sizemat_reshaped, stridemat_reshaped, offsets);
+  if (viewable) {
+    return self.view(proposed_shape);
   }
-  Tensor buffer_reshaped = buffer.new_empty(buffer.sizes());
-  Tensor output = wrap_buffer(buffer_reshaped, sizemat_reshaped);
-  NestedTensor_reshape_copy(output,
-      buffer, sizes, strides, offsets);
-  return output;
+  else {
+    return self.clone(at::MemoryFormat::Contiguous).view(proposed_shape);
+  }
+}
+
+Tensor reshape_as_nested(const Tensor& self, const Tensor& other) {
+  auto other_ptr = get_nested_tensor_impl(other);
+  // TODO: this is to reproduce other_ptr->opt_sizes_
+  //       if an accessor is provided in the future, can replace this
+  std::vector<int64_t> sizes;
+  for (int64_t i = 0; i < other_ptr->dim(); i++) {
+    c10::optional<int64_t> opt_size = other_ptr->opt_size(i);
+    if (opt_size.has_value()) {
+      sizes.push_back(*opt_size);
+    }
+    else {
+      sizes.push_back(-1);
+    }
+  }
+  // reshape with other.opt_sizes_
+  return self.reshape(sizes);
 }
 
 } // namespace native
diff --git a/aten/src/ATen/native/nested/NestedTensorMath.h b/aten/src/ATen/native/nested/NestedTensorMath.h
index 11b94b65a4e5..954fa807f183 100644
--- a/aten/src/ATen/native/nested/NestedTensorMath.h
+++ b/aten/src/ATen/native/nested/NestedTensorMath.h
@@ -1,261 +1,22 @@
 #pragma once
 
-#include <ATen/ATen.h>
-#include <c10/macros/Macros.h>
+#include <ATen/core/ATen_fwd.h>
 #include <ATen/NestedTensorImpl.h>
-
-#include <vector>
+#include <c10/macros/Macros.h>
 
 namespace at {
 namespace native {
-struct NestedTensorImpl;
-
-// TODO: cache this and only do it once per NestedTensor
-int64_t get_consistent_last_dim_of_nested_tensor(const NestedTensorImpl& nt);
-
-inline at::Tensor wrap_buffer(at::Tensor buffer, at::Tensor nested_size_tensor) {
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(buffer.is_contiguous(), "Given buffer must be contiguous.");
-  return at::detail::make_tensor<NestedTensorImpl>(
-      std::move(buffer), std::move(nested_size_tensor));
-}
-
-inline at::Tensor wrap_buffer(
-    at::Tensor buffer, at::Tensor nested_size_tensor,
-    at::Tensor nested_stride_tensor, const std::vector<int64_t>& offsets) {
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(buffer.is_contiguous(), "Given buffer must be contiguous.");
-  return at::detail::make_tensor<NestedTensorImpl>(
-      std::move(buffer), std::move(nested_size_tensor),
-      std::move(nested_stride_tensor), offsets);
-}
-
-inline at::Tensor get_buffer(const at::Tensor& tensor) {
-  return get_nested_tensor_impl(tensor)->get_buffer();
-}
-
-// The sizes of the underlying tensors
-inline std::vector<IntArrayRef> NestedTensor_get_sizes(const NestedTensorImpl* self_ptr) {
-  int64_t ntensors = self_ptr->size(0);
-  std::vector<IntArrayRef> sizes(ntensors);
-  if (ntensors == 0) {
-    return sizes;
-  }
-  const Tensor& sizemat = self_ptr->get_nested_size_tensor();
-  int64_t orig_dim = sizemat.size(1);
-  // nesting scalars has empty sizes
-  if (orig_dim == 0) {
-    return sizes;
-  }
-  const int64_t* sizemat_ptr = sizemat.data_ptr<int64_t>();
-
-  for(const auto i: c10::irange(ntensors)){
-    sizes[i] = IntArrayRef(sizemat_ptr, sizemat_ptr + orig_dim);
-    sizemat_ptr += orig_dim;
-  }
-  return sizes;
-}
-
-inline std::vector<IntArrayRef> NestedTensor_get_sizes(const at::Tensor& self) {
-  const NestedTensorImpl* self_ptr = get_nested_tensor_impl(self);
-  return NestedTensor_get_sizes(self_ptr);
-}
-
-// The strides of the underlying tensors
-inline std::vector<IntArrayRef> NestedTensor_get_strides(const NestedTensorImpl* self_ptr) {
-  int64_t ntensors = self_ptr->size(0);
-  std::vector<IntArrayRef> strides(ntensors);
-  if (ntensors == 0) {
-    return strides;
-  }
-  const Tensor& stridemat = self_ptr->get_nested_stride_tensor();
-  int64_t orig_dim = stridemat.size(1);
-  // nesting scalars has empty strides
-  if (orig_dim == 0) {
-    return strides;
-  }
-  const int64_t* stridemat_ptr = stridemat.data_ptr<int64_t>();
-  for(const auto i: c10::irange(ntensors)) {
-    strides[i] = IntArrayRef(stridemat_ptr, stridemat_ptr + orig_dim);
-    stridemat_ptr += orig_dim;
-  }
-  return strides;
-}
-
-inline std::vector<IntArrayRef> NestedTensor_get_strides(const at::Tensor& self) {
-  const NestedTensorImpl* self_ptr = get_nested_tensor_impl(self);
-  return NestedTensor_get_strides(self_ptr);
-}
-
-TORCH_API std::vector<int64_t> NestedTensor_get_max_size(
-    const NestedTensorImpl& nt);
 
 TORCH_API Tensor NestedTensor_to_padded_tensor_generic(
     const Tensor& t,
     double padding,
     OptionalIntArrayRef output_size);
 
-namespace impl {
-
-template <typename T>
-struct NestedNode {
-  NestedNode() = delete;
-  explicit NestedNode(std::vector<T>&& children)
-      : _is_leaf(false), _children(children) {}
-  explicit NestedNode(TensorList children)
-      : _is_leaf(false), _children(children.vec()) {}
-  // NestedNode(NestedNode&) = delete;
-  // NestedNode(const NestedNode&) = delete;
-  // NestedNode& operator=(NestedNode) = delete;
-  explicit NestedNode(T payload) : _is_leaf(true), _payload(payload) {}
-  inline bool is_leaf() const {
-    return _is_leaf;
-  }
-  inline size_t degree() const {
-    return _children.size();
-  }
-  inline const std::vector<T> unbind() const {
-    return _children;
-  }
-  inline T children(size_t i) const {
-    return _children[i];
-  }
-  inline const T& payload() const {
-    return _payload;
-  }
-  inline T& payload() {
-    return _payload;
-  }
-
- private:
-  bool _is_leaf;
-  std::vector<T> _children;
-  T _payload;
-};
-
-using TensorNode = NestedNode<at::Tensor>;
-
-template <class F, class A, class TypeList>
-class _map;
-
-template <class F, class A, class... Args>
-class _map<F, A, c10::guts::typelist::typelist<Args...>> {
- public:
-  static A function_one(
-      F&& fn,
-      const Args&... nested_node) {
-      return std::forward<F>(fn)(nested_node...);
-  }
-  // NOTE: We must move F to avoid copying objects if it is a lambda with
-  // captures.
-  static NestedNode<A> function(
-      F&& fn,
-      const NestedNode<Args>&... nested_node) {
-    size_t degree = 0;
-    bool all_leaf = true;
-    c10::guts::tuple_map(
-        std::forward_as_tuple(nested_node...), [&all_leaf, &degree](auto n) {
-          all_leaf = all_leaf && (n.is_leaf());
-          if (degree > 1 && n.degree() > 1) {
-            TORCH_CHECK(degree == n.degree(), "NestedNodes must match in degree.");
-          }
-          if (n.degree() > degree) {
-            degree = n.degree();
-          }
-          return nullptr;
-        });
-    // All NestedNodes just wrap regular objects.
-    if (all_leaf) {
-      return NestedNode<A>(std::forward<F>(fn)(nested_node.payload()...));
-    }
-    // Some NestedNodes wrap regular Tensors, some NestedTensors and some other types.
-    std::vector<A> result;
-    for (size_t i = 0; i < degree; i++) {
-      std::tuple<Args...> children = c10::guts::tuple_map(
-          std::forward_as_tuple(nested_node...), [&i](auto a) {
-            static_assert(
-                c10::guts::is_instantiation_of<NestedNode, decltype(a)>::value,
-                "Internal error.");
-            // Broadcast regular arguments across NestedTensor constituents.
-            // This could be a Tensor, integer or anything else really.
-            if (a.is_leaf()) {
-              return a.payload();
-            }
-            // Broadcast NestedTensors with one constituent.
-            if (a.degree() == 1 && !a.is_leaf()) {
-              return a.children(0);
-            }
-            TORCH_CHECK(a.degree() > 0, "Internal assert.");
-            return a.children(i);
-          });
-      c10::guts::apply(
-          [&result, &fn](Args... filtered) {
-            result.emplace_back(function_one(std::forward<F>(fn), filtered...));
-          },
-          std::move(children));
-    }
-    return NestedNode<A>(std::move(result));
-  }
-};
-
-// TODO: Add static assert to verify lambda arguments match nested_node types
-template <class F, class... B>
-static inline NestedNode<
-    typename c10::guts::infer_function_traits<F>::type::return_type>
-map(F&& fn, const NestedNode<B>&... nested_node) {
-  return _map<
-      F,
-      typename c10::guts::infer_function_traits<F>::type::return_type,
-      typename c10::guts::infer_function_traits<F>::type::parameter_types>::
-      function(std::forward<F>(fn), nested_node...);
-}
-
-inline TensorNode get_nested_tensor_structure(at::Tensor tensor) {
-  if (get_nested_tensor_impl_or_null(tensor) == nullptr) {
-    return TensorNode(std::move(tensor));
-  }
-  return TensorNode(tensor.unbind());
-}
-
-inline Tensor wrap_tensor_node(
-    TensorNode tensor_node,
-    c10::optional<ScalarType> dtype,
-    c10::optional<Layout> layout,
-    c10::optional<Device> device,
-    c10::optional<bool> pin_memory) {
-  TORCH_CHECK(
-      !tensor_node.is_leaf(), "Expected TensorNode to wrap a list of Tensors.");
-  TensorOptions options_ =
-      TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(
-          pin_memory);
-  if (tensor_node.degree() == 0) {
-    return wrap_buffer(ones({0}, dtype, layout, device), ones({}));
-  }
-  std::vector<Tensor> sizes;
-  std::vector<Tensor> flat_tensors;
-  for (const auto i : c10::irange(tensor_node.degree())) {
-    flat_tensors.push_back(
-        tensor_node.children(i).reshape(-1).contiguous());
-    sizes.push_back(
-        tensor(c10::IntArrayRef(tensor_node.children(i).sizes())));
-  }
-
-  TensorOptions options = flat_tensors[0].options().merge_in(options_);
-
-  return wrap_buffer(
-      at::cat(flat_tensors).to(options), at::native::stack(sizes));
-}
-
-} // namespace impl
-
-// This function is meant to ease rapid operator coverage for
-// NestedTensor kernels. It is not meant to be efficient. Use it judiciously.
-template <class F, class... A>
-inline at::Tensor map_nested_tensor(F&& fn, A... a) {
-  return wrap_tensor_node(
-      impl::map(std::forward<F>(fn), impl::get_nested_tensor_structure(a)...),
-      c10::nullopt,
-      c10::nullopt,
-      c10::nullopt,
-      c10::nullopt);
+template <typename Func>
+Tensor map_nt(const Tensor& nt, Func f) {
+  auto* nt_impl = get_nested_tensor_impl(nt);
+  const auto& sizes = nt_impl->get_nested_size_tensor();
+  return at::detail::make_tensor<NestedTensorImpl>(f(nt_impl->get_buffer()), sizes);
 }
 
 } // namespace native
diff --git a/aten/src/ATen/native/nested/NestedTensorMatmul.cpp b/aten/src/ATen/native/nested/NestedTensorMatmul.cpp
new file mode 100644
index 000000000000..c8cfa124330d
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorMatmul.cpp
@@ -0,0 +1,352 @@
+#include <ATen/native/nested/NestedTensorMath.h>
+
+#include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#include <ATen/NestedTensorImpl.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIndexing.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/native/layer_norm.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+
+#include <tuple>
+
+namespace at {
+namespace native {
+
+Tensor bmm_nested(const Tensor& self, const Tensor& mat2) {
+  if (self.is_nested() && !mat2.is_nested()) {
+    AT_ERROR("Expected both to be nested, but got a nested self and non-nested other");
+  }
+  else if (!self.is_nested() && mat2.is_nested()) {
+    AT_ERROR("Expected both to be nested, but got a non-nested self and nested other");
+  }
+  // dispatcher should have guaranteed that at least one is nested
+  auto self_ptr = get_nested_tensor_impl(self);
+  auto mat2_ptr = get_nested_tensor_impl(mat2);
+  TORCH_CHECK(self_ptr->dim() == 3, "batch1 must be a 3D tensor");
+  TORCH_CHECK(mat2_ptr->dim() == 3, "batch2 must be a 3D tensor");
+  int64_t ntensors = self_ptr->size(0),
+      ntensors2 = mat2_ptr->size(0);
+  TORCH_CHECK(ntensors == ntensors2,
+      "Expected size for the 1st dimension of batch2 tensor to be: ", ntensors,
+      " but got: ", ntensors2, ".");
+  const Tensor& self_buffer = self_ptr->get_unsafe_storage_as_tensor(),
+      & mat2_buffer = mat2_ptr->get_unsafe_storage_as_tensor();
+  std::vector<IntArrayRef> self_sizes = NestedTensor_get_sizes(self_ptr),
+      mat2_sizes = NestedTensor_get_sizes(mat2_ptr),
+      self_strides = NestedTensor_get_strides(self_ptr),
+      mat2_strides = NestedTensor_get_strides(mat2_ptr);
+  const std::vector<int64_t>& self_offsets = self_ptr->get_storage_offsets(),
+      & mat2_offsets = mat2_ptr->get_storage_offsets();
+  // create a contiguous output
+  int64_t out_numel = 0;
+  const Tensor& self_sizemat = self_ptr->get_nested_size_tensor();
+  Tensor out_sizemat = self_sizemat.new_empty(self_sizemat.sizes());
+  int64_t* out_sizemat_ptr = out_sizemat.data_ptr<int64_t>();
+  for (int64_t i = 0; i < ntensors; i++) {
+    const IntArrayRef& self_shape = self_sizes[i],
+        & mat2_shape = mat2_sizes[i];
+    const int64_t& self_size0 = self_shape[0], & self_size1 = self_shape[1],
+        & mat2_size0 = mat2_shape[0], & mat2_size1 = mat2_shape[1];
+    TORCH_CHECK(self_size1 == mat2_size0,
+        i, "-th nested matrices in batch cannot be multiplied (",
+        self_size0, "x", self_size1, " and ",
+        mat2_size0, "x", mat2_size1, ")");
+    out_sizemat_ptr[0] = self_size0;
+    out_sizemat_ptr[1] = mat2_size1;
+    out_sizemat_ptr += 2;
+    out_numel += self_size0 * mat2_size1;
+  }
+  Tensor out_buffer = self_buffer.new_empty(out_numel);
+  Tensor output = wrap_buffer(out_buffer, out_sizemat);
+  // call tensor mm
+  // TODO: `padding nested tensor -> bmm -> remove padding` may be more efficient
+  //       until we have specialized nested tensor bmm kernel
+  //       useful resource: `aten/src/ATen/native/cpu/LinearAlgebra.cpp/bmm_out_or_baddbmm_`
+  //                        `aten/src/ATen/native/cuda/Blas.cpp/baddbmm_out_cuda_impl`
+  std::vector<Tensor> output_unbind = output.unbind();
+  for (int64_t i = 0; i < ntensors; i++) {
+    at::mm_out(output_unbind[i],
+               self_buffer.as_strided(self_sizes[i], self_strides[i], self_offsets[i]),
+               mat2_buffer.as_strided(mat2_sizes[i], mat2_strides[i], mat2_offsets[i]));
+  }
+  return output;
+}
+
+// utilities support `matmul_nested`
+namespace {
+// Args:
+//     self_sizes: the sizes of `self` in `matmul_nested`
+//     mat2_sizes: the sizes of `mat2` in `matmul_nested`
+//     buffer_op: the options for new buffer
+//     sizemat_op: the options for new size matrix
+// Returns:
+//     the batch size of each input underlying tensor, i.e. the product of batch-dimension sizes
+//     the empty output nested tensor
+inline std::tuple<std::vector<int64_t>, Tensor>
+matmul_nested_helper(
+    const std::vector<IntArrayRef>& self_sizes,
+    const std::vector<IntArrayRef>& mat2_sizes,
+    const c10::TensorOptions& buffer_op,
+    const c10::TensorOptions& sizemat_op) {
+  int64_t ntensors = self_sizes.size(),
+      ndims = self_sizes[0].size();
+  std::vector<int64_t> batch_sizes(ntensors, 1);
+  Tensor sizemat = at::empty({ntensors, ndims}, sizemat_op);
+  int64_t* sizemat_ptr = sizemat.data_ptr<int64_t>();
+  int64_t numel = 0;
+  for (int64_t i = 0; i < ntensors; i++) {
+    const IntArrayRef& self_size = self_sizes[i],
+        & mat2_size = mat2_sizes[i];
+    int64_t& batch_size = batch_sizes[i];
+    // batch dimensions
+    for (int64_t j = 0; j < ndims - 2; j++) {
+      const int64_t& self_sizej = self_size[j],
+          & mat2_sizej = mat2_size[j];
+      TORCH_CHECK(
+          self_sizej == mat2_sizej,
+          "matmul: For nested tensors, no broadcasting is currently performed: ",
+          i, "-th nested matrices in batch at dimension ", j + 1,
+          " have mismatching sizes ", self_sizej, " and ", mat2_sizej);
+      sizemat_ptr[j] = self_sizej;
+      batch_size *= sizemat_ptr[j];
+    }
+    // matrix multiplication dimensions
+    const int64_t& self_size0 = self_size[ndims - 2], & self_size1 = self_size[ndims - 1],
+        & mat2_size0 = mat2_size[ndims - 2], & mat2_size1 = mat2_size[ndims - 1];
+    TORCH_CHECK(
+        self_size1 == mat2_size0,
+        "matmul: ",
+        i, "-th nested matrices in batch cannot be multiplied (",
+        self_size0, "x", self_size1, " and ",
+        mat2_size0, "x", mat2_size1, ")");
+    sizemat_ptr[ndims - 2] = self_size0;
+    sizemat_ptr[ndims - 1] = mat2_size1;
+    sizemat_ptr += ndims;
+    numel += batch_size * self_size0 * mat2_size1;
+  }
+  Tensor buffer = at::empty(numel, buffer_op);
+  Tensor output = wrap_buffer(buffer, sizemat);
+  return std::make_tuple(batch_sizes, output);
+}
+}
+
+Tensor matmul_with_bmm_nested(const Tensor& self, const Tensor& mat2) {
+  // Tensor self = self_.contiguous();
+  // Tensor mat2 = mat2_.contiguous();
+  // self [N, n_heads, *, head_dim]
+  // mat2 [N, n_heads, head_dim, *]
+  const auto self_ptr = get_nested_tensor_impl(self);
+  const auto mat2_ptr = get_nested_tensor_impl(mat2);
+  // metadata for self
+  std::vector<IntArrayRef> self_sizes = NestedTensor_get_sizes(self_ptr);
+  std::vector<IntArrayRef> self_strides = NestedTensor_get_strides(self_ptr);
+  std::vector<int64_t> self_offsets = self_ptr->get_storage_offsets();
+  auto opt = self_ptr->get_nested_size_tensor().options();
+
+  // metadata for mat2
+  std::vector<IntArrayRef> mat2_sizes = NestedTensor_get_sizes(mat2_ptr);
+  std::vector<IntArrayRef> mat2_strides = NestedTensor_get_strides(mat2_ptr);
+  std::vector<int64_t> mat2_offsets = mat2_ptr->get_storage_offsets();
+  auto opt2 = mat2_ptr->get_nested_size_tensor().options();
+
+  int64_t N = self_sizes.size();
+  int64_t n_heads = self_sizes[0][0];
+
+  // viewed metadata for self
+  auto self_new_sizes = at::empty({N * n_heads, 2}, opt);
+  int64_t* self_new_sizes_ptr = self_new_sizes.data_ptr<int64_t>();
+
+  auto self_new_strides = at::empty({N * n_heads, 2}, opt);
+  int64_t* self_new_strides_ptr = self_new_strides.data_ptr<int64_t>();
+  std::vector<int64_t> self_new_offsets;
+
+  // viewed metadata for mat2
+  auto mat2_new_sizes = at::empty({N * n_heads, 2}, opt2);
+  int64_t* mat2_new_sizes_ptr = mat2_new_sizes.data_ptr<int64_t>();
+
+  auto mat2_new_strides = at::empty({N * n_heads, 2}, opt2);
+  int64_t* mat2_new_strides_ptr = mat2_new_strides.data_ptr<int64_t>();
+  std::vector<int64_t> mat2_new_offsets;
+
+  for (int64_t i = 0; i < N; i++) {
+    const IntArrayRef& self_size_i = self_sizes[i];
+    const IntArrayRef& self_stride_i = self_strides[i];
+    int64_t self_offset = self_offsets[i];
+
+    const IntArrayRef& mat2_size_i = mat2_sizes[i];
+    const IntArrayRef& mat2_stride_i = mat2_strides[i];
+    int64_t mat2_offset = mat2_offsets[i];
+    for (int64_t j = 0; j < n_heads; j++) {
+      auto idx = (i * n_heads + j) * 2;
+      self_new_sizes_ptr[idx] = self_size_i[1];
+      self_new_sizes_ptr[idx + 1] = self_size_i[2];
+      self_new_strides_ptr[idx] = self_stride_i[1];
+      self_new_strides_ptr[idx + 1] = self_stride_i[2];
+      self_new_offsets.push_back(self_offset);
+      self_offset += self_stride_i[0];
+
+      mat2_new_sizes_ptr[idx] = mat2_size_i[1];
+      mat2_new_sizes_ptr[idx + 1] = mat2_size_i[2];
+      mat2_new_strides_ptr[idx] = mat2_stride_i[1];
+      mat2_new_strides_ptr[idx + 1] = mat2_stride_i[2];
+      mat2_new_offsets.push_back(mat2_offset);
+      mat2_offset += mat2_stride_i[0];
+    }
+  }
+
+
+  // view self as [N * n_heads, *, head_dim] (collapse first 2 dims)
+  auto viewed_self = create_nested_view_tensor(
+      self, self_new_sizes, self_new_strides, std::vector<int64_t>(self_new_offsets));
+
+  // view mat2 as [N * n_heads, head_dim, *] (collapse first 2_dims)
+  auto viewed_mat2 = create_nested_view_tensor(
+      mat2, mat2_new_sizes, mat2_new_strides, std::vector<int64_t>(mat2_new_offsets));
+
+  // output [N * n_heads, *, *]
+  auto bmm_output = at::bmm(viewed_self, viewed_mat2);
+
+  // generate metadata for viewing output as [N, n_heads, *, *]
+  // output of bmm should be contiguous so stride calculations should hold
+  auto out_new_sizes = at::empty({N, 3}, opt);
+  auto out_new_strides = at::empty({N, 3}, opt);
+  std::vector<int64_t> out_new_offsets;
+
+  int64_t* out_new_sizes_ptr = out_new_sizes.data_ptr<int64_t>();
+  int64_t* out_new_strides_ptr = out_new_strides.data_ptr<int64_t>();
+
+  int64_t out_offset = 0;
+  for (int64_t i = 0; i < N; i++) {
+    out_new_offsets.push_back(out_offset);
+    const IntArrayRef& self_size_i = self_sizes[i];
+    const IntArrayRef& mat2_size_i = mat2_sizes[i];
+    auto idx = i * 3;
+    out_new_sizes_ptr[idx] = n_heads;
+    out_new_sizes_ptr[idx + 1] = self_size_i[1];
+    out_new_sizes_ptr[idx + 2] = mat2_size_i[2];
+    out_new_strides_ptr[idx] = self_size_i[1] * mat2_size_i[2];
+    out_new_strides_ptr[idx + 1] = mat2_size_i[2];
+    out_new_strides_ptr[idx + 2] = 1;
+    out_offset += n_heads * (self_size_i[1] * mat2_size_i[2]);
+  }
+
+  auto viewed_out = create_nested_view_tensor(
+      bmm_output, out_new_sizes, out_new_strides, std::vector<int64_t>(out_new_offsets));
+
+  return viewed_out;
+
+}
+
+// Note [nested tensor matmul]
+// This is really a generalized batched matmul dedicated to nested tensors,
+// where `self` and `mat2` have same number (>= 3) of dimensions.
+// The last 2 dimensions will be considered as matrix dimensions,
+// so they should be matrix-multiplicable.
+// The leading dimensions are considered as batch dimensions,
+// and since nested tensor does not support broadcasting for now,
+// for each batch dimension `self` and `mat2` must have same size.
+// TODO: Should make full matmul semantics support some day
+Tensor matmul_nested(const Tensor& self, const Tensor& mat2) {
+  if (self.is_nested() && !mat2.is_nested()) {
+    AT_ERROR("Expected both to be nested, but got a nested self and non-nested other");
+  }
+  else if (!self.is_nested() && mat2.is_nested()) {
+    AT_ERROR("Expected both to be nested, but got a non-nested self and nested other");
+  }
+  // to_padded_tensor only supports contiguous inputs
+  auto self_contig = self.contiguous();
+  auto mat2_contig = mat2.contiguous();
+  // dispatcher should have guaranteed that at least one is nested
+  const auto self_ptr = get_nested_tensor_impl(self_contig);
+  const auto mat2_ptr = get_nested_tensor_impl(mat2_contig);
+  int64_t self_dim = self_ptr->dim(),
+      mat2_dim = mat2_ptr->dim();
+  TORCH_CHECK(
+      self_dim >= 3,
+      "matmul: For nested tensors, only inputs with >= 3 dims are currently supported. 1st input has rank: ",
+      self_dim);
+  TORCH_CHECK(
+      mat2_dim >= 3,
+      "matmul: For nested tensors, only inputs with >= 3 dims are currently supported. 2nd input has rank: ",
+      mat2_dim);
+  TORCH_CHECK(self_dim == mat2_dim, "matmul: both inputs must have the same rank");
+  int64_t ntensors = self_ptr->size(0),
+      ntensors2 = mat2_ptr->size(0);
+  TORCH_CHECK(ntensors == ntensors2,
+      "matmul: Expected size for the 1st dimension of 2nd input tensor to be: ", ntensors,
+      " but got: ", ntensors2, ".");
+  // Ensure batch dimensions have the same sizes (no broadcasting).
+  const auto& self_sizes = self_ptr->get_nested_size_tensor();
+  const auto& mat2_sizes = mat2_ptr->get_nested_size_tensor();
+  const auto& self_batch_sizes = self_sizes.narrow(1, 0, self_dim-3);
+  const auto& mat2_batch_sizes = mat2_sizes.narrow(1, 0, mat2_dim-3);
+  TORCH_CHECK(at::equal(self_batch_sizes, mat2_batch_sizes),
+    "matmul: For nested tensors, batch dimensions must have the same sizes, ",
+    "no broadcasting is currently performed. Got batch shapes for self ",
+    self_batch_sizes,
+    " and batch shapes for mat2 ",
+    mat2_batch_sizes);
+  // Ensure last dim of self and second last dim of mat2 have the same size
+  const auto& self_dim_size = self_sizes.select(1, -1);
+  const auto& mat2_dim_size = mat2_sizes.select(1, -2);
+  TORCH_CHECK(at::equal(self_dim_size, mat2_dim_size),
+    "matmul: Nested tensors cannot be matrix multiplied, last dimension of self has sizes",
+    self_dim_size,
+    "second last dimension of mat2 has sizes",
+    mat2_dim_size);
+
+  // use bmm inference-only fast path for [N, n_heads, *, head_dim] [N, n_heads, head_dim, *]
+  if (self.is_cuda() &&
+      self_dim == 4 && self.is_contiguous() &&
+      mat2_dim == 4 && mat2.is_contiguous() &&
+      !(GradMode::is_enabled() && (self.requires_grad() || mat2.requires_grad()))) {
+    auto n_heads = self_sizes.select(0, 1).select(0, 0).item<int64_t>();
+    auto self_first_dim_n_heads = at::all(self_sizes.select(1, 0) == n_heads).item<bool>();
+    auto mat2_first_dim_n_heads = at::all(mat2_sizes.select(1, 0) == n_heads).item<bool>();
+    if (self_first_dim_n_heads && mat2_first_dim_n_heads) {
+      return matmul_with_bmm_nested(self, mat2);
+    }
+  }
+
+  // Construct output size from input sizes
+  Tensor output_sizes = self_sizes.clone();
+  // The last entry in every row of output_sizes should be last column of mat2_sizes
+  output_sizes.index_put_({at::indexing::Slice(), -1}, mat2_sizes.select(1, -1).clone());
+
+  auto self_padded = self_contig.to_padded_tensor(0.);
+  auto mat2_padded = mat2_contig.to_padded_tensor(0.);
+  auto output_padded = at::matmul(self_padded, mat2_padded);
+  auto output_nested = nested_from_padded_generic(output_padded, output_sizes);
+  return output_nested;
+}
+
+Tensor& matmul_out_nested(const Tensor& tensor1, const Tensor& tensor2, Tensor& result) {
+  // TODO: this is a very quick and dirty implementation
+  //       should improve it to avoid the intermediate memory usage
+  Tensor function_result = at::matmul(tensor1, tensor2);
+  auto function_result_ptr = get_nested_tensor_impl(function_result);
+  // TODO: this is to reproduce function_result_ptr->opt_sizes_
+  //       if an accessor is provided in the future, can replace this
+  std::vector<int64_t> sizes;
+  for (int64_t i = 0; i < function_result_ptr->dim(); i++) {
+    c10::optional<int64_t> opt_size = function_result_ptr->opt_size(i);
+    if (opt_size.has_value()) {
+      sizes.push_back(*opt_size);
+    }
+    else {
+      sizes.push_back(-1);
+    }
+  }
+  result.reshape(sizes);
+  result.copy_(function_result);
+  return result;
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp
index d33decc22433..95c762ccc8ed 100644
--- a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp
+++ b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp
@@ -3,9 +3,11 @@
 #include <ATen/ATen.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/NestedTensorImpl.h>
-#include <ATen/native/nested/NestedTensorMath.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+
 #include <c10/util/string_view.h>
 #include <c10/util/Exception.h>
+#include <c10/util/Optional.h>
 
 namespace at {
 namespace native {
@@ -138,44 +140,56 @@ Tensor NestedTensor_add_NestedTensor_in_place(
   return self;
 }
 
-void NestedTensor_softmax_dropout(const Tensor& query, Tensor& attn_scores) {
+Tensor NestedTensor_softmax_dropout(const Tensor& self, const Tensor& query) {
   const auto* query_nt = get_nested_tensor_impl_or_null(query);
   TORCH_INTERNAL_ASSERT(query_nt != nullptr);
   TORCH_INTERNAL_ASSERT(nested_tensor_impl_is_contiguous(query_nt));
 
   const Tensor& sizes = query_nt->get_nested_size_tensor();
   const auto num_tensors = sizes.sizes()[0];
-  const auto max_seq_len = attn_scores.sizes()[2];
+
+  auto output = at::empty_like(self,{}, at::MemoryFormat::Contiguous);
+  TORCH_INTERNAL_ASSERT(output.is_contiguous());
+
+  const auto max_seq_len = self.sizes()[2];
 
   for (int64_t i = 0; i < num_tensors; i++) {
     auto seq_len = sizes.index({i, 0}).item<int64_t>();
-    auto subseq = attn_scores.index(
+    auto subseq = self.index(
         {i,
          indexing::Slice(),
          indexing::Slice(0, seq_len),
          indexing::Slice(0, seq_len)});
     auto subscores = at::softmax(subseq, subseq.dim() - 1);
-    attn_scores.index_put_(
+    output.index_put_(
         {i,
          indexing::Slice(),
          indexing::Slice(0, seq_len),
          indexing::Slice(0, seq_len)},
         subscores);
-    attn_scores.index_put_(
+    output.index_put_(
         {i,
          indexing::Slice(),
          indexing::Slice(0, seq_len),
          indexing::Slice(seq_len, max_seq_len)},
         0);
-    attn_scores.index_put_(
+    output.index_put_(
         {i,
          indexing::Slice(),
          indexing::Slice(seq_len, max_seq_len),
          indexing::Slice(0, max_seq_len)},
         0);
   }
+  return output;
 }
 
+Tensor NestedTensor_softmax_dropout_cuda(const Tensor& self, const Tensor& query) {
+  c10::optional<Tensor> attn_mask;
+
+  attn_mask = NestedTensor_to_mask(query, 2, self.size(2));
+  attn_mask = attn_mask->to(query.device(), /*non-blocking=*/true);
+  return _masked_softmax(self, *attn_mask, self.dim() - 1, /*mask type */ 1 );  // NestedTensor_to_mask produces a BxT mask
+}
 
 Tensor NestedTensor_batch_offsets_from_size_tensor(
     const Tensor& sizes,
@@ -196,8 +210,10 @@ Tensor NestedTensor_batch_offsets_from_size_tensor(
   return offsets;
 }
 
+
 Tensor NestedTensor_to_mask(const Tensor& nt, c10::optional<int64_t> mask_dim, c10::optional<int64_t> mask_dim_length) {
   auto* nt_impl = get_nested_tensor_impl(nt);
+  TORCH_CHECK(nested_tensor_impl_is_contiguous(nt_impl), "to_mask only works on contiguous NestedTensors.");
   TORCH_CHECK(
       !mask_dim || *mask_dim < nt.dim(),
       "Requested mask dimension ",
@@ -229,5 +245,6 @@ Tensor NestedTensor_to_mask(const Tensor& nt, c10::optional<int64_t> mask_dim, c
   }
   return result;
 }
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h
index 96ecfe91c3dd..0f623f896d0f 100644
--- a/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h
+++ b/aten/src/ATen/native/nested/NestedTensorTransformerFunctions.h
@@ -50,8 +50,6 @@ Tensor NestedTensor_from_padded_tensor_cpu(
     const Tensor& padded,
     const NestedTensorImpl& nt);
 
-void NestedTensor_softmax_dropout(const Tensor& query, Tensor& attn_scores);
-
 Tensor NestedTensor_to_mask(const Tensor& nt, c10::optional<int64_t> mask_dim, c10::optional<int64_t> mask_dim_length);
 
 template <typename T>
@@ -85,5 +83,21 @@ void add_padding_kernelLauncher(
     const std::vector<int64_t>& output_sizes,
     const int batch_size,
     const int output_batch_size);
+
+TORCH_API Tensor flash_attention_helper(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    double dropout_p,
+    bool need_atten_weights,
+    bool is_causal);
+
+TORCH_API std::tuple<Tensor, Tensor> mem_efficient_helper_nested_unpacked(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    double dropout_p,
+    bool need_atten_weights,
+    bool is_causal);
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp b/aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp
new file mode 100644
index 000000000000..6be7239775ea
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp
@@ -0,0 +1,74 @@
+#include <ATen/native/nested/NestedTensorMath.h>
+
+#include <ATen/AccumulateType.h>
+#include <ATen/Dispatch.h>
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#include <ATen/NestedTensorImpl.h>
+#include <ATen/ScalarOps.h>
+#include <ATen/TensorIndexing.h>
+#include <ATen/TensorOperators.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/native/layer_norm.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+
+#include <tuple>
+
+namespace at {
+namespace native {
+
+Tensor& NestedTensor_relu_(Tensor& self) {
+  auto self_ptr = get_nested_tensor_impl(self);
+  check_numel_equals_buffer_size(self_ptr);
+  auto buffer = self_ptr->get_buffer();
+  at::relu_(buffer);
+  return self;
+}
+
+Tensor NestedTensor_relu(const Tensor& self) {
+  return map_nt(self, at::relu);
+}
+
+Tensor& NestedTensor_gelu_(Tensor& self, c10::string_view approximate) {
+  auto self_ptr = get_nested_tensor_impl(self);
+  check_numel_equals_buffer_size(self_ptr);
+  auto buffer = self_ptr->get_buffer();
+  at::gelu_(buffer, approximate);
+  return self;
+}
+
+Tensor NestedTensor_gelu(const Tensor& self, c10::string_view approximate) {
+  return map_nt(
+      self,
+      [approximate](const Tensor& buffer) {
+        return at::gelu(buffer, approximate);
+      });
+}
+
+Tensor& NestedTensor_tanh_(Tensor& self) {
+  auto self_ptr = get_nested_tensor_impl(self);
+  check_numel_equals_buffer_size(self_ptr);
+  auto buffer = self_ptr->get_buffer();
+  at::tanh_(buffer);
+  return self;
+}
+
+Tensor NestedTensor_tanh(const Tensor& self) {
+  return map_nt(self, at::tanh);
+}
+
+Tensor& NestedTensor_neg_(Tensor& self) {
+  auto self_ptr = get_nested_tensor_impl(self);
+  check_numel_equals_buffer_size(self_ptr);
+  auto buffer = self_ptr->get_buffer();
+  at::neg_(buffer);
+  return self;
+}
+
+Tensor NestedTensor_neg(const Tensor& self) {
+  return map_nt(self, at::neg);
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorUtils.cpp b/aten/src/ATen/native/nested/NestedTensorUtils.cpp
new file mode 100644
index 000000000000..50ca7db6cb6b
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorUtils.cpp
@@ -0,0 +1,112 @@
+#include <ATen/NestedTensorImpl.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+#include <c10/util/Optional.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_nested_tensor_size_native.h>
+#include <ATen/ops/_nested_tensor_strides_native.h>
+#include <ATen/ops/_nested_tensor_offsets_native.h>
+#include <ATen/ops/chunk_native.h>
+#endif
+
+namespace at {
+namespace native {
+
+/**
+ * Thin wrapper around get_nested_size_tensor that is registered as a native function
+ *
+ * @return The nested tensors' size tensor.
+ */
+at::Tensor _nested_tensor_size(const at::Tensor& self) {
+  return get_nested_size_tensor(self);
+}
+
+at::Tensor _nested_tensor_strides(const at::Tensor& self){
+  return  get_nested_tensor_impl(self) -> get_nested_stride_tensor();
+}
+std::vector<int64_t> _nested_tensor_offsets(const at::Tensor& self){
+  return get_nested_tensor_impl(self) -> get_storage_offsets();
+}
+
+// Helper functions for getting information about a nested tensor's shape.
+std::vector<int64_t> NestedTensor_get_max_size_from_size_tensor(
+    const Tensor& sizes) {
+  if (sizes.dim() == 0) {
+    return {};
+  }
+  const auto sizes_ptr = sizes.data_ptr<int64_t>();
+  const auto sizes_size_0 = sizes.sizes()[0];
+  const auto sizes_size_1 = sizes.sizes()[1];
+  TORCH_INTERNAL_ASSERT(sizes_size_1 > 0);
+  std::vector<int64_t> results(sizes_size_1, 0);
+  for (const auto ii : c10::irange(sizes_size_0)) {
+    for (const auto jj : c10::irange(sizes_size_1)) {
+      auto val = sizes_ptr[ii * sizes_size_1 + jj];
+      if (results[jj] < val) {
+        results[jj] = val;
+      }
+    }
+  }
+  return results;
+}
+
+std::vector<int64_t> NestedTensor_get_max_size(const NestedTensorImpl& nt) {
+  return NestedTensor_get_max_size_from_size_tensor(
+      nt.get_nested_size_tensor());
+}
+
+int64_t get_consistent_last_dim_of_nested_tensor(const NestedTensorImpl& nt) {
+  c10::optional<int64_t> last_dim = nt.opt_size(-1);
+  TORCH_CHECK(
+      last_dim != c10::nullopt,
+      "Expected all tensors in nested tensor to have the same trailing dimension, instead last dimension equals: ",
+      nt.get_nested_size_tensor().select(1, -1));
+  return *last_dim;
+}
+
+std::vector<Tensor> chunk_nested_tensor(const Tensor& self, int64_t chunks, int64_t dim) {
+  int64_t ndim = self.dim();
+  if (ndim == 0) {
+    TORCH_CHECK_INDEX(false, "chunk() cannot be applied to a 0-dim tensor.");
+  }
+  dim = maybe_wrap_dim(dim, ndim);
+  TORCH_CHECK(self.dim() - 1 == dim,
+           "Chunk for nested tensors is currently only supported for the last dimension.");
+  TORCH_CHECK(chunks > 0,"chunk expects `chunks` to be greater than 0, got: ", chunks);
+  TORCH_CHECK(self.is_contiguous(), "chunk expects `self` to be contiguous.");
+  auto self_impl = get_nested_tensor_impl(self);
+  const int64_t last_dim_size = get_consistent_last_dim_of_nested_tensor(*self_impl);
+    TORCH_CHECK(last_dim_size % chunks == 0,
+           "Chunk for nested tensors is only supported for nested tensors with trailing dimension divisible by chunks, got: ",
+           last_dim_size, " % ", chunks, " != 0");
+  int64_t n_tensors = self.size(0);
+  int64_t split_size = last_dim_size / chunks;
+  std::vector<Tensor> splits(chunks);
+  const auto& sizes = self_impl->get_nested_size_tensor();
+  const auto& strides = self_impl->get_nested_stride_tensor();
+  const std::vector<int64_t>& offsets = self_impl->get_storage_offsets();
+  // Account for the implicit batch dim
+  --dim;
+  int64_t tensor_dim = sizes.size(1);
+  for (const auto split_idx : c10::irange(chunks)) {
+      auto new_sizes = sizes.clone() ;
+      auto new_strides = strides.clone();
+      // This copys offsets so we are safe to move
+      auto new_offsets = std::vector<int64_t>(offsets);
+      int64_t *size_ptr = new_sizes.data_ptr<int64_t>();
+      // Get start val for each split
+      int64_t start_val = split_idx * split_size;
+      for (int64_t i : c10::irange(n_tensors)) {
+        const int64_t index = i * tensor_dim + dim;
+        new_offsets[i] = offsets[i] + start_val;
+        size_ptr[index] = split_size;
+    }
+    splits[split_idx] = create_nested_view_tensor(self, new_sizes, new_strides, std::move(new_offsets));
+  }
+  return splits;
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/NestedTensorUtils.h b/aten/src/ATen/native/nested/NestedTensorUtils.h
new file mode 100644
index 000000000000..6590db9116e0
--- /dev/null
+++ b/aten/src/ATen/native/nested/NestedTensorUtils.h
@@ -0,0 +1,423 @@
+#pragma once
+
+#include <ATen/Dispatch.h>
+#include <ATen/NestedTensorImpl.h>
+#include <ATen/Parallel.h>
+#include <ATen/core/Tensor.h>
+#include <c10/core/DispatchKeySet.h>
+#include <c10/core/TensorImpl.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/Exception.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cat.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/ones_native.h>
+#include <ATen/ops/prod.h>
+#include <ATen/ops/stack_native.h>
+#include <ATen/ops/tensor.h>
+#endif
+
+#include <vector>
+
+namespace at {
+namespace native {
+struct NestedTensorImpl;
+
+// The following functions are used to construct nested tensors from buffers and
+// metadata.
+
+inline at::Tensor wrap_buffer(
+    at::Tensor buffer,
+    at::Tensor nested_size_tensor) {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
+      buffer.is_contiguous(), "Given buffer must be contiguous.");
+  return at::detail::make_tensor<NestedTensorImpl>(
+      std::move(buffer), std::move(nested_size_tensor));
+}
+
+inline at::Tensor wrap_buffer(
+    at::Tensor buffer,
+    at::Tensor nested_size_tensor,
+    at::Tensor nested_stride_tensor,
+    std::vector<int64_t>&& offsets) {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
+      buffer.is_contiguous(), "Given buffer must be contiguous.");
+  return at::detail::make_tensor<NestedTensorImpl>(
+      std::move(buffer),
+      std::move(nested_size_tensor),
+      std::move(nested_stride_tensor),
+      std::move(offsets));
+}
+
+inline at::Tensor wrap_buffer(
+    at::Tensor buffer,
+    at::Tensor nested_size_tensor,
+    at::Tensor nested_stride_tensor,
+    const std::vector<int64_t>& offsets) {
+  std::vector<int64_t> offsets_copy(offsets);
+  return wrap_buffer(
+      buffer,
+      nested_size_tensor,
+      nested_stride_tensor,
+      std::move(offsets_copy));
+}
+
+inline at::Tensor get_buffer(const at::Tensor& tensor) {
+  return get_nested_tensor_impl(tensor)->get_buffer();
+}
+
+/**
+ * Create a new nested tensor that is a view of a base nested tensor
+ *
+ * create_view_tensor calls a specialized constructor that copys the
+ * the keys from base onto the new view tensor being created.
+ * The storage is shared between the base and the returned view tensor
+ *
+ * All callers of this helper must:
+ * - Only return a view of the input
+ * - Must be explicit and define a derivative
+ *
+ * @param base Base tensor to construct view from.
+ * @param nested_size_tensor View tensors' sizes.
+ * @param nested_stride_tensor View tensors' strides.
+ * @param offsets View tensors' offsets.
+ * @return A newly constructed view tensor
+ */
+inline at::Tensor create_nested_view_tensor(
+    const at::Tensor& base,
+    at::Tensor nested_size_tensor,
+    at::Tensor nested_stride_tensor,
+    std::vector<int64_t>&& offsets) {
+  TORCH_INTERNAL_ASSERT(
+      base.is_nested(),
+      "This function can only be used to create nested tensor views");
+  TORCH_INTERNAL_ASSERT(
+      c10::impl::tls_local_dispatch_key_set().excluded_.has(
+          c10::DispatchKey::AutogradFunctionality),
+      "Creating a non differentiable nested tensor view in a CompositeImplicit function is not allowed.");
+  return at::detail::make_tensor<NestedTensorImpl>(
+      c10::TensorImpl::VIEW,
+      base,
+      nested_size_tensor,
+      nested_stride_tensor,
+      std::move(offsets));
+}
+//  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+// Helper functions for getting information about a nested tensor's shape.
+
+int64_t get_consistent_last_dim_of_nested_tensor(const NestedTensorImpl& nt);
+
+// The sizes of the underlying tensors
+inline std::vector<IntArrayRef> NestedTensor_get_sizes(
+    const NestedTensorImpl* self_ptr) {
+  int64_t ntensors = self_ptr->size(0);
+  std::vector<IntArrayRef> sizes(ntensors);
+  if (ntensors == 0) {
+    return sizes;
+  }
+  const Tensor& sizemat = self_ptr->get_nested_size_tensor();
+  int64_t orig_dim = sizemat.size(1);
+  // nesting scalars has empty sizes
+  if (orig_dim == 0) {
+    return sizes;
+  }
+  const int64_t* sizemat_ptr = sizemat.data_ptr<int64_t>();
+
+  for (const auto i : c10::irange(ntensors)) {
+    sizes[i] = IntArrayRef(sizemat_ptr, sizemat_ptr + orig_dim);
+    sizemat_ptr += orig_dim;
+  }
+  return sizes;
+}
+
+TORCH_API std::vector<int64_t> NestedTensor_get_max_size(
+    const NestedTensorImpl& nt);
+
+std::vector<int64_t> NestedTensor_get_max_size_from_size_tensor(
+    const Tensor& sizes);
+
+inline std::vector<IntArrayRef> NestedTensor_get_sizes(const at::Tensor& self) {
+  const NestedTensorImpl* self_ptr = get_nested_tensor_impl(self);
+  return NestedTensor_get_sizes(self_ptr);
+}
+// The strides of the underlying tensors
+inline std::vector<IntArrayRef> NestedTensor_get_strides(
+    const NestedTensorImpl* self_ptr) {
+  int64_t ntensors = self_ptr->size(0);
+  std::vector<IntArrayRef> strides(ntensors);
+  if (ntensors == 0) {
+    return strides;
+  }
+  const Tensor& stridemat = self_ptr->get_nested_stride_tensor();
+  int64_t orig_dim = stridemat.size(1);
+  // nesting scalars has empty strides
+  if (orig_dim == 0) {
+    return strides;
+  }
+  const int64_t* stridemat_ptr = stridemat.data_ptr<int64_t>();
+  for (const auto i : c10::irange(ntensors)) {
+    strides[i] = IntArrayRef(stridemat_ptr, stridemat_ptr + orig_dim);
+    stridemat_ptr += orig_dim;
+  }
+  return strides;
+}
+
+inline std::vector<IntArrayRef> NestedTensor_get_strides(
+    const at::Tensor& self) {
+  const NestedTensorImpl* self_ptr = get_nested_tensor_impl(self);
+  return NestedTensor_get_strides(self_ptr);
+}
+
+inline void check_numel_equals_buffer_size(const at::Tensor& self) {
+  auto self_impl = get_nested_tensor_impl(self);
+  TORCH_CHECK(
+      self.numel() == self_impl->get_buffer_size(),
+      "Number of elements in nested tensor must match number of elements in buffer.");
+}
+
+inline void check_numel_equals_buffer_size(const NestedTensorImpl* self_ptr) {
+  TORCH_CHECK(
+      self_ptr->numel() == self_ptr->get_buffer_size(),
+      "Number of elements in nested tensor must match number of elements in buffer.");
+}
+//  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+// Data structures and functions for generically applying a function on a nested
+// tensor.
+namespace impl {
+
+template <typename T>
+struct NestedNode {
+  NestedNode() = delete;
+  explicit NestedNode(std::vector<T>&& children)
+      : _is_leaf(false), _children(children) {}
+  explicit NestedNode(TensorList children)
+      : _is_leaf(false), _children(children.vec()) {}
+  // NestedNode(NestedNode&) = delete;
+  // NestedNode(const NestedNode&) = delete;
+  // NestedNode& operator=(NestedNode) = delete;
+  explicit NestedNode(T payload) : _is_leaf(true), _payload(payload) {}
+  inline bool is_leaf() const {
+    return _is_leaf;
+  }
+  inline size_t degree() const {
+    return _children.size();
+  }
+  inline const std::vector<T> unbind() const {
+    return _children;
+  }
+  inline T children(size_t i) const {
+    return _children[i];
+  }
+  inline const T& payload() const {
+    return _payload;
+  }
+  inline T& payload() {
+    return _payload;
+  }
+
+ private:
+  bool _is_leaf;
+  std::vector<T> _children;
+  T _payload;
+};
+
+using TensorNode = NestedNode<at::Tensor>;
+
+template <class F, class A, class TypeList>
+class _map;
+
+template <class F, class A, class... Args>
+class _map<F, A, c10::guts::typelist::typelist<Args...>> {
+ public:
+  static A function_one(F&& fn, const Args&... nested_node) {
+    return std::forward<F>(fn)(nested_node...);
+  }
+  // NOTE: We must move F to avoid copying objects if it is a lambda with
+  // captures.
+  static NestedNode<A> function(
+      F&& fn,
+      const NestedNode<Args>&... nested_node) {
+    size_t degree = 0;
+    bool all_leaf = true;
+    c10::guts::tuple_map(
+        std::forward_as_tuple(nested_node...), [&all_leaf, &degree](auto n) {
+          all_leaf = all_leaf && (n.is_leaf());
+          if (degree > 1 && n.degree() > 1) {
+            TORCH_CHECK(
+                degree == n.degree(), "NestedNodes must match in degree.");
+          }
+          if (n.degree() > degree) {
+            degree = n.degree();
+          }
+          return nullptr;
+        });
+    // All NestedNodes just wrap regular objects.
+    if (all_leaf) {
+      return NestedNode<A>(std::forward<F>(fn)(nested_node.payload()...));
+    }
+    // Some NestedNodes wrap regular Tensors, some NestedTensors and some other
+    // types.
+    std::vector<A> result;
+    for (size_t i = 0; i < degree; i++) {
+      std::tuple<Args...> children = c10::guts::tuple_map(
+          std::forward_as_tuple(nested_node...), [&i](auto a) {
+            static_assert(
+                c10::guts::is_instantiation_of<NestedNode, decltype(a)>::value,
+                "Internal error.");
+            // Broadcast regular arguments across NestedTensor constituents.
+            // This could be a Tensor, integer or anything else really.
+            if (a.is_leaf()) {
+              return a.payload();
+            }
+            // Broadcast NestedTensors with one constituent.
+            if (a.degree() == 1 && !a.is_leaf()) {
+              return a.children(0);
+            }
+            TORCH_CHECK(a.degree() > 0, "Internal assert.");
+            return a.children(i);
+          });
+      c10::guts::apply(
+          [&result, &fn](Args... filtered) {
+            result.emplace_back(function_one(std::forward<F>(fn), filtered...));
+          },
+          std::move(children));
+    }
+    return NestedNode<A>(std::move(result));
+  }
+};
+
+// TODO: Add static assert to verify lambda arguments match nested_node types
+template <class F, class... B>
+static inline NestedNode<
+    typename c10::guts::infer_function_traits<F>::type::return_type>
+map(F&& fn, const NestedNode<B>&... nested_node) {
+  return _map<
+      F,
+      typename c10::guts::infer_function_traits<F>::type::return_type,
+      typename c10::guts::infer_function_traits<F>::type::parameter_types>::
+      function(std::forward<F>(fn), nested_node...);
+}
+
+inline TensorNode get_nested_tensor_structure(at::Tensor tensor) {
+  if (get_nested_tensor_impl_or_null(tensor) == nullptr) {
+    return TensorNode(std::move(tensor));
+  }
+  return TensorNode(tensor.unbind());
+}
+
+inline Tensor wrap_tensor_node(
+    TensorNode tensor_node,
+    c10::optional<ScalarType> dtype,
+    c10::optional<Layout> layout,
+    c10::optional<Device> device,
+    c10::optional<bool> pin_memory) {
+  TORCH_CHECK(
+      !tensor_node.is_leaf(), "Expected TensorNode to wrap a list of Tensors.");
+  TensorOptions options_ =
+      TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(
+          pin_memory);
+  if (tensor_node.degree() == 0) {
+    return wrap_buffer(ones({0}, dtype, layout, device), ones({}));
+  }
+
+  // Fast path: if all tensors are on CPU, have contiguous memory, and the same
+  // dtype, copying can be done much faster.
+  bool all_tensors_cpu = true;
+  bool all_tensors_contiguous = true;
+  bool all_tensors_same_dtype = true;
+  auto first_dtype = tensor_node.children(0).dtype();
+  std::vector<long> start_offsets(tensor_node.degree());
+  start_offsets[0] = 0;
+  long total_size = 0;
+  for (const auto i : c10::irange(tensor_node.degree())) {
+    all_tensors_cpu = all_tensors_cpu && tensor_node.children(i).is_cpu();
+    all_tensors_contiguous =
+        all_tensors_contiguous && tensor_node.children(i).is_contiguous();
+    all_tensors_same_dtype = all_tensors_same_dtype &&
+        (first_dtype == tensor_node.children(i).dtype());
+    if (!(all_tensors_cpu && all_tensors_contiguous &&
+          all_tensors_same_dtype)) {
+      break;
+    }
+    if (i > 0) {
+      start_offsets[i] =
+          start_offsets[i - 1] + tensor_node.children(i - 1).numel();
+    }
+    total_size += tensor_node.children(i).numel();
+  }
+
+  TensorOptions options;
+  Tensor nt_buffer, nt_sizes;
+  if (all_tensors_cpu && all_tensors_contiguous && all_tensors_same_dtype) {
+    nt_buffer = at::empty({total_size}, tensor_node.children(0).options());
+    nt_sizes = at::empty(
+        {static_cast<long>(tensor_node.degree()),
+         static_cast<long>(tensor_node.children(0).sizes().size())},
+        TensorOptions().dtype(kLong));
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
+        at::ScalarType::Half,
+        at::ScalarType::Bool,
+        at::ScalarType::BFloat16,
+        c10::typeMetaToScalarType(first_dtype),
+        "create_nt_buffer",
+        [&]() {
+          at::parallel_for(
+              0, tensor_node.degree(), 1, [&](int64_t begin, int64_t end) {
+                for (int64_t i = begin; i < end; ++i) {
+                  // Only try copying memory if there is more than 0 elements
+                  // for a certain tensor
+                  if (tensor_node.children(i).numel() > 0) {
+                    memcpy(
+                        nt_buffer.data_ptr<scalar_t>() + start_offsets[i],
+                        tensor_node.children(i).data_ptr<scalar_t>(),
+                        tensor_node.children(i).numel() * sizeof(scalar_t));
+                  }
+                }
+              });
+        });
+    long sizes_offset = 0;
+    for (size_t i = 0; i < tensor_node.degree(); ++i) {
+      auto tensor_sizes = tensor_node.children(i).sizes();
+      for (size_t j = 0; j < tensor_sizes.size(); ++j) {
+        nt_sizes.data_ptr<int64_t>()[sizes_offset++] = tensor_sizes[j];
+      }
+    }
+    options = nt_buffer.options().merge_in(options_);
+  } else { // Slow path
+    std::vector<Tensor> flat_tensors;
+    std::vector<Tensor> sizes;
+    for (const auto i : c10::irange(tensor_node.degree())) {
+      flat_tensors.push_back(tensor_node.children(i).reshape(-1).contiguous());
+      sizes.push_back(
+          tensor(c10::IntArrayRef(tensor_node.children(i).sizes())));
+    }
+    options = flat_tensors[0].options().merge_in(options_);
+    nt_buffer = at::cat(flat_tensors);
+    nt_sizes = at::native::stack(sizes);
+  }
+
+  return wrap_buffer(nt_buffer.to(options), nt_sizes);
+}
+
+} // namespace impl
+
+// This function is meant to ease rapid operator coverage for
+// NestedTensor kernels. It is not meant to be efficient. Use it judiciously.
+template <class F, class... A>
+inline at::Tensor map_nested_tensor(F&& fn, A... a) {
+  return wrap_tensor_node(
+      impl::map(std::forward<F>(fn), impl::get_nested_tensor_structure(a)...),
+      c10::nullopt,
+      c10::nullopt,
+      c10::nullopt,
+      c10::nullopt);
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/cuda/NestedTensorBinaryOps.cu b/aten/src/ATen/native/nested/cuda/NestedTensorBinaryOps.cu
new file mode 100644
index 000000000000..678e62f5a81c
--- /dev/null
+++ b/aten/src/ATen/native/nested/cuda/NestedTensorBinaryOps.cu
@@ -0,0 +1,120 @@
+#include <ATen/native/nested/NestedTensorBinaryOps.h>
+
+#include <type_traits>
+
+#include <ATen/ATen.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/detail/KernelUtils.h>
+#include <ATen/cuda/detail/IndexUtils.cuh>
+#include <ATen/native/cuda/Loops.cuh>
+#include <ATen/native/cuda/MemoryAccess.cuh>
+
+#include <c10/cuda/CUDAMathCompat.h>
+#include <c10/cuda/CUDAStream.h>
+
+
+#include <ATen/native/nested/NestedTensorUtils.h>
+
+#define BLOCK_DIM 256
+
+namespace at {
+namespace native {
+
+
+// only for nested [B, *, D], dense [B, 1, D]
+template <typename T, typename func_t>
+__global__ void op_dense_esuhm(
+    const T* input,
+    const T* dense,
+    T* output,
+    int64_t embedding_dim,
+    const int64_t* offsets,
+    const func_t& f)
+{
+  // each batch is handled by a block
+  const int64_t batch_idx  = blockIdx.x;
+  const int64_t grain_size = blockDim.x;
+  const int64_t tid = threadIdx.x;
+  const int64_t range = offsets[batch_idx + 1] - offsets[batch_idx];
+  // each thread handles (embedding_dim // grain_size + (embedding_dim % grain_size <= tid)) elems
+  // of the dense embedding
+  for (int64_t idx = tid; idx < embedding_dim; idx += grain_size) {
+    const T dense_elem = dense[batch_idx * embedding_dim + idx];
+    for (int64_t nested_idx = idx; nested_idx < range; nested_idx += embedding_dim) {
+      output[offsets[batch_idx] + nested_idx] = f(input[offsets[batch_idx] + nested_idx], dense_elem);
+    }
+  }
+}
+
+template <typename T, typename func_t>
+void nested_op_dense_kernelLauncher(
+    const T* input, // [sum(*) x embedding_dim]
+    const T* dense, // [batch_size x embedding_dim]
+    T* output, // [sum(*) x embedding_dim]
+    int64_t batch_size,
+    int64_t embedding_dim,
+    const int64_t* input_offsets,  // [batch_size]
+    func_t f)
+{
+  dim3 grid;
+  grid.x = batch_size;
+  const auto stream = at::cuda::getDefaultCUDAStream();
+
+  op_dense_esuhm<<<grid, BLOCK_DIM, 0, stream>>>(
+      input,
+      dense,
+      output,
+      embedding_dim,
+      input_offsets,
+      f);
+}
+
+template <typename scalar_t, typename func_t>
+void _nested_op_dense_esuhm_kernel(Tensor& result, const Tensor& self, const Tensor& other, func_t f) {
+  auto self_ptr = get_nested_tensor_impl(self);
+  auto result_ptr = get_nested_tensor_impl(result);
+
+  const auto self_buffer = self_ptr->get_buffer();
+  const auto offsets = self_ptr->get_storage_offsets();
+  const auto batch_size = other.size(0);
+  const auto embedding_size = other.size(2);
+
+  auto result_buffer = result_ptr->get_buffer();
+  auto result_offsets = at::cat({at::tensor(offsets), at::tensor(self_ptr->numel())});
+  result_offsets = result_offsets.to(kCUDA);
+
+  const scalar_t* self_data_ptr = self_buffer.data_ptr<scalar_t>();
+  const scalar_t* other_data_ptr = other.data_ptr<scalar_t>();
+  scalar_t* result_data_ptr = result_buffer.data_ptr<scalar_t>();
+  int64_t* result_offsets_ptr = result_offsets.data_ptr<int64_t>();
+
+  nested_op_dense_kernelLauncher(
+    self_data_ptr,
+    other_data_ptr,
+    result_data_ptr,
+    batch_size,
+    embedding_size,
+    result_offsets_ptr,
+    f);
+}
+
+void _nested_op_dense_esuhm_cuda(Tensor& result, const Tensor& self, const Tensor& other, const NESTED_DENSE_OP& op) {
+  AT_DISPATCH_ALL_TYPES_AND2(
+    ScalarType::Half, ScalarType::BFloat16, self.scalar_type(), "_nested_op_dense_esuhm", [&]() {
+    switch (op) {
+      case NESTED_DENSE_OP::ADD :
+        _nested_op_dense_esuhm_kernel<scalar_t>(result, self, other, [] __host__ __device__ (scalar_t a, scalar_t b) -> scalar_t { return a + b; });
+        break;
+      case NESTED_DENSE_OP::MUL :
+        _nested_op_dense_esuhm_kernel<scalar_t>(result, self, other, [] __host__ __device__ (scalar_t a, scalar_t b) -> scalar_t { return a * b; });
+        break;
+    }
+  });
+}
+
+REGISTER_CUDA_DISPATCH(nested_dense_elementwise_stub, &_nested_op_dense_esuhm_cuda);
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/cuda/NestedTensorMatmul.cu b/aten/src/ATen/native/nested/cuda/NestedTensorMatmul.cu
new file mode 100644
index 000000000000..22cf38f85020
--- /dev/null
+++ b/aten/src/ATen/native/nested/cuda/NestedTensorMatmul.cu
@@ -0,0 +1,416 @@
+#include <type_traits>
+
+#include <ATen/ATen.h>
+#include <ATen/Dispatch.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/detail/KernelUtils.h>
+#include <ATen/cuda/detail/IndexUtils.cuh>
+#include <ATen/native/cuda/Loops.cuh>
+#include <ATen/native/cuda/MemoryAccess.cuh>
+#include <ATen/native/cuda/PersistentSoftmax.cuh>
+#include <ATen/native/cuda/block_reduce.cuh>
+
+#include <c10/cuda/CUDAMathCompat.h>
+#include <c10/cuda/CUDAStream.h>
+
+#include <ATen/native/nested/NestedTensorTransformerFunctions.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+
+#ifndef USE_ROCM
+#ifndef _WIN32
+#include <cutlass/gemm/device/default_gemm_configuration.h>
+#include <cutlass/gemm/device/gemm_grouped.h>
+#include <cutlass/gemm/kernel/default_gemm_grouped.h>
+#endif
+#endif
+
+#include <ATen/NestedTensorImpl.h>
+
+#define BLOCK_DIM 256
+#define GRID_DIM_Y 16
+
+namespace at {
+namespace native {
+
+#ifndef USE_ROCM
+#ifndef _WIN32
+namespace {
+
+template <
+    typename scalar_t,
+    unsigned int kPad,
+    typename LayoutA,
+    typename LayoutB,
+    typename OpClass,
+    typename Arch,
+    typename ThreadBlockShape,
+    typename WarpShape,
+    typename InstructionShape>
+void gemm_grouped_cuda_internal(
+    const std::vector<int64_t>& lda,
+    const std::vector<int64_t>& ldb,
+    const std::vector<int64_t>& ldd,
+    const std::vector<scalar_t*>& aptr,
+    const std::vector<scalar_t*>& bptr,
+    const std::vector<scalar_t*>& dptr,
+    const std::vector<cutlass::gemm::GemmCoord>& gemm_sizes,
+    const int problem_count,
+    at::Device& device) {
+  using Element = scalar_t;
+  using ElementAcc = float;
+
+  using GemmConfiguration =
+      typename cutlass::gemm::device::DefaultGemmConfiguration<
+          OpClass,
+          Arch,
+          Element,
+          Element,
+          Element,
+          ElementAcc>;
+
+  using GemmKernel = typename cutlass::gemm::kernel::DefaultGemmGrouped<
+      Element,
+      LayoutA,
+      cutlass::ComplexTransform::kNone,
+      kPad,
+      Element,
+      LayoutB,
+      cutlass::ComplexTransform::kNone,
+      kPad,
+      Element,
+      cutlass::layout::RowMajor,
+      ElementAcc,
+      OpClass,
+      Arch,
+      ThreadBlockShape,
+      WarpShape,
+      InstructionShape,
+      typename GemmConfiguration::EpilogueOutputOp,
+      cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle,
+      GemmConfiguration::kStages>::GemmKernel;
+
+  using GemmGrouped = typename cutlass::gemm::device::GemmGrouped<GemmKernel>;
+  using EpilogueOutputOp = typename GemmGrouped::GemmKernel::Epilogue::OutputOp;
+  typename EpilogueOutputOp::Params epilogue_op(/*alpha*/ 1, /*beta*/ 0);
+
+  const int64_t gemm_coord_size =
+      problem_count * ((int64_t)sizeof(cutlass::gemm::GemmCoord));
+  // Number of gmm args not including *problem_sizes
+  at::Tensor gmm_args = at::empty(
+      {problem_count * 6 + gemm_coord_size},
+      at::TensorOptions().dtype(at::kLong).pinned_memory(true));
+
+  // Obtain pointers for each argument (on host)
+  int64_t* lda_data = gmm_args.data_ptr<int64_t>(); // Base pointer
+  int64_t* ldb_data = lda_data + problem_count;
+  int64_t* ldd_data = lda_data + 2 * problem_count;
+  int64_t* ptr_a_data = lda_data + 3 * problem_count;
+  int64_t* ptr_b_data = lda_data + 4 * problem_count;
+  int64_t* ptr_d_data = lda_data + 5 * problem_count;
+  cutlass::gemm::GemmCoord* problem_sizes_data =
+      reinterpret_cast<cutlass::gemm::GemmCoord*>(lda_data + 6 * problem_count);
+
+  // Set arguments into gmm_args from input args
+  for (int i = 0; i < problem_count; ++i) {
+    problem_sizes_data[i] = gemm_sizes[i];
+    lda_data[i] = lda[i];
+    ldb_data[i] = ldb[i];
+    ldd_data[i] = ldd[i];
+    ptr_a_data[i] = reinterpret_cast<int64_t>(aptr[i]);
+    ptr_b_data[i] = reinterpret_cast<int64_t>(bptr[i]);
+    ptr_d_data[i] = reinterpret_cast<int64_t>(dptr[i]);
+  }
+  const int threadblock_count =
+      GemmGrouped::sufficient(problem_sizes_data, problem_count);
+
+  // Transfer arguments to GPU
+  gmm_args = gmm_args.to(device, true);
+
+  // Obtain pointers for each of arguments (on GPU)
+  lda_data = gmm_args.data_ptr<int64_t>(); // Base pointer
+  ldb_data = lda_data + problem_count;
+  ldd_data = lda_data + 2 * problem_count;
+  ptr_a_data = lda_data + 3 * problem_count;
+  ptr_b_data = lda_data + 4 * problem_count;
+  ptr_d_data = lda_data + 5 * problem_count;
+  problem_sizes_data =
+      reinterpret_cast<cutlass::gemm::GemmCoord*>(lda_data + 6 * problem_count);
+
+  // Create GemmGrouped::Arguments using the arguments prepared above
+  typename GemmGrouped::Arguments args(
+      problem_sizes_data,
+      problem_count,
+      threadblock_count,
+      epilogue_op,
+      reinterpret_cast<Element**>(ptr_a_data),
+      reinterpret_cast<Element**>(ptr_b_data),
+      reinterpret_cast<Element**>(ptr_d_data),
+      reinterpret_cast<Element**>(ptr_d_data),
+      lda_data,
+      ldb_data,
+      ldd_data,
+      ldd_data);
+
+  GemmGrouped gemm;
+  cutlass::Status status =
+      gemm.initialize(args, nullptr, at::cuda::getCurrentCUDAStream());
+  TORCH_CHECK(
+      status != cutlass::Status::kErrorWorkspaceNull,
+      "Failed to initialize CUTLASS Grouped GEMM kernel due to workspace.");
+  TORCH_CHECK(
+      status != cutlass::Status::kErrorInternal,
+      "Failed to initialize CUTLASS Grouped GEMM kernel due to internal error.");
+  TORCH_CHECK(
+      status == cutlass::Status::kSuccess,
+      "Failed to initialize CUTLASS Grouped GEMM kernel.");
+
+  // Run CUTLASS group GEMM
+  status = gemm.run(at::cuda::getCurrentCUDAStream());
+  TORCH_CHECK(
+      status == cutlass::Status::kSuccess,
+      "Failed to run CUTLASS Grouped GEMM kernel.");
+
+  C10_CUDA_KERNEL_LAUNCH_CHECK();
+}
+
+template <typename scalar_t>
+bool group_gemm_dispatch(
+    at::Device device,
+    const std::vector<scalar_t*>& aptr,
+    const std::vector<scalar_t*>& bptr,
+    const std::vector<scalar_t*>& dptr,
+    const std::vector<int64_t>& lda,
+    const std::vector<int64_t>& ldb,
+    const std::vector<int64_t>& ldd,
+    std::vector<cutlass::gemm::GemmCoord> gemm_sizes,
+    int64_t ntensors) {
+  return false;
+}
+
+template <>
+bool group_gemm_dispatch(
+    at::Device device,
+    const std::vector<float*>& aptr,
+    const std::vector<float*>& bptr,
+    const std::vector<float*>& dptr,
+    const std::vector<int64_t>& lda,
+    const std::vector<int64_t>& ldb,
+    const std::vector<int64_t>& ldd,
+    std::vector<cutlass::gemm::GemmCoord> gemm_sizes,
+    int64_t ntensors) {
+
+  gemm_grouped_cuda_internal<
+      float,
+      1,
+      cutlass::layout::RowMajor,
+      cutlass::layout::RowMajor,
+      cutlass::arch::OpClassSimt,
+      cutlass::arch::Sm80,
+      cutlass::gemm::GemmShape<128, 128, 8>,
+      cutlass::gemm::GemmShape<64, 32, 8>,
+      cutlass::gemm::GemmShape<1, 1, 1>>(
+      lda, ldb, ldd, aptr, bptr, dptr, gemm_sizes, ntensors, device);
+  return true;
+}
+
+template <>
+bool group_gemm_dispatch(
+    at::Device device,
+    const std::vector<c10::Half*>& aptr_,
+    const std::vector<c10::Half*>& bptr_,
+    const std::vector<c10::Half*>& dptr_,
+    const std::vector<int64_t>& lda,
+    const std::vector<int64_t>& ldb,
+    const std::vector<int64_t>& ldd,
+    std::vector<cutlass::gemm::GemmCoord> gemm_sizes,
+    int64_t ntensors) {
+
+  // Check alignment
+  bool all_pad_8 = true;
+  for (int i = 0; i < ntensors; i++) {
+    all_pad_8 = all_pad_8 && (gemm_sizes[i].n() % 8 == 0);
+    all_pad_8 = all_pad_8 && (gemm_sizes[i].k() % 8 == 0);
+
+    // Not sure if this is a requirement, on the safe side
+    all_pad_8 = all_pad_8 && (lda[i] % 8 == 0);
+    all_pad_8 = all_pad_8 && (ldb[i] % 8 == 0);
+    all_pad_8 = all_pad_8 && (ldd[i] % 8 == 0);
+  }
+
+  std::vector<cutlass::half_t*> aptr;
+  std::vector<cutlass::half_t*> bptr;
+  std::vector<cutlass::half_t*> dptr;
+  for (int64_t i = 0; i < ntensors; i++) {
+    aptr.push_back(reinterpret_cast<cutlass::half_t*>(aptr_[i]));
+    bptr.push_back(reinterpret_cast<cutlass::half_t*>(bptr_[i]));
+    dptr.push_back(reinterpret_cast<cutlass::half_t*>(dptr_[i]));
+  }
+  if (all_pad_8) {
+    gemm_grouped_cuda_internal<
+        cutlass::half_t,
+        8,
+        cutlass::layout::RowMajor,
+        cutlass::layout::RowMajor,
+        cutlass::arch::OpClassTensorOp,
+        cutlass::arch::Sm80,
+        cutlass::gemm::GemmShape<128, 128, 32>,
+        cutlass::gemm::GemmShape<64, 64, 32>,
+        cutlass::gemm::GemmShape<16, 8, 16>>(
+        lda, ldb, ldd, aptr, bptr, dptr, gemm_sizes, ntensors, device);
+    return true;
+  } else {
+    gemm_grouped_cuda_internal<
+        cutlass::half_t,
+        1,
+        cutlass::layout::RowMajor,
+        cutlass::layout::RowMajor,
+        cutlass::arch::OpClassSimt,
+        cutlass::arch::Sm80,
+        cutlass::gemm::GemmShape<128, 128, 8>,
+        cutlass::gemm::GemmShape<64, 32, 8>,
+        cutlass::gemm::GemmShape<1, 1, 1>>(
+        lda, ldb, ldd, aptr, bptr, dptr, gemm_sizes, ntensors, device);
+    return true;
+  }
+  // Did not perform GEMM
+  return false;
+}
+
+} // namespace
+
+#endif
+#endif
+
+Tensor bmm_nested_cuda(const Tensor& self, const Tensor& mat2) {
+  if (self.is_nested() && !mat2.is_nested()) {
+    AT_ERROR(
+        "Expected both to be nested, but got a nested self and non-nested other");
+  } else if (!self.is_nested() && mat2.is_nested()) {
+    AT_ERROR(
+        "Expected both to be nested, but got a non-nested self and nested other");
+  }
+  // dispatcher should have guaranteed that at least one is nested
+  auto self_ptr = get_nested_tensor_impl(self);
+  auto mat2_ptr = get_nested_tensor_impl(mat2);
+  TORCH_CHECK(self_ptr->dim() == 3, "batch1 must be a 3D tensor");
+  TORCH_CHECK(mat2_ptr->dim() == 3, "batch2 must be a 3D tensor");
+  int64_t ntensors = self_ptr->size(0), ntensors2 = mat2_ptr->size(0);
+  TORCH_CHECK(
+      ntensors == ntensors2,
+      "Expected size for the 1st dimension of batch2 tensor to be: ",
+      ntensors,
+      " but got: ",
+      ntensors2,
+      ".");
+
+  // create a contiguous output
+  const Tensor& self_sizemat = self_ptr->get_nested_size_tensor();
+  Tensor out_sizemat = self_sizemat.new_empty(self_sizemat.sizes());
+  int64_t* out_sizemat_ptr = out_sizemat.data_ptr<int64_t>();
+
+  std::vector<IntArrayRef> self_sizes = NestedTensor_get_sizes(self_ptr);
+  std::vector<IntArrayRef> mat2_sizes = NestedTensor_get_sizes(mat2_ptr);
+
+  int64_t out_numel = 0;
+  for (int64_t i = 0; i < ntensors; i++) {
+    const IntArrayRef &self_shape = self_sizes[i], &mat2_shape = mat2_sizes[i];
+    const int64_t &self_size0 = self_shape[0], &self_size1 = self_shape[1],
+                  &mat2_size0 = mat2_shape[0], &mat2_size1 = mat2_shape[1];
+    TORCH_CHECK(
+        self_size1 == mat2_size0,
+        i,
+        "-th nested matrices in batch cannot be multiplied (",
+        self_size0,
+        "x",
+        self_size1,
+        " and ",
+        mat2_size0,
+        "x",
+        mat2_size1,
+        ")");
+    out_sizemat_ptr[0] = self_size0;
+    out_sizemat_ptr[1] = mat2_size1;
+    out_sizemat_ptr += 2;
+    out_numel += self_size0 * mat2_size1;
+  }
+  const Tensor &self_buffer = self_ptr->get_unsafe_storage_as_tensor();
+  const Tensor &mat2_buffer = mat2_ptr->get_unsafe_storage_as_tensor();
+  Tensor out_buffer = self_buffer.new_empty(out_numel);
+  Tensor output = wrap_buffer(out_buffer, out_sizemat);
+  auto out_ptr = get_nested_tensor_impl(output);
+
+  std::vector<IntArrayRef> self_strides = NestedTensor_get_strides(self_ptr);
+  std::vector<IntArrayRef> mat2_strides = NestedTensor_get_strides(mat2_ptr);
+  const std::vector<int64_t>& self_offsets = self_ptr->get_storage_offsets();
+  const std::vector<int64_t>& mat2_offsets = mat2_ptr->get_storage_offsets();
+  const std::vector<int64_t>& out_offsets = out_ptr->get_storage_offsets();
+
+#ifndef USE_ROCM
+#ifndef _WIN32
+  bool success = false;
+  AT_DISPATCH_FLOATING_TYPES_AND_HALF(
+      self.scalar_type(), "group_gemm_dispatch", [&] {
+        std::vector<scalar_t*> aptr(ntensors);
+        std::vector<scalar_t*> bptr(ntensors);
+        std::vector<scalar_t*> dptr(ntensors);
+        std::vector<int64_t> lda(ntensors);
+        std::vector<int64_t> ldb(ntensors);
+        std::vector<int64_t> ldd(ntensors);
+        std::vector<cutlass::gemm::GemmCoord> gemm_sizes;
+        bool all_row_major = true;
+        for (int64_t i = 0; i < ntensors; i++) {
+          const IntArrayRef& self_shape = self_sizes[i];
+          const IntArrayRef& mat2_shape = mat2_sizes[i];
+          const int64_t &self_size0 = self_shape[0];
+          const int64_t &self_size1 = self_shape[1];
+          const int64_t &mat2_size0 = mat2_shape[0];
+          const int64_t &mat2_size1 = mat2_shape[1];
+          gemm_sizes.push_back(
+              cutlass::gemm::GemmCoord(self_size0, mat2_size1, self_size1));
+          aptr[i] = self_buffer.data_ptr<scalar_t>() + self_offsets[i];
+          bptr[i] = mat2_buffer.data_ptr<scalar_t>() + mat2_offsets[i];
+          dptr[i] = out_buffer.data_ptr<scalar_t>() + out_offsets[i];
+          all_row_major = all_row_major && (self_strides[i][1] == 1);
+          all_row_major = all_row_major && (mat2_strides[i][1] == 1);
+          lda[i] = self_strides[i][0];
+          ldb[i] = mat2_strides[i][0];
+          ldd[i] = mat2_size1;
+        }
+        auto dprops = at::cuda::getCurrentDeviceProperties();
+        bool is_sm8x = dprops->major == 8 && dprops->minor >= 0;
+        if (all_row_major &&
+            self.is_contiguous() &&
+            mat2.is_contiguous() &&
+            is_sm8x) {
+          success = group_gemm_dispatch<scalar_t>(
+              output.device(),
+              aptr,
+              bptr,
+              dptr,
+              lda,
+              ldb,
+              ldd,
+              gemm_sizes,
+              ntensors);
+        }
+      });
+  if (success) {
+    return output;
+  }
+#endif
+#endif
+
+  std::vector<Tensor> output_unbind = output.unbind();
+  for (int64_t i = 0; i < ntensors; i++) {
+    at::mm_out(
+        output_unbind[i],
+        self_buffer.as_strided(self_sizes[i], self_strides[i], self_offsets[i]),
+        mat2_buffer.as_strided(
+            mat2_sizes[i], mat2_strides[i], mat2_offsets[i]));
+  }
+  return output;
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp
index d89e5c5763d7..9c72454560d3 100644
--- a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp
+++ b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp
@@ -1,7 +1,11 @@
+#include <numeric>
+#include <algorithm>
 #include <type_traits>
+#include <c10/util/Exception.h>
 
 #include <ATen/ATen.h>
 #include <ATen/NestedTensorImpl.h>
+#include <ATen/native/NonSymbolicBC.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
@@ -9,9 +13,13 @@
 #include <ATen/ops/_nested_from_padded.h>
 #endif
 
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/native/nested/NestedTensorTransformerFunctions.h>
 #include <ATen/native/nested/NestedTensorMath.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+#include <ATen/native/transformers/cuda/sdp_utils.h>
 
+#include <ATen/cuda/CUDAContext.h>
 namespace at {
 namespace native {
 namespace {
@@ -36,16 +44,16 @@ Tensor nested_from_padded_cuda(
     const Tensor& sizes,
     bool do_transform_0213) {
   if (padded.dim() > 1 && padded.dim() < 5) {
+    // Instead of erroring call the generic version
+    if(!(padded.dim() == 4 && do_transform_0213) && !(padded.dim() == 3 && !do_transform_0213)){
+      return at::native::nested_from_padded_generic(padded, sizes, do_transform_0213);
+    }
     if (padded.dtype() != kFloat && padded.dtype() != kHalf) {
       TORCH_WARN_ONCE(
           "nested_from_padded CUDA kernels only support fp32/fp16; falling "
           "back to slower generic kernel");
       return at::native::nested_from_padded_generic(padded, sizes, do_transform_0213);
     }
-    TORCH_CHECK(
-        (padded.dim() == 4 && do_transform_0213) ||
-            (padded.dim() == 3 && !do_transform_0213),
-        "padded tensor size error");
     Tensor target_offsets =
         NestedTensor_batch_offsets_from_size_tensor(sizes, 0);
     Tensor padded_sizes_tensor = at::tensor(padded.sizes());
@@ -60,45 +68,46 @@ Tensor nested_from_padded_cuda(
     auto input_size_ptr = output_size_ptr + target_size_sizes.numel();
     auto offsets_ptr = input_size_ptr + padded_sizes_tensor.numel();
 
+    Tensor padded_contiguous = padded.contiguous();
     if (padded.dtype() == kFloat) {
       if (do_transform_0213) {
         remove_padding_transform0213_kernelLauncher(
-            padded.data_ptr<float>(),
+            padded_contiguous.data_ptr<float>(),
             output.data_ptr<float>(),
             offsets_ptr,
             input_size_ptr,
             output_size_ptr,
-            padded.dim() - 2,
-            padded.sizes()[0]);
+            padded_contiguous.dim() - 2,
+            padded_contiguous.sizes()[0]);
       } else {
         remove_padding_kernelLauncher(
-            padded.data_ptr<float>(),
+            padded_contiguous.data_ptr<float>(),
             output.data_ptr<float>(),
             offsets_ptr,
             input_size_ptr,
             output_size_ptr,
-            padded.dim() - 1,
-            padded.sizes()[0]);
+            padded_contiguous.dim() - 1,
+            padded_contiguous.sizes()[0]);
       }
     } else if (padded.dtype() == kHalf) {
       if (do_transform_0213) {
         remove_padding_transform0213_kernelLauncher(
-            padded.data_ptr<c10::Half>(),
+            padded_contiguous.data_ptr<c10::Half>(),
             output.data_ptr<c10::Half>(),
             offsets_ptr,
             input_size_ptr,
             output_size_ptr,
-            padded.dim() - 2,
-            padded.sizes()[0]);
+            padded_contiguous.dim() - 2,
+            padded_contiguous.sizes()[0]);
       } else {
         remove_padding_kernelLauncher(
-            padded.data_ptr<c10::Half>(),
+            padded_contiguous.data_ptr<c10::Half>(),
             output.data_ptr<c10::Half>(),
             offsets_ptr,
             input_size_ptr,
             output_size_ptr,
-            padded.dim() - 1,
-            padded.sizes()[0]);
+            padded_contiguous.dim() - 1,
+            padded_contiguous.sizes()[0]);
       }
     } else {
       AT_ERROR("Only support fp32/fp16 for padded input");
@@ -143,8 +152,8 @@ Tensor NestedTensor_to_padded_tensor_cuda(
     if (t_dim == 3 && nt_input->opt_size(2) && (*nt_input->opt_size(2) > 0) &&
         !(output_size.has_value())) {
       Tensor nt_sizes = nt_input->get_nested_size_tensor();
-      Tensor sizes_dim1 = at::native::narrow(nt_sizes, 1, 0, 1);
-      Tensor sizes_dim2 = at::native::narrow(nt_sizes, 1, 1, 1);
+      Tensor sizes_dim1 = at::native::narrow_symint(nt_sizes, 1, 0, 1);
+      Tensor sizes_dim2 = at::native::narrow_symint(nt_sizes, 1, 1, 1);
       Tensor result = at::detail::make_tensor<NestedTensorImpl>(
           nt_input->get_buffer(), sizes_dim1 * sizes_dim2[0]);
       TORCH_INTERNAL_ASSERT_DEBUG_ONLY(result.dim() == 2);
@@ -204,5 +213,342 @@ Tensor NestedTensor_to_padded_tensor_cuda(
   }
   return NestedTensor_to_padded_tensor_generic(t, padding, output_size);
 }
+
+namespace{
+
+/**
+ * This function is used to calculate two pieces of metadata that are needed
+ * for use with flash-attention and efficient_attention kernels. They are the
+ * cumulative sequence_length over a batch of sequences and the maximum sequence
+ * length.
+ *
+ * @return A tuple of cumulative sequence lengths and the maximum sequence length,
+ * and the last element in the cumulative_sequence_lengths
+ */
+std::tuple<Tensor, int64_t, int64_t> cumulative_and_max_seq_len(Tensor qkv) {
+  TORCH_CHECK(
+      qkv.is_nested(),
+      "QKV must be nested for flash cumulative_seq_len calculation.")
+  auto* nt_impl = get_nested_tensor_impl(qkv);
+  const auto& sizes = nt_impl->get_nested_size_tensor();
+  auto size_tensor_stride = sizes.stride(0);
+
+  const int64_t batch_size = qkv.size(0);
+  auto cumulative_seqlen = at::zeros(
+      {batch_size + 1}, TensorOptions().device(at::kCPU).dtype(at::kInt));
+
+  auto* sizes_ptr = sizes.data_ptr<int64_t>();
+  auto* cumulative_seqlen_ptr = cumulative_seqlen.data_ptr<int32_t>();
+
+  int32_t sum = 0;
+  int64_t max_seqlen = -1;
+  cumulative_seqlen_ptr[0] = sum;
+  for (const auto i : c10::irange(batch_size)) {
+    // Calculate the cumulative sum of the sequence lengths
+    auto current_seq_len = sizes_ptr[(i * size_tensor_stride)];
+    sum += current_seq_len;
+    cumulative_seqlen_ptr[i + 1] = sum;
+
+    // Find the max element while we traverse
+    max_seqlen = std::max(max_seqlen, current_seq_len);
+  }
+  // Send to GPU, this is pretty light weight calc for normal batch size
+  // but maybe this needs to be on gpu
+  cumulative_seqlen = cumulative_seqlen.to(TensorOptions().device(at::kCUDA));
+  return std::tuple<Tensor, int64_t, int64_t>{cumulative_seqlen, max_seqlen, sum};
+}
+
+/**
+ * This function checks if a nested tensor is valid for
+ * use with the flash-attention and efficient_attention kernels without
+ * needing to call contiguous on the nested tensor input.
+ * It checks that the storage offsets' adjacent_differences are a constant mutiple
+ * of the previous tensor in the nested tensor and that the strides are monitonically decreasing.
+ * This check is done after calling transpose on the nested tensor.
+ *
+ * @return A boolean indicating of contiguous needs to be called for input
+ */
+bool is_safe_to_get_storage_as_tensor(const NestedTensorImpl* tensor) {
+  const auto& tensor_offsets = tensor->get_storage_offsets();
+  const Tensor& tensor_sizes = tensor->get_nested_size_tensor();
+  const Tensor& tensor_strides = tensor->get_nested_stride_tensor();
+
+  const int64_t n_tensors = tensor_strides.size(0);
+  const int64_t n_dims = tensor_strides.size(1);
+
+  if (n_tensors <= 1) {
+    return true;
+  }
+
+  int64_t* previous_tensor_stride = tensor_strides.data_ptr<int64_t>();
+  // Check initially that they are in strictly descending order
+  for (int i{1}; i < n_dims; i++) {
+    if (previous_tensor_stride[i - 1] <= previous_tensor_stride[i]) {
+      return false;
+    }
+  }
+  // Check that each tensor i in the nested tensor has the same strides
+  auto tensor_stride_0 = tensor_strides.stride(0);
+
+  for (int i{1}; i < n_tensors; i++) {
+    for (const int64_t j : c10::irange(n_dims)) {
+      if (previous_tensor_stride[j] !=
+          previous_tensor_stride[i * tensor_stride_0 + j]) {
+        return false;
+      }
+    }
+  }
+  // Check the offsets are a constant multiple from the previous numels
+  const int64_t* tensor_size_ptr = tensor_sizes.data_ptr<int64_t>();
+  const int64_t* tensor_stride_ptr = tensor_strides.data_ptr<int64_t>();
+
+  int64_t numel_0 = (tensor_size_ptr[0] * tensor_stride_ptr[0]);
+  TORCH_INTERNAL_ASSERT(numel_0 > 0, "numels must be positive!");
+
+  int64_t offset_constant = (tensor_offsets[1] - tensor_offsets[0]) / numel_0;
+  for (int64_t i = 2; i < n_tensors; i++) {
+    // TODO: When 0 seq_len nested tensors are allowed we need to guard against this
+    int64_t previous_numel = tensor_size_ptr[(i - 1) * tensor_stride_0] * tensor_stride_ptr[(i - 1) * tensor_stride_0];
+    TORCH_INTERNAL_ASSERT(previous_numel > 0, "numels must be positive!");
+    int64_t current_offset_constant = (tensor_offsets[i] - tensor_offsets[i - 1]) / previous_numel;
+    if (current_offset_constant != offset_constant) {
+      return false;
+    }
+  }
+  // Congrats you made it!
+  return true;
+}
+
+} // namespace
+
+std::tuple<Tensor, Tensor, Tensor> _scaled_dot_product_flash_attention_nestedtensor_cuda(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    double dropout_p,
+    bool return_softmax,
+    bool is_causal) {
+  TORCH_CHECK(false, "There are currently cuda memory errors being returned from this path.")
+  // Query (Batch x Num_heads x {Q_seq_len}  x Dim_per_head)
+  // Key   (Batch x Num_heads x {KV_seq_len} x Dim_per_head)
+  // Value (Batch x Num_heads x {KV_seq_len} x Dim_per_head)
+  const int64_t num_heads = query.size(1);
+  const int64_t head_dim = query.size(3);
+
+  // Query -> Query (Batch x {Q_seq_len}  x Num_heads x Dim_per_head)
+  // Key   -> Key   (Batch x {KV_seq_len} x Num_heads x Dim_per_head)
+  // Value -> Value (Batch x {KV_seq_len} x Num_heads x Dim_per_head)
+  Tensor q_t = query.transpose(1, 2).contiguous();
+  Tensor k_t = key.transpose(1, 2).contiguous();
+  Tensor v_t = value.transpose(1, 2).contiguous();
+
+  // K and V have to have the same Nnz, should probably torch_check
+  // assume in order to not iterate over v
+
+  auto cumulative_and_max_q = cumulative_and_max_seq_len(q_t);
+  auto cumulative_and_max_k = cumulative_and_max_seq_len(k_t);
+
+  Tensor cumulative_sequence_length_q = std::get<0>(cumulative_and_max_q);
+  Tensor cumulative_sequence_length_k = std::get<0>(cumulative_and_max_k);
+
+  const int64_t max_seqlen_batch_q = std::get<1>(cumulative_and_max_q);
+  const int64_t max_seqlen_batch_k = std::get<1>(cumulative_and_max_k);
+
+  const int64_t Nnz_q  = cumulative_sequence_length_q[-1].item<int64_t>();
+  const int64_t Nnz_kv = cumulative_sequence_length_k[-1].item<int64_t>();
+
+  auto query_buffer_reshaped =
+      get_buffer(q_t).view({Nnz_q, num_heads, head_dim});
+  auto key_buffer_reshaped =
+      get_buffer(k_t).view({Nnz_kv, num_heads, head_dim});
+  auto value_buffer_reshaped =
+      get_buffer(v_t).view({Nnz_kv, num_heads, head_dim});
+
+  auto attention_and_lse_and_softmax =
+  at::_flash_attention_forward(
+      query_buffer_reshaped,
+      key_buffer_reshaped,
+      value_buffer_reshaped,
+      cumulative_sequence_length_q,
+      cumulative_sequence_length_k,
+      max_seqlen_batch_q,
+      max_seqlen_batch_k,
+      return_softmax,
+      dropout_p,
+      is_causal);
+  // Reshape output to convert nnz to batch_size and seq_len
+  Tensor attention = std::get<0>(attention_and_lse_and_softmax);
+  attention = wrap_buffer(attention.view(-1), get_nested_size_tensor(q_t).clone()).transpose(1,2);
+  return std::tie(attention, std::get<1>(attention_and_lse_and_softmax), std::get<2>(attention_and_lse_and_softmax));
+}
+
+std::tuple<Tensor, Tensor> _scaled_dot_product_efficient_attention_nestedtensor_cuda(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    bool compute_log_sumexp,
+    bool is_causal) {
+   // Query (Batch x Num_heads x {Q_seq_len}  x Dim_per_head)
+  // Key   (Batch x Num_heads x {KV_seq_len} x Dim_per_head)
+  // Value (Batch x Num_heads x {KV_seq_len} x Dim_per_head)
+  const int64_t num_heads = query.size(1);
+  const int64_t head_dim = query.size(3);
+
+  Tensor q_t = query.transpose(1, 2);
+  Tensor k_t = key.transpose(1, 2);
+  Tensor v_t = value.transpose(1, 2);
+
+  auto cumulative_and_max_q_and_nnz_q = cumulative_and_max_seq_len(q_t);
+  auto cumulative_and_max_k_and_nnz_k = cumulative_and_max_seq_len(k_t);
+
+  // K and V have to have the same Nnz, should probably torch_check
+  // assume in order to not iterate over v
+
+  Tensor cumulative_sequence_length_q = std::get<0>(cumulative_and_max_q_and_nnz_q);
+  Tensor cumulative_sequence_length_k = std::get<0>(cumulative_and_max_k_and_nnz_k);
+
+  const int64_t max_seqlen_batch_q = std::get<1>(cumulative_and_max_q_and_nnz_q);
+
+  const int64_t Nnz_q = std::get<2>(cumulative_and_max_q_and_nnz_q);
+  const int64_t Nnz_kv = std::get<2>(cumulative_and_max_k_and_nnz_k);
+
+  Tensor query_buffer_reshaped;
+  Tensor key_buffer_reshaped;
+  Tensor value_buffer_reshaped;
+
+  const auto* query_impl = get_nested_tensor_impl(q_t);
+  const auto* key_impl = get_nested_tensor_impl(k_t);
+  const auto* value_impl = get_nested_tensor_impl(v_t);
+
+  // If the physical layout of the NestedTensor's storage
+  // is not: batch, {seq_len}, num_heads, head_dim then we need
+  // to call contiguous
+  if (!q_t.is_contiguous() && !is_safe_to_get_storage_as_tensor(query_impl)) {
+    q_t = q_t.contiguous();
+    query_impl = get_nested_tensor_impl(q_t);
+  }
+  if (!k_t.is_contiguous() && !is_safe_to_get_storage_as_tensor(key_impl)) {
+    k_t = k_t.contiguous();
+    key_impl = get_nested_tensor_impl(k_t);
+  }
+  if (!v_t.is_contiguous() && !is_safe_to_get_storage_as_tensor(value_impl)) {
+    v_t = v_t.contiguous();
+    value_impl = get_nested_tensor_impl(v_t);
+  }
+
+  Tensor q_storage_as_tensor =
+      get_nested_tensor_impl(q_t)->get_unsafe_storage_as_tensor();
+  Tensor k_storage_as_tensor =
+      get_nested_tensor_impl(k_t)->get_unsafe_storage_as_tensor();
+  Tensor v_storage_as_tensor =
+      get_nested_tensor_impl(v_t)->get_unsafe_storage_as_tensor();
+
+  auto query_stride_tensor = query_impl->get_nested_stride_tensor();
+  auto key_stride_tensor = key_impl->get_nested_stride_tensor();
+  auto value_stride_tensor = value_impl->get_nested_stride_tensor();
+
+  const int64_t head_dim_stride = 1;
+
+  const int64_t* q_strides = query_stride_tensor.data_ptr<int64_t>();
+  const int64_t nnz_q_stride = q_strides[0];
+  const int64_t head_q_stride = q_strides[1];
+
+  const int64_t* k_strides = key_stride_tensor.data_ptr<int64_t>();
+  const int64_t nnz_k_stride = k_strides[0];
+  const int64_t head_k_stride = k_strides[1];
+
+  const int64_t* v_strides = value_stride_tensor.data_ptr<int64_t>();
+  const int64_t nnz_v_stride = v_strides[0];
+  const int64_t head_v_stride = v_strides[1];
+
+  query_buffer_reshaped = q_storage_as_tensor.as_strided(
+      {Nnz_q, num_heads, head_dim},
+      {nnz_q_stride, head_q_stride, head_dim_stride},
+      query_impl->get_storage_offsets()[0]);
+  key_buffer_reshaped = k_storage_as_tensor.as_strided(
+      {Nnz_kv, num_heads, head_dim},
+      {nnz_k_stride, head_k_stride, head_dim_stride},
+      key_impl->get_storage_offsets()[0]);
+  value_buffer_reshaped = v_storage_as_tensor.as_strided(
+      {Nnz_kv, num_heads, head_dim},
+      {nnz_v_stride, head_v_stride, head_dim_stride},
+      value_impl->get_storage_offsets()[0]);
+  std::tuple<Tensor, Tensor> attention_and_logsumexp=
+      at::_efficient_attention_forward(
+          query_buffer_reshaped.unsqueeze(0),
+          key_buffer_reshaped.unsqueeze(0),
+          value_buffer_reshaped.unsqueeze(0),
+          cumulative_sequence_length_q,
+          cumulative_sequence_length_k,
+          max_seqlen_batch_q,
+          compute_log_sumexp,
+          is_causal);
+  // Reshape output to convert nnz to batch_size and seq_len
+  Tensor attention = std::get<0>(attention_and_logsumexp);
+  attention =
+      wrap_buffer(attention.view(-1), get_nested_size_tensor(q_t).clone())
+          .transpose(1, 2);
+  return std::tie(attention, std::get<1>(attention_and_logsumexp));
+}
+
+Tensor flash_attention_helper(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    double dropout_p,
+    bool need_atten_weights,
+    bool is_causal) {
+  //  Query is of size (batch_size x ragged_seq_len x (3 or 1) x n_heads x
+  //  head_did
+  int64_t head_dim{query.size(-1)};
+  int64_t num_heads{query.size(-2)};
+
+  auto cumulative_and_max_q_and_nnz_q = cumulative_and_max_seq_len(query);
+  Tensor cumulative_sequence_length_q = std::get<0>(cumulative_and_max_q_and_nnz_q);
+  int64_t max_seqlen_batch_q = std::get<1>(cumulative_and_max_q_and_nnz_q);
+
+  TORCH_CHECK(
+      key.is_same(key) && query.is_same(value),
+      "Key and Value must be the same tensor");
+
+  int64_t Nnz_q = std::get<2>(cumulative_and_max_q_and_nnz_q);
+
+  // For the packed case we need to set the output size for dim 2 to 1
+  auto atten_size = get_nested_size_tensor(query).clone();
+  atten_size.index({at::indexing::Slice(), 1}) = 1;
+
+  auto qkv_buffer_reshaped = get_buffer(query)
+                                 .view({Nnz_q, 3, num_heads, head_dim})
+                                 .transpose(0, 1)
+                                 .contiguous();
+
+  auto q = qkv_buffer_reshaped[0];
+  auto k = qkv_buffer_reshaped[1];
+  auto v = qkv_buffer_reshaped[2];
+
+  TORCH_CHECK(q.is_contiguous());
+  TORCH_CHECK(k.is_contiguous());
+  TORCH_CHECK(v.is_contiguous());
+
+  // If we are passing in query, key, value all the same tensors then we have
+  // packed them into one tensor and need to slice for flash attention
+  Tensor attention =
+      std::get<0>(at::_flash_attention_forward(
+          q,
+          k,
+          v,
+          cumulative_sequence_length_q,
+          cumulative_sequence_length_q,
+          max_seqlen_batch_q,
+          max_seqlen_batch_q,
+          false /*return_softmax*/,
+          dropout_p,
+          is_causal));
+  // Output of flash_attention is a regular tensor lets wrap it back up to
+  // form a nested tensor
+
+  return wrap_buffer(attention.view(-1), atten_size);
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu
index e8eb164bf4e7..56cac2a89803 100644
--- a/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu
+++ b/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu
@@ -15,6 +15,17 @@
 #include <c10/cuda/CUDAStream.h>
 
 #include <ATen/native/nested/NestedTensorTransformerFunctions.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+
+#ifndef USE_ROCM
+#ifndef _WIN32
+#include <cutlass/gemm/device/default_gemm_configuration.h>
+#include <cutlass/gemm/device/gemm_grouped.h>
+#include <cutlass/gemm/kernel/default_gemm_grouped.h>
+#endif
+#endif
+
+#include <ATen/NestedTensorImpl.h>
 
 #define BLOCK_DIM 256
 #define GRID_DIM_Y 16
@@ -146,7 +157,7 @@ void remove_padding_kernelLauncher(
   dim3 grid;
   grid.x = batch_size;
   grid.y = GRID_DIM_Y;
-  at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream();
+  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
   if (output_dim == 2) {
     remove_padding_2<T><<<grid, BLOCK_DIM, 0, stream>>>(
         input,
@@ -180,7 +191,7 @@ void remove_padding_transform0213_kernelLauncher(
   dim3 grid;
   grid.x = batch_size;
   grid.y = GRID_DIM_Y;
-  at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream();
+  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
   TORCH_CHECK(
       output_dim == 2,
       "remove padding transform0213 only support output dim == 2");
@@ -338,7 +349,8 @@ __global__ void add_padding_3(
     const int i0 = i / (output_sizes_2 * output_sizes_3);
     const int i1 = (i % (output_sizes_2 * output_sizes_3)) / output_sizes_3;
     const int i2 = i % output_sizes_3;
-    if (batch_id < batch_size && i0 < sizes_i[0] && i1 < sizes_i[1] && i2 < sizes_i[2]) {
+    if (batch_id < batch_size && i0 < sizes_i[0] && i1 < sizes_i[1] &&
+        i2 < sizes_i[2]) {
       const int offset = offsets[batch_id];
       const int input_offset =
           offset + i0 * (sizes_i[1] * sizes_i[2]) + i1 * sizes_i[2] + i2;
@@ -352,7 +364,8 @@ __global__ void add_padding_3(
     const int i0 = i / (output_sizes_2 * output_sizes_3);
     const int i1 = (i % (output_sizes_2 * output_sizes_3)) / output_sizes_3;
     const int i2 = i % output_sizes_3;
-    if (batch_id < batch_size && i0 < sizes_i[0] && i1 < sizes_i[1] && i2 < sizes_i[2]) {
+    if (batch_id < batch_size && i0 < sizes_i[0] && i1 < sizes_i[1] &&
+        i2 < sizes_i[2]) {
       const int offset = offsets[batch_id];
       const int input_offset =
           offset + i0 * (sizes_i[1] * sizes_i[2]) + i1 * sizes_i[2] + i2;
@@ -374,7 +387,7 @@ void add_padding_kernelLauncher(
     const std::vector<int64_t>& output_sizes,
     const int batch_size,
     const int output_batch_size) {
-  at::cuda::CUDAStream stream = at::cuda::getDefaultCUDAStream();
+  at::cuda::CUDAStream stream = at::cuda::getCurrentCUDAStream();
   dim3 grid;
   grid.x = output_batch_size;
   grid.y = GRID_DIM_Y;
diff --git a/aten/src/ATen/native/prim_native_functions.cpp b/aten/src/ATen/native/prim_native_functions.cpp
index 8f82345c1905..4e79c112d7fc 100644
--- a/aten/src/ATen/native/prim_native_functions.cpp
+++ b/aten/src/ATen/native/prim_native_functions.cpp
@@ -1,4 +1,11 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/is_nonzero_native.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/AffineQuantizer.cpp b/aten/src/ATen/native/quantized/AffineQuantizer.cpp
index e2fa8f65adc6..dbda6ebd5f90 100644
--- a/aten/src/ATen/native/quantized/AffineQuantizer.cpp
+++ b/aten/src/ATen/native/quantized/AffineQuantizer.cpp
@@ -97,6 +97,21 @@ void checkSameSize(
       " only works with Tensors with the same shape");
 }
 
+void checkPerChannelParamsSize(
+    const Tensor& rtensor,
+    int64_t axis,
+    const Tensor& scales,
+    const Tensor& zero_points
+) {
+  int64_t channel = rtensor.size(axis);
+  TORCH_CHECK(
+      channel == int64_t(scales.numel()),
+      "length of scales must equal to channel, expected ", channel, " got, ", scales.numel());
+  TORCH_CHECK(
+      channel == int64_t(zero_points.numel()),
+      "length of zero_points must equal to channel expected ", channel, " got, ", zero_points.numel());
+}
+
 } // anonymous namespace
 
 Tensor& quantize_tensor_per_tensor_affine(
@@ -156,13 +171,7 @@ Tensor& quantize_tensor_per_channel_affine(
       "Expected: [0, ",
       rtensor.dim(),
       ")");
-  int64_t channel = rtensor.size(axis);
-  TORCH_CHECK(
-      channel == int64_t(scales.numel()),
-      "length of scales must equal to channel");
-  TORCH_CHECK(
-      channel == int64_t(zero_points.numel()),
-      "length of zero_points must equal to channel");
+  checkPerChannelParamsSize(rtensor, axis, scales, zero_points);
 
   quantize_tensor_per_channel_affine_stub(
       rtensor.device().type(), rtensor, qtensor, scales, zero_points, axis);
@@ -195,13 +204,7 @@ Tensor& quantize_tensor_per_channel_float_qparams(
       "Expected: [0, ",
       rtensor.dim(),
       ")");
-  int64_t channel = rtensor.size(axis);
-  TORCH_CHECK(
-      channel == int64_t(scales.numel()),
-      "length of scales must equal to channel");
-  TORCH_CHECK(
-      channel == int64_t(zero_points.numel()),
-      "length of zero_points must equal to channel");
+  checkPerChannelParamsSize(rtensor, axis, scales, zero_points);
 
   quantize_tensor_per_channel_float_qparams_stub(
       rtensor.device().type(), rtensor, qtensor, scales, zero_points, axis);
@@ -260,13 +263,7 @@ Tensor& dequantize_tensor_per_channel_affine(
       " Expected: [0, ",
       qtensor.dim(),
       ")");
-  int64_t channel = qtensor.size(axis);
-  TORCH_CHECK(
-      channel == int64_t(scales.numel()),
-      "length of scales must equal to channel");
-  TORCH_CHECK(
-      channel == int64_t(zero_points.numel()),
-      "length of zero_points must equal to channel");
+  checkPerChannelParamsSize(rtensor, axis, scales, zero_points);
 
   dequantize_tensor_per_channel_affine_stub(
       qtensor.device().type(), qtensor, rtensor, scales, zero_points, axis);
@@ -297,13 +294,7 @@ Tensor& dequantize_tensor_per_channel_float_qparams(
       " Expected: [0, ",
       qtensor.dim(),
       ")");
-  int64_t channel = qtensor.size(axis);
-  TORCH_CHECK(
-      channel == int64_t(scales.numel()),
-      "length of scales must equal to channel");
-  TORCH_CHECK(
-      channel == int64_t(zero_points.numel()),
-      "length of zero_points must equal to channel");
+  checkPerChannelParamsSize(rtensor, axis, scales, zero_points);
 
   dequantize_tensor_per_channel_float_qparams_stub(
       qtensor.device().type(), qtensor, rtensor, scales, zero_points, axis);
diff --git a/aten/src/ATen/native/quantized/AffineQuantizer.h b/aten/src/ATen/native/quantized/AffineQuantizer.h
index cd39e3424066..1ff342a643c3 100644
--- a/aten/src/ATen/native/quantized/AffineQuantizer.h
+++ b/aten/src/ATen/native/quantized/AffineQuantizer.h
@@ -1,6 +1,7 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/quantized/AffineQuantizerBase.h>
 
diff --git a/aten/src/ATen/native/quantized/AffineQuantizerBase.cpp b/aten/src/ATen/native/quantized/AffineQuantizerBase.cpp
index e40f8ef1fdb0..5d02d9e04ed7 100644
--- a/aten/src/ATen/native/quantized/AffineQuantizerBase.cpp
+++ b/aten/src/ATen/native/quantized/AffineQuantizerBase.cpp
@@ -71,6 +71,33 @@ void quantize_vec(
           (float)scale, (int32_t)zero_point, precision});
 }
 
+#if defined(__ARM_NEON__) || defined(__aarch64__)
+// For use when compiling FBGEMM on aarch64 but still supporting x86
+// intrinsics via simde
+template <typename T>
+T quantize_val_arm(
+    const float scale,
+    const int32_t zero_point,
+    const float value) {
+  constexpr int32_t qmin = std::numeric_limits<T>::min();
+  constexpr int32_t qmax = std::numeric_limits<T>::max();
+  float inv_scale = 1.0f / scale;
+  auto r = zero_point + static_cast<int32_t>(std::nearbyint(value * inv_scale));
+  r = std::max(r, qmin);
+  r = std::min(r, qmax);
+  return static_cast<T>(r);
+}
+
+template uint8_t quantize_val_arm<uint8_t>(
+    const float scale,
+    const int32_t zero_point,
+    const float value);
+template int8_t quantize_val_arm<int8_t>(
+    const float scale,
+    const int32_t zero_point,
+    const float value);
+#endif
+
 template <typename T>
 inline float dequantize_val(double scale, int64_t zero_point, T value) {
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
diff --git a/aten/src/ATen/native/quantized/FakeQuantAffine.h b/aten/src/ATen/native/quantized/FakeQuantAffine.h
index 3b1dbf608c13..1fb7cfbb0e72 100644
--- a/aten/src/ATen/native/quantized/FakeQuantAffine.h
+++ b/aten/src/ATen/native/quantized/FakeQuantAffine.h
@@ -1,6 +1,7 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/native/DispatchStub.h>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/FakeQuantPerTensorAffine.cpp b/aten/src/ATen/native/quantized/FakeQuantPerTensorAffine.cpp
index 700b3b14b180..aac039f0e03e 100644
--- a/aten/src/ATen/native/quantized/FakeQuantPerTensorAffine.cpp
+++ b/aten/src/ATen/native/quantized/FakeQuantPerTensorAffine.cpp
@@ -122,10 +122,10 @@ Tensor fake_quantize_per_tensor_affine_cachemask_backward(
     const Tensor& dY,
     const Tensor& mask) {
   TORCH_CHECK(mask.scalar_type() == ScalarType::Bool);
-  TORCH_CHECK(mask.numel() == dY.numel(),
+  TORCH_CHECK(mask.sym_numel() == dY.sym_numel(),
       "`mask` and `dY` are not the same size: ",
-      "`mask` is size ", mask.numel(), " and `dY` is size ", dY.numel());
-  if (dY.numel() <= 0) {
+      "`mask` is size ", mask.sym_numel(), " and `dY` is size ", dY.sym_numel());
+  if (dY.sym_numel() <= 0) {
     return dY;
   }
   // Note: no additional kernels needed, since mask is pre-computed
diff --git a/aten/src/ATen/native/quantized/IndexKernel.h b/aten/src/ATen/native/quantized/IndexKernel.h
index 69f12472bea0..0e240b5a8e9a 100644
--- a/aten/src/ATen/native/quantized/IndexKernel.h
+++ b/aten/src/ATen/native/quantized/IndexKernel.h
@@ -5,9 +5,10 @@ namespace at {
 namespace native {
 using masked_fill_kernel_quantized_fn = void(*)(TensorIterator& iter, const Scalar& value, double scale, int zero_point);
 using index_put_kernel_quantized_fn = void(*)(TensorIterator& iter, IntArrayRef index_size, IntArrayRef index_stride, bool accumulate, double scale, int zero_point);
+
 DECLARE_DISPATCH(masked_fill_kernel_quantized_fn, masked_fill_kernel_quantized_stub);
 DECLARE_DISPATCH(index_put_kernel_quantized_fn, index_put_kernel_quantized_stub);
 
-// TODO: implement index_put_kernel_quantized_cuda in cuda/IndexKernel.cu and put CUDA kernel in a stub
+
 } // native
 } // at
diff --git a/aten/src/ATen/native/quantized/PackedParams.h b/aten/src/ATen/native/quantized/PackedParams.h
index 64d8ec840c46..179fcce23dfe 100644
--- a/aten/src/ATen/native/quantized/PackedParams.h
+++ b/aten/src/ATen/native/quantized/PackedParams.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/core/ivalue.h>
 
 struct LinearPackedParamsBase : public torch::jit::CustomClassHolder {
diff --git a/aten/src/ATen/native/quantized/QTensor.cpp b/aten/src/ATen/native/quantized/QTensor.cpp
index b2737f578fbf..b3ff8bd8b327 100644
--- a/aten/src/ATen/native/quantized/QTensor.cpp
+++ b/aten/src/ATen/native/quantized/QTensor.cpp
@@ -330,6 +330,12 @@ std::tuple<Tensor, Tensor> choose_qparams_optimized(
     const double ratio,
     int64_t bit_width) {
 
+  if (numel < 0 || numel > input_tensor.numel()) {
+    TORCH_CHECK(false, "numel is out of the bound of input tensor");
+  }
+
+  TORCH_CHECK(numel <= input_tensor.numel(), "numel ", numel,
+      " greater than input_tensor.numel() ", input_tensor.numel());
   const float* input_row = input_tensor.data_ptr<float>();
   float xmin = *std::min_element(input_row, input_row + numel);
   float xmax = *std::max_element(input_row, input_row + numel);
diff --git a/aten/src/ATen/native/quantized/README.md b/aten/src/ATen/native/quantized/README.md
index 62c4a8a1f9e1..f042881a8ceb 100644
--- a/aten/src/ATen/native/quantized/README.md
+++ b/aten/src/ATen/native/quantized/README.md
@@ -171,7 +171,8 @@ def quantized_xand(qa, qb):
   return ops.quantized.xand(qa, qb)
 ```
 
-**Note:** If writing new pytorch functions that use quantized kernels, it is strongly encouraged to place them in the `torch/nn/quantized/functional.py`.
+**Note:** If writing new pytorch functions that use quantized kernels,
+it is strongly encouraged to place them in the `torch/ao/nn/quantized/functional.py`.
 
 ### C++
 
diff --git a/aten/src/ATen/native/quantized/TensorAdvancedIndexing.cpp b/aten/src/ATen/native/quantized/TensorAdvancedIndexing.cpp
index 904a8942eed9..4bfa5acaa263 100644
--- a/aten/src/ATen/native/quantized/TensorAdvancedIndexing.cpp
+++ b/aten/src/ATen/native/quantized/TensorAdvancedIndexing.cpp
@@ -5,11 +5,13 @@
 #include <ATen/native/TensorAdvancedIndexingUtils.h>
 #include <ATen/NamedTensorUtils.h>
 #include <c10/core/QScheme.h>
+#include <ATen/native/TensorAdvancedIndexing.h>
 
 namespace at {
 namespace native {
 DEFINE_DISPATCH(masked_fill_kernel_quantized_stub);
 DEFINE_DISPATCH(index_put_kernel_quantized_stub);
+DEFINE_DISPATCH(index_put_with_sort_quantized_stub);
 
 namespace {
 static TensorIterator make_index_put_iterator(const AdvancedIndex& info, const Tensor& value) {
@@ -76,6 +78,51 @@ Tensor & masked_fill__quantized_cpu(Tensor& self, const Tensor & mask, const Ten
   return self;
 }
 
+Tensor & masked_fill_impl_quantized_cuda(Tensor& self, const Tensor & mask, const Scalar& value) {
+  TORCH_CHECK(self.device() == mask.device(), "expected self and mask to be on the same device, but got mask on ",
+    mask.device(), " and self on ", self.device());
+  TORCH_CHECK(mask.scalar_type() == kByte || mask.scalar_type() == kBool,
+    "expected mask dtype to be Bool but got ", mask.scalar_type());
+  TORCH_CHECK(self.qscheme() == c10::kPerTensorAffine, "masked_fill__quantized_cpu for quantized tensors is currently only supported for per tensor quantized tensors");
+
+  auto maybe_outnames = namedinference::broadcast_to_outnames(self, mask, "masked_fill_");
+
+  if (at::has_internal_overlap(self) == MemOverlap::Yes) {
+    TORCH_WARN(
+      "Use of masked_fill_ on expanded tensors is deprecated. "
+      "Please clone() the tensor before performing this operation. "
+      "This also applies to advanced indexing e.g. tensor[mask] = scalar");
+  }
+  at::assert_no_partial_overlap(self, mask);
+
+  c10::MaybeOwned<Tensor> b_mask = expand_inplace(self, mask, "masked_fill_");
+
+  auto iter = TensorIteratorConfig()
+      .set_check_mem_overlap(false)
+      .check_all_same_dtype(false)
+      .resize_outputs(false)
+      .add_output(self)
+      .add_input(self)
+      .add_input(*b_mask)
+      .build();
+
+  masked_fill_kernel_quantized_stub(iter.device_type(), iter, value, self.q_scale(), self.q_zero_point());
+  namedinference::propagate_names_if_nonempty(self, maybe_outnames);
+  return self;
+}
+
+Tensor & masked_fill__quantized_cuda(Tensor& self, const Tensor & mask, const Scalar& value) {
+  TORCH_CHECK(!self.device().is_cpu(), "masked_fill_: Expected inputs to be on same device")
+  return masked_fill_impl_quantized_cuda(self, mask, value);
+}
+
+Tensor & masked_fill__quantized_cuda(Tensor& self, const Tensor & mask, const Tensor & value) {
+  TORCH_CHECK(value.dim() == 0, "masked_fill_ only supports a 0-dimensional value tensor, but got tensor "
+      "with ", value.dim(), " dimension(s).");
+  TORCH_CHECK(!self.device().is_cpu(), "masked_fill_: Expected inputs to be on same device")
+  return masked_fill_impl_quantized_cuda(self, mask, value.item());
+}
+
 Tensor& _index_put_impl_quantized_cpu_(Tensor & self, const torch::List<c10::optional<Tensor>>& indices, const Tensor & value, const bool accumulate, const bool unsafe) {
   TORCH_CHECK_INDEX(indices.size() <= (size_t)self.dim(), "too many indices for tensor of dimension ", self.dim(), " (got ", indices.size(), ")");
   TORCH_CHECK(!value.is_quantized(), "Value argument for quantized input_put should not be quantized");
@@ -112,5 +159,49 @@ Tensor& _index_put_impl_quantized_cpu_(Tensor & self, const torch::List<c10::opt
   return self;
 }
 
+Tensor& _index_put_impl_quantized_cuda_(Tensor & self, const torch::List<c10::optional<Tensor>>& indices, const Tensor & value, const bool accumulate, const bool unsafe) {
+  TORCH_CHECK_INDEX(indices.size() <= (size_t)self.dim(), "too many indices for tensor of dimension ", self.dim(), " (got ", indices.size(), ")");
+  TORCH_CHECK(!value.is_quantized(), "Value argument for quantized input_put should not be quantized");
+  TORCH_CHECK(self.qscheme() == c10::kPerTensorAffine, "index_put for quantized tensors is currently only supported for per tensor quantized tensors");
+  TORCH_CHECK(!accumulate, "index_put for quantized tensors is currently only supported for accumulate=False");
+
+  if (at::has_internal_overlap(self) == MemOverlap::Yes) {
+    TORCH_WARN(
+      "Use of index_put_ on expanded tensors is deprecated. "
+      "Please clone() the tensor before performing this operation. "
+      "This also applies to advanced indexing e.g. tensor[indices] = tensor");
+  }
+
+  auto masked_fill_dispatch = canDispatchToMaskedFill(self, indices, value);
+  if (std::get<0>(masked_fill_dispatch)) {
+    return self.masked_fill_(std::get<1>(masked_fill_dispatch), value.item());
+  }
+
+  auto value_ = value;
+  if (value.device() != self.device() && value.numel() == 1 && value.dim() == 0) {
+    value_ = value.to(self.device());
+  }
+  TORCH_CHECK(value.device() == self.device(), "expected device ", self.device(), " but got device ", value.device(), " for value tensor");
+
+  at::assert_no_overlap(self, value);
+  // NOLINTNEXTLINE(performance-implicit-conversion-in-loop)
+  for (const c10::optional<Tensor>& index: indices) {
+    if (index.has_value()) {
+      at::assert_no_overlap(self, *index);
+    }
+  }
+
+  // See Note [Enabling Deterministic Operations]
+  if (self.device().type() == DeviceType::CUDA && globalContext().deterministicAlgorithms()) {
+      index_put_with_sort_quantized_stub(self.device().type(), self, indices, value_, self.q_scale(), self.q_zero_point(), unsafe);
+      return self;
+  }
+
+  auto info = make_info(self, indices);
+  auto iter = make_index_put_iterator(info, value_);
+  index_put_kernel_quantized_stub(iter.device_type(), iter, info.indexed_sizes, info.indexed_strides, accumulate, self.q_scale(), self.q_zero_point());
+  return self;
+}
+
 }
 }
diff --git a/aten/src/ATen/native/quantized/TensorCompare.cpp b/aten/src/ATen/native/quantized/TensorCompare.cpp
index 08a104257f4e..747f8bfe4d30 100644
--- a/aten/src/ATen/native/quantized/TensorCompare.cpp
+++ b/aten/src/ATen/native/quantized/TensorCompare.cpp
@@ -14,6 +14,19 @@ Tensor max_quantized_cpu(const Tensor& self) {
   return std::get<0>(self.reshape({-1}).max(/*dim=*/0));
 }
 
+Tensor& max_quantized_unary_out(const Tensor& self, Tensor& out) {
+  // TODO this implementation is inefficient for now.
+  TORCH_CHECK(self.device() == out.device());
+
+  TORCH_CHECK(canCast(
+      typeMetaToScalarType(self.dtype()),
+      typeMetaToScalarType(out.dtype())));
+  Tensor temp = max_quantized_cpu(self);
+  at::native::resize_output(out, temp.sizes());
+  out.copy_(temp);
+  return out;
+}
+
 Tensor min_quantized_cpu(const Tensor& self) {
   return std::get<0>(self.reshape({-1}).min(/*dim=*/0));
 }
diff --git a/aten/src/ATen/native/quantized/TensorFactories.cpp b/aten/src/ATen/native/quantized/TensorFactories.cpp
index 66c48f4ce752..aa0fef5df9dc 100644
--- a/aten/src/ATen/native/quantized/TensorFactories.cpp
+++ b/aten/src/ATen/native/quantized/TensorFactories.cpp
@@ -66,16 +66,6 @@ Tensor empty_per_channel_affine_quantized(
       quantizer);
 }
 
-Tensor empty_symint_unknown_quantized(
-    c10::SymIntArrayRef size,
-    c10::optional<ScalarType> dtype,
-    c10::optional<Layout> layout,
-    c10::optional<Device> device,
-    c10::optional<bool> pin_memory,
-    c10::optional<c10::MemoryFormat> optional_memory_format) {
-      return at::native::empty_unknown_quantized(c10::asIntArrayRefSlow(size), dtype, layout, device, pin_memory, optional_memory_format);
-}
-
 Tensor empty_unknown_quantized(
     IntArrayRef size,
     c10::optional<ScalarType> dtype,
diff --git a/aten/src/ATen/native/quantized/cpu/AdaptiveAveragePooling.cpp b/aten/src/ATen/native/quantized/cpu/AdaptiveAveragePooling.cpp
index 4d6e0f79db95..1317817902cf 100644
--- a/aten/src/ATen/native/quantized/cpu/AdaptiveAveragePooling.cpp
+++ b/aten/src/ATen/native/quantized/cpu/AdaptiveAveragePooling.cpp
@@ -1,8 +1,20 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_adaptive_avg_pool2d_native.h>
+#include <ATen/ops/_adaptive_avg_pool3d_native.h>
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/adaptive_avg_pool3d_native.h>
+#endif
+
 #include <c10/util/irange.h>
 #include <c10/util/math_compat.h>
 
diff --git a/aten/src/ATen/native/quantized/cpu/AveragePool2d.cpp b/aten/src/ATen/native/quantized/cpu/AveragePool2d.cpp
index 264707c25a8f..bb72a2010ca3 100644
--- a/aten/src/ATen/native/quantized/cpu/AveragePool2d.cpp
+++ b/aten/src/ATen/native/quantized/cpu/AveragePool2d.cpp
@@ -1,5 +1,7 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/Pool.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
@@ -7,6 +9,14 @@
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/avg_pool2d_native.h>
+#endif
+
 #include <c10/util/irange.h>
 #include <c10/util/math_compat.h>
 
diff --git a/aten/src/ATen/native/quantized/cpu/AveragePool3d.cpp b/aten/src/ATen/native/quantized/cpu/AveragePool3d.cpp
index 35580bfc50d8..93534b70c2c0 100644
--- a/aten/src/ATen/native/quantized/cpu/AveragePool3d.cpp
+++ b/aten/src/ATen/native/quantized/cpu/AveragePool3d.cpp
@@ -1,17 +1,21 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/Parallel.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/native/Pool.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 
-#include <c10/util/irange.h>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/avg_pool3d_native.h>
+#endif
+
 #include <c10/util/math_compat.h>
 
-#include <algorithm>
-#include <cmath>
-#include <limits>
 #include <vector>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/cpu/BinaryOps.cpp b/aten/src/ATen/native/quantized/cpu/BinaryOps.cpp
index 1e3f1d3ddb0f..58a7036bdd7e 100644
--- a/aten/src/ATen/native/quantized/cpu/BinaryOps.cpp
+++ b/aten/src/ATen/native/quantized/cpu/BinaryOps.cpp
@@ -1,8 +1,9 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
+#include <ATen/ExpandUtils.h>
 #include <torch/library.h>
-#include <ATen/cpu/vec/vec.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/cpu/Loops.h>
 #include <ATen/quantized/Quantizer.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
@@ -10,6 +11,18 @@
 #include <ATen/native/quantized/cpu/XnnpackUtils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_empty_affine_quantized_native.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/relu_native.h>
+#endif
+
+#include <algorithm>
+
 namespace at {
 namespace native {
 
@@ -23,10 +36,10 @@ namespace {
 inline void check_inputs(const Tensor& qa, const Tensor& qb) {
   TORCH_CHECK(
       qa.qscheme() == kPerTensorAffine,
-      "Only per tensor quantization is suported in Add.");
+      "Only per tensor quantization is supported in Add.");
   TORCH_CHECK(
       qa.qscheme() == qb.qscheme(),
-      "Both inputs to Add must have the same quantization shceme.");
+      "Both inputs to Add must have the same quantization scheme.");
   TORCH_CHECK(
       qa.scalar_type() == qb.scalar_type(),
       "Add operands should have same data type.");
diff --git a/aten/src/ATen/native/quantized/cpu/BinaryOps.h b/aten/src/ATen/native/quantized/cpu/BinaryOps.h
index ada78c59f95c..cf86a13c139a 100644
--- a/aten/src/ATen/native/quantized/cpu/BinaryOps.h
+++ b/aten/src/ATen/native/quantized/cpu/BinaryOps.h
@@ -1,4 +1,4 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/ChannelShuffle.cpp b/aten/src/ATen/native/quantized/cpu/ChannelShuffle.cpp
index e0b455a7300b..bb42b4edbe7a 100644
--- a/aten/src/ATen/native/quantized/cpu/ChannelShuffle.cpp
+++ b/aten/src/ATen/native/quantized/cpu/ChannelShuffle.cpp
@@ -1,16 +1,16 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/core/op_registration/op_registration.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/cpu/Loops.h>
-#include <ATen/native/quantized/cpu/QuantizedOps.h>
-#include <ATen/quantized/Quantizer.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/core/boxing/KernelFunction.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
-#include <c10/core/TensorOptions.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
-#include <algorithm>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized_native.h>
+#include <ATen/ops/channel_shuffle_native.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/EmbeddingPackedParams.h b/aten/src/ATen/native/quantized/cpu/EmbeddingPackedParams.h
index 945c8edf7c75..140b716df269 100644
--- a/aten/src/ATen/native/quantized/cpu/EmbeddingPackedParams.h
+++ b/aten/src/ATen/native/quantized/cpu/EmbeddingPackedParams.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/core/ivalue.h>
 
 struct EmbeddingPackedParamsBase : public torch::jit::CustomClassHolder {
diff --git a/aten/src/ATen/native/quantized/cpu/IntReprQuant.cpp b/aten/src/ATen/native/quantized/cpu/IntReprQuant.cpp
index b3735ddb236d..9867a8f48a9e 100644
--- a/aten/src/ATen/native/quantized/cpu/IntReprQuant.cpp
+++ b/aten/src/ATen/native/quantized/cpu/IntReprQuant.cpp
@@ -1,10 +1,20 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/ceil_div.h>
+#include <ATen/Dispatch.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/cpu/Loops.h>
 #include <ATen/native/DispatchStub.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/int_repr_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/quantized/cpu/LinearUnpackImpl.cpp b/aten/src/ATen/native/quantized/cpu/LinearUnpackImpl.cpp
index 89465a8c5208..c9387eb0ebb1 100644
--- a/aten/src/ATen/native/quantized/cpu/LinearUnpackImpl.cpp
+++ b/aten/src/ATen/native/quantized/cpu/LinearUnpackImpl.cpp
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/cpp_custom_type_hack.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/PackedParams.h>
@@ -7,6 +9,16 @@
 #include <torch/custom_class.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_empty_per_channel_affine_quantized.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/from_blob.h>
+#endif
+
 int register_linear_params();
 
 #ifdef USE_FBGEMM
diff --git a/aten/src/ATen/native/quantized/cpu/MakePerTensorQuantizedTensor.cpp b/aten/src/ATen/native/quantized/cpu/MakePerTensorQuantizedTensor.cpp
index a321de08b994..a0a4342c4e00 100644
--- a/aten/src/ATen/native/quantized/cpu/MakePerTensorQuantizedTensor.cpp
+++ b/aten/src/ATen/native/quantized/cpu/MakePerTensorQuantizedTensor.cpp
@@ -1,7 +1,14 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/cpu/Loops.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/Normalization.cpp b/aten/src/ATen/native/quantized/cpu/Normalization.cpp
index a5be594bdf39..2918c6530538 100644
--- a/aten/src/ATen/native/quantized/cpu/Normalization.cpp
+++ b/aten/src/ATen/native/quantized/cpu/Normalization.cpp
@@ -1,12 +1,20 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Parallel.h>
 #include <torch/library.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/quantized_batch_norm_native.h>
+#endif
+
 #include <algorithm>
-#include <vector>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/OnednnUtils.h b/aten/src/ATen/native/quantized/cpu/OnednnUtils.h
index 6ad70356b3e0..533d83361f05 100644
--- a/aten/src/ATen/native/quantized/cpu/OnednnUtils.h
+++ b/aten/src/ATen/native/quantized/cpu/OnednnUtils.h
@@ -4,8 +4,168 @@
 #if AT_MKLDNN_ENABLED()
 #include <ATen/Tensor.h>
 #include <ATen/native/quantized/PackedParams.h>
-#include <ATen/native/mkldnn/MKLDNNCommon.h>
-#include <ATen/native/mkldnn/Utils.h>
+#include <ideep.hpp>
+#include <cpuinfo.h>
+
+#include <c10/util/CallOnce.h>
+
+using PrimitiveCacheKey = std::tuple<
+    double, // input_scale
+    int64_t, // input_zero_point
+    std::vector<int64_t>, // input_shape
+    double, // output_scale
+    int64_t, // output_zero_point
+    int64_t>; // OMP_number_of_threads
+
+enum CacheKeyIndex {
+  InputScale,
+  InputZeroPoint,
+  InputShape,
+  OutputScale,
+  OutputZeroPoint,
+  NumOfThreads,
+};
+
+// Base class of primitive cache
+struct PrimitiveCache {
+  PrimitiveCacheKey key;
+
+  bool hit(const PrimitiveCacheKey& key) {
+    return this->key == key;
+  }
+};
+
+using LinearParams = ideep::matmul_forward_params;
+using Conv = dnnl::convolution_forward;
+using ConvDesc = dnnl::convolution_forward::primitive_desc;
+using ConvParams = ideep::convolution_forward_params;
+using Deconv = dnnl::deconvolution_forward;
+using DeconvDesc = dnnl::deconvolution_forward::primitive_desc;
+using DeconvParams = ideep::deconv_forward_params;
+
+struct LinearPrimitiveCache : PrimitiveCache {
+  LinearPrimitiveCache() {}
+
+  LinearPrimitiveCache(
+      const PrimitiveCacheKey& key,
+      const LinearParams& param) {
+    this->key = key;
+    this->param = param;
+  }
+
+  LinearPrimitiveCache(
+      const PrimitiveCacheKey& key,
+      const LinearParams& param,
+      const ideep::tensor& bias) {
+    this->key = key;
+    this->param = param;
+    if (!bias.is_empty()) {
+      expected_bias =
+          bias.reorder_if_differ_in(param.pd.bias_desc(), param.bias_attr);
+    }
+  }
+
+  LinearParams param;
+  ideep::tensor expected_bias;
+
+  // For dynamic qlinear, scale and zero point
+  // are set at execution time. So we only need to compare
+  // the rest part of key.
+  bool hit_dynamic(const PrimitiveCacheKey& new_key) {
+    auto cached_input_shape = std::get<InputShape>(this->key);
+    auto new_input_shape = std::get<InputShape>(new_key);
+    return (
+        cached_input_shape == new_input_shape &&
+        std::get<NumOfThreads>(this->key) == std::get<NumOfThreads>(new_key));
+  }
+
+  LinearParams& get_param() {
+    return param;
+  }
+
+  ideep::tensor& get_expected_bias() {
+    return expected_bias;
+  }
+};
+
+struct ConvPrimitiveCache : PrimitiveCache {
+  ConvPrimitiveCache() {}
+
+  ConvPrimitiveCache(const PrimitiveCacheKey& key,
+                     const ConvDesc& conv_desc,
+                     const ideep::tensor& bias,
+                     const ideep::attr_t bias_attr) {
+    this->key = key;
+    this->primitive_desc = conv_desc;
+    this->primitive = Conv(this->primitive_desc);
+    // Construct tensor of input zero point
+    ideep::tensor::desc input_zp_desc = {{1}, ideep::data_type::s32, {1}};
+    this->input_zp_tensor.init(input_zp_desc, ideep::engine::cpu_engine());
+    auto zp_data_ptr = reinterpret_cast<int32_t *>(this->input_zp_tensor.get_data_handle());
+    zp_data_ptr[0] = std::get<InputZeroPoint>(key);
+    // Construct expected bias
+    this->expected_bias = bias.reorder_if_differ_in(conv_desc.bias_desc(), bias_attr);
+  }
+
+  ConvDesc primitive_desc;
+  Conv primitive;
+  ideep::tensor input_zp_tensor;
+  ideep::tensor expected_bias;
+
+  inline ConvDesc& get_primitive_desc() {
+    return primitive_desc;
+  }
+
+  inline Conv& get_primitive() {
+    return primitive;
+  }
+
+  inline ideep::tensor& get_src_zp_tensor() {
+    return input_zp_tensor;
+  }
+
+  inline ideep::tensor& get_bias() {
+    return expected_bias;
+  }
+};
+
+struct DeconvPrimitiveCache : PrimitiveCache {
+  DeconvPrimitiveCache() {}
+
+  DeconvPrimitiveCache(const PrimitiveCacheKey& key,
+                       const DeconvDesc& deconv_desc,
+                       const ideep::tensor& bias,
+                       const ideep::attr_t bias_attr,
+                       const ideep::tensor& input_zero_point) {
+    this->key = key;
+    this->primitive_desc = deconv_desc;
+    this->primitive = Deconv(this->primitive_desc);
+    this->input_zp_tensor = std::move(input_zero_point);
+    // Construct expected bias
+    this->expected_bias = bias.reorder_if_differ_in(deconv_desc.bias_desc(), bias_attr);
+  }
+
+  DeconvDesc primitive_desc;
+  Deconv primitive;
+  ideep::tensor input_zp_tensor;
+  ideep::tensor expected_bias;
+
+  inline DeconvDesc& get_primitive_desc() {
+    return primitive_desc;
+  }
+
+  inline Deconv& get_primitive() {
+    return primitive;
+  }
+
+  inline ideep::tensor& get_src_zp_tensor() {
+    return input_zp_tensor;
+  }
+
+  inline ideep::tensor& get_bias() {
+    return expected_bias;
+  }
+};
 
 struct PackedLinearWeightsOnednn : public LinearPackedParamsBase {
   PackedLinearWeightsOnednn(
@@ -16,7 +176,9 @@ struct PackedLinearWeightsOnednn : public LinearPackedParamsBase {
       : weight_(std::move(weight)),
         bias_(std::move(bias)),
         orig_weight_(std::move(orig_weight)),
-        orig_bias_(std::move(orig_bias)) {}
+        orig_bias_(std::move(orig_bias)) {
+    cache_initialized_flag = std::make_unique<c10::once_flag>();
+  }
   std::unique_ptr<ideep::tensor> weight_;
   c10::optional<ideep::tensor> bias_;
   at::Tensor orig_weight_;
@@ -45,6 +207,9 @@ struct PackedLinearWeightsOnednn : public LinearPackedParamsBase {
       c10::optional<at::Tensor> bias);
 
  private:
+  LinearPrimitiveCache prim_cache;
+  std::unique_ptr<c10::once_flag> cache_initialized_flag;
+
   template <bool ReluFused>
   at::Tensor apply_impl(
       at::Tensor input,
@@ -53,6 +218,10 @@ struct PackedLinearWeightsOnednn : public LinearPackedParamsBase {
 
   template <bool ReluFused>
   at::Tensor apply_dynamic_impl(at::Tensor input, bool reduce_range=false);
+
+  LinearPrimitiveCache& get_cache() {
+    return prim_cache;
+  }
 };
 
 template <int kSpatialDim = 2>
@@ -68,16 +237,18 @@ struct PackedConvWeightsOnednn : public ConvPackedParamsBase<kSpatialDim> {
       torch::List<int64_t> dilation,
       int64_t groups,
       uint8_t transpose)
-    : weight_(std::move(weight)),
-    bias_(std::move(bias)),
-    orig_weight_(std::move(orig_weight)),
-    orig_bias_(std::move(orig_bias)),
-    stride_(std::move(stride)),
-    padding_(std::move(padding)),
-    output_padding_(std::move(output_padding)),
-    dilation_(std::move(dilation)),
-    groups_(groups),
-    transpose_(transpose) {}
+      : weight_(std::move(weight)),
+        bias_(std::move(bias)),
+        orig_weight_(std::move(orig_weight)),
+        orig_bias_(std::move(orig_bias)),
+        stride_(std::move(stride)),
+        padding_(std::move(padding)),
+        output_padding_(std::move(output_padding)),
+        dilation_(std::move(dilation)),
+        groups_(groups),
+        transpose_(transpose) {
+    cache_initialized_flag = std::make_unique<c10::once_flag>();
+  }
 
   std::unique_ptr<ideep::tensor> weight_;
   c10::optional<ideep::tensor> bias_;
@@ -141,11 +312,90 @@ struct PackedConvWeightsOnednn : public ConvPackedParamsBase<kSpatialDim> {
   }
 
  private:
+  ConvPrimitiveCache conv_prim_cache;
+  DeconvPrimitiveCache deconv_prim_cache;
+  std::unique_ptr<c10::once_flag> cache_initialized_flag;
+
   template <bool ReluFused>
   at::Tensor apply_impl(
       const at::Tensor& input,
       double output_scale,
       int64_t output_zero_point);
+
+  ConvPrimitiveCache& get_conv_cache() {
+    assert(!transpose());
+    return conv_prim_cache;
+  }
+
+  DeconvPrimitiveCache& get_deconv_cache() {
+    assert(transpose());
+    return deconv_prim_cache;
+  }
 };
 
+namespace onednn_utils {
+
+// Try to reorder tensor to expected desc at runtime
+// Do it in a `try...catch...` manner to avoid oneDNN's errors
+// TODO: Move it to third_party/ideep
+static void try_reorder(
+    ideep::tensor& t,
+    const ideep::tensor::desc&& desc,
+    ideep::scale_t scales) {
+  if (t.get_desc() != desc) {
+    try {
+      t = t.reorder_if_differ_in(desc);
+    } catch (...) {
+      ideep::tensor&& plain = t.to_public(nullptr, t.get_data_type());
+      t = plain.reorder_if_differ_in(desc);
+    }
+    t.set_scale(scales);
+  }
+}
+
+// ONEDNN requires symmetric quantization of weight
+// Use this util function to check.
+static bool is_weight_symmetric_quant(
+      const at::Tensor& weight,
+      bool is_transposed_conv) {
+  bool is_symmetric = true;
+  const auto qtype = weight.qscheme();
+  if (qtype == c10::kPerTensorAffine) {
+    is_symmetric &= (weight.q_zero_point() == 0);
+  } else if (qtype == c10::kPerChannelAffine) {
+    if (is_transposed_conv) {
+      // This case is currently not supported in PyTorch
+      // but we do not want to raise an error in this util function.
+      is_symmetric = false;
+    } else {
+      auto output_channels = weight.size(0);
+      for (int i = 0; i < output_channels; ++i) {
+        auto zp = weight.q_per_channel_zero_points()[i].item<int32_t>();
+        is_symmetric &= (zp == 0);
+      }
+    }
+  } else {
+    // This case is currently not supported in PyTorch
+      // but we do not want to raise an error in this util function.
+    is_symmetric = false;
+  }
+  return is_symmetric;
+}
+
+// Check if onednn should be used w.r.t fbgemm
+static bool should_use_onednn_quant(
+    const at::Tensor& weight,
+    bool is_transposed_conv,
+    int groups,
+    torch::List<int64_t> output_padding) {
+  bool vnni_available = cpuinfo_has_x86_avx512vnni();
+  bool w_sym_quant =
+      is_weight_symmetric_quant(weight, is_transposed_conv);
+  bool opad_all_zero =
+      std::all_of(output_padding.begin(), output_padding.end(), [](int i) { return i==0; });
+  return vnni_available && (groups <= 100) && w_sym_quant && opad_all_zero;
+}
+
+} // onednn_utils
+
 #endif // #if AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/quantized/cpu/Pooling.cpp b/aten/src/ATen/native/quantized/cpu/Pooling.cpp
index 16ee1c566e3b..0153dd68d735 100644
--- a/aten/src/ATen/native/quantized/cpu/Pooling.cpp
+++ b/aten/src/ATen/native/quantized/cpu/Pooling.cpp
@@ -1,10 +1,10 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <torch/library.h>
 #include <ATen/native/Pool.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/cpu/Loops.h>
 #include <ATen/quantized/Quantizer.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
@@ -12,6 +12,17 @@
 #include <c10/util/irange.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/quantized_max_pool1d.h>
+#include <ATen/ops/quantized_max_pool1d_native.h>
+#include <ATen/ops/quantized_max_pool2d.h>
+#include <ATen/ops/quantized_max_pool2d_native.h>
+#endif
+
 #include <algorithm>
 #include <vector>
 
diff --git a/aten/src/ATen/native/quantized/cpu/QnnpackUtils.h b/aten/src/ATen/native/quantized/cpu/QnnpackUtils.h
index 799d159114c7..9c6c721657cb 100644
--- a/aten/src/ATen/native/quantized/cpu/QnnpackUtils.h
+++ b/aten/src/ATen/native/quantized/cpu/QnnpackUtils.h
@@ -1,7 +1,7 @@
 #pragma once
 
 #ifdef USE_PYTORCH_QNNPACK
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
 #include <pytorch_qnnpack.h>
 #include <qnnpack_func.h>
@@ -9,6 +9,12 @@
 #include <ATen/native/quantized/PackedParams.h>
 #include <ATen/native/utils/Factory.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#endif
+
 #include <utility>
 
 struct QnnpackOperatorDeleter {
@@ -266,8 +272,9 @@ struct PackedConvWeightsQnnp : public ConvPackedParamsBase<kSpatialDim> {
     void* zero_buffer = malloc(zero_size);
     if (zero_buffer == nullptr) {
       pytorch_qnnp_delete_operator(convolution);
-      pytorch_qnnp_log_error(
-          "failed to allocate %zu bytes for zero padding", zero_size);
+      TORCH_INTERNAL_ASSERT(
+          false, "failed to allocate %zu bytes for zero padding",
+          zero_size);
     }
     // Need to set to input zero point
     // memset(zero_buffer, input_zero_point, zero_size);
diff --git a/aten/src/ATen/native/quantized/cpu/QuantUtils.h b/aten/src/ATen/native/quantized/cpu/QuantUtils.h
index f53efab900be..85bcaa1a69fd 100644
--- a/aten/src/ATen/native/quantized/cpu/QuantUtils.h
+++ b/aten/src/ATen/native/quantized/cpu/QuantUtils.h
@@ -1,10 +1,21 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
+#include <ATen/TensorOperators.h>
 #include <c10/util/irange.h>
 #include <algorithm>
 #include <cmath>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/quantize_per_tensor_native.h>
+#include <ATen/ops/quantize_per_channel_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace quant_utils {
 namespace {
   float RawUint16ToFp16(unsigned short value) {
diff --git a/aten/src/ATen/native/quantized/cpu/QuantizedOps.h b/aten/src/ATen/native/quantized/cpu/QuantizedOps.h
index 506f0e46e573..8cba2f8cdd94 100644
--- a/aten/src/ATen/native/quantized/cpu/QuantizedOps.h
+++ b/aten/src/ATen/native/quantized/cpu/QuantizedOps.h
@@ -1,7 +1,10 @@
-#include <ATen/ATen.h>
+#pragma once
+#include <ATen/core/Tensor.h>
+#include <ATen/core/IListRef.h>
+#include <ATen/Dispatch.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/Activation.h>
 #include <ATen/native/DispatchStub.h>
-#include <ATen/native/TensorIterator.h>
 
 namespace at {
 namespace native {
@@ -143,7 +146,7 @@ using qupsample_bilinear2d_fn = void (*)(
     c10::optional<double> scales_w);
 
 using qcat_nhwc_fn = Tensor (*)(
-    const c10::List<Tensor>& qxs,
+    const MaterializedITensorListRef& qxs,
     int64_t dim,
     double scale,
     int64_t zero_point);
diff --git a/aten/src/ATen/native/quantized/cpu/ReduceOps.cpp b/aten/src/ATen/native/quantized/cpu/ReduceOps.cpp
index e7f78b29bbf0..c2d18693b9ea 100644
--- a/aten/src/ATen/native/quantized/cpu/ReduceOps.cpp
+++ b/aten/src/ATen/native/quantized/cpu/ReduceOps.cpp
@@ -1,11 +1,24 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/NamedTensorUtils.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
+#include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>         // for _empty_affine_q...
+#include <ATen/ops/mean.h>                            // for mean
+#include <ATen/ops/mean_native.h>                     // for mean_out_quanti...
+#include <ATen/ops/quantize_per_tensor.h>             // for quantize_per_te...
+#include <ATen/ops/std.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/quantized/cpu/Sorting.cpp b/aten/src/ATen/native/quantized/cpu/Sorting.cpp
index 7419f6f7e617..9389261ac1e8 100644
--- a/aten/src/ATen/native/quantized/cpu/Sorting.cpp
+++ b/aten/src/ATen/native/quantized/cpu/Sorting.cpp
@@ -1,13 +1,17 @@
-#include <ATen/ATen.h>
-#include <torch/library.h>
-#include <ATen/cpu/vec/vec.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/WrapDimUtils.h>
 #include <ATen/native/SortingUtils.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/cpu/Loops.h>
-#include <ATen/quantized/Quantizer.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 
-#include <algorithm>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/topk_native.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp b/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp
index 05a5a4521938..97799b3b8d42 100644
--- a/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp
+++ b/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp
@@ -1,11 +1,29 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/ExpandUtils.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/native/Resize.h>
 #include <ATen/quantized/Quantizer.h>
-#include <torch/library.h>
 #include <c10/core/QScheme.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/eq.h>
+#include <ATen/ops/eq_native.h>
+#include <ATen/ops/ge.h>
+#include <ATen/ops/ge_native.h>
+#include <ATen/ops/gt.h>
+#include <ATen/ops/gt_native.h>
+#include <ATen/ops/le.h>
+#include <ATen/ops/le_native.h>
+#include <ATen/ops/lt.h>
+#include <ATen/ops/lt_native.h>
+#include <ATen/ops/ne.h>
+#include <ATen/ops/ne_native.h>
+#include <ATen/ops/resize_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/quantized/cpu/TensorShape.cpp b/aten/src/ATen/native/quantized/cpu/TensorShape.cpp
index 172ad041a610..b4b519020246 100644
--- a/aten/src/ATen/native/quantized/cpu/TensorShape.cpp
+++ b/aten/src/ATen/native/quantized/cpu/TensorShape.cpp
@@ -1,13 +1,27 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
+#include <ATen/Dispatch.h>
 #include <ATen/WrapDimUtils.h>
+#include <ATen/core/IListRef.h>
 #include <ATen/native/cpu/Loops.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/TensorShape.h>
-#include <ATen/NativeFunctions.h>
 #include <c10/util/irange.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/cat.h>
+#include <ATen/ops/cat_native.h>
+#include <ATen/ops/copy_native.h>
+#include <ATen/ops/quantize_per_tensor.h>
+#include <ATen/ops/zeros_like_ops.h>
+#endif
+
 #include <algorithm>
 #include <vector>
 
@@ -19,7 +33,7 @@ DEFINE_DISPATCH(qcat_relu_nhwc_stub);
 
 namespace {
 
-bool is_cat_nhwc_fast_path(const c10::List<Tensor>& qxs, int dim) {
+bool is_cat_nhwc_fast_path(const MaterializedITensorListRef& qxs, int64_t dim) {
   TORCH_CHECK(qxs.size() > 0);
   bool is_fast_path = dim == 1;
   // NOLINTNEXTLINE(performance-implicit-conversion-in-loop)
@@ -35,21 +49,21 @@ bool is_valid_quantization_scheme(const Tensor& t) {
   return (qtype == kPerTensorAffine) || (qtype == kPerTensorSymmetric);
 }
 
-bool all_inputs_sharing_qparams(TensorList qxs) {
+bool all_inputs_sharing_qparams(const MaterializedITensorListRef& qxs) {
   bool is_valid = true;
   for (const auto i : c10::irange(1, qxs.size())) {
-    is_valid |= qxs[0].is_quantized();
-    is_valid |= qxs[i].is_quantized() == qxs[0].is_quantized();
-    is_valid |= qxs[i].qscheme() == qxs[0].qscheme();
-    is_valid |= qxs[i].dtype() == qxs[0].dtype();
-    if (qxs[0].qscheme() == kPerTensorAffine) {
-      is_valid |= qxs[i].q_scale() == qxs[0].q_scale();
-      is_valid |= qxs[i].q_zero_point() == qxs[0].q_zero_point();
-    } else if (qxs[0].qscheme() == kPerChannelAffine) {
-      is_valid |= qxs[i].q_per_channel_scales().equal(qxs[0].q_per_channel_scales());
-      is_valid |= qxs[i].q_per_channel_zero_points().equal(qxs[0].q_per_channel_zero_points());
+    is_valid |= qxs[0].get().is_quantized();
+    is_valid |= qxs[i].get().is_quantized() == qxs[0].get().is_quantized();
+    is_valid |= qxs[i].get().qscheme() == qxs[0].get().qscheme();
+    is_valid |= qxs[i].get().dtype() == qxs[0].get().dtype();
+    if (qxs[0].get().qscheme() == kPerTensorAffine) {
+        is_valid |= qxs[i].get().q_scale() == qxs[0].get().q_scale();
+      is_valid |= qxs[i].get().q_zero_point() == qxs[0].get().q_zero_point();
+    } else if (qxs[0].get().qscheme() == kPerChannelAffine) {
+        is_valid |= qxs[i].get().q_per_channel_scales().equal(qxs[0].get().q_per_channel_scales());
+      is_valid |= qxs[i].get().q_per_channel_zero_points().equal(qxs[0].get().q_per_channel_zero_points());
     } else {
-      TORCH_CHECK(false, "Unrecognized qscheme:", toString(qxs[0].qscheme()));
+        TORCH_CHECK(false, "Unrecognized qscheme:", toString(qxs[0].get().qscheme()));
     }
   }
   return is_valid;
@@ -61,7 +75,7 @@ bool all_inputs_sharing_qparams(TensorList qxs) {
  */
 template <bool ReLUFused>
 Tensor quantized_cat_impl(
-    const c10::List<Tensor>& qxs,
+    const MaterializedITensorListRef& qxs,
     int64_t dim,
     double scale,
     int64_t zero_point) {
@@ -73,8 +87,8 @@ Tensor quantized_cat_impl(
     }
   }
 
-  const auto x_dtype = qxs.get(0).scalar_type();
-  const auto x_qscheme = qxs.get(0).qscheme();
+  const auto x_dtype = qxs[0].get().scalar_type();
+  const auto x_qscheme = qxs[0].get().qscheme();
   std::vector<Tensor> xs;
   xs.reserve(qxs.size());
   // NOLINTNEXTLINE(performance-implicit-conversion-in-loop)
@@ -99,6 +113,15 @@ Tensor quantized_cat_impl(
   return qy;
 }
 
+template <bool ReLUFused>
+Tensor quantized_cat_impl(
+    ITensorListRef qxs,
+    int64_t dim,
+    double scale,
+    int64_t zero_point) {
+  return quantized_cat_impl<ReLUFused>(qxs.materialize(), dim, scale, zero_point);
+}
+
 template <bool ReLUFused = false>
 Tensor qcat(
     const c10::List<Tensor>& qxs,
@@ -134,28 +157,29 @@ TORCH_LIBRARY_IMPL(quantized, QuantizedCPU, m) {
   m.impl(TORCH_SELECTIVE_NAME("quantized::cat_relu_out"), TORCH_FN(qcat_out<true>));
 }
 
-Tensor cat_quantized_cpu(TensorList qxs, int64_t dim) {
-  TORCH_CHECK(is_valid_quantization_scheme(qxs[0]),
+Tensor cat_quantized_cpu(const ITensorListRef& qxs, int64_t dim) {
+  auto materialized = qxs.materialize();
+  TORCH_CHECK(is_valid_quantization_scheme(materialized[0]),
               "Only per-tensor quantization is supported in 'cat'!");
   TORCH_CHECK(
-      all_inputs_sharing_qparams(qxs),
+      all_inputs_sharing_qparams(materialized),
       "All inputs should share the same quantization parameters.");
-  check_cat_no_zero_dim(qxs);
-  dim = legacy_cat_wrap_dim(dim, qxs);
-  double _scale = qxs[0].q_scale();
-  int64_t _zero_point = qxs[0].q_zero_point();
-  return quantized_cat_impl<false>(c10::List<Tensor>(qxs), dim, _scale, _zero_point);
+  check_cat_no_zero_dim(materialized);
+  dim = legacy_cat_wrap_dim(dim, materialized);
+  double _scale = materialized[0].get().q_scale();
+  int64_t _zero_point = materialized[0].get().q_zero_point();
+  return quantized_cat_impl<false>(materialized, dim, _scale, _zero_point);
 }
 
-Tensor& cat_out_quantized_cpu(TensorList qxs, int64_t dim, Tensor& out) {
-  TORCH_CHECK(is_valid_quantization_scheme(qxs[0]),
+Tensor& cat_out_quantized_cpu(const ITensorListRef& qxs, int64_t dim, Tensor& out) {
+  auto materialized = qxs.materialize();
+  TORCH_CHECK(is_valid_quantization_scheme(materialized[0]),
               "Only per-tensor quantization is supported in 'cat'!")
   TORCH_CHECK(is_valid_quantization_scheme(out),
               "Only per-tensor quantization is supported in 'cat'!")
-  check_cat_no_zero_dim(qxs);
-  dim = legacy_cat_wrap_dim(dim, qxs);
-  auto out_ = quantized_cat_impl<false>(c10::List<Tensor>(qxs), dim, out.q_scale(),
-                                        out.q_zero_point());
+  check_cat_no_zero_dim(materialized);
+  dim = legacy_cat_wrap_dim(dim, materialized);
+  auto out_ = quantized_cat_impl<false>(qxs, dim, out.q_scale(), out.q_zero_point());
   at::native::copy_(out, out_, /*non_blocking=*/false);
   return out;
 }
diff --git a/aten/src/ATen/native/quantized/cpu/UpSampleBilinear2d.cpp b/aten/src/ATen/native/quantized/cpu/UpSampleBilinear2d.cpp
index ff8800228435..ac0fb23eb4c3 100644
--- a/aten/src/ATen/native/quantized/cpu/UpSampleBilinear2d.cpp
+++ b/aten/src/ATen/native/quantized/cpu/UpSampleBilinear2d.cpp
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/UpSample.h>
 #include <ATen/native/quantized/AffineQuantizer.h>
@@ -6,10 +8,15 @@
 #include <ATen/native/cpu/utils.h>
 #include <c10/util/irange.h>
 
-#include <algorithm>
-#include <cmath>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/upsample_bilinear2d_native.h>
+#endif
+
 #include <cstring>
-#include <limits>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/UpSampleNearest2d.cpp b/aten/src/ATen/native/quantized/cpu/UpSampleNearest2d.cpp
index 9f8b065576df..abe6dfd22586 100644
--- a/aten/src/ATen/native/quantized/cpu/UpSampleNearest2d.cpp
+++ b/aten/src/ATen/native/quantized/cpu/UpSampleNearest2d.cpp
@@ -1,11 +1,19 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/UpSample.h>
-#include <ATen/native/cpu/Loops.h>
-#include <ATen/native/quantized/cpu/QuantizedOps.h>
-#include <ATen/quantized/Quantizer.h>
 #include <ATen/native/cpu/utils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_upsample_nearest_exact2d_native.h>
+#include <ATen/ops/upsample_nearest2d_native.h>
+#endif
+
 #include <c10/util/irange.h>
 
 #include <cstring>
diff --git a/aten/src/ATen/native/quantized/cpu/UpSampleNearest3d.cpp b/aten/src/ATen/native/quantized/cpu/UpSampleNearest3d.cpp
index ba723d707ee9..4b4c63eb7c3d 100644
--- a/aten/src/ATen/native/quantized/cpu/UpSampleNearest3d.cpp
+++ b/aten/src/ATen/native/quantized/cpu/UpSampleNearest3d.cpp
@@ -1,8 +1,16 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/native/UpSample.h>
-#include <ATen/native/cpu/Loops.h>
-#include <ATen/native/quantized/cpu/QuantizedOps.h>
-#include <ATen/quantized/Quantizer.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_upsample_nearest_exact3d_native.h>
+#include <ATen/ops/upsample_nearest3d_native.h>
+#endif
 
 #include <c10/util/irange.h>
 
@@ -230,27 +238,5 @@ Tensor _upsample_nearest_exact3d_quantized_cpu(
       input, osize, scale_d, scale_h, scale_w);
 }
 
-Tensor upsample_nearest3d_quantized_cpu(
-    const Tensor& input,
-    at::OptionalIntArrayRef output_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return upsample_nearest3d_quantized_cpu(input, osize, scale_d, scale_h, scale_w);
-}
-
-Tensor _upsample_nearest_exact3d_quantized_cpu(
-    const Tensor& input,
-    at::OptionalIntArrayRef output_size,
-    c10::optional<ArrayRef<double>> scale_factors) {
-  auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
-  auto scale_d = get_scale_value(scale_factors, 0);
-  auto scale_h = get_scale_value(scale_factors, 1);
-  auto scale_w = get_scale_value(scale_factors, 2);
-  return _upsample_nearest_exact3d_quantized_cpu(input, osize, scale_d, scale_h, scale_w);
-}
-
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/quantized/cpu/XnnpackUtils.h b/aten/src/ATen/native/quantized/cpu/XnnpackUtils.h
index 78f325263f4f..12e4fbbf1e76 100644
--- a/aten/src/ATen/native/quantized/cpu/XnnpackUtils.h
+++ b/aten/src/ATen/native/quantized/cpu/XnnpackUtils.h
@@ -3,7 +3,7 @@
 #ifdef USE_XNNPACK
 #include <cstdint>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/xnnpack/Common.h>
 
 using xnnpack_operator = at::native::xnnpack::Operator;
diff --git a/aten/src/ATen/native/quantized/cpu/conv_serialization.h b/aten/src/ATen/native/quantized/cpu/conv_serialization.h
index 9e4edb8f9a88..e9d833c9fc22 100644
--- a/aten/src/ATen/native/quantized/cpu/conv_serialization.h
+++ b/aten/src/ATen/native/quantized/cpu/conv_serialization.h
@@ -1,11 +1,19 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/core/List.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <ATen/native/quantized/cpu/OnednnUtils.h>
 #include <c10/util/irange.h>
+#include <cpuinfo.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/from_blob.h>
+#endif
+
 
 #include <tuple>
 
@@ -330,6 +338,37 @@ c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> deserialize_conv(
 
   auto& ctx = at::globalContext();
 
+#ifdef USE_FBGEMM
+  if (ctx.qEngine() == at::QEngine::X86) {
+#if AT_MKLDNN_ENABLED()
+    bool use_onednn = onednn_utils::should_use_onednn_quant(
+        weight.value(), transpose, groups, output_padding);
+    if (use_onednn) {
+      return PackedConvWeightsOnednn<kSpatialDim>::prepack(
+        weight.value(),
+        bias,
+        stride,
+        padding,
+        output_padding,
+        dilation,
+        groups,
+        transpose
+      );
+    }
+#endif
+    return PackedConvWeight<kSpatialDim>::prepack(
+      weight.value(),
+      bias,
+      stride,
+      padding,
+      output_padding,
+      dilation,
+      groups,
+      transpose
+    );
+  } // x86
+#endif
+
 #ifdef USE_FBGEMM
   if (ctx.qEngine() == at::QEngine::FBGEMM) {
     return PackedConvWeight<kSpatialDim>::prepack(
diff --git a/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp b/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
index 33d8bd88b858..8af21bbc7df8 100644
--- a/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
+++ b/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
@@ -1,4 +1,10 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
+#include <ATen/Utils.h>
+#include <ATen/core/TensorBody.h>
+#include <ATen/core/ivalue.h>
+#include <ATen/core/jit_type_base.h>
 #include <ATen/native/quantized/PackedParams.h>
 #include <ATen/native/quantized/cpu/conv_serialization.h>
 #include <ATen/native/quantized/cpu/EmbeddingPackedParams.h>
@@ -14,6 +20,12 @@
 #include <c10/util/irange.h>
 #include <torch/custom_class.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/cat.h>
+#endif
+
 int register_linear_params();
 int register_embedding_params();
 
@@ -445,7 +457,8 @@ int register_linear_params() {
                 bias = std::move(std::get<1>(state));
 
 #ifdef USE_FBGEMM
-                if (at::globalContext().qEngine() == at::QEngine::FBGEMM) {
+                if (at::globalContext().qEngine() == at::QEngine::FBGEMM ||
+                    at::globalContext().qEngine() == at::QEngine::X86) {
                   if (weight.scalar_type() == at::kQInt8) {
                     return PackedLinearWeight::prepack(
                         std::move(weight), std::move(bias));
@@ -547,6 +560,7 @@ int register_embedding_params() {
             return PackedEmbeddingBagWeight::prepack(weight);
           })
       .def("bit_rate", &EmbeddingPackedParamsBase::bit_rate)
+      .def("unpack", &EmbeddingPackedParamsBase::unpack)
       .def("version", &EmbeddingPackedParamsBase::version);
 
   return 0;
diff --git a/aten/src/ATen/native/quantized/cpu/fused_obs_fake_quant.cpp b/aten/src/ATen/native/quantized/cpu/fused_obs_fake_quant.cpp
index 5fd73c58ed33..77c60141b065 100644
--- a/aten/src/ATen/native/quantized/cpu/fused_obs_fake_quant.cpp
+++ b/aten/src/ATen/native/quantized/cpu/fused_obs_fake_quant.cpp
@@ -1,9 +1,24 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <c10/util/irange.h>
 #include <cmath>
 #include <tuple>
 #include <vector>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_fake_quantize_per_tensor_affine_cachemask_tensor_qparams.h>
+#include <ATen/ops/_fused_moving_avg_obs_fq_helper.h>
+#include <ATen/ops/_fused_moving_avg_obs_fq_helper_native.h>
+#include <ATen/ops/aminmax.h>
+#include <ATen/ops/fake_quantize_per_channel_affine_cachemask.h>
+#include <ATen/ops/fused_moving_avg_obs_fake_quant_native.h>
+#include <ATen/ops/ones.h>
+#include <ATen/ops/ones_like.h>
+#endif
+
 #ifdef USE_FBGEMM
 #include <fbgemm/QuantUtils.h>
 #endif
@@ -221,7 +236,7 @@ at::Tensor fused_moving_avg_obs_fake_quant(
     const int64_t ch_axis,
     bool per_row_fake_quant,
     bool symmetric_quant) {
-  if (self.numel() == 0) {
+  if (self.sym_numel() == 0) {
     return self.clone();
   }
   const auto res = at::_fused_moving_avg_obs_fq_helper(
diff --git a/aten/src/ATen/native/quantized/cpu/init_qnnpack.cpp b/aten/src/ATen/native/quantized/cpu/init_qnnpack.cpp
index b4a524566605..82fb217e46fa 100644
--- a/aten/src/ATen/native/quantized/cpu/init_qnnpack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/init_qnnpack.cpp
@@ -1,8 +1,7 @@
 #ifdef USE_PYTORCH_QNNPACK
 
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
-#include <ATen/ATen.h>
-#include <ATen/Config.h>
+#include <c10/util/Exception.h>
 #include <pytorch_qnnpack.h>
 
 #include <c10/util/CallOnce.h>
diff --git a/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp b/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
index d1293cc29f27..a1f8f0d7c245 100644
--- a/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
+++ b/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
 #include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/Activation.h>
@@ -15,6 +17,13 @@
 #include <c10/util/irange.h>
 #include <ATen/native/cpu/utils.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/empty.h>
+#endif
+
 #include <cmath>
 #ifdef USE_FBGEMM
 #include <fbgemm/QuantUtils.h>
@@ -47,7 +56,7 @@ void check_tensor_memory_format(const Tensor& ref, const Tensor& other) {
 
 template <bool ReLUFused = false>
 Tensor qcat_nhwc_kernel(
-    const c10::List<Tensor>& qxs,
+    const MaterializedITensorListRef& qxs,
     int64_t dim,
     double scale,
     int64_t zero_point) {
@@ -110,7 +119,7 @@ Tensor qcat_nhwc_kernel(
       c10::nullopt);
 
   // N, H, and W are explicitly captured here because there's a bug in GCC5
-  // which causes an internal compiler error if they're not
+  // and clang5 which causes an internal compiler error if they're not
   AT_DISPATCH_QINT_TYPES(output.scalar_type(), "qcat_nhwc", [&, N, H, W]() {
     using Vec = Vectorized<scalar_t>;
     at::parallel_for(0, N * H * W, 0, [&](int64_t begin, int64_t end) {
@@ -2747,18 +2756,26 @@ void quantized_normalize_kernel(
                 dq =
                   (dq - layer_mean_div_scale_xVec) *
                     gamma_p_vec + beta_vec;
-                qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale)
-                  .store(Y_ptr + vecStartIdx);
               }
+              qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale)
+                .store(Y_ptr + vecStartIdx);
             }
-            for (int64_t remIdx = chEndIdx - kNonVecRemInChannel;
-                 remIdx < chEndIdx;
-                 remIdx++) {
-              auto qXVal = X_ptr[remIdx];
-              float dqXVal = at::native::dequantize_val(x_fake_scale, x_zp, qXVal);
-              float dqY =
-                (dqXVal - layer_mean_div_scale_x) * gamma_p + beta;
-              Y_ptr[remIdx] = at::native::quantize_val<scalar_t>(y_scale, y_zp, dqY);
+
+            // Remainder
+            if (kNonVecRemInChannel > 0) {
+              int64_t remIdx = chEndIdx - kNonVecRemInChannel;
+              auto qXVec = qVec::loadu(X_ptr + remIdx, kNonVecRemInChannel);
+              auto dqXVec = qXVec.dequantize(x_fake_scale_vec, x_zp_vec,
+                    x_fake_scale_zp_neg_premul_vec);
+              int validDqvecLen = (kNonVecRemInChannel - 1) / fVec::size() + 1;
+              for (int i = 0; i < validDqvecLen; ++i) {
+                auto &dq = dqXVec[i];
+                dq =
+                  (dq - layer_mean_div_scale_xVec) *
+                    gamma_p_vec + beta_vec;
+              }
+              qVec::quantize(dqXVec, y_scale, y_zp, y_inv_scale)
+                .store(Y_ptr + remIdx, kNonVecRemInChannel);
             }
           } // chIdx
 
@@ -3782,8 +3799,8 @@ void quantize_tensor_per_channel_impl<c10::quint8>(
     // channels_last contig.
     // If axis = 0 and channels_last contig, implementation for channels
     // first (NCHW) works.
-    for (const auto b : c10::irange(batches)) {
-      for (const auto e : c10::irange(elements_per_channel)) {
+    for (const auto b C10_UNUSED : c10::irange(batches)) {
+      for (const auto e C10_UNUSED : c10::irange(elements_per_channel)) {
         uint32_t c = 0;
         while (c + 8 < channels) {
           const int16x8_t vzero_point = vld1q_s16(&zero_points_int16t[c]);
@@ -3813,8 +3830,8 @@ void quantize_tensor_per_channel_impl<c10::quint8>(
       }
     }
   } else {
-    for (const auto b : c10::irange(batches)) {
-      for (const auto c : c10::irange(channels)) {
+    for (const auto b C10_UNUSED : c10::irange(batches)) {
+      for (const auto c C10_UNUSED : c10::irange(channels)) {
         uint32_t e = 0;
         const int16x8_t vzero_point = vdupq_n_s16(zero_points_int16t[c]);
         const float32x4_t vinv_scale = vdupq_n_f32(inv_scales[c]);
diff --git a/aten/src/ATen/native/quantized/cpu/qclamp.cpp b/aten/src/ATen/native/quantized/cpu/qclamp.cpp
index 21570fd436ea..10f8c4bd7d23 100644
--- a/aten/src/ATen/native/quantized/cpu/qclamp.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qclamp.cpp
@@ -1,15 +1,24 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
 #include <torch/library.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/cpu/Loops.h>
+#include <ATen/native/quantized/AffineQuantizerBase.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
-#include <ATen/quantized/Quantizer.h>
 #include <c10/util/irange.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/clamp_native.h>
+#include <ATen/ops/hardtanh_native.h>
+#endif
+
 #include <algorithm>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/cpu/qconv.cpp b/aten/src/ATen/native/quantized/cpu/qconv.cpp
index 873d983a4820..b6fa57b9e3ed 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qconv.cpp
@@ -1,9 +1,13 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <algorithm>
 #include <cmath>
 #include <vector>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
+#include <ATen/Context.h>
 #include <ATen/Parallel.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/SmallVector.h>
 #include <ATen/native/quantized/PackedParams.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
@@ -15,6 +19,19 @@
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_empty_affine_quantized_native.h>
+#include <ATen/ops/_empty_per_channel_affine_quantized_native.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/quantize_per_channel_native.h>
+#include <ATen/ops/quantize_per_tensor_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <c10/util/irange.h>
 
 namespace {
@@ -113,7 +130,7 @@ at::SmallVector<int64_t, kSpatialDim + 2> MakeDeConvOutputShape(
                 ", output padding: ", output_padding[idx],
                 ", dilation: ", dilation[idx])
     TORCH_CHECK(output_shape[idx + 2] < kReasonableMaxDim,
-                "Output dimension is beyound reasonable maximum for ", idx,
+                "Output dimension is beyond reasonable maximum for ", idx,
                 " axis;"
                 " kernel: ", kernel[idx],
                 ", stride: ", stride[idx],
@@ -1227,49 +1244,98 @@ at::Tensor PackedConvWeightsOnednn<kSpatialDim>::apply_impl(
   const ideep::dims& dilates = dilation().vec();
   const ideep::dims& padding_l = padding().vec();
   const ideep::dims& padding_r = padding().vec();
-  const ideep::scale_t& src_scales = ideep::scale_t(1, 1.0/act.q_scale()); // Scales of ONEDNN and PyTorch are reciprocal
+  double input_scale = act.q_scale();
+  int64_t input_zp = act.q_zero_point();
+  // Scales of ONEDNN and PyTorch are reciprocal
+  const ideep::scale_t& src_scales = ideep::scale_t(1, 1.0/input_scale);
   const ideep::scale_t& weights_scales = weights.get_scale();
-  const ideep::scale_t& dst_scales = ideep::scale_t(weights_scales.size(), 1.0/output_scale); // Scales of ONEDNN and PyTorch are reciprocal
-  const ideep::zero_point_t src_zero_points = ideep::zero_point_t(1, act.q_zero_point());
+  int64_t scale_size = weights_scales.size();
+  double inv_output_scale = 1.0/output_scale;
+  const ideep::zero_point_t src_zero_points = ideep::zero_point_t(1, input_zp);
   const ideep::zero_point_t dst_zero_points = ideep::zero_point_t(1, output_zero_point);
   ideep::attr_t op_attr = kReluFused ? ideep::attr_t::fuse_relu() : ideep::attr_t();
-  op_attr.set_zero_points(DNNL_ARG_SRC, ideep::utils::tensor_zp_mask(1), {DNNL_RUNTIME_S32_VAL}); // runtime src zero point
-  if (with_bias) {
-    // Bias might be modified outside (e.g. by quantization bias correction).
-    // If so, update the prepacked bias as well.
-    if (bias_.value().get_data_handle() != orig_bias_.value().data_ptr()) {
-      bias_.value().init(bias_.value().get_desc(), orig_bias_.value().data_ptr());
-    }
-    const auto& b = bias_.value();
-    if (transpose()) {
+  // Since src zero point is unknown, set runtime value here
+  op_attr.set_zero_points(DNNL_ARG_SRC, ideep::utils::tensor_zp_mask(1), {DNNL_RUNTIME_S32_VAL});
+
+  // Bias might be modified outside (e.g. by quantization bias correction).
+  // If so, update the prepacked bias as well.
+  if (with_bias && bias_.value().get_data_handle() != orig_bias_.value().data_ptr()) {
+    bias_.value().init(bias_.value().get_desc(), orig_bias_.value().data_ptr());
+  }
+  const auto& b = with_bias ? bias_.value() : ideep::tensor();
+  int num_threads = at::get_num_threads();
+  if (transpose()) {
+    // Primitive cache is initialized when called for the first time
+    // and won't be updated afterwards.
+    PrimitiveCacheKey cache_key = std::make_tuple(
+        input_scale, input_zp, src_dims, output_scale, output_zero_point, num_threads);
+    c10::call_once(*cache_initialized_flag, [&](){
+        DeconvParams params;
+        ideep::convolution_transpose_forward::prepare(
+            params, src, weights, b, dst_dims, dst,
+            strides, padding_l, padding_r, dilates, groups(),
+            src_scales, weights_scales, ideep::scale_t(scale_size, inv_output_scale),
+            src_zero_points, dst_zero_points, op_attr,
+            dnnl::algorithm::deconvolution_direct,
+            dnnl::prop_kind::forward_inference,
+            ideep::u8s8, ideep::engine::cpu_engine());
+        get_deconv_cache() = DeconvPrimitiveCache(
+            cache_key, params.pd, b, params.bias_attr, params.input_zero_point);
+        onednn_utils::try_reorder(
+            weights, (ideep::tensor::desc)params.pd.weights_desc(), weights_scales);
+    });
+    if (get_deconv_cache().hit(cache_key)) {
+      Deconv& primitive = get_deconv_cache().get_primitive();
+      DeconvDesc& pd = get_deconv_cache().get_primitive_desc();
+      auto& src_zp_tensor = get_deconv_cache().get_src_zp_tensor();
+      auto& expected_bias = get_deconv_cache().get_bias();
+      ideep::convolution_transpose_forward::compute(
+          pd, primitive, src, weights, expected_bias, dst, src_zp_tensor, groups());
+    } else {
       ideep::convolution_transpose_forward::compute_v2(
           src, weights, b, dst_dims, dst,
           strides, padding_l, padding_r, dilates,
-          groups(), src_scales, weights_scales, dst_scales, src_zero_points, dst_zero_points,
-          op_attr, dnnl::algorithm::deconvolution_direct, dnnl::prop_kind::forward_inference,
-          ideep::u8s8, ideep::engine::cpu_engine());
-    } else {
-      ideep::convolution_forward::compute_v2(
-          src, weights, b, dst_dims, dst,
-          strides, dilates, padding_l, padding_r, groups(),
-          src_scales, weights_scales, dst_scales, src_zero_points, dst_zero_points,
-          op_attr, dnnl::algorithm::convolution_direct, dnnl::prop_kind::forward_inference,
+          groups(), src_scales, weights_scales,
+          ideep::scale_t(scale_size, inv_output_scale),
+          src_zero_points, dst_zero_points, op_attr,
+          dnnl::algorithm::deconvolution_direct,
+          dnnl::prop_kind::forward_inference,
           ideep::u8s8, ideep::engine::cpu_engine());
     }
-  } else {
-    if (transpose()) {
-      ideep::convolution_transpose_forward::compute_v2(
-          src, weights, dst_dims, dst,
-          strides, padding_l, padding_r, dilates,
-          groups(), src_scales, weights_scales, dst_scales, src_zero_points, dst_zero_points,
-          op_attr, dnnl::algorithm::deconvolution_direct, dnnl::prop_kind::forward_inference,
-          ideep::u8s8, ideep::engine::cpu_engine());
+  } else {  // not transposed
+    PrimitiveCacheKey cache_key = std::make_tuple(
+        input_scale, input_zp, src_dims, output_scale, output_zero_point, num_threads);
+    c10::call_once(*cache_initialized_flag, [&](){
+        src.set_zero_point(src_zero_points);
+        dst.set_zero_point(dst_zero_points);
+        ConvParams params;
+        ideep::convolution_forward::prepare(
+            params, src, weights, b, dst_dims, dst,
+            strides, dilates, padding_l, padding_r, groups(),
+            src_scales, weights_scales, ideep::scale_t(scale_size, inv_output_scale),
+            op_attr, dnnl::algorithm::convolution_direct,
+            dnnl::prop_kind::forward_inference,
+            ideep::u8s8, ideep::engine::cpu_engine());
+        get_conv_cache() = ConvPrimitiveCache(cache_key, params.pd, b, params.bias_attr);
+        onednn_utils::try_reorder(
+            weights, (ideep::tensor::desc)params.pd.weights_desc(), weights_scales);
+    });
+    // If hit, use cached data. If miss, fall back to normal path.
+    if (get_conv_cache().hit(cache_key)) {
+      ConvDesc& pd = get_conv_cache().get_primitive_desc();
+      Conv& primitive = get_conv_cache().get_primitive();
+      auto& src_zp_tensor = get_conv_cache().get_src_zp_tensor();
+      auto& expected_bias = get_conv_cache().get_bias();
+      ideep::convolution_forward::compute(
+          pd, primitive, src, weights, expected_bias, dst, src_zp_tensor, groups());
     } else {
       ideep::convolution_forward::compute_v2(
-          src, weights, dst_dims, dst,
+          src, weights, b, dst_dims, dst,
           strides, dilates, padding_l, padding_r, groups(),
-          src_scales, weights_scales, dst_scales, src_zero_points, dst_zero_points,
-          op_attr, dnnl::algorithm::convolution_direct, dnnl::prop_kind::forward_inference,
+          src_scales, weights_scales, ideep::scale_t(scale_size, inv_output_scale),
+          src_zero_points, dst_zero_points, op_attr,
+          dnnl::algorithm::convolution_direct,
+          dnnl::prop_kind::forward_inference,
           ideep::u8s8, ideep::engine::cpu_engine());
     }
   }
diff --git a/aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp b/aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
index f4783484aaf8..26a2855a0fbb 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
@@ -1,8 +1,8 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <algorithm>
-#include <cmath>
-#include <vector>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/ivalue.h>
 #include <ATen/Parallel.h>
 #include <ATen/SmallVector.h>
 #include <ATen/native/quantized/PackedParams.h>
@@ -11,9 +11,15 @@
 #include <ATen/native/quantized/cpu/OnednnUtils.h>
 #include <ATen/native/quantized/cpu/QuantUtils.h>
 #include <c10/util/irange.h>
-#include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/dequantize.h>                           // for dequantize
+#include <ATen/ops/quantize_per_tensor.h>
+#endif
+
 #ifdef USE_FBGEMM
 
 template <int kSpatialDim>
diff --git a/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp b/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
index fd31c2e70883..9d2f1a96c31b 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
@@ -1,16 +1,24 @@
-#include <array>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <utility>
 #include <vector>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
+#include <ATen/Context.h>
 #include <ATen/native/quantized/PackedParams.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <ATen/native/quantized/cpu/OnednnUtils.h>
 #include <ATen/native/quantized/cpu/QuantUtils.h>
-#include <ATen/quantized/Quantizer.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <c10/util/irange.h>
 
 #ifdef USE_FBGEMM
@@ -437,7 +445,7 @@ c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> PackedConvWeightsOnednn<
   exp_wgt.init(w_desc);
   exp_wgt.set_scale(wgt_scales); // Also for feed_from()
   exp_wgt.feed_from(wgt, transpose); // expect wgt to be in [OC IC KH KW] format
-  ideep::tensor * packed_weight_p = new ideep::tensor(exp_wgt);
+  ideep::tensor * packed_weight_p = new ideep::tensor(std::move(exp_wgt));
   packed_weight_p->set_scale(wgt_scales);
   packed_weight_p->set_zero_point(wgt_zero_points);
   std::unique_ptr<ideep::tensor> weight_ptr(packed_weight_p);
@@ -521,6 +529,21 @@ class QConvPackWeightInt8 final {
       int64_t groups,
       bool transpose) {
     auto& ctx = at::globalContext();
+#ifdef USE_FBGEMM
+  if (ctx.qEngine() == at::QEngine::X86) {
+#if AT_MKLDNN_ENABLED()
+    bool use_onednn = onednn_utils::should_use_onednn_quant(
+          weight, transpose, groups, output_padding);
+    if (use_onednn) {
+      return PackedConvWeightsOnednn<kSpatialDim>::prepack(
+          weight, bias, stride, padding, output_padding, dilation, groups, transpose);
+    }
+#endif
+      return PackedConvWeight<kSpatialDim>::prepack(
+          weight, bias, stride, padding, output_padding, dilation, groups, transpose);
+  } // x86
+#endif // defined(USE_FBGEMM) || AT_MKLDNN_ENABLED()
+
 #ifdef USE_FBGEMM
     if (ctx.qEngine() == at::QEngine::FBGEMM) {
       return PackedConvWeight<kSpatialDim>::prepack(
@@ -598,6 +621,25 @@ class QConv1dPackWeightInt8 final {
     padding = quant_utils::MakeArgForConv1d(padding, 0);
     output_padding = quant_utils::MakeArgForConv1d(output_padding, 0);
     dilation = quant_utils::MakeArgForConv1d(dilation, 1);
+
+#ifdef USE_FBGEMM
+  if (ctx.qEngine() == at::QEngine::X86) {
+#if AT_MKLDNN_ENABLED()
+    bool use_onednn = onednn_utils::should_use_onednn_quant(
+        weight, transpose, groups, output_padding);
+    if (use_onednn) {
+      return PackedConvWeightsOnednn<2>::prepack(
+          weight, bias, stride, padding, output_padding, dilation, groups,
+          transpose);
+    }
+#endif
+    return PackedConvWeight<2>::prepack(
+        weight, bias, stride, padding, output_padding, dilation, groups,
+        transpose);
+
+  } // x86
+#endif
+
 #ifdef USE_FBGEMM
     if (ctx.qEngine() == at::QEngine::FBGEMM) {
       return PackedConvWeight<2>::prepack(
diff --git a/aten/src/ATen/native/quantized/cpu/qconv_unpack_impl.cpp b/aten/src/ATen/native/quantized/cpu/qconv_unpack_impl.cpp
index ad32d9b16a20..8af8d62f2f8a 100644
--- a/aten/src/ATen/native/quantized/cpu/qconv_unpack_impl.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qconv_unpack_impl.cpp
@@ -126,7 +126,7 @@ template <int kSpatialDim>
 std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsOnednn<
     kSpatialDim>::unpack() {
   return std::tuple<at::Tensor, c10::optional<at::Tensor>>(
-      orig_weight_, orig_bias_);
+      orig_weight_.clone(), orig_bias_);
 }
 
 template std::tuple<at::Tensor, c10::optional<at::Tensor>> PackedConvWeightsOnednn<
diff --git a/aten/src/ATen/native/quantized/cpu/qelu.cpp b/aten/src/ATen/native/quantized/cpu/qelu.cpp
index ba921efcc91e..f8b66781f2e9 100644
--- a/aten/src/ATen/native/quantized/cpu/qelu.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qelu.cpp
@@ -1,9 +1,15 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/core/ivalue.h>
 #include <torch/library.h>
-#include <ATen/quantized/Quantizer.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag.cpp b/aten/src/ATen/native/quantized/cpu/qembeddingbag.cpp
index ac6cce628064..e2703bb93fb4 100644
--- a/aten/src/ATen/native/quantized/cpu/qembeddingbag.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag.cpp
@@ -1,4 +1,5 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/quantized/cpu/EmbeddingPackedParams.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/qembeddingbag.h>
@@ -9,10 +10,20 @@
 #endif
 
 #include <ATen/Parallel.h>
+#include <ATen/Utils.h>
 #include <c10/util/irange.h>
 
 #include <array>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/resize_native.h>
+#endif
+
 int register_embedding_params();
 
 namespace {
diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag.h b/aten/src/ATen/native/quantized/cpu/qembeddingbag.h
index 301b025322a3..86ed0f530f9c 100644
--- a/aten/src/ATen/native/quantized/cpu/qembeddingbag.h
+++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag.h
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#pragma once
+#include <ATen/core/Tensor.h>
+#include <cstdint>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
index 748e89fc182d..dab19e0908e3 100644
--- a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.cpp
@@ -1,12 +1,24 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/native/quantized/cpu/qembeddingbag_prepack.h>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/custom_class.h>
 #include <ATen/Parallel.h>
+#include <ATen/Utils.h>
 #include <ATen/native/quantized/cpu/EmbeddingPackedParams.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <c10/core/ScalarType.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/choose_qparams_optimized.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/resize_native.h>
+#endif
+
 #include <c10/util/irange.h>
 
 int register_embedding_params();
@@ -254,9 +266,10 @@ Tensor& qembeddingbag_byte_prepack_out(Tensor& output, const Tensor& weight) {
   }
 
 #else
-  const auto weight_data = weight_contig->scalar_type() == at::ScalarType::Half
-      ? weight_contig->to(at::ScalarType::Float).data_ptr<float>()
-      : weight_contig->data_ptr<float>();
+  const Tensor& float_weight = weight_contig->scalar_type() == at::ScalarType::Half
+    ? weight_contig->to(at::ScalarType::Float)
+    : *weight_contig;
+  const auto weight_data = float_weight.data_ptr<float>();
   constexpr float kEpsilon = 1e-8f;
   for (auto row : c10::irange(embedding_rows)) {
     const float* input_row = weight_data + row * embedding_cols;
diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.h b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.h
index c52cbae4f2c8..a18ec214ebad 100644
--- a/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.h
+++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag_prepack.h
@@ -1,7 +1,7 @@
-#include <ATen/ATen.h>
+#pragma once
+#include <ATen/core/Tensor.h>
 
-namespace at {
-namespace native {
+namespace at { namespace native {
 
 Tensor& qembeddingbag_byte_prepack_out(Tensor& output, const Tensor& weight);
 
diff --git a/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp b/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp
index 68e7c4fdaca2..d0c62d686135 100644
--- a/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qembeddingbag_unpack.cpp
@@ -1,10 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Parallel.h>
 #include <ATen/native/quantized/cpu/EmbeddingPackedParams.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <c10/util/irange.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_per_channel_affine_quantized.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/from_blob.h>
+#include <ATen/ops/resize_native.h>
+#endif
+
 int register_embedding_params();
 
 at::Tensor PackedEmbeddingBagWeight::unpack() {
diff --git a/aten/src/ATen/native/quantized/cpu/qgelu.cpp b/aten/src/ATen/native/quantized/cpu/qgelu.cpp
index 05901b556e47..f9a3c32343df 100644
--- a/aten/src/ATen/native/quantized/cpu/qgelu.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qgelu.cpp
@@ -1,15 +1,12 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <torch/library.h>
-#include <ATen/native/Activation.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/cpu/Loops.h>
-#include <ATen/quantized/Quantizer.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
-#include <c10/util/irange.h>
-#include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
-#include <algorithm>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/gelu_native.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/qhardsigmoid.cpp b/aten/src/ATen/native/quantized/cpu/qhardsigmoid.cpp
index 6059671eb067..aa37e51e7ea1 100644
--- a/aten/src/ATen/native/quantized/cpu/qhardsigmoid.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qhardsigmoid.cpp
@@ -1,12 +1,19 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <torch/library.h>
-#include <ATen/quantized/Quantizer.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/hardsigmoid_native.h>
+#endif
+
 #include <algorithm>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/cpu/qhardswish.cpp b/aten/src/ATen/native/quantized/cpu/qhardswish.cpp
index 7f2431de86ec..bf4e0d988295 100644
--- a/aten/src/ATen/native/quantized/cpu/qhardswish.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qhardswish.cpp
@@ -1,12 +1,18 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <torch/library.h>
-#include <ATen/quantized/Quantizer.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#endif
+
 #include <algorithm>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear.cpp b/aten/src/ATen/native/quantized/cpu/qlinear.cpp
index 0e51b9867607..111b5eb5f139 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear.cpp
@@ -1,6 +1,8 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/Parallel.h>
-#include <ATen/core/op_registration/op_registration.h>
+#include <ATen/TensorOperators.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/PackedParams.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
@@ -8,9 +10,20 @@
 #include <ATen/native/quantized/cpu/OnednnUtils.h>
 #include <ATen/native/quantized/cpu/QuantUtils.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
-#include <torch/custom_class.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>         // for _empty_affine_q...
+#include <ATen/ops/_empty_affine_quantized_native.h>  // for empty_affine_qu...
+#include <ATen/ops/empty.h>                           // for empty
+#include <ATen/ops/quantize_per_channel_native.h>     // for quantize_per_ch...
+#include <ATen/ops/quantize_per_tensor_native.h>      // for quantize_per_te...
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <c10/util/irange.h>
 
 #include <algorithm>
@@ -629,10 +642,13 @@ at::Tensor PackedLinearWeightsOnednn::apply_impl(
   ideep::attr_t op_attr = ReluFused ? ideep::attr_t::fuse_relu() : ideep::attr_t();
   ideep::tensor x(input_desc, input_contig->data_ptr<c10::quint8>());
   auto dst_dims = {M, N};
-  const ideep::scale_t& src_scales = ideep::scale_t(1, 1.0/input.q_scale());
+  double input_scale = input.q_scale();
+  int64_t input_zero_point = input.q_zero_point();
+  const ideep::scale_t& src_scales = ideep::scale_t(1, 1.0/input_scale);
   const ideep::scale_t& weights_scales = w.get_scale();
-  const ideep::scale_t& dst_scales = ideep::scale_t(1, 1.0/output_scale); // Scales of ONEDNN and PyTorch are reciprocal
-  const ideep::zero_point_t& src_zero_point = ideep::zero_point_t(1, input.q_zero_point());
+  // Scales of ONEDNN and PyTorch are reciprocal
+  const ideep::scale_t& dst_scales = ideep::scale_t(1, 1.0/output_scale);
+  const ideep::zero_point_t& src_zero_point = ideep::zero_point_t(1, input_zero_point);
   const ideep::zero_point_t& dst_zero_point = ideep::zero_point_t(1, output_zero_point);
   // Compute: Use ideep::matmul_forward to support asymmetric quantization
   // Allocate output Tensor
@@ -644,20 +660,39 @@ at::Tensor PackedLinearWeightsOnednn::apply_impl(
   if (output.numel() == 0) {
     return output;
   }
-  ideep::tensor y({dst_dims, ideep::tensor::data_type::u8, {output.strides().cbegin(), output.strides().cend()}},
+  ideep::tensor y({dst_dims, ideep::tensor::data_type::u8,
+                   {output.strides().cbegin(), output.strides().cend()}},
                   output.data_ptr());
-  if (bias_.has_value()) {
+  bool with_bias = bias_.has_value();
+  if (with_bias) {
     // Bias might be modified outside (e.g. by quantization bias correction).
     // If so, update the prepacked bias as well.
     if (bias_.value().get_data_handle() != orig_bias_.value().data_ptr()) {
       bias_.value().init(bias_.value().get_desc(), orig_bias_.value().data_ptr());
     }
-    const auto& b = bias_.value();
-    ideep::matmul_forward::compute_v2(x, w, b, y, 1.0f, 1.0f, src_scales, weights_scales, dst_scales,
-                                      src_zero_point, dst_zero_point, op_attr);
+  }
+  const auto& b = with_bias ? bias_.value() : ideep::tensor();
+  // Primitive cache is initialized when called for the first time
+  // and won't be updated afterwards.
+  int num_threads = at::get_num_threads();
+  PrimitiveCacheKey cache_key = std::make_tuple(
+      input_scale, input_zero_point, input_dims, output_scale, output_zero_point, num_threads);
+  c10::call_once(*cache_initialized_flag, [&](){
+      LinearParams params;
+      ideep::matmul_forward::prepare</*is_dynamic=*/false>(
+          params, x, w, b, y, 1.0f, 1.0f,
+          src_scales, weights_scales, dst_scales,
+          src_zero_point, dst_zero_point, op_attr);
+      get_cache() = LinearPrimitiveCache(cache_key, params);
+      onednn_utils::try_reorder(
+          w, (ideep::tensor::desc)params.pd.weights_desc(), weights_scales);
+  });
+  if (get_cache().hit(cache_key)) {
+    LinearParams& params = get_cache().get_param();
+    ideep::matmul_forward::compute(params, x, w, b, y);
   } else {
-    ideep::matmul_forward::compute_v2(x, w, y, 1.0f, 1.0f, src_scales, weights_scales, dst_scales,
-                                      src_zero_point, dst_zero_point, op_attr);
+    ideep::matmul_forward::compute_v2(x, w, b, y, 1.0f, 1.0f, src_scales, weights_scales,
+                                      dst_scales, src_zero_point, dst_zero_point, op_attr);
   }
   auto out_sizes = input.sizes().vec();
   out_sizes.back() = N;
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
index df529a6612f9..537d0f492f8f 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
@@ -1,6 +1,7 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/Parallel.h>
-#include <ATen/core/op_registration/op_registration.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/PackedParams.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
@@ -9,7 +10,14 @@
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 #include <torch/library.h>
 
-#include <torch/custom_class.h>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/aminmax.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/quantize_per_tensor.h>
+#endif
 
 #include <c10/util/irange.h>
 
@@ -236,7 +244,7 @@ at::Tensor PackedLinearWeightsQnnp::apply_dynamic_impl(
     at::Tensor input,
     bool reduce_range) {
   if (reduce_range) {
-    TORCH_WARN("Currently, qnnpack incorrectly ignores reduce_range when it is set to true; this may change in a future release.");
+    TORCH_WARN_ONCE("Currently, qnnpack incorrectly ignores reduce_range when it is set to true; this may change in a future release.");
   }
 
   using at::Tensor;
@@ -415,14 +423,21 @@ at::Tensor& PackedLinearWeightFp16::apply_dynamic_impl(
   // Resize output Tensor
   output.resize_(output_sizes);
 
-  // Call the fp16 gemm interface
-  fbgemm::cblas_gemm_compute(
-      fbgemm::matrix_op_t::NoTranspose,
-      M,
-      input_ptr,
-      packed_weight_fp16,
-      0.0f,
-      output.data_ptr<float>());
+  int num_tasks = at::get_num_threads();
+  at::parallel_for(0, num_tasks, 1, [&](int64_t begin, int64_t end) {
+    for (const auto task_id : c10::irange(begin, end)) {
+      // Call the fp16 gemm interface
+      fbgemm::cblas_gemm_compute(
+          /*transa=*/fbgemm::matrix_op_t::NoTranspose,
+          /*m=*/static_cast<int>(M),
+          /*A=*/input_ptr,
+          /*Bp=*/packed_weight_fp16,
+          /*beta=*/0.0f,
+          /*C=*/output.data_ptr<float>(),
+          /*thread_id=*/static_cast<int>(task_id),
+          /*num_threads=*/num_tasks);
+    }
+  });
 
   // Add bias term
   if (bias_.has_value()) {
@@ -496,10 +511,21 @@ at::Tensor PackedLinearWeightsOnednn::apply_dynamic_impl(
   x.init(input_desc, input_contig.data_ptr());
   // Find quantization parameters
   float x_max = 0, x_min = 0;
-  if (input.numel() > 0) {
-    x_min = input_contig.min().item<float>();
-    x_max = input_contig.max().item<float>();
+#ifdef USE_FBGEMM
+  // Use FBGEMM's FindMinMax if available since it's faster
+  fbgemm::FindMinMax(
+      /*m=*/input_contig.data_ptr<float>(),
+      /*min=*/&x_min,
+      /*max=*/&x_max,
+      /*len=*/input.numel());
+#else
+  if (input_contig.numel() > 0) {
+    Tensor t_min, t_max;
+    std::tie(t_min, t_max) = at::aminmax(input_contig);
+    x_max = t_max.item<float>();
+    x_min = t_min.item<float>();
   }
+#endif
   const int precision = 8;
   auto q_params = quant_utils::ChooseQuantizationParams(
       /*min=*/x_min,
@@ -524,18 +550,37 @@ at::Tensor PackedLinearWeightsOnednn::apply_dynamic_impl(
   ideep::tensor y({dst_dims, ideep::tensor::data_type::f32,
                    {output.strides().cbegin(), output.strides().cend()}},
                   output.data_ptr());
-  if (bias_.has_value()) {
+  bool with_bias = bias_.has_value();
+  if (with_bias) {
     // Bias might be modified outside (e.g. by quantization bias correction).
     // If so, update the prepacked bias as well.
     if (bias_.value().get_data_handle() != orig_bias_.value().data_ptr()) {
       bias_.value().init(bias_.value().get_desc(), orig_bias_.value().data_ptr());
     }
-    const ideep::tensor b = bias_.value();
-    ideep::matmul_forward::compute_v2(x, w, b, y, 1.0f, 1.0f,
-                                      src_scales, weights_scales, ideep::scale_t(),
-                                      src_zero_point, ideep::zero_point_t(), op_attr);
+  }
+  const auto& b = with_bias ? bias_.value() : ideep::tensor();
+  // Primitive cache is initialized when called for the first time
+  // and won't be updated afterwards.
+  int num_threads = at::get_num_threads();
+  PrimitiveCacheKey cache_key = std::make_tuple(
+      q_params.scale, q_params.zero_point, input_dims, 1.0, 0, num_threads);
+  c10::call_once(*cache_initialized_flag, [&](){
+      LinearParams params;
+      ideep::matmul_forward::prepare</*is_dynamic=*/true>(
+          params, x, w, b, y, 1.0f, 1.0f,
+          src_scales, weights_scales, ideep::scale_t(),
+          src_zero_point, ideep::zero_point_t(), op_attr);
+      get_cache() = LinearPrimitiveCache(cache_key, params);
+      onednn_utils::try_reorder(
+          w, (ideep::tensor::desc)params.pd.weights_desc(), weights_scales);
+  });
+  if (get_cache().hit_dynamic(cache_key)) {
+    LinearParams& params = get_cache().get_param();
+    ideep::matmul_forward::compute_dynamic(
+        params, x, w, b, y, 1.0f, 1.0f, src_scales, weights_scales,
+        ideep::scale_t(), src_zero_point, ideep::zero_point_t());
   } else {
-    ideep::matmul_forward::compute_v2(x, w, y, 1.0f, 1.0f,
+    ideep::matmul_forward::compute_v2(x, w, b, y, 1.0f, 1.0f,
                                       src_scales, weights_scales, ideep::scale_t(),
                                       src_zero_point, ideep::zero_point_t(), op_attr);
   }
diff --git a/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp b/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
index b4f0f4c41f41..36523bbd1b9b 100644
--- a/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/PackedParams.h>
@@ -9,9 +11,19 @@
 #include <torch/custom_class.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_saturate_weight_to_fp16.h>
+#include <ATen/ops/_saturate_weight_to_fp16_native.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 #include <c10/util/irange.h>
 
 #include <algorithm>
+#include <utility>
 #include <vector>
 
 int register_linear_params();
@@ -238,7 +250,7 @@ c10::intrusive_ptr<LinearPackedParamsBase> PackedLinearWeightsOnednn::prepack(
                                                              dnnl::memory::data_type::u8);
   ideep::tensor exp_wgt(w_desc);
   exp_wgt.feed_from(wgt);
-  ideep::tensor * packed_weight_p = new ideep::tensor(exp_wgt);
+  ideep::tensor * packed_weight_p = new ideep::tensor(std::move(exp_wgt));
   packed_weight_p->set_scale(wgt_scales);
   packed_weight_p->set_zero_point(wgt_zero_points);
   std::unique_ptr<ideep::tensor> weight_ptr(packed_weight_p);
@@ -288,7 +300,8 @@ class QLinearPackWeightInt8 final {
     auto& ctx = at::globalContext();
 
 #ifdef USE_FBGEMM
-    if (ctx.qEngine() == at::QEngine::FBGEMM) {
+    if (ctx.qEngine() == at::QEngine::FBGEMM ||
+        ctx.qEngine() == at::QEngine::X86) {
       return PackedLinearWeight::prepack(std::move(weight), std::move(bias));
     }
 #endif
@@ -320,7 +333,8 @@ class QLinearPackWeightFp16 final {
     // temporarily convert weight back to fp32, needs to be fixed
     // after fbgemm fixes the interface for their prepacking op (take fp16 input0
     weight = weight.to(ScalarType::Float);
-    if (ctx.qEngine() == at::QEngine::FBGEMM) {
+    if (ctx.qEngine() == at::QEngine::FBGEMM ||
+        ctx.qEngine() == at::QEngine::X86) {
       return PackedLinearWeightFp16::prepack(
           std::move(weight), std::move(bias));
     }
diff --git a/aten/src/ATen/native/quantized/cpu/qmatmul.cpp b/aten/src/ATen/native/quantized/cpu/qmatmul.cpp
index c1e5041a5734..4da714e0bcf0 100644
--- a/aten/src/ATen/native/quantized/cpu/qmatmul.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qmatmul.cpp
@@ -21,7 +21,7 @@ inline void check_inputs(const Tensor& qa, const Tensor& qb) {
       "MatMul operands should have same data type.");
   TORCH_CHECK(
       qa.qscheme() == kPerTensorAffine || qa.qscheme() == kPerTensorSymmetric,
-      "Only per-tensor quantization is suported in Matmul.");
+      "Only per-tensor quantization is supported in Matmul.");
   TORCH_CHECK(
       qa.qscheme() == qb.qscheme(),
       "Both inputs to Matmul must have the same quantization scheme.");
@@ -45,7 +45,7 @@ Tensor qmatmul(
       " and ", b_num_dims, " provided)");
   TORCH_CHECK(
       num_dims >= 2,
-      "Quantized Matmul currently only suports operands which are at least 2-dimensional. (",
+      "Quantized Matmul currently only supports operands which are at least 2-dimensional. (",
       num_dims, " provided)");
 
   const int64_t m = qa.size(num_dims - 2);
diff --git a/aten/src/ATen/native/quantized/cpu/qmul.cpp b/aten/src/ATen/native/quantized/cpu/qmul.cpp
index 7015df9ea654..aa6ad0e724f5 100644
--- a/aten/src/ATen/native/quantized/cpu/qmul.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qmul.cpp
@@ -1,9 +1,28 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
+#include <ATen/ExpandUtils.h>
 #include <torch/library.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/native/cpu/Loops.h>
+#include <ATen/native/quantized/cpu/OnednnUtils.h>
+#include <ATen/native/quantized/cpu/QnnpackUtils.h>
+#include <ATen/native/quantized/cpu/QuantUtils.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
+#include <ATen/native/quantized/cpu/XnnpackUtils.h>
+#include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/quantized/Quantizer.h>
+#include <caffe2/utils/threadpool/pthreadpool-cpp.h>
+#include <torch/library.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_empty_affine_quantized_native.h>
+#include <ATen/ops/empty_like.h>
+#endif
 
 #include <algorithm>
 
@@ -21,7 +40,7 @@ inline void check_inputs(const Tensor& qa, const Tensor& qb) {
   TORCH_CHECK(qa.scalar_type() == qb.scalar_type(),
               "Mul operands should have same data type.");
   TORCH_CHECK(qa.qscheme() == qb.qscheme(),
-              "Both inputs to Mul must have the same quantization shceme.");
+              "Both inputs to Mul must have the same quantization scheme.");
 }
 
 // Note: out is assumed to be the same size as self and other.
@@ -37,6 +56,124 @@ Tensor _mul_out(Tensor& out, const Tensor& self, const Tensor& other) {
   return out;
 }
 
+#ifdef USE_XNNPACK
+template <typename scalar_t, bool ReLUFused = false>
+Tensor _mul_out_xnnpack(
+    const Tensor& self,
+    const Tensor& other,
+    double output_scale,
+    int64_t output_zero_point) {
+  using underlying_t = typename scalar_t::underlying;
+
+  const string func_name = "xnnp_mul()";
+  TORCH_CHECK(self.ndimension() > 0, func_name, ": Got empty input tensor.");
+  TORCH_CHECK(
+      at::native::xnnpack::available(), func_name, ": XNNPACK is not available")
+
+  // using qa memory format for qb to allow xnnpack kernel to flatten all the
+  // dims
+  auto qa_mem_format = self.suggest_memory_format();
+  Tensor self_contig = self.contiguous(qa_mem_format);
+  Tensor other_contig = other.contiguous(qa_mem_format);
+
+  Tensor out = at::native::empty_affine_quantized(
+      at::infer_size_dimvector(self_contig.sizes(), other_contig.sizes()),
+      self.scalar_type(),
+      c10::nullopt /* layout */,
+      kCPU,
+      c10::nullopt /* pin_memory */,
+      output_scale,
+      output_zero_point,
+      qa_mem_format);
+
+  if (self_contig.size(0) == 0) {
+    return out;
+  }
+
+  int64_t self_zero_point = self_contig.q_zero_point();
+  double self_scale = self_contig.q_scale();
+  int64_t other_zero_point = other_contig.q_zero_point();
+  double other_scale = other_contig.q_scale();
+
+  int64_t output_min = std::numeric_limits<underlying_t>::min();
+  int64_t output_max = std::numeric_limits<underlying_t>::max();
+
+  if(ReLUFused) {
+    /*
+     * FIXME: use acticationLimits<T>()
+     * With <T>, MSVC runs into "error C3862: indetifier activationLimits not
+     * found".
+     */
+    constexpr int64_t qmin = std::numeric_limits<underlying_t>::min();
+    constexpr int64_t qmax = std::numeric_limits<underlying_t>::max();
+    int64_t qvalue = static_cast<int64_t>(output_zero_point);
+    qvalue = std::max<int64_t>(qvalue, qmin);
+    output_min = static_cast<underlying_t>(std::min<int64_t>(qvalue, qmax));
+  }
+
+  xnn_operator_t xnnp_op = nullptr;
+  xnnpack_operator xnnp_qmul_operator;
+
+  // create xnnpack multiply operator ...
+  auto status = xnn_create_multiply_nd_qs8(
+      self_zero_point,
+      self_scale,
+      other_zero_point,
+      other_scale,
+      static_cast<underlying_t>(output_zero_point),
+      static_cast<float>(output_scale),
+      output_min,
+      output_max,
+      0,
+      &xnnp_op);
+
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name,
+      ": xnn create operator failed(",
+      status,
+      ")!");
+  xnnp_qmul_operator = xnnpack_operator(xnnp_op);
+
+
+  const auto self_shape = xnnp_utils::get_mem_format_aware_shape(self_contig);
+  const auto other_shape = xnnp_utils::get_mem_format_aware_shape(other_contig);
+
+  // set up operator
+  status = xnn_setup_multiply_nd_qs8(
+      xnnp_qmul_operator.get(),
+      self_shape.size(),
+      self_shape.data(),
+      other_shape.size(),
+      other_shape.data(),
+      reinterpret_cast<const underlying_t*>(self_contig.data_ptr<scalar_t>()),
+      reinterpret_cast<const underlying_t*>(other_contig.data_ptr<scalar_t>()),
+      reinterpret_cast<underlying_t*>(out.data_ptr<scalar_t>()),
+      caffe2::pthreadpool_());
+
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name,
+      ": xnn setup operator failed(",
+      status,
+      ")!");
+
+  // Run the operator
+  status = xnn_run_operator(
+      xnnp_qmul_operator.get(), /* xnn_operator_t op */
+      caffe2::pthreadpool_()); /* pthreadpool_t threadpool */
+  TORCH_CHECK(
+      status == xnn_status_success,
+      func_name,
+      ": xnn run operator failed(",
+      status,
+      ")");
+
+  return out;
+}
+
+#endif // use XNNPACK
+
 template <bool ReLUFused = false>
 Tensor _mul_scalar_out(Tensor& out, const Tensor& self, const Scalar& other) {
   int64_t self_zero_point = self.q_zero_point();
@@ -100,19 +237,27 @@ Tensor _mul_scalar_out(Tensor& out, const Tensor& self, const Scalar& other) {
   });
 
   return out;
-}
+  }
 
 template <bool ReLUFused = false>
 class QMul final {
  public:
   static Tensor run(Tensor qa, Tensor qb, double scale, int64_t zero_point) {
     check_inputs(qa, qb);
+#ifdef USE_XNNPACK
+    int64_t q_max = std::numeric_limits<c10::qint8::underlying>::max();
+    if (zero_point < q_max && qa.scalar_type() == kQInt8) {
+      return _mul_out_xnnpack<c10::qint8, ReLUFused>(qa, qb, scale, zero_point);
+    }
+#endif // USE_XNNPACK
+
     auto qc = at::_empty_affine_quantized(
         infer_size_dimvector(qa.sizes(), qb.sizes()),
         at::device(kCPU).dtype(qa.scalar_type()),
         scale,
         zero_point,
         qa.suggest_memory_format());
+
     return _mul_out<ReLUFused>(qc, qa, qb);
   }
 };
@@ -169,7 +314,7 @@ class QMulScalarTensor final {
   static Tensor run(Tensor qa, Tensor b) {
     TORCH_CHECK(qa.qscheme() == kPerTensorAffine ||
               qa.qscheme() == kPerTensorSymmetric,
-              "Only per tensor quantization is suported in Mul.");
+              "Only per tensor quantization is supported in Mul.");
     auto qc = at::empty_like(qa, qa.suggest_memory_format());
     return _mul_scalar_out<ReLUFused>(qc, qa, b.item());
   }
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/CMakeLists.txt b/aten/src/ATen/native/quantized/cpu/qnnpack/CMakeLists.txt
index 2c9ec7aa1e3a..8b5b82453a95 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/CMakeLists.txt
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/CMakeLists.txt
@@ -175,7 +175,6 @@ set(PYTORCH_QNNPACK_EXEC_SRCS
   src/deconv-run.cc
   src/fc-run.cc
   src/fc-dynamic-run.cc
-  src/pack_block_sparse.cc
   src/indirection.c
   src/operator-run.c)
 
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/bench/q8gemm_sparse.cc b/aten/src/ATen/native/quantized/cpu/qnnpack/bench/q8gemm_sparse.cc
index cb45912ed152..eabf62fe9410 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/bench/q8gemm_sparse.cc
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/bench/q8gemm_sparse.cc
@@ -254,14 +254,13 @@ class Q8GEMMSparse : public benchmark::Fixture {
         colBlockSize(),
         sparsity(),
         kernel_zero_points.data());
-    bcsr_matrix_ =
-      qnnpack::generateBlockCSRMatrix(
-          k_.data(),
-          nc(),
-          kc(),
-          rowBlockSize(),
-          colBlockSize(),
-          kernel_zero_points.data());
+    bcsr_matrix_ = qnnpack::generateBlockCSRMatrix<uint32_t>(
+        k_.data(),
+        nc(),
+        kc(),
+        rowBlockSize(),
+        colBlockSize(),
+        kernel_zero_points.data());
     std::vector<float> dequantization_scales(num_zero_points_kernel, 0.75f);
     c_.resize(mc() * nc());
     std::fill(c_.begin(), c_.end(), 0xA5);
@@ -466,13 +465,14 @@ BENCHMARK_TEMPLATE_DEFINE_F(Q8GEMMSparse_Op, 4x8c1x4_prepacked__aarch32_neon, 4,
       for (uint32_t n = 0, channel_offset = 0; n < nc();
           n += nr(), channel_offset += nr()) {
         const uint32_t nrr = min(nc() - n, nr());
-        pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon(
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w32__aarch32_neon(
             mrr,
             nrr,
             a_packed.data() + (m >> 2) * (k_blocks << 2) * mr(),
             bcsr_matrix_->values.data(),
-            bcsr_matrix_->row_values.data() + n,
-            bcsr_matrix_->col_indices.data(),
+            static_cast<const uint32_t*>(bcsr_matrix_->row_values_data_ptr()) +
+                n,
+            static_cast<const uint32_t*>(bcsr_matrix_->col_indices_data_ptr()),
             b() + n,
             c() + m * nc() + n,
             nc(),
@@ -512,13 +512,14 @@ BENCHMARK_TEMPLATE_DEFINE_F(Q8GEMMSparse_Op, 4x8c8x1_prepacked__aarch32_neon, 4,
       for (uint32_t n = 0, channel_offset = 0; n < nc();
           n += nr(), channel_offset += nr()) {
         const uint32_t nrr = min(nc() - n, nr());
-        pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon(
+        pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w32__aarch32_neon(
             mrr,
             nrr,
             a_packed.data() + (m >> 2) * (k_blocks << 2) * mr(),
             bcsr_matrix_->values.data(),
-            bcsr_matrix_->row_values.data() + (n >> 3),
-            bcsr_matrix_->col_indices.data(),
+            static_cast<const uint32_t*>(bcsr_matrix_->row_values_data_ptr()) +
+                (n >> 3),
+            static_cast<const uint32_t*>(bcsr_matrix_->col_indices_data_ptr()),
             b() + n,
             c() + m * nc() + n,
             nc(),
@@ -585,13 +586,13 @@ BENCHMARK_TEMPLATE_DEFINE_F(Q8GEMMSparse_Op, 8x8c1x4_prepacked__aarch64_neon, 8,
       for (uint32_t n = 0, channel_offset = 0; n < nc();
           n += nr(), channel_offset += nr()) {
         const uint32_t nrr = min(nc() - n, nr());
-        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA__aarch64_neon(
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w32__aarch64_neon(
             mrr,
             nrr,
             a_packed.data() + (m >> 3) * (k_blocks << 2) * mr(),
             bcsr_matrix_->values.data(),
-            bcsr_matrix_->row_values.data(),
-            bcsr_matrix_->col_indices.data(),
+            static_cast<const uint32_t*>(bcsr_matrix_->row_values_data_ptr()),
+            static_cast<const uint32_t*>(bcsr_matrix_->col_indices_data_ptr()),
             b() + n,
             c() + m * nc() + n,
             nc(),
@@ -630,13 +631,13 @@ BENCHMARK_TEMPLATE_DEFINE_F(Q8GEMMSparse_Op, 8x8c8x1_prepacked__aarch64_neon, 8,
       for (uint32_t n = 0, channel_offset = 0; n < nc();
           n += nr(), channel_offset += nr()) {
         const uint32_t nrr = min(nc() - n, nr());
-        pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA__aarch64_neon(
+        pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w32__aarch64_neon(
             mrr,
             nrr,
             a_packed.data() + (m >> 3) * (k_blocks << 2) * mr(),
             bcsr_matrix_->values.data(),
-            bcsr_matrix_->row_values.data(),
-            bcsr_matrix_->col_indices.data(),
+            static_cast<const uint32_t*>(bcsr_matrix_->row_values_data_ptr()),
+            static_cast<const uint32_t*>(bcsr_matrix_->col_indices_data_ptr()),
             b() + n,
             c() + m * nc() + n,
             nc(),
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl b/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl
index 5c1c316678e1..f981cce9726d 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl
@@ -272,7 +272,6 @@ def define_qnnpack(third_party, labels = []):
             "src/max-pooling.c",
             "src/operator-delete.c",
             "src/operator-run.c",
-            "src/pack_block_sparse.cc",
             "src/sigmoid.c",
             "src/softargmax.c",
             "src/tanh.c",
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/cmake/DownloadGoogleTest.cmake b/aten/src/ATen/native/quantized/cpu/qnnpack/cmake/DownloadGoogleTest.cmake
index 4a86d641e412..66b2232b5925 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/cmake/DownloadGoogleTest.cmake
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/cmake/DownloadGoogleTest.cmake
@@ -11,7 +11,7 @@ project(googletest-download NONE)
 include(ExternalProject)
 ExternalProject_Add(googletest
   URL https://github.com/google/googletest/archive/release-1.10.0.zip
-  URL_HASH SHA256=f3ed3b58511efd272eb074a3a6d6fb79d7c2e6a0e374323d1e6bcbcc1ef141bf
+  URL_HASH SHA256=94c634d499558a76fa649edb13721dce6e98fb1e7018dfaeba3cd7a083945e91
   SOURCE_DIR "${CONFU_DEPENDENCIES_SOURCE_DIR}/googletest"
   BINARY_DIR "${CONFU_DEPENDENCIES_BINARY_DIR}/googletest"
   CONFIGURE_COMMAND ""
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt b/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt
index f19d6c61f33f..e763e4e3ba93 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt
@@ -63,7 +63,7 @@ set_target_properties(clog PROPERTIES
   C_EXTENSIONS NO)
 CLOG_TARGET_RUNTIME_LIBRARY(clog)
 set_target_properties(clog PROPERTIES PUBLIC_HEADER include/clog.h)
-target_include_directories(clog BEFORE PUBLIC include)
+target_include_directories(clog PUBLIC $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include> $<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>)
 if(CLOG_LOG_TO_STDIO)
   target_compile_definitions(clog PRIVATE CLOG_LOG_TO_STDIO=1)
 else()
@@ -73,7 +73,10 @@ if(ANDROID AND NOT CLOG_LOG_TO_STDIO)
   target_link_libraries(clog PRIVATE log)
 endif()
 
+add_library(cpuinfo::clog ALIAS clog)
+
 install(TARGETS clog
+  EXPORT cpuinfo-targets
   LIBRARY DESTINATION "${CMAKE_INSTALL_LIBDIR}"
   ARCHIVE DESTINATION "${CMAKE_INSTALL_LIBDIR}"
   PUBLIC_HEADER DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}")
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/cmake/DownloadGoogleTest.cmake b/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/cmake/DownloadGoogleTest.cmake
index 4a86d641e412..66b2232b5925 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/cmake/DownloadGoogleTest.cmake
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/cmake/DownloadGoogleTest.cmake
@@ -11,7 +11,7 @@ project(googletest-download NONE)
 include(ExternalProject)
 ExternalProject_Add(googletest
   URL https://github.com/google/googletest/archive/release-1.10.0.zip
-  URL_HASH SHA256=f3ed3b58511efd272eb074a3a6d6fb79d7c2e6a0e374323d1e6bcbcc1ef141bf
+  URL_HASH SHA256=94c634d499558a76fa649edb13721dce6e98fb1e7018dfaeba3cd7a083945e91
   SOURCE_DIR "${CONFU_DEPENDENCIES_SOURCE_DIR}/googletest"
   BINARY_DIR "${CONFU_DEPENDENCIES_BINARY_DIR}/googletest"
   CONFIGURE_COMMAND ""
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/include/pack_block_sparse.h b/aten/src/ATen/native/quantized/cpu/qnnpack/include/pack_block_sparse.h
index bfaa19e564b4..4770b30638ad 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/include/pack_block_sparse.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/include/pack_block_sparse.h
@@ -15,9 +15,14 @@
 #ifndef _WIN32
 #include <qnnpack/AlignedAllocator.h>
 #endif
+#include <pytorch_qnnpack.h>
 #include <qnnpack/common.h>
 #include <qnnpack/math.h>
 
+#ifdef QNNPACK_BCSRMATRIX_DEBUG
+#include <iostream>
+#endif // QNNPACK_BCSRMATRIX_DEBUG
+
 namespace qnnpack {
 
 template <typename T>
@@ -70,13 +75,20 @@ struct OwnedOrBorrowedVector {
         owned(false) {}
 };
 
-typedef struct BCSRMatrix {
-  OwnedOrBorrowedVector<uint32_t> col_indices;
-  OwnedOrBorrowedVector<uint32_t> row_values;
+struct BCSRMatrix {
   OwnedOrBorrowedVector<uint8_t> values;
   uint32_t col_block_size;  // input features block size
   uint32_t row_block_size;  // output features block size
-  void print() const;
+  enum pytorch_qnnp_sparse_matrix_indices_dtype indices_dtype;
+  virtual ~BCSRMatrix() = default;
+  // Return void for the data ptrs because it doesn't require knowing the
+  // underlying TypedBCSRMatrix indices dtype and that's how it's passed
+  // into the qnnpack fully connected sparse op
+  virtual const void* col_indices_data_ptr() const = 0;
+  virtual const void* row_values_data_ptr() const = 0;
+#ifdef QNNPACK_BCSRMATRIX_DEBUG
+  virtual void print() const = 0;
+#endif // QNNPACK_BCSRMATRIX_DEBUG
   /*
    * Unpack from BCSR to Dense
    * - Each value and zero point converted to int8_t by subtracting 128
@@ -84,29 +96,288 @@ typedef struct BCSRMatrix {
    * - dst should be able to hold num_rows * num_cols elements
    * - zero_points should hold num_rows zero points
    */
+  virtual void unpack(
+      int8_t* dst,
+      const int64_t num_rows,
+      const int64_t num_cols,
+      const uint8_t* zero_points) const = 0;
+  virtual uint32_t max_index() const = 0;
+};
+
+template <typename INDICES_DTYPE>
+struct TypedBCSRMatrix : BCSRMatrix {
+  OwnedOrBorrowedVector<INDICES_DTYPE> col_indices;
+  OwnedOrBorrowedVector<INDICES_DTYPE> row_values;
+  TypedBCSRMatrix();
+  const void* col_indices_data_ptr() const override;
+  const void* row_values_data_ptr() const override;
+#ifdef QNNPACK_BCSRMATRIX_DEBUG
+  void print() const override;
+#endif // QNNPACK_BCSRMATRIX_DEBUG
   void unpack(
       int8_t* dst,
       const int64_t num_rows,
       const int64_t num_cols,
-      const uint8_t* zero_points) const;
-} BCSRMatrix;
+      const uint8_t* zero_points) const override;
+  uint32_t max_index() const override;
+
+  ~TypedBCSRMatrix() override = default;
+};
 
+template <typename INDICES_DTYPE>
 std::unique_ptr<BCSRMatrix> generateBlockCSRMatrix(
     const uint8_t* a,
     const size_t N,
     const size_t K,
     const uint32_t row_block_size,
     const uint32_t col_block_size,
-    const uint8_t* zero_points);
+    const uint8_t* zero_points) {
+  assert(K > 0);
+  std::unique_ptr<TypedBCSRMatrix<INDICES_DTYPE>> bcsr_mat =
+      std::make_unique<TypedBCSRMatrix<INDICES_DTYPE>>();
+  auto& row_values = bcsr_mat->row_values.vector();
+  auto& col_indices = bcsr_mat->col_indices.vector();
+  auto& values = bcsr_mat->values.vector();
+
+  const uint32_t num_row_blocks = (N + row_block_size - 1) / row_block_size;
+  // K must be > 0
+  const uint32_t num_col_blocks = (K + col_block_size - 1) / col_block_size;
 
+  row_values.reserve(num_row_blocks);
+  uint32_t num_nnz_blocks{0};
+  row_values.push_back(num_nnz_blocks);
+  for (uint32_t i = 0; i < num_row_blocks; ++i) {
+    for (uint32_t j = 0; j < num_col_blocks; ++j) {
+      bool block_zero{true};
+      for (uint32_t ib = 0; ib < row_block_size; ++ib) {
+        uint32_t row_index = i * row_block_size + ib;
+        if PYTORCH_QNNP_UNLIKELY(row_index >= N) {
+          break;
+        }
+        for (uint32_t jb = 0; jb < col_block_size; ++jb) {
+          uint32_t col_index = j * col_block_size + jb;
+          if PYTORCH_QNNP_UNLIKELY(col_index >= K) {
+            goto block_scanned;
+          }
+          if (*(a + row_index * K + col_index) != zero_points[row_index]) {
+            block_zero = false;
+            goto block_scanned;
+          }
+        }
+      }
+block_scanned:
+      if (!block_zero) {
+        col_indices.push_back(j);
+        num_nnz_blocks++;
+        for (uint32_t ib = 0; ib < row_block_size; ++ib) {
+          uint32_t row_index = i * row_block_size + ib;
+          if PYTORCH_QNNP_UNLIKELY(row_index >= N) {
+            for (; row_index < (num_row_blocks * row_block_size); row_index++) {
+              for (uint32_t jb = 0; jb < col_block_size; ++jb) {
+                values.push_back(zero_points[N-1]);
+              }
+            }
+            break;
+          }
+          for (uint32_t jb = 0; jb < col_block_size; ++jb) {
+            uint32_t col_index = j * col_block_size + jb;
+            if PYTORCH_QNNP_UNLIKELY(col_index >= K) {
+              values.push_back(zero_points[row_index]);
+            } else {
+              uint8_t val = *(a + row_index * K + col_index);
+              values.push_back(val);
+            }
+          }
+        }
+      }
+    }
+    row_values.push_back(num_nnz_blocks);
+  }
+  bcsr_mat->row_block_size = row_block_size;
+  bcsr_mat->col_block_size = col_block_size;
+  return bcsr_mat;
+}
+
+template <typename INDICES_DTYPE>
 std::unique_ptr<BCSRMatrix> generateBlockCSRMatrix(
-    uint32_t* col_indices,
-    uint32_t* row_values,
+    INDICES_DTYPE* col_indices,
+    INDICES_DTYPE* row_values,
     uint8_t* values,
     const int64_t col_indices_size,
     const int64_t row_values_size,
     const int64_t values_size,
     const int64_t row_block_size,
-    const int64_t col_block_size);
+    const int64_t col_block_size) {
+  std::unique_ptr<TypedBCSRMatrix<INDICES_DTYPE>> bcsr_mat =
+      std::make_unique<TypedBCSRMatrix<INDICES_DTYPE>>();
+  bcsr_mat->col_indices =
+      OwnedOrBorrowedVector<INDICES_DTYPE>(col_indices, col_indices_size);
+  bcsr_mat->row_values =
+      OwnedOrBorrowedVector<INDICES_DTYPE>(row_values, row_values_size);
+  bcsr_mat->values = OwnedOrBorrowedVector<uint8_t>(values, values_size);
+  bcsr_mat->row_block_size = row_block_size;
+  bcsr_mat->col_block_size = col_block_size;
+  return bcsr_mat;
+}
+
+template <typename INDICES_DTYPE>
+struct IndicesDtypeEnumTrait {
+  static_assert(
+      sizeof(INDICES_DTYPE) == 0,
+      "Invalid dtype for IndicesDtypeEnumTrait");
+};
+
+template <>
+struct IndicesDtypeEnumTrait<uint32_t> {
+  const static pytorch_qnnp_sparse_matrix_indices_dtype dtype =
+      pytorch_qnnp_sparse_matrix_indices_dtype_uint32_t;
+};
+
+template <>
+struct IndicesDtypeEnumTrait<uint16_t> {
+  const static pytorch_qnnp_sparse_matrix_indices_dtype dtype =
+      pytorch_qnnp_sparse_matrix_indices_dtype_uint16_t;
+};
+
+template <>
+struct IndicesDtypeEnumTrait<uint8_t> {
+  const static pytorch_qnnp_sparse_matrix_indices_dtype dtype =
+      pytorch_qnnp_sparse_matrix_indices_dtype_uint8_t;
+};
+
+template <typename INDICES_DTYPE>
+TypedBCSRMatrix<INDICES_DTYPE>::TypedBCSRMatrix() {
+  indices_dtype = IndicesDtypeEnumTrait<INDICES_DTYPE>::dtype;
+}
+
+template <typename INDICES_DTYPE>
+const void* TypedBCSRMatrix<INDICES_DTYPE>::col_indices_data_ptr() const {
+  return static_cast<const void*>(col_indices.data());
+}
+
+template <typename INDICES_DTYPE>
+const void* TypedBCSRMatrix<INDICES_DTYPE>::row_values_data_ptr() const {
+  return static_cast<const void*>(row_values.data());
+}
+
+#ifdef QNNPACK_BCSRMATRIX_DEBUG
+template <typename INDICES_DTYPE>
+void TypedBCSRMatrix<INDICES_DTYPE>::print() const {
+  std::cout << "row block size:" << row_block_size << std::endl;
+  std::cout << "col block size:" << col_block_size << std::endl;
+  std::cout << "row ptr\n";
+  std::cout
+      << "indices dtype: uint"
+      << static_cast<
+             std::underlying_type_t<pytorch_qnnp_sparse_matrix_indices_dtype>>(
+             indices_dtype)
+      << "_t" << std::endl;
+  for (uint32_t i = 0; i < row_values.size(); i++) {
+    std::cout << (uint32_t)row_values[i] << ", ";
+  }
+  std::cout << std::endl;
+  std::cout << "col indices\n";
+  for (uint32_t i = 0; i < col_indices.size(); i++) {
+    std::cout << (uint32_t)col_indices[i] << ", ";
+  }
+  std::cout << std::endl;
+  std::cout << "Actual values\n";
+  for (uint32_t i = 0; i < values.size(); i++) {
+    std::cout << (uint32_t)values[i] << ", ";
+  }
+  std::cout << std::endl;
+}
+#endif // QNNPACK_BCSRMATRIX_DEBUG
+
+template <typename INDICES_DTYPE>
+void TypedBCSRMatrix<INDICES_DTYPE>::unpack(
+    int8_t* dst,
+    const int64_t num_rows,
+    const int64_t num_cols,
+    const uint8_t* zero_points) const {
+  for (int64_t i = 0; i < num_rows; i++) {
+    memset(
+        dst + i * num_cols,
+        static_cast<int8_t>(static_cast<int16_t>(zero_points[i]) - 128),
+        num_cols * sizeof(int8_t));
+  }
+
+  const int64_t num_block_rows = static_cast<int64_t>(row_values.size()) - 1;
+  const int64_t block_size = (int64_t)row_block_size * col_block_size;
+  int64_t weight_values_num = 0;
+  for (int64_t block_row_num = 0; block_row_num < num_block_rows;
+       block_row_num++) {
+    const int64_t num_blocks_in_current_block_row =
+        row_values[block_row_num + 1] - row_values[block_row_num];
+    for (int64_t k = 0; k < num_blocks_in_current_block_row;
+         k++) { // iterate over each block in the row
+      const int64_t block_start_row_num = block_row_num * row_block_size;
+      const int64_t block_start_col_num =
+          (int64_t)(col_indices[weight_values_num / block_size]) *
+          col_block_size;
+      for (int64_t l = 0; l < block_size;
+           l++) { // iterate over each value in the block
+        const int64_t row_num = block_start_row_num + l / col_block_size;
+        const int64_t col_num = block_start_col_num + l % col_block_size;
+        if (row_num < num_rows && col_num < num_cols) {
+          dst[row_num * num_cols + col_num] = static_cast<int8_t>(
+              static_cast<int16_t>(values[weight_values_num]) - 128);
+        }
+        weight_values_num++;
+      }
+    }
+  }
+}
+
+template <typename INDICES_DTYPE>
+uint32_t TypedBCSRMatrix<INDICES_DTYPE>::max_index() const {
+  return static_cast<uint32_t>(std::max(
+      *std::max_element(
+          row_values.data(), row_values.data() + row_values.size()),
+      *std::max_element(
+          col_indices.data(), col_indices.data() + col_indices.size())));
+}
+
+/**
+ * Given a BCSRMatrix (bcsr_) and a block of code enclosed in { }
+ * (dispatch_body), run the block of code with the following in scope
+ * 1) The BCSRMatrix's underlying TypedBCSRMatrix, called typed_bcsr
+ * 2) The TypedBCSRMatrix's indices data type, called INDICES_DTYPE
+ */
+#define QNNPACK_BCSRMATRIX_DISPATCH_INDICES_DTYPE(bcsr_, dispatch_body)        \
+  [&bcsr = bcsr_]() {                                                          \
+    switch (bcsr->indices_dtype) {                                             \
+      case pytorch_qnnp_sparse_matrix_indices_dtype_uint32_t: {                \
+        using INDICES_DTYPE = uint32_t;                                        \
+        const qnnpack::TypedBCSRMatrix<INDICES_DTYPE>* typed_bcsr =            \
+            static_cast<const qnnpack::TypedBCSRMatrix<INDICES_DTYPE>*>(       \
+                bcsr.get());                                                   \
+        return [&typed_bcsr]() dispatch_body();                                \
+      }                                                                        \
+      case pytorch_qnnp_sparse_matrix_indices_dtype_uint16_t: {                \
+        using INDICES_DTYPE = uint16_t;                                        \
+        const qnnpack::TypedBCSRMatrix<INDICES_DTYPE>* typed_bcsr =            \
+            static_cast<const qnnpack::TypedBCSRMatrix<INDICES_DTYPE>*>(       \
+                bcsr.get());                                                   \
+        return [&typed_bcsr]() dispatch_body();                                \
+      }                                                                        \
+      case pytorch_qnnp_sparse_matrix_indices_dtype_uint8_t: {                 \
+        using INDICES_DTYPE = uint8_t;                                         \
+        const qnnpack::TypedBCSRMatrix<INDICES_DTYPE>* typed_bcsr =            \
+            static_cast<const qnnpack::TypedBCSRMatrix<INDICES_DTYPE>*>(       \
+                bcsr.get());                                                   \
+        return [&typed_bcsr]() dispatch_body();                                \
+      }                                                                        \
+      case pytorch_qnnp_sparse_matrix_indices_dtype_invalid: {                 \
+        assert(false);                                                         \
+      }                                                                        \
+    }                                                                          \
+    /* Throw exception to avoid the following errors: */                       \
+    /* - "non-void lambda does not return a value in all control paths" */     \
+    /* - "control reaches end of non-void function" */                         \
+    /* Throwing exception from within invalid case alone does not fix these */ \
+    throw std::invalid_argument(                                               \
+        "Invalid indices dtype in QNNPACK_BCSRMATRIX_DISPATCH_INDICES_DTYPE"); \
+  }()
 
 } // namespace qnnpack
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/include/pytorch_qnnpack.h b/aten/src/ATen/native/quantized/cpu/qnnpack/include/pytorch_qnnpack.h
index 07666ea09605..c518104153e5 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/include/pytorch_qnnpack.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/include/pytorch_qnnpack.h
@@ -32,6 +32,13 @@ enum pytorch_qnnp_status {
   pytorch_qnnp_status_out_of_memory = 5,
 };
 
+enum pytorch_qnnp_sparse_matrix_indices_dtype {
+  pytorch_qnnp_sparse_matrix_indices_dtype_invalid = 0,
+  pytorch_qnnp_sparse_matrix_indices_dtype_uint8_t = 8,
+  pytorch_qnnp_sparse_matrix_indices_dtype_uint16_t = 16,
+  pytorch_qnnp_sparse_matrix_indices_dtype_uint32_t = 32,
+};
+
 enum pytorch_qnnp_status pytorch_qnnp_initialize(void);
 
 enum pytorch_qnnp_status pytorch_qnnp_deinitialize(void);
@@ -168,11 +175,12 @@ enum pytorch_qnnp_status pytorch_qnnp_create_fully_connected_sparse_dq_nc_q8(
     size_t output_channels,
     uint8_t input_zero_point,
     const uint8_t* kernel_zero_points,
-    const uint32_t* kernel_col_indices,
-    const uint32_t* kernel_row_values,
+    const void* kernel_col_indices,
+    const void* kernel_row_values,
     const uint8_t* kernel_values,
     const uint32_t kernel_row_block_size,
     const uint32_t kernel_col_block_size,
+    enum pytorch_qnnp_sparse_matrix_indices_dtype kernel_indices_dtype,
     uint8_t output_zero_point,
     uint8_t output_min,
     uint8_t output_max,
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/fully-connected-sparse.c b/aten/src/ATen/native/quantized/cpu/qnnpack/src/fully-connected-sparse.c
index 4feadadf9796..71226ab5250e 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/fully-connected-sparse.c
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/fully-connected-sparse.c
@@ -26,11 +26,12 @@ enum pytorch_qnnp_status pytorch_qnnp_create_fully_connected_sparse_dq_nc_q8(
     size_t output_channels,
     uint8_t input_zero_point,
     const uint8_t* kernel_zero_points,
-    const uint32_t* kernel_col_indices,
-    const uint32_t* kernel_row_values,
+    const void* kernel_col_indices,
+    const void* kernel_row_values,
     const uint8_t* kernel_values,
     const uint32_t kernel_row_block_size,
     const uint32_t kernel_col_block_size,
+    enum pytorch_qnnp_sparse_matrix_indices_dtype kernel_indices_dtype,
     uint8_t output_zero_point,
     uint8_t output_min,
     uint8_t output_max,
@@ -77,8 +78,34 @@ enum pytorch_qnnp_status pytorch_qnnp_create_fully_connected_sparse_dq_nc_q8(
       goto error;
     }
   }
-  fully_connected->sparse_matrix.col_indices = kernel_col_indices;
-  fully_connected->sparse_matrix.row_values = kernel_row_values;
+
+  fully_connected->sparse_matrix.indices_dtype = kernel_indices_dtype;
+  switch (kernel_indices_dtype) {
+    case pytorch_qnnp_sparse_matrix_indices_dtype_uint32_t:
+      fully_connected->sparse_matrix.col_indices_w32 =
+          (const uint32_t*)kernel_col_indices;
+      fully_connected->sparse_matrix.row_values_w32 =
+          (const uint32_t*)kernel_row_values;
+      break;
+    case pytorch_qnnp_sparse_matrix_indices_dtype_uint16_t:
+      fully_connected->sparse_matrix.col_indices_w16 =
+          (const uint16_t*)kernel_col_indices;
+      fully_connected->sparse_matrix.row_values_w16 =
+          (const uint16_t*)kernel_row_values;
+      break;
+    case pytorch_qnnp_sparse_matrix_indices_dtype_uint8_t:
+      fully_connected->sparse_matrix.col_indices_w8 =
+          (const uint8_t*)kernel_col_indices;
+      fully_connected->sparse_matrix.row_values_w8 =
+          (const uint8_t*)kernel_row_values;
+      break;
+    case pytorch_qnnp_sparse_matrix_indices_dtype_invalid:
+      status = pytorch_qnnp_status_invalid_parameter;
+      pytorch_qnnp_log_error(
+          "Invalid indices dtype specified for qnnpack fully connected sparse");
+      goto error;
+  }
+
   fully_connected->sparse_matrix.values = kernel_values;
   fully_connected->sparse_matrix.row_block_size = kernel_row_block_size;
   fully_connected->sparse_matrix.col_block_size = kernel_col_block_size;
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/init.c b/aten/src/ATen/native/quantized/cpu/qnnpack/src/init.c
index 8768349d8587..b2ea18c669c6 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/init.c
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/init.c
@@ -61,7 +61,9 @@ static void init(void) {
   };
   pytorch_qnnp_params.q8gemm_sparse_c1x4 = (struct pytorch_q8gemm_sparse_parameters){
       .gemm_dq = NULL,
-      .packedA_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon,
+      .packedA_w32_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w32__aarch32_neon,
+      .packedA_w16_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w16__aarch32_neon,
+      .packedA_w8_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w8__aarch32_neon,
       .packA = pytorch_q8gemm_sparse_packA_ukernel_4x4__aarch32_neon,
       .mr = 4,
       .nr = 8,
@@ -73,7 +75,9 @@ static void init(void) {
   };
   pytorch_qnnp_params.q8gemm_sparse_c8x1 = (struct pytorch_q8gemm_sparse_parameters){
       .gemm_dq = NULL,
-      .packedA_gemm_dq = pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon,
+      .packedA_w32_gemm_dq = pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w32__aarch32_neon,
+      .packedA_w16_gemm_dq = pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w16__aarch32_neon,
+      .packedA_w8_gemm_dq = pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w8__aarch32_neon,
       .packA = pytorch_q8gemm_sparse_packA_ukernel_4x4__aarch32_neon,
       .mr = 4,
       .nr = 8,
@@ -169,7 +173,9 @@ static void init(void) {
 #elif CPUINFO_ARCH_ARM64
   pytorch_qnnp_params.q8gemm_sparse_c1x4 = (struct pytorch_q8gemm_sparse_parameters){
       .gemm_dq = NULL,
-      .packedA_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA__aarch64_neon,
+      .packedA_w32_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w32__aarch64_neon,
+      .packedA_w16_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w16__aarch64_neon,
+      .packedA_w8_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w8__aarch64_neon,
       .packA = pytorch_q8gemm_sparse_packA_ukernel_8x4__aarch64_neon,
       .mr = 8,
       .nr = 8,
@@ -181,7 +187,9 @@ static void init(void) {
   };
   pytorch_qnnp_params.q8gemm_sparse_c8x1 = (struct pytorch_q8gemm_sparse_parameters){
       .gemm_dq = NULL,
-      .packedA_gemm_dq = pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA__aarch64_neon,
+      .packedA_w32_gemm_dq = pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w32__aarch64_neon,
+      .packedA_w16_gemm_dq = pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w16__aarch64_neon,
+      .packedA_w8_gemm_dq = pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w8__aarch64_neon,
       .packA = pytorch_q8gemm_sparse_packA_ukernel_8x4__aarch64_neon,
       .mr = 8,
       .nr = 8,
@@ -265,7 +273,9 @@ static void init(void) {
   };
   pytorch_qnnp_params.q8gemm_sparse_c1x4 = (struct pytorch_q8gemm_sparse_parameters){
       .gemm_dq = NULL,
-      .packedA_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2,
+      .packedA_w32_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2,
+      .packedA_w16_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2,
+      .packedA_w8_gemm_dq = pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2,
       .packA = pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
       .mr = 8,
       .nr = 4,
@@ -277,7 +287,9 @@ static void init(void) {
   };
   pytorch_qnnp_params.q8gemm_sparse_c8x1 = (struct pytorch_q8gemm_sparse_parameters){
       .gemm_dq = NULL,
-      .packedA_gemm_dq = NULL,
+      .packedA_w32_gemm_dq = NULL,
+      .packedA_w16_gemm_dq = NULL,
+      .packedA_w8_gemm_dq = NULL,
       .packA = NULL,
       .mr = 4,
       .nr = 8,
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/operator-run.c b/aten/src/ATen/native/quantized/cpu/qnnpack/src/operator-run.c
index b1757ebb7ec9..a9a8858fe2b1 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/operator-run.c
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/operator-run.c
@@ -128,14 +128,28 @@ struct q8gemm_prepackA_sparse_dq_context {
   size_t a_packed_stride;
   size_t log2_mr;
   size_t log2_row_block_size;
-  const uint32_t* kernel_col_indices;
-  const uint32_t* kernel_row_values;
+  union {
+    const uint32_t* kernel_col_indices_w32;
+    const uint16_t* kernel_col_indices_w16;
+    const uint8_t* kernel_col_indices_w8;
+  };
+  union {
+    const uint32_t* kernel_row_values_w32;
+    const uint16_t* kernel_row_values_w16;
+    const uint8_t* kernel_row_values_w8;
+  };
+  enum pytorch_qnnp_sparse_matrix_indices_dtype kernel_indices_dtype;
   const uint8_t* kernel_values;
   const float* bias;
   float* c;  // can be float or uint8)t
   size_t c_stride;
   struct pytorch_qnnp_conv_dynamic_quantization_params quantization_params;
-  const pytorch_q8gemm_dq_sparse_packedA_ukernel_function ukernel;
+  union {
+    // Not const because assigned after context is initialized
+    pytorch_q8gemm_dq_sparse_packedA_w32_ukernel_function ukernel_w32;
+    pytorch_q8gemm_dq_sparse_packedA_w16_ukernel_function ukernel_w16;
+    pytorch_q8gemm_dq_sparse_packedA_w8_ukernel_function ukernel_w8;
+  };
   const pytorch_q8gemm_sparse_packA_ukernel_function prepack_ukernel;
 };
 
@@ -172,26 +186,66 @@ static void compute_q8gemm_prepacked_sparse_dq(
     size_t pixel_range,
     size_t mr_block_size,
     size_t nr_block_size) {
-  const uint8_t* restrict a_packed = context->a_packed;
   const size_t mr_packed_block_start =
     ((mr_block_start >> context->log2_mr) * context->a_packed_stride);
-  float* restrict c = (float*)context->c;
+  const uint8_t* restrict a_packed = context->a_packed + mr_packed_block_start;
   const size_t c_stride = context->c_stride;
-
-  size_t output_channel_index = nr_block_start;
-  context->ukernel(
-      mr_block_size,
-      nr_block_size,
-      a_packed + mr_packed_block_start,
-      context->kernel_values,
-      context->kernel_row_values +
-        (nr_block_start >> context->log2_row_block_size),
-      context->kernel_col_indices,
-      context->bias + nr_block_start,
-      c + mr_block_start * c_stride + nr_block_start,
-      c_stride,
-      output_channel_index,
-      &context->quantization_params);
+  float* restrict c =
+      ((float*)context->c) + mr_block_start * c_stride + nr_block_start;
+  const size_t kernel_row_values_shift =
+      nr_block_start >> context->log2_row_block_size;
+  const float* bias = context->bias + nr_block_start;
+  const size_t output_channel_index = nr_block_start;
+
+  switch (context->kernel_indices_dtype) {
+    case pytorch_qnnp_sparse_matrix_indices_dtype_uint32_t:
+      context->ukernel_w32(
+          mr_block_size,
+          nr_block_size,
+          a_packed,
+          context->kernel_values,
+          context->kernel_row_values_w32 + kernel_row_values_shift,
+          context->kernel_col_indices_w32,
+          bias,
+          c,
+          c_stride,
+          output_channel_index,
+          &context->quantization_params);
+      break;
+    case pytorch_qnnp_sparse_matrix_indices_dtype_uint16_t:
+      context->ukernel_w16(
+          mr_block_size,
+          nr_block_size,
+          a_packed,
+          context->kernel_values,
+          context->kernel_row_values_w16 + kernel_row_values_shift,
+          context->kernel_col_indices_w16,
+          bias,
+          c,
+          c_stride,
+          output_channel_index,
+          &context->quantization_params);
+      break;
+    case pytorch_qnnp_sparse_matrix_indices_dtype_uint8_t:
+      context->ukernel_w8(
+          mr_block_size,
+          nr_block_size,
+          a_packed,
+          context->kernel_values,
+          context->kernel_row_values_w8 + kernel_row_values_shift,
+          context->kernel_col_indices_w8,
+          bias,
+          c,
+          c_stride,
+          output_channel_index,
+          &context->quantization_params);
+      break;
+    case pytorch_qnnp_sparse_matrix_indices_dtype_invalid:
+      pytorch_qnnp_log_error(
+          "Invalid indices dtype specified for "
+          "operator-run compute_q8gemm_prepacked_sparse_dq");
+      assert(false);
+  }
 }
 
 struct q8sum_rows_context {
@@ -1094,7 +1148,8 @@ enum pytorch_qnnp_status pytorch_qnnp_run_operator(
       const size_t group_output_channels = op->group_output_channels;
       uint32_t mr, log2_mr, nr, kr, log2_row_block_size;
       pytorch_q8gemm_sparse_packA_ukernel_function prepack_kernel;
-      pytorch_q8gemm_dq_sparse_packedA_ukernel_function compute_kernel;
+      struct pytorch_q8gemm_sparse_parameters* pytorch_q8gemm_sparse_params =
+          NULL; // used to assign ukernel
       if (op->sparse_matrix.row_block_size == 1 &&
           op->sparse_matrix.col_block_size == 4) {
         mr = pytorch_qnnp_params.q8gemm_sparse_c1x4.mr;
@@ -1102,9 +1157,8 @@ enum pytorch_qnnp_status pytorch_qnnp_run_operator(
         log2_row_block_size = 0;
         nr = pytorch_qnnp_params.q8gemm_sparse_c1x4.nr;
         kr = pytorch_qnnp_params.q8gemm_sparse_c1x4.kr;
-        compute_kernel =
-          pytorch_qnnp_params.q8gemm_sparse_c1x4.packedA_gemm_dq;
         prepack_kernel = pytorch_qnnp_params.q8gemm_sparse_c1x4.packA;
+        pytorch_q8gemm_sparse_params = &pytorch_qnnp_params.q8gemm_sparse_c1x4;
       } else if (op->sparse_matrix.row_block_size == 8 &&
           op->sparse_matrix.col_block_size == 1) {
         mr = pytorch_qnnp_params.q8gemm_sparse_c8x1.mr;
@@ -1112,9 +1166,8 @@ enum pytorch_qnnp_status pytorch_qnnp_run_operator(
         log2_row_block_size = 3;
         nr = pytorch_qnnp_params.q8gemm_sparse_c8x1.nr;
         kr = pytorch_qnnp_params.q8gemm_sparse_c8x1.kr;
-        compute_kernel =
-          pytorch_qnnp_params.q8gemm_sparse_c8x1.packedA_gemm_dq;
         prepack_kernel = pytorch_qnnp_params.q8gemm_sparse_c8x1.packA;
+        pytorch_q8gemm_sparse_params = &pytorch_qnnp_params.q8gemm_sparse_c8x1;
       } else {
         return pytorch_qnnp_status_invalid_parameter;
       }
@@ -1132,24 +1185,56 @@ enum pytorch_qnnp_status pytorch_qnnp_run_operator(
       }
 
       struct q8gemm_prepackA_sparse_dq_context
-        q8gemm_prepack_sparse_dq_context = {
-          .k = group_input_channels,
-          .a = op->input,
-          .a_stride = op->input_pixel_stride,
-          .a_packed = op->prepacked_a,
-          .a_packed_stride = k_stride * mr,
-          .log2_mr = log2_mr,
-          .log2_row_block_size = log2_row_block_size,
-          .kernel_col_indices = op->sparse_matrix.col_indices,
-          .kernel_row_values = op->sparse_matrix.row_values,
-          .kernel_values = op->sparse_matrix.values,
-          .bias = (const float*)op->bias,
-          .c = (float*)op->output,
-          .c_stride = op->output_pixel_stride,
-          .quantization_params = op->dynamic_conv_quantization_params,
-          .ukernel = compute_kernel,
-          .prepack_ukernel = prepack_kernel,
-      };
+          q8gemm_prepack_sparse_dq_context = {
+              .k = group_input_channels,
+              .a = op->input,
+              .a_stride = op->input_pixel_stride,
+              .a_packed = op->prepacked_a,
+              .a_packed_stride = k_stride * mr,
+              .log2_mr = log2_mr,
+              .log2_row_block_size = log2_row_block_size,
+              .kernel_indices_dtype = op->sparse_matrix.indices_dtype,
+              .kernel_values = op->sparse_matrix.values,
+              .bias = (const float*)op->bias,
+              .c = (float*)op->output,
+              .c_stride = op->output_pixel_stride,
+              .quantization_params = op->dynamic_conv_quantization_params,
+              .prepack_ukernel = prepack_kernel,
+              // kernel_col_indices, kernel_row_values, and ukernel assigned
+              // below
+          };
+
+      switch (q8gemm_prepack_sparse_dq_context.kernel_indices_dtype) {
+        case pytorch_qnnp_sparse_matrix_indices_dtype_uint32_t:
+          q8gemm_prepack_sparse_dq_context.kernel_col_indices_w32 =
+              op->sparse_matrix.col_indices_w32;
+          q8gemm_prepack_sparse_dq_context.kernel_row_values_w32 =
+              op->sparse_matrix.row_values_w32;
+          q8gemm_prepack_sparse_dq_context.ukernel_w32 =
+              pytorch_q8gemm_sparse_params->packedA_w32_gemm_dq;
+          break;
+        case pytorch_qnnp_sparse_matrix_indices_dtype_uint16_t:
+          q8gemm_prepack_sparse_dq_context.kernel_col_indices_w16 =
+              op->sparse_matrix.col_indices_w16;
+          q8gemm_prepack_sparse_dq_context.kernel_row_values_w16 =
+              op->sparse_matrix.row_values_w16;
+          q8gemm_prepack_sparse_dq_context.ukernel_w16 =
+              pytorch_q8gemm_sparse_params->packedA_w16_gemm_dq;
+          break;
+        case pytorch_qnnp_sparse_matrix_indices_dtype_uint8_t:
+          q8gemm_prepack_sparse_dq_context.kernel_col_indices_w8 =
+              op->sparse_matrix.col_indices_w8;
+          q8gemm_prepack_sparse_dq_context.kernel_row_values_w8 =
+              op->sparse_matrix.row_values_w8;
+          q8gemm_prepack_sparse_dq_context.ukernel_w8 =
+              pytorch_q8gemm_sparse_params->packedA_w8_gemm_dq;
+          break;
+        case pytorch_qnnp_sparse_matrix_indices_dtype_invalid:
+          pytorch_qnnp_log_error(
+              "Invalid indices dtype specified for "
+              "operator-run pytorch_qnnp_ukernel_type_gemm_prepackA_sparse_dq");
+          return pytorch_qnnp_status_invalid_parameter;
+      }
 
       // This batch size is not the actual batch size of the op
       // The batch size is modified in fully-connected-sparse.c
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/pack_block_sparse.cc b/aten/src/ATen/native/quantized/cpu/qnnpack/src/pack_block_sparse.cc
deleted file mode 100644
index c837f55cda85..000000000000
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/pack_block_sparse.cc
+++ /dev/null
@@ -1,170 +0,0 @@
-/*
- * Copyright (c) Facebook, Inc. and its affiliates.
- * All rights reserved.
- *
- * This source code is licensed under the BSD-style license found in the
- * LICENSE file in the root directory of this source tree.
- */
-#include <algorithm>
-#include <cassert>
-#include <cstring>
-#include <iostream>
-#include <iterator>
-
-#include <pack_block_sparse.h>
-
-namespace qnnpack {
-std::unique_ptr<BCSRMatrix> generateBlockCSRMatrix(
-    const uint8_t* a,
-    const size_t N,
-    const size_t K,
-    const uint32_t row_block_size,
-    const uint32_t col_block_size,
-    const uint8_t* zero_points) {
-  assert(K > 0);
-  std::unique_ptr<BCSRMatrix> bcsr_mat_ptr = std::make_unique<BCSRMatrix>();
-  auto& bcsr_mat = *bcsr_mat_ptr;
-  auto& row_values = bcsr_mat.row_values.vector();
-  auto& col_indices = bcsr_mat.col_indices.vector();
-  auto& values = bcsr_mat.values.vector();
-
-  const uint32_t num_row_blocks = (N + row_block_size - 1) / row_block_size;
-  // K must be > 0
-  const uint32_t num_col_blocks = (K + col_block_size - 1) / col_block_size;
-
-  row_values.reserve(num_row_blocks);
-  uint32_t num_nnz_blocks{0};
-  row_values.push_back(num_nnz_blocks);
-  for (uint32_t i = 0; i < num_row_blocks; ++i) {
-    for (uint32_t j = 0; j < num_col_blocks; ++j) {
-      bool block_zero{true};
-      for (uint32_t ib = 0; ib < row_block_size; ++ib) {
-        uint32_t row_index = i * row_block_size + ib;
-        if PYTORCH_QNNP_UNLIKELY(row_index >= N) {
-          break;
-        }
-        for (uint32_t jb = 0; jb < col_block_size; ++jb) {
-          uint32_t col_index = j * col_block_size + jb;
-          if PYTORCH_QNNP_UNLIKELY(col_index >= K) {
-            goto block_scanned;
-          }
-          if (*(a + row_index * K + col_index) != zero_points[row_index]) {
-            block_zero = false;
-            goto block_scanned;
-          }
-        }
-      }
-block_scanned:
-      if (!block_zero) {
-        col_indices.push_back(j);
-        num_nnz_blocks++;
-        for (uint32_t ib = 0; ib < row_block_size; ++ib) {
-          uint32_t row_index = i * row_block_size + ib;
-          if PYTORCH_QNNP_UNLIKELY(row_index >= N) {
-            for (; row_index < (num_row_blocks * row_block_size); row_index++) {
-              for (uint32_t jb = 0; jb < col_block_size; ++jb) {
-                values.push_back(zero_points[N-1]);
-              }
-            }
-            break;
-          }
-          for (uint32_t jb = 0; jb < col_block_size; ++jb) {
-            uint32_t col_index = j * col_block_size + jb;
-            if PYTORCH_QNNP_UNLIKELY(col_index >= K) {
-              values.push_back(zero_points[row_index]);
-            } else {
-              uint8_t val = *(a + row_index * K + col_index);
-              values.push_back(val);
-            }
-          }
-        }
-      }
-    }
-    row_values.push_back(num_nnz_blocks);
-  }
-  bcsr_mat.row_block_size = row_block_size;
-  bcsr_mat.col_block_size = col_block_size;
-  return bcsr_mat_ptr;
-}
-
-std::unique_ptr<BCSRMatrix> generateBlockCSRMatrix(
-    uint32_t* col_indices,
-    uint32_t* row_values,
-    uint8_t* values,
-    const int64_t col_indices_size,
-    const int64_t row_values_size,
-    const int64_t values_size,
-    const int64_t row_block_size,
-    const int64_t col_block_size) {
-  std::unique_ptr<BCSRMatrix> bcsr_mat_ptr = std::make_unique<BCSRMatrix>();
-  BCSRMatrix& bcsr_mat = *bcsr_mat_ptr;
-  bcsr_mat.col_indices =
-      OwnedOrBorrowedVector<uint32_t>(col_indices, col_indices_size);
-  bcsr_mat.row_values =
-      OwnedOrBorrowedVector<uint32_t>(row_values, row_values_size);
-  bcsr_mat.values = OwnedOrBorrowedVector<uint8_t>(values, values_size);
-  bcsr_mat.row_block_size = row_block_size;
-  bcsr_mat.col_block_size = col_block_size;
-  return bcsr_mat_ptr;
-}
-
-void BCSRMatrix::print() const {
-  std::cout << "row block size:" << row_block_size << std::endl;
-  std::cout << "col block size:" << col_block_size << std::endl;
-  std::cout << "row ptr\n";
-  for (int i = 0; i < row_values.size(); i++) {
-    std::cout << row_values[i] << ", ";
-  }
-  std::cout << std::endl;
-  std::cout << "col indices\n";
-  for (int i = 0; i < col_indices.size(); i++) {
-    std::cout << col_indices[i] << ", ";
-  }
-  std::cout << std::endl;
-  std::cout << "Actual values\n";
-  for (int i = 0; i < values.size(); i++) {
-    std::cout << (uint32_t)values[i] << ", ";
-  }
-  std::cout << std::endl;
-}
-
-void BCSRMatrix::unpack(
-    int8_t* dst,
-    const int64_t num_rows,
-    const int64_t num_cols,
-    const uint8_t* zero_points) const {
-  for (int64_t i = 0; i < num_rows; i++) {
-    memset(
-        dst + i * num_cols,
-        static_cast<int8_t>(static_cast<int16_t>(zero_points[i]) - 128),
-        num_cols * sizeof(int8_t));
-  }
-
-  const int64_t num_block_rows = static_cast<int64_t>(row_values.size()) - 1;
-  const int64_t block_size = (int64_t)row_block_size * col_block_size;
-  int64_t weight_values_num = 0;
-  for (int64_t block_row_num = 0; block_row_num < num_block_rows;
-       block_row_num++) {
-    const int64_t num_blocks_in_current_block_row =
-        row_values[block_row_num + 1] - row_values[block_row_num];
-    for (int64_t k = 0; k < num_blocks_in_current_block_row;
-         k++) { // iterate over each block in the row
-      const int64_t block_start_row_num = block_row_num * row_block_size;
-      const int64_t block_start_col_num =
-          (int64_t)(col_indices[weight_values_num / block_size]) *
-          col_block_size;
-      for (int64_t l = 0; l < block_size;
-           l++) { // iterate over each value in the block
-        const int64_t row_num = block_start_row_num + l / col_block_size;
-        const int64_t col_num = block_start_col_num + l % col_block_size;
-        if (row_num < num_rows && col_num < num_cols) {
-          dst[row_num * num_cols + col_num] = static_cast<int8_t>(
-              static_cast<int16_t>(values[weight_values_num]) - 128);
-        }
-        weight_values_num++;
-      }
-    }
-  }
-}
-
-} // namsepace qnnpack
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/4x4c2-sse2.c b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/4x4c2-sse2.c
index 0b2da5a62bed..398496e08115 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/4x4c2-sse2.c
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/4x4c2-sse2.c
@@ -327,14 +327,15 @@ void pytorch_q8gemm_ukernel_4x4c2__sse2(
         (uint32_t)_mm_cvtsi128_si32(_mm_unpackhi_epi32(vout, vout));
     *((uint32_t*)c3) = (uint32_t)_mm_cvtsi128_si32(_mm_srli_si128(vout, 12));
   } else {
+    typedef PYTORCH_QNNP_UNALIGNED uint16_t unaligned_uint16_t;
     if (nr >= 2) {
-      *((uint16_t*)c0) = (uint16_t)_mm_extract_epi16(vout, 0);
+      *((unaligned_uint16_t*)c0) = (uint16_t)_mm_extract_epi16(vout, 0);
       c0 += 2;
-      *((uint16_t*)c1) = (uint16_t)_mm_extract_epi16(vout, 2);
+      *((unaligned_uint16_t*)c1) = (uint16_t)_mm_extract_epi16(vout, 2);
       c1 += 2;
-      *((uint16_t*)c2) = (uint16_t)_mm_extract_epi16(vout, 4);
+      *((unaligned_uint16_t*)c2) = (uint16_t)_mm_extract_epi16(vout, 4);
       c2 += 2;
-      *((uint16_t*)c3) = (uint16_t)_mm_extract_epi16(vout, 6);
+      *((unaligned_uint16_t*)c3) = (uint16_t)_mm_extract_epi16(vout, 6);
       c3 += 2;
       vout = _mm_srli_epi32(vout, 16);
       nr -= 2;
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c1x4-dq-packedA-aarch32-neon.S b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c1x4-dq-packedA-aarch32-neon.S
index 1d545734f6d4..5b796bb2563c 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c1x4-dq-packedA-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c1x4-dq-packedA-aarch32-neon.S
@@ -9,6 +9,12 @@
 #include <qnnpack/assembly.h>
 #include <requantization/runtime-assembly.h>
 
+#ifndef __APPLE__
+#define NDEF_APPLE_SYMBOLS .arch armv7-a; .fpu neon
+#else
+#define NDEF_APPLE_SYMBOLS
+#endif
+
 # r0 mr
 # r1 nr
 # r2 packed_a
@@ -60,7 +66,397 @@
 #  |----------------|
 #
 
-# void pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon(
+# void pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch32_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint##W_INDEX_DTYPE_NUM_BITS##_t* w_row_ptr,
+#     const uint##W_INDEX_DTYPE_NUM_BITS##_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+#define MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_4X8_PACKEDA__AARCH32_NEON(W_INDEX_DTYPE_NUM_BITS, W_INDEX_DTYPE_NUM_BYTES_ARG, W_INDEX_DTYPE_LOG_NUM_BYTES_ARG, LOAD_INDEX_INSTRUCTION) ;\
+    BEGIN_FUNCTION pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch32_neon ;\
+        .arm                                                                  ;\
+        NDEF_APPLE_SYMBOLS                                                    ;\
+                                                                              ;\
+        PUSH {r4, r5, r6, r7, r8, r9, r10, r11, lr}                           ;\
+        VPUSH {d8-d15}                                                        ;\
+                                                                              ;\
+        /* Store nr in r11 as well for late user. */                          ;\
+        MOV r11, r1                                                           ;\
+        /* Load output channel index */                                       ;\
+        LDR r5, [sp, 120]                                                     ;\
+        /* Load quantization params */                                        ;\
+        /* - r7 = quantization_params */                                      ;\
+        LDR r7, [sp, 124]                                                     ;\
+        /* Load input_zero_point */                                           ;\
+        VLD1.8 {d16[]}, [r7]                                                  ;\
+        ADD r7, r7, 4                                                         ;\
+        /* Load pointer to per channel zero points array */                   ;\
+        LDR r4, [r7]                                                          ;\
+        /* Add output_channel_index to the b_zero_point pointer */            ;\
+        ADD r4, r4, r5                                                        ;\
+                                                                              ;\
+        /* We enter the loop if r1 is atleast 1. */                           ;\
+        /* r1 = r1 - 1 will happen in the epilogue */                         ;\
+        /* of the loop */                                                     ;\
+        CMP r1, 1                                                             ;\
+        BLO _7_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        /* Load w_row_ptr + n */                                              ;\
+        LDR r5, [sp, 100]                                                     ;\
+        /* r7 = blocks_id_ptr */                                              ;\
+        LDR r7, [sp, 104]                                                     ;\
+                                                                              ;\
+        .p2align 5                                                            ;\
+    _0_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        VEOR q10, q10, q10                                                    ;\
+        VLD1.8 {d17[]}, [r4]!                                                 ;\
+        /* ip = w_row_ptr[n], lr = w_row_ptr[n+1] */                          ;\
+        /* r5 = r5 + W_INDEX_DTYPE_NUM_BYTES_ARG to point to next n */        ;\
+        LOAD_INDEX_INSTRUCTION ip, [r5], W_INDEX_DTYPE_NUM_BYTES_ARG          ;\
+        LOAD_INDEX_INSTRUCTION lr, [r5]                                       ;\
+        /* r6 = temp_packed_w = packed_w + w_row_ptr[n] * 4 */                ;\
+        /* This points to the first block of nonzero value */                 ;\
+        /* for the nth row. */                                                ;\
+        ADD r6, r3, ip, LSL #2                                                ;\
+        /* r9 = temp_w_block_ids_ptr = w_block_ids_ptr (r7) + w_row_ptr[n] */ ;\
+        /* LSL for when elements are >1 byte */                               ;\
+        /* (4 bytes: LSL #2, 2 bytes: LSL #1, 1 byte: LSL #0) */              ;\
+        /* This points to the block id of the first block */                  ;\
+        /* It should contain lr - ip number of block ids */                   ;\
+        ADD r9, r7, ip, LSL W_INDEX_DTYPE_LOG_NUM_BYTES_ARG                   ;\
+        /* r8 = num_blocks that needs to be processed */                      ;\
+        SUB r8, lr, ip                                                        ;\
+        SUBS r8, r8, 2                                                        ;\
+        BLO _1_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+    k_loop_w##W_INDEX_DTYPE_NUM_BITS##:                                       ;\
+        /* Load 2 non zero blocks of weights. Each block = 1x4. */            ;\
+        VLD1.8 {d0}, [r6]!                                                    ;\
+                                                                              ;\
+        /* ip = block_id_ptr[0] */                                            ;\
+        /* lr = block_id_ptr[1] */                                            ;\
+        LOAD_INDEX_INSTRUCTION ip, [r9], W_INDEX_DTYPE_NUM_BYTES_ARG          ;\
+        LOAD_INDEX_INSTRUCTION lr, [r9], W_INDEX_DTYPE_NUM_BYTES_ARG          ;\
+                                                                              ;\
+        /* Add offset to r2 */                                                ;\
+        /* Shift by 4 because each packed block is a block of 4x4 */          ;\
+        /* which 16 bytes */                                                  ;\
+        ADD r10, r2, ip, LSL #4                                               ;\
+        /* q9 = vxb */                                                        ;\
+        VSUBL.U8 q0, d0, d17                                                  ;\
+                                                                              ;\
+        /* d2, d3 = 4x4 transposed */                                         ;\
+        VLD1.8 {d2}, [r10]!                                                   ;\
+        VLD1.8 {d3}, [r10]                                                    ;\
+                                                                              ;\
+        ADD r10, r2, lr, LSL #4                                               ;\
+                                                                              ;\
+        VSUBL.U8 q4, d2, d16  /* vxa0_t */                                    ;\
+                                                                              ;\
+        /* d4, d5 = next 4x4 transposed */                                    ;\
+        VLD1.8 {d4}, [r10]!                                                   ;\
+        VLD1.8 {d5}, [r10]                                                    ;\
+                                                                              ;\
+        VSUBL.U8 q5, d3, d16  /* vxa1_t */                                    ;\
+        VSUBL.U8 q6, d4, d16  /* vxa4_t */                                    ;\
+        VSUBL.U8 q7, d5, d16  /* vxa5_t */                                    ;\
+                                                                              ;\
+        /* q4, q5 = 4x4 block (16 values each of 16 bits) */                  ;\
+        /* q6, q7 = 4x4 block (16 values each of 16 bits) */                  ;\
+                                                                              ;\
+        VMLAL.S16 q10, d8, d0[0]                                              ;\
+        VMLAL.S16 q10, d9, d0[1]                                              ;\
+        VMLAL.S16 q10, d10, d0[2]                                             ;\
+        VMLAL.S16 q10, d11, d0[3]                                             ;\
+        VMLAL.S16 q10, d12, d1[0]                                             ;\
+        VMLAL.S16 q10, d13, d1[1]                                             ;\
+        VMLAL.S16 q10, d14, d1[2]                                             ;\
+        VMLAL.S16 q10, d15, d1[3]                                             ;\
+                                                                              ;\
+        SUBS r8, r8, 2                                                        ;\
+                                                                              ;\
+        BHS k_loop_w##W_INDEX_DTYPE_NUM_BITS                                  ;\
+    _1_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        CMP r8, -2                                                            ;\
+        BEQ _2_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        /* Load last nonzero block */                                         ;\
+        /* For this we will load 4 8 bit values as one 32 bit value */        ;\
+        VLD1.32 {d0[]}, [r6]!                                                 ;\
+        /* q9 = vxb */                                                        ;\
+        VSUBL.U8 q0, d0, d17                                                  ;\
+                                                                              ;\
+        /* ip = block_id_ptr[0] */                                            ;\
+        LOAD_INDEX_INSTRUCTION ip, [r9]                                       ;\
+                                                                              ;\
+        /* Add offset to r2 */                                                ;\
+        /* Shift by 4 because each packed block is a block of 4x4 */          ;\
+        /* which 16 bytes */                                                  ;\
+        ADD r10, r2, ip, LSL #4                                               ;\
+                                                                              ;\
+        VLD1.8 {d2}, [r10]!                                                   ;\
+        VLD1.8 {d3}, [r10]                                                    ;\
+                                                                              ;\
+        VSUBL.U8 q4, d2, d16  /* vxa0_t */                                    ;\
+        VSUBL.U8 q5, d3, d16  /* vxa1_t */                                    ;\
+                                                                              ;\
+        VMLAL.S16 q10, d8, d0[0]                                              ;\
+        VMLAL.S16 q10, d9, d0[1]                                              ;\
+        VMLAL.S16 q10, d10, d0[2]                                             ;\
+        VMLAL.S16 q10, d11, d0[3]                                             ;\
+                                                                              ;\
+        .p2align 4                                                            ;\
+    _2_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        /* Store result on stack */                                           ;\
+                                                                              ;\
+        /* -12 because TOS - 4, TOS - 8, and TOS - 12, store mr, nr and pointer to weight zp */ ;\
+        /* + 128 bytes of buffer when nr = 1 */                               ;\
+        /* This is needed because after processing all nrs we will */         ;\
+        /* load 128 bytes from stack. This is for q10, q11 for max nr of 4 */ ;\
+        /* Thus we will load accumulators back in q0, q1, q2, q3, q4, q5, q6, q7 */ ;\
+        /* When nr < 4, extra q values will be fetched from stack which may overlap */ ;\
+        /* with other parts of stack storing local variables. To avoid that we just */ ;\
+        /* create a buffer of 128 bytes inbetween to make sure pointer increment */ ;\
+        /* never produces address that is beyond the stack frame of this function. */ ;\
+        SUB r9, sp, 140                                                       ;\
+        /* Each iteration produce 4 values each of 4 bytes */                 ;\
+        /* Thus 4 x 4 = 16 bytes 2^4 */                                       ;\
+        /* In this implementation, first value will be stored at */           ;\
+        /* 1st value: sp - 12 - r1 * 16 */                                    ;\
+        /* 2nd value: sp - 12 - (r1 - 1) * 16 */                              ;\
+        /* and so on. */                                                      ;\
+        SUB r9, r9, r1, LSL #4                                                ;\
+        VST1.32 {q10}, [r9]                                                   ;\
+                                                                              ;\
+        /* Check if nr >=1 */                                                 ;\
+        SUBS r1, r1, 1                                                        ;\
+        BHI _0_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+    _3_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        /* First load all the accumulators from stack */                      ;\
+        /* Load nr */                                                         ;\
+        SUB r9, sp, 140                                                       ;\
+        SUB r9, r9, r11, LSL #4                                               ;\
+        /* Now load q8-q15 */                                                 ;\
+        /* This is 8x4 block (nrxmr) */                                       ;\
+        /* We will transpose this to 4x8 (mrxnr) */                           ;\
+        /* q8, q12  : x00, x10, x20, x30; x04, x14, x24, x34 */               ;\
+        /* q9, q13  : x01, x11, x21, x31; x05, x15, x25, x35 */               ;\
+        /* q10, q14 : x02, x12, x22, x32; x06, x16, x26, x36 */               ;\
+        /* q11, q15 : x03, x13, x23, x33; x07, x17, x27, x37 */               ;\
+        VLD1.32 {q8}, [r9]!                                                   ;\
+        VLD1.32 {q9}, [r9]!                                                   ;\
+        VLD1.32 {q10}, [r9]!                                                  ;\
+        VLD1.32 {q11}, [r9]!                                                  ;\
+        VLD1.32 {q12}, [r9]!                                                  ;\
+        VLD1.32 {q13}, [r9]!                                                  ;\
+        VLD1.32 {q14}, [r9]!                                                  ;\
+        VLD1.32 {q15}, [r9]                                                   ;\
+                                                                              ;\
+        /*# Now transpose q8-11 */                                            ;\
+        /* VTRN.32 q8, q9 */                                                  ;\
+        /* VTRN.32 q10, q11 */                                                ;\
+        /* q8 : X00, x01, x20, x21 */                                         ;\
+        /* q9 : X10, x11, x30, x31 */                                         ;\
+        /* q10: X02, x03, x22, x23 */                                         ;\
+        /* q11: X12, x13, x32, x33 */                                         ;\
+        /* VSWP d16, d17 */                                                   ;\
+        /* q8 : x20, x21, x00, x01 */                                         ;\
+        /* VEXT.32 q6, q8, q10, 2 */                                          ;\
+        /* q6 : x00, x01, x02, x03 */                                         ;\
+        /* VEXT.32 q10, q10, q8, 2 */                                         ;\
+        /* q10: x22, x23, x20, x21 */                                         ;\
+        /* VSWP d20, d21 */                                                   ;\
+        /* VMOV q8, q6 */                                                     ;\
+        /* q8 : X00, x01, x02, x03 */                                         ;\
+        /* q10: x20, x21, x22, x23 */                                         ;\
+        /* VSWP d18, d19 */                                                   ;\
+        /* q9 : x30, x31, x10, x11 */                                         ;\
+        /* VEXT.32 q6, q9, q11, 2 */                                          ;\
+        /* q6 : x10, x11, x12, x13 */                                         ;\
+        /* VEXT.32 q11, q11, q9, 2 */                                         ;\
+        /* q11: x32, x33, x30, x31 */                                         ;\
+        /* VSWP d22, d23 */                                                   ;\
+        /* VMOV q9, q6 */                                                     ;\
+        /* q9 : x10, x11, x12, x13 */                                         ;\
+        /* q11: x30, x31, x32, x33 */                                         ;\
+        /* Thus we have */                                                    ;\
+        /* q8 : X00, x01, x02, x03 */                                         ;\
+        /* q9 : X10, x11, x12, x13 */                                         ;\
+        /* q10: X20, x21, x22, x23 */                                         ;\
+        /* q11: X30, x31, x32, x33 */                                         ;\
+        /* Now we can do the same for q4-q7 */                                ;\
+        /* q12: X04, X05, X06, X07 */                                         ;\
+        /* q13: X14, X15, X16, X17 */                                         ;\
+        /* q14: X24, X25, X26, X27 */                                         ;\
+        /* q15: X34, X35, X36, X37 */                                         ;\
+                                                                              ;\
+        VTRN.32 q8, q9                                                        ;\
+        VTRN.32 q10, q11                                                      ;\
+        VSWP d16, d17                                                         ;\
+        VEXT.32 q6, q8, q10, 2                                                ;\
+        VEXT.32 q10, q10, q8, 2                                               ;\
+        VSWP d20, d21                                                         ;\
+        VMOV q8, q6                                                           ;\
+        VSWP d18, d19                                                         ;\
+        VEXT.32 q6, q9, q11, 2                                                ;\
+        VEXT.32 q11, q11, q9, 2                                               ;\
+        VSWP d22, d23                                                         ;\
+        VMOV q9, q6                                                           ;\
+                                                                              ;\
+        VTRN.32 q12, q13                                                      ;\
+        VTRN.32 q14, q15                                                      ;\
+        VSWP d24, d25                                                         ;\
+        VEXT.32 q6, q12, q14, 2                                               ;\
+        VEXT.32 q14, q14, q12, 2                                              ;\
+        VSWP d28, d29                                                         ;\
+        VMOV q12, q6                                                          ;\
+        VSWP d26, d27                                                         ;\
+        VEXT.32 q6, q13, q15, 2                                               ;\
+        VEXT.32 q15, q15, q13, 2                                              ;\
+        VSWP d30, d31                                                         ;\
+        VMOV q13, q6                                                          ;\
+                                                                              ;\
+        /* Load output channel index */                                       ;\
+        LDR r5, [sp, 120]                                                     ;\
+        /* Load quantization params */                                        ;\
+        /* - r7 = quantization_params */                                      ;\
+        LDR r7, [sp, 124]                                                     ;\
+        ADD r7, r7, 8                                                         ;\
+        /* Load pointer to per channel requant scale */                       ;\
+        LDR r7, [r7]                                                          ;\
+        /* Now r7 has the base_addr + offset for multipliers */               ;\
+        ADD r7, r7, r5, LSL #2                                                ;\
+                                                                              ;\
+        LDR r6, [sp, 108]                                                     ;\
+        /* Load q6: vmultiplier_c0123 */                                      ;\
+        VLD1.32 {d12, d13}, [r7]!                                             ;\
+        /* Load q7: vmultiplier_c4567 */                                      ;\
+        VLD1.32 {d14, d15}, [r7]                                              ;\
+        VCVT.F32.S32 q8, q8                                                   ;\
+        VCVT.F32.S32 q9, q9                                                   ;\
+        VCVT.F32.S32 q10, q10                                                 ;\
+        VLD1.32 {q0}, [r6]!                                                   ;\
+        VLD1.32 {q1}, [r6]                                                    ;\
+                                                                              ;\
+        VCVT.F32.S32 q11, q11                                                 ;\
+        VCVT.F32.S32 q12, q12                                                 ;\
+        VCVT.F32.S32 q13, q13                                                 ;\
+        VCVT.F32.S32 q14, q14                                                 ;\
+        VCVT.F32.S32 q15, q15                                                 ;\
+                                                                              ;\
+        VMUL.F32 q8, q8, q6                                                   ;\
+        VMUL.F32 q9, q9, q6                                                   ;\
+        VMUL.F32 q10, q10, q6                                                 ;\
+        VMUL.F32 q11, q11, q6                                                 ;\
+        VMUL.F32 q12, q12, q7                                                 ;\
+        VMUL.F32 q13, q13, q7                                                 ;\
+        VMUL.F32 q14, q14, q7                                                 ;\
+        VMUL.F32 q15, q15, q7                                                 ;\
+                                                                              ;\
+        VADD.F32 q8, q8, q0                                                   ;\
+        VADD.F32 q9, q9, q0                                                   ;\
+        VADD.F32 q10, q10, q0                                                 ;\
+        VADD.F32 q11, q11, q0                                                 ;\
+        VADD.F32 q12, q12, q1                                                 ;\
+        VADD.F32 q13, q13, q1                                                 ;\
+        VADD.F32 q14, q14, q1                                                 ;\
+        VADD.F32 q15, q15, q1                                                 ;\
+                                                                              ;\
+        /* Load c, c_stride: */                                               ;\
+        /* - r1 = c */                                                        ;\
+        /* - r9 = c_stride */                                                 ;\
+        LDR r1, [sp, 112]                                                     ;\
+        LDR r9, [sp, 116]                                                     ;\
+        LSL r9, r9, 2                                                         ;\
+                                                                              ;\
+        /* r1 = c0 = c pointer */                                             ;\
+                                                                              ;\
+        CMP r0, 2                                                             ;\
+        /* r2 = c1 */                                                         ;\
+        ADD r2, r1, r9                                                        ;\
+        MOVLO r2, r1                                                          ;\
+                                                                              ;\
+        /* r3 = c2 */                                                         ;\
+        ADD r3, r2, r9                                                        ;\
+        MOVLS r3, r2                                                          ;\
+                                                                              ;\
+        CMP r0, 4                                                             ;\
+        /* r4 = c3 */                                                         ;\
+        ADD r4, r3, r9                                                        ;\
+        MOVNE r4, r3                                                          ;\
+                                                                              ;\
+        CMP r11, 8                                                            ;\
+        BNE _4_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        VST1.32 {q8}, [r1]!                                                   ;\
+        VST1.32 {q9}, [r2]!                                                   ;\
+        VST1.32 {q10}, [r3]!                                                  ;\
+        VST1.32 {q11}, [r4]!                                                  ;\
+        VST1.32 {q12}, [r1]                                                   ;\
+        VST1.32 {q13}, [r2]                                                   ;\
+        VST1.32 {q14}, [r3]                                                   ;\
+        VST1.32 {q15}, [r4]                                                   ;\
+                                                                              ;\
+        VPOP {d8-d15}                                                         ;\
+        POP {r4, r5, r6, r7, r8, r9, r10, r11, lr}                            ;\
+        BX lr                                                                 ;\
+                                                                              ;\
+        .p2align 3                                                            ;\
+    _4_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        CMP r11, 4                                                            ;\
+        BLO _5_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        VST1.32 {q8}, [r1]!                                                   ;\
+        VST1.32 {q9}, [r2]!                                                   ;\
+        VST1.32 {q10}, [r3]!                                                  ;\
+        VST1.32 {q11}, [r4]!                                                  ;\
+                                                                              ;\
+        SUB r11, 4                                                            ;\
+                                                                              ;\
+        VMOV.32 q8, q12                                                       ;\
+        VMOV.32 q9, q13                                                       ;\
+        VMOV.32 q10, q14                                                      ;\
+        VMOV.32 q11, q15                                                      ;\
+                                                                              ;\
+    _5_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        CMP r11, 2                                                            ;\
+        BLO _6_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        VST1.32 {d16}, [r1]!                                                  ;\
+        VST1.32 {d18}, [r2]!                                                  ;\
+        VST1.32 {d20}, [r3]!                                                  ;\
+        VST1.32 {d22}, [r4]!                                                  ;\
+                                                                              ;\
+        SUB r11, 2                                                            ;\
+                                                                              ;\
+        VEXT.32 q8, q8, 2                                                     ;\
+        VEXT.32 q9, q9, 2                                                     ;\
+        VEXT.32 q10, q10, 2                                                   ;\
+        VEXT.32 q11, q11, 2                                                   ;\
+                                                                              ;\
+    _6_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        TEQ r11, 0                                                            ;\
+        BEQ _7_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        VST1.32 {d16[0]}, [r1]                                                ;\
+        VST1.32 {d18[0]}, [r2]                                                ;\
+        VST1.32 {d20[0]}, [r3]                                                ;\
+        VST1.32 {d22[0]}, [r4]                                                ;\
+                                                                              ;\
+    _7_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        VPOP {d8-d15}                                                         ;\
+        POP {r4, r5, r6, r7, r8, r9, r10, r11, lr}                            ;\
+        BX lr                                                                 ;\
+                                                                              ;\
+    END_FUNCTION pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch32_neon
+
+# void pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w32__aarch32_neon(
 #     size_t mr,
 #     size_t nr,
 #     const uint8_t* a_packed,
@@ -72,385 +468,39 @@
 #     size_t c_stride,
 #     size_t output_channel_index,
 #     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
-BEGIN_FUNCTION pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon
-    .arm
-#ifndef __APPLE__
-    .arch armv7-a
-    .fpu neon
-#endif
-
-    PUSH {r4, r5, r6, r7, r8, r9, r10, r11, lr}
-    VPUSH {d8-d15}
-
-    # Store nr in r11 as well for late user.
-    MOV r11, r1
-    # Load output channel index
-    LDR r5, [sp, 120]
-    # Load quantization params
-    # - r7 = quantization_params
-    LDR r7, [sp, 124]
-    # Load input_zero_point
-    VLD1.8 {d16[]}, [r7]
-    ADD r7, r7, 4
-    # Load pointer to per channel zero points array
-    LDR r4, [r7]
-    # Add output_channel_index to the b_zero_point pointer
-    ADD r4, r4, r5
-
-    # We enter the loop if r1 is atleast 1.
-    # r1 = r1 - 1 will happen in the epilogue
-    # of the loop
-    CMP r1, 1
-    BLO 7f
-
-    # Load w_row_ptr + n
-    LDR r5, [sp, 100]
-    # r7 = blocks_id_ptr
-    LDR r7, [sp, 104]
-
-    .p2align 5
-0:
-    VEOR q10, q10, q10
-    VLD1.8 {d17[]}, [r4]!
-    # ip = w_row_ptr[n], lr = w_row_ptr[n+1]
-    # r5 = r5 + 4 to point to next n
-    LDR ip, [r5], #4
-    LDR lr, [r5]
-    # r6 = temp_packed_w = packed_w + w_row_ptr[n] * 4
-    # This points to the first block of nonzero value
-    # for the nth row.
-    ADD r6, r3, ip, LSL #2
-    # r9 = temp_w_block_ids_ptr = w_block_ids_ptr (r7) + w_row_ptr[n]
-    # LSL2 because each element is 4 bytes
-    # This points to the block id of the first block
-    # It should contain lr - ip number of block ids
-    ADD r9, r7, ip, LSL #2
-    # r8 = num_blocks that needs to be processed
-    SUB r8, lr, ip
-    SUBS r8, r8, 2
-    BLO 1f
-
-k_loop:
-    # Load 2 non zero blocks of weights. Each block = 1x4.
-    VLD1.8 {d0}, [r6]!
-
-    #ip = block_id_ptr[0]
-    #lr = block_id_ptr[1]
-    LDR ip, [r9], #4
-    LDR lr, [r9], #4
-
-    # Add offset to r2
-    # Shift by 4 because each packed block is a block of 4x4
-    # which 16 bytes
-    ADD r10, r2, ip, LSL #4
-    # q9 = vxb
-    VSUBL.U8 q0, d0, d17
-
-    # d2, d3 = 4x4 transposed
-    VLD1.8 {d2}, [r10]!
-    VLD1.8 {d3}, [r10]
-
-    ADD r10, r2, lr, LSL #4
-
-    VSUBL.U8 q4, d2, d16  // vxa0_t
-
-    # d4, d5 = next 4x4 transposed
-    VLD1.8 {d4}, [r10]!
-    VLD1.8 {d5}, [r10]
-
-    VSUBL.U8 q5, d3, d16  // vxa1_t
-    VSUBL.U8 q6, d4, d16  // vxa4_t
-    VSUBL.U8 q7, d5, d16  // vxa5_t
-
-    # q4, q5 = 4x4 block (16 values each of 16 bits)
-    # q6, q7 = 4x4 block (16 values each of 16 bits)
-
-    VMLAL.S16 q10, d8, d0[0]
-    VMLAL.S16 q10, d9, d0[1]
-    VMLAL.S16 q10, d10, d0[2]
-    VMLAL.S16 q10, d11, d0[3]
-    VMLAL.S16 q10, d12, d1[0]
-    VMLAL.S16 q10, d13, d1[1]
-    VMLAL.S16 q10, d14, d1[2]
-    VMLAL.S16 q10, d15, d1[3]
-
-    SUBS r8, r8, 2
-
-    BHS k_loop
-1:
-    CMP r8, -2
-    BEQ 2f
-
-    # Load last nonzero block
-    # For this we will load 4 8 bit values as one 32 bit value
-    VLD1.32 {d0[]}, [r6]!
-    # q9 = vxb
-    VSUBL.U8 q0, d0, d17
-
-    #ip = block_id_ptr[0]
-    LDR ip, [r9]
-
-    # Add offset to r2
-    # Shift by 4 because each packed block is a block of 4x4
-    # which 16 bytes
-    ADD r10, r2, ip, LSL #4
-
-    VLD1.8 {d2}, [r10]!
-    VLD1.8 {d3}, [r10]
-
-    VSUBL.U8 q4, d2, d16  // vxa0_t
-    VSUBL.U8 q5, d3, d16  // vxa1_t
-
-    VMLAL.S16 q10, d8, d0[0]
-    VMLAL.S16 q10, d9, d0[1]
-    VMLAL.S16 q10, d10, d0[2]
-    VMLAL.S16 q10, d11, d0[3]
-
-    .p2align 4
-2:
-    # Store result on stack
-
-    # -12 because TOS - 4, TOS - 8, and TOS - 12, store mr, nr and pointer to weight zp
-    # + 128 bytes of buffer when nr = 1
-    # This is needed because after processing all nrs we will
-    # load 128 bytes from stack. This is for q10, q11 for max nr of 4
-    # Thus we will load accumulators back in q0, q1, q2, q3, q4, q5, q6, q7
-    # When nr < 4, extra q values will be fetched from stack which may overlap
-    # with other parts of stack storing local variables. To avoid that we just
-    # create a buffer of 128 bytes inbetween to make sure pointer increment
-    # never produces address that is beyond the stack frame of this function.
-    SUB r9, sp, 140
-    # Each iteration produce 4 values each of 4 bytes
-    # Thus 4 x 4 = 16 bytes 2^4
-    # In this implementation, first value will be stored at
-    # 1st value: sp - 12 - r1 * 16
-    # 2nd value: sp - 12 - (r1 - 1) * 16
-    # and so on.
-    SUB r9, r9, r1, LSL #4
-    VST1.32 {q10}, [r9]
-
-    # Check if nr >=1
-    SUBS r1, r1, 1
-    BHI 0b
-3:
-    # First load all the accumulators from stack
-    # Load nr
-    SUB r9, sp, 140
-    SUB r9, r9, r11, LSL #4
-    # Now load q8-q15
-    # This is 8x4 block (nrxmr)
-    # We will transpose this to 4x8 (mrxnr)
-    # q8, q12  : x00, x10, x20, x30; x04, x14, x24, x34
-    # q9, q13  : x01, x11, x21, x31; x05, x15, x25, x35
-    # q10, q14 : x02, x12, x22, x32; x06, x16, x26, x36
-    # q11, q15 : x03, x13, x23, x33; x07, x17, x27, x37
-    VLD1.32 {q8}, [r9]!
-    VLD1.32 {q9}, [r9]!
-    VLD1.32 {q10}, [r9]!
-    VLD1.32 {q11}, [r9]!
-    VLD1.32 {q12}, [r9]!
-    VLD1.32 {q13}, [r9]!
-    VLD1.32 {q14}, [r9]!
-    VLD1.32 {q15}, [r9]
-
-    ## Now transpose q8-11
-    # VTRN.32 q8, q9
-    # VTRN.32 q10, q11
-    # q8 : X00, x01, x20, x21
-    # q9 : X10, x11, x30, x31
-    # q10: X02, x03, x22, x23
-    # q11: X12, x13, x32, x33
-    # VSWP d16, d17
-    # q8 : x20, x21, x00, x01
-    # VEXT.32 q6, q8, q10, 2
-    # q6 : x00, x01, x02, x03
-    # VEXT.32 q10, q10, q8, 2
-    # q10: x22, x23, x20, x21
-    # VSWP d20, d21
-    # VMOV q8, q6
-    # q8 : X00, x01, x02, x03
-    # q10: x20, x21, x22, x23
-    # VSWP d18, d19
-    # q9 : x30, x31, x10, x11
-    # VEXT.32 q6, q9, q11, 2
-    # q6 : x10, x11, x12, x13
-    # VEXT.32 q11, q11, q9, 2
-    # q11: x32, x33, x30, x31
-    # VSWP d22, d23
-    # VMOV q9, q6
-    # q9 : x10, x11, x12, x13
-    # q11: x30, x31, x32, x33
-    # Thus we have
-    # q8 : X00, x01, x02, x03
-    # q9 : X10, x11, x12, x13
-    # q10: X20, x21, x22, x23
-    # q11: X30, x31, x32, x33
-    # Now we can do the same for q4-q7
-    # q12: X04, X05, X06, X07
-    # q13: X14, X15, X16, X17
-    # q14: X24, X25, X26, X27
-    # q15: X34, X35, X36, X37
-
-    VTRN.32 q8, q9
-    VTRN.32 q10, q11
-    VSWP d16, d17
-    VEXT.32 q6, q8, q10, 2
-    VEXT.32 q10, q10, q8, 2
-    VSWP d20, d21
-    VMOV q8, q6
-    VSWP d18, d19
-    VEXT.32 q6, q9, q11, 2
-    VEXT.32 q11, q11, q9, 2
-    VSWP d22, d23
-    VMOV q9, q6
-
-    VTRN.32 q12, q13
-    VTRN.32 q14, q15
-    VSWP d24, d25
-    VEXT.32 q6, q12, q14, 2
-    VEXT.32 q14, q14, q12, 2
-    VSWP d28, d29
-    VMOV q12, q6
-    VSWP d26, d27
-    VEXT.32 q6, q13, q15, 2
-    VEXT.32 q15, q15, q13, 2
-    VSWP d30, d31
-    VMOV q13, q6
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_4X8_PACKEDA__AARCH32_NEON(32, #4, #2, LDR)
 
-    # Load output channel index
-    LDR r5, [sp, 120]
-    # Load quantization params
-    # - r7 = quantization_params
-    LDR r7, [sp, 124]
-    ADD r7, r7, 8
-    # Load pointer to per channel requant scale
-    LDR r7, [r7]
-    # Now r7 has the base_addr + offset for multipliers
-    ADD r7, r7, r5, LSL #2
-
-    LDR r6, [sp, 108]
-    # Load q6: vmultiplier_c0123
-    VLD1.32 {d12, d13}, [r7]!
-    # Load q7: vmultiplier_c4567
-    VLD1.32 {d14, d15}, [r7]
-    VCVT.F32.S32 q8, q8
-    VCVT.F32.S32 q9, q9
-    VCVT.F32.S32 q10, q10
-    VLD1.32 {q0}, [r6]!
-    VLD1.32 {q1}, [r6]
-
-    VCVT.F32.S32 q11, q11
-    VCVT.F32.S32 q12, q12
-    VCVT.F32.S32 q13, q13
-    VCVT.F32.S32 q14, q14
-    VCVT.F32.S32 q15, q15
-
-    VMUL.F32 q8, q8, q6
-    VMUL.F32 q9, q9, q6
-    VMUL.F32 q10, q10, q6
-    VMUL.F32 q11, q11, q6
-    VMUL.F32 q12, q12, q7
-    VMUL.F32 q13, q13, q7
-    VMUL.F32 q14, q14, q7
-    VMUL.F32 q15, q15, q7
-
-    VADD.F32 q8, q8, q0
-    VADD.F32 q9, q9, q0
-    VADD.F32 q10, q10, q0
-    VADD.F32 q11, q11, q0
-    VADD.F32 q12, q12, q1
-    VADD.F32 q13, q13, q1
-    VADD.F32 q14, q14, q1
-    VADD.F32 q15, q15, q1
-
-    # Load c, c_stride:
-    # - r1 = c
-    # - r9 = c_stride
-    LDR r1, [sp, 112]
-    LDR r9, [sp, 116]
-    LSL r9, r9, 2
-
-    # r1 = c0 = c pointer
-
-    CMP r0, 2
-    # r2 = c1
-    ADD r2, r1, r9
-    MOVLO r2, r1
-
-    # r3 = c2
-    ADD r3, r2, r9
-    MOVLS r3, r2
-
-    CMP r0, 4
-    # r4 = c3
-    ADD r4, r3, r9
-    MOVNE r4, r3
-
-    CMP r11, 8
-    BNE 4f
-
-    VST1.32 {q8}, [r1]!
-    VST1.32 {q9}, [r2]!
-    VST1.32 {q10}, [r3]!
-    VST1.32 {q11}, [r4]!
-    VST1.32 {q12}, [r1]
-    VST1.32 {q13}, [r2]
-    VST1.32 {q14}, [r3]
-    VST1.32 {q15}, [r4]
-
-    VPOP {d8-d15}
-    POP {r4, r5, r6, r7, r8, r9, r10, r11, lr}
-    BX lr
-
-    .p2align 3
-4:
-    CMP r11, 4
-    BLO 5f
-
-    VST1.32 {q8}, [r1]!
-    VST1.32 {q9}, [r2]!
-    VST1.32 {q10}, [r3]!
-    VST1.32 {q11}, [r4]!
-
-    SUB r11, 4
-
-    VMOV.32 q8, q12
-    VMOV.32 q9, q13
-    VMOV.32 q10, q14
-    VMOV.32 q11, q15
-
-5:
-    CMP r11, 2
-    BLO 6f
-
-    VST1.32 {d16}, [r1]!
-    VST1.32 {d18}, [r2]!
-    VST1.32 {d20}, [r3]!
-    VST1.32 {d22}, [r4]!
-
-    SUB r11, 2
-
-    VEXT.32 q8, q8, 2
-    VEXT.32 q9, q9, 2
-    VEXT.32 q10, q10, 2
-    VEXT.32 q11, q11, 2
-
-6:
-    TEQ r11, 0
-    BEQ 7f
-
-    VST1.32 {d16[0]}, [r1]
-    VST1.32 {d18[0]}, [r2]
-    VST1.32 {d20[0]}, [r3]
-    VST1.32 {d22[0]}, [r4]
-
-7:
-    VPOP {d8-d15}
-    POP {r4, r5, r6, r7, r8, r9, r10, r11, lr}
-    BX lr
+# void pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w16__aarch32_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint16_t* w_row_ptr,
+#     const uint16_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_4X8_PACKEDA__AARCH32_NEON(16, #2, #1, LDRH)
 
-END_FUNCTION pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon
+# void pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w8__aarch32_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint8_t* w_row_ptr,
+#     const uint8_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_4X8_PACKEDA__AARCH32_NEON(8, #1, #0, LDRB)
 
 #ifdef __ELF__
 .section ".note.GNU-stack","",%progbits
 #endif
+
+#undef NDEF_APPLE_SYMBOLS
+#undef MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_4X8_PACKEDA__AARCH32_NEON
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c8x1-dq-packedA-aarch32-neon.S b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c8x1-dq-packedA-aarch32-neon.S
index 109307d082d1..dd829f80e373 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c8x1-dq-packedA-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c8x1-dq-packedA-aarch32-neon.S
@@ -9,6 +9,12 @@
 #include <qnnpack/assembly.h>
 #include <requantization/runtime-assembly.h>
 
+#ifndef __APPLE__
+#define NDEF_APPLE_SYMBOLS .arch armv7-a; .fpu neon
+#else
+#define NDEF_APPLE_SYMBOLS
+#endif
+
 # r0 mr
 # r1 nr
 # r2 packed_a
@@ -60,7 +66,306 @@
 #  |----------------|
 #
 
-# void pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon(
+# void pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch32_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint##W_INDEX_DTYPE_NUM_BITS##_t* w_row_ptr,
+#     const uint##W_INDEX_DTYPE_NUM_BITS##_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+#define MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_4X8_PACKEDA__AARCH32_NEON(W_INDEX_DTYPE_NUM_BITS, W_INDEX_DTYPE_NUM_BYTES_ARG, W_INDEX_DTYPE_LOG_NUM_BYTES_ARG, LOAD_INDEX_INSTRUCTION) ;\
+    BEGIN_FUNCTION pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch32_neon ;\
+        .arm                                                                  ;\
+        NDEF_APPLE_SYMBOLS                                                    ;\
+                                                                              ;\
+        PUSH {r4, r5, r6, r7, r8, r9, r10, r11, lr}                           ;\
+        VPUSH {d8-d15}                                                        ;\
+                                                                              ;\
+        /* Store nr in r11 as well for late user. */                          ;\
+        MOV r11, r1                                                           ;\
+        /* Load output channel index */                                       ;\
+        LDR r5, [sp, 120]                                                     ;\
+        /* Load quantization params */                                        ;\
+        /* - r7 = quantization_params */                                      ;\
+        LDR r7, [sp, 124]                                                     ;\
+        /* Load input_zero_point */                                           ;\
+        VLD1.8 {d14[]}, [r7]                                                  ;\
+        ADD r7, r7, 4                                                         ;\
+        /* Load pointer to per channel zero points array */                   ;\
+        LDR r4, [r7]                                                          ;\
+        /* Add output_channel_index to the b_zero_point pointer */            ;\
+        ADD r4, r4, r5                                                        ;\
+                                                                              ;\
+        /* Load w_row_ptr + n */                                              ;\
+        LDR r5, [sp, 100]                                                     ;\
+        /* r7 = blocks_id_ptr */                                              ;\
+        LDR r7, [sp, 104]                                                     ;\
+                                                                              ;\
+        VEOR q8, q8, q8                                                       ;\
+        VEOR q9, q9, q9                                                       ;\
+        VEOR q10, q10, q10                                                    ;\
+        VEOR q11, q11, q11                                                    ;\
+        VEOR q12, q12, q12                                                    ;\
+        VEOR q13, q13, q13                                                    ;\
+        VEOR q14, q14, q14                                                    ;\
+        VEOR q15, q15, q15                                                    ;\
+        VLD1.8 {d15}, [r4]                                                    ;\
+        /* ip = w_row_ptr[n], lr = w_row_ptr[n+1] */                          ;\
+        /* r5 = r5 + W_INDEX_DTYPE_NUM_BYTES_ARG to point to next n */        ;\
+        LOAD_INDEX_INSTRUCTION ip, [r5], W_INDEX_DTYPE_NUM_BYTES_ARG          ;\
+        LOAD_INDEX_INSTRUCTION lr, [r5]                                       ;\
+        /* r6 = temp_packed_w = packed_w + w_row_ptr[n] * 8 */                ;\
+        /* * 8 because each block contains 8 values */                        ;\
+        /* This points to the first block of nonzero value */                 ;\
+        /* for the nth row. */                                                ;\
+        ADD r6, r3, ip, LSL #3                                                ;\
+        /* r9 = temp_w_block_ids_ptr = w_block_ids_ptr (r7) + w_row_ptr[n] */ ;\
+        /* LSL for when elements are >1 byte */                               ;\
+        /* (4 bytes: LSL #2, 2 bytes: LSL #1, 1 byte: LSL #0) */              ;\
+        /* This points to the col block id of the first block */              ;\
+        /* It should contain lr - ip number of block ids */                   ;\
+        /* Note that in this kernel sparsity pattern is 8x1. */               ;\
+        /* Thus each block contains only 1 k as opposed to */                 ;\
+        /* 1x4 where each block contains 4 k. */                              ;\
+        ADD r9, r7, ip, LSL W_INDEX_DTYPE_LOG_NUM_BYTES_ARG                   ;\
+        /* r8 = num_blocks that needs to be processed */                      ;\
+        SUB r8, lr, ip                                                        ;\
+        SUBS r8, r8, 2                                                        ;\
+        BLO _1_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        .p2align 5                                                            ;\
+    k_loop_w##W_INDEX_DTYPE_NUM_BITS##:                                       ;\
+        /* Load 2 non zero blocks of weights. Each block = 8x1. */            ;\
+        VLD1.8 {d0}, [r6]!                                                    ;\
+        VLD1.8 {d2}, [r6]!                                                    ;\
+                                                                              ;\
+        /* ip = block_id_ptr[0] */                                            ;\
+        /* lr = block_id_ptr[1] */                                            ;\
+        LOAD_INDEX_INSTRUCTION ip, [r9], W_INDEX_DTYPE_NUM_BYTES_ARG          ;\
+        LOAD_INDEX_INSTRUCTION lr, [r9], W_INDEX_DTYPE_NUM_BYTES_ARG          ;\
+                                                                              ;\
+        /* Add offset to r2 */                                                ;\
+        /* Shift by 4 because each packed block is a block of 4x1 */          ;\
+        /* which 4 bytes */                                                   ;\
+        ADD r10, r2, ip, LSL #2                                               ;\
+        /* q9 = vxb */                                                        ;\
+        VSUBL.U8 q0, d0, d15                                                  ;\
+        VSUBL.U8 q1, d2, d15                                                  ;\
+                                                                              ;\
+        /* d4 = 4x1 transposed */                                             ;\
+        VLD1.32 {d4[]}, [r10]                                                 ;\
+                                                                              ;\
+        ADD r10, r2, lr, LSL #2                                               ;\
+                                                                              ;\
+        VSUBL.U8 q2, d4, d14  /* vxa0_t */                                    ;\
+                                                                              ;\
+        /* d5 = next 4x1 transposed */                                        ;\
+        VLD1.32 {d6[]}, [r10]                                                 ;\
+                                                                              ;\
+        VSUBL.U8 q3, d6, d14  /* vxa1_t */                                    ;\
+                                                                              ;\
+        /* q0 = d0, d1 = 8x1 block of weight for k */                         ;\
+        /* q1 = d2, d3 = 8x1 block of weight for k + 1 */                     ;\
+        /* q2's d4 = 4x1 block of activation for k */                         ;\
+        /* q3's d6 = 4x1 block of activation for k + 1 */                     ;\
+                                                                              ;\
+        /* Generate 4x8 block as two 4x4 blocks */                            ;\
+                                                                              ;\
+        VMLAL.S16 q8, d0, d4[0]                                               ;\
+        VMLAL.S16 q9, d1, d4[0]                                               ;\
+        VMLAL.S16 q10, d0, d4[1]                                              ;\
+        VMLAL.S16 q11, d1, d4[1]                                              ;\
+        VMLAL.S16 q12, d0, d4[2]                                              ;\
+        VMLAL.S16 q13, d1, d4[2]                                              ;\
+        VMLAL.S16 q14, d0, d4[3]                                              ;\
+        VMLAL.S16 q15, d1, d4[3]                                              ;\
+                                                                              ;\
+        VMLAL.S16 q8, d2, d6[0]                                               ;\
+        VMLAL.S16 q9, d3, d6[0]                                               ;\
+        VMLAL.S16 q10, d2, d6[1]                                              ;\
+        VMLAL.S16 q11, d3, d6[1]                                              ;\
+        VMLAL.S16 q12, d2, d6[2]                                              ;\
+        VMLAL.S16 q13, d3, d6[2]                                              ;\
+        VMLAL.S16 q14, d2, d6[3]                                              ;\
+        VMLAL.S16 q15, d3, d6[3]                                              ;\
+                                                                              ;\
+        SUBS r8, r8, 2                                                        ;\
+                                                                              ;\
+        BHS k_loop_w##W_INDEX_DTYPE_NUM_BITS                                  ;\
+    _1_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        CMP r8, -2                                                            ;\
+        BEQ _3_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        /* Load last nonzero block */                                         ;\
+        /* For this we will load 4 8 bit values as one 32 bit value */        ;\
+        VLD1.8 {d0}, [r6]                                                     ;\
+        /* q9 = vxb */                                                        ;\
+        VSUBL.U8 q0, d0, d15                                                  ;\
+                                                                              ;\
+        /* ip = block_id_ptr[0] */                                            ;\
+        LOAD_INDEX_INSTRUCTION ip, [r9]                                       ;\
+                                                                              ;\
+        /* Add offset to r2 */                                                ;\
+        /* Shift by 4 because each packed block is a block of 4x1 */          ;\
+        /* which 4 bytes */                                                   ;\
+        ADD r10, r2, ip, LSL #2                                               ;\
+                                                                              ;\
+        VLD1.32 {d4[]}, [r10]!                                                ;\
+                                                                              ;\
+        VSUBL.U8 q2, d4, d14  /* vxa0_t */                                    ;\
+                                                                              ;\
+        VMLAL.S16 q8, d0, d4[0]                                               ;\
+        VMLAL.S16 q9, d1, d4[0]                                               ;\
+        VMLAL.S16 q10, d0, d4[1]                                              ;\
+        VMLAL.S16 q11, d1, d4[1]                                              ;\
+        VMLAL.S16 q12, d0, d4[2]                                              ;\
+        VMLAL.S16 q13, d1, d4[2]                                              ;\
+        VMLAL.S16 q14, d0, d4[3]                                              ;\
+        VMLAL.S16 q15, d1, d4[3]                                              ;\
+                                                                              ;\
+                                                                              ;\
+        .p2align 4                                                            ;\
+    _3_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        /* Load output channel index */                                       ;\
+        LDR r5, [sp, 120]                                                     ;\
+        /* Load quantization params */                                        ;\
+        /* - r7 = quantization_params */                                      ;\
+        LDR r7, [sp, 124]                                                     ;\
+        ADD r7, r7, 8                                                         ;\
+        /* Load pointer to per channel requant scale */                       ;\
+        LDR r7, [r7]                                                          ;\
+        /* Now r7 has the base_addr + offset for multipliers */               ;\
+        ADD r7, r7, r5, LSL #2                                                ;\
+                                                                              ;\
+        LDR r6, [sp, 108]                                                     ;\
+        /* Load q6: vmultiplier_c0123 */                                      ;\
+        VLD1.32 {d12, d13}, [r7]!                                             ;\
+        /* Load q7: vmultiplier_c4567 */                                      ;\
+        VLD1.32 {d14, d15}, [r7]                                              ;\
+        VCVT.F32.S32 q8, q8                                                   ;\
+        VCVT.F32.S32 q9, q9                                                   ;\
+        VCVT.F32.S32 q10, q10                                                 ;\
+        VLD1.32 {q0}, [r6]!                                                   ;\
+        VLD1.32 {q1}, [r6]                                                    ;\
+                                                                              ;\
+        VCVT.F32.S32 q11, q11                                                 ;\
+        VCVT.F32.S32 q12, q12                                                 ;\
+        VCVT.F32.S32 q13, q13                                                 ;\
+        VCVT.F32.S32 q14, q14                                                 ;\
+        VCVT.F32.S32 q15, q15                                                 ;\
+                                                                              ;\
+        VMUL.F32 q8, q8, q6                                                   ;\
+        VMUL.F32 q9, q9, q7                                                   ;\
+        VMUL.F32 q10, q10, q6                                                 ;\
+        VMUL.F32 q11, q11, q7                                                 ;\
+        VMUL.F32 q12, q12, q6                                                 ;\
+        VMUL.F32 q13, q13, q7                                                 ;\
+        VMUL.F32 q14, q14, q6                                                 ;\
+        VMUL.F32 q15, q15, q7                                                 ;\
+                                                                              ;\
+        VADD.F32 q8, q8, q0                                                   ;\
+        VADD.F32 q9, q9, q1                                                   ;\
+        VADD.F32 q10, q10, q0                                                 ;\
+        VADD.F32 q11, q11, q1                                                 ;\
+        VADD.F32 q12, q12, q0                                                 ;\
+        VADD.F32 q13, q13, q1                                                 ;\
+        VADD.F32 q14, q14, q0                                                 ;\
+        VADD.F32 q15, q15, q1                                                 ;\
+                                                                              ;\
+        /* Load c, c_stride: */                                               ;\
+        /* - r1 = c */                                                        ;\
+        /* - r9 = c_stride */                                                 ;\
+        LDR r1, [sp, 112]                                                     ;\
+        LDR r9, [sp, 116]                                                     ;\
+        LSL r9, r9, 2                                                         ;\
+                                                                              ;\
+        /* r1 = c0 = c pointer */                                             ;\
+                                                                              ;\
+        CMP r0, 2                                                             ;\
+        /* r2 = c1 */                                                         ;\
+        ADD r2, r1, r9                                                        ;\
+        MOVLO r2, r1                                                          ;\
+                                                                              ;\
+        /* r3 = c2 */                                                         ;\
+        ADD r3, r2, r9                                                        ;\
+        MOVLS r3, r2                                                          ;\
+                                                                              ;\
+        CMP r0, 4                                                             ;\
+        /* r4 = c3 */                                                         ;\
+        ADD r4, r3, r9                                                        ;\
+        MOVNE r4, r3                                                          ;\
+                                                                              ;\
+        CMP r11, 8                                                            ;\
+        BNE _4_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        VST1.32 {q8}, [r1]!                                                   ;\
+        VST1.32 {q10}, [r2]!                                                  ;\
+        VST1.32 {q12}, [r3]!                                                  ;\
+        VST1.32 {q14}, [r4]!                                                  ;\
+        VST1.32 {q9}, [r1]                                                    ;\
+        VST1.32 {q11}, [r2]                                                   ;\
+        VST1.32 {q13}, [r3]                                                   ;\
+        VST1.32 {q15}, [r4]                                                   ;\
+                                                                              ;\
+        VPOP {d8-d15}                                                         ;\
+        POP {r4, r5, r6, r7, r8, r9, r10, r11, lr}                            ;\
+        BX lr                                                                 ;\
+                                                                              ;\
+        .p2align 3                                                            ;\
+    _4_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        CMP r11, 4                                                            ;\
+        BLO _5_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        VST1.32 {q8}, [r1]!                                                   ;\
+        VST1.32 {q10}, [r2]!                                                  ;\
+        VST1.32 {q12}, [r3]!                                                  ;\
+        VST1.32 {q14}, [r4]!                                                  ;\
+                                                                              ;\
+        SUB r11, 4                                                            ;\
+                                                                              ;\
+        VMOV.32 q8, q9                                                        ;\
+        VMOV.32 q10, q11                                                      ;\
+        VMOV.32 q12, q13                                                      ;\
+        VMOV.32 q14, q15                                                      ;\
+                                                                              ;\
+    _5_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        CMP r11, 2                                                            ;\
+        BLO _6_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        VST1.32 {d16}, [r1]!                                                  ;\
+        VST1.32 {d20}, [r2]!                                                  ;\
+        VST1.32 {d24}, [r3]!                                                  ;\
+        VST1.32 {d28}, [r4]!                                                  ;\
+                                                                              ;\
+        SUB r11, 2                                                            ;\
+                                                                              ;\
+        VEXT.32 q8, q8, 2                                                     ;\
+        VEXT.32 q10, q10, 2                                                   ;\
+        VEXT.32 q12, q12, 2                                                   ;\
+        VEXT.32 q14, q14, 2                                                   ;\
+                                                                              ;\
+    _6_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        TEQ r11, 0                                                            ;\
+        BEQ _7_w##W_INDEX_DTYPE_NUM_BITS                                      ;\
+                                                                              ;\
+        VST1.32 {d16[0]}, [r1]                                                ;\
+        VST1.32 {d20[0]}, [r2]                                                ;\
+        VST1.32 {d24[0]}, [r3]                                                ;\
+        VST1.32 {d28[0]}, [r4]                                                ;\
+                                                                              ;\
+    _7_w##W_INDEX_DTYPE_NUM_BITS##:                                           ;\
+        VPOP {d8-d15}                                                         ;\
+        POP {r4, r5, r6, r7, r8, r9, r10, r11, lr}                            ;\
+        BX lr                                                                 ;\
+                                                                              ;\
+    END_FUNCTION pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch32_neon
+
+# void pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w32__aarch32_neon(
 #     size_t mr,
 #     size_t nr,
 #     const uint8_t* a_packed,
@@ -72,294 +377,39 @@
 #     size_t c_stride,
 #     size_t output_channel_index,
 #     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
-BEGIN_FUNCTION pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon
-    .arm
-#ifndef __APPLE__
-    .arch armv7-a
-    .fpu neon
-#endif
-
-    PUSH {r4, r5, r6, r7, r8, r9, r10, r11, lr}
-    VPUSH {d8-d15}
-
-    # Store nr in r11 as well for late user.
-    MOV r11, r1
-    # Load output channel index
-    LDR r5, [sp, 120]
-    # Load quantization params
-    # - r7 = quantization_params
-    LDR r7, [sp, 124]
-    # Load input_zero_point
-    VLD1.8 {d14[]}, [r7]
-    ADD r7, r7, 4
-    # Load pointer to per channel zero points array
-    LDR r4, [r7]
-    # Add output_channel_index to the b_zero_point pointer
-    ADD r4, r4, r5
-
-    # Load w_row_ptr + n
-    LDR r5, [sp, 100]
-    # r7 = blocks_id_ptr
-    LDR r7, [sp, 104]
-
-    VEOR q8, q8, q8
-    VEOR q9, q9, q9
-    VEOR q10, q10, q10
-    VEOR q11, q11, q11
-    VEOR q12, q12, q12
-    VEOR q13, q13, q13
-    VEOR q14, q14, q14
-    VEOR q15, q15, q15
-    VLD1.8 {d15}, [r4]
-    # ip = w_row_ptr[n], lr = w_row_ptr[n+1]
-    # r5 = r5 + 4 to point to next n
-    LDR ip, [r5], #4
-    LDR lr, [r5]
-    # r6 = temp_packed_w = packed_w + w_row_ptr[n] * 8
-    # * 8 because each block contains 8 values
-    # This points to the first block of nonzero value
-    # for the nth row.
-    ADD r6, r3, ip, LSL #3
-    # r9 = temp_w_block_ids_ptr = w_block_ids_ptr (r7) + w_row_ptr[n]
-    # LSL2 because each element is 4 bytes because blocks ids are uint32_t pointer
-    # This points to the col block id of the first block
-    # It should contain lr - ip number of block ids
-    # Note that in this kernel sparsity pattern is 8x1.
-    # Thus each block contains only 1 k as opposed to
-    # 1x4 where each block contains 4 k.
-    ADD r9, r7, ip, LSL #2
-    # r8 = num_blocks that needs to be processed
-    SUB r8, lr, ip
-    SUBS r8, r8, 2
-    BLO 1f
-
-    .p2align 5
-k_loop:
-    # Load 2 non zero blocks of weights. Each block = 8x1.
-    VLD1.8 {d0}, [r6]!
-    VLD1.8 {d2}, [r6]!
-
-    #ip = block_id_ptr[0]
-    #lr = block_id_ptr[1]
-    LDR ip, [r9], #4
-    LDR lr, [r9], #4
-
-    # Add offset to r2
-    # Shift by 4 because each packed block is a block of 4x1
-    # which 4 bytes
-    ADD r10, r2, ip, LSL #2
-    # q9 = vxb
-    VSUBL.U8 q0, d0, d15
-    VSUBL.U8 q1, d2, d15
-
-    # d4 = 4x1 transposed
-    VLD1.32 {d4[]}, [r10]
-
-    ADD r10, r2, lr, LSL #2
-
-    VSUBL.U8 q2, d4, d14  // vxa0_t
-
-    # d5 = next 4x1 transposed
-    VLD1.32 {d6[]}, [r10]
-
-    VSUBL.U8 q3, d6, d14  // vxa1_t
-
-    # q0 = d0, d1 = 8x1 block of weight for k
-    # q1 = d2, d3 = 8x1 block of weight for k + 1
-    # q2's d4 = 4x1 block of activation for k
-    # q3's d6 = 4x1 block of activation for k + 1
-
-    # Generate 4x8 block as two 4x4 blocks
-
-    VMLAL.S16 q8, d0, d4[0]
-    VMLAL.S16 q9, d1, d4[0]
-    VMLAL.S16 q10, d0, d4[1]
-    VMLAL.S16 q11, d1, d4[1]
-    VMLAL.S16 q12, d0, d4[2]
-    VMLAL.S16 q13, d1, d4[2]
-    VMLAL.S16 q14, d0, d4[3]
-    VMLAL.S16 q15, d1, d4[3]
-
-    VMLAL.S16 q8, d2, d6[0]
-    VMLAL.S16 q9, d3, d6[0]
-    VMLAL.S16 q10, d2, d6[1]
-    VMLAL.S16 q11, d3, d6[1]
-    VMLAL.S16 q12, d2, d6[2]
-    VMLAL.S16 q13, d3, d6[2]
-    VMLAL.S16 q14, d2, d6[3]
-    VMLAL.S16 q15, d3, d6[3]
-
-    SUBS r8, r8, 2
-
-    BHS k_loop
-1:
-    CMP r8, -2
-    BEQ 3f
-
-    # Load last nonzero block
-    # For this we will load 4 8 bit values as one 32 bit value
-    VLD1.8 {d0}, [r6]
-    # q9 = vxb
-    VSUBL.U8 q0, d0, d15
-
-    #ip = block_id_ptr[0]
-    LDR ip, [r9]
-
-    # Add offset to r2
-    # Shift by 4 because each packed block is a block of 4x1
-    # which 4 bytes
-    ADD r10, r2, ip, LSL #2
-
-    VLD1.32 {d4[]}, [r10]!
-
-    VSUBL.U8 q2, d4, d14  // vxa0_t
-
-    VMLAL.S16 q8, d0, d4[0]
-    VMLAL.S16 q9, d1, d4[0]
-    VMLAL.S16 q10, d0, d4[1]
-    VMLAL.S16 q11, d1, d4[1]
-    VMLAL.S16 q12, d0, d4[2]
-    VMLAL.S16 q13, d1, d4[2]
-    VMLAL.S16 q14, d0, d4[3]
-    VMLAL.S16 q15, d1, d4[3]
-
-
-    .p2align 4
-3:
-    # Load output channel index
-    LDR r5, [sp, 120]
-    # Load quantization params
-    # - r7 = quantization_params
-    LDR r7, [sp, 124]
-    ADD r7, r7, 8
-    # Load pointer to per channel requant scale
-    LDR r7, [r7]
-    # Now r7 has the base_addr + offset for multipliers
-    ADD r7, r7, r5, LSL #2
-
-    LDR r6, [sp, 108]
-    # Load q6: vmultiplier_c0123
-    VLD1.32 {d12, d13}, [r7]!
-    # Load q7: vmultiplier_c4567
-    VLD1.32 {d14, d15}, [r7]
-    VCVT.F32.S32 q8, q8
-    VCVT.F32.S32 q9, q9
-    VCVT.F32.S32 q10, q10
-    VLD1.32 {q0}, [r6]!
-    VLD1.32 {q1}, [r6]
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_4X8_PACKEDA__AARCH32_NEON(32, #4, #2, LDR)
 
-    VCVT.F32.S32 q11, q11
-    VCVT.F32.S32 q12, q12
-    VCVT.F32.S32 q13, q13
-    VCVT.F32.S32 q14, q14
-    VCVT.F32.S32 q15, q15
-
-    VMUL.F32 q8, q8, q6
-    VMUL.F32 q9, q9, q7
-    VMUL.F32 q10, q10, q6
-    VMUL.F32 q11, q11, q7
-    VMUL.F32 q12, q12, q6
-    VMUL.F32 q13, q13, q7
-    VMUL.F32 q14, q14, q6
-    VMUL.F32 q15, q15, q7
-
-    VADD.F32 q8, q8, q0
-    VADD.F32 q9, q9, q1
-    VADD.F32 q10, q10, q0
-    VADD.F32 q11, q11, q1
-    VADD.F32 q12, q12, q0
-    VADD.F32 q13, q13, q1
-    VADD.F32 q14, q14, q0
-    VADD.F32 q15, q15, q1
-
-    # Load c, c_stride:
-    # - r1 = c
-    # - r9 = c_stride
-    LDR r1, [sp, 112]
-    LDR r9, [sp, 116]
-    LSL r9, r9, 2
-
-    # r1 = c0 = c pointer
-
-    CMP r0, 2
-    # r2 = c1
-    ADD r2, r1, r9
-    MOVLO r2, r1
-
-    # r3 = c2
-    ADD r3, r2, r9
-    MOVLS r3, r2
-
-    CMP r0, 4
-    # r4 = c3
-    ADD r4, r3, r9
-    MOVNE r4, r3
-
-    CMP r11, 8
-    BNE 4f
-
-    VST1.32 {q8}, [r1]!
-    VST1.32 {q10}, [r2]!
-    VST1.32 {q12}, [r3]!
-    VST1.32 {q14}, [r4]!
-    VST1.32 {q9}, [r1]
-    VST1.32 {q11}, [r2]
-    VST1.32 {q13}, [r3]
-    VST1.32 {q15}, [r4]
-
-    VPOP {d8-d15}
-    POP {r4, r5, r6, r7, r8, r9, r10, r11, lr}
-    BX lr
-
-    .p2align 3
-4:
-    CMP r11, 4
-    BLO 5f
-
-    VST1.32 {q8}, [r1]!
-    VST1.32 {q10}, [r2]!
-    VST1.32 {q12}, [r3]!
-    VST1.32 {q14}, [r4]!
-
-    SUB r11, 4
-
-    VMOV.32 q8, q9
-    VMOV.32 q10, q11
-    VMOV.32 q12, q13
-    VMOV.32 q14, q15
-
-5:
-    CMP r11, 2
-    BLO 6f
-
-    VST1.32 {d16}, [r1]!
-    VST1.32 {d20}, [r2]!
-    VST1.32 {d24}, [r3]!
-    VST1.32 {d28}, [r4]!
-
-    SUB r11, 2
-
-    VEXT.32 q8, q8, 2
-    VEXT.32 q10, q10, 2
-    VEXT.32 q12, q12, 2
-    VEXT.32 q14, q14, 2
-
-6:
-    TEQ r11, 0
-    BEQ 7f
-
-    VST1.32 {d16[0]}, [r1]
-    VST1.32 {d20[0]}, [r2]
-    VST1.32 {d24[0]}, [r3]
-    VST1.32 {d28[0]}, [r4]
-
-7:
-    VPOP {d8-d15}
-    POP {r4, r5, r6, r7, r8, r9, r10, r11, lr}
-    BX lr
+# void pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w16__aarch32_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint16_t* w_row_ptr,
+#     const uint16_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_4X8_PACKEDA__AARCH32_NEON(16, #2, #1, LDRH)
 
-END_FUNCTION pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon
+# void pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w8__aarch32_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint8_t* w_row_ptr,
+#     const uint8_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_4X8_PACKEDA__AARCH32_NEON(8, #1, #0, LDRB)
 
 #ifdef __ELF__
 .section ".note.GNU-stack","",%progbits
 #endif
+
+#undef NDEF_APPLE_SYMBOLS
+#undef MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_4X8_PACKEDA__AARCH32_NEON
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.c b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.c
index 98376e3d2cdb..768574ba6f51 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.c
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.c
@@ -1,434 +1,17 @@
-/*
- * Copyright (c) Facebook, Inc. and its affiliates.
- * All rights reserved.
- *
- * This source code is licensed under the BSD-style license found in the
- * LICENSE file in the root directory of this source tree.
- */
-
-#include <immintrin.h>
-
-#include <qnnpack/q8gemm_sparse.h>
-#include <requantization/runtime-sse2.h>
-
-#include "8x4c1x4-packed-sse2.h"
-
-#define CONVERT_TO_FP_AND_TRANSPOSE(a, b, c, d, t_a, t_b, t_c, t_d)  \
-  a_ps = _mm_cvtepi32_ps(a);                                         \
-  b_ps = _mm_cvtepi32_ps(b);                                         \
-  c_ps = _mm_cvtepi32_ps(c);                                         \
-  d_ps = _mm_cvtepi32_ps(d);                                         \
-  tmp0 = _mm_shuffle_ps(a_ps, b_ps, _MM_SHUFFLE(1, 0, 1, 0));        \
-  tmp1 = _mm_shuffle_ps(a_ps, b_ps, _MM_SHUFFLE(3, 2, 3, 2));        \
-  tmp2 = _mm_shuffle_ps(c_ps, d_ps, _MM_SHUFFLE(1, 0, 1, 0));        \
-  tmp3 = _mm_shuffle_ps(c_ps, d_ps, _MM_SHUFFLE(3, 2, 3, 2));        \
-  t_a = _mm_shuffle_ps(tmp0, tmp2, _MM_SHUFFLE(2, 0, 2, 0));         \
-  t_b = _mm_shuffle_ps(tmp0, tmp2, _MM_SHUFFLE(3, 1, 3, 1));         \
-  t_c = _mm_shuffle_ps(tmp1, tmp3, _MM_SHUFFLE(2, 0, 2, 0));         \
-  t_d = _mm_shuffle_ps(tmp1, tmp3, _MM_SHUFFLE(3, 1, 3, 1));
-
-void pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2(
-    size_t mr,
-    size_t nr,
-    const uint8_t* a_packed,
-    const uint8_t* packed_w,
-    const uint32_t* w_row_ptr,
-    const uint32_t* w_block_ids_ptr,
-    const float* b,
-    float* c,
-    size_t c_stride,
-    size_t output_channel_index,
-    const struct pytorch_qnnp_conv_dynamic_quantization_params
-        quantization_params[RESTRICT_STATIC 1]) {
-
-  const __m128i va_zero_point = _mm_set1_epi16(quantization_params->input_zero_point);
-  const __m128 vbias = _mm_load_ps(b);
-  const __m128i vzero = _mm_setzero_si128();
-
-  // Packed A format.
-  // 8kx4m blocks for alls blocks given 4 rows (4m) are placed in contiguous memory.
-  // Original A
-  // --------- K -----------          -- (K + 4 - 1) / 4 --
-  // |                     |          |                   |
-  // |                     |        (M + 8 - 1)/8         |
-  // |                     | Packed   |                   |
-  // M                     |  =>      |-------------------|
-  // |                     |        Thus Packed A has (K + 4 - 1)/4 * (M + 8 -1)/8 blocks
-  // |                     |
-  // |---------------------|
-  //
-  // Each 8 x 4 blocks is transposed and stored.
-  // Each of the (K + 4 - 1)/4 blocks for a given group of 8 m blocks
-  // are stored adjacent in memory
-  // Thus, each block:
-  // |----8m-----|----8m-----|
-  // 4k          |           | .....
-  // |-----------|-----------|
-  // This locality helps in loading 8kx8m blocks of activations
-  // Note when M is not multiple of 8, the rest can contain arbitrary
-  // data in packed A as we will not be writing those out.
-  // This wil be taken care by just copying the appropriate valid data
-
-  __m128i vacc_low[4];
-  __m128i vacc_high[4];
-  const __m128 vmultiplier =
-      _mm_loadu_ps(&quantization_params->multipliers[output_channel_index]);
-  for (int32_t n = 0; n < nr; n++) {
-    vacc_low[n] = _mm_setzero_si128();
-    vacc_high[n] = _mm_setzero_si128();
-    const int16_t b_zero_point =
-      (int16_t)(uint16_t)quantization_params->kernel_zero_points[
-      output_channel_index + n];
-
-    int32_t num_blocks = w_row_ptr[n+1] - w_row_ptr[n];
-    // Offset into compressed values.
-    // w_row_ptr[0] is the block offset in the compressed values.
-    // Where the corresponding row of the weight matrix starts.
-    const uint8_t* temp_packed_w = packed_w + w_row_ptr[n] * COL_BLOCK_SIZE;
-    // Similarly w_row_ptr[0] is also the block offset where
-    // corresponding row's block column ids start.
-    // Per row # of block column ids = # of block values
-    const uint32_t* temp_w_block_ids_ptr = w_block_ids_ptr + w_row_ptr[n];
-    while (num_blocks > 1) {
-      // Load two 1x4 uint8 blocks 2 ints
-      const uint8_t* b_ptr = temp_packed_w;
-      // This is not perf optimal since this will result in
-      // register spills. We probably should work with output block
-      // of 1x4 instead of 1x8
-      // But doing is this way because mostly this how we will
-      // do it for ARM and this reference code helps establish
-      // the baseline for functional correctness.
-      const int16_t b_0 = (int16_t)((uint16_t)(b_ptr[0]));
-      const int16_t b_1 = (int16_t)((uint16_t)(b_ptr[1]));
-      const int16_t b_2 = (int16_t)((uint16_t)(b_ptr[2]));
-      const int16_t b_3 = (int16_t)((uint16_t)(b_ptr[3]));
-      const int16_t b_4 = (int16_t)((uint16_t)(b_ptr[4]));
-      const int16_t b_5 = (int16_t)((uint16_t)(b_ptr[5]));
-      const int16_t b_6 = (int16_t)((uint16_t)(b_ptr[6]));
-      const int16_t b_7 = (int16_t)((uint16_t)(b_ptr[7]));
-      // Now we will load 8kx1(broadcast 8) weight values
-      const __m128i vxb0 = _mm_set1_epi16((b_0 - b_zero_point));
-      const __m128i vxb1 = _mm_set1_epi16((b_1 - b_zero_point));
-      const __m128i vxb2 = _mm_set1_epi16((b_2 - b_zero_point));
-      const __m128i vxb3 = _mm_set1_epi16((b_3 - b_zero_point));
-      const __m128i vxb4 = _mm_set1_epi16((b_4 - b_zero_point));
-      const __m128i vxb5 = _mm_set1_epi16((b_5 - b_zero_point));
-      const __m128i vxb6 = _mm_set1_epi16((b_6 - b_zero_point));
-      const __m128i vxb7 = _mm_set1_epi16((b_7 - b_zero_point));
-
-      // Load activation blocks. In this kernel we assume
-      // a mat is already transposed. K x M
-      // 1. Load 8 1x8 registers = 8k x 8m
-
-      // Load column id of the first 1x4 block
-      int32_t col_block_id_0 = temp_w_block_ids_ptr[0];
-      // Load column id of the second 1x4 block
-      int32_t col_block_id_1 = temp_w_block_ids_ptr[1];
-      const __m128i va0 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 0));
-      const __m128i va1 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 1));
-      const __m128i va2 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 2));
-      const __m128i va3 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 3));
-      const __m128i va4 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_1 * PACKED_A_BLOCK_SIZE + MR * 0));
-      const __m128i va5 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_1 * PACKED_A_BLOCK_SIZE + MR * 1));
-      const __m128i va6 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_1 * PACKED_A_BLOCK_SIZE + MR * 2));
-      const __m128i va7 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_1 * PACKED_A_BLOCK_SIZE + MR * 3));
-
-      const __m128i vxa0 =
-          sub_zero_point(_mm_unpacklo_epi8(va0, vzero), va_zero_point);
-      const __m128i vxa1 =
-          sub_zero_point(_mm_unpacklo_epi8(va1, vzero), va_zero_point);
-      const __m128i vxa2 =
-          sub_zero_point(_mm_unpacklo_epi8(va2, vzero), va_zero_point);
-      const __m128i vxa3 =
-          sub_zero_point(_mm_unpacklo_epi8(va3, vzero), va_zero_point);
-      const __m128i vxa4 =
-          sub_zero_point(_mm_unpacklo_epi8(va4, vzero), va_zero_point);
-      const __m128i vxa5 =
-          sub_zero_point(_mm_unpacklo_epi8(va5, vzero), va_zero_point);
-      const __m128i vxa6 =
-          sub_zero_point(_mm_unpacklo_epi8(va6, vzero), va_zero_point);
-      const __m128i vxa7 =
-          sub_zero_point(_mm_unpacklo_epi8(va7, vzero), va_zero_point);
-
-      // acc += a0 * b0;
-      __m128i vacc_low_16bits = _mm_mullo_epi16(vxa0, vxb0);
-      __m128i vacc_high_16bits = _mm_mulhi_epi16(vxa0, vxb0);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a1 * b1;
-      vacc_low_16bits = _mm_mullo_epi16(vxa1, vxb1);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa1, vxb1);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a2 * b2;
-      vacc_low_16bits = _mm_mullo_epi16(vxa2, vxb2);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa2, vxb2);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a3 * b3;
-      vacc_low_16bits = _mm_mullo_epi16(vxa3, vxb3);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa3, vxb3);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a4 * b4;
-      vacc_low_16bits = _mm_mullo_epi16(vxa4, vxb4);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa4, vxb4);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a5 * b5;
-      vacc_low_16bits = _mm_mullo_epi16(vxa5, vxb5);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa5, vxb5);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a6 * b6;
-      vacc_low_16bits = _mm_mullo_epi16(vxa6, vxb6);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa6, vxb6);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a7 * b7;
-      vacc_low_16bits = _mm_mullo_epi16(vxa7, vxb7);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa7, vxb7);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-
-      // Now we have 1x8 m acculated 32 bit values in vacc_low[n](4) and vacc_high[n](4)
-
-      temp_packed_w = temp_packed_w + COL_BLOCK_SIZE * 2;
-      temp_w_block_ids_ptr += 2;
-      num_blocks -= 2;
-    }
-    if (num_blocks > 0) {
-      // Load two 1x4 uint8 blocks 2 ints
-      const uint8_t* b_ptr = temp_packed_w;
-      const int16_t b_0 = (int16_t)((uint16_t)(b_ptr[0]));
-      const int16_t b_1 = (int16_t)((uint16_t)(b_ptr[1]));
-      const int16_t b_2 = (int16_t)((uint16_t)(b_ptr[2]));
-      const int16_t b_3 = (int16_t)((uint16_t)(b_ptr[3]));
-      // Now we will load 8kx1(broadcast 8) weight values
-      const __m128i vxb0 = _mm_set1_epi16((b_0 - b_zero_point));
-      const __m128i vxb1 = _mm_set1_epi16((b_1 - b_zero_point));
-      const __m128i vxb2 = _mm_set1_epi16((b_2 - b_zero_point));
-      const __m128i vxb3 = _mm_set1_epi16((b_3 - b_zero_point));
-
-      // Then load transformed weight blocks
-      // 1. Load 4 1x8 registers = 4k x 8m
-      // Thus have 4x8 (4k x 8m) activations a0, a1, a2, a3
-      // Each a containing 8 m values.
-
-      // Load column id of the first 1x4 block
-      int32_t col_block_id_0 = temp_w_block_ids_ptr[0];
-      const __m128i va0 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 0));
-      const __m128i va1 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 1));
-      const __m128i va2 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 2));
-      const __m128i va3 =
-        _mm_loadl_epi64((const __m128i*) (a_packed +
-            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 3));
-      const __m128i vxa0 =
-          sub_zero_point(_mm_unpacklo_epi8(va0, vzero), va_zero_point);
-      const __m128i vxa1 =
-          sub_zero_point(_mm_unpacklo_epi8(va1, vzero), va_zero_point);
-      const __m128i vxa2 =
-          sub_zero_point(_mm_unpacklo_epi8(va2, vzero), va_zero_point);
-      const __m128i vxa3 =
-          sub_zero_point(_mm_unpacklo_epi8(va3, vzero), va_zero_point);
-
-      // acc += a0 * b0;
-      __m128i vacc_low_16bits = _mm_mullo_epi16(vxa0, vxb0);
-      __m128i vacc_high_16bits = _mm_mulhi_epi16(vxa0, vxb0);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a1 * b1;
-      vacc_low_16bits = _mm_mullo_epi16(vxa1, vxb1);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa1, vxb1);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a2 * b2;
-      vacc_low_16bits = _mm_mullo_epi16(vxa2, vxb2);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa2, vxb2);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-      // acc += a3 * b3;
-      vacc_low_16bits = _mm_mullo_epi16(vxa3, vxb3);
-      vacc_high_16bits = _mm_mulhi_epi16(vxa3, vxb3);
-      vacc_low[n] = _mm_add_epi32(vacc_low[n],
-        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
-      vacc_high[n] = _mm_add_epi32(vacc_high[n],
-        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
-
-      // Now we have 1x8 m acculated 32 bit values in vacc_low[n](4) and vacc_high[n](4)
-    }
-  }
-
-  __m128 vout[8];
-  __m128 a_ps, b_ps, c_ps, d_ps, tmp0, tmp1, tmp2, tmp3;
-
-  // Transform low half of 4x8 result
-  // That is 4x4 block (4n x 4m)
-  // Convert to FP and transpose: 4m x 4n
-  CONVERT_TO_FP_AND_TRANSPOSE(vacc_low[0],
-                              vacc_low[1],
-                              vacc_low[2],
-                              vacc_low[3],
-                              vout[0],
-                              vout[1],
-                              vout[2],
-                              vout[3])
-  CONVERT_TO_FP_AND_TRANSPOSE(vacc_high[0],
-                              vacc_high[1],
-                              vacc_high[2],
-                              vacc_high[3],
-                              vout[4],
-                              vout[5],
-                              vout[6],
-                              vout[7])
-
-  vout[0] = _mm_mul_ps(vmultiplier, vout[0]);
-  vout[1] = _mm_mul_ps(vmultiplier, vout[1]);
-  vout[2] = _mm_mul_ps(vmultiplier, vout[2]);
-  vout[3] = _mm_mul_ps(vmultiplier, vout[3]);
-  vout[4] = _mm_mul_ps(vmultiplier, vout[4]);
-  vout[5] = _mm_mul_ps(vmultiplier, vout[5]);
-  vout[6] = _mm_mul_ps(vmultiplier, vout[6]);
-  vout[7] = _mm_mul_ps(vmultiplier, vout[7]);
-
-  vout[0] = _mm_add_ps(vout[0], vbias);
-  vout[1] = _mm_add_ps(vout[1], vbias);
-  vout[2] = _mm_add_ps(vout[2], vbias);
-  vout[3] = _mm_add_ps(vout[3], vbias);
-  vout[4] = _mm_add_ps(vout[4], vbias);
-  vout[5] = _mm_add_ps(vout[5], vbias);
-  vout[6] = _mm_add_ps(vout[6], vbias);
-  vout[7] = _mm_add_ps(vout[7], vbias);
-
-  float* c0 = c;
-  float* c1 = c0 + c_stride;
-  if (mr < 2) {
-    c1 = c0;
-    vout[1] = vout[0];
-  }
-  float* c2 = c1 + c_stride;
-  if (mr < 3) {
-    c2 = c0;
-    vout[2] = vout[0];
-  }
-  float* c3 = c2 + c_stride;
-  if (mr < 4) {
-    c3 = c0;
-    vout[3] = vout[0];
-  }
-  float* c4 = c3 + c_stride;
-  if (mr < 5) {
-    c4 = c0;
-    vout[4] = vout[0];
-  }
-  float* c5 = c4 + c_stride;
-  if (mr < 6) {
-    c5 = c0;
-    vout[5] = vout[0];
-  }
-  float* c6 = c5 + c_stride;
-  if (mr < 7) {
-    c6 = c0;
-    vout[6] = vout[0];
-  }
-  float* c7 = c6 + c_stride;
-  if (mr < 8) {
-    c7 = c0;
-    vout[7] = vout[0];
-  }
-
-  if (nr == 4) {
-    _mm_storeu_ps(c0, vout[0]);
-    _mm_storeu_ps(c1, vout[1]);
-    _mm_storeu_ps(c2, vout[2]);
-    _mm_storeu_ps(c3, vout[3]);
-    _mm_storeu_ps(c4, vout[4]);
-    _mm_storeu_ps(c5, vout[5]);
-    _mm_storeu_ps(c6, vout[6]);
-    _mm_storeu_ps(c7, vout[7]);
-  } else {
-    if (nr >= 2) {
-      _mm_storel_pi((__m64*)c0, vout[0]);
-      _mm_storel_pi((__m64*)c1, vout[1]);
-      _mm_storel_pi((__m64*)c2, vout[2]);
-      _mm_storel_pi((__m64*)c3, vout[3]);
-      _mm_storel_pi((__m64*)c4, vout[4]);
-      _mm_storel_pi((__m64*)c5, vout[5]);
-      _mm_storel_pi((__m64*)c6, vout[6]);
-      _mm_storel_pi((__m64*)c7, vout[7]);
-
-      nr -= 2;
-
-      c0 += 2;
-      c1 += 2;
-      c2 += 2;
-      c3 += 2;
-      c4 += 2;
-      c5 += 2;
-      c6 += 2;
-      c7 += 2;
-      vout[0] = _mm_shuffle_ps(vout[0], vout[0], _MM_SHUFFLE(2, 2, 2, 2));
-      vout[1] = _mm_shuffle_ps(vout[1], vout[1], _MM_SHUFFLE(2, 2, 2, 2));
-      vout[2] = _mm_shuffle_ps(vout[2], vout[2], _MM_SHUFFLE(2, 2, 2, 2));
-      vout[3] = _mm_shuffle_ps(vout[3], vout[3], _MM_SHUFFLE(2, 2, 2, 2));
-      vout[4] = _mm_shuffle_ps(vout[4], vout[4], _MM_SHUFFLE(2, 2, 2, 2));
-      vout[5] = _mm_shuffle_ps(vout[5], vout[5], _MM_SHUFFLE(2, 2, 2, 2));
-      vout[6] = _mm_shuffle_ps(vout[6], vout[6], _MM_SHUFFLE(2, 2, 2, 2));
-      vout[7] = _mm_shuffle_ps(vout[7], vout[7], _MM_SHUFFLE(2, 2, 2, 2));
-    }
-    if (nr != 0) {
-      *c0 = _mm_cvtss_f32(vout[0]);
-      *c1 = _mm_cvtss_f32(vout[1]);
-      *c2 = _mm_cvtss_f32(vout[2]);
-      *c3 = _mm_cvtss_f32(vout[3]);
-      *c4 = _mm_cvtss_f32(vout[4]);
-      *c5 = _mm_cvtss_f32(vout[5]);
-      *c6 = _mm_cvtss_f32(vout[6]);
-      *c7 = _mm_cvtss_f32(vout[7]);
-    }
-  }
-}
+#define KERNEL_NAME pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2
+#define W_INDEX_DTYPE uint32_t
+#include "8x4c1x4-dq-packedA-sse2.h"
+#undef KERNEL_NAME
+#undef W_INDEX_DTYPE
+
+#define KERNEL_NAME pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2
+#define W_INDEX_DTYPE uint16_t
+#include "8x4c1x4-dq-packedA-sse2.h"
+#undef KERNEL_NAME
+#undef W_INDEX_DTYPE
+
+#define KERNEL_NAME pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2
+#define W_INDEX_DTYPE uint8_t
+#include "8x4c1x4-dq-packedA-sse2.h"
+#undef KERNEL_NAME
+#undef W_INDEX_DTYPE
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.h b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.h
new file mode 100644
index 000000000000..5503d6718172
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.h
@@ -0,0 +1,435 @@
+/*
+ * Copyright (c) Facebook, Inc. and its affiliates.
+ * All rights reserved.
+ *
+ * This source code is licensed under the BSD-style license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+
+#include <immintrin.h>
+
+#include <qnnpack/q8gemm_sparse.h>
+#include <requantization/runtime-sse2.h>
+
+#include "8x4c1x4-packed-sse2.h"
+
+#define CONVERT_TO_FP_AND_TRANSPOSE(a, b, c, d, t_a, t_b, t_c, t_d)  \
+  a_ps = _mm_cvtepi32_ps(a);                                         \
+  b_ps = _mm_cvtepi32_ps(b);                                         \
+  c_ps = _mm_cvtepi32_ps(c);                                         \
+  d_ps = _mm_cvtepi32_ps(d);                                         \
+  tmp0 = _mm_shuffle_ps(a_ps, b_ps, _MM_SHUFFLE(1, 0, 1, 0));        \
+  tmp1 = _mm_shuffle_ps(a_ps, b_ps, _MM_SHUFFLE(3, 2, 3, 2));        \
+  tmp2 = _mm_shuffle_ps(c_ps, d_ps, _MM_SHUFFLE(1, 0, 1, 0));        \
+  tmp3 = _mm_shuffle_ps(c_ps, d_ps, _MM_SHUFFLE(3, 2, 3, 2));        \
+  t_a = _mm_shuffle_ps(tmp0, tmp2, _MM_SHUFFLE(2, 0, 2, 0));         \
+  t_b = _mm_shuffle_ps(tmp0, tmp2, _MM_SHUFFLE(3, 1, 3, 1));         \
+  t_c = _mm_shuffle_ps(tmp1, tmp3, _MM_SHUFFLE(2, 0, 2, 0));         \
+  t_d = _mm_shuffle_ps(tmp1, tmp3, _MM_SHUFFLE(3, 1, 3, 1));
+
+// KERNEL_NAME and W_INDEX_DTYPE macros are defined in
+// https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4c1x4-dq-packedA-sse2.c
+void KERNEL_NAME(
+    size_t mr,
+    size_t nr,
+    const uint8_t* a_packed,
+    const uint8_t* packed_w,
+    const W_INDEX_DTYPE* w_row_ptr,
+    const W_INDEX_DTYPE* w_block_ids_ptr,
+    const float* b,
+    float* c,
+    size_t c_stride,
+    size_t output_channel_index,
+    const struct pytorch_qnnp_conv_dynamic_quantization_params
+        quantization_params[RESTRICT_STATIC 1]) {
+  const __m128i va_zero_point = _mm_set1_epi16(quantization_params->input_zero_point);
+  const __m128 vbias = _mm_load_ps(b);
+  const __m128i vzero = _mm_setzero_si128();
+
+  // Packed A format.
+  // 8kx4m blocks for alls blocks given 4 rows (4m) are placed in contiguous memory.
+  // Original A
+  // --------- K -----------          -- (K + 4 - 1) / 4 --
+  // |                     |          |                   |
+  // |                     |        (M + 8 - 1)/8         |
+  // |                     | Packed   |                   |
+  // M                     |  =>      |-------------------|
+  // |                     |        Thus Packed A has (K + 4 - 1)/4 * (M + 8 -1)/8 blocks
+  // |                     |
+  // |---------------------|
+  //
+  // Each 8 x 4 blocks is transposed and stored.
+  // Each of the (K + 4 - 1)/4 blocks for a given group of 8 m blocks
+  // are stored adjacent in memory
+  // Thus, each block:
+  // |----8m-----|----8m-----|
+  // 4k          |           | .....
+  // |-----------|-----------|
+  // This locality helps in loading 8kx8m blocks of activations
+  // Note when M is not multiple of 8, the rest can contain arbitrary
+  // data in packed A as we will not be writing those out.
+  // This wil be taken care by just copying the appropriate valid data
+
+  __m128i vacc_low[4];
+  __m128i vacc_high[4];
+  const __m128 vmultiplier =
+      _mm_loadu_ps(&quantization_params->multipliers[output_channel_index]);
+  for (int32_t n = 0; n < nr; n++) {
+    vacc_low[n] = _mm_setzero_si128();
+    vacc_high[n] = _mm_setzero_si128();
+    const int16_t b_zero_point =
+      (int16_t)(uint16_t)quantization_params->kernel_zero_points[
+      output_channel_index + n];
+
+    int32_t num_blocks = w_row_ptr[n+1] - w_row_ptr[n];
+    // Offset into compressed values.
+    // w_row_ptr[0] is the block offset in the compressed values.
+    // Where the corresponding row of the weight matrix starts.
+    const uint8_t* temp_packed_w = packed_w + w_row_ptr[n] * COL_BLOCK_SIZE;
+    // Similarly w_row_ptr[0] is also the block offset where
+    // corresponding row's block column ids start.
+    // Per row # of block column ids = # of block values
+    const W_INDEX_DTYPE* temp_w_block_ids_ptr = w_block_ids_ptr + w_row_ptr[n];
+    while (num_blocks > 1) {
+      // Load two 1x4 uint8 blocks 2 ints
+      const uint8_t* b_ptr = temp_packed_w;
+      // This is not perf optimal since this will result in
+      // register spills. We probably should work with output block
+      // of 1x4 instead of 1x8
+      // But doing is this way because mostly this how we will
+      // do it for ARM and this reference code helps establish
+      // the baseline for functional correctness.
+      const int16_t b_0 = (int16_t)((uint16_t)(b_ptr[0]));
+      const int16_t b_1 = (int16_t)((uint16_t)(b_ptr[1]));
+      const int16_t b_2 = (int16_t)((uint16_t)(b_ptr[2]));
+      const int16_t b_3 = (int16_t)((uint16_t)(b_ptr[3]));
+      const int16_t b_4 = (int16_t)((uint16_t)(b_ptr[4]));
+      const int16_t b_5 = (int16_t)((uint16_t)(b_ptr[5]));
+      const int16_t b_6 = (int16_t)((uint16_t)(b_ptr[6]));
+      const int16_t b_7 = (int16_t)((uint16_t)(b_ptr[7]));
+      // Now we will load 8kx1(broadcast 8) weight values
+      const __m128i vxb0 = _mm_set1_epi16((b_0 - b_zero_point));
+      const __m128i vxb1 = _mm_set1_epi16((b_1 - b_zero_point));
+      const __m128i vxb2 = _mm_set1_epi16((b_2 - b_zero_point));
+      const __m128i vxb3 = _mm_set1_epi16((b_3 - b_zero_point));
+      const __m128i vxb4 = _mm_set1_epi16((b_4 - b_zero_point));
+      const __m128i vxb5 = _mm_set1_epi16((b_5 - b_zero_point));
+      const __m128i vxb6 = _mm_set1_epi16((b_6 - b_zero_point));
+      const __m128i vxb7 = _mm_set1_epi16((b_7 - b_zero_point));
+
+      // Load activation blocks. In this kernel we assume
+      // a mat is already transposed. K x M
+      // 1. Load 8 1x8 registers = 8k x 8m
+
+      // Load column id of the first 1x4 block
+      int32_t col_block_id_0 = temp_w_block_ids_ptr[0];
+      // Load column id of the second 1x4 block
+      int32_t col_block_id_1 = temp_w_block_ids_ptr[1];
+      const __m128i va0 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 0));
+      const __m128i va1 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 1));
+      const __m128i va2 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 2));
+      const __m128i va3 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 3));
+      const __m128i va4 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_1 * PACKED_A_BLOCK_SIZE + MR * 0));
+      const __m128i va5 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_1 * PACKED_A_BLOCK_SIZE + MR * 1));
+      const __m128i va6 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_1 * PACKED_A_BLOCK_SIZE + MR * 2));
+      const __m128i va7 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_1 * PACKED_A_BLOCK_SIZE + MR * 3));
+
+      const __m128i vxa0 =
+          sub_zero_point(_mm_unpacklo_epi8(va0, vzero), va_zero_point);
+      const __m128i vxa1 =
+          sub_zero_point(_mm_unpacklo_epi8(va1, vzero), va_zero_point);
+      const __m128i vxa2 =
+          sub_zero_point(_mm_unpacklo_epi8(va2, vzero), va_zero_point);
+      const __m128i vxa3 =
+          sub_zero_point(_mm_unpacklo_epi8(va3, vzero), va_zero_point);
+      const __m128i vxa4 =
+          sub_zero_point(_mm_unpacklo_epi8(va4, vzero), va_zero_point);
+      const __m128i vxa5 =
+          sub_zero_point(_mm_unpacklo_epi8(va5, vzero), va_zero_point);
+      const __m128i vxa6 =
+          sub_zero_point(_mm_unpacklo_epi8(va6, vzero), va_zero_point);
+      const __m128i vxa7 =
+          sub_zero_point(_mm_unpacklo_epi8(va7, vzero), va_zero_point);
+
+      // acc += a0 * b0;
+      __m128i vacc_low_16bits = _mm_mullo_epi16(vxa0, vxb0);
+      __m128i vacc_high_16bits = _mm_mulhi_epi16(vxa0, vxb0);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a1 * b1;
+      vacc_low_16bits = _mm_mullo_epi16(vxa1, vxb1);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa1, vxb1);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a2 * b2;
+      vacc_low_16bits = _mm_mullo_epi16(vxa2, vxb2);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa2, vxb2);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a3 * b3;
+      vacc_low_16bits = _mm_mullo_epi16(vxa3, vxb3);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa3, vxb3);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a4 * b4;
+      vacc_low_16bits = _mm_mullo_epi16(vxa4, vxb4);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa4, vxb4);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a5 * b5;
+      vacc_low_16bits = _mm_mullo_epi16(vxa5, vxb5);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa5, vxb5);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a6 * b6;
+      vacc_low_16bits = _mm_mullo_epi16(vxa6, vxb6);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa6, vxb6);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a7 * b7;
+      vacc_low_16bits = _mm_mullo_epi16(vxa7, vxb7);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa7, vxb7);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+
+      // Now we have 1x8 m acculated 32 bit values in vacc_low[n](4) and vacc_high[n](4)
+
+      temp_packed_w = temp_packed_w + COL_BLOCK_SIZE * 2;
+      temp_w_block_ids_ptr += 2;
+      num_blocks -= 2;
+    }
+    if (num_blocks > 0) {
+      // Load two 1x4 uint8 blocks 2 ints
+      const uint8_t* b_ptr = temp_packed_w;
+      const int16_t b_0 = (int16_t)((uint16_t)(b_ptr[0]));
+      const int16_t b_1 = (int16_t)((uint16_t)(b_ptr[1]));
+      const int16_t b_2 = (int16_t)((uint16_t)(b_ptr[2]));
+      const int16_t b_3 = (int16_t)((uint16_t)(b_ptr[3]));
+      // Now we will load 8kx1(broadcast 8) weight values
+      const __m128i vxb0 = _mm_set1_epi16((b_0 - b_zero_point));
+      const __m128i vxb1 = _mm_set1_epi16((b_1 - b_zero_point));
+      const __m128i vxb2 = _mm_set1_epi16((b_2 - b_zero_point));
+      const __m128i vxb3 = _mm_set1_epi16((b_3 - b_zero_point));
+
+      // Then load transformed weight blocks
+      // 1. Load 4 1x8 registers = 4k x 8m
+      // Thus have 4x8 (4k x 8m) activations a0, a1, a2, a3
+      // Each a containing 8 m values.
+
+      // Load column id of the first 1x4 block
+      int32_t col_block_id_0 = temp_w_block_ids_ptr[0];
+      const __m128i va0 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 0));
+      const __m128i va1 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 1));
+      const __m128i va2 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 2));
+      const __m128i va3 =
+        _mm_loadl_epi64((const __m128i*) (a_packed +
+            col_block_id_0 * PACKED_A_BLOCK_SIZE + MR * 3));
+      const __m128i vxa0 =
+          sub_zero_point(_mm_unpacklo_epi8(va0, vzero), va_zero_point);
+      const __m128i vxa1 =
+          sub_zero_point(_mm_unpacklo_epi8(va1, vzero), va_zero_point);
+      const __m128i vxa2 =
+          sub_zero_point(_mm_unpacklo_epi8(va2, vzero), va_zero_point);
+      const __m128i vxa3 =
+          sub_zero_point(_mm_unpacklo_epi8(va3, vzero), va_zero_point);
+
+      // acc += a0 * b0;
+      __m128i vacc_low_16bits = _mm_mullo_epi16(vxa0, vxb0);
+      __m128i vacc_high_16bits = _mm_mulhi_epi16(vxa0, vxb0);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a1 * b1;
+      vacc_low_16bits = _mm_mullo_epi16(vxa1, vxb1);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa1, vxb1);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a2 * b2;
+      vacc_low_16bits = _mm_mullo_epi16(vxa2, vxb2);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa2, vxb2);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+      // acc += a3 * b3;
+      vacc_low_16bits = _mm_mullo_epi16(vxa3, vxb3);
+      vacc_high_16bits = _mm_mulhi_epi16(vxa3, vxb3);
+      vacc_low[n] = _mm_add_epi32(vacc_low[n],
+        _mm_unpacklo_epi16(vacc_low_16bits, vacc_high_16bits));
+      vacc_high[n] = _mm_add_epi32(vacc_high[n],
+        _mm_unpackhi_epi16(vacc_low_16bits, vacc_high_16bits));
+
+      // Now we have 1x8 m acculated 32 bit values in vacc_low[n](4) and vacc_high[n](4)
+    }
+  }
+
+  __m128 vout[8];
+  __m128 a_ps, b_ps, c_ps, d_ps, tmp0, tmp1, tmp2, tmp3;
+
+  // Transform low half of 4x8 result
+  // That is 4x4 block (4n x 4m)
+  // Convert to FP and transpose: 4m x 4n
+  CONVERT_TO_FP_AND_TRANSPOSE(vacc_low[0],
+                              vacc_low[1],
+                              vacc_low[2],
+                              vacc_low[3],
+                              vout[0],
+                              vout[1],
+                              vout[2],
+                              vout[3])
+  CONVERT_TO_FP_AND_TRANSPOSE(vacc_high[0],
+                              vacc_high[1],
+                              vacc_high[2],
+                              vacc_high[3],
+                              vout[4],
+                              vout[5],
+                              vout[6],
+                              vout[7])
+
+  vout[0] = _mm_mul_ps(vmultiplier, vout[0]);
+  vout[1] = _mm_mul_ps(vmultiplier, vout[1]);
+  vout[2] = _mm_mul_ps(vmultiplier, vout[2]);
+  vout[3] = _mm_mul_ps(vmultiplier, vout[3]);
+  vout[4] = _mm_mul_ps(vmultiplier, vout[4]);
+  vout[5] = _mm_mul_ps(vmultiplier, vout[5]);
+  vout[6] = _mm_mul_ps(vmultiplier, vout[6]);
+  vout[7] = _mm_mul_ps(vmultiplier, vout[7]);
+
+  vout[0] = _mm_add_ps(vout[0], vbias);
+  vout[1] = _mm_add_ps(vout[1], vbias);
+  vout[2] = _mm_add_ps(vout[2], vbias);
+  vout[3] = _mm_add_ps(vout[3], vbias);
+  vout[4] = _mm_add_ps(vout[4], vbias);
+  vout[5] = _mm_add_ps(vout[5], vbias);
+  vout[6] = _mm_add_ps(vout[6], vbias);
+  vout[7] = _mm_add_ps(vout[7], vbias);
+
+  float* c0 = c;
+  float* c1 = c0 + c_stride;
+  if (mr < 2) {
+    c1 = c0;
+    vout[1] = vout[0];
+  }
+  float* c2 = c1 + c_stride;
+  if (mr < 3) {
+    c2 = c0;
+    vout[2] = vout[0];
+  }
+  float* c3 = c2 + c_stride;
+  if (mr < 4) {
+    c3 = c0;
+    vout[3] = vout[0];
+  }
+  float* c4 = c3 + c_stride;
+  if (mr < 5) {
+    c4 = c0;
+    vout[4] = vout[0];
+  }
+  float* c5 = c4 + c_stride;
+  if (mr < 6) {
+    c5 = c0;
+    vout[5] = vout[0];
+  }
+  float* c6 = c5 + c_stride;
+  if (mr < 7) {
+    c6 = c0;
+    vout[6] = vout[0];
+  }
+  float* c7 = c6 + c_stride;
+  if (mr < 8) {
+    c7 = c0;
+    vout[7] = vout[0];
+  }
+
+  if (nr == 4) {
+    _mm_storeu_ps(c0, vout[0]);
+    _mm_storeu_ps(c1, vout[1]);
+    _mm_storeu_ps(c2, vout[2]);
+    _mm_storeu_ps(c3, vout[3]);
+    _mm_storeu_ps(c4, vout[4]);
+    _mm_storeu_ps(c5, vout[5]);
+    _mm_storeu_ps(c6, vout[6]);
+    _mm_storeu_ps(c7, vout[7]);
+  } else {
+    if (nr >= 2) {
+      _mm_storel_pi((__m64*)c0, vout[0]);
+      _mm_storel_pi((__m64*)c1, vout[1]);
+      _mm_storel_pi((__m64*)c2, vout[2]);
+      _mm_storel_pi((__m64*)c3, vout[3]);
+      _mm_storel_pi((__m64*)c4, vout[4]);
+      _mm_storel_pi((__m64*)c5, vout[5]);
+      _mm_storel_pi((__m64*)c6, vout[6]);
+      _mm_storel_pi((__m64*)c7, vout[7]);
+
+      nr -= 2;
+
+      c0 += 2;
+      c1 += 2;
+      c2 += 2;
+      c3 += 2;
+      c4 += 2;
+      c5 += 2;
+      c6 += 2;
+      c7 += 2;
+      vout[0] = _mm_shuffle_ps(vout[0], vout[0], _MM_SHUFFLE(2, 2, 2, 2));
+      vout[1] = _mm_shuffle_ps(vout[1], vout[1], _MM_SHUFFLE(2, 2, 2, 2));
+      vout[2] = _mm_shuffle_ps(vout[2], vout[2], _MM_SHUFFLE(2, 2, 2, 2));
+      vout[3] = _mm_shuffle_ps(vout[3], vout[3], _MM_SHUFFLE(2, 2, 2, 2));
+      vout[4] = _mm_shuffle_ps(vout[4], vout[4], _MM_SHUFFLE(2, 2, 2, 2));
+      vout[5] = _mm_shuffle_ps(vout[5], vout[5], _MM_SHUFFLE(2, 2, 2, 2));
+      vout[6] = _mm_shuffle_ps(vout[6], vout[6], _MM_SHUFFLE(2, 2, 2, 2));
+      vout[7] = _mm_shuffle_ps(vout[7], vout[7], _MM_SHUFFLE(2, 2, 2, 2));
+    }
+    if (nr != 0) {
+      *c0 = _mm_cvtss_f32(vout[0]);
+      *c1 = _mm_cvtss_f32(vout[1]);
+      *c2 = _mm_cvtss_f32(vout[2]);
+      *c3 = _mm_cvtss_f32(vout[3]);
+      *c4 = _mm_cvtss_f32(vout[4]);
+      *c5 = _mm_cvtss_f32(vout[5]);
+      *c6 = _mm_cvtss_f32(vout[6]);
+      *c7 = _mm_cvtss_f32(vout[7]);
+    }
+  }
+}
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c1x4-dq-packedA-aarch64-neon.S b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c1x4-dq-packedA-aarch64-neon.S
index 375581ec3fec..aca408e89757 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c1x4-dq-packedA-aarch64-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c1x4-dq-packedA-aarch64-neon.S
@@ -8,6 +8,24 @@
 
 #include <qnnpack/assembly.h>
 
+#ifndef IGNORE_CODE_ALIGN_DIRECTIVES
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_5 .p2align 5
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_4 .p2align 4
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_3 .p2align 3
+#else
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_5
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_4
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_3
+#endif
+
+# Macro for separating instructions. For most builds, ; can be used, but for
+# ARM64 + Mach, ; begins a comment, and %% is used to separate instructions
+#if defined(__MACH__)
+#define XX %%
+#else
+#define XX ;
+#endif
+
 .macro TRANSPOSE_4X4_S32 vin0, vin1, vin2, vin3, temp0, temp1, temp2, temp3
     TRN1 \temp0\().4s, \vin0\().4s, \vin1\().4s
     TRN2 \temp1\().4s, \vin0\().4s, \vin1\().4s
@@ -30,7 +48,460 @@
 #  |params     | 16
 #  |-----------|
 
-# void pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon(
+# void pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch64_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint##W_INDEX_DTYPE_NUM_BITS##_t* w_row_ptr,
+#     const uint##W_INDEX_DTYPE_NUM_BITS##_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+#define MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_8X8_PACKEDA__AARCH64_NEON(W_INDEX_DTYPE_NUM_BITS, W_INDEX_DTYPE_NUM_BYTES_ARG, W_INDEX_DTYPE_LOG_NUM_BYTES_ARG, LOAD_INDEX_INSTRUCTION) XX\
+    BEGIN_FUNCTION pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch64_neon XX\
+                                                                             XX\
+        STP d15, d14, [sp, -16]                                              XX\
+        STP d13, d12, [sp, -32]                                              XX\
+        STP d11, d10, [sp, -48]                                              XX\
+        STP d9, d8, [sp, -64]                                                XX\
+                                                                             XX\
+        MOV x11, x1                                                          XX\
+        /* Load output channel index */                                      XX\
+        LDR x10, [sp, 8]                                                     XX\
+        /* Load params */                                                    XX\
+        LDR x8, [sp, 16]                                                     XX\
+                                                                             XX\
+        /* Load a_zero_point */                                              XX\
+        LD1R {v24.8b}, [x8]                                                  XX\
+        ADD x8, x8, 8                                                        XX\
+                                                                             XX\
+        /* Load pointer to per channel zero points array */                  XX\
+        LDR x17, [x8], 8                                                     XX\
+                                                                             XX\
+        /* Load pointer to per channel multiplier */                         XX\
+        LDR x13, [x8]                                                        XX\
+                                                                             XX\
+        /* Add offset to the base pointer */                                 XX\
+        ADD x17, x17, x10                                                    XX\
+        /* Mul by 4 to get byte offset for multiplier */                     XX\
+        LSL x10, x10, 2                                                      XX\
+        /* Add offset to the base pointer for multiplier */                  XX\
+        ADD x13, x13, x10                                                    XX\
+                                                                             XX\
+        /* Load b_zero_point */                                              XX\
+        LD1 {v25.8b}, [x17]                                                  XX\
+        /* Load multiplier c0123 */                                          XX\
+        LD1 {v26.4s}, [x13], 16                                              XX\
+        /* Load multiplier c4567 */                                          XX\
+        LD1 {v30.4s}, [x13]                                                  XX\
+                                                                             XX\
+        EOR x12, x12, x12                                                    XX\
+        EOR x13, x13, x13                                                    XX\
+                                                                             XX\
+        CMP x1, 1                                                            XX\
+        B.LO _7_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_5                          XX\
+    _0_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        /* v8 := zero */                                                     XX\
+        EOR v8.16b, v8.16b, v8.16b                                           XX\
+        /* v9 := zero */                                                     XX\
+        EOR v9.16b, v9.16b, v9.16b                                           XX\
+                                                                             XX\
+        DUP v29.8b, v25.b[0]                                                 XX\
+        /* w12 = w_row_ptr[n], x13 = w_row_ptr[n+1] */                       XX\
+        /* x4 = x4 + W_INDEX_DTYPE_NUM_BYTES_ARG to point to next n */       XX\
+        LOAD_INDEX_INSTRUCTION w12, [x4], W_INDEX_DTYPE_NUM_BYTES_ARG        XX\
+        LOAD_INDEX_INSTRUCTION w13, [x4]                                     XX\
+        /* x10 = temp_packed_w = packed_w + w_row_ptr[n] * 4 */              XX\
+        /* This points to the first block of nonzero value */                XX\
+        /* for the nth row. */                                               XX\
+        ADD x10, x3, x12, LSL #2                                             XX\
+        /* x9 = temp_w_block_ids_ptr = w_block_ids_ptr (x5) + w_row_ptr[n] */ XX\
+        /* LSL for when elements are >1 byte */                              XX\
+        /* (4 bytes: LSL #2, 2 bytes: LSL #1, 1 byte: LSL #0) */             XX\
+        /* This points to the block id of the first block */                 XX\
+        /* It should contain x13 - x12 number of block ids */                XX\
+        ADD x9, x5, x12, LSL W_INDEX_DTYPE_LOG_NUM_BYTES_ARG                 XX\
+        /* x8 = num_blocks that needs to be processed */                     XX\
+        SUB x8, x13, x12                                                     XX\
+        SUBS x8, x8, 2                                                       XX\
+        B.LO _1_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+    k_loop_w##W_INDEX_DTYPE_NUM_BITS##:                                      XX\
+        /* b0-7 (channel 0) */                                               XX\
+        LD1 {v10.8b}, [x10], 8                                               XX\
+        USUBL v10.8h, v10.8b, v29.8b                                         XX\
+                                                                             XX\
+        /* x12 = block_id_ptr[0] */                                          XX\
+        /* x13 = block_id_ptr[1] */                                          XX\
+        LOAD_INDEX_INSTRUCTION w12, [x9], W_INDEX_DTYPE_NUM_BYTES_ARG        XX\
+        LOAD_INDEX_INSTRUCTION w13, [x9], W_INDEX_DTYPE_NUM_BYTES_ARG        XX\
+        /* Add offset to x2 */                                               XX\
+        /* Shift by 5 because each packed block is a block of 8x4 */         XX\
+        /* which 32 bytes */                                                 XX\
+        ADD x16, x2, x12, LSL #5                                             XX\
+        ADD x17, x2, x13, LSL #5                                             XX\
+                                                                             XX\
+        LD1 {v0.8b}, [x16], 8                                                XX\
+        LD1 {v1.8b}, [x16], 8                                                XX\
+        LD1 {v2.8b}, [x16], 8                                                XX\
+        LD1 {v3.8b}, [x16]                                                   XX\
+        LD1 {v4.8b}, [x17], 8                                                XX\
+        LD1 {v5.8b}, [x17], 8                                                XX\
+        LD1 {v6.8b}, [x17], 8                                                XX\
+        LD1 {v7.8b}, [x17]                                                   XX\
+                                                                             XX\
+        USUBL v0.8h, v0.8b, v24.8b                                           XX\
+        USUBL v1.8h, v1.8b, v24.8b                                           XX\
+        USUBL v2.8h, v2.8b, v24.8b                                           XX\
+        USUBL v3.8h, v3.8b, v24.8b                                           XX\
+        USUBL v4.8h, v4.8b, v24.8b                                           XX\
+        USUBL v5.8h, v5.8b, v24.8b                                           XX\
+        USUBL v6.8h, v6.8b, v24.8b                                           XX\
+        USUBL v7.8h, v7.8b, v24.8b                                           XX\
+                                                                             XX\
+        SMLAL v8.4s, v0.4h, v10.h[0]                                         XX\
+        SMLAL2 v9.4s, v0.8h, v10.h[0]                                        XX\
+        SMLAL v8.4s, v1.4h, v10.h[1]                                         XX\
+        SMLAL2 v9.4s, v1.8h, v10.h[1]                                        XX\
+        SMLAL v8.4s, v2.4h, v10.h[2]                                         XX\
+        SMLAL2 v9.4s, v2.8h, v10.h[2]                                        XX\
+        SMLAL v8.4s, v3.4h, v10.h[3]                                         XX\
+        SMLAL2 v9.4s, v3.8h, v10.h[3]                                        XX\
+        SMLAL v8.4s, v4.4h, v10.h[4]                                         XX\
+        SMLAL2 v9.4s, v4.8h, v10.h[4]                                        XX\
+        SMLAL v8.4s, v5.4h, v10.h[5]                                         XX\
+        SMLAL2 v9.4s, v5.8h, v10.h[5]                                        XX\
+        SMLAL v8.4s, v6.4h, v10.h[6]                                         XX\
+        SMLAL2 v9.4s, v6.8h, v10.h[6]                                        XX\
+        SUBS x8, x8, 2                                                       XX\
+        SMLAL v8.4s, v7.4h, v10.h[7]                                         XX\
+        SMLAL2 v9.4s, v7.8h, v10.h[7]                                        XX\
+                                                                             XX\
+                                                                             XX\
+        B.HS k_loop_w##W_INDEX_DTYPE_NUM_BITS                                XX\
+                                                                             XX\
+    _1_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        CMP x8, -2                                                           XX\
+        B.EQ _2_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        /* b0-7 (channel 0) */                                               XX\
+        LD1R {v10.4s}, [x10]                                                 XX\
+        USUBL v10.8h, v10.8b, v29.8b                                         XX\
+                                                                             XX\
+        /* x12 = block_id_ptr[0] */                                          XX\
+        LOAD_INDEX_INSTRUCTION w12, [x9]                                     XX\
+        /* Add offset to x2 */                                               XX\
+        /* Shift by 5 because each packed block is a block of 8x4 */         XX\
+        /* which 32 bytes */                                                 XX\
+        ADD x16, x2, x12, LSL #5                                             XX\
+                                                                             XX\
+        LD1 {v0.8b}, [x16], 8                                                XX\
+        LD1 {v1.8b}, [x16], 8                                                XX\
+        LD1 {v2.8b}, [x16], 8                                                XX\
+        LD1 {v3.8b}, [x16]                                                   XX\
+                                                                             XX\
+        USUBL v0.8h, v0.8b, v24.8b                                           XX\
+        USUBL v1.8h, v1.8b, v24.8b                                           XX\
+        USUBL v2.8h, v2.8b, v24.8b                                           XX\
+        USUBL v3.8h, v3.8b, v24.8b                                           XX\
+                                                                             XX\
+        SMLAL v8.4s, v0.4h, v10.h[0]                                         XX\
+        SMLAL2 v9.4s, v0.8h, v10.h[0]                                        XX\
+        SMLAL v8.4s, v1.4h, v10.h[1]                                         XX\
+        SMLAL2 v9.4s, v1.8h, v10.h[1]                                        XX\
+        SMLAL v8.4s, v2.4h, v10.h[2]                                         XX\
+        SMLAL2 v9.4s, v2.8h, v10.h[2]                                        XX\
+        SMLAL v8.4s, v3.4h, v10.h[3]                                         XX\
+        SMLAL2 v9.4s, v3.8h, v10.h[3]                                        XX\
+                                                                             XX\
+        NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_4                          XX\
+    _2_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        /* Store result on stack */                                          XX\
+                                                                             XX\
+        /* -64 because all d8-d15 are on stack */                            XX\
+        /* + 256 bytes of buffer when nr = 1 */                              XX\
+        /* 256 because we are doing 8x8 block with each value being 4 bytes */ XX\
+        /* Thus 64 * 4 = 256 */                                              XX\
+        /* 256 + 64 = 320 */                                                 XX\
+        /* This is needed because after processing all nrs we will */        XX\
+        /* load 256  bytes from stack. */                                    XX\
+        /* Thus we will load accumulators back in v8, v9, v10, v11, v12, v13, v14, v15 */ XX\
+        /* v16, v17, v18, v19, v20, v21, v22, v23 */                         XX\
+        /* When nr < 8, say nr = 1, extra v values will be fetched from stack which may overlap */ XX\
+        /* with other parts of stack storing local variables. To avoid that we just */ XX\
+        /* create a buffer of 256 bytes inbetween to make sure pointer increment */ XX\
+        /* never produces address that is beyond the stack frame of this function. */ XX\
+        SUB x9, sp, 320                                                      XX\
+        /* Each iteration produce 8 values each of 4 bytes */                XX\
+        /* Thus 8 x 4 = 32 bytes 2^5 */                                      XX\
+        /* In this implementation, first value will be stored at */          XX\
+        /* 1st value: sp - 64 - r1 * 32 */                                   XX\
+        /* 2nd value: sp - 12 - (r1 - 1) * 32 */                             XX\
+        /* and so on. */                                                     XX\
+        SUB x9, x9, x1, LSL #5                                               XX\
+        ST1 {v8.4s}, [x9], 16                                                XX\
+        ST1 {v9.4s}, [x9]                                                    XX\
+                                                                             XX\
+        /* Shift zero point vector by 8 to load */                           XX\
+        /* zero point of the next channel */                                 XX\
+        SRI v25.2d, v25.2d, #8                                               XX\
+        /* Check if nr >=1 */                                                XX\
+        SUBS x1, x1, 1                                                       XX\
+        BHI _0_w##W_INDEX_DTYPE_NUM_BITS                                     XX\
+    _3_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        /* First load all the accumulators from stack */                     XX\
+        /* Load nr */                                                        XX\
+        SUB x9, sp, 320                                                      XX\
+        SUB x9, x9, x11, LSL #5                                              XX\
+        /* Now load v8-v15 */                                                XX\
+        /* This is 8x4 block (nrxmr) */                                      XX\
+        /* We will transpose this to 4x8 (mrxnr) */                          XX\
+        /* v8, v9   : x00, x10, x20, x30; x40, x50, x60, x70 */              XX\
+        /* v10, v11 : x01, x11, x21, x31; x41, x51, x61, x71 */              XX\
+        /* v12, v13 : x02, x12, x22, x32; x42, x52, x62, x72 */              XX\
+        /* v14, v15 : x03, x13, x23, x33; x43, x53, x63, x73 */              XX\
+        /* */                                                                XX\
+        /* v16, v17 : x04, x14, x24, x34; x44, x54, x64, x74 */              XX\
+        /* v18, v19 : x05, x15, x25, x35; x45, x55, x65, x75 */              XX\
+        /* v20, v21 : x06, x16, x26, x36; x46, x56, x66, x76 */              XX\
+        /* v22, v23 : x07, x17, x27, x37; x47, x57, x67, x77 */              XX\
+        LD1 {v8.4s}, [x9], 16                                                XX\
+        LD1 {v9.4s}, [x9], 16                                                XX\
+        LD1 {v10.4s}, [x9], 16                                               XX\
+        LD1 {v11.4s}, [x9], 16                                               XX\
+        LD1 {v12.4s}, [x9], 16                                               XX\
+        LD1 {v13.4s}, [x9], 16                                               XX\
+        LD1 {v14.4s}, [x9], 16                                               XX\
+        LD1 {v15.4s}, [x9], 16                                               XX\
+        LD1 {v16.4s}, [x9], 16                                               XX\
+        LD1 {v17.4s}, [x9], 16                                               XX\
+        LD1 {v18.4s}, [x9], 16                                               XX\
+        LD1 {v19.4s}, [x9], 16                                               XX\
+        LD1 {v20.4s}, [x9], 16                                               XX\
+        LD1 {v21.4s}, [x9], 16                                               XX\
+        LD1 {v22.4s}, [x9], 16                                               XX\
+        LD1 {v23.4s}, [x9]                                                   XX\
+                                                                             XX\
+        /* We can tranpose one 4x4 block using macro */                      XX\
+        /* TRANSPOSE_4X4_S32 v8, v10, v12, v14, v0, v1, v2, v3 */            XX\
+        /* After this we have */                                             XX\
+        /* v8  : x00, x01, x02, x03 */                                       XX\
+        /* v10 : x10, x11, x12, x13 */                                       XX\
+        /* v12 : x20, x21, x22, x23 */                                       XX\
+        /* v14 : x30, x31, x32, x33 */                                       XX\
+        /* Then using */                                                     XX\
+        /* TRANSPOSE_4X4_S32 v16, v18, v20, v22, v4, v5, v6, v7 */           XX\
+        /* We get */                                                         XX\
+        /* v16 : x04, x05, x06, x07 */                                       XX\
+        /* v18 : x14, x15, x16, x17 */                                       XX\
+        /* v20 : x24, x25, x26, x27 */                                       XX\
+        /* v22 : x34, x35, x36, x37 */                                       XX\
+        /* Similarly we can transpose other two 4x4 blocks and we get */     XX\
+        /* tranposed 8x8 */                                                  XX\
+                                                                             XX\
+        TRANSPOSE_4X4_S32 v8, v10, v12, v14, v0, v1, v2, v3                  XX\
+        TRANSPOSE_4X4_S32 v16, v18, v20, v22, v4, v5, v6, v7                 XX\
+        TRANSPOSE_4X4_S32 v9, v11, v13, v15, v0, v1, v2, v3                  XX\
+        TRANSPOSE_4X4_S32 v17, v19, v21, v23, v4, v5, v6, v7                 XX\
+                                                                             XX\
+        /* row 0: v8, v16 */                                                 XX\
+        /* row 1: v10, v18 */                                                XX\
+        /* row 2: v12, v20 */                                                XX\
+        /* row 3: v14, v22 */                                                XX\
+        /* row 4: v9, v17 */                                                 XX\
+        /* row 5: v11, v19 */                                                XX\
+        /* row 6: v13, v21 */                                                XX\
+        /* row 7: v15, v23 */                                                XX\
+                                                                             XX\
+        /* Load c_stride & params */                                         XX\
+        LDR x16, [sp]                                                        XX\
+        LSL x16, x16, 2                                                      XX\
+        LD1 {v24.4s}, [x6], 16                                               XX\
+        LD1 {v25.4s}, [x6]                                                   XX\
+                                                                             XX\
+        SCVTF v8.4s, v8.4s                                                   XX\
+        SCVTF v9.4s, v9.4s                                                   XX\
+        SCVTF v10.4s, v10.4s                                                 XX\
+        SCVTF v11.4s, v11.4s                                                 XX\
+        SCVTF v12.4s, v12.4s                                                 XX\
+        SCVTF v13.4s, v13.4s                                                 XX\
+        SCVTF v14.4s, v14.4s                                                 XX\
+        SCVTF v15.4s, v15.4s                                                 XX\
+        SCVTF v16.4s, v16.4s                                                 XX\
+        SCVTF v17.4s, v17.4s                                                 XX\
+        SCVTF v18.4s, v18.4s                                                 XX\
+        SCVTF v19.4s, v19.4s                                                 XX\
+        SCVTF v20.4s, v20.4s                                                 XX\
+        SCVTF v21.4s, v21.4s                                                 XX\
+        SCVTF v22.4s, v22.4s                                                 XX\
+        SCVTF v23.4s, v23.4s                                                 XX\
+                                                                             XX\
+        FMUL v8.4s, v8.4s, v26.4s                                            XX\
+        FMUL v16.4s, v16.4s, v30.4s                                          XX\
+        FMUL v10.4s, v10.4s, v26.4s                                          XX\
+        FMUL v18.4s, v18.4s, v30.4s                                          XX\
+        FMUL v12.4s, v12.4s, v26.4s                                          XX\
+        FMUL v20.4s, v20.4s, v30.4s                                          XX\
+        FMUL v14.4s, v14.4s, v26.4s                                          XX\
+        FMUL v22.4s, v22.4s, v30.4s                                          XX\
+        FMUL v9.4s, v9.4s, v26.4s                                            XX\
+        FMUL v17.4s, v17.4s, v30.4s                                          XX\
+        FMUL v11.4s, v11.4s, v26.4s                                          XX\
+        FMUL v19.4s, v19.4s, v30.4s                                          XX\
+        FMUL v13.4s, v13.4s, v26.4s                                          XX\
+        FMUL v21.4s, v21.4s, v30.4s                                          XX\
+        FMUL v15.4s, v15.4s, v26.4s                                          XX\
+        FMUL v23.4s, v23.4s, v30.4s                                          XX\
+                                                                             XX\
+        FADD v8.4s, v8.4s, v24.4s                                            XX\
+        FADD v16.4s, v16.4s, v25.4s                                          XX\
+        FADD v10.4s, v10.4s, v24.4s                                          XX\
+        FADD v18.4s, v18.4s, v25.4s                                          XX\
+        FADD v12.4s, v12.4s, v24.4s                                          XX\
+        FADD v20.4s, v20.4s, v25.4s                                          XX\
+        FADD v14.4s, v14.4s, v24.4s                                          XX\
+        FADD v22.4s, v22.4s, v25.4s                                          XX\
+        FADD v9.4s, v9.4s, v24.4s                                            XX\
+        FADD v17.4s, v17.4s, v25.4s                                          XX\
+        FADD v11.4s, v11.4s, v24.4s                                          XX\
+        FADD v19.4s, v19.4s, v25.4s                                          XX\
+        FADD v13.4s, v13.4s, v24.4s                                          XX\
+        FADD v21.4s, v21.4s, v25.4s                                          XX\
+        FADD v15.4s, v15.4s, v24.4s                                          XX\
+        FADD v23.4s, v23.4s, v25.4s                                          XX\
+                                                                             XX\
+        /* Compute c0-c7 */                                                  XX\
+                                                                             XX\
+        ADD  x9, x7, x16                                                     XX\
+        CMP x0, 2                                                            XX\
+        CSEL x9, x7, x9, LO                                                  XX\
+                                                                             XX\
+        ADD x10, x9,  x16                                                    XX\
+        CSEL x10, x9, x10, LS                                                XX\
+                                                                             XX\
+        ADD x8, x10, x16                                                     XX\
+        CMP x0, 4                                                            XX\
+        CSEL x8, x10, x8, LO                                                 XX\
+                                                                             XX\
+        ADD x12, x8, x16                                                     XX\
+        CSEL x12, x8, x12, LS                                                XX\
+                                                                             XX\
+        ADD x13, x12, x16                                                    XX\
+        CMP x0, 6                                                            XX\
+        CSEL x13, x12, x13, LO                                               XX\
+                                                                             XX\
+        ADD x14, x13, x16                                                    XX\
+        CSEL x14, x13, x14, LS                                               XX\
+                                                                             XX\
+        ADD x15, x14, x16                                                    XX\
+        CMP x0, 8                                                            XX\
+        CSEL x15, x14, x15, NE                                               XX\
+                                                                             XX\
+        CMP x11, 8                                                           XX\
+        B.NE _4_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        ST1 {v8.4s}, [x7], 16                                                XX\
+        ST1 {v16.4s}, [x7]                                                   XX\
+        ST1 {v10.4s}, [x9], 16                                               XX\
+        ST1 {v18.4s}, [x9]                                                   XX\
+        ST1 {v12.4s}, [x10], 16                                              XX\
+        ST1 {v20.4s}, [x10]                                                  XX\
+        ST1 {v14.4s}, [x8], 16                                               XX\
+        ST1 {v22.4s}, [x8]                                                   XX\
+        ST1 {v9.4s}, [x12], 16                                               XX\
+        ST1 {v17.4s}, [x12]                                                  XX\
+        ST1 {v11.4s}, [x13], 16                                              XX\
+        ST1 {v19.4s}, [x13]                                                  XX\
+        ST1 {v13.4s}, [x14], 16                                              XX\
+        ST1 {v21.4s}, [x14]                                                  XX\
+        ST1 {v15.4s}, [x15], 16                                              XX\
+        ST1 {v23.4s}, [x15]                                                  XX\
+                                                                             XX\
+        LDP d9, d8, [sp, -64]                                                XX\
+        LDP d11, d10, [sp, -48]                                              XX\
+        LDP d13, d12, [sp, -32]                                              XX\
+        LDP d15, d14, [sp, -16]                                              XX\
+                                                                             XX\
+        RET                                                                  XX\
+                                                                             XX\
+        NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_3                          XX\
+    _4_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        CMP x11, 4                                                           XX\
+        B.LO _5_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        ST1 {v8.4s}, [x7], 16                                                XX\
+        ST1 {v10.4s}, [x9], 16                                               XX\
+        ST1 {v12.4s}, [x10], 16                                              XX\
+        ST1 {v14.4s}, [x8], 16                                               XX\
+        ST1 {v9.4s}, [x12], 16                                               XX\
+        ST1 {v11.4s}, [x13], 16                                              XX\
+        ST1 {v13.4s}, [x14], 16                                              XX\
+        ST1 {v15.4s}, [x15], 16                                              XX\
+                                                                             XX\
+        SUB x11, x11, 4                                                      XX\
+                                                                             XX\
+        MOV v8.16b, v16.16b                                                  XX\
+        MOV v10.16b, v18.16b                                                 XX\
+        MOV v12.16b, v20.16b                                                 XX\
+        MOV v14.16b, v22.16b                                                 XX\
+        MOV v9.16b, v17.16b                                                  XX\
+        MOV v11.16b, v19.16b                                                 XX\
+        MOV v13.16b, v21.16b                                                 XX\
+        MOV v15.16b, v23.16b                                                 XX\
+                                                                             XX\
+    _5_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        CMP x11, 2                                                           XX\
+        B.LO _6_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        ST1 {v8.2s}, [x7], 8                                                 XX\
+        ST1 {v10.2s}, [x9], 8                                                XX\
+        ST1 {v12.2s}, [x10], 8                                               XX\
+        ST1 {v14.2s}, [x8], 8                                                XX\
+        ST1 {v9.2s}, [x12], 8                                                XX\
+        ST1 {v11.2s}, [x13], 8                                               XX\
+        ST1 {v13.2s}, [x14], 8                                               XX\
+        ST1 {v15.2s}, [x15], 8                                               XX\
+                                                                             XX\
+        SUB x11, x11, 2                                                      XX\
+                                                                             XX\
+        EXT v8.16b, v8.16b, v8.16b, 8                                        XX\
+        EXT v10.16b, v10.16b, v10.16b, 8                                     XX\
+        EXT v12.16b, v12.16b, v12.16b, 8                                     XX\
+        EXT v14.16b, v14.16b, v14.16b, 8                                     XX\
+        EXT v9.16b, v9.16b, v9.16b, 8                                        XX\
+        EXT v11.16b, v11.16b, v11.16b, 8                                     XX\
+        EXT v13.16b, v13.16b, v13.16b, 8                                     XX\
+        EXT v15.16b, v15.16b, v15.16b, 8                                     XX\
+                                                                             XX\
+    _6_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        CMP x11, 1                                                           XX\
+        B.LO _7_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        ST1 {v8.s}[0], [x7]                                                  XX\
+        ST1 {v10.s}[0], [x9]                                                 XX\
+        ST1 {v12.s}[0], [x10]                                                XX\
+        ST1 {v14.s}[0], [x8]                                                 XX\
+        ST1 {v9.s}[0], [x12]                                                 XX\
+        ST1 {v11.s}[0], [x13]                                                XX\
+        ST1 {v13.s}[0], [x14]                                                XX\
+        ST1 {v15.s}[0], [x15]                                                XX\
+                                                                             XX\
+    _7_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        LDP d9, d8, [sp, -64]                                                XX\
+        LDP d11, d10, [sp, -48]                                              XX\
+        LDP d13, d12, [sp, -32]                                              XX\
+        LDP d15, d14, [sp, -16]                                              XX\
+                                                                             XX\
+        RET                                                                  XX\
+                                                                             XX\
+    END_FUNCTION pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch64_neon
+
+# void pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w32__aarch64_neon(
 #     size_t mr,
 #     size_t nr,
 #     const uint8_t* a_packed,
@@ -42,451 +513,42 @@
 #     size_t c_stride,
 #     size_t output_channel_index,
 #     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
-BEGIN_FUNCTION pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA__aarch64_neon
-
-    STP d15, d14, [sp, -16]
-    STP d13, d12, [sp, -32]
-    STP d11, d10, [sp, -48]
-    STP d9, d8, [sp, -64]
-
-    MOV x11, x1
-    # Load output channel index
-    LDR x10, [sp, 8]
-    # Load params
-    LDR x8, [sp, 16]
-
-    # Load a_zero_point
-    LD1R {v24.8b}, [x8]
-    ADD x8, x8, 8
-
-    # Load pointer to per channel zero points array
-    LDR x17, [x8], 8
-
-    # Load pointer to per channel multiplier
-    LDR x13, [x8]
-
-    # Add offset to the base pointer
-    ADD x17, x17, x10
-    # Mul by 4 to get byte offset for multiplier
-    LSL x10, x10, 2
-    # Add offset to the base pointer for multiplier
-    ADD x13, x13, x10
-
-    # Load b_zero_point
-    LD1 {v25.8b}, [x17]
-    # Load multiplier c0123
-    LD1 {v26.4s}, [x13], 16
-    # Load multiplier c4567
-    LD1 {v30.4s}, [x13]
-
-    EOR x12, x12, x12
-    EOR x13, x13, x13
-
-    CMP x1, 1
-    B.LO 7f
-
-#ifndef IGNORE_CODE_ALIGN_DIRECTIVES
-    .p2align 5
-#endif
-0:
-    # v8 := zero
-    EOR v8.16b, v8.16b, v8.16b
-    # v9 := zero
-    EOR v9.16b, v9.16b, v9.16b
-
-    DUP v29.8b, v25.b[0]
-    # w12 = w_row_ptr[n], x13 = w_row_ptr[n+1]
-    # x4 = x4 + 4 to point to next n
-    LDR w12, [x4], #4
-    LDR w13, [x4]
-    # x10 = temp_packed_w = packed_w + w_row_ptr[n] * 4
-    # This points to the first block of nonzero value
-    # for the nth row.
-    ADD x10, x3, x12, LSL #2
-    # x9 = temp_w_block_ids_ptr = w_block_ids_ptr (x5) + w_row_ptr[n]
-    # LSL2 because each element is 4 bytes
-    # This points to the block id of the first block
-    # It should contain x13 - x12 number of block ids
-    ADD x9, x5, x12, LSL #2
-    # x8 = num_blocks that needs to be processed
-    SUB x8, x13, x12
-    SUBS x8, x8, 2
-    B.LO 1f
-
-k_loop:
-    // b0-7 (channel 0)
-    LD1 {v10.8b}, [x10], 8
-    USUBL v10.8h, v10.8b, v29.8b
-
-    #x12 = block_id_ptr[0]
-    #x13 = block_id_ptr[1]
-    LDR w12, [x9], #4
-    LDR w13, [x9], #4
-    # Add offset to x2
-    # Shift by 5 because each packed block is a block of 8x4
-    # which 32 bytes
-    ADD x16, x2, x12, LSL #5
-    ADD x17, x2, x13, LSL #5
-
-    LD1 {v0.8b}, [x16], 8
-    LD1 {v1.8b}, [x16], 8
-    LD1 {v2.8b}, [x16], 8
-    LD1 {v3.8b}, [x16]
-    LD1 {v4.8b}, [x17], 8
-    LD1 {v5.8b}, [x17], 8
-    LD1 {v6.8b}, [x17], 8
-    LD1 {v7.8b}, [x17]
-
-    USUBL v0.8h, v0.8b, v24.8b
-    USUBL v1.8h, v1.8b, v24.8b
-    USUBL v2.8h, v2.8b, v24.8b
-    USUBL v3.8h, v3.8b, v24.8b
-    USUBL v4.8h, v4.8b, v24.8b
-    USUBL v5.8h, v5.8b, v24.8b
-    USUBL v6.8h, v6.8b, v24.8b
-    USUBL v7.8h, v7.8b, v24.8b
-
-    SMLAL v8.4s, v0.4h, v10.h[0]
-    SMLAL2 v9.4s, v0.8h, v10.h[0]
-    SMLAL v8.4s, v1.4h, v10.h[1]
-    SMLAL2 v9.4s, v1.8h, v10.h[1]
-    SMLAL v8.4s, v2.4h, v10.h[2]
-    SMLAL2 v9.4s, v2.8h, v10.h[2]
-    SMLAL v8.4s, v3.4h, v10.h[3]
-    SMLAL2 v9.4s, v3.8h, v10.h[3]
-    SMLAL v8.4s, v4.4h, v10.h[4]
-    SMLAL2 v9.4s, v4.8h, v10.h[4]
-    SMLAL v8.4s, v5.4h, v10.h[5]
-    SMLAL2 v9.4s, v5.8h, v10.h[5]
-    SMLAL v8.4s, v6.4h, v10.h[6]
-    SMLAL2 v9.4s, v6.8h, v10.h[6]
-    SUBS x8, x8, 2
-    SMLAL v8.4s, v7.4h, v10.h[7]
-    SMLAL2 v9.4s, v7.8h, v10.h[7]
-
-
-    B.HS k_loop
-
-1:
-    CMP x8, -2
-    B.EQ 2f
-
-    // b0-7 (channel 0)
-    LD1R {v10.4s}, [x10]
-    USUBL v10.8h, v10.8b, v29.8b
-
-    #x12 = block_id_ptr[0]
-    LDR w12, [x9]
-    # Add offset to x2
-    # Shift by 5 because each packed block is a block of 8x4
-    # which 32 bytes
-    ADD x16, x2, x12, LSL #5
-
-    LD1 {v0.8b}, [x16], 8
-    LD1 {v1.8b}, [x16], 8
-    LD1 {v2.8b}, [x16], 8
-    LD1 {v3.8b}, [x16]
-
-    USUBL v0.8h, v0.8b, v24.8b
-    USUBL v1.8h, v1.8b, v24.8b
-    USUBL v2.8h, v2.8b, v24.8b
-    USUBL v3.8h, v3.8b, v24.8b
-
-    SMLAL v8.4s, v0.4h, v10.h[0]
-    SMLAL2 v9.4s, v0.8h, v10.h[0]
-    SMLAL v8.4s, v1.4h, v10.h[1]
-    SMLAL2 v9.4s, v1.8h, v10.h[1]
-    SMLAL v8.4s, v2.4h, v10.h[2]
-    SMLAL2 v9.4s, v2.8h, v10.h[2]
-    SMLAL v8.4s, v3.4h, v10.h[3]
-    SMLAL2 v9.4s, v3.8h, v10.h[3]
-
-#ifndef IGNORE_CODE_ALIGN_DIRECTIVES
-    .p2align 4
-#endif
-2:
-    # Store result on stack
-
-    # -64 because all d8-d15 are on stack
-    # + 256 bytes of buffer when nr = 1
-    # 256 because we are doing 8x8 block with each value being 4 bytes
-    # Thus 64 * 4 = 256
-    # 256 + 64 = 320
-    # This is needed because after processing all nrs we will
-    # load 256  bytes from stack.
-    # Thus we will load accumulators back in v8, v9, v10, v11, v12, v13, v14, v15
-    # v16, v17, v18, v19, v20, v21, v22, v23
-    # When nr < 8, say nr = 1, extra v values will be fetched from stack which may overlap
-    # with other parts of stack storing local variables. To avoid that we just
-    # create a buffer of 256 bytes inbetween to make sure pointer increment
-    # never produces address that is beyond the stack frame of this function.
-    SUB x9, sp, 320
-    # Each iteration produce 8 values each of 4 bytes
-    # Thus 8 x 4 = 32 bytes 2^5
-    # In this implementation, first value will be stored at
-    # 1st value: sp - 64 - r1 * 32
-    # 2nd value: sp - 12 - (r1 - 1) * 32
-    # and so on.
-    SUB x9, x9, x1, LSL #5
-    ST1 {v8.4s}, [x9], 16
-    ST1 {v9.4s}, [x9]
-
-    # Shift zero point vector by 8 to load
-    # zero point of the next channel
-    SRI v25.2d, v25.2d, #8
-    # Check if nr >=1
-    SUBS x1, x1, 1
-    BHI 0b
-3:
-    # First load all the accumulators from stack
-    # Load nr
-    SUB x9, sp, 320
-    SUB x9, x9, x11, LSL #5
-    # Now load v8-v15
-    # This is 8x4 block (nrxmr)
-    # We will transpose this to 4x8 (mrxnr)
-    # v8, v9   : x00, x10, x20, x30; x40, x50, x60, x70
-    # v10, v11 : x01, x11, x21, x31; x41, x51, x61, x71
-    # v12, v13 : x02, x12, x22, x32; x42, x52, x62, x72
-    # v14, v15 : x03, x13, x23, x33; x43, x53, x63, x73
-    #
-    # v16, v17 : x04, x14, x24, x34; x44, x54, x64, x74
-    # v18, v19 : x05, x15, x25, x35; x45, x55, x65, x75
-    # v20, v21 : x06, x16, x26, x36; x46, x56, x66, x76
-    # v22, v23 : x07, x17, x27, x37; x47, x57, x67, x77
-    LD1 {v8.4s}, [x9], 16
-    LD1 {v9.4s}, [x9], 16
-    LD1 {v10.4s}, [x9], 16
-    LD1 {v11.4s}, [x9], 16
-    LD1 {v12.4s}, [x9], 16
-    LD1 {v13.4s}, [x9], 16
-    LD1 {v14.4s}, [x9], 16
-    LD1 {v15.4s}, [x9], 16
-    LD1 {v16.4s}, [x9], 16
-    LD1 {v17.4s}, [x9], 16
-    LD1 {v18.4s}, [x9], 16
-    LD1 {v19.4s}, [x9], 16
-    LD1 {v20.4s}, [x9], 16
-    LD1 {v21.4s}, [x9], 16
-    LD1 {v22.4s}, [x9], 16
-    LD1 {v23.4s}, [x9]
-
-    # We can tranpose one 4x4 block using macro
-    # TRANSPOSE_4X4_S32 v8, v10, v12, v14, v0, v1, v2, v3
-    # After this we have
-    # v8  : x00, x01, x02, x03
-    # v10 : x10, x11, x12, x13
-    # v12 : x20, x21, x22, x23
-    # v14 : x30, x31, x32, x33
-    # Then using
-    # TRANSPOSE_4X4_S32 v16, v18, v20, v22, v4, v5, v6, v7
-    # We get
-    # v16 : x04, x05, x06, x07
-    # v18 : x14, x15, x16, x17
-    # v20 : x24, x25, x26, x27
-    # v22 : x34, x35, x36, x37
-    # Similarly we can transpose other two 4x4 blocks and we get
-    # tranposed 8x8
-
-    TRANSPOSE_4X4_S32 v8, v10, v12, v14, v0, v1, v2, v3
-    TRANSPOSE_4X4_S32 v16, v18, v20, v22, v4, v5, v6, v7
-    TRANSPOSE_4X4_S32 v9, v11, v13, v15, v0, v1, v2, v3
-    TRANSPOSE_4X4_S32 v17, v19, v21, v23, v4, v5, v6, v7
-
-    # row 0: v8, v16
-    # row 1: v10, v18
-    # row 2: v12, v20
-    # row 3: v14, v22
-    # row 4: v9, v17
-    # row 5: v11, v19
-    # row 6: v13, v21
-    # row 7: v15, v23
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_8X8_PACKEDA__AARCH64_NEON(32, #4, #2, LDR)
 
-    # Load c_stride & params
-    LDR x16, [sp]
-    LSL x16, x16, 2
-    LD1 {v24.4s}, [x6], 16
-    LD1 {v25.4s}, [x6]
-
-    SCVTF v8.4s, v8.4s
-    SCVTF v9.4s, v9.4s
-    SCVTF v10.4s, v10.4s
-    SCVTF v11.4s, v11.4s
-    SCVTF v12.4s, v12.4s
-    SCVTF v13.4s, v13.4s
-    SCVTF v14.4s, v14.4s
-    SCVTF v15.4s, v15.4s
-    SCVTF v16.4s, v16.4s
-    SCVTF v17.4s, v17.4s
-    SCVTF v18.4s, v18.4s
-    SCVTF v19.4s, v19.4s
-    SCVTF v20.4s, v20.4s
-    SCVTF v21.4s, v21.4s
-    SCVTF v22.4s, v22.4s
-    SCVTF v23.4s, v23.4s
-
-    FMUL v8.4s, v8.4s, v26.4s
-    FMUL v16.4s, v16.4s, v30.4s
-    FMUL v10.4s, v10.4s, v26.4s
-    FMUL v18.4s, v18.4s, v30.4s
-    FMUL v12.4s, v12.4s, v26.4s
-    FMUL v20.4s, v20.4s, v30.4s
-    FMUL v14.4s, v14.4s, v26.4s
-    FMUL v22.4s, v22.4s, v30.4s
-    FMUL v9.4s, v9.4s, v26.4s
-    FMUL v17.4s, v17.4s, v30.4s
-    FMUL v11.4s, v11.4s, v26.4s
-    FMUL v19.4s, v19.4s, v30.4s
-    FMUL v13.4s, v13.4s, v26.4s
-    FMUL v21.4s, v21.4s, v30.4s
-    FMUL v15.4s, v15.4s, v26.4s
-    FMUL v23.4s, v23.4s, v30.4s
-
-    FADD v8.4s, v8.4s, v24.4s
-    FADD v16.4s, v16.4s, v25.4s
-    FADD v10.4s, v10.4s, v24.4s
-    FADD v18.4s, v18.4s, v25.4s
-    FADD v12.4s, v12.4s, v24.4s
-    FADD v20.4s, v20.4s, v25.4s
-    FADD v14.4s, v14.4s, v24.4s
-    FADD v22.4s, v22.4s, v25.4s
-    FADD v9.4s, v9.4s, v24.4s
-    FADD v17.4s, v17.4s, v25.4s
-    FADD v11.4s, v11.4s, v24.4s
-    FADD v19.4s, v19.4s, v25.4s
-    FADD v13.4s, v13.4s, v24.4s
-    FADD v21.4s, v21.4s, v25.4s
-    FADD v15.4s, v15.4s, v24.4s
-    FADD v23.4s, v23.4s, v25.4s
-
-    // Compute c0-c7
-
-    ADD  x9, x7, x16
-    CMP x0, 2
-    CSEL x9, x7, x9, LO
-
-    ADD x10, x9,  x16
-    CSEL x10, x9, x10, LS
-
-    ADD x8, x10, x16
-    CMP x0, 4
-    CSEL x8, x10, x8, LO
-
-    ADD x12, x8, x16
-    CSEL x12, x8, x12, LS
-
-    ADD x13, x12, x16
-    CMP x0, 6
-    CSEL x13, x12, x13, LO
-
-    ADD x14, x13, x16
-    CSEL x14, x13, x14, LS
-
-    ADD x15, x14, x16
-    CMP x0, 8
-    CSEL x15, x14, x15, NE
-
-    CMP x11, 8
-    B.NE 4f
-
-    ST1 {v8.4s}, [x7], 16
-    ST1 {v16.4s}, [x7]
-    ST1 {v10.4s}, [x9], 16
-    ST1 {v18.4s}, [x9]
-    ST1 {v12.4s}, [x10], 16
-    ST1 {v20.4s}, [x10]
-    ST1 {v14.4s}, [x8], 16
-    ST1 {v22.4s}, [x8]
-    ST1 {v9.4s}, [x12], 16
-    ST1 {v17.4s}, [x12]
-    ST1 {v11.4s}, [x13], 16
-    ST1 {v19.4s}, [x13]
-    ST1 {v13.4s}, [x14], 16
-    ST1 {v21.4s}, [x14]
-    ST1 {v15.4s}, [x15], 16
-    ST1 {v23.4s}, [x15]
-
-    LDP d9, d8, [sp, -64]
-    LDP d11, d10, [sp, -48]
-    LDP d13, d12, [sp, -32]
-    LDP d15, d14, [sp, -16]
-
-    RET
-
-#ifndef IGNORE_CODE_ALIGN_DIRECTIVES
-    .p2align 3
-#endif
-4:
-    CMP x11, 4
-    B.LO 5f
-
-    ST1 {v8.4s}, [x7], 16
-    ST1 {v10.4s}, [x9], 16
-    ST1 {v12.4s}, [x10], 16
-    ST1 {v14.4s}, [x8], 16
-    ST1 {v9.4s}, [x12], 16
-    ST1 {v11.4s}, [x13], 16
-    ST1 {v13.4s}, [x14], 16
-    ST1 {v15.4s}, [x15], 16
-
-    SUB x11, x11, 4
-
-    MOV v8.16b, v16.16b
-    MOV v10.16b, v18.16b
-    MOV v12.16b, v20.16b
-    MOV v14.16b, v22.16b
-    MOV v9.16b, v17.16b
-    MOV v11.16b, v19.16b
-    MOV v13.16b, v21.16b
-    MOV v15.16b, v23.16b
-
-5:
-    CMP x11, 2
-    B.LO 6f
-
-    ST1 {v8.2s}, [x7], 8
-    ST1 {v10.2s}, [x9], 8
-    ST1 {v12.2s}, [x10], 8
-    ST1 {v14.2s}, [x8], 8
-    ST1 {v9.2s}, [x12], 8
-    ST1 {v11.2s}, [x13], 8
-    ST1 {v13.2s}, [x14], 8
-    ST1 {v15.2s}, [x15], 8
-
-    SUB x11, x11, 2
-
-    EXT v8.16b, v8.16b, v8.16b, 8
-    EXT v10.16b, v10.16b, v10.16b, 8
-    EXT v12.16b, v12.16b, v12.16b, 8
-    EXT v14.16b, v14.16b, v14.16b, 8
-    EXT v9.16b, v9.16b, v9.16b, 8
-    EXT v11.16b, v11.16b, v11.16b, 8
-    EXT v13.16b, v13.16b, v13.16b, 8
-    EXT v15.16b, v15.16b, v15.16b, 8
-
-6:
-    CMP x11, 1
-    B.LO 7f
-
-    ST1 {v8.s}[0], [x7]
-    ST1 {v10.s}[0], [x9]
-    ST1 {v12.s}[0], [x10]
-    ST1 {v14.s}[0], [x8]
-    ST1 {v9.s}[0], [x12]
-    ST1 {v11.s}[0], [x13]
-    ST1 {v13.s}[0], [x14]
-    ST1 {v15.s}[0], [x15]
-
-7:
-    LDP d9, d8, [sp, -64]
-    LDP d11, d10, [sp, -48]
-    LDP d13, d12, [sp, -32]
-    LDP d15, d14, [sp, -16]
-
-    RET
+# void pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w16__aarch64_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint16_t* w_row_ptr,
+#     const uint16_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_8X8_PACKEDA__AARCH64_NEON(16, #2, #1, LDRH)
 
-END_FUNCTION pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA__aarch64_neon
+# void pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w8__aarch64_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint8_t* w_row_ptr,
+#     const uint8_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_8X8_PACKEDA__AARCH64_NEON(8, #1, #0, LDRB)
 
 #ifdef __ELF__
 .section ".note.GNU-stack","",%progbits
 #endif
+
+#undef NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_5
+#undef NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_4
+#undef NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_3
+#undef MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_1X4_UKERNEL_8X8_PACKEDA__AARCH64_NEON
+#undef XX
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c8x1-dq-packedA-aarch64-neon.S b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c8x1-dq-packedA-aarch64-neon.S
index 5bb470b2521b..2ba033c57c83 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c8x1-dq-packedA-aarch64-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c8x1-dq-packedA-aarch64-neon.S
@@ -8,6 +8,24 @@
 
 #include <qnnpack/assembly.h>
 
+#ifndef IGNORE_CODE_ALIGN_DIRECTIVES
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_5 .p2align 5
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_4 .p2align 4
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_3 .p2align 3
+#else
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_5
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_4
+#define NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_3
+#endif
+
+# Macro for separating instructions. For most builds, ; can be used, but for
+# ARM64 + Mach, ; begins a comment, and %% is used to separate instructions
+#if defined(__MACH__)
+#define XX %%
+#else
+#define XX ;
+#endif
+
 # params
 # c_stride
 
@@ -19,7 +37,389 @@
 #  |params     | 16
 #  |-----------|
 
-# void pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon(
+# void pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch64_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint##W_INDEX_DTYPE_NUM_BITS##_t* w_row_ptr,
+#     const uint##W_INDEX_DTYPE_NUM_BITS##_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+#define MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_8X8_PACKEDA__AARCH64_NEON(W_INDEX_DTYPE_NUM_BITS, W_INDEX_DTYPE_NUM_BYTES_ARG, W_INDEX_DTYPE_LOG_NUM_BYTES_ARG, LOAD_INDEX_INSTRUCTION) XX\
+    BEGIN_FUNCTION pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch64_neon XX\
+                                                                             XX\
+        STP d15, d14, [sp, -16]                                              XX\
+        STP d13, d12, [sp, -32]                                              XX\
+        STP d11, d10, [sp, -48]                                              XX\
+        STP d9, d8, [sp, -64]                                                XX\
+                                                                             XX\
+        MOV x11, x1                                                          XX\
+        /* Load output channel index */                                      XX\
+        LDR x10, [sp, 8]                                                     XX\
+        /* Load params */                                                    XX\
+        LDR x8, [sp, 16]                                                     XX\
+                                                                             XX\
+        /* Load a_zero_point */                                              XX\
+        LD1R {v24.8b}, [x8]                                                  XX\
+        ADD x8, x8, 8                                                        XX\
+                                                                             XX\
+        /* Load pointer to per channel zero points array */                  XX\
+        LDR x17, [x8], 8                                                     XX\
+                                                                             XX\
+        /* Load pointer to per channel multiplier */                         XX\
+        LDR x13, [x8]                                                        XX\
+                                                                             XX\
+        /* Add offset to the base pointer */                                 XX\
+        ADD x17, x17, x10                                                    XX\
+        /* Mul by 4 to get byte offset for multiplier */                     XX\
+        LSL x10, x10, 2                                                      XX\
+        /* Add offset to the base pointer for multiplier */                  XX\
+        ADD x13, x13, x10                                                    XX\
+                                                                             XX\
+        /* Load b_zero_point */                                              XX\
+        LD1 {v25.8b}, [x17]                                                  XX\
+        /* Load multiplier c0123 */                                          XX\
+        LD1 {v26.4s}, [x13], 16                                              XX\
+        /* Load multiplier c4567 */                                          XX\
+        LD1 {v30.4s}, [x13]                                                  XX\
+                                                                             XX\
+        EOR x12, x12, x12                                                    XX\
+        EOR x13, x13, x13                                                    XX\
+                                                                             XX\
+        EOR v8.16b, v8.16b, v8.16b                                           XX\
+        EOR v9.16b, v9.16b, v9.16b                                           XX\
+        EOR v10.16b, v10.16b, v10.16b                                        XX\
+        EOR v11.16b, v11.16b, v11.16b                                        XX\
+        EOR v12.16b, v12.16b, v12.16b                                        XX\
+        EOR v13.16b, v13.16b, v13.16b                                        XX\
+        EOR v14.16b, v14.16b, v14.16b                                        XX\
+        EOR v15.16b, v15.16b, v15.16b                                        XX\
+        EOR v16.16b, v16.16b, v16.16b                                        XX\
+        EOR v17.16b, v17.16b, v17.16b                                        XX\
+        EOR v18.16b, v18.16b, v18.16b                                        XX\
+        EOR v19.16b, v19.16b, v19.16b                                        XX\
+        EOR v20.16b, v20.16b, v20.16b                                        XX\
+        EOR v21.16b, v21.16b, v21.16b                                        XX\
+        EOR v22.16b, v22.16b, v22.16b                                        XX\
+        EOR v23.16b, v23.16b, v23.16b                                        XX\
+                                                                             XX\
+        /* w12 = w_row_ptr[n], x13 = w_row_ptr[n+1] */                       XX\
+        /* x4 = x4 + W_INDEX_DTYPE_NUM_BYTES_ARG to point to next n */       XX\
+        LOAD_INDEX_INSTRUCTION w12, [x4], W_INDEX_DTYPE_NUM_BYTES_ARG        XX\
+        LOAD_INDEX_INSTRUCTION w13, [x4]                                     XX\
+        /* x10 = temp_packed_w = packed_w + w_row_ptr[n] * 8 */              XX\
+        /* This points to the first block of nonzero value */                XX\
+        /* for the nth row. */                                               XX\
+        ADD x10, x3, x12, LSL #3                                             XX\
+        /* x9 = temp_w_block_ids_ptr = w_block_ids_ptr (x5) + w_row_ptr[n] */ XX\
+        /* LSL for when elements are >1 byte */                              XX\
+        /* (4 bytes: LSL #2, 2 bytes: LSL #1, 1 byte: LSL #0) */             XX\
+        /* This points to the block id of the first block */                 XX\
+        /* It should contain x13 - x12 number of block ids */                XX\
+        ADD x9, x5, x12, LSL W_INDEX_DTYPE_LOG_NUM_BYTES_ARG                 XX\
+        /* x8 = num_blocks that needs to be processed */                     XX\
+        SUB x8, x13, x12                                                     XX\
+        SUBS x8, x8, 2                                                       XX\
+        B.LO _1_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_5                          XX\
+    k_loop_w##W_INDEX_DTYPE_NUM_BITS##:                                      XX\
+        /* k_loop processes two k values */                                  XX\
+        /* Load two 8x1 blocks */                                            XX\
+        LD1 {v0.8b}, [x10], 8                                                XX\
+        LD1 {v1.8b}, [x10], 8                                                XX\
+        USUBL v0.8h, v0.8b, v25.8b                                           XX\
+        USUBL v1.8h, v1.8b, v25.8b                                           XX\
+                                                                             XX\
+        /* x12 = block_id_ptr[0] */                                          XX\
+        /* x13 = block_id_ptr[1] */                                          XX\
+        LOAD_INDEX_INSTRUCTION w12, [x9], W_INDEX_DTYPE_NUM_BYTES_ARG        XX\
+        LOAD_INDEX_INSTRUCTION w13, [x9], W_INDEX_DTYPE_NUM_BYTES_ARG        XX\
+        /* Add offset to x2 */                                               XX\
+        /* Shift by 3 because each packed block is a block of 8x1 */         XX\
+        /* which 8 bytes */                                                  XX\
+        ADD x16, x2, x12, LSL #3                                             XX\
+        ADD x17, x2, x13, LSL #3                                             XX\
+                                                                             XX\
+        /* Load two 8x1 blocks of activation */                              XX\
+        /* First 8x1 for first channel */                                    XX\
+        /* second 8x1 for next channel */                                    XX\
+        LD1 {v2.8b}, [x16]                                                   XX\
+        LD1 {v3.8b}, [x17]                                                   XX\
+                                                                             XX\
+        USUBL v2.8h, v2.8b, v24.8b                                           XX\
+        USUBL v3.8h, v3.8b, v24.8b                                           XX\
+                                                                             XX\
+        /* First channel */                                                  XX\
+        SMLAL v8.4s, v0.4h, v2.h[0]                                          XX\
+        SMLAL2 v9.4s, v0.8h, v2.h[0]                                         XX\
+        SMLAL v10.4s, v0.4h, v2.h[1]                                         XX\
+        SMLAL2 v11.4s, v0.8h, v2.h[1]                                        XX\
+        SMLAL v12.4s, v0.4h, v2.h[2]                                         XX\
+        SMLAL2 v13.4s, v0.8h, v2.h[2]                                        XX\
+        SMLAL v14.4s, v0.4h, v2.h[3]                                         XX\
+        SMLAL2 v15.4s, v0.8h, v2.h[3]                                        XX\
+        SMLAL v16.4s, v0.4h, v2.h[4]                                         XX\
+        SMLAL2 v17.4s, v0.8h, v2.h[4]                                        XX\
+        SMLAL v18.4s, v0.4h, v2.h[5]                                         XX\
+        SMLAL2 v19.4s, v0.8h, v2.h[5]                                        XX\
+        SMLAL v20.4s, v0.4h, v2.h[6]                                         XX\
+        SMLAL2 v21.4s, v0.8h, v2.h[6]                                        XX\
+        SMLAL v22.4s, v0.4h, v2.h[7]                                         XX\
+        SMLAL2 v23.4s, v0.8h, v2.h[7]                                        XX\
+                                                                             XX\
+        SUBS x8, x8, 2                                                       XX\
+        /* Second channel */                                                 XX\
+        SMLAL v8.4s, v1.4h, v3.h[0]                                          XX\
+        SMLAL2 v9.4s, v1.8h, v3.h[0]                                         XX\
+        SMLAL v10.4s, v1.4h, v3.h[1]                                         XX\
+        SMLAL2 v11.4s, v1.8h, v3.h[1]                                        XX\
+        SMLAL v12.4s, v1.4h, v3.h[2]                                         XX\
+        SMLAL2 v13.4s, v1.8h, v3.h[2]                                        XX\
+        SMLAL v14.4s, v1.4h, v3.h[3]                                         XX\
+        SMLAL2 v15.4s, v1.8h, v3.h[3]                                        XX\
+        SMLAL v16.4s, v1.4h, v3.h[4]                                         XX\
+        SMLAL2 v17.4s, v1.8h, v3.h[4]                                        XX\
+        SMLAL v18.4s, v1.4h, v3.h[5]                                         XX\
+        SMLAL2 v19.4s, v1.8h, v3.h[5]                                        XX\
+        SMLAL v20.4s, v1.4h, v3.h[6]                                         XX\
+        SMLAL2 v21.4s, v1.8h, v3.h[6]                                        XX\
+        SMLAL v22.4s, v1.4h, v3.h[7]                                         XX\
+        SMLAL2 v23.4s, v1.8h, v3.h[7]                                        XX\
+                                                                             XX\
+        B.HS k_loop_w##W_INDEX_DTYPE_NUM_BITS                                XX\
+                                                                             XX\
+    _1_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        CMP x8, -2                                                           XX\
+        B.EQ _3_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        LD1 {v0.8b}, [x10]                                                   XX\
+        USUBL v0.8h, v0.8b, v25.8b                                           XX\
+                                                                             XX\
+        /* x12 = block_id_ptr[0] */                                          XX\
+        LOAD_INDEX_INSTRUCTION w12, [x9]                                     XX\
+        /* Add offset to x2 */                                               XX\
+        ADD x16, x2, x12, LSL #3                                             XX\
+                                                                             XX\
+        LD1 {v2.8b}, [x16]                                                   XX\
+        USUBL v2.8h, v2.8b, v24.8b                                           XX\
+                                                                             XX\
+        SMLAL v8.4s, v0.4h, v2.h[0]                                          XX\
+        SMLAL2 v9.4s, v0.8h, v2.h[0]                                         XX\
+        SMLAL v10.4s, v0.4h, v2.h[1]                                         XX\
+        SMLAL2 v11.4s, v0.8h, v2.h[1]                                        XX\
+        SMLAL v12.4s, v0.4h, v2.h[2]                                         XX\
+        SMLAL2 v13.4s, v0.8h, v2.h[2]                                        XX\
+        SMLAL v14.4s, v0.4h, v2.h[3]                                         XX\
+        SMLAL2 v15.4s, v0.8h, v2.h[3]                                        XX\
+        SMLAL v16.4s, v0.4h, v2.h[4]                                         XX\
+        SMLAL2 v17.4s, v0.8h, v2.h[4]                                        XX\
+        SMLAL v18.4s, v0.4h, v2.h[5]                                         XX\
+        SMLAL2 v19.4s, v0.8h, v2.h[5]                                        XX\
+        SMLAL v20.4s, v0.4h, v2.h[6]                                         XX\
+        SMLAL2 v21.4s, v0.8h, v2.h[6]                                        XX\
+        SMLAL v22.4s, v0.4h, v2.h[7]                                         XX\
+        SMLAL2 v23.4s, v0.8h, v2.h[7]                                        XX\
+                                                                             XX\
+        NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_4                          XX\
+    _3_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        /* row 0: v8, v9 */                                                  XX\
+        /* row 1: v10, v11 */                                                XX\
+        /* row 2: v12, v13 */                                                XX\
+        /* row 3: v14, v15 */                                                XX\
+        /* row 4: v16, v17 */                                                XX\
+        /* row 5: v18, v19 */                                                XX\
+        /* row 6: v20, v21 */                                                XX\
+        /* row 7: v22, v23 */                                                XX\
+                                                                             XX\
+        /* Load c_stride & params */                                         XX\
+        LDR x16, [sp]                                                        XX\
+        LSL x16, x16, 2                                                      XX\
+        LD1 {v24.4s}, [x6], 16                                               XX\
+        LD1 {v25.4s}, [x6]                                                   XX\
+                                                                             XX\
+        SCVTF v8.4s, v8.4s                                                   XX\
+        SCVTF v9.4s, v9.4s                                                   XX\
+        SCVTF v10.4s, v10.4s                                                 XX\
+        SCVTF v11.4s, v11.4s                                                 XX\
+        SCVTF v12.4s, v12.4s                                                 XX\
+        SCVTF v13.4s, v13.4s                                                 XX\
+        SCVTF v14.4s, v14.4s                                                 XX\
+        SCVTF v15.4s, v15.4s                                                 XX\
+        SCVTF v16.4s, v16.4s                                                 XX\
+        SCVTF v17.4s, v17.4s                                                 XX\
+        SCVTF v18.4s, v18.4s                                                 XX\
+        SCVTF v19.4s, v19.4s                                                 XX\
+        SCVTF v20.4s, v20.4s                                                 XX\
+        SCVTF v21.4s, v21.4s                                                 XX\
+        SCVTF v22.4s, v22.4s                                                 XX\
+        SCVTF v23.4s, v23.4s                                                 XX\
+                                                                             XX\
+        FMUL v8.4s, v8.4s, v26.4s                                            XX\
+        FMUL v9.4s, v9.4s, v30.4s                                            XX\
+        FMUL v10.4s, v10.4s, v26.4s                                          XX\
+        FMUL v11.4s, v11.4s, v30.4s                                          XX\
+        FMUL v12.4s, v12.4s, v26.4s                                          XX\
+        FMUL v13.4s, v13.4s, v30.4s                                          XX\
+        FMUL v14.4s, v14.4s, v26.4s                                          XX\
+        FMUL v15.4s, v15.4s, v30.4s                                          XX\
+        FMUL v16.4s, v16.4s, v26.4s                                          XX\
+        FMUL v17.4s, v17.4s, v30.4s                                          XX\
+        FMUL v18.4s, v18.4s, v26.4s                                          XX\
+        FMUL v19.4s, v19.4s, v30.4s                                          XX\
+        FMUL v20.4s, v20.4s, v26.4s                                          XX\
+        FMUL v21.4s, v21.4s, v30.4s                                          XX\
+        FMUL v22.4s, v22.4s, v26.4s                                          XX\
+        FMUL v23.4s, v23.4s, v30.4s                                          XX\
+                                                                             XX\
+        FADD v8.4s, v8.4s, v24.4s                                            XX\
+        FADD v9.4s, v9.4s, v25.4s                                            XX\
+        FADD v10.4s, v10.4s, v24.4s                                          XX\
+        FADD v11.4s, v11.4s, v25.4s                                          XX\
+        FADD v12.4s, v12.4s, v24.4s                                          XX\
+        FADD v13.4s, v13.4s, v25.4s                                          XX\
+        FADD v14.4s, v14.4s, v24.4s                                          XX\
+        FADD v15.4s, v15.4s, v25.4s                                          XX\
+        FADD v16.4s, v16.4s, v24.4s                                          XX\
+        FADD v17.4s, v17.4s, v25.4s                                          XX\
+        FADD v18.4s, v18.4s, v24.4s                                          XX\
+        FADD v19.4s, v19.4s, v25.4s                                          XX\
+        FADD v20.4s, v20.4s, v24.4s                                          XX\
+        FADD v21.4s, v21.4s, v25.4s                                          XX\
+        FADD v22.4s, v22.4s, v24.4s                                          XX\
+        FADD v23.4s, v23.4s, v25.4s                                          XX\
+                                                                             XX\
+        /* Compute c0-c7 */                                                  XX\
+                                                                             XX\
+        ADD  x9, x7, x16                                                     XX\
+        CMP x0, 2                                                            XX\
+        CSEL x9, x7, x9, LO                                                  XX\
+                                                                             XX\
+        ADD x10, x9,  x16                                                    XX\
+        CSEL x10, x9, x10, LS                                                XX\
+                                                                             XX\
+        ADD x8, x10, x16                                                     XX\
+        CMP x0, 4                                                            XX\
+        CSEL x8, x10, x8, LO                                                 XX\
+                                                                             XX\
+        ADD x12, x8, x16                                                     XX\
+        CSEL x12, x8, x12, LS                                                XX\
+                                                                             XX\
+        ADD x13, x12, x16                                                    XX\
+        CMP x0, 6                                                            XX\
+        CSEL x13, x12, x13, LO                                               XX\
+                                                                             XX\
+        ADD x14, x13, x16                                                    XX\
+        CSEL x14, x13, x14, LS                                               XX\
+                                                                             XX\
+        ADD x15, x14, x16                                                    XX\
+        CMP x0, 8                                                            XX\
+        CSEL x15, x14, x15, NE                                               XX\
+                                                                             XX\
+        CMP x11, 8                                                           XX\
+        B.NE _4_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        ST1 {v8.4s}, [x7], 16                                                XX\
+        ST1 {v9.4s}, [x7]                                                    XX\
+        ST1 {v10.4s}, [x9], 16                                               XX\
+        ST1 {v11.4s}, [x9]                                                   XX\
+        ST1 {v12.4s}, [x10], 16                                              XX\
+        ST1 {v13.4s}, [x10]                                                  XX\
+        ST1 {v14.4s}, [x8], 16                                               XX\
+        ST1 {v15.4s}, [x8]                                                   XX\
+        ST1 {v16.4s}, [x12], 16                                              XX\
+        ST1 {v17.4s}, [x12]                                                  XX\
+        ST1 {v18.4s}, [x13], 16                                              XX\
+        ST1 {v19.4s}, [x13]                                                  XX\
+        ST1 {v20.4s}, [x14], 16                                              XX\
+        ST1 {v21.4s}, [x14]                                                  XX\
+        ST1 {v22.4s}, [x15], 16                                              XX\
+        ST1 {v23.4s}, [x15]                                                  XX\
+                                                                             XX\
+        LDP d9, d8, [sp, -64]                                                XX\
+        LDP d11, d10, [sp, -48]                                              XX\
+        LDP d13, d12, [sp, -32]                                              XX\
+        LDP d15, d14, [sp, -16]                                              XX\
+                                                                             XX\
+        RET                                                                  XX\
+                                                                             XX\
+        NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_3                          XX\
+    _4_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        CMP x11, 4                                                           XX\
+        B.LO _5_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        ST1 {v8.4s}, [x7], 16                                                XX\
+        ST1 {v10.4s}, [x9], 16                                               XX\
+        ST1 {v12.4s}, [x10], 16                                              XX\
+        ST1 {v14.4s}, [x8], 16                                               XX\
+        ST1 {v16.4s}, [x12], 16                                              XX\
+        ST1 {v18.4s}, [x13], 16                                              XX\
+        ST1 {v20.4s}, [x14], 16                                              XX\
+        ST1 {v22.4s}, [x15], 16                                              XX\
+                                                                             XX\
+        SUB x11, x11, 4                                                      XX\
+                                                                             XX\
+        MOV v8.16b, v9.16b                                                   XX\
+        MOV v10.16b, v11.16b                                                 XX\
+        MOV v12.16b, v13.16b                                                 XX\
+        MOV v14.16b, v15.16b                                                 XX\
+        MOV v16.16b, v17.16b                                                 XX\
+        MOV v18.16b, v19.16b                                                 XX\
+        MOV v20.16b, v21.16b                                                 XX\
+        MOV v22.16b, v23.16b                                                 XX\
+                                                                             XX\
+    _5_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        CMP x11, 2                                                           XX\
+        B.LO _6_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        ST1 {v8.2s}, [x7], 8                                                 XX\
+        ST1 {v10.2s}, [x9], 8                                                XX\
+        ST1 {v12.2s}, [x10], 8                                               XX\
+        ST1 {v14.2s}, [x8], 8                                                XX\
+        ST1 {v16.2s}, [x12], 8                                               XX\
+        ST1 {v18.2s}, [x13], 8                                               XX\
+        ST1 {v20.2s}, [x14], 8                                               XX\
+        ST1 {v22.2s}, [x15], 8                                               XX\
+                                                                             XX\
+        SUB x11, x11, 2                                                      XX\
+                                                                             XX\
+        EXT v8.16b, v8.16b, v8.16b, 8                                        XX\
+        EXT v10.16b, v10.16b, v10.16b, 8                                     XX\
+        EXT v12.16b, v12.16b, v12.16b, 8                                     XX\
+        EXT v14.16b, v14.16b, v14.16b, 8                                     XX\
+        EXT v16.16b, v16.16b, v16.16b, 8                                     XX\
+        EXT v18.16b, v18.16b, v18.16b, 8                                     XX\
+        EXT v20.16b, v20.16b, v20.16b, 8                                     XX\
+        EXT v22.16b, v22.16b, v22.16b, 8                                     XX\
+                                                                             XX\
+    _6_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        CMP x11, 1                                                           XX\
+        B.LO _7_w##W_INDEX_DTYPE_NUM_BITS                                    XX\
+                                                                             XX\
+        ST1 {v8.s}[0], [x7]                                                  XX\
+        ST1 {v10.s}[0], [x9]                                                 XX\
+        ST1 {v12.s}[0], [x10]                                                XX\
+        ST1 {v14.s}[0], [x8]                                                 XX\
+        ST1 {v16.s}[0], [x12]                                                XX\
+        ST1 {v18.s}[0], [x13]                                                XX\
+        ST1 {v20.s}[0], [x14]                                                XX\
+        ST1 {v22.s}[0], [x15]                                                XX\
+                                                                             XX\
+    _7_w##W_INDEX_DTYPE_NUM_BITS##:                                          XX\
+        LDP d9, d8, [sp, -64]                                                XX\
+        LDP d11, d10, [sp, -48]                                              XX\
+        LDP d13, d12, [sp, -32]                                              XX\
+        LDP d15, d14, [sp, -16]                                              XX\
+                                                                             XX\
+        RET                                                                  XX\
+                                                                             XX\
+    END_FUNCTION pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w##W_INDEX_DTYPE_NUM_BITS##__aarch64_neon
+
+# void pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w32__aarch64_neon(
 #     size_t mr,
 #     size_t nr,
 #     const uint8_t* a_packed,
@@ -31,380 +431,42 @@
 #     size_t c_stride,
 #     size_t output_channel_index,
 #     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
-BEGIN_FUNCTION pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA__aarch64_neon
-
-    STP d15, d14, [sp, -16]
-    STP d13, d12, [sp, -32]
-    STP d11, d10, [sp, -48]
-    STP d9, d8, [sp, -64]
-
-    MOV x11, x1
-    # Load output channel index
-    LDR x10, [sp, 8]
-    # Load params
-    LDR x8, [sp, 16]
-
-    # Load a_zero_point
-    LD1R {v24.8b}, [x8]
-    ADD x8, x8, 8
-
-    # Load pointer to per channel zero points array
-    LDR x17, [x8], 8
-
-    # Load pointer to per channel multiplier
-    LDR x13, [x8]
-
-    # Add offset to the base pointer
-    ADD x17, x17, x10
-    # Mul by 4 to get byte offset for multiplier
-    LSL x10, x10, 2
-    # Add offset to the base pointer for multiplier
-    ADD x13, x13, x10
-
-    # Load b_zero_point
-    LD1 {v25.8b}, [x17]
-    # Load multiplier c0123
-    LD1 {v26.4s}, [x13], 16
-    # Load multiplier c4567
-    LD1 {v30.4s}, [x13]
-
-    EOR x12, x12, x12
-    EOR x13, x13, x13
-
-    EOR v8.16b, v8.16b, v8.16b
-    EOR v9.16b, v9.16b, v9.16b
-    EOR v10.16b, v10.16b, v10.16b
-    EOR v11.16b, v11.16b, v11.16b
-    EOR v12.16b, v12.16b, v12.16b
-    EOR v13.16b, v13.16b, v13.16b
-    EOR v14.16b, v14.16b, v14.16b
-    EOR v15.16b, v15.16b, v15.16b
-    EOR v16.16b, v16.16b, v16.16b
-    EOR v17.16b, v17.16b, v17.16b
-    EOR v18.16b, v18.16b, v18.16b
-    EOR v19.16b, v19.16b, v19.16b
-    EOR v20.16b, v20.16b, v20.16b
-    EOR v21.16b, v21.16b, v21.16b
-    EOR v22.16b, v22.16b, v22.16b
-    EOR v23.16b, v23.16b, v23.16b
-
-    # w12 = w_row_ptr[n], x13 = w_row_ptr[n+1]
-    # x4 = x4 + 4 to point to next n
-    LDR w12, [x4], #4
-    LDR w13, [x4]
-    # x10 = temp_packed_w = packed_w + w_row_ptr[n] * 8
-    # This points to the first block of nonzero value
-    # for the nth row.
-    ADD x10, x3, x12, LSL #3
-    # x9 = temp_w_block_ids_ptr = w_block_ids_ptr (x5) + w_row_ptr[n]
-    # LSL2 because each element is 4 bytes
-    # This points to the block id of the first block
-    # It should contain x13 - x12 number of block ids
-    ADD x9, x5, x12, LSL #2
-    # x8 = num_blocks that needs to be processed
-    SUB x8, x13, x12
-    SUBS x8, x8, 2
-    B.LO 1f
-
-#ifndef IGNORE_CODE_ALIGN_DIRECTIVES
-    .p2align 5
-#endif
-k_loop:
-    # k_loop processes two k values
-    # Load two 8x1 blocks
-    LD1 {v0.8b}, [x10], 8
-    LD1 {v1.8b}, [x10], 8
-    USUBL v0.8h, v0.8b, v25.8b
-    USUBL v1.8h, v1.8b, v25.8b
-
-    #x12 = block_id_ptr[0]
-    #x13 = block_id_ptr[1]
-    LDR w12, [x9], #4
-    LDR w13, [x9], #4
-    # Add offset to x2
-    # Shift by 3 because each packed block is a block of 8x1
-    # which 8 bytes
-    ADD x16, x2, x12, LSL #3
-    ADD x17, x2, x13, LSL #3
-
-    # Load two 8x1 blocks of activation
-    # First 8x1 for first channel
-    # second 8x1 for next channel
-    LD1 {v2.8b}, [x16]
-    LD1 {v3.8b}, [x17]
-
-    USUBL v2.8h, v2.8b, v24.8b
-    USUBL v3.8h, v3.8b, v24.8b
-
-    # First channel
-    SMLAL v8.4s, v0.4h, v2.h[0]
-    SMLAL2 v9.4s, v0.8h, v2.h[0]
-    SMLAL v10.4s, v0.4h, v2.h[1]
-    SMLAL2 v11.4s, v0.8h, v2.h[1]
-    SMLAL v12.4s, v0.4h, v2.h[2]
-    SMLAL2 v13.4s, v0.8h, v2.h[2]
-    SMLAL v14.4s, v0.4h, v2.h[3]
-    SMLAL2 v15.4s, v0.8h, v2.h[3]
-    SMLAL v16.4s, v0.4h, v2.h[4]
-    SMLAL2 v17.4s, v0.8h, v2.h[4]
-    SMLAL v18.4s, v0.4h, v2.h[5]
-    SMLAL2 v19.4s, v0.8h, v2.h[5]
-    SMLAL v20.4s, v0.4h, v2.h[6]
-    SMLAL2 v21.4s, v0.8h, v2.h[6]
-    SMLAL v22.4s, v0.4h, v2.h[7]
-    SMLAL2 v23.4s, v0.8h, v2.h[7]
-
-    SUBS x8, x8, 2
-    # Second channel
-    SMLAL v8.4s, v1.4h, v3.h[0]
-    SMLAL2 v9.4s, v1.8h, v3.h[0]
-    SMLAL v10.4s, v1.4h, v3.h[1]
-    SMLAL2 v11.4s, v1.8h, v3.h[1]
-    SMLAL v12.4s, v1.4h, v3.h[2]
-    SMLAL2 v13.4s, v1.8h, v3.h[2]
-    SMLAL v14.4s, v1.4h, v3.h[3]
-    SMLAL2 v15.4s, v1.8h, v3.h[3]
-    SMLAL v16.4s, v1.4h, v3.h[4]
-    SMLAL2 v17.4s, v1.8h, v3.h[4]
-    SMLAL v18.4s, v1.4h, v3.h[5]
-    SMLAL2 v19.4s, v1.8h, v3.h[5]
-    SMLAL v20.4s, v1.4h, v3.h[6]
-    SMLAL2 v21.4s, v1.8h, v3.h[6]
-    SMLAL v22.4s, v1.4h, v3.h[7]
-    SMLAL2 v23.4s, v1.8h, v3.h[7]
-
-    B.HS k_loop
-
-1:
-    CMP x8, -2
-    B.EQ 3f
-
-    LD1 {v0.8b}, [x10]
-    USUBL v0.8h, v0.8b, v25.8b
-
-    #x12 = block_id_ptr[0]
-    LDR w12, [x9]
-    # Add offset to x2
-    ADD x16, x2, x12, LSL #3
-
-    LD1 {v2.8b}, [x16]
-    USUBL v2.8h, v2.8b, v24.8b
-
-    SMLAL v8.4s, v0.4h, v2.h[0]
-    SMLAL2 v9.4s, v0.8h, v2.h[0]
-    SMLAL v10.4s, v0.4h, v2.h[1]
-    SMLAL2 v11.4s, v0.8h, v2.h[1]
-    SMLAL v12.4s, v0.4h, v2.h[2]
-    SMLAL2 v13.4s, v0.8h, v2.h[2]
-    SMLAL v14.4s, v0.4h, v2.h[3]
-    SMLAL2 v15.4s, v0.8h, v2.h[3]
-    SMLAL v16.4s, v0.4h, v2.h[4]
-    SMLAL2 v17.4s, v0.8h, v2.h[4]
-    SMLAL v18.4s, v0.4h, v2.h[5]
-    SMLAL2 v19.4s, v0.8h, v2.h[5]
-    SMLAL v20.4s, v0.4h, v2.h[6]
-    SMLAL2 v21.4s, v0.8h, v2.h[6]
-    SMLAL v22.4s, v0.4h, v2.h[7]
-    SMLAL2 v23.4s, v0.8h, v2.h[7]
-
-#ifndef IGNORE_CODE_ALIGN_DIRECTIVES
-    .p2align 4
-#endif
-3:
-    # row 0: v8, v9
-    # row 1: v10, v11
-    # row 2: v12, v13
-    # row 3: v14, v15
-    # row 4: v16, v17
-    # row 5: v18, v19
-    # row 6: v20, v21
-    # row 7: v22, v23
-
-    # Load c_stride & params
-    LDR x16, [sp]
-    LSL x16, x16, 2
-    LD1 {v24.4s}, [x6], 16
-    LD1 {v25.4s}, [x6]
-
-    SCVTF v8.4s, v8.4s
-    SCVTF v9.4s, v9.4s
-    SCVTF v10.4s, v10.4s
-    SCVTF v11.4s, v11.4s
-    SCVTF v12.4s, v12.4s
-    SCVTF v13.4s, v13.4s
-    SCVTF v14.4s, v14.4s
-    SCVTF v15.4s, v15.4s
-    SCVTF v16.4s, v16.4s
-    SCVTF v17.4s, v17.4s
-    SCVTF v18.4s, v18.4s
-    SCVTF v19.4s, v19.4s
-    SCVTF v20.4s, v20.4s
-    SCVTF v21.4s, v21.4s
-    SCVTF v22.4s, v22.4s
-    SCVTF v23.4s, v23.4s
-
-    FMUL v8.4s, v8.4s, v26.4s
-    FMUL v9.4s, v9.4s, v30.4s
-    FMUL v10.4s, v10.4s, v26.4s
-    FMUL v11.4s, v11.4s, v30.4s
-    FMUL v12.4s, v12.4s, v26.4s
-    FMUL v13.4s, v13.4s, v30.4s
-    FMUL v14.4s, v14.4s, v26.4s
-    FMUL v15.4s, v15.4s, v30.4s
-    FMUL v16.4s, v16.4s, v26.4s
-    FMUL v17.4s, v17.4s, v30.4s
-    FMUL v18.4s, v18.4s, v26.4s
-    FMUL v19.4s, v19.4s, v30.4s
-    FMUL v20.4s, v20.4s, v26.4s
-    FMUL v21.4s, v21.4s, v30.4s
-    FMUL v22.4s, v22.4s, v26.4s
-    FMUL v23.4s, v23.4s, v30.4s
-
-    FADD v8.4s, v8.4s, v24.4s
-    FADD v9.4s, v9.4s, v25.4s
-    FADD v10.4s, v10.4s, v24.4s
-    FADD v11.4s, v11.4s, v25.4s
-    FADD v12.4s, v12.4s, v24.4s
-    FADD v13.4s, v13.4s, v25.4s
-    FADD v14.4s, v14.4s, v24.4s
-    FADD v15.4s, v15.4s, v25.4s
-    FADD v16.4s, v16.4s, v24.4s
-    FADD v17.4s, v17.4s, v25.4s
-    FADD v18.4s, v18.4s, v24.4s
-    FADD v19.4s, v19.4s, v25.4s
-    FADD v20.4s, v20.4s, v24.4s
-    FADD v21.4s, v21.4s, v25.4s
-    FADD v22.4s, v22.4s, v24.4s
-    FADD v23.4s, v23.4s, v25.4s
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_8X8_PACKEDA__AARCH64_NEON(32, #4, #2, LDR)
 
-    // Compute c0-c7
-
-    ADD  x9, x7, x16
-    CMP x0, 2
-    CSEL x9, x7, x9, LO
-
-    ADD x10, x9,  x16
-    CSEL x10, x9, x10, LS
-
-    ADD x8, x10, x16
-    CMP x0, 4
-    CSEL x8, x10, x8, LO
-
-    ADD x12, x8, x16
-    CSEL x12, x8, x12, LS
-
-    ADD x13, x12, x16
-    CMP x0, 6
-    CSEL x13, x12, x13, LO
-
-    ADD x14, x13, x16
-    CSEL x14, x13, x14, LS
-
-    ADD x15, x14, x16
-    CMP x0, 8
-    CSEL x15, x14, x15, NE
-
-    CMP x11, 8
-    B.NE 4f
-
-    ST1 {v8.4s}, [x7], 16
-    ST1 {v9.4s}, [x7]
-    ST1 {v10.4s}, [x9], 16
-    ST1 {v11.4s}, [x9]
-    ST1 {v12.4s}, [x10], 16
-    ST1 {v13.4s}, [x10]
-    ST1 {v14.4s}, [x8], 16
-    ST1 {v15.4s}, [x8]
-    ST1 {v16.4s}, [x12], 16
-    ST1 {v17.4s}, [x12]
-    ST1 {v18.4s}, [x13], 16
-    ST1 {v19.4s}, [x13]
-    ST1 {v20.4s}, [x14], 16
-    ST1 {v21.4s}, [x14]
-    ST1 {v22.4s}, [x15], 16
-    ST1 {v23.4s}, [x15]
-
-    LDP d9, d8, [sp, -64]
-    LDP d11, d10, [sp, -48]
-    LDP d13, d12, [sp, -32]
-    LDP d15, d14, [sp, -16]
-
-    RET
-
-#ifndef IGNORE_CODE_ALIGN_DIRECTIVES
-    .p2align 3
-#endif
-4:
-    CMP x11, 4
-    B.LO 5f
-
-    ST1 {v8.4s}, [x7], 16
-    ST1 {v10.4s}, [x9], 16
-    ST1 {v12.4s}, [x10], 16
-    ST1 {v14.4s}, [x8], 16
-    ST1 {v16.4s}, [x12], 16
-    ST1 {v18.4s}, [x13], 16
-    ST1 {v20.4s}, [x14], 16
-    ST1 {v22.4s}, [x15], 16
-
-    SUB x11, x11, 4
-
-    MOV v8.16b, v9.16b
-    MOV v10.16b, v11.16b
-    MOV v12.16b, v13.16b
-    MOV v14.16b, v15.16b
-    MOV v16.16b, v17.16b
-    MOV v18.16b, v19.16b
-    MOV v20.16b, v21.16b
-    MOV v22.16b, v23.16b
-
-5:
-    CMP x11, 2
-    B.LO 6f
-
-    ST1 {v8.2s}, [x7], 8
-    ST1 {v10.2s}, [x9], 8
-    ST1 {v12.2s}, [x10], 8
-    ST1 {v14.2s}, [x8], 8
-    ST1 {v16.2s}, [x12], 8
-    ST1 {v18.2s}, [x13], 8
-    ST1 {v20.2s}, [x14], 8
-    ST1 {v22.2s}, [x15], 8
-
-    SUB x11, x11, 2
-
-    EXT v8.16b, v8.16b, v8.16b, 8
-    EXT v10.16b, v10.16b, v10.16b, 8
-    EXT v12.16b, v12.16b, v12.16b, 8
-    EXT v14.16b, v14.16b, v14.16b, 8
-    EXT v16.16b, v16.16b, v16.16b, 8
-    EXT v18.16b, v18.16b, v18.16b, 8
-    EXT v20.16b, v20.16b, v20.16b, 8
-    EXT v22.16b, v22.16b, v22.16b, 8
-
-6:
-    CMP x11, 1
-    B.LO 7f
-
-    ST1 {v8.s}[0], [x7]
-    ST1 {v10.s}[0], [x9]
-    ST1 {v12.s}[0], [x10]
-    ST1 {v14.s}[0], [x8]
-    ST1 {v16.s}[0], [x12]
-    ST1 {v18.s}[0], [x13]
-    ST1 {v20.s}[0], [x14]
-    ST1 {v22.s}[0], [x15]
-
-7:
-    LDP d9, d8, [sp, -64]
-    LDP d11, d10, [sp, -48]
-    LDP d13, d12, [sp, -32]
-    LDP d15, d14, [sp, -16]
-
-    RET
+# void pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w16__aarch64_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint16_t* w_row_ptr,
+#     const uint16_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_8X8_PACKEDA__AARCH64_NEON(16, #2, #1, LDRH)
 
-END_FUNCTION pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA__aarch64_neon
+# void pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w8__aarch64_neon(
+#     size_t mr,
+#     size_t nr,
+#     const uint8_t* a_packed,
+#     const uint8_t* packed_w,
+#     const uint8_t* w_row_ptr,
+#     const uint8_t* w_block_ids_ptr,
+#     const float* b,
+#     uint8_t* restrict c,
+#     size_t c_stride,
+#     size_t output_channel_index,
+#     const union pytorch_qnnp_conv_dynamic_quantization_params quantization_params[restrict static 1])
+MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_8X8_PACKEDA__AARCH64_NEON(8, #1, #0, LDRB)
 
 #ifdef __ELF__
 .section ".note.GNU-stack","",%progbits
 #endif
+
+#undef NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_5
+#undef NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_4
+#undef NDEF_IGNORE_CODE_ALIGN_DIRECTIVES_P2ALIGN_3
+#undef MAKE_PYTORCH_Q8GEMM_DQ_SPARSE_8X1_UKERNEL_8X8_PACKEDA__AARCH64_NEON
+#undef XX
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/common.h b/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/common.h
index 14bcc01d21ed..fbfaa85904c7 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/common.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/common.h
@@ -80,3 +80,15 @@
 #if defined(_MSC_VER)
 #define __builtin_prefetch
 #endif
+
+#if defined(__GNUC__)
+  #define PYTORCH_QNNP_UNALIGNED __attribute__((__aligned__(1)))
+#elif defined(_MSC_VER)
+  #if defined(_M_IX86)
+    #define PYTORCH_QNNP_UNALIGNED
+  #else
+    #define PYTORCH_QNNP_UNALIGNED __unaligned
+  #endif
+#else
+  #error "Platform-specific implementation of PYTORCH_QNNP_UNALIGNED required"
+#endif
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/operator.h b/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/operator.h
index 44e702a7e412..a6e2dbe24f81 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/operator.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/operator.h
@@ -38,11 +38,20 @@ enum pytorch_qnnp_ukernel_type {
 };
 
 typedef struct {
-  const uint32_t* col_indices;
-  const uint32_t* row_values;
+  union {
+    const uint32_t* col_indices_w32;
+    const uint16_t* col_indices_w16;
+    const uint8_t* col_indices_w8;
+  };
+  union {
+    const uint32_t* row_values_w32;
+    const uint16_t* row_values_w16;
+    const uint8_t* row_values_w8;
+  };
   const uint8_t* values;
   uint32_t row_block_size;
   uint32_t col_block_size;
+  enum pytorch_qnnp_sparse_matrix_indices_dtype indices_dtype;
 } sparse_matrix_t;
 
 struct pytorch_qnnp_operator {
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/params.h b/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/params.h
index 1fb607e3f195..04536dafcef9 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/params.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/params.h
@@ -331,7 +331,7 @@ typedef void (*pytorch_q8gemm_dq_sparse_ukernel_function)(
     size_t output_channel_index,
     const struct pytorch_qnnp_conv_dynamic_quantization_params* quantization_params);
 
-typedef void (*pytorch_q8gemm_dq_sparse_packedA_ukernel_function)(
+typedef void (*pytorch_q8gemm_dq_sparse_packedA_w32_ukernel_function)(
     size_t mr,
     size_t nr,
     const uint8_t* a_packed,
@@ -344,6 +344,32 @@ typedef void (*pytorch_q8gemm_dq_sparse_packedA_ukernel_function)(
     size_t output_channel_index,
     const struct pytorch_qnnp_conv_dynamic_quantization_params* quantization_params);
 
+typedef void (*pytorch_q8gemm_dq_sparse_packedA_w16_ukernel_function)(
+    size_t mr,
+    size_t nr,
+    const uint8_t* a_packed,
+    const uint8_t* packed_w,
+    const uint16_t* w_row_ptr,
+    const uint16_t* w_block_ids_ptr,
+    const float* bias,
+    float* c,
+    size_t c_stride,
+    size_t output_channel_index,
+    const struct pytorch_qnnp_conv_dynamic_quantization_params* quantization_params);
+
+typedef void (*pytorch_q8gemm_dq_sparse_packedA_w8_ukernel_function)(
+    size_t mr,
+    size_t nr,
+    const uint8_t* a_packed,
+    const uint8_t* packed_w,
+    const uint8_t* w_row_ptr,
+    const uint8_t* w_block_ids_ptr,
+    const float* bias,
+    float* c,
+    size_t c_stride,
+    size_t output_channel_index,
+    const struct pytorch_qnnp_conv_dynamic_quantization_params* quantization_params);
+
 typedef void (*pytorch_q8gemm_sparse_packA_ukernel_function)(
     const size_t mr,
     const size_t K,
@@ -545,7 +571,11 @@ struct pytorch_q8conv_parameters {
 
 struct pytorch_q8gemm_sparse_parameters {
   pytorch_q8gemm_dq_sparse_ukernel_function gemm_dq;
-  pytorch_q8gemm_dq_sparse_packedA_ukernel_function packedA_gemm_dq;
+  // w32, w16, and w8 refer to variants of the kernel which use uint32_t,
+  // uint16_t, and uint8_t datatype for row values/col indices respectively
+  pytorch_q8gemm_dq_sparse_packedA_w32_ukernel_function packedA_w32_gemm_dq;
+  pytorch_q8gemm_dq_sparse_packedA_w16_ukernel_function packedA_w16_gemm_dq;
+  pytorch_q8gemm_dq_sparse_packedA_w8_ukernel_function packedA_w8_gemm_dq;
   pytorch_q8gemm_sparse_packA_ukernel_function packA;
   uint8_t mr;
   uint8_t nr;
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/q8gemm_sparse.h b/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/q8gemm_sparse.h
index 572b7cfe54a7..a4079f9bde0b 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/q8gemm_sparse.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/qnnpack/q8gemm_sparse.h
@@ -61,32 +61,72 @@ DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_UKERNEL_FUNCTION(pytorch_q8ge
 DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_UKERNEL_FUNCTION(pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4__aarch64_neon)
 DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_UKERNEL_FUNCTION(pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4__sse2)
 
-#define DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(fn_name) \
-  PYTORCH_QNNP_INTERNAL void fn_name(            \
-      size_t mr,                                 \
-      size_t nr,                                 \
-      const uint8_t* a_packed,                   \
-      const uint8_t* packed_w,                   \
-      const uint32_t* w_row_ptr,                 \
-      const uint32_t* w_block_ids_ptr,           \
-      const float* b,                            \
-      float* c,                                  \
-      size_t c_stride,                           \
-      size_t output_channel_index,               \
-      const struct pytorch_qnnp_conv_dynamic_quantization_params* quantization_params);
-
+#define DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION( \
+    fn_name, w_index_dtype)                                                          \
+  PYTORCH_QNNP_INTERNAL void fn_name(                                                \
+      size_t mr,                                                                     \
+      size_t nr,                                                                     \
+      const uint8_t* a_packed,                                                       \
+      const uint8_t* packed_w,                                                       \
+      const w_index_dtype* w_row_ptr,                                                \
+      const w_index_dtype* w_block_ids_ptr,                                          \
+      const float* b,                                                                \
+      float* c,                                                                      \
+      size_t c_stride,                                                               \
+      size_t output_channel_index,                                                   \
+      const struct pytorch_qnnp_conv_dynamic_quantization_params*                    \
+          quantization_params);
+
+// w32, w16, and w8 refer to variants of the kernel which use uint32_t,
+// uint16_t, and uint8_t datatype for row values/col indices respectively
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w32__aarch32_neon,
+    uint32_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w16__aarch32_neon,
+    uint16_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w8__aarch32_neon,
+    uint8_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w32__aarch32_neon,
+    uint32_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w16__aarch32_neon,
+    uint16_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w8__aarch32_neon,
+    uint8_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__aarch32_neon,
+    uint32_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w32__aarch64_neon,
+    uint32_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w16__aarch64_neon,
+    uint16_t)
+DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w8__aarch64_neon,
+    uint8_t)
 DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
-    pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon)
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w32__aarch64_neon,
+    uint32_t)
 DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
-    pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon)
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w16__aarch64_neon,
+    uint16_t)
 DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
-    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__aarch32_neon)
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w8__aarch64_neon,
+    uint8_t)
 DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
-    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA__aarch64_neon)
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2,
+    uint32_t)
 DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
-    pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA__aarch64_neon)
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2,
+    uint16_t)
 DECLARE_PYTORCH_Q8GEMM_DYNAMIC_QUANTIZATION_SPARSE_PACKEDA_UKERNEL_FUNCTION(
-    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2)
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2,
+    uint8_t)
 
 #define DECLARE_PYTORCH_Q8GEMM_PARSE_PACKA_UKERNEL_FUNCTION(fn_name) \
   PYTORCH_QNNP_INTERNAL void fn_name(            \
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h b/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h
index 575c0a17bceb..b1338df41f18 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-sparse-operator-tester.h
@@ -241,13 +241,13 @@ class FullyConnectedSparseOperatorTester {
       } while (max_elem == min_elem);
 
       std::unique_ptr<qnnpack::BCSRMatrix> bcsr_matrix =
-        qnnpack::generateBlockCSRMatrix(
-            kernel.data(),
-            outputChannels(),
-            inputChannels(),
-            rowBlockSize(),
-            colBlockSize(),
-            kernelZeroPoints.data());
+          qnnpack::generateBlockCSRMatrix<uint32_t>(
+              kernel.data(),
+              outputChannels(),
+              inputChannels(),
+              rowBlockSize(),
+              colBlockSize(),
+              kernelZeroPoints.data());
 
       std::fill(output.begin(), output.end(), 0xA5);
       std::fill(output_dynamic.begin(), output_dynamic.end(), 0.0f);
@@ -320,11 +320,12 @@ class FullyConnectedSparseOperatorTester {
                     outputChannels(),
                     inputZeroPoint,
                     kernelZeroPoints.data(),
-                    bcsr_matrix->col_indices.data(),
-                    bcsr_matrix->row_values.data(),
+                    bcsr_matrix->col_indices_data_ptr(),
+                    bcsr_matrix->row_values_data_ptr(),
                     bcsr_matrix->values.data(),
                     bcsr_matrix->row_block_size,
                     bcsr_matrix->col_block_size,
+                    pytorch_qnnp_sparse_matrix_indices_dtype_uint32_t,
                     outputZeroPoint,
                     qmin(),
                     qmax(),
@@ -441,13 +442,13 @@ class FullyConnectedSparseOperatorTester {
         min_elem = *std::min_element(kernel.cbegin(), kernel.cend());
       } while (max_elem == min_elem);
       std::unique_ptr<qnnpack::BCSRMatrix> bcsr_matrix =
-        qnnpack::generateBlockCSRMatrix(
-            kernel.data(),
-            outputChannels(),
-            inputChannels(),
-            rowBlockSize(),
-            colBlockSize(),
-            kernelZeroPoints.data());
+          qnnpack::generateBlockCSRMatrix<uint32_t>(
+              kernel.data(),
+              outputChannels(),
+              inputChannels(),
+              rowBlockSize(),
+              colBlockSize(),
+              kernelZeroPoints.data());
 
       std::fill(output.begin(), output.end(), 0xA5);
       std::fill(output_dynamic.begin(), output_dynamic.end(), 0.0f);
@@ -520,11 +521,12 @@ class FullyConnectedSparseOperatorTester {
                     outputChannels(),
                     inputZeroPoint,
                     kernelZeroPoints.data(),
-                    bcsr_matrix->col_indices.data(),
-                    bcsr_matrix->row_values.data(),
+                    bcsr_matrix->col_indices_data_ptr(),
+                    bcsr_matrix->row_values_data_ptr(),
                     bcsr_matrix->values.data(),
                     bcsr_matrix->row_block_size,
                     bcsr_matrix->col_block_size,
+                    pytorch_qnnp_sparse_matrix_indices_dtype_uint32_t,
                     outputZeroPoint,
                     qmin(),
                     qmax(),
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h b/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h
index 25e7bb670653..53eb9ed33830 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h
@@ -279,7 +279,7 @@ class GemmBlockSparseMicrokernelTester {
       } while (max_elem == min_elem);
 
       std::unique_ptr<qnnpack::BCSRMatrix> bcsr_matrix =
-          qnnpack::generateBlockCSRMatrix(
+          qnnpack::generateBlockCSRMatrix<uint32_t>(
               b.data(),
               n(),
               k(),
@@ -332,8 +332,8 @@ class GemmBlockSparseMicrokernelTester {
           aPtr,
           aStride() * sizeof(uint8_t),
           bcsr_matrix->values.data(),
-          bcsr_matrix->row_values.data(),
-          bcsr_matrix->col_indices.data(),
+          static_cast<const uint32_t*>(bcsr_matrix->row_values_data_ptr()),
+          static_cast<const uint32_t*>(bcsr_matrix->col_indices_data_ptr()),
           bias.data(),
           c.data(),
           cStride(),
@@ -355,9 +355,10 @@ class GemmBlockSparseMicrokernelTester {
     }
   }
 
+  template <typename SPARSE_INDICES_DTYPE, typename GEMM_UKERNEL_DTYPE>
   void test_packed(
       pytorch_q8gemm_sparse_packA_ukernel_function packa,
-      pytorch_q8gemm_dq_sparse_packedA_ukernel_function qgemm) const {
+      GEMM_UKERNEL_DTYPE qgemm) const {
     ASSERT_LE(m(), mr());
     ASSERT_LE(n(), nr());
 
@@ -405,13 +406,13 @@ class GemmBlockSparseMicrokernelTester {
         min_elem = *std::min_element(b.cbegin(), b.cend());
       } while (max_elem == min_elem);
       std::unique_ptr<qnnpack::BCSRMatrix> bcsr_matrix =
-        qnnpack::generateBlockCSRMatrix(
-            b.data(),
-            n(),
-            k(),
-            rowBlockSize(),
-            colBlockSize(),
-            kernel_zero_points.data());
+          qnnpack::generateBlockCSRMatrix<SPARSE_INDICES_DTYPE>(
+              b.data(),
+              n(),
+              k(),
+              rowBlockSize(),
+              colBlockSize(),
+              kernel_zero_points.data());
 
       ASSERT_NE(
           *std::max_element(a.cbegin(), a.cend()),
@@ -465,8 +466,10 @@ class GemmBlockSparseMicrokernelTester {
           n(),
           a_packed.data(),
           bcsr_matrix->values.data(),
-          bcsr_matrix->row_values.data(),
-          bcsr_matrix->col_indices.data(),
+          static_cast<const SPARSE_INDICES_DTYPE*>(
+              bcsr_matrix->row_values_data_ptr()),
+          static_cast<const SPARSE_INDICES_DTYPE*>(
+              bcsr_matrix->col_indices_data_ptr()),
           bias.data(),
           c.data(),
           cStride(),
diff --git a/aten/src/ATen/native/quantized/cpu/qnnpack/test/q8gemm_sparse.cc b/aten/src/ATen/native/quantized/cpu/qnnpack/test/q8gemm_sparse.cc
index 42467e2d2952..49f970c1dabc 100644
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/test/q8gemm_sparse.cc
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/test/q8gemm_sparse.cc
@@ -16,25 +16,31 @@
 
 #define TEST_PACKED_ROW_BLOCK_SIZEXCOL_BLOCK_SIZE_SPARSE_OP(MR, \
     NR, row_block_size, col_block_size, \
-    prepacking_kernel, compute_kernel) \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_4) { \
+    prepacking_kernel, compute_kernel_w32, compute_kernel_w16, compute_kernel_w8) \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_4) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(3) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(3) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
 \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_4_strided_a) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_4_strided_a) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -42,15 +48,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(3) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .aStride(37) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .aStride(37); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_4_strided_c) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_4_strided_c) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -58,47 +70,65 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(3) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .cStride(17) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .cStride(17); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_4_qmin128) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_4_qmin128) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(3) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .qmin(128) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(3) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size) \
+      .qmin(128); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_4_qmax128) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_4_qmax128) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(3) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .qmax(128) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(3) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size) \
+      .qmax(128); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_4_azp0) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_4_azp0) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -106,15 +136,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(3) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .aZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .aZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_4_bzp0) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_4_bzp0) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -122,15 +158,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(3) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .bZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .bZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_4_nozp) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_4_nozp) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -139,30 +181,42 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
       .aZeroPoint(0) \
-      .bZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .bZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_8) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_8) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(5) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(5) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_8_strided_a) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_8_strided_a) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -170,15 +224,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(5) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .aStride(37) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .aStride(37); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_8_strided_c) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_8_strided_c) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -186,47 +246,65 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(5) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .cStride(17) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .cStride(17); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_8_qmin128) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_8_qmin128) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(5) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .qmin(128) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(5) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size) \
+      .qmin(128); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_8_qmax128) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_8_qmax128) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(5) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .qmax(128) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(5) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size) \
+      .qmax(128); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_8_azp0) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_8_azp0) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -234,15 +312,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(5) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .aZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .aZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_8_bzp0) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_8_bzp0) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -250,15 +334,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(5) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .bZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .bZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_lt_8_nozp) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_lt_8_nozp) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -267,30 +357,42 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
       .aZeroPoint(0) \
-      .bZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .bZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_eq_8) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_eq_8) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(8) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(8) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_eq_8_strided_a) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_eq_8_strided_a) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -298,15 +400,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(8) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .aStride(37) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .aStride(37); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_eq_8_strided_c) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_eq_8_strided_c) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -314,47 +422,65 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(8) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .cStride(17) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .cStride(17); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_eq_8_qmin128) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_eq_8_qmin128) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(8) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .qmin(128) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(8) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size) \
+      .qmin(128); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_eq_8_qmax128) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_eq_8_qmax128) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
-  .mr(MR) \
-  .nr(NR) \
-  .m(MR) \
-  .n(NR) \
-  .k(8) \
-  .rowBlockSize(row_block_size) \
-  .colBlockSize(col_block_size) \
-  .qmax(128) \
-  .test_packed( \
-      prepacking_kernel, \
-      compute_kernel); \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+      .mr(MR) \
+      .nr(NR) \
+      .m(MR) \
+      .n(NR) \
+      .k(8) \
+      .rowBlockSize(row_block_size) \
+      .colBlockSize(col_block_size) \
+      .qmax(128); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_eq_8_azp0) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_eq_8_azp0) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -362,15 +488,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(8) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .aZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .aZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_eq_8_bzp0) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_eq_8_bzp0) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -378,15 +510,21 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .k(8) \
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
-      .bZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .bZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_eq_8_nozp) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_eq_8_nozp) { \
   TEST_REQUIRES_ARM_NEON; \
-  GemmBlockSparseMicrokernelTester() \
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
       .mr(MR) \
       .nr(NR) \
       .m(MR) \
@@ -395,33 +533,45 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
       .rowBlockSize(row_block_size) \
       .colBlockSize(col_block_size) \
       .aZeroPoint(0) \
-      .bZeroPoint(0) \
-      .test_packed( \
-          prepacking_kernel, \
-          compute_kernel); \
+      .bZeroPoint(0); \
+  tester.test_packed<uint32_t>( \
+      prepacking_kernel, \
+      compute_kernel_w32); \
+  tester.test_packed<uint16_t>( \
+      prepacking_kernel, \
+      compute_kernel_w16); \
+  tester.test_packed<uint8_t>( \
+      prepacking_kernel, \
+      compute_kernel_w8); \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_gt_8) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_gt_8) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 9; k < 16; k++) { \
-    GemmBlockSparseMicrokernelTester() \
-    .mr(MR) \
-    .nr(NR) \
-    .m(MR) \
-    .n(NR) \
-    .k(k) \
-    .rowBlockSize(row_block_size) \
-    .colBlockSize(col_block_size) \
-    .test_packed( \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+        .mr(MR) \
+        .nr(NR) \
+        .m(MR) \
+        .n(NR) \
+        .k(k) \
+        .rowBlockSize(row_block_size) \
+        .colBlockSize(col_block_size); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
+        prepacking_kernel, \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
         prepacking_kernel, \
-        compute_kernel); \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_gt_8_strided_a) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_gt_8_strided_a) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 9; k < 16; k++) { \
-    GemmBlockSparseMicrokernelTester() \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
         .mr(MR) \
         .nr(NR) \
         .m(MR) \
@@ -429,17 +579,23 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
         .k(k) \
         .rowBlockSize(row_block_size) \
         .colBlockSize(col_block_size) \
-        .aStride(37) \
-        .test_packed( \
-            prepacking_kernel, \
-            compute_kernel); \
+        .aStride(37); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
+        prepacking_kernel, \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
+        prepacking_kernel, \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_gt_8_strided_c) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_gt_8_strided_c) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 9; k < 16; k++) { \
-    GemmBlockSparseMicrokernelTester() \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
         .mr(MR) \
         .nr(NR) \
         .m(MR) \
@@ -447,17 +603,23 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
         .k(k) \
         .rowBlockSize(row_block_size) \
         .colBlockSize(col_block_size) \
-        .cStride(17) \
-        .test_packed( \
-            prepacking_kernel, \
-            compute_kernel); \
+        .cStride(17); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
+        prepacking_kernel, \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
+        prepacking_kernel, \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_gt_8_azp0) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_gt_8_azp0) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 9; k < 16; k++) { \
-    GemmBlockSparseMicrokernelTester() \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
         .mr(MR) \
         .nr(NR) \
         .m(MR) \
@@ -465,17 +627,23 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
         .k(k) \
         .rowBlockSize(row_block_size) \
         .colBlockSize(col_block_size) \
-        .aZeroPoint(0) \
-        .test_packed( \
-            prepacking_kernel, \
-            compute_kernel); \
+        .aZeroPoint(0); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
+        prepacking_kernel, \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
+        prepacking_kernel, \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_gt_8_bzp0) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_gt_8_bzp0) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 9; k < 16; k++) { \
-    GemmBlockSparseMicrokernelTester() \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
         .mr(MR) \
         .nr(NR) \
         .m(MR) \
@@ -483,17 +651,23 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
         .k(k) \
         .rowBlockSize(row_block_size) \
         .colBlockSize(col_block_size) \
-        .bZeroPoint(0) \
-        .test_packed( \
-            prepacking_kernel, \
-            compute_kernel); \
+        .bZeroPoint(0); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
+        prepacking_kernel, \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
+        prepacking_kernel, \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_gt_8_nozp) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_gt_8_nozp) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 9; k < 16; k++) { \
-    GemmBlockSparseMicrokernelTester() \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
         .mr(MR) \
         .nr(NR) \
         .m(MR) \
@@ -502,19 +676,25 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
         .rowBlockSize(row_block_size) \
         .colBlockSize(col_block_size) \
         .aZeroPoint(0) \
-        .bZeroPoint(0) \
-        .test_packed( \
-            prepacking_kernel, \
-            compute_kernel); \
+        .bZeroPoint(0); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
+        prepacking_kernel, \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
+        prepacking_kernel, \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_gt_8_subtile) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_gt_8_subtile) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 9; k < 16; k++) { \
     for (uint32_t m = 1; m <= MR; m++) { \
       for (uint32_t n = 1; n <= NR; n++) { \
-        GemmBlockSparseMicrokernelTester() \
+        GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
             .mr(MR) \
             .nr(NR) \
             .m(m) \
@@ -522,36 +702,48 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
             .k(k) \
             .rowBlockSize(row_block_size) \
             .colBlockSize(col_block_size) \
-            .iterations(3) \
-            .test_packed( \
-                prepacking_kernel, \
-                compute_kernel); \
+            .iterations(3); \
+        tester.test_packed<uint32_t>( \
+            prepacking_kernel, \
+            compute_kernel_w32); \
+        tester.test_packed<uint16_t>( \
+            prepacking_kernel, \
+            compute_kernel_w16); \
+        tester.test_packed<uint8_t>( \
+            prepacking_kernel, \
+            compute_kernel_w8); \
       } \
     } \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_div_8) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_div_8) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 16; k < 128; k += 8) { \
-    GemmBlockSparseMicrokernelTester() \
-    .mr(MR) \
-    .nr(NR) \
-    .m(MR) \
-    .n(NR) \
-    .k(k) \
-    .rowBlockSize(row_block_size) \
-    .colBlockSize(col_block_size) \
-    .test_packed( \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
+        .mr(MR) \
+        .nr(NR) \
+        .m(MR) \
+        .n(NR) \
+        .k(k) \
+        .rowBlockSize(row_block_size) \
+        .colBlockSize(col_block_size); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
         prepacking_kernel, \
-        compute_kernel); \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
+        prepacking_kernel, \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_div_8_strided_a) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_div_8_strided_a) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 16; k < 128; k += 8) { \
-    GemmBlockSparseMicrokernelTester() \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
         .mr(MR) \
         .nr(NR) \
         .m(MR) \
@@ -559,17 +751,23 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
         .k(k) \
         .rowBlockSize(row_block_size) \
         .colBlockSize(col_block_size) \
-        .aStride(171) \
-        .test_packed( \
-            prepacking_kernel, \
-            compute_kernel); \
+        .aStride(171); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
+        prepacking_kernel, \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
+        prepacking_kernel, \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_div_8_strided_c) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_div_8_strided_c) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 16; k < 128; k += 8) { \
-    GemmBlockSparseMicrokernelTester() \
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
         .mr(MR) \
         .nr(NR) \
         .m(MR) \
@@ -577,19 +775,25 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
         .k(k) \
         .rowBlockSize(row_block_size) \
         .colBlockSize(col_block_size) \
-        .cStride(17) \
-        .test_packed( \
-            prepacking_kernel, \
-            compute_kernel); \
+        .cStride(17); \
+    tester.test_packed<uint32_t>( \
+        prepacking_kernel, \
+        compute_kernel_w32); \
+    tester.test_packed<uint16_t>( \
+        prepacking_kernel, \
+        compute_kernel_w16); \
+    tester.test_packed<uint8_t>( \
+        prepacking_kernel, \
+        compute_kernel_w8); \
   } \
 } \
  \
-TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH3232_NEON, packedA_k_div_8_subtile) { \
+TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARCH32_NEON, packedA_k_div_8_subtile) { \
   TEST_REQUIRES_ARM_NEON; \
   for (size_t k = 16; k < 128; k += 24) { \
     for (uint32_t m = 1; m <= MR; m++) { \
       for (uint32_t n = 1; n <= NR; n++) { \
-        GemmBlockSparseMicrokernelTester() \
+        GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester() \
             .mr(MR) \
             .nr(NR) \
             .m(m) \
@@ -597,33 +801,43 @@ TEST(Q8GEMM__##MR ## x ##NR ## c##row_block_size ## x ##col_block_size ## __AARC
             .k(k) \
             .rowBlockSize(row_block_size) \
             .colBlockSize(col_block_size) \
-            .iterations(3) \
-            .test_packed( \
-                prepacking_kernel, \
-                compute_kernel); \
+            .iterations(3); \
+        tester.test_packed<uint32_t>( \
+            prepacking_kernel, \
+            compute_kernel_w32); \
+        tester.test_packed<uint16_t>( \
+            prepacking_kernel, \
+            compute_kernel_w16); \
+        tester.test_packed<uint8_t>( \
+            prepacking_kernel, \
+            compute_kernel_w8); \
       } \
     } \
   } \
 }
 
-#define TEST_PACKED_1x4_SPARSE_OP(MR, NR, prepacking_kernel, compute_kernel) \
+#define TEST_PACKED_1x4_SPARSE_OP(MR, NR, prepacking_kernel, compute_kernel_w32, compute_kernel_w16, compute_kernel_w8) \
   TEST_PACKED_ROW_BLOCK_SIZEXCOL_BLOCK_SIZE_SPARSE_OP(MR, \
-      NR, 1, 4, prepacking_kernel, compute_kernel)
-#define TEST_PACKED_8x1_SPARSE_OP(MR, NR, prepacking_kernel, compute_kernel) \
+      NR, 1, 4, prepacking_kernel, compute_kernel_w32, compute_kernel_w16, compute_kernel_w8)
+#define TEST_PACKED_8x1_SPARSE_OP(MR, NR, prepacking_kernel, compute_kernel_w32, compute_kernel_w16, compute_kernel_w8) \
   TEST_PACKED_ROW_BLOCK_SIZEXCOL_BLOCK_SIZE_SPARSE_OP(MR, \
-      NR, 8, 1, prepacking_kernel, compute_kernel)
+      NR, 8, 1, prepacking_kernel, compute_kernel_w32, compute_kernel_w16, compute_kernel_w8)
 
 #if CPUINFO_ARCH_ARM
 TEST_PACKED_1x4_SPARSE_OP(
     4,
     8,
     pytorch_q8gemm_sparse_packA_ukernel_4x4__aarch32_neon,
-    pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon)
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w32__aarch32_neon,
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w16__aarch32_neon,
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA_w8__aarch32_neon)
 TEST_PACKED_8x1_SPARSE_OP(
     4,
     8,
     pytorch_q8gemm_sparse_packA_ukernel_4x4__aarch32_neon,
-    pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon)
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w32__aarch32_neon,
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w16__aarch32_neon,
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA_w8__aarch32_neon)
 
 #endif
 
@@ -633,12 +847,16 @@ TEST_PACKED_1x4_SPARSE_OP(
     8,
     8,
     pytorch_q8gemm_sparse_packA_ukernel_8x4__aarch64_neon,
-    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA__aarch64_neon)
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w32__aarch64_neon,
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w16__aarch64_neon,
+    pytorch_q8gemm_dq_sparse_1x4_ukernel_8x8_packedA_w8__aarch64_neon)
 TEST_PACKED_8x1_SPARSE_OP(
     8,
     8,
     pytorch_q8gemm_sparse_packA_ukernel_8x4__aarch64_neon,
-    pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA__aarch64_neon)
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w32__aarch64_neon,
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w16__aarch64_neon,
+    pytorch_q8gemm_dq_sparse_8x1_ukernel_8x8_packedA_w8__aarch64_neon)
 
 #endif
 
@@ -646,367 +864,613 @@ TEST_PACKED_8x1_SPARSE_OP(
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_4) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(3).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(3);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_4_strided_a) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(3)
-      .aStride(37)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .aStride(37);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_4_strided_c) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(3)
-      .cStride(17)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .cStride(17);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_4_qmin128) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(3).qmin(128).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(3)
+      .qmin(128);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_4_qmax128) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(3).qmax(128).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(3)
+      .qmax(128);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_4_azp0) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(3)
-      .aZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .aZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_4_bzp0) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(3)
-      .bZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .bZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_4_nozp) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(3)
       .aZeroPoint(0)
-      .bZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .bZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_8) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(5).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(5);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_8_strided_a) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(5)
-      .aStride(37)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .aStride(37);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_8_strided_c) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(5)
-      .cStride(17)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .cStride(17);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_8_qmin128) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(5).qmin(128).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(5)
+      .qmin(128);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_8_qmax128) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(5).qmax(128).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(5)
+      .qmax(128);
+  tester.test_packed<uint32_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_8_azp0) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(5)
-      .aZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .aZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_8_bzp0) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(5)
-      .bZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .bZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_lt_8_nozp) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(5)
       .aZeroPoint(0)
-      .bZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .bZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_eq_8) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(8).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(8);
+  tester.test_packed<uint32_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_eq_8_strided_a) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(8)
-      .aStride(37)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .aStride(37);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_eq_8_strided_c) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(8)
-      .cStride(17)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .cStride(17);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_eq_8_qmin128) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(8).qmin(128).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(8)
+      .qmin(128);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_eq_8_qmax128) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(8).qmax(128).test_packed(
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+      .mr(8)
+      .nr(4)
+      .m(8)
+      .n(4)
+      .k(8)
+      .qmax(128);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
       pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_eq_8_azp0) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(8)
-      .aZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .aZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_eq_8_bzp0) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(8)
-      .bZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .bZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_eq_8_nozp) {
   TEST_REQUIRES_X86_SSE2;
-  GemmBlockSparseMicrokernelTester()
+  GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
       .mr(8)
       .nr(4)
       .m(8)
       .n(4)
       .k(8)
       .aZeroPoint(0)
-      .bZeroPoint(0)
-      .test_packed(
-          pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-          pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+      .bZeroPoint(0);
+  tester.test_packed<uint32_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+  tester.test_packed<uint16_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+  tester.test_packed<uint8_t>(
+      pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+      pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_gt_8) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 9; k < 16; k++) {
-    GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(k).test_packed(
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+        .mr(8)
+        .nr(4)
+        .m(8)
+        .n(4)
+        .k(k);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
         pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_gt_8_strided_a) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 9; k < 16; k++) {
-    GemmBlockSparseMicrokernelTester()
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
         .mr(8)
         .nr(4)
         .m(8)
         .n(4)
         .k(k)
-        .aStride(37)
-        .test_packed(
-            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        .aStride(37);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_gt_8_strided_c) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 9; k < 16; k++) {
-    GemmBlockSparseMicrokernelTester()
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
         .mr(8)
         .nr(4)
         .m(8)
         .n(4)
         .k(k)
-        .cStride(17)
-        .test_packed(
-            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        .cStride(17);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_gt_8_azp0) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 9; k < 16; k++) {
-    GemmBlockSparseMicrokernelTester()
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
         .mr(8)
         .nr(4)
         .m(8)
         .n(4)
         .k(k)
-        .aZeroPoint(0)
-        .test_packed(
-            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        .aZeroPoint(0);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_gt_8_bzp0) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 9; k < 16; k++) {
-    GemmBlockSparseMicrokernelTester()
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
         .mr(8)
         .nr(4)
         .m(8)
         .n(4)
         .k(k)
-        .bZeroPoint(0)
-        .test_packed(
-            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        .bZeroPoint(0);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_gt_8_nozp) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 9; k < 16; k++) {
-    GemmBlockSparseMicrokernelTester()
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
         .mr(8)
         .nr(4)
         .m(8)
         .n(4)
         .k(k)
         .aZeroPoint(0)
-        .bZeroPoint(0)
-        .test_packed(
-            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        .bZeroPoint(0);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
@@ -1015,16 +1479,22 @@ TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_gt_8_subtile) {
   for (size_t k = 9; k < 16; k++) {
     for (uint32_t m = 1; m <= 8; m++) {
       for (uint32_t n = 1; n <= 4; n++) {
-        GemmBlockSparseMicrokernelTester()
+        GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
             .mr(8)
             .nr(4)
             .m(m)
             .n(n)
             .k(k)
-            .iterations(3)
-            .test_packed(
-                pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-                pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+            .iterations(3);
+        tester.test_packed<uint32_t>(
+            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+        tester.test_packed<uint16_t>(
+            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+        tester.test_packed<uint8_t>(
+            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
       }
     }
   }
@@ -1033,41 +1503,65 @@ TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_gt_8_subtile) {
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_div_8) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 16; k < 128; k += 8) {
-    GemmBlockSparseMicrokernelTester().mr(8).nr(4).m(8).n(4).k(k).test_packed(
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
+        .mr(8)
+        .nr(4)
+        .m(8)
+        .n(4)
+        .k(k);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
         pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_div_8_strided_a) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 16; k < 128; k += 8) {
-    GemmBlockSparseMicrokernelTester()
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
         .mr(8)
         .nr(4)
         .m(8)
         .n(4)
         .k(k)
-        .aStride(171)
-        .test_packed(
-            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        .aStride(171);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
 TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_div_8_strided_c) {
   TEST_REQUIRES_X86_SSE2;
   for (size_t k = 16; k < 128; k += 8) {
-    GemmBlockSparseMicrokernelTester()
+    GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
         .mr(8)
         .nr(4)
         .m(8)
         .n(4)
         .k(k)
-        .cStride(17)
-        .test_packed(
-            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+        .cStride(17);
+    tester.test_packed<uint32_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+    tester.test_packed<uint16_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+    tester.test_packed<uint8_t>(
+        pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+        pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
   }
 }
 
@@ -1076,16 +1570,22 @@ TEST(Q8GEMM_8x4c1x4__SSE2, packedA_k_div_8_subtile) {
   for (size_t k = 16; k < 128; k += 24) {
     for (uint32_t m = 1; m <= 8; m++) {
       for (uint32_t n = 1; n <= 4; n++) {
-        GemmBlockSparseMicrokernelTester()
+        GemmBlockSparseMicrokernelTester tester = GemmBlockSparseMicrokernelTester()
             .mr(8)
             .nr(4)
             .m(m)
             .n(n)
             .k(k)
-            .iterations(3)
-            .test_packed(
-                pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
-                pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA__sse2);
+            .iterations(3);
+        tester.test_packed<uint32_t>(
+            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w32__sse2);
+        tester.test_packed<uint16_t>(
+            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w16__sse2);
+        tester.test_packed<uint8_t>(
+            pytorch_q8gemm_sparse_packA_ukernel_8x4__sse2,
+            pytorch_q8gemm_dq_sparse_1x4_ukernel_8x4_packedA_w8__sse2);
       }
     }
   }
diff --git a/aten/src/ATen/native/quantized/cpu/qnormalization.cpp b/aten/src/ATen/native/quantized/cpu/qnormalization.cpp
index ddfbad8917f7..f9b94ec4e49d 100644
--- a/aten/src/ATen/native/quantized/cpu/qnormalization.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qnormalization.cpp
@@ -1,11 +1,16 @@
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/layer_norm.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/Parallel.h>
 #include <c10/util/accumulate.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#endif
+
 #include <algorithm>
 #include <vector>
 
diff --git a/aten/src/ATen/native/quantized/cpu/qrelu.cpp b/aten/src/ATen/native/quantized/cpu/qrelu.cpp
index e4ca887fb674..fcdfb0e9260c 100644
--- a/aten/src/ATen/native/quantized/cpu/qrelu.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qrelu.cpp
@@ -1,6 +1,8 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/TensorIterator.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/cpu/Loops.h>
 #include <ATen/native/quantized/AffineQuantizer.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
@@ -10,6 +12,19 @@
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/hardtanh_native.h>
+#include <ATen/ops/leaky_relu_native.h>
+#include <ATen/ops/prelu.h>
+#include <ATen/ops/prelu_native.h>
+#include <ATen/ops/quantize_per_tensor.h>
+#include <ATen/ops/relu_native.h>
+#endif
+
 #include <algorithm>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/cpu/qsigmoid.cpp b/aten/src/ATen/native/quantized/cpu/qsigmoid.cpp
index 354590e211c7..862d2bad49dd 100644
--- a/aten/src/ATen/native/quantized/cpu/qsigmoid.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qsigmoid.cpp
@@ -1,15 +1,22 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
+#include <ATen/Dispatch.h>
 #include <torch/library.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/cpu/Loops.h>
-#include <ATen/quantized/Quantizer.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <c10/util/irange.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/sigmoid_native.h>
+#endif
+
 #include <algorithm>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/cpu/qtanh.cpp b/aten/src/ATen/native/quantized/cpu/qtanh.cpp
index fde8f41630df..5dc3e759ede1 100644
--- a/aten/src/ATen/native/quantized/cpu/qtanh.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qtanh.cpp
@@ -1,16 +1,19 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <torch/library.h>
-#include <ATen/native/TensorIterator.h>
-#include <ATen/native/cpu/Loops.h>
-#include <ATen/quantized/Quantizer.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Context.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 #include <ATen/native/quantized/cpu/init_qnnpack.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
 #include <c10/util/irange.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
 
-#include <algorithm>
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/tanh_native.h>
+#endif
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/quantized/cpu/qthreshold.cpp b/aten/src/ATen/native/quantized/cpu/qthreshold.cpp
index 6c1f10356d98..c2b03638c0ea 100644
--- a/aten/src/ATen/native/quantized/cpu/qthreshold.cpp
+++ b/aten/src/ATen/native/quantized/cpu/qthreshold.cpp
@@ -1,9 +1,16 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <torch/library.h>
-#include <ATen/quantized/Quantizer.h>
 #include <ATen/native/quantized/cpu/QuantizedOps.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/threshold_native.h>
+#endif
+
 #include <algorithm>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/cuda/Activation.cpp b/aten/src/ATen/native/quantized/cuda/Activation.cpp
index 3154c2a75dd0..3a9e400fa81b 100644
--- a/aten/src/ATen/native/quantized/cuda/Activation.cpp
+++ b/aten/src/ATen/native/quantized/cuda/Activation.cpp
@@ -1,5 +1,6 @@
 #include <c10/util/Exception.h>
 #include <ATen/ATen.h>
+#include <ATen/Functions.h>
 
 namespace at {
 namespace native {
@@ -17,5 +18,13 @@ Tensor gelu_quantized_cuda(const Tensor& qx, c10::string_view approximate) {
   return at::quantize_per_tensor(result_fp32, qx.q_scale(), qx.q_zero_point(), qx.scalar_type());
 }
 
+Tensor relu_quantized_cuda(const Tensor& self) {
+  auto zero_point = self.q_zero_point();
+  auto int_repr = self.int_repr();
+  auto mask = (int_repr > zero_point);
+  const auto relu_int_repr = at::where(mask, int_repr, zero_point);
+  return at::_make_per_tensor_quantized_tensor(relu_int_repr, self.q_scale(), zero_point);
+}
+
 }  // namespace at::native
 }  // namespace at
diff --git a/aten/src/ATen/native/quantized/cuda/Activation.cu b/aten/src/ATen/native/quantized/cuda/Activation.cu
new file mode 100644
index 000000000000..9e3e3ba13ea6
--- /dev/null
+++ b/aten/src/ATen/native/quantized/cuda/Activation.cu
@@ -0,0 +1,21 @@
+#include <ATen/ATen.h>
+#include <ATen/native/TensorIterator.h>
+#include <ATen/native/cuda/Loops.cuh>
+
+namespace at {
+namespace native {
+
+Tensor& relu_quantized_cuda_(Tensor& self) {
+  const auto zero_point = self.q_zero_point();
+  AT_DISPATCH_QINT_TYPES(
+    self.scalar_type(), "qrelu_cuda", [&]() {
+      auto iter = TensorIterator::unary_op(self, self);
+      gpu_kernel(iter, [zero_point] GPU_LAMBDA(scalar_t value) -> scalar_t {
+        return scalar_t(std::max<underlying_t>(value.val_, zero_point));
+        });
+  });
+  return self;
+}
+
+}  // namespace at::native
+}  // namespace at
diff --git a/aten/src/ATen/native/quantized/cuda/AffineQuantizer.cu b/aten/src/ATen/native/quantized/cuda/AffineQuantizer.cu
index 6f251fc33502..c60dc57f9226 100644
--- a/aten/src/ATen/native/quantized/cuda/AffineQuantizer.cu
+++ b/aten/src/ATen/native/quantized/cuda/AffineQuantizer.cu
@@ -1,10 +1,20 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
-#include <ATen/native/TensorIterator.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/quantized/AffineQuantizer.h>
 #include <math.h>
 #include <ATen/native/cuda/Loops.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_unsafe_view_native.h>
+#include <ATen/ops/any.h>
+#include <ATen/ops/gt.h>
+#include <ATen/ops/lt.h>
+#endif
+
 namespace at {
 namespace native {
 namespace {
diff --git a/aten/src/ATen/native/quantized/cuda/EmbeddingBag.cu b/aten/src/ATen/native/quantized/cuda/EmbeddingBag.cu
index 55b0b0d4f36d..0580c47b8c62 100644
--- a/aten/src/ATen/native/quantized/cuda/EmbeddingBag.cu
+++ b/aten/src/ATen/native/quantized/cuda/EmbeddingBag.cu
@@ -1,4 +1,6 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
 #include <ATen/Parallel.h>
 #include <ATen/core/op_registration/op_registration.h>
 #include <ATen/cuda/CUDAContext.h>
@@ -7,6 +9,16 @@
 #include <c10/cuda/CUDAGuard.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/empty_native.h>
+#include <ATen/ops/resize_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/quantized/cuda/FakeQuantizeCore.cu b/aten/src/ATen/native/quantized/cuda/FakeQuantizeCore.cu
index e85622b3d4fa..3d340a303afb 100644
--- a/aten/src/ATen/native/quantized/cuda/FakeQuantizeCore.cu
+++ b/aten/src/ATen/native/quantized/cuda/FakeQuantizeCore.cu
@@ -1,7 +1,7 @@
-#include <ATen/ATen.h>
-#include <ATen/NativeFunctions.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/quantized/FakeQuantAffine.h>
-#include <ATen/native/TensorIterator.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <thrust/tuple.h>
 #include <cmath>
diff --git a/aten/src/ATen/native/quantized/cuda/FusedObsFakeQuant.cu b/aten/src/ATen/native/quantized/cuda/FusedObsFakeQuant.cu
index a448a7cca215..d75a10c0db89 100644
--- a/aten/src/ATen/native/quantized/cuda/FusedObsFakeQuant.cu
+++ b/aten/src/ATen/native/quantized/cuda/FusedObsFakeQuant.cu
@@ -1,9 +1,21 @@
-#include <ATen/ATen.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/ceil_div.h>
-#include <ATen/NativeFunctions.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <c10/cuda/CUDAGuard.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_aminmax.h>
+#include <ATen/ops/_fake_quantize_per_tensor_affine_cachemask_tensor_qparams.h>
+#include <ATen/ops/fake_quantize_per_channel_affine.h>
+#include <ATen/ops/fake_quantize_per_channel_affine_cachemask.h>
+#include <ATen/ops/fake_quantize_per_tensor_affine.h>
+#include <ATen/ops/fused_moving_avg_obs_fake_quant_native.h>
+#include <ATen/ops/ones_like.h>
+#endif
+
 #include <cmath>
 
 namespace at {
diff --git a/aten/src/ATen/native/quantized/cuda/IntReprQuant.cu b/aten/src/ATen/native/quantized/cuda/IntReprQuant.cu
index 497b94d020f3..082244ca0c85 100644
--- a/aten/src/ATen/native/quantized/cuda/IntReprQuant.cu
+++ b/aten/src/ATen/native/quantized/cuda/IntReprQuant.cu
@@ -1,8 +1,17 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
-#include <ATen/Functions.h>
-#include <ATen/native/TensorIterator.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/cuda/Loops.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/int_repr_native.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/quantized/cuda/MakePerTensorQuantizedTensor.cu b/aten/src/ATen/native/quantized/cuda/MakePerTensorQuantizedTensor.cu
index 82fc77735a94..ce5a54ceec16 100644
--- a/aten/src/ATen/native/quantized/cuda/MakePerTensorQuantizedTensor.cu
+++ b/aten/src/ATen/native/quantized/cuda/MakePerTensorQuantizedTensor.cu
@@ -1,7 +1,20 @@
-#include <ATen/ATen.h>
-#include <ATen/native/TensorIterator.h>
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/core/Tensor.h>
+#include <ATen/Dispatch.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/cuda/Loops.cuh>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_empty_per_channel_affine_quantized.h>
+#include <ATen/ops/_make_per_channel_quantized_tensor_native.h>
+#include <ATen/ops/_make_per_tensor_quantized_tensor_native.h>
+#include <ATen/ops/empty.h>
+#endif
+
 namespace at {
 namespace native {
 
diff --git a/aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp b/aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp
index dce78c4bb294..fbb46b4b0174 100644
--- a/aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp
+++ b/aten/src/ATen/native/quantized/cudnn/BinaryOps.cpp
@@ -18,6 +18,13 @@
 #include <c10/util/ArrayRef.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#include <ATen/ops/_empty_affine_quantized.h>
+#endif
+
 #include <unordered_map>
 
 namespace at {
@@ -64,10 +71,10 @@ std::unordered_map<CacheKey, cudnn_frontend::ManagedOpaqueDescriptor, at::native
 inline void check_inputs(const Tensor& qa, const Tensor& qb) {
   TORCH_CHECK(
       qa.qscheme() == kPerTensorAffine,
-      "Only per tensor quantization is suported in Add.");
+      "Only per tensor quantization is supported in Add.");
   TORCH_CHECK(
       qa.qscheme() == qb.qscheme(),
-      "Both inputs to Add must have the same quantization shceme.");
+      "Both inputs to Add must have the same quantization scheme.");
   TORCH_CHECK(
       qa.scalar_type() == qb.scalar_type(),
       "Add operands should have same data type.");
diff --git a/aten/src/ATen/native/quantized/cudnn/ConvPrepack.cpp b/aten/src/ATen/native/quantized/cudnn/ConvPrepack.cpp
index b2a12832332c..e214ab6492df 100644
--- a/aten/src/ATen/native/quantized/cudnn/ConvPrepack.cpp
+++ b/aten/src/ATen/native/quantized/cudnn/ConvPrepack.cpp
@@ -33,7 +33,7 @@ c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> PackedConvWeightCudnn<
         int64_t groups,
         bool transpose) {
   // TODO: need to check out to implement groups for conv operator in Conv.cpp
-  TORCH_CHECK(groups == 1, "Quantized cudnn conv2d is currenty limited to groups = 1; received groups =", groups);
+  TORCH_CHECK(groups == 1, "Quantized cudnn conv2d is currently limited to groups = 1; received groups =", groups);
   TORCH_CHECK(weight.qscheme() == c10::kPerTensorAffine, "Unsupported qscheme: ", toString(weight.qscheme()));
   TORCH_CHECK(
       kSpatialDim == 2,  // 1D is packed as 2d, hence we don't need other checks
diff --git a/aten/src/ATen/native/quantized/cudnn/utils.h b/aten/src/ATen/native/quantized/cudnn/utils.h
index 1a58e8f38456..4e5f663efa16 100644
--- a/aten/src/ATen/native/quantized/cudnn/utils.h
+++ b/aten/src/ATen/native/quantized/cudnn/utils.h
@@ -19,6 +19,12 @@ This file contains some of the auxiliary functions used by both Conv.cpp & Linea
 #include <c10/util/ArrayRef.h>
 #include <cudnn_frontend.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty.h>
+#endif
+
 struct PackedLinearWeightCudnn : public LinearPackedParamsBase {
   PackedLinearWeightCudnn(
       at::Tensor orig_weight,
@@ -207,7 +213,11 @@ uint8_t getAlignment(const at::Tensor &t) {
   // alignment are in bytes
   uint8_t alignment = 1;
   uintptr_t address = reinterpret_cast<uintptr_t>(t.data_ptr());
-  while (address % alignment == 0 && alignment < 16) alignment *= 2;
+  for (; alignment < 16; alignment *= 2) {
+    if (address % (alignment * 2)) {
+      return alignment;
+    }
+  }
   return alignment;
 }
 
diff --git a/aten/src/ATen/native/quantized/qconv_unpack.cpp b/aten/src/ATen/native/quantized/qconv_unpack.cpp
index 41f4754e8f1b..90e210ebe227 100644
--- a/aten/src/ATen/native/quantized/qconv_unpack.cpp
+++ b/aten/src/ATen/native/quantized/qconv_unpack.cpp
@@ -7,9 +7,12 @@ The implementations for the unpack functions can be found in /cpu/qconv_unpack_i
 and /cudnn/ConvUnpackImpl.cpp, for cudnn.
 */
 
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <tuple>
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/core/List.h>
+#include <ATen/core/ivalue.h>
 #include <torch/library.h>
 #include <ATen/native/quantized/cpu/fbgemm_utils.h>
 #include <ATen/native/quantized/cpu/QnnpackUtils.h>
@@ -17,6 +20,15 @@ and /cudnn/ConvUnpackImpl.cpp, for cudnn.
 #include <ATen/native/quantized/cpu/QuantUtils.h>
 #include <ATen/native/quantized/PackedParams.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/_empty_affine_quantized.h>
+#include <ATen/ops/_empty_per_channel_affine_quantized.h>
+#include <ATen/ops/from_blob.h>
+#endif
+
+
 namespace at {
 namespace native {
 namespace {
@@ -36,7 +48,8 @@ class QConvUnpackWeightsInt8 final {
     auto& ctx = at::globalContext();
 
 #ifdef USE_FBGEMM
-    if (ctx.qEngine() == at::QEngine::FBGEMM) {
+    if (ctx.qEngine() == at::QEngine::FBGEMM ||
+        ctx.qEngine() == at::QEngine::X86) {
       return packed_weight->unpack();
     }
 #endif
@@ -72,7 +85,8 @@ class QConv1dUnpackWeightsInt8 final {
     at::Tensor weight;
     c10::optional<at::Tensor> bias;
 #ifdef USE_FBGEMM
-    if (ctx.qEngine() == at::QEngine::FBGEMM) {
+    if (ctx.qEngine() == at::QEngine::FBGEMM ||
+        ctx.qEngine() == at::QEngine::X86) {
       std::tie(weight, bias) = packed_weight->unpack();
       weight = weight.squeeze_(quant_utils::kConv1dSqueezeDim + 2);
       return std::tuple<at::Tensor, c10::optional<at::Tensor>>(weight, bias);
diff --git a/aten/src/ATen/native/sparse/Macros.h b/aten/src/ATen/native/sparse/Macros.h
new file mode 100644
index 000000000000..7dac5b04e6f8
--- /dev/null
+++ b/aten/src/ATen/native/sparse/Macros.h
@@ -0,0 +1,19 @@
+#pragma once
+
+#if defined(__CUDACC__) || defined(__HIPCC__)
+#define GPUCC
+#define FUNCAPI __host__ __device__
+#define INLINE __forceinline__
+#else
+#define FUNCAPI
+#define INLINE inline
+#endif
+
+#if defined(_WIN32) || defined(_WIN64)
+// Temporarily disable __restrict on Windows,
+// as it turns out not all MSVC versions are aware of it.
+// #define RESTRICT __restrict
+#define RESTRICT
+#else
+#define RESTRICT __restrict__
+#endif
diff --git a/aten/src/ATen/native/sparse/SparseBinaryOpIntersectionCommon.h b/aten/src/ATen/native/sparse/SparseBinaryOpIntersectionCommon.h
new file mode 100644
index 000000000000..08ba4de68cac
--- /dev/null
+++ b/aten/src/ATen/native/sparse/SparseBinaryOpIntersectionCommon.h
@@ -0,0 +1,585 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+#include <ATen/native/TensorIterator.h>
+#include <ATen/Dispatch.h>
+#include <ATen/native/sparse/Macros.h>
+#include <ATen/ExpandUtils.h>
+#include <ATen/SparseTensorUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#else
+#include <ATen/ops/arange.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/_sparse_coo_tensor_with_dims_and_tensors.h>
+#include <ATen/ops/from_blob.h>
+#include <ATen/ops/result_type.h>
+#endif
+
+#ifdef GPUCC
+#define NAME "sparse_binary_op_intersection_cuda"
+#else
+#define NAME "sparse_binary_op_intersection_cpu"
+#endif
+
+#define CALL(...) __VA_ARGS__();
+#define EXPAND(b, n, ...)         \
+  if (b) {                        \
+    using index_t ## n = int32_t; \
+    __VA_ARGS__                   \
+  }                               \
+  else {                          \
+    using index_t ## n = int64_t; \
+    __VA_ARGS__                   \
+  }
+#define BOOL_TO_INDEX_TYPE1(b0, ...) \
+  EXPAND(b0, 0, CALL(__VA_ARGS__))
+#define BOOL_TO_INDEX_TYPE2(b1, b0, ...) \
+  EXPAND(b1, 1, BOOL_TO_INDEX_TYPE1(b0, __VA_ARGS__))
+#define BOOL_TO_INDEX_TYPE3(b2, b1, b0, ...) \
+  EXPAND(b2, 2, BOOL_TO_INDEX_TYPE2(b1, b0, __VA_ARGS__))
+
+namespace at {
+namespace native {
+
+namespace {
+
+using at::sparse::get_sparse_impl;
+
+// ForwardIt: only legacy random access iterator is supported.
+template<class ForwardIt, class T, bool is_lower = true>
+static FUNCAPI INLINE
+ForwardIt find_bound(ForwardIt first, ForwardIt last, const T& value) {
+    ForwardIt RESTRICT it;
+    typename std::iterator_traits<ForwardIt>::difference_type count, step;
+    // NOTE: std::distance(first, last) compiles but produces wrong results on CUDA,
+    // so only legacy random access iterators are safe in this code.
+    count = last - first;
+
+    while (count > 0) {
+      it = first;
+      step = count / 2;
+      // avoiding std::advance(it, step),
+      // although it does work unlike std::distance on CUDA.
+      it += step;
+      // The decision which separates finding a lower bound vs an upper bound.
+      // Note that a lower bound is a value at *it with the smallest index
+      // such that *it >= value if such value exists, or last if does not.
+      // Similarly, an upper bound is a value at *it with the smallest index
+      // such that *it > value if such value exists, or last if does not.
+      // Let is_lower = true and *it < value, then we know that *it and values
+      // preceeding *it cannot contain a lower bound, so we adjust initial iterator range
+      // from [first, first + count] to [first + step + 1, first + count - (step + 1)],
+      // where +1 skips the element at which we have just evaluated *it < value.
+      // Samilar logic holds when is_lower = false.
+      if (is_lower ? *it < value : value >= *it) {
+        first = ++it;
+        count -= step + 1;
+      }
+      else {
+        count = step;
+      }
+    }
+    return first;
+}
+
+template <template <typename func_t> class kernel_t>
+struct KernelLauncher {
+  template <typename func_t>
+  static void launch(TensorIteratorBase& iter, const func_t& f) {
+    kernel_t<func_t>::launch(iter, f);
+  }
+};
+
+TensorIterator make_value_selection_intersection_iter(
+    const Tensor& lhs_values,
+    const Tensor& lhs_select_idx,
+    const Tensor& rhs_values,
+    const Tensor& rhs_select_idx) {
+  const auto res_values_sizes = [&]() -> std::vector<int64_t> {
+    auto sizes = infer_size(
+        // keep nnz dim
+        lhs_values.sizes(),
+        // remove nnz dim for smooth broadcasting
+        rhs_values.sizes().slice(1));
+    // update nnz dim to be the lenght of an index
+    sizes[0] = lhs_select_idx.numel();
+    return sizes;
+  }();
+  auto res_values = at::empty(res_values_sizes, lhs_values.options());
+
+  const auto restride_idx = [&res_values](const Tensor& idx) -> Tensor {
+    auto idx_sizes = std::vector<int64_t>(res_values.dim(), 1);
+    auto idx_strides = std::vector<int64_t>(res_values.dim(), 0);
+    idx_sizes[0] = idx.numel();
+    idx_strides[0] = 1;
+    return idx.as_strided(idx_sizes, idx_strides);
+  };
+
+  const auto restride_values = [&lhs_select_idx](const Tensor& values) -> Tensor {
+    auto values_sizes = at::DimVector(values.sizes());
+    auto values_strides = at::DimVector(values.strides());
+    values_sizes[0] = lhs_select_idx.numel();
+    values_strides[0] = 0;
+    return values.as_strided(values_sizes, values_strides);
+  };
+
+  auto iter = TensorIteratorConfig()
+    .set_check_mem_overlap(false)
+    .check_all_same_dtype(false)
+    .resize_outputs(false)
+    .add_owned_output(res_values)
+    .add_owned_input(restride_values(lhs_values))
+    .add_owned_input(restride_idx(lhs_select_idx))
+    .add_owned_input(restride_values(rhs_values))
+    .add_owned_input(restride_idx(rhs_select_idx))
+    .build();
+
+  return iter;
+}
+
+template <
+  template <typename func_t> class kernel_t,
+  typename value_selection_intersection_kernel_t,
+  typename index_t = int64_t,
+  typename hash_t = int64_t,
+  typename offset_t = int64_t>
+void _sparse_binary_op_intersection_kernel_impl(
+    Tensor& res,
+    const Tensor& x_,
+    const Tensor& y_,
+    const std::vector<int64_t> broadcasted_shape,
+    const bool commutes_with_sum = true
+) {
+  // The common dtype check is relevant when op is done in-place.
+  // This is because binary_of_t produces new values and it could be that
+  // new_values.dtype != res.dtype. In such a case we should error out
+  // as soon as possible to avoid redundant kernel runs.
+  const auto common_dtype = at::result_type(x_, y_);
+  TORCH_CHECK(canCast(common_dtype, res.scalar_type()),
+      "Can't convert result type ", common_dtype,
+      " to output ", res.scalar_type());
+
+  using KernelLauncher = KernelLauncher<kernel_t>;
+
+  const Tensor x = commutes_with_sum ? x_ : x_.coalesce();
+  const Tensor y = commutes_with_sum ? y_ : y_.coalesce();
+
+  // Given sparse tensors x and y we decide which one is source, and which one
+  // is probably_coalesced. The indices of both source and probably_coalesced are
+  // hashed and then the hash values of the source's indices are binary-searched
+  // into the hash values of the probably_coalesced's indices.
+  // If probably_coalesce is coalesced, by the property of the hashing method
+  // (see below), the hash values are already sorted and we can avoid any
+  // explicit sorting routines.
+  Tensor probably_coalesced, source;
+  std::tie(probably_coalesced, source) = [&]() -> std::tuple<Tensor, Tensor> {
+    // Case 1: either x or y is coalesced.
+    if ((x.is_coalesced() ^ y.is_coalesced())) {
+      return x.is_coalesced()
+        ? std::make_tuple(x, y)
+        : std::make_tuple(y, x);
+    }
+    // Case 2: Both x and y are either coalesced or non-coalesced.
+    // If both are coalesced, search into the larger tensor is faster.
+    // Same holds when both are non-coalesced.
+    else {
+      Tensor larger, smaller;
+      std::tie(larger, smaller) = [&]() -> std::tuple<Tensor, Tensor> {
+        return x._nnz() >= y._nnz()
+          ? std::make_tuple(x, y)
+          : std::make_tuple(y, x);
+      }();
+
+      // If under a uniform distribution it is likely to hit many elements in larger,
+      // it is best to coalesce it for better performance.
+      const auto larger_sizes = larger.sizes();
+      const auto sparse_dim_numel = std::accumulate(
+          larger_sizes.begin(),
+          larger_sizes.begin() + larger.sparse_dim(),
+          1,
+          std::multiplies<int64_t>());
+      // If nnz > prod(larger.shape[:sparse_dim]), by the pidgeonhole principle,
+      // there is at least one bucket with nnz / prod(larger.shape[:sparse_dim]) elements.
+      // It provides a lower bound for the max count in the intersection.
+      // This condition is very conservative as we do not check whether such an event
+      // actually occurred, although it is very likely under a uniform distribution,
+      // the distribution with the highest uncertainty (maximizes entropy).
+      const auto max_count_lower_bound = larger._nnz() / sparse_dim_numel;
+      constexpr int64_t MAX_COPIES_PER_THREAD = 50;
+      return max_count_lower_bound > MAX_COPIES_PER_THREAD
+        ? std::make_tuple(larger.coalesce(), smaller)
+        : std::make_tuple(larger, smaller);
+    }
+  }();
+
+  // The employed hash function maps a d-dim index to a linear offset
+  // into a contiguous memory that is sufficient to fit a dense tensor
+  // of shape broadcasted_shape(x.shape, y.shape), i.e.
+  // idx -> \sum_{i = 0}^d idx[i] * hash_coeffs[i], where
+  // hash_coeffs are the strides of a contiguous tensor of shape
+  // broadcasted_shape(x.shape, y.shape).
+  // Assuming the following order on the dimensions, i.e. the right-most dim is the
+  // fastest-changing dim, and the left-most is the slowest-changing dim,
+  // which is implicit in the definition of hash_coeffs,
+  // it could be shown that the hash function is actually bijective and, hence,
+  // is a perfect hash function (no collisions ever).
+  const auto kHash = std::is_same<hash_t, int64_t>::value ? kLong : kInt;
+  const auto hash_coeffs = [&]() -> Tensor {
+    const auto broadcasted_sparse_dim_shape = std::vector<int64_t>(
+      broadcasted_shape.begin(),
+      broadcasted_shape.begin() + probably_coalesced.sparse_dim()
+    );
+    auto strides = contiguous_strides(broadcasted_sparse_dim_shape);
+    auto strides_len = static_cast<int64_t>(strides.size());
+    auto hash_coeffs = at::empty(
+        {strides_len},
+        probably_coalesced._indices().options().device(kCPU).dtype(kHash));
+    // Copy with a potential casting. Is there a nicer way?
+    for (const auto i : c10::irange(strides_len)) {
+      hash_coeffs[i] = strides[i];
+    }
+    hash_coeffs = hash_coeffs.to(probably_coalesced.device());
+    return hash_coeffs;
+  }();
+
+  const auto nnz_arange = at::arange(
+      std::max(probably_coalesced._nnz(), source._nnz()),
+      source._indices().options());
+  const auto probably_coalesced_nnz_arange = nnz_arange.narrow(-1, 0, probably_coalesced._nnz());
+
+  // non-const because of gcc-5/clang-5 issues
+  auto sparse_dim = probably_coalesced.sparse_dim();
+  // non-const because of gcc-5/clang-5 issues
+  auto sdim = static_cast<uint32_t>(sparse_dim);
+
+  // Apply the hash function to probably_coalesced.indices
+  const auto probably_coalesced_indices_hash = [&]() -> Tensor {
+    const auto indices = probably_coalesced._indices();
+    // non-const because of gcc-5/clang-5 issues
+    auto indices_dim_stride = indices.stride(0);
+    auto indices_nnz_stride = indices.stride(1);
+
+    auto hash = at::empty({probably_coalesced._nnz()},
+        indices.options().dtype(kHash));
+
+    auto iter = TensorIteratorConfig()
+      // Hash has hash_t type while probably_coalesced_nnz_arange is index_t.
+      .check_all_same_dtype(false)
+      .add_output(hash)
+      .add_input(probably_coalesced_nnz_arange)
+      .build();
+
+    {
+      const auto* RESTRICT ptr_indices = indices.data_ptr<index_t>();
+      const auto* RESTRICT ptr_hash_coeffs = hash_coeffs.template data_ptr<hash_t>();
+
+      KernelLauncher::launch(iter,
+          // NOTE: capture by value required by CUDA
+          [=] FUNCAPI (index_t nnz_idx) -> hash_t {
+          const auto* RESTRICT ptr_indices_dim = ptr_indices + nnz_idx * indices_nnz_stride;
+          auto hash = hash_t {0};
+          for (uint32_t dim = 0; dim < sdim; ++dim) {
+            // use only int32_t operations when hash_t == int32_t
+            const auto dim_hash_coeff = ptr_hash_coeffs[dim];
+            const auto dim_index = static_cast<hash_t>(ptr_indices_dim[dim * indices_dim_stride]);
+            hash += dim_index * dim_hash_coeff;
+          }
+          return hash;
+      });
+    }
+
+    return hash;
+  }();
+
+  // Now that we have hash values of probably_coalesced.indices,
+  // we need to decide whether they need to get sorted.
+  // The sort is not requires if probably_coalesced is coalesced.
+  Tensor sorted_hash, argsort_hash;
+  std::tie(sorted_hash, argsort_hash) = [&]() -> std::tuple<Tensor, Tensor> {
+    if (probably_coalesced.is_coalesced()) {
+      // NOTE: argsort.dtype == nnz_arange.dtype
+      const auto argsort = nnz_arange.narrow(-1, 0, probably_coalesced._nnz());
+      return std::make_tuple(probably_coalesced_indices_hash, argsort);
+    }
+    else {
+      // NOTE: we want argsort.dtype == nnz_arange.dtype,
+      // but sort() produces indices of type int64_t,
+      // so we convert to nnz_arange.dtype to avoid issues
+      // with pointer types in the kernels below.
+      Tensor sorted, argsort;
+      std::tie(sorted, argsort) = probably_coalesced_indices_hash.sort();
+      return std::make_tuple(sorted, argsort.to(nnz_arange.scalar_type()));
+    }
+  }();
+
+  // Perform hash intersection.
+  // Let  s_hash = hash(source.indices),
+  //     pc_hash = hash(probably_coalesced.indices), then
+  // for i = 0, ..., len(s_hash) - 1:
+  //     lb = <index of a value in pc_hash[argsort_hash] which is a lower bound for s_hash[i]>,
+  //     up = <index of a value in pc_hash[argsort_hash] which is an upper bound for s_hash[i]>,
+  //     intersection_count[i] = up - lb
+  //     intersection_first_idx[i] = lb.
+  //
+  // intersection_count and intersection_first_idx are used to form indices at which
+  // intersection values are selected.
+  Tensor intersection_count, intersection_first_idx;
+  std::tie(intersection_count, intersection_first_idx) = [&]() -> std::tuple<Tensor, Tensor> {
+    const auto source_nnz = source._nnz();
+    auto intersection_buffer = at::empty({2, source_nnz}, sorted_hash.options());
+    auto intersection_count = intersection_buffer.select(0, 0);
+    auto intersection_first_idx = intersection_buffer.select(0, 1);
+
+    const auto source_indices = source._indices();
+    const auto source_arange = nnz_arange.narrow(-1, 0, source_nnz);
+    // non-const because of gcc-5/clang-5 issues
+    auto indices_dim_stride = source_indices.stride(0);
+    auto indices_nnz_stride = source_indices.stride(1);
+    auto dummy = at::empty({1}, source_arange.options());
+
+    auto iter = TensorIteratorConfig()
+      .set_check_mem_overlap(false)
+      .add_owned_output(dummy.expand_as(source_arange))
+      .add_input(source_arange)
+      .build();
+
+    {
+      const auto* RESTRICT ptr_indices = source_indices.data_ptr<index_t>();
+      const auto* RESTRICT ptr_sorted_hash = sorted_hash.data_ptr<hash_t>();
+      const auto sorted_hash_len = sorted_hash.numel();
+      const auto* RESTRICT ptr_hash_coeffs = hash_coeffs.template data_ptr<hash_t>();
+      auto* RESTRICT ptr_intersection_count = intersection_count.data_ptr<hash_t>();
+      auto* RESTRICT ptr_intersection_first_idx = intersection_first_idx.data_ptr<hash_t>();
+
+      // Fusing hash computation with hash intersection.
+      KernelLauncher::launch(iter,
+          // NOTE: capture by value required by CUDA
+          [=] FUNCAPI (index_t nnz_idx) -> index_t {
+          // Compute hash value
+          const auto* RESTRICT ptr_indices_dim = ptr_indices + nnz_idx * indices_nnz_stride;
+          auto hash = hash_t {0};
+          for (uint32_t dim = 0; dim < sdim; ++dim) {
+            // Use only int32_t operations when hash_t == int32_t.
+            const auto dim_hash_coeff = ptr_hash_coeffs[dim];
+            const auto dim_index = static_cast<hash_t>(ptr_indices_dim[dim * indices_dim_stride]);
+            hash += dim_index * dim_hash_coeff;
+          }
+
+          // Perform hash values intersection
+          const auto* RESTRICT lb = find_bound<const hash_t*, hash_t, /*is_lower=*/true>(
+              ptr_sorted_hash,
+              ptr_sorted_hash + sorted_hash_len,
+              hash
+          );
+
+          const auto* RESTRICT ub = find_bound<const hash_t*, hash_t, /*is_lower=*/false>(
+              ptr_sorted_hash,
+              ptr_sorted_hash + sorted_hash_len,
+              hash
+          );
+
+          ptr_intersection_count[nnz_idx] = ub - lb;
+          ptr_intersection_first_idx[nnz_idx] = lb - ptr_sorted_hash;
+
+          return 0;
+      });
+    }
+
+    return std::make_tuple(intersection_count, intersection_first_idx);
+  }();
+
+  // Using intersection_count and intersection_first_idx,
+  // form indices selected_source and selected_probably_coalesced such that
+  // res.values = op(
+  //  source.values.index_select(0, selected_source),
+  //  probably_coalesced.values.index_select(0, selected_probably_coalesced)) and
+  // res.indices = selected_source_sparse_indices, which is also equivalent to
+  // res.indices = source.indices.index_select(1, selected_source).
+  Tensor selected_source, selected_source_sparse_indices, selected_probably_coalesced;
+  std::tie(selected_source, selected_source_sparse_indices, selected_probably_coalesced)
+    = [&]() -> std::tuple<Tensor, Tensor, Tensor> {
+    // Thread offset = shifted_offset - shift.
+    // This computation is fused in kernels below.
+
+    // hash_t might not be enough to store offset values, so we use
+    // offset_t which is at least sizeof(hash_t).
+    const auto kOffset = std::is_same<offset_t, int32_t>::value ? kInt : kLong;
+    const auto shifted_offset = intersection_count.cumsum(-1, kOffset);
+
+    // NOTE: unavoidable sync to get to know the result's shape.
+    const auto intersection_nnz = static_cast<int64_t>(
+        // shifted_offset is a 1-dim tensor, potentially empty
+        shifted_offset.size(0)
+        ? shifted_offset.select(-1, -1).template item<offset_t>()
+        : 0);
+
+    auto selected_buffer = at::empty({2, intersection_nnz}, intersection_count.options());
+    auto selected_source = selected_buffer.select(0, 0);
+    auto selected_probably_coalesced = selected_buffer.select(0, 1);
+    const auto source_sparse_indices = source._indices();
+    auto selected_source_sparse_indices = at::empty({source.sparse_dim(), intersection_nnz},
+        source_sparse_indices.options().memory_format(at::MemoryFormat::Contiguous));
+    const auto source_idx = nnz_arange.narrow(-1, 0, source._nnz());
+    auto dummy = at::empty({1}, source_idx.options());
+
+    auto iter = TensorIteratorConfig()
+      .set_check_mem_overlap(false)
+      .check_all_same_dtype(false)
+      .add_owned_output(dummy.expand_as(source_idx))
+      .add_input(source_idx) // index_t
+      .add_input(intersection_count) // hash_t
+      .add_input(intersection_first_idx) // hash_t
+      .add_input(shifted_offset) // offset_t
+      .build();
+
+    {
+      auto* RESTRICT ptr_selected_source = selected_source.data_ptr<hash_t>();
+      auto* RESTRICT ptr_selected_probably_coalesced = selected_probably_coalesced.data_ptr<hash_t>();
+      const auto* RESTRICT ptr_argsort = argsort_hash.data_ptr<index_t>();
+
+      auto* RESTRICT ptr_selected_source_sparse_indices = selected_source_sparse_indices.data_ptr<index_t>();
+      // Non-const because of Gcc5/Clang5 issues
+      auto selected_source_sparse_indices_nnz_stride = static_cast<offset_t>(
+          selected_source_sparse_indices.stride(1));
+      auto selected_source_sparse_indices_dim_stride = static_cast<offset_t>(
+          selected_source_sparse_indices.stride(0));
+
+      const auto* RESTRICT ptr_source_sparse_indices = source_sparse_indices.data_ptr<index_t>();
+      // Non-const because of Gcc5/Clang5 issues
+      auto source_sparse_indices_nnz_stride = static_cast<offset_t>(
+          source_sparse_indices.stride(1));
+      auto source_sparse_indices_dim_stride = static_cast<offset_t>(
+          source_sparse_indices.stride(0));
+
+      KernelLauncher::launch(iter,
+          // NOTE: capture by value required by CUDA
+          [=] FUNCAPI (
+            index_t idx,
+            hash_t count,
+            hash_t first_match_idx,
+            offset_t shifted_offset) -> index_t {
+          const auto offset = shifted_offset - static_cast<offset_t>(count);
+          auto* RESTRICT ptr_selected_source_idx_out = ptr_selected_source + offset;
+          auto* RESTRICT ptr_selected_probably_coalesced_idx_out = ptr_selected_probably_coalesced + offset;
+          const auto* RESTRICT ptr_argsort_idx = ptr_argsort + first_match_idx;
+
+          auto* RESTRICT ptr_selected_source_sparse_indices_out =
+            ptr_selected_source_sparse_indices + offset * selected_source_sparse_indices_nnz_stride;
+          const auto* RESTRICT ptr_source_sparse_indices_in =
+            ptr_source_sparse_indices + idx * source_sparse_indices_nnz_stride;
+
+          for (hash_t i = 0; i < count; ++i) {
+            *ptr_selected_source_idx_out++ = idx;
+            *ptr_selected_probably_coalesced_idx_out++ = *ptr_argsort_idx++;
+
+            // res_indices = source._indices().index_select(1, selected_source)
+            // The code below fuses this computation with forming
+            // selected_source and selected_probably_coalesced.
+            for (uint32_t d = 0; d < sdim; ++d) {
+              ptr_selected_source_sparse_indices_out[d * selected_source_sparse_indices_dim_stride]
+                = ptr_source_sparse_indices_in[d * source_sparse_indices_dim_stride];
+            }
+            ptr_selected_source_sparse_indices_out += selected_source_sparse_indices_nnz_stride;
+          }
+
+          return 0;
+      });
+    }
+
+    return std::make_tuple(selected_source, selected_source_sparse_indices, selected_probably_coalesced);
+  }();
+
+  const auto res_indices = selected_source_sparse_indices;
+
+  // Value intersection
+  const auto binary_op_res_dtype = at::result_type(
+      source._values(),
+      probably_coalesced._values());
+  auto res_values = value_selection_intersection_kernel_t::apply(
+      source._values().to(binary_op_res_dtype), // promote for better accuracy
+      selected_source,
+      probably_coalesced._values().to(binary_op_res_dtype), // promote for better accuracy
+      selected_probably_coalesced);
+  // Convert back if the promoted dtype is different from res.dtype.
+  // This could happen for in-place usage cases.
+  res_values = res_values.to(res.scalar_type());
+
+  const auto res_sparse_dim = source.sparse_dim();
+  const auto res_dense_dim = res_values.dim() - 1;
+  const auto res_shape = broadcasted_shape;
+  const auto res_nnz = selected_source.numel();
+
+  auto* res_sparse_impl = get_sparse_impl(res);
+  res_sparse_impl->raw_resize_(res_sparse_dim, res_dense_dim, res_shape);
+  res_sparse_impl->set_indices_and_values_unsafe(res_indices, res_values);
+  res_sparse_impl->set_nnz_and_narrow(res_nnz);
+  // Result is coalesced iff arguments are coalesced, conditioned on the fact
+  // that we do not check that intersection hash values are sorted and unique.
+  // <= : intersection contains only unique indices (or empty), and the algorithm's
+  // behavior is order-preserving. So, the result has only unique indices (or empty) which are sorted.
+  // => : proof by contraposition. The contrapositive statement reads
+  // `there is an uncoalesced argument => result is not coalesced`.
+  // If both arguments are uncoalesced, the result is clearly uncoalesced again
+  // thanks to the order-preserving behavior of the algorithm.
+  // Otherwise we have a coalesced argument `probably_coalesced` and an uncoalesced `source`.
+  // Since the matching beahavior of the algorithm respects the order of `source`, the result
+  // will be as coalesced as `source` is, which is uncoalesced.
+  res._coalesced_(source.is_coalesced() && probably_coalesced.is_coalesced());
+}
+
+template <
+  template <typename func_t> class kernel_t,
+  typename value_selection_intersection_kernel_t>
+void _sparse_binary_op_intersection_kernel_out(
+    Tensor& res,
+    const Tensor& x,
+    const Tensor& y,
+    const bool commutes_with_sum = true
+) {
+  TORCH_CHECK(
+      (x.is_sparse() && y.is_sparse())
+      && (x.dim() == y.dim()) && (x.sparse_dim() == y.sparse_dim())
+      && (x.sizes().slice(0, x.sparse_dim()) == y.sizes().slice(0, y.sparse_dim())),
+      NAME, "(): expects sparse inputs with equal dimensionality, ",
+      "number of sparse dimensions, and shape of sparse dimensions");
+  TORCH_CHECK(
+      x._indices().scalar_type() == y._indices().scalar_type(),
+      NAME, "(): expects inputs' indices to be of the same dtype (i.e. long or int)");
+
+  const auto broadcasted_shape = infer_size(x.sizes(), y.sizes());
+
+  int64_t max_hash_val = 1;
+  for (const auto d : c10::irange(x.sparse_dim())) {
+    max_hash_val *= broadcasted_shape[d];
+  }
+
+  const auto is_32bit_indexing = x._indices().scalar_type() == at::kInt;
+  // Optimization: use 32-bit hash values when possible.
+  const auto is_max_hash_32bits = max_hash_val <= std::numeric_limits<int>::max();
+  // Intersection nnz could get larger than nnz of either arguments.
+  // Example: probably_coalesced and source have only one unique and shared index,
+  // then the size of intersection is exactly the product of their nnzs.
+  // This nnz defines offsets per thread which are computed using cumsum on values
+  // of hash dtype. This becomes a problem when hash_t=int32_t and res_nnz > max(int32_t).
+  const auto is_max_offset_32bits = (x._nnz() * y._nnz()) <= std::numeric_limits<int>::max();
+
+  BOOL_TO_INDEX_TYPE3(is_32bit_indexing, is_max_hash_32bits, is_max_offset_32bits, [&]() {
+      // Given 3 booleans b0, b1, b2, index_t<i> is defined as
+      // index_t<i> = int32_t if b<2 - i> is true else int64_t.
+      // The goal is to use int32_t whenever possible for better
+      // performance.
+      // NOTE: order of types given booleans is reversed.
+      using index_t = index_t2;
+      using hash_t = index_t1;
+      using offset_t = index_t0;
+      _sparse_binary_op_intersection_kernel_impl<kernel_t, value_selection_intersection_kernel_t, index_t, hash_t, offset_t>(
+          res, x, y, broadcasted_shape, commutes_with_sum);
+  });
+}
+
+} // anonymous namespace
+
+}} // at::native
diff --git a/aten/src/ATen/native/sparse/SparseBinaryOpIntersectionKernel.cpp b/aten/src/ATen/native/sparse/SparseBinaryOpIntersectionKernel.cpp
new file mode 100644
index 000000000000..4457f20415a6
--- /dev/null
+++ b/aten/src/ATen/native/sparse/SparseBinaryOpIntersectionKernel.cpp
@@ -0,0 +1,107 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/native/sparse/SparseStubs.h>
+#include <ATen/native/sparse/SparseBinaryOpIntersectionCommon.h>
+#include <ATen/native/cpu/Loops.h>
+#include <ATen/native/TensorIterator.h>
+
+namespace at {
+namespace native {
+
+namespace {
+
+template <typename func_t>
+struct CPUKernelLauncher {
+  static void launch(TensorIteratorBase& iter, const func_t& f) {
+    cpu_kernel(iter, f);
+  }
+};
+
+struct MulOp {
+  template <typename scalar_t>
+  static scalar_t apply(scalar_t a, scalar_t b) {
+    return a * b;
+  }
+};
+
+template <>
+bool MulOp::apply(bool a, bool b) {
+  return a && b;
+}
+
+template <typename binary_op_t>
+struct CPUValueSelectionIntersectionKernel {
+  static Tensor apply(
+      const Tensor& lhs_values,
+      const Tensor& lhs_select_idx,
+      const Tensor& rhs_values,
+      const Tensor& rhs_select_idx) {
+    auto iter = make_value_selection_intersection_iter(
+        lhs_values,
+        lhs_select_idx,
+        rhs_values,
+        rhs_select_idx);
+    auto res_values = iter.tensor(0);
+
+    auto lhs_nnz_stride = lhs_values.stride(0);
+    auto rhs_nnz_stride = rhs_values.stride(0);
+
+    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
+        ScalarType::Bool, ScalarType::Half, ScalarType::BFloat16, res_values.scalar_type(),
+        "binary_op_intersection_cpu", [&] {
+          AT_DISPATCH_INDEX_TYPES(lhs_select_idx.scalar_type(),
+              "binary_op_intersection_cpu", [&] {
+                auto loop = [&](char** data, const int64_t* strides, int64_t n) {
+                  auto* ptr_res_values_bytes = data[0];
+                  const auto* ptr_lhs_values_bytes = data[1];
+                  const auto* ptr_lhs_select_idx_bytes = data[2];
+                  const auto* ptr_rhs_values_bytes = data[3];
+                  const auto* ptr_rhs_select_idx_bytes = data[4];
+
+                  for (int64_t i = 0; i < n; ++i) {
+                    // Exctract data
+                    auto* RESTRICT ptr_res_values = reinterpret_cast<scalar_t*>(ptr_res_values_bytes);
+                    const auto* ptr_lhs_values = reinterpret_cast<const scalar_t*>(ptr_lhs_values_bytes);
+                    const auto lhs_nnz_idx = *reinterpret_cast<const index_t*>(ptr_lhs_select_idx_bytes);
+                    const auto* ptr_rhs_values = reinterpret_cast<const scalar_t*>(ptr_rhs_values_bytes);
+                    const auto rhs_nnz_idx = *reinterpret_cast<const index_t*>(ptr_rhs_select_idx_bytes);
+
+                    // Apply op
+                    *ptr_res_values = binary_op_t::apply(
+                        *(ptr_lhs_values + lhs_nnz_idx * lhs_nnz_stride),
+                        *(ptr_rhs_values + rhs_nnz_idx * rhs_nnz_stride));
+
+                    // Advance
+                    ptr_res_values_bytes += strides[0];
+                    ptr_lhs_values_bytes += strides[1];
+                    ptr_lhs_select_idx_bytes += strides[2];
+                    ptr_rhs_values_bytes += strides[3];
+                    ptr_rhs_select_idx_bytes += strides[4];
+                  }
+                };
+                iter.for_each(loop, at::internal::GRAIN_SIZE);
+              });
+        });
+
+    return res_values;
+  }
+};
+
+void mul_sparse_sparse_out_cpu_kernel(
+    Tensor& result,
+    const Tensor& x,
+    const Tensor& y) {
+  using CPUValueSelectionMulKernel = CPUValueSelectionIntersectionKernel<MulOp>;
+  _sparse_binary_op_intersection_kernel_out<CPUKernelLauncher, CPUValueSelectionMulKernel>(
+      result, x, y
+  );
+}
+
+}
+
+REGISTER_ARCH_DISPATCH(mul_sparse_sparse_out_stub, DEFAULT, &mul_sparse_sparse_out_cpu_kernel);
+REGISTER_AVX512_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel);
+REGISTER_AVX2_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel);
+REGISTER_VSX_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel);
+REGISTER_ZVECTOR_DISPATCH(mul_sparse_sparse_out_stub, &mul_sparse_sparse_out_cpu_kernel);
+
+}}
diff --git a/aten/src/ATen/native/sparse/SparseBlasImpl.cpp b/aten/src/ATen/native/sparse/SparseBlasImpl.cpp
index 4ad0d55c6891..cdeb3e134e52 100644
--- a/aten/src/ATen/native/sparse/SparseBlasImpl.cpp
+++ b/aten/src/ATen/native/sparse/SparseBlasImpl.cpp
@@ -2,11 +2,215 @@
 #include <ATen/Config.h>
 #include <ATen/native/mkl/SparseBlasImpl.h>
 #include <ATen/native/sparse/SparseBlasImpl.h>
+#include <ATen/SparseCsrTensorUtils.h>
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#include <ATen/NativeFunctions.h>
+#include <ATen/Operators.h>
+#else
+#include <ATen/ops/_convert_indices_from_csr_to_coo.h>
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/zeros.h>
+#endif
 
 namespace at {
 namespace native {
 namespace sparse {
 namespace impl {
+
+Tensor& _compressed_row_strided_mm_out(const Tensor& compressed, const Tensor& strided, Tensor& result) {
+  const auto compressed_layout = compressed.layout();
+  const auto compressed_layout_str = at::sparse_csr::layoutToString(compressed_layout);
+
+  // Device restrictions
+  TORCH_CHECK(compressed.device() == strided.device()
+      && compressed.device() == result.device(),
+      "spmm_out(): all input arguments are expected to be on the same device.");
+
+  // Layout restrictions.
+  TORCH_CHECK(compressed_layout == kSparseCsr || compressed_layout == kSparseBsr,
+      "spmm(", compressed_layout_str, ", Strided): only Csr and Bsr formats are supported for the sparse argument.");
+  TORCH_CHECK(result.layout() == kStrided,
+      "spmm_out(): out argument is expected to be strided.");
+
+  // Dtype restrictions.
+  TORCH_CHECK(compressed.scalar_type() == strided.scalar_type(),
+      "spmm(", compressed_layout_str, ", Strided): arguments expected to have the same dtype.");
+
+  // Dim restrictions.
+  TORCH_CHECK(compressed.dim() == 2,
+      "spmm(", compressed_layout_str, ", Strided): sparse arguments which are not 2D are not supported.");
+  TORCH_CHECK(strided.dim() >= 2,
+      "spmm(", compressed_layout_str, ", Strided): expects strided inputs to be at least 2D.");
+
+  const auto m = compressed.sizes()[0];
+  const auto k = compressed.sizes()[1];
+  const auto n = strided.size(-1);
+  // Matrix product size compatibility.
+  TORCH_CHECK(strided.size(-2) == k,
+      "spmm(", compressed_layout_str, "Strided): argument sizes are not compatible for matrix multiplication. ",
+      "Got ", compressed_layout_str, ".sizes(-1) == ", k, " is not equal to ",
+      "Strided.sizes(-2) == ", strided.size(-2), ".");
+
+  // We assume that result is properly resized.
+  auto result_expected_size = at::DimVector(strided.sizes().slice(0, strided.dim() - 2));
+  result_expected_size.push_back(m);
+  result_expected_size.push_back(n);
+  TORCH_CHECK(result.sizes() == result_expected_size,
+      "spmm_out(): out argument has wrong size. ",
+      "Expected (", result_expected_size, ") but got (", result.sizes(), ").");
+
+  auto values = compressed.values();
+
+  using Blocksize = std::array<int64_t, 2>;
+  // We refer to these as (b0, b1) in the comments below.
+  Blocksize blocksize = {1, 1};
+  if (compressed_layout == kSparseBsr) {
+    blocksize = {values.size(-2), values.size(-1)};
+  }
+
+  // (..., r, c) -> (..., r / b0, c / b1, b0, b1)
+  // NOTE: this function ALWAYS creates a view upon successful execution.
+  const auto tile_tensor = [compressed_layout](
+      const Tensor& t, Blocksize blocksize) -> Tensor {
+    if (compressed_layout == kSparseCsr) {
+      return t.unsqueeze(-1).unsqueeze_(-1);
+    }
+    else {
+      const auto size_neg_2_blocked = t.size(-2) / blocksize[0];
+      const auto size_neg_1_blocked = t.size(-1) / blocksize[1];
+      auto tiled_sizes = at::DimVector(t.sizes().slice(0, t.dim() - 2));
+      tiled_sizes.push_back(size_neg_2_blocked);
+      tiled_sizes.push_back(blocksize[0]);
+      tiled_sizes.push_back(size_neg_1_blocked);
+      tiled_sizes.push_back(blocksize[1]);
+      return t.reshape(tiled_sizes).transpose(-3, -2);
+    }
+  };
+
+  // Note that sparse values are (..., b0, b1). This means that
+  // the strided input has to be "tilable" to (..., b1, x) with
+  // any x >= 1 such that all the shapes are (block) matrix product
+  // compatible. The matrix product will then have shape (..., b0, x).
+  // This in turn means the the result has to be "tilable" to
+  // (..., b0, x).
+  //
+  // These observations imply the following restrictions:
+  // 1. strided.size(-2) has to be divisible by b1.
+  // 2. result.size(-2) has to be divisible by b0.
+  // 3. both strided.size(-1) and result.size(-1)
+  //    have to be divisible by x.
+  //
+  // Restrictions 1 and 2 are trivially satisfied.
+  // Regarding restriction 3:
+  // it would make sense to take the largest possible x for better
+  // performance since it is very likely that the last dimension
+  // is contiguous. As such, this value is exactly
+  // x = strided.size(-1), since strided.size(-1) == result.size(-1)
+
+  // See the comments above. This is our x.
+  const auto outer_blocksize = n;
+
+  Blocksize strided_blocksize = {blocksize[1], outer_blocksize};
+  const auto strided_tiled = tile_tensor(strided, strided_blocksize);
+
+  // Left argument is (..., b0, b1) and right is (..., b1, x).
+  // This naturally implies the result should be "tilable" as
+  // (..., b0, x).
+  Blocksize result_blocksize = {blocksize[0], outer_blocksize};
+  auto result_tiled = tile_tensor(result, result_blocksize);
+
+  if (compressed_layout == kSparseCsr) {
+    values.unsqueeze_(-1).unsqueeze_(-1);
+  }
+
+  Tensor compressed_indices, plain_indices;
+  std::tie(compressed_indices, plain_indices) = at::sparse_csr::getCompressedPlainIndices(compressed);
+
+  // Select block rows of the strided input that intersect with the block colums of the sparse input.
+  auto strided_tiled_selected_rows = strided_tiled.index_select(-4, plain_indices);
+
+  // Promote to float if output is half or bfloat16 for better precision
+  const auto mm_dtype = (result.scalar_type() == kHalf || result.scalar_type() == kBFloat16)
+    ? kFloat : result.scalar_type();
+  // Now that we know which block rows intersect with which block columns,
+  // we can perform matrix products between pairs of blocks.
+  // NOTE: .to is a no-op when result.scalar_type() == mm_dtype.
+  const auto pairwise_block_mm = values.unsqueeze(-3).to(mm_dtype)
+    .matmul(strided_tiled_selected_rows.to(mm_dtype));
+
+  // Having pairwise block matrix products stored in pairwise_block_mm,
+  // it is sufficient to sum all the block products that share the same row
+  // encoded in the sparse index. Since the reduction step is done via
+  // advanced indexing methods, the compressed index ought to get converted
+  // to the COO format.
+  const auto compressed_indices_coo = at::_convert_indices_from_csr_to_coo(
+      compressed_indices,
+      plain_indices,
+      compressed_indices.scalar_type() == kInt).select(0, 0);
+
+  // Reduction step.
+  // If result is neither half nor bfloat16, do everyting in-place.
+  if (result.scalar_type() == mm_dtype) {
+    // Zero out and sum over the blocks that share the same row indices.
+    result_tiled.zero_();
+    result_tiled.index_add_(
+        /*dim=*/-4,
+        /*index=*/compressed_indices_coo,
+        /*source=*/pairwise_block_mm);
+  }
+  // Otherwise accumulate into a buffer and then copy.
+  else {
+    // No need to zero out, sum over the blocks goes into a buffer
+    // followed by a copy into result.
+    auto promoted_result_tiled = at::zeros(
+        result_tiled.sizes(),
+        result_tiled.options().dtype(mm_dtype));
+    promoted_result_tiled.index_add_(
+        /*dim=*/-4,
+        /*index=*/compressed_indices_coo,
+        /*source=*/pairwise_block_mm);
+    result_tiled.copy_(promoted_result_tiled);
+  }
+
+  return result;
+}
+
+Tensor& _compressed_row_strided_addmm_out(
+    const Tensor& self,
+    const Tensor& mat1,
+    const Tensor& mat2,
+    const Scalar& beta,
+    const Scalar& alpha,
+    Tensor& result) {
+  // If result is not the same as self, it could always be used as out argument to mm.
+  if (!result.is_same(self)) {
+    _compressed_row_strided_mm_out(mat1, mat2, result).mul_(alpha);
+
+    // Process beta
+    if (beta.toComplexDouble() != 0.) {
+      result.add_(self.mul(beta));
+    }
+  }
+  // Otherwise we need to allocate external memory for mm if beta != 0.
+  else {
+    // Process beta
+    if (beta.toComplexDouble() != 0.) {
+      result.mul_(beta);
+      auto mm = at::empty_like(result);
+      _compressed_row_strided_mm_out(mat1, mat2, mm);
+      mm.mul_(alpha);
+      result.add_(mm);
+    }
+    else {
+      _compressed_row_strided_mm_out(mat1, mat2, result).mul_(alpha);
+    }
+  }
+
+  return result;
+}
+
 namespace cpu {
 
 /*
diff --git a/aten/src/ATen/native/sparse/SparseBlasImpl.h b/aten/src/ATen/native/sparse/SparseBlasImpl.h
index b48839692410..acdd6b377ff6 100644
--- a/aten/src/ATen/native/sparse/SparseBlasImpl.h
+++ b/aten/src/ATen/native/sparse/SparseBlasImpl.h
@@ -7,6 +7,20 @@ namespace at {
 namespace native {
 namespace sparse {
 namespace impl {
+
+TORCH_API Tensor& _compressed_row_strided_mm_out(
+    const Tensor& compressed_row_sparse,
+    const Tensor& strided,
+    Tensor& result);
+
+TORCH_API Tensor& _compressed_row_strided_addmm_out(
+    const Tensor& self,
+    const Tensor& mat1,
+    const Tensor& mat2,
+    const Scalar& beta,
+    const Scalar& alpha,
+    Tensor& result);
+
 namespace cpu {
 
 void addmv_out_sparse_csr(
diff --git a/aten/src/ATen/native/sparse/SparseCsrTensor.cpp b/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
index 062cc3d12629..ef205c5673ae 100644
--- a/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
+++ b/aten/src/ATen/native/sparse/SparseCsrTensor.cpp
@@ -129,7 +129,7 @@ void _validate_sparse_compressed_tensor_args_worker(const Tensor& compressed_ind
   // 3.1
   TORCH_CHECK(
               static_cast<int>(size.size()) == batch_ndim + base_ndim + dense_ndim,
-              "tensor dimensionality must be sum of batch, base, and dense dimensionalites (=",
+              "tensor dimensionality must be sum of batch, base, and dense dimensionalities (=",
               batch_ndim, " + ", base_ndim, " + ", dense_ndim, ") but got ", size.size());
 
   // For CSR/CSC formats, we define blocksize=(1, 1) so that checking
@@ -272,8 +272,7 @@ SparseCsrTensor new_compressed_tensor(const TensorOptions& options) {
     dispatch_key = DispatchKey::SparseCsrCPU;
   }
 
-  return detail::make_tensor<SparseCsrTensorImpl>(
-      DispatchKeySet(dispatch_key), layout, options.dtype());
+  return detail::make_tensor<SparseCsrTensorImpl>(DispatchKeySet(dispatch_key), options.device(), layout, options.dtype());
 }
 
 
@@ -376,12 +375,12 @@ DimVector _estimate_sparse_compressed_tensor_size(
         size.push_back(compressed_dim_size * blocksize[1]);
       });
   for (int i=0; i<dense_ndim; i++) {
-    int64_t j = batch_ndim + 1 + base_ndim + i;
+    int64_t j = batch_ndim + 1 + block_ndim + i;
     size.push_back((j < values.dim() ? values.size(j) : 1));
   }
   TORCH_CHECK(
               static_cast<int>(size.size()) == batch_ndim + base_ndim + dense_ndim,
-              "tensor dimensionality must be sum of batch, base, and dense dimensionalites (=",
+              "tensor dimensionality must be sum of batch, base, and dense dimensionalities (=",
               batch_ndim, " + ", base_ndim, " + ", dense_ndim, ") but got ", size.size());
   return size;
 }
@@ -488,16 +487,6 @@ SPARSE_COMPRESSED_TENSOR(csc, kSparseCsc)
 SPARSE_COMPRESSED_TENSOR(bsr, kSparseBsr)
 SPARSE_COMPRESSED_TENSOR(bsc, kSparseBsc)
 
-Tensor empty_symint_sparse_compressed(
-    c10::SymIntArrayRef size,
-    c10::optional<ScalarType> dtype,
-    c10::optional<Layout> layout,
-    c10::optional<Device> device,
-    c10::optional<bool> pin_memory,
-    c10::optional<MemoryFormat> optional_memory_format) {
-  return at::native::empty_sparse_compressed(c10::asIntArrayRefSlow(size), dtype, layout, device, pin_memory, optional_memory_format);
-}
-
 Tensor empty_sparse_compressed(
     IntArrayRef size,
     c10::optional<ScalarType> dtype,
@@ -570,13 +559,13 @@ Tensor& copy_sparse_compressed_(Tensor& self, const Tensor& src, bool non_blocki
                 "torch.copy_: expected shapes of self and src to match along dimension ",
                 self_compressed_dim, " for ",
                 self.layout(), " layout but the corresponding dimensions of self and src are ",
-                self_compressed_dims, " and ", src_compressed_dims, ", respecitvely.");
+                self_compressed_dims, " and ", src_compressed_dims, ", respectively.");
   } else {
     TORCH_CHECK(self_compressed_dims == src_compressed_dims,
                 "torch.copy_: expected shapes of self and src to match along dimensions ",
                 self_compressed_dim, " and ", src_compressed_dim, ", respectively, for ",
                 self.layout(), " layout but the corresponding dimensions of self and src are ",
-                self_compressed_dims, " and ", src_compressed_dims, ", respecitvely.");
+                self_compressed_dims, " and ", src_compressed_dims, ", respectively.");
   }
   AT_DISPATCH_PLAIN_SPARSE_COMPRESSED_LAYOUTS(self.layout(), "copy_sparse_compressed_",
                                               [&]{},
@@ -587,7 +576,7 @@ Tensor& copy_sparse_compressed_(Tensor& self, const Tensor& src, bool non_blocki
                                                 auto src_blocksize = DimVector(src_values.sizes().slice(src_values.dim()-2, 2));
                                                 TORCH_CHECK(self_blocksize == src_blocksize,
                                                             "torch.copy_: copy of sparse compressed tensors having different block sizes is not supported.",
-                                                            " self and src block sizes are ", self_blocksize, " and ", src_blocksize, ", respectivly.");
+                                                            " self and src block sizes are ", self_blocksize, " and ", src_blocksize, ", respectively.");
                                               });
   AT_DISPATCH_ROW_SPARSE_COMPRESSED_LAYOUTS(self.layout(), "copy_sparse_compressed_",
                                             [&]{
@@ -643,25 +632,24 @@ int64_t dense_dim_sparse_csr(const SparseCsrTensor& self) {
   return get_sparse_csr_impl(self)->dense_dim();
 }
 
-bool _is_same_size_as_sparse_csr(
+bool _is_same_size_as_sparse_compressed(
     const SparseCsrTensor& self,
     const SparseCsrTensor& src) {
   return self.sizes().equals(src.sizes());
 }
 
-const SparseCsrTensor& resize_as_sparse_csr_(
+const SparseCsrTensor& resize_as_sparse_compressed_(
     const SparseCsrTensor& self,
     const SparseCsrTensor& src) {
-  TORCH_CHECK(
-      src.is_sparse_csr() && self.is_sparse_csr(),
-      "resize_as_sparse_csr_: layout for self and src must be sparse_csr but got ",
-      self.layout(),
-      " for self, and ",
-      src.layout(),
-      " for src");
-  if (!_is_same_size_as_sparse_csr(self, src)) {
-    get_sparse_csr_impl(self)->resize_as_sparse_csr_tensor_(src);
-  }
+  auto src_layout = src.layout();
+  auto self_layout = self.layout();
+  AT_DISPATCH_ALL_SPARSE_COMPRESSED_LAYOUTS(
+      src_layout, "resize_as_sparse_compressed_: src ", []() {});
+  AT_DISPATCH_ALL_SPARSE_COMPRESSED_LAYOUTS(
+      self_layout, "resize_as_sparse_compressed_: self ", []() {});
+  // Note: The impl method does all required checking to see if resize/data copy
+  // on member tensors is required.
+  get_sparse_csr_impl(self)->resize_as_sparse_compressed_tensor_(src);
   return self;
 }
 
diff --git a/aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp b/aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp
index f52e3f2ef641..358bbbaf08b0 100644
--- a/aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp
+++ b/aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp
@@ -20,11 +20,15 @@
 #include <ATen/Operators.h>
 #else
 #include <ATen/ops/_conj_physical_native.h>
+#include <ATen/ops/_convert_indices_from_coo_to_csr.h>
 #include <ATen/ops/_convert_indices_from_coo_to_csr_native.h>
+#include <ATen/ops/_convert_indices_from_csr_to_coo.h>
 #include <ATen/ops/_convert_indices_from_csr_to_coo_native.h>
 #include <ATen/ops/_sparse_bsr_tensor_unsafe_native.h>
 #include <ATen/ops/_sparse_compressed_tensor_unsafe_native.h>
 #include <ATen/ops/_sparse_csr_tensor_unsafe_native.h>
+#include <ATen/ops/_spmm_reduce.h>
+#include <ATen/ops/_spmm_reduce_native.h>
 #include <ATen/ops/_unique.h>
 #include <ATen/ops/abs.h>
 #include <ATen/ops/abs_native.h>
@@ -47,7 +51,10 @@
 #include <ATen/ops/conj_physical.h>
 #include <ATen/ops/conj_physical_native.h>
 #include <ATen/ops/copy_native.h>
+#include <ATen/ops/deg2rad.h>
+#include <ATen/ops/deg2rad_native.h>
 #include <ATen/ops/empty.h>
+#include <ATen/ops/empty_like.h>
 #include <ATen/ops/erf.h>
 #include <ATen/ops/erf_native.h>
 #include <ATen/ops/erfinv.h>
@@ -57,6 +64,8 @@
 #include <ATen/ops/fill_native.h>
 #include <ATen/ops/floor.h>
 #include <ATen/ops/floor_native.h>
+#include <ATen/ops/frac.h>
+#include <ATen/ops/frac_native.h>
 #include <ATen/ops/isinf.h>
 #include <ATen/ops/isinf_native.h>
 #include <ATen/ops/isnan.h>
@@ -75,6 +84,8 @@
 #include <ATen/ops/ones_like.h>
 #include <ATen/ops/rad2deg.h>
 #include <ATen/ops/rad2deg_native.h>
+#include <ATen/ops/relu.h>
+#include <ATen/ops/relu_native.h>
 #include <ATen/ops/resize_as_sparse_native.h>
 #include <ATen/ops/result_type.h>
 #include <ATen/ops/round.h>
@@ -90,15 +101,17 @@
 #include <ATen/ops/sin_native.h>
 #include <ATen/ops/sinh.h>
 #include <ATen/ops/sinh_native.h>
-#include <ATen/ops/sqrt.h>
-#include <ATen/ops/sqrt_native.h>
 #include <ATen/ops/sparse_mask.h>
 #include <ATen/ops/sparse_mask_native.h>
+#include <ATen/ops/sqrt.h>
+#include <ATen/ops/sqrt_native.h>
 #include <ATen/ops/tan.h>
 #include <ATen/ops/tan_native.h>
 #include <ATen/ops/tanh.h>
 #include <ATen/ops/tanh_native.h>
 #include <ATen/ops/tensor.h>
+#include <ATen/ops/threshold_backward.h>
+#include <ATen/ops/threshold_backward_native.h>
 #include <ATen/ops/trunc.h>
 #include <ATen/ops/trunc_native.h>
 #include <ATen/ops/zero_native.h>
@@ -147,7 +160,7 @@ Tensor& unary_op_out(F op_out, const Tensor& self, Tensor& result) {
     // For the case of (0x0) result tensor, manually resize `result` tensor
     // to the size of `self` tensor
     if (result.numel() == 0) {
-      at::native::resize_as_sparse_csr_(result, self);
+      at::native::resize_as_sparse_compressed_(result, self);
     }
     // copy_sparse_compressed_ internally checks the sizes of result and self tensors
     // Hence no external size check required
@@ -163,7 +176,7 @@ Tensor& unary_op_out(F op_out, const Tensor& self, Tensor& result) {
 
 template <typename F, typename... Args>
 Tensor& unary_op_inplace(Tensor& self, const F& op_inplace, Args&&... args) {
-  TORCH_INTERNAL_ASSERT(self.is_sparse_csr());
+  AT_DISPATCH_ALL_SPARSE_COMPRESSED_LAYOUTS(self.layout(), "unary_op_inplace", [](){});
 
   auto self_values = self.values();
   (self_values.*op_inplace)(std::forward<Args>(args)...);
@@ -299,6 +312,23 @@ Tensor mul_scalar_sparse_csr(const Tensor& self, const Scalar& other) {
       result_values.device());
 }
 
+Tensor& zero_sparse_csr_(Tensor& self) {
+  /*
+    csr.zero_() resets nnz to 0.
+
+    If the original sparsity pattern needs to be preserved, use
+    `csr.values().zero_()` instead.
+
+    The above behavior also implies that torch.zeros_like(csr) returns
+    a new tensor with nnz == 0. If one needs a zeros_like semantics
+    where the result has the same sparsity pattern as input, then use
+    `result = csr.clone(); result.values.zero_();`
+  */
+  AT_DISPATCH_ALL_SPARSE_COMPRESSED_LAYOUTS(self.layout(), "zero_sparse_csr_", [](){});
+  get_sparse_csr_impl(self)->resize_and_clear_(self.sparse_dim(), self.sizes());
+  return self;
+}
+
 /* Implementation of Unary Ufuncs, those supported for Sparse CSR Layout
  * Only simple funcs, with 0->0 correspondence are currently supported. */
 
@@ -333,10 +363,12 @@ CREATE_UNARY_UFUNC(asinh);
 CREATE_UNARY_UFUNC(atan);
 CREATE_UNARY_UFUNC(atanh);
 CREATE_UNARY_UFUNC(ceil);
+CREATE_UNARY_UFUNC(deg2rad);
 CREATE_UNARY_UFUNC(erf);
 CREATE_UNARY_UFUNC(erfinv);
 CREATE_UNARY_UFUNC(expm1);
 CREATE_UNARY_UFUNC(floor);
+CREATE_UNARY_UFUNC(frac);
 CREATE_UNARY_UFUNC(log1p);
 CREATE_UNARY_UFUNC(neg);
 CREATE_UNARY_UFUNC(rad2deg);
@@ -349,8 +381,7 @@ CREATE_UNARY_UFUNC(tan);
 CREATE_UNARY_UFUNC(tanh);
 CREATE_UNARY_UFUNC(trunc);
 CREATE_UNARY_UFUNC(conj_physical);
-
-CREATE_UNARY_UFUNC_INPLACE(zero);
+CREATE_UNARY_UFUNC(relu);
 
 // With addition of `round.decimals` overload, using CREATE_UNARY_UFUNC leads
 // to unresolved overload.
@@ -368,6 +399,30 @@ Tensor& round_sparse_csr_(Tensor& self) {
   return self;
 }
 
+Tensor threshold_backward_sparse_compressed(
+    const Tensor& grad_output,
+    const Tensor& self,
+    const Scalar& threshold) {
+  return get_result_tensor_for_unary_op(
+      [&](const Tensor& t) {
+        return at::threshold_backward(t, self.values(), threshold);
+      },
+      grad_output);
+}
+
+Tensor& threshold_backward_sparse_compressed_out(
+    const Tensor& grad_output,
+    const Tensor& self,
+    const Scalar& threshold,
+    Tensor& grad_input) {
+  return unary_op_out(
+      [&](const Tensor& t, Tensor& out) {
+        return at::threshold_backward_outf(t, self.values(), threshold, out);
+      },
+      grad_output,
+      grad_input);
+}
+
 // angle, isneginf, isposinf and signbit currently don't have an inplace variant
 CREATE_UNARY_UFUNC_NO_INPLACE(angle);
 CREATE_UNARY_UFUNC_NO_INPLACE(isneginf);
@@ -457,11 +512,6 @@ Tensor& addmm_out_sparse_compressed_cpu(
       mat1.size(1) == mat2.size(0), "mat1 and mat2 shapes cannot be multiplied (",
       mat1.size(0), "x", mat1.size(1), " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
 
-  if (mat1.layout() == kSparseCsc || mat2.layout() == kSparseCsc) {
-    return addmm_out_sparse_compressed_cpu(
-        self, mat1.to_sparse_csr(), mat2.to_sparse_csr(), beta, alpha, result);
-  }
-
   c10::MaybeOwned<at::Tensor> self_;
   // Don't expand self if this is an in-place operation
   if (&result == &self) {
@@ -510,20 +560,69 @@ Tensor& addmm_out_sparse_compressed_cpu(
   }
 
 #if !AT_USE_MKL_SPARSE()
+  // The custom impl addmm_out_sparse_csr_native_cpu only supports CSR @
+  // strided -> strided
+  if (mat1.layout() == kStrided) {
+    if (mat2.layout() == kSparseCsr) {
+      if (result.layout() == kStrided) {
+        AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
+            result.scalar_type(), "addmm_sparse_dense", [&] {
+              addmm_out_sparse_csr_native_cpu<scalar_t>(
+                  mat2.transpose(-2, -1).to_sparse_csr(),
+                  mat1.transpose(-2, -1),
+                  result.transpose(-2, -1),
+                  alpha,
+                  beta);
+            });
+        return result;
+      }
+    }
+    if (mat2.layout() == kSparseCsc) {
+      if (result.layout() == kStrided) {
+        AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
+            result.scalar_type(), "addmm_sparse_dense", [&] {
+              addmm_out_sparse_csr_native_cpu<scalar_t>(
+                  mat2.transpose(-2, -1),
+                  mat1.transpose(-2, -1),
+                  result.transpose(-2, -1),
+                  alpha,
+                  beta);
+            });
+        return result;
+      }
+    }
+  } else if (mat1.layout() == kSparseCsr) {
+    if (mat2.layout() == kStrided) {
+      if (result.layout() == kStrided) {
+        AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
+            result.scalar_type(), "addmm_sparse_dense", [&] {
+              addmm_out_sparse_csr_native_cpu<scalar_t>(
+                  mat1, mat2, result, alpha, beta);
+            });
+        return result;
+      }
+    }
+  } else if (mat1.layout() == kSparseCsc) {
+    if (mat2.layout() == kStrided) {
+      if (result.layout() == kStrided) {
+        AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
+            result.scalar_type(), "addmm_sparse_dense", [&] {
+              addmm_out_sparse_csr_native_cpu<scalar_t>(
+                  mat1.to_sparse_csr(), mat2, result, alpha, beta);
+            });
+        return result;
+      }
+    }
+  }
   TORCH_CHECK(
-      (mat1.is_sparse_csr() ||
-       (mat2.is_sparse_csr() && result.is_sparse_csr())),
       false,
-      "Calling addmm on sparse CPU tensors requires Linux platform. ",
-      "Please use PyTorch built with MKL on Linux.");
-  TORCH_CHECK(
-      result.layout() == kStrided,
-      "Calling addmm on CPU with sparse output requires MKL.");
-  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(
-      result.scalar_type(), "addmm_sparse_dense", [&] {
-        addmm_out_sparse_csr_native_cpu<scalar_t>(
-            mat1, mat2, result, alpha, beta);
-      });
+      "addmm: computation on CPU is not implemented for ",
+      result.layout(),
+      " + ",
+      mat1.layout(),
+      " @ ",
+      mat2.layout(),
+      " without MKL. PyTorch built with MKL has better support for addmm with sparse CPU tensors.");
 #else
   sparse::impl::mkl::addmm_out_sparse_csr(mat1, mat2, beta, alpha, result);
 #endif
@@ -579,25 +678,19 @@ Tensor _sparse_csr_mm(const Tensor& mat1, const Tensor& mat2) {
     // native support for CSC.
     return _sparse_csr_mm(mat1.to_sparse_csr(), mat2);
   }
-  if (mat1.is_sparse_csr() && mat2.layout() == c10::kStrided) {
-    // Return dense
-    return at::addmm(
-        at::zeros({mat1.size(0), mat2.size(1)}, mat2.options()),
-        mat1,
-        mat2,
-        0.0,
-        1.0);
-  }
-  if (mat1.layout() == c10::kStrided && mat2.is_sparse_csr()) {
-    // Return dense
-    return at::addmm(
-        at::zeros({mat1.size(0), mat2.size(1)}, mat1.options()),
-        mat1,
-        mat2,
-        0.0,
-        1.0);
+  // Default to taking options from mat1
+  auto result_options = mat1.options();
+  if (mat2.layout() == kStrided) {
+    // if either  arg is strided we return strided, so update the options if
+    // mat2 is strided.
+    result_options = result_options.layout(kStrided);
   }
-  AT_ERROR("_sparse_csr_mm does not support matrix multiplication of ", mat1.layout(), " @ ", mat2.layout());
+  return at::addmm(
+      at::zeros({mat1.size(0), mat2.size(1)}, result_options),
+      mat1,
+      mat2,
+      0.0,
+      1.0);
 }
 
 Tensor _sparse_csr_addmm(
@@ -753,7 +846,7 @@ Tensor& add_out_sparse_csr_cpu(
         self.sizes(),
         " and tensor `other` with shape ",
         other.sizes());
-    at::native::resize_as_sparse_csr_(out, self);
+    at::native::resize_as_sparse_compressed_(out, self);
     sparse::impl::cpu::add_out_sparse_csr(self, other, alpha, out);
   }
   return out;
@@ -1113,5 +1206,149 @@ Tensor _sparse_csr_prod_cpu(const Tensor& input, IntArrayRef dims_to_reduce, boo
   return result;
 }
 
+std::tuple<Tensor, Tensor> _spmm_reduce_sparse_csr_cpu(
+    const Tensor& input,
+    const Tensor& weight,
+    const c10::string_view reduce,
+    const c10::optional<Tensor>& row_indices_opt,
+    const c10::optional<Tensor>& ccol_indices_opt,
+    const c10::optional<Tensor>& csr2csc_opt) {
+
+  // See [Note: hacky wrapper removal for optional tensor]
+  c10::MaybeOwned<Tensor> row_indices_maybe_owned = at::borrow_from_optional_tensor(row_indices_opt);
+  const Tensor& row_indices = *row_indices_maybe_owned;
+  const Tensor& ccol_indices = c10::value_or_else(ccol_indices_opt, [] {return Tensor();});
+  const Tensor& csr2csc = c10::value_or_else(csr2csc_opt, [] {return Tensor();});
+
+  sparse::impl::check_spmm_reduce_inputs</*train*/false>(
+      input, Tensor(), weight, row_indices, ccol_indices, csr2csc);
+
+  auto op = sparse::impl::get_operator_enum(reduce);
+
+  auto crow = input.crow_indices();
+  auto col = input.col_indices();
+  auto val = input.values();
+
+  // init output to be all zeros, for `rows` that has no nonzero elements,
+  // the corresponding rows in the output will be zero.
+  auto out = at::zeros({input.size(0), weight.size(1)}, weight.options());
+  auto arg_out = at::empty({0}, col.options());
+
+  int64_t nnz = input._nnz();
+  if (nnz == 0) {
+    return std::make_tuple(out, arg_out);
+  }
+
+  // only need to calculate the out args
+  // for reduce type "max" and "min" for training
+  bool need_arg_out = at::GradMode::is_enabled()
+      && (input.requires_grad() || weight.requires_grad())
+      && (op == SPMM_MAX || op == SPMM_MIN);
+
+  if (!need_arg_out) {
+    spmm_reduce_stub(kCPU, out, crow, col, val, weight, op);
+  } else {
+    // allocate memory and init with invalid index
+    arg_out.resize_(out.sizes());
+    arg_out.fill_(nnz);
+    spmm_reduce_arg_stub(kCPU, out, arg_out, crow, col, val, weight, op);
+  }
+
+  return std::make_tuple(std::move(out), std::move(arg_out));
+}
+
+std::tuple<Tensor, Tensor> _spmm_reduce_backward_sparse_csr_cpu(
+    const Tensor& input,
+    const Tensor& grad_out,
+    const Tensor& weight,
+    const c10::string_view reduce,
+    const Tensor& arg_out,
+    const Tensor& row_indices,
+    const Tensor& ccol_indices,
+    const Tensor& csr2csc,
+    std::array<bool, 2> output_mask) {
+
+  sparse::impl::check_spmm_reduce_inputs</*train*/true>(
+      input, grad_out, weight, row_indices, ccol_indices, csr2csc);
+
+  auto op = sparse::impl::get_operator_enum(reduce);
+
+  auto crow = input.crow_indices();
+  auto col = input.col_indices();
+  auto val = input.values();
+
+  // reconstruct the CSC indices in case not all given
+  bool has_csc_indices = row_indices.defined()
+      && ccol_indices.defined()
+      && csr2csc.defined();
+
+  Tensor row, ccol, permute;
+  if (has_csc_indices) {
+    row = row_indices;
+    ccol = ccol_indices;
+    permute = csr2csc;
+  } else {
+    bool out_int32 = crow.scalar_type() == ScalarType::Int;
+    Tensor coo_indices = at::_convert_indices_from_csr_to_coo(
+        crow,
+        col,
+        out_int32,
+        /*transpose*/false);
+    row = coo_indices.select(0, 0);
+
+    // calculte the global index for CSC
+    // and get the conversion permute pattern
+    Tensor index = col.mul(input.size(0)).add(row);
+    permute = index.argsort();
+
+    ccol = at::_convert_indices_from_coo_to_csr(
+        /*column indices*/col.index_select(0, permute),
+        /*column count*/input.size(1),
+        out_int32);
+  }
+
+  Tensor grad_input, grad_weight;
+  if (output_mask[0]) {
+    // grad_input has the same indices and nnz with input
+    grad_input = at::empty_like(input);
+    grad_input.values().zero_();
+    if (op == SPMM_MAX || op == SPMM_MIN) {
+      spmm_reduce_backward_input_arg_stub(kCPU, grad_input, grad_out, col, weight, arg_out, op);
+    } else {
+      spmm_reduce_backward_input_stub(kCPU, grad_input, grad_out, crow, col, weight, row, op);
+    }
+  }
+  if (output_mask[1]) {
+    grad_weight = at::zeros(weight.sizes(), weight.options());
+    if (op == SPMM_MAX || op == SPMM_MIN) {
+      spmm_reduce_backward_weight_arg_stub(kCPU, grad_weight, grad_out, col, val, arg_out, op);
+    } else {
+      spmm_reduce_backward_weight_stub(kCPU, grad_weight, grad_out, crow, val, row, ccol, permute, op);
+    }
+  }
+
+  return std::make_tuple(std::move(grad_input), std::move(grad_weight));
+}
+
+Tensor spmm_reduce(
+    const Tensor& input,
+    const Tensor& weight,
+    const c10::string_view reduce,
+    const c10::optional<Tensor>& row_indices_opt,
+    const c10::optional<Tensor>& ccol_indices_opt,
+    const c10::optional<Tensor>& csr2csc_opt) {
+
+  // result: out, arg_out
+  auto result = at::_spmm_reduce(input, weight, reduce, row_indices_opt, ccol_indices_opt, csr2csc_opt);
+  return std::get<0>(result);
+}
+
+DEFINE_DISPATCH(spmm_reduce_stub);
+DEFINE_DISPATCH(spmm_reduce_arg_stub);
+DEFINE_DISPATCH(spmm_reduce_backward_input_stub);
+DEFINE_DISPATCH(spmm_reduce_backward_input_arg_stub);
+DEFINE_DISPATCH(spmm_reduce_backward_weight_stub);
+DEFINE_DISPATCH(spmm_reduce_backward_weight_arg_stub);
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/sparse/SparseCsrTensorMath.h b/aten/src/ATen/native/sparse/SparseCsrTensorMath.h
index a92added5f01..ff3c0e4382ad 100644
--- a/aten/src/ATen/native/sparse/SparseCsrTensorMath.h
+++ b/aten/src/ATen/native/sparse/SparseCsrTensorMath.h
@@ -2,6 +2,8 @@
 
 #include <ATen/Tensor.h>
 #include <ATen/core/Scalar.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/native/cpu/SpmmReduceKernel.h>
 
 namespace at {
 namespace native {
@@ -59,6 +61,64 @@ inline void _check_dim(const Tensor& self, int64_t target_dim, c10::string_view
       " instead.");
 }
 
+inline SPMM_REDUCE_OP get_operator_enum(const c10::string_view reduce) {
+  if (reduce == "sum") {
+    return SPMM_SUM;
+  } else if (reduce == "mean") {
+    return SPMM_MEAN;
+  } else if (reduce == "max") {
+    return SPMM_MAX;
+  } else if (reduce == "min") {
+    return SPMM_MIN;
+  } else {
+    TORCH_CHECK(false, "spmm_reduce: reduce argument must be either sum, mean, max or min.");
+  }
+}
+
+template <bool train>
+inline void check_spmm_reduce_inputs(
+    const Tensor& input,
+    const Tensor& grad_output,
+    const Tensor& weight,
+    const Tensor& row_indices,
+    const Tensor& ccol_indices,
+    const Tensor& csr2csc) {
+  TORCH_CHECK(input.is_sparse_csr(),"Expected input to be sparse CSR tensor.");
+
+  const auto input_scalar_type = input.values().scalar_type();
+  const auto index_scalar_type = input.col_indices().scalar_type();
+  int64_t nnz = input._nnz();
+
+  CheckedFrom c = train ? "spmm_reduce_backward" : "spmm_reduce";
+  if (train) {
+    checkLayout(c, grad_output, kStrided);
+    checkScalarType(c, {grad_output, "grad_output", 1}, input_scalar_type);
+    check_dim_size(grad_output, 2, 0, input.size(0));
+    check_dim_size(grad_output, 2, 1, weight.size(1));
+  }
+
+  int pos = train ? 2 : 1;
+  checkLayout(c, weight, kStrided);
+  checkScalarType(c, {weight, "weight", pos}, input_scalar_type);
+  check_dim_size(weight, 2, 0, input.size(1));
+
+  if (row_indices.defined()) {
+    checkLayout(c, row_indices, kStrided);
+    checkScalarType(c, {row_indices, "row_indices", pos++}, index_scalar_type);
+    check_dim_size(row_indices, 1, 0, nnz);
+  }
+  if (ccol_indices.defined()) {
+    checkLayout(c, ccol_indices, kStrided);
+    checkScalarType(c, {ccol_indices, "ccol_indices", pos++}, index_scalar_type);
+    check_dim_size(ccol_indices, 1, 0, input.size(1) + 1);
+  }
+  if (csr2csc.defined()) {
+    checkLayout(c, csr2csc, kStrided);
+    checkScalarType(c, {csr2csc, "csr2csc", pos++}, index_scalar_type);
+    check_dim_size(csr2csc, 1, 0, nnz);
+  }
+}
+
 }
 }
 }
diff --git a/aten/src/ATen/native/sparse/SparseFactories.cpp b/aten/src/ATen/native/sparse/SparseFactories.cpp
index f0007747660e..98d4bede7fb7 100644
--- a/aten/src/ATen/native/sparse/SparseFactories.cpp
+++ b/aten/src/ATen/native/sparse/SparseFactories.cpp
@@ -1,4 +1,5 @@
 #include <ATen/Dispatch.h>
+#include <ATen/TensorIterator.h>
 #include <ATen/native/sparse/SparseFactories.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
diff --git a/aten/src/ATen/native/sparse/SparseFactories.h b/aten/src/ATen/native/sparse/SparseFactories.h
index 3fd68931878b..3234162e746b 100644
--- a/aten/src/ATen/native/sparse/SparseFactories.h
+++ b/aten/src/ATen/native/sparse/SparseFactories.h
@@ -1,14 +1,14 @@
 #pragma once
-#include <ATen/TensorIterator.h>
-#include <ATen/core/ATen_fwd.h>
-#include <ATen/core/Tensor.h>
 #include <ATen/native/DispatchStub.h>
 
 namespace at {
+struct TensorIterator;
+class TensorBase;
+
 namespace native {
 
 using spdiags_kernel_fn_t =
-    void (*)(TensorIterator&, const Tensor&, Tensor&, Tensor&);
+    void (*)(TensorIterator&, const TensorBase&, TensorBase&, TensorBase&);
 
 DECLARE_DISPATCH(spdiags_kernel_fn_t, spdiags_kernel_stub);
 } // namespace native
diff --git a/aten/src/ATen/native/sparse/SparseStubs.h b/aten/src/ATen/native/sparse/SparseStubs.h
new file mode 100644
index 000000000000..89eda9d05b39
--- /dev/null
+++ b/aten/src/ATen/native/sparse/SparseStubs.h
@@ -0,0 +1,16 @@
+#pragma once
+
+#include <ATen/native/DispatchStub.h>
+
+namespace at {
+
+class Tensor;
+
+namespace native {
+
+using mul_sparse_sparse_out_fn = void (*)(Tensor& res, const Tensor& x, const Tensor& y);
+DECLARE_DISPATCH(mul_sparse_sparse_out_fn, mul_sparse_sparse_out_stub);
+
+}
+
+}
diff --git a/aten/src/ATen/native/sparse/SparseTensor.cpp b/aten/src/ATen/native/sparse/SparseTensor.cpp
index a162689eb5fb..38f3e11f8fd4 100644
--- a/aten/src/ATen/native/sparse/SparseTensor.cpp
+++ b/aten/src/ATen/native/sparse/SparseTensor.cpp
@@ -9,6 +9,7 @@
 #include <ATen/SparseTensorImpl.h>
 #include <ATen/SparseTensorUtils.h>
 #include <ATen/native/IndexingUtils.h>
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/NamedTensorUtils.h>
 
 #include <ATen/native/Copy.h>
@@ -162,10 +163,10 @@ SparseTensor new_with_dims_sparse(
   return self;
 }
 
-SparseTensor new_with_dims_and_tensor_sparse(
+SparseTensor new_with_dims_and_tensor_sparse_symint(
     int64_t sparse_dim,
     int64_t dense_dim,
-    ArrayRef<int64_t> size,
+    c10::SymIntArrayRef size,
     const Tensor& indices,
     const Tensor& values,
     c10::optional<ScalarType> dtype,
@@ -207,17 +208,6 @@ Tensor empty_sparse(
       size.size(), 0, size, dtype, layout, device, pin_memory);
 }
 
-/** Empty init **/
-Tensor empty_symint_sparse(
-    c10::SymIntArrayRef size,
-    c10::optional<ScalarType> dtype,
-    c10::optional<Layout> layout,
-    c10::optional<Device> device,
-    c10::optional<bool> pin_memory,
-    c10::optional<MemoryFormat> optional_memory_format) {
-      return at::native::empty_sparse(c10::asIntArrayRefSlow(size), dtype, layout, device, pin_memory, optional_memory_format);
-}
-
 /* Shape init */
 Tensor sparse_coo_tensor(IntArrayRef size,
     c10::optional<ScalarType> dtype,
@@ -418,13 +408,34 @@ Tensor sparse_coo_tensor(const Tensor& indices, const Tensor& values, IntArrayRe
       options.pinned_memory_opt());
 }
 
+Tensor _sparse_coo_tensor_unsafe(const Tensor& indices, const Tensor& values_, at::IntArrayRef size,
+    c10::optional<ScalarType> dtype,
+    c10::optional<Layout> layout,
+    c10::optional<Device> device,
+    c10::optional<bool> pin_memory) {
+    return at::native::_sparse_coo_tensor_unsafe_symint(indices, values_, c10::fromIntArrayRefSlow(size), dtype, layout, device, pin_memory);
+
+  Tensor values = expand_values_if_needed(values_);
+
+  auto sparse_dim = indices.size(0);
+  auto dense_dim = values.dim() - 1;
+
+  return at::_sparse_coo_tensor_with_dims_and_tensors(
+      sparse_dim,
+      dense_dim,
+      size,
+      indices,
+      values,
+      values.options().layout(kSparse));
+}
+
 // NOTE: _sparse_coo_tensor_unsafe() differs from sparse_coo_tensor()
 // in that we don't check whether any indices are out of boundaries of `size`, thus avoiding a
 // copy from CUDA to CPU. However, this function should ONLY be used where we know that the indices
 // are guaranteed to be within bounds or if the caller is going to call
 // _validate_sparse_coo_tensor_args before using the tensor.
 // NB: Got rid of the size == NULL case
-Tensor _sparse_coo_tensor_unsafe(const Tensor& indices, const Tensor& values_, IntArrayRef size,
+Tensor _sparse_coo_tensor_unsafe_symint(const Tensor& indices, const Tensor& values_, c10::SymIntArrayRef size,
     c10::optional<ScalarType> dtype,
     c10::optional<Layout> layout,
     c10::optional<Device> device,
@@ -433,10 +444,12 @@ Tensor _sparse_coo_tensor_unsafe(const Tensor& indices, const Tensor& values_, I
 
   Tensor values = expand_values_if_needed(values_);
 
-  int64_t sparse_dim = indices.size(0);
-  int64_t dense_dim = values.dim() - 1;
+  // This guard is intentional: we don't support dynamic shapes along the
+  // indices dimension because that implies variable dimensionality
+  auto sparse_dim = indices.sym_size(0).guard_int(__FILE__, __LINE__);
+  auto dense_dim = values.dim() - 1;
 
-  return at::_sparse_coo_tensor_with_dims_and_tensors(
+  return at::_sparse_coo_tensor_with_dims_and_tensors_symint(
       sparse_dim,
       dense_dim,
       size,
@@ -524,7 +537,7 @@ SparseTensor dense_to_sparse(const Tensor& self, int64_t sparse_dim) {
 
   Tensor nz = self.nonzero().transpose(0, 1);
   if (nz.size(1) == 0) {
-    return new_with_dims_sparse(
+    auto sparse = new_with_dims_sparse(
         sparse_dim,
         dims - sparse_dim,
         sizes,
@@ -532,6 +545,7 @@ SparseTensor dense_to_sparse(const Tensor& self, int64_t sparse_dim) {
         sparse_options.layout_opt(),
         sparse_options.device_opt(),
         sparse_options.pinned_memory_opt());
+    return sparse._coalesced_(true);
   }
   Tensor indices;
   if (sparse_dim == dims) {
diff --git a/aten/src/ATen/native/sparse/SparseTensorMath.cpp b/aten/src/ATen/native/sparse/SparseTensorMath.cpp
index 3ca304acc356..40c3cca268ed 100644
--- a/aten/src/ATen/native/sparse/SparseTensorMath.cpp
+++ b/aten/src/ATen/native/sparse/SparseTensorMath.cpp
@@ -1,13 +1,14 @@
-#include <ATen/TensorIndexing.h>
-
 #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+#include <ATen/TensorIndexing.h>
 #include <ATen/native/sparse/SparseTensorMath.h>
 
 #include <c10/util/irange.h>
 #include <c10/util/MaybeOwned.h>
 #include <ATen/core/Tensor.h>
 #include <ATen/Dispatch.h>
+#include <ATen/native/sparse/SparseStubs.h>
 #include <ATen/Parallel.h>
+#include <ATen/SparseCsrTensorUtils.h>
 #include <ATen/SparseTensorImpl.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/ScalarOps.h>
@@ -34,6 +35,7 @@
 #include <ATen/ops/add_native.h>
 #include <ATen/ops/addmm.h>
 #include <ATen/ops/addmm_native.h>
+#include <ATen/ops/arange.h>
 #include <ATen/ops/any.h>
 #include <ATen/ops/any_native.h>
 #include <ATen/ops/bmm_native.h>
@@ -610,34 +612,127 @@ SparseTensor& add_out_sparse_cpu(const SparseTensor& t, const SparseTensor& src,
 // add(Tensor, SparseTensor, Scalar)
 //    formerly known as spcadd
 // --------------------------------------------------------------------
-
 template <typename scalar_t>
-void add_dense_sparse_worker_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) {
+void add_dense_sparse_worker_non_hybrid_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) {
   auto indices_accessor = indices.accessor<int64_t, 2>();
   auto values_accessor = values.accessor<scalar_t, 1>();
 
   scalar_t* r_ptr = r.data_ptr<scalar_t>();
-  auto r_strides = r.strides();
   scalar_t cast_value = value.to<scalar_t>();
-  const auto sparse_dim = sparse.sparse_dim();
-
+  const int64_t sparse_dim = sparse.sparse_dim();
+  std::vector<int64_t> result_stride(sparse_dim);
+  for (const auto d: c10::irange(sparse_dim)) {
+    result_stride[d] = r.stride(d);
+  }
   at::parallel_for(0, sparse._nnz(), 0, [&](int64_t start, int64_t end) {
-    for (auto k: c10::irange(start, end)) {
+    for (const auto k: c10::irange(start, end)) {
       int64_t index = r.storage_offset();
       for (auto d: c10::irange(sparse_dim)) {
-        index += r_strides[d] * indices_accessor[d][k];
+        index += result_stride[d] * indices_accessor[d][k];
       }
       r_ptr[index] += cast_value * values_accessor[k];
     }
   });
 }
 
+template <typename scalar_t>
+inline void add_dense_sparse_worker_hybrid_cpu(Tensor& r, const Scalar& value, const SparseTensor& sparse, const Tensor& indices, const Tensor& values) {
+
+  // Get the dense dimension element numbers of hybrid sparse tensor
+  int64_t values_dense_size = values.stride(0);
+  TORCH_CHECK(values.is_contiguous());
+  scalar_t* v_ptr = values.data_ptr<scalar_t>();
+
+  scalar_t* r_ptr = r.data_ptr<scalar_t>();
+  TORCH_CHECK(r_ptr != nullptr);
+
+  auto indices_accessor = indices.accessor<int64_t, 2>();
+  scalar_t cast_value = value.to<scalar_t>();
+  auto sparse_dim = sparse.sparse_dim();
+  std::vector<int64_t> result_stride(sparse_dim);
+  for (auto d : c10::irange(sparse_dim)) {
+    result_stride[d] = r.stride(d);
+  }
+
+  at::parallel_for(0, sparse._nnz(), 0, [&](int64_t start, int64_t end) {
+    for (auto k: c10::irange(start, end)) {
+      auto r_index = r_ptr;
+      for (auto d: c10::irange(sparse_dim)) {
+        r_index += result_stride[d] * indices_accessor[d][k];
+      }
+      auto v_index = v_ptr + k * values_dense_size;
+      at::native::cpublas::axpy<scalar_t>(values_dense_size, cast_value, v_index, 1, r_index, 1);
+    }
+  });
+}
+
+template <typename scalar_t>
+inline void add_dense_sparse_worker_non_coalesced_cpu(Tensor& r, const Scalar& value,
+    const SparseTensor& sparse, const Tensor& indices, const Tensor& values) {
+
+  // Get the dense dimension element numbers of hybrid sparse tensor
+  auto values_dense_size = values.stride(0);
+  TORCH_CHECK(values.is_contiguous());
+  scalar_t* v_ptr = values.data_ptr<scalar_t>();
+  TORCH_CHECK(v_ptr != nullptr);
+
+  scalar_t* r_ptr = r.data_ptr<scalar_t>();
+  TORCH_CHECK(r_ptr != nullptr);
+
+  scalar_t cast_value = value.to<scalar_t>();
+  auto sparse_dim = sparse.sparse_dim();
+
+  auto indices_accessor = indices.accessor<int64_t, 2>();
+  int64_t result_length = r.size(0);
+  std::vector<int64_t> result_stride(sparse_dim);
+  for (auto d : c10::irange(sparse_dim)) {
+    result_stride[d] = r.stride(d);
+  }
+
+  auto sparse_nnz = sparse._nnz();
+  int max_threads = at::get_num_threads();
+  max_threads = (result_length < max_threads) ? result_length : max_threads;
+  int64_t avg_chunk_down = result_length / max_threads;
+  std::vector<int64_t> chuck_size(max_threads);
+  for (const auto i : c10::irange(max_threads)) {
+    chuck_size[i] = avg_chunk_down;
+  }
+  //make chunk balance among threads as 211
+  for (auto i = 0 ; i < result_length % max_threads ; i++) {
+    chuck_size[i] += 1;
+  }
+  std::vector<int64_t> chuck_sum_size(max_threads + 1);
+  chuck_sum_size[0] = 0;
+  for (const auto i : c10::irange(1, max_threads)) {
+    chuck_sum_size[i] = chuck_sum_size[i - 1] + chuck_size[i - 1];
+  }
+  chuck_sum_size[max_threads] = result_length;
+  at::parallel_for(0, max_threads, 0, [&](int64_t start, int64_t end) {
+    for (auto k: c10::irange(start, end)) {
+      int64_t chunk_begin = chuck_sum_size[k];
+      int64_t chunk_end = chuck_sum_size[k + 1];
+      for (const auto n: c10::irange(sparse_nnz)) {
+        int64_t chunk_offset = indices_accessor[0][n];
+        if (chunk_offset >= chunk_begin && chunk_offset < chunk_end) {
+          int64_t r_offset = result_stride[0] * chunk_offset;
+          for (const auto d : c10::irange(1, sparse_dim)) {
+            r_offset += result_stride[d] * indices_accessor[d][n];
+          }
+          scalar_t* v_index = v_ptr + n * values_dense_size;
+          auto r_index = r_ptr + r_offset;
+          at::native::cpublas::axpy<scalar_t>(values_dense_size, cast_value, v_index, 1, r_index, 1);
+        }
+      }
+    }
+  });
+}
+
 Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTensor& sparse_, const Scalar& value) {
-  AT_ASSERT(!r.is_sparse());
-  AT_ASSERT(!dense.is_sparse());
-  AT_ASSERT(sparse_.is_sparse());
+  TORCH_CHECK(!r.is_sparse());
+  TORCH_CHECK(!dense.is_sparse());
+  TORCH_CHECK(sparse_.is_sparse());
 
-  AT_ASSERT(!dense.is_cuda()); // dispatch argument
+  TORCH_CHECK(!dense.is_cuda()); // dispatch argument
   TORCH_CHECK(!r.is_cuda(), "add: expected 'out' to be CPU tensor, but got CUDA tensor");
   TORCH_CHECK(!sparse_.is_cuda(), "add: expected 'other' to be a CPU tensor, but got a CUDA tensor");
 
@@ -648,19 +743,15 @@ Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTen
   TORCH_CHECK(canCast(commonDtype, r.scalar_type()), "Can't convert result type ", commonDtype, " to output ", r.scalar_type(), " in add operation");
 
   r.resize_as_(dense);
-  SparseTensor sparse = sparse_.coalesce();
-
-  Tensor indices = sparse._indices();
-  Tensor values = sparse._values();
-  int64_t nDim = dense.dim();
-  int64_t nDimI = sparse.sparse_dim();
 
-  if (sparse._nnz() == 0) {
+  auto sparse_nnz = sparse_._nnz();
+  if (sparse_nnz == 0) {
     if (!is_same_tensor(r, dense)) r.copy_(dense);
     return r;
   }
 
-  Tensor valuesBuffer = values.to(commonDtype);
+  int64_t dense_dim = dense.dim();
+  int64_t sparse_dim = sparse_.sparse_dim();
   Tensor resultBuffer = r;
   if (r.scalar_type() != commonDtype) {
     resultBuffer = dense.to(commonDtype);
@@ -668,23 +759,56 @@ Tensor& add_out_dense_sparse_cpu(Tensor& r, const Tensor& dense, const SparseTen
     resultBuffer.copy_(dense);
   }
 
-  // accessors rely on nnz test
-  if (nDim > nDimI) {
-    auto indices_accessor = indices.accessor<int64_t, 2>();
-    for (const auto k : c10::irange(sparse._nnz())) {
-      Tensor dstBuffer = resultBuffer;
-      for (const auto d : c10::irange(sparse.sparse_dim())) {
-        dstBuffer = dstBuffer.select(0, indices_accessor[d][k]);
-      }
-      Tensor srcBuffer = valuesBuffer.select(0, k);
-      dstBuffer.add_(srcBuffer, value);
+  Tensor values = sparse_._values();
+  bool sparse_is_coalesced = (sparse_.is_coalesced() || sparse_nnz == 1);
+  bool result_is_contiguous = ((r.storage().data() != nullptr) && resultBuffer.is_contiguous());
+  bool value_is_contiguous = values.is_contiguous();
+  bool is_contiguous =  (result_is_contiguous && value_is_contiguous);
+
+  SparseTensor sparse = sparse_;
+  Tensor indices = sparse_._indices();
+  Tensor valuesBuffer = values.to(commonDtype);
+  if (is_contiguous && sparse_is_coalesced) {
+    //TODO: we can optimize it for non-hybrid by not using buffers
+    if (sparse_dim == dense_dim) {
+      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+          at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half,
+          commonDtype, "add_dense_sparse_non_hybrid", [&] {
+            add_dense_sparse_worker_non_hybrid_cpu<scalar_t>(resultBuffer, value, sparse_, indices, valuesBuffer);
+          });
+    } else {
+      AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
+          at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half,
+          commonDtype, "add_dense_sparse_hybrid", [&] {
+            add_dense_sparse_worker_hybrid_cpu<scalar_t>(resultBuffer, value, sparse_, indices, valuesBuffer);
+          });
     }
-  } else {
+  } else if (is_contiguous && (sparse_dim > 0)) {
+    // Handle sparse is not coalesced
     AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(
         at::ScalarType::ComplexHalf, at::ScalarType::Bool, at::ScalarType::BFloat16, at::ScalarType::Half,
-        commonDtype, "add_dense_sparse", [&] {
-          add_dense_sparse_worker_cpu<scalar_t>(resultBuffer, value, sparse, indices, valuesBuffer);
+        commonDtype, "add_dense_sparse_worker_non_coalesced", [&] {
+          add_dense_sparse_worker_non_coalesced_cpu<scalar_t>(resultBuffer, value, sparse_, indices, valuesBuffer);
         });
+  } else {
+    // Slow path for non-contiguous values and output
+    // TODO: coalesce() performance may can be further improved
+    sparse = sparse_.coalesce();
+    indices = sparse._indices();
+    values = sparse._values();
+    valuesBuffer = values.to(commonDtype);
+    auto indices_accessor = indices.accessor<int64_t, 2>();
+    auto sparse_nnz = sparse._nnz();
+    at::parallel_for(0, sparse_nnz, 100, [&](int64_t start, int64_t end) {
+      for (auto k: c10::irange(start, end)) {
+        Tensor dstBuffer = resultBuffer;
+        for (auto d: c10::irange(sparse_dim)) {
+          dstBuffer = dstBuffer.select(0, indices_accessor[d][k]);
+        }
+        Tensor srcBuffer = valuesBuffer.select(0, k);
+        dstBuffer.add_(srcBuffer, value);
+      }
+    });
   }
   if (r.scalar_type() != commonDtype) {
     r.copy_(resultBuffer);
@@ -965,6 +1089,13 @@ Tensor& _mul_sparse_sparse_zero_dim_out(const Tensor& zero_dim, const Tensor& ot
   return _mul_dense_sparse_out(scalar_val, other, r);
 }
 
+DEFINE_DISPATCH(mul_sparse_sparse_out_stub);
+
+Tensor& _mul_sparse_sparse_out(const Tensor& x, const Tensor& y, Tensor& res) {
+  mul_sparse_sparse_out_stub(res.device().type(), res, x, y);
+  return res;
+}
+
 SparseTensor& mul_out_sparse_cpu(const Tensor& t_, const Tensor& src_, Tensor& r) {
   AT_ASSERT(!t_.is_cuda()); // dispatch argument
   TORCH_CHECK(!r.is_cuda(), "mul: expected 'out' to be CPU tensor, but got CUDA tensor");
@@ -987,14 +1118,35 @@ SparseTensor& mul_out_sparse_cpu(const Tensor& t_, const Tensor& src_, Tensor& r
     return _mul_sparse_sparse_zero_dim_out(t_, src_, r);
   }
 
-  TORCH_CHECK(t_.sizes().equals(src_.sizes()), "mul: expected 'self' and 'other' to have same sizes when both are sparse"
+  const auto is_equal_size_inputs = t_.sizes().equals(src_.sizes());
+
+  // mul(sparse, sparse) with inputs which broadcast only in dense dims
+  if (!is_equal_size_inputs) {
+    _mul_sparse_sparse_out(t_, src_, r);
+    return r;
+  }
+
+  TORCH_CHECK(is_equal_size_inputs, "mul: expected 'self' and 'other' to have same sizes when both are sparse"
       ", but ", t_.sizes(), " != ", src_.sizes());
 
+  // Short circuit when there is zero nnz
+  // Not strictly necessary, but there are tests checking whether
+  // resize in mul fails if run on tensors coming from .data/.detach.
   if (!t_._nnz() || !src_._nnz()) {
     r.resize_as_(t_);
     return r.zero_();
   }
 
+  // _mul_sparse_sparse_out is faster for large inputs
+  // and when either of the inputs is uncoalesced.
+  if (!t_.is_coalesced() || !src_.is_coalesced()) {
+    _mul_sparse_sparse_out(t_, src_, r);
+    return r;
+  }
+
+  // Otherwise _mul_sparse_sparse_out might be slower
+  // than the brute-force solution below.
+
   SparseTensor t = t_.coalesce();
   SparseTensor src = src_.coalesce();
 
@@ -1134,6 +1286,9 @@ Tensor& s_addmm_out_sparse_dense_cpu(
     const Scalar& beta,
     const Scalar& alpha
 ) {
+  AT_ASSERT(r.layout() == kStrided, "addmm_sparse_dense: expected strided result tensor, got tensor with layout ", r.layout());
+  AT_ASSERT(sparse_.layout() == kSparse, "addmm_sparse_dense: expected sparse tensor, got tensor with layout ", sparse_.layout());
+
   // TODO: This error message seems awfully opaque
   TORCH_CHECK(
       t.is_cpu(),
@@ -1256,8 +1411,12 @@ Tensor _sparse_mm(
   if (mat1.is_sparse() && mat2.is_sparse()) {
     return at::_sparse_sparse_matmul(mat1, mat2);
   }
-  Tensor t = at::zeros({mat1.size(-2), mat2.size(-1)}, mat2.options());
-  return at::_sparse_addmm(t, mat1, mat2, 0, 1);
+  if (mat1.is_sparse() || at::sparse_csr::is_sparse_compressed(mat1)) {
+    Tensor t = at::zeros({mat1.size(-2), mat2.size(-1)}, mat2.options());
+    return at::_sparse_addmm(t, mat1, mat2, 0, 1);
+  }
+  Tensor t = at::zeros({mat1.size(-2), mat2.size(-1)}, mat1.options());
+  return at::_sparse_addmm(t.transpose(-2, -1), mat2.transpose(-2, -1), mat1.transpose(-2, -1), 0, 1).transpose(-2, -1);
 }
 
 // NB: Despite its suggestive name, this actually only exists so that
@@ -1402,7 +1561,7 @@ SparseTensor& _sspaddmm_out_cpu(
   int64_t t_nnz = t._nnz();
   int64_t r_nnz = nnz * dim_k + t_nnz;
   Tensor newi = at::empty({2, r_nnz}, kLong);
-  Tensor newv = native::zeros(
+  Tensor newv = at::zeros(
       {r_nnz},
       optTypeMetaToScalarType(values.options().dtype_opt()),
       values.options().layout_opt(),
diff --git a/aten/src/ATen/native/sparse/SparseTensorMath.h b/aten/src/ATen/native/sparse/SparseTensorMath.h
index 1a263b2e7d5e..555de14b52c0 100644
--- a/aten/src/ATen/native/sparse/SparseTensorMath.h
+++ b/aten/src/ATen/native/sparse/SparseTensorMath.h
@@ -8,5 +8,6 @@ TORCH_API sparse::SparseTensor& mul_out_sparse_scalar(sparse::SparseTensor& r, c
 TORCH_API sparse::SparseTensor& mul_out_sparse_zerodim(sparse::SparseTensor& r, const sparse::SparseTensor& t, const Tensor& value);
 TORCH_API sparse::SparseTensor& _mul_dense_sparse_out(const Tensor& d, const Tensor& s, Tensor& res);
 TORCH_API sparse::SparseTensor& _mul_sparse_sparse_zero_dim_out(const Tensor& zero_dim, const Tensor& other, Tensor& res);
+TORCH_API sparse::SparseTensor& _mul_sparse_sparse_out(const Tensor& x, const Tensor& y, Tensor& res);
 
 }}
diff --git a/aten/src/ATen/native/sparse/SparseUnaryOps.cpp b/aten/src/ATen/native/sparse/SparseUnaryOps.cpp
index d3da0f45200b..9e0503337b5d 100644
--- a/aten/src/ATen/native/sparse/SparseUnaryOps.cpp
+++ b/aten/src/ATen/native/sparse/SparseUnaryOps.cpp
@@ -1,6 +1,6 @@
 // #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
-#include <ATen/core/Tensor.h>
 #include <ATen/SparseTensorUtils.h>
+#include <ATen/core/Tensor.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
@@ -19,6 +19,8 @@
 #include <ATen/ops/atanh_native.h>
 #include <ATen/ops/ceil.h>
 #include <ATen/ops/ceil_native.h>
+#include <ATen/ops/deg2rad.h>
+#include <ATen/ops/deg2rad_native.h>
 #include <ATen/ops/erf.h>
 #include <ATen/ops/erf_native.h>
 #include <ATen/ops/erfinv.h>
@@ -27,6 +29,8 @@
 #include <ATen/ops/expm1_native.h>
 #include <ATen/ops/floor.h>
 #include <ATen/ops/floor_native.h>
+#include <ATen/ops/frac.h>
+#include <ATen/ops/frac_native.h>
 #include <ATen/ops/isinf.h>
 #include <ATen/ops/isinf_native.h>
 #include <ATen/ops/isnan.h>
@@ -39,6 +43,10 @@
 #include <ATen/ops/log1p_native.h>
 #include <ATen/ops/nan_to_num.h>
 #include <ATen/ops/nan_to_num_native.h>
+#include <ATen/ops/rad2deg.h>
+#include <ATen/ops/rad2deg_native.h>
+#include <ATen/ops/relu.h>
+#include <ATen/ops/relu_native.h>
 #include <ATen/ops/round.h>
 #include <ATen/ops/round_native.h>
 #include <ATen/ops/sgn.h>
@@ -58,6 +66,8 @@
 #include <ATen/ops/tan_native.h>
 #include <ATen/ops/tanh.h>
 #include <ATen/ops/tanh_native.h>
+#include <ATen/ops/threshold_backward.h>
+#include <ATen/ops/threshold_backward_native.h>
 #include <ATen/ops/trunc.h>
 #include <ATen/ops/trunc_native.h>
 #endif
@@ -161,12 +171,15 @@ COALESCED_UNARY_UFUNC(asinh);
 COALESCED_UNARY_UFUNC(atan);
 COALESCED_UNARY_UFUNC(atanh);
 COALESCED_UNARY_UFUNC(ceil);
+COALESCED_UNARY_UFUNC(deg2rad);
 COALESCED_UNARY_UFUNC(erf);
 COALESCED_UNARY_UFUNC(erfinv);
 COALESCED_UNARY_UFUNC(expm1);
 COALESCED_UNARY_UFUNC(floor);
+COALESCED_UNARY_UFUNC(frac);
 COALESCED_UNARY_UFUNC(log1p);
 COALESCED_UNARY_UFUNC(round);
+COALESCED_UNARY_UFUNC(rad2deg);
 COALESCED_UNARY_UFUNC(sign);
 COALESCED_UNARY_UFUNC(sgn);
 COALESCED_UNARY_UFUNC(sin);
@@ -175,6 +188,7 @@ COALESCED_UNARY_UFUNC(sqrt);
 COALESCED_UNARY_UFUNC(tan);
 COALESCED_UNARY_UFUNC(tanh);
 COALESCED_UNARY_UFUNC(trunc);
+COALESCED_UNARY_UFUNC(relu);
 
 COALESCED_UNARY_UFUNC_NO_INPLACE(signbit);
 COALESCED_UNARY_UFUNC_NO_INPLACE(isneginf);
@@ -187,6 +201,42 @@ Tensor isinf_sparse_meta(const Tensor& self) {
   TORCH_CHECK_NOT_IMPLEMENTED(0, "nyi isinf for SparseMeta");
 }
 
+// Threshold_backward is not unary but it is the backward used for relu which is
+// unary
+Tensor threshold_backward_sparse(
+    const Tensor& grad_output,
+    const Tensor& self,
+    const Scalar& threshold) {
+  auto self_v = [&self]() {
+    if (self.is_coalesced()) {
+      return self.values();
+    } else {
+      return self.coalesce().values();
+    }
+  }();
+  return coalesced_unary_ufunc(grad_output, [&](const Tensor& t) {
+    return at::threshold_backward(t, self_v, threshold);
+  });
+}
+
+Tensor& threshold_backward_sparse_out(
+    const Tensor& grad_output,
+    const Tensor& self,
+    const Scalar& threshold,
+    Tensor& grad_input) {
+  auto self_v = [&self]() {
+    if (self.is_coalesced()) {
+      return self.values();
+    } else {
+      return self.coalesce().values();
+    }
+  }();
+  return coalesced_unary_ufunc_out(
+      grad_output, grad_input, [&](const Tensor& t, Tensor& out) {
+        return at::threshold_backward_outf(t, self_v, threshold, out);
+      });
+}
+
 Tensor nan_to_num_sparse(
     const Tensor &self, c10::optional<double> nan,
     c10::optional<double> posinf, c10::optional<double> neginf) {
diff --git a/aten/src/ATen/native/sparse/ValidateCompressedIndicesCommon.h b/aten/src/ATen/native/sparse/ValidateCompressedIndicesCommon.h
index 7d71d2104e5b..48130d94f0d1 100644
--- a/aten/src/ATen/native/sparse/ValidateCompressedIndicesCommon.h
+++ b/aten/src/ATen/native/sparse/ValidateCompressedIndicesCommon.h
@@ -3,6 +3,7 @@
 #include <ATen/Tensor.h>
 #include <ATen/native/TensorIterator.h>
 #include <ATen/Dispatch.h>
+#include <ATen/native/sparse/Macros.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/Functions.h>
@@ -13,23 +14,12 @@
 #include <ATen/ops/_sparse_coo_tensor_with_dims_and_tensors.h>
 #endif
 
-#if defined(__CUDACC__) || defined(__HIPCC__)
-#define GPUCC
-#define FUNCAPI __host__ __device__
-#define INLINE __forceinline__
+#ifdef GPUCC
 #define NAME "compressed_index_invariance_checks_cuda"
 #else
-#define FUNCAPI
-#define INLINE inline
 #define NAME "compressed_index_invariance_checks_cpu"
 #endif
 
-#if defined(_WIN32) || defined(_WIN64)
-#define RESTRICT __restrict
-#else
-#define RESTRICT __restrict__
-#endif
-
 #define INVARIANT_CHECK_FUNC_API static INLINE FUNCAPI void
 
 namespace at {
diff --git a/aten/src/ATen/native/sparse/cuda/SoftMax.cu b/aten/src/ATen/native/sparse/cuda/SoftMax.cu
index 05cb9e06d90f..0591646f89b5 100644
--- a/aten/src/ATen/native/sparse/cuda/SoftMax.cu
+++ b/aten/src/ATen/native/sparse/cuda/SoftMax.cu
@@ -258,7 +258,7 @@ Tensor get_offsets(
           cudaMemcpyHostToDevice,
           stream));
 
-  auto indices_accessor = indices.packed_accessor<int64_t, 2>();
+  auto indices_accessor = indices.packed_accessor64<int64_t, 2>();
 
   Tensor offsets = at::empty({nnz}, indices.options());
 
@@ -345,7 +345,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> compute_pool_max(
   if (requireMxRows) {
 
     auto values_accessor =
-        values.packed_accessor<scalar_t, 2>(); // {nnz, nvalues}
+        values.packed_accessor64<scalar_t, 2>(); // {nnz, nvalues}
 
     mx_buffer = at::full({new_sz * nvalues}, Scalar(-std::numeric_limits<scalar_t>::infinity()), values.options());
 
@@ -420,10 +420,10 @@ void cuda_sparse_coo_softmax(
 
   /* Prepare accessors */
   auto values_2 = values.view({nnz, nvalues});
-  auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
+  auto values_accessor = values_2.packed_accessor64<scalar_t, 2>();
 
   auto out_values_2 = out_values.view({nnz, nvalues});
-  auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
+  auto out_values_accessor = out_values_2.packed_accessor64<scalar_t, 2>();
 
   Tensor sorted_indices;
   Tensor pool_offsets;
@@ -539,13 +539,13 @@ void cuda_sparse_coo_softmax_backward(
   auto nvalues = get_nvalues(sizes, sparse_dim);
 
   auto values_2 = values.view({nnz, nvalues});
-  auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
+  auto values_accessor = values_2.packed_accessor64<scalar_t, 2>();
 
   auto out_values_2 = out_values.view({out_nnz, nvalues});
-  auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
+  auto out_values_accessor = out_values_2.packed_accessor64<scalar_t, 2>();
 
   auto grad_values_2 = grad_values.view({grad_nnz, nvalues});
-  auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
+  auto grad_values_accessor = grad_values_2.packed_accessor64<scalar_t, 2>();
 
   Tensor lower_bound_values =
       at::empty({out_offsets.size(0)}, indices.options());
diff --git a/aten/src/ATen/native/sparse/cuda/SparseBlas.cpp b/aten/src/ATen/native/sparse/cuda/SparseBlas.cpp
index 9c0c431b696f..297e4b601cbc 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseBlas.cpp
+++ b/aten/src/ATen/native/sparse/cuda/SparseBlas.cpp
@@ -57,6 +57,7 @@ Tensor& sparse_sampled_addmm_out_sparse_csr_cuda(
 
   // there's a segfault when calling cuSPARSE on 0-sized matrices
   if (mat1.numel() == 0 || mat2.numel() == 0) {
+    result.mul_(beta);
     return result;
   }
 
@@ -83,16 +84,6 @@ Tensor& addmm_out_sparse_compressed_cuda(
     const Scalar& beta,
     const Scalar& alpha,
     Tensor& result) {
-
-  if (mat1.layout() == kSparseCsc || mat2.layout() == kSparseCsc) {
-    // TODO: Add native CSC support to avoid costly conversion.
-    return addmm_out_sparse_compressed_cuda(self, mat1.to_sparse_csr(), mat2.to_sparse_csr(),
-        beta, alpha, result);
-  }
-  TORCH_CHECK(!(mat1.layout() == kSparseBsc || mat2.layout() == kSparseBsc),
-      "addmm_out_sparse_compressed_cuda currently does not support layout SparseBsc for input mat, but got ",
-      mat1.layout(), " for mat 1 and ", mat2.layout(), " for mat2.");
-
   sparse::impl::_check_is_cuda(self, "self");
   sparse::impl::_check_is_cuda(mat1, "mat1");
   sparse::impl::_check_is_cuda(mat2, "mat2");
diff --git a/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp b/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
index 4309e756e8be..833fd41eb6a0 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
+++ b/aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp
@@ -8,6 +8,7 @@
 #include <ATen/cuda/CUDASparseDescriptors.h>
 #include <ATen/native/LinearAlgebraUtils.h>
 #include <ATen/native/cuda/MiscUtils.h>
+#include <ATen/native/sparse/SparseBlasImpl.h>
 #include <ATen/native/sparse/cuda/SparseBlasImpl.h>
 #include <ATen/native/sparse/cuda/SparseBlasLegacy.h>
 
@@ -480,6 +481,22 @@ void block_sparse_mm(
       mat1.values().is_contiguous() ||
       mat1.values().transpose(-2, -1).is_contiguous());
 
+  // NOTE: the code below allows arbitrary block sizes
+  // and might be potentially faster than cuSPARSE implementation
+  // especially for not very sparse inputs.
+  if (mat1.scalar_type() == ScalarType::Half || mat1.scalar_type() == ScalarType::BFloat16) {
+    at::native::sparse::impl::_compressed_row_strided_addmm_out(
+        result,
+        mat1,
+        mat2,
+        /*beta=*/beta,
+        /*alpha=*/alpha,
+        // @nikitaved: not sure whether `const Tensor& result` makes sense,
+        // but let's keep the interface intact, hence the const cast.
+        const_cast<Tensor&>(result));
+    return;
+  }
+
   const cusparseDirection_t block_layout = mat1.values().is_contiguous()
       ? CUSPARSE_DIRECTION_ROW
       : CUSPARSE_DIRECTION_COLUMN;
@@ -562,7 +579,7 @@ void spmm(
     const Scalar& beta,
     const Scalar& alpha,
     const Tensor& result) {
-#if !AT_USE_CUSPARSE_GENERIC_API()
+#if !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_52_API())
   addmm_out_legacy(mat1, mat2, beta, alpha, result);
 #else
   c10::MaybeOwned<Tensor> result_ = prepare_dense_matrix_for_cusparse(result);
@@ -663,7 +680,7 @@ void spmm(
   if (!result.is_same(*result_)) {
     result.copy_(*result_);
   }
-#endif // !AT_USE_CUSPARSE_GENERIC_API()
+#endif // !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API())
 }
 
 void spgemm(
@@ -672,12 +689,18 @@ void spgemm(
     const Scalar& beta,
     const Scalar& alpha,
     const at::sparse_csr::SparseCsrTensor& C) {
-#if defined(CUDA_VERSION) && CUDA_VERSION < 11000
+#if (!defined(USE_ROCM)) && (defined(CUDA_VERSION) && CUDA_VERSION < 11000)
   TORCH_CHECK(
       false,
       "Calling addmm with sparse GPU tensors requires compiling ",
       "PyTorch with CUDA 11+. ",
       "Please use PyTorch built with newer CUDA version.");
+#elif defined(USE_ROCM) && ROCM_VERSION < 50200
+  TORCH_CHECK(
+      false,
+      "Calling addmm with sparse GPU tensors requires compiling ",
+      "PyTorch with ROCm 5.2+. ",
+      "Please use PyTorch built with newer ROCm version.");
 #else
   // older versions of cusparse on Windows segfault for complex128 dtype
 #if defined(_WIN32) && defined(CUSPARSE_VERSION) && CUSPARSE_VERSION < 11400
@@ -826,21 +849,110 @@ void addmm_out_sparse_csr(
     const Scalar& beta,
     const Scalar& alpha,
     const Tensor& result) {
-  if (mat1.layout() == kSparseBsr && mat2.layout() == kStrided && result.layout() == kStrided) {
-    return block_sparse_mm(mat1, mat2, beta, alpha, result);
+  TORCH_INTERNAL_ASSERT(
+      !((mat1.layout() == kStrided) && (mat2.layout() == kStrided) &&
+        (result.layout() == kStrided)),
+      "Expected at least one sparse input");
+
+  // Layout checks are nested mat1, mat2, result
+  // Conditions are ordered strided, csr, csc, bsr, bsc.
+  // Valid combinations terminate in a return
+  // Invalid combinations are omitted and will fall though to the TORCH check
+  // generating an informative error message
+  if (mat1.layout() == kStrided) {
+    if (mat2.layout() == kSparseCsr) {
+      if (result.layout() == kStrided) {
+        // TODO: Add native CSC support via cuSPARSE if supported.
+        return spmm(
+            mat2.transpose(0, 1).to_sparse_csr(),
+            mat1.transpose(0, 1),
+            beta,
+            alpha,
+            result.transpose(0, 1));
+      }
+    }
+    if (mat2.layout() == kSparseCsc) {
+      if (result.layout() == kStrided) {
+        return spmm(
+            mat2.transpose(-2, -1),
+            mat1.transpose(-2, -1),
+            beta,
+            alpha,
+            result.transpose(-2, -1));
+      }
+    }
+    if (mat2.layout() == kSparseBsc) {
+      if (result.layout() == kStrided) {
+        return block_sparse_mm(
+            mat2.transpose(-2, -1),
+            mat1.transpose(-2, -1),
+            beta,
+            alpha,
+            result.transpose(-2, -1));
+      }
+    }
   }
-  if (mat1.is_sparse_csr() && mat2.layout() == kStrided && result.layout() == kStrided) {
-    return spmm(mat1, mat2, beta, alpha, result);
+  if (mat1.layout() == kSparseCsr) {
+    if (mat2.layout() == kStrided) {
+      if (result.layout() == kStrided) {
+        return spmm(mat1, mat2, beta, alpha, result);
+      }
+    }
+    if (mat2.layout() == kSparseCsr) {
+      if (result.layout() == kSparseCsr) {
+        return spgemm(mat1, mat2, beta, alpha, result);
+      }
+    }
+    if (mat2.layout() == kSparseCsc) {
+      if (result.layout() == kSparseCsr) {
+        // TODO: Add native CSC support via cuSPARSE if supported.
+        // CSR @ CSC kernel would be very fast due to format alignment
+        return spgemm(mat1, mat2.to_sparse_csr(), beta, alpha, result);
+      }
+    }
   }
-  if (mat1.layout() == kStrided && mat2.is_sparse_csr() && result.layout() == kStrided) {
-    // TODO: Add native CSC support via cuSPARSE if supported.
-    return spmm(mat2.transpose(0, 1).to_sparse_csr(), mat1.transpose(0, 1), beta, alpha, result.transpose(0, 1));
+  if (mat1.layout() == kSparseCsc) {
+    if (mat2.layout() == kStrided) {
+      if (result.layout() == kStrided) {
+        // TODO: Add native CSC support via cuSPARSE if supported.
+        return spmm(mat1.to_sparse_csr(), mat2, beta, alpha, result);
+      }
+    }
+    if (mat2.layout() == kSparseCsr) {
+      if (result.layout() == kSparseCsr)
+        // TODO: Add native CSC support via cuSPARSE if supported.
+        return spgemm(mat1.to_sparse_csr(), mat2, beta, alpha, result);
+    }
+    if (mat2.layout() == kSparseCsc) {
+      if (result.layout() == kSparseCsr) {
+        // TODO: Add native CSC support via cuSPARSE if supported.
+        return spgemm(
+            mat1.to_sparse_csr(), mat2.to_sparse_csr(), beta, alpha, result);
+      }
+      if (result.layout() == kSparseCsc) {
+        return spgemm(
+            mat2.transpose(-2, -1),
+            mat1.transpose(-2, -1),
+            beta,
+            alpha,
+            result.transpose(-2, -1));
+      }
+    }
   }
-  if (mat1.is_sparse_csr() && mat2.is_sparse_csr() && result.is_sparse_csr()) {
-    return spgemm(mat1, mat2, beta, alpha, result);
+  if (mat1.layout() == kSparseBsr) {
+    if (mat2.layout() == kStrided) {
+      if (result.layout() == kStrided)
+        return block_sparse_mm(mat1, mat2, beta, alpha, result);
+    }
   }
-  TORCH_CHECK(false, "addmm: computation on CUDA is not implemented for ",
-              result.layout(), " + ", mat1.layout(), " @ ", mat2.layout());
+  TORCH_CHECK(
+      false,
+      "addmm: computation on CUDA is not implemented for ",
+      result.layout(),
+      " + ",
+      mat1.layout(),
+      " @ ",
+      mat2.layout());
 }
 
 /*
@@ -862,7 +974,7 @@ void addmv_out_sparse_csr(
   if (mat.layout() == kSparseBsr) {
     return block_sparse_mv(mat, vec, beta, alpha, result);
   }
-#if !AT_USE_CUSPARSE_GENERIC_API()
+#if !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API())
   TORCH_CHECK(
       false,
       "Calling addmv on a sparse GPU tensor requires compiling ",
@@ -936,7 +1048,7 @@ void addmv_out_sparse_csr(
   if (!result.is_same(*result_)) {
     result.copy_(*result_);
   }
-#endif
+#endif // !(AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API())
 }
 
 /*
@@ -1289,7 +1401,7 @@ void sampled_addmm_out_sparse_csr(
     const Scalar& beta,
     const Scalar& alpha,
     const at::sparse_csr::SparseCsrTensor& C) {
-#if !AT_USE_CUSPARSE_GENERIC_SDDMM()
+#if !(AT_USE_CUSPARSE_GENERIC_SDDMM() || AT_USE_HIPSPARSE_GENERIC_52_API())
   TORCH_CHECK(
       false,
       "Calling sampled_addmm with sparse GPU tensors requires compiling ",
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh b/aten/src/ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh
index 2a266319212a..c39c2710d0ff 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh
@@ -192,117 +192,6 @@ __global__ void indexSparseUnionKernel(
   *resultNnz = r_i;
 }
 
-template <typename Op, typename IndexType, typename Real>
-#if __CUDA_ARCH__ >= 350 || defined(USE_ROCM)
-C10_LAUNCH_BOUNDS_2(cuda::getApplyBlockSize(), cuda::getApplyBlocksPerSM())
-#endif
-__global__ void valueSparseIntersectionKernel(
-    Op op,
-    TensorInfo<indexT, IndexType> r_indices,
-    TensorInfo<indexT, IndexType> t_indices,
-    TensorInfo<indexT, IndexType> s_indices,
-    TensorInfo<Real, IndexType> r_values,
-    TensorInfo<Real, IndexType> t_values,
-    TensorInfo<Real, IndexType> s_values,
-    const IndexType t_nnz, const IndexType s_nnz) {
-  IndexType t_indskip = t_indices.strides[0];
-  IndexType s_indskip = s_indices.strides[0];
-  int64_t match, d;
-  int64_t nDimI = r_indices.sizes[0];
-  IndexType valueSize = r_values.strides[0];
-  // reset valueSize if a dense dimension is zero:
-  for (d=0; d<r_values.dims; d++) {
-    if (r_values.sizes[d] == 0) {
-      valueSize = 0;
-      break;
-    }
-  }
-  IndexType r_i = 0, t_i = 0, s_i = 0;
-  while (t_i < t_nnz && s_i < s_nnz) {
-    match = 1;
-    for (d = 0; d < nDimI; d++) {
-      if (t_indices.data[d * t_indskip + t_i] < s_indices.data[d * s_indskip + s_i]) {
-        t_i++;
-        match = 0;
-        break;
-      }
-      if (t_indices.data[d * t_indskip + t_i] > s_indices.data[d * s_indskip + s_i]) {
-        s_i++;
-        match = 0;
-        break;
-      }
-    }
-    if (!match) continue;
-    applyOp3(op, valueSize, r_values, r_i++, t_values, t_i++, s_values, s_i++);
-  }
-}
-
-// TODO find a way to parallelize this...
-template <typename IndexType, typename Real>
-#if __CUDA_ARCH__ >= 350 || defined(USE_ROCM)
-C10_LAUNCH_BOUNDS_2(cuda::getApplyBlockSize(), cuda::getApplyBlocksPerSM())
-#endif
-__global__ void indexSparseIntersectionKernel(
-    TensorInfo<indexT, IndexType> r_indices,
-    TensorInfo<indexT, IndexType> t_indices,
-    TensorInfo<indexT, IndexType> s_indices,
-    const IndexType t_nnz, const IndexType s_nnz, IndexType *resultNnz) {
-  IndexType r_indskip = r_indices.strides[0];
-  IndexType t_indskip = t_indices.strides[0];
-  IndexType s_indskip = s_indices.strides[0];
-  int64_t match, d;
-  int64_t nDimI = r_indices.sizes[0];
-  IndexType r_i = 0, t_i = 0, s_i = 0;
-  while (t_i < t_nnz && s_i < s_nnz) {
-    match = 1;
-    for (d = 0; d < nDimI; d++) {
-      if (t_indices.data[d * t_indskip + t_i] < s_indices.data[d * s_indskip + s_i]) {
-        t_i++;
-        match = 0;
-        break;
-      }
-      if (t_indices.data[d * t_indskip + t_i] > s_indices.data[d * s_indskip + s_i]) {
-        s_i++;
-        match = 0;
-        break;
-      }
-    }
-    if (!match) continue;
-    for (d = 0; d < nDimI; d++) {
-      r_indices.data[d * r_indskip + r_i] = t_indices.data[d * t_indskip + t_i];
-    }
-    r_i++; t_i++; s_i++;
-  }
-  *resultNnz = r_i;
-}
-
-// template <typename Dtype, typename Acctype>
-// __global__ void coalesceValuesKernel_gridStrided(
-//   long *segment_offsets, long *value_indices,
-//   Dtype *values, Dtype *newValues,
-//   long nnz, long newNnz, long stride) {
-//
-//   long chunksPerSeg = THCCeilDiv(stride, (long) blockDim.x);
-//   long numChunks = newNnz * chunksPerSeg;
-//   long chunkOffset = blockIdx.x * blockDim.y + threadIdx.y;
-//   long chunkStride = gridDim.x * blockDim.y;
-//
-//   for (long chunk = chunkOffset; chunk < numChunks; chunk += chunkStride) {
-//     long featureDim = (chunk % chunksPerSeg) * blockDim.x + threadIdx.x;
-//     if (featureDim < stride) {
-//       auto valFeat = values + featureDim;
-//       long seg = chunk / chunksPerSeg;
-//       auto begin = segment_offsets[seg];
-//       auto end = (seg < newNnz - 1) ? segment_offsets[seg + 1] : nnz;
-//       Acctype valSum = static_cast<Acctype>::to(0);
-//       for (long valIdx = begin; valIdx < end; valIdx++) {
-//         const long valRow = value_indices[valIdx] * stride;
-//         valSum += static_cast<Acctype>::to(valFeat[valRow]);
-//       }
-//       newValues[seg * stride + featureDim] = static_cast<Dtype>::to(valSum);
-//     }
-//   }
-// }
 
 template <typename Dtype, typename Acctype>
 C10_LAUNCH_BOUNDS_1(num_threads())
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp b/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp
index 633a503ac833..bd89e6fc1701 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp
@@ -19,7 +19,13 @@
 #define IS_SPMM_AVAILABLE() 0
 #endif
 
-#if IS_SPMM_AVAILABLE()
+#if defined(USE_ROCM) && ROCM_VERSION >= 50200
+#define IS_SPMM_HIP_AVAILABLE() 1
+#else
+#define IS_SPMM_HIP_AVAILABLE() 0
+#endif
+
+#if IS_SPMM_AVAILABLE() || IS_SPMM_HIP_AVAILABLE()
 #include <library_types.h>
 #endif
 
@@ -86,7 +92,7 @@ cusparseOperation_t convertTransToCusparseOperation(char trans) {
   }
 }
 
-#if IS_SPMM_AVAILABLE()
+#if IS_SPMM_AVAILABLE() || IS_SPMM_HIP_AVAILABLE()
 
 namespace {
 template<typename T>
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu b/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu
index c88a9c6abfde..34a864a8fae0 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu
@@ -7,6 +7,7 @@
 #include <ATen/cuda/ThrustAllocator.h>
 #include <ATen/native/sparse/cuda/SparseCUDAApplyUtils.cuh>
 #include <ATen/native/cuda/SortingCommon.cuh>
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/SparseTensorUtils.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/accumulate.h>
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu b/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu
index b564f18456d5..d388864f0b0c 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu
@@ -482,66 +482,17 @@ SparseTensor& mul_out_sparse_cuda(const Tensor& t_, const Tensor& src_, SparseTe
   TORCH_CHECK(t_.is_cuda(), "mul: expected 'self' to be CUDA, but got CPU");
   TORCH_CHECK(src_.is_cuda(), "mul: expected 'other' to be CUDA, but got CPU");
   TORCH_CHECK(cuda::check_device({r_, t_, src_}));
-  TORCH_CHECK(t_.sizes().equals(src_.sizes()), "mul: expected 'self' and 'other' to have same size, but ", t_.sizes(), " != ", src_.sizes());
 
-  SparseTensor t = t_.coalesce();
-  SparseTensor src = src_.coalesce();
+  // mul(sparse, sparse)
 
-  if (src_._nnz() == 0 || t_._nnz() == 0) {
-    r_.resize_as_(src_);
+  // Short circuit when there is zero nnz.
+  // Not strictly necessary, but there are tests checking whether
+  // resize in mul fails if run on tensors coming from .data/.detach.
+  if (t_.sizes().equals(src_.sizes()) && (!t_._nnz() || !src_._nnz())) {
+    r_.resize_as_(t_);
     return r_.zero_();
   }
-
-  // saving those because they can be overwritten when doing in-place operations
-  int64_t t_nnz = t._nnz(), s_nnz = src._nnz();
-  int64_t max_nnz = std::min(t_nnz, s_nnz);  // multiply by zero is zero, and can be dropped
-  int64_t sparse_dim = src.sparse_dim();
-  auto commonDtype = at::result_type(t, src);
-  TORCH_CHECK(canCast(commonDtype, r_.scalar_type()), "Can't convert result type ", commonDtype, " to output ", r_.scalar_type());
-  Tensor t_indices_ = t._indices().contiguous();
-  Tensor t_values_ = t._values().to(commonDtype);
-  Tensor s_indices_ = src._indices().contiguous();
-  Tensor s_values_ = src._values().to(commonDtype);
-  Tensor r_indices_ = at::empty({sparse_dim, max_nnz}, t_indices_.options());
-  r_.resize_as_(src);
-
-  Tensor r_values_ = new_values_with_size_of(t_values_, max_nnz).zero_();
-
-  int64_t valueSize = std::max<int64_t>(1, t_values_.stride(0));
-  const dim3 block = dim3(std::min(static_cast<int64_t>(cuda::getApplyBlock().x), valueSize));
-  dim3 grid;
-  int curDevice = -1;
-  cudaGetDevice(&curDevice);
-  cudaStream_t stream = at::cuda::getCurrentCUDAStream(curDevice);
-  TORCH_CHECK(cuda::getApplyGrid(valueSize, grid, curDevice), "mul: Argument #0: tensor too large or too many dimensions");
-
-  Tensor resultNnz = at::empty({1}, CUDA(kLong));
-  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
-    at::ScalarType::Half, at::ScalarType::BFloat16, commonDtype, "mul_out_sparse_cuda", [&] {
-        apply::valueSparseIntersectionKernel<<<grid, block, 0, stream>>>(
-            TensorMulOp<scalar_t>(),
-            I_INFO(r_indices_), I_INFO(t_indices_), I_INFO(s_indices_),
-            V_INFO(r_values_), V_INFO(t_values_), V_INFO(s_values_),
-            static_cast<uint64_t>(t_nnz), static_cast<uint64_t>(s_nnz));
-        C10_CUDA_KERNEL_LAUNCH_CHECK();
-
-        apply::indexSparseIntersectionKernel<uint64_t, scalar_t>
-          <<<1, 1, 0, stream>>>(
-            I_INFO(r_indices_), I_INFO(t_indices_), I_INFO(s_indices_),
-            // reinterpret_cast shenanigans, because we don't actually have
-            // unsigned tensors...
-            static_cast<uint64_t>(t_nnz), static_cast<uint64_t>(s_nnz), reinterpret_cast<uint64_t*>(resultNnz.data_ptr()));
-        C10_CUDA_KERNEL_LAUNCH_CHECK();
-      });
-  r_values_ = r_values_.to(r_.scalar_type());
-  get_sparse_impl(r_)->set_indices_and_values_unsafe(r_indices_, r_values_);
-
-  // sync!  (surely there is a more idiomatic way to do this...)
-  Tensor cpu_resultNnz = at::empty({1}, CPU(kLong));
-  cpu_resultNnz.copy_(resultNnz);
-  get_sparse_impl(r_)->set_nnz_and_narrow(cpu_resultNnz.accessor<int64_t, 1>()[0]);
-
-  return r_._coalesced_(true);
+  return _mul_sparse_sparse_out(t_, src_, r_);
 }
 
 // --------------------------------------------------------------------
@@ -716,7 +667,7 @@ Tensor bmm_sparse_cuda(const SparseTensor& self, const Tensor& mat2) {
   return bmm_out_sparse_cuda(self, mat2, result);
 }
 
-#if !(defined(USE_ROCM) || (defined(_MSC_VER) && CUSPARSE_VERSION < 11000))
+#if defined(USE_ROCM) || !(defined(_MSC_VER) && CUSPARSE_VERSION < 11000)
 __global__ void search_end_matrix_indices_cuda_kernel(
   int64_t* mat_el_end_indices,
   int64_t num_matrices,
@@ -797,11 +748,9 @@ cudaDataType getTensorCudaDataType(Tensor self) {
 #endif
 
 Tensor& bmm_out_sparse_cuda(const SparseTensor& self, const Tensor& mat2, Tensor& result) {
-#if defined(USE_ROCM)
-  TORCH_CHECK(false, "bmm sparse-dense is not supported on HIP");
-#elif defined(_MSC_VER) && (CUSPARSE_VERSION < 11000)
+#if defined(_MSC_VER) && (CUSPARSE_VERSION < 11000)
   TORCH_CHECK(false, "bmm sparse-dense CUDA is not supported on Windows with cuda before 11.0");
-#elif defined(CUDART_VERSION) && (CUDART_VERSION >= 10010)  // linux cuda >= 10.1 or windows cuda >= 11.0
+#elif defined(USE_ROCM) || (defined(CUDART_VERSION) && (CUDART_VERSION >= 10010))  // linux cuda >= 10.1 or windows cuda >= 11.0
 
   TORCH_CHECK(!mat2.is_sparse(), "bmm_sparse: Tensor 'mat2' must be dense");
   TORCH_CHECK(self.dense_dim() == 0, "bmm_sparse: Tensor 'self' must have 0 dense dims, but has ", self.dense_dim());
diff --git a/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu b/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu
index 98cb16731ef9..2e3868319272 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu
@@ -263,7 +263,7 @@ Tensor& add_out_sparse_csr_cuda(
         self.sizes(),
         " and tensor `other` with shape ",
         other.sizes());
-    at::native::resize_as_sparse_csr_(out, self);
+    at::native::resize_as_sparse_compressed_(out, self);
     sparse::impl::cuda::add_out_sparse_csr(self, other, Scalar(1), alpha, out);
   }
   return out;
diff --git a/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu b/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu
index dbc194ddb20b..33123abccbe9 100644
--- a/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu
@@ -70,7 +70,7 @@ Tensor _to_csr_int(const Tensor& rowIndices, int64_t dim, int64_t nnz) {
 #pragma push
 // NVCC complains that confirm_mult_size is not used,
 // but it is used in specializations of CusparseMatrixMultiplyOp below
-#pragma diag_suppress 177   // Function was declared but never referenced
+#pragma nv_diag_suppress 177   // Function was declared but never referenced
 int confirm_mult_size(const std::vector<int>& mat1_size, const std::vector<int>& mat2_size) {
   TORCH_CHECK(
       mat1_size[1] == mat2_size[0],
@@ -734,13 +734,13 @@ void sparse_sparse_matmul_cuda_kernel(
 
   output_values.set_(csr_output.csr_values_);
   output_indices.resize_({2, nnz});
-  auto output_indices_accessor = output_indices.packed_accessor<int64_t, 2>();
+  auto output_indices_accessor = output_indices.packed_accessor64<int64_t, 2>();
 
   auto csr_output_pointers_accessor =
-      csr_output.csr_pointers_.packed_accessor<int, 1>();
+      csr_output.csr_pointers_.packed_accessor64<int, 1>();
 
   auto csr_output_ind_accessor =
-      csr_output.csr_indices_.packed_accessor<int, 1>();
+      csr_output.csr_indices_.packed_accessor64<int, 1>();
 
   auto major_dim = result.size(0);
   cudaStream_t stream = at::cuda::getCurrentCUDAStream();
diff --git a/aten/src/ATen/native/tags.yaml b/aten/src/ATen/native/tags.yaml
index 8fc44c68c267..5d2a69db016f 100644
--- a/aten/src/ATen/native/tags.yaml
+++ b/aten/src/ATen/native/tags.yaml
@@ -25,8 +25,18 @@
 - tag: nondeterministic_seeded
   desc: |
           This tag indicates if an operator is nondeterminstically seeded (ie is random)
-          such that the operator intentially produces different results when run twice on the same inputs.
+          such that the operator intentionally produces different results when run twice on the same inputs.
 - tag: nondeterministic_bitwise
   desc: |
           This tag indicates if an operator doesn't guarentee bitwise equivalence
           across different runs of an operator with identical inputs.
+- tag: canonical
+  desc: |
+          Canonical aten ops is a subset of aten ops that remains after aten-to-aten decomposition and
+          functionalization pass. Canonical aten ops are fully functional and adhere to single static
+          assignment (SSA): this implies there will be no `inplace` or `_out` variants in this opset.
+          This opset is designed to serve as the functional IR to interface with compiler backends.
+          In contrast to primTorch, canonical aten opset doesn't decompose ops into explicit
+          type promotion and boardcasting ops.
+          Canonical aten ops is also effectively the opset produced by torchdynamo.export(aten_graph=True),
+          and thus can be used as an opset for export purpose.
diff --git a/aten/src/ATen/native/transformers/attention.cpp b/aten/src/ATen/native/transformers/attention.cpp
index 8cfa2cd2e354..9c5be12ef24d 100644
--- a/aten/src/ATen/native/transformers/attention.cpp
+++ b/aten/src/ATen/native/transformers/attention.cpp
@@ -1,5 +1,4 @@
 #include <type_traits>
-
 #include <ATen/ATen.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/Dispatch.h>
@@ -7,6 +6,9 @@
 #include <ATen/Parallel.h>
 #include <ATen/TensorIndexing.h>
 #include <ATen/cpu/vec/vec256/vec256.h>
+#include <ATen/native/transformers/attention.h>
+#include <ATen/native/transformers/sdp_utils_cpp.h>
+
 
 #ifndef AT_PER_OPERATOR_HEADERS
 #include <ATen/NativeFunctions.h>
@@ -15,7 +17,6 @@
 #endif
 
 #include <ATen/native/nested/NestedTensorTransformerFunctions.h>
-
 namespace at {
 
 namespace native {
@@ -107,6 +108,17 @@ void transform_bias_rescale_qkv_inner_loop(
   }
 }
 
+Tensor transform_0213(const Tensor& a) {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(a.size(1));
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(a.size(3));
+  return a.permute({0, 2, 1, 3})
+      .contiguous()
+      .view({a.size(0), a.size(2), a.size(1) * a.size(3)});
+}
+
+} // namespace
+
+
 Tensor bmm_nt(const Tensor& a, const Tensor& b) {
   auto a_ = a.view({a.size(0) * a.size(1), a.size(2), a.size(3)});
   auto b_ = b.view({b.size(0) * b.size(1), b.size(2), b.size(3)});
@@ -119,15 +131,9 @@ Tensor masked_softmax(
     Tensor& attn_scores,
     c10::optional<Tensor> attn_mask,
     const Tensor& query,
-    c10::optional<int64_t> mask_type = NULL) {
+    c10::optional<int64_t> mask_type) {
   if (query.is_nested() && !attn_mask) {
-    if (attn_scores.is_cpu()) {
-      NestedTensor_softmax_dropout(query, attn_scores);
-      return attn_scores;
-    }
-    attn_mask = NestedTensor_to_mask(query, 2, attn_scores.size(2));
-    attn_mask = attn_mask->to(query.device(), /*non-blocking=*/true);
-    mask_type = 1;  /* NestedTensor_to_mask produces a BxT mask */
+    return at::_nested_tensor_softmax_with_shape(attn_scores, query);
   }
   if (attn_mask && attn_mask->dtype() != at::kBool) {
     TORCH_WARN(
@@ -135,15 +141,6 @@ Tensor masked_softmax(
         "negatively affect performance. Prefer to use a boolean mask directly.");
     attn_mask = attn_mask->to(at::kBool);
   }
-  if (attn_scores.is_cpu() && attn_mask && attn_mask->dim() == 2) {
-    // TODO: CPU path does not support transformer mask yet.
-    const auto batch_size = attn_scores.sizes()[0];
-    const auto seq_len = attn_scores.sizes()[3];
-    TORCH_CHECK(attn_mask->sizes()[0] == batch_size);
-    TORCH_CHECK(attn_mask->sizes()[1] == seq_len);
-    attn_mask = attn_mask->view({batch_size, 1, 1, seq_len});
-    attn_mask = at::expand_inplace(attn_scores, *attn_mask)->contiguous();
-  }
   if (attn_mask) {
     return _masked_softmax(attn_scores, *attn_mask, attn_scores.dim() - 1, mask_type);
   } else {
@@ -163,13 +160,6 @@ Tensor bmm_nn(Tensor& out, const Tensor& a, const Tensor& b) {
   return c_.view({a.size(0), a.size(1), a.size(2), b.size(3)});
 }
 
-Tensor transform_0213(const Tensor& a) {
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(a.size(1));
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(a.size(3));
-  return a.permute({0, 2, 1, 3})
-      .contiguous()
-      .view({a.size(0), a.size(2), a.size(1) * a.size(3)});
-}
 
 Tensor transform0213_gemm_nt_bias(
     const Tensor& a,
@@ -261,8 +251,6 @@ Tensor qkv_projection(
   return qkv;
 }
 
-} // namespace
-
 // compute q = (q + q_bias) / sqrt(dim_per_head), k = k + k_bias, v = v + v_bias
 std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cpu(
     const Tensor& qkv,
@@ -319,7 +307,7 @@ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cpu(
   return std::make_tuple(q_k_v_s[0], q_k_v_s[1], q_k_v_s[2]);
 }
 
-std::tuple<Tensor, Tensor> native_multi_head_attention(
+std::tuple<Tensor, Tensor> native_multi_head_attention_cpu(
     const Tensor& query,
     const Tensor& key,
     const Tensor& value,
@@ -690,15 +678,74 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> native_decoder_only_multi_head_attent
 //     L: Target sequence length
 //     E: Embedding dimension
 std::tuple<Tensor, Tensor> _scaled_dot_product_attention(
+    const Tensor& query_,
+    const Tensor& key,
+    const Tensor& value,
+    const c10::optional<Tensor>& attn_mask_,
+    double dropout_p,
+    bool need_attn_weights,
+    bool is_causal) {
+  // TODO: The second return is the attention weights if the math kernel is
+  // used. The fused kernels do not return this Tensor so for the fused kernels
+  // The second return SHOULD always be an empty Tensor, unless need_attn_weights
+  // is true (in which case the fused kernels would not be called). This blows up
+  // op_info tests.
+  int64_t choice_int = at::_fused_sdp_choice(
+      query_, key, value, attn_mask_, dropout_p, need_attn_weights, is_causal);
+  sdp::SDPBackend backend = static_cast<sdp::SDPBackend>(choice_int);
+  switch (backend) {
+    case sdp::SDPBackend::flash_attention: {
+      auto out_lse_softmax = at::_scaled_dot_product_flash_attention(
+          query_, key, value, dropout_p, need_attn_weights, is_causal);
+      return std::make_tuple(
+          std::move(std::get<0>(out_lse_softmax)),
+          std::move(std::get<2>(out_lse_softmax)));
+    }
+    case sdp::SDPBackend::efficient_attention: {
+      bool compute_logsumexp =
+          (query_.requires_grad() || key.requires_grad() ||
+           value.requires_grad());
+      return at::_scaled_dot_product_efficient_attention(
+          query_, key, value, compute_logsumexp, is_causal);
+    }
+    case sdp::SDPBackend::math:
+      return at::_scaled_dot_product_attention_math(
+          query_,
+          key,
+          value,
+          attn_mask_,
+          dropout_p,
+          need_attn_weights,
+          is_causal);
+    default:
+      TORCH_CHECK(
+          false,
+          "No viable backend for scaled_dot_product_attention was found.");
+      return std::make_tuple(Tensor(), Tensor());
+  }
+}
+
+int64_t _fused_sdp_choice_cpp(const Tensor& query_, const Tensor& key, const Tensor& value,
+        const c10::optional<Tensor>& attn_mask_, double dropout_p, bool need_attn_weights, bool is_causal){
+  return static_cast<int64_t>(sdp::SDPBackend::math);
+}
+
+std::tuple<Tensor, Tensor> _scaled_dot_product_attention_math(
         const Tensor& query_, const Tensor& key, const Tensor& value,
         const c10::optional<Tensor>& attn_mask_, double dropout_p, bool need_attn_weights, bool is_causal) {
+  if (query_.is_nested() || key.is_nested() || value.is_nested()) {
+    TORCH_CHECK(
+        query_.is_contiguous() && key.is_contiguous() &&
+            value.is_contiguous(),
+        "scaled_dot_product_attention: If inputs are nested tensors they must be contiguous");
+  }
     auto attn_mask = attn_mask_;
-    TORCH_CHECK(!attn_mask.has_value() || attn_mask->dtype() == at::kBool,
-            "_scaled_dot_product_attention: Only boolean attention masks are currently supported, but found: ",
-            attn_mask->dtype())
     // Naive, composite implementation defined here.
     const auto embed_size = query_.size(-1);
-    const auto query = query_ * (1. / ::sqrt(static_cast<double>(embed_size)));
+
+    // Scale q,k before matmul for stability see https://tinyurl.com/sudb9s96 for math
+    const double scaling_factor = ::sqrt(::sqrt(static_cast<double>(embed_size)));
+    const auto query = query_ / scaling_factor;
     if (is_causal) {
         TORCH_CHECK(!attn_mask.has_value(),
                 "_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True");
@@ -714,20 +761,23 @@ std::tuple<Tensor, Tensor> _scaled_dot_product_attention(
                 "_scaled_dot_product_attention: Nested tensors for query / key are not supported "
                 "when an explicit attn_mask is set");
         // Convert boolean mask to additive mask; need to invert mask to indicate what to mask *out*.
-        auto new_attn_mask = at::zeros_like(*attn_mask, query.dtype());
-        new_attn_mask.masked_fill_(attn_mask->logical_not(), -std::numeric_limits<double>::infinity());
-        attn_mask = new_attn_mask;
+        if (attn_mask->dtype() == at::kBool){
+          auto new_attn_mask = at::zeros_like(*attn_mask, query.dtype());
+          new_attn_mask.masked_fill_(attn_mask->logical_not(), -std::numeric_limits<double>::infinity());
+          attn_mask = new_attn_mask;
+        }
+        // Otherwise, attn_mask represents an additive attention tensor
     }
-    auto attn = at::matmul(query, key.transpose(-2, -1));
+    auto attn = at::matmul(query, key.transpose(-2, -1)/scaling_factor);
     if (attn_mask.has_value()) {
         attn.add_(*attn_mask);
     }
     attn = at::softmax(attn, -1);
     if (dropout_p > 0.0) {
-        at::dropout_(attn, dropout_p, true);
+      attn = at::dropout(attn, dropout_p, true);
     }
     const auto output = at::matmul(attn, value);
-    return (need_attn_weights ? std::make_tuple(output, attn) : std::make_tuple(output, Tensor()));
+    return std::make_tuple(output, attn);
 }
 
 Tensor triton_multi_head_attention(
diff --git a/aten/src/ATen/native/transformers/attention.h b/aten/src/ATen/native/transformers/attention.h
new file mode 100644
index 000000000000..783b22869137
--- /dev/null
+++ b/aten/src/ATen/native/transformers/attention.h
@@ -0,0 +1,33 @@
+#pragma once
+#include <ATen/ATen.h>
+#include <c10/macros/Export.h>
+
+namespace at {
+namespace native {
+
+TORCH_API Tensor bmm_nt(const Tensor& a, const Tensor& b);
+TORCH_API Tensor masked_softmax(
+    Tensor& attn_scores,
+    c10::optional<Tensor> attn_mask,
+    const Tensor& query,
+    c10::optional<int64_t> mask_type = NULL);
+
+TORCH_API Tensor transform0213_gemm_nt_bias(
+    const Tensor& a,
+    const Tensor& b,
+    const Tensor& c,
+    const Tensor& query);
+
+TORCH_API Tensor bmm_nn(Tensor& out, const Tensor& a, const Tensor& b);
+
+TORCH_API void debug_assert_shape(int line, const Tensor& t, c10::IntArrayRef shape);
+
+TORCH_API Tensor qkv_projection(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    const int64_t embed_dim,
+    const Tensor& qkv_weight);
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/transformers/cuda/attention.cu b/aten/src/ATen/native/transformers/cuda/attention.cu
index dd31a755bf1d..8dcb99b3380d 100644
--- a/aten/src/ATen/native/transformers/cuda/attention.cu
+++ b/aten/src/ATen/native/transformers/cuda/attention.cu
@@ -9,6 +9,7 @@
 #include <ATen/cuda/CUDAContext.h>
 #include <ATen/cuda/detail/KernelUtils.h>
 #include <ATen/cuda/detail/IndexUtils.cuh>
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/native/cuda/Loops.cuh>
 #include <ATen/native/cuda/MemoryAccess.cuh>
 #include <ATen/native/cuda/PersistentSoftmax.cuh>
@@ -16,14 +17,85 @@
 
 #include <c10/cuda/CUDAMathCompat.h>
 
-#include <ATen/native/nested/NestedTensorMath.h>
+#include <ATen/native/transformers/attention.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
 #include <ATen/native/nested/NestedTensorTransformerFunctions.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+#include <ATen/native/transformers/cuda/sdp_utils.h>
+
+#ifdef USE_FLASH_ATTENTION
+#include <ATen/native/transformers/cuda/flash_attn/fmha_api.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h>
+#endif
+
 namespace at {
 
 namespace native {
 
 namespace {
 
+#define DISPATCH_BLOCKSIZE(VALUE_HEAD_DIM, FN)        \
+  {                                                   \
+    if (VALUE_HEAD_DIM <= 64) {                       \
+      constexpr bool kIs64x64 = true;                 \
+      constexpr bool kSingleValueIteration = true;    \
+      FN();                                           \
+    } else {                                          \
+      constexpr bool kIs64x64 = false;                \
+      if (VALUE_HEAD_DIM <= 128) {                    \
+        constexpr bool kSingleValueIteration = true;  \
+        FN();                                         \
+      } else {                                        \
+        constexpr bool kSingleValueIteration = false; \
+        FN();                                         \
+      }                                               \
+    }                                                 \
+  }
+
+#define DISPATCH_KERNEL(QUERY, KEY, VALUE, FUNC)                              \
+  {                                                                           \
+    cudaDeviceProp* properties =                                              \
+        at::cuda::getDeviceProperties(QUERY.device().index());                \
+    const int computeCapability = properties->major * 10 + properties->minor; \
+    DISPATCH_BLOCKSIZE(                                                       \
+        VALUE.size(-1), ([&]() {                                              \
+          static constexpr int64_t kQueriesPerBlock = kIs64x64 ? 64 : 32;     \
+          static constexpr int64_t kKeysPerBlock = kIs64x64 ? 64 : 128;       \
+          DISPATCH_TYPES(                                                     \
+              QUERY, ([&]() {                                                 \
+                DISPATCH_ARCHTAG(                                             \
+                    computeCapability, ([&]() {                               \
+                      using AlignedAK = AttentionKernel<                      \
+                          scalar_t,                                           \
+                          ArchTag,                                            \
+                          true,                                               \
+                          kQueriesPerBlock,                                   \
+                          kKeysPerBlock,                                      \
+                          kSingleValueIteration>;                             \
+                      /* Run a more efficient kernel (with `isAligned=True`)  \
+                      if memory is correctly aligned*/                        \
+                      bool isAligned =                                        \
+                          (QUERY.stride(2) % AlignedAK::kAlignmentQ == 0 &&   \
+                           KEY.stride(2) % AlignedAK::kAlignmentK == 0 &&     \
+                           VALUE.stride(2) % AlignedAK::kAlignmentV == 0);    \
+                      /* TODO: Should we warn or log somewhere when we use a  \
+                      less efficient kernel due to wrong alignment? */        \
+                      DISPATCH_BOOL(isAligned, kIsAligned, ([&]() {           \
+                                      using Kernel = AttentionKernel<         \
+                                          scalar_t,                           \
+                                          ArchTag,                            \
+                                          kIsAligned,                         \
+                                          kQueriesPerBlock,                   \
+                                          kKeysPerBlock,                      \
+                                          kSingleValueIteration>;             \
+                                      FUNC();                                 \
+                                    }))                                       \
+                    }))                                                       \
+              }));                                                            \
+        }));                                                                  \
+  }
+
+
 static constexpr int TRANSFORM_BIAS_RESCALE_VEC = 4;
 
 template <typename scalar_t, typename accscalar_t, bool assume_aligned>
@@ -297,8 +369,8 @@ __global__ void transform_bias_rescale_qkv_add_padding_kernel(
 }
 
 Tensor collapse_dims_1_and_2(const Tensor& sizes) {
-  auto sizes_dim1 = at::native::narrow(sizes, 1, 0, 1);
-  auto sizes_dim2 = at::native::narrow(sizes, 1, 1, 1);
+  auto sizes_dim1 = at::native::narrow_symint(sizes, 1, 0, 1);
+  auto sizes_dim2 = at::native::narrow_symint(sizes, 1, 1, 1);
 
   return (sizes_dim1 * sizes_dim2).contiguous();
 }
@@ -380,7 +452,7 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
           auto sizes = collapse_dims_1_and_2(nt_qkv->get_nested_size_tensor());
           auto offsets =
               NestedTensor_batch_offsets_from_size_tensor(sizes, sizes.numel());
-          at::native::narrow(offsets, 0, sizes.numel() + 1, sizes.numel())
+          at::native::narrow_symint(offsets, 0, sizes.numel() + 1, sizes.numel())
               .copy_(sizes.reshape({-1}));
           auto metadata = offsets.to(at::Device(kCUDA), at::kInt, true, true);
           const auto offsets_ptr = metadata.data_ptr<int>();
@@ -408,6 +480,507 @@ __host__ std::tuple<Tensor, Tensor, Tensor> transform_bias_rescale_qkv_cuda(
   return std::make_tuple(q_k_v_s[0], q_k_v_s[1], q_k_v_s[2]);
 }
 
+std::tuple<Tensor, Tensor> native_multi_head_attention_cuda(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    const int64_t embed_dim,
+    const int64_t num_head,
+    const Tensor& qkv_weight,
+    const Tensor& qkv_bias,
+    const Tensor& proj_weight,
+    const Tensor& proj_bias,
+    const c10::optional<Tensor>& mask,
+    bool need_weights,
+    bool average_attn_weights,
+    const c10::optional<int64_t> mask_type) {
+  // query shape: [B, T, D]
+  // qkv_weight shape: [3 * D, D]
+
+  TORCH_CHECK(
+      !mask || !query.is_nested(),
+      "NestedTensor with mask is not supported yet");
+  const auto D = embed_dim;
+  TORCH_CHECK(
+      query.dim() == 3,
+      "expected 3-D `query`, got ",
+      query.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+      query.is_nested() || query.sizes()[2] == embed_dim,
+      "passed-in embed_dim ",
+      embed_dim,
+      " didn't match last dim of query ",
+      query.sizes()[2]);
+  TORCH_CHECK(
+      key.dim() == 3,
+      "expected 3-D `key`, got ",
+      key.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+      value.dim() == 3,
+      "expected 3-D `value`, got ",
+      value.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+      query.is_nested() || key.is_nested() || value.is_nested() ||
+          (query.sizes() == key.sizes() && key.sizes() == value.sizes()),
+      "expected `query`/`key`/`value` shapes to match");
+  TORCH_CHECK(
+      qkv_weight.dim() == 2,
+      "expected 2-D `qkv_weight`, got ",
+      qkv_weight.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+      D * 3 == qkv_weight.sizes()[0],
+      "expected `qkv_weight` first dim to be 3x embed_dim");
+  TORCH_CHECK(
+      D == qkv_weight.sizes()[1],
+      "expected `qkv_weight` second dim to be embed_Dim");
+  TORCH_CHECK(
+      qkv_bias.dim() == 1,
+      "expected 2-D `qkv_bias`, got ",
+      qkv_bias.dim(),
+      "-D tensor");
+  TORCH_CHECK(
+      qkv_bias.sizes()[0] == 3 * D,
+      "expected `qkv_bias` first dim and first dim of query to be equal");
+  TORCH_CHECK(D % num_head == 0, "`embed_dim` must divide evenly by `num_heads`");
+
+#ifndef NDEBUG
+  const auto B = query.is_nested()
+      ? get_nested_tensor_impl(query)->get_nested_size_tensor().size(0)
+      : query.sizes()[0];
+  auto T = query.is_nested() ? 0 : query.sizes()[1];
+
+#endif
+  const auto dim_per_head = D / num_head;
+  if ((query.is_same(key) && key.is_same(value)) && dim_per_head % 8 == 0 ) {
+
+    // We have not done linear projection yet but the input for SDP
+    // Is expected to be 4 dimensional. We "cheaply" create view tensors
+    // That will then be used for checking hot path conditions with select_sd_backend
+    auto q = query.view({query.size(0), -1, num_head, dim_per_head}).transpose(1, 2);
+    auto k = key.view({key.size(0), -1, num_head, dim_per_head}).transpose(1, 2);
+    auto v = value.view({value.size(0), -1, num_head, dim_per_head}).transpose(1, 2);
+
+    sdp::sdp_params kernel_params{q, k, v, mask.has_value(), 0.0, need_weights, false};
+    auto backend = select_sdp_backend(kernel_params);
+    if (backend == sdp::SDPBackend::flash_attention || backend == sdp::SDPBackend::efficient_attention) {
+      auto x = at::linear(query, qkv_weight, qkv_bias);
+      auto chunks = x.chunk(3, -1);
+      auto x_size_0 = x.size(0);
+
+      chunks[0] = (chunks[0].view({x_size_0, -1, num_head, dim_per_head}))
+                      .transpose(1, 2);
+      chunks[1] = (chunks[1].view({x_size_0, -1, num_head, dim_per_head}))
+                      .transpose(1, 2);
+      chunks[2] = (chunks[2].view({x_size_0, -1, num_head, dim_per_head}))
+                      .transpose(1, 2);
+
+      auto y = at::_scaled_dot_product_attention(
+          chunks[0], chunks[1], chunks[2], mask, 0.0, need_weights, false);
+      auto past_sdp =
+          std::get<0>(y).transpose(1, 2).reshape({x_size_0, -1, embed_dim});
+      return std::make_tuple(
+          at::linear(past_sdp, proj_weight, proj_bias), Tensor());
+    }
+    // Returned math or error lets not use it
+  }
+
+  // shape: [B, T, 3 x D]
+  auto qkv = qkv_projection(query, key, value, embed_dim, qkv_weight);
+
+  if (!qkv.is_nested() && qkv.numel() == 0) {
+    if (query.is_nested()) {
+      return std::make_tuple(Tensor(), Tensor());
+    }
+    return std::make_tuple(at::empty_like(query), Tensor());
+  }
+
+#ifndef NDEBUG
+  if (!query.is_nested() || !qkv.is_nested()) {
+    if (query.is_nested()) {
+      T = qkv.size(1);
+    }
+    debug_assert_shape(__LINE__, qkv, {B, T, 3 * D});
+  }
+#endif
+
+#ifdef DEBUG_PRINT_EACH_STEP
+  if (!qkv.is_nested()) {
+    std::cerr << "qkv: " << qkv << std::endl;
+  }
+#endif
+  // shape: 3 x [B, num_head, T, dim_per_head]
+  auto q_k_v = _transform_bias_rescale_qkv(qkv, qkv_bias, num_head);
+  qkv = Tensor(); // Not used any more, allow free
+  auto& q = std::get<0>(q_k_v);
+  const auto& k = std::get<1>(q_k_v);
+  const auto& v = std::get<2>(q_k_v);
+#ifndef NDEBUG
+  debug_assert_shape(__LINE__, q, {B, num_head, T, dim_per_head});
+  debug_assert_shape(__LINE__, k, {B, num_head, T, dim_per_head});
+  debug_assert_shape(__LINE__, v, {B, num_head, T, dim_per_head});
+#endif
+#ifdef DEBUG_PRINT_EACH_STEP
+  std::cerr << "q: " << q << std::endl;
+  std::cerr << "k: " << k << std::endl;
+  std::cerr << "v: " << v << std::endl;
+#endif
+
+  // shape: [B, num_head, T, T]
+  auto qkt = bmm_nt(q, k);
+  // q & k are dead but cannot be freed because they were packed with v
+#ifndef NDEBUG
+  debug_assert_shape(__LINE__, qkt, {B, num_head, T, T});
+#endif
+#ifdef DEBUG_PRINT_EACH_STEP
+  std::cerr << "qkt: " << qkt << std::endl;
+#endif
+
+  // shape: [B, num_head, T, T]
+  // TODO: long-term, have a kernel that works with
+  // NestedTensor directly if there is no mask passed
+  qkt = masked_softmax(qkt, mask, query, mask_type);
+#ifdef DEBUG_PRINT_EACH_STEP
+  std::cerr << "qkt after softmax: " << qkt << std::endl;
+#endif
+
+  // shape: [B, num_head, T, dim_per_head]
+  // reuse storage for q; we're done with it
+  auto attn_ctx = bmm_nn(q, qkt, v);
+  // qkv is not dead; we just reused storage for q!
+  if (!need_weights) {
+    qkt = Tensor();
+  }
+#ifndef NDEBUG
+  debug_assert_shape(__LINE__, attn_ctx, {B, num_head, T, dim_per_head});
+#endif
+#ifdef DEBUG_PRINT_EACH_STEP
+  std::cerr << "attn_ctx: " << attn_ctx << std::endl;
+#endif
+
+  // shape: [B, T, D]
+  // Fuse transform_0213 inside
+  auto proj = transform0213_gemm_nt_bias(
+      attn_ctx, proj_weight, proj_bias, query);
+#ifndef NDEBUG
+  debug_assert_shape(__LINE__, proj, {B, T, D});
+#endif
+  if (need_weights && average_attn_weights) {
+    // weights are not needed for full transformer, so don't worry too
+    // much about performance -- we implement this just to make use
+    // cases that don't disable need_weights still get some speedup.
+    qkt = qkt.sum(1);
+    qkt /= num_head;
+  }
+  return std::make_tuple(std::move(proj), std::move(qkt));
+}
+
+std::tuple<Tensor, Tensor, Tensor> _scaled_dot_product_flash_attention_cuda(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    double dropout_p,
+    bool return_softmax,
+    bool is_causal) {
+  // Query (Batch x Num_heads x Q_seq_len  x Dim_per_head)
+  // Key   (Batch x Num_heads x KV_seq_len x Dim_per_head)
+  // Value (Batch x Num_heads x KV_seq_len x Dim_per_head)
+  const int64_t batch_size = query.size(0);
+  const int64_t num_heads = query.size(1);
+  const int64_t max_seqlen_batch_q = query.size(2);
+  const int64_t head_dim = query.size(3);
+
+  const int64_t max_seqlen_batch_k = key.size(2);
+  const int64_t max_seqlen_batch_v = value.size(2);
+  TORCH_CHECK(
+      max_seqlen_batch_k == max_seqlen_batch_v,
+      "Key and Value must have the same sequence length");
+
+  // Query -> Query(Batch x Q_seq_len x Num_heads x Dim_per_head)
+  // Key   -> Key(Batch x KV_seq_len x Num_heads x Dim_per_head)
+  // Value -> Value(Batch x KV_seq_len x  Num_heads x Dim_per_head)
+  Tensor q_t = query.transpose(1, 2);
+  Tensor k_t = key.transpose(1, 2);
+  Tensor v_t = value.transpose(1, 2);
+
+  Tensor cumulative_sequence_length_q = at::arange(
+      0,
+      (batch_size + 1) * max_seqlen_batch_q,
+      max_seqlen_batch_q,
+      TensorOptions().device(at::kCUDA).dtype(at::kInt));
+
+  Tensor cumulative_sequence_length_k = at::arange(
+      0,
+      (batch_size + 1) * max_seqlen_batch_k,
+      max_seqlen_batch_k,
+      TensorOptions().device(at::kCUDA).dtype(at::kInt));
+
+  int64_t Nnz_q{batch_size * max_seqlen_batch_q};
+  int64_t Nnz_kv{batch_size * max_seqlen_batch_k};
+
+  // For the standard MHA these will actually be views
+  Tensor query_reshaped = q_t.reshape({Nnz_q, num_heads, head_dim});
+  Tensor key_reshaped = k_t.reshape({Nnz_kv, num_heads, head_dim});
+  Tensor value_reshaped = v_t.reshape({Nnz_kv, num_heads, head_dim});
+
+  Tensor attention, log_sumexp, softmax;
+  std::tie(attention, log_sumexp, softmax) =
+      at::_flash_attention_forward(
+          query_reshaped,
+          key_reshaped,
+          value_reshaped,
+          cumulative_sequence_length_q,
+          cumulative_sequence_length_k,
+          max_seqlen_batch_q,
+          max_seqlen_batch_k,
+          return_softmax,
+          dropout_p,
+          is_causal);
+  // Reshape output to convert nnz to batch_size and seq_len
+  attention =
+      attention.view({batch_size, max_seqlen_batch_q, num_heads, head_dim}).transpose(1,2);
+
+  return std::make_tuple(attention, log_sumexp, softmax);
+}
+
+std::tuple<Tensor, Tensor> _scaled_dot_product_efficient_attention_cuda(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    bool compute_log_sumexp,
+    bool is_causal) {
+  // Query -> Query(Batch x Q_seq_len x Num_heads x Dim_per_head)
+  // Key   -> Key(Batch x KV_seq_len x Num_heads x Dim_per_head)
+  // Value -> Value(Batch x KV_seq_len x  Num_heads x Dim_per_head)
+  Tensor q_t = query.transpose(1, 2);
+  Tensor k_t = key.transpose(1, 2);
+  Tensor v_t = value.transpose(1, 2);
+
+  Tensor attention, log_sumexp;
+  std::tie(attention, log_sumexp) = at::_efficient_attention_forward(
+      q_t,
+      k_t,
+      v_t,
+      c10::nullopt,
+      c10::nullopt,
+      c10::nullopt,
+      compute_log_sumexp,
+      is_causal);
+  attention = attention.transpose(1,2);
+  return std::make_tuple(std::move(attention), std::move(log_sumexp));
+}
+
+int64_t _fused_sdp_choice_cuda(const Tensor& query_, const Tensor& key, const Tensor& value,
+        const c10::optional<Tensor>& attn_mask_, double dropout_p, bool need_attn_weights, bool is_causal){
+  sdp::sdp_params kernel_params{query_, key, value, attn_mask_.has_value(), dropout_p, need_attn_weights, is_causal};
+  auto backend = select_sdp_backend(kernel_params);
+  if (backend == sdp::SDPBackend::error) {
+    TORCH_CHECK(
+        false,
+        "No viable backend for scaled_dot_product_attention was found. ",
+        "This is likely due to turning off both the math kernel and the fused kernels.");
+  }
+  return static_cast<int64_t>(backend);
+}
+
+std::tuple<Tensor, Tensor, Tensor> _flash_attention_forward(
+    const Tensor& query,
+    const Tensor& key,
+    const Tensor& value,
+    const Tensor& cumulative_sequence_length_q,
+    const Tensor& cumulative_sequence_length_k,
+    const int64_t max_seqlen_batch_q,
+    const int64_t max_seqlen_batch_k,
+    bool return_softmax,
+    double dropout_p,
+    bool is_causal) {
+#if defined(USE_FLASH_ATTENTION)
+  auto softmax_scale = std::pow(query.size(-1), -0.5);
+  return fmha::mha_fwd(
+      query,
+      key,
+      value,
+      cumulative_sequence_length_q,
+      cumulative_sequence_length_k,
+      max_seqlen_batch_q,
+      max_seqlen_batch_k,
+      dropout_p,
+      softmax_scale,
+      false,
+      is_causal,
+      return_softmax,
+      c10::nullopt);
+#endif
+  TORCH_CHECK(false, "USE_FLASH_ATTENTION was not enabled for build.")
+  return std::make_tuple(Tensor(), Tensor(), Tensor());
+}
+
+std::tuple<at::Tensor, at::Tensor> _efficient_attention_forward(
+    const at::Tensor& query, // [b, seqlen, num_heads, K]
+    const at::Tensor& key, // [b, seqlen, num_heads, K]
+    const at::Tensor& value, // [b, seqlen, num_heads, Kv]
+    // (Mode 1MHK only) [b+1]: cu_seqlens_q[b] contains the
+    // position of the first query token for batch $b
+    const c10::optional<at::Tensor>& cu_seqlens_q,
+    // (Mode 1MHK only) [b+1]: cu_seqlens_k[b] contains the
+    // position of the first key token for batch $b
+    const c10::optional<at::Tensor>& cu_seqlens_k,
+    // (Mode 1MHK only) Maximum sequence length across batches
+    const c10::optional<int64_t> max_seqlen_q_,
+    bool compute_logsumexp,
+    bool causal) {
+#if defined(USE_FLASH_ATTENTION)
+// TODO In theory it is possible to compile with _CUDA_ARCH < 5.0 and run on a
+// machine that is >= 5.0. In practice, this is not a problem but since
+// this would avoid runtime architecture checks, we should look into it
+  TORCH_CHECK(query.dim() == 4);
+  TORCH_CHECK(key.dim() == 4);
+  TORCH_CHECK(value.dim() == 4);
+
+  // Batch sizes
+  TORCH_CHECK(query.size(0) == key.size(0));
+  TORCH_CHECK(query.size(0) == value.size(0));
+
+  // Sequence length
+  TORCH_CHECK(key.size(1) == value.size(1));
+
+  // Num heads
+  TORCH_CHECK(query.size(2) == key.size(2));
+  TORCH_CHECK(query.size(2) == value.size(2));
+
+  // Embedding per head
+  TORCH_CHECK(query.size(3) == key.size(3));
+
+  int64_t max_seqlen_q = 0, max_seqlen_k=0;
+  TORCH_CHECK(cu_seqlens_q.has_value() == cu_seqlens_k.has_value());
+  if (cu_seqlens_q.has_value()) {
+    TORCH_CHECK(cu_seqlens_q->scalar_type() == at::ScalarType::Int);
+    TORCH_CHECK(cu_seqlens_k->scalar_type() == at::ScalarType::Int);
+    TORCH_CHECK(cu_seqlens_q->dim() == 1 && cu_seqlens_k->dim() == 1);
+    CHECK_NOSPARSE_CONTIGUOUS_CUDA((*cu_seqlens_q));
+    CHECK_NOSPARSE_CONTIGUOUS_CUDA((*cu_seqlens_k));
+    TORCH_CHECK(cu_seqlens_q->size(0) == cu_seqlens_k->size(0));
+    TORCH_CHECK(query.size(0) == 1, "cu_seqlen only supports batch_size=1");
+    TORCH_CHECK(max_seqlen_q_.has_value());
+    max_seqlen_q = *max_seqlen_q_;
+    max_seqlen_k = 0; // Will be set inside the kernel
+  } else {
+    max_seqlen_q = query.size(1);
+    max_seqlen_k = key.size(1);
+  }
+
+  CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA(query);
+  CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA(key);
+  CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA(value);
+
+  at::cuda::CUDAGuard device_guard(query.device());
+  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  int64_t B = query.size(0);
+  int64_t M = query.size(1);
+  int64_t N = key.size(1);
+  int64_t num_heads = query.size(-2);
+  int64_t K = query.size(-1);
+  int64_t Kv = value.size(-1);
+
+  at::Tensor res;
+  at::Tensor logsumexp;
+
+  auto launchKernel = [&](auto _k, int computeCapability) {
+    using Kernel = decltype(_k);
+    using scalar_t = typename Kernel::scalar_t;
+    (void)_k;
+
+    res = at::empty(
+        {B, M, num_heads, Kv},
+        query.options().dtype(
+            TypeTraits<typename Kernel::output_t>::atScalarType()));
+
+    // NOTE: Should be aligned (by padding) in case M is
+    // not a good number for loading during backward
+    constexpr decltype(M) kAlignLSE = Kernel::kAlignLSE;
+    logsumexp = at::empty(
+        {B,
+         num_heads,
+         compute_logsumexp ? ceil_div(max_seqlen_q, kAlignLSE) * kAlignLSE : 0},
+        query.options().dtype(at::ScalarType::Float));
+
+    typename Kernel::Params p;
+    p.query_ptr = (scalar_t*)query.data_ptr();
+    p.key_ptr = (scalar_t*)key.data_ptr();
+    p.value_ptr = (scalar_t*)value.data_ptr();
+    p.logsumexp_ptr = compute_logsumexp
+        ? (typename Kernel::lse_scalar_t*)logsumexp.data_ptr()
+        : nullptr;
+    at::Tensor output_accum;
+    if (Kernel::kNeedsOutputAccumulatorBuffer) {
+      output_accum = at::empty(
+          {B, M, num_heads, Kv},
+          query.options().dtype(
+              TypeTraits<typename Kernel::output_accum_t>::atScalarType()));
+      p.output_accum_ptr =
+          (typename Kernel::output_accum_t*)output_accum.data_ptr();
+    } else {
+      p.output_accum_ptr = nullptr;
+    }
+    p.output_ptr = (typename Kernel::output_t*)res.data_ptr();
+
+    if (cu_seqlens_q.has_value()) {
+      p.cu_seqlens_q_ptr = (int32_t*)cu_seqlens_q->data_ptr();
+      p.cu_seqlens_k_ptr = (int32_t*)cu_seqlens_k->data_ptr();
+    }
+
+#define ASSIGN_CHECK_OVERFLOW(A, B)                                            \
+  {                                                                            \
+    A = B;                                                                     \
+    TORCH_CHECK(B < std::numeric_limits<decltype(A)>::max(), #B " overflows"); \
+  }
+
+    p.num_heads = num_heads;
+    p.head_dim = query.size(3);
+    p.head_dim_value = value.size(3);
+    p.num_queries = max_seqlen_q;
+    p.num_keys = max_seqlen_k;
+    p.num_batches = cu_seqlens_q.has_value() ? cu_seqlens_q->size(0) - 1 : B;
+    p.causal = causal;
+
+    ASSIGN_CHECK_OVERFLOW(p.q_strideB, query.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.k_strideB, key.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.v_strideB, value.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.q_strideM, query.stride(1));
+    ASSIGN_CHECK_OVERFLOW(p.k_strideM, key.stride(1));
+    ASSIGN_CHECK_OVERFLOW(p.v_strideM, value.stride(1));
+    ASSIGN_CHECK_OVERFLOW(p.q_strideH, query.stride(2));
+    ASSIGN_CHECK_OVERFLOW(p.k_strideH, key.stride(2));
+    ASSIGN_CHECK_OVERFLOW(p.v_strideH, value.stride(2));
+
+    constexpr auto kernel_fn = attention_kernel_batched<Kernel>;
+    size_t smem_bytes = sizeof(typename Kernel::SharedStorage);
+    if (smem_bytes > 0xc000) {
+      TORCH_INTERNAL_ASSERT(
+          computeCapability >= 70,
+          "This kernel requires too much shared memory on this machine!");
+      AT_CUDA_CHECK(cudaFuncSetAttribute(
+          kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes));
+    }
+    Kernel::check_supported(p);
+    kernel_fn<<<p.getBlocksGrid(), p.getThreadsGrid(), smem_bytes, stream>>>(p);
+  };
+  // Dispatch to the right kernel
+  DISPATCH_KERNEL(query, key, value, ([&]() {
+                    launchKernel(Kernel{}, computeCapability);
+                  }));
+
+  AT_CUDA_CHECK(cudaGetLastError());
+  return std::make_tuple(res, logsumexp);
+#endif
+  TORCH_CHECK(false, "USE_FLASH_ATTENTION was not enabled for build.")
+  return std::make_tuple(Tensor{}, Tensor{});
+}
+
 Tensor triton_scaled_dot_attention(const Tensor& q, const Tensor& k, const Tensor& v, double dropout_p){
   TORCH_CHECK(false, "This operator should be overridden in python before use");
   return at::Tensor();
diff --git a/aten/src/ATen/native/transformers/cuda/attention_backward.cu b/aten/src/ATen/native/transformers/cuda/attention_backward.cu
new file mode 100644
index 000000000000..a063aacb901e
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/attention_backward.cu
@@ -0,0 +1,289 @@
+#include <type_traits>
+
+#include <ATen/ATen.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAMathCompat.h>
+
+#include <ATen/native/nested/NestedTensorTransformerFunctions.h>
+#include <ATen/native/nested/NestedTensorUtils.h>
+#include <ATen/native/transformers/attention.h>
+#include <ATen/native/transformers/cuda/sdp_utils.h>
+
+#include <iostream>
+#ifdef USE_FLASH_ATTENTION
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+#endif
+
+#define ASSIGN_CHECK_OVERFLOW(A, B)                                            \
+  {                                                                            \
+    A = B;                                                                     \
+    TORCH_CHECK(B < std::numeric_limits<decltype(A)>::max(), #B " overflows"); \
+  }
+
+#define DISPATCH_MAXK(func)                                   \
+  {                                                           \
+    const auto maxK = std::max(query.size(3), value.size(3)); \
+    if (maxK <= 64) {                                         \
+      constexpr int kMaxK = 64;                               \
+      func();                                                 \
+    } else if (maxK <= 128) {                                 \
+      constexpr int kMaxK = 128;                              \
+      func();                                                 \
+    } else {                                                  \
+      constexpr int kMaxK = std::numeric_limits<int>::max();  \
+      func();                                                 \
+    }                                                         \
+  }
+
+#define DISPATCH_KERNEL(QUERY, KEY, VALUE, FUNC)                               \
+  {                                                                            \
+    cudaDeviceProp* properties =                                               \
+        at::cuda::getDeviceProperties(QUERY.device().index());                 \
+    const int computeCapability = properties->major * 10 + properties->minor;  \
+    DISPATCH_MAXK(([&] {                                                       \
+      DISPATCH_TYPES(                                                          \
+          QUERY, ([&]() {                                                      \
+            DISPATCH_ARCHTAG(                                                  \
+                computeCapability, ([&]() {                                    \
+                  using AlignedAK =                                            \
+                      AttentionBackwardKernel<ArchTag, scalar_t, true, kMaxK>; \
+                  bool isAligned =                                             \
+                      (QUERY.stride(2) % AlignedAK::kOptimalAlignement == 0 && \
+                       KEY.stride(2) % AlignedAK::kOptimalAlignement == 0 &&   \
+                       VALUE.stride(2) % AlignedAK::kOptimalAlignement == 0);  \
+                  DISPATCH_BOOL(isAligned, kIsAligned, ([&]() {                \
+                                  using Kernel = AttentionBackwardKernel<      \
+                                      ArchTag,                                 \
+                                      scalar_t,                                \
+                                      kIsAligned,                              \
+                                      kMaxK>;                                  \
+                                  FUNC();                                      \
+                                }))                                            \
+                }))                                                            \
+          }))                                                                  \
+    }));                                                                       \
+  }
+
+namespace at {
+
+namespace native {
+
+std::tuple<at::Tensor, at::Tensor, at::Tensor> _efficient_attention_backward(
+    const at::Tensor& grad_out_,
+    const at::Tensor& query,
+    const at::Tensor& key,
+    const at::Tensor& value,
+    const at::Tensor& out,
+    const at::Tensor& logsumexp,
+    bool causal) {
+  #if defined(USE_FLASH_ATTENTION)
+  if (!grad_out_.defined()) {
+    return std::make_tuple(Tensor{}, Tensor{}, Tensor{});
+  }
+  // ndim
+  TORCH_CHECK(query.dim() == grad_out_.dim());
+  TORCH_CHECK(query.dim() == key.dim());
+  TORCH_CHECK(query.dim() == value.dim());
+  TORCH_CHECK(query.dim() == 4);
+
+  // batch size
+  TORCH_CHECK(query.size(0) == grad_out_.size(0));
+  TORCH_CHECK(query.size(0) == key.size(0));
+  TORCH_CHECK(query.size(0) == value.size(0));
+
+  // seqlen
+  TORCH_CHECK(key.size(1) == value.size(1));
+  TORCH_CHECK(query.size(1) == grad_out_.size(1));
+
+  // Num heads
+  TORCH_CHECK(query.size(2) == key.size(2));
+  TORCH_CHECK(query.size(2) == value.size(2));
+  TORCH_CHECK(query.size(2) == grad_out_.size(2));
+
+  // Embedding per head
+  TORCH_CHECK(query.size(3) == key.size(3));
+  TORCH_CHECK(value.size(3) == grad_out_.size(3));
+
+  // handle potentially non-contiguous grad_out through a copy
+  auto grad_out = grad_out_.contiguous();
+  CHECK_NOSPARSE_CONTIGUOUS_CUDA(grad_out);
+
+  CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA(query);
+  CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA(key);
+  CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA(value);
+
+  at::cuda::CUDAGuard device_guard(query.device());
+  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+  int64_t B = query.size(0);
+  int64_t M = query.size(1);
+  int64_t N = key.size(1);
+  int64_t nH = query.size(2);
+  int64_t K = query.size(3);
+
+  // It does not make sense to use that in practice,
+  // but let's still make sure we are correct
+  // As we iterate through keys first, we skip
+  // keys with no query associated, so they are not
+  // initialized
+  bool grad_kv_needs_init = causal && N > M;
+  at::Tensor grad_q, grad_k, grad_v;
+  int8_t gQKV_strideM_multiplier = 1;
+  if (!grad_kv_needs_init && query.size(1) == key.size(1) &&
+      query.size(3) == value.size(3) &&
+      query.storage().is_alias_of(key.storage()) &&
+      query.storage().is_alias_of(value.storage())) {
+    // Create one big contiguous chunk
+    // This is because q, k and v usually come from a single
+    // output of a linear layer that is chunked.
+    // Creating the gradients with the right layout saves us
+    // a `torch.cat` call in the backward pass
+    at::Tensor chunk = at::empty({B, M, 3, nH, K}, query.options());
+    grad_q = chunk.select(2, 0);
+    grad_k = chunk.select(2, 1);
+    grad_v = chunk.select(2, 2);
+    gQKV_strideM_multiplier=3;
+  } else {
+    grad_q = at::empty(query.sizes(), query.options());
+    grad_k = grad_kv_needs_init ? at::zeros(key.sizes(), key.options())
+                                : at::empty(key.sizes(), key.options());
+    grad_v = grad_kv_needs_init ? at::zeros(value.sizes(), value.options())
+                                : at::empty(value.sizes(), value.options());
+  }
+
+  auto launchKernel = [&](auto _k, int computeCapability) {
+    using Kernel = decltype(_k);
+    using scalar_t = typename Kernel::scalar_t;
+    (void)_k;
+
+    size_t smem_bytes = sizeof(typename Kernel::SharedStorage);
+
+    // TODO: Fuse this into a kernel?
+    // This is a bottleneck for smaller sequences (M <= 128)
+    auto delta = Kernel::kKernelComputesDelta
+        ? at::empty({B, nH, M}, query.options().dtype(at::ScalarType::Float))
+        : (grad_out.to(at::kFloat) * out.to(at::kFloat))
+              .sum(-1)
+              .transpose(-2, -1)
+              .contiguous();
+    TORCH_INTERNAL_ASSERT(delta.size(0) == B);
+    TORCH_INTERNAL_ASSERT(delta.size(1) == nH);
+    TORCH_INTERNAL_ASSERT(delta.size(2) == M);
+
+    typename Kernel::Params p;
+    p.query_ptr = (scalar_t*)query.data_ptr();
+    p.key_ptr = (scalar_t*)key.data_ptr();
+    p.value_ptr = (scalar_t*)value.data_ptr();
+    p.logsumexp_ptr = (typename Kernel::lse_scalar_t*)logsumexp.data_ptr();
+    p.output_ptr = (scalar_t*)out.data_ptr();
+    p.grad_output_ptr = (scalar_t*)grad_out.data_ptr();
+    p.grad_query_ptr = (scalar_t*)grad_q.data_ptr();
+    p.grad_key_ptr = (scalar_t*)grad_k.data_ptr();
+    p.grad_value_ptr = (scalar_t*)grad_v.data_ptr();
+    p.delta_ptr = (float*)delta.data_ptr();
+    p.head_dim = query.size(3);
+    p.head_dim_value = value.size(3);
+    p.num_queries = query.size(1);
+    p.num_keys = key.size(1);
+    p.num_batches = B;
+    p.num_heads = nH;
+    p.causal = causal;
+
+    ASSIGN_CHECK_OVERFLOW(p.gO_strideB, grad_out.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.gO_strideM, grad_out.stride(1));
+    ASSIGN_CHECK_OVERFLOW(p.gO_strideH, grad_out.stride(2));
+
+    ASSIGN_CHECK_OVERFLOW(p.o_strideB, out.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.o_strideH, out.stride(2));
+
+    ASSIGN_CHECK_OVERFLOW(p.gQ_strideB, grad_q.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.gK_strideB, grad_k.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.gV_strideB, grad_v.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.gQ_strideH, grad_q.stride(2));
+    ASSIGN_CHECK_OVERFLOW(p.gK_strideH, grad_k.stride(2));
+    ASSIGN_CHECK_OVERFLOW(p.gV_strideH, grad_v.stride(2));
+    p.gQKV_strideM_multiplier = gQKV_strideM_multiplier;
+    TORCH_INTERNAL_ASSERT(p.gQ_strideM() == grad_q.stride(1));
+    TORCH_INTERNAL_ASSERT(p.gK_strideM() == grad_k.stride(1));
+    TORCH_INTERNAL_ASSERT(p.gV_strideM() == grad_v.stride(1));
+
+    ASSIGN_CHECK_OVERFLOW(p.q_strideB, query.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.k_strideB, key.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.v_strideB, value.stride(0));
+    ASSIGN_CHECK_OVERFLOW(p.q_strideM, query.stride(1));
+    ASSIGN_CHECK_OVERFLOW(p.k_strideM, key.stride(1));
+    ASSIGN_CHECK_OVERFLOW(p.v_strideM, value.stride(1));
+    ASSIGN_CHECK_OVERFLOW(p.q_strideH, query.stride(2));
+    ASSIGN_CHECK_OVERFLOW(p.k_strideH, key.stride(2));
+    ASSIGN_CHECK_OVERFLOW(p.v_strideH, value.stride(2));
+
+    Kernel::check_supported(p);
+
+    constexpr auto kernel_fn = attention_kernel_backward_batched<Kernel>;
+
+    if (smem_bytes > 0xc000) {
+      TORCH_INTERNAL_ASSERT(
+          computeCapability >= 70,
+          "This kernel requires too much shared memory on this machine!");
+      cudaFuncSetAttribute(
+          kernel_fn, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes);
+    }
+
+    // second syntax resulted in the error below on windows
+    // error C3495: 'kernel_fn': a simple capture must be a variable
+    // with automatic storage duration declared
+    // in the reaching scope of the lambda
+#ifdef _WIN32
+    cudaFuncAttributes attr;
+    AT_CUDA_CHECK(cudaFuncGetAttributes(&attr, kernel_fn));
+    TORCH_INTERNAL_ASSERT(
+        attr.binaryVersion >= Kernel::ArchTag::kMinComputeCapability,
+        "Something went wrong in the build process");
+#else
+    auto checkBinaryArchMatches = [&]() {
+      cudaFuncAttributes attr;
+      AT_CUDA_CHECK(cudaFuncGetAttributes(&attr, kernel_fn));
+      return attr.binaryVersion >= Kernel::ArchTag::kMinComputeCapability;
+    };
+    TORCH_INTERNAL_ASSERT(
+        checkBinaryArchMatches(), "Something went wrong in the build process");
+#endif
+
+    kernel_fn<<<p.getBlocksGrid(), p.getThreadsGrid(), smem_bytes, stream>>>(p);
+  };
+
+  DISPATCH_KERNEL(
+      query, key, value, ([&] { launchKernel(Kernel{}, computeCapability); }));
+  AT_CUDA_CHECK(cudaGetLastError());
+  return std::make_tuple(grad_q, grad_k, grad_v);
+  #endif
+  TORCH_CHECK(false, "USE_FLASH_ATTENTION was not enabled for build.")
+  return std::make_tuple(Tensor{}, Tensor{}, Tensor{});
+}
+
+
+std::tuple<at::Tensor, at::Tensor, at::Tensor> _scaled_dot_product_efficient_attention_backward_cuda(
+    const at::Tensor& grad_out_,
+    const at::Tensor& query,
+    const at::Tensor& key,
+    const at::Tensor& value,
+    const at::Tensor& out,
+    const at::Tensor& logsumexp,
+    bool causal){
+  if (!grad_out_.defined()) {
+    return std::make_tuple(Tensor{}, Tensor{}, Tensor{});
+  }
+  auto grad_out = grad_out_.transpose(1, 2);
+  auto out_t = out.transpose(1, 2);
+  auto q_t = query.transpose(1, 2);
+  auto k_t = key.transpose(1, 2);
+  auto v_t = value.transpose(1, 2);
+
+  Tensor grad_q, grad_k, grad_v;
+  std::tie(grad_q, grad_k, grad_v) = at::_efficient_attention_backward(grad_out, q_t, k_t, v_t, out_t, logsumexp, causal);
+  return std::make_tuple(grad_q.transpose(1, 2), grad_k.transpose(1, 2), grad_v.transpose(1, 2));
+}
+
+} // namespace native
+} // namespace at
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/epilogue.h b/aten/src/ATen/native/transformers/cuda/flash_attn/epilogue.h
new file mode 100644
index 000000000000..2bf4e1eb5482
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/epilogue.h
@@ -0,0 +1,149 @@
+/******************************************************************************
+ * Copyright (c) 2022, Tri Dao.
+ ******************************************************************************/
+
+#pragma once
+
+#include <cutlass/cutlass.h>
+#include <cutlass/epilogue/threadblock/default_epilogue_tensor_op.h>
+#include <cutlass/epilogue/threadblock/default_thread_map_tensor_op.h>
+#include <cutlass/epilogue/warp/fragment_iterator_tensor_op.h>
+#include <cutlass/gemm/warp/default_mma_tensor_op.h>
+#include <cutlass/layout/layout.h>
+#include <cutlass/arch/mma.h>
+#include <cutlass/array.h>
+#include <cutlass/numeric_types.h>
+
+#include <ATen/native/transformers/cuda/flash_attn/gemm.h>
+#include <ATen/native/transformers/cuda/flash_attn/epilogue_predicated_tile_iterator.h>
+
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename MmaCore>
+struct FMHAEpilogue {
+
+    using ThreadblockShape = typename MmaCore::Shape;
+    using WarpMma = typename MmaCore::MmaTensorOp;
+    using LayoutC = typename MmaCore::LayoutC;
+    using Element = typename MmaCore::ElementA;
+    using ElementC = typename MmaCore::ElementC;
+
+    static constexpr int kPartitionsK = ThreadblockShape::kK / MmaCore::WarpShape::kK;
+
+    using AccumulatorFragmentIterator = cutlass::epilogue::warp::FragmentIteratorTensorOp<
+                                    typename WarpMma::Shape,
+                                    typename WarpMma::Policy::Operator::Shape,
+                                    typename WarpMma::Policy::Operator::ElementC,
+                                    typename WarpMma::Policy::Operator::FragmentC,
+                                    LayoutC>;
+    using AccumulatorTile = typename AccumulatorFragmentIterator::AccumulatorTile;
+    static constexpr int kIterationsStore = AccumulatorFragmentIterator::kIterations;
+
+    // Maybe elementsPerAccess should vary: 4 for d=64, 2 for d=32?
+    using OutputTileThreadMap = typename cutlass::epilogue::threadblock::DefaultThreadMapTensorOp<
+        ThreadblockShape, typename WarpMma::Shape, kPartitionsK, Element, /*ElementsPerAccess=*/4>::Type;
+    using OutputTileThreadMapAccum = typename cutlass::epilogue::threadblock::DefaultThreadMapTensorOp<
+        ThreadblockShape, typename WarpMma::Shape, kPartitionsK, ElementC, /*ElementsPerAccess=*/4>::Type;
+
+    using GmemIterator = fmha::EpiloguePredicatedTileIterator<
+        OutputTileThreadMap,
+        Element
+    >;
+    // which ThreadMap should we use?
+    using GmemIteratorAccum = fmha::EpiloguePredicatedTileIterator<
+        // OutputTileThreadMapAccum,
+        OutputTileThreadMap,
+        ElementC
+    >;
+
+
+    using DefaultIterators = cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp<
+        Element, ElementC, /*ElementsPerAccess=*/4, ThreadblockShape, typename WarpMma::Shape,
+        typename WarpMma::Policy::Operator::Shape, typename OutputTileThreadMap::CompactedThreadMap>;
+    using WarpTileIterator = typename DefaultIterators::WarpTileIterator;
+    static_assert(WarpTileIterator::kIterations == kIterationsStore, "");
+    using SharedLoadIterator = typename DefaultIterators::SharedLoadIterator;
+    using OutputFragment = typename SharedLoadIterator::Fragment;
+
+    // using Padding = cutlass::MatrixShape<0, 0>;
+    using Padding = cutlass::MatrixShape<0, 64 / cutlass::sizeof_bits<ElementC>::value * 4>;
+    static constexpr int kFragmentsPerIteration = kIterationsStore;  // TODO: could be 1 for Volta?
+    /*Using kIterationsStore here so that we get the right storage size*/
+    using EpilogueBase = typename cutlass::epilogue::threadblock::EpilogueBase<
+        ThreadblockShape, typename WarpMma::Shape, kPartitionsK, AccumulatorFragmentIterator, WarpTileIterator,
+        Padding, kIterationsStore>;
+
+    using SharedStorage = typename EpilogueBase::SharedStorage;
+    static constexpr int kSmemTiles = EpilogueBase::kFragmentsPerIteration;
+    static constexpr int kSmemPointerOffset = SharedStorage::StorageShape::kCount / kSmemTiles;
+    static constexpr int kSmemPointerOffsetPerWarp = SharedStorage::StorageShape::kCount / (kSmemTiles * kPartitionsK);
+
+    SharedStorage *shared_storage;
+    WarpTileIterator warp_tile_iterator;
+
+    inline __device__ FMHAEpilogue(void *smem, const int tidx)
+        : shared_storage(reinterpret_cast<SharedStorage *>(smem))
+        , warp_tile_iterator(shared_storage->reference(), threadIdx.x % 32) {
+
+        // const int warp_idx = tidx / 32;
+        // Broadcast the warp_id computed by lane 0 to ensure dependent code
+        // is compiled as warp-uniform.
+        // https://github.com/NVIDIA/cutlass/blob/e66bfcb1f880792caa46b1e983c4114e23afa5f3/include/cutlass/gemm/kernel/gemm_with_fused_epilogue.h#L520
+        const int warp_idx = __shfl_sync(0xffffffff, tidx / 32, 0);
+
+        cutlass::MatrixCoord warp_offset{kIterationsStore * warp_idx, 0};
+
+        warp_tile_iterator.add_tile_offset(warp_offset);
+    }
+
+    // Store the accumulators.
+    inline __device__ void store(const AccumulatorTile &acc) {
+        AccumulatorFragmentIterator accum_fragment_iterator(acc);
+        CUTLASS_PRAGMA_UNROLL
+        for (int p = 0; p < kIterationsStore; ++p) {
+            typename AccumulatorFragmentIterator::Fragment accum_fragment;
+            accum_fragment_iterator.load(accum_fragment);
+            ++accum_fragment_iterator;
+
+            warp_tile_iterator.store(accum_fragment);
+            if (p < kIterationsStore - 1) {
+                warp_tile_iterator.add_pointer_offset(kSmemPointerOffsetPerWarp);
+            }
+        }
+        if (kIterationsStore > 1) {
+            warp_tile_iterator.add_pointer_offset((1 - kIterationsStore) * kSmemPointerOffsetPerWarp);
+        }
+    }
+
+    // Load the accumulators
+    template<bool zero_init=true>
+    inline __device__ void load(OutputFragment (&out)[kFragmentsPerIteration],
+                                const int tidx) {
+        SharedLoadIterator shared_load_iterator(shared_storage->reference(), tidx);
+        CUTLASS_PRAGMA_UNROLL
+        for (int p = 0; p < EpilogueBase::kFragmentsPerIteration; ++p) {
+            OutputFragment aligned_accum_fragment[kPartitionsK];
+            shared_load_iterator.load(aligned_accum_fragment[0]);
+            cutlass::plus<OutputFragment> add_fragments;
+            if (kPartitionsK > 1) {
+                CUTLASS_PRAGMA_UNROLL
+                for ( int i = 1; i < kPartitionsK; ++i) {
+                    shared_load_iterator.add_pointer_offset(kSmemPointerOffsetPerWarp * kIterationsStore);
+                    shared_load_iterator.load(aligned_accum_fragment[i]);
+                    aligned_accum_fragment[0] = add_fragments(aligned_accum_fragment[0], aligned_accum_fragment[i]);
+                }
+                shared_load_iterator.add_pointer_offset((1 - kPartitionsK) * kSmemPointerOffsetPerWarp * kIterationsStore);
+            }
+            if (p < EpilogueBase::kFragmentsPerIteration - 1) {
+                shared_load_iterator.add_pointer_offset(kSmemPointerOffsetPerWarp);
+            }
+
+            out[p] = zero_init ? aligned_accum_fragment[0] : add_fragments(out[p], aligned_accum_fragment[0]);
+        }
+    }
+
+};
+
+}  // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/epilogue_predicated_tile_iterator.h b/aten/src/ATen/native/transformers/cuda/flash_attn/epilogue_predicated_tile_iterator.h
new file mode 100644
index 000000000000..170df703e7da
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/epilogue_predicated_tile_iterator.h
@@ -0,0 +1,493 @@
+// Adapted from cutlass/epilogue/threadblock/predicated_tile_iterator.h
+// We just want to add the move() function, but idk how to do it without
+// copying the code here.
+
+/******************************************************************************
+ * Copyright (c) 2022, Tri Dao.
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <cutlass/cutlass.h>
+#include <cutlass/arch/arch.h>
+#include <cutlass/arch/memory.h>
+#include <cutlass/array.h>
+#include <cutlass/epilogue/threadblock/output_tile_thread_map.h>
+#include <cutlass/epilogue/threadblock/predicated_tile_iterator_params.h>
+#include <cutlass/layout/matrix.h>
+#include <cutlass/layout/tensor.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/numeric_types.h>
+#include <cutlass/tensor_ref.h>
+#include <cutlass/transform/pitch_linear_thread_map.h>
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////
+
+using namespace cutlass;
+using namespace cutlass::epilogue::threadblock;
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Tile iterator used to load and store output tile from global memory in epilogue.
+///
+/// Satisfies: ReadableTileIterator | PredicatedTileIterator | ForwardTileIterator
+///
+template <
+  typename ThreadMap_,       ///< Thread map (conept: OutputTileThreadMap)
+  typename Element_,         ///< Element data type
+  bool ScatterD = false,     ///< Scatter D operand or not
+  bool UseCUDAStore = false
+>
+class EpiloguePredicatedTileIterator {
+public:
+  using ThreadMap = ThreadMap_;
+  using Shape = typename ThreadMap::Shape;
+
+  using Element = Element_;
+
+  using Layout = layout::RowMajor;
+  using TensorRef = TensorRef<Element, Layout>;
+  using ConstTensorRef = typename TensorRef::ConstTensorRef;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+  using TensorCoord = MatrixCoord;
+
+  static int const kElementsPerAccess = ThreadMap::kElementsPerAccess;
+  static int const kThreads = ThreadMap::kThreads;
+  static int const kIterations = ThreadMap::Count::kTile;
+
+  static_assert( ThreadMap::Iterations::kRow > 0,"ThreadMap::Iterations::kRow must be > 0");
+  static_assert( ThreadMap::Iterations::kGroup > 0,"ThreadMap::Iterations::kGroup must be > 0");
+  static_assert( ThreadMap::Iterations::kCluster > 0,"ThreadMap::Iterations::kCluster must be > 0");
+  static_assert( ThreadMap::Iterations::kColumn > 0,"ThreadMap::Iterations::kColumn must be > 0");
+
+  /// Fragment object
+  using Fragment = Array<
+    Element,
+    ThreadMap::Iterations::kColumn *
+    ThreadMap::Iterations::kRow *
+    ThreadMap::Iterations::kGroup *
+    ThreadMap::Iterations::kCluster * ThreadMap::kElementsPerAccess>;
+
+  /// Memory access size
+  using AccessType = AlignedArray<Element, ThreadMap::kElementsPerAccess>;
+
+  //
+  // Parameters struct
+  //
+
+  /// Uses a non-template class
+  struct Params : PredicatedTileIteratorParams {
+    using Base = PredicatedTileIteratorParams;
+
+    CUTLASS_HOST_DEVICE
+    Params() { }
+
+    CUTLASS_HOST_DEVICE
+    Params(Layout const &layout):
+      PredicatedTileIteratorParams(
+        layout.stride(0) * int(sizeof(AccessType)) / kElementsPerAccess,
+        make_OutputTileThreadMapDesc<ThreadMap>()
+      )
+    { }
+
+    CUTLASS_HOST_DEVICE
+    Params(Base const &base) :
+      Base(base) { }
+  };
+
+  /// Mask object
+  struct Mask {
+
+    static int const kCount = ThreadMap::Iterations::kColumn;
+
+    /// Predicate state
+    bool predicates[kCount];
+
+    //
+    // Mask
+    //
+    CUTLASS_HOST_DEVICE
+    Mask() {
+      enable();
+    }
+
+    ///< Efficiently disables all accesses guarded by mask
+    CUTLASS_HOST_DEVICE void clear() {
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < kCount; ++i) {
+        predicates[i] = false;
+      }
+    }
+
+    ///< CUTLASS_HOST_DEVICE enables all accesses guarded by mask
+    CUTLASS_DEVICE void enable() {
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < kCount; ++i) {
+        predicates[i] = true;
+      }
+    }
+  };
+
+private:
+
+  //
+  // Data members
+  //
+
+  /// Parameters structure containing reference and precomputed state.
+  PredicatedTileIteratorParams params_;
+
+  /// Byte-level pointer
+  uint8_t *byte_pointer_;
+
+  /// Array of boolean values to contain steady-state predicates
+  Mask mask_;
+
+  /// Extent of the matrix tile in rows
+  Index extent_row_;
+
+  /// Extent of the matrix tile in rows
+  Index extent_column_;
+
+  /// A thread's starting row position (assuming steady-state predicates have been computed)
+  Index thread_start_row_;
+
+  /// A thread's starting column
+  Index thread_start_column_;
+
+  /// Internal state counter
+  int state_[3];
+
+  /// Scatter indices
+  int const *indices_;
+
+  //
+  // Static asserts about internal strides
+  //
+
+  static_assert(sizeof(extent_row_) == 4, "Expected 32b extents");
+  static_assert(sizeof(thread_start_row_) == 4, "Expected 32b extents");
+  static_assert(sizeof(PredicatedTileIteratorParams::stride) == 8, "Expected 64b strides");
+
+private:
+
+  //
+  // Methods
+  //
+
+public:
+
+  //
+  // Methods
+  //
+
+  /// Constructor
+  CUTLASS_DEVICE
+  EpiloguePredicatedTileIterator(
+    PredicatedTileIteratorParams const & params,
+    Element *pointer,
+    TensorCoord extent,
+    int thread_idx,
+    TensorCoord threadblock_offset = TensorCoord(),
+    int const *indices = nullptr
+  ):
+    params_(params), indices_(indices)
+  {
+
+    TensorCoord thread_offset = ThreadMap::initial_offset(thread_idx) + threadblock_offset;
+
+    extent_row_ = extent.row();
+    extent_column_ = extent.column();
+
+    thread_start_row_ = thread_offset.row();
+    thread_start_column_ = thread_offset.column();
+
+    // Initialize predicates
+    CUTLASS_PRAGMA_UNROLL
+    for (int c = 0; c < ThreadMap::Iterations::kColumn; ++c) {
+
+      mask_.predicates[c] = ((thread_offset.column()
+        + ThreadMap::Delta::kColumn * c) < extent.column());
+    }
+
+    // Null pointer performs no accesses
+    if (!pointer) {
+      mask_.clear();
+    }
+
+    if (ScatterD && !indices) {
+      mask_.clear();
+    }
+
+    // Initialize pointer
+    byte_pointer_ = reinterpret_cast<uint8_t *>(pointer) +
+      LongIndex(thread_offset.row()) * LongIndex(params_.stride) +
+      LongIndex(thread_offset.column()) * sizeof(AccessType) / kElementsPerAccess;
+
+    if (ScatterD) {
+      byte_pointer_ = reinterpret_cast<uint8_t *>(pointer) +
+        LongIndex(thread_offset.column()) * sizeof(AccessType) / kElementsPerAccess;
+    }
+
+    // Initialize internal state counter
+    state_[0] = state_[1] = state_[2] = 0;
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    byte_pointer_ += pointer_offset * sizeof_bits<Element>::value / 8;
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_byte_offset(Fragment &frag, int64_t byte_offset) const {
+
+    uint8_t *byte_pointer = byte_pointer_;
+    AccessType *frag_ptr = reinterpret_cast<AccessType *>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int cluster = 0; cluster < ThreadMap::Iterations::kCluster; ++cluster) {
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int group = 0; group < ThreadMap::Iterations::kGroup; ++group) {
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int row = 0; row < ThreadMap::Iterations::kRow; ++row) {
+
+          int frag_row_idx =
+            (row + ThreadMap::Iterations::kRow * (group + ThreadMap::Iterations::kGroup * cluster));
+
+          int row_offset = row * ThreadMap::Delta::kRow
+            + group * ThreadMap::Delta::kGroup
+            + cluster * ThreadMap::Delta::kCluster;
+
+          bool row_guard = ((row_offset + thread_start_row_) < extent_row_);
+
+          AccessType *memory_pointer = reinterpret_cast<AccessType *>(byte_pointer + byte_offset);
+
+          if (ScatterD && row_guard) {
+            assert(indices_);
+
+            memory_pointer = reinterpret_cast<AccessType *>(byte_pointer + byte_offset +
+              LongIndex(indices_[row_offset + thread_start_row_]) * LongIndex(params_.stride));
+          }
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int column = 0; column < ThreadMap::Iterations::kColumn; ++column) {
+
+            bool guard = row_guard && mask_.predicates[column];
+
+            cutlass::arch::global_load<
+              AccessType,
+              sizeof(AccessType)
+            >(
+                frag_ptr[frag_row_idx * ThreadMap::Iterations::kColumn +
+                         column],
+                (void *)&memory_pointer[column * ThreadMap::Delta::kColumn /
+                                        kElementsPerAccess],
+                guard);
+          }
+
+          if (row + 1 < ThreadMap::Iterations::kRow) {
+            if (!ScatterD) {
+              byte_pointer += params_.increment_row;
+            }
+          }
+        }
+
+        if (group + 1 < ThreadMap::Iterations::kGroup) {
+          byte_pointer += params_.increment_group;
+        }
+      }
+
+      if (cluster + 1 < ThreadMap::Iterations::kCluster) {
+        byte_pointer += params_.increment_cluster;
+      }
+    }
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment &frag) const {
+
+    load_with_byte_offset(frag, 0);
+  }
+
+  /// Stores a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_byte_offset(Fragment const &frag, int64_t byte_offset) const {
+    uint8_t *byte_pointer = byte_pointer_;
+    AccessType const *frag_ptr = reinterpret_cast<AccessType const *>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int cluster = 0; cluster < ThreadMap::Iterations::kCluster; ++cluster) {
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int group = 0; group < ThreadMap::Iterations::kGroup; ++group) {
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int row = 0; row < ThreadMap::Iterations::kRow; ++row) {
+
+          int frag_row_idx =
+            (row + ThreadMap::Iterations::kRow * (group + ThreadMap::Iterations::kGroup * cluster));
+
+          int row_offset = row * ThreadMap::Delta::kRow
+            + group * ThreadMap::Delta::kGroup
+            + cluster * ThreadMap::Delta::kCluster;
+
+          bool row_guard = ((row_offset + thread_start_row_) < extent_row_);
+
+          AccessType *memory_pointer = reinterpret_cast<AccessType *>(byte_pointer + byte_offset);
+
+          if (ScatterD && row_guard) {
+            assert(indices_);
+
+            memory_pointer = reinterpret_cast<AccessType *>(byte_pointer + byte_offset +
+              LongIndex(indices_[row_offset + thread_start_row_]) * LongIndex(params_.stride));
+          }
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int column = 0; column < ThreadMap::Iterations::kColumn; ++column) {
+
+            bool guard = row_guard && mask_.predicates[column];
+
+            if (UseCUDAStore) {
+              if (guard) {
+                memory_pointer[column * ThreadMap::Delta::kColumn / kElementsPerAccess] =
+                    frag_ptr[frag_row_idx * ThreadMap::Iterations::kColumn + column];
+              }
+            } else {
+              cutlass::arch::global_store<AccessType, sizeof(AccessType)>(
+                  frag_ptr[frag_row_idx * ThreadMap::Iterations::kColumn + column],
+                  (void *)&memory_pointer[column * ThreadMap::Delta::kColumn / kElementsPerAccess],
+                  guard);
+            }
+          }
+
+          if (row + 1 < ThreadMap::Iterations::kRow) {
+            if (!ScatterD) {
+              byte_pointer += params_.increment_row;
+            }
+          }
+
+        }
+
+        if (group + 1 < ThreadMap::Iterations::kGroup) {
+          byte_pointer += params_.increment_group;
+        }
+      }
+
+      if (cluster + 1 < ThreadMap::Iterations::kCluster) {
+        byte_pointer += params_.increment_cluster;
+      }
+    }
+  }
+
+  /// Stores a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const &frag) const {
+
+    store_with_byte_offset(frag, 0);
+  }
+
+  CUTLASS_DEVICE
+  MatrixCoord thread_start() const {
+    return MatrixCoord(thread_start_row_, thread_start_column_);
+  }
+
+  /// Need to get the thread start row from the tile iterator
+  CUTLASS_DEVICE
+  int32_t thread_start_row() const {
+    return thread_start_row_;
+  }
+
+  /// Need to get the thread start row from the tile iterator
+  CUTLASS_DEVICE
+  int32_t thread_start_column() const {
+    return thread_start_column_;
+  }
+
+  /// Extent of the matrix in rows
+  CUTLASS_DEVICE
+  Index extent_row() const {
+    return extent_row_;
+  }
+
+  /// Extent of the matrix in columns
+  CUTLASS_DEVICE
+  Index extent_column() const {
+    return extent_column_;
+  }
+
+  /// Advances to the next position to load or store
+  CUTLASS_HOST_DEVICE
+  void move(const int step=1) {
+
+    if (!ScatterD) {
+      byte_pointer_ += step * params_.advance_row;
+    }
+
+    thread_start_row_ += step * ThreadMap::Shape::kRow;
+  }
+
+  ///< Efficiently disables all accesses guarded by mask
+  CUTLASS_DEVICE void clear_mask() {
+    mask_.clear();
+  }
+
+  ///< Efficiently enables all accesses guarded by mask
+  CUTLASS_DEVICE void enable_mask() {
+    mask_.enable();
+  }
+
+  ///< Sets the mask
+  CUTLASS_DEVICE void get_mask(Mask &mask) const {
+    mask = mask_;
+  }
+
+  ///< Sets the mask
+  CUTLASS_DEVICE void set_mask(Mask const &mask) {
+    mask_ = mask;
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+} // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/fmha.h b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha.h
new file mode 100644
index 000000000000..2bd17da72f7d
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha.h
@@ -0,0 +1,154 @@
+/******************************************************************************
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <cuda.h>
+#include <vector>
+
+#include <ATen/cuda/CUDAGeneratorImpl.h>
+#include <ATen/cuda/CUDAGraphsUtils.cuh>
+#include <ATen/native/transformers/cuda/flash_attn/fmha_utils.h>
+
+
+constexpr int TOTAL_DIM = 0;
+constexpr int H_DIM = 1;
+constexpr int D_DIM = 2;
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct Qkv_params {
+    // The QKV matrices.
+    void *__restrict__ q_ptr;
+    void *__restrict__ k_ptr;
+    void *__restrict__ v_ptr;
+
+    // The stride between rows of the Q, K and V matrices.
+    // size_t qkv_stride_in_elts;
+    // size_t qkv_stride_in_bytes;
+    // TD [2022-04-16]: We're using 32-bit indexing to save registers.
+    // The code probably won't work for arrays larger than 2GB.
+    uint32_t q_row_stride_in_elts;
+    uint32_t k_row_stride_in_elts;
+    uint32_t v_row_stride_in_elts;
+    uint32_t q_head_stride_in_elts;
+    uint32_t k_head_stride_in_elts;
+    uint32_t v_head_stride_in_elts;
+
+    // The number of heads.
+    int h;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct FMHA_fprop_params : public Qkv_params {
+
+    // The O matrix (output).
+    void * __restrict__ o_ptr;
+
+    // The stride between rows of O.
+    // size_t o_stride_in_elts;
+    // size_t o_stride_in_bytes;
+    uint32_t o_row_stride_in_elts;
+    uint32_t o_head_stride_in_elts;
+
+    // The pointer to the O_tmp matrix, which holds O intermediate value during
+    // the loop;
+    void *__restrict__ o_tmp_ptr;
+
+    // The pointer to the S matrix.
+    void * __restrict__ s_ptr;
+    // The stride between rows of the S matrix.
+    // int64_t s_stride_in_bytes;
+    uint32_t s_stride_in_bytes;
+
+    // The pointer to the softmax sum.
+    void * __restrict__ softmax_lse_ptr;
+
+    // The dimensions.
+    int b, seqlen_q, seqlen_k, d;
+
+    // The scaling factors for the kernel.
+    float scale_bmm1;
+
+    // array of length b+1 holding starting offset of each sequence.
+    int * __restrict__ cu_seqlens_q;
+    int * __restrict__ cu_seqlens_k;
+
+    int *__restrict__ blockmask;
+
+    // The dropout probability (probability of keeping an activation).
+    float p_dropout;
+    uint32_t p_dropout_in_uint;
+    uint16_t p_dropout_in_uint16_t;
+
+    // Scale factor of 1 / (1 - p_dropout).
+    float rp_dropout;
+    float scale_bmm1_rp_dropout;
+
+    // Random state.
+    at::PhiloxCudaState philox_args;
+
+    bool is_bf16;
+    bool is_causal;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Kernel_params>
+struct Launch_params{
+    Launch_params(cudaDeviceProp * props_,
+                  cudaStream_t stream_,
+                  bool is_dropout_,
+                  bool return_softmax_)
+        : elts_per_thread(0)
+        , props(props_)
+        , stream(stream_)
+        , is_dropout(is_dropout_)
+        , return_softmax(return_softmax_) {
+    }
+
+    size_t elts_per_thread;
+
+    cudaDeviceProp * props;
+
+    cudaStream_t stream;
+
+    bool is_dropout;
+    bool return_softmax;
+
+    Kernel_params params;
+    int num_full_heads;
+    int num_main_groups;
+    int heads_last_wave;
+    int main_steps;
+    int rest_steps;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+TORCH_API void run_fmha_fprop(Launch_params<FMHA_fprop_params> &launch_params, const bool configure);
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp
new file mode 100644
index 000000000000..7cc0c250664e
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp
@@ -0,0 +1,248 @@
+/******************************************************************************
+ * Copyright (c) 2022, Tri Dao.
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#include <tuple>
+#ifdef USE_FLASH_ATTENTION
+#include <ATen/ATen.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/NativeFunctions.h>
+
+#include <ATen/native/transformers/cuda/flash_attn/fmha.h>
+#include <ATen/native/transformers/cuda/flash_attn/fmha_api.h>
+
+#include <c10/util/Exception.h>
+
+#define CHECK_SHAPE(x, ...) TORCH_CHECK(x.sizes() == at::IntArrayRef({__VA_ARGS__}), #x " must have shape (" #__VA_ARGS__ ")")
+
+namespace fmha {
+
+void set_params_fprop(FMHA_fprop_params &params,
+                      // sizes
+                      const size_t b,
+                      const size_t seqlen_q,
+                      const size_t seqlen_k,
+                      const size_t h,
+                      const size_t d,
+                      // device pointers
+                      const at::Tensor q,
+                      const at::Tensor k,
+                      const at::Tensor v,
+                      void *cu_seqlens_q_d,
+                      void *cu_seqlens_k_d,
+                      void *o_packed_d,
+                      void *o_tmp_d,
+                      void *s_d,
+                      void *softmax_lse_d,
+                      float p_dropout,
+                      float softmax_scale,
+                      bool is_causal) {
+
+    // Reset the parameters
+    memset(&params, 0, sizeof(params));
+
+    params.is_bf16 = q.dtype() == at::kBFloat16;
+
+    // Set the pointers and strides.
+    params.q_ptr = q.data_ptr();
+    params.k_ptr = k.data_ptr();
+    params.v_ptr = v.data_ptr();
+    params.q_row_stride_in_elts = q.stride(0);
+    params.k_row_stride_in_elts = k.stride(0);
+    params.v_row_stride_in_elts = v.stride(0);
+    params.q_head_stride_in_elts = q.stride(1);
+    params.k_head_stride_in_elts = k.stride(1);
+    params.v_head_stride_in_elts = v.stride(1);
+    params.o_ptr = o_packed_d;
+    params.o_row_stride_in_elts = h * d;
+    params.o_head_stride_in_elts = d;
+    params.o_tmp_ptr = o_tmp_d;
+
+    params.cu_seqlens_q = static_cast<int *>(cu_seqlens_q_d);
+    params.cu_seqlens_k = static_cast<int *>(cu_seqlens_k_d);
+
+    // S = softmax(P)
+    params.s_ptr = s_d;
+    params.s_stride_in_bytes = b * h * seqlen_k * 2;  // 2 = sizeof(Element)
+
+    // Softmax sum
+    params.softmax_lse_ptr = softmax_lse_d;
+
+    // Set the dimensions.
+    params.b = b;
+    params.h = h;
+    params.seqlen_q = seqlen_q;
+    params.seqlen_k = seqlen_k;
+    params.d = d;
+
+    // Set the different scale values.
+    params.scale_bmm1 = softmax_scale;
+
+    // Set this to probability of keeping an element to simplify things.
+    params.p_dropout = 1.f - p_dropout;
+    // Convert p from float to int so we don't have to convert the random uint to float to compare.
+    // [Minor] We want to round down since when we do the comparison we use <= instead of <
+    params.p_dropout_in_uint = uint32_t(std::floor(params.p_dropout * 4294967295.0));
+    params.p_dropout_in_uint16_t = uint16_t(std::floor(params.p_dropout * 65535.0));
+    params.rp_dropout = 1.f / params.p_dropout;
+    params.scale_bmm1_rp_dropout = params.rp_dropout * params.scale_bmm1;
+    TORCH_CHECK(p_dropout < 1.f);
+
+    params.is_causal = is_causal;
+}
+
+std::tuple<at::Tensor, at::Tensor, at::Tensor>
+mha_fwd(const at::Tensor &q,         // total_q x num_heads x head_size, total_q := \sum_{i=0}^{b} s_i
+        const at::Tensor &k,         // total_k x num_heads x head_size, total_k := \sum_{i=0}^{b} s_i
+        const at::Tensor &v,         // total_k x num_heads x head_size, total_k := \sum_{i=0}^{b} s_i
+        const at::Tensor &cu_seqlens_q,  // b+1
+        const at::Tensor &cu_seqlens_k,  // b+1
+        const int max_seqlen_q_,
+        const int max_seqlen_k_,
+        const float p_dropout,
+        const float softmax_scale,
+        const bool zero_tensors,
+        const bool is_causal,
+        const bool return_softmax,
+        c10::optional<at::Generator> gen_) {
+
+    auto dprops = at::cuda::getCurrentDeviceProperties();
+    bool is_sm75 = dprops->major == 7 && dprops->minor == 5;
+    bool is_sm80 = dprops->major == 8 && dprops->minor == 0;
+    bool is_sm8x = dprops->major == 8 && dprops->minor >= 0;
+    TORCH_CHECK(is_sm8x || is_sm75);
+    auto stream = at::cuda::getCurrentCUDAStream().stream();
+    bool is_dropout = p_dropout > 0.0;
+    Launch_params<FMHA_fprop_params> launch_params(dprops, stream, is_dropout, return_softmax);
+
+    auto q_dtype = q.dtype();
+    TORCH_CHECK(q_dtype == at::kHalf || (is_sm8x && q_dtype == at::kBFloat16));
+    TORCH_CHECK(k.dtype() == q_dtype);
+    TORCH_CHECK(v.dtype() == q_dtype);
+    TORCH_CHECK(cu_seqlens_q.dtype() == at::kInt);
+    TORCH_CHECK(cu_seqlens_k.dtype() == at::kInt);
+
+    TORCH_CHECK(q.is_cuda());
+    TORCH_CHECK(k.is_cuda());
+    TORCH_CHECK(v.is_cuda());
+    TORCH_CHECK(cu_seqlens_q.is_cuda());
+    TORCH_CHECK(cu_seqlens_k.is_cuda());
+
+    TORCH_CHECK(q.stride(-1) == 1);
+    TORCH_CHECK(k.stride(-1) == 1);
+    TORCH_CHECK(v.stride(-1) == 1);
+    TORCH_CHECK(cu_seqlens_k.is_contiguous());
+    TORCH_CHECK(cu_seqlens_k.is_contiguous());
+
+    const auto sizes = q.sizes();
+
+    const int batch_size = cu_seqlens_q.numel() - 1;
+    const int total_q = sizes[TOTAL_DIM];
+    const int num_heads = sizes[H_DIM];
+    const int head_size = sizes[D_DIM];
+    const int total_k = k.size(TOTAL_DIM);
+    TORCH_CHECK(batch_size > 0);
+    TORCH_CHECK((head_size % 8 == 0) && (head_size <= 128));
+    const int head_size_rounded = head_size <= 64 ? 64 : 128;
+
+    CHECK_SHAPE(q, total_q, num_heads, head_size);
+    CHECK_SHAPE(k, total_k, num_heads, head_size);
+    CHECK_SHAPE(v, total_k, num_heads, head_size);
+    CHECK_SHAPE(cu_seqlens_q, batch_size + 1);
+    CHECK_SHAPE(cu_seqlens_k, batch_size + 1);
+
+    int blocksize_c = ((head_size_rounded == 128 && (is_dropout || !is_sm80)) || (is_sm75 && head_size_rounded == 64 && is_dropout)) ? 128 : 256;
+    // Need to round max_seqlen_k to multiples of blocksize_c
+    int max_seqlen_k = ((max_seqlen_k_ + blocksize_c - 1) / blocksize_c) * blocksize_c;
+    if( max_seqlen_k_ <= 128 ) {
+        max_seqlen_k = 128;
+    } else if( max_seqlen_k_ <= 256 ) {
+        max_seqlen_k = 256;
+    }
+    int max_seqlen_q = ((max_seqlen_q_ + 16 - 1) / 16) * 16;
+    bool loop = max_seqlen_k > blocksize_c;
+
+    // Otherwise the kernel will be launched from cuda:0 device
+    at::cuda::CUDAGuard device_guard{q.device()};
+
+    auto opts = q.options();
+
+    auto o = at::empty({ total_q, num_heads, head_size }, opts);
+
+    at::Tensor o_tmp;
+    if (loop) { o_tmp = at::empty({total_q, num_heads, head_size}, opts.dtype(at::kFloat)); }
+
+    auto softmax_lse = at::empty({batch_size, num_heads, max_seqlen_q}, opts.dtype(at::kFloat));
+    // auto softmax_lse = torch::full({batch_size, num_heads, max_seqlen_k}, -std::numeric_limits<float>::infinity(), opts.dtype(at::kFloat));
+
+    at::Tensor s;
+    if (return_softmax) { s = at::empty({ batch_size, num_heads, max_seqlen_q, max_seqlen_k }, opts); }
+
+    if( zero_tensors ) {
+        o.zero_();
+        softmax_lse.fill_(-std::numeric_limits<float>::infinity());
+        if (return_softmax) {s.zero_();}
+    }
+
+    auto gen = at::get_generator_or_default<at::CUDAGeneratorImpl>(
+        gen_, at::cuda::detail::getDefaultCUDAGenerator());
+
+    set_params_fprop(launch_params.params,
+                     batch_size,
+                     max_seqlen_q,
+                     max_seqlen_k,
+                     num_heads,
+                     head_size,
+                     q, k, v,
+                     cu_seqlens_q.data_ptr(),
+                     cu_seqlens_k.data_ptr(),
+                     o.data_ptr(),
+                     loop ? o_tmp.data_ptr() : nullptr,
+                     return_softmax ? s.data_ptr() : nullptr,
+                     softmax_lse.data_ptr(),
+                     p_dropout,
+                     softmax_scale,
+                     is_causal);
+
+    run_fmha_fprop(launch_params, /*configure=*/ true);
+    // number of times random will be generated per thread, to offset philox counter in thc random
+    // state
+    int64_t counter_offset = launch_params.elts_per_thread;
+
+    if( is_dropout ) {
+        // See Note [Acquire lock when using random generators]
+        std::lock_guard<std::mutex> lock(gen->mutex_);
+        launch_params.params.philox_args = gen->philox_cuda_state(counter_offset);
+    }
+
+    run_fmha_fprop(launch_params, /*configure=*/false);
+
+    return std::make_tuple(o, softmax_lse, s);
+}
+} // namespace fmha
+#endif
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.h b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.h
new file mode 100644
index 000000000000..b0555463be04
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.h
@@ -0,0 +1,25 @@
+#pragma once
+#include <cstddef>
+
+#include <ATen/ATen.h>
+#include <c10/util/Exception.h>
+
+namespace fmha {
+
+TORCH_API
+std::tuple<at::Tensor, at::Tensor, at::Tensor>
+mha_fwd(const at::Tensor &q,         // total_q x num_heads x head_size, total_q := \sum_{i=0}^{b} s_i
+        const at::Tensor &k,         // total_k x num_heads x head_size, total_k := \sum_{i=0}^{b} s_i
+        const at::Tensor &v,         // total_k x num_heads x head_size, total_k := \sum_{i=0}^{b} s_i
+        const at::Tensor &cu_seqlens_q,  // b+1
+        const at::Tensor &cu_seqlens_k,  // b+1
+        const int max_seqlen_q_,
+        const int max_seqlen_k_,
+        const float p_dropout,
+        const float softmax_scale,
+        const bool zero_tensors,
+        const bool is_causal,
+        const bool return_softmax,
+        c10::optional<at::Generator> gen_);
+
+} // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h
new file mode 100644
index 000000000000..1a41438c6627
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h
@@ -0,0 +1,722 @@
+/***************************************************************************************************
+ * Copyright (c) 2022, Tri Dao.
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/native/transformers/cuda/flash_attn/philox.cuh>
+#include <ATen/native/transformers/cuda/flash_attn/mask.h>
+#include <ATen/native/transformers/cuda/flash_attn/fmha_kernel.h>
+#include <ATen/native/transformers/cuda/flash_attn/softmax.h>
+#include <ATen/native/transformers/cuda/flash_attn/epilogue.h>
+
+#include <cutlass/cutlass.h>
+#include <cutlass/layout/layout.h>
+#include <cutlass/array.h>
+#include <cutlass/numeric_types.h>
+#include <cutlass/numeric_conversion.h>
+#include <cutlass/arch/mma.h>
+#include <cutlass/gemm/warp/default_mma_tensor_op.h>
+#include <cutlass/gemm/warp/mma_tensor_op_tile_iterator.h>
+#include <cutlass/gemm/threadblock/default_mma_core.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm75.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm80.h>
+#include <cutlass/epilogue/warp/fragment_iterator_tensor_op.h>
+#include <cutlass/epilogue/warp/tile_iterator_tensor_op.h>
+#include <cutlass/epilogue/threadblock/default_thread_map_tensor_op.h>
+#include <cutlass/epilogue/threadblock/default_epilogue_tensor_op.h>
+#include <cutlass/epilogue/threadblock/predicated_tile_iterator.h>
+
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Kernel_traits>
+struct Gemm_Q_K_base {
+    using Smem_O = fmha::FMHAEpilogue<typename Kernel_traits::MmaCorePV>;
+    using WarpMma = typename Kernel_traits::MmaCoreQK::MmaTensorOp;
+
+    // The description of the CTA tile for the 1st batched GEMM.
+    using Cta_tile_p = typename Kernel_traits::Cta_tile_p;
+
+    static constexpr size_t SMEM_BYTES_SOFTMAX = Cta_tile_p::M * Cta_tile_p::WARPS_N * sizeof(float) * 2;
+
+    __device__ inline Gemm_Q_K_base(char * smem_ptr_q, char * smem_ptr_k)
+        : smem_q_ptr(smem_ptr_q)
+        , smem_k_ptr(smem_ptr_k) {
+
+    }
+
+    __device__ inline void load_q(int byte_offset=0) {
+        typename WarpMma::LayoutA layout_A = WarpMma::LayoutA::packed({Cta_tile_p::M, Cta_tile_p::K});
+        typename WarpMma::IteratorA iter_A({reinterpret_cast<typename WarpMma::ElementA *>(smem_q_ptr + byte_offset), layout_A}, threadIdx.x % 32);
+        iter_A.load(frag_q[0]);
+    }
+
+
+    __device__ inline void reload_q(int byte_offset=0) {
+        typename WarpMma::LayoutA layout_A = WarpMma::LayoutA::packed({Cta_tile_p::M, Cta_tile_p::K});
+        typename WarpMma::IteratorA iter_A({reinterpret_cast<typename WarpMma::ElementA *>(smem_q_ptr + byte_offset), layout_A}, threadIdx.x % 32);
+        iter_A.load(frag_q[0]);
+    }
+
+    typename WarpMma::FragmentA frag_q[2];
+    char *smem_q_ptr;
+    char *smem_k_ptr;
+};
+
+template<typename Kernel_traits, bool K_in_regs>
+struct Gemm_Q_K : public Gemm_Q_K_base<Kernel_traits> {
+
+    using Base = Gemm_Q_K_base<Kernel_traits>;
+    using Cta_tile_p = typename Base::Cta_tile_p;
+    using Smem_O = typename Base::Smem_O;
+    using WarpMma = typename Base::WarpMma;
+
+    static constexpr int kIterations = WarpMma::Shape::kK / WarpMma::InstructionShape::kK;
+
+    static constexpr bool SHARE_SMEM_FOR_K_AND_V = Kernel_traits::SHARE_SMEM_FOR_K_AND_V;
+    // If V is stored in shared memory, we can't load K using the same shared memory.
+    static_assert(Kernel_traits::V_IN_REGS, "");
+
+    static constexpr size_t SMEM_OFFSET_O = Kernel_traits::BYTES_PER_SMEM_Q;
+    static constexpr size_t SMEM_OFFSET_SOFTMAX = SMEM_OFFSET_O + sizeof(typename Smem_O::SharedStorage);
+    static constexpr size_t SMEM_OFFSET_V = Kernel_traits::BYTES_PER_SMEM_Q + (SHARE_SMEM_FOR_K_AND_V ? 0 : Kernel_traits::BYTES_PER_SMEM_K);
+
+    // Q | K / V
+    //   | O | SOFTMAX
+    static constexpr size_t SMEM_BYTES = Kernel_traits::BYTES_PER_SMEM_Q
+        + std::max((size_t)(SHARE_SMEM_FOR_K_AND_V ? 1 : 2) * Kernel_traits::BYTES_PER_SMEM_K,
+                   sizeof(typename Smem_O::SharedStorage) + Base::SMEM_BYTES_SOFTMAX);
+
+    __device__ inline Gemm_Q_K(char * smem_)
+        : Base(smem_, smem_ + Kernel_traits::BYTES_PER_SMEM_Q) {
+    }
+
+    __device__ inline void load_k(){
+        typename WarpMma::LayoutB layout_B = WarpMma::LayoutB::packed({Cta_tile_p::K, Cta_tile_p::N});
+        typename WarpMma::IteratorB iter_B({reinterpret_cast<typename WarpMma::ElementB *>(Base::smem_k_ptr), layout_B}, threadIdx.x % 32);
+        const int warp_idx = threadIdx.x / 32;
+        iter_B.add_tile_offset({0, warp_idx});
+        #pragma unroll
+        for( int ki = 0; ki < kIterations; ++ki ) {
+            iter_B.load(frag_k[ki]);
+            ++iter_B;
+        }
+    }
+
+    __device__ inline void operator()(WarpMma warp_mma, typename WarpMma::FragmentC &acc_p, int byte_offset_q=0){
+        typename WarpMma::LayoutA layout_A = WarpMma::LayoutA::packed({Base::Cta_tile_p::M, Base::Cta_tile_p::K});
+        typename WarpMma::IteratorA iter_A({reinterpret_cast<typename WarpMma::ElementB *>(Base::smem_q_ptr + byte_offset_q), layout_A}, threadIdx.x % 32);
+        ++iter_A;
+        // Do this part of P^T = (Q * K^T)^T.
+        #pragma unroll
+        for( int ki = 0; ki < kIterations; ++ki ) {
+            // Trigger the load from shared memory for the next series of Q values.
+            if (ki + 1 < kIterations) { iter_A.load(Base::frag_q[(ki + 1) % 2]); ++iter_A; }
+            // Do the math for the values already in registers.
+            warp_mma(acc_p, Base::frag_q[ki % 2], frag_k[ki], acc_p);
+        }
+    }
+
+    __device__ inline void reload_k(){
+        // Noop.
+    }
+
+    typename WarpMma::FragmentB frag_k[kIterations];
+};
+
+
+template<typename Kernel_traits>
+struct Gemm_Q_K<Kernel_traits, false> : public Gemm_Q_K_base<Kernel_traits> {
+    using Base = Gemm_Q_K_base<Kernel_traits>;
+    using Cta_tile_p = typename Base::Cta_tile_p;
+    using Smem_O = typename Base::Smem_O;
+    using WarpMma = typename Base::WarpMma;
+
+    static constexpr bool SHARE_SMEM_FOR_K_AND_V = Kernel_traits::SHARE_SMEM_FOR_K_AND_V;
+    static constexpr bool V_IN_REGS = Kernel_traits::V_IN_REGS;
+    static_assert(V_IN_REGS || !SHARE_SMEM_FOR_K_AND_V, "");
+
+    static constexpr size_t SMEM_OFFSET_V = Kernel_traits::BYTES_PER_SMEM_Q + (SHARE_SMEM_FOR_K_AND_V ? 0 : Kernel_traits::BYTES_PER_SMEM_K);
+    static constexpr size_t SMEM_OFFSET_O = SMEM_OFFSET_V + Kernel_traits::BYTES_PER_SMEM_V;
+    static constexpr size_t SMEM_OFFSET_SOFTMAX = SMEM_OFFSET_O + sizeof(typename Smem_O::SharedStorage);
+
+    // If V_IN_REGS and SHARE_SMEM_FOR_K_AND_V:      Q | K/V | O | SOFTMAX
+    // If !V_IN_REGS (then !SHARE_SMEM_FOR_K_AND_V): Q | K   | V | O | SOFTMAX
+    static constexpr size_t SMEM_BYTES = Kernel_traits::BYTES_PER_SMEM_Q
+        + (SHARE_SMEM_FOR_K_AND_V ? 1 : 2) * Kernel_traits::BYTES_PER_SMEM_K
+        + sizeof(typename Smem_O::SharedStorage) + Base::SMEM_BYTES_SOFTMAX;
+
+    __device__ inline Gemm_Q_K(char * smem_)
+        : Base(smem_, smem_ + Kernel_traits::BYTES_PER_SMEM_Q) {
+    }
+
+    __device__ inline void load_k(){
+        typename WarpMma::LayoutB layout_B = WarpMma::LayoutB::packed({Cta_tile_p::K, Cta_tile_p::N});
+        typename WarpMma::IteratorB iter_B({reinterpret_cast<typename WarpMma::ElementB *>(Base::smem_k_ptr), layout_B}, threadIdx.x % 32);
+        const int warp_idx = threadIdx.x / 32;
+        iter_B.add_tile_offset({0, warp_idx});
+        iter_B.load(frag_k[0]);
+    }
+
+    __device__ inline void operator()(WarpMma warp_mma, typename WarpMma::FragmentC &acc_p, int byte_offset_q=0){
+        typename WarpMma::LayoutA layout_A = WarpMma::LayoutA::packed({Base::Cta_tile_p::M, Base::Cta_tile_p::K});
+        typename WarpMma::IteratorA iter_A({reinterpret_cast<typename WarpMma::ElementA *>(Base::smem_q_ptr + byte_offset_q), layout_A}, threadIdx.x % 32);
+        ++iter_A;
+        typename WarpMma::LayoutB layout_B = WarpMma::LayoutB::packed({Cta_tile_p::K, Cta_tile_p::N});
+        typename WarpMma::IteratorB iter_B({reinterpret_cast<typename WarpMma::ElementB *>(Base::smem_k_ptr), layout_B}, threadIdx.x % 32);
+        const int warp_idx = threadIdx.x / 32;
+        iter_B.add_tile_offset({0, warp_idx});
+        ++iter_B;
+
+        // Do this part of P^T = (Q * K^T)^T.
+        constexpr int kIterations = WarpMma::Shape::kK / WarpMma::InstructionShape::kK;
+        #pragma unroll
+        for( int ki = 0; ki < kIterations; ++ki ) {
+            // Trigger the load from shared memory for the next series of Q values.
+            if (ki + 1 < kIterations) {
+                iter_A.load(Base::frag_q[(ki + 1) % 2]); ++iter_A;
+                iter_B.load(frag_k[(ki + 1) % 2]); ++iter_B;
+            }
+            // Do the math for the values already in registers.
+            warp_mma(acc_p, Base::frag_q[ki % 2], frag_k[ki % 2], acc_p);
+        }
+    }
+    __device__ inline void reload_k(){
+        typename WarpMma::LayoutB layout_B = WarpMma::LayoutB::packed({Cta_tile_p::K, Cta_tile_p::N});
+        typename WarpMma::IteratorB iter_B({reinterpret_cast<typename WarpMma::ElementB *>(Base::smem_k_ptr), layout_B}, threadIdx.x % 32);
+        const int warp_idx = threadIdx.x / 32;
+        iter_B.add_tile_offset({0, warp_idx});
+        iter_B.load(frag_k[0]);
+    }
+
+    typename WarpMma::FragmentB frag_k[2];
+};
+
+template<typename Kernel_traits>
+constexpr size_t get_dynamic_smem_size(){
+    return Gemm_Q_K<Kernel_traits, Kernel_traits::K_IN_REGS>::SMEM_BYTES;
+}
+
+template<typename Kernel_traits, bool Is_dropout, bool Is_causal, bool Return_softmax, bool Is_first, bool Is_last, typename Params, typename Prng>
+inline __device__ void device_1xN_(const Params &params, const int bidb, const int bidh, int begin, int steps, Prng &ph0, Prng &ph1, const int loop_step_idx) {
+
+    // The description of the CTA tile for the 1st batched GEMM.
+    using Cta_tile_p = typename Kernel_traits::Cta_tile_p;
+    // The description of the CTA tile for the 2nd batched GEMM.
+    using Cta_tile_o = typename Kernel_traits::Cta_tile_o;
+
+    // The MMA tile for the 1st GEMM.
+    using Mma_tile_p = fmha::Hmma_tile<Cta_tile_p>;
+    // The MMA tile for the 2nd GEMM.
+    using Mma_tile_o = fmha::Hmma_tile<Cta_tile_o>;
+
+    using InstructionShape = typename Kernel_traits::MmaInstructionShape;
+    using Element = typename Kernel_traits::Element;
+    using ElementAccum = typename Kernel_traits::ElementAccum;
+
+    using ThreadblockShapeQK = typename Kernel_traits::ThreadblockShapeQK;
+    using LayoutQ = typename Kernel_traits::LayoutQ;
+    using LayoutK = typename Kernel_traits::LayoutK;
+    using LayoutP = typename Kernel_traits::LayoutP;
+    using MmaCoreQK = typename Kernel_traits::MmaCoreQK;
+    using WarpMmaQK = typename MmaCoreQK::MmaTensorOp;
+    using SmemLayoutQ = typename MmaCoreQK::SmemLayoutA;
+    using SmemLayoutK = typename MmaCoreQK::SmemLayoutB;
+    using SmemIteratorQ = typename MmaCoreQK::SmemIteratorA;
+    using SmemIteratorK = typename MmaCoreQK::SmemIteratorB;
+
+    using ThreadblockShapePV = typename Kernel_traits::ThreadblockShapePV;
+    using LayoutV = typename Kernel_traits::LayoutV;
+    using LayoutO = typename Kernel_traits::LayoutO;
+    using MmaCorePV = typename Kernel_traits::MmaCorePV;
+    using WarpMmaPV = typename MmaCorePV::MmaTensorOp;
+    using WarpIteratorV = typename WarpMmaPV::IteratorB;
+    using SmemLayoutV = typename MmaCorePV::SmemLayoutB;
+    using SmemIteratorV = typename MmaCorePV::SmemIteratorB;
+    constexpr int kIterationsPV = WarpMmaPV::Shape::kK / WarpMmaPV::InstructionShape::kK;
+
+    // The global memory tile to load Q.
+    // Copy from mma_piplined_testbed.h
+    using GmemIteratorQ = typename Kernel_traits::GmemIteratorQ;
+    // The global memory tile to load K.
+    using GmemIteratorK = typename Kernel_traits::GmemIteratorK;
+    // The global memory tile to load V.
+    using GmemIteratorV = typename Kernel_traits::GmemIteratorV;
+    // The global memory tile to store O.
+    using GmemIteratorO = typename fmha::FMHAEpilogue<MmaCorePV>::GmemIterator;
+    using GmemIteratorOAccum = typename fmha::FMHAEpilogue<MmaCorePV>::GmemIteratorAccum;
+
+    using Gmem_tile_s = typename Kernel_traits::Gmem_tile_s;
+
+    using Gmem_softmax_sum = typename Kernel_traits::Gmem_softmax_sum;
+
+    using Smem_softmax_lse = typename Kernel_traits::Smem_softmax_lse;
+
+    using Gemm1 = Gemm_Q_K<Kernel_traits, Kernel_traits::K_IN_REGS>;
+
+    using Softmax = fmha::Softmax<Cta_tile_p, Kernel_traits>;
+
+    // Shared memory.
+    extern __shared__ char smem_[];
+
+    // The thread index.
+    const int tidx = threadIdx.x;
+
+    const BlockInfoPadded<Kernel_traits::THREADS> binfo(params, bidb, bidh, tidx);
+    if( binfo.stop_early(loop_step_idx * Cta_tile_p::N) ) return;
+
+    Gemm1 gemm_q_k(smem_);
+    // Allocate the global memory tile loader for S.
+    Gmem_tile_s gmem_s(params, binfo, tidx);
+    Gmem_softmax_sum gmem_softmax_lse(params.softmax_lse_ptr, params, tidx);
+
+    // Wind gmem tiles to the correct position.
+    static_assert(Cta_tile_p::N % Cta_tile_p::M == 0, "");
+    const int begin_og = begin;
+    begin = Is_causal ? std::max(begin, loop_step_idx * Cta_tile_p::N / Cta_tile_p::M) : begin;
+    const int steps_og = steps;
+    steps -= begin - begin_og;
+    if (Return_softmax) { gmem_s.move(begin); }
+    gmem_softmax_lse.move(begin);
+
+    fmha::Mask<Cta_tile_p, Is_causal> mask(binfo, tidx, loop_step_idx);
+
+    // The base pointer of smem_v;
+    char *smem_v_addr = &smem_[Gemm1::SMEM_OFFSET_V];
+
+    // Allocate the shared memory tile loader for V. We use the same as K so be careful!!!
+
+    SmemLayoutQ layout_Q = SmemLayoutQ::packed({ThreadblockShapeQK::kM, ThreadblockShapeQK::kK});
+    SmemIteratorQ smem_q({reinterpret_cast<Element *>(smem_), layout_Q}, tidx);
+    SmemLayoutK layout_K = SmemLayoutK::packed({ThreadblockShapeQK::kK, ThreadblockShapeQK::kN});
+    SmemIteratorK smem_k({reinterpret_cast<Element *>(smem_ + Kernel_traits::BYTES_PER_SMEM_Q), layout_K}, tidx);
+    SmemLayoutV layout_V = SmemLayoutV::packed({ThreadblockShapePV::kK, ThreadblockShapePV::kN});
+    // SmemIterator stores to smem and WarpIterator loads from smem
+    SmemIteratorV smem_v({reinterpret_cast<Element *>(smem_v_addr), layout_V}, tidx);
+    WarpIteratorV iter_V({reinterpret_cast<Element *>(smem_v_addr), layout_V}, threadIdx.x % 32);
+
+    // Allocate the shared memory tile loader for O. We use the same as K so be careful!!!
+    using Smem_O = fmha::FMHAEpilogue<MmaCorePV>;
+    Smem_O smem_o(&smem_[Gemm1::SMEM_OFFSET_O], tidx);
+
+    // Allocate the global memory tile loader for Q.
+    // cutlass::transform::threadblock::PredicatedTileIterator deals with seqlen not divisible
+    // by 16 in a different way than we want. If the seqlen_q is 36, the first iteration would
+    // load 4 rows and the next two iterations would load 16 rows each. Instead we round the
+    // actual_seqlen_q to be multiple of 16, then change the mask in the last iteration, so
+    // that in this case we would load 16, 16, 4.
+    LayoutQ gmem_layout_Q(params.q_row_stride_in_elts);
+    typename GmemIteratorQ::Params gmem_Q_params(gmem_layout_Q);
+    const uint32_t row_offset_q = (binfo.sum_s_q + begin * ThreadblockShapeQK::kM) * params.q_row_stride_in_elts + binfo.bidh * params.q_head_stride_in_elts;
+    const int actual_seqlen_q = binfo.actual_seqlen_q - begin * ThreadblockShapeQK::kM;
+    const int seqlen_q_remainder = actual_seqlen_q % ThreadblockShapeQK::kM;
+    const int extent_q = ((actual_seqlen_q <= ThreadblockShapeQK::kM) || (seqlen_q_remainder == 0)) ? actual_seqlen_q : actual_seqlen_q + ThreadblockShapeQK::kM - seqlen_q_remainder;
+    GmemIteratorQ gmem_q(gmem_Q_params,
+                         reinterpret_cast<Element *>(params.q_ptr) + row_offset_q,
+                         {extent_q, params.d},
+                         tidx);
+
+    // Allocate the global memory tile loader for K.
+    LayoutK gmem_layout_K(params.k_row_stride_in_elts);
+    typename GmemIteratorK::Params gmem_K_params(gmem_layout_K);
+    const uint32_t row_offset_k = (binfo.sum_s_k + loop_step_idx * ThreadblockShapeQK::kN) * params.k_row_stride_in_elts + binfo.bidh * params.k_head_stride_in_elts;
+    const int extent_k = min(binfo.actual_seqlen_k - loop_step_idx * ThreadblockShapeQK::kN, ThreadblockShapeQK::kN);
+    GmemIteratorK gmem_k(gmem_K_params,
+                         reinterpret_cast<Element *>(params.k_ptr) + row_offset_k,
+                         {params.d, extent_k},
+                         tidx);
+
+    // Allocate the global memory tile loader for V.
+    LayoutV gmem_layout_V(params.v_row_stride_in_elts);
+    typename GmemIteratorV::Params gmem_V_params(gmem_layout_V);
+    const uint32_t row_offset_v = (binfo.sum_s_k + loop_step_idx * ThreadblockShapePV::kK) * params.v_row_stride_in_elts + binfo.bidh * params.v_head_stride_in_elts;
+    // extent_v is the same as extent_k
+    GmemIteratorV gmem_v(gmem_V_params,
+                         reinterpret_cast<Element *>(params.v_ptr) + row_offset_v,
+                         {extent_k, params.d},
+                         tidx);
+
+    // Allocate the global memory tile loader for O.
+    LayoutO gmem_layout_O(params.o_row_stride_in_elts);
+    typename GmemIteratorO::Params gmem_O_params(gmem_layout_O);
+    const uint32_t row_offset_o = (binfo.sum_s_q + begin * ThreadblockShapeQK::kM) * params.o_row_stride_in_elts + binfo.bidh * params.o_head_stride_in_elts;
+    GmemIteratorO gmem_o(gmem_O_params,
+                         reinterpret_cast<Element *>(params.o_ptr) + row_offset_o,
+                         {actual_seqlen_q, params.d},
+                         tidx);
+
+    typename GmemIteratorOAccum::Params gmem_Oaccum_params(gmem_layout_O);
+    GmemIteratorOAccum gmem_o_accum(gmem_Oaccum_params,
+                                    reinterpret_cast<ElementAccum *>(params.o_tmp_ptr) + row_offset_o,
+                                    {actual_seqlen_q, params.d},
+                                    tidx);
+
+    // Create the object to do the softmax.
+    Softmax softmax(params, &smem_[Gemm1::SMEM_OFFSET_SOFTMAX], tidx);
+
+    Smem_softmax_lse smem_softmax_lse(reinterpret_cast<float *>(&smem_[Gemm1::SMEM_BYTES]));
+
+    if (!Is_first) {
+        if (Return_softmax) { gmem_s.move(loop_step_idx * steps_og); }
+    }
+
+    if (!Is_first) { __syncthreads(); }
+
+    // Trigger the loads for V.
+    typename GmemIteratorV::Fragment gmem_frag_v;
+    gmem_frag_v.clear();
+    gmem_v.load(gmem_frag_v);
+
+    // Trigger the loads for Q.
+    typename GmemIteratorQ::Fragment gmem_frag_q;
+    gmem_frag_q.clear();
+    gmem_q.load(gmem_frag_q);
+
+    // Trigger the loads for K.
+    typename GmemIteratorK::Fragment gmem_frag_k;
+    gmem_frag_k.clear();
+    gmem_k.load(gmem_frag_k);
+
+    float p_prev_lse[Mma_tile_p::MMAS_M * 2];
+    if (!Is_first) {
+        gmem_softmax_lse.load(reinterpret_cast<uint32_t(&)[Mma_tile_p::MMAS_M * 2]>(p_prev_lse));
+    }
+
+    // Commit the data for Q and V to shared memory.
+    smem_v.store(gmem_frag_v);
+    smem_q.store(gmem_frag_q);
+
+    // Commit the data for K to shared memory.
+    if( !Kernel_traits::SHARE_SMEM_FOR_K_AND_V ) {
+        smem_k.store(gmem_frag_k);
+    }
+
+    __syncthreads();
+
+    // Load the fragments for Q.
+    gemm_q_k.load_q();
+
+    // Load the fragments for V. We keep the data in registers during the entire
+    // kernel. copied from mma_pipelined.h
+    const int warp_idx = threadIdx.x / 32;
+    iter_V.add_tile_offset({kIterationsPV * warp_idx, 0});
+    typename WarpIteratorV::Fragment frag_v[kIterationsPV];
+    static_assert(WarpIteratorV::Fragment::kStorageElements == 4 * Mma_tile_o::MMAS_N || WarpIteratorV::Fragment::kStorageElements == 2 * Mma_tile_o::MMAS_N, "");
+    #pragma unroll
+    for( int ki = 0; ki < kIterationsPV; ++ki ) {
+        iter_V.load(frag_v[ki]);
+        ++iter_V;
+    }
+
+    // Commit the data for K to shared memory if it has not been done already.
+    if( Kernel_traits::SHARE_SMEM_FOR_K_AND_V ) {
+        // Make sure we are done loading the fragments for K.
+        __syncthreads();
+
+        // Commit the data to shared memory for K.
+        smem_k.store(gmem_frag_k);
+
+        // Make sure the data is in shared memory.
+        __syncthreads();
+    }
+
+    // Load the fragments for K.
+    gemm_q_k.load_k();
+
+    // Load over the entire sequence length.
+    for( int l = 0; l < steps; l++ ) {
+        if((begin + l) * Cta_tile_p::M >= binfo.actual_seqlen_q) break;
+
+        // Declare the accumulators for the 1st gemm.
+        WarpMmaQK mma_qk;
+        typename WarpMmaQK::FragmentC acc_p;
+        acc_p.clear();
+
+        // Do this part of P = Q * K^T.
+        gemm_q_k(mma_qk, acc_p);
+
+        typename Smem_O::OutputFragment out[Smem_O::kIterationsStore];
+        static_assert(GmemIteratorOAccum::kIterations == Smem_O::kIterationsStore, "");
+        static_assert(GmemIteratorO::kIterations == Smem_O::kIterationsStore, "");
+        if (!Is_first) {
+            #pragma unroll
+            for (int iter = 0; iter < GmemIteratorOAccum::kIterations; ++iter) {
+                gmem_o_accum.load(out[iter]);
+                gmem_o_accum.move();
+            }
+        }
+
+        // Trigger the load for the next Q values.
+        if( l < steps - 1) {
+            ++gmem_q;
+            // If actual_seqlen_q is not a multiple of 16, we change the mask in the last iteration
+            // to load the "residue" tile.
+            if ((l + 1 == steps - 1) && (actual_seqlen_q % ThreadblockShapeQK::kM != 0)) {
+                // TODO: this probably only works for head_dim = 64 and head_dim = 128, which is
+                // what we have right now. Maybe for head_dim = 32 or 96, this could be different.
+                const int row_idx = tidx / (GmemIteratorQ::Shape::kColumn / GmemIteratorQ::Fragment::kElements);
+                if (row_idx >= actual_seqlen_q - (l + 1) * ThreadblockShapeQK::kM) {
+                    gmem_q.clear_mask();
+                }
+            }
+            gmem_q.load(gmem_frag_q);
+        }
+
+        // Load the mask for that iteration.
+        mask.load(begin + l);
+
+        // Convert from the accumulator type to FP32 for Softmax.
+        softmax.unpack_noscale(acc_p);
+
+        // Apply the mask.
+        softmax.apply_mask(mask);
+
+        if( Kernel_traits::SHARE_SMEM_FOR_K_AND_V && l == 0 ) {
+            // if we share K and V, it could be that V was not fully read yet but we write into smem for reduction
+            __syncthreads();
+        }
+
+        // Compute the max.
+        float p_max[Mma_tile_p::MMAS_M * 2];
+        if (!Is_first) {
+            smem_softmax_lse.store_pair(p_prev_lse);
+            for (int mi = 0; mi < Mma_tile_p::MMAS_M * 2; mi++) { p_max[mi] = p_prev_lse[mi] / params.scale_bmm1; }
+        }
+
+        // Trigger the load for the next LSE values.
+        if( l < steps - 1) {
+            if (!Is_first) {
+                gmem_softmax_lse.load_next(reinterpret_cast<uint32_t(&)[Mma_tile_p::MMAS_M * 2]>(p_prev_lse));
+            }
+        }
+
+        softmax.template reduce_max</*zero_init=*/Is_first>(p_max);
+
+        // Compute the exponential value.
+        softmax.scale_apply_exp(p_max, params.scale_bmm1);
+
+        // We don't finalize the sum reduction here, as that would incur an extra sync_threads().
+        // Instead, we reduce the sum from each warp, write to smem, then wait until the sync_threads()
+        // from storing acc_o. Then we read the sum of each warp from smem and finalize the reduction.
+        // As a consequence, we don't scale acc_p by the inverse sum, we scale the output by the inverse sum.
+        // Compute the sum.
+        float p_sum[Mma_tile_p::MMAS_M * 2];
+        // softmax.reduce_sum(p_sum);
+        softmax.reduce_sum_before_sync_(p_sum);
+
+        constexpr bool encode_dropout_in_sign_bit = Return_softmax;
+        if (Is_dropout) {
+            softmax.template apply_dropout_16bits<encode_dropout_in_sign_bit>(ph0, ph1, params.p_dropout_in_uint16_t);
+        }
+
+        static_assert(Mma_tile_o::MMAS_M == Mma_tile_p::MMAS_M, "");
+        static_assert(Mma_tile_o::MMAS_K == Mma_tile_p::MMAS_N, "");
+        softmax.pack_noconvert(acc_p);
+        cutlass::NumericArrayConverter<Element, ElementAccum, decltype(acc_p)::kElements, cutlass::FloatRoundStyle::round_to_nearest> convert_p;
+        auto frag_p = convert_p(acc_p);
+
+        if (Return_softmax) {
+            gmem_s.store(reinterpret_cast<const cutlass::Array<Element, 8>(&)[Mma_tile_o::MMAS_K][Mma_tile_o::MMAS_M]>(frag_p), mask);
+            gmem_s.move();
+        }
+
+        // Commit the values for Q into shared memory.
+        if (l < steps - 1) { smem_q.store(gmem_frag_q); }
+
+        if (Is_dropout && encode_dropout_in_sign_bit) {
+            cutlass::epilogue::thread::ReLu<decltype(frag_p)> relu;
+            frag_p = relu(frag_p);
+        }
+
+        // Declare the accumulators for the 2nd gemm.
+        WarpMmaPV mma_pv;
+        typename WarpMmaPV::FragmentC acc_o;
+        static_assert(WarpMmaPV::FragmentC::kElements == Mma_tile_o::MMAS_M * Mma_tile_o::MMAS_N * 8, "");
+        acc_o.clear();
+
+        // For some reason, WarpMmaPV::FragmentA has length K * N * (8|4) instead of just N * (8|4).
+        // We have to first cast frag_p to be array of k x (N * (8|4)), then cast each row to be
+        // an array of WarpMmaPV::FragmentA (which is what mma_pv expects).
+        static_assert(decltype(frag_p)::kElements == kIterationsPV * Mma_tile_o::MMAS_M * WarpMmaPV::FragmentA::kElements, "");
+        const auto frag_p_reshaped = reinterpret_cast<const cutlass::Array<Element, WarpMmaPV::FragmentA::kElements> (&)[kIterationsPV]>(frag_p);
+        #pragma unroll
+        for( int ki = 0; ki < kIterationsPV; ++ki ) {
+            mma_pv(acc_o, reinterpret_cast<const typename WarpMmaPV::FragmentA(&)>(frag_p_reshaped[ki]), frag_v[ki], acc_o);
+        }
+        // Swizzle the elements and do the final reduction.
+        smem_o.store(acc_o);
+
+        // The mapping from tidx to rows changes between the softmax and the
+        // O-reduction. So we recalculate the max.
+        using OutputTileThreadMap = typename Smem_O::OutputTileThreadMap;
+        constexpr int kOutputRowsPerThread = OutputTileThreadMap::Iterations::kRow * Smem_O::kIterationsStore;
+        float p_max_o[kOutputRowsPerThread][Mma_tile_o::MMAS_M];
+        int rows[kOutputRowsPerThread];
+        cutlass::MatrixCoord output_thread_offset = OutputTileThreadMap::initial_offset(tidx);
+        const int output_thread_start_row = output_thread_offset.row();
+        const int output_thread_start_column = output_thread_offset.column();
+        for (int iter = 0; iter < Smem_O::kIterationsStore; ++iter) {
+            for (int row = 0; row < OutputTileThreadMap::Iterations::kRow; ++row) {
+                rows[iter * OutputTileThreadMap::Iterations::kRow + row] = output_thread_start_row + iter * OutputTileThreadMap::Shape::kRow + row;
+            }
+        }
+
+        softmax.reduce_max_after_sync_(p_max_o, rows);
+        static_assert(Mma_tile_o::MMAS_M == 1, "");
+        for (int jj = 0; jj < kOutputRowsPerThread; jj++) {
+            p_max_o[jj][0] *= params.scale_bmm1;
+        }
+        float p_prev_scale_o[kOutputRowsPerThread];
+        if (!Is_first) {
+            smem_softmax_lse.load(p_prev_scale_o, rows);
+        }
+
+        // Make sure the data is in shared memory.
+        __syncthreads();
+
+        static_assert(Mma_tile_o::MMAS_M == 1, "");
+        float p_sum_o[kOutputRowsPerThread][Mma_tile_o::MMAS_M];
+        softmax.reduce_sum_after_sync_(p_sum_o, rows);
+        if (!Is_first) {
+            for (int jj = 0; jj < kOutputRowsPerThread; jj++) {
+                p_prev_scale_o[jj] = expf(p_prev_scale_o[jj] - p_max_o[jj][0]);
+                p_sum_o[jj][0] += p_prev_scale_o[jj];
+            }
+        }
+
+        float p_sum_log[kOutputRowsPerThread][Mma_tile_o::MMAS_M];
+        #pragma unroll
+        for (int jj = 0; jj < kOutputRowsPerThread; jj++) {
+            float sum = p_sum_o[jj][0];
+            p_sum_log[jj][0] = (sum == 0.f || sum != sum) ? -INFINITY : p_max_o[jj][0] + __logf(sum);
+            if (output_thread_start_column == 0) {
+                gmem_softmax_lse.store_row(
+                    reinterpret_cast<uint32_t(&)[Mma_tile_p::MMAS_M]>(p_sum_log[jj]), rows[jj]);
+            }
+        }
+        gmem_softmax_lse.move();
+
+        // Load from shared memory.
+        using ArrayTypeO = cutlass::Array<ElementAccum, OutputTileThreadMap::kElementsPerAccess>;
+        static_assert(OutputTileThreadMap::kElementsPerAccess * kOutputRowsPerThread == Smem_O::kIterationsStore * Smem_O::OutputFragment::kElements, "");
+        cutlass::multiplies<ArrayTypeO> multiply_fragments;
+        if (!Is_first) {
+            auto out_reshaped = reinterpret_cast<ArrayTypeO (&)[kOutputRowsPerThread]>(out);
+            for (int jj = 0; jj < kOutputRowsPerThread; jj++) {
+                out_reshaped[jj] = multiply_fragments(out_reshaped[jj], p_prev_scale_o[jj]);
+            }
+        }
+        smem_o.template load</*zero_init=*/Is_first>(out, tidx);
+
+        const bool is_final_write =
+            Is_last
+            || ((loop_step_idx + 1) * Cta_tile_p::N >= binfo.actual_seqlen_k)
+            || ((Is_causal) && ((begin + l) * Cta_tile_p::M < (loop_step_idx + 1) * Cta_tile_p::N));
+        auto out_reshaped = reinterpret_cast<ArrayTypeO (&)[kOutputRowsPerThread]>(out);
+        #pragma unroll
+        for (int jj = 0; jj < kOutputRowsPerThread; jj++) {
+            float sum = p_sum_o[jj][0];
+            float inv_sum = (sum == 0.f || sum != sum) ? 1.f : 1.f / sum;
+            if (Is_dropout && is_final_write) {
+                inv_sum *= params.rp_dropout;
+            }
+            out_reshaped[jj] = multiply_fragments(out_reshaped[jj], inv_sum);
+        }
+
+        // Output the values.
+        if (is_final_write) {
+            typename GmemIteratorO::Fragment out_converted;
+            cutlass::NumericArrayConverter<Element, ElementAccum, decltype(out_converted)::kElements, cutlass::FloatRoundStyle::round_to_nearest> convert_o;
+            #pragma unroll
+            for (int iter = 0; iter < GmemIteratorO::kIterations; ++iter) {
+                out_converted = convert_o(out[iter]);
+                gmem_o.store(out_converted);
+                gmem_o.move();
+            }
+            // We also need to move gmem_o_accum. For example, if Is_causal=true and seqlen=512,
+            // in the first loop, we write the first 256 rows to gmem_o and the last 256 rows to gmem_o_accum.
+            if (Is_first && !Is_last) { gmem_o_accum.move(GmemIteratorOAccum::kIterations); }
+        } else {
+            if (!Is_first) { gmem_o_accum.move(-GmemIteratorOAccum::kIterations); }
+            #pragma unroll
+            for (int iter = 0; iter < GmemIteratorOAccum::kIterations; ++iter) {
+                gmem_o_accum.store(out[iter]);
+                gmem_o_accum.move();
+            }
+        }
+
+        gemm_q_k.reload_k();
+
+        // Trigger the load from shared memory for the next series of Q values.
+        if(l < steps - 1) {
+            gemm_q_k.reload_q();
+        }
+
+    }  // Outer loop over the sequence length.
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Kernel_traits, bool Is_dropout, bool Is_causal, bool Return_softmax, typename Params>
+inline __device__ void device_1xN_loop(const Params &params) {
+
+    // The block index for the batch.
+    const int bidb = blockIdx.x;
+    // The block index for the head.
+    const int bidh = blockIdx.y;
+    // The thread index.
+    const int tidx = threadIdx.x;
+
+    const int tidx_global = (bidb * params.h + bidh) * blockDim.x * 2 + tidx;
+    auto seeds = at::cuda::philox::unpack(params.philox_args);
+    // We use 2 Philox generators to match the dropout pattern in the backward pass.
+    // Forward pass uses 128 threads while backward pass uses 256 threads, so each thread
+    // in the forward pass is simulating the droout pattern of 2 threads in the backward pass.
+    Philox ph0(std::get<0>(seeds), tidx_global, std::get<1>(seeds));
+    Philox ph1(std::get<0>(seeds), tidx_global + blockDim.x, std::get<1>(seeds));
+    constexpr int M = Kernel_traits::Cta_tile_p::M;
+    const int STEPS = (params.seqlen_q + M - 1) / M;
+
+    constexpr int blocksize_c = Kernel_traits::Cta_tile_p::N;
+    if (params.seqlen_k == blocksize_c) {
+        fmha::device_1xN_<Kernel_traits, Is_dropout, Is_causal, Return_softmax, true, true>(params, bidb, bidh, 0, STEPS, ph0, ph1, 0);
+    } else {
+        const int max_loop_steps = (params.seqlen_k + blocksize_c - 1) / blocksize_c;
+        fmha::device_1xN_<Kernel_traits, Is_dropout, Is_causal, Return_softmax, true, false>(params, bidb, bidh, 0, STEPS, ph0, ph1, 0);
+        for (int loop_step_idx = 1; loop_step_idx < max_loop_steps - 1; loop_step_idx++) {
+            fmha::device_1xN_<Kernel_traits, Is_dropout, Is_causal, Return_softmax, false, false>(params, bidb, bidh, 0, STEPS, ph0, ph1, loop_step_idx);
+        }
+        fmha::device_1xN_<Kernel_traits, Is_dropout, Is_causal, Return_softmax, false, true>(params, bidb, bidh, 0, STEPS, ph0, ph1, max_loop_steps - 1);
+    }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_dispatch.cu b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_dispatch.cu
new file mode 100644
index 000000000000..7748a779a82a
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_dispatch.cu
@@ -0,0 +1,134 @@
+/******************************************************************************
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+#include <ATen/native/transformers/cuda/flash_attn/philox.cuh>
+#include <ATen/native/transformers/cuda/flash_attn/static_switch.h>
+#include <ATen/native/transformers/cuda/flash_attn/fmha.h>
+#include <ATen/native/transformers/cuda/flash_attn/kernel_traits.h>
+#include <ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h>
+
+template<typename Kernel_traits, bool Is_dropout, bool Is_causal, bool Return_softmax>
+__global__ void fmha_fprop_loop_kernel(FMHA_fprop_params params) {
+    fmha::device_1xN_loop<Kernel_traits, Is_dropout, Is_causal, Return_softmax>(params);
+}
+
+template<typename Kernel_traits>
+void run_fmha_loop_(Launch_params<FMHA_fprop_params> &launch_params,
+                    const bool configure) {
+    constexpr int blocksize_c = Kernel_traits::Cta_tile_p::N;
+    const int loop_steps = (launch_params.params.seqlen_k + blocksize_c - 1) / blocksize_c;
+
+    if (configure) {
+        using Mma_tile_p = fmha::Hmma_tile<typename Kernel_traits::Cta_tile_p>;
+        constexpr int M = Kernel_traits::Cta_tile_p::M;
+        size_t STEPS = (launch_params.params.seqlen_q + M - 1) / M;
+        constexpr size_t MMAS_M = Mma_tile_p::MMAS_M;
+        constexpr size_t MMAS_N = Mma_tile_p::MMAS_N;
+        size_t elts_per_head = STEPS * MMAS_M * MMAS_N * 8 * loop_steps;
+        launch_params.elts_per_thread = elts_per_head;
+        return;
+    }
+
+    constexpr size_t smem_size_softmax_lse = Kernel_traits::Smem_softmax_lse::BYTES_PER_TILE;
+    // Don't need smem_size_softmax_lse if we're not looping
+    const size_t smem_size = fmha::get_dynamic_smem_size<Kernel_traits>()
+        + (loop_steps > 1 ? smem_size_softmax_lse : 0);
+    // printf("smem_size = %d\n", smem_size);
+
+    // Work-around for gcc 7. It doesn't like nested BOOL_SWITCH.
+    // https://github.com/kokkos/kokkos-kernels/issues/349
+    // https://github.com/HazyResearch/flash-attention/issues/21
+    BOOL_SWITCH(launch_params.is_dropout, IsDropoutConst, [&] {
+        auto kernel = launch_params.params.is_causal
+            ? (launch_params.return_softmax
+               ? &fmha_fprop_loop_kernel<Kernel_traits, IsDropoutConst, true, true>
+               : &fmha_fprop_loop_kernel<Kernel_traits, IsDropoutConst, true, false>)
+            : (launch_params.return_softmax
+               ? &fmha_fprop_loop_kernel<Kernel_traits, IsDropoutConst, false, true>
+               : &fmha_fprop_loop_kernel<Kernel_traits, IsDropoutConst, false, false>);
+        // constexpr bool IsDropoutConstTmp = false;
+        // auto kernel = launch_params.params.is_causal
+        //     ? (launch_params.return_softmax
+        //        ? &fmha_fprop_loop_kernel<Kernel_traits, IsDropoutConstTmp, true, true>
+        //        : &fmha_fprop_loop_kernel<Kernel_traits, IsDropoutConstTmp, true, false>)
+        //     : (launch_params.return_softmax
+        //        ? &fmha_fprop_loop_kernel<Kernel_traits, IsDropoutConstTmp, false, true>
+        //        : &fmha_fprop_loop_kernel<Kernel_traits, IsDropoutConstTmp, false, false>);
+        if( smem_size >= 48L  * 1024 ) {
+            FMHA_CHECK_CUDA(cudaFuncSetAttribute(
+                kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size));
+        }
+        dim3 grid(launch_params.params.b, launch_params.params.h);
+        kernel<<<grid, Kernel_traits::THREADS, smem_size, launch_params.stream>>>(
+            launch_params.params);
+        FMHA_CHECK_CUDA(cudaPeekAtLastError());
+    });
+}
+
+TORCH_API void run_fmha_fprop(Launch_params<FMHA_fprop_params> &launch_params,
+                    const bool configure) {
+    BOOL_SWITCH(launch_params.params.is_bf16, IsBf16Const, [&] {
+        using elem_type = std::conditional<IsBf16Const, cutlass::bfloat16_t, cutlass::half_t>::type;
+        auto dprops = at::cuda::getCurrentDeviceProperties();
+        if (launch_params.params.d <= 64) {
+            if( launch_params.params.seqlen_k == 128 ) {
+                // TD [2022-08-20]: One might expect that not sharing the smem between K & V
+                // could be faster, but seems like it's the same speed.
+                using Kernel_traits = FMHA_kernel_traits<128, 64, 16, 1, 4, 0x08u, elem_type>;
+                run_fmha_loop_<Kernel_traits>(launch_params, configure);
+            } else if( launch_params.params.seqlen_k >= 256 ) {
+                if (dprops->major == 8 && dprops->minor >= 0) {
+                    using Kernel_traits = FMHA_kernel_traits<256, 64, 16, 1, 4, 0x08u, elem_type>;
+                    run_fmha_loop_<Kernel_traits>(launch_params, configure);
+                } else if (dprops->major == 7 && dprops->minor == 5) {
+                    if (launch_params.is_dropout) { // Need to use the same block size as backward
+                        using Kernel_traits = FMHA_kernel_traits<128, 64, 16, 1, 4, 0x08u, elem_type>;
+                        run_fmha_loop_<Kernel_traits>(launch_params, configure);
+                    } else {
+                        using Kernel_traits = FMHA_kernel_traits<256, 64, 16, 1, 4, 0x08u, elem_type>;
+                        run_fmha_loop_<Kernel_traits>(launch_params, configure);
+                    }
+                }
+            }
+        } else if (launch_params.params.d <= 128) {
+            if( launch_params.params.seqlen_k == 128 ) {
+                using Kernel_traits = FMHA_kernel_traits<128, 128, 16, 1, 4, 0x08u, elem_type>;
+                run_fmha_loop_<Kernel_traits>(launch_params, configure);
+            } else {
+                if (dprops->major == 8 && dprops->minor == 0 && !launch_params.is_dropout) {
+                    // TD [2022-06-05] Keep K in smem to reduce register spilling
+                    // Gives about 6% speedup compared to using block size 128.
+                    using Kernel_traits = FMHA_kernel_traits<256, 128, 16, 1, 4, 0x18u, elem_type>;
+                    // using Kernel_traits = FMHA_kernel_traits<128, 128, 16, 1, 4, 0x08u, elem_type>;
+                    run_fmha_loop_<Kernel_traits>(launch_params, configure);
+                } else {  // Need to use the same block size as backward
+                    using Kernel_traits = FMHA_kernel_traits<128, 128, 16, 1, 4, 0x08u, elem_type>;
+                    run_fmha_loop_<Kernel_traits>(launch_params, configure);
+                }
+            }
+        }
+    });
+}
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_kernel.h b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_kernel.h
new file mode 100644
index 000000000000..a321e839b3bb
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_kernel.h
@@ -0,0 +1,71 @@
+/******************************************************************************
+ * Copyright (c) 2022, Tri Dao.
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<int THREADS_PER_CTA>
+struct BlockInfoPadded {
+
+    template<typename Params>
+    __device__ BlockInfoPadded(const Params &params,
+                               const int bidb,
+                               const int bidh,
+                               const int tidx)
+        : bidb(bidb), bidh(bidh), h(params.h) {
+
+        // The block index.
+        sum_s_k = params.cu_seqlens_k[bidb];
+        actual_seqlen_k = params.cu_seqlens_k[bidb + 1] - sum_s_k;
+        sum_s_q = params.cu_seqlens_q[bidb];
+        actual_seqlen_q = params.cu_seqlens_q[bidb + 1] - sum_s_q;
+
+        tidx_global = (bidb * params.h + bidh) * THREADS_PER_CTA + tidx;
+    }
+
+    __device__ bool stop_early(const int start_col = 0) const {
+        return actual_seqlen_k <= start_col;
+    }
+
+    uint32_t actual_seqlen_q;
+    uint32_t actual_seqlen_k;
+    uint32_t sum_s_q;
+    uint32_t sum_s_k;
+    uint32_t bidh;
+    uint32_t bidb;
+    uint32_t tidx_global;
+    uint32_t h;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+}  // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_utils.h b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_utils.h
new file mode 100644
index 000000000000..9a40ecb59f24
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_utils.h
@@ -0,0 +1,52 @@
+
+
+/******************************************************************************
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <cassert>
+#include <cstdio>
+#include <cstdlib>
+#include <cuda_runtime_api.h>
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#define FMHA_CHECK_CUDA( call )                                                                    \
+    do {                                                                                           \
+        cudaError_t status_ = call;                                                                \
+        if( status_ != cudaSuccess ) {                                                             \
+            fprintf( stderr,                                                                       \
+                     "CUDA error (%s:%d): %s\n",                                                   \
+                     __FILE__,                                                                     \
+                     __LINE__,                                                                     \
+                     cudaGetErrorString( status_ ) );                                              \
+            exit( 1 );                                                                             \
+        }                                                                                          \
+    } while( 0 )
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/gemm.h b/aten/src/ATen/native/transformers/cuda/flash_attn/gemm.h
new file mode 100644
index 000000000000..2753e5e52572
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/gemm.h
@@ -0,0 +1,95 @@
+/******************************************************************************
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <ATen/native/transformers/cuda/flash_attn/utils.h>
+
+#include <cutlass/cutlass.h>
+#include <cutlass/gemm/warp/default_mma_tensor_op.h>
+#include <cutlass/layout/layout.h>
+#include <cutlass/arch/mma.h>
+#include <cutlass/array.h>
+#include <cutlass/numeric_types.h>
+
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    // The number of rows in the CTA tile.
+    int M_,
+    // The number of cols in the CTA tile.
+    int N_,
+    // The number of elements in the the K dimension of the GEMM loop.
+    int K_,
+    // The number of rows of warps.
+    int WARPS_M_,
+    // The number of cols of warps.
+    int WARPS_N_,
+    // The number of warps in the K dimension of the GEMM loop.
+    int WARPS_K_>
+struct Cta_tile_ {
+
+    static constexpr int M = M_, N = N_, K = K_;
+    // The number of warps.
+    static constexpr int WARPS_M = WARPS_M_, WARPS_N = WARPS_N_, WARPS_K = WARPS_K_;
+    // The number of warps per CTA.
+    static constexpr int WARPS_PER_CTA = WARPS_M * WARPS_N * WARPS_K;
+    // The number of threads per warp.
+    static constexpr int THREADS_PER_WARP = 32;
+    // The number of threads per CTA.
+    static constexpr int THREADS_PER_CTA = WARPS_PER_CTA * THREADS_PER_WARP;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Cta_tile>
+struct Hmma_tile {
+    // The number of elements computed with a single warp-MMA.
+    static constexpr int M_PER_MMA = 16, N_PER_MMA = 16, K_PER_MMA = 16;
+
+    // The number of elements computed with a single CTA-MMA.
+    static constexpr int M_PER_MMA_PER_CTA = M_PER_MMA * Cta_tile::WARPS_M,
+        N_PER_MMA_PER_CTA = N_PER_MMA * Cta_tile::WARPS_N,
+        K_PER_MMA_PER_CTA = K_PER_MMA * Cta_tile::WARPS_K;
+
+    // The number of MMAs needed to compute the GEMM.
+    static constexpr int MMAS_M = DivUpConstexpr(Cta_tile::M, M_PER_MMA_PER_CTA),
+        MMAS_N = DivUpConstexpr(Cta_tile::N, N_PER_MMA_PER_CTA),
+        MMAS_K = DivUpConstexpr(Cta_tile::K, K_PER_MMA_PER_CTA);
+
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<int M, int N, int K, int WARPS_M, int WARPS_N, int WARPS_K>
+using Cta_tile_extd = Cta_tile_<M, N, K, WARPS_M, WARPS_N, WARPS_K>;
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+}  // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/gmem_tile.h b/aten/src/ATen/native/transformers/cuda/flash_attn/gmem_tile.h
new file mode 100644
index 000000000000..ea54086ac36a
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/gmem_tile.h
@@ -0,0 +1,272 @@
+/******************************************************************************
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <ATen/native/transformers/cuda/flash_attn/gemm.h>
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename Cta_tile, int BYTES_PER_ELEMENT >
+struct Gmem_tile_mma_sd {
+
+    // The mma tile.
+    using Mma_tile = fmha::Hmma_tile<Cta_tile>;
+
+    // Each STG stores 8 elements.
+    static constexpr int BYTES_PER_STG = BYTES_PER_ELEMENT * 8;
+    // The number of MMAs in the M dimension.
+    static constexpr int MMAS_M = Mma_tile::MMAS_M;
+    // The number of MMAs in the N dimension.
+    static constexpr int MMAS_N = Mma_tile::MMAS_N;
+    // The number of rows computed per MMA per thread block.
+    static constexpr int M_PER_MMA_PER_CTA = Mma_tile::M_PER_MMA_PER_CTA;
+    // The number of cols computed per MMA per thread block.
+    static constexpr int N_PER_MMA_PER_CTA = Mma_tile::N_PER_MMA_PER_CTA;
+    // The number of threads per block.
+    static constexpr int THREADS_PER_CTA = Cta_tile::THREADS_PER_CTA;
+    // The size of each row in bytes. I.e. how many bytes are stored per STG.
+    static constexpr int BYTES_PER_ROW = THREADS_PER_CTA * BYTES_PER_STG;
+    // The distance between elements stored per loop (in bytes).
+    static constexpr int LOOP_STRIDE_BYTES = MMAS_M * MMAS_N * BYTES_PER_ROW;
+
+    // The type of elements stored per STG.
+    using Type = typename fmha::Uint_from_size_in_bytes<BYTES_PER_STG>::Type;
+
+    // Ctor.
+    template<typename Params>
+    inline __device__ Gmem_tile_mma_sd(void *ptr, const Params &params, const int bidb, const int bidh, const int tidx)
+        : ptr_(static_cast<char *>(ptr)) {
+
+        // The block index.
+        // size_t bidx = bidb * params.h + bidh;
+        uint32_t bidx = bidb * params.h + bidh;
+
+        // The distance between two blocks (in bytes).
+        // const size_t block_stride_bytes = params.seqlen_q * params.seqlen_k * BYTES_PER_ELEMENT;
+        const uint32_t block_stride_bytes = params.seqlen_q * params.seqlen_k * BYTES_PER_ELEMENT;
+        // Set store location for each thread at the beginning of the loop
+        ptr_ += bidx * block_stride_bytes + tidx * BYTES_PER_STG;
+    }
+
+    // Store to global memory.
+    inline __device__ void store(const Type &data, const int mi, const int ni) {
+        // size_t offset = (mi * MMAS_N + ni) * BYTES_PER_ROW;
+        uint32_t offset = (mi * MMAS_N + ni) * BYTES_PER_ROW;
+        fmha::stg(ptr_ + offset, data);
+    }
+
+    // Load from global memory.
+    inline __device__ void load(Type &data, const int mi, const int ni) {
+        // size_t offset = (mi * MMAS_N + ni) * BYTES_PER_ROW;
+        uint32_t offset = (mi * MMAS_N + ni) * BYTES_PER_ROW;
+        fmha::ldg(data, ptr_ + offset);
+    }
+
+    // Move to the next tile.
+    inline __device__ void move(const int steps = 1) {
+        ptr_ += LOOP_STRIDE_BYTES * steps;
+    }
+
+    // The pointer in global memory.
+    char *ptr_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename Cta_tile, typename Base = Gmem_tile_mma_sd<Cta_tile, sizeof(uint16_t)> >
+struct Gmem_tile_mma_s : public Base {
+
+    // The number of mmas in the vertical dimension.
+    static constexpr int M = Base::MMAS_M;
+    // The number of mmas in the horizontal dimension.
+    static constexpr int N = Base::MMAS_N;
+    // The type of the vectors stored by each STG.
+    using Type = typename Base::Type;
+
+    // Ctor.
+    template< typename Params, typename Block_info >
+    inline __device__ Gmem_tile_mma_s(const Params &params, const Block_info& binfo, const int tidx)
+        : Base(params.s_ptr, params, binfo.bidb, binfo.bidh, tidx) {
+    }
+
+    // Store to global memory.
+    template<typename Mask, typename Fragment>
+    inline __device__ void store(const Fragment (&frag)[N][M], const Mask& mask){
+        static_assert(Fragment::kStorageElements == 4, "");
+        #pragma unroll
+        for( int mi = 0; mi < M; mi++ ) {
+            #pragma unroll
+            for( int ni = 0; ni < N; ni++ ) {
+                uint4 dst;
+                dst.x = frag[ni][mi].raw_data()[0];
+                dst.y = frag[ni][mi].raw_data()[2];
+                dst.z = frag[ni][mi].raw_data()[1];
+                dst.w = frag[ni][mi].raw_data()[3];
+                if( mask.any_valid(mi, ni) ) {
+                    Base::store(dst, mi, ni);
+                }
+            }
+        }
+    }
+
+    // Load from global memory.
+    template<typename Mask>
+    inline __device__ void load(uint4 (&regs)[M][N], const Mask &mask) {
+        #pragma unroll
+        for( int mi = 0; mi < M; mi++ ) {
+            #pragma unroll
+            for( int ni = 0; ni < N; ni++ ) {
+                regs[mi][ni] = make_uint4(0, 0, 0, 0);
+                if( mask.any_valid(mi, ni) ) {
+                    Base::load(regs[mi][ni], mi, ni);
+                }
+            }
+        }
+    }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+    // The dimensions of the tile computed by the CTA.
+    typename Cta_tile
+>
+struct Gmem_summary_stats {
+
+    // The Mma tile.
+    using Mma_tile = fmha::Hmma_tile<Cta_tile>;
+
+    // The number of MMAs in M/N dimensions.
+    static constexpr int MMAS_M = Mma_tile::MMAS_M;
+
+    // The size of each element.
+    static constexpr int BYTES_PER_ELEMENT = 4;
+    static constexpr int BYTES_PER_MMA = (Cta_tile::THREADS_PER_WARP / 4) * 2 * BYTES_PER_ELEMENT;
+    static constexpr int ROWS = Cta_tile::M;
+
+    // Ctor.
+    template<typename Params>
+    inline __device__ Gmem_summary_stats(void *ptr, const Params &params, const int tidx)
+        : ptr_(reinterpret_cast<char *>(ptr)), tidx_(tidx) {
+
+        // The block index for the batch.
+        const int bidb = blockIdx.x;
+        // The block index for the head.
+        const int bidh = blockIdx.y;
+        // The block index.
+        // size_t bidx = bidb * params.h + bidh;
+        uint32_t bidx = bidb * params.h + bidh;
+
+        // Extract the position in the warp.
+        int warp = tidx / Cta_tile::THREADS_PER_WARP;
+        int lane = tidx % Cta_tile::THREADS_PER_WARP;
+
+        // The distance between two blocks (in bytes).
+        // size_t block_stride_bytes = params.seqlen_q * BYTES_PER_ELEMENT;
+        uint32_t block_stride_bytes = params.seqlen_q * BYTES_PER_ELEMENT;
+
+        // Set store location for each thread at the beginning of the loop
+        ptr_row_ = ptr_ + bidx * block_stride_bytes;
+        ptr_ += bidx * block_stride_bytes + (lane / 4) * BYTES_PER_ELEMENT;
+    }
+
+    // Store data to global memory.
+    inline __device__ void store(const uint32_t (&data)[MMAS_M * 2]) {
+        int warp = tidx_ / Cta_tile::THREADS_PER_WARP;
+        int lane = tidx_ % Cta_tile::THREADS_PER_WARP;
+        if ((warp == 0) && (lane % 4 == 0)) {
+            #pragma unroll
+            for (int mi = 0; mi < MMAS_M; ++mi) {
+                // TODO: Not sure if it's right for MMAS_M > 1
+                fmha::stg(ptr_ + mi * BYTES_PER_MMA + 0 * BYTES_PER_ELEMENT, data[mi * 2 + 0]);
+                fmha::stg(ptr_ + mi * BYTES_PER_MMA + 8 * BYTES_PER_ELEMENT, data[mi * 2 + 1]);
+            }
+        }
+    }
+
+    // Store data to global memory.
+    inline __device__ void store_row(const uint32_t (&data)[MMAS_M], const int row) {
+        #pragma unroll
+        for (int mi = 0; mi < MMAS_M; ++mi) {
+            // TODO: Not sure if it's right for MMAS_M > 1
+            fmha::stg(ptr_row_ + mi * BYTES_PER_MMA + row * BYTES_PER_ELEMENT, data[mi]);
+        }
+    }
+
+    // Load from global memory.
+    inline __device__ void load(uint32_t (&data)[MMAS_M * 2]) {
+        #pragma unroll
+        for (int mi = 0; mi < MMAS_M; ++mi) {
+            // TODO: Not sure if it's right for MMAS_M > 1
+            fmha::ldg(data[mi * 2 + 0], ptr_ + mi * BYTES_PER_MMA + 0 * BYTES_PER_ELEMENT);
+            fmha::ldg(data[mi * 2 + 1], ptr_ + mi * BYTES_PER_MMA + 8 * BYTES_PER_ELEMENT);
+        }
+    }
+
+    // Load from global memory.
+    inline __device__ void load_next(uint32_t (&data)[MMAS_M * 2], int move_steps=1) {
+        char *ptr_next = ptr_ + move_steps * ROWS * BYTES_PER_ELEMENT;
+        #pragma unroll
+        for (int mi = 0; mi < MMAS_M; ++mi) {
+            // TODO: Not sure if it's right for MMAS_M > 1
+            fmha::ldg(data[mi * 2 + 0], ptr_next + mi * BYTES_PER_MMA + 0 * BYTES_PER_ELEMENT);
+            fmha::ldg(data[mi * 2 + 1], ptr_next + mi * BYTES_PER_MMA + 8 * BYTES_PER_ELEMENT);
+        }
+    }
+
+    // Store data to global memory.
+    template <int N>
+    inline __device__ void load_row(uint32_t (&data)[N], const int row[N]) {
+        #pragma unroll
+        for (int ni = 0; ni < N; ++ni) {
+            fmha::ldg(data[ni], ptr_row_ + row[ni] * BYTES_PER_ELEMENT);
+        }
+    }
+
+    // Move the pointer to the next location.
+    inline __device__ void move() {
+        ptr_ += ROWS * BYTES_PER_ELEMENT;
+        ptr_row_ += ROWS * BYTES_PER_ELEMENT;
+    }
+
+    // Move the pointer to the next location.
+    inline __device__ void move(const int steps) {
+        ptr_ += ROWS * BYTES_PER_ELEMENT * steps;
+        ptr_row_ += ROWS * BYTES_PER_ELEMENT * steps;
+    }
+
+    // The pointer.
+    char *ptr_;
+    char *ptr_row_;
+    const int tidx_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+}  // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/kernel_traits.h b/aten/src/ATen/native/transformers/cuda/flash_attn/kernel_traits.h
new file mode 100644
index 000000000000..9c630fbd4fe1
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/kernel_traits.h
@@ -0,0 +1,154 @@
+/******************************************************************************
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <cutlass/cutlass.h>
+
+#include <cutlass/gemm/gemm.h>
+
+#include <cutlass/layout/layout.h>
+#include <cutlass/numeric_types.h>
+#include <cutlass/transform/threadblock/predicated_tile_iterator.h>
+
+#include <ATen/native/transformers/cuda/flash_attn/gemm.h>
+#include <ATen/native/transformers/cuda/flash_attn/gmem_tile.h>
+#include <ATen/native/transformers/cuda/flash_attn/summary_stats.h>
+#include <ATen/native/transformers/cuda/flash_attn/mma_core_sm75.h>
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<int S, int D, int STEP, int WARPS_M, int WARPS_N, uint32_t FLAGS = 0x08u, typename elem_type=cutlass::half_t>
+struct FMHA_kernel_traits {
+
+    // The CTA description for the 1st GEMM.
+    using Cta_tile_p = fmha::Cta_tile_extd<STEP, S, D, WARPS_M, WARPS_N, 1>;
+    // The CTA description for the 2nd GEMM.
+    using Cta_tile_o = fmha::Cta_tile_extd<STEP, D, S, WARPS_M, 1, WARPS_N>;
+
+    // Do we use one buffer for K and V.
+    static constexpr bool SHARE_SMEM_FOR_K_AND_V = (FLAGS & 0x08u) != 0u;
+    // Do we keep K in registers.
+    static constexpr bool K_IN_REGS = (FLAGS & 0x10u) == 0u;
+    // Do we keep V in registers.
+    static constexpr bool V_IN_REGS = (FLAGS & 0x100u) == 0u;
+
+    // The global memory tile to load/store S.
+    using Gmem_tile_s = fmha::Gmem_tile_mma_s<Cta_tile_p>;
+
+    // The global memory tile to store the softmax sum.
+    using Gmem_softmax_sum = fmha::Gmem_summary_stats<Cta_tile_p>;
+
+    // The number of threads.
+    static constexpr int THREADS = Cta_tile_p::THREADS_PER_CTA;
+    // Make sure the number of threads matches both CTAs.
+    static_assert(THREADS == Cta_tile_o::THREADS_PER_CTA, "");
+
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
+    using MmaInstructionShape = cutlass::gemm::GemmShape<16, 8, 16>;
+#elif defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 750
+    using MmaInstructionShape = cutlass::gemm::GemmShape<16, 8, 8>;
+#else
+    // using MmaInstructionShape = cutlass::gemm::GemmShape<8, 8, 4>;
+    using MmaInstructionShape = cutlass::gemm::GemmShape<16, 8, 16>;
+    // TD [2022-06-02] We don't support Volta (SM70) yet.
+#endif
+
+#if defined(__CUDA_ARCH__) &&  __CUDA_ARCH__ >= 800
+    using Element = elem_type;
+#else
+    using Element = cutlass::half_t;
+#endif
+    using ElementAccum = float;
+
+    static_assert(WARPS_M == 1, "");
+    using ThreadblockShapeQK = cutlass::gemm::GemmShape<STEP, S, D>;
+    using WarpCountQK = cutlass::gemm::GemmShape<WARPS_M, WARPS_N, 1>;
+    using WarpShapeQK = cutlass::gemm::GemmShape<
+       ThreadblockShapeQK::kM,
+       ThreadblockShapeQK::kN / WarpCountQK::kN, ThreadblockShapeQK::kK>;
+    using LayoutQ = cutlass::layout::RowMajor;
+    using LayoutK = cutlass::layout::ColumnMajor;
+    using LayoutP = cutlass::layout::RowMajor;
+    using MmaCoreQK = typename fmha::FMHAMmaCore<
+        ThreadblockShapeQK, WarpShapeQK, MmaInstructionShape, Element, LayoutQ,
+        Element, LayoutK, ElementAccum, LayoutP,
+        cutlass::arch::OpClassTensorOp>;
+
+    using ThreadblockShapePV = cutlass::gemm::GemmShape<STEP, D, S>;
+    using WarpCountPV = cutlass::gemm::GemmShape<WARPS_M, 1, WARPS_N>;
+    using WarpShapePV = cutlass::gemm::GemmShape<ThreadblockShapePV::kM, ThreadblockShapePV::kN, ThreadblockShapePV::kK / WarpCountPV::kK>;
+    using LayoutV = cutlass::layout::RowMajor;
+    using LayoutO = cutlass::layout::RowMajor;
+    using MmaCorePV = typename fmha::FMHAMmaCore<
+        ThreadblockShapePV, WarpShapePV, MmaInstructionShape, Element, LayoutP,
+        Element, LayoutV, ElementAccum, LayoutO,
+        cutlass::arch::OpClassTensorOp>;
+
+    // The global memory tile to load Q.
+    // Copy from mma_piplined_testbed.h
+    using GmemIteratorQ = cutlass::transform::threadblock::PredicatedTileIterator<
+      cutlass::MatrixShape<ThreadblockShapeQK::kM, ThreadblockShapeQK::kK>,
+      Element,
+      LayoutQ,
+      0,
+      typename MmaCoreQK::IteratorThreadMapA
+    >;
+
+    // The global memory tile to load K.
+    using GmemIteratorK = cutlass::transform::threadblock::PredicatedTileIterator<
+      cutlass::MatrixShape<ThreadblockShapeQK::kK, ThreadblockShapeQK::kN>,
+      Element,
+      LayoutK,
+      1,
+      typename MmaCoreQK::IteratorThreadMapB
+    >;
+
+    // The global memory tile to load V.
+    using GmemIteratorV = cutlass::transform::threadblock::PredicatedTileIterator<
+      cutlass::MatrixShape<ThreadblockShapePV::kK, ThreadblockShapePV::kN>,
+      Element,
+      LayoutV,
+      0,
+      typename MmaCorePV::IteratorThreadMapB
+    >;
+
+    // The shared memory tile to store softmax lse.
+    using Smem_softmax_lse = fmha::Smem_tile_softmax_lse<ThreadblockShapeQK::kM, MmaInstructionShape::kM, WarpCountQK::kM>;
+
+    // The amount of shared memory needed to load Q and K.
+    static constexpr size_t BYTES_PER_SMEM_Q = ThreadblockShapeQK::kM * ThreadblockShapeQK::kK * sizeof(Element);
+    static constexpr size_t BYTES_PER_SMEM_K = ThreadblockShapeQK::kN * ThreadblockShapeQK::kK * sizeof(Element);
+    static constexpr size_t BYTES_PER_SMEM_V = ThreadblockShapePV::kN * ThreadblockShapePV::kK * sizeof(Element);
+    static_assert(BYTES_PER_SMEM_K == BYTES_PER_SMEM_V, "");
+    static constexpr size_t BYTES_PER_SMEM_QK = BYTES_PER_SMEM_Q + BYTES_PER_SMEM_K;
+    // The extra amount of shared memory needed to load V.
+    static constexpr size_t BYTES_PER_SMEM_V_EXTRA = SHARE_SMEM_FOR_K_AND_V ? 0u : BYTES_PER_SMEM_V;
+    // The amount of shared memory needed for Q, K and V..
+    static constexpr size_t BYTES_PER_SMEM_QKV = BYTES_PER_SMEM_QK + BYTES_PER_SMEM_V_EXTRA;
+
+};
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/mask.h b/aten/src/ATen/native/transformers/cuda/flash_attn/mask.h
new file mode 100644
index 000000000000..6169c89550b6
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/mask.h
@@ -0,0 +1,92 @@
+/******************************************************************************
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+namespace fmha {
+
+
+template<typename Cta_tile, bool Is_causal=false>
+struct Mask {
+    using Mma_tile = fmha::Hmma_tile<Cta_tile>;
+
+    template<typename BInfo>
+    __device__ Mask(const BInfo &binfo, int tidx, const int loop_step_idx_ = 0)
+        : actual_seqlen_k(binfo.actual_seqlen_k - loop_step_idx_ * Cta_tile::N)
+        , loop_step_idx(loop_step_idx_) {
+
+        const int warp = tidx / Cta_tile::THREADS_PER_WARP;
+        const int lane = tidx % Cta_tile::THREADS_PER_WARP;
+
+        static_assert(Cta_tile::WARPS_K == 1, "");
+
+        // find the warp in the Cta tile
+        const int warp_n = (warp / Cta_tile::WARPS_M);
+        const int warp_m = (warp % Cta_tile::WARPS_M);
+        // decompose warp into 8x4 tile
+        const int quad = lane / 4;
+        const int tid = (lane % 4) * 2;
+        row = warp_m * 16 + quad;
+        // col = warp_n * 16 + tid;
+        col = warp_n * Mma_tile::N_PER_MMA * Mma_tile::MMAS_N + tid;
+    }
+
+    inline __device__ bool is_valid(const int mi, const int ni, const int ii, const int jj) const {
+
+        // ii and jj iterate over the 2x4 fragment
+        // const int current_col = (Is_causal ? loop_step_idx * Cta_tile::N : 0) + ni * Mma_tile::N_PER_MMA_PER_CTA + col + (jj & 2) * 4 + (jj & 1);
+        // const int current_col = ni * Mma_tile::N_PER_MMA_PER_CTA + col + (jj & 2) * 4 + (jj & 1);
+        const int current_col = ni * Mma_tile::N_PER_MMA + col + (jj & 2) * 4 + (jj & 1);
+        const int current_row = row_offset + ii * 8;
+        const bool col_valid = current_col < actual_seqlen_k;
+        // const bool col_valid = (ni * Mma_tile::N_PER_MMA_PER_CTA + col + (jj & 2) * 4 + (jj & 1)) < actual_seqlen_k;
+        //&& (row + mi * Mma_tile::M_PER_MMA_PER_CTA + ii * 8) < actual_seqlen_k;
+        // if ((threadIdx.x == 0) && (blockIdx.x == 0) && (blockIdx.y == 0)) {
+        //     printf("current_col=%d, current_row=%d, actual_seqlen_k=%d, col_valid=%d, all_valid=%d\n", current_col, current_row, actual_seqlen_k, col_valid, all_valid);
+        // }
+        return Is_causal ? col_valid && (current_col + loop_step_idx * Cta_tile::N <= current_row) : col_valid;
+        // return row_valid && col_valid;
+    }
+
+    //BERT Mask: if upper left is invalid, none are valid
+    inline __device__ bool any_valid(const int mi, const int ni) const {
+        return is_valid(mi, ni, 0, 0) || is_valid(mi, ni, 1, 0);
+    }
+
+    inline __device__ void load(const int it) {
+        row_offset = it * Cta_tile::M + row;
+    }
+    int row_offset;
+
+    int row;
+    int col;
+    const int loop_step_idx;
+    const int actual_seqlen_k;
+};
+
+}  // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/mma_core_sm75.h b/aten/src/ATen/native/transformers/cuda/flash_attn/mma_core_sm75.h
new file mode 100644
index 000000000000..863d30b14adf
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/mma_core_sm75.h
@@ -0,0 +1,382 @@
+// Adapted from cutlass/gemm/threadblock/default_mma_core_sm75.h
+// This is very similar, except we make it work for head_dim=128.
+// The original cutlass version only allows kK of the thread block to be
+// at most 64. Here we set kCrosswise = max(64, ThreadblockShape::kK) instead.
+
+/******************************************************************************
+ * Copyright (c) 2022, Tri Dao.
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <cutlass/cutlass.h>
+#include <cutlass/array.h>
+#include <cutlass/platform/platform.h>
+
+#include <cutlass/numeric_types.h>
+#include <cutlass/matrix_shape.h>
+
+#include <cutlass/layout/tensor_op_multiplicand_sm75.h>
+#include <cutlass/transform/pitch_linear_thread_map.h>
+#include <cutlass/transform/threadblock/regular_tile_iterator_tensor_op.h>
+
+#include <cutlass/gemm/warp/default_mma_tensor_op.h>
+#include <cutlass/gemm/threadblock/default_mma_core.h>
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace fmha {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Template defininng default matrix multiply operators inferred from threadblock tile size,
+/// global memory data layout, and target math instruction.
+template <
+    /// Shape of threadblock-scoped matrix multiply operator
+    typename Shape,
+    /// Shape of warp-level matrix multiply operator
+    typename WarpShape,
+    /// Shape of one matrix production operation (concept: GemmShape)
+    typename InstructionShape,
+    /// Element data type of A operand
+    typename ElementA,
+    /// Layout of operand A
+    typename LayoutA,
+    /// Element data type of B operand
+    typename ElementB,
+    /// Layout of operand B
+    typename LayoutB,
+    /// Data type of accumulator
+    typename ElementC,
+    /// Layout of accumulator
+    typename LayoutC,
+    /// Indicates type of math operator (arch::OpClassSimt or arch::OpClassTensorOp)
+    typename OperatorClass,
+    /// Operation performed by MMA
+    typename Operator = cutlass::arch::OpMultiplyAdd
+>
+struct FMHAMmaCore;
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Partial specialization:
+///
+///   A: row-major
+///   B: column-major
+///   Operator: tensor op class
+///
+/// This uses the default warp-level operator given tile sizes
+template <
+    /// Shape of threadblock-scoped matrix multiply operator (concept:
+    /// GemmShape)
+    typename Shape_,
+    /// Shape of warp-level matrix multiply operator (concept: GemmShape)
+    typename WarpShape_,
+    /// Shape of one matrix production operation (concept: GemmShape)
+    typename InstructionShape_,
+    /// Data type of A operand
+    typename ElementA_,
+    /// Data type of B operand
+    typename ElementB_,
+    /// Data type of accumulator
+    typename ElementC_,
+    /// Layout of accumulator
+    typename LayoutC_,
+    /// Operation performed by MMA
+    typename Operator_>
+struct FMHAMmaCore<Shape_, WarpShape_, InstructionShape_, ElementA_,
+                   cutlass::layout::RowMajor, ElementB_, cutlass::layout::ColumnMajor,
+                   ElementC_, LayoutC_, cutlass::arch::OpClassTensorOp, Operator_
+                  > {
+  using Shape = Shape_;
+  using WarpShape = WarpShape_;
+  using InstructionShape = InstructionShape_;
+  using ElementA = ElementA_;
+  using LayoutA = cutlass::layout::RowMajor;
+  using ElementB = ElementB_;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using ElementC = ElementC_;
+  using LayoutC = LayoutC_;
+  using OperatorClass = cutlass::arch::OpClassTensorOp;
+
+  /// Number of warps present
+  using WarpCount = cutlass::gemm::GemmShape<
+    Shape::kM / WarpShape::kM,
+    Shape::kN / WarpShape::kN,
+    Shape::kK / WarpShape::kK
+  >;
+
+  // Divisibility requirements
+  static_assert(
+    !(Shape::kM % WarpShape::kM) &&
+    !(Shape::kN % WarpShape::kN),
+    "Threadblock-scoped GEMM should be divisible by warp-scoped GEMM size."
+  );
+
+  /// Number of threads per warp
+  static int const kWarpSize = cutlass::gemm::warp::WarpSize<cutlass::arch::OpClassTensorOp>::value;
+
+  /// Number of threads total
+  static int const kThreads = WarpCount::kCount * kWarpSize;
+
+  /// Size of a threadblock-scoped access
+  static int const kAccessSizeInBits = 128;
+
+  /// Cutlass only supports Crosswise at most 64
+  static int const kCrosswise = std::min(Shape::kK, 64);
+
+  /// Default Operator
+  using Operator = Operator_;
+
+  // Warp thread arrangement
+  static int const kWarpThreadArrangementContiguousA =
+      kCrosswise / (kAccessSizeInBits / cutlass::sizeof_bits<ElementA>::value);
+
+  static int const kWarpThreadArrangementStridedA =
+      kWarpSize / kWarpThreadArrangementContiguousA;
+
+  static int const kWarpThreadArrangementContiguousB =
+      kCrosswise / (kAccessSizeInBits / cutlass::sizeof_bits<ElementB>::value);
+
+  static int const kWarpThreadArrangementStridedB =
+      kWarpSize / kWarpThreadArrangementContiguousB;
+
+  //
+  // Shared memory layouts
+  //
+
+  using SmemLayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
+      cutlass::sizeof_bits<ElementA>::value, kCrosswise>;
+
+  // Shared memory layout
+  using SmemLayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
+      cutlass::sizeof_bits<ElementB>::value, kCrosswise>;
+
+  //
+  // Iterators to write to shared memory
+  //
+
+  /// ThreadMap of iterator A
+  using IteratorThreadMapA = cutlass::transform::PitchLinearWarpRakedThreadMap<
+      cutlass::layout::PitchLinearShape<Shape::kK, Shape::kM>, kThreads,
+      cutlass::layout::PitchLinearShape<kWarpThreadArrangementContiguousA,
+                                        kWarpThreadArrangementStridedA>,
+      kAccessSizeInBits / cutlass::sizeof_bits<ElementA>::value>;
+
+  /// Shared memory iterator to A operand
+  using SmemIteratorA = cutlass::transform::threadblock::RegularTileIterator<
+    cutlass::MatrixShape<Shape::kM, Shape::kK>,
+    ElementA,
+    SmemLayoutA,
+    0,
+    IteratorThreadMapA
+  >;
+
+  /// ThreadMap of iterator B
+  using IteratorThreadMapB = cutlass::transform::PitchLinearWarpRakedThreadMap<
+      cutlass::layout::PitchLinearShape<Shape::kK, Shape::kN>, kThreads,
+      cutlass::layout::PitchLinearShape<kWarpThreadArrangementContiguousB,
+                                        kWarpThreadArrangementStridedB>,
+      kAccessSizeInBits / cutlass::sizeof_bits<ElementB>::value>;
+
+  /// Shared memory iterator to B operand
+  using SmemIteratorB = cutlass::transform::threadblock::RegularTileIterator<
+    cutlass::MatrixShape<Shape::kK, Shape::kN>,
+    ElementB,
+    SmemLayoutB,
+    1,
+    IteratorThreadMapB
+  >;
+
+  //
+  // Warp-level matrix multiply operator
+  //
+
+  // Define the warp-level tensor op
+  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
+      WarpShape, InstructionShape, ElementA, SmemLayoutA, ElementB, SmemLayoutB,
+      ElementC, LayoutC, Operator, WarpCount::kK>::Type;
+
+  /// Policy used to define MmaPipelined
+  using MmaPolicy = cutlass::gemm::threadblock::MmaPolicy<
+    MmaTensorOp,
+    cutlass::MatrixShape<0, 0>,
+    cutlass::MatrixShape<0, 0>,
+    WarpCount::kK
+  >;
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Partial specialization:
+///
+///   A: row-major
+///   B: row-major
+///   Operator: tensor op class
+///
+/// This uses the default warp-level operator given tile sizes
+template <
+    /// Shape of threadblock-scoped matrix multiply operator (concept:
+    /// GemmShape)
+    typename Shape_,
+    /// Shape of warp-level matrix multiply operator (concept: GemmShape)
+    typename WarpShape_,
+    /// Shape of one matrix production operation (concept: GemmShape)
+    typename InstructionShape_,
+    /// Data type of A operand
+    typename ElementA_,
+    /// Data type of B operand
+    typename ElementB_,
+    /// Data type of accumulator
+    typename ElementC_,
+    /// Layout of accumulator
+    typename LayoutC_,
+    /// Operation performed by MMA
+    typename Operator_>
+struct FMHAMmaCore<Shape_, WarpShape_, InstructionShape_, ElementA_,
+                   cutlass::layout::RowMajor, ElementB_, cutlass::layout::RowMajor, ElementC_,
+                   LayoutC_, cutlass::arch::OpClassTensorOp, Operator_
+                  > {
+  using Shape = Shape_;
+  using WarpShape = WarpShape_;
+  using InstructionShape = InstructionShape_;
+  using ElementA = ElementA_;
+  using LayoutA = cutlass::layout::RowMajor;
+  using ElementB = ElementB_;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using ElementC = ElementC_;
+  using LayoutC = LayoutC_;
+  using OperatorClass = cutlass::arch::OpClassTensorOp;
+
+  /// Number of warps present
+  using WarpCount = cutlass::gemm::GemmShape<
+    Shape::kM / WarpShape::kM,
+    Shape::kN / WarpShape::kN,
+    Shape::kK / WarpShape::kK
+  >;
+
+  // Divisility requirements
+  static_assert(
+    !(Shape::kM % WarpShape::kM) &&
+    !(Shape::kN % WarpShape::kN),
+    "Threadblock-scoped GEMM should be divisible by warp-scoped GEMM size."
+  );
+
+  /// Number of threads per warp
+  static int const kWarpSize = cutlass::gemm::warp::WarpSize<cutlass::arch::OpClassTensorOp>::value;
+
+  /// Number of threads total
+  static int const kThreads = WarpCount::kCount * kWarpSize;
+
+  /// Size of a threadblock-scoped access
+  static int const kAccessSizeInBits = 128;
+
+  /// Cutlass only supports Crosswise at most 64
+  static int const kCrosswise = std::min(Shape::kK, 64);
+
+  /// Default Operator
+  using Operator = Operator_;
+
+  // Warp thread arrangement
+  static int const kWarpThreadArrangementContiguousA =
+      kCrosswise / (kAccessSizeInBits / cutlass::sizeof_bits<ElementA>::value);
+
+  static int const kWarpThreadArrangementStridedA =
+      kWarpSize / kWarpThreadArrangementContiguousA;
+
+  //
+  // Shared memory layouts
+  //
+
+  using SmemLayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
+      cutlass::sizeof_bits<ElementA>::value, kCrosswise>;
+
+  // Shared memory layout
+  using SmemLayoutB = cutlass::layout::RowMajorTensorOpMultiplicandCongruous<
+      cutlass::sizeof_bits<ElementB>::value, int(128 / sizeof(ElementB))>;
+
+  //
+  // Iterators to write to shared memory
+  //
+
+  /// ThreadMap of iterator A
+  using IteratorThreadMapA = cutlass::transform::PitchLinearWarpRakedThreadMap<
+      cutlass::layout::PitchLinearShape<Shape::kK, Shape::kM>, kThreads,
+      cutlass::layout::PitchLinearShape<kWarpThreadArrangementContiguousA,
+                                        kWarpThreadArrangementStridedA>,
+      kAccessSizeInBits / cutlass::sizeof_bits<ElementA>::value>;
+
+  /// Shared memory iterator to A operand
+  using SmemIteratorA = cutlass::transform::threadblock::RegularTileIterator<
+    cutlass::MatrixShape<Shape::kM, Shape::kK>,
+    ElementA,
+    SmemLayoutA,
+    0,
+    IteratorThreadMapA
+  >;
+
+  /// ThreadMap of iterator B
+  using IteratorThreadMapB = cutlass::transform::PitchLinearWarpRakedThreadMap<
+    cutlass::layout::PitchLinearShape<Shape::kN, Shape::kK>,
+    kThreads,
+    cutlass::layout::PitchLinearShape<8, 4>,
+    kAccessSizeInBits / cutlass::sizeof_bits<ElementB>::value
+  >;
+
+  /// Shared memory iterator to B operand
+  using SmemIteratorB = cutlass::transform::threadblock::RegularTileIterator<
+    cutlass::MatrixShape<Shape::kK, Shape::kN>,
+    ElementB,
+    SmemLayoutB,
+    0,
+    IteratorThreadMapB
+  >;
+
+  //
+  // Warp-level matrix multiply operator
+  //
+
+  // Define the warp-level tensor op
+  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
+      WarpShape, InstructionShape, ElementA, SmemLayoutA, ElementB, SmemLayoutB,
+      ElementC, LayoutC, Operator, WarpCount::kK>::Type;
+
+  /// Policy used to define MmaPipelined
+  using MmaPolicy = cutlass::gemm::threadblock::MmaPolicy<
+    MmaTensorOp,
+    cutlass::MatrixShape<0, 0>,
+    cutlass::MatrixShape<0, 0>,
+    WarpCount::kK
+  >;
+};
+
+
+} // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/philox.cuh b/aten/src/ATen/native/transformers/cuda/flash_attn/philox.cuh
new file mode 100644
index 000000000000..456b320b64ef
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/philox.cuh
@@ -0,0 +1,146 @@
+// Pytorch also has an implementation of Philox RNG: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu
+#pragma once
+// Philox CUDA.
+
+#include <ATen/cuda/CUDAContext.h>
+
+namespace {
+
+class Philox {
+public:
+  __device__ inline Philox(unsigned long long seed,
+                           unsigned long long subsequence,
+                           unsigned long long offset)
+      : STATE(0)
+      , key(reinterpret_cast<const uint2&>(seed)) {
+    //key.x = (unsigned int)seed;
+    //key.y = (unsigned int)(seed >> 32);
+    //counter = make_uint4(0, 0, 0, 0);
+    //counter.z = (unsigned int)(subsequence);
+    //counter.w = (unsigned int)(subsequence >> 32);
+    //STATE = 0;
+    //incr_n(offset / 4);
+
+    // key = reinterpret_cast<const uint2&>(seed);
+    ull2 * tmp = reinterpret_cast<ull2*>(&counter);
+    tmp->x = offset / 4;
+    tmp->y = subsequence;
+    // if ((threadIdx.x == 0) && (blockIdx.x == 0) && (blockIdx.y == 0)) {
+    //     printf("Philox counter: %d, %d, %d, %d\n", counter.x, counter.y, counter.z, counter.w);
+    // }
+  }
+  __device__ inline uint4 operator()() {
+    // if (STATE == 0) {
+      uint4 counter_ = counter;
+      uint2 key_ = key;
+      // 7-round philox
+      #pragma unroll
+      for (int i = 0; i < 6; i++) {
+        counter_ = single_round(counter_, key_);
+        key_.x += (kPhilox10A);
+        key_.y += (kPhilox10B);
+      }
+      // output = single_round(counter_, key_);
+      uint4 output = single_round(counter_, key_);
+      // if ((threadIdx.x == 0) && (blockIdx.x == 0) && (blockIdx.y == 0)) {
+      //     printf("Philox counter: %u, %u, %u, %u\n", counter.x, counter.y, counter.z, counter.w);
+      //     printf("Philox output: %u, %u, %u, %u\n", output.x, output.y, output.z, output.w);
+      // }
+      incr();
+    // }
+    // return a float4 directly
+    // unsigned long ret;
+    // switch(STATE) {
+    //  case 0: ret = output.x; break;
+    //  case 1: ret = output.y; break;
+    //  case 2: ret = output.z; break;
+    //  case 3: ret = output.w; break;
+    //}
+    // STATE = (STATE + 1) % 4;
+    return output;
+  }
+
+private:
+  struct ull2 {
+      uint64_t x;
+      uint64_t y;
+  };
+  uint4 counter;
+  // uint4 output;
+  const uint2 key;
+  unsigned int STATE;
+  __device__ inline void incr_n(unsigned long long n) {
+    unsigned int nlo = (unsigned int)(n);
+    unsigned int nhi = (unsigned int)(n >> 32);
+    counter.x += nlo;
+    if (counter.x < nlo)
+      nhi++;
+    counter.y += nhi;
+    if (nhi <= counter.y)
+      return;
+    if (++counter.z)
+      return;
+    ++counter.w;
+  }
+
+  __device__ uint4 incr128 (uint4 ctr)
+  {
+    uint4 res;
+    asm ("add.cc.u32      %0, %4, %8;\n\t"
+         "addc.cc.u32     %1, %5, %9;\n\t"
+         "addc.cc.u32     %2, %6, %10;\n\t"
+         "addc.u32        %3, %7, %11;\n\t"
+         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)
+         : "r"(ctr.x), "r"(ctr.y), "r"(ctr.z), "r"(ctr.w),
+           "n"(1), "n"(0), "n"(0), "n"(0));
+    return res;
+  }
+
+  __device__ inline void incr() {
+    // if ((threadIdx.x == 0) && (blockIdx.x == 0) && (blockIdx.y == 0)) {
+    //     printf("Counter before: %u, %u, %u, %u\n", counter.x, counter.y, counter.z, counter.w);
+    // }
+    counter = incr128(counter);
+    // if ((threadIdx.x == 0) && (blockIdx.x == 0) && (blockIdx.y == 0)) {
+    //     printf("Counter after: %u, %u, %u, %u\n", counter.x, counter.y, counter.z, counter.w);
+    // }
+  }
+  __device__ unsigned int mulhilo32(unsigned int a, unsigned int b,
+                                    unsigned int *result_high) {
+    *result_high = __umulhi(a, b);
+    return a * b;
+  }
+  __device__ uint2 mulhilo32_v2 (const unsigned int a, const unsigned int b)
+  {
+    uint2 *res;
+    unsigned long long tmp;
+    asm ("mul.wide.u32      %0, %1, %2;\n\t"
+         : "=l"(tmp)
+         : "r"(a), "r"(b));
+    res = (uint2*)(&tmp);
+    return *res;
+  }
+  __device__ inline uint4 single_round(const uint4 ctr, const uint2 key) {
+    //unsigned int hi0;
+    //unsigned int hi1;
+    //unsigned int lo0 = mulhilo32(kPhiloxSA, ctr.x, &hi0);
+    //unsigned int lo1 = mulhilo32(kPhiloxSB, ctr.z, &hi1);
+    //uint4 ret = {hi1 ^ ctr.y ^ key.x, lo1, hi0 ^ ctr.w ^ key.y, lo0};
+    uint2 res0 = mulhilo32_v2(kPhiloxSA, ctr.x);
+    uint2 res1 = mulhilo32_v2(kPhiloxSB, ctr.z);
+    uint4 ret = {res1.y ^ ctr.y ^ key.x, res1.x, res0.y ^ ctr.w ^ key.y, res0.x};
+    return ret;
+  }
+  static const unsigned long kPhilox10A = 0x9E3779B9;
+  static const unsigned long kPhilox10B = 0xBB67AE85;
+  static const unsigned long kPhiloxSA = 0xD2511F53;
+  static const unsigned long kPhiloxSB = 0xCD9E8D57;
+};
+// Inverse of 2^32.
+constexpr float M_RAN_INVM32 = 2.3283064e-10f;
+__device__ __inline__ float4 uniform4(const uint4 x) {
+  return make_float4(x.x * M_RAN_INVM32, x.y * M_RAN_INVM32, x.z * M_RAN_INVM32,
+                     x.w * M_RAN_INVM32);
+}
+
+} // namespace
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/softmax.h b/aten/src/ATen/native/transformers/cuda/flash_attn/softmax.h
new file mode 100644
index 000000000000..2e121d0e9311
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/softmax.h
@@ -0,0 +1,446 @@
+/******************************************************************************
+ * Copyright (c) 2022, Tri Dao.
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <cmath>
+#include <ATen/native/transformers/cuda/flash_attn/philox.cuh>
+
+#include <cutlass/cutlass.h>
+#include <cutlass/array.h>
+
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float apply_exp_(float x, float max) {
+    return __expf(x - max);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float apply_exp2_(float x, float max) {
+    return exp2f(x - max);
+    // With fast-math, this produces the same PTX instruction as the assembly below
+    // float diff = x - max;
+    // float res;
+    // asm ("ex2.approx.ftz.f32 %0, %1;\n\t" : "=f"(res) : "f"(diff));
+    // return res;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<int COLS> struct ReadType {};
+template<> struct ReadType<4> { using T = float;};
+template<> struct ReadType<8> { using T = float2;};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename Cta_tile, typename Kernel_traits>
+struct Smem_tile_reduce {
+    // Helper class to distribute MMA tiles reduced over rows per warp over quads.
+
+    // The Mma tile.
+    using Mma_tile = fmha::Hmma_tile<Cta_tile>;
+
+    // The number of MMAs in M/N dimensions.
+    static constexpr int MMAS_M = Mma_tile::MMAS_M;
+    static constexpr int MMAS_N = Mma_tile::MMAS_N;
+
+    static constexpr int WARPS_M = Cta_tile::WARPS_M;
+    static constexpr int WARPS_N = Cta_tile::WARPS_N;
+
+
+    static constexpr int ROWS = WARPS_M * MMAS_M * 16;
+    static constexpr int COLS = WARPS_N;
+    static_assert(COLS == 4 || COLS == 8, "");
+    static constexpr int ROWS_PER_XOR_PATTERN = (COLS == 8) ? 4 : 8;
+    static constexpr int BYTES_PER_TILE = ROWS * COLS * sizeof(float);
+    static constexpr int ELTS_PER_TILE = ROWS * COLS;
+
+    using read_t = typename ReadType<COLS>::T;
+
+    __device__ inline Smem_tile_reduce(float *smem_, const int tidx) {
+
+        int lane = tidx % 32;
+        int warp = tidx / 32;
+
+        int warp_m = warp % WARPS_M;
+        int warp_n = warp / WARPS_M;
+
+        qid_ = lane % 4;
+        int qp = lane / 4;
+
+        // Swizzle the column to avoid 2-fold bank conflicts when we have 8 warps.
+        // This won't affect reading as we assume commutative reduction ops.
+        const int col = warp_n ^ (qp / ROWS_PER_XOR_PATTERN);
+        smem_write_ = &smem_[warp_m * 16 * MMAS_M * WARPS_N + qp * WARPS_N + col];
+        smem_read_ = &reinterpret_cast<read_t *>(smem_)[warp_m * 16 * MMAS_M * 4 + qp * 4 + qid_];
+        smem_read_row_ = &reinterpret_cast<read_t *>(smem_)[warp_m * 16 * MMAS_M * 4 + qid_];
+
+    }
+
+    __device__ inline void store(float (&frag)[2 * MMAS_M]) {
+        if( qid_ == 0 ) {
+            #pragma unroll
+            for( int mi = 0; mi < MMAS_M; mi++ ) {
+                int offset = mi * 16 * WARPS_N;
+                smem_write_[offset + 0 * 8 * WARPS_N] = frag[mi * 2 + 0];
+                smem_write_[offset + 1 * 8 * WARPS_N] = frag[mi * 2 + 1];
+            }
+        }
+    }
+
+    __device__ inline void load(read_t (&frag)[2 * MMAS_M]) {
+        #pragma unroll
+        for( int mi = 0; mi < MMAS_M; mi++ ) {
+            int offset = mi * 16 * 4;
+            frag[mi * 2 + 0] = smem_read_[offset + 0 * 8 * 4];
+            frag[mi * 2 + 1] = smem_read_[offset + 1 * 8 * 4];
+        }
+    }
+
+    __device__ inline void load_row(read_t (&frag)[MMAS_M], int row) {
+        #pragma unroll
+        for( int mi = 0; mi < MMAS_M; mi++ ) {
+            int offset = mi * 16 * 4;
+            frag[mi] = smem_read_row_[offset + 0 * 8 * 4 + row * 4];
+        }
+    }
+
+    int qid_;
+    float *smem_write_;
+    read_t *smem_read_;
+    read_t *smem_read_row_;
+
+};
+
+
+template<typename Cta_tile, typename Kernel_traits>
+struct Softmax_base {
+
+    // The Mma tile.
+    using Mma_tile = fmha::Hmma_tile<Cta_tile>;
+
+    // The number of MMAs in M/N dimensions.
+    static constexpr int MMAS_M = Mma_tile::MMAS_M;
+    static constexpr int MMAS_N = Mma_tile::MMAS_N;
+
+    // The number of groups of warp such that we have at most 4 warps writing consecutive elements.
+    static constexpr int GROUPS = fmha::DivUpConstexpr(Cta_tile::WARPS_N, 4);
+    // The number of elements that we are going to store per row.
+    static constexpr int ELEMENTS_PER_ROW = Cta_tile::WARPS_N / GROUPS;
+    // The number of rows.
+    static constexpr int ROWS = Cta_tile::M * GROUPS;
+    // The total number of elements.
+    static constexpr int ELEMENTS = ROWS * ELEMENTS_PER_ROW;
+
+    // Ctor.
+    template<typename Params>
+    inline __device__ Softmax_base(const Params &params, void *smem, int tidx)
+        :  // packed_mask_ptr_(reinterpret_cast<const char*>(params.packed_mask_ptr)),
+          smem_(reinterpret_cast<float *>(smem)), tidx_(tidx) {
+
+        // Extract the position in the warp.
+        int warp = tidx / Cta_tile::THREADS_PER_WARP;
+        int lane = tidx % Cta_tile::THREADS_PER_WARP;
+
+        // Decompose the warp index into M and N.
+        int warp_m = warp % Cta_tile::WARPS_M;
+        int warp_n = warp / Cta_tile::WARPS_M;
+
+        // Decompose the warp-n index into group/position-inside-the-group.
+        int warp_g = warp_n / ELEMENTS_PER_ROW;
+        int warp_i = warp_n % ELEMENTS_PER_ROW;
+
+        // The location written by the threads.
+        int write_row = warp_g * (ROWS / GROUPS) + warp_m * Mma_tile::M_PER_MMA + lane / 4;
+        int write_col = warp_i;
+
+        // Assemble the write pointer.
+        smem_write_ = &smem_[write_row * ELEMENTS_PER_ROW + write_col];
+
+        // Assemble the read pointer.
+        smem_read_ = &smem_[warp_m * Mma_tile::M_PER_MMA + lane / 4];
+    }
+
+    template<bool zero=false, typename Mask>
+    inline __device__ void apply_mask(const Mask &mask) {
+        #pragma unroll
+        for( int mi = 0; mi < MMAS_M; ++mi ) {
+            #pragma unroll
+            for( int ii = 0; ii < 2; ++ii ) {
+                #pragma unroll
+                for( int ni = 0; ni < MMAS_N; ++ni ) {
+                    #pragma unroll
+                    for( int jj = 0; jj < 4; ++jj ) {
+                        if( !mask.is_valid(mi, ni, ii, jj) ) {
+                            elt_[2 * mi + ii][4 * ni + jj] = zero ? 0.f : -INFINITY;
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    // Apply the exp to all the elements.
+    template <bool scale_max=true>
+    inline __device__ void scale_apply_exp(const float (&max)[MMAS_M * 2], const float scale_) {
+        const float max_scale = scale_max ? scale_ * M_LOG2E : M_LOG2E;
+        const float scale = scale_ * M_LOG2E;
+        #pragma unroll
+        for( int mi = 0; mi < MMAS_M * 2; ++mi ) {
+            // Instead of computing exp(x - max), we compute exp2(x * log_2(e) -
+            // max * log_2(e)) This allows the compiler to use the ffma
+            // instruction instead of fadd and fmul separately.
+            const float max_scaled = max[mi] * max_scale;
+            #pragma unroll
+            for( int ni = 0; ni < MMAS_N * 4; ++ni ) {
+                elt_[mi][ni] = apply_exp2_(elt_[mi][ni] * scale, max_scaled);
+            }
+        }
+    }
+
+    template <bool encode_dropout_in_sign_bit=false>
+    inline __device__ void apply_dropout_16bits(Philox &ph, uint16_t p_dropout_in_uint16_t) {
+        // We encode the dropout pattern in the sign bit of the non-negative
+        // softmax to distinguish from pre-existing zeros
+        auto encode_dropout = [](bool keep, float val) {
+            return keep ? val : (encode_dropout_in_sign_bit ? -val : float(0));
+        };
+        #pragma unroll
+        for( int mi = 0; mi < MMAS_M; mi++ ) {
+            #pragma unroll
+            for( int ni = 0; ni < MMAS_N; ni++ ) {
+                uint4 random_uint4 = ph();
+                uint16_t (&rnd)[8] = reinterpret_cast<uint16_t (&)[8]>(random_uint4);
+                // if ((threadIdx.x == 0) && (blockIdx.x == 0) && (blockIdx.y == 0)) {
+                //     printf("ni = %d, ph  Philox: %u, %u, %u, %u\n", ni, rnd.x, rnd.y, rnd.z, rnd.w);
+                // }
+                #pragma unroll
+                for (int ii = 0; ii < 2; ++ii) {
+                    #pragma unroll
+                    for (int jj = 0; jj < 4; ++jj) {
+                        elt_[mi * 2 + ii][4 * ni + jj] =
+                            encode_dropout(rnd[ii * 4 + jj] <= p_dropout_in_uint16_t, elt_[mi * 2 + ii][4 * ni + jj]);
+                    }
+                }
+            }
+        }
+    }
+
+    template <bool encode_dropout_in_sign_bit=false>
+    inline __device__ void apply_dropout_16bits(Philox &ph0, Philox &ph1, uint16_t p_dropout_in_uint16_t) {
+        // We encode the dropout pattern in the sign bit of the non-negative
+        // softmax to distinguish from pre-existing zeros
+        auto encode_dropout = [](bool keep, float val) {
+            return keep ? val : (encode_dropout_in_sign_bit ? -val : float(0));
+        };
+        #pragma unroll
+        for( int mi = 0; mi < MMAS_M; mi++ ) {
+            static_assert(MMAS_N % 2 == 0, "");
+            #pragma unroll
+            for( int ni = 0; ni < MMAS_N; ni += 2 ) {
+                uint4 random_uint4 = ph0();
+                uint16_t (&rnd0)[8] = reinterpret_cast<uint16_t (&)[8]>(random_uint4);
+                // if ((threadIdx.x == 0) && (blockIdx.x == 0) && (blockIdx.y == 0)) {
+                //     printf("ni = %d, ph  Philox: %u, %u, %u, %u\n", ni, rnd0.x, rnd0.y, rnd0.z, rnd0.w);
+                // }
+                #pragma unroll
+                for (int ii = 0; ii < 2; ++ii) {
+                    #pragma unroll
+                    for (int jj = 0; jj < 4; ++jj) {
+                        elt_[mi * 2 + ii][4 * ni + jj] =
+                            encode_dropout(rnd0[ii * 4 + jj] <= p_dropout_in_uint16_t, elt_[mi * 2 + ii][4 * ni + jj]);
+                    }
+                }
+                random_uint4 = ph1();
+                uint16_t (&rnd1)[8] = reinterpret_cast<uint16_t (&)[8]>(random_uint4);
+                // if ((threadIdx.x == 0) && (blockIdx.x == 0) && (blockIdx.y == 0)) {
+                //     printf("ni = %d, ph  Philox: %u, %u, %u, %u\n", ni, rnd1.x, rnd1.y, rnd1.z, rnd1.w);
+                // }
+                #pragma unroll
+                for (int ii = 0; ii < 2; ++ii) {
+                    #pragma unroll
+                    for (int jj = 0; jj < 4; ++jj) {
+                        elt_[mi * 2 + ii][4 * (ni + 1) + jj] =
+                            encode_dropout(rnd1[ii * 4 + jj] <= p_dropout_in_uint16_t, elt_[mi * 2 + ii][4 * (ni + 1) + jj]);
+                    }
+                }
+            }
+        }
+    }
+
+    // Shared memory for the CTA-wide reduction.
+    float *smem_, *smem_write_, *smem_read_;
+    // The current thread index.
+    int tidx_;
+    // The elements.
+    float elt_[MMAS_M * 2][MMAS_N * 4];
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Cta_tile, typename Kernel_traits>
+struct Softmax : public Softmax_base<Cta_tile, Kernel_traits> {
+
+    // The base class.
+    using Base = Softmax_base<Cta_tile, Kernel_traits>;
+
+    static constexpr int WARPS_M = Cta_tile::WARPS_M;
+    static constexpr int WARPS_N = Cta_tile::WARPS_N;
+    // The MMAs.
+    static constexpr int MMAS_M = Base::MMAS_M;
+    static constexpr int MMAS_N = Base::MMAS_N;
+
+    using Smem_tile_red = Smem_tile_reduce<Cta_tile, Kernel_traits>;
+    static_assert(Smem_tile_red::ELTS_PER_TILE == Cta_tile::M * WARPS_N, "");
+    // Ctor.
+    template<typename Params>
+    inline __device__ Softmax(const Params &params, void *smem, int tidx)
+        : Base(params, smem, tidx)
+        , smem_sum_(static_cast<float*>(smem), tidx)
+        , smem_max_(static_cast<float*>(smem) + Smem_tile_red::ELTS_PER_TILE, tidx) {
+    }
+
+    // Pack the data to a fragment for the next GEMM.
+    inline __device__ void pack_noconvert(cutlass::Array<float, MMAS_M * MMAS_N * 8> &frag) const {
+        #pragma unroll
+        for( int mi = 0; mi < MMAS_M; ++mi ) {
+            #pragma unroll
+            for( int ki = 0; ki < MMAS_N; ++ki ) {
+                // 1st row - 4 elements per row.
+                frag[ki * MMAS_M * 8 + mi * 8 + 0] = this->elt_[2 * mi + 0][4 * ki + 0];
+                frag[ki * MMAS_M * 8 + mi * 8 + 1] = this->elt_[2 * mi + 0][4 * ki + 1];
+                frag[ki * MMAS_M * 8 + mi * 8 + 4] = this->elt_[2 * mi + 0][4 * ki + 2];
+                frag[ki * MMAS_M * 8 + mi * 8 + 5] = this->elt_[2 * mi + 0][4 * ki + 3];
+                // 2nd row - 4 elements per row.
+                frag[ki * MMAS_M * 8 + mi * 8 + 2] = this->elt_[2 * mi + 1][4 * ki + 0];
+                frag[ki * MMAS_M * 8 + mi * 8 + 3] = this->elt_[2 * mi + 1][4 * ki + 1];
+                frag[ki * MMAS_M * 8 + mi * 8 + 6] = this->elt_[2 * mi + 1][4 * ki + 2];
+                frag[ki * MMAS_M * 8 + mi * 8 + 7] = this->elt_[2 * mi + 1][4 * ki + 3];
+            }
+        }
+    }
+
+    template <typename FragmentC>
+    inline __device__ void unpack_noscale(const FragmentC (&acc)) {
+        static_assert(FragmentC::kElements == MMAS_M * MMAS_N * 8, "");
+        #pragma unroll
+        for( int mi = 0; mi < MMAS_M; ++mi ) {
+            #pragma unroll
+            for( int ni = 0; ni < MMAS_N; ++ni ) {
+                // 1st row - 4 elements per row.
+                this->elt_[2 * mi + 0][4 * ni + 0] = acc[mi * MMAS_N * 8 + ni * 8 + 0];
+                this->elt_[2 * mi + 0][4 * ni + 1] = acc[mi * MMAS_N * 8 + ni * 8 + 1];
+                this->elt_[2 * mi + 0][4 * ni + 2] = acc[mi * MMAS_N * 8 + ni * 8 + 4];
+                this->elt_[2 * mi + 0][4 * ni + 3] = acc[mi * MMAS_N * 8 + ni * 8 + 5];
+                // 2nd row - 4 elements per row.
+                this->elt_[2 * mi + 1][4 * ni + 0] = acc[mi * MMAS_N * 8 + ni * 8 + 2];
+                this->elt_[2 * mi + 1][4 * ni + 1] = acc[mi * MMAS_N * 8 + ni * 8 + 3];
+                this->elt_[2 * mi + 1][4 * ni + 2] = acc[mi * MMAS_N * 8 + ni * 8 + 6];
+                this->elt_[2 * mi + 1][4 * ni + 3] = acc[mi * MMAS_N * 8 + ni * 8 + 7];
+            }
+        }
+    }
+
+    template<bool zero_init=true, typename Operator>
+    __device__ inline void thread_reduce_(float (&frag)[2 * MMAS_M], Operator &op) {
+        #pragma unroll
+        for( int mi = 0; mi < 2 * MMAS_M; mi++ ) {
+            frag[mi] = zero_init ? this->elt_[mi][0] : op(frag[mi], this->elt_[mi][0]);
+            #pragma unroll
+            for( int ni = 1; ni < 4 * MMAS_N; ni++ ) {
+                frag[mi] = op(frag[mi], this->elt_[mi][ni]);
+            }
+        }
+    }
+
+    template<bool zero_init=true, typename Operator>
+    __device__ inline void reduce_(float (&frag)[2 * MMAS_M], Operator &op, Smem_tile_red & smem_red) {
+        thread_reduce_<zero_init>(frag, op);
+        quad_reduce(frag, frag, op);
+        smem_red.store(frag);
+        __syncthreads();
+        typename Smem_tile_red::read_t tmp[2 * MMAS_M];
+        smem_red.load(tmp);
+        quad_allreduce(frag, tmp, op);
+    }
+
+    template<bool zero_init=true>
+    __device__ inline void reduce_max(float (&frag)[2 * MMAS_M]){
+        MaxOp<float> max;
+        reduce_<zero_init>(frag, max, smem_max_);
+    }
+
+    __device__ inline void reduce_sum(float (&frag)[2 * MMAS_M]){
+        SumOp<float> sum;
+        reduce_(frag, sum, smem_sum_);
+    }
+
+    template<bool zero_init=true>
+    __device__ inline void reduce_sum_before_sync_(float (&frag)[2 * MMAS_M]){
+        SumOp<float> sum;
+        thread_reduce_<zero_init>(frag, sum);
+        quad_reduce(frag, frag, sum);
+        smem_sum_.store(frag);
+    }
+
+    template<int NROWS, typename Operator>
+    __device__ inline void reduce_after_sync_(float (&frag)[NROWS][MMAS_M],
+                                              const int (&rows)[NROWS],
+                                              Operator &op, Smem_tile_red & smem_red) {
+        #pragma unroll
+        for (int ii = 0; ii < NROWS; ii++) {
+            typename Smem_tile_red::read_t tmp[MMAS_M];
+            smem_red.load_row(tmp, rows[ii]);
+            quad_allreduce(frag[ii], tmp, op);
+        }
+    }
+
+    template<int NROWS>
+    __device__ inline void reduce_sum_after_sync_(float (&frag)[NROWS][MMAS_M],
+                                                  const int (&rows)[NROWS]){
+        SumOp<float> sum;
+        reduce_after_sync_(frag, rows, sum, smem_sum_);
+    }
+
+    template<int NROWS>
+    __device__ inline void reduce_max_after_sync_(float (&frag)[NROWS][MMAS_M],
+                                                  const int (&rows)[NROWS]){
+        MaxOp<float> max;
+        reduce_after_sync_(frag, rows, max, smem_max_);
+    }
+
+    Smem_tile_red smem_max_;
+    Smem_tile_red smem_sum_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+}  // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h b/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h
new file mode 100644
index 000000000000..7920ac045d0a
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h
@@ -0,0 +1,25 @@
+// Inspired by https://github.com/NVIDIA/DALI/blob/main/include/dali/core/static_switch.h
+// and https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h
+
+#pragma once
+
+/// @param COND       - a boolean expression to switch by
+/// @param CONST_NAME - a name given for the constexpr bool variable.
+/// @param ...       - code to execute for true and false
+///
+/// Usage:
+/// ```
+/// BOOL_SWITCH(flag, BoolConst, [&] {
+///     some_function<BoolConst>(...);
+/// });
+/// ```
+#define BOOL_SWITCH(COND, CONST_NAME, ...)                                           \
+    [&] {                                                                            \
+        if (COND) {                                                                  \
+            constexpr bool CONST_NAME = true;                                        \
+            return __VA_ARGS__();                                                    \
+        } else {                                                                     \
+            constexpr bool CONST_NAME = false;                                       \
+            return __VA_ARGS__();                                                    \
+        }                                                                            \
+    }()
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/summary_stats.h b/aten/src/ATen/native/transformers/cuda/flash_attn/summary_stats.h
new file mode 100644
index 000000000000..a3abda34b4e4
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/summary_stats.h
@@ -0,0 +1,55 @@
+/******************************************************************************
+ * Copyright (c) 2022, Tri Dao.
+ ******************************************************************************/
+
+#pragma once
+
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<int kRows, int kRowsPerMma, int kWarpCountM>
+struct Smem_tile_softmax_lse {
+
+    static constexpr int kMmaM = (kRows / kWarpCountM) / kRowsPerMma;
+    static_assert(kMmaM * kRowsPerMma * kWarpCountM == kRows, "");
+    // static_assert(kWarpCountM == 1);
+    // Otherwise we might need to check warp_idx / kWarpCountM == 0 instead of just warp_idx == 0
+
+    // The size of one buffer in bytes in shared memory.
+    static constexpr size_t BYTES_PER_TILE = kRows * sizeof(float);
+
+    inline __device__ Smem_tile_softmax_lse(float *smem) : smem_(smem) {
+    }
+
+    inline __device__ void store_pair(const float (&sum)[kMmaM * 2]) {
+        // Broadcast the warp_id computed by lane 0 to ensure dependent code
+        // is compiled as warp-uniform.
+        // This makes a difference of 50us for BERT.
+        // const int warp_idx = threadIdx.x / 32;
+        const int warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0);
+        const int lane_idx =  threadIdx.x % 32;
+        const int warp_n = warp_idx / kWarpCountM;
+        // Extract the position in the warp.
+        const int row = lane_idx / 4;
+        if ((lane_idx % 4 == 0) && (warp_n == 0)) {
+            #pragma unroll
+            for (int mi = 0; mi < kMmaM; ++mi) {
+                smem_[mi * kRowsPerMma + row + 0] = sum[mi * 2 + 0];
+                smem_[mi * kRowsPerMma + row + 8] = sum[mi * 2 + 1];
+            }
+        }
+    }
+
+    template<int N>
+    inline __device__ void load(float (&sum)[N], const int (&row)[N]) {
+        #pragma unroll
+        for( int ni = 0; ni < N; ni++ ) {
+            sum[ni] = smem_[row[ni]];
+        }
+    }
+
+    float * const smem_;
+};
+
+}  // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/flash_attn/utils.h b/aten/src/ATen/native/transformers/cuda/flash_attn/utils.h
new file mode 100644
index 000000000000..7caa29f20869
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/flash_attn/utils.h
@@ -0,0 +1,404 @@
+/******************************************************************************
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the
+ *       names of its contributors may be used to endorse or promote products
+ *       derived from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ ******************************************************************************/
+
+#pragma once
+
+#include <cassert>
+#include <cstdint>
+#include <cstdlib>
+
+#include <ATen/cuda/CUDAContext.h>
+// #include <cuda_fp16.h>
+
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
+#include <cuda_bf16.h>
+#endif
+
+extern "C" __device__ uint32_t __nvvm_get_smem_pointer(void *ptr);
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace fmha {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct Row {};
+struct Col {};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< int M, int N >
+struct Div_up {
+    enum { VALUE = (M + N-1) / N };
+};
+
+constexpr int DivUpConstexpr(int M, int N) { return (M + N - 1) / N; }
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< int A, int B >
+struct Max {
+    enum { VALUE = A >= B ? A : B };
+};
+
+constexpr int MaxConstexpr(int A, int B) { return A >= B ? A : B; }
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< int A, int B, int C >
+struct Max_3 {
+    enum { VALUE = Max<Max<A, B>::VALUE, C>::VALUE };
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< int A, int B >
+struct Min {
+    enum { VALUE = A <= B ? A : B };
+};
+
+constexpr int MinConstexpr(int A, int B) { return A <= B ? A : B; }
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< int SIZE_IN_BYTES >
+struct Uint_from_size_in_bytes {
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<>
+struct Uint_from_size_in_bytes<1> {
+    using Type = uint8_t;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<>
+struct Uint_from_size_in_bytes<2> {
+    using Type = uint16_t;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<>
+struct Uint_from_size_in_bytes<4> {
+    using Type = uint32_t;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<>
+struct Uint_from_size_in_bytes<8> {
+    using Type = uint2;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<>
+struct Uint_from_size_in_bytes<16> {
+    using Type = uint4;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename T >
+inline __device__ __host__ T div_up(T m, T n) {
+    return (m + n-1) / n;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+inline __device__ uint32_t hrelu2(uint32_t x);
+
+template<>
+inline __device__ uint32_t hrelu2<__half>(uint32_t x) {
+    uint32_t res;
+    const uint32_t zero = 0u;
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
+    asm volatile( "max.f16x2 %0, %1, %2;\n" : "=r"(res) : "r"(x), "r"(zero));
+#else
+    asm volatile( \
+        "{\n" \
+        "\t .reg .f16x2 sela;\n" \
+        "\t set.gtu.u32.f16x2 sela, %1, %2;\n" \
+        "\t and.b32 %0, sela, %1;\n"
+        "}\n" : "=r"(res) : "r"(x), "r"(zero));
+#endif
+    return res;
+}
+
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
+template<>
+inline __device__ uint32_t hrelu2<__nv_bfloat16>(uint32_t x) {
+    uint32_t res;
+    const uint32_t zero = 0u;
+    asm volatile( "max.bf16x2 %0, %1, %2;\n" : "=r"(res) : "r"(x), "r"(zero));
+    return res;
+}
+#endif
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+static inline __device__ uint16_t float_to_half(float f) {
+    uint16_t h;
+    asm volatile("cvt.rn.f16.f32 %0, %1;" : "=h"(h) : "f"(f));
+    return h;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+inline __device__ uint32_t float2_pack(float a, float b);
+
+template <>
+inline __device__ uint32_t float2_pack<__half>(float a, float b) {
+    __half2 result = __floats2half2_rn(a, b);
+    return reinterpret_cast<uint32_t(&)>(result);
+}
+
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
+template <>
+inline __device__ uint32_t float2_pack<__nv_bfloat16>(float a, float b) {
+    __nv_bfloat162 result = __floats2bfloat162_rn(a, b);
+    return reinterpret_cast<uint32_t(&)>(result);
+}
+#endif
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+inline __device__ uint2 float4_pack(float x, float y, float z, float w) {
+    uint2 d;
+    d.x = float2_pack<T>(x, y);
+    d.y = float2_pack<T>(z, w);
+    return d;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+inline __device__ float2 half2_unpack(uint32_t a);
+
+template <>
+inline __device__ float2 half2_unpack<__half>(uint32_t a) {
+    return __half22float2(reinterpret_cast<__half2 (&)>(a));
+}
+
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
+template <>
+inline __device__ float2 half2_unpack<__nv_bfloat16>(uint32_t a) {
+    return __bfloat1622float2(reinterpret_cast<__nv_bfloat162 (&)>(a));
+}
+#endif
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Convert two half2's or bf162's into float, then take their dot product.
+template <typename T>
+inline __device__ float hfma2_to_float(const uint32_t a, const uint32_t b) {
+    float2 af = fmha::half2_unpack<T>(a);
+    float2 bf = fmha::half2_unpack<T>(b);
+    return af.x * bf.x + af.y * bf.y;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Converted two vectors of 8 half's or bf16's into float, then take their dot product.
+template<typename T>
+inline __device__ float hmulsum8(const uint4 a, const uint4 b) {
+    float sum;
+    sum  = fmha::hfma2_to_float<T>(a.x, b.x);
+    sum += fmha::hfma2_to_float<T>(a.y, b.y);
+    sum += fmha::hfma2_to_float<T>(a.z, b.z);
+    sum += fmha::hfma2_to_float<T>(a.w, b.w);
+    return sum;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//
+// L D G
+//
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void ldg(uint8_t &dst, const void *ptr) {
+    dst = *reinterpret_cast<const uint8_t*>(ptr);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void ldg(uint16_t &dst, const void *ptr) {
+    dst = *reinterpret_cast<const uint16_t*>(ptr);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void ldg(uint32_t &dst, const void *ptr) {
+    dst = *reinterpret_cast<const uint32_t*>(ptr);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void ldg(uint2 &dst, const void *ptr) {
+    dst = *reinterpret_cast<const uint2*>(ptr);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void ldg(uint4 &dst, const void *ptr) {
+    dst = *reinterpret_cast<const uint4*>(ptr);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//
+// S T G
+//
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void stg(void *ptr, uint8_t val) {
+    *reinterpret_cast<uint8_t*>(ptr) = val;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void stg(void *ptr, uint16_t val) {
+    *reinterpret_cast<uint16_t*>(ptr) = val;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void stg(void *ptr, uint32_t val) {
+    *reinterpret_cast<uint32_t*>(ptr) = val;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void stg(void *ptr, uint2 val) {
+    *reinterpret_cast<uint2*>(ptr) = val;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void stg(void *ptr, uint4 val) {
+    *reinterpret_cast<uint4*>(ptr) = val;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+struct MaxOp {
+__device__ inline T operator()(T const & x, T const & y) { return x > y ? x : y; }
+};
+
+template <>
+struct MaxOp<float> {
+// This is slightly faster
+__device__ inline float operator()(float const &x, float const &y) { return max(x, y); }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+struct SumOp {
+__device__ inline T operator()(T const & x, T const & y) { return x + y; }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<int THREADS>
+struct Allreduce {
+    static_assert(THREADS == 32 || THREADS == 16 || THREADS == 8 || THREADS == 4, "");
+    template<typename T, typename Operator>
+    static __device__ inline T run(T x, Operator &op) {
+        constexpr int OFFSET = THREADS / 2;
+        x = op(x, __shfl_xor_sync(uint32_t(-1), x, OFFSET));
+        return Allreduce<OFFSET>::run(x, op);
+    }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<>
+struct Allreduce<2> {
+template<typename T, typename Operator>
+static __device__ inline T run(T x, Operator &op) {
+    x = op(x, __shfl_xor_sync(uint32_t(-1), x, 1));
+    return x;
+}
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Operator, int M>
+__device__ inline void  quad_reduce(float (&dst)[M], float (&src)[M], Operator &op) {
+    #pragma unroll
+    for(int mi=0; mi < M; mi++){
+        dst[mi] = src[mi];
+        dst[mi] = op(dst[mi], __shfl_down_sync(uint32_t(-1), dst[mi], 2));
+        dst[mi] = op(dst[mi], __shfl_down_sync(uint32_t(-1), dst[mi], 1));
+    }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Operator, int M>
+__device__ inline void quad_reduce(float (&dst)[M], float2 (&src)[M], Operator &op) {
+    float tmp[M];
+    #pragma unroll
+    for(int mi=0; mi < M; mi++){
+        tmp[mi] = op(src[mi].x, src[mi].y);
+    }
+    quad_reduce(dst, tmp, op);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Operator, int M>
+__device__ inline void quad_allreduce(float (&dst)[M], float (&src)[M], Operator &op) {
+    #pragma unroll
+    for(int mi=0; mi < M; mi++){
+        dst[mi] = src[mi];
+        dst[mi] = Allreduce<4>::run(dst[mi], op);
+    }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Operator, int M>
+__device__ inline void quad_allreduce(float (&dst)[M], float2 (&src)[M], Operator &op) {
+    float tmp[M];
+    #pragma unroll
+    for(int mi=0; mi < M; mi++){
+        tmp[mi] = op(src[mi].x, src[mi].y);
+    }
+    quad_allreduce(dst, tmp, op);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+}  // namespace fmha
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/attention_scaling_coefs_updater.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/attention_scaling_coefs_updater.h
new file mode 100644
index 000000000000..5a5bbef604c3
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/attention_scaling_coefs_updater.h
@@ -0,0 +1,479 @@
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+
+#include <cutlass/functional.h>
+#include <cutlass/gemm/warp/mma_simt_tile_iterator.h>
+#include <cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm70.h>
+#include <cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm80.h>
+#include <cutlass/matrix_shape.h>
+
+namespace {
+
+static CUTLASS_DEVICE float atomicMaxFloat(float* addr, float value) {
+  // source: https://stackoverflow.com/a/51549250
+  return (value >= 0)
+      ? __int_as_float(atomicMax((int*)addr, __float_as_int(value)))
+      : __uint_as_float(atomicMin((unsigned int*)addr, __float_as_uint(value)));
+}
+} // namespace
+
+/* Iterates on the accumulator and corresponding position on result matrix
+
+(1) Update `mi[r]` to the max value of the row `r`
+(2) In a second iteration do the following:
+    (a) accum   <- exp(accum - mi)
+    (b) m_prime <- exp(m_prime - mi)
+    (c) s_prime <- s_prime * m_prime + sum(accum)
+
+All of this is done on registers, before we store all of this
+on shared memory for the next matmul with Value.
+
+We have multiple implementations, because each configuration has a different way
+of iterating in the accumulators.
+*/
+
+template <typename BASE, typename T, typename accum_t, int kWarpSize>
+struct RegisterOps {
+  template <
+      int kQueriesPerBlock,
+      bool kFullColumns,
+      bool kIsFirst,
+      bool kKeepOutputInRF>
+  CUTLASS_DEVICE static void update(
+      typename T::Fragment& frag_o, // output so far
+      typename T::Fragment& frag,
+      cutlass::Array<accum_t, kQueriesPerBlock>& mi,
+      cutlass::Array<accum_t, kQueriesPerBlock>& m_prime,
+      cutlass::Array<accum_t, kQueriesPerBlock>& s_prime,
+      int8_t lane_id,
+      int8_t thread_id,
+      int8_t warp_id,
+      int16_t max_col,
+      typename T::TensorCoord const& tile_offset,
+      float scaling) {
+    // Convert to `accum_t` (rather than double)
+    constexpr float kLog2e = M_LOG2E;
+    if (!kIsFirst) {
+      if (thread_id < kQueriesPerBlock) {
+        m_prime[thread_id] = mi[thread_id];
+      }
+      __syncthreads();
+    }
+
+    auto lane_offset = BASE::get_lane_offset(lane_id, warp_id, tile_offset);
+
+    // First update `mi` to the max per-row
+    {
+      accum_t max;
+      BASE::iterateRows(
+          lane_offset,
+          [&](int accum_m) { max = -std::numeric_limits<accum_t>::infinity(); },
+          [&](int accum_m, int accum_n, int idx) {
+            if (kFullColumns || accum_n < max_col) {
+              max = std::max(max, frag[idx]);
+            }
+          },
+          [&](int accum_m) {
+            // Having 4x atomicMax seems faster than reduce within warp
+            // first...
+            atomicMaxFloat(&mi[accum_m], max * scaling);
+          });
+    }
+    frag = cutlass::multiplies<typename T::Fragment>()(scaling * kLog2e, frag);
+
+    // Make sure we all share the update values for `mi`
+    __syncthreads();
+
+    if (thread_id < kQueriesPerBlock) {
+      auto m_prime_exp = exp2f(kLog2e * (m_prime[thread_id] - mi[thread_id]));
+      m_prime[thread_id] = m_prime_exp;
+      s_prime[thread_id] *= m_prime_exp;
+    }
+    __syncthreads(); // Update output fragments
+    if (kKeepOutputInRF && !kIsFirst) {
+      accum_t mp;
+      BASE::iterateRows(
+          lane_offset,
+          [&](int accum_m) { mp = m_prime[accum_m]; },
+          [&](int accum_m, int accum_n, int idx) { frag_o[idx] *= mp; },
+          [&](int accum_m) {});
+      __syncthreads();
+    }
+    // Update accum_m, accum_n, ...
+    {
+      accum_t mi_row, total_row;
+      BASE::iterateRows(
+          lane_offset,
+          [&](int accum_m) { mi_row = kLog2e * mi[accum_m]; },
+          [&](int accum_m, int accum_n, int idx) {
+            frag[idx] = (kFullColumns || accum_n < max_col)
+                ? exp2f(frag[idx] - mi_row)
+                : accum_t(0.0);
+          },
+          [&](int accum_m) {});
+      BASE::iterateRows(
+          lane_offset,
+          [&](int accum_m) { total_row = 0.0; },
+          [&](int accum_m, int accum_n, int idx) { total_row += frag[idx]; },
+          [&](int accum_m) {
+            if (BASE::reduceSameRow(
+                    lane_id, total_row, [](accum_t a, accum_t b) {
+                      return a + b;
+                    })) {
+              atomicAdd(&s_prime[accum_m], total_row);
+            }
+          });
+    }
+  }
+};
+
+template <typename T, typename accum_t, int kWarpSize>
+struct AttentionScalingCoefsUpdaterSm80
+    : RegisterOps<
+          AttentionScalingCoefsUpdaterSm80<T, accum_t, kWarpSize>,
+          T,
+          accum_t,
+          kWarpSize> {
+  static_assert(
+      std::is_same<typename T::Layout, cutlass::layout::RowMajor>::value, "");
+
+  using Policy = typename T::Policy;
+  using InstructionShape = typename T::InstructionShape;
+  using OpDelta = typename T::OpDelta;
+  using Shape = typename T::Shape;
+  static int const kElementsPerAccess = InstructionShape::kN / 4;
+  static int const kRowsPerTile = 8;
+  static int const kAccumulatorRows = InstructionShape::kM / kRowsPerTile;
+
+  static cutlass::MatrixCoord CUTLASS_DEVICE get_lane_offset(
+      int8_t lane_id,
+      int8_t warp_id,
+      typename T::TensorCoord const& tile_offset) {
+    int quad = (lane_id >> 2);
+    int lane_in_quad = (lane_id & 3);
+    return cutlass::MatrixCoord(
+        quad + tile_offset.row() * Shape::kRow,
+        lane_in_quad * kElementsPerAccess +
+            tile_offset.column() * Shape::kColumn);
+  }
+
+  template <typename FA, typename FB, typename FC>
+  CUTLASS_DEVICE static void iterateRows(
+      cutlass::MatrixCoord& lane_offset,
+      FA beginRow,
+      FB op,
+      FC endRow) {
+    // See cutlass/gemm/warp/mma_tensor_op_tile_iterator.h
+    CUTLASS_PRAGMA_UNROLL
+    for (int mma_m = 0; mma_m < Policy::MmaIterations::kRow; ++mma_m) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int row = 0; row < kAccumulatorRows; ++row) {
+        int accum_m = mma_m * InstructionShape::kM * OpDelta::kRow +
+            row * kRowsPerTile + lane_offset.row();
+        beginRow(accum_m);
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int mma_n = 0; mma_n < Policy::MmaIterations::kColumn; ++mma_n) {
+          int mma_accum_start = kAccumulatorRows * kElementsPerAccess *
+              (mma_n * Policy::MmaIterations::kRow + mma_m);
+          CUTLASS_PRAGMA_UNROLL
+          for (int col = 0; col < kElementsPerAccess; ++col) {
+            int accum_n = mma_n * InstructionShape::kN * OpDelta::kColumn +
+                col + lane_offset.column();
+            int idx = mma_accum_start + row * kElementsPerAccess + col;
+            op(accum_m, accum_n, idx);
+          }
+        }
+
+        endRow(accum_m);
+      }
+    }
+  }
+
+  template <typename DT, typename F>
+  CUTLASS_DEVICE static bool reduceSameRow(int lane_id, DT& myValue, F fn) {
+    // In each warp, 4 threads will work on the same row
+    // - the ones with the same `quad`
+    auto otherV = __shfl_xor_sync(0xffffffff, myValue, 1);
+    myValue = fn(myValue, otherV);
+    otherV = __shfl_xor_sync(0xffffffff, myValue, 2);
+    myValue = fn(myValue, otherV);
+    int lane_in_quad = (lane_id & 3);
+    return lane_in_quad == 0;
+  };
+};
+
+// cutlass::gemm::warp::MmaVoltaTensorOpAccumulatorTileIterator<cutlass::MatrixShape<32,
+// 32>, float, cutlass::layout::RowMajor, cutlass::gemm::GemmShape<16, 16, 4>,
+// cutlass::MatrixShape<1, 1>> See
+// cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm70.h
+template <typename T, typename accum_t, int kWarpSize>
+struct AttentionScalingCoefsUpdaterVolta
+    : RegisterOps<
+          AttentionScalingCoefsUpdaterVolta<T, accum_t, kWarpSize>,
+          T,
+          accum_t,
+          kWarpSize> {
+  static_assert(
+      std::is_same<typename T::Layout, cutlass::layout::RowMajor>::value, "");
+
+  using Policy = typename T::Policy;
+  using InstructionShape = typename T::InstructionShape;
+  using OpDelta = typename T::OpDelta;
+  using Shape = typename T::Shape;
+  using Element = accum_t;
+
+  static int const kElementsPerPartial = 4;
+  using EleShapePerPatial = typename cutlass::platform::conditional<
+      cutlass::platform::is_same<Element, float>::value,
+      cutlass::MatrixShape<2, 2>,
+      cutlass::MatrixShape<1, 4>>::type;
+  static int const kElementsPerMma = 8;
+  static int const kAccumulatorPatials = 2;
+  using QuadShapePerPatialMma = cutlass::MatrixShape<4, 4>;
+
+  static cutlass::MatrixCoord CUTLASS_DEVICE get_lane_offset(
+      int8_t lane_id,
+      int8_t warp_id,
+      typename T::TensorCoord const& tile_offset) {
+    int quad = (lane_id >> 2);
+    int lane_in_quad = (lane_id & 3);
+    int accum_m, accum_n;
+
+    if (cutlass::platform::is_same<Element, float>::value) {
+      // (quad[2],quad[0])+lane_in_quad[0]
+      accum_m = (((quad & 0x4) >> 1) + (quad & 0x1)) * 8 + (lane_in_quad & 1);
+      // (quad[1])+lane_in_quad[1]
+      accum_n =
+          ((quad >> 1) & 0x1) * kElementsPerPartial * kAccumulatorPatials +
+          (lane_in_quad & 2);
+    } else {
+      accum_m = (((quad & 0x4) >> 1) + (quad & 0x1)) * 8 +
+          lane_in_quad; // (quad[2],quad[0])
+      accum_n = ((quad >> 1) & 0x1) * kElementsPerPartial * kAccumulatorPatials;
+    }
+    return cutlass::MatrixCoord(
+        accum_m + tile_offset.row() * Shape::kRow,
+        accum_n + tile_offset.column() * Shape::kColumn);
+  }
+
+  template <typename DT, typename F>
+  CUTLASS_DEVICE static bool reduceSameRow(int lane_id, DT& myValue, F fn) {
+    static_assert(
+        cutlass::platform::is_same<Element, float>::value,
+        "update to support non-float accum");
+    // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-884-f16
+    // T0 & T2 share same line within a quad
+    auto otherV = __shfl_xor_sync(0xffffffff, myValue, 1 << 1);
+    myValue = fn(myValue, otherV);
+    // quad 0 and quad 2 are on the same lines
+    otherV = __shfl_xor_sync(0xffffffff, myValue, 1 << 3);
+    myValue = fn(myValue, otherV);
+    return (lane_id & ((1 << 1) | (1 << 3))) == 0;
+  };
+
+  template <typename FA, typename FB, typename FC>
+  CUTLASS_DEVICE static void iterateRows(
+      cutlass::MatrixCoord& lane_offset,
+      FA beginRow,
+      FB op,
+      FC endRow) {
+    CUTLASS_PRAGMA_UNROLL
+    for (int tile_m = 0; tile_m < Policy::TileIterations::kRow; ++tile_m) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int mma_m = 0; mma_m < Policy::MmaIterations::kRow; ++mma_m) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int m = 0; m < EleShapePerPatial::kRow; ++m) {
+          int accum_m = tile_m * Policy::InterleavedTile::kRow +
+              mma_m * QuadShapePerPatialMma::kRow + m * 2 + lane_offset.row();
+          beginRow(accum_m);
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int tile_n = 0; tile_n < Policy::TileIterations::kColumn;
+               ++tile_n) {
+            CUTLASS_PRAGMA_UNROLL
+            for (int mma_n = 0; mma_n < Policy::MmaIterations::kColumn;
+                 ++mma_n) {
+              CUTLASS_PRAGMA_UNROLL
+              for (int p = 0; p < kAccumulatorPatials; ++p) {
+                CUTLASS_PRAGMA_UNROLL
+                for (int n = 0; n < EleShapePerPatial::kColumn; ++n) {
+                  int mma_accum_start =
+                      (((tile_n * Policy::TileIterations::kRow + tile_m) *
+                            Policy::MmaIterations::kColumn +
+                        mma_n) *
+                           Policy::MmaIterations::kRow +
+                       mma_m) *
+                      kElementsPerMma;
+                  int accum_n = tile_n * Policy::InterleavedTile::kColumn +
+                      mma_n * QuadShapePerPatialMma::kColumn +
+                      p * Policy::InterleavedTile::kColumn / 2 + n +
+                      lane_offset.column();
+                  int idx = mma_accum_start + p * kElementsPerPartial +
+                      m * EleShapePerPatial::kColumn + n;
+                  op(accum_m, accum_n, idx);
+                }
+              }
+            }
+          }
+          endRow(accum_m);
+        }
+      }
+    }
+  }
+};
+
+template <typename T, typename accum_t, int kWarpSize>
+struct AttentionScalingCoefsUpdaterSimt
+    : RegisterOps<
+          AttentionScalingCoefsUpdaterSimt<T, accum_t, kWarpSize>,
+          T,
+          accum_t,
+          kWarpSize> {
+  using Policy = typename T::Policy;
+  using Iterations = typename T::Iterations;
+  using Element = typename T::Element;
+  using Delta = typename T::Delta;
+  using Shape = typename T::Shape;
+  static_assert(
+      std::is_same<typename T::Layout, cutlass::layout::RowMajor>::value, "");
+  static_assert(
+      std::is_same<typename T::Iterations, typename T::Iterations>::value, "");
+
+  template <typename DT, typename F>
+  CUTLASS_DEVICE static bool reduceSameRow(int lane_id, DT& myValue, F fn) {
+    CUTLASS_PRAGMA_UNROLL
+    for (int bit = 1; bit < Policy::WarpShape::kColumn; bit *= 2) {
+      auto otherV = __shfl_xor_sync(0xffffffff, myValue, bit);
+      myValue = fn(myValue, otherV);
+    }
+    return (lane_id & (Policy::WarpShape::kColumn - 1)) == 0;
+  }
+
+  template <typename FA, typename FB, typename FC>
+  CUTLASS_DEVICE static void iterateRows(
+      cutlass::MatrixCoord& lane_offset,
+      FA beginRow,
+      FB op,
+      FC endRow) {
+    CUTLASS_PRAGMA_UNROLL
+    for (int mma_m = 0; mma_m < Iterations::kRow; ++mma_m) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int m = 0; m < Policy::LaneMmaShape::kM; ++m) {
+        int accum_m = mma_m * Delta::kRow + m + lane_offset.row();
+        beginRow(accum_m);
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int mma_n = 0; mma_n < Iterations::kColumn; ++mma_n) {
+          int accum_n =
+              mma_n * Policy::WarpShape::kColumn * Policy::LaneMmaShape::kN +
+              lane_offset.column();
+          CUTLASS_PRAGMA_UNROLL
+          for (int n = 0; n < Policy::LaneMmaShape::kN; ++n) {
+            int idx = n +
+                Policy::LaneMmaShape::kN *
+                    (mma_n +
+                     Iterations::kColumn *
+                         (m + mma_m * Policy::LaneMmaShape::kM));
+            op(accum_m, accum_n + n, idx);
+          }
+        }
+        endRow(accum_m);
+      }
+    }
+  }
+
+  static cutlass::MatrixCoord CUTLASS_DEVICE get_lane_offset(
+      int8_t lane_id,
+      int8_t warp_id,
+      typename T::TensorCoord const& tile_offset) {
+    static_assert(std::is_same<
+                  typename Policy::LaneLayout,
+                  cutlass::layout::RowMajorInterleaved<1>>::value, "");
+    typename Policy::LaneLayout lane_layout = Policy::get_lane_layout();
+
+    cutlass::MatrixCoord lane_offset = lane_layout.inverse(lane_id) *
+        cutlass::MatrixCoord(Policy::LaneMmaShape::kM,
+                             Policy::LaneMmaShape::kN);
+    return lane_offset +
+        tile_offset * cutlass::MatrixCoord(Shape::kRow, Shape::kColumn);
+  }
+};
+
+template <typename T, typename accum_t, int kWarpSize>
+struct DefaultAttentionScalingCoefsUpdater;
+
+// Simt
+template <typename S, typename P, typename accum_t, int kWarpSize>
+struct DefaultAttentionScalingCoefsUpdater<
+    cutlass::gemm::warp::MmaSimtTileIterator<
+        S,
+        cutlass::gemm::Operand::kC,
+        accum_t,
+        cutlass::layout::RowMajor,
+        P,
+        1,
+        1>,
+    accum_t,
+    kWarpSize> {
+  using Iterator = typename cutlass::gemm::warp::MmaSimtTileIterator<
+      S,
+      cutlass::gemm::Operand::kC,
+      accum_t,
+      cutlass::layout::RowMajor,
+      P,
+      1,
+      1>;
+  using Updater =
+      AttentionScalingCoefsUpdaterSimt<Iterator, accum_t, kWarpSize>;
+};
+
+// TensorOp - Volta
+template <typename S1, typename S2, typename accum_t, int kWarpSize>
+struct DefaultAttentionScalingCoefsUpdater<
+    cutlass::gemm::warp::MmaVoltaTensorOpAccumulatorTileIterator<
+        S1,
+        accum_t,
+        cutlass::layout::RowMajor,
+        S2,
+        cutlass::MatrixShape<1, 1>>,
+    accum_t,
+    kWarpSize> {
+  using Iterator =
+      typename cutlass::gemm::warp::MmaVoltaTensorOpAccumulatorTileIterator<
+          S1,
+          accum_t,
+          cutlass::layout::RowMajor,
+          S2,
+          cutlass::MatrixShape<1, 1>>;
+  using Updater =
+      AttentionScalingCoefsUpdaterVolta<Iterator, accum_t, kWarpSize>;
+};
+
+// TensorOp - Sm75+
+template <
+    typename S1,
+    typename S2,
+    typename S3,
+    typename accum_t,
+    int kWarpSize>
+struct DefaultAttentionScalingCoefsUpdater<
+    cutlass::gemm::warp::MmaTensorOpAccumulatorTileIterator<
+        S1,
+        accum_t,
+        cutlass::layout::RowMajor,
+        S2,
+        S3>,
+    accum_t,
+    kWarpSize> {
+  using Iterator =
+      typename cutlass::gemm::warp::MmaTensorOpAccumulatorTileIterator<
+          S1,
+          accum_t,
+          cutlass::layout::RowMajor,
+          S2,
+          S3>;
+  using Updater =
+      AttentionScalingCoefsUpdaterSm80<Iterator, accum_t, kWarpSize>;
+};
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/debug_utils.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/debug_utils.h
new file mode 100644
index 000000000000..df76a06dac33
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/debug_utils.h
@@ -0,0 +1,129 @@
+#pragma once
+#include <cfloat>
+#include <cstdio>
+#include <cmath>
+
+////////////////////////////////////////////////////////////////////////////////
+// Debugging functions
+////////////////////////////////////////////////////////////////////////////////
+// Nans & inf detection
+#define NANCHECK(frag)                         \
+  {                                            \
+    for (int _i = 0; _i < frag.size(); ++_i) { \
+      assert(std::isfinite(float(frag[_i])));  \
+      assert(!std::isnan(float(frag[_i])));    \
+    }                                          \
+  }
+
+struct __string_view {
+  char const* data;
+  std::size_t size;
+};
+template <class T>
+constexpr __string_view __get_type_name() {
+  char const* p = __PRETTY_FUNCTION__;
+  while (*p++ != '=')
+    ;
+  for (; *p == ' '; ++p)
+    ;
+  char const* p2 = p;
+  int count = 1;
+  for (;; ++p2) {
+    switch (*p2) {
+      case '[':
+        ++count;
+        break;
+      case ']':
+        --count;
+        if (!count)
+          return {p, std::size_t(p2 - p)};
+    }
+  }
+  return {};
+}
+
+// Print on the first thread of the first block
+#if 0
+#define PRINT_WARP_ID 0
+#define PRINT_LANE_ID 0
+#define PRINT_T0_L0(msg, ...)                                         \
+  if (blockIdx.x == 0 && blockIdx.y == 0 && blockIdx.z == 0 &&        \
+      threadIdx.x == PRINT_LANE_ID && threadIdx.y == PRINT_WARP_ID && \
+      threadIdx.z == 0) {                                             \
+    printf(msg "\n", __VA_ARGS__);                                    \
+  }
+#else
+#define PRINT_T0_L0
+#endif
+
+// Print a given array
+#define PRINT_ACCUM8_T0_L0_START(name, accum, start)  \
+  PRINT_T0_L0(                                        \
+      "%s[%d:%d] - {%f, %f, %f, %f, %f, %f, %f, %f}", \
+      name,                                           \
+      int(start),                                     \
+      int(start + 8),                                 \
+      float(accum[start + 0]),                        \
+      float(accum[start + 1]),                        \
+      float(accum[start + 2]),                        \
+      float(accum[start + 3]),                        \
+      float(accum[start + 4]),                        \
+      float(accum[start + 5]),                        \
+      float(accum[start + 6]),                        \
+      float(accum[start + 7]));
+#define PRINT_ACCUM8_T0_L0(name, accum) PRINT_ACCUM8_T0_L0_START(name, accum, 0)
+#define PRINT_FRAG_T0_L0(name, frag)                          \
+  {                                                           \
+    auto typeStr = __get_type_name<decltype(frag)>();         \
+    PRINT_T0_L0("printing %s (%s)", name, typeStr.data);      \
+    for (int _start = 0; _start < frag.size(); _start += 8) { \
+      PRINT_ACCUM8_T0_L0_START("  ", frag, _start);           \
+    }                                                         \
+    /*__syncthreads();                                        \
+    NANCHECK(frag); */                                        \
+  }
+#define PRINT_ARRAY_T0_L0_INCR(name, array, length, incr)   \
+  {                                                         \
+    PRINT_T0_L0("printing %s (len=%d)", name, int(length)); \
+    for (int _start = 0; _start < length; _start += incr) { \
+      PRINT_ACCUM8_T0_L0_START("  ", array, _start);        \
+    }                                                       \
+  }
+#define PRINT_ARRAY_T0_L0(name, array, length) \
+  PRINT_ARRAY_T0_L0_INCR(name, array, length, 8)
+
+// Print a 4x4 matrix
+#define PRINT_TENSOR4x4_T0_L0_START(name, ref, start_x, start_y)                                           \
+  PRINT_T0_L0(                                                                                             \
+      "%s[%d:%d, %d:%d]:\n    %f, %f, %f, %f\n    %f, %f, %f, %f\n    %f, %f, %f, %f\n    %f, %f, %f, %f", \
+      name,                                                                                                \
+      int(start_x),                                                                                        \
+      int(start_x + 4),                                                                                    \
+      int(start_y),                                                                                        \
+      int(start_y + 4),                                                                                    \
+      float(ref.at({start_x + 0, start_y + 0})),                                                           \
+      float(ref.at({start_x + 0, start_y + 1})),                                                           \
+      float(ref.at({start_x + 0, start_y + 2})),                                                           \
+      float(ref.at({start_x + 0, start_y + 3})),                                                           \
+      float(ref.at({start_x + 1, start_y + 0})),                                                           \
+      float(ref.at({start_x + 1, start_y + 1})),                                                           \
+      float(ref.at({start_x + 1, start_y + 2})),                                                           \
+      float(ref.at({start_x + 1, start_y + 3})),                                                           \
+      float(ref.at({start_x + 2, start_y + 0})),                                                           \
+      float(ref.at({start_x + 2, start_y + 1})),                                                           \
+      float(ref.at({start_x + 2, start_y + 2})),                                                           \
+      float(ref.at({start_x + 2, start_y + 3})),                                                           \
+      float(ref.at({start_x + 3, start_y + 0})),                                                           \
+      float(ref.at({start_x + 3, start_y + 1})),                                                           \
+      float(ref.at({start_x + 3, start_y + 2})),                                                           \
+      float(ref.at({start_x + 3, start_y + 3})));
+#define PRINT_TENSOR4x4_T0_L0(name, ref) \
+  PRINT_TENSOR4x4_T0_L0_START(name, ref, 0, 0)
+
+#define PRINT_PROBLEM_SIZE(name, ps)            \
+  PRINT_T0_L0(                                  \
+      "%s.problem_size: {.m=%d, .n=%d, .k=%d}", \
+      name,                                     \
+      int(ps.m()),                              \
+      int(ps.n()),                              \
+      int(ps.k()))
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_pipelined.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_pipelined.h
new file mode 100644
index 000000000000..21ed5819d858
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_pipelined.h
@@ -0,0 +1,629 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+  \brief Epilogue for threadblock scoped GEMMs using Tensor Ops.
+
+  File copied from <cutlass/epilogue/threadblock/epilogue.h>
+  then modified to:
+  (1) load 2 source fragments at the same time (pipelining)
+  (2) support reading from a different dtype
+  (3) pass the row id to the OutputOp if it takes it
+    (see MemoryEfficientAttentionNormalize)
+  Note that in general the fragment passed to the OutputOp could
+  span multiple rows but it does not happen with the configurations we have
+*/
+
+#pragma once
+
+#if defined(__CUDACC_RTC__)
+#include <cuda/std/cassert>
+#else
+#include <cassert>
+#endif
+
+#include <ATen/cuda/CUDAContext.h>
+
+#include <cutlass/aligned_buffer.h>
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/functional.h>
+#include <cutlass/layout/tensor.h>
+#include <cutlass/layout/vector.h>
+#include <cutlass/numeric_types.h>
+#include <cutlass/tensor_coord.h>
+
+#include <cutlass/gemm/gemm.h>
+
+#include <cutlass/transform/pitch_linear_thread_map.h>
+#include <cutlass/transform/threadblock/regular_tile_iterator.h>
+
+#include <cutlass/epilogue/threadblock/epilogue_base.h>
+#include <cutlass/epilogue/threadblock/predicated_tile_iterator.h>
+#include <cutlass/numeric_types.h>
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace epilogue {
+namespace threadblock {
+
+template <typename Op>
+struct ApplyEpilogueOp {
+  static CUTLASS_DEVICE typename Op::FragmentOutput apply(
+      Op const& output_op,
+      int row_id,
+      typename Op::FragmentAccumulator const& accum,
+      typename Op::FragmentOutput const& source) {
+    return output_op(accum, source);
+  }
+  static CUTLASS_DEVICE typename Op::FragmentOutput apply(
+      Op const& output_op,
+      int row_id,
+      typename Op::FragmentAccumulator const& accum) {
+    return output_op(accum);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Epilogue operator
+template <
+    typename Shape_, ///< Shape of threadblock tile (concept: GemmShape)
+    typename WarpMmaOperator_, ///< Warp-level MMA operator (concept:
+                               ///< gemm::warp::MmaTensorOp)
+    int PartitionsK, ///< Number of partitions of the K dimension
+    typename OutputTileIterator_, ///< Tile iterator writing output tensors
+    typename AccumulatorFragmentIterator_, ///< Fragment iterator selecting
+                                           ///< accumulators
+    typename WarpTileIterator_, ///< Warp-scoped tile iterator writing
+                                ///< accumulators to SMEM
+    typename SharedLoadIterator_, ///< Threadblock-scoped tile iterator loading
+                                  ///< from SMEM
+    typename OutputOp_, ///< Output operator
+    typename Padding_, ///< Padding added to SMEM allocation to avoid bank
+                       ///< conflicts (concept: MatrixShape)
+    int FragmentsPerPartition =
+        1, ///< Used to coarsten the epilogue granularity
+    int IterationsUnroll = ///< Used to reduce binary size when epilogue op is
+                           ///< large
+    (!IsEpilogueFunctorHeavy<OutputOp_>::value),
+    typename OutputTileSourceIterator_ =
+        OutputTileIterator_ ///< Tile iterator reading tensors
+    >
+class EpiloguePipelined : public EpilogueBase<
+                              Shape_,
+                              typename WarpMmaOperator_::Shape,
+                              PartitionsK,
+                              AccumulatorFragmentIterator_,
+                              WarpTileIterator_,
+                              Padding_,
+                              FragmentsPerPartition> {
+ public:
+  using Base = EpilogueBase<
+      Shape_,
+      typename WarpMmaOperator_::Shape,
+      PartitionsK,
+      AccumulatorFragmentIterator_,
+      WarpTileIterator_,
+      Padding_,
+      FragmentsPerPartition>;
+
+  using Shape = Shape_;
+  using WarpMmaOperator = WarpMmaOperator_;
+  static int const kPartitionsK = PartitionsK;
+  using OutputTileIterator = OutputTileIterator_;
+  using OutputTileSourceIterator = OutputTileSourceIterator_;
+  using AccumulatorFragmentIterator = AccumulatorFragmentIterator_;
+  using WarpTileIterator = WarpTileIterator_;
+  using SharedLoadIterator = SharedLoadIterator_;
+  using OutputOp = OutputOp_;
+  using Padding = Padding_;
+
+  using Layout = layout::RowMajor;
+  using LongIndex = typename Layout::LongIndex;
+
+  /// The complete warp-level accumulator tile
+  using AccumulatorTile = typename Base::AccumulatorTile;
+
+  /// Accumulator element
+  using ElementAccumulator = typename WarpTileIterator::Element;
+
+  /// Output element
+  using ElementOutput = typename OutputTileIterator::Element;
+  using ElementSource = typename OutputTileSourceIterator::Element;
+
+  /// Output access size
+  static int const kElementsPerAccess = OutputTileIterator::kElementsPerAccess;
+
+  /// Tensor reference to destination tensor
+  using TensorRef = typename OutputTileIterator::TensorRef;
+
+  /// Tensor reference to sync tensor
+  using SyncTensorRef =
+      typename cutlass::TensorRef<int, cutlass::layout::PackedVectorLayout>;
+
+  /// Const tensor reference to source tensor
+  using ConstTensorRef = typename OutputTileIterator::ConstTensorRef;
+
+  /// Array type used to output
+  using OutputAccessType = Array<
+      typename OutputTileIterator::Element,
+      OutputTileIterator::kElementsPerAccess>;
+  using SourceAccessType = Array<
+      typename OutputTileSourceIterator::Element,
+      OutputTileSourceIterator::kElementsPerAccess>;
+
+  /// Array type used by output functor
+  using AccumulatorAccessType = Array<
+      typename WarpTileIterator::Element,
+      OutputTileIterator::kElementsPerAccess>;
+
+  /// Number of warps
+  using WarpCount = typename Base::WarpCount;
+
+  static int constexpr kSmemTiles = Base::kFragmentsPerIteration > 1
+      ? Base::kFragmentsPerIteration
+      : kPartitionsK;
+  static int constexpr kSmemPointerOffset =
+      Base::SharedStorage::StorageShape::kCount / kSmemTiles;
+
+ public:
+  static_assert(
+      OutputTileSourceIterator::Fragment::kElements ==
+          OutputTileIterator::Fragment::kElements,
+      "Mismatch between input tile and output tile iterator (kElements)");
+  static_assert(
+      OutputTileSourceIterator::kIterations == OutputTileIterator::kIterations,
+      "Mismatch between input tile and output tile iterator (kIterations)");
+  static_assert(
+      SharedLoadIterator::Fragment::kElements ==
+          OutputTileIterator::Fragment::kElements,
+      "Mismatch between shared load iterator and output tile iterator.");
+
+  static_assert(
+      OutputTileIterator::kElementsPerAccess,
+      "OutputTileIterator::kElementsPerAccess must not be zero.");
+
+  static_assert(
+      !(OutputTileIterator::Fragment::kElements %
+        OutputTileIterator::kElementsPerAccess),
+      "Divisibility");
+
+ private:
+  /// Loads fragment from shared memory aligned with output tensor
+  SharedLoadIterator shared_load_iterator_;
+
+ public:
+  /// Constructor
+  CUTLASS_DEVICE
+  EpiloguePipelined(
+      typename Base::SharedStorage& shared_storage, ///< Shared storage object
+      int thread_idx, ///< ID of a thread within the threadblock
+      int warp_idx, ///< ID of warp within threadblock
+      int lane_idx ///< Id of thread within warp
+      )
+      : Base(shared_storage, thread_idx, warp_idx, lane_idx),
+        shared_load_iterator_(shared_storage.reference(), thread_idx) {}
+
+  /// Streams the result to global memory
+  CUTLASS_DEVICE
+  void operator()(
+      OutputOp const& output_op, ///< Output operator
+      OutputTileIterator
+          destination_iterator, ///< Tile iterator for destination
+      AccumulatorTile const&
+          accumulators, ///< Complete warp-level accumulator tile
+      OutputTileSourceIterator
+          source_iterator) { ///< Threadblock tile coordinate in GEMM (in units
+                             ///< of threadblock tiles)
+
+    if (!output_op.is_source_needed()) {
+      compute_source_not_needed_(output_op, destination_iterator, accumulators);
+    } else {
+      compute_source_needed_(
+          output_op, destination_iterator, accumulators, source_iterator);
+    }
+  }
+  CUTLASS_DEVICE
+  void operator()(
+      OutputOp const& output_op, ///< Output operator
+      OutputTileIterator
+          destination_iterator, ///< Tile iterator for destination
+      AccumulatorTile const&
+          accumulators) { ///< Complete warp-level accumulator tile
+    compute_source_not_needed_(output_op, destination_iterator, accumulators);
+  }
+
+ private:
+  template <class Seq>
+  struct acc2smem_source_not_needed;
+
+  template <size_t... Seq>
+  struct acc2smem_source_not_needed<cutlass::index_sequence<Seq...>> {
+    template <int Advance>
+    CUTLASS_DEVICE static void helper(
+        AccumulatorFragmentIterator accum_fragment_iterator,
+        WarpTileIterator& warp_tile_iterator) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < Advance; i++) {
+        ++accum_fragment_iterator;
+      }
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int p = 0; p < Base::kFragmentsPerIteration; ++p) {
+        typename AccumulatorFragmentIterator::Fragment accum_fragment;
+
+        accum_fragment_iterator.load(accum_fragment);
+        ++accum_fragment_iterator;
+
+        warp_tile_iterator.store(accum_fragment);
+        if (p < Base::kFragmentsPerIteration - 1) {
+          warp_tile_iterator.add_pointer_offset(kSmemPointerOffset);
+        }
+      }
+
+      if (Base::kFragmentsPerIteration > 1) {
+        warp_tile_iterator.add_pointer_offset(
+            kSmemPointerOffset * (1 - Base::kFragmentsPerIteration));
+      }
+    }
+
+    CUTLASS_DEVICE
+    static void push(
+        size_t pos,
+        AccumulatorFragmentIterator const& iterator_begin,
+        WarpTileIterator& warp_tile_iterator) {
+      int dummy[] = {
+          (pos == (Seq * Base::kFragmentsPerIteration)) &&
+          (helper<Seq * Base::kFragmentsPerIteration>(
+               iterator_begin, warp_tile_iterator),
+           0)...};
+
+      CUTLASS_UNUSED(dummy[0]);
+    }
+  };
+
+  static_assert(
+      kPartitionsK == 1 || Base::kFragmentsPerIteration == 1,
+      "One of these must be exactly 1.");
+
+  /// Streams the result to global memory
+  CUTLASS_DEVICE
+  void compute_source_not_needed_(
+      OutputOp const& output_op, ///< Output operator
+      OutputTileIterator
+          destination_iterator, ///< Tile iterator for destination
+      AccumulatorTile const&
+          accumulators ///< Complete warp-level accumulator tile
+  ) {
+    //
+    // Iterator over warp-level accumulator fragment
+    //
+
+    AccumulatorFragmentIterator accum_fragment_iterator(accumulators);
+
+    //
+    // Iterate over accumulator tile
+    //
+
+#pragma unroll(                                                          \
+    IterationsUnroll                                                     \
+        ? OutputTileIterator::kIterations / Base::kFragmentsPerIteration \
+        : 1)
+    for (int iter = 0; iter < OutputTileIterator::kIterations;
+         iter += Base::kFragmentsPerIteration) {
+      //
+      // Convert and store fragment
+      //
+
+      __syncthreads();
+
+      acc2smem_source_not_needed<cutlass::make_index_sequence<
+          OutputTileIterator::kIterations / Base::kFragmentsPerIteration>>::
+          push(iter, accum_fragment_iterator, this->warp_tile_iterator_);
+
+      __syncthreads();
+
+      //
+      // Load fragments from shared memory
+      //
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int p = 0; p < Base::kFragmentsPerIteration; ++p) {
+        typename SharedLoadIterator::Fragment
+            aligned_accum_fragment[kPartitionsK];
+
+        shared_load_iterator_.load(aligned_accum_fragment[0]);
+
+        if (p < Base::kFragmentsPerIteration - 1) {
+          shared_load_iterator_.add_pointer_offset(kSmemPointerOffset);
+        } else if (kPartitionsK > 1) {
+          plus<typename SharedLoadIterator::Fragment> add_fragments;
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int i = 1; i < kPartitionsK; ++i) {
+            shared_load_iterator_.add_pointer_offset(kSmemPointerOffset);
+            shared_load_iterator_.load(aligned_accum_fragment[i]);
+            aligned_accum_fragment[0] = add_fragments(
+                aligned_accum_fragment[0], aligned_accum_fragment[i]);
+          }
+
+          shared_load_iterator_.add_pointer_offset(
+              (1 - kPartitionsK) * kSmemPointerOffset);
+        }
+
+        //
+        // Compute the output result
+        //
+
+        typename OutputTileIterator::Fragment output_fragment;
+
+        apply_output_operator_source_not_needed_(
+            destination_iterator.thread_start_row(),
+            output_fragment,
+            output_op,
+            aligned_accum_fragment[0]);
+
+        //
+        // Store the final result
+        //
+
+        destination_iterator.store(output_fragment);
+        ++destination_iterator;
+      }
+
+      if (Base::kFragmentsPerIteration > 1) {
+        shared_load_iterator_.add_pointer_offset(
+            kSmemPointerOffset * (1 - Base::kFragmentsPerIteration));
+      }
+    }
+  }
+
+  template <class Seq>
+  struct acc2smem_source_needed;
+
+  template <size_t... Seq>
+  struct acc2smem_source_needed<cutlass::index_sequence<Seq...>> {
+    template <int Advance>
+    CUTLASS_DEVICE static void helper(
+        AccumulatorFragmentIterator accum_fragment_iterator,
+        WarpTileIterator& warp_tile_iterator) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < Advance; i++) {
+        ++accum_fragment_iterator;
+      }
+
+      typename AccumulatorFragmentIterator::Fragment accum_fragment;
+      accum_fragment_iterator.load(accum_fragment);
+      warp_tile_iterator.store(accum_fragment);
+    }
+
+    CUTLASS_DEVICE
+    static void push(
+        size_t pos,
+        AccumulatorFragmentIterator const& iterator_begin,
+        WarpTileIterator& warp_tile_iterator) {
+      int dummy[] = {
+          (pos == Seq) &&
+          (helper<Seq>(iterator_begin, warp_tile_iterator), 0)...};
+    }
+  };
+
+  /// Streams the result to global memory
+  CUTLASS_DEVICE
+  void compute_source_needed_(
+      OutputOp const& output_op, ///< Output operator
+      OutputTileIterator
+          destination_iterator, ///< Tile iterator for destination
+      AccumulatorTile const&
+          accumulators, ///< Complete warp-level accumulator tile
+      OutputTileSourceIterator
+          source_iterator ///< Threadblock tile coordinate in GEMM (in units of
+                          ///< threadblock tiles)
+  ) {
+    typename OutputTileSourceIterator::Fragment source_fragment[2];
+
+    source_fragment[0].clear();
+    source_iterator.load(source_fragment[0]);
+    ++source_iterator;
+    source_fragment[1].clear();
+
+    //
+    // Iterator over warp-level accumulator fragment
+    //
+
+    AccumulatorFragmentIterator accum_fragment_iterator(accumulators);
+
+    //
+    // Iterate over accumulator tile
+    //
+
+#pragma unroll(IterationsUnroll ? OutputTileIterator::kIterations : 1)
+    for (int iter = 0; iter < OutputTileIterator::kIterations; ++iter) {
+      if (iter > 0) {
+        __syncthreads();
+      }
+      //
+      // Load the source for next iteration (pipelining)
+      //
+
+      if (iter + 1 < OutputTileIterator::kIterations) {
+        source_iterator.load(source_fragment[(iter + 1) % 2]);
+      }
+      ++source_iterator;
+      acc2smem_source_needed<
+          cutlass::make_index_sequence<OutputTileIterator::kIterations>>::
+          push(iter, accum_fragment_iterator, this->warp_tile_iterator_);
+
+      __syncthreads();
+
+      //
+      // Load fragments from shared memory
+      //
+
+      typename SharedLoadIterator::Fragment
+          aligned_accum_fragment[kPartitionsK];
+
+      shared_load_iterator_.load(aligned_accum_fragment[0]);
+
+      // If the number of k-slices is > 1 - perform a reduction amongst the
+      // k-slices
+      if (kPartitionsK > 1) {
+        plus<typename SharedLoadIterator::Fragment> add_fragments;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 1; i < kPartitionsK; ++i) {
+          shared_load_iterator_.add_pointer_offset(kSmemPointerOffset);
+          shared_load_iterator_.load(aligned_accum_fragment[i]);
+          aligned_accum_fragment[0] = add_fragments(
+              aligned_accum_fragment[0], aligned_accum_fragment[i]);
+        }
+
+        shared_load_iterator_.add_pointer_offset(
+            (1 - kPartitionsK) * kSmemPointerOffset);
+      }
+
+      //
+      // Compute the output result
+      //
+
+      typename OutputTileIterator::Fragment output_fragment;
+
+      apply_output_operator_(
+          destination_iterator.thread_start_row(),
+          output_fragment,
+          output_op,
+          aligned_accum_fragment[0],
+          source_fragment[iter % 2]);
+
+      //
+      // Store the final result
+      //
+
+      destination_iterator.store(output_fragment);
+      ++destination_iterator;
+    }
+  }
+
+  /// Helper to invoke the output functor over each vector of output
+  CUTLASS_DEVICE
+  void apply_output_operator_(
+      int begin_row,
+      typename OutputTileIterator::Fragment& output_fragment,
+      OutputOp const& output_op, ///< Output operator
+      typename SharedLoadIterator::Fragment const& aligned_accum_fragment,
+      typename OutputTileSourceIterator::Fragment const& source_fragment) {
+    OutputAccessType* output_frag_ptr =
+        reinterpret_cast<OutputAccessType*>(&output_fragment);
+
+    AccumulatorAccessType const* compute_frag_ptr =
+        reinterpret_cast<AccumulatorAccessType const*>(&aligned_accum_fragment);
+
+    SourceAccessType const* source_frag_ptr =
+        reinterpret_cast<SourceAccessType const*>(&source_fragment);
+
+    int const kOutputOpIterations = OutputTileIterator::Fragment::kElements /
+        OutputTileIterator::kElementsPerAccess;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < kOutputOpIterations; ++i) {
+      // Call the output operator
+      output_frag_ptr[i] = ApplyEpilogueOp<OutputOp>::apply(
+          output_op,
+          begin_row + getRowOffset(i * OutputTileIterator::kElementsPerAccess),
+          compute_frag_ptr[i],
+          source_frag_ptr[i]);
+    }
+  }
+
+  /// Helper to invoke the output functor over each vector of output
+  CUTLASS_DEVICE
+  void apply_output_operator_source_not_needed_(
+      int begin_row,
+      typename OutputTileIterator::Fragment& output_fragment,
+      OutputOp const& output_op, ///< Output operator
+      typename SharedLoadIterator::Fragment const& aligned_accum_fragment) {
+    OutputAccessType* output_frag_ptr =
+        reinterpret_cast<OutputAccessType*>(&output_fragment);
+
+    AccumulatorAccessType const* compute_frag_ptr =
+        reinterpret_cast<AccumulatorAccessType const*>(&aligned_accum_fragment);
+
+    int const kOutputOpIterations = OutputTileIterator::Fragment::kElements /
+        OutputTileIterator::kElementsPerAccess;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < kOutputOpIterations; ++i) {
+      // Call the output operator
+      output_frag_ptr[i] = ApplyEpilogueOp<OutputOp>::apply(
+          output_op,
+          begin_row + getRowOffset(i * OutputTileIterator::kElementsPerAccess),
+          compute_frag_ptr[i]);
+    }
+  }
+
+  constexpr int getRowOffset(int i) {
+    using ThreadMap = typename OutputTileIterator::ThreadMap;
+
+    for (int cluster = 0; cluster < ThreadMap::Iterations::kCluster;
+         ++cluster) {
+      for (int group = 0; group < ThreadMap::Iterations::kGroup; ++group) {
+        for (int row = 0; row < ThreadMap::Iterations::kRow; ++row) {
+          int row_offset = row * ThreadMap::Delta::kRow +
+              group * ThreadMap::Delta::kGroup +
+              cluster * ThreadMap::Delta::kCluster;
+          int frag_row_idx =
+              (row +
+               ThreadMap::Iterations::kRow *
+                   (group + ThreadMap::Iterations::kGroup * cluster));
+          for (int column = 0; column < ThreadMap::Iterations::kColumn;
+               ++column) {
+            int frag_idx = ThreadMap::kElementsPerAccess *
+                (frag_row_idx * ThreadMap::Iterations::kColumn + column);
+            if (i < frag_idx + ThreadMap::kElementsPerAccess) {
+              return row_offset;
+            }
+          }
+        }
+      }
+    }
+    return -1;
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+} // namespace threadblock
+} // namespace epilogue
+} // namespace cutlass
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_rescale_output.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_rescale_output.h
new file mode 100644
index 000000000000..aa4163c0fce9
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_rescale_output.h
@@ -0,0 +1,234 @@
+/*! \file
+  \brief Epilogue for threadblock scoped GEMMs using Tensor Ops.
+
+  The epilogue rearranges the result of a matrix product through shared memory
+  to match canonical tensor layouts in global memory. Epilogues support
+  conversion and reduction operations.
+
+  This is a copy of cutlass/epilogue/threadblock/epilogue.h that can
+  handle "row_id" as a first argument, as uses it to get the corresponding
+  `m_prime` / `s_prime` to rescale the output.
+*/
+
+#pragma once
+
+#if defined(__CUDACC_RTC__)
+#include <cuda/std/cassert>
+#else
+#include <cassert>
+#endif
+
+#include <ATen/cuda/CUDAContext.h>
+
+#include <cutlass/aligned_buffer.h>
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/functional.h>
+#include <cutlass/layout/tensor.h>
+#include <cutlass/layout/vector.h>
+#include <cutlass/numeric_types.h>
+#include <cutlass/tensor_coord.h>
+
+#include <cutlass/gemm/gemm.h>
+
+#include <cutlass/transform/pitch_linear_thread_map.h>
+#include <cutlass/transform/threadblock/regular_tile_iterator.h>
+
+#include <cutlass/epilogue/threadblock/epilogue_base.h>
+#include <cutlass/epilogue/threadblock/predicated_tile_iterator.h>
+#include <cutlass/numeric_types.h>
+
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/epilogue/thread/scale_type.h>
+#include <cutlass/functional.h>
+#include <cutlass/numeric_conversion.h>
+#include <cutlass/numeric_types.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/epilogue_pipelined.h>
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace epilogue {
+namespace thread {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Applies a linear combination operator to an array of elements.
+// output <- alpha * accumulator + beta * source
+//   with:
+//     alpha = 1 / s_prime (to normalize when isLast=True, 1 otherwise)
+//     beta = alpha / m_prime (renormalize the output when the max changes)
+//     source is the current output
+template <
+    typename ElementOutput_, ///< Data type used to store tensors
+    typename ElementSource_, //< Data type for source (usually matches
+                             //`ElementOutput`)
+    int Count, ///< Number of elements computed per operation.
+               ///< Usually it is 128/sizeof_bits<ElementOutput_>,
+               ///< but we use 64 or 32 sometimes when there are not enough data
+               ///< to store
+    typename ElementAccumulator_, ///< Accumulator data type
+    typename ElementCompute_, ///< Data type used to compute linear combination
+    bool isFirst,
+    bool isLast,
+    typename FragmentAlphaBeta_,
+    FloatRoundStyle Round = FloatRoundStyle::round_to_nearest>
+class MemoryEfficientAttentionNormalize {
+ public:
+  using ElementOutput = ElementOutput_;
+  using ElementSource = ElementSource_;
+  using ElementAccumulator = ElementAccumulator_;
+  using ElementCompute = ElementCompute_;
+
+  static int const kCount = Count;
+
+  using FragmentOutput = Array<ElementOutput, kCount>;
+  using FragmentSource = Array<ElementSource, kCount>;
+  using FragmentAccumulator = Array<ElementAccumulator, kCount>;
+  using ComputeFragment = Array<ElementCompute, kCount>;
+  using FragmentAlphaBeta = FragmentAlphaBeta_;
+
+  static FloatRoundStyle const kRound = Round;
+
+ private:
+  //
+  // Data members
+  //
+
+  FragmentAlphaBeta const& s_prime_;
+  FragmentAlphaBeta const& m_prime_;
+
+ public:
+  /// Constructs the function object, possibly loading from pointers in host
+  /// memory
+  CUTLASS_HOST_DEVICE
+  MemoryEfficientAttentionNormalize(
+      FragmentAlphaBeta const& s_prime,
+      FragmentAlphaBeta const& m_prime)
+      : s_prime_(s_prime), m_prime_(m_prime) {}
+
+  /// Returns true if source is needed
+  CUTLASS_HOST_DEVICE
+  bool is_source_needed() const {
+    return !isFirst;
+  }
+
+  /// Functionally required for serial reduction in the epilogue
+  CUTLASS_HOST_DEVICE
+  void set_k_partition(int k_partition, int k_partition_count) {}
+
+  /// Computes linear scaling: D = alpha * accumulator + beta * source
+  CUTLASS_HOST_DEVICE
+  FragmentOutput operator()(
+      int row,
+      FragmentAccumulator const& accumulator,
+      FragmentSource const& source) const {
+    assert(!isFirst);
+
+    // Convert source to interal compute numeric type
+    NumericArrayConverter<ElementCompute, ElementSource, kCount, Round>
+        source_converter;
+    NumericArrayConverter<ElementCompute, ElementAccumulator, kCount, Round>
+        accumulator_converter;
+
+    // Convert to destination numeric type
+    NumericArrayConverter<ElementOutput, ElementCompute, kCount, Round>
+        destination_converter;
+
+    ComputeFragment converted_source = source_converter(source);
+    ComputeFragment converted_accumulator = accumulator_converter(accumulator);
+
+    // Perform binary operations
+    ComputeFragment intermediate;
+
+    multiplies<ComputeFragment> mul_add_source;
+    multiply_add<ComputeFragment> mul_add_accumulator;
+
+    ElementCompute alpha = isLast ? (1 / s_prime_[row]) : 1;
+    ElementCompute beta = alpha * m_prime_[row];
+
+    intermediate = mul_add_source(beta, converted_source); // X =  beta * C
+
+    intermediate = mul_add_accumulator(
+        alpha, converted_accumulator, intermediate); // D = alpha * Accum + X
+
+    return destination_converter(intermediate);
+  }
+
+  /// Computes linear scaling: D = alpha * accumulator
+  CUTLASS_HOST_DEVICE
+  FragmentOutput operator()(int row, FragmentAccumulator const& accumulator)
+      const {
+    assert(isFirst);
+
+    // Convert source to interal compute numeric type
+    NumericArrayConverter<ElementCompute, ElementAccumulator, kCount, Round>
+        accumulator_converter;
+
+    // Convert to destination numeric type
+    NumericArrayConverter<ElementOutput, ElementCompute, kCount, Round>
+        destination_converter;
+
+    ComputeFragment converted_accumulator = accumulator_converter(accumulator);
+
+    ComputeFragment intermediate;
+    multiplies<ComputeFragment> mul_accumulator;
+
+    ElementCompute alpha = isLast ? (1 / s_prime_[row]) : 1;
+
+    intermediate = mul_accumulator(
+        alpha, converted_accumulator); // X =  alpha * C + uniform
+
+    return destination_converter(intermediate);
+  }
+};
+
+} // namespace thread
+
+namespace threadblock {
+template <
+    typename EO,
+    typename ES,
+    int Count,
+    typename EA,
+    typename EC,
+    bool F,
+    bool L,
+    typename FAB,
+    FloatRoundStyle R>
+struct ApplyEpilogueOp<thread::MemoryEfficientAttentionNormalize<
+    EO,
+    ES,
+    Count,
+    EA,
+    EC,
+    F,
+    L,
+    FAB,
+    R>> {
+  using Op = thread::
+      MemoryEfficientAttentionNormalize<EO, ES, Count, EA, EC, F, L, FAB, R>;
+  static CUTLASS_DEVICE typename Op::FragmentOutput apply(
+      Op const& output_op,
+      int row_id,
+      typename Op::FragmentAccumulator const& accum,
+      typename Op::FragmentSource const& source) {
+    return output_op(row_id, accum, source);
+  }
+  static CUTLASS_DEVICE typename Op::FragmentOutput apply(
+      Op const& output_op,
+      int row_id,
+      typename Op::FragmentAccumulator const& accum) {
+    return output_op(row_id, accum);
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace threadblock
+} // namespace epilogue
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_thread_apply_logsumexp.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_thread_apply_logsumexp.h
new file mode 100644
index 000000000000..ac6593ff13f7
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/epilogue_thread_apply_logsumexp.h
@@ -0,0 +1,177 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+  \brief Functor performing linear combination operations used by epilogues.
+*/
+
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+
+#include <cuda_fp16.h>
+
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/epilogue/thread/activation.h>
+#include <cutlass/functional.h>
+#include <cutlass/numeric_conversion.h>
+#include <cutlass/numeric_types.h>
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace epilogue {
+namespace thread {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace detail {
+
+template <typename Element, int ElementsPerAccess>
+struct ArrayExponential {
+  CUTLASS_HOST_DEVICE
+  Array<Element, ElementsPerAccess> operator()(
+      Array<Element, ElementsPerAccess> const& input) const {
+    Array<Element, ElementsPerAccess> result;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < ElementsPerAccess; ++i) {
+      result[i] = expf(input[i]);
+    }
+
+    return result;
+  }
+};
+
+template <int ElementsPerAccess>
+struct ArrayExponential<half_t, ElementsPerAccess> {
+  CUTLASS_DEVICE
+  Array<half_t, ElementsPerAccess> operator()(
+      Array<half_t, ElementsPerAccess> const& input) const {
+    Array<half_t, ElementsPerAccess> result;
+
+    int const kVectorCount = ElementsPerAccess / 2;
+
+    __half2 const* input_ptr =
+        reinterpret_cast<__half2 const*>(input.raw_data());
+    __half2* res_ptr = reinterpret_cast<__half2*>(result.raw_data());
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < kVectorCount; ++i) {
+      res_ptr[i] = h2exp(input_ptr[i]);
+    }
+
+    return result;
+  }
+};
+} // namespace detail
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Applies:
+/// output <- (input - lse).exp()
+template <
+    typename ElementOutput_, // output
+    typename ElementLSE_, // accumulator from LSE
+    typename ElementAccumulator_, // accumulator from matmul
+    typename ElementCompute_, // intermediate compute (and exp calculation)
+    int ElementsPerAccess>
+class ApplyLogSumExp {
+ public:
+  using ElementOutput = ElementOutput_;
+  using ElementAccumulator = ElementAccumulator_;
+  using ElementCompute = ElementCompute_;
+  using ElementLSE = ElementLSE_;
+
+  static int const kElementsPerAccess = ElementsPerAccess;
+  static int const kCount = kElementsPerAccess;
+  static const ScaleType::Kind kScale =
+      cutlass::epilogue::thread::ScaleType::NoBetaScaling;
+
+  using FragmentOutput = Array<ElementOutput, kCount>;
+  using FragmentAccumulator = Array<ElementAccumulator, kElementsPerAccess>;
+  using FragmentCompute = Array<ElementCompute, kElementsPerAccess>;
+  using FragmentLSE = Array<ElementLSE, kElementsPerAccess>;
+  using FragmentScaleBias = FragmentLSE; // Used by epilogue_smem_accumulator.h
+
+ public:
+  //
+  // Methods
+  //
+
+  CUTLASS_HOST_DEVICE
+  ApplyLogSumExp() {}
+
+  /// Returns true if source is needed
+  CUTLASS_HOST_DEVICE
+  bool is_source_needed() const {
+    return true;
+  }
+
+  /// Functionally required for serial reduction in the epilogue
+  CUTLASS_HOST_DEVICE
+  void set_k_partition(int k_partition, int k_partition_count) {}
+
+  CUTLASS_HOST_DEVICE
+  FragmentOutput operator()(
+      FragmentAccumulator const& AB,
+      FragmentLSE const& scale_unused,
+      // bias used as LSE
+      FragmentLSE const& bias) const {
+    FragmentCompute frag_AB = NumericArrayConverter<
+        ElementCompute,
+        ElementAccumulator,
+        kElementsPerAccess>()(AB);
+    FragmentCompute frag_lse_compute =
+        NumericArrayConverter<ElementCompute, ElementLSE, kElementsPerAccess>()(
+            bias);
+    FragmentCompute frag_compute;
+
+    minus<FragmentCompute> minus_lse;
+    detail::ArrayExponential<ElementCompute, kElementsPerAccess> apply_exp;
+    frag_compute = minus_lse(frag_AB, frag_lse_compute);
+    frag_compute = apply_exp(frag_compute);
+
+    return NumericArrayConverter<
+        ElementOutput,
+        ElementCompute,
+        kElementsPerAccess>()(frag_compute);
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace thread
+} // namespace epilogue
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/find_default_mma.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/find_default_mma.h
new file mode 100644
index 000000000000..b0e7106f3cfc
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/find_default_mma.h
@@ -0,0 +1,159 @@
+/*! \file
+    \brief Cutlass provides helper template functions to figure out the right
+   datastructures to instantiate to run a GEMM with various parameters (see
+   `cutlass/gemm/threadblock/default_mma.h`). However, due to template
+   instantiation priority rules, it will only create an MmaMultiStage with
+   kStages=3 (otherwise creates an MmePipelined - which is not compatible with
+   FastF32). kStages=3 uses too much shared memory and we want to use kStages=2,
+   so we just copy-pasted some code from `default_mma.h` and
+   `default_mma_core.h` files and wrapped this template to allow our use case.
+
+    This is really only for the FastF32 case - aka using TensorCores with fp32.
+*/
+#pragma once
+
+#include <cutlass/gemm/threadblock/default_mma.h>
+#include <cutlass/gemm/threadblock/default_mma_core_simt.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm70.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm75.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm80.h>
+
+namespace cutlass {
+namespace gemm {
+namespace threadblock {
+
+template <
+    /// Element type for A matrix operand
+    typename ElementA,
+    /// Layout type for A matrix operand
+    typename LayoutA,
+    /// Access granularity of A matrix in units of elements
+    int kAlignmentA,
+    /// Element type for B matrix operand
+    typename ElementB,
+    /// Layout type for B matrix operand
+    typename LayoutB,
+    /// Access granularity of B matrix in units of elements
+    int kAlignmentB,
+    /// Element type for internal accumulation
+    typename ElementAccumulator,
+    /// Layout type for C and D matrix operand
+    typename LayoutC,
+    /// Operator class tag
+    typename OperatorClass,
+    /// Tag indicating architecture to tune for
+    typename ArchTag,
+    /// Threadblock-level tile size (concept: GemmShape)
+    typename ThreadblockShape,
+    /// Warp-level tile size (concept: GemmShape)
+    typename WarpShape,
+    /// Instruction-level tile size (concept: GemmShape)
+    typename InstructionShape,
+    /// Number of stages used in the pipelined mainloop
+    int Stages,
+    /// Operation perfomed by GEMM
+    typename Operator,
+    typename Enable_ = void>
+struct FindDefaultMma {
+  static constexpr bool AccumulatorsInRowMajor = false;
+  static constexpr SharedMemoryClearOption SharedMemoryClear =
+      SharedMemoryClearOption::kNone;
+  using DefaultMma = cutlass::gemm::threadblock::DefaultMma<
+      ElementA,
+      LayoutA,
+      kAlignmentA,
+      ElementB,
+      LayoutB,
+      kAlignmentB,
+      ElementAccumulator,
+      LayoutC,
+      OperatorClass,
+      ArchTag,
+      ThreadblockShape,
+      WarpShape,
+      InstructionShape,
+      Stages,
+      Operator,
+      AccumulatorsInRowMajor,
+      SharedMemoryClear>;
+};
+
+/// Specialization for sm80 / FastF32 / multistage with kStages=2
+template <
+    typename ElementA_,
+    /// Layout type for A matrix operand
+    typename LayoutA_,
+    /// Access granularity of A matrix in units of elements
+    int kAlignmentA,
+    typename ElementB_,
+    /// Layout type for B matrix operand
+    typename LayoutB_,
+    /// Access granularity of B matrix in units of elements
+    int kAlignmentB,
+    typename ElementAccumulator,
+    /// Threadblock-level tile size (concept: GemmShape)
+    typename ThreadblockShape,
+    /// Warp-level tile size (concept: GemmShape)
+    typename WarpShape,
+    /// Instruction-level tile size (concept: GemmShape)
+    typename InstructionShape,
+    int kStages,
+    typename Operator>
+struct FindDefaultMma<
+    ElementA_,
+    LayoutA_,
+    kAlignmentA,
+    ElementB_,
+    LayoutB_,
+    kAlignmentB,
+    ElementAccumulator,
+    layout::RowMajor,
+    arch::OpClassTensorOp,
+    arch::Sm80,
+    ThreadblockShape,
+    WarpShape,
+    InstructionShape,
+    kStages,
+    Operator,
+    typename std::enable_if<(kAlignmentA > 1)>::type> {
+  using LayoutC = layout::RowMajor;
+  using OperatorClass = arch::OpClassTensorOp;
+  using ArchTag = arch::Sm80;
+
+  using DefaultMma_ = cutlass::gemm::threadblock::DefaultMma<
+      ElementA_,
+      LayoutA_,
+      kAlignmentA,
+      ElementB_,
+      LayoutB_,
+      kAlignmentB,
+      ElementAccumulator,
+      LayoutC,
+      OperatorClass,
+      ArchTag,
+      ThreadblockShape,
+      WarpShape,
+      InstructionShape,
+      3,
+      Operator>;
+  struct DefaultMma : DefaultMma_ {
+    using MmaCore_ = typename DefaultMma_::MmaCore;
+    // Define the threadblock-scoped multistage matrix multiply
+    using ThreadblockMma = cutlass::gemm::threadblock::MmaMultistage<
+        typename MmaCore_::Shape,
+        typename DefaultMma_::IteratorA,
+        typename MmaCore_::SmemIteratorA,
+        MmaCore_::kCacheOpA,
+        typename DefaultMma_::IteratorB,
+        typename MmaCore_::SmemIteratorB,
+        MmaCore_::kCacheOpB,
+        ElementAccumulator,
+        LayoutC,
+        typename MmaCore_::MmaPolicy,
+        kStages>;
+  };
+};
+
+} // namespace threadblock
+} // namespace gemm
+} // namespace cutlass
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma.h
new file mode 100644
index 000000000000..9781dba5f27c
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma.h
@@ -0,0 +1,92 @@
+#pragma once
+
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_multistage.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_pipelined.h>
+
+#include <cutlass/gemm/threadblock/mma_multistage.h>
+#include <cutlass/gemm/threadblock/mma_pipelined.h>
+
+template <typename Mma, int kMaxK>
+struct MakeCustomMma;
+
+template <
+    typename Shape,
+    typename IteratorA,
+    typename SmemIteratorA,
+    cutlass::arch::CacheOperation::Kind CacheOpA,
+    typename IteratorB,
+    typename SmemIteratorB,
+    cutlass::arch::CacheOperation::Kind CacheOpB,
+    typename ElementC,
+    typename LayoutC,
+    typename Policy,
+    int Stages,
+    cutlass::gemm::SharedMemoryClearOption SharedMemoryClear,
+    int kMaxK>
+struct MakeCustomMma<
+    cutlass::gemm::threadblock::MmaMultistage<
+        Shape,
+        IteratorA,
+        SmemIteratorA,
+        CacheOpA,
+        IteratorB,
+        SmemIteratorB,
+        CacheOpB,
+        ElementC,
+        LayoutC,
+        Policy,
+        Stages,
+        SharedMemoryClear>,
+    kMaxK> {
+  // Reduce the number of stages if we don't need that many
+  static int constexpr kStages = kMaxK == std::numeric_limits<int>::max()
+      ? Stages
+      : std::min(Stages, (kMaxK + int(Shape::kK) - 1) / int(Shape::kK));
+  using Mma = cutlass::gemm::threadblock::CustomMmaMultistage<
+      Shape,
+      IteratorA,
+      SmemIteratorA,
+      CacheOpA,
+      IteratorB,
+      SmemIteratorB,
+      CacheOpB,
+      ElementC,
+      LayoutC,
+      Policy,
+      kStages,
+      SharedMemoryClear,
+      kMaxK>;
+};
+
+template <
+    typename Shape,
+    typename IteratorA,
+    typename SmemIteratorA,
+    typename IteratorB,
+    typename SmemIteratorB,
+    typename ElementC,
+    typename LayoutC,
+    typename Policy,
+    int kMaxK>
+struct MakeCustomMma<
+    cutlass::gemm::threadblock::MmaPipelined<
+        Shape,
+        IteratorA,
+        SmemIteratorA,
+        IteratorB,
+        SmemIteratorB,
+        ElementC,
+        LayoutC,
+        Policy>,
+    kMaxK> {
+  using Mma = cutlass::gemm::threadblock::CustomMmaPipelined<
+      Shape,
+      IteratorA,
+      SmemIteratorA,
+      IteratorB,
+      SmemIteratorB,
+      ElementC,
+      LayoutC,
+      Policy>;
+};
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_base.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_base.h
new file mode 100644
index 000000000000..60169ef2c16f
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_base.h
@@ -0,0 +1,183 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Template for a double-buffered threadblock-scoped GEMM kernel.
+*/
+
+#pragma once
+
+#include <cutlass/aligned_buffer.h>
+#include <cutlass/arch/memory.h>
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/gemm/gemm.h>
+#include <cutlass/gemm/threadblock/mma_base.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/numeric_types.h>
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace gemm {
+namespace threadblock {
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Structure to compute the matrix product targeting CUDA cores and SIMT math
+/// instructions.
+template <
+    /// Size of the Gemm problem - concept: gemm::GemmShape<>
+    typename Shape_,
+    /// Policy describing tuning details (concept: MmaPolicy)
+    typename Policy_,
+    /// Number of stages,
+    int Stages,
+    /// Used for partial specialization
+    typename Enable = bool>
+class CustomMmaBase {
+ public:
+  ///< Size of the Gemm problem - concept: gemm::GemmShape<>
+  using Shape = Shape_;
+
+  ///< Policy describing tuning details
+  using Policy = Policy_;
+
+  //
+  // Dependent types
+  //
+
+  /// Warp-level Mma
+  using Operator = typename Policy::Operator;
+
+  /// Shape describing the overall GEMM computed from shared memory
+  /// by each warp.
+  using WarpGemm = typename Policy::Operator::Shape;
+
+  /// Shape describing the number of warps filling the CTA
+  using WarpCount = GemmShape<
+      Shape::kM / WarpGemm::kM,
+      Shape::kN / WarpGemm::kN,
+      Shape::kK / WarpGemm::kK>;
+
+  /// Number of warp-level GEMM oeprations
+  static int const kWarpGemmIterations =
+      (WarpGemm::kK / Operator::Policy::MmaShape::kK);
+
+  /// Number of stages
+  static int const kStages = Stages;
+
+  //
+  // Nested structs
+  //
+
+  /// Shared storage object needed by threadblock-scoped GEMM
+  template <typename Element, typename OperandShape, typename OperandLayout>
+  struct OperandSharedStorage {
+    AlignedBuffer<Element, OperandShape::kCount> buffer;
+    using TensorRef = TensorRef<Element, OperandLayout>;
+
+    CUTLASS_DEVICE
+    static OperandLayout Layout() {
+      return OperandLayout::packed({OperandShape::kRow, OperandShape::kColumn});
+    }
+
+    /// Returns a TensorRef to the operand
+    CUTLASS_HOST_DEVICE
+    TensorRef ref() {
+      return TensorRef{buffer.data(), Layout()};
+    }
+  };
+
+  /// Shape of the A matrix operand in shared memory
+  using ShapeA = MatrixShape<
+      Shape::kM + Policy::SmemPaddingA::kRow,
+      Shape::kK * kStages + Policy::SmemPaddingA::kColumn>;
+
+  /// Shape of the B matrix operand in shared memory
+  using ShapeB = MatrixShape<
+      Shape::kK * kStages + Policy::SmemPaddingB::kRow,
+      Shape::kN + Policy::SmemPaddingB::kColumn>;
+
+  using SharedStorageA = OperandSharedStorage<
+      typename Operator::ElementA,
+      ShapeA,
+      typename Operator::LayoutA>;
+  using SharedStorageB = OperandSharedStorage<
+      typename Operator::ElementB,
+      ShapeB,
+      typename Operator::LayoutB>;
+  using TensorRefA = typename SharedStorageA::TensorRef;
+  using TensorRefB = typename SharedStorageB::TensorRef;
+
+  struct SharedStorage {
+    /// Buffer for A operand
+    SharedStorageA operand_A;
+
+    /// Buffer for B operand
+    SharedStorageB operand_B;
+  };
+
+ protected:
+  //
+  // Data members
+  //
+
+  /// Iterator to load a warp-scoped tile of A operand from shared memory
+  typename Operator::IteratorA warp_tile_iterator_A_;
+
+  /// Iterator to load a warp-scoped tile of B operand from shared memory
+  typename Operator::IteratorB warp_tile_iterator_B_;
+
+ public:
+  /// Construct from tensor references
+  CUTLASS_DEVICE
+  CustomMmaBase(
+      ///< Shared storage needed for internal use by threadblock-scoped GEMM
+      SharedStorageA& shared_storageA,
+      SharedStorageB& shared_storageB,
+      ///< ID within the threadblock
+      int thread_idx,
+      ///< ID of warp
+      int warp_idx,
+      ///< ID of each thread within a warp
+      int lane_idx)
+      : warp_tile_iterator_A_(shared_storageA.ref(), lane_idx),
+        warp_tile_iterator_B_(shared_storageB.ref(), lane_idx) {}
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace threadblock
+} // namespace gemm
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_multistage.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_multistage.h
new file mode 100644
index 000000000000..ec8a00fd9dd0
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_multistage.h
@@ -0,0 +1,769 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Template for a double-buffered threadblock-scoped GEMM kernel.
+*/
+
+#pragma once
+
+#include <cutlass/aligned_buffer.h>
+#include <cutlass/arch/cache_operation.h>
+#include <cutlass/arch/memory.h>
+#include <cutlass/arch/mma.h>
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/gemm/gemm.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/numeric_types.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_base.h>
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace gemm {
+namespace threadblock {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Structure to compute the matrix product targeting CUDA cores and SIMT math
+/// instructions.
+template <
+    /// Size of the Gemm problem - concept: gemm::GemmShape<>
+    typename Shape_,
+    /// Iterates over tiles of A operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorA_,
+    /// Iterates over tiles of A operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorA_,
+    /// Cache operation for operand A
+    cutlass::arch::CacheOperation::Kind CacheOpA,
+    /// Iterates over tiles of B operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorB_,
+    /// Iterates over tiles of B operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorB_,
+    /// Cache operation for operand B
+    cutlass::arch::CacheOperation::Kind CacheOpB,
+    /// Data type of accumulator matrix
+    typename ElementC_,
+    /// Data type of accumulator matrix
+    typename LayoutC_,
+    /// Policy describing tuning details (concept: MmaPolicy)
+    typename Policy_,
+    /// Number of stages,
+    int Stages,
+    /// Use zfill or predicate for out-of-bound cp.async
+    SharedMemoryClearOption SharedMemoryClear = SharedMemoryClearOption::kNone,
+    /// Upper boundon the K dimension
+    int kMaxK = std::numeric_limits<int>::max(),
+    /// Used for partial specialization
+    typename Enable = bool>
+class CustomMmaMultistage : public CustomMmaBase<Shape_, Policy_, Stages> {
+ public:
+  ///< Base class
+  using Base = CustomMmaBase<Shape_, Policy_, Stages>;
+  ///< Size of the Gemm problem - concept: gemm::GemmShape<>
+  using Shape = Shape_;
+  ///< Iterates over tiles of A operand in global memory
+  using IteratorA = IteratorA_;
+  ///< Iterates over tiles of B operand in global memory
+  using IteratorB = IteratorB_;
+  ///< Data type of accumulator matrix
+  using ElementC = ElementC_;
+  ///< Layout of accumulator matrix
+  using LayoutC = LayoutC_;
+  ///< Policy describing tuning details
+  using Policy = Policy_;
+
+  using SmemIteratorA = SmemIteratorA_;
+  using SmemIteratorB = SmemIteratorB_;
+
+  static cutlass::arch::CacheOperation::Kind const kCacheOpA = CacheOpA;
+  static cutlass::arch::CacheOperation::Kind const kCacheOpB = CacheOpB;
+
+  //
+  // Dependent types
+  //
+
+  /// Fragment of accumulator tile
+  using FragmentC = typename Policy::Operator::FragmentC;
+
+  /// Warp-level Mma
+  using Operator = typename Policy::Operator;
+
+  /// Minimum architecture is Sm80 to support cp.async
+  using ArchTag = arch::Sm80;
+
+  /// Complex transform on A operand
+  static ComplexTransform const kTransformA = Operator::kTransformA;
+
+  /// Complex transform on B operand
+  static ComplexTransform const kTransformB = Operator::kTransformB;
+
+  /// Internal structure exposed for introspection.
+  struct Detail {
+    static_assert(
+        Base::kWarpGemmIterations > 1,
+        "The pipelined structure requires at least two warp-level "
+        "GEMM operations.");
+
+    /// Number of cp.async instructions to load one stage of operand A
+    static int const AsyncCopyIterationsPerStageA =
+        IteratorA::ThreadMap::Iterations::kCount;
+
+    /// Number of cp.async instructions to load one stage of operand B
+    static int const AsyncCopyIterationsPerStageB =
+        IteratorB::ThreadMap::Iterations::kCount;
+
+    /// Number of stages
+    static int const kStages = Stages;
+
+    /// Number of cp.async instructions to load on group of operand A
+    static int const kAccessesPerGroupA =
+        (AsyncCopyIterationsPerStageA + Base::kWarpGemmIterations - 1) /
+        Base::kWarpGemmIterations;
+
+    /// Number of cp.async instructions to load on group of operand B
+    static int const kAccessesPerGroupB =
+        (AsyncCopyIterationsPerStageB + Base::kWarpGemmIterations - 1) /
+        Base::kWarpGemmIterations;
+  };
+
+  static bool const kSmemContainsEntireMat = kMaxK <= Shape::kK * Stages;
+  static constexpr int kNumStagesConcurrentLoad =
+      kSmemContainsEntireMat ? Stages : Stages - 1;
+
+ private:
+  using WarpLoadedFragmentA = typename Operator::FragmentA;
+  using WarpLoadedFragmentB = typename Operator::FragmentB;
+  using WarpTransformedFragmentA = typename Operator::TransformedFragmentA;
+  using WarpTransformedFragmentB = typename Operator::TransformedFragmentB;
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Iterator to write threadblock-scoped tile of A operand to shared memory
+  SmemIteratorA smem_iterator_A_;
+
+  /// Iterator to write threadblock-scoped tile of B operand to shared memory
+  SmemIteratorB smem_iterator_B_;
+
+  bool prologue_done_;
+
+  // Set to `True` to ensure the accumulator will be zero outside the GEMM
+  // footprint
+  bool zero_outside_bounds_;
+
+ public:
+  /// Construct from tensor references
+  CUTLASS_DEVICE
+  CustomMmaMultistage(
+      ///< Shared storage needed for internal use by threadblock-scoped GEMM
+      typename Base::SharedStorageA& shared_storageA,
+      typename Base::SharedStorageB& shared_storageB,
+      ///< ID within the threadblock
+      int thread_idx,
+      ///< ID of warp
+      int warp_idx,
+      ///< ID of each thread within a warp
+      int lane_idx)
+      : Base(shared_storageA, shared_storageB, thread_idx, warp_idx, lane_idx),
+        smem_iterator_A_(shared_storageA.ref(), thread_idx),
+        smem_iterator_B_(shared_storageB.ref(), thread_idx),
+        prologue_done_(false),
+        zero_outside_bounds_(false) {
+    // Compute warp location within threadblock tile by mapping the warp_id to
+    // three coordinates:
+    //   _m: the warp's position within the threadblock along the M dimension
+    //   _n: the warp's position within the threadblock along the N dimension
+    //   _k: the warp's position within the threadblock along the K dimension
+
+    int warp_idx_mn = warp_idx % (Base::WarpCount::kM * Base::WarpCount::kN);
+    int warp_idx_k = warp_idx / (Base::WarpCount::kM * Base::WarpCount::kN);
+
+    int warp_idx_m = warp_idx_mn % Base::WarpCount::kM;
+    int warp_idx_n = warp_idx_mn / Base::WarpCount::kM;
+
+    // Add per-warp offsets in units of warp-level tiles
+    this->warp_tile_iterator_A_.add_tile_offset(
+        {warp_idx_m, Base::kWarpGemmIterations * warp_idx_k});
+    this->warp_tile_iterator_B_.add_tile_offset(
+        {Base::kWarpGemmIterations * warp_idx_k, warp_idx_n});
+  }
+  CUTLASS_DEVICE
+  CustomMmaMultistage(
+      ///< Shared storage needed for internal use by threadblock-scoped GEMM
+      typename Base::SharedStorage& st,
+      ///< ID within the threadblock
+      int thread_idx,
+      ///< ID of warp
+      int warp_idx,
+      ///< ID of each thread within a warp
+      int lane_idx)
+      : CustomMmaMultistage(
+            st.operand_A,
+            st.operand_B,
+            thread_idx,
+            warp_idx,
+            lane_idx) {}
+
+  CUTLASS_DEVICE
+  bool set_prologue_done(bool value) {
+    prologue_done_ = value;
+  }
+
+  CUTLASS_DEVICE
+  bool set_zero_outside_bounds(bool value) {
+    zero_outside_bounds_ = value;
+  }
+
+  template <bool kLoadA = true, bool kLoadB = true>
+  CUTLASS_DEVICE static void prologue(
+      typename Base::SharedStorage& shared_storage,
+      ///< iterator over A operand in global memory
+      IteratorA iterator_A,
+      ///< iterator over B operand in global memory
+      IteratorB iterator_B,
+      int thread_idx,
+      int problem_size_k) {
+    prologue<kLoadA, kLoadB>(
+        shared_storage.operand_A,
+        shared_storage.operand_B,
+        iterator_A,
+        iterator_B,
+        thread_idx,
+        problem_size_k);
+  }
+
+  template <bool kLoadA = true, bool kLoadB = true>
+  CUTLASS_DEVICE static void prologue(
+      typename Base::SharedStorageA& shared_storageA,
+      typename Base::SharedStorageB& shared_storageB,
+      ///< iterator over A operand in global memory
+      IteratorA iterator_A,
+      ///< iterator over B operand in global memory
+      IteratorB iterator_B,
+      int thread_idx,
+      int problem_size_k) {
+    SmemIteratorA smem_iterator_A(shared_storageA.ref(), thread_idx);
+    SmemIteratorB smem_iterator_B(shared_storageB.ref(), thread_idx);
+    int32_t iter = (problem_size_k + Base::Shape::kK - 1) / Base::Shape::kK;
+    _prologue<kLoadA, kLoadB>(
+        iterator_A, iterator_B, iter, smem_iterator_A, smem_iterator_B);
+  }
+
+  CUTLASS_DEVICE
+  void copy_tiles_and_advance(
+      IteratorA& iterator_A,
+      IteratorB& iterator_B,
+      int group_start_A = 0,
+      int group_start_B = 0) {
+    iterator_A.set_iteration_index(
+        group_start_A * IteratorA::kAccessesPerVector);
+    this->smem_iterator_A_.set_iteration_index(group_start_A);
+
+    // Async Copy for operand A
+    CUTLASS_PRAGMA_UNROLL
+    for (int j = 0; j < Detail::kAccessesPerGroupA; ++j) {
+      if (group_start_A + j < Detail::AsyncCopyIterationsPerStageA) {
+        typename IteratorA::AccessType* dst_ptr =
+            reinterpret_cast<typename IteratorA::AccessType*>(
+                this->smem_iterator_A_.get());
+
+        int const kSrcBytes = sizeof_bits<typename IteratorA::Element>::value *
+            IteratorA::ThreadMap::kElementsPerAccess /
+            IteratorA::kAccessesPerVector / 8;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < IteratorA::kAccessesPerVector; ++v) {
+          auto gmem_ptr = iterator_A.get();
+
+          if (zero_outside_bounds_ ||
+              SharedMemoryClear == SharedMemoryClearOption::kZfill) {
+            cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpA>(
+                dst_ptr + v, gmem_ptr, iterator_A.valid());
+          } else {
+            cutlass::arch::cp_async<kSrcBytes, kCacheOpA>(
+                dst_ptr + v, gmem_ptr, iterator_A.valid());
+          }
+
+          ++iterator_A;
+        }
+
+        ++this->smem_iterator_A_;
+      }
+    }
+
+    iterator_B.set_iteration_index(
+        group_start_B * IteratorB::kAccessesPerVector);
+    this->smem_iterator_B_.set_iteration_index(group_start_B);
+
+    // Async Copy for operand B
+    CUTLASS_PRAGMA_UNROLL
+    for (int j = 0; j < Detail::kAccessesPerGroupB; ++j) {
+      if (group_start_B + j < Detail::AsyncCopyIterationsPerStageB) {
+        typename IteratorB::AccessType* dst_ptr =
+            reinterpret_cast<typename IteratorB::AccessType*>(
+                this->smem_iterator_B_.get());
+
+        int const kSrcBytes = sizeof_bits<typename IteratorB::Element>::value *
+            IteratorB::ThreadMap::kElementsPerAccess /
+            IteratorB::kAccessesPerVector / 8;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < IteratorB::kAccessesPerVector; ++v) {
+          auto gmem_ptr = iterator_B.get();
+
+          if (zero_outside_bounds_ ||
+              SharedMemoryClear == SharedMemoryClearOption::kZfill) {
+            cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpB>(
+                dst_ptr + v, gmem_ptr, iterator_B.valid());
+          } else {
+            cutlass::arch::cp_async<kSrcBytes, kCacheOpB>(
+                dst_ptr + v, gmem_ptr, iterator_B.valid());
+          }
+
+          ++iterator_B;
+        }
+        ++this->smem_iterator_B_;
+      }
+    }
+  }
+
+  template <bool kLoadA = true, bool kLoadB = true>
+  CUTLASS_DEVICE static void _prologue(
+      IteratorA& iterator_A,
+      IteratorB& iterator_B,
+      int32_t& gemm_k_iterations,
+      SmemIteratorA& smem_iterator_A_,
+      SmemIteratorB& smem_iterator_B_) {
+    // Issue several complete stages
+    CUTLASS_PRAGMA_UNROLL
+    for (int stage = 0; stage < kNumStagesConcurrentLoad;
+         ++stage, --gemm_k_iterations) {
+      iterator_A.clear_mask(gemm_k_iterations == 0);
+      iterator_B.clear_mask(gemm_k_iterations == 0);
+
+      iterator_A.set_iteration_index(0);
+      smem_iterator_A_.set_iteration_index(0);
+
+      // Async Copy for operand A
+      CUTLASS_PRAGMA_UNROLL
+      for (int j = 0; j < Detail::AsyncCopyIterationsPerStageA; ++j) {
+        typename IteratorA::AccessType* dst_ptr =
+            reinterpret_cast<typename IteratorA::AccessType*>(
+                smem_iterator_A_.get());
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < IteratorA::kAccessesPerVector; ++v) {
+          int const kSrcBytes =
+              sizeof_bits<typename IteratorA::Element>::value *
+              IteratorA::ThreadMap::kElementsPerAccess /
+              IteratorA::kAccessesPerVector / 8;
+
+          int src_bytes = (iterator_A.valid() ? kSrcBytes : 0);
+
+          if (kLoadA) {
+            cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpA>(
+                dst_ptr + v, iterator_A.get(), iterator_A.valid());
+          }
+
+          ++iterator_A;
+        }
+
+        ++smem_iterator_A_;
+      }
+
+      iterator_B.set_iteration_index(0);
+      smem_iterator_B_.set_iteration_index(0);
+
+      // Async Copy for operand B
+      CUTLASS_PRAGMA_UNROLL
+      for (int j = 0; j < Detail::AsyncCopyIterationsPerStageB; ++j) {
+        typename IteratorB::AccessType* dst_ptr =
+            reinterpret_cast<typename IteratorB::AccessType*>(
+                smem_iterator_B_.get());
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < IteratorB::kAccessesPerVector; ++v) {
+          int const kSrcBytes =
+              sizeof_bits<typename IteratorB::Element>::value *
+              IteratorB::ThreadMap::kElementsPerAccess /
+              IteratorB::kAccessesPerVector / 8;
+
+          if (kLoadB) {
+            cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpB>(
+                dst_ptr + v, iterator_B.get(), iterator_B.valid());
+          }
+
+          ++iterator_B;
+        }
+
+        ++smem_iterator_B_;
+      }
+
+      // Move to the next stage
+      iterator_A.add_tile_offset({0, 1});
+      iterator_B.add_tile_offset({1, 0});
+
+      smem_iterator_A_.add_tile_offset({0, 1});
+      smem_iterator_B_.add_tile_offset({1, 0});
+
+      // Defines the boundary of a stage of cp.async.
+      cutlass::arch::cp_async_fence();
+    }
+  }
+
+  /// Perform a threadblock-scoped matrix multiply-accumulate
+  CUTLASS_DEVICE
+  void operator()(
+      ///< problem size of GEMM
+      int gemm_k_iterations,
+      ///< destination accumulator tile
+      FragmentC& accum,
+      ///< iterator over A operand in global memory
+      IteratorA iterator_A,
+      ///< iterator over B operand in global memory
+      IteratorB iterator_B,
+      ///< initial value of accumulator
+      FragmentC const& src_accum) {
+    //
+    // Prologue
+    //
+
+    if (!prologue_done_) {
+      _prologue<true, true>(
+          iterator_A,
+          iterator_B,
+          gemm_k_iterations,
+          smem_iterator_A_,
+          smem_iterator_B_);
+    } else if (!kSmemContainsEntireMat) {
+      _prologue<false, false>(
+          iterator_A,
+          iterator_B,
+          gemm_k_iterations,
+          smem_iterator_A_,
+          smem_iterator_B_);
+    } else {
+      gemm_k_iterations -= kNumStagesConcurrentLoad;
+    }
+
+    // Perform accumulation in the 'd' output operand
+    accum = src_accum;
+
+    //
+    // Clear the remaining tiles of SMEM. This is a functional requirement for
+    // some kernels so that all accumulator elements outside the GEMM footprint
+    // are zero.
+    //
+
+    if (SharedMemoryClear == SharedMemoryClearOption::kClearLastStage) {
+      /// Iterator to write threadblock-scoped tile of A operand to shared
+      /// memory
+      SmemIteratorA last_smem_iterator_A(this->smem_iterator_A_);
+
+      typename IteratorA::AccessType zero_A;
+      zero_A.clear();
+
+      last_smem_iterator_A.set_iteration_index(0);
+
+      // Async Copy for operand A
+      CUTLASS_PRAGMA_UNROLL
+      for (int j = 0; j < Detail::AsyncCopyIterationsPerStageA; ++j) {
+        typename IteratorA::AccessType* dst_ptr =
+            reinterpret_cast<typename IteratorA::AccessType*>(
+                last_smem_iterator_A.get());
+
+        *dst_ptr = zero_A;
+
+        ++last_smem_iterator_A;
+      }
+
+      /// Iterator to write threadblock-scoped tile of B operand to shared
+      /// memory
+      SmemIteratorB last_smem_iterator_B(this->smem_iterator_B_);
+      typename IteratorB::AccessType zero_B;
+
+      zero_B.clear();
+      last_smem_iterator_B.set_iteration_index(0);
+
+      // Async Copy for operand B
+      CUTLASS_PRAGMA_UNROLL
+      for (int j = 0; j < Detail::AsyncCopyIterationsPerStageB; ++j) {
+        typename IteratorB::AccessType* dst_ptr =
+            reinterpret_cast<typename IteratorB::AccessType*>(
+                last_smem_iterator_B.get());
+
+        *dst_ptr = zero_B;
+
+        ++last_smem_iterator_B;
+      }
+    }
+
+    // Waits until kStages-2 stages have committed.
+    cutlass::arch::cp_async_wait<kNumStagesConcurrentLoad - 1>();
+    __syncthreads();
+
+    // Pair of fragments used to overlap shared memory loads and math
+    // instructions
+    WarpLoadedFragmentA warp_loaded_frag_A[2];
+    WarpLoadedFragmentB warp_loaded_frag_B[2];
+    WarpTransformedFragmentA warp_transformed_frag_A[2];
+    WarpTransformedFragmentB warp_transformed_frag_B[2];
+
+    Operator warp_mma;
+
+    this->warp_tile_iterator_A_.set_kgroup_index(0);
+    this->warp_tile_iterator_B_.set_kgroup_index(0);
+
+    this->warp_tile_iterator_A_.load(warp_loaded_frag_A[0]);
+    this->warp_tile_iterator_B_.load(warp_loaded_frag_B[0]);
+
+    ++this->warp_tile_iterator_A_;
+    ++this->warp_tile_iterator_B_;
+
+    iterator_A.clear_mask(gemm_k_iterations == 0);
+    iterator_B.clear_mask(gemm_k_iterations == 0);
+
+    int smem_write_stage_idx = Base::kStages - 1;
+    int smem_read_stage_idx = 0;
+
+    warp_mma.transform(
+        warp_transformed_frag_A[0],
+        warp_transformed_frag_B[0],
+        warp_loaded_frag_A[0],
+        warp_loaded_frag_B[0]);
+
+    // tf32x3 kernels use staging accumulation. warp_mma uses a temporary
+    // accumulator and this temporary accumulator is added to the final
+    // accumulator once in every mainloop iteration.
+    plus<FragmentC> plus_accum;
+
+    FragmentC tmp_accum;
+
+    if (platform::is_same<
+            typename Operator::MathOperator,
+            arch::OpMultiplyAddFastF32>::value ||
+        platform::is_same<
+            typename Operator::MathOperator,
+            arch::OpMultiplyAddComplexFastF32>::value) {
+      tmp_accum.clear();
+    }
+
+    //
+    // Mainloop
+    //
+
+    CUTLASS_GEMM_LOOP
+    for (; gemm_k_iterations > (-kNumStagesConcurrentLoad);) {
+      //
+      // Loop over GEMM K dimension
+      //
+
+      // Computes a warp-level GEMM on data held in shared memory
+      // Each "warp_mma_k" refers to a warp-level matrix multiply-accumulate
+      CUTLASS_PRAGMA_UNROLL
+      for (int warp_mma_k = 0; warp_mma_k < Base::kWarpGemmIterations;
+           ++warp_mma_k) {
+        // Load warp-level tiles from shared memory, wrapping to k offset if
+        // this is the last group as the case may be.
+
+        this->warp_tile_iterator_A_.set_kgroup_index(
+            (warp_mma_k + 1) % Base::kWarpGemmIterations);
+        this->warp_tile_iterator_B_.set_kgroup_index(
+            (warp_mma_k + 1) % Base::kWarpGemmIterations);
+
+        // In case of a non-circular buffer ("kSmemContainsEntireMat")
+        // make sure we don't load out of bounds data.
+        if (!kSmemContainsEntireMat ||
+            gemm_k_iterations > (-kNumStagesConcurrentLoad) ||
+            warp_mma_k < Base::kWarpGemmIterations - 1) {
+          this->warp_tile_iterator_A_.load(
+              warp_loaded_frag_A[(warp_mma_k + 1) % 2]);
+          this->warp_tile_iterator_B_.load(
+              warp_loaded_frag_B[(warp_mma_k + 1) % 2]);
+        }
+
+        ++this->warp_tile_iterator_A_;
+        ++this->warp_tile_iterator_B_;
+
+        if (warp_mma_k > 0)
+          warp_mma.transform(
+              warp_transformed_frag_A[warp_mma_k % 2],
+              warp_transformed_frag_B[warp_mma_k % 2],
+              warp_loaded_frag_A[warp_mma_k % 2],
+              warp_loaded_frag_B[warp_mma_k % 2]);
+
+        if (platform::is_same<
+                typename Operator::MathOperator,
+                arch::OpMultiplyAddFastF32>::value ||
+            platform::is_same<
+                typename Operator::MathOperator,
+                arch::OpMultiplyAddComplexFastF32>::value) {
+          warp_mma(
+              tmp_accum,
+              warp_transformed_frag_A[warp_mma_k % 2],
+              warp_transformed_frag_B[warp_mma_k % 2],
+              tmp_accum);
+
+          if (warp_mma_k == 0) {
+            accum = plus_accum(accum, tmp_accum);
+            tmp_accum.clear();
+          }
+        } else {
+          warp_mma(
+              accum,
+              warp_transformed_frag_A[warp_mma_k % 2],
+              warp_transformed_frag_B[warp_mma_k % 2],
+              accum);
+        }
+
+        // Issue global->shared copies for the this stage
+        if (!kSmemContainsEntireMat &&
+            warp_mma_k < Base::kWarpGemmIterations - 1) {
+          int group_start_iteration_A, group_start_iteration_B;
+
+          group_start_iteration_A = warp_mma_k * Detail::kAccessesPerGroupA;
+          group_start_iteration_B = warp_mma_k * Detail::kAccessesPerGroupB;
+
+          copy_tiles_and_advance(
+              iterator_A,
+              iterator_B,
+              group_start_iteration_A,
+              group_start_iteration_B);
+        }
+
+        if (warp_mma_k + 2 == Base::kWarpGemmIterations) {
+          if (!kSmemContainsEntireMat) {
+            int group_start_iteration_A, group_start_iteration_B;
+            group_start_iteration_A =
+                (warp_mma_k + 1) * Detail::kAccessesPerGroupA;
+            group_start_iteration_B =
+                (warp_mma_k + 1) * Detail::kAccessesPerGroupB;
+
+            copy_tiles_and_advance(
+                iterator_A,
+                iterator_B,
+                group_start_iteration_A,
+                group_start_iteration_B);
+          }
+
+          // Inserts a memory fence between stages of cp.async instructions.
+          cutlass::arch::cp_async_fence();
+
+          // Waits until kStages-2 stages have committed.
+          cutlass::arch::cp_async_wait<kNumStagesConcurrentLoad - 1>();
+          __syncthreads();
+
+          // Move to the next stage
+          iterator_A.add_tile_offset({0, 1});
+          iterator_B.add_tile_offset({1, 0});
+
+          this->smem_iterator_A_.add_tile_offset({0, 1});
+          this->smem_iterator_B_.add_tile_offset({1, 0});
+
+          // Add negative offsets to return iterators to the 'start' of the
+          // circular buffer in shared memory
+          if (smem_write_stage_idx == (Base::kStages - 1)) {
+            this->smem_iterator_A_.add_tile_offset({0, -Base::kStages});
+            this->smem_iterator_B_.add_tile_offset({-Base::kStages, 0});
+            smem_write_stage_idx = 0;
+          } else {
+            ++smem_write_stage_idx;
+          }
+
+          if (!kSmemContainsEntireMat &&
+              smem_read_stage_idx == (Base::kStages - 1)) {
+            this->warp_tile_iterator_A_.add_tile_offset(
+                {0,
+                 -Base::kStages * Policy::kPartitionsK *
+                     Base::kWarpGemmIterations});
+            this->warp_tile_iterator_B_.add_tile_offset(
+                {-Base::kStages * Policy::kPartitionsK *
+                     Base::kWarpGemmIterations,
+                 0});
+            smem_read_stage_idx = 0;
+          } else {
+            ++smem_read_stage_idx;
+          }
+
+          --gemm_k_iterations;
+          iterator_A.clear_mask(gemm_k_iterations == 0);
+          iterator_B.clear_mask(gemm_k_iterations == 0);
+        }
+
+        // Do any conversions feeding the first stage at the end of the loop so
+        // we can start right away on mma instructions
+        if (warp_mma_k + 1 == Base::kWarpGemmIterations)
+          warp_mma.transform(
+              warp_transformed_frag_A[(warp_mma_k + 1) % 2],
+              warp_transformed_frag_B[(warp_mma_k + 1) % 2],
+              warp_loaded_frag_A[(warp_mma_k + 1) % 2],
+              warp_loaded_frag_B[(warp_mma_k + 1) % 2]);
+      }
+    }
+
+    if (platform::is_same<
+            typename Operator::MathOperator,
+            arch::OpMultiplyAddFastF32>::value ||
+        platform::is_same<
+            typename Operator::MathOperator,
+            arch::OpMultiplyAddComplexFastF32>::value) {
+      accum = plus_accum(accum, tmp_accum);
+    }
+
+    if (SharedMemoryClear == SharedMemoryClearOption::kZfill) {
+      // commit and drain all pending and predicated LDGSTS pnz from the GEMM
+      // mainloop
+      cutlass::arch::cp_async_fence();
+      cutlass::arch::cp_async_wait<0>();
+      __syncthreads();
+    }
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace threadblock
+} // namespace gemm
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_pipelined.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_pipelined.h
new file mode 100644
index 000000000000..9c0e1d52955a
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_pipelined.h
@@ -0,0 +1,402 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Template for a double-buffered threadblock-scoped GEMM kernel.
+*/
+
+#pragma once
+
+#include <cutlass/aligned_buffer.h>
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/numeric_conversion.h>
+
+#include <cutlass/matrix_shape.h>
+#include <cutlass/numeric_types.h>
+
+#include <cutlass/gemm/gemm.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma_base.h>
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace gemm {
+namespace threadblock {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Structure to compute the matrix product targeting CUDA cores and SIMT math
+/// instructions.
+template <
+    /// Size of the Gemm problem - concept: gemm::GemmShape<>
+    typename Shape_,
+    /// Iterates over tiles of A operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorA_,
+    /// Iterates over tiles of A operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorA_,
+    /// Iterates over tiles of B operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorB_,
+    /// Iterates over tiles of B operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorB_,
+    /// Data type of accumulator matrix
+    typename ElementC_,
+    /// Data type of accumulator matrix
+    typename LayoutC_,
+    /// Policy describing tuning details (concept: MmaPolicy)
+    typename Policy_,
+    /// Transformation applied to A operand
+    typename TransformA_ = NumericArrayConverter<
+        typename SmemIteratorA_::Element,
+        typename IteratorA_::Element,
+        IteratorA_::Fragment::kElements>,
+    ///
+    /// Transformation applied to B operand
+    typename TransformB_ = NumericArrayConverter<
+        typename SmemIteratorB_::Element,
+        typename IteratorB_::Element,
+        IteratorB_::Fragment::kElements>,
+    /// Used for partial specialization
+    typename Enable = bool>
+class CustomMmaPipelined : public CustomMmaBase<Shape_, Policy_, 2> {
+ public:
+  ///< Base class
+  using Base = CustomMmaBase<Shape_, Policy_, 2>;
+
+  using Shape =
+      Shape_; ///< Size of the Gemm problem - concept: gemm::GemmShape<>
+  using IteratorA =
+      IteratorA_; ///< Iterates over tiles of A operand in global memory
+  using IteratorB =
+      IteratorB_; ///< Iterates over tiles of B operand in global memory
+  using ElementC = ElementC_; ///< Data type of accumulator matrix
+  using LayoutC = LayoutC_; ///< Layout of accumulator matrix
+  using Policy = Policy_; ///< Policy describing tuning details
+
+  using SmemIteratorA = SmemIteratorA_;
+  using SmemIteratorB = SmemIteratorB_;
+
+  using TransformA = TransformA_;
+  using TransformB = TransformB_;
+
+  //
+  // Dependent types
+  //
+
+  /// Fragment of operand A loaded from global memory
+  using FragmentA = typename IteratorA::Fragment;
+
+  /// Fragment of operand B loaded from global memory
+  using FragmentB = typename IteratorB::Fragment;
+
+  /// Fragment of accumulator tile
+  using FragmentC = typename Policy::Operator::FragmentC;
+
+  /// Warp-level Mma
+  using Operator = typename Policy::Operator;
+
+  /// Obtain the arch tag from the warp-level operator
+  using ArchTag = typename Policy::Operator::ArchTag;
+
+  /// Complex transform on A operand
+  static ComplexTransform const kTransformA = Operator::kTransformA;
+
+  /// Complex transform on B operand
+  static ComplexTransform const kTransformB = Operator::kTransformB;
+
+  // staticaly assert kStages for MmaPipelined is two (Double-buffered pipeline)
+  static_assert(
+      (Base::kStages == 2),
+      "MmaPipelined requires kStages set to value 2");
+
+  static bool const kSmemContainsEntireMat = false;
+
+ private:
+  using WarpFragmentA = typename Operator::FragmentA;
+  using WarpFragmentB = typename Operator::FragmentB;
+
+ protected:
+  /// Iterator to write threadblock-scoped tile of A operand to shared memory
+  SmemIteratorA smem_iterator_A_;
+
+  /// Iterator to write threadblock-scoped tile of B operand to shared memory
+  SmemIteratorB smem_iterator_B_;
+
+ public:
+  /// Construct from tensor references
+  CUTLASS_DEVICE
+  CustomMmaPipelined(
+      typename Base::SharedStorageA& shared_storageA,
+      typename Base::SharedStorageB& shared_storageB,
+      int thread_idx, ///< ID within the threadblock
+      int warp_idx, ///< ID of warp
+      int lane_idx ///< ID of each thread within a warp
+      )
+      : Base(shared_storageA, shared_storageB, thread_idx, warp_idx, lane_idx),
+        smem_iterator_A_(shared_storageA.ref(), thread_idx),
+        smem_iterator_B_(shared_storageB.ref(), thread_idx) {
+    // Compute warp location within threadblock tile by mapping the warp_id to
+    // three coordinates:
+    //   _m: the warp's position within the threadblock along the M dimension
+    //   _n: the warp's position within the threadblock along the N dimension
+    //   _k: the warp's position within the threadblock along the K dimension
+
+    int warp_idx_mn = warp_idx % (Base::WarpCount::kM * Base::WarpCount::kN);
+    int warp_idx_k = warp_idx / (Base::WarpCount::kM * Base::WarpCount::kN);
+
+    int warp_idx_m = warp_idx_mn % Base::WarpCount::kM;
+    int warp_idx_n = warp_idx_mn / Base::WarpCount::kM;
+
+    // Add per-warp offsets in units of warp-level tiles
+    this->warp_tile_iterator_A_.add_tile_offset(
+        {warp_idx_m, Base::kWarpGemmIterations * warp_idx_k});
+    this->warp_tile_iterator_B_.add_tile_offset(
+        {Base::kWarpGemmIterations * warp_idx_k, warp_idx_n});
+  }
+  CUTLASS_DEVICE
+  CustomMmaPipelined(
+      ///< Shared storage needed for internal use by threadblock-scoped GEMM
+      typename Base::SharedStorage& st,
+      ///< ID within the threadblock
+      int thread_idx,
+      ///< ID of warp
+      int warp_idx,
+      ///< ID of each thread within a warp
+      int lane_idx)
+      : CustomMmaPipelined(
+            st.operand_A,
+            st.operand_B,
+            thread_idx,
+            warp_idx,
+            lane_idx) {}
+
+  CUTLASS_DEVICE
+  bool set_prologue_done(bool value) {
+    // NOT IMPLEMENTED FOR PIPELINED
+  }
+
+  CUTLASS_DEVICE
+  bool set_zero_outside_bounds(bool value) {
+    // NOT NEEDED FOR PIPELINED
+    // shared memory will always be zero-filled
+  }
+
+  template <bool kLoadA = true, bool kLoadB = true>
+  CUTLASS_DEVICE static void prologue(
+      typename Base::SharedStorage& shared_storage,
+      ///< iterator over A operand in global memory
+      IteratorA iterator_A,
+      ///< iterator over B operand in global memory
+      IteratorB iterator_B,
+      int thread_idx,
+      int problem_size_k) {
+    prologue<kLoadA, kLoadB>(
+        shared_storage.operand_A,
+        shared_storage.operand_B,
+        iterator_A,
+        iterator_B,
+        thread_idx,
+        problem_size_k);
+  }
+
+  template <bool kLoadA = true, bool kLoadB = true>
+  CUTLASS_DEVICE static void prologue(
+      typename Base::SharedStorageA& shared_storageA,
+      typename Base::SharedStorageB& shared_storageB,
+      ///< iterator over A operand in global memory
+      IteratorA iterator_A,
+      ///< iterator over B operand in global memory
+      IteratorB iterator_B,
+      int thread_idx,
+      int problem_size_k) {
+    // NOT IMPLEMENTED FOR PIPELINED
+  }
+
+  /// Perform a threadblock-scoped matrix multiply-accumulate
+  CUTLASS_DEVICE
+  void operator()(
+      int gemm_k_iterations, ///< number of iterations of the mainloop
+      FragmentC& accum, ///< destination accumulator tile
+      IteratorA iterator_A, ///< iterator over A operand in global memory
+      IteratorB iterator_B, ///< iterator over B operand in global memory
+      FragmentC const& src_accum, ///< source accumulator tile
+      TransformA transform_A =
+          TransformA(), ///< transformation applied to A fragment
+      TransformB transform_B =
+          TransformB()) { ///< transformation applied to B fragment
+
+    //
+    // Prologue
+    //
+
+    // Perform accumulation in the 'd' output operand
+    accum = src_accum;
+
+    FragmentA tb_frag_A;
+    FragmentB tb_frag_B;
+
+    tb_frag_A.clear();
+    tb_frag_B.clear();
+
+    // The last kblock is loaded in the prolog
+    iterator_A.load(tb_frag_A);
+    iterator_B.load(tb_frag_B);
+
+    ++iterator_A;
+    ++iterator_B;
+
+    this->smem_iterator_A_.store(transform_A(tb_frag_A));
+    this->smem_iterator_B_.store(transform_B(tb_frag_B));
+
+    ++this->smem_iterator_A_;
+    ++this->smem_iterator_B_;
+
+    __syncthreads();
+
+    // Pair of fragments used to overlap shared memory loads and math
+    // instructions
+    WarpFragmentA warp_frag_A[2];
+    WarpFragmentB warp_frag_B[2];
+
+    this->warp_tile_iterator_A_.set_kgroup_index(0);
+    this->warp_tile_iterator_B_.set_kgroup_index(0);
+
+    this->warp_tile_iterator_A_.load(warp_frag_A[0]);
+    this->warp_tile_iterator_B_.load(warp_frag_B[0]);
+
+    ++this->warp_tile_iterator_A_;
+    ++this->warp_tile_iterator_B_;
+
+    Operator warp_mma;
+
+    int smem_write_stage_idx = 1;
+
+    // Avoid reading out of bounds
+    iterator_A.clear_mask(gemm_k_iterations <= 1);
+    iterator_B.clear_mask(gemm_k_iterations <= 1);
+
+    // Issue loads during the first warp-level matrix multiply-add *AFTER*
+    // issuing shared memory loads (which have the tighest latency requirement).
+
+    //
+    // Mainloop
+    //
+
+    // Note: The main loop does not support Base::kWarpGemmIterations == 2.
+    CUTLASS_GEMM_LOOP
+    for (; gemm_k_iterations > 0; --gemm_k_iterations) {
+      //
+      // Loop over GEMM K dimension
+      //
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int warp_mma_k = 0; warp_mma_k < Base::kWarpGemmIterations;
+           ++warp_mma_k) {
+        // Load warp-level tiles from shared memory, wrapping to k offset if
+        // this is the last group as the case may be.
+
+        if (warp_mma_k == Base::kWarpGemmIterations - 1) {
+          // Write fragments to shared memory
+          this->smem_iterator_A_.store(transform_A(tb_frag_A));
+
+          this->smem_iterator_B_.store(transform_B(tb_frag_B));
+
+          __syncthreads();
+
+          ++this->smem_iterator_A_;
+          ++this->smem_iterator_B_;
+
+          // Add negative offsets to return iterators to the 'start' of the
+          // circular buffer in shared memory
+          if (smem_write_stage_idx == 1) {
+            this->smem_iterator_A_.add_tile_offset({0, -Base::kStages});
+            this->smem_iterator_B_.add_tile_offset({-Base::kStages, 0});
+          } else {
+            this->warp_tile_iterator_A_.add_tile_offset(
+                {0,
+                 -Base::kStages * Policy::kPartitionsK *
+                     Base::kWarpGemmIterations});
+            this->warp_tile_iterator_B_.add_tile_offset(
+                {-Base::kStages * Policy::kPartitionsK *
+                     Base::kWarpGemmIterations,
+                 0});
+          }
+
+          smem_write_stage_idx ^= 1;
+        }
+
+        this->warp_tile_iterator_A_.set_kgroup_index(
+            (warp_mma_k + 1) % Base::kWarpGemmIterations);
+        this->warp_tile_iterator_B_.set_kgroup_index(
+            (warp_mma_k + 1) % Base::kWarpGemmIterations);
+
+        this->warp_tile_iterator_A_.load(warp_frag_A[(warp_mma_k + 1) % 2]);
+        this->warp_tile_iterator_B_.load(warp_frag_B[(warp_mma_k + 1) % 2]);
+
+        ++this->warp_tile_iterator_A_;
+        ++this->warp_tile_iterator_B_;
+
+        if (warp_mma_k == 0) {
+          iterator_A.load(tb_frag_A);
+          iterator_B.load(tb_frag_B);
+
+          ++iterator_A;
+          ++iterator_B;
+
+          // Avoid reading out of bounds if this was the last loop iteration
+          iterator_A.clear_mask(gemm_k_iterations <= 2);
+          iterator_B.clear_mask(gemm_k_iterations <= 2);
+        }
+
+        warp_mma(
+            accum,
+            warp_frag_A[warp_mma_k % 2],
+            warp_frag_B[warp_mma_k % 2],
+            accum);
+      }
+    }
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace threadblock
+} // namespace gemm
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm_kernel_utils.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm_kernel_utils.h
new file mode 100644
index 000000000000..c0f7a8db05e6
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/gemm_kernel_utils.h
@@ -0,0 +1,226 @@
+#pragma once
+
+#include <cutlass/arch/mma.h>
+
+////////////////////////////////////////////////////////////////////////////////
+// Some helper functions
+////////////////////////////////////////////////////////////////////////////////
+#define DISPATCH_TYPES(tensor, func)                                        \
+  {                                                                         \
+    if (query.scalar_type() == at::ScalarType::Float) {                     \
+      using scalar_t = float;                                               \
+      func();                                                               \
+    } else if (query.scalar_type() == at::ScalarType::Half) {               \
+      using scalar_t = cutlass::half_t;                                     \
+      func();                                                               \
+    } else if (query.scalar_type() == at::ScalarType::BFloat16) {           \
+      using scalar_t = cutlass::bfloat16_t;                                 \
+      func();                                                               \
+    } else {                                                                \
+      TORCH_CHECK(false, "Only fp32, half & bf16 supported at the moment"); \
+    }                                                                       \
+  }
+
+#define DISPATCH_BOOL(BOOL_V, BOOL_NAME, F) \
+  {                                         \
+    if (BOOL_V) {                           \
+      constexpr bool BOOL_NAME = true;      \
+      F();                                  \
+    } else {                                \
+      constexpr bool BOOL_NAME = false;     \
+      F();                                  \
+    }                                       \
+  }
+#define DISPATCH_ARCHTAG(CC, func)                                        \
+  {                                                                       \
+    if (CC >= 80) {                                                       \
+      using ArchTag = cutlass::arch::Sm80;                                \
+      func();                                                             \
+    } else if (CC >= 75) {                                                \
+      using ArchTag = cutlass::arch::Sm75;                                \
+      func();                                                             \
+    } else if (CC >= 70) {                                                \
+      using ArchTag = cutlass::arch::Sm70;                                \
+      func();                                                             \
+    } else if (CC >= 50) {                                                \
+      using ArchTag = cutlass::arch::Sm50;                                \
+      func();                                                             \
+    } else {                                                              \
+      TORCH_CHECK(                                                        \
+          false,                                                          \
+          "Your device is too old. We require compute capability >= 50"); \
+    }                                                                     \
+  }
+
+#define CHECK_NOSPARSE_CONTIGUOUS_CUDA(TENSOR)                         \
+  TORCH_CHECK(TENSOR.is_cuda(), #TENSOR " must be a CUDA tensor");     \
+  TORCH_CHECK(!TENSOR.is_sparse(), #TENSOR " must be a dense tensor"); \
+  TORCH_CHECK(TENSOR.is_contiguous());
+
+#define CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA(TENSOR)                     \
+  TORCH_CHECK(TENSOR.is_cuda(), #TENSOR " must be a CUDA tensor");     \
+  TORCH_CHECK(!TENSOR.is_sparse(), #TENSOR " must be a dense tensor"); \
+  TORCH_CHECK(                                                         \
+      TENSOR.stride(-1) == 1, #TENSOR ": last dimension must be contiguous");
+
+#define CHECK_ALIGNED_PTR(PTR, ALIGNMENT) \
+  TORCH_CHECK(uint64_t(PTR) % ALIGNMENT == 0, #PTR " is not correctly aligned")
+namespace gemm_kernel_utils {
+template <typename scalar_t>
+struct TypeTraits;
+
+template <>
+struct TypeTraits<cutlass::half_t> {
+  using scalar_t = cutlass::half_t;
+
+  static constexpr __host__ at::ScalarType atScalarType() {
+    return at::ScalarType::Half;
+  }
+  template <int nDim>
+  static __host__ at::PackedTensorAccessor32<scalar_t, nDim> packed_accessor(
+      at::Tensor const& tensor) {
+    return at::PackedTensorAccessor32<scalar_t, nDim>(
+        (scalar_t*)(tensor.data_ptr()),
+        tensor.sizes().data(),
+        tensor.strides().data());
+  }
+};
+
+template <>
+struct TypeTraits<cutlass::bfloat16_t> {
+  using scalar_t = cutlass::bfloat16_t;
+
+  static constexpr __host__ at::ScalarType atScalarType() {
+    return at::ScalarType::BFloat16;
+  }
+  template <int nDim>
+  static __host__ at::PackedTensorAccessor32<scalar_t, nDim> packed_accessor(
+      at::Tensor const& tensor) {
+    return at::PackedTensorAccessor32<scalar_t, nDim>(
+        (scalar_t*)(tensor.data_ptr()),
+        tensor.sizes().data(),
+        tensor.strides().data());
+  }
+};
+
+template <>
+struct TypeTraits<float> {
+  using scalar_t = float;
+
+  static constexpr __host__ at::ScalarType atScalarType() {
+    return at::ScalarType::Float;
+  }
+  template <int nDim>
+  static __host__ at::PackedTensorAccessor32<scalar_t, nDim> packed_accessor(
+      at::Tensor const& tensor) {
+    return tensor.packed_accessor32<scalar_t, nDim>();
+  }
+};
+
+template <typename integer>
+constexpr __host__ __device__ inline integer ceil_div(integer n, integer m) {
+  return (n + m - 1) / m;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Determine the type of GEMM we do (TensorCores or not, Shapes ...)
+// TODO: Maybe we could rely on Cutlass's DefaultGemm templates
+////////////////////////////////////////////////////////////////////////////////
+
+// Fallback to Simt (FMA on cuda cores) if not in a special case below
+template <typename ArchTag, typename scalar_t_, typename Enable = void>
+struct DefaultGemmType {
+  static constexpr int ThreadK = 8;
+  static constexpr int WarpK = 8;
+  static constexpr int kMinimumAlignment = 1;
+  using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
+  using OpClass = cutlass::arch::OpClassSimt;
+  using Operator = cutlass::arch::OpMultiplyAdd;
+};
+
+// Specialization for tensorcores with f32
+template <typename ArchTag>
+struct DefaultGemmType<
+    ArchTag,
+    float,
+    typename std::enable_if<ArchTag::kMinComputeCapability >= 80>::type> {
+  static constexpr int ThreadK = 32;
+  static constexpr int WarpK = 32;
+  static constexpr int kMinimumAlignment = 4;
+  using OpClass = cutlass::arch::OpClassTensorOp;
+  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>;
+  using Operator = cutlass::arch::OpMultiplyAddFastF32;
+};
+
+// Specialization for tensorcores with f16/bf16 - Sm75+
+template <typename ArchTag, typename scalar_t>
+struct DefaultGemmType<
+    ArchTag,
+    scalar_t,
+    typename std::enable_if<
+        ArchTag::kMinComputeCapability >= 75 &&
+        cutlass::sizeof_bits<scalar_t>::value == 16>::type> {
+  static constexpr int ThreadK = 32;
+  static constexpr int WarpK = 32;
+  static constexpr int kMinimumAlignment = 4;
+  using OpClass = cutlass::arch::OpClassTensorOp;
+  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>;
+  using Operator = cutlass::arch::OpMultiplyAdd;
+};
+
+// Specialization for tensorcores with f16 - Volta
+template <>
+struct DefaultGemmType<cutlass::arch::Sm70, cutlass::half_t, void> {
+  static constexpr int ThreadK = 32;
+  static constexpr int WarpK = 32;
+  static constexpr int kMinimumAlignment = 2;
+  using OpClass = cutlass::arch::OpClassTensorOp;
+  using InstructionShape = cutlass::gemm::GemmShape<8, 8, 4>;
+  using Operator = cutlass::arch::OpMultiplyAdd;
+};
+
+// Enables to do
+// `auto x = kCondition ? fa(arg) : fb(arg)`
+// when `fa` and `fb` have different types
+template <bool kVal, typename TA, typename TB>
+struct call_conditional;
+
+template <typename TA, typename TB>
+struct call_conditional<true, TA, TB> {
+  template <typename... Args>
+  static CUTLASS_DEVICE auto apply(TA ta, TB tb, Args&&... args) {
+    return ta(std::forward<Args>(args)...);
+  }
+};
+
+template <typename TA, typename TB>
+struct call_conditional<false, TA, TB> {
+  template <typename... Args>
+  static CUTLASS_DEVICE auto apply(TA ta, TB tb, Args&&... args) {
+    return tb(std::forward<Args>(args)...);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+// Mark a variable as warp-uniform - enables some compiler optimizations
+// The cheapest way to do it is just to broadcast it from lane 0
+////////////////////////////////////////////////////////////////////////////////
+
+CUTLASS_DEVICE int32_t warp_uniform(int32_t value) {
+  return (int32_t)__shfl_sync(0xfffff, (unsigned)value, 0);
+}
+
+template <typename T>
+CUTLASS_DEVICE T* warp_uniform(T* ptr) {
+  struct {
+    union {
+      T* ptr;
+      uint32_t asInt[2];
+    };
+  } p;
+  p.ptr = ptr;
+  p.asInt[0] = warp_uniform(p.asInt[0]);
+  p.asInt[1] = warp_uniform(p.asInt[1]);
+  return p.ptr;
+}
+} // namespace gemm_kernel_utils
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/epilogue_predicated_tile_iterator.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/epilogue_predicated_tile_iterator.h
new file mode 100644
index 000000000000..a952090840fc
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/epilogue_predicated_tile_iterator.h
@@ -0,0 +1,750 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+  \brief Epilogue iterator that supports prefetching
+
+  Mostly copied from <cutlass/epilogue/threadblock/predicated_tile_iterator.h>
+*/
+
+#pragma once
+
+#include <cutlass/arch/arch.h>
+#include <cutlass/arch/memory.h>
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/epilogue/threadblock/output_tile_thread_map.h>
+#include <cutlass/epilogue/threadblock/predicated_tile_iterator_params.h>
+#include <cutlass/layout/matrix.h>
+#include <cutlass/layout/tensor.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/numeric_types.h>
+#include <cutlass/tensor_ref.h>
+#include <cutlass/transform/pitch_linear_thread_map.h>
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace epilogue {
+namespace threadblock {
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Tile iterator used to load and store output tile from global memory in
+/// epilogue.
+///
+/// Satisfies: ReadableTileIterator | PredicatedTileIterator |
+/// ForwardTileIterator
+///
+template <
+    typename ThreadMap_, ///< Thread map (conept: OutputTileThreadMap)
+    typename Element_, ///< Element data type
+    bool ScatterD = false, ///< Scatter D operand or not
+    bool UseCUDAStore = false>
+class PredicatedTileIteratorPrefetch {
+ public:
+  using ThreadMap = ThreadMap_;
+  using Shape = typename ThreadMap::Shape;
+
+  using Element = Element_;
+
+  using Layout = layout::RowMajor;
+  using TensorRef = TensorRef<Element, Layout>;
+  using ConstTensorRef = typename TensorRef::ConstTensorRef;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+  using TensorCoord = MatrixCoord;
+
+  static int const kElementsPerAccess = ThreadMap::kElementsPerAccess;
+  static int const kThreads = ThreadMap::kThreads;
+  static int const kIterations = ThreadMap::Count::kTile;
+
+  static_assert(
+      ThreadMap::Iterations::kRow > 0,
+      "ThreadMap::Iterations::kRow must be > 0");
+  static_assert(
+      ThreadMap::Iterations::kGroup > 0,
+      "ThreadMap::Iterations::kGroup must be > 0");
+  static_assert(
+      ThreadMap::Iterations::kCluster > 0,
+      "ThreadMap::Iterations::kCluster must be > 0");
+  static_assert(
+      ThreadMap::Iterations::kColumn > 0,
+      "ThreadMap::Iterations::kColumn must be > 0");
+
+  /// Fragment object
+  using Fragment = Array<
+      Element,
+      ThreadMap::Iterations::kColumn * ThreadMap::Iterations::kRow *
+          ThreadMap::Iterations::kGroup * ThreadMap::Iterations::kCluster *
+          ThreadMap::kElementsPerAccess>;
+
+  /// Memory access size
+  using AccessType = AlignedArray<Element, ThreadMap::kElementsPerAccess>;
+
+  //
+  // Parameters struct
+  //
+
+  /// Uses a non-template class
+  struct Params : PredicatedTileIteratorParams {
+    using Base = PredicatedTileIteratorParams;
+
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : PredicatedTileIteratorParams(
+              layout.stride(0) * int(sizeof(AccessType)) / kElementsPerAccess,
+              make_OutputTileThreadMapDesc<ThreadMap>()) {}
+
+    CUTLASS_HOST_DEVICE
+    Params(Base const& base) : Base(base) {}
+  };
+
+  /// Mask object
+  struct Mask {
+    static int const kCount = ThreadMap::Iterations::kColumn;
+
+    /// Predicate state
+    bool predicates[kCount];
+
+    //
+    // Mask
+    //
+    CUTLASS_HOST_DEVICE
+    Mask() {
+      enable();
+    }
+
+    ///< Efficiently disables all accesses guarded by mask
+    CUTLASS_HOST_DEVICE void clear() {
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < kCount; ++i) {
+        predicates[i] = false;
+      }
+    }
+
+    ///< CUTLASS_HOST_DEVICE enables all accesses guarded by mask
+    CUTLASS_DEVICE void enable() {
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < kCount; ++i) {
+        predicates[i] = true;
+      }
+    }
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Parameters structure containing reference and precomputed state.
+  PredicatedTileIteratorParams params_;
+
+  /// Byte-level pointer
+  uint8_t* byte_pointer_;
+
+  /// Array of boolean values to contain steady-state predicates
+  Mask mask_;
+
+  /// Extent of the matrix tile in rows
+  Index extent_row_;
+
+  /// Extent of the matrix tile in rows
+  Index extent_column_;
+
+  /// A thread's starting row position (assuming steady-state predicates have
+  /// been computed)
+  Index thread_start_row_;
+
+  /// A thread's starting column
+  Index thread_start_column_;
+
+  /// Internal state counter
+  int state_[3];
+
+  /// Scatter indices
+  int const* indices_;
+
+  //
+  // Static asserts about internal strides
+  //
+
+  static_assert(sizeof(extent_row_) == 4, "Expected 32b extents");
+  static_assert(sizeof(thread_start_row_) == 4, "Expected 32b extents");
+  static_assert(
+      sizeof(PredicatedTileIteratorParams::stride) == 8,
+      "Expected 64b strides");
+
+ private:
+  //
+  // Methods
+  //
+
+ public:
+  //
+  // Methods
+  //
+
+  /// Constructor
+  CUTLASS_DEVICE
+  PredicatedTileIteratorPrefetch(
+      PredicatedTileIteratorParams const& params,
+      Element* pointer,
+      TensorCoord extent,
+      int thread_idx,
+      TensorCoord threadblock_offset = TensorCoord(),
+      int const* indices = nullptr)
+      : params_(params), indices_(indices) {
+    TensorCoord thread_offset =
+        ThreadMap::initial_offset(thread_idx) + threadblock_offset;
+
+    extent_row_ = extent.row();
+    extent_column_ = extent.column();
+
+    thread_start_row_ = thread_offset.row();
+    thread_start_column_ = thread_offset.column();
+
+    // Initialize predicates
+    CUTLASS_PRAGMA_UNROLL
+    for (int c = 0; c < ThreadMap::Iterations::kColumn; ++c) {
+      mask_.predicates[c] =
+          ((thread_offset.column() + ThreadMap::Delta::kColumn * c) <
+           extent.column());
+    }
+
+    // Null pointer performs no accesses
+    if (!pointer) {
+      mask_.clear();
+    }
+
+    if (ScatterD && !indices) {
+      mask_.clear();
+    }
+
+    // Initialize pointer
+    byte_pointer_ = reinterpret_cast<uint8_t*>(pointer) +
+        LongIndex(thread_offset.row()) * LongIndex(params_.stride) +
+        LongIndex(thread_offset.column()) * sizeof(AccessType) /
+            kElementsPerAccess;
+
+    if (ScatterD) {
+      byte_pointer_ = reinterpret_cast<uint8_t*>(pointer) +
+          LongIndex(thread_offset.column()) * sizeof(AccessType) /
+              kElementsPerAccess;
+    }
+
+    // Initialize internal state counter
+    state_[0] = state_[1] = state_[2] = 0;
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    byte_pointer_ += pointer_offset * sizeof_bits<Element>::value / 8;
+  }
+
+  CUTLASS_DEVICE
+  void prefetch_all() {
+    CUTLASS_PRAGMA_UNROLL
+    for (int iter = 0; iter < kIterations; ++iter) {
+      prefetch();
+      ++(*this);
+    }
+  }
+
+  CUTLASS_DEVICE
+  void prefetch() {
+    uint8_t* byte_pointer = byte_pointer_;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int cluster = 0; cluster < ThreadMap::Iterations::kCluster;
+         ++cluster) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int group = 0; group < ThreadMap::Iterations::kGroup; ++group) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int row = 0; row < ThreadMap::Iterations::kRow; ++row) {
+          int row_offset = row * ThreadMap::Delta::kRow +
+              group * ThreadMap::Delta::kGroup +
+              cluster * ThreadMap::Delta::kCluster;
+
+          AccessType* memory_pointer =
+              reinterpret_cast<AccessType*>(byte_pointer);
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int column = 0; column < ThreadMap::Iterations::kColumn;
+               ++column) {
+            unsigned long addr =
+                (unsigned long)((void*)&memory_pointer
+                                    [column * ThreadMap::Delta::kColumn /
+                                     kElementsPerAccess]);
+            asm volatile("prefetch.global.L1 [ %1 ];" : "=l"(addr) : "l"(addr));
+          }
+
+          if (row + 1 < ThreadMap::Iterations::kRow) {
+            if (!ScatterD) {
+              byte_pointer += params_.increment_row;
+            }
+          }
+        }
+
+        if (group + 1 < ThreadMap::Iterations::kGroup) {
+          byte_pointer += params_.increment_group;
+        }
+      }
+
+      if (cluster + 1 < ThreadMap::Iterations::kCluster) {
+        byte_pointer += params_.increment_cluster;
+      }
+    }
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_byte_offset(Fragment& frag, int64_t byte_offset) const {
+    uint8_t* byte_pointer = byte_pointer_;
+    AccessType* frag_ptr = reinterpret_cast<AccessType*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int cluster = 0; cluster < ThreadMap::Iterations::kCluster;
+         ++cluster) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int group = 0; group < ThreadMap::Iterations::kGroup; ++group) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int row = 0; row < ThreadMap::Iterations::kRow; ++row) {
+          int frag_row_idx =
+              (row +
+               ThreadMap::Iterations::kRow *
+                   (group + ThreadMap::Iterations::kGroup * cluster));
+
+          int row_offset = row * ThreadMap::Delta::kRow +
+              group * ThreadMap::Delta::kGroup +
+              cluster * ThreadMap::Delta::kCluster;
+
+          bool row_guard = ((row_offset + thread_start_row_) < extent_row_);
+
+          AccessType* memory_pointer =
+              reinterpret_cast<AccessType*>(byte_pointer + byte_offset);
+
+          if (ScatterD && row_guard) {
+            assert(indices_);
+
+            memory_pointer = reinterpret_cast<AccessType*>(
+                byte_pointer + byte_offset +
+                LongIndex(indices_[row_offset + thread_start_row_]) *
+                    LongIndex(params_.stride));
+          }
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int column = 0; column < ThreadMap::Iterations::kColumn;
+               ++column) {
+            bool guard = row_guard && mask_.predicates[column];
+
+            cutlass::arch::global_load<AccessType, sizeof(AccessType)>(
+                frag_ptr
+                    [frag_row_idx * ThreadMap::Iterations::kColumn + column],
+                (void*)&memory_pointer
+                    [column * ThreadMap::Delta::kColumn / kElementsPerAccess],
+                guard);
+          }
+
+          if (row + 1 < ThreadMap::Iterations::kRow) {
+            if (!ScatterD) {
+              byte_pointer += params_.increment_row;
+            }
+          }
+        }
+
+        if (group + 1 < ThreadMap::Iterations::kGroup) {
+          byte_pointer += params_.increment_group;
+        }
+      }
+
+      if (cluster + 1 < ThreadMap::Iterations::kCluster) {
+        byte_pointer += params_.increment_cluster;
+      }
+    }
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) const {
+    load_with_byte_offset(frag, 0);
+  }
+
+  /// Stores a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_byte_offset(Fragment const& frag, int64_t byte_offset) const {
+    uint8_t* byte_pointer = byte_pointer_;
+    AccessType const* frag_ptr = reinterpret_cast<AccessType const*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int cluster = 0; cluster < ThreadMap::Iterations::kCluster;
+         ++cluster) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int group = 0; group < ThreadMap::Iterations::kGroup; ++group) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int row = 0; row < ThreadMap::Iterations::kRow; ++row) {
+          int frag_row_idx =
+              (row +
+               ThreadMap::Iterations::kRow *
+                   (group + ThreadMap::Iterations::kGroup * cluster));
+
+          int row_offset = row * ThreadMap::Delta::kRow +
+              group * ThreadMap::Delta::kGroup +
+              cluster * ThreadMap::Delta::kCluster;
+
+          bool row_guard = ((row_offset + thread_start_row_) < extent_row_);
+
+          AccessType* memory_pointer =
+              reinterpret_cast<AccessType*>(byte_pointer + byte_offset);
+
+          if (ScatterD && row_guard) {
+            assert(indices_);
+
+            memory_pointer = reinterpret_cast<AccessType*>(
+                byte_pointer + byte_offset +
+                LongIndex(indices_[row_offset + thread_start_row_]) *
+                    LongIndex(params_.stride));
+          }
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int column = 0; column < ThreadMap::Iterations::kColumn;
+               ++column) {
+            bool guard = row_guard && mask_.predicates[column];
+
+            if (UseCUDAStore) {
+              if (guard) {
+                memory_pointer
+                    [column * ThreadMap::Delta::kColumn / kElementsPerAccess] =
+                        frag_ptr
+                            [frag_row_idx * ThreadMap::Iterations::kColumn +
+                             column];
+              }
+            } else {
+              cutlass::arch::global_store<AccessType, sizeof(AccessType)>(
+                  frag_ptr
+                      [frag_row_idx * ThreadMap::Iterations::kColumn + column],
+                  (void*)&memory_pointer
+                      [column * ThreadMap::Delta::kColumn / kElementsPerAccess],
+                  guard);
+            }
+          }
+
+          if (row + 1 < ThreadMap::Iterations::kRow) {
+            if (!ScatterD) {
+              byte_pointer += params_.increment_row;
+            }
+          }
+        }
+
+        if (group + 1 < ThreadMap::Iterations::kGroup) {
+          byte_pointer += params_.increment_group;
+        }
+      }
+
+      if (cluster + 1 < ThreadMap::Iterations::kCluster) {
+        byte_pointer += params_.increment_cluster;
+      }
+    }
+  }
+
+  /// Stores a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) const {
+    store_with_byte_offset(frag, 0);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void downsample_load_with_byte_offset(
+      Fragment& frag,
+      int64_t byte_offset,
+      int convolution_P,
+      int convolution_Q,
+      int add_P,
+      int add_Q,
+      int problem_N) const {
+    uint8_t* byte_pointer = byte_pointer_;
+    AccessType* frag_ptr = reinterpret_cast<AccessType*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int cluster = 0; cluster < ThreadMap::Iterations::kCluster;
+         ++cluster) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int group = 0; group < ThreadMap::Iterations::kGroup; ++group) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int row = 0; row < ThreadMap::Iterations::kRow; ++row) {
+          int frag_row_idx =
+              (row +
+               ThreadMap::Iterations::kRow *
+                   (group + ThreadMap::Iterations::kGroup * cluster));
+
+          int row_offset = row * ThreadMap::Delta::kRow +
+              group * ThreadMap::Delta::kGroup +
+              cluster * ThreadMap::Delta::kCluster;
+
+          bool row_guard = ((row_offset + thread_start_row_) < extent_row_);
+
+          int output_row = row_offset + thread_start_row_;
+          int output_N = output_row / (convolution_P * convolution_Q);
+          int output_PQ = output_row % (convolution_P * convolution_Q);
+          int output_P = output_PQ / convolution_Q;
+          int output_Q = output_PQ % convolution_Q;
+
+          int input_row = output_N * 2 * convolution_P * 2 * convolution_Q +
+              (2 * output_P + add_P) * 2 * convolution_Q + 2 * output_Q + add_Q;
+
+          int64_t byte_offset =
+              (input_row - output_row) * problem_N * sizeof(float);
+
+          AccessType* memory_pointer =
+              reinterpret_cast<AccessType*>(byte_pointer + byte_offset);
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int column = 0; column < ThreadMap::Iterations::kColumn;
+               ++column) {
+            bool guard = row_guard && mask_.predicates[column];
+
+            cutlass::arch::global_load<AccessType, sizeof(AccessType)>(
+                frag_ptr
+                    [frag_row_idx * ThreadMap::Iterations::kColumn + column],
+                (void*)&memory_pointer
+                    [column * ThreadMap::Delta::kColumn / kElementsPerAccess],
+                guard);
+          }
+
+          if (row + 1 < ThreadMap::Iterations::kRow) {
+            byte_pointer += params_.increment_row;
+          }
+        }
+
+        if (group + 1 < ThreadMap::Iterations::kGroup) {
+          byte_pointer += params_.increment_group;
+        }
+      }
+
+      if (cluster + 1 < ThreadMap::Iterations::kCluster) {
+        byte_pointer += params_.increment_cluster;
+      }
+    }
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void upsample_load_with_byte_offset(
+      Fragment& frag,
+      int64_t byte_offset,
+      int convolution_P,
+      int convolution_Q,
+      int add_P,
+      int add_Q,
+      int problem_N) const {
+    uint8_t* byte_pointer = byte_pointer_;
+    AccessType* frag_ptr = reinterpret_cast<AccessType*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int cluster = 0; cluster < ThreadMap::Iterations::kCluster;
+         ++cluster) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int group = 0; group < ThreadMap::Iterations::kGroup; ++group) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int row = 0; row < ThreadMap::Iterations::kRow; ++row) {
+          int frag_row_idx =
+              (row +
+               ThreadMap::Iterations::kRow *
+                   (group + ThreadMap::Iterations::kGroup * cluster));
+
+          int row_offset = row * ThreadMap::Delta::kRow +
+              group * ThreadMap::Delta::kGroup +
+              cluster * ThreadMap::Delta::kCluster;
+
+          bool row_guard = ((row_offset + thread_start_row_) < extent_row_);
+
+          int output_row = row_offset + thread_start_row_;
+          int output_N = output_row / (convolution_P * convolution_Q);
+          int output_PQ = output_row % (convolution_P * convolution_Q);
+          int output_P = output_PQ / convolution_Q;
+          int output_Q = output_PQ % convolution_Q;
+          int row_add_P = add_P;
+          int row_add_Q = add_Q;
+          if (output_P > convolution_P - 2)
+            row_add_P = 0;
+          if (output_Q > convolution_Q - 2)
+            row_add_Q = 0;
+
+          int input_row = output_N * (convolution_P / 2) * (convolution_Q / 2) +
+              ((output_P + row_add_P) / 2) * (convolution_Q / 2) +
+              (output_Q + row_add_Q) / 2;
+
+          int64_t byte_offset =
+              (input_row - output_row) * problem_N * sizeof(float);
+
+          AccessType* memory_pointer =
+              reinterpret_cast<AccessType*>(byte_pointer + byte_offset);
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int column = 0; column < ThreadMap::Iterations::kColumn;
+               ++column) {
+            bool guard = row_guard && mask_.predicates[column];
+
+            cutlass::arch::global_load<AccessType, sizeof(AccessType)>(
+                frag_ptr
+                    [frag_row_idx * ThreadMap::Iterations::kColumn + column],
+                (void*)&memory_pointer
+                    [column * ThreadMap::Delta::kColumn / kElementsPerAccess],
+                guard);
+          }
+
+          if (row + 1 < ThreadMap::Iterations::kRow) {
+            byte_pointer += params_.increment_row;
+          }
+        }
+
+        if (group + 1 < ThreadMap::Iterations::kGroup) {
+          byte_pointer += params_.increment_group;
+        }
+      }
+
+      if (cluster + 1 < ThreadMap::Iterations::kCluster) {
+        byte_pointer += params_.increment_cluster;
+      }
+    }
+  }
+
+  CUTLASS_DEVICE
+  MatrixCoord thread_start() const {
+    return MatrixCoord(thread_start_row_, thread_start_column_);
+  }
+
+  /// Need to get the thread start row from the tile iterator
+  CUTLASS_DEVICE
+  int32_t thread_start_row() const {
+    return thread_start_row_;
+  }
+
+  /// Need to get the thread start row from the tile iterator
+  CUTLASS_DEVICE
+  int32_t thread_start_column() const {
+    return thread_start_column_;
+  }
+
+  /// Extent of the matrix in rows
+  CUTLASS_DEVICE
+  Index extent_row() const {
+    return extent_row_;
+  }
+
+  /// Extent of the matrix in columns
+  CUTLASS_DEVICE
+  Index extent_column() const {
+    return extent_column_;
+  }
+
+  /// Advances to the next position to load or store
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorPrefetch& operator++() {
+    ++state_[0];
+
+    if (!ScatterD) {
+      byte_pointer_ += params_.advance_row;
+    }
+
+    thread_start_row_ += ThreadMap::Shape::kRow;
+
+    if (state_[0] == ThreadMap::Count::kRow) {
+      state_[0] = 0;
+      ++state_[1];
+      byte_pointer_ += params_.advance_group;
+
+      thread_start_row_ += (ThreadMap::Shape::kGroup - 1) *
+          ThreadMap::Shape::kRow * ThreadMap::Count::kRow;
+
+      if (state_[1] == ThreadMap::Count::kGroup) {
+        state_[1] = 0;
+        ++state_[2];
+        byte_pointer_ += params_.advance_cluster;
+
+        thread_start_row_ += ThreadMap::Count::kGroup *
+            ThreadMap::Shape::kGroup * ThreadMap::Count::kRow *
+            ThreadMap::Shape::kRow;
+
+        if (state_[2] == ThreadMap::Count::kCluster) {
+          state_[2] = 0;
+          byte_pointer_ += params_.advance_tile;
+        }
+      }
+    }
+
+    return *this;
+  }
+
+  ///< Efficiently disables all accesses guarded by mask
+  CUTLASS_DEVICE void clear_mask() {
+    mask_.clear();
+  }
+
+  ///< Efficiently enables all accesses guarded by mask
+  CUTLASS_DEVICE void enable_mask() {
+    mask_.enable();
+  }
+
+  ///< Sets the mask
+  CUTLASS_DEVICE void get_mask(Mask& mask) const {
+    mask = mask_;
+  }
+
+  ///< Sets the mask
+  CUTLASS_DEVICE void set_mask(Mask const& mask) {
+    mask_ = mask;
+  }
+};
+
+template <typename IT>
+struct MakePrefetchableIterator {
+  using Iterator = PredicatedTileIteratorPrefetch<
+      typename IT::ThreadMap,
+      typename IT::Element>;
+};
+
+///////////////////////////////////////////////////////////////////////////////
+
+} // namespace threadblock
+} // namespace epilogue
+} // namespace cutlass
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/make_residual_last.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/make_residual_last.h
new file mode 100644
index 000000000000..fd8f90a78f46
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/make_residual_last.h
@@ -0,0 +1,67 @@
+#pragma once
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_access_iterator_residual_last.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_iterator_residual_last.h>
+
+
+namespace cutlass {
+namespace transform {
+namespace threadblock {
+
+template <typename BaseIterator>
+struct MakeIteratorResidualLast;
+
+template <
+    typename Shape,
+    typename Element,
+    typename Layout,
+    int AdvanceRank,
+    typename ThreadMap,
+    int AccessSize,
+    bool Gather>
+struct MakeIteratorResidualLast<PredicatedTileIterator<
+    Shape,
+    Element,
+    Layout,
+    AdvanceRank,
+    ThreadMap,
+    AccessSize,
+    Gather>> {
+  using Iterator = PredicatedTileIteratorResidualLast<
+      Shape,
+      Element,
+      Layout,
+      AdvanceRank,
+      ThreadMap,
+      AccessSize,
+      Gather>;
+};
+
+template <
+    typename Shape,
+    typename Element,
+    typename Layout,
+    int AdvanceRank,
+    typename ThreadMap,
+    typename AccessType,
+    bool Gather>
+struct MakeIteratorResidualLast<PredicatedTileAccessIterator<
+    Shape,
+    Element,
+    Layout,
+    AdvanceRank,
+    ThreadMap,
+    AccessType,
+    Gather>> {
+  using Iterator = PredicatedTileAccessIteratorResidualLast<
+      Shape,
+      Element,
+      Layout,
+      AdvanceRank,
+      ThreadMap,
+      AccessType,
+      Gather>;
+};
+} // namespace threadblock
+} // namespace transform
+} // namespace cutlass
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_access_iterator_residual_last.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_access_iterator_residual_last.h
new file mode 100644
index 000000000000..bae1fee360bf
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_access_iterator_residual_last.h
@@ -0,0 +1,2116 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Templates calculating the address and predicates to the load of tiles
+    from pitch-linear rank=2 tensors.
+
+    This iterator uses masks to guard out-of-bounds accesses. The first tile
+   this iterator visits maybe partial, then the remaining tiles are complete.
+   So, we only need to compute the predicates twice, once before the first tile
+   and once for the remaining full tiles which can share the same predicates.
+
+    A precomputed "Params" object minimizes the amount of state that must be
+    stored in registers, and integer addition is used to advance the pointer
+    through memory.
+*/
+
+#pragma once
+
+#include <cutlass/array.h>
+#include <cutlass/coord.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/layout/matrix.h>
+#include <cutlass/layout/pitch_linear.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/predicate_vector.h>
+#include <cutlass/tensor_ref.h>
+#include <cutlass/tensor_view.h>
+#include <cutlass/transform/threadblock/predicated_tile_access_iterator_params.h>
+#include <cutlass/transform/threadblock/predicated_tile_access_iterator.h>
+
+////////////////////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace transform {
+namespace threadblock {
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// PredicatedTileAccessIteratorResidualLast
+///
+template <
+    typename Shape,
+    typename Element,
+    typename Layout,
+    int AdvanceRank,
+    typename ThreadMap,
+    typename AccessType,
+    bool Gather = false>
+class PredicatedTileAccessIteratorResidualLast;
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileAccessIteratorResidualLast for pitch-linear
+/// data.
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    typename AccessType_,
+    bool Gather>
+class PredicatedTileAccessIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::PitchLinear,
+    AdvanceRank,
+    ThreadMap_,
+    AccessType_,
+    Gather> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::PitchLinear;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+  using AccessType = AccessType_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingPredicates = PredicatedTileAccessIteratorPredicates<
+      Shape,
+      Element,
+      Layout,
+      AdvanceRank,
+      ThreadMap,
+      AccessType>;
+
+  static int const kAccessesPerVector =
+      ThreadMap::kElementsPerAccess / AccessType::kElements;
+
+  static_assert(
+      !(ThreadMap::kElementsPerAccess % AccessType::kElements),
+      "Vectors implied by the thread map must be divisible by the access type.");
+
+  using Mask = typename UnderlyingPredicates::Mask;
+
+  /// Uses a non-template class
+  struct Params : PredicatedTileAccessIteratorParams {
+    using Base = PredicatedTileAccessIteratorParams;
+
+    // Default ctor
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : Base(
+              layout.stride(0),
+              MakePredicatedTileAccessIteratorDesc<
+                  Shape,
+                  Element,
+                  Layout,
+                  kAdvanceRank,
+                  ThreadMap>()()) {}
+
+    CUTLASS_HOST_DEVICE
+    Params(Base const& base) : Base(base) {}
+  };
+
+ private:
+  /// Internal pointer type permits fast address arithmetic
+  using BytePointer = char*;
+
+ private:
+  //
+  // Data members
+  //
+
+  UnderlyingPredicates the_predicates;
+  Mask residual_tile_mask;
+
+  /// Parameters object with precomputed internal state
+  Params const& params_;
+
+  /// Internal pointer to first access of tile
+  BytePointer pointer_;
+
+  /// Below is used when Gather is turned on.  We need to record strided_offset
+  /// and contiguous_offset seperated to compute the offset by using
+  ///
+  /// offset = contiguous_offset + indices[strided_offset]
+  ///
+
+  /// Gather indices
+  int const* indices_;
+
+  Index gather_offset_strided;
+
+ private:
+  /// Computes predicates based on internally tracked per-thread offset.
+  CUTLASS_DEVICE
+  void compute_predicates_(
+      /// Extent of the matrix window
+      TensorCoord extent,
+      /// optionally, simplify predicate calculation during 'steady state' phase
+      bool is_steady_state = false) {
+    the_predicates.compute_predicates_(extent, is_steady_state);
+  }
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      /// Precomputed parameters object
+      Params const& params,
+      /// Pointer to start of tensor
+      Pointer pointer,
+      /// Extent of tensor
+      TensorCoord extent,
+      /// ID of each participating thread
+      int thread_id,
+      /// Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      /// Gather indices
+      int const* indices = nullptr)
+      : params_(params),
+        pointer_(reinterpret_cast<BytePointer>(
+            const_cast<NonConstPointer>(pointer))),
+        the_predicates(extent),
+        indices_(indices) {
+    the_predicates.set_predicates(thread_id, threadblock_offset);
+    the_predicates.get_mask(residual_tile_mask);
+
+    // Working around a weird compiler bug happening on P100 for the backward.
+    // I've seen together: the_predicates.predicates_[0] = 14 (instead of 15)
+    // residual_tile_mask[0] = 15 (correct)
+    //
+    // Adding prints when the value is calculated (in `compute_predicates_`)
+    // sometimes removes the bug. The consequence is that we skip some
+    // element of a tensor, leading to wrong results
+    // Setting `compute_predicates_`'s second argument (`is_steady_state`) to
+    // true also seems to get rid of the bug - at the cost of twice as many
+    // comparisons.
+#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 700)
+    constexpr bool kWorkAroundCompilerBug = false;
+#else
+    constexpr bool kWorkAroundCompilerBug = true;
+#endif
+    the_predicates.compute_predicates_(extent, true && !kWorkAroundCompilerBug);
+
+    // update internal pointers
+    Layout layout(params_.stride_);
+
+    if (!Gather) {
+      add_pointer_offset(layout(the_predicates.thread_offset_));
+    } else {
+      gather_offset_strided = the_predicates.thread_offset_.strided();
+      add_pointer_offset(
+          layout(make_Coord(the_predicates.thread_offset_.contiguous(), 0)));
+    }
+  }
+
+  /// Construct a PredicatedTileAccessIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      /// Precomputed parameters object
+      Params const& params,
+      /// Pointer to start of tensor
+      Pointer pointer,
+      /// Extent of tensor
+      TensorCoord extent,
+      ///< ID of each participating thread
+      int thread_id)
+      : PredicatedTileAccessIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Overrides the internal iteration index
+  CUTLASS_HOST_DEVICE
+  void set_iteration_index(int index) {
+    the_predicates.set_iteration_index(index);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool is_residual_tile) {
+    if (is_residual_tile) {
+      the_predicates.set_mask(residual_tile_mask);
+    }
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    pointer_ += sizeof_bits<Element>::value * pointer_offset / 8;
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_DEVICE
+  void add_tile_offset(TensorCoord const& tile_offset) {
+    if (!Gather) {
+      if (kAdvanceRank) {
+        pointer_ += params_.inc_advance_ * LongIndex(tile_offset.strided());
+        pointer_ += Shape::kContiguous * tile_offset.contiguous();
+      } else {
+        pointer_ += params_.inc_advance_ * LongIndex(tile_offset.contiguous());
+        pointer_ += Shape::kStrided * tile_offset.strided();
+      }
+    } else {
+      add_pointer_offset(Shape::kContiguous * tile_offset.contiguous());
+      gather_offset_strided += Shape::kStrided * tile_offset.strided();
+    }
+  }
+
+  /// Returns a pointer
+  CUTLASS_HOST_DEVICE
+  AccessType* get() const {
+    if (Gather) {
+      assert(indices_);
+
+      if (!valid()) {
+        return nullptr;
+      }
+
+      LongIndex contiguous_offset = the_predicates.iteration_contiguous_ *
+              (ThreadMap::Delta::kContiguous * sizeof_bits<Element>::value /
+               8) +
+          the_predicates.iteration_vector_;
+      int strided_index = gather_offset_strided +
+          the_predicates.iteration_strided_ * ThreadMap::Delta::kStrided;
+
+      LongIndex strided_offset = indices_[strided_index] *
+          LongIndex(params_.stride_) * sizeof_bits<Element>::value / 8;
+
+      return reinterpret_cast<AccessType*>(
+          pointer_ + contiguous_offset + strided_offset);
+    }
+
+    return reinterpret_cast<AccessType*>(
+               pointer_ +
+               the_predicates.iteration_contiguous_ *
+                   (ThreadMap::Delta::kContiguous *
+                    sizeof_bits<Element>::value) /
+                   8) +
+        the_predicates.iteration_vector_;
+  }
+
+  /// Increment and return an instance to self.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast& operator++() {
+    the_predicates.operator++();
+
+    ++the_predicates.iteration_vector_;
+    if (the_predicates.iteration_vector_ < kAccessesPerVector) {
+      return *this;
+    }
+
+    the_predicates.iteration_vector_ = 0;
+    ++the_predicates.iteration_contiguous_;
+
+    if (the_predicates.iteration_contiguous_ <
+        ThreadMap::Iterations::kContiguous) {
+      return *this;
+    }
+
+    // Enter here only if (iteration_contiguous_ ==
+    // ThreadMap::Iteration::kContiguous)
+    the_predicates.iteration_contiguous_ = 0;
+    ++the_predicates.iteration_strided_;
+
+    if (the_predicates.iteration_strided_ < ThreadMap::Iterations::kStrided) {
+      if (!Gather) {
+        pointer_ += params_.inc_strided_;
+      }
+
+      return *this;
+    }
+
+    // Enter here only if (iteration_stride_ == ThreadMap::Iteration::kStrided)
+    // which means we enter the next tile.
+    the_predicates.iteration_strided_ = 0;
+
+    if (!Gather) {
+      // advance to next tile
+      pointer_ += params_.inc_next_;
+
+      // now return to start tile - if the iterator is subsequently advanced,
+      // this subtraction as well as the subsequent integer addition are both
+      // elided by the compiler.
+      pointer_ -= params_.inc_advance_;
+    }
+
+    return *this;
+  }
+
+  /// Increment and return an instance to self.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast operator++(int) {
+    PredicatedTileAccessIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    the_predicates.clear_mask(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    the_predicates.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    the_predicates.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    the_predicates.get_mask(mask);
+  }
+
+  /// Returns whether access is valid or not
+  CUTLASS_HOST_DEVICE
+  bool valid() const {
+    return the_predicates.valid();
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileAccessIteratorResidualLast for column-major
+/// data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    typename AccessType_,
+    bool Gather>
+class PredicatedTileAccessIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::ColumnMajor,
+    AdvanceRank,
+    ThreadMap_,
+    AccessType_,
+    Gather> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::ColumnMajor;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+  using AccessType = AccessType_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingIterator = PredicatedTileAccessIteratorResidualLast<
+      layout::PitchLinearShape<Shape::kRow, Shape::kColumn>,
+      Element,
+      layout::PitchLinear,
+      (kAdvanceRank == 0 ? 0 : 1),
+      ThreadMap,
+      AccessType,
+      Gather>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  static int const kAccessesPerVector = UnderlyingIterator::kAccessesPerVector;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileAccessIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    /// Default ctor
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::PitchLinear(layout.stride(0))){};
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(typename UnderlyingIterator::Params::Base const& base)
+        : params_(base) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying pitch-linear tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      ///< Precomputed parameters object
+      Params const& params,
+      ///< Pointer to start of tensor
+      Pointer pointer,
+      ///< Extent of tensor
+      TensorCoord extent,
+      ///< ID of each participating thread
+      int thread_id,
+      ///< Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(extent.row(), extent.column()),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.row(),
+                threadblock_offset.column()),
+            indices) {}
+
+  /// Construct a PredicatedTileAccessIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileAccessIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Overrides the internal iteration index
+  CUTLASS_HOST_DEVICE
+  void set_iteration_index(int index) {
+    iterator_.set_iteration_index(index);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_HOST_DEVICE
+  void add_tile_offset(TensorCoord const& tile_offset) {
+    iterator_.add_tile_offset({tile_offset.row(), tile_offset.column()});
+  }
+
+  /// Returns a pointer
+  CUTLASS_HOST_DEVICE
+  AccessType* get() const {
+    return reinterpret_cast<AccessType*>(iterator_.get());
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast operator++(int) {
+    PredicatedTileAccessIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Returns whether access is valid or not
+  CUTLASS_HOST_DEVICE
+  bool valid() {
+    return iterator_.valid();
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileAccessIteratorResidualLast for row-major
+/// data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    typename AccessType_,
+    bool Gather>
+class PredicatedTileAccessIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::RowMajor,
+    AdvanceRank,
+    ThreadMap_,
+    AccessType_,
+    Gather> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::RowMajor;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+  using AccessType = AccessType_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingIterator = PredicatedTileAccessIteratorResidualLast<
+      layout::PitchLinearShape<Shape::kColumn, Shape::kRow>,
+      Element,
+      layout::PitchLinear,
+      (kAdvanceRank == 0 ? 1 : 0),
+      ThreadMap,
+      AccessType,
+      Gather>;
+
+  static int const kAccessesPerVector = UnderlyingIterator::kAccessesPerVector;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileAccessIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    /// Default ctor
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::PitchLinear(layout.stride(0))){};
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(typename UnderlyingIterator::Params::Base const& base)
+        : params_(base) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying pitch-linear tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      ///< Precomputed parameters object
+      Params const& params,
+      ///< Pointer to start of tensor
+      Pointer pointer,
+      ///< Extent of tensor
+      TensorCoord extent,
+      ///< ID of each participating thread
+      int thread_id,
+      ///< Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      /// Gather indices
+      int const* indices = nullptr)
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(extent.column(), extent.row()),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.column(),
+                threadblock_offset.row()),
+            indices) {}
+
+  /// Construct a PredicatedTileAccessIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileAccessIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Overrides the internal iteration index
+  CUTLASS_HOST_DEVICE
+  void set_iteration_index(int index) {
+    iterator_.set_iteration_index(index);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_HOST_DEVICE
+  void add_tile_offset(TensorCoord const& tile_offset) {
+    iterator_.add_tile_offset({tile_offset.column(), tile_offset.row()});
+  }
+
+  /// Returns a pointer
+  CUTLASS_HOST_DEVICE
+  AccessType* get() const {
+    return reinterpret_cast<AccessType*>(iterator_.get());
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast operator++(int) {
+    PredicatedTileAccessIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Returns whether access is valid or not
+  CUTLASS_HOST_DEVICE
+  bool valid() {
+    return iterator_.valid();
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileAccessIteratorResidualLast for affine rank 2
+/// data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    typename AccessType_>
+class PredicatedTileAccessIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::AffineRankN<2>,
+    AdvanceRank,
+    ThreadMap_,
+    AccessType_,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::AffineRankN<2>;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+  using AccessType = AccessType_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingPredicates = PredicatedTileAccessIteratorPredicates<
+      Shape,
+      Element,
+      layout::PitchLinear,
+      AdvanceRank,
+      ThreadMap,
+      AccessType>;
+
+  static int const kAccessesPerVector =
+      ThreadMap::kElementsPerAccess / AccessType::kElements;
+
+  static_assert(
+      !(ThreadMap::kElementsPerAccess % AccessType::kElements),
+      "Vectors implied by the thread map must be divisible by the access type.");
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingPredicates::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   public:
+    friend PredicatedTileAccessIteratorResidualLast;
+
+   private:
+    /// stride of pitch-linear layout (units of Element)
+    Coord<Layout::kStrideRank, Layout::LongIndex> stride_;
+    /// amount (in byte) to increment pointer to move to next access along
+    /// contiguous dimension
+    LongIndex inc_contiguous_;
+    /// amount (in byte) to increment pointer from first access of current
+    /// contiguous dimension to first access of next one.
+    LongIndex inc_strided_;
+    /// amount (in byte) to increment pointer from last access of current
+    /// contiguous dimension to first access of next one.
+    LongIndex inc_next_strided_;
+    /// amount (in byte) to increment pointer from last access to first access
+    /// of next tile
+    LongIndex inc_next_;
+    /// amount (in byte) to increment pointer from first access of current tile
+    /// to first access of next tile
+    LongIndex inc_advance_;
+
+   public:
+    // Default ctor
+    CUTLASS_HOST_DEVICE
+    Params()
+        : stride_(0),
+          inc_contiguous_(0),
+          inc_strided_(0),
+          inc_next_(0),
+          inc_advance_(0) {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : stride_({layout.stride(0), layout.stride(1)}) {
+      inc_contiguous_ =
+          (LongIndex(stride_[0]) * ThreadMap::Delta::kContiguous) *
+          sizeof_bits<Element>::value / 8;
+
+      inc_strided_ = (LongIndex(stride_[1]) * ThreadMap::Delta::kStrided) *
+          sizeof_bits<Element>::value / 8;
+
+      inc_next_strided_ = inc_strided_ -
+          LongIndex(ThreadMap::Iterations::kContiguous - 1) * inc_contiguous_;
+
+      if (kAdvanceRank) {
+        // advance along strided dimension
+        inc_advance_ = Shape::kStrided * LongIndex(stride_[1]) *
+            sizeof_bits<Element>::value / 8;
+      } else {
+        // advance along contiguous dimension
+        inc_advance_ =
+            Shape::kContiguous * stride_[0] * sizeof_bits<Element>::value / 8;
+      }
+
+      inc_next_ = inc_advance_ -
+          LongIndex(ThreadMap::Iterations::kContiguous - 1) * inc_contiguous_ -
+          LongIndex(ThreadMap::Iterations::kStrided - 1) * inc_strided_;
+    };
+  };
+
+ private:
+  /// Internal pointer type permits fast address arithmetic
+  using BytePointer = char*;
+
+  //
+  // Data members
+  //
+
+  /// Parameters object with precomputed internal state
+  Params const& params_;
+
+  /// Internal pointer to first access of tile
+  BytePointer pointer_;
+
+  UnderlyingPredicates the_predicates;
+  Mask residual_tile_mask;
+
+ private:
+  /// Computes predicates based on internally tracked per-thread offset.
+  CUTLASS_DEVICE
+  void compute_predicates_(
+      /// Extent of the matrix window
+      TensorCoord extent,
+      /// optionally, simplify predicate calculation during 'steady state' phase
+      bool is_steady_state = false) {
+    the_predicates.compute_predicates_(extent, is_steady_state);
+  }
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      ///< Precomputed parameters object
+      Params const& params,
+      ///< Pointer to start of tensor
+      Pointer pointer,
+      ///< Extent of tensor
+      TensorCoord extent,
+      ///< ID of each participating thread
+      int thread_id,
+      ///< Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : params_(params),
+        pointer_(reinterpret_cast<BytePointer>(
+            const_cast<NonConstPointer>(pointer))),
+        the_predicates(extent) {
+    the_predicates.set_predicates(thread_id, threadblock_offset);
+
+    // update internal pointers
+    Layout layout(params_.stride_);
+    add_pointer_offset(layout(the_predicates.thread_offset_));
+  }
+
+  /// Construct a PredicatedTileAccessIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileAccessIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Overrides the internal iteration index
+  CUTLASS_HOST_DEVICE
+  void set_iteration_index(int index) {
+    the_predicates.set_iteration_index(index);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool is_residual_tile) {
+    if (is_residual_tile) {
+      the_predicates.set_mask(residual_tile_mask);
+    }
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    pointer_ += sizeof_bits<Element>::value * pointer_offset / 8;
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_HOST_DEVICE
+  void add_tile_offset(TensorCoord const& tile_offset) {
+    if (kAdvanceRank) {
+      pointer_ += params_.inc_advance_ * LongIndex(tile_offset[1]);
+      pointer_ += Shape::kContiguous * tile_offset[0];
+    } else {
+      pointer_ += params_.inc_advance_ * LongIndex(tile_offset[0]);
+      pointer_ += Shape::kStrided * tile_offset[1];
+    }
+  }
+
+  /// Returns a pointer
+  CUTLASS_HOST_DEVICE
+  AccessType* get() const {
+    return reinterpret_cast<AccessType*>(pointer_) +
+        the_predicates.iteration_vector_;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast& operator++() {
+    the_predicates.operator++();
+    ++the_predicates.iteration_vector_;
+    if (the_predicates.iteration_vector_ < kAccessesPerVector) {
+      return *this;
+    }
+
+    the_predicates.iteration_vector_ = 0;
+    ++the_predicates.iteration_contiguous_;
+
+    if (the_predicates.iteration_contiguous_ <
+        ThreadMap::Iterations::kContiguous) {
+      pointer_ += params_.inc_contiguous_;
+      return *this;
+    }
+
+    // Enter here only if (iteration_contiguous_ ==
+    // ThreadMap::Iteration::kContiguous)
+    the_predicates.iteration_contiguous_ = 0;
+    ++the_predicates.iteration_strided_;
+
+    if (the_predicates.iteration_strided_ < ThreadMap::Iterations::kStrided) {
+      pointer_ += params_.inc_next_strided_;
+      return *this;
+    }
+
+    // Enter here only if (iteration_stride_ == ThreadMap::Iteration::kStrided)
+    // which means we enter the next tile.
+    the_predicates.iteration_strided_ = 0;
+
+    // advance to next tile
+    pointer_ += params_.inc_next_;
+
+    // now return to start tile - if the iterator is subsequently advanced, this
+    // subtraction as well as the subsequent integer addition are both elided by
+    // the compiler.
+    pointer_ -= params_.inc_advance_;
+
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast operator++(int) {
+    PredicatedTileAccessIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    the_predicates.clear_mask(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    the_predicates.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    the_predicates.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    the_predicates.get_mask(mask);
+  }
+
+  /// Returns whether access is valid or not
+  CUTLASS_HOST_DEVICE
+  bool valid() {
+    return the_predicates.valid();
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileAccessIteratorResidualLast for affine rank 2
+/// column-major data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    typename AccessType_>
+class PredicatedTileAccessIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::AffineRank2ColumnMajor,
+    AdvanceRank,
+    ThreadMap_,
+    AccessType_,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::AffineRank2ColumnMajor;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+  using AccessType = AccessType_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  // Map to the underlying AffineRankN<2> layout
+  using UnderlyingIterator = PredicatedTileAccessIteratorResidualLast<
+      layout::PitchLinearShape<Shape::kRow, Shape::kColumn>,
+      Element,
+      layout::AffineRankN<2>,
+      (kAdvanceRank == 0 ? 0 : 1),
+      ThreadMap,
+      AccessType>;
+
+  static int const kAccessesPerVector = UnderlyingIterator::kAccessesPerVector;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileAccessIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    /// Default ctor
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given an AffineRankN<2> tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::AffineRankN<2>(layout.stride(0), layout.stride(1))){};
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying AffineRankN<2> tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      ///< Precomputed parameters object
+      Params const& params,
+      ///< Pointer to start of tensor
+      Pointer pointer,
+      ///< Extent of tensor
+      TensorCoord extent,
+      ///< ID of each participating thread
+      int thread_id,
+      ///< Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(extent.row(), extent.column()),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.row(),
+                threadblock_offset.column())) {}
+
+  /// Construct a PredicatedTileAccessIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileAccessIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Overrides the internal iteration index
+  CUTLASS_HOST_DEVICE
+  void set_iteration_index(int index) {
+    iterator_.set_iteration_index(index);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_HOST_DEVICE
+  void add_tile_offset(TensorCoord const& tile_offset) {
+    iterator_.add_tile_offset(
+        make_Coord(tile_offset.row(), tile_offset.column()));
+  }
+
+  /// Returns a pointer
+  CUTLASS_HOST_DEVICE
+  AccessType* get() const {
+    return reinterpret_cast<AccessType*>(iterator_.get());
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast operator++(int) {
+    PredicatedTileAccessIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Returns whether access is valid or not
+  CUTLASS_HOST_DEVICE
+  bool valid() {
+    return iterator_.valid();
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileAccessIteratorResidualLast for affine rank-2
+/// row-major data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    typename AccessType_>
+class PredicatedTileAccessIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::AffineRank2RowMajor,
+    AdvanceRank,
+    ThreadMap_,
+    AccessType_,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::AffineRank2RowMajor;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+  using AccessType = AccessType_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  // Map to the underlying AffineRankN<2> layout
+  using UnderlyingIterator = PredicatedTileAccessIteratorResidualLast<
+      layout::PitchLinearShape<Shape::kColumn, Shape::kRow>,
+      Element,
+      layout::AffineRankN<2>,
+      (kAdvanceRank == 0 ? 1 : 0),
+      ThreadMap,
+      AccessType>;
+
+  static int const kAccessesPerVector = UnderlyingIterator::kAccessesPerVector;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileAccessIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    /// Default ctor
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given an AffineRankN<2> tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::AffineRankN<2>(layout.stride(1), layout.stride(0))){};
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying AffineRankN<2> tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      ///< Precomputed parameters object
+      Params const& params,
+      ///< Pointer to start of tensor
+      Pointer pointer,
+      ///< Extent of tensor
+      TensorCoord extent,
+      ///< ID of each participating thread
+      int thread_id,
+      ///< Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(extent.column(), extent.row()),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.column(),
+                threadblock_offset.row())) {}
+
+  /// Construct a PredicatedTileAccessIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileAccessIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Overrides the internal iteration index
+  CUTLASS_HOST_DEVICE
+  void set_iteration_index(int index) {
+    iterator_.set_iteration_index(index);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_HOST_DEVICE
+  void add_tile_offset(TensorCoord const& tile_offset) {
+    iterator_.add_tile_offset(
+        make_Coord(tile_offset.column(), tile_offset.row()));
+  }
+
+  /// Returns a pointer
+  CUTLASS_HOST_DEVICE
+  AccessType* get() const {
+    return reinterpret_cast<AccessType*>(iterator_.get());
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast operator++(int) {
+    PredicatedTileAccessIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Returns whether access is valid or not
+  CUTLASS_HOST_DEVICE
+  bool valid() {
+    return iterator_.valid();
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileAccessIteratorResidualLast for column-major
+/// interleaved data. It is mapped to the congruous layout.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    typename AccessType_,
+    int InterleavedK>
+class PredicatedTileAccessIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::ColumnMajorInterleaved<InterleavedK>,
+    AdvanceRank,
+    ThreadMap_,
+    AccessType_,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  static int const kInterleavedK = InterleavedK;
+  using Layout = layout::ColumnMajorInterleaved<kInterleavedK>;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+  using AccessType = AccessType_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingIterator = PredicatedTileAccessIteratorResidualLast<
+      layout::PitchLinearShape<
+          Shape::kRow * kInterleavedK,
+          Shape::kColumn / kInterleavedK>,
+      Element,
+      layout::PitchLinear,
+      (kAdvanceRank == 0 ? 0 : 1),
+      ThreadMap,
+      AccessType>;
+
+  static int const kAccessesPerVector = UnderlyingIterator::kAccessesPerVector;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileAccessIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::PitchLinear(layout.stride(0))) {}
+
+    CUTLASS_HOST_DEVICE
+    Params(typename UnderlyingIterator::Params::Base const& base)
+        : params_(base) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying pitch-linear tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      /// Precomputed parameters object
+      Params const& params,
+      /// Pointer to start of tensor
+      Pointer pointer,
+      /// Extent of tensor
+      TensorCoord extent,
+      /// ID of each participating thread
+      int thread_id,
+      /// Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(
+                extent.row() * kInterleavedK,
+                extent.column() / kInterleavedK),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.row() * kInterleavedK,
+                threadblock_offset.column() / kInterleavedK)) {}
+
+  /// Construct a PredicatedTileAccessIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileAccessIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Overrides the internal iteration index
+  CUTLASS_HOST_DEVICE
+  void set_iteration_index(int index) {
+    iterator_.set_iteration_index(index);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_HOST_DEVICE
+  void add_tile_offset(TensorCoord const& tile_offset) {
+    iterator_.add_tile_offset({tile_offset.row(), tile_offset.column()});
+  }
+
+  /// Returns a pointer
+  CUTLASS_HOST_DEVICE
+  AccessType* get() const {
+    return reinterpret_cast<AccessType*>(iterator_.get());
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast operator++(int) {
+    PredicatedTileAccessIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Returns whether access is valid or not
+  CUTLASS_HOST_DEVICE
+  bool valid() {
+    return iterator_.valid();
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileAccessIteratorResidualLast for row-major
+/// interleaved data.
+//  It is mapped to the congruous layout.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    typename AccessType_,
+    int InterleavedK>
+class PredicatedTileAccessIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::RowMajorInterleaved<InterleavedK>,
+    AdvanceRank,
+    ThreadMap_,
+    AccessType_,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  static int const kInterleavedK = InterleavedK;
+  using Layout = layout::RowMajorInterleaved<kInterleavedK>;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+  using AccessType = AccessType_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingIterator = PredicatedTileAccessIteratorResidualLast<
+      layout::PitchLinearShape<
+          Shape::kColumn * kInterleavedK,
+          Shape::kRow / kInterleavedK>,
+      Element,
+      layout::PitchLinear,
+      (kAdvanceRank == 0 ? 1 : 0),
+      ThreadMap,
+      AccessType>;
+
+  static int const kAccessesPerVector = UnderlyingIterator::kAccessesPerVector;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileAccessIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::PitchLinear(layout.stride(0))) {}
+
+    CUTLASS_HOST_DEVICE
+    Params(typename UnderlyingIterator::Params::Base const& base)
+        : params_(base) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying pitch-linear tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      /// Precomputed parameters object
+      Params const& params,
+      /// Pointer to start of tensor
+      Pointer pointer,
+      /// Extent of tensor
+      TensorCoord extent,
+      /// ID of each participating thread
+      int thread_id,
+      /// Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(
+                extent.column() * kInterleavedK,
+                extent.row() / kInterleavedK),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.column() * kInterleavedK,
+                threadblock_offset.row() / kInterleavedK)) {}
+
+  /// Construct a PredicatedTileAccessIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileAccessIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Overrides the internal iteration index
+  CUTLASS_HOST_DEVICE
+  void set_iteration_index(int index) {
+    iterator_.set_iteration_index(index);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_HOST_DEVICE
+  void add_tile_offset(TensorCoord const& tile_offset) {
+    iterator_.add_tile_offset({tile_offset.column(), tile_offset.row()});
+  }
+
+  /// Returns a pointer
+  CUTLASS_HOST_DEVICE
+  AccessType* get() const {
+    return reinterpret_cast<AccessType*>(iterator_.get());
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileAccessIteratorResidualLast operator++(int) {
+    PredicatedTileAccessIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Returns whether access is valid or not
+  CUTLASS_HOST_DEVICE
+  bool valid() {
+    return iterator_.valid();
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+} // namespace threadblock
+} // namespace transform
+} // namespace cutlass
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_iterator_residual_last.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_iterator_residual_last.h
new file mode 100644
index 000000000000..43364189842d
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/iterators/predicated_tile_iterator_residual_last.h
@@ -0,0 +1,2120 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Templates implementing loading of tiles from pitch-linear rank=2
+   tensors.
+
+    This iterator uses masks to guard out-of-bounds accesses. The first tile
+   this iterator visits maybe partial, then the remaining tiles are complete.
+   So, we only need to compute the predicates twice, once before the first tile
+   and once for the remaining full tiles which can share the same predicates.
+
+    A precomputed "Params" object minimizes the amount of state that must be
+   stored in registers, and integer addition is used to advance the pointer
+   through memory.
+*/
+
+#pragma once
+
+#include <cutlass/arch/memory.h>
+#include <cutlass/transform/threadblock/predicated_tile_access_iterator.h>
+
+////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace transform {
+namespace threadblock {
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// PredicatedTileIteratorResidualLast
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+/// Regular tile iterator using a precomputed control structure to minimize
+/// register liveness and integer arithmetic.
+///
+/// Layout is assumed to be invariant at the time the precomputed "Params"
+/// object is constructed.
+///
+/// Base pointer and tensor extents may be specified at the time the iterator is
+/// constructed. Subsequently, they are assumed to be immutable.
+///
+/// Adding a logical coordinate offset may be performed at the time the iterator
+/// is constructed. Subsequent additions to logical coordinate offset may be
+/// performed but are relatively expensive.
+///
+/// Visitation order is intended to first visit a "residual" tile that may be
+/// partially full in both the advance dimension and the steady-state dimension.
+/// This is assumed to be the last tile in the iteration sequence. Advancing an
+/// iterator that has just been constructed moves to the first tile that is full
+/// in the advance dimension and recomputes predicates. Subsequent accesses may
+/// be performed without updating internal predicates and are efficient in terms
+/// of live register state and pointer arithmetic instructions.
+///
+/// To be efficient, this assumes the iterator will be dereferenced and advanced
+/// at least once outside any looping structure to minimize integer arithmetic.
+///
+/// Acceses out of bounds are safe so long as `clear_mask()` is called prior to
+/// dereferencing the iterator.
+///
+///
+/// Example:
+///
+/// An efficient pipeline structure may be constructed as follows:
+///
+// template <typename Iterator>
+// __global__ void kernel(
+//   typename Iterator::Params params,
+//   typename Iterator::Element *ptr,
+//   TensorCoord extent) {
+//
+//   typename Iterator::Fragment fragment;
+//
+//   TensorCoord threadblock_offset(0, 0);
+//
+//   Iterator iter(params, ptr, extent, threadIdx.x, threadblock_offsets);
+//
+//
+//   fragment = *iter;        // load "residue" tile first
+//   ++iter;                  // advance to first "steady state" tile and update
+//   internal masks
+//
+//
+//   #pragma unroll
+//   for (int i = Remaining - 1; i >= 0; --i) {
+//
+//     f(fragment);
+//
+//     if (!i) {
+//       iter.clear_mask();   // light-weight operation to clear masks -
+//       subsequent loads become NO-OPs.
+//     }
+//
+//     fragment = *iter;      // load tile during "steady state" phase
+//     ++iter;                // advance to next tile - lightweight due to
+//     steady-state masks
+//   }
+// }
+//
+// void host(TensorView<Element, 2, layout::PitchLinear> view) {
+//
+//   using Iterator =
+//   transform::threadblock::PredicatedTileIteratorResidualLast;
+//
+//   typename Iterator::Params params(view.layout());
+//
+//   kernel<Iterator>(params, view.data());
+// }
+///
+///
+template <
+    typename Shape,
+    typename Element,
+    typename Layout,
+    int AdvanceRank,
+    typename ThreadMap,
+    int AccessSize = ThreadMap::kElementsPerAccess,
+    bool Gather = false>
+class PredicatedTileIteratorResidualLast;
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileIteratorResidualLast for pitch-linear data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    int AccessSize,
+    bool Gather>
+class PredicatedTileIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::PitchLinear,
+    AdvanceRank,
+    ThreadMap_,
+    AccessSize,
+    Gather> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::PitchLinear;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  /// Type used for internal memory accesses
+  using AccessType = AlignedArray<
+      Element,
+      AccessSize,
+      (AccessSize * sizeof_bits<Element>::value / 8)>;
+
+  /// Underlying iterator to compute the addresses
+  using TileAccessIterator = PredicatedTileAccessIteratorResidualLast<
+      Shape,
+      Element,
+      Layout,
+      kAdvanceRank,
+      ThreadMap,
+      AccessType,
+      Gather>;
+
+  static int const kAccessesPerVector = TileAccessIterator::kAccessesPerVector;
+
+  /// Fragment object to be loaded or stored
+  using Fragment = cutlass::Array<
+      Element,
+      ThreadMap::Iterations::kCount * ThreadMap::kElementsPerAccess>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename TileAccessIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   public:
+    using Base = typename TileAccessIterator::Params::Base;
+
+    friend PredicatedTileIteratorResidualLast;
+
+   private:
+    /// Parameters object
+    typename TileAccessIterator::Params params_;
+
+   public:
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout) : params_(layout) {}
+
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    CUTLASS_HOST_DEVICE
+    Params(Base const& base) : params_(base) {}
+  };
+
+ private:
+  /// Internal pointer type permits fast address arithmetic
+  using BytePointer = char*;
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Data member to the tile access iterator
+  TileAccessIterator address_iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      /// Precomputed parameters object
+      Params const& params,
+      /// Pointer to start of tensor
+      Pointer pointer,
+      /// Extent of tensor
+      TensorCoord extent,
+      /// ID of each participating thread
+      int thread_id,
+      /// Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      /// Gather indices
+      int const* indices = nullptr)
+      : address_iterator_(
+            params.params_,
+            pointer,
+            extent,
+            thread_id,
+            threadblock_offset,
+            indices) {}
+
+  /// Construct a PredicatedTileIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    address_iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast& operator++() {
+    if (kAdvanceRank)
+      address_iterator_.add_tile_offset({0, 1});
+    else
+      address_iterator_.add_tile_offset({1, 0});
+
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast operator++(int) {
+    PredicatedTileIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    address_iterator_.clear_mask(enable);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    address_iterator_.set_residual_tile(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    address_iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    address_iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    address_iterator_.get_mask(mask);
+  }
+
+  CUTLASS_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) {
+    load_with_byte_offset(
+        frag, pointer_offset * sizeof_bits<Element>::value / 8);
+  }
+
+  CUTLASS_DEVICE
+  void load_with_byte_offset(Fragment& frag, LongIndex byte_offset) {
+    AccessType* frag_ptr = reinterpret_cast<AccessType*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int s = 0; s < ThreadMap::Iterations::kStrided; ++s) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int c = 0; c < ThreadMap::Iterations::kContiguous; ++c) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < kAccessesPerVector; ++v) {
+          int idx = v +
+              kAccessesPerVector * (c + s * ThreadMap::Iterations::kContiguous);
+
+          address_iterator_.set_iteration_index(idx);
+          char const* byte_ptr =
+              reinterpret_cast<char const*>(address_iterator_.get()) +
+              byte_offset;
+
+          AccessType const* access_ptr =
+              reinterpret_cast<AccessType const*>(byte_ptr);
+
+          cutlass::arch::global_load<AccessType, sizeof(AccessType)>(
+              frag_ptr[idx], access_ptr, address_iterator_.valid());
+
+          ++address_iterator_;
+        }
+      }
+    }
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) {
+    load_with_byte_offset(frag, 0);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_pointer_offset(Fragment const& frag, Index pointer_offset) {
+    store_with_byte_offset(
+        frag, pointer_offset * sizeof_bits<Element>::value / 8);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_byte_offset(Fragment const& frag, LongIndex byte_offset) {
+    address_iterator_.set_iteration_index(0);
+    AccessType const* frag_ptr = reinterpret_cast<AccessType const*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int s = 0; s < ThreadMap::Iterations::kStrided; ++s) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int c = 0; c < ThreadMap::Iterations::kContiguous; ++c) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < kAccessesPerVector; ++v) {
+          int idx = v +
+              kAccessesPerVector * (c + s * ThreadMap::Iterations::kContiguous);
+
+          char* byte_ptr =
+              reinterpret_cast<char*>(address_iterator_.get()) + byte_offset;
+          AccessType* access_ptr = reinterpret_cast<AccessType*>(byte_ptr);
+
+          if (address_iterator_.valid()) {
+            *access_ptr = frag_ptr[idx];
+          }
+          ++address_iterator_;
+        }
+      }
+    }
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) {
+    store_with_byte_offset(frag, 0);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileIteratorResidualLast for pitch-linear data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    int AccessSize,
+    bool Gather>
+class PredicatedTileIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::ColumnMajor,
+    AdvanceRank,
+    ThreadMap_,
+    AccessSize,
+    Gather> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::ColumnMajor;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingIterator = PredicatedTileIteratorResidualLast<
+      layout::PitchLinearShape<Shape::kRow, Shape::kColumn>,
+      Element,
+      layout::PitchLinear,
+      (kAdvanceRank == 0 ? 0 : 1),
+      ThreadMap,
+      AccessSize,
+      Gather>;
+
+  using AccessType = typename UnderlyingIterator::AccessType;
+
+  /// Fragment object to be loaded or stored
+  using Fragment = cutlass::Array<
+      Element,
+      ThreadMap::Iterations::kCount * ThreadMap::kElementsPerAccess>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::PitchLinear(layout.stride(0))) {}
+
+    CUTLASS_HOST_DEVICE
+    Params(typename UnderlyingIterator::Params::Base const& base)
+        : params_(base) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying pitch-linear tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id, ///< ID of each participating thread
+      TensorCoord const& threadblock_offset, ///< Initial offset of threadblock
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(extent.row(), extent.column()),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.row(),
+                threadblock_offset.column()),
+            indices) {}
+
+  /// Construct a PredicatedTileIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast operator++(int) {
+    PredicatedTileIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) {
+    iterator_.load_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_byte_offset(Fragment& frag, LongIndex byte_offset) {
+    iterator_.load_with_byte_offset(frag, byte_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) {
+    load_with_pointer_offset(frag, 0);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_pointer_offset(Fragment const& frag, Index pointer_offset) {
+    iterator_.store_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_byte_offset(Fragment const& frag, LongIndex byte_offset) {
+    iterator_.store_with_byte_offset(frag, byte_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) {
+    store_with_pointer_offset(frag, 0);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileIteratorResidualLast for pitch-linear data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    int AccessSize,
+    bool Gather>
+class PredicatedTileIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::RowMajor,
+    AdvanceRank,
+    ThreadMap_,
+    AccessSize,
+    Gather> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::RowMajor;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingIterator = PredicatedTileIteratorResidualLast<
+      layout::PitchLinearShape<Shape::kColumn, Shape::kRow>,
+      Element,
+      layout::PitchLinear,
+      (kAdvanceRank == 0 ? 1 : 0),
+      ThreadMap,
+      AccessSize,
+      Gather>;
+
+  using AccessType = typename UnderlyingIterator::AccessType;
+
+  /// Fragment object to be loaded or stored
+  using Fragment = cutlass::Array<
+      Element,
+      ThreadMap::Iterations::kCount * ThreadMap::kElementsPerAccess>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::PitchLinear(layout.stride(0))) {}
+
+    CUTLASS_HOST_DEVICE
+    Params(typename UnderlyingIterator::Params::Base const& base)
+        : params_(base) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying pitch-linear tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id, ///< ID of each participating thread
+      TensorCoord const& threadblock_offset, ///< Initial offset of threadblock
+      int const* indices = nullptr ///< Gather indices
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(extent.column(), extent.row()),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.column(),
+                threadblock_offset.row()),
+            indices) {}
+
+  /// Construct a PredicatedTileIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast operator++(int) {
+    PredicatedTileIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) {
+    iterator_.load_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_byte_offset(Fragment& frag, LongIndex byte_offset) {
+    iterator_.load_with_byte_offset(frag, byte_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) {
+    load_with_pointer_offset(frag, 0);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_pointer_offset(Fragment const& frag, Index pointer_offset) {
+    iterator_.store_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_byte_offset(Fragment const& frag, LongIndex byte_offset) {
+    iterator_.store_with_byte_offset(frag, byte_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) {
+    store_with_pointer_offset(frag, 0);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileIteratorResidualLast for affine rank-2 data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    int AccessSize>
+class PredicatedTileIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::AffineRankN<2>,
+    AdvanceRank,
+    ThreadMap_,
+    AccessSize,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::AffineRankN<2>;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  /// Type used for internal memory accesses
+  using AccessType = AlignedArray<
+      Element,
+      AccessSize,
+      (AccessSize * sizeof_bits<Element>::value / 8)>;
+
+  /// Underlying iterator to compute the addresses
+  using TileAccessIterator = PredicatedTileAccessIteratorResidualLast<
+      Shape,
+      Element,
+      Layout,
+      kAdvanceRank,
+      ThreadMap,
+      AccessType>;
+
+  static int const kAccessesPerVector = TileAccessIterator::kAccessesPerVector;
+
+  /// Fragment object to be loaded or stored
+  using Fragment = cutlass::Array<
+      Element,
+      ThreadMap::Iterations::kCount * ThreadMap::kElementsPerAccess>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename TileAccessIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   public:
+    friend PredicatedTileIteratorResidualLast;
+
+   private:
+    /// Parameters object
+    typename TileAccessIterator::Params params_;
+
+   public:
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout) : params_(layout) {}
+
+    CUTLASS_HOST_DEVICE
+    Params() {}
+  };
+
+ private:
+  /// Internal pointer type permits fast address arithmetic
+  using BytePointer = char*;
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Data member to the tile access iterator
+  TileAccessIterator address_iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      /// Precomputed parameters object
+      Params const& params,
+      /// Pointer to start of tensor
+      Pointer pointer,
+      /// Extent of tensor
+      TensorCoord extent,
+      /// ID of each participating thread
+      int thread_id,
+      /// Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : address_iterator_(
+            params.params_,
+            pointer,
+            extent,
+            thread_id,
+            threadblock_offset) {}
+
+  /// Construct a PredicatedTileIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    address_iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast& operator++() {
+    if (kAdvanceRank)
+      address_iterator_.add_tile_offset(make_Coord(0, 1));
+    else
+      address_iterator_.add_tile_offset(make_Coord(1, 0));
+
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast operator++(int) {
+    PredicatedTileIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    address_iterator_.clear_mask(enable);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    address_iterator_.set_residual_tile(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    address_iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    address_iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    address_iterator_.get_mask(mask);
+  }
+
+  CUTLASS_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) {
+    load_with_byte_offset(
+        frag, pointer_offset * sizeof_bits<Element>::value / 8);
+  }
+
+  CUTLASS_DEVICE
+  void load_with_byte_offset(Fragment& frag, LongIndex byte_offset) {
+    AccessType* frag_ptr = reinterpret_cast<AccessType*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int s = 0; s < ThreadMap::Iterations::kStrided; ++s) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int c = 0; c < ThreadMap::Iterations::kContiguous; ++c) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < kAccessesPerVector; ++v) {
+          int idx = v +
+              kAccessesPerVector * (c + s * ThreadMap::Iterations::kContiguous);
+
+          address_iterator_.set_iteration_index(idx);
+          char const* byte_ptr =
+              reinterpret_cast<char const*>(address_iterator_.get()) +
+              byte_offset;
+
+          AccessType const* access_ptr =
+              reinterpret_cast<AccessType const*>(byte_ptr);
+
+          cutlass::arch::global_load<AccessType, sizeof(AccessType)>(
+              frag_ptr[idx], access_ptr, address_iterator_.valid());
+
+          ++address_iterator_;
+        }
+      }
+    }
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) {
+    load_with_byte_offset(frag, 0);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_pointer_offset(Fragment const& frag, Index pointer_offset) {
+    store_with_byte_offset(
+        frag, pointer_offset * sizeof_bits<Element>::value / 8);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_byte_offset(Fragment const& frag, LongIndex byte_offset) {
+    address_iterator_.set_iteration_index(0);
+    AccessType const* frag_ptr = reinterpret_cast<AccessType const*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int s = 0; s < ThreadMap::Iterations::kStrided; ++s) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int c = 0; c < ThreadMap::Iterations::kContiguous; ++c) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < kAccessesPerVector; ++v) {
+          int idx = v +
+              kAccessesPerVector * (c + s * ThreadMap::Iterations::kContiguous);
+
+          char* byte_ptr =
+              reinterpret_cast<char*>(address_iterator_.get()) + byte_offset;
+          AccessType* access_ptr = reinterpret_cast<AccessType*>(byte_ptr);
+
+          if (address_iterator_.valid()) {
+            *access_ptr = frag_ptr[idx];
+          }
+          ++address_iterator_;
+        }
+      }
+    }
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) {
+    store_with_byte_offset(frag, 0);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileIteratorResidualLast for affine rank 2
+/// column-major data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    int AccessSize>
+class PredicatedTileIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::AffineRank2ColumnMajor,
+    AdvanceRank,
+    ThreadMap_,
+    AccessSize,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::AffineRank2ColumnMajor;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  // Map to the underlying AffineRankN<2> layout
+  using UnderlyingIterator = PredicatedTileIteratorResidualLast<
+      layout::PitchLinearShape<Shape::kRow, Shape::kColumn>,
+      Element,
+      layout::AffineRankN<2>,
+      (kAdvanceRank == 0 ? 0 : 1),
+      ThreadMap,
+      AccessSize>;
+
+  using AccessType = typename UnderlyingIterator::AccessType;
+
+  /// Fragment object to be loaded or stored
+  using Fragment = cutlass::Array<
+      Element,
+      ThreadMap::Iterations::kCount * ThreadMap::kElementsPerAccess>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given an AffineRankN<2> tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::AffineRankN<2>(layout.stride(0), layout.stride(1))) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying AffineRankN<2> tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id, ///< ID of each participating thread
+      TensorCoord const& threadblock_offset, ///< Initial offset of threadblock
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(extent.row(), extent.column()),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.row(),
+                threadblock_offset.column())) {}
+
+  /// Construct a PredicatedTileIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast operator++(int) {
+    PredicatedTileIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) {
+    iterator_.load_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_byte_offset(Fragment& frag, LongIndex byte_offset) {
+    iterator_.load_with_byte_offset(frag, byte_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) {
+    load_with_pointer_offset(frag, 0);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_pointer_offset(Fragment const& frag, Index pointer_offset) {
+    iterator_.store_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_byte_offset(Fragment const& frag, LongIndex byte_offset) {
+    iterator_.store_with_byte_offset(frag, byte_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) {
+    store_with_pointer_offset(frag, 0);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileIteratorResidualLast for affine rank 2
+/// row-major data.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    int AccessSize>
+class PredicatedTileIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::AffineRank2RowMajor,
+    AdvanceRank,
+    ThreadMap_,
+    AccessSize,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = layout::AffineRank2RowMajor;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  // Map to the underlying AffineRankN<2> layout
+  using UnderlyingIterator = PredicatedTileIteratorResidualLast<
+      layout::PitchLinearShape<Shape::kColumn, Shape::kRow>,
+      Element,
+      layout::AffineRankN<2>,
+      (kAdvanceRank == 0 ? 1 : 0),
+      ThreadMap,
+      AccessSize>;
+
+  using AccessType = typename UnderlyingIterator::AccessType;
+
+  /// Fragment object to be loaded or stored
+  using Fragment = cutlass::Array<
+      Element,
+      ThreadMap::Iterations::kCount * ThreadMap::kElementsPerAccess>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given an AffineRankN<2> tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::AffineRankN<2>(layout.stride(1), layout.stride(0))) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying AffineRankN<2> tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id, ///< ID of each participating thread
+      TensorCoord const& threadblock_offset, ///< Initial offset of threadblock
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(extent.column(), extent.row()),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.column(),
+                threadblock_offset.row())) {}
+
+  /// Construct a PredicatedTileIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast operator++(int) {
+    PredicatedTileIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) {
+    iterator_.load_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_byte_offset(Fragment& frag, LongIndex byte_offset) {
+    iterator_.load_with_byte_offset(frag, byte_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) {
+    load_with_pointer_offset(frag, 0);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_pointer_offset(Fragment const& frag, Index pointer_offset) {
+    iterator_.store_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_byte_offset(Fragment const& frag, LongIndex byte_offset) {
+    iterator_.store_with_byte_offset(frag, byte_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) {
+    store_with_pointer_offset(frag, 0);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileIteratorResidualLast for interleaved data.
+/// It is mapped to the congruous layout.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    int AccessSize,
+    int InterleavedK>
+class PredicatedTileIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::ColumnMajorInterleaved<InterleavedK>,
+    AdvanceRank,
+    ThreadMap_,
+    AccessSize,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  static int const kInterleavedK = InterleavedK;
+  using Layout = layout::ColumnMajorInterleaved<kInterleavedK>;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingIterator = PredicatedTileIteratorResidualLast<
+      layout::PitchLinearShape<
+          Shape::kRow * kInterleavedK,
+          Shape::kColumn / kInterleavedK>,
+      Element,
+      layout::PitchLinear,
+      (kAdvanceRank == 0 ? 0 : 1),
+      ThreadMap,
+      AccessSize>;
+
+  using AccessType = typename UnderlyingIterator::AccessType;
+
+  /// Fragment object to be loaded or stored
+  using Fragment = cutlass::Array<
+      Element,
+      ThreadMap::Iterations::kCount * ThreadMap::kElementsPerAccess>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::PitchLinear(layout.stride(0))) {}
+
+    CUTLASS_HOST_DEVICE
+    Params(typename UnderlyingIterator::Params::Base const& base)
+        : params_(base) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying pitch-linear tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      /// Precomputed parameters object
+      Params const& params,
+      /// Pointer to start of tensor
+      Pointer pointer,
+      /// Extent of tensor
+      TensorCoord extent,
+      /// ID of each participating thread
+      int thread_id,
+      /// Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(
+                extent.row() * kInterleavedK,
+                extent.column() / kInterleavedK),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.row() * kInterleavedK,
+                threadblock_offset.column() / kInterleavedK)) {}
+
+  /// Construct a PredicatedTileIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast operator++(int) {
+    PredicatedTileIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) {
+    iterator_.load_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) {
+    load_with_pointer_offset(frag, 0);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_pointer_offset(Fragment const& frag, Index pointer_offset) {
+    iterator_.store_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) {
+    store_with_pointer_offset(frag, 0);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization of PredicatedTileIteratorResidualLast for interleaved-32
+/// data.  It is mapped to the congruous layout.
+///
+/// Satisfies: ForwardTileIteratorConcept |
+///            ReadableContiguousTileIteratorConcept |
+///            WriteableContiguousTileIteratorConcept |
+///            MaskedTileIteratorConcept
+///
+template <
+    typename Shape_,
+    typename Element_,
+    int AdvanceRank,
+    typename ThreadMap_,
+    int AccessSize,
+    int InterleavedK>
+class PredicatedTileIteratorResidualLast<
+    Shape_,
+    Element_,
+    layout::RowMajorInterleaved<InterleavedK>,
+    AdvanceRank,
+    ThreadMap_,
+    AccessSize,
+    false> {
+ public:
+  static_assert(
+      AdvanceRank == 0 || AdvanceRank == 1,
+      "Specialization for pitch-linear iterator may along advance along the "
+      "contiguous(rank=0) or strided(rank=1) dimension.");
+
+  using Shape = Shape_;
+  using Element = Element_;
+  static int const kInterleavedK = InterleavedK;
+  using Layout = layout::RowMajorInterleaved<kInterleavedK>;
+  static int const kAdvanceRank = AdvanceRank;
+  using ThreadMap = ThreadMap_;
+
+  using Index = typename Layout::Index;
+  using LongIndex = typename Layout::LongIndex;
+
+  using TensorRef = TensorRef<Element, Layout>;
+  using TensorView = TensorView<Element, Layout>;
+  using TensorCoord = typename Layout::TensorCoord;
+
+  using Pointer = Element*;
+  using NonConstPointer = typename platform::remove_const<Element>::type*;
+
+  using UnderlyingIterator = PredicatedTileIteratorResidualLast<
+      layout::PitchLinearShape<
+          Shape::kColumn * kInterleavedK,
+          Shape::kRow / kInterleavedK>,
+      Element,
+      layout::PitchLinear,
+      (kAdvanceRank == 0 ? 1 : 0),
+      ThreadMap,
+      AccessSize>;
+
+  using AccessType = typename UnderlyingIterator::AccessType;
+
+  /// Fragment object to be loaded or stored
+  using Fragment = cutlass::Array<
+      Element,
+      ThreadMap::Iterations::kCount * ThreadMap::kElementsPerAccess>;
+
+  /// Predicate vector stores mask to guard accesses
+  using Mask = typename UnderlyingIterator::Mask;
+
+  /// Parameters object is precomputed state and is host-constructible
+  class Params {
+   private:
+    friend PredicatedTileIteratorResidualLast;
+
+    /// Parameters object
+    typename UnderlyingIterator::Params params_;
+
+   public:
+    CUTLASS_HOST_DEVICE
+    Params() {}
+
+    /// Construct the Params object given a pitch-linear tensor's layout
+    CUTLASS_HOST_DEVICE
+    Params(Layout const& layout)
+        : params_(layout::PitchLinear(layout.stride(0))) {}
+
+    CUTLASS_HOST_DEVICE
+    Params(typename UnderlyingIterator::Params::Base const& base)
+        : params_(base) {}
+  };
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Underlying pitch-linear tile iterator
+  UnderlyingIterator iterator_;
+
+ public:
+  /// Constructs a TileIterator from its precomputed state, threadblock offset,
+  /// and thread ID
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      /// Precomputed parameters object
+      Params const& params,
+      /// Pointer to start of tensor
+      Pointer pointer,
+      /// Extent of tensor
+      TensorCoord extent,
+      /// ID of each participating thread
+      int thread_id,
+      /// Initial offset of threadblock
+      TensorCoord const& threadblock_offset,
+      int const* indices =
+          nullptr ///< gather/scatter indices, note no support for
+                  ///< gather/scatter at this specialization
+      )
+      : iterator_(
+            params.params_,
+            pointer,
+            layout::PitchLinearCoord(
+                extent.column() * kInterleavedK,
+                extent.row() / kInterleavedK),
+            thread_id,
+            layout::PitchLinearCoord(
+                threadblock_offset.column() * kInterleavedK,
+                threadblock_offset.row() / kInterleavedK)) {}
+
+  /// Construct a PredicatedTileIteratorResidualLast with zero threadblock
+  /// offset
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast(
+      Params const& params, ///< Precomputed parameters object
+      Pointer pointer, ///< Pointer to start of tensor
+      TensorCoord extent, ///< Extent of tensor
+      int thread_id ///< ID of each participating thread
+      )
+      : PredicatedTileIteratorResidualLast(
+            params,
+            pointer,
+            extent,
+            thread_id,
+            make_Coord(0, 0)) {}
+
+  /// Adds a pointer offset in units of Element
+  CUTLASS_HOST_DEVICE
+  void add_pointer_offset(LongIndex pointer_offset) {
+    iterator_.add_pointer_offset(pointer_offset);
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast& operator++() {
+    ++iterator_;
+    return *this;
+  }
+
+  /// Advances to the next tile in memory.
+  ///
+  /// The first time this method is called, predicates are updated, and the
+  /// iterator's internal pointer is reverted to the first "steady state" tile.
+  /// Subsequent calls are lightweight and must only update the internal
+  /// pointer.
+  CUTLASS_HOST_DEVICE
+  PredicatedTileIteratorResidualLast operator++(int) {
+    PredicatedTileIteratorResidualLast self(*this);
+    operator++();
+    return self;
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void clear_mask(bool enable = true) {
+    iterator_.clear_mask(enable);
+  }
+
+  CUTLASS_HOST_DEVICE
+  void set_residual_tile(bool enable) {
+    iterator_.set_residual_tile(enable);
+  }
+
+  /// Clears the predicate set efficiently
+  CUTLASS_HOST_DEVICE
+  void enable_mask() {
+    iterator_.enable_mask();
+  }
+
+  /// Sets the predicate mask, overriding value stored in predicate iterator
+  CUTLASS_HOST_DEVICE
+  void set_mask(Mask const& mask) {
+    iterator_.set_mask(mask);
+  }
+
+  /// Gets the mask
+  CUTLASS_HOST_DEVICE
+  void get_mask(Mask& mask) {
+    iterator_.get_mask(mask);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) {
+    iterator_.load_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Loads a fragment from memory
+  CUTLASS_DEVICE
+  void load(Fragment& frag) {
+    load_with_pointer_offset(frag, 0);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store_with_pointer_offset(Fragment const& frag, Index pointer_offset) {
+    iterator_.store_with_pointer_offset(frag, pointer_offset);
+  }
+
+  /// Store a fragment to memory
+  CUTLASS_DEVICE
+  void store(Fragment const& frag) {
+    store_with_pointer_offset(frag, 0);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+} // namespace threadblock
+} // namespace transform
+} // namespace cutlass
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h
new file mode 100644
index 000000000000..c9652c40d38e
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h
@@ -0,0 +1,1575 @@
+#pragma once
+#include <ATen/ATen.h>
+#include <cmath>
+#include <vector>
+
+#include <cuda_fp16.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+
+#include <cutlass/gemm/gemm.h>
+#include <cutlass/layout/matrix.h>
+#include <cutlass/layout/vector.h>
+#include <cutlass/numeric_types.h>
+
+#include <cutlass/epilogue/thread/linear_combination_relu.h>
+#include <cutlass/epilogue/threadblock/epilogue_smem_accumulator.h>
+#include <cutlass/epilogue/warp/fragment_iterator_tensor_op.h>
+#include <cutlass/epilogue/warp/tile_iterator_tensor_op.h>
+#include <cutlass/gemm/device/default_gemm_configuration.h>
+#include <cutlass/gemm/kernel/default_gemm.h>
+#include <cutlass/gemm/threadblock/default_mma.h>
+#include <cutlass/gemm/threadblock/default_mma_core_simt.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm70.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm75.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm80.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/platform/platform.h>
+#include <cutlass/transform/threadblock/predicated_tile_iterator.h>
+#include <cutlass/transform/threadblock/vector_iterator.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/debug_utils.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/gemm_kernel_utils.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/epilogue_pipelined.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/iterators/epilogue_predicated_tile_iterator.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/find_default_mma.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/mma_from_smem.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/gemm/custom_mma.h>
+
+#include <cinttypes>
+
+using namespace gemm_kernel_utils;
+
+namespace {
+template <typename scalar_t, typename Arch>
+constexpr int getWarpsPerSm() {
+  bool is_half = !std::is_same<scalar_t, float>::value;
+  if (Arch::kMinComputeCapability >= 80) {
+    return is_half ? 12 : 8;
+  }
+  return 8;
+}
+} // namespace
+
+template <
+    // which arch we target (eg `cutlass::arch::Sm80`)
+    typename ArchTag_,
+    // input/output type
+    typename scalar_t_,
+    // run optimized kernel because memory accesses will be aligned
+    bool kIsAligned_,
+    // upperbound on `max(value.shape[-1], query.shape[-1])`
+    int kMaxK = std::numeric_limits<int>::max()>
+struct AttentionBackwardKernel {
+  using scalar_t = scalar_t_;
+  using output_t = scalar_t;
+  using lse_scalar_t = float;
+  using accum_t = float;
+  using ArchTag = ArchTag_;
+  static constexpr bool kIsAligned = kIsAligned_;
+
+  struct Params {
+    // Input tensors
+    scalar_t* query_ptr; // [Mq, nH, K]
+    scalar_t* key_ptr; // [Mk, nH, K]
+    scalar_t* value_ptr; // [Mk, nH, Kv]
+    lse_scalar_t* logsumexp_ptr; // [nH, Mq]
+    scalar_t* output_ptr; // [Mq, nH, Kv]
+    scalar_t* grad_output_ptr; // [Mq, nH, Kv]
+    accum_t* delta_ptr; // [Mq, nH]
+
+    // Output tensors
+    output_t* grad_query_ptr; //  [Mq, nH, K]
+    output_t* grad_key_ptr; //    [Mk, nH, K]
+    output_t* grad_value_ptr; //  [Mk, nH, Kv]
+
+    // Dimensions/strides
+    int32_t head_dim;
+    int32_t head_dim_value;
+    int32_t num_queries;
+    int32_t num_keys;
+    int32_t num_heads;
+    bool causal;
+
+    int32_t q_strideM;
+    int32_t k_strideM;
+    int32_t v_strideM;
+    int32_t gO_strideM;
+    int8_t gQKV_strideM_multiplier; // 3 for packed, 1 otherwise
+
+    CUTLASS_HOST_DEVICE int32_t o_strideM() const {
+      return head_dim_value * num_heads;
+    }
+    CUTLASS_HOST_DEVICE int32_t gQ_strideM() const {
+      return gQKV_strideM_multiplier * num_heads * head_dim;
+    }
+    CUTLASS_HOST_DEVICE int32_t gK_strideM() const {
+      return gQKV_strideM_multiplier * num_heads * head_dim;
+    }
+    CUTLASS_HOST_DEVICE int32_t gV_strideM() const {
+      return gQKV_strideM_multiplier * num_heads * head_dim_value;
+    }
+
+    // Everything below is only used in `advance_to_block`
+    // and shouldn't use registers
+    int64_t o_strideH;
+    int32_t q_strideH;
+    int32_t k_strideH;
+    int32_t v_strideH;
+    int64_t o_strideB;
+    int64_t q_strideB;
+    int64_t k_strideB;
+    int64_t v_strideB;
+    int32_t num_batches;
+
+    int64_t gO_strideB;
+    int64_t gQ_strideB;
+    int64_t gK_strideB;
+    int64_t gV_strideB;
+    int64_t gO_strideH;
+    int64_t gQ_strideH;
+    int64_t gK_strideH;
+    int64_t gV_strideH;
+
+    CUTLASS_DEVICE void advance_to_block() {
+      constexpr int32_t kAlignLSE = 32; // block size of backward
+      auto lse_dim = ceil_div((int32_t)num_queries, kAlignLSE) * kAlignLSE;
+
+      int32_t batch_id = blockIdx.z;
+      int32_t head_id = blockIdx.y;
+
+      query_ptr += batch_id * q_strideB + head_id * q_strideH;
+      key_ptr += batch_id * k_strideB + head_id * k_strideH;
+      value_ptr += batch_id * v_strideB + head_id * v_strideH;
+      logsumexp_ptr += (batch_id * num_heads + head_id) * lse_dim;
+      output_ptr += batch_id * o_strideB + head_id * o_strideH;
+      grad_output_ptr += batch_id * gO_strideB + head_id * gO_strideH;
+      delta_ptr += (batch_id * num_heads + head_id) * num_queries;
+
+      grad_query_ptr += batch_id * gQ_strideB + head_id * gQ_strideH;
+      grad_key_ptr += batch_id * gK_strideB + head_id * gK_strideH;
+      grad_value_ptr += batch_id * gV_strideB + head_id * gV_strideH;
+
+      head_dim = warp_uniform(head_dim);
+      head_dim_value = warp_uniform(head_dim_value);
+      num_queries = warp_uniform(num_queries);
+      num_keys = warp_uniform(num_keys);
+      num_heads = warp_uniform(num_heads);
+
+      gO_strideM = warp_uniform(gO_strideM);
+      gQKV_strideM_multiplier = warp_uniform(gQKV_strideM_multiplier);
+      q_strideM = warp_uniform(q_strideM);
+      k_strideM = warp_uniform(k_strideM);
+      v_strideM = warp_uniform(v_strideM);
+
+      query_ptr = warp_uniform(query_ptr);
+      key_ptr = warp_uniform(key_ptr);
+      value_ptr = warp_uniform(value_ptr);
+      logsumexp_ptr = warp_uniform(logsumexp_ptr);
+      output_ptr = warp_uniform(output_ptr);
+      grad_output_ptr = warp_uniform(grad_output_ptr);
+      delta_ptr = warp_uniform(delta_ptr);
+
+      grad_query_ptr = warp_uniform(grad_query_ptr);
+      grad_key_ptr = warp_uniform(grad_key_ptr);
+      grad_value_ptr = warp_uniform(grad_value_ptr);
+    }
+
+    __host__ dim3 getBlocksGrid() const {
+      return dim3(1, num_heads, num_batches);
+    }
+    __host__ dim3 getThreadsGrid() const {
+      return dim3(kWarpSize, kNumWarpsPerBlock, 1);
+    }
+  };
+
+  // Blocks & grid
+  static constexpr bool kSupports64x128 =
+      ArchTag::kMinComputeCapability >= 80 ||
+      (ArchTag::kMinComputeCapability >= 70 &&
+       cutlass::sizeof_bits<scalar_t>::value <= 16);
+  static constexpr int64_t kWarpSize = 32;
+  static constexpr int64_t kBlockSizeI =
+      kSupports64x128 && kMaxK > 64 ? 128 : 64;
+  static constexpr int64_t kBlockSizeJ = 64;
+  static constexpr int64_t kNumWarpsPerBlock =
+      (kBlockSizeI * kBlockSizeJ) / (32 * 32);
+
+  // If this is true, we store and accumulate dK/dV in RF
+  // rather than going back to gmem everytime
+  static constexpr bool kIsHalf = cutlass::sizeof_bits<scalar_t>::value <= 16;
+  static constexpr bool kOutputInRF = kIsHalf && kMaxK <= kBlockSizeI;
+  static constexpr bool kPreloadMmas =
+      kIsHalf && ArchTag::kMinComputeCapability >= 80 && kOutputInRF;
+  static constexpr bool kPrologueQK = kPreloadMmas;
+  static constexpr bool kPrologueGV = kPreloadMmas;
+  static constexpr bool kPrologueDOV = kPreloadMmas;
+  static constexpr bool kPrologueGQ = kPreloadMmas;
+  static constexpr bool kPrologueGK = kPreloadMmas;
+
+  // Compute delta for the f16 kernels
+  // TODO: Figure out why it's slower on the f32 kernels
+  // (something due to RF pressure?)
+  // TODO: Remove condition on `kOutputInRF` - this is needed to work
+  // around a compiler bug on V100, not exactly sure why but I spent
+  // too much time on this already. Reproducible with
+  // (B, Mq, Mkv, K) = (1, 1, 1, 136) for instance
+  static constexpr bool kKernelComputesDelta =
+      kIsHalf && (kOutputInRF || ArchTag::kMinComputeCapability != 70);
+
+  // Launch bounds
+  static constexpr int64_t kNumThreads = kWarpSize * kNumWarpsPerBlock;
+  static constexpr int64_t kMinBlocksPerSm =
+      getWarpsPerSm<scalar_t, ArchTag>() / kNumWarpsPerBlock;
+
+  using GemmType = DefaultGemmType<ArchTag, scalar_t>;
+  using DefaultConfig =
+      typename cutlass::gemm::device::DefaultGemmConfiguration<
+          typename GemmType::OpClass,
+          ArchTag,
+          scalar_t,
+          scalar_t,
+          scalar_t, // ElementC
+          accum_t // ElementAccumulator
+          >;
+  static constexpr auto kOptimalAlignement =
+      std::max(DefaultConfig::kAlignmentA, DefaultConfig::kAlignmentB);
+  static constexpr auto kMinimumAlignment = GemmType::kMinimumAlignment;
+
+  struct MatmulQK {
+    /*
+    attn_T = k_j @ q_i.transpose(-2, -1) # matmul
+    attn_T = (attn_T - logsumexp[i_start:i_end].unsqueeze(1).transpose(-2,
+    -1)).exp() # epilogue
+    with attn_T.shape = (kBlockSizeJ, kBlockSizeI)
+    */
+    using ThreadblockShape =
+        cutlass::gemm::GemmShape<kBlockSizeJ, kBlockSizeI, GemmType::ThreadK>;
+    using WarpShape = cutlass::gemm::GemmShape<32, 32, GemmType::WarpK>;
+    using DefaultMma = typename cutlass::gemm::threadblock::DefaultMma<
+        scalar_t, // ElementA
+        cutlass::layout::RowMajor, // LayoutA
+        kIsAligned ? DefaultConfig::kAlignmentA : GemmType::kMinimumAlignment,
+        scalar_t, // ElementB
+        cutlass::layout::ColumnMajor, // LayoutB
+        kIsAligned ? DefaultConfig::kAlignmentB : GemmType::kMinimumAlignment,
+        accum_t, // ElementC
+        cutlass::layout::RowMajor, // LayoutC
+        typename GemmType::OpClass,
+        ArchTag,
+        ThreadblockShape,
+        WarpShape,
+        typename GemmType::InstructionShape,
+        DefaultConfig::kStages,
+        typename GemmType::Operator,
+        false, // AccumulatorsInRowMajor = false,
+        cutlass::gemm::SharedMemoryClearOption::kNone>;
+    using MmaCore = typename DefaultMma::MmaCore;
+    using Mma =
+        typename MakeCustomMma<typename DefaultMma::ThreadblockMma, kMaxK>::Mma;
+
+    // Epilogue to store to shared-memory in a format that we can use later for
+    // the second matmul
+    using B2bGemm = typename cutlass::gemm::threadblock::B2bGemm<
+        typename Mma::Operator::IteratorC,
+        typename Mma::Operator,
+        scalar_t,
+        WarpShape,
+        ThreadblockShape>;
+    using ScalingCoefsUpdater = typename DefaultAttentionScalingCoefsUpdater<
+        typename Mma::Operator::IteratorC,
+        accum_t,
+        kWarpSize>::Updater;
+    using AccumulatorSharedStorage = typename B2bGemm::AccumulatorSharedStorage;
+  };
+
+  struct MatmulGradV {
+    /*
+    grad_v[j_start:j_end] += attn_T @ do_i # matmul
+    Dimensions: (kBlockSizeJ * kNumWarpsPerBlock, kBlockSizeI, K)
+    (we might need to iterate multiple times on K)
+    */
+    using ThreadblockShape =
+        cutlass::gemm::GemmShape<kBlockSizeJ, kBlockSizeI, GemmType::ThreadK>;
+    using WarpShape = cutlass::gemm::GemmShape<32, 32, GemmType::WarpK>;
+    using InstructionShape = typename GemmType::InstructionShape;
+
+    using DefaultGemm = cutlass::gemm::kernel::DefaultGemm<
+        scalar_t, // ElementA,
+        cutlass::layout::RowMajor, // LayoutA,
+        DefaultConfig::kAlignmentA,
+        scalar_t, // ElementB,
+        cutlass::layout::RowMajor, // LayoutB,
+        kIsAligned ? DefaultConfig::kAlignmentB : GemmType::kMinimumAlignment,
+        output_t,
+        cutlass::layout::RowMajor, // LayoutC,
+        accum_t,
+        typename GemmType::OpClass,
+        ArchTag,
+        ThreadblockShape,
+        WarpShape,
+        typename GemmType::InstructionShape,
+        typename DefaultConfig::EpilogueOutputOp,
+        void, // ThreadblockSwizzle - not used
+        DefaultConfig::kStages,
+        false, // SplitKSerial
+        typename GemmType::Operator>;
+
+    using DefaultMmaFromSmem =
+        typename cutlass::gemm::threadblock::DefaultMmaFromSharedMemory<
+            typename DefaultGemm::Mma,
+            typename MatmulQK::AccumulatorSharedStorage>;
+    using Mma = typename DefaultMmaFromSmem::Mma;
+    using IteratorB = typename Mma::IteratorB;
+    using WarpCount = typename Mma::WarpCount;
+
+    // Epilogue
+    using DefaultOutputOp = typename DefaultConfig::EpilogueOutputOp;
+    using DefaultEpilogue = typename DefaultGemm::Epilogue;
+    using OutputTileIterator =
+        typename cutlass::epilogue::threadblock::MakePrefetchableIterator<
+            typename DefaultEpilogue::OutputTileIterator>::Iterator;
+  };
+
+  struct MatmulDOIVJ {
+    /*
+    doi_t_vj = do_i @ v_j.transpose(-2, -1) # matmul
+    tmp = (doi_t_vj - Di.unsqueeze(1)) * attn # inplace / epilogue?
+    */
+    using ThreadblockShape =
+        cutlass::gemm::GemmShape<kBlockSizeI, kBlockSizeJ, GemmType::ThreadK>;
+    using WarpShape = cutlass::gemm::GemmShape<32, 32, GemmType::WarpK>;
+    using DefaultMma = typename cutlass::gemm::threadblock::DefaultMma<
+        scalar_t, // ElementA
+        cutlass::layout::RowMajor, // LayoutA
+        kIsAligned ? DefaultConfig::kAlignmentA : GemmType::kMinimumAlignment,
+        scalar_t, // ElementB
+        cutlass::layout::ColumnMajor, // LayoutB
+        kIsAligned ? DefaultConfig::kAlignmentB : GemmType::kMinimumAlignment,
+        accum_t, // ElementC
+        cutlass::layout::RowMajor, // LayoutC
+        typename GemmType::OpClass,
+        ArchTag,
+        ThreadblockShape,
+        WarpShape,
+        typename GemmType::InstructionShape,
+        DefaultConfig::kStages,
+        typename GemmType::Operator,
+        false, // AccumulatorsInRowMajor = false,
+        cutlass::gemm::SharedMemoryClearOption::kNone>;
+    using MmaCore = typename DefaultMma::MmaCore;
+    using Mma =
+        typename MakeCustomMma<typename DefaultMma::ThreadblockMma, kMaxK>::Mma;
+
+    // Epilogue to store to shared-memory in a format that we can use later for
+    // the second matmul
+    using B2bGemm = typename cutlass::gemm::threadblock::B2bGemm<
+        typename Mma::Operator::IteratorC,
+        typename Mma::Operator,
+        scalar_t,
+        WarpShape,
+        ThreadblockShape>;
+    using AccumulatorSharedStorage = typename B2bGemm::AccumulatorSharedStorage;
+  };
+
+  struct MatmulGradQ {
+    // grad_q <- tmp @ k_j
+    using ThreadblockShape =
+        cutlass::gemm::GemmShape<kBlockSizeI, kBlockSizeJ, GemmType::ThreadK>;
+    using WarpShape = cutlass::gemm::GemmShape<32, 32, GemmType::WarpK>;
+    using InstructionShape = typename GemmType::InstructionShape;
+
+    using DefaultGemm = cutlass::gemm::kernel::DefaultGemm<
+        scalar_t, // ElementA,
+        cutlass::layout::RowMajor, // LayoutA,
+        DefaultConfig::kAlignmentA,
+        scalar_t, // ElementB,
+        cutlass::layout::RowMajor, // LayoutB,
+        kIsAligned ? DefaultConfig::kAlignmentB : GemmType::kMinimumAlignment,
+        output_t,
+        cutlass::layout::RowMajor, // LayoutC,
+        accum_t,
+        typename GemmType::OpClass,
+        ArchTag,
+        ThreadblockShape,
+        WarpShape,
+        typename GemmType::InstructionShape,
+        typename DefaultConfig::EpilogueOutputOp,
+        void, // ThreadblockSwizzle - not used
+        DefaultConfig::kStages,
+        false, // SplitKSerial
+        typename GemmType::Operator>;
+
+    using DefaultMmaFromSmem =
+        typename cutlass::gemm::threadblock::DefaultMmaFromSharedMemory<
+            typename DefaultGemm::Mma,
+            typename MatmulDOIVJ::AccumulatorSharedStorage>;
+    using Mma = typename DefaultMmaFromSmem::Mma;
+    using IteratorB = typename Mma::IteratorB;
+    using WarpCount = typename Mma::WarpCount;
+
+    // Epilogue
+    using DefaultOutputOp = typename DefaultConfig::EpilogueOutputOp;
+    using DefaultEpilogue = typename DefaultGemm::Epilogue;
+    using OutputTileIterator =
+        typename cutlass::epilogue::threadblock::MakePrefetchableIterator<
+            typename DefaultEpilogue::OutputTileIterator>::Iterator;
+  };
+  struct MatmulGradK {
+    // grad_k <- tmp.transpose(-2, -1) @ q_i
+    using ThreadblockShape =
+        cutlass::gemm::GemmShape<kBlockSizeJ, kBlockSizeI, GemmType::ThreadK>;
+    using WarpShape = cutlass::gemm::GemmShape<32, 32, GemmType::WarpK>;
+    using InstructionShape = typename GemmType::InstructionShape;
+
+    using DefaultGemm = cutlass::gemm::kernel::DefaultGemm<
+        scalar_t, // ElementA,
+        cutlass::layout::RowMajor, // LayoutA,
+        DefaultConfig::kAlignmentA,
+        scalar_t, // ElementB,
+        cutlass::layout::RowMajor, // LayoutB,
+        kIsAligned ? DefaultConfig::kAlignmentB : GemmType::kMinimumAlignment,
+        output_t,
+        cutlass::layout::RowMajor, // LayoutC,
+        accum_t,
+        typename GemmType::OpClass,
+        ArchTag,
+        ThreadblockShape,
+        WarpShape,
+        typename GemmType::InstructionShape,
+        typename DefaultConfig::EpilogueOutputOp,
+        void, // ThreadblockSwizzle - not used
+        DefaultConfig::kStages,
+        false, // SplitKSerial
+        typename GemmType::Operator>;
+
+    using DefaultMmaFromSmem =
+        typename cutlass::gemm::threadblock::DefaultMmaFromSharedMemory<
+            typename DefaultGemm::Mma,
+            typename MatmulQK::AccumulatorSharedStorage>;
+    using Mma = typename DefaultMmaFromSmem::Mma;
+    using IteratorB = typename Mma::IteratorB;
+    using WarpCount = typename Mma::WarpCount;
+
+    // Epilogue
+    using DefaultOutputOp = typename DefaultConfig::EpilogueOutputOp;
+    using DefaultEpilogue = typename DefaultGemm::Epilogue;
+    using OutputTileIterator =
+        typename cutlass::epilogue::threadblock::MakePrefetchableIterator<
+            typename DefaultEpilogue::OutputTileIterator>::Iterator;
+  };
+
+  // See https://fburl.com/gsheet/l5bltspl
+  // for an illustration of how smem is used
+  struct SharedStoragePrologue {
+    struct {
+      cutlass::Array<accum_t, kBlockSizeI> di; // (do_i * o_i).sum(-1)
+      typename MatmulQK::Mma::SharedStorageA mm_qk_k;
+    } persistent;
+    union {
+      struct {
+        // p1 - after Q.K / dV / dO.V
+        typename MatmulQK::AccumulatorSharedStorage attn_shared_storage;
+
+        union {
+          typename MatmulGradV::Mma::SharedStorage mm_gradV;
+          typename MatmulGradV::DefaultEpilogue::SharedStorage gradV_epilogue;
+        };
+
+        typename MatmulDOIVJ::Mma::SharedStorage mm_doivj;
+      } p1;
+
+      struct {
+        // p2 - dQ
+        typename MatmulQK::AccumulatorSharedStorage
+            attn_shared_storage; // (from p1)
+        typename MatmulGradK::Mma::SharedStorage mm_gradK; // (preload)
+        typename MatmulGradQ::Mma::SharedStorage mm_gradQ; // (preload)
+        typename MatmulGradQ::DefaultEpilogue::SharedStorage gradQ_epilogue;
+
+        typename MatmulDOIVJ::AccumulatorSharedStorage doivj_shared_storage;
+      } p2;
+
+      struct {
+        // p3 - after last iteration on dQ's epilogue / dK
+        typename MatmulQK::AccumulatorSharedStorage
+            attn_shared_storage; // (from p1)
+        typename MatmulGradK::Mma::SharedStorage mm_gradK; // (preload)
+        typename MatmulGradQ::DefaultEpilogue::SharedStorage
+            gradQ_epilogue_lastIter;
+
+        typename MatmulGradK::DefaultEpilogue::SharedStorage gradK_epilogue;
+      } p3;
+
+      struct {
+        // p4 - after last iteration on dK's epilogue / preload next K.Q_t
+        typename MatmulQK::Mma::SharedStorageB mm_qk_q;
+
+        // If we reach end of current key, dump RF->gmem with "final" epilogues
+        typename MatmulGradK::DefaultEpilogue::SharedStorage
+            gradK_epilogue_final;
+        typename MatmulGradV::DefaultEpilogue::SharedStorage
+            gradV_epilogue_final;
+      } p4;
+    };
+    static void print_size() {
+      // Field size
+#define FSZ(f) int((sizeof(((SharedStorage*)0)->f)))
+
+      printf("Total smem: %d bytes\n", int(sizeof(SharedStorage)));
+      printf("  persistent: %db\n", FSZ(persistent));
+      printf("    mm_qk_k: %db\n", FSZ(persistent.mm_qk_k));
+      printf("  p1: %db\n", FSZ(p1));
+      printf("    attn_shared_storage: %db\n", FSZ(p1.attn_shared_storage));
+      printf("    mm_gradV: %db\n", FSZ(p1.mm_gradV));
+      printf("    gradV_epilogue: %db\n", FSZ(p1.gradV_epilogue));
+      printf("    mm_doivj: %db\n", FSZ(p1.mm_doivj));
+      printf("  p2: %db\n", FSZ(p2));
+      printf("    mm_gradK: %db\n", FSZ(p2.mm_gradK));
+      printf("    mm_gradQ: %db\n", FSZ(p2.mm_gradQ));
+      printf("    gradQ_epilogue: %db\n", FSZ(p2.gradQ_epilogue));
+      printf("    doivj_shared_storage: %db\n", FSZ(p2.doivj_shared_storage));
+      printf("  p3: %db\n", FSZ(p3));
+      printf("  p4: %db\n", FSZ(p4));
+      printf("    mm_qk_q: %db\n", FSZ(p4.mm_qk_q));
+      printf("    gradK_epilogue_final: %db\n", FSZ(p4.gradK_epilogue_final));
+      printf("    gradV_epilogue_final: %db\n", FSZ(p4.gradV_epilogue_final));
+    }
+// ===========================================
+#define FIELD(INSIDE_STRUCT, FIELDNAME) \
+  CUTLASS_DEVICE auto& FIELDNAME() {    \
+    return INSIDE_STRUCT.FIELDNAME;     \
+  }
+
+    FIELD(persistent, di)
+    FIELD(persistent, mm_qk_k)
+    FIELD(p1, attn_shared_storage)
+    FIELD(p1, mm_gradV)
+    FIELD(p1, gradV_epilogue)
+    FIELD(p1, mm_doivj)
+    FIELD(p2, mm_gradK)
+    FIELD(p2, mm_gradQ)
+    FIELD(p2, gradQ_epilogue)
+    FIELD(p2, doivj_shared_storage)
+    FIELD(p3, gradQ_epilogue_lastIter)
+    FIELD(p3, gradK_epilogue)
+    FIELD(p4, mm_qk_q)
+    FIELD(p4, gradK_epilogue_final)
+    FIELD(p4, gradV_epilogue_final)
+  };
+
+  struct SharedStorageNoPrologue {
+    struct {
+      cutlass::Array<accum_t, kBlockSizeI> di; // (do_i * o_i).sum(-1)
+    } persistent;
+    union {
+      struct {
+        // p1 - Q.K matmul
+        typename MatmulQK::Mma::SharedStorageA mm_qk_k;
+        typename MatmulQK::Mma::SharedStorageB mm_qk_q;
+      } p1;
+
+      struct {
+        // p2 - compute gradV
+        typename MatmulQK::AccumulatorSharedStorage attn_shared_storage;
+        union {
+          typename MatmulGradV::Mma::SharedStorage mm_gradV;
+          typename MatmulGradV::DefaultEpilogue::SharedStorage gradV_epilogue;
+        };
+      } p2;
+
+      struct {
+        // p3 - DO.V matmul
+        typename MatmulQK::AccumulatorSharedStorage
+            attn_shared_storage; // (from p2)
+        typename MatmulDOIVJ::Mma::SharedStorage mm_doivj;
+      } p3;
+
+      struct {
+        // p4 - compute gradQ
+        typename MatmulQK::AccumulatorSharedStorage
+            attn_shared_storage; // (from p2)
+        typename MatmulDOIVJ::AccumulatorSharedStorage doivj_shared_storage;
+        union {
+          typename MatmulGradQ::Mma::SharedStorage mm_gradQ;
+          typename MatmulGradQ::DefaultEpilogue::SharedStorage gradQ_epilogue;
+          typename MatmulGradQ::DefaultEpilogue::SharedStorage
+              gradQ_epilogue_lastIter;
+        };
+      } p4;
+
+      struct {
+        // p5 - compute gradK
+        typename MatmulQK::AccumulatorSharedStorage
+            attn_shared_storage; // (from p2)
+        typename MatmulDOIVJ::AccumulatorSharedStorage
+            doivj_shared_storage; // (from p4)
+        union {
+          typename MatmulGradK::Mma::SharedStorage mm_gradK;
+          typename MatmulGradK::DefaultEpilogue::SharedStorage gradK_epilogue;
+        };
+      } p5;
+
+      struct {
+        // p6 - store RF accumulated into gmem
+        typename MatmulGradK::DefaultEpilogue::SharedStorage
+            gradK_epilogue_final;
+        typename MatmulGradV::DefaultEpilogue::SharedStorage
+            gradV_epilogue_final;
+      } p6;
+    };
+    static void print_size() {
+#define FIELD_SIZEOF(f) int((sizeof(((SharedStorage*)0)->f)))
+      printf("Total smem: %d bytes\n", int(sizeof(SharedStorage)));
+      printf("  persistent: %db\n", FIELD_SIZEOF(persistent));
+      printf("  p1: %db\n", FIELD_SIZEOF(p1));
+      printf("  p2: %db\n", FIELD_SIZEOF(p2));
+      printf("  p3: %db\n", FIELD_SIZEOF(p3));
+      printf("  p4: %db\n", FIELD_SIZEOF(p4));
+      printf("  p5: %db\n", FIELD_SIZEOF(p5));
+      printf("  p6: %db\n", FIELD_SIZEOF(p6));
+    }
+// ===========================================
+#define FIELD(INSIDE_STRUCT, FIELDNAME) \
+  CUTLASS_DEVICE auto& FIELDNAME() {    \
+    return INSIDE_STRUCT.FIELDNAME;     \
+  }
+
+    FIELD(persistent, di)
+    FIELD(p1, mm_qk_k)
+    FIELD(p1, mm_qk_q)
+    FIELD(p2, attn_shared_storage)
+    FIELD(p2, mm_gradV)
+    FIELD(p2, gradV_epilogue)
+    FIELD(p3, mm_doivj)
+    FIELD(p4, doivj_shared_storage)
+    FIELD(p4, mm_gradQ)
+    FIELD(p4, gradQ_epilogue)
+    FIELD(p4, gradQ_epilogue_lastIter)
+    FIELD(p5, mm_gradK)
+    FIELD(p5, gradK_epilogue)
+    FIELD(p6, gradK_epilogue_final)
+    FIELD(p6, gradV_epilogue_final)
+  };
+
+  using SharedStorage = typename std::conditional<
+      kPreloadMmas,
+      SharedStoragePrologue,
+      SharedStorageNoPrologue>::type;
+
+  struct OutputFragments {
+    typename MatmulGradV::Mma::FragmentC gradV;
+    typename MatmulGradK::Mma::FragmentC gradK;
+
+    CUTLASS_DEVICE void clear() {
+      gradV.clear();
+      gradK.clear();
+    }
+  };
+
+  static void __host__ check_supported(Params const& p) {
+    CHECK_ALIGNED_PTR(p.query_ptr, kMinimumAlignment);
+    CHECK_ALIGNED_PTR(p.key_ptr, kMinimumAlignment);
+    CHECK_ALIGNED_PTR(p.value_ptr, kMinimumAlignment);
+    CHECK_ALIGNED_PTR(p.output_ptr, kMinimumAlignment);
+    CHECK_ALIGNED_PTR(p.grad_output_ptr, kMinimumAlignment);
+    TORCH_CHECK(
+        p.q_strideH % kMinimumAlignment == 0, "query is not correctly aligned");
+    TORCH_CHECK(
+        p.k_strideH % kMinimumAlignment == 0, "key is not correctly aligned");
+    TORCH_CHECK(
+        p.v_strideH % kMinimumAlignment == 0, "value is not correctly aligned");
+  }
+
+  static CUTLASS_DEVICE void kernel(Params& p_) {
+    // Hint to nvcc to store points & tensor shapes in registers
+    // as we use them a lot
+    register const Params p = p_;
+
+    extern __shared__ char smem_buffer[];
+    SharedStorage& shared_storage = *((SharedStorage*)smem_buffer);
+
+    auto getQueryStart = [&](int32_t key_start) {
+      if (p.causal) {
+        return key_start;
+      }
+      return 0;
+    };
+
+    if (kPrologueQK) {
+      prologueQkNextIteration<true>(shared_storage, p, 0, 0);
+    }
+
+    // Computes (dO*out).sum(-1) and writes it to `p.delta_ptr`
+    if (kKernelComputesDelta) {
+      constexpr int kOptimalElements =
+          128 / cutlass::sizeof_bits<scalar_t>::value;
+      if (p.head_dim_value % kOptimalElements == 0) {
+        for (int query_start = 0; query_start < p.num_queries;
+             query_start += kBlockSizeI) {
+          computeDelta<kOptimalElements>(p, query_start);
+        }
+      } else {
+        for (int query_start = 0; query_start < p.num_queries;
+             query_start += kBlockSizeI) {
+          computeDelta<1>(p, query_start);
+        }
+      }
+      __syncthreads();
+    }
+
+    OutputFragments register output_frags;
+    int32_t key_start = 0;
+    int32_t key_end = p.num_keys / kBlockSizeJ * kBlockSizeJ;
+    for (; key_start < key_end; key_start += kBlockSizeJ) {
+      output_frags.clear();
+      int32_t query_start = getQueryStart(key_start);
+      int32_t query_end = query_start +
+          (p.num_queries - query_start) / kBlockSizeI * kBlockSizeI;
+      for (; query_start < query_end; query_start += kBlockSizeI) {
+        processBlockIJ<true>(
+            shared_storage, output_frags, p, query_start, key_start);
+      }
+      // last (partial) query
+      if (query_start < p.num_queries) {
+        processBlockIJ<false>(
+            shared_storage, output_frags, p, query_start, key_start);
+      }
+      if (kOutputInRF) {
+        writeFragsToGmem<true>(shared_storage, output_frags, p, key_start);
+      }
+      __syncthreads();
+    }
+    // Last (partial) key
+    if (key_start != p.num_keys) {
+      output_frags.clear();
+      for (int32_t query_start = getQueryStart(key_start);
+           query_start < p.num_queries;
+           query_start += kBlockSizeI) {
+        processBlockIJ<false>(
+            shared_storage, output_frags, p, query_start, key_start);
+      }
+      if (kOutputInRF) {
+        writeFragsToGmem<false>(shared_storage, output_frags, p, key_start);
+      }
+    }
+  }
+
+  static CUTLASS_DEVICE void loadDi(
+      cutlass::Array<accum_t, kBlockSizeI>& di,
+      Params const& p,
+      int32_t query_start) {
+    int32_t thread_id = threadIdx.x + threadIdx.y * blockDim.x;
+    if (thread_id < kBlockSizeI) {
+      accum_t di_rf = accum_t(0);
+      if (query_start + thread_id < p.num_queries) {
+        di_rf = p.delta_ptr[query_start + thread_id];
+      }
+      di[thread_id] = di_rf;
+    }
+  }
+
+  template <bool skipBoundsChecks>
+  static CUTLASS_DEVICE void processBlockIJ(
+      SharedStorage& shared_storage,
+      OutputFragments& output_frags,
+      Params const& p,
+      int32_t query_start,
+      int32_t key_start) {
+    cutlass::MatrixCoord no_offset{0, 0};
+    accum_t scale = accum_t(1.0 / std::sqrt(float(p.head_dim)));
+    int16_t thread_id = threadIdx.x + threadIdx.y * blockDim.x;
+    int8_t warp_id = warp_uniform(threadIdx.y);
+    int8_t lane_id = threadIdx.x;
+    __syncthreads();
+    loadDi(shared_storage.di(), p, query_start);
+
+    int32_t num_queries_in_block = skipBoundsChecks
+        ? MatmulQK::Mma::Shape::kN
+        : std::min(
+              (int32_t)MatmulQK::Mma::Shape::kN, p.num_queries - query_start);
+    int32_t num_keys_in_block = skipBoundsChecks
+        ? MatmulQK::Mma::Shape::kM
+        : std::min((int32_t)MatmulQK::Mma::Shape::kM, p.num_keys - key_start);
+
+    auto prologueGradV = [&](int col) {
+      typename MatmulGradV::Mma::IteratorB iterator_dO(
+          {int32_t(p.gO_strideM)},
+          p.grad_output_ptr + query_start * p.gO_strideM + col,
+          {num_queries_in_block, p.head_dim_value - col},
+          thread_id,
+          no_offset);
+      MatmulGradV::Mma::prologue(
+          shared_storage.mm_gradV(),
+          iterator_dO,
+          thread_id,
+          num_queries_in_block);
+    };
+    auto prologueGradQ = [&](int col) {
+      typename MatmulGradQ::Mma::IteratorB iterator_K(
+          {int32_t(p.k_strideM)},
+          p.key_ptr + key_start * p.k_strideM + col,
+          {num_keys_in_block, p.head_dim - col},
+          thread_id,
+          no_offset);
+      MatmulGradQ::Mma::prologue(
+          shared_storage.mm_gradQ(), iterator_K, thread_id, num_keys_in_block);
+    };
+    auto prologueGradK = [&](int col) {
+      typename MatmulGradK::Mma::IteratorB iterator_Q(
+          {int32_t(p.q_strideM)},
+          p.query_ptr + query_start * p.q_strideM + col,
+          {num_queries_in_block, p.head_dim - col},
+          thread_id,
+          no_offset);
+      MatmulGradK::Mma::prologue(
+          shared_storage.mm_gradK(),
+          iterator_Q,
+          thread_id,
+          num_queries_in_block);
+    };
+    auto prologueDOV = [&]() {
+      typename MatmulDOIVJ::Mma::IteratorA iterator_A(
+          {int32_t(p.gO_strideM)},
+          p.grad_output_ptr + query_start * p.gO_strideM,
+          {num_queries_in_block, p.head_dim_value},
+          thread_id,
+          no_offset);
+      typename MatmulDOIVJ::Mma::IteratorB iterator_B(
+          {int32_t(p.v_strideM)},
+          p.value_ptr + key_start * p.v_strideM,
+          {p.head_dim_value, num_keys_in_block},
+          thread_id,
+          no_offset);
+      MatmulDOIVJ::Mma::prologue(
+          shared_storage.mm_doivj(),
+          iterator_A,
+          iterator_B,
+          thread_id,
+          p.head_dim_value);
+    };
+
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    // MatmulQK
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    {
+      using Mma = typename MatmulQK::Mma;
+
+      cutlass::gemm::GemmCoord problem_size(
+          num_keys_in_block,
+          num_queries_in_block,
+          p.head_dim // k
+      );
+
+      // k_j
+      typename Mma::IteratorA iterator_A(
+          {int32_t(p.k_strideM)},
+          p.key_ptr + key_start * p.k_strideM,
+          {problem_size.m(), problem_size.k()},
+          thread_id,
+          no_offset);
+
+      // q_i.transpose(-2, -1)
+      typename Mma::IteratorB iterator_B(
+          {int32_t(p.q_strideM)},
+          p.query_ptr + query_start * p.q_strideM,
+          {problem_size.k(), problem_size.n()},
+          thread_id,
+          no_offset);
+
+      Mma mma(
+          shared_storage.mm_qk_k(),
+          shared_storage.mm_qk_q(),
+          thread_id,
+          warp_id,
+          lane_id);
+
+      typename Mma::FragmentC accum;
+
+      accum.clear();
+
+      auto gemm_k_iterations =
+          (problem_size.k() + Mma::Shape::kK - 1) / Mma::Shape::kK;
+
+      // Compute threadblock-scoped matrix multiply-add
+      mma.set_prologue_done(kPrologueQK);
+      mma.set_zero_outside_bounds(!skipBoundsChecks);
+      mma(gemm_k_iterations, accum, iterator_A, iterator_B, accum);
+      accum = cutlass::multiplies<typename Mma::FragmentC>()(scale, accum);
+
+      // Epilogue: add LSE + exp and store that to our shared memory buffer
+      // shmem <- (matmul_result -
+      // logsumexp[i_start:i_end].unsqueeze(1)).exp()
+      int warp_idx_mn_0 =
+          warp_id % (Mma::Base::WarpCount::kM * Mma::Base::WarpCount::kN);
+      auto output_tile_coords = cutlass::MatrixCoord{
+          warp_idx_mn_0 % Mma::Base::WarpCount::kM,
+          warp_idx_mn_0 / Mma::Base::WarpCount::kM};
+      // Apply mask
+      if (p.causal) {
+        auto lane_offset = MatmulQK::ScalingCoefsUpdater::get_lane_offset(
+            lane_id, warp_id, output_tile_coords);
+        MatmulQK::ScalingCoefsUpdater::iterateRows(
+            lane_offset,
+            [&](int accum_m) {},
+            [&](int accum_m, int accum_n, int idx) {
+              // (don't forget we are transposed!)
+              if (accum_m > accum_n + query_start - key_start) {
+                accum[idx] = -std::numeric_limits<accum_t>::infinity();
+              }
+            },
+            [&](int accum_m) {});
+      }
+
+      __syncthreads();
+      if (kPrologueGV) {
+        prologueGradV(0);
+      }
+      if (kPrologueDOV) {
+        prologueDOV();
+      }
+      MatmulQK::B2bGemm::accumApplyLSEToSmem(
+          shared_storage.attn_shared_storage(),
+          accum,
+          p.logsumexp_ptr + query_start,
+          problem_size.n(),
+          thread_id,
+          warp_id,
+          lane_id,
+          output_tile_coords);
+      __syncthreads();
+    }
+
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    // GradV matmul
+    //
+    // grad_v[j_start:j_end] += attn_T @ do_i
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    for (int col = 0; col < (kOutputInRF ? 1 : p.head_dim_value);
+         col += MatmulGradV::ThreadblockShape::kN) {
+      using Mma = typename MatmulGradV::Mma;
+
+      cutlass::gemm::GemmCoord problem_size(
+          num_keys_in_block, p.head_dim_value - col, num_queries_in_block);
+      auto createEpilogueIter = [&]() {
+        return typename MatmulGradV::OutputTileIterator(
+            typename MatmulGradV::OutputTileIterator::Params{p.gV_strideM()},
+            p.grad_value_ptr + key_start * p.gV_strideM() + col,
+            {num_keys_in_block, p.head_dim_value - col},
+            thread_id);
+      };
+      typename Mma::IteratorB iterator_B(
+          {int32_t(p.gO_strideM)},
+          p.grad_output_ptr + query_start * p.gO_strideM + col,
+          {num_queries_in_block, p.head_dim_value - col},
+          thread_id,
+          no_offset);
+
+      Mma mma(
+          shared_storage.mm_gradV(),
+          shared_storage.attn_shared_storage(),
+          thread_id,
+          warp_id,
+          lane_id,
+          problem_size.k());
+
+      if (!kOutputInRF) {
+        createEpilogueIter().prefetch_all();
+        output_frags.gradV.clear();
+      }
+      mma.set_prologue_done(kPrologueGV);
+
+      auto gemm_k_iterations =
+          (problem_size.k() + Mma::Shape::kK - 1) / Mma::Shape::kK;
+
+      // Compute threadblock-scoped matrix multiply-add
+      __syncthreads();
+
+      mma(gemm_k_iterations,
+          output_frags.gradV,
+          iterator_B,
+          output_frags.gradV);
+      __syncthreads();
+      if (kPrologueGV &&
+          col + MatmulGradV::ThreadblockShape::kN < p.head_dim_value) {
+        prologueGradV(col + MatmulGradV::ThreadblockShape::kN);
+      }
+
+      if (!kOutputInRF) {
+        accumulateInGmem<MatmulGradV>(
+            shared_storage.gradV_epilogue(),
+            output_frags.gradV,
+            createEpilogueIter(),
+            query_start == 0 || (p.causal && query_start == key_start));
+      }
+    }
+    __syncthreads();
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    // MatmulDOIVJ
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    {
+      using Mma = typename MatmulDOIVJ::Mma;
+      // do_i
+      typename Mma::IteratorA iterator_A(
+          {int32_t(p.gO_strideM)},
+          p.grad_output_ptr + query_start * p.gO_strideM,
+          {num_queries_in_block, p.head_dim_value},
+          thread_id,
+          no_offset);
+
+      // v_j.transpose(-2, -1)
+      typename Mma::IteratorB iterator_B(
+          {int32_t(p.v_strideM)},
+          p.value_ptr + key_start * p.v_strideM,
+          {p.head_dim_value, num_keys_in_block},
+          thread_id,
+          no_offset);
+
+      Mma mma(shared_storage.mm_doivj(), thread_id, warp_id, lane_id);
+      mma.set_prologue_done(kPrologueDOV);
+      mma.set_zero_outside_bounds(!skipBoundsChecks);
+
+      typename Mma::FragmentC accum;
+
+      accum.clear();
+
+      auto gemm_k_iterations =
+          (p.head_dim_value + Mma::Shape::kK - 1) / Mma::Shape::kK;
+
+      // Compute threadblock-scoped matrix multiply-add
+      mma(gemm_k_iterations, accum, iterator_A, iterator_B, accum);
+      __syncthreads();
+      if (kPrologueGQ) {
+        prologueGradQ(0);
+      }
+      if (kPrologueGK) {
+        prologueGradK(0);
+      }
+
+      int warp_idx_mn_0 =
+          warp_id % (Mma::Base::WarpCount::kM * Mma::Base::WarpCount::kN);
+      auto output_tile_coords = cutlass::MatrixCoord{
+          warp_idx_mn_0 % Mma::Base::WarpCount::kM,
+          warp_idx_mn_0 / Mma::Base::WarpCount::kM};
+      // TODO: This must be terribly inefficient. There must be a better way
+      // tmp [RF] <- (accum [RF] - Di [smem] ) * attn_T.T [smem]
+      // attn_shared_storage  [smem] <- tmp.T
+      // doivj_shared_storage [smem] <- tmp
+      {
+        using RegistersIter = typename DefaultAttentionScalingCoefsUpdater<
+            typename Mma::Operator::IteratorC,
+            typename MatmulDOIVJ::DefaultMma::MmaCore::ElementC,
+            kWarpSize>::Updater;
+        auto lane_offset = RegistersIter::get_lane_offset(
+            lane_id, warp_id, output_tile_coords);
+        auto attn_T = shared_storage.attn_shared_storage().accum_ref();
+        accum_t current_di;
+        typename Mma::FragmentC fragment_attn, fragment_di;
+        RegistersIter::iterateRows(
+            lane_offset,
+            [&](int accum_m) { current_di = shared_storage.di()[accum_m]; },
+            [&](int accum_m, int accum_n, int idx) {
+              // TODO: Otherwise we can get nans as we
+              // might have infs here (only seen on f16 tho)
+              if (skipBoundsChecks ||
+                  (accum_m < num_queries_in_block &&
+                   accum_n < num_keys_in_block)) {
+                fragment_attn[idx] = attn_T.at({accum_n, accum_m});
+              } else {
+                fragment_attn[idx] = 0;
+              }
+              fragment_di[idx] = current_di;
+            },
+            [&](int accum_m) {
+
+            });
+        accum = (accum - fragment_di) * fragment_attn * scale;
+        // attn <- attn_T.T
+        RegistersIter::iterateRows(
+            lane_offset,
+            [&](int accum_m) {},
+            [&](int accum_m, int accum_n, int idx) {
+              // How does this even work?! We need to change the layout
+              attn_T.at({accum_n, accum_m}) = scalar_t(accum[idx]);
+            },
+            [&](int accum_m) {});
+      }
+
+      MatmulDOIVJ::B2bGemm::accumToSmem(
+          shared_storage.doivj_shared_storage(),
+          accum,
+          lane_id,
+          output_tile_coords);
+      __syncthreads();
+    }
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    // GradQ matmul
+    //
+    // grad_q[i_start:i_end] += tmp @ k_j
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    for (int col = 0; col < p.head_dim;
+         col += MatmulGradQ::ThreadblockShape::kN) {
+      using Mma = typename MatmulGradQ::Mma;
+
+      cutlass::gemm::GemmCoord problem_size(
+          num_queries_in_block,
+          false ? MatmulGradQ::ThreadblockShape::kN : p.head_dim - col,
+          num_keys_in_block);
+      auto createEpilogueIter = [&]() {
+        return typename MatmulGradQ::OutputTileIterator(
+            typename MatmulGradQ::OutputTileIterator::Params{p.gQ_strideM()},
+            p.grad_query_ptr + query_start * p.gQ_strideM() + col,
+            {problem_size.m(), problem_size.n()},
+            thread_id);
+      };
+
+      // k_j
+      typename Mma::IteratorB iterator_B(
+          {int32_t(p.k_strideM)},
+          p.key_ptr + key_start * p.k_strideM + col,
+          {problem_size.k(), problem_size.n()},
+          thread_id,
+          no_offset);
+
+      Mma mma(
+          shared_storage.mm_gradQ(),
+          shared_storage.doivj_shared_storage(),
+          thread_id,
+          warp_id,
+          lane_id,
+          problem_size.k());
+
+      typename Mma::FragmentC accum;
+
+      accum.clear();
+
+      auto gemm_k_iterations =
+          (problem_size.k() + Mma::Shape::kK - 1) / Mma::Shape::kK;
+
+      // Start prefetching output tile now to make the epilogue faster
+      createEpilogueIter().prefetch_all();
+      // Compute threadblock-scoped matrix multiply-add
+      __syncthreads();
+      mma.set_prologue_done(kPrologueGQ);
+      mma(gemm_k_iterations, accum, iterator_B, accum);
+      __syncthreads();
+      bool isLastColumn = col + MatmulGradQ::ThreadblockShape::kN >= p.head_dim;
+      if (kPrologueGQ && !isLastColumn) {
+        prologueGradQ(col + MatmulGradQ::ThreadblockShape::kN);
+      }
+
+      // Output results
+      typename MatmulGradQ::OutputTileIterator output_it = createEpilogueIter();
+      DISPATCH_BOOL(
+          key_start == 0, kIsFirst, ([&]() {
+            using DefaultEpilogue = typename MatmulGradQ::DefaultEpilogue;
+            using DefaultOutputOp = typename MatmulGradQ::DefaultOutputOp;
+            static constexpr auto ScaleType = kIsFirst
+                ? cutlass::epilogue::thread::ScaleType::Nothing
+                : cutlass::epilogue::thread::ScaleType::NoBetaScaling;
+            using EpilogueOutputOp =
+                typename cutlass::epilogue::thread::LinearCombination<
+                    typename DefaultOutputOp::ElementOutput,
+                    DefaultOutputOp::kCount,
+                    typename DefaultOutputOp::ElementAccumulator,
+                    typename DefaultOutputOp::ElementCompute,
+                    ScaleType>;
+            using Epilogue =
+                typename cutlass::epilogue::threadblock::EpiloguePipelined<
+                    typename DefaultEpilogue::Shape,
+                    typename Mma::Operator,
+                    DefaultEpilogue::kPartitionsK,
+                    typename MatmulGradQ::OutputTileIterator,
+                    typename DefaultEpilogue::AccumulatorFragmentIterator,
+                    typename DefaultEpilogue::WarpTileIterator,
+                    typename DefaultEpilogue::SharedLoadIterator,
+                    EpilogueOutputOp,
+                    typename DefaultEpilogue::Padding,
+                    DefaultEpilogue::kFragmentsPerIteration,
+                    true // IterationsUnroll
+                    >;
+            EpilogueOutputOp rescale({1, 1});
+            Epilogue epilogue(
+                isLastColumn ? shared_storage.gradQ_epilogue_lastIter()
+                             : shared_storage.gradQ_epilogue(),
+                thread_id,
+                warp_id,
+                lane_id);
+            epilogue(rescale, output_it, accum, output_it);
+          }));
+    }
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    // GradK matmul
+    //
+    // grad_k[i_start:i_end] += tmp.transpose(-2, -1) @ q_i
+    /////////////////////////////////////////////////////////////////////////////////////////////////
+    for (int col = 0; col < (kOutputInRF ? 1 : p.head_dim);
+         col += MatmulGradK::ThreadblockShape::kN) {
+      using Mma = typename MatmulGradK::Mma;
+
+      cutlass::gemm::GemmCoord problem_size(
+          num_keys_in_block,
+          false ? MatmulGradK::ThreadblockShape::kN : p.head_dim - col,
+          num_queries_in_block);
+      auto createEpilogueIter = [&]() {
+        return typename MatmulGradK::OutputTileIterator(
+            typename MatmulGradK::OutputTileIterator::Params{p.gK_strideM()},
+            p.grad_key_ptr + key_start * p.gK_strideM() + col,
+            {num_keys_in_block,
+             false ? MatmulGradK::ThreadblockShape::kN : p.head_dim - col},
+            thread_id);
+      };
+
+      // q_i
+      typename Mma::IteratorB iterator_B(
+          {int32_t(p.q_strideM)},
+          p.query_ptr + query_start * p.q_strideM + col,
+          {problem_size.k(), problem_size.n()},
+          thread_id,
+          no_offset);
+
+      Mma mma(
+          shared_storage.mm_gradK(),
+          shared_storage.attn_shared_storage(), // storing tmp.T
+          thread_id,
+          warp_id,
+          lane_id,
+          problem_size.k());
+
+      if (!kOutputInRF) {
+        output_frags.gradK.clear();
+        createEpilogueIter().prefetch_all();
+      }
+
+      auto gemm_k_iterations =
+          (problem_size.k() + Mma::Shape::kK - 1) / Mma::Shape::kK;
+
+      // Compute threadblock-scoped matrix multiply-add
+      __syncthreads();
+      mma.set_prologue_done(kPrologueGK);
+      mma(gemm_k_iterations,
+          output_frags.gradK,
+          iterator_B,
+          output_frags.gradK);
+      __syncthreads();
+      bool isLastColumn = col + MatmulGradK::ThreadblockShape::kN >= p.head_dim;
+      if (kPrologueGK && !isLastColumn) {
+        prologueGradK(col + MatmulGradK::ThreadblockShape::kN);
+      }
+
+      if (kPrologueQK && isLastColumn) {
+        int32_t next_query = query_start + kBlockSizeI;
+        int32_t next_key = key_start;
+        if (next_query >= p.num_queries) {
+          next_key = key_start + kBlockSizeJ;
+          next_query = p.causal ? next_key : 0;
+        }
+        DISPATCH_BOOL(next_key != key_start, kForceReloadK, ([&]() {
+                        prologueQkNextIteration<kForceReloadK>(
+                            shared_storage, p, next_query, next_key);
+                      }));
+      }
+
+      // Output results
+      if (!kOutputInRF) {
+        accumulateInGmem<MatmulGradK>(
+            isLastColumn ? shared_storage.gradK_epilogue_final()
+                         : shared_storage.gradK_epilogue(),
+            output_frags.gradK,
+            createEpilogueIter(),
+            query_start == 0 || (p.causal && query_start == key_start));
+      }
+    }
+  }
+
+  template <bool kForceReloadK>
+  static CUTLASS_DEVICE void prologueQkNextIteration(
+      SharedStorage& shared_storage,
+      Params const& p,
+      int32_t query_start,
+      int32_t key_start) {
+    if (query_start >= p.num_queries || key_start >= p.num_keys) {
+      return;
+    }
+
+    static constexpr bool kReloadK =
+        kForceReloadK || !MatmulQK::Mma::kSmemContainsEntireMat;
+    auto thread_id = get_thread_id();
+    typename MatmulQK::Mma::IteratorA iterator_A(
+        {int32_t(p.k_strideM)},
+        p.key_ptr + key_start * p.k_strideM,
+        {p.num_keys - key_start, p.head_dim},
+        thread_id,
+        cutlass::MatrixCoord{0, 0});
+
+    typename MatmulQK::Mma::IteratorB iterator_B(
+        {int32_t(p.q_strideM)},
+        p.query_ptr + query_start * p.q_strideM,
+        {p.head_dim, p.num_queries - query_start},
+        thread_id,
+        cutlass::MatrixCoord{0, 0});
+
+    MatmulQK::Mma::prologue<kReloadK, true>(
+        shared_storage.mm_qk_k(),
+        shared_storage.mm_qk_q(),
+        iterator_A,
+        iterator_B,
+        thread_id,
+        p.head_dim);
+  }
+
+  template <bool skipBoundsChecks>
+  static CUTLASS_DEVICE void writeFragsToGmem(
+      SharedStorage& shared_storage,
+      OutputFragments& output_frags,
+      Params const& p,
+      int32_t key_start) {
+    int32_t num_keys_in_block = skipBoundsChecks
+        ? MatmulQK::Mma::Shape::kM
+        : std::min((int32_t)MatmulQK::Mma::Shape::kM, p.num_keys - key_start);
+    typename MatmulGradV::OutputTileIterator outputV_it(
+        typename MatmulGradV::OutputTileIterator::Params{p.gV_strideM()},
+        p.grad_value_ptr + key_start * p.gV_strideM(),
+        {num_keys_in_block, p.head_dim_value},
+        get_thread_id());
+    accumulateInGmem<MatmulGradV>(
+        shared_storage.gradV_epilogue_final(),
+        output_frags.gradV,
+        outputV_it,
+        true);
+
+    typename MatmulGradK::OutputTileIterator outputK_it(
+        typename MatmulGradK::OutputTileIterator::Params{p.gK_strideM()},
+        p.grad_key_ptr + key_start * p.gK_strideM(),
+        {num_keys_in_block,
+         false ? MatmulGradK::ThreadblockShape::kN : p.head_dim},
+        get_thread_id());
+    accumulateInGmem<MatmulGradK>(
+        shared_storage.gradK_epilogue_final(),
+        output_frags.gradK,
+        outputK_it,
+        true);
+  }
+
+  template <typename MatmulT>
+  static CUTLASS_DEVICE void accumulateInGmem(
+      typename MatmulT::DefaultEpilogue::SharedStorage& epilogue_smem,
+      typename MatmulT::Mma::FragmentC const& accum,
+      typename MatmulT::OutputTileIterator output_it,
+      bool first) {
+    using DefaultEpilogue = typename MatmulT::DefaultEpilogue;
+    using DefaultOutputOp = typename MatmulT::DefaultOutputOp;
+    using Mma = typename MatmulT::Mma;
+    DISPATCH_BOOL(
+        first, kIsFirst, ([&]() {
+          static constexpr auto ScaleType = kIsFirst
+              ? cutlass::epilogue::thread::ScaleType::Nothing
+              : cutlass::epilogue::thread::ScaleType::NoBetaScaling;
+          using EpilogueOutputOp =
+              typename cutlass::epilogue::thread::LinearCombination<
+                  typename DefaultOutputOp::ElementOutput,
+                  DefaultOutputOp::kCount,
+                  typename DefaultOutputOp::ElementAccumulator,
+                  typename DefaultOutputOp::ElementCompute,
+                  ScaleType>;
+          using Epilogue =
+              typename cutlass::epilogue::threadblock::EpiloguePipelined<
+                  typename DefaultEpilogue::Shape,
+                  typename Mma::Operator,
+                  DefaultEpilogue::kPartitionsK,
+                  typename MatmulT::OutputTileIterator,
+                  typename DefaultEpilogue::AccumulatorFragmentIterator,
+                  typename DefaultEpilogue::WarpTileIterator,
+                  typename DefaultEpilogue::SharedLoadIterator,
+                  EpilogueOutputOp,
+                  typename DefaultEpilogue::Padding,
+                  DefaultEpilogue::kFragmentsPerIteration,
+                  true // IterationsUnroll
+                  >;
+          EpilogueOutputOp rescale({1, 1});
+          Epilogue epilogue(
+              epilogue_smem, get_thread_id(), get_warp_id(), get_lane_id());
+          epilogue(rescale, output_it, accum, output_it);
+        }));
+  }
+
+  template <int kElementsPerAccess>
+  static CUTLASS_DEVICE void computeDelta(
+      Params const& p,
+      int32_t query_start) {
+    // Each thread computes one value for Delta
+    // Depending on warp configuration, we might have multiple
+    // threads of the same warp working on the same row
+    using AccessType = cutlass::Array<scalar_t, kElementsPerAccess>;
+    static_assert(kNumThreads >= kBlockSizeI, "");
+    static constexpr int kNumThreadsPerLine = kNumThreads / kBlockSizeI;
+    int16_t thread_id = get_thread_id();
+
+    int16_t laneFirstCol =
+        kElementsPerAccess * (get_lane_id() % kNumThreadsPerLine);
+    int16_t laneRow = thread_id / kNumThreadsPerLine;
+    bool rowPred = (query_start + laneRow) < p.num_queries;
+    bool pred = rowPred;
+
+    // on windows, previous syntax __restrict__ AccessType*
+    // resulted in error: "restrict" is not allowed
+    const AccessType* __restrict__ grad_output_ptr =
+        reinterpret_cast<const AccessType*>(
+            p.grad_output_ptr + (query_start + laneRow) * p.gO_strideM +
+            laneFirstCol);
+    const AccessType* __restrict__ output_ptr =
+        reinterpret_cast<const AccessType*>(
+            p.output_ptr + (query_start + laneRow) * p.o_strideM() +
+            laneFirstCol);
+
+    static constexpr int64_t kMaxIters =
+        kMaxK / (kElementsPerAccess * kNumThreadsPerLine);
+    constexpr int kPipelineStages = 2;
+    accum_t delta_value = accum_t(0);
+    using GlobalLoad =
+        cutlass::arch::global_load<AccessType, sizeof(AccessType)>;
+    AccessType frag_grad_output[kPipelineStages];
+    AccessType frag_output[kPipelineStages];
+
+    auto loadAndIncrement = [&](int ld_pos, bool is_valid) {
+      frag_grad_output[ld_pos].clear();
+      frag_output[ld_pos].clear();
+      GlobalLoad(frag_grad_output[ld_pos], grad_output_ptr, is_valid);
+      GlobalLoad(frag_output[ld_pos], output_ptr, is_valid);
+      grad_output_ptr += kNumThreadsPerLine;
+      output_ptr += kNumThreadsPerLine;
+    };
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int iter = 0; iter < kPipelineStages - 1; ++iter) {
+      int ld_pos = iter % kPipelineStages;
+      pred = pred &&
+          (laneFirstCol + iter * kElementsPerAccess * kNumThreadsPerLine) <
+              p.head_dim_value;
+      loadAndIncrement(ld_pos, pred);
+    }
+    auto columnIteration = [&](int iter) {
+      // Load for next iter
+      int ld_pos = (iter + kPipelineStages - 1) % kPipelineStages;
+      pred = pred &&
+          (laneFirstCol +
+           (iter + kPipelineStages - 1) * kElementsPerAccess *
+               kNumThreadsPerLine) < p.head_dim_value;
+      loadAndIncrement(ld_pos, pred);
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < AccessType::kElements; ++i) {
+        delta_value += accum_t(frag_output[iter % kPipelineStages][i]) *
+            accum_t(frag_grad_output[iter % kPipelineStages][i]);
+      }
+    };
+
+    // If we have a small lower-bound for K, we can unroll the loop
+    if (kMaxK <= 256) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int iter = 0; iter < kMaxIters; ++iter) {
+        columnIteration(iter);
+      }
+    } else {
+      int num_iters =
+          ceil_div(p.head_dim_value, kElementsPerAccess * kNumThreadsPerLine) *
+          (kElementsPerAccess * kNumThreadsPerLine);
+      for (int iter = 0; iter < num_iters; ++iter) {
+        columnIteration(iter);
+      }
+    }
+
+    // Reduce between workers
+    static_assert(
+        kNumThreadsPerLine == 1 || kNumThreadsPerLine == 2 ||
+            kNumThreadsPerLine == 4,
+        "");
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 1; i < kNumThreadsPerLine; i *= 2) {
+      delta_value = delta_value + __shfl_xor_sync(0xffffffff, delta_value, i);
+    }
+
+    // Store in gmem
+    if (rowPred) {
+      p.delta_ptr[query_start + laneRow] = delta_value;
+    }
+  }
+
+  static CUTLASS_DEVICE int8_t get_lane_id() {
+    return threadIdx.x;
+  }
+  static CUTLASS_DEVICE int8_t get_warp_id() {
+    return threadIdx.y;
+  }
+  static CUTLASS_DEVICE int16_t get_thread_id() {
+    return threadIdx.x + threadIdx.y * blockDim.x;
+  }
+};
+
+template <typename AK>
+__global__ void __launch_bounds__(AK::kNumThreads, AK::kMinBlocksPerSm)
+    attention_kernel_backward_batched(typename AK::Params params);
+
+#define _ATTENTION_KERNEL_BACKWARD_BEGIN(...)                 \
+  template <>                                                 \
+  __global__ void __launch_bounds__(                          \
+      __VA_ARGS__::kNumThreads, __VA_ARGS__::kMinBlocksPerSm) \
+      attention_kernel_backward_batched<__VA_ARGS__>(         \
+          typename __VA_ARGS__::Params p) {                   \
+    using Kernel = __VA_ARGS__;
+#define _ATTENTION_KERNEL_BACKWARD_END() }
+
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD(ARCH, ...)             \
+  _ATTENTION_KERNEL_BACKWARD_BEGIN(                                  \
+      AttentionBackwardKernel<cutlass::arch::Sm##ARCH, __VA_ARGS__>) \
+  p.advance_to_block();                                              \
+  Kernel::kernel(p);                                                 \
+  _ATTENTION_KERNEL_BACKWARD_END();
+
+#ifdef __CUDA_ARCH__
+#define __CUDA_ARCH_OR_ZERO__ __CUDA_ARCH__
+#else
+#define __CUDA_ARCH_OR_ZERO__ 0
+#endif
+
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_DISABLED(ARCH, ...)                \
+  _ATTENTION_KERNEL_BACKWARD_BEGIN(                                              \
+      AttentionBackwardKernel<cutlass::arch::Sm##ARCH, __VA_ARGS__>)             \
+  printf(                                                                        \
+      "FATAL: this function is for sm%d, but was built with __CUDA_ARCH__=%d\n", \
+      int(ARCH),                                                                 \
+      int(__CUDA_ARCH_OR_ZERO__));                                               \
+  _ATTENTION_KERNEL_BACKWARD_END();
+
+// All kernels are disabled by default
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(...) \
+  INSTANTIATE_ATTENTION_KERNEL_BACKWARD_DISABLED(50, __VA_ARGS__)
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(...) \
+  INSTANTIATE_ATTENTION_KERNEL_BACKWARD_DISABLED(70, __VA_ARGS__)
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(...) \
+  INSTANTIATE_ATTENTION_KERNEL_BACKWARD_DISABLED(75, __VA_ARGS__)
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(...) \
+  INSTANTIATE_ATTENTION_KERNEL_BACKWARD_DISABLED(80, __VA_ARGS__)
+
+// Enable the right one based on __CUDA_ARCH__
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ < 500
+// "Need cuda arch at least 5.0"
+#elif __CUDA_ARCH__ < 700
+#undef INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(...) \
+  INSTANTIATE_ATTENTION_KERNEL_BACKWARD(50, __VA_ARGS__)
+#elif __CUDA_ARCH__ < 750
+#undef INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(...) \
+  INSTANTIATE_ATTENTION_KERNEL_BACKWARD(70, __VA_ARGS__)
+#elif __CUDA_ARCH__ < 800
+#undef INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(...) \
+  INSTANTIATE_ATTENTION_KERNEL_BACKWARD(75, __VA_ARGS__)
+#elif __CUDA_ARCH__ >= 800
+#undef INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80
+#define INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(...) \
+  INSTANTIATE_ATTENTION_KERNEL_BACKWARD(80, __VA_ARGS__)
+#endif
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h
new file mode 100644
index 000000000000..5207daa22d6f
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h
@@ -0,0 +1,895 @@
+#include <ATen/ATen.h>
+#include <cmath>
+#include <vector>
+
+#include <cuda_fp16.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+
+#include <cutlass/bfloat16.h>
+#include <cutlass/gemm/gemm.h>
+#include <cutlass/layout/matrix.h>
+#include <cutlass/layout/vector.h>
+#include <cutlass/numeric_types.h>
+
+#include <cutlass/epilogue/threadblock/default_epilogue_simt.h>
+#include <cutlass/epilogue/threadblock/default_epilogue_tensor_op.h>
+#include <cutlass/epilogue/threadblock/default_epilogue_volta_tensor_op.h>
+#include <cutlass/gemm/device/default_gemm_configuration.h>
+#include <cutlass/gemm/kernel/default_gemm.h>
+#include <cutlass/gemm/threadblock/default_mma.h>
+#include <cutlass/gemm/threadblock/default_mma_core_simt.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm70.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm75.h>
+#include <cutlass/gemm/threadblock/default_mma_core_sm80.h>
+#include <cutlass/gemm/threadblock/threadblock_swizzle.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/platform/platform.h>
+#include <cutlass/transform/threadblock/predicated_tile_iterator.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/debug_utils.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/attention_scaling_coefs_updater.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/epilogue_pipelined.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/epilogue_rescale_output.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/find_default_mma.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/mma_from_smem.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/gemm_kernel_utils.h>
+
+#include <cinttypes>
+
+using namespace gemm_kernel_utils;
+
+namespace {
+template <typename scalar_t, typename Arch>
+constexpr int getWarpsPerSm() {
+  bool is_half = !std::is_same<scalar_t, float>::value;
+  if (Arch::kMinComputeCapability >= 80) {
+    return is_half ? 16 : 12;
+  }
+  return 12;
+}
+} // namespace
+
+template <
+    // The datatype of Q/K/V
+    typename scalar_t_,
+    // Architecture we are targeting (eg `cutlass::arch::Sm80`)
+    typename ArchTag,
+    // If Q/K/V are correctly aligned in memory and we can run a fast kernel
+    bool isAligned_,
+    int64_t kQueriesPerBlock,
+    int64_t kKeysPerBlock,
+    bool kSingleValueIteration // = `value.shape[-1] <= kKeysPerBlock`
+    >
+struct AttentionKernel {
+  using scalar_t = scalar_t_;
+  using accum_t = float;
+  using lse_scalar_t = float;
+  using output_t = scalar_t;
+  // Accumulator between 2 iterations
+  // Using `accum_t` improves perf on f16 at the cost of
+  // numerical errors
+  using output_accum_t = accum_t;
+  static constexpr bool kIsAligned = isAligned_;
+  static constexpr int32_t kAlignLSE = 32; // block size of backward
+  static constexpr bool kPreloadV = ArchTag::kMinComputeCapability >= 80 &&
+      cutlass::sizeof_bits<scalar_t>::value == 16;
+  static constexpr bool kKeepOutputInRF = kSingleValueIteration;
+  static constexpr bool kNeedsOutputAccumulatorBuffer =
+      !kKeepOutputInRF && !std::is_same<output_accum_t, output_t>::value;
+
+  static_assert(kQueriesPerBlock % 32 == 0, "");
+  static_assert(kKeysPerBlock % 32 == 0, "");
+  static constexpr int64_t kNumWarpsPerBlock =
+      kQueriesPerBlock * kKeysPerBlock / (32 * 32);
+  static constexpr int64_t kWarpSize = 32;
+
+  // Launch bounds
+  static constexpr int64_t kNumThreads = kWarpSize * kNumWarpsPerBlock;
+  static constexpr int64_t kMinBlocksPerSm =
+      getWarpsPerSm<scalar_t, ArchTag>() / kNumWarpsPerBlock;
+
+  struct Params {
+    // Input tensors
+    scalar_t* query_ptr; // [num_queries, num_heads, head_dim]
+    scalar_t* key_ptr; // [num_keys, num_heads, head_dim]
+    scalar_t* value_ptr; // [num_keys, num_heads, head_dim_value]
+    int32_t* cu_seqlens_q_ptr = nullptr;
+    int32_t* cu_seqlens_k_ptr = nullptr;
+
+    // Output tensors
+    output_t* output_ptr; // [num_queries, num_heads, head_dim_value]
+    output_accum_t*
+        output_accum_ptr; // [num_queries, num_heads, head_dim_value]
+    lse_scalar_t* logsumexp_ptr; // [num_heads, num_queries] - can be null
+
+    // Dimensions/strides
+    int32_t head_dim;
+    int32_t head_dim_value;
+    int32_t num_queries;
+    int32_t num_keys;
+    int32_t num_heads = 1;
+
+    bool causal;
+
+    int32_t q_strideM;
+    int32_t k_strideM;
+    int32_t v_strideM;
+
+    // Everything below is only used in `advance_to_block`
+    // and shouldn't use registers
+    int32_t q_strideH;
+    int32_t k_strideH;
+    int32_t v_strideH;
+    int64_t q_strideB;
+    int64_t k_strideB;
+    int64_t v_strideB;
+    int32_t num_batches;
+
+    CUTLASS_HOST_DEVICE int32_t o_strideM() const {
+      return head_dim_value * num_heads;
+    }
+    // Moves pointers to what we should process
+    // Returns "false" if there is no work to do
+    CUTLASS_DEVICE bool advance_to_block() {
+      auto batch_id = blockIdx.z;
+      auto head_id = blockIdx.y;
+      auto query_start = blockIdx.x * kQueriesPerBlock;
+
+      auto lse_dim = ceil_div((int32_t)num_queries, kAlignLSE) * kAlignLSE;
+
+      int64_t q_start, k_start;
+      // Advance to current batch - in case of different sequence lengths
+      if (cu_seqlens_q_ptr != nullptr) {
+        assert(cu_seqlens_k_ptr != nullptr);
+        cu_seqlens_q_ptr += batch_id;
+        cu_seqlens_k_ptr += batch_id;
+        q_start = cu_seqlens_q_ptr[0];
+        k_start = cu_seqlens_k_ptr[0];
+        int64_t q_next_start = cu_seqlens_q_ptr[1];
+        int64_t k_next_start = cu_seqlens_k_ptr[1];
+        num_queries = q_next_start - q_start;
+        num_keys = k_next_start - k_start;
+
+        if (query_start >= num_queries) {
+          return false;
+        }
+      } else {
+        query_ptr += batch_id * q_strideB;
+        key_ptr += batch_id * k_strideB;
+        value_ptr += batch_id * v_strideB;
+        output_ptr += int64_t(batch_id * num_queries) * o_strideM();
+        if (output_accum_ptr != nullptr) {
+          output_accum_ptr += int64_t(batch_id * num_queries) * o_strideM();
+        }
+        q_start = 0;
+        k_start = 0;
+      }
+
+      // Advance to the current batch / head / query_start
+      query_ptr += (q_start + query_start) * q_strideM + head_id * q_strideH;
+      key_ptr += k_start * k_strideM + head_id * k_strideH;
+      value_ptr += k_start * v_strideM + head_id * v_strideH;
+      output_ptr += int64_t(q_start + query_start) * o_strideM() +
+          head_id * head_dim_value;
+
+      if (output_accum_ptr != nullptr) {
+        output_accum_ptr += int64_t(q_start + query_start) * o_strideM() +
+            head_id * head_dim_value;
+      } else {
+        // Accumulate directly in the destination buffer (eg for f32)
+        output_accum_ptr = (accum_t*)output_ptr;
+      }
+      if (logsumexp_ptr != nullptr) {
+        // lse[batch_id, head_id, query_start]
+        logsumexp_ptr +=
+            batch_id * lse_dim * num_heads + head_id * lse_dim + query_start;
+      }
+
+      num_queries -= query_start;
+      if (causal) {
+        num_keys = std::min(int32_t(query_start + kQueriesPerBlock), num_keys);
+      }
+      num_batches = 0; // no longer used after
+
+      // Make sure the compiler knows these variables are the same on all
+      // the threads of the warp.
+      query_ptr = warp_uniform(query_ptr);
+      key_ptr = warp_uniform(key_ptr);
+      value_ptr = warp_uniform(value_ptr);
+      output_ptr = warp_uniform(output_ptr);
+      output_accum_ptr = warp_uniform(output_accum_ptr);
+      logsumexp_ptr = warp_uniform(logsumexp_ptr);
+      num_queries = warp_uniform(num_queries);
+      num_keys = warp_uniform(num_keys);
+      head_dim = warp_uniform(head_dim);
+      head_dim_value = warp_uniform(head_dim_value);
+      return true;
+    }
+
+    __host__ dim3 getBlocksGrid() const {
+      return dim3(
+          ceil_div(num_queries, (int32_t)kQueriesPerBlock),
+          num_heads,
+          num_batches);
+    }
+    __host__ dim3 getThreadsGrid() const {
+      return dim3(kWarpSize, kNumWarpsPerBlock, 1);
+    }
+  };
+
+  struct MM0 {
+    /*
+      In this first matmul, we compute a block of `Q @ K.T`.
+      While the calculation result is still hot in registers, we update
+      `mi`, `m_prime`, `s_prime` in shared-memory, and then store this value
+      into a shared-memory ("AccumulatorSharedStorage") that is used later as
+      operand A for the second matmul (see MM1)
+    */
+    using GemmType = DefaultGemmType<ArchTag, scalar_t>;
+
+    using OpClass = typename GemmType::OpClass;
+    using DefaultConfig =
+        typename cutlass::gemm::device::DefaultGemmConfiguration<
+            OpClass,
+            ArchTag,
+            scalar_t,
+            scalar_t,
+            scalar_t, // ElementC
+            accum_t // ElementAccumulator
+            >;
+    static constexpr int64_t kAlignmentA =
+        kIsAligned ? DefaultConfig::kAlignmentA : GemmType::kMinimumAlignment;
+    static constexpr int64_t kAlignmentB =
+        kIsAligned ? DefaultConfig::kAlignmentB : GemmType::kMinimumAlignment;
+    using ThreadblockShape = cutlass::gemm::
+        GemmShape<kQueriesPerBlock, kKeysPerBlock, GemmType::ThreadK>;
+    using WarpShape = cutlass::gemm::GemmShape<32, 32, GemmType::WarpK>;
+    using DefaultMma = typename cutlass::gemm::threadblock::FindDefaultMma<
+        scalar_t, // ElementA,
+        cutlass::layout::RowMajor, // LayoutA,
+        kAlignmentA,
+        scalar_t, // ElementB,
+        cutlass::layout::ColumnMajor, // LayoutB,
+        kAlignmentB,
+        accum_t,
+        cutlass::layout::RowMajor, // LayoutC,
+        OpClass,
+        ArchTag, // ArchTag
+        ThreadblockShape, // ThreadblockShape
+        WarpShape, // WarpShape
+        typename GemmType::InstructionShape, // InstructionShape
+        DefaultConfig::kStages, // Should use `DefaultConfig::kStages`, but that
+                                // uses too much smem
+        typename GemmType::Operator // Operator
+        >::DefaultMma;
+    using MmaCore = typename DefaultMma::MmaCore;
+    using IteratorA = typename DefaultMma::IteratorA;
+    using IteratorB = typename DefaultMma::IteratorB;
+    using Mma = typename DefaultMma::ThreadblockMma;
+    using ScalingCoefsUpdater = typename DefaultAttentionScalingCoefsUpdater<
+        typename Mma::Operator::IteratorC,
+        accum_t,
+        kWarpSize>::Updater;
+    static_assert(
+        MmaCore::WarpCount::kM * MmaCore::WarpCount::kN *
+                MmaCore::WarpCount::kK ==
+            kNumWarpsPerBlock,
+        "");
+
+    // Epilogue to store to shared-memory in a format that we can use later for
+    // the second matmul
+    using B2bGemm = typename cutlass::gemm::threadblock::B2bGemm<
+        typename Mma::Operator::IteratorC,
+        typename Mma::Operator,
+        scalar_t,
+        WarpShape,
+        ThreadblockShape>;
+    using AccumulatorSharedStorage = typename B2bGemm::AccumulatorSharedStorage;
+  };
+
+  struct MM1 {
+    /**
+      Second matmul: perform `attn @ V` where `attn` is the attention (not
+      normalized) and stored in shared memory
+    */
+    using GemmType = DefaultGemmType<ArchTag, scalar_t>;
+
+    using OpClass = typename GemmType::OpClass;
+    using DefaultConfig =
+        typename cutlass::gemm::device::DefaultGemmConfiguration<
+            OpClass,
+            ArchTag,
+            scalar_t,
+            scalar_t,
+            output_accum_t, // ElementC
+            accum_t // ElementAccumulator
+            >;
+    static constexpr int64_t kAlignmentA =
+        DefaultConfig::kAlignmentA; // from smem
+    static constexpr int64_t kAlignmentB =
+        kIsAligned ? DefaultConfig::kAlignmentB : GemmType::kMinimumAlignment;
+    using ThreadblockShape = cutlass::gemm::
+        GemmShape<kQueriesPerBlock, kKeysPerBlock, GemmType::ThreadK>;
+    using WarpShape = cutlass::gemm::GemmShape<32, 32, GemmType::WarpK>;
+    using InstructionShape = typename GemmType::InstructionShape;
+
+    using LayoutB = cutlass::layout::RowMajor;
+    using DefaultGemm = cutlass::gemm::kernel::DefaultGemm<
+        scalar_t, // ElementA,
+        cutlass::layout::RowMajor, // LayoutA,
+        kAlignmentA,
+        scalar_t, // ElementB,
+        LayoutB, // LayoutB,
+        kAlignmentB,
+        output_accum_t,
+        cutlass::layout::RowMajor, // LayoutC,
+        accum_t,
+        OpClass,
+        ArchTag,
+        ThreadblockShape,
+        WarpShape,
+        typename GemmType::InstructionShape,
+        typename DefaultConfig::EpilogueOutputOp,
+        void, // ThreadblockSwizzle - not used
+        DefaultConfig::kStages,
+        false, // SplitKSerial
+        typename GemmType::Operator>;
+
+    using DefaultMmaFromSmem =
+        typename cutlass::gemm::threadblock::DefaultMmaFromSharedMemory<
+            typename DefaultGemm::Mma,
+            typename MM0::AccumulatorSharedStorage>;
+    using Mma = typename DefaultMmaFromSmem::Mma;
+    using IteratorB = typename Mma::IteratorB;
+    using WarpCount = typename Mma::WarpCount;
+    static_assert(
+        WarpCount::kM * WarpCount::kN * WarpCount::kK == kNumWarpsPerBlock,
+        "");
+
+    using DefaultEpilogue = typename DefaultGemm::Epilogue;
+    using OutputTileIterator =
+        typename cutlass::epilogue::threadblock::PredicatedTileIterator<
+            typename DefaultEpilogue::OutputTileIterator::ThreadMap,
+            output_t>;
+    using OutputTileIteratorAccum =
+        typename cutlass::epilogue::threadblock::PredicatedTileIterator<
+            typename DefaultEpilogue::OutputTileIterator::ThreadMap,
+            output_accum_t>;
+
+    struct SharedStorageMM1 {
+      typename Mma::SharedStorage mm;
+    };
+  };
+
+  static constexpr int64_t kAlignmentQ = MM0::kAlignmentA;
+  static constexpr int64_t kAlignmentK = MM0::kAlignmentB;
+  static constexpr int64_t kAlignmentV = 1;
+
+  // Shared storage - depends on kernel params
+  struct ScalingCoefs {
+    cutlass::Array<accum_t, kQueriesPerBlock> m_prime;
+    cutlass::Array<accum_t, kQueriesPerBlock> s_prime;
+    cutlass::Array<accum_t, kQueriesPerBlock> mi;
+  };
+
+  struct SharedStorageEpilogueAtEnd : ScalingCoefs {
+    struct SharedStorageAfterMM0 {
+      // Everything here might be overwritten during MM0
+      typename MM0::AccumulatorSharedStorage si;
+      typename MM1::SharedStorageMM1 mm1;
+    };
+
+    union {
+      typename MM0::Mma::SharedStorage mm0;
+      SharedStorageAfterMM0 after_mm0;
+      typename MM1::DefaultEpilogue::SharedStorage epilogue;
+    };
+
+    CUTLASS_DEVICE typename MM1::DefaultEpilogue::SharedStorage&
+    epilogue_shared_storage() {
+      return epilogue;
+    }
+  };
+
+  struct SharedStorageEpilogueInLoop : ScalingCoefs {
+    struct SharedStorageAfterMM0 {
+      // Everything here might be overwritten during MM0
+      typename MM0::AccumulatorSharedStorage si;
+      typename MM1::SharedStorageMM1 mm1;
+      typename MM1::DefaultEpilogue::SharedStorage epilogue;
+    };
+
+    union {
+      typename MM0::Mma::SharedStorage mm0;
+      SharedStorageAfterMM0 after_mm0;
+    };
+
+    CUTLASS_DEVICE typename MM1::DefaultEpilogue::SharedStorage&
+    epilogue_shared_storage() {
+      return after_mm0.epilogue;
+    }
+  };
+
+  using SharedStorage = typename std::conditional<
+      kSingleValueIteration || kKeepOutputInRF,
+      SharedStorageEpilogueAtEnd,
+      SharedStorageEpilogueInLoop>::type;
+
+  static void __host__ check_supported(Params const& p) {
+    CHECK_ALIGNED_PTR(p.query_ptr, kAlignmentQ);
+    CHECK_ALIGNED_PTR(p.key_ptr, kAlignmentK);
+    CHECK_ALIGNED_PTR(p.value_ptr, kAlignmentV);
+    TORCH_CHECK(
+        p.head_dim % kAlignmentQ == 0, "query is not correctly aligned");
+    TORCH_CHECK(p.head_dim % kAlignmentK == 0, "key is not correctly aligned");
+    TORCH_CHECK(
+        p.head_dim_value % kAlignmentV == 0, "value is not correctly aligned");
+  }
+
+  static void CUTLASS_DEVICE attention_kernel(Params& p) {
+    // In this block, we will only ever:
+    // - read query[query_start:query_end, :]
+    // - write to output[query_start:query_end, :]
+
+    extern __shared__ char smem_buffer[];
+    SharedStorage& shared_storage = *((SharedStorage*)smem_buffer);
+    auto& m_prime = shared_storage.m_prime;
+    auto& s_prime = shared_storage.s_prime;
+    auto& si = shared_storage.after_mm0.si;
+    auto& mi = shared_storage.mi;
+
+    static_assert(kQueriesPerBlock < kNumWarpsPerBlock * kWarpSize, "");
+    if (thread_id() < kQueriesPerBlock) {
+      s_prime[thread_id()] = accum_t(0);
+      m_prime[thread_id()] = -std::numeric_limits<accum_t>::infinity();
+      mi[thread_id()] = -std::numeric_limits<accum_t>::infinity();
+    }
+    typename MM1::Mma::FragmentC accum_o;
+    accum_o.clear();
+
+    auto createOutputIter = [&](auto col) {
+      using OutputTileIterator = typename MM1::OutputTileIterator;
+      return OutputTileIterator(
+          typename OutputTileIterator::Params{(int32_t)p.o_strideM()},
+          p.output_ptr,
+          typename OutputTileIterator::TensorCoord{
+              p.num_queries, p.head_dim_value},
+          thread_id(),
+          {0, col});
+    };
+
+    auto createOutputAccumIter = [&](auto col) {
+      using OutputTileIteratorAccum = typename MM1::OutputTileIteratorAccum;
+      return OutputTileIteratorAccum(
+          typename OutputTileIteratorAccum::Params{(int32_t)p.o_strideM()},
+          p.output_accum_ptr,
+          typename OutputTileIteratorAccum::TensorCoord{
+              p.num_queries, p.head_dim_value},
+          thread_id(),
+          {0, col});
+    };
+
+    // Iterate through keys
+    for (int32_t iter_key_start = 0; iter_key_start < p.num_keys;
+         iter_key_start += kKeysPerBlock) {
+      int32_t problem_size_0_m =
+          std::min((int32_t)kQueriesPerBlock, p.num_queries);
+      int32_t problem_size_0_n =
+          std::min(int32_t(kKeysPerBlock), p.num_keys - iter_key_start);
+      int32_t const& problem_size_0_k = p.head_dim;
+      int32_t const& problem_size_1_m = problem_size_0_m;
+      int32_t const& problem_size_1_n = p.head_dim_value;
+      int32_t const& problem_size_1_k = problem_size_0_n;
+
+      auto prologueV = [&](int blockN) {
+        typename MM1::Mma::IteratorB iterator_V(
+            typename MM1::IteratorB::Params{MM1::LayoutB(p.v_strideM)},
+            p.value_ptr + iter_key_start * p.v_strideM,
+            {problem_size_1_k, problem_size_1_n},
+            thread_id(),
+            cutlass::MatrixCoord{0, blockN * MM1::Mma::Shape::kN});
+        MM1::Mma::prologue(
+            shared_storage.after_mm0.mm1.mm,
+            iterator_V,
+            thread_id(),
+            problem_size_1_k);
+      };
+
+      __syncthreads(); // Need to have shared memory initialized, and `m_prime`
+                       // updated from end of prev iter
+      //
+      // MATMUL: Q.K_t
+      //
+      // Computes the block-matrix product of:
+      // (a) query[query_start:query_end, :]
+      // with
+      // (b) key[iter_key_start:iter_key_start + kKeysPerBlock]
+      // and stores that into `shared_storage.si`
+      //
+
+      // Compute threadblock location
+      cutlass::gemm::GemmCoord tb_tile_offset = {0, 0, 0};
+
+      cutlass::MatrixCoord tb_offset_A{
+          tb_tile_offset.m() * MM0::Mma::Shape::kM, tb_tile_offset.k()};
+
+      cutlass::MatrixCoord tb_offset_B{
+          tb_tile_offset.k(), tb_tile_offset.n() * MM0::Mma::Shape::kN};
+
+      // Construct iterators to A and B operands
+      typename MM0::IteratorA iterator_A(
+          typename MM0::IteratorA::Params(
+              typename MM0::MmaCore::LayoutA(p.q_strideM)),
+          p.query_ptr,
+          {problem_size_0_m, problem_size_0_k},
+          thread_id(),
+          tb_offset_A);
+
+      typename MM0::IteratorB iterator_B(
+          typename MM0::IteratorB::Params(
+              typename MM0::MmaCore::LayoutB(p.k_strideM)),
+          p.key_ptr + iter_key_start * p.k_strideM,
+          {problem_size_0_k, problem_size_0_n},
+          thread_id(),
+          tb_offset_B);
+
+      auto my_warp_id = warp_id();
+      auto my_lane_id = lane_id();
+
+      // Construct thread-scoped matrix multiply
+      typename MM0::Mma mma(
+          shared_storage.mm0, thread_id(), my_warp_id, my_lane_id);
+
+      typename MM0::Mma::FragmentC accum;
+
+      accum.clear();
+
+      auto gemm_k_iterations =
+          (problem_size_0_k + MM0::Mma::Shape::kK - 1) / MM0::Mma::Shape::kK;
+
+      // Compute threadblock-scoped matrix multiply-add
+      mma(gemm_k_iterations, accum, iterator_A, iterator_B, accum);
+      __syncthreads();
+
+      if (kPreloadV) {
+        prologueV(0);
+      }
+
+      typename MM0::Mma::Operator::IteratorC::TensorCoord
+          iteratorC_tile_offset = {
+              (tb_tile_offset.m() * MM0::Mma::WarpCount::kM) +
+                  (my_warp_id % MM0::Mma::WarpCount::kM),
+              (tb_tile_offset.n() * MM0::Mma::WarpCount::kN) +
+                  (my_warp_id / MM0::Mma::WarpCount::kM)};
+
+      // Mask out last if causal
+      if (p.causal && p.num_keys - iter_key_start <= kKeysPerBlock) {
+        auto query_start = blockIdx.x * kQueriesPerBlock;
+        auto lane_offset = MM0::ScalingCoefsUpdater::get_lane_offset(
+            lane_id(), warp_id(), iteratorC_tile_offset);
+        int32_t last_col;
+        MM0::ScalingCoefsUpdater::iterateRows(
+            lane_offset,
+            [&](int accum_m) {
+              last_col = query_start + accum_m - iter_key_start;
+            },
+            [&](int accum_m, int accum_n, int idx) {
+              if (accum_n > last_col) {
+                accum[idx] = -std::numeric_limits<accum_t>::infinity();
+              }
+            },
+            [&](int accum_m) {});
+      }
+      DISPATCH_BOOL(iter_key_start == 0, kIsFirst, ([&] {
+                      DISPATCH_BOOL(
+                          p.num_keys - iter_key_start >= kKeysPerBlock,
+                          kFullColumns,
+                          ([&] {
+                            // Update `mi` from accum stored in registers
+                            // Also updates `accum` with accum[i] <-
+                            // exp(accum[i] * scale
+                            // - mi)
+                            MM0::ScalingCoefsUpdater::update<
+                                kQueriesPerBlock,
+                                kFullColumns,
+                                kIsFirst,
+                                kKeepOutputInRF>(
+                                accum_o,
+                                accum,
+                                mi,
+                                m_prime,
+                                s_prime,
+                                lane_id(),
+                                thread_id(),
+                                warp_id(),
+                                p.num_keys - iter_key_start,
+                                iteratorC_tile_offset,
+                                1.0f / std::sqrt(float(p.head_dim)));
+                          }));
+                    }));
+
+      // Output results to shared-memory
+      int warp_idx_mn_0 = my_warp_id %
+          (MM0::Mma::Base::WarpCount::kM * MM0::Mma::Base::WarpCount::kN);
+      auto output_tile_coords = cutlass::MatrixCoord{
+          warp_idx_mn_0 % MM0::Mma::Base::WarpCount::kM,
+          warp_idx_mn_0 / MM0::Mma::Base::WarpCount::kM};
+
+      MM0::B2bGemm::accumToSmem(
+          shared_storage.after_mm0.si, accum, my_lane_id, output_tile_coords);
+
+      __syncthreads();
+
+      //
+      // MATMUL: Attn . V
+      // Run the matmul `attn @ V` for a block of attn and V.
+      // `attn` is read from shared memory (in `shared_storage_si`)
+      // `V` is read from global memory (with iterator_B)
+      //
+
+      const int64_t nBlockN = kSingleValueIteration
+          ? 1
+          : ceil_div(
+                (int64_t)problem_size_1_n, int64_t(MM1::ThreadblockShape::kN));
+      for (int blockN = 0; blockN < nBlockN; ++blockN) {
+        int gemm_k_iterations =
+            (problem_size_1_k + MM1::Mma::Shape::kK - 1) / MM1::Mma::Shape::kK;
+
+        // Compute threadblock-scoped matrix multiply-add and store it in accum
+        // (in registers)
+        if (!kPreloadV) {
+          __syncthreads(); // we share shmem between mma and epilogue
+        }
+
+        typename MM1::Mma::IteratorB iterator_V(
+            typename MM1::IteratorB::Params{MM1::LayoutB(p.v_strideM)},
+            p.value_ptr + iter_key_start * p.v_strideM,
+            {problem_size_1_k, problem_size_1_n},
+            thread_id(),
+            cutlass::MatrixCoord{0, blockN * MM1::Mma::Shape::kN});
+        typename MM1::Mma mma_pv(
+            shared_storage.after_mm0.mm1.mm,
+            shared_storage.after_mm0.si,
+            (int)thread_id(),
+            (int)warp_id(),
+            (int)lane_id(),
+            (int)problem_size_1_k);
+        mma_pv.set_prologue_done(kPreloadV);
+        if (!kKeepOutputInRF) {
+          accum_o.clear();
+        }
+        mma_pv(gemm_k_iterations, accum_o, iterator_V, accum_o);
+        __syncthreads();
+
+        if (kPreloadV && !kSingleValueIteration && blockN + 1 < nBlockN) {
+          prologueV(blockN + 1);
+        }
+
+        if (!kKeepOutputInRF) {
+          DISPATCH_BOOL(
+              iter_key_start == 0, kIsFirst, ([&] {
+                DISPATCH_BOOL(
+                    (iter_key_start + kKeysPerBlock) >= p.num_keys,
+                    kIsLast,
+                    ([&] {
+                      using DefaultEpilogue = typename MM1::DefaultEpilogue;
+                      using DefaultOp =
+                          typename MM1::DefaultConfig::EpilogueOutputOp;
+                      using ElementCompute = typename DefaultOp::ElementCompute;
+                      using EpilogueOutputOp = typename cutlass::epilogue::
+                          thread::MemoryEfficientAttentionNormalize<
+                              typename std::conditional<
+                                  kIsLast,
+                                  output_t,
+                                  output_accum_t>::type,
+                              output_accum_t,
+                              DefaultOp::kCount,
+                              typename DefaultOp::ElementAccumulator,
+                              ElementCompute,
+                              kIsFirst,
+                              kIsLast,
+                              cutlass::Array<ElementCompute, kQueriesPerBlock>>;
+                      using Epilogue = typename cutlass::epilogue::threadblock::
+                          EpiloguePipelined<
+                              typename DefaultEpilogue::Shape,
+                              typename MM1::Mma::Operator,
+                              DefaultEpilogue::kPartitionsK,
+                              typename std::conditional<
+                                  kIsLast,
+                                  typename MM1::OutputTileIterator,
+                                  typename MM1::OutputTileIteratorAccum>::type,
+                              typename DefaultEpilogue::
+                                  AccumulatorFragmentIterator,
+                              typename DefaultEpilogue::WarpTileIterator,
+                              typename DefaultEpilogue::SharedLoadIterator,
+                              EpilogueOutputOp,
+                              typename DefaultEpilogue::Padding,
+                              DefaultEpilogue::kFragmentsPerIteration,
+                              true, // IterationsUnroll
+                              typename MM1::OutputTileIteratorAccum // Read
+                                                                    // iterator
+                              >;
+
+                      int col = blockN * MM1::Mma::Shape::kN;
+                      auto source_iter = createOutputAccumIter(col);
+                      auto dest_iter = call_conditional<
+                          kIsLast,
+                          decltype(createOutputIter),
+                          decltype(createOutputAccumIter)>::
+                          apply(createOutputIter, createOutputAccumIter, col);
+                      EpilogueOutputOp rescale(s_prime, m_prime);
+                      Epilogue epilogue(
+                          shared_storage.epilogue_shared_storage(),
+                          thread_id(),
+                          warp_id(),
+                          lane_id());
+                      epilogue(rescale, dest_iter, accum_o, source_iter);
+                    }));
+              }));
+          if (!kSingleValueIteration) {
+            __syncthreads();
+          }
+        }
+      }
+      __syncthreads(); // we modify `m_prime` after
+    }
+
+    if (kKeepOutputInRF) {
+      constexpr bool kIsFirst = true;
+      constexpr bool kIsLast = true;
+      using DefaultEpilogue = typename MM1::DefaultEpilogue;
+      using DefaultOp = typename MM1::DefaultConfig::EpilogueOutputOp;
+      using ElementCompute = typename DefaultOp::ElementCompute;
+      using EpilogueOutputOp =
+          typename cutlass::epilogue::thread::MemoryEfficientAttentionNormalize<
+              output_t, // output
+              output_accum_t, // source
+              DefaultOp::kCount,
+              typename DefaultOp::ElementAccumulator, // accum
+              output_accum_t, // compute
+              kIsFirst,
+              kIsLast,
+              cutlass::Array<ElementCompute, kQueriesPerBlock>>;
+      using Epilogue =
+          typename cutlass::epilogue::threadblock::EpiloguePipelined<
+              typename DefaultEpilogue::Shape,
+              typename MM1::Mma::Operator,
+              DefaultEpilogue::kPartitionsK,
+              typename MM1::OutputTileIterator, // destination
+              typename DefaultEpilogue::AccumulatorFragmentIterator,
+              typename DefaultEpilogue::WarpTileIterator,
+              typename DefaultEpilogue::SharedLoadIterator,
+              EpilogueOutputOp,
+              typename DefaultEpilogue::Padding,
+              DefaultEpilogue::kFragmentsPerIteration,
+              true, // IterationsUnroll
+              typename MM1::OutputTileIteratorAccum // source tile
+              >;
+      auto dest_iter = createOutputIter(0);
+      EpilogueOutputOp rescale(s_prime, m_prime);
+      Epilogue epilogue(
+          shared_storage.epilogue_shared_storage(),
+          thread_id(),
+          warp_id(),
+          lane_id());
+      epilogue(rescale, dest_iter, accum_o);
+    }
+
+    // 7. Calculate logsumexp
+    // To make the backward easier, we pad logsumexp with `inf`
+    // this avoids a few bound checks, and is not more expensive during fwd
+    static_assert(kQueriesPerBlock < kNumWarpsPerBlock * kWarpSize, "");
+    if (p.logsumexp_ptr && thread_id() < kQueriesPerBlock) {
+      auto lse_dim = ceil_div((int32_t)p.num_queries, kAlignLSE) * kAlignLSE;
+      if (thread_id() < p.num_queries) {
+        p.logsumexp_ptr[thread_id()] =
+            accum_t(mi[thread_id()]) + std::log(accum_t(s_prime[thread_id()]));
+      } else if (thread_id() < lse_dim) {
+        p.logsumexp_ptr[thread_id()] = std::numeric_limits<accum_t>::infinity();
+      }
+    }
+  }
+
+  static CUTLASS_DEVICE int8_t lane_id() {
+    return threadIdx.x;
+  }
+  static CUTLASS_DEVICE int8_t warp_id() {
+    return threadIdx.y;
+  }
+  static CUTLASS_DEVICE int16_t thread_id() {
+    return threadIdx.x + threadIdx.y * blockDim.x;
+  }
+};
+
+template <typename AK>
+__global__ void __launch_bounds__(AK::kNumThreads, AK::kMinBlocksPerSm)
+    attention_kernel_batched(typename AK::Params params);
+
+#define _ATTENTION_KERNEL_FORWARD_BEGIN(...)                                  \
+  template <>                                                                 \
+  __global__ void __launch_bounds__(                                          \
+      __VA_ARGS__::kNumThreads, __VA_ARGS__::kMinBlocksPerSm)                 \
+      attention_kernel_batched<__VA_ARGS__>(typename __VA_ARGS__::Params p) { \
+    using Kernel = __VA_ARGS__;
+#define _ATTENTION_KERNEL_FORWARD_END() }
+
+#ifdef __CUDA_ARCH__
+#define __CUDA_ARCH_OR_ZERO__ __CUDA_ARCH__
+#else
+#define __CUDA_ARCH_OR_ZERO__ 0
+#endif
+
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD(              \
+    ARCH,                                                  \
+    SCALAR_T,                                              \
+    IS_ALIGNED,                                            \
+    QUERIES_PER_BLOCK,                                     \
+    KEYS_PER_BLOCK,                                        \
+    SINGLE_VALUE_ITER)                                     \
+  _ATTENTION_KERNEL_FORWARD_BEGIN(AttentionKernel<         \
+                                  SCALAR_T,                \
+                                  cutlass::arch::Sm##ARCH, \
+                                  IS_ALIGNED,              \
+                                  QUERIES_PER_BLOCK,       \
+                                  KEYS_PER_BLOCK,          \
+                                  SINGLE_VALUE_ITER>)      \
+  if (!p.advance_to_block()) {                             \
+    return;                                                \
+  }                                                        \
+  Kernel::attention_kernel(p);                             \
+  _ATTENTION_KERNEL_FORWARD_END();
+
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_DISABLED(              \
+    ARCH,                                                           \
+    SCALAR_T,                                                       \
+    IS_ALIGNED,                                                     \
+    QUERIES_PER_BLOCK,                                              \
+    KEYS_PER_BLOCK,                                                 \
+    SINGLE_VALUE_ITER)                                              \
+  _ATTENTION_KERNEL_FORWARD_BEGIN(AttentionKernel<                  \
+                                  SCALAR_T,                         \
+                                  cutlass::arch::Sm##ARCH,          \
+                                  IS_ALIGNED,                       \
+                                  QUERIES_PER_BLOCK,                \
+                                  KEYS_PER_BLOCK,                   \
+                                  SINGLE_VALUE_ITER>)               \
+  printf(                                                           \
+      "FATAL: this function is for sm%d, but was built for sm%d\n", \
+      int(ARCH),                                                    \
+      int(__CUDA_ARCH_OR_ZERO__));                                  \
+  _ATTENTION_KERNEL_FORWARD_END();
+
+// All kernels are disabled by default
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(...) \
+  INSTANTIATE_ATTENTION_KERNEL_FORWARD_DISABLED(50, __VA_ARGS__)
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(...) \
+  INSTANTIATE_ATTENTION_KERNEL_FORWARD_DISABLED(70, __VA_ARGS__)
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(...) \
+  INSTANTIATE_ATTENTION_KERNEL_FORWARD_DISABLED(75, __VA_ARGS__)
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(...) \
+  INSTANTIATE_ATTENTION_KERNEL_FORWARD_DISABLED(80, __VA_ARGS__)
+
+// Enable the right one based on __CUDA_ARCH__
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ < 500
+// "Need cuda arch at least 5.0"
+#elif __CUDA_ARCH__ < 700
+#undef INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(...) \
+  INSTANTIATE_ATTENTION_KERNEL_FORWARD(50, __VA_ARGS__)
+#elif __CUDA_ARCH__ < 750
+#undef INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(...) \
+  INSTANTIATE_ATTENTION_KERNEL_FORWARD(70, __VA_ARGS__)
+#elif __CUDA_ARCH__ < 800
+#undef INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(...) \
+  INSTANTIATE_ATTENTION_KERNEL_FORWARD(75, __VA_ARGS__)
+#elif __CUDA_ARCH__ >= 800
+#undef INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80
+#define INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(...) \
+  INSTANTIATE_ATTENTION_KERNEL_FORWARD(80, __VA_ARGS__)
+#endif
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16.cu
new file mode 100644
index 000000000000..84285dc80c13
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::bfloat16_t, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::bfloat16_t, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::bfloat16_t, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::bfloat16_t, false);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned.cu
new file mode 100644
index 000000000000..87da07543fe0
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::bfloat16_t, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::bfloat16_t, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::bfloat16_t, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::bfloat16_t, true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned_k128.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned_k128.cu
new file mode 100644
index 000000000000..691652131b12
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned_k128.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::bfloat16_t, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::bfloat16_t, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::bfloat16_t, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::bfloat16_t, true, 128);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned_k64.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned_k64.cu
new file mode 100644
index 000000000000..8367a0b8d2f8
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_aligned_k64.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::bfloat16_t, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::bfloat16_t, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::bfloat16_t, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::bfloat16_t, true, 64);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_k128.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_k128.cu
new file mode 100644
index 000000000000..5ef2d4b17a9d
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_k128.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::bfloat16_t, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::bfloat16_t, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::bfloat16_t, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::bfloat16_t, false, 128);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_k64.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_k64.cu
new file mode 100644
index 000000000000..c218560c1483
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_bf16_k64.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::bfloat16_t, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::bfloat16_t, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::bfloat16_t, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::bfloat16_t, false, 64);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16.cu
new file mode 100644
index 000000000000..e0353dc432bf
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::half_t, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::half_t, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::half_t, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::half_t, false);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned.cu
new file mode 100644
index 000000000000..bc98b4166e70
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::half_t, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::half_t, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::half_t, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::half_t, true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned_k128.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned_k128.cu
new file mode 100644
index 000000000000..48fc0a52bd6e
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned_k128.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::half_t, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::half_t, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::half_t, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::half_t, true, 128);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned_k64.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned_k64.cu
new file mode 100644
index 000000000000..98567a3ebd3c
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_aligned_k64.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::half_t, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::half_t, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::half_t, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::half_t, true, 64);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_k128.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_k128.cu
new file mode 100644
index 000000000000..7eb535c8a582
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_k128.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::half_t, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::half_t, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::half_t, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::half_t, false, 128);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_k64.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_k64.cu
new file mode 100644
index 000000000000..3ee0c05f5833
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f16_k64.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(cutlass::half_t, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(cutlass::half_t, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(cutlass::half_t, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(cutlass::half_t, false, 64);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32.cu
new file mode 100644
index 000000000000..8e061931092e
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(float, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(float, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(float, false);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(float, false);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned.cu
new file mode 100644
index 000000000000..d90cd439a890
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(float, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(float, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(float, true);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(float, true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned_k128.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned_k128.cu
new file mode 100644
index 000000000000..e084bb9fdb15
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned_k128.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(float, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(float, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(float, true, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(float, true, 128);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned_k64.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned_k64.cu
new file mode 100644
index 000000000000..7f9d01175fe3
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_aligned_k64.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(float, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(float, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(float, true, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(float, true, 64);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_k128.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_k128.cu
new file mode 100644
index 000000000000..3f6060304c85
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_k128.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(float, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(float, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(float, false, 128);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(float, false, 128);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_k64.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_k64.cu
new file mode 100644
index 000000000000..8a2b51030cde
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/backward_f32_k64.cu
@@ -0,0 +1,6 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM50(float, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM70(float, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM75(float, false, 64);
+INSTANTIATE_ATTENTION_KERNEL_BACKWARD_SM80(float, false, 64);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_bf16.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_bf16.cu
new file mode 100644
index 000000000000..ceceb12145bf
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_bf16.cu
@@ -0,0 +1,74 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h>
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::bfloat16_t,
+    false,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::bfloat16_t,
+    false,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::bfloat16_t,
+    false,
+    64,
+    64,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::bfloat16_t,
+    false,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::bfloat16_t,
+    false,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::bfloat16_t,
+    false,
+    64,
+    64,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::bfloat16_t,
+    false,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::bfloat16_t,
+    false,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::bfloat16_t,
+    false,
+    64,
+    64,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::bfloat16_t,
+    false,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::bfloat16_t,
+    false,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::bfloat16_t,
+    false,
+    64,
+    64,
+    true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_bf16_aligned.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_bf16_aligned.cu
new file mode 100644
index 000000000000..14d7de34f879
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_bf16_aligned.cu
@@ -0,0 +1,74 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h>
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::bfloat16_t,
+    true,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::bfloat16_t,
+    true,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::bfloat16_t,
+    true,
+    64,
+    64,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::bfloat16_t,
+    true,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::bfloat16_t,
+    true,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::bfloat16_t,
+    true,
+    64,
+    64,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::bfloat16_t,
+    true,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::bfloat16_t,
+    true,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::bfloat16_t,
+    true,
+    64,
+    64,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::bfloat16_t,
+    true,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::bfloat16_t,
+    true,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::bfloat16_t,
+    true,
+    64,
+    64,
+    true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f16.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f16.cu
new file mode 100644
index 000000000000..cc19719927dd
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f16.cu
@@ -0,0 +1,54 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h>
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::half_t,
+    false,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::half_t,
+    false,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(cutlass::half_t, false, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::half_t,
+    false,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::half_t,
+    false,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(cutlass::half_t, false, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::half_t,
+    false,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::half_t,
+    false,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(cutlass::half_t, false, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::half_t,
+    false,
+    32,
+    128,
+    true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::half_t,
+    false,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(cutlass::half_t, false, 64, 64, true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f16_aligned.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f16_aligned.cu
new file mode 100644
index 000000000000..c97c7ba5cb0c
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f16_aligned.cu
@@ -0,0 +1,34 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h>
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(cutlass::half_t, true, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(
+    cutlass::half_t,
+    true,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(cutlass::half_t, true, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(cutlass::half_t, true, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(
+    cutlass::half_t,
+    true,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(cutlass::half_t, true, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(cutlass::half_t, true, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(
+    cutlass::half_t,
+    true,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(cutlass::half_t, true, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(cutlass::half_t, true, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(
+    cutlass::half_t,
+    true,
+    32,
+    128,
+    false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(cutlass::half_t, true, 64, 64, true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f32.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f32.cu
new file mode 100644
index 000000000000..4a77ef69b77f
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f32.cu
@@ -0,0 +1,14 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h>
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(float, false, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(float, false, 32, 128, false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(float, false, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(float, false, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(float, false, 32, 128, false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(float, false, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(float, false, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(float, false, 32, 128, false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(float, false, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(float, false, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(float, false, 32, 128, false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(float, false, 64, 64, true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f32_aligned.cu b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f32_aligned.cu
new file mode 100644
index 000000000000..0e7a97ac4641
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/forward_f32_aligned.cu
@@ -0,0 +1,14 @@
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h>
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(float, true, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(float, true, 32, 128, false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM50(float, true, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(float, true, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(float, true, 32, 128, false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM70(float, true, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(float, true, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(float, true, 32, 128, false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM75(float, true, 64, 64, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(float, true, 32, 128, true);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(float, true, 32, 128, false);
+INSTANTIATE_ATTENTION_KERNEL_FORWARD_SM80(float, true, 64, 64, true);
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/generate_kernels.sh b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/generate_kernels.sh
new file mode 100755
index 000000000000..0c1f19c24365
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/generate_kernels.sh
@@ -0,0 +1,56 @@
+#!/bin/bash
+set -ex
+rm -f *.cu
+IFS=","
+
+# BACKWARD
+kernel="BACKWARD"
+kernel_lower=`echo "\$kernel" | awk '{print tolower($0)}'`
+for aligned in "false" "true"; do
+    for maxk in 64 128 ""; do
+        for dtype_name in "f32" "f16" "bf16"; do
+            case "$dtype_name" in
+                "f32") dtype="float" ;;
+                "f16") dtype="cutlass::half_t" ;;
+                "bf16") dtype="cutlass::bfloat16_t" ;;
+            esac
+            [[ $aligned = "true" ]] && s="_aligned" || s=""
+            [[ $maxk = "" ]] && s="${s}" || s="${s}_k$maxk"
+            [[ $maxk = "" ]] && maxk_code="" || maxk_code=", $maxk"
+            FNAME="${kernel_lower}_${dtype_name}${s}.cu"
+            echo $FNAME
+            cat <<EOF > $FNAME
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_backward.h>
+EOF
+            for sm in 50 70 75 80; do
+                echo "INSTANTIATE_ATTENTION_KERNEL_${kernel}_SM${sm}($dtype, $aligned$maxk_code);" >> $FNAME
+            done;
+        done;
+    done;
+done
+
+# FORWARD
+kernel="FORWARD"
+kernel_lower=`echo "\$kernel" | awk '{print tolower($0)}'`
+for aligned in "false" "true"; do
+    [[ $aligned = "true" ]] && aligned_suffix="_aligned" || aligned_suffix=""
+    for dtype_name in "f32" "f16" "bf16"; do
+        case "$dtype_name" in
+            "f32") dtype="float" ;;
+            "f16") dtype="cutlass::half_t" ;;
+            "bf16") dtype="cutlass::bfloat16_t" ;;
+        esac
+        FNAME="${kernel_lower}_${dtype_name}${aligned_suffix}.cu"
+        echo $FNAME
+        cat <<EOF > $FNAME
+// This file is auto-generated. See "generate_kernels.sh"
+#include <ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h>
+EOF
+        for sm in 50 70 75 80; do
+            echo "INSTANTIATE_ATTENTION_KERNEL_${kernel}_SM${sm}($dtype, $aligned, 32, 128, true);" >> $FNAME
+            echo "INSTANTIATE_ATTENTION_KERNEL_${kernel}_SM${sm}($dtype, $aligned, 32, 128, false);" >> $FNAME
+            echo "INSTANTIATE_ATTENTION_KERNEL_${kernel}_SM${sm}($dtype, $aligned, 64, 64, true);" >> $FNAME
+        done;
+    done;
+done
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/mma_from_smem.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/mma_from_smem.h
new file mode 100644
index 000000000000..01cfb574c8c8
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/mma_from_smem.h
@@ -0,0 +1,1785 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Template for a double-buffered threadblock-scoped GEMM kernel.
+*/
+
+#pragma once
+
+#include <cutlass/aligned_buffer.h>
+#include <cutlass/arch/memory.h>
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/epilogue/thread/linear_combination.h>
+#include <cutlass/epilogue/threadblock/default_epilogue_simt.h>
+#include <cutlass/epilogue/threadblock/default_epilogue_tensor_op.h>
+#include <cutlass/epilogue/threadblock/default_epilogue_volta_tensor_op.h>
+#include <cutlass/gemm/gemm.h>
+#include <cutlass/gemm/warp/mma_tensor_op_fragment_iterator.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/numeric_conversion.h>
+#include <cutlass/numeric_types.h>
+#include <cutlass/transform/threadblock/vector_iterator.h>
+
+#include <cutlass/epilogue/threadblock/epilogue_smem_accumulator.h>
+#include <cutlass/gemm/threadblock/mma_base.h>
+#include <cutlass/gemm/warp/mma_tensor_op_tile_access_iterator.h>
+#include <cutlass/gemm/threadblock/mma_pipelined.h>
+
+#include <ATen/native/transformers/cuda/mem_eff_attention/attention_scaling_coefs_updater.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/epilogue_thread_apply_logsumexp.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/gemm_kernel_utils.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/iterators/make_residual_last.h>
+#include <ATen/native/transformers/cuda/mem_eff_attention/mma_simt_tile_iterator_residual.h>
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace gemm {
+namespace threadblock {
+
+/// Shared storage object needed by accumulator
+/// From 13_two_tensor_op_fusion/threadblock/b2b_mma_base_smem_accumulator.h
+template <
+    typename Shape_,
+    typename Element_,
+    typename Layout_,
+    typename Padding_>
+class AccumulatorSharedStorage {
+ public:
+  //
+  // Type definitions
+  //
+  using Shape = Shape_;
+  using Element = Element_;
+  using Layout = Layout_;
+  using Padding = Padding_;
+
+  /// Tensor reference to the accumulator
+  using TensorRefAccum = cutlass::TensorRef<Element, Layout>;
+
+  /// Shape of the accumulator matrix in shared memory
+  using ShapeAccum = cutlass::
+      MatrixShape<Shape::kM + Padding::kRow, Shape::kN + Padding::kColumn>;
+
+ public:
+  //
+  // Data members
+  //
+
+  /// Buffer for accumulator
+  cutlass::AlignedBuffer<Element, ShapeAccum::kCount> accum;
+
+ public:
+  //
+  // Methods
+  //
+
+  /// Returns a layout object for the Accum matrix
+  CUTLASS_DEVICE
+  static Layout LayoutAccum() {
+    return Layout::packed({ShapeAccum::kRow, ShapeAccum::kColumn});
+  }
+
+  /// Returns a TensorRef to the Accumulator
+  CUTLASS_HOST_DEVICE
+  TensorRefAccum accum_ref() {
+    return TensorRefAccum{accum.data(), LayoutAccum()};
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+// Taken from
+// https://github.com/NVIDIA/cutlass/blob/master/examples/13_two_tensor_op_fusion/threadblock/b2b_mma_base_smem_accumulator.h
+////////////////////////////////////////////////////////////////////////////////
+
+/// Structure to compute the matrix product targeting CUDA cores and SIMT math
+/// instructions.
+template <
+    /// Size of the Gemm problem - concept: gemm::GemmShape<>
+    typename Shape_,
+    // Maximum value for K
+    int kMaxK,
+    /// Policy describing tuning details (concept: MmaPolicy)
+    typename Policy_,
+    /// Number of stages,
+    int Stages,
+    /// Used for partial specialization
+    typename Enable = bool>
+class MmaBaseFromSharedMemory {
+ public:
+  ///< Size of the Gemm problem - concept: gemm::GemmShape<>
+  using Shape = Shape_;
+
+  ///< Policy describing tuning details
+  using Policy = Policy_;
+
+  //
+  // Dependent types
+  //
+
+  /// Warp-level Mma
+  using Operator = typename Policy::Operator;
+
+  /// Shape describing the overall GEMM computed from shared memory
+  /// by each warp.
+  using WarpGemm = typename Policy::Operator::Shape;
+
+  /// Shape describing the number of warps filling the CTA
+  using WarpCount = GemmShape<
+      Shape::kM / WarpGemm::kM,
+      Shape::kN / WarpGemm::kN,
+      Shape::kK / WarpGemm::kK>;
+  using WarpCount1 = WarpCount;
+
+  /// Number of warp-level GEMM oeprations
+  static int const kWarpGemmIterations =
+      (WarpGemm::kK / Operator::Policy::MmaShape::kK);
+  static int const kWarpGemmIterations1 = kWarpGemmIterations;
+
+  /// Number of stages
+  static int const kStages = Stages;
+
+  /// If this is true, we fill the entire shmem buffer at start
+  /// and don't need to iterate through it in a circular fashion
+  static bool const kSmemContainsEntireB = kMaxK <= Shape::kK * kStages;
+
+  /// Tensor reference to the A operand
+  using TensorRefA =
+      TensorRef<typename Operator::ElementA, typename Operator::LayoutA>;
+
+  /// Tensor reference to the B operand
+  using TensorRefB =
+      TensorRef<typename Operator::ElementB, typename Operator::LayoutB>;
+
+  //
+  // Nested structs
+  //
+
+  /// Shared storage object needed by threadblock-scoped GEMM
+  class SharedStorage {
+   public:
+    //
+    // Type definitions
+    //
+
+    /// Shape of the B matrix operand in shared memory
+    using ShapeB = MatrixShape<
+        Shape::kK * kStages + Policy::SmemPaddingB::kRow,
+        Shape::kN + Policy::SmemPaddingB::kColumn>;
+
+   public:
+    //
+    // Data members
+    //
+
+    /// Buffer for B operand
+    AlignedBuffer<typename Operator::ElementB, ShapeB::kCount> operand_B;
+
+   public:
+    //
+    // Methods
+    //
+
+    /// Returns a layout object for the B matrix
+    CUTLASS_HOST_DEVICE
+    static typename Operator::LayoutB LayoutB() {
+      return Operator::LayoutB::packed({ShapeB::kRow, ShapeB::kColumn});
+    }
+
+    /// Returns a TensorRef to the B operand
+    CUTLASS_HOST_DEVICE
+    TensorRefB operand_B_ref() {
+      return TensorRefB{operand_B.data(), LayoutB()};
+    }
+  };
+
+ protected:
+  //
+  // Data members
+  //
+
+  // /// Iterator to load a warp-scoped tile of A operand from shared memory
+  // typename Operator::IteratorA warp_tile_iterator_A_;
+
+  /// Iterator to load a warp-scoped tile of B operand from shared memory
+  typename Operator::IteratorB warp_tile_iterator_B_;
+
+ public:
+  /// Construct from tensor references
+  CUTLASS_DEVICE
+  MmaBaseFromSharedMemory(
+      ///< Shared storage needed for internal use by threadblock-scoped GEMM
+      SharedStorage& shared_storage,
+      ///< ID within the threadblock
+      int thread_idx,
+      ///< ID of warp
+      int warp_idx,
+      ///< ID of each thread within a warp
+      int lane_idx)
+      : warp_tile_iterator_B_(shared_storage.operand_B_ref(), lane_idx) {}
+};
+
+////////////////////////////////////////////////////////////////////////////////
+// Taken from
+// https://github.com/NVIDIA/cutlass/blob/master/examples/13_two_tensor_op_fusion/threadblock/b2b_mma_pipelined_smem_accumulator.h
+////////////////////////////////////////////////////////////////////////////////
+
+/// Structure to compute the matrix product targeting CUDA cores and SIMT math
+/// instructions.
+template <
+    /// Size of the Gemm problem - concept: gemm::GemmShape<>
+    typename Shape_,
+    // BEGIN smem
+    /// Iterates over the intermediate accumulator tile in shared memory
+    typename WarpIteratorA,
+    // Accumulator type
+    typename AccumulatorSharedStorage,
+    // END smem
+    /// Iterates over tiles of B operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorB_,
+    /// Iterates over tiles of B operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorB_,
+    /// Data type of accumulator matrix
+    typename ElementC_,
+    /// Data type of accumulator matrix
+    typename LayoutC_,
+    /// Policy describing tuning details (concept: MmaPolicy)
+    typename Policy_,
+    /// Transformation applied to B operand
+    typename TransformB_ = NumericArrayConverter<
+        typename SmemIteratorB_::Element,
+        typename IteratorB_::Element,
+        IteratorB_::Fragment::kElements>,
+    /// Used for partial specialization
+    typename Enable = bool>
+class MmaPipelinedFromSharedMemory : public MmaBaseFromSharedMemory<
+                                         Shape_,
+                                         AccumulatorSharedStorage::Shape::kN,
+                                         Policy_,
+                                         2> {
+ public:
+  ///< Base class
+  using Base = MmaBaseFromSharedMemory<
+      Shape_,
+      AccumulatorSharedStorage::Shape::kN,
+      Policy_,
+      2>;
+
+  using Shape =
+      Shape_; ///< Size of the Gemm problem - concept: gemm::GemmShape<>
+  using IteratorB =
+      IteratorB_; ///< Iterates over tiles of B operand in global memory
+  using ElementC = ElementC_; ///< Data type of accumulator matrix
+  using LayoutC = LayoutC_; ///< Layout of accumulator matrix
+  using Policy = Policy_; ///< Policy describing tuning details
+
+  using SmemIteratorB = SmemIteratorB_;
+
+  using TransformB = TransformB_;
+
+  //
+  // Dependent types
+  //
+
+  /// Fragment of operand B loaded from global memory
+  using FragmentB = typename IteratorB::Fragment;
+
+  /// Fragment of accumulator tile
+  using FragmentC = typename Policy::Operator::FragmentC;
+
+  /// Warp-level Mma
+  using Operator = typename Policy::Operator;
+
+  /// Obtain the arch tag from the warp-level operator
+  using ArchTag = typename Policy::Operator::ArchTag;
+
+  /// Complex transform on B operand
+  static ComplexTransform const kTransformB = Operator::kTransformB;
+
+  // staticaly assert kStages for MmaPipelined is two (Double-buffered pipeline)
+  static_assert(
+      (Base::kStages == 2),
+      "MmaPipelined requires kStages set to value 2");
+
+ private:
+  using WarpFragmentA = typename Operator::FragmentA;
+  using WarpFragmentB = typename Operator::FragmentB;
+
+ protected:
+  // /// Iterator to write threadblock-scoped tile of A operand to shared memory
+  // SmemIteratorA smem_iterator_A_;
+
+  /// Iterator to write threadblock-scoped tile of B operand to shared memory
+  SmemIteratorB smem_iterator_B_;
+
+  /// Iterator to load a warp-scoped tile of A operand from intermediate
+  /// accumulator tile
+  WarpIteratorA warp_tile_iterator_A_;
+
+ public:
+  /// Construct from tensor references
+  CUTLASS_DEVICE
+  MmaPipelinedFromSharedMemory(
+      typename Base::SharedStorage&
+          shared_storage, ///< Shared storage needed for internal use by
+                          ///< threadblock-scoped GEMM
+      AccumulatorSharedStorage& accumulator_shared_storage,
+      int thread_idx, ///< ID within the threadblock
+      int warp_idx, ///< ID of warp
+      int lane_idx, ///< ID of each thread within a warp
+      int problem_size_0_n)
+      : Base(shared_storage, thread_idx, warp_idx, lane_idx),
+        warp_tile_iterator_A_(accumulator_shared_storage.accum_ref(), lane_idx),
+        smem_iterator_B_(shared_storage.operand_B_ref(), thread_idx) {
+    // Compute warp location within threadblock tile by mapping the warp_id to
+    // three coordinates:
+    //   _m: the warp's position within the threadblock along the M dimension
+    //   _n: the warp's position within the threadblock along the N dimension
+    //   _k: the warp's position within the threadblock along the K dimension
+
+    int warp_idx_mn = warp_idx % (Base::WarpCount::kM * Base::WarpCount::kN);
+    int warp_idx_k = warp_idx / (Base::WarpCount::kM * Base::WarpCount::kN);
+
+    int warp_idx_m = warp_idx_mn % Base::WarpCount::kM;
+    int warp_idx_n = warp_idx_mn / Base::WarpCount::kM;
+
+    // Add per-warp offsets in units of warp-level tiles
+    this->warp_tile_iterator_A_.add_tile_offset(
+        {warp_idx_m, Base::kWarpGemmIterations * warp_idx_k});
+    this->warp_tile_iterator_B_.add_tile_offset(
+        {Base::kWarpGemmIterations * warp_idx_k, warp_idx_n});
+  }
+
+  // For API compatibility with MmaMultistageFromSharedMemory
+  // but not supported as it worsens perf: older gpus < sm80 don't
+  // support async tranfers and have to waste registers
+  CUTLASS_DEVICE
+  bool set_prologue_done(bool value) {}
+  CUTLASS_DEVICE
+  static void prologue(
+      typename Base::SharedStorage& shared_storage,
+      IteratorB iterator_B1,
+      int thread_idx,
+      int problem_size_0_n) {}
+
+  /// Perform a threadblock-scoped matrix multiply-accumulate
+  CUTLASS_DEVICE
+  void operator()(
+      int gemm_k_iterations, ///< number of iterations of the mainloop
+      FragmentC& accum, ///< destination accumulator tile
+      // IteratorA iterator_A,                             ///< iterator over A
+      // operand in global memory
+      IteratorB iterator_B, ///< iterator over B operand in global memory
+      FragmentC const& src_accum, ///< source accumulator tile
+      // TransformA transform_A = TransformA(),            ///< transformation
+      // applied to A fragment
+      TransformB transform_B =
+          TransformB()) { ///< transformation applied to B fragment
+
+    //
+    // Prologue
+    //
+
+    // Perform accumulation in the 'd' output operand
+    accum = src_accum;
+
+    FragmentB tb_frag_B;
+
+    tb_frag_B.clear();
+
+    // The last kblock is loaded in the prolog
+    iterator_B.set_residual_tile(gemm_k_iterations == 1);
+    iterator_B.load(tb_frag_B);
+
+    ++iterator_B;
+
+    this->smem_iterator_B_.store(transform_B(tb_frag_B));
+
+    ++this->smem_iterator_B_;
+
+    __syncthreads();
+
+    // Pair of fragments used to overlap shared memory loads and math
+    // instructions
+    WarpFragmentA warp_frag_A[2];
+    WarpFragmentB warp_frag_B[2];
+    warp_frag_A[0].clear();
+    warp_frag_B[0].clear();
+
+    this->warp_tile_iterator_B_.set_kgroup_index(0);
+
+    this->warp_tile_iterator_A_.load(warp_frag_A[0]);
+    this->warp_tile_iterator_B_.load(warp_frag_B[0]);
+
+    ++this->warp_tile_iterator_A_;
+    ++this->warp_tile_iterator_B_;
+
+    Operator warp_mma;
+
+    int smem_write_stage_idx = 1;
+
+    // Avoid reading out of bounds
+    iterator_B.set_residual_tile(gemm_k_iterations == 2);
+    iterator_B.clear_mask(gemm_k_iterations <= 1);
+
+    // Issue loads during the first warp-level matrix multiply-add *AFTER*
+    // issuing shared memory loads (which have the tighest latency requirement).
+
+    //
+    // Mainloop
+    //
+
+    // Note: The main loop does not support Base::kWarpGemmIterations == 2.
+    CUTLASS_GEMM_LOOP
+    for (; gemm_k_iterations > 0; --gemm_k_iterations) {
+      //
+      // Loop over GEMM K dimension
+      //
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int warp_mma_k = 0; warp_mma_k < Base::kWarpGemmIterations;
+           ++warp_mma_k) {
+        // Load warp-level tiles from shared memory, wrapping to k offset if
+        // this is the last group as the case may be.
+        bool hasNext = true;
+
+        if (warp_mma_k == Base::kWarpGemmIterations - 1) {
+          // Write fragments to shared memory
+          this->smem_iterator_B_.store(transform_B(tb_frag_B));
+
+          __syncthreads();
+
+          ++this->smem_iterator_B_;
+
+          // Add negative offsets to return iterators to the 'start' of the
+          // circular buffer in shared memory SMEM: Don't reset iterator A, as
+          // we are continuing our iteration at this point
+          if (smem_write_stage_idx == 1) {
+            this->smem_iterator_B_.add_tile_offset({-Base::kStages, 0});
+          } else {
+            this->warp_tile_iterator_B_.add_tile_offset(
+                {-Base::kStages * Policy::kPartitionsK *
+                     Base::kWarpGemmIterations,
+                 0});
+          }
+
+          smem_write_stage_idx ^= 1;
+          hasNext = gemm_k_iterations > 1;
+        }
+
+        // Only read the next if we need to
+        if (hasNext) {
+          this->warp_tile_iterator_B_.set_kgroup_index(
+              (warp_mma_k + 1) % Base::kWarpGemmIterations);
+
+          this->warp_tile_iterator_A_.load(warp_frag_A[(warp_mma_k + 1) % 2]);
+          this->warp_tile_iterator_B_.load(warp_frag_B[(warp_mma_k + 1) % 2]);
+
+          ++this->warp_tile_iterator_A_;
+          ++this->warp_tile_iterator_B_;
+
+          if (warp_mma_k == 0) {
+            iterator_B.load(tb_frag_B);
+
+            ++iterator_B;
+
+            // Avoid reading out of bounds if this was the last loop iteration
+            iterator_B.set_residual_tile(gemm_k_iterations == 3);
+            iterator_B.clear_mask(gemm_k_iterations <= 2);
+          }
+        }
+
+        warp_mma(
+            accum,
+            warp_frag_A[warp_mma_k % 2],
+            warp_frag_B[warp_mma_k % 2],
+            accum);
+      }
+    }
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////
+// Taken from
+// https://github.com/NVIDIA/cutlass/blob/master/examples/13_two_tensor_op_fusion/threadblock/b2b_mma_multistage_smem_accumulator.h
+////////////////////////////////////////////////////////////////////////////////
+
+/// Structure to compute the matrix product targeting CUDA cores and SIMT math
+/// instructions.
+template <
+    /// Size of the Gemm problem - concept: gemm::GemmShape<>
+    typename Shape1_,
+    /// Iterates over the intermediate accumulator tile in shared memory
+    typename WarpIteratorA1_,
+    // Accumulator type
+    typename AccumulatorSharedStorage,
+    /// Iterates over tiles of B operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorB1_,
+    /// Iterates over tiles of B operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorB1_,
+    /// Cache operation for operand B
+    cutlass::arch::CacheOperation::Kind CacheOpB1,
+    /// Data type of accumulator matrix
+    typename ElementC_,
+    /// Data type of accumulator matrix
+    typename LayoutC_,
+    /// Policy describing tuning details (concept: MmaPolicy)
+    typename Policy1_,
+    /// Number of stages,
+    int Stages_,
+    /// Used for partial specialization
+    typename Enable = bool>
+class MmaMultistageFromSharedMemory : public MmaBaseFromSharedMemory<
+                                          Shape1_,
+                                          AccumulatorSharedStorage::Shape::kN,
+                                          Policy1_,
+                                          Stages_> {
+ public:
+  ///< Base class
+  using Base = MmaBaseFromSharedMemory<
+      Shape1_,
+      AccumulatorSharedStorage::Shape::kN,
+      Policy1_,
+      Stages_>;
+
+  ///< Size of the Gemm problem - concept: gemm::GemmShape<>
+  using Shape1 = Shape1_;
+  ///< Iterates over tiles of B operand in global memory
+  using IteratorB1 = IteratorB1_;
+  using IteratorB = IteratorB1;
+  ///< Policy describing tuning details
+  using Policy1 = Policy1_;
+
+  using SmemIteratorB1 = SmemIteratorB1_;
+  using WarpIteratorA1 = WarpIteratorA1_; ///< Iterates over the intermediate
+                                          ///< accumulator tile in shared memory
+
+  ///< Data type of accumulator matrix
+  using ElementC = ElementC_;
+  ///< Layout of accumulator matrix
+  using LayoutC = LayoutC_;
+
+  static cutlass::arch::CacheOperation::Kind const kCacheOpB1 = CacheOpB1;
+  static constexpr bool kSmemContainsEntireB = Base::kSmemContainsEntireB;
+
+  //
+  // Dependent types
+  //
+
+  /// Fragment of accumulator tile
+  using FragmentC1 = typename Policy1::Operator::FragmentC;
+  using FragmentC = FragmentC1;
+
+  /// Warp-level Mma
+  using Operator1 = typename Policy1::Operator;
+
+  /// Minimum architecture is Sm80 to support cp.async
+  using ArchTag = arch::Sm80;
+
+  /// Complex transform on B operand
+  static ComplexTransform const kTransformB1 = Operator1::kTransformB;
+
+  /// Internal structure exposed for introspection.
+  struct Detail {
+    static_assert(
+        Base::kWarpGemmIterations1 > 1,
+        "The pipelined structure requires at least two warp-level "
+        "GEMM operations.");
+
+    /// Number of cp.async instructions to load one stage of operand B
+    static int const TBLDGSTSIterationsB1 =
+        IteratorB1::ThreadMap::Iterations::kCount;
+
+    /// Number of cp.async instructions to load on group of operand B
+    static int const kAccessesPerGroupB1 =
+        (TBLDGSTSIterationsB1 + Base::kWarpGemmIterations1 - 1) /
+        Base::kWarpGemmIterations1;
+  };
+
+  static constexpr int kNumStagesConcurrentLoad =
+      kSmemContainsEntireB ? Base::kStages : Base::kStages - 1;
+
+ private:
+  using WarpLoadedFragmentA1 = typename Operator1::FragmentA;
+  using WarpLoadedFragmentB1 = typename Operator1::FragmentB;
+  using WarpTransformedFragmentA1 = typename Operator1::TransformedFragmentA;
+  using WarpTransformedFragmentB1 = typename Operator1::TransformedFragmentB;
+
+ private:
+  //
+  // Data members
+  //
+
+  /// Iterator to load a warp-scoped tile of A1 operand from intermediate
+  /// accumulator tile
+  WarpIteratorA1 warp_tile_iterator_A1_;
+
+  /// Iterator to write threadblock-scoped tile of B operand to shared memory
+  SmemIteratorB1 smem_iterator_B1_;
+
+  bool prologue_done_;
+
+ public:
+  /// Construct from tensor references
+  CUTLASS_DEVICE
+  MmaMultistageFromSharedMemory(
+      typename Base::SharedStorage&
+          shared_storage, ///< Shared storage needed for internal use by
+                          ///< threadblock-scoped GEMM
+      AccumulatorSharedStorage& accumulator_shared_storage,
+      ///< ID within the threadblock
+      int thread_idx,
+      ///< ID of warp
+      int warp_idx,
+      ///< ID of each thread within a warp
+      int lane_idx,
+      ///< GEMM0 N is used for accumulator extent
+      int problem_size_0_n)
+      : Base(shared_storage, thread_idx, warp_idx, lane_idx),
+        warp_tile_iterator_A1_(
+            accumulator_shared_storage.accum_ref(),
+            lane_idx),
+        smem_iterator_B1_(shared_storage.operand_B_ref(), thread_idx),
+        prologue_done_(false) {
+    // Compute warp location within threadblock tile by mapping the warp_id to
+    // three coordinates:
+    //   _m: the warp's position within the threadblock along the M dimension
+    //   _n: the warp's position within the threadblock along the N dimension
+    //   _k: the warp's position within the threadblock along the K dimension
+
+    int warp_idx_mn_1 =
+        warp_idx % (Base::WarpCount1::kM * Base::WarpCount1::kN);
+    int warp_idx_k_1 = warp_idx / (Base::WarpCount1::kM * Base::WarpCount1::kN);
+
+    int warp_idx_m_1 = warp_idx_mn_1 % Base::WarpCount1::kM;
+    int warp_idx_n_1 = warp_idx_mn_1 / Base::WarpCount1::kM;
+
+    // Add per-warp offsets in units of warp-level tiles
+    warp_tile_iterator_A1_.add_tile_offset(
+        {warp_idx_m_1, Base::kWarpGemmIterations1 * warp_idx_k_1});
+    this->warp_tile_iterator_B_.add_tile_offset(
+        {Base::kWarpGemmIterations1 * warp_idx_k_1, warp_idx_n_1});
+  }
+
+  CUTLASS_DEVICE
+  bool set_prologue_done(bool value) {
+    prologue_done_ = value;
+  }
+
+  CUTLASS_DEVICE
+  static void prologue(
+      typename Base::SharedStorage& shared_storage,
+      IteratorB iterator_B1,
+      int thread_idx,
+      int problem_size_0_n) {
+    SmemIteratorB1 smem_iterator_B1(shared_storage.operand_B_ref(), thread_idx);
+    _prologue(
+        iterator_B1,
+        (problem_size_0_n + Base::Shape::kK - 1) / Base::Shape::kK,
+        smem_iterator_B1);
+  }
+
+  CUTLASS_DEVICE
+  void copy_tiles_and_advance_1(
+      IteratorB1& iterator_B1,
+      int group_start_B1 = 0) {
+    iterator_B1.set_iteration_index(
+        group_start_B1 * IteratorB1::kAccessesPerVector);
+    this->smem_iterator_B1_.set_iteration_index(group_start_B1);
+
+    // LDGSTS for operand B
+    CUTLASS_PRAGMA_UNROLL
+    for (int j = 0; j < Detail::kAccessesPerGroupB1; ++j) {
+      if (group_start_B1 + j < Detail::TBLDGSTSIterationsB1) {
+        typename IteratorB1::AccessType* dst_ptr =
+            reinterpret_cast<typename IteratorB1::AccessType*>(
+                this->smem_iterator_B1_.get());
+
+        int const kSrcBytes = sizeof_bits<typename IteratorB1::Element>::value *
+            IteratorB1::ThreadMap::kElementsPerAccess /
+            IteratorB1::kAccessesPerVector / 8;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < IteratorB1::kAccessesPerVector; ++v) {
+          auto gmem_ptr = iterator_B1.get();
+
+          cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpB1>(
+              dst_ptr + v, gmem_ptr, iterator_B1.valid());
+
+          ++iterator_B1;
+        }
+        ++this->smem_iterator_B1_;
+      }
+    }
+  }
+
+  CUTLASS_DEVICE
+  static void _prologue(
+      IteratorB& iterator_B1,
+      int32_t gemm_k_iterations_1,
+      SmemIteratorB1& smem_iterator_B1_) {
+    // Issue several complete stages
+    CUTLASS_PRAGMA_UNROLL
+    for (int stage = 0; stage < kNumStagesConcurrentLoad;
+         ++stage, --gemm_k_iterations_1) {
+      iterator_B1.set_residual_tile(gemm_k_iterations_1 == 1);
+      iterator_B1.clear_mask(gemm_k_iterations_1 == 0);
+
+      iterator_B1.set_iteration_index(0);
+      smem_iterator_B1_.set_iteration_index(0);
+
+      // LDGSTS for operand B
+      CUTLASS_PRAGMA_UNROLL
+      for (int j = 0; j < Detail::TBLDGSTSIterationsB1; ++j) {
+        typename IteratorB1::AccessType* dst_ptr =
+            reinterpret_cast<typename IteratorB1::AccessType*>(
+                smem_iterator_B1_.get());
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int v = 0; v < IteratorB1::kAccessesPerVector; ++v) {
+          int const kSrcBytes =
+              sizeof_bits<typename IteratorB1::Element>::value *
+              IteratorB1::ThreadMap::kElementsPerAccess /
+              IteratorB1::kAccessesPerVector / 8;
+
+          cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpB1>(
+              dst_ptr + v, iterator_B1.get(), iterator_B1.valid());
+
+          ++iterator_B1;
+        }
+
+        ++smem_iterator_B1_;
+      }
+
+      // Move to the next stage
+      iterator_B1.add_tile_offset({1, 0});
+
+      smem_iterator_B1_.add_tile_offset({1, 0});
+
+      // Defines the boundary of a stage of cp.async.
+      cutlass::arch::cp_async_fence();
+    }
+    iterator_B1.set_residual_tile(gemm_k_iterations_1 == 1);
+    iterator_B1.clear_mask(gemm_k_iterations_1 == 0);
+  }
+
+  /// Perform a threadblock-scoped matrix multiply-accumulate
+  CUTLASS_DEVICE
+  void operator()(
+      ///< problem size of GEMM
+      int gemm_k_iterations_1_,
+      ///< destination accumulator tile
+      FragmentC1& accum,
+      ///< iterator over B1 operand in global memory
+      IteratorB1 iterator_B1,
+      ///< initial value of accumulator
+      FragmentC1 const& src_accum) {
+    // 2nd Gemm
+
+    //
+    // Prologue
+    //
+    // Perform accumulation in the 'd' output operand
+    accum = src_accum;
+
+    if (!prologue_done_) {
+      _prologue(iterator_B1, gemm_k_iterations_1_, smem_iterator_B1_);
+    } else if (!kSmemContainsEntireB) {
+      // Restore the iterators increments
+
+      int gemm_k_iterations_1 = gemm_k_iterations_1_;
+      // Issue several complete stages
+      CUTLASS_PRAGMA_UNROLL
+      for (int stage = 0; stage < kNumStagesConcurrentLoad;
+           ++stage, --gemm_k_iterations_1) {
+        iterator_B1.set_iteration_index(0);
+        this->smem_iterator_B1_.set_iteration_index(0);
+
+        // LDGSTS for operand B
+        CUTLASS_PRAGMA_UNROLL
+        for (int j = 0; j < Detail::TBLDGSTSIterationsB1; ++j) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int v = 0; v < IteratorB1::kAccessesPerVector; ++v) {
+            ++iterator_B1;
+          }
+          ++this->smem_iterator_B1_;
+        }
+        iterator_B1.add_tile_offset({1, 0});
+        this->smem_iterator_B1_.add_tile_offset({1, 0});
+      }
+      iterator_B1.set_residual_tile(gemm_k_iterations_1 <= 1);
+      iterator_B1.clear_mask(gemm_k_iterations_1 <= 0);
+    }
+
+    // DEPBAR+SYNC
+    cutlass::arch::cp_async_wait<kNumStagesConcurrentLoad - 1>();
+    __syncthreads();
+
+    // Pair of fragments used to overlap shared memory loads and math
+    // instructions
+    WarpLoadedFragmentA1 warp_loaded_frag_A1[2];
+    WarpLoadedFragmentB1 warp_loaded_frag_B1[2];
+    WarpTransformedFragmentA1 warp_transformed_frag_A1[2];
+    WarpTransformedFragmentB1 warp_transformed_frag_B1[2];
+
+    Operator1 warp_mma1;
+
+    warp_tile_iterator_A1_.load(warp_loaded_frag_A1[0]);
+    ++warp_tile_iterator_A1_;
+
+    this->warp_tile_iterator_B_.set_kgroup_index(0);
+    this->warp_tile_iterator_B_.load(warp_loaded_frag_B1[0]);
+    ++this->warp_tile_iterator_B_;
+
+    int smem_write_stage_idx = Base::kStages - 1;
+    int smem_read_stage_idx = 0;
+
+    warp_mma1.transform(
+        warp_transformed_frag_A1[0],
+        warp_transformed_frag_B1[0],
+        warp_loaded_frag_A1[0],
+        warp_loaded_frag_B1[0]);
+
+    // tf32x3 kernels use staging accumulation. warp_mma uses a temporary
+    // accumulator and this temporary accumulator is added to the final
+    // accumulator once in every mainloop iteration.
+    plus<FragmentC1> plus_accum;
+
+    FragmentC1 tmp_accum;
+
+    if (platform::is_same<
+            typename Operator1::MathOperator,
+            arch::OpMultiplyAddFastF32>::value ||
+        platform::is_same<
+            typename Operator1::MathOperator,
+            arch::OpMultiplyAddComplexFastF32>::value) {
+      tmp_accum.clear();
+    }
+
+    //
+    // Mainloop
+    //
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int gemm_k_iterations_1 = gemm_k_iterations_1_ - (Base::kStages - 1);
+         gemm_k_iterations_1 > (-Base::kStages + 1);
+         gemm_k_iterations_1--) {
+      //
+      // Loop over GEMM K dimension
+      //
+
+      // Computes a warp-level GEMM on data held in shared memory
+      // Each "warp_mma_k" refers to a warp-level matrix multiply-accumulate
+      CUTLASS_PRAGMA_UNROLL
+      for (int warp_mma_k = 0; warp_mma_k < Base::kWarpGemmIterations1;
+           ++warp_mma_k) {
+        // Load warp-level tile from accumulator fragment (A)
+        // or shared memory (operand B)
+        this->warp_tile_iterator_B_.set_kgroup_index(
+            (warp_mma_k + 1) % Base::kWarpGemmIterations1);
+        // skip warp tile loading for the last kgroup (we are out of the buf)
+        if (gemm_k_iterations_1 > (-Base::kStages + 2) ||
+            warp_mma_k < Base::kWarpGemmIterations1 - 1) {
+          warp_tile_iterator_A1_.load(
+              warp_loaded_frag_A1[(warp_mma_k + 1) % 2]);
+          this->warp_tile_iterator_B_.load(
+              warp_loaded_frag_B1[(warp_mma_k + 1) % 2]);
+        }
+        ++warp_tile_iterator_A1_;
+        ++this->warp_tile_iterator_B_;
+
+        if (warp_mma_k > 0)
+          warp_mma1.transform(
+              warp_transformed_frag_A1[warp_mma_k % 2],
+              warp_transformed_frag_B1[warp_mma_k % 2],
+              warp_loaded_frag_A1[warp_mma_k % 2],
+              warp_loaded_frag_B1[warp_mma_k % 2]);
+
+        if (platform::is_same<
+                typename Operator1::MathOperator,
+                arch::OpMultiplyAddFastF32>::value ||
+            platform::is_same<
+                typename Operator1::MathOperator,
+                arch::OpMultiplyAddComplexFastF32>::value) {
+          warp_mma1(
+              tmp_accum,
+              warp_transformed_frag_A1[warp_mma_k % 2],
+              warp_transformed_frag_B1[warp_mma_k % 2],
+              tmp_accum);
+
+          if (warp_mma_k == 0) {
+            accum = plus_accum(accum, tmp_accum);
+            tmp_accum.clear();
+          }
+        } else {
+          warp_mma1(
+              accum,
+              warp_transformed_frag_A1[warp_mma_k % 2],
+              warp_transformed_frag_B1[warp_mma_k % 2],
+              accum);
+        }
+
+        // Issue global->shared copies for the this stage
+        if (warp_mma_k < Base::kWarpGemmIterations1 - 1) {
+          int group_start_iteration_B1;
+
+          group_start_iteration_B1 = warp_mma_k * Detail::kAccessesPerGroupB1;
+
+          if (!kSmemContainsEntireB) {
+            copy_tiles_and_advance_1(iterator_B1, group_start_iteration_B1);
+          }
+        }
+
+        if (warp_mma_k + 2 == Base::kWarpGemmIterations1) {
+          int group_start_iteration_B1;
+          group_start_iteration_B1 =
+              (warp_mma_k + 1) * Detail::kAccessesPerGroupB1;
+
+          if (!kSmemContainsEntireB) {
+            copy_tiles_and_advance_1(iterator_B1, group_start_iteration_B1);
+          }
+
+          // Inserts a memory fence between stages of cp.async instructions.
+          cutlass::arch::cp_async_fence();
+
+          // Waits until kStages-2 stages have committed.
+          arch::cp_async_wait<kNumStagesConcurrentLoad - 1>();
+          __syncthreads();
+
+          // Move to the next stage
+          iterator_B1.add_tile_offset({1, 0});
+
+          this->smem_iterator_B1_.add_tile_offset({1, 0});
+
+          // Add negative offsets to return iterators to the 'start' of the
+          // circular buffer in shared memory
+          if (!kSmemContainsEntireB) {
+            if (smem_write_stage_idx == (Base::kStages - 1)) {
+              this->smem_iterator_B1_.add_tile_offset({-Base::kStages, 0});
+              smem_write_stage_idx = 0;
+            } else {
+              ++smem_write_stage_idx;
+            }
+
+            if (smem_read_stage_idx == (Base::kStages - 1)) {
+              this->warp_tile_iterator_B_.add_tile_offset(
+                  {-Base::kStages * Policy1::kPartitionsK *
+                       Base::kWarpGemmIterations1,
+                   0});
+              smem_read_stage_idx = 0;
+            } else {
+              ++smem_read_stage_idx;
+            }
+          }
+
+          iterator_B1.set_residual_tile(gemm_k_iterations_1 == 2);
+          iterator_B1.clear_mask(gemm_k_iterations_1 == 1);
+        }
+
+        // Do any conversions feeding the first stage at the end of the loop so
+        // we can start right away on mma instructions
+        if (warp_mma_k + 1 == Base::kWarpGemmIterations1)
+          warp_mma1.transform(
+              warp_transformed_frag_A1[(warp_mma_k + 1) % 2],
+              warp_transformed_frag_B1[(warp_mma_k + 1) % 2],
+              warp_loaded_frag_A1[(warp_mma_k + 1) % 2],
+              warp_loaded_frag_B1[(warp_mma_k + 1) % 2]);
+      }
+    }
+
+    if (platform::is_same<
+            typename Operator1::MathOperator,
+            arch::OpMultiplyAddFastF32>::value ||
+        platform::is_same<
+            typename Operator1::MathOperator,
+            arch::OpMultiplyAddComplexFastF32>::value) {
+      accum = plus_accum(accum, tmp_accum);
+    }
+  }
+};
+
+namespace {
+template <typename A, typename B>
+struct AssertIsSame {
+  static_assert(std::is_same<A, B>::value, "");
+  using CHECK = bool;
+};
+} // namespace
+
+template <
+    typename WarpShape,
+    typename InstructionShape,
+    typename RegularWarpIterator,
+    typename Policy>
+struct DefaultWarpIteratorAFromSharedMemory {};
+
+// TensorOp - Ampere
+template <typename WarpShape, typename RegularWarpIterator, typename Policy>
+struct DefaultWarpIteratorAFromSharedMemory<
+    WarpShape,
+    cutlass::gemm::GemmShape<16, 8, 8>,
+    RegularWarpIterator,
+    Policy> {
+  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>;
+  static constexpr auto kWarpSize = 32;
+  using OpDelta = typename Policy::Operator::Policy::OpDelta;
+
+  using WarpIterator =
+      cutlass::gemm::warp::MmaTensorOpMultiplicandTileAccessIterator<
+          cutlass::MatrixShape<WarpShape::kM, WarpShape::kK>,
+          cutlass::gemm::Operand::kA,
+          typename RegularWarpIterator::Element,
+          cutlass::layout::RowMajor,
+          cutlass::MatrixShape<InstructionShape::kM, InstructionShape::kK>,
+          OpDelta::kRow,
+          kWarpSize>;
+};
+
+// TensorOp - Volta
+template <typename WarpShape, typename RegularWarpIterator, typename Policy>
+struct DefaultWarpIteratorAFromSharedMemory<
+    WarpShape,
+    cutlass::gemm::GemmShape<16, 16, 4>,
+    RegularWarpIterator,
+    Policy> {
+  using InstructionShape = cutlass::gemm::GemmShape<16, 16, 4>;
+  static constexpr auto kWarpSize = 32;
+  using OpDelta = typename Policy::Operator::Policy::OpDelta;
+
+  using WarpIterator =
+      cutlass::gemm::warp::MmaVoltaTensorOpMultiplicandTileIterator<
+          cutlass::MatrixShape<32, 32>, // MatrixShape<WarpShape::kM,
+                                        // WarpShape::kK>,
+          cutlass::gemm::Operand::kA,
+          typename RegularWarpIterator::Element,
+          cutlass::layout::RowMajorVoltaTensorOpMultiplicandCrosswise<16, 32>,
+          cutlass::MatrixShape<16, 4>,
+          OpDelta::kRow,
+          kWarpSize>;
+};
+
+// Simt
+template <typename WarpShape, typename RegularWarpIterator, typename Policy>
+struct DefaultWarpIteratorAFromSharedMemory<
+    WarpShape,
+    cutlass::gemm::GemmShape<1, 1, 1>,
+    RegularWarpIterator,
+    Policy> {
+  using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
+  static constexpr auto kWarpSize = 32;
+
+  // We just use the same iterator, as we reproduced the same shared-memory
+  // schema. Just modify it to handle non-complete tiles.
+  using WarpIterator = RegularWarpIterator;
+};
+
+// Converts a "regular" Mma into their counterpart from shared memory
+template <typename Mma_, typename AccumulatorSharedStorage>
+struct DefaultMmaFromSharedMemory;
+
+// Mma pipelined
+template <
+    /// Size of the Gemm problem - concept: gemm::GemmShape<>
+    typename Shape_,
+    /// Iterates over tiles of A operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorA_,
+    /// Iterates over tiles of A operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorA_,
+    /// Iterates over tiles of B operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorB_,
+    /// Iterates over tiles of B operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorB_,
+    /// Data type of accumulator matrix
+    typename ElementC_,
+    /// Data type of accumulator matrix
+    typename LayoutC_,
+    /// Policy describing tuning details (concept: MmaPolicy)
+    typename Policy_,
+    /// Transformation applied to A operand
+    typename TransformA_,
+    /// Transformation applied to B operand
+    typename TransformB_,
+    typename AccumulatorSharedStorage_>
+struct DefaultMmaFromSharedMemory<
+    MmaPipelined<
+        Shape_,
+        IteratorA_,
+        SmemIteratorA_,
+        IteratorB_,
+        SmemIteratorB_,
+        ElementC_,
+        LayoutC_,
+        Policy_,
+        TransformA_,
+        TransformB_>,
+    AccumulatorSharedStorage_> {
+  static constexpr int kWarpSize = 32;
+  using SmemAccumulatorLayout = cutlass::layout::RowMajor;
+
+  using RegularMma = MmaPipelined<
+      Shape_,
+      IteratorA_,
+      SmemIteratorA_,
+      IteratorB_,
+      SmemIteratorB_,
+      ElementC_,
+      LayoutC_,
+      Policy_,
+      TransformA_,
+      TransformB_>;
+
+  using WarpShape = typename Policy_::Operator::Shape;
+  using InstructionShape = typename Policy_::Operator::InstructionShape;
+  using ArchMmaOperator = typename Policy_::Operator;
+
+  using WarpIteratorA = typename DefaultWarpIteratorAFromSharedMemory<
+      WarpShape,
+      InstructionShape,
+      typename RegularMma::Operator::IteratorA,
+      Policy_>::WarpIterator;
+  using IteratorB =
+      typename cutlass::transform::threadblock::MakeIteratorResidualLast<
+          IteratorB_>::Iterator;
+
+  using Mma = typename cutlass::gemm::threadblock::MmaPipelinedFromSharedMemory<
+      Shape_,
+      WarpIteratorA,
+      AccumulatorSharedStorage_,
+      IteratorB,
+      SmemIteratorB_,
+      ElementC_,
+      LayoutC_,
+      Policy_>;
+};
+
+template <
+    /// Size of the Gemm problem - concept: gemm::GemmShape<>
+    typename Shape_,
+    /// Iterates over tiles of A operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorA_,
+    /// Iterates over tiles of A operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorA_,
+    /// Cache operation for operand A
+    cutlass::arch::CacheOperation::Kind CacheOpA,
+    /// Iterates over tiles of B operand in global memory
+    //  (concept: ReadableTileIterator | ForwardTileIterator |
+    //  MaskedTileIterator)
+    typename IteratorB_,
+    /// Iterates over tiles of B operand in shared memory
+    /// (concept: WriteableTileIterator | RandomAccessTileIterator)
+    typename SmemIteratorB_,
+    /// Cache operation for operand B
+    cutlass::arch::CacheOperation::Kind CacheOpB,
+    /// Data type of accumulator matrix
+    typename ElementC_,
+    /// Data type of accumulator matrix
+    typename LayoutC_,
+    /// Policy describing tuning details (concept: MmaPolicy)
+    typename Policy_,
+    /// Number of stages,
+    int Stages,
+    /// Use zfill or predicate for out-of-bound cp.async
+    SharedMemoryClearOption SharedMemoryClear,
+    typename AccumulatorSharedStorage_>
+struct DefaultMmaFromSharedMemory<
+    MmaMultistage<
+        Shape_,
+        IteratorA_,
+        SmemIteratorA_,
+        CacheOpA,
+        IteratorB_,
+        SmemIteratorB_,
+        CacheOpB,
+        ElementC_,
+        LayoutC_,
+        Policy_,
+        Stages,
+        SharedMemoryClear>,
+    AccumulatorSharedStorage_> {
+  static constexpr int kWarpSize = 32;
+
+  using RegularMma = MmaMultistage<
+      Shape_,
+      IteratorA_,
+      SmemIteratorA_,
+      CacheOpA,
+      IteratorB_,
+      SmemIteratorB_,
+      CacheOpB,
+      ElementC_,
+      LayoutC_,
+      Policy_,
+      Stages,
+      SharedMemoryClear>;
+
+  using WarpShape = typename Policy_::Operator::Shape;
+  using InstructionShape = typename Policy_::Operator::InstructionShape;
+  using WarpIteratorA = typename DefaultWarpIteratorAFromSharedMemory<
+      WarpShape,
+      InstructionShape,
+      typename RegularMma::Operator::IteratorA,
+      Policy_>::WarpIterator;
+
+  static int constexpr kMaxK = AccumulatorSharedStorage_::Shape::kN;
+  // Reduce the number of stages if we don't need that many
+  static int constexpr kStages =
+      std::min(Stages, (kMaxK + int(Shape_::kK) - 1) / int(Shape_::kK));
+
+  using IteratorB =
+      typename cutlass::transform::threadblock::MakeIteratorResidualLast<
+          IteratorB_>::Iterator;
+  using Mma =
+      typename cutlass::gemm::threadblock::MmaMultistageFromSharedMemory<
+          Shape_,
+          WarpIteratorA,
+          AccumulatorSharedStorage_,
+          IteratorB,
+          SmemIteratorB_,
+          RegularMma::kCacheOpB,
+          ElementC_,
+          LayoutC_,
+          Policy_,
+          kStages>;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+    typename IteratorC,
+    typename Operator,
+    typename scalar_t,
+    typename WarpShape_,
+    typename ThreadblockShape_>
+struct B2bGemm;
+
+// Tensor Cores >= Sm75 specialization (Ampere ...)
+template < /// Size of the matrix to load (concept: MatrixShape)
+    typename Shape_,
+    /// Element type
+    typename Element_,
+    /// Layout of operand in memory
+    typename Layout_,
+    /// Shape of one matrix product operation (concept: MatrixShape)
+    typename InstructionShape_,
+    /// Interval between adjacent *MMA instructions (in units of MMA
+    /// instructions, concept: MatrixShape)
+    typename OpDelta_,
+    typename Operator,
+    typename scalar_t,
+    typename WarpShape_,
+    typename ThreadblockShape_>
+struct B2bGemm<
+    cutlass::gemm::warp::MmaTensorOpAccumulatorTileIterator<
+        Shape_,
+        Element_,
+        Layout_,
+        InstructionShape_,
+        OpDelta_>,
+    Operator,
+    scalar_t,
+    WarpShape_,
+    ThreadblockShape_> {
+  using IteratorC =
+      typename cutlass::gemm::warp::MmaTensorOpAccumulatorTileIterator<
+          Shape_,
+          Element_,
+          Layout_,
+          InstructionShape_,
+          OpDelta_>;
+  using FragmentC = typename IteratorC::Fragment;
+  using InstructionShape = InstructionShape_;
+  using WarpShape = WarpShape_;
+  using ThreadblockShape = ThreadblockShape_;
+  using accum_t = Element_;
+  using lse_scalar_t = float;
+
+  using SmemAccumulatorLayout = cutlass::layout::RowMajor;
+
+  // Iterator to load accumulators (results of matmul in registers)
+  using FragmentIteratorAccumulator =
+      cutlass::epilogue::warp::FragmentIteratorTensorOp<
+          WarpShape,
+          InstructionShape,
+          accum_t,
+          typename Operator::Policy::Operator::FragmentC,
+          cutlass::layout::RowMajor>;
+
+  // Iterator to store to shared-memory
+  using SmemIteratorD0 = typename cutlass::epilogue::warp::TileIteratorTensorOp<
+      WarpShape,
+      InstructionShape,
+      scalar_t, // accum_t,
+      SmemAccumulatorLayout>;
+  using AccumulatorSharedStorage =
+      cutlass::gemm::threadblock::AccumulatorSharedStorage<
+          ThreadblockShape,
+          typename SmemIteratorD0::Element,
+          typename SmemIteratorD0::TensorLayout,
+          typename SmemIteratorD0::Padding>;
+  // We need to provide an operation for the epilogue. Let's create an
+  // operation that does nothing (ScaleType::Nothing), just converts
+  // from accum_t (float) -> scalar_t (can be half)
+  using OutputOpNoOp = cutlass::epilogue::thread::LinearCombination<
+      typename SmemIteratorD0::Element, // ElementOutput
+      FragmentIteratorAccumulator::Fragment::kElements,
+      accum_t, // ElementAccumulator
+      typename SmemIteratorD0::Element, // ElementCompute
+      cutlass::epilogue::thread::ScaleType::Nothing>;
+  using Epilogue = cutlass::epilogue::threadblock::EpilogueSmemAccumulator<
+      SmemIteratorD0,
+      FragmentIteratorAccumulator,
+      SmemIteratorD0, // ScaleBiasIterator - not used
+      OutputOpNoOp>;
+
+  // Epilogue 2: with LSE (for backwards pass)
+  static int const kElementsPerAccess = 2; // TODO: Why 2?
+  using IteratorAccumulatorLSE =
+      cutlass::transform::threadblock::VectorIterator<
+          cutlass::transform::threadblock::PredicatedVectorAccessIterator<
+              // Shape
+              cutlass::MatrixShape<ThreadblockShape::kM, ThreadblockShape::kN>,
+              // WarpShape
+              cutlass::MatrixShape<WarpShape::kM, WarpShape::kN>,
+              lse_scalar_t,
+              cutlass::layout::RowMajor,
+              kElementsPerAccess>>;
+  using EpilogueOpApplyLSE = cutlass::epilogue::thread::ApplyLogSumExp<
+      scalar_t, // ElementOutput_
+      lse_scalar_t, // ElementLSE_
+      accum_t, // ElementAccumulator_
+      accum_t, // ElementCompute_
+      128 / cutlass::sizeof_bits<scalar_t>::value
+      // FragmentIteratorAccumulator::Fragment::kElements
+      // InstructionShape::kM * InstructionShape::kN / 32
+      >;
+  using EpilogueWithLSE =
+      cutlass::epilogue::threadblock::EpilogueSmemAccumulator<
+          SmemIteratorD0,
+          FragmentIteratorAccumulator,
+          IteratorAccumulatorLSE,
+          EpilogueOpApplyLSE>;
+
+  static void CUTLASS_DEVICE accumToSmem(
+      AccumulatorSharedStorage& shared_storage,
+      FragmentC const& accum,
+      int lane_id,
+      cutlass::MatrixCoord const& tile_coords) {
+    SmemIteratorD0 smem_iterator_attn(shared_storage.accum_ref(), lane_id);
+    smem_iterator_attn.add_tile_offset(
+        tile_coords *
+        cutlass::MatrixCoord{
+            SmemIteratorD0::TileIterations::kRow,
+            SmemIteratorD0::TileIterations::kColumn});
+    Epilogue epilogue;
+    epilogue(OutputOpNoOp({}), smem_iterator_attn, accum);
+  }
+
+  static void CUTLASS_DEVICE accumApplyLSEToSmem(
+      AccumulatorSharedStorage& shared_storage,
+      FragmentC& accum,
+      lse_scalar_t const* lse,
+      int32_t lse_extents,
+      int thread_id,
+      int warp_id,
+      int lane_id,
+      cutlass::MatrixCoord const& tile_coords) {
+    constexpr int32_t kAlignLSE = 32;
+    IteratorAccumulatorLSE iterator_lse(
+        lse,
+        {(int32_t)0, (int32_t)ceil_div(lse_extents, kAlignLSE) * kAlignLSE},
+        thread_id,
+        warp_id,
+        cutlass::MatrixCoord{0, 0} // offset
+    );
+
+    SmemIteratorD0 smem_iterator_attn(shared_storage.accum_ref(), lane_id);
+    smem_iterator_attn.add_tile_offset(
+        tile_coords *
+        cutlass::MatrixCoord{
+            SmemIteratorD0::TileIterations::kRow,
+            SmemIteratorD0::TileIterations::kColumn});
+    EpilogueWithLSE epilogue;
+    EpilogueOpApplyLSE minus_lse_exp({});
+    epilogue(
+        minus_lse_exp,
+        smem_iterator_attn,
+        accum,
+        // scale - unused
+        iterator_lse,
+        // bias
+        iterator_lse);
+  }
+};
+
+// Volta Specialization
+// only supported for f16
+template <typename Operator, typename WarpShape_, typename ThreadblockShape_>
+struct B2bGemm<
+    cutlass::gemm::warp::MmaVoltaTensorOpAccumulatorTileIterator<
+        cutlass::MatrixShape<32, 32>,
+        float,
+        cutlass::layout::RowMajor,
+        cutlass::gemm::GemmShape<16, 16, 4>,
+        cutlass::MatrixShape<1, 1>>,
+    Operator,
+    cutlass::half_t,
+    WarpShape_,
+    ThreadblockShape_> {
+  using IteratorC =
+      cutlass::gemm::warp::MmaVoltaTensorOpAccumulatorTileIterator<
+          cutlass::MatrixShape<32, 32>,
+          float,
+          cutlass::layout::RowMajor,
+          cutlass::gemm::GemmShape<16, 16, 4>,
+          cutlass::MatrixShape<1, 1>>;
+  using scalar_t = cutlass::half_t;
+  using accum_t = IteratorC::Element;
+  using WarpShape = WarpShape_;
+  using ThreadblockShape = ThreadblockShape_;
+  using FragmentC = IteratorC::Fragment;
+  using lse_scalar_t = float;
+
+  using SmemAccumulatorLayout = cutlass::layout::RowMajor;
+  using SmemIteratorD0 = cutlass::epilogue::warp::TileIteratorVoltaTensorOp<
+      WarpShape,
+      cutlass::gemm::GemmShape<32, 32, 4>,
+      scalar_t,
+      SmemAccumulatorLayout>;
+
+  // // Storage in shared-memory for Q.Kt
+  using AccumulatorSharedStorage =
+      cutlass::gemm::threadblock::AccumulatorSharedStorage<
+          ThreadblockShape,
+          scalar_t,
+          cutlass::layout::RowMajorVoltaTensorOpMultiplicandCrosswise<
+              16,
+              32>, // typename SmemIteratorD0::TensorLayout,
+          cutlass::MatrixShape<0, 0> // Padding
+          >;
+
+  using OutputLayout =
+      cutlass::layout::RowMajorVoltaTensorOpMultiplicandCrosswise<16, 32>;
+  using TensorRef = cutlass::TensorRef<scalar_t, OutputLayout>;
+  using Policy = typename IteratorC::Policy;
+  using Element = accum_t;
+  // Those are MmaVoltaTensorOpAccumulatorTileIterator private fields
+  // Let's copy their values
+  static int const kElementsPerPartial = 4;
+  using EleShapePerPatial = typename cutlass::platform::conditional<
+      cutlass::platform::is_same<Element, float>::value,
+      cutlass::MatrixShape<2, 2>,
+      cutlass::MatrixShape<1, 4>>::type;
+  static int const kElementsPerMma = 8;
+  static int const kAccumulatorPatials = 2;
+  using QuadShapePerPatialMma = cutlass::MatrixShape<4, 4>;
+
+  static void CUTLASS_DEVICE accumToSmem(
+      AccumulatorSharedStorage& shared_storage,
+      FragmentC const& accum,
+      int lane_id,
+      cutlass::MatrixCoord const& tile_coords) {
+    // ctor - from MmaVoltaTensorOpAccumulatorTileIterator
+    TensorRef ref_(shared_storage.accum_ref());
+    int quad = (lane_id >> 2);
+    int lane_in_quad = (lane_id & 3);
+    int accum_m, accum_n;
+
+    if (cutlass::platform::is_same<Element, float>::value) {
+      // (quad[2],quad[0])+lane_in_quad[0]
+      accum_m = (((quad & 0x4) >> 1) + (quad & 0x1)) * 8 + (lane_in_quad & 1);
+      // (quad[1])+lane_in_quad[1]
+      accum_n =
+          ((quad >> 1) & 0x1) * kElementsPerPartial * kAccumulatorPatials +
+          (lane_in_quad & 2);
+    } else {
+      accum_m = (((quad & 0x4) >> 1) + (quad & 0x1)) * 8 +
+          lane_in_quad; // (quad[2],quad[0])
+      accum_n = ((quad >> 1) & 0x1) * kElementsPerPartial * kAccumulatorPatials;
+    }
+    cutlass::MatrixCoord lane_offset(accum_m, accum_n);
+
+    // Tile offset
+    ref_.add_coord_offset(
+        tile_coords *
+        cutlass::MatrixCoord(
+            {IteratorC::Shape::kRow, IteratorC::Shape::kColumn}));
+
+    // store - from MmaVoltaTensorOpAccumulatorTileIterator
+    CUTLASS_PRAGMA_UNROLL
+    for (int tile_n = 0; tile_n < Policy::TileIterations::kColumn; ++tile_n) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int tile_m = 0; tile_m < Policy::TileIterations::kRow; ++tile_m) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int mma_n = 0; mma_n < Policy::MmaIterations::kColumn; ++mma_n) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int mma_m = 0; mma_m < Policy::MmaIterations::kRow; ++mma_m) {
+            int mma_accum_start =
+                (((tile_n * Policy::TileIterations::kRow + tile_m) *
+                      Policy::MmaIterations::kColumn +
+                  mma_n) *
+                     Policy::MmaIterations::kRow +
+                 mma_m) *
+                kElementsPerMma;
+
+            CUTLASS_PRAGMA_UNROLL
+            for (int p = 0; p < kAccumulatorPatials; ++p) {
+              CUTLASS_PRAGMA_UNROLL
+              for (int m = 0; m < EleShapePerPatial::kRow; ++m) {
+                CUTLASS_PRAGMA_UNROLL
+                for (int n = 0; n < EleShapePerPatial::kColumn; ++n) {
+                  int accum_m = tile_m * Policy::InterleavedTile::kRow +
+                      mma_m * QuadShapePerPatialMma::kRow + m * 2;
+                  int accum_n = tile_n * Policy::InterleavedTile::kColumn +
+                      mma_n * QuadShapePerPatialMma::kColumn +
+                      p * Policy::InterleavedTile::kColumn / 2 + n;
+                  int idx = mma_accum_start + p * kElementsPerPartial +
+                      m * EleShapePerPatial::kColumn + n;
+                  int r = (accum_m + lane_offset.row());
+                  int c = (accum_n + lane_offset.column());
+                  assert(r < 32);
+                  assert(c < 32);
+                  ref_.at({r, c}) = scalar_t(accum[idx]);
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  static void CUTLASS_DEVICE accumApplyLSEToSmem(
+      AccumulatorSharedStorage& shared_storage,
+      typename IteratorC::Fragment& accum,
+      lse_scalar_t const* lse,
+      int lse_extent,
+      int thread_id,
+      int warp_id,
+      int lane_id,
+      cutlass::MatrixCoord const& tile_coords) {
+    // Non-optimized way to apply LSE to registers
+    // NOTE: accum is attn.T
+    // TODO: Optimize for each architecture
+    static constexpr int WarpSize = 32;
+    using RegistersIter = typename DefaultAttentionScalingCoefsUpdater<
+        IteratorC,
+        accum_t,
+        WarpSize>::Updater;
+    auto lane_offset =
+        RegistersIter::get_lane_offset(lane_id, warp_id, tile_coords);
+
+    cutlass::Array<lse_scalar_t, IteratorC::Fragment::kElements> lse_prefetched;
+    lse_prefetched.clear();
+    int rowIdx = 0;
+    int colIdx = 0;
+    RegistersIter::iterateRows(
+        lane_offset,
+        [&](int accum_m) {
+          ++rowIdx;
+          colIdx = 0;
+        },
+        [&](int accum_m, int accum_n, int idx) {
+          if (rowIdx == 1) {
+            lse_prefetched[colIdx] = accum_n < lse_extent
+                ? lse[accum_n]
+                : std::numeric_limits<accum_t>::infinity();
+          }
+          accum[idx] = expf(accum[idx] - lse_prefetched[colIdx]);
+          ++colIdx;
+        },
+        [&](int accum_m) {});
+    accumToSmem(shared_storage, accum, lane_id, tile_coords);
+  }
+};
+
+// Simt Specialization
+// for f32 on Sm70-Sm75 and f16/f32 below
+
+template <
+    typename Operator,
+    typename OperatorPolicy,
+    typename scalar_t,
+    typename WarpShape_,
+    typename ThreadblockShape_>
+struct B2bGemm<
+    cutlass::gemm::warp::MmaSimtTileIterator<
+        cutlass::MatrixShape<32, 32>,
+        cutlass::gemm::Operand::kC,
+        float,
+        cutlass::layout::RowMajor,
+        OperatorPolicy,
+        1,
+        1>,
+    Operator,
+    scalar_t,
+    WarpShape_,
+    ThreadblockShape_> {
+  using IteratorC = cutlass::gemm::warp::MmaSimtTileIterator<
+      cutlass::MatrixShape<32, 32>,
+      cutlass::gemm::Operand::kC,
+      float,
+      cutlass::layout::RowMajor,
+      OperatorPolicy,
+      1,
+      1>;
+  using accum_t = typename IteratorC::Element;
+  using WarpShape = WarpShape_;
+  using ThreadblockShape = ThreadblockShape_;
+  using FragmentC = typename IteratorC::Fragment;
+  using lse_scalar_t = float;
+
+  // Storage in shared-memory for Q.Kt
+  using AccumulatorSharedStorage =
+      cutlass::gemm::threadblock::AccumulatorSharedStorage<
+          ThreadblockShape,
+          scalar_t,
+          cutlass::layout::ColumnMajor,
+          cutlass::MatrixShape<0, 0> // Padding
+          >;
+
+  static void CUTLASS_DEVICE accumToSmem(
+      AccumulatorSharedStorage& shared_storage,
+      FragmentC const& accum,
+      int lane_id,
+      cutlass::MatrixCoord const& tile_coords) {
+    using Policy = typename IteratorC::Policy;
+    using Element = typename IteratorC::Element;
+    using Iterations = typename IteratorC::Iterations;
+    using Delta = typename IteratorC::Delta;
+
+    auto ref_ = shared_storage.accum_ref();
+    // ctor - MmaSimtTileIterator
+    // compute offset based on thread ID and lane layout
+    typename Policy::LaneLayout lane_layout = Policy::get_lane_layout();
+
+    MatrixCoord lane_offset = lane_layout.inverse(lane_id) *
+        MatrixCoord(Policy::LaneMmaShape::kM, Policy::LaneMmaShape::kN);
+
+    ref_.add_coord_offset(lane_offset);
+
+    // Tile offset
+    ref_.add_coord_offset(
+        tile_coords *
+        cutlass::MatrixCoord(
+            {IteratorC::Shape::kRow, IteratorC::Shape::kColumn}));
+
+    // store - MmaSimtTileIterator
+    CUTLASS_PRAGMA_UNROLL
+    for (int mma_n = 0; mma_n < Iterations::kColumn; ++mma_n) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int n = 0; n < Policy::LaneMmaShape::kN; ++n) {
+        CUTLASS_PRAGMA_UNROLL
+        for (int mma_m = 0; mma_m < Iterations::kRow; ++mma_m) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int m = 0; m < Policy::LaneMmaShape::kM; ++m) {
+            int r =
+                Policy::LaneMmaShape::kM * (mma_m * Policy::WarpShape::kRow) +
+                m;
+            int c = mma_n * Delta::kColumn + n;
+            int idx = n +
+                Policy::LaneMmaShape::kN *
+                    (mma_n +
+                     Iterations::kColumn *
+                         (m + mma_m * Policy::LaneMmaShape::kM));
+            ref_.at({r, c}) = scalar_t(accum[idx]);
+          }
+        }
+      }
+    }
+  }
+
+  static void CUTLASS_DEVICE accumApplyLSEToSmem(
+      AccumulatorSharedStorage& shared_storage,
+      typename IteratorC::Fragment& accum,
+      lse_scalar_t const* lse,
+      int lse_extent,
+      int thread_id,
+      int warp_id,
+      int lane_id,
+      cutlass::MatrixCoord const& tile_coords) {
+    // Non-optimized way to apply LSE to registers
+    // NOTE: accum is attn.T
+    // TODO: Optimize for each architecture
+    static constexpr int WarpSize = 32;
+    using RegistersIter = typename DefaultAttentionScalingCoefsUpdater<
+        IteratorC,
+        accum_t,
+        WarpSize>::Updater;
+    auto lane_offset =
+        RegistersIter::get_lane_offset(lane_id, warp_id, tile_coords);
+
+    cutlass::Array<lse_scalar_t, IteratorC::Fragment::kElements> lse_prefetched;
+    lse_prefetched.clear();
+    int rowIdx = 0;
+    int colIdx = 0;
+    RegistersIter::iterateRows(
+        lane_offset,
+        [&](int accum_m) {
+          ++rowIdx;
+          colIdx = 0;
+        },
+        [&](int accum_m, int accum_n, int idx) {
+          if (rowIdx == 1) {
+            lse_prefetched[colIdx] = accum_n < lse_extent
+                ? lse[accum_n]
+                : std::numeric_limits<accum_t>::infinity();
+          }
+          accum[idx] = expf(accum[idx] - lse_prefetched[colIdx]);
+          ++colIdx;
+        },
+        [&](int accum_m) {});
+    accumToSmem(shared_storage, accum, lane_id, tile_coords);
+  }
+};
+
+} // namespace threadblock
+} // namespace gemm
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/aten/src/ATen/native/transformers/cuda/mem_eff_attention/mma_simt_tile_iterator_residual.h b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/mma_simt_tile_iterator_residual.h
new file mode 100644
index 000000000000..cafb86761692
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/mem_eff_attention/mma_simt_tile_iterator_residual.h
@@ -0,0 +1,302 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
+ *reserved. SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ *this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+ *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ *POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Iterator over OperandA in shared memory in SIMT cases, used for
+   fused-matmuls Mostly copied from cutlass/gemm/warp/mma_simt_tile_iterator.h
+*/
+
+#pragma once
+
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/matrix_shape.h>
+#include <cutlass/tensor_ref.h>
+
+#include <cutlass/arch/memory_sm75.h>
+
+#include <cutlass/layout/matrix.h>
+
+#include <cutlass/gemm/gemm.h>
+#include <cutlass/gemm/warp/mma_simt_policy.h>
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace gemm {
+namespace warp {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Iterates over operands to warp-level matrix multiply operations targeting
+/// SIMT instructions
+///
+/// concept: MutableRandomAccessContiguousTileIteratorConcept
+///
+template <
+    /// Size of the matrix to load (concept: MatrixShape)
+    typename Shape_,
+    /// Operand identity
+    Operand Operand,
+    /// Data type of A elements
+    typename Element_,
+    /// Layout of operand
+    typename Layout_,
+    /// Shape of the warp in units of thread (concept: MmaSimtPolicy)
+    typename Policy_>
+class MmaSimtTileIteratorWithResidual;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Specialization for A operands of column-major layouts
+///
+/// Concept: MutableRandomAccessContiguousTileIteratorConcept
+///
+template <
+    /// Size of the matrix to load (concept: MatrixShape)
+    typename Shape_,
+    /// Data type of A elements
+    typename Element_,
+    /// Shape of the warp in units of thread (concept: MmaSimtPolicy)
+    typename Policy_>
+class MmaSimtTileIteratorWithResidual<
+    Shape_,
+    Operand::kA,
+    Element_,
+    layout::ColumnMajor,
+    Policy_> {
+ public:
+  static int const kIterations = 8;
+
+  /// Shape of tile to load (concept: MatrixShape)
+  using Shape = Shape_;
+
+  /// Operand tag
+  static Operand const kOperand = Operand::kA;
+
+  /// Element type
+  using Element = Element_;
+
+  /// Layout of policy
+  using Layout = layout::ColumnMajor;
+
+  /// Decomposition of elements among threads
+  using Policy = Policy_;
+
+  /// TensorRef type for loading element from a tensor
+  using TensorRef = TensorRef<Element, Layout>;
+
+  /// Index type
+  using Index = typename TensorRef::Index;
+
+  /// Long Index type
+  using LongIndex = typename TensorRef::LongIndex;
+
+  /// Coordinate for an element in the tensor
+  using TensorCoord = typename TensorRef::TensorCoord;
+
+  //
+  // Derived quantities
+  //
+
+  static_assert(
+      !(Shape::kRow % Policy::WarpShape::kRow),
+      "The warp-level GEMM M size must be divisible by the number of threads arranged along the M dimension.");
+
+  static_assert(Shape::kRow > 0, "Shape::kRow must be greater than zero.");
+  static_assert(
+      Shape::kColumn > 0,
+      "Shape::kColumn must be greater than zero.");
+  static_assert(
+      Policy::WarpShape::kRow > 0,
+      "Policy::WarpShape::kRow must be greater than zero.");
+  static_assert(
+      Shape::kRow / Policy::WarpShape::kRow > 0,
+      "Shape::kRow / Policy::WarpShape::kRow must be greater than zero.");
+
+  /// Thread-level shape of a fragment
+  using ThreadShape =
+      MatrixShape<Shape::kRow / Policy::WarpShape::kRow, Shape::kColumn>;
+
+  static_assert(
+      !(ThreadShape::kRow % Policy::LaneMmaShape::kM),
+      "Thread-level GEMM must be divisible by Policy::LaneMmaShape.");
+
+  /// Number of individual loads
+  using Iterations = MatrixShape<
+      ThreadShape::kRow / Policy::LaneMmaShape::kM,
+      ThreadShape::kColumn>;
+
+  /// Fragment object holding a thread's part of a tile
+  using Fragment = Array<Element, ThreadShape::kCount>;
+
+ private:
+  /// Internal reference
+  cutlass::
+      TensorRef<Array<Element, Policy::LaneMmaShape::kM>, layout::ColumnMajor>
+          ref_;
+
+  /// residual access
+  bool is_residual_;
+
+  /// residual offset applied after first block
+  TensorCoord residual_offset_;
+
+  int iterations_;
+
+ public:
+  /// Default ctor constructs null iterator
+  CUTLASS_HOST_DEVICE
+  MmaSimtTileIteratorWithResidual() : is_residual_(false), iterations_(0) {}
+
+  /// Constructor from TensorRef
+  CUTLASS_HOST_DEVICE
+  MmaSimtTileIteratorWithResidual(TensorRef ref, int lane_id)
+      : is_residual_(false), iterations_(0) {
+    // compute offset based on thread ID and lane layout
+    typename Policy::LaneLayout lane_layout = Policy::get_lane_layout();
+
+    MatrixCoord lane_offset =
+        lane_layout.inverse(lane_id) * MatrixCoord(Policy::LaneMmaShape::kM, 0);
+
+    ref.add_coord_offset(lane_offset);
+
+    ref_.reset(
+        reinterpret_cast<Array<Element, Policy::LaneMmaShape::kM>*>(ref.data()),
+        ref.stride(0) / Policy::LaneMmaShape::kM);
+  }
+  CUTLASS_HOST_DEVICE
+  MmaSimtTileIteratorWithResidual(
+      TensorRef ref,
+      TensorCoord extent,
+      int lane_id)
+      : is_residual_(false), iterations_(0) {
+    // compute offset based on thread ID and lane layout
+    typename Policy::LaneLayout lane_layout = Policy::get_lane_layout();
+
+    MatrixCoord lane_offset =
+        lane_layout.inverse(lane_id) * MatrixCoord(Policy::LaneMmaShape::kM, 0);
+
+    ref.add_coord_offset(lane_offset);
+
+    ref_.reset(
+        reinterpret_cast<Array<Element, Policy::LaneMmaShape::kM>*>(ref.data()),
+        ref.stride(0) / Policy::LaneMmaShape::kM);
+
+    typename TensorCoord::Index residual_size =
+        extent.column() % (kIterations * Shape::kColumn);
+    if (residual_size) {
+      is_residual_ = true;
+      residual_offset_ =
+          make_Coord(0, residual_size - kIterations * Shape::kColumn);
+    }
+  }
+
+  /// Adds a pointer offset to internal pointer(s) to advance through memory
+  CUTLASS_HOST_DEVICE
+  MmaSimtTileIteratorWithResidual& add_pointer_offset(LongIndex offset) {
+    ref_.add_pointer_offset(offset);
+    return *this;
+  }
+
+  /// Advances an iterator along logical dimensions of matrix in units of whole
+  /// tiles
+  CUTLASS_HOST_DEVICE
+  MmaSimtTileIteratorWithResidual& add_tile_offset(TensorCoord const& coord) {
+    ref_.add_coord_offset(
+        {coord.row() * Shape::kRow / Policy::LaneMmaShape::kM,
+         coord.column() * Shape::kColumn});
+
+    return *this;
+  }
+
+  /// Advances the iterator along the advance dimension
+  CUTLASS_HOST_DEVICE
+  MmaSimtTileIteratorWithResidual& operator++() {
+    ref_.add_coord_offset({0, Shape::kColumn});
+    ++iterations_;
+
+    if (iterations_ >= kIterations) {
+      if (is_residual_) {
+        is_residual_ = false;
+        ref_.add_coord_offset(residual_offset_);
+      }
+      iterations_ = 0;
+    }
+
+    return *this;
+  }
+
+  /// Loads a fragment from memory at the location pointed to by the iterator.
+  /// (vector loads)
+  CUTLASS_HOST_DEVICE
+  void load_with_pointer_offset(Fragment& frag, Index pointer_offset) const {
+    Array<Element, Policy::LaneMmaShape::kM>* dst_ptr =
+        reinterpret_cast<Array<Element, Policy::LaneMmaShape::kM>*>(&frag);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int k = 0; k < Iterations::kColumn; ++k) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int m = 0; m < Iterations::kRow; ++m) {
+// This logic has been replaced with calls to inline PTX to guarantee
+// vectorization.
+#if 0
+        dst_ptr[m + k * Iterations::kRow] =
+          *(ref_.data() + ref_.offset({m * Policy::WarpShape::kRow, k}) + pointer_offset / Policy::LaneMmaShape::kM);
+#endif
+
+        auto ptr = ref_.data() + ref_.offset({m * Policy::WarpShape::kRow, k}) +
+            pointer_offset / Policy::LaneMmaShape::kM;
+        arch::shared_load(dst_ptr[m + k * Iterations::kRow], ptr);
+      }
+    }
+  }
+  /// Loads a fragment from memory at the location pointed to by the iterator.
+  CUTLASS_HOST_DEVICE
+  void load(Fragment& frag) const {
+    load_with_pointer_offset(frag, 0);
+  }
+
+  /// Notify the iterator which k-group it is currently pointing to.
+  ///
+  /// This does not advance the iterator. Rather, it overrides its internal
+  /// tracking with constant-valued k-group index to enable the compiler to
+  /// fold constants and achieve more efficient code.
+  ///
+  /// This is used by some nontrivial permuted layouts.
+  CUTLASS_DEVICE
+  void set_kgroup_index(int k_group) {
+    // no operation here
+  }
+};
+} // namespace warp
+} // namespace gemm
+} // namespace cutlass
diff --git a/aten/src/ATen/native/transformers/cuda/sdp_utils.h b/aten/src/ATen/native/transformers/cuda/sdp_utils.h
new file mode 100644
index 000000000000..55e9aeb184a2
--- /dev/null
+++ b/aten/src/ATen/native/transformers/cuda/sdp_utils.h
@@ -0,0 +1,316 @@
+#pragma once
+
+#include <ATen/Context.h>
+#include <ATen/TensorUtils.h>
+#include <ATen/core/Tensor.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/detail/CUDAHooksInterface.h>
+#include <ATen/native/DispatchStub.h>
+#include <c10/core/ScalarType.h>
+#include <c10/util/env.h>
+#include <c10/util/irange.h>
+#include <ATen/NestedTensorImpl.h>
+#include <ATen/native/transformers/sdp_utils_cpp.h>
+
+#include <functional>
+#include <unordered_set>
+#include <vector>
+
+namespace sdp {
+
+struct sdp_params {
+  const at::Tensor& query;
+  const at::Tensor& key;
+  const at::Tensor& value;
+  bool has_attn_mask;
+  double dropout;
+  bool need_attn_weights;
+  bool is_causal;
+};
+
+template <typename dtype_vector>
+inline bool check_tensor_dtype(
+    sdp_params params,
+    dtype_vector allowed_dtypes,
+    bool debug) {
+  auto query_dtype = params.query.dtype();
+  if (!(query_dtype == params.key.dtype() &&
+        query_dtype == params.value.dtype() &&
+        (std::find(allowed_dtypes.begin(), allowed_dtypes.end(), query_dtype) !=
+         allowed_dtypes.end()))) {
+    TORCH_CHECK(
+        !debug,
+        "Expected query, key and value to be of dtype float16 or bfloat16 but got Query dtype: ",
+        params.query.dtype(),
+        ", Key dtype: ",
+        params.key.dtype(),
+        ", and Value dtype: ",
+        params.value.dtype(),
+        " instead.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_for_attn_weights(sdp_params params, bool debug) {
+  // This can be returned form flash attention but care is needed
+  // to convert from flash_attn format to attn_weights
+  if (params.need_attn_weights) {
+    TORCH_CHECK(!debug, "Flash Attention does not support need attn weights");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_for_non_zero_dropout(sdp_params params, bool debug) {
+  if (params.dropout != 0.0) {
+    TORCH_CHECK(!debug, "Mem_efficient does not support non_zero dropout. Dropout_p: ", params.dropout);
+    return false;
+  }
+  return true;
+}
+
+inline bool check_for_seq_len_1_nested_tensor(sdp_params params, bool debug) {
+  if (!params.query.is_nested()) {
+    return true;
+  }
+  const at::Tensor& sizes = at::native::get_nested_tensor_impl(params.query)->get_nested_size_tensor();
+  auto* sizes_ptr = sizes.data_ptr<int64_t>();
+  const int64_t n_tensors = params.query.size(0);
+  const int64_t size_tensor_stride = sizes.stride(0);
+
+  // This is being called inside sdp with shape [batch, heads, {seq_len}, dim]
+  for (const auto i : c10::irange(n_tensors)) {
+    if (sizes_ptr[(i * size_tensor_stride) + 1] <= 1) {
+      TORCH_CHECK(
+          !debug, "Flash Attention does not support sequence_length <= 1");
+      return false;
+    }
+  }
+
+  return true;
+}
+
+inline bool check_for_nested_inputs(sdp_params params, bool debug){
+  if (params.query.is_nested() || params.key.is_nested() || params.value.is_nested()) {
+    TORCH_CHECK(!debug, "We are not enabling nested Tensors for Flash Attention because of cuda memory errors.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_requires_grad(sdp_params params, bool debug) {
+  if (params.query.requires_grad() || params.key.requires_grad() || params.value.requires_grad()) {
+    TORCH_CHECK(!debug, "Flash Attention does not currently support training.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_requires_grad_and_nested(sdp_params params, bool debug) {
+  // If we fail both checks then we return false
+  if (!check_for_nested_inputs(params, false) && !check_requires_grad(params,false)){
+      TORCH_CHECK(!debug, "Memory efficient attention currently doesn't support training with NT inputs.");
+      return false;
+  }
+  return true;
+}
+
+inline bool check_for_attn_mask(sdp_params params, bool debug) {
+  if (params.has_attn_mask) {
+    TORCH_CHECK(!debug, "Flash Attention does not support attention mask.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_tensor_shapes(sdp_params params, bool debug) {
+  auto query_dim = params.query.dim();
+  if (!(query_dim == params.key.dim() && query_dim == params.value.dim() &&
+        (query_dim == 4 ))) {
+    TORCH_CHECK(
+        !debug,
+        "Flash attention requires query, key and value to be 4 dimensional, but got Query dim: ",
+        query_dim,
+        ", Key dim: ",
+        params.key.dim(),
+        ", Value dim: ",
+        params.value.dim(),
+        " instead.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_head_dim_size(sdp_params params, bool debug) {
+  const int64_t query_size_last = params.query.size(-1);
+  if (!(query_size_last == params.key.size(-1) &&
+        query_size_last == params.value.size(-1) && query_size_last % 8 == 0 &&
+        query_size_last <= 128)) {
+    TORCH_CHECK(
+        !debug,
+        "Flash attention requires last dimension of inputs to be a multiple of 8 and less than or equal to 128.",
+        "Got Query.size(-1): ",
+        query_size_last,
+        ", Key.size(-1): ",
+        params.key.size(-1),
+        ", Value.size(-1): ",
+        params.value.size(-1),
+        " instead.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_runtime_disabled_flash(sdp_params params, bool debug) {
+  // We check the global context to see if user has explicitly turned of flash
+  // sdp kernels
+  if (!at::globalContext().userEnabledFlashSDP()) {
+    TORCH_CHECK(!debug, "Flash attention has been runtime disabled.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_runtime_disabled_mem_efficient(sdp_params params, bool debug) {
+  // We check the global context to see if user has explicitly turned of mem_efficient
+  // sdp kernels
+  if (!at::globalContext().userEnabledMemEfficientSDP()) {
+    TORCH_CHECK(!debug, "Memory Efficient attention has been runtime disabled.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_gpu_sm75_or_greater(sdp_params params, bool debug) {
+  // Check that the gpu is capable of running flash attention
+  auto dprops = at::cuda::getCurrentDeviceProperties();
+  bool is_sm75 = dprops->major == 7 && dprops->minor == 5;
+  bool is_sm8x = dprops->major == 8 && dprops->minor >= 0;
+  if (!(is_sm8x || is_sm75)) {
+    TORCH_CHECK(
+        !debug,
+        "Flash attention only supports sm75 and sm8x gpu architectures. Attempting to run on a sm ",
+        dprops->major,
+        ".",
+        dprops->minor,
+        " gpu.");
+    return false;
+  }
+  return true;
+}
+
+inline bool check_gpu_sm50_or_greater(sdp_params params, bool debug) {
+  // Check that the gpu is capable of running flash attention
+  auto dprops = at::cuda::getCurrentDeviceProperties();
+  bool is_sm50 = dprops->major >= 5;
+  if (!(is_sm50)) {
+    TORCH_CHECK(
+        !debug,
+        "Mem Efficient Attention only supports sm5x or greater gpu architectures. Attempting to run on a sm ",
+        dprops->major,
+        ".",
+        dprops->minor,
+        " gpu.");
+    return false;
+  }
+  return true;
+}
+
+inline bool use_flash_attention(sdp_params params, bool debug) {
+#ifndef USE_FLASH_ATTENTION
+  TORCH_CHECK(!debug, "Torch was not compiled with flash attention.");
+  return false;
+#endif
+  //  Define gate functions that determine if a flash kernel can be ran
+  constexpr std::array<bool(*)(sdp_params, bool), 9> constraints {{
+      check_runtime_disabled_flash,
+      check_requires_grad,
+      check_tensor_shapes,
+      check_for_attn_weights,
+      check_for_attn_mask,
+      check_head_dim_size,
+      check_gpu_sm75_or_greater,
+      check_for_nested_inputs,
+      check_for_seq_len_1_nested_tensor}};
+  for (auto& constraint : constraints) {
+    if (!constraint(params, debug)) {
+      return false;
+    }
+  }
+
+  auto dprop = at::cuda::getCurrentDeviceProperties();
+  if (dprop->major >= 8) {
+    static const std::array<at::ScalarType, 2> sm80_flash_dtypes{at::kHalf, at::kBFloat16};
+    return check_tensor_dtype(params, sm80_flash_dtypes, debug);
+  } else {
+    static const std::array<at::ScalarType, 1> default_flash_dtypes{at::kHalf};
+    return check_tensor_dtype(params, default_flash_dtypes, debug);
+  }
+}
+
+inline bool use_mem_efficient_attention(sdp_params params, bool debug) {
+#ifndef USE_FLASH_ATTENTION
+  TORCH_CHECK(!debug, "Torch was not compiled with flash attention.");
+  return false;
+#endif
+  // Constraints specific to flash attention
+  static const std::vector<caffe2::ScalarType> flash_dtypes{
+      at::kHalf, at::kFloat, at::kBFloat16};
+
+  //  Define gate functions that determine if a flash kernel can be ran
+  constexpr std::array<bool(*)(sdp_params, bool), 8> constraints{{
+      check_gpu_sm50_or_greater,
+      check_runtime_disabled_mem_efficient,
+      check_requires_grad_and_nested,
+      check_for_attn_weights,
+      check_tensor_shapes,
+      check_for_attn_mask,
+      check_for_seq_len_1_nested_tensor,
+      check_for_non_zero_dropout}};
+  for (auto& constraint : constraints) {
+    if (!constraint(params, debug)) {
+      return false;
+    }
+  }
+  if (!check_tensor_dtype(params, flash_dtypes, debug)) {
+    return false;
+  }
+  return true;
+}
+
+inline SDPBackend select_sdp_backend(sdp_params kernel_params) {
+  // This function defines the priority order of the different sdp backends
+  // 1. Flash Attention
+  // 2. Mem Efficient Attention
+  // 3. Math fallback
+  auto& ctx = at::globalContext();
+  if (!ctx.userEnabledMathSDP() && !ctx.userEnabledFlashSDP() && !ctx.userEnabledMemEfficientSDP()) {
+    return SDPBackend::error;
+  }
+  // Because TORCHCHECK checks if condition is true we negate debug so that
+  // The statements will be printed when debug is true
+  bool print_debug = false;
+  if (use_flash_attention(kernel_params, print_debug)) {
+    return SDPBackend::flash_attention;
+  }
+  if (use_mem_efficient_attention(kernel_params, print_debug)) {
+    return SDPBackend::efficient_attention;
+  }
+  if (ctx.userEnabledMathSDP()) {
+    return SDPBackend::math;
+  }
+  // If we have gotten to this point then two things have happened:
+  // 1. use_flash_attention or use_mem_efficient did not satisfy the
+  // constraints to be ran
+  // 2. The user has explicitly disabled the math kernel
+  // We then re-run use_flash_attention with debug enabled to print out the
+  // reason why the kernel was not selected
+
+  print_debug = true;
+  use_mem_efficient_attention(kernel_params, print_debug);
+  use_flash_attention(kernel_params, print_debug);
+  return SDPBackend::error;
+}
+
+} // namespace sdp
diff --git a/aten/src/ATen/native/transformers/sdp_utils_cpp.h b/aten/src/ATen/native/transformers/sdp_utils_cpp.h
new file mode 100644
index 000000000000..9641a36b33b2
--- /dev/null
+++ b/aten/src/ATen/native/transformers/sdp_utils_cpp.h
@@ -0,0 +1,9 @@
+#pragma once
+namespace sdp {
+enum class SDPBackend {
+  error = -1,
+  math = 0,
+  flash_attention = 1,
+  efficient_attention = 2
+};
+} // namespace sdp
\ No newline at end of file
diff --git a/aten/src/ATen/native/transformers/transformer.cpp b/aten/src/ATen/native/transformers/transformer.cpp
index 2a641a40dfb5..4a4c9946b35a 100644
--- a/aten/src/ATen/native/transformers/transformer.cpp
+++ b/aten/src/ATen/native/transformers/transformer.cpp
@@ -49,14 +49,6 @@ Tensor ffn(
   return res;
 }
 
-void add_in_place(Tensor& input, const Tensor& arg, bool use_nested_tensor) {
-  if (use_nested_tensor) {
-    NestedTensor_add_NestedTensor_in_place(input, arg);
-  } else {
-    input.add_(arg);
-  }
-}
-
 Tensor norm(
     const Tensor& input,
     const int64_t embed_dim,
@@ -64,11 +56,7 @@ Tensor norm(
     const Tensor& weight,
     const Tensor& bias,
     const bool use_nested_tensor) {
-  if (use_nested_tensor) {
-    return NestedTensor_layer_norm(input, weight, bias, eps);
-  } else {
-    return at::layer_norm(input, {embed_dim}, weight, bias, eps, true);
-  }
+  return at::layer_norm(input, {embed_dim}, weight, bias, eps, true);
 }
 
 } // namespace
@@ -107,7 +95,7 @@ Tensor transformer_encoder_layer_forward(
   if (norm_first) {
     x = norm(x, embed_dim, layer_norm_eps, layer_norm_weight_1, layer_norm_bias_1, use_nested_tensor);
   }
-  x = std::get<0>(native_multi_head_attention(
+  x = std::get<0>(at::_native_multi_head_attention(
       x,
       x,
       x,
@@ -121,11 +109,13 @@ Tensor transformer_encoder_layer_forward(
       false /* need_weights */,
       true /* average_attn_weights */,
       mask_type));
-  add_in_place(x, src, use_nested_tensor);
+
+  x.add_(src);
   if (!norm_first) {
     x = norm(x, embed_dim, layer_norm_eps, layer_norm_weight_1, layer_norm_bias_1, use_nested_tensor);
   }
 
+
   auto pre_ffn_res = x;
 
   if (norm_first) {
@@ -139,7 +129,7 @@ Tensor transformer_encoder_layer_forward(
       ffn_bias_2,
       use_gelu,
       /* add_norm* */ false);
-  add_in_place(x, pre_ffn_res, use_nested_tensor);
+  x.add_(pre_ffn_res);
   if (!norm_first) {
     x = norm(x, embed_dim, layer_norm_eps, layer_norm_weight_2, layer_norm_bias_2, use_nested_tensor);
   }
@@ -178,7 +168,6 @@ std::tuple<Tensor, Tensor, Tensor>  transformer_decoder_only_layer_forward(
     }
   }
   TORCH_CHECK(!norm_first, "norm_first is not supported yet");
-  const bool use_nested_tensor = src.is_nested();
   auto mha_out = native_decoder_only_multi_head_attention(
       src,
       src,
@@ -196,20 +185,14 @@ std::tuple<Tensor, Tensor, Tensor>  transformer_decoder_only_layer_forward(
   auto x = std::get<0>(mha_out);
   auto incr_key_out = std::get<2>(mha_out);
   auto incr_value_out = std::get<3>(mha_out);
-  if (use_nested_tensor) {
-    NestedTensor_add_NestedTensor_in_place(x, src);
-    x = NestedTensor_layer_norm(
-        x, layer_norm_weight_1, layer_norm_bias_1, layer_norm_eps);
-  } else {
-    x.add_(src);
-    x = at::layer_norm(
-        x,
-        {embed_dim},
-        layer_norm_weight_1,
-        layer_norm_bias_1,
-        layer_norm_eps,
-        true);
-  }
+  x.add_(src);
+  x = at::layer_norm(
+      x,
+      {embed_dim},
+      layer_norm_weight_1,
+      layer_norm_bias_1,
+      layer_norm_eps,
+      true);
 
   auto pre_ffn_res = x;
   x = ffn(
@@ -220,20 +203,14 @@ std::tuple<Tensor, Tensor, Tensor>  transformer_decoder_only_layer_forward(
       ffn_bias_2,
       use_gelu,
       /* add_norm* */ false);
-  if (use_nested_tensor) {
-    NestedTensor_add_NestedTensor_in_place(x, pre_ffn_res);
-    x = NestedTensor_layer_norm(
-        x, layer_norm_weight_2, layer_norm_bias_2, layer_norm_eps);
-  } else {
-    x.add_(pre_ffn_res);
-    x = at::layer_norm(
-        x,
-        {embed_dim},
-        layer_norm_weight_2,
-        layer_norm_bias_2,
-        layer_norm_eps,
-        true);
-  }
+  x.add_(pre_ffn_res);
+  x = at::layer_norm(
+      x,
+      {embed_dim},
+      layer_norm_weight_2,
+      layer_norm_bias_2,
+      layer_norm_eps,
+      true);
   return std::make_tuple(std::move(x), std::move(incr_key_out), std::move(incr_value_out));
 }
 
diff --git a/aten/src/ATen/native/ts_native_functions.yaml b/aten/src/ATen/native/ts_native_functions.yaml
index 2ef238c0bff0..f4c3ee849896 100644
--- a/aten/src/ATen/native/ts_native_functions.yaml
+++ b/aten/src/ATen/native/ts_native_functions.yaml
@@ -141,7 +141,6 @@ full_codegen:
   - upsample_nearest2d
   - upsample_nearest2d_backward
   - zero
-  - narrow_copy.SymInt
   - alias_copy
   - as_strided_copy
   - diagonal_copy
@@ -175,7 +174,6 @@ supported:
   - _copy_from
   - _copy_from_and_resize
   - empty.memory_format
-  - empty.SymInt
   - empty_strided
   - fill_.Scalar
   - normal_
@@ -191,6 +189,7 @@ supported:
   # after functionalization,
   # but their implementations call view operators (which we need to functionalize away).
   - block_diag
+  - diag_embed
   - diagonal_backward
   - slice_backward
   - new_empty_strided
@@ -199,9 +198,23 @@ supported:
   - pixel_unshuffle
   - select_backward
   - _trilinear
-  - linalg_inv_ex
   - linalg_pinv.atol_rtol_tensor
   - logsumexp.out
+symint:
+  - empty.memory_format
+  - expand_copy
+  - narrow_copy
+  - view_copy
+  - as_strided_copy
+  - as_strided_scatter
+  - diagonal_backward
+  - slice_backward
+  - slice_copy.Tensor
+  - slice_scatter
+  - empty_strided
+  - new_empty_strided
+  - _reshape_alias_copy
+  - select_backward
 autograd:
   - max_pool3d
   - native_group_norm
@@ -214,33 +227,10 @@ non_native:
       - ShapeCompute
       - TreatScalarsAsConstants
       - CanBeReusedDeclOnly
+  # Even we have removed all the other view ops in favor of the *_copy version, expand
+  # is still kept because it's used in copy_.
   - func: expand(Tensor input, int[] size, bool is_scalar_expand) -> Tensor
-  - func: view(Tensor input, int[] output_size) -> Tensor
-    properties:
-      - ShapeCompute
   - func: cast(Tensor input, ScalarType dtype, ScalarType? stype) -> Tensor
     opkind: ltc_cast
     properties:
       - ShapeCompute
-
-  # View ops only required until proper functionalization pass is introduced into LTC
-  - func: as_strided_view_update(Tensor target, Tensor input, int[] size, int[] stride, int storage_offset) -> Tensor
-    opkind: ltc_as_strided_view_update
-  - func: as_strided(Tensor input, int[] size, int[] stride, int storage_offset) -> Tensor
-  - func: diagonal_view_update(Tensor target, Tensor input, int offset, int dim1, int dim2) -> Tensor
-    opkind: ltc_diagonal_view_update
-    properties:
-      - ShapeCompute
-  - func: diagonal(Tensor input, int offset, int dim1, int dim2) -> Tensor
-  - func: narrow_view_update(Tensor input, Tensor source, int[] base_indices) -> Tensor
-    opkind: ltc_narrow_view_update
-  - func: narrow(Tensor input, int[] base_indices, int[] sizes) -> Tensor
-  - func: permute(Tensor input, int[] dims) -> Tensor
-  - func: resize(Tensor input, int[] size) -> Tensor
-  - func: select_view_update(Tensor target, Tensor source, int dim, int start, int end, int stride) -> Tensor
-    opkind: ltc_select_view_update
-    properties:
-      - ShapeCompute
-  - func: select(Tensor input, int dim, int start, int end, int stride) -> Tensor
-  - func: squeeze(Tensor input, int dim) -> Tensor
-  - func: unsqueeze(Tensor input, int dim) -> Tensor
diff --git a/aten/src/ATen/native/utils/Factory.cpp b/aten/src/ATen/native/utils/Factory.cpp
index 5ad1841211a8..ea6be4e01755 100644
--- a/aten/src/ATen/native/utils/Factory.cpp
+++ b/aten/src/ATen/native/utils/Factory.cpp
@@ -1,3 +1,4 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/native/utils/Factory.h>
 #include <c10/core/CPUAllocator.h>
diff --git a/aten/src/ATen/native/utils/Factory.h b/aten/src/ATen/native/utils/Factory.h
index 4b5b93582b10..bd153aaa6752 100644
--- a/aten/src/ATen/native/utils/Factory.h
+++ b/aten/src/ATen/native/utils/Factory.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/utils/ParamUtils.h b/aten/src/ATen/native/utils/ParamUtils.h
index 376467ff79cf..adb5f1cfa49f 100644
--- a/aten/src/ATen/native/utils/ParamUtils.h
+++ b/aten/src/ATen/native/utils/ParamUtils.h
@@ -6,12 +6,13 @@
 namespace at {
 namespace native {
 
-inline std::vector<int64_t> expand_param_if_needed(
-    IntArrayRef list_param,
+template <typename T>
+inline std::vector<T> _expand_param_if_needed(
+    ArrayRef<T> list_param,
     const char* param_name,
     int64_t expected_dim) {
   if (list_param.size() == 1) {
-    return std::vector<int64_t>(expected_dim, list_param[0]);
+    return std::vector<T>(expected_dim, list_param[0]);
   } else if ((int64_t)list_param.size() != expected_dim) {
     std::ostringstream ss;
     ss << "expected " << param_name << " to be a single integer value or a "
@@ -23,5 +24,19 @@ inline std::vector<int64_t> expand_param_if_needed(
   }
 }
 
+inline std::vector<int64_t> expand_param_if_needed(
+    IntArrayRef list_param,
+    const char* param_name,
+    int64_t expected_dim) {
+  return _expand_param_if_needed(list_param, param_name, expected_dim);
+}
+
+inline std::vector<c10::SymInt> expand_param_if_needed(
+    SymIntArrayRef list_param,
+    const char* param_name,
+    int64_t expected_dim) {
+  return _expand_param_if_needed(list_param, param_name, expected_dim);
+}
+
 } // namespace native
 } // namespace at
diff --git a/aten/src/ATen/native/vol2col.h b/aten/src/ATen/native/vol2col.h
index 12718a8f00af..2b2ee3b57b0c 100644
--- a/aten/src/ATen/native/vol2col.h
+++ b/aten/src/ATen/native/vol2col.h
@@ -1,8 +1,6 @@
 #pragma once
 
-#include <ATen/ATen.h>
-#include <ATen/TensorUtils.h>
-#include <ATen/Utils.h>
+#include <cstring>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/vulkan/VulkanOpaqueTensorImpl.h b/aten/src/ATen/native/vulkan/VulkanOpaqueTensorImpl.h
index 3802befad9d2..ee79ec81ee61 100644
--- a/aten/src/ATen/native/vulkan/VulkanOpaqueTensorImpl.h
+++ b/aten/src/ATen/native/vulkan/VulkanOpaqueTensorImpl.h
@@ -29,6 +29,10 @@ struct VulkanOpaqueTensorImpl : public OpaqueTensorImpl<OpaqueHandle> {
     return strides_;
   }
 
+  SymIntArrayRef sym_strides_custom() const override {
+    return c10::fromIntArrayRefKnownNonNegative(strides_);
+  }
+
   bool is_contiguous_custom(c10::MemoryFormat memory_format) const override {
     return true;
   }
diff --git a/aten/src/ATen/native/vulkan/api/Adapter.cpp b/aten/src/ATen/native/vulkan/api/Adapter.cpp
index ff7311886e9a..176236611c1d 100644
--- a/aten/src/ATen/native/vulkan/api/Adapter.cpp
+++ b/aten/src/ATen/native/vulkan/api/Adapter.cpp
@@ -1,5 +1,7 @@
 #include <ATen/native/vulkan/api/Adapter.h>
 #include <c10/util/irange.h>
+
+#include <bitset>
 #include <iomanip>
 #include <sstream>
 
@@ -193,7 +195,7 @@ std::string get_device_type_str(const VkPhysicalDeviceType type) {
     case VK_PHYSICAL_DEVICE_TYPE_CPU:
       return "CPU";
     default:
-      return "UNKOWN";
+      return "UNKNOWN";
   }
 }
 
diff --git a/aten/src/ATen/native/vulkan/api/Adapter.h b/aten/src/ATen/native/vulkan/api/Adapter.h
index 8a43a4581edb..c54e55358c69 100644
--- a/aten/src/ATen/native/vulkan/api/Adapter.h
+++ b/aten/src/ATen/native/vulkan/api/Adapter.h
@@ -6,6 +6,8 @@
 #include <ATen/native/vulkan/api/Pipeline.h>
 #include <ATen/native/vulkan/api/Shader.h>
 #include <ATen/native/vulkan/api/Utils.h>
+
+#include <mutex>
 #include <ostream>
 
 namespace at {
diff --git a/aten/src/ATen/native/vulkan/api/Allocator.h b/aten/src/ATen/native/vulkan/api/Allocator.h
index 470eb07543c2..d0c8bdf9ecd4 100644
--- a/aten/src/ATen/native/vulkan/api/Allocator.h
+++ b/aten/src/ATen/native/vulkan/api/Allocator.h
@@ -22,7 +22,7 @@
 
 #define VMA_STATS_STRING_ENABLED 0
 
-#ifdef DEBUG
+#ifdef VULKAN_DEBUG
 #define VMA_DEBUG_ALIGNMENT 4096
 #define VMA_DEBUG_ALWAYS_DEDICATED_MEMORY 0
 #define VMA_DEBUG_DETECT_CORRUPTION 1
@@ -39,7 +39,7 @@
     printf("\n"); \
 } while(false)
 */
-#endif /* DEBUG */
+#endif /* VULKAN_DEBUG */
 
 #ifdef __clang__
 #pragma clang diagnostic push
@@ -47,7 +47,7 @@
 #pragma clang diagnostic ignored "-Wunused-variable"
 #endif /* __clang__ */
 
-#include <ATen/native/vulkan/api/vk_mem_alloc.h>
+#include <include/vk_mem_alloc.h>
 
 #ifdef __clang__
 #pragma clang diagnostic pop
diff --git a/aten/src/ATen/native/vulkan/api/Command.cpp b/aten/src/ATen/native/vulkan/api/Command.cpp
index c42eda1c5ef2..323bbd9512ba 100644
--- a/aten/src/ATen/native/vulkan/api/Command.cpp
+++ b/aten/src/ATen/native/vulkan/api/Command.cpp
@@ -170,6 +170,29 @@ void CommandBuffer::dispatch(const utils::uvec3& global_workgroup_size) {
   state_ = CommandBuffer::State::RECORDING;
 }
 
+void CommandBuffer::copy_buffer_to_buffer(
+    const api::VulkanBuffer& source,
+    const api::VulkanBuffer& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset) {
+  TORCH_CHECK(
+      state_ == CommandBuffer::State::BARRIERS_INSERTED,
+      "Vulkan CommandBuffer: called copy_buffer_to_buffer() on a command buffer whose state "
+      "is not BARRIERS_INSERTED.");
+
+  const VkBufferCopy copy_details{
+      src_offset.data[0u], // srcOffset
+      dst_offset.data[0u], // dstOffset
+      copy_range.data[0u], // size
+  };
+
+  vkCmdCopyBuffer(
+      handle_, source.handle(), destination.handle(), 1u, &copy_details);
+
+  state_ = CommandBuffer::State::RECORDING;
+}
+
 void CommandBuffer::copy_texture_to_texture(
     const api::VulkanImage& source,
     const api::VulkanImage& destination,
diff --git a/aten/src/ATen/native/vulkan/api/Command.h b/aten/src/ATen/native/vulkan/api/Command.h
index f52e238463fd..9c19095acdeb 100644
--- a/aten/src/ATen/native/vulkan/api/Command.h
+++ b/aten/src/ATen/native/vulkan/api/Command.h
@@ -81,6 +81,13 @@ class CommandBuffer final {
   void insert_barrier(const PipelineBarrier& pipeline_barrier);
   void dispatch(const utils::uvec3&);
 
+  void copy_buffer_to_buffer(
+      const api::VulkanBuffer&,
+      const api::VulkanBuffer&,
+      const api::utils::uvec3&,
+      const api::utils::uvec3&,
+      const api::utils::uvec3&);
+
   void copy_texture_to_texture(
       const api::VulkanImage&,
       const api::VulkanImage&,
diff --git a/aten/src/ATen/native/vulkan/api/Common.h b/aten/src/ATen/native/vulkan/api/Common.h
index 1fa268e63409..3cfee491d7ea 100644
--- a/aten/src/ATen/native/vulkan/api/Common.h
+++ b/aten/src/ATen/native/vulkan/api/Common.h
@@ -2,41 +2,63 @@
 
 #ifdef USE_VULKAN_API
 
-#include <ATen/ATen.h>
+#include <c10/util/Exception.h>
+#include <utility>
 
 #include <ATen/native/vulkan/api/vk_api.h>
 
+#define CONCAT_LITERALS(a, b) #a #b
 #ifdef USE_VULKAN_SHADERC_RUNTIME
 #include <ATen/native/vulkan/glsl.h>
-#define VK_KERNEL(name)                     \
-  ::at::native::vulkan::api::ShaderSource { \
-#name, name##_glsl,                     \
+#define VK_KERNEL(name)                          \
+  ::at::native::vulkan::api::ShaderSource {      \
+    CONCAT_LITERALS(vulkan., name), name##_glsl, \
   }
 #else
 #include <ATen/native/vulkan/spv.h>
-#define VK_KERNEL(name)                                  \
-  ::at::native::vulkan::api::ShaderSource {              \
-#name, name##_spv, name##_spv_len, name##_spv_layout \
+#define VK_KERNEL(name)                                         \
+  ::at::native::vulkan::api::ShaderSource {                     \
+    CONCAT_LITERALS(vulkan., name), name##_spv, name##_spv_len, \
+        name##_spv_layout                                       \
+  }
+#define VK_SHADER(name)                                                        \
+  ::at::native::vulkan::api::ShaderInfo {                                      \
+    CONCAT_LITERALS(vulkan., name), name##_spv, name##_spv_len,                \
+        name##_spv_layout, name##_spv_tile_size, name##_spv_bias_storage_type, \
+        name##_spv_weight_storage_type,                                        \
   }
 #endif /* USE_VULKAN_SHADERC_RUNTIME */
 
-#define VK_CHECK(function)              \
-  do {                                  \
-    const VkResult result = (function); \
-    TORCH_CHECK(                        \
-        VK_SUCCESS == result,           \
-        C10_STRINGIZE(__FILE__),        \
-        " [",                           \
-        C10_STRINGIZE(__LINE__),        \
-        "] "                            \
-        "VkResult:",                    \
-        result);                        \
+/*
+ * Check that the return code of a Vulkan API call is VK_SUCCESS, throwing an
+ * error with the returned code if not. If STRIP_ERROR_MESSAGES is defined then
+ * only the return code will be preserved.
+ */
+#ifdef STRIP_ERROR_MESSAGES
+#define VK_CHECK(function)                                       \
+  do {                                                           \
+    const VkResult result = (function);                          \
+    if (VK_SUCCESS != result) {                                  \
+      throw c10::Error(                                          \
+          {__func__, __FILE__, static_cast<uint32_t>(__LINE__)}, \
+          c10::str(result));                                     \
+    }                                                            \
   } while (false)
-
-#define VK_CHECK_RELAXED(function)                          \
-  do {                                                      \
-    const VkResult result = (function);                     \
-    TORCH_CHECK(VK_SUCCESS <= result, "VkResult:", result); \
+#else
+#define VK_CHECK(function)                                       \
+  do {                                                           \
+    const VkResult result = (function);                          \
+    if (VK_SUCCESS != result) {                                  \
+      throw c10::Error(                                          \
+          {__func__, __FILE__, static_cast<uint32_t>(__LINE__)}, \
+          c10::str(                                              \
+              C10_STRINGIZE(__FILE__),                           \
+              "[",                                               \
+              C10_STRINGIZE(__LINE__),                           \
+              "] Expected VK_SUCCESS, got VkResult of ",         \
+              result));                                          \
+    }                                                            \
   } while (false)
+#endif /* STRIP_ERROR_MESSAGES */
 
 #endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/api/Context.cpp b/aten/src/ATen/native/vulkan/api/Context.cpp
index 5e5b25b927a5..06038b9e4ecf 100644
--- a/aten/src/ATen/native/vulkan/api/Context.cpp
+++ b/aten/src/ATen/native/vulkan/api/Context.cpp
@@ -21,7 +21,7 @@ Context::Context(size_t adapter_i, const ContextConfig& config)
       fences_(device_),
 // Diagnostics
 #ifdef USE_VULKAN_GPU_DIAGNOSTICS
-      querypool_(device_, config_.queryPoolConfig),
+      querypool_(config_.queryPoolConfig, adapter_p_),
 #endif /* USE_VULKAN_GPU_DIAGNOSTICS */
       // Command buffer submission
       cmd_mutex_{},
@@ -129,22 +129,88 @@ Context* context() {
       };
 
       return new Context(runtime()->default_adapter_i(), config);
+    } catch (const c10::Error& e) {
+      TORCH_WARN(
+          "Pytorch Vulkan Context: Failed to initialize global vulkan context: ",
+          e.what());
     } catch (const std::exception& e) {
-      TORCH_CHECK(
-          false, "Vulkan: Failed to initialize context! Error: ", e.what());
+      TORCH_WARN(
+          "Pytorch Vulkan Context: Failed to initialize global vulkan context: ",
+          e.what());
     } catch (...) {
-      TORCH_CHECK(
-          false, "Vulkan: Failed to initialize context! Error: Unknown");
+      TORCH_WARN(
+          "Pytorch Vulkan Context: Failed to initialize global vulkan context!");
     }
 
     return nullptr;
   }());
 
-  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(context, "Invalid Vulkan context!");
+  if (!context) {
+    TORCH_WARN(
+        "Pytorch Vulkan Context: The global context could not be retrieved "
+        "because it failed to initialize.");
+  }
 
   return context.get();
 }
 
+//
+// UniformParamsBuffer
+//
+
+namespace {
+
+void memcpy_to_buffer(const VulkanBuffer& src, VulkanBuffer& dst) {
+  MemoryMap dst_mapping(dst, MemoryAccessType::WRITE);
+
+  MemoryMap src_mapping(src, api::MemoryAccessType::READ);
+  src_mapping.invalidate();
+
+  void* dst_ptr = dst_mapping.template data<void>();
+  void* src_ptr = src_mapping.template data<void>();
+
+  memcpy(dst_ptr, src_ptr, src.mem_size());
+}
+
+} // namespace
+
+UniformParamsBuffer::UniformParamsBuffer(const UniformParamsBuffer& other)
+    : context_p_(other.context_p_), vulkan_buffer_{} {
+  if (other.vulkan_buffer_) {
+    vulkan_buffer_ = context_p_->adapter_ptr()->vma().create_uniform_buffer(
+        other.vulkan_buffer_.mem_size());
+
+    memcpy_to_buffer(other.vulkan_buffer_, vulkan_buffer_);
+  }
+}
+
+UniformParamsBuffer& UniformParamsBuffer::operator=(
+    const UniformParamsBuffer& other) {
+  if (&other != this) {
+    context_p_ = other.context_p_;
+
+    // Move vulkan_buffer_ to another VulkanBuffer for cleanup
+    if (vulkan_buffer_) {
+      VulkanBuffer temp_buffer(std::move(vulkan_buffer_));
+      context_p_->register_buffer_cleanup(temp_buffer);
+    }
+    // vulkan_buffer_ should now be empty
+
+    if (other.vulkan_buffer_) {
+      vulkan_buffer_ = context_p_->adapter_ptr()->vma().create_uniform_buffer(
+          other.vulkan_buffer_.mem_size());
+
+      memcpy_to_buffer(other.vulkan_buffer_, vulkan_buffer_);
+    }
+  }
+
+  return *this;
+}
+
+//
+// VulkanImpl
+//
+
 struct VulkanImpl final : public at::vulkan::VulkanImplInterface {
   bool is_vulkan_available() const override {
     return available();
diff --git a/aten/src/ATen/native/vulkan/api/Context.h b/aten/src/ATen/native/vulkan/api/Context.h
index e9464b9a16a7..ce0525abda57 100644
--- a/aten/src/ATen/native/vulkan/api/Context.h
+++ b/aten/src/ATen/native/vulkan/api/Context.h
@@ -2,6 +2,7 @@
 
 #ifdef USE_VULKAN_API
 
+#include <ATen/core/Tensor.h>
 #include <ATen/native/vulkan/api/Adapter.h>
 #include <ATen/native/vulkan/api/Command.h>
 #include <ATen/native/vulkan/api/Common.h>
@@ -57,6 +58,8 @@ class Context final {
   DescriptorPool descriptor_pool_;
   FencePool fences_;
   // Diagnostics
+  // TODO: remove USE_VULKAN_GPU_DIAGNOSTICS
+  bool enable_op_profiling_{false};
 #ifdef USE_VULKAN_GPU_DIAGNOSTICS
   QueryPool querypool_;
 #endif /* USE_VULKAN_GPU_DIAGNOSTICS */
@@ -77,6 +80,14 @@ class Context final {
     return adapter_p_;
   }
 
+  inline void enable_op_profiling() {
+    enable_op_profiling_ = true;
+  }
+
+  inline bool op_profiling_enabled() {
+    return enable_op_profiling_;
+  }
+
   inline VkDevice device() {
     return device_;
   }
@@ -195,20 +206,24 @@ class UniformParamsBuffer final {
   VulkanBuffer vulkan_buffer_;
 
  public:
+  UniformParamsBuffer() : context_p_{nullptr}, vulkan_buffer_{} {}
+
   template <typename Block>
   UniformParamsBuffer(Context* context_p, const Block& block)
       : context_p_(context_p),
         vulkan_buffer_(
             context_p_->adapter_ptr()->vma().create_params_buffer(block)) {}
 
-  UniformParamsBuffer(const UniformParamsBuffer&) = delete;
-  UniformParamsBuffer& operator=(const UniformParamsBuffer&) = delete;
+  UniformParamsBuffer(const UniformParamsBuffer&);
+  UniformParamsBuffer& operator=(const UniformParamsBuffer&);
 
-  UniformParamsBuffer(UniformParamsBuffer&&) = delete;
-  UniformParamsBuffer& operator=(UniformParamsBuffer&&) = delete;
+  UniformParamsBuffer(UniformParamsBuffer&&) = default;
+  UniformParamsBuffer& operator=(UniformParamsBuffer&&) = default;
 
   ~UniformParamsBuffer() {
-    context_p_->register_buffer_cleanup(vulkan_buffer_);
+    if (vulkan_buffer_) {
+      context_p_->register_buffer_cleanup(vulkan_buffer_);
+    }
   }
 
   VulkanBuffer& buffer() {
@@ -281,6 +296,18 @@ inline void record_copy(
     const api::utils::uvec3& src_offset,
     const api::utils::uvec3& dst_offset) = delete;
 
+template <>
+inline void record_copy<VulkanBuffer, VulkanBuffer>(
+    CommandBuffer& cmd,
+    const VulkanBuffer& source,
+    const VulkanBuffer& destination,
+    const api::utils::uvec3& copy_range,
+    const api::utils::uvec3& src_offset,
+    const api::utils::uvec3& dst_offset) {
+  cmd.copy_buffer_to_buffer(
+      source, destination, copy_range, src_offset, dst_offset);
+}
+
 template <>
 inline void record_copy<VulkanImage, VulkanImage>(
     CommandBuffer& cmd,
@@ -337,9 +364,12 @@ inline void Context::submit_copy(
   set_cmd();
 
 #ifdef USE_VULKAN_GPU_DIAGNOSTICS
-  std::string label = "cmd_copy";
-  uint32_t log_idx = querypool_.shader_profile_begin(
-      cmd_, label, create_extent3d({0, 0, 0}), create_extent3d({0, 0, 0}));
+  uint32_t log_idx = UINT32_MAX;
+  if (enable_op_profiling_) {
+    std::string label = "cmd_copy";
+    log_idx = querypool_.shader_profile_begin(
+        cmd_, label, create_extent3d({0, 0, 0}), create_extent3d({0, 0, 0}));
+  }
 #endif /* USE_VULKAN_GPU_DIAGNOSTICS */
 
   cmd_.insert_barrier(pipeline_barrier);
@@ -347,7 +377,9 @@ inline void Context::submit_copy(
   record_copy(cmd_, source, destination, copy_range, src_offset, dst_offset);
 
 #ifdef USE_VULKAN_GPU_DIAGNOSTICS
-  querypool_.shader_profile_end(cmd_, log_idx);
+  if (enable_op_profiling_) {
+    querypool_.shader_profile_end(cmd_, log_idx);
+  }
 #endif /* USE_VULKAN_GPU_DIAGNOSTICS */
 
   submit_count_++;
@@ -359,7 +391,7 @@ inline void Context::submit_copy(
 
 template <typename... Arguments>
 inline void Context::submit_compute_job(
-    const ShaderSource& shader_descriptor,
+    const ShaderSource& shader,
     const PipelineBarrier& pipeline_barrier,
     const utils::uvec3& global_work_group,
     const utils::uvec3& local_work_group_size,
@@ -382,28 +414,40 @@ inline void Context::submit_compute_job(
   set_cmd();
 
 #ifdef USE_VULKAN_GPU_DIAGNOSTICS
-  uint32_t log_idx = querypool_.shader_profile_begin(
-      cmd_,
-      shader_descriptor.kernel_name,
-      create_extent3d(global_work_group),
-      create_extent3d(local_work_group_size));
+  uint32_t log_idx = UINT32_MAX;
+  if (enable_op_profiling_) {
+    log_idx = querypool_.shader_profile_begin(
+        cmd_,
+        shader.kernel_name,
+        create_extent3d(global_work_group),
+        create_extent3d(local_work_group_size));
+  }
 #endif /* USE_VULKAN_GPU_DIAGNOSTICS */
 
   // Factor out template parameter independent code to minimize code bloat.
   DescriptorSet descriptor_set =
-      submit_compute_prologue(cmd_, shader_descriptor, local_work_group_size);
+      submit_compute_prologue(cmd_, shader, local_work_group_size);
 
   detail::bind(
       descriptor_set,
       std::index_sequence_for<Arguments...>{},
       std::forward<Arguments>(arguments)...);
 
+  // Adjust the global workgroup size based on the output tile size
+  const utils::uvec3 effective_global_wg = {
+      utils::div_up(global_work_group.data[0u], shader.out_tile_size.data[0u]),
+      utils::div_up(global_work_group.data[1u], shader.out_tile_size.data[1u]),
+      utils::div_up(global_work_group.data[2u], shader.out_tile_size.data[2u]),
+  };
+
   // Factor out template parameter independent code to minimize code bloat.
   submit_compute_epilogue(
-      cmd_, descriptor_set, pipeline_barrier, global_work_group);
+      cmd_, descriptor_set, pipeline_barrier, effective_global_wg);
 
 #ifdef USE_VULKAN_GPU_DIAGNOSTICS
-  querypool_.shader_profile_end(cmd_, log_idx);
+  if (enable_op_profiling_) {
+    querypool_.shader_profile_end(cmd_, log_idx);
+  }
 #endif /* USE_VULKAN_GPU_DIAGNOSTICS */
 
   submit_count_++;
diff --git a/aten/src/ATen/native/vulkan/api/Descriptor.h b/aten/src/ATen/native/vulkan/api/Descriptor.h
index 5ef21e1f78df..fdf5cec9cf04 100644
--- a/aten/src/ATen/native/vulkan/api/Descriptor.h
+++ b/aten/src/ATen/native/vulkan/api/Descriptor.h
@@ -5,6 +5,7 @@
 #include <ATen/native/vulkan/api/Common.h>
 #include <ATen/native/vulkan/api/Resource.h>
 #include <ATen/native/vulkan/api/Shader.h>
+#include <c10/util/flat_hash_map.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/vulkan/api/Pipeline.h b/aten/src/ATen/native/vulkan/api/Pipeline.h
index f53a414f3258..fb9edc4ed0bf 100644
--- a/aten/src/ATen/native/vulkan/api/Pipeline.h
+++ b/aten/src/ATen/native/vulkan/api/Pipeline.h
@@ -5,7 +5,10 @@
 #include <ATen/native/vulkan/api/Common.h>
 #include <ATen/native/vulkan/api/Resource.h>
 #include <ATen/native/vulkan/api/Shader.h>
-#include <c10/util/hash.h>
+#include <c10/util/SmallVector.h>
+#include <c10/util/flat_hash_map.h>
+
+#include <mutex>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/vulkan/api/QueryPool.cpp b/aten/src/ATen/native/vulkan/api/QueryPool.cpp
index 1ac8d2b7a21a..3fcaf18bdcc0 100644
--- a/aten/src/ATen/native/vulkan/api/QueryPool.cpp
+++ b/aten/src/ATen/native/vulkan/api/QueryPool.cpp
@@ -2,6 +2,7 @@
 #include <ATen/native/vulkan/api/Utils.h>
 #include <ATen/native/vulkan/ops/Tensor.h>
 
+#include <cmath>
 #include <iostream>
 
 namespace at {
@@ -9,9 +10,16 @@ namespace native {
 namespace vulkan {
 namespace api {
 
-QueryPool::QueryPool(const VkDevice device, const QueryPoolConfig& config)
+namespace {
+// On Mali gpus timestamp_period seems to return 0.
+// For some reason when 52.08 is used op runtimes seem to make more sense
+// TODO: Figure out what is special about 52.08
+constexpr int64_t default_ns_per_tick = 52; // lround(52.08f);
+} // namespace
+
+QueryPool::QueryPool(const QueryPoolConfig& config, const Adapter* adapter_p)
     : mutex_{},
-      device_(device),
+      device_(adapter_p->device_handle()),
       config_(config),
       querypool_(VK_NULL_HANDLE),
       shader_log_{},
@@ -28,6 +36,10 @@ QueryPool::QueryPool(const VkDevice device, const QueryPoolConfig& config)
   VK_CHECK(vkCreateQueryPool(device_, &info, nullptr, &querypool_));
 
   shader_log_.reserve(config_.initialReserveSize);
+
+  TORCH_CHECK(adapter_p, "Valid GPU device must be created for QueryPool");
+  ns_per_tick_ = std::lround(adapter_p->timestamp_period());
+  ns_per_tick_ = (ns_per_tick_ == 0) ? default_ns_per_tick : ns_per_tick_;
 }
 
 QueryPool::~QueryPool() {
@@ -45,7 +57,7 @@ void QueryPool::reset(const CommandBuffer& cmd) {
   shader_log_.clear();
 }
 
-uint32_t QueryPool::write_timestamp(const CommandBuffer& cmd) {
+size_t QueryPool::write_timestamp(const CommandBuffer& cmd) {
   TORCH_CHECK(
       in_use_ < config_.maxQueryCount,
       "Vulkan QueryPool: Exceeded the maximum number of queries "
@@ -93,7 +105,7 @@ void QueryPool::shader_profile_end(
     const uint32_t log_idx) {
   std::lock_guard<std::mutex> lock(mutex_);
 
-  uint32_t query_idx = write_timestamp(cmd);
+  size_t query_idx = write_timestamp(cmd);
 
   shader_log_[log_idx].end_query_idx = query_idx;
 }
@@ -117,9 +129,8 @@ void QueryPool::extract_results() {
       flags)); // flags
 
   for (ShaderDuration& entry : shader_log_) {
-    entry.start_time_ns = query_data.at(entry.start_query_idx);
-    entry.end_time_ns = query_data.at(entry.end_query_idx);
-
+    entry.start_time_ns = query_data.at(entry.start_query_idx) * ns_per_tick_;
+    entry.end_time_ns = query_data.at(entry.end_query_idx) * ns_per_tick_;
     entry.execution_duration_ns = entry.end_time_ns - entry.start_time_ns;
   }
 }
@@ -187,6 +198,12 @@ uint64_t QueryPool::get_total_op_ns(std::string op_name) {
   return sum;
 }
 
+void QueryPool::shader_log_for_each(
+    std::function<void(const ShaderDuration&)> fn) {
+  std::lock_guard<std::mutex> lock(mutex_);
+  std::for_each(shader_log_.begin(), shader_log_.end(), fn);
+}
+
 } // namespace api
 } // namespace vulkan
 } // namespace native
diff --git a/aten/src/ATen/native/vulkan/api/QueryPool.h b/aten/src/ATen/native/vulkan/api/QueryPool.h
index 12d9832e3dc4..606e5dabf37f 100644
--- a/aten/src/ATen/native/vulkan/api/QueryPool.h
+++ b/aten/src/ATen/native/vulkan/api/QueryPool.h
@@ -37,7 +37,7 @@ struct ShaderDuration final {
 
 class QueryPool final {
  public:
-  explicit QueryPool(const VkDevice, const QueryPoolConfig&);
+  explicit QueryPool(const QueryPoolConfig&, const Adapter* adapter_p);
 
   QueryPool(const QueryPool&) = delete;
   QueryPool& operator=(const QueryPool&) = delete;
@@ -59,7 +59,7 @@ class QueryPool final {
   size_t in_use_;
 
  private:
-  uint32_t write_timestamp(const CommandBuffer&);
+  size_t write_timestamp(const CommandBuffer&);
 
   std::string generate_string_report();
 
@@ -80,6 +80,8 @@ class QueryPool final {
   void extract_results();
   void print_results();
   uint64_t get_total_op_ns(std::string op_name);
+  uint64_t ns_per_tick_;
+  void shader_log_for_each(std::function<void(const ShaderDuration&)> fn);
 };
 
 } // namespace api
diff --git a/aten/src/ATen/native/vulkan/api/Resource.cpp b/aten/src/ATen/native/vulkan/api/Resource.cpp
index 3bb2da6ed045..e47f85b9f556 100644
--- a/aten/src/ATen/native/vulkan/api/Resource.cpp
+++ b/aten/src/ATen/native/vulkan/api/Resource.cpp
@@ -1,6 +1,8 @@
 #include <ATen/native/vulkan/api/Adapter.h>
 #include <ATen/native/vulkan/api/Resource.h>
 
+#include <c10/core/ScalarTypeToTypeMeta.h>
+
 namespace at {
 namespace native {
 namespace vulkan {
@@ -24,8 +26,8 @@ namespace api {
  * always created with the corresponding VkFormat. Consequently, kHalf tensors
  * are currently unsupported in favor of enforcing inputs to be of kFloat dtype.
  */
-VkFormat vk_format(const caffe2::TypeMeta dtype) {
-  switch (c10::typeMetaToScalarType(dtype)) {
+VkFormat vk_format(const at::ScalarType dtype) {
+  switch (dtype) {
     case kFloat:
 #ifdef USE_VULKAN_FP16_INFERENCE
       return VK_FORMAT_R16G16B16A16_SFLOAT;
@@ -193,8 +195,8 @@ MemoryMap::~MemoryMap() {
   if (access_ & MemoryAccessType::WRITE) {
     // Call will be ignored by implementation if the memory type this allocation
     // belongs to is not HOST_VISIBLE or is HOST_COHERENT, which is the behavior
-    // we want.
-    VK_CHECK(vmaFlushAllocation(allocator_, allocation_, 0u, VK_WHOLE_SIZE));
+    // we want. Don't check the result here as the destructor cannot throw.
+    vmaFlushAllocation(allocator_, allocation_, 0u, VK_WHOLE_SIZE);
   }
 
   vmaUnmapMemory(allocator_, allocation_);
@@ -515,6 +517,7 @@ VkSampler SamplerCache::retrieve(const SamplerCache::Key& key) {
 }
 
 void SamplerCache::purge() {
+  std::lock_guard<std::mutex> lock(cache_mutex_);
   cache_.clear();
 }
 
@@ -569,12 +572,14 @@ MemoryAllocator::~MemoryAllocator() {
   vmaDestroyAllocator(allocator_);
 }
 
-VulkanImage MemoryAllocator::create_image3d(
+VulkanImage MemoryAllocator::create_image(
     const VkExtent3D& extents,
+    const VkFormat image_format,
+    const VkImageType image_type,
+    const VkImageViewType image_view_type,
     const VulkanImage::SamplerProperties& sampler_props,
     const VkSampler sampler,
-    const caffe2::TypeMeta dtype,
-    bool allow_transfer) {
+    const bool allow_transfer) {
   VkImageUsageFlags usage =
       VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_STORAGE_BIT;
   if (allow_transfer) {
@@ -582,8 +587,6 @@ VulkanImage MemoryAllocator::create_image3d(
         (VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT);
   }
 
-  const VkFormat image_format = vk_format(dtype);
-
   const VulkanImage::MemoryProperties mem_props{
       DEFAULT_ALLOCATION_STRATEGY,
       VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE,
@@ -593,13 +596,13 @@ VulkanImage MemoryAllocator::create_image3d(
   };
 
   const VulkanImage::ImageProperties image_props{
-      VK_IMAGE_TYPE_3D,
+      image_type,
       image_format,
       extents,
   };
 
   const VulkanImage::ViewProperties view_props{
-      VK_IMAGE_VIEW_TYPE_3D,
+      image_view_type,
       image_format,
   };
 
@@ -660,6 +663,21 @@ VulkanBuffer MemoryAllocator::create_staging_buffer(const VkDeviceSize size) {
   return VulkanBuffer(allocator_, size, mem_props);
 }
 
+VulkanBuffer MemoryAllocator::create_uniform_buffer(const VkDeviceSize size) {
+  const VulkanBuffer::MemoryProperties mem_props{
+      DEFAULT_ALLOCATION_STRATEGY |
+          VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT,
+      VMA_MEMORY_USAGE_AUTO,
+      0u,
+      0u,
+      VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT,
+  };
+
+  VulkanBuffer uniform_buffer(allocator_, size, mem_props);
+
+  return uniform_buffer;
+}
+
 //
 // VulkanFence
 //
diff --git a/aten/src/ATen/native/vulkan/api/Resource.h b/aten/src/ATen/native/vulkan/api/Resource.h
index 75df3aa88560..9180b3422db1 100644
--- a/aten/src/ATen/native/vulkan/api/Resource.h
+++ b/aten/src/ATen/native/vulkan/api/Resource.h
@@ -6,7 +6,8 @@
 #include <ATen/native/vulkan/api/Utils.h>
 
 #include <c10/core/ScalarType.h>
-#include <c10/util/hash.h>
+#include <c10/util/flat_hash_map.h>
+#include <c10/util/typeid.h>
 
 #include <stack>
 
@@ -17,7 +18,7 @@ namespace api {
 
 typedef uint8_t MemoryAccessFlags;
 
-VkFormat vk_format(const caffe2::TypeMeta dtype);
+VkFormat vk_format(const at::ScalarType dtype);
 
 c10::ScalarType c10_scalartype(const VkFormat image_format);
 
@@ -385,11 +386,13 @@ class MemoryAllocator final {
   VmaAllocator allocator_;
 
  public:
-  VulkanImage create_image3d(
+  VulkanImage create_image(
       const VkExtent3D&,
+      const VkFormat,
+      const VkImageType,
+      const VkImageViewType,
       const VulkanImage::SamplerProperties&,
       const VkSampler,
-      const caffe2::TypeMeta dtype,
       const bool allow_transfer = false);
 
   VulkanBuffer create_storage_buffer(
@@ -398,6 +401,14 @@ class MemoryAllocator final {
 
   VulkanBuffer create_staging_buffer(const VkDeviceSize);
 
+  /*
+   * Create a uniform buffer with a specified size
+   */
+  VulkanBuffer create_uniform_buffer(const VkDeviceSize);
+
+  /*
+   * Create a uniform buffer containing the data in an arbitrary struct
+   */
   template <typename Block>
   VulkanBuffer create_params_buffer(const Block& block);
 };
@@ -483,16 +494,7 @@ struct FencePool final {
 
 template <typename Block>
 inline VulkanBuffer MemoryAllocator::create_params_buffer(const Block& block) {
-  const VulkanBuffer::MemoryProperties mem_props{
-      DEFAULT_ALLOCATION_STRATEGY |
-          VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT,
-      VMA_MEMORY_USAGE_AUTO,
-      0u,
-      0u,
-      VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT,
-  };
-
-  VulkanBuffer uniform_buffer(allocator_, sizeof(Block), mem_props);
+  VulkanBuffer uniform_buffer = create_uniform_buffer(sizeof(Block));
 
   // Fill the uniform buffer with data in block
   {
diff --git a/aten/src/ATen/native/vulkan/api/Runtime.cpp b/aten/src/ATen/native/vulkan/api/Runtime.cpp
index a1c460fa4dc9..1065fb1c6177 100644
--- a/aten/src/ATen/native/vulkan/api/Runtime.cpp
+++ b/aten/src/ATen/native/vulkan/api/Runtime.cpp
@@ -1,4 +1,9 @@
+#include <ATen/native/vulkan/api/Adapter.h>
 #include <ATen/native/vulkan/api/Runtime.h>
+#include <c10/util/Logging.h>
+#include <c10/util/irange.h>
+
+#include <sstream>
 
 namespace at {
 namespace native {
@@ -236,11 +241,11 @@ std::unique_ptr<Runtime> init_global_vulkan_runtime() {
 #endif /* USE_VULKAN_VOLK, USE_VULKAN_WRAPPER */
 
   const bool enableValidationMessages =
-#if defined(DEBUG)
+#if defined(VULKAN_DEBUG)
       true;
 #else
       false;
-#endif /* DEBUG */
+#endif /* VULKAN_DEBUG */
   const bool initDefaultDevice = true;
   const uint32_t numRequestedQueues = 1; // TODO: raise this value
 
@@ -253,6 +258,11 @@ std::unique_ptr<Runtime> init_global_vulkan_runtime() {
 
   try {
     return std::make_unique<Runtime>(Runtime(default_config));
+  } catch (const c10::Error& e) {
+    TORCH_WARN(
+        "Pytorch Vulkan Runtime: Failed to initialize the global vulkan runtime! "
+        "The global vulkan runtime is invalid. Error: ",
+        e.what());
   } catch (const std::exception& e) {
     TORCH_WARN(
         "Pytorch Vulkan Runtime: Failed to initialize the global vulkan runtime! "
@@ -286,6 +296,10 @@ Runtime::Runtime(const RuntimeConfiguration config)
         case AdapterSelector::First:
           default_adapter_i_ = create_adapter(select_first);
       }
+    } catch (const c10::Error& e) {
+      TORCH_WARN(
+          "Pytorch Vulkan Runtime: Could not initialize default device! Error: ",
+          e.what());
     } catch (const std::exception& e) {
       TORCH_WARN(
           "Pytorch Vulkan Runtime: Could not initialize default device! Error: ",
@@ -372,10 +386,12 @@ Runtime* runtime() {
   // Runtime.h as it would have internal linkage.
   static const std::unique_ptr<Runtime> p_runtime =
       init_global_vulkan_runtime();
+
   TORCH_CHECK(
       p_runtime,
       "Pytorch Vulkan Runtime: The global runtime could not be retrieved "
       "because it failed to initialize.");
+
   return p_runtime.get();
 }
 
diff --git a/aten/src/ATen/native/vulkan/api/Shader.cpp b/aten/src/ATen/native/vulkan/api/Shader.cpp
index de7ac11d418d..1ca37ba99999 100644
--- a/aten/src/ATen/native/vulkan/api/Shader.cpp
+++ b/aten/src/ATen/native/vulkan/api/Shader.cpp
@@ -13,6 +13,16 @@ namespace api {
 // ShaderSource
 //
 
+ShaderSource::ShaderSource()
+    : type(ShaderSource::Type::SPIRV),
+      src_code{
+          .spirv =
+              {
+                  nullptr,
+                  0u,
+              },
+      } {}
+
 ShaderSource::ShaderSource(std::string name, const char* const glsl_src)
     : type(ShaderSource::Type::GLSL),
       src_code{
@@ -40,6 +50,23 @@ ShaderSource::ShaderSource(
       kernel_name{std::move(name)},
       kernel_layout{layout} {}
 
+ShaderInfo::ShaderInfo(
+    std::string name,
+    const uint32_t* const spirv_bin,
+    const uint32_t size,
+    const std::vector<VkDescriptorType>& layout,
+    const std::vector<uint32_t>& tile_size,
+    const StorageType bias_storage_type,
+    const StorageType weight_storage_type)
+    : shader_src(name, spirv_bin, size, layout),
+      tile_size(tile_size),
+      bias_storage_type(bias_storage_type),
+      weight_storage_type(weight_storage_type) {
+  for (uint64_t i = 0; i < tile_size.size(); ++i) {
+    shader_src.out_tile_size.data[i] = tile_size[i];
+  }
+}
+
 bool operator==(const ShaderSource& _1, const ShaderSource& _2) {
   if (_1.type != _2.type) {
     return false;
diff --git a/aten/src/ATen/native/vulkan/api/Shader.h b/aten/src/ATen/native/vulkan/api/Shader.h
index 3db3adc4d00d..c676d10b1937 100644
--- a/aten/src/ATen/native/vulkan/api/Shader.h
+++ b/aten/src/ATen/native/vulkan/api/Shader.h
@@ -3,9 +3,13 @@
 #ifdef USE_VULKAN_API
 
 #include <ATen/native/vulkan/api/Common.h>
+#include <ATen/native/vulkan/api/Types.h>
 #include <ATen/native/vulkan/api/Utils.h>
+#include <c10/util/flat_hash_map.h>
 #include <c10/util/hash.h>
 
+#include <mutex>
+
 namespace at {
 namespace native {
 namespace vulkan {
@@ -54,9 +58,13 @@ struct ShaderSource final {
     } spirv;
   } src_code;
 
-  std::string kernel_name;
-  ShaderLayout::Signature kernel_layout;
+  std::string kernel_name{""};
+  ShaderLayout::Signature kernel_layout{};
+
+  // Shader Metadata
+  utils::uvec3 out_tile_size{1u, 1u, 1u};
 
+  explicit ShaderSource();
   explicit ShaderSource(std::string, const char*);
   explicit ShaderSource(
       std::string,
@@ -65,6 +73,26 @@ struct ShaderSource final {
       const std::vector<VkDescriptorType>&);
 };
 
+bool operator==(const ShaderSource& _1, const ShaderSource& _2);
+
+struct ShaderInfo final {
+  ShaderSource shader_src;
+  c10::SmallVector<uint32_t, 4> tile_size;
+  StorageType bias_storage_type{StorageType::UNKNOWN};
+  StorageType weight_storage_type{StorageType::UNKNOWN};
+
+  explicit ShaderInfo() = default;
+  explicit ShaderInfo(std::string, const char*);
+  explicit ShaderInfo(
+      std::string,
+      const uint32_t*,
+      const uint32_t,
+      const std::vector<VkDescriptorType>&,
+      const std::vector<uint32_t>& tile_size,
+      const StorageType bias_storage_type,
+      const StorageType weight_storage_type);
+};
+
 class ShaderModule final {
  public:
   explicit ShaderModule(const VkDevice device, const ShaderSource& source);
diff --git a/aten/src/ATen/native/vulkan/api/Types.h b/aten/src/ATen/native/vulkan/api/Types.h
new file mode 100644
index 000000000000..ff4ce3e7044d
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/api/Types.h
@@ -0,0 +1,21 @@
+#pragma once
+
+#ifdef USE_VULKAN_API
+namespace at {
+namespace native {
+namespace vulkan {
+namespace api {
+
+enum class StorageType {
+  BUFFER,
+  TEXTURE_3D,
+  TEXTURE_2D,
+  UNKNOWN,
+};
+
+} // namespace api
+} // namespace vulkan
+} // namespace native
+} // namespace at
+
+#endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/api/Utils.h b/aten/src/ATen/native/vulkan/api/Utils.h
index 7c350abc23f2..931db8f8364d 100644
--- a/aten/src/ATen/native/vulkan/api/Utils.h
+++ b/aten/src/ATen/native/vulkan/api/Utils.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <c10/util/ArrayRef.h>
 #include <c10/util/Half.h> // For c10::overflows
 
 #include <ATen/native/vulkan/api/Common.h>
@@ -98,6 +99,41 @@ using vec2 = vec<2u>;
 using vec3 = vec<3u>;
 using vec4 = vec<4u>;
 
+inline ivec2 make_ivec2(IntArrayRef ints, bool reverse = false) {
+  TORCH_CHECK(ints.size() == 2);
+  if (reverse) {
+    return {safe_downcast<int32_t>(ints[1]), safe_downcast<int32_t>(ints[0])};
+  } else {
+    return {safe_downcast<int32_t>(ints[0]), safe_downcast<int32_t>(ints[1])};
+  }
+}
+
+inline ivec4 make_ivec4(IntArrayRef ints, bool reverse = false) {
+  TORCH_CHECK(ints.size() == 4);
+  if (reverse) {
+    return {
+        safe_downcast<int32_t>(ints[3]),
+        safe_downcast<int32_t>(ints[2]),
+        safe_downcast<int32_t>(ints[1]),
+        safe_downcast<int32_t>(ints[0]),
+    };
+  } else {
+    return {
+        safe_downcast<int32_t>(ints[0]),
+        safe_downcast<int32_t>(ints[1]),
+        safe_downcast<int32_t>(ints[2]),
+        safe_downcast<int32_t>(ints[3]),
+    };
+  }
+}
+
+inline ivec3 make_ivec3(uvec3 ints) {
+  return {
+      safe_downcast<int32_t>(ints.data[0u]),
+      safe_downcast<int32_t>(ints.data[1u]),
+      safe_downcast<int32_t>(ints.data[2u])};
+}
+
 } // namespace utils
 
 inline bool operator==(const utils::uvec3& _1, const utils::uvec3& _2) {
diff --git a/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h b/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h
deleted file mode 100644
index 06721bf6b63f..000000000000
--- a/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h
+++ /dev/null
@@ -1,19558 +0,0 @@
-//
-// Copyright (c) 2017-2022 Advanced Micro Devices, Inc. All rights reserved.
-//
-// Permission is hereby granted, free of charge, to any person obtaining a copy
-// of this software and associated documentation files (the "Software"), to deal
-// in the Software without restriction, including without limitation the rights
-// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-// copies of the Software, and to permit persons to whom the Software is
-// furnished to do so, subject to the following conditions:
-//
-// The above copyright notice and this permission notice shall be included in
-// all copies or substantial portions of the Software.
-//
-// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
-// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
-// THE SOFTWARE.
-//
-
-#ifndef AMD_VULKAN_MEMORY_ALLOCATOR_H
-#define AMD_VULKAN_MEMORY_ALLOCATOR_H
-
-/** \mainpage Vulkan Memory Allocator
-
-<b>Version 3.0.1 (2022-05-26)</b>
-
-Copyright (c) 2017-2022 Advanced Micro Devices, Inc. All rights reserved. \n
-License: MIT
-
-<b>API documentation divided into groups:</b> [Modules](modules.html)
-
-\section main_table_of_contents Table of contents
-
-- <b>User guide</b>
-  - \subpage quick_start
-    - [Project setup](@ref quick_start_project_setup)
-    - [Initialization](@ref quick_start_initialization)
-    - [Resource allocation](@ref quick_start_resource_allocation)
-  - \subpage choosing_memory_type
-    - [Usage](@ref choosing_memory_type_usage)
-    - [Required and preferred flags](@ref choosing_memory_type_required_preferred_flags)
-    - [Explicit memory types](@ref choosing_memory_type_explicit_memory_types)
-    - [Custom memory pools](@ref choosing_memory_type_custom_memory_pools)
-    - [Dedicated allocations](@ref choosing_memory_type_dedicated_allocations)
-  - \subpage memory_mapping
-    - [Mapping functions](@ref memory_mapping_mapping_functions)
-    - [Persistently mapped memory](@ref memory_mapping_persistently_mapped_memory)
-    - [Cache flush and invalidate](@ref memory_mapping_cache_control)
-  - \subpage staying_within_budget
-    - [Querying for budget](@ref staying_within_budget_querying_for_budget)
-    - [Controlling memory usage](@ref staying_within_budget_controlling_memory_usage)
-  - \subpage resource_aliasing
-  - \subpage custom_memory_pools
-    - [Choosing memory type index](@ref custom_memory_pools_MemTypeIndex)
-    - [Linear allocation algorithm](@ref linear_algorithm)
-      - [Free-at-once](@ref linear_algorithm_free_at_once)
-      - [Stack](@ref linear_algorithm_stack)
-      - [Double stack](@ref linear_algorithm_double_stack)
-      - [Ring buffer](@ref linear_algorithm_ring_buffer)
-  - \subpage defragmentation
-  - \subpage statistics
-    - [Numeric statistics](@ref statistics_numeric_statistics)
-    - [JSON dump](@ref statistics_json_dump)
-  - \subpage allocation_annotation
-    - [Allocation user data](@ref allocation_user_data)
-    - [Allocation names](@ref allocation_names)
-  - \subpage virtual_allocator
-  - \subpage debugging_memory_usage
-    - [Memory initialization](@ref debugging_memory_usage_initialization)
-    - [Margins](@ref debugging_memory_usage_margins)
-    - [Corruption detection](@ref debugging_memory_usage_corruption_detection)
-  - \subpage opengl_interop
-- \subpage usage_patterns
-    - [GPU-only resource](@ref usage_patterns_gpu_only)
-    - [Staging copy for upload](@ref usage_patterns_staging_copy_upload)
-    - [Readback](@ref usage_patterns_readback)
-    - [Advanced data uploading](@ref usage_patterns_advanced_data_uploading)
-    - [Other use cases](@ref usage_patterns_other_use_cases)
-- \subpage configuration
-  - [Pointers to Vulkan functions](@ref config_Vulkan_functions)
-  - [Custom host memory allocator](@ref custom_memory_allocator)
-  - [Device memory allocation callbacks](@ref allocation_callbacks)
-  - [Device heap memory limit](@ref heap_memory_limit)
-- <b>Extension support</b>
-    - \subpage vk_khr_dedicated_allocation
-    - \subpage enabling_buffer_device_address
-    - \subpage vk_ext_memory_priority
-    - \subpage vk_amd_device_coherent_memory
-- \subpage general_considerations
-  - [Thread safety](@ref general_considerations_thread_safety)
-  - [Versioning and compatibility](@ref general_considerations_versioning_and_compatibility)
-  - [Validation layer warnings](@ref general_considerations_validation_layer_warnings)
-  - [Allocation algorithm](@ref general_considerations_allocation_algorithm)
-  - [Features not supported](@ref general_considerations_features_not_supported)
-
-\section main_see_also See also
-
-- [**Product page on GPUOpen**](https://gpuopen.com/gaming-product/vulkan-memory-allocator/)
-- [**Source repository on GitHub**](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator)
-
-\defgroup group_init Library initialization
-
-\brief API elements related to the initialization and management of the entire library, especially #VmaAllocator object.
-
-\defgroup group_alloc Memory allocation
-
-\brief API elements related to the allocation, deallocation, and management of Vulkan memory, buffers, images.
-Most basic ones being: vmaCreateBuffer(), vmaCreateImage().
-
-\defgroup group_virtual Virtual allocator
-
-\brief API elements related to the mechanism of \ref virtual_allocator - using the core allocation algorithm
-for user-defined purpose without allocating any real GPU memory.
-
-\defgroup group_stats Statistics
-
-\brief API elements that query current status of the allocator, from memory usage, budget, to full dump of the internal state in JSON format.
-See documentation chapter: \ref statistics.
-*/
-
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-#ifndef VULKAN_H_
-    #include <vulkan/vulkan.h>
-#endif
-
-// Define this macro to declare maximum supported Vulkan version in format AAABBBCCC,
-// where AAA = major, BBB = minor, CCC = patch.
-// If you want to use version > 1.0, it still needs to be enabled via VmaAllocatorCreateInfo::vulkanApiVersion.
-#if !defined(VMA_VULKAN_VERSION)
-    #if defined(VK_VERSION_1_3)
-        #define VMA_VULKAN_VERSION 1003000
-    #elif defined(VK_VERSION_1_2)
-        #define VMA_VULKAN_VERSION 1002000
-    #elif defined(VK_VERSION_1_1)
-        #define VMA_VULKAN_VERSION 1001000
-    #else
-        #define VMA_VULKAN_VERSION 1000000
-    #endif
-#endif
-
-#if defined(__ANDROID__) && defined(VK_NO_PROTOTYPES) && VMA_STATIC_VULKAN_FUNCTIONS
-    extern PFN_vkGetInstanceProcAddr vkGetInstanceProcAddr;
-    extern PFN_vkGetDeviceProcAddr vkGetDeviceProcAddr;
-    extern PFN_vkGetPhysicalDeviceProperties vkGetPhysicalDeviceProperties;
-    extern PFN_vkGetPhysicalDeviceMemoryProperties vkGetPhysicalDeviceMemoryProperties;
-    extern PFN_vkAllocateMemory vkAllocateMemory;
-    extern PFN_vkFreeMemory vkFreeMemory;
-    extern PFN_vkMapMemory vkMapMemory;
-    extern PFN_vkUnmapMemory vkUnmapMemory;
-    extern PFN_vkFlushMappedMemoryRanges vkFlushMappedMemoryRanges;
-    extern PFN_vkInvalidateMappedMemoryRanges vkInvalidateMappedMemoryRanges;
-    extern PFN_vkBindBufferMemory vkBindBufferMemory;
-    extern PFN_vkBindImageMemory vkBindImageMemory;
-    extern PFN_vkGetBufferMemoryRequirements vkGetBufferMemoryRequirements;
-    extern PFN_vkGetImageMemoryRequirements vkGetImageMemoryRequirements;
-    extern PFN_vkCreateBuffer vkCreateBuffer;
-    extern PFN_vkDestroyBuffer vkDestroyBuffer;
-    extern PFN_vkCreateImage vkCreateImage;
-    extern PFN_vkDestroyImage vkDestroyImage;
-    extern PFN_vkCmdCopyBuffer vkCmdCopyBuffer;
-    #if VMA_VULKAN_VERSION >= 1001000
-        extern PFN_vkGetBufferMemoryRequirements2 vkGetBufferMemoryRequirements2;
-        extern PFN_vkGetImageMemoryRequirements2 vkGetImageMemoryRequirements2;
-        extern PFN_vkBindBufferMemory2 vkBindBufferMemory2;
-        extern PFN_vkBindImageMemory2 vkBindImageMemory2;
-        extern PFN_vkGetPhysicalDeviceMemoryProperties2 vkGetPhysicalDeviceMemoryProperties2;
-    #endif // #if VMA_VULKAN_VERSION >= 1001000
-#endif // #if defined(__ANDROID__) && VMA_STATIC_VULKAN_FUNCTIONS && VK_NO_PROTOTYPES
-
-#if !defined(VMA_DEDICATED_ALLOCATION)
-    #if VK_KHR_get_memory_requirements2 && VK_KHR_dedicated_allocation
-        #define VMA_DEDICATED_ALLOCATION 1
-    #else
-        #define VMA_DEDICATED_ALLOCATION 0
-    #endif
-#endif
-
-#if !defined(VMA_BIND_MEMORY2)
-    #if VK_KHR_bind_memory2
-        #define VMA_BIND_MEMORY2 1
-    #else
-        #define VMA_BIND_MEMORY2 0
-    #endif
-#endif
-
-#if !defined(VMA_MEMORY_BUDGET)
-    #if VK_EXT_memory_budget && (VK_KHR_get_physical_device_properties2 || VMA_VULKAN_VERSION >= 1001000)
-        #define VMA_MEMORY_BUDGET 1
-    #else
-        #define VMA_MEMORY_BUDGET 0
-    #endif
-#endif
-
-// Defined to 1 when VK_KHR_buffer_device_address device extension or equivalent core Vulkan 1.2 feature is defined in its headers.
-#if !defined(VMA_BUFFER_DEVICE_ADDRESS)
-    #if VK_KHR_buffer_device_address || VMA_VULKAN_VERSION >= 1002000
-        #define VMA_BUFFER_DEVICE_ADDRESS 1
-    #else
-        #define VMA_BUFFER_DEVICE_ADDRESS 0
-    #endif
-#endif
-
-// Defined to 1 when VK_EXT_memory_priority device extension is defined in Vulkan headers.
-#if !defined(VMA_MEMORY_PRIORITY)
-    #if VK_EXT_memory_priority
-        #define VMA_MEMORY_PRIORITY 1
-    #else
-        #define VMA_MEMORY_PRIORITY 0
-    #endif
-#endif
-
-// Defined to 1 when VK_KHR_external_memory device extension is defined in Vulkan headers.
-#if !defined(VMA_EXTERNAL_MEMORY)
-    #if VK_KHR_external_memory
-        #define VMA_EXTERNAL_MEMORY 1
-    #else
-        #define VMA_EXTERNAL_MEMORY 0
-    #endif
-#endif
-
-// Define these macros to decorate all public functions with additional code,
-// before and after returned type, appropriately. This may be useful for
-// exporting the functions when compiling VMA as a separate library. Example:
-// #define VMA_CALL_PRE  __declspec(dllexport)
-// #define VMA_CALL_POST __cdecl
-#ifndef VMA_CALL_PRE
-    #define VMA_CALL_PRE
-#endif
-#ifndef VMA_CALL_POST
-    #define VMA_CALL_POST
-#endif
-
-// Define this macro to decorate pointers with an attribute specifying the
-// length of the array they point to if they are not null.
-//
-// The length may be one of
-// - The name of another parameter in the argument list where the pointer is declared
-// - The name of another member in the struct where the pointer is declared
-// - The name of a member of a struct type, meaning the value of that member in
-//   the context of the call. For example
-//   VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount"),
-//   this means the number of memory heaps available in the device associated
-//   with the VmaAllocator being dealt with.
-#ifndef VMA_LEN_IF_NOT_NULL
-    #define VMA_LEN_IF_NOT_NULL(len)
-#endif
-
-// The VMA_NULLABLE macro is defined to be _Nullable when compiling with Clang.
-// see: https://clang.llvm.org/docs/AttributeReference.html#nullable
-#ifndef VMA_NULLABLE
-    #ifdef __clang__
-        #define VMA_NULLABLE _Nullable
-    #else
-        #define VMA_NULLABLE
-    #endif
-#endif
-
-// The VMA_NOT_NULL macro is defined to be _Nonnull when compiling with Clang.
-// see: https://clang.llvm.org/docs/AttributeReference.html#nonnull
-#ifndef VMA_NOT_NULL
-    #ifdef __clang__
-        #define VMA_NOT_NULL _Nonnull
-    #else
-        #define VMA_NOT_NULL
-    #endif
-#endif
-
-// If non-dispatchable handles are represented as pointers then we can give
-// then nullability annotations
-#ifndef VMA_NOT_NULL_NON_DISPATCHABLE
-    #if defined(__LP64__) || defined(_WIN64) || (defined(__x86_64__) && !defined(__ILP32__) ) || defined(_M_X64) || defined(__ia64) || defined (_M_IA64) || defined(__aarch64__) || defined(__powerpc64__)
-        #define VMA_NOT_NULL_NON_DISPATCHABLE VMA_NOT_NULL
-    #else
-        #define VMA_NOT_NULL_NON_DISPATCHABLE
-    #endif
-#endif
-
-#ifndef VMA_NULLABLE_NON_DISPATCHABLE
-    #if defined(__LP64__) || defined(_WIN64) || (defined(__x86_64__) && !defined(__ILP32__) ) || defined(_M_X64) || defined(__ia64) || defined (_M_IA64) || defined(__aarch64__) || defined(__powerpc64__)
-        #define VMA_NULLABLE_NON_DISPATCHABLE VMA_NULLABLE
-    #else
-        #define VMA_NULLABLE_NON_DISPATCHABLE
-    #endif
-#endif
-
-#ifndef VMA_STATS_STRING_ENABLED
-    #define VMA_STATS_STRING_ENABLED 1
-#endif
-
-////////////////////////////////////////////////////////////////////////////////
-////////////////////////////////////////////////////////////////////////////////
-//
-//    INTERFACE
-//
-////////////////////////////////////////////////////////////////////////////////
-////////////////////////////////////////////////////////////////////////////////
-
-// Sections for managing code placement in file, only for development purposes e.g. for convenient folding inside an IDE.
-#ifndef _VMA_ENUM_DECLARATIONS
-
-/**
-\addtogroup group_init
-@{
-*/
-
-/// Flags for created #VmaAllocator.
-typedef enum VmaAllocatorCreateFlagBits
-{
-    /** \brief Allocator and all objects created from it will not be synchronized internally, so you must guarantee they are used from only one thread at a time or synchronized externally by you.
-
-    Using this flag may increase performance because internal mutexes are not used.
-    */
-    VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT = 0x00000001,
-    /** \brief Enables usage of VK_KHR_dedicated_allocation extension.
-
-    The flag works only if VmaAllocatorCreateInfo::vulkanApiVersion `== VK_API_VERSION_1_0`.
-    When it is `VK_API_VERSION_1_1`, the flag is ignored because the extension has been promoted to Vulkan 1.1.
-
-    Using this extension will automatically allocate dedicated blocks of memory for
-    some buffers and images instead of suballocating place for them out of bigger
-    memory blocks (as if you explicitly used #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT
-    flag) when it is recommended by the driver. It may improve performance on some
-    GPUs.
-
-    You may set this flag only if you found out that following device extensions are
-    supported, you enabled them while creating Vulkan device passed as
-    VmaAllocatorCreateInfo::device, and you want them to be used internally by this
-    library:
-
-    - VK_KHR_get_memory_requirements2 (device extension)
-    - VK_KHR_dedicated_allocation (device extension)
-
-    When this flag is set, you can experience following warnings reported by Vulkan
-    validation layer. You can ignore them.
-
-    > vkBindBufferMemory(): Binding memory to buffer 0x2d but vkGetBufferMemoryRequirements() has not been called on that buffer.
-    */
-    VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT = 0x00000002,
-    /**
-    Enables usage of VK_KHR_bind_memory2 extension.
-
-    The flag works only if VmaAllocatorCreateInfo::vulkanApiVersion `== VK_API_VERSION_1_0`.
-    When it is `VK_API_VERSION_1_1`, the flag is ignored because the extension has been promoted to Vulkan 1.1.
-
-    You may set this flag only if you found out that this device extension is supported,
-    you enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device,
-    and you want it to be used internally by this library.
-
-    The extension provides functions `vkBindBufferMemory2KHR` and `vkBindImageMemory2KHR`,
-    which allow to pass a chain of `pNext` structures while binding.
-    This flag is required if you use `pNext` parameter in vmaBindBufferMemory2() or vmaBindImageMemory2().
-    */
-    VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT = 0x00000004,
-    /**
-    Enables usage of VK_EXT_memory_budget extension.
-
-    You may set this flag only if you found out that this device extension is supported,
-    you enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device,
-    and you want it to be used internally by this library, along with another instance extension
-    VK_KHR_get_physical_device_properties2, which is required by it (or Vulkan 1.1, where this extension is promoted).
-
-    The extension provides query for current memory usage and budget, which will probably
-    be more accurate than an estimation used by the library otherwise.
-    */
-    VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT = 0x00000008,
-    /**
-    Enables usage of VK_AMD_device_coherent_memory extension.
-
-    You may set this flag only if you:
-
-    - found out that this device extension is supported and enabled it while creating Vulkan device passed as VmaAllocatorCreateInfo::device,
-    - checked that `VkPhysicalDeviceCoherentMemoryFeaturesAMD::deviceCoherentMemory` is true and set it while creating the Vulkan device,
-    - want it to be used internally by this library.
-
-    The extension and accompanying device feature provide access to memory types with
-    `VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD` and `VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` flags.
-    They are useful mostly for writing breadcrumb markers - a common method for debugging GPU crash/hang/TDR.
-
-    When the extension is not enabled, such memory types are still enumerated, but their usage is illegal.
-    To protect from this error, if you don't create the allocator with this flag, it will refuse to allocate any memory or create a custom pool in such memory type,
-    returning `VK_ERROR_FEATURE_NOT_PRESENT`.
-    */
-    VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT = 0x00000010,
-    /**
-    Enables usage of "buffer device address" feature, which allows you to use function
-    `vkGetBufferDeviceAddress*` to get raw GPU pointer to a buffer and pass it for usage inside a shader.
-
-    You may set this flag only if you:
-
-    1. (For Vulkan version < 1.2) Found as available and enabled device extension
-    VK_KHR_buffer_device_address.
-    This extension is promoted to core Vulkan 1.2.
-    2. Found as available and enabled device feature `VkPhysicalDeviceBufferDeviceAddressFeatures::bufferDeviceAddress`.
-
-    When this flag is set, you can create buffers with `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT` using VMA.
-    The library automatically adds `VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT` to
-    allocated memory blocks wherever it might be needed.
-
-    For more information, see documentation chapter \ref enabling_buffer_device_address.
-    */
-    VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT = 0x00000020,
-    /**
-    Enables usage of VK_EXT_memory_priority extension in the library.
-
-    You may set this flag only if you found available and enabled this device extension,
-    along with `VkPhysicalDeviceMemoryPriorityFeaturesEXT::memoryPriority == VK_TRUE`,
-    while creating Vulkan device passed as VmaAllocatorCreateInfo::device.
-
-    When this flag is used, VmaAllocationCreateInfo::priority and VmaPoolCreateInfo::priority
-    are used to set priorities of allocated Vulkan memory. Without it, these variables are ignored.
-
-    A priority must be a floating-point value between 0 and 1, indicating the priority of the allocation relative to other memory allocations.
-    Larger values are higher priority. The granularity of the priorities is implementation-dependent.
-    It is automatically passed to every call to `vkAllocateMemory` done by the library using structure `VkMemoryPriorityAllocateInfoEXT`.
-    The value to be used for default priority is 0.5.
-    For more details, see the documentation of the VK_EXT_memory_priority extension.
-    */
-    VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT = 0x00000040,
-
-    VMA_ALLOCATOR_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaAllocatorCreateFlagBits;
-/// See #VmaAllocatorCreateFlagBits.
-typedef VkFlags VmaAllocatorCreateFlags;
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/// \brief Intended usage of the allocated memory.
-typedef enum VmaMemoryUsage
-{
-    /** No intended memory usage specified.
-    Use other members of VmaAllocationCreateInfo to specify your requirements.
-    */
-    VMA_MEMORY_USAGE_UNKNOWN = 0,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Prefers `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`.
-    */
-    VMA_MEMORY_USAGE_GPU_ONLY = 1,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` and `VK_MEMORY_PROPERTY_HOST_COHERENT_BIT`.
-    */
-    VMA_MEMORY_USAGE_CPU_ONLY = 2,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`, prefers `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`.
-    */
-    VMA_MEMORY_USAGE_CPU_TO_GPU = 3,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Guarantees `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`, prefers `VK_MEMORY_PROPERTY_HOST_CACHED_BIT`.
-    */
-    VMA_MEMORY_USAGE_GPU_TO_CPU = 4,
-    /**
-    \deprecated Obsolete, preserved for backward compatibility.
-    Prefers not `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`.
-    */
-    VMA_MEMORY_USAGE_CPU_COPY = 5,
-    /**
-    Lazily allocated GPU memory having `VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT`.
-    Exists mostly on mobile platforms. Using it on desktop PC or other GPUs with no such memory type present will fail the allocation.
-
-    Usage: Memory for transient attachment images (color attachments, depth attachments etc.), created with `VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT`.
-
-    Allocations with this usage are always created as dedicated - it implies #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-    */
-    VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED = 6,
-    /**
-    Selects best memory type automatically.
-    This flag is recommended for most common use cases.
-
-    When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT),
-    you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT
-    in VmaAllocationCreateInfo::flags.
-
-    It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g.
-    vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo()
-    and not with generic memory allocation functions.
-    */
-    VMA_MEMORY_USAGE_AUTO = 7,
-    /**
-    Selects best memory type automatically with preference for GPU (device) memory.
-
-    When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT),
-    you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT
-    in VmaAllocationCreateInfo::flags.
-
-    It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g.
-    vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo()
-    and not with generic memory allocation functions.
-    */
-    VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE = 8,
-    /**
-    Selects best memory type automatically with preference for CPU (host) memory.
-
-    When using this flag, if you want to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT),
-    you must pass one of the flags: #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT
-    in VmaAllocationCreateInfo::flags.
-
-    It can be used only with functions that let the library know `VkBufferCreateInfo` or `VkImageCreateInfo`, e.g.
-    vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo()
-    and not with generic memory allocation functions.
-    */
-    VMA_MEMORY_USAGE_AUTO_PREFER_HOST = 9,
-
-    VMA_MEMORY_USAGE_MAX_ENUM = 0x7FFFFFFF
-} VmaMemoryUsage;
-
-/// Flags to be passed as VmaAllocationCreateInfo::flags.
-typedef enum VmaAllocationCreateFlagBits
-{
-    /** \brief Set this flag if the allocation should have its own memory block.
-
-    Use it for special, big resources, like fullscreen images used as attachments.
-    */
-    VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT = 0x00000001,
-
-    /** \brief Set this flag to only try to allocate from existing `VkDeviceMemory` blocks and never create new such block.
-
-    If new allocation cannot be placed in any of the existing blocks, allocation
-    fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY` error.
-
-    You should not use #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT and
-    #VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT at the same time. It makes no sense.
-    */
-    VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT = 0x00000002,
-    /** \brief Set this flag to use a memory that will be persistently mapped and retrieve pointer to it.
-
-    Pointer to mapped memory will be returned through VmaAllocationInfo::pMappedData.
-
-    It is valid to use this flag for allocation made from memory type that is not
-    `HOST_VISIBLE`. This flag is then ignored and memory is not mapped. This is
-    useful if you need an allocation that is efficient to use on GPU
-    (`DEVICE_LOCAL`) and still want to map it directly if possible on platforms that
-    support it (e.g. Intel GPU).
-    */
-    VMA_ALLOCATION_CREATE_MAPPED_BIT = 0x00000004,
-    /** \deprecated Preserved for backward compatibility. Consider using vmaSetAllocationName() instead.
-
-    Set this flag to treat VmaAllocationCreateInfo::pUserData as pointer to a
-    null-terminated string. Instead of copying pointer value, a local copy of the
-    string is made and stored in allocation's `pName`. The string is automatically
-    freed together with the allocation. It is also used in vmaBuildStatsString().
-    */
-    VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT = 0x00000020,
-    /** Allocation will be created from upper stack in a double stack pool.
-
-    This flag is only allowed for custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT flag.
-    */
-    VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT = 0x00000040,
-    /** Create both buffer/image and allocation, but don't bind them together.
-    It is useful when you want to bind yourself to do some more advanced binding, e.g. using some extensions.
-    The flag is meaningful only with functions that bind by default: vmaCreateBuffer(), vmaCreateImage().
-    Otherwise it is ignored.
-
-    If you want to make sure the new buffer/image is not tied to the new memory allocation
-    through `VkMemoryDedicatedAllocateInfoKHR` structure in case the allocation ends up in its own memory block,
-    use also flag #VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT.
-    */
-    VMA_ALLOCATION_CREATE_DONT_BIND_BIT = 0x00000080,
-    /** Create allocation only if additional device memory required for it, if any, won't exceed
-    memory budget. Otherwise return `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-    */
-    VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT = 0x00000100,
-    /** \brief Set this flag if the allocated memory will have aliasing resources.
-
-    Usage of this flag prevents supplying `VkMemoryDedicatedAllocateInfoKHR` when #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT is specified.
-    Otherwise created dedicated memory will not be suitable for aliasing resources, resulting in Vulkan Validation Layer errors.
-    */
-    VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT = 0x00000200,
-    /**
-    Requests possibility to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT).
-
-    - If you use #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` value,
-      you must use this flag to be able to map the allocation. Otherwise, mapping is incorrect.
-    - If you use other value of #VmaMemoryUsage, this flag is ignored and mapping is always possible in memory types that are `HOST_VISIBLE`.
-      This includes allocations created in \ref custom_memory_pools.
-
-    Declares that mapped memory will only be written sequentially, e.g. using `memcpy()` or a loop writing number-by-number,
-    never read or accessed randomly, so a memory type can be selected that is uncached and write-combined.
-
-    \warning Violating this declaration may work correctly, but will likely be very slow.
-    Watch out for implicit reads introduced by doing e.g. `pMappedData[i] += x;`
-    Better prepare your data in a local variable and `memcpy()` it to the mapped pointer all at once.
-    */
-    VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT = 0x00000400,
-    /**
-    Requests possibility to map the allocation (using vmaMapMemory() or #VMA_ALLOCATION_CREATE_MAPPED_BIT).
-
-    - If you use #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` value,
-      you must use this flag to be able to map the allocation. Otherwise, mapping is incorrect.
-    - If you use other value of #VmaMemoryUsage, this flag is ignored and mapping is always possible in memory types that are `HOST_VISIBLE`.
-      This includes allocations created in \ref custom_memory_pools.
-
-    Declares that mapped memory can be read, written, and accessed in random order,
-    so a `HOST_CACHED` memory type is required.
-    */
-    VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT = 0x00000800,
-    /**
-    Together with #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT,
-    it says that despite request for host access, a not-`HOST_VISIBLE` memory type can be selected
-    if it may improve performance.
-
-    By using this flag, you declare that you will check if the allocation ended up in a `HOST_VISIBLE` memory type
-    (e.g. using vmaGetAllocationMemoryProperties()) and if not, you will create some "staging" buffer and
-    issue an explicit transfer to write/read your data.
-    To prepare for this possibility, don't forget to add appropriate flags like
-    `VK_BUFFER_USAGE_TRANSFER_DST_BIT`, `VK_BUFFER_USAGE_TRANSFER_SRC_BIT` to the parameters of created buffer or image.
-    */
-    VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT = 0x00001000,
-    /** Allocation strategy that chooses smallest possible free range for the allocation
-    to minimize memory usage and fragmentation, possibly at the expense of allocation time.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT = 0x00010000,
-    /** Allocation strategy that chooses first suitable free range for the allocation -
-    not necessarily in terms of the smallest offset but the one that is easiest and fastest to find
-    to minimize allocation time, possibly at the expense of allocation quality.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT = 0x00020000,
-    /** Allocation strategy that chooses always the lowest offset in available space.
-    This is not the most efficient strategy but achieves highly packed data.
-    Used internally by defragmentation, not recomended in typical usage.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT  = 0x00040000,
-    /** Alias to #VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_BEST_FIT_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT,
-    /** Alias to #VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_FIRST_FIT_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT,
-    /** A bit mask to extract only `STRATEGY` bits from entire set of flags.
-    */
-    VMA_ALLOCATION_CREATE_STRATEGY_MASK =
-        VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT |
-        VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT |
-        VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-
-    VMA_ALLOCATION_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaAllocationCreateFlagBits;
-/// See #VmaAllocationCreateFlagBits.
-typedef VkFlags VmaAllocationCreateFlags;
-
-/// Flags to be passed as VmaPoolCreateInfo::flags.
-typedef enum VmaPoolCreateFlagBits
-{
-    /** \brief Use this flag if you always allocate only buffers and linear images or only optimal images out of this pool and so Buffer-Image Granularity can be ignored.
-
-    This is an optional optimization flag.
-
-    If you always allocate using vmaCreateBuffer(), vmaCreateImage(),
-    vmaAllocateMemoryForBuffer(), then you don't need to use it because allocator
-    knows exact type of your allocations so it can handle Buffer-Image Granularity
-    in the optimal way.
-
-    If you also allocate using vmaAllocateMemoryForImage() or vmaAllocateMemory(),
-    exact type of such allocations is not known, so allocator must be conservative
-    in handling Buffer-Image Granularity, which can lead to suboptimal allocation
-    (wasted memory). In that case, if you can make sure you always allocate only
-    buffers and linear images or only optimal images out of this pool, use this flag
-    to make allocator disregard Buffer-Image Granularity and so make allocations
-    faster and more optimal.
-    */
-    VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT = 0x00000002,
-
-    /** \brief Enables alternative, linear allocation algorithm in this pool.
-
-    Specify this flag to enable linear allocation algorithm, which always creates
-    new allocations after last one and doesn't reuse space from allocations freed in
-    between. It trades memory consumption for simplified algorithm and data
-    structure, which has better performance and uses less memory for metadata.
-
-    By using this flag, you can achieve behavior of free-at-once, stack,
-    ring buffer, and double stack.
-    For details, see documentation chapter \ref linear_algorithm.
-    */
-    VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT = 0x00000004,
-
-    /** Bit mask to extract only `ALGORITHM` bits from entire set of flags.
-    */
-    VMA_POOL_CREATE_ALGORITHM_MASK =
-        VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT,
-
-    VMA_POOL_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaPoolCreateFlagBits;
-/// Flags to be passed as VmaPoolCreateInfo::flags. See #VmaPoolCreateFlagBits.
-typedef VkFlags VmaPoolCreateFlags;
-
-/// Flags to be passed as VmaDefragmentationInfo::flags.
-typedef enum VmaDefragmentationFlagBits
-{
-    /* \brief Use simple but fast algorithm for defragmentation.
-    May not achieve best results but will require least time to compute and least allocations to copy.
-    */
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT = 0x1,
-    /* \brief Default defragmentation algorithm, applied also when no `ALGORITHM` flag is specified.
-    Offers a balance between defragmentation quality and the amount of allocations and bytes that need to be moved.
-    */
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT = 0x2,
-    /* \brief Perform full defragmentation of memory.
-    Can result in notably more time to compute and allocations to copy, but will achieve best memory packing.
-    */
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT = 0x4,
-    /** \brief Use the most roboust algorithm at the cost of time to compute and number of copies to make.
-    Only available when bufferImageGranularity is greater than 1, since it aims to reduce
-    alignment issues between different types of resources.
-    Otherwise falls back to same behavior as #VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT.
-    */
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT = 0x8,
-
-    /// A bit mask to extract only `ALGORITHM` bits from entire set of flags.
-    VMA_DEFRAGMENTATION_FLAG_ALGORITHM_MASK =
-        VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT |
-        VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT |
-        VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT |
-        VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT,
-
-    VMA_DEFRAGMENTATION_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaDefragmentationFlagBits;
-/// See #VmaDefragmentationFlagBits.
-typedef VkFlags VmaDefragmentationFlags;
-
-/// Operation performed on single defragmentation move. See structure #VmaDefragmentationMove.
-typedef enum VmaDefragmentationMoveOperation
-{
-    /// Buffer/image has been recreated at `dstTmpAllocation`, data has been copied, old buffer/image has been destroyed. `srcAllocation` should be changed to point to the new place. This is the default value set by vmaBeginDefragmentationPass().
-    VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY = 0,
-    /// Set this value if you cannot move the allocation. New place reserved at `dstTmpAllocation` will be freed. `srcAllocation` will remain unchanged.
-    VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE = 1,
-    /// Set this value if you decide to abandon the allocation and you destroyed the buffer/image. New place reserved at `dstTmpAllocation` will be freed, along with `srcAllocation`, which will be destroyed.
-    VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY = 2,
-} VmaDefragmentationMoveOperation;
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/// Flags to be passed as VmaVirtualBlockCreateInfo::flags.
-typedef enum VmaVirtualBlockCreateFlagBits
-{
-    /** \brief Enables alternative, linear allocation algorithm in this virtual block.
-
-    Specify this flag to enable linear allocation algorithm, which always creates
-    new allocations after last one and doesn't reuse space from allocations freed in
-    between. It trades memory consumption for simplified algorithm and data
-    structure, which has better performance and uses less memory for metadata.
-
-    By using this flag, you can achieve behavior of free-at-once, stack,
-    ring buffer, and double stack.
-    For details, see documentation chapter \ref linear_algorithm.
-    */
-    VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT = 0x00000001,
-
-    /** \brief Bit mask to extract only `ALGORITHM` bits from entire set of flags.
-    */
-    VMA_VIRTUAL_BLOCK_CREATE_ALGORITHM_MASK =
-        VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT,
-
-    VMA_VIRTUAL_BLOCK_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaVirtualBlockCreateFlagBits;
-/// Flags to be passed as VmaVirtualBlockCreateInfo::flags. See #VmaVirtualBlockCreateFlagBits.
-typedef VkFlags VmaVirtualBlockCreateFlags;
-
-/// Flags to be passed as VmaVirtualAllocationCreateInfo::flags.
-typedef enum VmaVirtualAllocationCreateFlagBits
-{
-    /** \brief Allocation will be created from upper stack in a double stack pool.
-
-    This flag is only allowed for virtual blocks created with #VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT flag.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_UPPER_ADDRESS_BIT = VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT,
-    /** \brief Allocation strategy that tries to minimize memory usage.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT,
-    /** \brief Allocation strategy that tries to minimize allocation time.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT,
-    /** Allocation strategy that chooses always the lowest offset in available space.
-    This is not the most efficient strategy but achieves highly packed data.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT = VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-    /** \brief A bit mask to extract only `STRATEGY` bits from entire set of flags.
-
-    These strategy flags are binary compatible with equivalent flags in #VmaAllocationCreateFlagBits.
-    */
-    VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MASK = VMA_ALLOCATION_CREATE_STRATEGY_MASK,
-
-    VMA_VIRTUAL_ALLOCATION_CREATE_FLAG_BITS_MAX_ENUM = 0x7FFFFFFF
-} VmaVirtualAllocationCreateFlagBits;
-/// Flags to be passed as VmaVirtualAllocationCreateInfo::flags. See #VmaVirtualAllocationCreateFlagBits.
-typedef VkFlags VmaVirtualAllocationCreateFlags;
-
-/** @} */
-
-#endif // _VMA_ENUM_DECLARATIONS
-
-#ifndef _VMA_DATA_TYPES_DECLARATIONS
-
-/**
-\addtogroup group_init
-@{ */
-
-/** \struct VmaAllocator
-\brief Represents main object of this library initialized.
-
-Fill structure #VmaAllocatorCreateInfo and call function vmaCreateAllocator() to create it.
-Call function vmaDestroyAllocator() to destroy it.
-
-It is recommended to create just one object of this type per `VkDevice` object,
-right after Vulkan is initialized and keep it alive until before Vulkan device is destroyed.
-*/
-VK_DEFINE_HANDLE(VmaAllocator)
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/** \struct VmaPool
-\brief Represents custom memory pool
-
-Fill structure VmaPoolCreateInfo and call function vmaCreatePool() to create it.
-Call function vmaDestroyPool() to destroy it.
-
-For more information see [Custom memory pools](@ref choosing_memory_type_custom_memory_pools).
-*/
-VK_DEFINE_HANDLE(VmaPool)
-
-/** \struct VmaAllocation
-\brief Represents single memory allocation.
-
-It may be either dedicated block of `VkDeviceMemory` or a specific region of a bigger block of this type
-plus unique offset.
-
-There are multiple ways to create such object.
-You need to fill structure VmaAllocationCreateInfo.
-For more information see [Choosing memory type](@ref choosing_memory_type).
-
-Although the library provides convenience functions that create Vulkan buffer or image,
-allocate memory for it and bind them together,
-binding of the allocation to a buffer or an image is out of scope of the allocation itself.
-Allocation object can exist without buffer/image bound,
-binding can be done manually by the user, and destruction of it can be done
-independently of destruction of the allocation.
-
-The object also remembers its size and some other information.
-To retrieve this information, use function vmaGetAllocationInfo() and inspect
-returned structure VmaAllocationInfo.
-*/
-VK_DEFINE_HANDLE(VmaAllocation)
-
-/** \struct VmaDefragmentationContext
-\brief An opaque object that represents started defragmentation process.
-
-Fill structure #VmaDefragmentationInfo and call function vmaBeginDefragmentation() to create it.
-Call function vmaEndDefragmentation() to destroy it.
-*/
-VK_DEFINE_HANDLE(VmaDefragmentationContext)
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/** \struct VmaVirtualAllocation
-\brief Represents single memory allocation done inside VmaVirtualBlock.
-
-Use it as a unique identifier to virtual allocation within the single block.
-
-Use value `VK_NULL_HANDLE` to represent a null/invalid allocation.
-*/
-VK_DEFINE_NON_DISPATCHABLE_HANDLE(VmaVirtualAllocation);
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/** \struct VmaVirtualBlock
-\brief Handle to a virtual block object that allows to use core allocation algorithm without allocating any real GPU memory.
-
-Fill in #VmaVirtualBlockCreateInfo structure and use vmaCreateVirtualBlock() to create it. Use vmaDestroyVirtualBlock() to destroy it.
-For more information, see documentation chapter \ref virtual_allocator.
-
-This object is not thread-safe - should not be used from multiple threads simultaneously, must be synchronized externally.
-*/
-VK_DEFINE_HANDLE(VmaVirtualBlock)
-
-/** @} */
-
-/**
-\addtogroup group_init
-@{
-*/
-
-/// Callback function called after successful vkAllocateMemory.
-typedef void (VKAPI_PTR* PFN_vmaAllocateDeviceMemoryFunction)(
-    VmaAllocator VMA_NOT_NULL                    allocator,
-    uint32_t                                     memoryType,
-    VkDeviceMemory VMA_NOT_NULL_NON_DISPATCHABLE memory,
-    VkDeviceSize                                 size,
-    void* VMA_NULLABLE                           pUserData);
-
-/// Callback function called before vkFreeMemory.
-typedef void (VKAPI_PTR* PFN_vmaFreeDeviceMemoryFunction)(
-    VmaAllocator VMA_NOT_NULL                    allocator,
-    uint32_t                                     memoryType,
-    VkDeviceMemory VMA_NOT_NULL_NON_DISPATCHABLE memory,
-    VkDeviceSize                                 size,
-    void* VMA_NULLABLE                           pUserData);
-
-/** \brief Set of callbacks that the library will call for `vkAllocateMemory` and `vkFreeMemory`.
-
-Provided for informative purpose, e.g. to gather statistics about number of
-allocations or total amount of memory allocated in Vulkan.
-
-Used in VmaAllocatorCreateInfo::pDeviceMemoryCallbacks.
-*/
-typedef struct VmaDeviceMemoryCallbacks
-{
-    /// Optional, can be null.
-    PFN_vmaAllocateDeviceMemoryFunction VMA_NULLABLE pfnAllocate;
-    /// Optional, can be null.
-    PFN_vmaFreeDeviceMemoryFunction VMA_NULLABLE pfnFree;
-    /// Optional, can be null.
-    void* VMA_NULLABLE pUserData;
-} VmaDeviceMemoryCallbacks;
-
-/** \brief Pointers to some Vulkan functions - a subset used by the library.
-
-Used in VmaAllocatorCreateInfo::pVulkanFunctions.
-*/
-typedef struct VmaVulkanFunctions
-{
-    /// Required when using VMA_DYNAMIC_VULKAN_FUNCTIONS.
-    PFN_vkGetInstanceProcAddr VMA_NULLABLE vkGetInstanceProcAddr;
-    /// Required when using VMA_DYNAMIC_VULKAN_FUNCTIONS.
-    PFN_vkGetDeviceProcAddr VMA_NULLABLE vkGetDeviceProcAddr;
-    PFN_vkGetPhysicalDeviceProperties VMA_NULLABLE vkGetPhysicalDeviceProperties;
-    PFN_vkGetPhysicalDeviceMemoryProperties VMA_NULLABLE vkGetPhysicalDeviceMemoryProperties;
-    PFN_vkAllocateMemory VMA_NULLABLE vkAllocateMemory;
-    PFN_vkFreeMemory VMA_NULLABLE vkFreeMemory;
-    PFN_vkMapMemory VMA_NULLABLE vkMapMemory;
-    PFN_vkUnmapMemory VMA_NULLABLE vkUnmapMemory;
-    PFN_vkFlushMappedMemoryRanges VMA_NULLABLE vkFlushMappedMemoryRanges;
-    PFN_vkInvalidateMappedMemoryRanges VMA_NULLABLE vkInvalidateMappedMemoryRanges;
-    PFN_vkBindBufferMemory VMA_NULLABLE vkBindBufferMemory;
-    PFN_vkBindImageMemory VMA_NULLABLE vkBindImageMemory;
-    PFN_vkGetBufferMemoryRequirements VMA_NULLABLE vkGetBufferMemoryRequirements;
-    PFN_vkGetImageMemoryRequirements VMA_NULLABLE vkGetImageMemoryRequirements;
-    PFN_vkCreateBuffer VMA_NULLABLE vkCreateBuffer;
-    PFN_vkDestroyBuffer VMA_NULLABLE vkDestroyBuffer;
-    PFN_vkCreateImage VMA_NULLABLE vkCreateImage;
-    PFN_vkDestroyImage VMA_NULLABLE vkDestroyImage;
-    PFN_vkCmdCopyBuffer VMA_NULLABLE vkCmdCopyBuffer;
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    /// Fetch "vkGetBufferMemoryRequirements2" on Vulkan >= 1.1, fetch "vkGetBufferMemoryRequirements2KHR" when using VK_KHR_dedicated_allocation extension.
-    PFN_vkGetBufferMemoryRequirements2KHR VMA_NULLABLE vkGetBufferMemoryRequirements2KHR;
-    /// Fetch "vkGetImageMemoryRequirements2" on Vulkan >= 1.1, fetch "vkGetImageMemoryRequirements2KHR" when using VK_KHR_dedicated_allocation extension.
-    PFN_vkGetImageMemoryRequirements2KHR VMA_NULLABLE vkGetImageMemoryRequirements2KHR;
-#endif
-#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000
-    /// Fetch "vkBindBufferMemory2" on Vulkan >= 1.1, fetch "vkBindBufferMemory2KHR" when using VK_KHR_bind_memory2 extension.
-    PFN_vkBindBufferMemory2KHR VMA_NULLABLE vkBindBufferMemory2KHR;
-    /// Fetch "vkBindImageMemory2" on Vulkan >= 1.1, fetch "vkBindImageMemory2KHR" when using VK_KHR_bind_memory2 extension.
-    PFN_vkBindImageMemory2KHR VMA_NULLABLE vkBindImageMemory2KHR;
-#endif
-#if VMA_MEMORY_BUDGET || VMA_VULKAN_VERSION >= 1001000
-    PFN_vkGetPhysicalDeviceMemoryProperties2KHR VMA_NULLABLE vkGetPhysicalDeviceMemoryProperties2KHR;
-#endif
-#if VMA_VULKAN_VERSION >= 1003000
-    /// Fetch from "vkGetDeviceBufferMemoryRequirements" on Vulkan >= 1.3, but you can also fetch it from "vkGetDeviceBufferMemoryRequirementsKHR" if you enabled extension VK_KHR_maintenance4.
-    PFN_vkGetDeviceBufferMemoryRequirements VMA_NULLABLE vkGetDeviceBufferMemoryRequirements;
-    /// Fetch from "vkGetDeviceImageMemoryRequirements" on Vulkan >= 1.3, but you can also fetch it from "vkGetDeviceImageMemoryRequirementsKHR" if you enabled extension VK_KHR_maintenance4.
-    PFN_vkGetDeviceImageMemoryRequirements VMA_NULLABLE vkGetDeviceImageMemoryRequirements;
-#endif
-} VmaVulkanFunctions;
-
-/// Description of a Allocator to be created.
-typedef struct VmaAllocatorCreateInfo
-{
-    /// Flags for created allocator. Use #VmaAllocatorCreateFlagBits enum.
-    VmaAllocatorCreateFlags flags;
-    /// Vulkan physical device.
-    /** It must be valid throughout whole lifetime of created allocator. */
-    VkPhysicalDevice VMA_NOT_NULL physicalDevice;
-    /// Vulkan device.
-    /** It must be valid throughout whole lifetime of created allocator. */
-    VkDevice VMA_NOT_NULL device;
-    /// Preferred size of a single `VkDeviceMemory` block to be allocated from large heaps > 1 GiB. Optional.
-    /** Set to 0 to use default, which is currently 256 MiB. */
-    VkDeviceSize preferredLargeHeapBlockSize;
-    /// Custom CPU memory allocation callbacks. Optional.
-    /** Optional, can be null. When specified, will also be used for all CPU-side memory allocations. */
-    const VkAllocationCallbacks* VMA_NULLABLE pAllocationCallbacks;
-    /// Informative callbacks for `vkAllocateMemory`, `vkFreeMemory`. Optional.
-    /** Optional, can be null. */
-    const VmaDeviceMemoryCallbacks* VMA_NULLABLE pDeviceMemoryCallbacks;
-    /** \brief Either null or a pointer to an array of limits on maximum number of bytes that can be allocated out of particular Vulkan memory heap.
-
-    If not NULL, it must be a pointer to an array of
-    `VkPhysicalDeviceMemoryProperties::memoryHeapCount` elements, defining limit on
-    maximum number of bytes that can be allocated out of particular Vulkan memory
-    heap.
-
-    Any of the elements may be equal to `VK_WHOLE_SIZE`, which means no limit on that
-    heap. This is also the default in case of `pHeapSizeLimit` = NULL.
-
-    If there is a limit defined for a heap:
-
-    - If user tries to allocate more memory from that heap using this allocator,
-      the allocation fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-    - If the limit is smaller than heap size reported in `VkMemoryHeap::size`, the
-      value of this limit will be reported instead when using vmaGetMemoryProperties().
-
-    Warning! Using this feature may not be equivalent to installing a GPU with
-    smaller amount of memory, because graphics driver doesn't necessary fail new
-    allocations with `VK_ERROR_OUT_OF_DEVICE_MEMORY` result when memory capacity is
-    exceeded. It may return success and just silently migrate some device memory
-    blocks to system RAM. This driver behavior can also be controlled using
-    VK_AMD_memory_overallocation_behavior extension.
-    */
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount") pHeapSizeLimit;
-
-    /** \brief Pointers to Vulkan functions. Can be null.
-
-    For details see [Pointers to Vulkan functions](@ref config_Vulkan_functions).
-    */
-    const VmaVulkanFunctions* VMA_NULLABLE pVulkanFunctions;
-    /** \brief Handle to Vulkan instance object.
-
-    Starting from version 3.0.0 this member is no longer optional, it must be set!
-    */
-    VkInstance VMA_NOT_NULL instance;
-    /** \brief Optional. The highest version of Vulkan that the application is designed to use.
-
-    It must be a value in the format as created by macro `VK_MAKE_VERSION` or a constant like: `VK_API_VERSION_1_1`, `VK_API_VERSION_1_0`.
-    The patch version number specified is ignored. Only the major and minor versions are considered.
-    It must be less or equal (preferably equal) to value as passed to `vkCreateInstance` as `VkApplicationInfo::apiVersion`.
-    Only versions 1.0, 1.1, 1.2, 1.3 are supported by the current implementation.
-    Leaving it initialized to zero is equivalent to `VK_API_VERSION_1_0`.
-    */
-    uint32_t vulkanApiVersion;
-#if VMA_EXTERNAL_MEMORY
-    /** \brief Either null or a pointer to an array of external memory handle types for each Vulkan memory type.
-
-    If not NULL, it must be a pointer to an array of `VkPhysicalDeviceMemoryProperties::memoryTypeCount`
-    elements, defining external memory handle types of particular Vulkan memory type,
-    to be passed using `VkExportMemoryAllocateInfoKHR`.
-
-    Any of the elements may be equal to 0, which means not to use `VkExportMemoryAllocateInfoKHR` on this memory type.
-    This is also the default in case of `pTypeExternalMemoryHandleTypes` = NULL.
-    */
-    const VkExternalMemoryHandleTypeFlagsKHR* VMA_NULLABLE VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryTypeCount") pTypeExternalMemoryHandleTypes;
-#endif // #if VMA_EXTERNAL_MEMORY
-} VmaAllocatorCreateInfo;
-
-/// Information about existing #VmaAllocator object.
-typedef struct VmaAllocatorInfo
-{
-    /** \brief Handle to Vulkan instance object.
-
-    This is the same value as has been passed through VmaAllocatorCreateInfo::instance.
-    */
-    VkInstance VMA_NOT_NULL instance;
-    /** \brief Handle to Vulkan physical device object.
-
-    This is the same value as has been passed through VmaAllocatorCreateInfo::physicalDevice.
-    */
-    VkPhysicalDevice VMA_NOT_NULL physicalDevice;
-    /** \brief Handle to Vulkan device object.
-
-    This is the same value as has been passed through VmaAllocatorCreateInfo::device.
-    */
-    VkDevice VMA_NOT_NULL device;
-} VmaAllocatorInfo;
-
-/** @} */
-
-/**
-\addtogroup group_stats
-@{
-*/
-
-/** \brief Calculated statistics of memory usage e.g. in a specific memory type, heap, custom pool, or total.
-
-These are fast to calculate.
-See functions: vmaGetHeapBudgets(), vmaGetPoolStatistics().
-*/
-typedef struct VmaStatistics
-{
-    /** \brief Number of `VkDeviceMemory` objects - Vulkan memory blocks allocated.
-    */
-    uint32_t blockCount;
-    /** \brief Number of #VmaAllocation objects allocated.
-
-    Dedicated allocations have their own blocks, so each one adds 1 to `allocationCount` as well as `blockCount`.
-    */
-    uint32_t allocationCount;
-    /** \brief Number of bytes allocated in `VkDeviceMemory` blocks.
-
-    \note To avoid confusion, please be aware that what Vulkan calls an "allocation" - a whole `VkDeviceMemory` object
-    (e.g. as in `VkPhysicalDeviceLimits::maxMemoryAllocationCount`) is called a "block" in VMA, while VMA calls
-    "allocation" a #VmaAllocation object that represents a memory region sub-allocated from such block, usually for a single buffer or image.
-    */
-    VkDeviceSize blockBytes;
-    /** \brief Total number of bytes occupied by all #VmaAllocation objects.
-
-    Always less or equal than `blockBytes`.
-    Difference `(blockBytes - allocationBytes)` is the amount of memory allocated from Vulkan
-    but unused by any #VmaAllocation.
-    */
-    VkDeviceSize allocationBytes;
-} VmaStatistics;
-
-/** \brief More detailed statistics than #VmaStatistics.
-
-These are slower to calculate. Use for debugging purposes.
-See functions: vmaCalculateStatistics(), vmaCalculatePoolStatistics().
-
-Previous version of the statistics API provided averages, but they have been removed
-because they can be easily calculated as:
-
-\code
-VkDeviceSize allocationSizeAvg = detailedStats.statistics.allocationBytes / detailedStats.statistics.allocationCount;
-VkDeviceSize unusedBytes = detailedStats.statistics.blockBytes - detailedStats.statistics.allocationBytes;
-VkDeviceSize unusedRangeSizeAvg = unusedBytes / detailedStats.unusedRangeCount;
-\endcode
-*/
-typedef struct VmaDetailedStatistics
-{
-    /// Basic statistics.
-    VmaStatistics statistics;
-    /// Number of free ranges of memory between allocations.
-    uint32_t unusedRangeCount;
-    /// Smallest allocation size. `VK_WHOLE_SIZE` if there are 0 allocations.
-    VkDeviceSize allocationSizeMin;
-    /// Largest allocation size. 0 if there are 0 allocations.
-    VkDeviceSize allocationSizeMax;
-    /// Smallest empty range size. `VK_WHOLE_SIZE` if there are 0 empty ranges.
-    VkDeviceSize unusedRangeSizeMin;
-    /// Largest empty range size. 0 if there are 0 empty ranges.
-    VkDeviceSize unusedRangeSizeMax;
-} VmaDetailedStatistics;
-
-/** \brief  General statistics from current state of the Allocator -
-total memory usage across all memory heaps and types.
-
-These are slower to calculate. Use for debugging purposes.
-See function vmaCalculateStatistics().
-*/
-typedef struct VmaTotalStatistics
-{
-    VmaDetailedStatistics memoryType[VK_MAX_MEMORY_TYPES];
-    VmaDetailedStatistics memoryHeap[VK_MAX_MEMORY_HEAPS];
-    VmaDetailedStatistics total;
-} VmaTotalStatistics;
-
-/** \brief Statistics of current memory usage and available budget for a specific memory heap.
-
-These are fast to calculate.
-See function vmaGetHeapBudgets().
-*/
-typedef struct VmaBudget
-{
-    /** \brief Statistics fetched from the library.
-    */
-    VmaStatistics statistics;
-    /** \brief Estimated current memory usage of the program, in bytes.
-
-    Fetched from system using VK_EXT_memory_budget extension if enabled.
-
-    It might be different than `statistics.blockBytes` (usually higher) due to additional implicit objects
-    also occupying the memory, like swapchain, pipelines, descriptor heaps, command buffers, or
-    `VkDeviceMemory` blocks allocated outside of this library, if any.
-    */
-    VkDeviceSize usage;
-    /** \brief Estimated amount of memory available to the program, in bytes.
-
-    Fetched from system using VK_EXT_memory_budget extension if enabled.
-
-    It might be different (most probably smaller) than `VkMemoryHeap::size[heapIndex]` due to factors
-    external to the program, decided by the operating system.
-    Difference `budget - usage` is the amount of additional memory that can probably
-    be allocated without problems. Exceeding the budget may result in various problems.
-    */
-    VkDeviceSize budget;
-} VmaBudget;
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/** \brief Parameters of new #VmaAllocation.
-
-To be used with functions like vmaCreateBuffer(), vmaCreateImage(), and many others.
-*/
-typedef struct VmaAllocationCreateInfo
-{
-    /// Use #VmaAllocationCreateFlagBits enum.
-    VmaAllocationCreateFlags flags;
-    /** \brief Intended usage of memory.
-
-    You can leave #VMA_MEMORY_USAGE_UNKNOWN if you specify memory requirements in other way. \n
-    If `pool` is not null, this member is ignored.
-    */
-    VmaMemoryUsage usage;
-    /** \brief Flags that must be set in a Memory Type chosen for an allocation.
-
-    Leave 0 if you specify memory requirements in other way. \n
-    If `pool` is not null, this member is ignored.*/
-    VkMemoryPropertyFlags requiredFlags;
-    /** \brief Flags that preferably should be set in a memory type chosen for an allocation.
-
-    Set to 0 if no additional flags are preferred. \n
-    If `pool` is not null, this member is ignored. */
-    VkMemoryPropertyFlags preferredFlags;
-    /** \brief Bitmask containing one bit set for every memory type acceptable for this allocation.
-
-    Value 0 is equivalent to `UINT32_MAX` - it means any memory type is accepted if
-    it meets other requirements specified by this structure, with no further
-    restrictions on memory type index. \n
-    If `pool` is not null, this member is ignored.
-    */
-    uint32_t memoryTypeBits;
-    /** \brief Pool that this allocation should be created in.
-
-    Leave `VK_NULL_HANDLE` to allocate from default pool. If not null, members:
-    `usage`, `requiredFlags`, `preferredFlags`, `memoryTypeBits` are ignored.
-    */
-    VmaPool VMA_NULLABLE pool;
-    /** \brief Custom general-purpose pointer that will be stored in #VmaAllocation, can be read as VmaAllocationInfo::pUserData and changed using vmaSetAllocationUserData().
-
-    If #VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT is used, it must be either
-    null or pointer to a null-terminated string. The string will be then copied to
-    internal buffer, so it doesn't need to be valid after allocation call.
-    */
-    void* VMA_NULLABLE pUserData;
-    /** \brief A floating-point value between 0 and 1, indicating the priority of the allocation relative to other memory allocations.
-
-    It is used only when #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT flag was used during creation of the #VmaAllocator object
-    and this allocation ends up as dedicated or is explicitly forced as dedicated using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-    Otherwise, it has the priority of a memory block where it is placed and this variable is ignored.
-    */
-    float priority;
-} VmaAllocationCreateInfo;
-
-/// Describes parameter of created #VmaPool.
-typedef struct VmaPoolCreateInfo
-{
-    /** \brief Vulkan memory type index to allocate this pool from.
-    */
-    uint32_t memoryTypeIndex;
-    /** \brief Use combination of #VmaPoolCreateFlagBits.
-    */
-    VmaPoolCreateFlags flags;
-    /** \brief Size of a single `VkDeviceMemory` block to be allocated as part of this pool, in bytes. Optional.
-
-    Specify nonzero to set explicit, constant size of memory blocks used by this
-    pool.
-
-    Leave 0 to use default and let the library manage block sizes automatically.
-    Sizes of particular blocks may vary.
-    In this case, the pool will also support dedicated allocations.
-    */
-    VkDeviceSize blockSize;
-    /** \brief Minimum number of blocks to be always allocated in this pool, even if they stay empty.
-
-    Set to 0 to have no preallocated blocks and allow the pool be completely empty.
-    */
-    size_t minBlockCount;
-    /** \brief Maximum number of blocks that can be allocated in this pool. Optional.
-
-    Set to 0 to use default, which is `SIZE_MAX`, which means no limit.
-
-    Set to same value as VmaPoolCreateInfo::minBlockCount to have fixed amount of memory allocated
-    throughout whole lifetime of this pool.
-    */
-    size_t maxBlockCount;
-    /** \brief A floating-point value between 0 and 1, indicating the priority of the allocations in this pool relative to other memory allocations.
-
-    It is used only when #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT flag was used during creation of the #VmaAllocator object.
-    Otherwise, this variable is ignored.
-    */
-    float priority;
-    /** \brief Additional minimum alignment to be used for all allocations created from this pool. Can be 0.
-
-    Leave 0 (default) not to impose any additional alignment. If not 0, it must be a power of two.
-    It can be useful in cases where alignment returned by Vulkan by functions like `vkGetBufferMemoryRequirements` is not enough,
-    e.g. when doing interop with OpenGL.
-    */
-    VkDeviceSize minAllocationAlignment;
-    /** \brief Additional `pNext` chain to be attached to `VkMemoryAllocateInfo` used for every allocation made by this pool. Optional.
-
-    Optional, can be null. If not null, it must point to a `pNext` chain of structures that can be attached to `VkMemoryAllocateInfo`.
-    It can be useful for special needs such as adding `VkExportMemoryAllocateInfoKHR`.
-    Structures pointed by this member must remain alive and unchanged for the whole lifetime of the custom pool.
-
-    Please note that some structures, e.g. `VkMemoryPriorityAllocateInfoEXT`, `VkMemoryDedicatedAllocateInfoKHR`,
-    can be attached automatically by this library when using other, more convenient of its features.
-    */
-    void* VMA_NULLABLE pMemoryAllocateNext;
-} VmaPoolCreateInfo;
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/// Parameters of #VmaAllocation objects, that can be retrieved using function vmaGetAllocationInfo().
-typedef struct VmaAllocationInfo
-{
-    /** \brief Memory type index that this allocation was allocated from.
-
-    It never changes.
-    */
-    uint32_t memoryType;
-    /** \brief Handle to Vulkan memory object.
-
-    Same memory object can be shared by multiple allocations.
-
-    It can change after the allocation is moved during \ref defragmentation.
-    */
-    VkDeviceMemory VMA_NULLABLE_NON_DISPATCHABLE deviceMemory;
-    /** \brief Offset in `VkDeviceMemory` object to the beginning of this allocation, in bytes. `(deviceMemory, offset)` pair is unique to this allocation.
-
-    You usually don't need to use this offset. If you create a buffer or an image together with the allocation using e.g. function
-    vmaCreateBuffer(), vmaCreateImage(), functions that operate on these resources refer to the beginning of the buffer or image,
-    not entire device memory block. Functions like vmaMapMemory(), vmaBindBufferMemory() also refer to the beginning of the allocation
-    and apply this offset automatically.
-
-    It can change after the allocation is moved during \ref defragmentation.
-    */
-    VkDeviceSize offset;
-    /** \brief Size of this allocation, in bytes.
-
-    It never changes.
-
-    \note Allocation size returned in this variable may be greater than the size
-    requested for the resource e.g. as `VkBufferCreateInfo::size`. Whole size of the
-    allocation is accessible for operations on memory e.g. using a pointer after
-    mapping with vmaMapMemory(), but operations on the resource e.g. using
-    `vkCmdCopyBuffer` must be limited to the size of the resource.
-    */
-    VkDeviceSize size;
-    /** \brief Pointer to the beginning of this allocation as mapped data.
-
-    If the allocation hasn't been mapped using vmaMapMemory() and hasn't been
-    created with #VMA_ALLOCATION_CREATE_MAPPED_BIT flag, this value is null.
-
-    It can change after call to vmaMapMemory(), vmaUnmapMemory().
-    It can also change after the allocation is moved during \ref defragmentation.
-    */
-    void* VMA_NULLABLE pMappedData;
-    /** \brief Custom general-purpose pointer that was passed as VmaAllocationCreateInfo::pUserData or set using vmaSetAllocationUserData().
-
-    It can change after call to vmaSetAllocationUserData() for this allocation.
-    */
-    void* VMA_NULLABLE pUserData;
-    /** \brief Custom allocation name that was set with vmaSetAllocationName().
-
-    It can change after call to vmaSetAllocationName() for this allocation.
-
-    Another way to set custom name is to pass it in VmaAllocationCreateInfo::pUserData with
-    additional flag #VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT set [DEPRECATED].
-    */
-    const char* VMA_NULLABLE pName;
-} VmaAllocationInfo;
-
-/** \brief Parameters for defragmentation.
-
-To be used with function vmaBeginDefragmentation().
-*/
-typedef struct VmaDefragmentationInfo
-{
-    /// \brief Use combination of #VmaDefragmentationFlagBits.
-    VmaDefragmentationFlags flags;
-    /** \brief Custom pool to be defragmented.
-
-    If null then default pools will undergo defragmentation process.
-    */
-    VmaPool VMA_NULLABLE pool;
-    /** \brief Maximum numbers of bytes that can be copied during single pass, while moving allocations to different places.
-
-    `0` means no limit.
-    */
-    VkDeviceSize maxBytesPerPass;
-    /** \brief Maximum number of allocations that can be moved during single pass to a different place.
-
-    `0` means no limit.
-    */
-    uint32_t maxAllocationsPerPass;
-} VmaDefragmentationInfo;
-
-/// Single move of an allocation to be done for defragmentation.
-typedef struct VmaDefragmentationMove
-{
-    /// Operation to be performed on the allocation by vmaEndDefragmentationPass(). Default value is #VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY. You can modify it.
-    VmaDefragmentationMoveOperation operation;
-    /// Allocation that should be moved.
-    VmaAllocation VMA_NOT_NULL srcAllocation;
-    /** \brief Temporary allocation pointing to destination memory that will replace `srcAllocation`.
-
-    \warning Do not store this allocation in your data structures! It exists only temporarily, for the duration of the defragmentation pass,
-    to be used for binding new buffer/image to the destination memory using e.g. vmaBindBufferMemory().
-    vmaEndDefragmentationPass() will destroy it and make `srcAllocation` point to this memory.
-    */
-    VmaAllocation VMA_NOT_NULL dstTmpAllocation;
-} VmaDefragmentationMove;
-
-/** \brief Parameters for incremental defragmentation steps.
-
-To be used with function vmaBeginDefragmentationPass().
-*/
-typedef struct VmaDefragmentationPassMoveInfo
-{
-    /// Number of elements in the `pMoves` array.
-    uint32_t moveCount;
-    /** \brief Array of moves to be performed by the user in the current defragmentation pass.
-
-    Pointer to an array of `moveCount` elements, owned by VMA, created in vmaBeginDefragmentationPass(), destroyed in vmaEndDefragmentationPass().
-
-    For each element, you should:
-
-    1. Create a new buffer/image in the place pointed by VmaDefragmentationMove::dstMemory + VmaDefragmentationMove::dstOffset.
-    2. Copy data from the VmaDefragmentationMove::srcAllocation e.g. using `vkCmdCopyBuffer`, `vkCmdCopyImage`.
-    3. Make sure these commands finished executing on the GPU.
-    4. Destroy the old buffer/image.
-
-    Only then you can finish defragmentation pass by calling vmaEndDefragmentationPass().
-    After this call, the allocation will point to the new place in memory.
-
-    Alternatively, if you cannot move specific allocation, you can set VmaDefragmentationMove::operation to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE.
-
-    Alternatively, if you decide you want to completely remove the allocation:
-
-    1. Destroy its buffer/image.
-    2. Set VmaDefragmentationMove::operation to #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY.
-
-    Then, after vmaEndDefragmentationPass() the allocation will be freed.
-    */
-    VmaDefragmentationMove* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(moveCount) pMoves;
-} VmaDefragmentationPassMoveInfo;
-
-/// Statistics returned for defragmentation process in function vmaEndDefragmentation().
-typedef struct VmaDefragmentationStats
-{
-    /// Total number of bytes that have been copied while moving allocations to different places.
-    VkDeviceSize bytesMoved;
-    /// Total number of bytes that have been released to the system by freeing empty `VkDeviceMemory` objects.
-    VkDeviceSize bytesFreed;
-    /// Number of allocations that have been moved to different places.
-    uint32_t allocationsMoved;
-    /// Number of empty `VkDeviceMemory` objects that have been released to the system.
-    uint32_t deviceMemoryBlocksFreed;
-} VmaDefragmentationStats;
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/// Parameters of created #VmaVirtualBlock object to be passed to vmaCreateVirtualBlock().
-typedef struct VmaVirtualBlockCreateInfo
-{
-    /** \brief Total size of the virtual block.
-
-    Sizes can be expressed in bytes or any units you want as long as you are consistent in using them.
-    For example, if you allocate from some array of structures, 1 can mean single instance of entire structure.
-    */
-    VkDeviceSize size;
-
-    /** \brief Use combination of #VmaVirtualBlockCreateFlagBits.
-    */
-    VmaVirtualBlockCreateFlags flags;
-
-    /** \brief Custom CPU memory allocation callbacks. Optional.
-
-    Optional, can be null. When specified, they will be used for all CPU-side memory allocations.
-    */
-    const VkAllocationCallbacks* VMA_NULLABLE pAllocationCallbacks;
-} VmaVirtualBlockCreateInfo;
-
-/// Parameters of created virtual allocation to be passed to vmaVirtualAllocate().
-typedef struct VmaVirtualAllocationCreateInfo
-{
-    /** \brief Size of the allocation.
-
-    Cannot be zero.
-    */
-    VkDeviceSize size;
-    /** \brief Required alignment of the allocation. Optional.
-
-    Must be power of two. Special value 0 has the same meaning as 1 - means no special alignment is required, so allocation can start at any offset.
-    */
-    VkDeviceSize alignment;
-    /** \brief Use combination of #VmaVirtualAllocationCreateFlagBits.
-    */
-    VmaVirtualAllocationCreateFlags flags;
-    /** \brief Custom pointer to be associated with the allocation. Optional.
-
-    It can be any value and can be used for user-defined purposes. It can be fetched or changed later.
-    */
-    void* VMA_NULLABLE pUserData;
-} VmaVirtualAllocationCreateInfo;
-
-/// Parameters of an existing virtual allocation, returned by vmaGetVirtualAllocationInfo().
-typedef struct VmaVirtualAllocationInfo
-{
-    /** \brief Offset of the allocation.
-
-    Offset at which the allocation was made.
-    */
-    VkDeviceSize offset;
-    /** \brief Size of the allocation.
-
-    Same value as passed in VmaVirtualAllocationCreateInfo::size.
-    */
-    VkDeviceSize size;
-    /** \brief Custom pointer associated with the allocation.
-
-    Same value as passed in VmaVirtualAllocationCreateInfo::pUserData or to vmaSetVirtualAllocationUserData().
-    */
-    void* VMA_NULLABLE pUserData;
-} VmaVirtualAllocationInfo;
-
-/** @} */
-
-#endif // _VMA_DATA_TYPES_DECLARATIONS
-
-#ifndef _VMA_FUNCTION_HEADERS
-
-/**
-\addtogroup group_init
-@{
-*/
-
-/// Creates #VmaAllocator object.
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAllocator(
-    const VmaAllocatorCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaAllocator VMA_NULLABLE* VMA_NOT_NULL pAllocator);
-
-/// Destroys allocator object.
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyAllocator(
-    VmaAllocator VMA_NULLABLE allocator);
-
-/** \brief Returns information about existing #VmaAllocator object - handle to Vulkan device etc.
-
-It might be useful if you want to keep just the #VmaAllocator handle and fetch other required handles to
-`VkPhysicalDevice`, `VkDevice` etc. every time using this function.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocatorInfo(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocatorInfo* VMA_NOT_NULL pAllocatorInfo);
-
-/**
-PhysicalDeviceProperties are fetched from physicalDevice by the allocator.
-You can access it here, without fetching it again on your own.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPhysicalDeviceProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkPhysicalDeviceProperties* VMA_NULLABLE* VMA_NOT_NULL ppPhysicalDeviceProperties);
-
-/**
-PhysicalDeviceMemoryProperties are fetched from physicalDevice by the allocator.
-You can access it here, without fetching it again on your own.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkPhysicalDeviceMemoryProperties* VMA_NULLABLE* VMA_NOT_NULL ppPhysicalDeviceMemoryProperties);
-
-/**
-\brief Given Memory Type Index, returns Property Flags of this memory type.
-
-This is just a convenience function. Same information can be obtained using
-vmaGetMemoryProperties().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryTypeProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t memoryTypeIndex,
-    VkMemoryPropertyFlags* VMA_NOT_NULL pFlags);
-
-/** \brief Sets index of the current frame.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetCurrentFrameIndex(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t frameIndex);
-
-/** @} */
-
-/**
-\addtogroup group_stats
-@{
-*/
-
-/** \brief Retrieves statistics from current state of the Allocator.
-
-This function is called "calculate" not "get" because it has to traverse all
-internal data structures, so it may be quite slow. Use it for debugging purposes.
-For faster but more brief statistics suitable to be called every frame or every allocation,
-use vmaGetHeapBudgets().
-
-Note that when using allocator from multiple threads, returned information may immediately
-become outdated.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculateStatistics(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaTotalStatistics* VMA_NOT_NULL pStats);
-
-/** \brief Retrieves information about current memory usage and budget for all memory heaps.
-
-\param allocator
-\param[out] pBudgets Must point to array with number of elements at least equal to number of memory heaps in physical device used.
-
-This function is called "get" not "calculate" because it is very fast, suitable to be called
-every frame or every allocation. For more detailed statistics use vmaCalculateStatistics().
-
-Note that when using allocator from multiple threads, returned information may immediately
-become outdated.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetHeapBudgets(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaBudget* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL("VkPhysicalDeviceMemoryProperties::memoryHeapCount") pBudgets);
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/**
-\brief Helps to find memoryTypeIndex, given memoryTypeBits and VmaAllocationCreateInfo.
-
-This algorithm tries to find a memory type that:
-
-- Is allowed by memoryTypeBits.
-- Contains all the flags from pAllocationCreateInfo->requiredFlags.
-- Matches intended usage.
-- Has as many flags from pAllocationCreateInfo->preferredFlags as possible.
-
-\return Returns VK_ERROR_FEATURE_NOT_PRESENT if not found. Receiving such result
-from this function or any other allocating function probably means that your
-device doesn't support any memory type with requested features for the specific
-type of resource you want to use it for. Please check parameters of your
-resource, like image layout (OPTIMAL versus LINEAR) or mip level count.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndex(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t memoryTypeBits,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    uint32_t* VMA_NOT_NULL pMemoryTypeIndex);
-
-/**
-\brief Helps to find memoryTypeIndex, given VkBufferCreateInfo and VmaAllocationCreateInfo.
-
-It can be useful e.g. to determine value to be used as VmaPoolCreateInfo::memoryTypeIndex.
-It internally creates a temporary, dummy buffer that never has memory bound.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForBufferInfo(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    uint32_t* VMA_NOT_NULL pMemoryTypeIndex);
-
-/**
-\brief Helps to find memoryTypeIndex, given VkImageCreateInfo and VmaAllocationCreateInfo.
-
-It can be useful e.g. to determine value to be used as VmaPoolCreateInfo::memoryTypeIndex.
-It internally creates a temporary, dummy image that never has memory bound.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForImageInfo(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    uint32_t* VMA_NOT_NULL pMemoryTypeIndex);
-
-/** \brief Allocates Vulkan device memory and creates #VmaPool object.
-
-\param allocator Allocator object.
-\param pCreateInfo Parameters of pool to create.
-\param[out] pPool Handle to created pool.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreatePool(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VmaPoolCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaPool VMA_NULLABLE* VMA_NOT_NULL pPool);
-
-/** \brief Destroys #VmaPool object and frees Vulkan device memory.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyPool(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NULLABLE pool);
-
-/** @} */
-
-/**
-\addtogroup group_stats
-@{
-*/
-
-/** \brief Retrieves statistics of existing #VmaPool object.
-
-\param allocator Allocator object.
-\param pool Pool object.
-\param[out] pPoolStats Statistics of specified pool.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolStatistics(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool,
-    VmaStatistics* VMA_NOT_NULL pPoolStats);
-
-/** \brief Retrieves detailed statistics of existing #VmaPool object.
-
-\param allocator Allocator object.
-\param pool Pool object.
-\param[out] pPoolStats Statistics of specified pool.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculatePoolStatistics(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool,
-    VmaDetailedStatistics* VMA_NOT_NULL pPoolStats);
-
-/** @} */
-
-/**
-\addtogroup group_alloc
-@{
-*/
-
-/** \brief Checks magic number in margins around all allocations in given memory pool in search for corruptions.
-
-Corruption detection is enabled only when `VMA_DEBUG_DETECT_CORRUPTION` macro is defined to nonzero,
-`VMA_DEBUG_MARGIN` is defined to nonzero and the pool is created in memory type that is
-`HOST_VISIBLE` and `HOST_COHERENT`. For more information, see [Corruption detection](@ref debugging_memory_usage_corruption_detection).
-
-Possible return values:
-
-- `VK_ERROR_FEATURE_NOT_PRESENT` - corruption detection is not enabled for specified pool.
-- `VK_SUCCESS` - corruption detection has been performed and succeeded.
-- `VK_ERROR_UNKNOWN` - corruption detection has been performed and found memory corruptions around one of the allocations.
-  `VMA_ASSERT` is also fired in that case.
-- Other value: Error returned by Vulkan, e.g. memory mapping failure.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckPoolCorruption(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool);
-
-/** \brief Retrieves name of a custom pool.
-
-After the call `ppName` is either null or points to an internally-owned null-terminated string
-containing name of the pool that was previously set. The pointer becomes invalid when the pool is
-destroyed or its name is changed using vmaSetPoolName().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolName(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool,
-    const char* VMA_NULLABLE* VMA_NOT_NULL ppName);
-
-/** \brief Sets name of a custom pool.
-
-`pName` can be either null or pointer to a null-terminated string with new name for the pool.
-Function makes internal copy of the string, so it can be changed or freed immediately after this call.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetPoolName(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaPool VMA_NOT_NULL pool,
-    const char* VMA_NULLABLE pName);
-
-/** \brief General purpose memory allocation.
-
-\param allocator
-\param pVkMemoryRequirements
-\param pCreateInfo
-\param[out] pAllocation Handle to allocated memory.
-\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo().
-
-You should free the memory using vmaFreeMemory() or vmaFreeMemoryPages().
-
-It is recommended to use vmaAllocateMemoryForBuffer(), vmaAllocateMemoryForImage(),
-vmaCreateBuffer(), vmaCreateImage() instead whenever possible.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkMemoryRequirements* VMA_NOT_NULL pVkMemoryRequirements,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief General purpose memory allocation for multiple allocation objects at once.
-
-\param allocator Allocator object.
-\param pVkMemoryRequirements Memory requirements for each allocation.
-\param pCreateInfo Creation parameters for each allocation.
-\param allocationCount Number of allocations to make.
-\param[out] pAllocations Pointer to array that will be filled with handles to created allocations.
-\param[out] pAllocationInfo Optional. Pointer to array that will be filled with parameters of created allocations.
-
-You should free the memory using vmaFreeMemory() or vmaFreeMemoryPages().
-
-Word "pages" is just a suggestion to use this function to allocate pieces of memory needed for sparse binding.
-It is just a general purpose allocation function able to make multiple allocations at once.
-It may be internally optimized to be more efficient than calling vmaAllocateMemory() `allocationCount` times.
-
-All allocations are made using same parameters. All of them are created out of the same memory pool and type.
-If any allocation fails, all allocations already made within this function call are also freed, so that when
-returned result is not `VK_SUCCESS`, `pAllocation` array is always entirely filled with `VK_NULL_HANDLE`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryPages(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkMemoryRequirements* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pVkMemoryRequirements,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pCreateInfo,
-    size_t allocationCount,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pAllocations,
-    VmaAllocationInfo* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) pAllocationInfo);
-
-/** \brief Allocates memory suitable for given `VkBuffer`.
-
-\param allocator
-\param buffer
-\param pCreateInfo
-\param[out] pAllocation Handle to allocated memory.
-\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo().
-
-It only creates #VmaAllocation. To bind the memory to the buffer, use vmaBindBufferMemory().
-
-This is a special-purpose function. In most cases you should use vmaCreateBuffer().
-
-You must free the allocation using vmaFreeMemory() when no longer needed.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief Allocates memory suitable for given `VkImage`.
-
-\param allocator
-\param image
-\param pCreateInfo
-\param[out] pAllocation Handle to allocated memory.
-\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo().
-
-It only creates #VmaAllocation. To bind the memory to the buffer, use vmaBindImageMemory().
-
-This is a special-purpose function. In most cases you should use vmaCreateImage().
-
-You must free the allocation using vmaFreeMemory() when no longer needed.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkImage VMA_NOT_NULL_NON_DISPATCHABLE image,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief Frees memory previously allocated using vmaAllocateMemory(), vmaAllocateMemoryForBuffer(), or vmaAllocateMemoryForImage().
-
-Passing `VK_NULL_HANDLE` as `allocation` is valid. Such function call is just skipped.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VmaAllocation VMA_NULLABLE allocation);
-
-/** \brief Frees memory and destroys multiple allocations.
-
-Word "pages" is just a suggestion to use this function to free pieces of memory used for sparse binding.
-It is just a general purpose function to free memory and destroy allocations made using e.g. vmaAllocateMemory(),
-vmaAllocateMemoryPages() and other functions.
-It may be internally optimized to be more efficient than calling vmaFreeMemory() `allocationCount` times.
-
-Allocations in `pAllocations` array can come from any memory pools and types.
-Passing `VK_NULL_HANDLE` as elements of `pAllocations` array is valid. Such entries are just skipped.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemoryPages(
-    VmaAllocator VMA_NOT_NULL allocator,
-    size_t allocationCount,
-    const VmaAllocation VMA_NULLABLE* VMA_NOT_NULL VMA_LEN_IF_NOT_NULL(allocationCount) pAllocations);
-
-/** \brief Returns current information about specified allocation.
-
-Current paramteres of given allocation are returned in `pAllocationInfo`.
-
-Although this function doesn't lock any mutex, so it should be quite efficient,
-you should avoid calling it too often.
-You can retrieve same VmaAllocationInfo structure while creating your resource, from function
-vmaCreateBuffer(), vmaCreateImage(). You can remember it if you are sure parameters don't change
-(e.g. due to defragmentation).
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationInfo(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VmaAllocationInfo* VMA_NOT_NULL pAllocationInfo);
-
-/** \brief Sets pUserData in given allocation to new value.
-
-The value of pointer `pUserData` is copied to allocation's `pUserData`.
-It is opaque, so you can use it however you want - e.g.
-as a pointer, ordinal number or some handle to you own data.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationUserData(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    void* VMA_NULLABLE pUserData);
-
-/** \brief Sets pName in given allocation to new value.
-
-`pName` must be either null, or pointer to a null-terminated string. The function
-makes local copy of the string and sets it as allocation's `pName`. String
-passed as pName doesn't need to be valid for whole lifetime of the allocation -
-you can free it after this call. String previously pointed by allocation's
-`pName` is freed from memory.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationName(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const char* VMA_NULLABLE pName);
-
-/**
-\brief Given an allocation, returns Property Flags of its memory type.
-
-This is just a convenience function. Same information can be obtained using
-vmaGetAllocationInfo() + vmaGetMemoryProperties().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationMemoryProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkMemoryPropertyFlags* VMA_NOT_NULL pFlags);
-
-/** \brief Maps memory represented by given allocation and returns pointer to it.
-
-Maps memory represented by given allocation to make it accessible to CPU code.
-When succeeded, `*ppData` contains pointer to first byte of this memory.
-
-\warning
-If the allocation is part of a bigger `VkDeviceMemory` block, returned pointer is
-correctly offsetted to the beginning of region assigned to this particular allocation.
-Unlike the result of `vkMapMemory`, it points to the allocation, not to the beginning of the whole block.
-You should not add VmaAllocationInfo::offset to it!
-
-Mapping is internally reference-counted and synchronized, so despite raw Vulkan
-function `vkMapMemory()` cannot be used to map same block of `VkDeviceMemory`
-multiple times simultaneously, it is safe to call this function on allocations
-assigned to the same memory block. Actual Vulkan memory will be mapped on first
-mapping and unmapped on last unmapping.
-
-If the function succeeded, you must call vmaUnmapMemory() to unmap the
-allocation when mapping is no longer needed or before freeing the allocation, at
-the latest.
-
-It also safe to call this function multiple times on the same allocation. You
-must call vmaUnmapMemory() same number of times as you called vmaMapMemory().
-
-It is also safe to call this function on allocation created with
-#VMA_ALLOCATION_CREATE_MAPPED_BIT flag. Its memory stays mapped all the time.
-You must still call vmaUnmapMemory() same number of times as you called
-vmaMapMemory(). You must not call vmaUnmapMemory() additional time to free the
-"0-th" mapping made automatically due to #VMA_ALLOCATION_CREATE_MAPPED_BIT flag.
-
-This function fails when used on allocation made in memory type that is not
-`HOST_VISIBLE`.
-
-This function doesn't automatically flush or invalidate caches.
-If the allocation is made from a memory types that is not `HOST_COHERENT`,
-you also need to use vmaInvalidateAllocation() / vmaFlushAllocation(), as required by Vulkan specification.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaMapMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    void* VMA_NULLABLE* VMA_NOT_NULL ppData);
-
-/** \brief Unmaps memory represented by given allocation, mapped previously using vmaMapMemory().
-
-For details, see description of vmaMapMemory().
-
-This function doesn't automatically flush or invalidate caches.
-If the allocation is made from a memory types that is not `HOST_COHERENT`,
-you also need to use vmaInvalidateAllocation() / vmaFlushAllocation(), as required by Vulkan specification.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaUnmapMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation);
-
-/** \brief Flushes memory of given allocation.
-
-Calls `vkFlushMappedMemoryRanges()` for memory associated with given range of given allocation.
-It needs to be called after writing to a mapped memory for memory types that are not `HOST_COHERENT`.
-Unmap operation doesn't do that automatically.
-
-- `offset` must be relative to the beginning of allocation.
-- `size` can be `VK_WHOLE_SIZE`. It means all memory from `offset` the the end of given allocation.
-- `offset` and `size` don't have to be aligned.
-  They are internally rounded down/up to multiply of `nonCoherentAtomSize`.
-- If `size` is 0, this call is ignored.
-- If memory type that the `allocation` belongs to is not `HOST_VISIBLE` or it is `HOST_COHERENT`,
-  this call is ignored.
-
-Warning! `offset` and `size` are relative to the contents of given `allocation`.
-If you mean whole allocation, you can pass 0 and `VK_WHOLE_SIZE`, respectively.
-Do not pass allocation's offset as `offset`!!!
-
-This function returns the `VkResult` from `vkFlushMappedMemoryRanges` if it is
-called, otherwise `VK_SUCCESS`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocation(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkDeviceSize offset,
-    VkDeviceSize size);
-
-/** \brief Invalidates memory of given allocation.
-
-Calls `vkInvalidateMappedMemoryRanges()` for memory associated with given range of given allocation.
-It needs to be called before reading from a mapped memory for memory types that are not `HOST_COHERENT`.
-Map operation doesn't do that automatically.
-
-- `offset` must be relative to the beginning of allocation.
-- `size` can be `VK_WHOLE_SIZE`. It means all memory from `offset` the the end of given allocation.
-- `offset` and `size` don't have to be aligned.
-  They are internally rounded down/up to multiply of `nonCoherentAtomSize`.
-- If `size` is 0, this call is ignored.
-- If memory type that the `allocation` belongs to is not `HOST_VISIBLE` or it is `HOST_COHERENT`,
-  this call is ignored.
-
-Warning! `offset` and `size` are relative to the contents of given `allocation`.
-If you mean whole allocation, you can pass 0 and `VK_WHOLE_SIZE`, respectively.
-Do not pass allocation's offset as `offset`!!!
-
-This function returns the `VkResult` from `vkInvalidateMappedMemoryRanges` if
-it is called, otherwise `VK_SUCCESS`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocation(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkDeviceSize offset,
-    VkDeviceSize size);
-
-/** \brief Flushes memory of given set of allocations.
-
-Calls `vkFlushMappedMemoryRanges()` for memory associated with given ranges of given allocations.
-For more information, see documentation of vmaFlushAllocation().
-
-\param allocator
-\param allocationCount
-\param allocations
-\param offsets If not null, it must point to an array of offsets of regions to flush, relative to the beginning of respective allocations. Null means all ofsets are zero.
-\param sizes If not null, it must point to an array of sizes of regions to flush in respective allocations. Null means `VK_WHOLE_SIZE` for all allocations.
-
-This function returns the `VkResult` from `vkFlushMappedMemoryRanges` if it is
-called, otherwise `VK_SUCCESS`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocations(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t allocationCount,
-    const VmaAllocation VMA_NOT_NULL* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) allocations,
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) offsets,
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) sizes);
-
-/** \brief Invalidates memory of given set of allocations.
-
-Calls `vkInvalidateMappedMemoryRanges()` for memory associated with given ranges of given allocations.
-For more information, see documentation of vmaInvalidateAllocation().
-
-\param allocator
-\param allocationCount
-\param allocations
-\param offsets If not null, it must point to an array of offsets of regions to flush, relative to the beginning of respective allocations. Null means all ofsets are zero.
-\param sizes If not null, it must point to an array of sizes of regions to flush in respective allocations. Null means `VK_WHOLE_SIZE` for all allocations.
-
-This function returns the `VkResult` from `vkInvalidateMappedMemoryRanges` if it is
-called, otherwise `VK_SUCCESS`.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocations(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t allocationCount,
-    const VmaAllocation VMA_NOT_NULL* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) allocations,
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) offsets,
-    const VkDeviceSize* VMA_NULLABLE VMA_LEN_IF_NOT_NULL(allocationCount) sizes);
-
-/** \brief Checks magic number in margins around all allocations in given memory types (in both default and custom pools) in search for corruptions.
-
-\param allocator
-\param memoryTypeBits Bit mask, where each bit set means that a memory type with that index should be checked.
-
-Corruption detection is enabled only when `VMA_DEBUG_DETECT_CORRUPTION` macro is defined to nonzero,
-`VMA_DEBUG_MARGIN` is defined to nonzero and only for memory types that are
-`HOST_VISIBLE` and `HOST_COHERENT`. For more information, see [Corruption detection](@ref debugging_memory_usage_corruption_detection).
-
-Possible return values:
-
-- `VK_ERROR_FEATURE_NOT_PRESENT` - corruption detection is not enabled for any of specified memory types.
-- `VK_SUCCESS` - corruption detection has been performed and succeeded.
-- `VK_ERROR_UNKNOWN` - corruption detection has been performed and found memory corruptions around one of the allocations.
-  `VMA_ASSERT` is also fired in that case.
-- Other value: Error returned by Vulkan, e.g. memory mapping failure.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckCorruption(
-    VmaAllocator VMA_NOT_NULL allocator,
-    uint32_t memoryTypeBits);
-
-/** \brief Begins defragmentation process.
-
-\param allocator Allocator object.
-\param pInfo Structure filled with parameters of defragmentation.
-\param[out] pContext Context object that must be passed to vmaEndDefragmentation() to finish defragmentation.
-\returns
-- `VK_SUCCESS` if defragmentation can begin.
-- `VK_ERROR_FEATURE_NOT_PRESENT` if defragmentation is not supported.
-
-For more information about defragmentation, see documentation chapter:
-[Defragmentation](@ref defragmentation).
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentation(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VmaDefragmentationInfo* VMA_NOT_NULL pInfo,
-    VmaDefragmentationContext VMA_NULLABLE* VMA_NOT_NULL pContext);
-
-/** \brief Ends defragmentation process.
-
-\param allocator Allocator object.
-\param context Context object that has been created by vmaBeginDefragmentation().
-\param[out] pStats Optional stats for the defragmentation. Can be null.
-
-Use this function to finish defragmentation started by vmaBeginDefragmentation().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaEndDefragmentation(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationStats* VMA_NULLABLE pStats);
-
-/** \brief Starts single defragmentation pass.
-
-\param allocator Allocator object.
-\param context Context object that has been created by vmaBeginDefragmentation().
-\param[out] pPassInfo Computed informations for current pass.
-\returns
-- `VK_SUCCESS` if no more moves are possible. Then you can omit call to vmaEndDefragmentationPass() and simply end whole defragmentation.
-- `VK_INCOMPLETE` if there are pending moves returned in `pPassInfo`. You need to perform them, call vmaEndDefragmentationPass(),
-  and then preferably try another pass with vmaBeginDefragmentationPass().
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentationPass(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo);
-
-/** \brief Ends single defragmentation pass.
-
-\param allocator Allocator object.
-\param context Context object that has been created by vmaBeginDefragmentation().
-\param pPassInfo Computed informations for current pass filled by vmaBeginDefragmentationPass() and possibly modified by you.
-
-Returns `VK_SUCCESS` if no more moves are possible or `VK_INCOMPLETE` if more defragmentations are possible.
-
-Ends incremental defragmentation pass and commits all defragmentation moves from `pPassInfo`.
-After this call:
-
-- Allocations at `pPassInfo[i].srcAllocation` that had `pPassInfo[i].operation ==` #VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY
-  (which is the default) will be pointing to the new destination place.
-- Allocation at `pPassInfo[i].srcAllocation` that had `pPassInfo[i].operation ==` #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY
-  will be freed.
-
-If no more moves are possible you can end whole defragmentation.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaEndDefragmentationPass(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo);
-
-/** \brief Binds buffer to allocation.
-
-Binds specified buffer to region of memory represented by specified allocation.
-Gets `VkDeviceMemory` handle and offset from the allocation.
-If you want to create a buffer, allocate memory for it and bind them together separately,
-you should use this function for binding instead of standard `vkBindBufferMemory()`,
-because it ensures proper synchronization so that when a `VkDeviceMemory` object is used by multiple
-allocations, calls to `vkBind*Memory()` or `vkMapMemory()` won't happen from multiple threads simultaneously
-(which is illegal in Vulkan).
-
-It is recommended to use function vmaCreateBuffer() instead of this one.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer);
-
-/** \brief Binds buffer to allocation with additional parameters.
-
-\param allocator
-\param allocation
-\param allocationLocalOffset Additional offset to be added while binding, relative to the beginning of the `allocation`. Normally it should be 0.
-\param buffer
-\param pNext A chain of structures to be attached to `VkBindBufferMemoryInfoKHR` structure used internally. Normally it should be null.
-
-This function is similar to vmaBindBufferMemory(), but it provides additional parameters.
-
-If `pNext` is not null, #VmaAllocator object must have been created with #VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT flag
-or with VmaAllocatorCreateInfo::vulkanApiVersion `>= VK_API_VERSION_1_1`. Otherwise the call fails.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory2(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkDeviceSize allocationLocalOffset,
-    VkBuffer VMA_NOT_NULL_NON_DISPATCHABLE buffer,
-    const void* VMA_NULLABLE pNext);
-
-/** \brief Binds image to allocation.
-
-Binds specified image to region of memory represented by specified allocation.
-Gets `VkDeviceMemory` handle and offset from the allocation.
-If you want to create an image, allocate memory for it and bind them together separately,
-you should use this function for binding instead of standard `vkBindImageMemory()`,
-because it ensures proper synchronization so that when a `VkDeviceMemory` object is used by multiple
-allocations, calls to `vkBind*Memory()` or `vkMapMemory()` won't happen from multiple threads simultaneously
-(which is illegal in Vulkan).
-
-It is recommended to use function vmaCreateImage() instead of this one.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkImage VMA_NOT_NULL_NON_DISPATCHABLE image);
-
-/** \brief Binds image to allocation with additional parameters.
-
-\param allocator
-\param allocation
-\param allocationLocalOffset Additional offset to be added while binding, relative to the beginning of the `allocation`. Normally it should be 0.
-\param image
-\param pNext A chain of structures to be attached to `VkBindImageMemoryInfoKHR` structure used internally. Normally it should be null.
-
-This function is similar to vmaBindImageMemory(), but it provides additional parameters.
-
-If `pNext` is not null, #VmaAllocator object must have been created with #VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT flag
-or with VmaAllocatorCreateInfo::vulkanApiVersion `>= VK_API_VERSION_1_1`. Otherwise the call fails.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory2(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkDeviceSize allocationLocalOffset,
-    VkImage VMA_NOT_NULL_NON_DISPATCHABLE image,
-    const void* VMA_NULLABLE pNext);
-
-/** \brief Creates a new `VkBuffer`, allocates and binds memory for it.
-
-\param allocator
-\param pBufferCreateInfo
-\param pAllocationCreateInfo
-\param[out] pBuffer Buffer that was created.
-\param[out] pAllocation Allocation that was created.
-\param[out] pAllocationInfo Optional. Information about allocated memory. It can be later fetched using function vmaGetAllocationInfo().
-
-This function automatically:
-
--# Creates buffer.
--# Allocates appropriate memory for it.
--# Binds the buffer with the memory.
-
-If any of these operations fail, buffer and allocation are not created,
-returned value is negative error code, `*pBuffer` and `*pAllocation` are null.
-
-If the function succeeded, you must destroy both buffer and allocation when you
-no longer need them using either convenience function vmaDestroyBuffer() or
-separately, using `vkDestroyBuffer()` and vmaFreeMemory().
-
-If #VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT flag was used,
-VK_KHR_dedicated_allocation extension is used internally to query driver whether
-it requires or prefers the new buffer to have dedicated allocation. If yes,
-and if dedicated allocation is possible
-(#VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT is not used), it creates dedicated
-allocation for this buffer, just like when using
-#VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-
-\note This function creates a new `VkBuffer`. Sub-allocation of parts of one large buffer,
-although recommended as a good practice, is out of scope of this library and could be implemented
-by the user as a higher-level logic on top of VMA.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief Creates a buffer with additional minimum alignment.
-
-Similar to vmaCreateBuffer() but provides additional parameter `minAlignment` which allows to specify custom,
-minimum alignment to be used when placing the buffer inside a larger memory block, which may be needed e.g.
-for interop with OpenGL.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBufferWithAlignment(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    VkDeviceSize minAlignment,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/** \brief Creates a new `VkBuffer`, binds already created memory for it.
-
-\param allocator
-\param allocation Allocation that provides memory to be used for binding new buffer to it.
-\param pBufferCreateInfo
-\param[out] pBuffer Buffer that was created.
-
-This function automatically:
-
--# Creates buffer.
--# Binds the buffer with the supplied memory.
-
-If any of these operations fail, buffer is not created,
-returned value is negative error code and `*pBuffer` is null.
-
-If the function succeeded, you must destroy the buffer when you
-no longer need it using `vkDestroyBuffer()`. If you want to also destroy the corresponding
-allocation you can use convenience function vmaDestroyBuffer().
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer);
-
-/** \brief Destroys Vulkan buffer and frees allocated memory.
-
-This is just a convenience function equivalent to:
-
-\code
-vkDestroyBuffer(device, buffer, allocationCallbacks);
-vmaFreeMemory(allocator, allocation);
-\endcode
-
-It it safe to pass null as buffer and/or allocation.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE buffer,
-    VmaAllocation VMA_NULLABLE allocation);
-
-/// Function similar to vmaCreateBuffer().
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo,
-    const VmaAllocationCreateInfo* VMA_NOT_NULL pAllocationCreateInfo,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage,
-    VmaAllocation VMA_NULLABLE* VMA_NOT_NULL pAllocation,
-    VmaAllocationInfo* VMA_NULLABLE pAllocationInfo);
-
-/// Function similar to vmaCreateAliasingBuffer().
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage);
-
-/** \brief Destroys Vulkan image and frees allocated memory.
-
-This is just a convenience function equivalent to:
-
-\code
-vkDestroyImage(device, image, allocationCallbacks);
-vmaFreeMemory(allocator, allocation);
-\endcode
-
-It it safe to pass null as image and/or allocation.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE image,
-    VmaAllocation VMA_NULLABLE allocation);
-
-/** @} */
-
-/**
-\addtogroup group_virtual
-@{
-*/
-
-/** \brief Creates new #VmaVirtualBlock object.
-
-\param pCreateInfo Parameters for creation.
-\param[out] pVirtualBlock Returned virtual block object or `VMA_NULL` if creation failed.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateVirtualBlock(
-    const VmaVirtualBlockCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaVirtualBlock VMA_NULLABLE* VMA_NOT_NULL pVirtualBlock);
-
-/** \brief Destroys #VmaVirtualBlock object.
-
-Please note that you should consciously handle virtual allocations that could remain unfreed in the block.
-You should either free them individually using vmaVirtualFree() or call vmaClearVirtualBlock()
-if you are sure this is what you want. If you do neither, an assert is called.
-
-If you keep pointers to some additional metadata associated with your virtual allocations in their `pUserData`,
-don't forget to free them.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyVirtualBlock(
-    VmaVirtualBlock VMA_NULLABLE virtualBlock);
-
-/** \brief Returns true of the #VmaVirtualBlock is empty - contains 0 virtual allocations and has all its space available for new allocations.
-*/
-VMA_CALL_PRE VkBool32 VMA_CALL_POST vmaIsVirtualBlockEmpty(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock);
-
-/** \brief Returns information about a specific virtual allocation within a virtual block, like its size and `pUserData` pointer.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualAllocationInfo(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, VmaVirtualAllocationInfo* VMA_NOT_NULL pVirtualAllocInfo);
-
-/** \brief Allocates new virtual allocation inside given #VmaVirtualBlock.
-
-If the allocation fails due to not enough free space available, `VK_ERROR_OUT_OF_DEVICE_MEMORY` is returned
-(despite the function doesn't ever allocate actual GPU memory).
-`pAllocation` is then set to `VK_NULL_HANDLE` and `pOffset`, if not null, it set to `UINT64_MAX`.
-
-\param virtualBlock Virtual block
-\param pCreateInfo Parameters for the allocation
-\param[out] pAllocation Returned handle of the new allocation
-\param[out] pOffset Returned offset of the new allocation. Optional, can be null.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaVirtualAllocate(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    const VmaVirtualAllocationCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pAllocation,
-    VkDeviceSize* VMA_NULLABLE pOffset);
-
-/** \brief Frees virtual allocation inside given #VmaVirtualBlock.
-
-It is correct to call this function with `allocation == VK_NULL_HANDLE` - it does nothing.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaVirtualFree(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE allocation);
-
-/** \brief Frees all virtual allocations inside given #VmaVirtualBlock.
-
-You must either call this function or free each virtual allocation individually with vmaVirtualFree()
-before destroying a virtual block. Otherwise, an assert is called.
-
-If you keep pointer to some additional metadata associated with your virtual allocation in its `pUserData`,
-don't forget to free it as well.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaClearVirtualBlock(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock);
-
-/** \brief Changes custom pointer associated with given virtual allocation.
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaSetVirtualAllocationUserData(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation,
-    void* VMA_NULLABLE pUserData);
-
-/** \brief Calculates and returns statistics about virtual allocations and memory usage in given #VmaVirtualBlock.
-
-This function is fast to call. For more detailed statistics, see vmaCalculateVirtualBlockStatistics().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualBlockStatistics(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaStatistics* VMA_NOT_NULL pStats);
-
-/** \brief Calculates and returns detailed statistics about virtual allocations and memory usage in given #VmaVirtualBlock.
-
-This function is slow to call. Use for debugging purposes.
-For less detailed statistics, see vmaGetVirtualBlockStatistics().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculateVirtualBlockStatistics(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaDetailedStatistics* VMA_NOT_NULL pStats);
-
-/** @} */
-
-#if VMA_STATS_STRING_ENABLED
-/**
-\addtogroup group_stats
-@{
-*/
-
-/** \brief Builds and returns a null-terminated string in JSON format with information about given #VmaVirtualBlock.
-\param virtualBlock Virtual block.
-\param[out] ppStatsString Returned string.
-\param detailedMap Pass `VK_FALSE` to only obtain statistics as returned by vmaCalculateVirtualBlockStatistics(). Pass `VK_TRUE` to also obtain full list of allocations and free spaces.
-
-Returned string must be freed using vmaFreeVirtualBlockStatsString().
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaBuildVirtualBlockStatsString(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    char* VMA_NULLABLE* VMA_NOT_NULL ppStatsString,
-    VkBool32 detailedMap);
-
-/// Frees a string returned by vmaBuildVirtualBlockStatsString().
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeVirtualBlockStatsString(
-    VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    char* VMA_NULLABLE pStatsString);
-
-/** \brief Builds and returns statistics as a null-terminated string in JSON format.
-\param allocator
-\param[out] ppStatsString Must be freed using vmaFreeStatsString() function.
-\param detailedMap
-*/
-VMA_CALL_PRE void VMA_CALL_POST vmaBuildStatsString(
-    VmaAllocator VMA_NOT_NULL allocator,
-    char* VMA_NULLABLE* VMA_NOT_NULL ppStatsString,
-    VkBool32 detailedMap);
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeStatsString(
-    VmaAllocator VMA_NOT_NULL allocator,
-    char* VMA_NULLABLE pStatsString);
-
-/** @} */
-
-#endif // VMA_STATS_STRING_ENABLED
-
-#endif // _VMA_FUNCTION_HEADERS
-
-#ifdef __cplusplus
-}
-#endif
-
-#endif // AMD_VULKAN_MEMORY_ALLOCATOR_H
-
-////////////////////////////////////////////////////////////////////////////////
-////////////////////////////////////////////////////////////////////////////////
-//
-//    IMPLEMENTATION
-//
-////////////////////////////////////////////////////////////////////////////////
-////////////////////////////////////////////////////////////////////////////////
-
-// For Visual Studio IntelliSense.
-#if defined(__cplusplus) && defined(__INTELLISENSE__)
-#define VMA_IMPLEMENTATION
-#endif
-
-#ifdef VMA_IMPLEMENTATION
-#undef VMA_IMPLEMENTATION
-
-#include <cstdint>
-#include <cstdlib>
-#include <cstring>
-#include <utility>
-#include <type_traits>
-
-#ifdef _MSC_VER
-    #include <intrin.h> // For functions like __popcnt, _BitScanForward etc.
-#endif
-#if __cplusplus >= 202002L || _MSVC_LANG >= 202002L // C++20
-    #include <bit> // For std::popcount
-#endif
-
-/*******************************************************************************
-CONFIGURATION SECTION
-
-Define some of these macros before each #include of this header or change them
-here if you need other then default behavior depending on your environment.
-*/
-#ifndef _VMA_CONFIGURATION
-
-/*
-Define this macro to 1 to make the library fetch pointers to Vulkan functions
-internally, like:
-
-    vulkanFunctions.vkAllocateMemory = &vkAllocateMemory;
-*/
-#if !defined(VMA_STATIC_VULKAN_FUNCTIONS) && !defined(VK_NO_PROTOTYPES)
-    #define VMA_STATIC_VULKAN_FUNCTIONS 1
-#endif
-
-/*
-Define this macro to 1 to make the library fetch pointers to Vulkan functions
-internally, like:
-
-    vulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkGetDeviceProcAddr(device, "vkAllocateMemory");
-
-To use this feature in new versions of VMA you now have to pass
-VmaVulkanFunctions::vkGetInstanceProcAddr and vkGetDeviceProcAddr as
-VmaAllocatorCreateInfo::pVulkanFunctions. Other members can be null.
-*/
-#if !defined(VMA_DYNAMIC_VULKAN_FUNCTIONS)
-    #define VMA_DYNAMIC_VULKAN_FUNCTIONS 1
-#endif
-
-#ifndef VMA_USE_STL_SHARED_MUTEX
-    // Compiler conforms to C++17.
-    #if __cplusplus >= 201703L
-        #define VMA_USE_STL_SHARED_MUTEX 1
-    // Visual studio defines __cplusplus properly only when passed additional parameter: /Zc:__cplusplus
-    // Otherwise it is always 199711L, despite shared_mutex works since Visual Studio 2015 Update 2.
-    #elif defined(_MSC_FULL_VER) && _MSC_FULL_VER >= 190023918 && __cplusplus == 199711L && _MSVC_LANG >= 201703L
-        #define VMA_USE_STL_SHARED_MUTEX 1
-    #else
-        #define VMA_USE_STL_SHARED_MUTEX 0
-    #endif
-#endif
-
-/*
-Define this macro to include custom header files without having to edit this file directly, e.g.:
-
-    // Inside of "my_vma_configuration_user_includes.h":
-
-    #include "my_custom_assert.h" // for MY_CUSTOM_ASSERT
-    #include "my_custom_min.h" // for my_custom_min
-    #include <algorithm>
-    #include <mutex>
-
-    // Inside a different file, which includes "vk_mem_alloc.h":
-
-    #define VMA_CONFIGURATION_USER_INCLUDES_H "my_vma_configuration_user_includes.h"
-    #define VMA_ASSERT(expr) MY_CUSTOM_ASSERT(expr)
-    #define VMA_MIN(v1, v2)  (my_custom_min(v1, v2))
-    #include "vk_mem_alloc.h"
-    ...
-
-The following headers are used in this CONFIGURATION section only, so feel free to
-remove them if not needed.
-*/
-#if !defined(VMA_CONFIGURATION_USER_INCLUDES_H)
-    #include <cassert> // for assert
-    #include <algorithm> // for min, max
-    #include <mutex>
-#else
-    #include VMA_CONFIGURATION_USER_INCLUDES_H
-#endif
-
-#ifndef VMA_NULL
-   // Value used as null pointer. Define it to e.g.: nullptr, NULL, 0, (void*)0.
-   #define VMA_NULL   nullptr
-#endif
-
-#if defined(__ANDROID_API__) && (__ANDROID_API__ < 16)
-#include <cstdlib>
-static void* vma_aligned_alloc(size_t alignment, size_t size)
-{
-    // alignment must be >= sizeof(void*)
-    if(alignment < sizeof(void*))
-    {
-        alignment = sizeof(void*);
-    }
-
-    return memalign(alignment, size);
-}
-#elif defined(__APPLE__) || defined(__ANDROID__) || (defined(__linux__) && defined(__GLIBCXX__) && !defined(_GLIBCXX_HAVE_ALIGNED_ALLOC))
-#include <cstdlib>
-
-#if defined(__APPLE__)
-#include <AvailabilityMacros.h>
-#endif
-
-static void* vma_aligned_alloc(size_t alignment, size_t size)
-{
-    // Unfortunately, aligned_alloc causes VMA to crash due to it returning null pointers. (At least under 11.4)
-    // Therefore, for now disable this specific exception until a proper solution is found.
-    //#if defined(__APPLE__) && (defined(MAC_OS_X_VERSION_10_16) || defined(__IPHONE_14_0))
-    //#if MAC_OS_X_VERSION_MAX_ALLOWED >= MAC_OS_X_VERSION_10_16 || __IPHONE_OS_VERSION_MAX_ALLOWED >= __IPHONE_14_0
-    //    // For C++14, usr/include/malloc/_malloc.h declares aligned_alloc()) only
-    //    // with the MacOSX11.0 SDK in Xcode 12 (which is what adds
-    //    // MAC_OS_X_VERSION_10_16), even though the function is marked
-    //    // availabe for 10.15. That is why the preprocessor checks for 10.16 but
-    //    // the __builtin_available checks for 10.15.
-    //    // People who use C++17 could call aligned_alloc with the 10.15 SDK already.
-    //    if (__builtin_available(macOS 10.15, iOS 13, *))
-    //        return aligned_alloc(alignment, size);
-    //#endif
-    //#endif
-
-    // alignment must be >= sizeof(void*)
-    if(alignment < sizeof(void*))
-    {
-        alignment = sizeof(void*);
-    }
-
-    void *pointer;
-    if(posix_memalign(&pointer, alignment, size) == 0)
-        return pointer;
-    return VMA_NULL;
-}
-#elif defined(_WIN32)
-static void* vma_aligned_alloc(size_t alignment, size_t size)
-{
-    return _aligned_malloc(size, alignment);
-}
-#else
-static void* vma_aligned_alloc(size_t alignment, size_t size)
-{
-    return aligned_alloc(alignment, size);
-}
-#endif
-
-#if defined(_WIN32)
-static void vma_aligned_free(void* ptr)
-{
-    _aligned_free(ptr);
-}
-#else
-static void vma_aligned_free(void* VMA_NULLABLE ptr)
-{
-    free(ptr);
-}
-#endif
-
-// If your compiler is not compatible with C++11 and definition of
-// aligned_alloc() function is missing, uncommeting following line may help:
-
-//#include <malloc.h>
-
-// Normal assert to check for programmer's errors, especially in Debug configuration.
-#ifndef VMA_ASSERT
-   #ifdef NDEBUG
-       #define VMA_ASSERT(expr)
-   #else
-       #define VMA_ASSERT(expr)         assert(expr)
-   #endif
-#endif
-
-// Assert that will be called very often, like inside data structures e.g. operator[].
-// Making it non-empty can make program slow.
-#ifndef VMA_HEAVY_ASSERT
-   #ifdef NDEBUG
-       #define VMA_HEAVY_ASSERT(expr)
-   #else
-       #define VMA_HEAVY_ASSERT(expr)   //VMA_ASSERT(expr)
-   #endif
-#endif
-
-#ifndef VMA_ALIGN_OF
-   #define VMA_ALIGN_OF(type)       (__alignof(type))
-#endif
-
-#ifndef VMA_SYSTEM_ALIGNED_MALLOC
-   #define VMA_SYSTEM_ALIGNED_MALLOC(size, alignment) vma_aligned_alloc((alignment), (size))
-#endif
-
-#ifndef VMA_SYSTEM_ALIGNED_FREE
-   // VMA_SYSTEM_FREE is the old name, but might have been defined by the user
-   #if defined(VMA_SYSTEM_FREE)
-      #define VMA_SYSTEM_ALIGNED_FREE(ptr)     VMA_SYSTEM_FREE(ptr)
-   #else
-      #define VMA_SYSTEM_ALIGNED_FREE(ptr)     vma_aligned_free(ptr)
-    #endif
-#endif
-
-#ifndef VMA_COUNT_BITS_SET
-    // Returns number of bits set to 1 in (v)
-    #define VMA_COUNT_BITS_SET(v) VmaCountBitsSet(v)
-#endif
-
-#ifndef VMA_BITSCAN_LSB
-    // Scans integer for index of first nonzero value from the Least Significant Bit (LSB). If mask is 0 then returns UINT8_MAX
-    #define VMA_BITSCAN_LSB(mask) VmaBitScanLSB(mask)
-#endif
-
-#ifndef VMA_BITSCAN_MSB
-    // Scans integer for index of first nonzero value from the Most Significant Bit (MSB). If mask is 0 then returns UINT8_MAX
-    #define VMA_BITSCAN_MSB(mask) VmaBitScanMSB(mask)
-#endif
-
-#ifndef VMA_MIN
-   #define VMA_MIN(v1, v2)    ((std::min)((v1), (v2)))
-#endif
-
-#ifndef VMA_MAX
-   #define VMA_MAX(v1, v2)    ((std::max)((v1), (v2)))
-#endif
-
-#ifndef VMA_SWAP
-   #define VMA_SWAP(v1, v2)   std::swap((v1), (v2))
-#endif
-
-#ifndef VMA_SORT
-   #define VMA_SORT(beg, end, cmp)  std::sort(beg, end, cmp)
-#endif
-
-#ifndef VMA_DEBUG_LOG
-   #define VMA_DEBUG_LOG(format, ...)
-   /*
-   #define VMA_DEBUG_LOG(format, ...) do { \
-       printf(format, __VA_ARGS__); \
-       printf("\n"); \
-   } while(false)
-   */
-#endif
-
-// Define this macro to 1 to enable functions: vmaBuildStatsString, vmaFreeStatsString.
-#if VMA_STATS_STRING_ENABLED
-    static inline void VmaUint32ToStr(char* VMA_NOT_NULL outStr, size_t strLen, uint32_t num)
-    {
-        snprintf(outStr, strLen, "%u", static_cast<unsigned int>(num));
-    }
-    static inline void VmaUint64ToStr(char* VMA_NOT_NULL outStr, size_t strLen, uint64_t num)
-    {
-        snprintf(outStr, strLen, "%llu", static_cast<unsigned long long>(num));
-    }
-    static inline void VmaPtrToStr(char* VMA_NOT_NULL outStr, size_t strLen, const void* ptr)
-    {
-        snprintf(outStr, strLen, "%p", ptr);
-    }
-#endif
-
-#ifndef VMA_MUTEX
-    class VmaMutex
-    {
-    public:
-        void Lock() { m_Mutex.lock(); }
-        void Unlock() { m_Mutex.unlock(); }
-        bool TryLock() { return m_Mutex.try_lock(); }
-    private:
-        std::mutex m_Mutex;
-    };
-    #define VMA_MUTEX VmaMutex
-#endif
-
-// Read-write mutex, where "read" is shared access, "write" is exclusive access.
-#ifndef VMA_RW_MUTEX
-    #if VMA_USE_STL_SHARED_MUTEX
-        // Use std::shared_mutex from C++17.
-        #include <shared_mutex>
-        class VmaRWMutex
-        {
-        public:
-            void LockRead() { m_Mutex.lock_shared(); }
-            void UnlockRead() { m_Mutex.unlock_shared(); }
-            bool TryLockRead() { return m_Mutex.try_lock_shared(); }
-            void LockWrite() { m_Mutex.lock(); }
-            void UnlockWrite() { m_Mutex.unlock(); }
-            bool TryLockWrite() { return m_Mutex.try_lock(); }
-        private:
-            std::shared_mutex m_Mutex;
-        };
-        #define VMA_RW_MUTEX VmaRWMutex
-    #elif defined(_WIN32) && defined(WINVER) && WINVER >= 0x0600
-        // Use SRWLOCK from WinAPI.
-        // Minimum supported client = Windows Vista, server = Windows Server 2008.
-        class VmaRWMutex
-        {
-        public:
-            VmaRWMutex() { InitializeSRWLock(&m_Lock); }
-            void LockRead() { AcquireSRWLockShared(&m_Lock); }
-            void UnlockRead() { ReleaseSRWLockShared(&m_Lock); }
-            bool TryLockRead() { return TryAcquireSRWLockShared(&m_Lock) != FALSE; }
-            void LockWrite() { AcquireSRWLockExclusive(&m_Lock); }
-            void UnlockWrite() { ReleaseSRWLockExclusive(&m_Lock); }
-            bool TryLockWrite() { return TryAcquireSRWLockExclusive(&m_Lock) != FALSE; }
-        private:
-            SRWLOCK m_Lock;
-        };
-        #define VMA_RW_MUTEX VmaRWMutex
-    #else
-        // Less efficient fallback: Use normal mutex.
-        class VmaRWMutex
-        {
-        public:
-            void LockRead() { m_Mutex.Lock(); }
-            void UnlockRead() { m_Mutex.Unlock(); }
-            bool TryLockRead() { return m_Mutex.TryLock(); }
-            void LockWrite() { m_Mutex.Lock(); }
-            void UnlockWrite() { m_Mutex.Unlock(); }
-            bool TryLockWrite() { return m_Mutex.TryLock(); }
-        private:
-            VMA_MUTEX m_Mutex;
-        };
-        #define VMA_RW_MUTEX VmaRWMutex
-    #endif // #if VMA_USE_STL_SHARED_MUTEX
-#endif // #ifndef VMA_RW_MUTEX
-
-/*
-If providing your own implementation, you need to implement a subset of std::atomic.
-*/
-#ifndef VMA_ATOMIC_UINT32
-    #include <atomic>
-    #define VMA_ATOMIC_UINT32 std::atomic<uint32_t>
-#endif
-
-#ifndef VMA_ATOMIC_UINT64
-    #include <atomic>
-    #define VMA_ATOMIC_UINT64 std::atomic<uint64_t>
-#endif
-
-#ifndef VMA_DEBUG_ALWAYS_DEDICATED_MEMORY
-    /**
-    Every allocation will have its own memory block.
-    Define to 1 for debugging purposes only.
-    */
-    #define VMA_DEBUG_ALWAYS_DEDICATED_MEMORY (0)
-#endif
-
-#ifndef VMA_MIN_ALIGNMENT
-    /**
-    Minimum alignment of all allocations, in bytes.
-    Set to more than 1 for debugging purposes. Must be power of two.
-    */
-    #ifdef VMA_DEBUG_ALIGNMENT // Old name
-        #define VMA_MIN_ALIGNMENT VMA_DEBUG_ALIGNMENT
-    #else
-        #define VMA_MIN_ALIGNMENT (1)
-    #endif
-#endif
-
-#ifndef VMA_DEBUG_MARGIN
-    /**
-    Minimum margin after every allocation, in bytes.
-    Set nonzero for debugging purposes only.
-    */
-    #define VMA_DEBUG_MARGIN (0)
-#endif
-
-#ifndef VMA_DEBUG_INITIALIZE_ALLOCATIONS
-    /**
-    Define this macro to 1 to automatically fill new allocations and destroyed
-    allocations with some bit pattern.
-    */
-    #define VMA_DEBUG_INITIALIZE_ALLOCATIONS (0)
-#endif
-
-#ifndef VMA_DEBUG_DETECT_CORRUPTION
-    /**
-    Define this macro to 1 together with non-zero value of VMA_DEBUG_MARGIN to
-    enable writing magic value to the margin after every allocation and
-    validating it, so that memory corruptions (out-of-bounds writes) are detected.
-    */
-    #define VMA_DEBUG_DETECT_CORRUPTION (0)
-#endif
-
-#ifndef VMA_DEBUG_GLOBAL_MUTEX
-    /**
-    Set this to 1 for debugging purposes only, to enable single mutex protecting all
-    entry calls to the library. Can be useful for debugging multithreading issues.
-    */
-    #define VMA_DEBUG_GLOBAL_MUTEX (0)
-#endif
-
-#ifndef VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY
-    /**
-    Minimum value for VkPhysicalDeviceLimits::bufferImageGranularity.
-    Set to more than 1 for debugging purposes only. Must be power of two.
-    */
-    #define VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY (1)
-#endif
-
-#ifndef VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT
-    /*
-    Set this to 1 to make VMA never exceed VkPhysicalDeviceLimits::maxMemoryAllocationCount
-    and return error instead of leaving up to Vulkan implementation what to do in such cases.
-    */
-    #define VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT (0)
-#endif
-
-#ifndef VMA_SMALL_HEAP_MAX_SIZE
-   /// Maximum size of a memory heap in Vulkan to consider it "small".
-   #define VMA_SMALL_HEAP_MAX_SIZE (1024ull * 1024 * 1024)
-#endif
-
-#ifndef VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE
-   /// Default size of a block allocated as single VkDeviceMemory from a "large" heap.
-   #define VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE (256ull * 1024 * 1024)
-#endif
-
-/*
-Mapping hysteresis is a logic that launches when vmaMapMemory/vmaUnmapMemory is called
-or a persistently mapped allocation is created and destroyed several times in a row.
-It keeps additional +1 mapping of a device memory block to prevent calling actual
-vkMapMemory/vkUnmapMemory too many times, which may improve performance and help
-tools like RenderDOc.
-*/
-#ifndef VMA_MAPPING_HYSTERESIS_ENABLED
-    #define VMA_MAPPING_HYSTERESIS_ENABLED 1
-#endif
-
-#ifndef VMA_CLASS_NO_COPY
-    #define VMA_CLASS_NO_COPY(className) \
-        private: \
-            className(const className&) = delete; \
-            className& operator=(const className&) = delete;
-#endif
-
-#define VMA_VALIDATE(cond) do { if(!(cond)) { \
-        VMA_ASSERT(0 && "Validation failed: " #cond); \
-        return false; \
-    } } while(false)
-
-/*******************************************************************************
-END OF CONFIGURATION
-*/
-#endif // _VMA_CONFIGURATION
-
-
-static const uint8_t VMA_ALLOCATION_FILL_PATTERN_CREATED = 0xDC;
-static const uint8_t VMA_ALLOCATION_FILL_PATTERN_DESTROYED = 0xEF;
-// Decimal 2139416166, float NaN, little-endian binary 66 E6 84 7F.
-static const uint32_t VMA_CORRUPTION_DETECTION_MAGIC_VALUE = 0x7F84E666;
-
-// Copy of some Vulkan definitions so we don't need to check their existence just to handle few constants.
-static const uint32_t VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY = 0x00000040;
-static const uint32_t VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY = 0x00000080;
-static const uint32_t VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY = 0x00020000;
-static const uint32_t VK_IMAGE_CREATE_DISJOINT_BIT_COPY = 0x00000200;
-static const int32_t VK_IMAGE_TILING_DRM_FORMAT_MODIFIER_EXT_COPY = 1000158000;
-static const uint32_t VMA_ALLOCATION_INTERNAL_STRATEGY_MIN_OFFSET = 0x10000000u;
-static const uint32_t VMA_ALLOCATION_TRY_COUNT = 32;
-static const uint32_t VMA_VENDOR_ID_AMD = 4098;
-
-// This one is tricky. Vulkan specification defines this code as available since
-// Vulkan 1.0, but doesn't actually define it in Vulkan SDK earlier than 1.2.131.
-// See pull request #207.
-#define VK_ERROR_UNKNOWN_COPY ((VkResult)-13)
-
-
-#if VMA_STATS_STRING_ENABLED
-// Correspond to values of enum VmaSuballocationType.
-static const char* VMA_SUBALLOCATION_TYPE_NAMES[] =
-{
-    "FREE",
-    "UNKNOWN",
-    "BUFFER",
-    "IMAGE_UNKNOWN",
-    "IMAGE_LINEAR",
-    "IMAGE_OPTIMAL",
-};
-#endif
-
-static VkAllocationCallbacks VmaEmptyAllocationCallbacks =
-    { VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL, VMA_NULL };
-
-
-#ifndef _VMA_ENUM_DECLARATIONS
-
-enum VmaSuballocationType
-{
-    VMA_SUBALLOCATION_TYPE_FREE = 0,
-    VMA_SUBALLOCATION_TYPE_UNKNOWN = 1,
-    VMA_SUBALLOCATION_TYPE_BUFFER = 2,
-    VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN = 3,
-    VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR = 4,
-    VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL = 5,
-    VMA_SUBALLOCATION_TYPE_MAX_ENUM = 0x7FFFFFFF
-};
-
-enum VMA_CACHE_OPERATION
-{
-    VMA_CACHE_FLUSH,
-    VMA_CACHE_INVALIDATE
-};
-
-enum class VmaAllocationRequestType
-{
-    Normal,
-    TLSF,
-    // Used by "Linear" algorithm.
-    UpperAddress,
-    EndOf1st,
-    EndOf2nd,
-};
-
-#endif // _VMA_ENUM_DECLARATIONS
-
-#ifndef _VMA_FORWARD_DECLARATIONS
-// Opaque handle used by allocation algorithms to identify single allocation in any conforming way.
-VK_DEFINE_NON_DISPATCHABLE_HANDLE(VmaAllocHandle);
-
-struct VmaMutexLock;
-struct VmaMutexLockRead;
-struct VmaMutexLockWrite;
-
-template<typename T>
-struct AtomicTransactionalIncrement;
-
-template<typename T>
-struct VmaStlAllocator;
-
-template<typename T, typename AllocatorT>
-class VmaVector;
-
-template<typename T, typename AllocatorT, size_t N>
-class VmaSmallVector;
-
-template<typename T>
-class VmaPoolAllocator;
-
-template<typename T>
-struct VmaListItem;
-
-template<typename T>
-class VmaRawList;
-
-template<typename T, typename AllocatorT>
-class VmaList;
-
-template<typename ItemTypeTraits>
-class VmaIntrusiveLinkedList;
-
-// Unused in this version
-#if 0
-template<typename T1, typename T2>
-struct VmaPair;
-template<typename FirstT, typename SecondT>
-struct VmaPairFirstLess;
-
-template<typename KeyT, typename ValueT>
-class VmaMap;
-#endif
-
-#if VMA_STATS_STRING_ENABLED
-class VmaStringBuilder;
-class VmaJsonWriter;
-#endif
-
-class VmaDeviceMemoryBlock;
-
-struct VmaDedicatedAllocationListItemTraits;
-class VmaDedicatedAllocationList;
-
-struct VmaSuballocation;
-struct VmaSuballocationOffsetLess;
-struct VmaSuballocationOffsetGreater;
-struct VmaSuballocationItemSizeLess;
-
-typedef VmaList<VmaSuballocation, VmaStlAllocator<VmaSuballocation>> VmaSuballocationList;
-
-struct VmaAllocationRequest;
-
-class VmaBlockMetadata;
-class VmaBlockMetadata_Linear;
-class VmaBlockMetadata_TLSF;
-
-class VmaBlockVector;
-
-struct VmaPoolListItemTraits;
-
-struct VmaCurrentBudgetData;
-
-class VmaAllocationObjectAllocator;
-
-#endif // _VMA_FORWARD_DECLARATIONS
-
-
-#ifndef _VMA_FUNCTIONS
-
-/*
-Returns number of bits set to 1 in (v).
-
-On specific platforms and compilers you can use instrinsics like:
-
-Visual Studio:
-    return __popcnt(v);
-GCC, Clang:
-    return static_cast<uint32_t>(__builtin_popcount(v));
-
-Define macro VMA_COUNT_BITS_SET to provide your optimized implementation.
-But you need to check in runtime whether user's CPU supports these, as some old processors don't.
-*/
-static inline uint32_t VmaCountBitsSet(uint32_t v)
-{
-#if __cplusplus >= 202002L || _MSVC_LANG >= 202002L // C++20
-    return std::popcount(v);
-#else
-    uint32_t c = v - ((v >> 1) & 0x55555555);
-    c = ((c >> 2) & 0x33333333) + (c & 0x33333333);
-    c = ((c >> 4) + c) & 0x0F0F0F0F;
-    c = ((c >> 8) + c) & 0x00FF00FF;
-    c = ((c >> 16) + c) & 0x0000FFFF;
-    return c;
-#endif
-}
-
-static inline uint8_t VmaBitScanLSB(uint64_t mask)
-{
-#if defined(_MSC_VER) && defined(_WIN64)
-    unsigned long pos;
-    if (_BitScanForward64(&pos, mask))
-        return static_cast<uint8_t>(pos);
-    return UINT8_MAX;
-#elif defined __GNUC__ || defined __clang__
-    return static_cast<uint8_t>(__builtin_ffsll(mask)) - 1U;
-#else
-    uint8_t pos = 0;
-    uint64_t bit = 1;
-    do
-    {
-        if (mask & bit)
-            return pos;
-        bit <<= 1;
-    } while (pos++ < 63);
-    return UINT8_MAX;
-#endif
-}
-
-static inline uint8_t VmaBitScanLSB(uint32_t mask)
-{
-#ifdef _MSC_VER
-    unsigned long pos;
-    if (_BitScanForward(&pos, mask))
-        return static_cast<uint8_t>(pos);
-    return UINT8_MAX;
-#elif defined __GNUC__ || defined __clang__
-    return static_cast<uint8_t>(__builtin_ffs(mask)) - 1U;
-#else
-    uint8_t pos = 0;
-    uint32_t bit = 1;
-    do
-    {
-        if (mask & bit)
-            return pos;
-        bit <<= 1;
-    } while (pos++ < 31);
-    return UINT8_MAX;
-#endif
-}
-
-static inline uint8_t VmaBitScanMSB(uint64_t mask)
-{
-#if defined(_MSC_VER) && defined(_WIN64)
-    unsigned long pos;
-    if (_BitScanReverse64(&pos, mask))
-        return static_cast<uint8_t>(pos);
-#elif defined __GNUC__ || defined __clang__
-    if (mask)
-        return 63 - static_cast<uint8_t>(__builtin_clzll(mask));
-#else
-    uint8_t pos = 63;
-    uint64_t bit = 1ULL << 63;
-    do
-    {
-        if (mask & bit)
-            return pos;
-        bit >>= 1;
-    } while (pos-- > 0);
-#endif
-    return UINT8_MAX;
-}
-
-static inline uint8_t VmaBitScanMSB(uint32_t mask)
-{
-#ifdef _MSC_VER
-    unsigned long pos;
-    if (_BitScanReverse(&pos, mask))
-        return static_cast<uint8_t>(pos);
-#elif defined __GNUC__ || defined __clang__
-    if (mask)
-        return 31 - static_cast<uint8_t>(__builtin_clz(mask));
-#else
-    uint8_t pos = 31;
-    uint32_t bit = 1UL << 31;
-    do
-    {
-        if (mask & bit)
-            return pos;
-        bit >>= 1;
-    } while (pos-- > 0);
-#endif
-    return UINT8_MAX;
-}
-
-/*
-Returns true if given number is a power of two.
-T must be unsigned integer number or signed integer but always nonnegative.
-For 0 returns true.
-*/
-template <typename T>
-inline bool VmaIsPow2(T x)
-{
-    return (x & (x - 1)) == 0;
-}
-
-// Aligns given value up to nearest multiply of align value. For example: VmaAlignUp(11, 8) = 16.
-// Use types like uint32_t, uint64_t as T.
-template <typename T>
-static inline T VmaAlignUp(T val, T alignment)
-{
-    VMA_HEAVY_ASSERT(VmaIsPow2(alignment));
-    return (val + alignment - 1) & ~(alignment - 1);
-}
-
-// Aligns given value down to nearest multiply of align value. For example: VmaAlignUp(11, 8) = 8.
-// Use types like uint32_t, uint64_t as T.
-template <typename T>
-static inline T VmaAlignDown(T val, T alignment)
-{
-    VMA_HEAVY_ASSERT(VmaIsPow2(alignment));
-    return val & ~(alignment - 1);
-}
-
-// Division with mathematical rounding to nearest number.
-template <typename T>
-static inline T VmaRoundDiv(T x, T y)
-{
-    return (x + (y / (T)2)) / y;
-}
-
-// Divide by 'y' and round up to nearest integer.
-template <typename T>
-static inline T VmaDivideRoundingUp(T x, T y)
-{
-    return (x + y - (T)1) / y;
-}
-
-// Returns smallest power of 2 greater or equal to v.
-static inline uint32_t VmaNextPow2(uint32_t v)
-{
-    v--;
-    v |= v >> 1;
-    v |= v >> 2;
-    v |= v >> 4;
-    v |= v >> 8;
-    v |= v >> 16;
-    v++;
-    return v;
-}
-
-static inline uint64_t VmaNextPow2(uint64_t v)
-{
-    v--;
-    v |= v >> 1;
-    v |= v >> 2;
-    v |= v >> 4;
-    v |= v >> 8;
-    v |= v >> 16;
-    v |= v >> 32;
-    v++;
-    return v;
-}
-
-// Returns largest power of 2 less or equal to v.
-static inline uint32_t VmaPrevPow2(uint32_t v)
-{
-    v |= v >> 1;
-    v |= v >> 2;
-    v |= v >> 4;
-    v |= v >> 8;
-    v |= v >> 16;
-    v = v ^ (v >> 1);
-    return v;
-}
-
-static inline uint64_t VmaPrevPow2(uint64_t v)
-{
-    v |= v >> 1;
-    v |= v >> 2;
-    v |= v >> 4;
-    v |= v >> 8;
-    v |= v >> 16;
-    v |= v >> 32;
-    v = v ^ (v >> 1);
-    return v;
-}
-
-static inline bool VmaStrIsEmpty(const char* pStr)
-{
-    return pStr == VMA_NULL || *pStr == '\0';
-}
-
-/*
-Returns true if two memory blocks occupy overlapping pages.
-ResourceA must be in less memory offset than ResourceB.
-
-Algorithm is based on "Vulkan 1.0.39 - A Specification (with all registered Vulkan extensions)"
-chapter 11.6 "Resource Memory Association", paragraph "Buffer-Image Granularity".
-*/
-static inline bool VmaBlocksOnSamePage(
-    VkDeviceSize resourceAOffset,
-    VkDeviceSize resourceASize,
-    VkDeviceSize resourceBOffset,
-    VkDeviceSize pageSize)
-{
-    VMA_ASSERT(resourceAOffset + resourceASize <= resourceBOffset && resourceASize > 0 && pageSize > 0);
-    VkDeviceSize resourceAEnd = resourceAOffset + resourceASize - 1;
-    VkDeviceSize resourceAEndPage = resourceAEnd & ~(pageSize - 1);
-    VkDeviceSize resourceBStart = resourceBOffset;
-    VkDeviceSize resourceBStartPage = resourceBStart & ~(pageSize - 1);
-    return resourceAEndPage == resourceBStartPage;
-}
-
-/*
-Returns true if given suballocation types could conflict and must respect
-VkPhysicalDeviceLimits::bufferImageGranularity. They conflict if one is buffer
-or linear image and another one is optimal image. If type is unknown, behave
-conservatively.
-*/
-static inline bool VmaIsBufferImageGranularityConflict(
-    VmaSuballocationType suballocType1,
-    VmaSuballocationType suballocType2)
-{
-    if (suballocType1 > suballocType2)
-    {
-        VMA_SWAP(suballocType1, suballocType2);
-    }
-
-    switch (suballocType1)
-    {
-    case VMA_SUBALLOCATION_TYPE_FREE:
-        return false;
-    case VMA_SUBALLOCATION_TYPE_UNKNOWN:
-        return true;
-    case VMA_SUBALLOCATION_TYPE_BUFFER:
-        return
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN ||
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL;
-    case VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN:
-        return
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN ||
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR ||
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL;
-    case VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR:
-        return
-            suballocType2 == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL;
-    case VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL:
-        return false;
-    default:
-        VMA_ASSERT(0);
-        return true;
-    }
-}
-
-static void VmaWriteMagicValue(void* pData, VkDeviceSize offset)
-{
-#if VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_DETECT_CORRUPTION
-    uint32_t* pDst = (uint32_t*)((char*)pData + offset);
-    const size_t numberCount = VMA_DEBUG_MARGIN / sizeof(uint32_t);
-    for (size_t i = 0; i < numberCount; ++i, ++pDst)
-    {
-        *pDst = VMA_CORRUPTION_DETECTION_MAGIC_VALUE;
-    }
-#else
-    // no-op
-#endif
-}
-
-static bool VmaValidateMagicValue(const void* pData, VkDeviceSize offset)
-{
-#if VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_DETECT_CORRUPTION
-    const uint32_t* pSrc = (const uint32_t*)((const char*)pData + offset);
-    const size_t numberCount = VMA_DEBUG_MARGIN / sizeof(uint32_t);
-    for (size_t i = 0; i < numberCount; ++i, ++pSrc)
-    {
-        if (*pSrc != VMA_CORRUPTION_DETECTION_MAGIC_VALUE)
-        {
-            return false;
-        }
-    }
-#endif
-    return true;
-}
-
-/*
-Fills structure with parameters of an example buffer to be used for transfers
-during GPU memory defragmentation.
-*/
-static void VmaFillGpuDefragmentationBufferCreateInfo(VkBufferCreateInfo& outBufCreateInfo)
-{
-    memset(&outBufCreateInfo, 0, sizeof(outBufCreateInfo));
-    outBufCreateInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
-    outBufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-    outBufCreateInfo.size = (VkDeviceSize)VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE; // Example size.
-}
-
-
-/*
-Performs binary search and returns iterator to first element that is greater or
-equal to (key), according to comparison (cmp).
-
-Cmp should return true if first argument is less than second argument.
-
-Returned value is the found element, if present in the collection or place where
-new element with value (key) should be inserted.
-*/
-template <typename CmpLess, typename IterT, typename KeyT>
-static IterT VmaBinaryFindFirstNotLess(IterT beg, IterT end, const KeyT& key, const CmpLess& cmp)
-{
-    size_t down = 0, up = (end - beg);
-    while (down < up)
-    {
-        const size_t mid = down + (up - down) / 2;  // Overflow-safe midpoint calculation
-        if (cmp(*(beg + mid), key))
-        {
-            down = mid + 1;
-        }
-        else
-        {
-            up = mid;
-        }
-    }
-    return beg + down;
-}
-
-template<typename CmpLess, typename IterT, typename KeyT>
-IterT VmaBinaryFindSorted(const IterT& beg, const IterT& end, const KeyT& value, const CmpLess& cmp)
-{
-    IterT it = VmaBinaryFindFirstNotLess<CmpLess, IterT, KeyT>(
-        beg, end, value, cmp);
-    if (it == end ||
-        (!cmp(*it, value) && !cmp(value, *it)))
-    {
-        return it;
-    }
-    return end;
-}
-
-/*
-Returns true if all pointers in the array are not-null and unique.
-Warning! O(n^2) complexity. Use only inside VMA_HEAVY_ASSERT.
-T must be pointer type, e.g. VmaAllocation, VmaPool.
-*/
-template<typename T>
-static bool VmaValidatePointerArray(uint32_t count, const T* arr)
-{
-    for (uint32_t i = 0; i < count; ++i)
-    {
-        const T iPtr = arr[i];
-        if (iPtr == VMA_NULL)
-        {
-            return false;
-        }
-        for (uint32_t j = i + 1; j < count; ++j)
-        {
-            if (iPtr == arr[j])
-            {
-                return false;
-            }
-        }
-    }
-    return true;
-}
-
-template<typename MainT, typename NewT>
-static inline void VmaPnextChainPushFront(MainT* mainStruct, NewT* newStruct)
-{
-    newStruct->pNext = mainStruct->pNext;
-    mainStruct->pNext = newStruct;
-}
-
-// This is the main algorithm that guides the selection of a memory type best for an allocation -
-// converts usage to required/preferred/not preferred flags.
-static bool FindMemoryPreferences(
-    bool isIntegratedGPU,
-    const VmaAllocationCreateInfo& allocCreateInfo,
-    VkFlags bufImgUsage, // VkBufferCreateInfo::usage or VkImageCreateInfo::usage. UINT32_MAX if unknown.
-    VkMemoryPropertyFlags& outRequiredFlags,
-    VkMemoryPropertyFlags& outPreferredFlags,
-    VkMemoryPropertyFlags& outNotPreferredFlags)
-{
-    outRequiredFlags = allocCreateInfo.requiredFlags;
-    outPreferredFlags = allocCreateInfo.preferredFlags;
-    outNotPreferredFlags = 0;
-
-    switch(allocCreateInfo.usage)
-    {
-    case VMA_MEMORY_USAGE_UNKNOWN:
-        break;
-    case VMA_MEMORY_USAGE_GPU_ONLY:
-        if(!isIntegratedGPU || (outPreferredFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0)
-        {
-            outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-        }
-        break;
-    case VMA_MEMORY_USAGE_CPU_ONLY:
-        outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
-        break;
-    case VMA_MEMORY_USAGE_CPU_TO_GPU:
-        outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-        if(!isIntegratedGPU || (outPreferredFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0)
-        {
-            outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-        }
-        break;
-    case VMA_MEMORY_USAGE_GPU_TO_CPU:
-        outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-        outPreferredFlags |= VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-        break;
-    case VMA_MEMORY_USAGE_CPU_COPY:
-        outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-        break;
-    case VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED:
-        outRequiredFlags |= VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT;
-        break;
-    case VMA_MEMORY_USAGE_AUTO:
-    case VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE:
-    case VMA_MEMORY_USAGE_AUTO_PREFER_HOST:
-    {
-        if(bufImgUsage == UINT32_MAX)
-        {
-            VMA_ASSERT(0 && "VMA_MEMORY_USAGE_AUTO* values can only be used with functions like vmaCreateBuffer, vmaCreateImage so that the details of the created resource are known.");
-            return false;
-        }
-        // This relies on values of VK_IMAGE_USAGE_TRANSFER* being the same VK_BUFFER_IMAGE_TRANSFER*.
-        const bool deviceAccess = (bufImgUsage & ~(VK_BUFFER_USAGE_TRANSFER_DST_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT)) != 0;
-        const bool hostAccessSequentialWrite = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT) != 0;
-        const bool hostAccessRandom = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT) != 0;
-        const bool hostAccessAllowTransferInstead = (allocCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT) != 0;
-        const bool preferDevice = allocCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE;
-        const bool preferHost = allocCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_HOST;
-
-        // CPU random access - e.g. a buffer written to or transferred from GPU to read back on CPU.
-        if(hostAccessRandom)
-        {
-            if(!isIntegratedGPU && deviceAccess && hostAccessAllowTransferInstead && !preferHost)
-            {
-                // Nice if it will end up in HOST_VISIBLE, but more importantly prefer DEVICE_LOCAL.
-                // Omitting HOST_VISIBLE here is intentional.
-                // In case there is DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED, it will pick that one.
-                // Otherwise, this will give same weight to DEVICE_LOCAL as HOST_VISIBLE | HOST_CACHED and select the former if occurs first on the list.
-                outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-            }
-            else
-            {
-                // Always CPU memory, cached.
-                outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-            }
-        }
-        // CPU sequential write - may be CPU or host-visible GPU memory, uncached and write-combined.
-        else if(hostAccessSequentialWrite)
-        {
-            // Want uncached and write-combined.
-            outNotPreferredFlags |= VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-
-            if(!isIntegratedGPU && deviceAccess && hostAccessAllowTransferInstead && !preferHost)
-            {
-                outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-            }
-            else
-            {
-                outRequiredFlags |= VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-                // Direct GPU access, CPU sequential write (e.g. a dynamic uniform buffer updated every frame)
-                if(deviceAccess)
-                {
-                    // Could go to CPU memory or GPU BAR/unified. Up to the user to decide. If no preference, choose GPU memory.
-                    if(preferHost)
-                        outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                    else
-                        outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                }
-                // GPU no direct access, CPU sequential write (e.g. an upload buffer to be transferred to the GPU)
-                else
-                {
-                    // Could go to CPU memory or GPU BAR/unified. Up to the user to decide. If no preference, choose CPU memory.
-                    if(preferDevice)
-                        outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                    else
-                        outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                }
-            }
-        }
-        // No CPU access
-        else
-        {
-            // GPU access, no CPU access (e.g. a color attachment image) - prefer GPU memory
-            if(deviceAccess)
-            {
-                // ...unless there is a clear preference from the user not to do so.
-                if(preferHost)
-                    outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-                else
-                    outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-            }
-            // No direct GPU access, no CPU access, just transfers.
-            // It may be staging copy intended for e.g. preserving image for next frame (then better GPU memory) or
-            // a "swap file" copy to free some GPU memory (then better CPU memory).
-            // Up to the user to decide. If no preferece, assume the former and choose GPU memory.
-            if(preferHost)
-                outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-            else
-                outPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-        }
-        break;
-    }
-    default:
-        VMA_ASSERT(0);
-    }
-
-    // Avoid DEVICE_COHERENT unless explicitly requested.
-    if(((allocCreateInfo.requiredFlags | allocCreateInfo.preferredFlags) &
-        (VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY | VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY)) == 0)
-    {
-        outNotPreferredFlags |= VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY;
-    }
-
-    return true;
-}
-
-////////////////////////////////////////////////////////////////////////////////
-// Memory allocation
-
-static void* VmaMalloc(const VkAllocationCallbacks* pAllocationCallbacks, size_t size, size_t alignment)
-{
-    void* result = VMA_NULL;
-    if ((pAllocationCallbacks != VMA_NULL) &&
-        (pAllocationCallbacks->pfnAllocation != VMA_NULL))
-    {
-        result = (*pAllocationCallbacks->pfnAllocation)(
-            pAllocationCallbacks->pUserData,
-            size,
-            alignment,
-            VK_SYSTEM_ALLOCATION_SCOPE_OBJECT);
-    }
-    else
-    {
-        result = VMA_SYSTEM_ALIGNED_MALLOC(size, alignment);
-    }
-    VMA_ASSERT(result != VMA_NULL && "CPU memory allocation failed.");
-    return result;
-}
-
-static void VmaFree(const VkAllocationCallbacks* pAllocationCallbacks, void* ptr)
-{
-    if ((pAllocationCallbacks != VMA_NULL) &&
-        (pAllocationCallbacks->pfnFree != VMA_NULL))
-    {
-        (*pAllocationCallbacks->pfnFree)(pAllocationCallbacks->pUserData, ptr);
-    }
-    else
-    {
-        VMA_SYSTEM_ALIGNED_FREE(ptr);
-    }
-}
-
-template<typename T>
-static T* VmaAllocate(const VkAllocationCallbacks* pAllocationCallbacks)
-{
-    return (T*)VmaMalloc(pAllocationCallbacks, sizeof(T), VMA_ALIGN_OF(T));
-}
-
-template<typename T>
-static T* VmaAllocateArray(const VkAllocationCallbacks* pAllocationCallbacks, size_t count)
-{
-    return (T*)VmaMalloc(pAllocationCallbacks, sizeof(T) * count, VMA_ALIGN_OF(T));
-}
-
-#define vma_new(allocator, type)   new(VmaAllocate<type>(allocator))(type)
-
-#define vma_new_array(allocator, type, count)   new(VmaAllocateArray<type>((allocator), (count)))(type)
-
-template<typename T>
-static void vma_delete(const VkAllocationCallbacks* pAllocationCallbacks, T* ptr)
-{
-    ptr->~T();
-    VmaFree(pAllocationCallbacks, ptr);
-}
-
-template<typename T>
-static void vma_delete_array(const VkAllocationCallbacks* pAllocationCallbacks, T* ptr, size_t count)
-{
-    if (ptr != VMA_NULL)
-    {
-        for (size_t i = count; i--; )
-        {
-            ptr[i].~T();
-        }
-        VmaFree(pAllocationCallbacks, ptr);
-    }
-}
-
-static char* VmaCreateStringCopy(const VkAllocationCallbacks* allocs, const char* srcStr)
-{
-    if (srcStr != VMA_NULL)
-    {
-        const size_t len = strlen(srcStr);
-        char* const result = vma_new_array(allocs, char, len + 1);
-        memcpy(result, srcStr, len + 1);
-        return result;
-    }
-    return VMA_NULL;
-}
-
-#if VMA_STATS_STRING_ENABLED
-static char* VmaCreateStringCopy(const VkAllocationCallbacks* allocs, const char* srcStr, size_t strLen)
-{
-    if (srcStr != VMA_NULL)
-    {
-        char* const result = vma_new_array(allocs, char, strLen + 1);
-        memcpy(result, srcStr, strLen);
-        result[strLen] = '\0';
-        return result;
-    }
-    return VMA_NULL;
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-static void VmaFreeString(const VkAllocationCallbacks* allocs, char* str)
-{
-    if (str != VMA_NULL)
-    {
-        const size_t len = strlen(str);
-        vma_delete_array(allocs, str, len + 1);
-    }
-}
-
-template<typename CmpLess, typename VectorT>
-size_t VmaVectorInsertSorted(VectorT& vector, const typename VectorT::value_type& value)
-{
-    const size_t indexToInsert = VmaBinaryFindFirstNotLess(
-        vector.data(),
-        vector.data() + vector.size(),
-        value,
-        CmpLess()) - vector.data();
-    VmaVectorInsert(vector, indexToInsert, value);
-    return indexToInsert;
-}
-
-template<typename CmpLess, typename VectorT>
-bool VmaVectorRemoveSorted(VectorT& vector, const typename VectorT::value_type& value)
-{
-    CmpLess comparator;
-    typename VectorT::iterator it = VmaBinaryFindFirstNotLess(
-        vector.begin(),
-        vector.end(),
-        value,
-        comparator);
-    if ((it != vector.end()) && !comparator(*it, value) && !comparator(value, *it))
-    {
-        size_t indexToRemove = it - vector.begin();
-        VmaVectorRemove(vector, indexToRemove);
-        return true;
-    }
-    return false;
-}
-#endif // _VMA_FUNCTIONS
-
-#ifndef _VMA_STATISTICS_FUNCTIONS
-
-static void VmaClearStatistics(VmaStatistics& outStats)
-{
-    outStats.blockCount = 0;
-    outStats.allocationCount = 0;
-    outStats.blockBytes = 0;
-    outStats.allocationBytes = 0;
-}
-
-static void VmaAddStatistics(VmaStatistics& inoutStats, const VmaStatistics& src)
-{
-    inoutStats.blockCount += src.blockCount;
-    inoutStats.allocationCount += src.allocationCount;
-    inoutStats.blockBytes += src.blockBytes;
-    inoutStats.allocationBytes += src.allocationBytes;
-}
-
-static void VmaClearDetailedStatistics(VmaDetailedStatistics& outStats)
-{
-    VmaClearStatistics(outStats.statistics);
-    outStats.unusedRangeCount = 0;
-    outStats.allocationSizeMin = VK_WHOLE_SIZE;
-    outStats.allocationSizeMax = 0;
-    outStats.unusedRangeSizeMin = VK_WHOLE_SIZE;
-    outStats.unusedRangeSizeMax = 0;
-}
-
-static void VmaAddDetailedStatisticsAllocation(VmaDetailedStatistics& inoutStats, VkDeviceSize size)
-{
-    inoutStats.statistics.allocationCount++;
-    inoutStats.statistics.allocationBytes += size;
-    inoutStats.allocationSizeMin = VMA_MIN(inoutStats.allocationSizeMin, size);
-    inoutStats.allocationSizeMax = VMA_MAX(inoutStats.allocationSizeMax, size);
-}
-
-static void VmaAddDetailedStatisticsUnusedRange(VmaDetailedStatistics& inoutStats, VkDeviceSize size)
-{
-    inoutStats.unusedRangeCount++;
-    inoutStats.unusedRangeSizeMin = VMA_MIN(inoutStats.unusedRangeSizeMin, size);
-    inoutStats.unusedRangeSizeMax = VMA_MAX(inoutStats.unusedRangeSizeMax, size);
-}
-
-static void VmaAddDetailedStatistics(VmaDetailedStatistics& inoutStats, const VmaDetailedStatistics& src)
-{
-    VmaAddStatistics(inoutStats.statistics, src.statistics);
-    inoutStats.unusedRangeCount += src.unusedRangeCount;
-    inoutStats.allocationSizeMin = VMA_MIN(inoutStats.allocationSizeMin, src.allocationSizeMin);
-    inoutStats.allocationSizeMax = VMA_MAX(inoutStats.allocationSizeMax, src.allocationSizeMax);
-    inoutStats.unusedRangeSizeMin = VMA_MIN(inoutStats.unusedRangeSizeMin, src.unusedRangeSizeMin);
-    inoutStats.unusedRangeSizeMax = VMA_MAX(inoutStats.unusedRangeSizeMax, src.unusedRangeSizeMax);
-}
-
-#endif // _VMA_STATISTICS_FUNCTIONS
-
-#ifndef _VMA_MUTEX_LOCK
-// Helper RAII class to lock a mutex in constructor and unlock it in destructor (at the end of scope).
-struct VmaMutexLock
-{
-    VMA_CLASS_NO_COPY(VmaMutexLock)
-public:
-    VmaMutexLock(VMA_MUTEX& mutex, bool useMutex = true) :
-        m_pMutex(useMutex ? &mutex : VMA_NULL)
-    {
-        if (m_pMutex) { m_pMutex->Lock(); }
-    }
-    ~VmaMutexLock() {  if (m_pMutex) { m_pMutex->Unlock(); } }
-
-private:
-    VMA_MUTEX* m_pMutex;
-};
-
-// Helper RAII class to lock a RW mutex in constructor and unlock it in destructor (at the end of scope), for reading.
-struct VmaMutexLockRead
-{
-    VMA_CLASS_NO_COPY(VmaMutexLockRead)
-public:
-    VmaMutexLockRead(VMA_RW_MUTEX& mutex, bool useMutex) :
-        m_pMutex(useMutex ? &mutex : VMA_NULL)
-    {
-        if (m_pMutex) { m_pMutex->LockRead(); }
-    }
-    ~VmaMutexLockRead() { if (m_pMutex) { m_pMutex->UnlockRead(); } }
-
-private:
-    VMA_RW_MUTEX* m_pMutex;
-};
-
-// Helper RAII class to lock a RW mutex in constructor and unlock it in destructor (at the end of scope), for writing.
-struct VmaMutexLockWrite
-{
-    VMA_CLASS_NO_COPY(VmaMutexLockWrite)
-public:
-    VmaMutexLockWrite(VMA_RW_MUTEX& mutex, bool useMutex)
-        : m_pMutex(useMutex ? &mutex : VMA_NULL)
-    {
-        if (m_pMutex) { m_pMutex->LockWrite(); }
-    }
-    ~VmaMutexLockWrite() { if (m_pMutex) { m_pMutex->UnlockWrite(); } }
-
-private:
-    VMA_RW_MUTEX* m_pMutex;
-};
-
-#if VMA_DEBUG_GLOBAL_MUTEX
-    static VMA_MUTEX gDebugGlobalMutex;
-    #define VMA_DEBUG_GLOBAL_MUTEX_LOCK VmaMutexLock debugGlobalMutexLock(gDebugGlobalMutex, true);
-#else
-    #define VMA_DEBUG_GLOBAL_MUTEX_LOCK
-#endif
-#endif // _VMA_MUTEX_LOCK
-
-#ifndef _VMA_ATOMIC_TRANSACTIONAL_INCREMENT
-// An object that increments given atomic but decrements it back in the destructor unless Commit() is called.
-template<typename T>
-struct AtomicTransactionalIncrement
-{
-public:
-    typedef std::atomic<T> AtomicT;
-
-    ~AtomicTransactionalIncrement()
-    {
-        if(m_Atomic)
-            --(*m_Atomic);
-    }
-
-    void Commit() { m_Atomic = nullptr; }
-    T Increment(AtomicT* atomic)
-    {
-        m_Atomic = atomic;
-        return m_Atomic->fetch_add(1);
-    }
-
-private:
-    AtomicT* m_Atomic = nullptr;
-};
-#endif // _VMA_ATOMIC_TRANSACTIONAL_INCREMENT
-
-#ifndef _VMA_STL_ALLOCATOR
-// STL-compatible allocator.
-template<typename T>
-struct VmaStlAllocator
-{
-    const VkAllocationCallbacks* const m_pCallbacks;
-    typedef T value_type;
-
-    VmaStlAllocator(const VkAllocationCallbacks* pCallbacks) : m_pCallbacks(pCallbacks) {}
-    template<typename U>
-    VmaStlAllocator(const VmaStlAllocator<U>& src) : m_pCallbacks(src.m_pCallbacks) {}
-    VmaStlAllocator(const VmaStlAllocator&) = default;
-    VmaStlAllocator& operator=(const VmaStlAllocator&) = delete;
-
-    T* allocate(size_t n) { return VmaAllocateArray<T>(m_pCallbacks, n); }
-    void deallocate(T* p, size_t n) { VmaFree(m_pCallbacks, p); }
-
-    template<typename U>
-    bool operator==(const VmaStlAllocator<U>& rhs) const
-    {
-        return m_pCallbacks == rhs.m_pCallbacks;
-    }
-    template<typename U>
-    bool operator!=(const VmaStlAllocator<U>& rhs) const
-    {
-        return m_pCallbacks != rhs.m_pCallbacks;
-    }
-};
-#endif // _VMA_STL_ALLOCATOR
-
-#ifndef _VMA_VECTOR
-/* Class with interface compatible with subset of std::vector.
-T must be POD because constructors and destructors are not called and memcpy is
-used for these objects. */
-template<typename T, typename AllocatorT>
-class VmaVector
-{
-public:
-    typedef T value_type;
-    typedef T* iterator;
-    typedef const T* const_iterator;
-
-    VmaVector(const AllocatorT& allocator);
-    VmaVector(size_t count, const AllocatorT& allocator);
-    // This version of the constructor is here for compatibility with pre-C++14 std::vector.
-    // value is unused.
-    VmaVector(size_t count, const T& value, const AllocatorT& allocator) : VmaVector(count, allocator) {}
-    VmaVector(const VmaVector<T, AllocatorT>& src);
-    VmaVector& operator=(const VmaVector& rhs);
-    ~VmaVector() { VmaFree(m_Allocator.m_pCallbacks, m_pArray); }
-
-    bool empty() const { return m_Count == 0; }
-    size_t size() const { return m_Count; }
-    T* data() { return m_pArray; }
-    T& front() { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[0]; }
-    T& back() { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[m_Count - 1]; }
-    const T* data() const { return m_pArray; }
-    const T& front() const { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[0]; }
-    const T& back() const { VMA_HEAVY_ASSERT(m_Count > 0); return m_pArray[m_Count - 1]; }
-
-    iterator begin() { return m_pArray; }
-    iterator end() { return m_pArray + m_Count; }
-    const_iterator cbegin() const { return m_pArray; }
-    const_iterator cend() const { return m_pArray + m_Count; }
-    const_iterator begin() const { return cbegin(); }
-    const_iterator end() const { return cend(); }
-
-    void pop_front() { VMA_HEAVY_ASSERT(m_Count > 0); remove(0); }
-    void pop_back() { VMA_HEAVY_ASSERT(m_Count > 0); resize(size() - 1); }
-    void push_front(const T& src) { insert(0, src); }
-
-    void push_back(const T& src);
-    void reserve(size_t newCapacity, bool freeMemory = false);
-    void resize(size_t newCount);
-    void clear() { resize(0); }
-    void shrink_to_fit();
-    void insert(size_t index, const T& src);
-    void remove(size_t index);
-
-    T& operator[](size_t index) { VMA_HEAVY_ASSERT(index < m_Count); return m_pArray[index]; }
-    const T& operator[](size_t index) const { VMA_HEAVY_ASSERT(index < m_Count); return m_pArray[index]; }
-
-private:
-    AllocatorT m_Allocator;
-    T* m_pArray;
-    size_t m_Count;
-    size_t m_Capacity;
-};
-
-#ifndef _VMA_VECTOR_FUNCTIONS
-template<typename T, typename AllocatorT>
-VmaVector<T, AllocatorT>::VmaVector(const AllocatorT& allocator)
-    : m_Allocator(allocator),
-    m_pArray(VMA_NULL),
-    m_Count(0),
-    m_Capacity(0) {}
-
-template<typename T, typename AllocatorT>
-VmaVector<T, AllocatorT>::VmaVector(size_t count, const AllocatorT& allocator)
-    : m_Allocator(allocator),
-    m_pArray(count ? (T*)VmaAllocateArray<T>(allocator.m_pCallbacks, count) : VMA_NULL),
-    m_Count(count),
-    m_Capacity(count) {}
-
-template<typename T, typename AllocatorT>
-VmaVector<T, AllocatorT>::VmaVector(const VmaVector& src)
-    : m_Allocator(src.m_Allocator),
-    m_pArray(src.m_Count ? (T*)VmaAllocateArray<T>(src.m_Allocator.m_pCallbacks, src.m_Count) : VMA_NULL),
-    m_Count(src.m_Count),
-    m_Capacity(src.m_Count)
-{
-    if (m_Count != 0)
-    {
-        memcpy(m_pArray, src.m_pArray, m_Count * sizeof(T));
-    }
-}
-
-template<typename T, typename AllocatorT>
-VmaVector<T, AllocatorT>& VmaVector<T, AllocatorT>::operator=(const VmaVector& rhs)
-{
-    if (&rhs != this)
-    {
-        resize(rhs.m_Count);
-        if (m_Count != 0)
-        {
-            memcpy(m_pArray, rhs.m_pArray, m_Count * sizeof(T));
-        }
-    }
-    return *this;
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::push_back(const T& src)
-{
-    const size_t newIndex = size();
-    resize(newIndex + 1);
-    m_pArray[newIndex] = src;
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::reserve(size_t newCapacity, bool freeMemory)
-{
-    newCapacity = VMA_MAX(newCapacity, m_Count);
-
-    if ((newCapacity < m_Capacity) && !freeMemory)
-    {
-        newCapacity = m_Capacity;
-    }
-
-    if (newCapacity != m_Capacity)
-    {
-        T* const newArray = newCapacity ? VmaAllocateArray<T>(m_Allocator, newCapacity) : VMA_NULL;
-        if (m_Count != 0)
-        {
-            memcpy(newArray, m_pArray, m_Count * sizeof(T));
-        }
-        VmaFree(m_Allocator.m_pCallbacks, m_pArray);
-        m_Capacity = newCapacity;
-        m_pArray = newArray;
-    }
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::resize(size_t newCount)
-{
-    size_t newCapacity = m_Capacity;
-    if (newCount > m_Capacity)
-    {
-        newCapacity = VMA_MAX(newCount, VMA_MAX(m_Capacity * 3 / 2, (size_t)8));
-    }
-
-    if (newCapacity != m_Capacity)
-    {
-        T* const newArray = newCapacity ? VmaAllocateArray<T>(m_Allocator.m_pCallbacks, newCapacity) : VMA_NULL;
-        const size_t elementsToCopy = VMA_MIN(m_Count, newCount);
-        if (elementsToCopy != 0)
-        {
-            memcpy(newArray, m_pArray, elementsToCopy * sizeof(T));
-        }
-        VmaFree(m_Allocator.m_pCallbacks, m_pArray);
-        m_Capacity = newCapacity;
-        m_pArray = newArray;
-    }
-
-    m_Count = newCount;
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::shrink_to_fit()
-{
-    if (m_Capacity > m_Count)
-    {
-        T* newArray = VMA_NULL;
-        if (m_Count > 0)
-        {
-            newArray = VmaAllocateArray<T>(m_Allocator.m_pCallbacks, m_Count);
-            memcpy(newArray, m_pArray, m_Count * sizeof(T));
-        }
-        VmaFree(m_Allocator.m_pCallbacks, m_pArray);
-        m_Capacity = m_Count;
-        m_pArray = newArray;
-    }
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::insert(size_t index, const T& src)
-{
-    VMA_HEAVY_ASSERT(index <= m_Count);
-    const size_t oldCount = size();
-    resize(oldCount + 1);
-    if (index < oldCount)
-    {
-        memmove(m_pArray + (index + 1), m_pArray + index, (oldCount - index) * sizeof(T));
-    }
-    m_pArray[index] = src;
-}
-
-template<typename T, typename AllocatorT>
-void VmaVector<T, AllocatorT>::remove(size_t index)
-{
-    VMA_HEAVY_ASSERT(index < m_Count);
-    const size_t oldCount = size();
-    if (index < oldCount - 1)
-    {
-        memmove(m_pArray + index, m_pArray + (index + 1), (oldCount - index - 1) * sizeof(T));
-    }
-    resize(oldCount - 1);
-}
-#endif // _VMA_VECTOR_FUNCTIONS
-
-template<typename T, typename allocatorT>
-static void VmaVectorInsert(VmaVector<T, allocatorT>& vec, size_t index, const T& item)
-{
-    vec.insert(index, item);
-}
-
-template<typename T, typename allocatorT>
-static void VmaVectorRemove(VmaVector<T, allocatorT>& vec, size_t index)
-{
-    vec.remove(index);
-}
-#endif // _VMA_VECTOR
-
-#ifndef _VMA_SMALL_VECTOR
-/*
-This is a vector (a variable-sized array), optimized for the case when the array is small.
-
-It contains some number of elements in-place, which allows it to avoid heap allocation
-when the actual number of elements is below that threshold. This allows normal "small"
-cases to be fast without losing generality for large inputs.
-*/
-template<typename T, typename AllocatorT, size_t N>
-class VmaSmallVector
-{
-public:
-    typedef T value_type;
-    typedef T* iterator;
-
-    VmaSmallVector(const AllocatorT& allocator);
-    VmaSmallVector(size_t count, const AllocatorT& allocator);
-    template<typename SrcT, typename SrcAllocatorT, size_t SrcN>
-    VmaSmallVector(const VmaSmallVector<SrcT, SrcAllocatorT, SrcN>&) = delete;
-    template<typename SrcT, typename SrcAllocatorT, size_t SrcN>
-    VmaSmallVector<T, AllocatorT, N>& operator=(const VmaSmallVector<SrcT, SrcAllocatorT, SrcN>&) = delete;
-    ~VmaSmallVector() = default;
-
-    bool empty() const { return m_Count == 0; }
-    size_t size() const { return m_Count; }
-    T* data() { return m_Count > N ? m_DynamicArray.data() : m_StaticArray; }
-    T& front() { VMA_HEAVY_ASSERT(m_Count > 0); return data()[0]; }
-    T& back() { VMA_HEAVY_ASSERT(m_Count > 0); return data()[m_Count - 1]; }
-    const T* data() const { return m_Count > N ? m_DynamicArray.data() : m_StaticArray; }
-    const T& front() const { VMA_HEAVY_ASSERT(m_Count > 0); return data()[0]; }
-    const T& back() const { VMA_HEAVY_ASSERT(m_Count > 0); return data()[m_Count - 1]; }
-
-    iterator begin() { return data(); }
-    iterator end() { return data() + m_Count; }
-
-    void pop_front() { VMA_HEAVY_ASSERT(m_Count > 0); remove(0); }
-    void pop_back() { VMA_HEAVY_ASSERT(m_Count > 0); resize(size() - 1); }
-    void push_front(const T& src) { insert(0, src); }
-
-    void push_back(const T& src);
-    void resize(size_t newCount, bool freeMemory = false);
-    void clear(bool freeMemory = false);
-    void insert(size_t index, const T& src);
-    void remove(size_t index);
-
-    T& operator[](size_t index) { VMA_HEAVY_ASSERT(index < m_Count); return data()[index]; }
-    const T& operator[](size_t index) const { VMA_HEAVY_ASSERT(index < m_Count); return data()[index]; }
-
-private:
-    size_t m_Count;
-    T m_StaticArray[N]; // Used when m_Size <= N
-    VmaVector<T, AllocatorT> m_DynamicArray; // Used when m_Size > N
-};
-
-#ifndef _VMA_SMALL_VECTOR_FUNCTIONS
-template<typename T, typename AllocatorT, size_t N>
-VmaSmallVector<T, AllocatorT, N>::VmaSmallVector(const AllocatorT& allocator)
-    : m_Count(0),
-    m_DynamicArray(allocator) {}
-
-template<typename T, typename AllocatorT, size_t N>
-VmaSmallVector<T, AllocatorT, N>::VmaSmallVector(size_t count, const AllocatorT& allocator)
-    : m_Count(count),
-    m_DynamicArray(count > N ? count : 0, allocator) {}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::push_back(const T& src)
-{
-    const size_t newIndex = size();
-    resize(newIndex + 1);
-    data()[newIndex] = src;
-}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::resize(size_t newCount, bool freeMemory)
-{
-    if (newCount > N && m_Count > N)
-    {
-        // Any direction, staying in m_DynamicArray
-        m_DynamicArray.resize(newCount);
-        if (freeMemory)
-        {
-            m_DynamicArray.shrink_to_fit();
-        }
-    }
-    else if (newCount > N && m_Count <= N)
-    {
-        // Growing, moving from m_StaticArray to m_DynamicArray
-        m_DynamicArray.resize(newCount);
-        if (m_Count > 0)
-        {
-            memcpy(m_DynamicArray.data(), m_StaticArray, m_Count * sizeof(T));
-        }
-    }
-    else if (newCount <= N && m_Count > N)
-    {
-        // Shrinking, moving from m_DynamicArray to m_StaticArray
-        if (newCount > 0)
-        {
-            memcpy(m_StaticArray, m_DynamicArray.data(), newCount * sizeof(T));
-        }
-        m_DynamicArray.resize(0);
-        if (freeMemory)
-        {
-            m_DynamicArray.shrink_to_fit();
-        }
-    }
-    else
-    {
-        // Any direction, staying in m_StaticArray - nothing to do here
-    }
-    m_Count = newCount;
-}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::clear(bool freeMemory)
-{
-    m_DynamicArray.clear();
-    if (freeMemory)
-    {
-        m_DynamicArray.shrink_to_fit();
-    }
-    m_Count = 0;
-}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::insert(size_t index, const T& src)
-{
-    VMA_HEAVY_ASSERT(index <= m_Count);
-    const size_t oldCount = size();
-    resize(oldCount + 1);
-    T* const dataPtr = data();
-    if (index < oldCount)
-    {
-        //  I know, this could be more optimal for case where memmove can be memcpy directly from m_StaticArray to m_DynamicArray.
-        memmove(dataPtr + (index + 1), dataPtr + index, (oldCount - index) * sizeof(T));
-    }
-    dataPtr[index] = src;
-}
-
-template<typename T, typename AllocatorT, size_t N>
-void VmaSmallVector<T, AllocatorT, N>::remove(size_t index)
-{
-    VMA_HEAVY_ASSERT(index < m_Count);
-    const size_t oldCount = size();
-    if (index < oldCount - 1)
-    {
-        //  I know, this could be more optimal for case where memmove can be memcpy directly from m_DynamicArray to m_StaticArray.
-        T* const dataPtr = data();
-        memmove(dataPtr + index, dataPtr + (index + 1), (oldCount - index - 1) * sizeof(T));
-    }
-    resize(oldCount - 1);
-}
-#endif // _VMA_SMALL_VECTOR_FUNCTIONS
-#endif // _VMA_SMALL_VECTOR
-
-#ifndef _VMA_POOL_ALLOCATOR
-/*
-Allocator for objects of type T using a list of arrays (pools) to speed up
-allocation. Number of elements that can be allocated is not bounded because
-allocator can create multiple blocks.
-*/
-template<typename T>
-class VmaPoolAllocator
-{
-    VMA_CLASS_NO_COPY(VmaPoolAllocator)
-public:
-    VmaPoolAllocator(const VkAllocationCallbacks* pAllocationCallbacks, uint32_t firstBlockCapacity);
-    ~VmaPoolAllocator();
-    template<typename... Types> T* Alloc(Types&&... args);
-    void Free(T* ptr);
-
-private:
-    union Item
-    {
-        uint32_t NextFreeIndex;
-        alignas(T) char Value[sizeof(T)];
-    };
-    struct ItemBlock
-    {
-        Item* pItems;
-        uint32_t Capacity;
-        uint32_t FirstFreeIndex;
-    };
-
-    const VkAllocationCallbacks* m_pAllocationCallbacks;
-    const uint32_t m_FirstBlockCapacity;
-    VmaVector<ItemBlock, VmaStlAllocator<ItemBlock>> m_ItemBlocks;
-
-    ItemBlock& CreateNewBlock();
-};
-
-#ifndef _VMA_POOL_ALLOCATOR_FUNCTIONS
-template<typename T>
-VmaPoolAllocator<T>::VmaPoolAllocator(const VkAllocationCallbacks* pAllocationCallbacks, uint32_t firstBlockCapacity)
-    : m_pAllocationCallbacks(pAllocationCallbacks),
-    m_FirstBlockCapacity(firstBlockCapacity),
-    m_ItemBlocks(VmaStlAllocator<ItemBlock>(pAllocationCallbacks))
-{
-    VMA_ASSERT(m_FirstBlockCapacity > 1);
-}
-
-template<typename T>
-VmaPoolAllocator<T>::~VmaPoolAllocator()
-{
-    for (size_t i = m_ItemBlocks.size(); i--;)
-        vma_delete_array(m_pAllocationCallbacks, m_ItemBlocks[i].pItems, m_ItemBlocks[i].Capacity);
-    m_ItemBlocks.clear();
-}
-
-template<typename T>
-template<typename... Types> T* VmaPoolAllocator<T>::Alloc(Types&&... args)
-{
-    for (size_t i = m_ItemBlocks.size(); i--; )
-    {
-        ItemBlock& block = m_ItemBlocks[i];
-        // This block has some free items: Use first one.
-        if (block.FirstFreeIndex != UINT32_MAX)
-        {
-            Item* const pItem = &block.pItems[block.FirstFreeIndex];
-            block.FirstFreeIndex = pItem->NextFreeIndex;
-            T* result = (T*)&pItem->Value;
-            new(result)T(std::forward<Types>(args)...); // Explicit constructor call.
-            return result;
-        }
-    }
-
-    // No block has free item: Create new one and use it.
-    ItemBlock& newBlock = CreateNewBlock();
-    Item* const pItem = &newBlock.pItems[0];
-    newBlock.FirstFreeIndex = pItem->NextFreeIndex;
-    T* result = (T*)&pItem->Value;
-    new(result) T(std::forward<Types>(args)...); // Explicit constructor call.
-    return result;
-}
-
-template<typename T>
-void VmaPoolAllocator<T>::Free(T* ptr)
-{
-    // Search all memory blocks to find ptr.
-    for (size_t i = m_ItemBlocks.size(); i--; )
-    {
-        ItemBlock& block = m_ItemBlocks[i];
-
-        // Casting to union.
-        Item* pItemPtr;
-        memcpy(&pItemPtr, &ptr, sizeof(pItemPtr));
-
-        // Check if pItemPtr is in address range of this block.
-        if ((pItemPtr >= block.pItems) && (pItemPtr < block.pItems + block.Capacity))
-        {
-            ptr->~T(); // Explicit destructor call.
-            const uint32_t index = static_cast<uint32_t>(pItemPtr - block.pItems);
-            pItemPtr->NextFreeIndex = block.FirstFreeIndex;
-            block.FirstFreeIndex = index;
-            return;
-        }
-    }
-    VMA_ASSERT(0 && "Pointer doesn't belong to this memory pool.");
-}
-
-template<typename T>
-typename VmaPoolAllocator<T>::ItemBlock& VmaPoolAllocator<T>::CreateNewBlock()
-{
-    const uint32_t newBlockCapacity = m_ItemBlocks.empty() ?
-        m_FirstBlockCapacity : m_ItemBlocks.back().Capacity * 3 / 2;
-
-    const ItemBlock newBlock =
-    {
-        vma_new_array(m_pAllocationCallbacks, Item, newBlockCapacity),
-        newBlockCapacity,
-        0
-    };
-
-    m_ItemBlocks.push_back(newBlock);
-
-    // Setup singly-linked list of all free items in this block.
-    for (uint32_t i = 0; i < newBlockCapacity - 1; ++i)
-        newBlock.pItems[i].NextFreeIndex = i + 1;
-    newBlock.pItems[newBlockCapacity - 1].NextFreeIndex = UINT32_MAX;
-    return m_ItemBlocks.back();
-}
-#endif // _VMA_POOL_ALLOCATOR_FUNCTIONS
-#endif // _VMA_POOL_ALLOCATOR
-
-#ifndef _VMA_RAW_LIST
-template<typename T>
-struct VmaListItem
-{
-    VmaListItem* pPrev;
-    VmaListItem* pNext;
-    T Value;
-};
-
-// Doubly linked list.
-template<typename T>
-class VmaRawList
-{
-    VMA_CLASS_NO_COPY(VmaRawList)
-public:
-    typedef VmaListItem<T> ItemType;
-
-    VmaRawList(const VkAllocationCallbacks* pAllocationCallbacks);
-    // Intentionally not calling Clear, because that would be unnecessary
-    // computations to return all items to m_ItemAllocator as free.
-    ~VmaRawList() = default;
-
-    size_t GetCount() const { return m_Count; }
-    bool IsEmpty() const { return m_Count == 0; }
-
-    ItemType* Front() { return m_pFront; }
-    ItemType* Back() { return m_pBack; }
-    const ItemType* Front() const { return m_pFront; }
-    const ItemType* Back() const { return m_pBack; }
-
-    ItemType* PushFront();
-    ItemType* PushBack();
-    ItemType* PushFront(const T& value);
-    ItemType* PushBack(const T& value);
-    void PopFront();
-    void PopBack();
-
-    // Item can be null - it means PushBack.
-    ItemType* InsertBefore(ItemType* pItem);
-    // Item can be null - it means PushFront.
-    ItemType* InsertAfter(ItemType* pItem);
-    ItemType* InsertBefore(ItemType* pItem, const T& value);
-    ItemType* InsertAfter(ItemType* pItem, const T& value);
-
-    void Clear();
-    void Remove(ItemType* pItem);
-
-private:
-    const VkAllocationCallbacks* const m_pAllocationCallbacks;
-    VmaPoolAllocator<ItemType> m_ItemAllocator;
-    ItemType* m_pFront;
-    ItemType* m_pBack;
-    size_t m_Count;
-};
-
-#ifndef _VMA_RAW_LIST_FUNCTIONS
-template<typename T>
-VmaRawList<T>::VmaRawList(const VkAllocationCallbacks* pAllocationCallbacks)
-    : m_pAllocationCallbacks(pAllocationCallbacks),
-    m_ItemAllocator(pAllocationCallbacks, 128),
-    m_pFront(VMA_NULL),
-    m_pBack(VMA_NULL),
-    m_Count(0) {}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::PushFront()
-{
-    ItemType* const pNewItem = m_ItemAllocator.Alloc();
-    pNewItem->pPrev = VMA_NULL;
-    if (IsEmpty())
-    {
-        pNewItem->pNext = VMA_NULL;
-        m_pFront = pNewItem;
-        m_pBack = pNewItem;
-        m_Count = 1;
-    }
-    else
-    {
-        pNewItem->pNext = m_pFront;
-        m_pFront->pPrev = pNewItem;
-        m_pFront = pNewItem;
-        ++m_Count;
-    }
-    return pNewItem;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::PushBack()
-{
-    ItemType* const pNewItem = m_ItemAllocator.Alloc();
-    pNewItem->pNext = VMA_NULL;
-    if(IsEmpty())
-    {
-        pNewItem->pPrev = VMA_NULL;
-        m_pFront = pNewItem;
-        m_pBack = pNewItem;
-        m_Count = 1;
-    }
-    else
-    {
-        pNewItem->pPrev = m_pBack;
-        m_pBack->pNext = pNewItem;
-        m_pBack = pNewItem;
-        ++m_Count;
-    }
-    return pNewItem;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::PushFront(const T& value)
-{
-    ItemType* const pNewItem = PushFront();
-    pNewItem->Value = value;
-    return pNewItem;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::PushBack(const T& value)
-{
-    ItemType* const pNewItem = PushBack();
-    pNewItem->Value = value;
-    return pNewItem;
-}
-
-template<typename T>
-void VmaRawList<T>::PopFront()
-{
-    VMA_HEAVY_ASSERT(m_Count > 0);
-    ItemType* const pFrontItem = m_pFront;
-    ItemType* const pNextItem = pFrontItem->pNext;
-    if (pNextItem != VMA_NULL)
-    {
-        pNextItem->pPrev = VMA_NULL;
-    }
-    m_pFront = pNextItem;
-    m_ItemAllocator.Free(pFrontItem);
-    --m_Count;
-}
-
-template<typename T>
-void VmaRawList<T>::PopBack()
-{
-    VMA_HEAVY_ASSERT(m_Count > 0);
-    ItemType* const pBackItem = m_pBack;
-    ItemType* const pPrevItem = pBackItem->pPrev;
-    if(pPrevItem != VMA_NULL)
-    {
-        pPrevItem->pNext = VMA_NULL;
-    }
-    m_pBack = pPrevItem;
-    m_ItemAllocator.Free(pBackItem);
-    --m_Count;
-}
-
-template<typename T>
-void VmaRawList<T>::Clear()
-{
-    if (IsEmpty() == false)
-    {
-        ItemType* pItem = m_pBack;
-        while (pItem != VMA_NULL)
-        {
-            ItemType* const pPrevItem = pItem->pPrev;
-            m_ItemAllocator.Free(pItem);
-            pItem = pPrevItem;
-        }
-        m_pFront = VMA_NULL;
-        m_pBack = VMA_NULL;
-        m_Count = 0;
-    }
-}
-
-template<typename T>
-void VmaRawList<T>::Remove(ItemType* pItem)
-{
-    VMA_HEAVY_ASSERT(pItem != VMA_NULL);
-    VMA_HEAVY_ASSERT(m_Count > 0);
-
-    if(pItem->pPrev != VMA_NULL)
-    {
-        pItem->pPrev->pNext = pItem->pNext;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(m_pFront == pItem);
-        m_pFront = pItem->pNext;
-    }
-
-    if(pItem->pNext != VMA_NULL)
-    {
-        pItem->pNext->pPrev = pItem->pPrev;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(m_pBack == pItem);
-        m_pBack = pItem->pPrev;
-    }
-
-    m_ItemAllocator.Free(pItem);
-    --m_Count;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::InsertBefore(ItemType* pItem)
-{
-    if(pItem != VMA_NULL)
-    {
-        ItemType* const prevItem = pItem->pPrev;
-        ItemType* const newItem = m_ItemAllocator.Alloc();
-        newItem->pPrev = prevItem;
-        newItem->pNext = pItem;
-        pItem->pPrev = newItem;
-        if(prevItem != VMA_NULL)
-        {
-            prevItem->pNext = newItem;
-        }
-        else
-        {
-            VMA_HEAVY_ASSERT(m_pFront == pItem);
-            m_pFront = newItem;
-        }
-        ++m_Count;
-        return newItem;
-    }
-    else
-        return PushBack();
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::InsertAfter(ItemType* pItem)
-{
-    if(pItem != VMA_NULL)
-    {
-        ItemType* const nextItem = pItem->pNext;
-        ItemType* const newItem = m_ItemAllocator.Alloc();
-        newItem->pNext = nextItem;
-        newItem->pPrev = pItem;
-        pItem->pNext = newItem;
-        if(nextItem != VMA_NULL)
-        {
-            nextItem->pPrev = newItem;
-        }
-        else
-        {
-            VMA_HEAVY_ASSERT(m_pBack == pItem);
-            m_pBack = newItem;
-        }
-        ++m_Count;
-        return newItem;
-    }
-    else
-        return PushFront();
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::InsertBefore(ItemType* pItem, const T& value)
-{
-    ItemType* const newItem = InsertBefore(pItem);
-    newItem->Value = value;
-    return newItem;
-}
-
-template<typename T>
-VmaListItem<T>* VmaRawList<T>::InsertAfter(ItemType* pItem, const T& value)
-{
-    ItemType* const newItem = InsertAfter(pItem);
-    newItem->Value = value;
-    return newItem;
-}
-#endif // _VMA_RAW_LIST_FUNCTIONS
-#endif // _VMA_RAW_LIST
-
-#ifndef _VMA_LIST
-template<typename T, typename AllocatorT>
-class VmaList
-{
-    VMA_CLASS_NO_COPY(VmaList)
-public:
-    class reverse_iterator;
-    class const_iterator;
-    class const_reverse_iterator;
-
-    class iterator
-    {
-        friend class const_iterator;
-        friend class VmaList<T, AllocatorT>;
-    public:
-        iterator() :  m_pList(VMA_NULL), m_pItem(VMA_NULL) {}
-        iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-
-        T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; }
-        T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; }
-
-        bool operator==(const iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; }
-        bool operator!=(const iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; }
-
-        iterator operator++(int) { iterator result = *this; ++*this; return result; }
-        iterator operator--(int) { iterator result = *this; --*this; return result; }
-
-        iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pNext; return *this; }
-        iterator& operator--();
-
-    private:
-        VmaRawList<T>* m_pList;
-        VmaListItem<T>* m_pItem;
-
-        iterator(VmaRawList<T>* pList, VmaListItem<T>* pItem) : m_pList(pList),  m_pItem(pItem) {}
-    };
-    class reverse_iterator
-    {
-        friend class const_reverse_iterator;
-        friend class VmaList<T, AllocatorT>;
-    public:
-        reverse_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {}
-        reverse_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-
-        T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; }
-        T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; }
-
-        bool operator==(const reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; }
-        bool operator!=(const reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; }
-
-        reverse_iterator operator++(int) { reverse_iterator result = *this; ++* this; return result; }
-        reverse_iterator operator--(int) { reverse_iterator result = *this; --* this; return result; }
-
-        reverse_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pPrev; return *this; }
-        reverse_iterator& operator--();
-
-    private:
-        VmaRawList<T>* m_pList;
-        VmaListItem<T>* m_pItem;
-
-        reverse_iterator(VmaRawList<T>* pList, VmaListItem<T>* pItem) : m_pList(pList),  m_pItem(pItem) {}
-    };
-    class const_iterator
-    {
-        friend class VmaList<T, AllocatorT>;
-    public:
-        const_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {}
-        const_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-        const_iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-
-        iterator drop_const() { return { const_cast<VmaRawList<T>*>(m_pList), const_cast<VmaListItem<T>*>(m_pItem) }; }
-
-        const T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; }
-        const T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; }
-
-        bool operator==(const const_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; }
-        bool operator!=(const const_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; }
-
-        const_iterator operator++(int) { const_iterator result = *this; ++* this; return result; }
-        const_iterator operator--(int) { const_iterator result = *this; --* this; return result; }
-
-        const_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pNext; return *this; }
-        const_iterator& operator--();
-
-    private:
-        const VmaRawList<T>* m_pList;
-        const VmaListItem<T>* m_pItem;
-
-        const_iterator(const VmaRawList<T>* pList, const VmaListItem<T>* pItem) : m_pList(pList), m_pItem(pItem) {}
-    };
-    class const_reverse_iterator
-    {
-        friend class VmaList<T, AllocatorT>;
-    public:
-        const_reverse_iterator() : m_pList(VMA_NULL), m_pItem(VMA_NULL) {}
-        const_reverse_iterator(const reverse_iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-        const_reverse_iterator(const iterator& src) : m_pList(src.m_pList), m_pItem(src.m_pItem) {}
-
-        reverse_iterator drop_const() { return { const_cast<VmaRawList<T>*>(m_pList), const_cast<VmaListItem<T>*>(m_pItem) }; }
-
-        const T& operator*() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return m_pItem->Value; }
-        const T* operator->() const { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); return &m_pItem->Value; }
-
-        bool operator==(const const_reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem == rhs.m_pItem; }
-        bool operator!=(const const_reverse_iterator& rhs) const { VMA_HEAVY_ASSERT(m_pList == rhs.m_pList); return m_pItem != rhs.m_pItem; }
-
-        const_reverse_iterator operator++(int) { const_reverse_iterator result = *this; ++* this; return result; }
-        const_reverse_iterator operator--(int) { const_reverse_iterator result = *this; --* this; return result; }
-
-        const_reverse_iterator& operator++() { VMA_HEAVY_ASSERT(m_pItem != VMA_NULL); m_pItem = m_pItem->pPrev; return *this; }
-        const_reverse_iterator& operator--();
-
-    private:
-        const VmaRawList<T>* m_pList;
-        const VmaListItem<T>* m_pItem;
-
-        const_reverse_iterator(const VmaRawList<T>* pList, const VmaListItem<T>* pItem) : m_pList(pList), m_pItem(pItem) {}
-    };
-
-    VmaList(const AllocatorT& allocator) : m_RawList(allocator.m_pCallbacks) {}
-
-    bool empty() const { return m_RawList.IsEmpty(); }
-    size_t size() const { return m_RawList.GetCount(); }
-
-    iterator begin() { return iterator(&m_RawList, m_RawList.Front()); }
-    iterator end() { return iterator(&m_RawList, VMA_NULL); }
-
-    const_iterator cbegin() const { return const_iterator(&m_RawList, m_RawList.Front()); }
-    const_iterator cend() const { return const_iterator(&m_RawList, VMA_NULL); }
-
-    const_iterator begin() const { return cbegin(); }
-    const_iterator end() const { return cend(); }
-
-    reverse_iterator rbegin() { return reverse_iterator(&m_RawList, m_RawList.Back()); }
-    reverse_iterator rend() { return reverse_iterator(&m_RawList, VMA_NULL); }
-
-    const_reverse_iterator crbegin() const { return const_reverse_iterator(&m_RawList, m_RawList.Back()); }
-    const_reverse_iterator crend() const { return const_reverse_iterator(&m_RawList, VMA_NULL); }
-
-    const_reverse_iterator rbegin() const { return crbegin(); }
-    const_reverse_iterator rend() const { return crend(); }
-
-    void push_back(const T& value) { m_RawList.PushBack(value); }
-    iterator insert(iterator it, const T& value) { return iterator(&m_RawList, m_RawList.InsertBefore(it.m_pItem, value)); }
-
-    void clear() { m_RawList.Clear(); }
-    void erase(iterator it) { m_RawList.Remove(it.m_pItem); }
-
-private:
-    VmaRawList<T> m_RawList;
-};
-
-#ifndef _VMA_LIST_FUNCTIONS
-template<typename T, typename AllocatorT>
-typename VmaList<T, AllocatorT>::iterator& VmaList<T, AllocatorT>::iterator::operator--()
-{
-    if (m_pItem != VMA_NULL)
-    {
-        m_pItem = m_pItem->pPrev;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(!m_pList->IsEmpty());
-        m_pItem = m_pList->Back();
-    }
-    return *this;
-}
-
-template<typename T, typename AllocatorT>
-typename VmaList<T, AllocatorT>::reverse_iterator& VmaList<T, AllocatorT>::reverse_iterator::operator--()
-{
-    if (m_pItem != VMA_NULL)
-    {
-        m_pItem = m_pItem->pNext;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(!m_pList->IsEmpty());
-        m_pItem = m_pList->Front();
-    }
-    return *this;
-}
-
-template<typename T, typename AllocatorT>
-typename VmaList<T, AllocatorT>::const_iterator& VmaList<T, AllocatorT>::const_iterator::operator--()
-{
-    if (m_pItem != VMA_NULL)
-    {
-        m_pItem = m_pItem->pPrev;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(!m_pList->IsEmpty());
-        m_pItem = m_pList->Back();
-    }
-    return *this;
-}
-
-template<typename T, typename AllocatorT>
-typename VmaList<T, AllocatorT>::const_reverse_iterator& VmaList<T, AllocatorT>::const_reverse_iterator::operator--()
-{
-    if (m_pItem != VMA_NULL)
-    {
-        m_pItem = m_pItem->pNext;
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(!m_pList->IsEmpty());
-        m_pItem = m_pList->Back();
-    }
-    return *this;
-}
-#endif // _VMA_LIST_FUNCTIONS
-#endif // _VMA_LIST
-
-#ifndef _VMA_INTRUSIVE_LINKED_LIST
-/*
-Expected interface of ItemTypeTraits:
-struct MyItemTypeTraits
-{
-    typedef MyItem ItemType;
-    static ItemType* GetPrev(const ItemType* item) { return item->myPrevPtr; }
-    static ItemType* GetNext(const ItemType* item) { return item->myNextPtr; }
-    static ItemType*& AccessPrev(ItemType* item) { return item->myPrevPtr; }
-    static ItemType*& AccessNext(ItemType* item) { return item->myNextPtr; }
-};
-*/
-template<typename ItemTypeTraits>
-class VmaIntrusiveLinkedList
-{
-public:
-    typedef typename ItemTypeTraits::ItemType ItemType;
-    static ItemType* GetPrev(const ItemType* item) { return ItemTypeTraits::GetPrev(item); }
-    static ItemType* GetNext(const ItemType* item) { return ItemTypeTraits::GetNext(item); }
-
-    // Movable, not copyable.
-    VmaIntrusiveLinkedList() = default;
-    VmaIntrusiveLinkedList(VmaIntrusiveLinkedList && src);
-    VmaIntrusiveLinkedList(const VmaIntrusiveLinkedList&) = delete;
-    VmaIntrusiveLinkedList& operator=(VmaIntrusiveLinkedList&& src);
-    VmaIntrusiveLinkedList& operator=(const VmaIntrusiveLinkedList&) = delete;
-    ~VmaIntrusiveLinkedList() { VMA_HEAVY_ASSERT(IsEmpty()); }
-
-    size_t GetCount() const { return m_Count; }
-    bool IsEmpty() const { return m_Count == 0; }
-    ItemType* Front() { return m_Front; }
-    ItemType* Back() { return m_Back; }
-    const ItemType* Front() const { return m_Front; }
-    const ItemType* Back() const { return m_Back; }
-
-    void PushBack(ItemType* item);
-    void PushFront(ItemType* item);
-    ItemType* PopBack();
-    ItemType* PopFront();
-
-    // MyItem can be null - it means PushBack.
-    void InsertBefore(ItemType* existingItem, ItemType* newItem);
-    // MyItem can be null - it means PushFront.
-    void InsertAfter(ItemType* existingItem, ItemType* newItem);
-    void Remove(ItemType* item);
-    void RemoveAll();
-
-private:
-    ItemType* m_Front = VMA_NULL;
-    ItemType* m_Back = VMA_NULL;
-    size_t m_Count = 0;
-};
-
-#ifndef _VMA_INTRUSIVE_LINKED_LIST_FUNCTIONS
-template<typename ItemTypeTraits>
-VmaIntrusiveLinkedList<ItemTypeTraits>::VmaIntrusiveLinkedList(VmaIntrusiveLinkedList&& src)
-    : m_Front(src.m_Front), m_Back(src.m_Back), m_Count(src.m_Count)
-{
-    src.m_Front = src.m_Back = VMA_NULL;
-    src.m_Count = 0;
-}
-
-template<typename ItemTypeTraits>
-VmaIntrusiveLinkedList<ItemTypeTraits>& VmaIntrusiveLinkedList<ItemTypeTraits>::operator=(VmaIntrusiveLinkedList&& src)
-{
-    if (&src != this)
-    {
-        VMA_HEAVY_ASSERT(IsEmpty());
-        m_Front = src.m_Front;
-        m_Back = src.m_Back;
-        m_Count = src.m_Count;
-        src.m_Front = src.m_Back = VMA_NULL;
-        src.m_Count = 0;
-    }
-    return *this;
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::PushBack(ItemType* item)
-{
-    VMA_HEAVY_ASSERT(ItemTypeTraits::GetPrev(item) == VMA_NULL && ItemTypeTraits::GetNext(item) == VMA_NULL);
-    if (IsEmpty())
-    {
-        m_Front = item;
-        m_Back = item;
-        m_Count = 1;
-    }
-    else
-    {
-        ItemTypeTraits::AccessPrev(item) = m_Back;
-        ItemTypeTraits::AccessNext(m_Back) = item;
-        m_Back = item;
-        ++m_Count;
-    }
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::PushFront(ItemType* item)
-{
-    VMA_HEAVY_ASSERT(ItemTypeTraits::GetPrev(item) == VMA_NULL && ItemTypeTraits::GetNext(item) == VMA_NULL);
-    if (IsEmpty())
-    {
-        m_Front = item;
-        m_Back = item;
-        m_Count = 1;
-    }
-    else
-    {
-        ItemTypeTraits::AccessNext(item) = m_Front;
-        ItemTypeTraits::AccessPrev(m_Front) = item;
-        m_Front = item;
-        ++m_Count;
-    }
-}
-
-template<typename ItemTypeTraits>
-typename VmaIntrusiveLinkedList<ItemTypeTraits>::ItemType* VmaIntrusiveLinkedList<ItemTypeTraits>::PopBack()
-{
-    VMA_HEAVY_ASSERT(m_Count > 0);
-    ItemType* const backItem = m_Back;
-    ItemType* const prevItem = ItemTypeTraits::GetPrev(backItem);
-    if (prevItem != VMA_NULL)
-    {
-        ItemTypeTraits::AccessNext(prevItem) = VMA_NULL;
-    }
-    m_Back = prevItem;
-    --m_Count;
-    ItemTypeTraits::AccessPrev(backItem) = VMA_NULL;
-    ItemTypeTraits::AccessNext(backItem) = VMA_NULL;
-    return backItem;
-}
-
-template<typename ItemTypeTraits>
-typename VmaIntrusiveLinkedList<ItemTypeTraits>::ItemType* VmaIntrusiveLinkedList<ItemTypeTraits>::PopFront()
-{
-    VMA_HEAVY_ASSERT(m_Count > 0);
-    ItemType* const frontItem = m_Front;
-    ItemType* const nextItem = ItemTypeTraits::GetNext(frontItem);
-    if (nextItem != VMA_NULL)
-    {
-        ItemTypeTraits::AccessPrev(nextItem) = VMA_NULL;
-    }
-    m_Front = nextItem;
-    --m_Count;
-    ItemTypeTraits::AccessPrev(frontItem) = VMA_NULL;
-    ItemTypeTraits::AccessNext(frontItem) = VMA_NULL;
-    return frontItem;
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::InsertBefore(ItemType* existingItem, ItemType* newItem)
-{
-    VMA_HEAVY_ASSERT(newItem != VMA_NULL && ItemTypeTraits::GetPrev(newItem) == VMA_NULL && ItemTypeTraits::GetNext(newItem) == VMA_NULL);
-    if (existingItem != VMA_NULL)
-    {
-        ItemType* const prevItem = ItemTypeTraits::GetPrev(existingItem);
-        ItemTypeTraits::AccessPrev(newItem) = prevItem;
-        ItemTypeTraits::AccessNext(newItem) = existingItem;
-        ItemTypeTraits::AccessPrev(existingItem) = newItem;
-        if (prevItem != VMA_NULL)
-        {
-            ItemTypeTraits::AccessNext(prevItem) = newItem;
-        }
-        else
-        {
-            VMA_HEAVY_ASSERT(m_Front == existingItem);
-            m_Front = newItem;
-        }
-        ++m_Count;
-    }
-    else
-        PushBack(newItem);
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::InsertAfter(ItemType* existingItem, ItemType* newItem)
-{
-    VMA_HEAVY_ASSERT(newItem != VMA_NULL && ItemTypeTraits::GetPrev(newItem) == VMA_NULL && ItemTypeTraits::GetNext(newItem) == VMA_NULL);
-    if (existingItem != VMA_NULL)
-    {
-        ItemType* const nextItem = ItemTypeTraits::GetNext(existingItem);
-        ItemTypeTraits::AccessNext(newItem) = nextItem;
-        ItemTypeTraits::AccessPrev(newItem) = existingItem;
-        ItemTypeTraits::AccessNext(existingItem) = newItem;
-        if (nextItem != VMA_NULL)
-        {
-            ItemTypeTraits::AccessPrev(nextItem) = newItem;
-        }
-        else
-        {
-            VMA_HEAVY_ASSERT(m_Back == existingItem);
-            m_Back = newItem;
-        }
-        ++m_Count;
-    }
-    else
-        return PushFront(newItem);
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::Remove(ItemType* item)
-{
-    VMA_HEAVY_ASSERT(item != VMA_NULL && m_Count > 0);
-    if (ItemTypeTraits::GetPrev(item) != VMA_NULL)
-    {
-        ItemTypeTraits::AccessNext(ItemTypeTraits::AccessPrev(item)) = ItemTypeTraits::GetNext(item);
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(m_Front == item);
-        m_Front = ItemTypeTraits::GetNext(item);
-    }
-
-    if (ItemTypeTraits::GetNext(item) != VMA_NULL)
-    {
-        ItemTypeTraits::AccessPrev(ItemTypeTraits::AccessNext(item)) = ItemTypeTraits::GetPrev(item);
-    }
-    else
-    {
-        VMA_HEAVY_ASSERT(m_Back == item);
-        m_Back = ItemTypeTraits::GetPrev(item);
-    }
-    ItemTypeTraits::AccessPrev(item) = VMA_NULL;
-    ItemTypeTraits::AccessNext(item) = VMA_NULL;
-    --m_Count;
-}
-
-template<typename ItemTypeTraits>
-void VmaIntrusiveLinkedList<ItemTypeTraits>::RemoveAll()
-{
-    if (!IsEmpty())
-    {
-        ItemType* item = m_Back;
-        while (item != VMA_NULL)
-        {
-            ItemType* const prevItem = ItemTypeTraits::AccessPrev(item);
-            ItemTypeTraits::AccessPrev(item) = VMA_NULL;
-            ItemTypeTraits::AccessNext(item) = VMA_NULL;
-            item = prevItem;
-        }
-        m_Front = VMA_NULL;
-        m_Back = VMA_NULL;
-        m_Count = 0;
-    }
-}
-#endif // _VMA_INTRUSIVE_LINKED_LIST_FUNCTIONS
-#endif // _VMA_INTRUSIVE_LINKED_LIST
-
-// Unused in this version.
-#if 0
-
-#ifndef _VMA_PAIR
-template<typename T1, typename T2>
-struct VmaPair
-{
-    T1 first;
-    T2 second;
-
-    VmaPair() : first(), second() {}
-    VmaPair(const T1& firstSrc, const T2& secondSrc) : first(firstSrc), second(secondSrc) {}
-};
-
-template<typename FirstT, typename SecondT>
-struct VmaPairFirstLess
-{
-    bool operator()(const VmaPair<FirstT, SecondT>& lhs, const VmaPair<FirstT, SecondT>& rhs) const
-    {
-        return lhs.first < rhs.first;
-    }
-    bool operator()(const VmaPair<FirstT, SecondT>& lhs, const FirstT& rhsFirst) const
-    {
-        return lhs.first < rhsFirst;
-    }
-};
-#endif // _VMA_PAIR
-
-#ifndef _VMA_MAP
-/* Class compatible with subset of interface of std::unordered_map.
-KeyT, ValueT must be POD because they will be stored in VmaVector.
-*/
-template<typename KeyT, typename ValueT>
-class VmaMap
-{
-public:
-    typedef VmaPair<KeyT, ValueT> PairType;
-    typedef PairType* iterator;
-
-    VmaMap(const VmaStlAllocator<PairType>& allocator) : m_Vector(allocator) {}
-
-    iterator begin() { return m_Vector.begin(); }
-    iterator end() { return m_Vector.end(); }
-    size_t size() { return m_Vector.size(); }
-
-    void insert(const PairType& pair);
-    iterator find(const KeyT& key);
-    void erase(iterator it);
-
-private:
-    VmaVector< PairType, VmaStlAllocator<PairType>> m_Vector;
-};
-
-#ifndef _VMA_MAP_FUNCTIONS
-template<typename KeyT, typename ValueT>
-void VmaMap<KeyT, ValueT>::insert(const PairType& pair)
-{
-    const size_t indexToInsert = VmaBinaryFindFirstNotLess(
-        m_Vector.data(),
-        m_Vector.data() + m_Vector.size(),
-        pair,
-        VmaPairFirstLess<KeyT, ValueT>()) - m_Vector.data();
-    VmaVectorInsert(m_Vector, indexToInsert, pair);
-}
-
-template<typename KeyT, typename ValueT>
-VmaPair<KeyT, ValueT>* VmaMap<KeyT, ValueT>::find(const KeyT& key)
-{
-    PairType* it = VmaBinaryFindFirstNotLess(
-        m_Vector.data(),
-        m_Vector.data() + m_Vector.size(),
-        key,
-        VmaPairFirstLess<KeyT, ValueT>());
-    if ((it != m_Vector.end()) && (it->first == key))
-    {
-        return it;
-    }
-    else
-    {
-        return m_Vector.end();
-    }
-}
-
-template<typename KeyT, typename ValueT>
-void VmaMap<KeyT, ValueT>::erase(iterator it)
-{
-    VmaVectorRemove(m_Vector, it - m_Vector.begin());
-}
-#endif // _VMA_MAP_FUNCTIONS
-#endif // _VMA_MAP
-
-#endif // #if 0
-
-#if !defined(_VMA_STRING_BUILDER) && VMA_STATS_STRING_ENABLED
-class VmaStringBuilder
-{
-public:
-    VmaStringBuilder(const VkAllocationCallbacks* allocationCallbacks) : m_Data(VmaStlAllocator<char>(allocationCallbacks)) {}
-    ~VmaStringBuilder() = default;
-
-    size_t GetLength() const { return m_Data.size(); }
-    const char* GetData() const { return m_Data.data(); }
-    void AddNewLine() { Add('\n'); }
-    void Add(char ch) { m_Data.push_back(ch); }
-
-    void Add(const char* pStr);
-    void AddNumber(uint32_t num);
-    void AddNumber(uint64_t num);
-    void AddPointer(const void* ptr);
-
-private:
-    VmaVector<char, VmaStlAllocator<char>> m_Data;
-};
-
-#ifndef _VMA_STRING_BUILDER_FUNCTIONS
-void VmaStringBuilder::Add(const char* pStr)
-{
-    const size_t strLen = strlen(pStr);
-    if (strLen > 0)
-    {
-        const size_t oldCount = m_Data.size();
-        m_Data.resize(oldCount + strLen);
-        memcpy(m_Data.data() + oldCount, pStr, strLen);
-    }
-}
-
-void VmaStringBuilder::AddNumber(uint32_t num)
-{
-    char buf[11];
-    buf[10] = '\0';
-    char* p = &buf[10];
-    do
-    {
-        *--p = '0' + (num % 10);
-        num /= 10;
-    } while (num);
-    Add(p);
-}
-
-void VmaStringBuilder::AddNumber(uint64_t num)
-{
-    char buf[21];
-    buf[20] = '\0';
-    char* p = &buf[20];
-    do
-    {
-        *--p = '0' + (num % 10);
-        num /= 10;
-    } while (num);
-    Add(p);
-}
-
-void VmaStringBuilder::AddPointer(const void* ptr)
-{
-    char buf[21];
-    VmaPtrToStr(buf, sizeof(buf), ptr);
-    Add(buf);
-}
-#endif //_VMA_STRING_BUILDER_FUNCTIONS
-#endif // _VMA_STRING_BUILDER
-
-#if !defined(_VMA_JSON_WRITER) && VMA_STATS_STRING_ENABLED
-/*
-Allows to conveniently build a correct JSON document to be written to the
-VmaStringBuilder passed to the constructor.
-*/
-class VmaJsonWriter
-{
-    VMA_CLASS_NO_COPY(VmaJsonWriter)
-public:
-    // sb - string builder to write the document to. Must remain alive for the whole lifetime of this object.
-    VmaJsonWriter(const VkAllocationCallbacks* pAllocationCallbacks, VmaStringBuilder& sb);
-    ~VmaJsonWriter();
-
-    // Begins object by writing "{".
-    // Inside an object, you must call pairs of WriteString and a value, e.g.:
-    // j.BeginObject(true); j.WriteString("A"); j.WriteNumber(1); j.WriteString("B"); j.WriteNumber(2); j.EndObject();
-    // Will write: { "A": 1, "B": 2 }
-    void BeginObject(bool singleLine = false);
-    // Ends object by writing "}".
-    void EndObject();
-
-    // Begins array by writing "[".
-    // Inside an array, you can write a sequence of any values.
-    void BeginArray(bool singleLine = false);
-    // Ends array by writing "[".
-    void EndArray();
-
-    // Writes a string value inside "".
-    // pStr can contain any ANSI characters, including '"', new line etc. - they will be properly escaped.
-    void WriteString(const char* pStr);
-
-    // Begins writing a string value.
-    // Call BeginString, ContinueString, ContinueString, ..., EndString instead of
-    // WriteString to conveniently build the string content incrementally, made of
-    // parts including numbers.
-    void BeginString(const char* pStr = VMA_NULL);
-    // Posts next part of an open string.
-    void ContinueString(const char* pStr);
-    // Posts next part of an open string. The number is converted to decimal characters.
-    void ContinueString(uint32_t n);
-    void ContinueString(uint64_t n);
-    void ContinueString_Size(size_t n);
-    // Posts next part of an open string. Pointer value is converted to characters
-    // using "%p" formatting - shown as hexadecimal number, e.g.: 000000081276Ad00
-    void ContinueString_Pointer(const void* ptr);
-    // Ends writing a string value by writing '"'.
-    void EndString(const char* pStr = VMA_NULL);
-
-    // Writes a number value.
-    void WriteNumber(uint32_t n);
-    void WriteNumber(uint64_t n);
-    void WriteSize(size_t n);
-    // Writes a boolean value - false or true.
-    void WriteBool(bool b);
-    // Writes a null value.
-    void WriteNull();
-
-private:
-    enum COLLECTION_TYPE
-    {
-        COLLECTION_TYPE_OBJECT,
-        COLLECTION_TYPE_ARRAY,
-    };
-    struct StackItem
-    {
-        COLLECTION_TYPE type;
-        uint32_t valueCount;
-        bool singleLineMode;
-    };
-
-    static const char* const INDENT;
-
-    VmaStringBuilder& m_SB;
-    VmaVector< StackItem, VmaStlAllocator<StackItem> > m_Stack;
-    bool m_InsideString;
-
-    // Write size_t for less than 64bits
-    void WriteSize(size_t n, std::integral_constant<bool, false>) { m_SB.AddNumber(static_cast<uint32_t>(n)); }
-    // Write size_t for 64bits
-    void WriteSize(size_t n, std::integral_constant<bool, true>) { m_SB.AddNumber(static_cast<uint64_t>(n)); }
-
-    void BeginValue(bool isString);
-    void WriteIndent(bool oneLess = false);
-};
-const char* const VmaJsonWriter::INDENT = "  ";
-
-#ifndef _VMA_JSON_WRITER_FUNCTIONS
-VmaJsonWriter::VmaJsonWriter(const VkAllocationCallbacks* pAllocationCallbacks, VmaStringBuilder& sb)
-    : m_SB(sb),
-    m_Stack(VmaStlAllocator<StackItem>(pAllocationCallbacks)),
-    m_InsideString(false) {}
-
-VmaJsonWriter::~VmaJsonWriter()
-{
-    VMA_ASSERT(!m_InsideString);
-    VMA_ASSERT(m_Stack.empty());
-}
-
-void VmaJsonWriter::BeginObject(bool singleLine)
-{
-    VMA_ASSERT(!m_InsideString);
-
-    BeginValue(false);
-    m_SB.Add('{');
-
-    StackItem item;
-    item.type = COLLECTION_TYPE_OBJECT;
-    item.valueCount = 0;
-    item.singleLineMode = singleLine;
-    m_Stack.push_back(item);
-}
-
-void VmaJsonWriter::EndObject()
-{
-    VMA_ASSERT(!m_InsideString);
-
-    WriteIndent(true);
-    m_SB.Add('}');
-
-    VMA_ASSERT(!m_Stack.empty() && m_Stack.back().type == COLLECTION_TYPE_OBJECT);
-    m_Stack.pop_back();
-}
-
-void VmaJsonWriter::BeginArray(bool singleLine)
-{
-    VMA_ASSERT(!m_InsideString);
-
-    BeginValue(false);
-    m_SB.Add('[');
-
-    StackItem item;
-    item.type = COLLECTION_TYPE_ARRAY;
-    item.valueCount = 0;
-    item.singleLineMode = singleLine;
-    m_Stack.push_back(item);
-}
-
-void VmaJsonWriter::EndArray()
-{
-    VMA_ASSERT(!m_InsideString);
-
-    WriteIndent(true);
-    m_SB.Add(']');
-
-    VMA_ASSERT(!m_Stack.empty() && m_Stack.back().type == COLLECTION_TYPE_ARRAY);
-    m_Stack.pop_back();
-}
-
-void VmaJsonWriter::WriteString(const char* pStr)
-{
-    BeginString(pStr);
-    EndString();
-}
-
-void VmaJsonWriter::BeginString(const char* pStr)
-{
-    VMA_ASSERT(!m_InsideString);
-
-    BeginValue(true);
-    m_SB.Add('"');
-    m_InsideString = true;
-    if (pStr != VMA_NULL && pStr[0] != '\0')
-    {
-        ContinueString(pStr);
-    }
-}
-
-void VmaJsonWriter::ContinueString(const char* pStr)
-{
-    VMA_ASSERT(m_InsideString);
-
-    const size_t strLen = strlen(pStr);
-    for (size_t i = 0; i < strLen; ++i)
-    {
-        char ch = pStr[i];
-        if (ch == '\\')
-        {
-            m_SB.Add("\\\\");
-        }
-        else if (ch == '"')
-        {
-            m_SB.Add("\\\"");
-        }
-        else if (ch >= 32)
-        {
-            m_SB.Add(ch);
-        }
-        else switch (ch)
-        {
-        case '\b':
-            m_SB.Add("\\b");
-            break;
-        case '\f':
-            m_SB.Add("\\f");
-            break;
-        case '\n':
-            m_SB.Add("\\n");
-            break;
-        case '\r':
-            m_SB.Add("\\r");
-            break;
-        case '\t':
-            m_SB.Add("\\t");
-            break;
-        default:
-            VMA_ASSERT(0 && "Character not currently supported.");
-            break;
-        }
-    }
-}
-
-void VmaJsonWriter::ContinueString(uint32_t n)
-{
-    VMA_ASSERT(m_InsideString);
-    m_SB.AddNumber(n);
-}
-
-void VmaJsonWriter::ContinueString(uint64_t n)
-{
-    VMA_ASSERT(m_InsideString);
-    m_SB.AddNumber(n);
-}
-
-void VmaJsonWriter::ContinueString_Size(size_t n)
-{
-    VMA_ASSERT(m_InsideString);
-    // Fix for AppleClang incorrect type casting
-    // TODO: Change to if constexpr when C++17 used as minimal standard
-    WriteSize(n, std::is_same<size_t, uint64_t>{});
-}
-
-void VmaJsonWriter::ContinueString_Pointer(const void* ptr)
-{
-    VMA_ASSERT(m_InsideString);
-    m_SB.AddPointer(ptr);
-}
-
-void VmaJsonWriter::EndString(const char* pStr)
-{
-    VMA_ASSERT(m_InsideString);
-    if (pStr != VMA_NULL && pStr[0] != '\0')
-    {
-        ContinueString(pStr);
-    }
-    m_SB.Add('"');
-    m_InsideString = false;
-}
-
-void VmaJsonWriter::WriteNumber(uint32_t n)
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    m_SB.AddNumber(n);
-}
-
-void VmaJsonWriter::WriteNumber(uint64_t n)
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    m_SB.AddNumber(n);
-}
-
-void VmaJsonWriter::WriteSize(size_t n)
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    // Fix for AppleClang incorrect type casting
-    // TODO: Change to if constexpr when C++17 used as minimal standard
-    WriteSize(n, std::is_same<size_t, uint64_t>{});
-}
-
-void VmaJsonWriter::WriteBool(bool b)
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    m_SB.Add(b ? "true" : "false");
-}
-
-void VmaJsonWriter::WriteNull()
-{
-    VMA_ASSERT(!m_InsideString);
-    BeginValue(false);
-    m_SB.Add("null");
-}
-
-void VmaJsonWriter::BeginValue(bool isString)
-{
-    if (!m_Stack.empty())
-    {
-        StackItem& currItem = m_Stack.back();
-        if (currItem.type == COLLECTION_TYPE_OBJECT &&
-            currItem.valueCount % 2 == 0)
-        {
-            VMA_ASSERT(isString);
-        }
-
-        if (currItem.type == COLLECTION_TYPE_OBJECT &&
-            currItem.valueCount % 2 != 0)
-        {
-            m_SB.Add(": ");
-        }
-        else if (currItem.valueCount > 0)
-        {
-            m_SB.Add(", ");
-            WriteIndent();
-        }
-        else
-        {
-            WriteIndent();
-        }
-        ++currItem.valueCount;
-    }
-}
-
-void VmaJsonWriter::WriteIndent(bool oneLess)
-{
-    if (!m_Stack.empty() && !m_Stack.back().singleLineMode)
-    {
-        m_SB.AddNewLine();
-
-        size_t count = m_Stack.size();
-        if (count > 0 && oneLess)
-        {
-            --count;
-        }
-        for (size_t i = 0; i < count; ++i)
-        {
-            m_SB.Add(INDENT);
-        }
-    }
-}
-#endif // _VMA_JSON_WRITER_FUNCTIONS
-
-static void VmaPrintDetailedStatistics(VmaJsonWriter& json, const VmaDetailedStatistics& stat)
-{
-    json.BeginObject();
-
-    json.WriteString("BlockCount");
-    json.WriteNumber(stat.statistics.blockCount);
-    json.WriteString("BlockBytes");
-    json.WriteNumber(stat.statistics.blockBytes);
-    json.WriteString("AllocationCount");
-    json.WriteNumber(stat.statistics.allocationCount);
-    json.WriteString("AllocationBytes");
-    json.WriteNumber(stat.statistics.allocationBytes);
-    json.WriteString("UnusedRangeCount");
-    json.WriteNumber(stat.unusedRangeCount);
-
-    if (stat.statistics.allocationCount > 1)
-    {
-        json.WriteString("AllocationSizeMin");
-        json.WriteNumber(stat.allocationSizeMin);
-        json.WriteString("AllocationSizeMax");
-        json.WriteNumber(stat.allocationSizeMax);
-    }
-    if (stat.unusedRangeCount > 1)
-    {
-        json.WriteString("UnusedRangeSizeMin");
-        json.WriteNumber(stat.unusedRangeSizeMin);
-        json.WriteString("UnusedRangeSizeMax");
-        json.WriteNumber(stat.unusedRangeSizeMax);
-    }
-    json.EndObject();
-}
-#endif // _VMA_JSON_WRITER
-
-#ifndef _VMA_MAPPING_HYSTERESIS
-
-class VmaMappingHysteresis
-{
-    VMA_CLASS_NO_COPY(VmaMappingHysteresis)
-public:
-    VmaMappingHysteresis() = default;
-
-    uint32_t GetExtraMapping() const { return m_ExtraMapping; }
-
-    // Call when Map was called.
-    // Returns true if switched to extra +1 mapping reference count.
-    bool PostMap()
-    {
-#if VMA_MAPPING_HYSTERESIS_ENABLED
-        if(m_ExtraMapping == 0)
-        {
-            ++m_MajorCounter;
-            if(m_MajorCounter >= COUNTER_MIN_EXTRA_MAPPING)
-            {
-                m_ExtraMapping = 1;
-                m_MajorCounter = 0;
-                m_MinorCounter = 0;
-                return true;
-            }
-        }
-        else // m_ExtraMapping == 1
-            PostMinorCounter();
-#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED
-        return false;
-    }
-
-    // Call when Unmap was called.
-    void PostUnmap()
-    {
-#if VMA_MAPPING_HYSTERESIS_ENABLED
-        if(m_ExtraMapping == 0)
-            ++m_MajorCounter;
-        else // m_ExtraMapping == 1
-            PostMinorCounter();
-#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED
-    }
-
-    // Call when allocation was made from the memory block.
-    void PostAlloc()
-    {
-#if VMA_MAPPING_HYSTERESIS_ENABLED
-        if(m_ExtraMapping == 1)
-            ++m_MajorCounter;
-        else // m_ExtraMapping == 0
-            PostMinorCounter();
-#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED
-    }
-
-    // Call when allocation was freed from the memory block.
-    // Returns true if switched to extra -1 mapping reference count.
-    bool PostFree()
-    {
-#if VMA_MAPPING_HYSTERESIS_ENABLED
-        if(m_ExtraMapping == 1)
-        {
-            ++m_MajorCounter;
-            if(m_MajorCounter >= COUNTER_MIN_EXTRA_MAPPING &&
-                m_MajorCounter > m_MinorCounter + 1)
-            {
-                m_ExtraMapping = 0;
-                m_MajorCounter = 0;
-                m_MinorCounter = 0;
-                return true;
-            }
-        }
-        else // m_ExtraMapping == 0
-            PostMinorCounter();
-#endif // #if VMA_MAPPING_HYSTERESIS_ENABLED
-        return false;
-    }
-
-private:
-    static const int32_t COUNTER_MIN_EXTRA_MAPPING = 7;
-
-    uint32_t m_MinorCounter = 0;
-    uint32_t m_MajorCounter = 0;
-    uint32_t m_ExtraMapping = 0; // 0 or 1.
-
-    void PostMinorCounter()
-    {
-        if(m_MinorCounter < m_MajorCounter)
-        {
-            ++m_MinorCounter;
-        }
-        else if(m_MajorCounter > 0)
-        {
-            --m_MajorCounter;
-            --m_MinorCounter;
-        }
-    }
-};
-
-#endif // _VMA_MAPPING_HYSTERESIS
-
-#ifndef _VMA_DEVICE_MEMORY_BLOCK
-/*
-Represents a single block of device memory (`VkDeviceMemory`) with all the
-data about its regions (aka suballocations, #VmaAllocation), assigned and free.
-
-Thread-safety:
-- Access to m_pMetadata must be externally synchronized.
-- Map, Unmap, Bind* are synchronized internally.
-*/
-class VmaDeviceMemoryBlock
-{
-    VMA_CLASS_NO_COPY(VmaDeviceMemoryBlock)
-public:
-    VmaBlockMetadata* m_pMetadata;
-
-    VmaDeviceMemoryBlock(VmaAllocator hAllocator);
-    ~VmaDeviceMemoryBlock();
-
-    // Always call after construction.
-    void Init(
-        VmaAllocator hAllocator,
-        VmaPool hParentPool,
-        uint32_t newMemoryTypeIndex,
-        VkDeviceMemory newMemory,
-        VkDeviceSize newSize,
-        uint32_t id,
-        uint32_t algorithm,
-        VkDeviceSize bufferImageGranularity);
-    // Always call before destruction.
-    void Destroy(VmaAllocator allocator);
-
-    VmaPool GetParentPool() const { return m_hParentPool; }
-    VkDeviceMemory GetDeviceMemory() const { return m_hMemory; }
-    uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; }
-    uint32_t GetId() const { return m_Id; }
-    void* GetMappedData() const { return m_pMappedData; }
-    uint32_t GetMapRefCount() const { return m_MapCount; }
-
-    // Call when allocation/free was made from m_pMetadata.
-    // Used for m_MappingHysteresis.
-    void PostAlloc() { m_MappingHysteresis.PostAlloc(); }
-    void PostFree(VmaAllocator hAllocator);
-
-    // Validates all data structures inside this object. If not valid, returns false.
-    bool Validate() const;
-    VkResult CheckCorruption(VmaAllocator hAllocator);
-
-    // ppData can be null.
-    VkResult Map(VmaAllocator hAllocator, uint32_t count, void** ppData);
-    void Unmap(VmaAllocator hAllocator, uint32_t count);
-
-    VkResult WriteMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize);
-    VkResult ValidateMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize);
-
-    VkResult BindBufferMemory(
-        const VmaAllocator hAllocator,
-        const VmaAllocation hAllocation,
-        VkDeviceSize allocationLocalOffset,
-        VkBuffer hBuffer,
-        const void* pNext);
-    VkResult BindImageMemory(
-        const VmaAllocator hAllocator,
-        const VmaAllocation hAllocation,
-        VkDeviceSize allocationLocalOffset,
-        VkImage hImage,
-        const void* pNext);
-
-private:
-    VmaPool m_hParentPool; // VK_NULL_HANDLE if not belongs to custom pool.
-    uint32_t m_MemoryTypeIndex;
-    uint32_t m_Id;
-    VkDeviceMemory m_hMemory;
-
-    /*
-    Protects access to m_hMemory so it is not used by multiple threads simultaneously, e.g. vkMapMemory, vkBindBufferMemory.
-    Also protects m_MapCount, m_pMappedData.
-    Allocations, deallocations, any change in m_pMetadata is protected by parent's VmaBlockVector::m_Mutex.
-    */
-    VMA_MUTEX m_MapAndBindMutex;
-    VmaMappingHysteresis m_MappingHysteresis;
-    uint32_t m_MapCount;
-    void* m_pMappedData;
-};
-#endif // _VMA_DEVICE_MEMORY_BLOCK
-
-#ifndef _VMA_ALLOCATION_T
-struct VmaAllocation_T
-{
-    friend struct VmaDedicatedAllocationListItemTraits;
-
-    enum FLAGS
-    {
-        FLAG_PERSISTENT_MAP   = 0x01,
-        FLAG_MAPPING_ALLOWED  = 0x02,
-    };
-
-public:
-    enum ALLOCATION_TYPE
-    {
-        ALLOCATION_TYPE_NONE,
-        ALLOCATION_TYPE_BLOCK,
-        ALLOCATION_TYPE_DEDICATED,
-    };
-
-    // This struct is allocated using VmaPoolAllocator.
-    VmaAllocation_T(bool mappingAllowed);
-    ~VmaAllocation_T();
-
-    void InitBlockAllocation(
-        VmaDeviceMemoryBlock* block,
-        VmaAllocHandle allocHandle,
-        VkDeviceSize alignment,
-        VkDeviceSize size,
-        uint32_t memoryTypeIndex,
-        VmaSuballocationType suballocationType,
-        bool mapped);
-    // pMappedData not null means allocation is created with MAPPED flag.
-    void InitDedicatedAllocation(
-        VmaPool hParentPool,
-        uint32_t memoryTypeIndex,
-        VkDeviceMemory hMemory,
-        VmaSuballocationType suballocationType,
-        void* pMappedData,
-        VkDeviceSize size);
-
-    ALLOCATION_TYPE GetType() const { return (ALLOCATION_TYPE)m_Type; }
-    VkDeviceSize GetAlignment() const { return m_Alignment; }
-    VkDeviceSize GetSize() const { return m_Size; }
-    void* GetUserData() const { return m_pUserData; }
-    const char* GetName() const { return m_pName; }
-    VmaSuballocationType GetSuballocationType() const { return (VmaSuballocationType)m_SuballocationType; }
-
-    VmaDeviceMemoryBlock* GetBlock() const { VMA_ASSERT(m_Type == ALLOCATION_TYPE_BLOCK); return m_BlockAllocation.m_Block; }
-    uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; }
-    bool IsPersistentMap() const { return (m_Flags & FLAG_PERSISTENT_MAP) != 0; }
-    bool IsMappingAllowed() const { return (m_Flags & FLAG_MAPPING_ALLOWED) != 0; }
-
-    void SetUserData(VmaAllocator hAllocator, void* pUserData) { m_pUserData = pUserData; }
-    void SetName(VmaAllocator hAllocator, const char* pName);
-    void FreeName(VmaAllocator hAllocator);
-    uint8_t SwapBlockAllocation(VmaAllocator hAllocator, VmaAllocation allocation);
-    VmaAllocHandle GetAllocHandle() const;
-    VkDeviceSize GetOffset() const;
-    VmaPool GetParentPool() const;
-    VkDeviceMemory GetMemory() const;
-    void* GetMappedData() const;
-
-    void BlockAllocMap();
-    void BlockAllocUnmap();
-    VkResult DedicatedAllocMap(VmaAllocator hAllocator, void** ppData);
-    void DedicatedAllocUnmap(VmaAllocator hAllocator);
-
-#if VMA_STATS_STRING_ENABLED
-    uint32_t GetBufferImageUsage() const { return m_BufferImageUsage; }
-
-    void InitBufferImageUsage(uint32_t bufferImageUsage);
-    void PrintParameters(class VmaJsonWriter& json) const;
-#endif
-
-private:
-    // Allocation out of VmaDeviceMemoryBlock.
-    struct BlockAllocation
-    {
-        VmaDeviceMemoryBlock* m_Block;
-        VmaAllocHandle m_AllocHandle;
-    };
-    // Allocation for an object that has its own private VkDeviceMemory.
-    struct DedicatedAllocation
-    {
-        VmaPool m_hParentPool; // VK_NULL_HANDLE if not belongs to custom pool.
-        VkDeviceMemory m_hMemory;
-        void* m_pMappedData; // Not null means memory is mapped.
-        VmaAllocation_T* m_Prev;
-        VmaAllocation_T* m_Next;
-    };
-    union
-    {
-        // Allocation out of VmaDeviceMemoryBlock.
-        BlockAllocation m_BlockAllocation;
-        // Allocation for an object that has its own private VkDeviceMemory.
-        DedicatedAllocation m_DedicatedAllocation;
-    };
-
-    VkDeviceSize m_Alignment;
-    VkDeviceSize m_Size;
-    void* m_pUserData;
-    char* m_pName;
-    uint32_t m_MemoryTypeIndex;
-    uint8_t m_Type; // ALLOCATION_TYPE
-    uint8_t m_SuballocationType; // VmaSuballocationType
-    // Reference counter for vmaMapMemory()/vmaUnmapMemory().
-    uint8_t m_MapCount;
-    uint8_t m_Flags; // enum FLAGS
-#if VMA_STATS_STRING_ENABLED
-    uint32_t m_BufferImageUsage; // 0 if unknown.
-#endif
-};
-#endif // _VMA_ALLOCATION_T
-
-#ifndef _VMA_DEDICATED_ALLOCATION_LIST_ITEM_TRAITS
-struct VmaDedicatedAllocationListItemTraits
-{
-    typedef VmaAllocation_T ItemType;
-
-    static ItemType* GetPrev(const ItemType* item)
-    {
-        VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-        return item->m_DedicatedAllocation.m_Prev;
-    }
-    static ItemType* GetNext(const ItemType* item)
-    {
-        VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-        return item->m_DedicatedAllocation.m_Next;
-    }
-    static ItemType*& AccessPrev(ItemType* item)
-    {
-        VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-        return item->m_DedicatedAllocation.m_Prev;
-    }
-    static ItemType*& AccessNext(ItemType* item)
-    {
-        VMA_HEAVY_ASSERT(item->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-        return item->m_DedicatedAllocation.m_Next;
-    }
-};
-#endif // _VMA_DEDICATED_ALLOCATION_LIST_ITEM_TRAITS
-
-#ifndef _VMA_DEDICATED_ALLOCATION_LIST
-/*
-Stores linked list of VmaAllocation_T objects.
-Thread-safe, synchronized internally.
-*/
-class VmaDedicatedAllocationList
-{
-public:
-    VmaDedicatedAllocationList() {}
-    ~VmaDedicatedAllocationList();
-
-    void Init(bool useMutex) { m_UseMutex = useMutex; }
-    bool Validate();
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats);
-    void AddStatistics(VmaStatistics& inoutStats);
-#if VMA_STATS_STRING_ENABLED
-    // Writes JSON array with the list of allocations.
-    void BuildStatsString(VmaJsonWriter& json);
-#endif
-
-    bool IsEmpty();
-    void Register(VmaAllocation alloc);
-    void Unregister(VmaAllocation alloc);
-
-private:
-    typedef VmaIntrusiveLinkedList<VmaDedicatedAllocationListItemTraits> DedicatedAllocationLinkedList;
-
-    bool m_UseMutex = true;
-    VMA_RW_MUTEX m_Mutex;
-    DedicatedAllocationLinkedList m_AllocationList;
-};
-
-#ifndef _VMA_DEDICATED_ALLOCATION_LIST_FUNCTIONS
-
-VmaDedicatedAllocationList::~VmaDedicatedAllocationList()
-{
-    VMA_HEAVY_ASSERT(Validate());
-
-    if (!m_AllocationList.IsEmpty())
-    {
-        VMA_ASSERT(false && "Unfreed dedicated allocations found!");
-    }
-}
-
-bool VmaDedicatedAllocationList::Validate()
-{
-    const size_t declaredCount = m_AllocationList.GetCount();
-    size_t actualCount = 0;
-    VmaMutexLockRead lock(m_Mutex, m_UseMutex);
-    for (VmaAllocation alloc = m_AllocationList.Front();
-        alloc != VMA_NULL; alloc = m_AllocationList.GetNext(alloc))
-    {
-        ++actualCount;
-    }
-    VMA_VALIDATE(actualCount == declaredCount);
-
-    return true;
-}
-
-void VmaDedicatedAllocationList::AddDetailedStatistics(VmaDetailedStatistics& inoutStats)
-{
-    for(auto* item = m_AllocationList.Front(); item != nullptr; item = DedicatedAllocationLinkedList::GetNext(item))
-    {
-        const VkDeviceSize size = item->GetSize();
-        inoutStats.statistics.blockCount++;
-        inoutStats.statistics.blockBytes += size;
-        VmaAddDetailedStatisticsAllocation(inoutStats, item->GetSize());
-    }
-}
-
-void VmaDedicatedAllocationList::AddStatistics(VmaStatistics& inoutStats)
-{
-    VmaMutexLockRead lock(m_Mutex, m_UseMutex);
-
-    const uint32_t allocCount = (uint32_t)m_AllocationList.GetCount();
-    inoutStats.blockCount += allocCount;
-    inoutStats.allocationCount += allocCount;
-
-    for(auto* item = m_AllocationList.Front(); item != nullptr; item = DedicatedAllocationLinkedList::GetNext(item))
-    {
-        const VkDeviceSize size = item->GetSize();
-        inoutStats.blockBytes += size;
-        inoutStats.allocationBytes += size;
-    }
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaDedicatedAllocationList::BuildStatsString(VmaJsonWriter& json)
-{
-    VmaMutexLockRead lock(m_Mutex, m_UseMutex);
-    json.BeginArray();
-    for (VmaAllocation alloc = m_AllocationList.Front();
-        alloc != VMA_NULL; alloc = m_AllocationList.GetNext(alloc))
-    {
-        json.BeginObject(true);
-        alloc->PrintParameters(json);
-        json.EndObject();
-    }
-    json.EndArray();
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-bool VmaDedicatedAllocationList::IsEmpty()
-{
-    VmaMutexLockRead lock(m_Mutex, m_UseMutex);
-    return m_AllocationList.IsEmpty();
-}
-
-void VmaDedicatedAllocationList::Register(VmaAllocation alloc)
-{
-    VmaMutexLockWrite lock(m_Mutex, m_UseMutex);
-    m_AllocationList.PushBack(alloc);
-}
-
-void VmaDedicatedAllocationList::Unregister(VmaAllocation alloc)
-{
-    VmaMutexLockWrite lock(m_Mutex, m_UseMutex);
-    m_AllocationList.Remove(alloc);
-}
-#endif // _VMA_DEDICATED_ALLOCATION_LIST_FUNCTIONS
-#endif // _VMA_DEDICATED_ALLOCATION_LIST
-
-#ifndef _VMA_SUBALLOCATION
-/*
-Represents a region of VmaDeviceMemoryBlock that is either assigned and returned as
-allocated memory block or free.
-*/
-struct VmaSuballocation
-{
-    VkDeviceSize offset;
-    VkDeviceSize size;
-    void* userData;
-    VmaSuballocationType type;
-};
-
-// Comparator for offsets.
-struct VmaSuballocationOffsetLess
-{
-    bool operator()(const VmaSuballocation& lhs, const VmaSuballocation& rhs) const
-    {
-        return lhs.offset < rhs.offset;
-    }
-};
-
-struct VmaSuballocationOffsetGreater
-{
-    bool operator()(const VmaSuballocation& lhs, const VmaSuballocation& rhs) const
-    {
-        return lhs.offset > rhs.offset;
-    }
-};
-
-struct VmaSuballocationItemSizeLess
-{
-    bool operator()(const VmaSuballocationList::iterator lhs,
-        const VmaSuballocationList::iterator rhs) const
-    {
-        return lhs->size < rhs->size;
-    }
-
-    bool operator()(const VmaSuballocationList::iterator lhs,
-        VkDeviceSize rhsSize) const
-    {
-        return lhs->size < rhsSize;
-    }
-};
-#endif // _VMA_SUBALLOCATION
-
-#ifndef _VMA_ALLOCATION_REQUEST
-/*
-Parameters of planned allocation inside a VmaDeviceMemoryBlock.
-item points to a FREE suballocation.
-*/
-struct VmaAllocationRequest
-{
-    VmaAllocHandle allocHandle;
-    VkDeviceSize size;
-    VmaSuballocationList::iterator item;
-    void* customData;
-    uint64_t algorithmData;
-    VmaAllocationRequestType type;
-};
-#endif // _VMA_ALLOCATION_REQUEST
-
-#ifndef _VMA_BLOCK_METADATA
-/*
-Data structure used for bookkeeping of allocations and unused ranges of memory
-in a single VkDeviceMemory block.
-*/
-class VmaBlockMetadata
-{
-public:
-    // pAllocationCallbacks, if not null, must be owned externally - alive and unchanged for the whole lifetime of this object.
-    VmaBlockMetadata(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata() = default;
-
-    virtual void Init(VkDeviceSize size) { m_Size = size; }
-    bool IsVirtual() const { return m_IsVirtual; }
-    VkDeviceSize GetSize() const { return m_Size; }
-
-    // Validates all data structures inside this object. If not valid, returns false.
-    virtual bool Validate() const = 0;
-    virtual size_t GetAllocationCount() const = 0;
-    virtual size_t GetFreeRegionsCount() const = 0;
-    virtual VkDeviceSize GetSumFreeSize() const = 0;
-    // Returns true if this block is empty - contains only single free suballocation.
-    virtual bool IsEmpty() const = 0;
-    virtual void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) = 0;
-    virtual VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const = 0;
-    virtual void* GetAllocationUserData(VmaAllocHandle allocHandle) const = 0;
-
-    virtual VmaAllocHandle GetAllocationListBegin() const = 0;
-    virtual VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const = 0;
-    virtual VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const = 0;
-
-    // Shouldn't modify blockCount.
-    virtual void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const = 0;
-    virtual void AddStatistics(VmaStatistics& inoutStats) const = 0;
-
-#if VMA_STATS_STRING_ENABLED
-    virtual void PrintDetailedMap(class VmaJsonWriter& json) const = 0;
-#endif
-
-    // Tries to find a place for suballocation with given parameters inside this block.
-    // If succeeded, fills pAllocationRequest and returns true.
-    // If failed, returns false.
-    virtual bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        // Always one of VMA_ALLOCATION_CREATE_STRATEGY_* or VMA_ALLOCATION_INTERNAL_STRATEGY_* flags.
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) = 0;
-
-    virtual VkResult CheckCorruption(const void* pBlockData) = 0;
-
-    // Makes actual allocation based on request. Request must already be checked and valid.
-    virtual void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) = 0;
-
-    // Frees suballocation assigned to given memory region.
-    virtual void Free(VmaAllocHandle allocHandle) = 0;
-
-    // Frees all allocations.
-    // Careful! Don't call it if there are VmaAllocation objects owned by userData of cleared allocations!
-    virtual void Clear() = 0;
-
-    virtual void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) = 0;
-    virtual void DebugLogAllAllocations() const = 0;
-
-protected:
-    const VkAllocationCallbacks* GetAllocationCallbacks() const { return m_pAllocationCallbacks; }
-    VkDeviceSize GetBufferImageGranularity() const { return m_BufferImageGranularity; }
-    VkDeviceSize GetDebugMargin() const { return IsVirtual() ? 0 : VMA_DEBUG_MARGIN; }
-
-    void DebugLogAllocation(VkDeviceSize offset, VkDeviceSize size, void* userData) const;
-#if VMA_STATS_STRING_ENABLED
-    // mapRefCount == UINT32_MAX means unspecified.
-    void PrintDetailedMap_Begin(class VmaJsonWriter& json,
-        VkDeviceSize unusedBytes,
-        size_t allocationCount,
-        size_t unusedRangeCount) const;
-    void PrintDetailedMap_Allocation(class VmaJsonWriter& json,
-        VkDeviceSize offset, VkDeviceSize size, void* userData) const;
-    void PrintDetailedMap_UnusedRange(class VmaJsonWriter& json,
-        VkDeviceSize offset,
-        VkDeviceSize size) const;
-    void PrintDetailedMap_End(class VmaJsonWriter& json) const;
-#endif
-
-private:
-    VkDeviceSize m_Size;
-    const VkAllocationCallbacks* m_pAllocationCallbacks;
-    const VkDeviceSize m_BufferImageGranularity;
-    const bool m_IsVirtual;
-};
-
-#ifndef _VMA_BLOCK_METADATA_FUNCTIONS
-VmaBlockMetadata::VmaBlockMetadata(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : m_Size(0),
-    m_pAllocationCallbacks(pAllocationCallbacks),
-    m_BufferImageGranularity(bufferImageGranularity),
-    m_IsVirtual(isVirtual) {}
-
-void VmaBlockMetadata::DebugLogAllocation(VkDeviceSize offset, VkDeviceSize size, void* userData) const
-{
-    if (IsVirtual())
-    {
-        VMA_DEBUG_LOG("UNFREED VIRTUAL ALLOCATION; Offset: %llu; Size: %llu; UserData: %p", offset, size, userData);
-    }
-    else
-    {
-        VMA_ASSERT(userData != VMA_NULL);
-        VmaAllocation allocation = reinterpret_cast<VmaAllocation>(userData);
-
-        userData = allocation->GetUserData();
-        const char* name = allocation->GetName();
-
-#if VMA_STATS_STRING_ENABLED
-        VMA_DEBUG_LOG("UNFREED ALLOCATION; Offset: %llu; Size: %llu; UserData: %p; Name: %s; Type: %s; Usage: %u",
-            offset, size, userData, name ? name : "vma_empty",
-            VMA_SUBALLOCATION_TYPE_NAMES[allocation->GetSuballocationType()],
-            allocation->GetBufferImageUsage());
-#else
-        VMA_DEBUG_LOG("UNFREED ALLOCATION; Offset: %llu; Size: %llu; UserData: %p; Name: %s; Type: %u",
-            offset, size, userData, name ? name : "vma_empty",
-            (uint32_t)allocation->GetSuballocationType());
-#endif // VMA_STATS_STRING_ENABLED
-    }
-
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata::PrintDetailedMap_Begin(class VmaJsonWriter& json,
-    VkDeviceSize unusedBytes, size_t allocationCount, size_t unusedRangeCount) const
-{
-    json.WriteString("TotalBytes");
-    json.WriteNumber(GetSize());
-
-    json.WriteString("UnusedBytes");
-    json.WriteSize(unusedBytes);
-
-    json.WriteString("Allocations");
-    json.WriteSize(allocationCount);
-
-    json.WriteString("UnusedRanges");
-    json.WriteSize(unusedRangeCount);
-
-    json.WriteString("Suballocations");
-    json.BeginArray();
-}
-
-void VmaBlockMetadata::PrintDetailedMap_Allocation(class VmaJsonWriter& json,
-    VkDeviceSize offset, VkDeviceSize size, void* userData) const
-{
-    json.BeginObject(true);
-
-    json.WriteString("Offset");
-    json.WriteNumber(offset);
-
-    if (IsVirtual())
-    {
-        json.WriteString("Size");
-        json.WriteNumber(size);
-        if (userData)
-        {
-            json.WriteString("CustomData");
-            json.BeginString();
-            json.ContinueString_Pointer(userData);
-            json.EndString();
-        }
-    }
-    else
-    {
-        ((VmaAllocation)userData)->PrintParameters(json);
-    }
-
-    json.EndObject();
-}
-
-void VmaBlockMetadata::PrintDetailedMap_UnusedRange(class VmaJsonWriter& json,
-    VkDeviceSize offset, VkDeviceSize size) const
-{
-    json.BeginObject(true);
-
-    json.WriteString("Offset");
-    json.WriteNumber(offset);
-
-    json.WriteString("Type");
-    json.WriteString(VMA_SUBALLOCATION_TYPE_NAMES[VMA_SUBALLOCATION_TYPE_FREE]);
-
-    json.WriteString("Size");
-    json.WriteNumber(size);
-
-    json.EndObject();
-}
-
-void VmaBlockMetadata::PrintDetailedMap_End(class VmaJsonWriter& json) const
-{
-    json.EndArray();
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_BLOCK_METADATA_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA
-
-#ifndef _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY
-// Before deleting object of this class remember to call 'Destroy()'
-class VmaBlockBufferImageGranularity final
-{
-public:
-    struct ValidationContext
-    {
-        const VkAllocationCallbacks* allocCallbacks;
-        uint16_t* pageAllocs;
-    };
-
-    VmaBlockBufferImageGranularity(VkDeviceSize bufferImageGranularity);
-    ~VmaBlockBufferImageGranularity();
-
-    bool IsEnabled() const { return m_BufferImageGranularity > MAX_LOW_BUFFER_IMAGE_GRANULARITY; }
-
-    void Init(const VkAllocationCallbacks* pAllocationCallbacks, VkDeviceSize size);
-    // Before destroying object you must call free it's memory
-    void Destroy(const VkAllocationCallbacks* pAllocationCallbacks);
-
-    void RoundupAllocRequest(VmaSuballocationType allocType,
-        VkDeviceSize& inOutAllocSize,
-        VkDeviceSize& inOutAllocAlignment) const;
-
-    bool CheckConflictAndAlignUp(VkDeviceSize& inOutAllocOffset,
-        VkDeviceSize allocSize,
-        VkDeviceSize blockOffset,
-        VkDeviceSize blockSize,
-        VmaSuballocationType allocType) const;
-
-    void AllocPages(uint8_t allocType, VkDeviceSize offset, VkDeviceSize size);
-    void FreePages(VkDeviceSize offset, VkDeviceSize size);
-    void Clear();
-
-    ValidationContext StartValidation(const VkAllocationCallbacks* pAllocationCallbacks,
-        bool isVirutal) const;
-    bool Validate(ValidationContext& ctx, VkDeviceSize offset, VkDeviceSize size) const;
-    bool FinishValidation(ValidationContext& ctx) const;
-
-private:
-    static const uint16_t MAX_LOW_BUFFER_IMAGE_GRANULARITY = 256;
-
-    struct RegionInfo
-    {
-        uint8_t allocType;
-        uint16_t allocCount;
-    };
-
-    VkDeviceSize m_BufferImageGranularity;
-    uint32_t m_RegionCount;
-    RegionInfo* m_RegionInfo;
-
-    uint32_t GetStartPage(VkDeviceSize offset) const { return OffsetToPageIndex(offset & ~(m_BufferImageGranularity - 1)); }
-    uint32_t GetEndPage(VkDeviceSize offset, VkDeviceSize size) const { return OffsetToPageIndex((offset + size - 1) & ~(m_BufferImageGranularity - 1)); }
-
-    uint32_t OffsetToPageIndex(VkDeviceSize offset) const;
-    void AllocPage(RegionInfo& page, uint8_t allocType);
-};
-
-#ifndef _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY_FUNCTIONS
-VmaBlockBufferImageGranularity::VmaBlockBufferImageGranularity(VkDeviceSize bufferImageGranularity)
-    : m_BufferImageGranularity(bufferImageGranularity),
-    m_RegionCount(0),
-    m_RegionInfo(VMA_NULL) {}
-
-VmaBlockBufferImageGranularity::~VmaBlockBufferImageGranularity()
-{
-    VMA_ASSERT(m_RegionInfo == VMA_NULL && "Free not called before destroying object!");
-}
-
-void VmaBlockBufferImageGranularity::Init(const VkAllocationCallbacks* pAllocationCallbacks, VkDeviceSize size)
-{
-    if (IsEnabled())
-    {
-        m_RegionCount = static_cast<uint32_t>(VmaDivideRoundingUp(size, m_BufferImageGranularity));
-        m_RegionInfo = vma_new_array(pAllocationCallbacks, RegionInfo, m_RegionCount);
-        memset(m_RegionInfo, 0, m_RegionCount * sizeof(RegionInfo));
-    }
-}
-
-void VmaBlockBufferImageGranularity::Destroy(const VkAllocationCallbacks* pAllocationCallbacks)
-{
-    if (m_RegionInfo)
-    {
-        vma_delete_array(pAllocationCallbacks, m_RegionInfo, m_RegionCount);
-        m_RegionInfo = VMA_NULL;
-    }
-}
-
-void VmaBlockBufferImageGranularity::RoundupAllocRequest(VmaSuballocationType allocType,
-    VkDeviceSize& inOutAllocSize,
-    VkDeviceSize& inOutAllocAlignment) const
-{
-    if (m_BufferImageGranularity > 1 &&
-        m_BufferImageGranularity <= MAX_LOW_BUFFER_IMAGE_GRANULARITY)
-    {
-        if (allocType == VMA_SUBALLOCATION_TYPE_UNKNOWN ||
-            allocType == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN ||
-            allocType == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL)
-        {
-            inOutAllocAlignment = VMA_MAX(inOutAllocAlignment, m_BufferImageGranularity);
-            inOutAllocSize = VmaAlignUp(inOutAllocSize, m_BufferImageGranularity);
-        }
-    }
-}
-
-bool VmaBlockBufferImageGranularity::CheckConflictAndAlignUp(VkDeviceSize& inOutAllocOffset,
-    VkDeviceSize allocSize,
-    VkDeviceSize blockOffset,
-    VkDeviceSize blockSize,
-    VmaSuballocationType allocType) const
-{
-    if (IsEnabled())
-    {
-        uint32_t startPage = GetStartPage(inOutAllocOffset);
-        if (m_RegionInfo[startPage].allocCount > 0 &&
-            VmaIsBufferImageGranularityConflict(static_cast<VmaSuballocationType>(m_RegionInfo[startPage].allocType), allocType))
-        {
-            inOutAllocOffset = VmaAlignUp(inOutAllocOffset, m_BufferImageGranularity);
-            if (blockSize < allocSize + inOutAllocOffset - blockOffset)
-                return true;
-            ++startPage;
-        }
-        uint32_t endPage = GetEndPage(inOutAllocOffset, allocSize);
-        if (endPage != startPage &&
-            m_RegionInfo[endPage].allocCount > 0 &&
-            VmaIsBufferImageGranularityConflict(static_cast<VmaSuballocationType>(m_RegionInfo[endPage].allocType), allocType))
-        {
-            return true;
-        }
-    }
-    return false;
-}
-
-void VmaBlockBufferImageGranularity::AllocPages(uint8_t allocType, VkDeviceSize offset, VkDeviceSize size)
-{
-    if (IsEnabled())
-    {
-        uint32_t startPage = GetStartPage(offset);
-        AllocPage(m_RegionInfo[startPage], allocType);
-
-        uint32_t endPage = GetEndPage(offset, size);
-        if (startPage != endPage)
-            AllocPage(m_RegionInfo[endPage], allocType);
-    }
-}
-
-void VmaBlockBufferImageGranularity::FreePages(VkDeviceSize offset, VkDeviceSize size)
-{
-    if (IsEnabled())
-    {
-        uint32_t startPage = GetStartPage(offset);
-        --m_RegionInfo[startPage].allocCount;
-        if (m_RegionInfo[startPage].allocCount == 0)
-            m_RegionInfo[startPage].allocType = VMA_SUBALLOCATION_TYPE_FREE;
-        uint32_t endPage = GetEndPage(offset, size);
-        if (startPage != endPage)
-        {
-            --m_RegionInfo[endPage].allocCount;
-            if (m_RegionInfo[endPage].allocCount == 0)
-                m_RegionInfo[endPage].allocType = VMA_SUBALLOCATION_TYPE_FREE;
-        }
-    }
-}
-
-void VmaBlockBufferImageGranularity::Clear()
-{
-    if (m_RegionInfo)
-        memset(m_RegionInfo, 0, m_RegionCount * sizeof(RegionInfo));
-}
-
-VmaBlockBufferImageGranularity::ValidationContext VmaBlockBufferImageGranularity::StartValidation(
-    const VkAllocationCallbacks* pAllocationCallbacks, bool isVirutal) const
-{
-    ValidationContext ctx{ pAllocationCallbacks, VMA_NULL };
-    if (!isVirutal && IsEnabled())
-    {
-        ctx.pageAllocs = vma_new_array(pAllocationCallbacks, uint16_t, m_RegionCount);
-        memset(ctx.pageAllocs, 0, m_RegionCount * sizeof(uint16_t));
-    }
-    return ctx;
-}
-
-bool VmaBlockBufferImageGranularity::Validate(ValidationContext& ctx,
-    VkDeviceSize offset, VkDeviceSize size) const
-{
-    if (IsEnabled())
-    {
-        uint32_t start = GetStartPage(offset);
-        ++ctx.pageAllocs[start];
-        VMA_VALIDATE(m_RegionInfo[start].allocCount > 0);
-
-        uint32_t end = GetEndPage(offset, size);
-        if (start != end)
-        {
-            ++ctx.pageAllocs[end];
-            VMA_VALIDATE(m_RegionInfo[end].allocCount > 0);
-        }
-    }
-    return true;
-}
-
-bool VmaBlockBufferImageGranularity::FinishValidation(ValidationContext& ctx) const
-{
-    // Check proper page structure
-    if (IsEnabled())
-    {
-        VMA_ASSERT(ctx.pageAllocs != VMA_NULL && "Validation context not initialized!");
-
-        for (uint32_t page = 0; page < m_RegionCount; ++page)
-        {
-            VMA_VALIDATE(ctx.pageAllocs[page] == m_RegionInfo[page].allocCount);
-        }
-        vma_delete_array(ctx.allocCallbacks, ctx.pageAllocs, m_RegionCount);
-        ctx.pageAllocs = VMA_NULL;
-    }
-    return true;
-}
-
-uint32_t VmaBlockBufferImageGranularity::OffsetToPageIndex(VkDeviceSize offset) const
-{
-    return static_cast<uint32_t>(offset >> VMA_BITSCAN_MSB(m_BufferImageGranularity));
-}
-
-void VmaBlockBufferImageGranularity::AllocPage(RegionInfo& page, uint8_t allocType)
-{
-    // When current alloc type is free then it can be overriden by new type
-    if (page.allocCount == 0 || (page.allocCount > 0 && page.allocType == VMA_SUBALLOCATION_TYPE_FREE))
-        page.allocType = allocType;
-
-    ++page.allocCount;
-}
-#endif // _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY_FUNCTIONS
-#endif // _VMA_BLOCK_BUFFER_IMAGE_GRANULARITY
-
-#if 0
-#ifndef _VMA_BLOCK_METADATA_GENERIC
-class VmaBlockMetadata_Generic : public VmaBlockMetadata
-{
-    friend class VmaDefragmentationAlgorithm_Generic;
-    friend class VmaDefragmentationAlgorithm_Fast;
-    VMA_CLASS_NO_COPY(VmaBlockMetadata_Generic)
-public:
-    VmaBlockMetadata_Generic(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata_Generic() = default;
-
-    size_t GetAllocationCount() const override { return m_Suballocations.size() - m_FreeCount; }
-    VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize; }
-    bool IsEmpty() const override { return (m_Suballocations.size() == 1) && (m_FreeCount == 1); }
-    void Free(VmaAllocHandle allocHandle) override { FreeSuballocation(FindAtOffset((VkDeviceSize)allocHandle - 1)); }
-    VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; };
-
-    void Init(VkDeviceSize size) override;
-    bool Validate() const override;
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override;
-    void AddStatistics(VmaStatistics& inoutStats) const override;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const override;
-#endif
-
-    bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) override;
-
-    VkResult CheckCorruption(const void* pBlockData) override;
-
-    void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) override;
-
-    void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override;
-    void* GetAllocationUserData(VmaAllocHandle allocHandle) const override;
-    VmaAllocHandle GetAllocationListBegin() const override;
-    VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override;
-    void Clear() override;
-    void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override;
-    void DebugLogAllAllocations() const override;
-
-private:
-    uint32_t m_FreeCount;
-    VkDeviceSize m_SumFreeSize;
-    VmaSuballocationList m_Suballocations;
-    // Suballocations that are free. Sorted by size, ascending.
-    VmaVector<VmaSuballocationList::iterator, VmaStlAllocator<VmaSuballocationList::iterator>> m_FreeSuballocationsBySize;
-
-    VkDeviceSize AlignAllocationSize(VkDeviceSize size) const { return IsVirtual() ? size : VmaAlignUp(size, (VkDeviceSize)16); }
-
-    VmaSuballocationList::iterator FindAtOffset(VkDeviceSize offset) const;
-    bool ValidateFreeSuballocationList() const;
-
-    // Checks if requested suballocation with given parameters can be placed in given pFreeSuballocItem.
-    // If yes, fills pOffset and returns true. If no, returns false.
-    bool CheckAllocation(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        VmaSuballocationType allocType,
-        VmaSuballocationList::const_iterator suballocItem,
-        VmaAllocHandle* pAllocHandle) const;
-
-    // Given free suballocation, it merges it with following one, which must also be free.
-    void MergeFreeWithNext(VmaSuballocationList::iterator item);
-    // Releases given suballocation, making it free.
-    // Merges it with adjacent free suballocations if applicable.
-    // Returns iterator to new free suballocation at this place.
-    VmaSuballocationList::iterator FreeSuballocation(VmaSuballocationList::iterator suballocItem);
-    // Given free suballocation, it inserts it into sorted list of
-    // m_FreeSuballocationsBySize if it is suitable.
-    void RegisterFreeSuballocation(VmaSuballocationList::iterator item);
-    // Given free suballocation, it removes it from sorted list of
-    // m_FreeSuballocationsBySize if it is suitable.
-    void UnregisterFreeSuballocation(VmaSuballocationList::iterator item);
-};
-
-#ifndef _VMA_BLOCK_METADATA_GENERIC_FUNCTIONS
-VmaBlockMetadata_Generic::VmaBlockMetadata_Generic(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual),
-    m_FreeCount(0),
-    m_SumFreeSize(0),
-    m_Suballocations(VmaStlAllocator<VmaSuballocation>(pAllocationCallbacks)),
-    m_FreeSuballocationsBySize(VmaStlAllocator<VmaSuballocationList::iterator>(pAllocationCallbacks)) {}
-
-void VmaBlockMetadata_Generic::Init(VkDeviceSize size)
-{
-    VmaBlockMetadata::Init(size);
-
-    m_FreeCount = 1;
-    m_SumFreeSize = size;
-
-    VmaSuballocation suballoc = {};
-    suballoc.offset = 0;
-    suballoc.size = size;
-    suballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-
-    m_Suballocations.push_back(suballoc);
-    m_FreeSuballocationsBySize.push_back(m_Suballocations.begin());
-}
-
-bool VmaBlockMetadata_Generic::Validate() const
-{
-    VMA_VALIDATE(!m_Suballocations.empty());
-
-    // Expected offset of new suballocation as calculated from previous ones.
-    VkDeviceSize calculatedOffset = 0;
-    // Expected number of free suballocations as calculated from traversing their list.
-    uint32_t calculatedFreeCount = 0;
-    // Expected sum size of free suballocations as calculated from traversing their list.
-    VkDeviceSize calculatedSumFreeSize = 0;
-    // Expected number of free suballocations that should be registered in
-    // m_FreeSuballocationsBySize calculated from traversing their list.
-    size_t freeSuballocationsToRegister = 0;
-    // True if previous visited suballocation was free.
-    bool prevFree = false;
-
-    const VkDeviceSize debugMargin = GetDebugMargin();
-
-    for (const auto& subAlloc : m_Suballocations)
-    {
-        // Actual offset of this suballocation doesn't match expected one.
-        VMA_VALIDATE(subAlloc.offset == calculatedOffset);
-
-        const bool currFree = (subAlloc.type == VMA_SUBALLOCATION_TYPE_FREE);
-        // Two adjacent free suballocations are invalid. They should be merged.
-        VMA_VALIDATE(!prevFree || !currFree);
-
-        VmaAllocation alloc = (VmaAllocation)subAlloc.userData;
-        if (!IsVirtual())
-        {
-            VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE));
-        }
-
-        if (currFree)
-        {
-            calculatedSumFreeSize += subAlloc.size;
-            ++calculatedFreeCount;
-            ++freeSuballocationsToRegister;
-
-            // Margin required between allocations - every free space must be at least that large.
-            VMA_VALIDATE(subAlloc.size >= debugMargin);
-        }
-        else
-        {
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == subAlloc.offset + 1);
-                VMA_VALIDATE(alloc->GetSize() == subAlloc.size);
-            }
-
-            // Margin required between allocations - previous allocation must be free.
-            VMA_VALIDATE(debugMargin == 0 || prevFree);
-        }
-
-        calculatedOffset += subAlloc.size;
-        prevFree = currFree;
-    }
-
-    // Number of free suballocations registered in m_FreeSuballocationsBySize doesn't
-    // match expected one.
-    VMA_VALIDATE(m_FreeSuballocationsBySize.size() == freeSuballocationsToRegister);
-
-    VkDeviceSize lastSize = 0;
-    for (size_t i = 0; i < m_FreeSuballocationsBySize.size(); ++i)
-    {
-        VmaSuballocationList::iterator suballocItem = m_FreeSuballocationsBySize[i];
-
-        // Only free suballocations can be registered in m_FreeSuballocationsBySize.
-        VMA_VALIDATE(suballocItem->type == VMA_SUBALLOCATION_TYPE_FREE);
-        // They must be sorted by size ascending.
-        VMA_VALIDATE(suballocItem->size >= lastSize);
-
-        lastSize = suballocItem->size;
-    }
-
-    // Check if totals match calculated values.
-    VMA_VALIDATE(ValidateFreeSuballocationList());
-    VMA_VALIDATE(calculatedOffset == GetSize());
-    VMA_VALIDATE(calculatedSumFreeSize == m_SumFreeSize);
-    VMA_VALIDATE(calculatedFreeCount == m_FreeCount);
-
-    return true;
-}
-
-void VmaBlockMetadata_Generic::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const
-{
-    const uint32_t rangeCount = (uint32_t)m_Suballocations.size();
-    inoutStats.statistics.blockCount++;
-    inoutStats.statistics.blockBytes += GetSize();
-
-    for (const auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-            VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size);
-        else
-            VmaAddDetailedStatisticsUnusedRange(inoutStats, suballoc.size);
-    }
-}
-
-void VmaBlockMetadata_Generic::AddStatistics(VmaStatistics& inoutStats) const
-{
-    inoutStats.blockCount++;
-    inoutStats.allocationCount += (uint32_t)m_Suballocations.size() - m_FreeCount;
-    inoutStats.blockBytes += GetSize();
-    inoutStats.allocationBytes += GetSize() - m_SumFreeSize;
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_Generic::PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const
-{
-    PrintDetailedMap_Begin(json,
-        m_SumFreeSize, // unusedBytes
-        m_Suballocations.size() - (size_t)m_FreeCount, // allocationCount
-        m_FreeCount, // unusedRangeCount
-        mapRefCount);
-
-    for (const auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            PrintDetailedMap_UnusedRange(json, suballoc.offset, suballoc.size);
-        }
-        else
-        {
-            PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData);
-        }
-    }
-
-    PrintDetailedMap_End(json);
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-bool VmaBlockMetadata_Generic::CreateAllocationRequest(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    bool upperAddress,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(allocSize > 0);
-    VMA_ASSERT(!upperAddress);
-    VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(pAllocationRequest != VMA_NULL);
-    VMA_HEAVY_ASSERT(Validate());
-
-    allocSize = AlignAllocationSize(allocSize);
-
-    pAllocationRequest->type = VmaAllocationRequestType::Normal;
-    pAllocationRequest->size = allocSize;
-
-    const VkDeviceSize debugMargin = GetDebugMargin();
-
-    // There is not enough total free space in this block to fulfill the request: Early return.
-    if (m_SumFreeSize < allocSize + debugMargin)
-    {
-        return false;
-    }
-
-    // New algorithm, efficiently searching freeSuballocationsBySize.
-    const size_t freeSuballocCount = m_FreeSuballocationsBySize.size();
-    if (freeSuballocCount > 0)
-    {
-        if (strategy == 0 ||
-            strategy == VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT)
-        {
-            // Find first free suballocation with size not less than allocSize + debugMargin.
-            VmaSuballocationList::iterator* const it = VmaBinaryFindFirstNotLess(
-                m_FreeSuballocationsBySize.data(),
-                m_FreeSuballocationsBySize.data() + freeSuballocCount,
-                allocSize + debugMargin,
-                VmaSuballocationItemSizeLess());
-            size_t index = it - m_FreeSuballocationsBySize.data();
-            for (; index < freeSuballocCount; ++index)
-            {
-                if (CheckAllocation(
-                    allocSize,
-                    allocAlignment,
-                    allocType,
-                    m_FreeSuballocationsBySize[index],
-                    &pAllocationRequest->allocHandle))
-                {
-                    pAllocationRequest->item = m_FreeSuballocationsBySize[index];
-                    return true;
-                }
-            }
-        }
-        else if (strategy == VMA_ALLOCATION_INTERNAL_STRATEGY_MIN_OFFSET)
-        {
-            for (VmaSuballocationList::iterator it = m_Suballocations.begin();
-                it != m_Suballocations.end();
-                ++it)
-            {
-                if (it->type == VMA_SUBALLOCATION_TYPE_FREE && CheckAllocation(
-                    allocSize,
-                    allocAlignment,
-                    allocType,
-                    it,
-                    &pAllocationRequest->allocHandle))
-                {
-                    pAllocationRequest->item = it;
-                    return true;
-                }
-            }
-        }
-        else
-        {
-            VMA_ASSERT(strategy & (VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT | VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT ));
-            // Search staring from biggest suballocations.
-            for (size_t index = freeSuballocCount; index--; )
-            {
-                if (CheckAllocation(
-                    allocSize,
-                    allocAlignment,
-                    allocType,
-                    m_FreeSuballocationsBySize[index],
-                    &pAllocationRequest->allocHandle))
-                {
-                    pAllocationRequest->item = m_FreeSuballocationsBySize[index];
-                    return true;
-                }
-            }
-        }
-    }
-
-    return false;
-}
-
-VkResult VmaBlockMetadata_Generic::CheckCorruption(const void* pBlockData)
-{
-    for (auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size))
-            {
-                VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!");
-                return VK_ERROR_UNKNOWN_COPY;
-            }
-        }
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaBlockMetadata_Generic::Alloc(
-    const VmaAllocationRequest& request,
-    VmaSuballocationType type,
-    void* userData)
-{
-    VMA_ASSERT(request.type == VmaAllocationRequestType::Normal);
-    VMA_ASSERT(request.item != m_Suballocations.end());
-    VmaSuballocation& suballoc = *request.item;
-    // Given suballocation is a free block.
-    VMA_ASSERT(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-    // Given offset is inside this suballocation.
-    VMA_ASSERT((VkDeviceSize)request.allocHandle - 1 >= suballoc.offset);
-    const VkDeviceSize paddingBegin = (VkDeviceSize)request.allocHandle - suballoc.offset - 1;
-    VMA_ASSERT(suballoc.size >= paddingBegin + request.size);
-    const VkDeviceSize paddingEnd = suballoc.size - paddingBegin - request.size;
-
-    // Unregister this free suballocation from m_FreeSuballocationsBySize and update
-    // it to become used.
-    UnregisterFreeSuballocation(request.item);
-
-    suballoc.offset = (VkDeviceSize)request.allocHandle - 1;
-    suballoc.size = request.size;
-    suballoc.type = type;
-    suballoc.userData = userData;
-
-    // If there are any free bytes remaining at the end, insert new free suballocation after current one.
-    if (paddingEnd)
-    {
-        VmaSuballocation paddingSuballoc = {};
-        paddingSuballoc.offset = suballoc.offset + suballoc.size;
-        paddingSuballoc.size = paddingEnd;
-        paddingSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-        VmaSuballocationList::iterator next = request.item;
-        ++next;
-        const VmaSuballocationList::iterator paddingEndItem =
-            m_Suballocations.insert(next, paddingSuballoc);
-        RegisterFreeSuballocation(paddingEndItem);
-    }
-
-    // If there are any free bytes remaining at the beginning, insert new free suballocation before current one.
-    if (paddingBegin)
-    {
-        VmaSuballocation paddingSuballoc = {};
-        paddingSuballoc.offset = suballoc.offset - paddingBegin;
-        paddingSuballoc.size = paddingBegin;
-        paddingSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-        const VmaSuballocationList::iterator paddingBeginItem =
-            m_Suballocations.insert(request.item, paddingSuballoc);
-        RegisterFreeSuballocation(paddingBeginItem);
-    }
-
-    // Update totals.
-    m_FreeCount = m_FreeCount - 1;
-    if (paddingBegin > 0)
-    {
-        ++m_FreeCount;
-    }
-    if (paddingEnd > 0)
-    {
-        ++m_FreeCount;
-    }
-    m_SumFreeSize -= request.size;
-}
-
-void VmaBlockMetadata_Generic::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo)
-{
-    outInfo.offset = (VkDeviceSize)allocHandle - 1;
-    const VmaSuballocation& suballoc = *FindAtOffset(outInfo.offset);
-    outInfo.size = suballoc.size;
-    outInfo.pUserData = suballoc.userData;
-}
-
-void* VmaBlockMetadata_Generic::GetAllocationUserData(VmaAllocHandle allocHandle) const
-{
-    return FindAtOffset((VkDeviceSize)allocHandle - 1)->userData;
-}
-
-VmaAllocHandle VmaBlockMetadata_Generic::GetAllocationListBegin() const
-{
-    if (IsEmpty())
-        return VK_NULL_HANDLE;
-
-    for (const auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-            return (VmaAllocHandle)(suballoc.offset + 1);
-    }
-    VMA_ASSERT(false && "Should contain at least 1 allocation!");
-    return VK_NULL_HANDLE;
-}
-
-VmaAllocHandle VmaBlockMetadata_Generic::GetNextAllocation(VmaAllocHandle prevAlloc) const
-{
-    VmaSuballocationList::const_iterator prev = FindAtOffset((VkDeviceSize)prevAlloc - 1);
-
-    for (VmaSuballocationList::const_iterator it = ++prev; it != m_Suballocations.end(); ++it)
-    {
-        if (it->type != VMA_SUBALLOCATION_TYPE_FREE)
-            return (VmaAllocHandle)(it->offset + 1);
-    }
-    return VK_NULL_HANDLE;
-}
-
-void VmaBlockMetadata_Generic::Clear()
-{
-    const VkDeviceSize size = GetSize();
-
-    VMA_ASSERT(IsVirtual());
-    m_FreeCount = 1;
-    m_SumFreeSize = size;
-    m_Suballocations.clear();
-    m_FreeSuballocationsBySize.clear();
-
-    VmaSuballocation suballoc = {};
-    suballoc.offset = 0;
-    suballoc.size = size;
-    suballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-    m_Suballocations.push_back(suballoc);
-
-    m_FreeSuballocationsBySize.push_back(m_Suballocations.begin());
-}
-
-void VmaBlockMetadata_Generic::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData)
-{
-    VmaSuballocation& suballoc = *FindAtOffset((VkDeviceSize)allocHandle - 1);
-    suballoc.userData = userData;
-}
-
-void VmaBlockMetadata_Generic::DebugLogAllAllocations() const
-{
-    for (const auto& suballoc : m_Suballocations)
-    {
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-            DebugLogAllocation(suballoc.offset, suballoc.size, suballoc.userData);
-    }
-}
-
-VmaSuballocationList::iterator VmaBlockMetadata_Generic::FindAtOffset(VkDeviceSize offset) const
-{
-    VMA_HEAVY_ASSERT(!m_Suballocations.empty());
-    const VkDeviceSize last = m_Suballocations.rbegin()->offset;
-    if (last == offset)
-        return m_Suballocations.rbegin().drop_const();
-    const VkDeviceSize first = m_Suballocations.begin()->offset;
-    if (first == offset)
-        return m_Suballocations.begin().drop_const();
-
-    const size_t suballocCount = m_Suballocations.size();
-    const VkDeviceSize step = (last - first + m_Suballocations.begin()->size) / suballocCount;
-    auto findSuballocation = [&](auto begin, auto end) -> VmaSuballocationList::iterator
-    {
-        for (auto suballocItem = begin;
-            suballocItem != end;
-            ++suballocItem)
-        {
-            if (suballocItem->offset == offset)
-                return suballocItem.drop_const();
-        }
-        VMA_ASSERT(false && "Not found!");
-        return m_Suballocations.end().drop_const();
-    };
-    // If requested offset is closer to the end of range, search from the end
-    if (offset - first > suballocCount * step / 2)
-    {
-        return findSuballocation(m_Suballocations.rbegin(), m_Suballocations.rend());
-    }
-    return findSuballocation(m_Suballocations.begin(), m_Suballocations.end());
-}
-
-bool VmaBlockMetadata_Generic::ValidateFreeSuballocationList() const
-{
-    VkDeviceSize lastSize = 0;
-    for (size_t i = 0, count = m_FreeSuballocationsBySize.size(); i < count; ++i)
-    {
-        const VmaSuballocationList::iterator it = m_FreeSuballocationsBySize[i];
-
-        VMA_VALIDATE(it->type == VMA_SUBALLOCATION_TYPE_FREE);
-        VMA_VALIDATE(it->size >= lastSize);
-        lastSize = it->size;
-    }
-    return true;
-}
-
-bool VmaBlockMetadata_Generic::CheckAllocation(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    VmaSuballocationType allocType,
-    VmaSuballocationList::const_iterator suballocItem,
-    VmaAllocHandle* pAllocHandle) const
-{
-    VMA_ASSERT(allocSize > 0);
-    VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(suballocItem != m_Suballocations.cend());
-    VMA_ASSERT(pAllocHandle != VMA_NULL);
-
-    const VkDeviceSize debugMargin = GetDebugMargin();
-    const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity();
-
-    const VmaSuballocation& suballoc = *suballocItem;
-    VMA_ASSERT(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-    // Size of this suballocation is too small for this request: Early return.
-    if (suballoc.size < allocSize)
-    {
-        return false;
-    }
-
-    // Start from offset equal to beginning of this suballocation.
-    VkDeviceSize offset = suballoc.offset + (suballocItem == m_Suballocations.cbegin() ? 0 : GetDebugMargin());
-
-    // Apply debugMargin from the end of previous alloc.
-    if (debugMargin > 0)
-    {
-        offset += debugMargin;
-    }
-
-    // Apply alignment.
-    offset = VmaAlignUp(offset, allocAlignment);
-
-    // Check previous suballocations for BufferImageGranularity conflicts.
-    // Make bigger alignment if necessary.
-    if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment)
-    {
-        bool bufferImageGranularityConflict = false;
-        VmaSuballocationList::const_iterator prevSuballocItem = suballocItem;
-        while (prevSuballocItem != m_Suballocations.cbegin())
-        {
-            --prevSuballocItem;
-            const VmaSuballocation& prevSuballoc = *prevSuballocItem;
-            if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, offset, bufferImageGranularity))
-            {
-                if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType))
-                {
-                    bufferImageGranularityConflict = true;
-                    break;
-                }
-            }
-            else
-                // Already on previous page.
-                break;
-        }
-        if (bufferImageGranularityConflict)
-        {
-            offset = VmaAlignUp(offset, bufferImageGranularity);
-        }
-    }
-
-    // Calculate padding at the beginning based on current offset.
-    const VkDeviceSize paddingBegin = offset - suballoc.offset;
-
-    // Fail if requested size plus margin after is bigger than size of this suballocation.
-    if (paddingBegin + allocSize + debugMargin > suballoc.size)
-    {
-        return false;
-    }
-
-    // Check next suballocations for BufferImageGranularity conflicts.
-    // If conflict exists, allocation cannot be made here.
-    if (allocSize % bufferImageGranularity || offset % bufferImageGranularity)
-    {
-        VmaSuballocationList::const_iterator nextSuballocItem = suballocItem;
-        ++nextSuballocItem;
-        while (nextSuballocItem != m_Suballocations.cend())
-        {
-            const VmaSuballocation& nextSuballoc = *nextSuballocItem;
-            if (VmaBlocksOnSamePage(offset, allocSize, nextSuballoc.offset, bufferImageGranularity))
-            {
-                if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type))
-                {
-                    return false;
-                }
-            }
-            else
-            {
-                // Already on next page.
-                break;
-            }
-            ++nextSuballocItem;
-        }
-    }
-
-    *pAllocHandle = (VmaAllocHandle)(offset + 1);
-    // All tests passed: Success. pAllocHandle is already filled.
-    return true;
-}
-
-void VmaBlockMetadata_Generic::MergeFreeWithNext(VmaSuballocationList::iterator item)
-{
-    VMA_ASSERT(item != m_Suballocations.end());
-    VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE);
-
-    VmaSuballocationList::iterator nextItem = item;
-    ++nextItem;
-    VMA_ASSERT(nextItem != m_Suballocations.end());
-    VMA_ASSERT(nextItem->type == VMA_SUBALLOCATION_TYPE_FREE);
-
-    item->size += nextItem->size;
-    --m_FreeCount;
-    m_Suballocations.erase(nextItem);
-}
-
-VmaSuballocationList::iterator VmaBlockMetadata_Generic::FreeSuballocation(VmaSuballocationList::iterator suballocItem)
-{
-    // Change this suballocation to be marked as free.
-    VmaSuballocation& suballoc = *suballocItem;
-    suballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-    suballoc.userData = VMA_NULL;
-
-    // Update totals.
-    ++m_FreeCount;
-    m_SumFreeSize += suballoc.size;
-
-    // Merge with previous and/or next suballocation if it's also free.
-    bool mergeWithNext = false;
-    bool mergeWithPrev = false;
-
-    VmaSuballocationList::iterator nextItem = suballocItem;
-    ++nextItem;
-    if ((nextItem != m_Suballocations.end()) && (nextItem->type == VMA_SUBALLOCATION_TYPE_FREE))
-    {
-        mergeWithNext = true;
-    }
-
-    VmaSuballocationList::iterator prevItem = suballocItem;
-    if (suballocItem != m_Suballocations.begin())
-    {
-        --prevItem;
-        if (prevItem->type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            mergeWithPrev = true;
-        }
-    }
-
-    if (mergeWithNext)
-    {
-        UnregisterFreeSuballocation(nextItem);
-        MergeFreeWithNext(suballocItem);
-    }
-
-    if (mergeWithPrev)
-    {
-        UnregisterFreeSuballocation(prevItem);
-        MergeFreeWithNext(prevItem);
-        RegisterFreeSuballocation(prevItem);
-        return prevItem;
-    }
-    else
-    {
-        RegisterFreeSuballocation(suballocItem);
-        return suballocItem;
-    }
-}
-
-void VmaBlockMetadata_Generic::RegisterFreeSuballocation(VmaSuballocationList::iterator item)
-{
-    VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(item->size > 0);
-
-    // You may want to enable this validation at the beginning or at the end of
-    // this function, depending on what do you want to check.
-    VMA_HEAVY_ASSERT(ValidateFreeSuballocationList());
-
-    if (m_FreeSuballocationsBySize.empty())
-    {
-        m_FreeSuballocationsBySize.push_back(item);
-    }
-    else
-    {
-        VmaVectorInsertSorted<VmaSuballocationItemSizeLess>(m_FreeSuballocationsBySize, item);
-    }
-
-    //VMA_HEAVY_ASSERT(ValidateFreeSuballocationList());
-}
-
-void VmaBlockMetadata_Generic::UnregisterFreeSuballocation(VmaSuballocationList::iterator item)
-{
-    VMA_ASSERT(item->type == VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(item->size > 0);
-
-    // You may want to enable this validation at the beginning or at the end of
-    // this function, depending on what do you want to check.
-    VMA_HEAVY_ASSERT(ValidateFreeSuballocationList());
-
-    VmaSuballocationList::iterator* const it = VmaBinaryFindFirstNotLess(
-        m_FreeSuballocationsBySize.data(),
-        m_FreeSuballocationsBySize.data() + m_FreeSuballocationsBySize.size(),
-        item,
-        VmaSuballocationItemSizeLess());
-    for (size_t index = it - m_FreeSuballocationsBySize.data();
-        index < m_FreeSuballocationsBySize.size();
-        ++index)
-    {
-        if (m_FreeSuballocationsBySize[index] == item)
-        {
-            VmaVectorRemove(m_FreeSuballocationsBySize, index);
-            return;
-        }
-        VMA_ASSERT((m_FreeSuballocationsBySize[index]->size == item->size) && "Not found.");
-    }
-    VMA_ASSERT(0 && "Not found.");
-
-    //VMA_HEAVY_ASSERT(ValidateFreeSuballocationList());
-}
-#endif // _VMA_BLOCK_METADATA_GENERIC_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA_GENERIC
-#endif // #if 0
-
-#ifndef _VMA_BLOCK_METADATA_LINEAR
-/*
-Allocations and their references in internal data structure look like this:
-
-if(m_2ndVectorMode == SECOND_VECTOR_EMPTY):
-
-        0 +-------+
-          |       |
-          |       |
-          |       |
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount]
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount + 1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  1st[1st.size() - 1]
-          +-------+
-          |       |
-          |       |
-          |       |
-GetSize() +-------+
-
-if(m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER):
-
-        0 +-------+
-          | Alloc |  2nd[0]
-          +-------+
-          | Alloc |  2nd[1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  2nd[2nd.size() - 1]
-          +-------+
-          |       |
-          |       |
-          |       |
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount]
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount + 1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  1st[1st.size() - 1]
-          +-------+
-          |       |
-GetSize() +-------+
-
-if(m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK):
-
-        0 +-------+
-          |       |
-          |       |
-          |       |
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount]
-          +-------+
-          | Alloc |  1st[m_1stNullItemsBeginCount + 1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  1st[1st.size() - 1]
-          +-------+
-          |       |
-          |       |
-          |       |
-          +-------+
-          | Alloc |  2nd[2nd.size() - 1]
-          +-------+
-          |  ...  |
-          +-------+
-          | Alloc |  2nd[1]
-          +-------+
-          | Alloc |  2nd[0]
-GetSize() +-------+
-
-*/
-class VmaBlockMetadata_Linear : public VmaBlockMetadata
-{
-    VMA_CLASS_NO_COPY(VmaBlockMetadata_Linear)
-public:
-    VmaBlockMetadata_Linear(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata_Linear() = default;
-
-    VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize; }
-    bool IsEmpty() const override { return GetAllocationCount() == 0; }
-    VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; };
-
-    void Init(VkDeviceSize size) override;
-    bool Validate() const override;
-    size_t GetAllocationCount() const override;
-    size_t GetFreeRegionsCount() const override;
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override;
-    void AddStatistics(VmaStatistics& inoutStats) const override;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json) const override;
-#endif
-
-    bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) override;
-
-    VkResult CheckCorruption(const void* pBlockData) override;
-
-    void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) override;
-
-    void Free(VmaAllocHandle allocHandle) override;
-    void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override;
-    void* GetAllocationUserData(VmaAllocHandle allocHandle) const override;
-    VmaAllocHandle GetAllocationListBegin() const override;
-    VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override;
-    VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const override;
-    void Clear() override;
-    void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override;
-    void DebugLogAllAllocations() const override;
-
-private:
-    /*
-    There are two suballocation vectors, used in ping-pong way.
-    The one with index m_1stVectorIndex is called 1st.
-    The one with index (m_1stVectorIndex ^ 1) is called 2nd.
-    2nd can be non-empty only when 1st is not empty.
-    When 2nd is not empty, m_2ndVectorMode indicates its mode of operation.
-    */
-    typedef VmaVector<VmaSuballocation, VmaStlAllocator<VmaSuballocation>> SuballocationVectorType;
-
-    enum SECOND_VECTOR_MODE
-    {
-        SECOND_VECTOR_EMPTY,
-        /*
-        Suballocations in 2nd vector are created later than the ones in 1st, but they
-        all have smaller offset.
-        */
-        SECOND_VECTOR_RING_BUFFER,
-        /*
-        Suballocations in 2nd vector are upper side of double stack.
-        They all have offsets higher than those in 1st vector.
-        Top of this stack means smaller offsets, but higher indices in this vector.
-        */
-        SECOND_VECTOR_DOUBLE_STACK,
-    };
-
-    VkDeviceSize m_SumFreeSize;
-    SuballocationVectorType m_Suballocations0, m_Suballocations1;
-    uint32_t m_1stVectorIndex;
-    SECOND_VECTOR_MODE m_2ndVectorMode;
-    // Number of items in 1st vector with hAllocation = null at the beginning.
-    size_t m_1stNullItemsBeginCount;
-    // Number of other items in 1st vector with hAllocation = null somewhere in the middle.
-    size_t m_1stNullItemsMiddleCount;
-    // Number of items in 2nd vector with hAllocation = null.
-    size_t m_2ndNullItemsCount;
-
-    SuballocationVectorType& AccessSuballocations1st() { return m_1stVectorIndex ? m_Suballocations1 : m_Suballocations0; }
-    SuballocationVectorType& AccessSuballocations2nd() { return m_1stVectorIndex ? m_Suballocations0 : m_Suballocations1; }
-    const SuballocationVectorType& AccessSuballocations1st() const { return m_1stVectorIndex ? m_Suballocations1 : m_Suballocations0; }
-    const SuballocationVectorType& AccessSuballocations2nd() const { return m_1stVectorIndex ? m_Suballocations0 : m_Suballocations1; }
-
-    VmaSuballocation& FindSuballocation(VkDeviceSize offset) const;
-    bool ShouldCompact1st() const;
-    void CleanupAfterFree();
-
-    bool CreateAllocationRequest_LowerAddress(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest);
-    bool CreateAllocationRequest_UpperAddress(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest);
-};
-
-#ifndef _VMA_BLOCK_METADATA_LINEAR_FUNCTIONS
-VmaBlockMetadata_Linear::VmaBlockMetadata_Linear(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual),
-    m_SumFreeSize(0),
-    m_Suballocations0(VmaStlAllocator<VmaSuballocation>(pAllocationCallbacks)),
-    m_Suballocations1(VmaStlAllocator<VmaSuballocation>(pAllocationCallbacks)),
-    m_1stVectorIndex(0),
-    m_2ndVectorMode(SECOND_VECTOR_EMPTY),
-    m_1stNullItemsBeginCount(0),
-    m_1stNullItemsMiddleCount(0),
-    m_2ndNullItemsCount(0) {}
-
-void VmaBlockMetadata_Linear::Init(VkDeviceSize size)
-{
-    VmaBlockMetadata::Init(size);
-    m_SumFreeSize = size;
-}
-
-bool VmaBlockMetadata_Linear::Validate() const
-{
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    VMA_VALIDATE(suballocations2nd.empty() == (m_2ndVectorMode == SECOND_VECTOR_EMPTY));
-    VMA_VALIDATE(!suballocations1st.empty() ||
-        suballocations2nd.empty() ||
-        m_2ndVectorMode != SECOND_VECTOR_RING_BUFFER);
-
-    if (!suballocations1st.empty())
-    {
-        // Null item at the beginning should be accounted into m_1stNullItemsBeginCount.
-        VMA_VALIDATE(suballocations1st[m_1stNullItemsBeginCount].type != VMA_SUBALLOCATION_TYPE_FREE);
-        // Null item at the end should be just pop_back().
-        VMA_VALIDATE(suballocations1st.back().type != VMA_SUBALLOCATION_TYPE_FREE);
-    }
-    if (!suballocations2nd.empty())
-    {
-        // Null item at the end should be just pop_back().
-        VMA_VALIDATE(suballocations2nd.back().type != VMA_SUBALLOCATION_TYPE_FREE);
-    }
-
-    VMA_VALIDATE(m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount <= suballocations1st.size());
-    VMA_VALIDATE(m_2ndNullItemsCount <= suballocations2nd.size());
-
-    VkDeviceSize sumUsedSize = 0;
-    const size_t suballoc1stCount = suballocations1st.size();
-    const VkDeviceSize debugMargin = GetDebugMargin();
-    VkDeviceSize offset = 0;
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const size_t suballoc2ndCount = suballocations2nd.size();
-        size_t nullItem2ndCount = 0;
-        for (size_t i = 0; i < suballoc2ndCount; ++i)
-        {
-            const VmaSuballocation& suballoc = suballocations2nd[i];
-            const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-            VmaAllocation const alloc = (VmaAllocation)suballoc.userData;
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE));
-            }
-            VMA_VALIDATE(suballoc.offset >= offset);
-
-            if (!currFree)
-            {
-                if (!IsVirtual())
-                {
-                    VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1);
-                    VMA_VALIDATE(alloc->GetSize() == suballoc.size);
-                }
-                sumUsedSize += suballoc.size;
-            }
-            else
-            {
-                ++nullItem2ndCount;
-            }
-
-            offset = suballoc.offset + suballoc.size + debugMargin;
-        }
-
-        VMA_VALIDATE(nullItem2ndCount == m_2ndNullItemsCount);
-    }
-
-    for (size_t i = 0; i < m_1stNullItemsBeginCount; ++i)
-    {
-        const VmaSuballocation& suballoc = suballocations1st[i];
-        VMA_VALIDATE(suballoc.type == VMA_SUBALLOCATION_TYPE_FREE &&
-            suballoc.userData == VMA_NULL);
-    }
-
-    size_t nullItem1stCount = m_1stNullItemsBeginCount;
-
-    for (size_t i = m_1stNullItemsBeginCount; i < suballoc1stCount; ++i)
-    {
-        const VmaSuballocation& suballoc = suballocations1st[i];
-        const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-        VmaAllocation const alloc = (VmaAllocation)suballoc.userData;
-        if (!IsVirtual())
-        {
-            VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE));
-        }
-        VMA_VALIDATE(suballoc.offset >= offset);
-        VMA_VALIDATE(i >= m_1stNullItemsBeginCount || currFree);
-
-        if (!currFree)
-        {
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1);
-                VMA_VALIDATE(alloc->GetSize() == suballoc.size);
-            }
-            sumUsedSize += suballoc.size;
-        }
-        else
-        {
-            ++nullItem1stCount;
-        }
-
-        offset = suballoc.offset + suballoc.size + debugMargin;
-    }
-    VMA_VALIDATE(nullItem1stCount == m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount);
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        const size_t suballoc2ndCount = suballocations2nd.size();
-        size_t nullItem2ndCount = 0;
-        for (size_t i = suballoc2ndCount; i--; )
-        {
-            const VmaSuballocation& suballoc = suballocations2nd[i];
-            const bool currFree = (suballoc.type == VMA_SUBALLOCATION_TYPE_FREE);
-
-            VmaAllocation const alloc = (VmaAllocation)suballoc.userData;
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE(currFree == (alloc == VK_NULL_HANDLE));
-            }
-            VMA_VALIDATE(suballoc.offset >= offset);
-
-            if (!currFree)
-            {
-                if (!IsVirtual())
-                {
-                    VMA_VALIDATE((VkDeviceSize)alloc->GetAllocHandle() == suballoc.offset + 1);
-                    VMA_VALIDATE(alloc->GetSize() == suballoc.size);
-                }
-                sumUsedSize += suballoc.size;
-            }
-            else
-            {
-                ++nullItem2ndCount;
-            }
-
-            offset = suballoc.offset + suballoc.size + debugMargin;
-        }
-
-        VMA_VALIDATE(nullItem2ndCount == m_2ndNullItemsCount);
-    }
-
-    VMA_VALIDATE(offset <= GetSize());
-    VMA_VALIDATE(m_SumFreeSize == GetSize() - sumUsedSize);
-
-    return true;
-}
-
-size_t VmaBlockMetadata_Linear::GetAllocationCount() const
-{
-    return AccessSuballocations1st().size() - m_1stNullItemsBeginCount - m_1stNullItemsMiddleCount +
-        AccessSuballocations2nd().size() - m_2ndNullItemsCount;
-}
-
-size_t VmaBlockMetadata_Linear::GetFreeRegionsCount() const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    VMA_ASSERT(0);
-    return SIZE_MAX;
-}
-
-void VmaBlockMetadata_Linear::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const
-{
-    const VkDeviceSize size = GetSize();
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    const size_t suballoc1stCount = suballocations1st.size();
-    const size_t suballoc2ndCount = suballocations2nd.size();
-
-    inoutStats.statistics.blockCount++;
-    inoutStats.statistics.blockBytes += size;
-
-    VkDeviceSize lastOffset = 0;
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset;
-        size_t nextAlloc2ndIndex = 0;
-        while (lastOffset < freeSpace2ndTo1stEnd)
-        {
-            // Find next non-null allocation or move nextAllocIndex to the end.
-            while (nextAlloc2ndIndex < suballoc2ndCount &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                ++nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex < suballoc2ndCount)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                    VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size);
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                ++nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                // There is free space from lastOffset to freeSpace2ndTo1stEnd.
-                if (lastOffset < freeSpace2ndTo1stEnd)
-                {
-                    const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset;
-                    VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-                }
-
-                // End of loop.
-                lastOffset = freeSpace2ndTo1stEnd;
-            }
-        }
-    }
-
-    size_t nextAlloc1stIndex = m_1stNullItemsBeginCount;
-    const VkDeviceSize freeSpace1stTo2ndEnd =
-        m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size;
-    while (lastOffset < freeSpace1stTo2ndEnd)
-    {
-        // Find next non-null allocation or move nextAllocIndex to the end.
-        while (nextAlloc1stIndex < suballoc1stCount &&
-            suballocations1st[nextAlloc1stIndex].userData == VMA_NULL)
-        {
-            ++nextAlloc1stIndex;
-        }
-
-        // Found non-null allocation.
-        if (nextAlloc1stIndex < suballoc1stCount)
-        {
-            const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex];
-
-            // 1. Process free space before this allocation.
-            if (lastOffset < suballoc.offset)
-            {
-                // There is free space from lastOffset to suballoc.offset.
-                const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-            }
-
-            // 2. Process this allocation.
-            // There is allocation with suballoc.offset, suballoc.size.
-            VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size);
-
-            // 3. Prepare for next iteration.
-            lastOffset = suballoc.offset + suballoc.size;
-            ++nextAlloc1stIndex;
-        }
-        // We are at the end.
-        else
-        {
-            // There is free space from lastOffset to freeSpace1stTo2ndEnd.
-            if (lastOffset < freeSpace1stTo2ndEnd)
-            {
-                const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset;
-                VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-            }
-
-            // End of loop.
-            lastOffset = freeSpace1stTo2ndEnd;
-        }
-    }
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        size_t nextAlloc2ndIndex = suballocations2nd.size() - 1;
-        while (lastOffset < size)
-        {
-            // Find next non-null allocation or move nextAllocIndex to the end.
-            while (nextAlloc2ndIndex != SIZE_MAX &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                --nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex != SIZE_MAX)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                    VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                VmaAddDetailedStatisticsAllocation(inoutStats, suballoc.size);
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                --nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                // There is free space from lastOffset to size.
-                if (lastOffset < size)
-                {
-                    const VkDeviceSize unusedRangeSize = size - lastOffset;
-                    VmaAddDetailedStatisticsUnusedRange(inoutStats, unusedRangeSize);
-                }
-
-                // End of loop.
-                lastOffset = size;
-            }
-        }
-    }
-}
-
-void VmaBlockMetadata_Linear::AddStatistics(VmaStatistics& inoutStats) const
-{
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    const VkDeviceSize size = GetSize();
-    const size_t suballoc1stCount = suballocations1st.size();
-    const size_t suballoc2ndCount = suballocations2nd.size();
-
-    inoutStats.blockCount++;
-    inoutStats.blockBytes += size;
-    inoutStats.allocationBytes += size - m_SumFreeSize;
-
-    VkDeviceSize lastOffset = 0;
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset;
-        size_t nextAlloc2ndIndex = m_1stNullItemsBeginCount;
-        while (lastOffset < freeSpace2ndTo1stEnd)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex < suballoc2ndCount &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                ++nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex < suballoc2ndCount)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                ++inoutStats.allocationCount;
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                ++nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < freeSpace2ndTo1stEnd)
-                {
-                    // There is free space from lastOffset to freeSpace2ndTo1stEnd.
-                    const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset;
-                }
-
-                // End of loop.
-                lastOffset = freeSpace2ndTo1stEnd;
-            }
-        }
-    }
-
-    size_t nextAlloc1stIndex = m_1stNullItemsBeginCount;
-    const VkDeviceSize freeSpace1stTo2ndEnd =
-        m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size;
-    while (lastOffset < freeSpace1stTo2ndEnd)
-    {
-        // Find next non-null allocation or move nextAllocIndex to the end.
-        while (nextAlloc1stIndex < suballoc1stCount &&
-            suballocations1st[nextAlloc1stIndex].userData == VMA_NULL)
-        {
-            ++nextAlloc1stIndex;
-        }
-
-        // Found non-null allocation.
-        if (nextAlloc1stIndex < suballoc1stCount)
-        {
-            const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex];
-
-            // 1. Process free space before this allocation.
-            if (lastOffset < suballoc.offset)
-            {
-                // There is free space from lastOffset to suballoc.offset.
-                const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-            }
-
-            // 2. Process this allocation.
-            // There is allocation with suballoc.offset, suballoc.size.
-            ++inoutStats.allocationCount;
-
-            // 3. Prepare for next iteration.
-            lastOffset = suballoc.offset + suballoc.size;
-            ++nextAlloc1stIndex;
-        }
-        // We are at the end.
-        else
-        {
-            if (lastOffset < freeSpace1stTo2ndEnd)
-            {
-                // There is free space from lastOffset to freeSpace1stTo2ndEnd.
-                const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset;
-            }
-
-            // End of loop.
-            lastOffset = freeSpace1stTo2ndEnd;
-        }
-    }
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        size_t nextAlloc2ndIndex = suballocations2nd.size() - 1;
-        while (lastOffset < size)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex != SIZE_MAX &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                --nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex != SIZE_MAX)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                ++inoutStats.allocationCount;
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                --nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < size)
-                {
-                    // There is free space from lastOffset to size.
-                    const VkDeviceSize unusedRangeSize = size - lastOffset;
-                }
-
-                // End of loop.
-                lastOffset = size;
-            }
-        }
-    }
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_Linear::PrintDetailedMap(class VmaJsonWriter& json) const
-{
-    const VkDeviceSize size = GetSize();
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    const size_t suballoc1stCount = suballocations1st.size();
-    const size_t suballoc2ndCount = suballocations2nd.size();
-
-    // FIRST PASS
-
-    size_t unusedRangeCount = 0;
-    VkDeviceSize usedBytes = 0;
-
-    VkDeviceSize lastOffset = 0;
-
-    size_t alloc2ndCount = 0;
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset;
-        size_t nextAlloc2ndIndex = 0;
-        while (lastOffset < freeSpace2ndTo1stEnd)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex < suballoc2ndCount &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                ++nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex < suballoc2ndCount)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    ++unusedRangeCount;
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                ++alloc2ndCount;
-                usedBytes += suballoc.size;
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                ++nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < freeSpace2ndTo1stEnd)
-                {
-                    // There is free space from lastOffset to freeSpace2ndTo1stEnd.
-                    ++unusedRangeCount;
-                }
-
-                // End of loop.
-                lastOffset = freeSpace2ndTo1stEnd;
-            }
-        }
-    }
-
-    size_t nextAlloc1stIndex = m_1stNullItemsBeginCount;
-    size_t alloc1stCount = 0;
-    const VkDeviceSize freeSpace1stTo2ndEnd =
-        m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ? suballocations2nd.back().offset : size;
-    while (lastOffset < freeSpace1stTo2ndEnd)
-    {
-        // Find next non-null allocation or move nextAllocIndex to the end.
-        while (nextAlloc1stIndex < suballoc1stCount &&
-            suballocations1st[nextAlloc1stIndex].userData == VMA_NULL)
-        {
-            ++nextAlloc1stIndex;
-        }
-
-        // Found non-null allocation.
-        if (nextAlloc1stIndex < suballoc1stCount)
-        {
-            const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex];
-
-            // 1. Process free space before this allocation.
-            if (lastOffset < suballoc.offset)
-            {
-                // There is free space from lastOffset to suballoc.offset.
-                ++unusedRangeCount;
-            }
-
-            // 2. Process this allocation.
-            // There is allocation with suballoc.offset, suballoc.size.
-            ++alloc1stCount;
-            usedBytes += suballoc.size;
-
-            // 3. Prepare for next iteration.
-            lastOffset = suballoc.offset + suballoc.size;
-            ++nextAlloc1stIndex;
-        }
-        // We are at the end.
-        else
-        {
-            if (lastOffset < size)
-            {
-                // There is free space from lastOffset to freeSpace1stTo2ndEnd.
-                ++unusedRangeCount;
-            }
-
-            // End of loop.
-            lastOffset = freeSpace1stTo2ndEnd;
-        }
-    }
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        size_t nextAlloc2ndIndex = suballocations2nd.size() - 1;
-        while (lastOffset < size)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex != SIZE_MAX &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                --nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex != SIZE_MAX)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    ++unusedRangeCount;
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                ++alloc2ndCount;
-                usedBytes += suballoc.size;
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                --nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < size)
-                {
-                    // There is free space from lastOffset to size.
-                    ++unusedRangeCount;
-                }
-
-                // End of loop.
-                lastOffset = size;
-            }
-        }
-    }
-
-    const VkDeviceSize unusedBytes = size - usedBytes;
-    PrintDetailedMap_Begin(json, unusedBytes, alloc1stCount + alloc2ndCount, unusedRangeCount);
-
-    // SECOND PASS
-    lastOffset = 0;
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        const VkDeviceSize freeSpace2ndTo1stEnd = suballocations1st[m_1stNullItemsBeginCount].offset;
-        size_t nextAlloc2ndIndex = 0;
-        while (lastOffset < freeSpace2ndTo1stEnd)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex < suballoc2ndCount &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                ++nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex < suballoc2ndCount)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                    PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData);
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                ++nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < freeSpace2ndTo1stEnd)
-                {
-                    // There is free space from lastOffset to freeSpace2ndTo1stEnd.
-                    const VkDeviceSize unusedRangeSize = freeSpace2ndTo1stEnd - lastOffset;
-                    PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-                }
-
-                // End of loop.
-                lastOffset = freeSpace2ndTo1stEnd;
-            }
-        }
-    }
-
-    nextAlloc1stIndex = m_1stNullItemsBeginCount;
-    while (lastOffset < freeSpace1stTo2ndEnd)
-    {
-        // Find next non-null allocation or move nextAllocIndex to the end.
-        while (nextAlloc1stIndex < suballoc1stCount &&
-            suballocations1st[nextAlloc1stIndex].userData == VMA_NULL)
-        {
-            ++nextAlloc1stIndex;
-        }
-
-        // Found non-null allocation.
-        if (nextAlloc1stIndex < suballoc1stCount)
-        {
-            const VmaSuballocation& suballoc = suballocations1st[nextAlloc1stIndex];
-
-            // 1. Process free space before this allocation.
-            if (lastOffset < suballoc.offset)
-            {
-                // There is free space from lastOffset to suballoc.offset.
-                const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-            }
-
-            // 2. Process this allocation.
-            // There is allocation with suballoc.offset, suballoc.size.
-            PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData);
-
-            // 3. Prepare for next iteration.
-            lastOffset = suballoc.offset + suballoc.size;
-            ++nextAlloc1stIndex;
-        }
-        // We are at the end.
-        else
-        {
-            if (lastOffset < freeSpace1stTo2ndEnd)
-            {
-                // There is free space from lastOffset to freeSpace1stTo2ndEnd.
-                const VkDeviceSize unusedRangeSize = freeSpace1stTo2ndEnd - lastOffset;
-                PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-            }
-
-            // End of loop.
-            lastOffset = freeSpace1stTo2ndEnd;
-        }
-    }
-
-    if (m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        size_t nextAlloc2ndIndex = suballocations2nd.size() - 1;
-        while (lastOffset < size)
-        {
-            // Find next non-null allocation or move nextAlloc2ndIndex to the end.
-            while (nextAlloc2ndIndex != SIZE_MAX &&
-                suballocations2nd[nextAlloc2ndIndex].userData == VMA_NULL)
-            {
-                --nextAlloc2ndIndex;
-            }
-
-            // Found non-null allocation.
-            if (nextAlloc2ndIndex != SIZE_MAX)
-            {
-                const VmaSuballocation& suballoc = suballocations2nd[nextAlloc2ndIndex];
-
-                // 1. Process free space before this allocation.
-                if (lastOffset < suballoc.offset)
-                {
-                    // There is free space from lastOffset to suballoc.offset.
-                    const VkDeviceSize unusedRangeSize = suballoc.offset - lastOffset;
-                    PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-                }
-
-                // 2. Process this allocation.
-                // There is allocation with suballoc.offset, suballoc.size.
-                PrintDetailedMap_Allocation(json, suballoc.offset, suballoc.size, suballoc.userData);
-
-                // 3. Prepare for next iteration.
-                lastOffset = suballoc.offset + suballoc.size;
-                --nextAlloc2ndIndex;
-            }
-            // We are at the end.
-            else
-            {
-                if (lastOffset < size)
-                {
-                    // There is free space from lastOffset to size.
-                    const VkDeviceSize unusedRangeSize = size - lastOffset;
-                    PrintDetailedMap_UnusedRange(json, lastOffset, unusedRangeSize);
-                }
-
-                // End of loop.
-                lastOffset = size;
-            }
-        }
-    }
-
-    PrintDetailedMap_End(json);
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-bool VmaBlockMetadata_Linear::CreateAllocationRequest(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    bool upperAddress,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(allocSize > 0);
-    VMA_ASSERT(allocType != VMA_SUBALLOCATION_TYPE_FREE);
-    VMA_ASSERT(pAllocationRequest != VMA_NULL);
-    VMA_HEAVY_ASSERT(Validate());
-    pAllocationRequest->size = allocSize;
-    return upperAddress ?
-        CreateAllocationRequest_UpperAddress(
-            allocSize, allocAlignment, allocType, strategy, pAllocationRequest) :
-        CreateAllocationRequest_LowerAddress(
-            allocSize, allocAlignment, allocType, strategy, pAllocationRequest);
-}
-
-VkResult VmaBlockMetadata_Linear::CheckCorruption(const void* pBlockData)
-{
-    VMA_ASSERT(!IsVirtual());
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    for (size_t i = m_1stNullItemsBeginCount, count = suballocations1st.size(); i < count; ++i)
-    {
-        const VmaSuballocation& suballoc = suballocations1st[i];
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size))
-            {
-                VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!");
-                return VK_ERROR_UNKNOWN_COPY;
-            }
-        }
-    }
-
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    for (size_t i = 0, count = suballocations2nd.size(); i < count; ++i)
-    {
-        const VmaSuballocation& suballoc = suballocations2nd[i];
-        if (suballoc.type != VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            if (!VmaValidateMagicValue(pBlockData, suballoc.offset + suballoc.size))
-            {
-                VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!");
-                return VK_ERROR_UNKNOWN_COPY;
-            }
-        }
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaBlockMetadata_Linear::Alloc(
-    const VmaAllocationRequest& request,
-    VmaSuballocationType type,
-    void* userData)
-{
-    const VkDeviceSize offset = (VkDeviceSize)request.allocHandle - 1;
-    const VmaSuballocation newSuballoc = { offset, request.size, userData, type };
-
-    switch (request.type)
-    {
-    case VmaAllocationRequestType::UpperAddress:
-    {
-        VMA_ASSERT(m_2ndVectorMode != SECOND_VECTOR_RING_BUFFER &&
-            "CRITICAL ERROR: Trying to use linear allocator as double stack while it was already used as ring buffer.");
-        SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-        suballocations2nd.push_back(newSuballoc);
-        m_2ndVectorMode = SECOND_VECTOR_DOUBLE_STACK;
-    }
-    break;
-    case VmaAllocationRequestType::EndOf1st:
-    {
-        SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-
-        VMA_ASSERT(suballocations1st.empty() ||
-            offset >= suballocations1st.back().offset + suballocations1st.back().size);
-        // Check if it fits before the end of the block.
-        VMA_ASSERT(offset + request.size <= GetSize());
-
-        suballocations1st.push_back(newSuballoc);
-    }
-    break;
-    case VmaAllocationRequestType::EndOf2nd:
-    {
-        SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-        // New allocation at the end of 2-part ring buffer, so before first allocation from 1st vector.
-        VMA_ASSERT(!suballocations1st.empty() &&
-            offset + request.size <= suballocations1st[m_1stNullItemsBeginCount].offset);
-        SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-        switch (m_2ndVectorMode)
-        {
-        case SECOND_VECTOR_EMPTY:
-            // First allocation from second part ring buffer.
-            VMA_ASSERT(suballocations2nd.empty());
-            m_2ndVectorMode = SECOND_VECTOR_RING_BUFFER;
-            break;
-        case SECOND_VECTOR_RING_BUFFER:
-            // 2-part ring buffer is already started.
-            VMA_ASSERT(!suballocations2nd.empty());
-            break;
-        case SECOND_VECTOR_DOUBLE_STACK:
-            VMA_ASSERT(0 && "CRITICAL ERROR: Trying to use linear allocator as ring buffer while it was already used as double stack.");
-            break;
-        default:
-            VMA_ASSERT(0);
-        }
-
-        suballocations2nd.push_back(newSuballoc);
-    }
-    break;
-    default:
-        VMA_ASSERT(0 && "CRITICAL INTERNAL ERROR.");
-    }
-
-    m_SumFreeSize -= newSuballoc.size;
-}
-
-void VmaBlockMetadata_Linear::Free(VmaAllocHandle allocHandle)
-{
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    VkDeviceSize offset = (VkDeviceSize)allocHandle - 1;
-
-    if (!suballocations1st.empty())
-    {
-        // First allocation: Mark it as next empty at the beginning.
-        VmaSuballocation& firstSuballoc = suballocations1st[m_1stNullItemsBeginCount];
-        if (firstSuballoc.offset == offset)
-        {
-            firstSuballoc.type = VMA_SUBALLOCATION_TYPE_FREE;
-            firstSuballoc.userData = VMA_NULL;
-            m_SumFreeSize += firstSuballoc.size;
-            ++m_1stNullItemsBeginCount;
-            CleanupAfterFree();
-            return;
-        }
-    }
-
-    // Last allocation in 2-part ring buffer or top of upper stack (same logic).
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER ||
-        m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        VmaSuballocation& lastSuballoc = suballocations2nd.back();
-        if (lastSuballoc.offset == offset)
-        {
-            m_SumFreeSize += lastSuballoc.size;
-            suballocations2nd.pop_back();
-            CleanupAfterFree();
-            return;
-        }
-    }
-    // Last allocation in 1st vector.
-    else if (m_2ndVectorMode == SECOND_VECTOR_EMPTY)
-    {
-        VmaSuballocation& lastSuballoc = suballocations1st.back();
-        if (lastSuballoc.offset == offset)
-        {
-            m_SumFreeSize += lastSuballoc.size;
-            suballocations1st.pop_back();
-            CleanupAfterFree();
-            return;
-        }
-    }
-
-    VmaSuballocation refSuballoc;
-    refSuballoc.offset = offset;
-    // Rest of members stays uninitialized intentionally for better performance.
-
-    // Item from the middle of 1st vector.
-    {
-        const SuballocationVectorType::iterator it = VmaBinaryFindSorted(
-            suballocations1st.begin() + m_1stNullItemsBeginCount,
-            suballocations1st.end(),
-            refSuballoc,
-            VmaSuballocationOffsetLess());
-        if (it != suballocations1st.end())
-        {
-            it->type = VMA_SUBALLOCATION_TYPE_FREE;
-            it->userData = VMA_NULL;
-            ++m_1stNullItemsMiddleCount;
-            m_SumFreeSize += it->size;
-            CleanupAfterFree();
-            return;
-        }
-    }
-
-    if (m_2ndVectorMode != SECOND_VECTOR_EMPTY)
-    {
-        // Item from the middle of 2nd vector.
-        const SuballocationVectorType::iterator it = m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER ?
-            VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetLess()) :
-            VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetGreater());
-        if (it != suballocations2nd.end())
-        {
-            it->type = VMA_SUBALLOCATION_TYPE_FREE;
-            it->userData = VMA_NULL;
-            ++m_2ndNullItemsCount;
-            m_SumFreeSize += it->size;
-            CleanupAfterFree();
-            return;
-        }
-    }
-
-    VMA_ASSERT(0 && "Allocation to free not found in linear allocator!");
-}
-
-void VmaBlockMetadata_Linear::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo)
-{
-    outInfo.offset = (VkDeviceSize)allocHandle - 1;
-    VmaSuballocation& suballoc = FindSuballocation(outInfo.offset);
-    outInfo.size = suballoc.size;
-    outInfo.pUserData = suballoc.userData;
-}
-
-void* VmaBlockMetadata_Linear::GetAllocationUserData(VmaAllocHandle allocHandle) const
-{
-    return FindSuballocation((VkDeviceSize)allocHandle - 1).userData;
-}
-
-VmaAllocHandle VmaBlockMetadata_Linear::GetAllocationListBegin() const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    VMA_ASSERT(0);
-    return VK_NULL_HANDLE;
-}
-
-VmaAllocHandle VmaBlockMetadata_Linear::GetNextAllocation(VmaAllocHandle prevAlloc) const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    VMA_ASSERT(0);
-    return VK_NULL_HANDLE;
-}
-
-VkDeviceSize VmaBlockMetadata_Linear::GetNextFreeRegionSize(VmaAllocHandle alloc) const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    VMA_ASSERT(0);
-    return 0;
-}
-
-void VmaBlockMetadata_Linear::Clear()
-{
-    m_SumFreeSize = GetSize();
-    m_Suballocations0.clear();
-    m_Suballocations1.clear();
-    // Leaving m_1stVectorIndex unchanged - it doesn't matter.
-    m_2ndVectorMode = SECOND_VECTOR_EMPTY;
-    m_1stNullItemsBeginCount = 0;
-    m_1stNullItemsMiddleCount = 0;
-    m_2ndNullItemsCount = 0;
-}
-
-void VmaBlockMetadata_Linear::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData)
-{
-    VmaSuballocation& suballoc = FindSuballocation((VkDeviceSize)allocHandle - 1);
-    suballoc.userData = userData;
-}
-
-void VmaBlockMetadata_Linear::DebugLogAllAllocations() const
-{
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    for (auto it = suballocations1st.begin() + m_1stNullItemsBeginCount; it != suballocations1st.end(); ++it)
-        if (it->type != VMA_SUBALLOCATION_TYPE_FREE)
-            DebugLogAllocation(it->offset, it->size, it->userData);
-
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-    for (auto it = suballocations2nd.begin(); it != suballocations2nd.end(); ++it)
-        if (it->type != VMA_SUBALLOCATION_TYPE_FREE)
-            DebugLogAllocation(it->offset, it->size, it->userData);
-}
-
-VmaSuballocation& VmaBlockMetadata_Linear::FindSuballocation(VkDeviceSize offset) const
-{
-    const SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    const SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    VmaSuballocation refSuballoc;
-    refSuballoc.offset = offset;
-    // Rest of members stays uninitialized intentionally for better performance.
-
-    // Item from the 1st vector.
-    {
-        SuballocationVectorType::const_iterator it = VmaBinaryFindSorted(
-            suballocations1st.begin() + m_1stNullItemsBeginCount,
-            suballocations1st.end(),
-            refSuballoc,
-            VmaSuballocationOffsetLess());
-        if (it != suballocations1st.end())
-        {
-            return const_cast<VmaSuballocation&>(*it);
-        }
-    }
-
-    if (m_2ndVectorMode != SECOND_VECTOR_EMPTY)
-    {
-        // Rest of members stays uninitialized intentionally for better performance.
-        SuballocationVectorType::const_iterator it = m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER ?
-            VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetLess()) :
-            VmaBinaryFindSorted(suballocations2nd.begin(), suballocations2nd.end(), refSuballoc, VmaSuballocationOffsetGreater());
-        if (it != suballocations2nd.end())
-        {
-            return const_cast<VmaSuballocation&>(*it);
-        }
-    }
-
-    VMA_ASSERT(0 && "Allocation not found in linear allocator!");
-    return const_cast<VmaSuballocation&>(suballocations1st.back()); // Should never occur.
-}
-
-bool VmaBlockMetadata_Linear::ShouldCompact1st() const
-{
-    const size_t nullItemCount = m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount;
-    const size_t suballocCount = AccessSuballocations1st().size();
-    return suballocCount > 32 && nullItemCount * 2 >= (suballocCount - nullItemCount) * 3;
-}
-
-void VmaBlockMetadata_Linear::CleanupAfterFree()
-{
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    if (IsEmpty())
-    {
-        suballocations1st.clear();
-        suballocations2nd.clear();
-        m_1stNullItemsBeginCount = 0;
-        m_1stNullItemsMiddleCount = 0;
-        m_2ndNullItemsCount = 0;
-        m_2ndVectorMode = SECOND_VECTOR_EMPTY;
-    }
-    else
-    {
-        const size_t suballoc1stCount = suballocations1st.size();
-        const size_t nullItem1stCount = m_1stNullItemsBeginCount + m_1stNullItemsMiddleCount;
-        VMA_ASSERT(nullItem1stCount <= suballoc1stCount);
-
-        // Find more null items at the beginning of 1st vector.
-        while (m_1stNullItemsBeginCount < suballoc1stCount &&
-            suballocations1st[m_1stNullItemsBeginCount].type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            ++m_1stNullItemsBeginCount;
-            --m_1stNullItemsMiddleCount;
-        }
-
-        // Find more null items at the end of 1st vector.
-        while (m_1stNullItemsMiddleCount > 0 &&
-            suballocations1st.back().type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            --m_1stNullItemsMiddleCount;
-            suballocations1st.pop_back();
-        }
-
-        // Find more null items at the end of 2nd vector.
-        while (m_2ndNullItemsCount > 0 &&
-            suballocations2nd.back().type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            --m_2ndNullItemsCount;
-            suballocations2nd.pop_back();
-        }
-
-        // Find more null items at the beginning of 2nd vector.
-        while (m_2ndNullItemsCount > 0 &&
-            suballocations2nd[0].type == VMA_SUBALLOCATION_TYPE_FREE)
-        {
-            --m_2ndNullItemsCount;
-            VmaVectorRemove(suballocations2nd, 0);
-        }
-
-        if (ShouldCompact1st())
-        {
-            const size_t nonNullItemCount = suballoc1stCount - nullItem1stCount;
-            size_t srcIndex = m_1stNullItemsBeginCount;
-            for (size_t dstIndex = 0; dstIndex < nonNullItemCount; ++dstIndex)
-            {
-                while (suballocations1st[srcIndex].type == VMA_SUBALLOCATION_TYPE_FREE)
-                {
-                    ++srcIndex;
-                }
-                if (dstIndex != srcIndex)
-                {
-                    suballocations1st[dstIndex] = suballocations1st[srcIndex];
-                }
-                ++srcIndex;
-            }
-            suballocations1st.resize(nonNullItemCount);
-            m_1stNullItemsBeginCount = 0;
-            m_1stNullItemsMiddleCount = 0;
-        }
-
-        // 2nd vector became empty.
-        if (suballocations2nd.empty())
-        {
-            m_2ndVectorMode = SECOND_VECTOR_EMPTY;
-        }
-
-        // 1st vector became empty.
-        if (suballocations1st.size() - m_1stNullItemsBeginCount == 0)
-        {
-            suballocations1st.clear();
-            m_1stNullItemsBeginCount = 0;
-
-            if (!suballocations2nd.empty() && m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-            {
-                // Swap 1st with 2nd. Now 2nd is empty.
-                m_2ndVectorMode = SECOND_VECTOR_EMPTY;
-                m_1stNullItemsMiddleCount = m_2ndNullItemsCount;
-                while (m_1stNullItemsBeginCount < suballocations2nd.size() &&
-                    suballocations2nd[m_1stNullItemsBeginCount].type == VMA_SUBALLOCATION_TYPE_FREE)
-                {
-                    ++m_1stNullItemsBeginCount;
-                    --m_1stNullItemsMiddleCount;
-                }
-                m_2ndNullItemsCount = 0;
-                m_1stVectorIndex ^= 1;
-            }
-        }
-    }
-
-    VMA_HEAVY_ASSERT(Validate());
-}
-
-bool VmaBlockMetadata_Linear::CreateAllocationRequest_LowerAddress(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    const VkDeviceSize blockSize = GetSize();
-    const VkDeviceSize debugMargin = GetDebugMargin();
-    const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity();
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    if (m_2ndVectorMode == SECOND_VECTOR_EMPTY || m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-    {
-        // Try to allocate at the end of 1st vector.
-
-        VkDeviceSize resultBaseOffset = 0;
-        if (!suballocations1st.empty())
-        {
-            const VmaSuballocation& lastSuballoc = suballocations1st.back();
-            resultBaseOffset = lastSuballoc.offset + lastSuballoc.size + debugMargin;
-        }
-
-        // Start from offset equal to beginning of free space.
-        VkDeviceSize resultOffset = resultBaseOffset;
-
-        // Apply alignment.
-        resultOffset = VmaAlignUp(resultOffset, allocAlignment);
-
-        // Check previous suballocations for BufferImageGranularity conflicts.
-        // Make bigger alignment if necessary.
-        if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations1st.empty())
-        {
-            bool bufferImageGranularityConflict = false;
-            for (size_t prevSuballocIndex = suballocations1st.size(); prevSuballocIndex--; )
-            {
-                const VmaSuballocation& prevSuballoc = suballocations1st[prevSuballocIndex];
-                if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity))
-                {
-                    if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType))
-                    {
-                        bufferImageGranularityConflict = true;
-                        break;
-                    }
-                }
-                else
-                    // Already on previous page.
-                    break;
-            }
-            if (bufferImageGranularityConflict)
-            {
-                resultOffset = VmaAlignUp(resultOffset, bufferImageGranularity);
-            }
-        }
-
-        const VkDeviceSize freeSpaceEnd = m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK ?
-            suballocations2nd.back().offset : blockSize;
-
-        // There is enough free space at the end after alignment.
-        if (resultOffset + allocSize + debugMargin <= freeSpaceEnd)
-        {
-            // Check next suballocations for BufferImageGranularity conflicts.
-            // If conflict exists, allocation cannot be made here.
-            if ((allocSize % bufferImageGranularity || resultOffset % bufferImageGranularity) && m_2ndVectorMode == SECOND_VECTOR_DOUBLE_STACK)
-            {
-                for (size_t nextSuballocIndex = suballocations2nd.size(); nextSuballocIndex--; )
-                {
-                    const VmaSuballocation& nextSuballoc = suballocations2nd[nextSuballocIndex];
-                    if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity))
-                    {
-                        if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type))
-                        {
-                            return false;
-                        }
-                    }
-                    else
-                    {
-                        // Already on previous page.
-                        break;
-                    }
-                }
-            }
-
-            // All tests passed: Success.
-            pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1);
-            // pAllocationRequest->item, customData unused.
-            pAllocationRequest->type = VmaAllocationRequestType::EndOf1st;
-            return true;
-        }
-    }
-
-    // Wrap-around to end of 2nd vector. Try to allocate there, watching for the
-    // beginning of 1st vector as the end of free space.
-    if (m_2ndVectorMode == SECOND_VECTOR_EMPTY || m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        VMA_ASSERT(!suballocations1st.empty());
-
-        VkDeviceSize resultBaseOffset = 0;
-        if (!suballocations2nd.empty())
-        {
-            const VmaSuballocation& lastSuballoc = suballocations2nd.back();
-            resultBaseOffset = lastSuballoc.offset + lastSuballoc.size + debugMargin;
-        }
-
-        // Start from offset equal to beginning of free space.
-        VkDeviceSize resultOffset = resultBaseOffset;
-
-        // Apply alignment.
-        resultOffset = VmaAlignUp(resultOffset, allocAlignment);
-
-        // Check previous suballocations for BufferImageGranularity conflicts.
-        // Make bigger alignment if necessary.
-        if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations2nd.empty())
-        {
-            bool bufferImageGranularityConflict = false;
-            for (size_t prevSuballocIndex = suballocations2nd.size(); prevSuballocIndex--; )
-            {
-                const VmaSuballocation& prevSuballoc = suballocations2nd[prevSuballocIndex];
-                if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity))
-                {
-                    if (VmaIsBufferImageGranularityConflict(prevSuballoc.type, allocType))
-                    {
-                        bufferImageGranularityConflict = true;
-                        break;
-                    }
-                }
-                else
-                    // Already on previous page.
-                    break;
-            }
-            if (bufferImageGranularityConflict)
-            {
-                resultOffset = VmaAlignUp(resultOffset, bufferImageGranularity);
-            }
-        }
-
-        size_t index1st = m_1stNullItemsBeginCount;
-
-        // There is enough free space at the end after alignment.
-        if ((index1st == suballocations1st.size() && resultOffset + allocSize + debugMargin <= blockSize) ||
-            (index1st < suballocations1st.size() && resultOffset + allocSize + debugMargin <= suballocations1st[index1st].offset))
-        {
-            // Check next suballocations for BufferImageGranularity conflicts.
-            // If conflict exists, allocation cannot be made here.
-            if (allocSize % bufferImageGranularity || resultOffset % bufferImageGranularity)
-            {
-                for (size_t nextSuballocIndex = index1st;
-                    nextSuballocIndex < suballocations1st.size();
-                    nextSuballocIndex++)
-                {
-                    const VmaSuballocation& nextSuballoc = suballocations1st[nextSuballocIndex];
-                    if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity))
-                    {
-                        if (VmaIsBufferImageGranularityConflict(allocType, nextSuballoc.type))
-                        {
-                            return false;
-                        }
-                    }
-                    else
-                    {
-                        // Already on next page.
-                        break;
-                    }
-                }
-            }
-
-            // All tests passed: Success.
-            pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1);
-            pAllocationRequest->type = VmaAllocationRequestType::EndOf2nd;
-            // pAllocationRequest->item, customData unused.
-            return true;
-        }
-    }
-
-    return false;
-}
-
-bool VmaBlockMetadata_Linear::CreateAllocationRequest_UpperAddress(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    const VkDeviceSize blockSize = GetSize();
-    const VkDeviceSize bufferImageGranularity = GetBufferImageGranularity();
-    SuballocationVectorType& suballocations1st = AccessSuballocations1st();
-    SuballocationVectorType& suballocations2nd = AccessSuballocations2nd();
-
-    if (m_2ndVectorMode == SECOND_VECTOR_RING_BUFFER)
-    {
-        VMA_ASSERT(0 && "Trying to use pool with linear algorithm as double stack, while it is already being used as ring buffer.");
-        return false;
-    }
-
-    // Try to allocate before 2nd.back(), or end of block if 2nd.empty().
-    if (allocSize > blockSize)
-    {
-        return false;
-    }
-    VkDeviceSize resultBaseOffset = blockSize - allocSize;
-    if (!suballocations2nd.empty())
-    {
-        const VmaSuballocation& lastSuballoc = suballocations2nd.back();
-        resultBaseOffset = lastSuballoc.offset - allocSize;
-        if (allocSize > lastSuballoc.offset)
-        {
-            return false;
-        }
-    }
-
-    // Start from offset equal to end of free space.
-    VkDeviceSize resultOffset = resultBaseOffset;
-
-    const VkDeviceSize debugMargin = GetDebugMargin();
-
-    // Apply debugMargin at the end.
-    if (debugMargin > 0)
-    {
-        if (resultOffset < debugMargin)
-        {
-            return false;
-        }
-        resultOffset -= debugMargin;
-    }
-
-    // Apply alignment.
-    resultOffset = VmaAlignDown(resultOffset, allocAlignment);
-
-    // Check next suballocations from 2nd for BufferImageGranularity conflicts.
-    // Make bigger alignment if necessary.
-    if (bufferImageGranularity > 1 && bufferImageGranularity != allocAlignment && !suballocations2nd.empty())
-    {
-        bool bufferImageGranularityConflict = false;
-        for (size_t nextSuballocIndex = suballocations2nd.size(); nextSuballocIndex--; )
-        {
-            const VmaSuballocation& nextSuballoc = suballocations2nd[nextSuballocIndex];
-            if (VmaBlocksOnSamePage(resultOffset, allocSize, nextSuballoc.offset, bufferImageGranularity))
-            {
-                if (VmaIsBufferImageGranularityConflict(nextSuballoc.type, allocType))
-                {
-                    bufferImageGranularityConflict = true;
-                    break;
-                }
-            }
-            else
-                // Already on previous page.
-                break;
-        }
-        if (bufferImageGranularityConflict)
-        {
-            resultOffset = VmaAlignDown(resultOffset, bufferImageGranularity);
-        }
-    }
-
-    // There is enough free space.
-    const VkDeviceSize endOf1st = !suballocations1st.empty() ?
-        suballocations1st.back().offset + suballocations1st.back().size :
-        0;
-    if (endOf1st + debugMargin <= resultOffset)
-    {
-        // Check previous suballocations for BufferImageGranularity conflicts.
-        // If conflict exists, allocation cannot be made here.
-        if (bufferImageGranularity > 1)
-        {
-            for (size_t prevSuballocIndex = suballocations1st.size(); prevSuballocIndex--; )
-            {
-                const VmaSuballocation& prevSuballoc = suballocations1st[prevSuballocIndex];
-                if (VmaBlocksOnSamePage(prevSuballoc.offset, prevSuballoc.size, resultOffset, bufferImageGranularity))
-                {
-                    if (VmaIsBufferImageGranularityConflict(allocType, prevSuballoc.type))
-                    {
-                        return false;
-                    }
-                }
-                else
-                {
-                    // Already on next page.
-                    break;
-                }
-            }
-        }
-
-        // All tests passed: Success.
-        pAllocationRequest->allocHandle = (VmaAllocHandle)(resultOffset + 1);
-        // pAllocationRequest->item unused.
-        pAllocationRequest->type = VmaAllocationRequestType::UpperAddress;
-        return true;
-    }
-
-    return false;
-}
-#endif // _VMA_BLOCK_METADATA_LINEAR_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA_LINEAR
-
-#if 0
-#ifndef _VMA_BLOCK_METADATA_BUDDY
-/*
-- GetSize() is the original size of allocated memory block.
-- m_UsableSize is this size aligned down to a power of two.
-  All allocations and calculations happen relative to m_UsableSize.
-- GetUnusableSize() is the difference between them.
-  It is reported as separate, unused range, not available for allocations.
-
-Node at level 0 has size = m_UsableSize.
-Each next level contains nodes with size 2 times smaller than current level.
-m_LevelCount is the maximum number of levels to use in the current object.
-*/
-class VmaBlockMetadata_Buddy : public VmaBlockMetadata
-{
-    VMA_CLASS_NO_COPY(VmaBlockMetadata_Buddy)
-public:
-    VmaBlockMetadata_Buddy(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata_Buddy();
-
-    size_t GetAllocationCount() const override { return m_AllocationCount; }
-    VkDeviceSize GetSumFreeSize() const override { return m_SumFreeSize + GetUnusableSize(); }
-    bool IsEmpty() const override { return m_Root->type == Node::TYPE_FREE; }
-    VkResult CheckCorruption(const void* pBlockData) override { return VK_ERROR_FEATURE_NOT_PRESENT; }
-    VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return (VkDeviceSize)allocHandle - 1; };
-    void DebugLogAllAllocations() const override { DebugLogAllAllocationNode(m_Root, 0); }
-
-    void Init(VkDeviceSize size) override;
-    bool Validate() const override;
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override;
-    void AddStatistics(VmaStatistics& inoutStats) const override;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const override;
-#endif
-
-    bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) override;
-
-    void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) override;
-
-    void Free(VmaAllocHandle allocHandle) override;
-    void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override;
-    void* GetAllocationUserData(VmaAllocHandle allocHandle) const override;
-    VmaAllocHandle GetAllocationListBegin() const override;
-    VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override;
-    void Clear() override;
-    void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override;
-
-private:
-    static const size_t MAX_LEVELS = 48;
-
-    struct ValidationContext
-    {
-        size_t calculatedAllocationCount = 0;
-        size_t calculatedFreeCount = 0;
-        VkDeviceSize calculatedSumFreeSize = 0;
-    };
-    struct Node
-    {
-        VkDeviceSize offset;
-        enum TYPE
-        {
-            TYPE_FREE,
-            TYPE_ALLOCATION,
-            TYPE_SPLIT,
-            TYPE_COUNT
-        } type;
-        Node* parent;
-        Node* buddy;
-
-        union
-        {
-            struct
-            {
-                Node* prev;
-                Node* next;
-            } free;
-            struct
-            {
-                void* userData;
-            } allocation;
-            struct
-            {
-                Node* leftChild;
-            } split;
-        };
-    };
-
-    // Size of the memory block aligned down to a power of two.
-    VkDeviceSize m_UsableSize;
-    uint32_t m_LevelCount;
-    VmaPoolAllocator<Node> m_NodeAllocator;
-    Node* m_Root;
-    struct
-    {
-        Node* front;
-        Node* back;
-    } m_FreeList[MAX_LEVELS];
-
-    // Number of nodes in the tree with type == TYPE_ALLOCATION.
-    size_t m_AllocationCount;
-    // Number of nodes in the tree with type == TYPE_FREE.
-    size_t m_FreeCount;
-    // Doesn't include space wasted due to internal fragmentation - allocation sizes are just aligned up to node sizes.
-    // Doesn't include unusable size.
-    VkDeviceSize m_SumFreeSize;
-
-    VkDeviceSize GetUnusableSize() const { return GetSize() - m_UsableSize; }
-    VkDeviceSize LevelToNodeSize(uint32_t level) const { return m_UsableSize >> level; }
-
-    VkDeviceSize AlignAllocationSize(VkDeviceSize size) const
-    {
-        if (!IsVirtual())
-        {
-            size = VmaAlignUp(size, (VkDeviceSize)16);
-        }
-        return VmaNextPow2(size);
-    }
-    Node* FindAllocationNode(VkDeviceSize offset, uint32_t& outLevel) const;
-    void DeleteNodeChildren(Node* node);
-    bool ValidateNode(ValidationContext& ctx, const Node* parent, const Node* curr, uint32_t level, VkDeviceSize levelNodeSize) const;
-    uint32_t AllocSizeToLevel(VkDeviceSize allocSize) const;
-    void AddNodeToDetailedStatistics(VmaDetailedStatistics& inoutStats, const Node* node, VkDeviceSize levelNodeSize) const;
-    // Adds node to the front of FreeList at given level.
-    // node->type must be FREE.
-    // node->free.prev, next can be undefined.
-    void AddToFreeListFront(uint32_t level, Node* node);
-    // Removes node from FreeList at given level.
-    // node->type must be FREE.
-    // node->free.prev, next stay untouched.
-    void RemoveFromFreeList(uint32_t level, Node* node);
-    void DebugLogAllAllocationNode(Node* node, uint32_t level) const;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMapNode(class VmaJsonWriter& json, const Node* node, VkDeviceSize levelNodeSize) const;
-#endif
-};
-
-#ifndef _VMA_BLOCK_METADATA_BUDDY_FUNCTIONS
-VmaBlockMetadata_Buddy::VmaBlockMetadata_Buddy(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual),
-    m_NodeAllocator(pAllocationCallbacks, 32), // firstBlockCapacity
-    m_Root(VMA_NULL),
-    m_AllocationCount(0),
-    m_FreeCount(1),
-    m_SumFreeSize(0)
-{
-    memset(m_FreeList, 0, sizeof(m_FreeList));
-}
-
-VmaBlockMetadata_Buddy::~VmaBlockMetadata_Buddy()
-{
-    DeleteNodeChildren(m_Root);
-    m_NodeAllocator.Free(m_Root);
-}
-
-void VmaBlockMetadata_Buddy::Init(VkDeviceSize size)
-{
-    VmaBlockMetadata::Init(size);
-
-    m_UsableSize = VmaPrevPow2(size);
-    m_SumFreeSize = m_UsableSize;
-
-    // Calculate m_LevelCount.
-    const VkDeviceSize minNodeSize = IsVirtual() ? 1 : 16;
-    m_LevelCount = 1;
-    while (m_LevelCount < MAX_LEVELS &&
-        LevelToNodeSize(m_LevelCount) >= minNodeSize)
-    {
-        ++m_LevelCount;
-    }
-
-    Node* rootNode = m_NodeAllocator.Alloc();
-    rootNode->offset = 0;
-    rootNode->type = Node::TYPE_FREE;
-    rootNode->parent = VMA_NULL;
-    rootNode->buddy = VMA_NULL;
-
-    m_Root = rootNode;
-    AddToFreeListFront(0, rootNode);
-}
-
-bool VmaBlockMetadata_Buddy::Validate() const
-{
-    // Validate tree.
-    ValidationContext ctx;
-    if (!ValidateNode(ctx, VMA_NULL, m_Root, 0, LevelToNodeSize(0)))
-    {
-        VMA_VALIDATE(false && "ValidateNode failed.");
-    }
-    VMA_VALIDATE(m_AllocationCount == ctx.calculatedAllocationCount);
-    VMA_VALIDATE(m_SumFreeSize == ctx.calculatedSumFreeSize);
-
-    // Validate free node lists.
-    for (uint32_t level = 0; level < m_LevelCount; ++level)
-    {
-        VMA_VALIDATE(m_FreeList[level].front == VMA_NULL ||
-            m_FreeList[level].front->free.prev == VMA_NULL);
-
-        for (Node* node = m_FreeList[level].front;
-            node != VMA_NULL;
-            node = node->free.next)
-        {
-            VMA_VALIDATE(node->type == Node::TYPE_FREE);
-
-            if (node->free.next == VMA_NULL)
-            {
-                VMA_VALIDATE(m_FreeList[level].back == node);
-            }
-            else
-            {
-                VMA_VALIDATE(node->free.next->free.prev == node);
-            }
-        }
-    }
-
-    // Validate that free lists ar higher levels are empty.
-    for (uint32_t level = m_LevelCount; level < MAX_LEVELS; ++level)
-    {
-        VMA_VALIDATE(m_FreeList[level].front == VMA_NULL && m_FreeList[level].back == VMA_NULL);
-    }
-
-    return true;
-}
-
-void VmaBlockMetadata_Buddy::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const
-{
-    inoutStats.statistics.blockCount++;
-    inoutStats.statistics.blockBytes += GetSize();
-
-    AddNodeToDetailedStatistics(inoutStats, m_Root, LevelToNodeSize(0));
-
-    const VkDeviceSize unusableSize = GetUnusableSize();
-    if (unusableSize > 0)
-        VmaAddDetailedStatisticsUnusedRange(inoutStats, unusableSize);
-}
-
-void VmaBlockMetadata_Buddy::AddStatistics(VmaStatistics& inoutStats) const
-{
-    inoutStats.blockCount++;
-    inoutStats.allocationCount += (uint32_t)m_AllocationCount;
-    inoutStats.blockBytes += GetSize();
-    inoutStats.allocationBytes += GetSize() - m_SumFreeSize;
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_Buddy::PrintDetailedMap(class VmaJsonWriter& json, uint32_t mapRefCount) const
-{
-    VmaDetailedStatistics stats;
-    VmaClearDetailedStatistics(stats);
-    AddDetailedStatistics(stats);
-
-    PrintDetailedMap_Begin(
-        json,
-        stats.statistics.blockBytes - stats.statistics.allocationBytes,
-        stats.statistics.allocationCount,
-        stats.unusedRangeCount,
-        mapRefCount);
-
-    PrintDetailedMapNode(json, m_Root, LevelToNodeSize(0));
-
-    const VkDeviceSize unusableSize = GetUnusableSize();
-    if (unusableSize > 0)
-    {
-        PrintDetailedMap_UnusedRange(json,
-            m_UsableSize, // offset
-            unusableSize); // size
-    }
-
-    PrintDetailedMap_End(json);
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-bool VmaBlockMetadata_Buddy::CreateAllocationRequest(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    bool upperAddress,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(!upperAddress && "VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT can be used only with linear algorithm.");
-
-    allocSize = AlignAllocationSize(allocSize);
-
-    // Simple way to respect bufferImageGranularity. May be optimized some day.
-    // Whenever it might be an OPTIMAL image...
-    if (allocType == VMA_SUBALLOCATION_TYPE_UNKNOWN ||
-        allocType == VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN ||
-        allocType == VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL)
-    {
-        allocAlignment = VMA_MAX(allocAlignment, GetBufferImageGranularity());
-        allocSize = VmaAlignUp(allocSize, GetBufferImageGranularity());
-    }
-
-    if (allocSize > m_UsableSize)
-    {
-        return false;
-    }
-
-    const uint32_t targetLevel = AllocSizeToLevel(allocSize);
-    for (uint32_t level = targetLevel; level--; )
-    {
-        for (Node* freeNode = m_FreeList[level].front;
-            freeNode != VMA_NULL;
-            freeNode = freeNode->free.next)
-        {
-            if (freeNode->offset % allocAlignment == 0)
-            {
-                pAllocationRequest->type = VmaAllocationRequestType::Normal;
-                pAllocationRequest->allocHandle = (VmaAllocHandle)(freeNode->offset + 1);
-                pAllocationRequest->size = allocSize;
-                pAllocationRequest->customData = (void*)(uintptr_t)level;
-                return true;
-            }
-        }
-    }
-
-    return false;
-}
-
-void VmaBlockMetadata_Buddy::Alloc(
-    const VmaAllocationRequest& request,
-    VmaSuballocationType type,
-    void* userData)
-{
-    VMA_ASSERT(request.type == VmaAllocationRequestType::Normal);
-
-    const uint32_t targetLevel = AllocSizeToLevel(request.size);
-    uint32_t currLevel = (uint32_t)(uintptr_t)request.customData;
-
-    Node* currNode = m_FreeList[currLevel].front;
-    VMA_ASSERT(currNode != VMA_NULL && currNode->type == Node::TYPE_FREE);
-    const VkDeviceSize offset = (VkDeviceSize)request.allocHandle - 1;
-    while (currNode->offset != offset)
-    {
-        currNode = currNode->free.next;
-        VMA_ASSERT(currNode != VMA_NULL && currNode->type == Node::TYPE_FREE);
-    }
-
-    // Go down, splitting free nodes.
-    while (currLevel < targetLevel)
-    {
-        // currNode is already first free node at currLevel.
-        // Remove it from list of free nodes at this currLevel.
-        RemoveFromFreeList(currLevel, currNode);
-
-        const uint32_t childrenLevel = currLevel + 1;
-
-        // Create two free sub-nodes.
-        Node* leftChild = m_NodeAllocator.Alloc();
-        Node* rightChild = m_NodeAllocator.Alloc();
-
-        leftChild->offset = currNode->offset;
-        leftChild->type = Node::TYPE_FREE;
-        leftChild->parent = currNode;
-        leftChild->buddy = rightChild;
-
-        rightChild->offset = currNode->offset + LevelToNodeSize(childrenLevel);
-        rightChild->type = Node::TYPE_FREE;
-        rightChild->parent = currNode;
-        rightChild->buddy = leftChild;
-
-        // Convert current currNode to split type.
-        currNode->type = Node::TYPE_SPLIT;
-        currNode->split.leftChild = leftChild;
-
-        // Add child nodes to free list. Order is important!
-        AddToFreeListFront(childrenLevel, rightChild);
-        AddToFreeListFront(childrenLevel, leftChild);
-
-        ++m_FreeCount;
-        ++currLevel;
-        currNode = m_FreeList[currLevel].front;
-
-        /*
-        We can be sure that currNode, as left child of node previously split,
-        also fulfills the alignment requirement.
-        */
-    }
-
-    // Remove from free list.
-    VMA_ASSERT(currLevel == targetLevel &&
-        currNode != VMA_NULL &&
-        currNode->type == Node::TYPE_FREE);
-    RemoveFromFreeList(currLevel, currNode);
-
-    // Convert to allocation node.
-    currNode->type = Node::TYPE_ALLOCATION;
-    currNode->allocation.userData = userData;
-
-    ++m_AllocationCount;
-    --m_FreeCount;
-    m_SumFreeSize -= request.size;
-}
-
-void VmaBlockMetadata_Buddy::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo)
-{
-    uint32_t level = 0;
-    outInfo.offset = (VkDeviceSize)allocHandle - 1;
-    const Node* const node = FindAllocationNode(outInfo.offset, level);
-    outInfo.size = LevelToNodeSize(level);
-    outInfo.pUserData = node->allocation.userData;
-}
-
-void* VmaBlockMetadata_Buddy::GetAllocationUserData(VmaAllocHandle allocHandle) const
-{
-    uint32_t level = 0;
-    const Node* const node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level);
-    return node->allocation.userData;
-}
-
-VmaAllocHandle VmaBlockMetadata_Buddy::GetAllocationListBegin() const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    return VK_NULL_HANDLE;
-}
-
-VmaAllocHandle VmaBlockMetadata_Buddy::GetNextAllocation(VmaAllocHandle prevAlloc) const
-{
-    // Function only used for defragmentation, which is disabled for this algorithm
-    return VK_NULL_HANDLE;
-}
-
-void VmaBlockMetadata_Buddy::DeleteNodeChildren(Node* node)
-{
-    if (node->type == Node::TYPE_SPLIT)
-    {
-        DeleteNodeChildren(node->split.leftChild->buddy);
-        DeleteNodeChildren(node->split.leftChild);
-        const VkAllocationCallbacks* allocationCallbacks = GetAllocationCallbacks();
-        m_NodeAllocator.Free(node->split.leftChild->buddy);
-        m_NodeAllocator.Free(node->split.leftChild);
-    }
-}
-
-void VmaBlockMetadata_Buddy::Clear()
-{
-    DeleteNodeChildren(m_Root);
-    m_Root->type = Node::TYPE_FREE;
-    m_AllocationCount = 0;
-    m_FreeCount = 1;
-    m_SumFreeSize = m_UsableSize;
-}
-
-void VmaBlockMetadata_Buddy::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData)
-{
-    uint32_t level = 0;
-    Node* const node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level);
-    node->allocation.userData = userData;
-}
-
-VmaBlockMetadata_Buddy::Node* VmaBlockMetadata_Buddy::FindAllocationNode(VkDeviceSize offset, uint32_t& outLevel) const
-{
-    Node* node = m_Root;
-    VkDeviceSize nodeOffset = 0;
-    outLevel = 0;
-    VkDeviceSize levelNodeSize = LevelToNodeSize(0);
-    while (node->type == Node::TYPE_SPLIT)
-    {
-        const VkDeviceSize nextLevelNodeSize = levelNodeSize >> 1;
-        if (offset < nodeOffset + nextLevelNodeSize)
-        {
-            node = node->split.leftChild;
-        }
-        else
-        {
-            node = node->split.leftChild->buddy;
-            nodeOffset += nextLevelNodeSize;
-        }
-        ++outLevel;
-        levelNodeSize = nextLevelNodeSize;
-    }
-
-    VMA_ASSERT(node != VMA_NULL && node->type == Node::TYPE_ALLOCATION);
-    return node;
-}
-
-bool VmaBlockMetadata_Buddy::ValidateNode(ValidationContext& ctx, const Node* parent, const Node* curr, uint32_t level, VkDeviceSize levelNodeSize) const
-{
-    VMA_VALIDATE(level < m_LevelCount);
-    VMA_VALIDATE(curr->parent == parent);
-    VMA_VALIDATE((curr->buddy == VMA_NULL) == (parent == VMA_NULL));
-    VMA_VALIDATE(curr->buddy == VMA_NULL || curr->buddy->buddy == curr);
-    switch (curr->type)
-    {
-    case Node::TYPE_FREE:
-        // curr->free.prev, next are validated separately.
-        ctx.calculatedSumFreeSize += levelNodeSize;
-        ++ctx.calculatedFreeCount;
-        break;
-    case Node::TYPE_ALLOCATION:
-        ++ctx.calculatedAllocationCount;
-        if (!IsVirtual())
-        {
-            VMA_VALIDATE(curr->allocation.userData != VMA_NULL);
-        }
-        break;
-    case Node::TYPE_SPLIT:
-    {
-        const uint32_t childrenLevel = level + 1;
-        const VkDeviceSize childrenLevelNodeSize = levelNodeSize >> 1;
-        const Node* const leftChild = curr->split.leftChild;
-        VMA_VALIDATE(leftChild != VMA_NULL);
-        VMA_VALIDATE(leftChild->offset == curr->offset);
-        if (!ValidateNode(ctx, curr, leftChild, childrenLevel, childrenLevelNodeSize))
-        {
-            VMA_VALIDATE(false && "ValidateNode for left child failed.");
-        }
-        const Node* const rightChild = leftChild->buddy;
-        VMA_VALIDATE(rightChild->offset == curr->offset + childrenLevelNodeSize);
-        if (!ValidateNode(ctx, curr, rightChild, childrenLevel, childrenLevelNodeSize))
-        {
-            VMA_VALIDATE(false && "ValidateNode for right child failed.");
-        }
-    }
-    break;
-    default:
-        return false;
-    }
-
-    return true;
-}
-
-uint32_t VmaBlockMetadata_Buddy::AllocSizeToLevel(VkDeviceSize allocSize) const
-{
-    // I know this could be optimized somehow e.g. by using std::log2p1 from C++20.
-    uint32_t level = 0;
-    VkDeviceSize currLevelNodeSize = m_UsableSize;
-    VkDeviceSize nextLevelNodeSize = currLevelNodeSize >> 1;
-    while (allocSize <= nextLevelNodeSize && level + 1 < m_LevelCount)
-    {
-        ++level;
-        currLevelNodeSize >>= 1;
-        nextLevelNodeSize >>= 1;
-    }
-    return level;
-}
-
-void VmaBlockMetadata_Buddy::Free(VmaAllocHandle allocHandle)
-{
-    uint32_t level = 0;
-    Node* node = FindAllocationNode((VkDeviceSize)allocHandle - 1, level);
-
-    ++m_FreeCount;
-    --m_AllocationCount;
-    m_SumFreeSize += LevelToNodeSize(level);
-
-    node->type = Node::TYPE_FREE;
-
-    // Join free nodes if possible.
-    while (level > 0 && node->buddy->type == Node::TYPE_FREE)
-    {
-        RemoveFromFreeList(level, node->buddy);
-        Node* const parent = node->parent;
-
-        m_NodeAllocator.Free(node->buddy);
-        m_NodeAllocator.Free(node);
-        parent->type = Node::TYPE_FREE;
-
-        node = parent;
-        --level;
-        --m_FreeCount;
-    }
-
-    AddToFreeListFront(level, node);
-}
-
-void VmaBlockMetadata_Buddy::AddNodeToDetailedStatistics(VmaDetailedStatistics& inoutStats, const Node* node, VkDeviceSize levelNodeSize) const
-{
-    switch (node->type)
-    {
-    case Node::TYPE_FREE:
-        VmaAddDetailedStatisticsUnusedRange(inoutStats, levelNodeSize);
-        break;
-    case Node::TYPE_ALLOCATION:
-        VmaAddDetailedStatisticsAllocation(inoutStats, levelNodeSize);
-        break;
-    case Node::TYPE_SPLIT:
-    {
-        const VkDeviceSize childrenNodeSize = levelNodeSize / 2;
-        const Node* const leftChild = node->split.leftChild;
-        AddNodeToDetailedStatistics(inoutStats, leftChild, childrenNodeSize);
-        const Node* const rightChild = leftChild->buddy;
-        AddNodeToDetailedStatistics(inoutStats, rightChild, childrenNodeSize);
-    }
-    break;
-    default:
-        VMA_ASSERT(0);
-    }
-}
-
-void VmaBlockMetadata_Buddy::AddToFreeListFront(uint32_t level, Node* node)
-{
-    VMA_ASSERT(node->type == Node::TYPE_FREE);
-
-    // List is empty.
-    Node* const frontNode = m_FreeList[level].front;
-    if (frontNode == VMA_NULL)
-    {
-        VMA_ASSERT(m_FreeList[level].back == VMA_NULL);
-        node->free.prev = node->free.next = VMA_NULL;
-        m_FreeList[level].front = m_FreeList[level].back = node;
-    }
-    else
-    {
-        VMA_ASSERT(frontNode->free.prev == VMA_NULL);
-        node->free.prev = VMA_NULL;
-        node->free.next = frontNode;
-        frontNode->free.prev = node;
-        m_FreeList[level].front = node;
-    }
-}
-
-void VmaBlockMetadata_Buddy::RemoveFromFreeList(uint32_t level, Node* node)
-{
-    VMA_ASSERT(m_FreeList[level].front != VMA_NULL);
-
-    // It is at the front.
-    if (node->free.prev == VMA_NULL)
-    {
-        VMA_ASSERT(m_FreeList[level].front == node);
-        m_FreeList[level].front = node->free.next;
-    }
-    else
-    {
-        Node* const prevFreeNode = node->free.prev;
-        VMA_ASSERT(prevFreeNode->free.next == node);
-        prevFreeNode->free.next = node->free.next;
-    }
-
-    // It is at the back.
-    if (node->free.next == VMA_NULL)
-    {
-        VMA_ASSERT(m_FreeList[level].back == node);
-        m_FreeList[level].back = node->free.prev;
-    }
-    else
-    {
-        Node* const nextFreeNode = node->free.next;
-        VMA_ASSERT(nextFreeNode->free.prev == node);
-        nextFreeNode->free.prev = node->free.prev;
-    }
-}
-
-void VmaBlockMetadata_Buddy::DebugLogAllAllocationNode(Node* node, uint32_t level) const
-{
-    switch (node->type)
-    {
-    case Node::TYPE_FREE:
-        break;
-    case Node::TYPE_ALLOCATION:
-        DebugLogAllocation(node->offset, LevelToNodeSize(level), node->allocation.userData);
-        break;
-    case Node::TYPE_SPLIT:
-    {
-        ++level;
-        DebugLogAllAllocationNode(node->split.leftChild, level);
-        DebugLogAllAllocationNode(node->split.leftChild->buddy, level);
-    }
-    break;
-    default:
-        VMA_ASSERT(0);
-    }
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_Buddy::PrintDetailedMapNode(class VmaJsonWriter& json, const Node* node, VkDeviceSize levelNodeSize) const
-{
-    switch (node->type)
-    {
-    case Node::TYPE_FREE:
-        PrintDetailedMap_UnusedRange(json, node->offset, levelNodeSize);
-        break;
-    case Node::TYPE_ALLOCATION:
-        PrintDetailedMap_Allocation(json, node->offset, levelNodeSize, node->allocation.userData);
-        break;
-    case Node::TYPE_SPLIT:
-    {
-        const VkDeviceSize childrenNodeSize = levelNodeSize / 2;
-        const Node* const leftChild = node->split.leftChild;
-        PrintDetailedMapNode(json, leftChild, childrenNodeSize);
-        const Node* const rightChild = leftChild->buddy;
-        PrintDetailedMapNode(json, rightChild, childrenNodeSize);
-    }
-    break;
-    default:
-        VMA_ASSERT(0);
-    }
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_BLOCK_METADATA_BUDDY_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA_BUDDY
-#endif // #if 0
-
-#ifndef _VMA_BLOCK_METADATA_TLSF
-// To not search current larger region if first allocation won't succeed and skip to smaller range
-// use with VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT as strategy in CreateAllocationRequest().
-// When fragmentation and reusal of previous blocks doesn't matter then use with
-// VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT for fastest alloc time possible.
-class VmaBlockMetadata_TLSF : public VmaBlockMetadata
-{
-    VMA_CLASS_NO_COPY(VmaBlockMetadata_TLSF)
-public:
-    VmaBlockMetadata_TLSF(const VkAllocationCallbacks* pAllocationCallbacks,
-        VkDeviceSize bufferImageGranularity, bool isVirtual);
-    virtual ~VmaBlockMetadata_TLSF();
-
-    size_t GetAllocationCount() const override { return m_AllocCount; }
-    size_t GetFreeRegionsCount() const override { return m_BlocksFreeCount + 1; }
-    VkDeviceSize GetSumFreeSize() const override { return m_BlocksFreeSize + m_NullBlock->size; }
-    bool IsEmpty() const override { return m_NullBlock->offset == 0; }
-    VkDeviceSize GetAllocationOffset(VmaAllocHandle allocHandle) const override { return ((Block*)allocHandle)->offset; };
-
-    void Init(VkDeviceSize size) override;
-    bool Validate() const override;
-
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const override;
-    void AddStatistics(VmaStatistics& inoutStats) const override;
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json) const override;
-#endif
-
-    bool CreateAllocationRequest(
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        bool upperAddress,
-        VmaSuballocationType allocType,
-        uint32_t strategy,
-        VmaAllocationRequest* pAllocationRequest) override;
-
-    VkResult CheckCorruption(const void* pBlockData) override;
-    void Alloc(
-        const VmaAllocationRequest& request,
-        VmaSuballocationType type,
-        void* userData) override;
-
-    void Free(VmaAllocHandle allocHandle) override;
-    void GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo) override;
-    void* GetAllocationUserData(VmaAllocHandle allocHandle) const override;
-    VmaAllocHandle GetAllocationListBegin() const override;
-    VmaAllocHandle GetNextAllocation(VmaAllocHandle prevAlloc) const override;
-    VkDeviceSize GetNextFreeRegionSize(VmaAllocHandle alloc) const override;
-    void Clear() override;
-    void SetAllocationUserData(VmaAllocHandle allocHandle, void* userData) override;
-    void DebugLogAllAllocations() const override;
-
-private:
-    // According to original paper it should be preferable 4 or 5:
-    // M. Masmano, I. Ripoll, A. Crespo, and J. Real "TLSF: a New Dynamic Memory Allocator for Real-Time Systems"
-    // http://www.gii.upv.es/tlsf/files/ecrts04_tlsf.pdf
-    static const uint8_t SECOND_LEVEL_INDEX = 5;
-    static const uint16_t SMALL_BUFFER_SIZE = 256;
-    static const uint32_t INITIAL_BLOCK_ALLOC_COUNT = 16;
-    static const uint8_t MEMORY_CLASS_SHIFT = 7;
-    static const uint8_t MAX_MEMORY_CLASSES = 65 - MEMORY_CLASS_SHIFT;
-
-    class Block
-    {
-    public:
-        VkDeviceSize offset;
-        VkDeviceSize size;
-        Block* prevPhysical;
-        Block* nextPhysical;
-
-        void MarkFree() { prevFree = VMA_NULL; }
-        void MarkTaken() { prevFree = this; }
-        bool IsFree() const { return prevFree != this; }
-        void*& UserData() { VMA_HEAVY_ASSERT(!IsFree()); return userData; }
-        Block*& PrevFree() { return prevFree; }
-        Block*& NextFree() { VMA_HEAVY_ASSERT(IsFree()); return nextFree; }
-
-    private:
-        Block* prevFree; // Address of the same block here indicates that block is taken
-        union
-        {
-            Block* nextFree;
-            void* userData;
-        };
-    };
-
-    size_t m_AllocCount;
-    // Total number of free blocks besides null block
-    size_t m_BlocksFreeCount;
-    // Total size of free blocks excluding null block
-    VkDeviceSize m_BlocksFreeSize;
-    uint32_t m_IsFreeBitmap;
-    uint8_t m_MemoryClasses;
-    uint32_t m_InnerIsFreeBitmap[MAX_MEMORY_CLASSES];
-    uint32_t m_ListsCount;
-    /*
-    * 0: 0-3 lists for small buffers
-    * 1+: 0-(2^SLI-1) lists for normal buffers
-    */
-    Block** m_FreeList;
-    VmaPoolAllocator<Block> m_BlockAllocator;
-    Block* m_NullBlock;
-    VmaBlockBufferImageGranularity m_GranularityHandler;
-
-    uint8_t SizeToMemoryClass(VkDeviceSize size) const;
-    uint16_t SizeToSecondIndex(VkDeviceSize size, uint8_t memoryClass) const;
-    uint32_t GetListIndex(uint8_t memoryClass, uint16_t secondIndex) const;
-    uint32_t GetListIndex(VkDeviceSize size) const;
-
-    void RemoveFreeBlock(Block* block);
-    void InsertFreeBlock(Block* block);
-    void MergeBlock(Block* block, Block* prev);
-
-    Block* FindFreeBlock(VkDeviceSize size, uint32_t& listIndex) const;
-    bool CheckBlock(
-        Block& block,
-        uint32_t listIndex,
-        VkDeviceSize allocSize,
-        VkDeviceSize allocAlignment,
-        VmaSuballocationType allocType,
-        VmaAllocationRequest* pAllocationRequest);
-};
-
-#ifndef _VMA_BLOCK_METADATA_TLSF_FUNCTIONS
-VmaBlockMetadata_TLSF::VmaBlockMetadata_TLSF(const VkAllocationCallbacks* pAllocationCallbacks,
-    VkDeviceSize bufferImageGranularity, bool isVirtual)
-    : VmaBlockMetadata(pAllocationCallbacks, bufferImageGranularity, isVirtual),
-    m_AllocCount(0),
-    m_BlocksFreeCount(0),
-    m_BlocksFreeSize(0),
-    m_IsFreeBitmap(0),
-    m_MemoryClasses(0),
-    m_ListsCount(0),
-    m_FreeList(VMA_NULL),
-    m_BlockAllocator(pAllocationCallbacks, INITIAL_BLOCK_ALLOC_COUNT),
-    m_NullBlock(VMA_NULL),
-    m_GranularityHandler(bufferImageGranularity) {}
-
-VmaBlockMetadata_TLSF::~VmaBlockMetadata_TLSF()
-{
-    if (m_FreeList)
-        vma_delete_array(GetAllocationCallbacks(), m_FreeList, m_ListsCount);
-    m_GranularityHandler.Destroy(GetAllocationCallbacks());
-}
-
-void VmaBlockMetadata_TLSF::Init(VkDeviceSize size)
-{
-    VmaBlockMetadata::Init(size);
-
-    if (!IsVirtual())
-        m_GranularityHandler.Init(GetAllocationCallbacks(), size);
-
-    m_NullBlock = m_BlockAllocator.Alloc();
-    m_NullBlock->size = size;
-    m_NullBlock->offset = 0;
-    m_NullBlock->prevPhysical = VMA_NULL;
-    m_NullBlock->nextPhysical = VMA_NULL;
-    m_NullBlock->MarkFree();
-    m_NullBlock->NextFree() = VMA_NULL;
-    m_NullBlock->PrevFree() = VMA_NULL;
-    uint8_t memoryClass = SizeToMemoryClass(size);
-    uint16_t sli = SizeToSecondIndex(size, memoryClass);
-    m_ListsCount = (memoryClass == 0 ? 0 : (memoryClass - 1) * (1UL << SECOND_LEVEL_INDEX) + sli) + 1;
-    if (IsVirtual())
-        m_ListsCount += 1UL << SECOND_LEVEL_INDEX;
-    else
-        m_ListsCount += 4;
-
-    m_MemoryClasses = memoryClass + 2;
-    memset(m_InnerIsFreeBitmap, 0, MAX_MEMORY_CLASSES * sizeof(uint32_t));
-
-    m_FreeList = vma_new_array(GetAllocationCallbacks(), Block*, m_ListsCount);
-    memset(m_FreeList, 0, m_ListsCount * sizeof(Block*));
-}
-
-bool VmaBlockMetadata_TLSF::Validate() const
-{
-    VMA_VALIDATE(GetSumFreeSize() <= GetSize());
-
-    VkDeviceSize calculatedSize = m_NullBlock->size;
-    VkDeviceSize calculatedFreeSize = m_NullBlock->size;
-    size_t allocCount = 0;
-    size_t freeCount = 0;
-
-    // Check integrity of free lists
-    for (uint32_t list = 0; list < m_ListsCount; ++list)
-    {
-        Block* block = m_FreeList[list];
-        if (block != VMA_NULL)
-        {
-            VMA_VALIDATE(block->IsFree());
-            VMA_VALIDATE(block->PrevFree() == VMA_NULL);
-            while (block->NextFree())
-            {
-                VMA_VALIDATE(block->NextFree()->IsFree());
-                VMA_VALIDATE(block->NextFree()->PrevFree() == block);
-                block = block->NextFree();
-            }
-        }
-    }
-
-    VkDeviceSize nextOffset = m_NullBlock->offset;
-    auto validateCtx = m_GranularityHandler.StartValidation(GetAllocationCallbacks(), IsVirtual());
-
-    VMA_VALIDATE(m_NullBlock->nextPhysical == VMA_NULL);
-    if (m_NullBlock->prevPhysical)
-    {
-        VMA_VALIDATE(m_NullBlock->prevPhysical->nextPhysical == m_NullBlock);
-    }
-    // Check all blocks
-    for (Block* prev = m_NullBlock->prevPhysical; prev != VMA_NULL; prev = prev->prevPhysical)
-    {
-        VMA_VALIDATE(prev->offset + prev->size == nextOffset);
-        nextOffset = prev->offset;
-        calculatedSize += prev->size;
-
-        uint32_t listIndex = GetListIndex(prev->size);
-        if (prev->IsFree())
-        {
-            ++freeCount;
-            // Check if free block belongs to free list
-            Block* freeBlock = m_FreeList[listIndex];
-            VMA_VALIDATE(freeBlock != VMA_NULL);
-
-            bool found = false;
-            do
-            {
-                if (freeBlock == prev)
-                    found = true;
-
-                freeBlock = freeBlock->NextFree();
-            } while (!found && freeBlock != VMA_NULL);
-
-            VMA_VALIDATE(found);
-            calculatedFreeSize += prev->size;
-        }
-        else
-        {
-            ++allocCount;
-            // Check if taken block is not on a free list
-            Block* freeBlock = m_FreeList[listIndex];
-            while (freeBlock)
-            {
-                VMA_VALIDATE(freeBlock != prev);
-                freeBlock = freeBlock->NextFree();
-            }
-
-            if (!IsVirtual())
-            {
-                VMA_VALIDATE(m_GranularityHandler.Validate(validateCtx, prev->offset, prev->size));
-            }
-        }
-
-        if (prev->prevPhysical)
-        {
-            VMA_VALIDATE(prev->prevPhysical->nextPhysical == prev);
-        }
-    }
-
-    if (!IsVirtual())
-    {
-        VMA_VALIDATE(m_GranularityHandler.FinishValidation(validateCtx));
-    }
-
-    VMA_VALIDATE(nextOffset == 0);
-    VMA_VALIDATE(calculatedSize == GetSize());
-    VMA_VALIDATE(calculatedFreeSize == GetSumFreeSize());
-    VMA_VALIDATE(allocCount == m_AllocCount);
-    VMA_VALIDATE(freeCount == m_BlocksFreeCount);
-
-    return true;
-}
-
-void VmaBlockMetadata_TLSF::AddDetailedStatistics(VmaDetailedStatistics& inoutStats) const
-{
-    inoutStats.statistics.blockCount++;
-    inoutStats.statistics.blockBytes += GetSize();
-    if (m_NullBlock->size > 0)
-        VmaAddDetailedStatisticsUnusedRange(inoutStats, m_NullBlock->size);
-
-    for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-    {
-        if (block->IsFree())
-            VmaAddDetailedStatisticsUnusedRange(inoutStats, block->size);
-        else
-            VmaAddDetailedStatisticsAllocation(inoutStats, block->size);
-    }
-}
-
-void VmaBlockMetadata_TLSF::AddStatistics(VmaStatistics& inoutStats) const
-{
-    inoutStats.blockCount++;
-    inoutStats.allocationCount += (uint32_t)m_AllocCount;
-    inoutStats.blockBytes += GetSize();
-    inoutStats.allocationBytes += GetSize() - GetSumFreeSize();
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockMetadata_TLSF::PrintDetailedMap(class VmaJsonWriter& json) const
-{
-    size_t blockCount = m_AllocCount + m_BlocksFreeCount;
-    VmaStlAllocator<Block*> allocator(GetAllocationCallbacks());
-    VmaVector<Block*, VmaStlAllocator<Block*>> blockList(blockCount, allocator);
-
-    size_t i = blockCount;
-    for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-    {
-        blockList[--i] = block;
-    }
-    VMA_ASSERT(i == 0);
-
-    VmaDetailedStatistics stats;
-    VmaClearDetailedStatistics(stats);
-    AddDetailedStatistics(stats);
-
-    PrintDetailedMap_Begin(json,
-        stats.statistics.blockBytes - stats.statistics.allocationBytes,
-        stats.statistics.allocationCount,
-        stats.unusedRangeCount);
-
-    for (; i < blockCount; ++i)
-    {
-        Block* block = blockList[i];
-        if (block->IsFree())
-            PrintDetailedMap_UnusedRange(json, block->offset, block->size);
-        else
-            PrintDetailedMap_Allocation(json, block->offset, block->size, block->UserData());
-    }
-    if (m_NullBlock->size > 0)
-        PrintDetailedMap_UnusedRange(json, m_NullBlock->offset, m_NullBlock->size);
-
-    PrintDetailedMap_End(json);
-}
-#endif
-
-bool VmaBlockMetadata_TLSF::CreateAllocationRequest(
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    bool upperAddress,
-    VmaSuballocationType allocType,
-    uint32_t strategy,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(allocSize > 0 && "Cannot allocate empty block!");
-    VMA_ASSERT(!upperAddress && "VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT can be used only with linear algorithm.");
-
-    // For small granularity round up
-    if (!IsVirtual())
-        m_GranularityHandler.RoundupAllocRequest(allocType, allocSize, allocAlignment);
-
-    allocSize += GetDebugMargin();
-    // Quick check for too small pool
-    if (allocSize > GetSumFreeSize())
-        return false;
-
-    // If no free blocks in pool then check only null block
-    if (m_BlocksFreeCount == 0)
-        return CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest);
-
-    // Round up to the next block
-    VkDeviceSize sizeForNextList = allocSize;
-    VkDeviceSize smallSizeStep = SMALL_BUFFER_SIZE / (IsVirtual() ? 1 << SECOND_LEVEL_INDEX : 4);
-    if (allocSize > SMALL_BUFFER_SIZE)
-    {
-        sizeForNextList += (1ULL << (VMA_BITSCAN_MSB(allocSize) - SECOND_LEVEL_INDEX));
-    }
-    else if (allocSize > SMALL_BUFFER_SIZE - smallSizeStep)
-        sizeForNextList = SMALL_BUFFER_SIZE + 1;
-    else
-        sizeForNextList += smallSizeStep;
-
-    uint32_t nextListIndex = 0;
-    uint32_t prevListIndex = 0;
-    Block* nextListBlock = VMA_NULL;
-    Block* prevListBlock = VMA_NULL;
-
-    // Check blocks according to strategies
-    if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT)
-    {
-        // Quick check for larger block first
-        nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex);
-        if (nextListBlock != VMA_NULL && CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // If not fitted then null block
-        if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // Null block failed, search larger bucket
-        while (nextListBlock)
-        {
-            if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            nextListBlock = nextListBlock->NextFree();
-        }
-
-        // Failed again, check best fit bucket
-        prevListBlock = FindFreeBlock(allocSize, prevListIndex);
-        while (prevListBlock)
-        {
-            if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            prevListBlock = prevListBlock->NextFree();
-        }
-    }
-    else if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_MEMORY_BIT)
-    {
-        // Check best fit bucket
-        prevListBlock = FindFreeBlock(allocSize, prevListIndex);
-        while (prevListBlock)
-        {
-            if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            prevListBlock = prevListBlock->NextFree();
-        }
-
-        // If failed check null block
-        if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // Check larger bucket
-        nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex);
-        while (nextListBlock)
-        {
-            if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            nextListBlock = nextListBlock->NextFree();
-        }
-    }
-    else if (strategy & VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT )
-    {
-        // Perform search from the start
-        VmaStlAllocator<Block*> allocator(GetAllocationCallbacks());
-        VmaVector<Block*, VmaStlAllocator<Block*>> blockList(m_BlocksFreeCount, allocator);
-
-        size_t i = m_BlocksFreeCount;
-        for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-        {
-            if (block->IsFree() && block->size >= allocSize)
-                blockList[--i] = block;
-        }
-
-        for (; i < m_BlocksFreeCount; ++i)
-        {
-            Block& block = *blockList[i];
-            if (CheckBlock(block, GetListIndex(block.size), allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-        }
-
-        // If failed check null block
-        if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // Whole range searched, no more memory
-        return false;
-    }
-    else
-    {
-        // Check larger bucket
-        nextListBlock = FindFreeBlock(sizeForNextList, nextListIndex);
-        while (nextListBlock)
-        {
-            if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            nextListBlock = nextListBlock->NextFree();
-        }
-
-        // If failed check null block
-        if (CheckBlock(*m_NullBlock, m_ListsCount, allocSize, allocAlignment, allocType, pAllocationRequest))
-            return true;
-
-        // Check best fit bucket
-        prevListBlock = FindFreeBlock(allocSize, prevListIndex);
-        while (prevListBlock)
-        {
-            if (CheckBlock(*prevListBlock, prevListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            prevListBlock = prevListBlock->NextFree();
-        }
-    }
-
-    // Worst case, full search has to be done
-    while (++nextListIndex < m_ListsCount)
-    {
-        nextListBlock = m_FreeList[nextListIndex];
-        while (nextListBlock)
-        {
-            if (CheckBlock(*nextListBlock, nextListIndex, allocSize, allocAlignment, allocType, pAllocationRequest))
-                return true;
-            nextListBlock = nextListBlock->NextFree();
-        }
-    }
-
-    // No more memory sadly
-    return false;
-}
-
-VkResult VmaBlockMetadata_TLSF::CheckCorruption(const void* pBlockData)
-{
-    for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-    {
-        if (!block->IsFree())
-        {
-            if (!VmaValidateMagicValue(pBlockData, block->offset + block->size))
-            {
-                VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER VALIDATED ALLOCATION!");
-                return VK_ERROR_UNKNOWN_COPY;
-            }
-        }
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaBlockMetadata_TLSF::Alloc(
-    const VmaAllocationRequest& request,
-    VmaSuballocationType type,
-    void* userData)
-{
-    VMA_ASSERT(request.type == VmaAllocationRequestType::TLSF);
-
-    // Get block and pop it from the free list
-    Block* currentBlock = (Block*)request.allocHandle;
-    VkDeviceSize offset = request.algorithmData;
-    VMA_ASSERT(currentBlock != VMA_NULL);
-    VMA_ASSERT(currentBlock->offset <= offset);
-
-    if (currentBlock != m_NullBlock)
-        RemoveFreeBlock(currentBlock);
-
-    VkDeviceSize debugMargin = GetDebugMargin();
-    VkDeviceSize misssingAlignment = offset - currentBlock->offset;
-
-    // Append missing alignment to prev block or create new one
-    if (misssingAlignment)
-    {
-        Block* prevBlock = currentBlock->prevPhysical;
-        VMA_ASSERT(prevBlock != VMA_NULL && "There should be no missing alignment at offset 0!");
-
-        if (prevBlock->IsFree() && prevBlock->size != debugMargin)
-        {
-            uint32_t oldList = GetListIndex(prevBlock->size);
-            prevBlock->size += misssingAlignment;
-            // Check if new size crosses list bucket
-            if (oldList != GetListIndex(prevBlock->size))
-            {
-                prevBlock->size -= misssingAlignment;
-                RemoveFreeBlock(prevBlock);
-                prevBlock->size += misssingAlignment;
-                InsertFreeBlock(prevBlock);
-            }
-            else
-                m_BlocksFreeSize += misssingAlignment;
-        }
-        else
-        {
-            Block* newBlock = m_BlockAllocator.Alloc();
-            currentBlock->prevPhysical = newBlock;
-            prevBlock->nextPhysical = newBlock;
-            newBlock->prevPhysical = prevBlock;
-            newBlock->nextPhysical = currentBlock;
-            newBlock->size = misssingAlignment;
-            newBlock->offset = currentBlock->offset;
-            newBlock->MarkTaken();
-
-            InsertFreeBlock(newBlock);
-        }
-
-        currentBlock->size -= misssingAlignment;
-        currentBlock->offset += misssingAlignment;
-    }
-
-    VkDeviceSize size = request.size + debugMargin;
-    if (currentBlock->size == size)
-    {
-        if (currentBlock == m_NullBlock)
-        {
-            // Setup new null block
-            m_NullBlock = m_BlockAllocator.Alloc();
-            m_NullBlock->size = 0;
-            m_NullBlock->offset = currentBlock->offset + size;
-            m_NullBlock->prevPhysical = currentBlock;
-            m_NullBlock->nextPhysical = VMA_NULL;
-            m_NullBlock->MarkFree();
-            m_NullBlock->PrevFree() = VMA_NULL;
-            m_NullBlock->NextFree() = VMA_NULL;
-            currentBlock->nextPhysical = m_NullBlock;
-            currentBlock->MarkTaken();
-        }
-    }
-    else
-    {
-        VMA_ASSERT(currentBlock->size > size && "Proper block already found, shouldn't find smaller one!");
-
-        // Create new free block
-        Block* newBlock = m_BlockAllocator.Alloc();
-        newBlock->size = currentBlock->size - size;
-        newBlock->offset = currentBlock->offset + size;
-        newBlock->prevPhysical = currentBlock;
-        newBlock->nextPhysical = currentBlock->nextPhysical;
-        currentBlock->nextPhysical = newBlock;
-        currentBlock->size = size;
-
-        if (currentBlock == m_NullBlock)
-        {
-            m_NullBlock = newBlock;
-            m_NullBlock->MarkFree();
-            m_NullBlock->NextFree() = VMA_NULL;
-            m_NullBlock->PrevFree() = VMA_NULL;
-            currentBlock->MarkTaken();
-        }
-        else
-        {
-            newBlock->nextPhysical->prevPhysical = newBlock;
-            newBlock->MarkTaken();
-            InsertFreeBlock(newBlock);
-        }
-    }
-    currentBlock->UserData() = userData;
-
-    if (debugMargin > 0)
-    {
-        currentBlock->size -= debugMargin;
-        Block* newBlock = m_BlockAllocator.Alloc();
-        newBlock->size = debugMargin;
-        newBlock->offset = currentBlock->offset + currentBlock->size;
-        newBlock->prevPhysical = currentBlock;
-        newBlock->nextPhysical = currentBlock->nextPhysical;
-        newBlock->MarkTaken();
-        currentBlock->nextPhysical->prevPhysical = newBlock;
-        currentBlock->nextPhysical = newBlock;
-        InsertFreeBlock(newBlock);
-    }
-
-    if (!IsVirtual())
-        m_GranularityHandler.AllocPages((uint8_t)(uintptr_t)request.customData,
-            currentBlock->offset, currentBlock->size);
-    ++m_AllocCount;
-}
-
-void VmaBlockMetadata_TLSF::Free(VmaAllocHandle allocHandle)
-{
-    Block* block = (Block*)allocHandle;
-    Block* next = block->nextPhysical;
-    VMA_ASSERT(!block->IsFree() && "Block is already free!");
-
-    if (!IsVirtual())
-        m_GranularityHandler.FreePages(block->offset, block->size);
-    --m_AllocCount;
-
-    VkDeviceSize debugMargin = GetDebugMargin();
-    if (debugMargin > 0)
-    {
-        RemoveFreeBlock(next);
-        MergeBlock(next, block);
-        block = next;
-        next = next->nextPhysical;
-    }
-
-    // Try merging
-    Block* prev = block->prevPhysical;
-    if (prev != VMA_NULL && prev->IsFree() && prev->size != debugMargin)
-    {
-        RemoveFreeBlock(prev);
-        MergeBlock(block, prev);
-    }
-
-    if (!next->IsFree())
-        InsertFreeBlock(block);
-    else if (next == m_NullBlock)
-        MergeBlock(m_NullBlock, block);
-    else
-    {
-        RemoveFreeBlock(next);
-        MergeBlock(next, block);
-        InsertFreeBlock(next);
-    }
-}
-
-void VmaBlockMetadata_TLSF::GetAllocationInfo(VmaAllocHandle allocHandle, VmaVirtualAllocationInfo& outInfo)
-{
-    Block* block = (Block*)allocHandle;
-    VMA_ASSERT(!block->IsFree() && "Cannot get allocation info for free block!");
-    outInfo.offset = block->offset;
-    outInfo.size = block->size;
-    outInfo.pUserData = block->UserData();
-}
-
-void* VmaBlockMetadata_TLSF::GetAllocationUserData(VmaAllocHandle allocHandle) const
-{
-    Block* block = (Block*)allocHandle;
-    VMA_ASSERT(!block->IsFree() && "Cannot get user data for free block!");
-    return block->UserData();
-}
-
-VmaAllocHandle VmaBlockMetadata_TLSF::GetAllocationListBegin() const
-{
-    if (m_AllocCount == 0)
-        return VK_NULL_HANDLE;
-
-    for (Block* block = m_NullBlock->prevPhysical; block; block = block->prevPhysical)
-    {
-        if (!block->IsFree())
-            return (VmaAllocHandle)block;
-    }
-    VMA_ASSERT(false && "If m_AllocCount > 0 then should find any allocation!");
-    return VK_NULL_HANDLE;
-}
-
-VmaAllocHandle VmaBlockMetadata_TLSF::GetNextAllocation(VmaAllocHandle prevAlloc) const
-{
-    Block* startBlock = (Block*)prevAlloc;
-    VMA_ASSERT(!startBlock->IsFree() && "Incorrect block!");
-
-    for (Block* block = startBlock->prevPhysical; block; block = block->prevPhysical)
-    {
-        if (!block->IsFree())
-            return (VmaAllocHandle)block;
-    }
-    return VK_NULL_HANDLE;
-}
-
-VkDeviceSize VmaBlockMetadata_TLSF::GetNextFreeRegionSize(VmaAllocHandle alloc) const
-{
-    Block* block = (Block*)alloc;
-    VMA_ASSERT(!block->IsFree() && "Incorrect block!");
-
-    if (block->prevPhysical)
-        return block->prevPhysical->IsFree() ? block->prevPhysical->size : 0;
-    return 0;
-}
-
-void VmaBlockMetadata_TLSF::Clear()
-{
-    m_AllocCount = 0;
-    m_BlocksFreeCount = 0;
-    m_BlocksFreeSize = 0;
-    m_IsFreeBitmap = 0;
-    m_NullBlock->offset = 0;
-    m_NullBlock->size = GetSize();
-    Block* block = m_NullBlock->prevPhysical;
-    m_NullBlock->prevPhysical = VMA_NULL;
-    while (block)
-    {
-        Block* prev = block->prevPhysical;
-        m_BlockAllocator.Free(block);
-        block = prev;
-    }
-    memset(m_FreeList, 0, m_ListsCount * sizeof(Block*));
-    memset(m_InnerIsFreeBitmap, 0, m_MemoryClasses * sizeof(uint32_t));
-    m_GranularityHandler.Clear();
-}
-
-void VmaBlockMetadata_TLSF::SetAllocationUserData(VmaAllocHandle allocHandle, void* userData)
-{
-    Block* block = (Block*)allocHandle;
-    VMA_ASSERT(!block->IsFree() && "Trying to set user data for not allocated block!");
-    block->UserData() = userData;
-}
-
-void VmaBlockMetadata_TLSF::DebugLogAllAllocations() const
-{
-    for (Block* block = m_NullBlock->prevPhysical; block != VMA_NULL; block = block->prevPhysical)
-        if (!block->IsFree())
-            DebugLogAllocation(block->offset, block->size, block->UserData());
-}
-
-uint8_t VmaBlockMetadata_TLSF::SizeToMemoryClass(VkDeviceSize size) const
-{
-    if (size > SMALL_BUFFER_SIZE)
-        return VMA_BITSCAN_MSB(size) - MEMORY_CLASS_SHIFT;
-    return 0;
-}
-
-uint16_t VmaBlockMetadata_TLSF::SizeToSecondIndex(VkDeviceSize size, uint8_t memoryClass) const
-{
-    if (memoryClass == 0)
-    {
-        if (IsVirtual())
-            return static_cast<uint16_t>((size - 1) / 8);
-        else
-            return static_cast<uint16_t>((size - 1) / 64);
-    }
-    return static_cast<uint16_t>((size >> (memoryClass + MEMORY_CLASS_SHIFT - SECOND_LEVEL_INDEX)) ^ (1U << SECOND_LEVEL_INDEX));
-}
-
-uint32_t VmaBlockMetadata_TLSF::GetListIndex(uint8_t memoryClass, uint16_t secondIndex) const
-{
-    if (memoryClass == 0)
-        return secondIndex;
-
-    const uint32_t index = static_cast<uint32_t>(memoryClass - 1) * (1 << SECOND_LEVEL_INDEX) + secondIndex;
-    if (IsVirtual())
-        return index + (1 << SECOND_LEVEL_INDEX);
-    else
-        return index + 4;
-}
-
-uint32_t VmaBlockMetadata_TLSF::GetListIndex(VkDeviceSize size) const
-{
-    uint8_t memoryClass = SizeToMemoryClass(size);
-    return GetListIndex(memoryClass, SizeToSecondIndex(size, memoryClass));
-}
-
-void VmaBlockMetadata_TLSF::RemoveFreeBlock(Block* block)
-{
-    VMA_ASSERT(block != m_NullBlock);
-    VMA_ASSERT(block->IsFree());
-
-    if (block->NextFree() != VMA_NULL)
-        block->NextFree()->PrevFree() = block->PrevFree();
-    if (block->PrevFree() != VMA_NULL)
-        block->PrevFree()->NextFree() = block->NextFree();
-    else
-    {
-        uint8_t memClass = SizeToMemoryClass(block->size);
-        uint16_t secondIndex = SizeToSecondIndex(block->size, memClass);
-        uint32_t index = GetListIndex(memClass, secondIndex);
-        VMA_ASSERT(m_FreeList[index] == block);
-        m_FreeList[index] = block->NextFree();
-        if (block->NextFree() == VMA_NULL)
-        {
-            m_InnerIsFreeBitmap[memClass] &= ~(1U << secondIndex);
-            if (m_InnerIsFreeBitmap[memClass] == 0)
-                m_IsFreeBitmap &= ~(1UL << memClass);
-        }
-    }
-    block->MarkTaken();
-    block->UserData() = VMA_NULL;
-    --m_BlocksFreeCount;
-    m_BlocksFreeSize -= block->size;
-}
-
-void VmaBlockMetadata_TLSF::InsertFreeBlock(Block* block)
-{
-    VMA_ASSERT(block != m_NullBlock);
-    VMA_ASSERT(!block->IsFree() && "Cannot insert block twice!");
-
-    uint8_t memClass = SizeToMemoryClass(block->size);
-    uint16_t secondIndex = SizeToSecondIndex(block->size, memClass);
-    uint32_t index = GetListIndex(memClass, secondIndex);
-    VMA_ASSERT(index < m_ListsCount);
-    block->PrevFree() = VMA_NULL;
-    block->NextFree() = m_FreeList[index];
-    m_FreeList[index] = block;
-    if (block->NextFree() != VMA_NULL)
-        block->NextFree()->PrevFree() = block;
-    else
-    {
-        m_InnerIsFreeBitmap[memClass] |= 1U << secondIndex;
-        m_IsFreeBitmap |= 1UL << memClass;
-    }
-    ++m_BlocksFreeCount;
-    m_BlocksFreeSize += block->size;
-}
-
-void VmaBlockMetadata_TLSF::MergeBlock(Block* block, Block* prev)
-{
-    VMA_ASSERT(block->prevPhysical == prev && "Cannot merge separate physical regions!");
-    VMA_ASSERT(!prev->IsFree() && "Cannot merge block that belongs to free list!");
-
-    block->offset = prev->offset;
-    block->size += prev->size;
-    block->prevPhysical = prev->prevPhysical;
-    if (block->prevPhysical)
-        block->prevPhysical->nextPhysical = block;
-    m_BlockAllocator.Free(prev);
-}
-
-VmaBlockMetadata_TLSF::Block* VmaBlockMetadata_TLSF::FindFreeBlock(VkDeviceSize size, uint32_t& listIndex) const
-{
-    uint8_t memoryClass = SizeToMemoryClass(size);
-    uint32_t innerFreeMap = m_InnerIsFreeBitmap[memoryClass] & (~0U << SizeToSecondIndex(size, memoryClass));
-    if (!innerFreeMap)
-    {
-        // Check higher levels for avaiable blocks
-        uint32_t freeMap = m_IsFreeBitmap & (~0UL << (memoryClass + 1));
-        if (!freeMap)
-            return VMA_NULL; // No more memory avaible
-
-        // Find lowest free region
-        memoryClass = VMA_BITSCAN_LSB(freeMap);
-        innerFreeMap = m_InnerIsFreeBitmap[memoryClass];
-        VMA_ASSERT(innerFreeMap != 0);
-    }
-    // Find lowest free subregion
-    listIndex = GetListIndex(memoryClass, VMA_BITSCAN_LSB(innerFreeMap));
-    VMA_ASSERT(m_FreeList[listIndex]);
-    return m_FreeList[listIndex];
-}
-
-bool VmaBlockMetadata_TLSF::CheckBlock(
-    Block& block,
-    uint32_t listIndex,
-    VkDeviceSize allocSize,
-    VkDeviceSize allocAlignment,
-    VmaSuballocationType allocType,
-    VmaAllocationRequest* pAllocationRequest)
-{
-    VMA_ASSERT(block.IsFree() && "Block is already taken!");
-
-    VkDeviceSize alignedOffset = VmaAlignUp(block.offset, allocAlignment);
-    if (block.size < allocSize + alignedOffset - block.offset)
-        return false;
-
-    // Check for granularity conflicts
-    if (!IsVirtual() &&
-        m_GranularityHandler.CheckConflictAndAlignUp(alignedOffset, allocSize, block.offset, block.size, allocType))
-        return false;
-
-    // Alloc successful
-    pAllocationRequest->type = VmaAllocationRequestType::TLSF;
-    pAllocationRequest->allocHandle = (VmaAllocHandle)&block;
-    pAllocationRequest->size = allocSize - GetDebugMargin();
-    pAllocationRequest->customData = (void*)allocType;
-    pAllocationRequest->algorithmData = alignedOffset;
-
-    // Place block at the start of list if it's normal block
-    if (listIndex != m_ListsCount && block.PrevFree())
-    {
-        block.PrevFree()->NextFree() = block.NextFree();
-        if (block.NextFree())
-            block.NextFree()->PrevFree() = block.PrevFree();
-        block.PrevFree() = VMA_NULL;
-        block.NextFree() = m_FreeList[listIndex];
-        m_FreeList[listIndex] = &block;
-        if (block.NextFree())
-            block.NextFree()->PrevFree() = &block;
-    }
-
-    return true;
-}
-#endif // _VMA_BLOCK_METADATA_TLSF_FUNCTIONS
-#endif // _VMA_BLOCK_METADATA_TLSF
-
-#ifndef _VMA_BLOCK_VECTOR
-/*
-Sequence of VmaDeviceMemoryBlock. Represents memory blocks allocated for a specific
-Vulkan memory type.
-
-Synchronized internally with a mutex.
-*/
-class VmaBlockVector
-{
-    friend struct VmaDefragmentationContext_T;
-    VMA_CLASS_NO_COPY(VmaBlockVector)
-public:
-    VmaBlockVector(
-        VmaAllocator hAllocator,
-        VmaPool hParentPool,
-        uint32_t memoryTypeIndex,
-        VkDeviceSize preferredBlockSize,
-        size_t minBlockCount,
-        size_t maxBlockCount,
-        VkDeviceSize bufferImageGranularity,
-        bool explicitBlockSize,
-        uint32_t algorithm,
-        float priority,
-        VkDeviceSize minAllocationAlignment,
-        void* pMemoryAllocateNext);
-    ~VmaBlockVector();
-
-    VmaAllocator GetAllocator() const { return m_hAllocator; }
-    VmaPool GetParentPool() const { return m_hParentPool; }
-    bool IsCustomPool() const { return m_hParentPool != VMA_NULL; }
-    uint32_t GetMemoryTypeIndex() const { return m_MemoryTypeIndex; }
-    VkDeviceSize GetPreferredBlockSize() const { return m_PreferredBlockSize; }
-    VkDeviceSize GetBufferImageGranularity() const { return m_BufferImageGranularity; }
-    uint32_t GetAlgorithm() const { return m_Algorithm; }
-    bool HasExplicitBlockSize() const { return m_ExplicitBlockSize; }
-    float GetPriority() const { return m_Priority; }
-    const void* GetAllocationNextPtr() const { return m_pMemoryAllocateNext; }
-    // To be used only while the m_Mutex is locked. Used during defragmentation.
-    size_t GetBlockCount() const { return m_Blocks.size(); }
-    // To be used only while the m_Mutex is locked. Used during defragmentation.
-    VmaDeviceMemoryBlock* GetBlock(size_t index) const { return m_Blocks[index]; }
-    VMA_RW_MUTEX &GetMutex() { return m_Mutex; }
-
-    VkResult CreateMinBlocks();
-    void AddStatistics(VmaStatistics& inoutStats);
-    void AddDetailedStatistics(VmaDetailedStatistics& inoutStats);
-    bool IsEmpty();
-    bool IsCorruptionDetectionEnabled() const;
-
-    VkResult Allocate(
-        VkDeviceSize size,
-        VkDeviceSize alignment,
-        const VmaAllocationCreateInfo& createInfo,
-        VmaSuballocationType suballocType,
-        size_t allocationCount,
-        VmaAllocation* pAllocations);
-
-    void Free(const VmaAllocation hAllocation);
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json);
-#endif
-
-    VkResult CheckCorruption();
-
-private:
-    const VmaAllocator m_hAllocator;
-    const VmaPool m_hParentPool;
-    const uint32_t m_MemoryTypeIndex;
-    const VkDeviceSize m_PreferredBlockSize;
-    const size_t m_MinBlockCount;
-    const size_t m_MaxBlockCount;
-    const VkDeviceSize m_BufferImageGranularity;
-    const bool m_ExplicitBlockSize;
-    const uint32_t m_Algorithm;
-    const float m_Priority;
-    const VkDeviceSize m_MinAllocationAlignment;
-
-    void* const m_pMemoryAllocateNext;
-    VMA_RW_MUTEX m_Mutex;
-    // Incrementally sorted by sumFreeSize, ascending.
-    VmaVector<VmaDeviceMemoryBlock*, VmaStlAllocator<VmaDeviceMemoryBlock*>> m_Blocks;
-    uint32_t m_NextBlockId;
-    bool m_IncrementalSort = true;
-
-    void SetIncrementalSort(bool val) { m_IncrementalSort = val; }
-
-    VkDeviceSize CalcMaxBlockSize() const;
-    // Finds and removes given block from vector.
-    void Remove(VmaDeviceMemoryBlock* pBlock);
-    // Performs single step in sorting m_Blocks. They may not be fully sorted
-    // after this call.
-    void IncrementallySortBlocks();
-    void SortByFreeSize();
-
-    VkResult AllocatePage(
-        VkDeviceSize size,
-        VkDeviceSize alignment,
-        const VmaAllocationCreateInfo& createInfo,
-        VmaSuballocationType suballocType,
-        VmaAllocation* pAllocation);
-
-    VkResult AllocateFromBlock(
-        VmaDeviceMemoryBlock* pBlock,
-        VkDeviceSize size,
-        VkDeviceSize alignment,
-        VmaAllocationCreateFlags allocFlags,
-        void* pUserData,
-        VmaSuballocationType suballocType,
-        uint32_t strategy,
-        VmaAllocation* pAllocation);
-
-    VkResult CommitAllocationRequest(
-        VmaAllocationRequest& allocRequest,
-        VmaDeviceMemoryBlock* pBlock,
-        VkDeviceSize alignment,
-        VmaAllocationCreateFlags allocFlags,
-        void* pUserData,
-        VmaSuballocationType suballocType,
-        VmaAllocation* pAllocation);
-
-    VkResult CreateBlock(VkDeviceSize blockSize, size_t* pNewBlockIndex);
-    bool HasEmptyBlock();
-};
-#endif // _VMA_BLOCK_VECTOR
-
-#ifndef _VMA_DEFRAGMENTATION_CONTEXT
-struct VmaDefragmentationContext_T
-{
-    VMA_CLASS_NO_COPY(VmaDefragmentationContext_T)
-public:
-    VmaDefragmentationContext_T(
-        VmaAllocator hAllocator,
-        const VmaDefragmentationInfo& info);
-    ~VmaDefragmentationContext_T();
-
-    void GetStats(VmaDefragmentationStats& outStats) { outStats = m_GlobalStats; }
-
-    VkResult DefragmentPassBegin(VmaDefragmentationPassMoveInfo& moveInfo);
-    VkResult DefragmentPassEnd(VmaDefragmentationPassMoveInfo& moveInfo);
-
-private:
-    // Max number of allocations to ignore due to size constraints before ending single pass
-    static const uint8_t MAX_ALLOCS_TO_IGNORE = 16;
-    enum class CounterStatus { Pass, Ignore, End };
-
-    struct FragmentedBlock
-    {
-        uint32_t data;
-        VmaDeviceMemoryBlock* block;
-    };
-    struct StateBalanced
-    {
-        VkDeviceSize avgFreeSize = 0;
-        VkDeviceSize avgAllocSize = UINT64_MAX;
-    };
-    struct StateExtensive
-    {
-        enum class Operation : uint8_t
-        {
-            FindFreeBlockBuffer, FindFreeBlockTexture, FindFreeBlockAll,
-            MoveBuffers, MoveTextures, MoveAll,
-            Cleanup, Done
-        };
-
-        Operation operation = Operation::FindFreeBlockTexture;
-        size_t firstFreeBlock = SIZE_MAX;
-    };
-    struct MoveAllocationData
-    {
-        VkDeviceSize size;
-        VkDeviceSize alignment;
-        VmaSuballocationType type;
-        VmaAllocationCreateFlags flags;
-        VmaDefragmentationMove move = {};
-    };
-
-    const VkDeviceSize m_MaxPassBytes;
-    const uint32_t m_MaxPassAllocations;
-
-    VmaStlAllocator<VmaDefragmentationMove> m_MoveAllocator;
-    VmaVector<VmaDefragmentationMove, VmaStlAllocator<VmaDefragmentationMove>> m_Moves;
-
-    uint8_t m_IgnoredAllocs = 0;
-    uint32_t m_Algorithm;
-    uint32_t m_BlockVectorCount;
-    VmaBlockVector* m_PoolBlockVector;
-    VmaBlockVector** m_pBlockVectors;
-    size_t m_ImmovableBlockCount = 0;
-    VmaDefragmentationStats m_GlobalStats = { 0 };
-    VmaDefragmentationStats m_PassStats = { 0 };
-    void* m_AlgorithmState = VMA_NULL;
-
-    static MoveAllocationData GetMoveData(VmaAllocHandle handle, VmaBlockMetadata* metadata);
-    CounterStatus CheckCounters(VkDeviceSize bytes);
-    bool IncrementCounters(VkDeviceSize bytes);
-    bool ReallocWithinBlock(VmaBlockVector& vector, VmaDeviceMemoryBlock* block);
-    bool AllocInOtherBlock(size_t start, size_t end, MoveAllocationData& data, VmaBlockVector& vector);
-
-    bool ComputeDefragmentation(VmaBlockVector& vector, size_t index);
-    bool ComputeDefragmentation_Fast(VmaBlockVector& vector);
-    bool ComputeDefragmentation_Balanced(VmaBlockVector& vector, size_t index, bool update);
-    bool ComputeDefragmentation_Full(VmaBlockVector& vector);
-    bool ComputeDefragmentation_Extensive(VmaBlockVector& vector, size_t index);
-
-    void UpdateVectorStatistics(VmaBlockVector& vector, StateBalanced& state);
-    bool MoveDataToFreeBlocks(VmaSuballocationType currentType,
-        VmaBlockVector& vector, size_t firstFreeBlock,
-        bool& texturePresent, bool& bufferPresent, bool& otherPresent);
-};
-#endif // _VMA_DEFRAGMENTATION_CONTEXT
-
-#ifndef _VMA_POOL_T
-struct VmaPool_T
-{
-    friend struct VmaPoolListItemTraits;
-    VMA_CLASS_NO_COPY(VmaPool_T)
-public:
-    VmaBlockVector m_BlockVector;
-    VmaDedicatedAllocationList m_DedicatedAllocations;
-
-    VmaPool_T(
-        VmaAllocator hAllocator,
-        const VmaPoolCreateInfo& createInfo,
-        VkDeviceSize preferredBlockSize);
-    ~VmaPool_T();
-
-    uint32_t GetId() const { return m_Id; }
-    void SetId(uint32_t id) { VMA_ASSERT(m_Id == 0); m_Id = id; }
-
-    const char* GetName() const { return m_Name; }
-    void SetName(const char* pName);
-
-#if VMA_STATS_STRING_ENABLED
-    //void PrintDetailedMap(class VmaStringBuilder& sb);
-#endif
-
-private:
-    uint32_t m_Id;
-    char* m_Name;
-    VmaPool_T* m_PrevPool = VMA_NULL;
-    VmaPool_T* m_NextPool = VMA_NULL;
-};
-
-struct VmaPoolListItemTraits
-{
-    typedef VmaPool_T ItemType;
-
-    static ItemType* GetPrev(const ItemType* item) { return item->m_PrevPool; }
-    static ItemType* GetNext(const ItemType* item) { return item->m_NextPool; }
-    static ItemType*& AccessPrev(ItemType* item) { return item->m_PrevPool; }
-    static ItemType*& AccessNext(ItemType* item) { return item->m_NextPool; }
-};
-#endif // _VMA_POOL_T
-
-#ifndef _VMA_CURRENT_BUDGET_DATA
-struct VmaCurrentBudgetData
-{
-    VMA_ATOMIC_UINT32 m_BlockCount[VK_MAX_MEMORY_HEAPS];
-    VMA_ATOMIC_UINT32 m_AllocationCount[VK_MAX_MEMORY_HEAPS];
-    VMA_ATOMIC_UINT64 m_BlockBytes[VK_MAX_MEMORY_HEAPS];
-    VMA_ATOMIC_UINT64 m_AllocationBytes[VK_MAX_MEMORY_HEAPS];
-
-#if VMA_MEMORY_BUDGET
-    VMA_ATOMIC_UINT32 m_OperationsSinceBudgetFetch;
-    VMA_RW_MUTEX m_BudgetMutex;
-    uint64_t m_VulkanUsage[VK_MAX_MEMORY_HEAPS];
-    uint64_t m_VulkanBudget[VK_MAX_MEMORY_HEAPS];
-    uint64_t m_BlockBytesAtBudgetFetch[VK_MAX_MEMORY_HEAPS];
-#endif // VMA_MEMORY_BUDGET
-
-    VmaCurrentBudgetData();
-
-    void AddAllocation(uint32_t heapIndex, VkDeviceSize allocationSize);
-    void RemoveAllocation(uint32_t heapIndex, VkDeviceSize allocationSize);
-};
-
-#ifndef _VMA_CURRENT_BUDGET_DATA_FUNCTIONS
-VmaCurrentBudgetData::VmaCurrentBudgetData()
-{
-    for (uint32_t heapIndex = 0; heapIndex < VK_MAX_MEMORY_HEAPS; ++heapIndex)
-    {
-        m_BlockCount[heapIndex] = 0;
-        m_AllocationCount[heapIndex] = 0;
-        m_BlockBytes[heapIndex] = 0;
-        m_AllocationBytes[heapIndex] = 0;
-#if VMA_MEMORY_BUDGET
-        m_VulkanUsage[heapIndex] = 0;
-        m_VulkanBudget[heapIndex] = 0;
-        m_BlockBytesAtBudgetFetch[heapIndex] = 0;
-#endif
-    }
-
-#if VMA_MEMORY_BUDGET
-    m_OperationsSinceBudgetFetch = 0;
-#endif
-}
-
-void VmaCurrentBudgetData::AddAllocation(uint32_t heapIndex, VkDeviceSize allocationSize)
-{
-    m_AllocationBytes[heapIndex] += allocationSize;
-    ++m_AllocationCount[heapIndex];
-#if VMA_MEMORY_BUDGET
-    ++m_OperationsSinceBudgetFetch;
-#endif
-}
-
-void VmaCurrentBudgetData::RemoveAllocation(uint32_t heapIndex, VkDeviceSize allocationSize)
-{
-    VMA_ASSERT(m_AllocationBytes[heapIndex] >= allocationSize);
-    m_AllocationBytes[heapIndex] -= allocationSize;
-    VMA_ASSERT(m_AllocationCount[heapIndex] > 0);
-    --m_AllocationCount[heapIndex];
-#if VMA_MEMORY_BUDGET
-    ++m_OperationsSinceBudgetFetch;
-#endif
-}
-#endif // _VMA_CURRENT_BUDGET_DATA_FUNCTIONS
-#endif // _VMA_CURRENT_BUDGET_DATA
-
-#ifndef _VMA_ALLOCATION_OBJECT_ALLOCATOR
-/*
-Thread-safe wrapper over VmaPoolAllocator free list, for allocation of VmaAllocation_T objects.
-*/
-class VmaAllocationObjectAllocator
-{
-    VMA_CLASS_NO_COPY(VmaAllocationObjectAllocator)
-public:
-    VmaAllocationObjectAllocator(const VkAllocationCallbacks* pAllocationCallbacks)
-        : m_Allocator(pAllocationCallbacks, 1024) {}
-
-    template<typename... Types> VmaAllocation Allocate(Types&&... args);
-    void Free(VmaAllocation hAlloc);
-
-private:
-    VMA_MUTEX m_Mutex;
-    VmaPoolAllocator<VmaAllocation_T> m_Allocator;
-};
-
-template<typename... Types>
-VmaAllocation VmaAllocationObjectAllocator::Allocate(Types&&... args)
-{
-    VmaMutexLock mutexLock(m_Mutex);
-    return m_Allocator.Alloc<Types...>(std::forward<Types>(args)...);
-}
-
-void VmaAllocationObjectAllocator::Free(VmaAllocation hAlloc)
-{
-    VmaMutexLock mutexLock(m_Mutex);
-    m_Allocator.Free(hAlloc);
-}
-#endif // _VMA_ALLOCATION_OBJECT_ALLOCATOR
-
-#ifndef _VMA_VIRTUAL_BLOCK_T
-struct VmaVirtualBlock_T
-{
-    VMA_CLASS_NO_COPY(VmaVirtualBlock_T)
-public:
-    const bool m_AllocationCallbacksSpecified;
-    const VkAllocationCallbacks m_AllocationCallbacks;
-
-    VmaVirtualBlock_T(const VmaVirtualBlockCreateInfo& createInfo);
-    ~VmaVirtualBlock_T();
-
-    VkResult Init() { return VK_SUCCESS; }
-    bool IsEmpty() const { return m_Metadata->IsEmpty(); }
-    void Free(VmaVirtualAllocation allocation) { m_Metadata->Free((VmaAllocHandle)allocation); }
-    void SetAllocationUserData(VmaVirtualAllocation allocation, void* userData) { m_Metadata->SetAllocationUserData((VmaAllocHandle)allocation, userData); }
-    void Clear() { m_Metadata->Clear(); }
-
-    const VkAllocationCallbacks* GetAllocationCallbacks() const;
-    void GetAllocationInfo(VmaVirtualAllocation allocation, VmaVirtualAllocationInfo& outInfo);
-    VkResult Allocate(const VmaVirtualAllocationCreateInfo& createInfo, VmaVirtualAllocation& outAllocation,
-        VkDeviceSize* outOffset);
-    void GetStatistics(VmaStatistics& outStats) const;
-    void CalculateDetailedStatistics(VmaDetailedStatistics& outStats) const;
-#if VMA_STATS_STRING_ENABLED
-    void BuildStatsString(bool detailedMap, VmaStringBuilder& sb) const;
-#endif
-
-private:
-    VmaBlockMetadata* m_Metadata;
-};
-
-#ifndef _VMA_VIRTUAL_BLOCK_T_FUNCTIONS
-VmaVirtualBlock_T::VmaVirtualBlock_T(const VmaVirtualBlockCreateInfo& createInfo)
-    : m_AllocationCallbacksSpecified(createInfo.pAllocationCallbacks != VMA_NULL),
-    m_AllocationCallbacks(createInfo.pAllocationCallbacks != VMA_NULL ? *createInfo.pAllocationCallbacks : VmaEmptyAllocationCallbacks)
-{
-    const uint32_t algorithm = createInfo.flags & VMA_VIRTUAL_BLOCK_CREATE_ALGORITHM_MASK;
-    switch (algorithm)
-    {
-    default:
-        VMA_ASSERT(0);
-    case 0:
-        m_Metadata = vma_new(GetAllocationCallbacks(), VmaBlockMetadata_TLSF)(VK_NULL_HANDLE, 1, true);
-        break;
-    case VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT:
-        m_Metadata = vma_new(GetAllocationCallbacks(), VmaBlockMetadata_Linear)(VK_NULL_HANDLE, 1, true);
-        break;
-    }
-
-    m_Metadata->Init(createInfo.size);
-}
-
-VmaVirtualBlock_T::~VmaVirtualBlock_T()
-{
-    // Define macro VMA_DEBUG_LOG to receive the list of the unfreed allocations
-    if (!m_Metadata->IsEmpty())
-        m_Metadata->DebugLogAllAllocations();
-    // This is the most important assert in the entire library.
-    // Hitting it means you have some memory leak - unreleased virtual allocations.
-    VMA_ASSERT(m_Metadata->IsEmpty() && "Some virtual allocations were not freed before destruction of this virtual block!");
-
-    vma_delete(GetAllocationCallbacks(), m_Metadata);
-}
-
-const VkAllocationCallbacks* VmaVirtualBlock_T::GetAllocationCallbacks() const
-{
-    return m_AllocationCallbacksSpecified ? &m_AllocationCallbacks : VMA_NULL;
-}
-
-void VmaVirtualBlock_T::GetAllocationInfo(VmaVirtualAllocation allocation, VmaVirtualAllocationInfo& outInfo)
-{
-    m_Metadata->GetAllocationInfo((VmaAllocHandle)allocation, outInfo);
-}
-
-VkResult VmaVirtualBlock_T::Allocate(const VmaVirtualAllocationCreateInfo& createInfo, VmaVirtualAllocation& outAllocation,
-    VkDeviceSize* outOffset)
-{
-    VmaAllocationRequest request = {};
-    if (m_Metadata->CreateAllocationRequest(
-        createInfo.size, // allocSize
-        VMA_MAX(createInfo.alignment, (VkDeviceSize)1), // allocAlignment
-        (createInfo.flags & VMA_VIRTUAL_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0, // upperAddress
-        VMA_SUBALLOCATION_TYPE_UNKNOWN, // allocType - unimportant
-        createInfo.flags & VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MASK, // strategy
-        &request))
-    {
-        m_Metadata->Alloc(request,
-            VMA_SUBALLOCATION_TYPE_UNKNOWN, // type - unimportant
-            createInfo.pUserData);
-        outAllocation = (VmaVirtualAllocation)request.allocHandle;
-        if(outOffset)
-            *outOffset = m_Metadata->GetAllocationOffset(request.allocHandle);
-        return VK_SUCCESS;
-    }
-    outAllocation = (VmaVirtualAllocation)VK_NULL_HANDLE;
-    if (outOffset)
-        *outOffset = UINT64_MAX;
-    return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-}
-
-void VmaVirtualBlock_T::GetStatistics(VmaStatistics& outStats) const
-{
-    VmaClearStatistics(outStats);
-    m_Metadata->AddStatistics(outStats);
-}
-
-void VmaVirtualBlock_T::CalculateDetailedStatistics(VmaDetailedStatistics& outStats) const
-{
-    VmaClearDetailedStatistics(outStats);
-    m_Metadata->AddDetailedStatistics(outStats);
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaVirtualBlock_T::BuildStatsString(bool detailedMap, VmaStringBuilder& sb) const
-{
-    VmaJsonWriter json(GetAllocationCallbacks(), sb);
-    json.BeginObject();
-
-    VmaDetailedStatistics stats;
-    CalculateDetailedStatistics(stats);
-
-    json.WriteString("Stats");
-    VmaPrintDetailedStatistics(json, stats);
-
-    if (detailedMap)
-    {
-        json.WriteString("Details");
-        json.BeginObject();
-        m_Metadata->PrintDetailedMap(json);
-        json.EndObject();
-    }
-
-    json.EndObject();
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_VIRTUAL_BLOCK_T_FUNCTIONS
-#endif // _VMA_VIRTUAL_BLOCK_T
-
-
-// Main allocator object.
-struct VmaAllocator_T
-{
-    VMA_CLASS_NO_COPY(VmaAllocator_T)
-public:
-    bool m_UseMutex;
-    uint32_t m_VulkanApiVersion;
-    bool m_UseKhrDedicatedAllocation; // Can be set only if m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0).
-    bool m_UseKhrBindMemory2; // Can be set only if m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0).
-    bool m_UseExtMemoryBudget;
-    bool m_UseAmdDeviceCoherentMemory;
-    bool m_UseKhrBufferDeviceAddress;
-    bool m_UseExtMemoryPriority;
-    VkDevice m_hDevice;
-    VkInstance m_hInstance;
-    bool m_AllocationCallbacksSpecified;
-    VkAllocationCallbacks m_AllocationCallbacks;
-    VmaDeviceMemoryCallbacks m_DeviceMemoryCallbacks;
-    VmaAllocationObjectAllocator m_AllocationObjectAllocator;
-
-    // Each bit (1 << i) is set if HeapSizeLimit is enabled for that heap, so cannot allocate more than the heap size.
-    uint32_t m_HeapSizeLimitMask;
-
-    VkPhysicalDeviceProperties m_PhysicalDeviceProperties;
-    VkPhysicalDeviceMemoryProperties m_MemProps;
-
-    // Default pools.
-    VmaBlockVector* m_pBlockVectors[VK_MAX_MEMORY_TYPES];
-    VmaDedicatedAllocationList m_DedicatedAllocations[VK_MAX_MEMORY_TYPES];
-
-    VmaCurrentBudgetData m_Budget;
-    VMA_ATOMIC_UINT32 m_DeviceMemoryCount; // Total number of VkDeviceMemory objects.
-
-    VmaAllocator_T(const VmaAllocatorCreateInfo* pCreateInfo);
-    VkResult Init(const VmaAllocatorCreateInfo* pCreateInfo);
-    ~VmaAllocator_T();
-
-    const VkAllocationCallbacks* GetAllocationCallbacks() const
-    {
-        return m_AllocationCallbacksSpecified ? &m_AllocationCallbacks : VMA_NULL;
-    }
-    const VmaVulkanFunctions& GetVulkanFunctions() const
-    {
-        return m_VulkanFunctions;
-    }
-
-    VkPhysicalDevice GetPhysicalDevice() const { return m_PhysicalDevice; }
-
-    VkDeviceSize GetBufferImageGranularity() const
-    {
-        return VMA_MAX(
-            static_cast<VkDeviceSize>(VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY),
-            m_PhysicalDeviceProperties.limits.bufferImageGranularity);
-    }
-
-    uint32_t GetMemoryHeapCount() const { return m_MemProps.memoryHeapCount; }
-    uint32_t GetMemoryTypeCount() const { return m_MemProps.memoryTypeCount; }
-
-    uint32_t MemoryTypeIndexToHeapIndex(uint32_t memTypeIndex) const
-    {
-        VMA_ASSERT(memTypeIndex < m_MemProps.memoryTypeCount);
-        return m_MemProps.memoryTypes[memTypeIndex].heapIndex;
-    }
-    // True when specific memory type is HOST_VISIBLE but not HOST_COHERENT.
-    bool IsMemoryTypeNonCoherent(uint32_t memTypeIndex) const
-    {
-        return (m_MemProps.memoryTypes[memTypeIndex].propertyFlags & (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT)) ==
-            VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-    }
-    // Minimum alignment for all allocations in specific memory type.
-    VkDeviceSize GetMemoryTypeMinAlignment(uint32_t memTypeIndex) const
-    {
-        return IsMemoryTypeNonCoherent(memTypeIndex) ?
-            VMA_MAX((VkDeviceSize)VMA_MIN_ALIGNMENT, m_PhysicalDeviceProperties.limits.nonCoherentAtomSize) :
-            (VkDeviceSize)VMA_MIN_ALIGNMENT;
-    }
-
-    bool IsIntegratedGpu() const
-    {
-        return m_PhysicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU;
-    }
-
-    uint32_t GetGlobalMemoryTypeBits() const { return m_GlobalMemoryTypeBits; }
-
-    void GetBufferMemoryRequirements(
-        VkBuffer hBuffer,
-        VkMemoryRequirements& memReq,
-        bool& requiresDedicatedAllocation,
-        bool& prefersDedicatedAllocation) const;
-    void GetImageMemoryRequirements(
-        VkImage hImage,
-        VkMemoryRequirements& memReq,
-        bool& requiresDedicatedAllocation,
-        bool& prefersDedicatedAllocation) const;
-    VkResult FindMemoryTypeIndex(
-        uint32_t memoryTypeBits,
-        const VmaAllocationCreateInfo* pAllocationCreateInfo,
-        VkFlags bufImgUsage, // VkBufferCreateInfo::usage or VkImageCreateInfo::usage. UINT32_MAX if unknown.
-        uint32_t* pMemoryTypeIndex) const;
-
-    // Main allocation function.
-    VkResult AllocateMemory(
-        const VkMemoryRequirements& vkMemReq,
-        bool requiresDedicatedAllocation,
-        bool prefersDedicatedAllocation,
-        VkBuffer dedicatedBuffer,
-        VkImage dedicatedImage,
-        VkFlags dedicatedBufferImageUsage, // UINT32_MAX if unknown.
-        const VmaAllocationCreateInfo& createInfo,
-        VmaSuballocationType suballocType,
-        size_t allocationCount,
-        VmaAllocation* pAllocations);
-
-    // Main deallocation function.
-    void FreeMemory(
-        size_t allocationCount,
-        const VmaAllocation* pAllocations);
-
-    void CalculateStatistics(VmaTotalStatistics* pStats);
-
-    void GetHeapBudgets(
-        VmaBudget* outBudgets, uint32_t firstHeap, uint32_t heapCount);
-
-#if VMA_STATS_STRING_ENABLED
-    void PrintDetailedMap(class VmaJsonWriter& json);
-#endif
-
-    void GetAllocationInfo(VmaAllocation hAllocation, VmaAllocationInfo* pAllocationInfo);
-
-    VkResult CreatePool(const VmaPoolCreateInfo* pCreateInfo, VmaPool* pPool);
-    void DestroyPool(VmaPool pool);
-    void GetPoolStatistics(VmaPool pool, VmaStatistics* pPoolStats);
-    void CalculatePoolStatistics(VmaPool pool, VmaDetailedStatistics* pPoolStats);
-
-    void SetCurrentFrameIndex(uint32_t frameIndex);
-    uint32_t GetCurrentFrameIndex() const { return m_CurrentFrameIndex.load(); }
-
-    VkResult CheckPoolCorruption(VmaPool hPool);
-    VkResult CheckCorruption(uint32_t memoryTypeBits);
-
-    // Call to Vulkan function vkAllocateMemory with accompanying bookkeeping.
-    VkResult AllocateVulkanMemory(const VkMemoryAllocateInfo* pAllocateInfo, VkDeviceMemory* pMemory);
-    // Call to Vulkan function vkFreeMemory with accompanying bookkeeping.
-    void FreeVulkanMemory(uint32_t memoryType, VkDeviceSize size, VkDeviceMemory hMemory);
-    // Call to Vulkan function vkBindBufferMemory or vkBindBufferMemory2KHR.
-    VkResult BindVulkanBuffer(
-        VkDeviceMemory memory,
-        VkDeviceSize memoryOffset,
-        VkBuffer buffer,
-        const void* pNext);
-    // Call to Vulkan function vkBindImageMemory or vkBindImageMemory2KHR.
-    VkResult BindVulkanImage(
-        VkDeviceMemory memory,
-        VkDeviceSize memoryOffset,
-        VkImage image,
-        const void* pNext);
-
-    VkResult Map(VmaAllocation hAllocation, void** ppData);
-    void Unmap(VmaAllocation hAllocation);
-
-    VkResult BindBufferMemory(
-        VmaAllocation hAllocation,
-        VkDeviceSize allocationLocalOffset,
-        VkBuffer hBuffer,
-        const void* pNext);
-    VkResult BindImageMemory(
-        VmaAllocation hAllocation,
-        VkDeviceSize allocationLocalOffset,
-        VkImage hImage,
-        const void* pNext);
-
-    VkResult FlushOrInvalidateAllocation(
-        VmaAllocation hAllocation,
-        VkDeviceSize offset, VkDeviceSize size,
-        VMA_CACHE_OPERATION op);
-    VkResult FlushOrInvalidateAllocations(
-        uint32_t allocationCount,
-        const VmaAllocation* allocations,
-        const VkDeviceSize* offsets, const VkDeviceSize* sizes,
-        VMA_CACHE_OPERATION op);
-
-    void FillAllocation(const VmaAllocation hAllocation, uint8_t pattern);
-
-    /*
-    Returns bit mask of memory types that can support defragmentation on GPU as
-    they support creation of required buffer for copy operations.
-    */
-    uint32_t GetGpuDefragmentationMemoryTypeBits();
-
-#if VMA_EXTERNAL_MEMORY
-    VkExternalMemoryHandleTypeFlagsKHR GetExternalMemoryHandleTypeFlags(uint32_t memTypeIndex) const
-    {
-        return m_TypeExternalMemoryHandleTypes[memTypeIndex];
-    }
-#endif // #if VMA_EXTERNAL_MEMORY
-
-private:
-    VkDeviceSize m_PreferredLargeHeapBlockSize;
-
-    VkPhysicalDevice m_PhysicalDevice;
-    VMA_ATOMIC_UINT32 m_CurrentFrameIndex;
-    VMA_ATOMIC_UINT32 m_GpuDefragmentationMemoryTypeBits; // UINT32_MAX means uninitialized.
-#if VMA_EXTERNAL_MEMORY
-    VkExternalMemoryHandleTypeFlagsKHR m_TypeExternalMemoryHandleTypes[VK_MAX_MEMORY_TYPES];
-#endif // #if VMA_EXTERNAL_MEMORY
-
-    VMA_RW_MUTEX m_PoolsMutex;
-    typedef VmaIntrusiveLinkedList<VmaPoolListItemTraits> PoolList;
-    // Protected by m_PoolsMutex.
-    PoolList m_Pools;
-    uint32_t m_NextPoolId;
-
-    VmaVulkanFunctions m_VulkanFunctions;
-
-    // Global bit mask AND-ed with any memoryTypeBits to disallow certain memory types.
-    uint32_t m_GlobalMemoryTypeBits;
-
-    void ImportVulkanFunctions(const VmaVulkanFunctions* pVulkanFunctions);
-
-#if VMA_STATIC_VULKAN_FUNCTIONS == 1
-    void ImportVulkanFunctions_Static();
-#endif
-
-    void ImportVulkanFunctions_Custom(const VmaVulkanFunctions* pVulkanFunctions);
-
-#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1
-    void ImportVulkanFunctions_Dynamic();
-#endif
-
-    void ValidateVulkanFunctions();
-
-    VkDeviceSize CalcPreferredBlockSize(uint32_t memTypeIndex);
-
-    VkResult AllocateMemoryOfType(
-        VmaPool pool,
-        VkDeviceSize size,
-        VkDeviceSize alignment,
-        bool dedicatedPreferred,
-        VkBuffer dedicatedBuffer,
-        VkImage dedicatedImage,
-        VkFlags dedicatedBufferImageUsage,
-        const VmaAllocationCreateInfo& createInfo,
-        uint32_t memTypeIndex,
-        VmaSuballocationType suballocType,
-        VmaDedicatedAllocationList& dedicatedAllocations,
-        VmaBlockVector& blockVector,
-        size_t allocationCount,
-        VmaAllocation* pAllocations);
-
-    // Helper function only to be used inside AllocateDedicatedMemory.
-    VkResult AllocateDedicatedMemoryPage(
-        VmaPool pool,
-        VkDeviceSize size,
-        VmaSuballocationType suballocType,
-        uint32_t memTypeIndex,
-        const VkMemoryAllocateInfo& allocInfo,
-        bool map,
-        bool isUserDataString,
-        bool isMappingAllowed,
-        void* pUserData,
-        VmaAllocation* pAllocation);
-
-    // Allocates and registers new VkDeviceMemory specifically for dedicated allocations.
-    VkResult AllocateDedicatedMemory(
-        VmaPool pool,
-        VkDeviceSize size,
-        VmaSuballocationType suballocType,
-        VmaDedicatedAllocationList& dedicatedAllocations,
-        uint32_t memTypeIndex,
-        bool map,
-        bool isUserDataString,
-        bool isMappingAllowed,
-        bool canAliasMemory,
-        void* pUserData,
-        float priority,
-        VkBuffer dedicatedBuffer,
-        VkImage dedicatedImage,
-        VkFlags dedicatedBufferImageUsage,
-        size_t allocationCount,
-        VmaAllocation* pAllocations,
-        const void* pNextChain = nullptr);
-
-    void FreeDedicatedMemory(const VmaAllocation allocation);
-
-    VkResult CalcMemTypeParams(
-        VmaAllocationCreateInfo& outCreateInfo,
-        uint32_t memTypeIndex,
-        VkDeviceSize size,
-        size_t allocationCount);
-    VkResult CalcAllocationParams(
-        VmaAllocationCreateInfo& outCreateInfo,
-        bool dedicatedRequired,
-        bool dedicatedPreferred);
-
-    /*
-    Calculates and returns bit mask of memory types that can support defragmentation
-    on GPU as they support creation of required buffer for copy operations.
-    */
-    uint32_t CalculateGpuDefragmentationMemoryTypeBits() const;
-    uint32_t CalculateGlobalMemoryTypeBits() const;
-
-    bool GetFlushOrInvalidateRange(
-        VmaAllocation allocation,
-        VkDeviceSize offset, VkDeviceSize size,
-        VkMappedMemoryRange& outRange) const;
-
-#if VMA_MEMORY_BUDGET
-    void UpdateVulkanBudget();
-#endif // #if VMA_MEMORY_BUDGET
-};
-
-
-#ifndef _VMA_MEMORY_FUNCTIONS
-static void* VmaMalloc(VmaAllocator hAllocator, size_t size, size_t alignment)
-{
-    return VmaMalloc(&hAllocator->m_AllocationCallbacks, size, alignment);
-}
-
-static void VmaFree(VmaAllocator hAllocator, void* ptr)
-{
-    VmaFree(&hAllocator->m_AllocationCallbacks, ptr);
-}
-
-template<typename T>
-static T* VmaAllocate(VmaAllocator hAllocator)
-{
-    return (T*)VmaMalloc(hAllocator, sizeof(T), VMA_ALIGN_OF(T));
-}
-
-template<typename T>
-static T* VmaAllocateArray(VmaAllocator hAllocator, size_t count)
-{
-    return (T*)VmaMalloc(hAllocator, sizeof(T) * count, VMA_ALIGN_OF(T));
-}
-
-template<typename T>
-static void vma_delete(VmaAllocator hAllocator, T* ptr)
-{
-    if(ptr != VMA_NULL)
-    {
-        ptr->~T();
-        VmaFree(hAllocator, ptr);
-    }
-}
-
-template<typename T>
-static void vma_delete_array(VmaAllocator hAllocator, T* ptr, size_t count)
-{
-    if(ptr != VMA_NULL)
-    {
-        for(size_t i = count; i--; )
-            ptr[i].~T();
-        VmaFree(hAllocator, ptr);
-    }
-}
-#endif // _VMA_MEMORY_FUNCTIONS
-
-#ifndef _VMA_DEVICE_MEMORY_BLOCK_FUNCTIONS
-VmaDeviceMemoryBlock::VmaDeviceMemoryBlock(VmaAllocator hAllocator)
-    : m_pMetadata(VMA_NULL),
-    m_MemoryTypeIndex(UINT32_MAX),
-    m_Id(0),
-    m_hMemory(VK_NULL_HANDLE),
-    m_MapCount(0),
-    m_pMappedData(VMA_NULL) {}
-
-VmaDeviceMemoryBlock::~VmaDeviceMemoryBlock()
-{
-    VMA_ASSERT(m_MapCount == 0 && "VkDeviceMemory block is being destroyed while it is still mapped.");
-    VMA_ASSERT(m_hMemory == VK_NULL_HANDLE);
-}
-
-void VmaDeviceMemoryBlock::Init(
-    VmaAllocator hAllocator,
-    VmaPool hParentPool,
-    uint32_t newMemoryTypeIndex,
-    VkDeviceMemory newMemory,
-    VkDeviceSize newSize,
-    uint32_t id,
-    uint32_t algorithm,
-    VkDeviceSize bufferImageGranularity)
-{
-    VMA_ASSERT(m_hMemory == VK_NULL_HANDLE);
-
-    m_hParentPool = hParentPool;
-    m_MemoryTypeIndex = newMemoryTypeIndex;
-    m_Id = id;
-    m_hMemory = newMemory;
-
-    switch (algorithm)
-    {
-    case VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT:
-        m_pMetadata = vma_new(hAllocator, VmaBlockMetadata_Linear)(hAllocator->GetAllocationCallbacks(),
-            bufferImageGranularity, false); // isVirtual
-        break;
-    default:
-        VMA_ASSERT(0);
-        // Fall-through.
-    case 0:
-        m_pMetadata = vma_new(hAllocator, VmaBlockMetadata_TLSF)(hAllocator->GetAllocationCallbacks(),
-            bufferImageGranularity, false); // isVirtual
-    }
-    m_pMetadata->Init(newSize);
-}
-
-void VmaDeviceMemoryBlock::Destroy(VmaAllocator allocator)
-{
-    // Define macro VMA_DEBUG_LOG to receive the list of the unfreed allocations
-    if (!m_pMetadata->IsEmpty())
-        m_pMetadata->DebugLogAllAllocations();
-    // This is the most important assert in the entire library.
-    // Hitting it means you have some memory leak - unreleased VmaAllocation objects.
-    VMA_ASSERT(m_pMetadata->IsEmpty() && "Some allocations were not freed before destruction of this memory block!");
-
-    VMA_ASSERT(m_hMemory != VK_NULL_HANDLE);
-    allocator->FreeVulkanMemory(m_MemoryTypeIndex, m_pMetadata->GetSize(), m_hMemory);
-    m_hMemory = VK_NULL_HANDLE;
-
-    vma_delete(allocator, m_pMetadata);
-    m_pMetadata = VMA_NULL;
-}
-
-void VmaDeviceMemoryBlock::PostFree(VmaAllocator hAllocator)
-{
-    if(m_MappingHysteresis.PostFree())
-    {
-        VMA_ASSERT(m_MappingHysteresis.GetExtraMapping() == 0);
-        if (m_MapCount == 0)
-        {
-            m_pMappedData = VMA_NULL;
-            (*hAllocator->GetVulkanFunctions().vkUnmapMemory)(hAllocator->m_hDevice, m_hMemory);
-        }
-    }
-}
-
-bool VmaDeviceMemoryBlock::Validate() const
-{
-    VMA_VALIDATE((m_hMemory != VK_NULL_HANDLE) &&
-        (m_pMetadata->GetSize() != 0));
-
-    return m_pMetadata->Validate();
-}
-
-VkResult VmaDeviceMemoryBlock::CheckCorruption(VmaAllocator hAllocator)
-{
-    void* pData = nullptr;
-    VkResult res = Map(hAllocator, 1, &pData);
-    if (res != VK_SUCCESS)
-    {
-        return res;
-    }
-
-    res = m_pMetadata->CheckCorruption(pData);
-
-    Unmap(hAllocator, 1);
-
-    return res;
-}
-
-VkResult VmaDeviceMemoryBlock::Map(VmaAllocator hAllocator, uint32_t count, void** ppData)
-{
-    if (count == 0)
-    {
-        return VK_SUCCESS;
-    }
-
-    VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex);
-    const uint32_t oldTotalMapCount = m_MapCount + m_MappingHysteresis.GetExtraMapping();
-    m_MappingHysteresis.PostMap();
-    if (oldTotalMapCount != 0)
-    {
-        m_MapCount += count;
-        VMA_ASSERT(m_pMappedData != VMA_NULL);
-        if (ppData != VMA_NULL)
-        {
-            *ppData = m_pMappedData;
-        }
-        return VK_SUCCESS;
-    }
-    else
-    {
-        VkResult result = (*hAllocator->GetVulkanFunctions().vkMapMemory)(
-            hAllocator->m_hDevice,
-            m_hMemory,
-            0, // offset
-            VK_WHOLE_SIZE,
-            0, // flags
-            &m_pMappedData);
-        if (result == VK_SUCCESS)
-        {
-            if (ppData != VMA_NULL)
-            {
-                *ppData = m_pMappedData;
-            }
-            m_MapCount = count;
-        }
-        return result;
-    }
-}
-
-void VmaDeviceMemoryBlock::Unmap(VmaAllocator hAllocator, uint32_t count)
-{
-    if (count == 0)
-    {
-        return;
-    }
-
-    VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex);
-    if (m_MapCount >= count)
-    {
-        m_MapCount -= count;
-        const uint32_t totalMapCount = m_MapCount + m_MappingHysteresis.GetExtraMapping();
-        if (totalMapCount == 0)
-        {
-            m_pMappedData = VMA_NULL;
-            (*hAllocator->GetVulkanFunctions().vkUnmapMemory)(hAllocator->m_hDevice, m_hMemory);
-        }
-        m_MappingHysteresis.PostUnmap();
-    }
-    else
-    {
-        VMA_ASSERT(0 && "VkDeviceMemory block is being unmapped while it was not previously mapped.");
-    }
-}
-
-VkResult VmaDeviceMemoryBlock::WriteMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize)
-{
-    VMA_ASSERT(VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_MARGIN % 4 == 0 && VMA_DEBUG_DETECT_CORRUPTION);
-
-    void* pData;
-    VkResult res = Map(hAllocator, 1, &pData);
-    if (res != VK_SUCCESS)
-    {
-        return res;
-    }
-
-    VmaWriteMagicValue(pData, allocOffset + allocSize);
-
-    Unmap(hAllocator, 1);
-    return VK_SUCCESS;
-}
-
-VkResult VmaDeviceMemoryBlock::ValidateMagicValueAfterAllocation(VmaAllocator hAllocator, VkDeviceSize allocOffset, VkDeviceSize allocSize)
-{
-    VMA_ASSERT(VMA_DEBUG_MARGIN > 0 && VMA_DEBUG_MARGIN % 4 == 0 && VMA_DEBUG_DETECT_CORRUPTION);
-
-    void* pData;
-    VkResult res = Map(hAllocator, 1, &pData);
-    if (res != VK_SUCCESS)
-    {
-        return res;
-    }
-
-    if (!VmaValidateMagicValue(pData, allocOffset + allocSize))
-    {
-        VMA_ASSERT(0 && "MEMORY CORRUPTION DETECTED AFTER FREED ALLOCATION!");
-    }
-
-    Unmap(hAllocator, 1);
-    return VK_SUCCESS;
-}
-
-VkResult VmaDeviceMemoryBlock::BindBufferMemory(
-    const VmaAllocator hAllocator,
-    const VmaAllocation hAllocation,
-    VkDeviceSize allocationLocalOffset,
-    VkBuffer hBuffer,
-    const void* pNext)
-{
-    VMA_ASSERT(hAllocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_BLOCK &&
-        hAllocation->GetBlock() == this);
-    VMA_ASSERT(allocationLocalOffset < hAllocation->GetSize() &&
-        "Invalid allocationLocalOffset. Did you forget that this offset is relative to the beginning of the allocation, not the whole memory block?");
-    const VkDeviceSize memoryOffset = hAllocation->GetOffset() + allocationLocalOffset;
-    // This lock is important so that we don't call vkBind... and/or vkMap... simultaneously on the same VkDeviceMemory from multiple threads.
-    VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex);
-    return hAllocator->BindVulkanBuffer(m_hMemory, memoryOffset, hBuffer, pNext);
-}
-
-VkResult VmaDeviceMemoryBlock::BindImageMemory(
-    const VmaAllocator hAllocator,
-    const VmaAllocation hAllocation,
-    VkDeviceSize allocationLocalOffset,
-    VkImage hImage,
-    const void* pNext)
-{
-    VMA_ASSERT(hAllocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_BLOCK &&
-        hAllocation->GetBlock() == this);
-    VMA_ASSERT(allocationLocalOffset < hAllocation->GetSize() &&
-        "Invalid allocationLocalOffset. Did you forget that this offset is relative to the beginning of the allocation, not the whole memory block?");
-    const VkDeviceSize memoryOffset = hAllocation->GetOffset() + allocationLocalOffset;
-    // This lock is important so that we don't call vkBind... and/or vkMap... simultaneously on the same VkDeviceMemory from multiple threads.
-    VmaMutexLock lock(m_MapAndBindMutex, hAllocator->m_UseMutex);
-    return hAllocator->BindVulkanImage(m_hMemory, memoryOffset, hImage, pNext);
-}
-#endif // _VMA_DEVICE_MEMORY_BLOCK_FUNCTIONS
-
-#ifndef _VMA_ALLOCATION_T_FUNCTIONS
-VmaAllocation_T::VmaAllocation_T(bool mappingAllowed)
-    : m_Alignment{ 1 },
-    m_Size{ 0 },
-    m_pUserData{ VMA_NULL },
-    m_pName{ VMA_NULL },
-    m_MemoryTypeIndex{ 0 },
-    m_Type{ (uint8_t)ALLOCATION_TYPE_NONE },
-    m_SuballocationType{ (uint8_t)VMA_SUBALLOCATION_TYPE_UNKNOWN },
-    m_MapCount{ 0 },
-    m_Flags{ 0 }
-{
-    if(mappingAllowed)
-        m_Flags |= (uint8_t)FLAG_MAPPING_ALLOWED;
-
-#if VMA_STATS_STRING_ENABLED
-    m_BufferImageUsage = 0;
-#endif
-}
-
-VmaAllocation_T::~VmaAllocation_T()
-{
-    VMA_ASSERT(m_MapCount == 0 && "Allocation was not unmapped before destruction.");
-
-    // Check if owned string was freed.
-    VMA_ASSERT(m_pName == VMA_NULL);
-}
-
-void VmaAllocation_T::InitBlockAllocation(
-    VmaDeviceMemoryBlock* block,
-    VmaAllocHandle allocHandle,
-    VkDeviceSize alignment,
-    VkDeviceSize size,
-    uint32_t memoryTypeIndex,
-    VmaSuballocationType suballocationType,
-    bool mapped)
-{
-    VMA_ASSERT(m_Type == ALLOCATION_TYPE_NONE);
-    VMA_ASSERT(block != VMA_NULL);
-    m_Type = (uint8_t)ALLOCATION_TYPE_BLOCK;
-    m_Alignment = alignment;
-    m_Size = size;
-    m_MemoryTypeIndex = memoryTypeIndex;
-    if(mapped)
-    {
-        VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it.");
-        m_Flags |= (uint8_t)FLAG_PERSISTENT_MAP;
-    }
-    m_SuballocationType = (uint8_t)suballocationType;
-    m_BlockAllocation.m_Block = block;
-    m_BlockAllocation.m_AllocHandle = allocHandle;
-}
-
-void VmaAllocation_T::InitDedicatedAllocation(
-    VmaPool hParentPool,
-    uint32_t memoryTypeIndex,
-    VkDeviceMemory hMemory,
-    VmaSuballocationType suballocationType,
-    void* pMappedData,
-    VkDeviceSize size)
-{
-    VMA_ASSERT(m_Type == ALLOCATION_TYPE_NONE);
-    VMA_ASSERT(hMemory != VK_NULL_HANDLE);
-    m_Type = (uint8_t)ALLOCATION_TYPE_DEDICATED;
-    m_Alignment = 0;
-    m_Size = size;
-    m_MemoryTypeIndex = memoryTypeIndex;
-    m_SuballocationType = (uint8_t)suballocationType;
-    if(pMappedData != VMA_NULL)
-    {
-        VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it.");
-        m_Flags |= (uint8_t)FLAG_PERSISTENT_MAP;
-    }
-    m_DedicatedAllocation.m_hParentPool = hParentPool;
-    m_DedicatedAllocation.m_hMemory = hMemory;
-    m_DedicatedAllocation.m_pMappedData = pMappedData;
-    m_DedicatedAllocation.m_Prev = VMA_NULL;
-    m_DedicatedAllocation.m_Next = VMA_NULL;
-}
-
-void VmaAllocation_T::SetName(VmaAllocator hAllocator, const char* pName)
-{
-    VMA_ASSERT(pName == VMA_NULL || pName != m_pName);
-
-    FreeName(hAllocator);
-
-    if (pName != VMA_NULL)
-        m_pName = VmaCreateStringCopy(hAllocator->GetAllocationCallbacks(), pName);
-}
-
-uint8_t VmaAllocation_T::SwapBlockAllocation(VmaAllocator hAllocator, VmaAllocation allocation)
-{
-    VMA_ASSERT(allocation != VMA_NULL);
-    VMA_ASSERT(m_Type == ALLOCATION_TYPE_BLOCK);
-    VMA_ASSERT(allocation->m_Type == ALLOCATION_TYPE_BLOCK);
-
-    if (m_MapCount != 0)
-        m_BlockAllocation.m_Block->Unmap(hAllocator, m_MapCount);
-
-    m_BlockAllocation.m_Block->m_pMetadata->SetAllocationUserData(m_BlockAllocation.m_AllocHandle, allocation);
-    VMA_SWAP(m_BlockAllocation, allocation->m_BlockAllocation);
-    m_BlockAllocation.m_Block->m_pMetadata->SetAllocationUserData(m_BlockAllocation.m_AllocHandle, this);
-
-#if VMA_STATS_STRING_ENABLED
-    VMA_SWAP(m_BufferImageUsage, allocation->m_BufferImageUsage);
-#endif
-    return m_MapCount;
-}
-
-VmaAllocHandle VmaAllocation_T::GetAllocHandle() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        return m_BlockAllocation.m_AllocHandle;
-    case ALLOCATION_TYPE_DEDICATED:
-        return VK_NULL_HANDLE;
-    default:
-        VMA_ASSERT(0);
-        return VK_NULL_HANDLE;
-    }
-}
-
-VkDeviceSize VmaAllocation_T::GetOffset() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        return m_BlockAllocation.m_Block->m_pMetadata->GetAllocationOffset(m_BlockAllocation.m_AllocHandle);
-    case ALLOCATION_TYPE_DEDICATED:
-        return 0;
-    default:
-        VMA_ASSERT(0);
-        return 0;
-    }
-}
-
-VmaPool VmaAllocation_T::GetParentPool() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        return m_BlockAllocation.m_Block->GetParentPool();
-    case ALLOCATION_TYPE_DEDICATED:
-        return m_DedicatedAllocation.m_hParentPool;
-    default:
-        VMA_ASSERT(0);
-        return VK_NULL_HANDLE;
-    }
-}
-
-VkDeviceMemory VmaAllocation_T::GetMemory() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        return m_BlockAllocation.m_Block->GetDeviceMemory();
-    case ALLOCATION_TYPE_DEDICATED:
-        return m_DedicatedAllocation.m_hMemory;
-    default:
-        VMA_ASSERT(0);
-        return VK_NULL_HANDLE;
-    }
-}
-
-void* VmaAllocation_T::GetMappedData() const
-{
-    switch (m_Type)
-    {
-    case ALLOCATION_TYPE_BLOCK:
-        if (m_MapCount != 0 || IsPersistentMap())
-        {
-            void* pBlockData = m_BlockAllocation.m_Block->GetMappedData();
-            VMA_ASSERT(pBlockData != VMA_NULL);
-            return (char*)pBlockData + GetOffset();
-        }
-        else
-        {
-            return VMA_NULL;
-        }
-        break;
-    case ALLOCATION_TYPE_DEDICATED:
-        VMA_ASSERT((m_DedicatedAllocation.m_pMappedData != VMA_NULL) == (m_MapCount != 0 || IsPersistentMap()));
-        return m_DedicatedAllocation.m_pMappedData;
-    default:
-        VMA_ASSERT(0);
-        return VMA_NULL;
-    }
-}
-
-void VmaAllocation_T::BlockAllocMap()
-{
-    VMA_ASSERT(GetType() == ALLOCATION_TYPE_BLOCK);
-    VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it.");
-
-    if (m_MapCount < 0xFF)
-    {
-        ++m_MapCount;
-    }
-    else
-    {
-        VMA_ASSERT(0 && "Allocation mapped too many times simultaneously.");
-    }
-}
-
-void VmaAllocation_T::BlockAllocUnmap()
-{
-    VMA_ASSERT(GetType() == ALLOCATION_TYPE_BLOCK);
-
-    if (m_MapCount > 0)
-    {
-        --m_MapCount;
-    }
-    else
-    {
-        VMA_ASSERT(0 && "Unmapping allocation not previously mapped.");
-    }
-}
-
-VkResult VmaAllocation_T::DedicatedAllocMap(VmaAllocator hAllocator, void** ppData)
-{
-    VMA_ASSERT(GetType() == ALLOCATION_TYPE_DEDICATED);
-    VMA_ASSERT(IsMappingAllowed() && "Mapping is not allowed on this allocation! Please use one of the new VMA_ALLOCATION_CREATE_HOST_ACCESS_* flags when creating it.");
-
-    if (m_MapCount != 0 || IsPersistentMap())
-    {
-        if (m_MapCount < 0xFF)
-        {
-            VMA_ASSERT(m_DedicatedAllocation.m_pMappedData != VMA_NULL);
-            *ppData = m_DedicatedAllocation.m_pMappedData;
-            ++m_MapCount;
-            return VK_SUCCESS;
-        }
-        else
-        {
-            VMA_ASSERT(0 && "Dedicated allocation mapped too many times simultaneously.");
-            return VK_ERROR_MEMORY_MAP_FAILED;
-        }
-    }
-    else
-    {
-        VkResult result = (*hAllocator->GetVulkanFunctions().vkMapMemory)(
-            hAllocator->m_hDevice,
-            m_DedicatedAllocation.m_hMemory,
-            0, // offset
-            VK_WHOLE_SIZE,
-            0, // flags
-            ppData);
-        if (result == VK_SUCCESS)
-        {
-            m_DedicatedAllocation.m_pMappedData = *ppData;
-            m_MapCount = 1;
-        }
-        return result;
-    }
-}
-
-void VmaAllocation_T::DedicatedAllocUnmap(VmaAllocator hAllocator)
-{
-    VMA_ASSERT(GetType() == ALLOCATION_TYPE_DEDICATED);
-
-    if (m_MapCount > 0)
-    {
-        --m_MapCount;
-        if (m_MapCount == 0 && !IsPersistentMap())
-        {
-            m_DedicatedAllocation.m_pMappedData = VMA_NULL;
-            (*hAllocator->GetVulkanFunctions().vkUnmapMemory)(
-                hAllocator->m_hDevice,
-                m_DedicatedAllocation.m_hMemory);
-        }
-    }
-    else
-    {
-        VMA_ASSERT(0 && "Unmapping dedicated allocation not previously mapped.");
-    }
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaAllocation_T::InitBufferImageUsage(uint32_t bufferImageUsage)
-{
-    VMA_ASSERT(m_BufferImageUsage == 0);
-    m_BufferImageUsage = bufferImageUsage;
-}
-
-void VmaAllocation_T::PrintParameters(class VmaJsonWriter& json) const
-{
-    json.WriteString("Type");
-    json.WriteString(VMA_SUBALLOCATION_TYPE_NAMES[m_SuballocationType]);
-
-    json.WriteString("Size");
-    json.WriteNumber(m_Size);
-    json.WriteString("Usage");
-    json.WriteNumber(m_BufferImageUsage);
-
-    if (m_pUserData != VMA_NULL)
-    {
-        json.WriteString("CustomData");
-        json.BeginString();
-        json.ContinueString_Pointer(m_pUserData);
-        json.EndString();
-    }
-    if (m_pName != VMA_NULL)
-    {
-        json.WriteString("Name");
-        json.WriteString(m_pName);
-    }
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-void VmaAllocation_T::FreeName(VmaAllocator hAllocator)
-{
-    if(m_pName)
-    {
-        VmaFreeString(hAllocator->GetAllocationCallbacks(), m_pName);
-        m_pName = VMA_NULL;
-    }
-}
-#endif // _VMA_ALLOCATION_T_FUNCTIONS
-
-#ifndef _VMA_BLOCK_VECTOR_FUNCTIONS
-VmaBlockVector::VmaBlockVector(
-    VmaAllocator hAllocator,
-    VmaPool hParentPool,
-    uint32_t memoryTypeIndex,
-    VkDeviceSize preferredBlockSize,
-    size_t minBlockCount,
-    size_t maxBlockCount,
-    VkDeviceSize bufferImageGranularity,
-    bool explicitBlockSize,
-    uint32_t algorithm,
-    float priority,
-    VkDeviceSize minAllocationAlignment,
-    void* pMemoryAllocateNext)
-    : m_hAllocator(hAllocator),
-    m_hParentPool(hParentPool),
-    m_MemoryTypeIndex(memoryTypeIndex),
-    m_PreferredBlockSize(preferredBlockSize),
-    m_MinBlockCount(minBlockCount),
-    m_MaxBlockCount(maxBlockCount),
-    m_BufferImageGranularity(bufferImageGranularity),
-    m_ExplicitBlockSize(explicitBlockSize),
-    m_Algorithm(algorithm),
-    m_Priority(priority),
-    m_MinAllocationAlignment(minAllocationAlignment),
-    m_pMemoryAllocateNext(pMemoryAllocateNext),
-    m_Blocks(VmaStlAllocator<VmaDeviceMemoryBlock*>(hAllocator->GetAllocationCallbacks())),
-    m_NextBlockId(0) {}
-
-VmaBlockVector::~VmaBlockVector()
-{
-    for (size_t i = m_Blocks.size(); i--; )
-    {
-        m_Blocks[i]->Destroy(m_hAllocator);
-        vma_delete(m_hAllocator, m_Blocks[i]);
-    }
-}
-
-VkResult VmaBlockVector::CreateMinBlocks()
-{
-    for (size_t i = 0; i < m_MinBlockCount; ++i)
-    {
-        VkResult res = CreateBlock(m_PreferredBlockSize, VMA_NULL);
-        if (res != VK_SUCCESS)
-        {
-            return res;
-        }
-    }
-    return VK_SUCCESS;
-}
-
-void VmaBlockVector::AddStatistics(VmaStatistics& inoutStats)
-{
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-
-    const size_t blockCount = m_Blocks.size();
-    for (uint32_t blockIndex = 0; blockIndex < blockCount; ++blockIndex)
-    {
-        const VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex];
-        VMA_ASSERT(pBlock);
-        VMA_HEAVY_ASSERT(pBlock->Validate());
-        pBlock->m_pMetadata->AddStatistics(inoutStats);
-    }
-}
-
-void VmaBlockVector::AddDetailedStatistics(VmaDetailedStatistics& inoutStats)
-{
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-
-    const size_t blockCount = m_Blocks.size();
-    for (uint32_t blockIndex = 0; blockIndex < blockCount; ++blockIndex)
-    {
-        const VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex];
-        VMA_ASSERT(pBlock);
-        VMA_HEAVY_ASSERT(pBlock->Validate());
-        pBlock->m_pMetadata->AddDetailedStatistics(inoutStats);
-    }
-}
-
-bool VmaBlockVector::IsEmpty()
-{
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-    return m_Blocks.empty();
-}
-
-bool VmaBlockVector::IsCorruptionDetectionEnabled() const
-{
-    const uint32_t requiredMemFlags = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
-    return (VMA_DEBUG_DETECT_CORRUPTION != 0) &&
-        (VMA_DEBUG_MARGIN > 0) &&
-        (m_Algorithm == 0 || m_Algorithm == VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT) &&
-        (m_hAllocator->m_MemProps.memoryTypes[m_MemoryTypeIndex].propertyFlags & requiredMemFlags) == requiredMemFlags;
-}
-
-VkResult VmaBlockVector::Allocate(
-    VkDeviceSize size,
-    VkDeviceSize alignment,
-    const VmaAllocationCreateInfo& createInfo,
-    VmaSuballocationType suballocType,
-    size_t allocationCount,
-    VmaAllocation* pAllocations)
-{
-    size_t allocIndex;
-    VkResult res = VK_SUCCESS;
-
-    alignment = VMA_MAX(alignment, m_MinAllocationAlignment);
-
-    if (IsCorruptionDetectionEnabled())
-    {
-        size = VmaAlignUp<VkDeviceSize>(size, sizeof(VMA_CORRUPTION_DETECTION_MAGIC_VALUE));
-        alignment = VmaAlignUp<VkDeviceSize>(alignment, sizeof(VMA_CORRUPTION_DETECTION_MAGIC_VALUE));
-    }
-
-    {
-        VmaMutexLockWrite lock(m_Mutex, m_hAllocator->m_UseMutex);
-        for (allocIndex = 0; allocIndex < allocationCount; ++allocIndex)
-        {
-            res = AllocatePage(
-                size,
-                alignment,
-                createInfo,
-                suballocType,
-                pAllocations + allocIndex);
-            if (res != VK_SUCCESS)
-            {
-                break;
-            }
-        }
-    }
-
-    if (res != VK_SUCCESS)
-    {
-        // Free all already created allocations.
-        while (allocIndex--)
-            Free(pAllocations[allocIndex]);
-        memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount);
-    }
-
-    return res;
-}
-
-VkResult VmaBlockVector::AllocatePage(
-    VkDeviceSize size,
-    VkDeviceSize alignment,
-    const VmaAllocationCreateInfo& createInfo,
-    VmaSuballocationType suballocType,
-    VmaAllocation* pAllocation)
-{
-    const bool isUpperAddress = (createInfo.flags & VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0;
-
-    VkDeviceSize freeMemory;
-    {
-        const uint32_t heapIndex = m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex);
-        VmaBudget heapBudget = {};
-        m_hAllocator->GetHeapBudgets(&heapBudget, heapIndex, 1);
-        freeMemory = (heapBudget.usage < heapBudget.budget) ? (heapBudget.budget - heapBudget.usage) : 0;
-    }
-
-    const bool canFallbackToDedicated = !HasExplicitBlockSize() &&
-        (createInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0;
-    const bool canCreateNewBlock =
-        ((createInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0) &&
-        (m_Blocks.size() < m_MaxBlockCount) &&
-        (freeMemory >= size || !canFallbackToDedicated);
-    uint32_t strategy = createInfo.flags & VMA_ALLOCATION_CREATE_STRATEGY_MASK;
-
-    // Upper address can only be used with linear allocator and within single memory block.
-    if (isUpperAddress &&
-        (m_Algorithm != VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT || m_MaxBlockCount > 1))
-    {
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    // Early reject: requested allocation size is larger that maximum block size for this block vector.
-    if (size + VMA_DEBUG_MARGIN > m_PreferredBlockSize)
-    {
-        return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-    }
-
-    // 1. Search existing allocations. Try to allocate.
-    if (m_Algorithm == VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT)
-    {
-        // Use only last block.
-        if (!m_Blocks.empty())
-        {
-            VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks.back();
-            VMA_ASSERT(pCurrBlock);
-            VkResult res = AllocateFromBlock(
-                pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-            if (res == VK_SUCCESS)
-            {
-                VMA_DEBUG_LOG("    Returned from last block #%u", pCurrBlock->GetId());
-                IncrementallySortBlocks();
-                return VK_SUCCESS;
-            }
-        }
-    }
-    else
-    {
-        if (strategy != VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT) // MIN_MEMORY or default
-        {
-            const bool isHostVisible =
-                (m_hAllocator->m_MemProps.memoryTypes[m_MemoryTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0;
-            if(isHostVisible)
-            {
-                const bool isMappingAllowed = (createInfo.flags &
-                    (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0;
-                /*
-                For non-mappable allocations, check blocks that are not mapped first.
-                For mappable allocations, check blocks that are already mapped first.
-                This way, having many blocks, we will separate mappable and non-mappable allocations,
-                hopefully limiting the number of blocks that are mapped, which will help tools like RenderDoc.
-                */
-                for(size_t mappingI = 0; mappingI < 2; ++mappingI)
-                {
-                    // Forward order in m_Blocks - prefer blocks with smallest amount of free space.
-                    for (size_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex)
-                    {
-                        VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex];
-                        VMA_ASSERT(pCurrBlock);
-                        const bool isBlockMapped = pCurrBlock->GetMappedData() != VMA_NULL;
-                        if((mappingI == 0) == (isMappingAllowed == isBlockMapped))
-                        {
-                            VkResult res = AllocateFromBlock(
-                                pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-                            if (res == VK_SUCCESS)
-                            {
-                                VMA_DEBUG_LOG("    Returned from existing block #%u", pCurrBlock->GetId());
-                                IncrementallySortBlocks();
-                                return VK_SUCCESS;
-                            }
-                        }
-                    }
-                }
-            }
-            else
-            {
-                // Forward order in m_Blocks - prefer blocks with smallest amount of free space.
-                for (size_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex)
-                {
-                    VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex];
-                    VMA_ASSERT(pCurrBlock);
-                    VkResult res = AllocateFromBlock(
-                        pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-                    if (res == VK_SUCCESS)
-                    {
-                        VMA_DEBUG_LOG("    Returned from existing block #%u", pCurrBlock->GetId());
-                        IncrementallySortBlocks();
-                        return VK_SUCCESS;
-                    }
-                }
-            }
-        }
-        else // VMA_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT
-        {
-            // Backward order in m_Blocks - prefer blocks with largest amount of free space.
-            for (size_t blockIndex = m_Blocks.size(); blockIndex--; )
-            {
-                VmaDeviceMemoryBlock* const pCurrBlock = m_Blocks[blockIndex];
-                VMA_ASSERT(pCurrBlock);
-                VkResult res = AllocateFromBlock(pCurrBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-                if (res == VK_SUCCESS)
-                {
-                    VMA_DEBUG_LOG("    Returned from existing block #%u", pCurrBlock->GetId());
-                    IncrementallySortBlocks();
-                    return VK_SUCCESS;
-                }
-            }
-        }
-    }
-
-    // 2. Try to create new block.
-    if (canCreateNewBlock)
-    {
-        // Calculate optimal size for new block.
-        VkDeviceSize newBlockSize = m_PreferredBlockSize;
-        uint32_t newBlockSizeShift = 0;
-        const uint32_t NEW_BLOCK_SIZE_SHIFT_MAX = 3;
-
-        if (!m_ExplicitBlockSize)
-        {
-            // Allocate 1/8, 1/4, 1/2 as first blocks.
-            const VkDeviceSize maxExistingBlockSize = CalcMaxBlockSize();
-            for (uint32_t i = 0; i < NEW_BLOCK_SIZE_SHIFT_MAX; ++i)
-            {
-                const VkDeviceSize smallerNewBlockSize = newBlockSize / 2;
-                if (smallerNewBlockSize > maxExistingBlockSize && smallerNewBlockSize >= size * 2)
-                {
-                    newBlockSize = smallerNewBlockSize;
-                    ++newBlockSizeShift;
-                }
-                else
-                {
-                    break;
-                }
-            }
-        }
-
-        size_t newBlockIndex = 0;
-        VkResult res = (newBlockSize <= freeMemory || !canFallbackToDedicated) ?
-            CreateBlock(newBlockSize, &newBlockIndex) : VK_ERROR_OUT_OF_DEVICE_MEMORY;
-        // Allocation of this size failed? Try 1/2, 1/4, 1/8 of m_PreferredBlockSize.
-        if (!m_ExplicitBlockSize)
-        {
-            while (res < 0 && newBlockSizeShift < NEW_BLOCK_SIZE_SHIFT_MAX)
-            {
-                const VkDeviceSize smallerNewBlockSize = newBlockSize / 2;
-                if (smallerNewBlockSize >= size)
-                {
-                    newBlockSize = smallerNewBlockSize;
-                    ++newBlockSizeShift;
-                    res = (newBlockSize <= freeMemory || !canFallbackToDedicated) ?
-                        CreateBlock(newBlockSize, &newBlockIndex) : VK_ERROR_OUT_OF_DEVICE_MEMORY;
-                }
-                else
-                {
-                    break;
-                }
-            }
-        }
-
-        if (res == VK_SUCCESS)
-        {
-            VmaDeviceMemoryBlock* const pBlock = m_Blocks[newBlockIndex];
-            VMA_ASSERT(pBlock->m_pMetadata->GetSize() >= size);
-
-            res = AllocateFromBlock(
-                pBlock, size, alignment, createInfo.flags, createInfo.pUserData, suballocType, strategy, pAllocation);
-            if (res == VK_SUCCESS)
-            {
-                VMA_DEBUG_LOG("    Created new block #%u Size=%llu", pBlock->GetId(), newBlockSize);
-                IncrementallySortBlocks();
-                return VK_SUCCESS;
-            }
-            else
-            {
-                // Allocation from new block failed, possibly due to VMA_DEBUG_MARGIN or alignment.
-                return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-            }
-        }
-    }
-
-    return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-}
-
-void VmaBlockVector::Free(const VmaAllocation hAllocation)
-{
-    VmaDeviceMemoryBlock* pBlockToDelete = VMA_NULL;
-
-    bool budgetExceeded = false;
-    {
-        const uint32_t heapIndex = m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex);
-        VmaBudget heapBudget = {};
-        m_hAllocator->GetHeapBudgets(&heapBudget, heapIndex, 1);
-        budgetExceeded = heapBudget.usage >= heapBudget.budget;
-    }
-
-    // Scope for lock.
-    {
-        VmaMutexLockWrite lock(m_Mutex, m_hAllocator->m_UseMutex);
-
-        VmaDeviceMemoryBlock* pBlock = hAllocation->GetBlock();
-
-        if (IsCorruptionDetectionEnabled())
-        {
-            VkResult res = pBlock->ValidateMagicValueAfterAllocation(m_hAllocator, hAllocation->GetOffset(), hAllocation->GetSize());
-            VMA_ASSERT(res == VK_SUCCESS && "Couldn't map block memory to validate magic value.");
-        }
-
-        if (hAllocation->IsPersistentMap())
-        {
-            pBlock->Unmap(m_hAllocator, 1);
-        }
-
-        const bool hadEmptyBlockBeforeFree = HasEmptyBlock();
-        pBlock->m_pMetadata->Free(hAllocation->GetAllocHandle());
-        pBlock->PostFree(m_hAllocator);
-        VMA_HEAVY_ASSERT(pBlock->Validate());
-
-        VMA_DEBUG_LOG("  Freed from MemoryTypeIndex=%u", m_MemoryTypeIndex);
-
-        const bool canDeleteBlock = m_Blocks.size() > m_MinBlockCount;
-        // pBlock became empty after this deallocation.
-        if (pBlock->m_pMetadata->IsEmpty())
-        {
-            // Already had empty block. We don't want to have two, so delete this one.
-            if ((hadEmptyBlockBeforeFree || budgetExceeded) && canDeleteBlock)
-            {
-                pBlockToDelete = pBlock;
-                Remove(pBlock);
-            }
-            // else: We now have one empty block - leave it. A hysteresis to avoid allocating whole block back and forth.
-        }
-        // pBlock didn't become empty, but we have another empty block - find and free that one.
-        // (This is optional, heuristics.)
-        else if (hadEmptyBlockBeforeFree && canDeleteBlock)
-        {
-            VmaDeviceMemoryBlock* pLastBlock = m_Blocks.back();
-            if (pLastBlock->m_pMetadata->IsEmpty())
-            {
-                pBlockToDelete = pLastBlock;
-                m_Blocks.pop_back();
-            }
-        }
-
-        IncrementallySortBlocks();
-    }
-
-    // Destruction of a free block. Deferred until this point, outside of mutex
-    // lock, for performance reason.
-    if (pBlockToDelete != VMA_NULL)
-    {
-        VMA_DEBUG_LOG("    Deleted empty block #%u", pBlockToDelete->GetId());
-        pBlockToDelete->Destroy(m_hAllocator);
-        vma_delete(m_hAllocator, pBlockToDelete);
-    }
-
-    m_hAllocator->m_Budget.RemoveAllocation(m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex), hAllocation->GetSize());
-    m_hAllocator->m_AllocationObjectAllocator.Free(hAllocation);
-}
-
-VkDeviceSize VmaBlockVector::CalcMaxBlockSize() const
-{
-    VkDeviceSize result = 0;
-    for (size_t i = m_Blocks.size(); i--; )
-    {
-        result = VMA_MAX(result, m_Blocks[i]->m_pMetadata->GetSize());
-        if (result >= m_PreferredBlockSize)
-        {
-            break;
-        }
-    }
-    return result;
-}
-
-void VmaBlockVector::Remove(VmaDeviceMemoryBlock* pBlock)
-{
-    for (uint32_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex)
-    {
-        if (m_Blocks[blockIndex] == pBlock)
-        {
-            VmaVectorRemove(m_Blocks, blockIndex);
-            return;
-        }
-    }
-    VMA_ASSERT(0);
-}
-
-void VmaBlockVector::IncrementallySortBlocks()
-{
-    if (!m_IncrementalSort)
-        return;
-    if (m_Algorithm != VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT)
-    {
-        // Bubble sort only until first swap.
-        for (size_t i = 1; i < m_Blocks.size(); ++i)
-        {
-            if (m_Blocks[i - 1]->m_pMetadata->GetSumFreeSize() > m_Blocks[i]->m_pMetadata->GetSumFreeSize())
-            {
-                VMA_SWAP(m_Blocks[i - 1], m_Blocks[i]);
-                return;
-            }
-        }
-    }
-}
-
-void VmaBlockVector::SortByFreeSize()
-{
-    VMA_SORT(m_Blocks.begin(), m_Blocks.end(),
-        [](VmaDeviceMemoryBlock* b1, VmaDeviceMemoryBlock* b2) -> bool
-        {
-            return b1->m_pMetadata->GetSumFreeSize() < b2->m_pMetadata->GetSumFreeSize();
-        });
-}
-
-VkResult VmaBlockVector::AllocateFromBlock(
-    VmaDeviceMemoryBlock* pBlock,
-    VkDeviceSize size,
-    VkDeviceSize alignment,
-    VmaAllocationCreateFlags allocFlags,
-    void* pUserData,
-    VmaSuballocationType suballocType,
-    uint32_t strategy,
-    VmaAllocation* pAllocation)
-{
-    const bool isUpperAddress = (allocFlags & VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT) != 0;
-
-    VmaAllocationRequest currRequest = {};
-    if (pBlock->m_pMetadata->CreateAllocationRequest(
-        size,
-        alignment,
-        isUpperAddress,
-        suballocType,
-        strategy,
-        &currRequest))
-    {
-        return CommitAllocationRequest(currRequest, pBlock, alignment, allocFlags, pUserData, suballocType, pAllocation);
-    }
-    return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-}
-
-VkResult VmaBlockVector::CommitAllocationRequest(
-    VmaAllocationRequest& allocRequest,
-    VmaDeviceMemoryBlock* pBlock,
-    VkDeviceSize alignment,
-    VmaAllocationCreateFlags allocFlags,
-    void* pUserData,
-    VmaSuballocationType suballocType,
-    VmaAllocation* pAllocation)
-{
-    const bool mapped = (allocFlags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0;
-    const bool isUserDataString = (allocFlags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0;
-    const bool isMappingAllowed = (allocFlags &
-        (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0;
-
-    pBlock->PostAlloc();
-    // Allocate from pCurrBlock.
-    if (mapped)
-    {
-        VkResult res = pBlock->Map(m_hAllocator, 1, VMA_NULL);
-        if (res != VK_SUCCESS)
-        {
-            return res;
-        }
-    }
-
-    *pAllocation = m_hAllocator->m_AllocationObjectAllocator.Allocate(isMappingAllowed);
-    pBlock->m_pMetadata->Alloc(allocRequest, suballocType, *pAllocation);
-    (*pAllocation)->InitBlockAllocation(
-        pBlock,
-        allocRequest.allocHandle,
-        alignment,
-        allocRequest.size, // Not size, as actual allocation size may be larger than requested!
-        m_MemoryTypeIndex,
-        suballocType,
-        mapped);
-    VMA_HEAVY_ASSERT(pBlock->Validate());
-    if (isUserDataString)
-        (*pAllocation)->SetName(m_hAllocator, (const char*)pUserData);
-    else
-        (*pAllocation)->SetUserData(m_hAllocator, pUserData);
-    m_hAllocator->m_Budget.AddAllocation(m_hAllocator->MemoryTypeIndexToHeapIndex(m_MemoryTypeIndex), allocRequest.size);
-    if (VMA_DEBUG_INITIALIZE_ALLOCATIONS)
-    {
-        m_hAllocator->FillAllocation(*pAllocation, VMA_ALLOCATION_FILL_PATTERN_CREATED);
-    }
-    if (IsCorruptionDetectionEnabled())
-    {
-        VkResult res = pBlock->WriteMagicValueAfterAllocation(m_hAllocator, (*pAllocation)->GetOffset(), allocRequest.size);
-        VMA_ASSERT(res == VK_SUCCESS && "Couldn't map block memory to write magic value.");
-    }
-    return VK_SUCCESS;
-}
-
-VkResult VmaBlockVector::CreateBlock(VkDeviceSize blockSize, size_t* pNewBlockIndex)
-{
-    VkMemoryAllocateInfo allocInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO };
-    allocInfo.pNext = m_pMemoryAllocateNext;
-    allocInfo.memoryTypeIndex = m_MemoryTypeIndex;
-    allocInfo.allocationSize = blockSize;
-
-#if VMA_BUFFER_DEVICE_ADDRESS
-    // Every standalone block can potentially contain a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT - always enable the feature.
-    VkMemoryAllocateFlagsInfoKHR allocFlagsInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR };
-    if (m_hAllocator->m_UseKhrBufferDeviceAddress)
-    {
-        allocFlagsInfo.flags = VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT_KHR;
-        VmaPnextChainPushFront(&allocInfo, &allocFlagsInfo);
-    }
-#endif // VMA_BUFFER_DEVICE_ADDRESS
-
-#if VMA_MEMORY_PRIORITY
-    VkMemoryPriorityAllocateInfoEXT priorityInfo = { VK_STRUCTURE_TYPE_MEMORY_PRIORITY_ALLOCATE_INFO_EXT };
-    if (m_hAllocator->m_UseExtMemoryPriority)
-    {
-        VMA_ASSERT(m_Priority >= 0.f && m_Priority <= 1.f);
-        priorityInfo.priority = m_Priority;
-        VmaPnextChainPushFront(&allocInfo, &priorityInfo);
-    }
-#endif // VMA_MEMORY_PRIORITY
-
-#if VMA_EXTERNAL_MEMORY
-    // Attach VkExportMemoryAllocateInfoKHR if necessary.
-    VkExportMemoryAllocateInfoKHR exportMemoryAllocInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR };
-    exportMemoryAllocInfo.handleTypes = m_hAllocator->GetExternalMemoryHandleTypeFlags(m_MemoryTypeIndex);
-    if (exportMemoryAllocInfo.handleTypes != 0)
-    {
-        VmaPnextChainPushFront(&allocInfo, &exportMemoryAllocInfo);
-    }
-#endif // VMA_EXTERNAL_MEMORY
-
-    VkDeviceMemory mem = VK_NULL_HANDLE;
-    VkResult res = m_hAllocator->AllocateVulkanMemory(&allocInfo, &mem);
-    if (res < 0)
-    {
-        return res;
-    }
-
-    // New VkDeviceMemory successfully created.
-
-    // Create new Allocation for it.
-    VmaDeviceMemoryBlock* const pBlock = vma_new(m_hAllocator, VmaDeviceMemoryBlock)(m_hAllocator);
-    pBlock->Init(
-        m_hAllocator,
-        m_hParentPool,
-        m_MemoryTypeIndex,
-        mem,
-        allocInfo.allocationSize,
-        m_NextBlockId++,
-        m_Algorithm,
-        m_BufferImageGranularity);
-
-    m_Blocks.push_back(pBlock);
-    if (pNewBlockIndex != VMA_NULL)
-    {
-        *pNewBlockIndex = m_Blocks.size() - 1;
-    }
-
-    return VK_SUCCESS;
-}
-
-bool VmaBlockVector::HasEmptyBlock()
-{
-    for (size_t index = 0, count = m_Blocks.size(); index < count; ++index)
-    {
-        VmaDeviceMemoryBlock* const pBlock = m_Blocks[index];
-        if (pBlock->m_pMetadata->IsEmpty())
-        {
-            return true;
-        }
-    }
-    return false;
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaBlockVector::PrintDetailedMap(class VmaJsonWriter& json)
-{
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-
-
-    json.BeginObject();
-    for (size_t i = 0; i < m_Blocks.size(); ++i)
-    {
-        json.BeginString();
-        json.ContinueString(m_Blocks[i]->GetId());
-        json.EndString();
-
-        json.BeginObject();
-        json.WriteString("MapRefCount");
-        json.WriteNumber(m_Blocks[i]->GetMapRefCount());
-
-        m_Blocks[i]->m_pMetadata->PrintDetailedMap(json);
-        json.EndObject();
-    }
-    json.EndObject();
-}
-#endif // VMA_STATS_STRING_ENABLED
-
-VkResult VmaBlockVector::CheckCorruption()
-{
-    if (!IsCorruptionDetectionEnabled())
-    {
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    VmaMutexLockRead lock(m_Mutex, m_hAllocator->m_UseMutex);
-    for (uint32_t blockIndex = 0; blockIndex < m_Blocks.size(); ++blockIndex)
-    {
-        VmaDeviceMemoryBlock* const pBlock = m_Blocks[blockIndex];
-        VMA_ASSERT(pBlock);
-        VkResult res = pBlock->CheckCorruption(m_hAllocator);
-        if (res != VK_SUCCESS)
-        {
-            return res;
-        }
-    }
-    return VK_SUCCESS;
-}
-
-#endif // _VMA_BLOCK_VECTOR_FUNCTIONS
-
-#ifndef _VMA_DEFRAGMENTATION_CONTEXT_FUNCTIONS
-VmaDefragmentationContext_T::VmaDefragmentationContext_T(
-    VmaAllocator hAllocator,
-    const VmaDefragmentationInfo& info)
-    : m_MaxPassBytes(info.maxBytesPerPass == 0 ? VK_WHOLE_SIZE : info.maxBytesPerPass),
-    m_MaxPassAllocations(info.maxAllocationsPerPass == 0 ? UINT32_MAX : info.maxAllocationsPerPass),
-    m_MoveAllocator(hAllocator->GetAllocationCallbacks()),
-    m_Moves(m_MoveAllocator)
-{
-    m_Algorithm = info.flags & VMA_DEFRAGMENTATION_FLAG_ALGORITHM_MASK;
-
-    if (info.pool != VMA_NULL)
-    {
-        m_BlockVectorCount = 1;
-        m_PoolBlockVector = &info.pool->m_BlockVector;
-        m_pBlockVectors = &m_PoolBlockVector;
-        m_PoolBlockVector->SetIncrementalSort(false);
-        m_PoolBlockVector->SortByFreeSize();
-    }
-    else
-    {
-        m_BlockVectorCount = hAllocator->GetMemoryTypeCount();
-        m_PoolBlockVector = VMA_NULL;
-        m_pBlockVectors = hAllocator->m_pBlockVectors;
-        for (uint32_t i = 0; i < m_BlockVectorCount; ++i)
-        {
-            VmaBlockVector* vector = m_pBlockVectors[i];
-            if (vector != VMA_NULL)
-            {
-                vector->SetIncrementalSort(false);
-                vector->SortByFreeSize();
-            }
-        }
-    }
-
-    switch (m_Algorithm)
-    {
-    case 0: // Default algorithm
-        m_Algorithm = VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT;
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT:
-    {
-        m_AlgorithmState = vma_new_array(hAllocator, StateBalanced, m_BlockVectorCount);
-        break;
-    }
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-    {
-        if (hAllocator->GetBufferImageGranularity() > 1)
-        {
-            m_AlgorithmState = vma_new_array(hAllocator, StateExtensive, m_BlockVectorCount);
-        }
-        break;
-    }
-    }
-}
-
-VmaDefragmentationContext_T::~VmaDefragmentationContext_T()
-{
-    if (m_PoolBlockVector != VMA_NULL)
-    {
-        m_PoolBlockVector->SetIncrementalSort(true);
-    }
-    else
-    {
-        for (uint32_t i = 0; i < m_BlockVectorCount; ++i)
-        {
-            VmaBlockVector* vector = m_pBlockVectors[i];
-            if (vector != VMA_NULL)
-                vector->SetIncrementalSort(true);
-        }
-    }
-
-    if (m_AlgorithmState)
-    {
-        switch (m_Algorithm)
-        {
-        case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT:
-            vma_delete_array(m_MoveAllocator.m_pCallbacks, reinterpret_cast<StateBalanced*>(m_AlgorithmState), m_BlockVectorCount);
-            break;
-        case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-            vma_delete_array(m_MoveAllocator.m_pCallbacks, reinterpret_cast<StateExtensive*>(m_AlgorithmState), m_BlockVectorCount);
-            break;
-        default:
-            VMA_ASSERT(0);
-        }
-    }
-}
-
-VkResult VmaDefragmentationContext_T::DefragmentPassBegin(VmaDefragmentationPassMoveInfo& moveInfo)
-{
-    if (m_PoolBlockVector != VMA_NULL)
-    {
-        VmaMutexLockWrite lock(m_PoolBlockVector->GetMutex(), m_PoolBlockVector->GetAllocator()->m_UseMutex);
-
-        if (m_PoolBlockVector->GetBlockCount() > 1)
-            ComputeDefragmentation(*m_PoolBlockVector, 0);
-        else if (m_PoolBlockVector->GetBlockCount() == 1)
-            ReallocWithinBlock(*m_PoolBlockVector, m_PoolBlockVector->GetBlock(0));
-    }
-    else
-    {
-        for (uint32_t i = 0; i < m_BlockVectorCount; ++i)
-        {
-            if (m_pBlockVectors[i] != VMA_NULL)
-            {
-                VmaMutexLockWrite lock(m_pBlockVectors[i]->GetMutex(), m_pBlockVectors[i]->GetAllocator()->m_UseMutex);
-
-                if (m_pBlockVectors[i]->GetBlockCount() > 1)
-                {
-                    if (ComputeDefragmentation(*m_pBlockVectors[i], i))
-                        break;
-                }
-                else if (m_pBlockVectors[i]->GetBlockCount() == 1)
-                {
-                    if (ReallocWithinBlock(*m_pBlockVectors[i], m_pBlockVectors[i]->GetBlock(0)))
-                        break;
-                }
-            }
-        }
-    }
-
-    moveInfo.moveCount = static_cast<uint32_t>(m_Moves.size());
-    if (moveInfo.moveCount > 0)
-    {
-        moveInfo.pMoves = m_Moves.data();
-        return VK_INCOMPLETE;
-    }
-
-    moveInfo.pMoves = VMA_NULL;
-    return VK_SUCCESS;
-}
-
-VkResult VmaDefragmentationContext_T::DefragmentPassEnd(VmaDefragmentationPassMoveInfo& moveInfo)
-{
-    VMA_ASSERT(moveInfo.moveCount > 0 ? moveInfo.pMoves != VMA_NULL : true);
-
-    VkResult result = VK_SUCCESS;
-    VmaStlAllocator<FragmentedBlock> blockAllocator(m_MoveAllocator.m_pCallbacks);
-    VmaVector<FragmentedBlock, VmaStlAllocator<FragmentedBlock>> immovableBlocks(blockAllocator);
-    VmaVector<FragmentedBlock, VmaStlAllocator<FragmentedBlock>> mappedBlocks(blockAllocator);
-
-    VmaAllocator allocator = VMA_NULL;
-    for (uint32_t i = 0; i < moveInfo.moveCount; ++i)
-    {
-        VmaDefragmentationMove& move = moveInfo.pMoves[i];
-        size_t prevCount = 0, currentCount = 0;
-        VkDeviceSize freedBlockSize = 0;
-
-        uint32_t vectorIndex;
-        VmaBlockVector* vector;
-        if (m_PoolBlockVector != VMA_NULL)
-        {
-            vectorIndex = 0;
-            vector = m_PoolBlockVector;
-        }
-        else
-        {
-            vectorIndex = move.srcAllocation->GetMemoryTypeIndex();
-            vector = m_pBlockVectors[vectorIndex];
-            VMA_ASSERT(vector != VMA_NULL);
-        }
-
-        switch (move.operation)
-        {
-        case VMA_DEFRAGMENTATION_MOVE_OPERATION_COPY:
-        {
-            uint8_t mapCount = move.srcAllocation->SwapBlockAllocation(vector->m_hAllocator, move.dstTmpAllocation);
-            if (mapCount > 0)
-            {
-                allocator = vector->m_hAllocator;
-                VmaDeviceMemoryBlock* newMapBlock = move.srcAllocation->GetBlock();
-                bool notPresent = true;
-                for (FragmentedBlock& block : mappedBlocks)
-                {
-                    if (block.block == newMapBlock)
-                    {
-                        notPresent = false;
-                        block.data += mapCount;
-                        break;
-                    }
-                }
-                if (notPresent)
-                    mappedBlocks.push_back({ mapCount, newMapBlock });
-            }
-
-            // Scope for locks, Free have it's own lock
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                prevCount = vector->GetBlockCount();
-                freedBlockSize = move.dstTmpAllocation->GetBlock()->m_pMetadata->GetSize();
-            }
-            vector->Free(move.dstTmpAllocation);
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                currentCount = vector->GetBlockCount();
-            }
-
-            result = VK_INCOMPLETE;
-            break;
-        }
-        case VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE:
-        {
-            m_PassStats.bytesMoved -= move.srcAllocation->GetSize();
-            --m_PassStats.allocationsMoved;
-            vector->Free(move.dstTmpAllocation);
-
-            VmaDeviceMemoryBlock* newBlock = move.srcAllocation->GetBlock();
-            bool notPresent = true;
-            for (const FragmentedBlock& block : immovableBlocks)
-            {
-                if (block.block == newBlock)
-                {
-                    notPresent = false;
-                    break;
-                }
-            }
-            if (notPresent)
-                immovableBlocks.push_back({ vectorIndex, newBlock });
-            break;
-        }
-        case VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY:
-        {
-            m_PassStats.bytesMoved -= move.srcAllocation->GetSize();
-            --m_PassStats.allocationsMoved;
-            // Scope for locks, Free have it's own lock
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                prevCount = vector->GetBlockCount();
-                freedBlockSize = move.srcAllocation->GetBlock()->m_pMetadata->GetSize();
-            }
-            vector->Free(move.srcAllocation);
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                currentCount = vector->GetBlockCount();
-            }
-            freedBlockSize *= prevCount - currentCount;
-
-            VkDeviceSize dstBlockSize;
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                dstBlockSize = move.dstTmpAllocation->GetBlock()->m_pMetadata->GetSize();
-            }
-            vector->Free(move.dstTmpAllocation);
-            {
-                VmaMutexLockRead lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-                freedBlockSize += dstBlockSize * (currentCount - vector->GetBlockCount());
-                currentCount = vector->GetBlockCount();
-            }
-
-            result = VK_INCOMPLETE;
-            break;
-        }
-        default:
-            VMA_ASSERT(0);
-        }
-
-        if (prevCount > currentCount)
-        {
-            size_t freedBlocks = prevCount - currentCount;
-            m_PassStats.deviceMemoryBlocksFreed += static_cast<uint32_t>(freedBlocks);
-            m_PassStats.bytesFreed += freedBlockSize;
-        }
-
-        switch (m_Algorithm)
-        {
-        case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-        {
-            if (m_AlgorithmState != VMA_NULL)
-            {
-                // Avoid unnecessary tries to allocate when new free block is avaiable
-                StateExtensive& state = reinterpret_cast<StateExtensive*>(m_AlgorithmState)[vectorIndex];
-                if (state.firstFreeBlock != SIZE_MAX)
-                {
-                    const size_t diff = prevCount - currentCount;
-                    if (state.firstFreeBlock >= diff)
-                    {
-                        state.firstFreeBlock -= diff;
-                        if (state.firstFreeBlock != 0)
-                            state.firstFreeBlock -= vector->GetBlock(state.firstFreeBlock - 1)->m_pMetadata->IsEmpty();
-                    }
-                    else
-                        state.firstFreeBlock = 0;
-                }
-            }
-        }
-        }
-    }
-    moveInfo.moveCount = 0;
-    moveInfo.pMoves = VMA_NULL;
-    m_Moves.clear();
-
-    // Update stats
-    m_GlobalStats.allocationsMoved += m_PassStats.allocationsMoved;
-    m_GlobalStats.bytesFreed += m_PassStats.bytesFreed;
-    m_GlobalStats.bytesMoved += m_PassStats.bytesMoved;
-    m_GlobalStats.deviceMemoryBlocksFreed += m_PassStats.deviceMemoryBlocksFreed;
-    m_PassStats = { 0 };
-
-    // Move blocks with immovable allocations according to algorithm
-    if (immovableBlocks.size() > 0)
-    {
-        switch (m_Algorithm)
-        {
-        case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-        {
-            if (m_AlgorithmState != VMA_NULL)
-            {
-                bool swapped = false;
-                // Move to the start of free blocks range
-                for (const FragmentedBlock& block : immovableBlocks)
-                {
-                    StateExtensive& state = reinterpret_cast<StateExtensive*>(m_AlgorithmState)[block.data];
-                    if (state.operation != StateExtensive::Operation::Cleanup)
-                    {
-                        VmaBlockVector* vector = m_pBlockVectors[block.data];
-                        VmaMutexLockWrite lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-
-                        for (size_t i = 0, count = vector->GetBlockCount() - m_ImmovableBlockCount; i < count; ++i)
-                        {
-                            if (vector->GetBlock(i) == block.block)
-                            {
-                                VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[vector->GetBlockCount() - ++m_ImmovableBlockCount]);
-                                if (state.firstFreeBlock != SIZE_MAX)
-                                {
-                                    if (i + 1 < state.firstFreeBlock)
-                                    {
-                                        if (state.firstFreeBlock > 1)
-                                            VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[--state.firstFreeBlock]);
-                                        else
-                                            --state.firstFreeBlock;
-                                    }
-                                }
-                                swapped = true;
-                                break;
-                            }
-                        }
-                    }
-                }
-                if (swapped)
-                    result = VK_INCOMPLETE;
-                break;
-            }
-        }
-        default:
-        {
-            // Move to the begining
-            for (const FragmentedBlock& block : immovableBlocks)
-            {
-                VmaBlockVector* vector = m_pBlockVectors[block.data];
-                VmaMutexLockWrite lock(vector->GetMutex(), vector->GetAllocator()->m_UseMutex);
-
-                for (size_t i = m_ImmovableBlockCount; i < vector->GetBlockCount(); ++i)
-                {
-                    if (vector->GetBlock(i) == block.block)
-                    {
-                        VMA_SWAP(vector->m_Blocks[i], vector->m_Blocks[m_ImmovableBlockCount++]);
-                        break;
-                    }
-                }
-            }
-            break;
-        }
-        }
-    }
-
-    // Bulk-map destination blocks
-    for (const FragmentedBlock& block : mappedBlocks)
-    {
-        VkResult res = block.block->Map(allocator, block.data, VMA_NULL);
-        VMA_ASSERT(res == VK_SUCCESS);
-    }
-    return result;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation(VmaBlockVector& vector, size_t index)
-{
-    switch (m_Algorithm)
-    {
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT:
-        return ComputeDefragmentation_Fast(vector);
-    default:
-        VMA_ASSERT(0);
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_BALANCED_BIT:
-        return ComputeDefragmentation_Balanced(vector, index, true);
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FULL_BIT:
-        return ComputeDefragmentation_Full(vector);
-    case VMA_DEFRAGMENTATION_FLAG_ALGORITHM_EXTENSIVE_BIT:
-        return ComputeDefragmentation_Extensive(vector, index);
-    }
-}
-
-VmaDefragmentationContext_T::MoveAllocationData VmaDefragmentationContext_T::GetMoveData(
-    VmaAllocHandle handle, VmaBlockMetadata* metadata)
-{
-    MoveAllocationData moveData;
-    moveData.move.srcAllocation = (VmaAllocation)metadata->GetAllocationUserData(handle);
-    moveData.size = moveData.move.srcAllocation->GetSize();
-    moveData.alignment = moveData.move.srcAllocation->GetAlignment();
-    moveData.type = moveData.move.srcAllocation->GetSuballocationType();
-    moveData.flags = 0;
-
-    if (moveData.move.srcAllocation->IsPersistentMap())
-        moveData.flags |= VMA_ALLOCATION_CREATE_MAPPED_BIT;
-    if (moveData.move.srcAllocation->IsMappingAllowed())
-        moveData.flags |= VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT;
-
-    return moveData;
-}
-
-VmaDefragmentationContext_T::CounterStatus VmaDefragmentationContext_T::CheckCounters(VkDeviceSize bytes)
-{
-    // Ignore allocation if will exceed max size for copy
-    if (m_PassStats.bytesMoved + bytes > m_MaxPassBytes)
-    {
-        if (++m_IgnoredAllocs < MAX_ALLOCS_TO_IGNORE)
-            return CounterStatus::Ignore;
-        else
-            return CounterStatus::End;
-    }
-    return CounterStatus::Pass;
-}
-
-bool VmaDefragmentationContext_T::IncrementCounters(VkDeviceSize bytes)
-{
-    m_PassStats.bytesMoved += bytes;
-    // Early return when max found
-    if (++m_PassStats.allocationsMoved >= m_MaxPassAllocations || m_PassStats.bytesMoved >= m_MaxPassBytes)
-    {
-        VMA_ASSERT(m_PassStats.allocationsMoved == m_MaxPassAllocations ||
-            m_PassStats.bytesMoved == m_MaxPassBytes && "Exceeded maximal pass threshold!");
-        return true;
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ReallocWithinBlock(VmaBlockVector& vector, VmaDeviceMemoryBlock* block)
-{
-    VmaBlockMetadata* metadata = block->m_pMetadata;
-
-    for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-        handle != VK_NULL_HANDLE;
-        handle = metadata->GetNextAllocation(handle))
-    {
-        MoveAllocationData moveData = GetMoveData(handle, metadata);
-        // Ignore newly created allocations by defragmentation algorithm
-        if (moveData.move.srcAllocation->GetUserData() == this)
-            continue;
-        switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-        {
-        case CounterStatus::Ignore:
-            continue;
-        case CounterStatus::End:
-            return true;
-        default:
-            VMA_ASSERT(0);
-        case CounterStatus::Pass:
-            break;
-        }
-
-        VkDeviceSize offset = moveData.move.srcAllocation->GetOffset();
-        if (offset != 0 && metadata->GetSumFreeSize() >= moveData.size)
-        {
-            VmaAllocationRequest request = {};
-            if (metadata->CreateAllocationRequest(
-                moveData.size,
-                moveData.alignment,
-                false,
-                moveData.type,
-                VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-                &request))
-            {
-                if (metadata->GetAllocationOffset(request.allocHandle) < offset)
-                {
-                    if (vector.CommitAllocationRequest(
-                        request,
-                        block,
-                        moveData.alignment,
-                        moveData.flags,
-                        this,
-                        moveData.type,
-                        &moveData.move.dstTmpAllocation) == VK_SUCCESS)
-                    {
-                        m_Moves.push_back(moveData.move);
-                        if (IncrementCounters(moveData.size))
-                            return true;
-                    }
-                }
-            }
-        }
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::AllocInOtherBlock(size_t start, size_t end, MoveAllocationData& data, VmaBlockVector& vector)
-{
-    for (; start < end; ++start)
-    {
-        VmaDeviceMemoryBlock* dstBlock = vector.GetBlock(start);
-        if (dstBlock->m_pMetadata->GetSumFreeSize() >= data.size)
-        {
-            if (vector.AllocateFromBlock(dstBlock,
-                data.size,
-                data.alignment,
-                data.flags,
-                this,
-                data.type,
-                0,
-                &data.move.dstTmpAllocation) == VK_SUCCESS)
-            {
-                m_Moves.push_back(data.move);
-                if (IncrementCounters(data.size))
-                    return true;
-                break;
-            }
-        }
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation_Fast(VmaBlockVector& vector)
-{
-    // Move only between blocks
-
-    // Go through allocations in last blocks and try to fit them inside first ones
-    for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i)
-    {
-        VmaBlockMetadata* metadata = vector.GetBlock(i)->m_pMetadata;
-
-        for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = metadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, metadata);
-            // Ignore newly created allocations by defragmentation algorithm
-            if (moveData.move.srcAllocation->GetUserData() == this)
-                continue;
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Check all previous blocks for free space
-            if (AllocInOtherBlock(0, i, moveData, vector))
-                return true;
-        }
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation_Balanced(VmaBlockVector& vector, size_t index, bool update)
-{
-    // Go over every allocation and try to fit it in previous blocks at lowest offsets,
-    // if not possible: realloc within single block to minimize offset (exclude offset == 0),
-    // but only if there are noticable gaps between them (some heuristic, ex. average size of allocation in block)
-    VMA_ASSERT(m_AlgorithmState != VMA_NULL);
-
-    StateBalanced& vectorState = reinterpret_cast<StateBalanced*>(m_AlgorithmState)[index];
-    if (update && vectorState.avgAllocSize == UINT64_MAX)
-        UpdateVectorStatistics(vector, vectorState);
-
-    const size_t startMoveCount = m_Moves.size();
-    VkDeviceSize minimalFreeRegion = vectorState.avgFreeSize / 2;
-    for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i)
-    {
-        VmaDeviceMemoryBlock* block = vector.GetBlock(i);
-        VmaBlockMetadata* metadata = block->m_pMetadata;
-        VkDeviceSize prevFreeRegionSize = 0;
-
-        for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = metadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, metadata);
-            // Ignore newly created allocations by defragmentation algorithm
-            if (moveData.move.srcAllocation->GetUserData() == this)
-                continue;
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Check all previous blocks for free space
-            const size_t prevMoveCount = m_Moves.size();
-            if (AllocInOtherBlock(0, i, moveData, vector))
-                return true;
-
-            VkDeviceSize nextFreeRegionSize = metadata->GetNextFreeRegionSize(handle);
-            // If no room found then realloc within block for lower offset
-            VkDeviceSize offset = moveData.move.srcAllocation->GetOffset();
-            if (prevMoveCount == m_Moves.size() && offset != 0 && metadata->GetSumFreeSize() >= moveData.size)
-            {
-                // Check if realloc will make sense
-                if (prevFreeRegionSize >= minimalFreeRegion ||
-                    nextFreeRegionSize >= minimalFreeRegion ||
-                    moveData.size <= vectorState.avgFreeSize ||
-                    moveData.size <= vectorState.avgAllocSize)
-                {
-                    VmaAllocationRequest request = {};
-                    if (metadata->CreateAllocationRequest(
-                        moveData.size,
-                        moveData.alignment,
-                        false,
-                        moveData.type,
-                        VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-                        &request))
-                    {
-                        if (metadata->GetAllocationOffset(request.allocHandle) < offset)
-                        {
-                            if (vector.CommitAllocationRequest(
-                                request,
-                                block,
-                                moveData.alignment,
-                                moveData.flags,
-                                this,
-                                moveData.type,
-                                &moveData.move.dstTmpAllocation) == VK_SUCCESS)
-                            {
-                                m_Moves.push_back(moveData.move);
-                                if (IncrementCounters(moveData.size))
-                                    return true;
-                            }
-                        }
-                    }
-                }
-            }
-            prevFreeRegionSize = nextFreeRegionSize;
-        }
-    }
-
-    // No moves perfomed, update statistics to current vector state
-    if (startMoveCount == m_Moves.size() && !update)
-    {
-        vectorState.avgAllocSize = UINT64_MAX;
-        return ComputeDefragmentation_Balanced(vector, index, false);
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation_Full(VmaBlockVector& vector)
-{
-    // Go over every allocation and try to fit it in previous blocks at lowest offsets,
-    // if not possible: realloc within single block to minimize offset (exclude offset == 0)
-
-    for (size_t i = vector.GetBlockCount() - 1; i > m_ImmovableBlockCount; --i)
-    {
-        VmaDeviceMemoryBlock* block = vector.GetBlock(i);
-        VmaBlockMetadata* metadata = block->m_pMetadata;
-
-        for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = metadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, metadata);
-            // Ignore newly created allocations by defragmentation algorithm
-            if (moveData.move.srcAllocation->GetUserData() == this)
-                continue;
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Check all previous blocks for free space
-            const size_t prevMoveCount = m_Moves.size();
-            if (AllocInOtherBlock(0, i, moveData, vector))
-                return true;
-
-            // If no room found then realloc within block for lower offset
-            VkDeviceSize offset = moveData.move.srcAllocation->GetOffset();
-            if (prevMoveCount == m_Moves.size() && offset != 0 && metadata->GetSumFreeSize() >= moveData.size)
-            {
-                VmaAllocationRequest request = {};
-                if (metadata->CreateAllocationRequest(
-                    moveData.size,
-                    moveData.alignment,
-                    false,
-                    moveData.type,
-                    VMA_ALLOCATION_CREATE_STRATEGY_MIN_OFFSET_BIT,
-                    &request))
-                {
-                    if (metadata->GetAllocationOffset(request.allocHandle) < offset)
-                    {
-                        if (vector.CommitAllocationRequest(
-                            request,
-                            block,
-                            moveData.alignment,
-                            moveData.flags,
-                            this,
-                            moveData.type,
-                            &moveData.move.dstTmpAllocation) == VK_SUCCESS)
-                        {
-                            m_Moves.push_back(moveData.move);
-                            if (IncrementCounters(moveData.size))
-                                return true;
-                        }
-                    }
-                }
-            }
-        }
-    }
-    return false;
-}
-
-bool VmaDefragmentationContext_T::ComputeDefragmentation_Extensive(VmaBlockVector& vector, size_t index)
-{
-    // First free single block, then populate it to the brim, then free another block, and so on
-
-    // Fallback to previous algorithm since without granularity conflicts it can achieve max packing
-    if (vector.m_BufferImageGranularity == 1)
-        return ComputeDefragmentation_Full(vector);
-
-    VMA_ASSERT(m_AlgorithmState != VMA_NULL);
-
-    StateExtensive& vectorState = reinterpret_cast<StateExtensive*>(m_AlgorithmState)[index];
-
-    bool texturePresent = false, bufferPresent = false, otherPresent = false;
-    switch (vectorState.operation)
-    {
-    case StateExtensive::Operation::Done: // Vector defragmented
-        return false;
-    case StateExtensive::Operation::FindFreeBlockBuffer:
-    case StateExtensive::Operation::FindFreeBlockTexture:
-    case StateExtensive::Operation::FindFreeBlockAll:
-    {
-        // No more blocks to free, just perform fast realloc and move to cleanup
-        if (vectorState.firstFreeBlock == 0)
-        {
-            vectorState.operation = StateExtensive::Operation::Cleanup;
-            return ComputeDefragmentation_Fast(vector);
-        }
-
-        // No free blocks, have to clear last one
-        size_t last = (vectorState.firstFreeBlock == SIZE_MAX ? vector.GetBlockCount() : vectorState.firstFreeBlock) - 1;
-        VmaBlockMetadata* freeMetadata = vector.GetBlock(last)->m_pMetadata;
-
-        const size_t prevMoveCount = m_Moves.size();
-        for (VmaAllocHandle handle = freeMetadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = freeMetadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, freeMetadata);
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Check all previous blocks for free space
-            if (AllocInOtherBlock(0, last, moveData, vector))
-            {
-                // Full clear performed already
-                if (prevMoveCount != m_Moves.size() && freeMetadata->GetNextAllocation(handle) == VK_NULL_HANDLE)
-                    reinterpret_cast<size_t*>(m_AlgorithmState)[index] = last;
-                return true;
-            }
-        }
-
-        if (prevMoveCount == m_Moves.size())
-        {
-            // Cannot perform full clear, have to move data in other blocks around
-            if (last != 0)
-            {
-                for (size_t i = last - 1; i; --i)
-                {
-                    if (ReallocWithinBlock(vector, vector.GetBlock(i)))
-                        return true;
-                }
-            }
-
-            if (prevMoveCount == m_Moves.size())
-            {
-                // No possible reallocs within blocks, try to move them around fast
-                return ComputeDefragmentation_Fast(vector);
-            }
-        }
-        else
-        {
-            switch (vectorState.operation)
-            {
-            case StateExtensive::Operation::FindFreeBlockBuffer:
-                vectorState.operation = StateExtensive::Operation::MoveBuffers;
-                break;
-            default:
-                VMA_ASSERT(0);
-            case StateExtensive::Operation::FindFreeBlockTexture:
-                vectorState.operation = StateExtensive::Operation::MoveTextures;
-                break;
-            case StateExtensive::Operation::FindFreeBlockAll:
-                vectorState.operation = StateExtensive::Operation::MoveAll;
-                break;
-            }
-            vectorState.firstFreeBlock = last;
-            // Nothing done, block found without reallocations, can perform another reallocs in same pass
-            return ComputeDefragmentation_Extensive(vector, index);
-        }
-        break;
-    }
-    case StateExtensive::Operation::MoveTextures:
-    {
-        if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL, vector,
-            vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent))
-        {
-            if (texturePresent)
-            {
-                vectorState.operation = StateExtensive::Operation::FindFreeBlockTexture;
-                return ComputeDefragmentation_Extensive(vector, index);
-            }
-
-            if (!bufferPresent && !otherPresent)
-            {
-                vectorState.operation = StateExtensive::Operation::Cleanup;
-                break;
-            }
-
-            // No more textures to move, check buffers
-            vectorState.operation = StateExtensive::Operation::MoveBuffers;
-            bufferPresent = false;
-            otherPresent = false;
-        }
-        else
-            break;
-    }
-    case StateExtensive::Operation::MoveBuffers:
-    {
-        if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_BUFFER, vector,
-            vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent))
-        {
-            if (bufferPresent)
-            {
-                vectorState.operation = StateExtensive::Operation::FindFreeBlockBuffer;
-                return ComputeDefragmentation_Extensive(vector, index);
-            }
-
-            if (!otherPresent)
-            {
-                vectorState.operation = StateExtensive::Operation::Cleanup;
-                break;
-            }
-
-            // No more buffers to move, check all others
-            vectorState.operation = StateExtensive::Operation::MoveAll;
-            otherPresent = false;
-        }
-        else
-            break;
-    }
-    case StateExtensive::Operation::MoveAll:
-    {
-        if (MoveDataToFreeBlocks(VMA_SUBALLOCATION_TYPE_FREE, vector,
-            vectorState.firstFreeBlock, texturePresent, bufferPresent, otherPresent))
-        {
-            if (otherPresent)
-            {
-                vectorState.operation = StateExtensive::Operation::FindFreeBlockBuffer;
-                return ComputeDefragmentation_Extensive(vector, index);
-            }
-            // Everything moved
-            vectorState.operation = StateExtensive::Operation::Cleanup;
-        }
-        break;
-    }
-    case StateExtensive::Operation::Cleanup:
-        // Cleanup is handled below so that other operations may reuse the cleanup code. This case is here to prevent the unhandled enum value warning (C4062).
-        break;
-    }
-
-    if (vectorState.operation == StateExtensive::Operation::Cleanup)
-    {
-        // All other work done, pack data in blocks even tighter if possible
-        const size_t prevMoveCount = m_Moves.size();
-        for (size_t i = 0; i < vector.GetBlockCount(); ++i)
-        {
-            if (ReallocWithinBlock(vector, vector.GetBlock(i)))
-                return true;
-        }
-
-        if (prevMoveCount == m_Moves.size())
-            vectorState.operation = StateExtensive::Operation::Done;
-    }
-    return false;
-}
-
-void VmaDefragmentationContext_T::UpdateVectorStatistics(VmaBlockVector& vector, StateBalanced& state)
-{
-    size_t allocCount = 0;
-    size_t freeCount = 0;
-    state.avgFreeSize = 0;
-    state.avgAllocSize = 0;
-
-    for (size_t i = 0; i < vector.GetBlockCount(); ++i)
-    {
-        VmaBlockMetadata* metadata = vector.GetBlock(i)->m_pMetadata;
-
-        allocCount += metadata->GetAllocationCount();
-        freeCount += metadata->GetFreeRegionsCount();
-        state.avgFreeSize += metadata->GetSumFreeSize();
-        state.avgAllocSize += metadata->GetSize();
-    }
-
-    state.avgAllocSize = (state.avgAllocSize - state.avgFreeSize) / allocCount;
-    state.avgFreeSize /= freeCount;
-}
-
-bool VmaDefragmentationContext_T::MoveDataToFreeBlocks(VmaSuballocationType currentType,
-    VmaBlockVector& vector, size_t firstFreeBlock,
-    bool& texturePresent, bool& bufferPresent, bool& otherPresent)
-{
-    const size_t prevMoveCount = m_Moves.size();
-    for (size_t i = firstFreeBlock ; i;)
-    {
-        VmaDeviceMemoryBlock* block = vector.GetBlock(--i);
-        VmaBlockMetadata* metadata = block->m_pMetadata;
-
-        for (VmaAllocHandle handle = metadata->GetAllocationListBegin();
-            handle != VK_NULL_HANDLE;
-            handle = metadata->GetNextAllocation(handle))
-        {
-            MoveAllocationData moveData = GetMoveData(handle, metadata);
-            // Ignore newly created allocations by defragmentation algorithm
-            if (moveData.move.srcAllocation->GetUserData() == this)
-                continue;
-            switch (CheckCounters(moveData.move.srcAllocation->GetSize()))
-            {
-            case CounterStatus::Ignore:
-                continue;
-            case CounterStatus::End:
-                return true;
-            default:
-                VMA_ASSERT(0);
-            case CounterStatus::Pass:
-                break;
-            }
-
-            // Move only single type of resources at once
-            if (!VmaIsBufferImageGranularityConflict(moveData.type, currentType))
-            {
-                // Try to fit allocation into free blocks
-                if (AllocInOtherBlock(firstFreeBlock, vector.GetBlockCount(), moveData, vector))
-                    return false;
-            }
-
-            if (!VmaIsBufferImageGranularityConflict(moveData.type, VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL))
-                texturePresent = true;
-            else if (!VmaIsBufferImageGranularityConflict(moveData.type, VMA_SUBALLOCATION_TYPE_BUFFER))
-                bufferPresent = true;
-            else
-                otherPresent = true;
-        }
-    }
-    return prevMoveCount == m_Moves.size();
-}
-#endif // _VMA_DEFRAGMENTATION_CONTEXT_FUNCTIONS
-
-#ifndef _VMA_POOL_T_FUNCTIONS
-VmaPool_T::VmaPool_T(
-    VmaAllocator hAllocator,
-    const VmaPoolCreateInfo& createInfo,
-    VkDeviceSize preferredBlockSize)
-    : m_BlockVector(
-        hAllocator,
-        this, // hParentPool
-        createInfo.memoryTypeIndex,
-        createInfo.blockSize != 0 ? createInfo.blockSize : preferredBlockSize,
-        createInfo.minBlockCount,
-        createInfo.maxBlockCount,
-        (createInfo.flags& VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT) != 0 ? 1 : hAllocator->GetBufferImageGranularity(),
-        createInfo.blockSize != 0, // explicitBlockSize
-        createInfo.flags & VMA_POOL_CREATE_ALGORITHM_MASK, // algorithm
-        createInfo.priority,
-        VMA_MAX(hAllocator->GetMemoryTypeMinAlignment(createInfo.memoryTypeIndex), createInfo.minAllocationAlignment),
-        createInfo.pMemoryAllocateNext),
-    m_Id(0),
-    m_Name(VMA_NULL) {}
-
-VmaPool_T::~VmaPool_T()
-{
-    VMA_ASSERT(m_PrevPool == VMA_NULL && m_NextPool == VMA_NULL);
-}
-
-void VmaPool_T::SetName(const char* pName)
-{
-    const VkAllocationCallbacks* allocs = m_BlockVector.GetAllocator()->GetAllocationCallbacks();
-    VmaFreeString(allocs, m_Name);
-
-    if (pName != VMA_NULL)
-    {
-        m_Name = VmaCreateStringCopy(allocs, pName);
-    }
-    else
-    {
-        m_Name = VMA_NULL;
-    }
-}
-#endif // _VMA_POOL_T_FUNCTIONS
-
-#ifndef _VMA_ALLOCATOR_T_FUNCTIONS
-VmaAllocator_T::VmaAllocator_T(const VmaAllocatorCreateInfo* pCreateInfo) :
-    m_UseMutex((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT) == 0),
-    m_VulkanApiVersion(pCreateInfo->vulkanApiVersion != 0 ? pCreateInfo->vulkanApiVersion : VK_API_VERSION_1_0),
-    m_UseKhrDedicatedAllocation((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT) != 0),
-    m_UseKhrBindMemory2((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT) != 0),
-    m_UseExtMemoryBudget((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT) != 0),
-    m_UseAmdDeviceCoherentMemory((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT) != 0),
-    m_UseKhrBufferDeviceAddress((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT) != 0),
-    m_UseExtMemoryPriority((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT) != 0),
-    m_hDevice(pCreateInfo->device),
-    m_hInstance(pCreateInfo->instance),
-    m_AllocationCallbacksSpecified(pCreateInfo->pAllocationCallbacks != VMA_NULL),
-    m_AllocationCallbacks(pCreateInfo->pAllocationCallbacks ?
-        *pCreateInfo->pAllocationCallbacks : VmaEmptyAllocationCallbacks),
-    m_AllocationObjectAllocator(&m_AllocationCallbacks),
-    m_HeapSizeLimitMask(0),
-    m_DeviceMemoryCount(0),
-    m_PreferredLargeHeapBlockSize(0),
-    m_PhysicalDevice(pCreateInfo->physicalDevice),
-    m_GpuDefragmentationMemoryTypeBits(UINT32_MAX),
-    m_NextPoolId(0),
-    m_GlobalMemoryTypeBits(UINT32_MAX)
-{
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        m_UseKhrDedicatedAllocation = false;
-        m_UseKhrBindMemory2 = false;
-    }
-
-    if(VMA_DEBUG_DETECT_CORRUPTION)
-    {
-        // Needs to be multiply of uint32_t size because we are going to write VMA_CORRUPTION_DETECTION_MAGIC_VALUE to it.
-        VMA_ASSERT(VMA_DEBUG_MARGIN % sizeof(uint32_t) == 0);
-    }
-
-    VMA_ASSERT(pCreateInfo->physicalDevice && pCreateInfo->device && pCreateInfo->instance);
-
-    if(m_VulkanApiVersion < VK_MAKE_VERSION(1, 1, 0))
-    {
-#if !(VMA_DEDICATED_ALLOCATION)
-        if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT) != 0)
-        {
-            VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT set but required extensions are disabled by preprocessor macros.");
-        }
-#endif
-#if !(VMA_BIND_MEMORY2)
-        if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT) != 0)
-        {
-            VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_KHR_BIND_MEMORY2_BIT set but required extension is disabled by preprocessor macros.");
-        }
-#endif
-    }
-#if !(VMA_MEMORY_BUDGET)
-    if((pCreateInfo->flags & VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT) != 0)
-    {
-        VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT set but required extension is disabled by preprocessor macros.");
-    }
-#endif
-#if !(VMA_BUFFER_DEVICE_ADDRESS)
-    if(m_UseKhrBufferDeviceAddress)
-    {
-        VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT is set but required extension or Vulkan 1.2 is not available in your Vulkan header or its support in VMA has been disabled by a preprocessor macro.");
-    }
-#endif
-#if VMA_VULKAN_VERSION < 1002000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 2, 0))
-    {
-        VMA_ASSERT(0 && "vulkanApiVersion >= VK_API_VERSION_1_2 but required Vulkan version is disabled by preprocessor macros.");
-    }
-#endif
-#if VMA_VULKAN_VERSION < 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VMA_ASSERT(0 && "vulkanApiVersion >= VK_API_VERSION_1_1 but required Vulkan version is disabled by preprocessor macros.");
-    }
-#endif
-#if !(VMA_MEMORY_PRIORITY)
-    if(m_UseExtMemoryPriority)
-    {
-        VMA_ASSERT(0 && "VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT is set but required extension is not available in your Vulkan header or its support in VMA has been disabled by a preprocessor macro.");
-    }
-#endif
-
-    memset(&m_DeviceMemoryCallbacks, 0 ,sizeof(m_DeviceMemoryCallbacks));
-    memset(&m_PhysicalDeviceProperties, 0, sizeof(m_PhysicalDeviceProperties));
-    memset(&m_MemProps, 0, sizeof(m_MemProps));
-
-    memset(&m_pBlockVectors, 0, sizeof(m_pBlockVectors));
-    memset(&m_VulkanFunctions, 0, sizeof(m_VulkanFunctions));
-
-#if VMA_EXTERNAL_MEMORY
-    memset(&m_TypeExternalMemoryHandleTypes, 0, sizeof(m_TypeExternalMemoryHandleTypes));
-#endif // #if VMA_EXTERNAL_MEMORY
-
-    if(pCreateInfo->pDeviceMemoryCallbacks != VMA_NULL)
-    {
-        m_DeviceMemoryCallbacks.pUserData = pCreateInfo->pDeviceMemoryCallbacks->pUserData;
-        m_DeviceMemoryCallbacks.pfnAllocate = pCreateInfo->pDeviceMemoryCallbacks->pfnAllocate;
-        m_DeviceMemoryCallbacks.pfnFree = pCreateInfo->pDeviceMemoryCallbacks->pfnFree;
-    }
-
-    ImportVulkanFunctions(pCreateInfo->pVulkanFunctions);
-
-    (*m_VulkanFunctions.vkGetPhysicalDeviceProperties)(m_PhysicalDevice, &m_PhysicalDeviceProperties);
-    (*m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties)(m_PhysicalDevice, &m_MemProps);
-
-    VMA_ASSERT(VmaIsPow2(VMA_MIN_ALIGNMENT));
-    VMA_ASSERT(VmaIsPow2(VMA_DEBUG_MIN_BUFFER_IMAGE_GRANULARITY));
-    VMA_ASSERT(VmaIsPow2(m_PhysicalDeviceProperties.limits.bufferImageGranularity));
-    VMA_ASSERT(VmaIsPow2(m_PhysicalDeviceProperties.limits.nonCoherentAtomSize));
-
-    m_PreferredLargeHeapBlockSize = (pCreateInfo->preferredLargeHeapBlockSize != 0) ?
-        pCreateInfo->preferredLargeHeapBlockSize : static_cast<VkDeviceSize>(VMA_DEFAULT_LARGE_HEAP_BLOCK_SIZE);
-
-    m_GlobalMemoryTypeBits = CalculateGlobalMemoryTypeBits();
-
-#if VMA_EXTERNAL_MEMORY
-    if(pCreateInfo->pTypeExternalMemoryHandleTypes != VMA_NULL)
-    {
-        memcpy(m_TypeExternalMemoryHandleTypes, pCreateInfo->pTypeExternalMemoryHandleTypes,
-            sizeof(VkExternalMemoryHandleTypeFlagsKHR) * GetMemoryTypeCount());
-    }
-#endif // #if VMA_EXTERNAL_MEMORY
-
-    if(pCreateInfo->pHeapSizeLimit != VMA_NULL)
-    {
-        for(uint32_t heapIndex = 0; heapIndex < GetMemoryHeapCount(); ++heapIndex)
-        {
-            const VkDeviceSize limit = pCreateInfo->pHeapSizeLimit[heapIndex];
-            if(limit != VK_WHOLE_SIZE)
-            {
-                m_HeapSizeLimitMask |= 1u << heapIndex;
-                if(limit < m_MemProps.memoryHeaps[heapIndex].size)
-                {
-                    m_MemProps.memoryHeaps[heapIndex].size = limit;
-                }
-            }
-        }
-    }
-
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        // Create only supported types
-        if((m_GlobalMemoryTypeBits & (1u << memTypeIndex)) != 0)
-        {
-            const VkDeviceSize preferredBlockSize = CalcPreferredBlockSize(memTypeIndex);
-            m_pBlockVectors[memTypeIndex] = vma_new(this, VmaBlockVector)(
-                this,
-                VK_NULL_HANDLE, // hParentPool
-                memTypeIndex,
-                preferredBlockSize,
-                0,
-                SIZE_MAX,
-                GetBufferImageGranularity(),
-                false, // explicitBlockSize
-                0, // algorithm
-                0.5f, // priority (0.5 is the default per Vulkan spec)
-                GetMemoryTypeMinAlignment(memTypeIndex), // minAllocationAlignment
-                VMA_NULL); // // pMemoryAllocateNext
-            // No need to call m_pBlockVectors[memTypeIndex][blockVectorTypeIndex]->CreateMinBlocks here,
-            // becase minBlockCount is 0.
-        }
-    }
-}
-
-VkResult VmaAllocator_T::Init(const VmaAllocatorCreateInfo* pCreateInfo)
-{
-    VkResult res = VK_SUCCESS;
-
-#if VMA_MEMORY_BUDGET
-    if(m_UseExtMemoryBudget)
-    {
-        UpdateVulkanBudget();
-    }
-#endif // #if VMA_MEMORY_BUDGET
-
-    return res;
-}
-
-VmaAllocator_T::~VmaAllocator_T()
-{
-    VMA_ASSERT(m_Pools.IsEmpty());
-
-    for(size_t memTypeIndex = GetMemoryTypeCount(); memTypeIndex--; )
-    {
-        vma_delete(this, m_pBlockVectors[memTypeIndex]);
-    }
-}
-
-void VmaAllocator_T::ImportVulkanFunctions(const VmaVulkanFunctions* pVulkanFunctions)
-{
-#if VMA_STATIC_VULKAN_FUNCTIONS == 1
-    ImportVulkanFunctions_Static();
-#endif
-
-    if(pVulkanFunctions != VMA_NULL)
-    {
-        ImportVulkanFunctions_Custom(pVulkanFunctions);
-    }
-
-#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1
-    ImportVulkanFunctions_Dynamic();
-#endif
-
-    ValidateVulkanFunctions();
-}
-
-#if VMA_STATIC_VULKAN_FUNCTIONS == 1
-
-void VmaAllocator_T::ImportVulkanFunctions_Static()
-{
-    // Vulkan 1.0
-    m_VulkanFunctions.vkGetInstanceProcAddr = (PFN_vkGetInstanceProcAddr)vkGetInstanceProcAddr;
-    m_VulkanFunctions.vkGetDeviceProcAddr = (PFN_vkGetDeviceProcAddr)vkGetDeviceProcAddr;
-    m_VulkanFunctions.vkGetPhysicalDeviceProperties = (PFN_vkGetPhysicalDeviceProperties)vkGetPhysicalDeviceProperties;
-    m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties = (PFN_vkGetPhysicalDeviceMemoryProperties)vkGetPhysicalDeviceMemoryProperties;
-    m_VulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkAllocateMemory;
-    m_VulkanFunctions.vkFreeMemory = (PFN_vkFreeMemory)vkFreeMemory;
-    m_VulkanFunctions.vkMapMemory = (PFN_vkMapMemory)vkMapMemory;
-    m_VulkanFunctions.vkUnmapMemory = (PFN_vkUnmapMemory)vkUnmapMemory;
-    m_VulkanFunctions.vkFlushMappedMemoryRanges = (PFN_vkFlushMappedMemoryRanges)vkFlushMappedMemoryRanges;
-    m_VulkanFunctions.vkInvalidateMappedMemoryRanges = (PFN_vkInvalidateMappedMemoryRanges)vkInvalidateMappedMemoryRanges;
-    m_VulkanFunctions.vkBindBufferMemory = (PFN_vkBindBufferMemory)vkBindBufferMemory;
-    m_VulkanFunctions.vkBindImageMemory = (PFN_vkBindImageMemory)vkBindImageMemory;
-    m_VulkanFunctions.vkGetBufferMemoryRequirements = (PFN_vkGetBufferMemoryRequirements)vkGetBufferMemoryRequirements;
-    m_VulkanFunctions.vkGetImageMemoryRequirements = (PFN_vkGetImageMemoryRequirements)vkGetImageMemoryRequirements;
-    m_VulkanFunctions.vkCreateBuffer = (PFN_vkCreateBuffer)vkCreateBuffer;
-    m_VulkanFunctions.vkDestroyBuffer = (PFN_vkDestroyBuffer)vkDestroyBuffer;
-    m_VulkanFunctions.vkCreateImage = (PFN_vkCreateImage)vkCreateImage;
-    m_VulkanFunctions.vkDestroyImage = (PFN_vkDestroyImage)vkDestroyImage;
-    m_VulkanFunctions.vkCmdCopyBuffer = (PFN_vkCmdCopyBuffer)vkCmdCopyBuffer;
-
-    // Vulkan 1.1
-#if VMA_VULKAN_VERSION >= 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR = (PFN_vkGetBufferMemoryRequirements2)vkGetBufferMemoryRequirements2;
-        m_VulkanFunctions.vkGetImageMemoryRequirements2KHR = (PFN_vkGetImageMemoryRequirements2)vkGetImageMemoryRequirements2;
-        m_VulkanFunctions.vkBindBufferMemory2KHR = (PFN_vkBindBufferMemory2)vkBindBufferMemory2;
-        m_VulkanFunctions.vkBindImageMemory2KHR = (PFN_vkBindImageMemory2)vkBindImageMemory2;
-        m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties2KHR = (PFN_vkGetPhysicalDeviceMemoryProperties2)vkGetPhysicalDeviceMemoryProperties2;
-    }
-#endif
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0))
-    {
-        m_VulkanFunctions.vkGetDeviceBufferMemoryRequirements = (PFN_vkGetDeviceBufferMemoryRequirements)vkGetDeviceBufferMemoryRequirements;
-        m_VulkanFunctions.vkGetDeviceImageMemoryRequirements = (PFN_vkGetDeviceImageMemoryRequirements)vkGetDeviceImageMemoryRequirements;
-    }
-#endif
-}
-
-#endif // VMA_STATIC_VULKAN_FUNCTIONS == 1
-
-void VmaAllocator_T::ImportVulkanFunctions_Custom(const VmaVulkanFunctions* pVulkanFunctions)
-{
-    VMA_ASSERT(pVulkanFunctions != VMA_NULL);
-
-#define VMA_COPY_IF_NOT_NULL(funcName) \
-    if(pVulkanFunctions->funcName != VMA_NULL) m_VulkanFunctions.funcName = pVulkanFunctions->funcName;
-
-    VMA_COPY_IF_NOT_NULL(vkGetInstanceProcAddr);
-    VMA_COPY_IF_NOT_NULL(vkGetDeviceProcAddr);
-    VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceProperties);
-    VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceMemoryProperties);
-    VMA_COPY_IF_NOT_NULL(vkAllocateMemory);
-    VMA_COPY_IF_NOT_NULL(vkFreeMemory);
-    VMA_COPY_IF_NOT_NULL(vkMapMemory);
-    VMA_COPY_IF_NOT_NULL(vkUnmapMemory);
-    VMA_COPY_IF_NOT_NULL(vkFlushMappedMemoryRanges);
-    VMA_COPY_IF_NOT_NULL(vkInvalidateMappedMemoryRanges);
-    VMA_COPY_IF_NOT_NULL(vkBindBufferMemory);
-    VMA_COPY_IF_NOT_NULL(vkBindImageMemory);
-    VMA_COPY_IF_NOT_NULL(vkGetBufferMemoryRequirements);
-    VMA_COPY_IF_NOT_NULL(vkGetImageMemoryRequirements);
-    VMA_COPY_IF_NOT_NULL(vkCreateBuffer);
-    VMA_COPY_IF_NOT_NULL(vkDestroyBuffer);
-    VMA_COPY_IF_NOT_NULL(vkCreateImage);
-    VMA_COPY_IF_NOT_NULL(vkDestroyImage);
-    VMA_COPY_IF_NOT_NULL(vkCmdCopyBuffer);
-
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    VMA_COPY_IF_NOT_NULL(vkGetBufferMemoryRequirements2KHR);
-    VMA_COPY_IF_NOT_NULL(vkGetImageMemoryRequirements2KHR);
-#endif
-
-#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000
-    VMA_COPY_IF_NOT_NULL(vkBindBufferMemory2KHR);
-    VMA_COPY_IF_NOT_NULL(vkBindImageMemory2KHR);
-#endif
-
-#if VMA_MEMORY_BUDGET
-    VMA_COPY_IF_NOT_NULL(vkGetPhysicalDeviceMemoryProperties2KHR);
-#endif
-
-#if VMA_VULKAN_VERSION >= 1003000
-    VMA_COPY_IF_NOT_NULL(vkGetDeviceBufferMemoryRequirements);
-    VMA_COPY_IF_NOT_NULL(vkGetDeviceImageMemoryRequirements);
-#endif
-
-#undef VMA_COPY_IF_NOT_NULL
-}
-
-#if VMA_DYNAMIC_VULKAN_FUNCTIONS == 1
-
-void VmaAllocator_T::ImportVulkanFunctions_Dynamic()
-{
-    VMA_ASSERT(m_VulkanFunctions.vkGetInstanceProcAddr && m_VulkanFunctions.vkGetDeviceProcAddr &&
-        "To use VMA_DYNAMIC_VULKAN_FUNCTIONS in new versions of VMA you now have to pass "
-        "VmaVulkanFunctions::vkGetInstanceProcAddr and vkGetDeviceProcAddr as VmaAllocatorCreateInfo::pVulkanFunctions. "
-        "Other members can be null.");
-
-#define VMA_FETCH_INSTANCE_FUNC(memberName, functionPointerType, functionNameString) \
-    if(m_VulkanFunctions.memberName == VMA_NULL) \
-        m_VulkanFunctions.memberName = \
-            (functionPointerType)m_VulkanFunctions.vkGetInstanceProcAddr(m_hInstance, functionNameString);
-#define VMA_FETCH_DEVICE_FUNC(memberName, functionPointerType, functionNameString) \
-    if(m_VulkanFunctions.memberName == VMA_NULL) \
-        m_VulkanFunctions.memberName = \
-            (functionPointerType)m_VulkanFunctions.vkGetDeviceProcAddr(m_hDevice, functionNameString);
-
-    VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceProperties, PFN_vkGetPhysicalDeviceProperties, "vkGetPhysicalDeviceProperties");
-    VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties, PFN_vkGetPhysicalDeviceMemoryProperties, "vkGetPhysicalDeviceMemoryProperties");
-    VMA_FETCH_DEVICE_FUNC(vkAllocateMemory, PFN_vkAllocateMemory, "vkAllocateMemory");
-    VMA_FETCH_DEVICE_FUNC(vkFreeMemory, PFN_vkFreeMemory, "vkFreeMemory");
-    VMA_FETCH_DEVICE_FUNC(vkMapMemory, PFN_vkMapMemory, "vkMapMemory");
-    VMA_FETCH_DEVICE_FUNC(vkUnmapMemory, PFN_vkUnmapMemory, "vkUnmapMemory");
-    VMA_FETCH_DEVICE_FUNC(vkFlushMappedMemoryRanges, PFN_vkFlushMappedMemoryRanges, "vkFlushMappedMemoryRanges");
-    VMA_FETCH_DEVICE_FUNC(vkInvalidateMappedMemoryRanges, PFN_vkInvalidateMappedMemoryRanges, "vkInvalidateMappedMemoryRanges");
-    VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory, PFN_vkBindBufferMemory, "vkBindBufferMemory");
-    VMA_FETCH_DEVICE_FUNC(vkBindImageMemory, PFN_vkBindImageMemory, "vkBindImageMemory");
-    VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements, PFN_vkGetBufferMemoryRequirements, "vkGetBufferMemoryRequirements");
-    VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements, PFN_vkGetImageMemoryRequirements, "vkGetImageMemoryRequirements");
-    VMA_FETCH_DEVICE_FUNC(vkCreateBuffer, PFN_vkCreateBuffer, "vkCreateBuffer");
-    VMA_FETCH_DEVICE_FUNC(vkDestroyBuffer, PFN_vkDestroyBuffer, "vkDestroyBuffer");
-    VMA_FETCH_DEVICE_FUNC(vkCreateImage, PFN_vkCreateImage, "vkCreateImage");
-    VMA_FETCH_DEVICE_FUNC(vkDestroyImage, PFN_vkDestroyImage, "vkDestroyImage");
-    VMA_FETCH_DEVICE_FUNC(vkCmdCopyBuffer, PFN_vkCmdCopyBuffer, "vkCmdCopyBuffer");
-
-#if VMA_VULKAN_VERSION >= 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements2KHR, PFN_vkGetBufferMemoryRequirements2, "vkGetBufferMemoryRequirements2");
-        VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements2KHR, PFN_vkGetImageMemoryRequirements2, "vkGetImageMemoryRequirements2");
-        VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory2KHR, PFN_vkBindBufferMemory2, "vkBindBufferMemory2");
-        VMA_FETCH_DEVICE_FUNC(vkBindImageMemory2KHR, PFN_vkBindImageMemory2, "vkBindImageMemory2");
-        VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties2KHR, PFN_vkGetPhysicalDeviceMemoryProperties2, "vkGetPhysicalDeviceMemoryProperties2");
-    }
-#endif
-
-#if VMA_DEDICATED_ALLOCATION
-    if(m_UseKhrDedicatedAllocation)
-    {
-        VMA_FETCH_DEVICE_FUNC(vkGetBufferMemoryRequirements2KHR, PFN_vkGetBufferMemoryRequirements2KHR, "vkGetBufferMemoryRequirements2KHR");
-        VMA_FETCH_DEVICE_FUNC(vkGetImageMemoryRequirements2KHR, PFN_vkGetImageMemoryRequirements2KHR, "vkGetImageMemoryRequirements2KHR");
-    }
-#endif
-
-#if VMA_BIND_MEMORY2
-    if(m_UseKhrBindMemory2)
-    {
-        VMA_FETCH_DEVICE_FUNC(vkBindBufferMemory2KHR, PFN_vkBindBufferMemory2KHR, "vkBindBufferMemory2KHR");
-        VMA_FETCH_DEVICE_FUNC(vkBindImageMemory2KHR, PFN_vkBindImageMemory2KHR, "vkBindImageMemory2KHR");
-    }
-#endif // #if VMA_BIND_MEMORY2
-
-#if VMA_MEMORY_BUDGET
-    if(m_UseExtMemoryBudget)
-    {
-        VMA_FETCH_INSTANCE_FUNC(vkGetPhysicalDeviceMemoryProperties2KHR, PFN_vkGetPhysicalDeviceMemoryProperties2KHR, "vkGetPhysicalDeviceMemoryProperties2KHR");
-    }
-#endif // #if VMA_MEMORY_BUDGET
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0))
-    {
-        VMA_FETCH_DEVICE_FUNC(vkGetDeviceBufferMemoryRequirements, PFN_vkGetDeviceBufferMemoryRequirements, "vkGetDeviceBufferMemoryRequirements");
-        VMA_FETCH_DEVICE_FUNC(vkGetDeviceImageMemoryRequirements, PFN_vkGetDeviceImageMemoryRequirements, "vkGetDeviceImageMemoryRequirements");
-    }
-#endif
-
-#undef VMA_FETCH_DEVICE_FUNC
-#undef VMA_FETCH_INSTANCE_FUNC
-}
-
-#endif // VMA_DYNAMIC_VULKAN_FUNCTIONS == 1
-
-void VmaAllocator_T::ValidateVulkanFunctions()
-{
-    VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceProperties != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkAllocateMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkFreeMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkMapMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkUnmapMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkFlushMappedMemoryRanges != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkInvalidateMappedMemoryRanges != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkBindBufferMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkBindImageMemory != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkGetBufferMemoryRequirements != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkGetImageMemoryRequirements != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkCreateBuffer != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkDestroyBuffer != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkCreateImage != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkDestroyImage != VMA_NULL);
-    VMA_ASSERT(m_VulkanFunctions.vkCmdCopyBuffer != VMA_NULL);
-
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0) || m_UseKhrDedicatedAllocation)
-    {
-        VMA_ASSERT(m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR != VMA_NULL);
-        VMA_ASSERT(m_VulkanFunctions.vkGetImageMemoryRequirements2KHR != VMA_NULL);
-    }
-#endif
-
-#if VMA_BIND_MEMORY2 || VMA_VULKAN_VERSION >= 1001000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0) || m_UseKhrBindMemory2)
-    {
-        VMA_ASSERT(m_VulkanFunctions.vkBindBufferMemory2KHR != VMA_NULL);
-        VMA_ASSERT(m_VulkanFunctions.vkBindImageMemory2KHR != VMA_NULL);
-    }
-#endif
-
-#if VMA_MEMORY_BUDGET || VMA_VULKAN_VERSION >= 1001000
-    if(m_UseExtMemoryBudget || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VMA_ASSERT(m_VulkanFunctions.vkGetPhysicalDeviceMemoryProperties2KHR != VMA_NULL);
-    }
-#endif
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(m_VulkanApiVersion >= VK_MAKE_VERSION(1, 3, 0))
-    {
-        VMA_ASSERT(m_VulkanFunctions.vkGetDeviceBufferMemoryRequirements != VMA_NULL);
-        VMA_ASSERT(m_VulkanFunctions.vkGetDeviceImageMemoryRequirements != VMA_NULL);
-    }
-#endif
-}
-
-VkDeviceSize VmaAllocator_T::CalcPreferredBlockSize(uint32_t memTypeIndex)
-{
-    const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memTypeIndex);
-    const VkDeviceSize heapSize = m_MemProps.memoryHeaps[heapIndex].size;
-    const bool isSmallHeap = heapSize <= VMA_SMALL_HEAP_MAX_SIZE;
-    return VmaAlignUp(isSmallHeap ? (heapSize / 8) : m_PreferredLargeHeapBlockSize, (VkDeviceSize)32);
-}
-
-VkResult VmaAllocator_T::AllocateMemoryOfType(
-    VmaPool pool,
-    VkDeviceSize size,
-    VkDeviceSize alignment,
-    bool dedicatedPreferred,
-    VkBuffer dedicatedBuffer,
-    VkImage dedicatedImage,
-    VkFlags dedicatedBufferImageUsage,
-    const VmaAllocationCreateInfo& createInfo,
-    uint32_t memTypeIndex,
-    VmaSuballocationType suballocType,
-    VmaDedicatedAllocationList& dedicatedAllocations,
-    VmaBlockVector& blockVector,
-    size_t allocationCount,
-    VmaAllocation* pAllocations)
-{
-    VMA_ASSERT(pAllocations != VMA_NULL);
-    VMA_DEBUG_LOG("  AllocateMemory: MemoryTypeIndex=%u, AllocationCount=%zu, Size=%llu", memTypeIndex, allocationCount, size);
-
-    VmaAllocationCreateInfo finalCreateInfo = createInfo;
-    VkResult res = CalcMemTypeParams(
-        finalCreateInfo,
-        memTypeIndex,
-        size,
-        allocationCount);
-    if(res != VK_SUCCESS)
-        return res;
-
-    if((finalCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0)
-    {
-        return AllocateDedicatedMemory(
-            pool,
-            size,
-            suballocType,
-            dedicatedAllocations,
-            memTypeIndex,
-            (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0,
-            (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0,
-            (finalCreateInfo.flags &
-                (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0,
-            (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0,
-            finalCreateInfo.pUserData,
-            finalCreateInfo.priority,
-            dedicatedBuffer,
-            dedicatedImage,
-            dedicatedBufferImageUsage,
-            allocationCount,
-            pAllocations,
-            blockVector.GetAllocationNextPtr());
-    }
-    else
-    {
-        const bool canAllocateDedicated =
-            (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) == 0 &&
-            (pool == VK_NULL_HANDLE || !blockVector.HasExplicitBlockSize());
-
-        if(canAllocateDedicated)
-        {
-            // Heuristics: Allocate dedicated memory if requested size if greater than half of preferred block size.
-            if(size > blockVector.GetPreferredBlockSize() / 2)
-            {
-                dedicatedPreferred = true;
-            }
-            // Protection against creating each allocation as dedicated when we reach or exceed heap size/budget,
-            // which can quickly deplete maxMemoryAllocationCount: Don't prefer dedicated allocations when above
-            // 3/4 of the maximum allocation count.
-            if(m_DeviceMemoryCount.load() > m_PhysicalDeviceProperties.limits.maxMemoryAllocationCount * 3 / 4)
-            {
-                dedicatedPreferred = false;
-            }
-
-            if(dedicatedPreferred)
-            {
-                res = AllocateDedicatedMemory(
-                    pool,
-                    size,
-                    suballocType,
-                    dedicatedAllocations,
-                    memTypeIndex,
-                    (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0,
-                    (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0,
-                    (finalCreateInfo.flags &
-                        (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0,
-                    (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0,
-                    finalCreateInfo.pUserData,
-                    finalCreateInfo.priority,
-                    dedicatedBuffer,
-                    dedicatedImage,
-                    dedicatedBufferImageUsage,
-                    allocationCount,
-                    pAllocations,
-                    blockVector.GetAllocationNextPtr());
-                if(res == VK_SUCCESS)
-                {
-                    // Succeeded: AllocateDedicatedMemory function already filld pMemory, nothing more to do here.
-                    VMA_DEBUG_LOG("    Allocated as DedicatedMemory");
-                    return VK_SUCCESS;
-                }
-            }
-        }
-
-        res = blockVector.Allocate(
-            size,
-            alignment,
-            finalCreateInfo,
-            suballocType,
-            allocationCount,
-            pAllocations);
-        if(res == VK_SUCCESS)
-            return VK_SUCCESS;
-
-        // Try dedicated memory.
-        if(canAllocateDedicated && !dedicatedPreferred)
-        {
-            res = AllocateDedicatedMemory(
-                pool,
-                size,
-                suballocType,
-                dedicatedAllocations,
-                memTypeIndex,
-                (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0,
-                (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_USER_DATA_COPY_STRING_BIT) != 0,
-                (finalCreateInfo.flags &
-                    (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0,
-                (finalCreateInfo.flags & VMA_ALLOCATION_CREATE_CAN_ALIAS_BIT) != 0,
-                finalCreateInfo.pUserData,
-                finalCreateInfo.priority,
-                dedicatedBuffer,
-                dedicatedImage,
-                dedicatedBufferImageUsage,
-                allocationCount,
-                pAllocations,
-                blockVector.GetAllocationNextPtr());
-            if(res == VK_SUCCESS)
-            {
-                // Succeeded: AllocateDedicatedMemory function already filld pMemory, nothing more to do here.
-                VMA_DEBUG_LOG("    Allocated as DedicatedMemory");
-                return VK_SUCCESS;
-            }
-        }
-        // Everything failed: Return error code.
-        VMA_DEBUG_LOG("    vkAllocateMemory FAILED");
-        return res;
-    }
-}
-
-VkResult VmaAllocator_T::AllocateDedicatedMemory(
-    VmaPool pool,
-    VkDeviceSize size,
-    VmaSuballocationType suballocType,
-    VmaDedicatedAllocationList& dedicatedAllocations,
-    uint32_t memTypeIndex,
-    bool map,
-    bool isUserDataString,
-    bool isMappingAllowed,
-    bool canAliasMemory,
-    void* pUserData,
-    float priority,
-    VkBuffer dedicatedBuffer,
-    VkImage dedicatedImage,
-    VkFlags dedicatedBufferImageUsage,
-    size_t allocationCount,
-    VmaAllocation* pAllocations,
-    const void* pNextChain)
-{
-    VMA_ASSERT(allocationCount > 0 && pAllocations);
-
-    VkMemoryAllocateInfo allocInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO };
-    allocInfo.memoryTypeIndex = memTypeIndex;
-    allocInfo.allocationSize = size;
-    allocInfo.pNext = pNextChain;
-
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    VkMemoryDedicatedAllocateInfoKHR dedicatedAllocInfo = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_ALLOCATE_INFO_KHR };
-    if(!canAliasMemory)
-    {
-        if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-        {
-            if(dedicatedBuffer != VK_NULL_HANDLE)
-            {
-                VMA_ASSERT(dedicatedImage == VK_NULL_HANDLE);
-                dedicatedAllocInfo.buffer = dedicatedBuffer;
-                VmaPnextChainPushFront(&allocInfo, &dedicatedAllocInfo);
-            }
-            else if(dedicatedImage != VK_NULL_HANDLE)
-            {
-                dedicatedAllocInfo.image = dedicatedImage;
-                VmaPnextChainPushFront(&allocInfo, &dedicatedAllocInfo);
-            }
-        }
-    }
-#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-
-#if VMA_BUFFER_DEVICE_ADDRESS
-    VkMemoryAllocateFlagsInfoKHR allocFlagsInfo = { VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_FLAGS_INFO_KHR };
-    if(m_UseKhrBufferDeviceAddress)
-    {
-        bool canContainBufferWithDeviceAddress = true;
-        if(dedicatedBuffer != VK_NULL_HANDLE)
-        {
-            canContainBufferWithDeviceAddress = dedicatedBufferImageUsage == UINT32_MAX || // Usage flags unknown
-                (dedicatedBufferImageUsage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_EXT) != 0;
-        }
-        else if(dedicatedImage != VK_NULL_HANDLE)
-        {
-            canContainBufferWithDeviceAddress = false;
-        }
-        if(canContainBufferWithDeviceAddress)
-        {
-            allocFlagsInfo.flags = VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT_KHR;
-            VmaPnextChainPushFront(&allocInfo, &allocFlagsInfo);
-        }
-    }
-#endif // #if VMA_BUFFER_DEVICE_ADDRESS
-
-#if VMA_MEMORY_PRIORITY
-    VkMemoryPriorityAllocateInfoEXT priorityInfo = { VK_STRUCTURE_TYPE_MEMORY_PRIORITY_ALLOCATE_INFO_EXT };
-    if(m_UseExtMemoryPriority)
-    {
-        VMA_ASSERT(priority >= 0.f && priority <= 1.f);
-        priorityInfo.priority = priority;
-        VmaPnextChainPushFront(&allocInfo, &priorityInfo);
-    }
-#endif // #if VMA_MEMORY_PRIORITY
-
-#if VMA_EXTERNAL_MEMORY
-    // Attach VkExportMemoryAllocateInfoKHR if necessary.
-    VkExportMemoryAllocateInfoKHR exportMemoryAllocInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR };
-    exportMemoryAllocInfo.handleTypes = GetExternalMemoryHandleTypeFlags(memTypeIndex);
-    if(exportMemoryAllocInfo.handleTypes != 0)
-    {
-        VmaPnextChainPushFront(&allocInfo, &exportMemoryAllocInfo);
-    }
-#endif // #if VMA_EXTERNAL_MEMORY
-
-    size_t allocIndex;
-    VkResult res = VK_SUCCESS;
-    for(allocIndex = 0; allocIndex < allocationCount; ++allocIndex)
-    {
-        res = AllocateDedicatedMemoryPage(
-            pool,
-            size,
-            suballocType,
-            memTypeIndex,
-            allocInfo,
-            map,
-            isUserDataString,
-            isMappingAllowed,
-            pUserData,
-            pAllocations + allocIndex);
-        if(res != VK_SUCCESS)
-        {
-            break;
-        }
-    }
-
-    if(res == VK_SUCCESS)
-    {
-        for (allocIndex = 0; allocIndex < allocationCount; ++allocIndex)
-        {
-            dedicatedAllocations.Register(pAllocations[allocIndex]);
-        }
-        VMA_DEBUG_LOG("    Allocated DedicatedMemory Count=%zu, MemoryTypeIndex=#%u", allocationCount, memTypeIndex);
-    }
-    else
-    {
-        // Free all already created allocations.
-        while(allocIndex--)
-        {
-            VmaAllocation currAlloc = pAllocations[allocIndex];
-            VkDeviceMemory hMemory = currAlloc->GetMemory();
-
-            /*
-            There is no need to call this, because Vulkan spec allows to skip vkUnmapMemory
-            before vkFreeMemory.
-
-            if(currAlloc->GetMappedData() != VMA_NULL)
-            {
-                (*m_VulkanFunctions.vkUnmapMemory)(m_hDevice, hMemory);
-            }
-            */
-
-            FreeVulkanMemory(memTypeIndex, currAlloc->GetSize(), hMemory);
-            m_Budget.RemoveAllocation(MemoryTypeIndexToHeapIndex(memTypeIndex), currAlloc->GetSize());
-            m_AllocationObjectAllocator.Free(currAlloc);
-        }
-
-        memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount);
-    }
-
-    return res;
-}
-
-VkResult VmaAllocator_T::AllocateDedicatedMemoryPage(
-    VmaPool pool,
-    VkDeviceSize size,
-    VmaSuballocationType suballocType,
-    uint32_t memTypeIndex,
-    const VkMemoryAllocateInfo& allocInfo,
-    bool map,
-    bool isUserDataString,
-    bool isMappingAllowed,
-    void* pUserData,
-    VmaAllocation* pAllocation)
-{
-    VkDeviceMemory hMemory = VK_NULL_HANDLE;
-    VkResult res = AllocateVulkanMemory(&allocInfo, &hMemory);
-    if(res < 0)
-    {
-        VMA_DEBUG_LOG("    vkAllocateMemory FAILED");
-        return res;
-    }
-
-    void* pMappedData = VMA_NULL;
-    if(map)
-    {
-        res = (*m_VulkanFunctions.vkMapMemory)(
-            m_hDevice,
-            hMemory,
-            0,
-            VK_WHOLE_SIZE,
-            0,
-            &pMappedData);
-        if(res < 0)
-        {
-            VMA_DEBUG_LOG("    vkMapMemory FAILED");
-            FreeVulkanMemory(memTypeIndex, size, hMemory);
-            return res;
-        }
-    }
-
-    *pAllocation = m_AllocationObjectAllocator.Allocate(isMappingAllowed);
-    (*pAllocation)->InitDedicatedAllocation(pool, memTypeIndex, hMemory, suballocType, pMappedData, size);
-    if (isUserDataString)
-        (*pAllocation)->SetName(this, (const char*)pUserData);
-    else
-        (*pAllocation)->SetUserData(this, pUserData);
-    m_Budget.AddAllocation(MemoryTypeIndexToHeapIndex(memTypeIndex), size);
-    if(VMA_DEBUG_INITIALIZE_ALLOCATIONS)
-    {
-        FillAllocation(*pAllocation, VMA_ALLOCATION_FILL_PATTERN_CREATED);
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaAllocator_T::GetBufferMemoryRequirements(
-    VkBuffer hBuffer,
-    VkMemoryRequirements& memReq,
-    bool& requiresDedicatedAllocation,
-    bool& prefersDedicatedAllocation) const
-{
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VkBufferMemoryRequirementsInfo2KHR memReqInfo = { VK_STRUCTURE_TYPE_BUFFER_MEMORY_REQUIREMENTS_INFO_2_KHR };
-        memReqInfo.buffer = hBuffer;
-
-        VkMemoryDedicatedRequirementsKHR memDedicatedReq = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_REQUIREMENTS_KHR };
-
-        VkMemoryRequirements2KHR memReq2 = { VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2_KHR };
-        VmaPnextChainPushFront(&memReq2, &memDedicatedReq);
-
-        (*m_VulkanFunctions.vkGetBufferMemoryRequirements2KHR)(m_hDevice, &memReqInfo, &memReq2);
-
-        memReq = memReq2.memoryRequirements;
-        requiresDedicatedAllocation = (memDedicatedReq.requiresDedicatedAllocation != VK_FALSE);
-        prefersDedicatedAllocation  = (memDedicatedReq.prefersDedicatedAllocation  != VK_FALSE);
-    }
-    else
-#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    {
-        (*m_VulkanFunctions.vkGetBufferMemoryRequirements)(m_hDevice, hBuffer, &memReq);
-        requiresDedicatedAllocation = false;
-        prefersDedicatedAllocation  = false;
-    }
-}
-
-void VmaAllocator_T::GetImageMemoryRequirements(
-    VkImage hImage,
-    VkMemoryRequirements& memReq,
-    bool& requiresDedicatedAllocation,
-    bool& prefersDedicatedAllocation) const
-{
-#if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    if(m_UseKhrDedicatedAllocation || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0))
-    {
-        VkImageMemoryRequirementsInfo2KHR memReqInfo = { VK_STRUCTURE_TYPE_IMAGE_MEMORY_REQUIREMENTS_INFO_2_KHR };
-        memReqInfo.image = hImage;
-
-        VkMemoryDedicatedRequirementsKHR memDedicatedReq = { VK_STRUCTURE_TYPE_MEMORY_DEDICATED_REQUIREMENTS_KHR };
-
-        VkMemoryRequirements2KHR memReq2 = { VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2_KHR };
-        VmaPnextChainPushFront(&memReq2, &memDedicatedReq);
-
-        (*m_VulkanFunctions.vkGetImageMemoryRequirements2KHR)(m_hDevice, &memReqInfo, &memReq2);
-
-        memReq = memReq2.memoryRequirements;
-        requiresDedicatedAllocation = (memDedicatedReq.requiresDedicatedAllocation != VK_FALSE);
-        prefersDedicatedAllocation  = (memDedicatedReq.prefersDedicatedAllocation  != VK_FALSE);
-    }
-    else
-#endif // #if VMA_DEDICATED_ALLOCATION || VMA_VULKAN_VERSION >= 1001000
-    {
-        (*m_VulkanFunctions.vkGetImageMemoryRequirements)(m_hDevice, hImage, &memReq);
-        requiresDedicatedAllocation = false;
-        prefersDedicatedAllocation  = false;
-    }
-}
-
-VkResult VmaAllocator_T::FindMemoryTypeIndex(
-    uint32_t memoryTypeBits,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    VkFlags bufImgUsage,
-    uint32_t* pMemoryTypeIndex) const
-{
-    memoryTypeBits &= GetGlobalMemoryTypeBits();
-
-    if(pAllocationCreateInfo->memoryTypeBits != 0)
-    {
-        memoryTypeBits &= pAllocationCreateInfo->memoryTypeBits;
-    }
-
-    VkMemoryPropertyFlags requiredFlags = 0, preferredFlags = 0, notPreferredFlags = 0;
-    if(!FindMemoryPreferences(
-        IsIntegratedGpu(),
-        *pAllocationCreateInfo,
-        bufImgUsage,
-        requiredFlags, preferredFlags, notPreferredFlags))
-    {
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    *pMemoryTypeIndex = UINT32_MAX;
-    uint32_t minCost = UINT32_MAX;
-    for(uint32_t memTypeIndex = 0, memTypeBit = 1;
-        memTypeIndex < GetMemoryTypeCount();
-        ++memTypeIndex, memTypeBit <<= 1)
-    {
-        // This memory type is acceptable according to memoryTypeBits bitmask.
-        if((memTypeBit & memoryTypeBits) != 0)
-        {
-            const VkMemoryPropertyFlags currFlags =
-                m_MemProps.memoryTypes[memTypeIndex].propertyFlags;
-            // This memory type contains requiredFlags.
-            if((requiredFlags & ~currFlags) == 0)
-            {
-                // Calculate cost as number of bits from preferredFlags not present in this memory type.
-                uint32_t currCost = VMA_COUNT_BITS_SET(preferredFlags & ~currFlags) +
-                    VMA_COUNT_BITS_SET(currFlags & notPreferredFlags);
-                // Remember memory type with lowest cost.
-                if(currCost < minCost)
-                {
-                    *pMemoryTypeIndex = memTypeIndex;
-                    if(currCost == 0)
-                    {
-                        return VK_SUCCESS;
-                    }
-                    minCost = currCost;
-                }
-            }
-        }
-    }
-    return (*pMemoryTypeIndex != UINT32_MAX) ? VK_SUCCESS : VK_ERROR_FEATURE_NOT_PRESENT;
-}
-
-VkResult VmaAllocator_T::CalcMemTypeParams(
-    VmaAllocationCreateInfo& inoutCreateInfo,
-    uint32_t memTypeIndex,
-    VkDeviceSize size,
-    size_t allocationCount)
-{
-    // If memory type is not HOST_VISIBLE, disable MAPPED.
-    if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0 &&
-        (m_MemProps.memoryTypes[memTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) == 0)
-    {
-        inoutCreateInfo.flags &= ~VMA_ALLOCATION_CREATE_MAPPED_BIT;
-    }
-
-    if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0 &&
-        (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT) != 0)
-    {
-        const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memTypeIndex);
-        VmaBudget heapBudget = {};
-        GetHeapBudgets(&heapBudget, heapIndex, 1);
-        if(heapBudget.usage + size * allocationCount > heapBudget.budget)
-        {
-            return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-        }
-    }
-    return VK_SUCCESS;
-}
-
-VkResult VmaAllocator_T::CalcAllocationParams(
-    VmaAllocationCreateInfo& inoutCreateInfo,
-    bool dedicatedRequired,
-    bool dedicatedPreferred)
-{
-    VMA_ASSERT((inoutCreateInfo.flags &
-        (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) !=
-        (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT) &&
-        "Specifying both flags VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT and VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT is incorrect.");
-    VMA_ASSERT((((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT) == 0 ||
-        (inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0)) &&
-        "Specifying VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT requires also VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.");
-    if(inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO || inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE || inoutCreateInfo.usage == VMA_MEMORY_USAGE_AUTO_PREFER_HOST)
-    {
-        if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_MAPPED_BIT) != 0)
-        {
-            VMA_ASSERT((inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) != 0 &&
-                "When using VMA_ALLOCATION_CREATE_MAPPED_BIT and usage = VMA_MEMORY_USAGE_AUTO*, you must also specify VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.");
-        }
-    }
-
-    // If memory is lazily allocated, it should be always dedicated.
-    if(dedicatedRequired ||
-        inoutCreateInfo.usage == VMA_MEMORY_USAGE_GPU_LAZILY_ALLOCATED)
-    {
-        inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
-    }
-
-    if(inoutCreateInfo.pool != VK_NULL_HANDLE)
-    {
-        if(inoutCreateInfo.pool->m_BlockVector.HasExplicitBlockSize() &&
-            (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0)
-        {
-            VMA_ASSERT(0 && "Specifying VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT while current custom pool doesn't support dedicated allocations.");
-            return VK_ERROR_FEATURE_NOT_PRESENT;
-        }
-        inoutCreateInfo.priority = inoutCreateInfo.pool->m_BlockVector.GetPriority();
-    }
-
-    if((inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT) != 0 &&
-        (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) != 0)
-    {
-        VMA_ASSERT(0 && "Specifying VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT together with VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT makes no sense.");
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    if(VMA_DEBUG_ALWAYS_DEDICATED_MEMORY &&
-        (inoutCreateInfo.flags & VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT) != 0)
-    {
-        inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
-    }
-
-    // Non-auto USAGE values imply HOST_ACCESS flags.
-    // And so does VMA_MEMORY_USAGE_UNKNOWN because it is used with custom pools.
-    // Which specific flag is used doesn't matter. They change things only when used with VMA_MEMORY_USAGE_AUTO*.
-    // Otherwise they just protect from assert on mapping.
-    if(inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO &&
-        inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE &&
-        inoutCreateInfo.usage != VMA_MEMORY_USAGE_AUTO_PREFER_HOST)
-    {
-        if((inoutCreateInfo.flags & (VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT)) == 0)
-        {
-            inoutCreateInfo.flags |= VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT;
-        }
-    }
-
-    return VK_SUCCESS;
-}
-
-VkResult VmaAllocator_T::AllocateMemory(
-    const VkMemoryRequirements& vkMemReq,
-    bool requiresDedicatedAllocation,
-    bool prefersDedicatedAllocation,
-    VkBuffer dedicatedBuffer,
-    VkImage dedicatedImage,
-    VkFlags dedicatedBufferImageUsage,
-    const VmaAllocationCreateInfo& createInfo,
-    VmaSuballocationType suballocType,
-    size_t allocationCount,
-    VmaAllocation* pAllocations)
-{
-    memset(pAllocations, 0, sizeof(VmaAllocation) * allocationCount);
-
-    VMA_ASSERT(VmaIsPow2(vkMemReq.alignment));
-
-    if(vkMemReq.size == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VmaAllocationCreateInfo createInfoFinal = createInfo;
-    VkResult res = CalcAllocationParams(createInfoFinal, requiresDedicatedAllocation, prefersDedicatedAllocation);
-    if(res != VK_SUCCESS)
-        return res;
-
-    if(createInfoFinal.pool != VK_NULL_HANDLE)
-    {
-        VmaBlockVector& blockVector = createInfoFinal.pool->m_BlockVector;
-        return AllocateMemoryOfType(
-            createInfoFinal.pool,
-            vkMemReq.size,
-            vkMemReq.alignment,
-            prefersDedicatedAllocation,
-            dedicatedBuffer,
-            dedicatedImage,
-            dedicatedBufferImageUsage,
-            createInfoFinal,
-            blockVector.GetMemoryTypeIndex(),
-            suballocType,
-            createInfoFinal.pool->m_DedicatedAllocations,
-            blockVector,
-            allocationCount,
-            pAllocations);
-    }
-    else
-    {
-        // Bit mask of memory Vulkan types acceptable for this allocation.
-        uint32_t memoryTypeBits = vkMemReq.memoryTypeBits;
-        uint32_t memTypeIndex = UINT32_MAX;
-        res = FindMemoryTypeIndex(memoryTypeBits, &createInfoFinal, dedicatedBufferImageUsage, &memTypeIndex);
-        // Can't find any single memory type matching requirements. res is VK_ERROR_FEATURE_NOT_PRESENT.
-        if(res != VK_SUCCESS)
-            return res;
-        do
-        {
-            VmaBlockVector* blockVector = m_pBlockVectors[memTypeIndex];
-            VMA_ASSERT(blockVector && "Trying to use unsupported memory type!");
-            res = AllocateMemoryOfType(
-                VK_NULL_HANDLE,
-                vkMemReq.size,
-                vkMemReq.alignment,
-                requiresDedicatedAllocation || prefersDedicatedAllocation,
-                dedicatedBuffer,
-                dedicatedImage,
-                dedicatedBufferImageUsage,
-                createInfoFinal,
-                memTypeIndex,
-                suballocType,
-                m_DedicatedAllocations[memTypeIndex],
-                *blockVector,
-                allocationCount,
-                pAllocations);
-            // Allocation succeeded
-            if(res == VK_SUCCESS)
-                return VK_SUCCESS;
-
-            // Remove old memTypeIndex from list of possibilities.
-            memoryTypeBits &= ~(1u << memTypeIndex);
-            // Find alternative memTypeIndex.
-            res = FindMemoryTypeIndex(memoryTypeBits, &createInfoFinal, dedicatedBufferImageUsage, &memTypeIndex);
-        } while(res == VK_SUCCESS);
-
-        // No other matching memory type index could be found.
-        // Not returning res, which is VK_ERROR_FEATURE_NOT_PRESENT, because we already failed to allocate once.
-        return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-    }
-}
-
-void VmaAllocator_T::FreeMemory(
-    size_t allocationCount,
-    const VmaAllocation* pAllocations)
-{
-    VMA_ASSERT(pAllocations);
-
-    for(size_t allocIndex = allocationCount; allocIndex--; )
-    {
-        VmaAllocation allocation = pAllocations[allocIndex];
-
-        if(allocation != VK_NULL_HANDLE)
-        {
-            if(VMA_DEBUG_INITIALIZE_ALLOCATIONS)
-            {
-                FillAllocation(allocation, VMA_ALLOCATION_FILL_PATTERN_DESTROYED);
-            }
-
-            allocation->FreeName(this);
-
-            switch(allocation->GetType())
-            {
-            case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-                {
-                    VmaBlockVector* pBlockVector = VMA_NULL;
-                    VmaPool hPool = allocation->GetParentPool();
-                    if(hPool != VK_NULL_HANDLE)
-                    {
-                        pBlockVector = &hPool->m_BlockVector;
-                    }
-                    else
-                    {
-                        const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex();
-                        pBlockVector = m_pBlockVectors[memTypeIndex];
-                        VMA_ASSERT(pBlockVector && "Trying to free memory of unsupported type!");
-                    }
-                    pBlockVector->Free(allocation);
-                }
-                break;
-            case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-                FreeDedicatedMemory(allocation);
-                break;
-            default:
-                VMA_ASSERT(0);
-            }
-        }
-    }
-}
-
-void VmaAllocator_T::CalculateStatistics(VmaTotalStatistics* pStats)
-{
-    // Initialize.
-    VmaClearDetailedStatistics(pStats->total);
-    for(uint32_t i = 0; i < VK_MAX_MEMORY_TYPES; ++i)
-        VmaClearDetailedStatistics(pStats->memoryType[i]);
-    for(uint32_t i = 0; i < VK_MAX_MEMORY_HEAPS; ++i)
-        VmaClearDetailedStatistics(pStats->memoryHeap[i]);
-
-    // Process default pools.
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        VmaBlockVector* const pBlockVector = m_pBlockVectors[memTypeIndex];
-        if (pBlockVector != VMA_NULL)
-            pBlockVector->AddDetailedStatistics(pStats->memoryType[memTypeIndex]);
-    }
-
-    // Process custom pools.
-    {
-        VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex);
-        for(VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool))
-        {
-            VmaBlockVector& blockVector = pool->m_BlockVector;
-            const uint32_t memTypeIndex = blockVector.GetMemoryTypeIndex();
-            blockVector.AddDetailedStatistics(pStats->memoryType[memTypeIndex]);
-            pool->m_DedicatedAllocations.AddDetailedStatistics(pStats->memoryType[memTypeIndex]);
-        }
-    }
-
-    // Process dedicated allocations.
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        m_DedicatedAllocations[memTypeIndex].AddDetailedStatistics(pStats->memoryType[memTypeIndex]);
-    }
-
-    // Sum from memory types to memory heaps.
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        const uint32_t memHeapIndex = m_MemProps.memoryTypes[memTypeIndex].heapIndex;
-        VmaAddDetailedStatistics(pStats->memoryHeap[memHeapIndex], pStats->memoryType[memTypeIndex]);
-    }
-
-    // Sum from memory heaps to total.
-    for(uint32_t memHeapIndex = 0; memHeapIndex < GetMemoryHeapCount(); ++memHeapIndex)
-        VmaAddDetailedStatistics(pStats->total, pStats->memoryHeap[memHeapIndex]);
-
-    VMA_ASSERT(pStats->total.statistics.allocationCount == 0 ||
-        pStats->total.allocationSizeMax >= pStats->total.allocationSizeMin);
-    VMA_ASSERT(pStats->total.unusedRangeCount == 0 ||
-        pStats->total.unusedRangeSizeMax >= pStats->total.unusedRangeSizeMin);
-}
-
-void VmaAllocator_T::GetHeapBudgets(VmaBudget* outBudgets, uint32_t firstHeap, uint32_t heapCount)
-{
-#if VMA_MEMORY_BUDGET
-    if(m_UseExtMemoryBudget)
-    {
-        if(m_Budget.m_OperationsSinceBudgetFetch < 30)
-        {
-            VmaMutexLockRead lockRead(m_Budget.m_BudgetMutex, m_UseMutex);
-            for(uint32_t i = 0; i < heapCount; ++i, ++outBudgets)
-            {
-                const uint32_t heapIndex = firstHeap + i;
-
-                outBudgets->statistics.blockCount = m_Budget.m_BlockCount[heapIndex];
-                outBudgets->statistics.allocationCount = m_Budget.m_AllocationCount[heapIndex];
-                outBudgets->statistics.blockBytes = m_Budget.m_BlockBytes[heapIndex];
-                outBudgets->statistics.allocationBytes = m_Budget.m_AllocationBytes[heapIndex];
-
-                if(m_Budget.m_VulkanUsage[heapIndex] + outBudgets->statistics.blockBytes > m_Budget.m_BlockBytesAtBudgetFetch[heapIndex])
-                {
-                    outBudgets->usage = m_Budget.m_VulkanUsage[heapIndex] +
-                        outBudgets->statistics.blockBytes - m_Budget.m_BlockBytesAtBudgetFetch[heapIndex];
-                }
-                else
-                {
-                    outBudgets->usage = 0;
-                }
-
-                // Have to take MIN with heap size because explicit HeapSizeLimit is included in it.
-                outBudgets->budget = VMA_MIN(
-                    m_Budget.m_VulkanBudget[heapIndex], m_MemProps.memoryHeaps[heapIndex].size);
-            }
-        }
-        else
-        {
-            UpdateVulkanBudget(); // Outside of mutex lock
-            GetHeapBudgets(outBudgets, firstHeap, heapCount); // Recursion
-        }
-    }
-    else
-#endif
-    {
-        for(uint32_t i = 0; i < heapCount; ++i, ++outBudgets)
-        {
-            const uint32_t heapIndex = firstHeap + i;
-
-            outBudgets->statistics.blockCount = m_Budget.m_BlockCount[heapIndex];
-            outBudgets->statistics.allocationCount = m_Budget.m_AllocationCount[heapIndex];
-            outBudgets->statistics.blockBytes = m_Budget.m_BlockBytes[heapIndex];
-            outBudgets->statistics.allocationBytes = m_Budget.m_AllocationBytes[heapIndex];
-
-            outBudgets->usage = outBudgets->statistics.blockBytes;
-            outBudgets->budget = m_MemProps.memoryHeaps[heapIndex].size * 8 / 10; // 80% heuristics.
-        }
-    }
-}
-
-void VmaAllocator_T::GetAllocationInfo(VmaAllocation hAllocation, VmaAllocationInfo* pAllocationInfo)
-{
-    pAllocationInfo->memoryType = hAllocation->GetMemoryTypeIndex();
-    pAllocationInfo->deviceMemory = hAllocation->GetMemory();
-    pAllocationInfo->offset = hAllocation->GetOffset();
-    pAllocationInfo->size = hAllocation->GetSize();
-    pAllocationInfo->pMappedData = hAllocation->GetMappedData();
-    pAllocationInfo->pUserData = hAllocation->GetUserData();
-    pAllocationInfo->pName = hAllocation->GetName();
-}
-
-VkResult VmaAllocator_T::CreatePool(const VmaPoolCreateInfo* pCreateInfo, VmaPool* pPool)
-{
-    VMA_DEBUG_LOG("  CreatePool: MemoryTypeIndex=%u, flags=%u", pCreateInfo->memoryTypeIndex, pCreateInfo->flags);
-
-    VmaPoolCreateInfo newCreateInfo = *pCreateInfo;
-
-    // Protection against uninitialized new structure member. If garbage data are left there, this pointer dereference would crash.
-    if(pCreateInfo->pMemoryAllocateNext)
-    {
-        VMA_ASSERT(((const VkBaseInStructure*)pCreateInfo->pMemoryAllocateNext)->sType != 0);
-    }
-
-    if(newCreateInfo.maxBlockCount == 0)
-    {
-        newCreateInfo.maxBlockCount = SIZE_MAX;
-    }
-    if(newCreateInfo.minBlockCount > newCreateInfo.maxBlockCount)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-    // Memory type index out of range or forbidden.
-    if(pCreateInfo->memoryTypeIndex >= GetMemoryTypeCount() ||
-        ((1u << pCreateInfo->memoryTypeIndex) & m_GlobalMemoryTypeBits) == 0)
-    {
-        return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-    if(newCreateInfo.minAllocationAlignment > 0)
-    {
-        VMA_ASSERT(VmaIsPow2(newCreateInfo.minAllocationAlignment));
-    }
-
-    const VkDeviceSize preferredBlockSize = CalcPreferredBlockSize(newCreateInfo.memoryTypeIndex);
-
-    *pPool = vma_new(this, VmaPool_T)(this, newCreateInfo, preferredBlockSize);
-
-    VkResult res = (*pPool)->m_BlockVector.CreateMinBlocks();
-    if(res != VK_SUCCESS)
-    {
-        vma_delete(this, *pPool);
-        *pPool = VMA_NULL;
-        return res;
-    }
-
-    // Add to m_Pools.
-    {
-        VmaMutexLockWrite lock(m_PoolsMutex, m_UseMutex);
-        (*pPool)->SetId(m_NextPoolId++);
-        m_Pools.PushBack(*pPool);
-    }
-
-    return VK_SUCCESS;
-}
-
-void VmaAllocator_T::DestroyPool(VmaPool pool)
-{
-    // Remove from m_Pools.
-    {
-        VmaMutexLockWrite lock(m_PoolsMutex, m_UseMutex);
-        m_Pools.Remove(pool);
-    }
-
-    vma_delete(this, pool);
-}
-
-void VmaAllocator_T::GetPoolStatistics(VmaPool pool, VmaStatistics* pPoolStats)
-{
-    VmaClearStatistics(*pPoolStats);
-    pool->m_BlockVector.AddStatistics(*pPoolStats);
-    pool->m_DedicatedAllocations.AddStatistics(*pPoolStats);
-}
-
-void VmaAllocator_T::CalculatePoolStatistics(VmaPool pool, VmaDetailedStatistics* pPoolStats)
-{
-    VmaClearDetailedStatistics(*pPoolStats);
-    pool->m_BlockVector.AddDetailedStatistics(*pPoolStats);
-    pool->m_DedicatedAllocations.AddDetailedStatistics(*pPoolStats);
-}
-
-void VmaAllocator_T::SetCurrentFrameIndex(uint32_t frameIndex)
-{
-    m_CurrentFrameIndex.store(frameIndex);
-
-#if VMA_MEMORY_BUDGET
-    if(m_UseExtMemoryBudget)
-    {
-        UpdateVulkanBudget();
-    }
-#endif // #if VMA_MEMORY_BUDGET
-}
-
-VkResult VmaAllocator_T::CheckPoolCorruption(VmaPool hPool)
-{
-    return hPool->m_BlockVector.CheckCorruption();
-}
-
-VkResult VmaAllocator_T::CheckCorruption(uint32_t memoryTypeBits)
-{
-    VkResult finalRes = VK_ERROR_FEATURE_NOT_PRESENT;
-
-    // Process default pools.
-    for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-    {
-        VmaBlockVector* const pBlockVector = m_pBlockVectors[memTypeIndex];
-        if(pBlockVector != VMA_NULL)
-        {
-            VkResult localRes = pBlockVector->CheckCorruption();
-            switch(localRes)
-            {
-            case VK_ERROR_FEATURE_NOT_PRESENT:
-                break;
-            case VK_SUCCESS:
-                finalRes = VK_SUCCESS;
-                break;
-            default:
-                return localRes;
-            }
-        }
-    }
-
-    // Process custom pools.
-    {
-        VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex);
-        for(VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool))
-        {
-            if(((1u << pool->m_BlockVector.GetMemoryTypeIndex()) & memoryTypeBits) != 0)
-            {
-                VkResult localRes = pool->m_BlockVector.CheckCorruption();
-                switch(localRes)
-                {
-                case VK_ERROR_FEATURE_NOT_PRESENT:
-                    break;
-                case VK_SUCCESS:
-                    finalRes = VK_SUCCESS;
-                    break;
-                default:
-                    return localRes;
-                }
-            }
-        }
-    }
-
-    return finalRes;
-}
-
-VkResult VmaAllocator_T::AllocateVulkanMemory(const VkMemoryAllocateInfo* pAllocateInfo, VkDeviceMemory* pMemory)
-{
-    AtomicTransactionalIncrement<uint32_t> deviceMemoryCountIncrement;
-    const uint64_t prevDeviceMemoryCount = deviceMemoryCountIncrement.Increment(&m_DeviceMemoryCount);
-#if VMA_DEBUG_DONT_EXCEED_MAX_MEMORY_ALLOCATION_COUNT
-    if(prevDeviceMemoryCount >= m_PhysicalDeviceProperties.limits.maxMemoryAllocationCount)
-    {
-        return VK_ERROR_TOO_MANY_OBJECTS;
-    }
-#endif
-
-    const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(pAllocateInfo->memoryTypeIndex);
-
-    // HeapSizeLimit is in effect for this heap.
-    if((m_HeapSizeLimitMask & (1u << heapIndex)) != 0)
-    {
-        const VkDeviceSize heapSize = m_MemProps.memoryHeaps[heapIndex].size;
-        VkDeviceSize blockBytes = m_Budget.m_BlockBytes[heapIndex];
-        for(;;)
-        {
-            const VkDeviceSize blockBytesAfterAllocation = blockBytes + pAllocateInfo->allocationSize;
-            if(blockBytesAfterAllocation > heapSize)
-            {
-                return VK_ERROR_OUT_OF_DEVICE_MEMORY;
-            }
-            if(m_Budget.m_BlockBytes[heapIndex].compare_exchange_strong(blockBytes, blockBytesAfterAllocation))
-            {
-                break;
-            }
-        }
-    }
-    else
-    {
-        m_Budget.m_BlockBytes[heapIndex] += pAllocateInfo->allocationSize;
-    }
-    ++m_Budget.m_BlockCount[heapIndex];
-
-    // VULKAN CALL vkAllocateMemory.
-    VkResult res = (*m_VulkanFunctions.vkAllocateMemory)(m_hDevice, pAllocateInfo, GetAllocationCallbacks(), pMemory);
-
-    if(res == VK_SUCCESS)
-    {
-#if VMA_MEMORY_BUDGET
-        ++m_Budget.m_OperationsSinceBudgetFetch;
-#endif
-
-        // Informative callback.
-        if(m_DeviceMemoryCallbacks.pfnAllocate != VMA_NULL)
-        {
-            (*m_DeviceMemoryCallbacks.pfnAllocate)(this, pAllocateInfo->memoryTypeIndex, *pMemory, pAllocateInfo->allocationSize, m_DeviceMemoryCallbacks.pUserData);
-        }
-
-        deviceMemoryCountIncrement.Commit();
-    }
-    else
-    {
-        --m_Budget.m_BlockCount[heapIndex];
-        m_Budget.m_BlockBytes[heapIndex] -= pAllocateInfo->allocationSize;
-    }
-
-    return res;
-}
-
-void VmaAllocator_T::FreeVulkanMemory(uint32_t memoryType, VkDeviceSize size, VkDeviceMemory hMemory)
-{
-    // Informative callback.
-    if(m_DeviceMemoryCallbacks.pfnFree != VMA_NULL)
-    {
-        (*m_DeviceMemoryCallbacks.pfnFree)(this, memoryType, hMemory, size, m_DeviceMemoryCallbacks.pUserData);
-    }
-
-    // VULKAN CALL vkFreeMemory.
-    (*m_VulkanFunctions.vkFreeMemory)(m_hDevice, hMemory, GetAllocationCallbacks());
-
-    const uint32_t heapIndex = MemoryTypeIndexToHeapIndex(memoryType);
-    --m_Budget.m_BlockCount[heapIndex];
-    m_Budget.m_BlockBytes[heapIndex] -= size;
-
-    --m_DeviceMemoryCount;
-}
-
-VkResult VmaAllocator_T::BindVulkanBuffer(
-    VkDeviceMemory memory,
-    VkDeviceSize memoryOffset,
-    VkBuffer buffer,
-    const void* pNext)
-{
-    if(pNext != VMA_NULL)
-    {
-#if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2
-        if((m_UseKhrBindMemory2 || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) &&
-            m_VulkanFunctions.vkBindBufferMemory2KHR != VMA_NULL)
-        {
-            VkBindBufferMemoryInfoKHR bindBufferMemoryInfo = { VK_STRUCTURE_TYPE_BIND_BUFFER_MEMORY_INFO_KHR };
-            bindBufferMemoryInfo.pNext = pNext;
-            bindBufferMemoryInfo.buffer = buffer;
-            bindBufferMemoryInfo.memory = memory;
-            bindBufferMemoryInfo.memoryOffset = memoryOffset;
-            return (*m_VulkanFunctions.vkBindBufferMemory2KHR)(m_hDevice, 1, &bindBufferMemoryInfo);
-        }
-        else
-#endif // #if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2
-        {
-            return VK_ERROR_EXTENSION_NOT_PRESENT;
-        }
-    }
-    else
-    {
-        return (*m_VulkanFunctions.vkBindBufferMemory)(m_hDevice, buffer, memory, memoryOffset);
-    }
-}
-
-VkResult VmaAllocator_T::BindVulkanImage(
-    VkDeviceMemory memory,
-    VkDeviceSize memoryOffset,
-    VkImage image,
-    const void* pNext)
-{
-    if(pNext != VMA_NULL)
-    {
-#if VMA_VULKAN_VERSION >= 1001000 || VMA_BIND_MEMORY2
-        if((m_UseKhrBindMemory2 || m_VulkanApiVersion >= VK_MAKE_VERSION(1, 1, 0)) &&
-            m_VulkanFunctions.vkBindImageMemory2KHR != VMA_NULL)
-        {
-            VkBindImageMemoryInfoKHR bindBufferMemoryInfo = { VK_STRUCTURE_TYPE_BIND_IMAGE_MEMORY_INFO_KHR };
-            bindBufferMemoryInfo.pNext = pNext;
-            bindBufferMemoryInfo.image = image;
-            bindBufferMemoryInfo.memory = memory;
-            bindBufferMemoryInfo.memoryOffset = memoryOffset;
-            return (*m_VulkanFunctions.vkBindImageMemory2KHR)(m_hDevice, 1, &bindBufferMemoryInfo);
-        }
-        else
-#endif // #if VMA_BIND_MEMORY2
-        {
-            return VK_ERROR_EXTENSION_NOT_PRESENT;
-        }
-    }
-    else
-    {
-        return (*m_VulkanFunctions.vkBindImageMemory)(m_hDevice, image, memory, memoryOffset);
-    }
-}
-
-VkResult VmaAllocator_T::Map(VmaAllocation hAllocation, void** ppData)
-{
-    switch(hAllocation->GetType())
-    {
-    case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-        {
-            VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock();
-            char *pBytes = VMA_NULL;
-            VkResult res = pBlock->Map(this, 1, (void**)&pBytes);
-            if(res == VK_SUCCESS)
-            {
-                *ppData = pBytes + (ptrdiff_t)hAllocation->GetOffset();
-                hAllocation->BlockAllocMap();
-            }
-            return res;
-        }
-    case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-        return hAllocation->DedicatedAllocMap(this, ppData);
-    default:
-        VMA_ASSERT(0);
-        return VK_ERROR_MEMORY_MAP_FAILED;
-    }
-}
-
-void VmaAllocator_T::Unmap(VmaAllocation hAllocation)
-{
-    switch(hAllocation->GetType())
-    {
-    case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-        {
-            VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock();
-            hAllocation->BlockAllocUnmap();
-            pBlock->Unmap(this, 1);
-        }
-        break;
-    case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-        hAllocation->DedicatedAllocUnmap(this);
-        break;
-    default:
-        VMA_ASSERT(0);
-    }
-}
-
-VkResult VmaAllocator_T::BindBufferMemory(
-    VmaAllocation hAllocation,
-    VkDeviceSize allocationLocalOffset,
-    VkBuffer hBuffer,
-    const void* pNext)
-{
-    VkResult res = VK_SUCCESS;
-    switch(hAllocation->GetType())
-    {
-    case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-        res = BindVulkanBuffer(hAllocation->GetMemory(), allocationLocalOffset, hBuffer, pNext);
-        break;
-    case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-    {
-        VmaDeviceMemoryBlock* const pBlock = hAllocation->GetBlock();
-        VMA_ASSERT(pBlock && "Binding buffer to allocation that doesn't belong to any block.");
-        res = pBlock->BindBufferMemory(this, hAllocation, allocationLocalOffset, hBuffer, pNext);
-        break;
-    }
-    default:
-        VMA_ASSERT(0);
-    }
-    return res;
-}
-
-VkResult VmaAllocator_T::BindImageMemory(
-    VmaAllocation hAllocation,
-    VkDeviceSize allocationLocalOffset,
-    VkImage hImage,
-    const void* pNext)
-{
-    VkResult res = VK_SUCCESS;
-    switch(hAllocation->GetType())
-    {
-    case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-        res = BindVulkanImage(hAllocation->GetMemory(), allocationLocalOffset, hImage, pNext);
-        break;
-    case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-    {
-        VmaDeviceMemoryBlock* pBlock = hAllocation->GetBlock();
-        VMA_ASSERT(pBlock && "Binding image to allocation that doesn't belong to any block.");
-        res = pBlock->BindImageMemory(this, hAllocation, allocationLocalOffset, hImage, pNext);
-        break;
-    }
-    default:
-        VMA_ASSERT(0);
-    }
-    return res;
-}
-
-VkResult VmaAllocator_T::FlushOrInvalidateAllocation(
-    VmaAllocation hAllocation,
-    VkDeviceSize offset, VkDeviceSize size,
-    VMA_CACHE_OPERATION op)
-{
-    VkResult res = VK_SUCCESS;
-
-    VkMappedMemoryRange memRange = {};
-    if(GetFlushOrInvalidateRange(hAllocation, offset, size, memRange))
-    {
-        switch(op)
-        {
-        case VMA_CACHE_FLUSH:
-            res = (*GetVulkanFunctions().vkFlushMappedMemoryRanges)(m_hDevice, 1, &memRange);
-            break;
-        case VMA_CACHE_INVALIDATE:
-            res = (*GetVulkanFunctions().vkInvalidateMappedMemoryRanges)(m_hDevice, 1, &memRange);
-            break;
-        default:
-            VMA_ASSERT(0);
-        }
-    }
-    // else: Just ignore this call.
-    return res;
-}
-
-VkResult VmaAllocator_T::FlushOrInvalidateAllocations(
-    uint32_t allocationCount,
-    const VmaAllocation* allocations,
-    const VkDeviceSize* offsets, const VkDeviceSize* sizes,
-    VMA_CACHE_OPERATION op)
-{
-    typedef VmaStlAllocator<VkMappedMemoryRange> RangeAllocator;
-    typedef VmaSmallVector<VkMappedMemoryRange, RangeAllocator, 16> RangeVector;
-    RangeVector ranges = RangeVector(RangeAllocator(GetAllocationCallbacks()));
-
-    for(uint32_t allocIndex = 0; allocIndex < allocationCount; ++allocIndex)
-    {
-        const VmaAllocation alloc = allocations[allocIndex];
-        const VkDeviceSize offset = offsets != VMA_NULL ? offsets[allocIndex] : 0;
-        const VkDeviceSize size = sizes != VMA_NULL ? sizes[allocIndex] : VK_WHOLE_SIZE;
-        VkMappedMemoryRange newRange;
-        if(GetFlushOrInvalidateRange(alloc, offset, size, newRange))
-        {
-            ranges.push_back(newRange);
-        }
-    }
-
-    VkResult res = VK_SUCCESS;
-    if(!ranges.empty())
-    {
-        switch(op)
-        {
-        case VMA_CACHE_FLUSH:
-            res = (*GetVulkanFunctions().vkFlushMappedMemoryRanges)(m_hDevice, (uint32_t)ranges.size(), ranges.data());
-            break;
-        case VMA_CACHE_INVALIDATE:
-            res = (*GetVulkanFunctions().vkInvalidateMappedMemoryRanges)(m_hDevice, (uint32_t)ranges.size(), ranges.data());
-            break;
-        default:
-            VMA_ASSERT(0);
-        }
-    }
-    // else: Just ignore this call.
-    return res;
-}
-
-void VmaAllocator_T::FreeDedicatedMemory(const VmaAllocation allocation)
-{
-    VMA_ASSERT(allocation && allocation->GetType() == VmaAllocation_T::ALLOCATION_TYPE_DEDICATED);
-
-    const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex();
-    VmaPool parentPool = allocation->GetParentPool();
-    if(parentPool == VK_NULL_HANDLE)
-    {
-        // Default pool
-        m_DedicatedAllocations[memTypeIndex].Unregister(allocation);
-    }
-    else
-    {
-        // Custom pool
-        parentPool->m_DedicatedAllocations.Unregister(allocation);
-    }
-
-    VkDeviceMemory hMemory = allocation->GetMemory();
-
-    /*
-    There is no need to call this, because Vulkan spec allows to skip vkUnmapMemory
-    before vkFreeMemory.
-
-    if(allocation->GetMappedData() != VMA_NULL)
-    {
-        (*m_VulkanFunctions.vkUnmapMemory)(m_hDevice, hMemory);
-    }
-    */
-
-    FreeVulkanMemory(memTypeIndex, allocation->GetSize(), hMemory);
-
-    m_Budget.RemoveAllocation(MemoryTypeIndexToHeapIndex(allocation->GetMemoryTypeIndex()), allocation->GetSize());
-    m_AllocationObjectAllocator.Free(allocation);
-
-    VMA_DEBUG_LOG("    Freed DedicatedMemory MemoryTypeIndex=%u", memTypeIndex);
-}
-
-uint32_t VmaAllocator_T::CalculateGpuDefragmentationMemoryTypeBits() const
-{
-    VkBufferCreateInfo dummyBufCreateInfo;
-    VmaFillGpuDefragmentationBufferCreateInfo(dummyBufCreateInfo);
-
-    uint32_t memoryTypeBits = 0;
-
-    // Create buffer.
-    VkBuffer buf = VK_NULL_HANDLE;
-    VkResult res = (*GetVulkanFunctions().vkCreateBuffer)(
-        m_hDevice, &dummyBufCreateInfo, GetAllocationCallbacks(), &buf);
-    if(res == VK_SUCCESS)
-    {
-        // Query for supported memory types.
-        VkMemoryRequirements memReq;
-        (*GetVulkanFunctions().vkGetBufferMemoryRequirements)(m_hDevice, buf, &memReq);
-        memoryTypeBits = memReq.memoryTypeBits;
-
-        // Destroy buffer.
-        (*GetVulkanFunctions().vkDestroyBuffer)(m_hDevice, buf, GetAllocationCallbacks());
-    }
-
-    return memoryTypeBits;
-}
-
-uint32_t VmaAllocator_T::CalculateGlobalMemoryTypeBits() const
-{
-    // Make sure memory information is already fetched.
-    VMA_ASSERT(GetMemoryTypeCount() > 0);
-
-    uint32_t memoryTypeBits = UINT32_MAX;
-
-    if(!m_UseAmdDeviceCoherentMemory)
-    {
-        // Exclude memory types that have VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD.
-        for(uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-        {
-            if((m_MemProps.memoryTypes[memTypeIndex].propertyFlags & VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY) != 0)
-            {
-                memoryTypeBits &= ~(1u << memTypeIndex);
-            }
-        }
-    }
-
-    return memoryTypeBits;
-}
-
-bool VmaAllocator_T::GetFlushOrInvalidateRange(
-    VmaAllocation allocation,
-    VkDeviceSize offset, VkDeviceSize size,
-    VkMappedMemoryRange& outRange) const
-{
-    const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex();
-    if(size > 0 && IsMemoryTypeNonCoherent(memTypeIndex))
-    {
-        const VkDeviceSize nonCoherentAtomSize = m_PhysicalDeviceProperties.limits.nonCoherentAtomSize;
-        const VkDeviceSize allocationSize = allocation->GetSize();
-        VMA_ASSERT(offset <= allocationSize);
-
-        outRange.sType = VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE;
-        outRange.pNext = VMA_NULL;
-        outRange.memory = allocation->GetMemory();
-
-        switch(allocation->GetType())
-        {
-        case VmaAllocation_T::ALLOCATION_TYPE_DEDICATED:
-            outRange.offset = VmaAlignDown(offset, nonCoherentAtomSize);
-            if(size == VK_WHOLE_SIZE)
-            {
-                outRange.size = allocationSize - outRange.offset;
-            }
-            else
-            {
-                VMA_ASSERT(offset + size <= allocationSize);
-                outRange.size = VMA_MIN(
-                    VmaAlignUp(size + (offset - outRange.offset), nonCoherentAtomSize),
-                    allocationSize - outRange.offset);
-            }
-            break;
-        case VmaAllocation_T::ALLOCATION_TYPE_BLOCK:
-        {
-            // 1. Still within this allocation.
-            outRange.offset = VmaAlignDown(offset, nonCoherentAtomSize);
-            if(size == VK_WHOLE_SIZE)
-            {
-                size = allocationSize - offset;
-            }
-            else
-            {
-                VMA_ASSERT(offset + size <= allocationSize);
-            }
-            outRange.size = VmaAlignUp(size + (offset - outRange.offset), nonCoherentAtomSize);
-
-            // 2. Adjust to whole block.
-            const VkDeviceSize allocationOffset = allocation->GetOffset();
-            VMA_ASSERT(allocationOffset % nonCoherentAtomSize == 0);
-            const VkDeviceSize blockSize = allocation->GetBlock()->m_pMetadata->GetSize();
-            outRange.offset += allocationOffset;
-            outRange.size = VMA_MIN(outRange.size, blockSize - outRange.offset);
-
-            break;
-        }
-        default:
-            VMA_ASSERT(0);
-        }
-        return true;
-    }
-    return false;
-}
-
-#if VMA_MEMORY_BUDGET
-void VmaAllocator_T::UpdateVulkanBudget()
-{
-    VMA_ASSERT(m_UseExtMemoryBudget);
-
-    VkPhysicalDeviceMemoryProperties2KHR memProps = { VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_PROPERTIES_2_KHR };
-
-    VkPhysicalDeviceMemoryBudgetPropertiesEXT budgetProps = { VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_BUDGET_PROPERTIES_EXT };
-    VmaPnextChainPushFront(&memProps, &budgetProps);
-
-    GetVulkanFunctions().vkGetPhysicalDeviceMemoryProperties2KHR(m_PhysicalDevice, &memProps);
-
-    {
-        VmaMutexLockWrite lockWrite(m_Budget.m_BudgetMutex, m_UseMutex);
-
-        for(uint32_t heapIndex = 0; heapIndex < GetMemoryHeapCount(); ++heapIndex)
-        {
-            m_Budget.m_VulkanUsage[heapIndex] = budgetProps.heapUsage[heapIndex];
-            m_Budget.m_VulkanBudget[heapIndex] = budgetProps.heapBudget[heapIndex];
-            m_Budget.m_BlockBytesAtBudgetFetch[heapIndex] = m_Budget.m_BlockBytes[heapIndex].load();
-
-            // Some bugged drivers return the budget incorrectly, e.g. 0 or much bigger than heap size.
-            if(m_Budget.m_VulkanBudget[heapIndex] == 0)
-            {
-                m_Budget.m_VulkanBudget[heapIndex] = m_MemProps.memoryHeaps[heapIndex].size * 8 / 10; // 80% heuristics.
-            }
-            else if(m_Budget.m_VulkanBudget[heapIndex] > m_MemProps.memoryHeaps[heapIndex].size)
-            {
-                m_Budget.m_VulkanBudget[heapIndex] = m_MemProps.memoryHeaps[heapIndex].size;
-            }
-            if(m_Budget.m_VulkanUsage[heapIndex] == 0 && m_Budget.m_BlockBytesAtBudgetFetch[heapIndex] > 0)
-            {
-                m_Budget.m_VulkanUsage[heapIndex] = m_Budget.m_BlockBytesAtBudgetFetch[heapIndex];
-            }
-        }
-        m_Budget.m_OperationsSinceBudgetFetch = 0;
-    }
-}
-#endif // VMA_MEMORY_BUDGET
-
-void VmaAllocator_T::FillAllocation(const VmaAllocation hAllocation, uint8_t pattern)
-{
-    if(VMA_DEBUG_INITIALIZE_ALLOCATIONS &&
-        hAllocation->IsMappingAllowed() &&
-        (m_MemProps.memoryTypes[hAllocation->GetMemoryTypeIndex()].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) != 0)
-    {
-        void* pData = VMA_NULL;
-        VkResult res = Map(hAllocation, &pData);
-        if(res == VK_SUCCESS)
-        {
-            memset(pData, (int)pattern, (size_t)hAllocation->GetSize());
-            FlushOrInvalidateAllocation(hAllocation, 0, VK_WHOLE_SIZE, VMA_CACHE_FLUSH);
-            Unmap(hAllocation);
-        }
-        else
-        {
-            VMA_ASSERT(0 && "VMA_DEBUG_INITIALIZE_ALLOCATIONS is enabled, but couldn't map memory to fill allocation.");
-        }
-    }
-}
-
-uint32_t VmaAllocator_T::GetGpuDefragmentationMemoryTypeBits()
-{
-    uint32_t memoryTypeBits = m_GpuDefragmentationMemoryTypeBits.load();
-    if(memoryTypeBits == UINT32_MAX)
-    {
-        memoryTypeBits = CalculateGpuDefragmentationMemoryTypeBits();
-        m_GpuDefragmentationMemoryTypeBits.store(memoryTypeBits);
-    }
-    return memoryTypeBits;
-}
-
-#if VMA_STATS_STRING_ENABLED
-void VmaAllocator_T::PrintDetailedMap(VmaJsonWriter& json)
-{
-    json.WriteString("DefaultPools");
-    json.BeginObject();
-    {
-        for (uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-        {
-            VmaBlockVector* pBlockVector = m_pBlockVectors[memTypeIndex];
-            VmaDedicatedAllocationList& dedicatedAllocList = m_DedicatedAllocations[memTypeIndex];
-            if (pBlockVector != VMA_NULL)
-            {
-                json.BeginString("Type ");
-                json.ContinueString(memTypeIndex);
-                json.EndString();
-                json.BeginObject();
-                {
-                    json.WriteString("PreferredBlockSize");
-                    json.WriteNumber(pBlockVector->GetPreferredBlockSize());
-
-                    json.WriteString("Blocks");
-                    pBlockVector->PrintDetailedMap(json);
-
-                    json.WriteString("DedicatedAllocations");
-                    dedicatedAllocList.BuildStatsString(json);
-                }
-                json.EndObject();
-            }
-        }
-    }
-    json.EndObject();
-
-    json.WriteString("CustomPools");
-    json.BeginObject();
-    {
-        VmaMutexLockRead lock(m_PoolsMutex, m_UseMutex);
-        if (!m_Pools.IsEmpty())
-        {
-            for (uint32_t memTypeIndex = 0; memTypeIndex < GetMemoryTypeCount(); ++memTypeIndex)
-            {
-                bool displayType = true;
-                size_t index = 0;
-                for (VmaPool pool = m_Pools.Front(); pool != VMA_NULL; pool = m_Pools.GetNext(pool))
-                {
-                    VmaBlockVector& blockVector = pool->m_BlockVector;
-                    if (blockVector.GetMemoryTypeIndex() == memTypeIndex)
-                    {
-                        if (displayType)
-                        {
-                            json.BeginString("Type ");
-                            json.ContinueString(memTypeIndex);
-                            json.EndString();
-                            json.BeginArray();
-                            displayType = false;
-                        }
-
-                        json.BeginObject();
-                        {
-                            json.WriteString("Name");
-                            json.BeginString();
-                            json.ContinueString_Size(index++);
-                            if (pool->GetName())
-                            {
-                                json.ContinueString(" - ");
-                                json.ContinueString(pool->GetName());
-                            }
-                            json.EndString();
-
-                            json.WriteString("PreferredBlockSize");
-                            json.WriteNumber(blockVector.GetPreferredBlockSize());
-
-                            json.WriteString("Blocks");
-                            blockVector.PrintDetailedMap(json);
-
-                            json.WriteString("DedicatedAllocations");
-                            pool->m_DedicatedAllocations.BuildStatsString(json);
-                        }
-                        json.EndObject();
-                    }
-                }
-
-                if (!displayType)
-                    json.EndArray();
-            }
-        }
-    }
-    json.EndObject();
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_ALLOCATOR_T_FUNCTIONS
-
-
-#ifndef _VMA_PUBLIC_INTERFACE
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAllocator(
-    const VmaAllocatorCreateInfo* pCreateInfo,
-    VmaAllocator* pAllocator)
-{
-    VMA_ASSERT(pCreateInfo && pAllocator);
-    VMA_ASSERT(pCreateInfo->vulkanApiVersion == 0 ||
-        (VK_VERSION_MAJOR(pCreateInfo->vulkanApiVersion) == 1 && VK_VERSION_MINOR(pCreateInfo->vulkanApiVersion) <= 3));
-    VMA_DEBUG_LOG("vmaCreateAllocator");
-    *pAllocator = vma_new(pCreateInfo->pAllocationCallbacks, VmaAllocator_T)(pCreateInfo);
-    VkResult result = (*pAllocator)->Init(pCreateInfo);
-    if(result < 0)
-    {
-        vma_delete(pCreateInfo->pAllocationCallbacks, *pAllocator);
-        *pAllocator = VK_NULL_HANDLE;
-    }
-    return result;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyAllocator(
-    VmaAllocator allocator)
-{
-    if(allocator != VK_NULL_HANDLE)
-    {
-        VMA_DEBUG_LOG("vmaDestroyAllocator");
-        VkAllocationCallbacks allocationCallbacks = allocator->m_AllocationCallbacks; // Have to copy the callbacks when destroying.
-        vma_delete(&allocationCallbacks, allocator);
-    }
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocatorInfo(VmaAllocator allocator, VmaAllocatorInfo* pAllocatorInfo)
-{
-    VMA_ASSERT(allocator && pAllocatorInfo);
-    pAllocatorInfo->instance = allocator->m_hInstance;
-    pAllocatorInfo->physicalDevice = allocator->GetPhysicalDevice();
-    pAllocatorInfo->device = allocator->m_hDevice;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPhysicalDeviceProperties(
-    VmaAllocator allocator,
-    const VkPhysicalDeviceProperties **ppPhysicalDeviceProperties)
-{
-    VMA_ASSERT(allocator && ppPhysicalDeviceProperties);
-    *ppPhysicalDeviceProperties = &allocator->m_PhysicalDeviceProperties;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryProperties(
-    VmaAllocator allocator,
-    const VkPhysicalDeviceMemoryProperties** ppPhysicalDeviceMemoryProperties)
-{
-    VMA_ASSERT(allocator && ppPhysicalDeviceMemoryProperties);
-    *ppPhysicalDeviceMemoryProperties = &allocator->m_MemProps;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetMemoryTypeProperties(
-    VmaAllocator allocator,
-    uint32_t memoryTypeIndex,
-    VkMemoryPropertyFlags* pFlags)
-{
-    VMA_ASSERT(allocator && pFlags);
-    VMA_ASSERT(memoryTypeIndex < allocator->GetMemoryTypeCount());
-    *pFlags = allocator->m_MemProps.memoryTypes[memoryTypeIndex].propertyFlags;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetCurrentFrameIndex(
-    VmaAllocator allocator,
-    uint32_t frameIndex)
-{
-    VMA_ASSERT(allocator);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->SetCurrentFrameIndex(frameIndex);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculateStatistics(
-    VmaAllocator allocator,
-    VmaTotalStatistics* pStats)
-{
-    VMA_ASSERT(allocator && pStats);
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-    allocator->CalculateStatistics(pStats);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetHeapBudgets(
-    VmaAllocator allocator,
-    VmaBudget* pBudgets)
-{
-    VMA_ASSERT(allocator && pBudgets);
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-    allocator->GetHeapBudgets(pBudgets, 0, allocator->GetMemoryHeapCount());
-}
-
-#if VMA_STATS_STRING_ENABLED
-
-VMA_CALL_PRE void VMA_CALL_POST vmaBuildStatsString(
-    VmaAllocator allocator,
-    char** ppStatsString,
-    VkBool32 detailedMap)
-{
-    VMA_ASSERT(allocator && ppStatsString);
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VmaStringBuilder sb(allocator->GetAllocationCallbacks());
-    {
-        VmaBudget budgets[VK_MAX_MEMORY_HEAPS];
-        allocator->GetHeapBudgets(budgets, 0, allocator->GetMemoryHeapCount());
-
-        VmaTotalStatistics stats;
-        allocator->CalculateStatistics(&stats);
-
-        VmaJsonWriter json(allocator->GetAllocationCallbacks(), sb);
-        json.BeginObject();
-        {
-            json.WriteString("General");
-            json.BeginObject();
-            {
-                const VkPhysicalDeviceProperties& deviceProperties = allocator->m_PhysicalDeviceProperties;
-                const VkPhysicalDeviceMemoryProperties& memoryProperties = allocator->m_MemProps;
-
-                json.WriteString("API");
-                json.WriteString("Vulkan");
-
-                json.WriteString("apiVersion");
-                json.BeginString();
-                json.ContinueString(VK_API_VERSION_MAJOR(deviceProperties.apiVersion));
-                json.ContinueString(".");
-                json.ContinueString(VK_API_VERSION_MINOR(deviceProperties.apiVersion));
-                json.ContinueString(".");
-                json.ContinueString(VK_API_VERSION_PATCH(deviceProperties.apiVersion));
-                json.EndString();
-
-                json.WriteString("GPU");
-                json.WriteString(deviceProperties.deviceName);
-                json.WriteString("deviceType");
-                json.WriteNumber(static_cast<uint32_t>(deviceProperties.deviceType));
-
-                json.WriteString("maxMemoryAllocationCount");
-                json.WriteNumber(deviceProperties.limits.maxMemoryAllocationCount);
-                json.WriteString("bufferImageGranularity");
-                json.WriteNumber(deviceProperties.limits.bufferImageGranularity);
-                json.WriteString("nonCoherentAtomSize");
-                json.WriteNumber(deviceProperties.limits.nonCoherentAtomSize);
-
-                json.WriteString("memoryHeapCount");
-                json.WriteNumber(memoryProperties.memoryHeapCount);
-                json.WriteString("memoryTypeCount");
-                json.WriteNumber(memoryProperties.memoryTypeCount);
-            }
-            json.EndObject();
-        }
-        {
-            json.WriteString("Total");
-            VmaPrintDetailedStatistics(json, stats.total);
-        }
-        {
-            json.WriteString("MemoryInfo");
-            json.BeginObject();
-            {
-                for (uint32_t heapIndex = 0; heapIndex < allocator->GetMemoryHeapCount(); ++heapIndex)
-                {
-                    json.BeginString("Heap ");
-                    json.ContinueString(heapIndex);
-                    json.EndString();
-                    json.BeginObject();
-                    {
-                        const VkMemoryHeap& heapInfo = allocator->m_MemProps.memoryHeaps[heapIndex];
-                        json.WriteString("Flags");
-                        json.BeginArray(true);
-                        {
-                            if (heapInfo.flags & VK_MEMORY_HEAP_DEVICE_LOCAL_BIT)
-                                json.WriteString("DEVICE_LOCAL");
-                        #if VMA_VULKAN_VERSION >= 1001000
-                            if (heapInfo.flags & VK_MEMORY_HEAP_MULTI_INSTANCE_BIT)
-                                json.WriteString("MULTI_INSTANCE");
-                        #endif
-
-                            VkMemoryHeapFlags flags = heapInfo.flags &
-                                ~(VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
-                        #if VMA_VULKAN_VERSION >= 1001000
-                                    | VK_MEMORY_HEAP_MULTI_INSTANCE_BIT
-                        #endif
-                                    );
-                            if (flags != 0)
-                                json.WriteNumber(flags);
-                        }
-                        json.EndArray();
-
-                        json.WriteString("Size");
-                        json.WriteNumber(heapInfo.size);
-
-                        json.WriteString("Budget");
-                        json.BeginObject();
-                        {
-                            json.WriteString("BudgetBytes");
-                            json.WriteNumber(budgets[heapIndex].budget);
-                            json.WriteString("UsageBytes");
-                            json.WriteNumber(budgets[heapIndex].usage);
-                        }
-                        json.EndObject();
-
-                        json.WriteString("Stats");
-                        VmaPrintDetailedStatistics(json, stats.memoryHeap[heapIndex]);
-
-                        json.WriteString("MemoryPools");
-                        json.BeginObject();
-                        {
-                            for (uint32_t typeIndex = 0; typeIndex < allocator->GetMemoryTypeCount(); ++typeIndex)
-                            {
-                                if (allocator->MemoryTypeIndexToHeapIndex(typeIndex) == heapIndex)
-                                {
-                                    json.BeginString("Type ");
-                                    json.ContinueString(typeIndex);
-                                    json.EndString();
-                                    json.BeginObject();
-                                    {
-                                        json.WriteString("Flags");
-                                        json.BeginArray(true);
-                                        {
-                                            VkMemoryPropertyFlags flags = allocator->m_MemProps.memoryTypes[typeIndex].propertyFlags;
-                                            if (flags & VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT)
-                                                json.WriteString("DEVICE_LOCAL");
-                                            if (flags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT)
-                                                json.WriteString("HOST_VISIBLE");
-                                            if (flags & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT)
-                                                json.WriteString("HOST_COHERENT");
-                                            if (flags & VK_MEMORY_PROPERTY_HOST_CACHED_BIT)
-                                                json.WriteString("HOST_CACHED");
-                                            if (flags & VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT)
-                                                json.WriteString("LAZILY_ALLOCATED");
-                                        #if VMA_VULKAN_VERSION >= 1001000
-                                            if (flags & VK_MEMORY_PROPERTY_PROTECTED_BIT)
-                                                json.WriteString("PROTECTED");
-                                        #endif
-                                        #if VK_AMD_device_coherent_memory
-                                            if (flags & VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY)
-                                                json.WriteString("DEVICE_COHERENT_AMD");
-                                            if (flags & VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY)
-                                                json.WriteString("DEVICE_UNCACHED_AMD");
-                                        #endif
-
-                                            flags &= ~(VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT
-                                        #if VMA_VULKAN_VERSION >= 1001000
-                                                | VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT
-                                        #endif
-                                        #if VK_AMD_device_coherent_memory
-                                                | VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD_COPY
-                                                | VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD_COPY
-                                        #endif
-                                                | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
-                                                | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
-                                                | VK_MEMORY_PROPERTY_HOST_CACHED_BIT);
-                                            if (flags != 0)
-                                                json.WriteNumber(flags);
-                                        }
-                                        json.EndArray();
-
-                                        json.WriteString("Stats");
-                                        VmaPrintDetailedStatistics(json, stats.memoryType[typeIndex]);
-                                    }
-                                    json.EndObject();
-                                }
-                            }
-
-                        }
-                        json.EndObject();
-                    }
-                    json.EndObject();
-                }
-            }
-            json.EndObject();
-        }
-
-        if (detailedMap == VK_TRUE)
-            allocator->PrintDetailedMap(json);
-
-        json.EndObject();
-    }
-
-    *ppStatsString = VmaCreateStringCopy(allocator->GetAllocationCallbacks(), sb.GetData(), sb.GetLength());
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeStatsString(
-    VmaAllocator allocator,
-    char* pStatsString)
-{
-    if(pStatsString != VMA_NULL)
-    {
-        VMA_ASSERT(allocator);
-        VmaFreeString(allocator->GetAllocationCallbacks(), pStatsString);
-    }
-}
-
-#endif // VMA_STATS_STRING_ENABLED
-
-/*
-This function is not protected by any mutex because it just reads immutable data.
-*/
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndex(
-    VmaAllocator allocator,
-    uint32_t memoryTypeBits,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    uint32_t* pMemoryTypeIndex)
-{
-    VMA_ASSERT(allocator != VK_NULL_HANDLE);
-    VMA_ASSERT(pAllocationCreateInfo != VMA_NULL);
-    VMA_ASSERT(pMemoryTypeIndex != VMA_NULL);
-
-    return allocator->FindMemoryTypeIndex(memoryTypeBits, pAllocationCreateInfo, UINT32_MAX, pMemoryTypeIndex);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForBufferInfo(
-    VmaAllocator allocator,
-    const VkBufferCreateInfo* pBufferCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    uint32_t* pMemoryTypeIndex)
-{
-    VMA_ASSERT(allocator != VK_NULL_HANDLE);
-    VMA_ASSERT(pBufferCreateInfo != VMA_NULL);
-    VMA_ASSERT(pAllocationCreateInfo != VMA_NULL);
-    VMA_ASSERT(pMemoryTypeIndex != VMA_NULL);
-
-    const VkDevice hDev = allocator->m_hDevice;
-    const VmaVulkanFunctions* funcs = &allocator->GetVulkanFunctions();
-    VkResult res;
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(funcs->vkGetDeviceBufferMemoryRequirements)
-    {
-        // Can query straight from VkBufferCreateInfo :)
-        VkDeviceBufferMemoryRequirements devBufMemReq = {VK_STRUCTURE_TYPE_DEVICE_BUFFER_MEMORY_REQUIREMENTS};
-        devBufMemReq.pCreateInfo = pBufferCreateInfo;
-
-        VkMemoryRequirements2 memReq = {VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2};
-        (*funcs->vkGetDeviceBufferMemoryRequirements)(hDev, &devBufMemReq, &memReq);
-
-        res = allocator->FindMemoryTypeIndex(
-            memReq.memoryRequirements.memoryTypeBits, pAllocationCreateInfo, pBufferCreateInfo->usage, pMemoryTypeIndex);
-    }
-    else
-#endif // #if VMA_VULKAN_VERSION >= 1003000
-    {
-        // Must create a dummy buffer to query :(
-        VkBuffer hBuffer = VK_NULL_HANDLE;
-        res = funcs->vkCreateBuffer(
-            hDev, pBufferCreateInfo, allocator->GetAllocationCallbacks(), &hBuffer);
-        if(res == VK_SUCCESS)
-        {
-            VkMemoryRequirements memReq = {};
-            funcs->vkGetBufferMemoryRequirements(hDev, hBuffer, &memReq);
-
-            res = allocator->FindMemoryTypeIndex(
-                memReq.memoryTypeBits, pAllocationCreateInfo, pBufferCreateInfo->usage, pMemoryTypeIndex);
-
-            funcs->vkDestroyBuffer(
-                hDev, hBuffer, allocator->GetAllocationCallbacks());
-        }
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFindMemoryTypeIndexForImageInfo(
-    VmaAllocator allocator,
-    const VkImageCreateInfo* pImageCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    uint32_t* pMemoryTypeIndex)
-{
-    VMA_ASSERT(allocator != VK_NULL_HANDLE);
-    VMA_ASSERT(pImageCreateInfo != VMA_NULL);
-    VMA_ASSERT(pAllocationCreateInfo != VMA_NULL);
-    VMA_ASSERT(pMemoryTypeIndex != VMA_NULL);
-
-    const VkDevice hDev = allocator->m_hDevice;
-    const VmaVulkanFunctions* funcs = &allocator->GetVulkanFunctions();
-    VkResult res;
-
-#if VMA_VULKAN_VERSION >= 1003000
-    if(funcs->vkGetDeviceImageMemoryRequirements)
-    {
-        // Can query straight from VkImageCreateInfo :)
-        VkDeviceImageMemoryRequirements devImgMemReq = {VK_STRUCTURE_TYPE_DEVICE_IMAGE_MEMORY_REQUIREMENTS};
-        devImgMemReq.pCreateInfo = pImageCreateInfo;
-        VMA_ASSERT(pImageCreateInfo->tiling != VK_IMAGE_TILING_DRM_FORMAT_MODIFIER_EXT_COPY && (pImageCreateInfo->flags & VK_IMAGE_CREATE_DISJOINT_BIT_COPY) == 0 &&
-            "Cannot use this VkImageCreateInfo with vmaFindMemoryTypeIndexForImageInfo as I don't know what to pass as VkDeviceImageMemoryRequirements::planeAspect.");
-
-        VkMemoryRequirements2 memReq = {VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2};
-        (*funcs->vkGetDeviceImageMemoryRequirements)(hDev, &devImgMemReq, &memReq);
-
-        res = allocator->FindMemoryTypeIndex(
-            memReq.memoryRequirements.memoryTypeBits, pAllocationCreateInfo, pImageCreateInfo->usage, pMemoryTypeIndex);
-    }
-    else
-#endif // #if VMA_VULKAN_VERSION >= 1003000
-    {
-        // Must create a dummy image to query :(
-        VkImage hImage = VK_NULL_HANDLE;
-        res = funcs->vkCreateImage(
-            hDev, pImageCreateInfo, allocator->GetAllocationCallbacks(), &hImage);
-        if(res == VK_SUCCESS)
-        {
-            VkMemoryRequirements memReq = {};
-            funcs->vkGetImageMemoryRequirements(hDev, hImage, &memReq);
-
-            res = allocator->FindMemoryTypeIndex(
-                memReq.memoryTypeBits, pAllocationCreateInfo, pImageCreateInfo->usage, pMemoryTypeIndex);
-
-            funcs->vkDestroyImage(
-                hDev, hImage, allocator->GetAllocationCallbacks());
-        }
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreatePool(
-    VmaAllocator allocator,
-    const VmaPoolCreateInfo* pCreateInfo,
-    VmaPool* pPool)
-{
-    VMA_ASSERT(allocator && pCreateInfo && pPool);
-
-    VMA_DEBUG_LOG("vmaCreatePool");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->CreatePool(pCreateInfo, pPool);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyPool(
-    VmaAllocator allocator,
-    VmaPool pool)
-{
-    VMA_ASSERT(allocator);
-
-    if(pool == VK_NULL_HANDLE)
-    {
-        return;
-    }
-
-    VMA_DEBUG_LOG("vmaDestroyPool");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->DestroyPool(pool);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolStatistics(
-    VmaAllocator allocator,
-    VmaPool pool,
-    VmaStatistics* pPoolStats)
-{
-    VMA_ASSERT(allocator && pool && pPoolStats);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->GetPoolStatistics(pool, pPoolStats);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculatePoolStatistics(
-    VmaAllocator allocator,
-    VmaPool pool,
-    VmaDetailedStatistics* pPoolStats)
-{
-    VMA_ASSERT(allocator && pool && pPoolStats);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->CalculatePoolStatistics(pool, pPoolStats);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckPoolCorruption(VmaAllocator allocator, VmaPool pool)
-{
-    VMA_ASSERT(allocator && pool);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VMA_DEBUG_LOG("vmaCheckPoolCorruption");
-
-    return allocator->CheckPoolCorruption(pool);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetPoolName(
-    VmaAllocator allocator,
-    VmaPool pool,
-    const char** ppName)
-{
-    VMA_ASSERT(allocator && pool && ppName);
-
-    VMA_DEBUG_LOG("vmaGetPoolName");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *ppName = pool->GetName();
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetPoolName(
-    VmaAllocator allocator,
-    VmaPool pool,
-    const char* pName)
-{
-    VMA_ASSERT(allocator && pool);
-
-    VMA_DEBUG_LOG("vmaSetPoolName");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    pool->SetName(pName);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemory(
-    VmaAllocator allocator,
-    const VkMemoryRequirements* pVkMemoryRequirements,
-    const VmaAllocationCreateInfo* pCreateInfo,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && pVkMemoryRequirements && pCreateInfo && pAllocation);
-
-    VMA_DEBUG_LOG("vmaAllocateMemory");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VkResult result = allocator->AllocateMemory(
-        *pVkMemoryRequirements,
-        false, // requiresDedicatedAllocation
-        false, // prefersDedicatedAllocation
-        VK_NULL_HANDLE, // dedicatedBuffer
-        VK_NULL_HANDLE, // dedicatedImage
-        UINT32_MAX, // dedicatedBufferImageUsage
-        *pCreateInfo,
-        VMA_SUBALLOCATION_TYPE_UNKNOWN,
-        1, // allocationCount
-        pAllocation);
-
-    if(pAllocationInfo != VMA_NULL && result == VK_SUCCESS)
-    {
-        allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-    }
-
-    return result;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryPages(
-    VmaAllocator allocator,
-    const VkMemoryRequirements* pVkMemoryRequirements,
-    const VmaAllocationCreateInfo* pCreateInfo,
-    size_t allocationCount,
-    VmaAllocation* pAllocations,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    if(allocationCount == 0)
-    {
-        return VK_SUCCESS;
-    }
-
-    VMA_ASSERT(allocator && pVkMemoryRequirements && pCreateInfo && pAllocations);
-
-    VMA_DEBUG_LOG("vmaAllocateMemoryPages");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VkResult result = allocator->AllocateMemory(
-        *pVkMemoryRequirements,
-        false, // requiresDedicatedAllocation
-        false, // prefersDedicatedAllocation
-        VK_NULL_HANDLE, // dedicatedBuffer
-        VK_NULL_HANDLE, // dedicatedImage
-        UINT32_MAX, // dedicatedBufferImageUsage
-        *pCreateInfo,
-        VMA_SUBALLOCATION_TYPE_UNKNOWN,
-        allocationCount,
-        pAllocations);
-
-    if(pAllocationInfo != VMA_NULL && result == VK_SUCCESS)
-    {
-        for(size_t i = 0; i < allocationCount; ++i)
-        {
-            allocator->GetAllocationInfo(pAllocations[i], pAllocationInfo + i);
-        }
-    }
-
-    return result;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForBuffer(
-    VmaAllocator allocator,
-    VkBuffer buffer,
-    const VmaAllocationCreateInfo* pCreateInfo,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && buffer != VK_NULL_HANDLE && pCreateInfo && pAllocation);
-
-    VMA_DEBUG_LOG("vmaAllocateMemoryForBuffer");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VkMemoryRequirements vkMemReq = {};
-    bool requiresDedicatedAllocation = false;
-    bool prefersDedicatedAllocation = false;
-    allocator->GetBufferMemoryRequirements(buffer, vkMemReq,
-        requiresDedicatedAllocation,
-        prefersDedicatedAllocation);
-
-    VkResult result = allocator->AllocateMemory(
-        vkMemReq,
-        requiresDedicatedAllocation,
-        prefersDedicatedAllocation,
-        buffer, // dedicatedBuffer
-        VK_NULL_HANDLE, // dedicatedImage
-        UINT32_MAX, // dedicatedBufferImageUsage
-        *pCreateInfo,
-        VMA_SUBALLOCATION_TYPE_BUFFER,
-        1, // allocationCount
-        pAllocation);
-
-    if(pAllocationInfo && result == VK_SUCCESS)
-    {
-        allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-    }
-
-    return result;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaAllocateMemoryForImage(
-    VmaAllocator allocator,
-    VkImage image,
-    const VmaAllocationCreateInfo* pCreateInfo,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && image != VK_NULL_HANDLE && pCreateInfo && pAllocation);
-
-    VMA_DEBUG_LOG("vmaAllocateMemoryForImage");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    VkMemoryRequirements vkMemReq = {};
-    bool requiresDedicatedAllocation = false;
-    bool prefersDedicatedAllocation  = false;
-    allocator->GetImageMemoryRequirements(image, vkMemReq,
-        requiresDedicatedAllocation, prefersDedicatedAllocation);
-
-    VkResult result = allocator->AllocateMemory(
-        vkMemReq,
-        requiresDedicatedAllocation,
-        prefersDedicatedAllocation,
-        VK_NULL_HANDLE, // dedicatedBuffer
-        image, // dedicatedImage
-        UINT32_MAX, // dedicatedBufferImageUsage
-        *pCreateInfo,
-        VMA_SUBALLOCATION_TYPE_IMAGE_UNKNOWN,
-        1, // allocationCount
-        pAllocation);
-
-    if(pAllocationInfo && result == VK_SUCCESS)
-    {
-        allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-    }
-
-    return result;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation)
-{
-    VMA_ASSERT(allocator);
-
-    if(allocation == VK_NULL_HANDLE)
-    {
-        return;
-    }
-
-    VMA_DEBUG_LOG("vmaFreeMemory");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->FreeMemory(
-        1, // allocationCount
-        &allocation);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeMemoryPages(
-    VmaAllocator allocator,
-    size_t allocationCount,
-    const VmaAllocation* pAllocations)
-{
-    if(allocationCount == 0)
-    {
-        return;
-    }
-
-    VMA_ASSERT(allocator);
-
-    VMA_DEBUG_LOG("vmaFreeMemoryPages");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->FreeMemory(allocationCount, pAllocations);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationInfo(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && allocation && pAllocationInfo);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->GetAllocationInfo(allocation, pAllocationInfo);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationUserData(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    void* pUserData)
-{
-    VMA_ASSERT(allocator && allocation);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocation->SetUserData(allocator, pUserData);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetAllocationName(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const char* VMA_NULLABLE pName)
-{
-    allocation->SetName(allocator, pName);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetAllocationMemoryProperties(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    VkMemoryPropertyFlags* VMA_NOT_NULL pFlags)
-{
-    VMA_ASSERT(allocator && allocation && pFlags);
-    const uint32_t memTypeIndex = allocation->GetMemoryTypeIndex();
-    *pFlags = allocator->m_MemProps.memoryTypes[memTypeIndex].propertyFlags;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaMapMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    void** ppData)
-{
-    VMA_ASSERT(allocator && allocation && ppData);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->Map(allocation, ppData);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaUnmapMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation)
-{
-    VMA_ASSERT(allocator && allocation);
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    allocator->Unmap(allocation);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocation(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkDeviceSize offset,
-    VkDeviceSize size)
-{
-    VMA_ASSERT(allocator && allocation);
-
-    VMA_DEBUG_LOG("vmaFlushAllocation");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    const VkResult res = allocator->FlushOrInvalidateAllocation(allocation, offset, size, VMA_CACHE_FLUSH);
-
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocation(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkDeviceSize offset,
-    VkDeviceSize size)
-{
-    VMA_ASSERT(allocator && allocation);
-
-    VMA_DEBUG_LOG("vmaInvalidateAllocation");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    const VkResult res = allocator->FlushOrInvalidateAllocation(allocation, offset, size, VMA_CACHE_INVALIDATE);
-
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaFlushAllocations(
-    VmaAllocator allocator,
-    uint32_t allocationCount,
-    const VmaAllocation* allocations,
-    const VkDeviceSize* offsets,
-    const VkDeviceSize* sizes)
-{
-    VMA_ASSERT(allocator);
-
-    if(allocationCount == 0)
-    {
-        return VK_SUCCESS;
-    }
-
-    VMA_ASSERT(allocations);
-
-    VMA_DEBUG_LOG("vmaFlushAllocations");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    const VkResult res = allocator->FlushOrInvalidateAllocations(allocationCount, allocations, offsets, sizes, VMA_CACHE_FLUSH);
-
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaInvalidateAllocations(
-    VmaAllocator allocator,
-    uint32_t allocationCount,
-    const VmaAllocation* allocations,
-    const VkDeviceSize* offsets,
-    const VkDeviceSize* sizes)
-{
-    VMA_ASSERT(allocator);
-
-    if(allocationCount == 0)
-    {
-        return VK_SUCCESS;
-    }
-
-    VMA_ASSERT(allocations);
-
-    VMA_DEBUG_LOG("vmaInvalidateAllocations");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    const VkResult res = allocator->FlushOrInvalidateAllocations(allocationCount, allocations, offsets, sizes, VMA_CACHE_INVALIDATE);
-
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCheckCorruption(
-    VmaAllocator allocator,
-    uint32_t memoryTypeBits)
-{
-    VMA_ASSERT(allocator);
-
-    VMA_DEBUG_LOG("vmaCheckCorruption");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->CheckCorruption(memoryTypeBits);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentation(
-    VmaAllocator allocator,
-    const VmaDefragmentationInfo* pInfo,
-    VmaDefragmentationContext* pContext)
-{
-    VMA_ASSERT(allocator && pInfo && pContext);
-
-    VMA_DEBUG_LOG("vmaBeginDefragmentation");
-
-    if (pInfo->pool != VMA_NULL)
-    {
-        // Check if run on supported algorithms
-        if (pInfo->pool->m_BlockVector.GetAlgorithm() & VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT)
-            return VK_ERROR_FEATURE_NOT_PRESENT;
-    }
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *pContext = vma_new(allocator, VmaDefragmentationContext_T)(allocator, *pInfo);
-    return VK_SUCCESS;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaEndDefragmentation(
-    VmaAllocator allocator,
-    VmaDefragmentationContext context,
-    VmaDefragmentationStats* pStats)
-{
-    VMA_ASSERT(allocator && context);
-
-    VMA_DEBUG_LOG("vmaEndDefragmentation");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    if (pStats)
-        context->GetStats(*pStats);
-    vma_delete(allocator, context);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBeginDefragmentationPass(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo)
-{
-    VMA_ASSERT(context && pPassInfo);
-
-    VMA_DEBUG_LOG("vmaBeginDefragmentationPass");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return context->DefragmentPassBegin(*pPassInfo);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaEndDefragmentationPass(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaDefragmentationContext VMA_NOT_NULL context,
-    VmaDefragmentationPassMoveInfo* VMA_NOT_NULL pPassInfo)
-{
-    VMA_ASSERT(context && pPassInfo);
-
-    VMA_DEBUG_LOG("vmaEndDefragmentationPass");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return context->DefragmentPassEnd(*pPassInfo);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkBuffer buffer)
-{
-    VMA_ASSERT(allocator && allocation && buffer);
-
-    VMA_DEBUG_LOG("vmaBindBufferMemory");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->BindBufferMemory(allocation, 0, buffer, VMA_NULL);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindBufferMemory2(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkDeviceSize allocationLocalOffset,
-    VkBuffer buffer,
-    const void* pNext)
-{
-    VMA_ASSERT(allocator && allocation && buffer);
-
-    VMA_DEBUG_LOG("vmaBindBufferMemory2");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->BindBufferMemory(allocation, allocationLocalOffset, buffer, pNext);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkImage image)
-{
-    VMA_ASSERT(allocator && allocation && image);
-
-    VMA_DEBUG_LOG("vmaBindImageMemory");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    return allocator->BindImageMemory(allocation, 0, image, VMA_NULL);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaBindImageMemory2(
-    VmaAllocator allocator,
-    VmaAllocation allocation,
-    VkDeviceSize allocationLocalOffset,
-    VkImage image,
-    const void* pNext)
-{
-    VMA_ASSERT(allocator && allocation && image);
-
-    VMA_DEBUG_LOG("vmaBindImageMemory2");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-        return allocator->BindImageMemory(allocation, allocationLocalOffset, image, pNext);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBuffer(
-    VmaAllocator allocator,
-    const VkBufferCreateInfo* pBufferCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    VkBuffer* pBuffer,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && pBufferCreateInfo && pAllocationCreateInfo && pBuffer && pAllocation);
-
-    if(pBufferCreateInfo->size == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-    if((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 &&
-        !allocator->m_UseKhrBufferDeviceAddress)
-    {
-        VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used.");
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_LOG("vmaCreateBuffer");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *pBuffer = VK_NULL_HANDLE;
-    *pAllocation = VK_NULL_HANDLE;
-
-    // 1. Create VkBuffer.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)(
-        allocator->m_hDevice,
-        pBufferCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pBuffer);
-    if(res >= 0)
-    {
-        // 2. vkGetBufferMemoryRequirements.
-        VkMemoryRequirements vkMemReq = {};
-        bool requiresDedicatedAllocation = false;
-        bool prefersDedicatedAllocation  = false;
-        allocator->GetBufferMemoryRequirements(*pBuffer, vkMemReq,
-            requiresDedicatedAllocation, prefersDedicatedAllocation);
-
-        // 3. Allocate memory using allocator.
-        res = allocator->AllocateMemory(
-            vkMemReq,
-            requiresDedicatedAllocation,
-            prefersDedicatedAllocation,
-            *pBuffer, // dedicatedBuffer
-            VK_NULL_HANDLE, // dedicatedImage
-            pBufferCreateInfo->usage, // dedicatedBufferImageUsage
-            *pAllocationCreateInfo,
-            VMA_SUBALLOCATION_TYPE_BUFFER,
-            1, // allocationCount
-            pAllocation);
-
-        if(res >= 0)
-        {
-            // 3. Bind buffer with memory.
-            if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0)
-            {
-                res = allocator->BindBufferMemory(*pAllocation, 0, *pBuffer, VMA_NULL);
-            }
-            if(res >= 0)
-            {
-                // All steps succeeded.
-                #if VMA_STATS_STRING_ENABLED
-                    (*pAllocation)->InitBufferImageUsage(pBufferCreateInfo->usage);
-                #endif
-                if(pAllocationInfo != VMA_NULL)
-                {
-                    allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-                }
-
-                return VK_SUCCESS;
-            }
-            allocator->FreeMemory(
-                1, // allocationCount
-                pAllocation);
-            *pAllocation = VK_NULL_HANDLE;
-            (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-            *pBuffer = VK_NULL_HANDLE;
-            return res;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-        *pBuffer = VK_NULL_HANDLE;
-        return res;
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateBufferWithAlignment(
-    VmaAllocator allocator,
-    const VkBufferCreateInfo* pBufferCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    VkDeviceSize minAlignment,
-    VkBuffer* pBuffer,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && pBufferCreateInfo && pAllocationCreateInfo && VmaIsPow2(minAlignment) && pBuffer && pAllocation);
-
-    if(pBufferCreateInfo->size == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-    if((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 &&
-        !allocator->m_UseKhrBufferDeviceAddress)
-    {
-        VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used.");
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_LOG("vmaCreateBufferWithAlignment");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *pBuffer = VK_NULL_HANDLE;
-    *pAllocation = VK_NULL_HANDLE;
-
-    // 1. Create VkBuffer.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)(
-        allocator->m_hDevice,
-        pBufferCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pBuffer);
-    if(res >= 0)
-    {
-        // 2. vkGetBufferMemoryRequirements.
-        VkMemoryRequirements vkMemReq = {};
-        bool requiresDedicatedAllocation = false;
-        bool prefersDedicatedAllocation  = false;
-        allocator->GetBufferMemoryRequirements(*pBuffer, vkMemReq,
-            requiresDedicatedAllocation, prefersDedicatedAllocation);
-
-        // 2a. Include minAlignment
-        vkMemReq.alignment = VMA_MAX(vkMemReq.alignment, minAlignment);
-
-        // 3. Allocate memory using allocator.
-        res = allocator->AllocateMemory(
-            vkMemReq,
-            requiresDedicatedAllocation,
-            prefersDedicatedAllocation,
-            *pBuffer, // dedicatedBuffer
-            VK_NULL_HANDLE, // dedicatedImage
-            pBufferCreateInfo->usage, // dedicatedBufferImageUsage
-            *pAllocationCreateInfo,
-            VMA_SUBALLOCATION_TYPE_BUFFER,
-            1, // allocationCount
-            pAllocation);
-
-        if(res >= 0)
-        {
-            // 3. Bind buffer with memory.
-            if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0)
-            {
-                res = allocator->BindBufferMemory(*pAllocation, 0, *pBuffer, VMA_NULL);
-            }
-            if(res >= 0)
-            {
-                // All steps succeeded.
-                #if VMA_STATS_STRING_ENABLED
-                    (*pAllocation)->InitBufferImageUsage(pBufferCreateInfo->usage);
-                #endif
-                if(pAllocationInfo != VMA_NULL)
-                {
-                    allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-                }
-
-                return VK_SUCCESS;
-            }
-            allocator->FreeMemory(
-                1, // allocationCount
-                pAllocation);
-            *pAllocation = VK_NULL_HANDLE;
-            (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-            *pBuffer = VK_NULL_HANDLE;
-            return res;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-        *pBuffer = VK_NULL_HANDLE;
-        return res;
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingBuffer(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const VkBufferCreateInfo* VMA_NOT_NULL pBufferCreateInfo,
-    VkBuffer VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pBuffer)
-{
-    VMA_ASSERT(allocator && pBufferCreateInfo && pBuffer && allocation);
-
-    VMA_DEBUG_LOG("vmaCreateAliasingBuffer");
-
-    *pBuffer = VK_NULL_HANDLE;
-
-    if (pBufferCreateInfo->size == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-    if ((pBufferCreateInfo->usage & VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT_COPY) != 0 &&
-        !allocator->m_UseKhrBufferDeviceAddress)
-    {
-        VMA_ASSERT(0 && "Creating a buffer with VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT is not valid if VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT was not used.");
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    // 1. Create VkBuffer.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateBuffer)(
-        allocator->m_hDevice,
-        pBufferCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pBuffer);
-    if (res >= 0)
-    {
-        // 2. Bind buffer with memory.
-        res = allocator->BindBufferMemory(allocation, 0, *pBuffer, VMA_NULL);
-        if (res >= 0)
-        {
-            return VK_SUCCESS;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, *pBuffer, allocator->GetAllocationCallbacks());
-    }
-    return res;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyBuffer(
-    VmaAllocator allocator,
-    VkBuffer buffer,
-    VmaAllocation allocation)
-{
-    VMA_ASSERT(allocator);
-
-    if(buffer == VK_NULL_HANDLE && allocation == VK_NULL_HANDLE)
-    {
-        return;
-    }
-
-    VMA_DEBUG_LOG("vmaDestroyBuffer");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    if(buffer != VK_NULL_HANDLE)
-    {
-        (*allocator->GetVulkanFunctions().vkDestroyBuffer)(allocator->m_hDevice, buffer, allocator->GetAllocationCallbacks());
-    }
-
-    if(allocation != VK_NULL_HANDLE)
-    {
-        allocator->FreeMemory(
-            1, // allocationCount
-            &allocation);
-    }
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateImage(
-    VmaAllocator allocator,
-    const VkImageCreateInfo* pImageCreateInfo,
-    const VmaAllocationCreateInfo* pAllocationCreateInfo,
-    VkImage* pImage,
-    VmaAllocation* pAllocation,
-    VmaAllocationInfo* pAllocationInfo)
-{
-    VMA_ASSERT(allocator && pImageCreateInfo && pAllocationCreateInfo && pImage && pAllocation);
-
-    if(pImageCreateInfo->extent.width == 0 ||
-        pImageCreateInfo->extent.height == 0 ||
-        pImageCreateInfo->extent.depth == 0 ||
-        pImageCreateInfo->mipLevels == 0 ||
-        pImageCreateInfo->arrayLayers == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_LOG("vmaCreateImage");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    *pImage = VK_NULL_HANDLE;
-    *pAllocation = VK_NULL_HANDLE;
-
-    // 1. Create VkImage.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateImage)(
-        allocator->m_hDevice,
-        pImageCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pImage);
-    if(res >= 0)
-    {
-        VmaSuballocationType suballocType = pImageCreateInfo->tiling == VK_IMAGE_TILING_OPTIMAL ?
-            VMA_SUBALLOCATION_TYPE_IMAGE_OPTIMAL :
-            VMA_SUBALLOCATION_TYPE_IMAGE_LINEAR;
-
-        // 2. Allocate memory using allocator.
-        VkMemoryRequirements vkMemReq = {};
-        bool requiresDedicatedAllocation = false;
-        bool prefersDedicatedAllocation  = false;
-        allocator->GetImageMemoryRequirements(*pImage, vkMemReq,
-            requiresDedicatedAllocation, prefersDedicatedAllocation);
-
-        res = allocator->AllocateMemory(
-            vkMemReq,
-            requiresDedicatedAllocation,
-            prefersDedicatedAllocation,
-            VK_NULL_HANDLE, // dedicatedBuffer
-            *pImage, // dedicatedImage
-            pImageCreateInfo->usage, // dedicatedBufferImageUsage
-            *pAllocationCreateInfo,
-            suballocType,
-            1, // allocationCount
-            pAllocation);
-
-        if(res >= 0)
-        {
-            // 3. Bind image with memory.
-            if((pAllocationCreateInfo->flags & VMA_ALLOCATION_CREATE_DONT_BIND_BIT) == 0)
-            {
-                res = allocator->BindImageMemory(*pAllocation, 0, *pImage, VMA_NULL);
-            }
-            if(res >= 0)
-            {
-                // All steps succeeded.
-                #if VMA_STATS_STRING_ENABLED
-                    (*pAllocation)->InitBufferImageUsage(pImageCreateInfo->usage);
-                #endif
-                if(pAllocationInfo != VMA_NULL)
-                {
-                    allocator->GetAllocationInfo(*pAllocation, pAllocationInfo);
-                }
-
-                return VK_SUCCESS;
-            }
-            allocator->FreeMemory(
-                1, // allocationCount
-                pAllocation);
-            *pAllocation = VK_NULL_HANDLE;
-            (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks());
-            *pImage = VK_NULL_HANDLE;
-            return res;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks());
-        *pImage = VK_NULL_HANDLE;
-        return res;
-    }
-    return res;
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateAliasingImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VmaAllocation VMA_NOT_NULL allocation,
-    const VkImageCreateInfo* VMA_NOT_NULL pImageCreateInfo,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pImage)
-{
-    VMA_ASSERT(allocator && pImageCreateInfo && pImage && allocation);
-
-    *pImage = VK_NULL_HANDLE;
-
-    VMA_DEBUG_LOG("vmaCreateImage");
-
-    if (pImageCreateInfo->extent.width == 0 ||
-        pImageCreateInfo->extent.height == 0 ||
-        pImageCreateInfo->extent.depth == 0 ||
-        pImageCreateInfo->mipLevels == 0 ||
-        pImageCreateInfo->arrayLayers == 0)
-    {
-        return VK_ERROR_INITIALIZATION_FAILED;
-    }
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    // 1. Create VkImage.
-    VkResult res = (*allocator->GetVulkanFunctions().vkCreateImage)(
-        allocator->m_hDevice,
-        pImageCreateInfo,
-        allocator->GetAllocationCallbacks(),
-        pImage);
-    if (res >= 0)
-    {
-        // 2. Bind image with memory.
-        res = allocator->BindImageMemory(allocation, 0, *pImage, VMA_NULL);
-        if (res >= 0)
-        {
-            return VK_SUCCESS;
-        }
-        (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, *pImage, allocator->GetAllocationCallbacks());
-    }
-    return res;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyImage(
-    VmaAllocator VMA_NOT_NULL allocator,
-    VkImage VMA_NULLABLE_NON_DISPATCHABLE image,
-    VmaAllocation VMA_NULLABLE allocation)
-{
-    VMA_ASSERT(allocator);
-
-    if(image == VK_NULL_HANDLE && allocation == VK_NULL_HANDLE)
-    {
-        return;
-    }
-
-    VMA_DEBUG_LOG("vmaDestroyImage");
-
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK
-
-    if(image != VK_NULL_HANDLE)
-    {
-        (*allocator->GetVulkanFunctions().vkDestroyImage)(allocator->m_hDevice, image, allocator->GetAllocationCallbacks());
-    }
-    if(allocation != VK_NULL_HANDLE)
-    {
-        allocator->FreeMemory(
-            1, // allocationCount
-            &allocation);
-    }
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaCreateVirtualBlock(
-    const VmaVirtualBlockCreateInfo* VMA_NOT_NULL pCreateInfo,
-    VmaVirtualBlock VMA_NULLABLE * VMA_NOT_NULL pVirtualBlock)
-{
-    VMA_ASSERT(pCreateInfo && pVirtualBlock);
-    VMA_ASSERT(pCreateInfo->size > 0);
-    VMA_DEBUG_LOG("vmaCreateVirtualBlock");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    *pVirtualBlock = vma_new(pCreateInfo->pAllocationCallbacks, VmaVirtualBlock_T)(*pCreateInfo);
-    VkResult res = (*pVirtualBlock)->Init();
-    if(res < 0)
-    {
-        vma_delete(pCreateInfo->pAllocationCallbacks, *pVirtualBlock);
-        *pVirtualBlock = VK_NULL_HANDLE;
-    }
-    return res;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaDestroyVirtualBlock(VmaVirtualBlock VMA_NULLABLE virtualBlock)
-{
-    if(virtualBlock != VK_NULL_HANDLE)
-    {
-        VMA_DEBUG_LOG("vmaDestroyVirtualBlock");
-        VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-        VkAllocationCallbacks allocationCallbacks = virtualBlock->m_AllocationCallbacks; // Have to copy the callbacks when destroying.
-        vma_delete(&allocationCallbacks, virtualBlock);
-    }
-}
-
-VMA_CALL_PRE VkBool32 VMA_CALL_POST vmaIsVirtualBlockEmpty(VmaVirtualBlock VMA_NOT_NULL virtualBlock)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-    VMA_DEBUG_LOG("vmaIsVirtualBlockEmpty");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    return virtualBlock->IsEmpty() ? VK_TRUE : VK_FALSE;
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualAllocationInfo(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, VmaVirtualAllocationInfo* VMA_NOT_NULL pVirtualAllocInfo)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pVirtualAllocInfo != VMA_NULL);
-    VMA_DEBUG_LOG("vmaGetVirtualAllocationInfo");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->GetAllocationInfo(allocation, *pVirtualAllocInfo);
-}
-
-VMA_CALL_PRE VkResult VMA_CALL_POST vmaVirtualAllocate(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    const VmaVirtualAllocationCreateInfo* VMA_NOT_NULL pCreateInfo, VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE* VMA_NOT_NULL pAllocation,
-    VkDeviceSize* VMA_NULLABLE pOffset)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pCreateInfo != VMA_NULL && pAllocation != VMA_NULL);
-    VMA_DEBUG_LOG("vmaVirtualAllocate");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    return virtualBlock->Allocate(*pCreateInfo, *pAllocation, pOffset);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaVirtualFree(VmaVirtualBlock VMA_NOT_NULL virtualBlock, VmaVirtualAllocation VMA_NULLABLE_NON_DISPATCHABLE allocation)
-{
-    if(allocation != VK_NULL_HANDLE)
-    {
-        VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-        VMA_DEBUG_LOG("vmaVirtualFree");
-        VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-        virtualBlock->Free(allocation);
-    }
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaClearVirtualBlock(VmaVirtualBlock VMA_NOT_NULL virtualBlock)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-    VMA_DEBUG_LOG("vmaClearVirtualBlock");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->Clear();
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaSetVirtualAllocationUserData(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaVirtualAllocation VMA_NOT_NULL_NON_DISPATCHABLE allocation, void* VMA_NULLABLE pUserData)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-    VMA_DEBUG_LOG("vmaSetVirtualAllocationUserData");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->SetAllocationUserData(allocation, pUserData);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaGetVirtualBlockStatistics(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaStatistics* VMA_NOT_NULL pStats)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pStats != VMA_NULL);
-    VMA_DEBUG_LOG("vmaGetVirtualBlockStatistics");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->GetStatistics(*pStats);
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaCalculateVirtualBlockStatistics(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    VmaDetailedStatistics* VMA_NOT_NULL pStats)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && pStats != VMA_NULL);
-    VMA_DEBUG_LOG("vmaCalculateVirtualBlockStatistics");
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    virtualBlock->CalculateDetailedStatistics(*pStats);
-}
-
-#if VMA_STATS_STRING_ENABLED
-
-VMA_CALL_PRE void VMA_CALL_POST vmaBuildVirtualBlockStatsString(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    char* VMA_NULLABLE * VMA_NOT_NULL ppStatsString, VkBool32 detailedMap)
-{
-    VMA_ASSERT(virtualBlock != VK_NULL_HANDLE && ppStatsString != VMA_NULL);
-    VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-    const VkAllocationCallbacks* allocationCallbacks = virtualBlock->GetAllocationCallbacks();
-    VmaStringBuilder sb(allocationCallbacks);
-    virtualBlock->BuildStatsString(detailedMap != VK_FALSE, sb);
-    *ppStatsString = VmaCreateStringCopy(allocationCallbacks, sb.GetData(), sb.GetLength());
-}
-
-VMA_CALL_PRE void VMA_CALL_POST vmaFreeVirtualBlockStatsString(VmaVirtualBlock VMA_NOT_NULL virtualBlock,
-    char* VMA_NULLABLE pStatsString)
-{
-    if(pStatsString != VMA_NULL)
-    {
-        VMA_ASSERT(virtualBlock != VK_NULL_HANDLE);
-        VMA_DEBUG_GLOBAL_MUTEX_LOCK;
-        VmaFreeString(virtualBlock->GetAllocationCallbacks(), pStatsString);
-    }
-}
-#endif // VMA_STATS_STRING_ENABLED
-#endif // _VMA_PUBLIC_INTERFACE
-#endif // VMA_IMPLEMENTATION
-
-/**
-\page quick_start Quick start
-
-\section quick_start_project_setup Project setup
-
-Vulkan Memory Allocator comes in form of a "stb-style" single header file.
-You don't need to build it as a separate library project.
-You can add this file directly to your project and submit it to code repository next to your other source files.
-
-"Single header" doesn't mean that everything is contained in C/C++ declarations,
-like it tends to be in case of inline functions or C++ templates.
-It means that implementation is bundled with interface in a single file and needs to be extracted using preprocessor macro.
-If you don't do it properly, you will get linker errors.
-
-To do it properly:
-
--# Include "vk_mem_alloc.h" file in each CPP file where you want to use the library.
-   This includes declarations of all members of the library.
--# In exactly one CPP file define following macro before this include.
-   It enables also internal definitions.
-
-\code
-#define VMA_IMPLEMENTATION
-#include "vk_mem_alloc.h"
-\endcode
-
-It may be a good idea to create dedicated CPP file just for this purpose.
-
-This library includes header `<vulkan/vulkan.h>`, which in turn
-includes `<windows.h>` on Windows. If you need some specific macros defined
-before including these headers (like `WIN32_LEAN_AND_MEAN` or
-`WINVER` for Windows, `VK_USE_PLATFORM_WIN32_KHR` for Vulkan), you must define
-them before every `#include` of this library.
-
-This library is written in C++, but has C-compatible interface.
-Thus you can include and use vk_mem_alloc.h in C or C++ code, but full
-implementation with `VMA_IMPLEMENTATION` macro must be compiled as C++, NOT as C.
-Some features of C++14 used. STL containers, RTTI, or C++ exceptions are not used.
-
-
-\section quick_start_initialization Initialization
-
-At program startup:
-
--# Initialize Vulkan to have `VkPhysicalDevice`, `VkDevice` and `VkInstance` object.
--# Fill VmaAllocatorCreateInfo structure and create #VmaAllocator object by
-   calling vmaCreateAllocator().
-
-Only members `physicalDevice`, `device`, `instance` are required.
-However, you should inform the library which Vulkan version do you use by setting
-VmaAllocatorCreateInfo::vulkanApiVersion and which extensions did you enable
-by setting VmaAllocatorCreateInfo::flags (like #VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT for VK_KHR_buffer_device_address).
-Otherwise, VMA would use only features of Vulkan 1.0 core with no extensions.
-
-You may need to configure importing Vulkan functions. There are 3 ways to do this:
-
--# **If you link with Vulkan static library** (e.g. "vulkan-1.lib" on Windows):
-   - You don't need to do anything.
-   - VMA will use these, as macro `VMA_STATIC_VULKAN_FUNCTIONS` is defined to 1 by default.
--# **If you want VMA to fetch pointers to Vulkan functions dynamically** using `vkGetInstanceProcAddr`,
-   `vkGetDeviceProcAddr` (this is the option presented in the example below):
-   - Define `VMA_STATIC_VULKAN_FUNCTIONS` to 0, `VMA_DYNAMIC_VULKAN_FUNCTIONS` to 1.
-   - Provide pointers to these two functions via VmaVulkanFunctions::vkGetInstanceProcAddr,
-     VmaVulkanFunctions::vkGetDeviceProcAddr.
-   - The library will fetch pointers to all other functions it needs internally.
--# **If you fetch pointers to all Vulkan functions in a custom way**, e.g. using some loader like
-   [Volk](https://github.com/zeux/volk):
-   - Define `VMA_STATIC_VULKAN_FUNCTIONS` and `VMA_DYNAMIC_VULKAN_FUNCTIONS` to 0.
-   - Pass these pointers via structure #VmaVulkanFunctions.
-
-\code
-VmaVulkanFunctions vulkanFunctions = {};
-vulkanFunctions.vkGetInstanceProcAddr = &vkGetInstanceProcAddr;
-vulkanFunctions.vkGetDeviceProcAddr = &vkGetDeviceProcAddr;
-
-VmaAllocatorCreateInfo allocatorCreateInfo = {};
-allocatorCreateInfo.vulkanApiVersion = VK_API_VERSION_1_2;
-allocatorCreateInfo.physicalDevice = physicalDevice;
-allocatorCreateInfo.device = device;
-allocatorCreateInfo.instance = instance;
-allocatorCreateInfo.pVulkanFunctions = &vulkanFunctions;
-
-VmaAllocator allocator;
-vmaCreateAllocator(&allocatorCreateInfo, &allocator);
-\endcode
-
-
-\section quick_start_resource_allocation Resource allocation
-
-When you want to create a buffer or image:
-
--# Fill `VkBufferCreateInfo` / `VkImageCreateInfo` structure.
--# Fill VmaAllocationCreateInfo structure.
--# Call vmaCreateBuffer() / vmaCreateImage() to get `VkBuffer`/`VkImage` with memory
-   already allocated and bound to it, plus #VmaAllocation objects that represents its underlying memory.
-
-\code
-VkBufferCreateInfo bufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufferInfo.size = 65536;
-bufferInfo.usage = VK_BUFFER_USAGE_VERTEX_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocInfo = {};
-allocInfo.usage = VMA_MEMORY_USAGE_AUTO;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);
-\endcode
-
-Don't forget to destroy your objects when no longer needed:
-
-\code
-vmaDestroyBuffer(allocator, buffer, allocation);
-vmaDestroyAllocator(allocator);
-\endcode
-
-
-\page choosing_memory_type Choosing memory type
-
-Physical devices in Vulkan support various combinations of memory heaps and
-types. Help with choosing correct and optimal memory type for your specific
-resource is one of the key features of this library. You can use it by filling
-appropriate members of VmaAllocationCreateInfo structure, as described below.
-You can also combine multiple methods.
-
--# If you just want to find memory type index that meets your requirements, you
-   can use function: vmaFindMemoryTypeIndexForBufferInfo(),
-   vmaFindMemoryTypeIndexForImageInfo(), vmaFindMemoryTypeIndex().
--# If you want to allocate a region of device memory without association with any
-   specific image or buffer, you can use function vmaAllocateMemory(). Usage of
-   this function is not recommended and usually not needed.
-   vmaAllocateMemoryPages() function is also provided for creating multiple allocations at once,
-   which may be useful for sparse binding.
--# If you already have a buffer or an image created, you want to allocate memory
-   for it and then you will bind it yourself, you can use function
-   vmaAllocateMemoryForBuffer(), vmaAllocateMemoryForImage().
-   For binding you should use functions: vmaBindBufferMemory(), vmaBindImageMemory()
-   or their extended versions: vmaBindBufferMemory2(), vmaBindImageMemory2().
--# **This is the easiest and recommended way to use this library:**
-   If you want to create a buffer or an image, allocate memory for it and bind
-   them together, all in one call, you can use function vmaCreateBuffer(),
-   vmaCreateImage().
-
-When using 3. or 4., the library internally queries Vulkan for memory types
-supported for that buffer or image (function `vkGetBufferMemoryRequirements()`)
-and uses only one of these types.
-
-If no memory type can be found that meets all the requirements, these functions
-return `VK_ERROR_FEATURE_NOT_PRESENT`.
-
-You can leave VmaAllocationCreateInfo structure completely filled with zeros.
-It means no requirements are specified for memory type.
-It is valid, although not very useful.
-
-\section choosing_memory_type_usage Usage
-
-The easiest way to specify memory requirements is to fill member
-VmaAllocationCreateInfo::usage using one of the values of enum #VmaMemoryUsage.
-It defines high level, common usage types.
-Since version 3 of the library, it is recommended to use #VMA_MEMORY_USAGE_AUTO to let it select best memory type for your resource automatically.
-
-For example, if you want to create a uniform buffer that will be filled using
-transfer only once or infrequently and then used for rendering every frame as a uniform buffer, you can
-do it using following code. The buffer will most likely end up in a memory type with
-`VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT` to be fast to access by the GPU device.
-
-\code
-VkBufferCreateInfo bufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufferInfo.size = 65536;
-bufferInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocInfo = {};
-allocInfo.usage = VMA_MEMORY_USAGE_AUTO;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);
-\endcode
-
-If you have a preference for putting the resource in GPU (device) memory or CPU (host) memory
-on systems with discrete graphics card that have the memories separate, you can use
-#VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE or #VMA_MEMORY_USAGE_AUTO_PREFER_HOST.
-
-When using `VMA_MEMORY_USAGE_AUTO*` while you want to map the allocated memory,
-you also need to specify one of the host access flags:
-#VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.
-This will help the library decide about preferred memory type to ensure it has `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`
-so you can map it.
-
-For example, a staging buffer that will be filled via mapped pointer and then
-used as a source of transfer to the buffer decribed previously can be created like this.
-It will likely and up in a memory type that is `HOST_VISIBLE` and `HOST_COHERENT`
-but not `HOST_CACHED` (meaning uncached, write-combined) and not `DEVICE_LOCAL` (meaning system RAM).
-
-\code
-VkBufferCreateInfo stagingBufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-stagingBufferInfo.size = 65536;
-stagingBufferInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-
-VmaAllocationCreateInfo stagingAllocInfo = {};
-stagingAllocInfo.usage = VMA_MEMORY_USAGE_AUTO;
-stagingAllocInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT;
-
-VkBuffer stagingBuffer;
-VmaAllocation stagingAllocation;
-vmaCreateBuffer(allocator, &stagingBufferInfo, &stagingAllocInfo, &stagingBuffer, &stagingAllocation, nullptr);
-\endcode
-
-For more examples of creating different kinds of resources, see chapter \ref usage_patterns.
-
-Usage values `VMA_MEMORY_USAGE_AUTO*` are legal to use only when the library knows
-about the resource being created by having `VkBufferCreateInfo` / `VkImageCreateInfo` passed,
-so they work with functions like: vmaCreateBuffer(), vmaCreateImage(), vmaFindMemoryTypeIndexForBufferInfo() etc.
-If you allocate raw memory using function vmaAllocateMemory(), you have to use other means of selecting
-memory type, as decribed below.
-
-\note
-Old usage values (`VMA_MEMORY_USAGE_GPU_ONLY`, `VMA_MEMORY_USAGE_CPU_ONLY`,
-`VMA_MEMORY_USAGE_CPU_TO_GPU`, `VMA_MEMORY_USAGE_GPU_TO_CPU`, `VMA_MEMORY_USAGE_CPU_COPY`)
-are still available and work same way as in previous versions of the library
-for backward compatibility, but they are not recommended.
-
-\section choosing_memory_type_required_preferred_flags Required and preferred flags
-
-You can specify more detailed requirements by filling members
-VmaAllocationCreateInfo::requiredFlags and VmaAllocationCreateInfo::preferredFlags
-with a combination of bits from enum `VkMemoryPropertyFlags`. For example,
-if you want to create a buffer that will be persistently mapped on host (so it
-must be `HOST_VISIBLE`) and preferably will also be `HOST_COHERENT` and `HOST_CACHED`,
-use following code:
-
-\code
-VmaAllocationCreateInfo allocInfo = {};
-allocInfo.requiredFlags = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
-allocInfo.preferredFlags = VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT;
-allocInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT | VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);
-\endcode
-
-A memory type is chosen that has all the required flags and as many preferred
-flags set as possible.
-
-Value passed in VmaAllocationCreateInfo::usage is internally converted to a set of required and preferred flags,
-plus some extra "magic" (heuristics).
-
-\section choosing_memory_type_explicit_memory_types Explicit memory types
-
-If you inspected memory types available on the physical device and you have
-a preference for memory types that you want to use, you can fill member
-VmaAllocationCreateInfo::memoryTypeBits. It is a bit mask, where each bit set
-means that a memory type with that index is allowed to be used for the
-allocation. Special value 0, just like `UINT32_MAX`, means there are no
-restrictions to memory type index.
-
-Please note that this member is NOT just a memory type index.
-Still you can use it to choose just one, specific memory type.
-For example, if you already determined that your buffer should be created in
-memory type 2, use following code:
-
-\code
-uint32_t memoryTypeIndex = 2;
-
-VmaAllocationCreateInfo allocInfo = {};
-allocInfo.memoryTypeBits = 1u << memoryTypeIndex;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);
-\endcode
-
-
-\section choosing_memory_type_custom_memory_pools Custom memory pools
-
-If you allocate from custom memory pool, all the ways of specifying memory
-requirements described above are not applicable and the aforementioned members
-of VmaAllocationCreateInfo structure are ignored. Memory type is selected
-explicitly when creating the pool and then used to make all the allocations from
-that pool. For further details, see \ref custom_memory_pools.
-
-\section choosing_memory_type_dedicated_allocations Dedicated allocations
-
-Memory for allocations is reserved out of larger block of `VkDeviceMemory`
-allocated from Vulkan internally. That is the main feature of this whole library.
-You can still request a separate memory block to be created for an allocation,
-just like you would do in a trivial solution without using any allocator.
-In that case, a buffer or image is always bound to that memory at offset 0.
-This is called a "dedicated allocation".
-You can explicitly request it by using flag #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-The library can also internally decide to use dedicated allocation in some cases, e.g.:
-
-- When the size of the allocation is large.
-- When [VK_KHR_dedicated_allocation](@ref vk_khr_dedicated_allocation) extension is enabled
-  and it reports that dedicated allocation is required or recommended for the resource.
-- When allocation of next big memory block fails due to not enough device memory,
-  but allocation with the exact requested size succeeds.
-
-
-\page memory_mapping Memory mapping
-
-To "map memory" in Vulkan means to obtain a CPU pointer to `VkDeviceMemory`,
-to be able to read from it or write to it in CPU code.
-Mapping is possible only of memory allocated from a memory type that has
-`VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT` flag.
-Functions `vkMapMemory()`, `vkUnmapMemory()` are designed for this purpose.
-You can use them directly with memory allocated by this library,
-but it is not recommended because of following issue:
-Mapping the same `VkDeviceMemory` block multiple times is illegal - only one mapping at a time is allowed.
-This includes mapping disjoint regions. Mapping is not reference-counted internally by Vulkan.
-Because of this, Vulkan Memory Allocator provides following facilities:
-
-\note If you want to be able to map an allocation, you need to specify one of the flags
-#VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT
-in VmaAllocationCreateInfo::flags. These flags are required for an allocation to be mappable
-when using #VMA_MEMORY_USAGE_AUTO or other `VMA_MEMORY_USAGE_AUTO*` enum values.
-For other usage values they are ignored and every such allocation made in `HOST_VISIBLE` memory type is mappable,
-but they can still be used for consistency.
-
-\section memory_mapping_mapping_functions Mapping functions
-
-The library provides following functions for mapping of a specific #VmaAllocation: vmaMapMemory(), vmaUnmapMemory().
-They are safer and more convenient to use than standard Vulkan functions.
-You can map an allocation multiple times simultaneously - mapping is reference-counted internally.
-You can also map different allocations simultaneously regardless of whether they use the same `VkDeviceMemory` block.
-The way it is implemented is that the library always maps entire memory block, not just region of the allocation.
-For further details, see description of vmaMapMemory() function.
-Example:
-
-\code
-// Having these objects initialized:
-struct ConstantBuffer
-{
-    ...
-};
-ConstantBuffer constantBufferData = ...
-
-VmaAllocator allocator = ...
-VkBuffer constantBuffer = ...
-VmaAllocation constantBufferAllocation = ...
-
-// You can map and fill your buffer using following code:
-
-void* mappedData;
-vmaMapMemory(allocator, constantBufferAllocation, &mappedData);
-memcpy(mappedData, &constantBufferData, sizeof(constantBufferData));
-vmaUnmapMemory(allocator, constantBufferAllocation);
-\endcode
-
-When mapping, you may see a warning from Vulkan validation layer similar to this one:
-
-<i>Mapping an image with layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL can result in undefined behavior if this memory is used by the device. Only GENERAL or PREINITIALIZED should be used.</i>
-
-It happens because the library maps entire `VkDeviceMemory` block, where different
-types of images and buffers may end up together, especially on GPUs with unified memory like Intel.
-You can safely ignore it if you are sure you access only memory of the intended
-object that you wanted to map.
-
-
-\section memory_mapping_persistently_mapped_memory Persistently mapped memory
-
-Kepping your memory persistently mapped is generally OK in Vulkan.
-You don't need to unmap it before using its data on the GPU.
-The library provides a special feature designed for that:
-Allocations made with #VMA_ALLOCATION_CREATE_MAPPED_BIT flag set in
-VmaAllocationCreateInfo::flags stay mapped all the time,
-so you can just access CPU pointer to it any time
-without a need to call any "map" or "unmap" function.
-Example:
-
-\code
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = sizeof(ConstantBuffer);
-bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT |
-    VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buf;
-VmaAllocation alloc;
-VmaAllocationInfo allocInfo;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo);
-
-// Buffer is already mapped. You can access its memory.
-memcpy(allocInfo.pMappedData, &constantBufferData, sizeof(constantBufferData));
-\endcode
-
-\note #VMA_ALLOCATION_CREATE_MAPPED_BIT by itself doesn't guarantee that the allocation will end up
-in a mappable memory type.
-For this, you need to also specify #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT or
-#VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.
-#VMA_ALLOCATION_CREATE_MAPPED_BIT only guarantees that if the memory is `HOST_VISIBLE`, the allocation will be mapped on creation.
-For an example of how to make use of this fact, see section \ref usage_patterns_advanced_data_uploading.
-
-\section memory_mapping_cache_control Cache flush and invalidate
-
-Memory in Vulkan doesn't need to be unmapped before using it on GPU,
-but unless a memory types has `VK_MEMORY_PROPERTY_HOST_COHERENT_BIT` flag set,
-you need to manually **invalidate** cache before reading of mapped pointer
-and **flush** cache after writing to mapped pointer.
-Map/unmap operations don't do that automatically.
-Vulkan provides following functions for this purpose `vkFlushMappedMemoryRanges()`,
-`vkInvalidateMappedMemoryRanges()`, but this library provides more convenient
-functions that refer to given allocation object: vmaFlushAllocation(),
-vmaInvalidateAllocation(),
-or multiple objects at once: vmaFlushAllocations(), vmaInvalidateAllocations().
-
-Regions of memory specified for flush/invalidate must be aligned to
-`VkPhysicalDeviceLimits::nonCoherentAtomSize`. This is automatically ensured by the library.
-In any memory type that is `HOST_VISIBLE` but not `HOST_COHERENT`, all allocations
-within blocks are aligned to this value, so their offsets are always multiply of
-`nonCoherentAtomSize` and two different allocations never share same "line" of this size.
-
-Also, Windows drivers from all 3 PC GPU vendors (AMD, Intel, NVIDIA)
-currently provide `HOST_COHERENT` flag on all memory types that are
-`HOST_VISIBLE`, so on PC you may not need to bother.
-
-
-\page staying_within_budget Staying within budget
-
-When developing a graphics-intensive game or program, it is important to avoid allocating
-more GPU memory than it is physically available. When the memory is over-committed,
-various bad things can happen, depending on the specific GPU, graphics driver, and
-operating system:
-
-- It may just work without any problems.
-- The application may slow down because some memory blocks are moved to system RAM
-  and the GPU has to access them through PCI Express bus.
-- A new allocation may take very long time to complete, even few seconds, and possibly
-  freeze entire system.
-- The new allocation may fail with `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-- It may even result in GPU crash (TDR), observed as `VK_ERROR_DEVICE_LOST`
-  returned somewhere later.
-
-\section staying_within_budget_querying_for_budget Querying for budget
-
-To query for current memory usage and available budget, use function vmaGetHeapBudgets().
-Returned structure #VmaBudget contains quantities expressed in bytes, per Vulkan memory heap.
-
-Please note that this function returns different information and works faster than
-vmaCalculateStatistics(). vmaGetHeapBudgets() can be called every frame or even before every
-allocation, while vmaCalculateStatistics() is intended to be used rarely,
-only to obtain statistical information, e.g. for debugging purposes.
-
-It is recommended to use <b>VK_EXT_memory_budget</b> device extension to obtain information
-about the budget from Vulkan device. VMA is able to use this extension automatically.
-When not enabled, the allocator behaves same way, but then it estimates current usage
-and available budget based on its internal information and Vulkan memory heap sizes,
-which may be less precise. In order to use this extension:
-
-1. Make sure extensions VK_EXT_memory_budget and VK_KHR_get_physical_device_properties2
-   required by it are available and enable them. Please note that the first is a device
-   extension and the second is instance extension!
-2. Use flag #VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT when creating #VmaAllocator object.
-3. Make sure to call vmaSetCurrentFrameIndex() every frame. Budget is queried from
-   Vulkan inside of it to avoid overhead of querying it with every allocation.
-
-\section staying_within_budget_controlling_memory_usage Controlling memory usage
-
-There are many ways in which you can try to stay within the budget.
-
-First, when making new allocation requires allocating a new memory block, the library
-tries not to exceed the budget automatically. If a block with default recommended size
-(e.g. 256 MB) would go over budget, a smaller block is allocated, possibly even
-dedicated memory for just this resource.
-
-If the size of the requested resource plus current memory usage is more than the
-budget, by default the library still tries to create it, leaving it to the Vulkan
-implementation whether the allocation succeeds or fails. You can change this behavior
-by using #VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT flag. With it, the allocation is
-not made if it would exceed the budget or if the budget is already exceeded.
-VMA then tries to make the allocation from the next eligible Vulkan memory type.
-The all of them fail, the call then fails with `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-Example usage pattern may be to pass the #VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT flag
-when creating resources that are not essential for the application (e.g. the texture
-of a specific object) and not to pass it when creating critically important resources
-(e.g. render targets).
-
-On AMD graphics cards there is a custom vendor extension available: <b>VK_AMD_memory_overallocation_behavior</b>
-that allows to control the behavior of the Vulkan implementation in out-of-memory cases -
-whether it should fail with an error code or still allow the allocation.
-Usage of this extension involves only passing extra structure on Vulkan device creation,
-so it is out of scope of this library.
-
-Finally, you can also use #VMA_ALLOCATION_CREATE_NEVER_ALLOCATE_BIT flag to make sure
-a new allocation is created only when it fits inside one of the existing memory blocks.
-If it would require to allocate a new block, if fails instead with `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-This also ensures that the function call is very fast because it never goes to Vulkan
-to obtain a new block.
-
-\note Creating \ref custom_memory_pools with VmaPoolCreateInfo::minBlockCount
-set to more than 0 will currently try to allocate memory blocks without checking whether they
-fit within budget.
-
-
-\page resource_aliasing Resource aliasing (overlap)
-
-New explicit graphics APIs (Vulkan and Direct3D 12), thanks to manual memory
-management, give an opportunity to alias (overlap) multiple resources in the
-same region of memory - a feature not available in the old APIs (Direct3D 11, OpenGL).
-It can be useful to save video memory, but it must be used with caution.
-
-For example, if you know the flow of your whole render frame in advance, you
-are going to use some intermediate textures or buffers only during a small range of render passes,
-and you know these ranges don't overlap in time, you can bind these resources to
-the same place in memory, even if they have completely different parameters (width, height, format etc.).
-
-![Resource aliasing (overlap)](../gfx/Aliasing.png)
-
-Such scenario is possible using VMA, but you need to create your images manually.
-Then you need to calculate parameters of an allocation to be made using formula:
-
-- allocation size = max(size of each image)
-- allocation alignment = max(alignment of each image)
-- allocation memoryTypeBits = bitwise AND(memoryTypeBits of each image)
-
-Following example shows two different images bound to the same place in memory,
-allocated to fit largest of them.
-
-\code
-// A 512x512 texture to be sampled.
-VkImageCreateInfo img1CreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };
-img1CreateInfo.imageType = VK_IMAGE_TYPE_2D;
-img1CreateInfo.extent.width = 512;
-img1CreateInfo.extent.height = 512;
-img1CreateInfo.extent.depth = 1;
-img1CreateInfo.mipLevels = 10;
-img1CreateInfo.arrayLayers = 1;
-img1CreateInfo.format = VK_FORMAT_R8G8B8A8_SRGB;
-img1CreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
-img1CreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-img1CreateInfo.usage = VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT;
-img1CreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;
-
-// A full screen texture to be used as color attachment.
-VkImageCreateInfo img2CreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };
-img2CreateInfo.imageType = VK_IMAGE_TYPE_2D;
-img2CreateInfo.extent.width = 1920;
-img2CreateInfo.extent.height = 1080;
-img2CreateInfo.extent.depth = 1;
-img2CreateInfo.mipLevels = 1;
-img2CreateInfo.arrayLayers = 1;
-img2CreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM;
-img2CreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
-img2CreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-img2CreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT;
-img2CreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;
-
-VkImage img1;
-res = vkCreateImage(device, &img1CreateInfo, nullptr, &img1);
-VkImage img2;
-res = vkCreateImage(device, &img2CreateInfo, nullptr, &img2);
-
-VkMemoryRequirements img1MemReq;
-vkGetImageMemoryRequirements(device, img1, &img1MemReq);
-VkMemoryRequirements img2MemReq;
-vkGetImageMemoryRequirements(device, img2, &img2MemReq);
-
-VkMemoryRequirements finalMemReq = {};
-finalMemReq.size = std::max(img1MemReq.size, img2MemReq.size);
-finalMemReq.alignment = std::max(img1MemReq.alignment, img2MemReq.alignment);
-finalMemReq.memoryTypeBits = img1MemReq.memoryTypeBits & img2MemReq.memoryTypeBits;
-// Validate if(finalMemReq.memoryTypeBits != 0)
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.preferredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
-
-VmaAllocation alloc;
-res = vmaAllocateMemory(allocator, &finalMemReq, &allocCreateInfo, &alloc, nullptr);
-
-res = vmaBindImageMemory(allocator, alloc, img1);
-res = vmaBindImageMemory(allocator, alloc, img2);
-
-// You can use img1, img2 here, but not at the same time!
-
-vmaFreeMemory(allocator, alloc);
-vkDestroyImage(allocator, img2, nullptr);
-vkDestroyImage(allocator, img1, nullptr);
-\endcode
-
-Remember that using resources that alias in memory requires proper synchronization.
-You need to issue a memory barrier to make sure commands that use `img1` and `img2`
-don't overlap on GPU timeline.
-You also need to treat a resource after aliasing as uninitialized - containing garbage data.
-For example, if you use `img1` and then want to use `img2`, you need to issue
-an image memory barrier for `img2` with `oldLayout` = `VK_IMAGE_LAYOUT_UNDEFINED`.
-
-Additional considerations:
-
-- Vulkan also allows to interpret contents of memory between aliasing resources consistently in some cases.
-See chapter 11.8. "Memory Aliasing" of Vulkan specification or `VK_IMAGE_CREATE_ALIAS_BIT` flag.
-- You can create more complex layout where different images and buffers are bound
-at different offsets inside one large allocation. For example, one can imagine
-a big texture used in some render passes, aliasing with a set of many small buffers
-used between in some further passes. To bind a resource at non-zero offset in an allocation,
-use vmaBindBufferMemory2() / vmaBindImageMemory2().
-- Before allocating memory for the resources you want to alias, check `memoryTypeBits`
-returned in memory requirements of each resource to make sure the bits overlap.
-Some GPUs may expose multiple memory types suitable e.g. only for buffers or
-images with `COLOR_ATTACHMENT` usage, so the sets of memory types supported by your
-resources may be disjoint. Aliasing them is not possible in that case.
-
-
-\page custom_memory_pools Custom memory pools
-
-A memory pool contains a number of `VkDeviceMemory` blocks.
-The library automatically creates and manages default pool for each memory type available on the device.
-Default memory pool automatically grows in size.
-Size of allocated blocks is also variable and managed automatically.
-
-You can create custom pool and allocate memory out of it.
-It can be useful if you want to:
-
-- Keep certain kind of allocations separate from others.
-- Enforce particular, fixed size of Vulkan memory blocks.
-- Limit maximum amount of Vulkan memory allocated for that pool.
-- Reserve minimum or fixed amount of Vulkan memory always preallocated for that pool.
-- Use extra parameters for a set of your allocations that are available in #VmaPoolCreateInfo but not in
-  #VmaAllocationCreateInfo - e.g., custom minimum alignment, custom `pNext` chain.
-- Perform defragmentation on a specific subset of your allocations.
-
-To use custom memory pools:
-
--# Fill VmaPoolCreateInfo structure.
--# Call vmaCreatePool() to obtain #VmaPool handle.
--# When making an allocation, set VmaAllocationCreateInfo::pool to this handle.
-   You don't need to specify any other parameters of this structure, like `usage`.
-
-Example:
-
-\code
-// Find memoryTypeIndex for the pool.
-VkBufferCreateInfo sampleBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-sampleBufCreateInfo.size = 0x10000; // Doesn't matter.
-sampleBufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo sampleAllocCreateInfo = {};
-sampleAllocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-
-uint32_t memTypeIndex;
-VkResult res = vmaFindMemoryTypeIndexForBufferInfo(allocator,
-    &sampleBufCreateInfo, &sampleAllocCreateInfo, &memTypeIndex);
-// Check res...
-
-// Create a pool that can have at most 2 blocks, 128 MiB each.
-VmaPoolCreateInfo poolCreateInfo = {};
-poolCreateInfo.memoryTypeIndex = memTypeIndex;
-poolCreateInfo.blockSize = 128ull * 1024 * 1024;
-poolCreateInfo.maxBlockCount = 2;
-
-VmaPool pool;
-res = vmaCreatePool(allocator, &poolCreateInfo, &pool);
-// Check res...
-
-// Allocate a buffer out of it.
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = 1024;
-bufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.pool = pool;
-
-VkBuffer buf;
-VmaAllocation alloc;
-res = vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, nullptr);
-// Check res...
-\endcode
-
-You have to free all allocations made from this pool before destroying it.
-
-\code
-vmaDestroyBuffer(allocator, buf, alloc);
-vmaDestroyPool(allocator, pool);
-\endcode
-
-New versions of this library support creating dedicated allocations in custom pools.
-It is supported only when VmaPoolCreateInfo::blockSize = 0.
-To use this feature, set VmaAllocationCreateInfo::pool to the pointer to your custom pool and
-VmaAllocationCreateInfo::flags to #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-
-\note Excessive use of custom pools is a common mistake when using this library.
-Custom pools may be useful for special purposes - when you want to
-keep certain type of resources separate e.g. to reserve minimum amount of memory
-for them or limit maximum amount of memory they can occupy. For most
-resources this is not needed and so it is not recommended to create #VmaPool
-objects and allocations out of them. Allocating from the default pool is sufficient.
-
-
-\section custom_memory_pools_MemTypeIndex Choosing memory type index
-
-When creating a pool, you must explicitly specify memory type index.
-To find the one suitable for your buffers or images, you can use helper functions
-vmaFindMemoryTypeIndexForBufferInfo(), vmaFindMemoryTypeIndexForImageInfo().
-You need to provide structures with example parameters of buffers or images
-that you are going to create in that pool.
-
-\code
-VkBufferCreateInfo exampleBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-exampleBufCreateInfo.size = 1024; // Doesn't matter
-exampleBufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-
-uint32_t memTypeIndex;
-vmaFindMemoryTypeIndexForBufferInfo(allocator, &exampleBufCreateInfo, &allocCreateInfo, &memTypeIndex);
-
-VmaPoolCreateInfo poolCreateInfo = {};
-poolCreateInfo.memoryTypeIndex = memTypeIndex;
-// ...
-\endcode
-
-When creating buffers/images allocated in that pool, provide following parameters:
-
-- `VkBufferCreateInfo`: Prefer to pass same parameters as above.
-  Otherwise you risk creating resources in a memory type that is not suitable for them, which may result in undefined behavior.
-  Using different `VK_BUFFER_USAGE_` flags may work, but you shouldn't create images in a pool intended for buffers
-  or the other way around.
-- VmaAllocationCreateInfo: You don't need to pass same parameters. Fill only `pool` member.
-  Other members are ignored anyway.
-
-\section linear_algorithm Linear allocation algorithm
-
-Each Vulkan memory block managed by this library has accompanying metadata that
-keeps track of used and unused regions. By default, the metadata structure and
-algorithm tries to find best place for new allocations among free regions to
-optimize memory usage. This way you can allocate and free objects in any order.
-
-![Default allocation algorithm](../gfx/Linear_allocator_1_algo_default.png)
-
-Sometimes there is a need to use simpler, linear allocation algorithm. You can
-create custom pool that uses such algorithm by adding flag
-#VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT to VmaPoolCreateInfo::flags while creating
-#VmaPool object. Then an alternative metadata management is used. It always
-creates new allocations after last one and doesn't reuse free regions after
-allocations freed in the middle. It results in better allocation performance and
-less memory consumed by metadata.
-
-![Linear allocation algorithm](../gfx/Linear_allocator_2_algo_linear.png)
-
-With this one flag, you can create a custom pool that can be used in many ways:
-free-at-once, stack, double stack, and ring buffer. See below for details.
-You don't need to specify explicitly which of these options you are going to use - it is detected automatically.
-
-\subsection linear_algorithm_free_at_once Free-at-once
-
-In a pool that uses linear algorithm, you still need to free all the allocations
-individually, e.g. by using vmaFreeMemory() or vmaDestroyBuffer(). You can free
-them in any order. New allocations are always made after last one - free space
-in the middle is not reused. However, when you release all the allocation and
-the pool becomes empty, allocation starts from the beginning again. This way you
-can use linear algorithm to speed up creation of allocations that you are going
-to release all at once.
-
-![Free-at-once](../gfx/Linear_allocator_3_free_at_once.png)
-
-This mode is also available for pools created with VmaPoolCreateInfo::maxBlockCount
-value that allows multiple memory blocks.
-
-\subsection linear_algorithm_stack Stack
-
-When you free an allocation that was created last, its space can be reused.
-Thanks to this, if you always release allocations in the order opposite to their
-creation (LIFO - Last In First Out), you can achieve behavior of a stack.
-
-![Stack](../gfx/Linear_allocator_4_stack.png)
-
-This mode is also available for pools created with VmaPoolCreateInfo::maxBlockCount
-value that allows multiple memory blocks.
-
-\subsection linear_algorithm_double_stack Double stack
-
-The space reserved by a custom pool with linear algorithm may be used by two
-stacks:
-
-- First, default one, growing up from offset 0.
-- Second, "upper" one, growing down from the end towards lower offsets.
-
-To make allocation from the upper stack, add flag #VMA_ALLOCATION_CREATE_UPPER_ADDRESS_BIT
-to VmaAllocationCreateInfo::flags.
-
-![Double stack](../gfx/Linear_allocator_7_double_stack.png)
-
-Double stack is available only in pools with one memory block -
-VmaPoolCreateInfo::maxBlockCount must be 1. Otherwise behavior is undefined.
-
-When the two stacks' ends meet so there is not enough space between them for a
-new allocation, such allocation fails with usual
-`VK_ERROR_OUT_OF_DEVICE_MEMORY` error.
-
-\subsection linear_algorithm_ring_buffer Ring buffer
-
-When you free some allocations from the beginning and there is not enough free space
-for a new one at the end of a pool, allocator's "cursor" wraps around to the
-beginning and starts allocation there. Thanks to this, if you always release
-allocations in the same order as you created them (FIFO - First In First Out),
-you can achieve behavior of a ring buffer / queue.
-
-![Ring buffer](../gfx/Linear_allocator_5_ring_buffer.png)
-
-Ring buffer is available only in pools with one memory block -
-VmaPoolCreateInfo::maxBlockCount must be 1. Otherwise behavior is undefined.
-
-\note \ref defragmentation is not supported in custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT.
-
-
-\page defragmentation Defragmentation
-
-Interleaved allocations and deallocations of many objects of varying size can
-cause fragmentation over time, which can lead to a situation where the library is unable
-to find a continuous range of free memory for a new allocation despite there is
-enough free space, just scattered across many small free ranges between existing
-allocations.
-
-To mitigate this problem, you can use defragmentation feature.
-It doesn't happen automatically though and needs your cooperation,
-because VMA is a low level library that only allocates memory.
-It cannot recreate buffers and images in a new place as it doesn't remember the contents of `VkBufferCreateInfo` / `VkImageCreateInfo` structures.
-It cannot copy their contents as it doesn't record any commands to a command buffer.
-
-Example:
-
-\code
-VmaDefragmentationInfo defragInfo = {};
-defragInfo.pool = myPool;
-defragInfo.flags = VMA_DEFRAGMENTATION_FLAG_ALGORITHM_FAST_BIT;
-
-VmaDefragmentationContext defragCtx;
-VkResult res = vmaBeginDefragmentation(allocator, &defragInfo, &defragCtx);
-// Check res...
-
-for(;;)
-{
-    VmaDefragmentationPassMoveInfo pass;
-    res = vmaBeginDefragmentationPass(allocator, defragCtx, &pass);
-    if(res == VK_SUCCESS)
-        break;
-    else if(res != VK_INCOMPLETE)
-        // Handle error...
-
-    for(uint32_t i = 0; i < pass.moveCount; ++i)
-    {
-        // Inspect pass.pMoves[i].srcAllocation, identify what buffer/image it represents.
-        VmaAllocationInfo allocInfo;
-        vmaGetAllocationInfo(allocator, pMoves[i].srcAllocation, &allocInfo);
-        MyEngineResourceData* resData = (MyEngineResourceData*)allocInfo.pUserData;
-
-        // Recreate and bind this buffer/image at: pass.pMoves[i].dstMemory, pass.pMoves[i].dstOffset.
-        VkImageCreateInfo imgCreateInfo = ...
-        VkImage newImg;
-        res = vkCreateImage(device, &imgCreateInfo, nullptr, &newImg);
-        // Check res...
-        res = vmaBindImageMemory(allocator, pMoves[i].dstTmpAllocation, newImg);
-        // Check res...
-
-        // Issue a vkCmdCopyBuffer/vkCmdCopyImage to copy its content to the new place.
-        vkCmdCopyImage(cmdBuf, resData->img, ..., newImg, ...);
-    }
-
-    // Make sure the copy commands finished executing.
-    vkWaitForFences(...);
-
-    // Destroy old buffers/images bound with pass.pMoves[i].srcAllocation.
-    for(uint32_t i = 0; i < pass.moveCount; ++i)
-    {
-        // ...
-        vkDestroyImage(device, resData->img, nullptr);
-    }
-
-    // Update appropriate descriptors to point to the new places...
-
-    res = vmaEndDefragmentationPass(allocator, defragCtx, &pass);
-    if(res == VK_SUCCESS)
-        break;
-    else if(res != VK_INCOMPLETE)
-        // Handle error...
-}
-
-vmaEndDefragmentation(allocator, defragCtx, nullptr);
-\endcode
-
-Although functions like vmaCreateBuffer(), vmaCreateImage(), vmaDestroyBuffer(), vmaDestroyImage()
-create/destroy an allocation and a buffer/image at once, these are just a shortcut for
-creating the resource, allocating memory, and binding them together.
-Defragmentation works on memory allocations only. You must handle the rest manually.
-Defragmentation is an iterative process that should repreat "passes" as long as related functions
-return `VK_INCOMPLETE` not `VK_SUCCESS`.
-In each pass:
-
-1. vmaBeginDefragmentationPass() function call:
-   - Calculates and returns the list of allocations to be moved in this pass.
-     Note this can be a time-consuming process.
-   - Reserves destination memory for them by creating temporary destination allocations
-     that you can query for their `VkDeviceMemory` + offset using vmaGetAllocationInfo().
-2. Inside the pass, **you should**:
-   - Inspect the returned list of allocations to be moved.
-   - Create new buffers/images and bind them at the returned destination temporary allocations.
-   - Copy data from source to destination resources if necessary.
-   - Destroy the source buffers/images, but NOT their allocations.
-3. vmaEndDefragmentationPass() function call:
-   - Frees the source memory reserved for the allocations that are moved.
-   - Modifies source #VmaAllocation objects that are moved to point to the destination reserved memory.
-   - Frees `VkDeviceMemory` blocks that became empty.
-
-Unlike in previous iterations of the defragmentation API, there is no list of "movable" allocations passed as a parameter.
-Defragmentation algorithm tries to move all suitable allocations.
-You can, however, refuse to move some of them inside a defragmentation pass, by setting
-`pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE.
-This is not recommended and may result in suboptimal packing of the allocations after defragmentation.
-If you cannot ensure any allocation can be moved, it is better to keep movable allocations separate in a custom pool.
-
-Inside a pass, for each allocation that should be moved:
-
-- You should copy its data from the source to the destination place by calling e.g. `vkCmdCopyBuffer()`, `vkCmdCopyImage()`.
-  - You need to make sure these commands finished executing before destroying the source buffers/images and before calling vmaEndDefragmentationPass().
-- If a resource doesn't contain any meaningful data, e.g. it is a transient color attachment image to be cleared,
-  filled, and used temporarily in each rendering frame, you can just recreate this image
-  without copying its data.
-- If the resource is in `HOST_VISIBLE` and `HOST_CACHED` memory, you can copy its data on the CPU
-  using `memcpy()`.
-- If you cannot move the allocation, you can set `pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE.
-  This will cancel the move.
-  - vmaEndDefragmentationPass() will then free the destination memory
-    not the source memory of the allocation, leaving it unchanged.
-- If you decide the allocation is unimportant and can be destroyed instead of moved (e.g. it wasn't used for long time),
-  you can set `pass.pMoves[i].operation` to #VMA_DEFRAGMENTATION_MOVE_OPERATION_DESTROY.
-  - vmaEndDefragmentationPass() will then free both source and destination memory, and will destroy the source #VmaAllocation object.
-
-You can defragment a specific custom pool by setting VmaDefragmentationInfo::pool
-(like in the example above) or all the default pools by setting this member to null.
-
-Defragmentation is always performed in each pool separately.
-Allocations are never moved between different Vulkan memory types.
-The size of the destination memory reserved for a moved allocation is the same as the original one.
-Alignment of an allocation as it was determined using `vkGetBufferMemoryRequirements()` etc. is also respected after defragmentation.
-Buffers/images should be recreated with the same `VkBufferCreateInfo` / `VkImageCreateInfo` parameters as the original ones.
-
-You can perform the defragmentation incrementally to limit the number of allocations and bytes to be moved
-in each pass, e.g. to call it in sync with render frames and not to experience too big hitches.
-See members: VmaDefragmentationInfo::maxBytesPerPass, VmaDefragmentationInfo::maxAllocationsPerPass.
-
-It is also safe to perform the defragmentation asynchronously to render frames and other Vulkan and VMA
-usage, possibly from multiple threads, with the exception that allocations
-returned in VmaDefragmentationPassMoveInfo::pMoves shouldn't be destroyed until the defragmentation pass is ended.
-
-<b>Mapping</b> is preserved on allocations that are moved during defragmentation.
-Whether through #VMA_ALLOCATION_CREATE_MAPPED_BIT or vmaMapMemory(), the allocations
-are mapped at their new place. Of course, pointer to the mapped data changes, so it needs to be queried
-using VmaAllocationInfo::pMappedData.
-
-\note Defragmentation is not supported in custom pools created with #VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT.
-
-
-\page statistics Statistics
-
-This library contains several functions that return information about its internal state,
-especially the amount of memory allocated from Vulkan.
-
-\section statistics_numeric_statistics Numeric statistics
-
-If you need to obtain basic statistics about memory usage per heap, together with current budget,
-you can call function vmaGetHeapBudgets() and inspect structure #VmaBudget.
-This is useful to keep track of memory usage and stay withing budget
-(see also \ref staying_within_budget).
-Example:
-
-\code
-uint32_t heapIndex = ...
-
-VmaBudget budgets[VK_MAX_MEMORY_HEAPS];
-vmaGetHeapBudgets(allocator, budgets);
-
-printf("My heap currently has %u allocations taking %llu B,\n",
-    budgets[heapIndex].statistics.allocationCount,
-    budgets[heapIndex].statistics.allocationBytes);
-printf("allocated out of %u Vulkan device memory blocks taking %llu B,\n",
-    budgets[heapIndex].statistics.blockCount,
-    budgets[heapIndex].statistics.blockBytes);
-printf("Vulkan reports total usage %llu B with budget %llu B.\n",
-    budgets[heapIndex].usage,
-    budgets[heapIndex].budget);
-\endcode
-
-You can query for more detailed statistics per memory heap, type, and totals,
-including minimum and maximum allocation size and unused range size,
-by calling function vmaCalculateStatistics() and inspecting structure #VmaTotalStatistics.
-This function is slower though, as it has to traverse all the internal data structures,
-so it should be used only for debugging purposes.
-
-You can query for statistics of a custom pool using function vmaGetPoolStatistics()
-or vmaCalculatePoolStatistics().
-
-You can query for information about a specific allocation using function vmaGetAllocationInfo().
-It fill structure #VmaAllocationInfo.
-
-\section statistics_json_dump JSON dump
-
-You can dump internal state of the allocator to a string in JSON format using function vmaBuildStatsString().
-The result is guaranteed to be correct JSON.
-It uses ANSI encoding.
-Any strings provided by user (see [Allocation names](@ref allocation_names))
-are copied as-is and properly escaped for JSON, so if they use UTF-8, ISO-8859-2 or any other encoding,
-this JSON string can be treated as using this encoding.
-It must be freed using function vmaFreeStatsString().
-
-The format of this JSON string is not part of official documentation of the library,
-but it will not change in backward-incompatible way without increasing library major version number
-and appropriate mention in changelog.
-
-The JSON string contains all the data that can be obtained using vmaCalculateStatistics().
-It can also contain detailed map of allocated memory blocks and their regions -
-free and occupied by allocations.
-This allows e.g. to visualize the memory or assess fragmentation.
-
-
-\page allocation_annotation Allocation names and user data
-
-\section allocation_user_data Allocation user data
-
-You can annotate allocations with your own information, e.g. for debugging purposes.
-To do that, fill VmaAllocationCreateInfo::pUserData field when creating
-an allocation. It is an opaque `void*` pointer. You can use it e.g. as a pointer,
-some handle, index, key, ordinal number or any other value that would associate
-the allocation with your custom metadata.
-It it useful to identify appropriate data structures in your engine given #VmaAllocation,
-e.g. when doing \ref defragmentation.
-
-\code
-VkBufferCreateInfo bufCreateInfo = ...
-
-MyBufferMetadata* pMetadata = CreateBufferMetadata();
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.pUserData = pMetadata;
-
-VkBuffer buffer;
-VmaAllocation allocation;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buffer, &allocation, nullptr);
-\endcode
-
-The pointer may be later retrieved as VmaAllocationInfo::pUserData:
-
-\code
-VmaAllocationInfo allocInfo;
-vmaGetAllocationInfo(allocator, allocation, &allocInfo);
-MyBufferMetadata* pMetadata = (MyBufferMetadata*)allocInfo.pUserData;
-\endcode
-
-It can also be changed using function vmaSetAllocationUserData().
-
-Values of (non-zero) allocations' `pUserData` are printed in JSON report created by
-vmaBuildStatsString() in hexadecimal form.
-
-\section allocation_names Allocation names
-
-An allocation can also carry a null-terminated string, giving a name to the allocation.
-To set it, call vmaSetAllocationName().
-The library creates internal copy of the string, so the pointer you pass doesn't need
-to be valid for whole lifetime of the allocation. You can free it after the call.
-
-\code
-std::string imageName = "Texture: ";
-imageName += fileName;
-vmaSetAllocationName(allocator, allocation, imageName.c_str());
-\endcode
-
-The string can be later retrieved by inspecting VmaAllocationInfo::pName.
-It is also printed in JSON report created by vmaBuildStatsString().
-
-\note Setting string name to VMA allocation doesn't automatically set it to the Vulkan buffer or image created with it.
-You must do it manually using an extension like VK_EXT_debug_utils, which is independent of this library.
-
-
-\page virtual_allocator Virtual allocator
-
-As an extra feature, the core allocation algorithm of the library is exposed through a simple and convenient API of "virtual allocator".
-It doesn't allocate any real GPU memory. It just keeps track of used and free regions of a "virtual block".
-You can use it to allocate your own memory or other objects, even completely unrelated to Vulkan.
-A common use case is sub-allocation of pieces of one large GPU buffer.
-
-\section virtual_allocator_creating_virtual_block Creating virtual block
-
-To use this functionality, there is no main "allocator" object.
-You don't need to have #VmaAllocator object created.
-All you need to do is to create a separate #VmaVirtualBlock object for each block of memory you want to be managed by the allocator:
-
--# Fill in #VmaVirtualBlockCreateInfo structure.
--# Call vmaCreateVirtualBlock(). Get new #VmaVirtualBlock object.
-
-Example:
-
-\code
-VmaVirtualBlockCreateInfo blockCreateInfo = {};
-blockCreateInfo.size = 1048576; // 1 MB
-
-VmaVirtualBlock block;
-VkResult res = vmaCreateVirtualBlock(&blockCreateInfo, &block);
-\endcode
-
-\section virtual_allocator_making_virtual_allocations Making virtual allocations
-
-#VmaVirtualBlock object contains internal data structure that keeps track of free and occupied regions
-using the same code as the main Vulkan memory allocator.
-Similarly to #VmaAllocation for standard GPU allocations, there is #VmaVirtualAllocation type
-that represents an opaque handle to an allocation withing the virtual block.
-
-In order to make such allocation:
-
--# Fill in #VmaVirtualAllocationCreateInfo structure.
--# Call vmaVirtualAllocate(). Get new #VmaVirtualAllocation object that represents the allocation.
-   You can also receive `VkDeviceSize offset` that was assigned to the allocation.
-
-Example:
-
-\code
-VmaVirtualAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.size = 4096; // 4 KB
-
-VmaVirtualAllocation alloc;
-VkDeviceSize offset;
-res = vmaVirtualAllocate(block, &allocCreateInfo, &alloc, &offset);
-if(res == VK_SUCCESS)
-{
-    // Use the 4 KB of your memory starting at offset.
-}
-else
-{
-    // Allocation failed - no space for it could be found. Handle this error!
-}
-\endcode
-
-\section virtual_allocator_deallocation Deallocation
-
-When no longer needed, an allocation can be freed by calling vmaVirtualFree().
-You can only pass to this function an allocation that was previously returned by vmaVirtualAllocate()
-called for the same #VmaVirtualBlock.
-
-When whole block is no longer needed, the block object can be released by calling vmaDestroyVirtualBlock().
-All allocations must be freed before the block is destroyed, which is checked internally by an assert.
-However, if you don't want to call vmaVirtualFree() for each allocation, you can use vmaClearVirtualBlock() to free them all at once -
-a feature not available in normal Vulkan memory allocator. Example:
-
-\code
-vmaVirtualFree(block, alloc);
-vmaDestroyVirtualBlock(block);
-\endcode
-
-\section virtual_allocator_allocation_parameters Allocation parameters
-
-You can attach a custom pointer to each allocation by using vmaSetVirtualAllocationUserData().
-Its default value is null.
-It can be used to store any data that needs to be associated with that allocation - e.g. an index, a handle, or a pointer to some
-larger data structure containing more information. Example:
-
-\code
-struct CustomAllocData
-{
-    std::string m_AllocName;
-};
-CustomAllocData* allocData = new CustomAllocData();
-allocData->m_AllocName = "My allocation 1";
-vmaSetVirtualAllocationUserData(block, alloc, allocData);
-\endcode
-
-The pointer can later be fetched, along with allocation offset and size, by passing the allocation handle to function
-vmaGetVirtualAllocationInfo() and inspecting returned structure #VmaVirtualAllocationInfo.
-If you allocated a new object to be used as the custom pointer, don't forget to delete that object before freeing the allocation!
-Example:
-
-\code
-VmaVirtualAllocationInfo allocInfo;
-vmaGetVirtualAllocationInfo(block, alloc, &allocInfo);
-delete (CustomAllocData*)allocInfo.pUserData;
-
-vmaVirtualFree(block, alloc);
-\endcode
-
-\section virtual_allocator_alignment_and_units Alignment and units
-
-It feels natural to express sizes and offsets in bytes.
-If an offset of an allocation needs to be aligned to a multiply of some number (e.g. 4 bytes), you can fill optional member
-VmaVirtualAllocationCreateInfo::alignment to request it. Example:
-
-\code
-VmaVirtualAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.size = 4096; // 4 KB
-allocCreateInfo.alignment = 4; // Returned offset must be a multiply of 4 B
-
-VmaVirtualAllocation alloc;
-res = vmaVirtualAllocate(block, &allocCreateInfo, &alloc, nullptr);
-\endcode
-
-Alignments of different allocations made from one block may vary.
-However, if all alignments and sizes are always multiply of some size e.g. 4 B or `sizeof(MyDataStruct)`,
-you can express all sizes, alignments, and offsets in multiples of that size instead of individual bytes.
-It might be more convenient, but you need to make sure to use this new unit consistently in all the places:
-
-- VmaVirtualBlockCreateInfo::size
-- VmaVirtualAllocationCreateInfo::size and VmaVirtualAllocationCreateInfo::alignment
-- Using offset returned by vmaVirtualAllocate() or in VmaVirtualAllocationInfo::offset
-
-\section virtual_allocator_statistics Statistics
-
-You can obtain statistics of a virtual block using vmaGetVirtualBlockStatistics()
-(to get brief statistics that are fast to calculate)
-or vmaCalculateVirtualBlockStatistics() (to get more detailed statistics, slower to calculate).
-The functions fill structures #VmaStatistics, #VmaDetailedStatistics respectively - same as used by the normal Vulkan memory allocator.
-Example:
-
-\code
-VmaStatistics stats;
-vmaGetVirtualBlockStatistics(block, &stats);
-printf("My virtual block has %llu bytes used by %u virtual allocations\n",
-    stats.allocationBytes, stats.allocationCount);
-\endcode
-
-You can also request a full list of allocations and free regions as a string in JSON format by calling
-vmaBuildVirtualBlockStatsString().
-Returned string must be later freed using vmaFreeVirtualBlockStatsString().
-The format of this string differs from the one returned by the main Vulkan allocator, but it is similar.
-
-\section virtual_allocator_additional_considerations Additional considerations
-
-The "virtual allocator" functionality is implemented on a level of individual memory blocks.
-Keeping track of a whole collection of blocks, allocating new ones when out of free space,
-deleting empty ones, and deciding which one to try first for a new allocation must be implemented by the user.
-
-Alternative allocation algorithms are supported, just like in custom pools of the real GPU memory.
-See enum #VmaVirtualBlockCreateFlagBits to learn how to specify them (e.g. #VMA_VIRTUAL_BLOCK_CREATE_LINEAR_ALGORITHM_BIT).
-You can find their description in chapter \ref custom_memory_pools.
-Allocation strategies are also supported.
-See enum #VmaVirtualAllocationCreateFlagBits to learn how to specify them (e.g. #VMA_VIRTUAL_ALLOCATION_CREATE_STRATEGY_MIN_TIME_BIT).
-
-Following features are supported only by the allocator of the real GPU memory and not by virtual allocations:
-buffer-image granularity, `VMA_DEBUG_MARGIN`, `VMA_MIN_ALIGNMENT`.
-
-
-\page debugging_memory_usage Debugging incorrect memory usage
-
-If you suspect a bug with memory usage, like usage of uninitialized memory or
-memory being overwritten out of bounds of an allocation,
-you can use debug features of this library to verify this.
-
-\section debugging_memory_usage_initialization Memory initialization
-
-If you experience a bug with incorrect and nondeterministic data in your program and you suspect uninitialized memory to be used,
-you can enable automatic memory initialization to verify this.
-To do it, define macro `VMA_DEBUG_INITIALIZE_ALLOCATIONS` to 1.
-
-\code
-#define VMA_DEBUG_INITIALIZE_ALLOCATIONS 1
-#include "vk_mem_alloc.h"
-\endcode
-
-It makes memory of new allocations initialized to bit pattern `0xDCDCDCDC`.
-Before an allocation is destroyed, its memory is filled with bit pattern `0xEFEFEFEF`.
-Memory is automatically mapped and unmapped if necessary.
-
-If you find these values while debugging your program, good chances are that you incorrectly
-read Vulkan memory that is allocated but not initialized, or already freed, respectively.
-
-Memory initialization works only with memory types that are `HOST_VISIBLE` and with allocations that can be mapped.
-It works also with dedicated allocations.
-
-\section debugging_memory_usage_margins Margins
-
-By default, allocations are laid out in memory blocks next to each other if possible
-(considering required alignment, `bufferImageGranularity`, and `nonCoherentAtomSize`).
-
-![Allocations without margin](../gfx/Margins_1.png)
-
-Define macro `VMA_DEBUG_MARGIN` to some non-zero value (e.g. 16) to enforce specified
-number of bytes as a margin after every allocation.
-
-\code
-#define VMA_DEBUG_MARGIN 16
-#include "vk_mem_alloc.h"
-\endcode
-
-![Allocations with margin](../gfx/Margins_2.png)
-
-If your bug goes away after enabling margins, it means it may be caused by memory
-being overwritten outside of allocation boundaries. It is not 100% certain though.
-Change in application behavior may also be caused by different order and distribution
-of allocations across memory blocks after margins are applied.
-
-Margins work with all types of memory.
-
-Margin is applied only to allocations made out of memory blocks and not to dedicated
-allocations, which have their own memory block of specific size.
-It is thus not applied to allocations made using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT flag
-or those automatically decided to put into dedicated allocations, e.g. due to its
-large size or recommended by VK_KHR_dedicated_allocation extension.
-
-Margins appear in [JSON dump](@ref statistics_json_dump) as part of free space.
-
-Note that enabling margins increases memory usage and fragmentation.
-
-Margins do not apply to \ref virtual_allocator.
-
-\section debugging_memory_usage_corruption_detection Corruption detection
-
-You can additionally define macro `VMA_DEBUG_DETECT_CORRUPTION` to 1 to enable validation
-of contents of the margins.
-
-\code
-#define VMA_DEBUG_MARGIN 16
-#define VMA_DEBUG_DETECT_CORRUPTION 1
-#include "vk_mem_alloc.h"
-\endcode
-
-When this feature is enabled, number of bytes specified as `VMA_DEBUG_MARGIN`
-(it must be multiply of 4) after every allocation is filled with a magic number.
-This idea is also know as "canary".
-Memory is automatically mapped and unmapped if necessary.
-
-This number is validated automatically when the allocation is destroyed.
-If it is not equal to the expected value, `VMA_ASSERT()` is executed.
-It clearly means that either CPU or GPU overwritten the memory outside of boundaries of the allocation,
-which indicates a serious bug.
-
-You can also explicitly request checking margins of all allocations in all memory blocks
-that belong to specified memory types by using function vmaCheckCorruption(),
-or in memory blocks that belong to specified custom pool, by using function
-vmaCheckPoolCorruption().
-
-Margin validation (corruption detection) works only for memory types that are
-`HOST_VISIBLE` and `HOST_COHERENT`.
-
-
-\page opengl_interop OpenGL Interop
-
-VMA provides some features that help with interoperability with OpenGL.
-
-\section opengl_interop_exporting_memory Exporting memory
-
-If you want to attach `VkExportMemoryAllocateInfoKHR` structure to `pNext` chain of memory allocations made by the library:
-
-It is recommended to create \ref custom_memory_pools for such allocations.
-Define and fill in your `VkExportMemoryAllocateInfoKHR` structure and attach it to VmaPoolCreateInfo::pMemoryAllocateNext
-while creating the custom pool.
-Please note that the structure must remain alive and unchanged for the whole lifetime of the #VmaPool,
-not only while creating it, as no copy of the structure is made,
-but its original pointer is used for each allocation instead.
-
-If you want to export all memory allocated by the library from certain memory types,
-also dedicated allocations or other allocations made from default pools,
-an alternative solution is to fill in VmaAllocatorCreateInfo::pTypeExternalMemoryHandleTypes.
-It should point to an array with `VkExternalMemoryHandleTypeFlagsKHR` to be automatically passed by the library
-through `VkExportMemoryAllocateInfoKHR` on each allocation made from a specific memory type.
-Please note that new versions of the library also support dedicated allocations created in custom pools.
-
-You should not mix these two methods in a way that allows to apply both to the same memory type.
-Otherwise, `VkExportMemoryAllocateInfoKHR` structure would be attached twice to the `pNext` chain of `VkMemoryAllocateInfo`.
-
-
-\section opengl_interop_custom_alignment Custom alignment
-
-Buffers or images exported to a different API like OpenGL may require a different alignment,
-higher than the one used by the library automatically, queried from functions like `vkGetBufferMemoryRequirements`.
-To impose such alignment:
-
-It is recommended to create \ref custom_memory_pools for such allocations.
-Set VmaPoolCreateInfo::minAllocationAlignment member to the minimum alignment required for each allocation
-to be made out of this pool.
-The alignment actually used will be the maximum of this member and the alignment returned for the specific buffer or image
-from a function like `vkGetBufferMemoryRequirements`, which is called by VMA automatically.
-
-If you want to create a buffer with a specific minimum alignment out of default pools,
-use special function vmaCreateBufferWithAlignment(), which takes additional parameter `minAlignment`.
-
-Note the problem of alignment affects only resources placed inside bigger `VkDeviceMemory` blocks and not dedicated
-allocations, as these, by definition, always have alignment = 0 because the resource is bound to the beginning of its dedicated block.
-Contrary to Direct3D 12, Vulkan doesn't have a concept of alignment of the entire memory block passed on its allocation.
-
-
-\page usage_patterns Recommended usage patterns
-
-Vulkan gives great flexibility in memory allocation.
-This chapter shows the most common patterns.
-
-See also slides from talk:
-[Sawicki, Adam. Advanced Graphics Techniques Tutorial: Memory management in Vulkan and DX12. Game Developers Conference, 2018](https://www.gdcvault.com/play/1025458/Advanced-Graphics-Techniques-Tutorial-New)
-
-
-\section usage_patterns_gpu_only GPU-only resource
-
-<b>When:</b>
-Any resources that you frequently write and read on GPU,
-e.g. images used as color attachments (aka "render targets"), depth-stencil attachments,
-images/buffers used as storage image/buffer (aka "Unordered Access View (UAV)").
-
-<b>What to do:</b>
-Let the library select the optimal memory type, which will likely have `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`.
-
-\code
-VkImageCreateInfo imgCreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };
-imgCreateInfo.imageType = VK_IMAGE_TYPE_2D;
-imgCreateInfo.extent.width = 3840;
-imgCreateInfo.extent.height = 2160;
-imgCreateInfo.extent.depth = 1;
-imgCreateInfo.mipLevels = 1;
-imgCreateInfo.arrayLayers = 1;
-imgCreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM;
-imgCreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
-imgCreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-imgCreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT;
-imgCreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
-allocCreateInfo.priority = 1.0f;
-
-VkImage img;
-VmaAllocation alloc;
-vmaCreateImage(allocator, &imgCreateInfo, &allocCreateInfo, &img, &alloc, nullptr);
-\endcode
-
-<b>Also consider:</b>
-Consider creating them as dedicated allocations using #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT,
-especially if they are large or if you plan to destroy and recreate them with different sizes
-e.g. when display resolution changes.
-Prefer to create such resources first and all other GPU resources (like textures and vertex buffers) later.
-When VK_EXT_memory_priority extension is enabled, it is also worth setting high priority to such allocation
-to decrease chances to be evicted to system memory by the operating system.
-
-\section usage_patterns_staging_copy_upload Staging copy for upload
-
-<b>When:</b>
-A "staging" buffer than you want to map and fill from CPU code, then use as a source od transfer
-to some GPU resource.
-
-<b>What to do:</b>
-Use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT.
-Let the library select the optimal memory type, which will always have `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`.
-
-\code
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = 65536;
-bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT |
-    VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buf;
-VmaAllocation alloc;
-VmaAllocationInfo allocInfo;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo);
-
-...
-
-memcpy(allocInfo.pMappedData, myData, myDataSize);
-\endcode
-
-<b>Also consider:</b>
-You can map the allocation using vmaMapMemory() or you can create it as persistenly mapped
-using #VMA_ALLOCATION_CREATE_MAPPED_BIT, as in the example above.
-
-
-\section usage_patterns_readback Readback
-
-<b>When:</b>
-Buffers for data written by or transferred from the GPU that you want to read back on the CPU,
-e.g. results of some computations.
-
-<b>What to do:</b>
-Use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT.
-Let the library select the optimal memory type, which will always have `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT`
-and `VK_MEMORY_PROPERTY_HOST_CACHED_BIT`.
-
-\code
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = 65536;
-bufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT |
-    VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buf;
-VmaAllocation alloc;
-VmaAllocationInfo allocInfo;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo);
-
-...
-
-const float* downloadedData = (const float*)allocInfo.pMappedData;
-\endcode
-
-
-\section usage_patterns_advanced_data_uploading Advanced data uploading
-
-For resources that you frequently write on CPU via mapped pointer and
-freqnently read on GPU e.g. as a uniform buffer (also called "dynamic"), multiple options are possible:
-
--# Easiest solution is to have one copy of the resource in `HOST_VISIBLE` memory,
-   even if it means system RAM (not `DEVICE_LOCAL`) on systems with a discrete graphics card,
-   and make the device reach out to that resource directly.
-   - Reads performed by the device will then go through PCI Express bus.
-     The performace of this access may be limited, but it may be fine depending on the size
-     of this resource (whether it is small enough to quickly end up in GPU cache) and the sparsity
-     of access.
--# On systems with unified memory (e.g. AMD APU or Intel integrated graphics, mobile chips),
-   a memory type may be available that is both `HOST_VISIBLE` (available for mapping) and `DEVICE_LOCAL`
-   (fast to access from the GPU). Then, it is likely the best choice for such type of resource.
--# Systems with a discrete graphics card and separate video memory may or may not expose
-   a memory type that is both `HOST_VISIBLE` and `DEVICE_LOCAL`, also known as Base Address Register (BAR).
-   If they do, it represents a piece of VRAM (or entire VRAM, if ReBAR is enabled in the motherboard BIOS)
-   that is available to CPU for mapping.
-   - Writes performed by the host to that memory go through PCI Express bus.
-     The performance of these writes may be limited, but it may be fine, especially on PCIe 4.0,
-     as long as rules of using uncached and write-combined memory are followed - only sequential writes and no reads.
--# Finally, you may need or prefer to create a separate copy of the resource in `DEVICE_LOCAL` memory,
-   a separate "staging" copy in `HOST_VISIBLE` memory and perform an explicit transfer command between them.
-
-Thankfully, VMA offers an aid to create and use such resources in the the way optimal
-for the current Vulkan device. To help the library make the best choice,
-use flag #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT together with
-#VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT.
-It will then prefer a memory type that is both `DEVICE_LOCAL` and `HOST_VISIBLE` (integrated memory or BAR),
-but if no such memory type is available or allocation from it fails
-(PC graphics cards have only 256 MB of BAR by default, unless ReBAR is supported and enabled in BIOS),
-it will fall back to `DEVICE_LOCAL` memory for fast GPU access.
-It is then up to you to detect that the allocation ended up in a memory type that is not `HOST_VISIBLE`,
-so you need to create another "staging" allocation and perform explicit transfers.
-
-\code
-VkBufferCreateInfo bufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-bufCreateInfo.size = 65536;
-bufCreateInfo.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT |
-    VMA_ALLOCATION_CREATE_HOST_ACCESS_ALLOW_TRANSFER_INSTEAD_BIT |
-    VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-VkBuffer buf;
-VmaAllocation alloc;
-VmaAllocationInfo allocInfo;
-vmaCreateBuffer(allocator, &bufCreateInfo, &allocCreateInfo, &buf, &alloc, &allocInfo);
-
-VkMemoryPropertyFlags memPropFlags;
-vmaGetAllocationMemoryProperties(allocator, alloc, &memPropFlags);
-
-if(memPropFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT)
-{
-    // Allocation ended up in a mappable memory and is already mapped - write to it directly.
-
-    // [Executed in runtime]:
-    memcpy(allocInfo.pMappedData, myData, myDataSize);
-}
-else
-{
-    // Allocation ended up in a non-mappable memory - need to transfer.
-    VkBufferCreateInfo stagingBufCreateInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
-    stagingBufCreateInfo.size = 65536;
-    stagingBufCreateInfo.usage = VK_BUFFER_USAGE_TRANSFER_SRC_BIT;
-
-    VmaAllocationCreateInfo stagingAllocCreateInfo = {};
-    stagingAllocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-    stagingAllocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT |
-        VMA_ALLOCATION_CREATE_MAPPED_BIT;
-
-    VkBuffer stagingBuf;
-    VmaAllocation stagingAlloc;
-    VmaAllocationInfo stagingAllocInfo;
-    vmaCreateBuffer(allocator, &stagingBufCreateInfo, &stagingAllocCreateInfo,
-        &stagingBuf, &stagingAlloc, stagingAllocInfo);
-
-    // [Executed in runtime]:
-    memcpy(stagingAllocInfo.pMappedData, myData, myDataSize);
-    //vkCmdPipelineBarrier: VK_ACCESS_HOST_WRITE_BIT --> VK_ACCESS_TRANSFER_READ_BIT
-    VkBufferCopy bufCopy = {
-        0, // srcOffset
-        0, // dstOffset,
-        myDataSize); // size
-    vkCmdCopyBuffer(cmdBuf, stagingBuf, buf, 1, &bufCopy);
-}
-\endcode
-
-\section usage_patterns_other_use_cases Other use cases
-
-Here are some other, less obvious use cases and their recommended settings:
-
-- An image that is used only as transfer source and destination, but it should stay on the device,
-  as it is used to temporarily store a copy of some texture, e.g. from the current to the next frame,
-  for temporal antialiasing or other temporal effects.
-  - Use `VkImageCreateInfo::usage = VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT`
-  - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO
-- An image that is used only as transfer source and destination, but it should be placed
-  in the system RAM despite it doesn't need to be mapped, because it serves as a "swap" copy to evict
-  least recently used textures from VRAM.
-  - Use `VkImageCreateInfo::usage = VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT`
-  - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO_PREFER_HOST,
-    as VMA needs a hint here to differentiate from the previous case.
-- A buffer that you want to map and write from the CPU, directly read from the GPU
-  (e.g. as a uniform or vertex buffer), but you have a clear preference to place it in device or
-  host memory due to its large size.
-  - Use `VkBufferCreateInfo::usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT`
-  - Use VmaAllocationCreateInfo::usage = #VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE or #VMA_MEMORY_USAGE_AUTO_PREFER_HOST
-  - Use VmaAllocationCreateInfo::flags = #VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT
-
-
-\page configuration Configuration
-
-Please check "CONFIGURATION SECTION" in the code to find macros that you can define
-before each include of this file or change directly in this file to provide
-your own implementation of basic facilities like assert, `min()` and `max()` functions,
-mutex, atomic etc.
-The library uses its own implementation of containers by default, but you can switch to using
-STL containers instead.
-
-For example, define `VMA_ASSERT(expr)` before including the library to provide
-custom implementation of the assertion, compatible with your project.
-By default it is defined to standard C `assert(expr)` in `_DEBUG` configuration
-and empty otherwise.
-
-\section config_Vulkan_functions Pointers to Vulkan functions
-
-There are multiple ways to import pointers to Vulkan functions in the library.
-In the simplest case you don't need to do anything.
-If the compilation or linking of your program or the initialization of the #VmaAllocator
-doesn't work for you, you can try to reconfigure it.
-
-First, the allocator tries to fetch pointers to Vulkan functions linked statically,
-like this:
-
-\code
-m_VulkanFunctions.vkAllocateMemory = (PFN_vkAllocateMemory)vkAllocateMemory;
-\endcode
-
-If you want to disable this feature, set configuration macro: `#define VMA_STATIC_VULKAN_FUNCTIONS 0`.
-
-Second, you can provide the pointers yourself by setting member VmaAllocatorCreateInfo::pVulkanFunctions.
-You can fetch them e.g. using functions `vkGetInstanceProcAddr` and `vkGetDeviceProcAddr` or
-by using a helper library like [volk](https://github.com/zeux/volk).
-
-Third, VMA tries to fetch remaining pointers that are still null by calling
-`vkGetInstanceProcAddr` and `vkGetDeviceProcAddr` on its own.
-You need to only fill in VmaVulkanFunctions::vkGetInstanceProcAddr and VmaVulkanFunctions::vkGetDeviceProcAddr.
-Other pointers will be fetched automatically.
-If you want to disable this feature, set configuration macro: `#define VMA_DYNAMIC_VULKAN_FUNCTIONS 0`.
-
-Finally, all the function pointers required by the library (considering selected
-Vulkan version and enabled extensions) are checked with `VMA_ASSERT` if they are not null.
-
-
-\section custom_memory_allocator Custom host memory allocator
-
-If you use custom allocator for CPU memory rather than default operator `new`
-and `delete` from C++, you can make this library using your allocator as well
-by filling optional member VmaAllocatorCreateInfo::pAllocationCallbacks. These
-functions will be passed to Vulkan, as well as used by the library itself to
-make any CPU-side allocations.
-
-\section allocation_callbacks Device memory allocation callbacks
-
-The library makes calls to `vkAllocateMemory()` and `vkFreeMemory()` internally.
-You can setup callbacks to be informed about these calls, e.g. for the purpose
-of gathering some statistics. To do it, fill optional member
-VmaAllocatorCreateInfo::pDeviceMemoryCallbacks.
-
-\section heap_memory_limit Device heap memory limit
-
-When device memory of certain heap runs out of free space, new allocations may
-fail (returning error code) or they may succeed, silently pushing some existing_
-memory blocks from GPU VRAM to system RAM (which degrades performance). This
-behavior is implementation-dependent - it depends on GPU vendor and graphics
-driver.
-
-On AMD cards it can be controlled while creating Vulkan device object by using
-VK_AMD_memory_overallocation_behavior extension, if available.
-
-Alternatively, if you want to test how your program behaves with limited amount of Vulkan device
-memory available without switching your graphics card to one that really has
-smaller VRAM, you can use a feature of this library intended for this purpose.
-To do it, fill optional member VmaAllocatorCreateInfo::pHeapSizeLimit.
-
-
-
-\page vk_khr_dedicated_allocation VK_KHR_dedicated_allocation
-
-VK_KHR_dedicated_allocation is a Vulkan extension which can be used to improve
-performance on some GPUs. It augments Vulkan API with possibility to query
-driver whether it prefers particular buffer or image to have its own, dedicated
-allocation (separate `VkDeviceMemory` block) for better efficiency - to be able
-to do some internal optimizations. The extension is supported by this library.
-It will be used automatically when enabled.
-
-It has been promoted to core Vulkan 1.1, so if you use eligible Vulkan version
-and inform VMA about it by setting VmaAllocatorCreateInfo::vulkanApiVersion,
-you are all set.
-
-Otherwise, if you want to use it as an extension:
-
-1 . When creating Vulkan device, check if following 2 device extensions are
-supported (call `vkEnumerateDeviceExtensionProperties()`).
-If yes, enable them (fill `VkDeviceCreateInfo::ppEnabledExtensionNames`).
-
-- VK_KHR_get_memory_requirements2
-- VK_KHR_dedicated_allocation
-
-If you enabled these extensions:
-
-2 . Use #VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT flag when creating
-your #VmaAllocator to inform the library that you enabled required extensions
-and you want the library to use them.
-
-\code
-allocatorInfo.flags |= VMA_ALLOCATOR_CREATE_KHR_DEDICATED_ALLOCATION_BIT;
-
-vmaCreateAllocator(&allocatorInfo, &allocator);
-\endcode
-
-That is all. The extension will be automatically used whenever you create a
-buffer using vmaCreateBuffer() or image using vmaCreateImage().
-
-When using the extension together with Vulkan Validation Layer, you will receive
-warnings like this:
-
-_vkBindBufferMemory(): Binding memory to buffer 0x33 but vkGetBufferMemoryRequirements() has not been called on that buffer._
-
-It is OK, you should just ignore it. It happens because you use function
-`vkGetBufferMemoryRequirements2KHR()` instead of standard
-`vkGetBufferMemoryRequirements()`, while the validation layer seems to be
-unaware of it.
-
-To learn more about this extension, see:
-
-- [VK_KHR_dedicated_allocation in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap50.html#VK_KHR_dedicated_allocation)
-- [VK_KHR_dedicated_allocation unofficial manual](http://asawicki.info/articles/VK_KHR_dedicated_allocation.php5)
-
-
-
-\page vk_ext_memory_priority VK_EXT_memory_priority
-
-VK_EXT_memory_priority is a device extension that allows to pass additional "priority"
-value to Vulkan memory allocations that the implementation may use prefer certain
-buffers and images that are critical for performance to stay in device-local memory
-in cases when the memory is over-subscribed, while some others may be moved to the system memory.
-
-VMA offers convenient usage of this extension.
-If you enable it, you can pass "priority" parameter when creating allocations or custom pools
-and the library automatically passes the value to Vulkan using this extension.
-
-If you want to use this extension in connection with VMA, follow these steps:
-
-\section vk_ext_memory_priority_initialization Initialization
-
-1) Call `vkEnumerateDeviceExtensionProperties` for the physical device.
-Check if the extension is supported - if returned array of `VkExtensionProperties` contains "VK_EXT_memory_priority".
-
-2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`.
-Attach additional structure `VkPhysicalDeviceMemoryPriorityFeaturesEXT` to `VkPhysicalDeviceFeatures2::pNext` to be returned.
-Check if the device feature is really supported - check if `VkPhysicalDeviceMemoryPriorityFeaturesEXT::memoryPriority` is true.
-
-3) While creating device with `vkCreateDevice`, enable this extension - add "VK_EXT_memory_priority"
-to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`.
-
-4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`.
-Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`.
-Enable this device feature - attach additional structure `VkPhysicalDeviceMemoryPriorityFeaturesEXT` to
-`VkPhysicalDeviceFeatures2::pNext` chain and set its member `memoryPriority` to `VK_TRUE`.
-
-5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you
-have enabled this extension and feature - add #VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT
-to VmaAllocatorCreateInfo::flags.
-
-\section vk_ext_memory_priority_usage Usage
-
-When using this extension, you should initialize following member:
-
-- VmaAllocationCreateInfo::priority when creating a dedicated allocation with #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
-- VmaPoolCreateInfo::priority when creating a custom pool.
-
-It should be a floating-point value between `0.0f` and `1.0f`, where recommended default is `0.5f`.
-Memory allocated with higher value can be treated by the Vulkan implementation as higher priority
-and so it can have lower chances of being pushed out to system memory, experiencing degraded performance.
-
-It might be a good idea to create performance-critical resources like color-attachment or depth-stencil images
-as dedicated and set high priority to them. For example:
-
-\code
-VkImageCreateInfo imgCreateInfo = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };
-imgCreateInfo.imageType = VK_IMAGE_TYPE_2D;
-imgCreateInfo.extent.width = 3840;
-imgCreateInfo.extent.height = 2160;
-imgCreateInfo.extent.depth = 1;
-imgCreateInfo.mipLevels = 1;
-imgCreateInfo.arrayLayers = 1;
-imgCreateInfo.format = VK_FORMAT_R8G8B8A8_UNORM;
-imgCreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
-imgCreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
-imgCreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT;
-imgCreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;
-
-VmaAllocationCreateInfo allocCreateInfo = {};
-allocCreateInfo.usage = VMA_MEMORY_USAGE_AUTO;
-allocCreateInfo.flags = VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
-allocCreateInfo.priority = 1.0f;
-
-VkImage img;
-VmaAllocation alloc;
-vmaCreateImage(allocator, &imgCreateInfo, &allocCreateInfo, &img, &alloc, nullptr);
-\endcode
-
-`priority` member is ignored in the following situations:
-
-- Allocations created in custom pools: They inherit the priority, along with all other allocation parameters
-  from the parametrs passed in #VmaPoolCreateInfo when the pool was created.
-- Allocations created in default pools: They inherit the priority from the parameters
-  VMA used when creating default pools, which means `priority == 0.5f`.
-
-
-\page vk_amd_device_coherent_memory VK_AMD_device_coherent_memory
-
-VK_AMD_device_coherent_memory is a device extension that enables access to
-additional memory types with `VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD` and
-`VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` flag. It is useful mostly for
-allocation of buffers intended for writing "breadcrumb markers" in between passes
-or draw calls, which in turn are useful for debugging GPU crash/hang/TDR cases.
-
-When the extension is available but has not been enabled, Vulkan physical device
-still exposes those memory types, but their usage is forbidden. VMA automatically
-takes care of that - it returns `VK_ERROR_FEATURE_NOT_PRESENT` when an attempt
-to allocate memory of such type is made.
-
-If you want to use this extension in connection with VMA, follow these steps:
-
-\section vk_amd_device_coherent_memory_initialization Initialization
-
-1) Call `vkEnumerateDeviceExtensionProperties` for the physical device.
-Check if the extension is supported - if returned array of `VkExtensionProperties` contains "VK_AMD_device_coherent_memory".
-
-2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`.
-Attach additional structure `VkPhysicalDeviceCoherentMemoryFeaturesAMD` to `VkPhysicalDeviceFeatures2::pNext` to be returned.
-Check if the device feature is really supported - check if `VkPhysicalDeviceCoherentMemoryFeaturesAMD::deviceCoherentMemory` is true.
-
-3) While creating device with `vkCreateDevice`, enable this extension - add "VK_AMD_device_coherent_memory"
-to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`.
-
-4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`.
-Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`.
-Enable this device feature - attach additional structure `VkPhysicalDeviceCoherentMemoryFeaturesAMD` to
-`VkPhysicalDeviceFeatures2::pNext` and set its member `deviceCoherentMemory` to `VK_TRUE`.
-
-5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you
-have enabled this extension and feature - add #VMA_ALLOCATOR_CREATE_AMD_DEVICE_COHERENT_MEMORY_BIT
-to VmaAllocatorCreateInfo::flags.
-
-\section vk_amd_device_coherent_memory_usage Usage
-
-After following steps described above, you can create VMA allocations and custom pools
-out of the special `DEVICE_COHERENT` and `DEVICE_UNCACHED` memory types on eligible
-devices. There are multiple ways to do it, for example:
-
-- You can request or prefer to allocate out of such memory types by adding
-  `VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD` to VmaAllocationCreateInfo::requiredFlags
-  or VmaAllocationCreateInfo::preferredFlags. Those flags can be freely mixed with
-  other ways of \ref choosing_memory_type, like setting VmaAllocationCreateInfo::usage.
-- If you manually found memory type index to use for this purpose, force allocation
-  from this specific index by setting VmaAllocationCreateInfo::memoryTypeBits `= 1u << index`.
-
-\section vk_amd_device_coherent_memory_more_information More information
-
-To learn more about this extension, see [VK_AMD_device_coherent_memory in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_AMD_device_coherent_memory.html)
-
-Example use of this extension can be found in the code of the sample and test suite
-accompanying this library.
-
-
-\page enabling_buffer_device_address Enabling buffer device address
-
-Device extension VK_KHR_buffer_device_address
-allow to fetch raw GPU pointer to a buffer and pass it for usage in a shader code.
-It has been promoted to core Vulkan 1.2.
-
-If you want to use this feature in connection with VMA, follow these steps:
-
-\section enabling_buffer_device_address_initialization Initialization
-
-1) (For Vulkan version < 1.2) Call `vkEnumerateDeviceExtensionProperties` for the physical device.
-Check if the extension is supported - if returned array of `VkExtensionProperties` contains
-"VK_KHR_buffer_device_address".
-
-2) Call `vkGetPhysicalDeviceFeatures2` for the physical device instead of old `vkGetPhysicalDeviceFeatures`.
-Attach additional structure `VkPhysicalDeviceBufferDeviceAddressFeatures*` to `VkPhysicalDeviceFeatures2::pNext` to be returned.
-Check if the device feature is really supported - check if `VkPhysicalDeviceBufferDeviceAddressFeatures::bufferDeviceAddress` is true.
-
-3) (For Vulkan version < 1.2) While creating device with `vkCreateDevice`, enable this extension - add
-"VK_KHR_buffer_device_address" to the list passed as `VkDeviceCreateInfo::ppEnabledExtensionNames`.
-
-4) While creating the device, also don't set `VkDeviceCreateInfo::pEnabledFeatures`.
-Fill in `VkPhysicalDeviceFeatures2` structure instead and pass it as `VkDeviceCreateInfo::pNext`.
-Enable this device feature - attach additional structure `VkPhysicalDeviceBufferDeviceAddressFeatures*` to
-`VkPhysicalDeviceFeatures2::pNext` and set its member `bufferDeviceAddress` to `VK_TRUE`.
-
-5) While creating #VmaAllocator with vmaCreateAllocator() inform VMA that you
-have enabled this feature - add #VMA_ALLOCATOR_CREATE_BUFFER_DEVICE_ADDRESS_BIT
-to VmaAllocatorCreateInfo::flags.
-
-\section enabling_buffer_device_address_usage Usage
-
-After following steps described above, you can create buffers with `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT*` using VMA.
-The library automatically adds `VK_MEMORY_ALLOCATE_DEVICE_ADDRESS_BIT*` to
-allocated memory blocks wherever it might be needed.
-
-Please note that the library supports only `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT*`.
-The second part of this functionality related to "capture and replay" is not supported,
-as it is intended for usage in debugging tools like RenderDoc, not in everyday Vulkan usage.
-
-\section enabling_buffer_device_address_more_information More information
-
-To learn more about this extension, see [VK_KHR_buffer_device_address in Vulkan specification](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/chap46.html#VK_KHR_buffer_device_address)
-
-Example use of this extension can be found in the code of the sample and test suite
-accompanying this library.
-
-\page general_considerations General considerations
-
-\section general_considerations_thread_safety Thread safety
-
-- The library has no global state, so separate #VmaAllocator objects can be used
-  independently.
-  There should be no need to create multiple such objects though - one per `VkDevice` is enough.
-- By default, all calls to functions that take #VmaAllocator as first parameter
-  are safe to call from multiple threads simultaneously because they are
-  synchronized internally when needed.
-  This includes allocation and deallocation from default memory pool, as well as custom #VmaPool.
-- When the allocator is created with #VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT
-  flag, calls to functions that take such #VmaAllocator object must be
-  synchronized externally.
-- Access to a #VmaAllocation object must be externally synchronized. For example,
-  you must not call vmaGetAllocationInfo() and vmaMapMemory() from different
-  threads at the same time if you pass the same #VmaAllocation object to these
-  functions.
-- #VmaVirtualBlock is not safe to be used from multiple threads simultaneously.
-
-\section general_considerations_versioning_and_compatibility Versioning and compatibility
-
-The library uses [**Semantic Versioning**](https://semver.org/),
-which means version numbers follow convention: Major.Minor.Patch (e.g. 2.3.0), where:
-
-- Incremented Patch version means a release is backward- and forward-compatible,
-  introducing only some internal improvements, bug fixes, optimizations etc.
-  or changes that are out of scope of the official API described in this documentation.
-- Incremented Minor version means a release is backward-compatible,
-  so existing code that uses the library should continue to work, while some new
-  symbols could have been added: new structures, functions, new values in existing
-  enums and bit flags, new structure members, but not new function parameters.
-- Incrementing Major version means a release could break some backward compatibility.
-
-All changes between official releases are documented in file "CHANGELOG.md".
-
-\warning Backward compatiblity is considered on the level of C++ source code, not binary linkage.
-Adding new members to existing structures is treated as backward compatible if initializing
-the new members to binary zero results in the old behavior.
-You should always fully initialize all library structures to zeros and not rely on their
-exact binary size.
-
-\section general_considerations_validation_layer_warnings Validation layer warnings
-
-When using this library, you can meet following types of warnings issued by
-Vulkan validation layer. They don't necessarily indicate a bug, so you may need
-to just ignore them.
-
-- *vkBindBufferMemory(): Binding memory to buffer 0xeb8e4 but vkGetBufferMemoryRequirements() has not been called on that buffer.*
-  - It happens when VK_KHR_dedicated_allocation extension is enabled.
-    `vkGetBufferMemoryRequirements2KHR` function is used instead, while validation layer seems to be unaware of it.
-- *Mapping an image with layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL can result in undefined behavior if this memory is used by the device. Only GENERAL or PREINITIALIZED should be used.*
-  - It happens when you map a buffer or image, because the library maps entire
-    `VkDeviceMemory` block, where different types of images and buffers may end
-    up together, especially on GPUs with unified memory like Intel.
-- *Non-linear image 0xebc91 is aliased with linear buffer 0xeb8e4 which may indicate a bug.*
-  - It may happen when you use [defragmentation](@ref defragmentation).
-
-\section general_considerations_allocation_algorithm Allocation algorithm
-
-The library uses following algorithm for allocation, in order:
-
--# Try to find free range of memory in existing blocks.
--# If failed, try to create a new block of `VkDeviceMemory`, with preferred block size.
--# If failed, try to create such block with size / 2, size / 4, size / 8.
--# If failed, try to allocate separate `VkDeviceMemory` for this allocation,
-   just like when you use #VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT.
--# If failed, choose other memory type that meets the requirements specified in
-   VmaAllocationCreateInfo and go to point 1.
--# If failed, return `VK_ERROR_OUT_OF_DEVICE_MEMORY`.
-
-\section general_considerations_features_not_supported Features not supported
-
-Features deliberately excluded from the scope of this library:
-
--# **Data transfer.** Uploading (streaming) and downloading data of buffers and images
-   between CPU and GPU memory and related synchronization is responsibility of the user.
-   Defining some "texture" object that would automatically stream its data from a
-   staging copy in CPU memory to GPU memory would rather be a feature of another,
-   higher-level library implemented on top of VMA.
-   VMA doesn't record any commands to a `VkCommandBuffer`. It just allocates memory.
--# **Recreation of buffers and images.** Although the library has functions for
-   buffer and image creation: vmaCreateBuffer(), vmaCreateImage(), you need to
-   recreate these objects yourself after defragmentation. That is because the big
-   structures `VkBufferCreateInfo`, `VkImageCreateInfo` are not stored in
-   #VmaAllocation object.
--# **Handling CPU memory allocation failures.** When dynamically creating small C++
-   objects in CPU memory (not Vulkan memory), allocation failures are not checked
-   and handled gracefully, because that would complicate code significantly and
-   is usually not needed in desktop PC applications anyway.
-   Success of an allocation is just checked with an assert.
--# **Code free of any compiler warnings.** Maintaining the library to compile and
-   work correctly on so many different platforms is hard enough. Being free of
-   any warnings, on any version of any compiler, is simply not feasible.
-   There are many preprocessor macros that make some variables unused, function parameters unreferenced,
-   or conditional expressions constant in some configurations.
-   The code of this library should not be bigger or more complicated just to silence these warnings.
-   It is recommended to disable such warnings instead.
--# This is a C++ library with C interface. **Bindings or ports to any other programming languages** are welcome as external projects but
-   are not going to be included into this repository.
-*/
diff --git a/aten/src/ATen/native/vulkan/glsl/batchnorm.glsl b/aten/src/ATen/native/vulkan/glsl/batchnorm.glsl
index 6ec93422b0d6..0ec7dbdf4fcf 100644
--- a/aten/src/ATen/native/vulkan/glsl/batchnorm.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/batchnorm.glsl
@@ -1,37 +1,61 @@
 #version 450 core
 #define PRECISION $precision
-#define FORMAT    $format
+#define FORMAT $format
 
 layout(std430) buffer;
 
-/* Qualifiers: layout - storage - precision - memory */
-
-layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION                    sampler3D uInput;
-layout(set = 0, binding = 2)         uniform PRECISION                    sampler3D uGamma;
-layout(set = 0, binding = 3)         uniform PRECISION                    sampler3D uBeta;
-layout(set = 0, binding = 4)         uniform PRECISION                    sampler3D uMean;
-layout(set = 0, binding = 5)         uniform PRECISION                    sampler3D uVar;
-layout(set = 0, binding = 6)         uniform PRECISION restrict           Block {
-  ivec3 isize;
-  int channels_ext;
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
+
+/*
+ * Input Textures
+ */
+layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
+layout(set = 0, binding = 2) uniform PRECISION sampler3D uGamma;
+layout(set = 0, binding = 3) uniform PRECISION sampler3D uBeta;
+layout(set = 0, binding = 4) uniform PRECISION sampler3D uMean;
+layout(set = 0, binding = 5) uniform PRECISION sampler3D uVar;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 6) uniform PRECISION restrict Block {
+  // xyz contains extents of the output texture, w contains the number of
+  // channels divided by 4, rounded up.
+  ivec4 out_extents;
   float eps;
-} uBlock;
+}
+uBlock;
 
+/*
+ * Local Work Group
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
+/*
+ * Computes a Batch normalization. Each shader invocation calculates the output
+ * at a single output location.
+ */
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
 
-  if (all(lessThan(pos, uBlock.isize.xyz))) {
-    const ivec3 chn = ivec3(0, 0, pos.z % uBlock.channels_ext);
-    imageStore(
-        uOutput,
-        pos,
-        (texelFetch(uInput, pos, 0)
-            - texelFetch(uMean, chn, 0))
-            / sqrt(texelFetch(uVar, chn, 0) + uBlock.eps)
-            * texelFetch(uGamma, chn, 0)
-            + texelFetch(uBeta, chn, 0));
+  // Return if this global position is outside output texture bounds
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
   }
+
+  const ivec3 ch_pos = ivec3(0, 0, pos.z % uBlock.out_extents.w);
+
+  const vec4 in_tex = texelFetch(uInput, pos, 0);
+  const vec4 gamma_tex = texelFetch(uGamma, ch_pos, 0);
+  const vec4 beta_tex = texelFetch(uBeta, ch_pos, 0);
+  const vec4 mean_tex = texelFetch(uMean, ch_pos, 0);
+  const vec4 var_tex = texelFetch(uVar, ch_pos, 0);
+
+  const vec4 out_tex =
+      (in_tex - mean_tex) / sqrt(var_tex + uBlock.eps) * gamma_tex + beta_tex;
+
+  imageStore(uOutput, pos, out_tex);
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/buffer_to_buffer.glsl b/aten/src/ATen/native/vulkan/glsl/buffer_to_buffer.glsl
new file mode 100644
index 000000000000..7a67a8ca3737
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/glsl/buffer_to_buffer.glsl
@@ -0,0 +1,78 @@
+#version 450 core
+
+#define PRECISION $precision
+#define FORMAT $format
+
+#include "indexing.h"
+
+layout(std430) buffer;
+
+/*
+ * Output Buffer
+ */
+layout(set = 0, binding = 0) buffer PRECISION restrict writeonly OutBuffer {
+  float data[];
+}
+uOutput;
+
+/*
+ * Output Buffer Metadata
+ */
+layout(set = 0, binding = 1) uniform PRECISION restrict OutMeta {
+  uvec4 sizes;
+  uvec4 strides;
+  uint ndim;
+  uint buf_length;
+}
+uOutMeta;
+
+/*
+ * Input Buffer
+ */
+layout(set = 0, binding = 2) buffer PRECISION restrict readonly InBuffer {
+  float data[];
+}
+uInput;
+
+/*
+ * Input Buffer Metadata
+ */
+layout(set = 0, binding = 3) uniform PRECISION restrict InMeta {
+  uvec4 sizes;
+  uvec4 strides;
+  uint ndim;
+  uint buf_length;
+}
+uInMeta;
+
+/*
+ * Local Work Group Size
+ */
+layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
+
+/*
+ * Copies data from the tensor at uInput to the tensor at uOutput based on 4D
+ * coordinate. Each element at (x,y,c,n) in uInput will be copied to uOutput at
+ * (x,y,c,n). If (x,y,c,n) is outside the bounds of uInput then 0 will be
+ * written.
+ *
+ * Each shader invocation is responsible for one element of the output buffer.
+ */
+void main() {
+  const uint write_idx = ivec3(gl_GlobalInvocationID).x;
+
+  if (write_idx >= uOutMeta.buf_length) {
+    return;
+  }
+
+  uvec4 write_coord =
+      idx_to_coord(write_idx, uOutMeta.strides, uOutMeta.sizes);
+
+  float outval = 0u;
+  if (all(lessThan(write_coord, uInMeta.sizes))) {
+    uint read_idx = coord_to_idx(write_coord, uInMeta.strides);
+    outval = uInput.data[read_idx];
+  }
+
+  uOutput.data[write_idx] = outval;
+}
diff --git a/aten/src/ATen/native/vulkan/glsl/conv2d.glsl b/aten/src/ATen/native/vulkan/glsl/conv2d.glsl
index 6107dd55fbe2..9d73356c71e7 100644
--- a/aten/src/ATen/native/vulkan/glsl/conv2d.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/conv2d.glsl
@@ -1,59 +1,142 @@
 #version 450 core
 #define PRECISION $precision
-#define FORMAT    $format
+#define FORMAT $format
+
+/*
+ * TILE_SIZE = (1, 1, 1)
+ * WEIGHT_STORAGE = TEXTURE_2D
+ * BIAS_STORAGE = TEXTURE_2D
+ */
 
 layout(std430) buffer;
 
-/* Qualifiers: layout - storage - precision - memory */
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
+
+/*
+ * Input Textures
+ */
+layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
+layout(set = 0, binding = 2) uniform PRECISION sampler2D uKernel;
+layout(set = 0, binding = 3) uniform PRECISION sampler2D uBias;
 
-layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION                    sampler3D uInput;
-layout(set = 0, binding = 2)         uniform PRECISION                    sampler3D uKernel;
-layout(set = 0, binding = 3)         uniform PRECISION                    sampler3D uBias;
-layout(set = 0, binding = 4)         uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  ivec2 ikernel;
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 4) uniform PRECISION restrict Block {
+  // extents of the output texture
+  ivec4 out_extents;
+  // extents of the input texture
+  ivec4 in_extents;
+  // size of the overlay region of the kernel
+  ivec4 overlay_region;
+  // width and height of the kernel
+  ivec2 kernel_size;
+  // convolution parameters
   ivec2 stride;
   ivec2 padding;
   ivec2 dilate;
-  vec2 clamp;
-} uBlock;
+  vec2 clamp_thresh;
+}
+uBlock;
 
+/*
+ * Local Work Group
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
+/*
+ * Computes a 2D convolution. Each shader invocation calculates the output at
+ * a single output location.
+ */
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
 
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
+  // Return if this global position is outside output texture bounds
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
+  }
+
+  // Compute the index of the top-left element of the overlay region. Note that
+  // negative indices can be produced indicating that the top-left element is in
+  // a region added by padding.
+  const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
 
-    const ivec2 start = max(ivec2(0), ipos);
-    const ivec2 end = min(ipos + uBlock.kernel.xy, uBlock.kernel.zw);
-    ivec2 kstart = (start - ipos) / uBlock.dilate;
+  // Compute the start and end of the input indices to load. Padding is assumed
+  // to be constant 0 padding, so any reads from the padding region is skipped.
+  const ivec2 start = max(ivec2(0), ipos);
+  const ivec2 end = min(ipos + uBlock.overlay_region.xy, uBlock.in_extents.xy);
+  // Compute the start of the kernel based on how far we are skipping ahead when
+  // reading the input. Note that these are "canonical" indices.
+  ivec2 kstart = (start - ipos) / uBlock.dilate;
+  // During prepacking, the weight tensor was rearranged in order to optimize
+  // for data access linearity in this shader. Therefore we need to adjust the
+  // canonical coordinates to the corresponding index in the rearranged weight
+  // tensor. the x coordinate is multipled by 4 since each group of 4 channels
+  // is folded into the X axis. The y coordinate is offset based on the z
+  // coordinate because the 2D planes were stacked atop each other vertically.
+  kstart.x *= 4;
+  kstart.y += pos.z * uBlock.kernel_size.y;
 
-    kstart.x *= 4;
-    kstart.y += pos.z * uBlock.ikernel.y;
+  // Perform the convolution by iterating over the overlay region
+  vec4 sum = texelFetch(uBias, ivec2(pos.z, 0), 0);
+  const int dil_y = uBlock.dilate.y;
+  const int dil_x = uBlock.dilate.x;
+  const int ic4 = uBlock.overlay_region.z / 4;
+  for (int z4 = 0; z4 < ic4; ++z4, kstart.x += uBlock.kernel_size.x * 4) {
+    for (int y = start.y, ky = kstart.y; y < end.y; y += dil_y, ++ky) {
+      for (int x = start.x, kx = kstart.x; x < end.x; x += dil_x, kx += 4) {
+        const vec4 in_tex = texelFetch(uInput, ivec3(x, y, z4), 0);
 
-    vec4 sum = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
+        // To explain the calculation below, the contents of in_tex and the
+        // group of 4 texels loaded from uKernel are shown:
+        //
+        //   in_tex               uKernel
+        //    -x->                   ---x--->
+        //   +---+              +----+----+----+----+
+        // ^ | w |           ^  | D0 | D1 | D2 | D3 |
+        // | +---+           |  +----+----+----+----+
+        // | | z |           |  | C0 | C1 | C2 | C3 |
+        // z +---+           z  +----+----+----+----+
+        // | | y |           |  | B0 | B2 | B2 | B3 |
+        // | +---+           |  +----+----+----+----+
+        //   | x |              | A0 | A1 | A2 | A3 |
+        //   +---+              +----+----+----+----+
+        //
+        // In the uKernel graphic, cells sharing the the same letter are from
+        // the same batch/output channel index, and the number denotes a unique
+        // channel index. To calculate the output texel, the following
+        // calculation is performed:
+        //
+        //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+        //  | x | | D0 |   | y | | D1 |   | z | | D2 |   | w | | D3 |
+        //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+        //  | x | | C0 |   | y | | C1 |   | z | | C2 |   | w | | C3 |
+        //  +---+X+----+ + +---+X+----+ + +---+X+----+ + +---+X+----+
+        //  | x | | B0 |   | y | | B1 |   | z | | B2 |   | w | | B3 |
+        //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+        //  | x | | A0 |   | y | | A1 |   | z | | A2 |   | w | | A3 |
+        //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+        //
+        //  which is what is expressed in the following calculations.
 
-    for (int z4 = 0; z4 < uBlock.size.w/4; ++z4, kstart.x += uBlock.ikernel.x*4) {
-      for (int y = start.y, ky = kstart.y; y < end.y; y += uBlock.dilate.y, ++ky) {
-        for (int x = start.x, kx = kstart.x; x < end.x; x += uBlock.dilate.x, kx += 4) {
-          const vec4 In = texelFetch(uInput, ivec3(x, y, z4), 0);
-          const ivec4 kxs = kx + ivec4(0, 1, 2, 3);
+        const vec4 ktex_0 = texelFetch(uKernel, ivec2(kx + 0, ky), 0);
+        sum = fma(in_tex.xxxx, ktex_0, sum);
 
-          sum = fma(In.xxxx, texelFetch(uKernel, ivec3(kxs.x, ky, 0), 0), sum);
-          sum = fma(In.yyyy, texelFetch(uKernel, ivec3(kxs.y, ky, 0), 0), sum);
-          sum = fma(In.zzzz, texelFetch(uKernel, ivec3(kxs.z, ky, 0), 0), sum);
-          sum = fma(In.wwww, texelFetch(uKernel, ivec3(kxs.w, ky, 0), 0), sum);
-        }
+        const vec4 ktex_1 = texelFetch(uKernel, ivec2(kx + 1, ky), 0);
+        sum = fma(in_tex.yyyy, ktex_1, sum);
+
+        const vec4 ktex_2 = texelFetch(uKernel, ivec2(kx + 2, ky), 0);
+        sum = fma(in_tex.zzzz, ktex_2, sum);
+
+        const vec4 ktex_3 = texelFetch(uKernel, ivec2(kx + 3, ky), 0);
+        sum = fma(in_tex.wwww, ktex_3, sum);
       }
     }
-
-    imageStore(
-        uOutput,
-        pos,
-        clamp(sum, uBlock.clamp.x, uBlock.clamp.y));
   }
+
+  imageStore(
+      uOutput, pos, clamp(sum, uBlock.clamp_thresh.x, uBlock.clamp_thresh.y));
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/conv2d_dw.glsl b/aten/src/ATen/native/vulkan/glsl/conv2d_dw.glsl
index c153c93ea187..89be0f3b69b2 100644
--- a/aten/src/ATen/native/vulkan/glsl/conv2d_dw.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/conv2d_dw.glsl
@@ -1,51 +1,93 @@
 #version 450 core
 #define PRECISION $precision
-#define FORMAT    $format
+#define FORMAT $format
+
+/*
+ * TILE_SIZE = (1, 1, 1)
+ * WEIGHT_STORAGE = TEXTURE_2D
+ * BIAS_STORAGE = TEXTURE_2D
+ * Note that for DW kernel IC = 1 so the weight layout is really OC4, H, W, 4oc
+ */
 
 layout(std430) buffer;
 
-/* Qualifiers: layout - storage - precision - memory */
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOut;
+
+/*
+ * Input Textures
+ */
+layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
+layout(set = 0, binding = 2) uniform PRECISION sampler2D uKernel;
+layout(set = 0, binding = 3) uniform PRECISION sampler2D uBias;
 
-layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION                    sampler3D uInput;
-layout(set = 0, binding = 2)         uniform PRECISION                    sampler3D uKernel;
-layout(set = 0, binding = 3)         uniform PRECISION                    sampler3D uBias;
-layout(set = 0, binding = 4)         uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  ivec2 ikernel;
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 4) uniform PRECISION restrict Block {
+  // extents of the output texture
+  ivec4 out_extents;
+  // extents of the input texture
+  ivec4 in_extents;
+  // size of the overlay region of the kernel
+  ivec4 overlay_region;
+  // width and height of the kernel
+  ivec2 kernel_size;
+  // convolution parameters
   ivec2 stride;
   ivec2 padding;
   ivec2 dilate;
-  vec2 clamp;
-} uBlock;
+  vec2 clamp_thresh;
+}
+uBlock;
 
+/*
+ * Local Work Group
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
+/*
+ * Computes depthwise convolution. Each shader invocation calculates the output
+ * of a single output location.
+ */
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
 
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
+  // Return if this global position is outside output texture bounds
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
+  }
 
-    const ivec2 start = max(ivec2(0), ipos);
-    const ivec2 end = min(ipos + uBlock.kernel.xy, uBlock.kernel.zw);
-    const ivec2 kstart = (start - ipos) / uBlock.dilate;
+  // Compute the index of the top-left element of the overlay region. Note that
+  // negative indices can be produced indicating that the top-left element is in
+  // a region added by padding.
+  const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
 
-    vec4 sum = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
+  // Compute the start and end of the input indices to load. Padding is assumed
+  // to be constant 0 padding, so any reads from the padding region is skipped.
+  const ivec2 start = max(ivec2(0), ipos);
+  const ivec2 end = min(ipos + uBlock.overlay_region.xy, uBlock.in_extents.xy);
+  // Compute the start of the kernel based on how far we are skipping ahead when
+  // reading the input
+  const ivec2 kstart = (start - ipos) / uBlock.dilate;
 
-    for (int y = start.y, ky = kstart.y; y < end.y; y += uBlock.dilate.y, ++ky) {
-      for (int x = start.x, kx = kstart.x + ky * uBlock.ikernel.x; x < end.x; x += uBlock.dilate.x, ++kx) {
-        sum = fma(
-            texelFetch(uInput, ivec3(x, y, pos.z), 0),
-            texelFetch(uKernel, ivec3(kx, pos.z, 0), 0),
-            sum);
-      }
+  vec4 sum = texelFetch(uBias, ivec2(pos.z, 0), 0);
+  const int dil_y = uBlock.dilate.y;
+  const int dil_x = uBlock.dilate.x;
+  for (int y = start.y, ky = kstart.y; y < end.y; y += dil_y, ky++) {
+    for (int x = start.x, kx = kstart.x; x < end.x; x += dil_x, kx++) {
+      // The weight kernel was rearranged so that every NxN filter was flattened
+      // so that it fits on one row. Each filter was then stacked on top of each
+      // other vertically.
+      const int k_ind = kx + ky * uBlock.kernel_size.x;
+      const vec4 k_tex = texelFetch(uKernel, ivec2(k_ind, pos.z), 0);
+      const vec4 i_tex = texelFetch(uInput, ivec3(x, y, pos.z), 0);
+      sum = fma(i_tex, k_tex, sum);
     }
-
-    imageStore(
-        uOutput,
-        pos,
-        clamp(sum, uBlock.clamp.x, uBlock.clamp.y));
   }
+
+  imageStore(
+      uOut, pos, clamp(sum, uBlock.clamp_thresh.x, uBlock.clamp_thresh.y));
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/conv2d_pw.glsl b/aten/src/ATen/native/vulkan/glsl/conv2d_pw.glsl
deleted file mode 100644
index d56e1ebc14b3..000000000000
--- a/aten/src/ATen/native/vulkan/glsl/conv2d_pw.glsl
+++ /dev/null
@@ -1,48 +0,0 @@
-#version 450 core
-#define PRECISION $precision
-#define FORMAT    $format
-
-layout(std430) buffer;
-
-/* Qualifiers: layout - storage - precision - memory */
-
-layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION                    sampler3D uInput;
-layout(set = 0, binding = 2)         uniform PRECISION                    sampler3D uKernel;
-layout(set = 0, binding = 3)         uniform PRECISION                    sampler3D uBias;
-layout(set = 0, binding = 4)         uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  ivec2 ikernel;
-  ivec2 stride;
-  ivec2 padding;
-  ivec2 dilate;
-  vec2 clamp;
-} uBlock;
-
-layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
-
-void main() {
-  const ivec3 pos = ivec3(gl_GlobalInvocationID);
-
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
-
-    vec4 sum = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
-
-    for (int z = 0, z4 = 0; z < uBlock.size.w; z += 4, ++z4) {
-      const vec4 In = texelFetch(uInput, ivec3(ipos, z4), 0);
-      const ivec4 kxs = z + ivec4(0, 1, 2, 3);
-
-      sum = fma(In.xxxx, texelFetch(uKernel, ivec3(kxs.x, pos.z, 0), 0), sum);
-      sum = fma(In.yyyy, texelFetch(uKernel, ivec3(kxs.y, pos.z, 0), 0), sum);
-      sum = fma(In.zzzz, texelFetch(uKernel, ivec3(kxs.z, pos.z, 0), 0), sum);
-      sum = fma(In.wwww, texelFetch(uKernel, ivec3(kxs.w, pos.z, 0), 0), sum);
-    }
-
-    imageStore(
-        uOutput,
-        pos,
-        clamp(sum, uBlock.clamp.x, uBlock.clamp.y));
-  }
-}
diff --git a/aten/src/ATen/native/vulkan/glsl/conv2d_pw_2x2.glsl b/aten/src/ATen/native/vulkan/glsl/conv2d_pw_2x2.glsl
deleted file mode 100644
index adb806736dc5..000000000000
--- a/aten/src/ATen/native/vulkan/glsl/conv2d_pw_2x2.glsl
+++ /dev/null
@@ -1,100 +0,0 @@
-#version 450 core
-#define PRECISION $precision
-#define FORMAT    $format
-
-layout(std430) buffer;
-
-/* Qualifiers: layout - storage - precision - memory */
-
-layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION                    sampler3D uInput;
-layout(set = 0, binding = 2)         uniform PRECISION                    sampler3D uKernel;
-layout(set = 0, binding = 3)         uniform PRECISION                    sampler3D uBias;
-layout(set = 0, binding = 4)         uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  ivec2 ikernel;
-  ivec2 stride;
-  ivec2 padding;
-  ivec2 dilate;
-  vec2 clamp;
-} uBlock;
-
-layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
-
-void main() {
-  const ivec3 pos = ivec3(gl_GlobalInvocationID);
-
-  const ivec3 pos00 = ivec3(pos.x*2  , pos.y*2  , pos.z);
-  const ivec3 pos10 = ivec3(pos.x*2+1, pos.y*2  , pos.z);
-  const ivec3 pos01 = ivec3(pos.x*2  , pos.y*2+1, pos.z);
-  const ivec3 pos11 = ivec3(pos.x*2+1, pos.y*2+1, pos.z);
-
-  if (all(lessThan(pos00, uBlock.size.xyz))) {
-    const ivec2 ipos00 = pos00.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos10 = pos10.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos01 = pos01.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos11 = pos11.xy * uBlock.stride - uBlock.padding;
-
-    vec4 sum00 = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
-    vec4 sum10 = sum00;
-    vec4 sum01 = sum00;
-    vec4 sum11 = sum00;
-
-    for (int z = 0, z4 = 0; z < uBlock.size.w; z += 4, ++z4) {
-      const ivec4 kxs = z + ivec4(0, 1, 2, 3);
-      const vec4 k1 = texelFetch(uKernel, ivec3(kxs.x, pos.z, 0), 0);
-      const vec4 k2 = texelFetch(uKernel, ivec3(kxs.y, pos.z, 0), 0);
-      const vec4 k3 = texelFetch(uKernel, ivec3(kxs.z, pos.z, 0), 0);
-      const vec4 k4 = texelFetch(uKernel, ivec3(kxs.w, pos.z, 0), 0);
-
-      const vec4 In00 = texelFetch(uInput, ivec3(ipos00, z4), 0);
-      const vec4 In10 = texelFetch(uInput, ivec3(ipos10, z4), 0);
-      const vec4 In01 = texelFetch(uInput, ivec3(ipos01, z4), 0);
-      const vec4 In11 = texelFetch(uInput, ivec3(ipos11, z4), 0);
-
-      sum00 = fma(In00.xxxx, k1, sum00);
-      sum00 = fma(In00.yyyy, k2, sum00);
-      sum00 = fma(In00.zzzz, k3, sum00);
-      sum00 = fma(In00.wwww, k4, sum00);
-
-      sum10 = fma(In10.xxxx, k1, sum10);
-      sum10 = fma(In10.yyyy, k2, sum10);
-      sum10 = fma(In10.zzzz, k3, sum10);
-      sum10 = fma(In10.wwww, k4, sum10);
-
-      sum01 = fma(In01.xxxx, k1, sum01);
-      sum01 = fma(In01.yyyy, k2, sum01);
-      sum01 = fma(In01.zzzz, k3, sum01);
-      sum01 = fma(In01.wwww, k4, sum01);
-
-      sum11 = fma(In11.xxxx, k1, sum11);
-      sum11 = fma(In11.yyyy, k2, sum11);
-      sum11 = fma(In11.zzzz, k3, sum11);
-      sum11 = fma(In11.wwww, k4, sum11);
-    }
-
-    imageStore(
-        uOutput,
-        pos00,
-        clamp(sum00, uBlock.clamp.x, uBlock.clamp.y));
-    if (all(lessThan(pos10, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos10,
-          clamp(sum10, uBlock.clamp.x, uBlock.clamp.y));
-    }
-    if (all(lessThan(pos01, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos01,
-          clamp(sum01, uBlock.clamp.x, uBlock.clamp.y));
-    }
-    if (all(lessThan(pos11, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos11,
-          clamp(sum11, uBlock.clamp.x, uBlock.clamp.y));
-    }
-  }
-}
diff --git a/aten/src/ATen/native/vulkan/glsl/conv2d_pw_2x2_buffered.glsl b/aten/src/ATen/native/vulkan/glsl/conv2d_pw_2x2_buffered.glsl
deleted file mode 100644
index 97b01c6f7000..000000000000
--- a/aten/src/ATen/native/vulkan/glsl/conv2d_pw_2x2_buffered.glsl
+++ /dev/null
@@ -1,154 +0,0 @@
-#version 450 core
-#define PRECISION $precision
-#define FORMAT    $format
-
-layout(std430) buffer;
-
-/* Qualifiers: layout - storage - precision - memory */
-
-layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION                    sampler3D uInput;
-layout(set = 0, binding = 2)         uniform PRECISION                    sampler3D uKernel;
-layout(set = 0, binding = 3)         uniform PRECISION                    sampler3D uBias;
-layout(set = 0, binding = 4)         uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  ivec2 ikernel;
-  ivec2 stride;
-  ivec2 padding;
-  ivec2 dilate;
-  vec2 clamp;
-} uBlock;
-
-layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
-
-void main() {
-  const ivec3 pos = ivec3(gl_GlobalInvocationID);
-
-  const ivec3 pos00 = ivec3(pos.x*2  , pos.y*2  , pos.z);
-  const ivec3 pos10 = ivec3(pos.x*2+1, pos.y*2  , pos.z);
-  const ivec3 pos01 = ivec3(pos.x*2  , pos.y*2+1, pos.z);
-  const ivec3 pos11 = ivec3(pos.x*2+1, pos.y*2+1, pos.z);
-
-  if (all(lessThan(pos00, uBlock.size.xyz))) {
-    const ivec2 ipos00 = pos00.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos10 = pos10.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos01 = pos01.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos11 = pos11.xy * uBlock.stride - uBlock.padding;
-
-    vec4 sum00 = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
-    vec4 sum10 = sum00;
-    vec4 sum01 = sum00;
-    vec4 sum11 = sum00;
-
-    vec4 k1_0 = texelFetch(uKernel, ivec3(0, pos.z, 0), 0);
-    vec4 k2_0 = texelFetch(uKernel, ivec3(1, pos.z, 0), 0);
-    vec4 k3_0 = texelFetch(uKernel, ivec3(2, pos.z, 0), 0);
-    vec4 k4_0 = texelFetch(uKernel, ivec3(3, pos.z, 0), 0);
-
-    vec4 k1_1;
-    vec4 k2_1;
-    vec4 k3_1;
-    vec4 k4_1;
-
-    vec4 In00_0 = texelFetch(uInput, ivec3(ipos00, 0), 0);
-    vec4 In10_0 = texelFetch(uInput, ivec3(ipos10, 0), 0);
-    vec4 In01_0 = texelFetch(uInput, ivec3(ipos01, 0), 0);
-    vec4 In11_0 = texelFetch(uInput, ivec3(ipos11, 0), 0);
-
-    vec4 In00_1;
-    vec4 In10_1;
-    vec4 In01_1;
-    vec4 In11_1;
-
-    for (int z = 0, z4 = 0; z < uBlock.size.w; z += 8, z4+=2) {
-      ivec4 kxs = z + ivec4(4, 5, 6, 7);
-
-      k1_1 = texelFetch(uKernel, ivec3(kxs.x, pos.z, 0), 0);
-      k2_1 = texelFetch(uKernel, ivec3(kxs.y, pos.z, 0), 0);
-      k3_1 = texelFetch(uKernel, ivec3(kxs.z, pos.z, 0), 0);
-      k4_1 = texelFetch(uKernel, ivec3(kxs.w, pos.z, 0), 0);
-
-      In00_1 = texelFetch(uInput, ivec3(ipos00, z4+1), 0);
-      In10_1 = texelFetch(uInput, ivec3(ipos10, z4+1), 0);
-      In01_1 = texelFetch(uInput, ivec3(ipos01, z4+1), 0);
-      In11_1 = texelFetch(uInput, ivec3(ipos11, z4+1), 0);
-
-      sum00 = fma(In00_0.xxxx, k1_0, sum00);
-      sum00 = fma(In00_0.yyyy, k2_0, sum00);
-      sum00 = fma(In00_0.zzzz, k3_0, sum00);
-      sum00 = fma(In00_0.wwww, k4_0, sum00);
-
-      sum10 = fma(In10_0.xxxx, k1_0, sum10);
-      sum10 = fma(In10_0.yyyy, k2_0, sum10);
-      sum10 = fma(In10_0.zzzz, k3_0, sum10);
-      sum10 = fma(In10_0.wwww, k4_0, sum10);
-
-      sum01 = fma(In01_0.xxxx, k1_0, sum01);
-      sum01 = fma(In01_0.yyyy, k2_0, sum01);
-      sum01 = fma(In01_0.zzzz, k3_0, sum01);
-      sum01 = fma(In01_0.wwww, k4_0, sum01);
-
-      sum11 = fma(In11_0.xxxx, k1_0, sum11);
-      sum11 = fma(In11_0.yyyy, k2_0, sum11);
-      sum11 = fma(In11_0.zzzz, k3_0, sum11);
-      sum11 = fma(In11_0.wwww, k4_0, sum11);
-
-      // Next iteration
-      kxs += 4;
-
-      k1_0 = texelFetch(uKernel, ivec3(kxs.x, pos.z, 0), 0);
-      k2_0 = texelFetch(uKernel, ivec3(kxs.y, pos.z, 0), 0);
-      k3_0 = texelFetch(uKernel, ivec3(kxs.z, pos.z, 0), 0);
-      k4_0 = texelFetch(uKernel, ivec3(kxs.w, pos.z, 0), 0);
-
-      In00_0 = texelFetch(uInput, ivec3(ipos00, z4+2), 0);
-      In10_0 = texelFetch(uInput, ivec3(ipos10, z4+2), 0);
-      In01_0 = texelFetch(uInput, ivec3(ipos01, z4+2), 0);
-      In11_0 = texelFetch(uInput, ivec3(ipos11, z4+2), 0);
-
-      sum00 = fma(In00_1.xxxx, k1_1, sum00);
-      sum00 = fma(In00_1.yyyy, k2_1, sum00);
-      sum00 = fma(In00_1.zzzz, k3_1, sum00);
-      sum00 = fma(In00_1.wwww, k4_1, sum00);
-
-      sum10 = fma(In10_1.xxxx, k1_1, sum10);
-      sum10 = fma(In10_1.yyyy, k2_1, sum10);
-      sum10 = fma(In10_1.zzzz, k3_1, sum10);
-      sum10 = fma(In10_1.wwww, k4_1, sum10);
-
-      sum01 = fma(In01_1.xxxx, k1_1, sum01);
-      sum01 = fma(In01_1.yyyy, k2_1, sum01);
-      sum01 = fma(In01_1.zzzz, k3_1, sum01);
-      sum01 = fma(In01_1.wwww, k4_1, sum01);
-
-      sum11 = fma(In11_1.xxxx, k1_1, sum11);
-      sum11 = fma(In11_1.yyyy, k2_1, sum11);
-      sum11 = fma(In11_1.zzzz, k3_1, sum11);
-      sum11 = fma(In11_1.wwww, k4_1, sum11);
-    }
-
-    imageStore(
-        uOutput,
-        pos00,
-        clamp(sum00, uBlock.clamp.x, uBlock.clamp.y));
-    if (all(lessThan(pos10, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos10,
-          clamp(sum10, uBlock.clamp.x, uBlock.clamp.y));
-    }
-    if (all(lessThan(pos01, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos01,
-          clamp(sum01, uBlock.clamp.x, uBlock.clamp.y));
-    }
-    if (all(lessThan(pos11, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos11,
-          clamp(sum11, uBlock.clamp.x, uBlock.clamp.y));
-    }
-  }
-}
diff --git a/aten/src/ATen/native/vulkan/glsl/conv_transpose2d.glsl b/aten/src/ATen/native/vulkan/glsl/conv_transpose2d.glsl
index a4814291a34c..b3c983fc5214 100644
--- a/aten/src/ATen/native/vulkan/glsl/conv_transpose2d.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/conv_transpose2d.glsl
@@ -1,66 +1,97 @@
 #version 450 core
 #define PRECISION $precision
-#define FORMAT    $format
+#define FORMAT $format
+
+/*
+ * TILE_SIZE = (1, 1, 1)
+ * WEIGHT_STORAGE = TEXTURE_2D
+ * BIAS_STORAGE = TEXTURE_2D
+ */
 
 layout(std430) buffer;
 
 /* Qualifiers: layout - storage - precision - memory */
 
-layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D   uOutput;
-layout(set = 0, binding = 1)         uniform PRECISION                    sampler3D uInput;
-layout(set = 0, binding = 2)         uniform PRECISION                    sampler3D uKernel;
-layout(set = 0, binding = 3)         uniform PRECISION                    sampler3D uBias;
-layout(set = 0, binding = 4)         uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  ivec2 ikernel;
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
+
+/*
+ * Input Textures
+ */
+layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
+layout(set = 0, binding = 2) uniform PRECISION sampler2D uKernel;
+layout(set = 0, binding = 3) uniform PRECISION sampler2D uBias;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 4) uniform PRECISION restrict Block {
+  // extents of the output texture
+  ivec4 out_extents;
+  // extents of the input texture
+  ivec4 in_extents;
+  // size of the overlay region of the kernel
+  ivec4 overlay_region;
+  // width and height of the kernel
+  ivec2 kernel_size;
+  // convolution parameters
   ivec2 stride;
   ivec2 padding;
   ivec2 dilate;
-  vec2 clamp;
-} uBlock;
+  vec2 clamp_thresh;
+}
+uBlock;
 
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
 
-  const ivec2 isize = ivec2(uBlock.kernel.zw);
-  const vec2 ksize = vec2(uBlock.kernel.xy);
+  // Return if this global position is outside output texture bounds
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
+  }
+
+  const vec2 ksize = vec2(uBlock.kernel_size);
   const vec2 stride = vec2(uBlock.stride);
   const vec2 padding = vec2(uBlock.padding);
 
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    ivec2 ipos = pos.xy + uBlock.padding;
-    vec2 ipos_f = vec2(ipos);
-
-    const ivec2 start = max(ivec2(0), ivec2(ceil((ipos_f - ksize + 1)/stride)));
-    const ivec2 end = min(isize, ivec2(floor(ipos_f/stride))+1);
-    ivec2 kstart = start;
-
-    vec4 sum = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
-
-    int ky_start = uBlock.kernel.y - 1 - (ipos.y - uBlock.stride.y*start.y) + pos.z * uBlock.ikernel.y;
-    int kx_start = (uBlock.kernel.x - 1 - (ipos.x - uBlock.stride.x*start.x)) * uBlock.size.w;
-    int kx_stride = uBlock.size.w * (uBlock.stride.x - 1);
-    for (int y = start.y, ky = ky_start; y < end.y; ++y, ky += uBlock.stride.y) {
-      int kx = kx_start;
-      for (int x = start.x, kx = kx_start; x < end.x; ++x, kx += kx_stride) {
-        for (int z4 = 0; z4 < uBlock.size.w/4; ++z4, kx += 4) {
-          const vec4 In = texelFetch(uInput, ivec3(x, y, z4), 0);
-          const ivec4 kxs = kx + ivec4(0, 1, 2, 3);
-
-          sum = fma(In.xxxx, texelFetch(uKernel, ivec3(kxs.x, ky, 0), 0), sum);
-          sum = fma(In.yyyy, texelFetch(uKernel, ivec3(kxs.y, ky, 0), 0), sum);
-          sum = fma(In.zzzz, texelFetch(uKernel, ivec3(kxs.z, ky, 0), 0), sum);
-          sum = fma(In.wwww, texelFetch(uKernel, ivec3(kxs.w, ky, 0), 0), sum);
-        }
+  ivec2 ipos = pos.xy + uBlock.padding;
+  vec2 ipos_f = vec2(ipos);
+
+  const ivec2 start = max(ivec2(0), ivec2(ceil((ipos_f - ksize + 1) / stride)));
+  const ivec2 end =
+      min(uBlock.in_extents.xy, ivec2(floor(ipos_f / stride)) + 1);
+  ivec2 kstart = start;
+
+  vec4 sum = texelFetch(uBias, ivec2(pos.z, 0), 0);
+
+  const int ic4 = uBlock.overlay_region.z;
+
+  int ky_start = uBlock.overlay_region.y - 1 -
+      (ipos.y - uBlock.stride.y * start.y) + pos.z * uBlock.kernel_size.y;
+  int kx_start =
+      (uBlock.overlay_region.x - 1 - (ipos.x - uBlock.stride.x * start.x)) *
+      ic4;
+  int kx_stride = ic4 * (uBlock.stride.x - 1);
+
+  for (int y = start.y, ky = ky_start; y < end.y; ++y, ky += uBlock.stride.y) {
+    int kx = kx_start;
+    for (int x = start.x, kx = kx_start; x < end.x; ++x, kx += kx_stride) {
+      for (int z4 = 0; z4 < ic4 / 4; ++z4, kx += 4) {
+        const vec4 In = texelFetch(uInput, ivec3(x, y, z4), 0);
+        const ivec4 kxs = kx + ivec4(0, 1, 2, 3);
+
+        sum = fma(In.xxxx, texelFetch(uKernel, ivec2(kxs.x, ky), 0), sum);
+        sum = fma(In.yyyy, texelFetch(uKernel, ivec2(kxs.y, ky), 0), sum);
+        sum = fma(In.zzzz, texelFetch(uKernel, ivec2(kxs.z, ky), 0), sum);
+        sum = fma(In.wwww, texelFetch(uKernel, ivec2(kxs.w, ky), 0), sum);
       }
     }
-
-    imageStore(
-        uOutput,
-        pos,
-        clamp(sum, uBlock.clamp.x, uBlock.clamp.y));
   }
+
+  imageStore(
+      uOutput, pos, clamp(sum, uBlock.clamp_thresh.x, uBlock.clamp_thresh.y));
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/image2d_to_nchw.glsl b/aten/src/ATen/native/vulkan/glsl/image2d_to_nchw.glsl
new file mode 100644
index 000000000000..47436badf45c
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/glsl/image2d_to_nchw.glsl
@@ -0,0 +1,52 @@
+#version 450 core
+#define PRECISION $precision
+
+layout(std430) buffer;
+
+/*
+ * Input Sampler
+ */
+layout(set = 0, binding = 0) uniform PRECISION sampler2D uImage;
+
+/*
+ * Output Buffer
+ */
+layout(set = 0, binding = 1) buffer PRECISION restrict writeonly Buffer {
+  float data[];
+}
+uBuffer;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 2) uniform PRECISION restrict Block {
+  // xyz contain the extents of the input texture, w contains HxW to help
+  // calculate buffer offsets
+  ivec4 in_extents;
+}
+uBlock;
+
+/*
+ * Local Work Group Size
+ */
+layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
+
+void main() {
+  const ivec3 pos = ivec3(gl_GlobalInvocationID);
+
+  if (any(greaterThanEqual(pos, uBlock.in_extents.xyz))) {
+    return;
+  }
+
+  const vec4 intex = texelFetch(uImage, pos.xy, 0);
+
+  const int base_index =
+      pos.x + uBlock.in_extents.x * pos.y + (4 * uBlock.in_extents.w) * pos.z;
+  const ivec4 buf_indices =
+      base_index + ivec4(0, 1, 2, 3) * uBlock.in_extents.w;
+
+  uBuffer.data[buf_indices.x] = intex.x;
+  uBuffer.data[buf_indices.y] = intex.y;
+  uBuffer.data[buf_indices.z] = intex.z;
+  uBuffer.data[buf_indices.w] = intex.w;
+}
diff --git a/aten/src/ATen/native/vulkan/glsl/image_to_nchw.glsl b/aten/src/ATen/native/vulkan/glsl/image_to_nchw.glsl
index 01d653bf06de..50600fdcdcfb 100644
--- a/aten/src/ATen/native/vulkan/glsl/image_to_nchw.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/image_to_nchw.glsl
@@ -3,31 +3,50 @@
 
 layout(std430) buffer;
 
-/* Qualifiers: layout - storage - precision - memory */
-
-layout(set = 0, binding = 0) uniform PRECISION                    sampler3D uImage;
-layout(set = 0, binding = 1) buffer  PRECISION restrict writeonly Buffer {
+/*
+ * Input Sampler
+ */
+layout(set = 0, binding = 0) uniform PRECISION sampler3D uImage;
+
+/*
+ * Output Buffer
+ */
+layout(set = 0, binding = 1) buffer PRECISION restrict writeonly Buffer {
   float data[];
-} uBuffer;
-layout(set = 0, binding = 2) uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 offset;
-} uBlock;
+}
+uBuffer;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 2) uniform PRECISION restrict Block {
+  // xyz contain the extents of the input texture, w contains HxW to help
+  // calculate buffer offsets
+  ivec4 in_extents;
+}
+uBlock;
 
+/*
+ * Local Work Group Size
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
 
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    const vec4 texel = texelFetch(uImage, pos, 0);
+  if (any(greaterThanEqual(pos, uBlock.in_extents.xyz))) {
+    return;
+  }
 
-    const int base = pos.x + uBlock.size.x * pos.y + uBlock.size.w * pos.z;
-    const ivec4 index = base + uBlock.offset;
+  const vec4 intex = texelFetch(uImage, pos, 0);
 
-    uBuffer.data[index.x] = texel.r;
-    uBuffer.data[index.y] = texel.g;
-    uBuffer.data[index.z] = texel.b;
-    uBuffer.data[index.w] = texel.a;
-  }
+  const int base_index =
+      pos.x + uBlock.in_extents.x * pos.y + (4 * uBlock.in_extents.w) * pos.z;
+  const ivec4 buf_indices =
+      base_index + ivec4(0, 1, 2, 3) * uBlock.in_extents.w;
+
+  uBuffer.data[buf_indices.x] = intex.x;
+  uBuffer.data[buf_indices.y] = intex.y;
+  uBuffer.data[buf_indices.z] = intex.z;
+  uBuffer.data[buf_indices.w] = intex.w;
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/image_to_nchw_quantized.glsl b/aten/src/ATen/native/vulkan/glsl/image_to_nchw_quantized.glsl
index 07db4318df9c..2f5999b465e3 100644
--- a/aten/src/ATen/native/vulkan/glsl/image_to_nchw_quantized.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/image_to_nchw_quantized.glsl
@@ -3,63 +3,85 @@
 
 layout(std430) buffer;
 
-/* Qualifiers: layout - storage - precision - memory */
+/*
+ * Input Sampler
+ */
+layout(set = 0, binding = 0) uniform PRECISION isampler3D uImage;
 
-layout(set = 0, binding = 0) uniform PRECISION                    isampler3D uImage;
-layout(set = 0, binding = 1) buffer  PRECISION                    Buffer {
+/*
+ * Output Buffer
+ */
+layout(set = 0, binding = 1) buffer PRECISION Buffer {
   uint data[];
-} uBuffer;
-layout(set = 0, binding = 2) uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 offset;
-} uBlock;
+}
+uBuffer;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 2) uniform PRECISION restrict Block {
+  // xyz contain the extents of the input texture, w contains HxW to help
+  // calculate buffer offsets
+  ivec4 in_extents;
+}
+uBlock;
 
+/*
+ * Local Work Group in_extents
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
   if (pos.y == 0 && pos.z == 0) {
-      ivec4 texture_pos = ivec4(0,1,2,3) + 4 * pos.x;
+    ivec4 texture_pos = ivec4(0, 1, 2, 3) + 4 * pos.x;
 
-      ivec4 last_eight;
-      last_eight.z = texture_pos.x / (uBlock.size.x * uBlock.size.y);
-      last_eight.w = texture_pos.x % (uBlock.size.x * uBlock.size.y);
-      last_eight.y = last_eight.w / uBlock.size.x;
-      last_eight.x = last_eight.w % uBlock.size.x;
+    ivec4 last_eight;
+    last_eight.z = texture_pos.x / (uBlock.in_extents.x * uBlock.in_extents.y);
+    last_eight.w = texture_pos.x % (uBlock.in_extents.x * uBlock.in_extents.y);
+    last_eight.y = last_eight.w / uBlock.in_extents.x;
+    last_eight.x = last_eight.w % uBlock.in_extents.x;
 
-      ivec4 sec_last_eight;
-      sec_last_eight.z = texture_pos.y / (uBlock.size.x * uBlock.size.y);
-      sec_last_eight.w = texture_pos.y % (uBlock.size.x * uBlock.size.y);
-      sec_last_eight.y = sec_last_eight.w / uBlock.size.x;
-      sec_last_eight.x = sec_last_eight.w % uBlock.size.x;
+    ivec4 sec_last_eight;
+    sec_last_eight.z =
+        texture_pos.y / (uBlock.in_extents.x * uBlock.in_extents.y);
+    sec_last_eight.w =
+        texture_pos.y % (uBlock.in_extents.x * uBlock.in_extents.y);
+    sec_last_eight.y = sec_last_eight.w / uBlock.in_extents.x;
+    sec_last_eight.x = sec_last_eight.w % uBlock.in_extents.x;
 
-      ivec4 thr_last_eight;
-      thr_last_eight.z = texture_pos.z / (uBlock.size.x * uBlock.size.y);
-      thr_last_eight.w = texture_pos.z % (uBlock.size.x * uBlock.size.y);
-      thr_last_eight.y = thr_last_eight.w / uBlock.size.x;
-      thr_last_eight.x = thr_last_eight.w % uBlock.size.x;
+    ivec4 thr_last_eight;
+    thr_last_eight.z =
+        texture_pos.z / (uBlock.in_extents.x * uBlock.in_extents.y);
+    thr_last_eight.w =
+        texture_pos.z % (uBlock.in_extents.x * uBlock.in_extents.y);
+    thr_last_eight.y = thr_last_eight.w / uBlock.in_extents.x;
+    thr_last_eight.x = thr_last_eight.w % uBlock.in_extents.x;
 
-      ivec4 four_last_eight;
-      four_last_eight.z = texture_pos.w / (uBlock.size.x * uBlock.size.y);
-      four_last_eight.w = texture_pos.w % (uBlock.size.x * uBlock.size.y);
-      four_last_eight.y = four_last_eight.w / uBlock.size.x;
-      four_last_eight.x = four_last_eight.w % uBlock.size.x;
+    ivec4 four_last_eight;
+    four_last_eight.z =
+        texture_pos.w / (uBlock.in_extents.x * uBlock.in_extents.y);
+    four_last_eight.w =
+        texture_pos.w % (uBlock.in_extents.x * uBlock.in_extents.y);
+    four_last_eight.y = four_last_eight.w / uBlock.in_extents.x;
+    four_last_eight.x = four_last_eight.w % uBlock.in_extents.x;
 
-      ivec3 last_eight_pos = ivec3(last_eight.x, last_eight.y, last_eight.z / 4);
-      ivec3 sec_last_eight_pos = ivec3(sec_last_eight.x, sec_last_eight.y, sec_last_eight.z / 4);
-      ivec3 thr_last_eight_pos = ivec3(thr_last_eight.x, thr_last_eight.y, thr_last_eight.z / 4);
-      ivec3 four_last_eight_pos = ivec3(four_last_eight.x, four_last_eight.y, four_last_eight.z / 4);
+    ivec3 last_eight_pos = ivec3(last_eight.x, last_eight.y, last_eight.z / 4);
+    ivec3 sec_last_eight_pos =
+        ivec3(sec_last_eight.x, sec_last_eight.y, sec_last_eight.z / 4);
+    ivec3 thr_last_eight_pos =
+        ivec3(thr_last_eight.x, thr_last_eight.y, thr_last_eight.z / 4);
+    ivec3 four_last_eight_pos =
+        ivec3(four_last_eight.x, four_last_eight.y, four_last_eight.z / 4);
 
-      int texel_1 = texelFetch(uImage, last_eight_pos, 0)[last_eight.z];
-      int texel_2 = texelFetch(uImage, sec_last_eight_pos, 0)[sec_last_eight.z];
-      int texel_3 = texelFetch(uImage, thr_last_eight_pos, 0)[thr_last_eight.z];
-      int texel_4 = texelFetch(uImage, four_last_eight_pos, 0)[four_last_eight.z];
+    int texel_1 = texelFetch(uImage, last_eight_pos, 0)[last_eight.z];
+    int texel_2 = texelFetch(uImage, sec_last_eight_pos, 0)[sec_last_eight.z];
+    int texel_3 = texelFetch(uImage, thr_last_eight_pos, 0)[thr_last_eight.z];
+    int texel_4 = texelFetch(uImage, four_last_eight_pos, 0)[four_last_eight.z];
 
-      uint ui32 = (uint(texel_4 & 0xFF) << 24)
-            | (uint(texel_3 & 0xFF) << 16)
-            | (uint(texel_2 & 0xFF) << 8)
-            | (uint(texel_1 & 0xFF));
+    uint ui32 = (uint(texel_4 & 0xFF) << 24) | (uint(texel_3 & 0xFF) << 16) |
+        (uint(texel_2 & 0xFF) << 8) | (uint(texel_1 & 0xFF));
 
-      uBuffer.data[texture_pos.x / 4] = ui32;
+    uBuffer.data[texture_pos.x / 4] = ui32;
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/indexing.h b/aten/src/ATen/native/vulkan/glsl/indexing.h
new file mode 100644
index 000000000000..e7b6a29fc16e
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/glsl/indexing.h
@@ -0,0 +1,13 @@
+/*
+ * Computes a 4D tensor co-ordinate from a linearized index
+ */
+uvec4 idx_to_coord(const uint idx, const uvec4 strides, const uvec4 sizes) {
+  return ivec4(mod(idx / strides, sizes));
+}
+
+/*
+ * Computes a linearized index from a 4D tensor co-ordinate
+ */
+uint coord_to_idx(const uvec4 coord, const uvec4 strides) {
+  return int(dot(coord * strides, ivec4(1)));
+}
diff --git a/aten/src/ATen/native/vulkan/glsl/nchw_to_image.glsl b/aten/src/ATen/native/vulkan/glsl/nchw_to_image.glsl
index efb1df05a206..70f57c0742ad 100644
--- a/aten/src/ATen/native/vulkan/glsl/nchw_to_image.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/nchw_to_image.glsl
@@ -4,33 +4,50 @@
 
 layout(std430) buffer;
 
-/* Qualifiers: layout - storage - precision - memory */
-
+/*
+ * Output Image
+ */
 layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uImage;
-layout(set = 0, binding = 1)         buffer  PRECISION restrict readonly  Buffer {
+
+/*
+ * Input Buffer
+ */
+layout(set = 0, binding = 1) buffer  PRECISION restrict readonly Buffer {
   float data[];
-} uBuffer;
-layout(set = 0, binding = 2)         uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 offset;
-} uBlock;
+}
+uBuffer;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 2) uniform PRECISION restrict Block {
+  // xyz contain the extents of the output texture, w contains HxW to help
+  // calculate buffer offsets
+  ivec4 out_extents;
+}
+uBlock;
 
+/*
+ * Local Work Group Size
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
 
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    const int base = pos.x + uBlock.size.x * pos.y + uBlock.size.w * pos.z;
-    const ivec4 index = base + uBlock.offset;
-
-    imageStore(
-        uImage,
-        pos,
-        vec4(
-            uBuffer.data[index.x],
-            uBuffer.data[index.y],
-            uBuffer.data[index.z],
-            uBuffer.data[index.w]));
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
   }
+
+  const int base_index =
+      pos.x + uBlock.out_extents.x * pos.y + (4 * uBlock.out_extents.w) * pos.z;
+  const ivec4 buf_indices =
+      base_index + ivec4(0, 1, 2, 3) * uBlock.out_extents.w;
+
+  float val_x = uBuffer.data[buf_indices.x];
+  float val_y = uBuffer.data[buf_indices.y];
+  float val_z = uBuffer.data[buf_indices.z];
+  float val_w = uBuffer.data[buf_indices.w];
+
+  imageStore(uImage, pos, vec4(val_x, val_y, val_z, val_w));
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/nchw_to_image2d.glsl b/aten/src/ATen/native/vulkan/glsl/nchw_to_image2d.glsl
new file mode 100644
index 000000000000..19f3a68b244e
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/glsl/nchw_to_image2d.glsl
@@ -0,0 +1,53 @@
+#version 450 core
+#define PRECISION $precision
+#define FORMAT    $format
+
+layout(std430) buffer;
+
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image2D uImage;
+
+/*
+ * Input Buffer
+ */
+layout(set = 0, binding = 1) buffer  PRECISION restrict readonly Buffer {
+  float data[];
+}
+uBuffer;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 2) uniform PRECISION restrict Block {
+  // xyz contain the extents of the output texture, w contains HxW to help
+  // calculate buffer offsets
+  ivec4 out_extents;
+}
+uBlock;
+
+/*
+ * Local Work Group Size
+ */
+layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
+
+void main() {
+  const ivec3 pos = ivec3(gl_GlobalInvocationID);
+
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
+  }
+
+  const int base_index =
+      pos.x + uBlock.out_extents.x * pos.y + (4 * uBlock.out_extents.w) * pos.z;
+  const ivec4 buf_indices =
+      base_index + ivec4(0, 1, 2, 3) * uBlock.out_extents.w;
+
+  float val_x = uBuffer.data[buf_indices.x];
+  float val_y = uBuffer.data[buf_indices.y];
+  float val_z = uBuffer.data[buf_indices.z];
+  float val_w = uBuffer.data[buf_indices.w];
+
+  imageStore(uImage, pos.xy, vec4(val_x, val_y, val_z, val_w));
+}
diff --git a/aten/src/ATen/native/vulkan/glsl/nchw_to_image_quantized.glsl b/aten/src/ATen/native/vulkan/glsl/nchw_to_image_quantized.glsl
index d23796d8af4b..cca8d88fcd7d 100644
--- a/aten/src/ATen/native/vulkan/glsl/nchw_to_image_quantized.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/nchw_to_image_quantized.glsl
@@ -1,52 +1,71 @@
 #version 450 core
 #define PRECISION $precision
-#define FORMAT    $format
+#define FORMAT $format
 
 layout(std430) buffer;
 
 /* Qualifiers: layout - storage - precision - memory */
 
+/*
+ * Input Sampler
+ */
 layout(set = 0, binding = 0, rgba8ui) uniform PRECISION restrict writeonly uimage3D uImage;
-layout(set = 0, binding = 1)         buffer  PRECISION restrict readonly  Buffer {
+
+/*
+ * Output Buffer
+ */
+layout(set = 0, binding = 1) buffer PRECISION restrict readonly Buffer {
   uint data[];
-} uBuffer;
-layout(set = 0, binding = 2)         uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 offset;
-} uBlock;
+}
+uBuffer;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 2) uniform PRECISION restrict Block {
+  // xyz contain the extents of the input texture, w contains HxW to help
+  // calculate buffer offsets
+  ivec4 out_extents;
+}
+uBlock;
 
+/*
+ * Local Work Group Size
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    const int base = pos.x + uBlock.size.x * pos.y + uBlock.size.w * pos.z;
-    const ivec4 index = base + uBlock.offset;
 
-    int shift = (1 << 8) - 1;
-    ivec4 masks;
-    masks.x = shift << 8 * (index.x % 4);
-    masks.y = shift << 8 * (index.y % 4);
-    masks.z = shift << 8 * (index.z % 4);
-    masks.w = shift << 8 * (index.w % 4);
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
+  }
 
-    uint buf_in_1 = uBuffer.data[index.x / 4];
-    uint a_v = (buf_in_1 & masks.x) >> 8 * (index.x % 4);
+  const int base_index =
+      pos.x + uBlock.out_extents.x * pos.y + (4 * uBlock.out_extents.w) * pos.z;
+  const ivec4 buf_indices =
+      base_index + ivec4(0, 1, 2, 3) * uBlock.out_extents.w;
 
-    uint buf_in_2 = uBuffer.data[index.y / 4];
-    uint b_v = (buf_in_2 & masks.y) >> 8 * (index.y % 4);
+  int shift = (1 << 8) - 1;
+  ivec4 masks;
+  masks.x = shift << 8 * (buf_indices.x % 4);
+  masks.y = shift << 8 * (buf_indices.y % 4);
+  masks.z = shift << 8 * (buf_indices.z % 4);
+  masks.w = shift << 8 * (buf_indices.w % 4);
 
-    uint buf_in_3 = uBuffer.data[index.z / 4];
-    uint g_v = (buf_in_3 & masks.z) >> 8 * (index.z % 4);
+  uint buf_in_1 = uBuffer.data[buf_indices.x / 4];
+  uint a_v = (buf_in_1 & masks.x) >> 8 * (buf_indices.x % 4);
 
-    uint buf_in_4 = uBuffer.data[index.w / 4];
-    uint r_v = (buf_in_4 & masks.w) >> 8 * (index.w % 4);
+  uint buf_in_2 = uBuffer.data[buf_indices.y / 4];
+  uint b_v = (buf_in_2 & masks.y) >> 8 * (buf_indices.y % 4);
 
-    uvec4 texel = uvec4(a_v, b_v, g_v, r_v);
+  uint buf_in_3 = uBuffer.data[buf_indices.z / 4];
+  uint g_v = (buf_in_3 & masks.z) >> 8 * (buf_indices.z % 4);
 
-    imageStore(
-        uImage,
-        pos,
-        texel);
-  }
+  uint buf_in_4 = uBuffer.data[buf_indices.w / 4];
+  uint r_v = (buf_in_4 & masks.w) >> 8 * (buf_indices.w % 4);
+
+  uvec4 texel = uvec4(a_v, b_v, g_v, r_v);
+
+  imageStore(uImage, pos, texel);
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/quantize_per_tensor.glsl b/aten/src/ATen/native/vulkan/glsl/quantize_per_tensor.glsl
index 910603aa29f2..f67954ad48c1 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantize_per_tensor.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantize_per_tensor.glsl
@@ -19,11 +19,13 @@ layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
   if (all(lessThan(pos, uBlock.size.xyz))) {
-    vec4 ret = texelFetch(uInput, pos, 0) / uBlock.scale.x + uBlock.zero_point.x;
-    uvec4 texel = uvec4(int(ret.x), int(ret.y), int(ret.z), int(ret.w));
+    vec4 q_res = roundEven(texelFetch(uInput, pos, 0) / uBlock.scale.x) + uBlock.zero_point.x;
+
+    uvec4 ret = uvec4(q_res);
+
     imageStore(
         uOutput,
         pos,
-        texel);
+        ret);
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/quantized_add.glsl b/aten/src/ATen/native/vulkan/glsl/quantized_add.glsl
index 8f6e51397d1c..a526dc2121bf 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantized_add.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantized_add.glsl
@@ -34,9 +34,9 @@ void main() {
     vec4 deq_in_1 = uBlock.in_scale.y * (texel1 - uBlock.in_zero_point.y);
 
     vec4 res = deq_in_0 + deq_in_1;
-    vec4 q_res = res / uBlock.out_scale.x + uBlock.out_zero_point.x;
+    vec4 q_res = roundEven(res / uBlock.out_scale.x) + uBlock.out_zero_point.x;
 
-    uvec4 ret = uvec4(int(q_res.x), int(q_res.y), int(q_res.z), int(q_res.w));
+    uvec4 ret = uvec4(q_res);
 
     imageStore(
         uOutput,
diff --git a/aten/src/ATen/native/vulkan/glsl/quantized_conv2d.glsl b/aten/src/ATen/native/vulkan/glsl/quantized_conv2d.glsl
index a361a44e8599..63bf055761cc 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantized_conv2d.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantized_conv2d.glsl
@@ -1,88 +1,188 @@
 #version 450 core
 #define PRECISION $precision
-#define FORMAT    $format
+#define FORMAT $format
+
+/*
+ * TILE_SIZE = (1, 1, 1)
+ * WEIGHT_STORAGE = TEXTURE_3D
+ * BIAS_STORAGE = TEXTURE_3D
+ */
 
 layout(std430) buffer;
 
 /* Qualifiers: layout - storage - precision - memory */
 
-layout(set = 0, binding = 0, rgba8ui) uniform PRECISION restrict writeonly uimage3D   uOutput;
-layout(set = 0, binding = 1)          uniform PRECISION                    isampler3D uInput;
-layout(set = 0, binding = 2)          uniform PRECISION                    isampler3D uKernel;
-layout(set = 0, binding = 3)          uniform PRECISION                    isampler3D uBias;
-layout(set = 0, binding = 4)          uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  vec2 scale;
-  ivec2 zero_point;
-  vec2 other_inp_scale;
-  ivec2 other_inp_zero_point;
-  ivec2 ikernel;
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, rgba8ui) uniform PRECISION restrict writeonly uimage3D uOutput;
+
+/*
+ * Input Textures
+ */
+layout(set = 0, binding = 1) uniform PRECISION isampler3D uInput;
+layout(set = 0, binding = 2) uniform PRECISION isampler3D uKernel;
+layout(set = 0, binding = 3) uniform PRECISION isampler3D uBias;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 4) uniform PRECISION restrict Block {
+  // quantization scales, xyzw corresponds to output, input, kernel, bias
+  vec4 scales;
+  // quantization zero points, xyzw corresponds to output, input, kernel, bias
+  ivec4 zero_points;
+  // extents of the output texture
+  ivec4 out_extents;
+  // extents of the input texture
+  ivec4 in_extents;
+  // size of the overlay region of the kernel
+  ivec4 overlay_region;
+  // width and height of the kernel
+  ivec2 kernel_size;
+  // convolution parameters
   ivec2 stride;
   ivec2 padding;
   ivec2 dilate;
-  vec2 clamp;
-} uBlock;
+  vec2 clamp_thresh;
+}
+uBlock;
 
+/*
+ * Local Work Group
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
+/*
+ * Dequantizes a float texel based on a scale and zero point.
+ */
+vec4 dequantize(vec4 tex, float scale, int zero_point) {
+  return scale * (tex - zero_point);
+}
+
+/*
+ * Quantizes a float texel based on a scale and zero point.
+ */
+uvec4 quantize(vec4 tex, float scale, int zero_point) {
+  return uvec4(roundEven(tex / scale) + zero_point);
+}
+
+/*
+ * Computes a 2D quantized convolution. Each shader invocation calculates the
+ * output at a single output location. Currently this is implemented in a naive
+ * way, where inputs are dequantized upon reading in, and requantized upon
+ * writing out.
+ */
 void main() {
   const ivec3 pos = ivec3(gl_GlobalInvocationID);
 
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
-
-    const ivec2 start = max(ivec2(0), ipos);
-    const ivec2 end = min(ipos + uBlock.kernel.xy, uBlock.kernel.zw);
-    ivec2 kstart = (start - ipos) / uBlock.dilate;
-
-    kstart.x *= 4;
-    kstart.y += pos.z * uBlock.ikernel.y;
-
-    vec4 q_sum = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
-    vec4 sum = uBlock.other_inp_scale.y * (q_sum - uBlock.other_inp_zero_point.y);
-
-    for (int z4 = 0; z4 < uBlock.size.w/4; ++z4, kstart.x += uBlock.ikernel.x*4) {
-      for (int y = start.y, ky = kstart.y; y < end.y; y += uBlock.dilate.y, ++ky) {
-        for (int x = start.x, kx = kstart.x; x < end.x; x += uBlock.dilate.x, kx += 4) {
-          const vec4 In = texelFetch(uInput, ivec3(x, y, z4), 0);
-          vec4 deq_In = uBlock.scale.y * (In - uBlock.zero_point.y);
-          const ivec4 kxs = kx + ivec4(0, 1, 2, 3);
-
-          const vec4 weight_x = texelFetch(uKernel, ivec3(kxs.x, ky, 0), 0);
-          if (weight_x != vec4(0.0)) {
-            vec4 deq_weight_x = uBlock.other_inp_scale.x * (weight_x - uBlock.other_inp_zero_point.x);
-            sum = fma(deq_In.xxxx, deq_weight_x, sum);
-          }
-
-          const vec4 weight_y = texelFetch(uKernel, ivec3(kxs.y, ky, 0), 0);
-          if (weight_y != vec4(0.0)) {
-            vec4 deq_weight_y = uBlock.other_inp_scale.x * (weight_y - uBlock.other_inp_zero_point.x);
-            sum = fma(deq_In.yyyy, deq_weight_y, sum);
-          }
-
-          const vec4 weight_z = texelFetch(uKernel, ivec3(kxs.z, ky, 0), 0);
-          if (weight_z != vec4(0.0)) {
-            vec4 deq_weight_z = uBlock.other_inp_scale.x * (weight_z - uBlock.other_inp_zero_point.x);
-            sum = fma(deq_In.zzzz, deq_weight_z, sum);
-          }
-
-          const vec4 weight_w = texelFetch(uKernel, ivec3(kxs.w, ky, 0), 0);
-          if (weight_w != vec4(0.0)) {
-            vec4 deq_weight_w = uBlock.other_inp_scale.x * (weight_w - uBlock.other_inp_zero_point.x);
-            sum = fma(deq_In.wwww, deq_weight_w, sum);
-          }
-        }
+  // Return if this global position is outside output texture bounds
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
+  }
+
+  // Compute the index of the top-left element of the overlay region. Note that
+  // negative indices can be produced indicating that the top-left element is in
+  // a region added by padding.
+  const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
+
+  // Compute the start and end of the input indices to load. Padding is assumed
+  // to be constant 0 padding, so any reads from the padding region is skipped.
+  const ivec2 start = max(ivec2(0), ipos);
+  const ivec2 end = min(ipos + uBlock.overlay_region.xy, uBlock.in_extents.xy);
+  // Compute the start of the kernel based on how far we are skipping ahead when
+  // reading the input. Note that these are "canonical" indices.
+  ivec2 kstart = (start - ipos) / uBlock.dilate;
+  // During prepacking, the weight tensor was rearranged in order to optimize
+  // for data access linearity in this shader. Therefore we need to adjust the
+  // canonical coordinates to the corresponding index in the rearranged weight
+  // tensor. the x coordinate is multipled by 4 since each group of 4 channels
+  // is folded into the X axis. The y coordinate is offset based on the z
+  // coordinate because the 2D planes were stacked atop each other vertically.
+  kstart.x *= 4;
+  kstart.y += pos.z * uBlock.kernel_size.y;
+
+  vec4 sum = dequantize(
+      texelFetch(uBias, ivec3(pos.z, 0, 0), 0),
+      uBlock.scales.w,
+      uBlock.zero_points.w);
+
+  // Perform the convolution by iterating over the overlay region
+  const int dil_y = uBlock.dilate.y;
+  const int dil_x = uBlock.dilate.x;
+  const int ic4 = uBlock.overlay_region.z / 4;
+  for (int z4 = 0; z4 < ic4; ++z4, kstart.x += uBlock.kernel_size.x * 4) {
+    for (int y = start.y, ky = kstart.y; y < end.y; y += dil_y, ++ky) {
+      for (int x = start.x, kx = kstart.x; x < end.x; x += dil_x, kx += 4) {
+        // Read in and dequantize the input texel
+        const vec4 in_tex = dequantize(
+            texelFetch(uInput, ivec3(x, y, z4), 0),
+            uBlock.scales.y,
+            uBlock.zero_points.y);
+
+        // To explain the calculation below, the contents of in_tex and the
+        // group of 4 texels loaded from uKernel are shown:
+        //
+        //   in_tex               uKernel
+        //    -x->                   ---x--->
+        //   +---+              +----+----+----+----+
+        // ^ | w |           ^  | D0 | D1 | D2 | D3 |
+        // | +---+           |  +----+----+----+----+
+        // | | z |           |  | C0 | C1 | C2 | C3 |
+        // z +---+           z  +----+----+----+----+
+        // | | y |           |  | B0 | B2 | B2 | B3 |
+        // | +---+           |  +----+----+----+----+
+        //   | x |              | A0 | A1 | A2 | A3 |
+        //   +---+              +----+----+----+----+
+        //
+        // In the uKernel graphic, cells sharing the the same letter are from
+        // the same batch/output channel index, and the number denotes a unique
+        // channel index. To calculate the output texel, the following
+        // calculation is performed:
+        //
+        //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+        //  | x | | D0 |   | y | | D1 |   | z | | D2 |   | w | | D3 |
+        //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+        //  | x | | C0 |   | y | | C1 |   | z | | C2 |   | w | | C3 |
+        //  +---+X+----+ + +---+X+----+ + +---+X+----+ + +---+X+----+
+        //  | x | | B0 |   | y | | B1 |   | z | | B2 |   | w | | B3 |
+        //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+        //  | x | | A0 |   | y | | A1 |   | z | | A2 |   | w | | A3 |
+        //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+        //
+        //  which is what is expressed in the following calculations.
+
+        const vec4 ktex_0 = dequantize(
+            texelFetch(uKernel, ivec3(kx + 0, ky, 0), 0),
+            uBlock.scales.z,
+            uBlock.zero_points.z);
+        sum = fma(in_tex.xxxx, ktex_0, sum);
+
+        const vec4 ktex_1 = dequantize(
+            texelFetch(uKernel, ivec3(kx + 1, ky, 0), 0),
+            uBlock.scales.z,
+            uBlock.zero_points.z);
+        sum = fma(in_tex.yyyy, ktex_1, sum);
+
+        const vec4 ktex_2 = dequantize(
+            texelFetch(uKernel, ivec3(kx + 2, ky, 0), 0),
+            uBlock.scales.z,
+            uBlock.zero_points.z);
+        sum = fma(in_tex.zzzz, ktex_2, sum);
+
+        const vec4 ktex_3 = dequantize(
+            texelFetch(uKernel, ivec3(kx + 3, ky, 0), 0),
+            uBlock.scales.z,
+            uBlock.zero_points.z);
+        sum = fma(in_tex.wwww, ktex_3, sum);
       }
     }
+  }
 
-    sum = clamp(sum, uBlock.clamp.x, uBlock.clamp.y);
-    vec4 q_ret = sum / uBlock.scale.x + uBlock.zero_point.x;
-    uvec4 res = uvec4(int(q_ret.x), int(q_ret.y), int(q_ret.z), int(q_ret.w));
+  uvec4 out_tex = quantize(
+      clamp(sum, uBlock.clamp_thresh.x, uBlock.clamp_thresh.y),
+      uBlock.scales.x,
+      uBlock.zero_points.x);
 
-    imageStore(
-        uOutput,
-        pos,
-        res);
-  }
+  imageStore(uOutput, pos, out_tex);
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/quantized_conv2d_dw.glsl b/aten/src/ATen/native/vulkan/glsl/quantized_conv2d_dw.glsl
index 41681da3f52b..0d823620a517 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantized_conv2d_dw.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantized_conv2d_dw.glsl
@@ -1,67 +1,125 @@
 #version 450 core
 #define PRECISION $precision
-#define FORMAT    $format
+#define FORMAT $format
+
+/*
+ * TILE_SIZE = (1, 1, 1)
+ * WEIGHT_STORAGE = TEXTURE_3D
+ * BIAS_STORAGE = TEXTURE_3D
+ * Note that for DW kernel IC = 1 so the weight layout is really OC4, H, W, 4oc
+ */
 
 layout(std430) buffer;
 
 /* Qualifiers: layout - storage - precision - memory */
 
-layout(set = 0, binding = 0, rgba8ui) uniform PRECISION restrict writeonly uimage3D   uOutput;
-layout(set = 0, binding = 1)          uniform PRECISION                    isampler3D uInput;
-layout(set = 0, binding = 2)          uniform PRECISION                    isampler3D uKernel;
-layout(set = 0, binding = 3)          uniform PRECISION                    isampler3D uBias;
-layout(set = 0, binding = 4)          uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  vec2 scale;
-  ivec2 zero_point;
-  vec2 other_inp_scale;
-  ivec2 other_inp_zero_point;
-  ivec2 ikernel;
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, rgba8ui) uniform PRECISION restrict writeonly uimage3D uOutput;
+
+/*
+ * Input Textures
+ */
+layout(set = 0, binding = 1) uniform PRECISION isampler3D uInput;
+layout(set = 0, binding = 2) uniform PRECISION isampler3D uKernel;
+layout(set = 0, binding = 3) uniform PRECISION isampler3D uBias;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 4) uniform PRECISION restrict Block {
+  // quantization scales, xyzw corresponds to output, input, kernel, bias
+  vec4 scales;
+  // quantization zero points, xyzw corresponds to output, input, kernel, bias
+  ivec4 zero_points;
+  // extents of the output texture
+  ivec4 out_extents;
+  // extents of the input texture
+  ivec4 in_extents;
+  // size of the overlay region of the kernel
+  ivec4 overlay_region;
+  // width and height of the kernel
+  ivec2 kernel_size;
+  // convolution parameters
   ivec2 stride;
   ivec2 padding;
   ivec2 dilate;
-  vec2 clamp;
-} uBlock;
+  vec2 clamp_thresh;
+}
+uBlock;
 
+/*
+ * Local Work Group
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
+/*
+ * Dequantizes a float texel based on a scale and zero point.
+ */
+vec4 dequantize(vec4 tex, float scale, int zero_point) {
+  return scale * (tex - zero_point);
+}
+
+/*
+ * Quantizes a float texel based on a scale and zero point.
+ */
+uvec4 quantize(vec4 tex, float scale, int zero_point) {
+  return uvec4(roundEven(tex / scale) + zero_point);
+}
+
 void main() {
-  ivec3 pos = ivec3(gl_GlobalInvocationID);
-
-  if (all(lessThan(pos, uBlock.size.xyz))) {
-    const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
-
-    const ivec2 start = max(ivec2(0), ipos);
-    const ivec2 end = min(ipos + uBlock.kernel.xy, uBlock.kernel.zw);
-    const ivec2 kstart = (start - ipos) / uBlock.dilate;
-
-    vec4 q_sum = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
-    vec4 sum = uBlock.other_inp_scale.y * (q_sum - uBlock.other_inp_zero_point.y);
-
-    for (int y = start.y, ky = kstart.y; y < end.y; y += uBlock.dilate.y, ++ky) {
-      for (int x = start.x, kx = kstart.x + ky * uBlock.ikernel.x; x < end.x; x += uBlock.dilate.x, ++kx) {
-        const vec4 In = texelFetch(uInput, ivec3(x, y, pos.z), 0);
-        vec4 deq_In = uBlock.scale.y * (In - uBlock.zero_point.y);
-
-        const vec4 weight = texelFetch(uKernel, ivec3(kx, pos.z, 0), 0);
-        if (weight != vec4(0.0)) {
-          vec4 deq_weight = uBlock.other_inp_scale.x * (weight - uBlock.other_inp_zero_point.x);
-          sum = fma(
-            deq_In,
-            deq_weight,
-            sum);
-        }
-      }
-    }
+  const ivec3 pos = ivec3(gl_GlobalInvocationID);
 
-    sum = clamp(sum, uBlock.clamp.x, uBlock.clamp.y);
-    vec4 q_ret = sum / uBlock.scale.x + uBlock.zero_point.x;
-    uvec4 res = uvec4(int(q_ret.x), int(q_ret.y), int(q_ret.z), int(q_ret.w));
+  // Return if this global position is outside output texture bounds
+  if (any(greaterThanEqual(pos, uBlock.out_extents.xyz))) {
+    return;
+  }
+
+  // Compute the index of the top-left element of the overlay region. Note that
+  // negative indices can be produced indicating that the top-left element is in
+  // a region added by padding.
+  const ivec2 ipos = pos.xy * uBlock.stride - uBlock.padding;
+
+  // Compute the start and end of the input indices to load. Padding is assumed
+  // to be constant 0 padding, so any reads from the padding region is skipped.
+  const ivec2 start = max(ivec2(0), ipos);
+  const ivec2 end = min(ipos + uBlock.overlay_region.xy, uBlock.in_extents.xy);
+  // Compute the start of the kernel based on how far we are skipping ahead when
+  // reading the input
+  const ivec2 kstart = (start - ipos) / uBlock.dilate;
+
+  vec4 sum = dequantize(
+      texelFetch(uBias, ivec3(pos.z, 0, 0), 0),
+      uBlock.scales.w,
+      uBlock.zero_points.w);
+
+  const int dil_y = uBlock.dilate.y;
+  const int dil_x = uBlock.dilate.x;
+  for (int y = start.y, ky = kstart.y; y < end.y; y += dil_y, ky++) {
+    for (int x = start.x, kx = kstart.x; x < end.x; x += dil_x, kx++) {
+      // The weight kernel was rearranged so that every NxN filter was flattened
+      // so that it fits on one row. Each filter was then stacked on top of each
+      // other vertically.
+      const int k_ind = kx + ky * uBlock.kernel_size.x;
 
-    imageStore(
-        uOutput,
-        pos,
-        res);
+      const vec4 k_tex = dequantize(
+          texelFetch(uKernel, ivec3(k_ind, pos.z, 0), 0),
+          uBlock.scales.z,
+          uBlock.zero_points.z);
+      const vec4 in_tex = dequantize(
+          texelFetch(uInput, ivec3(x, y, pos.z), 0),
+          uBlock.scales.y,
+          uBlock.zero_points.y);
+
+      sum = fma(in_tex, k_tex, sum);
+    }
   }
+
+  uvec4 out_tex = quantize(
+      clamp(sum, uBlock.clamp_thresh.x, uBlock.clamp_thresh.y),
+      uBlock.scales.x,
+      uBlock.zero_points.x);
+
+  imageStore(uOutput, pos, out_tex);
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/quantized_conv2d_pw_2x2.glsl b/aten/src/ATen/native/vulkan/glsl/quantized_conv2d_pw_2x2.glsl
index 7d9805fb9fe4..2ef6d3d60f32 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantized_conv2d_pw_2x2.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantized_conv2d_pw_2x2.glsl
@@ -1,136 +1,189 @@
 #version 450 core
 #define PRECISION $precision
-#define FORMAT    $format
-
-layout(std430) buffer;
-
-/* Qualifiers: layout - storage - precision - memory */
-
-layout(set = 0, binding = 0, rgba8ui) uniform PRECISION restrict writeonly uimage3D   uOutput;
-layout(set = 0, binding = 1)          uniform PRECISION                    isampler3D uInput;
-layout(set = 0, binding = 2)          uniform PRECISION                    isampler3D uKernel;
-layout(set = 0, binding = 3)          uniform PRECISION                    isampler3D uBias;
-layout(set = 0, binding = 4)          uniform PRECISION restrict           Block {
-  ivec4 size;
-  ivec4 kernel;
-  vec2 scale;
-  ivec2 zero_point;
-  vec2 other_inp_scale;
-  ivec2 other_inp_zero_point;
-  ivec2 ikernel;
+#define FORMAT $format
+
+/*
+ * TILE_SIZE = (2, 2, 1)
+ * WEIGHT_STORAGE = TEXTURE_3D
+ * BIAS_STORAGE = TEXTURE_3D
+ */
+
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, rgba8ui) uniform PRECISION restrict writeonly uimage3D uOutput;
+
+/*
+ * Input Textures
+ */
+layout(set = 0, binding = 1) uniform PRECISION isampler3D uInput;
+layout(set = 0, binding = 2) uniform PRECISION isampler3D uKernel;
+layout(set = 0, binding = 3) uniform PRECISION isampler3D uBias;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 4) uniform PRECISION restrict Block {
+  // quantization scales, xyzw corresponds to output, input, kernel, bias
+  vec4 scales;
+  // quantization zero points, xyzw corresponds to output, input, kernel, bias
+  ivec4 zero_points;
+  // extents of the output texture
+  ivec4 out_extents;
+  // extents of the input texture
+  ivec4 in_extents;
+  // size of the overlay region of the kernel
+  ivec4 overlay_region;
+  // width and height of the kernel
+  ivec2 kernel_size;
+  // convolution parameters
   ivec2 stride;
   ivec2 padding;
   ivec2 dilate;
-  vec2 clamp;
-} uBlock;
+  vec2 clamp_thresh;
+}
+uBlock;
 
+/*
+ * Local Work Group
+ */
 layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
 
+/*
+ * Dequantizes a float texel based on a scale and zero point.
+ */
+vec4 dequantize(vec4 tex, float scale, int zero_point) {
+  return scale * (tex - zero_point);
+}
+
+/*
+ * Quantizes a float texel based on a scale and zero point.
+ */
+uvec4 quantize(vec4 tex, float scale, int zero_point) {
+  return uvec4(roundEven(tex / scale) + zero_point);
+}
+
+/*
+ * Computes a 2D quantized pointwise convolution. Each shader invocation
+ * calculates the output of a 2x2 output tile. Currently this is implemented in
+ * a naive way, where inputs are dequantized upon reading in, and requantized
+ * upon writing out.
+ */
 void main() {
-  const ivec3 pos = ivec3(gl_GlobalInvocationID);
-
-  const ivec3 pos00 = ivec3(pos.x*2  , pos.y*2  , pos.z);
-  const ivec3 pos10 = ivec3(pos.x*2+1, pos.y*2  , pos.z);
-  const ivec3 pos01 = ivec3(pos.x*2  , pos.y*2+1, pos.z);
-  const ivec3 pos11 = ivec3(pos.x*2+1, pos.y*2+1, pos.z);
-
-  if (all(lessThan(pos00, uBlock.size.xyz))) {
-    const ivec2 ipos00 = pos00.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos10 = pos10.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos01 = pos01.xy * uBlock.stride - uBlock.padding;
-    const ivec2 ipos11 = pos11.xy * uBlock.stride - uBlock.padding;
-
-    vec4 q_sum00 = texelFetch(uBias, ivec3(pos.z, 0, 0), 0);
-    vec4 sum00 = uBlock.other_inp_scale.y * (q_sum00 - uBlock.other_inp_zero_point.y);
-    vec4 sum10 = sum00;
-    vec4 sum01 = sum00;
-    vec4 sum11 = sum00;
-
-    for (int z = 0, z4 = 0; z < uBlock.size.w; z += 4, ++z4) {
-      const ivec4 kxs = z + ivec4(0, 1, 2, 3);
-      const vec4 q_k1 = texelFetch(uKernel, ivec3(kxs.x, pos.z, 0), 0);
-      const vec4 k1 = uBlock.other_inp_scale.x * (q_k1 - uBlock.other_inp_zero_point.x);
-      const vec4 q_k2 = texelFetch(uKernel, ivec3(kxs.y, pos.z, 0), 0);
-      const vec4 k2 = uBlock.other_inp_scale.x * (q_k2 - uBlock.other_inp_zero_point.x);
-      const vec4 q_k3 = texelFetch(uKernel, ivec3(kxs.z, pos.z, 0), 0);
-      const vec4 k3 = uBlock.other_inp_scale.x * (q_k3 - uBlock.other_inp_zero_point.x);
-      const vec4 q_k4 = texelFetch(uKernel, ivec3(kxs.w, pos.z, 0), 0);
-      const vec4 k4 = uBlock.other_inp_scale.x * (q_k4 - uBlock.other_inp_zero_point.x);
-
-      const vec4 In00 = texelFetch(uInput, ivec3(ipos00, z4), 0);
-      vec4 deq_In00 = uBlock.scale.y * (In00 - uBlock.zero_point.y);
-      const vec4 In10 = texelFetch(uInput, ivec3(ipos10, z4), 0);
-      vec4 deq_In10 = uBlock.scale.y * (In10 - uBlock.zero_point.y);
-      const vec4 In01 = texelFetch(uInput, ivec3(ipos01, z4), 0);
-      vec4 deq_In01 = uBlock.scale.y * (In01 - uBlock.zero_point.y);
-      const vec4 In11 = texelFetch(uInput, ivec3(ipos11, z4), 0);
-      vec4 deq_In11 = uBlock.scale.y * (In11 - uBlock.zero_point.y);
-
-      if (q_k1 != vec4(0.0)) {
-        sum00 = fma(deq_In00.xxxx, k1, sum00);
-        sum10 = fma(deq_In10.xxxx, k1, sum10);
-        sum01 = fma(deq_In01.xxxx, k1, sum01);
-        sum11 = fma(deq_In11.xxxx, k1, sum11);
-      }
-
-      if (q_k2 != vec4(0.0)) {
-        sum00 = fma(deq_In00.yyyy, k2, sum00);
-        sum10 = fma(deq_In10.yyyy, k2, sum10);
-        sum01 = fma(deq_In01.yyyy, k2, sum01);
-        sum11 = fma(deq_In11.yyyy, k2, sum11);
-      }
-
-      if (q_k3 != vec4(0.0)) {
-        sum00 = fma(deq_In00.zzzz, k3, sum00);
-        sum10 = fma(deq_In10.zzzz, k3, sum10);
-        sum01 = fma(deq_In01.zzzz, k3, sum01);
-        sum11 = fma(deq_In11.zzzz, k3, sum11);
-      }
-
-      if (q_k4 != vec4(0.0)) {
-        sum00 = fma(deq_In00.wwww, k4, sum00);
-        sum10 = fma(deq_In10.wwww, k4, sum10);
-        sum01 = fma(deq_In01.wwww, k4, sum01);
-        sum11 = fma(deq_In11.wwww, k4, sum11);
-      }
-    }
-    sum00 = clamp(sum00, uBlock.clamp.x, uBlock.clamp.y);
-    vec4 q_ret00 = sum00 / uBlock.scale.x + uBlock.zero_point.x;
-    uvec4 res00 = uvec4(int(q_ret00.x), int(q_ret00.y), int(q_ret00.z), int(q_ret00.w));
-
-    sum10 = clamp(sum10, uBlock.clamp.x, uBlock.clamp.y);
-    vec4 q_ret10 = sum10 / uBlock.scale.x + uBlock.zero_point.x;
-    uvec4 res10 = uvec4(int(q_ret10.x), int(q_ret10.y), int(q_ret10.z), int(q_ret10.w));
-
-    sum01 = clamp(sum01, uBlock.clamp.x, uBlock.clamp.y);
-    vec4 q_ret01 = sum01 / uBlock.scale.x + uBlock.zero_point.x;
-    uvec4 res01 = uvec4(int(q_ret01.x), int(q_ret01.y), int(q_ret01.z), int(q_ret01.w));
-
-    sum11 = clamp(sum11, uBlock.clamp.x, uBlock.clamp.y);
-    vec4 q_ret11 = sum11 / uBlock.scale.x + uBlock.zero_point.x;
-    uvec4 res11 = uvec4(int(q_ret11.x), int(q_ret11.y), int(q_ret11.z), int(q_ret11.w));
-
-    imageStore(
-        uOutput,
-        pos00,
-        res00);
-    if (all(lessThan(pos10, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos10,
-          res10);
-    }
-    if (all(lessThan(pos01, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos01,
-          res01);
+  const ivec3 gpos = ivec3(gl_GlobalInvocationID);
+
+  // Determine the output positions that will be written to.
+  // +--------+--------+
+  // | pos[0] | pos[1] |
+  // +--------+--------+
+  // | pos[2] | pos[3] |
+  // +--------+--------+
+  ivec3 pos[4];
+  pos[0] = ivec3(gpos.x * 2, gpos.y * 2, gpos.z);
+  pos[1] = ivec3(gpos.x * 2 + 1, gpos.y * 2, gpos.z);
+  pos[2] = ivec3(gpos.x * 2, gpos.y * 2 + 1, gpos.z);
+  pos[3] = ivec3(gpos.x * 2 + 1, gpos.y * 2 + 1, gpos.z);
+
+  // If the top left position is out of bounds, then this invocation will have
+  // no work to do.
+  if (any(greaterThanEqual(pos[0], uBlock.out_extents.xyz))) {
+    return;
+  }
+
+  // Compute the index of the input texture that needs to be loaded for each
+  // output position. Note that negative indices can be produced indicating that
+  // the top-left element is in a region added by padding.
+  ivec2 ipos[4];
+  for (int i = 0; i < 4; ++i) {
+    ipos[i] = pos[i].xy * uBlock.stride - uBlock.padding;
+  }
+
+  vec4 sum[4];
+  sum[0] = dequantize(
+      texelFetch(uBias, ivec3(gpos.z, 0, 0), 0),
+      uBlock.scales.w,
+      uBlock.zero_points.w);
+  for (int i = 1; i < 4; ++i) {
+    sum[i] = sum[0];
+  }
+
+  // Since the kernel is 1x1, we only have to loop over the depth dimension.
+  const int ic_aligned = uBlock.overlay_region.z;
+  for (int z = 0, z4 = 0; z < ic_aligned; z += 4, ++z4) {
+    // During prepacking, the weight tensor has been permuted so that the
+    // channel (IC) dim is along the x axis, and the batch (OC) dim is along
+    // the z axis.
+    const vec4 ktex_0 = dequantize(
+        texelFetch(uKernel, ivec3(z + 0, gpos.z, 0), 0),
+        uBlock.scales.z,
+        uBlock.zero_points.z);
+    const vec4 ktex_1 = dequantize(
+        texelFetch(uKernel, ivec3(z + 1, gpos.z, 0), 0),
+        uBlock.scales.z,
+        uBlock.zero_points.z);
+    const vec4 ktex_2 = dequantize(
+        texelFetch(uKernel, ivec3(z + 2, gpos.z, 0), 0),
+        uBlock.scales.z,
+        uBlock.zero_points.z);
+    const vec4 ktex_3 = dequantize(
+        texelFetch(uKernel, ivec3(z + 3, gpos.z, 0), 0),
+        uBlock.scales.z,
+        uBlock.zero_points.z);
+
+    for (int i = 0; i < 4; ++i) {
+      const vec4 in_tex = dequantize(
+          texelFetch(uInput, ivec3(ipos[i], z4), 0),
+          uBlock.scales.y,
+          uBlock.zero_points.y);
+
+      // To explain the calculations below, the contents one in_tex and the
+      // group of 4 texels loaded from uKernel are shown:
+      //
+      //   in_tex               uKernel
+      //    -x->                   ---x--->
+      //   +---+              +----+----+----+----+
+      // ^ | w |           ^  | D0 | D1 | D2 | D3 |
+      // | +---+           |  +----+----+----+----+
+      // | | z |           |  | C0 | C1 | C2 | C3 |
+      // z +---+           z  +----+----+----+----+
+      // | | y |           |  | B0 | B2 | B2 | B3 |
+      // | +---+           |  +----+----+----+----+
+      //   | x |              | A0 | A1 | A2 | A3 |
+      //   +---+              +----+----+----+----+
+      //
+      // In the uKernel graphic, cells sharing the the same letter are from
+      // the same batch/output channel index, and the number denotes a unique
+      // channel index. To calculate the output texel, the following
+      // calculation is performed:
+      //
+      //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+      //  | x | | D0 |   | y | | D1 |   | z | | D2 |   | w | | D3 |
+      //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+      //  | x | | C0 |   | y | | C1 |   | z | | C2 |   | w | | C3 |
+      //  +---+X+----+ + +---+X+----+ + +---+X+----+ + +---+X+----+
+      //  | x | | B0 |   | y | | B1 |   | z | | B2 |   | w | | B3 |
+      //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+      //  | x | | A0 |   | y | | A1 |   | z | | A2 |   | w | | A3 |
+      //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+      //
+      //  which is what is expressed in the following calculations. This is done
+      //  for each output position.
+
+      sum[i] = fma(in_tex.xxxx, ktex_0, sum[i]);
+      sum[i] = fma(in_tex.yyyy, ktex_1, sum[i]);
+      sum[i] = fma(in_tex.zzzz, ktex_2, sum[i]);
+      sum[i] = fma(in_tex.wwww, ktex_3, sum[i]);
     }
-    if (all(lessThan(pos11, uBlock.size.xyz))) {
-      imageStore(
-          uOutput,
-          pos11,
-          res11);
+  }
+
+  for (int i = 0; i < 4; ++i) {
+    uvec4 out_tex = quantize(
+        clamp(sum[i], uBlock.clamp_thresh.x, uBlock.clamp_thresh.y),
+        uBlock.scales.x,
+        uBlock.zero_points.x);
+
+    if (all(lessThan(pos[i], uBlock.out_extents.xyz))) {
+      imageStore(uOutput, pos[i], out_tex);
     }
   }
 }
diff --git a/aten/src/ATen/native/vulkan/glsl/quantized_div.glsl b/aten/src/ATen/native/vulkan/glsl/quantized_div.glsl
index aa961eb34993..1998c5abbca3 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantized_div.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantized_div.glsl
@@ -34,9 +34,9 @@ void main() {
     vec4 deq_in_1 = uBlock.in_scale.y * (texel1 - uBlock.in_zero_point.y);
 
     vec4 res = deq_in_0 / deq_in_1;
-    vec4 q_res = res / uBlock.out_scale.x + uBlock.out_zero_point.x;
+    vec4 q_res = roundEven(res / uBlock.out_scale.x) + uBlock.out_zero_point.x;
 
-    uvec4 ret = uvec4(int(q_res.x), int(q_res.y), int(q_res.z), int(q_res.w));
+    uvec4 ret = uvec4(q_res);
 
     imageStore(
         uOutput,
diff --git a/aten/src/ATen/native/vulkan/glsl/quantized_mul.glsl b/aten/src/ATen/native/vulkan/glsl/quantized_mul.glsl
index 459f56915d77..c1ce18dbb38c 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantized_mul.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantized_mul.glsl
@@ -34,9 +34,9 @@ void main() {
     vec4 deq_in_1 = uBlock.in_scale.y * (texel1 - uBlock.in_zero_point.y);
 
     vec4 res = deq_in_0 * deq_in_1;
-    vec4 q_res = res / uBlock.out_scale.x + uBlock.out_zero_point.x;
+    vec4 q_res = roundEven(res / uBlock.out_scale.x) + uBlock.out_zero_point.x;
 
-    uvec4 ret = uvec4(int(q_res.x), int(q_res.y), int(q_res.z), int(q_res.w));
+    uvec4 ret = uvec4(q_res);
 
     imageStore(
         uOutput,
diff --git a/aten/src/ATen/native/vulkan/glsl/quantized_sub.glsl b/aten/src/ATen/native/vulkan/glsl/quantized_sub.glsl
index 6bd00f33a89c..767181f080fd 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantized_sub.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantized_sub.glsl
@@ -34,9 +34,9 @@ void main() {
     vec4 deq_in_1 = uBlock.in_scale.y * (texel1 - uBlock.in_zero_point.y);
 
     vec4 res = deq_in_0 - deq_in_1;
-    vec4 q_res = res / uBlock.out_scale.x + uBlock.out_zero_point.x;
+    vec4 q_res = roundEven(res / uBlock.out_scale.x) + uBlock.out_zero_point.x;
 
-    uvec4 ret = uvec4(int(q_res.x), int(q_res.y), int(q_res.z), int(q_res.w));
+    uvec4 ret = uvec4(q_res);
 
     imageStore(
         uOutput,
diff --git a/aten/src/ATen/native/vulkan/glsl/quantized_upsample_nearest2d.glsl b/aten/src/ATen/native/vulkan/glsl/quantized_upsample_nearest2d.glsl
index 28c167515405..46abbb1a8d76 100644
--- a/aten/src/ATen/native/vulkan/glsl/quantized_upsample_nearest2d.glsl
+++ b/aten/src/ATen/native/vulkan/glsl/quantized_upsample_nearest2d.glsl
@@ -25,8 +25,7 @@ void main() {
         ivec2(0),
         uBlock.isize);
 
-    vec4 texel = texelFetch(uInput, ivec3(ipos, pos.z), 0);
-    uvec4 ret = uvec4(int(texel.r), int(texel.g), int(texel.b), int(texel.a));
+    uvec4 ret = texelFetch(uInput, ivec3(ipos, pos.z), 0);
 
     imageStore(
         uOutput,
diff --git a/aten/src/ATen/native/vulkan/glsl/templates/conv2d_pw.glslt b/aten/src/ATen/native/vulkan/glsl/templates/conv2d_pw.glslt
new file mode 100644
index 000000000000..8f3c5a38db87
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/glsl/templates/conv2d_pw.glslt
@@ -0,0 +1,154 @@
+/*
+ * TILE_SIZE = ($TILE_SIZE_X, $TILE_SIZE_Y, 1)
+ * WEIGHT_STORAGE = TEXTURE_2D
+ * WEIGHT_STORAGE_LAYOUT = OC4,IC4,4ic,4oc
+ * BIAS_STORAGE = TEXTURE_2D
+ */
+
+layout(std430) buffer;
+
+/*
+ * Output Image
+ */
+layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
+
+/*
+ * Input Textures
+ */
+layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
+layout(set = 0, binding = 2) uniform PRECISION sampler2D uKernel;
+layout(set = 0, binding = 3) uniform PRECISION sampler2D uBias;
+
+/*
+ * Params Buffer
+ */
+layout(set = 0, binding = 4) uniform PRECISION restrict Block {
+  // extents of the output texture
+  ivec4 out_extents;
+  // extents of the input texture
+  ivec4 in_extents;
+  // size of the overlay region of the kernel
+  ivec4 overlay_region;
+  // width and height of the kernel
+  ivec2 kernel_size;
+  // convolution parameters
+  ivec2 stride;
+  ivec2 padding;
+  ivec2 dilate;
+  vec2 clamp_thresh;
+}
+uBlock;
+
+/*
+ * Local Work Group
+ */
+layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
+
+/*
+ * Computes a 2D pointwise convolution of a 2x2 output tile. Calculating an
+ * output tile for pointwise convolution is more efficient because the kernel
+ * size is only 1x1, making it much easier to re-use loaded texels from uKernel.
+ */
+void main() {
+  const ivec3 gpos = ivec3(gl_GlobalInvocationID);
+
+  // Output position for TILE_SIZE_X, TILE_SIZE_Y = 2, 2
+  // +--------+--------+
+  // | pos[0] | pos[1] |
+  // +--------+--------+
+  // | pos[2] | pos[3] |
+  // +--------+--------+
+  ivec3 pos[$TILE_SIZE_X * $TILE_SIZE_Y];
+  for (int y = 0, i = 0; y < $TILE_SIZE_Y; ++y) {
+    for (int x = 0; x < $TILE_SIZE_X; ++x) {
+      pos[i] = ivec3(gpos.x * $TILE_SIZE_X + x, gpos.y * $TILE_SIZE_Y + y, gpos.z);
+      i++;
+    }
+  }
+
+  // If the top left position is out of bounds, then this invocation will have
+  // no work to do.
+  if (any(greaterThanEqual(pos[0], uBlock.out_extents.xyz))) {
+    return;
+  }
+
+  // Compute the index of the input texture that needs to be loaded for each
+  // output position. Note that negative indices can be produced indicating that
+  // the top-left element is in a region added by padding.
+  ivec2 ipos[$TILE_SIZE_X * $TILE_SIZE_Y];
+  for (int i = 0; i < $TILE_SIZE_X * $TILE_SIZE_Y; ++i) {
+    ipos[i] = pos[i].xy * uBlock.stride - uBlock.padding;
+  }
+
+  vec4 sum[$TILE_SIZE_X * $TILE_SIZE_Y];
+  sum[0] = texelFetch(uBias, ivec2(gpos.z, 0), 0);
+  for (int i = 1; i < $TILE_SIZE_X * $TILE_SIZE_Y; ++i) {
+    sum[i] = sum[0];
+  }
+
+  // Since the kernel is 1x1, we only have to loop over the depth dimension.
+  const int ic_aligned = uBlock.overlay_region.z;
+  for (int z = 0, z4 = 0; z < ic_aligned; z += 4, ++z4) {
+    // During prepacking, the weight tensor has been permuted so that the
+    // channel (IC) dim is along the x axis, and the batch (OC) dim is along
+    // the z axis.
+    vec4 in_tex[$TILE_SIZE_X * $TILE_SIZE_Y];
+    const vec4 ktex_0 = texelFetch(uKernel, ivec2(z + 0, gpos.z), 0);
+    const vec4 ktex_1 = texelFetch(uKernel, ivec2(z + 1, gpos.z), 0);
+    const vec4 ktex_2 = texelFetch(uKernel, ivec2(z + 2, gpos.z), 0);
+    const vec4 ktex_3 = texelFetch(uKernel, ivec2(z + 3, gpos.z), 0);
+
+    for (int i = 0; i < $TILE_SIZE_Y * $TILE_SIZE_X; ++i) {
+      in_tex[i] = texelFetch(uInput, ivec3(ipos[i], z4), 0);
+    }
+
+    for (int i = 0; i < $TILE_SIZE_Y * $TILE_SIZE_X; ++i) {
+      // For 2x2 tile size algorithm works as follows.
+      // To explain the calculations below, the contents one in_tex and the
+      // group of 4 texels loaded from uKernel are shown:
+      //
+      //   in_tex               uKernel
+      //    -x->                   ---x--->
+      //   +---+              +----+----+----+----+
+      // ^ | w |           ^  | D0 | D1 | D2 | D3 |
+      // | +---+           |  +----+----+----+----+
+      // | | z |           |  | C0 | C1 | C2 | C3 |
+      // z +---+           z  +----+----+----+----+
+      // | | y |           |  | B0 | B2 | B2 | B3 |
+      // | +---+           |  +----+----+----+----+
+      //   | x |              | A0 | A1 | A2 | A3 |
+      //   +---+              +----+----+----+----+
+      //
+      // In the uKernel graphic, cells sharing the the same letter are from
+      // the same batch/output channel index, and the number denotes a unique
+      // channel index. To calculate the output texel, the following
+      // calculation is performed:
+      //
+      //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+      //  | x | | D0 |   | y | | D1 |   | z | | D2 |   | w | | D3 |
+      //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+      //  | x | | C0 |   | y | | C1 |   | z | | C2 |   | w | | C3 |
+      //  +---+X+----+ + +---+X+----+ + +---+X+----+ + +---+X+----+
+      //  | x | | B0 |   | y | | B1 |   | z | | B2 |   | w | | B3 |
+      //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+      //  | x | | A0 |   | y | | A1 |   | z | | A2 |   | w | | A3 |
+      //  +---+ +----+   +---+ +----+   +---+ +----+   +---+ +----+
+      //
+      //  which is what is expressed in the following calculations. This is done
+      //  for each output position.
+      sum[i] = fma(in_tex[i].xxxx, ktex_0, sum[i]);
+      sum[i] = fma(in_tex[i].yyyy, ktex_1, sum[i]);
+      sum[i] = fma(in_tex[i].zzzz, ktex_2, sum[i]);
+      sum[i] = fma(in_tex[i].wwww, ktex_3, sum[i]);
+    }
+  }
+
+  for (int i = 0; i < $TILE_SIZE_Y * $TILE_SIZE_X; ++i) {
+    if (all(lessThan(pos[i], uBlock.out_extents.xyz))) {
+      imageStore(
+          uOutput,
+          pos[i],
+          clamp(sum[i], uBlock.clamp_thresh.x, uBlock.clamp_thresh.y));
+    }
+  }
+}
diff --git a/aten/src/ATen/native/vulkan/glsl/templates/conv2d_pw_params.yaml b/aten/src/ATen/native/vulkan/glsl/templates/conv2d_pw_params.yaml
new file mode 100644
index 000000000000..fef8f20f4e73
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/glsl/templates/conv2d_pw_params.yaml
@@ -0,0 +1,7 @@
+conv2d_pw:
+  parameter_names_with_default_values:
+      TILE_SIZE_X: 2
+      TILE_SIZE_Y: 2
+  parameter_values:
+    - TILE_SIZE_X: 1
+      TILE_SIZE_Y: 1
diff --git a/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp b/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp
index e82a446c29eb..65fc758b065d 100644
--- a/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Arithmetic.cpp
@@ -13,37 +13,43 @@ namespace {
 void check_inputs(const Tensor& input1, const Tensor& input2) {
   const std::string broadcast_error_msg =
       "Incompatible dimensions for broadcasting for binary elementwise op!";
-  if (batch_size(input1) != batch_size(input2)) {
+  if (get_dim<Dim4D::Batch>(input1) != get_dim<Dim4D::Batch>(input2)) {
     TORCH_CHECK(
-        batch_size(input1) == 1 || batch_size(input2), broadcast_error_msg);
+        get_dim<Dim4D::Batch>(input1) == 1 || get_dim<Dim4D::Batch>(input2),
+        broadcast_error_msg);
     TORCH_CHECK(
-        (channels_size(input1) == channels_size(input2) &&
-         channels_size(input1) % 4 == 0) ||
-            channels_size(input1) * batch_size(input1) == 1 ||
-            channels_size(input2) * batch_size(input2) == 1,
+        (get_dim<Dim4D::Channel>(input1) == get_dim<Dim4D::Channel>(input2) &&
+         get_dim<Dim4D::Channel>(input1) % 4 == 0) ||
+            get_dim<Dim4D::Channel>(input1) * get_dim<Dim4D::Batch>(input1) ==
+                1 ||
+            get_dim<Dim4D::Channel>(input2) * get_dim<Dim4D::Batch>(input2) ==
+                1,
         "Invalid broadcasting for Vulkan binary elementwise op! "
         "If batch dimensions aren't equal, then channel dimensions must be "
         "equal and multiple of 4 or one of the inputs must have "
         "channel and batch dimensions both equal to 1!");
   }
-  if (channels_size(input1) != channels_size(input2)) {
+  if (get_dim<Dim4D::Channel>(input1) != get_dim<Dim4D::Channel>(input2)) {
     TORCH_CHECK(
-        channels_size(input1) == 1 || channels_size(input2),
+        get_dim<Dim4D::Channel>(input1) == 1 || get_dim<Dim4D::Channel>(input2),
         broadcast_error_msg);
     TORCH_CHECK(
-        channels_size(input1) * batch_size(input1) == 1 ||
-            channels_size(input2) * batch_size(input2) == 1,
+        get_dim<Dim4D::Channel>(input1) * get_dim<Dim4D::Batch>(input1) == 1 ||
+            get_dim<Dim4D::Channel>(input2) * get_dim<Dim4D::Batch>(input2) ==
+                1,
         "Invalid broadcasting for Vulkan binary elementwise op! "
         "If channel dimensions aren't equal, then one of the inputs must have "
         "channel and batch dimensions both equal to 1!");
   }
-  if (height_size(input1) != height_size(input2)) {
+  if (get_dim<Dim4D::Height>(input1) != get_dim<Dim4D::Height>(input2)) {
     TORCH_CHECK(
-        height_size(input1) == 1 || height_size(input2), broadcast_error_msg);
+        get_dim<Dim4D::Height>(input1) == 1 || get_dim<Dim4D::Height>(input2),
+        broadcast_error_msg);
   }
-  if (width_size(input1) != width_size(input2)) {
+  if (get_dim<Dim4D::Width>(input1) != get_dim<Dim4D::Width>(input2)) {
     TORCH_CHECK(
-        width_size(input1) == 1 || width_size(input2), broadcast_error_msg);
+        get_dim<Dim4D::Width>(input1) == 1 || get_dim<Dim4D::Width>(input2),
+        broadcast_error_msg);
   }
 }
 
@@ -63,16 +69,20 @@ std::vector<int64_t> broadcast_size(const Tensor& t1, const Tensor& t2) {
   }
 
   if (out.size() > 0) {
-    out[out.size() - 1] = std::max(width_size(t1), width_size(t2));
+    out[out.size() - 1] =
+        std::max(get_dim<Dim4D::Width>(t1), get_dim<Dim4D::Width>(t2));
   }
   if (out.size() > 1) {
-    out[out.size() - 2] = std::max(height_size(t1), height_size(t2));
+    out[out.size() - 2] =
+        std::max(get_dim<Dim4D::Height>(t1), get_dim<Dim4D::Height>(t2));
   }
   if (out.size() > 2) {
-    out[out.size() - 3] = std::max(channels_size(t1), channels_size(t2));
+    out[out.size() - 3] =
+        std::max(get_dim<Dim4D::Channel>(t1), get_dim<Dim4D::Channel>(t2));
   }
   if (out.size() > 3) {
-    out[out.size() - 4] = std::max(batch_size(t1), batch_size(t2));
+    out[out.size() - 4] =
+        std::max(get_dim<Dim4D::Batch>(t1), get_dim<Dim4D::Batch>(t2));
   }
 
   return out;
@@ -214,9 +224,9 @@ Tensor arithmetic_tensor(
       v_output.extents(),
       0u,
       v_self.extents(),
-      channels_size(self) * batch_size(self),
+      get_dim<Dim4D::Channel>(self) * get_dim<Dim4D::Batch>(self),
       v_other.extents(),
-      channels_size(other) * batch_size(other),
+      get_dim<Dim4D::Channel>(other) * get_dim<Dim4D::Batch>(other),
       alpha,
   };
 
@@ -340,10 +350,12 @@ Tensor& arithmetic_tensor_(
     const c10::optional<Scalar>& alpha_arg,
     const api::ShaderSource& shader_descriptor) {
   TORCH_CHECK(
-      batch_size(self_arg) >= batch_size(other_arg) &&
-          channels_size(self_arg) >= channels_size(other_arg) &&
-          height_size(self_arg) >= height_size(other_arg) &&
-          width_size(self_arg) >= width_size(other_arg),
+      get_dim<Dim4D::Batch>(self_arg) >= get_dim<Dim4D::Batch>(other_arg) &&
+          get_dim<Dim4D::Channel>(self_arg) >=
+              get_dim<Dim4D::Channel>(other_arg) &&
+          get_dim<Dim4D::Height>(self_arg) >=
+              get_dim<Dim4D::Height>(other_arg) &&
+          get_dim<Dim4D::Width>(self_arg) >= get_dim<Dim4D::Width>(other_arg),
       "Dimensions of input tensor to Vulkan in-place binary elementwise op "
       "must be less than or equal the dimensions of the underlying tensor.");
 
@@ -371,7 +383,7 @@ Tensor& arithmetic_tensor_(
       v_self.extents(),
       0u,
       v_other.extents(),
-      channels_size(other) * batch_size(other),
+      get_dim<Dim4D::Channel>(other) * get_dim<Dim4D::Batch>(other),
       alpha,
   };
 
diff --git a/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp b/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp
index 30407e8cec38..d1fecca2abeb 100644
--- a/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Batchnorm.cpp
@@ -1,109 +1,44 @@
-#include <ATen/native/vulkan/ops/Common.h>
+#include <ATen/Context.h>
+#include <ATen/native/vulkan/ops/Batchnorm.h>
 #include <torch/library.h>
 
 namespace at {
 namespace native {
 namespace vulkan {
 namespace ops {
-namespace {
-
-using namespace api::utils;
-
-Tensor batch_norm(
-    const at::Tensor& input_arg,
-    const c10::optional<Tensor>& weight_opt /* optional */,
-    const c10::optional<Tensor>& bias_opt /* optional */,
-    const c10::optional<Tensor>& running_mean_opt /* optional */,
-    const c10::optional<Tensor>& running_var_opt /* optional */,
-    bool training,
-    double /* momentum, not used in eval mode */,
-    double eps,
-    bool /* cudnn_enable, deprecated */) {
-  TORCH_CHECK(!training, "Vulkan batchnorm only supports evaluation mode.");
-  TORCH_CHECK(
-      weight_opt && weight_opt->defined() && bias_opt && bias_opt->defined(),
-      "Vulkan batchnorm expects weight and bias arguments to be defined");
-  TORCH_CHECK(
-      running_mean_opt && running_mean_opt->defined(),
-      "running_mean must be defined in evaluation mode.");
-  TORCH_CHECK(
-      running_var_opt && running_var_opt->defined(),
-      "running_var must be defined in evaluation mode.");
-  TORCH_CHECK(input_arg.dim() == 4, "Vulkan batchnorm expects 4-dim input!");
-  TORCH_CHECK(
-      channels_size(input_arg) % 4 == 0,
-      "Vulkan batchnorm expects channel dim to be multiple of 4!");
-
-  const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
-  const vTensor& v_input = convert(input);
-  const IntArrayRef v_input_sizes = v_input.sizes();
-
-  auto num_features = v_input.sizes()[1];
-  auto channels_ext = num_features / 4;
 
-  const Tensor weight_opt_3d = weight_opt->reshape({num_features, 1, 1});
-  const Tensor weight =
-      weight_opt_3d.is_vulkan() ? weight_opt_3d : weight_opt_3d.vulkan();
-  const vTensor& v_weight = convert(weight);
-  TORCH_CHECK(
-      weight.numel() == num_features,
-      "weight tensor should contain ",
-      num_features,
-      " elements!");
-
-  const Tensor bias_opt_3d = bias_opt->reshape({num_features, 1, 1});
-  const Tensor bias =
-      bias_opt_3d.is_vulkan() ? bias_opt_3d : bias_opt_3d.vulkan();
-  const vTensor& v_bias = convert(bias);
-  TORCH_CHECK(
-      bias.numel() == num_features,
-      "bias tensor should contain ",
-      num_features,
-      " elements!");
+namespace batchnorm {
+
+struct Params final {
+  api::utils::ivec3 out_extents;
+  int32_t c4;
+  float eps;
+};
+
+void record_op(
+    api::Context* const context,
+    vTensor& v_output,
+    const vTensor& v_input,
+    const vTensor& v_weight,
+    const vTensor& v_bias,
+    const vTensor& v_running_mean,
+    const vTensor& v_running_var,
+    const float eps) {
+  api::PipelineBarrier pipeline_barrier{};
 
-  const Tensor running_mean_opt_3d =
-      running_mean_opt->reshape({num_features, 1, 1});
-  const Tensor running_mean = running_mean_opt_3d.is_vulkan()
-      ? running_mean_opt_3d
-      : running_mean_opt_3d.vulkan();
-  const vTensor& v_running_mean = convert(running_mean);
-  TORCH_CHECK(
-      running_mean.numel() == num_features,
-      "running mean tensor should contain ",
-      num_features,
-      " elements!");
-
-  const Tensor running_var_opt_3d =
-      running_var_opt->reshape({num_features, 1, 1});
-  const Tensor running_var = running_var_opt_3d.is_vulkan()
-      ? running_var_opt_3d
-      : running_var_opt_3d.vulkan();
-  const vTensor& v_running_var = convert(running_var);
-  TORCH_CHECK(
-      running_var.numel() == num_features,
-      "running var tensor should contain ",
-      num_features,
-      " elements!");
+  api::utils::uvec3 global_size = v_output.extents();
+  api::utils::uvec3 local_size = adaptive_work_group_size(global_size);
 
-  api::Context* const context = api::context();
+  uint32_t num_features = get_dim<Dim4D::Channel>(v_input.sizes());
+  uint32_t channels_ext = api::utils::div_up(num_features, 4u);
 
-  vTensor v_output{
-      context,
-      v_input_sizes,
-      v_input.options(),
+  Params block{
+      api::utils::make_ivec3(v_output.extents()),
+      api::utils::safe_downcast<int32_t>(channels_ext),
+      eps,
   };
 
-  const struct Block final {
-    uvec3 iextents;
-    int32_t channels_ext;
-    float epsilon;
-  } block{
-      v_output.extents(),
-      safe_downcast<int32_t>(channels_ext),
-      safe_downcast<float>(eps)};
-
   api::UniformParamsBuffer params(context, block);
-  api::PipelineBarrier pipeline_barrier{};
 
   context->submit_compute_job(
       // shader descriptor
@@ -111,9 +46,9 @@ Tensor batch_norm(
       // pipeline barrier
       pipeline_barrier,
       // global work group size
-      v_output.extents(),
+      global_size,
       // local work group size
-      adaptive_work_group_size(v_output.extents()),
+      local_size,
       // fence handle
       VK_NULL_HANDLE,
       // shader arguments
@@ -128,8 +63,34 @@ Tensor batch_norm(
       v_running_var.image(pipeline_barrier, api::PipelineStage::COMPUTE),
       // params buffer
       params.buffer());
+}
 
-  return convert(v_output);
+} // namespace batchnorm
+
+namespace {
+
+using namespace api::utils;
+
+Tensor batch_norm(
+    const at::Tensor& input_arg,
+    const c10::optional<Tensor>& weight_opt /* optional */,
+    const c10::optional<Tensor>& bias_opt /* optional */,
+    const c10::optional<Tensor>& running_mean_opt /* optional */,
+    const c10::optional<Tensor>& running_var_opt /* optional */,
+    bool training,
+    double /* momentum, not used in eval mode */,
+    double eps,
+    bool /* cudnn_enable, deprecated */) {
+  TORCH_CHECK(!training, "Only evaluation mode is supported!");
+  TORCH_CHECK(input_arg.dim() == 4, "Input must have dim == 4!");
+  TORCH_CHECK(
+      get_dim<Dim4D::Channel>(input_arg) % 4 == 0,
+      "Input must have channels divisible by 4!");
+
+  return run_batchnorm_context(
+      input_arg,
+      c10::make_intrusive<BatchNormPackedContext>(BatchNormPackedContext(
+          weight_opt, bias_opt, running_mean_opt, running_var_opt, eps)));
 }
 
 #ifdef USE_VULKAN_API
@@ -141,6 +102,143 @@ TORCH_LIBRARY_IMPL(aten, Vulkan, m) {
 #endif /* USE_VULKAN_API */
 
 } // namespace
+
+BatchNormPackedContext::BatchNormPackedContext(
+    const c10::optional<Tensor>& weight_opt,
+    const c10::optional<Tensor>& bias_opt,
+    const c10::optional<Tensor>& running_mean_opt,
+    const c10::optional<Tensor>& running_var_opt,
+    double eps)
+    : unpacked_{c10::AnyType::get()} {
+  packed_.reserve(ListArgs::kNumArgs);
+
+  // Each optional tensor arg, if provided should be a 1 dimensional tensor. To
+  // achieve more efficient packing as a texture, they are first reshaped to {N,
+  // 1, 1}. Eventually this rearrangement should happen automatically in vTensor
+  // itself.
+
+  // Weight
+  TORCH_CHECK(weight_opt, "Weight must be provided!");
+  TORCH_CHECK(weight_opt->dim() == 1, "Weight must have ndim == 1!");
+
+  const int64_t num_features =
+      api::utils::safe_downcast<int64_t>(weight_opt->numel());
+  const Tensor weight_3d = weight_opt->reshape({num_features, 1, 1});
+  packed_.emplace_back(weight_3d.vulkan());
+
+  // Bias
+  TORCH_CHECK(bias_opt, "Bias must be provided!");
+  TORCH_CHECK(bias_opt->dim() == 1, "Bias must have ndim == 1!");
+  TORCH_CHECK(
+      bias_opt->numel() == num_features,
+      "Bias must have the same numel as weight!");
+
+  const Tensor bias_3d = bias_opt->reshape({num_features, 1, 1});
+  packed_.emplace_back(bias_3d.vulkan());
+
+  // Running Mean
+  TORCH_CHECK(running_mean_opt, "Running mean must be provided!");
+  TORCH_CHECK(running_mean_opt->dim() == 1, "Running mean must have ndim == 1");
+  TORCH_CHECK(
+      running_mean_opt->numel() == num_features,
+      "Running mean must have the same numel as weight!");
+
+  const Tensor running_mean_3d =
+      running_mean_opt->reshape({num_features, 1, 1});
+  packed_.emplace_back(running_mean_3d.vulkan());
+
+  // Running var
+  TORCH_CHECK(running_var_opt, "Running var must be provided!");
+  TORCH_CHECK(running_var_opt->dim() == 1, "Running var must have ndim == 1");
+  TORCH_CHECK(
+      running_var_opt->numel() == num_features,
+      "Running var must have the same numel as weight!");
+
+  const Tensor running_var_3d = running_var_opt->reshape({num_features, 1, 1});
+  packed_.emplace_back(running_var_3d.vulkan());
+
+  // Epsilon
+  packed_.emplace_back(eps);
+
+  if (!at::globalContext().releaseWeightsWhenPrepacking()) {
+    unpacked_.reserve(ListArgs::kNumArgs);
+    unpacked_.emplace_back(weight_opt);
+    unpacked_.emplace_back(bias_opt);
+    unpacked_.emplace_back(running_mean_opt);
+    unpacked_.emplace_back(running_var_opt);
+    unpacked_.emplace_back(eps);
+  }
+}
+
+BatchNormPackedContext BatchNormPackedContext::pack(
+    c10::impl::GenericList unpacked) {
+  return BatchNormPackedContext(
+      get_optional_tensor(unpacked, ListArgs::kWeight),
+      get_optional_tensor(unpacked, ListArgs::kBias),
+      get_optional_tensor(unpacked, ListArgs::kRunningMean),
+      get_optional_tensor(unpacked, ListArgs::kRunningVar),
+      unpacked.get(ListArgs::kEps).toDouble());
+}
+
+c10::intrusive_ptr<BatchNormPackedContext> create_batchnorm_context(
+    c10::optional<Tensor>&& weight_opt,
+    c10::optional<Tensor>&& bias_opt,
+    c10::optional<Tensor>&& running_mean_opt,
+    c10::optional<Tensor>&& running_var_opt,
+    bool training,
+    double /* momentum */,
+    double eps,
+    bool /* cudnn_enable, deprecated */) {
+  return c10::make_intrusive<BatchNormPackedContext>(BatchNormPackedContext(
+      weight_opt, bias_opt, running_mean_opt, running_var_opt, eps));
+}
+
+Tensor run_batchnorm_context(
+    const Tensor& input_arg,
+    const c10::intrusive_ptr<BatchNormPackedContext>& batchnorm_context) {
+  api::Context* const context = api::context();
+
+  const vTensor& v_input = convert(input_arg);
+
+  const vTensor& v_weight = convert(
+      batchnorm_context->get_val(BatchNormPackedContext::ListArgs::kWeight)
+          .toTensor());
+
+  const vTensor& v_bias = convert(
+      batchnorm_context->get_val(BatchNormPackedContext::ListArgs::kBias)
+          .toTensor());
+
+  const vTensor& v_running_mean = convert(
+      batchnorm_context->get_val(BatchNormPackedContext::ListArgs::kRunningMean)
+          .toTensor());
+
+  const vTensor& v_running_var = convert(
+      batchnorm_context->get_val(BatchNormPackedContext::ListArgs::kRunningVar)
+          .toTensor());
+
+  const float eps = api::utils::safe_downcast<float>(
+      batchnorm_context->get_val(BatchNormPackedContext::ListArgs::kEps)
+          .toDouble());
+
+  vTensor v_output{
+      context,
+      v_input.sizes(),
+      v_input.options(),
+  };
+
+  batchnorm::record_op(
+      context,
+      v_output,
+      v_input,
+      v_weight,
+      v_bias,
+      v_running_mean,
+      v_running_var,
+      eps);
+
+  return convert(v_output);
+}
+
 } // namespace ops
 } // namespace vulkan
 } // namespace native
diff --git a/aten/src/ATen/native/vulkan/ops/Batchnorm.h b/aten/src/ATen/native/vulkan/ops/Batchnorm.h
new file mode 100644
index 000000000000..6afaeb6f243b
--- /dev/null
+++ b/aten/src/ATen/native/vulkan/ops/Batchnorm.h
@@ -0,0 +1,68 @@
+#pragma once
+
+#ifdef USE_VULKAN_API
+
+#include <ATen/native/vulkan/ops/Common.h>
+#include <ATen/native/vulkan/ops/VulkanPackedContext.h>
+#include <torch/library.h>
+
+namespace at {
+namespace native {
+namespace vulkan {
+namespace ops {
+
+class BatchNormPackedContext final : virtual public VulkanPackedContext,
+                                     public torch::jit::CustomClassHolder {
+ private:
+  c10::impl::GenericList unpacked_;
+
+ public:
+  BatchNormPackedContext(
+      const c10::optional<Tensor>& weight_opt,
+      const c10::optional<Tensor>& bias_opt,
+      const c10::optional<Tensor>& running_mean_opt,
+      const c10::optional<Tensor>& running_var_opt,
+      double eps);
+
+  /*
+   * Assigns a name to each index in the packed/unpacked list.
+   */
+  struct ListArgs final {
+    static constexpr uint32_t kWeight = 0u;
+    static constexpr uint32_t kBias = 1u;
+    static constexpr uint32_t kRunningMean = 2u;
+    static constexpr uint32_t kRunningVar = 3u;
+    static constexpr uint32_t kEps = 4u;
+
+    static constexpr uint32_t kNumArgs = 5u;
+  };
+
+  static BatchNormPackedContext pack(c10::impl::GenericList);
+
+  const c10::impl::GenericList unpack() const override {
+    TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!");
+
+    return unpacked_;
+  }
+};
+
+c10::intrusive_ptr<BatchNormPackedContext> create_batchnorm_context(
+    c10::optional<Tensor>&& weight_opt,
+    c10::optional<Tensor>&& bias_opt,
+    c10::optional<Tensor>&& running_mean_opt,
+    c10::optional<Tensor>&& running_var_opt,
+    bool training,
+    double /* momentum */,
+    double eps,
+    bool /* cudnn_enable, deprecated */);
+
+Tensor run_batchnorm_context(
+    const Tensor& input_arg,
+    const c10::intrusive_ptr<BatchNormPackedContext>& context);
+
+} // namespace ops
+} // namespace vulkan
+} // namespace native
+} // namespace at
+
+#endif /* USE_VULKAN_API */
diff --git a/aten/src/ATen/native/vulkan/ops/Clone.cpp b/aten/src/ATen/native/vulkan/ops/Clone.cpp
index fa9f791cea0c..2601d785ddb5 100644
--- a/aten/src/ATen/native/vulkan/ops/Clone.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Clone.cpp
@@ -1,6 +1,13 @@
 #include <ATen/native/vulkan/ops/Common.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/empty_like.h>
+#include <ATen/ops/empty_strided.h>
+#endif
+
 namespace at {
 namespace native {
 namespace vulkan {
@@ -14,7 +21,7 @@ Tensor clone(
   TORCH_CHECK(
       (c10::MemoryFormat::Preserve == memory_format) ||
           (c10::MemoryFormat::Contiguous == memory_format),
-      "Vulkan supports Preserve and Contiguous memory foramts");
+      "Vulkan supports Preserve and Contiguous memory formats");
 
   Tensor self;
   if (memory_format == MemoryFormat::Preserve) {
diff --git a/aten/src/ATen/native/vulkan/ops/Common.cpp b/aten/src/ATen/native/vulkan/ops/Common.cpp
index fc2653f8c007..4c645ba3b142 100644
--- a/aten/src/ATen/native/vulkan/ops/Common.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Common.cpp
@@ -5,52 +5,13 @@ namespace native {
 namespace vulkan {
 namespace ops {
 
-uint32_t batch_size(const IntArrayRef sizes) {
-  const uint32_t dims = sizes.size();
-  if (dims < 4) {
-    return 1;
-  }
-  return sizes[dims - 4];
-}
-
-uint32_t batch_size(const Tensor& tensor) {
-  return batch_size(tensor.sizes());
-}
-
-uint32_t channels_size(const IntArrayRef sizes) {
-  const uint32_t dims = sizes.size();
-  if (dims < 3) {
-    return 1;
-  }
-  return sizes[dims - 3];
-}
-
-uint32_t channels_size(const Tensor& tensor) {
-  return channels_size(tensor.sizes());
-}
-
-uint32_t height_size(const IntArrayRef sizes) {
-  const uint32_t dims = sizes.size();
-  if (dims < 2) {
-    return 1;
-  }
-  return sizes[dims - 2];
-}
-
-uint32_t height_size(const Tensor& tensor) {
-  return height_size(tensor.sizes());
-}
-
-uint32_t width_size(const IntArrayRef sizes) {
-  const uint32_t dims = sizes.size();
-  if (dims < 1) {
-    return 1;
-  }
-  return sizes[dims - 1];
-}
+api::utils::uvec4 make_nchw_uvec4(const IntArrayRef arr) {
+  uint32_t w = get_dim<Dim4D::Width>(arr);
+  uint32_t h = get_dim<Dim4D::Height>(arr);
+  uint32_t c = get_dim<Dim4D::Channel>(arr);
+  uint32_t n = get_dim<Dim4D::Batch>(arr);
 
-uint32_t width_size(const Tensor& tensor) {
-  return width_size(tensor.sizes());
+  return {w, h, c, n};
 }
 
 api::utils::uvec3 adaptive_work_group_size(
diff --git a/aten/src/ATen/native/vulkan/ops/Common.h b/aten/src/ATen/native/vulkan/ops/Common.h
index 38e4c96125c3..4248417b3c99 100644
--- a/aten/src/ATen/native/vulkan/ops/Common.h
+++ b/aten/src/ATen/native/vulkan/ops/Common.h
@@ -2,7 +2,8 @@
 
 #ifdef USE_VULKAN_API
 
-#include <ATen/ATen.h>
+#include <ATen/core/List.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/vulkan/api/api.h>
 #include <ATen/native/vulkan/ops/Tensor.h>
 
@@ -43,17 +44,87 @@ struct Layout final {
   };
 };
 
-uint32_t batch_size(const IntArrayRef);
-uint32_t batch_size(const Tensor&);
+/*
+ * Maps a semantic dimension name to an integer that corresponds to its
+ * innermost ordering in a 4D tensor in NCHW format. Width is the innermost
+ * dimension, so it corresponds to 1, height is the next innermost, so it
+ * corresponds to 2, and so on.
+ */
+struct Dim4D {
+  static constexpr uint32_t Width = 1u;
+  static constexpr uint32_t Height = 2u;
+  static constexpr uint32_t Channel = 3u;
+  static constexpr uint32_t Batch = 4u;
+};
+
+/*
+ * Semantic dimension names for a 1D tensor
+ */
+struct Dim1D {
+  static constexpr uint32_t Length = 1u;
+};
+
+/*
+ * Semantic dimension names for a 2D Convolution kernel.
+ */
+struct DimConv2DKernel {
+  static constexpr uint32_t Width = 1u;
+  static constexpr uint32_t Height = 2u;
+  static constexpr uint32_t InChannels = 3u;
+  static constexpr uint32_t OutChannels = 4u;
+};
+
+/*
+ * The same as the above, except for a 2D Transposed Convolution kernel.
+ */
+struct DimTConv2DKernel {
+  static constexpr uint32_t Width = 1u;
+  static constexpr uint32_t Height = 2u;
+  static constexpr uint32_t OutChannels = 3u;
+  static constexpr uint32_t InChannels = 4u;
+};
+
+/*
+ * The functions below safely return the size of the dimension at the N-th
+ * innermost index. If the dimensionality of the size array is not sufficient
+ * then 1 will be returned. The structs above are intended to be used with
+ * these functions.
+ */
+template <uint32_t N>
+uint32_t get_dim(const IntArrayRef sizes) {
+  const uint32_t dims = sizes.size();
+  return dims < N ? 1 : api::utils::safe_downcast<uint32_t>(sizes[dims - N]);
+}
+
+template <uint32_t N>
+uint32_t get_dim(const Tensor& t_in) {
+  return get_dim<N>(t_in.sizes());
+}
+
+template <uint32_t N>
+uint32_t get_dim(const vTensor& v_in) {
+  return get_dim<N>(v_in.sizes());
+}
 
-uint32_t channels_size(const IntArrayRef);
-uint32_t channels_size(const Tensor&);
+/*
+ * Given an IntArrayRef of up to 4 elements, constructs a uvec4 containing those
+ * elements in reverse order.
+ */
+api::utils::uvec4 make_nchw_uvec4(const IntArrayRef arr);
 
-uint32_t height_size(const IntArrayRef);
-uint32_t height_size(const Tensor&);
+inline c10::optional<Tensor> get_optional_tensor(
+    const c10::impl::GenericList& gen_list,
+    const uint32_t idx) {
+  return gen_list.get(idx).isTensor() ? gen_list.get(idx).toTensor()
+                                      : c10::optional<Tensor>();
+}
 
-uint32_t width_size(const IntArrayRef);
-uint32_t width_size(const Tensor&);
+inline c10::optional<Scalar> get_optional_scalar(
+    const c10::impl::GenericList& gen_list,
+    const uint32_t idx) {
+  return gen_list.get(idx).isScalar() ? gen_list.get(idx).toScalar()
+                                      : c10::optional<Scalar>();
+}
 
 api::utils::uvec3 adaptive_work_group_size(
     const api::utils::uvec3& global_work_group);
diff --git a/aten/src/ATen/native/vulkan/ops/Concat.cpp b/aten/src/ATen/native/vulkan/ops/Concat.cpp
index 4ab543f5527f..412bda4fcde0 100644
--- a/aten/src/ATen/native/vulkan/ops/Concat.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Concat.cpp
@@ -16,34 +16,36 @@ inline int64_t normalize_dim(int64_t d, int64_t n) {
 }
 } // namespace
 
-Tensor cat_batch(const TensorList tensors, vTensor& v_output) {
+Tensor cat_batch(const MaterializedITensorListRef& tensors, vTensor& v_output) {
   TORCH_CHECK(false, "Vulkan cat not implemented for batch dimension!");
 }
 
-Tensor cat_feature(const TensorList tensors, vTensor& v_output) {
+Tensor cat_feature(
+    const MaterializedITensorListRef& tensors,
+    vTensor& v_output) {
   api::Context* const context = api::context();
 
   int64_t ch_size_allprior = 0;
   int64_t ch_interval = 0;
-  for (const auto& tensor : tensors) {
+  for (const at::Tensor& tensor : tensors) {
     ch_interval += tensor.sizes()[1];
   }
 
-  for (const auto& tensor : tensors) {
+  for (const at::Tensor& tensor : tensors) {
     const Tensor self = tensor.is_vulkan() ? tensor : tensor.vulkan();
     const vTensor& v_self = convert(self);
 
     const struct Block final {
       uvec3 size; // output texture size
-      uint32_t fill_0; // dummy
+      uint32_t fill0; // dummy
       uvec3 isize; // input texture size
-      uint32_t fill_1; // dummy
-      uint32_t batch_size; // input tensor's batch size
-      uint32_t ch_size; // input tensor's channel size
+      uint32_t fill1; // dummy
+      uint32_t batchSize; // input tensor's batch size
+      uint32_t chSize; // input tensor's channel size
       uint32_t
-          ch_interval; // channel interval (total # of channels for all tensors)
+          chInterval; // channel interval (total # of channels for all tensors)
       uint32_t
-          ch_size_allprior; // # of channels for tensor 0 to i-1 at ith tensor
+          chSizeAllprior; // # of channels for tensor 0 to i-1 at ith tensor
     } block{
         v_output.extents(),
         0u,
@@ -84,12 +86,14 @@ Tensor cat_feature(const TensorList tensors, vTensor& v_output) {
   return convert(v_output);
 }
 
-Tensor cat_feature_mult4ch(const TensorList tensors, vTensor& v_output) {
+Tensor cat_feature_mult4ch(
+    const MaterializedITensorListRef& tensors,
+    vTensor& v_output) {
   api::Context* const context = api::context();
 
   int64_t depth_size_allprior = 0;
   int64_t ch_interval = 0;
-  for (const auto& tensor : tensors) {
+  for (const at::Tensor& tensor : tensors) {
     ch_interval += tensor.sizes()[1];
   }
   const int64_t depth_interval = ch_interval / 4;
@@ -97,7 +101,7 @@ Tensor cat_feature_mult4ch(const TensorList tensors, vTensor& v_output) {
   uvec3 src_offset{};
   uvec3 dst_offset{};
 
-  for (const auto& tensor_arg : tensors) {
+  for (const at::Tensor& tensor_arg : tensors) {
     const Tensor tensor =
         tensor_arg.is_vulkan() ? tensor_arg : tensor_arg.vulkan();
     const vTensor& v_self = convert(tensor);
@@ -137,17 +141,19 @@ Tensor cat_feature_mult4ch(const TensorList tensors, vTensor& v_output) {
   return convert(v_output);
 }
 
-Tensor cat_width(const TensorList tensors, vTensor& v_output) {
+Tensor cat_width(const MaterializedITensorListRef& tensors, vTensor& v_output) {
   TORCH_CHECK(false, "Vulkan cat not implemented for width dimension!");
 }
 
-Tensor cat_height(const TensorList tensors, vTensor& v_output) {
+Tensor cat_height(
+    const MaterializedITensorListRef& tensors,
+    vTensor& v_output) {
   api::Context* const context = api::context();
 
   uvec3 src_offset{};
   uvec3 dst_offset{};
 
-  for (const auto& tensor : tensors) {
+  for (const at::Tensor& tensor : tensors) {
     const vTensor& v_self = convert(tensor);
 
     api::PipelineBarrier pipeline_barrier{};
@@ -175,14 +181,17 @@ Tensor cat_height(const TensorList tensors, vTensor& v_output) {
   return convert(v_output);
 }
 
-Tensor cat(const at::TensorList tensors, const int64_t dim) {
+Tensor cat(const at::ITensorListRef& tensors, const int64_t in_dim) {
   TORCH_CHECK(tensors.size() > 0, "Vulkan cat expects at least one tensor");
 
-  at::Tensor tensor = tensors[0];
+  const int64_t dim = normalize_dim(in_dim, 4);
+  auto materialized = tensors.materialize();
+  TORCH_INTERNAL_ASSERT(materialized.size() > 0, "Accessing empty array");
+  const at::Tensor& tensor = materialized[0];
   int64_t cat_dim_size = 0;
   bool is_mult4ch = true;
 
-  for (const auto& t : tensors) {
+  for (const at::Tensor& t : materialized) {
     TORCH_INTERNAL_ASSERT(
         t.dim() == 4, "Vulkan cat expects 4 dimensional inputs");
 
@@ -202,22 +211,23 @@ Tensor cat(const at::TensorList tensors, const int64_t dim) {
   }
 
   auto result_size = tensor.sizes().vec();
+  TORCH_INTERNAL_ASSERT(result_size.size() > 0, "Accessing empty array");
   result_size[dim] = cat_dim_size;
 
   vTensor v_output{api::context(), result_size, tensor.options()};
 
   if (dim == 3) {
-    return cat_width(tensors, v_output);
+    return cat_width(materialized, v_output);
   }
   if (dim == 2) {
-    return cat_height(tensors, v_output);
+    return cat_height(materialized, v_output);
   } else if (dim == 1) {
     if (is_mult4ch) {
-      return cat_feature_mult4ch(tensors, v_output);
+      return cat_feature_mult4ch(materialized, v_output);
     }
-    return cat_feature(tensors, v_output);
+    return cat_feature(materialized, v_output);
   }
-  return cat_batch(tensors, v_output);
+  return cat_batch(materialized, v_output);
 }
 
 #ifdef USE_VULKAN_API
diff --git a/aten/src/ATen/native/vulkan/ops/Convolution.cpp b/aten/src/ATen/native/vulkan/ops/Convolution.cpp
index 3a163fdde410..63fb00d6ee0a 100644
--- a/aten/src/ATen/native/vulkan/ops/Convolution.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Convolution.cpp
@@ -1,42 +1,57 @@
 #include <ATen/native/ConvUtils.h>
 #include <ATen/native/utils/ParamUtils.h>
 
+#include <ATen/Context.h>
+
 #include <ATen/native/vulkan/api/Utils.h>
 #include <ATen/native/vulkan/ops/Common.h>
 #include <ATen/native/vulkan/ops/Convolution.h>
+#include <ATen/native/vulkan/ops/Copy.h>
 #include <ATen/native/vulkan/ops/Utils.h>
 #include <c10/util/irange.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/pad.h>
+#include <ATen/ops/permute.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 namespace vulkan {
 namespace ops {
-namespace {
 
-using namespace api::utils;
-using namespace at::native::vulkan::ops;
+namespace conv2d {
 
-inline bool is_depthwise(const IntArrayRef filter, const int64_t groups) {
-  return (filter[Layout::Filter::output] == groups) &&
-      // Only K == 1 supported.
-      (filter[Layout::Filter::input] == 1);
-}
+//
+// Convolution type classification
+//
 
-inline bool is_pointwise(const IntArrayRef filter) {
-  return (1 == filter[Layout::Filter::height]) &&
-      (1 == filter[Layout::Filter::width]);
+inline bool is_depthwise(const IntArrayRef weight_size, const int64_t groups) {
+  uint32_t groups_uint = api::utils::safe_downcast<uint32_t>(groups);
+  if (get_dim<DimConv2DKernel::OutChannels>(weight_size) != groups_uint) {
+    return false;
+  }
+  if (get_dim<DimConv2DKernel::InChannels>(weight_size) != 1) {
+    return false;
+  }
+  return true;
 }
 
-bool all_lessthan(const IntArrayRef arr, const int t) {
-  bool retval = true;
-  for (const auto i : c10::irange(arr.size())) {
-    retval = retval && (arr[i] < t);
+inline bool is_pointwise(const IntArrayRef weight_size) {
+  if (get_dim<DimConv2DKernel::Width>(weight_size) != 1) {
+    return false;
+  }
+  if (get_dim<DimConv2DKernel::Height>(weight_size) != 1) {
+    return false;
   }
-  return retval;
+  return true;
 }
 
 Conv2dMethod determine_method(
-    const IntArrayRef filter,
+    const IntArrayRef weight_size,
     const IntArrayRef stride,
     const IntArrayRef padding,
     const IntArrayRef dilation,
@@ -44,309 +59,440 @@ Conv2dMethod determine_method(
     const bool transposed,
     const bool quantized) {
   if (transposed) {
-    return TConv2dSlidingWindow;
+    return Conv2dSlidingWindow;
   }
-  if (is_depthwise(filter, groups)) {
-    return quantized ? QConv2dDepthwise : Conv2dDepthwise;
+  if (is_depthwise(weight_size, groups)) {
+    return Conv2dDepthwise;
   }
-  if (is_pointwise(filter)) {
-    return quantized ? QConv2dPointwise : Conv2dPointwise;
+  if (is_pointwise(weight_size)) {
+    return Conv2dPointwise;
   }
-  return quantized ? QConv2dSlidingWindow : Conv2dSlidingWindow;
+  return Conv2dSlidingWindow;
 }
 
-vTensor pack_weights_dw(api::Context* const context, const Tensor& weight) {
-  /* Source */
-  const IntArrayRef src_filter = weight.sizes();
-  const float* const src_weight_ptr = weight.data_ptr<float>();
-
-  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
-  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
-  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
-  const int64_t src_block_sz =
-      src_kernel_sz * src_filter[Layout::Filter::input];
-  const int64_t num_stacks =
-      div_up(src_filter[Layout::Filter::output], INT64_C(4));
-
-  /* Destination */
-  const int64_t dst_kw_sz = src_kernel_sz;
-  const int64_t dst_kh_sz = num_stacks;
-  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
-
-  vTensor v_weight{
-      context,
-      {
-          4,
-          dst_kh_sz,
-          dst_kw_sz,
-      },
-      weight.options(),
-  };
-
-  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    float* dst_weight_ptr = mapping.template data<float>();
-
-    memset(dst_weight_ptr, 0, v_weight.nbytes());
+//
+// Rearrangement functions for pre-packing
+//
+
+/*
+ * Rearranges a convolution weight tensor to a layout that can be used by
+ * convolution compute shaders. The goal of this packing is to arrange the data
+ * such that data access in the compute shader is as linear as possible. The
+ * reasoning behind the packing pattern will be described in the shader kernel
+ * code.
+ *
+ * To understand the transformations performed by this function, consider an
+ * example input of size {11, 1, 3, 3}. The following transformations will
+ * applied to this weight tensor:
+ *
+ * 1. First, apply padding to the N dims so that it is a multiple of 4.
+ * In this case, 1 batch is added, producing a tensor of size {12,1,3,3}.
+ *
+ * 2. Next, flatten the last two dims of the tensor. This is done by reshaping
+ * the tensor to size {12,1,9}.
+ *
+ * 3. Finally, we want to "fold" the batch dim into the channel dim. We start by
+ * splitting the tensor along the N dim so that each split has 4 batches. This
+ * is done by reshaping the tensor to size {3,4,1,9}.
+ *
+ * 4. Normally, we would be done, but we want to stack each back vertically.
+ * This is done by permuting the N and C dims and reshaping the tensor to size
+ * {4,3,9}.
+ */
+at::Tensor rearrange_weights_dw(const Tensor& weight_in) {
+  at::Tensor weight = weight_in.clone();
+
+  uint32_t N = ops::get_dim<DimConv2DKernel::OutChannels>(weight);
+  uint32_t C = ops::get_dim<DimConv2DKernel::InChannels>(weight);
+  uint32_t H = ops::get_dim<DimConv2DKernel::Height>(weight);
+  uint32_t W = ops::get_dim<DimConv2DKernel::Width>(weight);
+
+  uint32_t N_aligned = api::utils::align_up(N, 4u);
+
+  // Add padding to the N dimension so that it's a multiple of 4
+  uint32_t N_padding_needed = N_aligned - N;
+  weight =
+      at::pad(weight, {0, 0, 0, 0, 0, 0, 0, N_padding_needed}, "constant", 0);
+
+  // Flatten so the H and W dim are on one row
+  weight = weight.reshape({N_aligned, C, H * W});
+
+  // Split batch dim to make groups of 4
+  uint32_t N4 = N_aligned / 4u;
+  weight = weight.reshape({N4, 4, C, H * W});
+
+  // Permute the groups of 4 so they are arranged along the channel dim, then
+  // reshape to stack the resulting batches vertically
+  weight = weight.permute({1, 0, 2, 3}).reshape({4, N4 * C, H * W});
+
+  return weight.contiguous();
+}
 
-    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
-      /* Source */
-      const float* const src_weight_oc_ptr =
-          src_weight_ptr + src_oc * src_block_sz;
+/*
+ * Rearranges a convolution weight tensor to a layout that can be used by
+ * convolution compute shaders. The goal of this packing is to arrange the data
+ * such that data access in the compute shader is as linear as possible. The
+ * reasoning behind the packing pattern will be described in the shader kernel
+ * code.
+ *
+ * To understand the transformations performed by this function, consider an
+ * example input of size {10, 7, 3, 3}. The following transformations will
+ * applied to this weight tensor:
+ *
+ * 1. First, apply padding to the N and C dims so that both are a multiple of 4.
+ * In this case, 2 batches and 1 channel of padding are added, producing a
+ * tensor of size {12,8,3,3}.
+ *
+ * 2. Next, split the tensor along the C dim so that each split has 4 channels.
+ * This is done by reshaping the channel to have the size {12,2,(4,3,3)}. ()
+ * brackets denote the size of the split.
+ *
+ * 3. For each split, we want to "fold" the C dim into the W dim. So suppose the
+ * first rows at H=0 of the split has values
+ *
+ *    0,1,2 | 10,11,12 | 20,21,22 | 30,31,32
+ *
+ *    where | denotes a channel boundary, then the goal is to combine those rows
+ * into one row with the values
+ *
+ *    0, 10, 20, 30, 1, 11, 21, 31, 2, 12, 22, 32
+ *
+ *    This is done in code by permuting and reshaping the tensor, producing a
+ * tensor of size {12,2,(3,12)}.
+ *
+ * 4. Next, we want to stack the splits belonging to the same batch horizontally
+ * which is done by swapping the C and H dims of the intermediate tensor and
+ * reshaping to produce a tensor of size {12,3,24}.
+ *
+ * 5. Now we will repeat a similar process of "folding" the N dim into the C
+ * dim. We start by splitting along the N dim so that each split has 4 batches.
+ * To do this the tensor is reshaped to {3,4,3,24}.
+ *
+ * 6. Normally, we would be done but we also want to stack each batch on each
+ * other vertically. Therefore final step is another permute swapping the N and
+ * C dims and reshaping to the output shape of {4, 9, 24}.
+ *
+ * For transposed convolutions, there are some slight differences to reflect the
+ * data access pattern in the shader. The first major difference is that the
+ * weight tensor is flipped along the H and W dims. The second major difference
+ * is that steps 3 and 4 are slightly different so that the splits are
+ * interleaved.
+ */
+at::Tensor rearrange_weights_2d(const Tensor& weight_in, bool tconv) {
+  at::Tensor weight = weight_in.clone();
+
+  // Flip values along the H and W axes for transposed convolutions
+  if (tconv) {
+    weight = weight.flip(3).flip(2);
+  }
 
-      /* Destination */
-      const int64_t dst_oh = src_oc / 4;
-      const int64_t dst_c = src_oc % 4;
+  uint32_t N = get_dim<DimConv2DKernel::OutChannels>(weight);
+  uint32_t C = get_dim<DimConv2DKernel::InChannels>(weight);
+  uint32_t H = get_dim<DimConv2DKernel::Height>(weight);
+  uint32_t W = get_dim<DimConv2DKernel::Width>(weight);
 
-      float* const dst_weight_c_ptr =
-          dst_weight_ptr + dst_c * dst_kernel_sz + dst_oh * dst_kw_sz;
+  uint32_t N_aligned = api::utils::align_up(N, 4u);
+  uint32_t C_aligned = api::utils::align_up(C, 4u);
 
-      for (const auto src_ih :
-           c10::irange(src_filter[Layout::Filter::height])) {
-        memcpy(
-            dst_weight_c_ptr + src_ih * src_kw_sz,
-            src_weight_oc_ptr + src_ih * src_kw_sz,
-            sizeof(float) * src_kw_sz);
-      }
-    }
+  // Add padding to the N and C dimensions so that it's a multiple of 4
+  uint32_t C_padding_needed = C_aligned - C;
+  uint32_t N_padding_needed = N_aligned - N;
+  weight = at::pad(
+      weight,
+      {0, 0, 0, 0, 0, C_padding_needed, 0, N_padding_needed},
+      "constant",
+      0);
+
+  // Split the C dim into groups of 4
+  uint32_t C4 = C_aligned / 4u;
+  weight = weight.reshape({N_aligned, C4, 4, H, W});
+
+  if (!tconv) {
+    // Collapse each group of 4 channels onto the width axis
+    weight = weight.permute({0, 1, 3, 4, 2}).reshape({N_aligned, C4, H, 4 * W});
+    // Next collapse each group of four onto the width axis
+    weight =
+        weight.permute({0, 2, 1, 3}).reshape({N_aligned, H, C_aligned * W});
+  } else {
+    // For tconv, do the same thing as above but we want to interleave batches
+    // of 4 from each of the channels
+    weight = weight.permute({0, 3, 4, 1, 2}).reshape({N_aligned, H, W, 4 * C4});
+    // Next reshape to combine the last two dims into a single row
+    weight = weight.reshape({N_aligned, H, C_aligned * W});
   }
-  ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
 
-  return v_weight;
-}
+  // Split the N dim into groups of 4
+  uint32_t N4 = N_aligned / 4u;
+  weight = weight.reshape({N4, 4, H, C_aligned * W});
 
-vTensor pack_weights_2d(
-    api::Context* const context,
-    const Tensor& weight,
-    bool reversed) {
-  /* Source */
-  const IntArrayRef src_filter = weight.sizes();
-  const float* const src_weight_ptr = weight.data_ptr<float>();
-
-  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
-  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
-  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
-  const int64_t src_block_sz =
-      src_kernel_sz * src_filter[Layout::Filter::input];
-
-  const int64_t num_stacks =
-      div_up(src_filter[Layout::Filter::output], INT64_C(4));
-  const int64_t stack_depth =
-      api::utils::align_up(src_filter[Layout::Filter::input], INT64_C(4));
-
-  /* Destination */
-  const int64_t dst_kw_sz = src_kw_sz * stack_depth;
-  const int64_t dst_kh_sz = src_kh_sz * num_stacks;
-  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
+  // Collapse the outermost dim so that each group of 4 is stacked vertically
+  weight = weight.permute({1, 0, 2, 3}).reshape({4, N4 * H, C_aligned * W});
 
-  vTensor v_weight{
-      context,
-      {
-          4,
-          dst_kh_sz,
-          dst_kw_sz,
-      },
-      weight.options(),
-  };
+  return weight.contiguous();
+}
 
-  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    float* dst_weight_ptr = mapping.template data<float>();
-
-    memset(dst_weight_ptr, 0, v_weight.nbytes());
-
-    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
-      /* Source */
-      const float* const src_weight_oc_ptr =
-          src_weight_ptr + src_oc * src_block_sz;
-
-      /* Destination */
-      const int64_t dst_oh = src_oc / 4;
-      const int64_t dst_c = src_oc % 4;
-
-      float* const dst_weight_c_ptr = dst_weight_ptr + dst_c * dst_kernel_sz;
-
-      if (reversed) {
-        for (const auto src_ic :
-             c10::irange(src_filter[Layout::Filter::input])) {
-          for (const auto src_ih : c10::irange(src_kh_sz)) {
-            const int64_t dst_h = src_kh_sz - 1 - src_ih;
-            for (const auto src_iw : c10::irange(src_kw_sz)) {
-              const int64_t dst_w = src_kw_sz - 1 - src_iw;
-              const int64_t dst_w_offset = dst_w * stack_depth;
-              memcpy(
-                  dst_weight_c_ptr + (dst_oh * src_kh_sz + dst_h) * dst_kw_sz +
-                      src_ic + dst_w_offset,
-                  src_weight_oc_ptr + src_ic * src_kernel_sz +
-                      src_ih * src_kw_sz + src_iw,
-                  sizeof(float));
-            }
-          }
-        }
-      } else {
-        for (const auto src_ic :
-             c10::irange(src_filter[Layout::Filter::input])) {
-          const int64_t dst_ic4 = src_ic / 4;
-          for (const auto src_ih : c10::irange(src_kh_sz)) {
-            for (const auto src_iw : c10::irange(src_kw_sz)) {
-              memcpy(
-                  dst_weight_c_ptr + (dst_oh * src_kh_sz + src_ih) * dst_kw_sz +
-                      dst_ic4 * src_kw_sz * 4 + src_iw * 4 + src_ic % 4,
-                  src_weight_oc_ptr + src_ic * src_kernel_sz +
-                      src_ih * src_kw_sz + src_iw,
-                  sizeof(float));
-            }
-          }
-        }
-      }
-    }
+/*
+ * Rearranges a convolution weight tensor to a layout that can be used by
+ * convolution compute shaders. The goal of this packing is to arrange the data
+ * such that data access in the compute shader is as linear as possible. The
+ * reasoning behind the packing pattern will be described in the shader kernel
+ * code.
+ *
+ * The rearrangement structure is quite straightforward. Essentially we are
+ * taking each texel and arranging them along the x axis.
+ */
+at::Tensor rearrange_bias(
+    const c10::optional<Tensor>& bias_in,
+    const at::Tensor& weight_in,
+    bool tconv) {
+  // If optional is empty, just return zeros
+  if (!bias_in) {
+    uint32_t L = tconv ? get_dim<DimTConv2DKernel::OutChannels>(weight_in)
+                       : get_dim<DimConv2DKernel::OutChannels>(weight_in);
+    const uint32_t L4 = api::utils::div_up(L, 4u);
+
+    at::Tensor bias = at::zeros({4, 1, L4}, weight_in.options());
+    return bias;
   }
-  utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
 
-  return v_weight;
-}
+  at::Tensor bias = bias_in->clone();
 
-vTensor pack_weights_dw_q(api::Context* const context, const Tensor& weight) {
-  /* Source */
-  const IntArrayRef src_filter = weight.sizes();
-  const c10::quint8* const src_weight_ptr = weight.data_ptr<c10::quint8>();
+  // Bias should just be a 1D tensor
+  uint32_t L = get_dim<Dim1D::Length>(bias);
 
-  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
-  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
-  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
-  const int64_t src_block_sz =
-      src_kernel_sz * src_filter[Layout::Filter::input];
-  const int64_t num_stacks =
-      div_up(src_filter[Layout::Filter::output], INT64_C(4));
+  uint32_t L_aligned = api::utils::align_up(L, 4u);
 
-  /* Destination */
-  const int64_t dst_kw_sz = src_kernel_sz;
-  const int64_t dst_kh_sz = num_stacks;
-  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
+  // Add padding so that the length is a multiple of 4
+  uint32_t padding_needed = L_aligned - L;
+  bias = at::pad(bias, {0, padding_needed}, "constant", 0);
 
-  vTensor v_weight{
-      context,
-      {
-          4,
-          dst_kh_sz,
-          dst_kw_sz,
-      },
-      weight.options(),
-      weight.q_scale(),
-      weight.q_zero_point(),
-  };
-
-  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    c10::quint8* dst_weight_ptr = mapping.template data<c10::quint8>();
+  // Reshape + permute to group every 4 consecutive elements along the same
+  // channel
+  uint32_t L4 = L_aligned / 4u;
+  bias = bias.reshape({L4, 4}).permute({1, 0});
+  bias = bias.reshape({4, 1, L4});
 
-    memset(dst_weight_ptr, 0, v_weight.nbytes());
+  return bias.contiguous();
+}
 
-    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
-      /* Source */
-      const c10::quint8* const src_weight_oc_ptr =
-          src_weight_ptr + src_oc * src_block_sz;
+//
+// Shader and Workgroup size determination
+//
 
-      /* Destination */
-      const int64_t dst_oh = src_oc / 4;
-      const int64_t dst_c = src_oc % 4;
+static api::ShaderSource get_shader(
+    const IntArrayRef kernel_size,
+    const IntArrayRef stride,
+    const IntArrayRef padding,
+    const IntArrayRef dilation,
+    const Conv2dMethod method,
+    const bool transposed,
+    const bool quantized) {
+  api::ShaderInfo shader;
 
-      c10::quint8* const dst_weight_c_ptr =
-          dst_weight_ptr + dst_c * dst_kernel_sz + dst_oh * dst_kw_sz;
+  if (quantized) {
+    if (transposed) {
+      TORCH_CHECK(false, "Quantized TConv not supported!");
+    }
 
-      for (const auto src_ih :
-           c10::irange(src_filter[Layout::Filter::height])) {
-        memcpy(
-            dst_weight_c_ptr + src_ih * src_kw_sz,
-            src_weight_oc_ptr + src_ih * src_kw_sz,
-            sizeof(c10::quint8) * src_kw_sz);
-      }
+    switch (method) {
+      case Conv2dSlidingWindow:
+        shader = VK_SHADER(quantized_conv2d);
+        break;
+      case Conv2dDepthwise:
+        shader = VK_SHADER(quantized_conv2d_dw);
+        break;
+      case Conv2dPointwise:
+        shader = VK_SHADER(quantized_conv2d_pw_2x2);
+        break;
+        // todo fail for quantized transposed conv
     }
+    return shader.shader_src;
   }
-  ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
 
-  return v_weight;
+  if (transposed) {
+    shader = VK_SHADER(conv_transpose2d);
+    return shader.shader_src;
+  }
+
+  switch (method) {
+    case Conv2dSlidingWindow:
+      shader = VK_SHADER(conv2d);
+      break;
+    case Conv2dDepthwise:
+      shader = VK_SHADER(conv2d_dw);
+      break;
+    case Conv2dPointwise:
+      shader = VK_SHADER(conv2d_pw_2x2);
+      break;
+  }
+  return shader.shader_src;
 }
 
-vTensor pack_weights_2d_q(api::Context* const context, const Tensor& weight) {
-  /* Source */
-  const IntArrayRef src_filter = weight.sizes();
-  const c10::quint8* const src_weight_ptr = weight.data_ptr<c10::quint8>();
+//
+// Op Recording
+//
+
+struct Params final {
+  api::utils::ivec3 out_extents;
+  int32_t fill0;
+  api::utils::ivec3 in_extents;
+  int32_t fill1;
+  api::utils::ivec4 overlay_region;
+  api::utils::ivec2 kernel_size;
+  api::utils::ivec2 stride;
+  api::utils::ivec2 padding;
+  api::utils::ivec2 dilate;
+  api::utils::vec2 clamp;
+};
+
+void record_op(
+    api::Context* const context,
+    api::ShaderSource& compute_shader,
+    vTensor& v_output,
+    const vTensor& v_input,
+    const vTensor& v_weight,
+    const vTensor& v_bias,
+    const IntArrayRef overlay_region,
+    const IntArrayRef stride,
+    const IntArrayRef padding,
+    const IntArrayRef dilation,
+    const float output_min,
+    const float output_max,
+    const IntArrayRef kernel_size,
+    const Conv2dMethod method,
+    const bool transposed) {
+  api::PipelineBarrier pipeline_barrier{};
+
+  api::utils::uvec3 global_size = v_output.extents();
+  api::utils::uvec3 local_size = adaptive_work_group_size(global_size);
+
+  Params block{
+      api::utils::make_ivec3(v_output.extents()),
+      0u,
+      api::utils::make_ivec3(v_input.extents()),
+      0u,
+      api::utils::make_ivec4(overlay_region, /*reverse=*/true),
+      api::utils::make_ivec2({kernel_size[3], kernel_size[2]}),
+      api::utils::make_ivec2(stride, /*reverse=*/true),
+      api::utils::make_ivec2(padding, /*reverse=*/true),
+      api::utils::make_ivec2(dilation, /*reverse=*/true),
+      {output_min, output_max},
+  };
+  api::UniformParamsBuffer params(context, block);
 
-  const int64_t src_kw_sz = src_filter[Layout::Filter::width];
-  const int64_t src_kh_sz = src_filter[Layout::Filter::height];
-  const int64_t src_kernel_sz = src_kw_sz * src_kh_sz;
-  const int64_t src_block_sz =
-      src_kernel_sz * src_filter[Layout::Filter::input];
+  context->submit_compute_job(
+      // shader descriptor
+      compute_shader,
+      // pipeline barrier
+      pipeline_barrier,
+      // global work group size
+      global_size,
+      // local work group size
+      local_size,
+      // fence handle
+      VK_NULL_HANDLE,
+      // shader arguments
+      v_output.image(
+          pipeline_barrier,
+          api::PipelineStage::COMPUTE,
+          api::MemoryAccessType::WRITE),
+      v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      // params buffer
+      params.buffer());
+}
 
-  const int64_t num_stacks =
-      div_up(src_filter[Layout::Filter::output], INT64_C(4));
-  const int64_t stack_depth =
-      api::utils::align_up(src_filter[Layout::Filter::input], INT64_C(4));
+struct QParams final {
+  api::utils::vec4 scales;
+  api::utils::ivec4 zero_points;
+  api::utils::ivec3 out_extents;
+  int32_t fill0;
+  api::utils::ivec3 in_extents;
+  int32_t fill1;
+  api::utils::ivec4 overlay_region;
+  api::utils::ivec2 kernel_size;
+  api::utils::ivec2 stride;
+  api::utils::ivec2 padding;
+  api::utils::ivec2 dilate;
+  api::utils::vec2 clamp;
+};
+
+void record_quantized_op(
+    api::Context* const context,
+    api::ShaderSource& compute_shader,
+    vTensor& v_output,
+    const vTensor& v_input,
+    const vTensor& v_weight,
+    const vTensor& v_bias,
+    const IntArrayRef overlay_region,
+    const IntArrayRef stride,
+    const IntArrayRef padding,
+    const IntArrayRef dilation,
+    const float output_min,
+    const float output_max,
+    const IntArrayRef kernel_size,
+    const Conv2dMethod method,
+    const bool transposed) {
+  api::PipelineBarrier pipeline_barrier{};
 
-  /* Destination */
-  const int64_t dst_kw_sz = src_kw_sz * stack_depth;
-  const int64_t dst_kh_sz = src_kh_sz * num_stacks;
-  const int64_t dst_kernel_sz = dst_kw_sz * dst_kh_sz;
+  api::utils::uvec3 global_size = v_output.extents();
+  api::utils::uvec3 local_size = adaptive_work_group_size(global_size);
 
-  vTensor v_weight{
-      context,
+  QParams block{
       {
-          4,
-          dst_kh_sz,
-          dst_kw_sz,
+          v_output.get_scale_float(),
+          v_input.get_scale_float(),
+          v_weight.get_scale_float(),
+          v_bias.get_scale_float(),
       },
-      weight.options(),
-      weight.q_scale(),
-      weight.q_zero_point(),
+      {
+          v_output.get_zero_point_int32(),
+          v_input.get_zero_point_int32(),
+          v_weight.get_zero_point_int32(),
+          v_bias.get_zero_point_int32(),
+      },
+      api::utils::make_ivec3(v_output.extents()),
+      0u,
+      api::utils::make_ivec3(v_input.extents()),
+      0u,
+      api::utils::make_ivec4(overlay_region, /*reverse=*/true),
+      api::utils::make_ivec2({kernel_size[3], kernel_size[2]}),
+      api::utils::make_ivec2(stride, /*reverse=*/true),
+      api::utils::make_ivec2(padding, /*reverse=*/true),
+      api::utils::make_ivec2(dilation, /*reverse=*/true),
+      {output_min, output_max},
   };
+  api::UniformParamsBuffer params(context, block);
 
-  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    c10::quint8* dst_weight_ptr = mapping.template data<c10::quint8>();
-
-    memset(dst_weight_ptr, 0, v_weight.nbytes());
-
-    for (const auto src_oc : c10::irange(src_filter[Layout::Filter::output])) {
-      /* Source */
-      const c10::quint8* const src_weight_oc_ptr =
-          src_weight_ptr + src_oc * src_block_sz;
-
-      /* Destination */
-      const int64_t dst_oh = src_oc / 4;
-      const int64_t dst_c = src_oc % 4;
-
-      c10::quint8* const dst_weight_c_ptr =
-          dst_weight_ptr + dst_c * dst_kernel_sz;
+  context->submit_compute_job(
+      // shader descriptor
+      compute_shader,
+      // pipeline barrier
+      pipeline_barrier,
+      // global work group size
+      global_size,
+      // local work group size
+      local_size,
+      // fence handle
+      VK_NULL_HANDLE,
+      // shader arguments
+      v_output.image(
+          pipeline_barrier,
+          api::PipelineStage::COMPUTE,
+          api::MemoryAccessType::WRITE),
+      v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE),
+      // params buffer
+      params.buffer());
+}
 
-      for (const auto src_ic : c10::irange(src_filter[Layout::Filter::input])) {
-        const int64_t dst_ic4 = src_ic / 4;
+} // namespace conv2d
 
-        for (const auto src_ih : c10::irange(src_kh_sz)) {
-          for (const auto src_iw : c10::irange(src_kw_sz)) {
-            memcpy(
-                dst_weight_c_ptr + (dst_oh * src_kh_sz + src_ih) * dst_kw_sz +
-                    dst_ic4 * src_kw_sz * 4 + src_iw * 4 + src_ic % 4,
-                src_weight_oc_ptr + src_ic * src_kernel_sz +
-                    src_ih * src_kw_sz + src_iw,
-                sizeof(c10::quint8));
-          }
-        }
-      }
-    }
-  }
-  ops::utils::pack_staging_to_vtensor(staging.buffer(), v_weight);
+namespace {
 
-  return v_weight;
-}
+using namespace api::utils;
 
 vTensor pack_weights(
     const Tensor& weight_arg,
@@ -357,130 +503,33 @@ vTensor pack_weights(
     return convert(weight_arg);
   }
 
-  api::Context* const context = api::context();
-
   const Tensor weight = transposed
       ? at::permute(weight_arg, {1, 0, 2, 3}).contiguous()
       : weight_arg.contiguous();
 
-  if (transposed) {
-    return pack_weights_2d(context, weight, true);
-  }
-  if (quantized) {
-    if (conv_method == QConv2dDepthwise) {
-      return pack_weights_dw_q(context, weight);
-    }
-    return pack_weights_2d_q(context, weight);
-  }
+  at::Tensor weight_rearranged;
   if (conv_method == Conv2dDepthwise) {
-    return pack_weights_dw(context, weight);
-  }
-  return pack_weights_2d(context, weight, false);
-}
-
-vTensor pack_biases_reg(
-    const c10::optional<Tensor>& bias,
-    const Tensor& weight,
-    const bool transposed) {
-  if (bias && bias->is_vulkan()) {
-    return convert(*bias);
+    weight_rearranged = conv2d::rearrange_weights_dw(weight);
+  } else {
+    weight_rearranged = conv2d::rearrange_weights_2d(weight, transposed);
   }
 
-  api::Context* const context = api::context();
-
-  const int64_t src_w = weight.size(
-      transposed ? Layout::TransposedFilter::output : Layout::Filter::output);
-  const int64_t packed_w = div_up(src_w, INT64_C(4));
-  vTensor v_bias{
-      context,
-      {
-          4,
-          1,
-          packed_w,
-      },
-      weight.options(),
+  vTensor v_weight{
+      api::context(),
+      weight_rearranged.sizes(),
+      weight_arg.options(),
+      quantized ? api::StorageType::TEXTURE_3D : api::StorageType::TEXTURE_2D,
   };
 
-  api::StorageBuffer staging(context, at::kFloat, v_bias.numcells());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    float* dst_bias_ptr = mapping.template data<float>();
-
-    if (bias) {
-      const float* const src_bias_ptr = bias->contiguous().data_ptr<float>();
-
-      memset(dst_bias_ptr, 0, v_bias.nbytes());
-      for (const auto i : c10::irange(src_w)) {
-        const int64_t c = i % 4;
-        const int64_t x = i / 4;
-        dst_bias_ptr[c * packed_w + x] = src_bias_ptr[i];
-      }
-    } else {
-      memset(
-          dst_bias_ptr,
-          // 2's complement integers and IEEE-754 floating point numbers both
-          // have identical bit representations for 0, so can use memset which
-          // only accepts uint8_t parameter.
-          0,
-          v_bias.nbytes());
-    }
-  }
-  utils::pack_staging_to_vtensor(staging.buffer(), v_bias);
-
-  return v_bias;
-}
-
-vTensor pack_biases_q(const c10::optional<Tensor>& bias, const Tensor& weight) {
-  if (bias && bias->is_vulkan()) {
-    return convert(*bias);
+  if (quantized) {
+    v_weight.set_is_quantized();
+    v_weight.set_scale(weight_arg.q_scale());
+    v_weight.set_zero_point(weight_arg.q_zero_point());
   }
 
-  api::Context* const context = api::context();
-
-  const int64_t src_w = weight.size(Layout::Filter::output);
-  const int64_t packed_w = div_up(src_w, INT64_C(4));
-  vTensor v_bias{
-      context,
-      {
-          4,
-          1,
-          packed_w,
-      },
-      weight.options(),
-      weight.q_scale(),
-      weight.q_zero_point(),
-  };
+  pack_cpu_to_vulkan(weight_rearranged, v_weight);
 
-  api::StorageBuffer staging(context, at::kFloat, v_bias.numcells());
-  {
-    api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
-
-    c10::quint8* dst_bias_ptr = mapping.template data<c10::quint8>();
-
-    if (bias) {
-      const c10::quint8* const src_bias_ptr =
-          bias->contiguous().data_ptr<c10::quint8>();
-
-      memset(dst_bias_ptr, 0, v_bias.nbytes());
-      for (const auto i : c10::irange(src_w)) {
-        const int64_t c = i % 4;
-        const int64_t x = i / 4;
-        dst_bias_ptr[c * packed_w + x] = src_bias_ptr[i];
-      }
-    } else {
-      memset(
-          dst_bias_ptr,
-          // 2's complement integers and IEEE-754 floating point numbers both
-          // have identical bit representations for 0, so can use memset which
-          // only accepts uint8_t parameter.
-          0,
-          v_bias.nbytes());
-    }
-  }
-  ops::utils::pack_staging_to_vtensor(staging.buffer(), v_bias);
-
-  return v_bias;
+  return v_weight;
 }
 
 vTensor pack_biases(
@@ -488,19 +537,36 @@ vTensor pack_biases(
     const Tensor& weight,
     const bool transposed,
     const bool quantized) {
+  at::Tensor bias_rearranged = conv2d::rearrange_bias(bias, weight, transposed);
+
+  vTensor v_bias{
+      api::context(),
+      bias_rearranged.sizes(),
+      weight.options(),
+      quantized ? api::StorageType::TEXTURE_3D : api::StorageType::TEXTURE_2D,
+  };
+
   if (quantized) {
-    return pack_biases_q(bias, weight);
+    v_bias.set_is_quantized();
+    v_bias.set_scale(weight.q_scale());
+    v_bias.set_zero_point(weight.q_zero_point());
   }
-  return pack_biases_reg(bias, weight, transposed);
+
+  pack_cpu_to_vulkan(bias_rearranged, v_bias);
+
+  return v_bias;
 }
 
-std::array<int64_t, 4> pack_filter(
+/*
+ * Computes the size of the overlay region when computing a convolution output.
+ */
+std::array<int64_t, 4> compute_overlay_region(
     const Tensor& weight,
     const IntArrayRef dilation,
     const bool transposed) {
   const IntArrayRef filter = weight.sizes();
 
-  const auto effective = [](const int64_t k, const int64_t d) {
+  const auto overlay_length = [](const int64_t k, const int64_t d) {
     return k + (k - 1) * (d - 1);
   };
 
@@ -513,9 +579,9 @@ std::array<int64_t, 4> pack_filter(
           transposed ? filter[Layout::TransposedFilter::input]
                      : filter[Layout::Filter::input],
           INT64_C(4)),
-      effective(
+      overlay_length(
           filter[Layout::Filter::height], dilation[Layout::Parameter::height]),
-      effective(
+      overlay_length(
           filter[Layout::Filter::width], dilation[Layout::Parameter::width]),
   };
 }
@@ -530,13 +596,24 @@ std::array<int64_t, 2> pack_params(const std::vector<int64_t>& vector) {
 }
 
 bool weight_valid(const Tensor& weight, const bool quantized) {
-  return (4 == weight.ndimension()) &&
-      (weight.size(Layout::Filter::height) > 0) &&
-      (weight.size(Layout::Filter::width) > 0) &&
-      ((weight.device().is_cpu()) ||
-       (c10::DeviceType::Vulkan == weight.device().type())) &&
-      (kFloat == weight.scalar_type() ||
-       (quantized && c10::kQUInt8 == weight.scalar_type()));
+  if (4 != weight.ndimension()) {
+    return false;
+  }
+  if (get_dim<DimConv2DKernel::Height>(weight) == 0) {
+    return false;
+  }
+  if (get_dim<DimConv2DKernel::Width>(weight) == 0) {
+    return false;
+  }
+  if (!weight.device().is_cpu() &&
+      weight.device().type() != c10::DeviceType::Vulkan) {
+    return false;
+  }
+  if (quantized && weight.scalar_type() != c10::kQUInt8) {
+    return false;
+  }
+
+  return true;
 }
 
 bool bias_valid(
@@ -544,17 +621,24 @@ bool bias_valid(
     const Tensor& weight,
     const bool transposed,
     const bool quantized) {
-  if (bias && bias->defined()) {
-    return (1 == bias->ndimension()) &&
-        ((bias->device().is_cpu()) ||
-         (c10::DeviceType::Vulkan == bias->device().type())) &&
-        (kFloat == bias->scalar_type() ||
-         (quantized && c10::kQUInt8 == bias->scalar_type())) &&
-        (transposed ? (weight.size(Layout::TransposedFilter::output) ==
-                       bias->size(Layout::Filter::output))
-                    : (weight.size(Layout::Filter::output) ==
-                       bias->size(Layout::Filter::output)));
+  if (!bias) {
+    return true;
+  }
+
+  if (bias->ndimension() != 1) {
+    return false;
   }
+  if (!bias->device().is_cpu() &&
+      bias->device().type() != c10::DeviceType::Vulkan) {
+    return false;
+  }
+  uint32_t L = get_dim<Dim1D::Length>(*bias);
+  uint32_t OC = transposed ? get_dim<DimTConv2DKernel::OutChannels>(weight)
+                           : get_dim<DimConv2DKernel::OutChannels>(weight);
+  if (L != OC) {
+    return false;
+  }
+
   return true;
 }
 
@@ -570,46 +654,82 @@ bool available(
     const int64_t groups,
     const c10::optional<Scalar>& output_min,
     const c10::optional<Scalar>& output_max) {
-  return api::available() &&
-      // Weight
-      weight_valid(weight, quantized) &&
-      // Bias
-      bias_valid(bias, weight, transposed, quantized) &&
-      // Stride
-      (stride[Layout::Parameter::height] > 0) &&
-      (stride[Layout::Parameter::width] > 0) &&
-      // Padding
-      (padding[Layout::Parameter::height] >= 0) &&
-      (padding[Layout::Parameter::width] >= 0) &&
-      // Dilation
-      (transposed ? (dilation[Layout::Parameter::height] == 1) &&
-               (dilation[Layout::Parameter::width] == 1)
-                  : (dilation[Layout::Parameter::height] > 0) &&
-               (dilation[Layout::Parameter::width] > 0)) &&
-      // Groups
-      (groups > 0) &&
-      // Input
-      (weight.size(Layout::Filter::input) > 0) &&
-      // Output
-      (weight.size(Layout::Filter::output) > 0) &&
-      // Output - Groups
-      ((weight.size(Layout::Filter::output) % groups) == 0) &&
-      // Output Min / Max
-      (!output_min || output_min->isFloatingPoint()) &&
-      (!output_max || output_max->isFloatingPoint()) && true;
+  if (!weight_valid(weight, quantized)) {
+    return false;
+  }
+  if (!bias_valid(bias, weight, transposed, quantized)) {
+    return false;
+  }
+  if (get_dim<Dim4D::Height>(stride) == 0 ||
+      get_dim<Dim4D::Width>(stride) == 0) {
+    return false;
+  }
+  if (transposed) {
+    if (get_dim<Dim4D::Height>(dilation) != 1 ||
+        get_dim<Dim4D::Width>(dilation) != 1) {
+      return false;
+    }
+  } else {
+    if (get_dim<Dim4D::Height>(dilation) == 0 ||
+        get_dim<Dim4D::Width>(dilation) == 0) {
+      return false;
+    }
+  }
+  if (groups <= 0) {
+    return false;
+  }
+  if (transposed) {
+    if ((get_dim<DimTConv2DKernel::OutChannels>(weight) % groups) != 0) {
+      return false;
+    }
+  } else {
+    if ((get_dim<DimConv2DKernel::OutChannels>(weight) % groups) != 0) {
+      return false;
+    }
+  }
+  if (get_dim<DimConv2DKernel::InChannels>(weight) == 0 ||
+      get_dim<DimConv2DKernel::OutChannels>(weight) == 0) {
+    return false;
+  }
+  if (output_min && !output_min->isFloatingPoint()) {
+    return false;
+  }
+  if (output_max && !output_max->isFloatingPoint()) {
+    return false;
+  }
+  return true;
 }
 
 bool usable(const Tensor& input, const bool quantized) {
-  // Input
-  return (4 == input.ndimension()) &&
-      (c10::DeviceType::Vulkan == input.device().type()) &&
-      (kFloat == input.scalar_type() ||
-       (quantized && c10::kQUInt8 == input.scalar_type())) &&
-      (input.size(Layout::Activation4D::batch) >= 0) &&
-      (input.size(Layout::Activation4D::channels) > 0) &&
-      (input.size(Layout::Activation4D::height) > 0) &&
-      (input.size(Layout::Activation4D::width) > 0) && !input.requires_grad() &&
-      true;
+  if (input.ndimension() != 4) {
+    return false;
+  }
+  if (input.device().type() != c10::DeviceType::Vulkan) {
+    return false;
+  }
+  if (!quantized && input.scalar_type() != at::kFloat) {
+    return false;
+  }
+  if (quantized && input.scalar_type() != c10::kQUInt8) {
+    return false;
+  }
+  if (get_dim<Dim4D::Batch>(input) == 0) {
+    return false;
+  }
+  if (get_dim<Dim4D::Channel>(input) == 0) {
+    return false;
+  }
+  if (get_dim<Dim4D::Height>(input) == 0) {
+    return false;
+  }
+  if (get_dim<Dim4D::Width>(input) == 0) {
+    return false;
+  }
+  if (input.requires_grad()) {
+    return false;
+  }
+
+  return true;
 }
 
 static inline std::vector<int64_t> get_conv_transpose_output_size(
@@ -630,219 +750,6 @@ static inline std::vector<int64_t> get_conv_transpose_output_size(
   return output_size;
 }
 
-void conv2d_sliding_window(
-    const api::ShaderSource& shader,
-    vTensor& v_output,
-    const vTensor& v_input,
-    const vTensor& packed_v_weight,
-    const vTensor& packed_v_bias,
-    const IntArrayRef packed_filter,
-    const IntArrayRef packed_stride,
-    const IntArrayRef packed_padding,
-    const IntArrayRef packed_dilation,
-    const float packed_output_min,
-    const float packed_output_max,
-    const IntArrayRef unpacked_filter,
-    const Conv2dMethod method_) {
-  api::Context* const context = api::context();
-
-  const struct Block final {
-    uvec3 extents;
-    int32_t ic4;
-    ivec4 kernel;
-    ivec2 ikernel;
-    ivec2 stride;
-    ivec2 padding;
-    ivec2 dilate;
-    vec2 clamp;
-    ivec4 src_filter;
-  } block{
-      v_output.extents(),
-      safe_downcast<int32_t>(
-          packed_filter[Layout::Filter::input]), /* this is aligned up */
-      {
-          safe_downcast<int32_t>(packed_filter[Layout::Filter::width]),
-          safe_downcast<int32_t>(packed_filter[Layout::Filter::height]),
-          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::width]),
-          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::height]),
-      },
-      {
-          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::width]),
-          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_stride[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_stride[Layout::Parameter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_padding[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_padding[Layout::Parameter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::height]),
-      },
-      {
-          packed_output_min,
-          packed_output_max,
-      },
-  };
-
-  uvec3 global_size = v_output.extents();
-  if (method_ == Conv2dPointwise) {
-    global_size = {
-        safe_downcast<uint32_t>(
-            div_up(v_output.sizes()[Layout::Filter::width], INT64_C(2))),
-        safe_downcast<uint32_t>(
-            div_up(v_output.sizes()[Layout::Filter::height], INT64_C(2))),
-        v_output.extents().data[2u]};
-  }
-
-  api::UniformParamsBuffer params(context, block);
-  api::PipelineBarrier pipeline_barrier{};
-
-  context->submit_compute_job(
-      // shader descriptor
-      shader,
-      // pipeline barrier
-      pipeline_barrier,
-      // global work group size
-      global_size,
-      // local work group size
-      adaptive_work_group_size(global_size),
-      // fence handle
-      VK_NULL_HANDLE,
-      // shader arguments
-      v_output.image(
-          pipeline_barrier,
-          api::PipelineStage::COMPUTE,
-          api::MemoryAccessType::WRITE),
-      v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      packed_v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      packed_v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      // params buffer
-      params.buffer());
-}
-
-void conv2d_sliding_window_q(
-    const api::ShaderSource& shader,
-    vTensor& v_output,
-    const vTensor& v_input,
-    const vTensor& packed_v_weight,
-    const vTensor& packed_v_bias,
-    const IntArrayRef packed_filter,
-    const IntArrayRef packed_stride,
-    const IntArrayRef packed_padding,
-    const IntArrayRef packed_dilation,
-    const float packed_output_min,
-    const float packed_output_max,
-    const IntArrayRef unpacked_filter,
-    const Conv2dMethod method_,
-    const double scale,
-    const int64_t zero_point) {
-  api::Context* const context = api::context();
-
-  const double scale_out = v_output.get_scale();
-  const int64_t zero_point_out = v_output.get_zero_point();
-
-  const double weight_scale = packed_v_weight.get_scale();
-  const int64_t weight_zero_point = packed_v_weight.get_zero_point();
-
-  const double bias_scale = packed_v_bias.get_scale();
-  const int64_t bias_zero_point = packed_v_bias.get_zero_point();
-
-  const struct Block final {
-    uvec3 extents;
-    int32_t ic4;
-    ivec4 kernel;
-    float scale_out;
-    float scale;
-    int32_t zero_point_out;
-    int32_t zero_point;
-    float weight_scale;
-    float bias_scale;
-    int32_t weight_zero_point;
-    int32_t bias_zero_point;
-    ivec2 ikernel;
-    ivec2 stride;
-    ivec2 padding;
-    ivec2 dilate;
-    vec2 clamp;
-  } block{
-      v_output.extents(),
-      safe_downcast<int32_t>(packed_filter[Layout::Filter::input]),
-      {
-          safe_downcast<int32_t>(packed_filter[Layout::Filter::width]),
-          safe_downcast<int32_t>(packed_filter[Layout::Filter::height]),
-          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::width]),
-          safe_downcast<int32_t>(v_input.sizes()[Layout::Activation4D::height]),
-      },
-      safe_downcast<float>(scale_out),
-      safe_downcast<float>(scale),
-      safe_downcast<int32_t>(zero_point_out),
-      safe_downcast<int32_t>(zero_point),
-      safe_downcast<float>(weight_scale),
-      safe_downcast<float>(bias_scale),
-      safe_downcast<int32_t>(weight_zero_point),
-      safe_downcast<int32_t>(bias_zero_point),
-      {
-          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::width]),
-          safe_downcast<int32_t>(unpacked_filter[Layout::Filter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_stride[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_stride[Layout::Parameter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_padding[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_padding[Layout::Parameter::height]),
-      },
-      {
-          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::width]),
-          safe_downcast<int32_t>(packed_dilation[Layout::Parameter::height]),
-      },
-      {
-          packed_output_min,
-          packed_output_max,
-      },
-  };
-
-  uvec3 global_size = v_output.extents();
-  if (method_ == QConv2dPointwise) {
-    global_size = {
-        safe_downcast<uint32_t>(
-            div_up(v_output.sizes()[Layout::Filter::width], INT64_C(2))),
-        safe_downcast<uint32_t>(
-            div_up(v_output.sizes()[Layout::Filter::height], INT64_C(2))),
-        v_output.extents().data[2u]};
-  }
-
-  api::UniformParamsBuffer params(context, block);
-  api::PipelineBarrier pipeline_barrier{};
-
-  context->submit_compute_job(
-      // shader descriptor
-      shader,
-      // pipeline barrier
-      pipeline_barrier,
-      // global work group size
-      global_size,
-      // local work group size
-      adaptive_work_group_size(global_size),
-      // fence handle
-      VK_NULL_HANDLE,
-      // shader arguments
-      v_output.image(
-          pipeline_barrier,
-          api::PipelineStage::COMPUTE,
-          api::MemoryAccessType::WRITE),
-      v_input.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      packed_v_weight.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      packed_v_bias.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      // params buffer
-      params.buffer());
-}
-
 Tensor convolution(
     const Tensor& input,
     const Tensor& weight,
@@ -952,15 +859,15 @@ Conv2dPackedContext::Conv2dPackedContext(
       "transposed, output_padding, output_min, output_max) parameters are either "
       "invalid individually or their combination is not supported by Vulkan impl.");
 
-  const auto method = determine_method(
+  const auto method = conv2d::determine_method(
       weight.sizes(), stride, padding, dilation, groups, transposed, quantized);
 
-  packed_.reserve(14);
+  packed_.reserve(Packed::NumArgs);
   packed_.emplace_back(
       convert(pack_weights(weight, transposed, quantized, method)));
   packed_.emplace_back(
       convert(pack_biases(bias, weight, transposed, quantized)));
-  packed_.emplace_back(pack_filter(weight, dilation, transposed));
+  packed_.emplace_back(compute_overlay_region(weight, dilation, transposed));
   packed_.emplace_back(pack_params(stride));
   packed_.emplace_back(pack_params(padding));
   packed_.emplace_back(output_padding);
@@ -977,37 +884,38 @@ Conv2dPackedContext::Conv2dPackedContext(
   packed_.emplace_back(method);
   packed_.emplace_back(weight.sizes().vec());
 
-  unpacked_.reserve(11);
-  unpacked_.emplace_back(weight);
-  unpacked_.emplace_back(bias);
-  unpacked_.emplace_back(stride_arg.vec());
-  unpacked_.emplace_back(padding_arg.vec());
-  unpacked_.emplace_back(dilation_arg.vec());
-  unpacked_.emplace_back(transposed);
-  unpacked_.emplace_back(quantized);
-  unpacked_.emplace_back(output_padding_arg.vec());
-  unpacked_.emplace_back(groups);
-  unpacked_.emplace_back(output_min);
-  unpacked_.emplace_back(output_max);
+  compute_shader_ = conv2d::get_shader(
+      weight.sizes(), stride, padding, dilation, method, transposed, quantized);
+
+  if (!at::globalContext().releaseWeightsWhenPrepacking()) {
+    unpacked_.reserve(Unpacked::NumArgs);
+    unpacked_.emplace_back(weight);
+    unpacked_.emplace_back(bias);
+    unpacked_.emplace_back(stride_arg.vec());
+    unpacked_.emplace_back(padding_arg.vec());
+    unpacked_.emplace_back(dilation_arg.vec());
+    unpacked_.emplace_back(transposed);
+    unpacked_.emplace_back(quantized);
+    unpacked_.emplace_back(output_padding_arg.vec());
+    unpacked_.emplace_back(groups);
+    unpacked_.emplace_back(output_min);
+    unpacked_.emplace_back(output_max);
+  }
 }
 
 Conv2dPackedContext Conv2dPackedContext::pack(c10::impl::GenericList unpacked) {
   return Conv2dPackedContext(
-      unpacked.get(0).toTensor(), // weight
-      unpacked.get(1).isTensor() ? unpacked.get(1).toTensor()
-                                 : c10::optional<Tensor>(), // bias
-      unpacked.get(2).toIntVector(), // stride
-      unpacked.get(3).toIntVector(), // padding
-      unpacked.get(4).toIntVector(), // dilation
-      unpacked.get(5).toBool(), // transposed
-      unpacked.get(6).toBool(), // quantized
-      unpacked.get(7).toIntVector(), // output_padding
-      unpacked.get(8).toInt(), // groups
-      unpacked.get(9).isScalar() ? unpacked.get(8).toScalar()
-                                 : c10::optional<Scalar>(), // output_min
-      unpacked.get(10).isScalar() ? unpacked.get(9).toScalar()
-                                  : c10::optional<Scalar>() // output_max
-  );
+      unpacked.get(Unpacked::Weight).toTensor(),
+      get_optional_tensor(unpacked, Unpacked::Bias),
+      unpacked.get(Unpacked::Stride).toIntVector(),
+      unpacked.get(Unpacked::Padding).toIntVector(),
+      unpacked.get(Unpacked::Dilation).toIntVector(),
+      unpacked.get(Unpacked::isTransposed).toBool(),
+      unpacked.get(Unpacked::isQuantized).toBool(),
+      unpacked.get(Unpacked::OutputPadding).toIntVector(),
+      unpacked.get(Unpacked::Groups).toInt(),
+      get_optional_scalar(unpacked, Unpacked::OutputMin),
+      get_optional_scalar(unpacked, Unpacked::OutputMax));
 }
 
 c10::intrusive_ptr<Conv2dPackedContext> create_conv2d_context(
@@ -1080,128 +988,134 @@ c10::intrusive_ptr<Conv2dPackedContext> create_qconv2d_context(
       output_max));
 }
 
-Tensor run_conv2d_context(
+Tensor run_conv2d_context_impl(
     const Tensor& input_arg,
-    const c10::intrusive_ptr<Conv2dPackedContext>& conv_context) {
+    const c10::intrusive_ptr<Conv2dPackedContext>& conv_context,
+    double scale,
+    int64_t zero_point) {
   api::Context* const context = api::context();
 
-  const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
-  const vTensor& v_input = convert(input);
-
-  const vTensor& packed_v_weight = convert(conv_context->get_val(0).toTensor());
-  const vTensor& packed_v_bias = convert(conv_context->get_val(1).toTensor());
-  const auto packed_filter = conv_context->get_val(2).toIntVector();
-  const auto packed_stride = conv_context->get_val(3).toIntVector();
-  const auto packed_padding = conv_context->get_val(4).toIntVector();
-  const auto packed_output_padding = conv_context->get_val(5).toIntVector();
-  const auto packed_dilation = conv_context->get_val(6).toIntVector();
-  const auto transposed = conv_context->get_val(7).toBool();
-  const auto quantized = conv_context->get_val(8).toBool();
-  /* const auto groups = conv_context->get_val(9).toInt(); */
-  const float packed_output_min =
-      safe_downcast<float>(conv_context->get_val(10).toDouble());
-  const float packed_output_max =
-      safe_downcast<float>(conv_context->get_val(11).toDouble());
-  const Conv2dMethod method_ = (Conv2dMethod)conv_context->get_val(12).toInt();
-  const auto unpacked_filter = conv_context->get_val(13).toIntVector();
+  // Validate input tensor is a Vulkan tensor, then convert to vTensor
+  TORCH_CHECK(input_arg.is_vulkan(), "Input tensor must be Vulkan!");
+  const vTensor& v_input = convert(input_arg);
+
+  // Extract everything from the PackedContext
+  const vTensor& v_weight = convert(
+      conv_context->get_val(Conv2dPackedContext::Packed::Weight).toTensor());
+
+  const vTensor& v_bias = convert(
+      conv_context->get_val(Conv2dPackedContext::Packed::Bias).toTensor());
+
+  const auto overlay_region =
+      conv_context->get_val(Conv2dPackedContext::Packed::OverlayRegion)
+          .toIntVector();
+
+  const auto stride =
+      conv_context->get_val(Conv2dPackedContext::Packed::Stride).toIntVector();
+  const auto padding =
+      conv_context->get_val(Conv2dPackedContext::Packed::Padding).toIntVector();
+  const auto output_padding =
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputPadding)
+          .toIntVector();
+  const auto dilation =
+      conv_context->get_val(Conv2dPackedContext::Packed::Dilation)
+          .toIntVector();
+
+  const auto transposed =
+      conv_context->get_val(Conv2dPackedContext::Packed::isTransposed).toBool();
+  const auto quantized =
+      conv_context->get_val(Conv2dPackedContext::Packed::isQuantized).toBool();
+
+  const float output_min = safe_downcast<float>(
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputMin).toDouble());
+  const float output_max = safe_downcast<float>(
+      conv_context->get_val(Conv2dPackedContext::Packed::OutputMax).toDouble());
+
+  const Conv2dMethod method_ = static_cast<Conv2dMethod>(
+      conv_context->get_val(Conv2dPackedContext::Packed::ConvMethod).toInt());
+
+  const auto kernel_size =
+      conv_context->get_val(Conv2dPackedContext::Packed::WeightSizes)
+          .toIntVector();
 
   TORCH_CHECK(
-      usable(input, quantized),
-      "Vulkan Convolution not usable! "
-      "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl.");
+      usable(input_arg, quantized), "Input tensor not usable for convolution!");
+
+  std::vector<int64_t> output_size;
+  if (transposed) {
+    output_size = get_conv_transpose_output_size(
+        v_input.sizes(),
+        kernel_size,
+        padding,
+        output_padding,
+        stride,
+        dilation);
+  } else {
+    output_size = conv_output_size(
+        v_input.sizes(), kernel_size, padding, stride, dilation);
+  }
 
   vTensor v_output{
       context,
-      transposed ? get_conv_transpose_output_size(
-                       v_input.sizes(),
-                       unpacked_filter,
-                       packed_padding,
-                       packed_output_padding,
-                       packed_stride,
-                       packed_dilation)
-                 : conv_output_size(
-                       v_input.sizes(),
-                       unpacked_filter,
-                       packed_padding,
-                       packed_stride,
-                       packed_dilation),
-      input.options(),
+      output_size,
+      input_arg.options(),
   };
 
-  switch (method_) {
-    case TConv2dSlidingWindow:
-      conv2d_sliding_window(
-          VK_KERNEL(conv_transpose2d),
-          v_output,
-          v_input,
-          packed_v_weight,
-          packed_v_bias,
-          packed_filter,
-          packed_stride,
-          packed_padding,
-          packed_dilation,
-          packed_output_min,
-          packed_output_max,
-          unpacked_filter,
-          method_);
-      break;
-    case Conv2dDepthwise:
-      conv2d_sliding_window(
-          VK_KERNEL(conv2d_dw),
-          v_output,
-          v_input,
-          packed_v_weight,
-          packed_v_bias,
-          packed_filter,
-          packed_stride,
-          packed_padding,
-          packed_dilation,
-          packed_output_min,
-          packed_output_max,
-          unpacked_filter,
-          method_);
-      break;
-    case Conv2dPointwise:
-      conv2d_sliding_window(
-          VK_KERNEL(conv2d_pw_2x2),
-          v_output,
-          v_input,
-          packed_v_weight,
-          packed_v_bias,
-          packed_filter,
-          packed_stride,
-          packed_padding,
-          packed_dilation,
-          packed_output_min,
-          packed_output_max,
-          unpacked_filter,
-          method_);
-      break;
-    default:
-      conv2d_sliding_window(
-          VK_KERNEL(conv2d),
-          v_output,
-          v_input,
-          packed_v_weight,
-          packed_v_bias,
-          packed_filter,
-          packed_stride,
-          packed_padding,
-          packed_dilation,
-          packed_output_min,
-          packed_output_max,
-          unpacked_filter,
-          method_);
-      break;
+  if (quantized) {
+    v_output.set_is_quantized();
+    v_output.set_scale(scale);
+    v_output.set_zero_point(zero_point);
+  }
+
+  if (quantized) {
+    conv2d::record_quantized_op(
+        context,
+        conv_context->compute_shader(),
+        v_output,
+        v_input,
+        v_weight,
+        v_bias,
+        overlay_region,
+        stride,
+        padding,
+        dilation,
+        output_min,
+        output_max,
+        kernel_size,
+        method_,
+        transposed);
+  } else {
+    conv2d::record_op(
+        context,
+        conv_context->compute_shader(),
+        v_output,
+        v_input,
+        v_weight,
+        v_bias,
+        overlay_region,
+        stride,
+        padding,
+        dilation,
+        output_min,
+        output_max,
+        kernel_size,
+        method_,
+        transposed);
   }
 
   return convert(v_output);
 }
 
+Tensor run_conv2d_context(
+    const Tensor& input_arg,
+    const c10::intrusive_ptr<Conv2dPackedContext>& conv_context) {
+  return run_conv2d_context_impl(input_arg, conv_context, 1.0f, 0u);
+}
+
 Tensor run_tconv2d_context(
     const Tensor& input_arg,
     const c10::intrusive_ptr<Conv2dPackedContext>& conv_context) {
-  return run_conv2d_context(input_arg, conv_context);
+  return run_conv2d_context_impl(input_arg, conv_context, 1.0f, 0u);
 }
 
 Tensor run_qconv2d_context(
@@ -1209,109 +1123,10 @@ Tensor run_qconv2d_context(
     double scale,
     int64_t zero_point,
     const c10::intrusive_ptr<Conv2dPackedContext>& conv_context) {
-  api::Context* const context = api::context();
-
-  const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
-  const vTensor& v_input = convert(input);
-
-  const vTensor& packed_v_weight = convert(conv_context->get_val(0).toTensor());
-  const vTensor& packed_v_bias = convert(conv_context->get_val(1).toTensor());
-  const auto packed_filter = conv_context->get_val(2).toIntVector();
-  const auto packed_stride = conv_context->get_val(3).toIntVector();
-  const auto packed_padding = conv_context->get_val(4).toIntVector();
-  const auto packed_output_padding = conv_context->get_val(5).toIntVector();
-  const auto packed_dilation = conv_context->get_val(6).toIntVector();
-  /* const auto transposed = conv_context->get_val(7).toBool(); */
-  const auto quantized = conv_context->get_val(8).toBool();
-  /* const auto groups = conv_context->get_val(9).toInt(); */
-  const float packed_output_min =
-      safe_downcast<float>(conv_context->get_val(10).toDouble());
-  const float packed_output_max =
-      safe_downcast<float>(conv_context->get_val(11).toDouble());
-  const Conv2dMethod method_ = (Conv2dMethod)conv_context->get_val(12).toInt();
-  const auto unpacked_filter = conv_context->get_val(13).toIntVector();
-
-  TORCH_CHECK(
-      usable(input, quantized),
-      "Vulkan Convolution not usable! "
-      "Reason: The provided input tensor is either invalid or unsupported by Vulkan impl.");
-
-  vTensor v_output{
-      context,
-      conv_output_size(
-          v_input.sizes(),
-          unpacked_filter,
-          packed_padding,
-          packed_stride,
-          packed_dilation),
-      input.options(),
-      scale,
-      zero_point,
-  };
-
-  switch (method_) {
-    case QConv2dSlidingWindow:
-      conv2d_sliding_window_q(
-          VK_KERNEL(quantized_conv2d),
-          v_output,
-          v_input,
-          packed_v_weight,
-          packed_v_bias,
-          packed_filter,
-          packed_stride,
-          packed_padding,
-          packed_dilation,
-          packed_output_min,
-          packed_output_max,
-          unpacked_filter,
-          method_,
-          v_input.get_scale(),
-          v_input.get_zero_point());
-      break;
-    case QConv2dPointwise:
-      conv2d_sliding_window_q(
-          VK_KERNEL(quantized_conv2d_pw_2x2),
-          v_output,
-          v_input,
-          packed_v_weight,
-          packed_v_bias,
-          packed_filter,
-          packed_stride,
-          packed_padding,
-          packed_dilation,
-          packed_output_min,
-          packed_output_max,
-          unpacked_filter,
-          method_,
-          v_input.get_scale(),
-          v_input.get_zero_point());
-      break;
-    case QConv2dDepthwise:
-      conv2d_sliding_window_q(
-          VK_KERNEL(quantized_conv2d_dw),
-          v_output,
-          v_input,
-          packed_v_weight,
-          packed_v_bias,
-          packed_filter,
-          packed_stride,
-          packed_padding,
-          packed_dilation,
-          packed_output_min,
-          packed_output_max,
-          unpacked_filter,
-          method_,
-          v_input.get_scale(),
-          v_input.get_zero_point());
-      break;
-    default:
-      TORCH_CHECK(false, "Invalid Method");
-  }
-
-  return convert_quantized(v_output);
+  return run_conv2d_context_impl(input_arg, conv_context, scale, zero_point);
 }
 
-Tensor conv2d(
+Tensor quantized_conv2d(
     const Tensor& input,
     const Tensor& weight,
     const c10::optional<Tensor>& bias,
@@ -1372,19 +1187,17 @@ Tensor Conv2dOpContext::run(const Tensor& input_arg) const {
 Conv2dOpContext::State Conv2dOpContext::unpack() const {
   const c10::impl::GenericList unpacked_ = conv_context_.unpack();
 
+  TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!");
+
   return Conv2dOpContext::State(
-      unpacked_.get(0).toTensor(), // weight
-      unpacked_.get(1).isTensor() ? unpacked_.get(1).toTensor()
-                                  : c10::optional<Tensor>(), // bias
-      unpacked_.get(2).toIntVector(), // stride
-      unpacked_.get(3).toIntVector(), // padding
-      unpacked_.get(4).toIntVector(), // dilation
-      unpacked_.get(7).toInt(), // groups
-      unpacked_.get(8).isScalar() ? unpacked_.get(8).toScalar()
-                                  : c10::optional<Scalar>(), // output_min
-      unpacked_.get(9).isScalar() ? unpacked_.get(9).toScalar()
-                                  : c10::optional<Scalar>() // output_max
-  );
+      unpacked_.get(Conv2dPackedContext::Unpacked::Weight).toTensor(),
+      get_optional_tensor(unpacked_, Conv2dPackedContext::Unpacked::Bias),
+      unpacked_.get(Conv2dPackedContext::Unpacked::Stride).toIntVector(),
+      unpacked_.get(Conv2dPackedContext::Unpacked::Padding).toIntVector(),
+      unpacked_.get(Conv2dPackedContext::Unpacked::Dilation).toIntVector(),
+      unpacked_.get(Conv2dPackedContext::Unpacked::Groups).toInt(),
+      get_optional_scalar(unpacked_, Conv2dPackedContext::Unpacked::OutputMin),
+      get_optional_scalar(unpacked_, Conv2dPackedContext::Unpacked::OutputMax));
 }
 
 c10::intrusive_ptr<Conv2dOpContext> conv2d_clamp_prepack(
diff --git a/aten/src/ATen/native/vulkan/ops/Convolution.h b/aten/src/ATen/native/vulkan/ops/Convolution.h
index 3d0c56ced7d8..3b84b4000521 100644
--- a/aten/src/ATen/native/vulkan/ops/Convolution.h
+++ b/aten/src/ATen/native/vulkan/ops/Convolution.h
@@ -14,16 +14,48 @@ enum Conv2dMethod {
   Conv2dDepthwise,
   Conv2dPointwise,
   Conv2dSlidingWindow,
-  TConv2dSlidingWindow,
-  QConv2dDepthwise,
-  QConv2dPointwise,
-  QConv2dSlidingWindow,
 };
 
+namespace conv2d {
+
+Tensor rearrange_weights_dw(const Tensor& weight_in);
+Tensor rearrange_weights_2d(const Tensor& weight_in, bool tconv);
+Tensor rearrange_bias(
+    const c10::optional<Tensor>& bias_in,
+    const at::Tensor& weight_in,
+    bool tconv);
+
+} // namespace conv2d
+
+namespace qconv2d_vk {
+
+struct QParams final {
+  api::utils::uvec3 out_extents;
+  int32_t ic4;
+  api::utils::ivec4 sizes_2d;
+  float output_scale;
+  float input_scale;
+  int32_t output_zero_point;
+  int32_t input_zero_point;
+  float weight_scale;
+  float bias_scale;
+  int32_t weight_zero_point;
+  int32_t bias_zero_point;
+  api::utils::ivec2 kernel_size;
+  api::utils::ivec2 stride;
+  api::utils::ivec2 padding;
+  api::utils::ivec2 dilate;
+  api::utils::vec2 clamp;
+  api::utils::ivec4 src_filter;
+};
+
+} // namespace qconv2d_vk
+
 class Conv2dPackedContext final : virtual public VulkanPackedContext,
                                   public torch::jit::CustomClassHolder {
  private:
   c10::impl::GenericList unpacked_;
+  api::ShaderSource compute_shader_{};
 
  public:
   Conv2dPackedContext(
@@ -39,11 +71,58 @@ class Conv2dPackedContext final : virtual public VulkanPackedContext,
       const c10::optional<Scalar>& output_min = c10::nullopt,
       const c10::optional<Scalar>& output_max = c10::nullopt);
 
+  /*
+   * Assigns a name to each index in the unpacked list.
+   */
+  struct Unpacked final {
+    static constexpr uint32_t Weight = 0u;
+    static constexpr uint32_t Bias = 1u;
+    static constexpr uint32_t Stride = 2u;
+    static constexpr uint32_t Padding = 3u;
+    static constexpr uint32_t Dilation = 4u;
+    static constexpr uint32_t isTransposed = 5u;
+    static constexpr uint32_t isQuantized = 6u;
+    static constexpr uint32_t OutputPadding = 7u;
+    static constexpr uint32_t Groups = 8u;
+    static constexpr uint32_t OutputMin = 9u;
+    static constexpr uint32_t OutputMax = 10u;
+
+    static constexpr uint32_t NumArgs = 11u;
+  };
+
+  /*
+   * Assigns a name to each index in the packed list.
+   */
+  struct Packed final {
+    static constexpr uint32_t Weight = 0u;
+    static constexpr uint32_t Bias = 1u;
+    static constexpr uint32_t OverlayRegion = 2u;
+    static constexpr uint32_t Stride = 3u;
+    static constexpr uint32_t Padding = 4u;
+    static constexpr uint32_t OutputPadding = 5u;
+    static constexpr uint32_t Dilation = 6u;
+    static constexpr uint32_t isTransposed = 7u;
+    static constexpr uint32_t isQuantized = 8u;
+    static constexpr uint32_t Groups = 9u;
+    static constexpr uint32_t OutputMin = 10u;
+    static constexpr uint32_t OutputMax = 11u;
+    static constexpr uint32_t ConvMethod = 12u;
+    static constexpr uint32_t WeightSizes = 13u;
+
+    static constexpr uint32_t NumArgs = 14u;
+  };
+
   static Conv2dPackedContext pack(c10::impl::GenericList);
 
   const c10::impl::GenericList unpack() const override {
+    TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!");
+
     return unpacked_;
   }
+
+  inline api::ShaderSource& compute_shader() {
+    return compute_shader_;
+  }
 };
 
 c10::intrusive_ptr<Conv2dPackedContext> create_conv2d_context(
diff --git a/aten/src/ATen/native/vulkan/ops/Copy.cpp b/aten/src/ATen/native/vulkan/ops/Copy.cpp
index dbac25e0c7ee..06f9225fe47d 100644
--- a/aten/src/ATen/native/vulkan/ops/Copy.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Copy.cpp
@@ -1,3 +1,4 @@
+#include <ATen/ATen.h>
 #include <ATen/native/vulkan/ops/Copy.h>
 #include <ATen/native/vulkan/ops/Utils.h>
 
@@ -52,7 +53,7 @@ void transfer_cpu_to_vulkan(const Tensor& src, vTensor& v_dst) {
   // a 16 bit format will be used for at::kFloat.
   Tensor src_nc4hw = utils::nchw_to_nc4hw(src).to(v_dst.texture_dtype());
 
-  api::StorageBuffer staging(context, v_dst.texture_dtype(), v_dst.numcells());
+  api::StorageBuffer staging(context, v_dst.texture_dtype(), v_dst.gpu_numel());
   // Copy data into the staging buffer
   {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
@@ -71,7 +72,7 @@ void transfer_vulkan_to_cpu(vTensor& v_src, Tensor& dst) {
   // Temporary tensor to receive copied NC4HW data
   at::Tensor dst_tmp = utils::create_staging_tensor(v_src);
 
-  api::StorageBuffer staging(context, v_src.texture_dtype(), v_src.numcells());
+  api::StorageBuffer staging(context, v_src.texture_dtype(), v_src.gpu_numel());
 
   api::VulkanFence fence = context->fences().get_fence();
 
@@ -135,13 +136,16 @@ void transfer_vulkan_to_vulkan(vTensor& src, vTensor& dst) {
 void pack_cpu_to_vulkan(const Tensor& src, vTensor& dst) {
   api::Context* const context = api::context();
 
+  // Ensure that src is contiguous in its memory format
+  Tensor src_contig = src.contiguous(src.suggest_memory_format());
+
   // Note that the float data type has been enforced for the storage buffer
   // below. The reason for this is that the nchw_to_image and image_to_nchw
   // shaders which perform the transfer to/from an image texture expect a buffer
   // of floats as input. GLSL/Vulkan does not natively support 16 bit arithmetic
   // types, so for now storage buffers created for compute shaders must define
   // floats as their base data type.
-  api::StorageBuffer staging(context, at::kFloat, dst.numcells());
+  api::StorageBuffer staging(context, at::kFloat, dst.gpu_numel());
   {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -150,9 +154,9 @@ void pack_cpu_to_vulkan(const Tensor& src, vTensor& dst) {
     // buffer as input (note that at::kFloat is used to create the StorageBuffer
     // above).
     if (src.dtype() == at::kHalf) {
-      memcpy_to_mapping(src.to(at::kFloat), mapping);
+      memcpy_to_mapping(src_contig.to(at::kFloat), mapping);
     } else {
-      memcpy_to_mapping(src, mapping);
+      memcpy_to_mapping(src_contig, mapping);
     }
   }
   utils::pack_staging_to_vtensor(staging.buffer(), dst);
@@ -163,7 +167,7 @@ void pack_vulkan_to_cpu(vTensor& src, Tensor& dst) {
 
   // Refer to the comment in pack_cpu_to_vulkan for why at::kFloat is specified
   // for the storage buffer below.
-  api::StorageBuffer staging(context, at::kFloat, src.numcells());
+  api::StorageBuffer staging(context, at::kFloat, src.gpu_numel());
 
   api::VulkanFence fence = context->fences().get_fence();
 
@@ -245,6 +249,28 @@ Tensor& copy_(Tensor& dst, const Tensor& src) {
   return dst;
 }
 
+ops::vTensor to_vulkan(at::Tensor& src, const api::StorageType storage_type) {
+  TORCH_CHECK(
+      src.device().type() == at::kCPU,
+      "Vulkan to_vulkan(): input tensor must be a CPU tensor!")
+
+  ops::vTensor v_ret{
+      api::context(),
+      src.sizes(),
+      src.options().memory_format(src.suggest_memory_format()),
+      storage_type};
+
+  ops::pack_cpu_to_vulkan(src, v_ret);
+
+  return v_ret;
+}
+
+at::Tensor from_vulkan(ops::vTensor& v_src) {
+  at::Tensor ret = at::empty(v_src.sizes(), v_src.options().device(at::kCPU));
+  ops::pack_vulkan_to_cpu(v_src, ret);
+  return ret;
+}
+
 } // namespace ops
 } // namespace vulkan
 } // namespace native
diff --git a/aten/src/ATen/native/vulkan/ops/Copy.h b/aten/src/ATen/native/vulkan/ops/Copy.h
index 1493af6e629b..a91d500a1a34 100644
--- a/aten/src/ATen/native/vulkan/ops/Copy.h
+++ b/aten/src/ATen/native/vulkan/ops/Copy.h
@@ -13,8 +13,18 @@ void transfer_cpu_to_vulkan(const Tensor&, vTensor&);
 
 void transfer_vulkan_to_cpu(vTensor&, Tensor&);
 
+void pack_cpu_to_vulkan(const Tensor& src, vTensor& dst);
+
+void pack_vulkan_to_cpu(vTensor& src, Tensor& dst);
+
 Tensor& copy_(Tensor& dst, const Tensor& src);
 
+ops::vTensor to_vulkan(
+    at::Tensor& src,
+    const api::StorageType storage_type = api::StorageType::TEXTURE_3D);
+
+at::Tensor from_vulkan(ops::vTensor& v_src);
+
 //
 // Utility functions for memcpy
 //
@@ -24,7 +34,7 @@ void memcpy_to_mapping_impl(const Tensor& src, api::MemoryMap& dst_mapping) {
   T* data_ptr = dst_mapping.template data<T>();
   memcpy(
       data_ptr,
-      src.contiguous().data_ptr<T>(),
+      src.data_ptr<T>(),
       std::min(src.nbytes(), dst_mapping.nbytes()));
 }
 
diff --git a/aten/src/ATen/native/vulkan/ops/Glu.cpp b/aten/src/ATen/native/vulkan/ops/Glu.cpp
index 1778813bce57..1a1f58b6dce5 100644
--- a/aten/src/ATen/native/vulkan/ops/Glu.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Glu.cpp
@@ -16,7 +16,7 @@ Tensor glu(const at::Tensor& input_arg, const int64_t dim = -1) {
       "Vulkan glu only supports GLU for dim = 1, but got dim = ",
       dim);
   TORCH_CHECK(
-      channels_size(input_arg) % 2 == 0,
+      get_dim<Dim4D::Channel>(input_arg) % 2 == 0,
       "Vulkan glu expects channel dim to be multiple of 2!");
 
   const Tensor input = input_arg.is_vulkan() ? input_arg : input_arg.vulkan();
diff --git a/aten/src/ATen/native/vulkan/ops/Gru.cpp b/aten/src/ATen/native/vulkan/ops/Gru.cpp
index 845f0c4425b2..a66c69b134ce 100644
--- a/aten/src/ATen/native/vulkan/ops/Gru.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Gru.cpp
@@ -1,21 +1,38 @@
+#include <ATen/TensorOperators.h>
 #include <ATen/native/vulkan/ops/Gru.h>
 #include <ATen/native/vulkan/ops/Mm.h>
 #include <vector>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/addmm.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/gru.h>
+#include <ATen/ops/sigmoid.h>
+#include <ATen/ops/slice.h>
+#include <ATen/ops/tanh.h>
+#endif
+
 namespace at {
 namespace native {
 namespace vulkan {
 namespace ops {
 namespace {
 //
-// input_vk: input tensor of shape (L, N, H_in) when batch_first=False
-//                                 (N, L, H_in) when batch_first=True containing
-//                                 the features of the input sequence
-// hx_vk: initial hidden state for each element in the batch. tensor of shape (D
-// * num_layers, N, H_out) output: tensor of shape (N, L, D * H_out)) when
-// batch_first=True h_n: tensor of shape (D * num_layers, N, H_out)
+// input_vk: input tensor containing the features of the input sequence
+//           tensor of shape (N, L, H_in) when batch_first=True
+//                           (L, N, H_in) when batch_first=False
+//
+// hx_vk: initial hidden state for each element in the batch.
+//        tensor of shape (D * num_layers, N, H_out)
+//
+// output: tensor of shape (N, L, D * H_out) when batch_first=True
+//                         (L, N, D * H_out) when batch_first=False
+//
+// h_n: tensor of shape (D * num_layers, N, H_out)
 //
-//  where
+// where
 //    L = sequence length
 //    N = batch size
 //    D = 2 if bidirectional=True otherwise 1
@@ -45,18 +62,22 @@ std::tuple<Tensor, Tensor> gru_input(
   TORCH_INTERNAL_ASSERT(!train, "Vulkan gru expects 'train' to be false.");
   TORCH_INTERNAL_ASSERT(
       !bidirectional, "Vulkan gru expects 'bidirectional' to be false.");
-  TORCH_INTERNAL_ASSERT(
-      batch_first, "Vulkan gru expects 'batch_first' to be true.");
   TORCH_INTERNAL_ASSERT(
       dropout < std::numeric_limits<double>::epsilon() * 1000,
       "Vulkan gru expects 'dropout' to be 0.0.");
 
+  const auto batch_size = input_vk.size(0);
+  const auto seq_length = input_vk.size(1);
+
+  TORCH_INTERNAL_ASSERT(
+      (batch_size == 1 && seq_length == 1) || batch_first,
+      "Vulkan gru expects batch-first input");
+
   const auto hidden_size = hx_vk.size(2);
   std::vector<at::Tensor> h_n_list; // hidden output
 
   // reshape to 2D due to Vulkan at::mm op accepts only 2D
-  auto x =
-      input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+  auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)});
 
   for (int64_t i = 0; i < num_layers; ++i) {
     // extract each hidden state and squeeze into 2D dim
@@ -99,6 +120,7 @@ std::tuple<Tensor, Tensor> gru_input(
   }
 
   auto h_n = at::cat(h_n_list, 1);
+  x = x.reshape({batch_size, seq_length, x.size(1)});
   h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
   return std::tuple<Tensor, Tensor>(x, h_n);
 }
@@ -118,7 +140,11 @@ std::vector<c10::intrusive_ptr<LinearPackedContext>> pack_linear_op_contexts(
     int64_t num_layers) {
   TORCH_CHECK(
       static_cast<int64_t>(params_cpu.size()) == 4 * num_layers,
-      "Vulkan gru expects 'params_cpu' size to be 4 * 'num_layers'.");
+      "Vulkan gru expects 'params_cpu' size to be 4 * 'num_layers'."
+      " But 'params_cpu' has size: ",
+      params_cpu.size(),
+      " and 'num_layers' is: ",
+      num_layers);
   std::vector<c10::intrusive_ptr<LinearPackedContext>> linear_op_contexts;
   linear_op_contexts.reserve(num_layers * 6);
 
@@ -170,13 +196,11 @@ GruPackedContext::GruPackedContext(
   TORCH_INTERNAL_ASSERT(!train, "Vulkan gru expects 'train' to be false.");
   TORCH_INTERNAL_ASSERT(
       !bidirectional, "Vulkan gru expects 'bidirectional' to be false.");
-  TORCH_INTERNAL_ASSERT(
-      batch_first, "Vulkan gru expects 'batch_first' to be true.");
   TORCH_INTERNAL_ASSERT(
       dropout < std::numeric_limits<double>::epsilon() * 1000,
       "Vulkan gru expects 'dropout' to be 0.0.");
 
-  packed_.reserve(7);
+  packed_.reserve(Packed::NumArgs);
   packed_.emplace_back(pack_linear_op_contexts(params_cpu, num_layers));
   packed_.emplace_back(has_biases);
   packed_.emplace_back(num_layers);
@@ -188,22 +212,23 @@ GruPackedContext::GruPackedContext(
 
 GruPackedContext GruPackedContext::pack(c10::impl::GenericList unpacked) {
   return GruPackedContext(
-      unpacked.get(0).toTensorVector(),
-      unpacked.get(1).toBool(),
-      unpacked.get(2).toInt(),
-      unpacked.get(3).toDouble(),
-      unpacked.get(4).toBool(),
-      unpacked.get(5).toBool(),
-      unpacked.get(6).toBool());
+      unpacked.get(Unpacked::Params).toTensorVector(),
+      unpacked.get(Unpacked::hasBiases).toBool(),
+      unpacked.get(Unpacked::NumLayers).toInt(),
+      unpacked.get(Unpacked::Dropout).toDouble(),
+      unpacked.get(Unpacked::Train).toBool(),
+      unpacked.get(Unpacked::Bidirectional).toBool(),
+      unpacked.get(Unpacked::BatchFirst).toBool());
 }
 
 const c10::impl::GenericList GruPackedContext::unpack() const {
   c10::impl::GenericList unpacked_gru_context{c10::AnyType::get()};
-  unpacked_gru_context.reserve(7);
+  unpacked_gru_context.reserve(Unpacked::NumArgs);
 
-  const c10::List<c10::IValue> packed_linear_contexts = get_val(0).toList();
+  const c10::List<c10::IValue> packed_linear_contexts =
+      get_val(Packed::LinearContexts).toList();
 
-  const int64_t num_layers = get_val(2).toInt();
+  const int64_t num_layers = get_val(Packed::NumLayers).toInt();
   const int64_t linear_contexts_per_layer = 6;
 
   std::vector<Tensor> params_cpu;
@@ -212,11 +237,21 @@ const c10::impl::GenericList GruPackedContext::unpack() const {
   for (c10::IValue packed_linear_context : packed_linear_contexts) {
     const c10::impl::GenericList unpacked_linear_context =
         packed_linear_context.toCustomClass<LinearPackedContext>()->unpack();
-    params_cpu.emplace_back(unpacked_linear_context.get(0).toTensor().t());
-    params_cpu.emplace_back(unpacked_linear_context.get(1).toTensor());
+
+    TORCH_CHECK(
+        unpacked_linear_context.size() > 0u,
+        "unpacked_linear_context does not have any elements!");
+
+    params_cpu.emplace_back(
+        unpacked_linear_context.get(LinearPackedContext::Unpacked::Weight)
+            .toTensor()
+            .t());
+    params_cpu.emplace_back(
+        unpacked_linear_context.get(LinearPackedContext::Unpacked::Bias)
+            .toTensor());
   }
   unpacked_gru_context.emplace_back(params_cpu);
-  for (int64_t i = 1; i < 7; ++i) {
+  for (int64_t i = 1; i < Unpacked::NumArgs; ++i) {
     unpacked_gru_context.emplace_back(get_val(i));
   }
 
@@ -251,9 +286,19 @@ std::tuple<Tensor, Tensor> run_gru_context(
   TORCH_INTERNAL_ASSERT(
       hx_vk.sizes().size() == 3, "Vulkan gru expects 'hx_vk' dims to be 3.");
 
+  const int64_t num_layers =
+      gru_context->get_val(GruPackedContext::Packed::NumLayers).toInt();
+  const bool batch_first =
+      gru_context->get_val(GruPackedContext::Packed::BatchFirst).toBool();
+  const auto batch_size = input_vk.size(0);
+  const auto seq_length = input_vk.size(1);
+
+  TORCH_INTERNAL_ASSERT(
+      (batch_size == 1 && seq_length == 1) || batch_first,
+      "Vulkan gru expects batch-first input");
+
   const c10::List<c10::IValue> packed_linear_contexts =
-      gru_context->get_val(0).toList();
-  const int64_t num_layers = gru_context->get_val(2).toInt();
+      gru_context->get_val(GruPackedContext::Packed::LinearContexts).toList();
 
   const int64_t linear_contexts_per_layer = 6;
   // (b_ir, w_ir), (b_hr, w_hr), (b_iz, w_iz),
@@ -261,8 +306,7 @@ std::tuple<Tensor, Tensor> run_gru_context(
   std::vector<at::Tensor> h_n_list; // hidden output
 
   // reshape to 2D due to Vulkan at::mm op accepts only 2D
-  auto x =
-      input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+  auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)});
 
   for (int64_t i = 0; i < num_layers; ++i) {
     // extract each hidden state and squeeze into 2D dim
@@ -304,6 +348,7 @@ std::tuple<Tensor, Tensor> run_gru_context(
   }
 
   auto h_n = at::cat(h_n_list, 1);
+  x = x.reshape({batch_size, seq_length, x.size(1)});
   h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
   return std::tuple<Tensor, Tensor>(x, h_n);
 }
diff --git a/aten/src/ATen/native/vulkan/ops/Gru.h b/aten/src/ATen/native/vulkan/ops/Gru.h
index da409ff3af2e..922ac02fc2d0 100644
--- a/aten/src/ATen/native/vulkan/ops/Gru.h
+++ b/aten/src/ATen/native/vulkan/ops/Gru.h
@@ -23,6 +23,36 @@ class GruPackedContext final : virtual public VulkanPackedContext,
       bool bidirectional,
       bool batch_first);
 
+  /*
+   * Assigns a name to each index in the unpacked list.
+   */
+  struct Unpacked final {
+    static constexpr uint32_t Params = 0u;
+    static constexpr uint32_t hasBiases = 1u;
+    static constexpr uint32_t NumLayers = 2u;
+    static constexpr uint32_t Dropout = 3u;
+    static constexpr uint32_t Train = 4u;
+    static constexpr uint32_t Bidirectional = 5u;
+    static constexpr uint32_t BatchFirst = 6u;
+
+    static constexpr uint32_t NumArgs = 7u;
+  };
+
+  /*
+   * Assigns a name to each index in the packed list.
+   */
+  struct Packed final {
+    static constexpr uint32_t LinearContexts = 0u;
+    static constexpr uint32_t hasBiases = 1u;
+    static constexpr uint32_t NumLayers = 2u;
+    static constexpr uint32_t Dropout = 3u;
+    static constexpr uint32_t Train = 4u;
+    static constexpr uint32_t Bidirectional = 5u;
+    static constexpr uint32_t BatchFirst = 6u;
+
+    static constexpr uint32_t NumArgs = 7u;
+  };
+
   static GruPackedContext pack(c10::impl::GenericList);
 
   const c10::impl::GenericList unpack() const override;
diff --git a/aten/src/ATen/native/vulkan/ops/Lerp.cpp b/aten/src/ATen/native/vulkan/ops/Lerp.cpp
index 67240f64b2cc..28921ef89764 100644
--- a/aten/src/ATen/native/vulkan/ops/Lerp.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Lerp.cpp
@@ -11,18 +11,18 @@ using namespace api::utils;
 
 void check_inputs_elementwise_op(const Tensor& input1, const Tensor& input2) {
   TORCH_CHECK(
-      channels_size(input1) == channels_size(input2),
+      get_dim<Dim4D::Channel>(input1) == get_dim<Dim4D::Channel>(input2),
       "Vulkan elementwise ops require channel dimension to be equal!");
-  if (batch_size(input1) != batch_size(input2)) {
+  if (get_dim<Dim4D::Batch>(input1) != get_dim<Dim4D::Batch>(input2)) {
     TORCH_CHECK(
-        channels_size(input1) % 4 == 0,
+        get_dim<Dim4D::Channel>(input1) % 4 == 0,
         "Vulkan elementwise ops require channel to be a multiple of 4 to broadcast along batch dimension!")
   }
 
-  const uint32_t input1_h = height_size(input1);
-  const uint32_t input1_w = width_size(input1);
-  const uint32_t input2_h = height_size(input2);
-  const uint32_t input2_w = width_size(input2);
+  const uint32_t input1_h = get_dim<Dim4D::Height>(input1);
+  const uint32_t input1_w = get_dim<Dim4D::Width>(input1);
+  const uint32_t input2_h = get_dim<Dim4D::Height>(input2);
+  const uint32_t input2_w = get_dim<Dim4D::Width>(input2);
 
   const std::string broadcast_error_msg =
       "Incompatible input dimensions for broadcasting for Vulkan elementwise op!";
diff --git a/aten/src/ATen/native/vulkan/ops/Lstm.cpp b/aten/src/ATen/native/vulkan/ops/Lstm.cpp
index 786d0da2a5dc..7e8000370346 100644
--- a/aten/src/ATen/native/vulkan/ops/Lstm.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Lstm.cpp
@@ -1,26 +1,43 @@
+#include <ATen/TensorOperators.h>
 #include <ATen/native/vulkan/ops/Lstm.h>
 #include <ATen/native/vulkan/ops/Mm.h>
 #include <torch/library.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/addmm.h>
+#include <ATen/ops/cat.h>
+#include <ATen/ops/sigmoid.h>
+#include <ATen/ops/slice.h>
+#include <ATen/ops/tanh.h>
+#endif
+
 namespace at {
 namespace native {
 namespace vulkan {
 namespace ops {
 namespace {
 //
-// input_vk: input tensor of shape (L, N, H_in) when batch_first=False or (N, L,
-// H_in) when batch_first=True
-//           containing the features of the input sequence
+// input_vk: input tensor of shape (L, N, H_in) when batch_first=False or
+// (N, L, H_in) when batch_first=True containing the features of the input
+// sequence
+//
 // hx_vk: tensor of shape (D * num_layers, N, H_out) containing the initial
-// hidden state for each element in the input sequence. cx_vk: tensor of shape
-// (D * num_layers, N, H_cell) containing the initial cell state for each
-// element in the input sequence. output: tensor of shape (L, N, D * H_out) when
-// batch_first=False or (N, L, D * H_out) when batch_first=True
-//         containing the output features (h_t) from the last layer of the LSTM,
-//         for each t
+// hidden state for each element in the input sequence.
+//
+// cx_vk: tensor of shape (D * num_layers, N, H_cell) containing the initial
+// cell state for each element in the input sequence.
+//
+// output: tensor of shape (L, N, D * H_out) when batch_first=False or
+// (N, L, D * H_out) when batch_first=True, containing the output features
+// (h_t) from the last layer of the LSTM, for each t
+//
 // h_n: tensor of shape (D * num_layers, N, H_out) containing the final hidden
-// state for each element in the sequence. c_n: tensor of shape (D * num_layers,
-// N, H_cell) containing the final cell state for each element in the sequence.
+// state for each element in the sequence.
+//
+// c_n: tensor of shape (D * num_layers, N, H_cell) containing the final cell
+// state for each element in the sequence.
 //
 //  where
 //    L = sequence length
@@ -60,12 +77,17 @@ std::tuple<Tensor, Tensor, Tensor> lstm_input(
   TORCH_INTERNAL_ASSERT(!train, "Vulkan LSTM expects 'train' to be false.");
   TORCH_INTERNAL_ASSERT(
       !bidirectional, "Vulkan LSTM expects 'bidirectional' to be false.");
-  TORCH_INTERNAL_ASSERT(
-      batch_first, "Vulkan LSTM expects 'batch_first' to be true.");
   TORCH_INTERNAL_ASSERT(
       dropout < std::numeric_limits<double>::epsilon() * 1000,
       "Vulkan LSTM expects 'dropout' to be 0.0.");
 
+  const auto batch_size = input_vk.size(0);
+  const auto seq_length = input_vk.size(1);
+
+  TORCH_INTERNAL_ASSERT(
+      (batch_size == 1 && seq_length == 1) || batch_first,
+      "Vulkan gru expects batch-first input");
+
   const Tensor& hx_vk = hx[0];
   const Tensor& cx_vk = hx[1];
 
@@ -74,8 +96,7 @@ std::tuple<Tensor, Tensor, Tensor> lstm_input(
   std::vector<at::Tensor> c_n_list; // cell state output
 
   // reshape to 2D due to Vulkan at::mm op accepts only 2D
-  auto x =
-      input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+  auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)});
 
   h_n_list.reserve(num_layers);
   c_n_list.reserve(num_layers);
@@ -134,6 +155,7 @@ std::tuple<Tensor, Tensor, Tensor> lstm_input(
 
   auto h_n = at::cat(h_n_list, 1);
   auto c_n = at::cat(c_n_list, 1);
+  x = x.reshape({batch_size, seq_length, x.size(1)});
   h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
   c_n = c_n.reshape({c_n.size(0) * c_n.size(1), c_n.size(2), c_n.size(3)});
   return std::tuple<Tensor, Tensor, Tensor>(x, h_n, c_n);
@@ -155,7 +177,11 @@ pack_lstm_linear_op_contexts(
     int64_t num_layers) {
   TORCH_CHECK(
       static_cast<int64_t>(params_cpu.size()) == 4 * num_layers,
-      "Vulkan LSTM expects 'params_cpu' size to be 4 * 'num_layers'.");
+      "Vulkan LSTM expects 'params_cpu' size to be 4 * 'num_layers'."
+      " But 'params_cpu' has size: ",
+      params_cpu.size(),
+      " and 'num_layers' is: ",
+      num_layers);
   std::vector<c10::intrusive_ptr<LinearPackedContext>> linear_op_contexts;
   linear_op_contexts.reserve(num_layers * 8);
 
@@ -213,13 +239,11 @@ LstmPackedContext::LstmPackedContext(
   TORCH_INTERNAL_ASSERT(!train, "Vulkan LSTM expects 'train' to be false.");
   TORCH_INTERNAL_ASSERT(
       !bidirectional, "Vulkan LSTM expects 'bidirectional' to be false.");
-  TORCH_INTERNAL_ASSERT(
-      batch_first, "Vulkan LSTM expects 'batch_first' to be true.");
   TORCH_INTERNAL_ASSERT(
       dropout < std::numeric_limits<double>::epsilon() * 1000,
       "Vulkan LSTM expects 'dropout' to be 0.0.");
 
-  packed_.reserve(7);
+  packed_.reserve(Packed::NumArgs);
   packed_.emplace_back(pack_lstm_linear_op_contexts(params_cpu, num_layers));
   packed_.emplace_back(has_biases);
   packed_.emplace_back(num_layers);
@@ -231,22 +255,23 @@ LstmPackedContext::LstmPackedContext(
 
 LstmPackedContext LstmPackedContext::pack(c10::impl::GenericList unpacked) {
   return LstmPackedContext(
-      unpacked.get(0).toTensorVector(),
-      unpacked.get(1).toBool(),
-      unpacked.get(2).toInt(),
-      unpacked.get(3).toDouble(),
-      unpacked.get(4).toBool(),
-      unpacked.get(5).toBool(),
-      unpacked.get(6).toBool());
+      unpacked.get(Unpacked::Params).toTensorVector(),
+      unpacked.get(Unpacked::hasBiases).toBool(),
+      unpacked.get(Unpacked::NumLayers).toInt(),
+      unpacked.get(Unpacked::Dropout).toDouble(),
+      unpacked.get(Unpacked::Train).toBool(),
+      unpacked.get(Unpacked::Bidirectional).toBool(),
+      unpacked.get(Unpacked::BatchFirst).toBool());
 }
 
 const c10::impl::GenericList LstmPackedContext::unpack() const {
   c10::impl::GenericList unpacked_lstm_context{c10::AnyType::get()};
-  unpacked_lstm_context.reserve(7);
+  unpacked_lstm_context.reserve(Unpacked::NumArgs);
 
-  const c10::List<c10::IValue> packed_linear_contexts = get_val(0).toList();
+  const c10::List<c10::IValue> packed_linear_contexts =
+      get_val(Packed::LinearContexts).toList();
 
-  const int64_t num_layers = get_val(2).toInt();
+  const int64_t num_layers = get_val(Packed::NumLayers).toInt();
   const int64_t linear_contexts_per_layer = 8;
 
   std::vector<Tensor> params_cpu;
@@ -255,8 +280,18 @@ const c10::impl::GenericList LstmPackedContext::unpack() const {
   for (c10::IValue packed_linear_context : packed_linear_contexts) {
     const c10::impl::GenericList unpacked_linear_context =
         packed_linear_context.toCustomClass<LinearPackedContext>()->unpack();
-    params_cpu.emplace_back(unpacked_linear_context.get(0).toTensor().t());
-    params_cpu.emplace_back(unpacked_linear_context.get(1).toTensor());
+
+    TORCH_CHECK(
+        unpacked_linear_context.size() > 0u,
+        "unpacked_linear_context does not have any elements!");
+
+    params_cpu.emplace_back(
+        unpacked_linear_context.get(LinearPackedContext::Unpacked::Weight)
+            .toTensor()
+            .t());
+    params_cpu.emplace_back(
+        unpacked_linear_context.get(LinearPackedContext::Unpacked::Bias)
+            .toTensor());
   }
   unpacked_lstm_context.emplace_back(params_cpu);
   for (int64_t i = 1; i < 7; ++i) {
@@ -298,24 +333,34 @@ std::tuple<Tensor, Tensor, Tensor> run_lstm_context(
       cx_vk.sizes().size() == 3,
       "Vulkan LSTM expects cell state dims to be 3.");
 
+  const int64_t num_layers =
+      lstm_context->get_val(LstmPackedContext::Packed::NumLayers).toInt();
+  const bool batch_first =
+      lstm_context->get_val(LstmPackedContext::Packed::BatchFirst).toBool();
+  const auto batch_size = input_vk.size(0);
+  const auto seq_length = input_vk.size(1);
+
+  TORCH_INTERNAL_ASSERT(
+      (batch_size == 1 && seq_length == 1) || batch_first,
+      "Vulkan gru expects batch-first input");
+
   const c10::List<c10::IValue> packed_linear_op_contexts =
-      lstm_context->get_val(0).toList();
-  const int64_t packed_num_layers = lstm_context->get_val(2).toInt();
+      lstm_context->get_val(LstmPackedContext::Packed::LinearContexts).toList();
+
+  const int64_t linear_op_contexts_per_layer = 8;
+  // (b_ii, w_ii), (b_hi, w_hi), (b_if, w_if), (b_hf, w_hf),
+  // (b_ig, w_ig), (b_hg, w_hg), (b_io, w_io), (b_ho, w_ho)
 
-  const int64_t linear_op_contexts_per_layer =
-      8; // (b_ii, w_ii), (b_hi, w_hi), (b_if, w_if), (b_hf, w_hf), (b_ig,
-         // w_ig), (b_hg, w_hg), (b_io, w_io), (b_ho, w_ho)
   std::vector<at::Tensor> h_n_list; // hidden state output
   std::vector<at::Tensor> c_n_list; // cell state output
 
   // reshape to 2D due to Vulkan at::mm op accepts only 2D
-  auto x =
-      input_vk.reshape({input_vk.size(0) * input_vk.size(1), input_vk.size(2)});
+  auto x = input_vk.reshape({batch_size * seq_length, input_vk.size(2)});
 
-  h_n_list.reserve(packed_num_layers);
-  c_n_list.reserve(packed_num_layers);
+  h_n_list.reserve(num_layers);
+  c_n_list.reserve(num_layers);
 
-  for (int64_t l = 0; l < packed_num_layers; ++l) {
+  for (int64_t l = 0; l < num_layers; ++l) {
     // extract each hidden state and squeeze into 2D dim
     auto h = at::slice(hx_vk, 0, l, l + 1, 1);
     h = h.reshape({h.size(0) * h.size(1), h.size(2)});
@@ -371,6 +416,7 @@ std::tuple<Tensor, Tensor, Tensor> run_lstm_context(
 
   auto h_n = at::cat(h_n_list, 1);
   auto c_n = at::cat(c_n_list, 1);
+  x = x.reshape({batch_size, seq_length, x.size(1)});
   h_n = h_n.reshape({h_n.size(0) * h_n.size(1), h_n.size(2), h_n.size(3)});
   c_n = c_n.reshape({c_n.size(0) * c_n.size(1), c_n.size(2), c_n.size(3)});
   return std::tuple<Tensor, Tensor, Tensor>(x, h_n, c_n);
diff --git a/aten/src/ATen/native/vulkan/ops/Lstm.h b/aten/src/ATen/native/vulkan/ops/Lstm.h
index f71fafe63fb8..5f4006c67d2f 100644
--- a/aten/src/ATen/native/vulkan/ops/Lstm.h
+++ b/aten/src/ATen/native/vulkan/ops/Lstm.h
@@ -23,6 +23,36 @@ class LstmPackedContext final : virtual public VulkanPackedContext,
       bool bidirectional,
       bool batch_first);
 
+  /*
+   * Assigns a name to each index in the unpacked list.
+   */
+  struct Unpacked final {
+    static constexpr uint32_t Params = 0u;
+    static constexpr uint32_t hasBiases = 1u;
+    static constexpr uint32_t NumLayers = 2u;
+    static constexpr uint32_t Dropout = 3u;
+    static constexpr uint32_t Train = 4u;
+    static constexpr uint32_t Bidirectional = 5u;
+    static constexpr uint32_t BatchFirst = 6u;
+
+    static constexpr uint32_t NumArgs = 7u;
+  };
+
+  /*
+   * Assigns a name to each index in the packed list.
+   */
+  struct Packed final {
+    static constexpr uint32_t LinearContexts = 0u;
+    static constexpr uint32_t hasBiases = 1u;
+    static constexpr uint32_t NumLayers = 2u;
+    static constexpr uint32_t Dropout = 3u;
+    static constexpr uint32_t Train = 4u;
+    static constexpr uint32_t Bidirectional = 5u;
+    static constexpr uint32_t BatchFirst = 6u;
+
+    static constexpr uint32_t NumArgs = 7u;
+  };
+
   static LstmPackedContext pack(c10::impl::GenericList);
 
   const c10::impl::GenericList unpack() const override;
diff --git a/aten/src/ATen/native/vulkan/ops/Mm.cpp b/aten/src/ATen/native/vulkan/ops/Mm.cpp
index 834c15061c81..c8225f08354e 100644
--- a/aten/src/ATen/native/vulkan/ops/Mm.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Mm.cpp
@@ -1,5 +1,7 @@
 #include <ATen/native/vulkan/ops/Mm.h>
 #include <ATen/native/vulkan/ops/Utils.h>
+
+#include <ATen/Context.h>
 #include <c10/util/irange.h>
 
 namespace at {
@@ -41,7 +43,7 @@ vTensor pack_weights(const Tensor& weight_arg) {
       weight.options(),
   };
 
-  api::StorageBuffer staging(context, at::kFloat, v_weight.numcells());
+  api::StorageBuffer staging(context, at::kFloat, v_weight.gpu_numel());
   {
     api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -104,7 +106,7 @@ vTensor pack_biases(
         bias_arg->options(),
     };
 
-    api::StorageBuffer staging(context, at::kFloat, v_bias.numcells());
+    api::StorageBuffer staging(context, at::kFloat, v_bias.gpu_numel());
     {
       api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -133,7 +135,7 @@ vTensor pack_biases(
         weight_arg.options(),
     };
 
-    api::StorageBuffer staging(context, at::kFloat, v_bias.numcells());
+    api::StorageBuffer staging(context, at::kFloat, v_bias.gpu_numel());
     {
       api::MemoryMap mapping(staging.buffer(), api::MemoryAccessType::WRITE);
 
@@ -209,12 +211,16 @@ Tensor run_addmm_context(
       input_arg_2d.is_vulkan() ? input_arg_2d : input_arg_2d.vulkan();
   const vTensor& v_input = convert(input);
 
-  const vTensor& packed_v_weight =
-      convert(linear_context->get_val(0).toTensor());
-  const vTensor& packed_v_bias = convert(linear_context->get_val(1).toTensor());
+  const vTensor& packed_v_weight = convert(
+      linear_context->get_val(LinearPackedContext::Packed::Weight).toTensor());
+  const vTensor& packed_v_bias = convert(
+      linear_context->get_val(LinearPackedContext::Packed::Bias).toTensor());
   const std::vector<int64_t> unpacked_weight_sizes =
-      linear_context->get_val(2).toIntVector();
-  const bool bias_defined = linear_context->get_val(3).toBool();
+      linear_context->get_val(LinearPackedContext::Packed::WeightSizes)
+          .toIntVector();
+  const bool bias_defined =
+      linear_context->get_val(LinearPackedContext::Packed::BiasDefined)
+          .toBool();
 
   TORCH_CHECK(
       usable(input, unpacked_weight_sizes),
@@ -375,22 +381,23 @@ LinearPackedContext::LinearPackedContext(
       "Reason: The provided (weight, bias) parameters are either invalid "
       "individually or their combination is not supported by Vulkan Impl.");
 
-  packed_.reserve(4);
+  packed_.reserve(Packed::NumArgs);
   packed_.emplace_back(convert(pack_weights(weight)));
   packed_.emplace_back(convert(pack_biases(weight, bias)));
   packed_.emplace_back(weight.sizes());
   packed_.emplace_back(bias && bias->defined());
 
-  unpacked_.reserve(2);
-  unpacked_.emplace_back(weight);
-  unpacked_.emplace_back(bias);
+  if (!at::globalContext().releaseWeightsWhenPrepacking()) {
+    unpacked_.reserve(Unpacked::NumArgs);
+    unpacked_.emplace_back(weight);
+    unpacked_.emplace_back(bias);
+  }
 }
 
 LinearPackedContext LinearPackedContext::pack(c10::impl::GenericList unpacked) {
   return LinearPackedContext(
-      unpacked.get(0).toTensor(),
-      unpacked.get(1).isTensor() ? unpacked.get(1).toTensor()
-                                 : c10::optional<Tensor>());
+      unpacked.get(Unpacked::Weight).toTensor(),
+      get_optional_tensor(unpacked, Unpacked::Bias));
 }
 
 c10::intrusive_ptr<LinearPackedContext> create_linear_context(
diff --git a/aten/src/ATen/native/vulkan/ops/Mm.h b/aten/src/ATen/native/vulkan/ops/Mm.h
index 970844298ece..17909eab6d4e 100644
--- a/aten/src/ATen/native/vulkan/ops/Mm.h
+++ b/aten/src/ATen/native/vulkan/ops/Mm.h
@@ -19,9 +19,33 @@ class LinearPackedContext final : virtual public VulkanPackedContext,
  public:
   LinearPackedContext(const Tensor& weight, const c10::optional<Tensor>& bias);
 
+  /*
+   * Assigns a name to each index in the unpacked list.
+   */
+  struct Unpacked final {
+    static constexpr uint32_t Weight = 0u;
+    static constexpr uint32_t Bias = 1u;
+
+    static constexpr uint32_t NumArgs = 2u;
+  };
+
+  /*
+   * Assigns a name to each index in the packed list.
+   */
+  struct Packed final {
+    static constexpr uint32_t Weight = 0u;
+    static constexpr uint32_t Bias = 1u;
+    static constexpr uint32_t WeightSizes = 2u;
+    static constexpr uint32_t BiasDefined = 3u;
+
+    static constexpr uint32_t NumArgs = 4u;
+  };
+
   static LinearPackedContext pack(c10::impl::GenericList);
 
   const c10::impl::GenericList unpack() const override {
+    TORCH_CHECK(unpacked_.size() > 0u, "unpacked_ does not have any elements!");
+
     return unpacked_;
   }
 };
diff --git a/aten/src/ATen/native/vulkan/ops/QuantizedFunctions.h b/aten/src/ATen/native/vulkan/ops/QuantizedFunctions.h
index 21fdbbf001c8..cd2defd07ecf 100644
--- a/aten/src/ATen/native/vulkan/ops/QuantizedFunctions.h
+++ b/aten/src/ATen/native/vulkan/ops/QuantizedFunctions.h
@@ -43,7 +43,7 @@ Tensor quantized_div(
     const double scale,
     const int64_t zero_point);
 
-Tensor conv2d(
+Tensor quantized_conv2d(
     const Tensor& input_,
     const Tensor& weight,
     const c10::optional<Tensor>& bias_opt,
diff --git a/aten/src/ATen/native/vulkan/ops/Register.cpp b/aten/src/ATen/native/vulkan/ops/Register.cpp
index 18d5a6facfae..25f0a6d99ec7 100644
--- a/aten/src/ATen/native/vulkan/ops/Register.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Register.cpp
@@ -1,5 +1,6 @@
 #ifdef USE_VULKAN_API
 
+#include <ATen/native/vulkan/ops/Batchnorm.h>
 #include <ATen/native/vulkan/ops/Common.h>
 #include <ATen/native/vulkan/ops/Convolution.h>
 #include <ATen/native/vulkan/ops/Gru.h>
@@ -16,6 +17,19 @@ namespace ops {
 namespace {
 
 TORCH_LIBRARY(vulkan, m) {
+  m.class_<BatchNormPackedContext>("BatchNormPackedContext")
+      .def_pickle(
+          // __getstate__
+          [](const c10::intrusive_ptr<BatchNormPackedContext>& context) {
+            // context is packed
+            return context->unpack();
+          },
+          // __setstate__
+          [](c10::impl::GenericList state) {
+            // state is unpacked
+            return c10::make_intrusive<BatchNormPackedContext>(
+                BatchNormPackedContext::pack(state));
+          });
   m.class_<LinearPackedContext>("LinearPackedContext")
       .def_pickle(
           // __getstate__
@@ -147,6 +161,22 @@ TORCH_LIBRARY(vulkan_prepack, m) {
       "Tensor hx_vk, "
       "Tensor cx_vk, "
       "__torch__.torch.classes.vulkan.LstmPackedContext L_prepack) -> (Tensor next_input, Tensor hidden_state, Tensor cell_state)"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "vulkan_prepack::create_batchnorm_context("
+      "Tensor? weight_opt, "
+      "Tensor? bias_opt, "
+      "Tensor? running_mean_opt, "
+      "Tensor? running_var_opt, "
+      "bool training, "
+      "float momentum, "
+      "float eps, "
+      "bool cudnn_enable) "
+      "-> __torch__.torch.classes.vulkan.BatchNormPackedContext"));
+  m.def(TORCH_SELECTIVE_SCHEMA(
+      "vulkan_prepack::run_batchnorm_context("
+      "Tensor input_vk, "
+      "__torch__.torch.classes.vulkan.BatchNormPackedContext context) "
+      "-> Tensor out"));
 }
 
 TORCH_LIBRARY_IMPL(vulkan_prepack, CPU, m) {
@@ -168,6 +198,9 @@ TORCH_LIBRARY_IMPL(vulkan_prepack, CPU, m) {
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::create_lstm_context"),
       TORCH_FN(create_lstm_context));
+  m.impl(
+      TORCH_SELECTIVE_NAME("vulkan_prepack::create_batchnorm_context"),
+      TORCH_FN(create_batchnorm_context));
 }
 
 TORCH_LIBRARY_IMPL(vulkan_prepack, Vulkan, m) {
@@ -189,6 +222,9 @@ TORCH_LIBRARY_IMPL(vulkan_prepack, Vulkan, m) {
   m.impl(
       TORCH_SELECTIVE_NAME("vulkan_prepack::run_lstm_context"),
       TORCH_FN(run_lstm_context));
+  m.impl(
+      TORCH_SELECTIVE_NAME("vulkan_prepack::run_batchnorm_context"),
+      TORCH_FN(run_batchnorm_context));
 }
 
 } // namespace
diff --git a/aten/src/ATen/native/vulkan/ops/Shape.cpp b/aten/src/ATen/native/vulkan/ops/Shape.cpp
index 14b32c2eea17..4209a3781cd2 100644
--- a/aten/src/ATen/native/vulkan/ops/Shape.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Shape.cpp
@@ -22,7 +22,7 @@ Tensor view_internal(const Tensor& self_arg, const IntArrayRef shape) {
       self.options(),
   };
 
-  api::StorageBuffer buffer(context, at::kFloat, v_self.numcells(), true);
+  api::StorageBuffer buffer(context, at::kFloat, v_self.gpu_numel(), true);
 
   utils::pack_vtensor_to_staging(v_self, buffer.buffer());
 
@@ -42,7 +42,7 @@ Tensor view_internal(const Tensor& self_arg, const IntArrayRef shape) {
   return convert(v_output);
 }
 
-inline Tensor view(const Tensor& self_arg, const IntArrayRef shape) {
+inline Tensor view(const Tensor& self_arg, IntArrayRef shape) {
   return view_internal(self_arg, shape);
 }
 
diff --git a/aten/src/ATen/native/vulkan/ops/Tensor.cpp b/aten/src/ATen/native/vulkan/ops/Tensor.cpp
index 337222c0fd5f..315462ac0d1d 100644
--- a/aten/src/ATen/native/vulkan/ops/Tensor.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Tensor.cpp
@@ -1,3 +1,4 @@
+#include <ATen/native/vulkan/api/Utils.h>
 #include <ATen/native/vulkan/ops/Common.h>
 #include <ATen/native/vulkan/ops/Tensor.h>
 #include <c10/util/accumulate.h>
@@ -9,45 +10,207 @@ namespace ops {
 
 namespace {
 
-api::utils::uvec3 image_extents(const IntArrayRef sizes) {
-  int64_t width = 1;
-  int64_t height = 1;
-  int64_t depth = 1;
+/*
+ * Extracts the memory format member of a TensorOptions struct. If there is no
+ * empty format listed, then a contiguous format is assumed.
+ */
+at::MemoryFormat get_memory_format(const TensorOptions& options) {
+  return options.memory_format_opt() ? *(options.memory_format_opt())
+                                     : at::MemoryFormat::Contiguous;
+}
 
-  switch (sizes.size()) {
-    case 1:
-      width = sizes[0];
-      break;
+/*
+ * Calculates the strides of a contiguous tensor. empty_tensor_restride from
+ * TensorImpl.h was used as a reference.
+ */
+c10::SmallVector<int64_t, 6u> calc_contiguous_strides(const IntArrayRef sizes) {
+  int64_t ndim = sizes.size();
+  c10::SmallVector<int64_t, 6u> strides(ndim);
+
+  int64_t running_product = 1;
+  if (ndim >= 1) {
+    strides[ndim - 1] = running_product;
+    for (int i = sizes.size() - 2; i >= 0; --i) {
+      running_product *= sizes[i + 1];
+      strides[i] = running_product;
+    }
+  }
 
-    case 2:
-      width = sizes[1];
-      height = sizes[0];
-      break;
+  return strides;
+}
 
-    case 3:
-      width = sizes[2];
-      height = sizes[1];
-      depth = sizes[0];
-      break;
+c10::SmallVector<int64_t, 6u> calc_channels_last_strides(
+    const IntArrayRef sizes) {
+  c10::SmallVector<int64_t, 6u> strides(sizes.size());
 
+  switch (sizes.size()) {
     case 4:
-      width = sizes[3];
-      height = sizes[2];
-      depth = sizes[0] * sizes[1];
-      break;
-
+      strides[1] = 1;
+      strides[3] = sizes[1];
+      strides[2] = strides[3] * sizes[3];
+      strides[0] = strides[2] * sizes[2];
+      return strides;
+    case 3:
+      strides[0] = 1;
+      strides[2] = sizes[0];
+      strides[1] = strides[2] * sizes[2];
+      return strides;
     default:
-      TORCH_INTERNAL_ASSERT(
-          false,
-          "Only Tensors with 1 <= dim <= 4 can be represented as a Vulkan Image!");
+      TORCH_CHECK(
+          false, "ChannelsLast format only available for 3 <= ndim <= 4!");
   }
 
-  return {
-      api::utils::safe_downcast<uint32_t>(width),
-      api::utils::safe_downcast<uint32_t>(height),
-      api::utils::safe_downcast<uint32_t>(
-          api::utils::div_up(depth, INT64_C(4))),
+  return strides;
+}
+
+/*
+ * Calculates the strides of a tensor based on the sizes and memory format. Note
+ * that strides are only valid for vTensors that are backed by buffer storage;
+ * if texture storage is used then the strides are invalid and set to zeros.
+ */
+c10::SmallVector<int64_t, 6u> calc_strides(
+    const IntArrayRef sizes,
+    const at::MemoryFormat memory_format,
+    const api::StorageType storage_type) {
+  if (storage_type == api::StorageType::BUFFER) {
+    switch (memory_format) {
+      case MemoryFormat::Contiguous:
+        return calc_contiguous_strides(sizes);
+        break;
+      case MemoryFormat::ChannelsLast:
+        return calc_channels_last_strides(sizes);
+        break;
+      default:
+        TORCH_CHECK(false, "Invalid memory format used to create vTensor!");
+    }
+  } else {
+    c10::SmallVector<int64_t, 6u> strides(sizes.size());
+    return strides;
+  }
+}
+
+/*
+ * When stored on the GPU, one dimension will be aligned to the next multiple of
+ * 4 in order to take advantage of vec4 data types. This function adjusts one of
+ * the dimensions based on the desired memory format and storage type.
+ */
+c10::SmallVector<int64_t, 6u> calc_gpu_sizes(
+    const IntArrayRef sizes,
+    const at::MemoryFormat memory_format,
+    const api::StorageType storage_type) {
+  size_t ndim = sizes.size();
+
+  // For buffer formats, the innermost dim (i.e. where the stride is 1) will be
+  // aligned up.
+  if (storage_type == api::StorageType::BUFFER) {
+    c10::SmallVector<int64_t, 6u> gpu_sizes{sizes};
+
+    switch (memory_format) {
+      case at::MemoryFormat::Contiguous:
+        gpu_sizes[ndim - 1] = api::utils::align_up(sizes[ndim - 1], INT64_C(4));
+        break;
+
+      case at::MemoryFormat::ChannelsLast:
+        switch (ndim) {
+          case 3:
+            gpu_sizes[0] = api::utils::align_up(sizes[0], INT64_C(4));
+            break;
+
+          case 4:
+            gpu_sizes[1] = api::utils::align_up(sizes[1], INT64_C(4));
+            break;
+        }
+        break;
+
+      default:
+        TORCH_CHECK(false, "Invalid memory format used to create vTensor!");
+        break;
+    }
+
+    return gpu_sizes;
+  } else {
+    TORCH_CHECK(
+        ndim >= 1 && ndim <= 4,
+        "Texture storage only valid for 1 <= ndim <= 4!");
+
+    c10::SmallVector<int64_t, 6u> gpu_sizes(3);
+
+    // Channel dim will be always be aligned. For 4 dimensional tensors, batch
+    // and channel are combined, then aligned.
+    switch (ndim) {
+      case 1:
+        gpu_sizes[0] = 4;
+        gpu_sizes[1] = 1;
+        gpu_sizes[2] = sizes[0];
+        break;
+
+      case 2:
+        gpu_sizes[0] = 4;
+        gpu_sizes[1] = sizes[0];
+        gpu_sizes[2] = sizes[1];
+        break;
+
+      case 3:
+        gpu_sizes[0] = api::utils::align_up(sizes[0], INT64_C(4));
+        gpu_sizes[1] = sizes[1];
+        gpu_sizes[2] = sizes[2];
+        break;
+
+      case 4:
+        int64_t combined_depth = sizes[0] * sizes[1];
+        gpu_sizes[0] = api::utils::align_up(combined_depth, INT64_C(4));
+        gpu_sizes[1] = sizes[2];
+        gpu_sizes[2] = sizes[3];
+        break;
+    }
+    return gpu_sizes;
+  }
+}
+
+/*
+ * Creates a uvec3 denoting the extents of the image texture that will be
+ * created to store a tensor of a given size.
+ */
+api::utils::uvec3 create_image_extents(
+    const IntArrayRef gpu_sizes,
+    const api::StorageType storage_type) {
+  size_t ndim = gpu_sizes.size();
+
+  if (storage_type == api::StorageType::BUFFER) {
+    // image extents do not apply to buffer storage
+    return {0u, 0u, 0u};
+  } else {
+    TORCH_CHECK(
+        ndim >= 1 && ndim <= 3,
+        "Texture storage only valid for 1 <= ndim <= 3!");
+
+    uint32_t width = get_dim<Dim4D::Width>(gpu_sizes);
+    uint32_t height = get_dim<Dim4D::Height>(gpu_sizes);
+    uint32_t depth = get_dim<Dim4D::Channel>(gpu_sizes);
+
+    TORCH_CHECK(depth % 4 == 0, "Channels must be divisible by 4!")
+
+    return {width, height, depth / 4u};
+  }
+}
+
+api::UniformParamsBuffer make_metadata_uniform(
+    api::Context* const context,
+    const IntArrayRef sizes,
+    const IntArrayRef strides,
+    const api::StorageType storage_type) {
+  if (storage_type != api::StorageType::BUFFER) {
+    return api::UniformParamsBuffer();
+  }
+
+  vTensor::BufferMetadata metadata{
+      ops::make_nchw_uvec4(sizes),
+      ops::make_nchw_uvec4(strides),
+      api::utils::safe_downcast<uint32_t>(sizes.size()),
+      api::utils::safe_downcast<uint32_t>(c10::multiply_integers(sizes)),
   };
+
+  return api::UniformParamsBuffer(context, metadata);
 }
 
 } // namespace
@@ -59,27 +222,69 @@ api::utils::uvec3 image_extents(const IntArrayRef sizes) {
 vTensor::vTensor(
     api::Context* const context,
     const IntArrayRef sizes,
-    const TensorOptions& options)
-    : view_(new vTensorStorage{
+    const TensorOptions& options,
+    const api::StorageType storage_type)
+    : options_(options),
+      memory_format_(get_memory_format(options)),
+      // Calculate sizes and strides
+      sizes_{sizes},
+      strides_{calc_strides(sizes, memory_format_, storage_type)},
+      gpu_sizes_{calc_gpu_sizes(sizes, memory_format_, storage_type)},
+      gpu_strides_{calc_strides(gpu_sizes_, memory_format_, storage_type)},
+      // Vulkan uniform buffer containing sizes and stride info
+      metadata_uniform_{make_metadata_uniform(
           context,
-          sizes,
-          options,
-      }) {}
+          gpu_sizes_,
+          gpu_strides_,
+          storage_type)},
+      // Construct Tensor storage
+      view_(std::make_shared<vTensorStorage>(
+          context,
+          storage_type,
+          gpu_sizes_,
+          dtype())) {
+  ops::verify(options);
+}
 
 vTensor::vTensor(
     api::Context* const context,
     const IntArrayRef sizes,
     const TensorOptions& options,
     double q_scale,
-    int64_t q_zero_point)
-    : view_(
-          new vTensorStorage{context, sizes, options, q_scale, q_zero_point}) {}
+    int64_t q_zero_point,
+    const api::StorageType storage_type)
+    : options_(options),
+      memory_format_(get_memory_format(options)),
+      // Calculate sizes and strides
+      sizes_{sizes},
+      strides_{calc_strides(sizes, memory_format_, storage_type)},
+      gpu_sizes_{calc_gpu_sizes(sizes, memory_format_, storage_type)},
+      gpu_strides_{calc_strides(gpu_sizes_, memory_format_, storage_type)},
+      // Vulkan uniform buffer containing sizes and stride info
+      metadata_uniform_{make_metadata_uniform(
+          context,
+          gpu_sizes_,
+          gpu_strides_,
+          storage_type)},
+      // Quantization params
+      is_quantized_{true},
+      q_scale_{q_scale},
+      q_zero_point_{q_zero_point},
+      // Construct Tensor storage
+      view_(std::make_shared<vTensorStorage>(
+          context,
+          storage_type,
+          gpu_sizes_,
+          dtype())) {
+  verify(options);
+}
 
 api::VulkanImage& vTensor::image(
     api::PipelineBarrier& pipeline_barrier,
     const api::PipelineStageFlags stage) const& {
-  view_->transition(pipeline_barrier, stage, api::MemoryAccessType::READ);
+  TORCH_CHECK(view_->image_, "vTensor has empty image texture!");
 
+  view_->transition(pipeline_barrier, stage, api::MemoryAccessType::READ);
   return view_->image_;
 }
 
@@ -87,11 +292,40 @@ api::VulkanImage& vTensor::image(
     api::PipelineBarrier& pipeline_barrier,
     const api::PipelineStageFlags stage,
     const api::MemoryAccessFlags access) & {
-  view_->transition(pipeline_barrier, stage, access);
+  TORCH_CHECK(view_->image_, "vTensor has empty image texture!");
 
+  view_->transition(pipeline_barrier, stage, access);
   return view_->image_;
 }
 
+api::VulkanBuffer& vTensor::buffer(
+    api::PipelineBarrier& pipeline_barrier,
+    const api::PipelineStageFlags stage) const& {
+  TORCH_CHECK(view_->buffer_, "vTensor has empty buffer!");
+
+  view_->transition(pipeline_barrier, stage, api::MemoryAccessType::READ);
+  return view_->buffer_;
+}
+
+api::VulkanBuffer& vTensor::buffer(
+    api::PipelineBarrier& pipeline_barrier,
+    const api::PipelineStageFlags stage,
+    const api::MemoryAccessFlags access) & {
+  TORCH_CHECK(view_->buffer_, "vTensor has empty buffer!");
+
+  view_->transition(pipeline_barrier, stage, access);
+  return view_->buffer_;
+}
+
+vTensor::BufferMetadata vTensor::get_cpu_buffer_metadata() const {
+  return {
+      ops::make_nchw_uvec4(sizes_),
+      ops::make_nchw_uvec4(strides_),
+      api::utils::safe_downcast<uint32_t>(sizes_.size()),
+      api::utils::safe_downcast<uint32_t>(c10::multiply_integers(sizes_)),
+  };
+}
+
 //
 // vTensorStorage
 //
@@ -99,7 +333,8 @@ api::VulkanImage& vTensor::image(
 api::VulkanImage allocate_image(
     api::Context* const context_ptr,
     api::utils::uvec3& extents,
-    const caffe2::TypeMeta dtype) {
+    const api::StorageType storage_type,
+    const VkFormat image_format) {
   api::Adapter* adapter_ptr = context_ptr->adapter_ptr();
 
   api::ImageSampler::Properties sampler_props{
@@ -109,47 +344,77 @@ api::VulkanImage allocate_image(
       VK_BORDER_COLOR_FLOAT_TRANSPARENT_BLACK,
   };
 
+  VkImageType image_type = VK_IMAGE_TYPE_3D;
+  VkImageViewType image_view_type = VK_IMAGE_VIEW_TYPE_3D;
+
+  switch (storage_type) {
+    case api::StorageType::TEXTURE_3D:
+      image_type = VK_IMAGE_TYPE_3D;
+      image_view_type = VK_IMAGE_VIEW_TYPE_3D;
+      break;
+    case api::StorageType::TEXTURE_2D:
+      image_type = VK_IMAGE_TYPE_2D;
+      image_view_type = VK_IMAGE_VIEW_TYPE_2D;
+      break;
+    default:
+      // Return an empty VulkanImage by default
+      return api::VulkanImage();
+  }
+
   VkSampler sampler = adapter_ptr->sampler_cache().retrieve(sampler_props);
 
-  return adapter_ptr->vma().create_image3d(
-      api::create_extent3d(extents), sampler_props, sampler, dtype, true);
+  return adapter_ptr->vma().create_image(
+      api::create_extent3d(extents),
+      image_format,
+      image_type,
+      image_view_type,
+      sampler_props,
+      sampler,
+      true);
 }
 
-vTensorStorage::vTensorStorage(
-    api::Context* const context,
-    const IntArrayRef sizes,
-    const TensorOptions& options)
-    : context_(context),
-      extents_(image_extents(sizes)),
-      options_(options),
-      sizes_(sizes),
-      strides_(sizes.size()),
-      image_(allocate_image(context_, extents_, options_.dtype())),
-      last_access_{} {
-  ops::verify(options);
+api::VulkanBuffer allocate_buffer(
+    api::Context* const context_ptr,
+    const int64_t numel,
+    const api::StorageType storage_type,
+    const c10::ScalarType dtype) {
+  api::Adapter* adapter_ptr = context_ptr->adapter_ptr();
+
+  switch (storage_type) {
+    case api::StorageType::BUFFER:
+      break;
+    default:
+      // Return an empty VulkanBuffer if Buffer storage is not used
+      return api::VulkanBuffer();
+  }
+
+  return adapter_ptr->vma().create_storage_buffer(
+      c10::elementSize(dtype) * numel, true);
 }
 
 vTensorStorage::vTensorStorage(
     api::Context* const context,
-    const IntArrayRef sizes,
-    const TensorOptions& options,
-    double q_scale_in,
-    int64_t q_zero_point_in)
+    const api::StorageType storage_type,
+    const IntArrayRef gpu_sizes,
+    const at::ScalarType dtype)
     : context_(context),
-      extents_(image_extents(sizes)),
-      options_(options),
-      sizes_(sizes),
-      strides_(sizes.size()),
-      is_quantized_{true},
-      q_scale{q_scale_in},
-      q_zero_point{q_zero_point_in},
-      image_(allocate_image(context_, extents_, options_.dtype())),
-      last_access_{} {
-  ops::verify(options);
-}
+      storage_type_{storage_type},
+      extents_(create_image_extents(gpu_sizes, storage_type)),
+      buffer_length_{c10::multiply_integers(gpu_sizes)},
+      image_(allocate_image(
+          context_,
+          extents_,
+          storage_type_,
+          api::vk_format(dtype))),
+      buffer_(allocate_buffer(context_, buffer_length_, storage_type_, dtype)),
+      last_access_{} {}
 
 vTensorStorage::~vTensorStorage() {
-  context_->register_image_cleanup(image_);
+  if (image_) {
+    context_->register_image_cleanup(image_);
+  } else if (buffer_) {
+    context_->register_buffer_cleanup(buffer_);
+  }
 }
 
 void vTensorStorage::transition(
@@ -160,12 +425,18 @@ void vTensorStorage::transition(
   api::PipelineStageFlags prev_stage = last_access_.stage;
   api::MemoryAccessFlags prev_access = last_access_.access;
 
-  const VkImageLayout cur_layout = image_.layout();
-  const VkImageLayout new_layout = api::vk_layout(cur_stage, cur_access);
-
-  const bool layout_changed = cur_layout != new_layout;
   const bool prev_written = (prev_access & api::MemoryAccessType::WRITE) != 0;
 
+  VkImageLayout cur_layout = VK_IMAGE_LAYOUT_UNDEFINED;
+  VkImageLayout new_layout = VK_IMAGE_LAYOUT_UNDEFINED;
+  bool layout_changed = false;
+  if (image_) {
+    cur_layout = image_.layout();
+    new_layout = api::vk_layout(cur_stage, cur_access);
+
+    layout_changed = cur_layout != new_layout;
+  }
+
   if (prev_written || layout_changed) {
     VkPipelineStageFlags src_stage = api::vk_stage(prev_stage);
     if (0u == src_stage) {
@@ -179,14 +450,21 @@ void vTensorStorage::transition(
     pipeline_barrier.stage.src |= src_stage;
     pipeline_barrier.stage.dst |= dst_stage;
 
-    pipeline_barrier.images.push_back(api::ImageMemoryBarrier(
-        api::vk_access(prev_stage, prev_access),
-        api::vk_access(cur_stage, cur_access),
-        cur_layout,
-        new_layout,
-        image_));
-
-    image_.set_layout(new_layout);
+    if (image_) {
+      pipeline_barrier.images.push_back(api::ImageMemoryBarrier(
+          api::vk_access(prev_stage, prev_access),
+          api::vk_access(cur_stage, cur_access),
+          cur_layout,
+          new_layout,
+          image_));
+
+      image_.set_layout(new_layout);
+    } else if (buffer_) {
+      pipeline_barrier.buffers.push_back(api::BufferMemoryBarrier(
+          api::vk_access(prev_stage, prev_access),
+          api::vk_access(cur_stage, cur_access),
+          buffer_));
+    }
   }
 
   last_access_.stage = cur_stage;
@@ -239,9 +517,10 @@ void verify(const TensorOptions& options) {
       !options.has_layout() || (c10::kStrided == options.layout()),
       "'layout' tensor option is not yet supported under Vulkan!");
 
+  at::MemoryFormat memory_format = get_memory_format(options);
   TORCH_CHECK(
-      !options.has_memory_format() ||
-          (c10::MemoryFormat::Contiguous == options.memory_format_opt()),
+      memory_format == at::MemoryFormat::ChannelsLast ||
+          memory_format == at::MemoryFormat::Contiguous,
       "'memory_format' tensor option is not yet supported under Vulkan!");
 }
 
diff --git a/aten/src/ATen/native/vulkan/ops/Tensor.h b/aten/src/ATen/native/vulkan/ops/Tensor.h
index 0eb11f17fb7b..241d2c839b80 100644
--- a/aten/src/ATen/native/vulkan/ops/Tensor.h
+++ b/aten/src/ATen/native/vulkan/ops/Tensor.h
@@ -2,7 +2,7 @@
 
 #ifdef USE_VULKAN_API
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
 #include <ATen/native/vulkan/VulkanOpaqueTensorImpl.h>
 #include <ATen/native/vulkan/api/api.h>
 #include <c10/util/accumulate.h>
@@ -33,14 +33,9 @@ class vTensorStorage final {
 
   vTensorStorage(
       api::Context* context,
-      IntArrayRef sizes,
-      const TensorOptions& options);
-  vTensorStorage(
-      api::Context* context,
-      IntArrayRef sizes,
-      const TensorOptions& options,
-      double q_scale,
-      int64_t q_zero_point);
+      const api::StorageType storage_type,
+      const IntArrayRef sizes,
+      const at::ScalarType dtype);
 
   vTensorStorage(const vTensorStorage&) = delete;
   vTensorStorage& operator=(const vTensorStorage&) = delete;
@@ -56,17 +51,15 @@ class vTensorStorage final {
   // Context
   api::Context* context_;
 
-  // Metadata
+  api::StorageType storage_type_;
+
+  // Resource sizings
   api::utils::uvec3 extents_;
-  TensorOptions options_;
-  c10::SmallVector<int64_t, 6u> sizes_;
-  c10::SmallVector<int64_t, 6u> strides_;
-  bool is_quantized_{false};
-  double q_scale{1.0f};
-  int64_t q_zero_point{0u};
+  int64_t buffer_length_;
 
   // Image Texture
   mutable api::VulkanImage image_;
+  mutable api::VulkanBuffer buffer_;
 
   // Last Access - used to insert memory barriers
   LastAccess last_access_;
@@ -92,18 +85,53 @@ class vTensor final {
   // Do not allow empty vTensor construction
   vTensor() = default;
 
+  // Default constructor
   vTensor(
       api::Context* context,
       IntArrayRef sizes,
-      const TensorOptions& options);
+      const TensorOptions& options,
+      const api::StorageType storage_type = api::StorageType::TEXTURE_3D);
+
+  // Default constructor with quantization parameters
   vTensor(
       api::Context* const context,
       const IntArrayRef sizes,
       const TensorOptions& options,
       double q_scale,
-      int64_t q_zero_point);
+      int64_t q_zero_point,
+      const api::StorageType storage_type = api::StorageType::TEXTURE_3D);
+
+  // Used for passing buffer sizes and strides data to shaders
+  struct BufferMetadata {
+    api::utils::uvec4 sizes;
+    api::utils::uvec4 strides;
+    uint32_t ndim;
+    uint32_t buffer_length;
+  };
 
  private:
+  // Tensor Options
+  TensorOptions options_;
+  at::MemoryFormat memory_format_;
+
+  // Sizes and Strides
+  c10::SmallVector<int64_t, 6u> sizes_;
+  c10::SmallVector<int64_t, 6u> strides_;
+
+  // Storage Dimensions. When stored on the GPU, one dimension will be aligned
+  // to the next multiple of 4 in order to take advantage of vec4 data types.
+  c10::SmallVector<int64_t, 6u> gpu_sizes_;
+  c10::SmallVector<int64_t, 6u> gpu_strides_;
+
+  // A Vulkan uniform buffer containing sizes and strides of the GPU buffer that
+  // can be passed into a shader.
+  api::UniformParamsBuffer metadata_uniform_;
+
+  // Quantization params
+  bool is_quantized_{false};
+  double q_scale_{1.0f};
+  int64_t q_zero_point_{0u};
+
   // Even at the cost of a heap allocation plus the resulting negative impact
   // on cache locality due to the subsequent pointer chasing, it is still
   // critcal to share the view across vTensor implementations to minimize
@@ -128,6 +156,10 @@ class vTensor final {
    Texture Access
   */
 
+  inline api::StorageType storage_type() const {
+    return view_->storage_type_;
+  }
+
   api::VulkanImage& image(api::PipelineBarrier&, const api::PipelineStageFlags)
       const&;
 
@@ -136,6 +168,15 @@ class vTensor final {
       const api::PipelineStageFlags,
       const api::MemoryAccessFlags) &;
 
+  api::VulkanBuffer& buffer(
+      api::PipelineBarrier&,
+      const api::PipelineStageFlags) const&;
+
+  api::VulkanBuffer& buffer(
+      api::PipelineBarrier&,
+      const api::PipelineStageFlags,
+      const api::MemoryAccessFlags) &;
+
   /*
     Metadata
   */
@@ -144,6 +185,13 @@ class vTensor final {
     return view_->extents_;
   }
 
+  /*
+   * Extract a ScalarType from the TensorOptions member
+   */
+  inline c10::ScalarType dtype() const {
+    return c10::typeMetaToScalarType(options_.dtype());
+  }
+
   /*
    * Get a c10::ScalarType that corresponds to the image format of the texture
    */
@@ -151,49 +199,96 @@ class vTensor final {
     return api::c10_scalartype(view_->texture_format());
   }
 
+  inline at::MemoryFormat memory_format() const {
+    return memory_format_;
+  }
+
   inline const TensorOptions& options() const {
-    return view_->options_;
+    return options_;
   }
 
   inline IntArrayRef sizes() const {
-    return view_->sizes_;
+    return sizes_;
   }
 
   inline IntArrayRef strides() const {
-    return view_->strides_;
+    return strides_;
+  }
+
+  inline IntArrayRef gpu_sizes() const {
+    return gpu_sizes_;
+  }
+
+  inline IntArrayRef gpu_strides() const {
+    return gpu_strides_;
+  }
+
+  /*
+   * Get a uniform buffer containing sizes and strides information of the GPU
+   * buffer
+   */
+  inline api::VulkanBuffer& buffer_metadata() {
+    return metadata_uniform_.buffer();
+  }
+
+  /*
+   * Constructs a BufferMetdata struct based on the original sizes and strides
+   * to pass into a shader.
+   */
+  BufferMetadata get_cpu_buffer_metadata() const;
+
+  inline void set_is_quantized() {
+    is_quantized_ = true;
   }
 
   inline bool is_quantized() const {
-    return view_->is_quantized_;
+    return is_quantized_;
+  }
+
+  inline void set_scale(const double q_scale) {
+    q_scale_ = q_scale;
   }
 
   inline double get_scale() const {
-    return view_->q_scale;
+    return q_scale_;
+  }
+
+  inline float get_scale_float() const {
+    return api::utils::safe_downcast<float>(q_scale_);
+  }
+
+  inline void set_zero_point(const int64_t q_zero_point) {
+    q_zero_point_ = q_zero_point;
   }
 
   inline int64_t get_zero_point() const {
-    return view_->q_zero_point;
+    return q_zero_point_;
   }
 
-  inline size_t nbytes() const {
-    return c10::elementSize(c10::typeMetaToScalarType(options().dtype())) *
-        c10::multiply_integers(sizes());
+  inline int32_t get_zero_point_int32() const {
+    return api::utils::safe_downcast<int32_t>(q_zero_point_);
+  }
+
+  inline size_t numel() const {
+    return c10::multiply_integers(sizes());
   }
 
   /*
-   * Number of "cells" in the image texture. 4 cells make up a texel.
+   * Returns numel but based on gpu_sizes_ instead of sizes_
    */
-  inline VkDeviceSize numcells() {
-    return view_->extents_.data[0u] * view_->extents_.data[1u] *
-        (4u * view_->extents_.data[2u]);
+  inline size_t gpu_numel() const {
+    return view_->buffer_length_;
+  }
+
+  inline size_t nbytes() const {
+    return c10::elementSize(dtype()) * numel();
   }
 
   /*
-   * Number of bytes needed for a buffer to receive all data in the texture
+   * Return nbytes but bnased on gpu_sizes_ instead of sizes_
    */
-  inline VkDeviceSize buffer_bytes() {
-    return c10::elementSize(this->texture_dtype()) * view_->extents_.data[0u] *
-        view_->extents_.data[1u] * (4u * view_->extents_.data[2u]);
+  inline VkDeviceSize gpu_nbytes() const {
+    return c10::elementSize(dtype()) * gpu_numel();
   }
 };
 
diff --git a/aten/src/ATen/native/vulkan/ops/Utils.cpp b/aten/src/ATen/native/vulkan/ops/Utils.cpp
index cb1a31b1d656..4c0f866ca9a3 100644
--- a/aten/src/ATen/native/vulkan/ops/Utils.cpp
+++ b/aten/src/ATen/native/vulkan/ops/Utils.cpp
@@ -1,9 +1,230 @@
 #include <ATen/native/vulkan/ops/Common.h>
 
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+#include <ATen/ops/cat.h>
+#include <ATen/ops/empty.h>
+#include <ATen/ops/narrow.h>
+#include <ATen/ops/zeros.h>
+#endif
+
 namespace at {
 namespace native {
 namespace vulkan {
 namespace ops {
+
+namespace packing {
+
+static api::ShaderSource get_nchw_to_image_shader(const vTensor& v_dst) {
+  if (v_dst.is_quantized()) {
+    switch (v_dst.storage_type()) {
+      case api::StorageType::TEXTURE_3D:
+        return VK_KERNEL(nchw_to_image_quantized);
+      default:
+        TORCH_CHECK(false, "No kernel available!");
+      case api::StorageType::BUFFER:
+      case api::StorageType::UNKNOWN:
+        TORCH_CHECK(false, "Requested storage type must be a texture type.");
+    }
+  }
+
+  switch (v_dst.storage_type()) {
+    case api::StorageType::TEXTURE_3D:
+      return VK_KERNEL(nchw_to_image);
+    case api::StorageType::TEXTURE_2D:
+      return VK_KERNEL(nchw_to_image2d);
+    default:
+      TORCH_CHECK(false, "No kernel available!");
+  }
+}
+
+static api::ShaderSource get_image_to_nchw_shader(const vTensor& v_src) {
+  if (v_src.is_quantized()) {
+    switch (v_src.storage_type()) {
+      case api::StorageType::TEXTURE_3D:
+        return VK_KERNEL(image_to_nchw_quantized);
+      default:
+        TORCH_CHECK(false, "No kernel available!");
+      case api::StorageType::BUFFER:
+      case api::StorageType::UNKNOWN:
+        TORCH_CHECK(false, "Requested storage type must be a texture type.");
+    }
+  }
+
+  switch (v_src.storage_type()) {
+    case api::StorageType::TEXTURE_3D:
+      return VK_KERNEL(image_to_nchw);
+    case api::StorageType::TEXTURE_2D:
+      return VK_KERNEL(image2d_to_nchw);
+    default:
+      TORCH_CHECK(false, "No kernel available!");
+  }
+}
+
+struct ToFromTextureParams final {
+  api::utils::ivec3 extents;
+  int32_t plane_size;
+};
+
+void record_nchw_to_image_op(
+    api::Context* const context,
+    api::ShaderSource& compute_shader,
+    api::VulkanBuffer& src_buffer,
+    vTensor& v_dst,
+    api::PipelineBarrier pipeline_barrier,
+    const VkFence fence_handle) {
+  api::utils::uvec3 global_size = v_dst.extents();
+  api::utils::uvec3 local_size = adaptive_work_group_size(global_size);
+
+  int32_t height =
+      api::utils::safe_downcast<int32_t>(get_dim<Dim4D::Height>(v_dst));
+  int32_t width =
+      api::utils::safe_downcast<int32_t>(get_dim<Dim4D::Width>(v_dst));
+  int32_t plane_size = height * width;
+
+  ToFromTextureParams block{
+      api::utils::make_ivec3(v_dst.extents()),
+      plane_size,
+  };
+
+  api::UniformParamsBuffer params(context, block);
+  context->submit_compute_job(
+      // shader descriptor
+      compute_shader,
+      // pipeline barrier
+      pipeline_barrier,
+      // global work group size
+      global_size,
+      // local work group size
+      local_size,
+      // fence handle
+      fence_handle,
+      // shader arguments
+      v_dst.image(
+          pipeline_barrier,
+          api::PipelineStage::COMPUTE,
+          api::MemoryAccessType::WRITE),
+      src_buffer,
+      // params buffer
+      params.buffer());
+}
+
+void record_image_to_nchw_op(
+    api::Context* const context,
+    api::ShaderSource& compute_shader,
+    vTensor& v_src,
+    api::VulkanBuffer& dst_buffer,
+    api::PipelineBarrier pipeline_barrier,
+    const VkFence fence_handle) {
+  api::utils::uvec3 global_size = v_src.extents();
+  api::utils::uvec3 local_size = adaptive_work_group_size(global_size);
+
+  int32_t height =
+      api::utils::safe_downcast<int32_t>(get_dim<Dim4D::Height>(v_src));
+  int32_t width =
+      api::utils::safe_downcast<int32_t>(get_dim<Dim4D::Width>(v_src));
+  int32_t plane_size = height * width;
+
+  ToFromTextureParams block{
+      api::utils::make_ivec3(v_src.extents()),
+      plane_size,
+  };
+
+  api::UniformParamsBuffer params(context, block);
+  context->submit_compute_job(
+      // shader descriptor
+      compute_shader,
+      // pipeline barrier
+      pipeline_barrier,
+      // global work group size
+      global_size,
+      // local work group size
+      local_size,
+      // fence handle
+      fence_handle,
+      // shader arguments
+      v_src.image(
+          pipeline_barrier,
+          api::PipelineStage::COMPUTE,
+          api::MemoryAccessType::WRITE),
+      dst_buffer,
+      // params buffer
+      params.buffer());
+}
+
+void record_nchw_to_buffer_op(
+    api::Context* const context,
+    api::VulkanBuffer& src_buffer,
+    vTensor& v_dst,
+    api::PipelineBarrier pipeline_barrier,
+    const VkFence fence_handle) {
+  uint32_t gpu_buf_len = api::utils::safe_downcast<uint32_t>(v_dst.gpu_numel());
+
+  api::utils::uvec3 global_size = {gpu_buf_len, 1u, 1u};
+  api::utils::uvec3 local_size = {32u, 1u, 1u};
+
+  api::UniformParamsBuffer cpu_buffer_metadata(
+      context, v_dst.get_cpu_buffer_metadata());
+
+  context->submit_compute_job(
+      // shader descriptor
+      VK_KERNEL(buffer_to_buffer),
+      // pipeline barrier
+      pipeline_barrier,
+      // global work group size
+      global_size,
+      // local work group size
+      local_size,
+      // fence handle
+      fence_handle,
+      // shader arguments
+      v_dst.buffer(
+          pipeline_barrier,
+          api::PipelineStage::COMPUTE,
+          api::MemoryAccessType::WRITE),
+      v_dst.buffer_metadata(),
+      src_buffer,
+      cpu_buffer_metadata.buffer());
+}
+
+void record_buffer_to_nchw_op(
+    api::Context* const context,
+    vTensor& v_src,
+    api::VulkanBuffer& dst_buffer,
+    api::PipelineBarrier pipeline_barrier,
+    const VkFence fence_handle) {
+  uint32_t buf_len = api::utils::safe_downcast<uint32_t>(v_src.numel());
+
+  api::utils::uvec3 global_size = {buf_len, 1u, 1u};
+  api::utils::uvec3 local_size = {4u, 1u, 1u};
+
+  api::UniformParamsBuffer cpu_buffer_metadata(
+      context, v_src.get_cpu_buffer_metadata());
+
+  context->submit_compute_job(
+      // shader descriptor
+      VK_KERNEL(buffer_to_buffer),
+      // pipeline barrier
+      pipeline_barrier,
+      // global work group size
+      global_size,
+      // local work group size
+      local_size,
+      // fence handle
+      fence_handle,
+      // shader arguments
+      dst_buffer,
+      cpu_buffer_metadata.buffer(),
+      v_src.buffer(
+          pipeline_barrier,
+          api::PipelineStage::COMPUTE,
+          api::MemoryAccessType::WRITE),
+      v_src.buffer_metadata());
+}
+
+} // namespace packing
+
 namespace utils {
 
 /*
@@ -32,10 +253,10 @@ namespace utils {
  *    tensor would be {NC_aligned/4, H, W, 4}
  */
 Tensor nchw_to_nc4hw(const Tensor& src) {
-  uint32_t N = batch_size(src.sizes());
-  uint32_t C = channels_size(src.sizes());
-  uint32_t H = height_size(src.sizes());
-  uint32_t W = width_size(src.sizes());
+  uint32_t N = get_dim<Dim4D::Batch>(src.sizes());
+  uint32_t C = get_dim<Dim4D::Channel>(src.sizes());
+  uint32_t H = get_dim<Dim4D::Height>(src.sizes());
+  uint32_t W = get_dim<Dim4D::Width>(src.sizes());
 
   uint32_t NC4 = api::utils::div_up(N * C, 4u);
   uint32_t NC_aligned = api::utils::align_up(N * C, 4u);
@@ -57,10 +278,10 @@ Tensor nchw_to_nc4hw(const Tensor& src) {
  * same as the tensor produced by a call to format_src_tensor().
  */
 Tensor create_staging_tensor(const vTensor& v_in) {
-  uint32_t N = batch_size(v_in.sizes());
-  uint32_t C = channels_size(v_in.sizes());
-  uint32_t H = height_size(v_in.sizes());
-  uint32_t W = width_size(v_in.sizes());
+  uint32_t N = get_dim<Dim4D::Batch>(v_in.sizes());
+  uint32_t C = get_dim<Dim4D::Channel>(v_in.sizes());
+  uint32_t H = get_dim<Dim4D::Height>(v_in.sizes());
+  uint32_t W = get_dim<Dim4D::Width>(v_in.sizes());
 
   uint32_t NC4 = api::utils::div_up(N * C, 4u);
 
@@ -82,10 +303,10 @@ Tensor create_staging_tensor(const vTensor& v_in) {
  * the properties of the original tensor.
  */
 Tensor nc4hw_to_nchw(const Tensor& t_in, IntArrayRef sizes) {
-  uint32_t N = batch_size(sizes);
-  uint32_t C = channels_size(sizes);
-  uint32_t H = height_size(sizes);
-  uint32_t W = width_size(sizes);
+  uint32_t N = get_dim<Dim4D::Batch>(sizes);
+  uint32_t C = get_dim<Dim4D::Channel>(sizes);
+  uint32_t H = get_dim<Dim4D::Height>(sizes);
+  uint32_t W = get_dim<Dim4D::Width>(sizes);
 
   uint32_t NC_aligned = api::utils::align_up(N * C, 4u);
 
@@ -106,7 +327,7 @@ void copy_buffer_to_vtensor(
   api::Context* const context = api::context();
 
   TORCH_CHECK(
-      src_buffer.mem_size() == v_dst.buffer_bytes(),
+      src_buffer.mem_size() == v_dst.gpu_nbytes(),
       "Vulkan copy_buffer_to_vtensor: source buffer and destination texture "
       "do not have the same number of bytes");
 
@@ -127,6 +348,27 @@ void copy_buffer_to_vtensor(
       VK_NULL_HANDLE);
 }
 
+void copy_buffer_to_buffer(
+    api::Context* const context,
+    api::StorageBuffer& src,
+    api::StorageBuffer& dst,
+    VkFence fence_handle) {
+  api::PipelineBarrier pipeline_barrier{};
+
+  context->submit_copy<api::VulkanBuffer, api::VulkanBuffer>(
+      // pipeline barrier
+      pipeline_barrier,
+      // resources
+      src.buffer(),
+      dst.buffer(),
+      // copy details
+      {static_cast<uint32_t>(src.buffer().mem_size()), 0u, 0u},
+      {0u, 0u, 0u},
+      {0u, 0u, 0u},
+      // fence handle
+      fence_handle);
+}
+
 void copy_vtensor_to_buffer(
     vTensor& v_src,
     api::VulkanBuffer& dst_buffer,
@@ -135,7 +377,7 @@ void copy_vtensor_to_buffer(
   api::Context* const context = api::context();
 
   TORCH_CHECK(
-      v_src.buffer_bytes() == dst_buffer.mem_size(),
+      v_src.gpu_nbytes() == dst_buffer.mem_size(),
       "Vulkan copy_vtensor_to_buffer: source texture and destination buffer "
       "do not have the same number of bytes");
 
@@ -162,48 +404,20 @@ void pack_buffer_to_vtensor(
     api::PipelineBarrier& pipeline_barrier) {
   api::Context* const context = api::context();
 
-  const api::utils::uvec3 extents = v_self.extents();
-  const uint32_t plane = extents.data[0u] * extents.data[1u];
-
-  const struct Block final {
-    api::utils::uvec3 extents;
-    uint32_t block;
-    api::utils::uvec4 offset;
-  } block{
-      extents,
-      4u * plane,
-      {
-          0u * plane,
-          1u * plane,
-          2u * plane,
-          3u * plane,
-      },
-  };
-
-  api::UniformParamsBuffer params(context, block);
-  bool is_quantized = v_self.is_quantized();
-  api::ShaderSource kernel = is_quantized ? VK_KERNEL(nchw_to_image_quantized)
-                                          : VK_KERNEL(nchw_to_image);
-
-  context->submit_compute_job(
-      // shader descriptor
-      kernel,
-      // pipeline barrier
-      pipeline_barrier,
-      // global work group size
-      extents,
-      // local work group size
-      adaptive_work_group_size(extents),
-      // fence handle
-      VK_NULL_HANDLE,
-      // shader arguments
-      v_self.image(
-          pipeline_barrier,
-          api::PipelineStage::COMPUTE,
-          api::MemoryAccessType::WRITE),
-      buffer,
-      // params buffer
-      params.buffer());
+  if (v_self.storage_type() == api::StorageType::BUFFER) {
+    packing::record_nchw_to_buffer_op(
+        context, buffer, v_self, pipeline_barrier, VK_NULL_HANDLE);
+  } else {
+    api::ShaderSource compute_shader =
+        packing::get_nchw_to_image_shader(v_self);
+    packing::record_nchw_to_image_op(
+        context,
+        compute_shader,
+        buffer,
+        v_self,
+        pipeline_barrier,
+        VK_NULL_HANDLE);
+  }
 }
 
 void pack_staging_to_vtensor(api::VulkanBuffer& staging, vTensor& v_self) {
@@ -216,53 +430,22 @@ void pack_vtensor_to_staging(
     api::VulkanBuffer& staging,
     const VkFence fence_handle) {
   api::Context* const context = api::context();
-
-  const api::utils::uvec3 extents = v_self.extents();
-  const uint32_t plane = extents.data[0u] * extents.data[1u];
-
-  const struct Block final {
-    api::utils::uvec3 extents;
-    uint32_t block;
-    api::utils::uvec4 offset;
-  } block{
-      extents,
-      4u * plane,
-      {
-          0u * plane,
-          1u * plane,
-          2u * plane,
-          3u * plane,
-      },
-  };
-
-  api::UniformParamsBuffer params(context, block);
   api::PipelineBarrier pipeline_barrier{};
-  bool is_quantized = v_self.is_quantized();
-  api::utils::uvec3 copy_extents;
-  copy_extents.data[0u] = 1;
-  copy_extents.data[1u] = 1;
-  copy_extents.data[2u] =
-      ((v_self.sizes()[1] * v_self.sizes()[2] * v_self.sizes()[3]) / 4);
-  api::ShaderSource kernel = is_quantized ? VK_KERNEL(image_to_nchw_quantized)
-                                          : VK_KERNEL(image_to_nchw);
-  api::utils::uvec3 extents_to_use = is_quantized ? copy_extents : extents;
 
-  context->submit_compute_job(
-      // shader descriptor
-      kernel,
-      // pipeline barrier
-      pipeline_barrier,
-      // global work group size
-      extents_to_use,
-      // local work group size
-      adaptive_work_group_size(extents_to_use),
-      // fence handle
-      fence_handle,
-      // shader arguments
-      v_self.image(pipeline_barrier, api::PipelineStage::COMPUTE),
-      staging,
-      // params buffer
-      params.buffer());
+  if (v_self.storage_type() == api::StorageType::BUFFER) {
+    packing::record_buffer_to_nchw_op(
+        context, v_self, staging, pipeline_barrier, fence_handle);
+  } else {
+    api::ShaderSource compute_shader =
+        packing::get_image_to_nchw_shader(v_self);
+    packing::record_image_to_nchw_op(
+        context,
+        compute_shader,
+        v_self,
+        staging,
+        pipeline_barrier,
+        fence_handle);
+  }
 }
 
 } // namespace utils
diff --git a/aten/src/ATen/native/vulkan/ops/Utils.h b/aten/src/ATen/native/vulkan/ops/Utils.h
index f9b85521bab0..af1d501b8644 100644
--- a/aten/src/ATen/native/vulkan/ops/Utils.h
+++ b/aten/src/ATen/native/vulkan/ops/Utils.h
@@ -8,6 +8,7 @@ namespace at {
 namespace native {
 namespace vulkan {
 namespace ops {
+
 namespace utils {
 
 Tensor nchw_to_nc4hw(const Tensor&);
@@ -16,6 +17,12 @@ Tensor create_staging_tensor(const vTensor&);
 
 Tensor nc4hw_to_nchw(const Tensor&, IntArrayRef);
 
+void copy_buffer_to_buffer(
+    api::Context* const context,
+    api::StorageBuffer& src,
+    api::StorageBuffer& dst,
+    VkFence fence_handle);
+
 void copy_buffer_to_vtensor(
     api::VulkanBuffer&,
     vTensor&,
diff --git a/aten/src/ATen/native/vulkan/ops/cumsum.cpp b/aten/src/ATen/native/vulkan/ops/cumsum.cpp
index fd84d3304f39..679201532c21 100644
--- a/aten/src/ATen/native/vulkan/ops/cumsum.cpp
+++ b/aten/src/ATen/native/vulkan/ops/cumsum.cpp
@@ -18,7 +18,8 @@ Tensor cumsum(
       input_arg.dim() <= 4, "Vulkan cumsum expects input dimension <= 4!");
 
   TORCH_CHECK(
-      batch_size(input_arg) == 1, "Vulkan cumsum expects batch size <= 1!");
+      get_dim<Dim4D::Batch>(input_arg) == 1,
+      "Vulkan cumsum expects batch size <= 1!");
 
   TORCH_CHECK(dim < 4, "Vulkan cumsum expects dim < 4!");
 
diff --git a/aten/src/ATen/native/xnnpack/Common.h b/aten/src/ATen/native/xnnpack/Common.h
index b000ffada157..f7a38761c002 100644
--- a/aten/src/ATen/native/xnnpack/Common.h
+++ b/aten/src/ATen/native/xnnpack/Common.h
@@ -1,11 +1,12 @@
 #pragma once
 
-#include <ATen/ATen.h>
-
 #ifdef USE_XNNPACK
 
 #include <xnnpack.h>
 #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
+#include <c10/util/ArrayRef.h>
+#include <limits>
+#include <memory>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/xnnpack/Engine.h b/aten/src/ATen/native/xnnpack/Engine.h
index 9d5c0e4594ac..0448f90af6e8 100644
--- a/aten/src/ATen/native/xnnpack/Engine.h
+++ b/aten/src/ATen/native/xnnpack/Engine.h
@@ -1,6 +1,7 @@
 #pragma once
 
-#include <ATen/ATen.h>
+#include <ATen/core/Tensor.h>
+#include <limits>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/xnnpack/Init.cpp b/aten/src/ATen/native/xnnpack/Init.cpp
index e7365bea2a61..dbc8e4cc756b 100644
--- a/aten/src/ATen/native/xnnpack/Init.cpp
+++ b/aten/src/ATen/native/xnnpack/Init.cpp
@@ -1,6 +1,7 @@
 #ifdef USE_XNNPACK
 
 #include <ATen/native/xnnpack/Common.h>
+#include <c10/util/Exception.h>
 
 namespace at {
 namespace native {
diff --git a/aten/src/ATen/native/xnnpack/OpContext.cpp b/aten/src/ATen/native/xnnpack/OpContext.cpp
index 9f92a373c319..8a3e59868030 100644
--- a/aten/src/ATen/native/xnnpack/OpContext.cpp
+++ b/aten/src/ATen/native/xnnpack/OpContext.cpp
@@ -3,6 +3,8 @@
 #include <ATen/native/xnnpack/Linear.h>
 #include <ATen/native/xnnpack/OpContext.h>
 
+#include <ATen/Context.h>
+
 namespace at {
 namespace native {
 namespace xnnpack {
@@ -133,10 +135,12 @@ XNNPackTransposeConv2dOpContext::create_context(at::Tensor&& weight,
 }
 
 Tensor XNNPackConv2dOpContext::run(const Tensor& input) {
+  std::lock_guard<std::mutex> lock(xnnp_mutex_);
   return xnnpack::internal::convolution2d::run(op_context_, input);
 }
 
 Tensor XNNPackTransposeConv2dOpContext::run(const Tensor& input) {
+  std::lock_guard<std::mutex> lock(xnnp_mutex_);
   return xnnpack::internal::convolution2d::run(op_context_, input);
 }
 
diff --git a/aten/src/ATen/native/xnnpack/OpContext.h b/aten/src/ATen/native/xnnpack/OpContext.h
index 3e6b7d81c037..dc2ea24c35d4 100644
--- a/aten/src/ATen/native/xnnpack/OpContext.h
+++ b/aten/src/ATen/native/xnnpack/OpContext.h
@@ -149,6 +149,13 @@ class TransposeConv2dOpContext : public torch::jit::CustomClassHolder {
 class XNNPackConv2dOpContext final : public Conv2dOpContext {
  private:
   ContextConv2D op_context_;
+  // xnnpack convs use indirection buffer.
+  // These buffers need setup at runtime and/or when input
+  // dims change. If we are running the same model on multiple
+  // threads, this can lead to contention where indirection buffer
+  // is being accessed and updated at the same time from two different
+  // threads.
+  std::mutex xnnp_mutex_;
 
  public:
   XNNPackConv2dOpContext(
@@ -190,6 +197,13 @@ class XNNPackConv2dOpContext final : public Conv2dOpContext {
 class XNNPackTransposeConv2dOpContext final : public TransposeConv2dOpContext {
  private:
   ContextConv2D op_context_;
+  // xnnpack convs use indirection buffer.
+  // These buffers need setup at runtime and/or when input
+  // dims change. If we are running the same model on multiple
+  // threads, this can lead to contention where indirection buffer
+  // is being accessed and updated at the same time from two different
+  // threads.
+  std::mutex xnnp_mutex_;
 
  public:
   XNNPackTransposeConv2dOpContext(
diff --git a/aten/src/ATen/native/xnnpack/Shim.cpp b/aten/src/ATen/native/xnnpack/Shim.cpp
index 32ddfb4b8525..38dd1d66cf76 100644
--- a/aten/src/ATen/native/xnnpack/Shim.cpp
+++ b/aten/src/ATen/native/xnnpack/Shim.cpp
@@ -1,6 +1,7 @@
 #ifndef USE_XNNPACK
 
 #include <ATen/native/xnnpack/Common.h>
+#include <ATen/core/Tensor.h>
 
 //
 // This file is here so as to provide an implementation even in cases where
diff --git a/aten/src/ATen/quantized/Quantizer.cpp b/aten/src/ATen/quantized/Quantizer.cpp
index ff1778222a97..133cb0bffab7 100644
--- a/aten/src/ATen/quantized/Quantizer.cpp
+++ b/aten/src/ATen/quantized/Quantizer.cpp
@@ -126,6 +126,10 @@ inline Tensor new_qtensor(
 
 #ifdef USE_PYTORCH_QNNPACK
   if (at::globalContext().qEngine() == at::QEngine::QNNPACK) {
+    TORCH_CHECK(!device.is_cuda(), "It looks like you are trying to quantize a CUDA tensor ",
+                "while QNNPACK backend is enabled. Although not expected to happen in ",
+                "practice, you might have done it for testing purposes. ",
+                "Please, either change the quantization engine or move the tensor to a CPU.");
     allocator = c10::GetDefaultMobileCPUAllocator();
   }
 #endif
diff --git a/aten/src/ATen/record_function.h b/aten/src/ATen/record_function.h
index d0f68371357b..323dc5f888b8 100644
--- a/aten/src/ATen/record_function.h
+++ b/aten/src/ATen/record_function.h
@@ -561,6 +561,20 @@ void record_function_with_scope_and_debug_handle(
         guard, fn, inputs, ##__VA_ARGS__);                 \
   }
 
+#define RECORD_FUNCTION_WITH_SCOPE_INPUTS_OUTPUTS( \
+    scope, fn, inputs, outputs, ...)               \
+  at::RecordFunction guard(scope);                 \
+  if (guard.isActive()) {                          \
+    if (guard.needsInputs()) {                     \
+      guard.before(fn, inputs, ##__VA_ARGS__);     \
+    } else {                                       \
+      guard.before(fn, ##__VA_ARGS__);             \
+    }                                              \
+    if (guard.needsOutputs()) {                    \
+      guard.setOutputs(outputs);                   \
+    }                                              \
+  }
+
 #define RECORD_FUNCTION(fn, inputs, ...) \
   RECORD_FUNCTION_WITH_SCOPE(            \
       at::RecordScope::FUNCTION, fn, inputs, ##__VA_ARGS__)
@@ -568,6 +582,10 @@ void record_function_with_scope_and_debug_handle(
 #define RECORD_TORCHSCRIPT_FUNCTION(mn, inputs) \
   RECORD_FUNCTION_WITH_SCOPE(at::RecordScope::TORCHSCRIPT_FUNCTION, mn, inputs)
 
+#define RECORD_FUNCTION_WITH_INPUTS_OUTPUTS(fn, inputs, outputs, ...) \
+  RECORD_FUNCTION_WITH_SCOPE_INPUTS_OUTPUTS(                          \
+      at::RecordScope::FUNCTION, fn, inputs, outputs, ##__VA_ARGS__)
+
 // Custom user scopes in C++; similar to Python's 'with record_function("..."):'
 #define RECORD_USER_SCOPE(fn) \
   RECORD_FUNCTION_WITH_SCOPE( \
@@ -593,6 +611,16 @@ void record_function_with_scope_and_debug_handle(
   RECORD_WITH_SCOPE_DEBUG_HANDLE_AND_INPUTS(            \
       at::RecordScope::LITE_INTERPRETER, fn, debug_handle, inputs)
 
+// Bookend to the RECORD_FUNCTION macros.  Use this after the kernel
+// launch to let the profiler bind the outputs to the op that produced
+// them.  Note that guard is declared by RECORD_FUNCTION so this macro
+// needs to be called from the same scope as RECORD_FUNCTION
+#define RECORD_OUTPUTS(outputs)                                    \
+  if (guard.needsOutputs()) {                                      \
+    guard.setOutputs(                                              \
+        std::vector<c10::IValue>(outputs.begin(), outputs.end())); \
+  }
+
 /**
  * addThreadLocalCallback adds a thread local callback to run with
  * RecordFunction, returns handle to use with removeThreadLocalCallback
diff --git a/aten/src/ATen/templates/CompositeViewCopyKernels.cpp b/aten/src/ATen/templates/CompositeViewCopyKernels.cpp
index dd4c3270843f..7548d7c1a3a8 100644
--- a/aten/src/ATen/templates/CompositeViewCopyKernels.cpp
+++ b/aten/src/ATen/templates/CompositeViewCopyKernels.cpp
@@ -30,17 +30,25 @@ std::vector<at::Tensor> clone_arg(const at::TensorList& t_list) {
     return out;
 }
 
+// duped with gen_resize_out_helper from structured kernels
 void copy_arg(const at::Tensor& dst, const at::Tensor& src) {
+    TORCH_CHECK(src.dtype() == dst.dtype(),
+        "Expected out tensor to have dtype ", src.dtype(), ", but got ", dst.dtype(), " instead");
+    TORCH_CHECK(src.device() == dst.device(),
+        "Expected out tensor to have device ", src.device(), ", but got ", dst.device(), " instead");
     dst.copy_(src);
 }
 
 void copy_arg(const at::TensorList& dst, const at::TensorList& src) {
     TORCH_INTERNAL_ASSERT(dst.size() == src.size());
     for (const auto& i : c10::irange(dst.size())) {
-        dst[i].copy_(src[i]);
+        copy_arg(dst[i], src[i]);
     }
 }
 
+// TODO: this doesn't handle restriding empty tensors correctly; see
+// gen_resize_out_helper for the correct algorithm
+
 void resize_out_helper(const at::Tensor& dst, const at::Tensor& src) {
     at::native::resize_output(dst, src.sizes());
 }
@@ -55,8 +63,6 @@ void resize_out_helper(const at::TensorList& dst, const at::TensorList& src) {
 
 ${CompositeViewCopyKernel_Definitions}
 
-${SymIntViewCopyKernel_Definitions}
-
 ${GeneratedCompositeFunctional_Definitions}
 
 ${GeneratedCompositeOut_Definitions}
diff --git a/aten/src/ATen/templates/RegisterFunctionalization.cpp b/aten/src/ATen/templates/RegisterFunctionalization.cpp
index af3ed13de7ae..716085679893 100644
--- a/aten/src/ATen/templates/RegisterFunctionalization.cpp
+++ b/aten/src/ATen/templates/RegisterFunctionalization.cpp
@@ -38,7 +38,7 @@ constexpr auto exclude_keys_for_meta_dispatch =
 
 
 inline Tensor to_meta(const Tensor& t) {
-    return at::native::empty_strided_meta(t.sizes(), t.strides(),
+    return at::native::empty_strided_meta_symint(t.sym_sizes(), t.sym_strides(),
 /*dtype=*/c10::make_optional(t.scalar_type()), /*layout=*/c10::make_optional(t.layout()),
 /*device=*/c10::make_optional(c10::Device(kMeta)), /*pin_memory=*/c10::nullopt);
 }
@@ -50,10 +50,11 @@ inline c10::optional<Tensor> to_meta(const c10::optional<Tensor>& t) {
   return c10::nullopt;
 }
 
-inline std::vector<Tensor> to_meta(const TensorList& t_list) {
-  std::vector<Tensor> outputs(t_list.size());
-  for (const auto i : c10::irange(t_list.size())) {
-    outputs[i] = to_meta(t_list[i]);
+inline std::vector<Tensor> to_meta(at::ITensorListRef t_list) {
+  std::vector<Tensor> outputs;
+  outputs.reserve(t_list.size());
+  for (const auto& tensor : t_list) {
+    outputs.push_back(to_meta(tensor));
   }
   return outputs;
 }
diff --git a/aten/src/ATen/templates/TensorBody.h b/aten/src/ATen/templates/TensorBody.h
index 662712c641f1..836b3651d601 100644
--- a/aten/src/ATen/templates/TensorBody.h
+++ b/aten/src/ATen/templates/TensorBody.h
@@ -41,6 +41,7 @@
 
 namespace c10{
 template<class T> class List;
+template<class T> class IListRef;
 }
 namespace at {
 struct Generator;
@@ -65,6 +66,7 @@ namespace at {
 class OptionalTensorRef;
 class Tensor;
 using TensorList = ArrayRef<Tensor>;
+using ITensorList = c10::IListRef<Tensor>;
 
 using Stream = c10::Stream;
 
diff --git a/aten/src/ATen/test/CMakeLists.txt b/aten/src/ATen/test/CMakeLists.txt
index 0d6d670e0278..5c8fda81b3d9 100644
--- a/aten/src/ATen/test/CMakeLists.txt
+++ b/aten/src/ATen/test/CMakeLists.txt
@@ -5,81 +5,89 @@ if(MSVC)
 endif(MSVC)
 
 list(APPEND ATen_CPU_TEST_SRCS
-  ${CMAKE_CURRENT_SOURCE_DIR}/scalar_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/apply_utils_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/basic.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/atest.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/Dimname_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/Dict_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/Dimname_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/MaybeOwned_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/NamedTensor_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/half_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/apply_utils_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/atest.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/basic.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/broadcast_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/wrapdim_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/cpu_generator_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/cpu_profiling_allocator_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/cpu_rng_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/dispatch_key_set_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/dlconvertor_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/native_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/scalar_tensor_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/test_parallel.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/undefined_tensor_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/verify_api_visibility.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/thread_init_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/weakref_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/quantized_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/extension_backend_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/operators_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/half_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/ivalue_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/lazy_tensor_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/tensor_iterator_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/math_kernel_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/memory_format_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/memory_overlapping_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/mobile_memory_cleanup.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cpu_generator_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cpu_profiling_allocator_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/native_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/operator_name_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/operators_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/packedtensoraccessor_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/pow_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/variant_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/quantized_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/reduce_ops_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/reportMemoryUsage_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/memory_format_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cpu_rng_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/ivalue_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/vmap_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/scalar_tensor_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/scalar_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/stride_properties_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/tensor_iterator_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/test_parallel.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/thread_init_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/type_ptr_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/type_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/dispatch_key_set_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/stride_properties_test.cpp)
+  ${CMAKE_CURRENT_SOURCE_DIR}/undefined_tensor_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/variant_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/verify_api_visibility.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/vmap_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/weakref_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/wrapdim_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/xla_tensor_test.cpp
+  # Fix this.
+  # ${CMAKE_CURRENT_SOURCE_DIR}/xnnpack_test.cpp
+  )
 
 list(APPEND ATen_CUDA_TEST_SRCS
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_apply_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/cuda_atomic_ops_test.cu
   ${CMAKE_CURRENT_SOURCE_DIR}/cuda_caching_host_allocator_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_complex_test.cu
   ${CMAKE_CURRENT_SOURCE_DIR}/cuda_complex_math_test.cu
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_integer_divider_test.cu
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_apply_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_stream_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_complex_test.cu
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_cub_test.cu
   ${CMAKE_CURRENT_SOURCE_DIR}/cuda_device_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_half_test.cu
   ${CMAKE_CURRENT_SOURCE_DIR}/cuda_distributions_test.cu
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_dlconvertor_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_generator_test.cu
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_half_test.cu
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_integer_divider_test.cu
   ${CMAKE_CURRENT_SOURCE_DIR}/cuda_optional_test.cu
   ${CMAKE_CURRENT_SOURCE_DIR}/cuda_packedtensoraccessor_test.cu
   ${CMAKE_CURRENT_SOURCE_DIR}/cuda_reportMemoryUsage_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_vectorized_test.cu
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_dlconvertor_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_generator_test.cu
-  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_cub_test.cu)
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_stream_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/cuda_vectorized_test.cu)
 if(CAFFE2_USE_CUDNN)
   list(APPEND ATen_CUDA_TEST_SRCS
     ${CMAKE_CURRENT_SOURCE_DIR}/cuda_cudnn_test.cpp)
 endif()
 
 list(APPEND ATen_HIP_TEST_SRCS
-  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_complex_test.hip
-  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_complex_math_test.hip
-  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_integer_divider_test.hip
   ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_apply_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_half_test.hip
+  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_complex_math_test.hip
+  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_complex_test.hip
   ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_distributions_test.hip
+  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_dlconvertor_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_generator_test.hip
+  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_half_test.hip
+  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_integer_divider_test.hip
   ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_optional_test.hip
   ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_packedtensoraccessor_test.hip
-  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_vectorized_test.hip
-  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_dlconvertor_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_generator_test.hip)
+  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_vectorized_test.hip)
 # TODO: fix and enable these
 #  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_tensor_interop_test.cpp
 #  ${CMAKE_CURRENT_SOURCE_DIR}/hip/hip_stream_test.cpp
@@ -88,18 +96,23 @@ list(APPEND ATen_VULKAN_TEST_SRCS
   ${CMAKE_CURRENT_SOURCE_DIR}/vulkan_api_test.cpp)
 
 list(APPEND ATen_MOBILE_TEST_SRCS
-  ${CMAKE_CURRENT_SOURCE_DIR}/vec_test_all_types.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/cpu_profiling_allocator_test.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/cpu_caching_allocator_test.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/quantized_test.cpp)
+  ${CMAKE_CURRENT_SOURCE_DIR}/cpu_profiling_allocator_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/quantized_test.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/vec_test_all_types.cpp)
 
 list(APPEND ATen_VEC_TEST_SRCS
   ${CMAKE_CURRENT_SOURCE_DIR}/vec_test_all_types.cpp
   )
 
+list(APPEND ATen_MPS_TEST_SRCS
+  ${CMAKE_CURRENT_SOURCE_DIR}/mps_test_print.cpp
+  )
+
 # Caffe2 specific tests
 if(BUILD_CAFFE2)
   list(APPEND ATen_CPU_TEST_SRCS
+    ${CMAKE_CURRENT_SOURCE_DIR}/ExclusivelyOwned_test.cpp
     ${CMAKE_CURRENT_SOURCE_DIR}/tensor_interop_test.cpp)
   list(APPEND ATen_CUDA_TEST_SRCS
     ${CMAKE_CURRENT_SOURCE_DIR}/cuda_tensor_interop_test.cpp)
@@ -113,3 +126,4 @@ set(ATen_HIP_TEST_SRCS ${ATen_HIP_TEST_SRCS} PARENT_SCOPE)
 set(ATen_VULKAN_TEST_SRCS ${ATen_VULKAN_TEST_SRCS} PARENT_SCOPE)
 set(ATen_MOBILE_TEST_SRCS ${ATen_MOBILE_TEST_SRCS} PARENT_SCOPE)
 set(ATen_VEC_TEST_SRCS ${ATen_VEC_TEST_SRCS} PARENT_SCOPE)
+set(ATen_MPS_TEST_SRCS ${ATen_MPS_TEST_SRCS} PARENT_SCOPE)
diff --git a/aten/src/ATen/test/ExclusivelyOwned_test.cpp b/aten/src/ATen/test/ExclusivelyOwned_test.cpp
index 5d1dcf7127d7..7b88a2c400d4 100644
--- a/aten/src/ATen/test/ExclusivelyOwned_test.cpp
+++ b/aten/src/ATen/test/ExclusivelyOwned_test.cpp
@@ -1,5 +1,6 @@
 #include <gtest/gtest.h>
 
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/Tensor.h>
 #include <caffe2/core/tensor.h>
@@ -28,7 +29,7 @@ T getSampleValue();
 
 template <>
 at::Tensor getSampleValue() {
-  return at::native::zeros({2, 2}).to(at::kCPU);
+  return at::zeros({2, 2}).to(at::kCPU);
 }
 
 template <>
diff --git a/aten/src/ATen/test/MaybeOwned_test.cpp b/aten/src/ATen/test/MaybeOwned_test.cpp
index e57bf3a4d9d4..3063c5b069fe 100644
--- a/aten/src/ATen/test/MaybeOwned_test.cpp
+++ b/aten/src/ATen/test/MaybeOwned_test.cpp
@@ -1,5 +1,6 @@
 #include <gtest/gtest.h>
 
+#include <ATen/Functions.h>
 #include <ATen/NativeFunctions.h>
 #include <ATen/Tensor.h>
 #include <ATen/core/ivalue.h>
@@ -53,9 +54,6 @@ T getSampleValue();
 template <typename T>
 T getSampleValue2();
 
-template <typename T>
-bool equal(const T&, const T&);
-
 template <typename T>
 void assertBorrow(const c10::MaybeOwned<T>&, const T&);
 
@@ -73,8 +71,7 @@ c10::intrusive_ptr<MyString> getSampleValue2() {
   return c10::make_intrusive<MyString>("goodbye");
 }
 
-template<>
-bool equal(const c10::intrusive_ptr<MyString>& lhs, const c10::intrusive_ptr<MyString>& rhs) {
+bool are_equal(const c10::intrusive_ptr<MyString>& lhs, const c10::intrusive_ptr<MyString>& rhs) {
   if (!lhs || !rhs) {
     return !lhs && !rhs;
   }
@@ -105,7 +102,7 @@ void assertOwn(
 
 template<>
 Tensor getSampleValue() {
-  return at::native::zeros({2, 2}).to(at::kCPU);
+  return at::zeros({2, 2}).to(at::kCPU);
 }
 
 template<>
@@ -113,8 +110,7 @@ Tensor getSampleValue2() {
   return at::native::ones({2, 2}).to(at::kCPU);
 }
 
-template<>
-bool equal(const Tensor& lhs, const Tensor& rhs) {
+bool are_equal(const Tensor& lhs, const Tensor& rhs) {
   if (!lhs.defined() || !rhs.defined()) {
     return !lhs.defined() && !rhs.defined();
   }
@@ -150,8 +146,7 @@ IValue getSampleValue2() {
   return IValue("hello");
 }
 
-template<>
-bool equal(const IValue& lhs, const IValue& rhs) {
+bool are_equal(const IValue& lhs, const IValue& rhs) {
   if (lhs.isTensor() != rhs.isTensor()) {
     return false;
   }
@@ -241,14 +236,14 @@ TYPED_TEST(MaybeOwnedTest, MoveDereferencing) {
   // Need a different value.
   this->owned = c10::MaybeOwned<TypeParam>::owned(c10::in_place, getSampleValue2<TypeParam>());
 
-  EXPECT_TRUE(equal(*std::move(this->borrowed), getSampleValue<TypeParam>()));
-  EXPECT_TRUE(equal(*std::move(this->owned), getSampleValue2<TypeParam>()));
+  EXPECT_TRUE(are_equal(*std::move(this->borrowed), getSampleValue<TypeParam>()));
+  EXPECT_TRUE(are_equal(*std::move(this->owned), getSampleValue2<TypeParam>()));
 
   // Borrowed is unaffected.
   assertBorrow(this->borrowed, this->borrowFrom);
 
   // Owned is a null c10::intrusive_ptr / empty Tensor.
-  EXPECT_TRUE(equal(*this->owned, TypeParam()));
+  EXPECT_TRUE(are_equal(*this->owned, TypeParam()));
 }
 
 TYPED_TEST(MaybeOwnedTest, MoveConstructor) {
diff --git a/aten/src/ATen/test/extension_backend_test.cpp b/aten/src/ATen/test/extension_backend_test.cpp
index 9b215a90ae74..f2ce15e99ecd 100644
--- a/aten/src/ATen/test/extension_backend_test.cpp
+++ b/aten/src/ATen/test/extension_backend_test.cpp
@@ -15,7 +15,7 @@ using namespace at;
 
 static int test_int;
 
-Tensor empty_override(IntArrayRef size, c10::optional<ScalarType> dtype, c10::optional<Layout> layout,
+Tensor empty_override(SymIntArrayRef size, c10::optional<ScalarType> dtype, c10::optional<Layout> layout,
                       c10::optional<Device> device, c10::optional<bool> pin_memory, c10::optional<MemoryFormat> optional_memory_format) {
   test_int = 1;
   auto tensor_impl = c10::make_intrusive<TensorImpl, UndefinedTensorImpl>(
@@ -44,7 +44,7 @@ Tensor empty_strided_override(
   c10::optional<c10::Device> device,
   c10::optional<bool> pin_memory) {
 
-  return empty_override(size, dtype, layout, device, pin_memory, c10::nullopt);
+  return empty_override(fromIntArrayRefSlow(size), dtype, layout, device, pin_memory, c10::nullopt);
 }
 
 TORCH_LIBRARY_IMPL(aten, ORT, m) {
diff --git a/aten/src/ATen/test/math_kernel_test.cpp b/aten/src/ATen/test/math_kernel_test.cpp
index 15ce0af4001d..8875e72a6af9 100644
--- a/aten/src/ATen/test/math_kernel_test.cpp
+++ b/aten/src/ATen/test/math_kernel_test.cpp
@@ -114,16 +114,6 @@ TEST(MathKernelTest, MishBackward) {
   ASSERT_ALLCLOSE_TOLERANCES(out, math_out, 1e-4, 1e-6);
 }
 
-TEST(MathKernelTest, NarrowCopy)  {
-  auto x = rand({5, 8, 7});
-  for (const auto dim : c10::irange(3)) {
-    const int64_t start = 1, length = 4;
-    auto y_ref = x.narrow(dim, start, length);
-    auto y_test = at::native::narrow_copy_dense(x, dim, start, length);
-    ASSERT_ALLCLOSE_TOLERANCES(y_ref, y_test, 0, 0);
-  }
-}
-
 TEST(MathKernelTest, Bmm)  {
   auto test_bmm = [](int64_t last_dim) {
     auto x = rand({1, 4, 4}, at::kFloat);
diff --git a/aten/src/ATen/test/mps_test_print.cpp b/aten/src/ATen/test/mps_test_print.cpp
new file mode 100644
index 000000000000..388f0b6f19be
--- /dev/null
+++ b/aten/src/ATen/test/mps_test_print.cpp
@@ -0,0 +1,34 @@
+#include <gtest/gtest.h>
+#include <torch/torch.h>
+#include <limits>
+#include <sstream>
+
+bool ends_with(const std::string& str, const std::string& suffix) {
+  const auto str_len = str.length();
+  const auto suffix_len = suffix.length();
+  return str_len < suffix_len ? false : suffix == str.substr(str_len - suffix_len, suffix_len);
+}
+
+TEST(MPSPrintTest, PrintFloatMatrix) {
+  std::stringstream ss;
+  ss << torch::randn({3, 3}, at::device(at::kMPS));
+  ASSERT_TRUE (ends_with(ss.str(), "[ MPSFloatType{3,3} ]")) << " got " << ss.str();
+}
+
+TEST(MPSPrintTest, PrintHalf4DTensor) {
+  std::stringstream ss;
+  ss << torch::randn({2, 2, 2, 2}, at::device(at::kMPS).dtype(at::kHalf));
+  ASSERT_TRUE (ends_with(ss.str(), "[ MPSHalfType{2,2,2,2} ]")) << " got " << ss.str();
+}
+
+TEST(MPSPrintTest, PrintLongMatrix) {
+  std::stringstream ss;
+  ss << torch::full({2, 2}, std::numeric_limits<int>::max(), at::device(at::kMPS));
+  ASSERT_TRUE (ends_with(ss.str(), "[ MPSLongType{2,2} ]")) << " got " << ss.str();
+}
+
+TEST(MPSPrintTest, PrintFloatScalar) {
+  std::stringstream ss;
+  ss << torch::ones({}, at::device(at::kMPS));
+  ASSERT_TRUE(ss.str() == "1\n[ MPSFloatType{} ]") << " got " << ss.str();
+}
diff --git a/aten/src/ATen/test/scalar_test.cpp b/aten/src/ATen/test/scalar_test.cpp
index 0468da47ee7f..b6762e173945 100644
--- a/aten/src/ATen/test/scalar_test.cpp
+++ b/aten/src/ATen/test/scalar_test.cpp
@@ -2,6 +2,7 @@
 
 #include <iostream>
 #include <random>
+#include <c10/core/SymInt.h>
 // define constants like M_PI and C keywords for MSVC
 #ifdef _MSC_VER
 #ifndef _USE_MATH_DEFINES
@@ -12,6 +13,18 @@
 #include <ATen/ATen.h>
 #include <ATen/Dispatch.h>
 
+// We intentionally test self assignment/move in this file, suppress warnings
+// on them
+#ifndef _MSC_VER
+#pragma GCC diagnostic ignored "-Wpragmas"
+#pragma GCC diagnostic ignored "-Wunknown-warning-option"
+#pragma GCC diagnostic ignored "-Wself-move"
+#endif
+
+#ifdef __clang__
+#pragma clang diagnostic ignored "-Wself-assign-overloaded"
+#endif
+
 using std::cout;
 using namespace at;
 
@@ -179,4 +192,5 @@ TEST(TestScalar, TestFormatting) {
   ASSERT_EQ("false", format(Scalar(false)));
   ASSERT_EQ("(2,3.1)", format(Scalar(c10::complex<double>(2.0, 3.1))));
   ASSERT_EQ("(2,3.1)", format(Scalar(c10::complex<float>(2.0, 3.1))));
+  ASSERT_EQ("4", format(Scalar(Scalar(4).toSymInt())));
 }
diff --git a/aten/src/ATen/test/vulkan_api_test.cpp b/aten/src/ATen/test/vulkan_api_test.cpp
index 4b19f4f047a0..6870ec4e049f 100644
--- a/aten/src/ATen/test/vulkan_api_test.cpp
+++ b/aten/src/ATen/test/vulkan_api_test.cpp
@@ -5,6 +5,7 @@
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/native/vulkan/api/api.h>
 #include <ATen/native/vulkan/ops/Copy.h>
+#include <ATen/native/vulkan/ops/Convolution.h>
 #include <c10/util/irange.h>
 
 // TODO: These functions should move to a common place.
@@ -173,7 +174,8 @@ static void slice_tests(const std::unordered_map<int64_t, std::vector<int64_t>>&
     slice_test(dim2size.second, dim2size.first, -30, -10, 1);       // i.e., 4D tensor's equivalent indexing = [:,:,:,-30:-10:1] with negative start/end
     slice_test(dim2size.second, dim2size.first, 0, INT64_MAX, 1);   // i.e., 4D 's equivalent indexing = [:,:,:,0:9223372036854775807:1] with end=INT64_MAX
     slice_test(dim2size.second, dim2size.first, -10, INT64_MAX, 1); // i.e., 4D 's equivalent indexing = [:,:,:,-10:9223372036854775807:1] with negative start and end=INT64_MAX
-    slice_test(dim2size.second, dim2size.first, INT64_MIN, INT64_MAX, 1); // i.e., 4D 's equivalent indexing = [:,:,:,-9223372036854775808:9223372036854775807:1] with start=INT64_MIN and end=INT64_MAX
+    // This triggers a SymInt assert since [-2^63, -2^62-1] range is reserved for packed symints
+    //slice_test(dim2size.second, dim2size.first, INT64_MIN, INT64_MAX, 1); // i.e., 4D 's equivalent indexing = [:,:,:,-9223372036854775808:9223372036854775807:1] with start=INT64_MIN and end=INT64_MAX
     slice_test(dim2size.second, dim2size.first, {}, {}, 1);         // i.e., 4D 's equivalent indexing = [:,:,:,::1] with empty start/end
   }
 }
@@ -250,6 +252,7 @@ class VulkanAPITest : public ::testing::Test {
 };
 
 TEST_F(VulkanAPITest, copy_to_texture) {
+  using namespace at::native::vulkan;
   at::Tensor test_tensors[] = {
     // 4D
     at::rand({7, 17, 134, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
@@ -271,6 +274,8 @@ TEST_F(VulkanAPITest, copy_to_texture) {
       std::cout << "Copy failed on size " << in_cpu.sizes()
                 << "with dtype" << in_cpu.dtype() << std::endl;
     }
+
+    ASSERT_TRUE(check_copy);
   }
 }
 
@@ -628,7 +633,7 @@ TEST_F(VulkanAPITest, batch_norm_invalid_inputs) {
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
-      true,
+      false,
       0.1,
       1e-05,
       false);
@@ -642,7 +647,7 @@ TEST_F(VulkanAPITest, batch_norm_invalid_inputs) {
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
-      true,
+      false,
       0.1,
       1e-05,
       false);
@@ -656,7 +661,7 @@ TEST_F(VulkanAPITest, batch_norm_invalid_inputs) {
       at::rand({7}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({7}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({7}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
-      true,
+      false,
       0.1,
       1e-05,
       false);
@@ -670,7 +675,7 @@ TEST_F(VulkanAPITest, batch_norm_invalid_inputs) {
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
-      true,
+      false,
       0.1,
       1e-05,
       false);
@@ -684,7 +689,7 @@ TEST_F(VulkanAPITest, batch_norm_invalid_inputs) {
       at::rand({12}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
-      true,
+      false,
       0.1,
       1e-05,
       false);
@@ -698,7 +703,7 @@ TEST_F(VulkanAPITest, batch_norm_invalid_inputs) {
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({12}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
-      true,
+      false,
       0.1,
       1e-05,
       false);
@@ -712,7 +717,7 @@ TEST_F(VulkanAPITest, batch_norm_invalid_inputs) {
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({8}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
       at::rand({12}, at::device(at::kCPU).dtype(at::kFloat)).vulkan(),
-      true,
+      false,
       0.1,
       1e-05,
       false);
@@ -781,7 +786,7 @@ TEST_F(VulkanAPITest, batch_norm_large) {
   c10::InferenceMode mode;
 
 
-  const auto input_cpu = at::rand({79, 52, 139, 109}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto input_cpu = at::rand({11, 52, 139, 109}, at::device(at::kCPU).dtype(at::kFloat));
   const auto input_vulkan = input_cpu.vulkan();
 
   const auto weight_cpu = at::rand({52}, at::device(at::kCPU).dtype(at::kFloat));
@@ -843,6 +848,130 @@ TEST_F(VulkanAPITest, clamp_) {
   ASSERT_TRUE(check);
 }
 
+void test_conv2d_context(
+    const at::IntArrayRef input_shape,
+    const at::IntArrayRef weight_shape,
+    const at::IntArrayRef bias_shape,
+    std::vector<int64_t> stride,
+    std::vector<int64_t> padding,
+    std::vector<int64_t> dilation,
+    int64_t groups) {
+  c10::InferenceMode mode;
+
+  at::Tensor input = at::rand(input_shape, at::device(at::kCPU).dtype(at::kFloat));
+  at::Tensor weight = at::rand(weight_shape, at::device(at::kCPU).dtype(at::kFloat));
+  at::Tensor bias = at::rand(bias_shape, at::device(at::kCPU).dtype(at::kFloat));
+
+  // cpu
+  const auto out_cpu = at::conv2d(
+    input, weight, bias, stride, padding, dilation, groups);
+
+  // vulkan
+  const auto prepack_vulkan = callOpByName(
+      "vulkan_prepack::create_conv2d_context",
+      "",
+      weight, bias, stride, padding, dilation, groups, c10::nullopt, c10::nullopt);
+
+  const auto vulkan_output = callOpByName(
+      "vulkan_prepack::run_conv2d_context",
+      "",
+      input.vulkan(), prepack_vulkan[0]);
+
+  const auto out_vulkan = vulkan_output[0].toTensor();
+  const auto out_vk_cpu = out_vulkan.cpu();
+
+  // check
+  const bool check = almostEqual(out_cpu, out_vk_cpu);
+  if (!check) {
+    showRtol(out_cpu, out_vk_cpu);
+  }
+
+  ASSERT_TRUE(check);
+}
+
+void test_backwards_compatible_conv2d_context(
+    const at::IntArrayRef input_shape,
+    const at::IntArrayRef weight_shape,
+    const at::IntArrayRef bias_shape,
+    std::vector<int64_t> stride,
+    std::vector<int64_t> padding,
+    std::vector<int64_t> dilation,
+    int64_t groups) {
+  c10::InferenceMode mode;
+
+  at::Tensor input = at::rand(input_shape, at::device(at::kCPU).dtype(at::kFloat));
+  at::Tensor weight = at::rand(weight_shape, at::device(at::kCPU).dtype(at::kFloat));
+  at::Tensor bias = at::rand(bias_shape, at::device(at::kCPU).dtype(at::kFloat));
+
+  // cpu
+  const auto out_cpu = at::conv2d(
+    input, weight, bias, stride, padding, dilation, groups);
+
+  // vulkan
+  const auto prepack_vulkan = callOpByName(
+      "vulkan_prepack::conv2d_clamp_prepack",
+      "",
+      weight, bias, stride, padding, dilation, groups, c10::nullopt, c10::nullopt);
+
+  const auto vulkan_output = callOpByName(
+      "vulkan_prepack::conv2d_clamp_run",
+      "",
+      input.vulkan(), prepack_vulkan[0]);
+
+  const auto out_vulkan = vulkan_output[0].toTensor();
+  const auto out_vk_cpu = out_vulkan.cpu();
+
+  // check
+  const bool check = almostEqual(out_cpu, out_vk_cpu);
+  if (!check) {
+    showRtol(out_cpu, out_vk_cpu);
+  }
+
+  ASSERT_TRUE(check);
+}
+
+void test_transposed_conv2d_context(
+    const at::IntArrayRef input_shape,
+    const at::IntArrayRef weight_shape,
+    const at::IntArrayRef bias_shape,
+    std::vector<int64_t> stride,
+    std::vector<int64_t> padding,
+    std::vector<int64_t> output_padding,
+    std::vector<int64_t> dilation,
+    int64_t groups) {
+  c10::InferenceMode mode;
+
+  at::Tensor input = at::rand(input_shape, at::device(at::kCPU).dtype(at::kFloat));
+  at::Tensor weight = at::rand(weight_shape, at::device(at::kCPU).dtype(at::kFloat));
+  at::Tensor bias = at::rand(bias_shape, at::device(at::kCPU).dtype(at::kFloat));
+
+  // cpu
+  const auto out_cpu = at::conv_transpose2d(
+    input, weight, bias, stride, padding, output_padding, groups, dilation);
+
+  // vulkan
+  const auto prepack_vulkan = callOpByName(
+      "vulkan_prepack::create_tconv2d_context",
+      "",
+      weight, bias, stride, padding, output_padding, dilation, groups, c10::nullopt, c10::nullopt);
+
+  const auto vulkan_output = callOpByName(
+      "vulkan_prepack::run_tconv2d_context",
+      "",
+      input.vulkan(), prepack_vulkan[0]);
+
+  const auto out_vulkan = vulkan_output[0].toTensor();
+  const auto out_vk_cpu = out_vulkan.cpu();
+
+  // check
+  const bool check = almostEqual(out_cpu, out_vk_cpu);
+  if (!check) {
+    showRtol(out_cpu, out_vk_cpu);
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, conv2d) {
   constexpr int64_t groups = 1;
   constexpr std::array<int64_t, 2u> stride{2, 2};
@@ -912,6 +1041,28 @@ TEST_F(VulkanAPITest, conv2d) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, conv2d_prepack) {
+  test_conv2d_context(
+    {1, 3, 8, 8}, // input_shape
+    {1, 3, 3, 3}, // weight_shape
+    {1},          // bias_shape
+    {2, 2},       // stride
+    {1, 1},       // padding
+    {1, 1},       // dilation
+    1);           // groups
+}
+
+TEST_F(VulkanAPITest, conv2d_prepack_bc) {
+  test_backwards_compatible_conv2d_context(
+    {1, 3, 8, 8}, // input_shape
+    {1, 3, 3, 3}, // weight_shape
+    {1},          // bias_shape
+    {2, 2},       // stride
+    {1, 1},       // padding
+    {1, 1},       // dilation
+    1);           // groups
+}
+
 TEST_F(VulkanAPITest, conv2d_dw) {
   constexpr int64_t groups = 7;
   constexpr std::array<int64_t, 2u> stride{2, 3};
@@ -980,6 +1131,28 @@ TEST_F(VulkanAPITest, conv2d_dw) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, conv2d_dw_prepack) {
+  test_conv2d_context(
+    {1, 7, 137, 199}, // input_shape
+    {7, 1, 17, 7},    // weight_shape
+    {7},              // bias_shape
+    {2, 3},           // stride
+    {0, 4},           // padding
+    {3, 1},           // dilation
+    7);               // groups
+}
+
+TEST_F(VulkanAPITest, conv2d_dw_prepack_bc) {
+  test_backwards_compatible_conv2d_context(
+    {1, 7, 137, 199}, // input_shape
+    {7, 1, 17, 7},    // weight_shape
+    {7},              // bias_shape
+    {2, 3},           // stride
+    {0, 4},           // padding
+    {3, 1},           // dilation
+    7);               // groups
+}
+
 TEST_F(VulkanAPITest, conv2d_pw) {
   constexpr int64_t groups = 1;
   constexpr std::array<int64_t, 2u> stride{1, 1};
@@ -1048,6 +1221,115 @@ TEST_F(VulkanAPITest, conv2d_pw) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, conv2d_pw_prepack) {
+  test_conv2d_context(
+    {1, 17, 127, 397},  // input_shape
+    {29, 17, 1, 1},     // weight_shape
+    {29},               // bias_shape
+    {1, 1},             // stride
+    {0, 0},             // padding
+    {1, 1},             // dilation
+    1);                 // groups
+}
+
+TEST_F(VulkanAPITest, conv2d_pw_prepack_bc) {
+  test_backwards_compatible_conv2d_context(
+    {1, 17, 127, 397},  // input_shape
+    {29, 17, 1, 1},     // weight_shape
+    {29},               // bias_shape
+    {1, 1},             // stride
+    {0, 0},             // padding
+    {1, 1},             // dilation
+    1);                 // groups
+}
+
+TEST_F(VulkanAPITest, conv2d_transposed) {
+  // Arrange
+  constexpr int64_t groups = 1;
+  constexpr std::array<int64_t, 2u> stride{1, 2};
+  constexpr std::array<int64_t, 2u> padding{1, 0};
+  constexpr std::array<int64_t, 2u> output_padding{0, 1};
+  //TODO: Support conv_transpose2d with dilation != 1
+  constexpr std::array<int64_t, 2u> dilation{1, 1};
+
+  constexpr struct {
+    uint32_t batches;
+    uint32_t channels;
+    uint32_t height;
+    uint32_t width;
+
+    std::array<int64_t, 4u> size() const {
+      return {
+        batches,
+        channels,
+        height,
+        width,
+      };
+    }
+  } input {1, 55, 7, 19};
+
+  constexpr struct {
+    uint32_t input_channels;
+    uint32_t output_channels;
+    uint32_t height;
+    uint32_t width;
+
+    std::array<int64_t, 4u> size() const {
+      return {
+        input_channels,
+        output_channels,
+        height,
+        width,
+      };
+    }
+  } weights {input.channels, 47, 2, 3};
+
+  const auto input_cpu = at::randn(input.size(), at::device(at::kCPU).dtype(at::kFloat));
+  const auto weights_cpu = at::randn(weights.size(), at::device(at::kCPU).dtype(at::kFloat));
+  const auto bias_cpu = at::zeros({weights.output_channels}, at::device(at::kCPU).dtype(at::kFloat));
+
+  // Act
+  const auto output_cpu = at::conv_transpose2d(
+      input_cpu,
+      weights_cpu,
+      bias_cpu,
+      stride,
+      padding,
+      output_padding,
+      groups,
+      dilation);
+
+  const auto output_vk = at::conv_transpose2d(
+      input_cpu.vulkan(),
+      weights_cpu,
+      bias_cpu,
+      stride,
+      padding,
+      output_padding,
+      groups,
+      dilation).cpu();
+
+  // Assert
+  const bool check = almostEqual(output_cpu, output_vk);
+  if (!check) {
+    showRtol(output_cpu, output_vk);
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, conv2d_transposed_prepack) {
+  test_transposed_conv2d_context(
+    {1, 55, 7, 19}, // input_shape
+    {55, 47, 2, 3}, // weight_shape
+    {47},           // bias_shape
+    {1, 2},         // stride
+    {1, 0},         // padding
+    {0, 1},         // output_padding
+    {1, 1},         // dilation
+    1);             // groups
+}
+
 TEST_F(VulkanAPITest, copy) {
   const auto cpu = at::rand({13, 17, 37, 19}, at::device(at::kCPU).dtype(at::kFloat));
   const auto vulkan = cpu.vulkan();
@@ -1444,6 +1726,36 @@ TEST_F(VulkanAPITest, hardshrink_) {
   }
 }
 
+TEST_F(VulkanAPITest, hardtanh) {
+  const auto in_cpu = at::rand({17, 197, 302, 5}, at::device(at::kCPU).dtype(at::kFloat)) * 10;
+  const auto in_vulkan = in_cpu.vulkan();
+
+  const auto out_cpu = at::hardtanh(in_cpu, 3, 7);
+  const auto out_vulkan = at::hardtanh(in_vulkan, 3, 7);
+
+  const auto check = almostEqual(out_cpu, out_vulkan.cpu());
+  if (!check) {
+    showRtol(out_cpu, out_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, hardtanh_) {
+  auto a_cpu = at::rand({17, 197, 302, 5}, at::device(at::kCPU).dtype(at::kFloat)) * 10;
+  auto a_vulkan = a_cpu.vulkan();
+
+  at::hardtanh_(a_cpu, 3, 7);
+  at::hardtanh_(a_vulkan, 3, 7);
+
+  const auto check = almostEqual(a_cpu, a_vulkan.cpu());
+  if (!check) {
+    showRtol(a_cpu, a_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, layer_norm_invalid_inputs) {
   c10::InferenceMode mode;
 
@@ -1897,7 +2209,7 @@ TEST_F(VulkanAPITest, hardswish_) {
   auto cpu = at::rand({17, 197, 302, 5}, at::device(at::kCPU).dtype(at::kFloat))*12 - 6;
   auto vulkan = cpu.vulkan();
 
-  at::native::hardswish_(cpu);
+  at::hardswish_(cpu);
   at::hardswish_(vulkan);
 
   const auto check = almostEqual(cpu, vulkan.cpu());
@@ -2228,6 +2540,38 @@ TEST_F(VulkanAPITest, mul_to_scalar_wrapped) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, relu) {
+  const auto in_cpu = at::rand({17, 197, 302, 5}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_vulkan = in_cpu.vulkan();
+
+  const auto out_cpu = at::relu(in_cpu);
+  const auto out_vulkan = at::relu(in_vulkan);
+
+  const auto check = almostEqual(out_cpu, out_vulkan.cpu());
+
+  if (!check) {
+    showRtol(out_cpu, out_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
+TEST_F(VulkanAPITest, relu_) {
+  auto a_cpu = at::rand({17, 197, 302, 5}, at::device(at::kCPU).dtype(at::kFloat));
+  auto a_vulkan = a_cpu.vulkan();
+
+  at::relu_(a_cpu);
+  at::relu_(a_vulkan);
+
+  const auto check = almostEqual(a_cpu, a_vulkan.cpu());
+
+  if (!check) {
+    showRtol(a_cpu, a_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, reflection_pad2d) {
   const auto a_cpu = at::rand({2, 3, 47, 63}, at::device(at::kCPU).dtype(at::kFloat));
   const auto a_vulkan = a_cpu.vulkan();
@@ -2701,81 +3045,6 @@ TEST_F(VulkanAPITest, sub_to_scalar_wrapped) {
   ASSERT_TRUE(check);
 }
 
-TEST_F(VulkanAPITest, transposed_conv2d) {
-  // Arrange
-  constexpr int64_t groups = 1;
-  constexpr std::array<int64_t, 2u> stride{1, 2};
-  constexpr std::array<int64_t, 2u> padding{1, 0};
-  constexpr std::array<int64_t, 2u> output_padding{0, 1};
-  //TODO: Support conv_transpose2d with dilation != 1
-  constexpr std::array<int64_t, 2u> dilation{1, 1};
-
-  constexpr struct {
-    uint32_t batches;
-    uint32_t channels;
-    uint32_t height;
-    uint32_t width;
-
-    std::array<int64_t, 4u> size() const {
-      return {
-        batches,
-        channels,
-        height,
-        width,
-      };
-    }
-  } input {1, 55, 7, 19};
-
-  constexpr struct {
-    uint32_t input_channels;
-    uint32_t output_channels;
-    uint32_t height;
-    uint32_t width;
-
-    std::array<int64_t, 4u> size() const {
-      return {
-        input_channels,
-        output_channels,
-        height,
-        width,
-      };
-    }
-  } weights {input.channels, 47, 2, 3};
-
-  const auto input_cpu = at::randn(input.size(), at::device(at::kCPU).dtype(at::kFloat));
-  const auto weights_cpu = at::randn(weights.size(), at::device(at::kCPU).dtype(at::kFloat));
-  const auto bias_cpu = at::zeros({weights.output_channels}, at::device(at::kCPU).dtype(at::kFloat));
-
-  // Act
-  const auto output_cpu = at::conv_transpose2d(
-      input_cpu,
-      weights_cpu,
-      bias_cpu,
-      stride,
-      padding,
-      output_padding,
-      groups,
-      dilation);
-
-  const auto output_vk = at::conv_transpose2d(
-      input_cpu.vulkan(),
-      weights_cpu,
-      bias_cpu,
-      stride,
-      padding,
-      output_padding,
-      groups,
-      dilation).cpu();
-
-  // Assert
-  const bool check = almostEqual(output_cpu, output_vk);
-  if (!check) {
-    showRtol(output_cpu, output_vk);
-  }
-
-  ASSERT_TRUE(check);
-}
-
 TEST_F(VulkanAPITest, upsample_nearest2d) {
   const auto in_cpu = at::rand({1, 2, 2, 3}, at::TensorOptions(at::kCPU).dtype(at::kFloat));
   const auto out_cpu = at::upsample_nearest2d(in_cpu, {4, 6});
@@ -2883,6 +3152,44 @@ TEST_F(VulkanAPITest, view_invalid_inputs) {
   }, ::std::runtime_error);
 }
 
+TEST_F(VulkanAPITest, cat_dim0_invalidinputs_exceptions) {
+  // Arrange: Vulkan cat inputs must have matching sizes except concatenated dimension
+  {
+    const auto in_cpu1 = at::rand({3, 5, 221, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu2 = at::rand({3, 9, 112, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu3 = at::rand({3, 9, 331, 193}, at::device(at::kCPU).dtype(at::kFloat));
+
+    // Act
+    EXPECT_THROW({
+      const auto out_vulkan = at::cat({in_cpu1.vulkan(), in_cpu2.vulkan(), in_cpu3.vulkan()}, 0);
+    }, ::c10::Error);
+  }
+
+  // Arrange: Vulkan cat expects 4 dimensional inputs
+  {
+    const auto in_cpu1 = at::rand({3, 9, 221, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu2 = at::rand({9, 112, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu3 = at::rand({3, 9, 331, 193}, at::device(at::kCPU).dtype(at::kFloat));
+
+    // Act
+    EXPECT_THROW({
+      const auto out_vulkan = at::cat({in_cpu1.vulkan(), in_cpu2.vulkan(), in_cpu3.vulkan()}, 0);
+    }, ::c10::Error);
+  }
+
+  // Arrange: Vulkan cat not implemented for batch dimension!
+  {
+    const auto in_cpu1 = at::rand({221, 3, 9, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu2 = at::rand({112, 3, 9, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu3 = at::rand({331, 3, 9, 193}, at::device(at::kCPU).dtype(at::kFloat));
+
+    // Act
+    EXPECT_THROW({
+      const auto out_vulkan = at::cat({in_cpu1.vulkan(), in_cpu2.vulkan(), in_cpu3.vulkan()}, 0);
+    }, ::c10::Error);
+  }
+}
+
 #if !defined(__APPLE__)
 TEST_F(VulkanAPITest, DISABLED_cat_dim1_samefeature_success) {
   // Arrange
@@ -3111,6 +3418,25 @@ TEST_F(VulkanAPITest, cat_dim2_diffheight_success) {
   ASSERT_TRUE(check);
 }
 
+TEST_F(VulkanAPITest, cat_dim2_negdim_success) {
+  // Arrange
+  const auto in_cpu1 = at::rand({3, 9, 221, 193}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu2 = at::rand({3, 9, 112, 193}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu3 = at::rand({3, 9, 331, 193}, at::device(at::kCPU).dtype(at::kFloat));
+
+  // Act
+  const auto out_cpu = at::cat({in_cpu1, in_cpu2, in_cpu3}, -2);
+  const auto out_vulkan = at::cat({in_cpu1.vulkan(), in_cpu2.vulkan(), in_cpu3.vulkan()}, -2);
+
+  // Assert
+  const auto check = almostEqual(out_cpu, out_vulkan.cpu());
+  if (!check) {
+    showRtol(out_cpu, out_vulkan.cpu());
+  }
+
+  ASSERT_TRUE(check);
+}
+
 TEST_F(VulkanAPITest, cat_dim2_singledepth_success) {
   // Arrange: batch x channel (1x1) = single depth texture
   const auto in_cpu1 = at::rand({1, 1, 221, 193}, at::device(at::kCPU).dtype(at::kFloat));
@@ -3174,6 +3500,44 @@ TEST_F(VulkanAPITest, cat_dim2_invalidinputs_exceptions) {
   }
 }
 
+TEST_F(VulkanAPITest, cat_dim3_invalidinputs_exceptions) {
+  // Arrange: Vulkan cat inputs must have matching sizes except concatenated dimension
+  {
+    const auto in_cpu1 = at::rand({3, 5, 221, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu2 = at::rand({3, 9, 112, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu3 = at::rand({3, 9, 331, 193}, at::device(at::kCPU).dtype(at::kFloat));
+
+    // Act
+    EXPECT_THROW({
+      const auto out_vulkan = at::cat({in_cpu1.vulkan(), in_cpu2.vulkan(), in_cpu3.vulkan()}, 3);
+    }, ::c10::Error);
+  }
+
+  // Arrange: Vulkan cat expects 4 dimensional inputs
+  {
+    const auto in_cpu1 = at::rand({3, 9, 221, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu2 = at::rand({9, 112, 193}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu3 = at::rand({3, 9, 331, 193}, at::device(at::kCPU).dtype(at::kFloat));
+
+    // Act
+    EXPECT_THROW({
+      const auto out_vulkan = at::cat({in_cpu1.vulkan(), in_cpu2.vulkan(), in_cpu3.vulkan()}, 3);
+    }, ::c10::Error);
+  }
+
+  // Arrange: Vulkan cat not implemented for width dimension!
+  {
+    const auto in_cpu1 = at::rand({3, 9, 193, 221}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu2 = at::rand({3, 9, 193, 112}, at::device(at::kCPU).dtype(at::kFloat));
+    const auto in_cpu3 = at::rand({3, 9, 193, 331}, at::device(at::kCPU).dtype(at::kFloat));
+
+    // Act
+    EXPECT_THROW({
+      const auto out_vulkan = at::cat({in_cpu1.vulkan(), in_cpu2.vulkan(), in_cpu3.vulkan()}, 3);
+    }, ::c10::Error);
+  }
+}
+
 TEST_F(VulkanAPITest, permute_2d_success) {
   // Arrange
   const auto in_cpu = at::rand({2, 3}, at::device(at::kCPU).dtype(at::kFloat));
@@ -3833,13 +4197,15 @@ TEST_F(VulkanAPITest, gru_success) {
   const int H_in = 5;  // input_size
   const int H_out = 7; // hidden_size
   const int num_layers = 3;
+  const int L = 1;
+  const int N = 1;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -3900,13 +4266,15 @@ TEST_F(VulkanAPITest, gru_mclareninputs_success) {
   const int H_in = 384;  // input_size
   const int H_out = 384; // hidden_size
   const int num_layers = 2;
+  const int L = 1;
+  const int N = 1;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -3963,13 +4331,15 @@ TEST_F(VulkanAPITest, gru_invalidinputs_exceptions) {
   const int H_in = 17;  // input_size
   const int H_out = 50; // hidden_size
   const int num_layers = 2;
+  const int L = 5;
+  const int N = 4;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -4056,13 +4426,15 @@ TEST_F(VulkanAPITest, gru_prepack_success) {
   const int H_in = 81;  // input_size
   const int H_out = 10; // hidden_size
   const int num_layers = 2;
+  const int L = 1;
+  const int N = 1;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -4125,13 +4497,15 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
   const int H_in = 70;  // input_size
   const int H_out = 2; // hidden_size
   const int num_layers = 2;
+  const int L = 3;
+  const int N = 5;
   const double gru_dropout = .0;
   const bool has_biases = true;
   const bool train = false;
   const bool bidirectional = false;
   const bool batch_first = true;
-  const auto in_cpu = at::rand({1, 1, H_in}, at::device(at::kCPU).dtype(at::kFloat));
-  const auto h0_cpu = at::rand({num_layers, 1, H_out}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto in_cpu = at::rand({N, L, H_in}, at::device(at::kCPU).dtype(at::kFloat));
+  const auto h0_cpu = at::rand({num_layers, N, H_out}, at::device(at::kCPU).dtype(at::kFloat));
 
   c10::List<at::Tensor> weight_ih_l; // shape (3 * hidden_size, input_size)
   c10::List<at::Tensor> weight_hh_l; // shape (3 * hidden_size, hidden_size)
@@ -4232,6 +4606,10 @@ TEST_F(VulkanAPITest, gru_prepack_invalidinputs_exceptions) {
         std::vector<at::Tensor>({ weight_ih_l.get(0), weight_hh_l.get(0), bias_ih_l.get(0), bias_hh_l.get(0),
            weight_ih_l.get(1), weight_hh_l.get(1), bias_ih_l.get(1), bias_hh_l.get(1) }),
         has_biases, num_layers, gru_dropout, train, bidirectional, false);
+    auto out_vulkan = callOpByName(
+        "vulkan_prepack::run_gru_context",
+        "",
+        in_cpu.vulkan(), h0_cpu.vulkan(), prepack[0]);
   }, ::c10::Error);
 
   // Act: dropout should be 0.0
diff --git a/aten/src/ATen/test/vulkan_perf_test.cpp b/aten/src/ATen/test/vulkan_perf_test.cpp
index f161a515920d..0c1c6b9cfe37 100644
--- a/aten/src/ATen/test/vulkan_perf_test.cpp
+++ b/aten/src/ATen/test/vulkan_perf_test.cpp
@@ -63,6 +63,7 @@ static void add_op_benchmark(benchmark::State& state) {
   const auto in_vulkan2 = in_cpu2.vulkan();
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -112,6 +113,7 @@ static void add_op_q_benchmark(benchmark::State& state) {
       in_vulkan2, scale, zero_point, c10::ScalarType::QUInt8);
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -193,6 +195,7 @@ static void conv2d_op_benchmark(benchmark::State& state) {
       {weights.output_channels}, at::device(at::kCPU).dtype(at::kFloat));
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -295,6 +298,7 @@ static void conv2d_op_q_benchmark(benchmark::State& state) {
       in_vulkan1, scale, zero_point, c10::ScalarType::QUInt8);
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -305,7 +309,7 @@ static void conv2d_op_q_benchmark(benchmark::State& state) {
       at::rand({1, 1, 64, 199}, at::device(at::kCPU).dtype(at::kFloat)) * 6;
   for (auto _ : state) {
     auto start = std::chrono::high_resolution_clock::now();
-    const auto vulkan_conv2d = at::native::vulkan::ops::conv2d(
+    const auto vulkan_conv2d = at::native::vulkan::ops::quantized_conv2d(
         out_vulkan1,
         weight_q,
         bias_q,
@@ -385,6 +389,7 @@ static void conv2dpw_op_benchmark(benchmark::State& state) {
       {weights.output_channels}, at::device(at::kCPU).dtype(at::kFloat));
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -486,6 +491,7 @@ static void conv2dpw_op_q_benchmark(benchmark::State& state) {
       in_vulkan1, scale, zero_point, c10::ScalarType::QUInt8);
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -496,7 +502,7 @@ static void conv2dpw_op_q_benchmark(benchmark::State& state) {
       at::rand({1, 29, 127, 397}, at::device(at::kCPU).dtype(at::kFloat)) * 6;
   for (auto _ : state) {
     auto start = std::chrono::high_resolution_clock::now();
-    const auto vulkan_conv2d = at::native::vulkan::ops::conv2d(
+    const auto vulkan_conv2d = at::native::vulkan::ops::quantized_conv2d(
         out_vulkan1,
         weight_q,
         bias_q,
@@ -575,6 +581,7 @@ static void conv2ddw_op_benchmark(benchmark::State& state) {
       {weights.output_channels}, at::device(at::kCPU).dtype(at::kFloat));
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -675,6 +682,7 @@ static void conv2ddw_op_q_benchmark(benchmark::State& state) {
       in_vulkan1, scale, zero_point, c10::ScalarType::QUInt8);
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -685,7 +693,7 @@ static void conv2ddw_op_q_benchmark(benchmark::State& state) {
       at::rand({1, 7, 45, 67}, at::device(at::kCPU).dtype(at::kFloat)) * 6;
   for (auto _ : state) {
     auto start = std::chrono::high_resolution_clock::now();
-    const auto vulkan_conv2d = at::native::vulkan::ops::conv2d(
+    const auto vulkan_conv2d = at::native::vulkan::ops::quantized_conv2d(
         out_vulkan1,
         weight_q,
         bias_q,
@@ -730,6 +738,7 @@ static void sub_op_benchmark(benchmark::State& state) {
   const auto in_vulkan2 = in_cpu2.vulkan();
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -779,6 +788,7 @@ static void sub_op_q_benchmark(benchmark::State& state) {
       in_vulkan2, scale, zero_point, c10::ScalarType::QUInt8);
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -824,6 +834,7 @@ static void mul_op_benchmark(benchmark::State& state) {
   const auto in_vulkan2 = in_cpu2.vulkan();
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -873,6 +884,7 @@ static void mul_op_q_benchmark(benchmark::State& state) {
       in_vulkan2, scale, zero_point, c10::ScalarType::QUInt8);
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -918,6 +930,7 @@ static void div_op_benchmark(benchmark::State& state) {
   const auto in_vulkan2 = in_cpu2.vulkan();
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
@@ -967,6 +980,7 @@ static void div_op_q_benchmark(benchmark::State& state) {
       in_vulkan2, scale, zero_point, c10::ScalarType::QUInt8);
 
 #if defined(USE_VULKAN_GPU_DIAGNOSTICS) && defined(__ANDROID__)
+  at::native::vulkan::api::context()->enable_op_profiling();
   at::native::vulkan::api::context()->reset_querypool();
 #endif
 
diff --git a/aten/src/ATen/test/vulkan_quantized_api_test.cpp b/aten/src/ATen/test/vulkan_quantized_api_test.cpp
index 3b86b472fffd..c30fac431d7b 100644
--- a/aten/src/ATen/test/vulkan_quantized_api_test.cpp
+++ b/aten/src/ATen/test/vulkan_quantized_api_test.cpp
@@ -12,9 +12,19 @@
 
 #include <c10/util/irange.h>
 
+/*
+ * TODO: rename this file to something like vulkan_experimental_test and move
+ * this under caffe2/fb/vulkan. This file should be used to test experimental
+ * features of the Vulkan backend. vulkan_api_test cannot serve this purpose
+ * because it cannot link against symbols in the ATen/native/vulkan folder.
+ */
+
 namespace {
 
-bool checkRtol(const at::Tensor& diff, const std::vector<at::Tensor>& inputs) {
+bool checkRtol(
+    const at::Tensor& diff,
+    const std::vector<at::Tensor>& inputs,
+    const float tolerated_error = 0) {
   float maxValue = 0.0f;
 
   for (const auto& tensor : inputs) {
@@ -27,11 +37,11 @@ bool checkRtol(const at::Tensor& diff, const std::vector<at::Tensor>& inputs) {
   constexpr float tolerance = 1e-5;
 #endif
 
-  return diff.abs().max().item<float>() <= (tolerance * maxValue);
+  return diff.abs().max().item<float>() <= (tolerance * maxValue + tolerated_error);
 }
 
-bool almostEqual(const at::Tensor& a, const at::Tensor& b) {
-  return checkRtol(a - b, {a, b});
+bool almostEqual(const at::Tensor& a, const at::Tensor& b, const float tolerated_error = 0) {
+  return checkRtol(a - b, {a, b}, tolerated_error);
 }
 
 /* Unused function
@@ -164,6 +174,85 @@ at::Tensor vulkan_to_cpu(at::Tensor vulkan, at::Tensor in_cpu) {
   }
 }
 
+TEST_F(VulkanAPITest, uniform_buffer_copy) {
+  using namespace at::native::vulkan;
+
+  struct TestStruct{
+    int a;
+    int b;
+    int c;
+  };
+
+  TestStruct test_struct{4, 9, 10};
+
+  api::UniformParamsBuffer params(api::context(), test_struct);
+  api::UniformParamsBuffer params_copy = params;
+
+  api::MemoryMap copy_mapping(
+      params_copy.buffer(), api::MemoryAccessType::READ);
+
+  TestStruct* test_copy_p = copy_mapping.template data<TestStruct>();
+
+  ASSERT_TRUE(test_copy_p->a == test_struct.a);
+  ASSERT_TRUE(test_copy_p->b == test_struct.b);
+  ASSERT_TRUE(test_copy_p->c == test_struct.c);
+}
+
+TEST_F(VulkanAPITest, copy_to_buffer) {
+  using namespace at::native::vulkan;
+
+  at::Tensor test_tensors[] = {
+    // 4D
+    at::rand({7, 17, 134, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
+    // 3D
+    at::rand({67, 134, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
+    // 2D
+    at::rand({229, 213}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
+    // 1D
+    at::rand({1902}, at::TensorOptions(at::kCPU).dtype(at::kFloat)),
+  };
+
+  for (auto in_cpu : test_tensors) {
+    ops::vTensor in_vk_copied = ops::to_vulkan(in_cpu, api::StorageType::BUFFER);
+    at::Tensor out_copied = ops::from_vulkan(in_vk_copied);
+
+    const auto check_copy = almostEqual(out_copied, in_cpu);
+
+    if(!check_copy) {
+      std::cout << "Copy failed on size " << in_cpu.sizes()
+                << "with dtype" << in_cpu.dtype() << std::endl;
+    }
+
+    ASSERT_TRUE(check_copy);
+  }
+}
+
+TEST_F(VulkanAPITest, copy_to_buffer_channels_last) {
+  using namespace at::native::vulkan;
+
+  at::TensorOptions options(at::kCPU);
+  options = options.dtype(at::kFloat);
+
+  at::Tensor test_tensors[] = {
+    // 4D
+    at::rand({7, 17, 134, 213}, options).to(at::MemoryFormat::ChannelsLast),
+  };
+
+  for (auto in_cpu : test_tensors) {
+    ops::vTensor in_vk_copied = ops::to_vulkan(in_cpu, api::StorageType::BUFFER);
+    at::Tensor out_copied = ops::from_vulkan(in_vk_copied);
+
+    const auto check_copy = almostEqual(out_copied, in_cpu);
+
+    if(!check_copy) {
+      std::cout << "Copy failed on size " << in_cpu.sizes()
+                << "with dtype" << in_cpu.dtype() << std::endl;
+    }
+
+    ASSERT_TRUE(check_copy);
+  }
+}
+
 TEST_F(VulkanAPITest, support_vulkan) {
   const double scale = 0.1;
   const int64_t zero_point = 10;
@@ -268,6 +357,73 @@ TEST_F(VulkanAPITest, quantize_dequantize) {
   ASSERT_TRUE(check_two);
 }
 
+void test_quantize_per_tensor_and_dequantize(
+    const at::IntArrayRef input_shape,
+    const double input_scale,
+    const int input_zero_point,
+    const float tolerance = 0) {
+  at::Tensor input = at::rand(input_shape, at::device(at::kCPU).dtype(at::kFloat));
+
+  // quantize tensors
+  at::Tensor out_q_cpu = at::quantize_per_tensor(
+    input, input_scale, input_zero_point, c10::ScalarType::QUInt8);
+  at::Tensor out_q_vk = at::quantize_per_tensor(
+    input.vulkan(), input_scale, input_zero_point, c10::ScalarType::QUInt8);
+
+  // dequantize tensors
+  const auto out_cpu_deq = at::dequantize(out_q_cpu);
+  const auto out_vk_deq = at::dequantize(out_q_vk);
+
+  // check dequantized tensor are equal
+  const auto check = almostEqual(out_cpu_deq, out_vk_deq.cpu(), tolerance);
+
+  if (!check) {
+    std::cout
+      << "Quantize and Dequantize failed with input shape: " << input_shape
+      << " scale: " << input_scale << " and zero point: " << input_zero_point
+    << std::endl;
+  }
+  ASSERT_TRUE(check);
+}
+
+void test_quantize_per_tensor_and_dequantize_random() {
+  const double scale = 0.0001 + (double)rand() / (double)RAND_MAX;
+  const int zero_point = int((double)rand() / (double)RAND_MAX * 255);
+  const int n = 1 + int((double)rand() / (double)RAND_MAX * 30);
+  const int c = 1 + int((double)rand() / (double)RAND_MAX * 30);
+  const int h = 1 + int((double)rand() / (double)RAND_MAX * 100);
+  const int w = 1 + int((double)rand() / (double)RAND_MAX * 100);
+  // tolerated error = scale, to allow for precision differences after dividing
+  // by random scale, which could result on a difference of 1 unit in the
+  // quantized result.
+  test_quantize_per_tensor_and_dequantize({n, c, h, w}, scale, zero_point, scale);
+}
+
+TEST_F(VulkanAPITest, quantize_per_tensor_and_dequantize) {
+  test_quantize_per_tensor_and_dequantize({1, 1, 1, 1}, 0.13, 21);
+  test_quantize_per_tensor_and_dequantize({1, 1, 1, 4}, 0.3, 87);
+  test_quantize_per_tensor_and_dequantize({1, 1, 4, 1}, 0.2, 120);
+  test_quantize_per_tensor_and_dequantize({1, 1, 7, 7}, 0.3, 87);
+  test_quantize_per_tensor_and_dequantize({1, 1, 8, 8}, 0.1, 10);
+  test_quantize_per_tensor_and_dequantize({3, 5, 8, 8}, 0.04, 97);
+  test_quantize_per_tensor_and_dequantize({1, 1, 11, 17}, 0.07, 15);
+  test_quantize_per_tensor_and_dequantize({1, 1, 12, 17}, 0.1, 10);
+  test_quantize_per_tensor_and_dequantize({3, 5, 12, 17}, 0.1, 10);
+  test_quantize_per_tensor_and_dequantize({1, 1, 17, 12}, 0.1, 10);
+  test_quantize_per_tensor_and_dequantize({2, 4, 17, 12}, 0.1, 10);
+  test_quantize_per_tensor_and_dequantize({1, 1, 10, 14}, 0.0001, 101);
+  test_quantize_per_tensor_and_dequantize({3, 5, 10, 14}, 0.009, 43);
+  test_quantize_per_tensor_and_dequantize({3, 5, 10, 15}, 0.1, 19);
+  test_quantize_per_tensor_and_dequantize({4, 4, 9, 17}, 0.1, 19);
+  test_quantize_per_tensor_and_dequantize({3, 5, 25, 29}, 0.1, 19);
+  test_quantize_per_tensor_and_dequantize({4, 4, 25, 29}, 0.1, 19);
+  test_quantize_per_tensor_and_dequantize({11, 17, 25, 29}, 0.027, 89);
+
+  for (int i = 0; i < 20; i += 1) {
+    test_quantize_per_tensor_and_dequantize_random();
+  }
+}
+
 TEST_F(VulkanAPITest, quantized_add) {
   const auto in_cpu =
       at::rand({2, 13, 32, 27}, at::device(at::kCPU).dtype(at::kFloat)) * 6;
@@ -636,7 +792,7 @@ TEST_F(VulkanAPITest, conv2d) {
 
   const double scale2 = 0.15;
   const int zero_point2 = 15;
-  const auto output_vulkan = at::native::vulkan::ops::conv2d(
+  const auto output_vulkan = at::native::vulkan::ops::quantized_conv2d(
       out_vulkan,
       weight_q,
       bias_q,
@@ -739,7 +895,7 @@ TEST_F(VulkanAPITest, conv2d_pw) {
 
   const double scale2 = 0.15;
   const int zero_point2 = 15;
-  const auto output_vulkan = at::native::vulkan::ops::conv2d(
+  const auto output_vulkan = at::native::vulkan::ops::quantized_conv2d(
       out_vulkan,
       weight_q,
       bias_q,
@@ -842,7 +998,7 @@ TEST_F(VulkanAPITest, conv2d_dw) {
 
   const double scale2 = 0.15;
   const int zero_point2 = 15;
-  const auto output_vulkan = at::native::vulkan::ops::conv2d(
+  const auto output_vulkan = at::native::vulkan::ops::quantized_conv2d(
       out_vulkan,
       weight_q,
       bias_q,
diff --git a/aten/src/ATen/test/xnnpack_test.cpp b/aten/src/ATen/test/xnnpack_test.cpp
index a936273bbcec..d3acaefa7067 100644
--- a/aten/src/ATen/test/xnnpack_test.cpp
+++ b/aten/src/ATen/test/xnnpack_test.cpp
@@ -5,10 +5,15 @@
 
 #include <ATen/native/xnnpack/Common.h>
 #include <ATen/native/xnnpack/Engine.h>
+#include <ATen/native/xnnpack/OpContext.h>
 #include <ATen/native/xnnpack/Pooling.h>
 #include <c10/core/CPUAllocator.h>
 #include <c10/core/MemoryFormat.h>
 
+#include <atomic>
+#include <condition_variable>
+#include <thread>
+
 #if defined(C10_MOBILE) && defined(USE_XNNPACK)
 
 bool checkRtol(const at::Tensor& diff, const std::vector<at::Tensor> inputs) {
@@ -185,6 +190,92 @@ TEST(TestXNNPackOps, TestHardSwish) {
   }
 }
 
+TEST(TestXNNPackOps, TestConvolution2dMultiThreaded) {
+  constexpr int64_t groups = 1;
+
+  constexpr struct {
+    uint32_t batches;
+    uint32_t channels;
+    uint32_t width;
+    uint32_t height;
+
+    std::array<int64_t, 4u> size() const {
+      return {
+          batches,
+          channels,
+          width,
+          height,
+      };
+    }
+  } input{1, 3, 8, 8};
+
+  constexpr struct {
+    uint32_t output_channels;
+    uint32_t input_channels;
+    uint32_t width;
+    uint32_t height;
+
+    std::array<int64_t, 4u> size() const {
+      return {
+          output_channels,
+          input_channels,
+          width,
+          height,
+      };
+    }
+  } weights{1, input.channels, 3, 3};
+
+  const auto input_cpu =
+      at::randn(input.size(), at::device(at::kCPU).dtype(at::kFloat));
+  auto weights_cpu =
+      at::randn(weights.size(), at::device(at::kCPU).dtype(at::kFloat));
+  auto bias_cpu = at::randn(
+      {weights.output_channels}, at::device(at::kCPU).dtype(at::kFloat));
+
+  auto context = at::native::xnnpack::XNNPackConv2dOpContext::create_context(
+      std::move(weights_cpu), std::move(bias_cpu), {1, 1}, {2, 2}, {1, 1}, groups, c10::nullopt, c10::nullopt);
+  std::atomic<int64_t> count{0};
+  int64_t num_workers = 5;
+  std::mutex lock;
+  std::condition_variable cond;
+  auto sync_and_run_conv = [&](int64_t h, int64_t w) -> at::Tensor
+  {
+    auto input_tensor = at::randn({1, 3, h, w}, at::device(at::kCPU).dtype(at::kFloat));
+    int64_t count_val = ++count;
+    if (count_val < num_workers) {
+      std::unique_lock<std::mutex> g(lock);
+      while ((count_val = count.load()) < num_workers) {
+        cond.wait(g, [&]() {
+            auto new_val = count.load();
+            return new_val >= num_workers;});
+      }
+    } else {
+      std::unique_lock<std::mutex> g(lock);
+      cond.notify_all();
+    }
+    for (int64_t i = 0; i < 30; i++) {
+      context->run(input_tensor);
+    }
+    return context->run(input_tensor);
+  };
+
+  auto conv = [sync_and_run_conv](int64_t h, int64_t w) -> at::Tensor
+  {
+    return sync_and_run_conv(h, w);
+  };
+
+  std::thread t1(conv, 16, 16);
+  std::thread t2(conv, 12, 12);
+  std::thread t3(conv, 20, 20);
+  std::thread t4(conv, 22, 22);
+  std::thread t5(conv, 8, 8);
+  t1.join();
+  t2.join();
+  t3.join();
+  t4.join();
+  t5.join();
+}
+
 TEST(TestXNNPackOps, TestGlobal) {
   // input, expected_result pair
   std::vector<std::pair<at::Tensor, at::Tensor>> input_result_pairs = {
diff --git a/aten/src/README.md b/aten/src/README.md
index add281692633..3127ed5c8c39 100644
--- a/aten/src/README.md
+++ b/aten/src/README.md
@@ -69,8 +69,8 @@ will `retain` it itself.
 ```
 
 Sometimes, you have a tensor in hand which you'd like to use directly, but
-under some conditions you have to have to call, e.g., `newContiguous`, to get
-it into the correct form:
+under some conditions you have to call, e.g., `newContiguous`, to get it into
+the correct form:
 
 ```
   if (!(k_->stride(3) == 1) || !(k_->stride[2] == k_->size(3))) {
diff --git a/benchmarks/cpp/nvfuser/CMakeLists.txt b/benchmarks/cpp/nvfuser/CMakeLists.txt
index bac36d19f3d1..ad9053bb3a3a 100644
--- a/benchmarks/cpp/nvfuser/CMakeLists.txt
+++ b/benchmarks/cpp/nvfuser/CMakeLists.txt
@@ -20,6 +20,7 @@ if(USE_CUDA)
     softmax_backward.cpp
     scale_bias_relu.cpp
     transpose.cpp
+    matmul.cpp
     timm.cpp
     utils.cpp
     main.cpp)
diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp
index 723d222516df..2f839f0c8332 100644
--- a/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp
+++ b/benchmarks/cpp/nvfuser/batch_norm_channels_first.cpp
@@ -73,10 +73,6 @@ static void NvFuserScheduler_BatchNorm(
     DataType dtype) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const bool kTraining = true;
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-
   std::vector<int64_t> input_shape{
       benchmark_state.range(0),
       benchmark_state.range(1),
diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp
index af2b4d145fc8..62a4e99e21ef 100644
--- a/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp
+++ b/benchmarks/cpp/nvfuser/batch_norm_channels_first_backward.cpp
@@ -25,7 +25,6 @@ static void setupBatchNorm_BWD(Fusion* fusion, DataType dtype) {
   FusionGuard fg(fusion);
 
   const bool kTraining = true;
-  const float kMomentum = 0.1;
   const float kEps = 1e-5;
 
   // setup fusion
@@ -85,9 +84,6 @@ static void NvFuserScheduler_BatchNorm_BWD(
     DataType dtype) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const bool kTraining = true;
-  const float kEps = 1e-5;
-
   std::vector<int64_t> input_shape{
       benchmark_state.range(0),
       benchmark_state.range(1),
diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp
index 14fde631aec0..7b8972a0aad0 100644
--- a/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp
+++ b/benchmarks/cpp/nvfuser/batch_norm_channels_last.cpp
@@ -74,10 +74,6 @@ static void NvFuserScheduler_BatchNorm_nhwc(
     DataType dtype) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const bool kTraining = true;
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-
   std::vector<int64_t> input_shape{
       benchmark_state.range(0),
       benchmark_state.range(2),
diff --git a/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp b/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp
index 0660b75e3942..29bcfb3e81be 100644
--- a/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp
+++ b/benchmarks/cpp/nvfuser/batch_norm_channels_last_backward.cpp
@@ -25,7 +25,6 @@ static void setupBatchNorm_nhwc_BWD(Fusion* fusion, DataType dtype) {
   FusionGuard fg(fusion);
 
   const bool kTraining = true;
-  const float kMomentum = 0.1;
   const float kEps = 1e-5;
 
   // setup fusion
@@ -86,9 +85,6 @@ static void NvFuserScheduler_BatchNorm_nhwc_BWD(
     DataType dtype) {
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const bool kTraining = true;
-  const float kEps = 1e-5;
-
   std::vector<int64_t> input_shape{
       benchmark_state.range(0),
       benchmark_state.range(2),
diff --git a/benchmarks/cpp/nvfuser/bert.cpp b/benchmarks/cpp/nvfuser/bert.cpp
index 008785c8cf04..05f0f490abb2 100644
--- a/benchmarks/cpp/nvfuser/bert.cpp
+++ b/benchmarks/cpp/nvfuser/bert.cpp
@@ -140,7 +140,7 @@ static void MagicScheduler_DivMaxSoftDropFwd(
   fe.compileFusion(&fusion);
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion({t0, t1}, norm_params->lparams);
@@ -148,7 +148,7 @@ static void MagicScheduler_DivMaxSoftDropFwd(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto tensor : std::vector<at::Tensor>({t0, t1})) {
@@ -200,7 +200,7 @@ static void MagicScheduler_DivMaxSoftDropBwd(
   fe.compileFusion(&fusion);
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion({t0, t1, t2, t3}, norm_params->lparams);
@@ -208,7 +208,7 @@ static void MagicScheduler_DivMaxSoftDropBwd(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   // Some reason t1 isn't used, ignore it.
@@ -316,7 +316,7 @@ static void MagicScheduler_BiasDropoutAddLayernormFwd(
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion(at_inputs, norm_params->lparams);
@@ -324,7 +324,7 @@ static void MagicScheduler_BiasDropoutAddLayernormFwd(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto inp : at_inputs) {
@@ -342,11 +342,6 @@ static void MagicScheduler_BiasDropoutAddLayernormFwd(
       bytes * int64_t(benchmark_state.iterations()));
 }
 
-static void MagicScheduler_fp32_BiasDropoutAddLayernormFwd(
-    benchmark::State& benchmark_state) {
-  MagicScheduler_BiasDropoutAddLayernormFwd(benchmark_state, DataType::Float);
-}
-
 static void setupBiasDropoutAddLayernormBwd1(Fusion* fusion, DataType dtype) {
   FusionGuard fg(fusion);
 
@@ -431,7 +426,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd1(
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     cg_outputs = fe.runFusion(at_inputs, norm_params->lparams);
@@ -439,7 +434,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd1(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto inp : at_inputs) {
@@ -542,7 +537,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd2(
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion(at_inputs, norm_params->lparams);
@@ -550,7 +545,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd2(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto inp : at_inputs) {
@@ -633,7 +628,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd3(
   fe.setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     cg_outputs = fe.runFusion(at_inputs, norm_params->lparams);
@@ -641,7 +636,7 @@ static void MagicScheduler_BiasDropoutAddLayernormBwd3(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   int64_t bytes = 0;
   for (auto inp : at_inputs) {
@@ -677,6 +672,16 @@ static void DivMaxSoftDropBwd_fp16(benchmark::State& benchmark_state) {
   MagicScheduler_DivMaxSoftDropBwd(benchmark_state, DataType::Half);
 }
 
+static void BiasDropoutAddLayernormFwd_fp32(
+    benchmark::State& benchmark_state) {
+  MagicScheduler_BiasDropoutAddLayernormFwd(benchmark_state, DataType::Float);
+}
+
+static void BiasDropoutAddLayernormFwd_tf32(
+    benchmark::State& benchmark_state) {
+  MagicScheduler_BiasDropoutAddLayernormFwd(benchmark_state, DataType::Float);
+}
+
 static void BiasDropoutAddLayernormBwd1_fp32(
     benchmark::State& benchmark_state) {
   MagicScheduler_BiasDropoutAddLayernormBwd1(benchmark_state, DataType::Float);
@@ -724,6 +729,19 @@ BENCHMARK(DivMaxSoftDropBwd_fp16)
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
+BENCHMARK(BiasDropoutAddLayernormFwd_fp32)
+    // ->RangeMultiplier(2)
+    ->Ranges({{32, 1024}, {128, 128}, {1024, 1024}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
+// Use full ampere wave here
+BENCHMARK(BiasDropoutAddLayernormFwd_tf32)
+    // ->RangeMultiplier(2)
+    ->Ranges({{32, 1024}, {128, 128}, {864, 864}})
+    ->Unit(benchmark::kMicrosecond)
+    ->UseManualTime();
+
 BENCHMARK(BiasDropoutAddLayernormBwd1_fp32)
     // ->RangeMultiplier(2)
     ->Ranges({{32, 1024}, {128, 128}, {1024, 1024}})
diff --git a/benchmarks/cpp/nvfuser/broadcast.cpp b/benchmarks/cpp/nvfuser/broadcast.cpp
index 05e8e052f4b2..04b6b18bd6b7 100644
--- a/benchmarks/cpp/nvfuser/broadcast.cpp
+++ b/benchmarks/cpp/nvfuser/broadcast.cpp
@@ -77,7 +77,7 @@ static void NvFuserScheduler_Broadcast(
   fusion_executor_cache->profile(false);
   executor_instance->setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     auto cg_outputs = fusion_executor_cache->runFusionWithInputs({t0, t1});
@@ -86,7 +86,7 @@ static void NvFuserScheduler_Broadcast(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -112,14 +112,14 @@ static void Baseline_Broadcast(
 
   // Sync everything up before we start
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     auto output = t0.add(t1.unsqueeze(bcast_dim));
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   benchmark_state.SetBytesProcessed(
diff --git a/benchmarks/cpp/nvfuser/gelu_backward.cpp b/benchmarks/cpp/nvfuser/gelu_backward.cpp
index 6632ba58a236..732ad7f0ea0f 100644
--- a/benchmarks/cpp/nvfuser/gelu_backward.cpp
+++ b/benchmarks/cpp/nvfuser/gelu_backward.cpp
@@ -113,9 +113,6 @@ BENCHMARK(GeluBackward_AutoSchedule)->Unit(benchmark::kMicrosecond);
 //------------------------------------------------------------------------------
 
 static void GeluBackward_Lower(benchmark::State& benchmark_state) {
-  constexpr int kHiddenFeatures = 512;
-  constexpr int kBatchSize = 64;
-
   Fusion fusion;
 
   // setup fusion
@@ -173,11 +170,11 @@ static void GeluBackward_RunFusion(benchmark::State& benchmark_state) {
   FusionExecutor executor;
   executor.compileFusion(&fusion);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (auto _ : benchmark_state) {
     outputs = executor.runFusion(c10::ArrayRef<c10::IValue>(inputs), lparams);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
   }
 }
@@ -204,7 +201,7 @@ static void GeluBackward_RunFusion_GpuOnly(benchmark::State& benchmark_state) {
   executor.setMeasureKernelTimeFlag(true);
   executor.compileFusion(&fusion);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (auto _ : benchmark_state) {
     outputs = executor.runFusion(c10::ArrayRef<c10::IValue>(inputs), lparams);
diff --git a/benchmarks/cpp/nvfuser/heuristic_lookup.cpp b/benchmarks/cpp/nvfuser/heuristic_lookup.cpp
index 64b1ecfb756d..3bd4ec0b1607 100644
--- a/benchmarks/cpp/nvfuser/heuristic_lookup.cpp
+++ b/benchmarks/cpp/nvfuser/heuristic_lookup.cpp
@@ -99,12 +99,15 @@ static void LayerNormBackward_HeuristicLookup(
 
   auto runtime = getLayerBackwardNormRuntime(
       std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape);
+
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+
   TORCH_INTERNAL_ASSERT(
-      runtime->getMaybeHeuristicsFor(aten_inputs).has_value());
+      runtime->getMaybeHeuristicsFor(args).has_value());
 
   for (auto _ : benchmark_state) {
     // Setup (not included in the measurement)
-    runtime->getMaybeHeuristicsFor(aten_inputs);
+    runtime->getMaybeHeuristicsFor(args);
   }
 }
 
@@ -152,12 +155,15 @@ static void LayerNormForward_HeuristicLookup(
 
   auto runtime = getLayerForwardNormRuntime(
       std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape);
+
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+
   TORCH_INTERNAL_ASSERT(
-      runtime->getMaybeHeuristicsFor(aten_inputs).has_value());
+      runtime->getMaybeHeuristicsFor(args).has_value());
 
   for (auto _ : benchmark_state) {
     // Setup (not included in the measurement)
-    runtime->getMaybeHeuristicsFor(aten_inputs);
+    runtime->getMaybeHeuristicsFor(args);
   }
 }
 
diff --git a/benchmarks/cpp/nvfuser/instance_norm.cpp b/benchmarks/cpp/nvfuser/instance_norm.cpp
index a7139c113a43..05475f114474 100644
--- a/benchmarks/cpp/nvfuser/instance_norm.cpp
+++ b/benchmarks/cpp/nvfuser/instance_norm.cpp
@@ -165,7 +165,7 @@ static void Baseline_InstanceNorm(
   auto ato_running_var = c10::optional<at::Tensor>(at_var);
 
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
 
@@ -182,9 +182,9 @@ static void Baseline_InstanceNorm(
     auto output = at::relu(norm);
 
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   const size_t kChannels = benchmark_state.range(2);
diff --git a/benchmarks/cpp/nvfuser/layer_norm.cpp b/benchmarks/cpp/nvfuser/layer_norm.cpp
index d793a45caa3c..d2cff09e5d2e 100644
--- a/benchmarks/cpp/nvfuser/layer_norm.cpp
+++ b/benchmarks/cpp/nvfuser/layer_norm.cpp
@@ -22,7 +22,6 @@ static void setupLayerNorm(Fusion* fusion, DataType dtype) {
 
   FusionGuard fg(fusion);
 
-  const int kReductionAxis = 1;
   const float kEps = 1e-5;
 
   Double* eps_ptr = IrBuilder::create<Double>(kEps);
@@ -61,7 +60,6 @@ static void NvFuserScheduler_LayerNorm(
 
   std::vector<int64_t> input_shape{
       benchmark_state.range(0), benchmark_state.range(1)};
-  const float kEps = 1e-5;
 
   // inputs
   at::manual_seed(0);
@@ -105,14 +103,14 @@ static void Baseline_LayerNorm(
   at::Tensor bias = at::randn({input_shape[1]}, options);
 
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     auto output = at::layer_norm(input, norm_shape, weight, bias);
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   benchmark_state.SetBytesProcessed(
diff --git a/benchmarks/cpp/nvfuser/layer_norm_backward.cpp b/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
index 9e6ac1c207d1..c431622e7b9f 100644
--- a/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
+++ b/benchmarks/cpp/nvfuser/layer_norm_backward.cpp
@@ -22,9 +22,6 @@ static void setupLayerNorm_BWD(Fusion* fusion, DataType dtype) {
 
   TORCH_INTERNAL_ASSERT(dtype == DataType::Float || dtype == DataType::Half);
 
-  const int kReductionAxis = 1;
-  Double* eps_ptr = IrBuilder::create<Double>(1e-5);
-
   // setup fusion
   auto grad_out = makeContigTensor(2, dtype);
   auto input = makeContigTensor(2, dtype);
@@ -136,7 +133,7 @@ static void Baseline_LayerNorm_BWD(
   std::array<bool, 3> output_mask = {true, true, true};
 
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     at::native_layer_norm_backward(
@@ -144,9 +141,9 @@ static void Baseline_LayerNorm_BWD(
 
     auto output = at::layer_norm(input, norm_shape, weight, bias);
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   benchmark_state.SetBytesProcessed(
diff --git a/benchmarks/cpp/nvfuser/lstm_cell.cpp b/benchmarks/cpp/nvfuser/lstm_cell.cpp
index 20ec7c8f4700..58fc057bd85f 100644
--- a/benchmarks/cpp/nvfuser/lstm_cell.cpp
+++ b/benchmarks/cpp/nvfuser/lstm_cell.cpp
@@ -170,11 +170,11 @@ static void LstmCell_RunFusion(
   FusionExecutor executor;
   executor.compileFusion(&fusion);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (auto _ : benchmark_state) {
     outputs = executor.runFusion(c10::ArrayRef<c10::IValue>(inputs), lparams);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 }
 
diff --git a/benchmarks/cpp/nvfuser/matmul.cpp b/benchmarks/cpp/nvfuser/matmul.cpp
new file mode 100644
index 000000000000..25fc6cfe2356
--- /dev/null
+++ b/benchmarks/cpp/nvfuser/matmul.cpp
@@ -0,0 +1,357 @@
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/matmul.h>
+
+#include <benchmark/benchmark.h>
+
+#include <cuda_runtime.h>
+
+#include <benchmarks/cpp/nvfuser/utils.h>
+
+using namespace torch::jit::fuser::cuda;
+
+bool cudaArchGuardShouldSkip(int required_major, int required_minor) {
+  int capability_major = at::cuda::getCurrentDeviceProperties()->major;
+  int capability_minor = at::cuda::getCurrentDeviceProperties()->minor;
+
+  if (capability_major < required_major ||
+      (capability_major == required_major &&
+       capability_minor < required_minor)) {
+    return true;
+  }
+  return false;
+}
+
+bool hasRequiredSmemSize(size_t required_size) {
+  // Only checking device 0
+  return at::cuda::getDeviceProperties(0)->sharedMemPerBlockOptin >=
+      required_size;
+}
+
+#define NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(                       \
+    REQUIRED_MAJOR, REQUIRED_MINOR, SMEM_SIZE, STATE)            \
+  if (cudaArchGuardShouldSkip(REQUIRED_MAJOR, REQUIRED_MINOR) || \
+      !hasRequiredSmemSize(SMEM_SIZE)) {                         \
+    STATE.SkipWithError("Unsupported arch or not enough smem!"); \
+    return;                                                      \
+  }
+
+// util to track support matmul operand layout.
+using MatmulLayout = MmaOptions::MmaInputLayout;
+
+static constexpr std::array<MatmulLayout, 3> kAllSupportedLayout = {
+    MatmulLayout::TT,
+    MatmulLayout::NT,
+    MatmulLayout::TN};
+
+// Generic interface to get matmul op with the given layout.
+TensorView* matmul(TensorView* a, TensorView* b, MatmulLayout layout) {
+  TORCH_CHECK(
+      a->nDims() == 2 && b->nDims() == 2, "only pure matmuls for these tests");
+  TensorView *tv2 = nullptr, *tv0b = nullptr, *tv1b = nullptr;
+  switch (layout) {
+    case MatmulLayout::TT:
+      tv0b = broadcast(a, {false, false, true});
+      tv1b = broadcast(b, {true, false, false});
+      tv2 = fusedMultiplySum(tv0b, tv1b, {1});
+      break;
+    case MatmulLayout::TN:
+      tv0b = broadcast(a, {false, true, false});
+      tv1b = broadcast(b, {true, false, false});
+      tv2 = fusedMultiplySum(tv0b, tv1b, {2});
+      break;
+    case MatmulLayout::NT:
+      tv0b = broadcast(a, {false, false, true});
+      tv1b = broadcast(b, {false, true, false});
+      tv2 = fusedMultiplySum(tv0b, tv1b, {0});
+      break;
+    default:
+      TORCH_CHECK(false, "unsupported data layout.");
+  }
+  return tv2;
+}
+
+// Utility to generate matmul input tensors based on given layout
+at::Tensor atMatmul(at::Tensor a, at::Tensor b, MatmulLayout layout) {
+  switch (layout) {
+    case MatmulLayout::TT:
+      return a.matmul(b);
+    case MatmulLayout::TN:
+      return a.matmul(b.t());
+    case MatmulLayout::NT:
+      return a.t().matmul(b);
+    default:
+      TORCH_CHECK(false, "unsupported data layout.");
+  }
+  return at::Tensor();
+}
+
+// Utility to generate reference results based on given layout
+std::pair<at::Tensor, at::Tensor> fp16MatmulAtInput(
+    int M,
+    int N,
+    int K,
+    MatmulLayout layout) {
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+
+  switch (layout) {
+    case MatmulLayout::TT:
+      return std::make_pair(
+          at::randn({M, K}, options), at::randn({K, N}, options));
+    case MatmulLayout::TN:
+      return std::make_pair(
+          at::randn({M, K}, options), at::randn({N, K}, options));
+    case MatmulLayout::NT:
+      return std::make_pair(
+          at::randn({K, M}, options), at::randn({K, N}, options));
+    default:
+      TORCH_CHECK(false, "unsupported data layout.");
+  }
+  return std::make_pair(at::Tensor(), at::Tensor());
+}
+
+// TODO: separate compute and schedule definition once the can schedule
+//  logic and pattern matching is ready.
+void setupMatmul(Fusion* fusion, MatmulLayout layout, MatmulParam params) {
+  // Only hgemm on the initial setup
+  auto a = makeContigTensor(2, DataType::Half);
+  auto b = makeContigTensor(2, DataType::Half);
+
+  auto c = matmul(a, b, layout);
+
+  fusion->addInput(a);
+  fusion->addInput(b);
+  fusion->addOutput(c);
+
+  scheduleMatmul(c, a, b, params);
+}
+
+static void SingleMatmulBase(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout,
+    MatmulParam params) {
+  std::vector<int64_t> input_mnk{
+      benchmark_state.range(0),
+      benchmark_state.range(1),
+      benchmark_state.range(2)};
+
+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  // Define fusion graph
+  setupMatmul(fusion, layout, params);
+
+  // inputs
+  at::manual_seed(0);
+
+  // Tensor inputs
+  auto inputs = fp16MatmulAtInput(
+      input_mnk.at(0), input_mnk.at(1), input_mnk.at(2), layout);
+
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(
+      {inputs.first, inputs.second});
+
+  // Always use 32b indexing mode for now.
+  TORCH_INTERNAL_ASSERT(args.getIndexMode() == KernelIndexMode::INT32);
+
+  // Compile kernel
+  FusionExecutor fe;
+  fe.compileFusion(fusion, args, LaunchParams());
+
+  // Warm up run
+  auto outputs = fe.runFusion({inputs.first, inputs.second});
+  fe.setMeasureKernelTimeFlag(true);
+
+  // Sync everything up before we start
+  for (auto _ : benchmark_state) {
+    clearL2Cache();
+    auto outputs = fe.runFusion({inputs.first, inputs.second});
+    benchmark_state.SetIterationTime(fe.kernelTimeMs() / 1000.0);
+  }
+  // Sync everything up before we're finished, don't want to run ahead on the
+  // cpu while benchmarking.
+  cudaDeviceSynchronize();
+
+  // TODO: FLOPS calculation
+}
+
+static void EagerModeMatmul(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  std::vector<int64_t> input_mnk{
+      benchmark_state.range(0),
+      benchmark_state.range(1),
+      benchmark_state.range(2)};
+
+  at::manual_seed(0);
+
+  auto inputs = fp16MatmulAtInput(
+      input_mnk.at(0), input_mnk.at(1), input_mnk.at(2), layout);
+
+  // warm up run
+  auto outputs = atMatmul(inputs.first, inputs.second, layout);
+
+  for (auto _ : benchmark_state) {
+    clearL2Cache();
+    CudaKernelTimer timer;
+    outputs = atMatmul(inputs.first, inputs.second, layout);
+    benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
+  }
+  // Sync everything up before we're finished, don't want to run ahead on the
+  // cpu while benchmarking.
+  cudaDeviceSynchronize();
+}
+
+// Actual benchmarking
+// -----------------------------------------------------------------
+
+size_t getSmemSize(GemmTile cta_tile, int stage_number) {
+  return ((cta_tile.m * cta_tile.k) + (cta_tile.n * cta_tile.k)) *
+      dataTypeSize(DataType::Half) * stage_number;
+}
+
+// TODO: this part eventually will be automated by heuristics
+MatmulParam getMatmulParams(
+    GemmTile cta_tile,
+    int stage_number,
+    MatmulLayout layout) {
+  MatMulTileOptions gemm_tile;
+  gemm_tile.cta_tile = cta_tile;
+  // TODO: pipe through split K
+  gemm_tile.warp_tile = GemmTile(64, 64, cta_tile.k);
+  gemm_tile.instruction_tile = GemmTile(16, 16, 16);
+
+  // Collect mma swizzle info
+  auto mma_builder =
+      MmaBuilder(MmaOptions::MacroType::Ampere_16_16_16, gemm_tile)
+          .layout(layout);
+
+  MatmulParam params(mma_builder);
+  params.tile_sizes = gemm_tile;
+  params.async_gmem_load_operands = true;
+  params.double_buffer_options.double_buffer_smem_write = true;
+  params.double_buffer_options.double_buffer_smem_read = true;
+  params.double_buffer_options.smem_double_buffer_stage = stage_number;
+
+  return params;
+}
+
+static void Nvfuser_Matmul_4warp3stage(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  auto cta_tile = GemmTile(128, 128, 32);
+  int number_of_stage = 3;
+
+  auto params = getMatmulParams(cta_tile, number_of_stage, layout);
+
+  NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(
+      8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state);
+
+  // Run benchmark:
+  SingleMatmulBase(benchmark_state, layout, params);
+}
+
+static void Nvfuser_Matmul_8warp3stage(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  auto cta_tile = GemmTile(256, 128, 32);
+  int number_of_stage = 3;
+
+  auto params = getMatmulParams(cta_tile, number_of_stage, layout);
+
+  NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(
+      8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state);
+
+  // Run benchmark:
+  SingleMatmulBase(benchmark_state, layout, params);
+}
+
+static void Nvfuser_Matmul_4warp4stage(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  auto cta_tile = GemmTile(128, 128, 32);
+  int number_of_stage = 4;
+
+  auto params = getMatmulParams(cta_tile, number_of_stage, layout);
+
+  NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(
+      8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state);
+
+  // Run benchmark:
+  SingleMatmulBase(benchmark_state, layout, params);
+}
+
+static void Nvfuser_Matmul_8warp4stage(
+    benchmark::State& benchmark_state,
+    MatmulLayout layout) {
+  auto cta_tile = GemmTile(256, 128, 32);
+  int number_of_stage = 4;
+
+  auto params = getMatmulParams(cta_tile, number_of_stage, layout);
+
+  NVFUSER_BENCHMARK_ARCH_SMEM_GUARD(
+      8, 0, getSmemSize(cta_tile, number_of_stage), benchmark_state);
+
+  // Run benchmark:
+  SingleMatmulBase(benchmark_state, layout, params);
+}
+
+// ----------------------------- Benchmark Instantiation-------
+
+// Common utils:
+#define NO_TILE_QUANTIZATION_ARGS                                             \
+  ArgsProduct(                                                                \
+      {{2048}, {3456}, benchmark::CreateDenseRange(512, 4096, /*step=*/512)}) \
+      ->Unit(benchmark::kMicrosecond)                                         \
+      ->UseManualTime();
+
+#define ForAllLayouts(run)   \
+  run(TT, MatmulLayout::TT); \
+  run(TN, MatmulLayout::TN); \
+  run(NT, MatmulLayout::NT)
+
+// Instantiations:
+#define Nvfuser_4warp3stage_test(layout_label, layout) \
+  BENCHMARK_CAPTURE(                                   \
+      Nvfuser_Matmul_4warp3stage,                      \
+      no_quant_nvfuser_4warp_##layout_label,           \
+      layout)                                          \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+#define Nvfuser_8warp3stage_test(layout_label, layout) \
+  BENCHMARK_CAPTURE(                                   \
+      Nvfuser_Matmul_8warp3stage,                      \
+      no_quant_nvfuser_8warp_##layout_label,           \
+      layout)                                          \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+#define Nvfuser_4warp4stage_test(layout_label, layout) \
+  BENCHMARK_CAPTURE(                                   \
+      Nvfuser_Matmul_4warp4stage,                      \
+      no_quant_nvfuser_4warp_##layout_label,           \
+      layout)                                          \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+#define Nvfuser_8warp4stage_test(layout_label, layout) \
+  BENCHMARK_CAPTURE(                                   \
+      Nvfuser_Matmul_8warp4stage,                      \
+      no_quant_nvfuser_8warp_##layout_label,           \
+      layout)                                          \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+#define Eagermode_test(layout_label, layout)                      \
+  BENCHMARK_CAPTURE(                                              \
+      EagerModeMatmul, no_quant_eagermode_##layout_label, layout) \
+      ->NO_TILE_QUANTIZATION_ARGS
+
+ForAllLayouts(Nvfuser_4warp3stage_test);
+ForAllLayouts(Nvfuser_4warp4stage_test);
+ForAllLayouts(Nvfuser_8warp3stage_test);
+ForAllLayouts(Nvfuser_8warp4stage_test);
+ForAllLayouts(Eagermode_test);
diff --git a/benchmarks/cpp/nvfuser/reduction.cpp b/benchmarks/cpp/nvfuser/reduction.cpp
index d6fc0ca327ae..c4aaaf8a6047 100644
--- a/benchmarks/cpp/nvfuser/reduction.cpp
+++ b/benchmarks/cpp/nvfuser/reduction.cpp
@@ -73,7 +73,7 @@ static void NvFuserScheduler_Reduction(
   fusion_executor_cache->profile(false);
   executor_instance->setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     auto cg_outputs = fusion_executor_cache->runFusionWithInputs({aten_input});
@@ -82,7 +82,7 @@ static void NvFuserScheduler_Reduction(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -105,14 +105,14 @@ static void Baseline_Reduction(
 
   // Sync everything up before we start
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
     auto output = aten_input.sum({reduction_dim});
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   benchmark_state.SetBytesProcessed(
diff --git a/benchmarks/cpp/nvfuser/rms_norm.cpp b/benchmarks/cpp/nvfuser/rms_norm.cpp
index 81fdf46cf818..37911ea6b1fd 100644
--- a/benchmarks/cpp/nvfuser/rms_norm.cpp
+++ b/benchmarks/cpp/nvfuser/rms_norm.cpp
@@ -24,7 +24,6 @@ static void setupRMSNorm(Fusion* fusion, DataType dtype) {
 
   FusionGuard fg(fusion);
 
-  const int kReductionAxis = 2;
   const float kEps = 1e-6;
 
   Double* eps_ptr = IrBuilder::create<Double>(kEps);
@@ -61,7 +60,6 @@ static void NvFuserScheduler_RMSNorm(
       dtype == DataType::BFloat16);
 
   std::vector<int64_t> input_shape{8, benchmark_state.range(0), 1024};
-  const float kEps = 1e-6;
 
   // inputs
   at::manual_seed(0);
diff --git a/benchmarks/cpp/nvfuser/rms_norm_backward.cpp b/benchmarks/cpp/nvfuser/rms_norm_backward.cpp
index b4c6ac413c75..987c3bf234fa 100644
--- a/benchmarks/cpp/nvfuser/rms_norm_backward.cpp
+++ b/benchmarks/cpp/nvfuser/rms_norm_backward.cpp
@@ -24,9 +24,6 @@ static void setupRMSNorm_BWD(Fusion* fusion, DataType dtype) {
       dtype == DataType::Float || dtype == DataType::Half ||
       dtype == DataType::BFloat16);
 
-  const int kReductionAxis = 2;
-  Double* eps_ptr = IrBuilder::create<Double>(1e-6);
-
   // setup fusion
   auto grad_out = makeContigTensor(3, dtype);
   auto input = makeContigTensor(3, dtype);
diff --git a/benchmarks/cpp/nvfuser/scale_bias_relu.cpp b/benchmarks/cpp/nvfuser/scale_bias_relu.cpp
index 74dbb5324cba..158d3668c279 100644
--- a/benchmarks/cpp/nvfuser/scale_bias_relu.cpp
+++ b/benchmarks/cpp/nvfuser/scale_bias_relu.cpp
@@ -144,7 +144,7 @@ static void NvFuserScheduler_SBR(
   fusion_executor_cache->profile(false);
   executor_instance->setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs);
@@ -153,7 +153,7 @@ static void NvFuserScheduler_SBR(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   const size_t size =
       input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
@@ -182,7 +182,7 @@ static void Baseline_SBR(benchmark::State& benchmark_state, DataType dtype) {
   at::Tensor at_bias = at::zeros(bcast_shape, options);
 
   clearL2Cache();
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
 
@@ -191,9 +191,9 @@ static void Baseline_SBR(benchmark::State& benchmark_state, DataType dtype) {
     auto output = at::relu(bias);
 
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     clearL2Cache();
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   const size_t size =
@@ -245,7 +245,7 @@ static void NvFuserScheduler_SBR_Norm(
   fusion_executor_cache->profile(false);
   executor_instance->setMeasureKernelTimeFlag(true);
   // Sync everything up before we start
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     clearL2Cache();
     auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs);
@@ -255,7 +255,7 @@ static void NvFuserScheduler_SBR_Norm(
 
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   const size_t size =
       input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3];
@@ -286,7 +286,7 @@ static void Baseline_SBR_Norm(
   at::Tensor at_mean = at::zeros(bcast_shape, options);
   at::Tensor at_var = at::ones(bcast_shape, options);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
   for (auto _ : benchmark_state) {
     CudaKernelTimer timer;
 
@@ -298,7 +298,7 @@ static void Baseline_SBR_Norm(
     auto output = at::relu(bias);
 
     benchmark_state.SetIterationTime(timer.elapsed() / 1000.0);
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   const size_t size =
diff --git a/benchmarks/cpp/nvfuser/shape_inference.cpp b/benchmarks/cpp/nvfuser/shape_inference.cpp
index 2e5e23ed7442..fd628a163abc 100644
--- a/benchmarks/cpp/nvfuser/shape_inference.cpp
+++ b/benchmarks/cpp/nvfuser/shape_inference.cpp
@@ -100,8 +100,11 @@ void LayerNormBackward_ShapeInference_Base(
 
   auto runtime = getLayerBackwardNormRuntime(
       std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape);
+
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+
   TORCH_INTERNAL_ASSERT(
-      runtime->getMaybeHeuristicsFor(aten_inputs).has_value());
+      runtime->getMaybeHeuristicsFor(args).has_value());
 
   fec->profile(true);
   fec->disableKernelLaunch();
@@ -172,8 +175,10 @@ void LayerNormForward_ShapeInferenceBase(
   auto runtime = getLayerForwardNormRuntime(
       std::move(fusion_ptr), fec, aten_inputs, shape, norm_shape);
 
+  KernelArgumentHolder args = KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+
   TORCH_INTERNAL_ASSERT(
-      runtime->getMaybeHeuristicsFor(aten_inputs).has_value());
+      runtime->getMaybeHeuristicsFor(args).has_value());
 
   fec->profile(true);
   fec->disableKernelLaunch();
diff --git a/benchmarks/cpp/nvfuser/softmax.cpp b/benchmarks/cpp/nvfuser/softmax.cpp
index 439e426220f8..350ccb301638 100644
--- a/benchmarks/cpp/nvfuser/softmax.cpp
+++ b/benchmarks/cpp/nvfuser/softmax.cpp
@@ -107,7 +107,7 @@ static void Softmax_WarpReduceReference(benchmark::State& benchmark_state) {
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -162,7 +162,7 @@ static void Softmax_WarpReduce(benchmark::State& benchmark_state) {
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -206,7 +206,7 @@ static void Baseline_Softmax(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
diff --git a/benchmarks/cpp/nvfuser/softmax_backward.cpp b/benchmarks/cpp/nvfuser/softmax_backward.cpp
index 8fb35083c6dc..51696ede90ce 100644
--- a/benchmarks/cpp/nvfuser/softmax_backward.cpp
+++ b/benchmarks/cpp/nvfuser/softmax_backward.cpp
@@ -116,7 +116,7 @@ static void Baseline_Softmax_BWD(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   benchmark_state.SetBytesProcessed(
       int64_t(benchmark_state.iterations()) *
@@ -177,13 +177,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32)
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -201,13 +201,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16)
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Outer_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -225,13 +225,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32)
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -249,13 +249,13 @@ NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16)
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 NVFUSER_BENCHMARK_RUN(NvFuserScheduler_Softmax_BWD_Inner_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -275,13 +275,13 @@ BENCHMARK(Baseline_Softmax_BWD_Outer_fp32)
 
 BENCHMARK(Baseline_Softmax_BWD_Outer_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 BENCHMARK(Baseline_Softmax_BWD_Outer_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -299,13 +299,13 @@ BENCHMARK(Baseline_Softmax_BWD_Outer_fp16)
 
 BENCHMARK(Baseline_Softmax_BWD_Outer_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 BENCHMARK(Baseline_Softmax_BWD_Outer_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -323,13 +323,13 @@ BENCHMARK(Baseline_Softmax_BWD_Inner_fp32)
 
 BENCHMARK(Baseline_Softmax_BWD_Inner_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 BENCHMARK(Baseline_Softmax_BWD_Inner_fp32)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
@@ -347,13 +347,13 @@ BENCHMARK(Baseline_Softmax_BWD_Inner_fp16)
 
 BENCHMARK(Baseline_Softmax_BWD_Inner_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{32768, 32 * 1024 * 1024}, {2, 16}})
+    ->Ranges({{32768, 16 * 1024 * 1024}, {2, 16}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
 BENCHMARK(Baseline_Softmax_BWD_Inner_fp16)
     // ->RangeMultiplier(2)
-    ->Ranges({{2, 16}, {32768, 32 * 1024 * 1024}})
+    ->Ranges({{2, 16}, {32768, 16 * 1024 * 1024}})
     ->Unit(benchmark::kMicrosecond)
     ->UseManualTime();
 
diff --git a/benchmarks/cpp/nvfuser/softmax_dropout.cpp b/benchmarks/cpp/nvfuser/softmax_dropout.cpp
index 48950373731c..383d1d4bb9f4 100644
--- a/benchmarks/cpp/nvfuser/softmax_dropout.cpp
+++ b/benchmarks/cpp/nvfuser/softmax_dropout.cpp
@@ -127,7 +127,7 @@ static void Baseline_Softmax_Dropout(
   at::Tensor attention_scores = at::randn(input_shape, options);
   at::Tensor at_y = at::randn(input_shape, options);
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (auto _ : benchmark_state) {
     clearL2Cache();
@@ -144,7 +144,7 @@ static void Baseline_Softmax_Dropout(
   }
   // Sync everything up before we're finished, don't want to run ahead on the
   // cpu while benchmarking.
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   // 5 dtype: attention_scores + attention_mask + attention_scores_out +
   // attention_probs_out + output
diff --git a/benchmarks/cpp/nvfuser/timm.cpp b/benchmarks/cpp/nvfuser/timm.cpp
index 013b609be602..4669ff0ecabf 100644
--- a/benchmarks/cpp/nvfuser/timm.cpp
+++ b/benchmarks/cpp/nvfuser/timm.cpp
@@ -115,7 +115,7 @@ static void setup_vit_base_patch16_224_bcast5(Fusion* fusion, void* null) {
   auto t6 = set(t5);
   auto t7 = broadcast(t6, bcast_pattern0);
   auto t8 = add(t4, t7);
-  auto t9 = randlike(t8);
+  auto t9 = rand_like(t8);
   auto d34 =
       sub(IrBuilder::create<Double>(1.0), IrBuilder::create<Double>(0.0));
   auto t10 = lt(t9, d34);
@@ -139,7 +139,6 @@ static void setup_vit_base_patch16_224_bcast5(Fusion* fusion, void* null) {
   auto t20 = sum(t37, {2});
   auto t24 = broadcast(t20, bcast_pattern1);
   auto d95 = castOp(DataType::Double, t2->axis(2)->extent());
-  auto d96 = mul(IrBuilder::create<Double>(1.0), d95);
   auto d105 = reciprocal(d95);
   auto t25 = mul(t24, d105);
   auto t26 = add(t25, IrBuilder::create<Double>(1e-6));
@@ -289,7 +288,7 @@ static void setup_vit_base_patch16_224_norm_inner3(Fusion* fusion, void* null) {
   auto t10 = broadcast(t9, {false, false, false, true});
   auto t11 = reciprocal(t10);
   auto t12 = mul(t8, t11);
-  auto t13 = randlike(t12);
+  auto t13 = rand_like(t12);
   auto d79 = sub(IrBuilder::create<Double>(1), IrBuilder::create<Double>(0));
   auto t14 = lt(t13, d79);
   auto t15 = castOp(DataType::Float, t14);
@@ -320,8 +319,6 @@ static void NvFuserScheduler_TIMM_vit_base_patch16_224_norm_inner3(
 
   at::manual_seed(0);
   auto fp16_options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  auto fp32_options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
 
   auto t0 = at::randn(input_shape, fp16_options);
 
@@ -367,7 +364,7 @@ static void setup_vit_base_patch16_224_bcast_outer6(
   auto t9 = add(IrBuilder::create<Double>(1), t8);
   auto t10 = mul(IrBuilder::create<Double>(0.5), t9);
   auto t11 = mul(t6, t10);
-  auto t12 = randlike(t11);
+  auto t12 = rand_like(t11);
   auto d66 = sub(IrBuilder::create<Double>(1), IrBuilder::create<Double>(0));
   auto t13 = lt(t12, d66);
   auto t14 = castOp(DataType::Float, t13);
@@ -456,7 +453,7 @@ static void setup_vit_base_patch16_224_bcast_inner6(
   auto t9 = add(IrBuilder::create<Double>(1), t8);
   auto t10 = mul(IrBuilder::create<Double>(0.5), t9);
   auto t11 = mul(t6, t10);
-  auto t12 = randlike(t11);
+  auto t12 = rand_like(t11);
   auto d66 = sub(IrBuilder::create<Double>(1), IrBuilder::create<Double>(0));
   auto t13 = lt(t12, d66);
   auto t14 = castOp(DataType::Float, t13);
diff --git a/benchmarks/cpp/nvfuser/utils.cpp b/benchmarks/cpp/nvfuser/utils.cpp
index 3915f7d65298..0a13c57d10e1 100644
--- a/benchmarks/cpp/nvfuser/utils.cpp
+++ b/benchmarks/cpp/nvfuser/utils.cpp
@@ -6,7 +6,7 @@
 
 using namespace torch::jit::fuser::cuda;
 
-std::string toString(ReductionParams rparams) {
+std::string toString(const ReductionParams& rparams) {
   std::stringstream ss;
   ss << (rparams.fastest_dim ? "Red On Fastest Dim // " : "Red On Slow Dim // ")
      << (rparams.persistent_kernel ? "Persistent Kernel // " : "")
@@ -65,7 +65,7 @@ std::string toString(ReductionParams rparams) {
   return ss.str();
 }
 
-std::string toString(PointwiseParams params) {
+std::string toString(const PointwiseParams& params) {
   std::stringstream ss;
   if (params.break_point) {
     ss << "2D Schedule at " << params.break_point << "/";
@@ -89,6 +89,15 @@ std::string toString(PointwiseParams params) {
   return ss.str();
 }
 
+std::string toString(const TransposeParams& params) {
+  std::stringstream ss;
+  ss << "Tile size: (" << params.tile_size1 << "," << params.tile_size2
+     << ")/";
+  ss << "Vectorize size: (" << params.vectorize_factor1 << ","
+     << params.vectorize_factor2 << ")";
+  return ss.str();
+}
+
 std::string toString(const std::shared_ptr<HeuristicParams>& params) {
   auto rparams = std::dynamic_pointer_cast<ReductionParams>(params);
   if (rparams) {
@@ -98,6 +107,10 @@ std::string toString(const std::shared_ptr<HeuristicParams>& params) {
   if (pparams) {
     return toString(*pparams);
   }
+  auto tparams = std::dynamic_pointer_cast<TransposeParams>(params);
+  if (tparams) {
+    return toString(*tparams);
+  }
   TORCH_INTERNAL_ASSERT(
       false,
       "Unknown heuristic parameter type. Did you just added a new heuristic parameter type but forget to update here?");
@@ -176,7 +189,7 @@ void runBenchmarkIterations(
     executor_instance->setMeasureKernelTimeFlag(true);
 
     // Sync everything up before we start
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     for (auto _ : benchmark_state) {
       clearL2Cache();
       auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs);
@@ -185,7 +198,7 @@ void runBenchmarkIterations(
     }
     // Sync everything up before we're finished, don't want to run ahead on the
     // cpu while benchmarking.
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   } else {
     // Segmented
     // Sync everything up before we start
@@ -193,7 +206,7 @@ void runBenchmarkIterations(
       // Compile/warmup
       auto cg_outputs = fusion_executor_cache->runFusionWithInputs(aten_inputs);
     }
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
     CudaKernelTimer timer;
     for (auto _ : benchmark_state) {
       clearL2Cache();
@@ -203,7 +216,7 @@ void runBenchmarkIterations(
     }
     // Sync everything up before we're finished, don't want to run ahead on the
     // cpu while benchmarking.
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
   }
 }
 
diff --git a/benchmarks/cpp/nvfuser/utils.h b/benchmarks/cpp/nvfuser/utils.h
index e24fdfb127da..7bfc4aefd45c 100644
--- a/benchmarks/cpp/nvfuser/utils.h
+++ b/benchmarks/cpp/nvfuser/utils.h
@@ -36,8 +36,9 @@ TensorView* makeContigConcreteTensor(
     std::vector<int64_t> shape,
     DataType dtype = DataType::Float);
 
-std::string toString(ReductionParams rparams);
-std::string toString(PointwiseParams params);
+std::string toString(const ReductionParams& rparams);
+std::string toString(const PointwiseParams& params);
+std::string toString(const TransposeParams& params);
 std::string toString(const std::shared_ptr<HeuristicParams>& params);
 std::string toString(LaunchParams lparams);
 
@@ -55,26 +56,27 @@ class CudaKernelTimer {
  public:
   CudaKernelTimer() {
     // Setup
-    cudaEventCreate(&start_event);
-    cudaEventCreate(&finish_event);
-    cudaEventRecord(start_event);
+    C10_CUDA_CHECK(cudaEventCreate(&start_event));
+    C10_CUDA_CHECK(cudaEventCreate(&finish_event));
+    C10_CUDA_CHECK(cudaEventRecord(start_event));
   }
 
   ~CudaKernelTimer() {
-    cudaEventDestroy(start_event);
-    cudaEventDestroy(finish_event);
+    C10_CUDA_IGNORE_ERROR(cudaEventDestroy(start_event));
+    C10_CUDA_IGNORE_ERROR(cudaEventDestroy(finish_event));
   }
 
   void restart() {
-    cudaEventRecord(start_event);
+    C10_CUDA_CHECK(cudaEventRecord(start_event));
   }
 
   float elapsed() {
     // Record
-    cudaEventRecord(finish_event);
-    cudaEventSynchronize(start_event);
-    cudaEventSynchronize(finish_event);
-    cudaEventElapsedTime(&kernel_time_ms_, start_event, finish_event);
+    C10_CUDA_CHECK(cudaEventRecord(finish_event));
+    C10_CUDA_CHECK(cudaEventSynchronize(start_event));
+    C10_CUDA_CHECK(cudaEventSynchronize(finish_event));
+    C10_CUDA_CHECK(
+        cudaEventElapsedTime(&kernel_time_ms_, start_event, finish_event));
     return kernel_time_ms_;
   }
 
diff --git a/benchmarks/distributed/ddp/README.md b/benchmarks/distributed/ddp/README.md
index 0bf254ee4cce..f89aaff9809e 100644
--- a/benchmarks/distributed/ddp/README.md
+++ b/benchmarks/distributed/ddp/README.md
@@ -158,7 +158,7 @@ Benchmark: resnext101_32x8d with batch size 32
 ```
 
 This compares throughput between `bucket_cap_mb=25` (the default) and
-`bucket_cap_mb=1` on 8 DGX machines with V100 GPUs. It confims that
+`bucket_cap_mb=1` on 8 DGX machines with V100 GPUs. It confirms that
 even for a relatively small model on machines with a very fast
 interconnect (4x 100Gb InfiniBand per machine), it still pays off to
 batch allreduce calls.
diff --git a/benchmarks/distributed/ddp/benchmark.py b/benchmarks/distributed/ddp/benchmark.py
index 2c742d0fc9d8..a905ad60f530 100644
--- a/benchmarks/distributed/ddp/benchmark.py
+++ b/benchmarks/distributed/ddp/benchmark.py
@@ -87,7 +87,7 @@ def run_benchmark(benchmark, ranks, opts):
     measurements = []
     if dist.get_rank() in set(ranks):
         if not opts:
-            opts = dict()
+            opts = {}
         measurements = benchmark_process_group(group, benchmark, **opts)
     dist.destroy_process_group(group)
     dist.barrier()
diff --git a/benchmarks/dynamo/Makefile_dashboard b/benchmarks/dynamo/Makefile_dashboard
new file mode 100644
index 000000000000..904b6726c494
--- /dev/null
+++ b/benchmarks/dynamo/Makefile_dashboard
@@ -0,0 +1,40 @@
+# Makefile for the dashboard setup
+TRITON_VERSION ?= $(shell cat ../../.github/ci_commit_pins/triton.txt)
+PIP ?= python -m pip
+
+clone-deps:
+	(cd ../../.. \
+		&& (test -e torchvision || git clone --recursive https://github.com/pytorch/vision torchvision) \
+		&& (test -e torchdata || git clone --recursive https://github.com/pytorch/data.git torchdata) \
+		&& (test -e torchtext || git clone --recursive https://github.com/pytorch/text torchtext) \
+		&& (test -e torchaudio || git clone --recursive https://github.com/pytorch/audio torchaudio) \
+		&& (test -e detectron2 || git clone --recursive https://github.com/facebookresearch/detectron2) \
+		&& (test -e torchbenchmark || git clone --recursive https://github.com/pytorch/benchmark torchbenchmark) \
+		&& (test -e triton || git clone --recursive https://github.com/openai/triton.git) \
+	)
+
+pull-deps: clone-deps
+	echo $(TRITON_VERSION)
+	(cd ../../../torchvision    && git pull && git submodule update --init --recursive)
+	(cd ../../../torchdata      && git pull && git submodule update --init --recursive)
+	(cd ../../../torchtext      && git pull && git submodule update --init --recursive)
+	(cd ../../../torchaudio      && git pull && git submodule update --init --recursive)
+	(cd ../../../detectron2     && git pull && git submodule update --init --recursive)
+	(cd ../../../torchbenchmark && git pull && git submodule update --init --recursive)
+	(cd ../../../triton         && git checkout master && git pull && git checkout $(TRITON_VERSION) && git submodule update --init --recursive)
+
+build-deps: clone-deps
+	# conda env remove --name torchdynamo
+	# conda create --name torchdynamo -y python=3.8
+	# conda activate torchdynamo
+	conda install -y astunparse numpy scipy ninja pyyaml mkl mkl-include setuptools cmake \
+		cffi typing_extensions future six requests dataclasses protobuf numba cython scikit-learn
+	conda install -y -c pytorch magma-cuda116
+	conda install -y -c conda-forge librosa
+	(cd ../../../torchvision && python setup.py clean && python setup.py develop)
+	(cd ../../../torchdata && python setup.py install)
+	(cd ../../../torchtext   && python setup.py clean && python setup.py develop)
+	(cd ../../../torchaudio   && python setup.py clean && python setup.py develop)
+	(cd ../../../detectron2  && python setup.py clean && python setup.py develop)
+	(cd ../../../torchbenchmark && python install.py --continue_on_fail)
+	(cd ../../../triton/python && python setup.py clean && python setup.py develop)
diff --git a/benchmarks/dynamo/README.md b/benchmarks/dynamo/README.md
new file mode 100644
index 000000000000..91556084cd0d
--- /dev/null
+++ b/benchmarks/dynamo/README.md
@@ -0,0 +1,52 @@
+# Torchdynamo Benchmarks
+
+## What We Benchmark
+TorchDynamo provides a benchmark harness that takes care of uniformly benchmarking different models.  It interleaves runs of eager and dynamo to avoid machine noise/variability issues, and reports results based on medians along with P-values.
+
+The runner integrates with models from TorchBenchmark, HuggingFace and TIMM suites and covers both training and inference.
+
+The infrastructure allows us to specify a loss function. For torchbench models, we use .sum().backward() call in place of the native loss function. For TIMM models, we use a CrossEntropy loss. And HF models contain a loss function inside the model itself, so we don't need any special loss computation handling.
+
+Training benchmarks approximate training by running the model forward, computing loss and then running backward. We entirely skip the optimizer step today.
+
+Inference benchmarks and Training benchmarks measure correctness by comparing dynamo and eager model outputs given fixed inputs and seeds.
+
+## Setup
+
+### Machine
+We run benchmarks on AWS machines (p4d.24xlarge) using 8xNVidia A100 40GB cards.  We suggest using Cuda 11.6 for consistency.
+
+### Benchmarks
+Make sure to carefully follow the [torchbench installation](https://github.com/pytorch/benchmark#installation) instructions, taking care to build the auxiliary libraries (torchvision, torchtext) from a matching version to your pytorch version.
+
+For HF and TIMM models, the scripts already install the transformers and timm package respectively on the first run.
+
+## Runbook
+
+### Basic Usage
+There are a lot of flags in the benchmark runner, and it can be confusing to know which settings to use or what machine to run it on.  In order to support apples-to-apples comparison, we have provided the following 'standard' settings in `runner.py`. This script is a wrapper over the common benchmarking infrastructure and simplifies the flags. We will continually update `runner.py` with the latest and most relevant compilers for training and inference. It also provides some graph utilities to visualize and compare results. Some of the example commands are
+
+**Inference Commands**
+* Inference compilers on torchbench models - `python benchmarks/dynamo/runner.py --suites=torchbench --inference --dtypes=float16`
+* Inductor Inference compiler on torchbench models - `python benchmarks/dynamo/runner.py --suites=torchbench --inference --dtypes=float16 --compilers=inductor`
+
+**Training Commands**
+* Training compilers on TIMM models - `python benchmarks/dynamo/runner.py --suites=timm_models --training --dtypes=float32 --output-dir=timm_logs`
+* AOTAutograd Training compiler on TIMM models - `python benchmarks/dynamo/runner.py --suites=timm_models --training --dtypes=float32 --compilers=aot_nvfuser --output-dir=timm_logs`
+* Inductor Training compiler on TIMM models - `python benchmarks/dynamo/runner.py --suites=timm_models --training --dtypes=float32 --compilers=inductor --output-dir=timm_logs`
+
+Running runner.py generates a file named `run.sh`. This file contains the actual commands that invoke the common benchmarking infrastructure with the appropriate flags. Which brings us to the advanced usage.
+
+### Advanced Usage
+
+One could directly call `torchbench.py`, `huggingface.py` or `timm_models.py` with the necessary flags. There are a lot of flags in the benchmarks runner. Some of the examples are as follows. These are subject to change.
+
+**Inference Commands**
+* TorchScript (with TorchDynamo capture) NVFuser Inference - `python benchmarks/dynamo/torchbench.py -dcuda -n100 --speedup-dynamo-ts --performance`
+* TorchInductor CUDA Graphs Inference - `python benchmarks/dynamo/torchbench.py -dcuda --float32 -n50 --inductor --performance`
+
+**Training Commands**
+* Torchscript (with TorchDynamo capture) NVFuser Training - `python benchmarks/dynamo/torchbench.py --float32 -dcuda --training --nvfuser --speedup-dynamo-ts --performance`
+* TorchInductor CUDA Graphs Training - `python benchmarks/dynamo/torchbench.py --float32 -dcuda --training --inductor --performance`
+
+Above commands are for torchbench models. You can simply replace `torchbench.py` with `huggingface.py` for HF models, and `timm_model.py` for TIMM models.
diff --git a/test/quantization/dbr/__init__.py b/benchmarks/dynamo/__init__.py
similarity index 100%
rename from test/quantization/dbr/__init__.py
rename to benchmarks/dynamo/__init__.py
diff --git a/benchmarks/dynamo/check_csv.py b/benchmarks/dynamo/check_csv.py
new file mode 100644
index 000000000000..d55988627300
--- /dev/null
+++ b/benchmarks/dynamo/check_csv.py
@@ -0,0 +1,40 @@
+import argparse
+import sys
+import textwrap
+
+import pandas as pd
+
+
+def check_csv(filename):
+    """
+    Basic accuracy checking.
+    """
+
+    df = pd.read_csv(filename)
+
+    failed = []
+    for _, row in df.iterrows():
+        model_name = row["name"]
+        status = row["accuracy"]
+        if "pass" not in status:
+            failed.append(model_name)
+
+        print(f"{model_name:34} {status}")
+
+    if failed:
+        print(
+            textwrap.dedent(
+                f"""
+                Error {len(failed)} models failed
+                    {' '.join(failed)}
+                """
+            )
+        )
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--file", "-f", type=str, help="csv file name")
+    args = parser.parse_args()
+    check_csv(args.file)
diff --git a/benchmarks/dynamo/common.py b/benchmarks/dynamo/common.py
new file mode 100644
index 000000000000..95cd0cd4ca17
--- /dev/null
+++ b/benchmarks/dynamo/common.py
@@ -0,0 +1,2078 @@
+#!/usr/bin/env python3
+import argparse
+import collections
+import copy
+import csv
+import functools
+import importlib
+import io
+import logging
+import os
+import random
+import signal
+import subprocess
+import sys
+import time
+import warnings
+from contextlib import contextmanager
+
+import numpy as np
+import pandas as pd
+import torch
+
+import torch._dynamo
+import torch._dynamo.utils
+import torch.distributed
+from functorch._src.aot_autograd import set_model_name
+from scipy.stats import gmean, ttest_ind
+from torch._dynamo.optimizations import backends
+from torch._dynamo.optimizations.log_args import conv_args_analysis
+from torch._dynamo.profiler import fx_insert_profiling, Profiler
+from torch._dynamo.testing import dummy_fx_compile, format_speedup, same
+from torch._dynamo.utils import clone_inputs
+from torch._inductor import config as inductor_config
+from torch._inductor.utils import fresh_inductor_cache
+from torch._subclasses.fake_tensor import FakeTensorMode
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils._pytree import tree_map
+
+try:
+    from .microbenchmarks.operator_inp_utils import OperatorInputsMode
+except ImportError:
+    from microbenchmarks.operator_inp_utils import OperatorInputsMode
+
+
+log = logging.getLogger(__name__)
+
+# We are primarily interested in TF32
+torch.backends.cuda.matmul.allow_tf32 = True
+
+current_name = ""
+current_device = ""
+current_batch_size = None
+output_filename = None
+
+CI_SKIP_AOT_EAGER_INFERENCE = [
+    # TorchBench
+    "demucs",  # OOM
+    # Huggingface
+    "AllenaiLongformerBase",
+    "BartForConditionalGeneration",  # OOM
+]
+
+CI_SKIP_AOT_EAGER_TRAINING = [
+    *CI_SKIP_AOT_EAGER_INFERENCE,
+    # TorchBench
+    "Background_Matting",  # fp64_OOM
+    "moco",
+    "pytorch_struct",
+    "vision_maskrcnn",
+    # Huggingface
+    "AlbertForMaskedLM",  # OOM
+    "AlbertForQuestionAnswering",  # OOM
+    "BigBird",
+    "M2M100ForConditionalGeneration",  # OOM
+    "PegasusForConditionalGeneration",  # OOM
+    "XGLMForCausalLM",  # OOM
+    "XLNetLMHeadModel",  # OOM
+    "YituTechConvBert",
+    # TIMM
+    "cait_m36_384",  # fp64_OOM
+    "convit_base",  # fp64_OOM
+    "mobilevit_s",  # Accuracy
+    "xcit_large_24_p8_224",  # fp64_OOM
+]
+
+CI_SKIP_INDCUTOR_INFERENCE = [
+    *CI_SKIP_AOT_EAGER_INFERENCE,
+    # TorchBench
+    "DALLE2_pytorch",
+    "detectron2",
+    "hf_T5",  # accuracy
+    "hf_BigBird",  # accuracy
+    "hf_GPT2_large",  # OOM
+    "maml",  # accuracy
+    "mobilenet_v2_quantized_qat",  # The eval test only supports CPU
+    "moco",  # accuracy
+    "pytorch_struct",  # Test eval is not implemented
+    "pyhpc_equation_of_state",  # Accuracy
+    "pyhpc_turbulent_kinetic_energy",  # Accuracy
+    "tacotron2",
+    "vision_maskrcnn",  # accuracy
+    # Huggingface
+    "DebertaV2ForQuestionAnswering",  # OOM
+    # TIMM
+    "cait_m36_384",  # Accuracy
+    "ghostnet_100",  # Accuracy
+]
+
+CI_SKIP_INDUCTOR_TRAINING = [
+    *CI_SKIP_INDCUTOR_INFERENCE,
+    # TorchBench
+    "Background_Matting",  # fp64_OOM
+    "mobilenet_v3_large",  # accuracy
+    "resnet50_quantized_qat",  # Eager model failed to run
+    # Huggingface
+    "BlenderbotForCausalLM",  # OOM
+    "GoogleFnet",  # Eager model failed to run
+    "M2M100ForConditionalGeneration",  # OOM
+    "XGLMForCausalLM",  # OOM
+    # TIMM
+    "convit_base",  # fp64_OOM
+    "eca_halonext26ts",  # accuracy
+    "fbnetv3_b",  # accuracy
+    "levit_128",  # fp64_OOM
+    "res2net101_26w_4s",  # accuracy
+    "resnest101e",  # accuracy
+    "rexnet_100",  # accuracy
+    "spnasnet_100",  # accuracy
+    "swin_base_patch4_window7_224",  # accuracy
+    "xcit_large_24_p8_224",  # fp64_OOM
+]
+
+
+def model_specified_by_path(path_and_class_str):
+    return ":" in path_and_class_str
+
+
+def load_model_from_path(path_and_class_str):
+    configs = {}
+    for kvstr in path_and_class_str.split(","):
+        k, v = kvstr.split(":")
+        configs[k] = v
+
+    for name in ["path", "class"]:
+        if name not in configs:
+            raise RuntimeError(
+                "Invalid --only arguments. Check help message for the correct format"
+            )
+
+    path = configs["path"]
+    class_name = configs["class"]
+
+    if path[:1] != "/":
+        raise RuntimeError(
+            "Use absolute path since dynamo may change the current working directory which makes using relative path tricky"
+        )
+
+    spec = importlib.util.spec_from_file_location("module_name", path)
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+
+    model_class = getattr(module, class_name)
+    assert issubclass(model_class, torch.nn.Module)
+    model = model_class()
+    assert hasattr(model, "get_example_inputs")
+    inputs = model.get_example_inputs()
+    return model, inputs
+
+
+def output_csv(filename, headers, row):
+    assert filename
+    existed = os.path.exists(filename)
+    output = csv.writer(
+        io.TextIOWrapper(
+            open(filename, "ab", buffering=0),
+            "utf-8",
+            write_through=True,
+        ),
+        lineterminator="\n",
+    )
+    if not existed:
+        output.writerow(headers)
+    output.writerow([(f"{x:.4f}" if isinstance(x, float) else x) for x in row])
+
+
+class NullContext:
+    def __enter__(self):
+        pass
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        pass
+
+
+@functools.lru_cache(None)
+def patch_torch_manual_seed():
+    """Make torch manual seed deterministic. Helps with accuracy testing."""
+
+    def deterministic_torch_manual_seed(*args, **kwargs):
+        from torch._C import default_generator
+
+        seed = 1337
+        import torch.cuda
+
+        if not torch.cuda._is_in_bad_fork():
+            torch.cuda.manual_seed_all(seed)
+        return default_generator.manual_seed(seed)
+
+    torch.manual_seed = deterministic_torch_manual_seed
+
+
+def synchronize():
+    pass
+
+
+def print_summary(filename):
+    if not (filename and os.path.exists(filename)):
+        return
+    data = pd.read_csv(filename)
+    width = max(map(len, data.columns))
+    for col in data.columns:
+        try:
+            if col in ("dev", "name", "batch_size"):
+                continue
+            elif col in ("pct_ops", "pct_time"):
+                print(col.ljust(width), f"{data[col].mean():.1%}")
+            elif col in ("graphs", "graph_calls", "captured_ops", "total_ops"):
+                print(col.ljust(width), f"{data[col].mean():.1f}")
+            elif col in ("compilation_latency"):
+                print(col.ljust(width), f"mean={data[col].mean():.1f} seconds")
+            elif col in ("compression_ratio"):
+                print(col.ljust(width), f"mean={data[col].mean():.1f}x")
+            else:
+                cdata = data[col].clip(1)
+                print(
+                    col.ljust(width),
+                    f"gmean={gmean(cdata):.2f}x mean={cdata.mean():.2f}x",
+                )
+        except Exception:
+            pass
+
+
+def tensor_is_on_xla(tensors):
+    if not isinstance(tensors, (tuple, list)):
+        tensors = [tensors]
+    tensors = [x for x in tensors if isinstance(x, torch.Tensor)]
+    return any(map(lambda x: x.device.type == "xla", tensors))
+
+
+def timed(model, model_iter_fn, example_inputs, times=1, return_result=False):
+    synchronize()
+    if tensor_is_on_xla(example_inputs):
+        import torch_xla.core.xla_model as xm
+
+        xm.mark_step()
+
+    reset_rng_state()
+    t0 = time.perf_counter()
+    # Dont collect outputs to correctly measure timing
+    for _ in range(times):
+        result = model_iter_fn(model, example_inputs, collect_outputs=False)
+        if tensor_is_on_xla(result):
+            # If the model is on XLA device, it's possible that after running
+            # the model, the computation is accumulated but not performed yet.
+            # Flush all the accumulated computations to make the time measurement
+            # accurate.
+            import torch_xla
+
+            result_list = result
+            if not isinstance(result, (tuple, list)):
+                result_list = [result]
+            torch_xla._XLAC._xla_sync_multi(result_list, [])
+        synchronize()
+    t1 = time.perf_counter()
+    return (t1 - t0, result) if return_result else t1 - t0
+
+
+class Stats:
+    totals = collections.defaultdict(collections.Counter)
+
+    @classmethod
+    def reset_counters(cls):
+        for k, v in torch._dynamo.utils.counters.items():
+            cls.totals[k].update(v)
+        ok = torch._dynamo.utils.counters["frames"]["ok"]
+        total = torch._dynamo.utils.counters["frames"]["total"]
+        torch._dynamo.utils.counters.clear()
+        return ok, total
+
+    @classmethod
+    def print_summary(cls):
+        for k, v in sorted(cls.totals.items()):
+            lines = "\n  ".join(map(str, v.most_common(50)))
+            print(f"STATS {k}\n  {lines}")
+
+    @classmethod
+    def aot_summary(cls):
+        return [cls.totals["aot_autograd"]["total"], cls.totals["aot_autograd"]["ok"]]
+
+
+def coverage_experiment(args, model_iter_fn, model, example_inputs):
+    """
+    Test operator/model coverage of TorchDynamo and record statistics
+    taken from a profiler.  This target is mainly intended to check
+    correctness.
+
+    Writes to ./coverage.csv
+    """
+    profiler = Profiler()
+    frozen_model_iter_fn = torch._dynamo.run(model_iter_fn)
+    with profiler.prof:
+        frozen_model_iter_fn(model, example_inputs)
+    coverage_result = profiler.results()
+    output_csv(
+        output_filename,
+        (
+            "dev",
+            "name",
+            "batch_size",
+            "graphs",
+            "graph_calls",
+            "captured_ops",
+            "total_ops",
+            "pct_ops",
+            "pct_time",
+        ),
+        [
+            current_device,
+            current_name,
+            current_batch_size,
+        ]
+        + coverage_result.tocsv(),
+    )
+    return coverage_result
+
+
+def speedup_experiment_fx2trt(args, model_iter_fn, model, example_inputs):
+    """
+    Measure speedups over eager using the trt inference backend. TRT backend is based fx graph
+    generated by torch._dynamo.
+    Writes to ./speedups_fx2trt.csv
+    """
+    return speedup_experiment(args, model_iter_fn, model, example_inputs)
+
+
+def recompile_profiler_experiment(args, model_iter_fn, model, example_inputs):
+    prof = torch._dynamo.utils.CompileProfiler()
+    opt_model_iter_fn = torch._dynamo.optimize(prof, nopython=args.nopython)(
+        model_iter_fn
+    )
+    opt_model_iter_fn(model, example_inputs)
+    output_csv(
+        output_filename, ["model", "profiler report"], [current_name, prof.report()]
+    )
+    met = prof.get_metrics()
+    guard_failures = len(met["guard_failures"])
+    return [guard_failures]
+
+
+def randomize_input(inputs):
+    if isinstance(inputs, (list, tuple)):
+        return type(inputs)([randomize_input(x) for x in inputs])
+    elif isinstance(inputs, torch.Tensor):
+        if inputs.dtype in (torch.float32, torch.float64):
+            torch._dynamo.utils.counters["randomize_input"]["times"] += 1
+            return torch.randn_like(inputs)
+        elif inputs.dtype == torch.int64:
+            # Note: we can not simply tune integer tensors as follows
+            #   `return torch.randint_like(inputs, high=inputs.max().item())`
+            # This may break some invariants between tensors.
+            # E.g. in embedding lookup case, one tensor is the length
+            # and another is an indices tensor.
+            return inputs
+        else:
+            raise RuntimeError(
+                f"randomize_input need support tensor of type {inputs.dtype}"
+            )
+    else:
+        raise RuntimeError(
+            f"randomize_input can not handle input of type {type(inputs)}"
+        )
+
+
+def maybe_mark_step(args):
+    if args.trace_on_xla:
+        import torch_xla.core.xla_model as xm
+
+        xm.mark_step()
+
+
+def speedup_experiment(args, model_iter_fn, model, example_inputs, **kwargs):
+    """
+    Measure speedups over eager.
+
+    Writes to ./speedups.csv
+    """
+    if args.dynamic_shapes:
+        return speedup_experiment_ds(args, model_iter_fn, model, example_inputs)
+
+    timings = np.zeros((args.repeat, 2), np.float64)
+    # if we randomize the input, we should also check the result is correct
+    should_check_result = should_randomize_input = args.randomize_input
+    is_correct = True
+
+    import contextlib
+
+    @contextlib.contextmanager
+    def maybe_profile(*args, **kwargs):
+        if kwargs.pop("enabled", True):
+            with torch.profiler.profile(*args, **kwargs) as p:
+                yield p
+        else:
+            yield
+
+    with maybe_profile(enabled=args.export_profiler_trace) as p:
+        frozen_model_iter_fn = torch._dynamo.run(model_iter_fn)
+        for rep in range(args.repeat):
+            inputs = (
+                randomize_input(copy.deepcopy(example_inputs))
+                if should_randomize_input
+                else example_inputs
+            )
+            # need call mark_step to perform the computation
+            # on randomize_input. Otherwise the first call using the
+            # inputs will incur high penalty then the next one.
+            maybe_mark_step(args)
+
+            # interleave the runs to handle frequency scaling and load changes
+            timings[rep, 0], expected_output = timed(
+                model, model_iter_fn, inputs, return_result=True
+            )
+
+            # call mark_step between the 2 calls to make the comparison fair.
+            maybe_mark_step(args)
+
+            timings[rep, 1], actual_output = timed(
+                model, frozen_model_iter_fn, inputs, return_result=True
+            )
+            if should_check_result:
+                is_correct = is_correct and same(expected_output, actual_output)
+
+    if args.export_profiler_trace:
+        name = args.profiler_trace_name + "_" + model.name + ".json"
+        name = os.path.join(torch._dynamo.config.base_dir, name)
+        p.export_chrome_trace(name)
+    pvalue = ttest_ind(timings[:, 0], timings[:, 1]).pvalue
+    median = np.median(timings, axis=0)
+    speedup = median[0] / median[1]
+    if args.dump_raw_metrics:
+        np.save(
+            f"{output_filename[:-4]}-raw_timings-{current_name}-{current_device}.npy",
+            timings,
+        )
+
+    headers = ("dev", "name", "batch_size", "speedup", "abs_latency")
+    row = [
+        current_device,
+        current_name,
+        current_batch_size,
+        float(speedup),
+        median[1] * 1000,
+    ]
+    if "compilation_latency" in kwargs:
+        headers = headers + ("compilation_latency", "compression_ratio")
+        row.append(kwargs["compilation_latency"])
+        row.append(kwargs["compression_ratio"])
+
+    output_csv(
+        output_filename,
+        headers,
+        row,
+    )
+    headers, data = torch._dynamo.utils.compile_times(repr="csv", aggregate=True)
+    assert (
+        output_filename.find(".csv") > 0
+    ), f"expected output_filename to be a .csv, but got {output_filename}"
+    output_csv(
+        output_filename[:-4] + "_compilation_metrics.csv",
+        ["dev", "name", "batch_size"] + headers,
+        [current_device, current_name, current_batch_size] + data,
+    )
+    return format_speedup(speedup, pvalue, is_correct=is_correct)
+
+
+def speedup_experiment_ds(args, model_iter_fn, model, example_inputs):
+    """
+    Run dynamic shapes benchmarks.
+
+    Requires dynamic shape compatible models, which provide a list of example inputs.
+
+    Warms up using the first input example and then iterates the inputs,
+    measuring (and expecting minimal) variance between the runtime for different examples.
+
+    """
+    timings = np.zeros((args.repeat, len(example_inputs), 2), np.float64)
+
+    if args.repeat > 5:
+        print(
+            f"\ndynamic shapes experiments are slow, consider setting --repeat less than {args.repeat}\n"
+        )
+
+    nwarmup = 4
+    for rep in range(args.repeat):
+        # Start each rep fresh, e.g. only warmup on example 0
+        torch._dynamo.reset()
+        optimized_model_iter_fn = optimize_ctx(model_iter_fn)
+        for _ in range(nwarmup):
+            optimized_model_iter_fn(model, example_inputs[0])
+
+        for input_idx, inputs in enumerate(example_inputs):
+            # interleave the runs to handle frequency scaling and load changes
+            timings[rep, input_idx, 0] = timed(
+                model, model_iter_fn, inputs, return_result=False
+            )
+            # different from regular speedup_experiment, we _DO_ want to allow recompilation
+            timings[rep, input_idx, 1] = timed(
+                model, optimized_model_iter_fn, inputs, return_result=False
+            )
+    medians = np.median(timings, axis=0)
+    speedups = list(medians[:, 0] / medians[:, 1])
+    speedups_mean = np.mean(speedups)
+    speedups_median = np.median(speedups)
+    speedups_var = np.var(speedups)
+
+    # TODO this x[0] is not going to work in general but bert only has 1 input
+    shapes = [x[0].shape for x in example_inputs]
+    shape_keys = sorted(set(shapes))
+    shape_speedups = {
+        shape: list(
+            map(
+                lambda it: it[1],
+                filter(lambda it: it[0] == shape, zip(shapes, speedups)),
+            )
+        )
+        for shape in shape_keys
+    }
+    output_str = (
+        f"mean: {speedups_mean:.3f}, median: {speedups_median:.3f}, var: {speedups_var:.3f}"
+        + "\nSpeedups by shape: "
+        + "\n".join(
+            [
+                f"{shape}: "
+                + ", ".join([f"{speedup: .3g}" for speedup in shape_speedups[shape]])
+                for shape in shape_keys
+            ]
+        )
+    )
+    output_csv(
+        output_filename,
+        ("dev", "name", "batch_size", "speedup mean", "speedup median", "speedup var"),
+        [
+            current_device,
+            current_name,
+            current_batch_size,
+            speedups_mean,
+            speedups_median,
+            speedups_var,
+        ],
+    )
+    return output_str
+
+
+def overhead_experiment(*args, model_iter_fn):
+    """
+    Measure overheads of TorchDynamo by running with no backend (only
+    eager+FX), and reporting speedup/slowdown over eager.
+
+    Writes to ./overheads.csv
+    """
+    return speedup_experiment(*args, model_iter_fn)
+
+
+def print_fx(gm, example_inputs):
+    print(gm.graph)
+    return gm
+
+
+def print_aten_ops(gm, example_inputs):
+    from functorch.compile import aot_module
+
+    def trace_printer(gm, _):
+        print(gm.graph)
+        return gm
+
+    return aot_module(gm, fw_compiler=trace_printer, bw_compiler=trace_printer)
+
+
+def baselines(models, model_iter_fn, example_inputs, args):
+    """
+    Common measurement code across all baseline experiments.
+    """
+    models = list(models)
+    for idx, (name, model) in enumerate(models):
+        if idx == 0:
+            result0 = model_iter_fn(model, example_inputs)
+        elif model is not None:
+            try:
+                result = model_iter_fn(model, example_inputs)
+                if same(result0, result):
+                    continue
+                print(name, "is INCORRECT")
+            except Exception:
+                log.exception("error checking %s", name)
+            models[idx] = (name, None)
+    timings = np.zeros((args.repeat, len(models)), np.float64)
+    timings.fill(1.0e10)
+    for rep in range(args.repeat):
+        for idx, (name, model) in enumerate(models):
+            if model is not None:
+                try:
+                    timings[rep, idx] = timed(model, model_iter_fn, example_inputs)
+                except Exception:
+                    pass
+    pvalue = [
+        ttest_ind(timings[:, 0], timings[:, i]).pvalue
+        for i in range(1, timings.shape[1])
+    ]
+    median = np.median(timings, axis=0)
+    speedup = median[0] / median[1:]
+    for idx, (name, model) in enumerate(models[1:]):
+        if model is None:
+            speedup[idx] = 0.0
+    result = " ".join(
+        [
+            format_speedup(s, p, m is not None)
+            for s, p, m in zip(speedup, pvalue, [m for n, m in models[1:]])
+        ]
+    )
+    output_csv(
+        output_filename,
+        ("dev", "name", "batch_size") + tuple(n for n, m in models[1:]),
+        [current_device, current_name, current_batch_size]
+        + [f"{x:.4f}" for x in speedup],
+    )
+    return result
+
+
+def try_script(model, example_inputs):
+    try:
+        return torch.jit.script(model)
+    except Exception:
+        return None
+
+
+def speedup_experiment_onnx(args, model_iter_fn, model, example_inputs):
+    """
+    Measure baseline performance (without using TorchDynamo) of ONNXRT and TensorFlow.
+
+    Writes to ./baseline_onnx.csv
+    """
+    if current_device == "cpu":
+        m_onnxrt = backends.onnxrt_cpu(
+            try_script(model, example_inputs), example_inputs
+        )
+    else:
+        m_onnxrt = backends.onnxrt_cuda(
+            try_script(model, example_inputs), example_inputs
+        )
+
+    if current_name != "timm_resnest":
+        m_onnx2tf = backends.onnx2tf(try_script(model, example_inputs), example_inputs)
+    else:
+        # this one takes 8+ hours to finish
+        m_onnx2tf = None
+
+    return baselines(
+        [
+            ("eager", model),
+            ("onnxrt", m_onnxrt),
+            ("onnx2tf", m_onnx2tf),
+        ],
+        model_iter_fn,
+        example_inputs,
+        args,
+    )
+
+
+def speedup_experiment_trt(args, model_iter_fn, model, example_inputs):
+    """
+    Measure baseline performance (without using TorchDynamo) of TensorRT.
+
+    Writes to ./baseline_trt.csv
+    """
+    m_onnx2trt = backends.onnx2tensorrt(
+        try_script(model, example_inputs), example_inputs
+    )
+
+    m_torch2trt = backends.torch2trt(model, example_inputs)
+
+    if current_name != "opacus_cifar10":
+        m_fx2trt = backends.fx2trt(model, example_inputs)
+    else:
+        # fx2trt infinite loops on one model
+        m_fx2trt = None
+
+    return baselines(
+        [
+            ("eager", model),
+            ("onnx2trt", m_onnx2trt),
+            ("torch2trt", m_torch2trt),
+            ("fx2trt", m_fx2trt),
+        ],
+        model_iter_fn,
+        example_inputs,
+        args,
+    )
+
+
+def read_batch_size_from_file(args, filename, model_name):
+    batch_size = None
+    if os.path.exists("benchmarks"):
+        filename = os.path.join("benchmarks", filename)
+    assert os.path.exists(filename), filename
+    with open(filename, "r") as f:
+        lines = f.readlines()
+        lines = [i.split(",") for i in lines if len(i.strip()) > 0]
+        for val in lines:
+            cur_name, b = val
+            if model_name == cur_name:
+                batch_size = int(b)
+    if batch_size is None:
+        log.warning("Could not find batch size for {}".format(model_name))
+    elif batch_size == -1:
+        raise RuntimeError(
+            f"Batch size is unset for {model_name} in {args.batch_size_file}"
+        )
+    print(f"batch size: {batch_size}")
+    return batch_size
+
+
+class TimeOutException(Exception):
+    pass
+
+
+def alarm_handler(signum, frame):
+    raise TimeOutException()
+
+
+def exit_after(s):
+    """
+    Decorator to raise TimeoutException if the fn is taking more than s seconds
+    to run.
+    """
+
+    def outer(fn):
+        def inner(*args, **kwargs):
+            signal.signal(signal.SIGALRM, alarm_handler)
+            signal.alarm(s)
+            try:
+                result = fn(*args, **kwargs)
+            finally:
+                signal.alarm(0)
+            return result
+
+        return inner
+
+    return outer
+
+
+def get_peak_memory():
+    return torch.cuda.max_memory_allocated() / 10**9
+
+
+def null_experiment(args, model_iter_fn, model, example_inputs):
+    """
+    A no-op experiment useful for making sure TorchBenchark alone works properly.
+    """
+
+    return []
+
+
+def cast_to(dtype, model, inputs):
+    # cast model and inputs to fp16
+    if dtype == torch.float16:
+        model = model.half()
+    else:
+        model = model.to(dtype)
+
+    inputs = tree_map(
+        lambda x: x.to(dtype)
+        if isinstance(x, torch.Tensor) and x.is_floating_point()
+        else x,
+        inputs,
+    )
+    return model, inputs
+
+
+def cast_to_fp16(model, inputs):
+    return cast_to(torch.float16, model, inputs)
+
+
+def cast_to_fp64(model, inputs):
+    return cast_to(torch.float64, model, inputs)
+
+
+def cast_to_fp32(model, inputs):
+    return cast_to(torch.float32, model, inputs)
+
+
+def reset_rng_state():
+    torch.manual_seed(1337)
+    random.seed(1337)
+    np.random.seed(1337)
+
+
+class DummyGradScaler:
+    def scale(self, loss):
+        return loss
+
+
+def maybe_fresh_cache(fn, is_cold_start):
+    def inner(*args, **kwargs):
+        cache_minder = NullContext()
+        if is_cold_start:
+            cache_entries = {}
+            cache_minder = fresh_inductor_cache(cache_entries)
+
+        try:
+            with cache_minder:
+                return fn(*args, **kwargs)
+        finally:
+            dump_cache = False
+            if dump_cache and is_cold_start:
+                output_csv(
+                    output_filename[:-4] + "_triton_cache.csv",
+                    ["dev", "name", "batch_size", "triton_cache"],
+                    [
+                        current_device,
+                        current_name,
+                        current_batch_size,
+                        cache_entries,
+                    ],
+                )
+
+    return inner
+
+
+@contextmanager
+def maybe_init_distributed(should_init_distributed, port="6789", rank=0, world_size=1):
+    # To avoid multiple inheritance from _dynamo.test_case.TestCase and MultiProcessTestCase,
+    # Just manually implement the most important part of the dynamo behavior to reset/clear.
+    try:
+        if should_init_distributed:
+            torch.cuda.set_device(rank)
+            os.environ["MASTER_ADDR"] = "localhost"
+            os.environ["MASTER_PORT"] = port
+            torch.distributed.init_process_group(
+                "nccl", rank=rank, world_size=world_size
+            )
+        yield
+    finally:
+        if should_init_distributed:
+            torch.distributed.destroy_process_group()
+
+
+class BenchmarkRunner:
+    def __init__(self):
+        self.model_iter_fn = None
+        self.use_amp = False
+        self.grad_scaler = DummyGradScaler()
+        self.autocast = NullContext
+        self._args = None
+
+    def setup_amp(self):
+        if self.args.amp and self.args.training:
+            assert self.args.devices == ["cuda"], "AMP is supported only for CUDA"
+            # AMP training can lead to small loss values which can undeflow
+            # gradient values returning in zero gradients. To solve this
+            # problem, PyTorch introduces GradScaler. GradScaler is a stateful
+            # structure, that scales the loss values to prevent underflow. Loss
+            # values are big at the beginning of training (therefore not
+            # requiring scaling), while loss value tends to be small as network
+            # starts getting better (requiring scaling). GradScaler manages all
+            # of this fine tuning, checking the gradients are turning to inf,
+            # discarding such batches.
+
+            # Since we are not running a long iteration, default value of
+            # init_scale 65536 is going to turn all gradients to inf. Therefore,
+            # we just use a init_scale of 2.0 for benchmarking purpose.
+            self.grad_scaler = torch.cuda.amp.GradScaler(init_scale=2.0)
+            self.autocast = torch.cuda.amp.autocast
+
+    def init_optimizer(self, device, params):
+        self.optimizer = None
+        # TODO - Currently, optimizers are used incorrectly. Fix optimizers with
+        # https://github.com/pytorch/pytorch/pull/87492
+        # param_list = list(params)
+        # if device == "cuda" and len(param_list) != 0:
+        #     # capturable is only supported on cuda at the moment
+        #     self.optimizer = torch.optim.Adam(param_list, capturable=True)
+        # else:
+        #     self.optimizer = None
+
+    @property
+    def args(self):
+        return self._args
+
+    @args.setter
+    def args(self, args):
+        self._args = args
+
+    @property
+    def skip_models(self):
+        return set()
+
+    @property
+    def slow_models(self):
+        return set()
+
+    @property
+    def very_slow_models(self):
+        return set()
+
+    @property
+    def non_deterministic_models(self):
+        return set()
+
+    @property
+    def skip_not_suitable_for_training_models(self):
+        return set()
+
+    @property
+    def failing_torchinductor_models(self):
+        return set()
+
+    @property
+    def failing_fx2trt_models(self):
+        return set()
+
+    @property
+    def failing_dynamic_shape_models(self):
+        return set()
+
+    @property
+    def skip_accuracy_checks_large_models_dashboard(self):
+        return set()
+
+    @property
+    def get_tolerance_and_cosine_flag(self, is_training, current_device, name):
+        raise NotImplementedError()
+
+    @property
+    def equal_nan(self):
+        equal_nan = True
+        if self.args.float32:
+            equal_nan = False
+        return equal_nan
+
+    def iter_models(self, args):
+        for model_name in self.iter_model_names(args):
+            for device in args.devices:
+                try:
+                    yield self.load_model(
+                        device,
+                        model_name,
+                        batch_size=args.batch_size,
+                    )
+                except NotImplementedError:
+                    continue  # bad benchmark implementation
+
+    def validate_model(self, model, example_inputs):
+        """
+        Runs the eager model with example inputs to ensure that eager passes.
+        """
+        model = copy.deepcopy(model)
+        example_inputs = clone_inputs(example_inputs)
+        if self.args.float32:
+            model, example_inputs = cast_to_fp32(model, example_inputs)
+        elif self.args.float16:
+            model, example_inputs = cast_to_fp16(model, example_inputs)
+
+        try:
+            self.model_iter_fn(model, example_inputs)
+        except Exception:
+            raise NotImplementedError("Eager model failed to run")
+
+    def maybe_cast(self, model, example_inputs):
+        model = copy.deepcopy(model)
+        example_inputs = clone_inputs(example_inputs)
+        if self.args.float32:
+            model, example_inputs = cast_to_fp32(model, example_inputs)
+        elif self.args.float16:
+            model, example_inputs = cast_to_fp16(model, example_inputs)
+        return model, example_inputs
+
+    def decay_batch_exp(self, batch_size, factor=0.5, divisor=2):
+        out_batch_size = batch_size * factor
+        if out_batch_size > divisor:
+            out_batch_size = (out_batch_size + 1) // divisor * divisor
+        else:
+            out_batch_size = batch_size - 1
+        return max(0, int(out_batch_size))
+
+    def batch_size_finder(self, device, model_name, initial_batch_size=1024):
+        batch_size = initial_batch_size
+        while batch_size >= 1:
+            torch.cuda.empty_cache()
+            try:
+                device, name, model, example_inputs, _ = self.load_model(
+                    device,
+                    model_name,
+                    batch_size,
+                )
+                self.model_iter_fn(model, example_inputs)
+                return batch_size
+            except RuntimeError as e:
+                error_str = str(e)
+                if "channels_last" in error_str:
+                    break
+            batch_size = self.decay_batch_exp(batch_size)
+        return 1
+
+    def run_n_iterations(self, mod, inputs, n=2):
+        for _ in range(n - 1):
+            self.model_iter_fn(mod, inputs, collect_outputs=False)
+        return self.model_iter_fn(mod, inputs, collect_outputs=True)
+
+    def optimizer_zero_grad(self):
+        if self.optimizer is not None:
+            self.optimizer.zero_grad(True)
+
+    def optimizer_step(self):
+        if self.optimizer is not None:
+            self.optimizer.step()
+
+    def get_benchmark_indices(self, length):
+        start = self._args.partition_id * (length // self._args.total_partitions)
+        end = (
+            (self._args.partition_id + 1) * (length // self._args.total_partitions)
+            if self._args.partition_id < self._args.total_partitions - 1
+            else length
+        )
+        return start, end
+
+    def check_accuracy(self, name, model, example_inputs, optimize_ctx, experiment):
+        """
+        Checks accuracy.
+        1) Collect the outputs with fp64 datatype. This is useful for error checking.
+        2) Checks if eager itself has variations.
+        """
+
+        def record_status(accuracy_status):
+            """
+            Records the status in the csv file
+            """
+            if current_name in self.non_deterministic_models:
+                if accuracy_status in ("pass", "eager_variation", "fail_accuracy"):
+                    accuracy_status = "pass"
+
+            output_csv(
+                output_filename,
+                ("dev", "name", "batch_size", "accuracy"),
+                [current_device, current_name, current_batch_size, accuracy_status],
+            )
+            return "PASS" if accuracy_status in ("pass", "pass_due_to_skip") else "FAIL"
+
+        tolerance, cos_similarity = self.get_tolerance_and_cosine_flag(
+            self.args.training, current_device, name
+        )
+
+        if name in self.skip_accuracy_checks_large_models_dashboard:
+            return record_status("pass_due_to_skip")
+
+        def deepcopy_and_maybe_ddp(model):
+            model = copy.deepcopy(model)
+            if self.args.ddp:
+                model = DDP(model, find_unused_parameters=True)
+            elif self.args.fsdp:
+                model = FSDP(model)
+                torch._inductor.config.triton.cudagraphs = False
+                log.warn("Disabling cudagraphs for FSDP compatibility")
+            return model
+
+        # Collect the fp64 reference outputs to be used later for accuracy checking.
+        fp64_outputs = None
+        try:
+            fp64_outputs = self.run_n_iterations(
+                *cast_to_fp64(
+                    deepcopy_and_maybe_ddp(model),
+                    clone_inputs(example_inputs),
+                )
+            )
+        except Exception:
+            log.warning(f"fp64 golden ref were not generated for {name}")
+            fp64_outputs = None
+            if self.args.ci and self.args.training:
+                return record_status("fp64_OOM")
+
+        # Cast the model to float16/float32 as necessary
+        model, example_inputs = self.maybe_cast(model, example_inputs)
+        accuracy_status = "pass"
+
+        with self.pick_grad(name, self.args.training):
+            # Get results of native pytorch
+            reset_rng_state()
+            correct_result = self.run_n_iterations(
+                deepcopy_and_maybe_ddp(model), clone_inputs(example_inputs)
+            )
+
+            # Rerun native pytorch
+            reset_rng_state()
+            correct_rerun_result = self.run_n_iterations(
+                deepcopy_and_maybe_ddp(model), clone_inputs(example_inputs)
+            )
+            if not same(
+                correct_result,
+                correct_rerun_result,
+                fp64_outputs,
+                equal_nan=self.equal_nan,
+            ):
+                accuracy_status = "eager_variation"
+                return record_status(accuracy_status)
+            correct_rerun_result = None
+
+            # Run with Dynamo
+            reset_rng_state()
+            torch._dynamo.reset()
+            try:
+                optimized_model_iter_fn = optimize_ctx(self.run_n_iterations)
+
+                new_result = optimized_model_iter_fn(
+                    deepcopy_and_maybe_ddp(model), example_inputs
+                )
+            except Exception as e:
+                accuracy_status = "fail_to_run"
+                print(
+                    "TorchDynamo optimized model failed to run because of following error"
+                )
+                log.exception(e)
+                return record_status(accuracy_status)
+
+            if not same(
+                correct_result,
+                new_result,
+                fp64_outputs,
+                equal_nan=self.equal_nan,
+                cos_similarity=cos_similarity,
+                tol=tolerance,
+            ):
+                if self.args.skip_accuracy_check:
+                    accuracy_status = "pass_due_to_skip"
+                else:
+                    accuracy_status = "fail_accuracy"
+                return record_status(accuracy_status)
+
+        return record_status(accuracy_status)
+
+    def run_performance_test(
+        self, name, model, example_inputs, optimize_ctx, experiment
+    ):
+        def warmup(fn, model, example_inputs, mode, niters=5):
+            peak_mem = 0
+            try:
+                if current_device == "cuda":
+                    torch.cuda.reset_peak_memory_stats()
+                    torch.cuda.empty_cache()
+                t0 = time.perf_counter()
+                for _ in range(niters):
+                    fn(model, example_inputs)
+                t1 = time.perf_counter()
+                latency = t1 - t0
+                if current_device == "cuda":
+                    peak_mem = get_peak_memory()
+            except Exception as e:
+                log.exception(f"Failed for {mode} {e}")
+                return sys.exit(-1)
+            return latency, peak_mem
+
+        # Cast the model to float16/float32 as necessary
+        model, example_inputs = self.maybe_cast(model, example_inputs)
+        with self.pick_grad(name, self.args.training):
+            ok, total = Stats.reset_counters()
+            experiment_kwargs = {}
+            results = []
+
+            eager_latency, eager_peak_mem = warmup(
+                self.model_iter_fn, model, example_inputs, "eager"
+            )
+            optimized_model_iter_fn = optimize_ctx(self.model_iter_fn)
+            dynamo_latency, dynamo_peak_mem = warmup(
+                optimized_model_iter_fn, model, example_inputs, "dynamo"
+            )
+
+            compilation_time = dynamo_latency - eager_latency
+            compression_ratio = (
+                eager_peak_mem / dynamo_peak_mem if dynamo_peak_mem else 0.0
+            )
+            # print(
+            #     f"memory: eager: {eager_peak_mem:.2f} GB, "
+            #     f"dynamo: {dynamo_peak_mem:.2f} GB, "
+            #     f"ratio: {compression_ratio:.2f}"
+            # )
+
+            if experiment.func is speedup_experiment:
+                experiment_kwargs["compilation_latency"] = compilation_time
+                experiment_kwargs["compression_ratio"] = compression_ratio
+
+            if experiment.func is coverage_experiment:
+                ok, total = Stats.reset_counters()
+                results = []
+                # run with torch._dynamo few times to populate the cache
+                for _ in range(3):
+                    optimized_model_iter_fn(model, example_inputs)
+                _, frames_second_pass = Stats.reset_counters()  # should be 0
+                if frames_second_pass > 0:
+                    optimized_model_iter_fn(model, example_inputs)
+                    _, frames_third_pass = Stats.reset_counters()  # should be 0
+                else:
+                    frames_third_pass = 0
+
+                results.append(
+                    f"{ok:3}/{total:3} +{frames_third_pass} frames {compilation_time:3.0f}s"
+                )
+
+            if not hasattr(model, name):
+                model.name = name
+            results.append(experiment(model, example_inputs, **experiment_kwargs))
+            return " ".join(map(str, results))
+
+    def compare_branches(
+        self,
+        name,
+        model,
+        example_inputs,
+        optimize_ctx,
+        experiment,
+        diff=False,
+        branch=None,
+    ):
+        assert branch is None, "Branch set during top level flow."
+        import git
+
+        repo = git.Repo(
+            "../torch._dynamo"
+        )  # Hack assumption of torchbenchmark positioning
+        curr_branch = repo.active_branch.name
+        if curr_branch != "main":
+            if repo.is_dirty():
+                raise RuntimeError(
+                    "--diff_main called on dirty branch. Commit, stash, or reset."
+                )
+            # Run current
+            try:
+                self.run_one_model(
+                    name,
+                    model,
+                    self.model_iter_fn,
+                    example_inputs,
+                    optimize_ctx,
+                    experiment,
+                    diff=False,
+                    branch=curr_branch,
+                )
+                # Swap to main
+                repo.git.checkout("main")
+                # Run main
+                self.run_one_model(
+                    name,
+                    model,
+                    self.model_iter_fn,
+                    example_inputs,
+                    optimize_ctx,
+                    experiment,
+                    diff=False,
+                    branch="main",
+                )
+            finally:
+                # Swap back
+                repo.git.checkout(curr_branch)
+            return
+        else:
+            raise RuntimeError(
+                "--diff_main called on main branch, what are you diffing?"
+            )
+
+    def run_one_model(
+        self,
+        name,
+        model,
+        example_inputs,
+        optimize_ctx,
+        experiment,
+        diff=False,
+        branch=None,
+        explain=False,
+    ):
+        if diff:
+            self.compare_branches(
+                name, model, example_inputs, optimize_ctx, experiment, diff, branch
+            )
+        elif branch:
+            print("RUNNING ON BRANCH:", branch)
+        mode = "train" if self.args.training else "eval"
+        print(f"{current_device:4} {mode:5} {current_name:34} ", end="", flush=True)
+        start_calls_captured = torch._dynamo.utils.counters["stats"]["calls_captured"]
+        start_unique_graphs = torch._dynamo.utils.counters["stats"]["unique_graphs"]
+        if self.args.accuracy:
+            status = self.check_accuracy(
+                name, model, example_inputs, optimize_ctx, experiment
+            )
+            print(status)
+        elif self.args.performance:
+            status = self.run_performance_test(
+                name, model, example_inputs, optimize_ctx, experiment
+            )
+            print(status)
+        end_calls_captured = torch._dynamo.utils.counters["stats"]["calls_captured"]
+        end_unique_graphs = torch._dynamo.utils.counters["stats"]["unique_graphs"]
+        if explain:
+            print(
+                f"Dynamo produced {end_unique_graphs-start_unique_graphs} graph(s) "
+                f"covering {end_calls_captured-start_calls_captured} ops"
+            )
+
+
+def help(fn):
+    return fn.__doc__
+
+
+def parse_args(args=None):
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--filter", "-k", action="append", help="filter benchmarks with regexp"
+    )
+    parser.add_argument(
+        "--exclude", "-x", action="append", help="filter benchmarks with regexp"
+    )
+    parser.add_argument(
+        "--total-partitions",
+        type=int,
+        default=1,
+        choices=range(1, 10),
+        help="Total number of partitions we want to divide the benchmark suite into",
+    )
+    parser.add_argument(
+        "--partition-id",
+        type=int,
+        default=0,
+        help="ID of the benchmark suite partition to be run. Used to divide CI tasks",
+    )
+    parser.add_argument(
+        "--devices", "--device", "-d", action="append", help="cpu or cuda"
+    )
+    parser.add_argument("--device-index", help="CUDA device index")
+    parser.add_argument(
+        "--repeat", "-n", type=int, default=30, help="number of timing runs"
+    )
+    parser.add_argument(
+        "--randomize-input",
+        action="store_true",
+        help="Whether to randomize the input values. Dimensions will be kept the same.",
+    )
+    parser.add_argument(
+        "--threads", "-t", type=int, help="number of threads to use for eager"
+    )
+    parser.add_argument(
+        "--nopython", action="store_true", help="Turn graph breaks into errors"
+    )
+    parser.add_argument(
+        "--no-skip",
+        action="store_true",
+        help="run models that are in the global SKIP list",
+    )
+    parser.add_argument(
+        "--prims-nvfuser", action="store_true", help="user prims + nvfuser backend"
+    )
+    parser.add_argument(
+        "--dump-raw-metrics",
+        action="store_true",
+        help="dump raw timing metrics from speedup experiment",
+    )
+    parser.add_argument(
+        "--log-operator-inputs",
+        action="store_true",
+        default=False,
+    )
+    parser.add_argument(
+        "--channels-last",
+        action="store_true",
+        default=False,
+        help="use channels last format",
+    )
+    parser.add_argument("--batch_size", type=int, help="batch size for benchmarking")
+    parser.add_argument(
+        "--batch-size-file", type=str, help="String to load batch size from"
+    )
+    parser.add_argument("--cosine", action="store_true", help="use cosine similarity")
+    parser.add_argument(
+        "--ci", action="store_true", help="Flag to tell that its a CI run"
+    )
+    parser.add_argument(
+        "--dashboard", action="store_true", help="Flag to tell that its a Dashboard run"
+    )
+    parser.add_argument(
+        "--skip-fp64-check", action="store_true", help="skip accuracy check using fp64"
+    )
+    parser.add_argument(
+        "--fast", "-f", action="store_true", help="skip slow benchmarks"
+    )
+    parser.add_argument(
+        "--only",
+        help="""Run just one model from torchbench. Or
+        specify the path and class name of the model in format like:
+        --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME>
+
+        Due to the fact that dynamo changes current working directory,
+        the path should be an absolute path.
+
+        The class should have a method get_example_inputs to return the inputs
+        for the model. An example looks like
+        ```
+        class LinearModel(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = nn.Linear(10, 10)
+
+            def forward(self, x):
+                return self.linear(x)
+
+            def get_example_inputs(self):
+                return (torch.randn(2, 10),)
+        ```
+    """,
+    )
+    parser.add_argument(
+        "--training",
+        action="store_true",
+        help="Performs training",
+    )
+    parser.add_argument(
+        "--ddp",
+        action="store_true",
+        help="Wraps model in DDP before running it, and uses dynamo DDPOptmizer (graph breaks) by default.",
+    )
+    parser.add_argument(
+        "--fsdp",
+        action="store_true",
+        help="""Wraps model in FSDP before running it. Disables cudagraphs by default.
+        Doesn't recursively wrap, mainly useful for checking dynamo UnspecNNModule compatibility
+    """,
+    )
+    parser.add_argument(
+        "--no-optimize-ddp",
+        action="store_true",
+        help="Disables dynamo DDPOptimizer (graph breaks). (Applies only when using --ddp benchmark mode).",
+    )
+    parser.add_argument(
+        "--distributed-master-port",
+        default="6789",
+        help="Port to bind for for torch.distributed.  Use the default unless it's conflicting with another user",
+    )
+    parser.add_argument(
+        "--dynamic-shapes",
+        action="store_true",
+        help="Runs a dynamic shapes version of the benchmark, if available.",
+    )
+    parser.add_argument(
+        "--use-eval-mode",
+        action="store_true",
+        help="sets model.eval() to reduce randomness",
+    )
+    parser.add_argument(
+        "--skip-accuracy-check",
+        action="store_true",
+        help="keeps running even when accuracy fails",
+    )
+    parser.add_argument(
+        "--generate-aot-autograd-stats",
+        action="store_true",
+        help="Generates AOT Autograd stats like how mnay graphs are sent to AOT",
+    )
+    parser.add_argument(
+        "--inductor-settings",
+        action="store_true",
+        help="Use same settings as --inductor for baseline comparisons",
+    )
+    parser.add_argument(
+        "--suppress-errors",
+        action="store_true",
+        help="Suppress errors instead of raising them",
+    )
+    parser.add_argument(
+        "--output",
+        help="Overrides the output filename",
+    )
+    parser.add_argument(
+        "--output-directory",
+        help="Overrides the directory to place output files.",
+    )
+    parser.add_argument(
+        "--part",
+        default=None,
+        help="Specify the part of the model to run.",
+    )
+    parser.add_argument(
+        "--export-profiler-trace",
+        action="store_true",
+        help="exports trace of kineto profiler",
+    )
+    parser.add_argument("--profiler_trace_name", help="Overwrites exported trace name")
+
+    parser.add_argument(
+        "--diff_main",
+        action="store_true",
+        help="Delta this branch against main. In the future, we may add support for picking the branch.",
+    )
+
+    parser.add_argument(
+        "--explain",
+        action="store_true",
+        help="print some graph/op statistics during the run, similar to .explain()",
+    )
+
+    parser.add_argument(
+        "--cold_start_latency",
+        action="store_true",
+        help="Use a fresh triton cachedir when running each model, to force cold-start compile.",
+    )
+    parser.add_argument(
+        "--disable-cudagraphs",
+        action="store_true",
+        help="Disables cudagraphs for Inductor",
+    )
+    parser.add_argument(
+        "--trace-on-xla",
+        action="store_true",
+        help="Whether to trace the model on XLA or on eager device",
+    )
+
+    group_fuser = parser.add_mutually_exclusive_group()
+    # --nvfuser is now the default, keep the option to not break scripts
+    group_fuser.add_argument("--nvfuser", action="store_true", help=argparse.SUPPRESS)
+    group_fuser.add_argument("--nnc", action="store_true", help="enable NNC for GPUs")
+
+    group_prec = parser.add_mutually_exclusive_group()
+    group_prec.add_argument("--float16", action="store_true", help="cast model to fp16")
+    group_prec.add_argument("--float32", action="store_true", help="cast model to fp32")
+    group_prec.add_argument(
+        "--amp", action="store_true", help="use automatic mixed precision"
+    )
+
+    group_printout = parser.add_mutually_exclusive_group()
+    group_printout.add_argument(
+        "--verbose", "-v", action="store_true", help="enable verbose debug printouts"
+    )
+    group_printout.add_argument(
+        "--quiet", "-q", action="store_true", help="suppress debug printouts"
+    )
+
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument(
+        "--coverage", action="store_true", help="(default) " + help(coverage_experiment)
+    )
+    group.add_argument(
+        "--overhead", action="store_true", help=help(overhead_experiment)
+    )
+    group.add_argument(
+        "--speedup-onnx", action="store_true", help=help(speedup_experiment_onnx)
+    )
+    group.add_argument(
+        "--speedup-trt", action="store_true", help=help(speedup_experiment_trt)
+    )
+    group.add_argument(
+        "--speedup-dynamo-ts",
+        action="store_true",
+        help="TorchDynamo frontend with torchscript backend",
+    )
+    group.add_argument(
+        "--speedup-fx2trt", action="store_true", help=help(speedup_experiment_fx2trt)
+    )
+    group.add_argument(
+        "--speedup-fx2trt-fp16",
+        action="store_true",
+        help=help(speedup_experiment_fx2trt),
+    )
+    group.add_argument(
+        "--print-fx",
+        action="store_true",
+        help="Print fx traces captured from model",
+    )
+    group.add_argument(
+        "--print-aten-ops",
+        action="store_true",
+        help="Print traces of aten ops captured by AOT autograd",
+    )
+    group.add_argument(
+        "--inductor",
+        action="store_true",
+        help="Measure speedup with TorchInductor",
+    )
+    group.add_argument(
+        "--inductor-dynamic",
+        action="store_true",
+        help="Measure speedup with TorchInductor",
+    )
+    group.add_argument(
+        "--backend",
+        choices=torch._dynamo.list_backends(),
+        help="measure speedup with a given backend",
+    )
+    group.add_argument("--nothing", action="store_true", help=help(null_experiment))
+    group.add_argument(
+        "--log-conv-args",
+        action="store_true",
+        help="Dump convolution input/weight/bias's shape/stride/dtype and other options to json",
+    )
+    group.add_argument(
+        "--recompile_profiler",
+        action="store_true",
+        help="Run the dynamo recompilation profiler on each model.",
+    )
+    group.add_argument(
+        "--find-batch-sizes",
+        action="store_true",
+        help="finds the largest batch size that could fit on GPUs",
+    )
+
+    mode_group = parser.add_mutually_exclusive_group(required=True)
+    mode_group.add_argument(
+        "--accuracy",
+        action="store_true",
+        help="Checks accuracy with small batch size and eval mode",
+    )
+    mode_group.add_argument(
+        "--performance", action="store_true", help="Measures performance speedup"
+    )
+    return parser.parse_args(args)
+
+
+def main(runner, original_dir=None):
+    args = parse_args()
+    with maybe_init_distributed(
+        (args.ddp or args.fsdp) and args.only, port=args.distributed_master_port
+    ):
+        return maybe_fresh_cache(run, args.cold_start_latency and args.only)(
+            runner, args, original_dir
+        )
+
+
+def run(runner, args, original_dir=None):
+    # Pass the parsed args object to benchmark runner object
+    runner.args = args
+
+    args.filter = args.filter or [r"."]
+    args.exclude = args.exclude or [r"^$"]
+
+    if args.ci:
+        # Only dump error on CI
+        args.quiet = True
+        args.repeat = 2
+        if args.backend == "aot_eager":
+            args.exclude = (
+                CI_SKIP_AOT_EAGER_TRAINING
+                if args.training
+                else CI_SKIP_AOT_EAGER_INFERENCE
+            )
+        elif args.inductor:
+            args.exclude = (
+                CI_SKIP_INDUCTOR_TRAINING
+                if args.training
+                else CI_SKIP_INDCUTOR_INFERENCE
+            )
+    if args.ddp:
+        # TODO: we could also hook DDP bench up to --speedup bench, _not_ for mgpu e2e perf,
+        # but just to measure impact on singlenode of performing graph-breaks.
+        # Left it as a follow up to keep this PR isolated.
+        assert (
+            args.accuracy
+        ), "DDP benchmark is currently only hooked up to --accuracy bench"
+        assert args.training, "DDP benchmark requires --training mode"
+        if args.no_optimize_ddp:
+            torch._dynamo.config.optimize_ddp = False
+        else:
+            # TODO(whc) after enabling DDPOptimizer by default this could be removed or assert
+            torch._dynamo.config.optimize_ddp = True
+        if args.only == "dlrm":
+            log.error(
+                "DLRM+DDP is unsupported as it requires sharding the embedding layer separately from DDP"
+            )
+            return sys.exit(-1)
+    if args.accuracy:
+        # Use small batch size. We use >1 batch size to ensure we test
+        # batch_norm type of operators that work on batch dims.
+        # TODO - Go through the failures for batch size = 2
+        if args.batch_size is None:
+            if runner.suite_name == "huggingface":
+                args.batch_size = 1
+            else:
+                args.batch_size = 2
+
+        # Remove sources of randomness
+        args.use_eval_mode = True
+
+        # Remove randomeness when torch manual seed is called
+        patch_torch_manual_seed()
+
+        # Some models e.g. yolov3 assert batch size on n_gpus
+        if "CUDA_VISIBLE_DEVICES" not in os.environ:
+            args.device_index = "0"
+
+        # Stricter check to disable fallbacks
+        args.suppress_errors = False
+
+    if args.device_index is not None:
+        os.environ["CUDA_VISIBLE_DEVICES"] = args.device_index
+
+    elif args.performance:
+        # Ensure that we test on real scenarios
+        args.use_eval_mode = False
+
+    if args.partition_id > args.total_partitions or args.partition_id < 0:
+        print("Invalid partition id")
+        return sys.exit(-1)
+
+    if not args.devices:
+        if torch.cuda.is_available():
+            args.devices = ["cuda"]
+        else:
+            log.warning("torch.cuda.is_available() == False, using CPU")
+            args.devices = ["cpu"]
+
+    if args.devices != ["cpu"] and torch.cuda.is_available():
+        global synchronize
+        synchronize = torch.cuda.synchronize
+
+    if (
+        args.devices == ["cuda"]
+        and torch.cuda.get_device_properties(0).total_memory < 25 * 2**30
+    ):
+        # OOM errors on an RTX 3090 with 24gb RAM
+        runner.skip_models.update(
+            {
+                # torchbench
+                "hf_Longformer",
+                "timm_nfnet",
+                "timm_efficientdet",
+                # timm
+                "beit_base_patch16_224",
+                "cait_m36_384",
+                "convmixer_768_32",
+                "deit_base_distilled_patch16_224",
+                "dm_nfnet_f0",
+                "dpn107",
+                "dm_nfnet_f0",
+            }
+        )
+        if args.training:
+            runner.skip_models.add("hf_T5")
+
+    if torch._dynamo.config.dynamic_shapes:
+        # TODO(jansel): fix bugs in these
+        runner.skip_models.update(runner.failing_dynamic_shape_models)
+
+    if args.nnc:
+        torch._C._jit_override_can_fuse_on_cpu(True)
+        torch._C._jit_override_can_fuse_on_gpu(True)
+        torch._C._jit_set_texpr_fuser_enabled(True)
+        torch._C._jit_set_nvfuser_enabled(False)
+
+    if args.threads:
+        torch.set_num_threads(args.threads)
+
+    if args.verbose:
+        torch._dynamo.config.log_level = logging.DEBUG
+
+    if args.quiet:
+        torch._dynamo.config.log_level = logging.ERROR
+
+    torch._dynamo.config.suppress_errors = args.suppress_errors
+
+    if args.training:
+        runner.model_iter_fn = runner.forward_and_backward_pass
+        runner.skip_models.update(runner.skip_not_suitable_for_training_models)
+    else:
+        runner.model_iter_fn = runner.forward_pass
+
+    if args.fast:
+        runner.skip_models.update(runner.slow_models)
+
+    if args.devices == ["cpu"]:
+        runner.skip_models.update(runner.very_slow_models)
+
+    if args.inductor or args.inductor_dynamic or args.inductor_settings:
+        runner.skip_models.update(runner.failing_torchinductor_models)
+        if args.float16:
+            # TODO(jansel): check if correctness issue is real
+            runner.skip_models.add("yolov3")
+
+    if args.float16:
+        # these give `INCORRECT - Variation in Eager runs itself` sometimes
+        runner.non_deterministic_models.update(
+            {
+                "demucs",
+                "pyhpc_equation_of_state",
+                "timm_efficientdet",
+                "pyhpc_isoneutral_mixing",
+                "pyhpc_turbulent_kinetic_energy",
+                "shufflenet_v2_x1_0",
+            }
+        )
+
+    if args.no_skip:
+        runner.skip_models.clear()
+
+    experiment = null_experiment
+    global current_name, current_device, current_batch_size, output_filename, optimize_ctx
+    optimize_ctx = NullContext()
+
+    if args.overhead:
+        optimize_ctx = torch._dynamo.optimize(dummy_fx_compile, nopython=args.nopython)
+        experiment = speedup_experiment
+        output_filename = "overheads.csv"
+    elif args.inductor or args.inductor_dynamic:
+        inductor_config.debug = args.verbose
+        if args.threads:
+            inductor_config.cpp.threads = args.threads
+
+        if args.inductor_dynamic:
+            inductor_config.triton.cudagraphs = False
+            inductor_config.dynamic_shapes = True
+        else:
+            inductor_config.dynamic_shapes = False
+            if args.export_profiler_trace:
+                print("Profiling requested, setting cudagraphs to False")
+                inductor_config.triton.cudagraphs = False
+
+        optimize_ctx = torch._dynamo.optimize("inductor", nopython=args.nopython)
+        experiment = speedup_experiment
+        output_filename = "inductor.csv"
+    elif args.speedup_onnx:
+        experiment = speedup_experiment_onnx
+        output_filename = "baseline_onnx.csv"
+    elif args.speedup_trt:
+        experiment = speedup_experiment_trt
+        output_filename = "baseline_trt.csv"
+    elif args.speedup_dynamo_ts:
+        optimize_ctx = torch._dynamo.optimize(backends.ts, nopython=args.nopython)
+        experiment = speedup_experiment
+        output_filename = "speedup_dynamo_ts.csv"
+    elif args.speedup_fx2trt:
+        optimize_ctx = torch._dynamo.optimize(
+            backends.fx2trt_compiler, nopython=args.nopython
+        )
+        experiment = speedup_experiment_fx2trt
+        output_filename = "speedups_fx2trt.csv"
+        runner.skip_models.update(runner.failing_fx2trt_models)
+        args.float32 = True
+        args.float16 = False
+        args.cosine = True
+    elif args.speedup_fx2trt_fp16:
+        optimize_ctx = torch._dynamo.optimize(
+            backends.fx2trt_compiler_fp16, nopython=args.nopython
+        )
+        experiment = speedup_experiment_fx2trt
+        output_filename = "speedups_fx2trt_fp16.csv"
+        args.float32 = False
+        args.float16 = True
+        args.cosine = True
+    elif args.prims_nvfuser:
+        optimize_ctx = torch._dynamo.optimize("prims_nvfuser", nopython=args.nopython)
+        experiment = speedup_experiment
+        backend_str = "prims_nvfuser"
+        output_filename = f"accuracy_aot_{backend_str}.csv"
+    elif args.print_fx:
+        optimize_ctx = torch._dynamo.optimize(
+            print_fx,
+            nopython=args.nopython,
+        )
+    elif args.print_aten_ops:
+        optimize_ctx = torch._dynamo.optimize(
+            print_aten_ops,
+            nopython=args.nopython,
+        )
+    elif args.nothing:
+        pass
+    elif args.backend:
+        optimize_ctx = torch._dynamo.optimize(args.backend, nopython=args.nopython)
+        experiment = speedup_experiment
+        if args.accuracy:
+            output_filename = f"accuracy_{args.backend}.csv"
+        else:
+            output_filename = f"speedup_{args.backend}.csv"
+    elif args.log_conv_args:
+        optimize_ctx = torch._dynamo.optimize(
+            conv_args_analysis, nopython=args.nopython
+        )
+        output_filename = "log_conv_args.csv"
+    elif args.recompile_profiler:
+        output_filename = "recompile_profiler_log.csv"
+        experiment = recompile_profiler_experiment
+    else:
+        optimize_ctx = torch._dynamo.optimize(
+            fx_insert_profiling, nopython=args.nopython
+        )
+        experiment = coverage_experiment
+        output_filename = "coverage.csv"
+
+    if args.inductor or args.backend == "inductor":
+        if args.disable_cudagraphs:
+            inductor_config.triton.cudagraphs = False
+
+    runner.setup_amp()
+
+    if args.output:
+        output_filename = args.output
+
+    if output_filename:
+        if args.output_directory:
+            output_filename = os.path.join(args.output_directory, output_filename)
+        else:
+            output_filename = os.path.join(
+                torch._dynamo.config.base_dir, output_filename
+            )
+
+    if args.find_batch_sizes and args.only:
+        for device in args.devices:
+            batch_size = runner.batch_size_finder(device, args.only)
+            print(args.only, batch_size)
+            output_csv(output_filename, [], [args.only, batch_size])
+        return
+
+    if args.export_profiler_trace:
+        if args.profiler_trace_name is None:
+            if args.backend:
+                args.profiler_trace_name = args.backend
+            elif args.inductor or args.inductor_dynamic:
+                args.profiler_trace_name = "inductor"
+            else:
+                args.profiler_trace_name = "profile"
+        else:
+            args.profiler_trace_name = args.profiler_trace_name
+
+    experiment = functools.partial(experiment, args, runner.model_iter_fn)
+
+    if args.only:
+        model_name = args.only
+        for device in args.devices:
+            batch_size = args.batch_size
+            if args.batch_size_file:
+                batch_size = read_batch_size_from_file(
+                    args, args.batch_size_file, model_name
+                )
+            if model_specified_by_path(args.only):
+                model, example_inputs = load_model_from_path(args.only)
+                name = model.__class__.__name__
+                model = model.to(device=device)
+                example_inputs = tree_map(lambda x: x.to(device=device), example_inputs)
+            else:
+                try:
+                    if args.part:
+                        (
+                            device,
+                            name,
+                            model,
+                            example_inputs,
+                            batch_size,
+                        ) = runner.load_model(
+                            device, model_name, batch_size=batch_size, part=args.part
+                        )
+                    else:
+                        (
+                            device,
+                            name,
+                            model,
+                            example_inputs,
+                            batch_size,
+                        ) = runner.load_model(device, model_name, batch_size=batch_size)
+                except NotImplementedError as e:
+                    print(e)
+                    import traceback
+
+                    print(traceback.format_exc())
+                    logging.warn(f"{args.only} failed to load")
+                    continue  # bad benchmark implementation
+
+            if args.trace_on_xla:
+                import torch_xla.core.xla_model as xm
+
+                xla_dev = xm.xla_device()
+                model = model.to(device=xla_dev)
+                example_inputs = tree_map(
+                    lambda x: x.to(device=xla_dev), example_inputs
+                )
+
+            current_name = name
+            current_device = device
+            current_batch_size = batch_size
+            set_model_name(name)
+
+            if args.float32:
+                model, example_inputs = cast_to_fp32(model, example_inputs)
+            elif args.float16:
+                model, example_inputs = cast_to_fp16(model, example_inputs)
+
+            if args.log_operator_inputs:
+                log_operator_inputs(
+                    model, example_inputs, runner.model_iter_fn, name, args
+                )
+                continue
+
+            runner.run_one_model(
+                name,
+                model,
+                example_inputs,
+                optimize_ctx,
+                experiment,
+                diff=args.diff_main,
+                explain=args.explain,
+            )
+        if args.generate_aot_autograd_stats:
+            stats_file = output_filename.split(".csv")[0] + "_stats.csv"
+            output_csv(
+                stats_file,
+                ("dev", "name", "batch_size", "total_aot_graphs", "ok_aot_graphs"),
+                [
+                    current_device,
+                    current_name,
+                    current_batch_size,
+                    *Stats.aot_summary(),
+                ],
+            )
+    else:
+        if output_filename and os.path.exists(output_filename):
+            os.unlink(output_filename)
+        if original_dir:
+            os.chdir(original_dir)
+        for name in runner.iter_model_names(args):
+            current_name = name
+            placeholder_batch_size = 0
+            try:
+                subprocess.check_call([sys.executable] + sys.argv + [f"--only={name}"])
+            except subprocess.SubprocessError:
+                print("ERROR")
+                for device in args.devices:
+                    output_csv(
+                        output_filename, [], [device, name, placeholder_batch_size, 0.0]
+                    )
+        print_summary(output_filename)
+
+
+def log_operator_inputs(model, example_inputs, model_iter_fn, name, args):
+    mode = "training" if args.training else "eval"
+    output = os.path.join(os.path.dirname(args.output), f"{name}_{mode}.txt")
+
+    # TODO - add option for coalescing inputs over multiple runs
+    if os.path.exists(output):
+        print(f"Skipping {name}, {output} already exists")
+        return
+
+    print(f"Running {name}")
+
+    operator_mode = OperatorInputsMode()
+    fake_tensor_mode = FakeTensorMode()
+
+    with torch._subclasses.fake_tensor.FakeCopyMode(fake_tensor_mode):
+        model_fake = copy.deepcopy(model)
+        example_inputs_fake = copy.deepcopy(example_inputs)
+    try:
+        with fake_tensor_mode, operator_mode:
+            model_iter_fn(model_fake, example_inputs_fake, collect_outputs=False)
+    except Exception as e:
+        print(f"{name} failed to run with fake tensors, trying real. Exception: {e}")
+        operator_mode = OperatorInputsMode()
+        try:
+            with operator_mode:
+                model_iter_fn(model, example_inputs, collect_outputs=False)
+        except Exception as e2:
+            print(f"{name} failed to run with real. Exception: {e2}")
+            raise
+
+    print(f"Writing output to {output}")
+    operator_mode.log_to_file(output)
+
+
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.WARNING)
+    warnings.filterwarnings("ignore")
+    main()
diff --git a/benchmarks/dynamo/dist_util.py b/benchmarks/dynamo/dist_util.py
new file mode 100644
index 000000000000..24625c84e1a1
--- /dev/null
+++ b/benchmarks/dynamo/dist_util.py
@@ -0,0 +1,148 @@
+import argparse
+import functools
+import importlib
+import os
+
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+from torch._dynamo.testing import reduce_to_scalar_loss
+from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+    apply_activation_checkpointing,
+    checkpoint_wrapper,
+    CheckpointImpl,
+)
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.wrap import ModuleWrapPolicy
+
+try:
+    from .torchbench import setup_torchbench_cwd
+except ImportError:
+    from torchbench import setup_torchbench_cwd
+
+from transformers.models.bert.modeling_bert import BertLayer, BertLMPredictionHead
+from transformers.models.t5.modeling_t5 import T5Block
+
+
+def setup(rank, world_size):
+    os.environ["MASTER_ADDR"] = os.getenv("MASTER_ADDR", "localhost")
+    os.environ["MASTER_PORT"] = os.getenv("MASTER_PORT", "12355")
+    os.environ["RANK"] = os.getenv("RANK", "0")
+    os.environ["WORLD_SIZE"] = os.getenv("WORLD_SIZE", "1")
+    dist.init_process_group("nccl")
+
+
+def cleanup():
+    dist.destroy_process_group()
+
+
+class CustomLinear(torch.nn.Module):
+    def __init__(self, a, b):
+        super(CustomLinear, self).__init__()
+        self.weight = nn.Parameter(torch.randn(a, b))
+
+    def forward(self, x):
+        return torch.mm(x, self.weight)
+
+
+class MyModule(torch.nn.Module):
+    def __init__(self, a, b):
+        super(MyModule, self).__init__()
+        self.net = nn.Sequential(
+            nn.Linear(a, b),
+            nn.ReLU(),
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+
+class ToyModel(nn.Module):
+    def __init__(self):
+        super(ToyModel, self).__init__()
+        self.net = nn.Sequential(
+            *[nn.Linear(10, 10000), nn.ReLU()]
+            + [nn.Linear(10000, 10000), nn.ReLU()]
+            + [MyModule(10000, 10000)]
+            + [MyModule(10000, 1000)]
+            + [MyModule(1000, 1000)]
+            + [MyModule(1000, 1000)]
+            + [MyModule(1000, 1000)]
+            + [MyModule(1000, 1000)]
+            + [MyModule(1000, 1000)]
+            + [MyModule(1000, 1000)]
+            + [MyModule(1000, 1000)]
+            + [nn.Linear(1000, 5)]
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+
+def model_iter_fn(model, example_inputs, collect_outputs=False):
+    outputs = model(*example_inputs)
+    loss = reduce_to_scalar_loss(outputs)
+    loss.backward()
+    if collect_outputs:
+        return outputs
+
+
+def get_model(args):
+    if args.torchbench_model:
+        old_cwd = setup_torchbench_cwd()
+        module = importlib.import_module(
+            f"torchbenchmark.models.{args.torchbench_model}"
+        )
+        benchmark_cls = getattr(module, "Model", None)
+        bm = benchmark_cls(
+            test="train", device=args.device, jit=False, batch_size=args.batch_size
+        )
+        model, inputs = bm.get_module()
+    elif args.toy_model:
+        model = ToyModel()
+        inputs = (torch.randn(20, 10),)
+    else:
+        raise argparse.ArgumentError(
+            args.torchbench_model, message="Must specify a model"
+        )
+
+    return model, inputs
+
+
+def fsdp_checkpointing_base(model, blocks):
+    """apply activation checkpointing to model
+    returns None as model is updated directly
+    """
+    non_reentrant_wrapper = functools.partial(
+        checkpoint_wrapper,
+        offload_to_cpu=False,
+        checkpoint_impl=CheckpointImpl.NO_REENTRANT,
+    )
+
+    def check_fn(submodule):
+        return isinstance(submodule, blocks)
+
+    apply_activation_checkpointing(
+        model, checkpoint_wrapper_fn=non_reentrant_wrapper, check_fn=check_fn
+    )
+
+
+MODEL_FSDP_WRAP = {
+    "toy_model": (MyModule,),
+    "hf_Bert": (BertLayer, BertLMPredictionHead),
+    "hf_T5": (T5Block,),
+}
+
+
+def apply_fsdp(args, model, use_checkpointing=False, use_wrap_policy=True):
+    wrap_policy = None
+    blocks = MODEL_FSDP_WRAP[
+        "toy_model" if model.__class__ is ToyModel else args.torchbench_model
+    ]
+    if use_wrap_policy:
+        wrap_policy = ModuleWrapPolicy(blocks)
+
+    model = FSDP(model, auto_wrap_policy=wrap_policy, use_orig_params=True)
+    if use_checkpointing:
+        fsdp_checkpointing_base(model, blocks)
+    return model
diff --git a/benchmarks/dynamo/distributed.py b/benchmarks/dynamo/distributed.py
new file mode 100644
index 000000000000..dee44210e93c
--- /dev/null
+++ b/benchmarks/dynamo/distributed.py
@@ -0,0 +1,164 @@
+import argparse
+import logging
+import os
+from functools import partial
+
+import torch
+import torch._dynamo as dynamo
+import torch.utils._pytree as pytree
+from torch._dynamo.testing import reduce_to_scalar_loss
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.profiler import profile, ProfilerActivity, record_function
+
+try:
+    from .common import timed
+    from .dist_util import apply_fsdp, cleanup, get_model, model_iter_fn, setup
+except ImportError:
+    from common import timed
+    from dist_util import apply_fsdp, cleanup, get_model, model_iter_fn, setup
+
+
+def torchviz_model(args, model, inputs, rank):
+    from torchviz import make_dot
+
+    outputs = model(*inputs)
+    loss = reduce_to_scalar_loss(outputs)
+    parameter_names = dict(model.named_parameters())
+    dot = make_dot(loss, params=parameter_names, show_attrs=True, show_saved=True)
+    if rank == 0:
+        dot.render("torchviz.dot")
+
+
+def profile_model(args, model, inputs, rank):
+    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
+        for i in range(args.repeat):
+            with record_function("Forward"):
+                outputs = model(*inputs)
+                loss = reduce_to_scalar_loss(outputs)
+            with record_function("Backward"):
+                loss.backward()
+    if rank == 0:
+        prof.export_chrome_trace(args.trace_file)
+
+
+def run_model(args, model, inputs, key):
+    rank = int(os.getenv("RANK", 0))
+    world_size = int(os.getenv("WORLD_SIZE", 1))
+    # result_q = []
+
+    setup(rank, world_size)
+    if args.device == "cuda":
+        # needed for FSDP
+        torch.cuda.set_device(rank)
+
+    dev_rank = f"{args.device}:{rank}"
+    model = model.to(dev_rank)
+
+    def move_tensor(maybe_tensor):
+        if torch.is_tensor(maybe_tensor):
+            return maybe_tensor.to(dev_rank)
+        return maybe_tensor
+
+    inputs = pytree.tree_map(move_tensor, inputs)
+
+    if args.fsdp:
+        model = apply_fsdp(
+            args,
+            model,
+            use_checkpointing=args.fsdp_checkpoint,
+            use_wrap_policy=args.fsdp_wrap,
+        )
+    elif args.ddp:
+        model = DDP(model)
+
+    if args.verbose:
+        print(model)
+
+    if args.dynamo:
+        dynamo.reset()
+        if args.verbose:
+            dynamo.config.verbose = True
+            dynamo.config.log_level = logging.DEBUG
+        if args.dynamo_optimize_ddp:
+            dynamo.config.optimize_ddp = True
+
+        def print_compile(gm, ex):
+            print(
+                f"print_compile:\n{str(gm.graph)}\n-----------------------------------------"
+            )
+            return gm
+
+        dynamo_ctx = dynamo.optimize(
+            print_compile if args.dynamo == "print" else args.dynamo
+        )
+        model = dynamo_ctx(model)
+
+    # warmup
+    _ = timed(model, model_iter_fn, inputs, times=3, return_result=False)
+    t_total = timed(
+        model, model_iter_fn, inputs, times=args.repeat, return_result=False
+    )
+    if args.torchviz:
+        torchviz_model(args, model, inputs, rank)
+    if args.profile:
+        profile_model(args, model, inputs, rank)
+
+    cleanup()
+    return t_total
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--device", default="cuda")
+    parser.add_argument(
+        "--dynamo",
+        default=None,
+        help="if set to a str, uses dynamo[str] backend. else, eager",
+    )
+    parser.add_argument("--verbose", action="store_true")
+    parser.add_argument("--batch_size", default=None)
+    parser.add_argument(
+        "--torchviz", action="store_true", help="Dump autograd graph with torchviz"
+    )
+    parser.add_argument("--profile", action="store_true", help="Run the profiler")
+    parser.add_argument("--trace_file", default="profile.json", help="Run the profiler")
+    parser.add_argument("--repeat", default=10, help="Repeats for timing run")
+    parser.add_argument(
+        "--dynamo_optimize_ddp",
+        action="store_true",
+        help="Enable dynamo's ddp optimizer",
+    )
+    parser.add_argument(
+        "--fsdp_checkpoint",
+        action="store_true",
+        help="whether to use gradient checkpointing via model-specific policy",
+    )
+    parser.add_argument(
+        "--fsdp_wrap",
+        action="store_true",
+        help="whether to apply fsdp to submodules via model-specific policy",
+    )
+
+    dist_arg = parser.add_mutually_exclusive_group()
+    dist_arg.add_argument("--ddp", action="store_true")
+    dist_arg.add_argument("--fsdp", action="store_true")
+
+    model_arg = parser.add_mutually_exclusive_group(required=True)
+    model_arg.add_argument(
+        "--torchbench_model", help="name of torchbench model, e.g. hf_Bert"
+    )
+    model_arg.add_argument(
+        "--toy_model", action="store_true", help="use toy model instead"
+    )
+    args = parser.parse_args()
+
+    model_name = args.torchbench_model
+    if args.toy_model:
+        model_name = "ToyModel"
+    model, inputs = get_model(args)
+
+    fn = partial(run_model, args, model, inputs)
+
+    world_size = os.getenv("WORLD_SIZE", 1)
+    t_total = fn(f"{model_name}_{world_size}")
+    print(f"mean latency {t_total / args.repeat} across {args.repeat} runs")
diff --git a/benchmarks/dynamo/huggingface.py b/benchmarks/dynamo/huggingface.py
new file mode 100755
index 000000000000..bf127deaa43a
--- /dev/null
+++ b/benchmarks/dynamo/huggingface.py
@@ -0,0 +1,585 @@
+#!/usr/bin/env python3
+import importlib
+import logging
+import os
+import re
+import subprocess
+import sys
+import warnings
+
+import torch
+from common import BenchmarkRunner, main
+
+from torch._dynamo.testing import collect_results
+from torch._dynamo.utils import clone_inputs
+
+log = logging.getLogger(__name__)
+
+
+def pip_install(package):
+    subprocess.check_call([sys.executable, "-m", "pip", "install", package])
+
+
+# Disable the flake warnings for the imports. Flake8 does not provide a way to
+# disable just warning for the entire file. Disabling flake8 entirely.
+# flake8: noqa
+imports = [
+    "AlbertForPreTraining",
+    "AutoConfig",
+    "AutoModelForCausalLM",
+    "AutoModelForMaskedLM",
+    "AutoModelForSeq2SeqLM",
+    "BigBirdConfig",
+    "BlenderbotForConditionalGeneration",
+    "BlenderbotModel",
+    "BlenderbotSmallForConditionalGeneration",
+    "BlenderbotSmallModel",
+    "CLIPModel",
+    "CLIPVisionModel",
+    "ElectraForPreTraining",
+    "GPT2ForSequenceClassification",
+    "GPTJForSequenceClassification",
+    "GPTNeoForSequenceClassification",
+    "HubertForSequenceClassification",
+    "LxmertForPreTraining",
+    "LxmertForQuestionAnswering",
+    "MarianForCausalLM",
+    "MarianModel",
+    "MarianMTModel",
+    "PegasusForConditionalGeneration",
+    "PegasusModel",
+    "ReformerConfig",
+    "ViTForImageClassification",
+    "ViTForMaskedImageModeling",
+    "ViTModel",
+]
+
+
+try:
+    mod = importlib.import_module("transformers")
+    for cls in imports:
+        if not hasattr(mod, cls):
+            raise ModuleNotFoundError
+except ModuleNotFoundError:
+    print("Installing HuggingFace Transformers...")
+    pip_install("git+https://github.com/huggingface/transformers.git#egg=transformers")
+finally:
+    for cls in imports:
+        exec(f"from transformers import {cls}")
+
+
+# These models contain the models present in huggingface_models_list. It is a
+# combination of models supported by HF Fx parser and some manually supplied
+# models. For these models, we already know the largest batch size that can fit
+# on A100 GPUs - 40 GB.
+BATCH_SIZE_KNOWN_MODELS = dict()
+
+
+# Get the list of models and their batch sizes
+MODELS_FILENAME = os.path.join(os.path.dirname(__file__), "huggingface_models_list.txt")
+assert os.path.exists(MODELS_FILENAME)
+with open(MODELS_FILENAME, "r") as fh:
+    lines = fh.readlines()
+    lines = [line.rstrip() for line in lines]
+    for line in lines:
+        model_name, batch_size = line.split(",")
+        batch_size = int(batch_size)
+        BATCH_SIZE_KNOWN_MODELS[model_name] = batch_size
+assert len(BATCH_SIZE_KNOWN_MODELS)
+
+
+SKIP = {
+    # Difficult to setup accuracy test because .eval() not supported
+    "Reformer",
+    # Fails deepcopy
+    "BlenderbotForConditionalGeneration",
+    "GPTNeoForCausalLM",
+    "GPTNeoForSequenceClassification",
+    # Fails with even batch size = 1
+    "GPTJForCausalLM",
+    "GPTJForQuestionAnswering",
+}
+
+# TODO - Fails even after fake tensors
+BATCH_SIZE_DIVISORS = {
+    "AlbertForMaskedLM": 2,
+    "AlbertForQuestionAnswering": 2,
+    "AllenaiLongformerBase": 2,
+    "BartForCausalLM": 2,
+    "BartForConditionalGeneration": 2,
+    "BertForMaskedLM": 2,
+    "BertForQuestionAnswering": 2,
+    "BlenderbotForCausalLM": 8,
+    # "BlenderbotForConditionalGeneration" : 16,
+    "BlenderbotSmallForCausalLM": 4,
+    "BlenderbotSmallForConditionalGeneration": 2,
+    "CamemBert": 2,
+    "DebertaForMaskedLM": 8,
+    "DebertaForQuestionAnswering": 4,
+    "DebertaV2ForMaskedLM": 8,
+    "DebertaV2ForQuestionAnswering": 4,
+    "DistilBertForMaskedLM": 2,
+    "DistilBertForQuestionAnswering": 2,
+    "DistillGPT2": 2,
+    "ElectraForCausalLM": 2,
+    "ElectraForQuestionAnswering": 2,
+    "GPT2ForSequenceClassification": 2,
+    # "GPTJForCausalLM" : 2,
+    # "GPTJForQuestionAnswering" : 2,
+    # "GPTNeoForCausalLM" : 32,
+    # "GPTNeoForSequenceClassification" : 2,
+    "GoogleFnet": 2,
+    "LayoutLMForMaskedLM": 2,
+    "LayoutLMForSequenceClassification": 2,
+    "M2M100ForConditionalGeneration": 4,
+    "MBartForCausalLM": 2,
+    "MBartForConditionalGeneration": 2,
+    "MT5ForConditionalGeneration": 2,
+    "MegatronBertForCausalLM": 4,
+    "MegatronBertForQuestionAnswering": 2,
+    "MobileBertForMaskedLM": 4,
+    "MobileBertForQuestionAnswering": 2,
+    "OPTForCausalLM": 2,
+    "PLBartForCausalLM": 2,
+    "PLBartForConditionalGeneration": 2,
+    "PegasusForCausalLM": 4,
+    "PegasusForConditionalGeneration": 2,
+    "RobertaForCausalLM": 2,
+    "RobertaForQuestionAnswering": 2,
+    "Speech2Text2ForCausalLM": 4,
+    "T5ForConditionalGeneration": 2,
+    "T5Small": 2,
+    "TrOCRForCausalLM": 2,
+    "XGLMForCausalLM": 4,
+    "XLNetLMHeadModel": 2,
+    "YituTechConvBert": 2,
+}
+
+SKIP_ACCURACY_CHECK_MODELS = {
+    # Models too large to have eager, dynamo and fp64_numbers simultaneosuly
+    # even for 40 GB machine.
+    "DebertaV2ForMaskedLM",
+    "BlenderbotForCausalLM",
+}
+
+
+def get_module_cls_by_model_name(model_cls_name):
+    _module_by_model_name = {
+        "Speech2Text2Decoder": "transformers.models.speech_to_text_2.modeling_speech_to_text_2",
+        "TrOCRDecoder": "transformers.models.trocr.modeling_trocr",
+    }
+    module_name = _module_by_model_name.get(model_cls_name, "transformers")
+    module = importlib.import_module(module_name)
+    return getattr(module, model_cls_name)
+
+
+def get_sequence_length(model_cls, model_name):
+    if model_name.startswith(("Blenderbot",)):
+        seq_length = 128
+    elif model_name.startswith(("GPT2", "Bart", "T5", "PLBart", "MBart")):
+        seq_length = 1024
+    elif model_name in ("AllenaiLongformerBase", "BigBird"):
+        seq_length = 1024
+    elif model_name.startswith("OPT"):
+        seq_length = 2048
+    elif "Reformer" in model_name:
+        seq_length = 4096
+    elif model_name.startswith(
+        (
+            "Albert",
+            "Deberta",
+            "Layout",
+            "Electra",
+            "XLNet",
+            "MegatronBert",
+            "Bert",
+            "Roberta",
+        )
+    ) or model_name in ("DistillGPT2", "GoogleFnet", "YituTechConvBert", "CamemBert"):
+        seq_length = 512
+    elif model_name in ("TrOCRForCausalLM"):
+        seq_length = 256
+    elif model_name.startswith("MobileBert"):
+        seq_length = 128
+    else:
+        log.warning(
+            f"Sequence Length not defined for {model_name}. Choosing 128 arbitrarily"
+        )
+        seq_length = 128
+    return seq_length
+
+
+def generate_inputs_for_model(
+    model_cls, model, model_name, bs, device, include_loss_args=False
+):
+    # TODO - Check if following values are representative
+    num_choices = 3
+    num_visual_features = 42
+    seq_length = get_sequence_length(model_cls, model_name)
+    vocab_size = model.config.vocab_size
+    if model_name.endswith("MultipleChoice"):
+        input = rand_int_tensor(device, 0, vocab_size, (bs, num_choices, seq_length))
+    elif model_name.startswith("Roberta"):
+        input = rand_int_tensor(device, 0, 1, (bs, seq_length))
+    else:
+        input = rand_int_tensor(device, 0, vocab_size, (bs, seq_length))
+
+    if "Bart" in model_name:
+        input[:, -1] = model.config.eos_token_id
+
+    input_dict = {"input_ids": input}
+
+    if (
+        model_name.startswith("T5")
+        or model_name.startswith("M2M100")
+        or model_name.startswith("MT5")
+        or model_cls
+        in [
+            BlenderbotModel,
+            BlenderbotSmallModel,
+            BlenderbotForConditionalGeneration,
+            BlenderbotSmallForConditionalGeneration,
+            PegasusModel,
+            PegasusForConditionalGeneration,
+            MarianModel,
+            MarianMTModel,
+        ]
+    ):
+        input_dict["decoder_input_ids"] = input
+
+    if model_name.startswith("Lxmert"):
+        visual_feat_dim, visual_pos_dim = (
+            model.config.visual_feat_dim,
+            model.config.visual_pos_dim,
+        )
+        input_dict["visual_feats"] = torch.randn(
+            bs, num_visual_features, visual_feat_dim
+        )
+        input_dict["visual_pos"] = torch.randn(bs, num_visual_features, visual_pos_dim)
+
+    if include_loss_args:
+        if model_name.endswith("PreTraining"):
+            if model_cls in [ElectraForPreTraining, LxmertForPreTraining]:
+                input_dict["labels"] = rand_int_tensor(device, 0, 1, (bs, seq_length))
+            else:
+                label_name = (
+                    "sentence_order_label"
+                    if model_cls in [AlbertForPreTraining]
+                    else "next_sentence_label"
+                )
+                input_dict["labels"] = (
+                    rand_int_tensor(device, 0, vocab_size, (bs, seq_length)),
+                )
+                input_dict[label_name] = rand_int_tensor(device, 0, 1, (bs,))
+        elif model_name.endswith("QuestionAnswering"):
+            input_dict["start_positions"] = rand_int_tensor(
+                device, 0, seq_length, (bs,)
+            )
+            input_dict["end_positions"] = rand_int_tensor(device, 0, seq_length, (bs,))
+        elif (
+            model_name.endswith("MaskedLM")
+            or model_name.endswith("HeadModel")
+            or model_name.endswith("CausalLM")
+            or model_name.endswith("DoubleHeadsModel")
+        ):
+            input_dict["labels"] = rand_int_tensor(
+                device, 0, vocab_size, (bs, seq_length)
+            )
+        elif model_name.endswith("TokenClassification"):
+            input_dict["labels"] = rand_int_tensor(
+                device, 0, model.config.num_labels - 1, (bs, seq_length)
+            )
+        elif model_name.endswith("MultipleChoice"):
+            input_dict["labels"] = rand_int_tensor(device, 0, num_choices, (bs,))
+        elif model_name.endswith("SequenceClassification"):
+            input_dict["labels"] = rand_int_tensor(
+                device, 0, model.config.num_labels - 1, (bs,)
+            )
+        elif model_name.endswith("NextSentencePrediction"):
+            input_dict["labels"] = rand_int_tensor(device, 0, 1, (bs,))
+        elif model_name.endswith("ForConditionalGeneration"):
+            input_dict["labels"] = rand_int_tensor(
+                device, 0, vocab_size - 1, (bs, seq_length)
+            )
+        elif model_name in EXTRA_MODELS:
+            input_dict["labels"] = rand_int_tensor(
+                device, 0, vocab_size, (bs, seq_length)
+            )
+        else:
+            raise NotImplementedError(
+                f"Class {model_name} unsupported for training test "
+            )
+
+    return input_dict
+
+
+def rand_int_tensor(device, low, high, shape):
+    return torch.randint(
+        low,
+        high,
+        shape,
+        device=device,
+        dtype=torch.int64,
+        requires_grad=False,
+    )
+
+
+EXTRA_MODELS = {
+    "AllenaiLongformerBase": (
+        AutoConfig.from_pretrained("allenai/longformer-base-4096"),
+        AutoModelForMaskedLM,
+    ),
+    "Reformer": (
+        ReformerConfig(),
+        AutoModelForMaskedLM,
+    ),
+    "T5Small": (
+        AutoConfig.from_pretrained("t5-small"),
+        AutoModelForSeq2SeqLM,
+    ),
+    # "BigBird": (
+    #     BigBirdConfig(attention_type="block_sparse"),
+    #     AutoModelForMaskedLM,
+    # ),
+    "DistillGPT2": (
+        AutoConfig.from_pretrained("distilgpt2"),
+        AutoModelForCausalLM,
+    ),
+    "GoogleFnet": (
+        AutoConfig.from_pretrained("google/fnet-base"),
+        AutoModelForMaskedLM,
+    ),
+    "YituTechConvBert": (
+        AutoConfig.from_pretrained("YituTech/conv-bert-base"),
+        AutoModelForMaskedLM,
+    ),
+    "CamemBert": (
+        AutoConfig.from_pretrained("camembert-base"),
+        AutoModelForMaskedLM,
+    ),
+}
+
+
+class HuggingfaceRunner(BenchmarkRunner):
+    def __init__(self):
+        super(HuggingfaceRunner, self).__init__()
+        self.suite_name = "huggingface"
+
+    def load_model(
+        self,
+        device,
+        model_name,
+        batch_size=None,
+    ):
+
+        is_training = self.args.training
+        use_eval_mode = self.args.use_eval_mode
+        dtype = torch.float32
+        if model_name not in EXTRA_MODELS:
+            model_cls = get_module_cls_by_model_name(model_name)
+            config_cls = model_cls.config_class
+            config = config_cls()
+
+            # NB: some models need a pad token defined to handle BS > 1
+            if (
+                model_cls
+                in [
+                    GPT2ForSequenceClassification,
+                    GPTNeoForSequenceClassification,
+                    GPTJForSequenceClassification,
+                ]
+                or model_cls.__name__.startswith("Roberta")
+                or model_cls.__name__.startswith("Marian")
+            ):
+                config.pad_token_id = 0
+
+        else:
+            config, model_cls = EXTRA_MODELS[model_name]
+
+        if "auto" in model_cls.__module__:
+            # Handle auto classes
+            model = model_cls.from_config(config).to(device, dtype=dtype)
+        else:
+            model = model_cls(config).to(device, dtype=dtype)
+
+        if model_name in BATCH_SIZE_KNOWN_MODELS:
+            batch_size_default = BATCH_SIZE_KNOWN_MODELS[model_name]
+        elif batch_size is None:
+            batch_size_default = 16
+            log.warning(
+                "Batch size not specified for {model_name}. Setting batch_size=16"
+            )
+
+        if batch_size is None:
+            batch_size = batch_size_default
+            if model_name in BATCH_SIZE_DIVISORS:
+                batch_size = max(int(batch_size / BATCH_SIZE_DIVISORS[model_name]), 1)
+                log.warning(
+                    f"Running smaller batch size={batch_size} for {model_name}, orig batch_size={batch_size_default}"
+                )
+
+        example_inputs = generate_inputs_for_model(
+            model_cls, model, model_name, batch_size, device, include_loss_args=True
+        )
+
+        # So we can check for correct gradients without eliminating the dropout computation
+        for attr in dir(config):
+            if "drop" in attr and isinstance(getattr(config, attr), float):
+                setattr(config, attr, 1e-30)
+
+        if is_training and not use_eval_mode:
+            model.train()
+        else:
+            model.eval()
+
+        self.init_optimizer(device, model.parameters())
+
+        self.validate_model(model, example_inputs)
+        return device, model_name, model, example_inputs, batch_size
+
+    def iter_model_names(self, args):
+        model_names = list(BATCH_SIZE_KNOWN_MODELS.keys()) + list(EXTRA_MODELS.keys())
+        model_names = set(model_names)
+        model_names = sorted(model_names)
+
+        start, end = self.get_benchmark_indices(len(model_names))
+        for index, model_name in enumerate(model_names):
+            if index < start or index >= end:
+                continue
+            if (
+                not re.search("|".join(args.filter), model_name, re.I)
+                or re.search("|".join(args.exclude), model_name, re.I)
+                or model_name in SKIP
+            ):
+                continue
+            yield model_name
+
+    @property
+    def skip_accuracy_checks_large_models_dashboard(self):
+        if self.args.dashboard or self.args.accuracy:
+            return SKIP_ACCURACY_CHECK_MODELS
+        return set()
+
+    def pick_grad(self, name, is_training):
+        if is_training:
+            return torch.enable_grad()
+        else:
+            return torch.no_grad()
+
+    def get_tolerance_and_cosine_flag(self, is_training, current_device, name):
+        cosine = self.args.cosine
+        if is_training:
+            return 1e-2, cosine
+        return 1e-3, cosine
+
+    def compute_loss(self, pred):
+        return pred[0]
+
+    def forward_pass(self, mod, inputs, collect_outputs=True):
+        return mod(**inputs)
+
+    def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
+        cloned_inputs = clone_inputs(inputs)
+        self.optimizer_zero_grad()
+        with self.autocast():
+            pred = mod(**cloned_inputs)
+            loss = self.compute_loss(pred)
+        self.grad_scaler.scale(loss).backward()
+        self.optimizer_step()
+        if collect_outputs:
+            return collect_results(mod, pred, loss, cloned_inputs)
+        return None
+
+
+def refresh_model_names_and_batch_sizes():
+    """
+    This function reads the HF Fx tracer supported models and finds the largest
+    batch size that could fit on the GPU with PyTorch eager.
+
+    The resulting data is written in huggingface_models_list.txt.
+
+    Note - We only need to run this function if we believe that HF Fx tracer now
+    supports more models.
+    """
+    import transformers.utils.fx as hf_fx
+
+    family = dict()
+    lm_seen = set()
+    family_seen = set()
+    for cls_name in hf_fx._SUPPORTED_MODELS:
+
+        if "For" not in cls_name:
+            continue
+
+        model_cls = get_module_cls_by_model_name(cls_name)
+
+        # TODO: AttributeError: '*Config' object has no attribute 'vocab_size'
+        if model_cls in [
+            CLIPModel,
+            CLIPVisionModel,
+            # SwinForImageClassification,
+            # SwinForImageClassification,
+            # SwinForMaskedImageModeling,
+            # SwinModel,
+            ViTForImageClassification,
+            ViTForMaskedImageModeling,
+            ViTModel,
+        ]:
+            continue
+
+        # TODO: AssertionError: Padding_idx must be within num_embeddings
+        if model_cls in [MarianForCausalLM, MarianMTModel, MarianModel]:
+            continue
+
+        # TODO: "model is not supported yet" from HFTracer
+        if model_cls in [HubertForSequenceClassification]:
+            continue
+
+        # TODO: shape mismatch in loss calculation
+        if model_cls in [LxmertForQuestionAnswering]:
+            continue
+
+        family_name = cls_name.split("For")[0]
+        if family_name not in family:
+            family[family_name] = []
+        if cls_name.endswith(("MaskedLM", "CausalLM")) and family_name not in lm_seen:
+            family[family_name].append(cls_name)
+            lm_seen.add(family_name)
+        elif (
+            cls_name.endswith(
+                ("SequenceClassification", "ConditionalGeneration", "QuestionAnswering")
+            )
+            and family_name not in family_seen
+        ):
+            family[family_name].append(cls_name)
+            family_seen.add(family_name)
+        elif cls_name.endswith("ImageClassification"):
+            family[family_name].append(cls_name)
+
+    chosen_models = set()
+    for members in family.values():
+        chosen_models.update(set(members))
+
+    # Add the EXTRA_MODELS
+    chosen_models.update(set(EXTRA_MODELS.keys()))
+
+    for model_name in sorted(chosen_models):
+        try:
+            subprocess.check_call(
+                [sys.executable]
+                + sys.argv
+                + ["--find-batch-sizes"]
+                + [f"--only={model_name}"]
+                + [f"--output={MODELS_FILENAME}"]
+            )
+        except subprocess.SubprocessError:
+            log.warning(f"Failed to find suitable batch size for {model_name}")
+
+
+if __name__ == "__main__":
+    # Code to refresh model names and batch sizes
+    # if "--find-batch-sizes" not in sys.argv:
+    #     refresh_model_names_and_batch_sizes()
+    logging.basicConfig(level=logging.WARNING)
+    warnings.filterwarnings("ignore")
+    main(HuggingfaceRunner())
diff --git a/benchmarks/dynamo/huggingface_models_list.txt b/benchmarks/dynamo/huggingface_models_list.txt
new file mode 100644
index 000000000000..6e3cf19a783d
--- /dev/null
+++ b/benchmarks/dynamo/huggingface_models_list.txt
@@ -0,0 +1,51 @@
+AlbertForMaskedLM,8
+AlbertForQuestionAnswering,8
+AllenaiLongformerBase,8
+BartForCausalLM,8
+BartForConditionalGeneration,4
+BertForMaskedLM,32
+BertForQuestionAnswering,32
+BlenderbotForCausalLM,32
+BlenderbotForConditionalGeneration,16
+BlenderbotSmallForCausalLM,256
+BlenderbotSmallForConditionalGeneration,128
+CamemBert,32
+DebertaForMaskedLM,32
+DebertaForQuestionAnswering,32
+DebertaV2ForMaskedLM,8
+DebertaV2ForQuestionAnswering,8
+DistilBertForMaskedLM,256
+DistilBertForQuestionAnswering,512
+DistillGPT2,32
+ElectraForCausalLM,64
+ElectraForQuestionAnswering,128
+GPT2ForSequenceClassification,8
+GPTJForCausalLM,1
+GPTJForQuestionAnswering,1
+GPTNeoForCausalLM,32
+GPTNeoForSequenceClassification,32
+GoogleFnet,32
+LayoutLMForMaskedLM,32
+LayoutLMForSequenceClassification,32
+M2M100ForConditionalGeneration,64
+MBartForCausalLM,8
+MBartForConditionalGeneration,4
+MT5ForConditionalGeneration,32
+MegatronBertForCausalLM,16
+MegatronBertForQuestionAnswering,16
+MobileBertForMaskedLM,256
+MobileBertForQuestionAnswering,256
+OPTForCausalLM,4
+PLBartForCausalLM,16
+PLBartForConditionalGeneration,8
+PegasusForCausalLM,128
+PegasusForConditionalGeneration,64
+RobertaForCausalLM,32
+RobertaForQuestionAnswering,32
+Speech2Text2ForCausalLM,1024
+T5ForConditionalGeneration,8
+T5Small,8
+TrOCRForCausalLM,64
+XGLMForCausalLM,32
+XLNetLMHeadModel,16
+YituTechConvBert,32
diff --git a/torch/ao/quantization/_dbr/__init__.py b/benchmarks/dynamo/microbenchmarks/__init__.py
similarity index 100%
rename from torch/ao/quantization/_dbr/__init__.py
rename to benchmarks/dynamo/microbenchmarks/__init__.py
diff --git a/benchmarks/dynamo/microbenchmarks/bench_autotune_conv.py b/benchmarks/dynamo/microbenchmarks/bench_autotune_conv.py
new file mode 100644
index 000000000000..ca8aeca85a28
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/bench_autotune_conv.py
@@ -0,0 +1,170 @@
+import model
+import torch
+
+import torch._dynamo
+import torch._inductor
+import torch._inductor.config as config
+import torch._inductor.triton_ops
+import triton
+
+# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
+torch.backends.cuda.matmul.allow_tf32 = True
+# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
+torch.backends.cudnn.allow_tf32 = True
+# config.debug = True
+config.triton.convolution = "autotune"
+
+
+# conv benchmarks
+conv_confs = [
+    triton.testing.Benchmark(
+        x_names=["layout"],
+        x_vals=["nchw", "nhwc"],
+        line_arg="provider",
+        line_vals=["aten", "autotune", "triton_conv", "triton_conv1x1"],
+        line_names=["aten", "autotune", "triton_conv", "triton_conv1x1"],
+        ylabel="TFLOPS",
+        plot_name=f"resnet50-conv{i}-perf",
+        args={
+            "BATCH": BATCH,
+            "IN_H": IN_H,
+            "IN_W": IN_W,
+            "IN_C": IN_C,
+            "KERNEL_N": KERNEL_N,
+            "KERNEL_H": KERNEL_H,
+            "KERNEL_W": KERNEL_W,
+            "stride": stride,
+            "padding": padding,
+        },
+    )
+    for i, (
+        IN_H,
+        IN_W,
+        IN_C,
+        KERNEL_H,
+        KERNEL_W,
+        KERNEL_N,
+        stride,
+        padding,
+    ) in enumerate(model.resnet50_layers)
+    for BATCH in [32]
+]
+
+
+@triton.testing.perf_report(conv_confs)
+def bench_op(
+    # Tensor dimensions
+    BATCH,
+    IN_C,
+    IN_H,
+    IN_W,
+    KERNEL_N,
+    KERNEL_H,
+    KERNEL_W,
+    # provider
+    provider,
+    # parameters of conv
+    stride=(1, 1),
+    padding=(0, 0),
+    dilation=(1, 1),
+    groups=1,
+    dtype=torch.float32,
+    layout="nhwc",
+    warmup=25,
+    rep=75,
+):
+
+    skip = False
+    # allocate inputs, nchw
+    x = torch.randn((BATCH, IN_C, IN_H, IN_W), dtype=dtype, device="cuda")
+    w = torch.randn(
+        (KERNEL_N, IN_C // groups, KERNEL_H, KERNEL_W), dtype=dtype, device="cuda"
+    )
+    bias = torch.randn((KERNEL_N), dtype=dtype, device="cuda")
+    if layout == "nhwc":
+        x = x.to(memory_format=torch.channels_last)
+        w = w.to(memory_format=torch.channels_last)
+    OUT_H = (
+        IN_H + 2 * padding[0] - dilation[0] * (KERNEL_H - 1) - 1 + stride[0]
+    ) // stride[0]
+    OUT_W = (
+        IN_W + 2 * padding[1] - dilation[1] * (KERNEL_W - 1) - 1 + stride[1]
+    ) // stride[1]
+
+    tflops = (
+        lambda ms: 2.0
+        * BATCH
+        * OUT_H
+        * OUT_W
+        * IN_C
+        * KERNEL_H
+        * KERNEL_W
+        * KERNEL_N
+        / ms
+        * 1e-9
+    )
+    if provider == "aten":
+
+        def fn():
+            return torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+
+    elif provider == "triton_conv":
+
+        def fn():
+            return torch._inductor.triton_ops.conv(
+                x, w, bias, stride, padding, dilation, False, (0, 0), groups
+            )
+
+    elif provider == "triton_conv1x1":
+
+        def fn():
+            return torch._inductor.triton_ops.conv1x1(
+                x, w, bias, stride, padding, dilation, False, (0, 0), groups
+            )
+
+        if KERNEL_H != 1 or KERNEL_W != 1:
+            skip = True
+
+    elif provider == "autotune":
+
+        @torch._dynamo.optimize("inductor")
+        def wrap_conv(*args, **kwargs):
+            return torch.conv2d(*args, **kwargs)
+
+        def fn():
+            return wrap_conv(x, w, bias, stride, padding, dilation, groups)
+
+    # use cuda graph for fair comparison
+    elif provider != "autotune" and not skip:
+        # prepare new tensor
+        new_x = x.clone()
+        new_w = w.clone()
+        new_bias = bias.clone()
+
+        # warmp up for cudagraph
+        s = torch.cuda.Stream()
+        s.wait_stream(torch.cuda.current_stream())
+        with torch.cuda.stream(s):
+            for i in range(3):
+                fn()
+        torch.cuda.current_stream().wait_stream(s)
+
+        # capture
+        g = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(g):
+            fn()
+
+        def fn():
+            x.copy_(new_x)
+            w.copy_(new_w)
+            bias.copy_(new_bias)
+            return g.replay()
+
+    if not skip:
+        ms, min_ms, max_ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
+        return tflops(ms), tflops(max_ms), tflops(min_ms)
+    else:
+        return 0, 0, 0
+
+
+bench_op.run(print_data=True)
diff --git a/benchmarks/dynamo/microbenchmarks/bench_conv.py b/benchmarks/dynamo/microbenchmarks/bench_conv.py
new file mode 100644
index 000000000000..6279af6854a1
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/bench_conv.py
@@ -0,0 +1,144 @@
+import model
+import torch
+
+import torch._inductor.triton_ops
+import triton
+
+# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
+torch.backends.cuda.matmul.allow_tf32 = True
+# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
+torch.backends.cudnn.allow_tf32 = True
+
+# https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
+useCudaGraph = False
+
+# conv benchmarks
+conv_confs = [
+    triton.testing.Benchmark(
+        x_names=["layout"],
+        x_vals=["nchw", "nhwc"],
+        line_arg="provider",
+        line_vals=["cublas", "triton"],
+        line_names=["cuBLAS", "Triton"],
+        ylabel="TFLOPS",
+        plot_name=f"resnet50-conv{i}-perf",
+        args={
+            "BATCH": BATCH,
+            "IN_H": IN_H,
+            "IN_W": IN_W,
+            "IN_C": IN_C,
+            "KERNEL_N": KERNEL_N,
+            "KERNEL_H": KERNEL_H,
+            "KERNEL_W": KERNEL_W,
+            "stride": stride,
+            "padding": padding,
+        },
+    )
+    for i, (
+        IN_H,
+        IN_W,
+        IN_C,
+        KERNEL_H,
+        KERNEL_W,
+        KERNEL_N,
+        stride,
+        padding,
+    ) in enumerate(model.resnet50_layers)
+    for BATCH in [32]
+]
+
+
+@triton.testing.perf_report(conv_confs)
+def bench_op(
+    # Tensor dimensions
+    BATCH,
+    IN_C,
+    IN_H,
+    IN_W,
+    KERNEL_N,
+    KERNEL_H,
+    KERNEL_W,
+    # provider
+    provider,
+    # parameters of conv
+    stride=(1, 1),
+    padding=(0, 0),
+    dilation=(1, 1),
+    groups=1,
+    dtype=torch.float32,
+    layout="nhwc",
+    warmup=25,
+    rep=75,
+):
+
+    # allocate inputs, nchw
+    x = torch.randn((BATCH, IN_C, IN_H, IN_W), dtype=dtype, device="cuda")
+    w = torch.randn(
+        (KERNEL_N, IN_C // groups, KERNEL_H, KERNEL_W), dtype=dtype, device="cuda"
+    )
+    bias = torch.randn((KERNEL_N), dtype=dtype, device="cuda")
+    if layout == "nhwc":
+        x = x.to(memory_format=torch.channels_last)
+        w = w.to(memory_format=torch.channels_last)
+    OUT_H = (
+        IN_H + 2 * padding[0] - dilation[0] * (KERNEL_H - 1) - 1 + stride[0]
+    ) // stride[0]
+    OUT_W = (
+        IN_W + 2 * padding[1] - dilation[1] * (KERNEL_W - 1) - 1 + stride[1]
+    ) // stride[1]
+
+    tflops = (
+        lambda ms: 2.0
+        * BATCH
+        * OUT_H
+        * OUT_W
+        * IN_C
+        * KERNEL_H
+        * KERNEL_W
+        * KERNEL_N
+        / ms
+        * 1e-9
+    )
+    if provider == "cublas":
+
+        def fn():
+            return torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+
+    elif provider == "triton":
+
+        def fn():
+            return torch._inductor.triton_ops.conv(
+                x, w, bias, stride, padding, dilation, False, (0, 0), groups
+            )
+
+    # useCudaGraph won't change the TFLOPs,
+    # because do_bench() clear L2 cache to hide the latency of CPU launch time
+    if useCudaGraph:
+        new_x = x.clone()
+        new_w = w.clone()
+        new_bias = bias.clone()
+
+        # warmp up for cudagraph
+        s = torch.cuda.Stream()
+        s.wait_stream(torch.cuda.current_stream())
+        with torch.cuda.stream(s):
+            for i in range(3):
+                fn()
+        torch.cuda.current_stream().wait_stream(s)
+
+        # capture
+        g = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(g):
+            fn()
+
+        def fn():
+            x.copy_(new_x)
+            w.copy_(new_w)
+            bias.copy_(new_bias)
+            return g.replay()
+
+    ms, min_ms, max_ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
+    return tflops(ms), tflops(max_ms), tflops(min_ms)
+
+
+bench_op.run(print_data=True)
diff --git a/benchmarks/dynamo/microbenchmarks/bench_conv1x1.py b/benchmarks/dynamo/microbenchmarks/bench_conv1x1.py
new file mode 100644
index 000000000000..bb70aed27206
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/bench_conv1x1.py
@@ -0,0 +1,140 @@
+import model
+import torch
+
+import torch._inductor.triton_ops
+import triton
+
+# https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
+useCudaGraph = False
+
+# conv benchmarks
+conv_confs = [
+    triton.testing.Benchmark(
+        x_names=["layout"],
+        x_vals=["nchw", "nhwc"],
+        line_arg="provider",
+        line_vals=["cublas", "triton"],
+        line_names=["cuBLAS", "Triton"],
+        ylabel="TFLOPS",
+        plot_name=f"resnet50-conv1x1-{i}-performance",
+        args={
+            "BATCH": BATCH,
+            "IN_H": IN_H,
+            "IN_W": IN_W,
+            "IN_C": IN_C,
+            "KERNEL_N": KERNEL_N,
+            "KERNEL_H": KERNEL_H,
+            "KERNEL_W": KERNEL_W,
+            "stride": stride,
+            "padding": padding,
+        },
+    )
+    for i, (
+        IN_H,
+        IN_W,
+        IN_C,
+        KERNEL_H,
+        KERNEL_W,
+        KERNEL_N,
+        stride,
+        padding,
+    ) in enumerate(model.resnet50_layers)
+    if KERNEL_H == 1 and KERNEL_W == 1
+    for BATCH in [32]
+]
+
+
+@triton.testing.perf_report(conv_confs)
+def bench_op(
+    # Tensor dimensions
+    BATCH,
+    IN_C,
+    IN_H,
+    IN_W,
+    KERNEL_N,
+    KERNEL_H,
+    KERNEL_W,
+    # provider
+    provider,
+    # parameters of conv
+    stride=(1, 1),
+    padding=(0, 0),
+    dilation=(1, 1),
+    groups=1,
+    dtype=torch.float32,
+    layout="nhwc",
+    warmup=25,
+    rep=75,
+):
+
+    # allocate inputs, nchw
+    x = torch.randn((BATCH, IN_C, IN_H, IN_W), dtype=dtype, device="cuda")
+    w = torch.randn(
+        (KERNEL_N, IN_C // groups, KERNEL_H, KERNEL_W), dtype=dtype, device="cuda"
+    )
+    bias = torch.randn((KERNEL_N), dtype=dtype, device="cuda")
+    if layout == "nhwc":
+        x = x.to(memory_format=torch.channels_last)
+        w = w.to(memory_format=torch.channels_last)
+    OUT_H = (
+        IN_H + 2 * padding[0] - dilation[0] * (KERNEL_H - 1) - 1 + stride[0]
+    ) // stride[0]
+    OUT_W = (
+        IN_W + 2 * padding[1] - dilation[1] * (KERNEL_W - 1) - 1 + stride[1]
+    ) // stride[1]
+
+    tflops = (
+        lambda ms: 2.0
+        * BATCH
+        * OUT_H
+        * OUT_W
+        * IN_C
+        * KERNEL_H
+        * KERNEL_W
+        * KERNEL_N
+        / ms
+        * 1e-9
+    )
+
+    if provider == "cublas":
+
+        def fn():
+            return torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+
+    elif provider == "triton":
+
+        def fn():
+            return torch._inductor.triton_ops.conv1x1(
+                x, w, bias, stride, padding, dilation, False, (0, 0), groups
+            )
+
+    if useCudaGraph:
+        # prepare new data
+        new_x = x.clone()
+        new_w = w.clone()
+        new_bias = bias.clone()
+
+        # warmp up for cudagraph
+        s = torch.cuda.Stream()
+        s.wait_stream(torch.cuda.current_stream())
+        with torch.cuda.stream(s):
+            for i in range(3):
+                fn()
+        torch.cuda.current_stream().wait_stream(s)
+
+        # capture
+        g = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(g):
+            fn()
+
+        def fn():
+            x.copy_(new_x)
+            w.copy_(new_w)
+            bias.copy_(new_bias)
+            return g.replay()
+
+    ms, min_ms, max_ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
+    return tflops(ms), tflops(max_ms), tflops(min_ms)
+
+
+bench_op.run(print_data=True)
diff --git a/benchmarks/dynamo/microbenchmarks/bench_conv_fusion.py b/benchmarks/dynamo/microbenchmarks/bench_conv_fusion.py
new file mode 100644
index 000000000000..d36c37c5a204
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/bench_conv_fusion.py
@@ -0,0 +1,298 @@
+# flake8: noqa
+import model
+import torch
+
+import torch._dynamo
+import torch._inductor.config
+import triton
+from prettytable import PrettyTable
+
+# torch._inductor.config.debug = True
+torch._inductor.config.triton.convolution = "triton"
+torch._inductor.config.triton.dense_indexing = True
+torch.manual_seed(0)
+useCudaGraph = True
+
+
+class Func(object):
+    # conv
+    @torch._dynamo.optimize("inductor")
+    def conv_torchinductor(x, w, bias, stride, padding, dilation, groups):
+        y = torch.conv2d(x, w, None, stride, padding, dilation, groups)
+        return y
+
+    # conv
+    def conv(x, w, bias, stride, padding, dilation, groups):
+        y = torch.conv2d(x, w, None, stride, padding, dilation, groups)
+        return y
+
+    # conv+bias
+    @torch._dynamo.optimize("inductor")
+    def conv_add_torchinductor(x, w, bias, stride, padding, dilation, groups):
+        y = torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+        return y
+
+    # conv+bias
+    def conv_add(x, w, bias, stride, padding, dilation, groups):
+        y = torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+        return y
+
+    # relu(conv)
+    @torch._dynamo.optimize("inductor")
+    def conv_relu_torchinductor(x, w, bias, stride, padding, dilation, groups):
+        y = torch.conv2d(x, w, None, stride, padding, dilation, groups)
+        return torch.relu(y)
+
+    # relu(conv)
+    def conv_relu(x, w, bias, stride, padding, dilation, groups):
+        y = torch.conv2d(x, w, None, stride, padding, dilation, groups)
+        return torch.relu(y)
+
+    # relu(conv+bias)
+    @torch._dynamo.optimize("inductor")
+    def conv_add_relu_torchinductor(x, w, bias, stride, padding, dilation, groups):
+        y = torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+        return torch.relu(y)
+
+    # relu(conv+bias)
+    def conv_add_relu(x, w, bias, stride, padding, dilation, groups):
+        y = torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+        return torch.relu(y)
+
+    # bn(conv)
+    @torch._dynamo.optimize("inductor")
+    def conv_bn_torchinductor(
+        x,
+        w,
+        bias,
+        stride,
+        padding,
+        dilation,
+        groups,
+        running_mean,
+        running_var,
+        bn_weight,
+        bn_bias,
+    ):
+        y = torch.conv2d(x, w, None, stride, padding, dilation, groups)
+        y = torch.batch_norm(
+            y,
+            weight=bn_weight,
+            bias=bn_bias,
+            running_mean=running_mean,
+            running_var=running_var,
+            training=False,
+            momentum=1,
+            eps=1e-5,
+            cudnn_enabled=True,
+        )
+        return y
+
+    # bn(conv)
+    def conv_bn(
+        x,
+        w,
+        bias,
+        stride,
+        padding,
+        dilation,
+        groups,
+        running_mean,
+        running_var,
+        bn_weight,
+        bn_bias,
+    ):
+        y = torch.conv2d(x, w, None, stride, padding, dilation, groups)
+        y = torch.batch_norm(
+            y,
+            weight=bn_weight,
+            bias=bn_bias,
+            running_mean=running_mean,
+            running_var=running_var,
+            training=False,
+            momentum=1,
+            eps=1e-5,
+            cudnn_enabled=True,
+        )
+        return y
+
+    # relu(bn(conv))
+    @torch._dynamo.optimize("inductor")
+    def conv_bn_relu_torchinductor(
+        x,
+        w,
+        bias,
+        stride,
+        padding,
+        dilation,
+        groups,
+        running_mean,
+        running_var,
+        bn_weight,
+        bn_bias,
+    ):
+        y = torch.conv2d(x, w, None, stride, padding, dilation, groups)
+        y = torch.batch_norm(
+            y,
+            weight=bn_weight,
+            bias=bn_bias,
+            running_mean=running_mean,
+            running_var=running_var,
+            training=False,
+            momentum=1,
+            eps=1e-5,
+            cudnn_enabled=True,
+        )
+        return torch.relu(y)
+
+    # relu(bn(conv))
+    def conv_bn_relu(
+        x,
+        w,
+        bias,
+        stride,
+        padding,
+        dilation,
+        groups,
+        running_mean,
+        running_var,
+        bn_weight,
+        bn_bias,
+    ):
+        y = torch.conv2d(x, w, None, stride, padding, dilation, groups)
+        y = torch.batch_norm(
+            y,
+            weight=bn_weight,
+            bias=bn_bias,
+            running_mean=running_mean,
+            running_var=running_var,
+            training=False,
+            momentum=1,
+            eps=1e-5,
+            cudnn_enabled=True,
+        )
+        return torch.relu(y)
+
+
+def cuda_graph(fn, x, w, bias):
+    new_x = x.clone()
+    new_w = w.clone()
+    if bias is not None:
+        new_bias = bias.clone()
+
+    # warmp up for cudagraph
+    s = torch.cuda.Stream()
+    s.wait_stream(torch.cuda.current_stream())
+    with torch.cuda.stream(s):
+        for i in range(3):
+            fn()
+    torch.cuda.current_stream().wait_stream(s)
+
+    # capture
+    g = torch.cuda.CUDAGraph()
+    with torch.cuda.graph(g):
+        fn()
+
+    def fn():
+        x.copy_(new_x)
+        w.copy_(new_w)
+        if bias is not None:
+            bias.copy_(new_bias)
+        return g.replay()
+
+    return fn
+
+
+def bench(layer_params, layer_id, p, fusion_types=[""]):
+    BATCH = 32
+    IN_H, IN_W, IN_C, KERNEL_H, KERNEL_W, KERNEL_N, stride, padding = layer_params
+    dilation, groups = (1, 1), 1
+    dtype = torch.float32
+
+    OUT_H = (
+        IN_H + 2 * padding[0] - dilation[0] * (KERNEL_H - 1) - 1 + stride[0]
+    ) // stride[0]
+    OUT_W = (
+        IN_W + 2 * padding[1] - dilation[1] * (KERNEL_W - 1) - 1 + stride[1]
+    ) // stride[1]
+    tflops = (
+        lambda ms: 2.0
+        * BATCH
+        * OUT_H
+        * OUT_W
+        * IN_C
+        * KERNEL_H
+        * KERNEL_W
+        * KERNEL_N
+        / ms
+        * 1e-9
+    )
+
+    # allocate inputs, nchw
+    x = torch.randn((BATCH, IN_C, IN_H, IN_W), dtype=dtype, device="cuda")
+    w = torch.randn(
+        (KERNEL_N, IN_C // groups, KERNEL_H, KERNEL_W), dtype=dtype, device="cuda"
+    )
+
+    row = [layer_id]
+    for fusion_type in fusion_types:
+
+        if fusion_type == "":
+            conv_torchinductor = getattr(Func, "conv_torchinductor")
+            conv = getattr(Func, "conv")
+        else:
+            conv_torchinductor = getattr(Func, f"conv_{fusion_type}_torchinductor")
+            conv = getattr(Func, f"conv_{fusion_type}")
+
+        if "add" in fusion_type:
+            bias = torch.randn((KERNEL_N,), dtype=dtype, device="cuda")
+        else:
+            bias = None
+
+        args = (x, w, bias, stride, padding, dilation, groups)
+
+        if "bn" in fusion_type:
+            running_mean = torch.randn((KERNEL_N), dtype=dtype, device="cuda")
+            running_var = torch.randn((KERNEL_N), dtype=dtype, device="cuda")
+            bn_weight = torch.randn((KERNEL_N), dtype=dtype, device="cuda")
+            bn_bias = torch.randn((KERNEL_N), dtype=dtype, device="cuda")
+            args += (
+                running_mean,
+                running_var,
+                bn_weight,
+                bn_bias,
+            )
+
+        def fn_conv():
+            return conv(*args)
+
+        def fn_conv_torchinductor():
+            return conv_torchinductor(*args)
+
+        if useCudaGraph:
+            fn_conv = cuda_graph(fn_conv, x, w, bias)
+
+        torch_conv_ms, _, _ = triton.testing.do_bench(fn_conv)
+        triton_conv_ms, _, _ = triton.testing.do_bench(fn_conv_torchinductor)
+        row.extend([tflops(torch_conv_ms), tflops(triton_conv_ms)])
+
+    p.add_row(row)
+
+
+fusion_types = ["", "add", "relu", "add_relu", "bn", "bn_relu"]
+p = PrettyTable()
+field_names = ["layer"]
+for fusion_type in fusion_types:
+    if fusion_type == "":
+        field_names.append("torch conv")
+        field_names.append("triton conv")
+    else:
+        field_names.append(f"torch conv+{fusion_type}")
+        field_names.append(f"triton conv+{fusion_type}")
+
+p.field_names = field_names
+p.float_format = ".3"
+for id, layer in enumerate(model.resnet50_layers):
+    bench(layer, id, p, fusion_types)
+
+print(p)
diff --git a/benchmarks/dynamo/microbenchmarks/bench_mm_fusion.py b/benchmarks/dynamo/microbenchmarks/bench_mm_fusion.py
new file mode 100644
index 000000000000..eb7ce72aea35
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/bench_mm_fusion.py
@@ -0,0 +1,121 @@
+# flake8: noqa
+import torch
+
+import torch._dynamo
+import torch._inductor.config
+import triton
+from prettytable import PrettyTable
+
+# torch._inductor.config.debug = True
+torch._inductor.config.triton.dense_indexing = True
+torch.manual_seed(0)
+
+
+# The flag below controls whether to allow TF32 on matmul.
+torch.backends.cuda.matmul.allow_tf32 = True
+
+
+class Func(object):
+    # mm
+    @torch._dynamo.optimize("inductor")
+    def mm(a, b, bias):
+        y = torch.mm(a, b)
+        return y
+
+    # mm+bias
+    @torch._dynamo.optimize("inductor")
+    def mm_add(a, b, bias):
+        y = torch.mm(a, b)
+        return y + bias
+
+    # relu(mm)
+    @torch._dynamo.optimize("inductor")
+    def mm_relu(a, b, bias):
+        y = torch.mm(a, b)
+        return torch.relu(y)
+
+    # relu(mm+bias)
+    @torch._dynamo.optimize("inductor")
+    def mm_add_relu(a, b, bias):
+        y = torch.mm(a, b)
+        y += bias
+        return torch.relu(y)
+
+
+def bench(shape, layer_id, p, fusion_types=[""]):
+    dtype = torch.float16
+    M, K = shape[0]
+    _, N = shape[1]
+    torch.manual_seed(0)
+    # allocate inputs
+    a = torch.randn(shape[0], device="cuda", dtype=dtype)
+    b = torch.randn(shape[1], device="cuda", dtype=dtype)
+
+    def tflops(ms):
+        return M * K * N / ms * 1e-9
+
+    row = [layer_id]
+    for fusion_type in fusion_types:
+
+        if fusion_type == "":
+            fn_mm = getattr(Func, "mm")
+        else:
+            fn_mm = getattr(Func, f"mm_{fusion_type}")
+
+        if "add" in fusion_type:
+            bias = torch.randn((M, N), dtype=dtype, device="cuda")
+        else:
+            bias = None
+
+        args = (a, b, bias)
+
+        def fn():
+            return fn_mm(*args)
+
+        torch._inductor.config.triton.mm = "aten"
+        torch_mm_ms, _, _ = triton.testing.do_bench(fn)
+        torch._inductor.config.triton.mm = "triton"
+        # reset to force code gen new python code
+        torch._dynamo.reset()
+        torch._inductor.metrics.reset()
+        triton_mm_ms, _, _ = triton.testing.do_bench(fn)
+        assert (
+            torch._inductor.metrics.generated_kernel_count == 1
+        ), "codegen #kernel != 1"
+        row.extend([tflops(torch_mm_ms), tflops(triton_mm_ms)])
+
+    p.add_row(row)
+
+
+fusion_types = ["", "add", "relu", "add_relu"]
+shapes = [
+    # alexnet
+    ([128, 9216], [9216, 4096]),
+    ([128, 4096], [4096, 4096]),
+    ([128, 4096], [4096, 1000]),
+    # BERT
+    ([2048, 768], [768, 768]),
+    ([2048, 768], [768, 3072]),
+    ([2048, 3072], [3072, 768]),
+    # hf_GPT2
+    ([1024, 768], [768, 768]),
+    ([1024, 768], [768, 3072]),
+    ([1024, 3072], [3072, 768]),
+    ([1024, 768], [768, 2304]),
+]
+p = PrettyTable()
+field_names = ["layer"]
+for fusion_type in fusion_types:
+    if fusion_type == "":
+        field_names.append("torch mm")
+        field_names.append("triton mm")
+    else:
+        field_names.append(f"torch mm+{fusion_type}")
+        field_names.append(f"triton mm+{fusion_type}")
+
+p.field_names = field_names
+p.float_format = ".3"
+for id, shape in enumerate(shapes):
+    bench(shape, id, p, fusion_types)
+
+print(p)
diff --git a/benchmarks/dynamo/microbenchmarks/benchmark_helper.py b/benchmarks/dynamo/microbenchmarks/benchmark_helper.py
new file mode 100644
index 000000000000..971d7c15c8cd
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/benchmark_helper.py
@@ -0,0 +1,13 @@
+from torch.utils.benchmark import Timer
+
+
+def time_with_torch_timer(fn, args, kwargs=None, iters=100):
+    kwargs = kwargs or {}
+    env = {"args": args, "kwargs": kwargs, "fn": fn}
+    fn_call = "fn(*args, **kwargs)"
+
+    # Measure end-to-end time
+    timer = Timer(stmt=f"{fn_call}", globals=env)
+    tt = timer.timeit(iters)
+
+    return tt
diff --git a/benchmarks/dynamo/microbenchmarks/inductor_bmm.py b/benchmarks/dynamo/microbenchmarks/inductor_bmm.py
new file mode 100644
index 000000000000..7ac296a58ad8
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/inductor_bmm.py
@@ -0,0 +1,61 @@
+import torch
+
+import torch._dynamo
+import torch._dynamo.config
+import torch._inductor.config as config
+from benchmark_helper import time_with_torch_timer
+
+
+@torch._dynamo.optimize("inductor", nopython=True)
+def inductor_aten_bmm(a, b):
+    return torch.bmm(a, b)
+
+
+@torch._dynamo.optimize("inductor", nopython=True)
+def inductor_triton_bmm(a, b):
+    return torch.bmm(a, b)
+
+
+def torch_bmm(a, b):
+    return torch.bmm(a, b)
+
+
+def test_total_time(shapes):
+    print("shape; torch bmm; inductor aten bmm; inductor triton bmm")
+    for i in range(len(shapes)):
+        a_shape, b_shape = shapes[i]
+        print(a_shape, "x", b_shape, end="; ")
+        a = torch.randn(a_shape, device="cuda", dtype=torch.float16)
+        b = torch.randn(b_shape, device="cuda", dtype=a.dtype)
+
+        config.triton.use_bmm = False
+        inductor_aten_bmm(a, b)
+
+        config.triton.use_bmm = True
+        inductor_triton_bmm(a, b)
+
+        torch_ms = time_with_torch_timer(torch_bmm, (a, b)).mean * 1000
+
+        config.triton.use_bmm = False
+        ind_aten_ms = time_with_torch_timer(inductor_aten_bmm, (a, b)).mean * 1000
+
+        config.triton.use_bmm = True
+        ind_triton_ms = time_with_torch_timer(inductor_triton_bmm, (a, b)).mean * 1000
+
+        print(torch_ms, ind_aten_ms, ind_triton_ms, sep="; ")
+
+
+if __name__ == "__main__":
+    shapes = [
+        # BERT (all)
+        ([192, 128, 64], [192, 64, 128]),
+        ([192, 128, 128], [192, 128, 64]),
+        # hf_GPT2 (all)
+        ([12, 1024, 1024], [12, 1024, 64]),
+        ([12, 1024, 64], [12, 64, 1024]),
+        # hf_Albert (all)
+        ([12, 512, 64], [12, 64, 512]),
+        ([12, 512, 512], [12, 512, 64]),
+    ]
+
+    test_total_time(shapes)
diff --git a/benchmarks/dynamo/microbenchmarks/inductor_mm.py b/benchmarks/dynamo/microbenchmarks/inductor_mm.py
new file mode 100644
index 000000000000..deb3d8f8b604
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/inductor_mm.py
@@ -0,0 +1,134 @@
+import torch
+
+import torch._dynamo
+import torch._dynamo.config
+import torch._inductor.config as config
+import triton
+from benchmark_helper import time_with_torch_timer
+
+# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
+torch.backends.cuda.matmul.allow_tf32 = True
+# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
+torch.backends.cudnn.allow_tf32 = True
+
+
+@torch._dynamo.optimize("inductor", nopython=True)
+def inductor_aten_mm(a, b):
+    return torch.mm(a, b)
+
+
+@torch._dynamo.optimize("inductor", nopython=True)
+def inductor_triton_mm(a, b):
+    return torch.mm(a, b)
+
+
+def torch_mm(a, b):
+    return torch.mm(a, b)
+
+
+def triton_mm(a, b):
+    return triton.ops.matmul(a, b)
+
+
+def test_total_time(shapes):
+    print("shape; torch mm; triton mm; inductor aten mm; inductor triton mm")
+    for i in range(len(shapes)):
+        a_shape, b_shape = shapes[i]
+        print(a_shape, "x", b_shape, end="; ")
+        a = torch.randn(a_shape, device="cuda", dtype=torch.float16)
+        b = torch.randn(b_shape, device="cuda", dtype=a.dtype)
+
+        config.triton.mm = "aten"
+        inductor_aten_mm(a, b)
+
+        config.triton.mm = "triton"
+        inductor_triton_mm(a, b)
+
+        torch_ms = time_with_torch_timer(torch_mm, (a, b)).mean * 1000
+
+        triton_ms = time_with_torch_timer(triton_mm, (a, b)).mean * 1000
+
+        config.triton.mm = "aten"
+        ind_aten_ms = time_with_torch_timer(inductor_aten_mm, (a, b)).mean * 1000
+
+        config.triton.mm = "triton"
+        ind_triton_ms = time_with_torch_timer(inductor_triton_mm, (a, b)).mean * 1000
+
+        print(torch_ms, triton_ms, ind_aten_ms, ind_triton_ms, sep="; ")
+
+        torch._dynamo.reset()
+
+
+def test_GPU_time(shapes):
+    print("shape; torch mm; triton mm; inductor aten mm; inductor triton mm")
+    for i in range(len(shapes)):
+        a_shape, b_shape = shapes[i]
+        print(a_shape, "x", b_shape, end="; ")
+        a = torch.randn(a_shape, device="cuda", dtype=torch.float16)
+        b = torch.randn(b_shape, device="cuda", dtype=a.dtype)
+
+        config.triton.mm = "aten"
+        inductor_aten_mm(a, b)
+
+        config.triton.mm = "triton"
+        inductor_triton_mm(a, b)
+
+        torch_ms, _, _ = triton.testing.do_bench(lambda: torch_mm(a, b))
+        triton_ms, _, _ = triton.testing.do_bench(lambda: triton_mm(a, b))
+        ind_aten_ms, _, _ = triton.testing.do_bench(lambda: inductor_aten_mm(a, b))
+        ind_triton_ms, _, _ = triton.testing.do_bench(lambda: inductor_triton_mm(a, b))
+        print(torch_ms, triton_ms, ind_aten_ms, ind_triton_ms, sep="; ")
+
+        torch._dynamo.reset()
+
+
+if __name__ == "__main__":
+    shapes = [
+        # alexnet
+        ([128, 9216], [9216, 4096]),
+        ([128, 4096], [4096, 4096]),
+        ([128, 4096], [4096, 1000]),
+        # BERT
+        ([2048, 768], [768, 768]),
+        ([2048, 768], [768, 3072]),
+        ([2048, 3072], [3072, 768]),
+        # hf_GPT2
+        ([1024, 768], [768, 768]),
+        ([1024, 768], [768, 3072]),
+        ([1024, 3072], [3072, 768]),
+        ([1024, 768], [768, 2304]),
+    ]
+    print("test total time")
+    test_total_time(shapes)
+
+    print("test GPU time")
+    test_GPU_time(shapes)
+
+
+# Results Preview on AWS AI cluster
+"""
+test total time
+shape; torch mm; triton mm; inductor aten mm; inductor triton mm
+[128, 9216] x [9216, 4096]; 0.07240759208798409; 0.10885953903198242; 0.20063146017491817; 0.20054904278367758
+[128, 4096] x [4096, 4096]; 0.03640300128608942; 0.10960095096379519; 0.09948539081960917; 0.0996188772842288
+[128, 4096] x [4096, 1000]; 0.02215010579675436; 0.12592008337378502; 0.031120930798351765; 0.0370654184371233
+[2048, 768] x [768, 768]; 0.023501068353652954; 0.10804693214595318; 0.03004650119692087; 0.0276932492852211
+[2048, 768] x [768, 3072]; 0.045639658346772194; 0.10883208829909563; 0.062736920081079; 0.06480381824076176
+[2048, 3072] x [3072, 768]; 0.054093082435429096; 0.10804777964949608; 0.08744294755160809; 0.07766005117446184
+[1024, 768] x [768, 768]; 0.021525858901441097; 0.10909941978752613; 0.02656651195138693; 0.02683836966753006
+[1024, 768] x [768, 3072]; 0.027319076471030712; 0.10825308971107006; 0.040118801407516; 0.039282338693737984
+[1024, 3072] x [3072, 768]; 0.034132059663534164; 0.10594133753329515; 0.05069758277386427; 0.04572632722556591
+[1024, 768] x [768, 2304]; 0.02529360819607973; 0.10486091021448374; 0.03724239766597748; 0.036449190229177475
+test GPU time
+shape; torch mm; triton mm; inductor aten mm; inductor triton mm
+[128, 9216] x [9216, 4096]; 0.09113600105047226; 0.09011200070381165; 0.21606400609016418; 0.21606400609016418
+[128, 4096] x [4096, 4096]; 0.053247999399900436; 0.05222399905323982; 0.1157120019197464; 0.1157120019197464
+[128, 4096] x [4096, 1000]; 0.026623999699950218; 0.02969600073993206; 0.04710400104522705; 0.05222399905323982
+[2048, 768] x [768, 768]; 0.02457600086927414; 0.020479999482631683; 0.04095999896526337; 0.03993599861860275
+[2048, 768] x [768, 3072]; 0.05119999870657921; 0.05222399905323982; 0.07475200295448303; 0.07577600330114365
+[2048, 3072] x [3072, 768]; 0.05939200147986412; 0.05222399905323982; 0.09830400347709656; 0.0870399996638298
+[1024, 768] x [768, 768]; 0.01945599913597107; 0.016383999958634377; 0.03276799991726875; 0.03276799991726875
+[1024, 768] x [768, 3072]; 0.03174399957060814; 0.03276799991726875; 0.053247999399900436; 0.053247999399900436
+[1024, 3072] x [3072, 768]; 0.04403200000524521; 0.03379200026392937; 0.06860800087451935; 0.062463998794555664
+[1024, 768] x [768, 2304]; 0.02969600073993206; 0.02969600073993206; 0.04915200173854828; 0.048128001391887665
+"""
diff --git a/benchmarks/dynamo/microbenchmarks/matmul_relu.py b/benchmarks/dynamo/microbenchmarks/matmul_relu.py
new file mode 100644
index 000000000000..629b574617ec
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/matmul_relu.py
@@ -0,0 +1,100 @@
+import torch
+
+import torch._dynamo
+import torch._inductor.config as inductor_config
+from benchmark_helper import time_with_torch_timer
+
+inductor_config.triton.mm = "triton"
+
+
+@torch._dynamo.optimize("inductor", nopython=True)
+def inductor_mm(a, b):
+    return torch.mm(a, b)
+
+
+def torch_mm_relu(a, b):
+    return torch.nn.functional.relu(torch.mm(a, b))
+
+
+def torch_mm(a, b):
+    return torch.mm(a, b)
+
+
+if __name__ == "__main__":
+    # Real shapes from torchbench
+    a_shapes = [
+        [2048, 768],
+        [64, 1280],
+        [2048, 768],
+        [32, 2048],
+        [1, 39200],
+        [128, 3072],
+        [16, 1280],
+    ]
+    b_shapes = [
+        [768, 3072],
+        [1280, 1000],
+        [768, 768],
+        [2048, 1000],
+        [39200, 50],
+        [3072, 1000],
+        [1280, 1000],
+    ]
+
+    # Artificial larger shapes
+    a_shapes += [[10240, 512], [10240, 1024]]
+    b_shapes += [[512, 10240], [1024, 10240]]
+
+    for i in range(len(a_shapes)):
+        a_shape = a_shapes[i]
+        b_shape = b_shapes[i]
+        print("Shape:", a_shape, "x", b_shape)
+        a = torch.randn(a_shape, device="cuda", dtype=torch.float16)
+        b = torch.randn(b_shape, device="cuda", dtype=a.dtype)
+
+        time_with_torch_timer(torch_mm, (a, b), string_id="torch mm")
+        time_with_torch_timer(torch_mm_relu, (a, b), string_id="torch mm + relu")
+        time_with_torch_timer(inductor_mm, (a, b), string_id="inductor mm")
+
+
+# Results obtained on the AWS AI cluster
+# CPU: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
+# GPU: NVIDIA A100-SXM 40GB memory
+"""
+Shape: [2048, 768] x [768, 3072]
+torch mm         mean: 0.0592 ms
+torch mm + relu  mean: 0.0759 ms
+inductor mm      mean: 0.0653 ms
+Shape: [64, 1280] x [1280, 1000]
+torch mm         mean: 0.0231 ms
+torch mm + relu  mean: 0.0316 ms
+inductor mm      mean: 0.0252 ms
+Shape: [2048, 768] x [768, 768]
+torch mm         mean: 0.0190 ms
+torch mm + relu  mean: 0.0277 ms
+inductor mm      mean: 0.0274 ms
+Shape: [32, 2048] x [2048, 1000]
+torch mm         mean: 0.0188 ms
+torch mm + relu  mean: 0.0290 ms
+inductor mm      mean: 0.0244 ms
+Shape: [1, 39200] x [39200, 50]
+torch mm         mean: 0.0134 ms
+torch mm + relu  mean: 0.0234 ms
+inductor mm      mean: 0.0290 ms
+Shape: [128, 3072] x [3072, 1000]
+torch mm         mean: 0.0181 ms
+torch mm + relu  mean: 0.0322 ms
+inductor mm      mean: 0.0319 ms
+Shape: [16, 1280] x [1280, 1000]
+torch mm         mean: 0.0188 ms
+torch mm + relu  mean: 0.0289 ms
+inductor mm      mean: 0.0255 ms
+Shape: [10240, 512] x [512, 10240]
+torch mm         mean: 0.4589 ms
+torch mm + relu  mean: 0.7896 ms
+inductor mm      mean: 0.5090 ms
+Shape: [10240, 1024] x [1024, 10240]
+torch mm         mean: 0.9152 ms
+torch mm + relu  mean: 1.2124 ms
+inductor mm      mean: 0.9462 ms
+"""
diff --git a/benchmarks/dynamo/microbenchmarks/microbench.py b/benchmarks/dynamo/microbenchmarks/microbench.py
new file mode 100755
index 000000000000..cab1bdc444d7
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/microbench.py
@@ -0,0 +1,176 @@
+#!/usr/bin/env python3
+import argparse
+import inspect
+import sys
+
+import numpy as np
+import tabulate
+import torch
+
+import torch._inductor
+from torch._dynamo.optimizations.backends import cudagraphs_inner
+from torch._dynamo.testing import same
+from torch._inductor.compile_fx import compile_fx
+from torch._inductor.utils import timed
+
+try:
+    import test.test_torchinductor as tti
+except ImportError:
+    tti = None
+
+
+def compute_speedups(args, models, example_inputs):
+    expected = models[0](*example_inputs)
+    for model in models[1:]:
+        actual = model(*example_inputs)
+        assert same(actual, expected), expected[0] - actual[0]
+
+    timings = np.zeros((args.repeat, len(models)), np.float64)
+    for rep in range(args.repeat):
+        # interleave the runs to handle frequency scaling and load changes
+        for m, model in enumerate(models):
+            timings[rep, m] = timed(model, example_inputs)
+    median = np.median(timings, axis=0)
+    return (median[0] / median[1:]).tolist()
+
+
+def microbenchmark(args, model, example_inputs):
+    compiled_fn = compile_fx(torch.fx.symbolic_trace(model), example_inputs)
+    cudagraphs_eager = cudagraphs_inner(model, example_inputs, copy_outputs=False)
+    cudagraphs_jit = cudagraphs_inner(
+        torch.jit.trace(model, example_inputs), example_inputs, copy_outputs=False
+    )
+    return compute_speedups(
+        args,
+        [cudagraphs_eager, cudagraphs_jit, compiled_fn],
+        example_inputs,
+    )
+
+
+class MyModel1(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.model = torch.nn.Sequential(
+            torch.nn.Linear(1024, 1024),
+            torch.nn.ReLU(),
+        )
+
+    def forward(self, input):
+        # return (self.model(input) + 1,)
+        return (self.model(input),)
+
+
+class MyModel2(torch.nn.Module):
+    def forward(self, x, y):
+        # return x / (torch.abs(x) + 1.0),
+        return (x + y,)
+
+
+class MicroBenchmarks:
+    @staticmethod
+    def add(a, b):
+        return (a + b,)
+
+    @staticmethod
+    def scale(x, m, d):
+        return ((x - m) / torch.clip(d, 1e-4),)
+
+    @staticmethod
+    def abs_norm(x):
+        return (x / (torch.abs(x) + 1),)
+
+    @staticmethod
+    def add_relu_softmax(x, a):
+        return (torch.softmax(torch.relu(x + a), -1),)
+
+    @staticmethod
+    def sum(a, b):
+        return ((a + b).sum(),)
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--filter", "-k", action="append", help="filter benchmarks with regexp"
+    )
+    parser.add_argument(
+        "--exclude", "-x", action="append", help="filter benchmarks with regexp"
+    )
+    parser.add_argument("--devices", "-d", action="append", help="cpu or cuda")
+    parser.add_argument("--size", "-s", action="append", help="cpu or cuda")
+    parser.add_argument(
+        "--repeat", "-n", type=int, default=30, help="number of timing runs"
+    )
+    parser.add_argument(
+        "--threads", "-t", type=int, help="number of threads to use for eager"
+    )
+    parser.add_argument(
+        "--verbose", "-v", action="store_true", help="enable verbose debug printouts"
+    )
+    parser.add_argument(
+        "--nvfuser", action="store_true", help="enable nvfuser globally"
+    )
+    parser.add_argument("--transpose", action="store_true", help="transpose one input")
+    parser.add_argument("--broadcast", action="store_true", help="broadcast one input")
+    args = parser.parse_args()
+
+    # defaults
+    args.devices = args.devices or ["cpu", "cuda"]
+    args.filter = args.filter or [r"."]
+    args.exclude = args.exclude or [r"^$"]
+    args.size = args.size or [64, 256, 1024, 4096, 8192]
+
+    if args.nvfuser:
+        torch._C._jit_override_can_fuse_on_cpu(False)
+        torch._C._jit_override_can_fuse_on_gpu(False)
+        torch._C._jit_set_texpr_fuser_enabled(False)
+        torch._C._jit_set_nvfuser_enabled(True)
+    else:
+        torch._C._jit_override_can_fuse_on_cpu(torch._C._llvm_enabled())
+        torch._C._jit_override_can_fuse_on_gpu(True)
+        torch._C._jit_set_texpr_fuser_enabled(True)
+        if torch.cuda.is_available():
+            torch._C._jit_set_nvfuser_enabled(False)
+
+    if args.threads:
+        torch.set_num_threads(args.threads)
+        torch._inductor.config.cpp.threads = args.threads
+
+    if args.verbose:
+        torch._inductor.config.debug = True
+
+    torch._inductor.config.triton.autotune = True
+
+    rows = []
+    for model in (MicroBenchmarks.sum,):
+        nargs = len(inspect.signature(model).parameters)
+        for device in args.devices:
+            for n in args.size:
+                n = int(n)
+                sys.stdout.write(f"{model.__name__:10} {device:4} {n:5} ")
+                sys.stdout.flush()
+                inputs = [torch.rand((n, n), device=device) for _ in range(nargs)]
+                if args.broadcast:
+                    inputs[-1] = torch.rand((1, n), device=device)
+                if args.transpose:
+                    inputs[-1] = inputs[-1].transpose(0, 1)
+                result = microbenchmark(args, model, inputs)
+                rows.append([model.__name__, device, str(n)] + result)
+                print(" ".join(f"{v:.2f}x" for v in result))
+
+    print(
+        tabulate.tabulate(
+            rows,
+            headers=[
+                "model",
+                "dev",
+                "n",
+                "ts",
+                "inductor",
+            ],
+        )
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/dynamo/microbenchmarks/model.py b/benchmarks/dynamo/microbenchmarks/model.py
new file mode 100644
index 000000000000..c926b6c79d0a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/model.py
@@ -0,0 +1,26 @@
+# resnet50 layer shape
+resnet50_layers = (
+    # IN_H, IN_W, IN_C, KERNEL_H, KERNEL_W, KERNEL_N, stride, padding
+    (224, 224, 3, 7, 7, 64, (2, 2), (0, 0)),
+    # conv2_x
+    (56, 56, 64, 1, 1, 64, (1, 1), (0, 0)),
+    (56, 56, 64, 3, 3, 64, (1, 1), (0, 0)),
+    (56, 56, 64, 1, 1, 256, (1, 1), (0, 0)),
+    # conv3_x
+    (56, 56, 256, 1, 1, 128, (2, 2), (0, 0)),
+    (28, 28, 128, 3, 3, 128, (1, 1), (0, 0)),
+    (28, 28, 128, 1, 1, 512, (1, 1), (0, 0)),
+    # conv4_x
+    (28, 28, 512, 1, 1, 256, (2, 2), (0, 0)),
+    (14, 14, 256, 3, 3, 256, (1, 1), (0, 0)),
+    (14, 14, 256, 1, 1, 1024, (1, 1), (0, 0)),
+    # conv5_x
+    (14, 14, 1024, 1, 1, 512, (2, 2), (0, 0)),
+    (7, 7, 512, 3, 3, 512, (1, 1), (0, 0)),
+    (7, 7, 512, 1, 1, 2048, (1, 1), (0, 0)),
+)
+
+alexnet_layers = (
+    # IN_H, IN_W, IN_C, KERNEL_H, KERNEL_W, KERNEL_N, stride, padding
+    (224, 224, 3, 11, 11, 64, (4, 4), (2, 2)),
+)
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AlbertForMaskedLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AlbertForMaskedLM_training.txt
new file mode 100644
index 000000000000..b2374b7faa53
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AlbertForMaskedLM_training.txt
@@ -0,0 +1,115 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([1024, 30000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([1024, 30000], f16), T([1024, 30000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([2, 64, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([2, 64, 512, 512], f16), T([2, 64, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([2, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([2, 64, 512, 64], f16), [128, 512, 64]), {})
+cnt: 12, ((T([2, 64, 64, 512], f16), [128, 64, 512]), {})
+cnt: 12, ((T([128, 512, 512], f16), [2, 64, 512, 512]), {})
+cnt: 12, ((T([128, 512, 64], f16), [2, 64, 512, 64]), {})
+cnt: 36, ((T([2, 512, 64, 64], f16), [2, 512, 4096]), {})
+cnt: 12, ((T([2, 512, 4096], f16), [1024, 4096]), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([2, 512, 128], f16), T([2, 512, 128], f16)), {})
+cnt: 12, ((T([2, 64, 512, 512], f16), T([2, 1, 1, 512], f16)), {})
+cnt: 72, ((T([2, 512, 4096], f16), T([2, 512, 4096], f16)), {})
+cnt: 36, ((T([2, 512, 16384], f16), T([2, 512, 16384], f16)), {})
+cnt: 12, ((T([2, 512, 16384], f16), 1.0), {})
+cnt: 1, ((T([2, 512, 128], f16), 1.0), {})
+cnt: 99, ((T([4096], f16), T([4096], f16)), {})
+cnt: 11, ((T([4096, 16384], f16), T([4096, 16384], f16)), {})
+cnt: 11, ((T([16384], f16), T([16384], f16)), {})
+cnt: 11, ((T([16384, 4096], f16), T([16384, 4096], f16)), {})
+cnt: 44, ((T([4096, 4096], f16), T([4096, 4096], f16)), {})
+cnt: 1, ((T([30000, 128], f16), T([30000, 128], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([2, 512, 128], f16), T([1, 512, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([4096], f16), T([1024, 128], f16), T([128, 4096], f16, stride=(1, 128))), {})
+cnt: 48, ((T([4096], f16), T([1024, 4096], f16), T([4096, 4096], f16, stride=(1, 4096))), {})
+cnt: 12, ((T([16384], f16), T([1024, 4096], f16), T([4096, 16384], f16, stride=(1, 4096))), {})
+cnt: 12, ((T([4096], f16), T([1024, 16384], f16), T([16384, 4096], f16, stride=(1, 16384))), {})
+cnt: 1, ((T([128], f16), T([1024, 4096], f16), T([4096, 128], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([30000], f16), T([1024, 128], f16), T([128, 30000], f16, stride=(1, 128))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([128, 512, 64], f16), T([128, 64, 512], f16)), {})
+cnt: 12, ((T([128, 512, 512], f16), T([128, 512, 64], f16)), {})
+cnt: 12, ((T([128, 512, 512], f16, stride=(262144, 1, 512)), T([128, 512, 64], f16)), {})
+cnt: 12, ((T([128, 512, 64], f16), T([128, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([128, 64, 512], f16, stride=(32768, 1, 64)), T([128, 512, 512], f16)), {})
+cnt: 12, ((T([128, 512, 512], f16), T([128, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([2, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([2, 512], i64), T([2, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([2, 64, 512, 512], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30000, 128], f16), T([2, 512], i64), 0), {})
+cnt: 1, ((T([2, 128], f16), T([2, 512], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 128], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([2, 512, 128], f16), T([2, 512], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([2, 512, 128], f16), T([2, 512], i64), 30000, 0, False), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 30000], f16), T([30000, 128], f16)), {})
+cnt: 1, ((T([30000, 1024], f16, stride=(1, 30000)), T([1024, 128], f16)), {})
+cnt: 1, ((T([1024, 128], f16), T([128, 4096], f16)), {})
+cnt: 1, ((T([128, 1024], f16, stride=(1, 128)), T([1024, 4096], f16)), {})
+cnt: 12, ((T([1024, 4096], f16), T([4096, 16384], f16)), {})
+cnt: 12, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 16384], f16)), {})
+cnt: 12, ((T([1024, 16384], f16), T([16384, 4096], f16)), {})
+cnt: 12, ((T([16384, 1024], f16, stride=(1, 16384)), T([1024, 4096], f16)), {})
+cnt: 48, ((T([1024, 4096], f16), T([4096, 4096], f16)), {})
+cnt: 48, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 4096], f16)), {})
+cnt: 1, ((T([1024, 4096], f16), T([4096, 128], f16)), {})
+cnt: 1, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 128], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 1, ((T([2, 512, 128], f16), 3.0), {})
+cnt: 12, ((T([2, 512, 16384], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([2, 1, 1, 512], f16), -65504.0), {})
+cnt: 24, ((T([2, 512, 16384], f16), 0.5), {})
+cnt: 24, ((T([2, 512, 16384], f16), 0.044715), {})
+cnt: 24, ((T([2, 512, 16384], f16), 0.7978845608028654), {})
+cnt: 48, ((T([2, 512, 16384], f16), T([2, 512, 16384], f16)), {})
+cnt: 2, ((T([2, 512, 128], f16), 0.5), {})
+cnt: 2, ((T([2, 512, 128], f16), 0.044715), {})
+cnt: 2, ((T([2, 512, 128], f16), 0.7978845608028654), {})
+cnt: 4, ((T([2, 512, 128], f16), T([2, 512, 128], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 2, ((T([2, 512, 128], f16), [128], T([128], f16), T([128], f16), 1e-12), {})
+cnt: 24, ((T([2, 512, 4096], f16), [4096], T([4096], f16), T([4096], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 2, ((T([2, 512, 128], f16), T([2, 512, 128], f16), [128], T([2, 512, 1], f32), T([2, 512, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 24, ((T([2, 512, 4096], f16), T([2, 512, 4096], f16), [4096], T([2, 512, 1], f32), T([2, 512, 1], f32), T([4096], f16), T([4096], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([1024, 30000], f16), T([1024], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([1024, 30000], f16), T([1024], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([2, 512, 16384], f16), 3.0), {})
+cnt: 1, ((T([2, 512, 128], f16), 3.0), {})
+cnt: 1, ((T([2, 512, 128], f16), 2.0), {})
+cnt: 12, ((T([2, 512, 16384], f16), 2.0), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([2, 1, 1, 512], f16), 1.0), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([1024, 30000], f16), [0], True), {})
+cnt: 1, ((T([1024, 128], f16), [0], True), {})
+cnt: 61, ((T([1024, 4096], f16), [0], True), {})
+cnt: 12, ((T([1024, 16384], f16), [0], True), {})
+cnt: 1, ((T([2, 512, 128], f16), [0], True), {})
+Operator: aten.tanh.default
+cnt: 12, ((T([2, 512, 16384], f16),), {})
+cnt: 1, ((T([2, 512, 128], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([2, 512, 128], f16), T([2, 512, 128], f16)), {})
+cnt: 12, ((T([2, 512, 16384], f16), T([2, 512, 16384], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AlbertForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AlbertForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..8e25df92770b
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AlbertForQuestionAnswering_training.txt
@@ -0,0 +1,110 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([2, 512], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([2, 512], f16), T([2, 512], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([2, 64, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([2, 64, 512, 512], f16), T([2, 64, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([2, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([2, 64, 512, 64], f16), [128, 512, 64]), {})
+cnt: 12, ((T([2, 64, 64, 512], f16), [128, 64, 512]), {})
+cnt: 12, ((T([128, 512, 512], f16), [2, 64, 512, 512]), {})
+cnt: 12, ((T([128, 512, 64], f16), [2, 64, 512, 64]), {})
+cnt: 36, ((T([2, 512, 64, 64], f16), [2, 512, 4096]), {})
+cnt: 12, ((T([2, 512, 4096], f16), [1024, 4096]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([2, 512, 128], f16), T([2, 512, 128], f16)), {})
+cnt: 12, ((T([2, 64, 512, 512], f16), T([2, 1, 1, 512], f16)), {})
+cnt: 72, ((T([2, 512, 4096], f16), T([2, 512, 4096], f16)), {})
+cnt: 36, ((T([2, 512, 16384], f16), T([2, 512, 16384], f16)), {})
+cnt: 12, ((T([2, 512, 16384], f16), 1.0), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+cnt: 99, ((T([4096], f16), T([4096], f16)), {})
+cnt: 11, ((T([4096, 16384], f16), T([4096, 16384], f16)), {})
+cnt: 11, ((T([16384], f16), T([16384], f16)), {})
+cnt: 11, ((T([16384, 4096], f16), T([16384, 4096], f16)), {})
+cnt: 44, ((T([4096, 4096], f16), T([4096, 4096], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([2, 512, 128], f16), T([1, 512, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([4096], f16), T([1024, 128], f16), T([128, 4096], f16, stride=(1, 128))), {})
+cnt: 48, ((T([4096], f16), T([1024, 4096], f16), T([4096, 4096], f16, stride=(1, 4096))), {})
+cnt: 12, ((T([16384], f16), T([1024, 4096], f16), T([4096, 16384], f16, stride=(1, 4096))), {})
+cnt: 12, ((T([4096], f16), T([1024, 16384], f16), T([16384, 4096], f16, stride=(1, 16384))), {})
+cnt: 1, ((T([2], f16), T([1024, 4096], f16), T([4096, 2], f16, stride=(1, 4096))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([128, 512, 64], f16), T([128, 64, 512], f16)), {})
+cnt: 12, ((T([128, 512, 512], f16), T([128, 512, 64], f16)), {})
+cnt: 12, ((T([128, 512, 512], f16, stride=(262144, 1, 512)), T([128, 512, 64], f16)), {})
+cnt: 12, ((T([128, 512, 64], f16), T([128, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([128, 64, 512], f16, stride=(32768, 1, 64)), T([128, 512, 512], f16)), {})
+cnt: 12, ((T([128, 512, 512], f16), T([128, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([2, 512, 1], f16), T([2, 512, 1], f16)], 2), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([2], i64), 0, 512), {})
+Operator: aten.clone.default
+cnt: 1, ((T([2, 512], i64),), {})
+cnt: 2, ((T([2], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([2, 512], i64), T([2, 512], i64)), {})
+cnt: 2, ((T([2], i64), T([2], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([2, 64, 512, 512], f16), 8.0), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30000, 128], f16), T([2, 512], i64), 0), {})
+cnt: 1, ((T([2, 128], f16), T([2, 512], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 128], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([2, 512, 128], f16), T([2, 512], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([2, 512, 128], f16), T([2, 512], i64), 30000, 0, False), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 2], f16), T([2, 4096], f16)), {})
+cnt: 1, ((T([2, 1024], f16, stride=(1, 2)), T([1024, 4096], f16)), {})
+cnt: 12, ((T([1024, 4096], f16), T([4096, 16384], f16)), {})
+cnt: 12, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 16384], f16)), {})
+cnt: 12, ((T([1024, 16384], f16), T([16384, 4096], f16)), {})
+cnt: 12, ((T([16384, 1024], f16, stride=(1, 16384)), T([1024, 4096], f16)), {})
+cnt: 48, ((T([1024, 4096], f16), T([4096, 4096], f16)), {})
+cnt: 48, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 4096], f16)), {})
+cnt: 1, ((T([1024, 4096], f16), T([4096, 128], f16)), {})
+cnt: 1, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 128], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 12, ((T([2, 512, 16384], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([2, 1, 1, 512], f16), -65504.0), {})
+cnt: 24, ((T([2, 512, 16384], f16), 0.5), {})
+cnt: 24, ((T([2, 512, 16384], f16), 0.044715), {})
+cnt: 24, ((T([2, 512, 16384], f16), 0.7978845608028654), {})
+cnt: 48, ((T([2, 512, 16384], f16), T([2, 512, 16384], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 1, ((T([2, 512, 128], f16), [128], T([128], f16), T([128], f16), 1e-12), {})
+cnt: 24, ((T([2, 512, 4096], f16), [4096], T([4096], f16), T([4096], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 24, ((T([2, 512, 4096], f16), T([2, 512, 4096], f16), [4096], T([2, 512, 1], f32), T([2, 512, 1], f32), T([4096], f16), T([4096], f16), [True, True, True]), {})
+cnt: 1, ((T([2, 512, 128], f16), T([2, 512, 128], f16), [128], T([2, 512, 1], f32), T([2, 512, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([2, 512], f16), T([2], i64), None, 1, 512, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([2, 512], f16), T([2], i64), None, 1, 512), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([2, 512, 16384], f16), 3.0), {})
+cnt: 12, ((T([2, 512, 16384], f16), 2.0), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([2, 1, 1, 512], f16), 1.0), {})
+Operator: aten.split.Tensor
+cnt: 1, ((T([2, 512, 2], f16), 1, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([1024, 2], f16), [0], True), {})
+cnt: 61, ((T([1024, 4096], f16), [0], True), {})
+cnt: 12, ((T([1024, 16384], f16), [0], True), {})
+cnt: 1, ((T([2, 512, 128], f16), [0], True), {})
+Operator: aten.tanh.default
+cnt: 12, ((T([2, 512, 16384], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 12, ((T([2, 512, 16384], f16), T([2, 512, 16384], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AllenaiLongformerBase_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AllenaiLongformerBase_training.txt
new file mode 100644
index 000000000000..5cf27686039e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/AllenaiLongformerBase_training.txt
@@ -0,0 +1,186 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([1024, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([1024, 50265], f16), T([1024, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([1, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)), -1, True), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([1, 1024, 12, 513], f32), T([1, 1024, 12, 513], f32), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1, 1, 1, 1024], f32),), {'dtype': f16})
+cnt: 1, ((T([1, 1024], b8),), {'dtype': i32})
+cnt: 1, ((T([1, 1024], i64),), {'dtype': i32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([1, 1024], i32),), {'dtype': i64})
+cnt: 12, ((T([1, 1024, 1, 1], b8),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 12, ((T([1, 1024, 12, 513], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 12, ((T([1, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([12, 3, 512, 64, 1], f16), [36, 512, 64]), {})
+cnt: 12, ((T([12, 3, 64, 512, 1], f16), [36, 64, 512]), {})
+cnt: 12, ((T([12, 4, 768, 64, 1], f16), [48, 768, 64]), {})
+cnt: 24, ((T([1024, 1, 12, 64], f16), [1024, 1, 768]), {})
+cnt: 12, ((T([12, 4, 256, 1, 64], f16), [48, 256, 64]), {})
+cnt: 12, ((T([12, 4, 768, 64], i64), [2359296]), {})
+cnt: 12, ((T([12, 3, 512, 64], f16), [1179648]), {})
+cnt: 24, ((T([12, 3, 512, 64], i64), [1179648]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([1, 1024], i64), 1), {})
+cnt: 50, ((T([1, 1024, 768], f16), T([1, 1024, 768], f16)), {})
+cnt: 36, ((T([12, 3, 512, 513], f16), T([12, 3, 512, 513], f16)), {})
+cnt: 24, ((T([1024, 1, 768], f16), T([1024, 1, 768], f16)), {})
+cnt: 1, ((T([50265, 768], f16), T([50265, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 12, ((T([1, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)), T([1, 1024, 1, 513], f16)), {})
+Operator: aten.addmm.default
+cnt: 49, ((T([768], f16), T([1024, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([1024, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([1024, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([50265], f16), T([1024, 768], f16), T([768, 50265], f16, stride=(1, 768))), {})
+Operator: aten.any.default
+cnt: 1, ((T([1024], b8),), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([36, 512, 64], f16), T([36, 64, 512], f16)), {})
+cnt: 12, ((T([48, 256, 768], f16, stride=(197120, 769, 1)), T([48, 768, 64], f16)), {})
+cnt: 12, ((T([48, 768, 256], f16, stride=(197120, 1, 769)), T([48, 256, 64], f16)), {})
+cnt: 12, ((T([48, 256, 64], f16), T([48, 64, 768], f16, stride=(49152, 1, 64))), {})
+cnt: 12, ((T([36, 64, 512], f16, stride=(32768, 1, 64)), T([36, 512, 512], f16)), {})
+cnt: 12, ((T([36, 512, 512], f16), T([36, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 1024], i64),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 12, ((T([12, 3, 512, 512], f16), [0, 0, 0, 1], 0.0), {})
+cnt: 12, ((T([1, 3, 512, 512], f16), [0, 0, 0, 1], 0.0), {})
+cnt: 12, ((T([12, 1024, 64], f16, stride=(64, 768, 1)), [0, 0, 256, 256], -1.0), {})
+cnt: 12, ((T([12, 4, 256, 513], f16, stride=(513, 1575936, 6156, 1)), [0, 257], 0.0), {})
+cnt: 12, ((T([12, 4, 256, 770], f16), [0, -257]), {})
+cnt: 12, ((T([12, 1536, 64], f16), [0, 0, -256, -256]), {})
+cnt: 12, ((T([12, 3, 513, 512], f16), [0, 0, 0, -1]), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 1024], i64), T([1, 1024], i64)), {})
+cnt: 12, ((T([12, 3, 256, 257], f16, stride=(525312, 131328, 513, 1)), T([12, 3, 256, 257], f16, stride=(787968, 262656, 513, 1))), {})
+cnt: 12, ((T([12, 256, 257], f16, stride=(525312, 513, 1)), T([12, 256, 257], f16, stride=(787968, 513, 1))), {})
+cnt: 12, ((T([12, 3, 256, 256], f16, stride=(525312, 131328, 513, 1)), T([12, 3, 256, 256], f16, stride=(787968, 262656, 513, 1))), {})
+cnt: 12, ((T([12, 255, 255], f16, stride=(525312, 513, 1)), T([12, 255, 255], f16, stride=(787968, 513, 1))), {})
+cnt: 12, ((T([1, 3, 256, 257], f16, stride=(525312, 131328, 513, 1)), T([1, 3, 256, 257], f16, stride=(787968, 262656, 513, 1))), {})
+cnt: 12, ((T([1, 256, 257], f16, stride=(525312, 513, 1)), T([1, 256, 257], f16, stride=(787968, 513, 1))), {})
+cnt: 12, ((T([1, 3, 256, 256], f16, stride=(525312, 131328, 513, 1)), T([1, 3, 256, 256], f16, stride=(787968, 262656, 513, 1))), {})
+cnt: 12, ((T([1, 255, 255], f16, stride=(525312, 513, 1)), T([1, 255, 255], f16, stride=(787968, 513, 1))), {})
+cnt: 12, ((T([1024, 12, 513], f16, stride=(513, 525312, 1)), T([1024, 12, 513], f16)), {})
+cnt: 84, ((T([12, 4, 256, 513], f16), T([12, 4, 256, 513], f16)), {})
+cnt: 12, ((T([1, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)), T([1, 1024, 12, 513], f16)), {})
+cnt: 24, ((T([1, 256, 12, 257], f16, stride=(6303744, 513, 525312, 1)), T([1, 256, 12, 257], f16)), {})
+cnt: 12, ((T([12, 255, 255], f16, stride=(525312, 513, 1)), T([12, 255, 255], f16)), {})
+cnt: 12, ((T([12, 3, 256, 256], f16, stride=(525312, 131328, 513, 1)), T([12, 3, 256, 256], f16)), {})
+cnt: 12, ((T([12, 256, 257], f16, stride=(525312, 513, 1)), T([12, 256, 257], f16)), {})
+cnt: 24, ((T([1024, 768], f16), T([1024, 768], f16)), {})
+cnt: 12, ((T([1024, 1, 768], f16), T([1024, 1, 768], f16)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([1, 1024], i32), 1), {})
+Operator: aten.div.Tensor
+cnt: 12, ((T([1024, 1, 768], f16), 8.0), {})
+Operator: aten.div_.Tensor
+cnt: 12, ((T([1024, 1, 768], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 768], f16), T([1, 1024], i64), 1), {})
+cnt: 1, ((T([4098, 768], f16), T([1, 1024], i64), 1), {})
+cnt: 1, ((T([1, 768], f16), T([1, 1024], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024], i64), 1, -1, False), {})
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024], i64), 4098, 1, False), {})
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024], i64), 50265, 1, False), {})
+Operator: aten.eq.Scalar
+cnt: 24, ((T([1, 256, 12, 257], f16, stride=(65792, 257, 0, 1)), 1), {})
+cnt: 24, ((T([1, 256, 1, 257], f16), 1), {})
+Operator: aten.flip.default
+cnt: 24, ((T([256, 257], f16), [0]), {})
+cnt: 24, ((T([1, 256, 1, 257], f16), [1, 3]), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([1, 1024, 3072], f16),), {})
+cnt: 1, ((T([1, 1024, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024, 768], f16)), {})
+cnt: 12, ((T([1, 1024, 3072], f16), T([1, 1024, 3072], f16)), {})
+Operator: aten.gt.Scalar
+cnt: 1, ((T([1, 1024], f16), 0), {})
+Operator: aten.index_add_.default
+cnt: 12, ((T([1179648], f16), 0, T([2359296], i64), T([2359296], f16)), {})
+cnt: 24, ((T([786432], f16), 0, T([1179648], i64), T([1179648], f16)), {})
+Operator: aten.lt.Scalar
+cnt: 1, ((T([1, 1024], f16), 0), {})
+Operator: aten.masked_fill.Scalar
+cnt: 12, ((T([1, 1024, 1, 1], f16), T([1, 1024, 1, 1], b8), -65504.0), {})
+cnt: 12, ((T([1, 1024, 12, 513], f32), T([1, 1024, 1, 1], b8), 0.0), {})
+cnt: 12, ((T([1, 1024, 12, 513], f32, stride=(6303744, 513, 525312, 1)), T([1, 1024, 1, 1], b8), 0), {})
+cnt: 24, ((T([1, 256, 12, 257], f16), T([1, 256, 12, 257], b8), 0), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 24, ((T([1, 256, 12, 257], f16, stride=(6303744, 513, 525312, 1)), T([1, 256, 12, 257], b8), -inf), {})
+cnt: 24, ((T([1, 256, 1, 257], f16, stride=(525312, 513, 525312, 1)), T([1, 256, 1, 257], b8), -inf), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 50265], f16), T([50265, 768], f16)), {})
+cnt: 1, ((T([50265, 1024], f16, stride=(1, 50265)), T([1024, 768], f16)), {})
+cnt: 49, ((T([1024, 768], f16), T([768, 768], f16)), {})
+cnt: 49, ((T([768, 1024], f16, stride=(1, 768)), T([1024, 768], f16)), {})
+cnt: 12, ((T([1024, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 1024], f16, stride=(1, 768)), T([1024, 3072], f16)), {})
+cnt: 12, ((T([1024, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 1024], f16, stride=(1, 3072)), T([1024, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([1, 1, 1, 1024], f16), -65504.0), {})
+cnt: 1, ((T([1, 1024], i32), T([1, 1024], i32)), {})
+cnt: 12, ((T([1, 3, 512, 1], f16, stride=(1024, 256, 1, 1)), T([1, 3, 1, 512], f16, stride=(1024, 256, 1, 1))), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([1, 1024, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([1, 1024, 768], f16), T([1, 1024, 768], f16), [768], T([1, 1024, 1], f32), T([1, 1024, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([1, 1024], i64), 1), {})
+cnt: 12, ((T([1, 1024], f16), 0), {})
+Operator: aten.new_empty.default
+cnt: 12, ((T([12, 3, 512, 513], f16), [12, 4, 256, 513]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 12, ((T([1, 3, 512, 513], f16), [1, 4, 256, 513]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_empty_strided.default
+cnt: 84, ((T([12, 4, 256, 513], f16), [12, 4, 256, 513], [525312, 131328, 513, 1]), {})
+cnt: 12, ((T([1024, 768], f16), [1024, 768], [768, 1]), {})
+Operator: aten.new_ones.default
+cnt: 12, ((T([1, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)), [256, 257]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 12, ((T([1, 1024, 1, 1], f16), [1, 1024, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 12, ((T([1, 1024, 1, 513], f16), [256, 257]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_zeros.default
+cnt: 12, ((T([12, 4, 768, 64], f16), [1179648]), {})
+cnt: 12, ((T([1024, 12, 513], f16), [6303744]), {})
+cnt: 12, ((T([12, 3, 512, 64], f16, stride=(98304, 32768, 1, 512)), [786432]), {})
+cnt: 12, ((T([12, 3, 512, 64], f16), [786432]), {})
+cnt: 12, ((T([1024, 768], f16), [786432]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([1024, 50265], f16), T([1024], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([1024, 50265], f16), T([1024], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([1, 1, 1, 1024], f16), 1.0), {})
+Operator: aten.select_backward.default
+cnt: 12, ((T([12, 512, 513], f16), [12, 3, 512, 513], 1, 0), {})
+cnt: 12, ((T([12, 512, 513], f16), [12, 3, 512, 513], 1, -1), {})
+Operator: aten.slice_backward.default
+cnt: 12, ((T([12, 4, 256, 768], f16), [12, 4, 256, 769], 3, 0, -1, 1), {})
+cnt: 12, ((T([12, 4, 256, 769], f16), [12, 4, 256, 769], 2, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 4, 256, 769], f16), [12, 4, 256, 769], 1, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 4, 256, 769], f16), [12, 4, 256, 769], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 4, 196864], f16), [12, 4, 197120], 2, 0, -256, 1), {})
+cnt: 12, ((T([12, 4, 197120], f16), [12, 4, 197120], 1, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 4, 197120], f16), [12, 4, 197120], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 255, 255], f16), [12, 255, 513], 2, -255, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 255, 513], f16), [12, 512, 513], 1, 0, 255, 1), {})
+cnt: 48, ((T([12, 3, 512, 513], f16), [12, 3, 512, 513], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 3, 256, 256], f16), [12, 3, 256, 513], 3, 257, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 3, 256, 513], f16), [12, 3, 512, 513], 2, -257, -1, 1), {})
+cnt: 24, ((T([12, 3, 512, 513], f16), [12, 3, 512, 513], 1, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 256, 257], f16), [12, 256, 513], 2, 0, 257, 1), {})
+cnt: 12, ((T([12, 256, 513], f16), [12, 512, 513], 1, 256, 9223372036854775807, 1), {})
+cnt: 12, ((T([12, 3, 256, 257], f16), [12, 3, 256, 513], 3, 0, 257, 1), {})
+cnt: 12, ((T([12, 3, 256, 513], f16), [12, 3, 512, 513], 2, 0, 256, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([1024, 50265], f16), [0], True), {})
+cnt: 61, ((T([1024, 768], f16), [0], True), {})
+cnt: 12, ((T([1024, 3072], f16), [0], True), {})
+Operator: aten.tril.default
+cnt: 24, ((T([256, 257], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BartForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BartForCausalLM_training.txt
new file mode 100644
index 000000000000..25d8b0b7a02a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BartForCausalLM_training.txt
@@ -0,0 +1,73 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([4096, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([4096, 50265], f16), T([4096, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([64, 1024, 1024], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 1024, 1024], f16), T([64, 1024, 1024], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1024, 1024], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 1, 1024, 1024], f16, stride=(0, 1048576, 1024, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([4, 1024, 16, 64], f16), [4, 1024, 1024]), {})
+cnt: 1, ((T([4096, 50265], f16), [4, 1024, 50265]), {})
+cnt: 12, ((T([4, 16, 1024, 64], f16), [64, 1024, 64]), {})
+cnt: 12, ((T([4, 1024, 1024], f16), [4096, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([1024], i64), 1), {})
+cnt: 1, ((T([4, 1024], i64, stride=(0, 1)), 2), {})
+cnt: 73, ((T([4, 1024, 1024], f16), T([4, 1024, 1024], f16)), {})
+cnt: 12, ((T([4, 16, 1024, 1024], f16), T([4, 1, 1024, 1024], f16)), {})
+cnt: 1, ((T([50265, 1024], f16), T([50265, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([1024], f16), T([4096, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 12, ((T([4096], f16), T([4096, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 12, ((T([1024], f16), T([4096, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([64, 1024, 64], f16), T([64, 64, 1024], f16, stride=(65536, 1, 64))), {})
+cnt: 24, ((T([64, 1024, 1024], f16), T([64, 1024, 64], f16)), {})
+cnt: 12, ((T([64, 1024, 1024], f16, stride=(1048576, 1, 1024)), T([64, 1024, 64], f16)), {})
+cnt: 12, ((T([64, 64, 1024], f16, stride=(65536, 1, 64)), T([64, 1024, 1024], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([4, 1024], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([4, 1024], i64), T([4, 1024], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 1024], f16), T([4, 1024], i64), 1), {})
+cnt: 1, ((T([1026, 1024], f16), T([4, 1024], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([4, 1024, 1024], f16), T([4, 1024], i64), 1026, -1, False), {})
+cnt: 1, ((T([4, 1024, 1024], f16), T([4, 1024], i64), 50265, 1, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([4, 1024, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([4, 1024, 4096], f16), T([4, 1024, 4096], f16)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([1024], i64), T([1024, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([1024, 1024], f32), T([1024, 1024], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([4096, 1024], f16), T([1024, 50265], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([50265, 4096], f16, stride=(1, 50265)), T([4096, 1024], f16)), {})
+cnt: 1, ((T([4096, 50265], f16), T([50265, 1024], f16)), {})
+cnt: 12, ((T([4096, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 12, ((T([1024, 4096], f16, stride=(1, 1024)), T([4096, 4096], f16)), {})
+cnt: 12, ((T([4096, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 12, ((T([4096, 4096], f16, stride=(1, 4096)), T([4096, 1024], f16)), {})
+cnt: 48, ((T([4096, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 48, ((T([1024, 4096], f16, stride=(1, 1024)), T([4096, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([4, 1024, 1024], f16), 1.0), {})
+cnt: 24, ((T([4, 1024, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([4, 1024, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([4, 1024, 1024], f16), T([4, 1024, 1024], f16), [1024], T([4, 1024, 1], f32), T([4, 1024, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([4096, 50265], f16), T([4096], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([4096, 50265], f16), T([4096], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 60, ((T([4096, 1024], f16), [0], True), {})
+cnt: 12, ((T([4096, 4096], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BartForConditionalGeneration_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BartForConditionalGeneration_training.txt
new file mode 100644
index 000000000000..0e388c6062e7
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BartForConditionalGeneration_training.txt
@@ -0,0 +1,89 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([2048, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([2048, 50265], f16), T([2048, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 36, ((T([32, 1024, 1024], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 36, ((T([32, 1024, 1024], f16), T([32, 1024, 1024], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1024, 1024], f32),), {'dtype': f16})
+cnt: 1, ((T([2, 1, 1024, 1024], f16, stride=(0, 1048576, 1024, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 108, ((T([2, 1024, 16, 64], f16), [2, 1024, 1024]), {})
+cnt: 1, ((T([2048, 50265], f16), [2, 1024, 50265]), {})
+cnt: 36, ((T([2, 16, 1024, 64], f16), [32, 1024, 64]), {})
+cnt: 36, ((T([2, 1024, 1024], f16), [2048, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([2, 1024], i64, stride=(0, 1)), 2), {})
+cnt: 193, ((T([2, 1024, 1024], f16), T([2, 1024, 1024], f16)), {})
+cnt: 1, ((T([1024], i64), 1), {})
+cnt: 12, ((T([2, 16, 1024, 1024], f16), T([2, 1, 1024, 1024], f16)), {})
+cnt: 1, ((T([2, 1024, 50265], f16), T([1, 50265], f16)), {})
+cnt: 2, ((T([50265, 1024], f16), T([50265, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 144, ((T([1024], f16), T([2048, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([4096], f16), T([2048, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([1024], f16), T([2048, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.any.default
+cnt: 24, ((T([2, 1024, 1024], b8),), {})
+Operator: aten.bmm.default
+cnt: 72, ((T([32, 1024, 64], f16), T([32, 64, 1024], f16, stride=(65536, 1, 64))), {})
+cnt: 72, ((T([32, 1024, 1024], f16), T([32, 1024, 64], f16)), {})
+cnt: 36, ((T([32, 1024, 1024], f16, stride=(1048576, 1, 1024)), T([32, 1024, 64], f16)), {})
+cnt: 36, ((T([32, 64, 1024], f16, stride=(65536, 1, 64)), T([32, 1024, 1024], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([2, 1024], i64),), {})
+cnt: 1, ((T([2, 1023], i64, stride=(1024, 1)),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([2, 1024], i64), T([2, 1024], i64)), {})
+cnt: 1, ((T([2, 1023], i64, stride=(1024, 1)), T([2, 1023], i64)), {})
+Operator: aten.embedding.default
+cnt: 2, ((T([50265, 1024], f16), T([2, 1024], i64), 1), {})
+cnt: 2, ((T([1026, 1024], f16), T([2, 1024], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 2, ((T([2, 1024, 1024], f16), T([2, 1024], i64), 1026, -1, False), {})
+cnt: 2, ((T([2, 1024, 1024], f16), T([2, 1024], i64), 50265, 1, False), {})
+Operator: aten.eq.Scalar
+cnt: 1, ((T([2, 1024], i64), -100), {})
+Operator: aten.fill_.Tensor
+cnt: 1, ((T([2], i64, stride=(1024,)), T([], i64)), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([2, 1024, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 24, ((T([2, 1024, 4096], f16), T([2, 1024, 4096], f16)), {})
+Operator: aten.isinf.default
+cnt: 12, ((T([2, 1024, 1024], f16),), {})
+Operator: aten.isnan.default
+cnt: 12, ((T([2, 1024, 1024], f16),), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([1024], i64), T([1024, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([2, 1024], i64), T([2, 1024], b8), 1), {})
+cnt: 1, ((T([1024, 1024], f32), T([1024, 1024], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 1024], f16), T([1024, 50265], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([50265, 2048], f16, stride=(1, 50265)), T([2048, 1024], f16)), {})
+cnt: 1, ((T([2048, 50265], f16), T([50265, 1024], f16)), {})
+cnt: 24, ((T([2048, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 2048], f16, stride=(1, 1024)), T([2048, 4096], f16)), {})
+cnt: 24, ((T([2048, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 24, ((T([4096, 2048], f16, stride=(1, 4096)), T([2048, 1024], f16)), {})
+cnt: 144, ((T([2048, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 144, ((T([1024, 2048], f16, stride=(1, 1024)), T([2048, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([2, 1024, 1024], f16), 1.0), {})
+cnt: 72, ((T([2, 1024, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 62, ((T([2, 1024, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 62, ((T([2, 1024, 1024], f16), T([2, 1024, 1024], f16), [1024], T([2, 1024, 1], f32), T([2, 1024, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([2, 1024], i64), [2, 1024]), {'dtype': i64, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([2048, 50265], f16), T([2048], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([2048, 50265], f16), T([2048], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 168, ((T([2048, 1024], f16), [0], True), {})
+cnt: 24, ((T([2048, 4096], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BertForMaskedLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BertForMaskedLM_training.txt
new file mode 100644
index 000000000000..5cd41366b65e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BertForMaskedLM_training.txt
@@ -0,0 +1,81 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([8192, 30522], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([8192, 30522], f16), T([8192, 30522], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([64, 12, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 12, 128, 128], f16), T([64, 12, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([64, 1, 1, 128], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([64, 12, 128, 64], f16), [768, 128, 64]), {})
+cnt: 12, ((T([64, 12, 64, 128], f16), [768, 64, 128]), {})
+cnt: 12, ((T([768, 128, 128], f16), [64, 12, 128, 128]), {})
+cnt: 12, ((T([768, 128, 64], f16), [64, 12, 128, 64]), {})
+cnt: 24, ((T([64, 128, 12, 64], f16), [64, 128, 768]), {})
+cnt: 12, ((T([64, 128, 768], f16), [8192, 768]), {})
+Operator: aten.add.Tensor
+cnt: 73, ((T([64, 128, 768], f16), T([64, 128, 768], f16)), {})
+cnt: 12, ((T([64, 12, 128, 128], f16), T([64, 1, 1, 128], f16)), {})
+cnt: 1, ((T([30522, 768], f16), T([30522, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([64, 128, 768], f16), T([1, 128, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 49, ((T([768], f16), T([8192, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([8192, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([8192, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([30522], f16), T([8192, 768], f16), T([768, 30522], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([768, 128, 64], f16), T([768, 64, 128], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16), T([768, 128, 64], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16, stride=(16384, 1, 128)), T([768, 128, 64], f16)), {})
+cnt: 12, ((T([768, 128, 64], f16), T([768, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 12, ((T([768, 64, 128], f16, stride=(8192, 1, 64)), T([768, 128, 128], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16), T([768, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([64, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([64, 128], i64), T([64, 128], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([64, 12, 128, 128], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([64, 128], i64), 0), {})
+cnt: 1, ((T([2, 768], f16), T([64, 128], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 768], f16), T([1, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 128, 768], f16), T([1, 128], i64), 512, -1, False), {})
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([64, 128, 3072], f16),), {})
+cnt: 1, ((T([64, 128, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128, 768], f16)), {})
+cnt: 12, ((T([64, 128, 3072], f16), T([64, 128, 3072], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8192, 30522], f16), T([30522, 768], f16)), {})
+cnt: 1, ((T([30522, 8192], f16, stride=(1, 30522)), T([8192, 768], f16)), {})
+cnt: 49, ((T([8192, 768], f16), T([768, 768], f16)), {})
+cnt: 49, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 768], f16)), {})
+cnt: 12, ((T([8192, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 3072], f16)), {})
+cnt: 12, ((T([8192, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 8192], f16, stride=(1, 3072)), T([8192, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([64, 1, 1, 128], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([64, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([64, 128, 768], f16), T([64, 128, 768], f16), [768], T([64, 128, 1], f32), T([64, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([8192, 30522], f16), T([8192], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([8192, 30522], f16), T([8192], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([64, 1, 1, 128], f16), 1.0), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([8192, 30522], f16), [0], True), {})
+cnt: 61, ((T([8192, 768], f16), [0], True), {})
+cnt: 12, ((T([8192, 3072], f16), [0], True), {})
+cnt: 1, ((T([64, 128, 768], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BertForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BertForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..463fb6ada157
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BertForQuestionAnswering_training.txt
@@ -0,0 +1,88 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([64, 128], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([64, 128], f16), T([64, 128], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([64, 12, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 12, 128, 128], f16), T([64, 12, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([64, 1, 1, 128], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([64, 12, 128, 64], f16), [768, 128, 64]), {})
+cnt: 12, ((T([64, 12, 64, 128], f16), [768, 64, 128]), {})
+cnt: 12, ((T([768, 128, 128], f16), [64, 12, 128, 128]), {})
+cnt: 12, ((T([768, 128, 64], f16), [64, 12, 128, 64]), {})
+cnt: 24, ((T([64, 128, 12, 64], f16), [64, 128, 768]), {})
+cnt: 12, ((T([64, 128, 768], f16), [8192, 768]), {})
+Operator: aten.add.Tensor
+cnt: 73, ((T([64, 128, 768], f16), T([64, 128, 768], f16)), {})
+cnt: 12, ((T([64, 12, 128, 128], f16), T([64, 1, 1, 128], f16)), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([64, 128, 768], f16), T([1, 128, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([768], f16), T([8192, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([8192, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([8192, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([2], f16), T([8192, 768], f16), T([768, 2], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([768, 128, 64], f16), T([768, 64, 128], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16), T([768, 128, 64], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16, stride=(16384, 1, 128)), T([768, 128, 64], f16)), {})
+cnt: 12, ((T([768, 128, 64], f16), T([768, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 12, ((T([768, 64, 128], f16, stride=(8192, 1, 64)), T([768, 128, 128], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16), T([768, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 128, 1], f16), T([64, 128, 1], f16)], 2), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([64], i64), 0, 128), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 128], i64),), {})
+cnt: 2, ((T([64], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 128], i64), T([64, 128], i64)), {})
+cnt: 2, ((T([64], i64), T([64], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([64, 12, 128, 128], f16), 8.0), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([64, 128], i64), 0), {})
+cnt: 1, ((T([2, 768], f16), T([64, 128], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 768], f16), T([1, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 128, 768], f16), T([1, 128], i64), 512, -1, False), {})
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([64, 128, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([64, 128, 3072], f16), T([64, 128, 3072], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8192, 2], f16), T([2, 768], f16)), {})
+cnt: 1, ((T([2, 8192], f16, stride=(1, 2)), T([8192, 768], f16)), {})
+cnt: 12, ((T([8192, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 3072], f16)), {})
+cnt: 12, ((T([8192, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 8192], f16, stride=(1, 3072)), T([8192, 768], f16)), {})
+cnt: 48, ((T([8192, 768], f16), T([768, 768], f16)), {})
+cnt: 48, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([64, 1, 1, 128], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([64, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([64, 128, 768], f16), T([64, 128, 768], f16), [768], T([64, 128, 1], f32), T([64, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([64, 128], f16), T([64], i64), None, 1, 128, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([64, 128], f16), T([64], i64), None, 1, 128), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([64, 1, 1, 128], f16), 1.0), {})
+Operator: aten.split.Tensor
+cnt: 1, ((T([64, 128, 2], f16), 1, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([8192, 2], f16), [0], True), {})
+cnt: 60, ((T([8192, 768], f16), [0], True), {})
+cnt: 12, ((T([8192, 3072], f16), [0], True), {})
+cnt: 1, ((T([64, 128, 768], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BigBird_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BigBird_training.txt
new file mode 100644
index 000000000000..7bc500b33d95
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BigBird_training.txt
@@ -0,0 +1,237 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([1024, 50358], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([1024, 50358], f16), T([1024, 50358], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([1, 12, 64, 1024], f16), -1, False), {})
+cnt: 24, ((T([1, 12, 64, 448], f16), -1, False), {})
+cnt: 12, ((T([1, 12, 12, 64, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([1, 12, 64, 1024], f16), T([1, 12, 64, 1024], f16), -1, f16), {})
+cnt: 24, ((T([1, 12, 64, 448], f16), T([1, 12, 64, 448], f16), -1, f16), {})
+cnt: 12, ((T([1, 12, 12, 64, 512], f16), T([1, 12, 12, 64, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 12, ((T([1, 1, 12, 64, 192], f32),), {'dtype': f16})
+cnt: 12, ((T([1, 1, 1024, 1], f32),), {'dtype': f16})
+cnt: 12, ((T([1, 1, 1, 1024], f32),), {'dtype': f16})
+cnt: 12, ((T([12, 14, 3], i32),), {'dtype': i64, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 24, ((T([1, 12, 16, 64, 64], f16), [192, 64, 64]), {})
+cnt: 24, ((T([1, 12, 12, 64, 64], f16), [144, 64, 64]), {})
+cnt: 24, ((T([1, 12, 12, 192, 64], f16), [144, 192, 64]), {})
+cnt: 24, ((T([1, 1024, 12, 64], f16), [1, 1024, 768]), {})
+Operator: aten.add.Tensor
+cnt: 76, ((T([1, 1024, 768], f16), T([1, 1024, 768], f16)), {})
+cnt: 24, ((T([504], i64), T([504], i64)), {})
+cnt: 36, ((T([1, 1024, 3072], f16), T([1, 1024, 3072], f16)), {})
+cnt: 12, ((T([1, 1024, 3072], f16), 1.0), {})
+cnt: 1, ((T([1, 1024, 768], f16), 1.0), {})
+cnt: 360, ((T([1, 12, 16, 64, 64], f16), T([1, 12, 16, 64, 64], f16)), {})
+cnt: 36, ((T([1, 12, 12, 64, 512], f16), T([1, 12, 12, 64, 512], f16)), {})
+cnt: 48, ((T([1, 12, 14, 192, 64], f16), T([1, 12, 14, 192, 64], f16)), {})
+cnt: 36, ((T([1, 12, 12, 64, 64], f16), T([1, 12, 12, 64, 64], f16)), {})
+cnt: 24, ((T([1, 12, 1024, 64], f16), T([1, 12, 1024, 64], f16)), {})
+cnt: 12, ((T([1, 12, 1024, 64], f16, stride=(786432, 65536, 1, 1024)), T([1, 12, 1024, 64], f16, stride=(786432, 65536, 1, 1024))), {})
+cnt: 12, ((T([1, 12, 1024, 64], f16, stride=(786432, 65536, 1, 1024)), T([1, 12, 1024, 64], f16)), {})
+cnt: 1, ((T([50358, 768], f16), T([50358, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024, 768], f16)), {})
+cnt: 24, ((T([1, 12, 64, 1024], f16), T([1, 1, 1, 1024], f16)), {})
+cnt: 24, ((T([1, 12, 64, 448], f16), T([1, 12, 64, 448], f32)), {})
+cnt: 12, ((T([1, 12, 12, 64, 192], f16), T([1, 1, 12, 64, 192], f16)), {})
+cnt: 24, ((T([1, 12, 12, 64, 64], f16), T([1, 1, 1, 1, 64], f16)), {})
+cnt: 12, ((T([1, 12, 12, 64, 192], f16), T([1, 12, 12, 64, 192], f32)), {})
+cnt: 36, ((T([1, 12, 12, 64, 64], f16), T([1, 12, 12, 64, 64], f16)), {})
+Operator: aten.addmm.default
+cnt: 49, ((T([768], f16), T([1024, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([1024, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([1024, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([768], f16), T([1, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50358], f16), T([1024, 768], f16), T([768, 50358], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 48, ((T([12, 64, 64], f16, stride=(64, 768, 1)), T([12, 64, 1024], f16, stride=(64, 1, 768))), {})
+cnt: 48, ((T([12, 64, 1024], f16), T([12, 1024, 64], f16, stride=(64, 768, 1))), {})
+cnt: 48, ((T([12, 64, 64], f16, stride=(64, 768, 1)), T([12, 64, 448], f16, stride=(28672, 1, 64))), {})
+cnt: 48, ((T([12, 64, 448], f16), T([12, 448, 64], f16)), {})
+cnt: 48, ((T([144, 64, 64], f16), T([144, 64, 192], f16, stride=(12288, 1, 64))), {})
+cnt: 24, ((T([12, 768, 64], f16, stride=(64, 768, 1)), T([12, 64, 64], f16, stride=(64, 1, 768))), {})
+cnt: 24, ((T([144, 64, 192], f16, stride=(32768, 512, 1)), T([144, 192, 64], f16)), {})
+cnt: 24, ((T([12, 768, 64], f16, stride=(393216, 512, 1)), T([12, 64, 64], f16, stride=(64, 768, 1))), {})
+cnt: 24, ((T([12, 1024, 64], f16, stride=(65536, 1, 1024)), T([12, 64, 64], f16, stride=(64, 768, 1))), {})
+cnt: 24, ((T([12, 64, 64], f16, stride=(64, 1, 768)), T([12, 64, 1024], f16)), {})
+cnt: 24, ((T([12, 448, 64], f16, stride=(28672, 1, 448)), T([12, 64, 64], f16, stride=(64, 768, 1))), {})
+cnt: 24, ((T([12, 64, 64], f16, stride=(64, 1, 768)), T([12, 64, 448], f16)), {})
+cnt: 24, ((T([12, 64, 768], f16, stride=(393216, 1, 512)), T([12, 768, 64], f16)), {})
+cnt: 24, ((T([12, 768, 64], f16), T([12, 64, 64], f16, stride=(64, 1, 768))), {})
+cnt: 24, ((T([144, 192, 64], f16, stride=(32768, 1, 512)), T([144, 64, 64], f16)), {})
+cnt: 24, ((T([12, 64, 768], f16, stride=(64, 1, 768)), T([12, 768, 64], f16)), {})
+cnt: 24, ((T([12, 768, 64], f16), T([12, 64, 64], f16, stride=(64, 768, 1))), {})
+cnt: 24, ((T([144, 64, 64], f16, stride=(4096, 1, 64)), T([144, 64, 192], f16)), {})
+cnt: 24, ((T([144, 64, 192], f16), T([144, 192, 64], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([1, 12, 64], f32), T([1, 12, 64], f32), T([1, 12, 64], f32)], 2), {})
+cnt: 12, (([T([1, 12, 14, 3], i64)],), {})
+cnt: 48, (([T([1, 12, 64, 64], f16, stride=(768, 64, 768, 1)), T([1, 12, 64, 64], f16, stride=(768, 64, 768, 1)), T([1, 12, 64, 64], f16, stride=(768, 64, 768, 1)), T([1, 12, 64, 64], f16, stride=(768, 64, 768, 1)), T([1, 12, 192, 64], f16, stride=(2064384, 172032, 64, 1))], 2), {})
+cnt: 12, (([T([1, 1, 1, 192], f16), T([1, 1, 1, 64], f16), T([1, 1, 1, 192], f16)], 3), {})
+cnt: 24, (([T([1, 12, 64, 256], f32), T([1, 12, 64, 192], f32, stride=(2064384, 172032, 192, 1))], 3), {})
+cnt: 24, (([T([1, 12, 12, 64, 64], f16, stride=(768, 64, 49152, 768, 1)), T([1, 12, 12, 64, 64], f16, stride=(768, 64, 49152, 768, 1)), T([1, 12, 12, 64, 64], f16, stride=(768, 64, 49152, 768, 1))], 3), {})
+cnt: 12, (([T([1, 12, 12, 64, 64], f16), T([1, 12, 12, 64, 192], f16), T([1, 12, 12, 64, 192], f16), T([1, 12, 12, 64, 64], f16)], -1), {})
+cnt: 12, (([T([1, 1, 1, 64], f16), T([1, 1, 1, 192], f16), T([1, 1, 1, 192], f16)], 3), {})
+cnt: 12, (([T([1, 12, 1, 64, 64], f16), T([1, 12, 1, 64, 64], f16), T([1, 12, 12, 64, 64], f16), T([1, 12, 1, 64, 64], f16), T([1, 12, 1, 64, 64], f16)], 2), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 1024], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 1024], i64), T([1, 1024], i64)), {})
+cnt: 12, ((T([12, 12, 64, 64], f16), T([12, 12, 64, 64], f16, stride=(64, 49152, 768, 1))), {})
+cnt: 36, ((T([144, 64, 64], f16), T([144, 64, 64], f16)), {})
+cnt: 36, ((T([1, 12, 12, 64, 64], f16), T([1, 12, 12, 64, 64], f16)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50358, 768], f16), T([1, 1024], i64), 0), {})
+cnt: 1, ((T([2, 768], f16), T([1, 1024], i64)), {})
+cnt: 1, ((T([4096, 768], f16), T([1, 1024], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024], i64), 4096, -1, False), {})
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024], i64), 2, -1, False), {})
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024], i64), 50358, 0, False), {})
+Operator: aten.floor_divide.default
+cnt: 24, ((T([504], i64), 42), {})
+Operator: aten.index.Tensor
+cnt: 12, ((T([16, 64], f32), [T([504], i64)]), {})
+Operator: aten.index_add.default
+cnt: 24, ((T([192, 64, 64], f16), 0, T([504], i64), T([504, 64, 64], f16)), {})
+Operator: aten.index_select.default
+cnt: 24, ((T([192, 64, 64], f16), 0, T([504], i64)), {})
+Operator: aten.minimum.default
+cnt: 24, ((T([1, 1, 1, 448], f16), T([1, 12, 64, 448], f32)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 50358], f16), T([50358, 768], f16)), {})
+cnt: 1, ((T([50358, 1024], f16, stride=(1, 50358)), T([1024, 768], f16)), {})
+cnt: 37, ((T([1024, 768], f16), T([768, 768], f16)), {})
+cnt: 37, ((T([768, 1024], f16, stride=(1, 768)), T([1024, 768], f16)), {})
+cnt: 12, ((T([1024, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 1024], f16, stride=(1, 768)), T([1024, 3072], f16)), {})
+cnt: 12, ((T([1024, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 1024], f16, stride=(1, 3072)), T([1024, 768], f16)), {})
+cnt: 12, ((T([1024, 768], f16, stride=(1, 1024)), T([768, 768], f16)), {})
+cnt: 12, ((T([768, 1024], f16), T([1024, 768], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 1, ((T([1, 1024, 768], f16), 3.0), {})
+cnt: 12, ((T([1, 1024, 3072], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([1, 12, 64, 1], f32), T([1, 12, 1, 192], f32)), {})
+cnt: 12, ((T([1, 1, 14, 64, 1], f32), T([1, 12, 14, 1, 192], f32)), {})
+cnt: 24, ((T([504], i64), 16), {})
+cnt: 48, ((T([1, 12, 64, 1024], f16), 0.125), {})
+cnt: 24, ((T([1, 1, 1, 1024], f16), -10000.0), {})
+cnt: 48, ((T([1, 12, 64, 448], f16), 0.125), {})
+cnt: 24, ((T([1, 12, 64, 448], f32), -10000.0), {})
+cnt: 24, ((T([1, 12, 12, 64, 192], f16), 0.125), {})
+cnt: 24, ((T([1, 12, 12, 64, 64], f16), 0.125), {})
+cnt: 12, ((T([1, 1, 12, 64, 192], f16), -10000.0), {})
+cnt: 24, ((T([1, 1, 1, 1, 64], f16), -10000.0), {})
+cnt: 12, ((T([1, 12, 12, 64, 192], f32), -10000.0), {})
+cnt: 12, ((T([1, 12, 1024, 64], f16), T([1, 1, 1024, 1], f16)), {})
+cnt: 24, ((T([1, 1024, 3072], f16), 0.5), {})
+cnt: 24, ((T([1, 1024, 3072], f16), 0.044715), {})
+cnt: 24, ((T([1, 1024, 3072], f16), 0.7978845608028654), {})
+cnt: 48, ((T([1, 1024, 3072], f16), T([1, 1024, 3072], f16)), {})
+cnt: 2, ((T([1, 1024, 768], f16), 0.5), {})
+cnt: 2, ((T([1, 1024, 768], f16), 0.044715), {})
+cnt: 2, ((T([1, 1024, 768], f16), 0.7978845608028654), {})
+cnt: 4, ((T([1, 1024, 768], f16), T([1, 1024, 768], f16)), {})
+cnt: 12, ((T([1, 12, 1024, 64], f16, stride=(786432, 64, 768, 1)), T([1, 1, 1024, 1], f16)), {})
+cnt: 24, ((T([1, 12, 12, 64, 64], f16, stride=(4718592, 393216, 32768, 512, 1)), 0.125), {})
+cnt: 24, ((T([1, 12, 12, 64, 192], f16, stride=(4718592, 393216, 32768, 512, 1)), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([1, 1024, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([1, 1024, 768], f16), T([1, 1024, 768], f16), [768], T([1, 1024, 1], f32), T([1, 1024, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 36, ((T([144, 64, 64], f16), [144, 64, 64], [4096, 64, 1]), {})
+Operator: aten.new_ones.default
+cnt: 24, ((T([1, 1, 1, 1024], f16), [1, 1, 1, 192]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 24, ((T([1, 12, 14, 64, 192], f32), [1, 12, 64, 256]), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_zeros.default
+cnt: 12, ((T([12, 12, 64, 64], f16, stride=(64, 49152, 768, 1)), [589824]), {})
+cnt: 24, ((T([504, 64, 64], f16), [192, 64, 64]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([1024, 50358], f16), T([1024], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([1024, 50358], f16), T([1024], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([1, 1024, 3072], f16), 3.0), {})
+cnt: 1, ((T([1, 1024, 768], f16), 3.0), {})
+cnt: 1, ((T([1, 1024, 768], f16), 2.0), {})
+cnt: 12, ((T([1, 1024, 3072], f16), 2.0), {})
+Operator: aten.rsub.Scalar
+cnt: 24, ((T([1, 1, 1, 1024], f16), 1.0), {})
+cnt: 24, ((T([1, 12, 64, 448], f32), 1.0), {})
+cnt: 12, ((T([1, 1, 12, 64, 192], f16), 1.0), {})
+cnt: 24, ((T([1, 1, 1, 1, 64], f16), 1.0), {})
+cnt: 12, ((T([1, 12, 12, 64, 192], f32, stride=(2064384, 172032, 12288, 192, 1)), 1.0), {})
+Operator: aten.select_backward.default
+cnt: 24, ((T([1, 12, 64, 64], f16), [1, 12, 16, 64, 64], 2, -1), {})
+cnt: 12, ((T([1, 12, 64, 64], f16), [1, 12, 16, 64, 64], 2, -2), {})
+cnt: 12, ((T([1, 12, 192, 64], f16, stride=(344064, 28672, 64, 1)), [1, 12, 14, 192, 64], 2, -1), {})
+cnt: 24, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [1, 12, 16, 64, 64], 2, -1), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [1, 12, 16, 64, 64], 2, -2), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [1, 12, 16, 64, 64], 2, -3), {})
+cnt: 24, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [1, 12, 16, 64, 64], 2, 0), {})
+cnt: 12, ((T([1, 12, 192, 64], f16, stride=(344064, 28672, 1, 448)), [1, 12, 14, 192, 64], 2, -1), {})
+cnt: 24, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [1, 12, 16, 64, 64], 2, -1), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [1, 12, 16, 64, 64], 2, -2), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [1, 12, 16, 64, 64], 2, -3), {})
+cnt: 24, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [1, 12, 16, 64, 64], 2, 0), {})
+cnt: 24, ((T([1, 12, 64, 64], f16), [1, 12, 16, 64, 64], 2, 0), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(64, 4096, 1, 64)), [1, 12, 16, 64, 64], 2, -1), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(64, 4096, 1, 64)), [1, 12, 16, 64, 64], 2, 0), {})
+cnt: 12, ((T([1, 12, 64, 64], f16), [1, 12, 16, 64, 64], 2, 1), {})
+cnt: 12, ((T([1, 12, 192, 64], f16, stride=(344064, 28672, 64, 1)), [1, 12, 14, 192, 64], 2, 0), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [1, 12, 16, 64, 64], 2, 2), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [1, 12, 16, 64, 64], 2, 1), {})
+cnt: 12, ((T([1, 12, 192, 64], f16, stride=(344064, 28672, 1, 448)), [1, 12, 14, 192, 64], 2, 0), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [1, 12, 16, 64, 64], 2, 2), {})
+cnt: 12, ((T([1, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [1, 12, 16, 64, 64], 2, 1), {})
+Operator: aten.slice_backward.default
+cnt: 372, ((T([1, 12, 16, 64, 64], f16), [1, 12, 16, 64, 64], 1, 0, 9223372036854775807, 1), {})
+cnt: 372, ((T([1, 12, 16, 64, 64], f16), [1, 12, 16, 64, 64], 0, 0, 9223372036854775807, 1), {})
+cnt: 72, ((T([1, 12, 14, 192, 64], f16), [1, 12, 14, 192, 64], 1, 0, 9223372036854775807, 1), {})
+cnt: 72, ((T([1, 12, 14, 192, 64], f16), [1, 12, 14, 192, 64], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16), [1, 12, 12, 64, 512], 4, -64, 9223372036854775807, 1), {})
+cnt: 48, ((T([1, 12, 12, 64, 512], f16), [1, 12, 12, 64, 512], 3, 0, 9223372036854775807, 1), {})
+cnt: 48, ((T([1, 12, 12, 64, 512], f16), [1, 12, 12, 64, 512], 2, 0, 9223372036854775807, 1), {})
+cnt: 48, ((T([1, 12, 12, 64, 512], f16), [1, 12, 12, 64, 512], 1, 0, 9223372036854775807, 1), {})
+cnt: 48, ((T([1, 12, 12, 64, 512], f16), [1, 12, 12, 64, 512], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16), [1, 12, 12, 64, 512], 4, 0, 64, 1), {})
+cnt: 12, ((T([1, 12, 12, 192, 64], f16), [1, 12, 14, 192, 64], 2, 1, -1, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 192], f16), [1, 12, 12, 64, 512], 4, 256, -64, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 192], f16), [1, 12, 12, 64, 512], 4, 64, 256, 1), {})
+cnt: 12, ((T([1, 12, 12, 192, 64], f16, stride=(1769472, 147456, 12288, 1, 192)), [1, 12, 14, 192, 64], 2, 1, -1, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16), [1, 12, 16, 64, 64], 2, 2, -2, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 64, 1)), [1, 12, 16, 64, 64], 2, 3, -1, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 64, 1)), [1, 12, 16, 64, 64], 2, 2, -2, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 64, 1)), [1, 12, 16, 64, 64], 2, 1, -3, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 1, 192)), [1, 12, 16, 64, 64], 2, 3, -1, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 1, 192)), [1, 12, 16, 64, 64], 2, 2, -2, 1), {})
+cnt: 12, ((T([1, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 1, 192)), [1, 12, 16, 64, 64], 2, 1, -3, 1), {})
+Operator: aten.stack.default
+cnt: 12, (([T([504, 64], f32)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([1024, 50358], f16), [0], True), {})
+cnt: 49, ((T([1024, 768], f16), [0], True), {})
+cnt: 12, ((T([1024, 3072], f16), [0], True), {})
+cnt: 12, ((T([1024, 768], f16, stride=(1, 1024)), [0], True), {})
+Operator: aten.tanh.default
+cnt: 12, ((T([1, 1024, 3072], f16),), {})
+cnt: 1, ((T([1, 768], f16),), {})
+cnt: 1, ((T([1, 1024, 768], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024, 768], f16)), {})
+cnt: 12, ((T([1, 1024, 3072], f16), T([1, 1024, 3072], f16)), {})
+Operator: aten.unbind.int
+cnt: 12, ((T([1, 16, 64], f32),), {})
+cnt: 12, ((T([1, 12, 14, 3], i64),), {})
+Operator: aten.unsqueeze_.default
+cnt: 1, ((T([1, 12, 64, 192], f32), 1), {})
+cnt: 12, ((T([12, 14, 3], i64), 0), {})
+cnt: 48, ((T([1, 12, 64, 64], f16), 2), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BlenderbotSmallForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BlenderbotSmallForCausalLM_training.txt
new file mode 100644
index 000000000000..3bb0b46b0398
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BlenderbotSmallForCausalLM_training.txt
@@ -0,0 +1,74 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([8192, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([8192, 50265], f16), T([8192, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 8, ((T([1024, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 8, ((T([1024, 128, 128], f16), T([1024, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([64, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 24, ((T([64, 128, 16, 32], f16), [64, 128, 512]), {})
+cnt: 1, ((T([8192, 50265], f16), [64, 128, 50265]), {})
+cnt: 8, ((T([64, 16, 128, 32], f16), [1024, 128, 32]), {})
+cnt: 8, ((T([64, 128, 512], f16), [8192, 512]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 1, ((T([64, 128, 512], f16), T([128, 512], f16)), {})
+cnt: 8, ((T([64, 16, 128, 128], f16), T([64, 1, 128, 128], f16)), {})
+cnt: 48, ((T([64, 128, 512], f16), T([64, 128, 512], f16)), {})
+cnt: 1, ((T([50265, 512], f16), T([50265, 512], f16)), {})
+Operator: aten.addmm.default
+cnt: 32, ((T([512], f16), T([8192, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 8, ((T([2048], f16), T([8192, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 8, ((T([512], f16), T([8192, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+Operator: aten.bmm.default
+cnt: 16, ((T([1024, 128, 32], f16), T([1024, 32, 128], f16, stride=(4096, 1, 32))), {})
+cnt: 16, ((T([1024, 128, 128], f16), T([1024, 128, 32], f16)), {})
+cnt: 8, ((T([1024, 128, 128], f16, stride=(16384, 1, 128)), T([1024, 128, 32], f16)), {})
+cnt: 8, ((T([1024, 32, 128], f16, stride=(4096, 1, 32)), T([1024, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([64, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([64, 128], i64), T([64, 128], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 512], f16), T([64, 128], i64), 0), {})
+cnt: 1, ((T([512, 512], f16), T([128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([128, 512], f16), T([128], i64), 512, -1, False), {})
+cnt: 1, ((T([64, 128, 512], f16), T([64, 128], i64), 50265, 0, False), {})
+Operator: aten.gelu.default
+cnt: 8, ((T([64, 128, 2048], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 8, ((T([64, 128, 2048], f16), T([64, 128, 2048], f16)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8192, 512], f16), T([512, 50265], f16, stride=(1, 512))), {})
+cnt: 1, ((T([50265, 8192], f16, stride=(1, 50265)), T([8192, 512], f16)), {})
+cnt: 1, ((T([8192, 50265], f16), T([50265, 512], f16)), {})
+cnt: 8, ((T([8192, 512], f16), T([512, 2048], f16)), {})
+cnt: 8, ((T([512, 8192], f16, stride=(1, 512)), T([8192, 2048], f16)), {})
+cnt: 8, ((T([8192, 2048], f16), T([2048, 512], f16)), {})
+cnt: 8, ((T([2048, 8192], f16, stride=(1, 2048)), T([8192, 512], f16)), {})
+cnt: 32, ((T([8192, 512], f16), T([512, 512], f16)), {})
+cnt: 32, ((T([512, 8192], f16, stride=(1, 512)), T([8192, 512], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([64, 128, 512], f16), 1.0), {})
+cnt: 16, ((T([64, 128, 512], f16), 0.1767766952966369), {})
+Operator: aten.native_layer_norm.default
+cnt: 17, ((T([64, 128, 512], f16), [512], T([512], f16), T([512], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 17, ((T([64, 128, 512], f16), T([64, 128, 512], f16), [512], T([64, 128, 1], f32), T([64, 128, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([8192, 50265], f16), T([8192], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([8192, 50265], f16), T([8192], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 40, ((T([8192, 512], f16), [0], True), {})
+cnt: 8, ((T([8192, 2048], f16), [0], True), {})
+cnt: 1, ((T([64, 128, 512], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BlenderbotSmallForConditionalGeneration_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BlenderbotSmallForConditionalGeneration_training.txt
new file mode 100644
index 000000000000..866fb9026418
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/BlenderbotSmallForConditionalGeneration_training.txt
@@ -0,0 +1,81 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([8192, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([8192, 50265], f16), T([8192, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([1024, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([1024, 128, 128], f16), T([1024, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([64, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 72, ((T([64, 128, 16, 32], f16), [64, 128, 512]), {})
+cnt: 1, ((T([8192, 50265], f16), [64, 128, 50265]), {})
+cnt: 24, ((T([64, 16, 128, 32], f16), [1024, 128, 32]), {})
+cnt: 24, ((T([64, 128, 512], f16), [8192, 512]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([64, 128, 512], f16), T([128, 512], f16)), {})
+cnt: 127, ((T([64, 128, 512], f16), T([64, 128, 512], f16)), {})
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 8, ((T([64, 16, 128, 128], f16), T([64, 1, 128, 128], f16)), {})
+cnt: 1, ((T([64, 128, 50265], f16), T([1, 50265], f16)), {})
+cnt: 2, ((T([50265, 512], f16), T([50265, 512], f16)), {})
+Operator: aten.addmm.default
+cnt: 96, ((T([512], f16), T([8192, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 16, ((T([2048], f16), T([8192, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 16, ((T([512], f16), T([8192, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+Operator: aten.any.default
+cnt: 16, ((T([64, 128, 512], b8),), {})
+Operator: aten.bmm.default
+cnt: 48, ((T([1024, 128, 32], f16), T([1024, 32, 128], f16, stride=(4096, 1, 32))), {})
+cnt: 48, ((T([1024, 128, 128], f16), T([1024, 128, 32], f16)), {})
+cnt: 24, ((T([1024, 128, 128], f16, stride=(16384, 1, 128)), T([1024, 128, 32], f16)), {})
+cnt: 24, ((T([1024, 32, 128], f16, stride=(4096, 1, 32)), T([1024, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 3, ((T([64, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 3, ((T([64, 128], i64), T([64, 128], i64)), {})
+Operator: aten.embedding.default
+cnt: 2, ((T([50265, 512], f16), T([64, 128], i64), 0), {})
+cnt: 2, ((T([512, 512], f16), T([128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 2, ((T([128, 512], f16), T([128], i64), 512, -1, False), {})
+cnt: 2, ((T([64, 128, 512], f16), T([64, 128], i64), 50265, 0, False), {})
+Operator: aten.gelu.default
+cnt: 16, ((T([64, 128, 2048], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 16, ((T([64, 128, 2048], f16), T([64, 128, 2048], f16)), {})
+Operator: aten.isinf.default
+cnt: 8, ((T([64, 128, 512], f16),), {})
+Operator: aten.isnan.default
+cnt: 8, ((T([64, 128, 512], f16),), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8192, 512], f16), T([512, 50265], f16, stride=(1, 512))), {})
+cnt: 1, ((T([50265, 8192], f16, stride=(1, 50265)), T([8192, 512], f16)), {})
+cnt: 1, ((T([8192, 50265], f16), T([50265, 512], f16)), {})
+cnt: 16, ((T([8192, 512], f16), T([512, 2048], f16)), {})
+cnt: 16, ((T([512, 8192], f16, stride=(1, 512)), T([8192, 2048], f16)), {})
+cnt: 16, ((T([8192, 2048], f16), T([2048, 512], f16)), {})
+cnt: 16, ((T([2048, 8192], f16, stride=(1, 2048)), T([8192, 512], f16)), {})
+cnt: 96, ((T([8192, 512], f16), T([512, 512], f16)), {})
+cnt: 96, ((T([512, 8192], f16, stride=(1, 512)), T([8192, 512], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([64, 128, 512], f16), 1.0), {})
+cnt: 48, ((T([64, 128, 512], f16), 0.1767766952966369), {})
+Operator: aten.native_layer_norm.default
+cnt: 42, ((T([64, 128, 512], f16), [512], T([512], f16), T([512], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 42, ((T([64, 128, 512], f16), T([64, 128, 512], f16), [512], T([64, 128, 1], f32), T([64, 128, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([8192, 50265], f16), T([8192], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([8192, 50265], f16), T([8192], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 112, ((T([8192, 512], f16), [0], True), {})
+cnt: 16, ((T([8192, 2048], f16), [0], True), {})
+cnt: 2, ((T([64, 128, 512], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/CamemBert_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/CamemBert_training.txt
new file mode 100644
index 000000000000..2ce6229b7d4b
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/CamemBert_training.txt
@@ -0,0 +1,88 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([512, 32005], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([512, 32005], f16), T([512, 32005], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([1, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([1, 12, 512, 512], f16), T([1, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1, 1, 1, 512], f32),), {'dtype': f16})
+cnt: 1, ((T([1, 512], b8),), {'dtype': i32})
+cnt: 1, ((T([1, 512], i64),), {'dtype': i32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([1, 512], i32),), {'dtype': i64})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([12, 512, 512], f16), [1, 12, 512, 512]), {})
+cnt: 12, ((T([12, 512, 64], f16), [1, 12, 512, 64]), {})
+cnt: 24, ((T([1, 512, 12, 64], f16), [1, 512, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([1, 512], i32), 0), {})
+cnt: 1, ((T([1, 512], i64), 1), {})
+cnt: 73, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 12, ((T([1, 12, 512, 512], f16), T([1, 1, 1, 512], f16)), {})
+cnt: 1, ((T([32005, 768], f16), T([32005, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 49, ((T([768], f16), T([512, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([512, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([512, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([32005], f16), T([512, 768], f16), T([768, 32005], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([12, 512, 64], f16, stride=(64, 768, 1)), T([12, 64, 512], f16, stride=(64, 1, 768))), {})
+cnt: 24, ((T([12, 512, 512], f16), T([12, 512, 64], f16, stride=(64, 768, 1))), {})
+cnt: 12, ((T([12, 512, 512], f16, stride=(262144, 1, 512)), T([12, 512, 64], f16, stride=(64, 768, 1))), {})
+cnt: 12, ((T([12, 64, 512], f16, stride=(64, 1, 768)), T([12, 512, 512], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 512], i64), T([1, 512], i64)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([1, 512], i32), 1), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([1, 12, 512, 512], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([32005, 768], f16), T([1, 512], i64), 1), {})
+cnt: 1, ((T([1, 768], f16), T([1, 512], i64)), {})
+cnt: 1, ((T([514, 768], f16), T([1, 512], i64), 1), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 514, 1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 1, -1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 32005, 1, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([1, 512, 3072], f16),), {})
+cnt: 1, ((T([1, 512, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 12, ((T([1, 512, 3072], f16), T([1, 512, 3072], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 32005], f16), T([32005, 768], f16)), {})
+cnt: 1, ((T([32005, 512], f16, stride=(1, 32005)), T([512, 768], f16)), {})
+cnt: 37, ((T([512, 768], f16), T([768, 768], f16)), {})
+cnt: 37, ((T([768, 512], f16, stride=(1, 768)), T([512, 768], f16)), {})
+cnt: 12, ((T([512, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 512], f16, stride=(1, 768)), T([512, 3072], f16)), {})
+cnt: 12, ((T([512, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 512], f16, stride=(1, 3072)), T([512, 768], f16)), {})
+cnt: 12, ((T([512, 768], f16, stride=(1, 512)), T([768, 768], f16)), {})
+cnt: 12, ((T([768, 512], f16), T([512, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([1, 1, 1, 512], f16), -65504.0), {})
+cnt: 1, ((T([1, 512], i32), T([1, 512], i32)), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([1, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([1, 512, 768], f16), T([1, 512, 768], f16), [768], T([1, 512, 1], f32), T([1, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([1, 512], i64), 1), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([512, 32005], f16), T([512], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([512, 32005], f16), T([512], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([1, 1, 1, 512], f16), 1.0), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([512, 32005], f16), [0], True), {})
+cnt: 49, ((T([512, 768], f16), [0], True), {})
+cnt: 12, ((T([512, 3072], f16), [0], True), {})
+cnt: 12, ((T([512, 768], f16, stride=(1, 512)), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaForMaskedLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaForMaskedLM_training.txt
new file mode 100644
index 000000000000..f3146c3fd934
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaForMaskedLM_training.txt
@@ -0,0 +1,132 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([2048, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([2048, 50265], f16), T([2048, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([4, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 25, ((T([4, 512, 768], f16),), {'dtype': f32})
+cnt: 25, ((T([4, 512, 768], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 512, 1], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 1, 512, 512], f32),), {'dtype': torch.uint8})
+cnt: 12, ((T([], f32),), {'dtype': f16, 'device': "torch.device('cpu')"})
+cnt: 12, ((T([4, 1, 512, 512], u8),), {'dtype': torch.bool})
+cnt: 25, ((T([4, 512, 768], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 25, ((T([4, 512, 768], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([2048, 2304], f16), [4, 512, 2304]), {})
+cnt: 36, ((T([4, 12, 512, 64], f16), [48, 512, 64]), {})
+cnt: 12, ((T([4, 12, 64, 512], f16), [48, 64, 512]), {})
+cnt: 12, ((T([48, 512, 512], f16), [4, 12, 512, 512]), {})
+cnt: 12, ((T([48, 512, 64], f16), [4, 12, 512, 64]), {})
+cnt: 12, ((T([4, 512, 12, 192], f16), [4, 512, 2304]), {})
+Operator: aten.add.Tensor
+cnt: 25, ((T([4, 512, 1], f32), 1e-07), {})
+cnt: 25, ((T([4, 512, 768], f16), T([768], f16)), {})
+cnt: 24, ((T([4, 12, 512, 64], f16, stride=(1179648, 192, 2304, 1)), T([1, 12, 1, 64], f16)), {})
+cnt: 48, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 50, ((T([4, 512, 768], f32), T([4, 512, 768], f32)), {})
+cnt: 25, ((T([4, 512, 1], f32), T([4, 512, 1], f32)), {})
+cnt: 1, ((T([50265, 768], f16), T([50265, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([4, 512, 768], f16), T([1, 512, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 13, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([50265], f16), T([2048, 768], f16), T([768, 50265], f16, stride=(1, 768))), {})
+Operator: aten.bitwise_not.default
+cnt: 12, ((T([4, 1, 512, 512], b8),), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([48, 512, 64], f16), T([48, 64, 512], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16), T([48, 512, 64], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16, stride=(262144, 1, 512)), T([48, 512, 64], f16)), {})
+cnt: 12, ((T([48, 512, 64], f16), T([48, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([48, 64, 512], f16, stride=(32768, 1, 64)), T([48, 512, 512], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16), T([48, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.cat.default
+cnt: 12, (([T([4, 12, 512, 64], f16), T([4, 12, 512, 64], f16, stride=(393216, 32768, 1, 512)), T([4, 12, 512, 64], f16)], 3), {})
+Operator: aten.clone.default
+cnt: 2, ((T([4, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([4, 512], i64), T([4, 512], i64)), {})
+Operator: aten.div.Scalar
+cnt: 50, ((T([4, 512, 768], f32, stride=(512, 1, 0)), 768), {})
+Operator: aten.div.Tensor
+cnt: 100, ((T([4, 512, 768], f32), T([4, 512, 1], f32)), {})
+cnt: 12, ((T([4, 12, 512, 64], f16, stride=(393216, 64, 768, 1)), T([], f16)), {})
+cnt: 25, ((T([4, 512, 1], f32), T([4, 512, 1], f32)), {})
+cnt: 12, ((T([4, 12, 512, 64], f16), T([], f16)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 768], f16), T([4, 512], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([4, 512, 768], f16), T([4, 512], i64), 50265, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([4, 512, 3072], f16),), {})
+cnt: 1, ((T([4, 512, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 12, ((T([4, 512, 3072], f16), T([4, 512, 3072], f16)), {})
+Operator: aten.masked_fill.Tensor
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 1, 512, 512], b8), T([], f32)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 1, 512, 512], b8), 0), {})
+Operator: aten.mean.dim
+cnt: 50, ((T([4, 512, 768], f32), [-1], True), {})
+Operator: aten.mm.default
+cnt: 12, ((T([2048, 768], f16), T([768, 2304], f16, stride=(1, 768))), {})
+cnt: 1, ((T([2048, 50265], f16), T([50265, 768], f16)), {})
+cnt: 1, ((T([50265, 2048], f16, stride=(1, 50265)), T([2048, 768], f16)), {})
+cnt: 13, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 13, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 12, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2304, 2048], f16, stride=(1, 2304)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 2304], f16), T([2304, 768], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 25, ((T([4, 512, 1], f32), 2), {})
+cnt: 25, ((T([4, 512, 768], f32), 2.0), {})
+Operator: aten.mul.Tensor
+cnt: 25, ((T([768], f16), T([4, 512, 768], f16)), {})
+cnt: 2, ((T([4, 512, 768], f16), T([4, 512, 1], f16)), {})
+cnt: 1, ((T([4, 1, 1, 512], f32), T([4, 1, 512, 1], f32)), {})
+cnt: 12, ((T([], f32), 1), {})
+cnt: 25, ((T([4, 512, 768], f16), T([768], f16)), {})
+cnt: 25, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 50, ((T([4, 512, 768], f32), T([4, 512, 768], f32)), {})
+Operator: aten.native_layer_norm.default
+cnt: 1, ((T([4, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-07), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 1, ((T([4, 512, 768], f16), T([4, 512, 768], f16), [768], T([4, 512, 1], f32), T([4, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.neg.default
+cnt: 75, ((T([4, 512, 768], f32),), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([2048, 50265], f16), T([2048], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([2048, 50265], f16), T([2048], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 25, ((T([4, 512, 768], f32), 2), {})
+cnt: 25, ((T([4, 512, 768], f32), 1.0), {})
+Operator: aten.slice_backward.default
+cnt: 24, ((T([1, 1, 768], f16), [1, 1, 768], 2, 0, 9223372036854775807, 1), {})
+Operator: aten.split.Tensor
+cnt: 12, ((T([4, 12, 512, 192], f16, stride=(1179648, 192, 2304, 1)), 64, -1), {})
+Operator: aten.sqrt.default
+cnt: 25, ((T([4, 512, 1], f32),), {})
+cnt: 12, ((T([], f32),), {})
+Operator: aten.sub.Tensor
+cnt: 50, ((T([4, 512, 768], f32), T([4, 512, 1], f32)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2048, 50265], f16), [0], True), {})
+cnt: 25, ((T([2048, 768], f16), [0], True), {})
+cnt: 50, ((T([4, 512, 768], f16), [0, 1], True), {})
+cnt: 75, ((T([4, 512, 768], f32), [2], True), {})
+cnt: 12, ((T([2048, 3072], f16), [0], True), {})
+cnt: 24, ((T([4, 12, 512, 64], f16), [0, 2], True), {})
+cnt: 1, ((T([4, 512, 768], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..cd06e0d09756
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaForQuestionAnswering_training.txt
@@ -0,0 +1,133 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([4, 512], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([4, 512], f16), T([4, 512], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([4, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 25, ((T([4, 512, 768], f16),), {'dtype': f32})
+cnt: 25, ((T([4, 512, 768], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 512, 1], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 1, 512, 512], f32),), {'dtype': torch.uint8})
+cnt: 12, ((T([], f32),), {'dtype': f16, 'device': "torch.device('cpu')"})
+cnt: 12, ((T([4, 1, 512, 512], u8),), {'dtype': torch.bool})
+cnt: 25, ((T([4, 512, 768], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 25, ((T([4, 512, 768], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([2048, 2304], f16), [4, 512, 2304]), {})
+cnt: 36, ((T([4, 12, 512, 64], f16), [48, 512, 64]), {})
+cnt: 12, ((T([4, 12, 64, 512], f16), [48, 64, 512]), {})
+cnt: 12, ((T([48, 512, 512], f16), [4, 12, 512, 512]), {})
+cnt: 12, ((T([48, 512, 64], f16), [4, 12, 512, 64]), {})
+cnt: 12, ((T([4, 512, 12, 192], f16), [4, 512, 2304]), {})
+Operator: aten.add.Tensor
+cnt: 25, ((T([4, 512, 1], f32), 1e-07), {})
+cnt: 25, ((T([4, 512, 768], f16), T([768], f16)), {})
+cnt: 24, ((T([4, 12, 512, 64], f16, stride=(1179648, 192, 2304, 1)), T([1, 12, 1, 64], f16)), {})
+cnt: 48, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+cnt: 50, ((T([4, 512, 768], f32), T([4, 512, 768], f32)), {})
+cnt: 25, ((T([4, 512, 1], f32), T([4, 512, 1], f32)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([4, 512, 768], f16), T([1, 512, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([2], f16), T([2048, 768], f16), T([768, 2], f16, stride=(1, 768))), {})
+Operator: aten.bitwise_not.default
+cnt: 12, ((T([4, 1, 512, 512], b8),), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([48, 512, 64], f16), T([48, 64, 512], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16), T([48, 512, 64], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16, stride=(262144, 1, 512)), T([48, 512, 64], f16)), {})
+cnt: 12, ((T([48, 512, 64], f16), T([48, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([48, 64, 512], f16, stride=(32768, 1, 64)), T([48, 512, 512], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16), T([48, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([4, 512, 1], f16), T([4, 512, 1], f16)], 2), {})
+cnt: 12, (([T([4, 12, 512, 64], f16), T([4, 12, 512, 64], f16, stride=(393216, 32768, 1, 512)), T([4, 12, 512, 64], f16)], 3), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([4], i64), 0, 512), {})
+Operator: aten.clone.default
+cnt: 1, ((T([4, 512], i64),), {})
+cnt: 2, ((T([4], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([4, 512], i64), T([4, 512], i64)), {})
+cnt: 2, ((T([4], i64), T([4], i64)), {})
+Operator: aten.div.Scalar
+cnt: 50, ((T([4, 512, 768], f32, stride=(512, 1, 0)), 768), {})
+Operator: aten.div.Tensor
+cnt: 100, ((T([4, 512, 768], f32), T([4, 512, 1], f32)), {})
+cnt: 12, ((T([4, 12, 512, 64], f16, stride=(393216, 64, 768, 1)), T([], f16)), {})
+cnt: 2, ((T([], f16), 2), {})
+cnt: 25, ((T([4, 512, 1], f32), T([4, 512, 1], f32)), {})
+cnt: 12, ((T([4, 12, 512, 64], f16), T([], f16)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 768], f16), T([4, 512], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([4, 512, 768], f16), T([4, 512], i64), 50265, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([4, 512, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([4, 512, 3072], f16), T([4, 512, 3072], f16)), {})
+Operator: aten.masked_fill.Tensor
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 1, 512, 512], b8), T([], f32)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 1, 512, 512], b8), 0), {})
+Operator: aten.mean.dim
+cnt: 50, ((T([4, 512, 768], f32), [-1], True), {})
+Operator: aten.mm.default
+cnt: 12, ((T([2048, 768], f16), T([768, 2304], f16, stride=(1, 768))), {})
+cnt: 1, ((T([2048, 2], f16), T([2, 768], f16)), {})
+cnt: 1, ((T([2, 2048], f16, stride=(1, 2)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 12, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2304, 2048], f16, stride=(1, 2304)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 2304], f16), T([2304, 768], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 25, ((T([4, 512, 1], f32), 2), {})
+cnt: 25, ((T([4, 512, 768], f32), 2.0), {})
+Operator: aten.mul.Tensor
+cnt: 25, ((T([768], f16), T([4, 512, 768], f16)), {})
+cnt: 2, ((T([4, 512, 768], f16), T([4, 512, 1], f16)), {})
+cnt: 1, ((T([4, 1, 1, 512], f32), T([4, 1, 512, 1], f32)), {})
+cnt: 12, ((T([], f32), 1), {})
+cnt: 25, ((T([4, 512, 768], f16), T([768], f16)), {})
+cnt: 25, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 50, ((T([4, 512, 768], f32), T([4, 512, 768], f32)), {})
+Operator: aten.neg.default
+cnt: 75, ((T([4, 512, 768], f32),), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([4, 512], f16), T([4], i64), None, 1, 512, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([4, 512], f16), T([4], i64), None, 1, 512), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 25, ((T([4, 512, 768], f32), 2), {})
+cnt: 25, ((T([4, 512, 768], f32), 1.0), {})
+Operator: aten.slice_backward.default
+cnt: 24, ((T([1, 1, 768], f16), [1, 1, 768], 2, 0, 9223372036854775807, 1), {})
+Operator: aten.split.Tensor
+cnt: 12, ((T([4, 12, 512, 192], f16, stride=(1179648, 192, 2304, 1)), 64, -1), {})
+cnt: 1, ((T([4, 512, 2], f16), 1, -1), {})
+Operator: aten.sqrt.default
+cnt: 25, ((T([4, 512, 1], f32),), {})
+cnt: 12, ((T([], f32),), {})
+Operator: aten.sub.Tensor
+cnt: 50, ((T([4, 512, 768], f32), T([4, 512, 1], f32)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2048, 2], f16), [0], True), {})
+cnt: 50, ((T([4, 512, 768], f16), [0, 1], True), {})
+cnt: 75, ((T([4, 512, 768], f32), [2], True), {})
+cnt: 24, ((T([2048, 768], f16), [0], True), {})
+cnt: 12, ((T([2048, 3072], f16), [0], True), {})
+cnt: 24, ((T([4, 12, 512, 64], f16), [0, 2], True), {})
+cnt: 1, ((T([4, 512, 768], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaV2ForMaskedLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaV2ForMaskedLM_training.txt
new file mode 100644
index 000000000000..157e119eeefc
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaV2ForMaskedLM_training.txt
@@ -0,0 +1,85 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([512, 128100], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([512, 128100], f16), T([512, 128100], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([1, 24, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([1, 24, 512, 512], f16), T([1, 24, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1, 512, 1], f32),), {'dtype': f16})
+cnt: 1, ((T([1, 1, 512, 512], f32),), {'dtype': torch.uint8})
+cnt: 24, ((T([], f32),), {'dtype': f16, 'device': "torch.device('cpu')"})
+cnt: 24, ((T([1, 1, 512, 512], u8),), {'dtype': torch.bool})
+Operator: aten._unsafe_view.default
+cnt: 48, ((T([1, 512, 24, 64], f16), [1, 512, 1536]), {})
+Operator: aten.add.Tensor
+cnt: 144, ((T([1, 512, 1536], f16), T([1, 512, 1536], f16)), {})
+cnt: 1, ((T([128100, 1536], f16), T([128100, 1536], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([1, 512, 1536], f16), T([1, 512, 1536], f16)), {})
+Operator: aten.addmm.default
+cnt: 97, ((T([1536], f16), T([512, 1536], f16), T([1536, 1536], f16, stride=(1, 1536))), {})
+cnt: 24, ((T([6144], f16), T([512, 1536], f16), T([1536, 6144], f16, stride=(1, 1536))), {})
+cnt: 24, ((T([1536], f16), T([512, 6144], f16), T([6144, 1536], f16, stride=(1, 6144))), {})
+cnt: 1, ((T([128100], f16), T([512, 1536], f16), T([1536, 128100], f16, stride=(1, 1536))), {})
+Operator: aten.bitwise_not.default
+cnt: 24, ((T([1, 1, 512, 512], b8),), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([24, 512, 64], f16), T([24, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 48, ((T([24, 512, 512], f16), T([24, 512, 64], f16)), {})
+cnt: 24, ((T([24, 512, 512], f16, stride=(262144, 1, 512)), T([24, 512, 64], f16, stride=(64, 1536, 1))), {})
+cnt: 24, ((T([24, 512, 64], f16, stride=(64, 1536, 1)), T([24, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 24, ((T([24, 64, 512], f16, stride=(32768, 1, 64)), T([24, 512, 512], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 512], i64), T([1, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 48, ((T([24, 512, 512], f16), T([], f16)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([128100, 1536], f16), T([1, 512], i64), 0), {})
+cnt: 1, ((T([512, 1536], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 1536], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([1, 512, 1536], f16), T([1, 512], i64), 128100, 0, False), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([1, 512, 6144], f16),), {})
+cnt: 1, ((T([1, 512, 1536], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([1, 512, 1536], f16), T([1, 512, 1536], f16)), {})
+cnt: 24, ((T([1, 512, 6144], f16), T([1, 512, 6144], f16)), {})
+Operator: aten.masked_fill.Tensor
+cnt: 24, ((T([1, 24, 512, 512], f16), T([1, 1, 512, 512], b8), T([], f32)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 24, ((T([1, 24, 512, 512], f16), T([1, 1, 512, 512], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 128100], f16), T([128100, 1536], f16)), {})
+cnt: 1, ((T([128100, 512], f16, stride=(1, 128100)), T([512, 1536], f16)), {})
+cnt: 73, ((T([512, 1536], f16), T([1536, 1536], f16)), {})
+cnt: 73, ((T([1536, 512], f16, stride=(1, 1536)), T([512, 1536], f16)), {})
+cnt: 24, ((T([512, 1536], f16), T([1536, 6144], f16)), {})
+cnt: 24, ((T([1536, 512], f16, stride=(1, 1536)), T([512, 6144], f16)), {})
+cnt: 24, ((T([512, 6144], f16), T([6144, 1536], f16)), {})
+cnt: 24, ((T([6144, 512], f16, stride=(1, 6144)), T([512, 1536], f16)), {})
+cnt: 24, ((T([512, 1536], f16, stride=(1, 512)), T([1536, 1536], f16)), {})
+cnt: 24, ((T([1536, 512], f16), T([512, 1536], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([1, 512, 1536], f16), T([1, 512, 1], f16)), {})
+cnt: 1, ((T([1, 1, 1, 512], f32), T([1, 1, 512, 1], f32)), {})
+cnt: 24, ((T([], f32), 1), {})
+Operator: aten.native_layer_norm.default
+cnt: 50, ((T([1, 512, 1536], f16), [1536], T([1536], f16), T([1536], f16), 1e-07), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 50, ((T([1, 512, 1536], f16), T([1, 512, 1536], f16), [1536], T([1, 512, 1], f32), T([1, 512, 1], f32), T([1536], f16), T([1536], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([512, 128100], f16), T([512], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([512, 128100], f16), T([512], i64), None, 1, -100), {})
+Operator: aten.sqrt.default
+cnt: 24, ((T([], f32),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([512, 128100], f16), [0], True), {})
+cnt: 97, ((T([512, 1536], f16), [0], True), {})
+cnt: 24, ((T([512, 6144], f16), [0], True), {})
+cnt: 24, ((T([512, 1536], f16, stride=(1, 512)), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaV2ForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaV2ForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..94ffa58562aa
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DebertaV2ForQuestionAnswering_training.txt
@@ -0,0 +1,92 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([1, 512], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([1, 512], f16), T([1, 512], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([1, 24, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([1, 24, 512, 512], f16), T([1, 24, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1, 512, 1], f32),), {'dtype': f16})
+cnt: 1, ((T([1, 1, 512, 512], f32),), {'dtype': torch.uint8})
+cnt: 24, ((T([], f32),), {'dtype': f16, 'device': "torch.device('cpu')"})
+cnt: 24, ((T([1, 1, 512, 512], u8),), {'dtype': torch.bool})
+Operator: aten._unsafe_view.default
+cnt: 48, ((T([1, 512, 24, 64], f16), [1, 512, 1536]), {})
+Operator: aten.add.Tensor
+cnt: 144, ((T([1, 512, 1536], f16), T([1, 512, 1536], f16)), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([1, 512, 1536], f16), T([1, 512, 1536], f16)), {})
+Operator: aten.addmm.default
+cnt: 96, ((T([1536], f16), T([512, 1536], f16), T([1536, 1536], f16, stride=(1, 1536))), {})
+cnt: 24, ((T([6144], f16), T([512, 1536], f16), T([1536, 6144], f16, stride=(1, 1536))), {})
+cnt: 24, ((T([1536], f16), T([512, 6144], f16), T([6144, 1536], f16, stride=(1, 6144))), {})
+cnt: 1, ((T([2], f16), T([512, 1536], f16), T([1536, 2], f16, stride=(1, 1536))), {})
+Operator: aten.bitwise_not.default
+cnt: 24, ((T([1, 1, 512, 512], b8),), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([24, 512, 64], f16), T([24, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 48, ((T([24, 512, 512], f16), T([24, 512, 64], f16)), {})
+cnt: 24, ((T([24, 512, 512], f16, stride=(262144, 1, 512)), T([24, 512, 64], f16, stride=(64, 1536, 1))), {})
+cnt: 24, ((T([24, 512, 64], f16, stride=(64, 1536, 1)), T([24, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 24, ((T([24, 64, 512], f16, stride=(32768, 1, 64)), T([24, 512, 512], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([1, 512, 1], f16), T([1, 512, 1], f16)], 2), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([1], i64), 0, 512), {})
+Operator: aten.clone.default
+cnt: 1, ((T([1, 512], i64),), {})
+cnt: 2, ((T([1], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([1, 512], i64), T([1, 512], i64)), {})
+cnt: 2, ((T([1], i64), T([1], i64)), {})
+Operator: aten.div.Tensor
+cnt: 48, ((T([24, 512, 512], f16), T([], f16)), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([128100, 1536], f16), T([1, 512], i64), 0), {})
+cnt: 1, ((T([512, 1536], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 1536], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([1, 512, 1536], f16), T([1, 512], i64), 128100, 0, False), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([1, 512, 6144], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 24, ((T([1, 512, 6144], f16), T([1, 512, 6144], f16)), {})
+Operator: aten.masked_fill.Tensor
+cnt: 24, ((T([1, 24, 512, 512], f16), T([1, 1, 512, 512], b8), T([], f32)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 24, ((T([1, 24, 512, 512], f16), T([1, 1, 512, 512], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 2], f16), T([2, 1536], f16)), {})
+cnt: 1, ((T([2, 512], f16, stride=(1, 2)), T([512, 1536], f16)), {})
+cnt: 24, ((T([512, 1536], f16), T([1536, 6144], f16)), {})
+cnt: 24, ((T([1536, 512], f16, stride=(1, 1536)), T([512, 6144], f16)), {})
+cnt: 24, ((T([512, 6144], f16), T([6144, 1536], f16)), {})
+cnt: 24, ((T([6144, 512], f16, stride=(1, 6144)), T([512, 1536], f16)), {})
+cnt: 72, ((T([512, 1536], f16), T([1536, 1536], f16)), {})
+cnt: 72, ((T([1536, 512], f16, stride=(1, 1536)), T([512, 1536], f16)), {})
+cnt: 24, ((T([512, 1536], f16, stride=(1, 512)), T([1536, 1536], f16)), {})
+cnt: 24, ((T([1536, 512], f16), T([512, 1536], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([1, 512, 1536], f16), T([1, 512, 1], f16)), {})
+cnt: 1, ((T([1, 1, 1, 512], f32), T([1, 1, 512, 1], f32)), {})
+cnt: 24, ((T([], f32), 1), {})
+Operator: aten.native_layer_norm.default
+cnt: 49, ((T([1, 512, 1536], f16), [1536], T([1536], f16), T([1536], f16), 1e-07), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 49, ((T([1, 512, 1536], f16), T([1, 512, 1536], f16), [1536], T([1, 512, 1], f32), T([1, 512, 1], f32), T([1536], f16), T([1536], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([1, 512], f16), T([1], i64), None, 1, 512, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([1, 512], f16), T([1], i64), None, 1, 512), {})
+Operator: aten.split.Tensor
+cnt: 1, ((T([1, 512, 2], f16), 1, -1), {})
+Operator: aten.sqrt.default
+cnt: 24, ((T([], f32),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([512, 2], f16), [0], True), {})
+cnt: 96, ((T([512, 1536], f16), [0], True), {})
+cnt: 24, ((T([512, 6144], f16), [0], True), {})
+cnt: 24, ((T([512, 1536], f16, stride=(1, 512)), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistilBertForMaskedLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistilBertForMaskedLM_training.txt
new file mode 100644
index 000000000000..37d0d4707d8a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistilBertForMaskedLM_training.txt
@@ -0,0 +1,78 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([2048, 30522], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([2048, 30522], f16), T([2048, 30522], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 6, ((T([16, 12, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 6, ((T([16, 12, 128, 128], f16), T([16, 12, 128, 128], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 18, ((T([16, 12, 128, 64], f16), [192, 128, 64]), {})
+cnt: 6, ((T([16, 12, 64, 128], f16), [192, 64, 128]), {})
+cnt: 6, ((T([192, 128, 128], f16), [16, 12, 128, 128]), {})
+cnt: 6, ((T([192, 128, 64], f16), [16, 12, 128, 64]), {})
+cnt: 12, ((T([16, 128, 12, 64], f16), [16, 128, 768]), {})
+cnt: 6, ((T([16, 128, 768], f16), [2048, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([16, 128, 768], f16), T([1, 128, 768], f16)), {})
+cnt: 36, ((T([16, 128, 768], f16), T([16, 128, 768], f16)), {})
+cnt: 1, ((T([30522, 768], f16), T([30522, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 25, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 6, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 6, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([30522], f16), T([2048, 768], f16), T([768, 30522], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 6, ((T([192, 128, 64], f16), T([192, 64, 128], f16)), {})
+cnt: 6, ((T([192, 128, 128], f16), T([192, 128, 64], f16)), {})
+cnt: 6, ((T([192, 128, 128], f16, stride=(16384, 1, 128)), T([192, 128, 64], f16)), {})
+cnt: 6, ((T([192, 128, 64], f16), T([192, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 6, ((T([192, 64, 128], f16, stride=(8192, 1, 64)), T([192, 128, 128], f16)), {})
+cnt: 6, ((T([192, 128, 128], f16), T([192, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([16, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([16, 128], i64), T([16, 128], i64)), {})
+Operator: aten.div.Tensor
+cnt: 6, ((T([16, 12, 128, 64], f16, stride=(98304, 64, 768, 1)), 8.0), {})
+cnt: 6, ((T([16, 12, 128, 64], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([16, 128], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([1, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 128, 768], f16), T([1, 128], i64), 512, -1, False), {})
+cnt: 1, ((T([16, 128, 768], f16), T([16, 128], i64), 30522, 0, False), {})
+Operator: aten.eq.Scalar
+cnt: 6, ((T([16, 128], f32), 0), {})
+Operator: aten.gelu.default
+cnt: 6, ((T([16, 128, 3072], f16),), {})
+cnt: 1, ((T([16, 128, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([16, 128, 768], f16), T([16, 128, 768], f16)), {})
+cnt: 6, ((T([16, 128, 3072], f16), T([16, 128, 3072], f16)), {})
+Operator: aten.masked_fill.Scalar
+cnt: 6, ((T([16, 12, 128, 128], f16), T([16, 12, 128, 128], b8, stride=(128, 0, 0, 1)), 0), {})
+Operator: aten.masked_fill.Tensor
+cnt: 6, ((T([16, 12, 128, 128], f16), T([16, 12, 128, 128], b8, stride=(128, 0, 0, 1)), T([], f32)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 30522], f16), T([30522, 768], f16)), {})
+cnt: 1, ((T([30522, 2048], f16, stride=(1, 30522)), T([2048, 768], f16)), {})
+cnt: 25, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 25, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+cnt: 6, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 6, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 6, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 6, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 14, ((T([16, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 14, ((T([16, 128, 768], f16), T([16, 128, 768], f16), [768], T([16, 128, 1], f32), T([16, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([2048, 30522], f16), T([2048], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([2048, 30522], f16), T([2048], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2048, 30522], f16), [0], True), {})
+cnt: 31, ((T([2048, 768], f16), [0], True), {})
+cnt: 6, ((T([2048, 3072], f16), [0], True), {})
+cnt: 1, ((T([16, 128, 768], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistilBertForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistilBertForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..350ed80182bd
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistilBertForQuestionAnswering_training.txt
@@ -0,0 +1,85 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([32, 128], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([32, 128], f16), T([32, 128], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 6, ((T([32, 12, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 6, ((T([32, 12, 128, 128], f16), T([32, 12, 128, 128], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 18, ((T([32, 12, 128, 64], f16), [384, 128, 64]), {})
+cnt: 6, ((T([32, 12, 64, 128], f16), [384, 64, 128]), {})
+cnt: 6, ((T([384, 128, 128], f16), [32, 12, 128, 128]), {})
+cnt: 6, ((T([384, 128, 64], f16), [32, 12, 128, 64]), {})
+cnt: 12, ((T([32, 128, 12, 64], f16), [32, 128, 768]), {})
+cnt: 6, ((T([32, 128, 768], f16), [4096, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([32, 128, 768], f16), T([1, 128, 768], f16)), {})
+cnt: 36, ((T([32, 128, 768], f16), T([32, 128, 768], f16)), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+Operator: aten.addmm.default
+cnt: 24, ((T([768], f16), T([4096, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 6, ((T([3072], f16), T([4096, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 6, ((T([768], f16), T([4096, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([2], f16), T([4096, 768], f16), T([768, 2], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 6, ((T([384, 128, 64], f16), T([384, 64, 128], f16)), {})
+cnt: 6, ((T([384, 128, 128], f16), T([384, 128, 64], f16)), {})
+cnt: 6, ((T([384, 128, 128], f16, stride=(16384, 1, 128)), T([384, 128, 64], f16)), {})
+cnt: 6, ((T([384, 128, 64], f16), T([384, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 6, ((T([384, 64, 128], f16, stride=(8192, 1, 64)), T([384, 128, 128], f16)), {})
+cnt: 6, ((T([384, 128, 128], f16), T([384, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([32, 128, 1], f16), T([32, 128, 1], f16)], 2), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([32], i64), 0, 128), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 128], i64),), {})
+cnt: 2, ((T([32], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 128], i64), T([32, 128], i64)), {})
+cnt: 2, ((T([32], i64), T([32], i64)), {})
+Operator: aten.div.Tensor
+cnt: 6, ((T([32, 12, 128, 64], f16, stride=(98304, 64, 768, 1)), 8.0), {})
+cnt: 2, ((T([], f16), 2), {})
+cnt: 6, ((T([32, 12, 128, 64], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([32, 128], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([1, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 128, 768], f16), T([1, 128], i64), 512, -1, False), {})
+cnt: 1, ((T([32, 128, 768], f16), T([32, 128], i64), 30522, 0, False), {})
+Operator: aten.eq.Scalar
+cnt: 6, ((T([32, 128], f32), 0), {})
+Operator: aten.gelu.default
+cnt: 6, ((T([32, 128, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 6, ((T([32, 128, 3072], f16), T([32, 128, 3072], f16)), {})
+Operator: aten.masked_fill.Scalar
+cnt: 6, ((T([32, 12, 128, 128], f16), T([32, 12, 128, 128], b8, stride=(128, 0, 0, 1)), 0), {})
+Operator: aten.masked_fill.Tensor
+cnt: 6, ((T([32, 12, 128, 128], f16), T([32, 12, 128, 128], b8, stride=(128, 0, 0, 1)), T([], f32)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([4096, 2], f16), T([2, 768], f16)), {})
+cnt: 1, ((T([2, 4096], f16, stride=(1, 2)), T([4096, 768], f16)), {})
+cnt: 6, ((T([4096, 768], f16), T([768, 3072], f16)), {})
+cnt: 6, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 3072], f16)), {})
+cnt: 6, ((T([4096, 3072], f16), T([3072, 768], f16)), {})
+cnt: 6, ((T([3072, 4096], f16, stride=(1, 3072)), T([4096, 768], f16)), {})
+cnt: 24, ((T([4096, 768], f16), T([768, 768], f16)), {})
+cnt: 24, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 768], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 13, ((T([32, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 13, ((T([32, 128, 768], f16), T([32, 128, 768], f16), [768], T([32, 128, 1], f32), T([32, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([32, 128], f16), T([32], i64), None, 1, 128, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([32, 128], f16), T([32], i64), None, 1, 128), {})
+Operator: aten.split.Tensor
+cnt: 1, ((T([32, 128, 2], f16), 1, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([4096, 2], f16), [0], True), {})
+cnt: 30, ((T([4096, 768], f16), [0], True), {})
+cnt: 6, ((T([4096, 3072], f16), [0], True), {})
+cnt: 1, ((T([32, 128, 768], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistillGPT2_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistillGPT2_training.txt
new file mode 100644
index 000000000000..5654c4bbd4d9
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/DistillGPT2_training.txt
@@ -0,0 +1,91 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([511, 50257], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([511, 50257], f16), T([511, 50257], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 6, ((T([1, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 6, ((T([1, 12, 512, 512], f16), T([1, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 6, ((T([1, 1, 512, 512], u8, stride=(1048576, 1048576, 1024, 1)),), {'dtype': torch.bool})
+cnt: 6, ((T([], f16),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 6, ((T([12, 512, 512], f16), [1, 12, 512, 512]), {})
+cnt: 6, ((T([12, 512, 64], f16), [1, 12, 512, 64]), {})
+cnt: 1, ((T([512, 50257], f16), [1, 512, 50257]), {})
+cnt: 12, ((T([1, 512, 12, 64], f16), [1, 512, 768]), {})
+Operator: aten.add.Tensor
+cnt: 25, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 18, ((T([1, 512, 3072], f16), T([1, 512, 3072], f16)), {})
+cnt: 6, ((T([1, 512, 3072], f16), 1.0), {})
+cnt: 1, ((T([50257, 768], f16), T([50257, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 6, ((T([2304], f16), T([512, 768], f16), T([768, 2304], f16)), {})
+cnt: 6, ((T([768], f16), T([512, 768], f16), T([768, 768], f16)), {})
+cnt: 6, ((T([3072], f16), T([512, 768], f16), T([768, 3072], f16)), {})
+cnt: 6, ((T([768], f16), T([512, 3072], f16), T([3072, 768], f16)), {})
+Operator: aten.bmm.default
+cnt: 6, ((T([12, 512, 64], f16, stride=(64, 2304, 1)), T([12, 64, 512], f16, stride=(64, 1, 2304))), {})
+cnt: 12, ((T([12, 512, 512], f16), T([12, 512, 64], f16, stride=(64, 2304, 1))), {})
+cnt: 6, ((T([12, 512, 512], f16, stride=(262144, 1, 512)), T([12, 512, 64], f16, stride=(64, 768, 1))), {})
+cnt: 6, ((T([12, 512, 64], f16, stride=(64, 768, 1)), T([12, 64, 512], f16, stride=(64, 1, 2304))), {})
+cnt: 6, ((T([12, 64, 512], f16, stride=(64, 1, 2304)), T([12, 512, 512], f16)), {})
+Operator: aten.cat.default
+cnt: 6, (([T([1, 512, 768], f16), T([1, 512, 768], f16, stride=(512, 1, 512)), T([1, 512, 768], f16)], 2), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 512], i64), T([1, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 12, ((T([1, 12, 512, 512], f16), T([], f16)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50257, 768], f16), T([1, 512], i64)), {})
+cnt: 1, ((T([1024, 768], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 1024, -1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 50257, -1, False), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 768], f16), T([768, 50257], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50257, 512], f16, stride=(1, 50257)), T([512, 768], f16)), {})
+cnt: 1, ((T([512, 50257], f16), T([50257, 768], f16)), {})
+cnt: 6, ((T([512, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 6, ((T([3072, 512], f16, stride=(1, 3072)), T([512, 768], f16)), {})
+cnt: 6, ((T([512, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 6, ((T([768, 512], f16, stride=(1, 768)), T([512, 3072], f16)), {})
+cnt: 6, ((T([512, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 6, ((T([768, 512], f16, stride=(1, 768)), T([512, 768], f16)), {})
+cnt: 6, ((T([512, 2304], f16), T([2304, 768], f16, stride=(1, 2304))), {})
+cnt: 6, ((T([768, 512], f16, stride=(1, 768)), T([512, 2304], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 6, ((T([1, 512, 3072], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 12, ((T([1, 512, 3072], f16), 0.5), {})
+cnt: 12, ((T([1, 512, 3072], f16), 0.044715), {})
+cnt: 12, ((T([1, 512, 3072], f16), 0.7978845608028654), {})
+cnt: 24, ((T([1, 512, 3072], f16), T([1, 512, 3072], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 13, ((T([1, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 13, ((T([1, 512, 768], f16), T([1, 512, 768], f16), [768], T([1, 512, 1], f32), T([1, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([511, 50257], f16), T([511], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([511, 50257], f16), T([511], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 6, ((T([1, 512, 3072], f16), 3.0), {})
+cnt: 6, ((T([1, 512, 3072], f16), 2.0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([1, 511, 50257], f16), [1, 511, 50257], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([1, 511, 50257], f16), [1, 512, 50257], 1, 0, -1, 1), {})
+Operator: aten.split.Tensor
+cnt: 6, ((T([1, 512, 2304], f16), 768, 2), {})
+Operator: aten.sum.SymInt
+cnt: 12, ((T([512, 768], f16), [0], True), {})
+cnt: 6, ((T([512, 3072], f16), [0], True), {})
+cnt: 6, ((T([512, 2304], f16), [0], True), {})
+Operator: aten.tanh.default
+cnt: 6, ((T([1, 512, 3072], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 6, ((T([1, 512, 3072], f16), T([1, 512, 3072], f16)), {})
+Operator: aten.where.self
+cnt: 12, ((T([1, 1, 512, 512], b8), T([1, 12, 512, 512], f16), T([], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/ElectraForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/ElectraForCausalLM_training.txt
new file mode 100644
index 000000000000..adbb45be6269
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/ElectraForCausalLM_training.txt
@@ -0,0 +1,92 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([511, 30522], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([511, 30522], f16), T([511, 30522], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([1, 4, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([1, 4, 512, 512], f16), T([1, 4, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([4, 512, 512], f16), [1, 4, 512, 512]), {})
+cnt: 12, ((T([4, 512, 64], f16), [1, 4, 512, 64]), {})
+cnt: 24, ((T([1, 512, 4, 64], f16), [1, 512, 256]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512, 128], f16)), {})
+cnt: 12, ((T([1, 4, 512, 512], f16), T([1, 1, 1, 512], f16)), {})
+cnt: 72, ((T([1, 512, 256], f16), T([1, 512, 256], f16)), {})
+cnt: 1, ((T([30522, 128], f16), T([30522, 128], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([256], f16), T([512, 128], f16), T([128, 256], f16, stride=(1, 128))), {})
+cnt: 48, ((T([256], f16), T([512, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 12, ((T([1024], f16), T([512, 256], f16), T([256, 1024], f16, stride=(1, 256))), {})
+cnt: 12, ((T([256], f16), T([512, 1024], f16), T([1024, 256], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([128], f16), T([512, 256], f16), T([256, 128], f16, stride=(1, 256))), {})
+cnt: 1, ((T([30522], f16), T([512, 128], f16), T([128, 30522], f16, stride=(1, 128))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([4, 512, 64], f16, stride=(64, 256, 1)), T([4, 64, 512], f16, stride=(64, 1, 256))), {})
+cnt: 24, ((T([4, 512, 512], f16), T([4, 512, 64], f16, stride=(64, 256, 1))), {})
+cnt: 12, ((T([4, 512, 512], f16, stride=(262144, 1, 512)), T([4, 512, 64], f16, stride=(64, 256, 1))), {})
+cnt: 12, ((T([4, 64, 512], f16, stride=(64, 1, 256)), T([4, 512, 512], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 512], i64), T([1, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([1, 4, 512, 512], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 128], f16), T([1, 512], i64), 0), {})
+cnt: 1, ((T([2, 128], f16), T([1, 512], i64)), {})
+cnt: 1, ((T([512, 128], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512], i64), 2, -1, False), {})
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([1, 512, 1024], f16),), {})
+cnt: 1, ((T([1, 512, 128], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512, 128], f16)), {})
+cnt: 12, ((T([1, 512, 1024], f16), T([1, 512, 1024], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 30522], f16), T([30522, 128], f16)), {})
+cnt: 1, ((T([30522, 512], f16, stride=(1, 30522)), T([512, 128], f16)), {})
+cnt: 1, ((T([512, 128], f16), T([128, 256], f16)), {})
+cnt: 1, ((T([128, 512], f16, stride=(1, 128)), T([512, 256], f16)), {})
+cnt: 12, ((T([512, 256], f16), T([256, 1024], f16)), {})
+cnt: 12, ((T([256, 512], f16, stride=(1, 256)), T([512, 1024], f16)), {})
+cnt: 12, ((T([512, 1024], f16), T([1024, 256], f16)), {})
+cnt: 12, ((T([1024, 512], f16, stride=(1, 1024)), T([512, 256], f16)), {})
+cnt: 36, ((T([512, 256], f16), T([256, 256], f16)), {})
+cnt: 36, ((T([256, 512], f16, stride=(1, 256)), T([512, 256], f16)), {})
+cnt: 12, ((T([512, 256], f16, stride=(1, 512)), T([256, 256], f16)), {})
+cnt: 12, ((T([256, 512], f16), T([512, 256], f16)), {})
+cnt: 1, ((T([512, 256], f16), T([256, 128], f16)), {})
+cnt: 1, ((T([256, 512], f16, stride=(1, 256)), T([512, 128], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([1, 1, 1, 512], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 2, ((T([1, 512, 128], f16), [128], T([128], f16), T([128], f16), 1e-12), {})
+cnt: 24, ((T([1, 512, 256], f16), [256], T([256], f16), T([256], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 2, ((T([1, 512, 128], f16), T([1, 512, 128], f16), [128], T([1, 512, 1], f32), T([1, 512, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 24, ((T([1, 512, 256], f16), T([1, 512, 256], f16), [256], T([1, 512, 1], f32), T([1, 512, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([511, 30522], f16), T([511], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([511, 30522], f16), T([511], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([1, 1, 1, 512], f16), 1.0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([1, 511, 30522], f16), [1, 511, 30522], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([1, 511, 30522], f16), [1, 512, 30522], 1, 0, -1, 1), {})
+cnt: 1, ((T([1, 512, 30522], f16), [1, 512, 30522], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([512, 30522], f16), [0], True), {})
+cnt: 1, ((T([512, 128], f16), [0], True), {})
+cnt: 49, ((T([512, 256], f16), [0], True), {})
+cnt: 12, ((T([512, 1024], f16), [0], True), {})
+cnt: 12, ((T([512, 256], f16, stride=(1, 512)), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/ElectraForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/ElectraForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..c2e4a8beb522
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/ElectraForQuestionAnswering_training.txt
@@ -0,0 +1,94 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([64, 512], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([64, 512], f16), T([64, 512], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([64, 4, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 4, 512, 512], f16), T([64, 4, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([64, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([64, 4, 512, 64], f16), [256, 512, 64]), {})
+cnt: 12, ((T([64, 4, 64, 512], f16), [256, 64, 512]), {})
+cnt: 12, ((T([256, 512, 512], f16), [64, 4, 512, 512]), {})
+cnt: 12, ((T([256, 512, 64], f16), [64, 4, 512, 64]), {})
+cnt: 24, ((T([64, 512, 4, 64], f16), [64, 512, 256]), {})
+cnt: 12, ((T([64, 512, 256], f16), [32768, 256]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 512, 128], f16), T([64, 512, 128], f16)), {})
+cnt: 12, ((T([64, 4, 512, 512], f16), T([64, 1, 1, 512], f16)), {})
+cnt: 72, ((T([64, 512, 256], f16), T([64, 512, 256], f16)), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([64, 512, 128], f16), T([1, 512, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([256], f16), T([32768, 128], f16), T([128, 256], f16, stride=(1, 128))), {})
+cnt: 48, ((T([256], f16), T([32768, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 12, ((T([1024], f16), T([32768, 256], f16), T([256, 1024], f16, stride=(1, 256))), {})
+cnt: 12, ((T([256], f16), T([32768, 1024], f16), T([1024, 256], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([2], f16), T([32768, 256], f16), T([256, 2], f16, stride=(1, 256))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([256, 512, 64], f16), T([256, 64, 512], f16)), {})
+cnt: 12, ((T([256, 512, 512], f16), T([256, 512, 64], f16)), {})
+cnt: 12, ((T([256, 512, 512], f16, stride=(262144, 1, 512)), T([256, 512, 64], f16)), {})
+cnt: 12, ((T([256, 512, 64], f16), T([256, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([256, 64, 512], f16, stride=(32768, 1, 64)), T([256, 512, 512], f16)), {})
+cnt: 12, ((T([256, 512, 512], f16), T([256, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 512, 1], f16), T([64, 512, 1], f16)], 2), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([64], i64), 0, 512), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 512], i64),), {})
+cnt: 2, ((T([64], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 512], i64), T([64, 512], i64)), {})
+cnt: 2, ((T([64], i64), T([64], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([64, 4, 512, 512], f16), 8.0), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 128], f16), T([64, 512], i64), 0), {})
+cnt: 1, ((T([2, 128], f16), T([64, 512], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 128], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([64, 512, 128], f16), T([64, 512], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([64, 512, 128], f16), T([64, 512], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([64, 512, 1024], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([64, 512, 1024], f16), T([64, 512, 1024], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32768, 2], f16), T([2, 256], f16)), {})
+cnt: 1, ((T([2, 32768], f16, stride=(1, 2)), T([32768, 256], f16)), {})
+cnt: 12, ((T([32768, 256], f16), T([256, 1024], f16)), {})
+cnt: 12, ((T([256, 32768], f16, stride=(1, 256)), T([32768, 1024], f16)), {})
+cnt: 12, ((T([32768, 1024], f16), T([1024, 256], f16)), {})
+cnt: 12, ((T([1024, 32768], f16, stride=(1, 1024)), T([32768, 256], f16)), {})
+cnt: 48, ((T([32768, 256], f16), T([256, 256], f16)), {})
+cnt: 48, ((T([256, 32768], f16, stride=(1, 256)), T([32768, 256], f16)), {})
+cnt: 1, ((T([32768, 256], f16), T([256, 128], f16)), {})
+cnt: 1, ((T([256, 32768], f16, stride=(1, 256)), T([32768, 128], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([64, 1, 1, 512], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 1, ((T([64, 512, 128], f16), [128], T([128], f16), T([128], f16), 1e-12), {})
+cnt: 24, ((T([64, 512, 256], f16), [256], T([256], f16), T([256], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 24, ((T([64, 512, 256], f16), T([64, 512, 256], f16), [256], T([64, 512, 1], f32), T([64, 512, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 512, 128], f16), T([64, 512, 128], f16), [128], T([64, 512, 1], f32), T([64, 512, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([64, 512], f16), T([64], i64), None, 1, 512, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([64, 512], f16), T([64], i64), None, 1, 512), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([64, 1, 1, 512], f16), 1.0), {})
+Operator: aten.split.Tensor
+cnt: 1, ((T([64, 512, 2], f16), 1, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32768, 2], f16), [0], True), {})
+cnt: 61, ((T([32768, 256], f16), [0], True), {})
+cnt: 12, ((T([32768, 1024], f16), [0], True), {})
+cnt: 1, ((T([64, 512, 128], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPT2ForSequenceClassification_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPT2ForSequenceClassification_training.txt
new file mode 100644
index 000000000000..4be61bd96d90
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPT2ForSequenceClassification_training.txt
@@ -0,0 +1,106 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([4, 2], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([4, 2], f16), T([4, 2], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([4, 12, 1024, 1024], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([4, 12, 1024, 1024], f16), T([4, 12, 1024, 1024], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 12, ((T([1, 1, 1024, 1024], u8),), {'dtype': torch.bool})
+cnt: 12, ((T([], f16),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([4, 12, 1024, 64], f16), [48, 1024, 64]), {})
+cnt: 12, ((T([4, 12, 64, 1024], f16), [48, 64, 1024]), {})
+cnt: 12, ((T([48, 1024, 1024], f16), [4, 12, 1024, 1024]), {})
+cnt: 12, ((T([48, 1024, 64], f16), [4, 12, 1024, 64]), {})
+cnt: 1, ((T([4096, 2], f16), [4, 1024, 2]), {})
+cnt: 24, ((T([4, 1024, 12, 64], f16), [4, 1024, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([4, 1024, 768], f16), T([1, 1024, 768], f16)), {})
+cnt: 48, ((T([4, 1024, 768], f16), T([4, 1024, 768], f16)), {})
+cnt: 36, ((T([4, 1024, 3072], f16), T([4, 1024, 3072], f16)), {})
+cnt: 12, ((T([4, 1024, 3072], f16), 1.0), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([2304], f16), T([4096, 768], f16), T([768, 2304], f16)), {})
+cnt: 12, ((T([768], f16), T([4096, 768], f16), T([768, 768], f16)), {})
+cnt: 12, ((T([3072], f16), T([4096, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768], f16), T([4096, 3072], f16), T([3072, 768], f16)), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([48, 1024, 64], f16), T([48, 64, 1024], f16)), {})
+cnt: 12, ((T([48, 1024, 1024], f16), T([48, 1024, 64], f16)), {})
+cnt: 12, ((T([48, 1024, 1024], f16, stride=(1048576, 1, 1024)), T([48, 1024, 64], f16)), {})
+cnt: 12, ((T([48, 1024, 64], f16), T([48, 64, 1024], f16, stride=(65536, 1, 64))), {})
+cnt: 12, ((T([48, 64, 1024], f16, stride=(65536, 1, 64)), T([48, 1024, 1024], f16)), {})
+cnt: 12, ((T([48, 1024, 1024], f16), T([48, 1024, 64], f16, stride=(65536, 1, 1024))), {})
+Operator: aten.cat.default
+cnt: 12, (([T([4, 1024, 768], f16), T([4, 1024, 768], f16, stride=(786432, 1, 1024)), T([4, 1024, 768], f16)], 2), {})
+Operator: aten.clone.default
+cnt: 1, ((T([4, 1024], i64),), {})
+cnt: 1, ((T([4], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([4, 1024], i64), T([4, 1024], i64)), {})
+cnt: 1, ((T([4], i64), T([4], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([4, 12, 1024, 1024], f16), T([], f16)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50257, 768], f16), T([4, 1024], i64)), {})
+cnt: 1, ((T([1024, 768], f16), T([1, 1024], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024], i64), 1024, -1, False), {})
+cnt: 1, ((T([4, 1024, 768], f16), T([4, 1024], i64), 50257, -1, False), {})
+Operator: aten.index.Tensor
+cnt: 1, ((T([4, 1024, 2], f16), [T([4], i64), T([4], i64)]), {})
+Operator: aten.index_put.default
+cnt: 1, ((T([4, 1024, 2], f16), [T([4], i64), T([4], i64)], T([4, 2], f16), True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([4096, 768], f16), T([768, 2], f16, stride=(1, 768))), {})
+cnt: 1, ((T([2, 4096], f16, stride=(1, 2)), T([4096, 768], f16)), {})
+cnt: 1, ((T([4096, 2], f16), T([2, 768], f16)), {})
+cnt: 12, ((T([4096, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072, 4096], f16, stride=(1, 3072)), T([4096, 768], f16)), {})
+cnt: 12, ((T([4096, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 12, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 3072], f16)), {})
+cnt: 12, ((T([4096, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 768], f16)), {})
+cnt: 12, ((T([4096, 2304], f16), T([2304, 768], f16, stride=(1, 2304))), {})
+cnt: 12, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 2304], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 12, ((T([4, 1024, 3072], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([4, 1024, 3072], f16), 0.5), {})
+cnt: 24, ((T([4, 1024, 3072], f16), 0.044715), {})
+cnt: 24, ((T([4, 1024, 3072], f16), 0.7978845608028654), {})
+cnt: 48, ((T([4, 1024, 3072], f16), T([4, 1024, 3072], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([4, 1024, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([4, 1024, 768], f16), T([4, 1024, 768], f16), [768], T([4, 1024, 1], f32), T([4, 1024, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([4, 1024], i64), 0), {})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([4, 2], f16), [4, 1024, 2]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([4, 2], f16), T([4], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([4, 2], f16), T([4], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([4, 1024, 3072], f16), 3.0), {})
+cnt: 12, ((T([4, 1024, 3072], f16), 2.0), {})
+Operator: aten.split.Tensor
+cnt: 12, ((T([4, 1024, 2304], f16), 768, 2), {})
+Operator: aten.sub.Tensor
+cnt: 1, ((T([4], i64), 1), {})
+Operator: aten.sum.SymInt
+cnt: 24, ((T([4096, 768], f16), [0], True), {})
+cnt: 12, ((T([4096, 3072], f16), [0], True), {})
+cnt: 12, ((T([4096, 2304], f16), [0], True), {})
+cnt: 1, ((T([4, 1024, 768], f16), [0], True), {})
+Operator: aten.sum.dim_IntList
+cnt: 1, ((T([4, 1024], b8), [-1]), {})
+Operator: aten.tanh.default
+cnt: 12, ((T([4, 1024, 3072], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 12, ((T([4, 1024, 3072], f16), T([4, 1024, 3072], f16)), {})
+Operator: aten.where.self
+cnt: 24, ((T([1, 1, 1024, 1024], b8), T([4, 12, 1024, 1024], f16), T([], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPTNeoForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPTNeoForCausalLM_training.txt
new file mode 100644
index 000000000000..013350f4bc8c
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPTNeoForCausalLM_training.txt
@@ -0,0 +1,96 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([127, 50257], f32), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([127, 50257], f32), T([127, 50257], f32), 1, f32), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([1, 16, 128, 128], f32), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([1, 16, 128, 128], f32), T([1, 16, 128, 128], f32), -1, f32), {})
+Operator: aten._to_copy.default
+cnt: 48, ((T([1, 16, 128, 128], f16, stride=(262144, 128, 2048, 1)),), {'dtype': f32})
+cnt: 24, ((T([1, 1, 128, 128], u8, stride=(4194304, 4194304, 2048, 1)),), {'dtype': torch.bool})
+cnt: 24, ((T([], f32),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 24, ((T([1, 16, 128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([1, 128, 50257], f16),), {'dtype': f32})
+cnt: 1, ((T([1, 128, 50257], f32),), {'dtype': f16})
+cnt: 1, ((T([], f32),), {'dtype': f16})
+cnt: 1, ((T([], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([1, 128, 50257], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 24, ((T([1, 16, 128, 128], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 24, ((T([1, 16, 128, 128], f32, stride=(262144, 16384, 1, 128)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 24, ((T([1, 16, 128, 128], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 72, ((T([128, 2048], f16), [1, 128, 2048]), {})
+cnt: 24, ((T([16, 128, 128], f32), [1, 16, 128, 128]), {})
+cnt: 24, ((T([16, 128, 128], f16), [1, 16, 128, 128]), {})
+cnt: 1, ((T([128, 50257], f16), [1, 128, 50257]), {})
+cnt: 48, ((T([1, 128, 16, 128], f16), [1, 128, 2048]), {})
+Operator: aten.add.Tensor
+cnt: 145, ((T([1, 128, 2048], f16), T([1, 128, 2048], f16)), {})
+cnt: 72, ((T([1, 128, 8192], f16), T([1, 128, 8192], f16)), {})
+cnt: 24, ((T([1, 128, 8192], f16), 1.0), {})
+cnt: 1, ((T([50257, 2048], f16), T([50257, 2048], f16)), {})
+Operator: aten.addmm.default
+cnt: 24, ((T([2048], f16), T([128, 2048], f16), T([2048, 2048], f16, stride=(1, 2048))), {})
+cnt: 24, ((T([8192], f16), T([128, 2048], f16), T([2048, 8192], f16, stride=(1, 2048))), {})
+cnt: 24, ((T([2048], f16), T([128, 8192], f16), T([8192, 2048], f16, stride=(1, 8192))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([16, 128, 128], f32, stride=(128, 2048, 1)), T([16, 128, 128], f32, stride=(128, 1, 2048))), {})
+cnt: 24, ((T([16, 128, 128], f16), T([16, 128, 128], f16, stride=(128, 2048, 1))), {})
+cnt: 24, ((T([16, 128, 128], f16, stride=(16384, 1, 128)), T([16, 128, 128], f16, stride=(128, 2048, 1))), {})
+cnt: 24, ((T([16, 128, 128], f16, stride=(128, 2048, 1)), T([16, 128, 128], f16, stride=(128, 1, 2048))), {})
+cnt: 24, ((T([16, 128, 128], f32, stride=(128, 1, 2048)), T([16, 128, 128], f32)), {})
+cnt: 24, ((T([16, 128, 128], f32), T([16, 128, 128], f32, stride=(128, 2048, 1))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 128], i64), T([1, 128], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50257, 2048], f16), T([1, 128], i64)), {})
+cnt: 1, ((T([2048, 2048], f16), T([1, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 128, 2048], f16), T([1, 128], i64), 2048, -1, False), {})
+cnt: 1, ((T([1, 128, 2048], f16), T([1, 128], i64), 50257, -1, False), {})
+Operator: aten.mm.default
+cnt: 72, ((T([128, 2048], f16), T([2048, 2048], f16, stride=(1, 2048))), {})
+cnt: 1, ((T([128, 2048], f16), T([2048, 50257], f16, stride=(1, 2048))), {})
+cnt: 1, ((T([50257, 128], f16, stride=(1, 50257)), T([128, 2048], f16)), {})
+cnt: 1, ((T([128, 50257], f16), T([50257, 2048], f16)), {})
+cnt: 24, ((T([128, 2048], f16), T([2048, 8192], f16)), {})
+cnt: 24, ((T([2048, 128], f16, stride=(1, 2048)), T([128, 8192], f16)), {})
+cnt: 24, ((T([128, 8192], f16), T([8192, 2048], f16)), {})
+cnt: 24, ((T([8192, 128], f16, stride=(1, 8192)), T([128, 2048], f16)), {})
+cnt: 72, ((T([128, 2048], f16), T([2048, 2048], f16)), {})
+cnt: 72, ((T([2048, 128], f16, stride=(1, 2048)), T([128, 2048], f16)), {})
+cnt: 24, ((T([2048, 128], f16), T([128, 2048], f16)), {})
+cnt: 24, ((T([128, 2048], f16, stride=(1, 128)), T([2048, 2048], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 24, ((T([1, 128, 8192], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 48, ((T([1, 128, 8192], f16), 0.5), {})
+cnt: 48, ((T([1, 128, 8192], f16), 0.044715), {})
+cnt: 48, ((T([1, 128, 8192], f16), 0.7978845608028654), {})
+cnt: 96, ((T([1, 128, 8192], f16), T([1, 128, 8192], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 49, ((T([1, 128, 2048], f16), [2048], T([2048], f16), T([2048], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 49, ((T([1, 128, 2048], f16), T([1, 128, 2048], f16), [2048], T([1, 128, 1], f32), T([1, 128, 1], f32), T([2048], f16), T([2048], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f32), T([127, 50257], f32), T([127], i64), None, 1, -100, T([], f32)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([127, 50257], f32), T([127], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 24, ((T([1, 128, 8192], f16), 3.0), {})
+cnt: 24, ((T([1, 128, 8192], f16), 2.0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([1, 127, 50257], f32), [1, 127, 50257], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([1, 127, 50257], f32), [1, 128, 50257], 1, 0, -1, 1), {})
+Operator: aten.sum.SymInt
+cnt: 48, ((T([128, 2048], f16), [0], True), {})
+cnt: 24, ((T([128, 8192], f16), [0], True), {})
+Operator: aten.tanh.default
+cnt: 24, ((T([1, 128, 8192], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 24, ((T([1, 128, 8192], f16), T([1, 128, 8192], f16)), {})
+Operator: aten.where.self
+cnt: 48, ((T([1, 1, 128, 128], b8), T([1, 16, 128, 128], f32), T([], f32)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPTNeoForSequenceClassification_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPTNeoForSequenceClassification_training.txt
new file mode 100644
index 000000000000..a537c2d6c04f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GPTNeoForSequenceClassification_training.txt
@@ -0,0 +1,101 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([1, 2], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([1, 2], f16), T([1, 2], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([1, 16, 128, 128], f32), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([1, 16, 128, 128], f32), T([1, 16, 128, 128], f32), -1, f32), {})
+Operator: aten._to_copy.default
+cnt: 48, ((T([1, 16, 128, 128], f16, stride=(262144, 128, 2048, 1)),), {'dtype': f32})
+cnt: 24, ((T([1, 1, 128, 128], u8, stride=(4194304, 4194304, 2048, 1)),), {'dtype': torch.bool})
+cnt: 24, ((T([], f32),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 24, ((T([1, 16, 128, 128], f32),), {'dtype': f16})
+cnt: 24, ((T([1, 16, 128, 128], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 24, ((T([1, 16, 128, 128], f32, stride=(262144, 16384, 1, 128)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 24, ((T([1, 16, 128, 128], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 72, ((T([128, 2048], f16), [1, 128, 2048]), {})
+cnt: 24, ((T([16, 128, 128], f32), [1, 16, 128, 128]), {})
+cnt: 24, ((T([16, 128, 128], f16), [1, 16, 128, 128]), {})
+cnt: 1, ((T([128, 2], f16), [1, 128, 2]), {})
+cnt: 48, ((T([1, 128, 16, 128], f16), [1, 128, 2048]), {})
+Operator: aten.add.Tensor
+cnt: 145, ((T([1, 128, 2048], f16), T([1, 128, 2048], f16)), {})
+cnt: 72, ((T([1, 128, 8192], f16), T([1, 128, 8192], f16)), {})
+cnt: 24, ((T([1, 128, 8192], f16), 1.0), {})
+Operator: aten.addmm.default
+cnt: 24, ((T([2048], f16), T([128, 2048], f16), T([2048, 2048], f16, stride=(1, 2048))), {})
+cnt: 24, ((T([8192], f16), T([128, 2048], f16), T([2048, 8192], f16, stride=(1, 2048))), {})
+cnt: 24, ((T([2048], f16), T([128, 8192], f16), T([8192, 2048], f16, stride=(1, 8192))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([16, 128, 128], f32, stride=(128, 2048, 1)), T([16, 128, 128], f32, stride=(128, 1, 2048))), {})
+cnt: 24, ((T([16, 128, 128], f16), T([16, 128, 128], f16, stride=(128, 2048, 1))), {})
+cnt: 24, ((T([16, 128, 128], f16, stride=(16384, 1, 128)), T([16, 128, 128], f16, stride=(128, 2048, 1))), {})
+cnt: 24, ((T([16, 128, 128], f16, stride=(128, 2048, 1)), T([16, 128, 128], f16, stride=(128, 1, 2048))), {})
+cnt: 24, ((T([16, 128, 128], f32, stride=(128, 1, 2048)), T([16, 128, 128], f32)), {})
+cnt: 24, ((T([16, 128, 128], f32), T([16, 128, 128], f32, stride=(128, 2048, 1))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([1, 128], i64),), {})
+cnt: 1, ((T([1], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([1, 128], i64), T([1, 128], i64)), {})
+cnt: 1, ((T([1], i64), T([1], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50257, 2048], f16), T([1, 128], i64)), {})
+cnt: 1, ((T([2048, 2048], f16), T([1, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 128, 2048], f16), T([1, 128], i64), 2048, -1, False), {})
+cnt: 1, ((T([1, 128, 2048], f16), T([1, 128], i64), 50257, -1, False), {})
+Operator: aten.index.Tensor
+cnt: 1, ((T([1, 128, 2], f16), [T([1], i64), T([1], i64)]), {})
+Operator: aten.index_put.default
+cnt: 1, ((T([1, 128, 2], f16), [T([1], i64), T([1], i64)], T([1, 2], f16), True), {})
+Operator: aten.mm.default
+cnt: 72, ((T([128, 2048], f16), T([2048, 2048], f16, stride=(1, 2048))), {})
+cnt: 1, ((T([128, 2048], f16), T([2048, 2], f16, stride=(1, 2048))), {})
+cnt: 1, ((T([2, 128], f16, stride=(1, 2)), T([128, 2048], f16)), {})
+cnt: 1, ((T([128, 2], f16), T([2, 2048], f16)), {})
+cnt: 24, ((T([128, 2048], f16), T([2048, 8192], f16)), {})
+cnt: 24, ((T([2048, 128], f16, stride=(1, 2048)), T([128, 8192], f16)), {})
+cnt: 24, ((T([128, 8192], f16), T([8192, 2048], f16)), {})
+cnt: 24, ((T([8192, 128], f16, stride=(1, 8192)), T([128, 2048], f16)), {})
+cnt: 72, ((T([128, 2048], f16), T([2048, 2048], f16)), {})
+cnt: 72, ((T([2048, 128], f16, stride=(1, 2048)), T([128, 2048], f16)), {})
+cnt: 24, ((T([2048, 128], f16), T([128, 2048], f16)), {})
+cnt: 24, ((T([128, 2048], f16, stride=(1, 128)), T([2048, 2048], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 24, ((T([1, 128, 8192], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 48, ((T([1, 128, 8192], f16), 0.5), {})
+cnt: 48, ((T([1, 128, 8192], f16), 0.044715), {})
+cnt: 48, ((T([1, 128, 8192], f16), 0.7978845608028654), {})
+cnt: 96, ((T([1, 128, 8192], f16), T([1, 128, 8192], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 49, ((T([1, 128, 2048], f16), [2048], T([2048], f16), T([2048], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 49, ((T([1, 128, 2048], f16), T([1, 128, 2048], f16), [2048], T([1, 128, 1], f32), T([1, 128, 1], f32), T([2048], f16), T([2048], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([1, 128], i64), 0), {})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([1, 2], f16), [1, 128, 2]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([1, 2], f16), T([1], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([1, 2], f16), T([1], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 24, ((T([1, 128, 8192], f16), 3.0), {})
+cnt: 24, ((T([1, 128, 8192], f16), 2.0), {})
+Operator: aten.sub.Tensor
+cnt: 1, ((T([1], i64), 1), {})
+Operator: aten.sum.SymInt
+cnt: 48, ((T([128, 2048], f16), [0], True), {})
+cnt: 24, ((T([128, 8192], f16), [0], True), {})
+Operator: aten.sum.dim_IntList
+cnt: 1, ((T([1, 128], b8), [-1]), {})
+Operator: aten.tanh.default
+cnt: 24, ((T([1, 128, 8192], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 24, ((T([1, 128, 8192], f16), T([1, 128, 8192], f16)), {})
+Operator: aten.where.self
+cnt: 48, ((T([1, 1, 128, 128], b8), T([1, 16, 128, 128], f32), T([], f32)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GoogleFnet_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GoogleFnet_training.txt
new file mode 100644
index 000000000000..c234ce838bf7
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/GoogleFnet_training.txt
@@ -0,0 +1,83 @@
+Operator: aten._fft_c2c.default
+cnt: 12, ((T([1, 512, 768], c32), [1, 2], 0, True), {})
+cnt: 12, ((T([1, 512, 768], c32), [1, 2], 0, False), {})
+Operator: aten._log_softmax.default
+cnt: 1, ((T([512, 32000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([512, 32000], f16), T([512, 32000], f16), 1, f16), {})
+Operator: aten._to_copy.default
+cnt: 12, ((T([1, 512, 768], f16),), {'dtype': c32})
+Operator: aten.add.Tensor
+cnt: 28, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 24, ((T([1, 512, 768], f16), T([1, 512, 768], f16, stride=(786432, 1536, 2))), {})
+cnt: 36, ((T([1, 512, 3072], f16), T([1, 512, 3072], f16)), {})
+cnt: 12, ((T([1, 512, 3072], f16), 1.0), {})
+cnt: 1, ((T([1, 512, 768], f16), 1.0), {})
+cnt: 1, ((T([32000, 768], f16), T([32000, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 2, ((T([768], f16), T([512, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([512, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([512, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([768], f16), T([1, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 1, ((T([32000], f16), T([512, 768], f16), T([768, 32000], f16, stride=(1, 768))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 512], i64), T([1, 512], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([32000, 768], f16), T([1, 512], i64), 3), {})
+cnt: 1, ((T([4, 768], f16), T([1, 512], i64)), {})
+cnt: 1, ((T([512, 768], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 4, -1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 32000, 3, False), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 32000], f16), T([32000, 768], f16)), {})
+cnt: 1, ((T([32000, 512], f16, stride=(1, 32000)), T([512, 768], f16)), {})
+cnt: 2, ((T([512, 768], f16), T([768, 768], f16)), {})
+cnt: 2, ((T([768, 512], f16, stride=(1, 768)), T([512, 768], f16)), {})
+cnt: 12, ((T([512, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 512], f16, stride=(1, 768)), T([512, 3072], f16)), {})
+cnt: 12, ((T([512, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 512], f16, stride=(1, 3072)), T([512, 768], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 1, ((T([1, 512, 768], f16), 3.0), {})
+cnt: 12, ((T([1, 512, 3072], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([1, 512, 3072], f16), 0.5), {})
+cnt: 24, ((T([1, 512, 3072], f16), 0.044715), {})
+cnt: 24, ((T([1, 512, 3072], f16), 0.7978845608028654), {})
+cnt: 48, ((T([1, 512, 3072], f16), T([1, 512, 3072], f16)), {})
+cnt: 2, ((T([1, 512, 768], f16), 0.5), {})
+cnt: 2, ((T([1, 512, 768], f16), 0.044715), {})
+cnt: 2, ((T([1, 512, 768], f16), 0.7978845608028654), {})
+cnt: 4, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([1, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([1, 512, 768], f16), T([1, 512, 768], f16), [768], T([1, 512, 1], f32), T([1, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([512, 32000], f16), T([512], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([512, 32000], f16), T([512], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([1, 512, 3072], f16), 3.0), {})
+cnt: 1, ((T([1, 512, 768], f16), 3.0), {})
+cnt: 1, ((T([1, 512, 768], f16), 2.0), {})
+cnt: 12, ((T([1, 512, 3072], f16), 2.0), {})
+Operator: aten.select_backward.default
+cnt: 12, ((T([1, 512, 768], f16), [1, 512, 768, 2], 3, 0), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([512, 32000], f16), [0], True), {})
+cnt: 14, ((T([512, 768], f16), [0], True), {})
+cnt: 12, ((T([512, 3072], f16), [0], True), {})
+Operator: aten.tanh.default
+cnt: 12, ((T([1, 512, 3072], f16),), {})
+cnt: 1, ((T([1, 768], f16),), {})
+cnt: 1, ((T([1, 512, 768], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 12, ((T([1, 512, 3072], f16), T([1, 512, 3072], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/LayoutLMForMaskedLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/LayoutLMForMaskedLM_training.txt
new file mode 100644
index 000000000000..e10fea3367ca
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/LayoutLMForMaskedLM_training.txt
@@ -0,0 +1,90 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([8192, 30522], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([8192, 30522], f16), T([8192, 30522], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([16, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([16, 12, 512, 512], f16), T([16, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([16, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([16, 12, 512, 64], f16), [192, 512, 64]), {})
+cnt: 12, ((T([16, 12, 64, 512], f16), [192, 64, 512]), {})
+cnt: 12, ((T([192, 512, 512], f16), [16, 12, 512, 512]), {})
+cnt: 12, ((T([192, 512, 64], f16), [16, 12, 512, 64]), {})
+cnt: 24, ((T([16, 512, 12, 64], f16), [16, 512, 768]), {})
+cnt: 12, ((T([16, 512, 768], f16), [8192, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([16, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 79, ((T([16, 512, 768], f16), T([16, 512, 768], f16)), {})
+cnt: 12, ((T([16, 12, 512, 512], f16), T([16, 1, 1, 512], f16)), {})
+cnt: 2, ((T([1024, 768], f16), T([1024, 768], f16)), {})
+cnt: 1, ((T([30522, 768], f16), T([30522, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 49, ((T([768], f16), T([8192, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([8192, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([8192, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([768], f16), T([16, 768], f16, stride=(393216, 1)), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 1, ((T([30522], f16), T([8192, 768], f16), T([768, 30522], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([192, 512, 64], f16), T([192, 64, 512], f16)), {})
+cnt: 12, ((T([192, 512, 512], f16), T([192, 512, 64], f16)), {})
+cnt: 12, ((T([192, 512, 512], f16, stride=(262144, 1, 512)), T([192, 512, 64], f16)), {})
+cnt: 12, ((T([192, 512, 64], f16), T([192, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([192, 64, 512], f16, stride=(32768, 1, 64)), T([192, 512, 512], f16)), {})
+cnt: 12, ((T([192, 512, 512], f16), T([192, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([16, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([16, 512], i64), T([16, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([16, 12, 512, 512], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([16, 512], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([1, 512], i64)), {})
+cnt: 4, ((T([1024, 768], f16), T([16, 512], i64, stride=(2048, 4))), {})
+cnt: 2, ((T([1024, 768], f16), T([16, 512], i64)), {})
+cnt: 1, ((T([2, 768], f16), T([16, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([16, 512, 768], f16), T([16, 512], i64), 2, -1, False), {})
+cnt: 2, ((T([16, 512, 768], f16), T([16, 512], i64), 1024, -1, False), {})
+cnt: 4, ((T([16, 512, 768], f16), T([16, 512], i64, stride=(2048, 4)), 1024, -1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([16, 512, 768], f16), T([16, 512], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([16, 512, 3072], f16),), {})
+cnt: 1, ((T([16, 512, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([16, 512, 768], f16), T([16, 512, 768], f16)), {})
+cnt: 12, ((T([16, 512, 3072], f16), T([16, 512, 3072], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8192, 30522], f16), T([30522, 768], f16)), {})
+cnt: 1, ((T([30522, 8192], f16, stride=(1, 30522)), T([8192, 768], f16)), {})
+cnt: 49, ((T([8192, 768], f16), T([768, 768], f16)), {})
+cnt: 49, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 768], f16)), {})
+cnt: 12, ((T([8192, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 3072], f16)), {})
+cnt: 12, ((T([8192, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 8192], f16, stride=(1, 3072)), T([8192, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([16, 1, 1, 512], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([16, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([16, 512, 768], f16), T([16, 512, 768], f16), [768], T([16, 512, 1], f32), T([16, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([8192, 30522], f16), T([8192], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([8192, 30522], f16), T([8192], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([16, 1, 1, 512], f16), 1.0), {})
+Operator: aten.sub.Tensor
+cnt: 2, ((T([16, 512], i64, stride=(2048, 4)), T([16, 512], i64, stride=(2048, 4))), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([8192, 30522], f16), [0], True), {})
+cnt: 61, ((T([8192, 768], f16), [0], True), {})
+cnt: 12, ((T([8192, 3072], f16), [0], True), {})
+cnt: 1, ((T([16, 512, 768], f16), [0], True), {})
+Operator: aten.tanh.default
+cnt: 1, ((T([16, 768], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/LayoutLMForSequenceClassification_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/LayoutLMForSequenceClassification_training.txt
new file mode 100644
index 000000000000..3d06f14961a0
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/LayoutLMForSequenceClassification_training.txt
@@ -0,0 +1,98 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([16, 2], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([16, 2], f16), T([16, 2], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([16, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([16, 12, 512, 512], f16), T([16, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([16, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([16, 12, 512, 64], f16), [192, 512, 64]), {})
+cnt: 12, ((T([16, 12, 64, 512], f16), [192, 64, 512]), {})
+cnt: 12, ((T([192, 512, 512], f16), [16, 12, 512, 512]), {})
+cnt: 12, ((T([192, 512, 64], f16), [16, 12, 512, 64]), {})
+cnt: 24, ((T([16, 512, 12, 64], f16), [16, 512, 768]), {})
+cnt: 12, ((T([16, 512, 768], f16), [8192, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([16, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 79, ((T([16, 512, 768], f16), T([16, 512, 768], f16)), {})
+cnt: 12, ((T([16, 12, 512, 512], f16), T([16, 1, 1, 512], f16)), {})
+cnt: 2, ((T([1024, 768], f16), T([1024, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([768], f16), T([8192, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([8192, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([8192, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([768], f16), T([16, 768], f16, stride=(393216, 1)), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 1, ((T([2], f16), T([16, 768], f16), T([768, 2], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([192, 512, 64], f16), T([192, 64, 512], f16)), {})
+cnt: 12, ((T([192, 512, 512], f16), T([192, 512, 64], f16)), {})
+cnt: 12, ((T([192, 512, 512], f16, stride=(262144, 1, 512)), T([192, 512, 64], f16)), {})
+cnt: 12, ((T([192, 512, 64], f16), T([192, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([192, 64, 512], f16, stride=(32768, 1, 64)), T([192, 512, 512], f16)), {})
+cnt: 12, ((T([192, 512, 512], f16), T([192, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([16, 512], i64),), {})
+cnt: 1, ((T([16], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([16, 512], i64), T([16, 512], i64)), {})
+cnt: 1, ((T([16], i64), T([16], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([16, 12, 512, 512], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([16, 512], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([1, 512], i64)), {})
+cnt: 4, ((T([1024, 768], f16), T([16, 512], i64, stride=(2048, 4))), {})
+cnt: 2, ((T([1024, 768], f16), T([16, 512], i64)), {})
+cnt: 1, ((T([2, 768], f16), T([16, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([16, 512, 768], f16), T([16, 512], i64), 2, -1, False), {})
+cnt: 2, ((T([16, 512, 768], f16), T([16, 512], i64), 1024, -1, False), {})
+cnt: 4, ((T([16, 512, 768], f16), T([16, 512], i64, stride=(2048, 4)), 1024, -1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([16, 512, 768], f16), T([16, 512], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([16, 512, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([16, 512, 3072], f16), T([16, 512, 3072], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([16, 2], f16), T([2, 768], f16)), {})
+cnt: 1, ((T([2, 16], f16, stride=(1, 2)), T([16, 768], f16)), {})
+cnt: 1, ((T([16, 768], f16), T([768, 768], f16)), {})
+cnt: 1, ((T([768, 16], f16, stride=(1, 768)), T([16, 768], f16, stride=(393216, 1))), {})
+cnt: 12, ((T([8192, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 3072], f16)), {})
+cnt: 12, ((T([8192, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 8192], f16, stride=(1, 3072)), T([8192, 768], f16)), {})
+cnt: 48, ((T([8192, 768], f16), T([768, 768], f16)), {})
+cnt: 48, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([16, 1, 1, 512], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([16, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([16, 512, 768], f16), T([16, 512, 768], f16), [768], T([16, 512, 1], f32), T([16, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([16, 2], f16), T([16], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([16, 2], f16), T([16], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([16, 1, 1, 512], f16), 1.0), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([16, 768], f16), [16, 512, 768], 1, 0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([16, 512, 768], f16), [16, 512, 768], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sub.Tensor
+cnt: 2, ((T([16, 512], i64, stride=(2048, 4)), T([16, 512], i64, stride=(2048, 4))), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([16, 2], f16), [0], True), {})
+cnt: 1, ((T([16, 768], f16), [0], True), {})
+cnt: 60, ((T([8192, 768], f16), [0], True), {})
+cnt: 12, ((T([8192, 3072], f16), [0], True), {})
+cnt: 1, ((T([16, 512, 768], f16), [0], True), {})
+Operator: aten.tanh.default
+cnt: 1, ((T([16, 768], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([16, 768], f16), T([16, 768], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/M2M100ForConditionalGeneration_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/M2M100ForConditionalGeneration_training.txt
new file mode 100644
index 000000000000..bafa9de2de0a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/M2M100ForConditionalGeneration_training.txt
@@ -0,0 +1,88 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([256, 128112], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([256, 128112], f16), T([256, 128112], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 36, ((T([32, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 36, ((T([32, 128, 128], f16), T([32, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 2, ((T([2, 128], b8),), {'dtype': i32})
+cnt: 2, ((T([2, 128], i64),), {'dtype': i32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 2, ((T([2, 128], i32),), {'dtype': i64})
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([2, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 108, ((T([2, 128, 16, 64], f16), [2, 128, 1024]), {})
+cnt: 1, ((T([256, 128112], f16), [2, 128, 128112]), {})
+cnt: 36, ((T([2, 16, 128, 64], f16), [32, 128, 64]), {})
+cnt: 36, ((T([2, 128, 1024], f16), [256, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([2, 128], i32), 0), {})
+cnt: 2, ((T([2, 128], i64), 1), {})
+cnt: 193, ((T([2, 128, 1024], f16), T([2, 128, 1024], f16)), {})
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 12, ((T([2, 16, 128, 128], f16), T([2, 1, 128, 128], f16)), {})
+cnt: 2, ((T([128112, 1024], f16), T([128112, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 144, ((T([1024], f16), T([256, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([4096], f16), T([256, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([1024], f16), T([256, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.any.default
+cnt: 24, ((T([2, 128, 1024], b8),), {})
+Operator: aten.bmm.default
+cnt: 72, ((T([32, 128, 64], f16), T([32, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 72, ((T([32, 128, 128], f16), T([32, 128, 64], f16)), {})
+cnt: 36, ((T([32, 128, 128], f16, stride=(16384, 1, 128)), T([32, 128, 64], f16)), {})
+cnt: 36, ((T([32, 64, 128], f16, stride=(8192, 1, 64)), T([32, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 3, ((T([2, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 3, ((T([2, 128], i64), T([2, 128], i64)), {})
+Operator: aten.cumsum.default
+cnt: 2, ((T([2, 128], i32), 1), {})
+Operator: aten.embedding.default
+cnt: 2, ((T([128112, 1024], f16), T([2, 128], i64), 1), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 2, ((T([2, 128, 1024], f16), T([2, 128], i64), 128112, 1, False), {})
+Operator: aten.index_select.default
+cnt: 2, ((T([1026, 1024], f16), 0, T([256], i64)), {})
+Operator: aten.isinf.default
+cnt: 12, ((T([2, 128, 1024], f16),), {})
+Operator: aten.isnan.default
+cnt: 12, ((T([2, 128, 1024], f16),), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([256, 1024], f16), T([1024, 128112], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([128112, 256], f16, stride=(1, 128112)), T([256, 1024], f16)), {})
+cnt: 1, ((T([256, 128112], f16), T([128112, 1024], f16)), {})
+cnt: 24, ((T([256, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 256], f16, stride=(1, 1024)), T([256, 4096], f16)), {})
+cnt: 24, ((T([256, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 24, ((T([4096, 256], f16, stride=(1, 4096)), T([256, 1024], f16)), {})
+cnt: 144, ((T([256, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 144, ((T([1024, 256], f16, stride=(1, 1024)), T([256, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([2, 128, 1024], f16), 32.0), {})
+cnt: 2, ((T([2, 128], i32), T([2, 128], i32)), {})
+cnt: 72, ((T([2, 128, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 62, ((T([2, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 62, ((T([2, 128, 1024], f16), T([2, 128, 1024], f16), [1024], T([2, 128, 1], f32), T([2, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 2, ((T([2, 128], i64), 1), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([256, 128112], f16), T([256], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([256, 128112], f16), T([256], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 24, ((T([2, 128, 4096], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 168, ((T([256, 1024], f16), [0], True), {})
+cnt: 24, ((T([256, 4096], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 24, ((T([2, 128, 4096], f16), T([2, 128, 4096], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MBartForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MBartForCausalLM_training.txt
new file mode 100644
index 000000000000..288b2cd2cbb2
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MBartForCausalLM_training.txt
@@ -0,0 +1,73 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([2048, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([2048, 50265], f16), T([2048, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([256, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([256, 128, 128], f16), T([256, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([16, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([16, 128, 16, 64], f16), [16, 128, 1024]), {})
+cnt: 1, ((T([2048, 50265], f16), [16, 128, 50265]), {})
+cnt: 12, ((T([16, 16, 128, 64], f16), [256, 128, 64]), {})
+cnt: 12, ((T([16, 128, 1024], f16), [2048, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 1, ((T([16, 128], i64, stride=(0, 1)), 2), {})
+cnt: 73, ((T([16, 128, 1024], f16), T([16, 128, 1024], f16)), {})
+cnt: 12, ((T([16, 16, 128, 128], f16), T([16, 1, 128, 128], f16)), {})
+cnt: 1, ((T([50265, 1024], f16), T([50265, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([1024], f16), T([2048, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 12, ((T([4096], f16), T([2048, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 12, ((T([1024], f16), T([2048, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([256, 128, 64], f16), T([256, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 24, ((T([256, 128, 128], f16), T([256, 128, 64], f16)), {})
+cnt: 12, ((T([256, 128, 128], f16, stride=(16384, 1, 128)), T([256, 128, 64], f16)), {})
+cnt: 12, ((T([256, 64, 128], f16, stride=(8192, 1, 64)), T([256, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([16, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([16, 128], i64), T([16, 128], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 1024], f16), T([16, 128], i64), 1), {})
+cnt: 1, ((T([1026, 1024], f16), T([16, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([16, 128, 1024], f16), T([16, 128], i64), 1026, -1, False), {})
+cnt: 1, ((T([16, 128, 1024], f16), T([16, 128], i64), 50265, 1, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([16, 128, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([16, 128, 4096], f16), T([16, 128, 4096], f16)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 1024], f16), T([1024, 50265], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([50265, 2048], f16, stride=(1, 50265)), T([2048, 1024], f16)), {})
+cnt: 1, ((T([2048, 50265], f16), T([50265, 1024], f16)), {})
+cnt: 12, ((T([2048, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 12, ((T([1024, 2048], f16, stride=(1, 1024)), T([2048, 4096], f16)), {})
+cnt: 12, ((T([2048, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 12, ((T([4096, 2048], f16, stride=(1, 4096)), T([2048, 1024], f16)), {})
+cnt: 48, ((T([2048, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 48, ((T([1024, 2048], f16, stride=(1, 1024)), T([2048, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([16, 128, 1024], f16), 1.0), {})
+cnt: 24, ((T([16, 128, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([16, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([16, 128, 1024], f16), T([16, 128, 1024], f16), [1024], T([16, 128, 1], f32), T([16, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([2048, 50265], f16), T([2048], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([2048, 50265], f16), T([2048], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 60, ((T([2048, 1024], f16), [0], True), {})
+cnt: 12, ((T([2048, 4096], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MBartForConditionalGeneration_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MBartForConditionalGeneration_training.txt
new file mode 100644
index 000000000000..2ca11dd08184
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MBartForConditionalGeneration_training.txt
@@ -0,0 +1,94 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([1024, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([1024, 50265], f16), T([1024, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 36, ((T([128, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 36, ((T([128, 128, 128], f16), T([128, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([8, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 108, ((T([8, 128, 16, 64], f16), [8, 128, 1024]), {})
+cnt: 1, ((T([1024, 50265], f16), [8, 128, 50265]), {})
+cnt: 36, ((T([8, 16, 128, 64], f16), [128, 128, 64]), {})
+cnt: 36, ((T([8, 128, 1024], f16), [1024, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([8, 128], i64, stride=(0, 1)), 2), {})
+cnt: 193, ((T([8, 128, 1024], f16), T([8, 128, 1024], f16)), {})
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 12, ((T([8, 16, 128, 128], f16), T([8, 1, 128, 128], f16)), {})
+cnt: 1, ((T([8, 128, 50265], f16), T([1, 50265], f16)), {})
+cnt: 2, ((T([50265, 1024], f16), T([50265, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 144, ((T([1024], f16), T([1024, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([4096], f16), T([1024, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([1024], f16), T([1024, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.any.default
+cnt: 24, ((T([8, 128, 1024], b8),), {})
+Operator: aten.bmm.default
+cnt: 72, ((T([128, 128, 64], f16), T([128, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 72, ((T([128, 128, 128], f16), T([128, 128, 64], f16)), {})
+cnt: 36, ((T([128, 128, 128], f16, stride=(16384, 1, 128)), T([128, 128, 64], f16)), {})
+cnt: 36, ((T([128, 64, 128], f16, stride=(8192, 1, 64)), T([128, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 3, ((T([8, 128], i64),), {})
+cnt: 1, ((T([8, 127], i64, stride=(128, 1)),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([8, 128], i64), T([8, 128], i64)), {})
+cnt: 1, ((T([8, 127], i64, stride=(128, 1)), T([8, 127], i64)), {})
+cnt: 1, ((T([8], i64, stride=(128,)), T([8], i64)), {})
+Operator: aten.embedding.default
+cnt: 2, ((T([50265, 1024], f16), T([8, 128], i64), 1), {})
+cnt: 2, ((T([1026, 1024], f16), T([8, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 2, ((T([8, 128, 1024], f16), T([8, 128], i64), 1026, -1, False), {})
+cnt: 2, ((T([8, 128, 1024], f16), T([8, 128], i64), 50265, 1, False), {})
+Operator: aten.eq.Scalar
+cnt: 1, ((T([8, 128], i64), -100), {})
+Operator: aten.gather.default
+cnt: 1, ((T([8, 128], i64), 1, T([8, 1], i64)), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([8, 128, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 24, ((T([8, 128, 4096], f16), T([8, 128, 4096], f16)), {})
+Operator: aten.isinf.default
+cnt: 12, ((T([8, 128, 1024], f16),), {})
+Operator: aten.isnan.default
+cnt: 12, ((T([8, 128, 1024], f16),), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([8, 128], i64), T([8, 128], b8), 1), {})
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 1024], f16), T([1024, 50265], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([50265, 1024], f16, stride=(1, 50265)), T([1024, 1024], f16)), {})
+cnt: 1, ((T([1024, 50265], f16), T([50265, 1024], f16)), {})
+cnt: 24, ((T([1024, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 1024], f16, stride=(1, 1024)), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 24, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 1024], f16)), {})
+cnt: 144, ((T([1024, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 144, ((T([1024, 1024], f16, stride=(1, 1024)), T([1024, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([8, 128, 1024], f16), 1.0), {})
+cnt: 72, ((T([8, 128, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 64, ((T([8, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 64, ((T([8, 128, 1024], f16), T([8, 128, 1024], f16), [1024], T([8, 128, 1], f32), T([8, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([8, 128], i64), 1), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([1024, 50265], f16), T([1024], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([1024, 50265], f16), T([1024], i64), None, 1, -100), {})
+Operator: aten.sub.Tensor
+cnt: 1, ((T([8], i64), 1), {})
+Operator: aten.sum.SymInt
+cnt: 168, ((T([1024, 1024], f16), [0], True), {})
+cnt: 24, ((T([1024, 4096], f16), [0], True), {})
+Operator: aten.sum.dim_IntList
+cnt: 1, ((T([8, 128], b8), [1]), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MegatronBertForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MegatronBertForCausalLM_training.txt
new file mode 100644
index 000000000000..efe2661fcc67
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MegatronBertForCausalLM_training.txt
@@ -0,0 +1,85 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([254, 29056], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([254, 29056], f16), T([254, 29056], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([2, 16, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([2, 16, 128, 128], f16), T([2, 16, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([2, 1, 1, 128], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 72, ((T([2, 16, 128, 64], f16), [32, 128, 64]), {})
+cnt: 24, ((T([2, 16, 64, 128], f16), [32, 64, 128]), {})
+cnt: 24, ((T([32, 128, 128], f16), [2, 16, 128, 128]), {})
+cnt: 24, ((T([32, 128, 64], f16), [2, 16, 128, 64]), {})
+cnt: 48, ((T([2, 128, 16, 64], f16), [2, 128, 1024]), {})
+cnt: 24, ((T([2, 128, 1024], f16), [256, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 145, ((T([2, 128, 1024], f16), T([2, 128, 1024], f16)), {})
+cnt: 24, ((T([2, 16, 128, 128], f16), T([2, 1, 1, 128], f16)), {})
+cnt: 1, ((T([29056, 1024], f16), T([29056, 1024], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([2, 128, 1024], f16), T([1, 128, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 97, ((T([1024], f16), T([256, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([4096], f16), T([256, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([1024], f16), T([256, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([29056], f16), T([256, 1024], f16), T([1024, 29056], f16, stride=(1, 1024))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([32, 128, 64], f16), T([32, 64, 128], f16)), {})
+cnt: 24, ((T([32, 128, 128], f16), T([32, 128, 64], f16)), {})
+cnt: 24, ((T([32, 128, 128], f16, stride=(16384, 1, 128)), T([32, 128, 64], f16)), {})
+cnt: 24, ((T([32, 128, 64], f16), T([32, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 24, ((T([32, 64, 128], f16, stride=(8192, 1, 64)), T([32, 128, 128], f16)), {})
+cnt: 24, ((T([32, 128, 128], f16), T([32, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([2, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([2, 128], i64), T([2, 128], i64)), {})
+Operator: aten.div.Tensor
+cnt: 48, ((T([2, 16, 128, 128], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([29056, 1024], f16), T([2, 128], i64), 0), {})
+cnt: 1, ((T([2, 1024], f16), T([2, 128], i64)), {})
+cnt: 1, ((T([512, 1024], f16), T([1, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 128, 1024], f16), T([1, 128], i64), 512, -1, False), {})
+cnt: 1, ((T([2, 128, 1024], f16), T([2, 128], i64), 2, -1, False), {})
+cnt: 1, ((T([2, 128, 1024], f16), T([2, 128], i64), 29056, 0, False), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([2, 128, 4096], f16),), {})
+cnt: 1, ((T([2, 128, 1024], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([2, 128, 1024], f16), T([2, 128, 1024], f16)), {})
+cnt: 24, ((T([2, 128, 4096], f16), T([2, 128, 4096], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([256, 29056], f16), T([29056, 1024], f16)), {})
+cnt: 1, ((T([29056, 256], f16, stride=(1, 29056)), T([256, 1024], f16)), {})
+cnt: 97, ((T([256, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 97, ((T([1024, 256], f16, stride=(1, 1024)), T([256, 1024], f16)), {})
+cnt: 24, ((T([256, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 256], f16, stride=(1, 1024)), T([256, 4096], f16)), {})
+cnt: 24, ((T([256, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 24, ((T([4096, 256], f16, stride=(1, 4096)), T([256, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([2, 1, 1, 128], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 50, ((T([2, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 50, ((T([2, 128, 1024], f16), T([2, 128, 1024], f16), [1024], T([2, 128, 1], f32), T([2, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([254, 29056], f16), T([254], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([254, 29056], f16), T([254], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([2, 1, 1, 128], f16), 1.0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([2, 127, 29056], f16), [2, 127, 29056], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([2, 127, 29056], f16), [2, 128, 29056], 1, 0, -1, 1), {})
+cnt: 1, ((T([2, 128, 29056], f16), [2, 128, 29056], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([256, 29056], f16), [0], True), {})
+cnt: 121, ((T([256, 1024], f16), [0], True), {})
+cnt: 24, ((T([256, 4096], f16), [0], True), {})
+cnt: 1, ((T([2, 128, 1024], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MegatronBertForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MegatronBertForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..5c1861e54231
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MegatronBertForQuestionAnswering_training.txt
@@ -0,0 +1,88 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([8, 128], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([8, 128], f16), T([8, 128], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([8, 16, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([8, 16, 128, 128], f16), T([8, 16, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([8, 1, 1, 128], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 72, ((T([8, 16, 128, 64], f16), [128, 128, 64]), {})
+cnt: 24, ((T([8, 16, 64, 128], f16), [128, 64, 128]), {})
+cnt: 24, ((T([128, 128, 128], f16), [8, 16, 128, 128]), {})
+cnt: 24, ((T([128, 128, 64], f16), [8, 16, 128, 64]), {})
+cnt: 48, ((T([8, 128, 16, 64], f16), [8, 128, 1024]), {})
+cnt: 24, ((T([8, 128, 1024], f16), [1024, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 145, ((T([8, 128, 1024], f16), T([8, 128, 1024], f16)), {})
+cnt: 24, ((T([8, 16, 128, 128], f16), T([8, 1, 1, 128], f16)), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([8, 128, 1024], f16), T([1, 128, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 96, ((T([1024], f16), T([1024, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([4096], f16), T([1024, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([1024], f16), T([1024, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([2], f16), T([1024, 1024], f16), T([1024, 2], f16, stride=(1, 1024))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([128, 128, 64], f16), T([128, 64, 128], f16)), {})
+cnt: 24, ((T([128, 128, 128], f16), T([128, 128, 64], f16)), {})
+cnt: 24, ((T([128, 128, 128], f16, stride=(16384, 1, 128)), T([128, 128, 64], f16)), {})
+cnt: 24, ((T([128, 128, 64], f16), T([128, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 24, ((T([128, 64, 128], f16, stride=(8192, 1, 64)), T([128, 128, 128], f16)), {})
+cnt: 24, ((T([128, 128, 128], f16), T([128, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([8, 128, 1], f16), T([8, 128, 1], f16)], 2), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([8], i64), 0, 128), {})
+Operator: aten.clone.default
+cnt: 1, ((T([8, 128], i64),), {})
+cnt: 2, ((T([8], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([8, 128], i64), T([8, 128], i64)), {})
+cnt: 2, ((T([8], i64), T([8], i64)), {})
+Operator: aten.div.Tensor
+cnt: 48, ((T([8, 16, 128, 128], f16), 8.0), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([29056, 1024], f16), T([8, 128], i64), 0), {})
+cnt: 1, ((T([2, 1024], f16), T([8, 128], i64)), {})
+cnt: 1, ((T([512, 1024], f16), T([1, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 128, 1024], f16), T([1, 128], i64), 512, -1, False), {})
+cnt: 1, ((T([8, 128, 1024], f16), T([8, 128], i64), 2, -1, False), {})
+cnt: 1, ((T([8, 128, 1024], f16), T([8, 128], i64), 29056, 0, False), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([8, 128, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 24, ((T([8, 128, 4096], f16), T([8, 128, 4096], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 2], f16), T([2, 1024], f16)), {})
+cnt: 1, ((T([2, 1024], f16, stride=(1, 2)), T([1024, 1024], f16)), {})
+cnt: 24, ((T([1024, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 1024], f16, stride=(1, 1024)), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 24, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 1024], f16)), {})
+cnt: 96, ((T([1024, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 96, ((T([1024, 1024], f16, stride=(1, 1024)), T([1024, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([8, 1, 1, 128], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 49, ((T([8, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 49, ((T([8, 128, 1024], f16), T([8, 128, 1024], f16), [1024], T([8, 128, 1], f32), T([8, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([8, 128], f16), T([8], i64), None, 1, 128, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([8, 128], f16), T([8], i64), None, 1, 128), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([8, 1, 1, 128], f16), 1.0), {})
+Operator: aten.split.Tensor
+cnt: 1, ((T([8, 128, 2], f16), 1, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([1024, 2], f16), [0], True), {})
+cnt: 120, ((T([1024, 1024], f16), [0], True), {})
+cnt: 24, ((T([1024, 4096], f16), [0], True), {})
+cnt: 1, ((T([8, 128, 1024], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MobileBertForMaskedLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MobileBertForMaskedLM_training.txt
new file mode 100644
index 000000000000..e6b91aa0181e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MobileBertForMaskedLM_training.txt
@@ -0,0 +1,112 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([2048, 30522], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([2048, 30522], f16), T([2048, 30522], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([16, 4, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([16, 4, 128, 128], f16), T([16, 4, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([16, 1, 1, 128], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 72, ((T([16, 4, 128, 32], f16), [64, 128, 32]), {})
+cnt: 24, ((T([16, 4, 32, 128], f16), [64, 32, 128]), {})
+cnt: 24, ((T([64, 128, 128], f16), [16, 4, 128, 128]), {})
+cnt: 24, ((T([64, 128, 32], f16), [16, 4, 128, 32]), {})
+cnt: 1, ((T([2048, 30522], f16), [16, 128, 30522]), {})
+cnt: 48, ((T([16, 128, 4, 32], f16), [16, 128, 128]), {})
+cnt: 24, ((T([16, 128, 128], f16), [2048, 128]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([16, 128, 512], f16), T([1, 128, 512], f16)), {})
+cnt: 97, ((T([16, 128, 512], f16), T([16, 128, 512], f16)), {})
+cnt: 25, ((T([16, 128, 512], f16), T([512], f16)), {})
+cnt: 168, ((T([16, 128, 128], f16), T([128], f16)), {})
+cnt: 24, ((T([16, 4, 128, 128], f16), T([16, 1, 1, 128], f16)), {})
+cnt: 241, ((T([16, 128, 128], f16), T([16, 128, 128], f16)), {})
+cnt: 1, ((T([16, 128, 128], f16, stride=(49152, 384, 1)), T([16, 128, 128], f16)), {})
+cnt: 1, ((T([30522, 128], f16, stride=(1, 30522)), T([30522, 128], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([16, 128, 30522], f16), T([30522], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([512], f16), T([2048, 384], f16), T([384, 512], f16, stride=(1, 384))), {})
+cnt: 168, ((T([128], f16), T([2048, 512], f16), T([512, 128], f16, stride=(1, 512))), {})
+cnt: 72, ((T([128], f16), T([2048, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 120, ((T([512], f16), T([2048, 128], f16), T([128, 512], f16, stride=(1, 128))), {})
+cnt: 1, ((T([512], f16), T([2048, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([64, 128, 32], f16), T([64, 32, 128], f16)), {})
+cnt: 24, ((T([64, 128, 128], f16), T([64, 128, 32], f16)), {})
+cnt: 24, ((T([64, 128, 128], f16, stride=(16384, 1, 128)), T([64, 128, 32], f16)), {})
+cnt: 24, ((T([64, 128, 32], f16), T([64, 32, 128], f16, stride=(4096, 1, 32))), {})
+cnt: 24, ((T([64, 32, 128], f16, stride=(4096, 1, 32)), T([64, 128, 128], f16)), {})
+cnt: 24, ((T([64, 128, 128], f16), T([64, 128, 32], f16, stride=(4096, 1, 128))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([16, 128, 128], f16), T([16, 128, 128], f16), T([16, 128, 128], f16)], 2), {})
+cnt: 1, (([T([128, 30522], f16, stride=(1, 128)), T([384, 30522], f16)],), {})
+Operator: aten.clone.default
+cnt: 2, ((T([16, 128], i64),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([16, 127, 128], f16, stride=(16384, 128, 1)), [0, 0, 0, 1, 0, 0], 0.0), {})
+cnt: 1, ((T([16, 127, 128], f16, stride=(16384, 128, 1)), [0, 0, 1, 0, 0, 0], 0.0), {})
+cnt: 1, ((T([16, 128, 128], f16, stride=(49152, 384, 1)), [0, 0, -1, 0, 0, 0]), {})
+cnt: 1, ((T([16, 128, 128], f16, stride=(49152, 384, 1)), [0, 0, 0, -1, 0, 0]), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([16, 128], i64), T([16, 128], i64)), {})
+cnt: 1, ((T([30522, 128], f16), T([30522, 128], f16, stride=(1, 30522))), {})
+Operator: aten.div.Tensor
+cnt: 48, ((T([16, 4, 128, 128], f16), 5.656854249492381), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 128], f16), T([16, 128], i64), 0), {})
+cnt: 1, ((T([512, 512], f16), T([1, 128], i64)), {})
+cnt: 1, ((T([2, 512], f16), T([16, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([16, 128, 512], f16), T([16, 128], i64), 2, -1, False), {})
+cnt: 1, ((T([1, 128, 512], f16), T([1, 128], i64), 512, -1, False), {})
+cnt: 1, ((T([16, 128, 128], f16), T([16, 128], i64), 30522, 0, False), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 512], f16), T([512, 30522], f16)), {})
+cnt: 1, ((T([512, 2048], f16, stride=(1, 512)), T([2048, 30522], f16)), {})
+cnt: 1, ((T([2048, 30522], f16), T([30522, 512], f16, stride=(1, 30522))), {})
+cnt: 1, ((T([2048, 512], f16), T([512, 512], f16)), {})
+cnt: 1, ((T([512, 2048], f16, stride=(1, 512)), T([2048, 512], f16)), {})
+cnt: 120, ((T([2048, 512], f16), T([512, 128], f16)), {})
+cnt: 120, ((T([512, 2048], f16, stride=(1, 512)), T([2048, 128], f16)), {})
+cnt: 168, ((T([2048, 128], f16), T([128, 512], f16)), {})
+cnt: 168, ((T([128, 2048], f16, stride=(1, 128)), T([2048, 512], f16)), {})
+cnt: 72, ((T([2048, 128], f16), T([128, 128], f16)), {})
+cnt: 72, ((T([128, 2048], f16, stride=(1, 128)), T([2048, 128], f16)), {})
+cnt: 1, ((T([2048, 512], f16), T([512, 384], f16)), {})
+cnt: 1, ((T([512, 2048], f16, stride=(1, 512)), T([2048, 384], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([16, 1, 1, 128], f16), -65504.0), {})
+cnt: 50, ((T([16, 128, 512], f16), T([512], f16)), {})
+cnt: 336, ((T([16, 128, 128], f16), T([128], f16)), {})
+cnt: 25, ((T([16, 128, 512], f16), T([16, 128, 512], f16)), {})
+cnt: 168, ((T([16, 128, 128], f16), T([16, 128, 128], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 1, ((T([16, 128, 512], f16), [512], T([512], f16), T([512], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 1, ((T([16, 128, 512], f16), T([16, 128, 512], f16), [512], T([16, 128, 1], f32), T([16, 128, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 1, ((T([30522, 128], f16, stride=(1, 30522)), [30522, 128], [128, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([2048, 30522], f16), T([2048], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([2048, 30522], f16), T([2048], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 97, ((T([16, 128, 512], f16),), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([16, 1, 1, 128], f16), 1.0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([16, 127, 128], f16), [16, 128, 128], 1, 0, -1, 1), {})
+cnt: 2, ((T([16, 128, 128], f16), [16, 128, 128], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([16, 127, 128], f16), [16, 128, 128], 1, 1, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([16, 128, 30522], f16), [0, 1], True), {})
+cnt: 122, ((T([2048, 512], f16), [0], True), {})
+cnt: 50, ((T([16, 128, 512], f16), [0, 1], True), {})
+cnt: 336, ((T([16, 128, 128], f16), [0, 1], True), {})
+cnt: 240, ((T([2048, 128], f16), [0], True), {})
+cnt: 1, ((T([16, 128, 512], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 97, ((T([16, 128, 512], f16), T([16, 128, 512], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MobileBertForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MobileBertForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..c5e7b0f51c67
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/MobileBertForQuestionAnswering_training.txt
@@ -0,0 +1,106 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([32, 128], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([32, 128], f16), T([32, 128], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([32, 4, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([32, 4, 128, 128], f16), T([32, 4, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([32, 1, 1, 128], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 72, ((T([32, 4, 128, 32], f16), [128, 128, 32]), {})
+cnt: 24, ((T([32, 4, 32, 128], f16), [128, 32, 128]), {})
+cnt: 24, ((T([128, 128, 128], f16), [32, 4, 128, 128]), {})
+cnt: 24, ((T([128, 128, 32], f16), [32, 4, 128, 32]), {})
+cnt: 48, ((T([32, 128, 4, 32], f16), [32, 128, 128]), {})
+cnt: 24, ((T([32, 128, 128], f16), [4096, 128]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([32, 128, 512], f16), T([1, 128, 512], f16)), {})
+cnt: 97, ((T([32, 128, 512], f16), T([32, 128, 512], f16)), {})
+cnt: 25, ((T([32, 128, 512], f16), T([512], f16)), {})
+cnt: 168, ((T([32, 128, 128], f16), T([128], f16)), {})
+cnt: 24, ((T([32, 4, 128, 128], f16), T([32, 1, 1, 128], f16)), {})
+cnt: 241, ((T([32, 128, 128], f16), T([32, 128, 128], f16)), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+cnt: 1, ((T([32, 128, 128], f16, stride=(49152, 384, 1)), T([32, 128, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([512], f16), T([4096, 384], f16), T([384, 512], f16, stride=(1, 384))), {})
+cnt: 168, ((T([128], f16), T([4096, 512], f16), T([512, 128], f16, stride=(1, 512))), {})
+cnt: 72, ((T([128], f16), T([4096, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 120, ((T([512], f16), T([4096, 128], f16), T([128, 512], f16, stride=(1, 128))), {})
+cnt: 1, ((T([2], f16), T([4096, 512], f16), T([512, 2], f16, stride=(1, 512))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([128, 128, 32], f16), T([128, 32, 128], f16)), {})
+cnt: 24, ((T([128, 128, 128], f16), T([128, 128, 32], f16)), {})
+cnt: 24, ((T([128, 128, 128], f16, stride=(16384, 1, 128)), T([128, 128, 32], f16)), {})
+cnt: 24, ((T([128, 128, 32], f16), T([128, 32, 128], f16, stride=(4096, 1, 32))), {})
+cnt: 24, ((T([128, 32, 128], f16, stride=(4096, 1, 32)), T([128, 128, 128], f16)), {})
+cnt: 24, ((T([128, 128, 128], f16), T([128, 128, 32], f16, stride=(4096, 1, 128))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([32, 128, 128], f16), T([32, 128, 128], f16), T([32, 128, 128], f16)], 2), {})
+cnt: 1, (([T([32, 128, 1], f16), T([32, 128, 1], f16)], 2), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([32], i64), 0, 128), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 128], i64),), {})
+cnt: 2, ((T([32], i64),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([32, 127, 128], f16, stride=(16384, 128, 1)), [0, 0, 0, 1, 0, 0], 0.0), {})
+cnt: 1, ((T([32, 127, 128], f16, stride=(16384, 128, 1)), [0, 0, 1, 0, 0, 0], 0.0), {})
+cnt: 1, ((T([32, 128, 128], f16, stride=(49152, 384, 1)), [0, 0, -1, 0, 0, 0]), {})
+cnt: 1, ((T([32, 128, 128], f16, stride=(49152, 384, 1)), [0, 0, 0, -1, 0, 0]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 128], i64), T([32, 128], i64)), {})
+cnt: 2, ((T([32], i64), T([32], i64)), {})
+Operator: aten.div.Tensor
+cnt: 48, ((T([32, 4, 128, 128], f16), 5.656854249492381), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 128], f16), T([32, 128], i64), 0), {})
+cnt: 1, ((T([512, 512], f16), T([1, 128], i64)), {})
+cnt: 1, ((T([2, 512], f16), T([32, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([32, 128, 512], f16), T([32, 128], i64), 2, -1, False), {})
+cnt: 1, ((T([1, 128, 512], f16), T([1, 128], i64), 512, -1, False), {})
+cnt: 1, ((T([32, 128, 128], f16), T([32, 128], i64), 30522, 0, False), {})
+Operator: aten.mm.default
+cnt: 1, ((T([4096, 2], f16), T([2, 512], f16)), {})
+cnt: 1, ((T([2, 4096], f16, stride=(1, 2)), T([4096, 512], f16)), {})
+cnt: 120, ((T([4096, 512], f16), T([512, 128], f16)), {})
+cnt: 120, ((T([512, 4096], f16, stride=(1, 512)), T([4096, 128], f16)), {})
+cnt: 168, ((T([4096, 128], f16), T([128, 512], f16)), {})
+cnt: 168, ((T([128, 4096], f16, stride=(1, 128)), T([4096, 512], f16)), {})
+cnt: 72, ((T([4096, 128], f16), T([128, 128], f16)), {})
+cnt: 72, ((T([128, 4096], f16, stride=(1, 128)), T([4096, 128], f16)), {})
+cnt: 1, ((T([4096, 512], f16), T([512, 384], f16)), {})
+cnt: 1, ((T([512, 4096], f16, stride=(1, 512)), T([4096, 384], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([32, 1, 1, 128], f16), -65504.0), {})
+cnt: 50, ((T([32, 128, 512], f16), T([512], f16)), {})
+cnt: 336, ((T([32, 128, 128], f16), T([128], f16)), {})
+cnt: 25, ((T([32, 128, 512], f16), T([32, 128, 512], f16)), {})
+cnt: 168, ((T([32, 128, 128], f16), T([32, 128, 128], f16)), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([32, 128], f16), T([32], i64), None, 1, 128, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([32, 128], f16), T([32], i64), None, 1, 128), {})
+Operator: aten.relu.default
+cnt: 96, ((T([32, 128, 512], f16),), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([32, 1, 1, 128], f16), 1.0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([32, 127, 128], f16), [32, 128, 128], 1, 0, -1, 1), {})
+cnt: 2, ((T([32, 128, 128], f16), [32, 128, 128], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 127, 128], f16), [32, 128, 128], 1, 1, 9223372036854775807, 1), {})
+Operator: aten.split.Tensor
+cnt: 1, ((T([32, 128, 2], f16), 1, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([4096, 2], f16), [0], True), {})
+cnt: 50, ((T([32, 128, 512], f16), [0, 1], True), {})
+cnt: 121, ((T([4096, 512], f16), [0], True), {})
+cnt: 336, ((T([32, 128, 128], f16), [0, 1], True), {})
+cnt: 240, ((T([4096, 128], f16), [0], True), {})
+cnt: 1, ((T([32, 128, 512], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 96, ((T([32, 128, 512], f16), T([32, 128, 512], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/OPTForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/OPTForCausalLM_training.txt
new file mode 100644
index 000000000000..533b1875674b
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/OPTForCausalLM_training.txt
@@ -0,0 +1,103 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([508, 50272], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([508, 50272], f16), T([508, 50272], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([48, 128, 128], f16), -1, True), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([48, 128, 128], f32), T([48, 128, 128], f32), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([4, 128], b8),), {'dtype': i64})
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([4, 1, 128, 128], b8, stride=(128, 128, 0, 1)),), {'dtype': f16})
+cnt: 1, ((T([4, 1, 128, 128], f16),), {'dtype': torch.bool})
+cnt: 12, ((T([48, 128, 128], f32),), {'dtype': f16})
+cnt: 12, ((T([48, 128, 128], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([4, 128, 12, 64], f16), [4, 128, 768]), {})
+cnt: 1, ((T([512, 50272], f16), [4, 128, 50272]), {})
+cnt: 12, ((T([4, 12, 128, 64], f16), [48, 128, 64]), {})
+cnt: 12, ((T([4, 128, 768], f16), [512, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([4, 128], i64), 2), {})
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 1, ((T([4, 1, 128, 128], f16), T([4, 1, 128, 128], f16)), {})
+cnt: 49, ((T([4, 128, 768], f16), T([4, 128, 768], f16)), {})
+cnt: 12, ((T([4, 12, 128, 128], f16), T([4, 1, 128, 128], f16)), {})
+cnt: 24, ((T([512, 768], f16), T([512, 768], f16)), {})
+cnt: 1, ((T([50272, 768], f16), T([50272, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([768], f16), T([512, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([512, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([512, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([48, 128, 64], f16), T([48, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 24, ((T([48, 128, 128], f16), T([48, 128, 64], f16)), {})
+cnt: 12, ((T([48, 128, 128], f16, stride=(16384, 1, 128)), T([48, 128, 64], f16)), {})
+cnt: 12, ((T([48, 64, 128], f16, stride=(8192, 1, 64)), T([48, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([4, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([4, 128], i64), T([4, 128], i64)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([4, 128], i64), 1), {})
+Operator: aten.div.Scalar
+cnt: 12, ((T([4, 12, 128, 128], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50272, 768], f16), T([4, 128], i64), 1), {})
+cnt: 1, ((T([2050, 768], f16), T([4, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([4, 128, 768], f16), T([4, 128], i64), 2050, -1, False), {})
+cnt: 1, ((T([4, 128, 768], f16), T([4, 128], i64), 50272, 1, False), {})
+Operator: aten.eq.Tensor
+cnt: 12, ((T([4, 12, 128, 128], f16), T([], f32)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+cnt: 12, ((T([4, 12, 128, 128], f16), T([], f32)), {})
+Operator: aten.masked_fill.Scalar
+cnt: 1, ((T([4, 1, 128, 128], f16), T([4, 1, 128, 128], b8), -65504.0), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+cnt: 12, ((T([4, 12, 128, 128], f16), T([4, 12, 128, 128], b8), 0), {})
+Operator: aten.maximum.default
+cnt: 12, ((T([4, 12, 128, 128], f16), T([], f32)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 768], f16), T([768, 50272], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50272, 512], f16, stride=(1, 50272)), T([512, 768], f16)), {})
+cnt: 1, ((T([512, 50272], f16), T([50272, 768], f16)), {})
+cnt: 12, ((T([512, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 512], f16, stride=(1, 768)), T([512, 3072], f16)), {})
+cnt: 12, ((T([512, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 512], f16, stride=(1, 3072)), T([512, 768], f16)), {})
+cnt: 48, ((T([512, 768], f16), T([768, 768], f16)), {})
+cnt: 48, ((T([768, 512], f16, stride=(1, 768)), T([512, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([4, 128], i64), T([4, 128], i64)), {})
+cnt: 24, ((T([4, 128, 768], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 13, ((T([4, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+cnt: 12, ((T([512, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 13, ((T([4, 128, 768], f16), T([4, 128, 768], f16), [768], T([4, 128, 1], f32), T([4, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+cnt: 12, ((T([512, 768], f16), T([512, 768], f16), [768], T([512, 1], f32), T([512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([508, 50272], f16), T([508], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([508, 50272], f16), T([508], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 12, ((T([512, 3072], f16),), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([4, 1, 128, 128], f16), 1.0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([4, 127, 50272], f16), [4, 127, 50272], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([4, 127, 50272], f16), [4, 128, 50272], 1, 0, -1, 1), {})
+Operator: aten.sub.Tensor
+cnt: 1, ((T([4, 128], i64), 1), {})
+Operator: aten.sum.SymInt
+cnt: 60, ((T([512, 768], f16), [0], True), {})
+cnt: 12, ((T([512, 3072], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 12, ((T([512, 3072], f16), T([512, 3072], f16), 0), {})
+Operator: aten.where.self
+cnt: 12, ((T([4, 12, 128, 128], b8), T([4, 12, 128, 128], f16), T([4, 12, 128, 128], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PLBartForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PLBartForCausalLM_training.txt
new file mode 100644
index 000000000000..7617876fd4aa
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PLBartForCausalLM_training.txt
@@ -0,0 +1,73 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([2048, 50005], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([2048, 50005], f16), T([2048, 50005], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 6, ((T([192, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 6, ((T([192, 128, 128], f16), T([192, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([16, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 18, ((T([16, 128, 12, 64], f16), [16, 128, 768]), {})
+cnt: 1, ((T([2048, 50005], f16), [16, 128, 50005]), {})
+cnt: 6, ((T([16, 12, 128, 64], f16), [192, 128, 64]), {})
+cnt: 6, ((T([16, 128, 768], f16), [2048, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 1, ((T([16, 128], i64, stride=(0, 1)), 2), {})
+cnt: 37, ((T([16, 128, 768], f16), T([16, 128, 768], f16)), {})
+cnt: 6, ((T([16, 12, 128, 128], f16), T([16, 1, 128, 128], f16)), {})
+cnt: 1, ((T([50005, 768], f16), T([50005, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 24, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 6, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 6, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([192, 128, 64], f16), T([192, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 12, ((T([192, 128, 128], f16), T([192, 128, 64], f16)), {})
+cnt: 6, ((T([192, 128, 128], f16, stride=(16384, 1, 128)), T([192, 128, 64], f16)), {})
+cnt: 6, ((T([192, 64, 128], f16, stride=(8192, 1, 64)), T([192, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([16, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([16, 128], i64), T([16, 128], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50005, 768], f16), T([16, 128], i64), 1), {})
+cnt: 1, ((T([1026, 768], f16), T([16, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([16, 128, 768], f16), T([16, 128], i64), 1026, -1, False), {})
+cnt: 1, ((T([16, 128, 768], f16), T([16, 128], i64), 50005, 1, False), {})
+Operator: aten.gelu.default
+cnt: 6, ((T([16, 128, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 6, ((T([16, 128, 3072], f16), T([16, 128, 3072], f16)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 768], f16), T([768, 50005], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50005, 2048], f16, stride=(1, 50005)), T([2048, 768], f16)), {})
+cnt: 1, ((T([2048, 50005], f16), T([50005, 768], f16)), {})
+cnt: 6, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 6, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 6, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 6, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+cnt: 24, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 24, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([16, 128, 768], f16), 27.712812921102035), {})
+cnt: 12, ((T([16, 128, 768], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 13, ((T([16, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 13, ((T([16, 128, 768], f16), T([16, 128, 768], f16), [768], T([16, 128, 1], f32), T([16, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([2048, 50005], f16), T([2048], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([2048, 50005], f16), T([2048], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 30, ((T([2048, 768], f16), [0], True), {})
+cnt: 6, ((T([2048, 3072], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PLBartForConditionalGeneration_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PLBartForConditionalGeneration_training.txt
new file mode 100644
index 000000000000..55115055a052
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PLBartForConditionalGeneration_training.txt
@@ -0,0 +1,94 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([1024, 50005], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([1024, 50005], f16), T([1024, 50005], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 18, ((T([96, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 18, ((T([96, 128, 128], f16), T([96, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([8, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 54, ((T([8, 128, 12, 64], f16), [8, 128, 768]), {})
+cnt: 1, ((T([1024, 50005], f16), [8, 128, 50005]), {})
+cnt: 18, ((T([8, 12, 128, 64], f16), [96, 128, 64]), {})
+cnt: 18, ((T([8, 128, 768], f16), [1024, 768]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([8, 128], i64, stride=(0, 1)), 2), {})
+cnt: 97, ((T([8, 128, 768], f16), T([8, 128, 768], f16)), {})
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 6, ((T([8, 12, 128, 128], f16), T([8, 1, 128, 128], f16)), {})
+cnt: 1, ((T([8, 128, 50005], f16), T([1, 50005], f16)), {})
+cnt: 2, ((T([50005, 768], f16), T([50005, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 72, ((T([768], f16), T([1024, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([1024, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([1024, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+Operator: aten.any.default
+cnt: 12, ((T([8, 128, 768], b8),), {})
+Operator: aten.bmm.default
+cnt: 36, ((T([96, 128, 64], f16), T([96, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 36, ((T([96, 128, 128], f16), T([96, 128, 64], f16)), {})
+cnt: 18, ((T([96, 128, 128], f16, stride=(16384, 1, 128)), T([96, 128, 64], f16)), {})
+cnt: 18, ((T([96, 64, 128], f16, stride=(8192, 1, 64)), T([96, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 3, ((T([8, 128], i64),), {})
+cnt: 1, ((T([8, 127], i64, stride=(128, 1)),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([8, 128], i64), T([8, 128], i64)), {})
+cnt: 1, ((T([8, 127], i64, stride=(128, 1)), T([8, 127], i64)), {})
+cnt: 1, ((T([8], i64, stride=(128,)), T([8], i64)), {})
+Operator: aten.embedding.default
+cnt: 2, ((T([50005, 768], f16), T([8, 128], i64), 1), {})
+cnt: 2, ((T([1026, 768], f16), T([8, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 2, ((T([8, 128, 768], f16), T([8, 128], i64), 1026, -1, False), {})
+cnt: 2, ((T([8, 128, 768], f16), T([8, 128], i64), 50005, 1, False), {})
+Operator: aten.eq.Scalar
+cnt: 1, ((T([8, 128], i64), -100), {})
+Operator: aten.gather.default
+cnt: 1, ((T([8, 128], i64), 1, T([8, 1], i64)), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([8, 128, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([8, 128, 3072], f16), T([8, 128, 3072], f16)), {})
+Operator: aten.isinf.default
+cnt: 6, ((T([8, 128, 768], f16),), {})
+Operator: aten.isnan.default
+cnt: 6, ((T([8, 128, 768], f16),), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([8, 128], i64), T([8, 128], b8), 1), {})
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 768], f16), T([768, 50005], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50005, 1024], f16, stride=(1, 50005)), T([1024, 768], f16)), {})
+cnt: 1, ((T([1024, 50005], f16), T([50005, 768], f16)), {})
+cnt: 12, ((T([1024, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 1024], f16, stride=(1, 768)), T([1024, 3072], f16)), {})
+cnt: 12, ((T([1024, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 1024], f16, stride=(1, 3072)), T([1024, 768], f16)), {})
+cnt: 72, ((T([1024, 768], f16), T([768, 768], f16)), {})
+cnt: 72, ((T([768, 1024], f16, stride=(1, 768)), T([1024, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([8, 128, 768], f16), 27.712812921102035), {})
+cnt: 36, ((T([8, 128, 768], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 32, ((T([8, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 32, ((T([8, 128, 768], f16), T([8, 128, 768], f16), [768], T([8, 128, 1], f32), T([8, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([8, 128], i64), 1), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([1024, 50005], f16), T([1024], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([1024, 50005], f16), T([1024], i64), None, 1, -100), {})
+Operator: aten.sub.Tensor
+cnt: 1, ((T([8], i64), 1), {})
+Operator: aten.sum.SymInt
+cnt: 84, ((T([1024, 768], f16), [0], True), {})
+cnt: 12, ((T([1024, 3072], f16), [0], True), {})
+Operator: aten.sum.dim_IntList
+cnt: 1, ((T([8, 128], b8), [1]), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PegasusForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PegasusForCausalLM_training.txt
new file mode 100644
index 000000000000..1341c2798398
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PegasusForCausalLM_training.txt
@@ -0,0 +1,72 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([1024, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([1024, 50265], f16), T([1024, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([128, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([128, 128, 128], f16), T([128, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([8, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([8, 128, 16, 64], f16), [8, 128, 1024]), {})
+cnt: 1, ((T([1024, 50265], f16), [8, 128, 50265]), {})
+cnt: 12, ((T([8, 16, 128, 64], f16), [128, 128, 64]), {})
+cnt: 12, ((T([8, 128, 1024], f16), [1024, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 1, ((T([8, 128, 1024], f16), T([128, 1024], f16)), {})
+cnt: 12, ((T([8, 16, 128, 128], f16), T([8, 1, 128, 128], f16)), {})
+cnt: 72, ((T([8, 128, 1024], f16), T([8, 128, 1024], f16)), {})
+cnt: 1, ((T([50265, 1024], f16), T([50265, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([1024], f16), T([1024, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 12, ((T([4096], f16), T([1024, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 12, ((T([1024], f16), T([1024, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([128, 128, 64], f16), T([128, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 24, ((T([128, 128, 128], f16), T([128, 128, 64], f16)), {})
+cnt: 12, ((T([128, 128, 128], f16, stride=(16384, 1, 128)), T([128, 128, 64], f16)), {})
+cnt: 12, ((T([128, 64, 128], f16, stride=(8192, 1, 64)), T([128, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([8, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([8, 128], i64), T([8, 128], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 1024], f16), T([8, 128], i64), 0), {})
+cnt: 1, ((T([1024, 1024], f16), T([128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([8, 128, 1024], f16), T([8, 128], i64), 50265, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([8, 128, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([8, 128, 4096], f16), T([8, 128, 4096], f16)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 1024], f16), T([1024, 50265], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([50265, 1024], f16, stride=(1, 50265)), T([1024, 1024], f16)), {})
+cnt: 1, ((T([1024, 50265], f16), T([50265, 1024], f16)), {})
+cnt: 12, ((T([1024, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 12, ((T([1024, 1024], f16, stride=(1, 1024)), T([1024, 4096], f16)), {})
+cnt: 12, ((T([1024, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 12, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 1024], f16)), {})
+cnt: 48, ((T([1024, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 48, ((T([1024, 1024], f16, stride=(1, 1024)), T([1024, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([8, 128, 1024], f16), 1.0), {})
+cnt: 24, ((T([8, 128, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([8, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([8, 128, 1024], f16), T([8, 128, 1024], f16), [1024], T([8, 128, 1], f32), T([8, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([1024, 50265], f16), T([1024], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([1024, 50265], f16), T([1024], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 60, ((T([1024, 1024], f16), [0], True), {})
+cnt: 12, ((T([1024, 4096], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PegasusForConditionalGeneration_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PegasusForConditionalGeneration_training.txt
new file mode 100644
index 000000000000..970513d4b354
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/PegasusForConditionalGeneration_training.txt
@@ -0,0 +1,79 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([512, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([512, 50265], f16), T([512, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 36, ((T([64, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 36, ((T([64, 128, 128], f16), T([64, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 108, ((T([4, 128, 16, 64], f16), [4, 128, 1024]), {})
+cnt: 1, ((T([512, 50265], f16), [4, 128, 50265]), {})
+cnt: 36, ((T([4, 16, 128, 64], f16), [64, 128, 64]), {})
+cnt: 36, ((T([4, 128, 1024], f16), [512, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([4, 128, 1024], f16), T([128, 1024], f16)), {})
+cnt: 191, ((T([4, 128, 1024], f16), T([4, 128, 1024], f16)), {})
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 12, ((T([4, 16, 128, 128], f16), T([4, 1, 128, 128], f16)), {})
+cnt: 1, ((T([4, 128, 50265], f16), T([1, 50265], f16)), {})
+cnt: 2, ((T([50265, 1024], f16), T([50265, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 144, ((T([1024], f16), T([512, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([4096], f16), T([512, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([1024], f16), T([512, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.any.default
+cnt: 24, ((T([4, 128, 1024], b8),), {})
+Operator: aten.bmm.default
+cnt: 72, ((T([64, 128, 64], f16), T([64, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 72, ((T([64, 128, 128], f16), T([64, 128, 64], f16)), {})
+cnt: 36, ((T([64, 128, 128], f16, stride=(16384, 1, 128)), T([64, 128, 64], f16)), {})
+cnt: 36, ((T([64, 64, 128], f16, stride=(8192, 1, 64)), T([64, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 3, ((T([4, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 3, ((T([4, 128], i64), T([4, 128], i64)), {})
+Operator: aten.embedding.default
+cnt: 2, ((T([50265, 1024], f16), T([4, 128], i64), 0), {})
+cnt: 2, ((T([1024, 1024], f16), T([128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 2, ((T([4, 128, 1024], f16), T([4, 128], i64), 50265, 0, False), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([4, 128, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 24, ((T([4, 128, 4096], f16), T([4, 128, 4096], f16)), {})
+Operator: aten.isinf.default
+cnt: 12, ((T([4, 128, 1024], f16),), {})
+Operator: aten.isnan.default
+cnt: 12, ((T([4, 128, 1024], f16),), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 1024], f16), T([1024, 50265], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([50265, 512], f16, stride=(1, 50265)), T([512, 1024], f16)), {})
+cnt: 1, ((T([512, 50265], f16), T([50265, 1024], f16)), {})
+cnt: 24, ((T([512, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 512], f16, stride=(1, 1024)), T([512, 4096], f16)), {})
+cnt: 24, ((T([512, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 24, ((T([4096, 512], f16, stride=(1, 4096)), T([512, 1024], f16)), {})
+cnt: 144, ((T([512, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 144, ((T([1024, 512], f16, stride=(1, 1024)), T([512, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([4, 128, 1024], f16), 1.0), {})
+cnt: 72, ((T([4, 128, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 62, ((T([4, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 62, ((T([4, 128, 1024], f16), T([4, 128, 1024], f16), [1024], T([4, 128, 1], f32), T([4, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([512, 50265], f16), T([512], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([512, 50265], f16), T([512], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 168, ((T([512, 1024], f16), [0], True), {})
+cnt: 24, ((T([512, 4096], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/RobertaForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/RobertaForCausalLM_training.txt
new file mode 100644
index 000000000000..25b78750deb5
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/RobertaForCausalLM_training.txt
@@ -0,0 +1,94 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([508, 30522], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([508, 30522], f16), T([508, 30522], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([4, 12, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([4, 12, 128, 128], f16), T([4, 12, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([4, 1, 1, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 128], b8),), {'dtype': i32})
+cnt: 1, ((T([4, 128], i64),), {'dtype': i32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([4, 128], i32),), {'dtype': i64})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([4, 12, 128, 64], f16), [48, 128, 64]), {})
+cnt: 12, ((T([4, 12, 64, 128], f16), [48, 64, 128]), {})
+cnt: 12, ((T([48, 128, 128], f16), [4, 12, 128, 128]), {})
+cnt: 12, ((T([48, 128, 64], f16), [4, 12, 128, 64]), {})
+cnt: 24, ((T([4, 128, 12, 64], f16), [4, 128, 768]), {})
+cnt: 12, ((T([4, 128, 768], f16), [512, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([4, 128], i32), 0), {})
+cnt: 1, ((T([4, 128], i64), 0), {})
+cnt: 73, ((T([4, 128, 768], f16), T([4, 128, 768], f16)), {})
+cnt: 12, ((T([4, 12, 128, 128], f16), T([4, 1, 1, 128], f16)), {})
+cnt: 1, ((T([30522, 768], f16), T([30522, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([4, 128, 768], f16), T([4, 128, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 49, ((T([768], f16), T([512, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([512, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([512, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([30522], f16), T([512, 768], f16), T([768, 30522], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([48, 128, 64], f16), T([48, 64, 128], f16)), {})
+cnt: 12, ((T([48, 128, 128], f16), T([48, 128, 64], f16)), {})
+cnt: 12, ((T([48, 128, 128], f16, stride=(16384, 1, 128)), T([48, 128, 64], f16)), {})
+cnt: 12, ((T([48, 128, 64], f16), T([48, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 12, ((T([48, 64, 128], f16, stride=(8192, 1, 64)), T([48, 128, 128], f16)), {})
+cnt: 12, ((T([48, 128, 128], f16), T([48, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([4, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([4, 128], i64), T([4, 128], i64)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([4, 128], i32), 1), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([4, 12, 128, 128], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([4, 128], i64), 0), {})
+cnt: 1, ((T([2, 768], f16), T([4, 128], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 768], f16), T([4, 128], i64), 0), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([4, 128, 768], f16), T([4, 128], i64), 512, 0, False), {})
+cnt: 1, ((T([4, 128, 768], f16), T([4, 128], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([4, 128, 768], f16), T([4, 128], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([4, 128, 3072], f16),), {})
+cnt: 1, ((T([4, 128, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([4, 128, 768], f16), T([4, 128, 768], f16)), {})
+cnt: 12, ((T([4, 128, 3072], f16), T([4, 128, 3072], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 30522], f16), T([30522, 768], f16)), {})
+cnt: 1, ((T([30522, 512], f16, stride=(1, 30522)), T([512, 768], f16)), {})
+cnt: 49, ((T([512, 768], f16), T([768, 768], f16)), {})
+cnt: 49, ((T([768, 512], f16, stride=(1, 768)), T([512, 768], f16)), {})
+cnt: 12, ((T([512, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 512], f16, stride=(1, 768)), T([512, 3072], f16)), {})
+cnt: 12, ((T([512, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 512], f16, stride=(1, 3072)), T([512, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([4, 1, 1, 128], f16), -65504.0), {})
+cnt: 1, ((T([4, 128], i32), T([4, 128], i32)), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([4, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([4, 128, 768], f16), T([4, 128, 768], f16), [768], T([4, 128, 1], f32), T([4, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([4, 128], i64), 0), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([508, 30522], f16), T([508], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([508, 30522], f16), T([508], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([4, 1, 1, 128], f16), 1.0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([4, 127, 30522], f16), [4, 127, 30522], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([4, 127, 30522], f16), [4, 128, 30522], 1, 0, -1, 1), {})
+cnt: 1, ((T([4, 128, 30522], f16), [4, 128, 30522], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([512, 30522], f16), [0], True), {})
+cnt: 61, ((T([512, 768], f16), [0], True), {})
+cnt: 12, ((T([512, 3072], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/RobertaForQuestionAnswering_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/RobertaForQuestionAnswering_training.txt
new file mode 100644
index 000000000000..02cf28ea0867
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/RobertaForQuestionAnswering_training.txt
@@ -0,0 +1,97 @@
+Operator: aten._log_softmax.default
+cnt: 2, ((T([64, 128], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 2, ((T([64, 128], f16), T([64, 128], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([64, 12, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 12, 128, 128], f16), T([64, 12, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([64, 1, 1, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([64, 128], b8),), {'dtype': i32})
+cnt: 1, ((T([64, 128], i64),), {'dtype': i32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([64, 128], i32),), {'dtype': i64})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([64, 12, 128, 64], f16), [768, 128, 64]), {})
+cnt: 12, ((T([64, 12, 64, 128], f16), [768, 64, 128]), {})
+cnt: 12, ((T([768, 128, 128], f16), [64, 12, 128, 128]), {})
+cnt: 12, ((T([768, 128, 64], f16), [64, 12, 128, 64]), {})
+cnt: 24, ((T([64, 128, 12, 64], f16), [64, 128, 768]), {})
+cnt: 12, ((T([64, 128, 768], f16), [8192, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 128], i32), 0), {})
+cnt: 1, ((T([64, 128], i64), 0), {})
+cnt: 73, ((T([64, 128, 768], f16), T([64, 128, 768], f16)), {})
+cnt: 12, ((T([64, 12, 128, 128], f16), T([64, 1, 1, 128], f16)), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([768], f16), T([8192, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([8192, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([8192, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([2], f16), T([8192, 768], f16), T([768, 2], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([768, 128, 64], f16), T([768, 64, 128], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16), T([768, 128, 64], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16, stride=(16384, 1, 128)), T([768, 128, 64], f16)), {})
+cnt: 12, ((T([768, 128, 64], f16), T([768, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 12, ((T([768, 64, 128], f16, stride=(8192, 1, 64)), T([768, 128, 128], f16)), {})
+cnt: 12, ((T([768, 128, 128], f16), T([768, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 128, 1], f16), T([64, 128, 1], f16)], 2), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([64], i64), 0, 128), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 128], i64),), {})
+cnt: 2, ((T([64], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 128], i64), T([64, 128], i64)), {})
+cnt: 2, ((T([64], i64), T([64], i64)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([64, 128], i32), 1), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([64, 12, 128, 128], f16), 8.0), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([64, 128], i64), 0), {})
+cnt: 1, ((T([2, 768], f16), T([64, 128], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 768], f16), T([64, 128], i64), 0), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128], i64), 512, 0, False), {})
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([64, 128, 768], f16), T([64, 128], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([64, 128, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([64, 128, 3072], f16), T([64, 128, 3072], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8192, 2], f16), T([2, 768], f16)), {})
+cnt: 1, ((T([2, 8192], f16, stride=(1, 2)), T([8192, 768], f16)), {})
+cnt: 12, ((T([8192, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 3072], f16)), {})
+cnt: 12, ((T([8192, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 8192], f16, stride=(1, 3072)), T([8192, 768], f16)), {})
+cnt: 48, ((T([8192, 768], f16), T([768, 768], f16)), {})
+cnt: 48, ((T([768, 8192], f16, stride=(1, 768)), T([8192, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([64, 1, 1, 128], f16), -65504.0), {})
+cnt: 1, ((T([64, 128], i32), T([64, 128], i32)), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([64, 128, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([64, 128, 768], f16), T([64, 128, 768], f16), [768], T([64, 128, 1], f32), T([64, 128, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([64, 128], i64), 0), {})
+Operator: aten.nll_loss_backward.default
+cnt: 2, ((T([], f16), T([64, 128], f16), T([64], i64), None, 1, 128, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 2, ((T([64, 128], f16), T([64], i64), None, 1, 128), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([64, 1, 1, 128], f16), 1.0), {})
+Operator: aten.split.Tensor
+cnt: 1, ((T([64, 128, 2], f16), 1, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([8192, 2], f16), [0], True), {})
+cnt: 60, ((T([8192, 768], f16), [0], True), {})
+cnt: 12, ((T([8192, 3072], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/Speech2Text2ForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/Speech2Text2ForCausalLM_training.txt
new file mode 100644
index 000000000000..a816e067e363
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/Speech2Text2ForCausalLM_training.txt
@@ -0,0 +1,82 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([8192, 10000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([8192, 10000], f16), T([8192, 10000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 6, ((T([256, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 6, ((T([256, 128, 128], f16), T([256, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([64, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([64, 128], b8),), {'dtype': i32})
+cnt: 1, ((T([64, 128], i64),), {'dtype': i32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([64, 128], i32),), {'dtype': i64})
+Operator: aten._unsafe_view.default
+cnt: 18, ((T([64, 128, 4, 64], f16), [64, 128, 256]), {})
+cnt: 1, ((T([8192, 10000], f16), [64, 128, 10000]), {})
+cnt: 6, ((T([64, 4, 128, 64], f16), [256, 128, 64]), {})
+cnt: 6, ((T([64, 128, 256], f16), [8192, 256]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 1, ((T([64, 128], i32), 0), {})
+cnt: 1, ((T([64, 128], i64), 1), {})
+cnt: 37, ((T([64, 128, 256], f16), T([64, 128, 256], f16)), {})
+cnt: 6, ((T([64, 4, 128, 128], f16), T([64, 1, 128, 128], f16)), {})
+cnt: 1, ((T([10000, 256], f16), T([10000, 256], f16)), {})
+Operator: aten.addmm.default
+cnt: 24, ((T([256], f16), T([8192, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 6, ((T([2048], f16), T([8192, 256], f16), T([256, 2048], f16, stride=(1, 256))), {})
+cnt: 6, ((T([256], f16), T([8192, 2048], f16), T([2048, 256], f16, stride=(1, 2048))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([256, 128, 64], f16), T([256, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 12, ((T([256, 128, 128], f16), T([256, 128, 64], f16)), {})
+cnt: 6, ((T([256, 128, 128], f16, stride=(16384, 1, 128)), T([256, 128, 64], f16)), {})
+cnt: 6, ((T([256, 64, 128], f16, stride=(8192, 1, 64)), T([256, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([64, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([64, 128], i64), T([64, 128], i64)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([64, 128], i32), 1), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([10000, 256], f16), T([64, 128], i64), 1), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([64, 128, 256], f16), T([64, 128], i64), 10000, 1, False), {})
+Operator: aten.index_select.default
+cnt: 1, ((T([1026, 256], f16), 0, T([8192], i64)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8192, 256], f16), T([256, 10000], f16, stride=(1, 256))), {})
+cnt: 1, ((T([10000, 8192], f16, stride=(1, 10000)), T([8192, 256], f16)), {})
+cnt: 1, ((T([8192, 10000], f16), T([10000, 256], f16)), {})
+cnt: 6, ((T([8192, 256], f16), T([256, 2048], f16)), {})
+cnt: 6, ((T([256, 8192], f16, stride=(1, 256)), T([8192, 2048], f16)), {})
+cnt: 6, ((T([8192, 2048], f16), T([2048, 256], f16)), {})
+cnt: 6, ((T([2048, 8192], f16, stride=(1, 2048)), T([8192, 256], f16)), {})
+cnt: 24, ((T([8192, 256], f16), T([256, 256], f16)), {})
+cnt: 24, ((T([256, 8192], f16, stride=(1, 256)), T([8192, 256], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([64, 128, 256], f16), 16.0), {})
+cnt: 1, ((T([64, 128], i32), T([64, 128], i32)), {})
+cnt: 12, ((T([64, 128, 256], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 12, ((T([64, 128, 256], f16), [256], T([256], f16), T([256], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 12, ((T([64, 128, 256], f16), T([64, 128, 256], f16), [256], T([64, 128, 1], f32), T([64, 128, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([64, 128], i64), 1), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([8192, 10000], f16), T([8192], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([8192, 10000], f16), T([8192], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 6, ((T([64, 128, 2048], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 30, ((T([8192, 256], f16), [0], True), {})
+cnt: 6, ((T([8192, 2048], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 6, ((T([64, 128, 2048], f16), T([64, 128, 2048], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/TrOCRForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/TrOCRForCausalLM_training.txt
new file mode 100644
index 000000000000..97c3b304cee4
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/TrOCRForCausalLM_training.txt
@@ -0,0 +1,73 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([1024, 50265], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([1024, 50265], f16), T([1024, 50265], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([128, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([128, 128, 128], f16), T([128, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([8, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([8, 128, 16, 64], f16), [8, 128, 1024]), {})
+cnt: 1, ((T([1024, 50265], f16), [8, 128, 50265]), {})
+cnt: 12, ((T([8, 16, 128, 64], f16), [128, 128, 64]), {})
+cnt: 12, ((T([8, 128, 1024], f16), [1024, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([8, 128], i64, stride=(0, 1)), 2), {})
+cnt: 73, ((T([8, 128, 1024], f16), T([8, 128, 1024], f16)), {})
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 12, ((T([8, 16, 128, 128], f16), T([8, 1, 128, 128], f16)), {})
+cnt: 1, ((T([50265, 1024], f16), T([50265, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([1024], f16), T([1024, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 12, ((T([4096], f16), T([1024, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 12, ((T([1024], f16), T([1024, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([128, 128, 64], f16), T([128, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 24, ((T([128, 128, 128], f16), T([128, 128, 64], f16)), {})
+cnt: 12, ((T([128, 128, 128], f16, stride=(16384, 1, 128)), T([128, 128, 64], f16)), {})
+cnt: 12, ((T([128, 64, 128], f16, stride=(8192, 1, 64)), T([128, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([8, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([8, 128], i64), T([8, 128], i64)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 1024], f16), T([8, 128], i64), 1), {})
+cnt: 1, ((T([514, 1024], f16), T([8, 128], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([8, 128, 1024], f16), T([8, 128], i64), 514, -1, False), {})
+cnt: 1, ((T([8, 128, 1024], f16), T([8, 128], i64), 50265, 1, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([8, 128, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([8, 128, 4096], f16), T([8, 128, 4096], f16)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 1024], f16), T([1024, 50265], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([50265, 1024], f16, stride=(1, 50265)), T([1024, 1024], f16)), {})
+cnt: 1, ((T([1024, 50265], f16), T([50265, 1024], f16)), {})
+cnt: 12, ((T([1024, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 12, ((T([1024, 1024], f16, stride=(1, 1024)), T([1024, 4096], f16)), {})
+cnt: 12, ((T([1024, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 12, ((T([4096, 1024], f16, stride=(1, 4096)), T([1024, 1024], f16)), {})
+cnt: 48, ((T([1024, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 48, ((T([1024, 1024], f16, stride=(1, 1024)), T([1024, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([8, 128, 1024], f16), 1.0), {})
+cnt: 24, ((T([8, 128, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([8, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([8, 128, 1024], f16), T([8, 128, 1024], f16), [1024], T([8, 128, 1], f32), T([8, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([1024, 50265], f16), T([1024], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([1024, 50265], f16), T([1024], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 60, ((T([1024, 1024], f16), [0], True), {})
+cnt: 12, ((T([1024, 4096], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/XGLMForCausalLM_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/XGLMForCausalLM_training.txt
new file mode 100644
index 000000000000..a8317b48f20d
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/XGLMForCausalLM_training.txt
@@ -0,0 +1,88 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([256, 256008], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([256, 256008], f16), T([256, 256008], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([32, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([32, 128, 128], f16), T([32, 128, 128], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([128, 128], f32),), {'dtype': f16})
+cnt: 1, ((T([2, 1, 128, 128], f16, stride=(0, 16384, 128, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([2, 128], b8),), {'dtype': i32})
+cnt: 1, ((T([2, 128], i64),), {'dtype': i32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([2, 128], i32),), {'dtype': i64})
+Operator: aten._unsafe_view.default
+cnt: 72, ((T([2, 128, 16, 64], f16), [2, 128, 1024]), {})
+cnt: 1, ((T([256, 256008], f16), [2, 128, 256008]), {})
+cnt: 24, ((T([2, 16, 128, 64], f16), [32, 128, 64]), {})
+cnt: 24, ((T([2, 128, 1024], f16), [256, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128], i64), 1), {})
+cnt: 1, ((T([2, 128], i32), 0), {})
+cnt: 1, ((T([2, 128], i64), 1), {})
+cnt: 145, ((T([2, 128, 1024], f16), T([2, 128, 1024], f16)), {})
+cnt: 24, ((T([2, 16, 128, 128], f16), T([2, 1, 128, 128], f16)), {})
+cnt: 1, ((T([256008, 1024], f16), T([256008, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 96, ((T([1024], f16), T([256, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([4096], f16), T([256, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([1024], f16), T([256, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+Operator: aten.bmm.default
+cnt: 48, ((T([32, 128, 64], f16), T([32, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 48, ((T([32, 128, 128], f16), T([32, 128, 64], f16)), {})
+cnt: 24, ((T([32, 128, 128], f16, stride=(16384, 1, 128)), T([32, 128, 64], f16)), {})
+cnt: 24, ((T([32, 64, 128], f16, stride=(8192, 1, 64)), T([32, 128, 128], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([2, 128], i64),), {})
+cnt: 1, ((T([2, 127], i64, stride=(128, 1)),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([2, 128], i64), T([2, 128], i64)), {})
+cnt: 1, ((T([2, 127], i64, stride=(128, 1)), T([2, 127], i64)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([2, 128], i32), 1), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([256008, 1024], f16), T([2, 128], i64), 1), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([2, 128, 1024], f16), T([2, 128], i64), 256008, 1, False), {})
+Operator: aten.fill_.Tensor
+cnt: 1, ((T([2], i64, stride=(128,)), T([], i64)), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([2, 128, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 24, ((T([2, 128, 4096], f16), T([2, 128, 4096], f16)), {})
+Operator: aten.index_select.default
+cnt: 1, ((T([2050, 1024], f16), 0, T([256], i64)), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([128], i64), T([128, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([128, 128], f32), T([128, 128], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([256, 1024], f16), T([1024, 256008], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([256008, 256], f16, stride=(1, 256008)), T([256, 1024], f16)), {})
+cnt: 1, ((T([256, 256008], f16), T([256008, 1024], f16)), {})
+cnt: 24, ((T([256, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 256], f16, stride=(1, 1024)), T([256, 4096], f16)), {})
+cnt: 24, ((T([256, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 24, ((T([4096, 256], f16, stride=(1, 4096)), T([256, 1024], f16)), {})
+cnt: 96, ((T([256, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 96, ((T([1024, 256], f16, stride=(1, 1024)), T([256, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([2, 128, 1024], f16), 32.0), {})
+cnt: 1, ((T([2, 128], i32), T([2, 128], i32)), {})
+cnt: 48, ((T([2, 128, 1024], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 49, ((T([2, 128, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 49, ((T([2, 128, 1024], f16), T([2, 128, 1024], f16), [1024], T([2, 128, 1], f32), T([2, 128, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([2, 128], i64), 1), {})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([2, 128], i64), [2, 128]), {'dtype': i64, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([256, 256008], f16), T([256], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([256, 256008], f16), T([256], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 120, ((T([256, 1024], f16), [0], True), {})
+cnt: 24, ((T([256, 4096], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/XLNetLMHeadModel_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/XLNetLMHeadModel_training.txt
new file mode 100644
index 000000000000..f3056de63d92
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/XLNetLMHeadModel_training.txt
@@ -0,0 +1,105 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([2048, 32000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([2048, 32000], f16), T([2048, 32000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 24, ((T([4, 16, 512, 512], f16), 3, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([4, 16, 512, 512], f16), T([4, 16, 512, 512], f16), 3, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1024, 4, 1024], f32, stride=(1024, 0, 1)),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 24, ((T([1024, 4, 1024], f32),), {'dtype': f16, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 24, ((T([512, 4, 64, 16, 1], f16), [1, 2048, 1024]), {})
+cnt: 24, ((T([64, 16, 1024, 1, 1], f16), [1, 1024, 1024]), {})
+cnt: 24, ((T([4, 16, 512, 1, 64], f16), [64, 512, 64]), {})
+cnt: 24, ((T([1024, 4, 1, 16, 64], f16), [1, 4096, 1024]), {})
+cnt: 72, ((T([512, 4, 1, 16, 64], f16), [1, 2048, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 48, ((T([512, 4, 16, 64], f16), T([16, 64], f16)), {})
+cnt: 24, ((T([4, 16, 512, 512], f16), T([4, 16, 512, 512], f16)), {})
+cnt: 24, ((T([4, 16, 512, 512], f16), 0), {})
+cnt: 144, ((T([512, 4, 1024], f16), T([512, 4, 1024], f16)), {})
+cnt: 24, ((T([512, 4, 16, 64], f16, stride=(64, 524288, 32768, 1)), T([512, 4, 16, 64], f16, stride=(64, 524288, 32768, 1))), {})
+cnt: 1, ((T([32000, 1024], f16), T([32000, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 24, ((T([4096], f16), T([2048, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 24, ((T([1024], f16), T([2048, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([32000], f16), T([2048, 1024], f16), T([1024, 32000], f16, stride=(1, 1024))), {})
+Operator: aten.bmm.default
+cnt: 96, ((T([1, 2048, 1024], f16), T([1, 1024, 1024], f16)), {})
+cnt: 24, ((T([1, 4096, 1024], f16), T([1, 1024, 1024], f16)), {})
+cnt: 24, ((T([64, 512, 64], f16, stride=(64, 4096, 1)), T([64, 64, 512], f16, stride=(64, 1, 4096))), {})
+cnt: 24, ((T([64, 512, 64], f16, stride=(64, 4096, 1)), T([64, 64, 1024], f16, stride=(64, 1, 4096))), {})
+cnt: 48, ((T([64, 512, 512], f16), T([64, 512, 64], f16, stride=(64, 4096, 1))), {})
+cnt: 96, ((T([1, 1024, 2048], f16, stride=(2097152, 1, 1024)), T([1, 2048, 1024], f16)), {})
+cnt: 96, ((T([1, 2048, 1024], f16), T([1, 1024, 1024], f16, stride=(1048576, 1, 1024))), {})
+cnt: 24, ((T([64, 512, 512], f16, stride=(262144, 1, 512)), T([64, 512, 64], f16)), {})
+cnt: 24, ((T([64, 512, 64], f16), T([64, 64, 512], f16, stride=(64, 1, 4096))), {})
+cnt: 24, ((T([64, 64, 512], f16, stride=(64, 1, 4096)), T([64, 512, 1024], f16)), {})
+cnt: 24, ((T([64, 512, 1024], f16), T([64, 1024, 64], f16, stride=(64, 4096, 1))), {})
+cnt: 24, ((T([64, 64, 512], f16, stride=(64, 1, 4096)), T([64, 512, 512], f16)), {})
+cnt: 24, ((T([1, 1024, 4096], f16, stride=(4194304, 1, 1024)), T([1, 4096, 1024], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([1024, 512], f32), T([1024, 512], f32)], -1), {})
+Operator: aten.clone.default
+cnt: 2, ((T([4, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([4, 512], i64), T([4, 512], i64)), {})
+cnt: 24, ((T([1024, 16, 64], f16), T([1024, 16, 64], f16, stride=(1, 1024, 16384))), {})
+Operator: aten.cos.default
+cnt: 1, ((T([1024, 512], f32),), {})
+Operator: aten.div.Tensor
+cnt: 1, ((T([512], f32), 1024), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([32000, 1024], f16), T([512, 4], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([512, 4, 1024], f16), T([512, 4], i64), 32000, -1, False), {})
+Operator: aten.gelu.default
+cnt: 24, ((T([512, 4, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 24, ((T([512, 4, 4096], f16), T([512, 4, 4096], f16)), {})
+Operator: aten.index_add.default
+cnt: 24, ((T([4, 16, 512, 1023], f16), 3, T([512], i64), T([4, 16, 512, 512], f16)), {})
+Operator: aten.index_select.default
+cnt: 24, ((T([4, 16, 512, 1023], f16, stride=(8388608, 524288, 1023, 1)), 3, T([512], i64)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 32000], f16), T([32000, 1024], f16)), {})
+cnt: 1, ((T([32000, 2048], f16, stride=(1, 32000)), T([2048, 1024], f16)), {})
+cnt: 24, ((T([2048, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 24, ((T([1024, 2048], f16, stride=(1, 1024)), T([2048, 4096], f16)), {})
+cnt: 24, ((T([2048, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 24, ((T([4096, 2048], f16, stride=(1, 4096)), T([2048, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([512], f32), 1), {})
+cnt: 1, ((T([1024, 1], f32), T([1, 512], f32)), {})
+cnt: 48, ((T([4, 16, 512, 512], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 48, ((T([512, 4, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 1, ((T([512, 4, 1024], f16, stride=(1024, 524288, 1)), T([512, 4, 1024], f16), [1024], T([512, 4, 1], f32), T([512, 4, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+cnt: 47, ((T([512, 4, 1024], f16), T([512, 4, 1024], f16), [1024], T([512, 4, 1], f32), T([512, 4, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 24, ((T([1024, 16, 64], f16, stride=(1, 1024, 16384)), [1024, 16, 64], [1024, 64, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.new_zeros.default
+cnt: 24, ((T([4, 16, 512, 512], f16), [4, 16, 512, 1023]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([2048, 32000], f16), T([2048], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([2048, 32000], f16), T([2048], i64), None, 1, -100), {})
+Operator: aten.pow.Scalar
+cnt: 1, ((10000, T([512], f32)), {})
+Operator: aten.reciprocal.default
+cnt: 1, ((T([512], f32),), {})
+Operator: aten.sin.default
+cnt: 1, ((T([1024, 512], f32),), {})
+Operator: aten.slice_backward.default
+cnt: 24, ((T([4, 16, 1023, 512], f16), [4, 16, 1023, 512], 3, 0, 9223372036854775807, 1), {})
+cnt: 24, ((T([4, 16, 1023, 512], f16), [4, 16, 1024, 512], 2, 1, 9223372036854775807, 1), {})
+cnt: 24, ((T([4, 16, 1024, 512], f16), [4, 16, 1024, 512], 1, 0, 9223372036854775807, 1), {})
+cnt: 24, ((T([4, 16, 1024, 512], f16), [4, 16, 1024, 512], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2048, 32000], f16), [0], True), {})
+cnt: 24, ((T([2048, 1024], f16), [0], True), {})
+cnt: 24, ((T([2048, 4096], f16), [0], True), {})
+cnt: 48, ((T([512, 4, 16, 64], f16, stride=(64, 524288, 32768, 1)), [0, 1], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/YituTechConvBert_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/YituTechConvBert_training.txt
new file mode 100644
index 000000000000..d1a6dcccdaa1
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/hf_train/YituTechConvBert_training.txt
@@ -0,0 +1,119 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([512, 30522], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([512, 30522], f16), T([512, 30522], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([3072, 9, 1], f16), 1, False), {})
+cnt: 12, ((T([1, 6, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([1, 6, 512, 512], f16), T([1, 6, 512, 512], f16), -1, f16), {})
+cnt: 12, ((T([3072, 9, 1], f16), T([3072, 9, 1], f16), 1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([1, 512, 54], f16), [1, 512, 54]), {})
+cnt: 12, ((T([1, 512, 384, 9], f16), [3072, 64, 9]), {})
+cnt: 12, ((T([3072, 64, 1], f16), [3072, 64, 1]), {})
+cnt: 12, ((T([6, 512, 512], f16), [1, 6, 512, 512]), {})
+cnt: 12, ((T([6, 512, 64], f16), [1, 6, 512, 64]), {})
+cnt: 12, ((T([512, 384], f16), [3072, 64, 1]), {})
+cnt: 24, ((T([1, 512, 6, 64], f16), [1, 512, 384]), {})
+Operator: aten.add.Tensor
+cnt: 86, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 12, ((T([1, 512, 54], f16), T([54], f16)), {})
+cnt: 12, ((T([1, 6, 512, 512], f16), T([1, 1, 1, 512], f16)), {})
+cnt: 12, ((T([1, 512, 384], f16), T([1, 512, 384], f16)), {})
+cnt: 12, ((T([1, 512, 768], f16), T([1, 512, 768], f16, stride=(393216, 1, 512))), {})
+cnt: 1, ((T([30522, 768], f16), T([30522, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 12, ((T([1, 384, 512], f16), T([384, 1], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([384], f16), T([512, 768], f16), T([768, 384], f16, stride=(1, 768))), {})
+cnt: 13, ((T([768], f16), T([512, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([512, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([512, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([30522], f16), T([512, 768], f16), T([768, 30522], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([1, 512, 384], f16, stride=(512, 1, 512)), T([1, 384, 54], f16, stride=(384, 1, 384))), {})
+cnt: 12, ((T([3072, 64, 9], f16), T([3072, 9, 1], f16)), {})
+cnt: 12, ((T([6, 512, 64], f16, stride=(64, 384, 1)), T([6, 64, 512], f16, stride=(64, 1, 384))), {})
+cnt: 24, ((T([6, 512, 512], f16), T([6, 512, 64], f16, stride=(64, 384, 1))), {})
+cnt: 12, ((T([6, 512, 512], f16, stride=(262144, 1, 512)), T([6, 512, 64], f16, stride=(64, 768, 1))), {})
+cnt: 12, ((T([6, 512, 64], f16, stride=(64, 768, 1)), T([6, 64, 512], f16, stride=(64, 1, 384))), {})
+cnt: 12, ((T([6, 64, 512], f16, stride=(64, 1, 384)), T([6, 512, 512], f16)), {})
+cnt: 12, ((T([3072, 9, 64], f16, stride=(576, 1, 9)), T([3072, 64, 1], f16)), {})
+cnt: 12, ((T([3072, 64, 1], f16), T([3072, 1, 9], f16)), {})
+cnt: 12, ((T([1, 384, 512], f16), T([1, 512, 54], f16)), {})
+cnt: 12, ((T([1, 512, 54], f16), T([1, 54, 384], f16)), {})
+Operator: aten.cat.default
+cnt: 12, (([T([1, 512, 6, 64], f16), T([1, 512, 6, 64], f16)], 2), {})
+Operator: aten.clone.default
+cnt: 2, ((T([1, 512], i64),), {})
+Operator: aten.convolution.default
+cnt: 12, ((T([1, 768, 512], f16, stride=(393216, 1, 768)), T([768, 1, 9], f16), None, [1], [4], [1], False, [0], 768), {})
+cnt: 12, ((T([1, 768, 512], f16), T([384, 768, 1], f16), None, [1], [0], [1], False, [0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 12, ((T([1, 384, 512], f16, stride=(196608, 1, 384)), T([1, 768, 512], f16), T([384, 768, 1], f16), [0], [1], [0], [1], False, [0], 1, [True, True, False]), {})
+cnt: 12, ((T([1, 768, 512], f16), T([1, 768, 512], f16, stride=(393216, 1, 768)), T([768, 1, 9], f16), [0], [1], [4], [1], False, [0], 768, [True, True, False]), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([1, 512], i64), T([1, 512], i64)), {})
+cnt: 12, ((T([54, 384], f16), T([54, 384], f16, stride=(1, 54))), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([1, 6, 512, 512], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([1, 512], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([1, 512], i64)), {})
+cnt: 1, ((T([2, 768], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 2, -1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([1, 512, 3072], f16),), {})
+cnt: 1, ((T([1, 512, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 12, ((T([1, 512, 3072], f16), T([1, 512, 3072], f16)), {})
+Operator: aten.im2col.default
+cnt: 12, ((T([1, 384, 512, 1], f16), [9, 1], [1, 1], [4, 0], [1, 1]), {})
+Operator: aten.im2col_backward.default
+cnt: 12, ((T([1, 3456, 512], f16, stride=(1769472, 1, 3456)), [512, 1], [9, 1], [1, 1], [4, 0], [1, 1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([512, 30522], f16), T([30522, 768], f16)), {})
+cnt: 1, ((T([30522, 512], f16, stride=(1, 30522)), T([512, 768], f16)), {})
+cnt: 13, ((T([512, 768], f16), T([768, 768], f16)), {})
+cnt: 13, ((T([768, 512], f16, stride=(1, 768)), T([512, 768], f16)), {})
+cnt: 12, ((T([512, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 512], f16, stride=(1, 768)), T([512, 3072], f16)), {})
+cnt: 12, ((T([512, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 512], f16, stride=(1, 3072)), T([512, 768], f16)), {})
+cnt: 24, ((T([512, 384], f16, stride=(1, 512)), T([384, 768], f16)), {})
+cnt: 24, ((T([384, 512], f16), T([512, 768], f16)), {})
+cnt: 24, ((T([512, 384], f16), T([384, 768], f16)), {})
+cnt: 24, ((T([384, 512], f16, stride=(1, 384)), T([512, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([1, 1, 1, 512], f16), -65504.0), {})
+cnt: 12, ((T([1, 512, 384], f16, stride=(196608, 1, 512)), T([1, 512, 384], f16)), {})
+cnt: 12, ((T([1, 512, 384], f16), T([1, 512, 384], f16, stride=(196608, 1, 512))), {})
+cnt: 12, ((T([1, 512, 384], f16), T([1, 512, 384], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([1, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([1, 512, 768], f16), T([1, 512, 768], f16), [768], T([1, 512, 1], f32), T([1, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 12, ((T([54, 384], f16, stride=(1, 54)), [54, 384], [384, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([512, 30522], f16), T([512], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([512, 30522], f16), T([512], i64), None, 1, -100), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([1, 1, 1, 512], f16), 1.0), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([512, 30522], f16), [0], True), {})
+cnt: 25, ((T([512, 768], f16), [0], True), {})
+cnt: 12, ((T([512, 3072], f16), [0], True), {})
+cnt: 24, ((T([512, 384], f16, stride=(1, 512)), [0], True), {})
+cnt: 12, ((T([1, 512, 54], f16), [0, 1], True), {})
+cnt: 12, ((T([1, 384, 54], f16), [0], True), {})
+cnt: 12, ((T([1, 384, 512], f16, stride=(196608, 1, 384)), [0, 2], True), {})
+cnt: 24, ((T([512, 384], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/adv_inception_v3_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/adv_inception_v3_training.txt
new file mode 100644
index 000000000000..c11cd6890c76
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/adv_inception_v3_training.txt
@@ -0,0 +1,239 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16)), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16)), {})
+cnt: 3, ((T([128, 1280, 8, 8], f16), T([128, 1280, 8, 8], f16)), {})
+cnt: 14, ((T([128, 768, 17, 17], f16), T([128, 768, 17, 17], f16)), {})
+cnt: 5, ((T([128, 288, 35, 35], f16), T([128, 288, 35, 35], f16)), {})
+cnt: 3, ((T([128, 256, 35, 35], f16), T([128, 256, 35, 35], f16)), {})
+cnt: 3, ((T([128, 192, 35, 35], f16), T([128, 192, 35, 35], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 94, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 192, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), [3, 3], [1, 1], [1, 1]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([128, 1280, 8, 8], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), T([128, 768, 17, 17], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([128, 288, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), T([128, 256, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([128, 192, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([128, 96, 35, 35], f16), T([128, 32, 35, 35], f16)], 1), {})
+cnt: 2, (([T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([128, 96, 35, 35], f16), T([128, 64, 35, 35], f16)], 1), {})
+cnt: 1, (([T([128, 384, 17, 17], f16), T([128, 96, 17, 17], f16), T([128, 288, 17, 17], f16)], 1), {})
+cnt: 4, (([T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16)], 1), {})
+cnt: 1, (([T([128, 320, 8, 8], f16), T([128, 192, 8, 8], f16), T([128, 768, 8, 8], f16)], 1), {})
+cnt: 4, (([T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16)], 1), {})
+cnt: 2, (([T([128, 320, 8, 8], f16), T([128, 768, 8, 8], f16), T([128, 768, 8, 8], f16), T([128, 192, 8, 8], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 299, 299], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 299, 299], f16), T([32, 3, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([32, 32, 3, 3], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 73, 73], f16), T([80, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([192, 80, 3, 3], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 192, 35, 35], f16), T([64, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([48, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([64, 48, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([96, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([32, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 256, 35, 35], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), T([48, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 288, 35, 35], f16), T([64, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([48, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([384, 288, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([128, 768, 17, 17], f16), T([192, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 768, 17, 17], f16), T([128, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 17, 17], f16), T([192, 128, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 17, 17], f16), T([192, 128, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), T([160, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([160, 160, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 160, 17, 17], f16), T([192, 160, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([160, 160, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 160, 17, 17], f16), T([192, 160, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([192, 192, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([192, 192, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([320, 192, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([192, 192, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([320, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([384, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([384, 384, 1, 3], f16), None, [1, 1], [0, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([384, 384, 3, 1], f16), None, [1, 1], [1, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([448, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([384, 448, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([192, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([320, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([384, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([448, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([192, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 2048, 8, 8], f16), T([192, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384, 384, 3, 1], f16), [0], [1, 1], [1, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384, 384, 1, 3], f16), [0], [1, 1], [0, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 384, 8, 8], f16), T([128, 448, 8, 8], f16), T([384, 448, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 448, 8, 8], f16), T([128, 2048, 8, 8], f16), T([448, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 8, 8], f16), T([128, 2048, 8, 8], f16), T([384, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 2048, 8, 8], f16), T([320, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 1280, 8, 8], f16), T([192, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 448, 8, 8], f16), T([128, 1280, 8, 8], f16), T([448, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 8, 8], f16), T([128, 1280, 8, 8], f16), T([384, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 1280, 8, 8], f16), T([320, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 192, 17, 17], f16), T([192, 192, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192, 192, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192, 192, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 12, ((T([128, 192, 17, 17], f16), T([128, 768, 17, 17], f16), T([192, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 192, 17, 17], f16), T([320, 192, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 192, 17, 17], f16), T([128, 160, 17, 17], f16), T([192, 160, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160, 160, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160, 160, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 768, 17, 17], f16), T([160, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 192, 17, 17], f16), T([128, 160, 17, 17], f16), T([192, 160, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([128, 128, 17, 17], f16), T([192, 128, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128, 128, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128, 128, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 768, 17, 17], f16), T([128, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([128, 128, 17, 17], f16), T([192, 128, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 96, 35, 35], f16), T([128, 64, 35, 35], f16), T([96, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([128, 288, 35, 35], f16), T([64, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([128, 288, 35, 35], f16), T([384, 288, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 35, 35], f16), T([128, 48, 35, 35], f16), T([64, 48, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 288, 35, 35], f16), T([48, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 35, 35], f16), T([128, 256, 35, 35], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 256, 35, 35], f16), T([48, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([128, 192, 35, 35], f16), T([32, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 35, 35], f16), T([128, 192, 35, 35], f16), T([64, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 192, 35, 35], f16), T([48, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 80, 73, 73], f16), T([192, 80, 3, 3], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 64, 73, 73], f16), T([80, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 32, 147, 147], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 149, 149], f16), T([32, 32, 3, 3], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 3, 299, 299], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 299, 299], f16), T([128, 3, 299, 299], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 8, 8], f16, stride=(2048, 1, 0, 0)), 64), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 147, 147], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 768, 17, 17], f16), [3, 3], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 768, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 768, 17, 17], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 768, 8, 8], i64)), {})
+cnt: 1, ((T([128, 288, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 288, 35, 35], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 288, 17, 17], i64)), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([128, 192, 71, 71], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 192, 35, 35], i64)), {})
+cnt: 1, ((T([128, 64, 73, 73], f16), T([128, 64, 147, 147], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 64, 73, 73], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 2048, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 32, 149, 149], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 64, 35, 35], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 0.001), {})
+cnt: 7, ((T([128, 96, 35, 35], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 26, ((T([128, 192, 17, 17], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 320, 8, 8], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 192, 8, 8], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 384, 8, 8], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), True, 0.1, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([128, 192, 8, 8], f16), T([128, 192, 8, 8], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([128, 448, 8, 8], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 320, 8, 8], f16), T([128, 320, 8, 8], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 0.001, [True, True, True]), {})
+cnt: 26, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 0.001, [True, True, True]), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([128, 96, 17, 17], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+cnt: 7, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([128, 384, 17, 17], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([128, 48, 35, 35], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([128, 32, 35, 35], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 192, 71, 71], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 80, 73, 73], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 64, 147, 147], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 147, 147], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 32, 149, 149], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 149, 149], f16),), {})
+cnt: 1, ((T([128, 32, 147, 147], f16),), {})
+cnt: 1, ((T([128, 64, 147, 147], f16),), {})
+cnt: 1, ((T([128, 80, 73, 73], f16),), {})
+cnt: 1, ((T([128, 192, 71, 71], f16),), {})
+cnt: 12, ((T([128, 64, 35, 35], f16),), {})
+cnt: 3, ((T([128, 48, 35, 35], f16),), {})
+cnt: 7, ((T([128, 96, 35, 35], f16),), {})
+cnt: 1, ((T([128, 32, 35, 35], f16),), {})
+cnt: 1, ((T([128, 384, 17, 17], f16),), {})
+cnt: 1, ((T([128, 96, 17, 17], f16),), {})
+cnt: 26, ((T([128, 192, 17, 17], f16),), {})
+cnt: 6, ((T([128, 128, 17, 17], f16),), {})
+cnt: 12, ((T([128, 160, 17, 17], f16),), {})
+cnt: 3, ((T([128, 320, 8, 8], f16),), {})
+cnt: 3, ((T([128, 192, 8, 8], f16),), {})
+cnt: 12, ((T([128, 384, 8, 8], f16),), {})
+cnt: 2, ((T([128, 448, 8, 8], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([128, 192, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 192, 8, 8], f16), 0), {})
+cnt: 8, ((T([128, 384, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 384, 8, 8], f16), 0), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), 0), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([128, 448, 8, 8], f16), 0), {})
+cnt: 2, ((T([128, 320, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 320, 8, 8], f16), 0), {})
+cnt: 1, ((T([128, 192, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 192, 8, 8], f16), 0), {})
+cnt: 10, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), 0), {})
+cnt: 1, ((T([128, 320, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 320, 8, 8], f16), 0), {})
+cnt: 16, ((T([128, 192, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 192, 17, 17], f16), 0), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), 0), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), 0), {})
+cnt: 1, ((T([128, 96, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 96, 17, 17], f16), 0), {})
+cnt: 4, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), 0), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 384, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 384, 17, 17], f16), 0), {})
+cnt: 6, ((T([128, 64, 35, 35], f16, stride=(352800, 1225, 35, 1)), T([128, 64, 35, 35], f16), 0), {})
+cnt: 2, ((T([128, 96, 35, 35], f16, stride=(352800, 1225, 35, 1)), T([128, 96, 35, 35], f16), 0), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([128, 48, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 32, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 32, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 96, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 96, 35, 35], f16), 0), {})
+cnt: 2, ((T([128, 64, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 64, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 192, 71, 71], f16), 0), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 80, 73, 73], f16), 0), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 64, 147, 147], f16), 0), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 147, 147], f16), 0), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 32, 149, 149], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/beit_base_patch16_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/beit_base_patch16_224_training.txt
new file mode 100644
index 000000000000..c4df651ef103
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/beit_base_patch16_224_training.txt
@@ -0,0 +1,100 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([64, 12, 197, 197], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 12, 197, 197], f16), T([64, 12, 197, 197], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([64, 12, 197, 64], f16), [768, 197, 64]), {})
+cnt: 12, ((T([64, 12, 64, 197], f16), [768, 64, 197]), {})
+cnt: 12, ((T([768, 197, 197], f16), [64, 12, 197, 197]), {})
+cnt: 12, ((T([768, 197, 64], f16), [64, 12, 197, 64]), {})
+cnt: 12, ((T([64, 197, 12, 64], f16), [64, 197, 768]), {})
+cnt: 12, ((T([64, 197, 3, 12, 64], f16), [64, 197, 2304]), {})
+Operator: aten.add.Tensor
+cnt: 12, ((T([64, 12, 197, 197], f16), T([1, 12, 197, 197], f16)), {})
+cnt: 48, ((T([64, 197, 768], f16), T([64, 197, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([2304], f16), T([12608, 768], f16), T([768, 2304], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([12608, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([12608, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([12608, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([1000], f16), T([64, 768], f16), T([768, 1000], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([768, 197, 64], f16), T([768, 64, 197], f16)), {})
+cnt: 12, ((T([768, 197, 197], f16), T([768, 197, 64], f16)), {})
+cnt: 12, ((T([768, 197, 197], f16, stride=(38809, 1, 197)), T([768, 197, 64], f16)), {})
+cnt: 12, ((T([768, 197, 64], f16), T([768, 64, 197], f16, stride=(12608, 1, 64))), {})
+cnt: 12, ((T([768, 64, 197], f16, stride=(12608, 1, 64)), T([768, 197, 197], f16)), {})
+cnt: 12, ((T([768, 197, 197], f16), T([768, 197, 64], f16, stride=(12608, 1, 197))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 1, 768], f16, stride=(0, 768, 1)), T([64, 196, 768], f16, stride=(150528, 1, 196))], 1), {})
+cnt: 12, (([T([768], f16), T([768], f16), T([768], f16)],), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([768, 3, 16, 16], f16), T([768], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 768, 14, 14], f16, stride=(151296, 1, 10752, 768)), T([64, 3, 224, 224], f16), T([768, 3, 16, 16], f16), [768], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 196, 768], f16, stride=(768, 0, 1)), 196), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([64, 197, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([64, 197, 3072], f16), T([64, 197, 3072], f16)), {})
+Operator: aten.index.Tensor
+cnt: 12, ((T([732, 12], f16), [T([38809], i64)]), {})
+Operator: aten.index_put.default
+cnt: 12, ((T([732, 12], f16), [T([38809], i64)], T([38809, 12], f16, stride=(1, 38809)), True), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 196, 768], f16, stride=(151296, 768, 1)), [1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 768], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 768], f16)), {})
+cnt: 12, ((T([12608, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 12608], f16, stride=(1, 768)), T([12608, 3072], f16)), {})
+cnt: 12, ((T([12608, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 12608], f16, stride=(1, 3072)), T([12608, 768], f16)), {})
+cnt: 12, ((T([12608, 768], f16), T([768, 768], f16)), {})
+cnt: 12, ((T([768, 12608], f16, stride=(1, 768)), T([12608, 768], f16)), {})
+cnt: 12, ((T([12608, 2304], f16), T([2304, 768], f16)), {})
+cnt: 12, ((T([2304, 12608], f16, stride=(1, 2304)), T([12608, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 12, ((T([64, 12, 197, 64], f16, stride=(453888, 64, 2304, 1)), 0.125), {})
+cnt: 24, ((T([768], f16), T([64, 197, 768], f16)), {})
+cnt: 24, ((T([64, 197, 768], f16), T([768], f16)), {})
+cnt: 24, ((T([64, 197, 768], f16), T([64, 197, 768], f16)), {})
+cnt: 12, ((T([64, 12, 197, 64], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 24, ((T([64, 197, 768], f16), [768], T([768], f16), T([768], f16), 1e-06), {})
+cnt: 1, ((T([64, 768], f16), [768], T([768], f16), T([768], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 1, ((T([64, 768], f16), T([64, 768], f16), [768], T([64, 1], f32), T([64, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+cnt: 24, ((T([64, 197, 768], f16), T([64, 197, 768], f16), [768], T([64, 197, 1], f32), T([64, 197, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.new_zeros.default
+cnt: 12, ((T([38809, 12], f16, stride=(1, 38809)), [732, 12]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([64, 196, 768], f16), [64, 197, 768], 1, 1, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 197, 768], f16), [64, 197, 768], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.stack.default
+cnt: 12, (([T([64, 12, 197, 64], f16), T([64, 12, 197, 64], f16, stride=(151296, 12608, 1, 197)), T([64, 12, 197, 64], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 24, ((T([64, 197, 768], f16), [0, 1], True), {})
+cnt: 24, ((T([12608, 768], f16), [0], True), {})
+cnt: 12, ((T([12608, 3072], f16), [0], True), {})
+cnt: 12, ((T([64, 12, 197, 197], f16), [0], True), {})
+cnt: 12, ((T([12608, 2304], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 768], f16, stride=(151296, 768, 1)), [0], True), {})
+Operator: aten.unbind.int
+cnt: 12, ((T([3, 64, 12, 197, 64], f16, stride=(768, 453888, 64, 2304, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/botnet26t_256_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/botnet26t_256_training.txt
new file mode 100644
index 000000000000..4f2a25afb62e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/botnet26t_256_training.txt
@@ -0,0 +1,244 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 2, ((T([512, 256, 256], f16), -1, False), {})
+cnt: 1, ((T([512, 64, 64], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 1, ((T([512, 64, 64], f16), T([512, 64, 64], f16), -1, f16), {})
+cnt: 2, ((T([512, 256, 256], f16), T([512, 256, 256], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 3, ((T([128, 256, 16, 16], f16), [512, 64, 256]), {})
+cnt: 2, ((T([512, 256, 256], f16), [512, 256, 256]), {})
+cnt: 2, ((T([512, 16, 16, 64], f16), [131072, 64]), {})
+cnt: 4, ((T([131072, 31], f16), [512, 16, 16, 31]), {})
+cnt: 2, ((T([512, 16, 16, 16, 16], f16), [512, 256, 256]), {})
+cnt: 1, ((T([512, 256, 64], f16), [512, 256, 64]), {})
+cnt: 3, ((T([512, 64, 256], f16), [128, 256, 16, 16]), {})
+cnt: 3, ((T([128, 512, 16, 16], f16), [512, 128, 256]), {})
+cnt: 2, ((T([512, 16, 16, 128], f16), [131072, 128]), {})
+cnt: 1, ((T([512, 256, 128], f16), [512, 256, 128]), {})
+cnt: 3, ((T([512, 128, 256], f16), [128, 512, 16, 16]), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), [512, 128, 64]), {})
+cnt: 1, ((T([512, 64, 64], f16), [512, 64, 64]), {})
+cnt: 2, ((T([512, 8, 8, 128], f16), [32768, 128]), {})
+cnt: 2, ((T([32768, 15], f16), [512, 8, 8, 15]), {})
+cnt: 1, ((T([512, 8, 8, 8, 8], f16), [512, 64, 64]), {})
+cnt: 1, ((T([512, 64, 128], f16), [512, 64, 128]), {})
+cnt: 3, ((T([512, 128, 64], f16), [128, 512, 8, 8]), {})
+cnt: 1, ((T([512, 8, 8, 128], f16), [512, 64, 128]), {})
+cnt: 1, ((T([512, 16, 16, 128], f16), [512, 256, 128]), {})
+cnt: 1, ((T([512, 16, 16, 64], f16), [512, 256, 64]), {})
+Operator: aten.add.Tensor
+cnt: 31, ((T([], i64), 1), {})
+cnt: 4, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16)), {})
+cnt: 4, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16)), {})
+cnt: 4, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16)), {})
+cnt: 2, ((T([512, 16, 16, 16, 16], f16, stride=(8432, 31, 527, 1, 0)), T([512, 16, 16, 16, 16], f16, stride=(8432, 527, 31, 0, 1))), {})
+cnt: 2, ((T([512, 256, 256], f16), T([512, 256, 256], f16)), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16)), {})
+cnt: 1, ((T([512, 8, 8, 8, 8], f16, stride=(1080, 15, 135, 1, 0)), T([512, 8, 8, 8, 8], f16, stride=(1080, 135, 15, 0, 1))), {})
+cnt: 1, ((T([512, 64, 64], f16), T([512, 64, 64], f16)), {})
+cnt: 1, ((T([512, 8, 8, 128], f16, stride=(8192, 128, 1024, 1)), T([512, 8, 8, 128], f16)), {})
+cnt: 1, ((T([512, 64, 128], f16), T([512, 64, 128], f16)), {})
+cnt: 1, ((T([512, 16, 16, 128], f16, stride=(32768, 128, 2048, 1)), T([512, 16, 16, 128], f16)), {})
+cnt: 1, ((T([512, 256, 128], f16), T([512, 256, 128], f16)), {})
+cnt: 1, ((T([512, 16, 16, 64], f16, stride=(16384, 64, 1024, 1)), T([512, 16, 16, 64], f16)), {})
+cnt: 1, ((T([512, 256, 64], f16), T([512, 256, 64], f16)), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 512, 16, 16], f16), [2, 2], [2, 2]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 512, 8, 8], f16), T([128, 512, 16, 16], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+Operator: aten.bmm.default
+cnt: 2, ((T([512, 256, 64], f16, stride=(16384, 1, 256)), T([512, 64, 256], f16)), {})
+cnt: 2, ((T([512, 256, 256], f16), T([512, 256, 64], f16, stride=(16384, 1, 256))), {})
+cnt: 2, ((T([512, 256, 128], f16, stride=(32768, 1, 256)), T([512, 128, 256], f16)), {})
+cnt: 2, ((T([512, 256, 256], f16), T([512, 256, 128], f16, stride=(32768, 1, 256))), {})
+cnt: 2, ((T([512, 64, 128], f16, stride=(8192, 1, 64)), T([512, 128, 64], f16)), {})
+cnt: 2, ((T([512, 64, 64], f16), T([512, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 1, ((T([512, 64, 64], f16, stride=(4096, 1, 64)), T([512, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 1, ((T([512, 128, 64], f16), T([512, 64, 64], f16)), {})
+cnt: 1, ((T([512, 256, 256], f16, stride=(65536, 1, 256)), T([512, 256, 128], f16, stride=(32768, 1, 256))), {})
+cnt: 1, ((T([512, 128, 256], f16), T([512, 256, 256], f16)), {})
+cnt: 1, ((T([512, 256, 256], f16, stride=(65536, 1, 256)), T([512, 256, 64], f16, stride=(16384, 1, 256))), {})
+cnt: 1, ((T([512, 64, 256], f16), T([512, 256, 256], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16)], 1), {})
+cnt: 1, (([T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16)], 1), {})
+cnt: 1, (([T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 256, 256], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 4, ((T([8192, 16, 31], f16), [0, 1], 0.0), {})
+cnt: 4, ((T([8192, 512], f16), [0, 15], 0.0), {})
+cnt: 2, ((T([4096, 8, 15], f16), [0, 1], 0.0), {})
+cnt: 2, ((T([4096, 128], f16), [0, 7], 0.0), {})
+cnt: 2, ((T([4096, 135], f16), [0, -7]), {})
+cnt: 2, ((T([4096, 8, 16], f16), [0, -1]), {})
+cnt: 4, ((T([8192, 527], f16), [0, -15]), {})
+cnt: 4, ((T([8192, 16, 32], f16), [0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 256, 256], f16), T([24, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([32, 24, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 64, 64, 64], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([256, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([768, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([1536, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 512, 8, 8], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), T([1536, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 2, ((T([128, 2048, 8, 8], f16), T([128, 512, 8, 8], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1536, 8, 8], f16), T([128, 512, 8, 8], f16), T([1536, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), T([128, 2048, 8, 8], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([128, 1024, 16, 16], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1536, 16, 16], f16), T([128, 512, 16, 16], f16), T([1536, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16), T([128, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 768, 16, 16], f16), T([128, 256, 16, 16], f16), T([768, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([128, 512, 32, 32], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 256, 32, 32], f16), T([256, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 512, 32, 32], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 512, 32, 32], f16), T([128, 128, 32, 32], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 512, 32, 32], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([128, 256, 64, 64], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 128, 64, 64], f16), T([128, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 256, 64, 64], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([128, 64, 64, 64], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 256, 64, 64], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 32, 128, 128], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 24, 128, 128], f16), T([32, 24, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 3, 256, 256], f16), T([24, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 256, 256], f16), T([128, 3, 256, 256], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 8, 8], f16, stride=(2048, 1, 0, 0)), 64), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 128, 128], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 64, 128, 128], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([128, 64, 64, 64], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 2048, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 2, ((T([131072, 64], f16), T([64, 31], f16, stride=(1, 64))), {})
+cnt: 2, ((T([131072, 128], f16), T([128, 31], f16, stride=(1, 128))), {})
+cnt: 2, ((T([32768, 128], f16), T([128, 15], f16, stride=(1, 128))), {})
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+cnt: 2, ((T([15, 32768], f16, stride=(1, 15)), T([32768, 128], f16)), {})
+cnt: 2, ((T([32768, 15], f16), T([15, 128], f16)), {})
+cnt: 2, ((T([31, 131072], f16, stride=(1, 31)), T([131072, 128], f16)), {})
+cnt: 2, ((T([131072, 31], f16), T([31, 128], f16)), {})
+cnt: 2, ((T([31, 131072], f16, stride=(1, 31)), T([131072, 64], f16)), {})
+cnt: 2, ((T([131072, 31], f16), T([31, 64], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([512, 256, 256], f16), 0.125), {})
+cnt: 2, ((T([512, 256, 256], f16), 0.08838834764831845), {})
+cnt: 2, ((T([512, 64, 64], f16), 0.08838834764831845), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 24, 128, 128], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 24, 128, 128], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 24, 128, 128], f16),), {})
+cnt: 1, ((T([128, 32, 128, 128], f16),), {})
+cnt: 1, ((T([128, 64, 128, 128], f16),), {})
+cnt: 4, ((T([128, 64, 64, 64], f16),), {})
+cnt: 2, ((T([128, 256, 64, 64], f16),), {})
+cnt: 1, ((T([128, 128, 64, 64], f16),), {})
+cnt: 3, ((T([128, 128, 32, 32], f16),), {})
+cnt: 2, ((T([128, 512, 32, 32], f16),), {})
+cnt: 1, ((T([128, 256, 32, 32], f16),), {})
+cnt: 3, ((T([128, 256, 16, 16], f16),), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16),), {})
+cnt: 1, ((T([128, 512, 16, 16], f16),), {})
+cnt: 3, ((T([128, 512, 8, 8], f16),), {})
+cnt: 2, ((T([128, 2048, 8, 8], f16),), {})
+Operator: aten.slice_backward.default
+cnt: 2, ((T([4096, 8, 8], f16), [4096, 8, 15], 2, 7, 9223372036854775807, 1), {})
+cnt: 2, ((T([4096, 8, 15], f16), [4096, 9, 15], 1, 0, 8, 1), {})
+cnt: 2, ((T([4096, 9, 15], f16), [4096, 9, 15], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([8192, 16, 16], f16), [8192, 16, 31], 2, 15, 9223372036854775807, 1), {})
+cnt: 4, ((T([8192, 16, 31], f16), [8192, 17, 31], 1, 0, 16, 1), {})
+cnt: 4, ((T([8192, 17, 31], f16), [8192, 17, 31], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.split_with_sizes.default
+cnt: 1, ((T([128, 768, 16, 16], f16), [256, 256, 256], 1), {})
+cnt: 1, ((T([128, 1536, 16, 16], f16), [512, 512, 512], 1), {})
+cnt: 1, ((T([128, 1536, 8, 8], f16), [512, 512, 512], 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([512, 8, 8, 8, 8], f16, stride=(4096, 64, 1, 512, 8)), [2], True), {})
+cnt: 1, ((T([512, 8, 8, 8, 8], f16, stride=(4096, 512, 8, 64, 1)), [2], True), {})
+cnt: 2, ((T([512, 16, 16, 16, 16], f16, stride=(65536, 256, 1, 4096, 16)), [2], True), {})
+cnt: 2, ((T([512, 16, 16, 16, 16], f16, stride=(65536, 4096, 16, 256, 1)), [2], True), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16), 0), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16), 0), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16), 0), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16), 0), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16), 0), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 256, 32, 32], f16), 0), {})
+cnt: 2, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16), 0), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16), 0), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16), 0), {})
+cnt: 2, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16), 0), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), 0), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 64, 128, 128], f16), 0), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16), 0), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 24, 128, 128], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/cait_m36_384_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/cait_m36_384_training.txt
new file mode 100644
index 000000000000..b49e97575082
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/cait_m36_384_training.txt
@@ -0,0 +1,149 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([2, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([2, 1000], f16), T([2, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 36, ((T([2, 16, 576, 576], f16, stride=(5308416, 1, 9216, 16)), -1, False), {})
+cnt: 2, ((T([2, 16, 1, 577], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 2, ((T([2, 16, 1, 577], f16), T([2, 16, 1, 577], f16), -1, f16), {})
+cnt: 36, ((T([2, 16, 576, 576], f16, stride=(5308416, 1, 9216, 16)), T([2, 16, 576, 576], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 108, ((T([2, 16, 576, 48], f16), [32, 576, 48]), {})
+cnt: 36, ((T([2, 16, 48, 576], f16), [32, 48, 576]), {})
+cnt: 36, ((T([32, 576, 576], f16), [2, 16, 576, 576]), {})
+cnt: 144, ((T([2, 576, 576, 16], f16), [663552, 16]), {})
+cnt: 72, ((T([663552, 16], f16), [2, 576, 576, 16]), {})
+cnt: 72, ((T([2, 16, 576, 576], f16), [32, 576, 576]), {})
+cnt: 36, ((T([32, 576, 48], f16), [2, 16, 576, 48]), {})
+cnt: 36, ((T([2, 576, 16, 48], f16), [2, 576, 768]), {})
+cnt: 2, ((T([2, 16, 48, 577], f16), [32, 48, 577]), {})
+cnt: 2, ((T([32, 1, 577], f16), [2, 16, 1, 577]), {})
+cnt: 2, ((T([2, 16, 577, 48], f16), [32, 577, 48]), {})
+cnt: 2, ((T([32, 1, 48], f16), [2, 16, 1, 48]), {})
+cnt: 2, ((T([2, 577, 16, 48], f16), [2, 577, 768]), {})
+cnt: 2, ((T([2, 577, 768], f16), [1154, 768]), {})
+cnt: 36, ((T([2, 576, 3, 16, 48], f16), [2, 576, 2304]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([2, 576, 768], f16, stride=(442368, 1, 576)), T([1, 576, 768], f16)), {})
+cnt: 72, ((T([2, 576, 576, 16], f16), T([16], f16)), {})
+cnt: 72, ((T([2, 576, 768], f16, stride=(442368, 1, 576)), T([2, 576, 768], f16)), {})
+cnt: 1, ((T([2, 1, 768], f16, stride=(0, 768, 1)), T([2, 1, 768], f16)), {})
+cnt: 4, ((T([2, 1, 768], f16), T([2, 1, 768], f16)), {})
+cnt: 1, ((T([2, 1, 768], f16, stride=(443136, 768, 1)), T([2, 1, 768], f16)), {})
+cnt: 4, ((T([2, 577, 768], f16), T([2, 577, 768], f16)), {})
+cnt: 2, ((T([2, 1, 768], f16), T([2, 1, 768], f16, stride=(443136, 768, 1))), {})
+cnt: 1, ((T([2, 576, 768], f16, stride=(443136, 768, 1)), T([2, 576, 768], f16, stride=(443136, 768, 1))), {})
+cnt: 1, ((T([2, 576, 768], f16), T([2, 576, 768], f16, stride=(443136, 768, 1))), {})
+cnt: 72, ((T([2, 576, 768], f16), T([2, 576, 768], f16)), {})
+cnt: 72, ((T([3, 2, 16, 576, 48], f16), T([3, 2, 16, 576, 48], f16)), {})
+Operator: aten.addmm.default
+cnt: 36, ((T([2304], f16), T([1152, 768], f16), T([768, 2304], f16, stride=(1, 768))), {})
+cnt: 36, ((T([768], f16), T([1152, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 36, ((T([3072], f16), T([1152, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 36, ((T([768], f16), T([1152, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 2, ((T([768], f16), T([2, 768], f16, stride=(443136, 1)), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 4, ((T([768], f16), T([1154, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 2, ((T([768], f16), T([2, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 2, ((T([3072], f16), T([2, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 2, ((T([768], f16), T([2, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([1000], f16), T([2, 768], f16, stride=(443136, 1)), T([768, 1000], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 36, ((T([32, 576, 48], f16), T([32, 48, 576], f16)), {})
+cnt: 36, ((T([32, 576, 576], f16), T([32, 576, 48], f16)), {})
+cnt: 2, ((T([32, 1, 48], f16), T([32, 48, 577], f16)), {})
+cnt: 2, ((T([32, 1, 577], f16), T([32, 577, 48], f16)), {})
+cnt: 2, ((T([32, 577, 1], f16), T([32, 1, 48], f16)), {})
+cnt: 2, ((T([32, 1, 48], f16), T([32, 48, 577], f16, stride=(27696, 1, 48))), {})
+cnt: 2, ((T([32, 48, 1], f16), T([32, 1, 577], f16)), {})
+cnt: 2, ((T([32, 1, 577], f16), T([32, 577, 48], f16, stride=(27696, 1, 577))), {})
+cnt: 36, ((T([32, 576, 576], f16, stride=(331776, 1, 576)), T([32, 576, 48], f16)), {})
+cnt: 36, ((T([32, 576, 48], f16), T([32, 48, 576], f16, stride=(27648, 1, 48))), {})
+cnt: 36, ((T([32, 48, 576], f16, stride=(27648, 1, 48)), T([32, 576, 576], f16)), {})
+cnt: 36, ((T([32, 576, 576], f16), T([32, 576, 48], f16, stride=(27648, 1, 576))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([2, 1, 768], f16, stride=(0, 768, 1)), T([2, 576, 768], f16, stride=(442368, 1, 576))], 1), {})
+cnt: 2, (([T([2, 1, 768], f16), T([2, 576, 768], f16, stride=(442368, 1, 576))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([2, 3, 384, 384], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([2, 3, 384, 384], f16), T([768, 3, 16, 16], f16), T([768], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([2, 768, 24, 24], f16, stride=(442368, 1, 18432, 768)), T([2, 3, 384, 384], f16), T([768, 3, 16, 16], f16), [768], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([2, 3, 384, 384], f16), T([2, 3, 384, 384], f16)), {})
+Operator: aten.gelu.default
+cnt: 36, ((T([2, 576, 3072], f16),), {})
+cnt: 2, ((T([2, 1, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 2, ((T([2, 1, 3072], f16), T([2, 1, 3072], f16)), {})
+cnt: 36, ((T([2, 576, 3072], f16), T([2, 576, 3072], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([2], i64),), {})
+Operator: aten.mm.default
+cnt: 72, ((T([663552, 16], f16), T([16, 16], f16, stride=(1, 16))), {})
+cnt: 1, ((T([2, 1000], f16), T([1000, 768], f16)), {})
+cnt: 1, ((T([1000, 2], f16, stride=(1, 1000)), T([2, 768], f16, stride=(443136, 1))), {})
+cnt: 2, ((T([2, 768], f16), T([768, 3072], f16)), {})
+cnt: 2, ((T([768, 2], f16, stride=(1, 768)), T([2, 3072], f16)), {})
+cnt: 2, ((T([2, 3072], f16), T([3072, 768], f16)), {})
+cnt: 2, ((T([3072, 2], f16, stride=(1, 3072)), T([2, 768], f16)), {})
+cnt: 4, ((T([2, 768], f16), T([768, 768], f16)), {})
+cnt: 2, ((T([768, 2], f16, stride=(1, 768)), T([2, 768], f16)), {})
+cnt: 4, ((T([1154, 768], f16), T([768, 768], f16)), {})
+cnt: 4, ((T([768, 1154], f16, stride=(1, 768)), T([1154, 768], f16)), {})
+cnt: 2, ((T([768, 2], f16, stride=(1, 768)), T([2, 768], f16, stride=(443136, 1))), {})
+cnt: 36, ((T([1152, 768], f16), T([768, 3072], f16)), {})
+cnt: 36, ((T([768, 1152], f16, stride=(1, 768)), T([1152, 3072], f16)), {})
+cnt: 36, ((T([1152, 3072], f16), T([3072, 768], f16)), {})
+cnt: 36, ((T([3072, 1152], f16, stride=(1, 3072)), T([1152, 768], f16)), {})
+cnt: 36, ((T([1152, 768], f16), T([768, 768], f16)), {})
+cnt: 36, ((T([768, 1152], f16, stride=(1, 768)), T([1152, 768], f16)), {})
+cnt: 72, ((T([16, 663552], f16, stride=(1, 16)), T([663552, 16], f16)), {})
+cnt: 72, ((T([663552, 16], f16), T([16, 16], f16)), {})
+cnt: 36, ((T([1152, 2304], f16), T([2304, 768], f16)), {})
+cnt: 36, ((T([2304, 1152], f16, stride=(1, 2304)), T([1152, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 36, ((T([2, 16, 576, 48], f16, stride=(1327104, 48, 2304, 1)), 0.14433756729740643), {})
+cnt: 72, ((T([768], f16), T([2, 576, 768], f16)), {})
+cnt: 4, ((T([2, 16, 1, 48], f16), 0.14433756729740643), {})
+cnt: 4, ((T([768], f16), T([2, 1, 768], f16)), {})
+cnt: 1, ((T([2, 1, 768], f16, stride=(443136, 768, 1)), T([768], f16)), {})
+cnt: 1, ((T([2, 1, 768], f16, stride=(443136, 768, 1)), T([2, 1, 768], f16)), {})
+cnt: 3, ((T([2, 1, 768], f16), T([768], f16)), {})
+cnt: 3, ((T([2, 1, 768], f16), T([2, 1, 768], f16)), {})
+cnt: 72, ((T([2, 576, 768], f16), T([768], f16)), {})
+cnt: 72, ((T([2, 576, 768], f16), T([2, 576, 768], f16)), {})
+cnt: 36, ((T([2, 16, 576, 48], f16), 0.14433756729740643), {})
+Operator: aten.native_layer_norm.default
+cnt: 72, ((T([2, 576, 768], f16, stride=(442368, 1, 576)), [768], T([768], f16), T([768], f16), 1e-06), {})
+cnt: 3, ((T([2, 577, 768], f16), [768], T([768], f16), T([768], f16), 1e-06), {})
+cnt: 2, ((T([2, 1, 768], f16), [768], T([768], f16), T([768], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 3, ((T([2, 577, 768], f16), T([2, 577, 768], f16), [768], T([2, 577, 1], f32), T([2, 577, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+cnt: 2, ((T([2, 1, 768], f16), T([2, 1, 768], f16), [768], T([2, 1, 1], f32), T([2, 1, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+cnt: 72, ((T([2, 576, 768], f16), T([2, 576, 768], f16, stride=(442368, 1, 576)), [768], T([2, 576, 1], f32), T([2, 576, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([2, 1000], f16), T([2], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([2, 1000], f16), T([2], i64), None, 1, -100), {})
+Operator: aten.select_backward.default
+cnt: 3, ((T([2, 768], f16), [2, 577, 768], 1, 0), {})
+cnt: 36, ((T([2, 16, 576, 48], f16), [3, 2, 16, 576, 48], 0, 2), {})
+cnt: 36, ((T([2, 16, 576, 48], f16, stride=(442368, 27648, 1, 576)), [3, 2, 16, 576, 48], 0, 1), {})
+cnt: 36, ((T([2, 16, 576, 48], f16), [3, 2, 16, 576, 48], 0, 0), {})
+Operator: aten.slice_backward.default
+cnt: 3, ((T([2, 577, 768], f16), [2, 577, 768], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2, 1000], f16), [0], True), {})
+cnt: 4, ((T([2, 1, 768], f16), [0, 1], True), {})
+cnt: 6, ((T([2, 768], f16), [0], True), {})
+cnt: 2, ((T([2, 3072], f16), [0], True), {})
+cnt: 4, ((T([1154, 768], f16), [0], True), {})
+cnt: 1, ((T([2, 1, 768], f16), [0], True), {})
+cnt: 72, ((T([2, 576, 768], f16), [0, 1], True), {})
+cnt: 72, ((T([1152, 768], f16), [0], True), {})
+cnt: 36, ((T([1152, 3072], f16), [0], True), {})
+cnt: 72, ((T([2, 576, 576, 16], f16, stride=(5308416, 576, 1, 331776)), [0, 1, 2], True), {})
+cnt: 36, ((T([1152, 2304], f16), [0], True), {})
+cnt: 1, ((T([2, 576, 768], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/coat_lite_mini_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/coat_lite_mini_training.txt
new file mode 100644
index 000000000000..cba167ebdb84
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/coat_lite_mini_training.txt
@@ -0,0 +1,348 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 2, ((T([128, 8, 3137, 8], f16, stride=(602304, 8, 192, 1)), 2, False), {})
+cnt: 2, ((T([128, 8, 785, 16], f16, stride=(301440, 16, 384, 1)), 2, False), {})
+cnt: 2, ((T([128, 8, 197, 40], f16, stride=(189120, 40, 960, 1)), 2, False), {})
+cnt: 2, ((T([128, 8, 50, 64], f16, stride=(76800, 64, 1536, 1)), 2, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 2, ((T([128, 8, 50, 64], f16, stride=(25600, 3200, 1, 50)), T([128, 8, 50, 64], f16), 2, f16), {})
+cnt: 2, ((T([128, 8, 197, 40], f16, stride=(63040, 7880, 1, 197)), T([128, 8, 197, 40], f16), 2, f16), {})
+cnt: 2, ((T([128, 8, 785, 16], f16, stride=(100480, 12560, 1, 785)), T([128, 8, 785, 16], f16), 2, f16), {})
+cnt: 2, ((T([128, 8, 3137, 8], f16, stride=(200768, 25096, 1, 3137)), T([128, 8, 3137, 8], f16), 2, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 6, ((T([128, 8, 3137, 8], f16), [1024, 3137, 8]), {})
+cnt: 2, ((T([1024, 8, 8], f16), [128, 8, 8, 8]), {})
+cnt: 2, ((T([1024, 3137, 8], f16), [128, 8, 3137, 8]), {})
+cnt: 2, ((T([128, 3137, 8, 8], f16), [128, 3137, 64]), {})
+cnt: 6, ((T([128, 8, 785, 16], f16), [1024, 785, 16]), {})
+cnt: 2, ((T([1024, 16, 16], f16), [128, 8, 16, 16]), {})
+cnt: 2, ((T([1024, 785, 16], f16), [128, 8, 785, 16]), {})
+cnt: 2, ((T([128, 785, 8, 16], f16), [128, 785, 128]), {})
+cnt: 6, ((T([128, 8, 197, 40], f16), [1024, 197, 40]), {})
+cnt: 2, ((T([1024, 40, 40], f16), [128, 8, 40, 40]), {})
+cnt: 2, ((T([1024, 197, 40], f16), [128, 8, 197, 40]), {})
+cnt: 2, ((T([128, 197, 8, 40], f16), [128, 197, 320]), {})
+cnt: 6, ((T([128, 8, 50, 64], f16), [1024, 50, 64]), {})
+cnt: 2, ((T([1024, 64, 64], f16), [128, 8, 64, 64]), {})
+cnt: 2, ((T([1024, 50, 64], f16), [128, 8, 50, 64]), {})
+cnt: 2, ((T([128, 50, 8, 64], f16), [128, 50, 512]), {})
+cnt: 2, ((T([128, 50, 3, 8, 64], f16), [128, 50, 1536]), {})
+cnt: 2, ((T([128, 197, 3, 8, 40], f16), [128, 197, 960]), {})
+cnt: 2, ((T([128, 785, 3, 8, 16], f16), [128, 785, 384]), {})
+cnt: 2, ((T([128, 3137, 3, 8, 8], f16), [128, 3137, 192]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16, stride=(200768, 1, 3584, 64))), {})
+cnt: 6, ((T([128, 8, 3137, 8], f16), T([128, 8, 3137, 8], f16)), {})
+cnt: 10, ((T([128, 3137, 64], f16), T([128, 3137, 64], f16)), {})
+cnt: 2, ((T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16, stride=(100480, 1, 3584, 128))), {})
+cnt: 6, ((T([128, 8, 785, 16], f16), T([128, 8, 785, 16], f16)), {})
+cnt: 10, ((T([128, 785, 128], f16), T([128, 785, 128], f16)), {})
+cnt: 2, ((T([128, 320, 14, 14], f16), T([128, 320, 14, 14], f16, stride=(63040, 1, 4480, 320))), {})
+cnt: 6, ((T([128, 8, 197, 40], f16), T([128, 8, 197, 40], f16)), {})
+cnt: 10, ((T([128, 197, 320], f16), T([128, 197, 320], f16)), {})
+cnt: 2, ((T([128, 512, 7, 7], f16), T([128, 512, 7, 7], f16, stride=(25600, 1, 3584, 512))), {})
+cnt: 6, ((T([128, 8, 50, 64], f16), T([128, 8, 50, 64], f16)), {})
+cnt: 10, ((T([128, 50, 512], f16), T([128, 50, 512], f16)), {})
+cnt: 4, ((T([3, 128, 8, 50, 64], f16), T([3, 128, 8, 50, 64], f16)), {})
+cnt: 2, ((T([128, 512, 7, 7], f16, stride=(25600, 1, 3584, 512)), T([128, 512, 7, 7], f16, stride=(25088, 1, 3584, 512))), {})
+cnt: 1, ((T([192, 1, 7, 7], f16), T([192, 1, 7, 7], f16)), {})
+cnt: 2, ((T([192], f16), T([192], f16)), {})
+cnt: 1, ((T([192, 1, 5, 5], f16), T([192, 1, 5, 5], f16)), {})
+cnt: 2, ((T([128, 1, 3, 3], f16), T([128, 1, 3, 3], f16)), {})
+cnt: 2, ((T([128], f16), T([128], f16)), {})
+cnt: 1, ((T([512, 1, 3, 3], f16), T([512, 1, 3, 3], f16)), {})
+cnt: 1, ((T([512], f16), T([512], f16)), {})
+cnt: 4, ((T([3, 128, 8, 197, 40], f16), T([3, 128, 8, 197, 40], f16)), {})
+cnt: 2, ((T([128, 320, 14, 14], f16, stride=(63040, 1, 4480, 320)), T([128, 320, 14, 14], f16, stride=(62720, 1, 4480, 320))), {})
+cnt: 1, ((T([120, 1, 7, 7], f16), T([120, 1, 7, 7], f16)), {})
+cnt: 2, ((T([120], f16), T([120], f16)), {})
+cnt: 1, ((T([120, 1, 5, 5], f16), T([120, 1, 5, 5], f16)), {})
+cnt: 1, ((T([80, 1, 3, 3], f16), T([80, 1, 3, 3], f16)), {})
+cnt: 1, ((T([80], f16), T([80], f16)), {})
+cnt: 1, ((T([320, 1, 3, 3], f16), T([320, 1, 3, 3], f16)), {})
+cnt: 1, ((T([320], f16), T([320], f16)), {})
+cnt: 4, ((T([3, 128, 8, 785, 16], f16), T([3, 128, 8, 785, 16], f16)), {})
+cnt: 2, ((T([128, 128, 28, 28], f16, stride=(100480, 1, 3584, 128)), T([128, 128, 28, 28], f16, stride=(100352, 1, 3584, 128))), {})
+cnt: 1, ((T([48, 1, 7, 7], f16), T([48, 1, 7, 7], f16)), {})
+cnt: 2, ((T([48], f16), T([48], f16)), {})
+cnt: 1, ((T([48, 1, 5, 5], f16), T([48, 1, 5, 5], f16)), {})
+cnt: 1, ((T([32, 1, 3, 3], f16), T([32, 1, 3, 3], f16)), {})
+cnt: 1, ((T([32], f16), T([32], f16)), {})
+cnt: 4, ((T([3, 128, 8, 3137, 8], f16), T([3, 128, 8, 3137, 8], f16)), {})
+cnt: 2, ((T([128, 64, 56, 56], f16, stride=(200768, 1, 3584, 64)), T([128, 64, 56, 56], f16, stride=(200704, 1, 3584, 64))), {})
+cnt: 1, ((T([24, 1, 7, 7], f16), T([24, 1, 7, 7], f16)), {})
+cnt: 2, ((T([24], f16), T([24], f16)), {})
+cnt: 1, ((T([24, 1, 5, 5], f16), T([24, 1, 5, 5], f16)), {})
+cnt: 1, ((T([16, 1, 3, 3], f16), T([16, 1, 3, 3], f16)), {})
+cnt: 1, ((T([16], f16), T([16], f16)), {})
+cnt: 1, ((T([64, 1, 3, 3], f16), T([64, 1, 3, 3], f16)), {})
+cnt: 1, ((T([64], f16), T([64], f16)), {})
+Operator: aten.addmm.default
+cnt: 2, ((T([192], f16), T([401536, 64], f16), T([64, 192], f16, stride=(1, 64))), {})
+cnt: 2, ((T([64], f16), T([401536, 64], f16), T([64, 64], f16, stride=(1, 64))), {})
+cnt: 2, ((T([512], f16), T([401536, 64], f16), T([64, 512], f16, stride=(1, 64))), {})
+cnt: 2, ((T([64], f16), T([401536, 512], f16), T([512, 64], f16, stride=(1, 512))), {})
+cnt: 2, ((T([384], f16), T([100480, 128], f16), T([128, 384], f16, stride=(1, 128))), {})
+cnt: 2, ((T([128], f16), T([100480, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 2, ((T([1024], f16), T([100480, 128], f16), T([128, 1024], f16, stride=(1, 128))), {})
+cnt: 2, ((T([128], f16), T([100480, 1024], f16), T([1024, 128], f16, stride=(1, 1024))), {})
+cnt: 2, ((T([960], f16), T([25216, 320], f16), T([320, 960], f16, stride=(1, 320))), {})
+cnt: 2, ((T([320], f16), T([25216, 320], f16), T([320, 320], f16, stride=(1, 320))), {})
+cnt: 2, ((T([1280], f16), T([25216, 320], f16), T([320, 1280], f16, stride=(1, 320))), {})
+cnt: 2, ((T([320], f16), T([25216, 1280], f16), T([1280, 320], f16, stride=(1, 1280))), {})
+cnt: 2, ((T([1536], f16), T([6400, 512], f16), T([512, 1536], f16, stride=(1, 512))), {})
+cnt: 2, ((T([512], f16), T([6400, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 2, ((T([2048], f16), T([6400, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 2, ((T([512], f16), T([6400, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+cnt: 1, ((T([1000], f16), T([128, 512], f16, stride=(25600, 1)), T([512, 1000], f16, stride=(1, 512))), {})
+Operator: aten.bmm.default
+cnt: 4, ((T([1024, 8, 3137], f16, stride=(25096, 1, 8)), T([1024, 3137, 8], f16)), {})
+cnt: 4, ((T([1024, 3137, 8], f16), T([1024, 8, 8], f16)), {})
+cnt: 4, ((T([1024, 16, 785], f16, stride=(12560, 1, 16)), T([1024, 785, 16], f16)), {})
+cnt: 4, ((T([1024, 785, 16], f16), T([1024, 16, 16], f16)), {})
+cnt: 4, ((T([1024, 40, 197], f16, stride=(7880, 1, 40)), T([1024, 197, 40], f16)), {})
+cnt: 4, ((T([1024, 197, 40], f16), T([1024, 40, 40], f16)), {})
+cnt: 4, ((T([1024, 64, 50], f16, stride=(3200, 1, 64)), T([1024, 50, 64], f16)), {})
+cnt: 4, ((T([1024, 50, 64], f16), T([1024, 64, 64], f16)), {})
+cnt: 2, ((T([1024, 50, 64], f16), T([1024, 64, 64], f16, stride=(4096, 1, 64))), {})
+cnt: 2, ((T([1024, 64, 64], f16), T([1024, 64, 50], f16, stride=(3200, 1, 64))), {})
+cnt: 2, ((T([1024, 197, 40], f16), T([1024, 40, 40], f16, stride=(1600, 1, 40))), {})
+cnt: 2, ((T([1024, 40, 40], f16), T([1024, 40, 197], f16, stride=(7880, 1, 40))), {})
+cnt: 2, ((T([1024, 785, 16], f16), T([1024, 16, 16], f16, stride=(256, 1, 16))), {})
+cnt: 2, ((T([1024, 16, 16], f16), T([1024, 16, 785], f16, stride=(12560, 1, 16))), {})
+cnt: 2, ((T([1024, 3137, 8], f16), T([1024, 8, 8], f16, stride=(64, 1, 8))), {})
+cnt: 2, ((T([1024, 8, 8], f16), T([1024, 8, 3137], f16, stride=(25096, 1, 8))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 1, 64], f16, stride=(0, 64, 1)), T([128, 3136, 64], f16)], 1), {})
+cnt: 2, (([T([128, 1, 64], f16, stride=(200768, 64, 1)), T([128, 3136, 64], f16, stride=(200704, 1, 3136))], 1), {})
+cnt: 2, (([T([128, 16, 56, 56], f16), T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)], 1), {})
+cnt: 1, (([T([128, 1, 128], f16, stride=(0, 128, 1)), T([128, 784, 128], f16)], 1), {})
+cnt: 2, (([T([128, 1, 128], f16, stride=(100480, 128, 1)), T([128, 784, 128], f16, stride=(100352, 1, 784))], 1), {})
+cnt: 2, (([T([128, 32, 28, 28], f16), T([128, 48, 28, 28], f16), T([128, 48, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 1, 320], f16, stride=(0, 320, 1)), T([128, 196, 320], f16)], 1), {})
+cnt: 2, (([T([128, 1, 320], f16, stride=(63040, 320, 1)), T([128, 196, 320], f16, stride=(62720, 1, 196))], 1), {})
+cnt: 2, (([T([128, 80, 14, 14], f16), T([128, 120, 14, 14], f16), T([128, 120, 14, 14], f16)], 1), {})
+cnt: 1, (([T([128, 1, 512], f16, stride=(0, 512, 1)), T([128, 49, 512], f16)], 1), {})
+cnt: 2, (([T([128, 1, 512], f16, stride=(25600, 512, 1)), T([128, 49, 512], f16, stride=(25088, 1, 49))], 1), {})
+cnt: 2, (([T([128, 128, 7, 7], f16), T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16)], 1), {})
+cnt: 2, (([T([128, 128, 7, 7], f16, stride=(6272, 1, 896, 128)), T([128, 192, 7, 7], f16, stride=(9408, 1, 1344, 192)), T([128, 192, 7, 7], f16, stride=(9408, 1, 1344, 192))], 1), {})
+cnt: 2, (([T([128, 80, 14, 14], f16, stride=(15680, 1, 1120, 80)), T([128, 120, 14, 14], f16, stride=(23520, 1, 1680, 120)), T([128, 120, 14, 14], f16, stride=(23520, 1, 1680, 120))], 1), {})
+cnt: 2, (([T([128, 32, 28, 28], f16, stride=(25088, 1, 896, 32)), T([128, 48, 28, 28], f16, stride=(37632, 1, 1344, 48)), T([128, 48, 28, 28], f16, stride=(37632, 1, 1344, 48))], 1), {})
+cnt: 2, (([T([128, 16, 56, 56], f16, stride=(50176, 1, 896, 16)), T([128, 24, 56, 56], f16, stride=(75264, 1, 1344, 24)), T([128, 24, 56, 56], f16, stride=(75264, 1, 1344, 24))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 2, ((T([128, 8, 3136, 8], f16, stride=(200704, 8, 64, 1)), [0, 0, 1, 0, 0, 0], 0.0), {})
+cnt: 2, ((T([128, 8, 784, 16], f16, stride=(100352, 16, 128, 1)), [0, 0, 1, 0, 0, 0], 0.0), {})
+cnt: 2, ((T([128, 8, 196, 40], f16, stride=(62720, 40, 320, 1)), [0, 0, 1, 0, 0, 0], 0.0), {})
+cnt: 2, ((T([128, 8, 49, 64], f16, stride=(25088, 64, 512, 1)), [0, 0, 1, 0, 0, 0], 0.0), {})
+cnt: 2, ((T([128, 8, 50, 64], f16, stride=(25600, 64, 512, 1)), [0, 0, -1, 0, 0, 0]), {})
+cnt: 2, ((T([128, 8, 197, 40], f16, stride=(63040, 40, 320, 1)), [0, 0, -1, 0, 0, 0]), {})
+cnt: 2, ((T([128, 8, 785, 16], f16, stride=(100480, 16, 128, 1)), [0, 0, -1, 0, 0, 0]), {})
+cnt: 2, ((T([128, 8, 3137, 8], f16, stride=(200768, 8, 64, 1)), [0, 0, -1, 0, 0, 0]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 4, 4], f16), T([64], f16), [4, 4], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 64, 56, 56], f16, stride=(200768, 1, 3584, 64)), T([64, 1, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 2, ((T([128, 16, 56, 56], f16, stride=(602304, 1, 10752, 192)), T([16, 1, 3, 3], f16), T([16], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 2, ((T([128, 24, 56, 56], f16, stride=(602304, 1, 10752, 192)), T([24, 1, 5, 5], f16), T([24], f16), [1, 1], [2, 2], [1, 1], False, [0, 0], 24), {})
+cnt: 2, ((T([128, 24, 56, 56], f16, stride=(602304, 1, 10752, 192)), T([24, 1, 7, 7], f16), T([24], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 24), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 2, 2], f16), T([128], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 28, 28], f16, stride=(100480, 1, 3584, 128)), T([128, 1, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 128), {})
+cnt: 2, ((T([128, 32, 28, 28], f16, stride=(301440, 1, 10752, 384)), T([32, 1, 3, 3], f16), T([32], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 2, ((T([128, 48, 28, 28], f16, stride=(301440, 1, 10752, 384)), T([48, 1, 5, 5], f16), T([48], f16), [1, 1], [2, 2], [1, 1], False, [0, 0], 48), {})
+cnt: 2, ((T([128, 48, 28, 28], f16, stride=(301440, 1, 10752, 384)), T([48, 1, 7, 7], f16), T([48], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 48), {})
+cnt: 1, ((T([128, 128, 28, 28], f16), T([320, 128, 2, 2], f16), T([320], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 320, 14, 14], f16, stride=(63040, 1, 4480, 320)), T([320, 1, 3, 3], f16), T([320], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 320), {})
+cnt: 2, ((T([128, 80, 14, 14], f16, stride=(189120, 1, 13440, 960)), T([80, 1, 3, 3], f16), T([80], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 80), {})
+cnt: 2, ((T([128, 120, 14, 14], f16, stride=(189120, 1, 13440, 960)), T([120, 1, 5, 5], f16), T([120], f16), [1, 1], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 2, ((T([128, 120, 14, 14], f16, stride=(189120, 1, 13440, 960)), T([120, 1, 7, 7], f16), T([120], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 120), {})
+cnt: 1, ((T([128, 320, 14, 14], f16), T([512, 320, 2, 2], f16), T([512], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 512, 7, 7], f16, stride=(25600, 1, 3584, 512)), T([512, 1, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 512), {})
+cnt: 2, ((T([128, 128, 7, 7], f16, stride=(76800, 1, 10752, 1536)), T([128, 1, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 128), {})
+cnt: 2, ((T([128, 192, 7, 7], f16, stride=(76800, 1, 10752, 1536)), T([192, 1, 5, 5], f16), T([192], f16), [1, 1], [2, 2], [1, 1], False, [0, 0], 192), {})
+cnt: 2, ((T([128, 192, 7, 7], f16, stride=(76800, 1, 10752, 1536)), T([192, 1, 7, 7], f16), T([192], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 192), {})
+Operator: aten.convolution_backward.default
+cnt: 2, ((T([128, 192, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([128, 192, 7, 7], f16, stride=(76800, 1, 10752, 1536)), T([192, 1, 7, 7], f16), [192], [1, 1], [3, 3], [1, 1], False, [0, 0], 192, [True, True, True]), {})
+cnt: 2, ((T([128, 192, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([128, 192, 7, 7], f16, stride=(76800, 1, 10752, 1536)), T([192, 1, 5, 5], f16), [192], [1, 1], [2, 2], [1, 1], False, [0, 0], 192, [True, True, True]), {})
+cnt: 2, ((T([128, 128, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([128, 128, 7, 7], f16, stride=(76800, 1, 10752, 1536)), T([128, 1, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 128, [True, True, True]), {})
+cnt: 2, ((T([128, 512, 7, 7], f16, stride=(25600, 1, 3584, 512)), T([128, 512, 7, 7], f16, stride=(25600, 1, 3584, 512)), T([512, 1, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 512, [True, True, True]), {})
+cnt: 1, ((T([128, 512, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([128, 320, 14, 14], f16), T([512, 320, 2, 2], f16), [512], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 120, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([128, 120, 14, 14], f16, stride=(189120, 1, 13440, 960)), T([120, 1, 7, 7], f16), [120], [1, 1], [3, 3], [1, 1], False, [0, 0], 120, [True, True, True]), {})
+cnt: 2, ((T([128, 120, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([128, 120, 14, 14], f16, stride=(189120, 1, 13440, 960)), T([120, 1, 5, 5], f16), [120], [1, 1], [2, 2], [1, 1], False, [0, 0], 120, [True, True, True]), {})
+cnt: 2, ((T([128, 80, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([128, 80, 14, 14], f16, stride=(189120, 1, 13440, 960)), T([80, 1, 3, 3], f16), [80], [1, 1], [1, 1], [1, 1], False, [0, 0], 80, [True, True, True]), {})
+cnt: 2, ((T([128, 320, 14, 14], f16, stride=(63040, 1, 4480, 320)), T([128, 320, 14, 14], f16, stride=(63040, 1, 4480, 320)), T([320, 1, 3, 3], f16), [320], [1, 1], [1, 1], [1, 1], False, [0, 0], 320, [True, True, True]), {})
+cnt: 1, ((T([128, 320, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([128, 128, 28, 28], f16), T([320, 128, 2, 2], f16), [320], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 48, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([128, 48, 28, 28], f16, stride=(301440, 1, 10752, 384)), T([48, 1, 7, 7], f16), [48], [1, 1], [3, 3], [1, 1], False, [0, 0], 48, [True, True, True]), {})
+cnt: 2, ((T([128, 48, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([128, 48, 28, 28], f16, stride=(301440, 1, 10752, 384)), T([48, 1, 5, 5], f16), [48], [1, 1], [2, 2], [1, 1], False, [0, 0], 48, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([128, 32, 28, 28], f16, stride=(301440, 1, 10752, 384)), T([32, 1, 3, 3], f16), [32], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, True]), {})
+cnt: 2, ((T([128, 128, 28, 28], f16, stride=(100480, 1, 3584, 128)), T([128, 128, 28, 28], f16, stride=(100480, 1, 3584, 128)), T([128, 1, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 128, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([128, 64, 56, 56], f16), T([128, 64, 2, 2], f16), [128], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([128, 24, 56, 56], f16, stride=(602304, 1, 10752, 192)), T([24, 1, 7, 7], f16), [24], [1, 1], [3, 3], [1, 1], False, [0, 0], 24, [True, True, True]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([128, 24, 56, 56], f16, stride=(602304, 1, 10752, 192)), T([24, 1, 5, 5], f16), [24], [1, 1], [2, 2], [1, 1], False, [0, 0], 24, [True, True, True]), {})
+cnt: 2, ((T([128, 16, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([128, 16, 56, 56], f16, stride=(602304, 1, 10752, 192)), T([16, 1, 3, 3], f16), [16], [1, 1], [1, 1], [1, 1], False, [0, 0], 16, [True, True, True]), {})
+cnt: 2, ((T([128, 64, 56, 56], f16, stride=(200768, 1, 3584, 64)), T([128, 64, 56, 56], f16, stride=(200768, 1, 3584, 64)), T([64, 1, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([128, 3, 224, 224], f16), T([64, 3, 4, 4], f16), [64], [4, 4], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.gelu.default
+cnt: 2, ((T([128, 3137, 512], f16),), {})
+cnt: 2, ((T([128, 785, 1024], f16),), {})
+cnt: 2, ((T([128, 197, 1280], f16),), {})
+cnt: 2, ((T([128, 50, 2048], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 2, ((T([128, 50, 2048], f16), T([128, 50, 2048], f16)), {})
+cnt: 2, ((T([128, 197, 1280], f16), T([128, 197, 1280], f16)), {})
+cnt: 2, ((T([128, 785, 1024], f16), T([128, 785, 1024], f16)), {})
+cnt: 2, ((T([128, 3137, 512], f16), T([128, 3137, 512], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 512], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 512], f16, stride=(25600, 1))), {})
+cnt: 2, ((T([6400, 512], f16), T([512, 2048], f16)), {})
+cnt: 2, ((T([512, 6400], f16, stride=(1, 512)), T([6400, 2048], f16)), {})
+cnt: 2, ((T([6400, 2048], f16), T([2048, 512], f16)), {})
+cnt: 2, ((T([2048, 6400], f16, stride=(1, 2048)), T([6400, 512], f16)), {})
+cnt: 2, ((T([6400, 512], f16), T([512, 512], f16)), {})
+cnt: 2, ((T([512, 6400], f16, stride=(1, 512)), T([6400, 512], f16)), {})
+cnt: 2, ((T([6400, 1536], f16), T([1536, 512], f16)), {})
+cnt: 2, ((T([1536, 6400], f16, stride=(1, 1536)), T([6400, 512], f16)), {})
+cnt: 2, ((T([25216, 320], f16), T([320, 1280], f16)), {})
+cnt: 2, ((T([320, 25216], f16, stride=(1, 320)), T([25216, 1280], f16)), {})
+cnt: 2, ((T([25216, 1280], f16), T([1280, 320], f16)), {})
+cnt: 2, ((T([1280, 25216], f16, stride=(1, 1280)), T([25216, 320], f16)), {})
+cnt: 2, ((T([25216, 320], f16), T([320, 320], f16)), {})
+cnt: 2, ((T([320, 25216], f16, stride=(1, 320)), T([25216, 320], f16)), {})
+cnt: 2, ((T([25216, 960], f16), T([960, 320], f16)), {})
+cnt: 2, ((T([960, 25216], f16, stride=(1, 960)), T([25216, 320], f16)), {})
+cnt: 2, ((T([100480, 128], f16), T([128, 1024], f16)), {})
+cnt: 2, ((T([128, 100480], f16, stride=(1, 128)), T([100480, 1024], f16)), {})
+cnt: 2, ((T([100480, 1024], f16), T([1024, 128], f16)), {})
+cnt: 2, ((T([1024, 100480], f16, stride=(1, 1024)), T([100480, 128], f16)), {})
+cnt: 2, ((T([100480, 128], f16), T([128, 128], f16)), {})
+cnt: 2, ((T([128, 100480], f16, stride=(1, 128)), T([100480, 128], f16)), {})
+cnt: 2, ((T([100480, 384], f16), T([384, 128], f16)), {})
+cnt: 2, ((T([384, 100480], f16, stride=(1, 384)), T([100480, 128], f16)), {})
+cnt: 2, ((T([401536, 64], f16), T([64, 512], f16)), {})
+cnt: 2, ((T([64, 401536], f16, stride=(1, 64)), T([401536, 512], f16)), {})
+cnt: 2, ((T([401536, 512], f16), T([512, 64], f16)), {})
+cnt: 2, ((T([512, 401536], f16, stride=(1, 512)), T([401536, 64], f16)), {})
+cnt: 2, ((T([401536, 64], f16), T([64, 64], f16)), {})
+cnt: 2, ((T([64, 401536], f16, stride=(1, 64)), T([401536, 64], f16)), {})
+cnt: 2, ((T([401536, 192], f16), T([192, 64], f16)), {})
+cnt: 2, ((T([192, 401536], f16, stride=(1, 192)), T([401536, 64], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 8, 3136, 8], f16, stride=(602304, 8, 192, 1)), T([128, 8, 3136, 8], f16, stride=(200704, 25088, 1, 3136))), {})
+cnt: 2, ((T([128, 8, 3137, 8], f16), 0.3535533905932738), {})
+cnt: 2, ((T([128, 8, 784, 16], f16, stride=(301440, 16, 384, 1)), T([128, 8, 784, 16], f16, stride=(100352, 12544, 1, 784))), {})
+cnt: 2, ((T([128, 8, 785, 16], f16), 0.25), {})
+cnt: 2, ((T([128, 8, 196, 40], f16, stride=(189120, 40, 960, 1)), T([128, 8, 196, 40], f16, stride=(62720, 7840, 1, 196))), {})
+cnt: 2, ((T([128, 8, 197, 40], f16), 0.15811388300841897), {})
+cnt: 2, ((T([128, 8, 49, 64], f16, stride=(76800, 64, 1536, 1)), T([128, 8, 49, 64], f16, stride=(25088, 3136, 1, 49))), {})
+cnt: 2, ((T([128, 8, 50, 64], f16), 0.125), {})
+cnt: 2, ((T([128, 8, 50, 64], f16, stride=(25600, 64, 512, 1)), 0.125), {})
+cnt: 2, ((T([128, 8, 49, 64], f16, stride=(25088, 64, 512, 1)), T([128, 8, 49, 64], f16, stride=(76800, 64, 1536, 1))), {})
+cnt: 2, ((T([128, 8, 49, 64], f16, stride=(25088, 64, 512, 1)), T([128, 8, 49, 64], f16, stride=(25088, 3136, 1, 49))), {})
+cnt: 2, ((T([128, 8, 197, 40], f16, stride=(63040, 40, 320, 1)), 0.15811388300841897), {})
+cnt: 2, ((T([128, 8, 196, 40], f16, stride=(62720, 40, 320, 1)), T([128, 8, 196, 40], f16, stride=(189120, 40, 960, 1))), {})
+cnt: 2, ((T([128, 8, 196, 40], f16, stride=(62720, 40, 320, 1)), T([128, 8, 196, 40], f16, stride=(62720, 7840, 1, 196))), {})
+cnt: 2, ((T([128, 8, 785, 16], f16, stride=(100480, 16, 128, 1)), 0.25), {})
+cnt: 2, ((T([128, 8, 784, 16], f16, stride=(100352, 16, 128, 1)), T([128, 8, 784, 16], f16, stride=(301440, 16, 384, 1))), {})
+cnt: 2, ((T([128, 8, 784, 16], f16, stride=(100352, 16, 128, 1)), T([128, 8, 784, 16], f16, stride=(100352, 12544, 1, 784))), {})
+cnt: 2, ((T([128, 8, 3137, 8], f16, stride=(200768, 8, 64, 1)), 0.3535533905932738), {})
+cnt: 2, ((T([128, 8, 3136, 8], f16, stride=(200704, 8, 64, 1)), T([128, 8, 3136, 8], f16, stride=(602304, 8, 192, 1))), {})
+cnt: 2, ((T([128, 8, 3136, 8], f16, stride=(200704, 8, 64, 1)), T([128, 8, 3136, 8], f16, stride=(200704, 25088, 1, 3136))), {})
+Operator: aten.native_layer_norm.default
+cnt: 1, ((T([128, 3136, 64], f16, stride=(200704, 1, 3136)), [64], T([64], f16), T([64], f16), 1e-05), {})
+cnt: 4, ((T([128, 3137, 64], f16), [64], T([64], f16), T([64], f16), 1e-06), {})
+cnt: 1, ((T([128, 784, 128], f16, stride=(100352, 1, 784)), [128], T([128], f16), T([128], f16), 1e-05), {})
+cnt: 4, ((T([128, 785, 128], f16), [128], T([128], f16), T([128], f16), 1e-06), {})
+cnt: 1, ((T([128, 196, 320], f16, stride=(62720, 1, 196)), [320], T([320], f16), T([320], f16), 1e-05), {})
+cnt: 4, ((T([128, 197, 320], f16), [320], T([320], f16), T([320], f16), 1e-06), {})
+cnt: 1, ((T([128, 49, 512], f16, stride=(25088, 1, 49)), [512], T([512], f16), T([512], f16), 1e-05), {})
+cnt: 5, ((T([128, 50, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 5, ((T([128, 50, 512], f16), T([128, 50, 512], f16), [512], T([128, 50, 1], f32), T([128, 50, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 1, ((T([128, 49, 512], f16, stride=(25600, 512, 1)), T([128, 49, 512], f16, stride=(25088, 1, 49)), [512], T([128, 49, 1], f32), T([128, 49, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 4, ((T([128, 197, 320], f16), T([128, 197, 320], f16), [320], T([128, 197, 1], f32), T([128, 197, 1], f32), T([320], f16), T([320], f16), [True, True, True]), {})
+cnt: 1, ((T([128, 196, 320], f16, stride=(63040, 320, 1)), T([128, 196, 320], f16, stride=(62720, 1, 196)), [320], T([128, 196, 1], f32), T([128, 196, 1], f32), T([320], f16), T([320], f16), [True, True, True]), {})
+cnt: 4, ((T([128, 785, 128], f16), T([128, 785, 128], f16), [128], T([128, 785, 1], f32), T([128, 785, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 1, ((T([128, 784, 128], f16, stride=(100480, 128, 1)), T([128, 784, 128], f16, stride=(100352, 1, 784)), [128], T([128, 784, 1], f32), T([128, 784, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 4, ((T([128, 3137, 64], f16), T([128, 3137, 64], f16), [64], T([128, 3137, 1], f32), T([128, 3137, 1], f32), T([64], f16), T([64], f16), [True, True, True]), {})
+cnt: 1, ((T([128, 3136, 64], f16, stride=(200768, 64, 1)), T([128, 3136, 64], f16, stride=(200704, 1, 3136)), [64], T([128, 3136, 1], f32), T([128, 3136, 1], f32), T([64], f16), T([64], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([128, 512], f16), [128, 50, 512], 1, 0), {})
+cnt: 2, ((T([128, 8, 50, 64], f16), [3, 128, 8, 50, 64], 0, 2), {})
+cnt: 2, ((T([128, 8, 50, 64], f16), [3, 128, 8, 50, 64], 0, 1), {})
+cnt: 2, ((T([128, 8, 50, 64], f16), [3, 128, 8, 50, 64], 0, 0), {})
+cnt: 2, ((T([128, 8, 197, 40], f16), [3, 128, 8, 197, 40], 0, 2), {})
+cnt: 2, ((T([128, 8, 197, 40], f16), [3, 128, 8, 197, 40], 0, 1), {})
+cnt: 2, ((T([128, 8, 197, 40], f16), [3, 128, 8, 197, 40], 0, 0), {})
+cnt: 2, ((T([128, 8, 785, 16], f16), [3, 128, 8, 785, 16], 0, 2), {})
+cnt: 2, ((T([128, 8, 785, 16], f16), [3, 128, 8, 785, 16], 0, 1), {})
+cnt: 2, ((T([128, 8, 785, 16], f16), [3, 128, 8, 785, 16], 0, 0), {})
+cnt: 2, ((T([128, 8, 3137, 8], f16), [3, 128, 8, 3137, 8], 0, 2), {})
+cnt: 2, ((T([128, 8, 3137, 8], f16), [3, 128, 8, 3137, 8], 0, 1), {})
+cnt: 2, ((T([128, 8, 3137, 8], f16), [3, 128, 8, 3137, 8], 0, 0), {})
+Operator: aten.slice_backward.default
+cnt: 5, ((T([128, 50, 512], f16), [128, 50, 512], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 49, 64], f16, stride=(25088, 64, 512, 1)), [128, 8, 49, 64], 3, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 49, 64], f16), [128, 8, 50, 64], 2, 1, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 50, 64], f16), [128, 8, 50, 64], 1, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 50, 64], f16), [128, 8, 50, 64], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 49, 512], f16), [128, 50, 512], 1, 1, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 1, 512], f16, stride=(25600, 512, 1)), [128, 50, 512], 1, 0, 1, 1), {})
+cnt: 1, ((T([128, 196, 320], f16, stride=(62720, 1, 196)), [128, 196, 320], 2, 0, 9223372036854775807, 1), {})
+cnt: 3, ((T([128, 196, 320], f16), [128, 197, 320], 1, 1, 9223372036854775807, 1), {})
+cnt: 5, ((T([128, 197, 320], f16), [128, 197, 320], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 196, 40], f16, stride=(62720, 40, 320, 1)), [128, 8, 196, 40], 3, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 196, 40], f16), [128, 8, 197, 40], 2, 1, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 197, 40], f16), [128, 8, 197, 40], 1, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 197, 40], f16), [128, 8, 197, 40], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 1, 320], f16, stride=(63040, 320, 1)), [128, 197, 320], 1, 0, 1, 1), {})
+cnt: 1, ((T([128, 784, 128], f16, stride=(100352, 1, 784)), [128, 784, 128], 2, 0, 9223372036854775807, 1), {})
+cnt: 3, ((T([128, 784, 128], f16), [128, 785, 128], 1, 1, 9223372036854775807, 1), {})
+cnt: 5, ((T([128, 785, 128], f16), [128, 785, 128], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 784, 16], f16, stride=(100352, 16, 128, 1)), [128, 8, 784, 16], 3, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 784, 16], f16), [128, 8, 785, 16], 2, 1, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 785, 16], f16), [128, 8, 785, 16], 1, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 785, 16], f16), [128, 8, 785, 16], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 1, 128], f16, stride=(100480, 128, 1)), [128, 785, 128], 1, 0, 1, 1), {})
+cnt: 1, ((T([128, 3136, 64], f16, stride=(200704, 1, 3136)), [128, 3136, 64], 2, 0, 9223372036854775807, 1), {})
+cnt: 3, ((T([128, 3136, 64], f16), [128, 3137, 64], 1, 1, 9223372036854775807, 1), {})
+cnt: 5, ((T([128, 3137, 64], f16), [128, 3137, 64], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 3136, 8], f16, stride=(200704, 8, 64, 1)), [128, 8, 3136, 8], 3, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 3136, 8], f16), [128, 8, 3137, 8], 2, 1, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 3137, 8], f16), [128, 8, 3137, 8], 1, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 8, 3137, 8], f16), [128, 8, 3137, 8], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 1, 64], f16, stride=(200768, 64, 1)), [128, 3137, 64], 1, 0, 1, 1), {})
+Operator: aten.split_with_sizes.default
+cnt: 2, ((T([128, 64, 56, 56], f16, stride=(602304, 1, 10752, 192)), [16, 24, 24], 1), {})
+cnt: 2, ((T([128, 128, 28, 28], f16, stride=(301440, 1, 10752, 384)), [32, 48, 48], 1), {})
+cnt: 2, ((T([128, 320, 14, 14], f16, stride=(189120, 1, 13440, 960)), [80, 120, 120], 1), {})
+cnt: 2, ((T([128, 512, 7, 7], f16, stride=(76800, 1, 10752, 1536)), [128, 192, 192], 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 4, ((T([6400, 512], f16), [0], True), {})
+cnt: 2, ((T([6400, 2048], f16), [0], True), {})
+cnt: 2, ((T([6400, 1536], f16), [0], True), {})
+cnt: 1, ((T([128, 1, 512], f16, stride=(25600, 512, 1)), [0], True), {})
+cnt: 4, ((T([25216, 320], f16), [0], True), {})
+cnt: 2, ((T([25216, 1280], f16), [0], True), {})
+cnt: 2, ((T([25216, 960], f16), [0], True), {})
+cnt: 1, ((T([128, 1, 320], f16, stride=(63040, 320, 1)), [0], True), {})
+cnt: 4, ((T([100480, 128], f16), [0], True), {})
+cnt: 2, ((T([100480, 1024], f16), [0], True), {})
+cnt: 2, ((T([100480, 384], f16), [0], True), {})
+cnt: 1, ((T([128, 1, 128], f16, stride=(100480, 128, 1)), [0], True), {})
+cnt: 4, ((T([401536, 64], f16), [0], True), {})
+cnt: 2, ((T([401536, 512], f16), [0], True), {})
+cnt: 2, ((T([401536, 192], f16), [0], True), {})
+cnt: 1, ((T([128, 1, 64], f16, stride=(200768, 64, 1)), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/convmixer_768_32_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/convmixer_768_32_training.txt
new file mode 100644
index 000000000000..a41c3378022c
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/convmixer_768_32_training.txt
@@ -0,0 +1,45 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 64, ((T([32, 768, 32, 32], f16), T([32, 768, 32, 32], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 65, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 768], f16), T([768, 1000], f16, stride=(1, 768))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([768, 3, 7, 7], f16), T([768], f16), [7, 7], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 32, ((T([32, 768, 32, 32], f16), T([768, 1, 7, 7], f16), T([768], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 768), {})
+cnt: 32, ((T([32, 768, 32, 32], f16), T([768, 768, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 32, ((T([32, 768, 32, 32], f16), T([32, 768, 32, 32], f16), T([768, 768, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 32, ((T([32, 768, 32, 32], f16), T([32, 768, 32, 32], f16), T([768, 1, 7, 7], f16), [768], [1, 1], [3, 3], [1, 1], False, [0, 0], 768, [True, True, True]), {})
+cnt: 1, ((T([32, 768, 32, 32], f16), T([32, 3, 224, 224], f16), T([768, 3, 7, 7], f16), [768], [7, 7], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 768, 32, 32], f16, stride=(768, 1, 0, 0)), 1024), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 768, 32, 32], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16), T([1000, 768], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(1, 1000)), T([32, 768], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 65, ((T([32, 768, 32, 32], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 65, ((T([32, 768, 32, 32], f16), T([32, 768, 32, 32], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 65, ((T([32, 768, 32, 32], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 65, ((T([32, 768, 32, 32], f16), T([32, 768, 32, 32], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/convnext_base_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/convnext_base_training.txt
new file mode 100644
index 000000000000..8e67418f598f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/convnext_base_training.txt
@@ -0,0 +1,210 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 3, ((T([100352, 512], f16), [32, 56, 56, 512]), {})
+cnt: 3, ((T([100352, 128], f16), [32, 56, 56, 128]), {})
+cnt: 3, ((T([25088, 1024], f16), [32, 28, 28, 1024]), {})
+cnt: 3, ((T([25088, 256], f16), [32, 28, 28, 256]), {})
+cnt: 27, ((T([6272, 2048], f16), [32, 14, 14, 2048]), {})
+cnt: 27, ((T([6272, 512], f16), [32, 14, 14, 512]), {})
+cnt: 3, ((T([1568, 4096], f16), [32, 7, 7, 4096]), {})
+cnt: 3, ((T([1568, 1024], f16), [32, 7, 7, 1024]), {})
+cnt: 3, ((T([32, 7, 7, 1024], f16), [1568, 1024]), {})
+Operator: aten.add.Tensor
+cnt: 3, ((T([32, 56, 56, 512], f16), T([512], f16)), {})
+cnt: 3, ((T([32, 56, 56, 128], f16), T([128], f16)), {})
+cnt: 7, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128))), {})
+cnt: 1, ((T([32, 1, 56, 56], f16), 1e-06), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([128, 1, 1], f16)), {})
+cnt: 3, ((T([32, 28, 28, 1024], f16), T([1024], f16)), {})
+cnt: 3, ((T([32, 28, 28, 256], f16), T([256], f16)), {})
+cnt: 7, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256))), {})
+cnt: 1, ((T([32, 1, 28, 28], f16), 1e-06), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([256, 1, 1], f16)), {})
+cnt: 27, ((T([32, 14, 14, 2048], f16), T([2048], f16)), {})
+cnt: 27, ((T([32, 14, 14, 512], f16), T([512], f16)), {})
+cnt: 55, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512))), {})
+cnt: 1, ((T([32, 1, 14, 14], f16), 1e-06), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([512, 1, 1], f16)), {})
+cnt: 3, ((T([32, 7, 7, 4096], f16), T([4096], f16)), {})
+cnt: 3, ((T([32, 7, 7, 1024], f16), T([1024], f16)), {})
+cnt: 3, ((T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024)), T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024))), {})
+cnt: 3, ((T([32, 1024, 7, 7], f16), T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024))), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([32, 512, 14, 14], f16)), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([32, 256, 28, 28], f16)), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([32, 128, 56, 56], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.as_strided_.default
+cnt: 1, ((T([32, 1024, 1, 1], f16), [32, 1024, 1, 1], [1024, 1, 1024, 1024]), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([128, 3, 4, 4], f16), T([128], f16), [4, 4], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([128, 1, 7, 7], f16), T([128], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 128), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([256, 128, 2, 2], f16), T([256], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([256, 1, 7, 7], f16), T([256], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 256), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([512, 256, 2, 2], f16), T([512], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 27, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([512, 1, 7, 7], f16), T([512], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 512), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([1024, 512, 2, 2], f16), T([1024], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024)), T([1024, 1, 7, 7], f16), T([1024], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 1024), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024)), T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024)), T([1024, 1, 7, 7], f16), [1024], [1, 1], [3, 3], [1, 1], False, [0, 0], 1024, [True, True, True]), {})
+cnt: 1, ((T([32, 1024, 7, 7], f16), T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([1024, 512, 2, 2], f16), [1024], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 27, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([512, 1, 7, 7], f16), [512], [1, 1], [3, 3], [1, 1], False, [0, 0], 512, [True, True, True]), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([512, 256, 2, 2], f16), [512], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([256, 1, 7, 7], f16), [256], [1, 1], [3, 3], [1, 1], False, [0, 0], 256, [True, True, True]), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([256, 128, 2, 2], f16), [256], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([128, 1, 7, 7], f16), [128], [1, 1], [3, 3], [1, 1], False, [0, 0], 128, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([32, 3, 224, 224], f16), T([128, 3, 4, 4], f16), [128], [4, 4], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+cnt: 1, ((T([32, 1024], f16), T([32, 1024], f16)), {})
+cnt: 1, ((T([1024, 512, 2, 2], f16), T([1024, 512, 2, 2], f16, stride=(2048, 1, 1024, 512))), {})
+cnt: 1, ((T([512, 256, 2, 2], f16), T([512, 256, 2, 2], f16, stride=(1024, 1, 512, 256))), {})
+cnt: 1, ((T([256, 128, 2, 2], f16), T([256, 128, 2, 2], f16, stride=(512, 1, 256, 128))), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 1024, 7, 7], f16, stride=(1024, 1, 0, 0)), 49), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(196, 0, 14, 1)), 512), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(784, 0, 28, 1)), 256), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(3136, 0, 56, 1)), 128), {})
+Operator: aten.gelu.default
+cnt: 3, ((T([32, 56, 56, 512], f16),), {})
+cnt: 3, ((T([32, 28, 28, 1024], f16),), {})
+cnt: 27, ((T([32, 14, 14, 2048], f16),), {})
+cnt: 3, ((T([32, 7, 7, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 3, ((T([32, 7, 7, 4096], f16), T([32, 7, 7, 4096], f16)), {})
+cnt: 27, ((T([32, 14, 14, 2048], f16), T([32, 14, 14, 2048], f16)), {})
+cnt: 3, ((T([32, 28, 28, 1024], f16), T([32, 28, 28, 1024], f16)), {})
+cnt: 3, ((T([32, 56, 56, 512], f16), T([32, 56, 56, 512], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024)), [-1, -2], True), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), [1], True), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), [1], True), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), [1], True), {})
+Operator: aten.mm.default
+cnt: 3, ((T([100352, 128], f16), T([128, 512], f16, stride=(1, 128))), {})
+cnt: 3, ((T([100352, 512], f16), T([512, 128], f16, stride=(1, 512))), {})
+cnt: 3, ((T([25088, 256], f16), T([256, 1024], f16, stride=(1, 256))), {})
+cnt: 3, ((T([25088, 1024], f16), T([1024, 256], f16, stride=(1, 1024))), {})
+cnt: 27, ((T([6272, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 27, ((T([6272, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+cnt: 3, ((T([1568, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 3, ((T([1568, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([32, 1000], f16), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(1, 1000)), T([32, 1024], f16)), {})
+cnt: 3, ((T([1024, 1568], f16, stride=(1, 1024)), T([1568, 4096], f16)), {})
+cnt: 3, ((T([1568, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 3, ((T([4096, 1568], f16, stride=(1, 4096)), T([1568, 1024], f16)), {})
+cnt: 3, ((T([1568, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 27, ((T([512, 6272], f16, stride=(1, 512)), T([6272, 2048], f16)), {})
+cnt: 27, ((T([6272, 512], f16), T([512, 2048], f16)), {})
+cnt: 27, ((T([2048, 6272], f16, stride=(1, 2048)), T([6272, 512], f16)), {})
+cnt: 27, ((T([6272, 2048], f16), T([2048, 512], f16)), {})
+cnt: 3, ((T([256, 25088], f16, stride=(1, 256)), T([25088, 1024], f16)), {})
+cnt: 3, ((T([25088, 256], f16), T([256, 1024], f16)), {})
+cnt: 3, ((T([1024, 25088], f16, stride=(1, 1024)), T([25088, 256], f16)), {})
+cnt: 3, ((T([25088, 1024], f16), T([1024, 256], f16)), {})
+cnt: 3, ((T([128, 100352], f16, stride=(1, 128)), T([100352, 512], f16)), {})
+cnt: 3, ((T([100352, 128], f16), T([128, 512], f16)), {})
+cnt: 3, ((T([512, 100352], f16, stride=(1, 512)), T([100352, 128], f16)), {})
+cnt: 3, ((T([100352, 512], f16), T([512, 128], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 1, ((T([32, 1, 14, 14], f16), -0.5), {})
+cnt: 1, ((T([32, 1, 14, 14], f16), 0.00390625), {})
+cnt: 1, ((T([32, 1, 28, 28], f16), -0.5), {})
+cnt: 1, ((T([32, 1, 28, 28], f16), 0.0078125), {})
+cnt: 1, ((T([32, 1, 56, 56], f16), -0.5), {})
+cnt: 1, ((T([32, 1, 56, 56], f16), 0.015625), {})
+Operator: aten.mul.Tensor
+cnt: 6, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([1, 128, 1, 1], f16)), {})
+cnt: 2, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([32, 1, 56, 56], f16)), {})
+cnt: 2, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([128, 1, 1], f16)), {})
+cnt: 6, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([1, 256, 1, 1], f16)), {})
+cnt: 2, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([32, 1, 28, 28], f16)), {})
+cnt: 2, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([256, 1, 1], f16)), {})
+cnt: 54, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([1, 512, 1, 1], f16)), {})
+cnt: 2, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([32, 1, 14, 14], f16)), {})
+cnt: 2, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([512, 1, 1], f16)), {})
+cnt: 3, ((T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024)), T([1, 1024, 1, 1], f16)), {})
+cnt: 3, ((T([32, 1024, 7, 7], f16), T([32, 1024, 7, 7], f16, stride=(50176, 1, 7168, 1024))), {})
+cnt: 3, ((T([32, 1024, 7, 7], f16), T([1, 1024, 1, 1], f16)), {})
+cnt: 29, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512))), {})
+cnt: 1, ((T([32, 1, 14, 14], f16), T([32, 1, 14, 14], f16)), {})
+cnt: 1, ((T([32, 1, 14, 14], f16), T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512))), {})
+cnt: 5, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256))), {})
+cnt: 1, ((T([32, 1, 28, 28], f16), T([32, 1, 28, 28], f16)), {})
+cnt: 1, ((T([32, 1, 28, 28], f16), T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256))), {})
+cnt: 5, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128))), {})
+cnt: 1, ((T([32, 1, 56, 56], f16), T([32, 1, 56, 56], f16)), {})
+cnt: 1, ((T([32, 1, 56, 56], f16), T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128))), {})
+Operator: aten.native_layer_norm.default
+cnt: 1, ((T([32, 56, 56, 128], f16, stride=(401408, 56, 1, 3136)), [128], T([128], f16), T([128], f16), 1e-06), {})
+cnt: 3, ((T([32, 56, 56, 128], f16), [128], T([128], f16), T([128], f16), 1e-06), {})
+cnt: 3, ((T([32, 28, 28, 256], f16), [256], T([256], f16), T([256], f16), 1e-06), {})
+cnt: 27, ((T([32, 14, 14, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+cnt: 3, ((T([32, 7, 7, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-06), {})
+cnt: 1, ((T([32, 1, 1, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 1, ((T([32, 1, 1, 1024], f16), T([32, 1, 1, 1024], f16), [1024], T([32, 1, 1, 1], f32), T([32, 1, 1, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+cnt: 3, ((T([32, 7, 7, 1024], f16), T([32, 7, 7, 1024], f16), [1024], T([32, 7, 7, 1], f32), T([32, 7, 7, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+cnt: 27, ((T([32, 14, 14, 512], f16), T([32, 14, 14, 512], f16), [512], T([32, 14, 14, 1], f32), T([32, 14, 14, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 3, ((T([32, 28, 28, 256], f16), T([32, 28, 28, 256], f16), [256], T([32, 28, 28, 1], f32), T([32, 28, 28, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 3, ((T([32, 56, 56, 128], f16), T([32, 56, 56, 128], f16), [128], T([32, 56, 56, 1], f32), T([32, 56, 56, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 1, ((T([32, 56, 56, 128], f16), T([32, 56, 56, 128], f16, stride=(401408, 56, 1, 3136)), [128], T([32, 56, 56, 1], f32), T([32, 56, 56, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+Operator: aten.neg.default
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)),), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)),), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)),), {})
+Operator: aten.new_empty_strided.default
+cnt: 1, ((T([1024, 512, 2, 2], f16, stride=(2048, 1, 1024, 512)), [1024, 512, 2, 2], [2048, 4, 2, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([512, 256, 2, 2], f16, stride=(1024, 1, 512, 256)), [512, 256, 2, 2], [1024, 4, 2, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([256, 128, 2, 2], f16, stride=(512, 1, 256, 128)), [256, 128, 2, 2], [512, 4, 2, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([32, 1024], f16), [32768]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 1, ((T([32, 1, 14, 14], f16), 3), {})
+cnt: 1, ((T([32, 1, 28, 28], f16), 3), {})
+cnt: 1, ((T([32, 1, 56, 56], f16), 3), {})
+Operator: aten.rsqrt.default
+cnt: 1, ((T([32, 1, 56, 56], f16),), {})
+cnt: 1, ((T([32, 1, 28, 28], f16),), {})
+cnt: 1, ((T([32, 1, 14, 14], f16),), {})
+Operator: aten.slice_backward.default
+cnt: 2, ((T([512], f16), [512], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([256], f16), [256], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128], f16), [128], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sub.Tensor
+cnt: 2, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([32, 1, 56, 56], f16)), {})
+cnt: 2, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([32, 1, 28, 28], f16)), {})
+cnt: 2, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), T([32, 1, 14, 14], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16), [0], True), {})
+cnt: 3, ((T([32, 1024, 7, 7], f16), [0, 2, 3], True), {})
+cnt: 3, ((T([32, 7, 7, 1024], f16, stride=(50176, 7, 1, 49)), [0, 1, 2], True), {})
+cnt: 3, ((T([32, 7, 7, 4096], f16), [0, 1, 2], True), {})
+cnt: 29, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), [0, 2, 3], True), {})
+cnt: 2, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), [1], True), {})
+cnt: 27, ((T([32, 14, 14, 512], f16), [0, 1, 2], True), {})
+cnt: 27, ((T([32, 14, 14, 2048], f16), [0, 1, 2], True), {})
+cnt: 5, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), [0, 2, 3], True), {})
+cnt: 2, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), [1], True), {})
+cnt: 3, ((T([32, 28, 28, 256], f16), [0, 1, 2], True), {})
+cnt: 3, ((T([32, 28, 28, 1024], f16), [0, 1, 2], True), {})
+cnt: 5, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), [0, 2, 3], True), {})
+cnt: 2, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), [1], True), {})
+cnt: 3, ((T([32, 56, 56, 128], f16), [0, 1, 2], True), {})
+cnt: 3, ((T([32, 56, 56, 512], f16), [0, 1, 2], True), {})
+Operator: aten.var_mean.correction
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), [1]), {'correction': 0, 'keepdim': True})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), [1]), {'correction': 0, 'keepdim': True})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), [1]), {'correction': 0, 'keepdim': True})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/crossvit_9_240_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/crossvit_9_240_training.txt
new file mode 100644
index 000000000000..eea124ed321f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/crossvit_9_240_training.txt
@@ -0,0 +1,203 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 3, ((T([64, 4, 401, 401], f16), -1, False), {})
+cnt: 9, ((T([64, 4, 197, 197], f16), -1, False), {})
+cnt: 3, ((T([64, 4, 1, 197], f16), -1, False), {})
+cnt: 3, ((T([64, 4, 1, 401], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 3, ((T([64, 4, 1, 401], f16), T([64, 4, 1, 401], f16), -1, f16), {})
+cnt: 3, ((T([64, 4, 1, 197], f16), T([64, 4, 1, 197], f16), -1, f16), {})
+cnt: 9, ((T([64, 4, 197, 197], f16), T([64, 4, 197, 197], f16), -1, f16), {})
+cnt: 3, ((T([64, 4, 401, 401], f16), T([64, 4, 401, 401], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([64, 4, 401, 32], f16), [256, 401, 32]), {})
+cnt: 6, ((T([64, 4, 32, 401], f16), [256, 32, 401]), {})
+cnt: 3, ((T([256, 401, 401], f16), [64, 4, 401, 401]), {})
+cnt: 3, ((T([256, 401, 32], f16), [64, 4, 401, 32]), {})
+cnt: 6, ((T([64, 401, 4, 32], f16), [64, 401, 128]), {})
+cnt: 30, ((T([64, 4, 197, 64], f16), [256, 197, 64]), {})
+cnt: 12, ((T([64, 4, 64, 197], f16), [256, 64, 197]), {})
+cnt: 9, ((T([256, 197, 197], f16), [64, 4, 197, 197]), {})
+cnt: 9, ((T([256, 197, 64], f16), [64, 4, 197, 64]), {})
+cnt: 12, ((T([64, 197, 4, 64], f16), [64, 197, 256]), {})
+cnt: 3, ((T([64, 256], f16), [64, 1, 256]), {})
+cnt: 3, ((T([256, 1, 197], f16), [64, 4, 1, 197]), {})
+cnt: 3, ((T([256, 1, 64], f16), [64, 4, 1, 64]), {})
+cnt: 3, ((T([64, 128], f16), [64, 1, 128]), {})
+cnt: 3, ((T([256, 1, 401], f16), [64, 4, 1, 401]), {})
+cnt: 3, ((T([256, 1, 32], f16), [64, 4, 1, 32]), {})
+cnt: 3, ((T([64, 401, 128], f16), [25664, 128]), {})
+cnt: 3, ((T([64, 197, 256], f16), [12608, 256]), {})
+cnt: 9, ((T([64, 197, 3, 4, 64], f16), [64, 197, 768]), {})
+cnt: 3, ((T([64, 401, 3, 4, 32], f16), [64, 401, 384]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 401, 128], f16), T([1, 401, 128], f16)), {})
+cnt: 1, ((T([64, 197, 256], f16), T([1, 197, 256], f16)), {})
+cnt: 27, ((T([64, 401, 128], f16), T([64, 401, 128], f16)), {})
+cnt: 51, ((T([64, 197, 256], f16), T([64, 197, 256], f16)), {})
+cnt: 3, ((T([64, 1, 256], f16), T([256], f16)), {})
+cnt: 3, ((T([64, 1, 256], f16, stride=(50432, 256, 1)), T([64, 1, 256], f16)), {})
+cnt: 3, ((T([64, 1, 128], f16), T([128], f16)), {})
+cnt: 3, ((T([64, 1, 128], f16, stride=(51328, 128, 1)), T([64, 1, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 6, ((T([384], f16), T([25664, 128], f16), T([128, 384], f16, stride=(1, 128))), {})
+cnt: 9, ((T([128], f16), T([25664, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 3, ((T([128], f16), T([25664, 384], f16), T([384, 128], f16, stride=(1, 384))), {})
+cnt: 18, ((T([768], f16), T([12608, 256], f16), T([256, 768], f16, stride=(1, 256))), {})
+cnt: 15, ((T([256], f16), T([12608, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 9, ((T([256], f16), T([12608, 768], f16), T([768, 256], f16, stride=(1, 768))), {})
+cnt: 6, ((T([256], f16), T([64, 128], f16), T([128, 256], f16, stride=(1, 128))), {})
+cnt: 6, ((T([128], f16), T([64, 256], f16), T([256, 128], f16, stride=(1, 256))), {})
+cnt: 3, ((T([256], f16), T([64, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 3, ((T([128], f16), T([64, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 1, ((T([1000], f16), T([64, 128], f16, stride=(51328, 1)), T([128, 1000], f16, stride=(1, 128))), {})
+cnt: 1, ((T([1000], f16), T([64, 256], f16, stride=(50432, 1)), T([256, 1000], f16, stride=(1, 256))), {})
+Operator: aten.bmm.default
+cnt: 3, ((T([256, 401, 32], f16), T([256, 32, 401], f16)), {})
+cnt: 3, ((T([256, 401, 401], f16), T([256, 401, 32], f16)), {})
+cnt: 9, ((T([256, 197, 64], f16), T([256, 64, 197], f16)), {})
+cnt: 9, ((T([256, 197, 197], f16), T([256, 197, 64], f16)), {})
+cnt: 3, ((T([256, 1, 64], f16), T([256, 64, 197], f16)), {})
+cnt: 3, ((T([256, 1, 197], f16), T([256, 197, 64], f16)), {})
+cnt: 3, ((T([256, 1, 32], f16), T([256, 32, 401], f16)), {})
+cnt: 3, ((T([256, 1, 401], f16), T([256, 401, 32], f16)), {})
+cnt: 3, ((T([256, 401, 1], f16), T([256, 1, 32], f16)), {})
+cnt: 3, ((T([256, 1, 32], f16), T([256, 32, 401], f16, stride=(12832, 1, 32))), {})
+cnt: 3, ((T([256, 32, 1], f16), T([256, 1, 401], f16)), {})
+cnt: 3, ((T([256, 1, 401], f16), T([256, 401, 32], f16, stride=(12832, 1, 401))), {})
+cnt: 3, ((T([256, 197, 1], f16), T([256, 1, 64], f16)), {})
+cnt: 3, ((T([256, 1, 64], f16), T([256, 64, 197], f16, stride=(12608, 1, 64))), {})
+cnt: 3, ((T([256, 64, 1], f16), T([256, 1, 197], f16)), {})
+cnt: 3, ((T([256, 1, 197], f16), T([256, 197, 64], f16, stride=(12608, 1, 197))), {})
+cnt: 9, ((T([256, 197, 197], f16, stride=(38809, 1, 197)), T([256, 197, 64], f16)), {})
+cnt: 9, ((T([256, 197, 64], f16), T([256, 64, 197], f16, stride=(12608, 1, 64))), {})
+cnt: 9, ((T([256, 64, 197], f16, stride=(12608, 1, 64)), T([256, 197, 197], f16)), {})
+cnt: 9, ((T([256, 197, 197], f16), T([256, 197, 64], f16, stride=(12608, 1, 197))), {})
+cnt: 3, ((T([256, 401, 401], f16, stride=(160801, 1, 401)), T([256, 401, 32], f16)), {})
+cnt: 3, ((T([256, 401, 32], f16), T([256, 32, 401], f16, stride=(12832, 1, 32))), {})
+cnt: 3, ((T([256, 32, 401], f16, stride=(12832, 1, 32)), T([256, 401, 401], f16)), {})
+cnt: 3, ((T([256, 401, 401], f16), T([256, 401, 32], f16, stride=(12832, 1, 401))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 1, 128], f16, stride=(0, 128, 1)), T([64, 400, 128], f16, stride=(51200, 1, 400))], 1), {})
+cnt: 1, (([T([64, 1, 256], f16, stride=(0, 256, 1)), T([64, 196, 256], f16, stride=(50176, 1, 196))], 1), {})
+cnt: 6, (([T([64, 1, 256], f16), T([64, 196, 256], f16, stride=(50432, 256, 1))], 1), {})
+cnt: 6, (([T([64, 1, 128], f16), T([64, 400, 128], f16, stride=(51328, 128, 1))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 240, 240], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 240, 240], f16), T([128, 3, 12, 12], f16), T([128], f16), [12, 12], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 3, 224, 224], f16), T([256, 3, 16, 16], f16), T([256], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 256, 14, 14], f16, stride=(50432, 1, 3584, 256)), T([64, 3, 224, 224], f16), T([256, 3, 16, 16], f16), [256], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+cnt: 1, ((T([64, 128, 20, 20], f16, stride=(51328, 1, 2560, 128)), T([64, 3, 240, 240], f16), T([128, 3, 12, 12], f16), [128], [12, 12], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 240, 240], f16), T([64, 3, 240, 240], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([2, 64, 1000], f16, stride=(0, 1000, 1)), 2), {})
+Operator: aten.gelu.default
+cnt: 3, ((T([64, 401, 384], f16),), {})
+cnt: 9, ((T([64, 197, 768], f16),), {})
+cnt: 6, ((T([64, 1, 128], f16),), {})
+cnt: 6, ((T([64, 1, 256], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 6, ((T([64, 1, 128], f16), T([64, 1, 128], f16)), {})
+cnt: 6, ((T([64, 1, 256], f16), T([64, 1, 256], f16)), {})
+cnt: 9, ((T([64, 197, 768], f16), T([64, 197, 768], f16)), {})
+cnt: 3, ((T([64, 401, 384], f16), T([64, 401, 384], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([2, 64, 1000], f16), [0]), {})
+Operator: aten.mm.default
+cnt: 3, ((T([64, 256], f16, stride=(50432, 1)), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 3, ((T([64, 128], f16, stride=(51328, 1)), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 1, ((T([64, 1000], f16), T([1000, 256], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 256], f16, stride=(50432, 1))), {})
+cnt: 1, ((T([64, 1000], f16), T([1000, 128], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 128], f16, stride=(51328, 1))), {})
+cnt: 6, ((T([64, 256], f16, stride=(50432, 1)), T([256, 128], f16)), {})
+cnt: 6, ((T([256, 64], f16, stride=(1, 50432)), T([64, 128], f16)), {})
+cnt: 6, ((T([64, 128], f16), T([128, 128], f16)), {})
+cnt: 3, ((T([128, 64], f16, stride=(1, 128)), T([64, 128], f16)), {})
+cnt: 9, ((T([25664, 128], f16), T([128, 128], f16)), {})
+cnt: 9, ((T([128, 25664], f16, stride=(1, 128)), T([25664, 128], f16)), {})
+cnt: 3, ((T([128, 64], f16, stride=(1, 128)), T([64, 128], f16, stride=(51328, 1))), {})
+cnt: 6, ((T([64, 128], f16, stride=(51328, 1)), T([128, 256], f16)), {})
+cnt: 6, ((T([128, 64], f16, stride=(1, 51328)), T([64, 256], f16)), {})
+cnt: 6, ((T([64, 256], f16), T([256, 256], f16)), {})
+cnt: 3, ((T([256, 64], f16, stride=(1, 256)), T([64, 256], f16)), {})
+cnt: 15, ((T([12608, 256], f16), T([256, 256], f16)), {})
+cnt: 15, ((T([256, 12608], f16, stride=(1, 256)), T([12608, 256], f16)), {})
+cnt: 3, ((T([256, 64], f16, stride=(1, 256)), T([64, 256], f16, stride=(50432, 1))), {})
+cnt: 9, ((T([12608, 256], f16), T([256, 768], f16)), {})
+cnt: 9, ((T([256, 12608], f16, stride=(1, 256)), T([12608, 768], f16)), {})
+cnt: 18, ((T([12608, 768], f16), T([768, 256], f16)), {})
+cnt: 18, ((T([768, 12608], f16, stride=(1, 768)), T([12608, 256], f16)), {})
+cnt: 3, ((T([25664, 128], f16), T([128, 384], f16)), {})
+cnt: 3, ((T([128, 25664], f16, stride=(1, 128)), T([25664, 384], f16)), {})
+cnt: 6, ((T([25664, 384], f16), T([384, 128], f16)), {})
+cnt: 6, ((T([384, 25664], f16, stride=(1, 384)), T([25664, 128], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 6, ((T([64, 4, 401, 401], f16), 0.1767766952966369), {})
+cnt: 18, ((T([64, 4, 197, 197], f16), 0.125), {})
+cnt: 6, ((T([64, 4, 1, 197], f16), 0.125), {})
+cnt: 6, ((T([64, 4, 1, 401], f16), 0.1767766952966369), {})
+Operator: aten.native_layer_norm.default
+cnt: 10, ((T([64, 401, 128], f16), [128], T([128], f16), T([128], f16), 1e-06), {})
+cnt: 22, ((T([64, 197, 256], f16), [256], T([256], f16), T([256], f16), 1e-06), {})
+cnt: 3, ((T([64, 1, 128], f16, stride=(51328, 128, 1)), [128], T([128], f16), T([128], f16), 1e-06), {})
+cnt: 3, ((T([64, 1, 256], f16, stride=(50432, 256, 1)), [256], T([256], f16), T([256], f16), 1e-06), {})
+cnt: 3, ((T([64, 1, 256], f16), [256], T([256], f16), T([256], f16), 1e-06), {})
+cnt: 3, ((T([64, 1, 128], f16), [128], T([128], f16), T([128], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 22, ((T([64, 197, 256], f16), T([64, 197, 256], f16), [256], T([64, 197, 1], f32), T([64, 197, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 10, ((T([64, 401, 128], f16), T([64, 401, 128], f16), [128], T([64, 401, 1], f32), T([64, 401, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 3, ((T([64, 1, 128], f16), T([64, 1, 128], f16), [128], T([64, 1, 1], f32), T([64, 1, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 3, ((T([64, 1, 256], f16), T([64, 1, 256], f16), [256], T([64, 1, 1], f32), T([64, 1, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 3, ((T([64, 1, 256], f16), T([64, 1, 256], f16, stride=(50432, 256, 1)), [256], T([64, 1, 1], f32), T([64, 1, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 3, ((T([64, 1, 128], f16), T([64, 1, 128], f16, stride=(51328, 128, 1)), [128], T([64, 1, 1], f32), T([64, 1, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([64, 256], f16), [64, 197, 256], 1, 0), {})
+cnt: 1, ((T([64, 128], f16), [64, 401, 128], 1, 0), {})
+Operator: aten.slice_backward.default
+cnt: 16, ((T([64, 197, 256], f16), [64, 197, 256], 0, 0, 9223372036854775807, 1), {})
+cnt: 16, ((T([64, 401, 128], f16), [64, 401, 128], 0, 0, 9223372036854775807, 1), {})
+cnt: 6, ((T([64, 196, 256], f16, stride=(50432, 256, 1)), [64, 197, 256], 1, 1, 9223372036854775807, 1), {})
+cnt: 3, ((T([64, 1, 128], f16), [64, 1, 128], 0, 0, 9223372036854775807, 1), {})
+cnt: 9, ((T([64, 1, 128], f16), [64, 401, 128], 1, 0, 1, 1), {})
+cnt: 6, ((T([64, 400, 128], f16, stride=(51328, 128, 1)), [64, 401, 128], 1, 1, 9223372036854775807, 1), {})
+cnt: 3, ((T([64, 1, 256], f16), [64, 1, 256], 0, 0, 9223372036854775807, 1), {})
+cnt: 9, ((T([64, 1, 256], f16), [64, 197, 256], 1, 0, 1, 1), {})
+Operator: aten.stack.default
+cnt: 1, (([T([64, 1000], f16), T([64, 1000], f16)],), {})
+cnt: 9, (([T([64, 4, 197, 64], f16), T([64, 4, 197, 64], f16, stride=(50432, 12608, 1, 197)), T([64, 4, 197, 64], f16)],), {})
+cnt: 3, (([T([64, 4, 401, 32], f16), T([64, 4, 401, 32], f16, stride=(51328, 12832, 1, 401)), T([64, 4, 401, 32], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 2, ((T([64, 1000], f16), [0], True), {})
+cnt: 6, ((T([64, 256], f16, stride=(50432, 1)), [0], True), {})
+cnt: 3, ((T([64, 128], f16), [0], True), {})
+cnt: 12, ((T([25664, 128], f16), [0], True), {})
+cnt: 3, ((T([64, 1, 128], f16), [0, 1], True), {})
+cnt: 6, ((T([64, 128], f16, stride=(51328, 1)), [0], True), {})
+cnt: 3, ((T([64, 256], f16), [0], True), {})
+cnt: 24, ((T([12608, 256], f16), [0], True), {})
+cnt: 3, ((T([64, 1, 256], f16), [0, 1], True), {})
+cnt: 18, ((T([12608, 768], f16), [0], True), {})
+cnt: 6, ((T([25664, 384], f16), [0], True), {})
+cnt: 1, ((T([64, 197, 256], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 256], f16, stride=(50432, 256, 1)), [0], True), {})
+cnt: 1, ((T([64, 401, 128], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 128], f16, stride=(51328, 128, 1)), [0], True), {})
+Operator: aten.unbind.int
+cnt: 3, ((T([3, 64, 4, 401, 32], f16, stride=(128, 153984, 32, 384, 1)),), {})
+cnt: 9, ((T([3, 64, 4, 197, 64], f16, stride=(256, 151296, 64, 768, 1)),), {})
+cnt: 1, ((T([2, 64, 1000], f16),), {})
+Operator: aten.upsample_bicubic2d.vec
+cnt: 1, ((T([64, 3, 240, 240], f16), [224, 224], False, None), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/cspdarknet53_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/cspdarknet53_training.txt
new file mode 100644
index 000000000000..9332a617dadd
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/cspdarknet53_training.txt
@@ -0,0 +1,177 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 67, ((T([], i64), 1), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 64, 128, 128], f16, stride=(2097152, 16384, 128, 1))), {})
+cnt: 1, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16, stride=(524288, 4096, 64, 1))), {})
+cnt: 3, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16)), {})
+cnt: 1, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16, stride=(262144, 1024, 32, 1))), {})
+cnt: 15, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16)), {})
+cnt: 1, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16, stride=(131072, 256, 16, 1))), {})
+cnt: 15, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16)), {})
+cnt: 1, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16, stride=(65536, 64, 8, 1))), {})
+cnt: 7, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16)), {})
+cnt: 1, ((T([64, 1024, 8, 8], f16), T([64, 1024, 8, 8], f16)), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16)), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([64, 256, 32, 32], f16)), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 128, 64, 64], f16)), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 64, 128, 128], f16)), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([64, 128, 128, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 64, 128, 128], f16, stride=(2097152, 16384, 128, 1)), T([64, 64, 128, 128], f16)], 1), {})
+cnt: 1, (([T([64, 64, 64, 64], f16, stride=(524288, 4096, 64, 1)), T([64, 64, 64, 64], f16)], 1), {})
+cnt: 1, (([T([64, 128, 32, 32], f16, stride=(262144, 1024, 32, 1)), T([64, 128, 32, 32], f16)], 1), {})
+cnt: 1, (([T([64, 256, 16, 16], f16, stride=(131072, 256, 16, 1)), T([64, 256, 16, 16], f16)], 1), {})
+cnt: 1, (([T([64, 512, 8, 8], f16, stride=(65536, 64, 8, 1)), T([64, 512, 8, 8], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 256, 256], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 256, 256], f16), T([32, 3, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 256, 256], f16), T([64, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 128, 128], f16, stride=(2097152, 16384, 128, 1)), T([32, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([64, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([128, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 128, 64, 64], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 64, 64], f16, stride=(524288, 4096, 64, 1)), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([256, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 256, 32, 32], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 32, 32], f16, stride=(262144, 1024, 32, 1)), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([64, 128, 32, 32], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([64, 128, 32, 32], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([512, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 512, 16, 16], f16), T([512, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 16, 16], f16, stride=(131072, 256, 16, 1)), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([64, 256, 16, 16], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([64, 256, 16, 16], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([1024, 512, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 1024, 8, 8], f16), T([1024, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 8, 8], f16, stride=(65536, 64, 8, 1)), T([512, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 512, 8, 8], f16), T([512, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 512, 8, 8], f16), T([512, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 2, ((T([64, 1024, 8, 8], f16), T([64, 1024, 8, 8], f16), T([1024, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16), T([512, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16), T([512, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16, stride=(65536, 64, 8, 1)), T([512, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 1024, 8, 8], f16), T([64, 512, 16, 16], f16), T([1024, 512, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16), T([512, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16, stride=(131072, 256, 16, 1)), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([64, 256, 32, 32], f16), T([512, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 256, 32, 32], f16), T([64, 256, 32, 32], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16, stride=(262144, 1024, 32, 1)), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([64, 128, 64, 64], f16), T([256, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 128, 64, 64], f16), T([64, 128, 64, 64], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16, stride=(524288, 4096, 64, 1)), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 64, 128, 128], f16), T([128, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 128, 128, 128], f16), T([64, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 64, 128, 128], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 32, 128, 128], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 64, 128, 128], f16, stride=(2097152, 16384, 128, 1)), T([32, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([64, 64, 128, 128], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 32, 256, 256], f16), T([64, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 256, 256], f16), T([64, 3, 256, 256], f16), T([32, 3, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 256, 256], f16), T([64, 3, 256, 256], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 1024, 8, 8], f16, stride=(1024, 1, 0, 0)), 64), {})
+Operator: aten.leaky_relu_.default
+cnt: 1, ((T([64, 32, 256, 256], f16),), {})
+cnt: 4, ((T([64, 64, 128, 128], f16),), {})
+cnt: 1, ((T([64, 128, 128, 128], f16),), {})
+cnt: 1, ((T([64, 32, 128, 128], f16),), {})
+cnt: 3, ((T([64, 128, 64, 64], f16),), {})
+cnt: 5, ((T([64, 64, 64, 64], f16),), {})
+cnt: 3, ((T([64, 256, 32, 32], f16),), {})
+cnt: 17, ((T([64, 128, 32, 32], f16),), {})
+cnt: 3, ((T([64, 512, 16, 16], f16),), {})
+cnt: 17, ((T([64, 256, 16, 16], f16),), {})
+cnt: 3, ((T([64, 1024, 8, 8], f16),), {})
+cnt: 9, ((T([64, 512, 8, 8], f16),), {})
+Operator: aten.leaky_relu_backward.default
+cnt: 3, ((T([64, 1024, 8, 8], f16), T([64, 1024, 8, 8], f16), 0.01, True), {})
+cnt: 1, ((T([64, 512, 8, 8], f16, stride=(65536, 64, 8, 1)), T([64, 512, 8, 8], f16), 0.01, True), {})
+cnt: 8, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16), 0.01, True), {})
+cnt: 3, ((T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16), 0.01, True), {})
+cnt: 1, ((T([64, 256, 16, 16], f16, stride=(131072, 256, 16, 1)), T([64, 256, 16, 16], f16), 0.01, True), {})
+cnt: 16, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16), 0.01, True), {})
+cnt: 3, ((T([64, 256, 32, 32], f16), T([64, 256, 32, 32], f16), 0.01, True), {})
+cnt: 1, ((T([64, 128, 32, 32], f16, stride=(262144, 1024, 32, 1)), T([64, 128, 32, 32], f16), 0.01, True), {})
+cnt: 16, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16), 0.01, True), {})
+cnt: 3, ((T([64, 128, 64, 64], f16), T([64, 128, 64, 64], f16), 0.01, True), {})
+cnt: 1, ((T([64, 64, 64, 64], f16, stride=(524288, 4096, 64, 1)), T([64, 64, 64, 64], f16), 0.01, True), {})
+cnt: 4, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16), 0.01, True), {})
+cnt: 3, ((T([64, 64, 128, 128], f16), T([64, 64, 128, 128], f16), 0.01, True), {})
+cnt: 1, ((T([64, 64, 128, 128], f16, stride=(2097152, 16384, 128, 1)), T([64, 64, 128, 128], f16), 0.01, True), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 32, 128, 128], f16), 0.01, True), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([64, 128, 128, 128], f16), 0.01, True), {})
+cnt: 1, ((T([64, 32, 256, 256], f16), T([64, 32, 256, 256], f16), 0.01, True), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 1024, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 1024], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([64, 32, 256, 256], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 17, ((T([64, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 17, ((T([64, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 1024, 8, 8], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([64, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([64, 1024, 8, 8], f16), T([64, 1024, 8, 8], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 17, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 256, 32, 32], f16), T([64, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 17, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 128, 64, 64], f16), T([64, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 64, 128, 128], f16), T([64, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([64, 128, 128, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 32, 256, 256], f16), T([64, 32, 256, 256], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([64, 512, 8, 8], f16), [64, 1024, 8, 8], 1, 512, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 1024, 8, 8], f16), [64, 1024, 8, 8], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 512, 8, 8], f16, stride=(65536, 64, 8, 1)), [64, 1024, 8, 8], 1, 0, 512, 1), {})
+cnt: 1, ((T([64, 256, 16, 16], f16), [64, 512, 16, 16], 1, 256, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 512, 16, 16], f16), [64, 512, 16, 16], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 256, 16, 16], f16, stride=(131072, 256, 16, 1)), [64, 512, 16, 16], 1, 0, 256, 1), {})
+cnt: 1, ((T([64, 128, 32, 32], f16), [64, 256, 32, 32], 1, 128, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 256, 32, 32], f16), [64, 256, 32, 32], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 128, 32, 32], f16, stride=(262144, 1024, 32, 1)), [64, 256, 32, 32], 1, 0, 128, 1), {})
+cnt: 1, ((T([64, 64, 64, 64], f16), [64, 128, 64, 64], 1, 64, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 128, 64, 64], f16), [64, 128, 64, 64], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 64, 64, 64], f16, stride=(524288, 4096, 64, 1)), [64, 128, 64, 64], 1, 0, 64, 1), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), [64, 128, 128, 128], 1, 64, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 128, 128, 128], f16), [64, 128, 128, 128], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 64, 128, 128], f16, stride=(2097152, 16384, 128, 1)), [64, 128, 128, 128], 1, 0, 64, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/deit_base_distilled_patch16_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/deit_base_distilled_patch16_224_training.txt
new file mode 100644
index 000000000000..486ee80cd59a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/deit_base_distilled_patch16_224_training.txt
@@ -0,0 +1,87 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([64, 12, 198, 198], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 12, 198, 198], f16), T([64, 12, 198, 198], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([64, 12, 198, 64], f16), [768, 198, 64]), {})
+cnt: 12, ((T([64, 12, 64, 198], f16), [768, 64, 198]), {})
+cnt: 12, ((T([768, 198, 198], f16), [64, 12, 198, 198]), {})
+cnt: 12, ((T([768, 198, 64], f16), [64, 12, 198, 64]), {})
+cnt: 12, ((T([64, 198, 12, 64], f16), [64, 198, 768]), {})
+cnt: 12, ((T([64, 198, 3, 12, 64], f16), [64, 198, 2304]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 198, 768], f16), T([1, 198, 768], f16)), {})
+cnt: 49, ((T([64, 198, 768], f16), T([64, 198, 768], f16)), {})
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16)), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([2304], f16), T([12672, 768], f16), T([768, 2304], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([12672, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([12672, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([12672, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 2, ((T([1000], f16), T([64, 768], f16, stride=(152064, 1)), T([768, 1000], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([768, 198, 64], f16), T([768, 64, 198], f16)), {})
+cnt: 12, ((T([768, 198, 198], f16), T([768, 198, 64], f16)), {})
+cnt: 12, ((T([768, 198, 198], f16, stride=(39204, 1, 198)), T([768, 198, 64], f16)), {})
+cnt: 12, ((T([768, 198, 64], f16), T([768, 64, 198], f16, stride=(12672, 1, 64))), {})
+cnt: 12, ((T([768, 64, 198], f16, stride=(12672, 1, 64)), T([768, 198, 198], f16)), {})
+cnt: 12, ((T([768, 198, 198], f16), T([768, 198, 64], f16, stride=(12672, 1, 198))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 1, 768], f16, stride=(0, 768, 1)), T([64, 1, 768], f16, stride=(0, 768, 1)), T([64, 196, 768], f16, stride=(150528, 1, 196))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([768, 3, 16, 16], f16), T([768], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 768, 14, 14], f16, stride=(152064, 1, 10752, 768)), T([64, 3, 224, 224], f16), T([768, 3, 16, 16], f16), [768], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([64, 1000], f16), 2), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([64, 198, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([64, 198, 3072], f16), T([64, 198, 3072], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mm.default
+cnt: 2, ((T([64, 1000], f16), T([1000, 768], f16)), {})
+cnt: 2, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 768], f16, stride=(152064, 1))), {})
+cnt: 12, ((T([12672, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 12672], f16, stride=(1, 768)), T([12672, 3072], f16)), {})
+cnt: 12, ((T([12672, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 12672], f16, stride=(1, 3072)), T([12672, 768], f16)), {})
+cnt: 12, ((T([12672, 768], f16), T([768, 768], f16)), {})
+cnt: 12, ((T([768, 12672], f16, stride=(1, 768)), T([12672, 768], f16)), {})
+cnt: 12, ((T([12672, 2304], f16), T([2304, 768], f16)), {})
+cnt: 12, ((T([2304, 12672], f16, stride=(1, 2304)), T([12672, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([64, 12, 198, 198], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([64, 198, 768], f16), [768], T([768], f16), T([768], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([64, 198, 768], f16), T([64, 198, 768], f16), [768], T([64, 198, 1], f32), T([64, 198, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([64, 768], f16), [64, 198, 768], 1, 1), {})
+cnt: 1, ((T([64, 768], f16), [64, 198, 768], 1, 0), {})
+Operator: aten.slice_backward.default
+cnt: 2, ((T([64, 198, 768], f16), [64, 198, 768], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.stack.default
+cnt: 12, (([T([64, 12, 198, 64], f16), T([64, 12, 198, 64], f16, stride=(152064, 12672, 1, 198)), T([64, 12, 198, 64], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 2, ((T([64, 1000], f16), [0], True), {})
+cnt: 24, ((T([12672, 768], f16), [0], True), {})
+cnt: 12, ((T([12672, 3072], f16), [0], True), {})
+cnt: 12, ((T([12672, 2304], f16), [0], True), {})
+cnt: 1, ((T([64, 198, 768], f16), [0], True), {})
+cnt: 2, ((T([64, 1, 768], f16, stride=(152064, 768, 1)), [0], True), {})
+Operator: aten.unbind.int
+cnt: 12, ((T([3, 64, 12, 198, 64], f16, stride=(768, 456192, 64, 2304, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/densenet121_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/densenet121_training.txt
new file mode 100644
index 000000000000..983f9ccb1044
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/densenet121_training.txt
@@ -0,0 +1,616 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 121, ((T([], i64), 1), {})
+cnt: 1, ((T([64, 512, 7, 7], f16, stride=(50176, 49, 7, 1)), T([64, 512, 7, 7], f16, stride=(48608, 49, 7, 1))), {})
+cnt: 15, ((T([64, 32, 7, 7], f16, stride=(50176, 49, 7, 1)), T([64, 32, 7, 7], f16, stride=(48608, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(47040, 49, 7, 1))), {})
+cnt: 14, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(47040, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(45472, 49, 7, 1))), {})
+cnt: 13, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(45472, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(43904, 49, 7, 1))), {})
+cnt: 12, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(43904, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(42336, 49, 7, 1))), {})
+cnt: 11, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(42336, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(40768, 49, 7, 1))), {})
+cnt: 10, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(40768, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(39200, 49, 7, 1))), {})
+cnt: 9, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(39200, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(37632, 49, 7, 1))), {})
+cnt: 8, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(37632, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(36064, 49, 7, 1))), {})
+cnt: 7, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(36064, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(34496, 49, 7, 1))), {})
+cnt: 6, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(34496, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(32928, 49, 7, 1))), {})
+cnt: 5, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(32928, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(31360, 49, 7, 1))), {})
+cnt: 4, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(31360, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(29792, 49, 7, 1))), {})
+cnt: 3, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(29792, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(28224, 49, 7, 1))), {})
+cnt: 2, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(28224, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16, stride=(26656, 49, 7, 1))), {})
+cnt: 1, ((T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16, stride=(26656, 49, 7, 1))), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16)), {})
+cnt: 1, ((T([64, 256, 14, 14], f16, stride=(200704, 196, 14, 1)), T([64, 256, 14, 14], f16, stride=(194432, 196, 14, 1))), {})
+cnt: 23, ((T([64, 32, 14, 14], f16, stride=(200704, 196, 14, 1)), T([64, 32, 14, 14], f16, stride=(194432, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(188160, 196, 14, 1))), {})
+cnt: 22, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(188160, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(181888, 196, 14, 1))), {})
+cnt: 21, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(181888, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(175616, 196, 14, 1))), {})
+cnt: 20, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(175616, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(169344, 196, 14, 1))), {})
+cnt: 19, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(169344, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(163072, 196, 14, 1))), {})
+cnt: 18, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(163072, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(156800, 196, 14, 1))), {})
+cnt: 17, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(156800, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(150528, 196, 14, 1))), {})
+cnt: 16, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(150528, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(144256, 196, 14, 1))), {})
+cnt: 15, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(144256, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(137984, 196, 14, 1))), {})
+cnt: 14, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(137984, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(131712, 196, 14, 1))), {})
+cnt: 13, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(131712, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(125440, 196, 14, 1))), {})
+cnt: 12, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(125440, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(119168, 196, 14, 1))), {})
+cnt: 11, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(119168, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(112896, 196, 14, 1))), {})
+cnt: 10, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(112896, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(106624, 196, 14, 1))), {})
+cnt: 9, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(106624, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(100352, 196, 14, 1))), {})
+cnt: 8, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(100352, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(94080, 196, 14, 1))), {})
+cnt: 7, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(94080, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(87808, 196, 14, 1))), {})
+cnt: 6, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(87808, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(81536, 196, 14, 1))), {})
+cnt: 5, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(81536, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(75264, 196, 14, 1))), {})
+cnt: 4, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(75264, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(68992, 196, 14, 1))), {})
+cnt: 3, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(68992, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(62720, 196, 14, 1))), {})
+cnt: 2, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(62720, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16, stride=(56448, 196, 14, 1))), {})
+cnt: 1, ((T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16, stride=(56448, 196, 14, 1))), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16)), {})
+cnt: 1, ((T([64, 128, 28, 28], f16, stride=(401408, 784, 28, 1)), T([64, 128, 28, 28], f16, stride=(376320, 784, 28, 1))), {})
+cnt: 11, ((T([64, 32, 28, 28], f16, stride=(401408, 784, 28, 1)), T([64, 32, 28, 28], f16, stride=(376320, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(351232, 784, 28, 1))), {})
+cnt: 10, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(351232, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(326144, 784, 28, 1))), {})
+cnt: 9, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(326144, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(301056, 784, 28, 1))), {})
+cnt: 8, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(301056, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(275968, 784, 28, 1))), {})
+cnt: 7, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(275968, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(250880, 784, 28, 1))), {})
+cnt: 6, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(250880, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(225792, 784, 28, 1))), {})
+cnt: 5, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(225792, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(200704, 784, 28, 1))), {})
+cnt: 4, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(200704, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(175616, 784, 28, 1))), {})
+cnt: 3, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(175616, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(150528, 784, 28, 1))), {})
+cnt: 2, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(150528, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16, stride=(125440, 784, 28, 1))), {})
+cnt: 1, ((T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16, stride=(125440, 784, 28, 1))), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16)), {})
+cnt: 1, ((T([64, 64, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([64, 64, 56, 56], f16, stride=(702464, 3136, 56, 1))), {})
+cnt: 5, ((T([64, 32, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([64, 32, 56, 56], f16, stride=(702464, 3136, 56, 1))), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16, stride=(602112, 3136, 56, 1))), {})
+cnt: 4, ((T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16, stride=(602112, 3136, 56, 1))), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16, stride=(501760, 3136, 56, 1))), {})
+cnt: 3, ((T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16, stride=(501760, 3136, 56, 1))), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16, stride=(401408, 3136, 56, 1))), {})
+cnt: 2, ((T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16, stride=(401408, 3136, 56, 1))), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16, stride=(301056, 3136, 56, 1))), {})
+cnt: 1, ((T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16, stride=(301056, 3136, 56, 1))), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([64, 128, 56, 56], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), [2, 2], [2, 2]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 14, 14], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 28, 28], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 56, 56], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 64, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 64, 56, 56], f16), T([64, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 64, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 64, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 64, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 64, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 64, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16), T([64, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 128, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16), T([64, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 256, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16), T([64, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 512, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16), T([64, 32, 7, 7], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 128, 56, 56], f16), T([32, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 96, 56, 56], f16), T([128, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 160, 56, 56], f16), T([128, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([128, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 224, 56, 56], f16), T([128, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([64, 128, 28, 28], f16), T([32, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 160, 28, 28], f16), T([128, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 192, 28, 28], f16), T([128, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 224, 28, 28], f16), T([128, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 288, 28, 28], f16), T([128, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 320, 28, 28], f16), T([128, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 352, 28, 28], f16), T([128, 352, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 384, 28, 28], f16), T([128, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), T([128, 416, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 448, 28, 28], f16), T([128, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 480, 28, 28], f16), T([128, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 24, ((T([64, 128, 14, 14], f16), T([32, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 288, 14, 14], f16), T([128, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 320, 14, 14], f16), T([128, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 352, 14, 14], f16), T([128, 352, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 384, 14, 14], f16), T([128, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 416, 14, 14], f16), T([128, 416, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 448, 14, 14], f16), T([128, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 480, 14, 14], f16), T([128, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 544, 14, 14], f16), T([128, 544, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 576, 14, 14], f16), T([128, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 608, 14, 14], f16), T([128, 608, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 640, 14, 14], f16), T([128, 640, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 672, 14, 14], f16), T([128, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 704, 14, 14], f16), T([128, 704, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 736, 14, 14], f16), T([128, 736, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 768, 14, 14], f16), T([128, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 800, 14, 14], f16), T([128, 800, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), T([128, 832, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 864, 14, 14], f16), T([128, 864, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 896, 14, 14], f16), T([128, 896, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 928, 14, 14], f16), T([128, 928, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([128, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 992, 14, 14], f16), T([128, 992, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 16, ((T([64, 128, 7, 7], f16), T([32, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 544, 7, 7], f16), T([128, 544, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 576, 7, 7], f16), T([128, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 608, 7, 7], f16), T([128, 608, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 640, 7, 7], f16), T([128, 640, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 672, 7, 7], f16), T([128, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 704, 7, 7], f16), T([128, 704, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 736, 7, 7], f16), T([128, 736, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 768, 7, 7], f16), T([128, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 800, 7, 7], f16), T([128, 800, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 832, 7, 7], f16), T([128, 832, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 864, 7, 7], f16), T([128, 864, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 896, 7, 7], f16), T([128, 896, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 928, 7, 7], f16), T([128, 928, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([128, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 992, 7, 7], f16), T([128, 992, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 32, 7, 7], f16, stride=(50176, 49, 7, 1)), T([64, 128, 7, 7], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 992, 7, 7], f16), T([128, 992, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 15, ((T([64, 32, 7, 7], f16), T([64, 128, 7, 7], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 960, 7, 7], f16), T([128, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 928, 7, 7], f16), T([128, 928, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 896, 7, 7], f16), T([128, 896, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 864, 7, 7], f16), T([128, 864, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 832, 7, 7], f16), T([128, 832, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 800, 7, 7], f16), T([128, 800, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 768, 7, 7], f16), T([128, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 736, 7, 7], f16), T([128, 736, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 704, 7, 7], f16), T([128, 704, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 672, 7, 7], f16), T([128, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 640, 7, 7], f16), T([128, 640, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 608, 7, 7], f16), T([128, 608, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 576, 7, 7], f16), T([128, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 544, 7, 7], f16), T([128, 544, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 7, 7], f16), T([64, 512, 7, 7], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 14, 14], f16, stride=(200704, 196, 14, 1)), T([64, 128, 14, 14], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 992, 14, 14], f16), T([128, 992, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 23, ((T([64, 32, 14, 14], f16), T([64, 128, 14, 14], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 960, 14, 14], f16), T([128, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 928, 14, 14], f16), T([128, 928, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 896, 14, 14], f16), T([128, 896, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 864, 14, 14], f16), T([128, 864, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 832, 14, 14], f16), T([128, 832, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 800, 14, 14], f16), T([128, 800, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 768, 14, 14], f16), T([128, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 736, 14, 14], f16), T([128, 736, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 704, 14, 14], f16), T([128, 704, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 672, 14, 14], f16), T([128, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 640, 14, 14], f16), T([128, 640, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 608, 14, 14], f16), T([128, 608, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 576, 14, 14], f16), T([128, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 544, 14, 14], f16), T([128, 544, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 512, 14, 14], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 480, 14, 14], f16), T([128, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 448, 14, 14], f16), T([128, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 416, 14, 14], f16), T([128, 416, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 384, 14, 14], f16), T([128, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 352, 14, 14], f16), T([128, 352, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 320, 14, 14], f16), T([128, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 288, 14, 14], f16), T([128, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 14, 14], f16), T([64, 256, 14, 14], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 28, 28], f16, stride=(401408, 784, 28, 1)), T([64, 128, 28, 28], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 480, 28, 28], f16), T([128, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 11, ((T([64, 32, 28, 28], f16), T([64, 128, 28, 28], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 448, 28, 28], f16), T([128, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 416, 28, 28], f16), T([128, 416, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 384, 28, 28], f16), T([128, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 352, 28, 28], f16), T([128, 352, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 320, 28, 28], f16), T([128, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 288, 28, 28], f16), T([128, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 256, 28, 28], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 224, 28, 28], f16), T([128, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 192, 28, 28], f16), T([128, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 160, 28, 28], f16), T([128, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([64, 128, 56, 56], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 224, 56, 56], f16), T([128, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([64, 32, 56, 56], f16), T([64, 128, 56, 56], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 192, 56, 56], f16), T([128, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 160, 56, 56], f16), T([128, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 96, 56, 56], f16), T([128, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 64, 56, 56], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 1024, 7, 7], f16, stride=(1024, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([64, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([64, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 1024, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 1024], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([64, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 160, 56, 56], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 224, 56, 56], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 13, ((T([64, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 160, 28, 28], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 224, 28, 28], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 288, 28, 28], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 320, 28, 28], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 352, 28, 28], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 384, 28, 28], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 448, 28, 28], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 480, 28, 28], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 24, ((T([64, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 288, 14, 14], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 320, 14, 14], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 352, 14, 14], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 416, 14, 14], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 448, 14, 14], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 544, 14, 14], f16), T([544], f16), T([544], f16), T([544], f16), T([544], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 608, 14, 14], f16), T([608], f16), T([608], f16), T([608], f16), T([608], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 640, 14, 14], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 704, 14, 14], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 736, 14, 14], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 800, 14, 14], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 864, 14, 14], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 896, 14, 14], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 928, 14, 14], f16), T([928], f16), T([928], f16), T([928], f16), T([928], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 992, 14, 14], f16), T([992], f16), T([992], f16), T([992], f16), T([992], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 16, ((T([64, 128, 7, 7], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 544, 7, 7], f16), T([544], f16), T([544], f16), T([544], f16), T([544], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 608, 7, 7], f16), T([608], f16), T([608], f16), T([608], f16), T([608], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 640, 7, 7], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 704, 7, 7], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 736, 7, 7], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 768, 7, 7], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 800, 7, 7], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 832, 7, 7], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 864, 7, 7], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 896, 7, 7], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 928, 7, 7], f16), T([928], f16), T([928], f16), T([928], f16), T([928], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 992, 7, 7], f16), T([992], f16), T([992], f16), T([992], f16), T([992], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([64, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 16, ((T([64, 128, 7, 7], f16), T([64, 128, 7, 7], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 992, 7, 7], f16), T([64, 992, 7, 7], f16), T([992], f16), T([992], f16), T([992], f16), T([992], f32), T([992], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 928, 7, 7], f16), T([64, 928, 7, 7], f16), T([928], f16), T([928], f16), T([928], f16), T([928], f32), T([928], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 896, 7, 7], f16), T([64, 896, 7, 7], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 864, 7, 7], f16), T([64, 864, 7, 7], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f32), T([864], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 832, 7, 7], f16), T([64, 832, 7, 7], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f32), T([832], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 800, 7, 7], f16), T([64, 800, 7, 7], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f32), T([800], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 768, 7, 7], f16), T([64, 768, 7, 7], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 736, 7, 7], f16), T([64, 736, 7, 7], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f32), T([736], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 704, 7, 7], f16), T([64, 704, 7, 7], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f32), T([704], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 672, 7, 7], f16), T([64, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 640, 7, 7], f16), T([64, 640, 7, 7], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f32), T([640], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 608, 7, 7], f16), T([64, 608, 7, 7], f16), T([608], f16), T([608], f16), T([608], f16), T([608], f32), T([608], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 576, 7, 7], f16), T([64, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 544, 7, 7], f16), T([64, 544, 7, 7], f16), T([544], f16), T([544], f16), T([544], f16), T([544], f32), T([544], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 24, ((T([64, 128, 14, 14], f16), T([64, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 992, 14, 14], f16), T([64, 992, 14, 14], f16), T([992], f16), T([992], f16), T([992], f16), T([992], f32), T([992], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([64, 960, 14, 14], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 928, 14, 14], f16), T([64, 928, 14, 14], f16), T([928], f16), T([928], f16), T([928], f16), T([928], f32), T([928], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 896, 14, 14], f16), T([64, 896, 14, 14], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 864, 14, 14], f16), T([64, 864, 14, 14], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f32), T([864], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), T([64, 832, 14, 14], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f32), T([832], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 800, 14, 14], f16), T([64, 800, 14, 14], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f32), T([800], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 768, 14, 14], f16), T([64, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 736, 14, 14], f16), T([64, 736, 14, 14], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f32), T([736], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 704, 14, 14], f16), T([64, 704, 14, 14], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f32), T([704], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 672, 14, 14], f16), T([64, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 640, 14, 14], f16), T([64, 640, 14, 14], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f32), T([640], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 608, 14, 14], f16), T([64, 608, 14, 14], f16), T([608], f16), T([608], f16), T([608], f16), T([608], f32), T([608], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 576, 14, 14], f16), T([64, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 544, 14, 14], f16), T([64, 544, 14, 14], f16), T([544], f16), T([544], f16), T([544], f16), T([544], f32), T([544], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 448, 14, 14], f16), T([64, 448, 14, 14], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 416, 14, 14], f16), T([64, 416, 14, 14], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f32), T([416], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 384, 14, 14], f16), T([64, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 352, 14, 14], f16), T([64, 352, 14, 14], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f32), T([352], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 320, 14, 14], f16), T([64, 320, 14, 14], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 288, 14, 14], f16), T([64, 288, 14, 14], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f32), T([288], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 13, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 480, 28, 28], f16), T([64, 480, 28, 28], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 448, 28, 28], f16), T([64, 448, 28, 28], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), T([64, 416, 28, 28], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f32), T([416], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 384, 28, 28], f16), T([64, 384, 28, 28], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 352, 28, 28], f16), T([64, 352, 28, 28], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f32), T([352], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 320, 28, 28], f16), T([64, 320, 28, 28], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 288, 28, 28], f16), T([64, 288, 28, 28], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f32), T([288], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 224, 28, 28], f16), T([64, 224, 28, 28], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 192, 28, 28], f16), T([64, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 160, 28, 28], f16), T([64, 160, 28, 28], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 224, 56, 56], f16), T([64, 224, 56, 56], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([64, 192, 56, 56], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 160, 56, 56], f16), T([64, 160, 56, 56], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([64, 64, 112, 112], f16),), {})
+cnt: 1, ((T([64, 64, 56, 56], f16),), {})
+cnt: 7, ((T([64, 128, 56, 56], f16),), {})
+cnt: 1, ((T([64, 96, 56, 56], f16),), {})
+cnt: 1, ((T([64, 160, 56, 56], f16),), {})
+cnt: 1, ((T([64, 192, 56, 56], f16),), {})
+cnt: 1, ((T([64, 224, 56, 56], f16),), {})
+cnt: 1, ((T([64, 256, 56, 56], f16),), {})
+cnt: 13, ((T([64, 128, 28, 28], f16),), {})
+cnt: 1, ((T([64, 160, 28, 28], f16),), {})
+cnt: 1, ((T([64, 192, 28, 28], f16),), {})
+cnt: 1, ((T([64, 224, 28, 28], f16),), {})
+cnt: 1, ((T([64, 256, 28, 28], f16),), {})
+cnt: 1, ((T([64, 288, 28, 28], f16),), {})
+cnt: 1, ((T([64, 320, 28, 28], f16),), {})
+cnt: 1, ((T([64, 352, 28, 28], f16),), {})
+cnt: 1, ((T([64, 384, 28, 28], f16),), {})
+cnt: 1, ((T([64, 416, 28, 28], f16),), {})
+cnt: 1, ((T([64, 448, 28, 28], f16),), {})
+cnt: 1, ((T([64, 480, 28, 28], f16),), {})
+cnt: 1, ((T([64, 512, 28, 28], f16),), {})
+cnt: 1, ((T([64, 256, 14, 14], f16),), {})
+cnt: 24, ((T([64, 128, 14, 14], f16),), {})
+cnt: 1, ((T([64, 288, 14, 14], f16),), {})
+cnt: 1, ((T([64, 320, 14, 14], f16),), {})
+cnt: 1, ((T([64, 352, 14, 14], f16),), {})
+cnt: 1, ((T([64, 384, 14, 14], f16),), {})
+cnt: 1, ((T([64, 416, 14, 14], f16),), {})
+cnt: 1, ((T([64, 448, 14, 14], f16),), {})
+cnt: 1, ((T([64, 480, 14, 14], f16),), {})
+cnt: 1, ((T([64, 512, 14, 14], f16),), {})
+cnt: 1, ((T([64, 544, 14, 14], f16),), {})
+cnt: 1, ((T([64, 576, 14, 14], f16),), {})
+cnt: 1, ((T([64, 608, 14, 14], f16),), {})
+cnt: 1, ((T([64, 640, 14, 14], f16),), {})
+cnt: 1, ((T([64, 672, 14, 14], f16),), {})
+cnt: 1, ((T([64, 704, 14, 14], f16),), {})
+cnt: 1, ((T([64, 736, 14, 14], f16),), {})
+cnt: 1, ((T([64, 768, 14, 14], f16),), {})
+cnt: 1, ((T([64, 800, 14, 14], f16),), {})
+cnt: 1, ((T([64, 832, 14, 14], f16),), {})
+cnt: 1, ((T([64, 864, 14, 14], f16),), {})
+cnt: 1, ((T([64, 896, 14, 14], f16),), {})
+cnt: 1, ((T([64, 928, 14, 14], f16),), {})
+cnt: 1, ((T([64, 960, 14, 14], f16),), {})
+cnt: 1, ((T([64, 992, 14, 14], f16),), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16),), {})
+cnt: 1, ((T([64, 512, 7, 7], f16),), {})
+cnt: 16, ((T([64, 128, 7, 7], f16),), {})
+cnt: 1, ((T([64, 544, 7, 7], f16),), {})
+cnt: 1, ((T([64, 576, 7, 7], f16),), {})
+cnt: 1, ((T([64, 608, 7, 7], f16),), {})
+cnt: 1, ((T([64, 640, 7, 7], f16),), {})
+cnt: 1, ((T([64, 672, 7, 7], f16),), {})
+cnt: 1, ((T([64, 704, 7, 7], f16),), {})
+cnt: 1, ((T([64, 736, 7, 7], f16),), {})
+cnt: 1, ((T([64, 768, 7, 7], f16),), {})
+cnt: 1, ((T([64, 800, 7, 7], f16),), {})
+cnt: 1, ((T([64, 832, 7, 7], f16),), {})
+cnt: 1, ((T([64, 864, 7, 7], f16),), {})
+cnt: 1, ((T([64, 896, 7, 7], f16),), {})
+cnt: 1, ((T([64, 928, 7, 7], f16),), {})
+cnt: 1, ((T([64, 960, 7, 7], f16),), {})
+cnt: 1, ((T([64, 992, 7, 7], f16),), {})
+cnt: 1, ((T([64, 1024, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([64, 1024, 7, 7], f16), 0), {})
+cnt: 16, ((T([64, 128, 7, 7], f16), T([64, 128, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 992, 7, 7], f16), T([64, 992, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 928, 7, 7], f16), T([64, 928, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 896, 7, 7], f16), T([64, 896, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 864, 7, 7], f16), T([64, 864, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 832, 7, 7], f16), T([64, 832, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 800, 7, 7], f16), T([64, 800, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 768, 7, 7], f16), T([64, 768, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 736, 7, 7], f16), T([64, 736, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 704, 7, 7], f16), T([64, 704, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 672, 7, 7], f16), T([64, 672, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 640, 7, 7], f16), T([64, 640, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 608, 7, 7], f16), T([64, 608, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 576, 7, 7], f16), T([64, 576, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 544, 7, 7], f16), T([64, 544, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16), 0), {})
+cnt: 24, ((T([64, 128, 14, 14], f16), T([64, 128, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 992, 14, 14], f16), T([64, 992, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([64, 960, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 928, 14, 14], f16), T([64, 928, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 896, 14, 14], f16), T([64, 896, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 864, 14, 14], f16), T([64, 864, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), T([64, 832, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 800, 14, 14], f16), T([64, 800, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 768, 14, 14], f16), T([64, 768, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 736, 14, 14], f16), T([64, 736, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 704, 14, 14], f16), T([64, 704, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 672, 14, 14], f16), T([64, 672, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 640, 14, 14], f16), T([64, 640, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 608, 14, 14], f16), T([64, 608, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 576, 14, 14], f16), T([64, 576, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 544, 14, 14], f16), T([64, 544, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 448, 14, 14], f16), T([64, 448, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 416, 14, 14], f16), T([64, 416, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 384, 14, 14], f16), T([64, 384, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 352, 14, 14], f16), T([64, 352, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 320, 14, 14], f16), T([64, 320, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 288, 14, 14], f16), T([64, 288, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16), 0), {})
+cnt: 13, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 480, 28, 28], f16), T([64, 480, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 448, 28, 28], f16), T([64, 448, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), T([64, 416, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 384, 28, 28], f16), T([64, 384, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 352, 28, 28], f16), T([64, 352, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 320, 28, 28], f16), T([64, 320, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 288, 28, 28], f16), T([64, 288, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 224, 28, 28], f16), T([64, 224, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 192, 28, 28], f16), T([64, 192, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 160, 28, 28], f16), T([64, 160, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16), 0), {})
+cnt: 7, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 224, 56, 56], f16), T([64, 224, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([64, 192, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 160, 56, 56], f16), T([64, 160, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dla102_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dla102_training.txt
new file mode 100644
index 000000000000..68226f899cee
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dla102_training.txt
@@ -0,0 +1,189 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([64, 1024, 7, 7], f16, stride=(125440, 49, 7, 1))), {})
+cnt: 1, ((T([64, 1024, 7, 7], f16, stride=(125440, 49, 7, 1)), T([64, 1024, 7, 7], f16)), {})
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([64, 1024, 7, 7], f16)), {})
+cnt: 1, ((T([64, 512, 7, 7], f16, stride=(125440, 49, 7, 1)), T([64, 512, 7, 7], f16)), {})
+cnt: 16, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16)), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16, stride=(551936, 196, 14, 1))), {})
+cnt: 4, ((T([64, 512, 14, 14], f16, stride=(551936, 196, 14, 1)), T([64, 512, 14, 14], f16)), {})
+cnt: 4, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16, stride=(200704, 196, 14, 1))), {})
+cnt: 4, ((T([64, 512, 14, 14], f16, stride=(200704, 196, 14, 1)), T([64, 512, 14, 14], f16)), {})
+cnt: 2, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16, stride=(301056, 196, 14, 1))), {})
+cnt: 4, ((T([64, 512, 14, 14], f16, stride=(301056, 196, 14, 1)), T([64, 512, 14, 14], f16)), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16, stride=(401408, 196, 14, 1))), {})
+cnt: 3, ((T([64, 512, 14, 14], f16, stride=(401408, 196, 14, 1)), T([64, 512, 14, 14], f16)), {})
+cnt: 9, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16)), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16, stride=(903168, 784, 28, 1))), {})
+cnt: 3, ((T([64, 256, 28, 28], f16, stride=(903168, 784, 28, 1)), T([64, 256, 28, 28], f16)), {})
+cnt: 2, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16, stride=(401408, 784, 28, 1))), {})
+cnt: 2, ((T([64, 256, 28, 28], f16, stride=(401408, 784, 28, 1)), T([64, 256, 28, 28], f16)), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16, stride=(602112, 784, 28, 1))), {})
+cnt: 2, ((T([64, 256, 28, 28], f16, stride=(602112, 784, 28, 1)), T([64, 256, 28, 28], f16)), {})
+cnt: 3, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16)), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16, stride=(802816, 3136, 56, 1))), {})
+cnt: 1, ((T([64, 128, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([64, 128, 56, 56], f16)), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 105, ((T([], i64), 1), {})
+cnt: 3, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16)), {})
+cnt: 12, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16)), {})
+cnt: 24, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16)), {})
+cnt: 3, ((T([64, 1024, 7, 7], f16), T([64, 1024, 7, 7], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16)], 1), {})
+cnt: 2, (([T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), T([64, 128, 28, 28], f16), T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16)], 1), {})
+cnt: 4, (([T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16)], 1), {})
+cnt: 2, (([T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([64, 256, 14, 14], f16), T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 1024, 7, 7], f16), T([64, 1024, 7, 7], f16), T([64, 512, 7, 7], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([16, 3, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 16, 224, 224], f16), T([16, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 16, 224, 224], f16), T([32, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 56, 56], f16), T([128, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 64, 56, 56], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([64, 128, 28, 28], f16), T([256, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([128, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([64, 256, 28, 28], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([64, 128, 28, 28], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 768, 28, 28], f16), T([256, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1152, 28, 28], f16), T([256, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 17, ((T([64, 256, 14, 14], f16), T([512, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([256, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 15, ((T([64, 512, 14, 14], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 15, ((T([64, 256, 14, 14], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 1536, 14, 14], f16), T([512, 1536, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 2048, 14, 14], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 2816, 14, 14], f16), T([512, 2816, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 512, 7, 7], f16), T([1024, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([512, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([512, 512, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([512, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 2560, 7, 7], f16), T([1024, 2560, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1024, 1, 1], f16), T([1000, 1024, 1, 1], f16), T([1000], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 1000, 1, 1], f16), T([64, 1024, 1, 1], f16), T([1000, 1024, 1, 1], f16), [1000], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([64, 2560, 7, 7], f16), T([1024, 2560, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 1024, 7, 7], f16), T([64, 512, 7, 7], f16), T([1024, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16), T([512, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 1024, 7, 7], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 14, 14], f16), T([512, 512, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([512, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 2816, 14, 14], f16), T([512, 2816, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 17, ((T([64, 512, 14, 14], f16), T([64, 256, 14, 14], f16), T([512, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 15, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 15, ((T([64, 256, 14, 14], f16), T([64, 512, 14, 14], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 512, 14, 14], f16), T([64, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 512, 14, 14], f16), T([64, 1536, 14, 14], f16), T([512, 1536, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 2048, 14, 14], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 28, 28], f16), T([256, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 1152, 28, 28], f16), T([256, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 9, ((T([64, 256, 28, 28], f16), T([64, 128, 28, 28], f16), T([256, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([64, 128, 28, 28], f16), T([64, 256, 28, 28], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 256, 28, 28], f16), T([64, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 768, 28, 28], f16), T([256, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 56, 56], f16), T([128, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 128, 56, 56], f16), T([64, 64, 56, 56], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 128, 56, 56], f16), T([64, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 112, 112], f16), T([64, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 32, 112, 112], f16), T([64, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 32, 56, 56], f16), T([128, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 16, 224, 224], f16), T([32, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 16, 224, 224], f16), T([64, 16, 224, 224], f16), T([16, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 16, 224, 224], f16), T([64, 3, 224, 224], f16), T([16, 3, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 1024, 7, 7], f16, stride=(1024, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([64, 32, 112, 112], f16), [2, 2], [2, 2]), {})
+cnt: 3, ((T([64, 128, 56, 56], f16), [2, 2], [2, 2]), {})
+cnt: 4, ((T([64, 256, 28, 28], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), [2, 2], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 14, 14], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 512, 7, 7], i64)), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 28, 28], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 256, 14, 14], i64)), {})
+cnt: 1, ((T([64, 256, 14, 14], f16, stride=(551936, 196, 14, 1)), T([64, 256, 28, 28], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 256, 14, 14], i64)), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 56, 56], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 128, 28, 28], i64)), {})
+cnt: 1, ((T([64, 128, 28, 28], f16, stride=(903168, 784, 28, 1)), T([64, 128, 56, 56], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 128, 28, 28], i64)), {})
+cnt: 1, ((T([64, 32, 56, 56], f16), T([64, 32, 112, 112], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 32, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 1024, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([64, 16, 224, 224], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 14, ((T([64, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 15, ((T([64, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 26, ((T([64, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 31, ((T([64, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([64, 1024, 7, 7], f16), T([64, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 26, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 31, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 14, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 15, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([64, 16, 224, 224], f16), T([64, 16, 224, 224], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([64, 16, 224, 224], f16),), {})
+cnt: 1, ((T([64, 32, 112, 112], f16),), {})
+cnt: 1, ((T([64, 64, 112, 112], f16),), {})
+cnt: 3, ((T([64, 64, 56, 56], f16),), {})
+cnt: 4, ((T([64, 128, 56, 56], f16),), {})
+cnt: 15, ((T([64, 128, 28, 28], f16),), {})
+cnt: 13, ((T([64, 256, 28, 28], f16),), {})
+cnt: 31, ((T([64, 256, 14, 14], f16),), {})
+cnt: 25, ((T([64, 512, 14, 14], f16),), {})
+cnt: 3, ((T([64, 512, 7, 7], f16),), {})
+cnt: 3, ((T([64, 1024, 7, 7], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([64, 1024, 7, 7], f16), T([64, 1024, 7, 7], f16), 0), {})
+cnt: 3, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16), 0), {})
+cnt: 25, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), 0), {})
+cnt: 31, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16), 0), {})
+cnt: 13, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), 0), {})
+cnt: 15, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), 0), {})
+cnt: 4, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16), 0), {})
+cnt: 3, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), 0), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), 0), {})
+cnt: 2, ((T([64, 16, 224, 224], f16), T([64, 16, 224, 224], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dm_nfnet_f0_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dm_nfnet_f0_training.txt
new file mode 100644
index 000000000000..683e671e2866
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dm_nfnet_f0_training.txt
@@ -0,0 +1,296 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 3, ((T([128, 256, 48, 48], f16), T([128, 256, 48, 48], f16)), {})
+cnt: 6, ((T([128, 512, 24, 24], f16), T([128, 512, 24, 24], f16)), {})
+cnt: 18, ((T([128, 1536, 12, 12], f16), T([128, 1536, 12, 12], f16)), {})
+cnt: 8, ((T([128, 1536, 6, 6], f16), T([128, 1536, 6, 6], f16)), {})
+cnt: 1, ((T([128, 128, 48, 48], f16), T([128, 128, 48, 48], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 3072], f16), T([3072, 1000], f16, stride=(1, 3072))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 256, 48, 48], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([128, 512, 24, 24], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([128, 1536, 12, 12], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 1536, 6, 6], f16), T([128, 1536, 12, 12], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([128, 512, 12, 12], f16), T([128, 512, 24, 24], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([128, 256, 24, 24], f16), T([128, 256, 48, 48], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 192, 192], f16),), {})
+cnt: 1, ((T([128, 256, 48, 48], f16),), {})
+cnt: 2, ((T([128, 512, 24, 24], f16),), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16),), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([128, 3, 192, 192], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 768, 12, 12], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 768, 13, 13], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([128, 768, 25, 25], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([128, 256, 49, 49], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([128, 64, 97, 97], f16), [0, -1, 0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 193, 193], f16), T([16, 3, 3, 3], f16), T([16], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([32, 16, 3, 3], f16), T([32], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([64, 32, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 97, 97], f16), T([128, 64, 3, 3], f16), T([128], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 48, 48], f16), T([256, 128, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 48, 48], f16), T([128, 128, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 48, 48], f16), T([128, 128, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([256, 128, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 256, 24, 24], f16), T([512, 256, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), T([256, 256, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 49, 49], f16), T([256, 128, 3, 3], f16), T([256], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 2), {})
+cnt: 3, ((T([128, 256, 24, 24], f16), T([256, 128, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([256, 512, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 256, 1, 1], f16), T([512, 256, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 24, 24], f16), T([256, 512, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 12, 12], f16), T([1536, 512, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 24, 24], f16), T([768, 512, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 768, 25, 25], f16), T([768, 128, 3, 3], f16), T([768], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 6), {})
+cnt: 11, ((T([128, 768, 12, 12], f16), T([768, 128, 3, 3], f16), T([768], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 6), {})
+cnt: 6, ((T([128, 768, 12, 12], f16), T([1536, 768, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([768, 1536, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([128, 768, 1, 1], f16), T([1536, 768, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([768, 1536, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1536, 6, 6], f16), T([1536, 1536, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 768, 13, 13], f16), T([768, 128, 3, 3], f16), T([768], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 6), {})
+cnt: 5, ((T([128, 768, 6, 6], f16), T([768, 128, 3, 3], f16), T([768], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 6), {})
+cnt: 3, ((T([128, 768, 6, 6], f16), T([1536, 768, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), T([768, 1536, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1536, 6, 6], f16), T([3072, 1536, 1, 1], f16), T([3072], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 3072, 6, 6], f16), T([128, 1536, 6, 6], f16), T([3072, 1536, 1, 1], f16), [3072], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([128, 768, 1, 1], f16), T([1536, 768, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 9, ((T([128, 768, 1, 1], f16), T([128, 1536, 1, 1], f16), T([768, 1536, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), T([128, 768, 6, 6], f16), T([1536, 768, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([128, 768, 6, 6], f16), T([128, 768, 6, 6], f16), T([768, 128, 3, 3], f16), [768], [1, 1], [1, 1], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 2, ((T([128, 768, 6, 6], f16), T([128, 1536, 6, 6], f16), T([768, 1536, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 6, 6], f16), T([128, 768, 13, 13], f16), T([768, 128, 3, 3], f16), [768], [2, 2], [0, 0], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 6, ((T([128, 768, 12, 12], f16), T([128, 1536, 12, 12], f16), T([768, 1536, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1536, 6, 6], f16), T([128, 1536, 6, 6], f16), T([1536, 1536, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([128, 768, 12, 12], f16), T([1536, 768, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 11, ((T([128, 768, 12, 12], f16), T([128, 768, 12, 12], f16), T([768, 128, 3, 3], f16), [768], [1, 1], [1, 1], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 12, 12], f16), T([128, 768, 25, 25], f16), T([768, 128, 3, 3], f16), [768], [2, 2], [0, 0], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), T([128, 512, 24, 24], f16), T([768, 512, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1536, 12, 12], f16), T([128, 512, 12, 12], f16), T([1536, 512, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([128, 256, 1, 1], f16), T([512, 256, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 256, 1, 1], f16), T([128, 512, 1, 1], f16), T([256, 512, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 24, 24], f16), T([128, 256, 24, 24], f16), T([512, 256, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 24, 24], f16), T([128, 256, 24, 24], f16), T([256, 128, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 24, 24], f16), T([128, 512, 24, 24], f16), T([256, 512, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 24, 24], f16), T([128, 256, 49, 49], f16), T([256, 128, 3, 3], f16), [256], [2, 2], [0, 0], [1, 1], False, [0, 0], 2, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), T([128, 256, 48, 48], f16), T([256, 256, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 128, 1, 1], f16), T([256, 128, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), T([128, 128, 48, 48], f16), T([256, 128, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 128, 48, 48], f16), T([128, 128, 48, 48], f16), T([128, 128, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 48, 48], f16), T([128, 128, 48, 48], f16), T([128, 128, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 48, 48], f16), T([128, 64, 97, 97], f16), T([128, 64, 3, 3], f16), [128], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), T([128, 32, 96, 96], f16), T([64, 32, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([128, 16, 96, 96], f16), T([32, 16, 3, 3], f16), [32], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([128, 3, 193, 193], f16), T([16, 3, 3, 3], f16), [16], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 192, 192], f16), T([128, 3, 192, 192], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 3072, 6, 6], f16, stride=(3072, 1, 0, 0)), 36), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16, stride=(1536, 1, 0, 0)), 36), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16, stride=(1536, 1, 0, 0)), 144), {})
+cnt: 2, ((T([128, 512, 24, 24], f16, stride=(512, 1, 0, 0)), 576), {})
+cnt: 1, ((T([128, 256, 48, 48], f16, stride=(256, 1, 0, 0)), 2304), {})
+Operator: aten.gelu.default
+cnt: 1, ((T([128, 16, 96, 96], f16),), {})
+cnt: 1, ((T([128, 32, 96, 96], f16),), {})
+cnt: 1, ((T([128, 64, 96, 96], f16),), {})
+cnt: 4, ((T([128, 128, 48, 48], f16),), {})
+cnt: 2, ((T([128, 256, 48, 48], f16),), {})
+cnt: 5, ((T([128, 256, 24, 24], f16),), {})
+cnt: 2, ((T([128, 512, 24, 24], f16),), {})
+cnt: 1, ((T([128, 768, 24, 24], f16),), {})
+cnt: 18, ((T([128, 768, 12, 12], f16),), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16),), {})
+cnt: 8, ((T([128, 768, 6, 6], f16),), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16),), {})
+cnt: 1, ((T([128, 3072, 6, 6], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([128, 3072, 6, 6], f16), T([128, 3072, 6, 6], f16)), {})
+cnt: 8, ((T([128, 768, 6, 6], f16), T([128, 768, 6, 6], f16)), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), T([128, 1536, 6, 6], f16)), {})
+cnt: 18, ((T([128, 768, 12, 12], f16), T([128, 768, 12, 12], f16)), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([128, 1536, 12, 12], f16)), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), T([128, 768, 24, 24], f16)), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), T([128, 512, 24, 24], f16)), {})
+cnt: 5, ((T([128, 256, 24, 24], f16), T([128, 256, 24, 24], f16)), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), T([128, 256, 48, 48], f16)), {})
+cnt: 4, ((T([128, 128, 48, 48], f16), T([128, 128, 48, 48], f16)), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), T([128, 64, 96, 96], f16)), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([128, 32, 96, 96], f16)), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([128, 16, 96, 96], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 256, 48, 48], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), [2, 3], True), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), [2, 3], True), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 3072, 6, 6], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 3072], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 3072], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([16, 1, 1, 1], f16), 0.19245008972987526), {})
+cnt: 2, ((T([32, 1, 1, 1], f16), 0.08333333333333333), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.05892556509887896), {})
+cnt: 2, ((T([128, 1, 1, 1], f16), 0.041666666666666664), {})
+cnt: 2, ((T([128, 128, 48, 48], f16), 1.0), {})
+cnt: 4, ((T([256, 1, 1, 1], f16), 0.08838834764831845), {})
+cnt: 2, ((T([128, 1, 1, 1], f16), 0.08838834764831845), {})
+cnt: 4, ((T([128, 1, 1, 1], f16), 0.02946278254943948), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), T([128, 256, 1, 1], f16)), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 2.0), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 0.2), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 0.9805806756909201), {})
+cnt: 6, ((T([512, 1, 1, 1], f16), 0.0625), {})
+cnt: 2, ((T([256, 1, 1, 1], f16), 0.0625), {})
+cnt: 8, ((T([256, 1, 1, 1], f16), 0.02946278254943948), {})
+cnt: 4, ((T([128, 512, 24, 24], f16), T([128, 512, 1, 1], f16)), {})
+cnt: 4, ((T([128, 512, 24, 24], f16), 2.0), {})
+cnt: 4, ((T([128, 512, 24, 24], f16), 0.2), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), 0.9805806756909201), {})
+cnt: 2, ((T([256, 1, 1, 1], f16), 0.04419417382415922), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), 0.9622504486493761), {})
+cnt: 2, ((T([1536, 1, 1, 1], f16), 0.04419417382415922), {})
+cnt: 2, ((T([768, 1, 1, 1], f16), 0.04419417382415922), {})
+cnt: 36, ((T([768, 1, 1, 1], f16), 0.02946278254943948), {})
+cnt: 18, ((T([1536, 1, 1, 1], f16), 0.03608439182435161), {})
+cnt: 12, ((T([128, 1536, 12, 12], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 12, ((T([128, 1536, 12, 12], f16), 2.0), {})
+cnt: 12, ((T([128, 1536, 12, 12], f16), 0.2), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9805806756909201), {})
+cnt: 16, ((T([768, 1, 1, 1], f16), 0.02551551815399144), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9622504486493761), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9449111825230679), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9284766908852592), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9128709291752768), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.8980265101338745), {})
+cnt: 2, ((T([1536, 1, 1, 1], f16), 0.02551551815399144), {})
+cnt: 6, ((T([128, 1536, 6, 6], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 6, ((T([128, 1536, 6, 6], f16), 2.0), {})
+cnt: 6, ((T([128, 1536, 6, 6], f16), 0.2), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), 0.9805806756909201), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), 0.9622504486493761), {})
+cnt: 2, ((T([3072, 1, 1, 1], f16), 0.02551551815399144), {})
+cnt: 1, ((T([128, 3072, 6, 6], f16), 1.7015043497085571), {})
+cnt: 6, ((T([128, 1536, 6, 6], f16), T([128, 1536, 6, 6], f16)), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), T([], f16)), {})
+cnt: 8, ((T([128, 768, 6, 6], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), 1.7015043497085571), {})
+cnt: 18, ((T([128, 768, 12, 12], f16), 1.7015043497085571), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), 1.7015043497085571), {})
+cnt: 12, ((T([128, 1536, 12, 12], f16), T([128, 1536, 12, 12], f16)), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([], f16)), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), 1.7015043497085571), {})
+cnt: 4, ((T([128, 512, 24, 24], f16), T([128, 512, 24, 24], f16)), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), T([], f16)), {})
+cnt: 5, ((T([128, 256, 24, 24], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), T([128, 256, 48, 48], f16)), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), T([], f16)), {})
+cnt: 4, ((T([128, 128, 48, 48], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), 1.7015043497085571), {})
+Operator: aten.mul_.Tensor
+cnt: 1, ((T([128, 16, 96, 96], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), 1.7015043497085571), {})
+cnt: 4, ((T([128, 128, 48, 48], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), T([], f16)), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 1.7015043497085571), {})
+cnt: 5, ((T([128, 256, 24, 24], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), T([], f16)), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), 1.7015043497085571), {})
+cnt: 18, ((T([128, 768, 12, 12], f16), 1.7015043497085571), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([], f16)), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), 1.7015043497085571), {})
+cnt: 8, ((T([128, 768, 6, 6], f16), 1.7015043497085571), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), T([], f16)), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 3072, 6, 6], f16), 1.7015043497085571), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([1, 16, 27], f16), T([16], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 32, 144], f16), T([32], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 64, 288], f16), T([64], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 128, 576], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 2, ((T([1, 256, 128], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 128, 128], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 2, ((T([1, 128, 1152], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 3, ((T([1, 512, 256], f16), T([512], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 256, 256], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 4, ((T([1, 256, 1152], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 256, 512], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 1536, 512], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 768, 512], f16), T([768], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 18, ((T([1, 768, 1152], f16), T([768], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 9, ((T([1, 1536, 768], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 8, ((T([1, 768, 1536], f16), T([768], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 1536, 1536], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 3072, 1536], f16), T([3072], f16), None, None, None, True, 0.0, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([1, 3072, 1536], f16), T([1, 3072, 1536], f16), T([3072], f16), None, None, T([3072], f32), T([3072], f32), True, 1e-05, [True, True, False]), {})
+cnt: 9, ((T([1, 1536, 768], f16), T([1, 1536, 768], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 18, ((T([1, 768, 1152], f16), T([1, 768, 1152], f16), T([768], f16), None, None, T([768], f32), T([768], f32), True, 1e-05, [True, True, False]), {})
+cnt: 8, ((T([1, 768, 1536], f16), T([1, 768, 1536], f16), T([768], f16), None, None, T([768], f32), T([768], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 1536, 1536], f16), T([1, 1536, 1536], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 768, 512], f16), T([1, 768, 512], f16), T([768], f16), None, None, T([768], f32), T([768], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 1536, 512], f16), T([1, 1536, 512], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 3, ((T([1, 512, 256], f16), T([1, 512, 256], f16), T([512], f16), None, None, T([512], f32), T([512], f32), True, 1e-05, [True, True, False]), {})
+cnt: 4, ((T([1, 256, 1152], f16), T([1, 256, 1152], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 256, 512], f16), T([1, 256, 512], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 256, 256], f16), T([1, 256, 256], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 2, ((T([1, 256, 128], f16), T([1, 256, 128], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 2, ((T([1, 128, 1152], f16), T([1, 128, 1152], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 128, 128], f16), T([1, 128, 128], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 128, 576], f16), T([1, 128, 576], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 64, 288], f16), T([1, 64, 288], f16), T([64], f16), None, None, T([64], f32), T([64], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 32, 144], f16), T([1, 32, 144], f16), T([32], f16), None, None, T([32], f32), T([32], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 16, 27], f16), T([1, 16, 27], f16), T([16], f16), None, None, T([16], f32), T([16], f32), True, 1e-05, [True, True, False]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 128, 1, 1], f16),), {})
+cnt: 2, ((T([128, 256, 1, 1], f16),), {})
+cnt: 9, ((T([128, 768, 1, 1], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([128, 256, 1, 1], f16),), {})
+cnt: 2, ((T([128, 512, 1, 1], f16),), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([128, 512, 1, 1], f16)), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), [2, 3], True), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), [2, 3], True), {})
+Operator: aten.sum.default
+cnt: 3, ((T([128, 1536, 6, 6], f16),), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16),), {})
+cnt: 2, ((T([128, 512, 24, 24], f16),), {})
+cnt: 1, ((T([128, 256, 48, 48], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 9, ((T([128, 768, 1, 1], f16), T([128, 768, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([128, 128, 1, 1], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dpn107_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dpn107_training.txt
new file mode 100644
index 000000000000..d1572e4cd2ce
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/dpn107_training.txt
@@ -0,0 +1,545 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 111, ((T([], i64), 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16, stride=(928256, 3136, 56, 1)), T([32, 256, 56, 56], f16, stride=(865536, 3136, 56, 1))), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16, stride=(865536, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16, stride=(501760, 784, 28, 1)), T([32, 512, 28, 28], f16, stride=(451584, 784, 28, 1))), {})
+cnt: 7, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16, stride=(451584, 784, 28, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16, stride=(225792, 196, 14, 1)), T([32, 1024, 14, 14], f16, stride=(213248, 196, 14, 1))), {})
+cnt: 19, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(213248, 196, 14, 1))), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16, stride=(112896, 49, 7, 1)), T([32, 2048, 7, 7], f16, stride=(106624, 49, 7, 1))), {})
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16, stride=(106624, 49, 7, 1))), {})
+cnt: 3, ((T([32, 2176, 7, 7], f16), T([32, 2176, 7, 7], f16)), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16, stride=(131712, 49, 7, 1)), T([32, 2048, 7, 7], f16, stride=(125440, 49, 7, 1))), {})
+cnt: 1, ((T([32, 512, 7, 7], f16, stride=(131712, 49, 7, 1)), T([32, 512, 7, 7], f16, stride=(125440, 49, 7, 1))), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16, stride=(119168, 49, 7, 1))), {})
+cnt: 1, ((T([32, 384, 7, 7], f16, stride=(25088, 49, 7, 1)), T([32, 384, 7, 7], f16, stride=(119168, 49, 7, 1))), {})
+cnt: 1, ((T([32, 2304, 7, 7], f16), T([32, 2304, 7, 7], f16)), {})
+cnt: 1, ((T([32, 2432, 14, 14], f16), T([32, 2432, 14, 14], f16)), {})
+cnt: 20, ((T([32, 1088, 14, 14], f16), T([32, 1088, 14, 14], f16)), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16, stride=(476672, 196, 14, 1)), T([32, 1024, 14, 14], f16, stride=(464128, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1344, 14, 14], f16, stride=(476672, 196, 14, 1)), T([32, 1344, 14, 14], f16, stride=(464128, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(451584, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1280, 14, 14], f16, stride=(263424, 196, 14, 1)), T([32, 1280, 14, 14], f16, stride=(451584, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(439040, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1216, 14, 14], f16, stride=(250880, 196, 14, 1)), T([32, 1216, 14, 14], f16, stride=(439040, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(426496, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1152, 14, 14], f16, stride=(238336, 196, 14, 1)), T([32, 1152, 14, 14], f16, stride=(426496, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(413952, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1088, 14, 14], f16, stride=(225792, 196, 14, 1)), T([32, 1088, 14, 14], f16, stride=(413952, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(401408, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16, stride=(213248, 196, 14, 1)), T([32, 1024, 14, 14], f16, stride=(401408, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(388864, 196, 14, 1))), {})
+cnt: 1, ((T([32, 960, 14, 14], f16, stride=(200704, 196, 14, 1)), T([32, 960, 14, 14], f16, stride=(388864, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(376320, 196, 14, 1))), {})
+cnt: 1, ((T([32, 896, 14, 14], f16, stride=(188160, 196, 14, 1)), T([32, 896, 14, 14], f16, stride=(376320, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(363776, 196, 14, 1))), {})
+cnt: 1, ((T([32, 832, 14, 14], f16, stride=(175616, 196, 14, 1)), T([32, 832, 14, 14], f16, stride=(363776, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(351232, 196, 14, 1))), {})
+cnt: 1, ((T([32, 768, 14, 14], f16, stride=(163072, 196, 14, 1)), T([32, 768, 14, 14], f16, stride=(351232, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(338688, 196, 14, 1))), {})
+cnt: 1, ((T([32, 704, 14, 14], f16, stride=(150528, 196, 14, 1)), T([32, 704, 14, 14], f16, stride=(338688, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(326144, 196, 14, 1))), {})
+cnt: 1, ((T([32, 640, 14, 14], f16, stride=(137984, 196, 14, 1)), T([32, 640, 14, 14], f16, stride=(326144, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(313600, 196, 14, 1))), {})
+cnt: 1, ((T([32, 576, 14, 14], f16, stride=(125440, 196, 14, 1)), T([32, 576, 14, 14], f16, stride=(313600, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(301056, 196, 14, 1))), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(112896, 196, 14, 1)), T([32, 512, 14, 14], f16, stride=(301056, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(288512, 196, 14, 1))), {})
+cnt: 1, ((T([32, 448, 14, 14], f16, stride=(100352, 196, 14, 1)), T([32, 448, 14, 14], f16, stride=(288512, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(275968, 196, 14, 1))), {})
+cnt: 1, ((T([32, 384, 14, 14], f16, stride=(87808, 196, 14, 1)), T([32, 384, 14, 14], f16, stride=(275968, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(263424, 196, 14, 1))), {})
+cnt: 1, ((T([32, 320, 14, 14], f16, stride=(75264, 196, 14, 1)), T([32, 320, 14, 14], f16, stride=(263424, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(250880, 196, 14, 1))), {})
+cnt: 1, ((T([32, 256, 14, 14], f16, stride=(62720, 196, 14, 1)), T([32, 256, 14, 14], f16, stride=(250880, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16, stride=(238336, 196, 14, 1))), {})
+cnt: 1, ((T([32, 192, 14, 14], f16, stride=(50176, 196, 14, 1)), T([32, 192, 14, 14], f16, stride=(238336, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1152, 14, 14], f16), T([32, 1152, 14, 14], f16)), {})
+cnt: 1, ((T([32, 1152, 28, 28], f16), T([32, 1152, 28, 28], f16)), {})
+cnt: 8, ((T([32, 576, 28, 28], f16), T([32, 576, 28, 28], f16)), {})
+cnt: 1, ((T([32, 512, 28, 28], f16, stride=(903168, 784, 28, 1)), T([32, 512, 28, 28], f16, stride=(852992, 784, 28, 1))), {})
+cnt: 1, ((T([32, 576, 28, 28], f16, stride=(903168, 784, 28, 1)), T([32, 576, 28, 28], f16, stride=(852992, 784, 28, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16, stride=(802816, 784, 28, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16, stride=(451584, 784, 28, 1)), T([32, 512, 28, 28], f16, stride=(802816, 784, 28, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16, stride=(752640, 784, 28, 1))), {})
+cnt: 1, ((T([32, 448, 28, 28], f16, stride=(401408, 784, 28, 1)), T([32, 448, 28, 28], f16, stride=(752640, 784, 28, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16, stride=(702464, 784, 28, 1))), {})
+cnt: 1, ((T([32, 384, 28, 28], f16, stride=(351232, 784, 28, 1)), T([32, 384, 28, 28], f16, stride=(702464, 784, 28, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16, stride=(652288, 784, 28, 1))), {})
+cnt: 1, ((T([32, 320, 28, 28], f16, stride=(301056, 784, 28, 1)), T([32, 320, 28, 28], f16, stride=(652288, 784, 28, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16, stride=(602112, 784, 28, 1))), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(250880, 784, 28, 1)), T([32, 256, 28, 28], f16, stride=(602112, 784, 28, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16, stride=(551936, 784, 28, 1))), {})
+cnt: 1, ((T([32, 192, 28, 28], f16, stride=(200704, 784, 28, 1)), T([32, 192, 28, 28], f16, stride=(551936, 784, 28, 1))), {})
+cnt: 1, ((T([32, 640, 28, 28], f16), T([32, 640, 28, 28], f16)), {})
+cnt: 1, ((T([32, 376, 56, 56], f16), T([32, 376, 56, 56], f16)), {})
+cnt: 4, ((T([32, 276, 56, 56], f16), T([32, 276, 56, 56], f16)), {})
+cnt: 1, ((T([32, 256, 56, 56], f16, stride=(1179136, 3136, 56, 1)), T([32, 256, 56, 56], f16, stride=(1116416, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 100, 56, 56], f16, stride=(1179136, 3136, 56, 1)), T([32, 100, 56, 56], f16, stride=(1116416, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16, stride=(1053696, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 80, 56, 56], f16, stride=(313600, 3136, 56, 1)), T([32, 80, 56, 56], f16, stride=(1053696, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16, stride=(990976, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 60, 56, 56], f16, stride=(250880, 3136, 56, 1)), T([32, 60, 56, 56], f16, stride=(990976, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 296, 56, 56], f16), T([32, 296, 56, 56], f16)), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([32, 40, 56, 56], f16, stride=(928256, 3136, 56, 1)), T([32, 20, 56, 56], f16, stride=(865536, 3136, 56, 1))], 1), {})
+cnt: 1, (([T([32, 256, 56, 56], f16), T([32, 60, 56, 56], f16)], 1), {})
+cnt: 1, (([T([32, 60, 56, 56], f16), T([32, 20, 56, 56], f16, stride=(865536, 3136, 56, 1))], 1), {})
+cnt: 1, (([T([32, 256, 56, 56], f16), T([32, 80, 56, 56], f16)], 1), {})
+cnt: 1, (([T([32, 80, 56, 56], f16), T([32, 20, 56, 56], f16, stride=(865536, 3136, 56, 1))], 1), {})
+cnt: 1, (([T([32, 256, 56, 56], f16), T([32, 100, 56, 56], f16)], 1), {})
+cnt: 1, (([T([32, 100, 56, 56], f16), T([32, 20, 56, 56], f16, stride=(865536, 3136, 56, 1))], 1), {})
+cnt: 1, (([T([32, 256, 56, 56], f16), T([32, 120, 56, 56], f16)], 1), {})
+cnt: 1, (([T([32, 128, 28, 28], f16, stride=(501760, 784, 28, 1)), T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1))], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 192, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 192, 28, 28], f16), T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1))], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 256, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 256, 28, 28], f16), T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1))], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 320, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 320, 28, 28], f16), T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1))], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 384, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 384, 28, 28], f16), T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1))], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 448, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 448, 28, 28], f16), T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1))], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1))], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 576, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 576, 28, 28], f16), T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1))], 1), {})
+cnt: 1, (([T([32, 512, 28, 28], f16), T([32, 640, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 128, 14, 14], f16, stride=(225792, 196, 14, 1)), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 192, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 192, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 256, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 256, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 320, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 320, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 384, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 384, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 448, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 448, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 512, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 512, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 576, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 576, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 640, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 640, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 704, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 704, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 768, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 768, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 832, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 832, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 896, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 896, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 960, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 960, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 1088, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 1088, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 1152, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 1152, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 1216, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 1216, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 1280, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 1280, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 1344, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 1344, 14, 14], f16), T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1))], 1), {})
+cnt: 1, (([T([32, 1024, 14, 14], f16), T([32, 1408, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 256, 7, 7], f16, stride=(112896, 49, 7, 1)), T([32, 128, 7, 7], f16, stride=(106624, 49, 7, 1))], 1), {})
+cnt: 1, (([T([32, 2048, 7, 7], f16), T([32, 384, 7, 7], f16)], 1), {})
+cnt: 1, (([T([32, 384, 7, 7], f16), T([32, 128, 7, 7], f16, stride=(106624, 49, 7, 1))], 1), {})
+cnt: 1, (([T([32, 2048, 7, 7], f16), T([32, 512, 7, 7], f16)], 1), {})
+cnt: 1, (([T([32, 512, 7, 7], f16), T([32, 128, 7, 7], f16, stride=(106624, 49, 7, 1))], 1), {})
+cnt: 1, (([T([32, 2048, 7, 7], f16), T([32, 640, 7, 7], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([128, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([296, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([200, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 200, 56, 56], f16), T([200, 4, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 50), {})
+cnt: 4, ((T([32, 200, 56, 56], f16), T([276, 200, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 316, 56, 56], f16), T([200, 316, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 336, 56, 56], f16), T([200, 336, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 356, 56, 56], f16), T([200, 356, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 376, 56, 56], f16), T([640, 376, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 376, 56, 56], f16), T([400, 376, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 400, 56, 56], f16), T([400, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 50), {})
+cnt: 8, ((T([32, 400, 28, 28], f16), T([576, 400, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 704, 28, 28], f16), T([400, 704, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([32, 400, 28, 28], f16), T([400, 8, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 50), {})
+cnt: 1, ((T([32, 768, 28, 28], f16), T([400, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 832, 28, 28], f16), T([400, 832, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([400, 896, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 960, 28, 28], f16), T([400, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 28, 28], f16), T([400, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1088, 28, 28], f16), T([400, 1088, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1152, 28, 28], f16), T([1152, 1152, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1152, 28, 28], f16), T([800, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 800, 28, 28], f16), T([800, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 50), {})
+cnt: 20, ((T([32, 800, 14, 14], f16), T([1088, 800, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1216, 14, 14], f16), T([800, 1216, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 19, ((T([32, 800, 14, 14], f16), T([800, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 50), {})
+cnt: 1, ((T([32, 1280, 14, 14], f16), T([800, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1344, 14, 14], f16), T([800, 1344, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1408, 14, 14], f16), T([800, 1408, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1472, 14, 14], f16), T([800, 1472, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1536, 14, 14], f16), T([800, 1536, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1600, 14, 14], f16), T([800, 1600, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1664, 14, 14], f16), T([800, 1664, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1728, 14, 14], f16), T([800, 1728, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1792, 14, 14], f16), T([800, 1792, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1856, 14, 14], f16), T([800, 1856, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1920, 14, 14], f16), T([800, 1920, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1984, 14, 14], f16), T([800, 1984, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2048, 14, 14], f16), T([800, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2112, 14, 14], f16), T([800, 2112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2176, 14, 14], f16), T([800, 2176, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([800, 2240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2304, 14, 14], f16), T([800, 2304, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2368, 14, 14], f16), T([800, 2368, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2432, 14, 14], f16), T([2304, 2432, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2432, 14, 14], f16), T([1600, 2432, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1600, 14, 14], f16), T([1600, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 50), {})
+cnt: 3, ((T([32, 1600, 7, 7], f16), T([2176, 1600, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2432, 7, 7], f16), T([1600, 2432, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 1600, 7, 7], f16), T([1600, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 50), {})
+cnt: 1, ((T([32, 2560, 7, 7], f16), T([1600, 2560, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2688, 1, 1], f16), T([1000, 2688, 1, 1], f16), T([1000], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 1000, 1, 1], f16), T([32, 2688, 1, 1], f16), T([1000, 2688, 1, 1], f16), [1000], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 2176, 7, 7], f16), T([32, 1600, 7, 7], f16), T([2176, 1600, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 1600, 7, 7], f16), T([32, 1600, 7, 7], f16), T([1600, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 50, [True, True, False]), {})
+cnt: 1, ((T([32, 1600, 7, 7], f16), T([32, 2560, 7, 7], f16), T([1600, 2560, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1600, 7, 7], f16), T([32, 2432, 7, 7], f16), T([1600, 2432, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1600, 7, 7], f16), T([32, 1600, 14, 14], f16), T([1600, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 50, [True, True, False]), {})
+cnt: 1, ((T([32, 1600, 14, 14], f16), T([32, 2432, 14, 14], f16), T([1600, 2432, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2304, 7, 7], f16), T([32, 2432, 14, 14], f16), T([2304, 2432, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 20, ((T([32, 1088, 14, 14], f16), T([32, 800, 14, 14], f16), T([1088, 800, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 19, ((T([32, 800, 14, 14], f16), T([32, 800, 14, 14], f16), T([800, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 50, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 2368, 14, 14], f16), T([800, 2368, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 2304, 14, 14], f16), T([800, 2304, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 2240, 14, 14], f16), T([800, 2240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 2176, 14, 14], f16), T([800, 2176, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 2112, 14, 14], f16), T([800, 2112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 2048, 14, 14], f16), T([800, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1984, 14, 14], f16), T([800, 1984, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1920, 14, 14], f16), T([800, 1920, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1856, 14, 14], f16), T([800, 1856, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1792, 14, 14], f16), T([800, 1792, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1728, 14, 14], f16), T([800, 1728, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1664, 14, 14], f16), T([800, 1664, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1600, 14, 14], f16), T([800, 1600, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1536, 14, 14], f16), T([800, 1536, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1472, 14, 14], f16), T([800, 1472, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1408, 14, 14], f16), T([800, 1408, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1344, 14, 14], f16), T([800, 1344, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1280, 14, 14], f16), T([800, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 1216, 14, 14], f16), T([800, 1216, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 14, 14], f16), T([32, 800, 28, 28], f16), T([800, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 50, [True, True, False]), {})
+cnt: 1, ((T([32, 800, 28, 28], f16), T([32, 1152, 28, 28], f16), T([800, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1152, 14, 14], f16), T([32, 1152, 28, 28], f16), T([1152, 1152, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([32, 576, 28, 28], f16), T([32, 400, 28, 28], f16), T([576, 400, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([32, 400, 28, 28], f16), T([32, 400, 28, 28], f16), T([400, 8, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 50, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 28, 28], f16), T([32, 1088, 28, 28], f16), T([400, 1088, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 28, 28], f16), T([32, 1024, 28, 28], f16), T([400, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 28, 28], f16), T([32, 960, 28, 28], f16), T([400, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 28, 28], f16), T([32, 896, 28, 28], f16), T([400, 896, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 28, 28], f16), T([32, 832, 28, 28], f16), T([400, 832, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 28, 28], f16), T([32, 768, 28, 28], f16), T([400, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 28, 28], f16), T([32, 704, 28, 28], f16), T([400, 704, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 28, 28], f16), T([32, 400, 56, 56], f16), T([400, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 50, [True, True, False]), {})
+cnt: 1, ((T([32, 400, 56, 56], f16), T([32, 376, 56, 56], f16), T([400, 376, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 640, 28, 28], f16), T([32, 376, 56, 56], f16), T([640, 376, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 276, 56, 56], f16), T([32, 200, 56, 56], f16), T([276, 200, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 200, 56, 56], f16), T([32, 200, 56, 56], f16), T([200, 4, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 50, [True, True, False]), {})
+cnt: 1, ((T([32, 200, 56, 56], f16), T([32, 356, 56, 56], f16), T([200, 356, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 200, 56, 56], f16), T([32, 336, 56, 56], f16), T([200, 336, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 200, 56, 56], f16), T([32, 316, 56, 56], f16), T([200, 316, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 200, 56, 56], f16), T([32, 128, 56, 56], f16), T([200, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 296, 56, 56], f16), T([32, 128, 56, 56], f16), T([296, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 3, 224, 224], f16), T([128, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 2688, 7, 7], f16, stride=(2688, 1, 0, 0)), 49), {})
+Operator: aten.elu.default
+cnt: 1, ((T([32, 2688, 7, 7], f16), 1.0), {})
+Operator: aten.elu_backward.default
+cnt: 1, ((T([32, 2688, 7, 7], f16), 1.0, 1, 1, False, T([32, 2688, 7, 7], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 128, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([32, 128, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 2688, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([32, 128, 112, 112], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 0.001), {})
+cnt: 8, ((T([32, 200, 56, 56], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 316, 56, 56], f16), T([316], f16), T([316], f16), T([316], f16), T([316], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 336, 56, 56], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 356, 56, 56], f16), T([356], f16), T([356], f16), T([356], f16), T([356], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([32, 376, 56, 56], f16), T([376], f16), T([376], f16), T([376], f16), T([376], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 400, 56, 56], f16), T([400], f16), T([400], f16), T([400], f16), T([400], f16), True, 0.1, 0.001), {})
+cnt: 15, ((T([32, 400, 28, 28], f16), T([400], f16), T([400], f16), T([400], f16), T([400], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 704, 28, 28], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 768, 28, 28], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 832, 28, 28], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 960, 28, 28], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1024, 28, 28], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1088, 28, 28], f16), T([1088], f16), T([1088], f16), T([1088], f16), T([1088], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([32, 1152, 28, 28], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 800, 28, 28], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f16), True, 0.1, 0.001), {})
+cnt: 39, ((T([32, 800, 14, 14], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1216, 14, 14], f16), T([1216], f16), T([1216], f16), T([1216], f16), T([1216], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1280, 14, 14], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1344, 14, 14], f16), T([1344], f16), T([1344], f16), T([1344], f16), T([1344], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1408, 14, 14], f16), T([1408], f16), T([1408], f16), T([1408], f16), T([1408], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1472, 14, 14], f16), T([1472], f16), T([1472], f16), T([1472], f16), T([1472], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1536, 14, 14], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([32, 1600, 14, 14], f16), T([1600], f16), T([1600], f16), T([1600], f16), T([1600], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1664, 14, 14], f16), T([1664], f16), T([1664], f16), T([1664], f16), T([1664], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1728, 14, 14], f16), T([1728], f16), T([1728], f16), T([1728], f16), T([1728], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1792, 14, 14], f16), T([1792], f16), T([1792], f16), T([1792], f16), T([1792], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1856, 14, 14], f16), T([1856], f16), T([1856], f16), T([1856], f16), T([1856], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1920, 14, 14], f16), T([1920], f16), T([1920], f16), T([1920], f16), T([1920], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 1984, 14, 14], f16), T([1984], f16), T([1984], f16), T([1984], f16), T([1984], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2048, 14, 14], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2112, 14, 14], f16), T([2112], f16), T([2112], f16), T([2112], f16), T([2112], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2176, 14, 14], f16), T([2176], f16), T([2176], f16), T([2176], f16), T([2176], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([2240], f16), T([2240], f16), T([2240], f16), T([2240], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2304, 14, 14], f16), T([2304], f16), T([2304], f16), T([2304], f16), T([2304], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2368, 14, 14], f16), T([2368], f16), T([2368], f16), T([2368], f16), T([2368], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([32, 2432, 14, 14], f16), T([2432], f16), T([2432], f16), T([2432], f16), T([2432], f16), True, 0.1, 0.001), {})
+cnt: 5, ((T([32, 1600, 7, 7], f16), T([1600], f16), T([1600], f16), T([1600], f16), T([1600], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2432, 7, 7], f16), T([2432], f16), T([2432], f16), T([2432], f16), T([2432], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2560, 7, 7], f16), T([2560], f16), T([2560], f16), T([2560], f16), T([2560], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([32, 2688, 7, 7], f16), T([2688], f16), T([2688], f16), T([2688], f16), T([2688], f16), True, 0.1, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([32, 2688, 7, 7], f16), T([32, 2688, 7, 7], f16), T([2688], f16), T([2688], f16), T([2688], f16), T([2688], f32), T([2688], f32), True, 0.001, [True, True, True]), {})
+cnt: 5, ((T([32, 1600, 7, 7], f16), T([32, 1600, 7, 7], f16), T([1600], f16), T([1600], f16), T([1600], f16), T([1600], f32), T([1600], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 2560, 7, 7], f16), T([32, 2560, 7, 7], f16), T([2560], f16), T([2560], f16), T([2560], f16), T([2560], f32), T([2560], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 2432, 7, 7], f16), T([32, 2432, 7, 7], f16), T([2432], f16), T([2432], f16), T([2432], f16), T([2432], f32), T([2432], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 1600, 14, 14], f16), T([32, 1600, 14, 14], f16), T([1600], f16), T([1600], f16), T([1600], f16), T([1600], f32), T([1600], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 2432, 14, 14], f16), T([32, 2432, 14, 14], f16), T([2432], f16), T([2432], f16), T([2432], f16), T([2432], f32), T([2432], f32), True, 0.001, [True, True, True]), {})
+cnt: 39, ((T([32, 800, 14, 14], f16), T([32, 800, 14, 14], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f32), T([800], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 2368, 14, 14], f16), T([32, 2368, 14, 14], f16), T([2368], f16), T([2368], f16), T([2368], f16), T([2368], f32), T([2368], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 2304, 14, 14], f16), T([32, 2304, 14, 14], f16), T([2304], f16), T([2304], f16), T([2304], f16), T([2304], f32), T([2304], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([32, 2240, 14, 14], f16), T([2240], f16), T([2240], f16), T([2240], f16), T([2240], f32), T([2240], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 2176, 14, 14], f16), T([32, 2176, 14, 14], f16), T([2176], f16), T([2176], f16), T([2176], f16), T([2176], f32), T([2176], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 2112, 14, 14], f16), T([32, 2112, 14, 14], f16), T([2112], f16), T([2112], f16), T([2112], f16), T([2112], f32), T([2112], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 2048, 14, 14], f16), T([32, 2048, 14, 14], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1984, 14, 14], f16), T([32, 1984, 14, 14], f16), T([1984], f16), T([1984], f16), T([1984], f16), T([1984], f32), T([1984], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1920, 14, 14], f16), T([32, 1920, 14, 14], f16), T([1920], f16), T([1920], f16), T([1920], f16), T([1920], f32), T([1920], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1856, 14, 14], f16), T([32, 1856, 14, 14], f16), T([1856], f16), T([1856], f16), T([1856], f16), T([1856], f32), T([1856], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1792, 14, 14], f16), T([32, 1792, 14, 14], f16), T([1792], f16), T([1792], f16), T([1792], f16), T([1792], f32), T([1792], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1728, 14, 14], f16), T([32, 1728, 14, 14], f16), T([1728], f16), T([1728], f16), T([1728], f16), T([1728], f32), T([1728], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1664, 14, 14], f16), T([32, 1664, 14, 14], f16), T([1664], f16), T([1664], f16), T([1664], f16), T([1664], f32), T([1664], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1536, 14, 14], f16), T([32, 1536, 14, 14], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f32), T([1536], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1472, 14, 14], f16), T([32, 1472, 14, 14], f16), T([1472], f16), T([1472], f16), T([1472], f16), T([1472], f32), T([1472], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1408, 14, 14], f16), T([32, 1408, 14, 14], f16), T([1408], f16), T([1408], f16), T([1408], f16), T([1408], f32), T([1408], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1344, 14, 14], f16), T([32, 1344, 14, 14], f16), T([1344], f16), T([1344], f16), T([1344], f16), T([1344], f32), T([1344], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1280, 14, 14], f16), T([32, 1280, 14, 14], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1216, 14, 14], f16), T([32, 1216, 14, 14], f16), T([1216], f16), T([1216], f16), T([1216], f16), T([1216], f32), T([1216], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 800, 28, 28], f16), T([32, 800, 28, 28], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f32), T([800], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 1152, 28, 28], f16), T([32, 1152, 28, 28], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), True, 0.001, [True, True, True]), {})
+cnt: 15, ((T([32, 400, 28, 28], f16), T([32, 400, 28, 28], f16), T([400], f16), T([400], f16), T([400], f16), T([400], f32), T([400], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1088, 28, 28], f16), T([32, 1088, 28, 28], f16), T([1088], f16), T([1088], f16), T([1088], f16), T([1088], f32), T([1088], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 1024, 28, 28], f16), T([32, 1024, 28, 28], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 960, 28, 28], f16), T([32, 960, 28, 28], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([32, 896, 28, 28], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 832, 28, 28], f16), T([32, 832, 28, 28], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f32), T([832], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 768, 28, 28], f16), T([32, 768, 28, 28], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 704, 28, 28], f16), T([32, 704, 28, 28], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f32), T([704], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 400, 56, 56], f16), T([32, 400, 56, 56], f16), T([400], f16), T([400], f16), T([400], f16), T([400], f32), T([400], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 376, 56, 56], f16), T([32, 376, 56, 56], f16), T([376], f16), T([376], f16), T([376], f16), T([376], f32), T([376], f32), True, 0.001, [True, True, True]), {})
+cnt: 8, ((T([32, 200, 56, 56], f16), T([32, 200, 56, 56], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f32), T([200], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 356, 56, 56], f16), T([32, 356, 56, 56], f16), T([356], f16), T([356], f16), T([356], f16), T([356], f32), T([356], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 336, 56, 56], f16), T([32, 336, 56, 56], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 316, 56, 56], f16), T([32, 316, 56, 56], f16), T([316], f16), T([316], f16), T([316], f16), T([316], f32), T([316], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 128, 112, 112], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 0.001, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([32, 128, 112, 112], f16),), {})
+cnt: 2, ((T([32, 128, 56, 56], f16),), {})
+cnt: 8, ((T([32, 200, 56, 56], f16),), {})
+cnt: 1, ((T([32, 316, 56, 56], f16),), {})
+cnt: 1, ((T([32, 336, 56, 56], f16),), {})
+cnt: 1, ((T([32, 356, 56, 56], f16),), {})
+cnt: 2, ((T([32, 376, 56, 56], f16),), {})
+cnt: 1, ((T([32, 400, 56, 56], f16),), {})
+cnt: 15, ((T([32, 400, 28, 28], f16),), {})
+cnt: 1, ((T([32, 704, 28, 28], f16),), {})
+cnt: 1, ((T([32, 768, 28, 28], f16),), {})
+cnt: 1, ((T([32, 832, 28, 28], f16),), {})
+cnt: 1, ((T([32, 896, 28, 28], f16),), {})
+cnt: 1, ((T([32, 960, 28, 28], f16),), {})
+cnt: 1, ((T([32, 1024, 28, 28], f16),), {})
+cnt: 1, ((T([32, 1088, 28, 28], f16),), {})
+cnt: 2, ((T([32, 1152, 28, 28], f16),), {})
+cnt: 1, ((T([32, 800, 28, 28], f16),), {})
+cnt: 39, ((T([32, 800, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1216, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1280, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1344, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1408, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1472, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1536, 14, 14], f16),), {})
+cnt: 2, ((T([32, 1600, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1664, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1728, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1792, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1856, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1920, 14, 14], f16),), {})
+cnt: 1, ((T([32, 1984, 14, 14], f16),), {})
+cnt: 1, ((T([32, 2048, 14, 14], f16),), {})
+cnt: 1, ((T([32, 2112, 14, 14], f16),), {})
+cnt: 1, ((T([32, 2176, 14, 14], f16),), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16),), {})
+cnt: 1, ((T([32, 2304, 14, 14], f16),), {})
+cnt: 1, ((T([32, 2368, 14, 14], f16),), {})
+cnt: 2, ((T([32, 2432, 14, 14], f16),), {})
+cnt: 5, ((T([32, 1600, 7, 7], f16),), {})
+cnt: 1, ((T([32, 2432, 7, 7], f16),), {})
+cnt: 1, ((T([32, 2560, 7, 7], f16),), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([32, 128, 7, 7], f16, stride=(131712, 49, 7, 1)), [32, 128, 7, 7], 3, 0, 9223372036854775807, 1), {})
+cnt: 3, ((T([32, 128, 7, 7], f16), [32, 128, 7, 7], 2, 0, 9223372036854775807, 1), {})
+cnt: 3, ((T([32, 128, 7, 7], f16), [32, 2176, 7, 7], 1, 2048, 9223372036854775807, 1), {})
+cnt: 6, ((T([32, 2176, 7, 7], f16), [32, 2176, 7, 7], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16, stride=(131712, 49, 7, 1)), [32, 2048, 7, 7], 3, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([32, 2048, 7, 7], f16), [32, 2048, 7, 7], 2, 0, 9223372036854775807, 1), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), [32, 2176, 7, 7], 1, 0, 2048, 1), {})
+cnt: 1, ((T([32, 128, 7, 7], f16, stride=(25088, 49, 7, 1)), [32, 128, 7, 7], 3, 0, 9223372036854775807, 1), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), [32, 2048, 7, 7], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 128, 7, 7], f16, stride=(18816, 49, 7, 1)), [32, 128, 7, 7], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 256, 7, 7], f16, stride=(18816, 49, 7, 1)), [32, 256, 7, 7], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 256, 7, 7], f16), [32, 256, 7, 7], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 256, 7, 7], f16), [32, 2304, 7, 7], 1, 2048, 9223372036854775807, 1), {})
+cnt: 2, ((T([32, 2304, 7, 7], f16), [32, 2304, 7, 7], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), [32, 2304, 7, 7], 1, 0, 2048, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(476672, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 20, ((T([32, 64, 14, 14], f16), [32, 64, 14, 14], 2, 0, 9223372036854775807, 1), {})
+cnt: 20, ((T([32, 64, 14, 14], f16), [32, 1088, 14, 14], 1, 1024, 9223372036854775807, 1), {})
+cnt: 40, ((T([32, 1088, 14, 14], f16), [32, 1088, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16, stride=(476672, 196, 14, 1)), [32, 1024, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 21, ((T([32, 1024, 14, 14], f16), [32, 1024, 14, 14], 2, 0, 9223372036854775807, 1), {})
+cnt: 20, ((T([32, 1024, 14, 14], f16), [32, 1088, 14, 14], 1, 0, 1024, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(263424, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 20, ((T([32, 1024, 14, 14], f16), [32, 1024, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(250880, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(238336, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(225792, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(213248, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(200704, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(188160, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(175616, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(163072, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(150528, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(137984, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(125440, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(112896, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(100352, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(87808, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(75264, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(62720, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(50176, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 14, 14], f16, stride=(37632, 196, 14, 1)), [32, 64, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 128, 14, 14], f16, stride=(37632, 196, 14, 1)), [32, 128, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 128, 14, 14], f16), [32, 128, 14, 14], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 128, 14, 14], f16), [32, 1152, 14, 14], 1, 1024, 9223372036854775807, 1), {})
+cnt: 2, ((T([32, 1152, 14, 14], f16), [32, 1152, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), [32, 1152, 14, 14], 1, 0, 1024, 1), {})
+cnt: 1, ((T([32, 64, 28, 28], f16, stride=(903168, 784, 28, 1)), [32, 64, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 8, ((T([32, 64, 28, 28], f16), [32, 64, 28, 28], 2, 0, 9223372036854775807, 1), {})
+cnt: 8, ((T([32, 64, 28, 28], f16), [32, 576, 28, 28], 1, 512, 9223372036854775807, 1), {})
+cnt: 16, ((T([32, 576, 28, 28], f16), [32, 576, 28, 28], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 512, 28, 28], f16, stride=(903168, 784, 28, 1)), [32, 512, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 9, ((T([32, 512, 28, 28], f16), [32, 512, 28, 28], 2, 0, 9223372036854775807, 1), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), [32, 576, 28, 28], 1, 0, 512, 1), {})
+cnt: 1, ((T([32, 64, 28, 28], f16, stride=(451584, 784, 28, 1)), [32, 64, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), [32, 512, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 28, 28], f16, stride=(401408, 784, 28, 1)), [32, 64, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 28, 28], f16, stride=(351232, 784, 28, 1)), [32, 64, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 28, 28], f16, stride=(301056, 784, 28, 1)), [32, 64, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 28, 28], f16, stride=(250880, 784, 28, 1)), [32, 64, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 28, 28], f16, stride=(200704, 784, 28, 1)), [32, 64, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 64, 28, 28], f16, stride=(150528, 784, 28, 1)), [32, 64, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 128, 28, 28], f16, stride=(150528, 784, 28, 1)), [32, 128, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), [32, 128, 28, 28], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), [32, 640, 28, 28], 1, 512, 9223372036854775807, 1), {})
+cnt: 2, ((T([32, 640, 28, 28], f16), [32, 640, 28, 28], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), [32, 640, 28, 28], 1, 0, 512, 1), {})
+cnt: 1, ((T([32, 20, 56, 56], f16, stride=(1179136, 3136, 56, 1)), [32, 20, 56, 56], 3, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([32, 20, 56, 56], f16), [32, 20, 56, 56], 2, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([32, 20, 56, 56], f16), [32, 276, 56, 56], 1, 256, 9223372036854775807, 1), {})
+cnt: 8, ((T([32, 276, 56, 56], f16), [32, 276, 56, 56], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16, stride=(1179136, 3136, 56, 1)), [32, 256, 56, 56], 3, 0, 9223372036854775807, 1), {})
+cnt: 5, ((T([32, 256, 56, 56], f16), [32, 256, 56, 56], 2, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), [32, 276, 56, 56], 1, 0, 256, 1), {})
+cnt: 1, ((T([32, 20, 56, 56], f16, stride=(313600, 3136, 56, 1)), [32, 20, 56, 56], 3, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), [32, 256, 56, 56], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 20, 56, 56], f16, stride=(250880, 3136, 56, 1)), [32, 20, 56, 56], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 20, 56, 56], f16, stride=(188160, 3136, 56, 1)), [32, 20, 56, 56], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 40, 56, 56], f16, stride=(188160, 3136, 56, 1)), [32, 40, 56, 56], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 40, 56, 56], f16), [32, 40, 56, 56], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 40, 56, 56], f16), [32, 296, 56, 56], 1, 256, 9223372036854775807, 1), {})
+cnt: 2, ((T([32, 296, 56, 56], f16), [32, 296, 56, 56], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), [32, 296, 56, 56], 1, 0, 256, 1), {})
+Operator: aten.threshold_backward.default
+cnt: 5, ((T([32, 1600, 7, 7], f16), T([32, 1600, 7, 7], f16), 0), {})
+cnt: 1, ((T([32, 2560, 7, 7], f16), T([32, 2560, 7, 7], f16), 0), {})
+cnt: 1, ((T([32, 2432, 7, 7], f16), T([32, 2432, 7, 7], f16), 0), {})
+cnt: 2, ((T([32, 1600, 14, 14], f16), T([32, 1600, 14, 14], f16), 0), {})
+cnt: 2, ((T([32, 2432, 14, 14], f16), T([32, 2432, 14, 14], f16), 0), {})
+cnt: 39, ((T([32, 800, 14, 14], f16), T([32, 800, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 2368, 14, 14], f16), T([32, 2368, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 2304, 14, 14], f16), T([32, 2304, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([32, 2240, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 2176, 14, 14], f16), T([32, 2176, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 2112, 14, 14], f16), T([32, 2112, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 2048, 14, 14], f16), T([32, 2048, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1984, 14, 14], f16), T([32, 1984, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1920, 14, 14], f16), T([32, 1920, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1856, 14, 14], f16), T([32, 1856, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1792, 14, 14], f16), T([32, 1792, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1728, 14, 14], f16), T([32, 1728, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1664, 14, 14], f16), T([32, 1664, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1536, 14, 14], f16), T([32, 1536, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1472, 14, 14], f16), T([32, 1472, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1408, 14, 14], f16), T([32, 1408, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1344, 14, 14], f16), T([32, 1344, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1280, 14, 14], f16), T([32, 1280, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 1216, 14, 14], f16), T([32, 1216, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 800, 28, 28], f16), T([32, 800, 28, 28], f16), 0), {})
+cnt: 2, ((T([32, 1152, 28, 28], f16), T([32, 1152, 28, 28], f16), 0), {})
+cnt: 15, ((T([32, 400, 28, 28], f16), T([32, 400, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 1088, 28, 28], f16), T([32, 1088, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 1024, 28, 28], f16), T([32, 1024, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 960, 28, 28], f16), T([32, 960, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([32, 896, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 832, 28, 28], f16), T([32, 832, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 768, 28, 28], f16), T([32, 768, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 704, 28, 28], f16), T([32, 704, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 400, 56, 56], f16), T([32, 400, 56, 56], f16), 0), {})
+cnt: 2, ((T([32, 376, 56, 56], f16), T([32, 376, 56, 56], f16), 0), {})
+cnt: 8, ((T([32, 200, 56, 56], f16), T([32, 200, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 356, 56, 56], f16), T([32, 356, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 336, 56, 56], f16), T([32, 336, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 316, 56, 56], f16), T([32, 316, 56, 56], f16), 0), {})
+cnt: 2, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 128, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/eca_botnext26ts_256_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/eca_botnext26ts_256_training.txt
new file mode 100644
index 000000000000..ab778074aa37
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/eca_botnext26ts_256_training.txt
@@ -0,0 +1,288 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 2, ((T([512, 256, 256], f16), -1, False), {})
+cnt: 1, ((T([512, 64, 64], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 1, ((T([512, 64, 64], f16), T([512, 64, 64], f16), -1, f16), {})
+cnt: 2, ((T([512, 256, 256], f16), T([512, 256, 256], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 4, ((T([128, 64, 16, 16], f16), [512, 16, 256]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), [512, 64, 256]), {})
+cnt: 2, ((T([512, 256, 256], f16), [512, 256, 256]), {})
+cnt: 4, ((T([512, 16, 16, 16], f16), [131072, 16]), {})
+cnt: 4, ((T([131072, 31], f16), [512, 16, 16, 31]), {})
+cnt: 2, ((T([512, 16, 16, 16, 16], f16), [512, 256, 256]), {})
+cnt: 1, ((T([512, 256, 64], f16), [512, 256, 64]), {})
+cnt: 2, ((T([512, 64, 256], f16), [128, 256, 16, 16]), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), [512, 128, 256]), {})
+cnt: 1, ((T([512, 256, 128], f16), [512, 256, 128]), {})
+cnt: 2, ((T([512, 128, 256], f16), [128, 512, 16, 16]), {})
+cnt: 2, ((T([128, 64, 8, 8], f16), [512, 16, 64]), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), [512, 128, 64]), {})
+cnt: 1, ((T([512, 64, 64], f16), [512, 64, 64]), {})
+cnt: 2, ((T([512, 8, 8, 16], f16), [32768, 16]), {})
+cnt: 2, ((T([32768, 15], f16), [512, 8, 8, 15]), {})
+cnt: 1, ((T([512, 8, 8, 8, 8], f16), [512, 64, 64]), {})
+cnt: 1, ((T([512, 64, 128], f16), [512, 64, 128]), {})
+cnt: 2, ((T([512, 128, 64], f16), [128, 512, 8, 8]), {})
+cnt: 1, ((T([512, 8, 8, 16], f16), [512, 64, 16]), {})
+cnt: 1, ((T([512, 16, 64], f16), [128, 64, 8, 8]), {})
+cnt: 2, ((T([512, 16, 16, 16], f16), [512, 256, 16]), {})
+cnt: 2, ((T([512, 16, 256], f16), [128, 64, 16, 16]), {})
+Operator: aten.add.Tensor
+cnt: 31, ((T([], i64), 1), {})
+cnt: 4, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16)), {})
+cnt: 4, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16)), {})
+cnt: 4, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16)), {})
+cnt: 2, ((T([512, 16, 16, 16, 16], f16, stride=(8432, 31, 527, 1, 0)), T([512, 16, 16, 16, 16], f16, stride=(8432, 527, 31, 0, 1))), {})
+cnt: 2, ((T([512, 256, 256], f16), T([512, 256, 256], f16)), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16)), {})
+cnt: 1, ((T([512, 8, 8, 8, 8], f16, stride=(1080, 15, 135, 1, 0)), T([512, 8, 8, 8, 8], f16, stride=(1080, 135, 15, 0, 1))), {})
+cnt: 1, ((T([512, 64, 64], f16), T([512, 64, 64], f16)), {})
+cnt: 1, ((T([512, 8, 8, 16], f16, stride=(1024, 16, 128, 1)), T([512, 8, 8, 16], f16)), {})
+cnt: 1, ((T([512, 64, 16], f16), T([512, 64, 16], f16)), {})
+cnt: 2, ((T([512, 16, 16, 16], f16, stride=(4096, 16, 256, 1)), T([512, 16, 16, 16], f16)), {})
+cnt: 2, ((T([512, 256, 16], f16), T([512, 256, 16], f16)), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16)), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16)), {})
+cnt: 3, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 512, 16, 16], f16), [2, 2], [2, 2]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 512, 8, 8], f16), T([128, 512, 16, 16], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+Operator: aten.bmm.default
+cnt: 2, ((T([512, 256, 16], f16, stride=(4096, 1, 256)), T([512, 16, 256], f16)), {})
+cnt: 1, ((T([512, 256, 256], f16), T([512, 256, 64], f16, stride=(16384, 1, 256))), {})
+cnt: 1, ((T([512, 256, 256], f16), T([512, 256, 128], f16, stride=(32768, 1, 256))), {})
+cnt: 1, ((T([512, 64, 16], f16, stride=(1024, 1, 64)), T([512, 16, 64], f16)), {})
+cnt: 1, ((T([512, 64, 64], f16), T([512, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 1, ((T([512, 64, 64], f16, stride=(4096, 1, 64)), T([512, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 1, ((T([512, 64, 128], f16, stride=(8192, 1, 64)), T([512, 128, 64], f16)), {})
+cnt: 1, ((T([512, 16, 64], f16), T([512, 64, 64], f16)), {})
+cnt: 1, ((T([512, 64, 64], f16), T([512, 64, 16], f16, stride=(1024, 1, 64))), {})
+cnt: 1, ((T([512, 256, 256], f16, stride=(65536, 1, 256)), T([512, 256, 128], f16, stride=(32768, 1, 256))), {})
+cnt: 1, ((T([512, 256, 128], f16, stride=(32768, 1, 256)), T([512, 128, 256], f16)), {})
+cnt: 2, ((T([512, 16, 256], f16), T([512, 256, 256], f16)), {})
+cnt: 2, ((T([512, 256, 256], f16), T([512, 256, 16], f16, stride=(4096, 1, 256))), {})
+cnt: 1, ((T([512, 256, 256], f16, stride=(65536, 1, 256)), T([512, 256, 64], f16, stride=(16384, 1, 256))), {})
+cnt: 1, ((T([512, 256, 64], f16, stride=(16384, 1, 256)), T([512, 64, 256], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 64, 8, 8], f16), T([128, 64, 8, 8], f16), T([128, 512, 8, 8], f16)], 1), {})
+cnt: 1, (([T([128, 64, 16, 16], f16), T([128, 64, 16, 16], f16), T([128, 512, 16, 16], f16)], 1), {})
+cnt: 1, (([T([128, 64, 16, 16], f16), T([128, 64, 16, 16], f16), T([128, 256, 16, 16], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 256, 256], f16),), {})
+cnt: 1, ((T([128, 24, 128, 128], f16),), {})
+cnt: 1, ((T([128, 32, 128, 128], f16),), {})
+cnt: 1, ((T([128, 64, 128, 128], f16),), {})
+cnt: 4, ((T([128, 64, 64, 64], f16),), {})
+cnt: 2, ((T([128, 256, 64, 64], f16),), {})
+cnt: 1, ((T([128, 128, 64, 64], f16),), {})
+cnt: 3, ((T([128, 128, 32, 32], f16),), {})
+cnt: 2, ((T([128, 512, 32, 32], f16),), {})
+cnt: 1, ((T([128, 256, 32, 32], f16),), {})
+cnt: 3, ((T([128, 256, 16, 16], f16),), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16),), {})
+cnt: 1, ((T([128, 512, 16, 16], f16),), {})
+cnt: 3, ((T([128, 512, 8, 8], f16),), {})
+cnt: 2, ((T([128, 2048, 8, 8], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 4, ((T([8192, 16, 31], f16), [0, 1], 0.0), {})
+cnt: 4, ((T([8192, 512], f16), [0, 15], 0.0), {})
+cnt: 2, ((T([4096, 8, 15], f16), [0, 1], 0.0), {})
+cnt: 2, ((T([4096, 128], f16), [0, 7], 0.0), {})
+cnt: 2, ((T([4096, 135], f16), [0, -7]), {})
+cnt: 2, ((T([4096, 8, 16], f16), [0, -1]), {})
+cnt: 4, ((T([8192, 527], f16), [0, -15]), {})
+cnt: 4, ((T([8192, 16, 32], f16), [0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 256, 256], f16), T([24, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([32, 24, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), T([64, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 4), {})
+cnt: 2, ((T([128, 1, 64], f16), T([1, 1, 3], f16), None, [1], [1], [1], False, [0], 1), {})
+cnt: 3, ((T([128, 64, 64, 64], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 2, ((T([128, 1, 128], f16), T([1, 1, 5], f16), None, [1], [2], [1], False, [0], 1), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([256, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 1, ((T([128, 1, 256], f16), T([1, 1, 5], f16), None, [1], [2], [1], False, [0], 1), {})
+cnt: 2, ((T([128, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([384, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([640, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 512, 8, 8], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), T([640, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 2, ((T([128, 2048, 8, 8], f16), T([128, 512, 8, 8], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 640, 8, 8], f16), T([128, 512, 8, 8], f16), T([640, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), T([128, 2048, 8, 8], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([128, 1024, 16, 16], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 640, 16, 16], f16), T([128, 512, 16, 16], f16), T([640, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16), T([128, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 16, 16], f16), T([128, 256, 16, 16], f16), T([384, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([128, 512, 32, 32], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1, 256], f16), T([128, 1, 256], f16), T([1, 1, 5], f16), [0], [1], [2], [1], False, [0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 256, 32, 32], f16), T([256, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 512, 32, 32], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 512, 32, 32], f16), T([128, 128, 32, 32], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 1, 128], f16), T([128, 1, 128], f16), T([1, 1, 5], f16), [0], [1], [2], [1], False, [0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16), T([128, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 512, 32, 32], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([128, 256, 64, 64], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 128, 64, 64], f16), T([128, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 256, 64, 64], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([128, 64, 64, 64], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 1, 64], f16), T([128, 1, 64], f16), T([1, 1, 3], f16), [0], [1], [1], [1], False, [0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 4, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 256, 64, 64], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 32, 128, 128], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 24, 128, 128], f16), T([32, 24, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 3, 256, 256], f16), T([24, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 256, 256], f16), T([128, 3, 256, 256], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 8, 8], f16, stride=(2048, 1, 0, 0)), 64), {})
+cnt: 1, ((T([128, 256, 16, 16], f16, stride=(256, 1, 0, 0)), 256), {})
+cnt: 2, ((T([128, 128, 32, 32], f16, stride=(128, 1, 0, 0)), 1024), {})
+cnt: 2, ((T([128, 64, 64, 64], f16, stride=(64, 1, 0, 0)), 4096), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 128, 128], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 64, 128, 128], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([128, 64, 64, 64], i64)), {})
+Operator: aten.mean.dim
+cnt: 2, ((T([128, 64, 64, 64], f16), [2, 3]), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), [2, 3]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), [2, 3]), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 4, ((T([131072, 16], f16), T([16, 31], f16, stride=(1, 16))), {})
+cnt: 2, ((T([32768, 16], f16), T([16, 15], f16, stride=(1, 16))), {})
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+cnt: 2, ((T([15, 32768], f16, stride=(1, 15)), T([32768, 16], f16)), {})
+cnt: 2, ((T([32768, 15], f16), T([15, 16], f16)), {})
+cnt: 4, ((T([31, 131072], f16, stride=(1, 31)), T([131072, 16], f16)), {})
+cnt: 4, ((T([131072, 31], f16), T([31, 16], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16, stride=(64, 1, 0, 0))), {})
+cnt: 4, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16, stride=(128, 1, 0, 0))), {})
+cnt: 2, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16, stride=(256, 1, 0, 0))), {})
+cnt: 4, ((T([512, 256, 256], f16), 0.25), {})
+cnt: 2, ((T([512, 64, 64], f16), 0.25), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16)), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16)), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 24, 128, 128], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 24, 128, 128], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.sigmoid.default
+cnt: 2, ((T([128, 1, 64], f16),), {})
+cnt: 2, ((T([128, 1, 128], f16),), {})
+cnt: 1, ((T([128, 1, 256], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([128, 1, 256], f16), T([128, 1, 256], f16)), {})
+cnt: 2, ((T([128, 1, 128], f16), T([128, 1, 128], f16)), {})
+cnt: 2, ((T([128, 1, 64], f16), T([128, 1, 64], f16)), {})
+Operator: aten.silu_.default
+cnt: 1, ((T([128, 24, 128, 128], f16),), {})
+cnt: 1, ((T([128, 32, 128, 128], f16),), {})
+cnt: 1, ((T([128, 64, 128, 128], f16),), {})
+cnt: 4, ((T([128, 64, 64, 64], f16),), {})
+cnt: 2, ((T([128, 256, 64, 64], f16),), {})
+cnt: 1, ((T([128, 128, 64, 64], f16),), {})
+cnt: 3, ((T([128, 128, 32, 32], f16),), {})
+cnt: 2, ((T([128, 512, 32, 32], f16),), {})
+cnt: 1, ((T([128, 256, 32, 32], f16),), {})
+cnt: 3, ((T([128, 256, 16, 16], f16),), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16),), {})
+cnt: 1, ((T([128, 512, 16, 16], f16),), {})
+cnt: 3, ((T([128, 512, 8, 8], f16),), {})
+cnt: 2, ((T([128, 2048, 8, 8], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 2, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16)), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16)), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16)), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16)), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16)), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 256, 32, 32], f16)), {})
+cnt: 2, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16)), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16)), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16)), {})
+cnt: 2, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16)), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16)), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 64, 128, 128], f16)), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16)), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 24, 128, 128], f16)), {})
+Operator: aten.slice_backward.default
+cnt: 2, ((T([4096, 8, 8], f16), [4096, 8, 15], 2, 7, 9223372036854775807, 1), {})
+cnt: 2, ((T([4096, 8, 15], f16), [4096, 9, 15], 1, 0, 8, 1), {})
+cnt: 2, ((T([4096, 9, 15], f16), [4096, 9, 15], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([8192, 16, 16], f16), [8192, 16, 31], 2, 15, 9223372036854775807, 1), {})
+cnt: 4, ((T([8192, 16, 31], f16), [8192, 17, 31], 1, 0, 16, 1), {})
+cnt: 4, ((T([8192, 17, 31], f16), [8192, 17, 31], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.split_with_sizes.default
+cnt: 1, ((T([128, 384, 16, 16], f16), [64, 64, 256], 1), {})
+cnt: 1, ((T([128, 640, 16, 16], f16), [64, 64, 512], 1), {})
+cnt: 1, ((T([128, 640, 8, 8], f16), [64, 64, 512], 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([512, 8, 8, 8, 8], f16, stride=(4096, 64, 1, 512, 8)), [2], True), {})
+cnt: 1, ((T([512, 8, 8, 8, 8], f16, stride=(4096, 512, 8, 64, 1)), [2], True), {})
+cnt: 2, ((T([512, 16, 16, 16, 16], f16, stride=(65536, 256, 1, 4096, 16)), [2], True), {})
+cnt: 2, ((T([512, 16, 16, 16, 16], f16, stride=(65536, 4096, 16, 256, 1)), [2], True), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), [2, 3], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/eca_halonext26ts_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/eca_halonext26ts_training.txt
new file mode 100644
index 000000000000..714fcdbbaf06
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/eca_halonext26ts_training.txt
@@ -0,0 +1,343 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 1, ((T([1024, 4, 64, 144], f16), -1, False), {})
+cnt: 1, ((T([1024, 4, 16, 144], f16), -1, False), {})
+cnt: 1, ((T([1024, 1, 64, 144], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 1, ((T([1024, 1, 64, 144], f16), T([1024, 1, 64, 144], f16), -1, f16), {})
+cnt: 1, ((T([1024, 4, 16, 144], f16), T([1024, 4, 16, 144], f16), -1, f16), {})
+cnt: 1, ((T([1024, 4, 64, 144], f16), T([1024, 4, 64, 144], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 1, ((T([1024, 16, 8, 8, 2, 2], f16), [1024, 16, 64, 4]), {})
+cnt: 1, ((T([128, 384, 2, 2, 12, 12], f16), [1024, 48, 4, 144]), {})
+cnt: 1, ((T([1024, 4, 64, 16], f16), [4096, 64, 16]), {})
+cnt: 2, ((T([1024, 4, 16, 144], f16), [4096, 16, 144]), {})
+cnt: 1, ((T([4096, 64, 144], f16), [1024, 4, 64, 144]), {})
+cnt: 1, ((T([1024, 4, 64, 16], f16), [4096, 8, 8, 16]), {})
+cnt: 2, ((T([262144, 23], f16), [4096, 8, 8, 23]), {})
+cnt: 1, ((T([4096, 8, 8, 16], f16), [262144, 16]), {})
+cnt: 1, ((T([4096, 8, 8, 12, 12], f16), [1024, 4, 64, 144]), {})
+cnt: 1, ((T([1024, 4, 144, 32], f16), [4096, 144, 32]), {})
+cnt: 1, ((T([4096, 64, 32], f16), [1024, 4, 64, 32]), {})
+cnt: 1, ((T([1024, 32, 64, 4], f16), [32768, 8, 8, 2, 2]), {})
+cnt: 1, ((T([1024, 16, 4, 4, 2, 2], f16), [1024, 16, 16, 4]), {})
+cnt: 1, ((T([128, 640, 2, 2, 12, 12], f16), [1024, 80, 4, 144]), {})
+cnt: 1, ((T([1024, 4, 16, 16], f16), [4096, 16, 16]), {})
+cnt: 1, ((T([4096, 16, 144], f16), [1024, 4, 16, 144]), {})
+cnt: 1, ((T([1024, 4, 16, 16], f16), [4096, 4, 4, 16]), {})
+cnt: 2, ((T([65536, 23], f16), [4096, 4, 4, 23]), {})
+cnt: 1, ((T([4096, 4, 4, 16], f16), [65536, 16]), {})
+cnt: 1, ((T([4096, 4, 4, 12, 12], f16), [1024, 4, 16, 144]), {})
+cnt: 1, ((T([1024, 4, 144, 64], f16), [4096, 144, 64]), {})
+cnt: 1, ((T([4096, 16, 64], f16), [1024, 4, 16, 64]), {})
+cnt: 1, ((T([1024, 64, 16, 4], f16), [65536, 4, 4, 2, 2]), {})
+cnt: 1, ((T([1024, 64, 144], f16), [1024, 1, 64, 144]), {})
+cnt: 2, ((T([1024, 8, 8, 16], f16), [65536, 16]), {})
+cnt: 2, ((T([65536, 23], f16), [1024, 8, 8, 23]), {})
+cnt: 1, ((T([1024, 8, 8, 12, 12], f16), [1024, 1, 64, 144]), {})
+cnt: 1, ((T([1024, 64, 64], f16), [1024, 1, 64, 64]), {})
+cnt: 1, ((T([1024, 64, 64, 1], f16), [65536, 8, 8, 1, 1]), {})
+cnt: 1, ((T([1024, 8, 8, 16], f16), [1024, 1, 64, 16]), {})
+cnt: 1, ((T([1024, 80, 1, 144], f16), [128, 640, 1, 1, 12, 12]), {})
+cnt: 1, ((T([1024, 16, 1, 8, 1, 8], f16), [128, 128, 8, 8]), {})
+cnt: 1, ((T([65536, 4, 4, 2, 2], f16), [1024, 64, 16, 4]), {})
+cnt: 1, ((T([1024, 4, 16, 64], f16), [4096, 16, 64]), {})
+cnt: 1, ((T([4096, 4, 4, 16], f16), [1024, 4, 16, 16]), {})
+cnt: 1, ((T([1024, 80, 4, 144], f16), [128, 640, 2, 2, 12, 12]), {})
+cnt: 1, ((T([1024, 16, 2, 4, 2, 4], f16), [128, 128, 8, 8]), {})
+cnt: 1, ((T([32768, 8, 8, 2, 2], f16), [1024, 32, 64, 4]), {})
+cnt: 1, ((T([1024, 4, 64, 32], f16), [4096, 64, 32]), {})
+cnt: 1, ((T([4096, 8, 8, 16], f16), [1024, 4, 64, 16]), {})
+cnt: 1, ((T([1024, 48, 4, 144], f16), [128, 384, 2, 2, 12, 12]), {})
+cnt: 1, ((T([1024, 16, 2, 8, 2, 8], f16), [128, 128, 16, 16]), {})
+Operator: aten.add.Tensor
+cnt: 31, ((T([], i64), 1), {})
+cnt: 4, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16)), {})
+cnt: 4, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16)), {})
+cnt: 4, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16)), {})
+cnt: 1, ((T([4096, 8, 8, 12, 12], f16, stride=(1656, 23, 207, 1, 0)), T([4096, 8, 8, 12, 12], f16, stride=(1656, 207, 23, 0, 1))), {})
+cnt: 1, ((T([1024, 4, 64, 144], f16), T([1024, 4, 64, 144], f16)), {})
+cnt: 1, ((T([4096, 4, 4, 12, 12], f16, stride=(460, 23, 115, 1, 0)), T([4096, 4, 4, 12, 12], f16, stride=(460, 115, 23, 0, 1))), {})
+cnt: 1, ((T([1024, 4, 16, 144], f16), T([1024, 4, 16, 144], f16)), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16)), {})
+cnt: 1, ((T([1024, 8, 8, 12, 12], f16, stride=(1656, 23, 207, 1, 0)), T([1024, 8, 8, 12, 12], f16, stride=(1656, 207, 23, 0, 1))), {})
+cnt: 1, ((T([1024, 1, 64, 144], f16), T([1024, 1, 64, 144], f16)), {})
+cnt: 1, ((T([1024, 8, 8, 16], f16, stride=(1024, 16, 128, 1)), T([1024, 8, 8, 16], f16)), {})
+cnt: 1, ((T([1024, 1, 64, 16], f16), T([1024, 1, 64, 16], f16)), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16)), {})
+cnt: 1, ((T([4096, 4, 4, 16], f16, stride=(256, 16, 64, 1)), T([4096, 4, 4, 16], f16)), {})
+cnt: 1, ((T([1024, 4, 16, 16], f16), T([1024, 4, 16, 16], f16)), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16)), {})
+cnt: 1, ((T([4096, 8, 8, 16], f16, stride=(1024, 16, 128, 1)), T([4096, 8, 8, 16], f16)), {})
+cnt: 1, ((T([1024, 4, 64, 16], f16), T([1024, 4, 64, 16], f16)), {})
+cnt: 2, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16)), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16)), {})
+cnt: 3, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.bmm.default
+cnt: 1, ((T([4096, 64, 16], f16), T([4096, 16, 144], f16)), {})
+cnt: 1, ((T([4096, 64, 144], f16), T([4096, 144, 32], f16)), {})
+cnt: 1, ((T([4096, 16, 16], f16), T([4096, 16, 144], f16)), {})
+cnt: 1, ((T([4096, 16, 144], f16), T([4096, 144, 64], f16)), {})
+cnt: 1, ((T([1024, 64, 16], f16, stride=(1024, 1, 64)), T([1024, 16, 144], f16, stride=(11520, 144, 1))), {})
+cnt: 1, ((T([1024, 64, 144], f16), T([1024, 144, 64], f16, stride=(11520, 1, 144))), {})
+cnt: 1, ((T([1024, 144, 64], f16, stride=(9216, 1, 144)), T([1024, 64, 64], f16, stride=(4096, 1, 64))), {})
+cnt: 1, ((T([1024, 64, 64], f16, stride=(4096, 1, 64)), T([1024, 64, 144], f16, stride=(11520, 144, 1))), {})
+cnt: 1, ((T([1024, 16, 64], f16), T([1024, 64, 144], f16)), {})
+cnt: 1, ((T([1024, 64, 144], f16), T([1024, 144, 16], f16, stride=(11520, 1, 144))), {})
+cnt: 1, ((T([4096, 144, 16], f16, stride=(2304, 1, 144)), T([4096, 16, 64], f16)), {})
+cnt: 1, ((T([4096, 16, 64], f16), T([4096, 64, 144], f16, stride=(9216, 1, 64))), {})
+cnt: 1, ((T([4096, 16, 16], f16, stride=(256, 1, 16)), T([4096, 16, 144], f16)), {})
+cnt: 1, ((T([4096, 16, 144], f16), T([4096, 144, 16], f16, stride=(2304, 1, 144))), {})
+cnt: 1, ((T([4096, 144, 64], f16, stride=(9216, 1, 144)), T([4096, 64, 32], f16)), {})
+cnt: 1, ((T([4096, 64, 32], f16), T([4096, 32, 144], f16, stride=(4608, 1, 32))), {})
+cnt: 1, ((T([4096, 16, 64], f16, stride=(1024, 1, 16)), T([4096, 64, 144], f16)), {})
+cnt: 1, ((T([4096, 64, 144], f16), T([4096, 144, 16], f16, stride=(2304, 1, 144))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([1024, 1, 144, 16], f16, stride=(2304, 2304, 1, 144)), T([1024, 1, 144, 64], f16)], 3), {})
+cnt: 1, (([T([1024, 4, 144, 16], f16, stride=(9216, 2304, 1, 144)), T([1024, 4, 144, 64], f16)], 3), {})
+cnt: 1, (([T([1024, 4, 144, 16], f16, stride=(9216, 2304, 1, 144)), T([1024, 4, 144, 32], f16)], 3), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 256, 256], f16),), {})
+cnt: 1, ((T([128, 24, 128, 128], f16),), {})
+cnt: 1, ((T([128, 32, 128, 128], f16),), {})
+cnt: 1, ((T([128, 64, 128, 128], f16),), {})
+cnt: 4, ((T([128, 64, 64, 64], f16),), {})
+cnt: 2, ((T([128, 256, 64, 64], f16),), {})
+cnt: 1, ((T([128, 128, 64, 64], f16),), {})
+cnt: 3, ((T([128, 128, 32, 32], f16),), {})
+cnt: 2, ((T([128, 512, 32, 32], f16),), {})
+cnt: 1, ((T([128, 256, 32, 32], f16),), {})
+cnt: 3, ((T([128, 256, 16, 16], f16),), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16),), {})
+cnt: 1, ((T([128, 512, 16, 16], f16),), {})
+cnt: 3, ((T([128, 512, 8, 8], f16),), {})
+cnt: 2, ((T([128, 2048, 8, 8], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([128, 384, 16, 16], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 2, ((T([32768, 8, 23], f16), [0, 1], 0.0), {})
+cnt: 2, ((T([32768, 192], f16), [0, 15], 0.0), {})
+cnt: 1, ((T([128, 640, 16, 16], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 2, ((T([16384, 4, 23], f16), [0, 1], 0.0), {})
+cnt: 2, ((T([16384, 96], f16), [0, 19], 0.0), {})
+cnt: 1, ((T([128, 640, 8, 8], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 2, ((T([8192, 8, 23], f16), [0, 1], 0.0), {})
+cnt: 2, ((T([8192, 192], f16), [0, 15], 0.0), {})
+cnt: 2, ((T([8192, 207], f16), [0, -15]), {})
+cnt: 2, ((T([8192, 8, 24], f16), [0, -1]), {})
+cnt: 1, ((T([128, 640, 12, 12], f16), [-2, -2, -2, -2]), {})
+cnt: 2, ((T([16384, 115], f16), [0, -19]), {})
+cnt: 2, ((T([16384, 4, 24], f16), [0, -1]), {})
+cnt: 1, ((T([128, 640, 20, 20], f16), [-2, -2, -2, -2]), {})
+cnt: 2, ((T([32768, 207], f16), [0, -15]), {})
+cnt: 2, ((T([32768, 8, 24], f16), [0, -1]), {})
+cnt: 1, ((T([128, 384, 20, 20], f16), [-2, -2, -2, -2]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 256, 256], f16), T([24, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([32, 24, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), T([64, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 4), {})
+cnt: 2, ((T([128, 1, 64], f16), T([1, 1, 3], f16), None, [1], [1], [1], False, [0], 1), {})
+cnt: 3, ((T([128, 64, 64, 64], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 2, ((T([128, 1, 128], f16), T([1, 1, 5], f16), None, [1], [2], [1], False, [0], 1), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 64, 64], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([256, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 1, ((T([128, 1, 256], f16), T([1, 1, 5], f16), None, [1], [2], [1], False, [0], 1), {})
+cnt: 2, ((T([128, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([384, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([640, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 512, 8, 8], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), T([640, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 2, ((T([128, 2048, 8, 8], f16), T([128, 512, 8, 8], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 640, 8, 8], f16), T([128, 512, 8, 8], f16), T([640, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 8, 8], f16), T([128, 512, 8, 8], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 8, 8], f16), T([128, 2048, 8, 8], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([128, 1024, 16, 16], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 640, 16, 16], f16), T([128, 512, 16, 16], f16), T([640, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 8, 8], f16), T([128, 512, 16, 16], f16), T([128, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16), T([128, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 16, 16], f16), T([128, 256, 16, 16], f16), T([384, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 16, 16], f16), T([128, 256, 16, 16], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1024, 16, 16], f16), T([128, 512, 32, 32], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1, 256], f16), T([128, 1, 256], f16), T([1, 1, 5], f16), [0], [1], [2], [1], False, [0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 256, 32, 32], f16), T([256, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 512, 32, 32], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 512, 32, 32], f16), T([128, 128, 32, 32], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 1, 128], f16), T([128, 1, 128], f16), T([1, 1, 5], f16), [0], [1], [2], [1], False, [0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16), T([128, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 512, 32, 32], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 32, 32], f16), T([128, 256, 64, 64], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 32, 32], f16), T([128, 128, 64, 64], f16), T([128, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 256, 64, 64], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([128, 64, 64, 64], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 1, 64], f16), T([128, 1, 64], f16), T([1, 1, 3], f16), [0], [1], [1], [1], False, [0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 4, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 256, 64, 64], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 32, 128, 128], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 24, 128, 128], f16), T([32, 24, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 3, 256, 256], f16), T([24, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 256, 256], f16), T([128, 3, 256, 256], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 8, 8], f16, stride=(2048, 1, 0, 0)), 64), {})
+cnt: 1, ((T([128, 256, 16, 16], f16, stride=(256, 1, 0, 0)), 256), {})
+cnt: 2, ((T([128, 128, 32, 32], f16, stride=(128, 1, 0, 0)), 1024), {})
+cnt: 2, ((T([128, 64, 64, 64], f16, stride=(64, 1, 0, 0)), 4096), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 128, 128], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 64, 64, 64], f16), T([128, 64, 128, 128], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([128, 64, 64, 64], i64)), {})
+Operator: aten.mean.dim
+cnt: 2, ((T([128, 64, 64, 64], f16), [2, 3]), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), [2, 3]), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), [2, 3]), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 2, ((T([262144, 16], f16), T([16, 23], f16, stride=(1, 16))), {})
+cnt: 4, ((T([65536, 16], f16), T([16, 23], f16, stride=(1, 16))), {})
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+cnt: 4, ((T([23, 65536], f16, stride=(1, 23)), T([65536, 16], f16)), {})
+cnt: 4, ((T([65536, 23], f16), T([23, 16], f16)), {})
+cnt: 2, ((T([23, 262144], f16, stride=(1, 23)), T([262144, 16], f16)), {})
+cnt: 2, ((T([262144, 23], f16), T([23, 16], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16, stride=(64, 1, 0, 0))), {})
+cnt: 4, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16, stride=(128, 1, 0, 0))), {})
+cnt: 2, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16, stride=(256, 1, 0, 0))), {})
+cnt: 2, ((T([1024, 4, 64, 144], f16), 0.25), {})
+cnt: 2, ((T([1024, 4, 16, 144], f16), 0.25), {})
+cnt: 2, ((T([1024, 1, 64, 144], f16), 0.25), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16)), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16)), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 24, 128, 128], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 24, 128, 128], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.sigmoid.default
+cnt: 2, ((T([128, 1, 64], f16),), {})
+cnt: 2, ((T([128, 1, 128], f16),), {})
+cnt: 1, ((T([128, 1, 256], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([128, 1, 256], f16), T([128, 1, 256], f16)), {})
+cnt: 2, ((T([128, 1, 128], f16), T([128, 1, 128], f16)), {})
+cnt: 2, ((T([128, 1, 64], f16), T([128, 1, 64], f16)), {})
+Operator: aten.silu_.default
+cnt: 1, ((T([128, 24, 128, 128], f16),), {})
+cnt: 1, ((T([128, 32, 128, 128], f16),), {})
+cnt: 1, ((T([128, 64, 128, 128], f16),), {})
+cnt: 4, ((T([128, 64, 64, 64], f16),), {})
+cnt: 2, ((T([128, 256, 64, 64], f16),), {})
+cnt: 1, ((T([128, 128, 64, 64], f16),), {})
+cnt: 3, ((T([128, 128, 32, 32], f16),), {})
+cnt: 2, ((T([128, 512, 32, 32], f16),), {})
+cnt: 1, ((T([128, 256, 32, 32], f16),), {})
+cnt: 3, ((T([128, 256, 16, 16], f16),), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16),), {})
+cnt: 1, ((T([128, 512, 16, 16], f16),), {})
+cnt: 3, ((T([128, 512, 8, 8], f16),), {})
+cnt: 2, ((T([128, 2048, 8, 8], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 2, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16)), {})
+cnt: 3, ((T([128, 512, 8, 8], f16), T([128, 512, 8, 8], f16)), {})
+cnt: 1, ((T([128, 512, 16, 16], f16), T([128, 512, 16, 16], f16)), {})
+cnt: 2, ((T([128, 1024, 16, 16], f16), T([128, 1024, 16, 16], f16)), {})
+cnt: 3, ((T([128, 256, 16, 16], f16), T([128, 256, 16, 16], f16)), {})
+cnt: 1, ((T([128, 256, 32, 32], f16), T([128, 256, 32, 32], f16)), {})
+cnt: 2, ((T([128, 512, 32, 32], f16), T([128, 512, 32, 32], f16)), {})
+cnt: 3, ((T([128, 128, 32, 32], f16), T([128, 128, 32, 32], f16)), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16)), {})
+cnt: 2, ((T([128, 256, 64, 64], f16), T([128, 256, 64, 64], f16)), {})
+cnt: 4, ((T([128, 64, 64, 64], f16), T([128, 64, 64, 64], f16)), {})
+cnt: 1, ((T([128, 64, 128, 128], f16), T([128, 64, 128, 128], f16)), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16)), {})
+cnt: 1, ((T([128, 24, 128, 128], f16), T([128, 24, 128, 128], f16)), {})
+Operator: aten.slice_backward.default
+cnt: 2, ((T([8192, 8, 12], f16), [8192, 8, 23], 2, 11, 9223372036854775807, 1), {})
+cnt: 2, ((T([8192, 8, 23], f16), [8192, 9, 23], 1, 0, 8, 1), {})
+cnt: 2, ((T([8192, 9, 23], f16), [8192, 9, 23], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([16384, 4, 12], f16), [16384, 4, 23], 2, 11, 9223372036854775807, 1), {})
+cnt: 2, ((T([16384, 4, 23], f16), [16384, 5, 23], 1, 0, 4, 1), {})
+cnt: 2, ((T([16384, 5, 23], f16), [16384, 5, 23], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([32768, 8, 12], f16), [32768, 8, 23], 2, 11, 9223372036854775807, 1), {})
+cnt: 2, ((T([32768, 8, 23], f16), [32768, 9, 23], 1, 0, 8, 1), {})
+cnt: 2, ((T([32768, 9, 23], f16), [32768, 9, 23], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.split_with_sizes.default
+cnt: 1, ((T([1024, 4, 144, 48], f16, stride=(27648, 144, 1, 576)), [16, 32], -1), {})
+cnt: 1, ((T([1024, 4, 144, 80], f16, stride=(46080, 144, 1, 576)), [16, 64], -1), {})
+cnt: 1, ((T([1024, 1, 144, 80], f16, stride=(11520, 144, 1, 144)), [16, 64], -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([1024, 8, 12, 8, 12], f16, stride=(9216, 144, 1, 1152, 12)), [2], True), {})
+cnt: 1, ((T([1024, 8, 12, 8, 12], f16, stride=(9216, 1152, 12, 144, 1)), [2], True), {})
+cnt: 1, ((T([4096, 4, 12, 4, 12], f16, stride=(2304, 144, 1, 576, 12)), [2], True), {})
+cnt: 1, ((T([4096, 4, 12, 4, 12], f16, stride=(2304, 576, 12, 144, 1)), [2], True), {})
+cnt: 1, ((T([4096, 8, 12, 8, 12], f16, stride=(9216, 144, 1, 1152, 12)), [2], True), {})
+cnt: 1, ((T([4096, 8, 12, 8, 12], f16, stride=(9216, 1152, 12, 144, 1)), [2], True), {})
+cnt: 1, ((T([128, 256, 16, 16], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 128, 32, 32], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 64, 64, 64], f16), [2, 3], True), {})
+Operator: aten.unfold_backward.default
+cnt: 1, ((T([128, 640, 1, 1, 12, 12], f16), [128, 640, 1, 12, 12], 3, 12, 8), {})
+cnt: 1, ((T([128, 640, 1, 12, 12], f16), [128, 640, 12, 12], 2, 12, 8), {})
+cnt: 1, ((T([128, 640, 2, 2, 12, 12], f16), [128, 640, 2, 20, 12], 3, 12, 8), {})
+cnt: 1, ((T([128, 640, 2, 20, 12], f16), [128, 640, 20, 20], 2, 12, 8), {})
+cnt: 1, ((T([128, 384, 2, 2, 12, 12], f16), [128, 384, 2, 20, 12], 3, 12, 8), {})
+cnt: 1, ((T([128, 384, 2, 20, 12], f16), [128, 384, 20, 20], 2, 12, 8), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ecaresnet101d_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ecaresnet101d_training.txt
new file mode 100644
index 000000000000..21e66cff13b0
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ecaresnet101d_training.txt
@@ -0,0 +1,195 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 5, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16)), {})
+cnt: 46, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16)), {})
+cnt: 8, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16)), {})
+cnt: 6, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16)), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 106, ((T([], i64), 1), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16)), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16)), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16)), {})
+cnt: 3, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([64, 256, 56, 56], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([64, 1024, 14, 14], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 28, 28], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 56, 56], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([32, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 64, 56, 56], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 1, 256], f16), T([1, 1, 5], f16), None, [1], [2], [1], False, [0], 1), {})
+cnt: 2, ((T([64, 256, 56, 56], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([128, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 128, 28, 28], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 1, 512], f16), T([1, 1, 5], f16), None, [1], [2], [1], False, [0], 1), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([512, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 512, 28, 28], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 128, 28, 28], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([256, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 23, ((T([64, 256, 14, 14], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 23, ((T([64, 1, 1024], f16), T([1, 1, 5], f16), None, [1], [2], [1], False, [0], 1), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([1024, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 22, ((T([64, 1024, 14, 14], f16), T([256, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 22, ((T([64, 256, 14, 14], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([512, 512, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 512, 7, 7], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 1, 2048], f16), T([1, 1, 7], f16), None, [1], [3], [1], False, [0], 1), {})
+cnt: 1, ((T([64, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 2048, 7, 7], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 512, 7, 7], f16), T([512, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([64, 1, 2048], f16), T([64, 1, 2048], f16), T([1, 1, 7], f16), [0], [1], [3], [1], False, [0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 2048, 7, 7], f16), T([64, 512, 7, 7], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16), T([512, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 512, 7, 7], f16), T([64, 2048, 7, 7], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 2048, 7, 7], f16), T([64, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 14, 14], f16), T([512, 512, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 23, ((T([64, 1, 1024], f16), T([64, 1, 1024], f16), T([1, 1, 5], f16), [0], [1], [2], [1], False, [0], 1, [True, True, False]), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), T([64, 256, 14, 14], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 22, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 22, ((T([64, 256, 14, 14], f16), T([64, 1024, 14, 14], f16), T([256, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([64, 512, 14, 14], f16), T([1024, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 14, 14], f16), T([64, 256, 28, 28], f16), T([256, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 1, 512], f16), T([64, 1, 512], f16), T([1, 1, 5], f16), [0], [1], [2], [1], False, [0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), T([64, 128, 28, 28], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 128, 28, 28], f16), T([64, 512, 28, 28], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([64, 256, 28, 28], f16), T([512, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 28, 28], f16), T([64, 128, 56, 56], f16), T([128, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 1, 256], f16), T([64, 1, 256], f16), T([1, 1, 5], f16), [0], [1], [2], [1], False, [0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 256, 56, 56], f16), T([64, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 64, 56, 56], f16), T([64, 256, 56, 56], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 32, 112, 112], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 4, ((T([64, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16, stride=(1024, 1, 0, 0)), 196), {})
+cnt: 4, ((T([64, 512, 28, 28], f16, stride=(512, 1, 0, 0)), 784), {})
+cnt: 3, ((T([64, 256, 56, 56], f16, stride=(256, 1, 0, 0)), 3136), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([64, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([64, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 3, ((T([64, 256, 56, 56], f16), [2, 3]), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), [2, 3]), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), [2, 3]), {})
+cnt: 3, ((T([64, 2048, 7, 7], f16), [2, 3]), {})
+cnt: 1, ((T([64, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 2048], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 6, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16, stride=(256, 1, 0, 0))), {})
+cnt: 8, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16, stride=(512, 1, 0, 0))), {})
+cnt: 46, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16, stride=(1024, 1, 0, 0))), {})
+cnt: 6, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16, stride=(2048, 1, 0, 0))), {})
+cnt: 3, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16)), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16)), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16)), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([64, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([64, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([64, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 45, ((T([64, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 24, ((T([64, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 24, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 45, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([64, 32, 112, 112], f16),), {})
+cnt: 1, ((T([64, 64, 112, 112], f16),), {})
+cnt: 6, ((T([64, 64, 56, 56], f16),), {})
+cnt: 3, ((T([64, 256, 56, 56], f16),), {})
+cnt: 1, ((T([64, 128, 56, 56], f16),), {})
+cnt: 7, ((T([64, 128, 28, 28], f16),), {})
+cnt: 4, ((T([64, 512, 28, 28], f16),), {})
+cnt: 1, ((T([64, 256, 28, 28], f16),), {})
+cnt: 45, ((T([64, 256, 14, 14], f16),), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16),), {})
+cnt: 1, ((T([64, 512, 14, 14], f16),), {})
+cnt: 5, ((T([64, 512, 7, 7], f16),), {})
+cnt: 3, ((T([64, 2048, 7, 7], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 3, ((T([64, 1, 256], f16),), {})
+cnt: 4, ((T([64, 1, 512], f16),), {})
+cnt: 23, ((T([64, 1, 1024], f16),), {})
+cnt: 3, ((T([64, 1, 2048], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 3, ((T([64, 1, 2048], f16), T([64, 1, 2048], f16)), {})
+cnt: 23, ((T([64, 1, 1024], f16), T([64, 1, 1024], f16)), {})
+cnt: 4, ((T([64, 1, 512], f16), T([64, 1, 512], f16)), {})
+cnt: 3, ((T([64, 1, 256], f16), T([64, 1, 256], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 3, ((T([64, 2048, 7, 7], f16), [2, 3], True), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), [2, 3], True), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16), 0), {})
+cnt: 5, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), 0), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16), 0), {})
+cnt: 45, ((T([64, 256, 14, 14], f16), T([64, 256, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 28, 28], f16), 0), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16), 0), {})
+cnt: 7, ((T([64, 128, 28, 28], f16), T([64, 128, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 128, 56, 56], f16), 0), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16), 0), {})
+cnt: 6, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), 0), {})
+cnt: 2, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ese_vovnet19b_dw_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ese_vovnet19b_dw_training.txt
new file mode 100644
index 000000000000..f81cd27ece75
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ese_vovnet19b_dw_training.txt
@@ -0,0 +1,182 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 23, ((T([], i64), 1), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16)), {})
+cnt: 2, ((T([128, 224, 7, 7], f16, stride=(70560, 49, 7, 1)), T([128, 224, 7, 7], f16)), {})
+cnt: 1, ((T([128, 768, 7, 7], f16, stride=(70560, 49, 7, 1)), T([128, 768, 7, 7], f16)), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([128, 768, 14, 14], f16)), {})
+cnt: 2, ((T([128, 192, 14, 14], f16, stride=(213248, 196, 14, 1)), T([128, 192, 14, 14], f16)), {})
+cnt: 1, ((T([128, 512, 14, 14], f16, stride=(213248, 196, 14, 1)), T([128, 512, 14, 14], f16)), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 2, ((T([128, 160, 28, 28], f16, stride=(577024, 784, 28, 1)), T([128, 160, 28, 28], f16)), {})
+cnt: 1, ((T([128, 256, 28, 28], f16, stride=(577024, 784, 28, 1)), T([128, 256, 28, 28], f16)), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 2, ((T([128, 128, 56, 56], f16, stride=(1404928, 3136, 56, 1)), T([128, 128, 56, 56], f16)), {})
+cnt: 1, ((T([128, 64, 56, 56], f16, stride=(1404928, 3136, 56, 1)), T([128, 64, 56, 56], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 64, 56, 56], f16), T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16)], 1), {})
+cnt: 1, (([T([128, 256, 28, 28], f16), T([128, 160, 28, 28], f16), T([128, 160, 28, 28], f16), T([128, 160, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 512, 14, 14], f16), T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16)], 1), {})
+cnt: 1, (([T([128, 768, 7, 7], f16), T([128, 224, 7, 7], f16), T([128, 224, 7, 7], f16), T([128, 224, 7, 7], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([128, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 128), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 448, 56, 56], f16), T([256, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([256, 256, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([160, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 160, 28, 28], f16), T([160, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 160), {})
+cnt: 3, ((T([128, 160, 28, 28], f16), T([160, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 736, 28, 28], f16), T([512, 736, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 1, 1], f16), T([512, 512, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([192, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 192, 14, 14], f16), T([192, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 192), {})
+cnt: 3, ((T([128, 192, 14, 14], f16), T([192, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1088, 14, 14], f16), T([768, 1088, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 768, 1, 1], f16), T([768, 768, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([224, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 224, 7, 7], f16), T([224, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 224), {})
+cnt: 3, ((T([128, 224, 7, 7], f16), T([224, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1440, 7, 7], f16), T([1024, 1440, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 1, 1], f16), T([1024, 1024, 1, 1], f16), T([1024], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1024, 1, 1], f16), T([128, 1024, 1, 1], f16), T([1024, 1024, 1, 1], f16), [1024], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1440, 7, 7], f16), T([1024, 1440, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 224, 7, 7], f16), T([128, 224, 7, 7], f16), T([224, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 224, 7, 7], f16), T([128, 224, 7, 7], f16), T([224, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 224, [True, True, False]), {})
+cnt: 1, ((T([128, 224, 7, 7], f16), T([128, 768, 7, 7], f16), T([224, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 768, 1, 1], f16), T([128, 768, 1, 1], f16), T([768, 768, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([128, 1088, 14, 14], f16), T([768, 1088, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), T([192, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), T([192, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([128, 512, 14, 14], f16), T([192, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 1, 1], f16), T([128, 512, 1, 1], f16), T([512, 512, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 736, 28, 28], f16), T([512, 736, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 160, 28, 28], f16), T([128, 160, 28, 28], f16), T([160, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 160, 28, 28], f16), T([128, 160, 28, 28], f16), T([160, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 160, [True, True, False]), {})
+cnt: 1, ((T([128, 160, 28, 28], f16), T([128, 256, 28, 28], f16), T([160, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16), T([256, 256, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 448, 56, 56], f16), T([256, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), T([128, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 128, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 64, 56, 56], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 112, 112], f16), T([64, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 3, 224, 224], f16), T([64, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 2, ((T([128, 1024, 7, 7], f16, stride=(1024, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 768, 14, 14], f16, stride=(768, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 512, 28, 28], f16, stride=(512, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 256, 56, 56], f16, stride=(256, 1, 0, 0)), 3136), {})
+Operator: aten.hardsigmoid.default
+cnt: 1, ((T([128, 256, 1, 1], f16),), {})
+cnt: 1, ((T([128, 512, 1, 1], f16),), {})
+cnt: 1, ((T([128, 768, 1, 1], f16),), {})
+cnt: 1, ((T([128, 1024, 1, 1], f16),), {})
+Operator: aten.hardsigmoid_backward.default
+cnt: 1, ((T([128, 1024, 1, 1], f16), T([128, 1024, 1, 1], f16)), {})
+cnt: 1, ((T([128, 768, 1, 1], f16), T([128, 768, 1, 1], f16)), {})
+cnt: 1, ((T([128, 512, 1, 1], f16), T([128, 512, 1, 1], f16)), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 256, 56, 56], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 768, 7, 7], f16), T([128, 768, 14, 14], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([128, 768, 7, 7], i64)), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([128, 512, 28, 28], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([128, 512, 14, 14], i64)), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([128, 256, 56, 56], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([128, 256, 28, 28], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 256, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1024], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 256, 56, 56], f16), T([128, 256, 1, 1], f16)), {})
+cnt: 2, ((T([128, 512, 28, 28], f16), T([128, 512, 1, 1], f16)), {})
+cnt: 2, ((T([128, 768, 14, 14], f16), T([128, 768, 1, 1], f16)), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16), T([128, 1024, 1, 1], f16)), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16)), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([128, 768, 14, 14], f16)), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 160, 28, 28], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 224, 7, 7], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 224, 7, 7], f16), T([128, 224, 7, 7], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([128, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 160, 28, 28], f16), T([128, 160, 28, 28], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([128, 64, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 56, 56], f16),), {})
+cnt: 4, ((T([128, 128, 56, 56], f16),), {})
+cnt: 1, ((T([128, 256, 56, 56], f16),), {})
+cnt: 4, ((T([128, 160, 28, 28], f16),), {})
+cnt: 1, ((T([128, 512, 28, 28], f16),), {})
+cnt: 4, ((T([128, 192, 14, 14], f16),), {})
+cnt: 1, ((T([128, 768, 14, 14], f16),), {})
+cnt: 4, ((T([128, 224, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 224, 7, 7], f16, stride=(70560, 49, 7, 1)), T([128, 224, 7, 7], f16), 0), {})
+cnt: 3, ((T([128, 224, 7, 7], f16), T([128, 224, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([128, 768, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 192, 14, 14], f16, stride=(213248, 196, 14, 1)), T([128, 192, 14, 14], f16), 0), {})
+cnt: 3, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 160, 28, 28], f16, stride=(577024, 784, 28, 1)), T([128, 160, 28, 28], f16), 0), {})
+cnt: 3, ((T([128, 160, 28, 28], f16), T([128, 160, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 128, 56, 56], f16, stride=(1404928, 3136, 56, 1)), T([128, 128, 56, 56], f16), 0), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), 0), {})
+cnt: 2, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/fbnetc_100_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/fbnetc_100_training.txt
new file mode 100644
index 000000000000..4be2a0309a2e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/fbnetc_100_training.txt
@@ -0,0 +1,189 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 65, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+cnt: 4, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 6, ((T([128, 32, 28, 28], f16), T([128, 32, 28, 28], f16)), {})
+cnt: 6, ((T([128, 64, 14, 14], f16), T([128, 64, 14, 14], f16)), {})
+cnt: 6, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16)), {})
+cnt: 6, ((T([128, 184, 7, 7], f16), T([128, 184, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1984], f16), T([1984, 1000], f16, stride=(1, 1984))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([16, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([96, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([24, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 24, 56, 56], f16), T([24, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([24, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 24), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([144, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([144, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([32, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([96, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 28, 28], f16), T([96, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([128, 96, 28, 28], f16), T([32, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 32, 28, 28], f16), T([192, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([192, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 192), {})
+cnt: 2, ((T([128, 192, 28, 28], f16), T([32, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([192, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 192), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([192, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 192), {})
+cnt: 2, ((T([128, 192, 14, 14], f16), T([64, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 14, 14], f16), T([192, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([192, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 192), {})
+cnt: 3, ((T([128, 64, 14, 14], f16), T([384, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 384, 14, 14], f16), T([384, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 384), {})
+cnt: 2, ((T([128, 384, 14, 14], f16), T([64, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([112, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 112, 14, 14], f16), T([672, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([112, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([336, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 336, 14, 14], f16), T([336, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 336), {})
+cnt: 1, ((T([128, 336, 14, 14], f16), T([112, 336, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([184, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 184, 7, 7], f16), T([1104, 184, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 1104, 7, 7], f16), T([1104, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1104), {})
+cnt: 3, ((T([128, 1104, 7, 7], f16), T([184, 1104, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([1104, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1104), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([352, 1104, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 352, 7, 7], f16), T([1984, 352, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1984, 7, 7], f16), T([128, 352, 7, 7], f16), T([1984, 352, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 352, 7, 7], f16), T([128, 1104, 7, 7], f16), T([352, 1104, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16), T([1104, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1104, [True, True, False]), {})
+cnt: 4, ((T([128, 1104, 7, 7], f16), T([128, 184, 7, 7], f16), T([1104, 184, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 184, 7, 7], f16), T([128, 1104, 7, 7], f16), T([184, 1104, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16), T([1104, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1104, [True, True, False]), {})
+cnt: 1, ((T([128, 184, 7, 7], f16), T([128, 672, 7, 7], f16), T([184, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 3, ((T([128, 672, 14, 14], f16), T([128, 112, 14, 14], f16), T([672, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 336, 14, 14], f16), T([112, 336, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 336, 14, 14], f16), T([128, 336, 14, 14], f16), T([336, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 336, [True, True, False]), {})
+cnt: 1, ((T([128, 336, 14, 14], f16), T([128, 112, 14, 14], f16), T([336, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([128, 672, 14, 14], f16), T([112, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 384, 14, 14], f16), T([112, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 384, [True, True, False]), {})
+cnt: 3, ((T([128, 384, 14, 14], f16), T([128, 64, 14, 14], f16), T([384, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 14, 14], f16), T([128, 384, 14, 14], f16), T([64, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 14, 14], f16), T([128, 192, 14, 14], f16), T([64, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), T([192, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([128, 64, 14, 14], f16), T([192, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([128, 192, 28, 28], f16), T([192, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 28, 28], f16), T([128, 32, 28, 28], f16), T([192, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 32, 28, 28], f16), T([128, 192, 28, 28], f16), T([32, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([128, 96, 28, 28], f16), T([32, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 28, 28], f16), T([128, 96, 28, 28], f16), T([96, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 28, 28], f16), T([128, 32, 28, 28], f16), T([96, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([128, 144, 28, 28], f16), T([32, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 56, 56], f16), T([144, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 24, 56, 56], f16), T([144, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 24, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 96, 56, 56], f16), T([24, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 112, 112], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 16, 112, 112], f16), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1984, 7, 7], f16, stride=(1984, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 1984, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1984], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1984], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 4, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 32, 28, 28], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 96, 28, 28], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 64, 14, 14], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 336, 14, 14], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 184, 7, 7], f16), T([184], f16), T([184], f16), T([184], f16), T([184], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 1104, 7, 7], f16), T([1104], f16), T([1104], f16), T([1104], f16), T([1104], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 352, 7, 7], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1984, 7, 7], f16), T([1984], f16), T([1984], f16), T([1984], f16), T([1984], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1984, 7, 7], f16), T([128, 1984, 7, 7], f16), T([1984], f16), T([1984], f16), T([1984], f16), T([1984], f32), T([1984], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 352, 7, 7], f16), T([128, 352, 7, 7], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f32), T([352], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16), T([1104], f16), T([1104], f16), T([1104], f16), T([1104], f32), T([1104], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 184, 7, 7], f16), T([128, 184, 7, 7], f16), T([184], f16), T([184], f16), T([184], f16), T([184], f32), T([184], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 336, 14, 14], f16), T([128, 336, 14, 14], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 64, 14, 14], f16), T([128, 64, 14, 14], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 32, 28, 28], f16), T([128, 32, 28, 28], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 96, 28, 28], f16), T([128, 96, 28, 28], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 3, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 96, 112, 112], f16),), {})
+cnt: 1, ((T([128, 96, 56, 56], f16),), {})
+cnt: 4, ((T([128, 24, 56, 56], f16),), {})
+cnt: 1, ((T([128, 144, 56, 56], f16),), {})
+cnt: 1, ((T([128, 144, 28, 28], f16),), {})
+cnt: 2, ((T([128, 96, 28, 28], f16),), {})
+cnt: 5, ((T([128, 192, 28, 28], f16),), {})
+cnt: 3, ((T([128, 192, 14, 14], f16),), {})
+cnt: 6, ((T([128, 384, 14, 14], f16),), {})
+cnt: 5, ((T([128, 672, 14, 14], f16),), {})
+cnt: 2, ((T([128, 336, 14, 14], f16),), {})
+cnt: 1, ((T([128, 672, 7, 7], f16),), {})
+cnt: 8, ((T([128, 1104, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1984, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 1984, 7, 7], f16), T([128, 1984, 7, 7], f16), 0), {})
+cnt: 8, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16), 0), {})
+cnt: 5, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), 0), {})
+cnt: 2, ((T([128, 336, 14, 14], f16), T([128, 336, 14, 14], f16), 0), {})
+cnt: 6, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), 0), {})
+cnt: 3, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), 0), {})
+cnt: 5, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), 0), {})
+cnt: 2, ((T([128, 96, 28, 28], f16), T([128, 96, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), 0), {})
+cnt: 4, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 96, 112, 112], f16), 0), {})
+cnt: 3, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/fbnetv3_b_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/fbnetv3_b_training.txt
new file mode 100644
index 000000000000..85ee90a54b64
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/fbnetv3_b_training.txt
@@ -0,0 +1,287 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 87, ((T([], i64), 1), {})
+cnt: 4, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+cnt: 6, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 8, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 8, ((T([128, 72, 14, 14], f16), T([128, 72, 14, 14], f16)), {})
+cnt: 10, ((T([128, 120, 14, 14], f16), T([128, 120, 14, 14], f16)), {})
+cnt: 10, ((T([128, 184, 7, 7], f16), T([128, 184, 7, 7], f16)), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16)), {})
+cnt: 5, ((T([128, 736, 7, 7], f16), T([128, 736, 7, 7], f16)), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), T([128, 720, 7, 7], f16)), {})
+cnt: 6, ((T([128, 360, 14, 14], f16), T([128, 360, 14, 14], f16)), {})
+cnt: 5, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1984], f16), T([1984, 1000], f16, stride=(1, 1984))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+cnt: 3, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 56, 56], f16),), {})
+cnt: 6, ((T([128, 48, 56, 56], f16),), {})
+cnt: 1, ((T([128, 120, 56, 56], f16),), {})
+cnt: 9, ((T([128, 120, 28, 28], f16),), {})
+cnt: 1, ((T([128, 8, 1, 1], f16),), {})
+cnt: 4, ((T([128, 16, 1, 1], f16),), {})
+cnt: 1, ((T([128, 200, 28, 28], f16),), {})
+cnt: 1, ((T([128, 200, 14, 14], f16),), {})
+cnt: 8, ((T([128, 216, 14, 14], f16),), {})
+cnt: 12, ((T([128, 360, 14, 14], f16),), {})
+cnt: 1, ((T([128, 24, 1, 1], f16),), {})
+cnt: 6, ((T([128, 32, 1, 1], f16),), {})
+cnt: 1, ((T([128, 720, 14, 14], f16),), {})
+cnt: 1, ((T([128, 720, 7, 7], f16),), {})
+cnt: 10, ((T([128, 736, 7, 7], f16),), {})
+cnt: 6, ((T([128, 48, 1, 1], f16),), {})
+cnt: 2, ((T([128, 1104, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1344, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1984, 1, 1], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([16, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([64, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([24, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([48, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 48, 56, 56], f16), T([48, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 48), {})
+cnt: 3, ((T([128, 48, 56, 56], f16), T([24, 48, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([120, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 120, 56, 56], f16), T([120, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([8, 120, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([120, 8, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 120, 28, 28], f16), T([40, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([120, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 4, ((T([128, 120, 1, 1], f16), T([16, 120, 1, 1], f16), T([16], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 16, 1, 1], f16), T([120, 16, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([200, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 200, 28, 28], f16), T([200, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 200), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([72, 200, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 72, 14, 14], f16), T([216, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 216, 14, 14], f16), T([216, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 216), {})
+cnt: 4, ((T([128, 216, 14, 14], f16), T([72, 216, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 14, 14], f16), T([360, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 360, 14, 14], f16), T([360, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 360), {})
+cnt: 1, ((T([128, 360, 1, 1], f16), T([24, 360, 1, 1], f16), T([24], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([360, 24, 1, 1], f16), T([360], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([128, 360, 14, 14], f16), T([120, 360, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 120, 14, 14], f16), T([360, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 360, 14, 14], f16), T([360, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 360), {})
+cnt: 5, ((T([128, 360, 1, 1], f16), T([32, 360, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 32, 1, 1], f16), T([360, 32, 1, 1], f16), T([360], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 120, 14, 14], f16), T([720, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 720, 14, 14], f16), T([720, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 720), {})
+cnt: 1, ((T([128, 720, 1, 1], f16), T([32, 720, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([720, 32, 1, 1], f16), T([720], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), T([184, 720, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 184, 7, 7], f16), T([736, 184, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 736, 7, 7], f16), T([736, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 736), {})
+cnt: 5, ((T([128, 736, 1, 1], f16), T([48, 736, 1, 1], f16), T([48], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 48, 1, 1], f16), T([736, 48, 1, 1], f16), T([736], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 736, 7, 7], f16), T([184, 736, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 184, 7, 7], f16), T([1104, 184, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([1104, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1104), {})
+cnt: 1, ((T([128, 1104, 1, 1], f16), T([48, 1104, 1, 1], f16), T([48], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 48, 1, 1], f16), T([1104, 48, 1, 1], f16), T([1104], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([224, 1104, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 224, 7, 7], f16), T([1344, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1344, 1, 1], f16), T([1984, 1344, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1984, 1, 1], f16), T([128, 1344, 1, 1], f16), T([1984, 1344, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1344, 7, 7], f16), T([128, 224, 7, 7], f16), T([1344, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 224, 7, 7], f16), T([128, 1104, 7, 7], f16), T([224, 1104, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1104, 1, 1], f16), T([128, 48, 1, 1], f16), T([1104, 48, 1, 1], f16), [1104], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 48, 1, 1], f16), T([128, 1104, 1, 1], f16), T([48, 1104, 1, 1], f16), [48], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16), T([1104, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1104, [True, True, False]), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([128, 184, 7, 7], f16), T([1104, 184, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 184, 7, 7], f16), T([128, 736, 7, 7], f16), T([184, 736, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 736, 1, 1], f16), T([128, 48, 1, 1], f16), T([736, 48, 1, 1], f16), [736], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([128, 48, 1, 1], f16), T([128, 736, 1, 1], f16), T([48, 736, 1, 1], f16), [48], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([128, 736, 7, 7], f16), T([128, 736, 7, 7], f16), T([736, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 736, [True, True, False]), {})
+cnt: 5, ((T([128, 736, 7, 7], f16), T([128, 184, 7, 7], f16), T([736, 184, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 184, 7, 7], f16), T([128, 720, 7, 7], f16), T([184, 720, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 720, 1, 1], f16), T([128, 32, 1, 1], f16), T([720, 32, 1, 1], f16), [720], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 720, 1, 1], f16), T([32, 720, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), T([128, 720, 14, 14], f16), T([720, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 720, [True, True, False]), {})
+cnt: 1, ((T([128, 720, 14, 14], f16), T([128, 120, 14, 14], f16), T([720, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([128, 120, 14, 14], f16), T([128, 360, 14, 14], f16), T([120, 360, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 360, 1, 1], f16), T([128, 32, 1, 1], f16), T([360, 32, 1, 1], f16), [360], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([128, 32, 1, 1], f16), T([128, 360, 1, 1], f16), T([32, 360, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([128, 360, 14, 14], f16), T([128, 360, 14, 14], f16), T([360, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 360, [True, True, False]), {})
+cnt: 5, ((T([128, 360, 14, 14], f16), T([128, 120, 14, 14], f16), T([360, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 360, 1, 1], f16), T([128, 24, 1, 1], f16), T([360, 24, 1, 1], f16), [360], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([128, 360, 1, 1], f16), T([24, 360, 1, 1], f16), [24], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 360, 14, 14], f16), T([128, 360, 14, 14], f16), T([360, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 360, [True, True, False]), {})
+cnt: 1, ((T([128, 360, 14, 14], f16), T([128, 72, 14, 14], f16), T([360, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 72, 14, 14], f16), T([128, 216, 14, 14], f16), T([72, 216, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 216, 14, 14], f16), T([128, 216, 14, 14], f16), T([216, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 216, [True, True, False]), {})
+cnt: 4, ((T([128, 216, 14, 14], f16), T([128, 72, 14, 14], f16), T([216, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 72, 14, 14], f16), T([128, 200, 14, 14], f16), T([72, 200, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([128, 200, 28, 28], f16), T([200, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 200, [True, True, False]), {})
+cnt: 1, ((T([128, 200, 28, 28], f16), T([128, 40, 28, 28], f16), T([200, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 40, 28, 28], f16), T([128, 120, 28, 28], f16), T([40, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 120, 1, 1], f16), T([128, 16, 1, 1], f16), T([120, 16, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([128, 16, 1, 1], f16), T([128, 120, 1, 1], f16), T([16, 120, 1, 1], f16), [16], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([128, 8, 1, 1], f16), T([120, 8, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([128, 120, 1, 1], f16), T([8, 120, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), T([128, 120, 56, 56], f16), T([120, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 1, ((T([128, 120, 56, 56], f16), T([128, 24, 56, 56], f16), T([120, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([128, 48, 56, 56], f16), T([24, 48, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), T([48, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 48, [True, True, False]), {})
+cnt: 3, ((T([128, 48, 56, 56], f16), T([128, 24, 56, 56], f16), T([48, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 64, 56, 56], f16), T([24, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 112, 112], f16), T([64, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 16, 112, 112], f16), T([64, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1344, 7, 7], f16, stride=(1344, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16, stride=(1104, 1, 0, 0)), 49), {})
+cnt: 5, ((T([128, 736, 7, 7], f16, stride=(736, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 720, 7, 7], f16, stride=(720, 1, 0, 0)), 49), {})
+cnt: 6, ((T([128, 360, 14, 14], f16, stride=(360, 1, 0, 0)), 196), {})
+cnt: 5, ((T([128, 120, 28, 28], f16, stride=(120, 1, 0, 0)), 784), {})
+Operator: aten.hardsigmoid.default
+cnt: 5, ((T([128, 120, 1, 1], f16),), {})
+cnt: 6, ((T([128, 360, 1, 1], f16),), {})
+cnt: 1, ((T([128, 720, 1, 1], f16),), {})
+cnt: 5, ((T([128, 736, 1, 1], f16),), {})
+cnt: 1, ((T([128, 1104, 1, 1], f16),), {})
+Operator: aten.hardsigmoid_backward.default
+cnt: 1, ((T([128, 1104, 1, 1], f16), T([128, 1104, 1, 1], f16)), {})
+cnt: 5, ((T([128, 736, 1, 1], f16), T([128, 736, 1, 1], f16)), {})
+cnt: 1, ((T([128, 720, 1, 1], f16), T([128, 720, 1, 1], f16)), {})
+cnt: 6, ((T([128, 360, 1, 1], f16), T([128, 360, 1, 1], f16)), {})
+cnt: 5, ((T([128, 120, 1, 1], f16), T([128, 120, 1, 1], f16)), {})
+Operator: aten.hardswish_.default
+cnt: 3, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 56, 56], f16),), {})
+cnt: 6, ((T([128, 48, 56, 56], f16),), {})
+cnt: 1, ((T([128, 120, 56, 56], f16),), {})
+cnt: 9, ((T([128, 120, 28, 28], f16),), {})
+cnt: 1, ((T([128, 8, 1, 1], f16),), {})
+cnt: 4, ((T([128, 16, 1, 1], f16),), {})
+cnt: 1, ((T([128, 200, 28, 28], f16),), {})
+cnt: 1, ((T([128, 200, 14, 14], f16),), {})
+cnt: 8, ((T([128, 216, 14, 14], f16),), {})
+cnt: 12, ((T([128, 360, 14, 14], f16),), {})
+cnt: 1, ((T([128, 24, 1, 1], f16),), {})
+cnt: 6, ((T([128, 32, 1, 1], f16),), {})
+cnt: 1, ((T([128, 720, 14, 14], f16),), {})
+cnt: 1, ((T([128, 720, 7, 7], f16),), {})
+cnt: 10, ((T([128, 736, 7, 7], f16),), {})
+cnt: 6, ((T([128, 48, 1, 1], f16),), {})
+cnt: 2, ((T([128, 1104, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1344, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1984, 1, 1], f16),), {})
+Operator: aten.hardswish_backward.default
+cnt: 1, ((T([128, 1984, 1, 1], f16), T([128, 1984, 1, 1], f16)), {})
+cnt: 1, ((T([128, 1344, 7, 7], f16), T([128, 1344, 7, 7], f16)), {})
+cnt: 6, ((T([128, 48, 1, 1], f16), T([128, 48, 1, 1], f16)), {})
+cnt: 2, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16)), {})
+cnt: 10, ((T([128, 736, 7, 7], f16), T([128, 736, 7, 7], f16)), {})
+cnt: 6, ((T([128, 32, 1, 1], f16), T([128, 32, 1, 1], f16)), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), T([128, 720, 7, 7], f16)), {})
+cnt: 1, ((T([128, 720, 14, 14], f16), T([128, 720, 14, 14], f16)), {})
+cnt: 12, ((T([128, 360, 14, 14], f16), T([128, 360, 14, 14], f16)), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([128, 24, 1, 1], f16)), {})
+cnt: 8, ((T([128, 216, 14, 14], f16), T([128, 216, 14, 14], f16)), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([128, 200, 14, 14], f16)), {})
+cnt: 1, ((T([128, 200, 28, 28], f16), T([128, 200, 28, 28], f16)), {})
+cnt: 4, ((T([128, 16, 1, 1], f16), T([128, 16, 1, 1], f16)), {})
+cnt: 9, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16)), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([128, 8, 1, 1], f16)), {})
+cnt: 1, ((T([128, 120, 56, 56], f16), T([128, 120, 56, 56], f16)), {})
+cnt: 6, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16)), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16)), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16)), {})
+cnt: 3, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 5, ((T([128, 120, 28, 28], f16), [2, 3], True), {})
+cnt: 6, ((T([128, 360, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), [2, 3], True), {})
+cnt: 5, ((T([128, 736, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1344, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1984], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1984], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 10, ((T([128, 120, 28, 28], f16), T([128, 120, 1, 1], f16)), {})
+cnt: 12, ((T([128, 360, 14, 14], f16), T([128, 360, 1, 1], f16)), {})
+cnt: 2, ((T([128, 720, 7, 7], f16), T([128, 720, 1, 1], f16)), {})
+cnt: 10, ((T([128, 736, 7, 7], f16), T([128, 736, 1, 1], f16)), {})
+cnt: 2, ((T([128, 1104, 7, 7], f16), T([128, 1104, 1, 1], f16)), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16)), {})
+cnt: 5, ((T([128, 736, 7, 7], f16), T([128, 736, 7, 7], f16)), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), T([128, 720, 7, 7], f16)), {})
+cnt: 6, ((T([128, 360, 14, 14], f16), T([128, 360, 14, 14], f16)), {})
+cnt: 5, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 5, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 120, 56, 56], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 200, 28, 28], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 72, 14, 14], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 216, 14, 14], f16), T([216], f16), T([216], f16), T([216], f16), T([216], f16), True, 0.1, 1e-05), {})
+cnt: 12, ((T([128, 360, 14, 14], f16), T([360], f16), T([360], f16), T([360], f16), T([360], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 120, 14, 14], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 720, 14, 14], f16), T([720], f16), T([720], f16), T([720], f16), T([720], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), T([720], f16), T([720], f16), T([720], f16), T([720], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 184, 7, 7], f16), T([184], f16), T([184], f16), T([184], f16), T([184], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([128, 736, 7, 7], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 1104, 7, 7], f16), T([1104], f16), T([1104], f16), T([1104], f16), T([1104], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 224, 7, 7], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1344, 7, 7], f16), T([1344], f16), T([1344], f16), T([1344], f16), T([1344], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1344, 7, 7], f16), T([128, 1344, 7, 7], f16), T([1344], f16), T([1344], f16), T([1344], f16), T([1344], f32), T([1344], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 224, 7, 7], f16), T([128, 224, 7, 7], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 1104, 7, 7], f16), T([128, 1104, 7, 7], f16), T([1104], f16), T([1104], f16), T([1104], f16), T([1104], f32), T([1104], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 184, 7, 7], f16), T([128, 184, 7, 7], f16), T([184], f16), T([184], f16), T([184], f16), T([184], f32), T([184], f32), True, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([128, 736, 7, 7], f16), T([128, 736, 7, 7], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f32), T([736], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), T([128, 720, 7, 7], f16), T([720], f16), T([720], f16), T([720], f16), T([720], f32), T([720], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 720, 14, 14], f16), T([128, 720, 14, 14], f16), T([720], f16), T([720], f16), T([720], f16), T([720], f32), T([720], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 120, 14, 14], f16), T([128, 120, 14, 14], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 1e-05, [True, True, True]), {})
+cnt: 12, ((T([128, 360, 14, 14], f16), T([128, 360, 14, 14], f16), T([360], f16), T([360], f16), T([360], f16), T([360], f32), T([360], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 72, 14, 14], f16), T([128, 72, 14, 14], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 216, 14, 14], f16), T([128, 216, 14, 14], f16), T([216], f16), T([216], f16), T([216], f16), T([216], f32), T([216], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([128, 200, 14, 14], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f32), T([200], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 200, 28, 28], f16), T([128, 200, 28, 28], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f32), T([200], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 120, 56, 56], f16), T([128, 120, 56, 56], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([128, 1104, 7, 7], f16), [2, 3], True), {})
+cnt: 5, ((T([128, 736, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 720, 7, 7], f16), [2, 3], True), {})
+cnt: 6, ((T([128, 360, 14, 14], f16), [2, 3], True), {})
+cnt: 5, ((T([128, 120, 28, 28], f16), [2, 3], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gernet_l_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gernet_l_training.txt
new file mode 100644
index 000000000000..1efcbbfec35e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gernet_l_training.txt
@@ -0,0 +1,118 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 57, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16)), {})
+cnt: 4, ((T([128, 192, 32, 32], f16), T([128, 192, 32, 32], f16)), {})
+cnt: 12, ((T([128, 640, 16, 16], f16), T([128, 640, 16, 16], f16)), {})
+cnt: 17, ((T([128, 640, 8, 8], f16), T([128, 640, 8, 8], f16)), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2560], f16), T([2560, 1000], f16, stride=(1, 2560))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 256, 256], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 256, 256], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([192, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 192, 32, 32], f16), T([192, 192, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([192, 128, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 32, 32], f16), T([160, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 160, 32, 32], f16), T([160, 160, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([128, 160, 16, 16], f16), T([640, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 32, 32], f16), T([640, 192, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 640, 16, 16], f16), T([160, 640, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 160, 16, 16], f16), T([160, 160, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 640, 16, 16], f16), T([1920, 640, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1920, 16, 16], f16), T([1920, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1920), {})
+cnt: 9, ((T([128, 1920, 8, 8], f16), T([640, 1920, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 640, 16, 16], f16), T([640, 640, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([128, 640, 8, 8], f16), T([1920, 640, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([128, 1920, 8, 8], f16), T([1920, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1920), {})
+cnt: 1, ((T([128, 640, 8, 8], f16), T([2560, 640, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 2560, 8, 8], f16), T([128, 640, 8, 8], f16), T([2560, 640, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 9, ((T([128, 640, 8, 8], f16), T([128, 1920, 8, 8], f16), T([640, 1920, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([128, 1920, 8, 8], f16), T([128, 1920, 8, 8], f16), T([1920, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1920, [True, True, False]), {})
+cnt: 8, ((T([128, 1920, 8, 8], f16), T([128, 640, 8, 8], f16), T([1920, 640, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 640, 8, 8], f16), T([128, 640, 16, 16], f16), T([640, 640, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1920, 8, 8], f16), T([128, 1920, 16, 16], f16), T([1920, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1920, [True, True, False]), {})
+cnt: 1, ((T([128, 1920, 16, 16], f16), T([128, 640, 16, 16], f16), T([1920, 640, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([128, 640, 16, 16], f16), T([128, 160, 16, 16], f16), T([640, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 160, 16, 16], f16), T([128, 160, 16, 16], f16), T([160, 160, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 160, 16, 16], f16), T([128, 640, 16, 16], f16), T([160, 640, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 640, 16, 16], f16), T([128, 192, 32, 32], f16), T([640, 192, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 160, 16, 16], f16), T([128, 160, 32, 32], f16), T([160, 160, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 160, 32, 32], f16), T([128, 192, 32, 32], f16), T([160, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 32, 32], f16), T([128, 192, 32, 32], f16), T([192, 192, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 32, 32], f16), T([128, 128, 64, 64], f16), T([192, 128, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 32, 32], f16), T([128, 128, 64, 64], f16), T([192, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 32, 128, 128], f16), T([128, 32, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 64, 64], f16), T([128, 32, 128, 128], f16), T([128, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 3, 256, 256], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 256, 256], f16), T([128, 3, 256, 256], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2560, 8, 8], f16, stride=(2560, 1, 0, 0)), 64), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 2560, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 2560], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2560], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 192, 32, 32], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 160, 32, 32], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 11, ((T([128, 160, 16, 16], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([128, 640, 16, 16], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1920, 16, 16], f16), T([1920], f16), T([1920], f16), T([1920], f16), T([1920], f16), True, 0.1, 1e-05), {})
+cnt: 17, ((T([128, 1920, 8, 8], f16), T([1920], f16), T([1920], f16), T([1920], f16), T([1920], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([128, 640, 8, 8], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 2560, 8, 8], f16), T([2560], f16), T([2560], f16), T([2560], f16), T([2560], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 2560, 8, 8], f16), T([128, 2560, 8, 8], f16), T([2560], f16), T([2560], f16), T([2560], f16), T([2560], f32), T([2560], f32), True, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([128, 640, 8, 8], f16), T([128, 640, 8, 8], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f32), T([640], f32), True, 1e-05, [True, True, True]), {})
+cnt: 17, ((T([128, 1920, 8, 8], f16), T([128, 1920, 8, 8], f16), T([1920], f16), T([1920], f16), T([1920], f16), T([1920], f32), T([1920], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 1920, 16, 16], f16), T([128, 1920, 16, 16], f16), T([1920], f16), T([1920], f16), T([1920], f16), T([1920], f32), T([1920], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([128, 640, 16, 16], f16), T([128, 640, 16, 16], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f32), T([640], f32), True, 1e-05, [True, True, True]), {})
+cnt: 11, ((T([128, 160, 16, 16], f16), T([128, 160, 16, 16], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 160, 32, 32], f16), T([128, 160, 32, 32], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 192, 32, 32], f16), T([128, 192, 32, 32], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 128, 128], f16),), {})
+cnt: 2, ((T([128, 128, 64, 64], f16),), {})
+cnt: 4, ((T([128, 192, 32, 32], f16),), {})
+cnt: 1, ((T([128, 160, 32, 32], f16),), {})
+cnt: 11, ((T([128, 160, 16, 16], f16),), {})
+cnt: 6, ((T([128, 640, 16, 16], f16),), {})
+cnt: 1, ((T([128, 1920, 16, 16], f16),), {})
+cnt: 17, ((T([128, 1920, 8, 8], f16),), {})
+cnt: 9, ((T([128, 640, 8, 8], f16),), {})
+cnt: 1, ((T([128, 2560, 8, 8], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 2560, 8, 8], f16), T([128, 2560, 8, 8], f16), 0), {})
+cnt: 9, ((T([128, 640, 8, 8], f16), T([128, 640, 8, 8], f16), 0), {})
+cnt: 17, ((T([128, 1920, 8, 8], f16), T([128, 1920, 8, 8], f16), 0), {})
+cnt: 1, ((T([128, 1920, 16, 16], f16), T([128, 1920, 16, 16], f16), 0), {})
+cnt: 6, ((T([128, 640, 16, 16], f16), T([128, 640, 16, 16], f16), 0), {})
+cnt: 11, ((T([128, 160, 16, 16], f16), T([128, 160, 16, 16], f16), 0), {})
+cnt: 1, ((T([128, 160, 32, 32], f16), T([128, 160, 32, 32], f16), 0), {})
+cnt: 4, ((T([128, 192, 32, 32], f16), T([128, 192, 32, 32], f16), 0), {})
+cnt: 2, ((T([128, 128, 64, 64], f16), T([128, 128, 64, 64], f16), 0), {})
+cnt: 1, ((T([128, 32, 128, 128], f16), T([128, 32, 128, 128], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ghostnet_100_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ghostnet_100_training.txt
new file mode 100644
index 000000000000..15066dcc1a0c
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/ghostnet_100_training.txt
@@ -0,0 +1,411 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([], i64), 1), {})
+cnt: 5, ((T([128, 80, 7, 7], f16, stride=(7840, 49, 7, 1)), T([128, 80, 7, 7], f16)), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16)), {})
+cnt: 4, ((T([128, 480, 7, 7], f16, stride=(47040, 49, 7, 1)), T([128, 480, 7, 7], f16)), {})
+cnt: 4, ((T([128, 160, 7, 7], f16), T([128, 160, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 2, ((T([128, 336, 14, 14], f16, stride=(131712, 196, 14, 1)), T([128, 336, 14, 14], f16)), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16)), {})
+cnt: 2, ((T([128, 56, 14, 14], f16, stride=(21952, 196, 14, 1)), T([128, 56, 14, 14], f16)), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 14, 14], f16, stride=(94080, 196, 14, 1)), T([128, 240, 14, 14], f16)), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16)), {})
+cnt: 4, ((T([128, 40, 14, 14], f16, stride=(15680, 196, 14, 1)), T([128, 40, 14, 14], f16)), {})
+cnt: 2, ((T([128, 92, 14, 14], f16, stride=(36064, 196, 14, 1)), T([128, 92, 14, 14], f16)), {})
+cnt: 1, ((T([128, 100, 14, 14], f16, stride=(39200, 196, 14, 1)), T([128, 100, 14, 14], f16)), {})
+cnt: 1, ((T([128, 120, 28, 28], f16, stride=(188160, 784, 28, 1)), T([128, 120, 28, 28], f16)), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 2, ((T([128, 20, 28, 28], f16, stride=(31360, 784, 28, 1)), T([128, 20, 28, 28], f16)), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16)), {})
+cnt: 1, ((T([128, 60, 28, 28], f16, stride=(94080, 784, 28, 1)), T([128, 60, 28, 28], f16)), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16)), {})
+cnt: 2, ((T([128, 36, 56, 56], f16, stride=(225792, 3136, 56, 1)), T([128, 36, 56, 56], f16)), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 2, ((T([128, 12, 56, 56], f16, stride=(75264, 3136, 56, 1)), T([128, 12, 56, 56], f16)), {})
+cnt: 1, ((T([128, 24, 112, 112], f16, stride=(602112, 12544, 112, 1)), T([128, 24, 112, 112], f16)), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+cnt: 2, ((T([128, 8, 112, 112], f16, stride=(200704, 12544, 112, 1)), T([128, 8, 112, 112], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 79, ((T([], i64), 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16)), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16)), {})
+cnt: 5, ((T([128, 160, 7, 7], f16), T([128, 160, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.cat.default
+cnt: 2, (([T([128, 8, 112, 112], f16), T([128, 8, 112, 112], f16)], 1), {})
+cnt: 1, (([T([128, 24, 112, 112], f16), T([128, 24, 112, 112], f16)], 1), {})
+cnt: 2, (([T([128, 12, 56, 56], f16), T([128, 12, 56, 56], f16)], 1), {})
+cnt: 2, (([T([128, 36, 56, 56], f16), T([128, 36, 56, 56], f16)], 1), {})
+cnt: 2, (([T([128, 20, 28, 28], f16), T([128, 20, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 60, 28, 28], f16), T([128, 60, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16)], 1), {})
+cnt: 4, (([T([128, 40, 14, 14], f16), T([128, 40, 14, 14], f16)], 1), {})
+cnt: 1, (([T([128, 100, 14, 14], f16), T([128, 100, 14, 14], f16)], 1), {})
+cnt: 2, (([T([128, 92, 14, 14], f16), T([128, 92, 14, 14], f16)], 1), {})
+cnt: 1, (([T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16)], 1), {})
+cnt: 2, (([T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16)], 1), {})
+cnt: 2, (([T([128, 336, 14, 14], f16), T([128, 336, 14, 14], f16)], 1), {})
+cnt: 5, (([T([128, 80, 7, 7], f16), T([128, 80, 7, 7], f16)], 1), {})
+cnt: 4, (([T([128, 480, 7, 7], f16), T([128, 480, 7, 7], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([8, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 8, 112, 112], f16), T([8, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([24, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([24, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 24), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([48, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 48), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([12, 48, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 12, 56, 56], f16), T([12, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 12), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([24, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([36, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 36, 56, 56], f16), T([36, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 36), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([12, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([20, 72, 1, 1], f16), T([20], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 20, 1, 1], f16), T([72, 20, 1, 1], f16), T([72], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([20, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 20, 28, 28], f16), T([20, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 20), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([24, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 24), {})
+cnt: 1, ((T([128, 24, 28, 28], f16), T([40, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([60, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 60, 28, 28], f16), T([60, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([32, 120, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([120, 32, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), T([20, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), T([120, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 120), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([40, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 40, 14, 14], f16), T([40, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 40), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([40, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 40), {})
+cnt: 1, ((T([128, 40, 14, 14], f16), T([80, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([100, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 100, 14, 14], f16), T([100, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 100), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([40, 200, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([92, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 92, 14, 14], f16), T([92, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 92), {})
+cnt: 2, ((T([128, 184, 14, 14], f16), T([40, 184, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([240, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([240, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([128, 480, 1, 1], f16), T([120, 480, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([480, 120, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([56, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 56, 14, 14], f16), T([56, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 56), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([80, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 80), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([112, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([336, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 336, 14, 14], f16), T([336, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 336), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([168, 672, 1, 1], f16), T([168], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([672, 168, 1, 1], f16), T([672], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([56, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([80, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 80, 7, 7], f16), T([80, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 80), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([112, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 112), {})
+cnt: 1, ((T([128, 112, 7, 7], f16), T([160, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 160, 7, 7], f16), T([480, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 480, 7, 7], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 4, ((T([128, 960, 7, 7], f16), T([80, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 960, 1, 1], f16), T([240, 960, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([960, 240, 1, 1], f16), T([960], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 160, 7, 7], f16), T([960, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 960, 1, 1], f16), T([1280, 960, 1, 1], f16), T([1280], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 1, 1], f16), T([128, 960, 1, 1], f16), T([1280, 960, 1, 1], f16), [1280], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 160, 7, 7], f16), T([960, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 80, 7, 7], f16), T([128, 80, 7, 7], f16), T([80, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 80, [True, True, False]), {})
+cnt: 4, ((T([128, 80, 7, 7], f16), T([128, 960, 7, 7], f16), T([80, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 960, 1, 1], f16), T([128, 240, 1, 1], f16), T([960, 240, 1, 1], f16), [960], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 960, 1, 1], f16), T([240, 960, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([128, 480, 7, 7], f16), T([128, 480, 7, 7], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 4, ((T([128, 480, 7, 7], f16), T([128, 160, 7, 7], f16), T([480, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 160, 7, 7], f16), T([128, 112, 7, 7], f16), T([160, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 7, 7], f16), T([128, 112, 14, 14], f16), T([112, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 112, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 7, 7], f16), T([128, 672, 7, 7], f16), T([80, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([128, 168, 1, 1], f16), T([672, 168, 1, 1], f16), [672], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([128, 672, 1, 1], f16), T([168, 672, 1, 1], f16), [168], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 2, ((T([128, 336, 14, 14], f16), T([128, 336, 14, 14], f16), T([336, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 336, [True, True, False]), {})
+cnt: 2, ((T([128, 336, 14, 14], f16), T([128, 112, 14, 14], f16), T([336, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([56, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 56, [True, True, False]), {})
+cnt: 1, ((T([128, 56, 14, 14], f16), T([128, 672, 14, 14], f16), T([56, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 80, 14, 14], f16), T([112, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16), T([80, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 80, [True, True, False]), {})
+cnt: 1, ((T([128, 56, 14, 14], f16), T([128, 480, 14, 14], f16), T([56, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 480, 1, 1], f16), T([128, 120, 1, 1], f16), T([480, 120, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([128, 480, 1, 1], f16), T([120, 480, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), T([240, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 80, 14, 14], f16), T([240, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 40, 14, 14], f16), T([128, 40, 14, 14], f16), T([40, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 40, [True, True, False]), {})
+cnt: 2, ((T([128, 40, 14, 14], f16), T([128, 184, 14, 14], f16), T([40, 184, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 92, 14, 14], f16), T([128, 92, 14, 14], f16), T([92, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 92, [True, True, False]), {})
+cnt: 2, ((T([128, 92, 14, 14], f16), T([128, 80, 14, 14], f16), T([92, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 14, 14], f16), T([128, 200, 14, 14], f16), T([40, 200, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 100, 14, 14], f16), T([128, 100, 14, 14], f16), T([100, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 100, [True, True, False]), {})
+cnt: 1, ((T([128, 100, 14, 14], f16), T([128, 80, 14, 14], f16), T([100, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([128, 40, 14, 14], f16), T([80, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 14, 14], f16), T([128, 40, 28, 28], f16), T([40, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 40, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 14, 14], f16), T([128, 240, 14, 14], f16), T([40, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 28, 28], f16), T([240, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 20, 28, 28], f16), T([128, 20, 28, 28], f16), T([20, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 20, [True, True, False]), {})
+cnt: 1, ((T([128, 20, 28, 28], f16), T([128, 120, 28, 28], f16), T([20, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([128, 32, 1, 1], f16), T([120, 32, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 120, 1, 1], f16), T([32, 120, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 60, 28, 28], f16), T([128, 60, 28, 28], f16), T([60, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([128, 60, 28, 28], f16), T([128, 40, 28, 28], f16), T([60, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 24, 28, 28], f16), T([40, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 28, 28], f16), T([128, 24, 56, 56], f16), T([24, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 24, [True, True, False]), {})
+cnt: 1, ((T([128, 20, 28, 28], f16), T([128, 72, 28, 28], f16), T([20, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([128, 20, 1, 1], f16), T([72, 20, 1, 1], f16), [72], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 20, 1, 1], f16), T([128, 72, 1, 1], f16), T([20, 72, 1, 1], f16), [20], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 2, ((T([128, 36, 56, 56], f16), T([128, 36, 56, 56], f16), T([36, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 36, [True, True, False]), {})
+cnt: 2, ((T([128, 36, 56, 56], f16), T([128, 24, 56, 56], f16), T([36, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 12, 56, 56], f16), T([128, 12, 56, 56], f16), T([12, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 12, [True, True, False]), {})
+cnt: 1, ((T([128, 12, 56, 56], f16), T([128, 72, 56, 56], f16), T([12, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 16, 56, 56], f16), T([24, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([128, 12, 56, 56], f16), T([128, 48, 56, 56], f16), T([12, 48, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 112, 112], f16), T([48, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 48, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 24, 112, 112], f16), T([24, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 24, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 16, 112, 112], f16), T([24, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 8, 112, 112], f16), T([128, 8, 112, 112], f16), T([8, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 2, ((T([128, 8, 112, 112], f16), T([128, 16, 112, 112], f16), T([8, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+cnt: 15, ((T([128, 160, 7, 7], f16), T([128, 160, 7, 7], f16)), {})
+cnt: 6, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16)), {})
+cnt: 12, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16)), {})
+cnt: 6, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 6, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 3, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+Operator: aten.div.Scalar
+cnt: 3, ((T([128, 960, 7, 7], f16, stride=(960, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 672, 7, 7], f16, stride=(672, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 672, 14, 14], f16, stride=(672, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 480, 14, 14], f16, stride=(480, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 120, 28, 28], f16, stride=(120, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 72, 28, 28], f16, stride=(72, 1, 0, 0)), 784), {})
+Operator: aten.hardsigmoid.default
+cnt: 1, ((T([128, 72, 1, 1], f16),), {})
+cnt: 1, ((T([128, 120, 1, 1], f16),), {})
+cnt: 1, ((T([128, 480, 1, 1], f16),), {})
+cnt: 2, ((T([128, 672, 1, 1], f16),), {})
+cnt: 2, ((T([128, 960, 1, 1], f16),), {})
+Operator: aten.hardsigmoid_backward.default
+cnt: 2, ((T([128, 960, 1, 1], f16), T([128, 960, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 1, ((T([128, 480, 1, 1], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([128, 120, 1, 1], f16)), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([128, 72, 1, 1], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 72, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 72, 28, 28], f16), T([128, 72, 1, 1], f16)), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([128, 120, 1, 1], f16)), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 7, 7], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 4, ((T([128, 960, 7, 7], f16), T([128, 960, 1, 1], f16)), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16)), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 8, 112, 112], f16), T([8], f16), T([8], f16), T([8], f16), T([8], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 24, 112, 112], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 12, 56, 56], f16), T([12], f16), T([12], f16), T([12], f16), T([12], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 36, 56, 56], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 20, 28, 28], f16), T([20], f16), T([20], f16), T([20], f16), T([20], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 24, 28, 28], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 60, 28, 28], f16), T([60], f16), T([60], f16), T([60], f16), T([60], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([128, 40, 14, 14], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 100, 14, 14], f16), T([100], f16), T([100], f16), T([100], f16), T([100], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 92, 14, 14], f16), T([92], f16), T([92], f16), T([92], f16), T([92], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 56, 14, 14], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 336, 14, 14], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([128, 80, 7, 7], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 112, 7, 7], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 480, 7, 7], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 80, 7, 7], f16, stride=(7840, 49, 7, 1)), T([128, 80, 7, 7], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 80, 7, 7], f16), T([128, 80, 7, 7], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 480, 7, 7], f16), T([128, 480, 7, 7], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 160, 7, 7], f16), T([128, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 336, 14, 14], f16), T([128, 336, 14, 14], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 56, 14, 14], f16, stride=(21952, 196, 14, 1)), T([128, 56, 14, 14], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f32), T([56], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f32), T([56], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 40, 14, 14], f16, stride=(15680, 196, 14, 1)), T([128, 40, 14, 14], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 40, 14, 14], f16), T([128, 40, 14, 14], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 92, 14, 14], f16), T([128, 92, 14, 14], f16), T([92], f16), T([92], f16), T([92], f16), T([92], f32), T([92], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 100, 14, 14], f16), T([128, 100, 14, 14], f16), T([100], f16), T([100], f16), T([100], f16), T([100], f32), T([100], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 20, 28, 28], f16, stride=(31360, 784, 28, 1)), T([128, 20, 28, 28], f16), T([20], f16), T([20], f16), T([20], f16), T([20], f32), T([20], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 20, 28, 28], f16), T([128, 20, 28, 28], f16), T([20], f16), T([20], f16), T([20], f16), T([20], f32), T([20], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 60, 28, 28], f16), T([128, 60, 28, 28], f16), T([60], f16), T([60], f16), T([60], f16), T([60], f32), T([60], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 28, 28], f16), T([128, 24, 28, 28], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 36, 56, 56], f16), T([128, 36, 56, 56], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f32), T([36], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 12, 56, 56], f16, stride=(75264, 3136, 56, 1)), T([128, 12, 56, 56], f16), T([12], f16), T([12], f16), T([12], f16), T([12], f32), T([12], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 12, 56, 56], f16), T([128, 12, 56, 56], f16), T([12], f16), T([12], f16), T([12], f16), T([12], f32), T([12], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([128, 16, 56, 56], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 24, 112, 112], f16), T([128, 24, 112, 112], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 8, 112, 112], f16, stride=(200704, 12544, 112, 1)), T([128, 8, 112, 112], f16), T([8], f16), T([8], f16), T([8], f16), T([8], f32), T([8], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 8, 112, 112], f16), T([128, 8, 112, 112], f16), T([8], f16), T([8], f16), T([8], f16), T([8], f32), T([8], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 5, ((T([128, 160, 7, 7], f16), [128, 160, 7, 7], [7840, 49, 7, 1]), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), [128, 112, 14, 14], [21952, 196, 14, 1]), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), [128, 80, 14, 14], [15680, 196, 14, 1]), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), [128, 40, 28, 28], [31360, 784, 28, 1]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), [128, 24, 56, 56], [75264, 3136, 56, 1]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), [128, 16, 112, 112], [200704, 12544, 112, 1]), {})
+Operator: aten.new_zeros.default
+cnt: 5, ((T([128, 160, 7, 7], f16), [1003520]), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), [2809856]), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), [2007040]), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), [4014080]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), [9633792]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), [25690112]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 2, ((T([128, 8, 112, 112], f16),), {})
+cnt: 2, ((T([128, 24, 112, 112], f16),), {})
+cnt: 4, ((T([128, 36, 56, 56], f16),), {})
+cnt: 1, ((T([128, 20, 1, 1], f16),), {})
+cnt: 2, ((T([128, 60, 28, 28], f16),), {})
+cnt: 1, ((T([128, 32, 1, 1], f16),), {})
+cnt: 2, ((T([128, 120, 28, 28], f16),), {})
+cnt: 2, ((T([128, 100, 14, 14], f16),), {})
+cnt: 4, ((T([128, 92, 14, 14], f16),), {})
+cnt: 2, ((T([128, 240, 14, 14], f16),), {})
+cnt: 1, ((T([128, 120, 1, 1], f16),), {})
+cnt: 4, ((T([128, 336, 14, 14], f16),), {})
+cnt: 2, ((T([128, 168, 1, 1], f16),), {})
+cnt: 8, ((T([128, 480, 7, 7], f16),), {})
+cnt: 2, ((T([128, 240, 1, 1], f16),), {})
+cnt: 1, ((T([128, 960, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 1, 1], f16),), {})
+Operator: aten.slice_backward.default
+cnt: 4, ((T([128, 960, 7, 7], f16), [128, 960, 7, 7], 3, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 960, 7, 7], f16), [128, 960, 7, 7], 2, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([128, 960, 7, 7], f16), [128, 960, 7, 7], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), [128, 672, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), [128, 672, 14, 14], 2, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), [128, 672, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), [128, 480, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), [128, 480, 14, 14], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), [128, 480, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 184, 14, 14], f16), [128, 184, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 184, 14, 14], f16), [128, 184, 14, 14], 2, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 184, 14, 14], f16), [128, 184, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), [128, 200, 14, 14], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), [128, 200, 14, 14], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), [128, 200, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), [128, 240, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), [128, 240, 28, 28], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), [128, 240, 28, 28], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), [128, 120, 28, 28], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), [128, 120, 28, 28], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), [128, 120, 28, 28], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), [128, 72, 56, 56], 3, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), [128, 72, 56, 56], 2, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), [128, 72, 56, 56], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), [128, 48, 112, 112], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), [128, 48, 112, 112], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), [128, 48, 112, 112], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), [128, 16, 112, 112], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), [128, 16, 112, 112], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), [128, 16, 112, 112], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 1280, 1, 1], f16), T([128, 1280, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), 0), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 240, 1, 1], f16), 0), {})
+cnt: 4, ((T([128, 480, 7, 7], f16, stride=(47040, 49, 7, 1)), T([128, 480, 7, 7], f16), 0), {})
+cnt: 4, ((T([128, 480, 7, 7], f16), T([128, 480, 7, 7], f16), 0), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([128, 168, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 336, 14, 14], f16, stride=(131712, 196, 14, 1)), T([128, 336, 14, 14], f16), 0), {})
+cnt: 2, ((T([128, 336, 14, 14], f16), T([128, 336, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([128, 120, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 240, 14, 14], f16, stride=(94080, 196, 14, 1)), T([128, 240, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), 0), {})
+cnt: 2, ((T([128, 92, 14, 14], f16, stride=(36064, 196, 14, 1)), T([128, 92, 14, 14], f16), 0), {})
+cnt: 2, ((T([128, 92, 14, 14], f16), T([128, 92, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 100, 14, 14], f16, stride=(39200, 196, 14, 1)), T([128, 100, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 100, 14, 14], f16), T([128, 100, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 120, 28, 28], f16, stride=(188160, 784, 28, 1)), T([128, 120, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 32, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 60, 28, 28], f16, stride=(94080, 784, 28, 1)), T([128, 60, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 60, 28, 28], f16), T([128, 60, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 20, 1, 1], f16), T([128, 20, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 36, 56, 56], f16, stride=(225792, 3136, 56, 1)), T([128, 36, 56, 56], f16), 0), {})
+cnt: 2, ((T([128, 36, 56, 56], f16), T([128, 36, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 24, 112, 112], f16, stride=(602112, 12544, 112, 1)), T([128, 24, 112, 112], f16), 0), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 24, 112, 112], f16), 0), {})
+cnt: 1, ((T([128, 8, 112, 112], f16, stride=(200704, 12544, 112, 1)), T([128, 8, 112, 112], f16), 0), {})
+cnt: 1, ((T([128, 8, 112, 112], f16), T([128, 8, 112, 112], f16), 0), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_inception_v3_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_inception_v3_training.txt
new file mode 100644
index 000000000000..c11cd6890c76
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_inception_v3_training.txt
@@ -0,0 +1,239 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16)), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16)), {})
+cnt: 3, ((T([128, 1280, 8, 8], f16), T([128, 1280, 8, 8], f16)), {})
+cnt: 14, ((T([128, 768, 17, 17], f16), T([128, 768, 17, 17], f16)), {})
+cnt: 5, ((T([128, 288, 35, 35], f16), T([128, 288, 35, 35], f16)), {})
+cnt: 3, ((T([128, 256, 35, 35], f16), T([128, 256, 35, 35], f16)), {})
+cnt: 3, ((T([128, 192, 35, 35], f16), T([128, 192, 35, 35], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 94, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 192, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), [3, 3], [1, 1], [1, 1]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([128, 1280, 8, 8], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), T([128, 768, 17, 17], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([128, 288, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), T([128, 256, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([128, 192, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([128, 96, 35, 35], f16), T([128, 32, 35, 35], f16)], 1), {})
+cnt: 2, (([T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([128, 96, 35, 35], f16), T([128, 64, 35, 35], f16)], 1), {})
+cnt: 1, (([T([128, 384, 17, 17], f16), T([128, 96, 17, 17], f16), T([128, 288, 17, 17], f16)], 1), {})
+cnt: 4, (([T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16)], 1), {})
+cnt: 1, (([T([128, 320, 8, 8], f16), T([128, 192, 8, 8], f16), T([128, 768, 8, 8], f16)], 1), {})
+cnt: 4, (([T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16)], 1), {})
+cnt: 2, (([T([128, 320, 8, 8], f16), T([128, 768, 8, 8], f16), T([128, 768, 8, 8], f16), T([128, 192, 8, 8], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 299, 299], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 299, 299], f16), T([32, 3, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([32, 32, 3, 3], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 73, 73], f16), T([80, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([192, 80, 3, 3], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 192, 35, 35], f16), T([64, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([48, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([64, 48, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([96, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([32, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 256, 35, 35], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), T([48, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 288, 35, 35], f16), T([64, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([48, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([384, 288, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([128, 768, 17, 17], f16), T([192, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 768, 17, 17], f16), T([128, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 17, 17], f16), T([192, 128, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 17, 17], f16), T([192, 128, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), T([160, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([160, 160, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 160, 17, 17], f16), T([192, 160, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([160, 160, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 160, 17, 17], f16), T([192, 160, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([192, 192, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([192, 192, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([320, 192, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([192, 192, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([320, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([384, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([384, 384, 1, 3], f16), None, [1, 1], [0, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([384, 384, 3, 1], f16), None, [1, 1], [1, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([448, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([384, 448, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([192, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([320, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([384, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([448, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([192, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 2048, 8, 8], f16), T([192, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384, 384, 3, 1], f16), [0], [1, 1], [1, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384, 384, 1, 3], f16), [0], [1, 1], [0, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 384, 8, 8], f16), T([128, 448, 8, 8], f16), T([384, 448, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 448, 8, 8], f16), T([128, 2048, 8, 8], f16), T([448, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 8, 8], f16), T([128, 2048, 8, 8], f16), T([384, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 2048, 8, 8], f16), T([320, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 1280, 8, 8], f16), T([192, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 448, 8, 8], f16), T([128, 1280, 8, 8], f16), T([448, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 8, 8], f16), T([128, 1280, 8, 8], f16), T([384, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 1280, 8, 8], f16), T([320, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 192, 17, 17], f16), T([192, 192, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192, 192, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192, 192, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 12, ((T([128, 192, 17, 17], f16), T([128, 768, 17, 17], f16), T([192, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 192, 17, 17], f16), T([320, 192, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 192, 17, 17], f16), T([128, 160, 17, 17], f16), T([192, 160, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160, 160, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160, 160, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 768, 17, 17], f16), T([160, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 192, 17, 17], f16), T([128, 160, 17, 17], f16), T([192, 160, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([128, 128, 17, 17], f16), T([192, 128, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128, 128, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128, 128, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 768, 17, 17], f16), T([128, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([128, 128, 17, 17], f16), T([192, 128, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 96, 35, 35], f16), T([128, 64, 35, 35], f16), T([96, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([128, 288, 35, 35], f16), T([64, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([128, 288, 35, 35], f16), T([384, 288, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 35, 35], f16), T([128, 48, 35, 35], f16), T([64, 48, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 288, 35, 35], f16), T([48, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 35, 35], f16), T([128, 256, 35, 35], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 256, 35, 35], f16), T([48, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([128, 192, 35, 35], f16), T([32, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 35, 35], f16), T([128, 192, 35, 35], f16), T([64, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 192, 35, 35], f16), T([48, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 80, 73, 73], f16), T([192, 80, 3, 3], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 64, 73, 73], f16), T([80, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 32, 147, 147], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 149, 149], f16), T([32, 32, 3, 3], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 3, 299, 299], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 299, 299], f16), T([128, 3, 299, 299], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 8, 8], f16, stride=(2048, 1, 0, 0)), 64), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 147, 147], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 768, 17, 17], f16), [3, 3], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 768, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 768, 17, 17], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 768, 8, 8], i64)), {})
+cnt: 1, ((T([128, 288, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 288, 35, 35], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 288, 17, 17], i64)), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([128, 192, 71, 71], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 192, 35, 35], i64)), {})
+cnt: 1, ((T([128, 64, 73, 73], f16), T([128, 64, 147, 147], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 64, 73, 73], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 2048, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 32, 149, 149], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 64, 35, 35], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 0.001), {})
+cnt: 7, ((T([128, 96, 35, 35], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 26, ((T([128, 192, 17, 17], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 320, 8, 8], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 192, 8, 8], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 384, 8, 8], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), True, 0.1, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([128, 192, 8, 8], f16), T([128, 192, 8, 8], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([128, 448, 8, 8], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 320, 8, 8], f16), T([128, 320, 8, 8], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 0.001, [True, True, True]), {})
+cnt: 26, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 0.001, [True, True, True]), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([128, 96, 17, 17], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+cnt: 7, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([128, 384, 17, 17], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([128, 48, 35, 35], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([128, 32, 35, 35], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 192, 71, 71], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 80, 73, 73], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 64, 147, 147], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 147, 147], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 32, 149, 149], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 149, 149], f16),), {})
+cnt: 1, ((T([128, 32, 147, 147], f16),), {})
+cnt: 1, ((T([128, 64, 147, 147], f16),), {})
+cnt: 1, ((T([128, 80, 73, 73], f16),), {})
+cnt: 1, ((T([128, 192, 71, 71], f16),), {})
+cnt: 12, ((T([128, 64, 35, 35], f16),), {})
+cnt: 3, ((T([128, 48, 35, 35], f16),), {})
+cnt: 7, ((T([128, 96, 35, 35], f16),), {})
+cnt: 1, ((T([128, 32, 35, 35], f16),), {})
+cnt: 1, ((T([128, 384, 17, 17], f16),), {})
+cnt: 1, ((T([128, 96, 17, 17], f16),), {})
+cnt: 26, ((T([128, 192, 17, 17], f16),), {})
+cnt: 6, ((T([128, 128, 17, 17], f16),), {})
+cnt: 12, ((T([128, 160, 17, 17], f16),), {})
+cnt: 3, ((T([128, 320, 8, 8], f16),), {})
+cnt: 3, ((T([128, 192, 8, 8], f16),), {})
+cnt: 12, ((T([128, 384, 8, 8], f16),), {})
+cnt: 2, ((T([128, 448, 8, 8], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([128, 192, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 192, 8, 8], f16), 0), {})
+cnt: 8, ((T([128, 384, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 384, 8, 8], f16), 0), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), 0), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([128, 448, 8, 8], f16), 0), {})
+cnt: 2, ((T([128, 320, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 320, 8, 8], f16), 0), {})
+cnt: 1, ((T([128, 192, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 192, 8, 8], f16), 0), {})
+cnt: 10, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), 0), {})
+cnt: 1, ((T([128, 320, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 320, 8, 8], f16), 0), {})
+cnt: 16, ((T([128, 192, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 192, 17, 17], f16), 0), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), 0), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), 0), {})
+cnt: 1, ((T([128, 96, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 96, 17, 17], f16), 0), {})
+cnt: 4, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), 0), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 384, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 384, 17, 17], f16), 0), {})
+cnt: 6, ((T([128, 64, 35, 35], f16, stride=(352800, 1225, 35, 1)), T([128, 64, 35, 35], f16), 0), {})
+cnt: 2, ((T([128, 96, 35, 35], f16, stride=(352800, 1225, 35, 1)), T([128, 96, 35, 35], f16), 0), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([128, 48, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 32, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 32, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 96, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 96, 35, 35], f16), 0), {})
+cnt: 2, ((T([128, 64, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 64, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 192, 71, 71], f16), 0), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 80, 73, 73], f16), 0), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 64, 147, 147], f16), 0), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 147, 147], f16), 0), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 32, 149, 149], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_senet154_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_senet154_training.txt
new file mode 100644
index 000000000000..b766b8a41570
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_senet154_training.txt
@@ -0,0 +1,187 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 5, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+cnt: 72, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 16, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 6, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 157, ((T([], i64), 1), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([64, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([128, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 128, 56, 56], f16), T([256, 2, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([16, 256, 1, 1], f16), T([16], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 16, 1, 1], f16), T([256, 16, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([256, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([512, 4, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 9, ((T([32, 512, 28, 28], f16), T([512, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([32, 512, 1, 1], f16), T([32, 512, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([32, 32, 1, 1], f16), T([512, 32, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([512, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([32, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([512, 4, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([1024, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 37, ((T([32, 1024, 14, 14], f16), T([1024, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 36, ((T([32, 1024, 1, 1], f16), T([64, 1024, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 36, ((T([32, 64, 1, 1], f16), T([1024, 64, 1, 1], f16), T([1024], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([1024, 512, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 35, ((T([32, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([1024, 8, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([2048, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([2048, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 2048, 1, 1], f16), T([128, 2048, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 128, 1, 1], f16), T([2048, 128, 1, 1], f16), T([2048], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([2048, 1024, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([1024, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([2048, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([32, 2048, 1, 1], f16), T([32, 128, 1, 1], f16), T([2048, 128, 1, 1], f16), [2048], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 128, 1, 1], f16), T([32, 2048, 1, 1], f16), T([128, 2048, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), T([2048, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([32, 1024, 7, 7], f16), T([2048, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([32, 2048, 7, 7], f16), T([1024, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 1024, 14, 14], f16), T([2048, 1024, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 1024, 14, 14], f16), T([2048, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 37, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), T([1024, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 36, ((T([32, 1024, 1, 1], f16), T([32, 64, 1, 1], f16), T([1024, 64, 1, 1], f16), [1024], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 36, ((T([32, 64, 1, 1], f16), T([32, 1024, 1, 1], f16), T([64, 1024, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 35, ((T([32, 1024, 14, 14], f16), T([32, 512, 14, 14], f16), T([1024, 8, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([32, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 512, 28, 28], f16), T([1024, 512, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 512, 28, 28], f16), T([1024, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 9, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), T([512, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([32, 512, 1, 1], f16), T([32, 32, 1, 1], f16), T([512, 32, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 8, ((T([32, 32, 1, 1], f16), T([32, 512, 1, 1], f16), T([32, 512, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 7, ((T([32, 512, 28, 28], f16), T([32, 256, 28, 28], f16), T([512, 4, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([32, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 256, 56, 56], f16), T([512, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 256, 56, 56], f16), T([512, 4, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([32, 16, 1, 1], f16), T([256, 16, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 16, 1, 1], f16), T([32, 256, 1, 1], f16), T([16, 256, 1, 1], f16), [16], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 128, 56, 56], f16), T([256, 2, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 2, ((T([32, 128, 56, 56], f16), T([32, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 128, 56, 56], f16), T([256, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 64, 112, 112], f16), T([128, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 3, 224, 224], f16), T([64, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 4, ((T([32, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16, stride=(1024, 1, 0, 0)), 196), {})
+cnt: 8, ((T([32, 512, 28, 28], f16, stride=(512, 1, 0, 0)), 784), {})
+cnt: 3, ((T([32, 256, 56, 56], f16, stride=(256, 1, 0, 0)), 3136), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 128, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([32, 128, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 3, ((T([32, 256, 56, 56], f16), [2, 3], True), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(1, 1000)), T([32, 2048], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 6, ((T([32, 256, 56, 56], f16), T([32, 256, 1, 1], f16)), {})
+cnt: 16, ((T([32, 512, 28, 28], f16), T([32, 512, 1, 1], f16)), {})
+cnt: 72, ((T([32, 1024, 14, 14], f16), T([32, 1024, 1, 1], f16)), {})
+cnt: 6, ((T([32, 2048, 7, 7], f16), T([32, 2048, 1, 1], f16)), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 18, ((T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 74, ((T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 7, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([32, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 74, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([32, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 18, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([32, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 128, 112, 112], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([32, 64, 112, 112], f16),), {})
+cnt: 1, ((T([32, 128, 112, 112], f16),), {})
+cnt: 3, ((T([32, 128, 56, 56], f16),), {})
+cnt: 7, ((T([32, 256, 56, 56], f16),), {})
+cnt: 3, ((T([32, 16, 1, 1], f16),), {})
+cnt: 17, ((T([32, 512, 28, 28], f16),), {})
+cnt: 8, ((T([32, 32, 1, 1], f16),), {})
+cnt: 7, ((T([32, 256, 28, 28], f16),), {})
+cnt: 73, ((T([32, 1024, 14, 14], f16),), {})
+cnt: 36, ((T([32, 64, 1, 1], f16),), {})
+cnt: 35, ((T([32, 512, 14, 14], f16),), {})
+cnt: 6, ((T([32, 2048, 7, 7], f16),), {})
+cnt: 3, ((T([32, 128, 1, 1], f16),), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 3, ((T([32, 256, 1, 1], f16),), {})
+cnt: 8, ((T([32, 512, 1, 1], f16),), {})
+cnt: 36, ((T([32, 1024, 1, 1], f16),), {})
+cnt: 3, ((T([32, 2048, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 3, ((T([32, 2048, 1, 1], f16), T([32, 2048, 1, 1], f16)), {})
+cnt: 36, ((T([32, 1024, 1, 1], f16), T([32, 1024, 1, 1], f16)), {})
+cnt: 8, ((T([32, 512, 1, 1], f16), T([32, 512, 1, 1], f16)), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([32, 256, 1, 1], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16), [0], True), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), [2, 3], True), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16), [2, 3], True), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 6, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), 0), {})
+cnt: 3, ((T([32, 128, 1, 1], f16), T([32, 128, 1, 1], f16), 0), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([32, 1024, 7, 7], f16), 0), {})
+cnt: 73, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), 0), {})
+cnt: 36, ((T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), 0), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([32, 512, 14, 14], f16), 0), {})
+cnt: 17, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), 0), {})
+cnt: 8, ((T([32, 32, 1, 1], f16), T([32, 32, 1, 1], f16), 0), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([32, 256, 28, 28], f16), 0), {})
+cnt: 7, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), 0), {})
+cnt: 3, ((T([32, 16, 1, 1], f16), T([32, 16, 1, 1], f16), 0), {})
+cnt: 3, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 128, 112, 112], f16), 0), {})
+cnt: 2, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_xception65_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_xception65_training.txt
new file mode 100644
index 000000000000..53a6cc214896
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gluon_xception65_training.txt
@@ -0,0 +1,155 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([32, 128, 75, 75], f16), T([32, 128, 75, 75], f16)), {})
+cnt: 2, ((T([32, 256, 38, 38], f16), T([32, 256, 38, 38], f16)), {})
+cnt: 34, ((T([32, 728, 19, 19], f16), T([32, 728, 19, 19], f16)), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([32, 1024, 10, 10], f16)), {})
+cnt: 1, ((T([32, 64, 150, 150], f16), T([32, 64, 150, 150], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 132, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 299, 299], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 299, 299], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 150, 150], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 150, 150], f16), T([128, 64, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 150, 150], f16), T([64, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([32, 64, 150, 150], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 150, 150], f16), T([128, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 128), {})
+cnt: 1, ((T([32, 128, 150, 150], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 150, 150], f16), T([128, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 128), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([256, 128, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([128, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 128), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([256, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 75, 75], f16), T([256, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 256), {})
+cnt: 1, ((T([32, 256, 75, 75], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 75, 75], f16), T([256, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 256), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([728, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([256, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 256), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([728, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 728, 38, 38], f16), T([728, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 728), {})
+cnt: 1, ((T([32, 728, 38, 38], f16), T([728, 728, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 728, 38, 38], f16), T([728, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 728), {})
+cnt: 50, ((T([32, 728, 19, 19], f16), T([728, 728, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 50, ((T([32, 728, 19, 19], f16), T([728, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 728), {})
+cnt: 1, ((T([32, 728, 19, 19], f16), T([1024, 728, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 728, 19, 19], f16), T([1024, 728, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 19, 19], f16), T([1024, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1024), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([1024, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([1024, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1024), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([1536, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 1536, 10, 10], f16), T([1536, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1536), {})
+cnt: 1, ((T([32, 1536, 10, 10], f16), T([1536, 1536, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1536, 10, 10], f16), T([2048, 1536, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 2048, 10, 10], f16), T([32, 1536, 10, 10], f16), T([2048, 1536, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 1536, 10, 10], f16), T([32, 1536, 10, 10], f16), T([1536, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1536, [True, True, False]), {})
+cnt: 1, ((T([32, 1536, 10, 10], f16), T([32, 1536, 10, 10], f16), T([1536, 1536, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1536, 10, 10], f16), T([32, 1024, 10, 10], f16), T([1536, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([32, 1024, 10, 10], f16), T([1024, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1024, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([32, 1024, 10, 10], f16), T([1024, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([32, 1024, 19, 19], f16), T([1024, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1024, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 19, 19], f16), T([32, 728, 19, 19], f16), T([1024, 728, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 50, ((T([32, 728, 19, 19], f16), T([32, 728, 19, 19], f16), T([728, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 728, [True, True, False]), {})
+cnt: 50, ((T([32, 728, 19, 19], f16), T([32, 728, 19, 19], f16), T([728, 728, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([32, 728, 19, 19], f16), T([1024, 728, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 728, 19, 19], f16), T([32, 728, 38, 38], f16), T([728, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 728, [True, True, False]), {})
+cnt: 1, ((T([32, 728, 38, 38], f16), T([32, 728, 38, 38], f16), T([728, 728, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 728, 38, 38], f16), T([32, 728, 38, 38], f16), T([728, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 728, [True, True, False]), {})
+cnt: 1, ((T([32, 728, 38, 38], f16), T([32, 256, 38, 38], f16), T([728, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([32, 256, 38, 38], f16), T([256, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 256, [True, True, False]), {})
+cnt: 1, ((T([32, 728, 19, 19], f16), T([32, 256, 38, 38], f16), T([728, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([32, 256, 38, 38], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([32, 256, 75, 75], f16), T([256, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 256, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 75, 75], f16), T([32, 256, 75, 75], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 75, 75], f16), T([32, 256, 75, 75], f16), T([256, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 256, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 75, 75], f16), T([32, 128, 75, 75], f16), T([256, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([32, 128, 75, 75], f16), T([128, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 128, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([32, 128, 75, 75], f16), T([256, 128, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([32, 128, 75, 75], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([32, 128, 150, 150], f16), T([128, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 128, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 150, 150], f16), T([32, 128, 150, 150], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 150, 150], f16), T([32, 128, 150, 150], f16), T([128, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 128, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 150, 150], f16), T([32, 64, 150, 150], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 150, 150], f16), T([32, 64, 150, 150], f16), T([64, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([32, 64, 150, 150], f16), T([128, 64, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 150, 150], f16), T([32, 32, 150, 150], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 32, 150, 150], f16), T([32, 3, 299, 299], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 299, 299], f16), T([32, 3, 299, 299], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 2048, 10, 10], f16, stride=(2048, 1, 0, 0)), 100), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 2048, 10, 10], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(1, 1000)), T([32, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([32, 32, 150, 150], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 64, 150, 150], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 128, 75, 75], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 128, 150, 150], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 256, 38, 38], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 256, 75, 75], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 102, ((T([32, 728, 19, 19], f16), T([728], f16), T([728], f16), T([728], f16), T([728], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 728, 38, 38], f16), T([728], f16), T([728], f16), T([728], f16), T([728], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 1024, 10, 10], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 1024, 19, 19], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 1536, 10, 10], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 2048, 10, 10], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([32, 2048, 10, 10], f16), T([32, 2048, 10, 10], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 1536, 10, 10], f16), T([32, 1536, 10, 10], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f32), T([1536], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 1024, 10, 10], f16), T([32, 1024, 10, 10], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 1024, 19, 19], f16), T([32, 1024, 19, 19], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 102, ((T([32, 728, 19, 19], f16), T([32, 728, 19, 19], f16), T([728], f16), T([728], f16), T([728], f16), T([728], f32), T([728], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 728, 38, 38], f16), T([32, 728, 38, 38], f16), T([728], f16), T([728], f16), T([728], f16), T([728], f32), T([728], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 256, 38, 38], f16), T([32, 256, 38, 38], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 256, 75, 75], f16), T([32, 256, 75, 75], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 128, 75, 75], f16), T([32, 128, 75, 75], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 128, 150, 150], f16), T([32, 128, 150, 150], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 64, 150, 150], f16), T([32, 64, 150, 150], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 32, 150, 150], f16), T([32, 32, 150, 150], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 1, ((T([32, 256, 38, 38], f16),), {})
+cnt: 17, ((T([32, 728, 19, 19], f16),), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([32, 32, 150, 150], f16),), {})
+cnt: 1, ((T([32, 64, 150, 150], f16),), {})
+cnt: 2, ((T([32, 128, 150, 150], f16),), {})
+cnt: 1, ((T([32, 128, 75, 75], f16),), {})
+cnt: 2, ((T([32, 256, 75, 75], f16),), {})
+cnt: 2, ((T([32, 728, 38, 38], f16),), {})
+cnt: 33, ((T([32, 728, 19, 19], f16),), {})
+cnt: 1, ((T([32, 1024, 19, 19], f16),), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16),), {})
+cnt: 2, ((T([32, 1536, 10, 10], f16),), {})
+cnt: 1, ((T([32, 2048, 10, 10], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([32, 2048, 10, 10], f16), T([32, 2048, 10, 10], f16), 0), {})
+cnt: 2, ((T([32, 1536, 10, 10], f16), T([32, 1536, 10, 10], f16), 0), {})
+cnt: 1, ((T([32, 1024, 10, 10], f16), T([32, 1024, 10, 10], f16), 0), {})
+cnt: 1, ((T([32, 1024, 19, 19], f16), T([32, 1024, 19, 19], f16), 0), {})
+cnt: 50, ((T([32, 728, 19, 19], f16), T([32, 728, 19, 19], f16), 0), {})
+cnt: 2, ((T([32, 728, 38, 38], f16), T([32, 728, 38, 38], f16), 0), {})
+cnt: 1, ((T([32, 256, 38, 38], f16), T([32, 256, 38, 38], f16), 0), {})
+cnt: 2, ((T([32, 256, 75, 75], f16), T([32, 256, 75, 75], f16), 0), {})
+cnt: 1, ((T([32, 128, 75, 75], f16), T([32, 128, 75, 75], f16), 0), {})
+cnt: 2, ((T([32, 128, 150, 150], f16), T([32, 128, 150, 150], f16), 0), {})
+cnt: 1, ((T([32, 64, 150, 150], f16), T([32, 64, 150, 150], f16), 0), {})
+cnt: 1, ((T([32, 32, 150, 150], f16), T([32, 32, 150, 150], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gmixer_24_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gmixer_24_224_training.txt
new file mode 100644
index 000000000000..3e4deb2860b6
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gmixer_24_224_training.txt
@@ -0,0 +1,83 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 24, ((T([64, 384, 384], f16), [64, 384, 384]), {})
+cnt: 24, ((T([64, 384, 196], f16), [24576, 196]), {})
+Operator: aten.add.Tensor
+cnt: 24, ((T([64, 384, 384], f16), T([384], f16)), {})
+cnt: 24, ((T([64, 196, 384], f16, stride=(75264, 1, 196)), T([64, 196, 384], f16, stride=(75264, 1, 196))), {})
+cnt: 24, ((T([64, 196, 384], f16, stride=(75264, 1, 196)), T([64, 196, 384], f16)), {})
+cnt: 24, ((T([64, 196, 384], f16), T([64, 196, 384], f16)), {})
+cnt: 24, ((T([64, 196, 384], f16), T([64, 196, 384], f16, stride=(75264, 1, 196))), {})
+Operator: aten.addmm.default
+cnt: 24, ((T([196], f16), T([24576, 192], f16), T([192, 196], f16, stride=(1, 192))), {})
+cnt: 24, ((T([1536], f16), T([12544, 384], f16), T([384, 1536], f16, stride=(1, 384))), {})
+cnt: 24, ((T([384], f16), T([12544, 768], f16), T([768, 384], f16, stride=(1, 768))), {})
+cnt: 1, ((T([1000], f16), T([64, 384], f16), T([384, 1000], f16, stride=(1, 384))), {})
+Operator: aten.bmm.default
+cnt: 24, ((T([64, 384, 196], f16, stride=(75264, 1, 384)), T([64, 196, 384], f16, stride=(0, 1, 196))), {})
+cnt: 24, ((T([64, 196, 384], f16), T([64, 384, 384], f16)), {})
+cnt: 24, ((T([64, 384, 384], f16), T([64, 384, 196], f16, stride=(0, 196, 1))), {})
+Operator: aten.cat.default
+cnt: 24, (([T([64, 196, 768], f16), T([64, 196, 768], f16)], 2), {})
+cnt: 24, (([T([64, 384, 192], f16), T([64, 384, 192], f16)], 2), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([384, 3, 16, 16], f16), T([384], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 384, 14, 14], f16, stride=(75264, 1, 5376, 384)), T([64, 3, 224, 224], f16), T([384, 3, 16, 16], f16), [384], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+cnt: 24, ((T([384, 196], f16), T([384, 196], f16, stride=(1, 384))), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 196, 384], f16, stride=(384, 0, 1)), 196), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 196, 384], f16), [1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 384], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 384], f16)), {})
+cnt: 24, ((T([12544, 384], f16), T([384, 768], f16)), {})
+cnt: 24, ((T([384, 12544], f16, stride=(1, 384)), T([12544, 768], f16)), {})
+cnt: 24, ((T([12544, 1536], f16), T([1536, 384], f16)), {})
+cnt: 24, ((T([1536, 12544], f16, stride=(1, 1536)), T([12544, 384], f16)), {})
+cnt: 24, ((T([24576, 196], f16), T([196, 192], f16)), {})
+cnt: 24, ((T([196, 24576], f16, stride=(1, 196)), T([24576, 192], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([64, 384, 192], f16, stride=(147456, 384, 1)), T([64, 384, 192], f16)), {})
+cnt: 24, ((T([64, 196, 768], f16, stride=(301056, 1536, 1)), T([64, 196, 768], f16)), {})
+cnt: 24, ((T([64, 196, 768], f16), T([64, 196, 768], f16, stride=(301056, 1536, 1))), {})
+cnt: 24, ((T([64, 196, 768], f16), T([64, 196, 768], f16)), {})
+cnt: 24, ((T([64, 384, 192], f16), T([64, 384, 192], f16, stride=(147456, 384, 1))), {})
+cnt: 24, ((T([64, 384, 192], f16), T([64, 384, 192], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 49, ((T([64, 196, 384], f16, stride=(75264, 1, 196)), [384], T([384], f16), T([384], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([64, 196, 384], f16), T([64, 196, 384], f16, stride=(75264, 1, 196)), [384], T([64, 196, 1], f32), T([64, 196, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+cnt: 24, ((T([64, 196, 384], f16, stride=(75264, 1, 196)), T([64, 196, 384], f16, stride=(75264, 1, 196)), [384], T([64, 196, 1], f32), T([64, 196, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 24, ((T([384, 196], f16, stride=(1, 384)), [384, 196], [196, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.silu.default
+cnt: 24, ((T([64, 384, 192], f16, stride=(147456, 384, 1)),), {})
+cnt: 24, ((T([64, 196, 768], f16, stride=(301056, 1536, 1)),), {})
+Operator: aten.silu_backward.default
+cnt: 24, ((T([64, 196, 768], f16), T([64, 196, 768], f16, stride=(301056, 1536, 1))), {})
+cnt: 24, ((T([64, 384, 192], f16), T([64, 384, 192], f16, stride=(147456, 384, 1))), {})
+Operator: aten.split.Tensor
+cnt: 24, ((T([64, 384, 384], f16), 192, -1), {})
+cnt: 24, ((T([64, 196, 1536], f16), 768, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 24, ((T([12544, 384], f16), [0], True), {})
+cnt: 24, ((T([12544, 1536], f16), [0], True), {})
+cnt: 24, ((T([24576, 196], f16), [0], True), {})
+cnt: 24, ((T([64, 384, 384], f16), [0, 1], True), {})
+cnt: 24, ((T([64, 196, 384], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gmlp_s16_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gmlp_s16_224_training.txt
new file mode 100644
index 000000000000..81057185fc5e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/gmlp_s16_224_training.txt
@@ -0,0 +1,70 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 30, ((T([64, 768, 196], f16), [64, 768, 196]), {})
+Operator: aten.add.Tensor
+cnt: 30, ((T([64, 768, 196], f16), T([196], f16)), {})
+cnt: 30, ((T([64, 196, 256], f16, stride=(50176, 1, 196)), T([64, 196, 256], f16)), {})
+cnt: 30, ((T([64, 196, 256], f16), T([64, 196, 256], f16)), {})
+Operator: aten.addmm.default
+cnt: 30, ((T([1536], f16), T([12544, 256], f16), T([256, 1536], f16, stride=(1, 256))), {})
+cnt: 30, ((T([256], f16), T([12544, 768], f16), T([768, 256], f16, stride=(1, 768))), {})
+cnt: 1, ((T([1000], f16), T([64, 256], f16), T([256, 1000], f16, stride=(1, 256))), {})
+Operator: aten.bmm.default
+cnt: 30, ((T([64, 768, 196], f16, stride=(150528, 1, 768)), T([64, 196, 196], f16, stride=(0, 1, 196))), {})
+cnt: 30, ((T([64, 196, 768], f16), T([64, 768, 196], f16, stride=(150528, 1, 768))), {})
+cnt: 30, ((T([64, 768, 196], f16, stride=(150528, 1, 768)), T([64, 196, 196], f16, stride=(0, 196, 1))), {})
+Operator: aten.cat.default
+cnt: 30, (([T([64, 196, 768], f16), T([64, 196, 768], f16, stride=(150528, 1, 196))], 2), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([256, 3, 16, 16], f16), T([256], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 256, 14, 14], f16, stride=(50176, 1, 3584, 256)), T([64, 3, 224, 224], f16), T([256, 3, 16, 16], f16), [256], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+cnt: 30, ((T([196, 196], f16), T([196, 196], f16, stride=(1, 196))), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 196, 256], f16, stride=(256, 0, 1)), 196), {})
+Operator: aten.gelu.default
+cnt: 30, ((T([64, 196, 1536], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 30, ((T([64, 196, 1536], f16), T([64, 196, 1536], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 196, 256], f16), [1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 256], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 256], f16)), {})
+cnt: 30, ((T([12544, 256], f16), T([256, 768], f16)), {})
+cnt: 30, ((T([256, 12544], f16, stride=(1, 256)), T([12544, 768], f16)), {})
+cnt: 30, ((T([12544, 1536], f16), T([1536, 256], f16)), {})
+cnt: 30, ((T([1536, 12544], f16, stride=(1, 1536)), T([12544, 256], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 30, ((T([64, 196, 768], f16, stride=(301056, 1536, 1)), T([64, 196, 768], f16, stride=(150528, 1, 196))), {})
+cnt: 30, ((T([64, 196, 768], f16), T([64, 196, 768], f16, stride=(301056, 1536, 1))), {})
+cnt: 30, ((T([64, 196, 768], f16), T([64, 196, 768], f16, stride=(150528, 1, 196))), {})
+Operator: aten.native_layer_norm.default
+cnt: 31, ((T([64, 196, 256], f16, stride=(50176, 1, 196)), [256], T([256], f16), T([256], f16), 1e-06), {})
+cnt: 30, ((T([64, 196, 768], f16, stride=(301056, 1536, 1)), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 31, ((T([64, 196, 256], f16), T([64, 196, 256], f16, stride=(50176, 1, 196)), [256], T([64, 196, 1], f32), T([64, 196, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 30, ((T([64, 196, 768], f16, stride=(150528, 1, 196)), T([64, 196, 768], f16, stride=(301056, 1536, 1)), [768], T([64, 196, 1], f32), T([64, 196, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 30, ((T([196, 196], f16, stride=(1, 196)), [196, 196], [196, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.split.Tensor
+cnt: 30, ((T([64, 196, 1536], f16), 768, -1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 30, ((T([12544, 256], f16), [0], True), {})
+cnt: 30, ((T([64, 768, 196], f16, stride=(150528, 1, 768)), [0, 1], True), {})
+cnt: 30, ((T([64, 196, 196], f16), [0], True), {})
+cnt: 30, ((T([12544, 1536], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/hardcorenas_a_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/hardcorenas_a_training.txt
new file mode 100644
index 000000000000..18f12cb61ce1
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/hardcorenas_a_training.txt
@@ -0,0 +1,260 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 34, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16)), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16)), {})
+cnt: 2, ((T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16)), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16)), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 240, 28, 28], f16),), {})
+cnt: 1, ((T([128, 240, 14, 14], f16),), {})
+cnt: 4, ((T([128, 480, 14, 14], f16),), {})
+cnt: 3, ((T([128, 672, 14, 14], f16),), {})
+cnt: 1, ((T([128, 672, 7, 7], f16),), {})
+cnt: 2, ((T([128, 1152, 7, 7], f16),), {})
+cnt: 1, ((T([128, 960, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 1, 1], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([48, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([48, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 48), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([24, 48, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([72, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([24, 72, 1, 1], f16), T([24], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([72, 24, 1, 1], f16), T([72], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([24, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([40, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([64, 240, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 64, 1, 1], f16), T([240, 64, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([40, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([480, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 480), {})
+cnt: 2, ((T([128, 480, 1, 1], f16), T([120, 480, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 120, 1, 1], f16), T([480, 120, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([80, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([112, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([672, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([168, 672, 1, 1], f16), T([168], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([672, 168, 1, 1], f16), T([672], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([112, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([192, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1152), {})
+cnt: 1, ((T([128, 1152, 1, 1], f16), T([288, 1152, 1, 1], f16), T([288], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 1, 1], f16), T([1152, 288, 1, 1], f16), T([1152], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 7, 7], f16), T([960, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 960, 1, 1], f16), T([1280, 960, 1, 1], f16), T([1280], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 1, 1], f16), T([128, 960, 1, 1], f16), T([1280, 960, 1, 1], f16), [1280], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 192, 7, 7], f16), T([960, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 7, 7], f16), T([128, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1152, 1, 1], f16), T([128, 288, 1, 1], f16), T([1152, 288, 1, 1], f16), [1152], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 288, 1, 1], f16), T([128, 1152, 1, 1], f16), T([288, 1152, 1, 1], f16), [288], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([128, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 7, 7], f16), T([128, 672, 7, 7], f16), T([192, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([128, 168, 1, 1], f16), T([672, 168, 1, 1], f16), [672], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([128, 672, 1, 1], f16), T([168, 672, 1, 1], f16), [168], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 112, 14, 14], f16), T([672, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 672, 14, 14], f16), T([112, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 480, 14, 14], f16), T([112, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 480, 1, 1], f16), T([128, 120, 1, 1], f16), T([480, 120, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 120, 1, 1], f16), T([128, 480, 1, 1], f16), T([120, 480, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([128, 480, 14, 14], f16), T([80, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 64, 1, 1], f16), T([240, 64, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 64, 1, 1], f16), T([128, 240, 1, 1], f16), T([64, 240, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 2, ((T([128, 240, 28, 28], f16), T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 240, 28, 28], f16), T([40, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 72, 28, 28], f16), T([40, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([128, 24, 56, 56], f16), T([72, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 72, 56, 56], f16), T([24, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([128, 24, 1, 1], f16), T([72, 24, 1, 1], f16), [72], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([128, 72, 1, 1], f16), T([24, 72, 1, 1], f16), [24], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 48, 56, 56], f16), T([24, 48, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 112, 112], f16), T([48, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 48, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 16, 112, 112], f16), T([48, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 960, 7, 7], f16, stride=(960, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16, stride=(1152, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 672, 7, 7], f16, stride=(672, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 672, 14, 14], f16, stride=(672, 1, 0, 0)), 196), {})
+cnt: 2, ((T([128, 480, 14, 14], f16, stride=(480, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 240, 14, 14], f16, stride=(240, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 240, 28, 28], f16, stride=(240, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 72, 56, 56], f16, stride=(72, 1, 0, 0)), 3136), {})
+Operator: aten.hardsigmoid.default
+cnt: 1, ((T([128, 72, 1, 1], f16),), {})
+cnt: 2, ((T([128, 240, 1, 1], f16),), {})
+cnt: 2, ((T([128, 480, 1, 1], f16),), {})
+cnt: 2, ((T([128, 672, 1, 1], f16),), {})
+cnt: 1, ((T([128, 1152, 1, 1], f16),), {})
+Operator: aten.hardsigmoid_backward.default
+cnt: 1, ((T([128, 1152, 1, 1], f16), T([128, 1152, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 2, ((T([128, 480, 1, 1], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([128, 72, 1, 1], f16)), {})
+Operator: aten.hardswish_.default
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 240, 28, 28], f16),), {})
+cnt: 1, ((T([128, 240, 14, 14], f16),), {})
+cnt: 4, ((T([128, 480, 14, 14], f16),), {})
+cnt: 3, ((T([128, 672, 14, 14], f16),), {})
+cnt: 1, ((T([128, 672, 7, 7], f16),), {})
+cnt: 2, ((T([128, 1152, 7, 7], f16),), {})
+cnt: 1, ((T([128, 960, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 1, 1], f16),), {})
+Operator: aten.hardswish_backward.default
+cnt: 1, ((T([128, 1280, 1, 1], f16), T([128, 1280, 1, 1], f16)), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16)), {})
+cnt: 2, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 3, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 4, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16)), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 72, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 72, 56, 56], f16), T([128, 72, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 28, 28], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 14, 14], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 4, ((T([128, 480, 14, 14], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 7, 7], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 2, ((T([128, 1152, 7, 7], f16), T([128, 1152, 1, 1], f16)), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16)), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 48, 112, 112], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 48, 112, 112], f16),), {})
+cnt: 1, ((T([128, 48, 56, 56], f16),), {})
+cnt: 3, ((T([128, 72, 56, 56], f16),), {})
+cnt: 1, ((T([128, 24, 1, 1], f16),), {})
+cnt: 1, ((T([128, 72, 28, 28], f16),), {})
+cnt: 2, ((T([128, 240, 28, 28], f16),), {})
+cnt: 2, ((T([128, 64, 1, 1], f16),), {})
+cnt: 2, ((T([128, 120, 1, 1], f16),), {})
+cnt: 2, ((T([128, 168, 1, 1], f16),), {})
+cnt: 1, ((T([128, 288, 1, 1], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 288, 1, 1], f16), T([128, 288, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([128, 168, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 120, 1, 1], f16), T([128, 120, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 64, 1, 1], f16), T([128, 64, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), 0), {})
+cnt: 3, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([128, 24, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 48, 112, 112], f16), 0), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/hrnet_w18_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/hrnet_w18_training.txt
new file mode 100644
index 000000000000..cf63431eecc2
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/hrnet_w18_training.txt
@@ -0,0 +1,247 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 69, ((T([128, 18, 56, 56], f16), T([128, 18, 56, 56], f16)), {})
+cnt: 70, ((T([128, 36, 28, 28], f16), T([128, 36, 28, 28], f16)), {})
+cnt: 64, ((T([128, 72, 14, 14], f16), T([128, 72, 14, 14], f16)), {})
+cnt: 31, ((T([128, 144, 7, 7], f16), T([128, 144, 7, 7], f16)), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([128, 256, 28, 28], f16)), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([128, 512, 14, 14], f16)), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16)), {})
+cnt: 4, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 325, ((T([], i64), 1), {})
+cnt: 4, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 32, ((T([128, 18, 56, 56], f16), T([128, 18, 56, 56], f16)), {})
+cnt: 32, ((T([128, 36, 28, 28], f16), T([128, 36, 28, 28], f16)), {})
+cnt: 28, ((T([128, 72, 14, 14], f16), T([128, 72, 14, 14], f16)), {})
+cnt: 12, ((T([128, 144, 7, 7], f16), T([128, 144, 7, 7], f16)), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16)), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([128, 256, 28, 28], f16)), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([128, 512, 14, 14], f16)), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 64, 56, 56], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 256, 56, 56], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([18, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([36, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 64, ((T([128, 18, 56, 56], f16), T([18, 18, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 64, ((T([128, 36, 28, 28], f16), T([36, 36, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([128, 36, 28, 28], f16), T([18, 36, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([128, 18, 56, 56], f16), T([36, 18, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([128, 36, 28, 28], f16), T([72, 36, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 56, ((T([128, 72, 14, 14], f16), T([72, 72, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 72, 14, 14], f16), T([18, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 72, 14, 14], f16), T([36, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 10, ((T([128, 18, 56, 56], f16), T([18, 18, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 18, 28, 28], f16), T([72, 18, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 72, 14, 14], f16), T([144, 72, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 24, ((T([128, 144, 7, 7], f16), T([144, 144, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 144, 7, 7], f16), T([18, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 144, 7, 7], f16), T([36, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 144, 7, 7], f16), T([72, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 18, 28, 28], f16), T([18, 18, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 18, 14, 14], f16), T([144, 18, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 36, 28, 28], f16), T([36, 36, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 36, 14, 14], f16), T([144, 36, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 18, 56, 56], f16), T([32, 18, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([32, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 18, 56, 56], f16), T([128, 18, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 36, 28, 28], f16), T([64, 36, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 36, 28, 28], f16), T([256, 36, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([256, 128, 3, 3], f16), T([256], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 14, 14], f16), T([128, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 14, 14], f16), T([512, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([512, 256, 3, 3], f16), T([512], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 7, 7], f16), T([256, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 7, 7], f16), T([1024, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([1024, 512, 3, 3], f16), T([1024], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), T([2048], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 2048, 7, 7], f16), T([128, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), [2048], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 512, 14, 14], f16), T([1024, 512, 3, 3], f16), [1024], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 144, 7, 7], f16), T([1024, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 256, 7, 7], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([128, 144, 7, 7], f16), T([256, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([128, 256, 28, 28], f16), T([512, 256, 3, 3], f16), [512], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([128, 72, 14, 14], f16), T([512, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([128, 128, 14, 14], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128, 72, 14, 14], f16), T([128, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([128, 128, 56, 56], f16), T([256, 128, 3, 3], f16), [256], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([128, 36, 28, 28], f16), T([256, 36, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([128, 64, 28, 28], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 36, 28, 28], f16), T([64, 36, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 18, 56, 56], f16), T([128, 18, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 32, 56, 56], f16), T([128, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 18, 56, 56], f16), T([32, 18, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 144, 7, 7], f16), T([128, 72, 14, 14], f16), T([144, 72, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 144, 7, 7], f16), T([128, 36, 14, 14], f16), T([144, 36, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 36, 14, 14], f16), T([128, 36, 28, 28], f16), T([36, 36, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 144, 7, 7], f16), T([128, 18, 14, 14], f16), T([144, 18, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 18, 14, 14], f16), T([128, 18, 28, 28], f16), T([18, 18, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 10, ((T([128, 18, 28, 28], f16), T([128, 18, 56, 56], f16), T([18, 18, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 72, 7, 7], f16), T([128, 144, 7, 7], f16), T([72, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([128, 72, 14, 14], f16), T([128, 36, 28, 28], f16), T([72, 36, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 72, 14, 14], f16), T([128, 18, 28, 28], f16), T([72, 18, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 36, 7, 7], f16), T([128, 144, 7, 7], f16), T([36, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 36, 14, 14], f16), T([128, 72, 14, 14], f16), T([36, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([128, 36, 28, 28], f16), T([128, 18, 56, 56], f16), T([36, 18, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 18, 7, 7], f16), T([128, 144, 7, 7], f16), T([18, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 18, 14, 14], f16), T([128, 72, 14, 14], f16), T([18, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([128, 18, 28, 28], f16), T([128, 36, 28, 28], f16), T([18, 36, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 24, ((T([128, 144, 7, 7], f16), T([128, 144, 7, 7], f16), T([144, 144, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 56, ((T([128, 72, 14, 14], f16), T([128, 72, 14, 14], f16), T([72, 72, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 64, ((T([128, 36, 28, 28], f16), T([128, 36, 28, 28], f16), T([36, 36, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 64, ((T([128, 18, 56, 56], f16), T([128, 18, 56, 56], f16), T([18, 18, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 36, 28, 28], f16), T([128, 256, 56, 56], f16), T([36, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 18, 56, 56], f16), T([128, 256, 56, 56], f16), T([18, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 256, 56, 56], f16), T([128, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 56, 56], f16), T([128, 256, 56, 56], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 112, 112], f16), T([64, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 3, 224, 224], f16), T([64, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 65, ((T([128, 18, 56, 56], f16), T([18], f16), T([18], f16), T([18], f16), T([18], f16), True, 0.1, 1e-05), {})
+cnt: 73, ((T([128, 36, 28, 28], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f16), True, 0.1, 1e-05), {})
+cnt: 18, ((T([128, 18, 28, 28], f16), T([18], f16), T([18], f16), T([18], f16), T([18], f16), True, 0.1, 1e-05), {})
+cnt: 71, ((T([128, 72, 14, 14], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([128, 18, 14, 14], f16), T([18], f16), T([18], f16), T([18], f16), T([18], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([128, 36, 14, 14], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f16), True, 0.1, 1e-05), {})
+cnt: 34, ((T([128, 144, 7, 7], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 18, 7, 7], f16), T([18], f16), T([18], f16), T([18], f16), T([18], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 36, 7, 7], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 72, 7, 7], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 64, 28, 28], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 256, 7, 7], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 14, 14], f16), T([128, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 28, 28], f16), T([128, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 34, ((T([128, 144, 7, 7], f16), T([128, 144, 7, 7], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([128, 36, 14, 14], f16), T([128, 36, 14, 14], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f32), T([36], f32), True, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([128, 18, 14, 14], f16), T([128, 18, 14, 14], f16), T([18], f16), T([18], f16), T([18], f16), T([18], f32), T([18], f32), True, 1e-05, [True, True, True]), {})
+cnt: 18, ((T([128, 18, 28, 28], f16), T([128, 18, 28, 28], f16), T([18], f16), T([18], f16), T([18], f16), T([18], f32), T([18], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 72, 7, 7], f16), T([128, 72, 7, 7], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 71, ((T([128, 72, 14, 14], f16), T([128, 72, 14, 14], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 36, 7, 7], f16), T([128, 36, 7, 7], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f32), T([36], f32), True, 1e-05, [True, True, True]), {})
+cnt: 73, ((T([128, 36, 28, 28], f16), T([128, 36, 28, 28], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f32), T([36], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 18, 7, 7], f16), T([128, 18, 7, 7], f16), T([18], f16), T([18], f16), T([18], f16), T([18], f32), T([18], f32), True, 1e-05, [True, True, True]), {})
+cnt: 65, ((T([128, 18, 56, 56], f16), T([128, 18, 56, 56], f16), T([18], f16), T([18], f16), T([18], f16), T([18], f32), T([18], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 8, ((T([128, 18, 56, 56], f16),), {})
+cnt: 8, ((T([128, 36, 28, 28], f16),), {})
+cnt: 10, ((T([128, 18, 28, 28], f16),), {})
+cnt: 7, ((T([128, 72, 14, 14], f16),), {})
+cnt: 3, ((T([128, 18, 14, 14], f16),), {})
+cnt: 3, ((T([128, 36, 14, 14], f16),), {})
+cnt: 3, ((T([128, 144, 7, 7], f16),), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 9, ((T([128, 64, 56, 56], f16),), {})
+cnt: 4, ((T([128, 256, 56, 56], f16),), {})
+cnt: 65, ((T([128, 18, 56, 56], f16),), {})
+cnt: 65, ((T([128, 36, 28, 28], f16),), {})
+cnt: 57, ((T([128, 72, 14, 14], f16),), {})
+cnt: 25, ((T([128, 144, 7, 7], f16),), {})
+cnt: 2, ((T([128, 32, 56, 56], f16),), {})
+cnt: 1, ((T([128, 128, 56, 56], f16),), {})
+cnt: 2, ((T([128, 64, 28, 28], f16),), {})
+cnt: 2, ((T([128, 256, 28, 28], f16),), {})
+cnt: 2, ((T([128, 128, 14, 14], f16),), {})
+cnt: 2, ((T([128, 512, 14, 14], f16),), {})
+cnt: 2, ((T([128, 256, 7, 7], f16),), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16),), {})
+cnt: 1, ((T([128, 2048, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16), 0), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), 0), {})
+cnt: 2, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), 0), {})
+cnt: 2, ((T([128, 512, 14, 14], f16), T([128, 512, 14, 14], f16), 0), {})
+cnt: 2, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), 0), {})
+cnt: 2, ((T([128, 256, 28, 28], f16), T([128, 256, 28, 28], f16), 0), {})
+cnt: 2, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), 0), {})
+cnt: 2, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), 0), {})
+cnt: 28, ((T([128, 144, 7, 7], f16), T([128, 144, 7, 7], f16), 0), {})
+cnt: 3, ((T([128, 36, 14, 14], f16), T([128, 36, 14, 14], f16), 0), {})
+cnt: 3, ((T([128, 18, 14, 14], f16), T([128, 18, 14, 14], f16), 0), {})
+cnt: 10, ((T([128, 18, 28, 28], f16), T([128, 18, 28, 28], f16), 0), {})
+cnt: 64, ((T([128, 72, 14, 14], f16), T([128, 72, 14, 14], f16), 0), {})
+cnt: 73, ((T([128, 36, 28, 28], f16), T([128, 36, 28, 28], f16), 0), {})
+cnt: 73, ((T([128, 18, 56, 56], f16), T([128, 18, 56, 56], f16), 0), {})
+cnt: 4, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), 0), {})
+cnt: 9, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), 0), {})
+Operator: aten.upsample_nearest2d.vec
+cnt: 8, ((T([128, 18, 28, 28], f16), None, [2.0, 2.0]), {})
+cnt: 7, ((T([128, 18, 14, 14], f16), None, [4.0, 4.0]), {})
+cnt: 7, ((T([128, 36, 14, 14], f16), None, [2.0, 2.0]), {})
+cnt: 3, ((T([128, 18, 7, 7], f16), None, [8.0, 8.0]), {})
+cnt: 3, ((T([128, 36, 7, 7], f16), None, [4.0, 4.0]), {})
+cnt: 3, ((T([128, 72, 7, 7], f16), None, [2.0, 2.0]), {})
+Operator: aten.upsample_nearest2d_backward.vec
+cnt: 3, ((T([128, 72, 14, 14], f16), None, [128, 72, 7, 7], [2.0, 2.0]), {})
+cnt: 3, ((T([128, 36, 28, 28], f16), None, [128, 36, 7, 7], [4.0, 4.0]), {})
+cnt: 7, ((T([128, 36, 28, 28], f16), None, [128, 36, 14, 14], [2.0, 2.0]), {})
+cnt: 3, ((T([128, 18, 56, 56], f16), None, [128, 18, 7, 7], [8.0, 8.0]), {})
+cnt: 7, ((T([128, 18, 56, 56], f16), None, [128, 18, 14, 14], [4.0, 4.0]), {})
+cnt: 8, ((T([128, 18, 56, 56], f16), None, [128, 18, 28, 28], [2.0, 2.0]), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/inception_v3_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/inception_v3_training.txt
new file mode 100644
index 000000000000..c11cd6890c76
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/inception_v3_training.txt
@@ -0,0 +1,239 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16)), {})
+cnt: 3, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16)), {})
+cnt: 3, ((T([128, 1280, 8, 8], f16), T([128, 1280, 8, 8], f16)), {})
+cnt: 14, ((T([128, 768, 17, 17], f16), T([128, 768, 17, 17], f16)), {})
+cnt: 5, ((T([128, 288, 35, 35], f16), T([128, 288, 35, 35], f16)), {})
+cnt: 3, ((T([128, 256, 35, 35], f16), T([128, 256, 35, 35], f16)), {})
+cnt: 3, ((T([128, 192, 35, 35], f16), T([128, 192, 35, 35], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 94, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 192, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), [3, 3], [1, 1], [1, 1]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([128, 2048, 8, 8], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([128, 1280, 8, 8], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), T([128, 768, 17, 17], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([128, 288, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), T([128, 256, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([128, 192, 35, 35], f16), [3, 3], [1, 1], [1, 1], False, True, None), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([128, 96, 35, 35], f16), T([128, 32, 35, 35], f16)], 1), {})
+cnt: 2, (([T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([128, 96, 35, 35], f16), T([128, 64, 35, 35], f16)], 1), {})
+cnt: 1, (([T([128, 384, 17, 17], f16), T([128, 96, 17, 17], f16), T([128, 288, 17, 17], f16)], 1), {})
+cnt: 4, (([T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16)], 1), {})
+cnt: 1, (([T([128, 320, 8, 8], f16), T([128, 192, 8, 8], f16), T([128, 768, 8, 8], f16)], 1), {})
+cnt: 4, (([T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16)], 1), {})
+cnt: 2, (([T([128, 320, 8, 8], f16), T([128, 768, 8, 8], f16), T([128, 768, 8, 8], f16), T([128, 192, 8, 8], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 299, 299], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 299, 299], f16), T([32, 3, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([32, 32, 3, 3], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 73, 73], f16), T([80, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([192, 80, 3, 3], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 192, 35, 35], f16), T([64, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([48, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([64, 48, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([96, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([32, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 256, 35, 35], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 35, 35], f16), T([48, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 288, 35, 35], f16), T([64, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([48, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), T([384, 288, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([128, 768, 17, 17], f16), T([192, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 768, 17, 17], f16), T([128, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 17, 17], f16), T([192, 128, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 17, 17], f16), T([192, 128, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 768, 17, 17], f16), T([160, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([160, 160, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 160, 17, 17], f16), T([192, 160, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([160, 160, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 160, 17, 17], f16), T([192, 160, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([192, 192, 1, 7], f16), None, [1, 1], [0, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([192, 192, 7, 1], f16), None, [1, 1], [3, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([320, 192, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([192, 192, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([320, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([384, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([384, 384, 1, 3], f16), None, [1, 1], [0, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([384, 384, 3, 1], f16), None, [1, 1], [1, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([448, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([384, 448, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 8, 8], f16), T([192, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([320, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([384, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([448, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 2048, 8, 8], f16), T([192, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 2048, 8, 8], f16), T([192, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384, 384, 3, 1], f16), [0], [1, 1], [1, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384, 384, 1, 3], f16), [0], [1, 1], [0, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 384, 8, 8], f16), T([128, 448, 8, 8], f16), T([384, 448, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 448, 8, 8], f16), T([128, 2048, 8, 8], f16), T([448, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 8, 8], f16), T([128, 2048, 8, 8], f16), T([384, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 2048, 8, 8], f16), T([320, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 1280, 8, 8], f16), T([192, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 448, 8, 8], f16), T([128, 1280, 8, 8], f16), T([448, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 8, 8], f16), T([128, 1280, 8, 8], f16), T([384, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 1280, 8, 8], f16), T([320, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 8, 8], f16), T([128, 192, 17, 17], f16), T([192, 192, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192, 192, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192, 192, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 12, ((T([128, 192, 17, 17], f16), T([128, 768, 17, 17], f16), T([192, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 8, 8], f16), T([128, 192, 17, 17], f16), T([320, 192, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 192, 17, 17], f16), T([128, 160, 17, 17], f16), T([192, 160, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160, 160, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160, 160, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 160, 17, 17], f16), T([128, 768, 17, 17], f16), T([160, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 192, 17, 17], f16), T([128, 160, 17, 17], f16), T([192, 160, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([128, 128, 17, 17], f16), T([192, 128, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128, 128, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128, 128, 1, 7], f16), [0], [1, 1], [0, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 17, 17], f16), T([128, 768, 17, 17], f16), T([128, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 17, 17], f16), T([128, 128, 17, 17], f16), T([192, 128, 7, 1], f16), [0], [1, 1], [3, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 96, 35, 35], f16), T([128, 64, 35, 35], f16), T([96, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([128, 288, 35, 35], f16), T([64, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([128, 288, 35, 35], f16), T([384, 288, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), T([96, 96, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 35, 35], f16), T([128, 48, 35, 35], f16), T([64, 48, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 288, 35, 35], f16), T([48, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 35, 35], f16), T([128, 256, 35, 35], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 256, 35, 35], f16), T([48, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([128, 192, 35, 35], f16), T([32, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 35, 35], f16), T([128, 192, 35, 35], f16), T([64, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 35, 35], f16), T([128, 192, 35, 35], f16), T([48, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 80, 73, 73], f16), T([192, 80, 3, 3], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 64, 73, 73], f16), T([80, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 32, 147, 147], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 149, 149], f16), T([32, 32, 3, 3], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 3, 299, 299], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 299, 299], f16), T([128, 3, 299, 299], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 8, 8], f16, stride=(2048, 1, 0, 0)), 64), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 147, 147], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 288, 35, 35], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 768, 17, 17], f16), [3, 3], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 768, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 768, 17, 17], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 768, 8, 8], i64)), {})
+cnt: 1, ((T([128, 288, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 288, 35, 35], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 288, 17, 17], i64)), {})
+cnt: 1, ((T([128, 192, 35, 35], f16), T([128, 192, 71, 71], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 192, 35, 35], i64)), {})
+cnt: 1, ((T([128, 64, 73, 73], f16), T([128, 64, 147, 147], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 64, 73, 73], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 2048, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 32, 149, 149], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 64, 35, 35], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 0.001), {})
+cnt: 7, ((T([128, 96, 35, 35], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 26, ((T([128, 192, 17, 17], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 320, 8, 8], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 192, 8, 8], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 12, ((T([128, 384, 8, 8], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), True, 0.1, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([128, 192, 8, 8], f16), T([128, 192, 8, 8], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([128, 448, 8, 8], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 320, 8, 8], f16), T([128, 320, 8, 8], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 0.001, [True, True, True]), {})
+cnt: 26, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 0.001, [True, True, True]), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 17, 17], f16), T([128, 96, 17, 17], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+cnt: 7, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+cnt: 12, ((T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 384, 17, 17], f16), T([128, 384, 17, 17], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([128, 48, 35, 35], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 35, 35], f16), T([128, 32, 35, 35], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 192, 71, 71], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 80, 73, 73], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 64, 147, 147], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 147, 147], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 32, 149, 149], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 149, 149], f16),), {})
+cnt: 1, ((T([128, 32, 147, 147], f16),), {})
+cnt: 1, ((T([128, 64, 147, 147], f16),), {})
+cnt: 1, ((T([128, 80, 73, 73], f16),), {})
+cnt: 1, ((T([128, 192, 71, 71], f16),), {})
+cnt: 12, ((T([128, 64, 35, 35], f16),), {})
+cnt: 3, ((T([128, 48, 35, 35], f16),), {})
+cnt: 7, ((T([128, 96, 35, 35], f16),), {})
+cnt: 1, ((T([128, 32, 35, 35], f16),), {})
+cnt: 1, ((T([128, 384, 17, 17], f16),), {})
+cnt: 1, ((T([128, 96, 17, 17], f16),), {})
+cnt: 26, ((T([128, 192, 17, 17], f16),), {})
+cnt: 6, ((T([128, 128, 17, 17], f16),), {})
+cnt: 12, ((T([128, 160, 17, 17], f16),), {})
+cnt: 3, ((T([128, 320, 8, 8], f16),), {})
+cnt: 3, ((T([128, 192, 8, 8], f16),), {})
+cnt: 12, ((T([128, 384, 8, 8], f16),), {})
+cnt: 2, ((T([128, 448, 8, 8], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([128, 192, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 192, 8, 8], f16), 0), {})
+cnt: 8, ((T([128, 384, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 384, 8, 8], f16), 0), {})
+cnt: 4, ((T([128, 384, 8, 8], f16), T([128, 384, 8, 8], f16), 0), {})
+cnt: 2, ((T([128, 448, 8, 8], f16), T([128, 448, 8, 8], f16), 0), {})
+cnt: 2, ((T([128, 320, 8, 8], f16, stride=(131072, 64, 8, 1)), T([128, 320, 8, 8], f16), 0), {})
+cnt: 1, ((T([128, 192, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 192, 8, 8], f16), 0), {})
+cnt: 10, ((T([128, 192, 17, 17], f16), T([128, 192, 17, 17], f16), 0), {})
+cnt: 1, ((T([128, 320, 8, 8], f16, stride=(81920, 64, 8, 1)), T([128, 320, 8, 8], f16), 0), {})
+cnt: 16, ((T([128, 192, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 192, 17, 17], f16), 0), {})
+cnt: 12, ((T([128, 160, 17, 17], f16), T([128, 160, 17, 17], f16), 0), {})
+cnt: 6, ((T([128, 128, 17, 17], f16), T([128, 128, 17, 17], f16), 0), {})
+cnt: 1, ((T([128, 96, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 96, 17, 17], f16), 0), {})
+cnt: 4, ((T([128, 96, 35, 35], f16), T([128, 96, 35, 35], f16), 0), {})
+cnt: 4, ((T([128, 64, 35, 35], f16), T([128, 64, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 384, 17, 17], f16, stride=(221952, 289, 17, 1)), T([128, 384, 17, 17], f16), 0), {})
+cnt: 6, ((T([128, 64, 35, 35], f16, stride=(352800, 1225, 35, 1)), T([128, 64, 35, 35], f16), 0), {})
+cnt: 2, ((T([128, 96, 35, 35], f16, stride=(352800, 1225, 35, 1)), T([128, 96, 35, 35], f16), 0), {})
+cnt: 3, ((T([128, 48, 35, 35], f16), T([128, 48, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 32, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 32, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 96, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 96, 35, 35], f16), 0), {})
+cnt: 2, ((T([128, 64, 35, 35], f16, stride=(313600, 1225, 35, 1)), T([128, 64, 35, 35], f16), 0), {})
+cnt: 1, ((T([128, 192, 71, 71], f16), T([128, 192, 71, 71], f16), 0), {})
+cnt: 1, ((T([128, 80, 73, 73], f16), T([128, 80, 73, 73], f16), 0), {})
+cnt: 1, ((T([128, 64, 147, 147], f16), T([128, 64, 147, 147], f16), 0), {})
+cnt: 1, ((T([128, 32, 147, 147], f16), T([128, 32, 147, 147], f16), 0), {})
+cnt: 1, ((T([128, 32, 149, 149], f16), T([128, 32, 149, 149], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/jx_nest_base_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/jx_nest_base_training.txt
new file mode 100644
index 000000000000..ddb7593f5949
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/jx_nest_base_training.txt
@@ -0,0 +1,269 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 2, ((T([64, 4, 16, 196, 196], f16), -1, False), {})
+cnt: 2, ((T([64, 8, 4, 196, 196], f16), -1, False), {})
+cnt: 20, ((T([64, 16, 1, 196, 196], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 20, ((T([64, 16, 1, 196, 196], f16), T([64, 16, 1, 196, 196], f16), -1, f16), {})
+cnt: 2, ((T([64, 8, 4, 196, 196], f16), T([64, 8, 4, 196, 196], f16), -1, f16), {})
+cnt: 2, ((T([64, 4, 16, 196, 196], f16), T([64, 4, 16, 196, 196], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 2, ((T([64, 4, 4, 14, 14, 128], f16), [64, 16, 196, 128]), {})
+cnt: 2, ((T([200704, 384], f16), [64, 16, 196, 384]), {})
+cnt: 6, ((T([64, 4, 16, 196, 32], f16), [4096, 196, 32]), {})
+cnt: 2, ((T([64, 4, 16, 32, 196], f16), [4096, 32, 196]), {})
+cnt: 2, ((T([4096, 196, 196], f16), [64, 4, 16, 196, 196]), {})
+cnt: 2, ((T([4096, 196, 32], f16), [64, 4, 16, 196, 32]), {})
+cnt: 2, ((T([64, 16, 196, 32, 4], f16), [64, 16, 196, 128]), {})
+cnt: 4, ((T([200704, 128], f16), [64, 16, 196, 128]), {})
+cnt: 2, ((T([200704, 512], f16), [64, 16, 196, 512]), {})
+cnt: 2, ((T([64, 4, 14, 4, 14, 128], f16), [64, 56, 56, 128]), {})
+cnt: 2, ((T([64, 2, 2, 14, 14, 256], f16), [64, 4, 196, 256]), {})
+cnt: 2, ((T([50176, 768], f16), [64, 4, 196, 768]), {})
+cnt: 6, ((T([64, 8, 4, 196, 32], f16), [2048, 196, 32]), {})
+cnt: 2, ((T([64, 8, 4, 32, 196], f16), [2048, 32, 196]), {})
+cnt: 2, ((T([2048, 196, 196], f16), [64, 8, 4, 196, 196]), {})
+cnt: 2, ((T([2048, 196, 32], f16), [64, 8, 4, 196, 32]), {})
+cnt: 2, ((T([64, 4, 196, 32, 8], f16), [64, 4, 196, 256]), {})
+cnt: 4, ((T([50176, 256], f16), [64, 4, 196, 256]), {})
+cnt: 2, ((T([50176, 1024], f16), [64, 4, 196, 1024]), {})
+cnt: 2, ((T([64, 2, 14, 2, 14, 256], f16), [64, 28, 28, 256]), {})
+cnt: 20, ((T([12544, 1536], f16), [64, 1, 196, 1536]), {})
+cnt: 60, ((T([64, 16, 1, 196, 32], f16), [1024, 196, 32]), {})
+cnt: 20, ((T([64, 16, 1, 32, 196], f16), [1024, 32, 196]), {})
+cnt: 20, ((T([1024, 196, 196], f16), [64, 16, 1, 196, 196]), {})
+cnt: 20, ((T([1024, 196, 32], f16), [64, 16, 1, 196, 32]), {})
+cnt: 20, ((T([64, 1, 196, 32, 16], f16), [64, 1, 196, 512]), {})
+cnt: 40, ((T([12544, 512], f16), [64, 1, 196, 512]), {})
+cnt: 20, ((T([12544, 2048], f16), [64, 1, 196, 2048]), {})
+cnt: 40, ((T([64, 1, 196, 512], f16), [12544, 512]), {})
+cnt: 20, ((T([64, 1, 196, 3, 16, 32], f16), [64, 1, 196, 1536]), {})
+cnt: 2, ((T([64, 4, 196, 3, 8, 32], f16), [64, 4, 196, 768]), {})
+cnt: 2, ((T([64, 16, 196, 3, 4, 32], f16), [64, 16, 196, 384]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 16, 196, 128], f16), T([1, 16, 196, 128], f16)), {})
+cnt: 2, ((T([64, 16, 196, 384], f16), T([384], f16)), {})
+cnt: 4, ((T([64, 16, 196, 128], f16), T([128], f16)), {})
+cnt: 8, ((T([64, 16, 196, 128], f16), T([64, 16, 196, 128], f16)), {})
+cnt: 2, ((T([64, 16, 196, 512], f16), T([512], f16)), {})
+cnt: 1, ((T([64, 4, 196, 256], f16), T([1, 4, 196, 256], f16)), {})
+cnt: 2, ((T([64, 4, 196, 768], f16), T([768], f16)), {})
+cnt: 4, ((T([64, 4, 196, 256], f16), T([256], f16)), {})
+cnt: 8, ((T([64, 4, 196, 256], f16), T([64, 4, 196, 256], f16)), {})
+cnt: 2, ((T([64, 4, 196, 1024], f16), T([1024], f16)), {})
+cnt: 1, ((T([64, 1, 196, 512], f16), T([1, 1, 196, 512], f16)), {})
+cnt: 20, ((T([64, 1, 196, 1536], f16), T([1536], f16)), {})
+cnt: 40, ((T([64, 1, 196, 512], f16), T([512], f16)), {})
+cnt: 40, ((T([64, 1, 196, 512], f16), T([64, 1, 196, 512], f16)), {})
+cnt: 20, ((T([64, 1, 196, 2048], f16), T([2048], f16)), {})
+cnt: 40, ((T([64, 1, 196, 512], f16, stride=(100352, 196, 1, 196)), T([64, 1, 196, 512], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 512], f16), T([512, 1000], f16, stride=(1, 512))), {})
+Operator: aten.as_strided_.default
+cnt: 1, ((T([64, 512, 1, 1], f16), [64, 512, 1, 1], [512, 1, 512, 512]), {})
+Operator: aten.bernoulli_.float
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.9782608691602945), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.9565217383205891), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.9347826093435287), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.9130434766411781), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8913043439388275), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8695652186870575), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8478260785341263), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8260869532823563), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8043478280305862), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.782608687877655), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.760869562625885), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.739130437374115), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.717391312122345), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.695652186870575), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.6739130318164825), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.6521739065647125), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.6304347813129425), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.6086956560611725), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.5869565308094025), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.5652174055576324), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.54347825050354), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.52173912525177), {})
+cnt: 2, ((T([64, 1, 1, 1], f16),), {})
+Operator: aten.bmm.default
+cnt: 2, ((T([4096, 196, 32], f16), T([4096, 32, 196], f16)), {})
+cnt: 2, ((T([4096, 196, 196], f16), T([4096, 196, 32], f16)), {})
+cnt: 2, ((T([2048, 196, 32], f16), T([2048, 32, 196], f16)), {})
+cnt: 2, ((T([2048, 196, 196], f16), T([2048, 196, 32], f16)), {})
+cnt: 20, ((T([1024, 196, 32], f16), T([1024, 32, 196], f16)), {})
+cnt: 20, ((T([1024, 196, 196], f16), T([1024, 196, 32], f16)), {})
+cnt: 20, ((T([1024, 196, 196], f16, stride=(38416, 1, 196)), T([1024, 196, 32], f16)), {})
+cnt: 20, ((T([1024, 196, 32], f16), T([1024, 32, 196], f16, stride=(6272, 1, 32))), {})
+cnt: 20, ((T([1024, 32, 196], f16, stride=(6272, 1, 32)), T([1024, 196, 196], f16)), {})
+cnt: 20, ((T([1024, 196, 196], f16), T([1024, 196, 32], f16, stride=(6272, 1, 196))), {})
+cnt: 2, ((T([2048, 196, 196], f16, stride=(38416, 1, 196)), T([2048, 196, 32], f16)), {})
+cnt: 2, ((T([2048, 196, 32], f16), T([2048, 32, 196], f16, stride=(6272, 1, 32))), {})
+cnt: 2, ((T([2048, 32, 196], f16, stride=(6272, 1, 32)), T([2048, 196, 196], f16)), {})
+cnt: 2, ((T([2048, 196, 196], f16), T([2048, 196, 32], f16, stride=(6272, 1, 196))), {})
+cnt: 2, ((T([4096, 196, 196], f16, stride=(38416, 1, 196)), T([4096, 196, 32], f16)), {})
+cnt: 2, ((T([4096, 196, 32], f16), T([4096, 32, 196], f16, stride=(6272, 1, 32))), {})
+cnt: 2, ((T([4096, 32, 196], f16, stride=(6272, 1, 32)), T([4096, 196, 196], f16)), {})
+cnt: 2, ((T([4096, 196, 196], f16), T([4096, 196, 32], f16, stride=(6272, 1, 196))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([64, 256, 56, 56], f16, stride=(802816, 1, 14336, 256)), [0, 1, 0, 1], -inf), {})
+cnt: 1, ((T([64, 512, 28, 28], f16, stride=(401408, 1, 14336, 512)), [0, 1, 0, 1], -inf), {})
+cnt: 1, ((T([64, 512, 29, 29], f16, stride=(430592, 1, 14848, 512)), [0, -1, 0, -1]), {})
+cnt: 1, ((T([64, 256, 57, 57], f16, stride=(831744, 1, 14592, 256)), [0, -1, 0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([128, 3, 4, 4], f16), T([128], f16), [4, 4], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([256, 128, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([512, 256, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 512, 28, 28], f16, stride=(401408, 1, 14336, 512)), T([64, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([512, 256, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 56, 56], f16, stride=(802816, 1, 14336, 256)), T([64, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([256, 128, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([64, 3, 224, 224], f16), T([128, 3, 4, 4], f16), [128], [4, 4], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+cnt: 1, ((T([64, 512], f16), T([64, 512], f16)), {})
+cnt: 1, ((T([512, 256, 3, 3], f16), T([512, 256, 3, 3], f16, stride=(2304, 1, 768, 256))), {})
+cnt: 1, ((T([256, 128, 3, 3], f16), T([256, 128, 3, 3], f16, stride=(1152, 1, 384, 128))), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 512, 14, 14], f16, stride=(512, 1, 0, 0)), 196), {})
+Operator: aten.div_.Tensor
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.9782608691602945), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.9565217383205891), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.9347826093435287), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.9130434766411781), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8913043439388275), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8695652186870575), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8478260785341263), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8260869532823563), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.8043478280305862), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.782608687877655), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.760869562625885), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.739130437374115), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.717391312122345), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.695652186870575), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.6739130318164825), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.6521739065647125), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.6304347813129425), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.6086956560611725), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.5869565308094025), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.5652174055576324), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.54347825050354), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.52173912525177), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.5), {})
+Operator: aten.gelu.default
+cnt: 2, ((T([64, 16, 196, 512], f16),), {})
+cnt: 2, ((T([64, 4, 196, 1024], f16),), {})
+cnt: 20, ((T([64, 1, 196, 2048], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 20, ((T([64, 1, 196, 2048], f16), T([64, 1, 196, 2048], f16)), {})
+cnt: 2, ((T([64, 4, 196, 1024], f16), T([64, 4, 196, 1024], f16)), {})
+cnt: 2, ((T([64, 16, 196, 512], f16), T([64, 16, 196, 512], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([64, 256, 57, 57], f16, stride=(831744, 1, 14592, 256)), [3, 3], [2, 2]), {})
+cnt: 1, ((T([64, 512, 29, 29], f16, stride=(430592, 1, 14848, 512)), [3, 3], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 29, 29], f16, stride=(430592, 1, 14848, 512)), [3, 3], [2, 2], [0, 0], [1, 1], False, T([64, 512, 14, 14], i64, stride=(100352, 1, 7168, 512))), {})
+cnt: 1, ((T([64, 256, 28, 28], f16, stride=(200704, 1, 7168, 256)), T([64, 256, 57, 57], f16, stride=(831744, 1, 14592, 256)), [3, 3], [2, 2], [0, 0], [1, 1], False, T([64, 256, 28, 28], i64, stride=(200704, 1, 7168, 256))), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 512, 14, 14], f16, stride=(100352, 1, 7168, 512)), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 2, ((T([200704, 128], f16), T([128, 384], f16, stride=(1, 128))), {})
+cnt: 2, ((T([200704, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 2, ((T([200704, 128], f16), T([128, 512], f16, stride=(1, 128))), {})
+cnt: 2, ((T([200704, 512], f16), T([512, 128], f16, stride=(1, 512))), {})
+cnt: 2, ((T([50176, 256], f16), T([256, 768], f16, stride=(1, 256))), {})
+cnt: 2, ((T([50176, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 2, ((T([50176, 256], f16), T([256, 1024], f16, stride=(1, 256))), {})
+cnt: 2, ((T([50176, 1024], f16), T([1024, 256], f16, stride=(1, 1024))), {})
+cnt: 20, ((T([12544, 512], f16), T([512, 1536], f16, stride=(1, 512))), {})
+cnt: 20, ((T([12544, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 20, ((T([12544, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 20, ((T([12544, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+cnt: 1, ((T([64, 1000], f16), T([1000, 512], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 512], f16)), {})
+cnt: 20, ((T([512, 12544], f16, stride=(1, 512)), T([12544, 2048], f16)), {})
+cnt: 20, ((T([12544, 512], f16), T([512, 2048], f16)), {})
+cnt: 20, ((T([2048, 12544], f16, stride=(1, 2048)), T([12544, 512], f16)), {})
+cnt: 20, ((T([12544, 2048], f16), T([2048, 512], f16)), {})
+cnt: 20, ((T([512, 12544], f16, stride=(1, 512)), T([12544, 512], f16)), {})
+cnt: 20, ((T([12544, 512], f16), T([512, 512], f16)), {})
+cnt: 20, ((T([1536, 12544], f16, stride=(1, 1536)), T([12544, 512], f16)), {})
+cnt: 20, ((T([12544, 1536], f16), T([1536, 512], f16)), {})
+cnt: 2, ((T([256, 50176], f16, stride=(1, 256)), T([50176, 1024], f16)), {})
+cnt: 2, ((T([50176, 256], f16), T([256, 1024], f16)), {})
+cnt: 2, ((T([1024, 50176], f16, stride=(1, 1024)), T([50176, 256], f16)), {})
+cnt: 2, ((T([50176, 1024], f16), T([1024, 256], f16)), {})
+cnt: 2, ((T([256, 50176], f16, stride=(1, 256)), T([50176, 256], f16)), {})
+cnt: 2, ((T([50176, 256], f16), T([256, 256], f16)), {})
+cnt: 2, ((T([768, 50176], f16, stride=(1, 768)), T([50176, 256], f16)), {})
+cnt: 2, ((T([50176, 768], f16), T([768, 256], f16)), {})
+cnt: 2, ((T([128, 200704], f16, stride=(1, 128)), T([200704, 512], f16)), {})
+cnt: 2, ((T([200704, 128], f16), T([128, 512], f16)), {})
+cnt: 2, ((T([512, 200704], f16, stride=(1, 512)), T([200704, 128], f16)), {})
+cnt: 2, ((T([200704, 512], f16), T([512, 128], f16)), {})
+cnt: 2, ((T([128, 200704], f16, stride=(1, 128)), T([200704, 128], f16)), {})
+cnt: 2, ((T([200704, 128], f16), T([128, 128], f16)), {})
+cnt: 2, ((T([384, 200704], f16, stride=(1, 384)), T([200704, 128], f16)), {})
+cnt: 2, ((T([200704, 384], f16), T([384, 128], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([64, 4, 16, 196, 196], f16), 0.1767766952966369), {})
+cnt: 4, ((T([64, 16, 196, 128], f16), T([64, 1, 1, 1], f16)), {})
+cnt: 4, ((T([64, 8, 4, 196, 196], f16), 0.1767766952966369), {})
+cnt: 8, ((T([64, 4, 196, 256], f16), T([64, 1, 1, 1], f16)), {})
+cnt: 40, ((T([64, 16, 1, 196, 196], f16), 0.1767766952966369), {})
+cnt: 40, ((T([64, 1, 196, 512], f16), T([64, 1, 1, 1], f16)), {})
+cnt: 40, ((T([64, 1, 196, 512], f16, stride=(100352, 196, 1, 196)), T([64, 1, 1, 1], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 4, ((T([64, 16, 196, 128], f16), [128], T([128], f16), T([128], f16), 1e-06), {})
+cnt: 1, ((T([64, 56, 56, 256], f16), [256], T([256], f16), T([256], f16), 1e-06), {})
+cnt: 4, ((T([64, 4, 196, 256], f16), [256], T([256], f16), T([256], f16), 1e-06), {})
+cnt: 1, ((T([64, 28, 28, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+cnt: 40, ((T([64, 1, 196, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+cnt: 1, ((T([64, 14, 14, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 1, ((T([64, 14, 14, 512], f16, stride=(100352, 14, 1, 196)), T([64, 14, 14, 512], f16), [512], T([64, 14, 14, 1], f32), T([64, 14, 14, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 40, ((T([64, 1, 196, 512], f16), T([64, 1, 196, 512], f16), [512], T([64, 1, 196, 1], f32), T([64, 1, 196, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 28, 28, 512], f16), T([64, 28, 28, 512], f16), [512], T([64, 28, 28, 1], f32), T([64, 28, 28, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 4, ((T([64, 4, 196, 256], f16), T([64, 4, 196, 256], f16), [256], T([64, 4, 196, 1], f32), T([64, 4, 196, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 56, 56, 256], f16), T([64, 56, 56, 256], f16), [256], T([64, 56, 56, 1], f32), T([64, 56, 56, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 4, ((T([64, 16, 196, 128], f16), T([64, 16, 196, 128], f16), [128], T([64, 16, 196, 1], f32), T([64, 16, 196, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+Operator: aten.new_empty.default
+cnt: 2, ((T([64, 16, 196, 128], f16), [64, 1, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 4, ((T([64, 4, 196, 256], f16), [64, 1, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 40, ((T([64, 1, 196, 512], f16), [64, 1, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_empty_strided.default
+cnt: 1, ((T([512, 256, 3, 3], f16, stride=(2304, 1, 768, 256)), [512, 256, 3, 3], [2304, 9, 3, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([256, 128, 3, 3], f16, stride=(1152, 1, 384, 128)), [256, 128, 3, 3], [1152, 9, 3, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([64, 512], f16), [32768]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.stack.default
+cnt: 20, (([T([64, 16, 1, 196, 32], f16), T([64, 16, 1, 196, 32], f16, stride=(100352, 6272, 6272, 1, 196)), T([64, 16, 1, 196, 32], f16)],), {})
+cnt: 2, (([T([64, 8, 4, 196, 32], f16), T([64, 8, 4, 196, 32], f16, stride=(200704, 25088, 6272, 1, 196)), T([64, 8, 4, 196, 32], f16)],), {})
+cnt: 2, (([T([64, 4, 16, 196, 32], f16), T([64, 4, 16, 196, 32], f16, stride=(401408, 100352, 6272, 1, 196)), T([64, 4, 16, 196, 32], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 40, ((T([64, 1, 196, 512], f16, stride=(100352, 196, 1, 196)), [0, 1, 2], True), {})
+cnt: 20, ((T([64, 1, 196, 2048], f16), [0, 1, 2], True), {})
+cnt: 20, ((T([64, 1, 196, 1536], f16), [0, 1, 2], True), {})
+cnt: 1, ((T([64, 1, 196, 512], f16, stride=(100352, 196, 1, 196)), [0], True), {})
+cnt: 4, ((T([64, 4, 196, 256], f16), [0, 1, 2], True), {})
+cnt: 2, ((T([64, 4, 196, 1024], f16), [0, 1, 2], True), {})
+cnt: 2, ((T([64, 4, 196, 768], f16), [0, 1, 2], True), {})
+cnt: 1, ((T([64, 4, 196, 256], f16), [0], True), {})
+cnt: 4, ((T([64, 16, 196, 128], f16), [0, 1, 2], True), {})
+cnt: 2, ((T([64, 16, 196, 512], f16), [0, 1, 2], True), {})
+cnt: 2, ((T([64, 16, 196, 384], f16), [0, 1, 2], True), {})
+cnt: 1, ((T([64, 16, 196, 128], f16), [0], True), {})
+Operator: aten.unbind.int
+cnt: 2, ((T([3, 64, 4, 16, 196, 32], f16, stride=(128, 1204224, 32, 75264, 384, 1)),), {})
+cnt: 2, ((T([3, 64, 8, 4, 196, 32], f16, stride=(256, 602112, 32, 150528, 768, 1)),), {})
+cnt: 20, ((T([3, 64, 16, 1, 196, 32], f16, stride=(512, 301056, 32, 301056, 1536, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/lcnet_050_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/lcnet_050_training.txt
new file mode 100644
index 000000000000..48f28c23f3f4
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/lcnet_050_training.txt
@@ -0,0 +1,158 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 27, ((T([], i64), 1), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16)), {})
+cnt: 1, ((T([128, 128, 7, 7], f16), T([128, 128, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+cnt: 2, ((T([128, 8, 112, 112], f16),), {})
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 16, 56, 56], f16),), {})
+cnt: 3, ((T([128, 32, 56, 56], f16),), {})
+cnt: 1, ((T([128, 32, 28, 28], f16),), {})
+cnt: 3, ((T([128, 64, 28, 28], f16),), {})
+cnt: 1, ((T([128, 64, 14, 14], f16),), {})
+cnt: 11, ((T([128, 128, 14, 14], f16),), {})
+cnt: 1, ((T([128, 128, 7, 7], f16),), {})
+cnt: 3, ((T([128, 256, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 1, 1], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([8, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 8, 112, 112], f16), T([8, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 1, ((T([128, 8, 112, 112], f16), T([16, 8, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([32, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([32, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([32, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([64, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([64, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([64, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([128, 64, 14, 14], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 128, 14, 14], f16), T([128, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 128), {})
+cnt: 5, ((T([128, 128, 14, 14], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 128), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([32, 128, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 32, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 7, 7], f16), T([256, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([256, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 256), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([64, 256, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([256, 64, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([1280, 256, 1, 1], f16), T([1280], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 1, 1], f16), T([128, 256, 1, 1], f16), T([1280, 256, 1, 1], f16), [1280], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 64, 1, 1], f16), T([256, 64, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([128, 256, 1, 1], f16), T([64, 256, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([256, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 256, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([128, 128, 7, 7], f16), T([256, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([128, 32, 1, 1], f16), T([128, 32, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 128, 1, 1], f16), T([32, 128, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 7, 7], f16), T([128, 128, 14, 14], f16), T([128, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 128, [True, True, False]), {})
+cnt: 5, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 128, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128, 64, 14, 14], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 14, 14], f16), T([128, 64, 28, 28], f16), T([64, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([64, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 32, 28, 28], f16), T([64, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([128, 32, 56, 56], f16), T([32, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 16, 56, 56], f16), T([32, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 8, 112, 112], f16), T([16, 8, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 8, 112, 112], f16), T([128, 8, 112, 112], f16), T([8, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([128, 8, 112, 112], f16), T([128, 3, 224, 224], f16), T([8, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 2, ((T([128, 256, 7, 7], f16, stride=(256, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 128, 7, 7], f16, stride=(128, 1, 0, 0)), 49), {})
+Operator: aten.hardsigmoid.default
+cnt: 1, ((T([128, 128, 1, 1], f16),), {})
+cnt: 1, ((T([128, 256, 1, 1], f16),), {})
+Operator: aten.hardsigmoid_backward.default
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16)), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([128, 128, 1, 1], f16)), {})
+Operator: aten.hardswish_.default
+cnt: 2, ((T([128, 8, 112, 112], f16),), {})
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 16, 56, 56], f16),), {})
+cnt: 3, ((T([128, 32, 56, 56], f16),), {})
+cnt: 1, ((T([128, 32, 28, 28], f16),), {})
+cnt: 3, ((T([128, 64, 28, 28], f16),), {})
+cnt: 1, ((T([128, 64, 14, 14], f16),), {})
+cnt: 11, ((T([128, 128, 14, 14], f16),), {})
+cnt: 1, ((T([128, 128, 7, 7], f16),), {})
+cnt: 3, ((T([128, 256, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 1, 1], f16),), {})
+Operator: aten.hardswish_backward.default
+cnt: 1, ((T([128, 1280, 1, 1], f16), T([128, 1280, 1, 1], f16)), {})
+cnt: 3, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16)), {})
+cnt: 1, ((T([128, 128, 7, 7], f16), T([128, 128, 7, 7], f16)), {})
+cnt: 11, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16)), {})
+cnt: 1, ((T([128, 64, 14, 14], f16), T([128, 64, 14, 14], f16)), {})
+cnt: 3, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16)), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([128, 32, 28, 28], f16)), {})
+cnt: 3, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16)), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([128, 16, 56, 56], f16)), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+cnt: 2, ((T([128, 8, 112, 112], f16), T([128, 8, 112, 112], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 128, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 128, 7, 7], f16), T([128, 128, 1, 1], f16)), {})
+cnt: 2, ((T([128, 256, 7, 7], f16), T([128, 256, 1, 1], f16)), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16)), {})
+cnt: 1, ((T([128, 128, 7, 7], f16), T([128, 128, 7, 7], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 8, 112, 112], f16), T([8], f16), T([8], f16), T([8], f16), T([8], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 64, 28, 28], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 14, 14], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 11, ((T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 128, 7, 7], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 7, 7], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 7, 7], f16), T([128, 128, 7, 7], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 11, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 14, 14], f16), T([128, 64, 14, 14], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([128, 32, 28, 28], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 56, 56], f16), T([128, 16, 56, 56], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 8, 112, 112], f16), T([128, 8, 112, 112], f16), T([8], f16), T([8], f16), T([8], f16), T([8], f32), T([8], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 1, 1], f16),), {})
+cnt: 1, ((T([128, 64, 1, 1], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([128, 256, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 128, 7, 7], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 64, 1, 1], f16), T([128, 64, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 32, 1, 1], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/legacy_senet154_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/legacy_senet154_training.txt
new file mode 100644
index 000000000000..c4895fad41ff
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/legacy_senet154_training.txt
@@ -0,0 +1,183 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 9, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 24, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 108, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 8, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 157, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([64, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([128, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 128, 56, 56], f16), T([256, 2, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([256, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([16, 256, 1, 1], f16), T([16], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 16, 1, 1], f16), T([256, 16, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([512, 4, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 9, ((T([32, 512, 28, 28], f16), T([512, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([512, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([32, 512, 1, 1], f16), T([32, 512, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([32, 32, 1, 1], f16), T([512, 32, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([32, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([512, 4, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([1024, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 37, ((T([32, 1024, 14, 14], f16), T([1024, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([1024, 512, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 36, ((T([32, 1024, 1, 1], f16), T([64, 1024, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 36, ((T([32, 64, 1, 1], f16), T([1024, 64, 1, 1], f16), T([1024], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 35, ((T([32, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([1024, 8, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([2048, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([2048, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([2048, 1024, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 2048, 1, 1], f16), T([128, 2048, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 128, 1, 1], f16), T([2048, 128, 1, 1], f16), T([2048], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([1024, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([2048, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([32, 2048, 1, 1], f16), T([32, 128, 1, 1], f16), T([2048, 128, 1, 1], f16), [2048], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 128, 1, 1], f16), T([32, 2048, 1, 1], f16), T([128, 2048, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), T([2048, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([32, 1024, 7, 7], f16), T([2048, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([32, 2048, 7, 7], f16), T([1024, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 1024, 14, 14], f16), T([2048, 1024, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 1024, 14, 14], f16), T([2048, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 37, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), T([1024, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 36, ((T([32, 1024, 1, 1], f16), T([32, 64, 1, 1], f16), T([1024, 64, 1, 1], f16), [1024], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 36, ((T([32, 64, 1, 1], f16), T([32, 1024, 1, 1], f16), T([64, 1024, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 35, ((T([32, 1024, 14, 14], f16), T([32, 512, 14, 14], f16), T([1024, 8, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([32, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 512, 28, 28], f16), T([1024, 512, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 512, 28, 28], f16), T([1024, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 9, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), T([512, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([32, 512, 1, 1], f16), T([32, 32, 1, 1], f16), T([512, 32, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 8, ((T([32, 32, 1, 1], f16), T([32, 512, 1, 1], f16), T([32, 512, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 7, ((T([32, 512, 28, 28], f16), T([32, 256, 28, 28], f16), T([512, 4, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([32, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 256, 56, 56], f16), T([512, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 256, 56, 56], f16), T([512, 4, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([32, 16, 1, 1], f16), T([256, 16, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 16, 1, 1], f16), T([32, 256, 1, 1], f16), T([16, 256, 1, 1], f16), [16], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 128, 56, 56], f16), T([256, 2, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 2, ((T([32, 128, 56, 56], f16), T([32, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 128, 56, 56], f16), T([256, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 64, 112, 112], f16), T([128, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 3, 224, 224], f16), T([64, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 4, ((T([32, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16, stride=(1024, 1, 0, 0)), 196), {})
+cnt: 8, ((T([32, 512, 28, 28], f16, stride=(512, 1, 0, 0)), 784), {})
+cnt: 3, ((T([32, 256, 56, 56], f16, stride=(256, 1, 0, 0)), 3136), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 128, 112, 112], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 112, 112], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([32, 128, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 3, ((T([32, 256, 56, 56], f16), [2, 3], True), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(1, 1000)), T([32, 2048], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 6, ((T([32, 256, 56, 56], f16), T([32, 256, 1, 1], f16)), {})
+cnt: 16, ((T([32, 512, 28, 28], f16), T([32, 512, 1, 1], f16)), {})
+cnt: 72, ((T([32, 1024, 14, 14], f16), T([32, 1024, 1, 1], f16)), {})
+cnt: 6, ((T([32, 2048, 7, 7], f16), T([32, 2048, 1, 1], f16)), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 18, ((T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 74, ((T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 7, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([32, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 74, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([32, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 18, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([32, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 128, 112, 112], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([32, 64, 112, 112], f16),), {})
+cnt: 1, ((T([32, 128, 112, 112], f16),), {})
+cnt: 3, ((T([32, 128, 56, 56], f16),), {})
+cnt: 7, ((T([32, 256, 56, 56], f16),), {})
+cnt: 3, ((T([32, 16, 1, 1], f16),), {})
+cnt: 17, ((T([32, 512, 28, 28], f16),), {})
+cnt: 8, ((T([32, 32, 1, 1], f16),), {})
+cnt: 7, ((T([32, 256, 28, 28], f16),), {})
+cnt: 73, ((T([32, 1024, 14, 14], f16),), {})
+cnt: 36, ((T([32, 64, 1, 1], f16),), {})
+cnt: 35, ((T([32, 512, 14, 14], f16),), {})
+cnt: 6, ((T([32, 2048, 7, 7], f16),), {})
+cnt: 3, ((T([32, 128, 1, 1], f16),), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 3, ((T([32, 256, 1, 1], f16),), {})
+cnt: 8, ((T([32, 512, 1, 1], f16),), {})
+cnt: 36, ((T([32, 1024, 1, 1], f16),), {})
+cnt: 3, ((T([32, 2048, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 3, ((T([32, 2048, 1, 1], f16), T([32, 2048, 1, 1], f16)), {})
+cnt: 36, ((T([32, 1024, 1, 1], f16), T([32, 1024, 1, 1], f16)), {})
+cnt: 8, ((T([32, 512, 1, 1], f16), T([32, 512, 1, 1], f16)), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([32, 256, 1, 1], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16), [0], True), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), [2, 3], True), {})
+cnt: 36, ((T([32, 1024, 14, 14], f16), [2, 3], True), {})
+cnt: 8, ((T([32, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 6, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), 0), {})
+cnt: 3, ((T([32, 128, 1, 1], f16), T([32, 128, 1, 1], f16), 0), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([32, 1024, 7, 7], f16), 0), {})
+cnt: 73, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), 0), {})
+cnt: 36, ((T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), 0), {})
+cnt: 35, ((T([32, 512, 14, 14], f16), T([32, 512, 14, 14], f16), 0), {})
+cnt: 17, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), 0), {})
+cnt: 8, ((T([32, 32, 1, 1], f16), T([32, 32, 1, 1], f16), 0), {})
+cnt: 7, ((T([32, 256, 28, 28], f16), T([32, 256, 28, 28], f16), 0), {})
+cnt: 7, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), 0), {})
+cnt: 3, ((T([32, 16, 1, 1], f16), T([32, 16, 1, 1], f16), 0), {})
+cnt: 3, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 128, 112, 112], f16), T([32, 128, 112, 112], f16), 0), {})
+cnt: 2, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/levit_128_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/levit_128_training.txt
new file mode 100644
index 000000000000..e24ac0ec6f74
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/levit_128_training.txt
@@ -0,0 +1,295 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 4, ((T([128, 4, 196, 196], f16), -1, False), {})
+cnt: 1, ((T([128, 8, 49, 196], f16), -1, False), {})
+cnt: 4, ((T([128, 8, 49, 49], f16), -1, False), {})
+cnt: 1, ((T([128, 16, 16, 49], f16), -1, False), {})
+cnt: 4, ((T([128, 12, 16, 16], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 4, ((T([128, 12, 16, 16], f16), T([128, 12, 16, 16], f16), -1, f16), {})
+cnt: 1, ((T([128, 16, 16, 49], f16), T([128, 16, 16, 49], f16), -1, f16), {})
+cnt: 4, ((T([128, 8, 49, 49], f16), T([128, 8, 49, 49], f16), -1, f16), {})
+cnt: 1, ((T([128, 8, 49, 196], f16), T([128, 8, 49, 196], f16), -1, f16), {})
+cnt: 4, ((T([128, 4, 196, 196], f16), T([128, 4, 196, 196], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 8, ((T([128, 196, 256], f16), [128, 196, 256]), {})
+cnt: 4, ((T([128, 4, 196, 16], f16), [512, 196, 16]), {})
+cnt: 4, ((T([128, 4, 16, 196], f16), [512, 16, 196]), {})
+cnt: 4, ((T([512, 196, 196], f16), [128, 4, 196, 196]), {})
+cnt: 8, ((T([128, 4, 196, 32], f16), [512, 196, 32]), {})
+cnt: 4, ((T([512, 196, 32], f16), [128, 4, 196, 32]), {})
+cnt: 4, ((T([128, 196, 4, 32], f16), [128, 196, 128]), {})
+cnt: 8, ((T([25088, 128], f16), [128, 196, 128]), {})
+cnt: 1, ((T([128, 196, 640], f16), [128, 196, 640]), {})
+cnt: 1, ((T([128, 7, 7, 128], f16), [128, 49, 128]), {})
+cnt: 1, ((T([6272, 128], f16), [128, 49, 128]), {})
+cnt: 5, ((T([128, 8, 49, 16], f16), [1024, 49, 16]), {})
+cnt: 1, ((T([128, 8, 16, 196], f16), [1024, 16, 196]), {})
+cnt: 1, ((T([1024, 49, 196], f16), [128, 8, 49, 196]), {})
+cnt: 1, ((T([128, 8, 196, 64], f16), [1024, 196, 64]), {})
+cnt: 1, ((T([1024, 49, 64], f16), [128, 8, 49, 64]), {})
+cnt: 1, ((T([128, 49, 8, 64], f16), [128, 49, 512]), {})
+cnt: 10, ((T([6272, 256], f16), [128, 49, 256]), {})
+cnt: 9, ((T([6272, 512], f16), [128, 49, 512]), {})
+cnt: 4, ((T([128, 8, 16, 49], f16), [1024, 16, 49]), {})
+cnt: 4, ((T([1024, 49, 49], f16), [128, 8, 49, 49]), {})
+cnt: 8, ((T([128, 8, 49, 32], f16), [1024, 49, 32]), {})
+cnt: 4, ((T([1024, 49, 32], f16), [128, 8, 49, 32]), {})
+cnt: 4, ((T([128, 49, 8, 32], f16), [128, 49, 256]), {})
+cnt: 1, ((T([6272, 1280], f16), [128, 49, 1280]), {})
+cnt: 1, ((T([128, 4, 4, 256], f16), [128, 16, 256]), {})
+cnt: 1, ((T([2048, 256], f16), [128, 16, 256]), {})
+cnt: 1, ((T([128, 16, 16, 16], f16), [2048, 16, 16]), {})
+cnt: 1, ((T([128, 16, 16, 49], f16), [2048, 16, 49]), {})
+cnt: 1, ((T([2048, 16, 49], f16), [128, 16, 16, 49]), {})
+cnt: 1, ((T([128, 16, 49, 64], f16), [2048, 49, 64]), {})
+cnt: 1, ((T([2048, 16, 64], f16), [128, 16, 16, 64]), {})
+cnt: 1, ((T([128, 16, 16, 64], f16), [128, 16, 1024]), {})
+cnt: 10, ((T([2048, 384], f16), [128, 16, 384]), {})
+cnt: 9, ((T([2048, 768], f16), [128, 16, 768]), {})
+cnt: 8, ((T([128, 12, 16, 16], f16), [1536, 16, 16]), {})
+cnt: 4, ((T([1536, 16, 16], f16), [128, 12, 16, 16]), {})
+cnt: 8, ((T([128, 12, 16, 32], f16), [1536, 16, 32]), {})
+cnt: 4, ((T([1536, 16, 32], f16), [128, 12, 16, 32]), {})
+cnt: 4, ((T([128, 16, 12, 32], f16), [128, 16, 384]), {})
+cnt: 1, ((T([128, 16, 16, 64], f16), [2048, 16, 64]), {})
+cnt: 1, ((T([128, 16, 16, 16], f16), [128, 16, 256]), {})
+cnt: 1, ((T([128, 8, 49, 64], f16), [1024, 49, 64]), {})
+cnt: 1, ((T([128, 49, 8, 16], f16), [128, 49, 128]), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([128, 4, 196, 196], f16), T([4, 196, 196], f16)), {})
+cnt: 8, ((T([128, 196, 128], f16, stride=(25088, 1, 196)), T([128, 196, 128], f16)), {})
+cnt: 1, ((T([128, 8, 49, 196], f16), T([8, 49, 196], f16)), {})
+cnt: 19, ((T([128, 49, 256], f16), T([128, 49, 256], f16)), {})
+cnt: 4, ((T([128, 8, 49, 49], f16), T([8, 49, 49], f16)), {})
+cnt: 1, ((T([128, 16, 16, 49], f16), T([16, 16, 49], f16)), {})
+cnt: 18, ((T([128, 16, 384], f16), T([128, 16, 384], f16)), {})
+cnt: 4, ((T([128, 12, 16, 16], f16), T([12, 16, 16], f16)), {})
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16)), {})
+cnt: 1, ((T([128, 384], f16), T([128, 384], f16)), {})
+cnt: 9, ((T([128, 196, 128], f16), T([128, 196, 128], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 64, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 2, ((T([1000], f16), T([128, 384], f16), T([384, 1000], f16, stride=(1, 384))), {})
+Operator: aten.bmm.default
+cnt: 8, ((T([128, 196, 128], f16, stride=(25088, 1, 196)), T([128, 128, 256], f16, stride=(0, 1, 128))), {})
+cnt: 4, ((T([512, 196, 16], f16), T([512, 16, 196], f16)), {})
+cnt: 4, ((T([512, 196, 196], f16), T([512, 196, 32], f16)), {})
+cnt: 1, ((T([128, 196, 128], f16, stride=(25088, 1, 196)), T([128, 128, 640], f16, stride=(0, 1, 128))), {})
+cnt: 1, ((T([1024, 49, 16], f16), T([1024, 16, 196], f16)), {})
+cnt: 1, ((T([1024, 49, 196], f16), T([1024, 196, 64], f16)), {})
+cnt: 4, ((T([1024, 49, 16], f16), T([1024, 16, 49], f16)), {})
+cnt: 4, ((T([1024, 49, 49], f16), T([1024, 49, 32], f16)), {})
+cnt: 1, ((T([2048, 16, 16], f16), T([2048, 16, 49], f16)), {})
+cnt: 1, ((T([2048, 16, 49], f16), T([2048, 49, 64], f16)), {})
+cnt: 4, ((T([1536, 16, 16], f16), T([1536, 16, 16], f16)), {})
+cnt: 4, ((T([1536, 16, 16], f16), T([1536, 16, 32], f16)), {})
+cnt: 4, ((T([1536, 16, 16], f16, stride=(256, 1, 16)), T([1536, 16, 32], f16)), {})
+cnt: 4, ((T([1536, 16, 32], f16), T([1536, 32, 16], f16, stride=(512, 1, 32))), {})
+cnt: 4, ((T([1536, 16, 16], f16, stride=(256, 1, 16)), T([1536, 16, 16], f16)), {})
+cnt: 4, ((T([1536, 16, 16], f16), T([1536, 16, 16], f16, stride=(256, 1, 16))), {})
+cnt: 1, ((T([2048, 49, 16], f16, stride=(784, 1, 49)), T([2048, 16, 64], f16)), {})
+cnt: 1, ((T([2048, 16, 64], f16), T([2048, 64, 49], f16, stride=(3136, 1, 64))), {})
+cnt: 1, ((T([2048, 16, 16], f16, stride=(256, 1, 16)), T([2048, 16, 49], f16)), {})
+cnt: 1, ((T([2048, 16, 49], f16), T([2048, 49, 16], f16, stride=(784, 1, 49))), {})
+cnt: 4, ((T([1024, 49, 49], f16, stride=(2401, 1, 49)), T([1024, 49, 32], f16)), {})
+cnt: 4, ((T([1024, 49, 32], f16), T([1024, 32, 49], f16, stride=(1568, 1, 32))), {})
+cnt: 4, ((T([1024, 16, 49], f16, stride=(784, 1, 16)), T([1024, 49, 49], f16)), {})
+cnt: 4, ((T([1024, 49, 49], f16), T([1024, 49, 16], f16, stride=(784, 1, 49))), {})
+cnt: 1, ((T([1024, 196, 49], f16, stride=(9604, 1, 196)), T([1024, 49, 64], f16)), {})
+cnt: 1, ((T([1024, 49, 64], f16), T([1024, 64, 196], f16, stride=(12544, 1, 64))), {})
+cnt: 1, ((T([1024, 16, 49], f16, stride=(784, 1, 16)), T([1024, 49, 196], f16)), {})
+cnt: 1, ((T([1024, 49, 196], f16), T([1024, 196, 16], f16, stride=(3136, 1, 196))), {})
+cnt: 1, ((T([128, 128, 196], f16), T([128, 196, 640], f16)), {})
+cnt: 1, ((T([128, 196, 640], f16), T([128, 640, 128], f16, stride=(0, 128, 1))), {})
+cnt: 8, ((T([128, 128, 196], f16), T([128, 196, 256], f16)), {})
+cnt: 8, ((T([128, 196, 256], f16), T([128, 256, 128], f16, stride=(0, 128, 1))), {})
+cnt: 4, ((T([512, 196, 196], f16, stride=(38416, 1, 196)), T([512, 196, 32], f16)), {})
+cnt: 4, ((T([512, 196, 32], f16), T([512, 32, 196], f16, stride=(6272, 1, 32))), {})
+cnt: 4, ((T([512, 16, 196], f16, stride=(3136, 1, 16)), T([512, 196, 196], f16)), {})
+cnt: 4, ((T([512, 196, 196], f16), T([512, 196, 16], f16, stride=(3136, 1, 196))), {})
+Operator: aten.cat.default
+cnt: 4, (([T([128, 16, 12, 16], f16, stride=(3072, 16, 256, 1)), T([128, 16, 12, 16], f16, stride=(3072, 1, 256, 16)), T([128, 16, 12, 32], f16, stride=(6144, 32, 512, 1))], 3), {})
+cnt: 1, (([T([128, 49, 16, 16], f16, stride=(12544, 1, 784, 49)), T([128, 49, 16, 64], f16, stride=(50176, 64, 3136, 1))], 3), {})
+cnt: 4, (([T([128, 49, 8, 16], f16, stride=(6272, 16, 784, 1)), T([128, 49, 8, 16], f16, stride=(6272, 1, 784, 49)), T([128, 49, 8, 32], f16, stride=(12544, 32, 1568, 1))], 3), {})
+cnt: 1, (([T([128, 196, 8, 16], f16, stride=(25088, 1, 3136, 196)), T([128, 196, 8, 64], f16, stride=(100352, 64, 12544, 1))], 3), {})
+cnt: 4, (([T([128, 196, 4, 16], f16, stride=(12544, 16, 3136, 1)), T([128, 196, 4, 16], f16, stride=(12544, 1, 3136, 196)), T([128, 196, 4, 32], f16, stride=(25088, 32, 6272, 1))], 3), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([32, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([64, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 128, 14, 14], f16, stride=(25088, 1, 1792, 128)), T([128, 64, 28, 28], f16), T([128, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 32, 56, 56], f16), T([64, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 16, 112, 112], f16), T([32, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+cnt: 1, ((T([640, 128], f16), T([640, 128], f16, stride=(1, 640))), {})
+cnt: 8, ((T([256, 128], f16), T([256, 128], f16, stride=(1, 256))), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 16, 384], f16, stride=(384, 0, 1)), 16), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([128, 1000], f16), 2), {})
+Operator: aten.hardswish.default
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 32, 56, 56], f16),), {})
+cnt: 1, ((T([128, 64, 28, 28], f16),), {})
+cnt: 4, ((T([128, 196, 128], f16),), {})
+cnt: 4, ((T([128, 196, 256], f16),), {})
+cnt: 6, ((T([128, 49, 512], f16),), {})
+cnt: 4, ((T([128, 49, 256], f16),), {})
+cnt: 1, ((T([128, 16, 1024], f16),), {})
+cnt: 5, ((T([128, 16, 768], f16),), {})
+cnt: 4, ((T([128, 16, 384], f16),), {})
+Operator: aten.hardswish_backward.default
+cnt: 5, ((T([128, 16, 768], f16), T([128, 16, 768], f16)), {})
+cnt: 4, ((T([128, 16, 384], f16), T([128, 16, 384], f16)), {})
+cnt: 1, ((T([128, 16, 1024], f16), T([128, 16, 1024], f16)), {})
+cnt: 6, ((T([128, 49, 512], f16), T([128, 49, 512], f16)), {})
+cnt: 4, ((T([128, 49, 256], f16), T([128, 49, 256], f16)), {})
+cnt: 4, ((T([128, 196, 256], f16), T([128, 196, 256], f16)), {})
+cnt: 4, ((T([128, 196, 128], f16), T([128, 196, 128], f16)), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16)), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16)), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+Operator: aten.index.Tensor
+cnt: 4, ((T([4, 196], f16), [None, T([196, 196], i64)]), {})
+cnt: 1, ((T([8, 196], f16), [None, T([49, 196], i64)]), {})
+cnt: 4, ((T([8, 49], f16), [None, T([49, 49], i64)]), {})
+cnt: 1, ((T([16, 49], f16), [None, T([16, 49], i64)]), {})
+cnt: 4, ((T([12, 16], f16), [None, T([16, 16], i64)]), {})
+Operator: aten.index_put.default
+cnt: 4, ((T([12, 16], f16), [None, T([16, 16], i64)], T([12, 16, 16], f16), True), {})
+cnt: 1, ((T([16, 49], f16), [None, T([16, 49], i64)], T([16, 16, 49], f16), True), {})
+cnt: 4, ((T([8, 49], f16), [None, T([49, 49], i64)], T([8, 49, 49], f16), True), {})
+cnt: 1, ((T([8, 196], f16), [None, T([49, 196], i64)], T([8, 49, 196], f16), True), {})
+cnt: 4, ((T([4, 196], f16), [None, T([196, 196], i64)], T([4, 196, 196], f16), True), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 16, 384], f16), [1]), {})
+Operator: aten.mm.default
+cnt: 4, ((T([25088, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 4, ((T([25088, 256], f16), T([256, 128], f16, stride=(1, 256))), {})
+cnt: 1, ((T([6272, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 6, ((T([6272, 512], f16), T([512, 256], f16, stride=(1, 512))), {})
+cnt: 9, ((T([6272, 256], f16), T([256, 512], f16, stride=(1, 256))), {})
+cnt: 4, ((T([6272, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 1, ((T([6272, 256], f16), T([256, 1280], f16, stride=(1, 256))), {})
+cnt: 1, ((T([2048, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 1, ((T([2048, 1024], f16), T([1024, 384], f16, stride=(1, 1024))), {})
+cnt: 9, ((T([2048, 384], f16), T([384, 768], f16, stride=(1, 384))), {})
+cnt: 5, ((T([2048, 768], f16), T([768, 384], f16, stride=(1, 768))), {})
+cnt: 4, ((T([2048, 384], f16), T([384, 384], f16, stride=(1, 384))), {})
+cnt: 2, ((T([128, 1000], f16), T([1000, 384], f16)), {})
+cnt: 2, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 384], f16)), {})
+cnt: 5, ((T([384, 2048], f16, stride=(1, 384)), T([2048, 768], f16)), {})
+cnt: 5, ((T([2048, 384], f16), T([384, 768], f16)), {})
+cnt: 9, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 384], f16)), {})
+cnt: 9, ((T([2048, 768], f16), T([768, 384], f16)), {})
+cnt: 4, ((T([384, 2048], f16, stride=(1, 384)), T([2048, 384], f16)), {})
+cnt: 4, ((T([2048, 384], f16), T([384, 384], f16)), {})
+cnt: 1, ((T([384, 2048], f16, stride=(1, 384)), T([2048, 1024], f16)), {})
+cnt: 1, ((T([2048, 384], f16), T([384, 1024], f16)), {})
+cnt: 1, ((T([256, 2048], f16, stride=(1, 256)), T([2048, 256], f16)), {})
+cnt: 1, ((T([2048, 256], f16), T([256, 256], f16)), {})
+cnt: 1, ((T([1280, 6272], f16, stride=(1, 1280)), T([6272, 256], f16)), {})
+cnt: 1, ((T([6272, 1280], f16), T([1280, 256], f16)), {})
+cnt: 6, ((T([256, 6272], f16, stride=(1, 256)), T([6272, 512], f16)), {})
+cnt: 6, ((T([6272, 256], f16), T([256, 512], f16)), {})
+cnt: 9, ((T([512, 6272], f16, stride=(1, 512)), T([6272, 256], f16)), {})
+cnt: 9, ((T([6272, 512], f16), T([512, 256], f16)), {})
+cnt: 4, ((T([256, 6272], f16, stride=(1, 256)), T([6272, 256], f16)), {})
+cnt: 4, ((T([6272, 256], f16), T([256, 256], f16)), {})
+cnt: 1, ((T([128, 6272], f16, stride=(1, 128)), T([6272, 128], f16)), {})
+cnt: 1, ((T([6272, 128], f16), T([128, 128], f16)), {})
+cnt: 4, ((T([128, 25088], f16, stride=(1, 128)), T([25088, 256], f16)), {})
+cnt: 4, ((T([25088, 128], f16), T([128, 256], f16)), {})
+cnt: 4, ((T([128, 25088], f16, stride=(1, 128)), T([25088, 128], f16)), {})
+cnt: 4, ((T([25088, 128], f16), T([128, 128], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 8, ((T([128, 4, 196, 196], f16), 0.25), {})
+cnt: 2, ((T([128, 8, 49, 196], f16), 0.25), {})
+cnt: 8, ((T([128, 8, 49, 49], f16), 0.25), {})
+cnt: 2, ((T([128, 16, 16, 49], f16), 0.25), {})
+cnt: 8, ((T([128, 12, 16, 16], f16), 0.25), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([25088, 256], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([25088, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([25088, 640], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([6272, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([6272, 256], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([6272, 512], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([6272, 1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([2048, 256], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([2048, 384], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([2048, 768], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 384], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 2, ((T([128, 384], f16), T([128, 384], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([2048, 384], f16), T([2048, 384], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([2048, 768], f16), T([2048, 768], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([2048, 256], f16), T([2048, 256], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([6272, 1280], f16), T([6272, 1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([6272, 256], f16), T([6272, 256], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([6272, 512], f16), T([6272, 512], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([6272, 128], f16), T([6272, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([25088, 640], f16), T([25088, 640], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f32), T([640], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([25088, 128], f16), T([25088, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([25088, 256], f16), T([25088, 256], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 14, 14], f16, stride=(25088, 1, 1792, 128)), T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 1, ((T([640, 128], f16, stride=(1, 640)), [640, 128], [128, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 8, ((T([256, 128], f16, stride=(1, 256)), [256, 128], [128, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.new_zeros.default
+cnt: 4, ((T([12, 16, 16], f16), [12, 16]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([16, 16, 49], f16), [16, 49]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 4, ((T([8, 49, 49], f16), [8, 49]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([8, 49, 196], f16), [8, 196]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 4, ((T([4, 196, 196], f16), [4, 196]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.slice_backward.default
+cnt: 4, ((T([12, 16], f16), [12, 16], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([16, 49], f16), [16, 49], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 4, 4, 256], f16), [128, 4, 7, 256], 2, 0, 9223372036854775807, 2), {})
+cnt: 1, ((T([128, 4, 7, 256], f16), [128, 7, 7, 256], 1, 0, 9223372036854775807, 2), {})
+cnt: 1, ((T([128, 7, 7, 256], f16), [128, 7, 7, 256], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([8, 49], f16), [8, 49], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([8, 196], f16), [8, 196], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 7, 7, 128], f16), [128, 7, 14, 128], 2, 0, 9223372036854775807, 2), {})
+cnt: 1, ((T([128, 7, 14, 128], f16), [128, 14, 14, 128], 1, 0, 9223372036854775807, 2), {})
+cnt: 1, ((T([128, 14, 14, 128], f16), [128, 14, 14, 128], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([4, 196], f16), [4, 196], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.split_with_sizes.default
+cnt: 4, ((T([128, 196, 4, 64], f16), [16, 16, 32], 3), {})
+cnt: 1, ((T([128, 196, 8, 80], f16), [16, 64], 3), {})
+cnt: 4, ((T([128, 49, 8, 64], f16), [16, 16, 32], 3), {})
+cnt: 1, ((T([128, 49, 16, 80], f16), [16, 64], 3), {})
+cnt: 4, ((T([128, 16, 12, 64], f16), [16, 16, 32], 3), {})
+Operator: aten.sum.SymInt
+cnt: 2, ((T([128, 1000], f16), [0], True), {})
+cnt: 4, ((T([128, 12, 16, 16], f16), [0], True), {})
+cnt: 1, ((T([128, 16, 16, 49], f16), [0], True), {})
+cnt: 4, ((T([128, 8, 49, 49], f16), [0], True), {})
+cnt: 1, ((T([128, 8, 49, 196], f16), [0], True), {})
+cnt: 1, ((T([128, 128, 640], f16), [0], True), {})
+cnt: 8, ((T([128, 128, 256], f16), [0], True), {})
+cnt: 4, ((T([128, 4, 196, 196], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mixer_b16_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mixer_b16_224_training.txt
new file mode 100644
index 000000000000..483b2dad380b
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mixer_b16_224_training.txt
@@ -0,0 +1,70 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([64, 768, 384], f16), [64, 768, 384]), {})
+cnt: 12, ((T([64, 768, 196], f16), [49152, 196]), {})
+Operator: aten.add.Tensor
+cnt: 12, ((T([64, 768, 384], f16), T([384], f16)), {})
+cnt: 12, ((T([64, 196, 768], f16, stride=(150528, 1, 196)), T([64, 196, 768], f16, stride=(150528, 1, 196))), {})
+cnt: 12, ((T([64, 196, 768], f16, stride=(150528, 1, 196)), T([64, 196, 768], f16)), {})
+cnt: 12, ((T([64, 196, 768], f16), T([64, 196, 768], f16)), {})
+cnt: 12, ((T([64, 196, 768], f16), T([64, 196, 768], f16, stride=(150528, 1, 196))), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([196], f16), T([49152, 384], f16), T([384, 196], f16, stride=(1, 384))), {})
+cnt: 12, ((T([3072], f16), T([12544, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([12544, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([1000], f16), T([64, 768], f16), T([768, 1000], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([64, 768, 196], f16, stride=(150528, 1, 768)), T([64, 196, 384], f16, stride=(0, 1, 196))), {})
+cnt: 12, ((T([64, 196, 768], f16), T([64, 768, 384], f16)), {})
+cnt: 12, ((T([64, 768, 384], f16), T([64, 384, 196], f16, stride=(0, 196, 1))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([768, 3, 16, 16], f16), T([768], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 768, 14, 14], f16, stride=(150528, 1, 10752, 768)), T([64, 3, 224, 224], f16), T([768, 3, 16, 16], f16), [768], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+cnt: 12, ((T([384, 196], f16), T([384, 196], f16, stride=(1, 384))), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 196, 768], f16, stride=(768, 0, 1)), 196), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([64, 768, 384], f16),), {})
+cnt: 12, ((T([64, 196, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([64, 196, 3072], f16), T([64, 196, 3072], f16)), {})
+cnt: 12, ((T([64, 768, 384], f16), T([64, 768, 384], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 196, 768], f16), [1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 768], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 768], f16)), {})
+cnt: 12, ((T([12544, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 12544], f16, stride=(1, 768)), T([12544, 3072], f16)), {})
+cnt: 12, ((T([12544, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 12544], f16, stride=(1, 3072)), T([12544, 768], f16)), {})
+cnt: 12, ((T([49152, 196], f16), T([196, 384], f16)), {})
+cnt: 12, ((T([196, 49152], f16, stride=(1, 196)), T([49152, 384], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([64, 196, 768], f16, stride=(150528, 1, 196)), [768], T([768], f16), T([768], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 13, ((T([64, 196, 768], f16), T([64, 196, 768], f16, stride=(150528, 1, 196)), [768], T([64, 196, 1], f32), T([64, 196, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+cnt: 12, ((T([64, 196, 768], f16, stride=(150528, 1, 196)), T([64, 196, 768], f16, stride=(150528, 1, 196)), [768], T([64, 196, 1], f32), T([64, 196, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 12, ((T([384, 196], f16, stride=(1, 384)), [384, 196], [196, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 12, ((T([12544, 768], f16), [0], True), {})
+cnt: 12, ((T([12544, 3072], f16), [0], True), {})
+cnt: 12, ((T([49152, 196], f16), [0], True), {})
+cnt: 12, ((T([64, 768, 384], f16), [0, 1], True), {})
+cnt: 12, ((T([64, 196, 384], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mixnet_l_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mixnet_l_training.txt
new file mode 100644
index 000000000000..74b315457b93
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mixnet_l_training.txt
@@ -0,0 +1,378 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 58, ((T([], i64), 1), {})
+cnt: 2, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16)), {})
+cnt: 2, ((T([64, 40, 56, 56], f16), T([64, 40, 56, 56], f16)), {})
+cnt: 6, ((T([64, 56, 28, 28], f16), T([64, 56, 28, 28], f16)), {})
+cnt: 6, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16)), {})
+cnt: 6, ((T([64, 160, 14, 14], f16), T([64, 160, 14, 14], f16)), {})
+cnt: 6, ((T([64, 264, 7, 7], f16), T([64, 264, 7, 7], f16)), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), T([64, 1584, 7, 7], f16)), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16)), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16)), {})
+cnt: 4, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16)), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([64, 336, 14, 14], f16)), {})
+cnt: 3, ((T([64, 336, 28, 28], f16), T([64, 336, 28, 28], f16)), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([64, 240, 28, 28], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 1536], f16), T([1536, 1000], f16, stride=(1, 1536))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 96, 112, 112], f16), T([64, 96, 112, 112], f16)], 1), {})
+cnt: 1, (([T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16)], 1), {})
+cnt: 3, (([T([64, 20, 56, 56], f16), T([64, 20, 56, 56], f16)], 1), {})
+cnt: 2, (([T([64, 60, 56, 56], f16), T([64, 60, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 60, 28, 28], f16), T([64, 60, 28, 28], f16), T([64, 60, 28, 28], f16), T([64, 60, 28, 28], f16)], 1), {})
+cnt: 12, (([T([64, 168, 28, 28], f16), T([64, 168, 28, 28], f16)], 1), {})
+cnt: 6, (([T([64, 28, 28, 28], f16), T([64, 28, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 112, 14, 14], f16), T([64, 112, 14, 14], f16), T([64, 112, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 312, 14, 14], f16), T([64, 312, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 156, 14, 14], f16), T([64, 156, 14, 14], f16), T([64, 156, 14, 14], f16), T([64, 156, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 52, 14, 14], f16), T([64, 52, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 240, 14, 14], f16), T([64, 240, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 120, 14, 14], f16), T([64, 120, 14, 14], f16), T([64, 120, 14, 14], f16), T([64, 120, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 80, 14, 14], f16), T([64, 80, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 240, 7, 7], f16), T([64, 240, 7, 7], f16), T([64, 240, 7, 7], f16), T([64, 240, 7, 7], f16)], 1), {})
+cnt: 6, (([T([64, 396, 7, 7], f16), T([64, 396, 7, 7], f16), T([64, 396, 7, 7], f16), T([64, 396, 7, 7], f16)], 1), {})
+cnt: 3, (([T([64, 132, 7, 7], f16), T([64, 132, 7, 7], f16)], 1), {})
+cnt: 3, (([T([64, 792, 7, 7], f16), T([64, 792, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 240, 14, 14], f16), T([64, 240, 14, 14], f16), T([64, 240, 14, 14], f16), T([64, 240, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 112, 28, 28], f16), T([64, 112, 28, 28], f16), T([64, 112, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 60, 56, 56], f16), T([64, 60, 56, 56], f16), T([64, 60, 56, 56], f16), T([64, 60, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16)], 1), {})
+cnt: 1, (([T([64, 16, 112, 112], f16), T([64, 16, 112, 112], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+cnt: 1, ((T([64, 240, 56, 56], f16),), {})
+cnt: 1, ((T([64, 240, 28, 28], f16),), {})
+cnt: 1, ((T([64, 20, 1, 1], f16),), {})
+cnt: 7, ((T([64, 336, 28, 28], f16),), {})
+cnt: 3, ((T([64, 28, 1, 1], f16),), {})
+cnt: 1, ((T([64, 336, 14, 14], f16),), {})
+cnt: 1, ((T([64, 14, 1, 1], f16),), {})
+cnt: 8, ((T([64, 624, 14, 14], f16),), {})
+cnt: 3, ((T([64, 26, 1, 1], f16),), {})
+cnt: 1, ((T([64, 52, 1, 1], f16),), {})
+cnt: 6, ((T([64, 480, 14, 14], f16),), {})
+cnt: 4, ((T([64, 80, 1, 1], f16),), {})
+cnt: 1, ((T([64, 960, 14, 14], f16),), {})
+cnt: 1, ((T([64, 960, 7, 7], f16),), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16),), {})
+cnt: 3, ((T([64, 132, 1, 1], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([32, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 16, 112, 112], f16, stride=(401408, 12544, 112, 1)), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), T([64, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), T([64, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), T([64, 1, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 64), {})
+cnt: 2, ((T([64, 96, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([20, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 20, 56, 56], f16, stride=(125440, 3136, 56, 1)), T([60, 20, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 120, 56, 56], f16), T([120, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 120), {})
+cnt: 2, ((T([64, 60, 56, 56], f16, stride=(376320, 3136, 56, 1)), T([20, 60, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 40, 56, 56], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), T([60, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), T([60, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), T([60, 1, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), T([60, 1, 9, 9], f16), None, [2, 2], [4, 4], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([64, 240, 1, 1], f16), T([20, 240, 1, 1], f16), T([20], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 20, 1, 1], f16), T([240, 20, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([56, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 28, 28, 28], f16, stride=(43904, 784, 28, 1)), T([168, 28, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([168, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 168), {})
+cnt: 3, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([168, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 168), {})
+cnt: 3, ((T([64, 336, 1, 1], f16), T([28, 336, 1, 1], f16), T([28], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 28, 1, 1], f16), T([336, 28, 1, 1], f16), T([336], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([28, 168, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 56, 28, 28], f16), T([336, 56, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), T([112, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 112), {})
+cnt: 1, ((T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), T([112, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 112), {})
+cnt: 1, ((T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), T([112, 1, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 112), {})
+cnt: 1, ((T([64, 336, 1, 1], f16), T([14, 336, 1, 1], f16), T([14], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 14, 1, 1], f16), T([336, 14, 1, 1], f16), T([336], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([104, 336, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 52, 14, 14], f16, stride=(20384, 196, 14, 1)), T([312, 52, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 156), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 156), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 156), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 9, 9], f16), None, [1, 1], [4, 4], [1, 1], False, [0, 0], 156), {})
+cnt: 3, ((T([64, 624, 1, 1], f16), T([26, 624, 1, 1], f16), T([26], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 26, 1, 1], f16), T([624, 26, 1, 1], f16), T([624], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 312, 14, 14], f16, stride=(122304, 196, 14, 1)), T([52, 312, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 104, 14, 14], f16), T([624, 104, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 624, 14, 14], f16), T([624, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 624), {})
+cnt: 1, ((T([64, 624, 1, 1], f16), T([52, 624, 1, 1], f16), T([52], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 52, 1, 1], f16), T([624, 52, 1, 1], f16), T([624], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 624, 14, 14], f16), T([160, 624, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 80, 14, 14], f16, stride=(31360, 196, 14, 1)), T([240, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 9, 9], f16), None, [1, 1], [4, 4], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([64, 480, 1, 1], f16), T([80, 480, 1, 1], f16), T([80], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 80, 1, 1], f16), T([480, 80, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 240, 14, 14], f16, stride=(94080, 196, 14, 1)), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 160, 14, 14], f16), T([960, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), T([240, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), T([240, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), T([240, 1, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), T([240, 1, 9, 9], f16), None, [2, 2], [4, 4], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([64, 960, 1, 1], f16), T([80, 960, 1, 1], f16), T([80], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 80, 1, 1], f16), T([960, 80, 1, 1], f16), T([960], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([264, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 264, 7, 7], f16), T([1584, 264, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 396), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 396), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 396), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 9, 9], f16), None, [1, 1], [4, 4], [1, 1], False, [0, 0], 396), {})
+cnt: 3, ((T([64, 1584, 1, 1], f16), T([132, 1584, 1, 1], f16), T([132], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 132, 1, 1], f16), T([1584, 132, 1, 1], f16), T([1584], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 792, 7, 7], f16, stride=(77616, 49, 7, 1)), T([132, 792, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 264, 7, 7], f16), T([1536, 264, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 1536, 7, 7], f16), T([64, 264, 7, 7], f16), T([1536, 264, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 132, 7, 7], f16, stride=(12936, 49, 7, 1)), T([64, 792, 7, 7], f16, stride=(77616, 49, 7, 1)), T([132, 792, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 1584, 1, 1], f16), T([64, 132, 1, 1], f16), T([1584, 132, 1, 1], f16), [1584], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 132, 1, 1], f16), T([64, 1584, 1, 1], f16), T([132, 1584, 1, 1], f16), [132], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 9, 9], f16), [0], [1, 1], [4, 4], [1, 1], False, [0, 0], 396, [True, True, False]), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 396, [True, True, False]), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 396, [True, True, False]), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 396, [True, True, False]), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), T([64, 264, 7, 7], f16), T([1584, 264, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 264, 7, 7], f16), T([64, 960, 7, 7], f16), T([264, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 960, 1, 1], f16), T([64, 80, 1, 1], f16), T([960, 80, 1, 1], f16), [960], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 80, 1, 1], f16), T([64, 960, 1, 1], f16), T([80, 960, 1, 1], f16), [80], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 240, 7, 7], f16, stride=(47040, 49, 7, 1)), T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), T([240, 1, 9, 9], f16), [0], [2, 2], [4, 4], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 7, 7], f16, stride=(47040, 49, 7, 1)), T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), T([240, 1, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 7, 7], f16, stride=(47040, 49, 7, 1)), T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), T([240, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 7, 7], f16, stride=(47040, 49, 7, 1)), T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), T([240, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([64, 160, 14, 14], f16), T([960, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 80, 14, 14], f16, stride=(31360, 196, 14, 1)), T([64, 240, 14, 14], f16, stride=(94080, 196, 14, 1)), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 480, 1, 1], f16), T([64, 80, 1, 1], f16), T([480, 80, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 80, 1, 1], f16), T([64, 480, 1, 1], f16), T([80, 480, 1, 1], f16), [80], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 9, 9], f16), [0], [1, 1], [4, 4], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 6, ((T([64, 240, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 80, 14, 14], f16, stride=(31360, 196, 14, 1)), T([240, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 160, 14, 14], f16), T([64, 624, 14, 14], f16), T([160, 624, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 624, 1, 1], f16), T([64, 52, 1, 1], f16), T([624, 52, 1, 1], f16), [624], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 52, 1, 1], f16), T([64, 624, 1, 1], f16), T([52, 624, 1, 1], f16), [52], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16), T([624, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 624, [True, True, False]), {})
+cnt: 1, ((T([64, 624, 14, 14], f16), T([64, 104, 14, 14], f16), T([624, 104, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 52, 14, 14], f16, stride=(20384, 196, 14, 1)), T([64, 312, 14, 14], f16, stride=(122304, 196, 14, 1)), T([52, 312, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 624, 1, 1], f16), T([64, 26, 1, 1], f16), T([624, 26, 1, 1], f16), [624], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 26, 1, 1], f16), T([64, 624, 1, 1], f16), T([26, 624, 1, 1], f16), [26], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 9, 9], f16), [0], [1, 1], [4, 4], [1, 1], False, [0, 0], 156, [True, True, False]), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 156, [True, True, False]), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 156, [True, True, False]), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 156, [True, True, False]), {})
+cnt: 6, ((T([64, 312, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 52, 14, 14], f16, stride=(20384, 196, 14, 1)), T([312, 52, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 104, 14, 14], f16), T([64, 336, 14, 14], f16), T([104, 336, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 336, 1, 1], f16), T([64, 14, 1, 1], f16), T([336, 14, 1, 1], f16), [336], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 14, 1, 1], f16), T([64, 336, 1, 1], f16), T([14, 336, 1, 1], f16), [14], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 112, 14, 14], f16, stride=(65856, 196, 14, 1)), T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), T([112, 1, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 112, [True, True, False]), {})
+cnt: 1, ((T([64, 112, 14, 14], f16, stride=(65856, 196, 14, 1)), T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), T([112, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 112, [True, True, False]), {})
+cnt: 1, ((T([64, 112, 14, 14], f16, stride=(65856, 196, 14, 1)), T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), T([112, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 112, [True, True, False]), {})
+cnt: 1, ((T([64, 336, 28, 28], f16), T([64, 56, 28, 28], f16), T([336, 56, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 28, 28, 28], f16, stride=(43904, 784, 28, 1)), T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([28, 168, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 336, 1, 1], f16), T([64, 28, 1, 1], f16), T([336, 28, 1, 1], f16), [336], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 28, 1, 1], f16), T([64, 336, 1, 1], f16), T([28, 336, 1, 1], f16), [28], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([168, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 168, [True, True, False]), {})
+cnt: 3, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([168, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 168, [True, True, False]), {})
+cnt: 6, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([64, 28, 28, 28], f16, stride=(43904, 784, 28, 1)), T([168, 28, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 56, 28, 28], f16), T([64, 240, 28, 28], f16), T([56, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 1, 1], f16), T([64, 20, 1, 1], f16), T([240, 20, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 20, 1, 1], f16), T([64, 240, 1, 1], f16), T([20, 240, 1, 1], f16), [20], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 60, 28, 28], f16, stride=(188160, 784, 28, 1)), T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), T([60, 1, 9, 9], f16), [0], [2, 2], [4, 4], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([64, 60, 28, 28], f16, stride=(188160, 784, 28, 1)), T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), T([60, 1, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([64, 60, 28, 28], f16, stride=(188160, 784, 28, 1)), T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), T([60, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([64, 60, 28, 28], f16, stride=(188160, 784, 28, 1)), T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), T([60, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), T([64, 40, 56, 56], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 20, 56, 56], f16, stride=(125440, 3136, 56, 1)), T([64, 60, 56, 56], f16, stride=(376320, 3136, 56, 1)), T([20, 60, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 120, 56, 56], f16), T([64, 120, 56, 56], f16), T([120, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 2, ((T([64, 60, 56, 56], f16, stride=(376320, 3136, 56, 1)), T([64, 20, 56, 56], f16, stride=(125440, 3136, 56, 1)), T([60, 20, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 20, 56, 56], f16, stride=(125440, 3136, 56, 1)), T([64, 96, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([20, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), T([64, 1, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), T([64, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), T([64, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 2, ((T([64, 96, 112, 112], f16, stride=(2408448, 12544, 112, 1)), T([64, 16, 112, 112], f16, stride=(401408, 12544, 112, 1)), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 1536, 7, 7], f16, stride=(1536, 1, 0, 0)), 49), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16, stride=(1584, 1, 0, 0)), 49), {})
+cnt: 1, ((T([64, 960, 7, 7], f16, stride=(960, 1, 0, 0)), 49), {})
+cnt: 3, ((T([64, 480, 14, 14], f16, stride=(480, 1, 0, 0)), 196), {})
+cnt: 4, ((T([64, 624, 14, 14], f16, stride=(624, 1, 0, 0)), 196), {})
+cnt: 1, ((T([64, 336, 14, 14], f16, stride=(336, 1, 0, 0)), 196), {})
+cnt: 3, ((T([64, 336, 28, 28], f16, stride=(336, 1, 0, 0)), 784), {})
+cnt: 1, ((T([64, 240, 28, 28], f16, stride=(240, 1, 0, 0)), 784), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 240, 28, 28], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 336, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), [2, 3], True), {})
+cnt: 4, ((T([64, 624, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 1536, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 1536], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 1536], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([64, 240, 28, 28], f16), T([64, 240, 1, 1], f16)), {})
+cnt: 6, ((T([64, 336, 28, 28], f16), T([64, 336, 1, 1], f16)), {})
+cnt: 2, ((T([64, 336, 14, 14], f16), T([64, 336, 1, 1], f16)), {})
+cnt: 8, ((T([64, 624, 14, 14], f16), T([64, 624, 1, 1], f16)), {})
+cnt: 6, ((T([64, 480, 14, 14], f16), T([64, 480, 1, 1], f16)), {})
+cnt: 2, ((T([64, 960, 7, 7], f16), T([64, 960, 1, 1], f16)), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16), T([64, 1584, 1, 1], f16)), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), T([64, 1584, 7, 7], f16)), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16)), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16)), {})
+cnt: 4, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16)), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([64, 336, 14, 14], f16)), {})
+cnt: 3, ((T([64, 336, 28, 28], f16), T([64, 336, 28, 28], f16)), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([64, 240, 28, 28], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 3, ((T([64, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 192, 112, 112], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([64, 40, 56, 56], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([64, 120, 56, 56], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 56, 28, 28], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([64, 336, 28, 28], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 104, 14, 14], f16), T([104], f16), T([104], f16), T([104], f16), T([104], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([64, 624, 14, 14], f16), T([624], f16), T([624], f16), T([624], f16), T([624], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 160, 14, 14], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([64, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 264, 7, 7], f16), T([264], f16), T([264], f16), T([264], f16), T([264], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16), T([1584], f16), T([1584], f16), T([1584], f16), T([1584], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 1536, 7, 7], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([64, 1536, 7, 7], f16), T([64, 1536, 7, 7], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f32), T([1536], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 264, 7, 7], f16), T([64, 264, 7, 7], f16), T([264], f16), T([264], f16), T([264], f16), T([264], f32), T([264], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16), T([64, 1584, 7, 7], f16), T([1584], f16), T([1584], f16), T([1584], f16), T([1584], f32), T([1584], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([64, 960, 14, 14], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 160, 14, 14], f16), T([64, 160, 14, 14], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16), T([624], f16), T([624], f16), T([624], f16), T([624], f32), T([624], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), T([104], f16), T([104], f16), T([104], f16), T([104], f32), T([104], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([64, 336, 14, 14], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([64, 336, 28, 28], f16), T([64, 336, 28, 28], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 56, 28, 28], f16), T([64, 56, 28, 28], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f32), T([56], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([64, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), T([64, 240, 56, 56], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([64, 40, 56, 56], f16), T([64, 40, 56, 56], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([64, 120, 56, 56], f16), T([64, 120, 56, 56], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([64, 192, 56, 56], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 192, 112, 112], f16), T([64, 192, 112, 112], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([64, 32, 112, 112], f16),), {})
+cnt: 1, ((T([64, 192, 112, 112], f16),), {})
+cnt: 1, ((T([64, 192, 56, 56], f16),), {})
+cnt: 2, ((T([64, 120, 56, 56], f16),), {})
+cnt: 1, ((T([64, 1536, 7, 7], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([64, 240, 1, 1], f16),), {})
+cnt: 4, ((T([64, 336, 1, 1], f16),), {})
+cnt: 4, ((T([64, 624, 1, 1], f16),), {})
+cnt: 3, ((T([64, 480, 1, 1], f16),), {})
+cnt: 1, ((T([64, 960, 1, 1], f16),), {})
+cnt: 3, ((T([64, 1584, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 3, ((T([64, 1584, 1, 1], f16), T([64, 1584, 1, 1], f16)), {})
+cnt: 1, ((T([64, 960, 1, 1], f16), T([64, 960, 1, 1], f16)), {})
+cnt: 3, ((T([64, 480, 1, 1], f16), T([64, 480, 1, 1], f16)), {})
+cnt: 4, ((T([64, 624, 1, 1], f16), T([64, 624, 1, 1], f16)), {})
+cnt: 4, ((T([64, 336, 1, 1], f16), T([64, 336, 1, 1], f16)), {})
+cnt: 1, ((T([64, 240, 1, 1], f16), T([64, 240, 1, 1], f16)), {})
+Operator: aten.silu_.default
+cnt: 1, ((T([64, 240, 56, 56], f16),), {})
+cnt: 1, ((T([64, 240, 28, 28], f16),), {})
+cnt: 1, ((T([64, 20, 1, 1], f16),), {})
+cnt: 7, ((T([64, 336, 28, 28], f16),), {})
+cnt: 3, ((T([64, 28, 1, 1], f16),), {})
+cnt: 1, ((T([64, 336, 14, 14], f16),), {})
+cnt: 1, ((T([64, 14, 1, 1], f16),), {})
+cnt: 8, ((T([64, 624, 14, 14], f16),), {})
+cnt: 3, ((T([64, 26, 1, 1], f16),), {})
+cnt: 1, ((T([64, 52, 1, 1], f16),), {})
+cnt: 6, ((T([64, 480, 14, 14], f16),), {})
+cnt: 4, ((T([64, 80, 1, 1], f16),), {})
+cnt: 1, ((T([64, 960, 14, 14], f16),), {})
+cnt: 1, ((T([64, 960, 7, 7], f16),), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16),), {})
+cnt: 3, ((T([64, 132, 1, 1], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 3, ((T([64, 132, 1, 1], f16), T([64, 132, 1, 1], f16)), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16), T([64, 1584, 7, 7], f16)), {})
+cnt: 4, ((T([64, 80, 1, 1], f16), T([64, 80, 1, 1], f16)), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16)), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([64, 960, 14, 14], f16)), {})
+cnt: 6, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16)), {})
+cnt: 1, ((T([64, 52, 1, 1], f16), T([64, 52, 1, 1], f16)), {})
+cnt: 8, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16)), {})
+cnt: 3, ((T([64, 26, 1, 1], f16), T([64, 26, 1, 1], f16)), {})
+cnt: 1, ((T([64, 14, 1, 1], f16), T([64, 14, 1, 1], f16)), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([64, 336, 14, 14], f16)), {})
+cnt: 7, ((T([64, 336, 28, 28], f16), T([64, 336, 28, 28], f16)), {})
+cnt: 3, ((T([64, 28, 1, 1], f16), T([64, 28, 1, 1], f16)), {})
+cnt: 1, ((T([64, 20, 1, 1], f16), T([64, 20, 1, 1], f16)), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([64, 240, 28, 28], f16)), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), T([64, 240, 56, 56], f16)), {})
+Operator: aten.split_with_sizes.default
+cnt: 1, ((T([64, 32, 112, 112], f16), [16, 16], 1), {})
+cnt: 1, ((T([64, 192, 112, 112], f16), [64, 64, 64], 1), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), [96, 96], 1), {})
+cnt: 1, ((T([64, 40, 56, 56], f16), [20, 20], 1), {})
+cnt: 1, ((T([64, 120, 56, 56], f16), [60, 60], 1), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), [60, 60, 60, 60], 1), {})
+cnt: 3, ((T([64, 56, 28, 28], f16), [28, 28], 1), {})
+cnt: 6, ((T([64, 336, 28, 28], f16), [168, 168], 1), {})
+cnt: 1, ((T([64, 336, 28, 28], f16), [112, 112, 112], 1), {})
+cnt: 3, ((T([64, 104, 14, 14], f16), [52, 52], 1), {})
+cnt: 3, ((T([64, 624, 14, 14], f16), [156, 156, 156, 156], 1), {})
+cnt: 3, ((T([64, 624, 14, 14], f16), [312, 312], 1), {})
+cnt: 3, ((T([64, 160, 14, 14], f16), [80, 80], 1), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), [120, 120, 120, 120], 1), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), [240, 240], 1), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), [240, 240, 240, 240], 1), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), [396, 396, 396, 396], 1), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), [792, 792], 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 4, ((T([64, 624, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 336, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([64, 1536, 7, 7], f16), T([64, 1536, 7, 7], f16), 0), {})
+cnt: 2, ((T([64, 120, 56, 56], f16), T([64, 120, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([64, 192, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 192, 112, 112], f16), T([64, 192, 112, 112], f16), 0), {})
+cnt: 2, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mnasnet_100_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mnasnet_100_training.txt
new file mode 100644
index 000000000000..6524a78aafe0
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mnasnet_100_training.txt
@@ -0,0 +1,170 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 52, ((T([], i64), 1), {})
+cnt: 4, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 4, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16)), {})
+cnt: 2, ((T([128, 96, 14, 14], f16), T([128, 96, 14, 14], f16)), {})
+cnt: 6, ((T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([48, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([48, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 48), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([24, 48, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([72, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([72, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 72), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([24, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([40, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([120, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([40, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([480, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 480), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([80, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([96, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 96, 14, 14], f16), T([576, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([576, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 576), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([96, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([576, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 576), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([192, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1152), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1152), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([128, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 4, ((T([128, 1152, 7, 7], f16), T([128, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 7, 7], f16), T([128, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 7, 7], f16), T([128, 576, 7, 7], f16), T([192, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 14, 14], f16), T([576, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 2, ((T([128, 576, 14, 14], f16), T([128, 96, 14, 14], f16), T([576, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 14, 14], f16), T([128, 576, 14, 14], f16), T([96, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([128, 576, 14, 14], f16), T([576, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 14, 14], f16), T([128, 480, 14, 14], f16), T([96, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 3, ((T([128, 480, 14, 14], f16), T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([128, 480, 14, 14], f16), T([80, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([128, 120, 28, 28], f16), T([40, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 72, 28, 28], f16), T([40, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 3, ((T([128, 72, 56, 56], f16), T([128, 24, 56, 56], f16), T([72, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 72, 56, 56], f16), T([24, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), T([72, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 48, 56, 56], f16), T([24, 48, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 112, 112], f16), T([48, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 48, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 16, 112, 112], f16), T([48, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1280, 7, 7], f16, stride=(1280, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 1280, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([128, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 576, 14, 14], f16), T([128, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 96, 14, 14], f16), T([128, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 48, 112, 112], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 48, 112, 112], f16),), {})
+cnt: 1, ((T([128, 48, 56, 56], f16),), {})
+cnt: 5, ((T([128, 72, 56, 56], f16),), {})
+cnt: 1, ((T([128, 72, 28, 28], f16),), {})
+cnt: 4, ((T([128, 120, 28, 28], f16),), {})
+cnt: 1, ((T([128, 240, 28, 28], f16),), {})
+cnt: 1, ((T([128, 240, 14, 14], f16),), {})
+cnt: 6, ((T([128, 480, 14, 14], f16),), {})
+cnt: 3, ((T([128, 576, 14, 14], f16),), {})
+cnt: 1, ((T([128, 576, 7, 7], f16),), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16), 0), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 7, 7], f16), 0), {})
+cnt: 3, ((T([128, 576, 14, 14], f16), T([128, 576, 14, 14], f16), 0), {})
+cnt: 6, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), 0), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), 0), {})
+cnt: 5, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 48, 112, 112], f16), 0), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilenetv2_100_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilenetv2_100_training.txt
new file mode 100644
index 000000000000..4c6b5706f274
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilenetv2_100_training.txt
@@ -0,0 +1,172 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 52, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 4, ((T([128, 32, 28, 28], f16), T([128, 32, 28, 28], f16)), {})
+cnt: 6, ((T([128, 64, 14, 14], f16), T([128, 64, 14, 14], f16)), {})
+cnt: 4, ((T([128, 96, 14, 14], f16), T([128, 96, 14, 14], f16)), {})
+cnt: 4, ((T([128, 160, 7, 7], f16), T([128, 160, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+cnt: 2, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 96, 112, 112], f16),), {})
+cnt: 1, ((T([128, 96, 56, 56], f16),), {})
+cnt: 3, ((T([128, 144, 56, 56], f16),), {})
+cnt: 1, ((T([128, 144, 28, 28], f16),), {})
+cnt: 5, ((T([128, 192, 28, 28], f16),), {})
+cnt: 1, ((T([128, 192, 14, 14], f16),), {})
+cnt: 8, ((T([128, 384, 14, 14], f16),), {})
+cnt: 5, ((T([128, 576, 14, 14], f16),), {})
+cnt: 1, ((T([128, 576, 7, 7], f16),), {})
+cnt: 6, ((T([128, 960, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([96, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([24, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([144, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([144, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([24, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([144, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([32, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 32, 28, 28], f16), T([192, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 192, 28, 28], f16), T([192, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 192), {})
+cnt: 2, ((T([128, 192, 28, 28], f16), T([32, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([192, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 192), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([64, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 64, 14, 14], f16), T([384, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 14, 14], f16), T([384, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 384), {})
+cnt: 3, ((T([128, 384, 14, 14], f16), T([64, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([96, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 96, 14, 14], f16), T([576, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 576, 14, 14], f16), T([576, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 576), {})
+cnt: 2, ((T([128, 576, 14, 14], f16), T([96, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([576, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 576), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([160, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 160, 7, 7], f16), T([960, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 960, 7, 7], f16), T([960, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 960), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), T([160, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([320, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([128, 960, 7, 7], f16), T([320, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), T([960, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 960, [True, True, False]), {})
+cnt: 3, ((T([128, 960, 7, 7], f16), T([128, 160, 7, 7], f16), T([960, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 160, 7, 7], f16), T([128, 960, 7, 7], f16), T([160, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 160, 7, 7], f16), T([128, 576, 7, 7], f16), T([160, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 14, 14], f16), T([576, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 3, ((T([128, 576, 14, 14], f16), T([128, 96, 14, 14], f16), T([576, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 96, 14, 14], f16), T([128, 576, 14, 14], f16), T([96, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 576, 14, 14], f16), T([128, 576, 14, 14], f16), T([576, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 14, 14], f16), T([128, 384, 14, 14], f16), T([96, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 384, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 14, 14], f16), T([128, 64, 14, 14], f16), T([384, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 14, 14], f16), T([128, 384, 14, 14], f16), T([64, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 14, 14], f16), T([128, 192, 14, 14], f16), T([64, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([128, 192, 28, 28], f16), T([192, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 28, 28], f16), T([128, 32, 28, 28], f16), T([192, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 32, 28, 28], f16), T([128, 192, 28, 28], f16), T([32, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 28, 28], f16), T([128, 144, 28, 28], f16), T([32, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 56, 56], f16), T([144, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 2, ((T([128, 144, 56, 56], f16), T([128, 24, 56, 56], f16), T([144, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 144, 56, 56], f16), T([24, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), T([144, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 96, 56, 56], f16), T([24, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 112, 112], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 16, 112, 112], f16), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1280, 7, 7], f16, stride=(1280, 1, 0, 0)), 49), {})
+Operator: aten.hardtanh_.default
+cnt: 2, ((T([128, 32, 112, 112], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), 0.0, 6.0), {})
+cnt: 3, ((T([128, 144, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), 0.0, 6.0), {})
+cnt: 5, ((T([128, 192, 28, 28], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), 0.0, 6.0), {})
+cnt: 8, ((T([128, 384, 14, 14], f16), 0.0, 6.0), {})
+cnt: 5, ((T([128, 576, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), 0.0, 6.0), {})
+cnt: 6, ((T([128, 960, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16), 0.0, 6.0), {})
+Operator: aten.hardtanh_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16), 0.0, 6.0), {})
+cnt: 6, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 7, 7], f16), 0.0, 6.0), {})
+cnt: 5, ((T([128, 576, 14, 14], f16), T([128, 576, 14, 14], f16), 0.0, 6.0), {})
+cnt: 8, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), 0.0, 6.0), {})
+cnt: 5, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), 0.0, 6.0), {})
+cnt: 3, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 96, 112, 112], f16), 0.0, 6.0), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), 0.0, 6.0), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 1280, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 32, 28, 28], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 64, 14, 14], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([128, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 160, 7, 7], f16), T([128, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 576, 14, 14], f16), T([128, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 96, 14, 14], f16), T([128, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 64, 14, 14], f16), T([128, 64, 14, 14], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 192, 14, 14], f16), T([128, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 32, 28, 28], f16), T([128, 32, 28, 28], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilenetv3_large_100_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilenetv3_large_100_training.txt
new file mode 100644
index 000000000000..df2ab44bf9f7
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilenetv3_large_100_training.txt
@@ -0,0 +1,269 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 46, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 4, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 6, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16)), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16)), {})
+cnt: 4, ((T([128, 160, 7, 7], f16), T([128, 160, 7, 7], f16)), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16)), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 240, 28, 28], f16),), {})
+cnt: 1, ((T([128, 240, 14, 14], f16),), {})
+cnt: 2, ((T([128, 200, 14, 14], f16),), {})
+cnt: 4, ((T([128, 184, 14, 14], f16),), {})
+cnt: 2, ((T([128, 480, 14, 14], f16),), {})
+cnt: 3, ((T([128, 672, 14, 14], f16),), {})
+cnt: 1, ((T([128, 672, 7, 7], f16),), {})
+cnt: 5, ((T([128, 960, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 1, 1], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([64, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([24, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([72, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([72, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([24, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([24, 72, 1, 1], f16), T([24], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([72, 24, 1, 1], f16), T([72], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([40, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([120, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 2, ((T([128, 120, 1, 1], f16), T([32, 120, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 32, 1, 1], f16), T([120, 32, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([40, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([200, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([200, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 200), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([80, 200, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([184, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 184, 14, 14], f16), T([184, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 184), {})
+cnt: 2, ((T([128, 184, 14, 14], f16), T([80, 184, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([128, 480, 1, 1], f16), T([120, 480, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([480, 120, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([112, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([672, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([672, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 672), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([168, 672, 1, 1], f16), T([168], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([672, 168, 1, 1], f16), T([672], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([112, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([160, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 160, 7, 7], f16), T([960, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), T([960, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 960), {})
+cnt: 2, ((T([128, 960, 1, 1], f16), T([240, 960, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([960, 240, 1, 1], f16), T([960], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), T([160, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 960, 1, 1], f16), T([1280, 960, 1, 1], f16), T([1280], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 1, 1], f16), T([128, 960, 1, 1], f16), T([1280, 960, 1, 1], f16), [1280], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 960, 7, 7], f16), T([128, 160, 7, 7], f16), T([960, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 160, 7, 7], f16), T([128, 960, 7, 7], f16), T([160, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 960, 1, 1], f16), T([128, 240, 1, 1], f16), T([960, 240, 1, 1], f16), [960], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 960, 1, 1], f16), T([240, 960, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), T([960, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 960, [True, True, False]), {})
+cnt: 1, ((T([128, 160, 7, 7], f16), T([128, 672, 7, 7], f16), T([160, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([128, 168, 1, 1], f16), T([672, 168, 1, 1], f16), [672], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([128, 672, 1, 1], f16), T([168, 672, 1, 1], f16), [168], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 112, 14, 14], f16), T([672, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 672, 14, 14], f16), T([112, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), T([672, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 480, 14, 14], f16), T([112, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 480, 1, 1], f16), T([128, 120, 1, 1], f16), T([480, 120, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([128, 480, 1, 1], f16), T([120, 480, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([128, 184, 14, 14], f16), T([80, 184, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 184, 14, 14], f16), T([128, 184, 14, 14], f16), T([184, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 184, [True, True, False]), {})
+cnt: 2, ((T([128, 184, 14, 14], f16), T([128, 80, 14, 14], f16), T([184, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([128, 200, 14, 14], f16), T([80, 200, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([128, 200, 14, 14], f16), T([200, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 200, [True, True, False]), {})
+cnt: 1, ((T([128, 200, 14, 14], f16), T([128, 80, 14, 14], f16), T([200, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 28, 28], f16), T([240, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([128, 120, 28, 28], f16), T([40, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 120, 1, 1], f16), T([128, 32, 1, 1], f16), T([120, 32, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 1, 1], f16), T([128, 120, 1, 1], f16), T([32, 120, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 72, 28, 28], f16), T([40, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([128, 24, 1, 1], f16), T([72, 24, 1, 1], f16), [72], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([128, 72, 1, 1], f16), T([24, 72, 1, 1], f16), [24], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 56, 56], f16), T([72, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([128, 24, 56, 56], f16), T([72, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 72, 56, 56], f16), T([24, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), T([72, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 64, 56, 56], f16), T([24, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 112, 112], f16), T([64, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 16, 112, 112], f16), T([64, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 3, ((T([128, 960, 7, 7], f16, stride=(960, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 672, 7, 7], f16, stride=(672, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 672, 14, 14], f16, stride=(672, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 480, 14, 14], f16, stride=(480, 1, 0, 0)), 196), {})
+cnt: 2, ((T([128, 120, 28, 28], f16, stride=(120, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 72, 28, 28], f16, stride=(72, 1, 0, 0)), 784), {})
+Operator: aten.hardsigmoid.default
+cnt: 1, ((T([128, 72, 1, 1], f16),), {})
+cnt: 2, ((T([128, 120, 1, 1], f16),), {})
+cnt: 1, ((T([128, 480, 1, 1], f16),), {})
+cnt: 2, ((T([128, 672, 1, 1], f16),), {})
+cnt: 2, ((T([128, 960, 1, 1], f16),), {})
+Operator: aten.hardsigmoid_backward.default
+cnt: 2, ((T([128, 960, 1, 1], f16), T([128, 960, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 1, 1], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 1, ((T([128, 480, 1, 1], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 2, ((T([128, 120, 1, 1], f16), T([128, 120, 1, 1], f16)), {})
+cnt: 1, ((T([128, 72, 1, 1], f16), T([128, 72, 1, 1], f16)), {})
+Operator: aten.hardswish_.default
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 240, 28, 28], f16),), {})
+cnt: 1, ((T([128, 240, 14, 14], f16),), {})
+cnt: 2, ((T([128, 200, 14, 14], f16),), {})
+cnt: 4, ((T([128, 184, 14, 14], f16),), {})
+cnt: 2, ((T([128, 480, 14, 14], f16),), {})
+cnt: 3, ((T([128, 672, 14, 14], f16),), {})
+cnt: 1, ((T([128, 672, 7, 7], f16),), {})
+cnt: 5, ((T([128, 960, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 1, 1], f16),), {})
+Operator: aten.hardswish_backward.default
+cnt: 1, ((T([128, 1280, 1, 1], f16), T([128, 1280, 1, 1], f16)), {})
+cnt: 5, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 3, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 4, ((T([128, 184, 14, 14], f16), T([128, 184, 14, 14], f16)), {})
+cnt: 2, ((T([128, 200, 14, 14], f16), T([128, 200, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16)), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 72, 28, 28], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 72, 28, 28], f16), T([128, 72, 1, 1], f16)), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([128, 120, 1, 1], f16)), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 7, 7], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 4, ((T([128, 960, 7, 7], f16), T([128, 960, 1, 1], f16)), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16)), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 3, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 200, 14, 14], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 184, 14, 14], f16), T([184], f16), T([184], f16), T([184], f16), T([184], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 5, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 160, 7, 7], f16), T([128, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 184, 14, 14], f16), T([128, 184, 14, 14], f16), T([184], f16), T([184], f16), T([184], f16), T([184], f32), T([184], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 200, 14, 14], f16), T([128, 200, 14, 14], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f32), T([200], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 56, 56], f16),), {})
+cnt: 3, ((T([128, 72, 56, 56], f16),), {})
+cnt: 1, ((T([128, 72, 28, 28], f16),), {})
+cnt: 1, ((T([128, 24, 1, 1], f16),), {})
+cnt: 4, ((T([128, 120, 28, 28], f16),), {})
+cnt: 2, ((T([128, 32, 1, 1], f16),), {})
+cnt: 1, ((T([128, 120, 1, 1], f16),), {})
+cnt: 2, ((T([128, 168, 1, 1], f16),), {})
+cnt: 2, ((T([128, 240, 1, 1], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 2, ((T([128, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 120, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 240, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 168, 1, 1], f16), T([128, 168, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 120, 1, 1], f16), T([128, 120, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 32, 1, 1], f16), T([128, 32, 1, 1], f16), 0), {})
+cnt: 4, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([128, 24, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), 0), {})
+cnt: 3, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), 0), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilevit_s_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilevit_s_training.txt
new file mode 100644
index 000000000000..ce3dba3ad0a7
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/mobilevit_s_training.txt
@@ -0,0 +1,313 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 2, ((T([256, 4, 256, 256], f16), -1, False), {})
+cnt: 4, ((T([256, 4, 64, 64], f16), -1, False), {})
+cnt: 3, ((T([256, 4, 16, 16], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 3, ((T([256, 4, 16, 16], f16), T([256, 4, 16, 16], f16), -1, f16), {})
+cnt: 4, ((T([256, 4, 64, 64], f16), T([256, 4, 64, 64], f16), -1, f16), {})
+cnt: 2, ((T([256, 4, 256, 256], f16), T([256, 4, 256, 256], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 2, ((T([147456, 16, 2, 2], f16), [64, 144, 256, 4]), {})
+cnt: 2, ((T([64, 4, 256, 144], f16), [256, 256, 144]), {})
+cnt: 6, ((T([256, 4, 256, 36], f16), [1024, 256, 36]), {})
+cnt: 2, ((T([256, 4, 36, 256], f16), [1024, 36, 256]), {})
+cnt: 2, ((T([1024, 256, 256], f16), [256, 4, 256, 256]), {})
+cnt: 2, ((T([1024, 256, 36], f16), [256, 4, 256, 36]), {})
+cnt: 2, ((T([256, 256, 4, 36], f16), [256, 256, 144]), {})
+cnt: 2, ((T([64, 144, 256, 4], f16), [147456, 16, 2, 2]), {})
+cnt: 2, ((T([147456, 2, 16, 2], f16), [64, 144, 32, 32]), {})
+cnt: 2, ((T([98304, 8, 2, 2], f16), [64, 192, 64, 4]), {})
+cnt: 2, ((T([64, 4, 64, 192], f16), [256, 64, 192]), {})
+cnt: 12, ((T([256, 4, 64, 48], f16), [1024, 64, 48]), {})
+cnt: 4, ((T([256, 4, 48, 64], f16), [1024, 48, 64]), {})
+cnt: 4, ((T([1024, 64, 64], f16), [256, 4, 64, 64]), {})
+cnt: 4, ((T([1024, 64, 48], f16), [256, 4, 64, 48]), {})
+cnt: 4, ((T([256, 64, 4, 48], f16), [256, 64, 192]), {})
+cnt: 2, ((T([64, 192, 64, 4], f16), [98304, 8, 2, 2]), {})
+cnt: 2, ((T([98304, 2, 8, 2], f16), [64, 192, 16, 16]), {})
+cnt: 2, ((T([61440, 4, 2, 2], f16), [64, 240, 16, 4]), {})
+cnt: 2, ((T([64, 4, 16, 240], f16), [256, 16, 240]), {})
+cnt: 9, ((T([256, 4, 16, 60], f16), [1024, 16, 60]), {})
+cnt: 3, ((T([256, 4, 60, 16], f16), [1024, 60, 16]), {})
+cnt: 3, ((T([1024, 16, 16], f16), [256, 4, 16, 16]), {})
+cnt: 3, ((T([1024, 16, 60], f16), [256, 4, 16, 60]), {})
+cnt: 3, ((T([256, 16, 4, 60], f16), [256, 16, 240]), {})
+cnt: 2, ((T([64, 240, 16, 4], f16), [61440, 4, 2, 2]), {})
+cnt: 2, ((T([61440, 2, 4, 2], f16), [64, 240, 8, 8]), {})
+cnt: 3, ((T([256, 16, 3, 4, 60], f16), [256, 16, 720]), {})
+cnt: 4, ((T([256, 64, 3, 4, 48], f16), [256, 64, 576]), {})
+cnt: 2, ((T([256, 256, 3, 4, 36], f16), [256, 256, 432]), {})
+Operator: aten.add.Tensor
+cnt: 32, ((T([], i64), 1), {})
+cnt: 4, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16)), {})
+cnt: 8, ((T([256, 256, 144], f16), T([256, 256, 144], f16)), {})
+cnt: 16, ((T([256, 64, 192], f16), T([256, 64, 192], f16)), {})
+cnt: 12, ((T([256, 16, 240], f16), T([256, 16, 240], f16)), {})
+cnt: 1, ((T([64, 160, 8, 8], f16, stride=(20480, 64, 8, 1)), T([64, 160, 8, 8], f16)), {})
+cnt: 1, ((T([64, 128, 16, 16], f16, stride=(65536, 256, 16, 1)), T([64, 128, 16, 16], f16)), {})
+cnt: 1, ((T([64, 96, 32, 32], f16, stride=(196608, 1024, 32, 1)), T([64, 96, 32, 32], f16)), {})
+Operator: aten.addmm.default
+cnt: 2, ((T([432], f16), T([65536, 144], f16), T([144, 432], f16, stride=(1, 144))), {})
+cnt: 2, ((T([144], f16), T([65536, 144], f16), T([144, 144], f16, stride=(1, 144))), {})
+cnt: 2, ((T([288], f16), T([65536, 144], f16), T([144, 288], f16, stride=(1, 144))), {})
+cnt: 2, ((T([144], f16), T([65536, 288], f16), T([288, 144], f16, stride=(1, 288))), {})
+cnt: 4, ((T([576], f16), T([16384, 192], f16), T([192, 576], f16, stride=(1, 192))), {})
+cnt: 4, ((T([192], f16), T([16384, 192], f16), T([192, 192], f16, stride=(1, 192))), {})
+cnt: 4, ((T([384], f16), T([16384, 192], f16), T([192, 384], f16, stride=(1, 192))), {})
+cnt: 4, ((T([192], f16), T([16384, 384], f16), T([384, 192], f16, stride=(1, 384))), {})
+cnt: 3, ((T([720], f16), T([4096, 240], f16), T([240, 720], f16, stride=(1, 240))), {})
+cnt: 3, ((T([240], f16), T([4096, 240], f16), T([240, 240], f16, stride=(1, 240))), {})
+cnt: 3, ((T([480], f16), T([4096, 240], f16), T([240, 480], f16, stride=(1, 240))), {})
+cnt: 3, ((T([240], f16), T([4096, 480], f16), T([480, 240], f16, stride=(1, 480))), {})
+cnt: 1, ((T([1000], f16), T([64, 640], f16), T([640, 1000], f16, stride=(1, 640))), {})
+Operator: aten.bmm.default
+cnt: 2, ((T([1024, 256, 36], f16), T([1024, 36, 256], f16)), {})
+cnt: 2, ((T([1024, 256, 256], f16), T([1024, 256, 36], f16)), {})
+cnt: 4, ((T([1024, 64, 48], f16), T([1024, 48, 64], f16)), {})
+cnt: 4, ((T([1024, 64, 64], f16), T([1024, 64, 48], f16)), {})
+cnt: 3, ((T([1024, 16, 60], f16), T([1024, 60, 16], f16)), {})
+cnt: 3, ((T([1024, 16, 16], f16), T([1024, 16, 60], f16)), {})
+cnt: 3, ((T([1024, 16, 16], f16, stride=(256, 1, 16)), T([1024, 16, 60], f16)), {})
+cnt: 3, ((T([1024, 16, 60], f16), T([1024, 60, 16], f16, stride=(960, 1, 60))), {})
+cnt: 3, ((T([1024, 60, 16], f16, stride=(960, 1, 60)), T([1024, 16, 16], f16)), {})
+cnt: 3, ((T([1024, 16, 16], f16), T([1024, 16, 60], f16, stride=(960, 1, 16))), {})
+cnt: 4, ((T([1024, 64, 64], f16, stride=(4096, 1, 64)), T([1024, 64, 48], f16)), {})
+cnt: 4, ((T([1024, 64, 48], f16), T([1024, 48, 64], f16, stride=(3072, 1, 48))), {})
+cnt: 4, ((T([1024, 48, 64], f16, stride=(3072, 1, 48)), T([1024, 64, 64], f16)), {})
+cnt: 4, ((T([1024, 64, 64], f16), T([1024, 64, 48], f16, stride=(3072, 1, 64))), {})
+cnt: 2, ((T([1024, 256, 256], f16, stride=(65536, 1, 256)), T([1024, 256, 36], f16)), {})
+cnt: 2, ((T([1024, 256, 36], f16), T([1024, 36, 256], f16, stride=(9216, 1, 36))), {})
+cnt: 2, ((T([1024, 36, 256], f16, stride=(9216, 1, 36)), T([1024, 256, 256], f16)), {})
+cnt: 2, ((T([1024, 256, 256], f16), T([1024, 256, 36], f16, stride=(9216, 1, 256))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 96, 32, 32], f16), T([64, 96, 32, 32], f16)], 1), {})
+cnt: 1, (([T([64, 128, 16, 16], f16), T([64, 128, 16, 16], f16)], 1), {})
+cnt: 1, (([T([64, 160, 8, 8], f16), T([64, 160, 8, 8], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 256, 256], f16),), {})
+cnt: 1, ((T([64, 16, 128, 128], f16),), {})
+cnt: 2, ((T([64, 64, 128, 128], f16),), {})
+cnt: 1, ((T([64, 128, 128, 128], f16),), {})
+cnt: 1, ((T([64, 128, 64, 64], f16),), {})
+cnt: 5, ((T([64, 256, 64, 64], f16),), {})
+cnt: 1, ((T([64, 256, 32, 32], f16),), {})
+cnt: 3, ((T([64, 96, 32, 32], f16),), {})
+cnt: 1, ((T([64, 384, 32, 32], f16),), {})
+cnt: 1, ((T([64, 384, 16, 16], f16),), {})
+cnt: 3, ((T([64, 128, 16, 16], f16),), {})
+cnt: 1, ((T([64, 512, 16, 16], f16),), {})
+cnt: 1, ((T([64, 512, 8, 8], f16),), {})
+cnt: 3, ((T([64, 160, 8, 8], f16),), {})
+cnt: 1, ((T([64, 640, 8, 8], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 256, 256], f16), T([16, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 16, 128, 128], f16), T([64, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([32, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([128, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([128, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 128), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 64, 64, 64], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 256, 64, 64], f16), T([256, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 256), {})
+cnt: 2, ((T([64, 256, 64, 64], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 64, 64], f16), T([256, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 256), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([96, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 96, 32, 32], f16), T([96, 96, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 96, 32, 32], f16), T([144, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 144, 32, 32], f16), T([96, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 192, 32, 32], f16), T([96, 192, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 96, 32, 32], f16), T([384, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 384, 32, 32], f16), T([384, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 384), {})
+cnt: 1, ((T([64, 384, 16, 16], f16), T([128, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 16, 16], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 16, 16], f16), T([192, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 192, 16, 16], f16), T([128, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 16, 16], f16), T([128, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 16, 16], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([512, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 512), {})
+cnt: 1, ((T([64, 512, 8, 8], f16), T([160, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 160, 8, 8], f16), T([160, 160, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 160, 8, 8], f16), T([240, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 240, 8, 8], f16), T([160, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 320, 8, 8], f16), T([160, 320, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 160, 8, 8], f16), T([640, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 640, 8, 8], f16), T([64, 160, 8, 8], f16), T([640, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 160, 8, 8], f16), T([64, 320, 8, 8], f16), T([160, 320, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 160, 8, 8], f16), T([64, 240, 8, 8], f16), T([160, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 8, 8], f16), T([64, 160, 8, 8], f16), T([240, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 160, 8, 8], f16), T([64, 160, 8, 8], f16), T([160, 160, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 160, 8, 8], f16), T([64, 512, 8, 8], f16), T([160, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 8, 8], f16), T([64, 512, 16, 16], f16), T([512, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 512, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([64, 128, 16, 16], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 16, 16], f16), T([64, 256, 16, 16], f16), T([128, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 16, 16], f16), T([64, 192, 16, 16], f16), T([128, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 192, 16, 16], f16), T([64, 128, 16, 16], f16), T([192, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 16, 16], f16), T([64, 128, 16, 16], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 16, 16], f16), T([64, 384, 16, 16], f16), T([128, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 384, 16, 16], f16), T([64, 384, 32, 32], f16), T([384, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 384, [True, True, False]), {})
+cnt: 1, ((T([64, 384, 32, 32], f16), T([64, 96, 32, 32], f16), T([384, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 96, 32, 32], f16), T([64, 192, 32, 32], f16), T([96, 192, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 96, 32, 32], f16), T([64, 144, 32, 32], f16), T([96, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 144, 32, 32], f16), T([64, 96, 32, 32], f16), T([144, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 96, 32, 32], f16), T([64, 96, 32, 32], f16), T([96, 96, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 96, 32, 32], f16), T([64, 256, 32, 32], f16), T([96, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([64, 256, 64, 64], f16), T([256, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 256, [True, True, False]), {})
+cnt: 3, ((T([64, 256, 64, 64], f16), T([64, 64, 64, 64], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), T([64, 256, 64, 64], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 256, 64, 64], f16), T([64, 256, 64, 64], f16), T([256, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 256, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 64, 64], f16), T([64, 128, 64, 64], f16), T([64, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 128, 128, 128], f16), T([128, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 128, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([64, 32, 128, 128], f16), T([128, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 64, 128, 128], f16), T([32, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 64, 128, 128], f16), T([64, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 128, 128], f16), T([64, 16, 128, 128], f16), T([64, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 16, 128, 128], f16), T([64, 3, 256, 256], f16), T([16, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 256, 256], f16), T([64, 3, 256, 256], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 640, 8, 8], f16, stride=(640, 1, 0, 0)), 64), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 640, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 640], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 640], f16)), {})
+cnt: 3, ((T([4096, 240], f16), T([240, 480], f16)), {})
+cnt: 3, ((T([240, 4096], f16, stride=(1, 240)), T([4096, 480], f16)), {})
+cnt: 3, ((T([4096, 480], f16), T([480, 240], f16)), {})
+cnt: 3, ((T([480, 4096], f16, stride=(1, 480)), T([4096, 240], f16)), {})
+cnt: 3, ((T([4096, 240], f16), T([240, 240], f16)), {})
+cnt: 3, ((T([240, 4096], f16, stride=(1, 240)), T([4096, 240], f16)), {})
+cnt: 3, ((T([4096, 720], f16), T([720, 240], f16)), {})
+cnt: 3, ((T([720, 4096], f16, stride=(1, 720)), T([4096, 240], f16)), {})
+cnt: 4, ((T([16384, 192], f16), T([192, 384], f16)), {})
+cnt: 4, ((T([192, 16384], f16, stride=(1, 192)), T([16384, 384], f16)), {})
+cnt: 4, ((T([16384, 384], f16), T([384, 192], f16)), {})
+cnt: 4, ((T([384, 16384], f16, stride=(1, 384)), T([16384, 192], f16)), {})
+cnt: 4, ((T([16384, 192], f16), T([192, 192], f16)), {})
+cnt: 4, ((T([192, 16384], f16, stride=(1, 192)), T([16384, 192], f16)), {})
+cnt: 4, ((T([16384, 576], f16), T([576, 192], f16)), {})
+cnt: 4, ((T([576, 16384], f16, stride=(1, 576)), T([16384, 192], f16)), {})
+cnt: 2, ((T([65536, 144], f16), T([144, 288], f16)), {})
+cnt: 2, ((T([144, 65536], f16, stride=(1, 144)), T([65536, 288], f16)), {})
+cnt: 2, ((T([65536, 288], f16), T([288, 144], f16)), {})
+cnt: 2, ((T([288, 65536], f16, stride=(1, 288)), T([65536, 144], f16)), {})
+cnt: 2, ((T([65536, 144], f16), T([144, 144], f16)), {})
+cnt: 2, ((T([144, 65536], f16, stride=(1, 144)), T([65536, 144], f16)), {})
+cnt: 2, ((T([65536, 432], f16), T([432, 144], f16)), {})
+cnt: 2, ((T([432, 65536], f16, stride=(1, 432)), T([65536, 144], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([256, 4, 256, 256], f16), 0.16666666666666666), {})
+cnt: 8, ((T([256, 4, 64, 64], f16), 0.14433756729740643), {})
+cnt: 6, ((T([256, 4, 16, 16], f16), 0.12909944487358058), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([64, 16, 128, 128], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([64, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 96, 32, 32], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 384, 32, 32], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 384, 16, 16], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 128, 16, 16], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 160, 8, 8], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 640, 8, 8], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([64, 640, 8, 8], f16), T([64, 640, 8, 8], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f32), T([640], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 160, 8, 8], f16), T([64, 160, 8, 8], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 128, 16, 16], f16), T([64, 128, 16, 16], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 384, 16, 16], f16), T([64, 384, 16, 16], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 384, 32, 32], f16), T([64, 384, 32, 32], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 96, 32, 32], f16), T([64, 96, 32, 32], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([64, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 256, 64, 64], f16), T([64, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([64, 128, 128, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([64, 64, 128, 128], f16), T([64, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 16, 128, 128], f16), T([64, 16, 128, 128], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.native_layer_norm.default
+cnt: 5, ((T([256, 256, 144], f16), [144], T([144], f16), T([144], f16), 1e-05), {})
+cnt: 9, ((T([256, 64, 192], f16), [192], T([192], f16), T([192], f16), 1e-05), {})
+cnt: 7, ((T([256, 16, 240], f16), [240], T([240], f16), T([240], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 7, ((T([256, 16, 240], f16), T([256, 16, 240], f16), [240], T([256, 16, 1], f32), T([256, 16, 1], f32), T([240], f16), T([240], f16), [True, True, True]), {})
+cnt: 9, ((T([256, 64, 192], f16), T([256, 64, 192], f16), [192], T([256, 64, 1], f32), T([256, 64, 1], f32), T([192], f16), T([192], f16), [True, True, True]), {})
+cnt: 5, ((T([256, 256, 144], f16), T([256, 256, 144], f16), [144], T([256, 256, 1], f32), T([256, 256, 1], f32), T([144], f16), T([144], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.silu.default
+cnt: 2, ((T([256, 256, 288], f16),), {})
+cnt: 4, ((T([256, 64, 384], f16),), {})
+cnt: 3, ((T([256, 16, 480], f16),), {})
+Operator: aten.silu_.default
+cnt: 1, ((T([64, 16, 128, 128], f16),), {})
+cnt: 2, ((T([64, 64, 128, 128], f16),), {})
+cnt: 1, ((T([64, 128, 128, 128], f16),), {})
+cnt: 1, ((T([64, 128, 64, 64], f16),), {})
+cnt: 5, ((T([64, 256, 64, 64], f16),), {})
+cnt: 1, ((T([64, 256, 32, 32], f16),), {})
+cnt: 3, ((T([64, 96, 32, 32], f16),), {})
+cnt: 1, ((T([64, 384, 32, 32], f16),), {})
+cnt: 1, ((T([64, 384, 16, 16], f16),), {})
+cnt: 3, ((T([64, 128, 16, 16], f16),), {})
+cnt: 1, ((T([64, 512, 16, 16], f16),), {})
+cnt: 1, ((T([64, 512, 8, 8], f16),), {})
+cnt: 3, ((T([64, 160, 8, 8], f16),), {})
+cnt: 1, ((T([64, 640, 8, 8], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 1, ((T([64, 640, 8, 8], f16), T([64, 640, 8, 8], f16)), {})
+cnt: 2, ((T([64, 160, 8, 8], f16), T([64, 160, 8, 8], f16)), {})
+cnt: 1, ((T([64, 160, 8, 8], f16, stride=(20480, 64, 8, 1)), T([64, 160, 8, 8], f16)), {})
+cnt: 3, ((T([256, 16, 480], f16), T([256, 16, 480], f16)), {})
+cnt: 1, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16)), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16)), {})
+cnt: 2, ((T([64, 128, 16, 16], f16), T([64, 128, 16, 16], f16)), {})
+cnt: 1, ((T([64, 128, 16, 16], f16, stride=(65536, 256, 16, 1)), T([64, 128, 16, 16], f16)), {})
+cnt: 4, ((T([256, 64, 384], f16), T([256, 64, 384], f16)), {})
+cnt: 1, ((T([64, 384, 16, 16], f16), T([64, 384, 16, 16], f16)), {})
+cnt: 1, ((T([64, 384, 32, 32], f16), T([64, 384, 32, 32], f16)), {})
+cnt: 2, ((T([64, 96, 32, 32], f16), T([64, 96, 32, 32], f16)), {})
+cnt: 1, ((T([64, 96, 32, 32], f16, stride=(196608, 1024, 32, 1)), T([64, 96, 32, 32], f16)), {})
+cnt: 2, ((T([256, 256, 288], f16), T([256, 256, 288], f16)), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([64, 256, 32, 32], f16)), {})
+cnt: 5, ((T([64, 256, 64, 64], f16), T([64, 256, 64, 64], f16)), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 128, 64, 64], f16)), {})
+cnt: 1, ((T([64, 128, 128, 128], f16), T([64, 128, 128, 128], f16)), {})
+cnt: 2, ((T([64, 64, 128, 128], f16), T([64, 64, 128, 128], f16)), {})
+cnt: 1, ((T([64, 16, 128, 128], f16), T([64, 16, 128, 128], f16)), {})
+Operator: aten.stack.default
+cnt: 3, (([T([256, 4, 16, 60], f16), T([256, 4, 16, 60], f16, stride=(3840, 960, 1, 16)), T([256, 4, 16, 60], f16)],), {})
+cnt: 4, (([T([256, 4, 64, 48], f16), T([256, 4, 64, 48], f16, stride=(12288, 3072, 1, 64)), T([256, 4, 64, 48], f16)],), {})
+cnt: 2, (([T([256, 4, 256, 36], f16), T([256, 4, 256, 36], f16, stride=(36864, 9216, 1, 256)), T([256, 4, 256, 36], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 6, ((T([4096, 240], f16), [0], True), {})
+cnt: 3, ((T([4096, 480], f16), [0], True), {})
+cnt: 3, ((T([4096, 720], f16), [0], True), {})
+cnt: 8, ((T([16384, 192], f16), [0], True), {})
+cnt: 4, ((T([16384, 384], f16), [0], True), {})
+cnt: 4, ((T([16384, 576], f16), [0], True), {})
+cnt: 4, ((T([65536, 144], f16), [0], True), {})
+cnt: 2, ((T([65536, 288], f16), [0], True), {})
+cnt: 2, ((T([65536, 432], f16), [0], True), {})
+Operator: aten.unbind.int
+cnt: 2, ((T([3, 256, 4, 256, 36], f16, stride=(144, 110592, 36, 432, 1)),), {})
+cnt: 4, ((T([3, 256, 4, 64, 48], f16, stride=(192, 36864, 48, 576, 1)),), {})
+cnt: 3, ((T([3, 256, 4, 16, 60], f16, stride=(240, 11520, 60, 720, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/nasnetalarge_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/nasnetalarge_training.txt
new file mode 100644
index 000000000000..908397ba8fd1
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/nasnetalarge_training.txt
@@ -0,0 +1,309 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([16, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([16, 1000], f16), T([16, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([], i64), 1), {})
+cnt: 6, ((T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16)), {})
+cnt: 6, ((T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16)), {})
+cnt: 66, ((T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16)), {})
+cnt: 72, ((T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16)), {})
+cnt: 72, ((T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16)), {})
+cnt: 12, ((T([16, 672, 11, 11], f16, stride=(487872, 121, 11, 1)), T([16, 672, 11, 11], f16)), {})
+cnt: 6, ((T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16, stride=(487872, 121, 11, 1))), {})
+cnt: 4, ((T([16, 4032, 11, 11], f16), T([16, 4032, 11, 11], f16)), {})
+cnt: 1, ((T([16, 2688, 11, 11], f16), T([16, 2688, 11, 11], f16)), {})
+cnt: 7, ((T([16, 2016, 21, 21], f16), T([16, 2016, 21, 21], f16)), {})
+cnt: 1, ((T([16, 672, 11, 11], f16, stride=(325248, 121, 11, 1)), T([16, 672, 11, 11], f16, stride=(325248, 121, 11, 1))), {})
+cnt: 5, ((T([16, 672, 21, 21], f16), T([16, 672, 21, 21], f16)), {})
+cnt: 12, ((T([16, 336, 21, 21], f16, stride=(889056, 441, 21, 1)), T([16, 336, 21, 21], f16)), {})
+cnt: 6, ((T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16, stride=(889056, 441, 21, 1))), {})
+cnt: 1, ((T([16, 1344, 21, 21], f16), T([16, 1344, 21, 21], f16)), {})
+cnt: 7, ((T([16, 1008, 42, 42], f16), T([16, 1008, 42, 42], f16)), {})
+cnt: 1, ((T([16, 336, 21, 21], f16, stride=(592704, 441, 21, 1)), T([16, 336, 21, 21], f16, stride=(592704, 441, 21, 1))), {})
+cnt: 6, ((T([16, 336, 42, 42], f16), T([16, 336, 42, 42], f16)), {})
+cnt: 12, ((T([16, 168, 42, 42], f16, stride=(1778112, 1764, 42, 1)), T([16, 168, 42, 42], f16)), {})
+cnt: 6, ((T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16, stride=(1778112, 1764, 42, 1))), {})
+cnt: 2, ((T([16, 168, 83, 83], f16), T([16, 168, 83, 83], f16)), {})
+cnt: 1, ((T([16, 84, 42, 42], f16, stride=(592704, 1764, 42, 1)), T([16, 84, 42, 42], f16, stride=(592704, 1764, 42, 1))), {})
+cnt: 5, ((T([16, 84, 83, 83], f16), T([16, 84, 83, 83], f16)), {})
+cnt: 5, ((T([16, 96, 165, 165], f16), T([16, 96, 165, 165], f16)), {})
+cnt: 1, ((T([16, 42, 83, 83], f16, stride=(1157352, 6889, 83, 1)), T([16, 42, 83, 83], f16, stride=(1157352, 6889, 83, 1))), {})
+cnt: 3, ((T([16, 42, 165, 165], f16), T([16, 42, 165, 165], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 263, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([16, 4032], f16), T([4032, 1000], f16, stride=(1, 4032))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([16, 42, 167, 167], f16), [3, 3], [2, 2], [0, 0], False, False), {})
+cnt: 1, ((T([16, 42, 83, 83], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+cnt: 2, ((T([16, 96, 165, 165], f16), [1, 1], [2, 2], [0, 0], False, False), {})
+cnt: 1, ((T([16, 84, 85, 85], f16), [3, 3], [2, 2], [0, 0], False, False), {})
+cnt: 1, ((T([16, 84, 42, 42], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+cnt: 2, ((T([16, 168, 83, 83], f16), [1, 1], [2, 2], [0, 0], False, False), {})
+cnt: 18, ((T([16, 168, 42, 42], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+cnt: 1, ((T([16, 336, 43, 43], f16), [3, 3], [2, 2], [0, 0], False, False), {})
+cnt: 19, ((T([16, 336, 21, 21], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+cnt: 2, ((T([16, 1008, 42, 42], f16), [1, 1], [2, 2], [0, 0], False, False), {})
+cnt: 1, ((T([16, 672, 23, 23], f16), [3, 3], [2, 2], [0, 0], False, False), {})
+cnt: 19, ((T([16, 672, 11, 11], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+cnt: 2, ((T([16, 2016, 21, 21], f16), [1, 1], [2, 2], [0, 0], False, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 18, ((T([16, 672, 11, 11], f16, stride=(487872, 121, 11, 1)), T([16, 672, 11, 11], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 2, ((T([16, 2016, 11, 11], f16), T([16, 2016, 21, 21], f16), [1, 1], [2, 2], [0, 0], False, False, None), {})
+cnt: 1, ((T([16, 672, 11, 11], f16, stride=(325248, 121, 11, 1)), T([16, 672, 11, 11], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 1, ((T([16, 672, 11, 11], f16, stride=(325248, 121, 11, 1)), T([16, 672, 23, 23], f16), [3, 3], [2, 2], [0, 0], False, False, None), {})
+cnt: 18, ((T([16, 336, 21, 21], f16, stride=(889056, 441, 21, 1)), T([16, 336, 21, 21], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 2, ((T([16, 1008, 21, 21], f16), T([16, 1008, 42, 42], f16), [1, 1], [2, 2], [0, 0], False, False, None), {})
+cnt: 1, ((T([16, 336, 21, 21], f16, stride=(592704, 441, 21, 1)), T([16, 336, 21, 21], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 1, ((T([16, 336, 21, 21], f16, stride=(592704, 441, 21, 1)), T([16, 336, 43, 43], f16), [3, 3], [2, 2], [0, 0], False, False, None), {})
+cnt: 18, ((T([16, 168, 42, 42], f16, stride=(1778112, 1764, 42, 1)), T([16, 168, 42, 42], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 2, ((T([16, 168, 42, 42], f16), T([16, 168, 83, 83], f16), [1, 1], [2, 2], [0, 0], False, False, None), {})
+cnt: 1, ((T([16, 84, 42, 42], f16, stride=(592704, 1764, 42, 1)), T([16, 84, 42, 42], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 1, ((T([16, 84, 42, 42], f16, stride=(592704, 1764, 42, 1)), T([16, 84, 85, 85], f16), [3, 3], [2, 2], [0, 0], False, False, None), {})
+cnt: 2, ((T([16, 96, 83, 83], f16), T([16, 96, 165, 165], f16), [1, 1], [2, 2], [0, 0], False, False, None), {})
+cnt: 1, ((T([16, 42, 83, 83], f16, stride=(1157352, 6889, 83, 1)), T([16, 42, 83, 83], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 1, ((T([16, 42, 83, 83], f16, stride=(1157352, 6889, 83, 1)), T([16, 42, 167, 167], f16), [3, 3], [2, 2], [0, 0], False, False, None), {})
+Operator: aten.cat.default
+cnt: 1, (([T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16)], 1), {})
+cnt: 1, (([T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16)], 1), {})
+cnt: 1, (([T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16)], 1), {})
+cnt: 1, (([T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16)], 1), {})
+cnt: 6, (([T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16)], 1), {})
+cnt: 1, (([T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16)], 1), {})
+cnt: 1, (([T([16, 168, 21, 21], f16), T([16, 168, 21, 21], f16)], 1), {})
+cnt: 6, (([T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16)], 1), {})
+cnt: 1, (([T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16)], 1), {})
+cnt: 1, (([T([16, 336, 11, 11], f16), T([16, 336, 11, 11], f16)], 1), {})
+cnt: 6, (([T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([16, 3, 331, 331], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([16, 42, 165, 165], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 2, ((T([16, 96, 165, 165], f16), [3, 3, 3, 3], 0.0), {})
+cnt: 2, ((T([16, 42, 165, 165], f16), [1, 1, 1, 1], -inf), {})
+cnt: 1, ((T([16, 42, 165, 165], f16), [1, 1, 1, 1], 0.0), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), [-1, 1, -1, 1], 0.0), {})
+cnt: 2, ((T([16, 84, 83, 83], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 2, ((T([16, 84, 83, 83], f16), [3, 3, 3, 3], 0.0), {})
+cnt: 2, ((T([16, 84, 83, 83], f16), [1, 1, 1, 1], -inf), {})
+cnt: 1, ((T([16, 84, 83, 83], f16), [1, 1, 1, 1], 0.0), {})
+cnt: 1, ((T([16, 168, 83, 83], f16), [-1, 1, -1, 1], 0.0), {})
+cnt: 2, ((T([16, 336, 42, 42], f16), [1, 2, 1, 2], 0.0), {})
+cnt: 2, ((T([16, 336, 42, 42], f16), [2, 3, 2, 3], 0.0), {})
+cnt: 2, ((T([16, 336, 42, 42], f16), [0, 1, 0, 1], -inf), {})
+cnt: 1, ((T([16, 336, 42, 42], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([16, 1008, 42, 42], f16), [-1, 1, -1, 1], 0.0), {})
+cnt: 2, ((T([16, 672, 21, 21], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 2, ((T([16, 672, 21, 21], f16), [3, 3, 3, 3], 0.0), {})
+cnt: 2, ((T([16, 672, 21, 21], f16), [1, 1, 1, 1], -inf), {})
+cnt: 1, ((T([16, 672, 21, 21], f16), [1, 1, 1, 1], 0.0), {})
+cnt: 1, ((T([16, 2016, 21, 21], f16), [-1, 1, -1, 1], 0.0), {})
+cnt: 1, ((T([16, 2016, 21, 21], f16), [1, -1, 1, -1]), {})
+cnt: 3, ((T([16, 672, 23, 23], f16), [-1, -1, -1, -1]), {})
+cnt: 2, ((T([16, 672, 25, 25], f16), [-2, -2, -2, -2]), {})
+cnt: 2, ((T([16, 672, 27, 27], f16), [-3, -3, -3, -3]), {})
+cnt: 1, ((T([16, 1008, 42, 42], f16), [1, -1, 1, -1]), {})
+cnt: 3, ((T([16, 336, 43, 43], f16), [0, -1, 0, -1]), {})
+cnt: 2, ((T([16, 336, 45, 45], f16), [-1, -2, -1, -2]), {})
+cnt: 2, ((T([16, 336, 47, 47], f16), [-2, -3, -2, -3]), {})
+cnt: 1, ((T([16, 168, 83, 83], f16), [1, -1, 1, -1]), {})
+cnt: 3, ((T([16, 84, 85, 85], f16), [-1, -1, -1, -1]), {})
+cnt: 2, ((T([16, 84, 87, 87], f16), [-2, -2, -2, -2]), {})
+cnt: 2, ((T([16, 84, 89, 89], f16), [-3, -3, -3, -3]), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), [1, -1, 1, -1]), {})
+cnt: 3, ((T([16, 42, 167, 167], f16), [-1, -1, -1, -1]), {})
+cnt: 1, ((T([16, 96, 169, 169], f16), [-2, -2, -2, -2]), {})
+cnt: 2, ((T([16, 96, 171, 171], f16), [-3, -3, -3, -3]), {})
+cnt: 1, ((T([16, 42, 169, 169], f16), [-2, -2, -2, -2]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([16, 3, 331, 331], f16), T([96, 3, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), T([42, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 42, 169, 169], f16), T([42, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 42), {})
+cnt: 7, ((T([16, 42, 83, 83], f16), T([42, 42, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 42, 83, 83], f16), T([42, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 42), {})
+cnt: 2, ((T([16, 96, 171, 171], f16), T([96, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 96), {})
+cnt: 5, ((T([16, 96, 83, 83], f16), T([42, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 42, 83, 83], f16), T([42, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 42), {})
+cnt: 1, ((T([16, 96, 169, 169], f16), T([96, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 96), {})
+cnt: 2, ((T([16, 42, 83, 83], f16), T([42, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 42), {})
+cnt: 1, ((T([16, 168, 83, 83], f16), T([84, 168, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 84, 87, 87], f16), T([84, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 84), {})
+cnt: 10, ((T([16, 84, 42, 42], f16), T([84, 84, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 84, 42, 42], f16), T([84, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 84), {})
+cnt: 2, ((T([16, 84, 89, 89], f16), T([84, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 84), {})
+cnt: 2, ((T([16, 84, 42, 42], f16), T([84, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 84), {})
+cnt: 2, ((T([16, 84, 42, 42], f16), T([84, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 84), {})
+cnt: 2, ((T([16, 168, 42, 42], f16), T([84, 168, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 336, 42, 42], f16), T([168, 336, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 24, ((T([16, 168, 42, 42], f16), T([168, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 168), {})
+cnt: 60, ((T([16, 168, 42, 42], f16), T([168, 168, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 36, ((T([16, 168, 42, 42], f16), T([168, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 168), {})
+cnt: 9, ((T([16, 1008, 42, 42], f16), T([168, 1008, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 1008, 42, 42], f16), T([336, 1008, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 336, 45, 45], f16), T([336, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 336), {})
+cnt: 70, ((T([16, 336, 21, 21], f16), T([336, 336, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 26, ((T([16, 336, 21, 21], f16), T([336, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 336), {})
+cnt: 2, ((T([16, 336, 47, 47], f16), T([336, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 336), {})
+cnt: 2, ((T([16, 336, 21, 21], f16), T([336, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 336), {})
+cnt: 38, ((T([16, 336, 21, 21], f16), T([336, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 336), {})
+cnt: 2, ((T([16, 1008, 21, 21], f16), T([168, 1008, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 1344, 21, 21], f16), T([336, 1344, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([16, 2016, 21, 21], f16), T([336, 2016, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 2016, 21, 21], f16), T([672, 2016, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 672, 25, 25], f16), T([672, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 672), {})
+cnt: 70, ((T([16, 672, 11, 11], f16), T([672, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 26, ((T([16, 672, 11, 11], f16), T([672, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 2, ((T([16, 672, 27, 27], f16), T([672, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 672), {})
+cnt: 2, ((T([16, 672, 11, 11], f16), T([672, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 672), {})
+cnt: 38, ((T([16, 672, 11, 11], f16), T([672, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 672), {})
+cnt: 2, ((T([16, 2016, 11, 11], f16), T([336, 2016, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 2688, 11, 11], f16), T([672, 2688, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([16, 4032, 11, 11], f16), T([672, 4032, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 70, ((T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([672, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 38, ((T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([672, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 26, ((T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([672, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 9, ((T([16, 672, 11, 11], f16), T([16, 4032, 11, 11], f16), T([672, 4032, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 672, 11, 11], f16), T([16, 2688, 11, 11], f16), T([672, 2688, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 336, 11, 11], f16, stride=(81312, 121, 11, 1)), T([16, 2016, 11, 11], f16), T([336, 2016, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 672, 11, 11], f16), T([16, 672, 25, 25], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 2, ((T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([672, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 2, ((T([16, 672, 11, 11], f16), T([16, 672, 27, 27], f16), T([672, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 2, ((T([16, 672, 21, 21], f16), T([16, 2016, 21, 21], f16), T([672, 2016, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 70, ((T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([336, 336, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 38, ((T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([336, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 336, [True, True, False]), {})
+cnt: 26, ((T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([336, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 336, [True, True, False]), {})
+cnt: 9, ((T([16, 336, 21, 21], f16), T([16, 2016, 21, 21], f16), T([336, 2016, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 336, 21, 21], f16), T([16, 1344, 21, 21], f16), T([336, 1344, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 168, 21, 21], f16, stride=(148176, 441, 21, 1)), T([16, 1008, 21, 21], f16), T([168, 1008, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 336, 21, 21], f16), T([16, 336, 45, 45], f16), T([336, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 336, [True, True, False]), {})
+cnt: 2, ((T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([336, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 336, [True, True, False]), {})
+cnt: 2, ((T([16, 336, 21, 21], f16), T([16, 336, 47, 47], f16), T([336, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 336, [True, True, False]), {})
+cnt: 2, ((T([16, 336, 42, 42], f16), T([16, 1008, 42, 42], f16), T([336, 1008, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 60, ((T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), T([168, 168, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 36, ((T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), T([168, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 168, [True, True, False]), {})
+cnt: 24, ((T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), T([168, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 168, [True, True, False]), {})
+cnt: 9, ((T([16, 168, 42, 42], f16), T([16, 1008, 42, 42], f16), T([168, 1008, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 168, 42, 42], f16), T([16, 336, 42, 42], f16), T([168, 336, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 84, 42, 42], f16, stride=(296352, 1764, 42, 1)), T([16, 168, 42, 42], f16), T([84, 168, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 10, ((T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16), T([84, 84, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16), T([84, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 84, [True, True, False]), {})
+cnt: 2, ((T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16), T([84, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 84, [True, True, False]), {})
+cnt: 2, ((T([16, 84, 42, 42], f16), T([16, 84, 87, 87], f16), T([84, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 84, [True, True, False]), {})
+cnt: 2, ((T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16), T([84, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 84, [True, True, False]), {})
+cnt: 2, ((T([16, 84, 42, 42], f16), T([16, 84, 89, 89], f16), T([84, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 84, [True, True, False]), {})
+cnt: 2, ((T([16, 42, 83, 83], f16, stride=(578676, 6889, 83, 1)), T([16, 96, 83, 83], f16), T([42, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 84, 83, 83], f16), T([16, 168, 83, 83], f16), T([84, 168, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16), T([42, 42, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16), T([42, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 42, [True, True, False]), {})
+cnt: 2, ((T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16), T([42, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 42, [True, True, False]), {})
+cnt: 3, ((T([16, 42, 83, 83], f16), T([16, 96, 83, 83], f16), T([42, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 96, 83, 83], f16), T([16, 96, 169, 169], f16), T([96, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 2, ((T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16), T([42, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 42, [True, True, False]), {})
+cnt: 2, ((T([16, 96, 83, 83], f16), T([16, 96, 171, 171], f16), T([96, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([16, 42, 83, 83], f16), T([16, 42, 169, 169], f16), T([42, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 42, [True, True, False]), {})
+cnt: 1, ((T([16, 42, 165, 165], f16), T([16, 96, 165, 165], f16), T([42, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), T([16, 3, 331, 331], f16), T([96, 3, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([16, 3, 331, 331], f16), T([16, 3, 331, 331], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([16, 4032, 11, 11], f16, stride=(4032, 1, 0, 0)), 121), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([16], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 2, ((T([16, 42, 167, 167], f16), [3, 3], [2, 2]), {})
+cnt: 2, ((T([16, 84, 85, 85], f16), [3, 3], [2, 2]), {})
+cnt: 2, ((T([16, 336, 43, 43], f16), [3, 3], [2, 2]), {})
+cnt: 2, ((T([16, 672, 23, 23], f16), [3, 3], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([16, 672, 11, 11], f16, stride=(325248, 121, 11, 1)), T([16, 672, 23, 23], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 672, 11, 11], i64)), {})
+cnt: 1, ((T([16, 672, 11, 11], f16), T([16, 672, 23, 23], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 672, 11, 11], i64)), {})
+cnt: 1, ((T([16, 336, 21, 21], f16, stride=(592704, 441, 21, 1)), T([16, 336, 43, 43], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 336, 21, 21], i64)), {})
+cnt: 1, ((T([16, 336, 21, 21], f16), T([16, 336, 43, 43], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 336, 21, 21], i64)), {})
+cnt: 1, ((T([16, 84, 42, 42], f16, stride=(592704, 1764, 42, 1)), T([16, 84, 85, 85], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 84, 42, 42], i64)), {})
+cnt: 1, ((T([16, 84, 42, 42], f16), T([16, 84, 85, 85], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 84, 42, 42], i64)), {})
+cnt: 1, ((T([16, 42, 83, 83], f16, stride=(1157352, 6889, 83, 1)), T([16, 42, 167, 167], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 42, 83, 83], i64)), {})
+cnt: 1, ((T([16, 42, 83, 83], f16), T([16, 42, 167, 167], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 42, 83, 83], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([16, 4032, 11, 11], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([16, 1000], f16), T([1000, 4032], f16)), {})
+cnt: 1, ((T([1000, 16], f16, stride=(1, 1000)), T([16, 4032], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([16, 96, 165, 165], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([16, 42, 165, 165], f16), T([42], f16), T([42], f16), T([42], f16), T([42], f16), True, 0.1, 0.001), {})
+cnt: 10, ((T([16, 42, 83, 83], f16), T([42], f16), T([42], f16), T([42], f16), T([42], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([16, 84, 83, 83], f16), T([84], f16), T([84], f16), T([84], f16), T([84], f16), True, 0.1, 0.001), {})
+cnt: 10, ((T([16, 84, 42, 42], f16), T([84], f16), T([84], f16), T([84], f16), T([84], f16), True, 0.1, 0.001), {})
+cnt: 72, ((T([16, 168, 42, 42], f16), T([168], f16), T([168], f16), T([168], f16), T([168], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([16, 336, 42, 42], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 0.001), {})
+cnt: 82, ((T([16, 336, 21, 21], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([16, 672, 21, 21], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 0.001), {})
+cnt: 82, ((T([16, 672, 11, 11], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 30, ((T([16, 672, 11, 11], f16, stride=(487872, 121, 11, 1)), T([16, 672, 11, 11], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 0.001, [True, True, True]), {})
+cnt: 50, ((T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 672, 11, 11], f16, stride=(325248, 121, 11, 1)), T([16, 672, 11, 11], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 672, 21, 21], f16), T([16, 672, 21, 21], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 0.001, [True, True, True]), {})
+cnt: 30, ((T([16, 336, 21, 21], f16, stride=(889056, 441, 21, 1)), T([16, 336, 21, 21], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 0.001, [True, True, True]), {})
+cnt: 50, ((T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 336, 21, 21], f16, stride=(592704, 441, 21, 1)), T([16, 336, 21, 21], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 336, 42, 42], f16), T([16, 336, 42, 42], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 0.001, [True, True, True]), {})
+cnt: 30, ((T([16, 168, 42, 42], f16, stride=(1778112, 1764, 42, 1)), T([16, 168, 42, 42], f16), T([168], f16), T([168], f16), T([168], f16), T([168], f32), T([168], f32), True, 0.001, [True, True, True]), {})
+cnt: 42, ((T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), T([168], f16), T([168], f16), T([168], f16), T([168], f32), T([168], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 84, 42, 42], f16, stride=(592704, 1764, 42, 1)), T([16, 84, 42, 42], f16), T([84], f16), T([84], f16), T([84], f16), T([84], f32), T([84], f32), True, 0.001, [True, True, True]), {})
+cnt: 8, ((T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16), T([84], f16), T([84], f16), T([84], f16), T([84], f32), T([84], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 84, 83, 83], f16), T([16, 84, 83, 83], f16), T([84], f16), T([84], f16), T([84], f16), T([84], f32), T([84], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 42, 83, 83], f16, stride=(1157352, 6889, 83, 1)), T([16, 42, 83, 83], f16), T([42], f16), T([42], f16), T([42], f16), T([42], f32), T([42], f32), True, 0.001, [True, True, True]), {})
+cnt: 8, ((T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16), T([42], f16), T([42], f16), T([42], f16), T([42], f32), T([42], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([16, 42, 165, 165], f16), T([16, 42, 165, 165], f16), T([42], f16), T([42], f16), T([42], f16), T([42], f32), T([42], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), T([16, 96, 165, 165], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([16, 1000], f16), T([16], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([16, 1000], f16), T([16], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 5, ((T([16, 96, 165, 165], f16),), {})
+cnt: 1, ((T([16, 42, 165, 165], f16),), {})
+cnt: 1, ((T([16, 42, 83, 83], f16),), {})
+cnt: 2, ((T([16, 168, 83, 83], f16),), {})
+cnt: 4, ((T([16, 84, 83, 83], f16),), {})
+cnt: 1, ((T([16, 84, 42, 42], f16),), {})
+cnt: 6, ((T([16, 336, 42, 42], f16),), {})
+cnt: 30, ((T([16, 168, 42, 42], f16),), {})
+cnt: 12, ((T([16, 1008, 42, 42], f16),), {})
+cnt: 31, ((T([16, 336, 21, 21], f16),), {})
+cnt: 2, ((T([16, 1344, 21, 21], f16),), {})
+cnt: 12, ((T([16, 2016, 21, 21], f16),), {})
+cnt: 4, ((T([16, 672, 21, 21], f16),), {})
+cnt: 31, ((T([16, 672, 11, 11], f16),), {})
+cnt: 2, ((T([16, 2688, 11, 11], f16),), {})
+cnt: 9, ((T([16, 4032, 11, 11], f16),), {})
+Operator: aten.relu_.default
+cnt: 5, ((T([16, 42, 83, 83], f16),), {})
+cnt: 5, ((T([16, 84, 42, 42], f16),), {})
+cnt: 30, ((T([16, 168, 42, 42], f16),), {})
+cnt: 35, ((T([16, 336, 21, 21], f16),), {})
+cnt: 35, ((T([16, 672, 11, 11], f16),), {})
+cnt: 1, ((T([16, 4032, 11, 11], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([16, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 10, ((T([16, 4032, 11, 11], f16), T([16, 4032, 11, 11], f16), 0), {})
+cnt: 66, ((T([16, 672, 11, 11], f16), T([16, 672, 11, 11], f16), 0), {})
+cnt: 2, ((T([16, 2688, 11, 11], f16), T([16, 2688, 11, 11], f16), 0), {})
+cnt: 12, ((T([16, 2016, 21, 21], f16), T([16, 2016, 21, 21], f16), 0), {})
+cnt: 4, ((T([16, 672, 21, 21], f16), T([16, 672, 21, 21], f16), 0), {})
+cnt: 66, ((T([16, 336, 21, 21], f16), T([16, 336, 21, 21], f16), 0), {})
+cnt: 2, ((T([16, 1344, 21, 21], f16), T([16, 1344, 21, 21], f16), 0), {})
+cnt: 12, ((T([16, 1008, 42, 42], f16), T([16, 1008, 42, 42], f16), 0), {})
+cnt: 6, ((T([16, 336, 42, 42], f16), T([16, 336, 42, 42], f16), 0), {})
+cnt: 60, ((T([16, 168, 42, 42], f16), T([16, 168, 42, 42], f16), 0), {})
+cnt: 2, ((T([16, 168, 83, 83], f16), T([16, 168, 83, 83], f16), 0), {})
+cnt: 6, ((T([16, 84, 42, 42], f16), T([16, 84, 42, 42], f16), 0), {})
+cnt: 4, ((T([16, 84, 83, 83], f16), T([16, 84, 83, 83], f16), 0), {})
+cnt: 5, ((T([16, 96, 165, 165], f16), T([16, 96, 165, 165], f16), 0), {})
+cnt: 6, ((T([16, 42, 83, 83], f16), T([16, 42, 83, 83], f16), 0), {})
+cnt: 1, ((T([16, 42, 165, 165], f16), T([16, 42, 165, 165], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/nfnet_l0_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/nfnet_l0_training.txt
new file mode 100644
index 000000000000..ae315ada2dfb
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/nfnet_l0_training.txt
@@ -0,0 +1,267 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 3, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 6, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 18, ((T([128, 1536, 14, 14], f16), T([128, 1536, 14, 14], f16)), {})
+cnt: 8, ((T([128, 1536, 7, 7], f16), T([128, 1536, 7, 7], f16)), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2304], f16), T([2304, 1000], f16, stride=(1, 2304))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 256, 56, 56], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([128, 1536, 14, 14], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 1536, 7, 7], f16), T([128, 1536, 14, 14], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([128, 512, 28, 28], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([128, 256, 56, 56], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 2, ((T([128, 64, 56, 56], f16),), {})
+cnt: 1, ((T([128, 128, 56, 56], f16),), {})
+cnt: 3, ((T([128, 128, 28, 28], f16),), {})
+cnt: 1, ((T([128, 384, 28, 28], f16),), {})
+cnt: 12, ((T([128, 384, 14, 14], f16),), {})
+cnt: 5, ((T([128, 384, 7, 7], f16),), {})
+cnt: 1, ((T([128, 2304, 7, 7], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), T([16], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([32, 16, 3, 3], f16), T([32], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([64, 32, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 3, 3], f16), T([128], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([256, 128, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([64, 128, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 64, 56, 56], f16), T([64, 64, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([256, 64, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([64, 256, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([256, 64, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 28, 28], f16), T([512, 256, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 256, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 64, 3, 3], f16), T([128], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 3, ((T([128, 128, 28, 28], f16), T([128, 64, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 2, ((T([128, 128, 28, 28], f16), T([512, 128, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([128, 512, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 1, 1], f16), T([512, 128, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 512, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 14, 14], f16), T([1536, 512, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([384, 512, 1, 1], f16), T([384], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 384, 28, 28], f16), T([384, 64, 3, 3], f16), T([384], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 6), {})
+cnt: 11, ((T([128, 384, 14, 14], f16), T([384, 64, 3, 3], f16), T([384], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 6), {})
+cnt: 6, ((T([128, 384, 14, 14], f16), T([1536, 384, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([384, 1536, 1, 1], f16), T([384], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([128, 384, 1, 1], f16), T([1536, 384, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([128, 1536, 14, 14], f16), T([384, 1536, 1, 1], f16), T([384], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1536, 7, 7], f16), T([1536, 1536, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([384, 64, 3, 3], f16), T([384], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 6), {})
+cnt: 5, ((T([128, 384, 7, 7], f16), T([384, 64, 3, 3], f16), T([384], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 6), {})
+cnt: 3, ((T([128, 384, 7, 7], f16), T([1536, 384, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 1536, 7, 7], f16), T([384, 1536, 1, 1], f16), T([384], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1536, 7, 7], f16), T([2304, 1536, 1, 1], f16), T([2304], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 2304, 7, 7], f16), T([128, 1536, 7, 7], f16), T([2304, 1536, 1, 1], f16), [2304], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([128, 384, 1, 1], f16), T([1536, 384, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 9, ((T([128, 384, 1, 1], f16), T([128, 1536, 1, 1], f16), T([384, 1536, 1, 1], f16), [384], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 1536, 7, 7], f16), T([128, 384, 7, 7], f16), T([1536, 384, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([128, 384, 7, 7], f16), T([128, 384, 7, 7], f16), T([384, 64, 3, 3], f16), [384], [1, 1], [1, 1], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 2, ((T([128, 384, 7, 7], f16), T([128, 1536, 7, 7], f16), T([384, 1536, 1, 1], f16), [384], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 384, 7, 7], f16), T([128, 384, 14, 14], f16), T([384, 64, 3, 3], f16), [384], [2, 2], [1, 1], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 6, ((T([128, 384, 14, 14], f16), T([128, 1536, 14, 14], f16), T([384, 1536, 1, 1], f16), [384], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1536, 7, 7], f16), T([128, 1536, 7, 7], f16), T([1536, 1536, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([128, 1536, 14, 14], f16), T([128, 384, 14, 14], f16), T([1536, 384, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 11, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384, 64, 3, 3], f16), [384], [1, 1], [1, 1], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([128, 384, 28, 28], f16), T([384, 64, 3, 3], f16), [384], [2, 2], [1, 1], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 1, ((T([128, 384, 28, 28], f16), T([128, 512, 28, 28], f16), T([384, 512, 1, 1], f16), [384], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1536, 14, 14], f16), T([128, 512, 14, 14], f16), T([1536, 512, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([128, 128, 1, 1], f16), T([512, 128, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 128, 1, 1], f16), T([128, 512, 1, 1], f16), T([128, 512, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 512, 28, 28], f16), T([128, 128, 28, 28], f16), T([512, 128, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16), T([128, 64, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 28, 28], f16), T([128, 512, 28, 28], f16), T([128, 512, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 28, 28], f16), T([128, 128, 56, 56], f16), T([128, 64, 3, 3], f16), [128], [2, 2], [1, 1], [1, 1], False, [0, 0], 2, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 256, 28, 28], f16), T([512, 256, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 64, 1, 1], f16), T([256, 64, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([128, 256, 1, 1], f16), T([64, 256, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64, 64, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 128, 56, 56], f16), T([64, 128, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 128, 56, 56], f16), T([256, 128, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 64, 112, 112], f16), T([128, 64, 3, 3], f16), [128], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 32, 112, 112], f16), T([64, 32, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 16, 112, 112], f16), T([32, 16, 3, 3], f16), [32], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 3, 224, 224], f16), T([16, 3, 3, 3], f16), [16], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2304, 7, 7], f16, stride=(2304, 1, 0, 0)), 49), {})
+cnt: 3, ((T([128, 1536, 7, 7], f16, stride=(1536, 1, 0, 0)), 49), {})
+cnt: 6, ((T([128, 1536, 14, 14], f16, stride=(1536, 1, 0, 0)), 196), {})
+cnt: 2, ((T([128, 512, 28, 28], f16, stride=(512, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 256, 56, 56], f16, stride=(256, 1, 0, 0)), 3136), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 256, 56, 56], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 6, ((T([128, 1536, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([128, 1536, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 2304, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 2304], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2304], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([16, 1, 1, 1], f16), 0.34412564994580647), {})
+cnt: 2, ((T([32, 1, 1, 1], f16), 0.1490107774734497), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.10536653122135592), {})
+cnt: 10, ((T([128, 1, 1, 1], f16), 0.07450538873672485), {})
+cnt: 2, ((T([128, 128, 56, 56], f16), 1.0), {})
+cnt: 2, ((T([256, 1, 1, 1], f16), 0.1580497968320339), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.1580497968320339), {})
+cnt: 4, ((T([64, 1, 1, 1], f16), 0.07450538873672485), {})
+cnt: 2, ((T([256, 1, 1, 1], f16), 0.22351616621017456), {})
+cnt: 2, ((T([128, 256, 56, 56], f16), T([128, 256, 1, 1], f16)), {})
+cnt: 2, ((T([128, 256, 56, 56], f16), 2.0), {})
+cnt: 2, ((T([128, 256, 56, 56], f16), 0.2), {})
+cnt: 2, ((T([128, 256, 56, 56], f16), 0.9805806756909201), {})
+cnt: 2, ((T([512, 1, 1, 1], f16), 0.11175808310508728), {})
+cnt: 2, ((T([128, 1, 1, 1], f16), 0.11175808310508728), {})
+cnt: 4, ((T([512, 1, 1, 1], f16), 0.1580497968320339), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), T([128, 512, 1, 1], f16)), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), 2.0), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), 0.2), {})
+cnt: 2, ((T([128, 512, 28, 28], f16), 0.9805806756909201), {})
+cnt: 2, ((T([128, 1, 1, 1], f16), 0.07902489841601695), {})
+cnt: 2, ((T([128, 512, 28, 28], f16), 0.9622504486493761), {})
+cnt: 2, ((T([1536, 1, 1, 1], f16), 0.07902489841601695), {})
+cnt: 2, ((T([384, 1, 1, 1], f16), 0.07902489841601695), {})
+cnt: 36, ((T([384, 1, 1, 1], f16), 0.07450538873672485), {})
+cnt: 18, ((T([1536, 1, 1, 1], f16), 0.09125009274634042), {})
+cnt: 12, ((T([128, 1536, 14, 14], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 12, ((T([128, 1536, 14, 14], f16), 2.0), {})
+cnt: 12, ((T([128, 1536, 14, 14], f16), 0.2), {})
+cnt: 2, ((T([128, 1536, 14, 14], f16), 0.9805806756909201), {})
+cnt: 16, ((T([384, 1, 1, 1], f16), 0.04562504637317021), {})
+cnt: 2, ((T([128, 1536, 14, 14], f16), 0.9622504486493761), {})
+cnt: 2, ((T([128, 1536, 14, 14], f16), 0.9449111825230679), {})
+cnt: 2, ((T([128, 1536, 14, 14], f16), 0.9284766908852592), {})
+cnt: 2, ((T([128, 1536, 14, 14], f16), 0.9128709291752768), {})
+cnt: 2, ((T([128, 1536, 14, 14], f16), 0.8980265101338745), {})
+cnt: 2, ((T([1536, 1, 1, 1], f16), 0.04562504637317021), {})
+cnt: 6, ((T([128, 1536, 7, 7], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 6, ((T([128, 1536, 7, 7], f16), 2.0), {})
+cnt: 6, ((T([128, 1536, 7, 7], f16), 0.2), {})
+cnt: 2, ((T([128, 1536, 7, 7], f16), 0.9805806756909201), {})
+cnt: 2, ((T([128, 1536, 7, 7], f16), 0.9622504486493761), {})
+cnt: 2, ((T([2304, 1, 1, 1], f16), 0.04562504637317021), {})
+cnt: 3, ((T([128, 1536, 7, 7], f16), T([128, 1536, 7, 7], f16)), {})
+cnt: 6, ((T([128, 1536, 14, 14], f16), T([128, 1536, 14, 14], f16)), {})
+cnt: 2, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([1, 16, 27], f16), T([16], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 32, 144], f16), T([32], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 64, 288], f16), T([64], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 5, ((T([1, 128, 576], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 256, 128], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 64, 128], f16), T([64], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 2, ((T([1, 64, 576], f16), T([64], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 256, 64], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 512, 256], f16), T([512], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 128, 256], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 2, ((T([1, 512, 128], f16), T([512], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 128, 512], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 1536, 512], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 384, 512], f16), T([384], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 18, ((T([1, 384, 576], f16), T([384], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 9, ((T([1, 1536, 384], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 8, ((T([1, 384, 1536], f16), T([384], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 1536, 1536], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 2304, 1536], f16), T([2304], f16), None, None, None, True, 0.0, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([1, 2304, 1536], f16), T([1, 2304, 1536], f16), T([2304], f16), None, None, T([2304], f32), T([2304], f32), True, 1e-05, [True, True, False]), {})
+cnt: 9, ((T([1, 1536, 384], f16), T([1, 1536, 384], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 18, ((T([1, 384, 576], f16), T([1, 384, 576], f16), T([384], f16), None, None, T([384], f32), T([384], f32), True, 1e-05, [True, True, False]), {})
+cnt: 8, ((T([1, 384, 1536], f16), T([1, 384, 1536], f16), T([384], f16), None, None, T([384], f32), T([384], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 1536, 1536], f16), T([1, 1536, 1536], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 384, 512], f16), T([1, 384, 512], f16), T([384], f16), None, None, T([384], f32), T([384], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 1536, 512], f16), T([1, 1536, 512], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 2, ((T([1, 512, 128], f16), T([1, 512, 128], f16), T([512], f16), None, None, T([512], f32), T([512], f32), True, 1e-05, [True, True, False]), {})
+cnt: 5, ((T([1, 128, 576], f16), T([1, 128, 576], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 128, 512], f16), T([1, 128, 512], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 128, 256], f16), T([1, 128, 256], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 512, 256], f16), T([1, 512, 256], f16), T([512], f16), None, None, T([512], f32), T([512], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 256, 64], f16), T([1, 256, 64], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 2, ((T([1, 64, 576], f16), T([1, 64, 576], f16), T([64], f16), None, None, T([64], f32), T([64], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 64, 128], f16), T([1, 64, 128], f16), T([64], f16), None, None, T([64], f32), T([64], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 256, 128], f16), T([1, 256, 128], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 64, 288], f16), T([1, 64, 288], f16), T([64], f16), None, None, T([64], f32), T([64], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 32, 144], f16), T([1, 32, 144], f16), T([32], f16), None, None, T([32], f32), T([32], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 16, 27], f16), T([1, 16, 27], f16), T([16], f16), None, None, T([16], f32), T([16], f32), True, 1e-05, [True, True, False]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 64, 1, 1], f16),), {})
+cnt: 2, ((T([128, 128, 1, 1], f16),), {})
+cnt: 9, ((T([128, 384, 1, 1], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([128, 256, 1, 1], f16),), {})
+cnt: 2, ((T([128, 512, 1, 1], f16),), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([128, 512, 1, 1], f16)), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16)), {})
+Operator: aten.silu.default
+cnt: 1, ((T([128, 128, 56, 56], f16),), {})
+cnt: 1, ((T([128, 64, 56, 56], f16),), {})
+cnt: 1, ((T([128, 256, 56, 56], f16),), {})
+cnt: 2, ((T([128, 128, 28, 28], f16),), {})
+cnt: 2, ((T([128, 512, 28, 28], f16),), {})
+cnt: 6, ((T([128, 384, 14, 14], f16),), {})
+cnt: 6, ((T([128, 1536, 14, 14], f16),), {})
+cnt: 3, ((T([128, 384, 7, 7], f16),), {})
+cnt: 2, ((T([128, 1536, 7, 7], f16),), {})
+Operator: aten.silu_.default
+cnt: 1, ((T([128, 16, 112, 112], f16),), {})
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 2, ((T([128, 64, 56, 56], f16),), {})
+cnt: 1, ((T([128, 128, 56, 56], f16),), {})
+cnt: 3, ((T([128, 128, 28, 28], f16),), {})
+cnt: 1, ((T([128, 384, 28, 28], f16),), {})
+cnt: 12, ((T([128, 384, 14, 14], f16),), {})
+cnt: 5, ((T([128, 384, 7, 7], f16),), {})
+cnt: 1, ((T([128, 2304, 7, 7], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 1, ((T([128, 2304, 7, 7], f16), T([128, 2304, 7, 7], f16)), {})
+cnt: 8, ((T([128, 384, 7, 7], f16), T([128, 384, 7, 7], f16)), {})
+cnt: 2, ((T([128, 1536, 7, 7], f16), T([128, 1536, 7, 7], f16)), {})
+cnt: 18, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16)), {})
+cnt: 6, ((T([128, 1536, 14, 14], f16), T([128, 1536, 14, 14], f16)), {})
+cnt: 1, ((T([128, 384, 28, 28], f16), T([128, 384, 28, 28], f16)), {})
+cnt: 2, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 5, ((T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16)), {})
+cnt: 2, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16)), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 3, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16)), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16)), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16)), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 3, ((T([128, 1536, 7, 7], f16), [2, 3], True), {})
+cnt: 6, ((T([128, 1536, 14, 14], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 512, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 9, ((T([128, 384, 1, 1], f16), T([128, 384, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 128, 1, 1], f16), T([128, 128, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([128, 64, 1, 1], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/pit_b_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/pit_b_224_training.txt
new file mode 100644
index 000000000000..d26a9ef24d6f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/pit_b_224_training.txt
@@ -0,0 +1,185 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 3, ((T([64, 4, 962, 962], f16), -1, False), {})
+cnt: 6, ((T([64, 8, 257, 257], f16), -1, False), {})
+cnt: 4, ((T([64, 16, 65, 65], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 4, ((T([64, 16, 65, 65], f16), T([64, 16, 65, 65], f16), -1, f16), {})
+cnt: 6, ((T([64, 8, 257, 257], f16), T([64, 8, 257, 257], f16), -1, f16), {})
+cnt: 3, ((T([64, 4, 962, 962], f16), T([64, 4, 962, 962], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 9, ((T([64, 4, 962, 64], f16), [256, 962, 64]), {})
+cnt: 3, ((T([64, 4, 64, 962], f16), [256, 64, 962]), {})
+cnt: 3, ((T([256, 962, 962], f16), [64, 4, 962, 962]), {})
+cnt: 3, ((T([256, 962, 64], f16), [64, 4, 962, 64]), {})
+cnt: 3, ((T([64, 962, 4, 64], f16), [64, 962, 256]), {})
+cnt: 1, ((T([64, 512], f16), [64, 1, 512]), {})
+cnt: 18, ((T([64, 8, 257, 64], f16), [512, 257, 64]), {})
+cnt: 6, ((T([64, 8, 64, 257], f16), [512, 64, 257]), {})
+cnt: 6, ((T([512, 257, 257], f16), [64, 8, 257, 257]), {})
+cnt: 6, ((T([512, 257, 64], f16), [64, 8, 257, 64]), {})
+cnt: 6, ((T([64, 257, 8, 64], f16), [64, 257, 512]), {})
+cnt: 1, ((T([64, 1024], f16), [64, 1, 1024]), {})
+cnt: 12, ((T([64, 16, 65, 64], f16), [1024, 65, 64]), {})
+cnt: 4, ((T([64, 16, 64, 65], f16), [1024, 64, 65]), {})
+cnt: 4, ((T([1024, 65, 65], f16), [64, 16, 65, 65]), {})
+cnt: 4, ((T([1024, 65, 64], f16), [64, 16, 65, 64]), {})
+cnt: 4, ((T([64, 65, 16, 64], f16), [64, 65, 1024]), {})
+cnt: 4, ((T([64, 65, 3, 16, 64], f16), [64, 65, 3072]), {})
+cnt: 6, ((T([64, 257, 3, 8, 64], f16), [64, 257, 1536]), {})
+cnt: 3, ((T([64, 962, 3, 4, 64], f16), [64, 962, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 256, 31, 31], f16), T([1, 256, 31, 31], f16)), {})
+cnt: 13, ((T([64, 962, 256], f16), T([64, 962, 256], f16)), {})
+cnt: 1, ((T([64, 1, 512], f16), T([512], f16)), {})
+cnt: 25, ((T([64, 257, 512], f16), T([64, 257, 512], f16)), {})
+cnt: 1, ((T([64, 1, 1024], f16), T([1024], f16)), {})
+cnt: 16, ((T([64, 65, 1024], f16), T([64, 65, 1024], f16)), {})
+Operator: aten.addmm.default
+cnt: 3, ((T([768], f16), T([61568, 256], f16), T([256, 768], f16, stride=(1, 256))), {})
+cnt: 3, ((T([256], f16), T([61568, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 3, ((T([1024], f16), T([61568, 256], f16), T([256, 1024], f16, stride=(1, 256))), {})
+cnt: 3, ((T([256], f16), T([61568, 1024], f16), T([1024, 256], f16, stride=(1, 1024))), {})
+cnt: 6, ((T([1536], f16), T([16448, 512], f16), T([512, 1536], f16, stride=(1, 512))), {})
+cnt: 6, ((T([512], f16), T([16448, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 6, ((T([2048], f16), T([16448, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 6, ((T([512], f16), T([16448, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+cnt: 4, ((T([3072], f16), T([4160, 1024], f16), T([1024, 3072], f16, stride=(1, 1024))), {})
+cnt: 4, ((T([1024], f16), T([4160, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 4, ((T([4096], f16), T([4160, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 4, ((T([1024], f16), T([4160, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([1000], f16), T([64, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.bmm.default
+cnt: 3, ((T([256, 962, 64], f16), T([256, 64, 962], f16)), {})
+cnt: 3, ((T([256, 962, 962], f16), T([256, 962, 64], f16)), {})
+cnt: 6, ((T([512, 257, 64], f16), T([512, 64, 257], f16)), {})
+cnt: 6, ((T([512, 257, 257], f16), T([512, 257, 64], f16)), {})
+cnt: 4, ((T([1024, 65, 64], f16), T([1024, 64, 65], f16)), {})
+cnt: 4, ((T([1024, 65, 65], f16), T([1024, 65, 64], f16)), {})
+cnt: 4, ((T([1024, 65, 65], f16, stride=(4225, 1, 65)), T([1024, 65, 64], f16)), {})
+cnt: 4, ((T([1024, 65, 64], f16), T([1024, 64, 65], f16, stride=(4160, 1, 64))), {})
+cnt: 4, ((T([1024, 64, 65], f16, stride=(4160, 1, 64)), T([1024, 65, 65], f16)), {})
+cnt: 4, ((T([1024, 65, 65], f16), T([1024, 65, 64], f16, stride=(4160, 1, 65))), {})
+cnt: 6, ((T([512, 257, 257], f16, stride=(66049, 1, 257)), T([512, 257, 64], f16)), {})
+cnt: 6, ((T([512, 257, 64], f16), T([512, 64, 257], f16, stride=(16448, 1, 64))), {})
+cnt: 6, ((T([512, 64, 257], f16, stride=(16448, 1, 64)), T([512, 257, 257], f16)), {})
+cnt: 6, ((T([512, 257, 257], f16), T([512, 257, 64], f16, stride=(16448, 1, 257))), {})
+cnt: 3, ((T([256, 962, 962], f16, stride=(925444, 1, 962)), T([256, 962, 64], f16)), {})
+cnt: 3, ((T([256, 962, 64], f16), T([256, 64, 962], f16, stride=(61568, 1, 64))), {})
+cnt: 3, ((T([256, 64, 962], f16, stride=(61568, 1, 64)), T([256, 962, 962], f16)), {})
+cnt: 3, ((T([256, 962, 962], f16), T([256, 962, 64], f16, stride=(61568, 1, 962))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 1, 256], f16, stride=(0, 256, 1)), T([64, 961, 256], f16, stride=(246016, 1, 961))], 1), {})
+cnt: 1, (([T([64, 1, 512], f16), T([64, 256, 512], f16, stride=(131072, 1, 256))], 1), {})
+cnt: 1, (([T([64, 1, 1024], f16), T([64, 64, 1024], f16, stride=(65536, 1, 64))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([256, 3, 14, 14], f16), T([256], f16), [7, 7], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 31, 31], f16, stride=(246272, 1, 7936, 256)), T([512, 1, 3, 3], f16), T([512], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 256), {})
+cnt: 1, ((T([64, 512, 16, 16], f16, stride=(131584, 1, 8192, 512)), T([1024, 1, 3, 3], f16), T([1024], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 512), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 1024, 8, 8], f16, stride=(66560, 1, 8192, 1024)), T([64, 512, 16, 16], f16, stride=(131584, 1, 8192, 512)), T([1024, 1, 3, 3], f16), [1024], [2, 2], [1, 1], [1, 1], False, [0, 0], 512, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 16, 16], f16, stride=(131584, 1, 8192, 512)), T([64, 256, 31, 31], f16, stride=(246272, 1, 7936, 256)), T([512, 1, 3, 3], f16), [512], [2, 2], [1, 1], [1, 1], False, [0, 0], 256, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 31, 31], f16, stride=(246272, 1, 7936, 256)), T([64, 3, 224, 224], f16), T([256, 3, 14, 14], f16), [256], [7, 7], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.gelu.default
+cnt: 3, ((T([64, 962, 1024], f16),), {})
+cnt: 6, ((T([64, 257, 2048], f16),), {})
+cnt: 4, ((T([64, 65, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 4, ((T([64, 65, 4096], f16), T([64, 65, 4096], f16)), {})
+cnt: 6, ((T([64, 257, 2048], f16), T([64, 257, 2048], f16)), {})
+cnt: 3, ((T([64, 962, 1024], f16), T([64, 962, 1024], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 256], f16, stride=(246272, 1)), T([256, 512], f16, stride=(1, 256))), {})
+cnt: 1, ((T([64, 512], f16, stride=(131584, 1)), T([512, 1024], f16, stride=(1, 512))), {})
+cnt: 1, ((T([64, 1000], f16), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 1024], f16)), {})
+cnt: 4, ((T([4160, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 4, ((T([1024, 4160], f16, stride=(1, 1024)), T([4160, 4096], f16)), {})
+cnt: 4, ((T([4160, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 4, ((T([4096, 4160], f16, stride=(1, 4096)), T([4160, 1024], f16)), {})
+cnt: 4, ((T([4160, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 4, ((T([1024, 4160], f16, stride=(1, 1024)), T([4160, 1024], f16)), {})
+cnt: 4, ((T([4160, 3072], f16), T([3072, 1024], f16)), {})
+cnt: 4, ((T([3072, 4160], f16, stride=(1, 3072)), T([4160, 1024], f16)), {})
+cnt: 1, ((T([1024, 64], f16, stride=(1, 66560)), T([64, 512], f16, stride=(131584, 1))), {})
+cnt: 1, ((T([64, 1024], f16, stride=(66560, 1)), T([1024, 512], f16)), {})
+cnt: 6, ((T([16448, 512], f16), T([512, 2048], f16)), {})
+cnt: 6, ((T([512, 16448], f16, stride=(1, 512)), T([16448, 2048], f16)), {})
+cnt: 6, ((T([16448, 2048], f16), T([2048, 512], f16)), {})
+cnt: 6, ((T([2048, 16448], f16, stride=(1, 2048)), T([16448, 512], f16)), {})
+cnt: 6, ((T([16448, 512], f16), T([512, 512], f16)), {})
+cnt: 6, ((T([512, 16448], f16, stride=(1, 512)), T([16448, 512], f16)), {})
+cnt: 6, ((T([16448, 1536], f16), T([1536, 512], f16)), {})
+cnt: 6, ((T([1536, 16448], f16, stride=(1, 1536)), T([16448, 512], f16)), {})
+cnt: 1, ((T([512, 64], f16, stride=(1, 131584)), T([64, 256], f16, stride=(246272, 1))), {})
+cnt: 1, ((T([64, 512], f16, stride=(131584, 1)), T([512, 256], f16)), {})
+cnt: 3, ((T([61568, 256], f16), T([256, 1024], f16)), {})
+cnt: 3, ((T([256, 61568], f16, stride=(1, 256)), T([61568, 1024], f16)), {})
+cnt: 3, ((T([61568, 1024], f16), T([1024, 256], f16)), {})
+cnt: 3, ((T([1024, 61568], f16, stride=(1, 1024)), T([61568, 256], f16)), {})
+cnt: 3, ((T([61568, 256], f16), T([256, 256], f16)), {})
+cnt: 3, ((T([256, 61568], f16, stride=(1, 256)), T([61568, 256], f16)), {})
+cnt: 3, ((T([61568, 768], f16), T([768, 256], f16)), {})
+cnt: 3, ((T([768, 61568], f16, stride=(1, 768)), T([61568, 256], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 6, ((T([64, 4, 962, 962], f16), 0.125), {})
+cnt: 12, ((T([64, 8, 257, 257], f16), 0.125), {})
+cnt: 8, ((T([64, 16, 65, 65], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 6, ((T([64, 962, 256], f16), [256], T([256], f16), T([256], f16), 1e-06), {})
+cnt: 12, ((T([64, 257, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+cnt: 8, ((T([64, 65, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-06), {})
+cnt: 1, ((T([64, 1, 1024], f16, stride=(66560, 1024, 1)), [1024], T([1024], f16), T([1024], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 1, ((T([64, 1, 1024], f16), T([64, 1, 1024], f16, stride=(66560, 1024, 1)), [1024], T([64, 1, 1], f32), T([64, 1, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+cnt: 8, ((T([64, 65, 1024], f16), T([64, 65, 1024], f16), [1024], T([64, 65, 1], f32), T([64, 65, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+cnt: 12, ((T([64, 257, 512], f16), T([64, 257, 512], f16), [512], T([64, 257, 1], f32), T([64, 257, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 6, ((T([64, 962, 256], f16), T([64, 962, 256], f16), [256], T([64, 962, 1], f32), T([64, 962, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([64, 1024], f16), [64, 1, 1024], 1, 0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([64, 1, 1024], f16), [64, 1, 1024], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 1, 1024], f16), [64, 65, 1024], 1, 0, 1, 1), {})
+cnt: 1, ((T([64, 65, 1024], f16), [64, 65, 1024], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 256, 512], f16), [64, 257, 512], 1, 1, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 257, 512], f16), [64, 257, 512], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 1, 512], f16), [64, 257, 512], 1, 0, 1, 1), {})
+cnt: 1, ((T([64, 961, 256], f16), [64, 962, 256], 1, 1, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 962, 256], f16), [64, 962, 256], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([64, 1, 256], f16), [64, 962, 256], 1, 0, 1, 1), {})
+Operator: aten.stack.default
+cnt: 4, (([T([64, 16, 65, 64], f16), T([64, 16, 65, 64], f16, stride=(66560, 4160, 1, 65)), T([64, 16, 65, 64], f16)],), {})
+cnt: 6, (([T([64, 8, 257, 64], f16), T([64, 8, 257, 64], f16, stride=(131584, 16448, 1, 257)), T([64, 8, 257, 64], f16)],), {})
+cnt: 3, (([T([64, 4, 962, 64], f16), T([64, 4, 962, 64], f16, stride=(246272, 61568, 1, 962)), T([64, 4, 962, 64], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 8, ((T([4160, 1024], f16), [0], True), {})
+cnt: 4, ((T([4160, 4096], f16), [0], True), {})
+cnt: 4, ((T([4160, 3072], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 1024], f16, stride=(66560, 1024, 1)), [0, 1], True), {})
+cnt: 12, ((T([16448, 512], f16), [0], True), {})
+cnt: 6, ((T([16448, 2048], f16), [0], True), {})
+cnt: 6, ((T([16448, 1536], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 512], f16, stride=(131584, 512, 1)), [0, 1], True), {})
+cnt: 6, ((T([61568, 256], f16), [0], True), {})
+cnt: 3, ((T([61568, 1024], f16), [0], True), {})
+cnt: 3, ((T([61568, 768], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 256], f16, stride=(246272, 256, 1)), [0], True), {})
+cnt: 1, ((T([64, 256, 31, 31], f16, stride=(246272, 1, 7936, 256)), [0], True), {})
+Operator: aten.unbind.int
+cnt: 3, ((T([3, 64, 4, 962, 64], f16, stride=(256, 738816, 64, 768, 1)),), {})
+cnt: 6, ((T([3, 64, 8, 257, 64], f16, stride=(512, 394752, 64, 1536, 1)),), {})
+cnt: 4, ((T([3, 64, 16, 65, 64], f16, stride=(1024, 199680, 64, 3072, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/pnasnet5large_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/pnasnet5large_training.txt
new file mode 100644
index 000000000000..c6d164aa5178
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/pnasnet5large_training.txt
@@ -0,0 +1,293 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([16, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([16, 1000], f16), T([16, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([], i64), 1), {})
+cnt: 5, ((T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16)), {})
+cnt: 5, ((T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16)), {})
+cnt: 44, ((T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16)), {})
+cnt: 38, ((T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16)), {})
+cnt: 38, ((T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16)), {})
+cnt: 7, ((T([16, 864, 11, 11], f16, stride=(522720, 121, 11, 1)), T([16, 864, 11, 11], f16)), {})
+cnt: 2, ((T([16, 4320, 11, 11], f16), T([16, 4320, 11, 11], f16)), {})
+cnt: 5, ((T([16, 2160, 21, 21], f16), T([16, 2160, 21, 21], f16)), {})
+cnt: 7, ((T([16, 864, 21, 21], f16), T([16, 864, 21, 21], f16)), {})
+cnt: 7, ((T([16, 432, 21, 21], f16, stride=(952560, 441, 21, 1)), T([16, 432, 21, 21], f16)), {})
+cnt: 5, ((T([16, 1080, 42, 42], f16), T([16, 1080, 42, 42], f16)), {})
+cnt: 7, ((T([16, 432, 42, 42], f16), T([16, 432, 42, 42], f16)), {})
+cnt: 8, ((T([16, 216, 42, 42], f16, stride=(1905120, 1764, 42, 1)), T([16, 216, 42, 42], f16)), {})
+cnt: 1, ((T([16, 540, 42, 42], f16), T([16, 540, 42, 42], f16)), {})
+cnt: 2, ((T([16, 270, 83, 83], f16), T([16, 270, 83, 83], f16)), {})
+cnt: 7, ((T([16, 108, 83, 83], f16), T([16, 108, 83, 83], f16)), {})
+cnt: 1, ((T([16, 108, 42, 42], f16, stride=(952560, 1764, 42, 1)), T([16, 108, 42, 42], f16)), {})
+cnt: 5, ((T([16, 96, 165, 165], f16), T([16, 96, 165, 165], f16)), {})
+cnt: 5, ((T([16, 54, 165, 165], f16), T([16, 54, 165, 165], f16)), {})
+cnt: 1, ((T([16, 54, 83, 83], f16, stride=(1860030, 6889, 83, 1)), T([16, 54, 83, 83], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 200, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([16, 4320], f16), T([4320, 1000], f16, stride=(1, 4320))), {})
+Operator: aten.avg_pool2d.default
+cnt: 2, ((T([16, 96, 165, 165], f16), [1, 1], [2, 2], [0, 0], False, False), {})
+cnt: 2, ((T([16, 270, 83, 83], f16), [1, 1], [2, 2], [0, 0], False, False), {})
+cnt: 2, ((T([16, 1080, 42, 42], f16), [1, 1], [2, 2], [0, 0], False, False), {})
+cnt: 2, ((T([16, 2160, 21, 21], f16), [1, 1], [2, 2], [0, 0], False, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 2, ((T([16, 2160, 11, 11], f16), T([16, 2160, 21, 21], f16), [1, 1], [2, 2], [0, 0], False, False, None), {})
+cnt: 2, ((T([16, 1080, 21, 21], f16), T([16, 1080, 42, 42], f16), [1, 1], [2, 2], [0, 0], False, False, None), {})
+cnt: 2, ((T([16, 270, 42, 42], f16), T([16, 270, 83, 83], f16), [1, 1], [2, 2], [0, 0], False, False, None), {})
+cnt: 2, ((T([16, 96, 83, 83], f16), T([16, 96, 165, 165], f16), [1, 1], [2, 2], [0, 0], False, False, None), {})
+Operator: aten.cat.default
+cnt: 1, (([T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16)], 1), {})
+cnt: 1, (([T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16)], 1), {})
+cnt: 1, (([T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16)], 1), {})
+cnt: 1, (([T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16)], 1), {})
+cnt: 4, (([T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16)], 1), {})
+cnt: 4, (([T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16)], 1), {})
+cnt: 1, (([T([16, 216, 21, 21], f16), T([16, 216, 21, 21], f16)], 1), {})
+cnt: 4, (([T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16)], 1), {})
+cnt: 1, (([T([16, 432, 11, 11], f16), T([16, 432, 11, 11], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([16, 3, 331, 331], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([16, 96, 165, 165], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), [1, 1, 1, 1], -inf), {})
+cnt: 1, ((T([16, 54, 165, 165], f16), [3, 3, 3, 3], 0.0), {})
+cnt: 2, ((T([16, 54, 165, 165], f16), [1, 1, 1, 1], -inf), {})
+cnt: 1, ((T([16, 54, 165, 165], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 1, ((T([16, 54, 165, 165], f16), [1, 1, 1, 1], 0.0), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), [1, 1, 1, 1], 0.0), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), [-1, 1, -1, 1], 0.0), {})
+cnt: 2, ((T([16, 108, 83, 83], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 3, ((T([16, 108, 83, 83], f16), [1, 1, 1, 1], -inf), {})
+cnt: 1, ((T([16, 108, 83, 83], f16), [3, 3, 3, 3], 0.0), {})
+cnt: 2, ((T([16, 108, 83, 83], f16), [1, 1, 1, 1], 0.0), {})
+cnt: 1, ((T([16, 270, 83, 83], f16), [-1, 1, -1, 1], 0.0), {})
+cnt: 2, ((T([16, 432, 42, 42], f16), [1, 2, 1, 2], 0.0), {})
+cnt: 3, ((T([16, 432, 42, 42], f16), [0, 1, 0, 1], -inf), {})
+cnt: 1, ((T([16, 432, 42, 42], f16), [2, 3, 2, 3], 0.0), {})
+cnt: 2, ((T([16, 432, 42, 42], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([16, 1080, 42, 42], f16), [-1, 1, -1, 1], 0.0), {})
+cnt: 2, ((T([16, 864, 21, 21], f16), [2, 2, 2, 2], 0.0), {})
+cnt: 3, ((T([16, 864, 21, 21], f16), [1, 1, 1, 1], -inf), {})
+cnt: 1, ((T([16, 864, 21, 21], f16), [3, 3, 3, 3], 0.0), {})
+cnt: 2, ((T([16, 864, 21, 21], f16), [1, 1, 1, 1], 0.0), {})
+cnt: 1, ((T([16, 2160, 21, 21], f16), [-1, 1, -1, 1], 0.0), {})
+cnt: 1, ((T([16, 2160, 21, 21], f16), [1, -1, 1, -1]), {})
+cnt: 5, ((T([16, 864, 23, 23], f16), [-1, -1, -1, -1]), {})
+cnt: 2, ((T([16, 864, 25, 25], f16), [-2, -2, -2, -2]), {})
+cnt: 1, ((T([16, 864, 27, 27], f16), [-3, -3, -3, -3]), {})
+cnt: 1, ((T([16, 1080, 42, 42], f16), [1, -1, 1, -1]), {})
+cnt: 5, ((T([16, 432, 43, 43], f16), [0, -1, 0, -1]), {})
+cnt: 2, ((T([16, 432, 45, 45], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([16, 432, 47, 47], f16), [-2, -3, -2, -3]), {})
+cnt: 1, ((T([16, 270, 83, 83], f16), [1, -1, 1, -1]), {})
+cnt: 5, ((T([16, 108, 85, 85], f16), [-1, -1, -1, -1]), {})
+cnt: 2, ((T([16, 108, 87, 87], f16), [-2, -2, -2, -2]), {})
+cnt: 1, ((T([16, 108, 89, 89], f16), [-3, -3, -3, -3]), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), [1, -1, 1, -1]), {})
+cnt: 2, ((T([16, 96, 167, 167], f16), [-1, -1, -1, -1]), {})
+cnt: 3, ((T([16, 54, 167, 167], f16), [-1, -1, -1, -1]), {})
+cnt: 1, ((T([16, 54, 169, 169], f16), [-2, -2, -2, -2]), {})
+cnt: 1, ((T([16, 54, 171, 171], f16), [-3, -3, -3, -3]), {})
+cnt: 1, ((T([16, 96, 169, 169], f16), [-2, -2, -2, -2]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([16, 3, 331, 331], f16), T([96, 3, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), T([54, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 96, 169, 169], f16), T([96, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 96), {})
+cnt: 5, ((T([16, 96, 83, 83], f16), T([54, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 54, 83, 83], f16), T([54, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 54), {})
+cnt: 10, ((T([16, 54, 83, 83], f16), T([54, 54, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 54, 171, 171], f16), T([54, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 54), {})
+cnt: 1, ((T([16, 54, 83, 83], f16), T([54, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 54), {})
+cnt: 1, ((T([16, 54, 169, 169], f16), T([54, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 54), {})
+cnt: 1, ((T([16, 54, 167, 167], f16), T([54, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 54), {})
+cnt: 4, ((T([16, 54, 83, 83], f16), T([54, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 54), {})
+cnt: 1, ((T([16, 96, 167, 167], f16), T([96, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([16, 54, 165, 165], f16), T([54, 54, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 270, 83, 83], f16), T([108, 270, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 108, 87, 87], f16), T([108, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 108), {})
+cnt: 12, ((T([16, 108, 42, 42], f16), T([108, 108, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 108, 42, 42], f16), T([108, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 108), {})
+cnt: 1, ((T([16, 108, 89, 89], f16), T([108, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 108), {})
+cnt: 1, ((T([16, 108, 42, 42], f16), T([108, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 108), {})
+cnt: 2, ((T([16, 108, 85, 85], f16), T([108, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 108), {})
+cnt: 4, ((T([16, 108, 42, 42], f16), T([108, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 108), {})
+cnt: 1, ((T([16, 108, 83, 83], f16), T([108, 108, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 270, 42, 42], f16), T([108, 270, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 540, 42, 42], f16), T([216, 540, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 16, ((T([16, 216, 42, 42], f16), T([216, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 216), {})
+cnt: 48, ((T([16, 216, 42, 42], f16), T([216, 216, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([16, 216, 42, 42], f16), T([216, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 216), {})
+cnt: 24, ((T([16, 216, 42, 42], f16), T([216, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 216), {})
+cnt: 5, ((T([16, 1080, 42, 42], f16), T([216, 1080, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 1080, 42, 42], f16), T([432, 1080, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 432, 45, 45], f16), T([432, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 432), {})
+cnt: 48, ((T([16, 432, 21, 21], f16), T([432, 432, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 14, ((T([16, 432, 21, 21], f16), T([432, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 432), {})
+cnt: 1, ((T([16, 432, 47, 47], f16), T([432, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 432), {})
+cnt: 7, ((T([16, 432, 21, 21], f16), T([432, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 432), {})
+cnt: 2, ((T([16, 432, 43, 43], f16), T([432, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 432), {})
+cnt: 22, ((T([16, 432, 21, 21], f16), T([432, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 432), {})
+cnt: 1, ((T([16, 432, 42, 42], f16), T([432, 432, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 1080, 21, 21], f16), T([216, 1080, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([16, 2160, 21, 21], f16), T([432, 2160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 2160, 21, 21], f16), T([864, 2160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 864, 25, 25], f16), T([864, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 864), {})
+cnt: 48, ((T([16, 864, 11, 11], f16), T([864, 864, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 14, ((T([16, 864, 11, 11], f16), T([864, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 864), {})
+cnt: 1, ((T([16, 864, 27, 27], f16), T([864, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 864), {})
+cnt: 7, ((T([16, 864, 11, 11], f16), T([864, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 864), {})
+cnt: 2, ((T([16, 864, 23, 23], f16), T([864, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 864), {})
+cnt: 22, ((T([16, 864, 11, 11], f16), T([864, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 864), {})
+cnt: 1, ((T([16, 864, 21, 21], f16), T([864, 864, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([16, 2160, 11, 11], f16), T([432, 2160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([16, 4320, 11, 11], f16), T([864, 4320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 48, ((T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), T([864, 864, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 22, ((T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), T([864, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 864, [True, True, False]), {})
+cnt: 14, ((T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), T([864, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 864, [True, True, False]), {})
+cnt: 7, ((T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), T([864, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 864, [True, True, False]), {})
+cnt: 5, ((T([16, 864, 11, 11], f16), T([16, 4320, 11, 11], f16), T([864, 4320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 432, 11, 11], f16, stride=(104544, 121, 11, 1)), T([16, 2160, 11, 11], f16), T([432, 2160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 864, 11, 11], f16), T([16, 864, 21, 21], f16), T([864, 864, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 864, 11, 11], f16), T([16, 864, 23, 23], f16), T([864, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 864, [True, True, False]), {})
+cnt: 2, ((T([16, 864, 11, 11], f16), T([16, 864, 25, 25], f16), T([864, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 864, [True, True, False]), {})
+cnt: 1, ((T([16, 864, 11, 11], f16), T([16, 864, 27, 27], f16), T([864, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 864, [True, True, False]), {})
+cnt: 2, ((T([16, 864, 21, 21], f16), T([16, 2160, 21, 21], f16), T([864, 2160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 48, ((T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), T([432, 432, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 22, ((T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), T([432, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 432, [True, True, False]), {})
+cnt: 14, ((T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), T([432, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 432, [True, True, False]), {})
+cnt: 7, ((T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), T([432, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 432, [True, True, False]), {})
+cnt: 5, ((T([16, 432, 21, 21], f16), T([16, 2160, 21, 21], f16), T([432, 2160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 216, 21, 21], f16, stride=(190512, 441, 21, 1)), T([16, 1080, 21, 21], f16), T([216, 1080, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 432, 21, 21], f16), T([16, 432, 42, 42], f16), T([432, 432, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 432, 21, 21], f16), T([16, 432, 43, 43], f16), T([432, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 432, [True, True, False]), {})
+cnt: 2, ((T([16, 432, 21, 21], f16), T([16, 432, 45, 45], f16), T([432, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 432, [True, True, False]), {})
+cnt: 1, ((T([16, 432, 21, 21], f16), T([16, 432, 47, 47], f16), T([432, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 432, [True, True, False]), {})
+cnt: 2, ((T([16, 432, 42, 42], f16), T([16, 1080, 42, 42], f16), T([432, 1080, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 48, ((T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), T([216, 216, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 24, ((T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), T([216, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 216, [True, True, False]), {})
+cnt: 16, ((T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), T([216, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 216, [True, True, False]), {})
+cnt: 8, ((T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), T([216, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 216, [True, True, False]), {})
+cnt: 5, ((T([16, 216, 42, 42], f16), T([16, 1080, 42, 42], f16), T([216, 1080, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 216, 42, 42], f16), T([16, 540, 42, 42], f16), T([216, 540, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 108, 42, 42], f16, stride=(381024, 1764, 42, 1)), T([16, 270, 42, 42], f16), T([108, 270, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 108, 42, 42], f16), T([16, 108, 83, 83], f16), T([108, 108, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 12, ((T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), T([108, 108, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), T([108, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 108, [True, True, False]), {})
+cnt: 2, ((T([16, 108, 42, 42], f16), T([16, 108, 85, 85], f16), T([108, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 108, [True, True, False]), {})
+cnt: 2, ((T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), T([108, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 108, [True, True, False]), {})
+cnt: 2, ((T([16, 108, 42, 42], f16), T([16, 108, 87, 87], f16), T([108, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 108, [True, True, False]), {})
+cnt: 1, ((T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), T([108, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 108, [True, True, False]), {})
+cnt: 1, ((T([16, 108, 42, 42], f16), T([16, 108, 89, 89], f16), T([108, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 108, [True, True, False]), {})
+cnt: 1, ((T([16, 108, 83, 83], f16), T([16, 270, 83, 83], f16), T([108, 270, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([16, 54, 83, 83], f16, stride=(744012, 6889, 83, 1)), T([16, 96, 83, 83], f16), T([54, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 54, 83, 83], f16), T([16, 54, 165, 165], f16), T([54, 54, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 10, ((T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), T([54, 54, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), T([54, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 54, [True, True, False]), {})
+cnt: 3, ((T([16, 54, 83, 83], f16), T([16, 96, 83, 83], f16), T([54, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 96, 83, 83], f16), T([16, 96, 167, 167], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([16, 54, 83, 83], f16), T([16, 54, 167, 167], f16), T([54, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 54, [True, True, False]), {})
+cnt: 2, ((T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), T([54, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 54, [True, True, False]), {})
+cnt: 1, ((T([16, 54, 83, 83], f16), T([16, 54, 169, 169], f16), T([54, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 54, [True, True, False]), {})
+cnt: 1, ((T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), T([54, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 54, [True, True, False]), {})
+cnt: 1, ((T([16, 54, 83, 83], f16), T([16, 54, 171, 171], f16), T([54, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 54, [True, True, False]), {})
+cnt: 1, ((T([16, 96, 83, 83], f16), T([16, 96, 169, 169], f16), T([96, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([16, 54, 165, 165], f16), T([16, 96, 165, 165], f16), T([54, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), T([16, 3, 331, 331], f16), T([96, 3, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([16, 3, 331, 331], f16), T([16, 3, 331, 331], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([16, 4320, 11, 11], f16, stride=(4320, 1, 0, 0)), 121), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([16], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([16, 96, 167, 167], f16), [3, 3], [2, 2]), {})
+cnt: 2, ((T([16, 54, 167, 167], f16), [3, 3], [2, 2]), {})
+cnt: 3, ((T([16, 108, 85, 85], f16), [3, 3], [2, 2]), {})
+cnt: 12, ((T([16, 216, 42, 42], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 3, ((T([16, 432, 43, 43], f16), [3, 3], [2, 2]), {})
+cnt: 9, ((T([16, 432, 21, 21], f16), [3, 3], [1, 1], [1, 1]), {})
+cnt: 3, ((T([16, 864, 23, 23], f16), [3, 3], [2, 2]), {})
+cnt: 9, ((T([16, 864, 11, 11], f16), [3, 3], [1, 1], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 9, ((T([16, 864, 11, 11], f16, stride=(522720, 121, 11, 1)), T([16, 864, 11, 11], f16), [3, 3], [1, 1], [1, 1], [1, 1], False, T([16, 864, 11, 11], i64)), {})
+cnt: 3, ((T([16, 864, 11, 11], f16, stride=(522720, 121, 11, 1)), T([16, 864, 23, 23], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 864, 11, 11], i64)), {})
+cnt: 9, ((T([16, 432, 21, 21], f16, stride=(952560, 441, 21, 1)), T([16, 432, 21, 21], f16), [3, 3], [1, 1], [1, 1], [1, 1], False, T([16, 432, 21, 21], i64)), {})
+cnt: 3, ((T([16, 432, 21, 21], f16, stride=(952560, 441, 21, 1)), T([16, 432, 43, 43], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 432, 21, 21], i64)), {})
+cnt: 12, ((T([16, 216, 42, 42], f16, stride=(1905120, 1764, 42, 1)), T([16, 216, 42, 42], f16), [3, 3], [1, 1], [1, 1], [1, 1], False, T([16, 216, 42, 42], i64)), {})
+cnt: 3, ((T([16, 108, 42, 42], f16, stride=(952560, 1764, 42, 1)), T([16, 108, 85, 85], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 108, 42, 42], i64)), {})
+cnt: 2, ((T([16, 54, 83, 83], f16, stride=(1860030, 6889, 83, 1)), T([16, 54, 167, 167], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 54, 83, 83], i64)), {})
+cnt: 1, ((T([16, 96, 83, 83], f16), T([16, 96, 167, 167], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([16, 96, 83, 83], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([16, 4320, 11, 11], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([16, 1000], f16), T([1000, 4320], f16)), {})
+cnt: 1, ((T([1000, 16], f16, stride=(1, 1000)), T([16, 4320], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([16, 96, 165, 165], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([16, 54, 165, 165], f16), T([54], f16), T([54], f16), T([54], f16), T([54], f16), True, 0.1, 0.001), {})
+cnt: 14, ((T([16, 54, 83, 83], f16), T([54], f16), T([54], f16), T([54], f16), T([54], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([16, 108, 83, 83], f16), T([108], f16), T([108], f16), T([108], f16), T([108], f16), True, 0.1, 0.001), {})
+cnt: 13, ((T([16, 108, 42, 42], f16), T([108], f16), T([108], f16), T([108], f16), T([108], f16), True, 0.1, 0.001), {})
+cnt: 56, ((T([16, 216, 42, 42], f16), T([216], f16), T([216], f16), T([216], f16), T([216], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([16, 432, 42, 42], f16), T([432], f16), T([432], f16), T([432], f16), T([432], f16), True, 0.1, 0.001), {})
+cnt: 55, ((T([16, 432, 21, 21], f16), T([432], f16), T([432], f16), T([432], f16), T([432], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([16, 864, 21, 21], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f16), True, 0.1, 0.001), {})
+cnt: 55, ((T([16, 864, 11, 11], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f16), True, 0.1, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 17, ((T([16, 864, 11, 11], f16, stride=(522720, 121, 11, 1)), T([16, 864, 11, 11], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f32), T([864], f32), True, 0.001, [True, True, True]), {})
+cnt: 38, ((T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f32), T([864], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 864, 21, 21], f16), T([16, 864, 21, 21], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f32), T([864], f32), True, 0.001, [True, True, True]), {})
+cnt: 17, ((T([16, 432, 21, 21], f16, stride=(952560, 441, 21, 1)), T([16, 432, 21, 21], f16), T([432], f16), T([432], f16), T([432], f16), T([432], f32), T([432], f32), True, 0.001, [True, True, True]), {})
+cnt: 38, ((T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), T([432], f16), T([432], f16), T([432], f16), T([432], f32), T([432], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 432, 42, 42], f16), T([16, 432, 42, 42], f16), T([432], f16), T([432], f16), T([432], f16), T([432], f32), T([432], f32), True, 0.001, [True, True, True]), {})
+cnt: 16, ((T([16, 216, 42, 42], f16, stride=(1905120, 1764, 42, 1)), T([16, 216, 42, 42], f16), T([216], f16), T([216], f16), T([216], f16), T([216], f32), T([216], f32), True, 0.001, [True, True, True]), {})
+cnt: 40, ((T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), T([216], f16), T([216], f16), T([216], f16), T([216], f32), T([216], f32), True, 0.001, [True, True, True]), {})
+cnt: 5, ((T([16, 108, 42, 42], f16, stride=(952560, 1764, 42, 1)), T([16, 108, 42, 42], f16), T([108], f16), T([108], f16), T([108], f16), T([108], f32), T([108], f32), True, 0.001, [True, True, True]), {})
+cnt: 8, ((T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), T([108], f16), T([108], f16), T([108], f16), T([108], f32), T([108], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([16, 108, 83, 83], f16), T([16, 108, 83, 83], f16), T([108], f16), T([108], f16), T([108], f16), T([108], f32), T([108], f32), True, 0.001, [True, True, True]), {})
+cnt: 6, ((T([16, 54, 83, 83], f16, stride=(1860030, 6889, 83, 1)), T([16, 54, 83, 83], f16), T([54], f16), T([54], f16), T([54], f16), T([54], f32), T([54], f32), True, 0.001, [True, True, True]), {})
+cnt: 8, ((T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), T([54], f16), T([54], f16), T([54], f16), T([54], f32), T([54], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([16, 54, 165, 165], f16), T([16, 54, 165, 165], f16), T([54], f16), T([54], f16), T([54], f16), T([54], f32), T([54], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([16, 96, 165, 165], f16), T([16, 96, 165, 165], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([16, 1000], f16), T([16], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([16, 1000], f16), T([16], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 4, ((T([16, 96, 165, 165], f16),), {})
+cnt: 7, ((T([16, 54, 83, 83], f16),), {})
+cnt: 4, ((T([16, 54, 165, 165], f16),), {})
+cnt: 2, ((T([16, 270, 83, 83], f16),), {})
+cnt: 6, ((T([16, 108, 83, 83], f16),), {})
+cnt: 7, ((T([16, 108, 42, 42], f16),), {})
+cnt: 2, ((T([16, 540, 42, 42], f16),), {})
+cnt: 48, ((T([16, 216, 42, 42], f16),), {})
+cnt: 8, ((T([16, 1080, 42, 42], f16),), {})
+cnt: 6, ((T([16, 432, 42, 42], f16),), {})
+cnt: 43, ((T([16, 432, 21, 21], f16),), {})
+cnt: 8, ((T([16, 2160, 21, 21], f16),), {})
+cnt: 6, ((T([16, 864, 21, 21], f16),), {})
+cnt: 43, ((T([16, 864, 11, 11], f16),), {})
+cnt: 6, ((T([16, 4320, 11, 11], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([16, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 6, ((T([16, 4320, 11, 11], f16), T([16, 4320, 11, 11], f16), 0), {})
+cnt: 43, ((T([16, 864, 11, 11], f16), T([16, 864, 11, 11], f16), 0), {})
+cnt: 8, ((T([16, 2160, 21, 21], f16), T([16, 2160, 21, 21], f16), 0), {})
+cnt: 6, ((T([16, 864, 21, 21], f16), T([16, 864, 21, 21], f16), 0), {})
+cnt: 43, ((T([16, 432, 21, 21], f16), T([16, 432, 21, 21], f16), 0), {})
+cnt: 8, ((T([16, 1080, 42, 42], f16), T([16, 1080, 42, 42], f16), 0), {})
+cnt: 6, ((T([16, 432, 42, 42], f16), T([16, 432, 42, 42], f16), 0), {})
+cnt: 48, ((T([16, 216, 42, 42], f16), T([16, 216, 42, 42], f16), 0), {})
+cnt: 2, ((T([16, 540, 42, 42], f16), T([16, 540, 42, 42], f16), 0), {})
+cnt: 2, ((T([16, 270, 83, 83], f16), T([16, 270, 83, 83], f16), 0), {})
+cnt: 6, ((T([16, 108, 83, 83], f16), T([16, 108, 83, 83], f16), 0), {})
+cnt: 7, ((T([16, 108, 42, 42], f16), T([16, 108, 42, 42], f16), 0), {})
+cnt: 4, ((T([16, 96, 165, 165], f16), T([16, 96, 165, 165], f16), 0), {})
+cnt: 4, ((T([16, 54, 165, 165], f16), T([16, 54, 165, 165], f16), 0), {})
+cnt: 7, ((T([16, 54, 83, 83], f16), T([16, 54, 83, 83], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/poolformer_m36_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/poolformer_m36_training.txt
new file mode 100644
index 000000000000..2cbc4a779e5b
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/poolformer_m36_training.txt
@@ -0,0 +1,111 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 30, ((T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16)), {})
+cnt: 30, ((T([64, 192, 28, 28], f16), T([64, 192, 28, 28], f16)), {})
+cnt: 90, ((T([64, 384, 14, 14], f16), T([64, 384, 14, 14], f16)), {})
+cnt: 30, ((T([64, 768, 7, 7], f16), T([64, 768, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 768], f16), T([768, 1000], f16, stride=(1, 768))), {})
+Operator: aten.avg_pool2d.default
+cnt: 6, ((T([64, 96, 56, 56], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+cnt: 6, ((T([64, 192, 28, 28], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+cnt: 18, ((T([64, 384, 14, 14], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+cnt: 6, ((T([64, 768, 7, 7], f16), [3, 3], [1, 1], [1, 1], False, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 6, ((T([64, 768, 7, 7], f16), T([64, 768, 7, 7], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 18, ((T([64, 384, 14, 14], f16), T([64, 384, 14, 14], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 6, ((T([64, 192, 28, 28], f16), T([64, 192, 28, 28], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+cnt: 6, ((T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16), [3, 3], [1, 1], [1, 1], False, False, None), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([96, 3, 7, 7], f16), T([96], f16), [4, 4], [2, 2], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 96, 56, 56], f16), T([384, 96, 1, 1], f16), T([384], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 384, 56, 56], f16), T([96, 384, 1, 1], f16), T([96], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 96, 56, 56], f16), T([192, 96, 3, 3], f16), T([192], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 192, 28, 28], f16), T([768, 192, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 768, 28, 28], f16), T([192, 768, 1, 1], f16), T([192], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 192, 28, 28], f16), T([384, 192, 3, 3], f16), T([384], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 18, ((T([64, 384, 14, 14], f16), T([1536, 384, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 18, ((T([64, 1536, 14, 14], f16), T([384, 1536, 1, 1], f16), T([384], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 384, 14, 14], f16), T([768, 384, 3, 3], f16), T([768], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 768, 7, 7], f16), T([3072, 768, 1, 1], f16), T([3072], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 3072, 7, 7], f16), T([768, 3072, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 6, ((T([64, 768, 7, 7], f16), T([64, 3072, 7, 7], f16), T([768, 3072, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([64, 3072, 7, 7], f16), T([64, 768, 7, 7], f16), T([3072, 768, 1, 1], f16), [3072], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 768, 7, 7], f16), T([64, 384, 14, 14], f16), T([768, 384, 3, 3], f16), [768], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 18, ((T([64, 384, 14, 14], f16), T([64, 1536, 14, 14], f16), T([384, 1536, 1, 1], f16), [384], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 18, ((T([64, 1536, 14, 14], f16), T([64, 384, 14, 14], f16), T([1536, 384, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 384, 14, 14], f16), T([64, 192, 28, 28], f16), T([384, 192, 3, 3], f16), [384], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([64, 192, 28, 28], f16), T([64, 768, 28, 28], f16), T([192, 768, 1, 1], f16), [192], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([64, 768, 28, 28], f16), T([64, 192, 28, 28], f16), T([768, 192, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 192, 28, 28], f16), T([64, 96, 56, 56], f16), T([192, 96, 3, 3], f16), [192], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([64, 96, 56, 56], f16), T([64, 384, 56, 56], f16), T([96, 384, 1, 1], f16), [96], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([64, 384, 56, 56], f16), T([64, 96, 56, 56], f16), T([384, 96, 1, 1], f16), [384], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 96, 56, 56], f16), T([64, 3, 224, 224], f16), T([96, 3, 7, 7], f16), [96], [4, 4], [2, 2], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 768, 7, 7], f16, stride=(768, 1, 0, 0)), 49), {})
+Operator: aten.gelu.default
+cnt: 6, ((T([64, 384, 56, 56], f16),), {})
+cnt: 6, ((T([64, 768, 28, 28], f16),), {})
+cnt: 18, ((T([64, 1536, 14, 14], f16),), {})
+cnt: 6, ((T([64, 3072, 7, 7], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 6, ((T([64, 3072, 7, 7], f16), T([64, 3072, 7, 7], f16)), {})
+cnt: 18, ((T([64, 1536, 14, 14], f16), T([64, 1536, 14, 14], f16)), {})
+cnt: 6, ((T([64, 768, 28, 28], f16), T([64, 768, 28, 28], f16)), {})
+cnt: 6, ((T([64, 384, 56, 56], f16), T([64, 384, 56, 56], f16)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 768, 7, 7], f16), [-2, -1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 768], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 12, ((T([96, 1, 1], f16), T([64, 96, 56, 56], f16)), {})
+cnt: 12, ((T([192, 1, 1], f16), T([64, 192, 28, 28], f16)), {})
+cnt: 36, ((T([384, 1, 1], f16), T([64, 384, 14, 14], f16)), {})
+cnt: 12, ((T([768, 1, 1], f16), T([64, 768, 7, 7], f16)), {})
+cnt: 12, ((T([64, 768, 7, 7], f16), T([768, 1, 1], f16)), {})
+cnt: 12, ((T([64, 768, 7, 7], f16), T([64, 768, 7, 7], f16)), {})
+cnt: 36, ((T([64, 384, 14, 14], f16), T([384, 1, 1], f16)), {})
+cnt: 36, ((T([64, 384, 14, 14], f16), T([64, 384, 14, 14], f16)), {})
+cnt: 12, ((T([64, 192, 28, 28], f16), T([192, 1, 1], f16)), {})
+cnt: 12, ((T([64, 192, 28, 28], f16), T([64, 192, 28, 28], f16)), {})
+cnt: 12, ((T([64, 96, 56, 56], f16), T([96, 1, 1], f16)), {})
+cnt: 12, ((T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16)), {})
+Operator: aten.native_group_norm.default
+cnt: 12, ((T([64, 96, 56, 56], f16), T([96], f16), T([96], f16), 64, 96, 3136, 1, 1e-05), {})
+cnt: 12, ((T([64, 192, 28, 28], f16), T([192], f16), T([192], f16), 64, 192, 784, 1, 1e-05), {})
+cnt: 36, ((T([64, 384, 14, 14], f16), T([384], f16), T([384], f16), 64, 384, 196, 1, 1e-05), {})
+cnt: 13, ((T([64, 768, 7, 7], f16), T([768], f16), T([768], f16), 64, 768, 49, 1, 1e-05), {})
+Operator: aten.native_group_norm_backward.default
+cnt: 13, ((T([64, 768, 7, 7], f16), T([64, 768, 7, 7], f16), T([64, 1], f16), T([64, 1], f16), T([768], f16), 64, 768, 49, 1, [True, True, True]), {})
+cnt: 36, ((T([64, 384, 14, 14], f16), T([64, 384, 14, 14], f16), T([64, 1], f16), T([64, 1], f16), T([384], f16), 64, 384, 196, 1, [True, True, True]), {})
+cnt: 12, ((T([64, 192, 28, 28], f16), T([64, 192, 28, 28], f16), T([64, 1], f16), T([64, 1], f16), T([192], f16), 64, 192, 784, 1, [True, True, True]), {})
+cnt: 12, ((T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16), T([64, 1], f16), T([64, 1], f16), T([96], f16), 64, 96, 3136, 1, [True, True, True]), {})
+Operator: aten.neg.default
+cnt: 6, ((T([64, 768, 7, 7], f16),), {})
+cnt: 18, ((T([64, 384, 14, 14], f16),), {})
+cnt: 6, ((T([64, 192, 28, 28], f16),), {})
+cnt: 6, ((T([64, 96, 56, 56], f16),), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.sub.Tensor
+cnt: 6, ((T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16)), {})
+cnt: 6, ((T([64, 192, 28, 28], f16), T([64, 192, 28, 28], f16)), {})
+cnt: 18, ((T([64, 384, 14, 14], f16), T([64, 384, 14, 14], f16)), {})
+cnt: 6, ((T([64, 768, 7, 7], f16), T([64, 768, 7, 7], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 12, ((T([64, 768, 7, 7], f16), [0, 2, 3], True), {})
+cnt: 36, ((T([64, 384, 14, 14], f16), [0, 2, 3], True), {})
+cnt: 12, ((T([64, 192, 28, 28], f16), [0, 2, 3], True), {})
+cnt: 12, ((T([64, 96, 56, 56], f16), [0, 2, 3], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/regnety_002_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/regnety_002_training.txt
new file mode 100644
index 000000000000..99d7f8ac9b48
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/regnety_002_training.txt
@@ -0,0 +1,181 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 44, ((T([], i64), 1), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 3, ((T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16)), {})
+cnt: 12, ((T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16)), {})
+cnt: 20, ((T([128, 368, 7, 7], f16), T([128, 368, 7, 7], f16)), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 368], f16), T([368, 1000], f16, stride=(1, 368))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([24, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([24, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 3), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([8, 24, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([24, 8, 1, 1], f16), T([24], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([24, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([24, 32, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([56, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 56, 56, 56], f16), T([56, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 7), {})
+cnt: 1, ((T([128, 56, 1, 1], f16), T([6, 56, 1, 1], f16), T([6], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 6, 1, 1], f16), T([56, 6, 1, 1], f16), T([56], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), T([56, 56, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([56, 24, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), T([152, 56, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 152, 28, 28], f16), T([152, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 19), {})
+cnt: 1, ((T([128, 152, 1, 1], f16), T([14, 152, 1, 1], f16), T([14], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 14, 1, 1], f16), T([152, 14, 1, 1], f16), T([152], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 152, 14, 14], f16), T([152, 152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), T([152, 56, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 152, 14, 14], f16), T([152, 8, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 19), {})
+cnt: 3, ((T([128, 152, 1, 1], f16), T([38, 152, 1, 1], f16), T([38], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 38, 1, 1], f16), T([152, 38, 1, 1], f16), T([152], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 152, 14, 14], f16), T([368, 152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 368, 14, 14], f16), T([368, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 46), {})
+cnt: 1, ((T([128, 368, 1, 1], f16), T([38, 368, 1, 1], f16), T([38], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 38, 1, 1], f16), T([368, 38, 1, 1], f16), T([368], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 13, ((T([128, 368, 7, 7], f16), T([368, 368, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 152, 14, 14], f16), T([368, 152, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([128, 368, 7, 7], f16), T([368, 8, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 46), {})
+cnt: 6, ((T([128, 368, 1, 1], f16), T([92, 368, 1, 1], f16), T([92], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([128, 92, 1, 1], f16), T([368, 92, 1, 1], f16), T([368], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 13, ((T([128, 368, 7, 7], f16), T([128, 368, 7, 7], f16), T([368, 368, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([128, 368, 1, 1], f16), T([128, 92, 1, 1], f16), T([368, 92, 1, 1], f16), [368], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([128, 92, 1, 1], f16), T([128, 368, 1, 1], f16), T([92, 368, 1, 1], f16), [92], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([128, 368, 7, 7], f16), T([128, 368, 7, 7], f16), T([368, 8, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 46, [True, True, False]), {})
+cnt: 1, ((T([128, 368, 7, 7], f16), T([128, 152, 14, 14], f16), T([368, 152, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 368, 1, 1], f16), T([128, 38, 1, 1], f16), T([368, 38, 1, 1], f16), [368], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 38, 1, 1], f16), T([128, 368, 1, 1], f16), T([38, 368, 1, 1], f16), [38], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 368, 7, 7], f16), T([128, 368, 14, 14], f16), T([368, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 46, [True, True, False]), {})
+cnt: 1, ((T([128, 368, 14, 14], f16), T([128, 152, 14, 14], f16), T([368, 152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16), T([152, 152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 152, 1, 1], f16), T([128, 38, 1, 1], f16), T([152, 38, 1, 1], f16), [152], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 38, 1, 1], f16), T([128, 152, 1, 1], f16), T([38, 152, 1, 1], f16), [38], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16), T([152, 8, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 19, [True, True, False]), {})
+cnt: 1, ((T([128, 152, 14, 14], f16), T([128, 56, 28, 28], f16), T([152, 56, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 152, 1, 1], f16), T([128, 14, 1, 1], f16), T([152, 14, 1, 1], f16), [152], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 14, 1, 1], f16), T([128, 152, 1, 1], f16), T([14, 152, 1, 1], f16), [14], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 152, 14, 14], f16), T([128, 152, 28, 28], f16), T([152, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 19, [True, True, False]), {})
+cnt: 1, ((T([128, 152, 28, 28], f16), T([128, 56, 28, 28], f16), T([152, 56, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), T([128, 24, 56, 56], f16), T([56, 24, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), T([56, 56, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 56, 1, 1], f16), T([128, 6, 1, 1], f16), T([56, 6, 1, 1], f16), [56], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 6, 1, 1], f16), T([128, 56, 1, 1], f16), T([6, 56, 1, 1], f16), [6], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), T([128, 56, 56, 56], f16), T([56, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 7, [True, True, False]), {})
+cnt: 1, ((T([128, 56, 56, 56], f16), T([128, 24, 56, 56], f16), T([56, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 32, 112, 112], f16), T([24, 32, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([128, 8, 1, 1], f16), T([24, 8, 1, 1], f16), [24], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([128, 24, 1, 1], f16), T([8, 24, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 24, 112, 112], f16), T([24, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 3, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 32, 112, 112], f16), T([24, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 8, ((T([128, 368, 7, 7], f16, stride=(368, 1, 0, 0)), 49), {})
+cnt: 4, ((T([128, 152, 14, 14], f16, stride=(152, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 56, 28, 28], f16, stride=(56, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 24, 56, 56], f16, stride=(24, 1, 0, 0)), 3136), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 24, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), [2, 3], True), {})
+cnt: 4, ((T([128, 152, 14, 14], f16), [2, 3], True), {})
+cnt: 7, ((T([128, 368, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 368, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 368], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 368], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 1, 1], f16)), {})
+cnt: 2, ((T([128, 56, 28, 28], f16), T([128, 56, 1, 1], f16)), {})
+cnt: 8, ((T([128, 152, 14, 14], f16), T([128, 152, 1, 1], f16)), {})
+cnt: 14, ((T([128, 368, 7, 7], f16), T([128, 368, 1, 1], f16)), {})
+cnt: 7, ((T([128, 368, 7, 7], f16), T([128, 368, 7, 7], f16)), {})
+cnt: 4, ((T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16)), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16)), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 56, 56, 56], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 56, 28, 28], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 152, 28, 28], f16), T([152], f16), T([152], f16), T([152], f16), T([152], f16), True, 0.1, 1e-05), {})
+cnt: 12, ((T([128, 152, 14, 14], f16), T([152], f16), T([152], f16), T([152], f16), T([152], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 368, 14, 14], f16), T([368], f16), T([368], f16), T([368], f16), T([368], f16), True, 0.1, 1e-05), {})
+cnt: 21, ((T([128, 368, 7, 7], f16), T([368], f16), T([368], f16), T([368], f16), T([368], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 21, ((T([128, 368, 7, 7], f16), T([128, 368, 7, 7], f16), T([368], f16), T([368], f16), T([368], f16), T([368], f32), T([368], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 368, 14, 14], f16), T([128, 368, 14, 14], f16), T([368], f16), T([368], f16), T([368], f16), T([368], f32), T([368], f32), True, 1e-05, [True, True, True]), {})
+cnt: 12, ((T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16), T([152], f16), T([152], f16), T([152], f16), T([152], f32), T([152], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 152, 28, 28], f16), T([128, 152, 28, 28], f16), T([152], f16), T([152], f16), T([152], f16), T([152], f32), T([152], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f32), T([56], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 56, 56, 56], f16), T([128, 56, 56, 56], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f32), T([56], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 24, 112, 112], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu.default
+cnt: 1, ((T([128, 24, 56, 56], f16),), {})
+cnt: 1, ((T([128, 56, 28, 28], f16),), {})
+cnt: 4, ((T([128, 152, 14, 14], f16),), {})
+cnt: 7, ((T([128, 368, 7, 7], f16),), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 24, 112, 112], f16),), {})
+cnt: 1, ((T([128, 24, 56, 56], f16),), {})
+cnt: 1, ((T([128, 8, 1, 1], f16),), {})
+cnt: 1, ((T([128, 56, 56, 56], f16),), {})
+cnt: 1, ((T([128, 56, 28, 28], f16),), {})
+cnt: 1, ((T([128, 6, 1, 1], f16),), {})
+cnt: 1, ((T([128, 152, 28, 28], f16),), {})
+cnt: 7, ((T([128, 152, 14, 14], f16),), {})
+cnt: 1, ((T([128, 14, 1, 1], f16),), {})
+cnt: 4, ((T([128, 38, 1, 1], f16),), {})
+cnt: 1, ((T([128, 368, 14, 14], f16),), {})
+cnt: 13, ((T([128, 368, 7, 7], f16),), {})
+cnt: 6, ((T([128, 92, 1, 1], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([128, 24, 1, 1], f16),), {})
+cnt: 1, ((T([128, 56, 1, 1], f16),), {})
+cnt: 4, ((T([128, 152, 1, 1], f16),), {})
+cnt: 7, ((T([128, 368, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 7, ((T([128, 368, 1, 1], f16), T([128, 368, 1, 1], f16)), {})
+cnt: 4, ((T([128, 152, 1, 1], f16), T([128, 152, 1, 1], f16)), {})
+cnt: 1, ((T([128, 56, 1, 1], f16), T([128, 56, 1, 1], f16)), {})
+cnt: 1, ((T([128, 24, 1, 1], f16), T([128, 24, 1, 1], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 7, ((T([128, 368, 7, 7], f16), [2, 3], True), {})
+cnt: 4, ((T([128, 152, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 56, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 20, ((T([128, 368, 7, 7], f16), T([128, 368, 7, 7], f16), 0), {})
+cnt: 6, ((T([128, 92, 1, 1], f16), T([128, 92, 1, 1], f16), 0), {})
+cnt: 4, ((T([128, 38, 1, 1], f16), T([128, 38, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 368, 14, 14], f16), T([128, 368, 14, 14], f16), 0), {})
+cnt: 11, ((T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 14, 1, 1], f16), T([128, 14, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 152, 28, 28], f16), T([128, 152, 28, 28], f16), 0), {})
+cnt: 2, ((T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 6, 1, 1], f16), T([128, 6, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 56, 56, 56], f16), T([128, 56, 56, 56], f16), 0), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([128, 8, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 24, 112, 112], f16), 0), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/repvgg_a2_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/repvgg_a2_training.txt
new file mode 100644
index 000000000000..ff6a44e15f6a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/repvgg_a2_training.txt
@@ -0,0 +1,90 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 61, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16)), {})
+cnt: 6, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16)), {})
+cnt: 14, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16)), {})
+cnt: 54, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16)), {})
+cnt: 1, ((T([128, 1408, 7, 7], f16), T([128, 1408, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1408], f16), T([1408, 1000], f16, stride=(1, 1408))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([96, 64, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([96, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([96, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([96, 96, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([192, 96, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([192, 96, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 192, 28, 28], f16), T([192, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 192, 28, 28], f16), T([192, 192, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([384, 192, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([384, 192, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 13, ((T([128, 384, 14, 14], f16), T([384, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 13, ((T([128, 384, 14, 14], f16), T([384, 384, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([1408, 384, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([1408, 384, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1408, 7, 7], f16), T([128, 384, 14, 14], f16), T([1408, 384, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1408, 7, 7], f16), T([128, 384, 14, 14], f16), T([1408, 384, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 13, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384, 384, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 13, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([128, 192, 28, 28], f16), T([384, 192, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([128, 192, 28, 28], f16), T([384, 192, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192, 192, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([128, 96, 56, 56], f16), T([192, 96, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([128, 96, 56, 56], f16), T([192, 96, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), T([96, 96, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), T([96, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 64, 112, 112], f16), T([96, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 64, 112, 112], f16), T([96, 64, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 3, 224, 224], f16), T([64, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 3, 224, 224], f16), T([64, 3, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1408, 7, 7], f16, stride=(1408, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 1408, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1408], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1408], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 11, ((T([128, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 41, ((T([128, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 1408, 7, 7], f16), T([1408], f16), T([1408], f16), T([1408], f16), T([1408], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 2, ((T([128, 1408, 7, 7], f16), T([128, 1408, 7, 7], f16), T([1408], f16), T([1408], f16), T([1408], f16), T([1408], f32), T([1408], f32), True, 1e-05, [True, True, True]), {})
+cnt: 41, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 11, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 2, ((T([128, 96, 56, 56], f16),), {})
+cnt: 4, ((T([128, 192, 28, 28], f16),), {})
+cnt: 14, ((T([128, 384, 14, 14], f16),), {})
+cnt: 1, ((T([128, 1408, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 1408, 7, 7], f16), T([128, 1408, 7, 7], f16), 0), {})
+cnt: 14, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), 0), {})
+cnt: 4, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), 0), {})
+cnt: 2, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2net101_26w_4s_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2net101_26w_4s_training.txt
new file mode 100644
index 000000000000..c669ec35671a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2net101_26w_4s_training.txt
@@ -0,0 +1,209 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1))), {})
+cnt: 6, ((T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16, stride=(163072, 784, 28, 1))), {})
+cnt: 44, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16, stride=(81536, 196, 14, 1))), {})
+cnt: 4, ((T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16, stride=(40768, 49, 7, 1))), {})
+cnt: 4, ((T([64, 208, 7, 7], f16, stride=(40768, 49, 7, 1)), T([64, 208, 7, 7], f16)), {})
+cnt: 2, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16)), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16)), {})
+cnt: 44, ((T([64, 104, 14, 14], f16, stride=(81536, 196, 14, 1)), T([64, 104, 14, 14], f16)), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16)), {})
+cnt: 6, ((T([64, 52, 28, 28], f16, stride=(163072, 784, 28, 1)), T([64, 52, 28, 28], f16)), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16)), {})
+cnt: 4, ((T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1)), T([64, 26, 56, 56], f16)), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 170, ((T([], i64), 1), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16)), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16)), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16)), {})
+cnt: 3, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1)), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([64, 52, 56, 56], f16, stride=(652288, 3136, 56, 1)), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([64, 104, 28, 28], f16, stride=(326144, 784, 28, 1)), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([64, 208, 14, 14], f16, stride=(163072, 196, 14, 1)), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([64, 208, 7, 7], f16, stride=(40768, 49, 7, 1)), T([64, 208, 14, 14], f16, stride=(163072, 196, 14, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([64, 104, 14, 14], f16, stride=(81536, 196, 14, 1)), T([64, 104, 28, 28], f16, stride=(326144, 784, 28, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([64, 52, 28, 28], f16, stride=(163072, 784, 28, 1)), T([64, 52, 56, 56], f16, stride=(652288, 3136, 56, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1)), T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1)), [3, 3], [1, 1], [1, 1], False, True, None), {})
+Operator: aten.cat.default
+cnt: 2, (([T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16)], 1), {})
+cnt: 4, (([T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1))], 1), {})
+cnt: 1, (([T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16)], 1), {})
+cnt: 6, (([T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16, stride=(163072, 784, 28, 1))], 1), {})
+cnt: 1, (([T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16)], 1), {})
+cnt: 44, (([T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16, stride=(81536, 196, 14, 1))], 1), {})
+cnt: 1, (([T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16)], 1), {})
+cnt: 4, (([T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16, stride=(40768, 49, 7, 1))], 1), {})
+cnt: 1, (([T([64, 208, 14, 14], f16), T([64, 208, 14, 14], f16), T([64, 208, 14, 14], f16), T([64, 208, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 104, 28, 28], f16), T([64, 104, 28, 28], f16), T([64, 104, 28, 28], f16), T([64, 104, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 52, 56, 56], f16), T([64, 52, 56, 56], f16), T([64, 52, 56, 56], f16), T([64, 52, 56, 56], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([104, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1)), T([26, 26, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 104, 56, 56], f16), T([256, 104, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 256, 56, 56], f16), T([104, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 26, 56, 56], f16), T([26, 26, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([208, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 52, 56, 56], f16, stride=(652288, 3136, 56, 1)), T([52, 52, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 208, 28, 28], f16), T([512, 208, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 512, 28, 28], f16), T([208, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 52, 28, 28], f16, stride=(163072, 784, 28, 1)), T([52, 52, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 52, 28, 28], f16), T([52, 52, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([416, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 104, 28, 28], f16, stride=(326144, 784, 28, 1)), T([104, 104, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 23, ((T([64, 416, 14, 14], f16), T([1024, 416, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 22, ((T([64, 1024, 14, 14], f16), T([416, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 22, ((T([64, 104, 14, 14], f16, stride=(81536, 196, 14, 1)), T([104, 104, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 44, ((T([64, 104, 14, 14], f16), T([104, 104, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([832, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 208, 14, 14], f16, stride=(163072, 196, 14, 1)), T([208, 208, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 832, 7, 7], f16), T([2048, 832, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 2048, 7, 7], f16), T([832, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 208, 7, 7], f16, stride=(40768, 49, 7, 1)), T([208, 208, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([64, 208, 7, 7], f16), T([208, 208, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([64, 2048, 7, 7], f16), T([64, 832, 7, 7], f16), T([2048, 832, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16), T([208, 208, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16, stride=(40768, 49, 7, 1)), T([208, 208, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 832, 7, 7], f16), T([64, 2048, 7, 7], f16), T([832, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 2048, 7, 7], f16), T([64, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 208, 7, 7], f16), T([64, 208, 14, 14], f16, stride=(163072, 196, 14, 1)), T([208, 208, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), T([64, 1024, 14, 14], f16), T([832, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), T([64, 416, 14, 14], f16), T([1024, 416, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 44, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), T([104, 104, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 22, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16, stride=(81536, 196, 14, 1)), T([104, 104, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 22, ((T([64, 416, 14, 14], f16), T([64, 1024, 14, 14], f16), T([416, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 1024, 14, 14], f16), T([64, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 104, 14, 14], f16), T([64, 104, 28, 28], f16, stride=(326144, 784, 28, 1)), T([104, 104, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), T([64, 512, 28, 28], f16), T([416, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), T([64, 208, 28, 28], f16), T([512, 208, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16), T([52, 52, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16, stride=(163072, 784, 28, 1)), T([52, 52, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 208, 28, 28], f16), T([64, 512, 28, 28], f16), T([208, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([64, 256, 56, 56], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 52, 28, 28], f16), T([64, 52, 56, 56], f16, stride=(652288, 3136, 56, 1)), T([52, 52, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 208, 56, 56], f16), T([64, 256, 56, 56], f16), T([208, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), T([64, 104, 56, 56], f16), T([256, 104, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16), T([26, 26, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1)), T([26, 26, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 104, 56, 56], f16), T([64, 256, 56, 56], f16), T([104, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([64, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 104, 56, 56], f16), T([64, 64, 56, 56], f16), T([104, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([64, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([64, 64, 56, 56], f16), T([64, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([64, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 104, 56, 56], f16), T([104], f16), T([104], f16), T([104], f16), T([104], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([64, 26, 56, 56], f16), T([26], f16), T([26], f16), T([26], f16), T([26], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 208, 56, 56], f16), T([208], f16), T([208], f16), T([208], f16), T([208], f16), True, 0.1, 1e-05), {})
+cnt: 12, ((T([64, 52, 28, 28], f16), T([52], f16), T([52], f16), T([52], f16), T([52], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 208, 28, 28], f16), T([208], f16), T([208], f16), T([208], f16), T([208], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f16), True, 0.1, 1e-05), {})
+cnt: 69, ((T([64, 104, 14, 14], f16), T([104], f16), T([104], f16), T([104], f16), T([104], f16), True, 0.1, 1e-05), {})
+cnt: 24, ((T([64, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 22, ((T([64, 416, 14, 14], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([64, 208, 7, 7], f16), T([208], f16), T([208], f16), T([208], f16), T([208], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([64, 832, 7, 7], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16), T([208], f16), T([208], f16), T([208], f16), T([208], f32), T([208], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([64, 832, 7, 7], f16), T([64, 832, 7, 7], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f32), T([832], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), T([64, 832, 14, 14], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f32), T([832], f32), True, 1e-05, [True, True, True]), {})
+cnt: 24, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 69, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), T([104], f16), T([104], f16), T([104], f16), T([104], f32), T([104], f32), True, 1e-05, [True, True, True]), {})
+cnt: 22, ((T([64, 416, 14, 14], f16), T([64, 416, 14, 14], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f32), T([416], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), T([64, 416, 28, 28], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f32), T([416], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 12, ((T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16), T([52], f16), T([52], f16), T([52], f16), T([52], f32), T([52], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 208, 28, 28], f16), T([64, 208, 28, 28], f16), T([208], f16), T([208], f16), T([208], f16), T([208], f32), T([208], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 208, 56, 56], f16), T([64, 208, 56, 56], f16), T([208], f16), T([208], f16), T([208], f16), T([208], f32), T([208], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16), T([26], f16), T([26], f16), T([26], f16), T([26], f32), T([26], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 104, 56, 56], f16), T([64, 104, 56, 56], f16), T([104], f16), T([104], f16), T([104], f16), T([104], f32), T([104], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([64, 64, 112, 112], f16),), {})
+cnt: 3, ((T([64, 104, 56, 56], f16),), {})
+cnt: 9, ((T([64, 26, 56, 56], f16),), {})
+cnt: 3, ((T([64, 256, 56, 56], f16),), {})
+cnt: 1, ((T([64, 208, 56, 56], f16),), {})
+cnt: 12, ((T([64, 52, 28, 28], f16),), {})
+cnt: 4, ((T([64, 512, 28, 28], f16),), {})
+cnt: 3, ((T([64, 208, 28, 28], f16),), {})
+cnt: 1, ((T([64, 416, 28, 28], f16),), {})
+cnt: 69, ((T([64, 104, 14, 14], f16),), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16),), {})
+cnt: 22, ((T([64, 416, 14, 14], f16),), {})
+cnt: 1, ((T([64, 832, 14, 14], f16),), {})
+cnt: 9, ((T([64, 208, 7, 7], f16),), {})
+cnt: 3, ((T([64, 2048, 7, 7], f16),), {})
+cnt: 2, ((T([64, 832, 7, 7], f16),), {})
+Operator: aten.split.Tensor
+cnt: 3, ((T([64, 104, 56, 56], f16), 26, 1), {})
+cnt: 1, ((T([64, 208, 56, 56], f16), 52, 1), {})
+cnt: 3, ((T([64, 208, 28, 28], f16), 52, 1), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), 104, 1), {})
+cnt: 22, ((T([64, 416, 14, 14], f16), 104, 1), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), 208, 1), {})
+cnt: 2, ((T([64, 832, 7, 7], f16), 208, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([64, 2048, 7, 7], f16), T([64, 2048, 7, 7], f16), 0), {})
+cnt: 5, ((T([64, 208, 7, 7], f16, stride=(40768, 49, 7, 1)), T([64, 208, 7, 7], f16), 0), {})
+cnt: 4, ((T([64, 208, 7, 7], f16), T([64, 208, 7, 7], f16), 0), {})
+cnt: 2, ((T([64, 832, 7, 7], f16), T([64, 832, 7, 7], f16), 0), {})
+cnt: 1, ((T([64, 832, 14, 14], f16), T([64, 832, 14, 14], f16), 0), {})
+cnt: 23, ((T([64, 1024, 14, 14], f16), T([64, 1024, 14, 14], f16), 0), {})
+cnt: 25, ((T([64, 104, 14, 14], f16, stride=(81536, 196, 14, 1)), T([64, 104, 14, 14], f16), 0), {})
+cnt: 44, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), 0), {})
+cnt: 22, ((T([64, 416, 14, 14], f16), T([64, 416, 14, 14], f16), 0), {})
+cnt: 1, ((T([64, 416, 28, 28], f16), T([64, 416, 28, 28], f16), 0), {})
+cnt: 4, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16), 0), {})
+cnt: 6, ((T([64, 52, 28, 28], f16, stride=(163072, 784, 28, 1)), T([64, 52, 28, 28], f16), 0), {})
+cnt: 6, ((T([64, 52, 28, 28], f16), T([64, 52, 28, 28], f16), 0), {})
+cnt: 3, ((T([64, 208, 28, 28], f16), T([64, 208, 28, 28], f16), 0), {})
+cnt: 1, ((T([64, 208, 56, 56], f16), T([64, 208, 56, 56], f16), 0), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16), 0), {})
+cnt: 5, ((T([64, 26, 56, 56], f16, stride=(326144, 3136, 56, 1)), T([64, 26, 56, 56], f16), 0), {})
+cnt: 4, ((T([64, 26, 56, 56], f16), T([64, 26, 56, 56], f16), 0), {})
+cnt: 3, ((T([64, 104, 56, 56], f16), T([64, 104, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2net50_14w_8s_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2net50_14w_8s_training.txt
new file mode 100644
index 000000000000..88b8cd46438e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2net50_14w_8s_training.txt
@@ -0,0 +1,209 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 12, ((T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1))), {})
+cnt: 18, ((T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16, stride=(175616, 784, 28, 1))), {})
+cnt: 30, ((T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16, stride=(87808, 196, 14, 1))), {})
+cnt: 12, ((T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16, stride=(43904, 49, 7, 1))), {})
+cnt: 12, ((T([128, 112, 7, 7], f16, stride=(43904, 49, 7, 1)), T([128, 112, 7, 7], f16)), {})
+cnt: 2, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16)), {})
+cnt: 6, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16)), {})
+cnt: 30, ((T([128, 56, 14, 14], f16, stride=(87808, 196, 14, 1)), T([128, 56, 14, 14], f16)), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 18, ((T([128, 28, 28, 28], f16, stride=(175616, 784, 28, 1)), T([128, 28, 28, 28], f16)), {})
+cnt: 3, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 12, ((T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1)), T([128, 14, 56, 56], f16)), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 149, ((T([], i64), 1), {})
+cnt: 3, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 6, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16)), {})
+cnt: 3, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1)), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 28, 56, 56], f16, stride=(702464, 3136, 56, 1)), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([128, 56, 28, 28], f16, stride=(351232, 784, 28, 1)), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16, stride=(175616, 196, 14, 1)), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 112, 7, 7], f16, stride=(43904, 49, 7, 1)), T([128, 112, 14, 14], f16, stride=(175616, 196, 14, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 56, 14, 14], f16, stride=(87808, 196, 14, 1)), T([128, 56, 28, 28], f16, stride=(351232, 784, 28, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 28, 28, 28], f16, stride=(175616, 784, 28, 1)), T([128, 28, 56, 56], f16, stride=(702464, 3136, 56, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1)), T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1)), [3, 3], [1, 1], [1, 1], False, True, None), {})
+Operator: aten.cat.default
+cnt: 2, (([T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16)], 1), {})
+cnt: 4, (([T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1))], 1), {})
+cnt: 1, (([T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16)], 1), {})
+cnt: 6, (([T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16, stride=(175616, 784, 28, 1))], 1), {})
+cnt: 1, (([T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16)], 1), {})
+cnt: 10, (([T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16, stride=(87808, 196, 14, 1))], 1), {})
+cnt: 1, (([T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16)], 1), {})
+cnt: 4, (([T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16, stride=(43904, 49, 7, 1))], 1), {})
+cnt: 1, (([T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16)], 1), {})
+cnt: 1, (([T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16), T([128, 56, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 28, 56, 56], f16), T([128, 28, 56, 56], f16), T([128, 28, 56, 56], f16), T([128, 28, 56, 56], f16), T([128, 28, 56, 56], f16), T([128, 28, 56, 56], f16), T([128, 28, 56, 56], f16), T([128, 28, 56, 56], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([112, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1)), T([14, 14, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 112, 56, 56], f16), T([256, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 256, 56, 56], f16), T([112, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([128, 14, 56, 56], f16), T([14, 14, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([224, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 28, 56, 56], f16, stride=(702464, 3136, 56, 1)), T([28, 28, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 224, 28, 28], f16), T([512, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 512, 28, 28], f16), T([224, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 28, 28, 28], f16, stride=(175616, 784, 28, 1)), T([28, 28, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 18, ((T([128, 28, 28, 28], f16), T([28, 28, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([448, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 56, 28, 28], f16, stride=(351232, 784, 28, 1)), T([56, 56, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([128, 448, 14, 14], f16), T([1024, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 1024, 14, 14], f16), T([448, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 56, 14, 14], f16, stride=(87808, 196, 14, 1)), T([56, 56, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 30, ((T([128, 56, 14, 14], f16), T([56, 56, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 14, 14], f16), T([896, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 112, 14, 14], f16, stride=(175616, 196, 14, 1)), T([112, 112, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 896, 7, 7], f16), T([2048, 896, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 2048, 7, 7], f16), T([896, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 112, 7, 7], f16, stride=(43904, 49, 7, 1)), T([112, 112, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([128, 112, 7, 7], f16), T([112, 112, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([128, 2048, 7, 7], f16), T([128, 896, 7, 7], f16), T([2048, 896, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 12, ((T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([112, 112, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16, stride=(43904, 49, 7, 1)), T([112, 112, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 896, 7, 7], f16), T([128, 2048, 7, 7], f16), T([896, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 2048, 7, 7], f16), T([128, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 112, 7, 7], f16), T([128, 112, 14, 14], f16, stride=(175616, 196, 14, 1)), T([112, 112, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 896, 14, 14], f16), T([128, 1024, 14, 14], f16), T([896, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([128, 1024, 14, 14], f16), T([128, 448, 14, 14], f16), T([1024, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 30, ((T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([56, 56, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16, stride=(87808, 196, 14, 1)), T([56, 56, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 448, 14, 14], f16), T([128, 1024, 14, 14], f16), T([448, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1024, 14, 14], f16), T([128, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 56, 14, 14], f16), T([128, 56, 28, 28], f16, stride=(351232, 784, 28, 1)), T([56, 56, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 448, 28, 28], f16), T([128, 512, 28, 28], f16), T([448, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), T([128, 224, 28, 28], f16), T([512, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 18, ((T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([28, 28, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16, stride=(175616, 784, 28, 1)), T([28, 28, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 224, 28, 28], f16), T([128, 512, 28, 28], f16), T([224, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 256, 56, 56], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 28, 28, 28], f16), T([128, 28, 56, 56], f16, stride=(702464, 3136, 56, 1)), T([28, 28, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 224, 56, 56], f16), T([128, 256, 56, 56], f16), T([224, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 256, 56, 56], f16), T([128, 112, 56, 56], f16), T([256, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 12, ((T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([14, 14, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 9, ((T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1)), T([14, 14, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 112, 56, 56], f16), T([128, 256, 56, 56], f16), T([112, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 56, 56], f16), T([128, 64, 56, 56], f16), T([112, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([128, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 112, 56, 56], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 1e-05), {})
+cnt: 21, ((T([128, 14, 56, 56], f16), T([14], f16), T([14], f16), T([14], f16), T([14], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 224, 56, 56], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), True, 0.1, 1e-05), {})
+cnt: 28, ((T([128, 28, 28, 28], f16), T([28], f16), T([28], f16), T([28], f16), T([28], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 224, 28, 28], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 448, 28, 28], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), True, 0.1, 1e-05), {})
+cnt: 42, ((T([128, 56, 14, 14], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([128, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 448, 14, 14], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 896, 14, 14], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), True, 0.1, 1e-05), {})
+cnt: 21, ((T([128, 112, 7, 7], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 896, 7, 7], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 21, ((T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 896, 7, 7], f16), T([128, 896, 7, 7], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 896, 14, 14], f16), T([128, 896, 14, 14], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 42, ((T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f32), T([56], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 448, 14, 14], f16), T([128, 448, 14, 14], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 448, 28, 28], f16), T([128, 448, 28, 28], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 28, ((T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), T([28], f16), T([28], f16), T([28], f16), T([28], f32), T([28], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 224, 28, 28], f16), T([128, 224, 28, 28], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 224, 56, 56], f16), T([128, 224, 56, 56], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 21, ((T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), T([14], f16), T([14], f16), T([14], f16), T([14], f32), T([14], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 112, 56, 56], f16), T([128, 112, 56, 56], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 3, ((T([128, 112, 56, 56], f16),), {})
+cnt: 21, ((T([128, 14, 56, 56], f16),), {})
+cnt: 3, ((T([128, 256, 56, 56], f16),), {})
+cnt: 1, ((T([128, 224, 56, 56], f16),), {})
+cnt: 28, ((T([128, 28, 28, 28], f16),), {})
+cnt: 4, ((T([128, 512, 28, 28], f16),), {})
+cnt: 3, ((T([128, 224, 28, 28], f16),), {})
+cnt: 1, ((T([128, 448, 28, 28], f16),), {})
+cnt: 42, ((T([128, 56, 14, 14], f16),), {})
+cnt: 6, ((T([128, 1024, 14, 14], f16),), {})
+cnt: 5, ((T([128, 448, 14, 14], f16),), {})
+cnt: 1, ((T([128, 896, 14, 14], f16),), {})
+cnt: 21, ((T([128, 112, 7, 7], f16),), {})
+cnt: 3, ((T([128, 2048, 7, 7], f16),), {})
+cnt: 2, ((T([128, 896, 7, 7], f16),), {})
+Operator: aten.split.Tensor
+cnt: 3, ((T([128, 112, 56, 56], f16), 14, 1), {})
+cnt: 1, ((T([128, 224, 56, 56], f16), 28, 1), {})
+cnt: 3, ((T([128, 224, 28, 28], f16), 28, 1), {})
+cnt: 1, ((T([128, 448, 28, 28], f16), 56, 1), {})
+cnt: 5, ((T([128, 448, 14, 14], f16), 56, 1), {})
+cnt: 1, ((T([128, 896, 14, 14], f16), 112, 1), {})
+cnt: 2, ((T([128, 896, 7, 7], f16), 112, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16), 0), {})
+cnt: 9, ((T([128, 112, 7, 7], f16, stride=(43904, 49, 7, 1)), T([128, 112, 7, 7], f16), 0), {})
+cnt: 12, ((T([128, 112, 7, 7], f16), T([128, 112, 7, 7], f16), 0), {})
+cnt: 2, ((T([128, 896, 7, 7], f16), T([128, 896, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 896, 14, 14], f16), T([128, 896, 14, 14], f16), 0), {})
+cnt: 6, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16), 0), {})
+cnt: 12, ((T([128, 56, 14, 14], f16, stride=(87808, 196, 14, 1)), T([128, 56, 14, 14], f16), 0), {})
+cnt: 30, ((T([128, 56, 14, 14], f16), T([128, 56, 14, 14], f16), 0), {})
+cnt: 5, ((T([128, 448, 14, 14], f16), T([128, 448, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 448, 28, 28], f16), T([128, 448, 28, 28], f16), 0), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16), 0), {})
+cnt: 10, ((T([128, 28, 28, 28], f16, stride=(175616, 784, 28, 1)), T([128, 28, 28, 28], f16), 0), {})
+cnt: 18, ((T([128, 28, 28, 28], f16), T([128, 28, 28, 28], f16), 0), {})
+cnt: 3, ((T([128, 224, 28, 28], f16), T([128, 224, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 224, 56, 56], f16), T([128, 224, 56, 56], f16), 0), {})
+cnt: 3, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), 0), {})
+cnt: 9, ((T([128, 14, 56, 56], f16, stride=(351232, 3136, 56, 1)), T([128, 14, 56, 56], f16), 0), {})
+cnt: 12, ((T([128, 14, 56, 56], f16), T([128, 14, 56, 56], f16), 0), {})
+cnt: 3, ((T([128, 112, 56, 56], f16), T([128, 112, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2next50_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2next50_training.txt
new file mode 100644
index 000000000000..d498c8050f7d
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/res2next50_training.txt
@@ -0,0 +1,197 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1))), {})
+cnt: 6, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16, stride=(200704, 784, 28, 1))), {})
+cnt: 10, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16, stride=(100352, 196, 14, 1))), {})
+cnt: 4, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16, stride=(50176, 49, 7, 1))), {})
+cnt: 4, ((T([128, 256, 7, 7], f16, stride=(50176, 49, 7, 1)), T([128, 256, 7, 7], f16)), {})
+cnt: 2, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16)), {})
+cnt: 6, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16)), {})
+cnt: 10, ((T([128, 128, 14, 14], f16, stride=(100352, 196, 14, 1)), T([128, 128, 14, 14], f16)), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 6, ((T([128, 64, 28, 28], f16, stride=(200704, 784, 28, 1)), T([128, 64, 28, 28], f16)), {})
+cnt: 3, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 4, ((T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), T([128, 32, 56, 56], f16)), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 85, ((T([], i64), 1), {})
+cnt: 3, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16)), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16)), {})
+cnt: 6, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16)), {})
+cnt: 3, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), [3, 3], [1, 1], [1, 1]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16, stride=(802816, 3136, 56, 1)), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([128, 128, 28, 28], f16, stride=(401408, 784, 28, 1)), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([128, 256, 14, 14], f16, stride=(200704, 196, 14, 1)), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 256, 7, 7], f16, stride=(50176, 49, 7, 1)), T([128, 256, 14, 14], f16, stride=(200704, 196, 14, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 128, 14, 14], f16, stride=(100352, 196, 14, 1)), T([128, 128, 28, 28], f16, stride=(401408, 784, 28, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 64, 28, 28], f16, stride=(200704, 784, 28, 1)), T([128, 64, 56, 56], f16, stride=(802816, 3136, 56, 1)), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), [3, 3], [1, 1], [1, 1], False, True, None), {})
+Operator: aten.cat.default
+cnt: 2, (([T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16)], 1), {})
+cnt: 4, (([T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1))], 1), {})
+cnt: 1, (([T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16)], 1), {})
+cnt: 6, (([T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16, stride=(200704, 784, 28, 1))], 1), {})
+cnt: 1, (([T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16)], 1), {})
+cnt: 10, (([T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16, stride=(100352, 196, 14, 1))], 1), {})
+cnt: 1, (([T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16)], 1), {})
+cnt: 4, (([T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16, stride=(50176, 49, 7, 1))], 1), {})
+cnt: 1, (([T([128, 256, 14, 14], f16), T([128, 256, 14, 14], f16), T([128, 256, 14, 14], f16), T([128, 256, 14, 14], f16)], 1), {})
+cnt: 1, (([T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), T([32, 4, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([256, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 32, 56, 56], f16), T([32, 4, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 64, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([64, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 4, ((T([128, 256, 28, 28], f16), T([512, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 64, 28, 28], f16, stride=(200704, 784, 28, 1)), T([64, 8, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 6, ((T([128, 64, 28, 28], f16), T([64, 8, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([512, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 128, 28, 28], f16, stride=(401408, 784, 28, 1)), T([128, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 6, ((T([128, 512, 14, 14], f16), T([1024, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 128, 14, 14], f16, stride=(100352, 196, 14, 1)), T([128, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 10, ((T([128, 128, 14, 14], f16), T([128, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 1, ((T([128, 1024, 14, 14], f16), T([1024, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 256, 14, 14], f16, stride=(200704, 196, 14, 1)), T([256, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 3, ((T([128, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 2048, 7, 7], f16), T([1024, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 256, 7, 7], f16, stride=(50176, 49, 7, 1)), T([256, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 4, ((T([128, 256, 7, 7], f16), T([256, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([128, 2048, 7, 7], f16), T([128, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([256, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 2, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16, stride=(50176, 49, 7, 1)), T([256, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16), T([128, 2048, 7, 7], f16), T([1024, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 2048, 7, 7], f16), T([128, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 256, 7, 7], f16), T([128, 256, 14, 14], f16, stride=(200704, 196, 14, 1)), T([256, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16), T([1024, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([128, 1024, 14, 14], f16), T([128, 512, 14, 14], f16), T([1024, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 10, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 5, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16, stride=(100352, 196, 14, 1)), T([128, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 5, ((T([128, 512, 14, 14], f16), T([128, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1024, 14, 14], f16), T([128, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 128, 14, 14], f16), T([128, 128, 28, 28], f16, stride=(401408, 784, 28, 1)), T([128, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16), T([512, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 512, 28, 28], f16), T([128, 256, 28, 28], f16), T([512, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([64, 8, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16, stride=(200704, 784, 28, 1)), T([64, 8, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 3, ((T([128, 256, 28, 28], f16), T([128, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), T([128, 256, 56, 56], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 64, 28, 28], f16), T([128, 64, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([64, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 256, 56, 56], f16), T([128, 128, 56, 56], f16), T([256, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32, 4, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 5, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), T([32, 4, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 2, ((T([128, 128, 56, 56], f16), T([128, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), T([128, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 64, 56, 56], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([128, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 12, ((T([128, 64, 28, 28], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 18, ((T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([128, 256, 7, 7], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 18, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 512, 14, 14], f16), T([128, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 12, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 28, 28], f16), T([128, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 3, ((T([128, 128, 56, 56], f16),), {})
+cnt: 9, ((T([128, 32, 56, 56], f16),), {})
+cnt: 4, ((T([128, 256, 56, 56], f16),), {})
+cnt: 12, ((T([128, 64, 28, 28], f16),), {})
+cnt: 5, ((T([128, 512, 28, 28], f16),), {})
+cnt: 3, ((T([128, 256, 28, 28], f16),), {})
+cnt: 18, ((T([128, 128, 14, 14], f16),), {})
+cnt: 7, ((T([128, 1024, 14, 14], f16),), {})
+cnt: 5, ((T([128, 512, 14, 14], f16),), {})
+cnt: 9, ((T([128, 256, 7, 7], f16),), {})
+cnt: 3, ((T([128, 2048, 7, 7], f16),), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16),), {})
+Operator: aten.split.Tensor
+cnt: 3, ((T([128, 128, 56, 56], f16), 32, 1), {})
+cnt: 1, ((T([128, 256, 56, 56], f16), 64, 1), {})
+cnt: 3, ((T([128, 256, 28, 28], f16), 64, 1), {})
+cnt: 1, ((T([128, 512, 28, 28], f16), 128, 1), {})
+cnt: 5, ((T([128, 512, 14, 14], f16), 128, 1), {})
+cnt: 1, ((T([128, 1024, 14, 14], f16), 256, 1), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16), 256, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([128, 2048, 7, 7], f16), T([128, 2048, 7, 7], f16), 0), {})
+cnt: 5, ((T([128, 256, 7, 7], f16, stride=(50176, 49, 7, 1)), T([128, 256, 7, 7], f16), 0), {})
+cnt: 4, ((T([128, 256, 7, 7], f16), T([128, 256, 7, 7], f16), 0), {})
+cnt: 2, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), 0), {})
+cnt: 7, ((T([128, 1024, 14, 14], f16), T([128, 1024, 14, 14], f16), 0), {})
+cnt: 8, ((T([128, 128, 14, 14], f16, stride=(100352, 196, 14, 1)), T([128, 128, 14, 14], f16), 0), {})
+cnt: 10, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), 0), {})
+cnt: 5, ((T([128, 512, 14, 14], f16), T([128, 512, 14, 14], f16), 0), {})
+cnt: 5, ((T([128, 512, 28, 28], f16), T([128, 512, 28, 28], f16), 0), {})
+cnt: 6, ((T([128, 64, 28, 28], f16, stride=(200704, 784, 28, 1)), T([128, 64, 28, 28], f16), 0), {})
+cnt: 6, ((T([128, 64, 28, 28], f16), T([128, 64, 28, 28], f16), 0), {})
+cnt: 3, ((T([128, 256, 28, 28], f16), T([128, 256, 28, 28], f16), 0), {})
+cnt: 4, ((T([128, 256, 56, 56], f16), T([128, 256, 56, 56], f16), 0), {})
+cnt: 5, ((T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), T([128, 32, 56, 56], f16), 0), {})
+cnt: 4, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), 0), {})
+cnt: 3, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resmlp_12_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resmlp_12_224_training.txt
new file mode 100644
index 000000000000..3c47d598f97f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resmlp_12_224_training.txt
@@ -0,0 +1,75 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([128, 196, 1536], f16), [128, 196, 1536]), {})
+cnt: 12, ((T([128, 384, 196], f16), [49152, 196]), {})
+Operator: aten.add.Tensor
+cnt: 12, ((T([128, 196, 384], f16, stride=(75264, 1, 196)), T([128, 196, 384], f16, stride=(75264, 1, 196))), {})
+cnt: 12, ((T([128, 196, 1536], f16), T([1536], f16)), {})
+cnt: 12, ((T([128, 196, 384], f16, stride=(75264, 1, 196)), T([128, 196, 384], f16)), {})
+cnt: 12, ((T([128, 196, 384], f16), T([128, 196, 384], f16)), {})
+cnt: 12, ((T([128, 196, 384], f16), T([128, 196, 384], f16, stride=(75264, 1, 196))), {})
+Operator: aten.addcmul.default
+cnt: 25, ((T([1, 1, 384], f16), T([1, 1, 384], f16), T([128, 196, 384], f16, stride=(75264, 1, 196))), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([196], f16), T([49152, 196], f16), T([196, 196], f16, stride=(1, 196))), {})
+cnt: 12, ((T([384], f16), T([25088, 1536], f16), T([1536, 384], f16, stride=(1, 1536))), {})
+cnt: 1, ((T([1000], f16), T([128, 384], f16), T([384, 1000], f16, stride=(1, 384))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([128, 196, 384], f16, stride=(75264, 1, 196)), T([128, 384, 1536], f16, stride=(0, 1, 384))), {})
+cnt: 12, ((T([128, 384, 196], f16), T([128, 196, 1536], f16)), {})
+cnt: 12, ((T([128, 196, 1536], f16), T([128, 1536, 384], f16, stride=(0, 384, 1))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([384, 3, 16, 16], f16), T([384], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 384, 14, 14], f16, stride=(75264, 1, 5376, 384)), T([128, 3, 224, 224], f16), T([384, 3, 16, 16], f16), [384], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+cnt: 12, ((T([1536, 384], f16), T([1536, 384], f16, stride=(1, 1536))), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 196, 384], f16, stride=(384, 0, 1)), 196), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([128, 196, 1536], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([128, 196, 1536], f16), T([128, 196, 1536], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 196, 384], f16, stride=(75264, 1, 196)), [1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 384], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 384], f16)), {})
+cnt: 12, ((T([25088, 384], f16), T([384, 1536], f16)), {})
+cnt: 12, ((T([384, 25088], f16, stride=(1, 384)), T([25088, 1536], f16)), {})
+cnt: 12, ((T([49152, 196], f16), T([196, 196], f16)), {})
+cnt: 12, ((T([196, 49152], f16, stride=(1, 196)), T([49152, 196], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 25, ((T([128, 196, 384], f16, stride=(75264, 1, 196)), 1), {})
+cnt: 25, ((T([1, 1, 384], f16), 1), {})
+Operator: aten.mul.Tensor
+cnt: 12, ((T([384], f16), T([128, 196, 384], f16, stride=(75264, 1, 196))), {})
+cnt: 12, ((T([384], f16), T([128, 196, 384], f16)), {})
+cnt: 25, ((T([128, 196, 384], f16), T([128, 196, 384], f16, stride=(75264, 1, 196))), {})
+cnt: 13, ((T([128, 196, 384], f16), T([1, 1, 384], f16)), {})
+cnt: 24, ((T([128, 196, 384], f16), T([384], f16)), {})
+cnt: 12, ((T([128, 196, 384], f16), T([128, 196, 384], f16)), {})
+cnt: 12, ((T([128, 196, 384], f16, stride=(75264, 1, 196)), T([128, 196, 384], f16, stride=(75264, 1, 196))), {})
+cnt: 12, ((T([128, 196, 384], f16, stride=(75264, 1, 196)), T([1, 1, 384], f16)), {})
+Operator: aten.new_empty_strided.default
+cnt: 12, ((T([1536, 384], f16, stride=(1, 1536)), [1536, 384], [384, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 50, ((T([128, 196, 384], f16), [0, 1], True), {})
+cnt: 12, ((T([25088, 384], f16), [0], True), {})
+cnt: 12, ((T([128, 196, 1536], f16), [0, 1], True), {})
+cnt: 12, ((T([128, 384, 1536], f16), [0], True), {})
+cnt: 12, ((T([49152, 196], f16), [0], True), {})
+cnt: 24, ((T([128, 196, 384], f16, stride=(75264, 1, 196)), [0, 1], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resnest101e_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resnest101e_training.txt
new file mode 100644
index 000000000000..03e1db4dc9c6
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resnest101e_training.txt
@@ -0,0 +1,269 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 3, ((T([32, 2, 1, 64], f16), 1, False), {})
+cnt: 4, ((T([32, 2, 1, 128], f16), 1, False), {})
+cnt: 23, ((T([32, 2, 1, 256], f16), 1, False), {})
+cnt: 3, ((T([32, 2, 1, 512], f16), 1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 3, ((T([32, 2, 1, 512], f16), T([32, 2, 1, 512], f16), 1, f16), {})
+cnt: 23, ((T([32, 2, 1, 256], f16), T([32, 2, 1, 256], f16), 1, f16), {})
+cnt: 4, ((T([32, 2, 1, 128], f16), T([32, 2, 1, 128], f16), 1, f16), {})
+cnt: 3, ((T([32, 2, 1, 64], f16), T([32, 2, 1, 64], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([32, 2, 512, 8, 8], f16), T([32, 2, 512, 8, 8], f16, stride=(32768, 0, 64, 8, 1))), {})
+cnt: 2, ((T([32, 2048, 8, 8], f16), T([32, 2048, 8, 8], f16)), {})
+cnt: 1, ((T([32, 2, 512, 16, 16], f16), T([32, 2, 512, 16, 16], f16, stride=(131072, 0, 256, 16, 1))), {})
+cnt: 23, ((T([32, 1024, 16, 16], f16), T([32, 1024, 16, 16], f16)), {})
+cnt: 22, ((T([32, 2, 256, 16, 16], f16), T([32, 2, 256, 16, 16], f16, stride=(65536, 0, 256, 16, 1))), {})
+cnt: 1, ((T([32, 2, 256, 32, 32], f16), T([32, 2, 256, 32, 32], f16, stride=(262144, 0, 1024, 32, 1))), {})
+cnt: 4, ((T([32, 512, 32, 32], f16), T([32, 512, 32, 32], f16)), {})
+cnt: 3, ((T([32, 2, 128, 32, 32], f16), T([32, 2, 128, 32, 32], f16, stride=(131072, 0, 1024, 32, 1))), {})
+cnt: 1, ((T([32, 2, 128, 64, 64], f16), T([32, 2, 128, 64, 64], f16, stride=(524288, 0, 4096, 64, 1))), {})
+cnt: 3, ((T([32, 256, 64, 64], f16), T([32, 256, 64, 64], f16)), {})
+cnt: 3, ((T([32, 2, 64, 64, 64], f16), T([32, 2, 64, 64, 64], f16, stride=(262144, 0, 4096, 64, 1))), {})
+cnt: 1, ((T([32, 128, 64, 64], f16), T([32, 128, 64, 64], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 139, ((T([], i64), 1), {})
+cnt: 3, ((T([32, 256, 64, 64], f16), T([32, 256, 64, 64], f16)), {})
+cnt: 4, ((T([32, 512, 32, 32], f16), T([32, 512, 32, 32], f16)), {})
+cnt: 23, ((T([32, 1024, 16, 16], f16), T([32, 1024, 16, 16], f16)), {})
+cnt: 3, ((T([32, 2048, 8, 8], f16), T([32, 2048, 8, 8], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([32, 128, 64, 64], f16), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([32, 256, 64, 64], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([32, 256, 32, 32], f16), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([32, 512, 32, 32], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([32, 512, 16, 16], f16), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([32, 1024, 16, 16], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([32, 1024, 8, 8], f16), T([32, 1024, 16, 16], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([32, 512, 8, 8], f16), T([32, 512, 16, 16], f16), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([32, 512, 16, 16], f16), T([32, 512, 32, 32], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([32, 256, 16, 16], f16), T([32, 256, 32, 32], f16), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([32, 256, 32, 32], f16), T([32, 256, 64, 64], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([32, 128, 32, 32], f16), T([32, 128, 64, 64], f16), [3, 3], [2, 2], [1, 1], False, True, None), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 256, 256], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 256, 256], f16), T([64, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 128, 128], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 128, 128], f16), T([128, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 64, 64], f16), T([64, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 64, 64, 64], f16), T([128, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 3, ((T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 32, 1, 1], f16), T([128, 32, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 64, 64, 64], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 64, 64], f16), T([256, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 256, 64, 64], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 64, 64], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 64, 64], f16), T([256, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 4, ((T([32, 128, 1, 1], f16), T([64, 128, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 64, 1, 1], f16), T([256, 64, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 128, 32, 32], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 32, 32], f16), T([512, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 512, 32, 32], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 128, 32, 32], f16), T([256, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 1, ((T([32, 512, 32, 32], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 32, 32], f16), T([512, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 23, ((T([32, 256, 1, 1], f16), T([128, 256, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 23, ((T([32, 128, 1, 1], f16), T([512, 128, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 23, ((T([32, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 16, 16], f16), T([1024, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 22, ((T([32, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 22, ((T([32, 256, 16, 16], f16), T([512, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 1, ((T([32, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 16, 16], f16), T([1024, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 3, ((T([32, 512, 1, 1], f16), T([256, 512, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([1024, 256, 1, 1], f16), T([1024], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 512, 8, 8], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 8, 8], f16), T([2048, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 2048, 8, 8], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 512, 8, 8], f16), T([1024, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([32, 2048, 8, 8], f16), T([32, 512, 8, 8], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 1024, 1, 1], f16), T([32, 256, 1, 1], f16), T([1024, 256, 1, 1], f16), [1024], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([32, 512, 1, 1], f16), T([256, 512, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 1024, 8, 8], f16), T([32, 512, 8, 8], f16), T([1024, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 2, ((T([32, 512, 8, 8], f16), T([32, 2048, 8, 8], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 8, 8], f16), T([32, 1024, 8, 8], f16), T([2048, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 16, 16], f16), T([32, 512, 16, 16], f16), T([1024, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 16, 16], f16), T([32, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 23, ((T([32, 1024, 16, 16], f16), T([32, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 23, ((T([32, 512, 1, 1], f16), T([32, 128, 1, 1], f16), T([512, 128, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 23, ((T([32, 128, 1, 1], f16), T([32, 256, 1, 1], f16), T([128, 256, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 22, ((T([32, 512, 16, 16], f16), T([32, 256, 16, 16], f16), T([512, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 22, ((T([32, 256, 16, 16], f16), T([32, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 16, 16], f16), T([32, 512, 16, 16], f16), T([1024, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 32, 32], f16), T([32, 256, 32, 32], f16), T([512, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 32, 32], f16), T([32, 512, 32, 32], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 512, 32, 32], f16), T([32, 128, 32, 32], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 256, 1, 1], f16), T([32, 64, 1, 1], f16), T([256, 64, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([32, 64, 1, 1], f16), T([32, 128, 1, 1], f16), T([64, 128, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 256, 32, 32], f16), T([32, 128, 32, 32], f16), T([256, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 3, ((T([32, 128, 32, 32], f16), T([32, 512, 32, 32], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 32, 32], f16), T([32, 256, 32, 32], f16), T([512, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 64, 64], f16), T([32, 128, 64, 64], f16), T([256, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 64, 64], f16), T([32, 256, 64, 64], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 256, 64, 64], f16), T([32, 64, 64, 64], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 128, 1, 1], f16), T([32, 32, 1, 1], f16), T([128, 32, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 32, 1, 1], f16), T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 128, 64, 64], f16), T([32, 64, 64, 64], f16), T([128, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 2, ((T([32, 64, 64, 64], f16), T([32, 256, 64, 64], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 64, 64], f16), T([32, 128, 64, 64], f16), T([256, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 64, 64], f16), T([32, 128, 64, 64], f16), T([64, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 128, 128], f16), T([32, 64, 128, 128], f16), T([128, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 128, 128], f16), T([32, 64, 128, 128], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 128, 128], f16), T([32, 3, 256, 256], f16), T([64, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 256, 256], f16), T([32, 3, 256, 256], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 2048, 8, 8], f16, stride=(2048, 1, 0, 0)), 64), {})
+cnt: 2, ((T([32, 512, 8, 8], f16, stride=(512, 1, 0, 0)), 64), {})
+cnt: 1, ((T([32, 512, 16, 16], f16, stride=(512, 1, 0, 0)), 256), {})
+cnt: 22, ((T([32, 256, 16, 16], f16, stride=(256, 1, 0, 0)), 256), {})
+cnt: 1, ((T([32, 256, 32, 32], f16, stride=(256, 1, 0, 0)), 1024), {})
+cnt: 3, ((T([32, 128, 32, 32], f16, stride=(128, 1, 0, 0)), 1024), {})
+cnt: 1, ((T([32, 128, 64, 64], f16, stride=(128, 1, 0, 0)), 4096), {})
+cnt: 3, ((T([32, 64, 64, 64], f16, stride=(64, 1, 0, 0)), 4096), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 128, 128, 128], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 128, 64, 64], f16), T([32, 128, 128, 128], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([32, 128, 64, 64], i64)), {})
+Operator: aten.mean.dim
+cnt: 3, ((T([32, 64, 64, 64], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 128, 64, 64], f16), [2, 3], True), {})
+cnt: 3, ((T([32, 128, 32, 32], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 256, 32, 32], f16), [2, 3], True), {})
+cnt: 22, ((T([32, 256, 16, 16], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 512, 16, 16], f16), [2, 3], True), {})
+cnt: 2, ((T([32, 512, 8, 8], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 2048, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(1, 1000)), T([32, 2048], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 3, ((T([32, 2, 64, 64, 64], f16), T([32, 2, 64, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 128, 64, 64], f16), T([32, 2, 128, 1, 1], f16)), {})
+cnt: 3, ((T([32, 2, 128, 32, 32], f16), T([32, 2, 128, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 256, 32, 32], f16), T([32, 2, 256, 1, 1], f16)), {})
+cnt: 22, ((T([32, 2, 256, 16, 16], f16), T([32, 2, 256, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 512, 16, 16], f16), T([32, 2, 512, 1, 1], f16)), {})
+cnt: 2, ((T([32, 2, 512, 8, 8], f16), T([32, 2, 512, 1, 1], f16)), {})
+cnt: 2, ((T([32, 2, 512, 8, 8], f16, stride=(32768, 0, 64, 8, 1)), T([32, 2, 512, 8, 8], f16)), {})
+cnt: 2, ((T([32, 2, 512, 8, 8], f16, stride=(32768, 0, 64, 8, 1)), T([32, 2, 512, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 512, 16, 16], f16, stride=(131072, 0, 256, 16, 1)), T([32, 2, 512, 16, 16], f16)), {})
+cnt: 1, ((T([32, 2, 512, 16, 16], f16, stride=(131072, 0, 256, 16, 1)), T([32, 2, 512, 1, 1], f16)), {})
+cnt: 22, ((T([32, 2, 256, 16, 16], f16, stride=(65536, 0, 256, 16, 1)), T([32, 2, 256, 16, 16], f16)), {})
+cnt: 22, ((T([32, 2, 256, 16, 16], f16, stride=(65536, 0, 256, 16, 1)), T([32, 2, 256, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 256, 32, 32], f16, stride=(262144, 0, 1024, 32, 1)), T([32, 2, 256, 32, 32], f16)), {})
+cnt: 1, ((T([32, 2, 256, 32, 32], f16, stride=(262144, 0, 1024, 32, 1)), T([32, 2, 256, 1, 1], f16)), {})
+cnt: 3, ((T([32, 2, 128, 32, 32], f16, stride=(131072, 0, 1024, 32, 1)), T([32, 2, 128, 32, 32], f16)), {})
+cnt: 3, ((T([32, 2, 128, 32, 32], f16, stride=(131072, 0, 1024, 32, 1)), T([32, 2, 128, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 128, 64, 64], f16, stride=(524288, 0, 4096, 64, 1)), T([32, 2, 128, 64, 64], f16)), {})
+cnt: 1, ((T([32, 2, 128, 64, 64], f16, stride=(524288, 0, 4096, 64, 1)), T([32, 2, 128, 1, 1], f16)), {})
+cnt: 3, ((T([32, 2, 64, 64, 64], f16, stride=(262144, 0, 4096, 64, 1)), T([32, 2, 64, 64, 64], f16)), {})
+cnt: 3, ((T([32, 2, 64, 64, 64], f16, stride=(262144, 0, 4096, 64, 1)), T([32, 2, 64, 1, 1], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([32, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 128, 128, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 32, 1, 1], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([32, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 64, 1, 1], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([32, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 23, ((T([32, 128, 1, 1], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 25, ((T([32, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 22, ((T([32, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 23, ((T([32, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 2048, 8, 8], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 1024, 8, 8], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([32, 2048, 8, 8], f16), T([32, 2048, 8, 8], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([32, 256, 1, 1], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 1024, 8, 8], f16), T([32, 1024, 8, 8], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 512, 8, 8], f16), T([32, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 25, ((T([32, 1024, 16, 16], f16), T([32, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 23, ((T([32, 512, 16, 16], f16), T([32, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 23, ((T([32, 128, 1, 1], f16), T([32, 128, 1, 1], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 22, ((T([32, 256, 16, 16], f16), T([32, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([32, 512, 32, 32], f16), T([32, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 256, 32, 32], f16), T([32, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 128, 32, 32], f16), T([32, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([32, 256, 64, 64], f16), T([32, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 128, 64, 64], f16), T([32, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 32, 1, 1], f16), T([32, 32, 1, 1], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 64, 64, 64], f16), T([32, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 128, 128], f16), T([32, 128, 128, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 64, 128, 128], f16), T([32, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([32, 64, 128, 128], f16),), {})
+cnt: 1, ((T([32, 128, 128, 128], f16),), {})
+cnt: 3, ((T([32, 64, 64, 64], f16),), {})
+cnt: 4, ((T([32, 128, 64, 64], f16),), {})
+cnt: 3, ((T([32, 32, 1, 1], f16),), {})
+cnt: 4, ((T([32, 256, 64, 64], f16),), {})
+cnt: 4, ((T([32, 64, 1, 1], f16),), {})
+cnt: 5, ((T([32, 512, 32, 32], f16),), {})
+cnt: 3, ((T([32, 128, 32, 32], f16),), {})
+cnt: 4, ((T([32, 256, 32, 32], f16),), {})
+cnt: 23, ((T([32, 128, 1, 1], f16),), {})
+cnt: 24, ((T([32, 1024, 16, 16], f16),), {})
+cnt: 22, ((T([32, 256, 16, 16], f16),), {})
+cnt: 23, ((T([32, 512, 16, 16], f16),), {})
+cnt: 3, ((T([32, 256, 1, 1], f16),), {})
+cnt: 3, ((T([32, 2048, 8, 8], f16),), {})
+cnt: 2, ((T([32, 512, 8, 8], f16),), {})
+cnt: 2, ((T([32, 1024, 8, 8], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16), [0], True), {})
+cnt: 2, ((T([32, 2, 512, 8, 8], f16), [3, 4], True), {})
+cnt: 1, ((T([32, 2, 512, 16, 16], f16), [3, 4], True), {})
+cnt: 22, ((T([32, 2, 256, 16, 16], f16), [3, 4], True), {})
+cnt: 1, ((T([32, 2, 256, 32, 32], f16), [3, 4], True), {})
+cnt: 3, ((T([32, 2, 128, 32, 32], f16), [3, 4], True), {})
+cnt: 1, ((T([32, 2, 128, 64, 64], f16), [3, 4], True), {})
+cnt: 3, ((T([32, 2, 64, 64, 64], f16), [3, 4], True), {})
+Operator: aten.sum.dim_IntList
+cnt: 6, ((T([32, 2, 64, 64, 64], f16), [1]), {})
+cnt: 2, ((T([32, 2, 128, 64, 64], f16), [1]), {})
+cnt: 6, ((T([32, 2, 128, 32, 32], f16), [1]), {})
+cnt: 2, ((T([32, 2, 256, 32, 32], f16), [1]), {})
+cnt: 44, ((T([32, 2, 256, 16, 16], f16), [1]), {})
+cnt: 2, ((T([32, 2, 512, 16, 16], f16), [1]), {})
+cnt: 4, ((T([32, 2, 512, 8, 8], f16), [1]), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([32, 2048, 8, 8], f16), T([32, 2048, 8, 8], f16), 0), {})
+cnt: 3, ((T([32, 256, 1, 1], f16), T([32, 256, 1, 1], f16), 0), {})
+cnt: 2, ((T([32, 1024, 8, 8], f16), T([32, 1024, 8, 8], f16), 0), {})
+cnt: 2, ((T([32, 512, 8, 8], f16), T([32, 512, 8, 8], f16), 0), {})
+cnt: 24, ((T([32, 1024, 16, 16], f16), T([32, 1024, 16, 16], f16), 0), {})
+cnt: 23, ((T([32, 512, 16, 16], f16), T([32, 512, 16, 16], f16), 0), {})
+cnt: 23, ((T([32, 128, 1, 1], f16), T([32, 128, 1, 1], f16), 0), {})
+cnt: 22, ((T([32, 256, 16, 16], f16), T([32, 256, 16, 16], f16), 0), {})
+cnt: 5, ((T([32, 512, 32, 32], f16), T([32, 512, 32, 32], f16), 0), {})
+cnt: 4, ((T([32, 256, 32, 32], f16), T([32, 256, 32, 32], f16), 0), {})
+cnt: 4, ((T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), 0), {})
+cnt: 3, ((T([32, 128, 32, 32], f16), T([32, 128, 32, 32], f16), 0), {})
+cnt: 4, ((T([32, 256, 64, 64], f16), T([32, 256, 64, 64], f16), 0), {})
+cnt: 4, ((T([32, 128, 64, 64], f16), T([32, 128, 64, 64], f16), 0), {})
+cnt: 3, ((T([32, 32, 1, 1], f16), T([32, 32, 1, 1], f16), 0), {})
+cnt: 3, ((T([32, 64, 64, 64], f16), T([32, 64, 64, 64], f16), 0), {})
+cnt: 1, ((T([32, 128, 128, 128], f16), T([32, 128, 128, 128], f16), 0), {})
+cnt: 2, ((T([32, 64, 128, 128], f16), T([32, 64, 128, 128], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resnet18_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resnet18_training.txt
new file mode 100644
index 000000000000..ef201d6c179c
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/resnet18_training.txt
@@ -0,0 +1,88 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128, 512, 7, 7], f16), T([128, 512, 7, 7], f16)), {})
+cnt: 2, ((T([128, 256, 14, 14], f16), T([128, 256, 14, 14], f16)), {})
+cnt: 2, ((T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16)), {})
+cnt: 3, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 20, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16)), {})
+cnt: 2, ((T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16)), {})
+cnt: 2, ((T([128, 256, 14, 14], f16), T([128, 256, 14, 14], f16)), {})
+cnt: 2, ((T([128, 512, 7, 7], f16), T([128, 512, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 512], f16), T([512, 1000], f16, stride=(1, 512))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 64, 56, 56], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 128, 28, 28], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 28, 28], f16), T([256, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 256, 14, 14], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 28, 28], f16), T([256, 128, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 14, 14], f16), T([512, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 512, 7, 7], f16), T([512, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 14, 14], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([128, 512, 7, 7], f16), T([128, 512, 7, 7], f16), T([512, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 7, 7], f16), T([128, 256, 14, 14], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 512, 7, 7], f16), T([128, 256, 14, 14], f16), T([512, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 256, 14, 14], f16), T([128, 256, 14, 14], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 14, 14], f16), T([128, 128, 28, 28], f16), T([256, 128, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 256, 14, 14], f16), T([128, 128, 28, 28], f16), T([256, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 28, 28], f16), T([128, 64, 56, 56], f16), T([128, 64, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 28, 28], f16), T([128, 64, 56, 56], f16), T([128, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 512, 7, 7], f16, stride=(512, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([128, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 512, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 512], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 512], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 5, ((T([128, 512, 7, 7], f16), T([128, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 256, 14, 14], f16), T([128, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 64, 112, 112], f16),), {})
+cnt: 4, ((T([128, 64, 56, 56], f16),), {})
+cnt: 4, ((T([128, 128, 28, 28], f16),), {})
+cnt: 4, ((T([128, 256, 14, 14], f16),), {})
+cnt: 4, ((T([128, 512, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 4, ((T([128, 512, 7, 7], f16), T([128, 512, 7, 7], f16), 0), {})
+cnt: 4, ((T([128, 256, 14, 14], f16), T([128, 256, 14, 14], f16), 0), {})
+cnt: 4, ((T([128, 128, 28, 28], f16), T([128, 128, 28, 28], f16), 0), {})
+cnt: 4, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 64, 112, 112], f16), T([128, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/rexnet_100_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/rexnet_100_training.txt
new file mode 100644
index 000000000000..739188b28f29
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/rexnet_100_training.txt
@@ -0,0 +1,573 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 49, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 27, 56, 56], f16, stride=(119168, 3136, 56, 1)), T([128, 27, 56, 56], f16)), {})
+cnt: 2, ((T([128, 50, 28, 28], f16, stride=(47824, 784, 28, 1)), T([128, 50, 28, 28], f16)), {})
+cnt: 2, ((T([128, 72, 14, 14], f16, stride=(16464, 196, 14, 1)), T([128, 72, 14, 14], f16)), {})
+cnt: 2, ((T([128, 84, 14, 14], f16, stride=(18620, 196, 14, 1)), T([128, 84, 14, 14], f16)), {})
+cnt: 2, ((T([128, 95, 14, 14], f16, stride=(20776, 196, 14, 1)), T([128, 95, 14, 14], f16)), {})
+cnt: 2, ((T([128, 106, 14, 14], f16, stride=(22932, 196, 14, 1)), T([128, 106, 14, 14], f16)), {})
+cnt: 2, ((T([128, 117, 14, 14], f16, stride=(25088, 196, 14, 1)), T([128, 117, 14, 14], f16)), {})
+cnt: 2, ((T([128, 140, 7, 7], f16, stride=(7399, 49, 7, 1)), T([128, 140, 7, 7], f16)), {})
+cnt: 2, ((T([128, 151, 7, 7], f16, stride=(7938, 49, 7, 1)), T([128, 151, 7, 7], f16)), {})
+cnt: 2, ((T([128, 162, 7, 7], f16, stride=(8526, 49, 7, 1)), T([128, 162, 7, 7], f16)), {})
+cnt: 2, ((T([128, 174, 7, 7], f16, stride=(9065, 49, 7, 1)), T([128, 174, 7, 7], f16)), {})
+cnt: 1, ((T([128, 185, 7, 7], f16), T([128, 185, 7, 7], f16)), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), T([128, 1044, 7, 7], f16)), {})
+cnt: 1, ((T([128, 174, 7, 7], f16), T([128, 174, 7, 7], f16)), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), T([128, 972, 7, 7], f16)), {})
+cnt: 1, ((T([128, 162, 7, 7], f16), T([128, 162, 7, 7], f16)), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), T([128, 906, 7, 7], f16)), {})
+cnt: 1, ((T([128, 151, 7, 7], f16), T([128, 151, 7, 7], f16)), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), T([128, 840, 7, 7], f16)), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([128, 768, 7, 7], f16)), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16)), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), T([128, 702, 14, 14], f16)), {})
+cnt: 1, ((T([128, 117, 14, 14], f16), T([128, 117, 14, 14], f16)), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), T([128, 636, 14, 14], f16)), {})
+cnt: 1, ((T([128, 106, 14, 14], f16), T([128, 106, 14, 14], f16)), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), T([128, 570, 14, 14], f16)), {})
+cnt: 1, ((T([128, 95, 14, 14], f16), T([128, 95, 14, 14], f16)), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), T([128, 504, 14, 14], f16)), {})
+cnt: 1, ((T([128, 84, 14, 14], f16), T([128, 84, 14, 14], f16)), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), T([128, 432, 14, 14], f16)), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), T([128, 366, 14, 14], f16)), {})
+cnt: 1, ((T([128, 61, 28, 28], f16), T([128, 61, 28, 28], f16)), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), T([128, 300, 28, 28], f16)), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), T([128, 228, 28, 28], f16)), {})
+cnt: 1, ((T([128, 38, 56, 56], f16), T([128, 38, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 13, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 27, 56, 56], f16), T([128, 11, 56, 56], f16, stride=(119168, 3136, 56, 1))], 1), {})
+cnt: 1, (([T([128, 50, 28, 28], f16), T([128, 11, 28, 28], f16, stride=(47824, 784, 28, 1))], 1), {})
+cnt: 1, (([T([128, 72, 14, 14], f16), T([128, 12, 14, 14], f16, stride=(16464, 196, 14, 1))], 1), {})
+cnt: 1, (([T([128, 84, 14, 14], f16), T([128, 11, 14, 14], f16, stride=(18620, 196, 14, 1))], 1), {})
+cnt: 1, (([T([128, 95, 14, 14], f16), T([128, 11, 14, 14], f16, stride=(20776, 196, 14, 1))], 1), {})
+cnt: 1, (([T([128, 106, 14, 14], f16), T([128, 11, 14, 14], f16, stride=(22932, 196, 14, 1))], 1), {})
+cnt: 1, (([T([128, 117, 14, 14], f16), T([128, 11, 14, 14], f16, stride=(25088, 196, 14, 1))], 1), {})
+cnt: 1, (([T([128, 140, 7, 7], f16), T([128, 11, 7, 7], f16, stride=(7399, 49, 7, 1))], 1), {})
+cnt: 1, (([T([128, 151, 7, 7], f16), T([128, 11, 7, 7], f16, stride=(7938, 49, 7, 1))], 1), {})
+cnt: 1, (([T([128, 162, 7, 7], f16), T([128, 12, 7, 7], f16, stride=(8526, 49, 7, 1))], 1), {})
+cnt: 1, (([T([128, 174, 7, 7], f16), T([128, 11, 7, 7], f16, stride=(9065, 49, 7, 1))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 96, 112, 112], f16),), {})
+cnt: 1, ((T([128, 162, 56, 56], f16),), {})
+cnt: 1, ((T([128, 228, 56, 56], f16),), {})
+cnt: 1, ((T([128, 300, 28, 28], f16),), {})
+cnt: 1, ((T([128, 366, 28, 28], f16),), {})
+cnt: 1, ((T([128, 432, 14, 14], f16),), {})
+cnt: 1, ((T([128, 504, 14, 14], f16),), {})
+cnt: 1, ((T([128, 570, 14, 14], f16),), {})
+cnt: 1, ((T([128, 636, 14, 14], f16),), {})
+cnt: 1, ((T([128, 702, 14, 14], f16),), {})
+cnt: 1, ((T([128, 768, 14, 14], f16),), {})
+cnt: 1, ((T([128, 840, 7, 7], f16),), {})
+cnt: 1, ((T([128, 906, 7, 7], f16),), {})
+cnt: 1, ((T([128, 972, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([96, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([27, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 27, 56, 56], f16), T([162, 27, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 162, 56, 56], f16), T([162, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 162), {})
+cnt: 1, ((T([128, 162, 56, 56], f16), T([38, 162, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 38, 56, 56], f16), T([228, 38, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 228, 56, 56], f16), T([228, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 228), {})
+cnt: 1, ((T([128, 228, 1, 1], f16), T([19, 228, 1, 1], f16), T([19], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 19, 1, 1], f16), T([228, 19, 1, 1], f16), T([228], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), T([50, 228, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 50, 28, 28], f16), T([300, 50, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), T([300, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 300), {})
+cnt: 1, ((T([128, 300, 1, 1], f16), T([25, 300, 1, 1], f16), T([25], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 25, 1, 1], f16), T([300, 25, 1, 1], f16), T([300], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), T([61, 300, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 61, 28, 28], f16), T([366, 61, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 366, 28, 28], f16), T([366, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 366), {})
+cnt: 1, ((T([128, 366, 1, 1], f16), T([30, 366, 1, 1], f16), T([30], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 30, 1, 1], f16), T([366, 30, 1, 1], f16), T([366], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), T([72, 366, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 72, 14, 14], f16), T([432, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), T([432, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 432), {})
+cnt: 1, ((T([128, 432, 1, 1], f16), T([36, 432, 1, 1], f16), T([36], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 36, 1, 1], f16), T([432, 36, 1, 1], f16), T([432], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), T([84, 432, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 84, 14, 14], f16), T([504, 84, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), T([504, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 504), {})
+cnt: 1, ((T([128, 504, 1, 1], f16), T([42, 504, 1, 1], f16), T([42], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 42, 1, 1], f16), T([504, 42, 1, 1], f16), T([504], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), T([95, 504, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 95, 14, 14], f16), T([570, 95, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), T([570, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 570), {})
+cnt: 1, ((T([128, 570, 1, 1], f16), T([47, 570, 1, 1], f16), T([47], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 47, 1, 1], f16), T([570, 47, 1, 1], f16), T([570], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), T([106, 570, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 106, 14, 14], f16), T([636, 106, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), T([636, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 636), {})
+cnt: 1, ((T([128, 636, 1, 1], f16), T([53, 636, 1, 1], f16), T([53], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 53, 1, 1], f16), T([636, 53, 1, 1], f16), T([636], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), T([117, 636, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 117, 14, 14], f16), T([702, 117, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), T([702, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 702), {})
+cnt: 1, ((T([128, 702, 1, 1], f16), T([58, 702, 1, 1], f16), T([58], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 58, 1, 1], f16), T([702, 58, 1, 1], f16), T([702], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), T([128, 702, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([768, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([768, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 768), {})
+cnt: 1, ((T([128, 768, 1, 1], f16), T([64, 768, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([768, 64, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([140, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 140, 7, 7], f16), T([840, 140, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), T([840, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 840), {})
+cnt: 1, ((T([128, 840, 1, 1], f16), T([70, 840, 1, 1], f16), T([70], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 70, 1, 1], f16), T([840, 70, 1, 1], f16), T([840], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), T([151, 840, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 151, 7, 7], f16), T([906, 151, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), T([906, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 906), {})
+cnt: 1, ((T([128, 906, 1, 1], f16), T([75, 906, 1, 1], f16), T([75], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 75, 1, 1], f16), T([906, 75, 1, 1], f16), T([906], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), T([162, 906, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 162, 7, 7], f16), T([972, 162, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), T([972, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 972), {})
+cnt: 1, ((T([128, 972, 1, 1], f16), T([81, 972, 1, 1], f16), T([81], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 81, 1, 1], f16), T([972, 81, 1, 1], f16), T([972], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), T([174, 972, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 174, 7, 7], f16), T([1044, 174, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), T([1044, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1044), {})
+cnt: 1, ((T([128, 1044, 1, 1], f16), T([87, 1044, 1, 1], f16), T([87], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 87, 1, 1], f16), T([1044, 87, 1, 1], f16), T([1044], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), T([185, 1044, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 185, 7, 7], f16), T([1280, 185, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 185, 7, 7], f16), T([1280, 185, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 185, 7, 7], f16), T([128, 1044, 7, 7], f16), T([185, 1044, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1044, 1, 1], f16), T([128, 87, 1, 1], f16), T([1044, 87, 1, 1], f16), [1044], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 87, 1, 1], f16), T([128, 1044, 1, 1], f16), T([87, 1044, 1, 1], f16), [87], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), T([128, 1044, 7, 7], f16), T([1044, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1044, [True, True, False]), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), T([128, 174, 7, 7], f16), T([1044, 174, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 174, 7, 7], f16), T([128, 972, 7, 7], f16), T([174, 972, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 972, 1, 1], f16), T([128, 81, 1, 1], f16), T([972, 81, 1, 1], f16), [972], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 81, 1, 1], f16), T([128, 972, 1, 1], f16), T([81, 972, 1, 1], f16), [81], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), T([128, 972, 7, 7], f16), T([972, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 972, [True, True, False]), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), T([128, 162, 7, 7], f16), T([972, 162, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 162, 7, 7], f16), T([128, 906, 7, 7], f16), T([162, 906, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 906, 1, 1], f16), T([128, 75, 1, 1], f16), T([906, 75, 1, 1], f16), [906], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 75, 1, 1], f16), T([128, 906, 1, 1], f16), T([75, 906, 1, 1], f16), [75], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), T([128, 906, 7, 7], f16), T([906, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 906, [True, True, False]), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), T([128, 151, 7, 7], f16), T([906, 151, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 151, 7, 7], f16), T([128, 840, 7, 7], f16), T([151, 840, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 840, 1, 1], f16), T([128, 70, 1, 1], f16), T([840, 70, 1, 1], f16), [840], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 70, 1, 1], f16), T([128, 840, 1, 1], f16), T([70, 840, 1, 1], f16), [70], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), T([128, 840, 7, 7], f16), T([840, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 840, [True, True, False]), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), T([128, 140, 7, 7], f16), T([840, 140, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 140, 7, 7], f16), T([128, 768, 7, 7], f16), T([140, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 768, 1, 1], f16), T([128, 64, 1, 1], f16), T([768, 64, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([128, 768, 1, 1], f16), T([64, 768, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([128, 768, 14, 14], f16), T([768, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 768, [True, True, False]), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([128, 128, 14, 14], f16), T([768, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128, 702, 14, 14], f16), T([128, 702, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 702, 1, 1], f16), T([128, 58, 1, 1], f16), T([702, 58, 1, 1], f16), [702], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 58, 1, 1], f16), T([128, 702, 1, 1], f16), T([58, 702, 1, 1], f16), [58], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), T([128, 702, 14, 14], f16), T([702, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 702, [True, True, False]), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), T([128, 117, 14, 14], f16), T([702, 117, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 117, 14, 14], f16), T([128, 636, 14, 14], f16), T([117, 636, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 636, 1, 1], f16), T([128, 53, 1, 1], f16), T([636, 53, 1, 1], f16), [636], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 53, 1, 1], f16), T([128, 636, 1, 1], f16), T([53, 636, 1, 1], f16), [53], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), T([128, 636, 14, 14], f16), T([636, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 636, [True, True, False]), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), T([128, 106, 14, 14], f16), T([636, 106, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 106, 14, 14], f16), T([128, 570, 14, 14], f16), T([106, 570, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 570, 1, 1], f16), T([128, 47, 1, 1], f16), T([570, 47, 1, 1], f16), [570], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 47, 1, 1], f16), T([128, 570, 1, 1], f16), T([47, 570, 1, 1], f16), [47], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), T([128, 570, 14, 14], f16), T([570, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 570, [True, True, False]), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), T([128, 95, 14, 14], f16), T([570, 95, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 95, 14, 14], f16), T([128, 504, 14, 14], f16), T([95, 504, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 504, 1, 1], f16), T([128, 42, 1, 1], f16), T([504, 42, 1, 1], f16), [504], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 42, 1, 1], f16), T([128, 504, 1, 1], f16), T([42, 504, 1, 1], f16), [42], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), T([128, 504, 14, 14], f16), T([504, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 504, [True, True, False]), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), T([128, 84, 14, 14], f16), T([504, 84, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 84, 14, 14], f16), T([128, 432, 14, 14], f16), T([84, 432, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 432, 1, 1], f16), T([128, 36, 1, 1], f16), T([432, 36, 1, 1], f16), [432], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 36, 1, 1], f16), T([128, 432, 1, 1], f16), T([36, 432, 1, 1], f16), [36], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), T([128, 432, 14, 14], f16), T([432, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 432, [True, True, False]), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), T([128, 72, 14, 14], f16), T([432, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 72, 14, 14], f16), T([128, 366, 14, 14], f16), T([72, 366, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 366, 1, 1], f16), T([128, 30, 1, 1], f16), T([366, 30, 1, 1], f16), [366], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 30, 1, 1], f16), T([128, 366, 1, 1], f16), T([30, 366, 1, 1], f16), [30], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), T([128, 366, 28, 28], f16), T([366, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 366, [True, True, False]), {})
+cnt: 1, ((T([128, 366, 28, 28], f16), T([128, 61, 28, 28], f16), T([366, 61, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 61, 28, 28], f16), T([128, 300, 28, 28], f16), T([61, 300, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 300, 1, 1], f16), T([128, 25, 1, 1], f16), T([300, 25, 1, 1], f16), [300], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 25, 1, 1], f16), T([128, 300, 1, 1], f16), T([25, 300, 1, 1], f16), [25], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), T([128, 300, 28, 28], f16), T([300, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 300, [True, True, False]), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), T([128, 50, 28, 28], f16), T([300, 50, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 50, 28, 28], f16), T([128, 228, 28, 28], f16), T([50, 228, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 228, 1, 1], f16), T([128, 19, 1, 1], f16), T([228, 19, 1, 1], f16), [228], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 19, 1, 1], f16), T([128, 228, 1, 1], f16), T([19, 228, 1, 1], f16), [19], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), T([128, 228, 56, 56], f16), T([228, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 228, [True, True, False]), {})
+cnt: 1, ((T([128, 228, 56, 56], f16), T([128, 38, 56, 56], f16), T([228, 38, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 38, 56, 56], f16), T([128, 162, 56, 56], f16), T([38, 162, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 162, 56, 56], f16), T([128, 162, 56, 56], f16), T([162, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 162, [True, True, False]), {})
+cnt: 1, ((T([128, 162, 56, 56], f16), T([128, 27, 56, 56], f16), T([162, 27, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 27, 56, 56], f16), T([128, 96, 56, 56], f16), T([27, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 112, 112], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 16, 112, 112], f16), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1280, 7, 7], f16, stride=(1280, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16, stride=(1044, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 972, 7, 7], f16, stride=(972, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 906, 7, 7], f16, stride=(906, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 840, 7, 7], f16, stride=(840, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 768, 7, 7], f16, stride=(768, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 702, 14, 14], f16, stride=(702, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 636, 14, 14], f16, stride=(636, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 570, 14, 14], f16, stride=(570, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 504, 14, 14], f16, stride=(504, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 432, 14, 14], f16, stride=(432, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 366, 14, 14], f16, stride=(366, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 300, 28, 28], f16, stride=(300, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 228, 28, 28], f16, stride=(228, 1, 0, 0)), 784), {})
+Operator: aten.hardtanh.default
+cnt: 1, ((T([128, 32, 112, 112], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 162, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), 0.0, 6.0), {})
+Operator: aten.hardtanh_backward.default
+cnt: 1, ((T([128, 1044, 7, 7], f16), T([128, 1044, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), T([128, 972, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), T([128, 906, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), T([128, 840, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([128, 768, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), T([128, 702, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), T([128, 636, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), T([128, 570, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), T([128, 504, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), T([128, 432, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), T([128, 366, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), T([128, 300, 28, 28], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), T([128, 228, 28, 28], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 162, 56, 56], f16), T([128, 162, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), 0.0, 6.0), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 228, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 228, 28, 28], f16), T([128, 228, 1, 1], f16)), {})
+cnt: 2, ((T([128, 300, 28, 28], f16), T([128, 300, 1, 1], f16)), {})
+cnt: 2, ((T([128, 366, 14, 14], f16), T([128, 366, 1, 1], f16)), {})
+cnt: 2, ((T([128, 432, 14, 14], f16), T([128, 432, 1, 1], f16)), {})
+cnt: 2, ((T([128, 504, 14, 14], f16), T([128, 504, 1, 1], f16)), {})
+cnt: 2, ((T([128, 570, 14, 14], f16), T([128, 570, 1, 1], f16)), {})
+cnt: 2, ((T([128, 636, 14, 14], f16), T([128, 636, 1, 1], f16)), {})
+cnt: 2, ((T([128, 702, 14, 14], f16), T([128, 702, 1, 1], f16)), {})
+cnt: 2, ((T([128, 768, 7, 7], f16), T([128, 768, 1, 1], f16)), {})
+cnt: 2, ((T([128, 840, 7, 7], f16), T([128, 840, 1, 1], f16)), {})
+cnt: 2, ((T([128, 906, 7, 7], f16), T([128, 906, 1, 1], f16)), {})
+cnt: 2, ((T([128, 972, 7, 7], f16), T([128, 972, 1, 1], f16)), {})
+cnt: 2, ((T([128, 1044, 7, 7], f16), T([128, 1044, 1, 1], f16)), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), T([128, 1044, 7, 7], f16)), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), T([128, 972, 7, 7], f16)), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), T([128, 906, 7, 7], f16)), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), T([128, 840, 7, 7], f16)), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([128, 768, 7, 7], f16)), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), T([128, 702, 14, 14], f16)), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), T([128, 636, 14, 14], f16)), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), T([128, 570, 14, 14], f16)), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), T([128, 504, 14, 14], f16)), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), T([128, 432, 14, 14], f16)), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), T([128, 366, 14, 14], f16)), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), T([128, 300, 28, 28], f16)), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), T([128, 228, 28, 28], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 27, 56, 56], f16), T([27], f16), T([27], f16), T([27], f16), T([27], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 162, 56, 56], f16), T([162], f16), T([162], f16), T([162], f16), T([162], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 38, 56, 56], f16), T([38], f16), T([38], f16), T([38], f16), T([38], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 228, 56, 56], f16), T([228], f16), T([228], f16), T([228], f16), T([228], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), T([228], f16), T([228], f16), T([228], f16), T([228], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 19, 1, 1], f16), T([19], f16), T([19], f16), T([19], f16), T([19], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 50, 28, 28], f16), T([50], f16), T([50], f16), T([50], f16), T([50], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 300, 28, 28], f16), T([300], f16), T([300], f16), T([300], f16), T([300], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 25, 1, 1], f16), T([25], f16), T([25], f16), T([25], f16), T([25], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 61, 28, 28], f16), T([61], f16), T([61], f16), T([61], f16), T([61], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 366, 28, 28], f16), T([366], f16), T([366], f16), T([366], f16), T([366], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), T([366], f16), T([366], f16), T([366], f16), T([366], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 30, 1, 1], f16), T([30], f16), T([30], f16), T([30], f16), T([30], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 72, 14, 14], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 432, 14, 14], f16), T([432], f16), T([432], f16), T([432], f16), T([432], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 36, 1, 1], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 84, 14, 14], f16), T([84], f16), T([84], f16), T([84], f16), T([84], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 504, 14, 14], f16), T([504], f16), T([504], f16), T([504], f16), T([504], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 42, 1, 1], f16), T([42], f16), T([42], f16), T([42], f16), T([42], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 95, 14, 14], f16), T([95], f16), T([95], f16), T([95], f16), T([95], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 570, 14, 14], f16), T([570], f16), T([570], f16), T([570], f16), T([570], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 47, 1, 1], f16), T([47], f16), T([47], f16), T([47], f16), T([47], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 106, 14, 14], f16), T([106], f16), T([106], f16), T([106], f16), T([106], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 636, 14, 14], f16), T([636], f16), T([636], f16), T([636], f16), T([636], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 53, 1, 1], f16), T([53], f16), T([53], f16), T([53], f16), T([53], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 117, 14, 14], f16), T([117], f16), T([117], f16), T([117], f16), T([117], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 702, 14, 14], f16), T([702], f16), T([702], f16), T([702], f16), T([702], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 58, 1, 1], f16), T([58], f16), T([58], f16), T([58], f16), T([58], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 140, 7, 7], f16), T([140], f16), T([140], f16), T([140], f16), T([140], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 840, 7, 7], f16), T([840], f16), T([840], f16), T([840], f16), T([840], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 70, 1, 1], f16), T([70], f16), T([70], f16), T([70], f16), T([70], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 151, 7, 7], f16), T([151], f16), T([151], f16), T([151], f16), T([151], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 906, 7, 7], f16), T([906], f16), T([906], f16), T([906], f16), T([906], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 75, 1, 1], f16), T([75], f16), T([75], f16), T([75], f16), T([75], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 162, 7, 7], f16), T([162], f16), T([162], f16), T([162], f16), T([162], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 972, 7, 7], f16), T([972], f16), T([972], f16), T([972], f16), T([972], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 81, 1, 1], f16), T([81], f16), T([81], f16), T([81], f16), T([81], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 174, 7, 7], f16), T([174], f16), T([174], f16), T([174], f16), T([174], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 1044, 7, 7], f16), T([1044], f16), T([1044], f16), T([1044], f16), T([1044], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 87, 1, 1], f16), T([87], f16), T([87], f16), T([87], f16), T([87], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 185, 7, 7], f16), T([185], f16), T([185], f16), T([185], f16), T([185], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 185, 7, 7], f16), T([128, 185, 7, 7], f16), T([185], f16), T([185], f16), T([185], f16), T([185], f32), T([185], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 87, 1, 1], f16), T([128, 87, 1, 1], f16), T([87], f16), T([87], f16), T([87], f16), T([87], f32), T([87], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 1044, 7, 7], f16), T([128, 1044, 7, 7], f16), T([1044], f16), T([1044], f16), T([1044], f16), T([1044], f32), T([1044], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 174, 7, 7], f16), T([128, 174, 7, 7], f16), T([174], f16), T([174], f16), T([174], f16), T([174], f32), T([174], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 81, 1, 1], f16), T([128, 81, 1, 1], f16), T([81], f16), T([81], f16), T([81], f16), T([81], f32), T([81], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 972, 7, 7], f16), T([128, 972, 7, 7], f16), T([972], f16), T([972], f16), T([972], f16), T([972], f32), T([972], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 162, 7, 7], f16), T([128, 162, 7, 7], f16), T([162], f16), T([162], f16), T([162], f16), T([162], f32), T([162], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 75, 1, 1], f16), T([128, 75, 1, 1], f16), T([75], f16), T([75], f16), T([75], f16), T([75], f32), T([75], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 906, 7, 7], f16), T([128, 906, 7, 7], f16), T([906], f16), T([906], f16), T([906], f16), T([906], f32), T([906], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 151, 7, 7], f16), T([128, 151, 7, 7], f16), T([151], f16), T([151], f16), T([151], f16), T([151], f32), T([151], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 70, 1, 1], f16), T([128, 70, 1, 1], f16), T([70], f16), T([70], f16), T([70], f16), T([70], f32), T([70], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 840, 7, 7], f16), T([128, 840, 7, 7], f16), T([840], f16), T([840], f16), T([840], f16), T([840], f32), T([840], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 140, 7, 7], f16), T([128, 140, 7, 7], f16), T([140], f16), T([140], f16), T([140], f16), T([140], f32), T([140], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([128, 64, 1, 1], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([128, 768, 7, 7], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([128, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 14, 14], f16), T([128, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 58, 1, 1], f16), T([128, 58, 1, 1], f16), T([58], f16), T([58], f16), T([58], f16), T([58], f32), T([58], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 702, 14, 14], f16), T([128, 702, 14, 14], f16), T([702], f16), T([702], f16), T([702], f16), T([702], f32), T([702], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 117, 14, 14], f16), T([128, 117, 14, 14], f16), T([117], f16), T([117], f16), T([117], f16), T([117], f32), T([117], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 53, 1, 1], f16), T([128, 53, 1, 1], f16), T([53], f16), T([53], f16), T([53], f16), T([53], f32), T([53], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 636, 14, 14], f16), T([128, 636, 14, 14], f16), T([636], f16), T([636], f16), T([636], f16), T([636], f32), T([636], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 106, 14, 14], f16), T([128, 106, 14, 14], f16), T([106], f16), T([106], f16), T([106], f16), T([106], f32), T([106], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 47, 1, 1], f16), T([128, 47, 1, 1], f16), T([47], f16), T([47], f16), T([47], f16), T([47], f32), T([47], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 570, 14, 14], f16), T([128, 570, 14, 14], f16), T([570], f16), T([570], f16), T([570], f16), T([570], f32), T([570], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 95, 14, 14], f16), T([128, 95, 14, 14], f16), T([95], f16), T([95], f16), T([95], f16), T([95], f32), T([95], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 42, 1, 1], f16), T([128, 42, 1, 1], f16), T([42], f16), T([42], f16), T([42], f16), T([42], f32), T([42], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 504, 14, 14], f16), T([128, 504, 14, 14], f16), T([504], f16), T([504], f16), T([504], f16), T([504], f32), T([504], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 84, 14, 14], f16), T([128, 84, 14, 14], f16), T([84], f16), T([84], f16), T([84], f16), T([84], f32), T([84], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 36, 1, 1], f16), T([128, 36, 1, 1], f16), T([36], f16), T([36], f16), T([36], f16), T([36], f32), T([36], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 432, 14, 14], f16), T([128, 432, 14, 14], f16), T([432], f16), T([432], f16), T([432], f16), T([432], f32), T([432], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 72, 14, 14], f16), T([128, 72, 14, 14], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 30, 1, 1], f16), T([128, 30, 1, 1], f16), T([30], f16), T([30], f16), T([30], f16), T([30], f32), T([30], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), T([128, 366, 14, 14], f16), T([366], f16), T([366], f16), T([366], f16), T([366], f32), T([366], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 366, 28, 28], f16), T([128, 366, 28, 28], f16), T([366], f16), T([366], f16), T([366], f16), T([366], f32), T([366], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 61, 28, 28], f16), T([128, 61, 28, 28], f16), T([61], f16), T([61], f16), T([61], f16), T([61], f32), T([61], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 25, 1, 1], f16), T([128, 25, 1, 1], f16), T([25], f16), T([25], f16), T([25], f16), T([25], f32), T([25], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 300, 28, 28], f16), T([128, 300, 28, 28], f16), T([300], f16), T([300], f16), T([300], f16), T([300], f32), T([300], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 50, 28, 28], f16), T([128, 50, 28, 28], f16), T([50], f16), T([50], f16), T([50], f16), T([50], f32), T([50], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 19, 1, 1], f16), T([128, 19, 1, 1], f16), T([19], f16), T([19], f16), T([19], f16), T([19], f32), T([19], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), T([128, 228, 28, 28], f16), T([228], f16), T([228], f16), T([228], f16), T([228], f32), T([228], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 228, 56, 56], f16), T([128, 228, 56, 56], f16), T([228], f16), T([228], f16), T([228], f16), T([228], f32), T([228], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 38, 56, 56], f16), T([128, 38, 56, 56], f16), T([38], f16), T([38], f16), T([38], f16), T([38], f32), T([38], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 162, 56, 56], f16), T([128, 162, 56, 56], f16), T([162], f16), T([162], f16), T([162], f16), T([162], f32), T([162], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 27, 56, 56], f16), T([128, 27, 56, 56], f16), T([27], f16), T([27], f16), T([27], f16), T([27], f32), T([27], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 19, 1, 1], f16),), {})
+cnt: 1, ((T([128, 25, 1, 1], f16),), {})
+cnt: 1, ((T([128, 30, 1, 1], f16),), {})
+cnt: 1, ((T([128, 36, 1, 1], f16),), {})
+cnt: 1, ((T([128, 42, 1, 1], f16),), {})
+cnt: 1, ((T([128, 47, 1, 1], f16),), {})
+cnt: 1, ((T([128, 53, 1, 1], f16),), {})
+cnt: 1, ((T([128, 58, 1, 1], f16),), {})
+cnt: 1, ((T([128, 64, 1, 1], f16),), {})
+cnt: 1, ((T([128, 70, 1, 1], f16),), {})
+cnt: 1, ((T([128, 75, 1, 1], f16),), {})
+cnt: 1, ((T([128, 81, 1, 1], f16),), {})
+cnt: 1, ((T([128, 87, 1, 1], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([128, 228, 1, 1], f16),), {})
+cnt: 1, ((T([128, 300, 1, 1], f16),), {})
+cnt: 1, ((T([128, 366, 1, 1], f16),), {})
+cnt: 1, ((T([128, 432, 1, 1], f16),), {})
+cnt: 1, ((T([128, 504, 1, 1], f16),), {})
+cnt: 1, ((T([128, 570, 1, 1], f16),), {})
+cnt: 1, ((T([128, 636, 1, 1], f16),), {})
+cnt: 1, ((T([128, 702, 1, 1], f16),), {})
+cnt: 1, ((T([128, 768, 1, 1], f16),), {})
+cnt: 1, ((T([128, 840, 1, 1], f16),), {})
+cnt: 1, ((T([128, 906, 1, 1], f16),), {})
+cnt: 1, ((T([128, 972, 1, 1], f16),), {})
+cnt: 1, ((T([128, 1044, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([128, 1044, 1, 1], f16), T([128, 1044, 1, 1], f16)), {})
+cnt: 1, ((T([128, 972, 1, 1], f16), T([128, 972, 1, 1], f16)), {})
+cnt: 1, ((T([128, 906, 1, 1], f16), T([128, 906, 1, 1], f16)), {})
+cnt: 1, ((T([128, 840, 1, 1], f16), T([128, 840, 1, 1], f16)), {})
+cnt: 1, ((T([128, 768, 1, 1], f16), T([128, 768, 1, 1], f16)), {})
+cnt: 1, ((T([128, 702, 1, 1], f16), T([128, 702, 1, 1], f16)), {})
+cnt: 1, ((T([128, 636, 1, 1], f16), T([128, 636, 1, 1], f16)), {})
+cnt: 1, ((T([128, 570, 1, 1], f16), T([128, 570, 1, 1], f16)), {})
+cnt: 1, ((T([128, 504, 1, 1], f16), T([128, 504, 1, 1], f16)), {})
+cnt: 1, ((T([128, 432, 1, 1], f16), T([128, 432, 1, 1], f16)), {})
+cnt: 1, ((T([128, 366, 1, 1], f16), T([128, 366, 1, 1], f16)), {})
+cnt: 1, ((T([128, 300, 1, 1], f16), T([128, 300, 1, 1], f16)), {})
+cnt: 1, ((T([128, 228, 1, 1], f16), T([128, 228, 1, 1], f16)), {})
+Operator: aten.silu_.default
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 96, 112, 112], f16),), {})
+cnt: 1, ((T([128, 162, 56, 56], f16),), {})
+cnt: 1, ((T([128, 228, 56, 56], f16),), {})
+cnt: 1, ((T([128, 300, 28, 28], f16),), {})
+cnt: 1, ((T([128, 366, 28, 28], f16),), {})
+cnt: 1, ((T([128, 432, 14, 14], f16),), {})
+cnt: 1, ((T([128, 504, 14, 14], f16),), {})
+cnt: 1, ((T([128, 570, 14, 14], f16),), {})
+cnt: 1, ((T([128, 636, 14, 14], f16),), {})
+cnt: 1, ((T([128, 702, 14, 14], f16),), {})
+cnt: 1, ((T([128, 768, 14, 14], f16),), {})
+cnt: 1, ((T([128, 840, 7, 7], f16),), {})
+cnt: 1, ((T([128, 906, 7, 7], f16),), {})
+cnt: 1, ((T([128, 972, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16)), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), T([128, 1044, 7, 7], f16)), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), T([128, 972, 7, 7], f16)), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), T([128, 906, 7, 7], f16)), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), T([128, 840, 7, 7], f16)), {})
+cnt: 1, ((T([128, 768, 14, 14], f16), T([128, 768, 14, 14], f16)), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), T([128, 702, 14, 14], f16)), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), T([128, 636, 14, 14], f16)), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), T([128, 570, 14, 14], f16)), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), T([128, 504, 14, 14], f16)), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), T([128, 432, 14, 14], f16)), {})
+cnt: 1, ((T([128, 366, 28, 28], f16), T([128, 366, 28, 28], f16)), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), T([128, 300, 28, 28], f16)), {})
+cnt: 1, ((T([128, 228, 56, 56], f16), T([128, 228, 56, 56], f16)), {})
+cnt: 1, ((T([128, 162, 56, 56], f16), T([128, 162, 56, 56], f16)), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 96, 112, 112], f16)), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16)), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([128, 11, 7, 7], f16, stride=(9065, 49, 7, 1)), [128, 185, 7, 7], 1, 174, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 185, 7, 7], f16), [128, 185, 7, 7], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 174, 7, 7], f16, stride=(9065, 49, 7, 1)), [128, 185, 7, 7], 1, 0, 174, 1), {})
+cnt: 1, ((T([128, 12, 7, 7], f16, stride=(8526, 49, 7, 1)), [128, 174, 7, 7], 1, 162, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 174, 7, 7], f16), [128, 174, 7, 7], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 162, 7, 7], f16, stride=(8526, 49, 7, 1)), [128, 174, 7, 7], 1, 0, 162, 1), {})
+cnt: 1, ((T([128, 11, 7, 7], f16, stride=(7938, 49, 7, 1)), [128, 162, 7, 7], 1, 151, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 162, 7, 7], f16), [128, 162, 7, 7], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 151, 7, 7], f16, stride=(7938, 49, 7, 1)), [128, 162, 7, 7], 1, 0, 151, 1), {})
+cnt: 1, ((T([128, 11, 7, 7], f16, stride=(7399, 49, 7, 1)), [128, 151, 7, 7], 1, 140, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 151, 7, 7], f16), [128, 151, 7, 7], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 140, 7, 7], f16, stride=(7399, 49, 7, 1)), [128, 151, 7, 7], 1, 0, 140, 1), {})
+cnt: 1, ((T([128, 11, 14, 14], f16, stride=(25088, 196, 14, 1)), [128, 128, 14, 14], 1, 117, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 128, 14, 14], f16), [128, 128, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 117, 14, 14], f16, stride=(25088, 196, 14, 1)), [128, 128, 14, 14], 1, 0, 117, 1), {})
+cnt: 1, ((T([128, 11, 14, 14], f16, stride=(22932, 196, 14, 1)), [128, 117, 14, 14], 1, 106, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 117, 14, 14], f16), [128, 117, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 106, 14, 14], f16, stride=(22932, 196, 14, 1)), [128, 117, 14, 14], 1, 0, 106, 1), {})
+cnt: 1, ((T([128, 11, 14, 14], f16, stride=(20776, 196, 14, 1)), [128, 106, 14, 14], 1, 95, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 106, 14, 14], f16), [128, 106, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 95, 14, 14], f16, stride=(20776, 196, 14, 1)), [128, 106, 14, 14], 1, 0, 95, 1), {})
+cnt: 1, ((T([128, 11, 14, 14], f16, stride=(18620, 196, 14, 1)), [128, 95, 14, 14], 1, 84, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 95, 14, 14], f16), [128, 95, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 84, 14, 14], f16, stride=(18620, 196, 14, 1)), [128, 95, 14, 14], 1, 0, 84, 1), {})
+cnt: 1, ((T([128, 12, 14, 14], f16, stride=(16464, 196, 14, 1)), [128, 84, 14, 14], 1, 72, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 84, 14, 14], f16), [128, 84, 14, 14], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 72, 14, 14], f16, stride=(16464, 196, 14, 1)), [128, 84, 14, 14], 1, 0, 72, 1), {})
+cnt: 1, ((T([128, 11, 28, 28], f16, stride=(47824, 784, 28, 1)), [128, 61, 28, 28], 1, 50, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 61, 28, 28], f16), [128, 61, 28, 28], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 50, 28, 28], f16, stride=(47824, 784, 28, 1)), [128, 61, 28, 28], 1, 0, 50, 1), {})
+cnt: 1, ((T([128, 11, 56, 56], f16, stride=(119168, 3136, 56, 1)), [128, 38, 56, 56], 1, 27, 9223372036854775807, 1), {})
+cnt: 2, ((T([128, 38, 56, 56], f16), [128, 38, 56, 56], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([128, 27, 56, 56], f16, stride=(119168, 3136, 56, 1)), [128, 38, 56, 56], 1, 0, 27, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([128, 1044, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 972, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 906, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 840, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 702, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 636, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 570, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 504, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 432, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 366, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 300, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 228, 28, 28], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 87, 1, 1], f16), T([128, 87, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 81, 1, 1], f16), T([128, 81, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 75, 1, 1], f16), T([128, 75, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 70, 1, 1], f16), T([128, 70, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 64, 1, 1], f16), T([128, 64, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 58, 1, 1], f16), T([128, 58, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 53, 1, 1], f16), T([128, 53, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 47, 1, 1], f16), T([128, 47, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 42, 1, 1], f16), T([128, 42, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 36, 1, 1], f16), T([128, 36, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 30, 1, 1], f16), T([128, 30, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 25, 1, 1], f16), T([128, 25, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 19, 1, 1], f16), T([128, 19, 1, 1], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/sebotnet33ts_256_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/sebotnet33ts_256_training.txt
new file mode 100644
index 000000000000..cdfa544bf9c0
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/sebotnet33ts_256_training.txt
@@ -0,0 +1,334 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 1, ((T([256, 1024, 1024], f16), -1, False), {})
+cnt: 2, ((T([256, 256, 256], f16), -1, False), {})
+cnt: 1, ((T([256, 64, 64], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 1, ((T([256, 64, 64], f16), T([256, 64, 64], f16), -1, f16), {})
+cnt: 2, ((T([256, 256, 256], f16), T([256, 256, 256], f16), -1, f16), {})
+cnt: 1, ((T([256, 1024, 1024], f16), T([256, 1024, 1024], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 3, ((T([64, 128, 32, 32], f16), [256, 32, 1024]), {})
+cnt: 1, ((T([256, 1024, 1024], f16), [256, 1024, 1024]), {})
+cnt: 2, ((T([256, 32, 32, 32], f16), [262144, 32]), {})
+cnt: 2, ((T([262144, 63], f16), [256, 32, 32, 63]), {})
+cnt: 1, ((T([256, 32, 32, 32, 32], f16), [256, 1024, 1024]), {})
+cnt: 1, ((T([256, 1024, 32], f16), [256, 1024, 32]), {})
+cnt: 3, ((T([256, 32, 1024], f16), [64, 128, 32, 32]), {})
+cnt: 3, ((T([64, 256, 16, 16], f16), [256, 64, 256]), {})
+cnt: 2, ((T([256, 256, 256], f16), [256, 256, 256]), {})
+cnt: 2, ((T([256, 16, 16, 64], f16), [65536, 64]), {})
+cnt: 4, ((T([65536, 31], f16), [256, 16, 16, 31]), {})
+cnt: 2, ((T([256, 16, 16, 16, 16], f16), [256, 256, 256]), {})
+cnt: 1, ((T([256, 256, 64], f16), [256, 256, 64]), {})
+cnt: 3, ((T([256, 64, 256], f16), [64, 256, 16, 16]), {})
+cnt: 3, ((T([64, 512, 16, 16], f16), [256, 128, 256]), {})
+cnt: 2, ((T([256, 16, 16, 128], f16), [65536, 128]), {})
+cnt: 1, ((T([256, 256, 128], f16), [256, 256, 128]), {})
+cnt: 3, ((T([256, 128, 256], f16), [64, 512, 16, 16]), {})
+cnt: 3, ((T([64, 512, 8, 8], f16), [256, 128, 64]), {})
+cnt: 1, ((T([256, 64, 64], f16), [256, 64, 64]), {})
+cnt: 2, ((T([256, 8, 8, 128], f16), [16384, 128]), {})
+cnt: 2, ((T([16384, 15], f16), [256, 8, 8, 15]), {})
+cnt: 1, ((T([256, 8, 8, 8, 8], f16), [256, 64, 64]), {})
+cnt: 1, ((T([256, 64, 128], f16), [256, 64, 128]), {})
+cnt: 3, ((T([256, 128, 64], f16), [64, 512, 8, 8]), {})
+cnt: 1, ((T([256, 8, 8, 128], f16), [256, 64, 128]), {})
+cnt: 1, ((T([256, 16, 16, 128], f16), [256, 256, 128]), {})
+cnt: 1, ((T([256, 16, 16, 64], f16), [256, 256, 64]), {})
+cnt: 1, ((T([256, 32, 32, 32], f16), [256, 1024, 32]), {})
+Operator: aten.add.Tensor
+cnt: 38, ((T([], i64), 1), {})
+cnt: 4, ((T([64, 256, 64, 64], f16), T([64, 256, 64, 64], f16)), {})
+cnt: 6, ((T([64, 512, 32, 32], f16), T([64, 512, 32, 32], f16)), {})
+cnt: 1, ((T([256, 32, 32, 32, 32], f16, stride=(66528, 63, 2079, 1, 0)), T([256, 32, 32, 32, 32], f16, stride=(66528, 2079, 63, 0, 1))), {})
+cnt: 1, ((T([256, 1024, 1024], f16), T([256, 1024, 1024], f16)), {})
+cnt: 6, ((T([64, 1024, 16, 16], f16), T([64, 1024, 16, 16], f16)), {})
+cnt: 2, ((T([256, 16, 16, 16, 16], f16, stride=(8432, 31, 527, 1, 0)), T([256, 16, 16, 16, 16], f16, stride=(8432, 527, 31, 0, 1))), {})
+cnt: 2, ((T([256, 256, 256], f16), T([256, 256, 256], f16)), {})
+cnt: 3, ((T([64, 1536, 8, 8], f16), T([64, 1536, 8, 8], f16)), {})
+cnt: 1, ((T([256, 8, 8, 8, 8], f16, stride=(1080, 15, 135, 1, 0)), T([256, 8, 8, 8, 8], f16, stride=(1080, 135, 15, 0, 1))), {})
+cnt: 1, ((T([256, 64, 64], f16), T([256, 64, 64], f16)), {})
+cnt: 1, ((T([256, 8, 8, 128], f16, stride=(8192, 128, 1024, 1)), T([256, 8, 8, 128], f16)), {})
+cnt: 1, ((T([256, 64, 128], f16), T([256, 64, 128], f16)), {})
+cnt: 1, ((T([256, 16, 16, 128], f16, stride=(32768, 128, 2048, 1)), T([256, 16, 16, 128], f16)), {})
+cnt: 1, ((T([256, 256, 128], f16), T([256, 256, 128], f16)), {})
+cnt: 1, ((T([256, 16, 16, 64], f16, stride=(16384, 64, 1024, 1)), T([256, 16, 16, 64], f16)), {})
+cnt: 1, ((T([256, 256, 64], f16), T([256, 256, 64], f16)), {})
+cnt: 2, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16)), {})
+cnt: 1, ((T([256, 32, 32, 32], f16, stride=(32768, 32, 1024, 1)), T([256, 32, 32, 32], f16)), {})
+cnt: 1, ((T([256, 1024, 32], f16), T([256, 1024, 32], f16)), {})
+cnt: 2, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16)), {})
+cnt: 3, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([64, 512, 16, 16], f16), [2, 2], [2, 2]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([64, 512, 8, 8], f16), T([64, 512, 16, 16], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+Operator: aten.bmm.default
+cnt: 2, ((T([256, 1024, 32], f16, stride=(32768, 1, 1024)), T([256, 32, 1024], f16)), {})
+cnt: 2, ((T([256, 1024, 1024], f16), T([256, 1024, 32], f16, stride=(32768, 1, 1024))), {})
+cnt: 2, ((T([256, 256, 64], f16, stride=(16384, 1, 256)), T([256, 64, 256], f16)), {})
+cnt: 2, ((T([256, 256, 256], f16), T([256, 256, 64], f16, stride=(16384, 1, 256))), {})
+cnt: 2, ((T([256, 256, 128], f16, stride=(32768, 1, 256)), T([256, 128, 256], f16)), {})
+cnt: 2, ((T([256, 256, 256], f16), T([256, 256, 128], f16, stride=(32768, 1, 256))), {})
+cnt: 2, ((T([256, 64, 128], f16, stride=(8192, 1, 64)), T([256, 128, 64], f16)), {})
+cnt: 2, ((T([256, 64, 64], f16), T([256, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 1, ((T([256, 64, 64], f16, stride=(4096, 1, 64)), T([256, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 1, ((T([256, 128, 64], f16), T([256, 64, 64], f16)), {})
+cnt: 1, ((T([256, 256, 256], f16, stride=(65536, 1, 256)), T([256, 256, 128], f16, stride=(32768, 1, 256))), {})
+cnt: 1, ((T([256, 128, 256], f16), T([256, 256, 256], f16)), {})
+cnt: 1, ((T([256, 256, 256], f16, stride=(65536, 1, 256)), T([256, 256, 64], f16, stride=(16384, 1, 256))), {})
+cnt: 1, ((T([256, 64, 256], f16), T([256, 256, 256], f16)), {})
+cnt: 1, ((T([256, 1024, 1024], f16, stride=(1048576, 1, 1024)), T([256, 1024, 32], f16, stride=(32768, 1, 1024))), {})
+cnt: 1, ((T([256, 32, 1024], f16), T([256, 1024, 1024], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16)], 1), {})
+cnt: 1, (([T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16)], 1), {})
+cnt: 1, (([T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16)], 1), {})
+cnt: 1, (([T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 256, 256], f16),), {})
+cnt: 1, ((T([64, 24, 128, 128], f16),), {})
+cnt: 1, ((T([64, 32, 128, 128], f16),), {})
+cnt: 5, ((T([64, 64, 64, 64], f16),), {})
+cnt: 2, ((T([64, 256, 64, 64], f16),), {})
+cnt: 1, ((T([64, 128, 64, 64], f16),), {})
+cnt: 5, ((T([64, 128, 32, 32], f16),), {})
+cnt: 3, ((T([64, 512, 32, 32], f16),), {})
+cnt: 1, ((T([64, 256, 32, 32], f16),), {})
+cnt: 5, ((T([64, 256, 16, 16], f16),), {})
+cnt: 3, ((T([64, 1024, 16, 16], f16),), {})
+cnt: 1, ((T([64, 512, 16, 16], f16),), {})
+cnt: 3, ((T([64, 512, 8, 8], f16),), {})
+cnt: 2, ((T([64, 1536, 8, 8], f16),), {})
+cnt: 1, ((T([64, 1280, 8, 8], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 2, ((T([8192, 32, 63], f16), [0, 1], 0.0), {})
+cnt: 2, ((T([8192, 2048], f16), [0, 31], 0.0), {})
+cnt: 4, ((T([4096, 16, 31], f16), [0, 1], 0.0), {})
+cnt: 4, ((T([4096, 512], f16), [0, 15], 0.0), {})
+cnt: 2, ((T([2048, 8, 15], f16), [0, 1], 0.0), {})
+cnt: 2, ((T([2048, 128], f16), [0, 7], 0.0), {})
+cnt: 2, ((T([2048, 135], f16), [0, -7]), {})
+cnt: 2, ((T([2048, 8, 16], f16), [0, -1]), {})
+cnt: 4, ((T([4096, 527], f16), [0, -15]), {})
+cnt: 4, ((T([4096, 16, 32], f16), [0, -1]), {})
+cnt: 2, ((T([8192, 2079], f16), [0, -31]), {})
+cnt: 2, ((T([8192, 32, 64], f16), [0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 256, 256], f16), T([24, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 24, 128, 128], f16), T([32, 24, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 64, 64], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 64, 1, 1], f16), T([8, 64, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 8, 1, 1], f16), T([64, 8, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 64, 64, 64], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 64, 64], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 64, 64], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([128, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 128, 1, 1], f16), T([8, 128, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 8, 1, 1], f16), T([128, 8, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 128, 32, 32], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 64, 64], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 512, 32, 32], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 32, 32], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 32, 32], f16), T([384, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 32, 32], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([256, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 256, 1, 1], f16), T([16, 256, 1, 1], f16), T([16], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 16, 1, 1], f16), T([256, 16, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 32, 32], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 16, 16], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 16, 16], f16), T([768, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([1536, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 512, 8, 8], f16), T([1536, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1024, 16, 16], f16), T([1536, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1536, 8, 8], f16), T([512, 1536, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 1536, 8, 8], f16), T([1280, 1536, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 1280, 8, 8], f16), T([64, 1536, 8, 8], f16), T([1280, 1536, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 1536, 8, 8], f16), T([64, 512, 8, 8], f16), T([1536, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 8, 8], f16), T([64, 1536, 8, 8], f16), T([512, 1536, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 1536, 8, 8], f16), T([64, 1024, 16, 16], f16), T([1536, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 1536, 16, 16], f16), T([64, 512, 16, 16], f16), T([1536, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([64, 1024, 16, 16], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 1024, 16, 16], f16), T([64, 256, 16, 16], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 768, 16, 16], f16), T([64, 256, 16, 16], f16), T([768, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 256, 16, 16], f16), T([64, 1024, 16, 16], f16), T([256, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 256, 1, 1], f16), T([64, 16, 1, 1], f16), T([256, 16, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([64, 16, 1, 1], f16), T([64, 256, 1, 1], f16), T([16, 256, 1, 1], f16), [16], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 1024, 16, 16], f16), T([64, 512, 32, 32], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 16, 16], f16), T([64, 256, 32, 32], f16), T([256, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([64, 512, 32, 32], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 512, 32, 32], f16), T([64, 128, 32, 32], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 384, 32, 32], f16), T([64, 128, 32, 32], f16), T([384, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 128, 32, 32], f16), T([64, 512, 32, 32], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 128, 1, 1], f16), T([64, 8, 1, 1], f16), T([128, 8, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([64, 8, 1, 1], f16), T([64, 128, 1, 1], f16), T([8, 128, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 512, 32, 32], f16), T([64, 256, 64, 64], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 32, 32], f16), T([64, 128, 64, 64], f16), T([128, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 256, 64, 64], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 256, 64, 64], f16), T([64, 64, 64, 64], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 64, 1, 1], f16), T([64, 8, 1, 1], f16), T([64, 8, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([64, 8, 1, 1], f16), T([64, 64, 1, 1], f16), T([8, 64, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 64, 64], f16), T([64, 256, 64, 64], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 64, 64], f16), T([64, 32, 128, 128], f16), T([64, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 24, 128, 128], f16), T([32, 24, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 24, 128, 128], f16), T([64, 3, 256, 256], f16), T([24, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 256, 256], f16), T([64, 3, 256, 256], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 1280, 8, 8], f16, stride=(1280, 1, 0, 0)), 64), {})
+cnt: 2, ((T([64, 256, 16, 16], f16, stride=(256, 1, 0, 0)), 256), {})
+cnt: 2, ((T([64, 128, 32, 32], f16, stride=(128, 1, 0, 0)), 1024), {})
+cnt: 2, ((T([64, 64, 64, 64], f16, stride=(64, 1, 0, 0)), 4096), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 2, ((T([64, 64, 64, 64], f16), [2, 3], True), {})
+cnt: 2, ((T([64, 128, 32, 32], f16), [2, 3], True), {})
+cnt: 2, ((T([64, 256, 16, 16], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 1280, 8, 8], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 2, ((T([262144, 32], f16), T([32, 63], f16, stride=(1, 32))), {})
+cnt: 2, ((T([65536, 64], f16), T([64, 31], f16, stride=(1, 64))), {})
+cnt: 2, ((T([65536, 128], f16), T([128, 31], f16, stride=(1, 128))), {})
+cnt: 2, ((T([16384, 128], f16), T([128, 15], f16, stride=(1, 128))), {})
+cnt: 1, ((T([64, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 1280], f16)), {})
+cnt: 2, ((T([15, 16384], f16, stride=(1, 15)), T([16384, 128], f16)), {})
+cnt: 2, ((T([16384, 15], f16), T([15, 128], f16)), {})
+cnt: 2, ((T([31, 65536], f16, stride=(1, 31)), T([65536, 128], f16)), {})
+cnt: 2, ((T([65536, 31], f16), T([31, 128], f16)), {})
+cnt: 2, ((T([31, 65536], f16, stride=(1, 31)), T([65536, 64], f16)), {})
+cnt: 2, ((T([65536, 31], f16), T([31, 64], f16)), {})
+cnt: 2, ((T([63, 262144], f16, stride=(1, 63)), T([262144, 32], f16)), {})
+cnt: 2, ((T([262144, 63], f16), T([63, 32], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([64, 64, 64, 64], f16), T([64, 64, 1, 1], f16)), {})
+cnt: 4, ((T([64, 128, 32, 32], f16), T([64, 128, 1, 1], f16)), {})
+cnt: 2, ((T([256, 1024, 1024], f16), 0.1767766952966369), {})
+cnt: 4, ((T([64, 256, 16, 16], f16), T([64, 256, 1, 1], f16)), {})
+cnt: 2, ((T([256, 256, 256], f16), 0.125), {})
+cnt: 2, ((T([256, 256, 256], f16), 0.08838834764831845), {})
+cnt: 2, ((T([256, 64, 64], f16), 0.08838834764831845), {})
+cnt: 2, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16)), {})
+cnt: 2, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16)), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([64, 24, 128, 128], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([64, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([64, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([64, 1536, 8, 8], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([64, 1280, 8, 8], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([64, 1280, 8, 8], f16), T([64, 1280, 8, 8], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 1536, 8, 8], f16), T([64, 1536, 8, 8], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f32), T([1536], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 1024, 16, 16], f16), T([64, 1024, 16, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([64, 256, 32, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([64, 512, 32, 32], f16), T([64, 512, 32, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 128, 64, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([64, 256, 64, 64], f16), T([64, 256, 64, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 32, 128, 128], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([64, 24, 128, 128], f16), T([64, 24, 128, 128], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 4, ((T([64, 8, 1, 1], f16),), {})
+cnt: 2, ((T([64, 16, 1, 1], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 2, ((T([64, 64, 1, 1], f16),), {})
+cnt: 2, ((T([64, 128, 1, 1], f16),), {})
+cnt: 2, ((T([64, 256, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 2, ((T([64, 256, 1, 1], f16), T([64, 256, 1, 1], f16)), {})
+cnt: 2, ((T([64, 128, 1, 1], f16), T([64, 128, 1, 1], f16)), {})
+cnt: 2, ((T([64, 64, 1, 1], f16), T([64, 64, 1, 1], f16)), {})
+Operator: aten.silu_.default
+cnt: 1, ((T([64, 24, 128, 128], f16),), {})
+cnt: 1, ((T([64, 32, 128, 128], f16),), {})
+cnt: 5, ((T([64, 64, 64, 64], f16),), {})
+cnt: 2, ((T([64, 256, 64, 64], f16),), {})
+cnt: 1, ((T([64, 128, 64, 64], f16),), {})
+cnt: 5, ((T([64, 128, 32, 32], f16),), {})
+cnt: 3, ((T([64, 512, 32, 32], f16),), {})
+cnt: 1, ((T([64, 256, 32, 32], f16),), {})
+cnt: 5, ((T([64, 256, 16, 16], f16),), {})
+cnt: 3, ((T([64, 1024, 16, 16], f16),), {})
+cnt: 1, ((T([64, 512, 16, 16], f16),), {})
+cnt: 3, ((T([64, 512, 8, 8], f16),), {})
+cnt: 2, ((T([64, 1536, 8, 8], f16),), {})
+cnt: 1, ((T([64, 1280, 8, 8], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 1, ((T([64, 1280, 8, 8], f16), T([64, 1280, 8, 8], f16)), {})
+cnt: 2, ((T([64, 1536, 8, 8], f16), T([64, 1536, 8, 8], f16)), {})
+cnt: 3, ((T([64, 512, 8, 8], f16), T([64, 512, 8, 8], f16)), {})
+cnt: 1, ((T([64, 512, 16, 16], f16), T([64, 512, 16, 16], f16)), {})
+cnt: 3, ((T([64, 1024, 16, 16], f16), T([64, 1024, 16, 16], f16)), {})
+cnt: 5, ((T([64, 256, 16, 16], f16), T([64, 256, 16, 16], f16)), {})
+cnt: 1, ((T([64, 256, 32, 32], f16), T([64, 256, 32, 32], f16)), {})
+cnt: 3, ((T([64, 512, 32, 32], f16), T([64, 512, 32, 32], f16)), {})
+cnt: 5, ((T([64, 128, 32, 32], f16), T([64, 128, 32, 32], f16)), {})
+cnt: 1, ((T([64, 128, 64, 64], f16), T([64, 128, 64, 64], f16)), {})
+cnt: 2, ((T([64, 256, 64, 64], f16), T([64, 256, 64, 64], f16)), {})
+cnt: 5, ((T([64, 64, 64, 64], f16), T([64, 64, 64, 64], f16)), {})
+cnt: 1, ((T([64, 32, 128, 128], f16), T([64, 32, 128, 128], f16)), {})
+cnt: 1, ((T([64, 24, 128, 128], f16), T([64, 24, 128, 128], f16)), {})
+Operator: aten.slice_backward.default
+cnt: 2, ((T([2048, 8, 8], f16), [2048, 8, 15], 2, 7, 9223372036854775807, 1), {})
+cnt: 2, ((T([2048, 8, 15], f16), [2048, 9, 15], 1, 0, 8, 1), {})
+cnt: 2, ((T([2048, 9, 15], f16), [2048, 9, 15], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([4096, 16, 16], f16), [4096, 16, 31], 2, 15, 9223372036854775807, 1), {})
+cnt: 4, ((T([4096, 16, 31], f16), [4096, 17, 31], 1, 0, 16, 1), {})
+cnt: 4, ((T([4096, 17, 31], f16), [4096, 17, 31], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([8192, 32, 32], f16), [8192, 32, 63], 2, 31, 9223372036854775807, 1), {})
+cnt: 2, ((T([8192, 32, 63], f16), [8192, 33, 63], 1, 0, 32, 1), {})
+cnt: 2, ((T([8192, 33, 63], f16), [8192, 33, 63], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.split_with_sizes.default
+cnt: 1, ((T([64, 384, 32, 32], f16), [128, 128, 128], 1), {})
+cnt: 1, ((T([64, 768, 16, 16], f16), [256, 256, 256], 1), {})
+cnt: 1, ((T([64, 1536, 16, 16], f16), [512, 512, 512], 1), {})
+cnt: 1, ((T([64, 1536, 8, 8], f16), [512, 512, 512], 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 1, ((T([256, 8, 8, 8, 8], f16, stride=(4096, 64, 1, 512, 8)), [2], True), {})
+cnt: 1, ((T([256, 8, 8, 8, 8], f16, stride=(4096, 512, 8, 64, 1)), [2], True), {})
+cnt: 2, ((T([256, 16, 16, 16, 16], f16, stride=(65536, 256, 1, 4096, 16)), [2], True), {})
+cnt: 2, ((T([256, 16, 16, 16, 16], f16, stride=(65536, 4096, 16, 256, 1)), [2], True), {})
+cnt: 2, ((T([64, 256, 16, 16], f16), [2, 3], True), {})
+cnt: 1, ((T([256, 32, 32, 32, 32], f16, stride=(1048576, 1024, 1, 32768, 32)), [2], True), {})
+cnt: 1, ((T([256, 32, 32, 32, 32], f16, stride=(1048576, 32768, 32, 1024, 1)), [2], True), {})
+cnt: 2, ((T([64, 128, 32, 32], f16), [2, 3], True), {})
+cnt: 2, ((T([64, 64, 64, 64], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([64, 16, 1, 1], f16), T([64, 16, 1, 1], f16), 0), {})
+cnt: 4, ((T([64, 8, 1, 1], f16), T([64, 8, 1, 1], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/selecsls42b_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/selecsls42b_training.txt
new file mode 100644
index 000000000000..bc42466c16d6
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/selecsls42b_training.txt
@@ -0,0 +1,167 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128, 152, 14, 14], f16, stride=(178752, 196, 14, 1)), T([128, 152, 14, 14], f16)), {})
+cnt: 2, ((T([128, 304, 14, 14], f16, stride=(178752, 196, 14, 1)), T([128, 304, 14, 14], f16)), {})
+cnt: 1, ((T([128, 152, 14, 14], f16, stride=(119168, 196, 14, 1)), T([128, 152, 14, 14], f16)), {})
+cnt: 1, ((T([128, 304, 14, 14], f16, stride=(119168, 196, 14, 1)), T([128, 304, 14, 14], f16)), {})
+cnt: 1, ((T([128, 72, 28, 28], f16, stride=(338688, 784, 28, 1)), T([128, 72, 28, 28], f16)), {})
+cnt: 2, ((T([128, 144, 28, 28], f16, stride=(338688, 784, 28, 1)), T([128, 144, 28, 28], f16)), {})
+cnt: 1, ((T([128, 72, 28, 28], f16, stride=(225792, 784, 28, 1)), T([128, 72, 28, 28], f16)), {})
+cnt: 1, ((T([128, 144, 28, 28], f16, stride=(225792, 784, 28, 1)), T([128, 144, 28, 28], f16)), {})
+cnt: 1, ((T([128, 32, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([128, 32, 56, 56], f16)), {})
+cnt: 2, ((T([128, 64, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([128, 64, 56, 56], f16)), {})
+cnt: 1, ((T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), T([128, 32, 56, 56], f16)), {})
+cnt: 1, ((T([128, 64, 56, 56], f16, stride=(401408, 3136, 56, 1)), T([128, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 41, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 64, 56, 56], f16), T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([128, 64, 56, 56], f16), T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([128, 64, 56, 56], f16)], 1), {})
+cnt: 1, (([T([128, 144, 28, 28], f16), T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 144, 28, 28], f16), T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), T([128, 144, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 304, 14, 14], f16), T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16)], 1), {})
+cnt: 1, (([T([128, 304, 14, 14], f16), T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16), T([128, 304, 14, 14], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([64, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 64, 56, 56], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 64, 56, 56], f16), T([32, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 32, 56, 56], f16), T([64, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([64, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 56, 56], f16), T([128, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([144, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 144, 28, 28], f16), T([144, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 144, 28, 28], f16), T([72, 144, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 72, 28, 28], f16), T([144, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 28, 28], f16), T([144, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([144, 144, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 432, 28, 28], f16), T([288, 432, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 288, 28, 28], f16), T([304, 288, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 304, 14, 14], f16), T([304, 304, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 304, 14, 14], f16), T([152, 304, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 152, 14, 14], f16), T([304, 152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 608, 14, 14], f16), T([304, 608, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 304, 14, 14], f16), T([304, 304, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 912, 14, 14], f16), T([480, 912, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([960, 480, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([1024, 960, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([1280, 1024, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1280, 4, 4], f16), T([1024, 1280, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1024, 4, 4], f16), T([128, 1280, 4, 4], f16), T([1024, 1280, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1280, 4, 4], f16), T([128, 1024, 7, 7], f16), T([1280, 1024, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 960, 7, 7], f16), T([1024, 960, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 480, 14, 14], f16), T([960, 480, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 912, 14, 14], f16), T([480, 912, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 152, 14, 14], f16), T([128, 304, 14, 14], f16), T([152, 304, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 304, 14, 14], f16), T([128, 152, 14, 14], f16), T([304, 152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 304, 14, 14], f16), T([128, 304, 14, 14], f16), T([304, 304, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 304, 14, 14], f16), T([128, 304, 14, 14], f16), T([304, 304, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 304, 14, 14], f16), T([128, 608, 14, 14], f16), T([304, 608, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 304, 14, 14], f16), T([128, 288, 28, 28], f16), T([304, 288, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 288, 28, 28], f16), T([128, 432, 28, 28], f16), T([288, 432, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 72, 28, 28], f16), T([128, 144, 28, 28], f16), T([72, 144, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 144, 28, 28], f16), T([128, 72, 28, 28], f16), T([144, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), T([144, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), T([144, 144, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 288, 28, 28], f16), T([144, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 128, 56, 56], f16), T([144, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 192, 56, 56], f16), T([128, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 32, 56, 56], f16), T([128, 64, 56, 56], f16), T([32, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 56, 56], f16), T([128, 32, 56, 56], f16), T([64, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 128, 56, 56], f16), T([64, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 64, 56, 56], f16), T([128, 32, 112, 112], f16), T([64, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1024, 4, 4], f16, stride=(1024, 1, 0, 0)), 16), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 1024, 4, 4], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1024], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 288, 28, 28], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([128, 304, 14, 14], f16), T([304], f16), T([304], f16), T([304], f16), T([304], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 152, 14, 14], f16), T([152], f16), T([152], f16), T([152], f16), T([152], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1280, 4, 4], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1024, 4, 4], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1024, 4, 4], f16), T([128, 1024, 4, 4], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 1280, 4, 4], f16), T([128, 1280, 4, 4], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16), T([152], f16), T([152], f16), T([152], f16), T([152], f32), T([152], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([128, 304, 14, 14], f16), T([128, 304, 14, 14], f16), T([304], f16), T([304], f16), T([304], f16), T([304], f32), T([304], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 288, 28, 28], f16), T([128, 288, 28, 28], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f32), T([288], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+cnt: 7, ((T([128, 64, 56, 56], f16),), {})
+cnt: 4, ((T([128, 32, 56, 56], f16),), {})
+cnt: 1, ((T([128, 128, 56, 56], f16),), {})
+cnt: 7, ((T([128, 144, 28, 28], f16),), {})
+cnt: 4, ((T([128, 72, 28, 28], f16),), {})
+cnt: 1, ((T([128, 288, 28, 28], f16),), {})
+cnt: 7, ((T([128, 304, 14, 14], f16),), {})
+cnt: 4, ((T([128, 152, 14, 14], f16),), {})
+cnt: 1, ((T([128, 480, 14, 14], f16),), {})
+cnt: 1, ((T([128, 960, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 4, 4], f16),), {})
+cnt: 1, ((T([128, 1024, 4, 4], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 1024, 4, 4], f16), T([128, 1024, 4, 4], f16), 0), {})
+cnt: 1, ((T([128, 1280, 4, 4], f16), T([128, 1280, 4, 4], f16), 0), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 960, 7, 7], f16), T([128, 960, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 152, 14, 14], f16, stride=(178752, 196, 14, 1)), T([128, 152, 14, 14], f16), 0), {})
+cnt: 7, ((T([128, 304, 14, 14], f16), T([128, 304, 14, 14], f16), 0), {})
+cnt: 2, ((T([128, 152, 14, 14], f16), T([128, 152, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 152, 14, 14], f16, stride=(119168, 196, 14, 1)), T([128, 152, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 288, 28, 28], f16), T([128, 288, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 72, 28, 28], f16, stride=(338688, 784, 28, 1)), T([128, 72, 28, 28], f16), 0), {})
+cnt: 7, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), 0), {})
+cnt: 2, ((T([128, 72, 28, 28], f16), T([128, 72, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 72, 28, 28], f16, stride=(225792, 784, 28, 1)), T([128, 72, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 128, 56, 56], f16), T([128, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 32, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([128, 32, 56, 56], f16), 0), {})
+cnt: 7, ((T([128, 64, 56, 56], f16), T([128, 64, 56, 56], f16), 0), {})
+cnt: 2, ((T([128, 32, 56, 56], f16), T([128, 32, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 32, 56, 56], f16, stride=(401408, 3136, 56, 1)), T([128, 32, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/spnasnet_100_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/spnasnet_100_training.txt
new file mode 100644
index 000000000000..5ffc25e3d6e6
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/spnasnet_100_training.txt
@@ -0,0 +1,182 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 64, ((T([], i64), 1), {})
+cnt: 4, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 6, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 6, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16)), {})
+cnt: 6, ((T([128, 96, 14, 14], f16), T([128, 96, 14, 14], f16)), {})
+cnt: 6, ((T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([48, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([48, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 48), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([24, 48, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([72, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([72, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 72), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([24, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([144, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([144, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([40, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 120, 28, 28], f16), T([120, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([128, 120, 28, 28], f16), T([40, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 4, ((T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 80, 14, 14], f16), T([240, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 240, 14, 14], f16), T([240, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([480, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([96, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 96, 14, 14], f16), T([288, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 288, 14, 14], f16), T([288, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 288), {})
+cnt: 3, ((T([128, 288, 14, 14], f16), T([96, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 14, 14], f16), T([576, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([576, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 576), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([192, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1152), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1152), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([128, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 4, ((T([128, 1152, 7, 7], f16), T([128, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 7, 7], f16), T([128, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 7, 7], f16), T([128, 576, 7, 7], f16), T([192, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 14, 14], f16), T([576, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([128, 96, 14, 14], f16), T([576, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 96, 14, 14], f16), T([128, 288, 14, 14], f16), T([96, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 288, 14, 14], f16), T([128, 288, 14, 14], f16), T([288, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 288, [True, True, False]), {})
+cnt: 3, ((T([128, 288, 14, 14], f16), T([128, 96, 14, 14], f16), T([288, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 14, 14], f16), T([128, 480, 14, 14], f16), T([96, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), T([240, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 3, ((T([128, 240, 14, 14], f16), T([128, 80, 14, 14], f16), T([240, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 40, 28, 28], f16), T([128, 120, 28, 28], f16), T([40, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 3, ((T([128, 120, 28, 28], f16), T([128, 40, 28, 28], f16), T([120, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 144, 28, 28], f16), T([40, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 56, 56], f16), T([144, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 24, 56, 56], f16), T([144, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 72, 56, 56], f16), T([24, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), T([72, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 2, ((T([128, 72, 56, 56], f16), T([128, 24, 56, 56], f16), T([72, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 48, 56, 56], f16), T([24, 48, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 112, 112], f16), T([48, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 48, [True, True, False]), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 16, 112, 112], f16), T([48, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1280, 7, 7], f16, stride=(1280, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 1280, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([128, 288, 14, 14], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([128, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([128, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 96, 14, 14], f16), T([128, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 288, 14, 14], f16), T([128, 288, 14, 14], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f32), T([288], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 48, 112, 112], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 48, 112, 112], f16),), {})
+cnt: 1, ((T([128, 48, 56, 56], f16),), {})
+cnt: 4, ((T([128, 72, 56, 56], f16),), {})
+cnt: 1, ((T([128, 144, 56, 56], f16),), {})
+cnt: 1, ((T([128, 144, 28, 28], f16),), {})
+cnt: 6, ((T([128, 120, 28, 28], f16),), {})
+cnt: 1, ((T([128, 240, 28, 28], f16),), {})
+cnt: 7, ((T([128, 240, 14, 14], f16),), {})
+cnt: 2, ((T([128, 480, 14, 14], f16),), {})
+cnt: 6, ((T([128, 288, 14, 14], f16),), {})
+cnt: 1, ((T([128, 576, 14, 14], f16),), {})
+cnt: 1, ((T([128, 576, 7, 7], f16),), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16),), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16), 0), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 576, 7, 7], f16), T([128, 576, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 576, 14, 14], f16), T([128, 576, 14, 14], f16), 0), {})
+cnt: 6, ((T([128, 288, 14, 14], f16), T([128, 288, 14, 14], f16), 0), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), 0), {})
+cnt: 7, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), 0), {})
+cnt: 6, ((T([128, 120, 28, 28], f16), T([128, 120, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), 0), {})
+cnt: 4, ((T([128, 72, 56, 56], f16), T([128, 72, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 48, 56, 56], f16), T([128, 48, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 48, 112, 112], f16), T([128, 48, 112, 112], f16), 0), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/swin_base_patch4_window7_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/swin_base_patch4_window7_224_training.txt
new file mode 100644
index 000000000000..6076086ba3a5
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/swin_base_patch4_window7_224_training.txt
@@ -0,0 +1,341 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 2, ((T([4096, 4, 49, 49], f16), -1, False), {})
+cnt: 2, ((T([1024, 8, 49, 49], f16), -1, False), {})
+cnt: 18, ((T([256, 16, 49, 49], f16), -1, False), {})
+cnt: 2, ((T([64, 32, 49, 49], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 2, ((T([64, 32, 49, 49], f16), T([64, 32, 49, 49], f16), -1, f16), {})
+cnt: 18, ((T([256, 16, 49, 49], f16), T([256, 16, 49, 49], f16), -1, f16), {})
+cnt: 2, ((T([1024, 8, 49, 49], f16), T([1024, 8, 49, 49], f16), -1, f16), {})
+cnt: 2, ((T([4096, 4, 49, 49], f16), T([4096, 4, 49, 49], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 6, ((T([4096, 4, 49, 32], f16), [16384, 49, 32]), {})
+cnt: 2, ((T([4096, 4, 32, 49], f16), [16384, 32, 49]), {})
+cnt: 2, ((T([16384, 49, 49], f16), [4096, 4, 49, 49]), {})
+cnt: 2, ((T([16384, 49, 32], f16), [4096, 4, 49, 32]), {})
+cnt: 2, ((T([4096, 49, 4, 32], f16), [4096, 49, 128]), {})
+cnt: 1, ((T([50176, 256], f16), [64, 784, 256]), {})
+cnt: 6, ((T([1024, 8, 49, 32], f16), [8192, 49, 32]), {})
+cnt: 2, ((T([1024, 8, 32, 49], f16), [8192, 32, 49]), {})
+cnt: 2, ((T([8192, 49, 49], f16), [1024, 8, 49, 49]), {})
+cnt: 2, ((T([8192, 49, 32], f16), [1024, 8, 49, 32]), {})
+cnt: 2, ((T([1024, 49, 8, 32], f16), [1024, 49, 256]), {})
+cnt: 1, ((T([12544, 512], f16), [64, 196, 512]), {})
+cnt: 54, ((T([256, 16, 49, 32], f16), [4096, 49, 32]), {})
+cnt: 18, ((T([256, 16, 32, 49], f16), [4096, 32, 49]), {})
+cnt: 18, ((T([4096, 49, 49], f16), [256, 16, 49, 49]), {})
+cnt: 18, ((T([4096, 49, 32], f16), [256, 16, 49, 32]), {})
+cnt: 18, ((T([256, 49, 16, 32], f16), [256, 49, 512]), {})
+cnt: 1, ((T([3136, 1024], f16), [64, 49, 1024]), {})
+cnt: 6, ((T([64, 32, 49, 32], f16), [2048, 49, 32]), {})
+cnt: 2, ((T([64, 32, 32, 49], f16), [2048, 32, 49]), {})
+cnt: 2, ((T([2048, 49, 49], f16), [64, 32, 49, 49]), {})
+cnt: 2, ((T([2048, 49, 32], f16), [64, 32, 49, 32]), {})
+cnt: 2, ((T([64, 49, 32, 32], f16), [64, 49, 1024]), {})
+cnt: 2, ((T([64, 49, 3, 32, 32], f16), [64, 49, 3072]), {})
+cnt: 18, ((T([64, 2, 2, 7, 7, 512], f16), [256, 7, 7, 512]), {})
+cnt: 18, ((T([256, 49, 3, 16, 32], f16), [256, 49, 1536]), {})
+cnt: 18, ((T([64, 2, 7, 2, 7, 512], f16), [64, 14, 14, 512]), {})
+cnt: 2, ((T([64, 4, 4, 7, 7, 256], f16), [1024, 7, 7, 256]), {})
+cnt: 2, ((T([1024, 49, 3, 8, 32], f16), [1024, 49, 768]), {})
+cnt: 2, ((T([64, 4, 7, 4, 7, 256], f16), [64, 28, 28, 256]), {})
+cnt: 2, ((T([64, 8, 8, 7, 7, 128], f16), [4096, 7, 7, 128]), {})
+cnt: 2, ((T([4096, 49, 3, 4, 32], f16), [4096, 49, 384]), {})
+cnt: 2, ((T([64, 8, 7, 8, 7, 128], f16), [64, 56, 56, 128]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([4096, 4, 49, 49], f16), T([1, 4, 49, 49], f16)), {})
+cnt: 8, ((T([64, 3136, 128], f16), T([64, 3136, 128], f16)), {})
+cnt: 1, ((T([64, 64, 4, 49, 49], f16), T([1, 64, 1, 49, 49], f16)), {})
+cnt: 2, ((T([1024, 8, 49, 49], f16), T([1, 8, 49, 49], f16)), {})
+cnt: 8, ((T([64, 784, 256], f16), T([64, 784, 256], f16)), {})
+cnt: 1, ((T([64, 16, 8, 49, 49], f16), T([1, 16, 1, 49, 49], f16)), {})
+cnt: 18, ((T([256, 16, 49, 49], f16), T([1, 16, 49, 49], f16)), {})
+cnt: 72, ((T([64, 196, 512], f16), T([64, 196, 512], f16)), {})
+cnt: 9, ((T([64, 4, 16, 49, 49], f16), T([1, 4, 1, 49, 49], f16)), {})
+cnt: 2, ((T([64, 32, 49, 49], f16), T([1, 32, 49, 49], f16)), {})
+cnt: 8, ((T([64, 49, 1024], f16), T([64, 49, 1024], f16)), {})
+cnt: 3, ((T([64, 14, 14, 512], f16), T([64, 14, 14, 512], f16)), {})
+cnt: 3, ((T([64, 28, 28, 256], f16), T([64, 28, 28, 256], f16)), {})
+cnt: 3, ((T([64, 56, 56, 128], f16), T([64, 56, 56, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 2, ((T([384], f16), T([200704, 128], f16), T([128, 384], f16, stride=(1, 128))), {})
+cnt: 2, ((T([128], f16), T([200704, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 2, ((T([512], f16), T([200704, 128], f16), T([128, 512], f16, stride=(1, 128))), {})
+cnt: 2, ((T([128], f16), T([200704, 512], f16), T([512, 128], f16, stride=(1, 512))), {})
+cnt: 2, ((T([768], f16), T([50176, 256], f16), T([256, 768], f16, stride=(1, 256))), {})
+cnt: 2, ((T([256], f16), T([50176, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+cnt: 2, ((T([1024], f16), T([50176, 256], f16), T([256, 1024], f16, stride=(1, 256))), {})
+cnt: 2, ((T([256], f16), T([50176, 1024], f16), T([1024, 256], f16, stride=(1, 1024))), {})
+cnt: 18, ((T([1536], f16), T([12544, 512], f16), T([512, 1536], f16, stride=(1, 512))), {})
+cnt: 18, ((T([512], f16), T([12544, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 18, ((T([2048], f16), T([12544, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 18, ((T([512], f16), T([12544, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+cnt: 2, ((T([3072], f16), T([3136, 1024], f16), T([1024, 3072], f16, stride=(1, 1024))), {})
+cnt: 2, ((T([1024], f16), T([3136, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 2, ((T([4096], f16), T([3136, 1024], f16), T([1024, 4096], f16, stride=(1, 1024))), {})
+cnt: 2, ((T([1024], f16), T([3136, 4096], f16), T([4096, 1024], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([1000], f16), T([64, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.bernoulli_.float
+cnt: 2, ((T([64, 1, 1], f16), 0.9956521736457944), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9913043472915888), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9869565209373832), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9826086945831776), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9782608672976494), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9739130418747663), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9695652164518833), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9652173891663551), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.960869561880827), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9565217345952988), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9521739110350609), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9478260837495327), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9434782564640045), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9391304329037666), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9347826093435287), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9304347857832909), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9260869547724724), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9217391312122345), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.917391300201416), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9130434766411781), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9086956530809402), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9043478220701218), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.8999999985098839), {})
+Operator: aten.bmm.default
+cnt: 2, ((T([16384, 49, 32], f16), T([16384, 32, 49], f16)), {})
+cnt: 2, ((T([16384, 49, 49], f16), T([16384, 49, 32], f16)), {})
+cnt: 2, ((T([8192, 49, 32], f16), T([8192, 32, 49], f16)), {})
+cnt: 2, ((T([8192, 49, 49], f16), T([8192, 49, 32], f16)), {})
+cnt: 18, ((T([4096, 49, 32], f16), T([4096, 32, 49], f16)), {})
+cnt: 18, ((T([4096, 49, 49], f16), T([4096, 49, 32], f16)), {})
+cnt: 2, ((T([2048, 49, 32], f16), T([2048, 32, 49], f16)), {})
+cnt: 2, ((T([2048, 49, 49], f16), T([2048, 49, 32], f16)), {})
+cnt: 2, ((T([2048, 49, 49], f16, stride=(2401, 1, 49)), T([2048, 49, 32], f16)), {})
+cnt: 2, ((T([2048, 49, 32], f16), T([2048, 32, 49], f16, stride=(1568, 1, 32))), {})
+cnt: 2, ((T([2048, 32, 49], f16, stride=(1568, 1, 32)), T([2048, 49, 49], f16)), {})
+cnt: 2, ((T([2048, 49, 49], f16), T([2048, 49, 32], f16, stride=(1568, 1, 49))), {})
+cnt: 18, ((T([4096, 49, 49], f16, stride=(2401, 1, 49)), T([4096, 49, 32], f16)), {})
+cnt: 18, ((T([4096, 49, 32], f16), T([4096, 32, 49], f16, stride=(1568, 1, 32))), {})
+cnt: 18, ((T([4096, 32, 49], f16, stride=(1568, 1, 32)), T([4096, 49, 49], f16)), {})
+cnt: 18, ((T([4096, 49, 49], f16), T([4096, 49, 32], f16, stride=(1568, 1, 49))), {})
+cnt: 2, ((T([8192, 49, 49], f16, stride=(2401, 1, 49)), T([8192, 49, 32], f16)), {})
+cnt: 2, ((T([8192, 49, 32], f16), T([8192, 32, 49], f16, stride=(1568, 1, 32))), {})
+cnt: 2, ((T([8192, 32, 49], f16, stride=(1568, 1, 32)), T([8192, 49, 49], f16)), {})
+cnt: 2, ((T([8192, 49, 49], f16), T([8192, 49, 32], f16, stride=(1568, 1, 49))), {})
+cnt: 2, ((T([16384, 49, 49], f16, stride=(2401, 1, 49)), T([16384, 49, 32], f16)), {})
+cnt: 2, ((T([16384, 49, 32], f16), T([16384, 32, 49], f16, stride=(1568, 1, 32))), {})
+cnt: 2, ((T([16384, 32, 49], f16, stride=(1568, 1, 32)), T([16384, 49, 49], f16)), {})
+cnt: 2, ((T([16384, 49, 49], f16), T([16384, 49, 32], f16, stride=(1568, 1, 49))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 28, 28, 128], f16, stride=(401408, 14336, 256, 1)), T([64, 28, 28, 128], f16, stride=(401408, 14336, 256, 1)), T([64, 28, 28, 128], f16, stride=(401408, 14336, 256, 1)), T([64, 28, 28, 128], f16, stride=(401408, 14336, 256, 1))], -1), {})
+cnt: 1, (([T([64, 14, 14, 256], f16, stride=(200704, 14336, 512, 1)), T([64, 14, 14, 256], f16, stride=(200704, 14336, 512, 1)), T([64, 14, 14, 256], f16, stride=(200704, 14336, 512, 1)), T([64, 14, 14, 256], f16, stride=(200704, 14336, 512, 1))], -1), {})
+cnt: 1, (([T([64, 7, 7, 512], f16, stride=(100352, 14336, 1024, 1)), T([64, 7, 7, 512], f16, stride=(100352, 14336, 1024, 1)), T([64, 7, 7, 512], f16, stride=(100352, 14336, 1024, 1)), T([64, 7, 7, 512], f16, stride=(100352, 14336, 1024, 1))], -1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([128, 3, 4, 4], f16), T([128], f16), [4, 4], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 128, 56, 56], f16, stride=(401408, 1, 7168, 128)), T([64, 3, 224, 224], f16), T([128, 3, 4, 4], f16), [128], [4, 4], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 49, 1024], f16, stride=(1024, 0, 1)), 49), {})
+Operator: aten.div_.Tensor
+cnt: 2, ((T([64, 1, 1], f16), 0.9956521736457944), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9913043472915888), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9869565209373832), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9826086945831776), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9782608672976494), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9739130418747663), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9695652164518833), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9652173891663551), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.960869561880827), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9565217345952988), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9521739110350609), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9478260837495327), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9434782564640045), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9391304329037666), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9347826093435287), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9304347857832909), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9260869547724724), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9217391312122345), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.917391300201416), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9130434766411781), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9086956530809402), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.9043478220701218), {})
+cnt: 2, ((T([64, 1, 1], f16), 0.8999999985098839), {})
+Operator: aten.gelu.default
+cnt: 2, ((T([64, 3136, 512], f16),), {})
+cnt: 2, ((T([64, 784, 1024], f16),), {})
+cnt: 18, ((T([64, 196, 2048], f16),), {})
+cnt: 2, ((T([64, 49, 4096], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 2, ((T([64, 49, 4096], f16), T([64, 49, 4096], f16)), {})
+cnt: 18, ((T([64, 196, 2048], f16), T([64, 196, 2048], f16)), {})
+cnt: 2, ((T([64, 784, 1024], f16), T([64, 784, 1024], f16)), {})
+cnt: 2, ((T([64, 3136, 512], f16), T([64, 3136, 512], f16)), {})
+Operator: aten.index.Tensor
+cnt: 2, ((T([169, 4], f16), [T([2401], i64)]), {})
+cnt: 2, ((T([169, 8], f16), [T([2401], i64)]), {})
+cnt: 18, ((T([169, 16], f16), [T([2401], i64)]), {})
+cnt: 2, ((T([169, 32], f16), [T([2401], i64)]), {})
+Operator: aten.index_put.default
+cnt: 2, ((T([169, 32], f16), [T([2401], i64)], T([2401, 32], f16, stride=(1, 2401)), True), {})
+cnt: 18, ((T([169, 16], f16), [T([2401], i64)], T([2401, 16], f16, stride=(1, 2401)), True), {})
+cnt: 2, ((T([169, 8], f16), [T([2401], i64)], T([2401, 8], f16, stride=(1, 2401)), True), {})
+cnt: 2, ((T([169, 4], f16), [T([2401], i64)], T([2401, 4], f16, stride=(1, 2401)), True), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 49, 1024], f16), [1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([50176, 512], f16), T([512, 256], f16, stride=(1, 512))), {})
+cnt: 1, ((T([12544, 1024], f16), T([1024, 512], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([3136, 2048], f16), T([2048, 1024], f16, stride=(1, 2048))), {})
+cnt: 1, ((T([64, 1000], f16), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 1024], f16)), {})
+cnt: 2, ((T([3136, 1024], f16), T([1024, 4096], f16)), {})
+cnt: 2, ((T([1024, 3136], f16, stride=(1, 1024)), T([3136, 4096], f16)), {})
+cnt: 2, ((T([3136, 4096], f16), T([4096, 1024], f16)), {})
+cnt: 2, ((T([4096, 3136], f16, stride=(1, 4096)), T([3136, 1024], f16)), {})
+cnt: 2, ((T([3136, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 2, ((T([1024, 3136], f16, stride=(1, 1024)), T([3136, 1024], f16)), {})
+cnt: 2, ((T([3136, 3072], f16), T([3072, 1024], f16)), {})
+cnt: 2, ((T([3072, 3136], f16, stride=(1, 3072)), T([3136, 1024], f16)), {})
+cnt: 1, ((T([1024, 3136], f16, stride=(1, 1024)), T([3136, 2048], f16)), {})
+cnt: 1, ((T([3136, 1024], f16), T([1024, 2048], f16)), {})
+cnt: 18, ((T([12544, 512], f16), T([512, 2048], f16)), {})
+cnt: 18, ((T([512, 12544], f16, stride=(1, 512)), T([12544, 2048], f16)), {})
+cnt: 18, ((T([12544, 2048], f16), T([2048, 512], f16)), {})
+cnt: 18, ((T([2048, 12544], f16, stride=(1, 2048)), T([12544, 512], f16)), {})
+cnt: 18, ((T([12544, 512], f16), T([512, 512], f16)), {})
+cnt: 18, ((T([512, 12544], f16, stride=(1, 512)), T([12544, 512], f16)), {})
+cnt: 18, ((T([12544, 1536], f16), T([1536, 512], f16)), {})
+cnt: 18, ((T([1536, 12544], f16, stride=(1, 1536)), T([12544, 512], f16)), {})
+cnt: 1, ((T([512, 12544], f16, stride=(1, 512)), T([12544, 1024], f16)), {})
+cnt: 1, ((T([12544, 512], f16), T([512, 1024], f16)), {})
+cnt: 2, ((T([50176, 256], f16), T([256, 1024], f16)), {})
+cnt: 2, ((T([256, 50176], f16, stride=(1, 256)), T([50176, 1024], f16)), {})
+cnt: 2, ((T([50176, 1024], f16), T([1024, 256], f16)), {})
+cnt: 2, ((T([1024, 50176], f16, stride=(1, 1024)), T([50176, 256], f16)), {})
+cnt: 2, ((T([50176, 256], f16), T([256, 256], f16)), {})
+cnt: 2, ((T([256, 50176], f16, stride=(1, 256)), T([50176, 256], f16)), {})
+cnt: 2, ((T([50176, 768], f16), T([768, 256], f16)), {})
+cnt: 2, ((T([768, 50176], f16, stride=(1, 768)), T([50176, 256], f16)), {})
+cnt: 1, ((T([256, 50176], f16, stride=(1, 256)), T([50176, 512], f16)), {})
+cnt: 1, ((T([50176, 256], f16), T([256, 512], f16)), {})
+cnt: 2, ((T([200704, 128], f16), T([128, 512], f16)), {})
+cnt: 2, ((T([128, 200704], f16, stride=(1, 128)), T([200704, 512], f16)), {})
+cnt: 2, ((T([200704, 512], f16), T([512, 128], f16)), {})
+cnt: 2, ((T([512, 200704], f16, stride=(1, 512)), T([200704, 128], f16)), {})
+cnt: 2, ((T([200704, 128], f16), T([128, 128], f16)), {})
+cnt: 2, ((T([128, 200704], f16, stride=(1, 128)), T([200704, 128], f16)), {})
+cnt: 2, ((T([200704, 384], f16), T([384, 128], f16)), {})
+cnt: 2, ((T([384, 200704], f16, stride=(1, 384)), T([200704, 128], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([4096, 4, 49, 32], f16, stride=(18816, 32, 384, 1)), 0.1767766952966369), {})
+cnt: 4, ((T([64, 3136, 128], f16), T([64, 1, 1], f16)), {})
+cnt: 2, ((T([1024, 8, 49, 32], f16, stride=(37632, 32, 768, 1)), 0.1767766952966369), {})
+cnt: 8, ((T([64, 784, 256], f16), T([64, 1, 1], f16)), {})
+cnt: 18, ((T([256, 16, 49, 32], f16, stride=(75264, 32, 1536, 1)), 0.1767766952966369), {})
+cnt: 72, ((T([64, 196, 512], f16), T([64, 1, 1], f16)), {})
+cnt: 2, ((T([64, 32, 49, 32], f16, stride=(150528, 32, 3072, 1)), 0.1767766952966369), {})
+cnt: 8, ((T([64, 49, 1024], f16), T([64, 1, 1], f16)), {})
+cnt: 2, ((T([64, 32, 49, 32], f16), 0.1767766952966369), {})
+cnt: 18, ((T([256, 16, 49, 32], f16), 0.1767766952966369), {})
+cnt: 2, ((T([1024, 8, 49, 32], f16), 0.1767766952966369), {})
+cnt: 2, ((T([4096, 4, 49, 32], f16), 0.1767766952966369), {})
+Operator: aten.native_layer_norm.default
+cnt: 1, ((T([64, 3136, 128], f16, stride=(401408, 1, 3136)), [128], T([128], f16), T([128], f16), 1e-05), {})
+cnt: 4, ((T([64, 3136, 128], f16), [128], T([128], f16), T([128], f16), 1e-05), {})
+cnt: 1, ((T([64, 784, 512], f16), [512], T([512], f16), T([512], f16), 1e-05), {})
+cnt: 4, ((T([64, 784, 256], f16), [256], T([256], f16), T([256], f16), 1e-05), {})
+cnt: 1, ((T([64, 196, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+cnt: 36, ((T([64, 196, 512], f16), [512], T([512], f16), T([512], f16), 1e-05), {})
+cnt: 1, ((T([64, 49, 2048], f16), [2048], T([2048], f16), T([2048], f16), 1e-05), {})
+cnt: 5, ((T([64, 49, 1024], f16), [1024], T([1024], f16), T([1024], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 5, ((T([64, 49, 1024], f16), T([64, 49, 1024], f16), [1024], T([64, 49, 1], f32), T([64, 49, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 49, 2048], f16), T([64, 49, 2048], f16), [2048], T([64, 49, 1], f32), T([64, 49, 1], f32), T([2048], f16), T([2048], f16), [True, True, True]), {})
+cnt: 36, ((T([64, 196, 512], f16), T([64, 196, 512], f16), [512], T([64, 196, 1], f32), T([64, 196, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 196, 1024], f16), T([64, 196, 1024], f16), [1024], T([64, 196, 1], f32), T([64, 196, 1], f32), T([1024], f16), T([1024], f16), [True, True, True]), {})
+cnt: 4, ((T([64, 784, 256], f16), T([64, 784, 256], f16), [256], T([64, 784, 1], f32), T([64, 784, 1], f32), T([256], f16), T([256], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 784, 512], f16), T([64, 784, 512], f16), [512], T([64, 784, 1], f32), T([64, 784, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 4, ((T([64, 3136, 128], f16), T([64, 3136, 128], f16), [128], T([64, 3136, 1], f32), T([64, 3136, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 3136, 128], f16), T([64, 3136, 128], f16, stride=(401408, 1, 3136)), [128], T([64, 3136, 1], f32), T([64, 3136, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+Operator: aten.new_empty.default
+cnt: 2, ((T([64, 3136, 128], f16), [64, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 4, ((T([64, 784, 256], f16), [64, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 36, ((T([64, 196, 512], f16), [64, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 4, ((T([64, 49, 1024], f16), [64, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_zeros.default
+cnt: 2, ((T([2401, 32], f16, stride=(1, 2401)), [169, 32]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 18, ((T([2401, 16], f16, stride=(1, 2401)), [169, 16]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 2, ((T([2401, 8], f16, stride=(1, 2401)), [169, 8]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 2, ((T([2401, 4], f16, stride=(1, 2401)), [169, 4]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.roll.default
+cnt: 1, ((T([64, 56, 56, 128], f16), [-3, -3], [1, 2]), {})
+cnt: 1, ((T([64, 56, 56, 128], f16), [3, 3], [1, 2]), {})
+cnt: 1, ((T([64, 28, 28, 256], f16), [-3, -3], [1, 2]), {})
+cnt: 1, ((T([64, 28, 28, 256], f16), [3, 3], [1, 2]), {})
+cnt: 9, ((T([64, 14, 14, 512], f16), [-3, -3], [1, 2]), {})
+cnt: 9, ((T([64, 14, 14, 512], f16), [3, 3], [1, 2]), {})
+cnt: 9, ((T([64, 14, 14, 512], f16), [-3, -3], [2, 1]), {})
+cnt: 9, ((T([64, 14, 14, 512], f16), [3, 3], [2, 1]), {})
+cnt: 1, ((T([64, 28, 28, 256], f16), [-3, -3], [2, 1]), {})
+cnt: 1, ((T([64, 28, 28, 256], f16), [3, 3], [2, 1]), {})
+cnt: 1, ((T([64, 56, 56, 128], f16), [-3, -3], [2, 1]), {})
+cnt: 1, ((T([64, 56, 56, 128], f16), [3, 3], [2, 1]), {})
+Operator: aten.slice_backward.default
+cnt: 4, ((T([64, 7, 7, 512], f16, stride=(100352, 14336, 2048, 1)), [64, 7, 7, 512], 3, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 7, 7, 512], f16), [64, 7, 14, 512], 2, 1, 9223372036854775807, 2), {})
+cnt: 2, ((T([64, 7, 14, 512], f16), [64, 14, 14, 512], 1, 1, 9223372036854775807, 2), {})
+cnt: 4, ((T([64, 14, 14, 512], f16), [64, 14, 14, 512], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 7, 14, 512], f16), [64, 14, 14, 512], 1, 0, 9223372036854775807, 2), {})
+cnt: 2, ((T([64, 7, 7, 512], f16), [64, 7, 14, 512], 2, 0, 9223372036854775807, 2), {})
+cnt: 4, ((T([64, 14, 14, 256], f16, stride=(200704, 14336, 1024, 1)), [64, 14, 14, 256], 3, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 14, 14, 256], f16), [64, 14, 28, 256], 2, 1, 9223372036854775807, 2), {})
+cnt: 2, ((T([64, 14, 28, 256], f16), [64, 28, 28, 256], 1, 1, 9223372036854775807, 2), {})
+cnt: 4, ((T([64, 28, 28, 256], f16), [64, 28, 28, 256], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 14, 28, 256], f16), [64, 28, 28, 256], 1, 0, 9223372036854775807, 2), {})
+cnt: 2, ((T([64, 14, 14, 256], f16), [64, 14, 28, 256], 2, 0, 9223372036854775807, 2), {})
+cnt: 4, ((T([64, 28, 28, 128], f16, stride=(401408, 14336, 512, 1)), [64, 28, 28, 128], 3, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 28, 28, 128], f16), [64, 28, 56, 128], 2, 1, 9223372036854775807, 2), {})
+cnt: 2, ((T([64, 28, 56, 128], f16), [64, 56, 56, 128], 1, 1, 9223372036854775807, 2), {})
+cnt: 4, ((T([64, 56, 56, 128], f16), [64, 56, 56, 128], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 28, 56, 128], f16), [64, 56, 56, 128], 1, 0, 9223372036854775807, 2), {})
+cnt: 2, ((T([64, 28, 28, 128], f16), [64, 28, 56, 128], 2, 0, 9223372036854775807, 2), {})
+Operator: aten.stack.default
+cnt: 2, (([T([64, 32, 49, 32], f16), T([64, 32, 49, 32], f16, stride=(50176, 1568, 1, 49)), T([64, 32, 49, 32], f16)],), {})
+cnt: 18, (([T([256, 16, 49, 32], f16), T([256, 16, 49, 32], f16, stride=(25088, 1568, 1, 49)), T([256, 16, 49, 32], f16)],), {})
+cnt: 2, (([T([1024, 8, 49, 32], f16), T([1024, 8, 49, 32], f16, stride=(12544, 1568, 1, 49)), T([1024, 8, 49, 32], f16)],), {})
+cnt: 2, (([T([4096, 4, 49, 32], f16), T([4096, 4, 49, 32], f16, stride=(6272, 1568, 1, 49)), T([4096, 4, 49, 32], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 4, ((T([3136, 1024], f16), [0], True), {})
+cnt: 2, ((T([3136, 4096], f16), [0], True), {})
+cnt: 2, ((T([64, 32, 49, 49], f16), [0], True), {})
+cnt: 2, ((T([3136, 3072], f16), [0], True), {})
+cnt: 36, ((T([12544, 512], f16), [0], True), {})
+cnt: 18, ((T([12544, 2048], f16), [0], True), {})
+cnt: 18, ((T([256, 16, 49, 49], f16), [0], True), {})
+cnt: 18, ((T([12544, 1536], f16), [0], True), {})
+cnt: 4, ((T([50176, 256], f16), [0], True), {})
+cnt: 2, ((T([50176, 1024], f16), [0], True), {})
+cnt: 2, ((T([1024, 8, 49, 49], f16), [0], True), {})
+cnt: 2, ((T([50176, 768], f16), [0], True), {})
+cnt: 4, ((T([200704, 128], f16), [0], True), {})
+cnt: 2, ((T([200704, 512], f16), [0], True), {})
+cnt: 2, ((T([4096, 4, 49, 49], f16), [0], True), {})
+cnt: 2, ((T([200704, 384], f16), [0], True), {})
+Operator: aten.unbind.int
+cnt: 2, ((T([3, 4096, 4, 49, 32], f16, stride=(128, 18816, 32, 384, 1)),), {})
+cnt: 2, ((T([3, 1024, 8, 49, 32], f16, stride=(256, 37632, 32, 768, 1)),), {})
+cnt: 18, ((T([3, 256, 16, 49, 32], f16, stride=(512, 75264, 32, 1536, 1)),), {})
+cnt: 2, ((T([3, 64, 32, 49, 32], f16, stride=(1024, 150528, 32, 3072, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/swsl_resnext101_32x16d_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/swsl_resnext101_32x16d_training.txt
new file mode 100644
index 000000000000..58d92f4b561c
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/swsl_resnext101_32x16d_training.txt
@@ -0,0 +1,143 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+cnt: 23, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 4, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 104, ((T([], i64), 1), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 4, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 23, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([512, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 512, 56, 56], f16), T([512, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 3, ((T([32, 512, 56, 56], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 256, 56, 56], f16), T([512, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 56, 56], f16), T([1024, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 4, ((T([32, 1024, 28, 28], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 1024, 28, 28], f16), T([1024, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2048, 28, 28], f16), T([2048, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 23, ((T([32, 2048, 14, 14], f16), T([1024, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 22, ((T([32, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 22, ((T([32, 2048, 14, 14], f16), T([2048, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([4096, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 4096, 14, 14], f16), T([4096, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 3, ((T([32, 4096, 7, 7], f16), T([2048, 4096, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([4096, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 4096, 7, 7], f16), T([4096, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 4096, 7, 7], f16), T([2048, 4096, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 4096, 7, 7], f16), T([32, 4096, 7, 7], f16), T([4096, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 2, ((T([32, 4096, 7, 7], f16), T([32, 2048, 7, 7], f16), T([4096, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 4096, 7, 7], f16), T([32, 4096, 14, 14], f16), T([4096, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([32, 4096, 14, 14], f16), T([32, 1024, 14, 14], f16), T([4096, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 23, ((T([32, 1024, 14, 14], f16), T([32, 2048, 14, 14], f16), T([1024, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 22, ((T([32, 2048, 14, 14], f16), T([32, 2048, 14, 14], f16), T([2048, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 22, ((T([32, 2048, 14, 14], f16), T([32, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 14, 14], f16), T([32, 2048, 28, 28], f16), T([2048, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 28, 28], f16), T([32, 512, 28, 28], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 512, 28, 28], f16), T([32, 1024, 28, 28], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 1024, 28, 28], f16), T([32, 1024, 28, 28], f16), T([1024, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 3, ((T([32, 1024, 28, 28], f16), T([32, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 256, 56, 56], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 28, 28], f16), T([32, 1024, 56, 56], f16), T([1024, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 56, 56], f16), T([32, 256, 56, 56], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 512, 56, 56], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 512, 56, 56], f16), T([32, 512, 56, 56], f16), T([512, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 2, ((T([32, 512, 56, 56], f16), T([32, 256, 56, 56], f16), T([512, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 56, 56], f16), T([32, 64, 56, 56], f16), T([512, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([32, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(1, 1000)), T([32, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+cnt: 6, ((T([32, 512, 56, 56], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 1024, 56, 56], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([32, 1024, 28, 28], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 2048, 28, 28], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+cnt: 45, ((T([32, 2048, 14, 14], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+cnt: 24, ((T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 4096, 14, 14], f16), T([4096], f16), T([4096], f16), T([4096], f16), T([4096], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([32, 4096, 7, 7], f16), T([4096], f16), T([4096], f16), T([4096], f16), T([4096], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([32, 4096, 7, 7], f16), T([32, 4096, 7, 7], f16), T([4096], f16), T([4096], f16), T([4096], f16), T([4096], f32), T([4096], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 4096, 14, 14], f16), T([32, 4096, 14, 14], f16), T([4096], f16), T([4096], f16), T([4096], f16), T([4096], f32), T([4096], f32), True, 1e-05, [True, True, True]), {})
+cnt: 24, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 45, ((T([32, 2048, 14, 14], f16), T([32, 2048, 14, 14], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 2048, 28, 28], f16), T([32, 2048, 28, 28], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([32, 1024, 28, 28], f16), T([32, 1024, 28, 28], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 1024, 56, 56], f16), T([32, 1024, 56, 56], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), True, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([32, 512, 56, 56], f16), T([32, 512, 56, 56], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([32, 64, 112, 112], f16),), {})
+cnt: 6, ((T([32, 512, 56, 56], f16),), {})
+cnt: 3, ((T([32, 256, 56, 56], f16),), {})
+cnt: 1, ((T([32, 1024, 56, 56], f16),), {})
+cnt: 7, ((T([32, 1024, 28, 28], f16),), {})
+cnt: 4, ((T([32, 512, 28, 28], f16),), {})
+cnt: 1, ((T([32, 2048, 28, 28], f16),), {})
+cnt: 45, ((T([32, 2048, 14, 14], f16),), {})
+cnt: 23, ((T([32, 1024, 14, 14], f16),), {})
+cnt: 1, ((T([32, 4096, 14, 14], f16),), {})
+cnt: 5, ((T([32, 4096, 7, 7], f16),), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), 0), {})
+cnt: 5, ((T([32, 4096, 7, 7], f16), T([32, 4096, 7, 7], f16), 0), {})
+cnt: 1, ((T([32, 4096, 14, 14], f16), T([32, 4096, 14, 14], f16), 0), {})
+cnt: 23, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), 0), {})
+cnt: 45, ((T([32, 2048, 14, 14], f16), T([32, 2048, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 2048, 28, 28], f16), T([32, 2048, 28, 28], f16), 0), {})
+cnt: 4, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), 0), {})
+cnt: 7, ((T([32, 1024, 28, 28], f16), T([32, 1024, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 1024, 56, 56], f16), T([32, 1024, 56, 56], f16), 0), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), 0), {})
+cnt: 6, ((T([32, 512, 56, 56], f16), T([32, 512, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tf_efficientnet_b0_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tf_efficientnet_b0_training.txt
new file mode 100644
index 000000000000..b606244e7f83
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tf_efficientnet_b0_training.txt
@@ -0,0 +1,312 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 49, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16)), {})
+cnt: 4, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16)), {})
+cnt: 4, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16)), {})
+cnt: 6, ((T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16)), {})
+cnt: 4, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 3, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16)), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16)), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16)), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16)), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+cnt: 2, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 8, 1, 1], f16),), {})
+cnt: 1, ((T([128, 96, 112, 112], f16),), {})
+cnt: 1, ((T([128, 96, 56, 56], f16),), {})
+cnt: 1, ((T([128, 4, 1, 1], f16),), {})
+cnt: 3, ((T([128, 144, 56, 56], f16),), {})
+cnt: 2, ((T([128, 6, 1, 1], f16),), {})
+cnt: 1, ((T([128, 144, 28, 28], f16),), {})
+cnt: 3, ((T([128, 240, 28, 28], f16),), {})
+cnt: 2, ((T([128, 10, 1, 1], f16),), {})
+cnt: 1, ((T([128, 240, 14, 14], f16),), {})
+cnt: 6, ((T([128, 480, 14, 14], f16),), {})
+cnt: 3, ((T([128, 20, 1, 1], f16),), {})
+cnt: 5, ((T([128, 672, 14, 14], f16),), {})
+cnt: 3, ((T([128, 28, 1, 1], f16),), {})
+cnt: 1, ((T([128, 672, 7, 7], f16),), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16),), {})
+cnt: 4, ((T([128, 48, 1, 1], f16),), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([128, 3, 224, 224], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), [1, 2, 1, 2], 0.0), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 672, 14, 14], f16), [1, 2, 1, 2], 0.0), {})
+cnt: 1, ((T([128, 672, 17, 17], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([128, 240, 29, 29], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([128, 144, 59, 59], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([128, 96, 113, 113], f16), [0, -1, 0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 225, 225], f16), T([32, 3, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([8, 32, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([32, 8, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 113, 113], f16), T([96, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([128, 96, 1, 1], f16), T([4, 96, 1, 1], f16), T([4], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 4, 1, 1], f16), T([96, 4, 1, 1], f16), T([96], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([24, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([144, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([144, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 144), {})
+cnt: 2, ((T([128, 144, 1, 1], f16), T([6, 144, 1, 1], f16), T([6], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 6, 1, 1], f16), T([144, 6, 1, 1], f16), T([144], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([24, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 59, 59], f16), T([144, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([40, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([10, 240, 1, 1], f16), T([10], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 10, 1, 1], f16), T([240, 10, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([40, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 29, 29], f16), T([240, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 3, ((T([128, 480, 1, 1], f16), T([20, 480, 1, 1], f16), T([20], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 20, 1, 1], f16), T([480, 20, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([80, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([480, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([112, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 112, 14, 14], f16), T([672, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 3, ((T([128, 672, 1, 1], f16), T([28, 672, 1, 1], f16), T([28], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 28, 1, 1], f16), T([672, 28, 1, 1], f16), T([672], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([112, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 17, 17], f16), T([672, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([192, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1152), {})
+cnt: 4, ((T([128, 1152, 1, 1], f16), T([48, 1152, 1, 1], f16), T([48], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 48, 1, 1], f16), T([1152, 48, 1, 1], f16), T([1152], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1152), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([128, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 1152, 1, 1], f16), T([128, 48, 1, 1], f16), T([1152, 48, 1, 1], f16), [1152], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([128, 48, 1, 1], f16), T([128, 1152, 1, 1], f16), T([48, 1152, 1, 1], f16), [48], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 4, ((T([128, 1152, 7, 7], f16), T([128, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 192, 7, 7], f16), T([128, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 7, 7], f16), T([128, 672, 7, 7], f16), T([192, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 672, 1, 1], f16), T([128, 28, 1, 1], f16), T([672, 28, 1, 1], f16), [672], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 28, 1, 1], f16), T([128, 672, 1, 1], f16), T([28, 672, 1, 1], f16), [28], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 17, 17], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 3, ((T([128, 672, 14, 14], f16), T([128, 112, 14, 14], f16), T([672, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 112, 14, 14], f16), T([128, 672, 14, 14], f16), T([112, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 14, 14], f16), T([128, 480, 14, 14], f16), T([112, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 480, 1, 1], f16), T([128, 20, 1, 1], f16), T([480, 20, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 20, 1, 1], f16), T([128, 480, 1, 1], f16), T([20, 480, 1, 1], f16), [20], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 3, ((T([128, 480, 14, 14], f16), T([128, 80, 14, 14], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 80, 14, 14], f16), T([128, 480, 14, 14], f16), T([80, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 14, 14], f16), T([128, 240, 14, 14], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 10, 1, 1], f16), T([240, 10, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 10, 1, 1], f16), T([128, 240, 1, 1], f16), T([10, 240, 1, 1], f16), [10], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 29, 29], f16), T([240, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 2, ((T([128, 240, 28, 28], f16), T([128, 40, 28, 28], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 240, 28, 28], f16), T([40, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), T([240, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 28, 28], f16), T([128, 144, 28, 28], f16), T([40, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 144, 1, 1], f16), T([128, 6, 1, 1], f16), T([144, 6, 1, 1], f16), [144], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 6, 1, 1], f16), T([128, 144, 1, 1], f16), T([6, 144, 1, 1], f16), [6], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 59, 59], f16), T([144, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 2, ((T([128, 144, 56, 56], f16), T([128, 24, 56, 56], f16), T([144, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 144, 56, 56], f16), T([24, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), T([144, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 96, 56, 56], f16), T([24, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 1, 1], f16), T([128, 4, 1, 1], f16), T([96, 4, 1, 1], f16), [96], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 4, 1, 1], f16), T([128, 96, 1, 1], f16), T([4, 96, 1, 1], f16), [4], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 113, 113], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 16, 112, 112], f16), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 8, 1, 1], f16), T([32, 8, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([128, 32, 1, 1], f16), T([8, 32, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 225, 225], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1280, 7, 7], f16, stride=(1280, 1, 0, 0)), 49), {})
+cnt: 4, ((T([128, 1152, 7, 7], f16, stride=(1152, 1, 0, 0)), 49), {})
+cnt: 1, ((T([128, 672, 7, 7], f16, stride=(672, 1, 0, 0)), 49), {})
+cnt: 2, ((T([128, 672, 14, 14], f16, stride=(672, 1, 0, 0)), 196), {})
+cnt: 3, ((T([128, 480, 14, 14], f16, stride=(480, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 240, 14, 14], f16, stride=(240, 1, 0, 0)), 196), {})
+cnt: 1, ((T([128, 240, 28, 28], f16, stride=(240, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 144, 28, 28], f16, stride=(144, 1, 0, 0)), 784), {})
+cnt: 1, ((T([128, 144, 56, 56], f16, stride=(144, 1, 0, 0)), 3136), {})
+cnt: 1, ((T([128, 96, 56, 56], f16, stride=(96, 1, 0, 0)), 3136), {})
+cnt: 1, ((T([128, 32, 112, 112], f16, stride=(32, 1, 0, 0)), 12544), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 32, 112, 112], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([128, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 4, ((T([128, 1152, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 1, 1], f16)), {})
+cnt: 2, ((T([128, 96, 56, 56], f16), T([128, 96, 1, 1], f16)), {})
+cnt: 2, ((T([128, 144, 56, 56], f16), T([128, 144, 1, 1], f16)), {})
+cnt: 2, ((T([128, 144, 28, 28], f16), T([128, 144, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 28, 28], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 14, 14], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 6, ((T([128, 480, 14, 14], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 4, ((T([128, 672, 14, 14], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 7, 7], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([128, 1152, 1, 1], f16)), {})
+cnt: 4, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 3, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16)), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16)), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16)), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16)), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16)), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 0.001), {})
+cnt: 6, ((T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 0.001), {})
+cnt: 3, ((T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 0.001), {})
+cnt: 5, ((T([128, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 0.001), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 320, 7, 7], f16), T([128, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 0.001, [True, True, True]), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), True, 0.001, [True, True, True]), {})
+cnt: 4, ((T([128, 192, 7, 7], f16), T([128, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 0.001, [True, True, True]), {})
+cnt: 5, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 112, 14, 14], f16), T([128, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 0.001, [True, True, True]), {})
+cnt: 6, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 80, 14, 14], f16), T([128, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([128, 40, 28, 28], f16), T([128, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 112, 112], f16), T([128, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([128, 32, 1, 1], f16),), {})
+cnt: 1, ((T([128, 96, 1, 1], f16),), {})
+cnt: 2, ((T([128, 144, 1, 1], f16),), {})
+cnt: 2, ((T([128, 240, 1, 1], f16),), {})
+cnt: 3, ((T([128, 480, 1, 1], f16),), {})
+cnt: 3, ((T([128, 672, 1, 1], f16),), {})
+cnt: 4, ((T([128, 1152, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 4, ((T([128, 1152, 1, 1], f16), T([128, 1152, 1, 1], f16)), {})
+cnt: 3, ((T([128, 672, 1, 1], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 3, ((T([128, 480, 1, 1], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 2, ((T([128, 144, 1, 1], f16), T([128, 144, 1, 1], f16)), {})
+cnt: 1, ((T([128, 96, 1, 1], f16), T([128, 96, 1, 1], f16)), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 32, 1, 1], f16)), {})
+Operator: aten.silu_.default
+cnt: 2, ((T([128, 32, 112, 112], f16),), {})
+cnt: 1, ((T([128, 8, 1, 1], f16),), {})
+cnt: 1, ((T([128, 96, 112, 112], f16),), {})
+cnt: 1, ((T([128, 96, 56, 56], f16),), {})
+cnt: 1, ((T([128, 4, 1, 1], f16),), {})
+cnt: 3, ((T([128, 144, 56, 56], f16),), {})
+cnt: 2, ((T([128, 6, 1, 1], f16),), {})
+cnt: 1, ((T([128, 144, 28, 28], f16),), {})
+cnt: 3, ((T([128, 240, 28, 28], f16),), {})
+cnt: 2, ((T([128, 10, 1, 1], f16),), {})
+cnt: 1, ((T([128, 240, 14, 14], f16),), {})
+cnt: 6, ((T([128, 480, 14, 14], f16),), {})
+cnt: 3, ((T([128, 20, 1, 1], f16),), {})
+cnt: 5, ((T([128, 672, 14, 14], f16),), {})
+cnt: 3, ((T([128, 28, 1, 1], f16),), {})
+cnt: 1, ((T([128, 672, 7, 7], f16),), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16),), {})
+cnt: 4, ((T([128, 48, 1, 1], f16),), {})
+cnt: 1, ((T([128, 1280, 7, 7], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 1, ((T([128, 1280, 7, 7], f16), T([128, 1280, 7, 7], f16)), {})
+cnt: 4, ((T([128, 48, 1, 1], f16), T([128, 48, 1, 1], f16)), {})
+cnt: 8, ((T([128, 1152, 7, 7], f16), T([128, 1152, 7, 7], f16)), {})
+cnt: 3, ((T([128, 28, 1, 1], f16), T([128, 28, 1, 1], f16)), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), T([128, 672, 7, 7], f16)), {})
+cnt: 5, ((T([128, 672, 14, 14], f16), T([128, 672, 14, 14], f16)), {})
+cnt: 3, ((T([128, 20, 1, 1], f16), T([128, 20, 1, 1], f16)), {})
+cnt: 6, ((T([128, 480, 14, 14], f16), T([128, 480, 14, 14], f16)), {})
+cnt: 2, ((T([128, 10, 1, 1], f16), T([128, 10, 1, 1], f16)), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), T([128, 240, 14, 14], f16)), {})
+cnt: 3, ((T([128, 240, 28, 28], f16), T([128, 240, 28, 28], f16)), {})
+cnt: 2, ((T([128, 6, 1, 1], f16), T([128, 6, 1, 1], f16)), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), T([128, 144, 28, 28], f16)), {})
+cnt: 3, ((T([128, 144, 56, 56], f16), T([128, 144, 56, 56], f16)), {})
+cnt: 1, ((T([128, 4, 1, 1], f16), T([128, 4, 1, 1], f16)), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), T([128, 96, 56, 56], f16)), {})
+cnt: 1, ((T([128, 96, 112, 112], f16), T([128, 96, 112, 112], f16)), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([128, 8, 1, 1], f16)), {})
+cnt: 2, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 4, ((T([128, 1152, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([128, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 144, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 144, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 96, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), [2, 3], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tf_mixnet_l_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tf_mixnet_l_training.txt
new file mode 100644
index 000000000000..5612bc45879f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tf_mixnet_l_training.txt
@@ -0,0 +1,408 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 58, ((T([], i64), 1), {})
+cnt: 2, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16)), {})
+cnt: 2, ((T([64, 40, 56, 56], f16), T([64, 40, 56, 56], f16)), {})
+cnt: 6, ((T([64, 56, 28, 28], f16), T([64, 56, 28, 28], f16)), {})
+cnt: 6, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16)), {})
+cnt: 6, ((T([64, 160, 14, 14], f16), T([64, 160, 14, 14], f16)), {})
+cnt: 6, ((T([64, 264, 7, 7], f16), T([64, 264, 7, 7], f16)), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), T([64, 1584, 7, 7], f16)), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16)), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16)), {})
+cnt: 4, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16)), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([64, 336, 14, 14], f16)), {})
+cnt: 3, ((T([64, 336, 28, 28], f16), T([64, 336, 28, 28], f16)), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([64, 240, 28, 28], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([64, 1536], f16), T([1536, 1000], f16, stride=(1, 1536))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 96, 112, 112], f16), T([64, 96, 112, 112], f16)], 1), {})
+cnt: 1, (([T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16), T([64, 64, 56, 56], f16)], 1), {})
+cnt: 3, (([T([64, 20, 56, 56], f16), T([64, 20, 56, 56], f16)], 1), {})
+cnt: 2, (([T([64, 60, 56, 56], f16), T([64, 60, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 60, 28, 28], f16), T([64, 60, 28, 28], f16), T([64, 60, 28, 28], f16), T([64, 60, 28, 28], f16)], 1), {})
+cnt: 12, (([T([64, 168, 28, 28], f16), T([64, 168, 28, 28], f16)], 1), {})
+cnt: 6, (([T([64, 28, 28, 28], f16), T([64, 28, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 112, 14, 14], f16), T([64, 112, 14, 14], f16), T([64, 112, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 312, 14, 14], f16), T([64, 312, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 156, 14, 14], f16), T([64, 156, 14, 14], f16), T([64, 156, 14, 14], f16), T([64, 156, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 52, 14, 14], f16), T([64, 52, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 240, 14, 14], f16), T([64, 240, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 120, 14, 14], f16), T([64, 120, 14, 14], f16), T([64, 120, 14, 14], f16), T([64, 120, 14, 14], f16)], 1), {})
+cnt: 6, (([T([64, 80, 14, 14], f16), T([64, 80, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 240, 7, 7], f16), T([64, 240, 7, 7], f16), T([64, 240, 7, 7], f16), T([64, 240, 7, 7], f16)], 1), {})
+cnt: 6, (([T([64, 396, 7, 7], f16), T([64, 396, 7, 7], f16), T([64, 396, 7, 7], f16), T([64, 396, 7, 7], f16)], 1), {})
+cnt: 3, (([T([64, 132, 7, 7], f16), T([64, 132, 7, 7], f16)], 1), {})
+cnt: 3, (([T([64, 792, 7, 7], f16), T([64, 792, 7, 7], f16)], 1), {})
+cnt: 1, (([T([64, 240, 14, 14], f16), T([64, 240, 14, 14], f16), T([64, 240, 14, 14], f16), T([64, 240, 14, 14], f16)], 1), {})
+cnt: 1, (([T([64, 112, 28, 28], f16), T([64, 112, 28, 28], f16), T([64, 112, 28, 28], f16)], 1), {})
+cnt: 1, (([T([64, 60, 56, 56], f16), T([64, 60, 56, 56], f16), T([64, 60, 56, 56], f16), T([64, 60, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 96, 56, 56], f16), T([64, 96, 56, 56], f16)], 1), {})
+cnt: 1, (([T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16)], 1), {})
+cnt: 1, (([T([64, 16, 112, 112], f16), T([64, 16, 112, 112], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+cnt: 1, ((T([64, 240, 56, 56], f16),), {})
+cnt: 1, ((T([64, 240, 28, 28], f16),), {})
+cnt: 1, ((T([64, 20, 1, 1], f16),), {})
+cnt: 7, ((T([64, 336, 28, 28], f16),), {})
+cnt: 3, ((T([64, 28, 1, 1], f16),), {})
+cnt: 1, ((T([64, 336, 14, 14], f16),), {})
+cnt: 1, ((T([64, 14, 1, 1], f16),), {})
+cnt: 8, ((T([64, 624, 14, 14], f16),), {})
+cnt: 3, ((T([64, 26, 1, 1], f16),), {})
+cnt: 1, ((T([64, 52, 1, 1], f16),), {})
+cnt: 6, ((T([64, 480, 14, 14], f16),), {})
+cnt: 4, ((T([64, 80, 1, 1], f16),), {})
+cnt: 1, ((T([64, 960, 14, 14], f16),), {})
+cnt: 1, ((T([64, 960, 7, 7], f16),), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16),), {})
+cnt: 3, ((T([64, 132, 1, 1], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([64, 3, 224, 224], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), [1, 2, 1, 2], 0.0), {})
+cnt: 1, ((T([64, 64, 112, 112], f16, stride=(2408448, 12544, 112, 1)), [2, 3, 2, 3], 0.0), {})
+cnt: 1, ((T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), [1, 2, 1, 2], 0.0), {})
+cnt: 1, ((T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), [2, 3, 2, 3], 0.0), {})
+cnt: 1, ((T([64, 60, 56, 56], f16, stride=(752640, 3136, 56, 1)), [3, 4, 3, 4], 0.0), {})
+cnt: 1, ((T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), [1, 2, 1, 2], 0.0), {})
+cnt: 1, ((T([64, 112, 28, 28], f16, stride=(263424, 784, 28, 1)), [2, 3, 2, 3], 0.0), {})
+cnt: 1, ((T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), [1, 2, 1, 2], 0.0), {})
+cnt: 1, ((T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), [2, 3, 2, 3], 0.0), {})
+cnt: 1, ((T([64, 240, 14, 14], f16, stride=(188160, 196, 14, 1)), [3, 4, 3, 4], 0.0), {})
+cnt: 1, ((T([64, 240, 21, 21], f16), [-3, -4, -3, -4]), {})
+cnt: 1, ((T([64, 240, 19, 19], f16), [-2, -3, -2, -3]), {})
+cnt: 1, ((T([64, 240, 17, 17], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([64, 240, 15, 15], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([64, 112, 33, 33], f16), [-2, -3, -2, -3]), {})
+cnt: 1, ((T([64, 112, 31, 31], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([64, 112, 29, 29], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([64, 60, 63, 63], f16), [-3, -4, -3, -4]), {})
+cnt: 1, ((T([64, 60, 61, 61], f16), [-2, -3, -2, -3]), {})
+cnt: 1, ((T([64, 60, 59, 59], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([64, 60, 57, 57], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([64, 64, 117, 117], f16), [-2, -3, -2, -3]), {})
+cnt: 1, ((T([64, 64, 115, 115], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([64, 64, 113, 113], f16), [0, -1, 0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 225, 225], f16), T([32, 3, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([32, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 16, 112, 112], f16, stride=(401408, 12544, 112, 1)), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 113, 113], f16), T([64, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([64, 64, 115, 115], f16), T([64, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([64, 64, 117, 117], f16), T([64, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 64), {})
+cnt: 2, ((T([64, 96, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([20, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 20, 56, 56], f16, stride=(125440, 3136, 56, 1)), T([60, 20, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 120, 56, 56], f16), T([120, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 120), {})
+cnt: 2, ((T([64, 60, 56, 56], f16, stride=(376320, 3136, 56, 1)), T([20, 60, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 40, 56, 56], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 60, 57, 57], f16), T([60, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([64, 60, 59, 59], f16), T([60, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([64, 60, 61, 61], f16), T([60, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([64, 60, 63, 63], f16), T([60, 1, 9, 9], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 60), {})
+cnt: 1, ((T([64, 240, 1, 1], f16), T([20, 240, 1, 1], f16), T([20], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 20, 1, 1], f16), T([240, 20, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([56, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 28, 28, 28], f16, stride=(43904, 784, 28, 1)), T([168, 28, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([168, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 168), {})
+cnt: 3, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([168, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 168), {})
+cnt: 3, ((T([64, 336, 1, 1], f16), T([28, 336, 1, 1], f16), T([28], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 28, 1, 1], f16), T([336, 28, 1, 1], f16), T([336], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([28, 168, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 56, 28, 28], f16), T([336, 56, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 112, 29, 29], f16), T([112, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 112), {})
+cnt: 1, ((T([64, 112, 31, 31], f16), T([112, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 112), {})
+cnt: 1, ((T([64, 112, 33, 33], f16), T([112, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 112), {})
+cnt: 1, ((T([64, 336, 1, 1], f16), T([14, 336, 1, 1], f16), T([14], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 14, 1, 1], f16), T([336, 14, 1, 1], f16), T([336], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([104, 336, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 52, 14, 14], f16, stride=(20384, 196, 14, 1)), T([312, 52, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 156), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 156), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 156), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 9, 9], f16), None, [1, 1], [4, 4], [1, 1], False, [0, 0], 156), {})
+cnt: 3, ((T([64, 624, 1, 1], f16), T([26, 624, 1, 1], f16), T([26], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 26, 1, 1], f16), T([624, 26, 1, 1], f16), T([624], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 312, 14, 14], f16, stride=(122304, 196, 14, 1)), T([52, 312, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 104, 14, 14], f16), T([624, 104, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 624, 14, 14], f16), T([624, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 624), {})
+cnt: 1, ((T([64, 624, 1, 1], f16), T([52, 624, 1, 1], f16), T([52], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 52, 1, 1], f16), T([624, 52, 1, 1], f16), T([624], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 624, 14, 14], f16), T([160, 624, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 80, 14, 14], f16, stride=(31360, 196, 14, 1)), T([240, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 9, 9], f16), None, [1, 1], [4, 4], [1, 1], False, [0, 0], 120), {})
+cnt: 3, ((T([64, 480, 1, 1], f16), T([80, 480, 1, 1], f16), T([80], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 80, 1, 1], f16), T([480, 80, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 240, 14, 14], f16, stride=(94080, 196, 14, 1)), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 160, 14, 14], f16), T([960, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 240, 15, 15], f16), T([240, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([64, 240, 17, 17], f16), T([240, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([64, 240, 19, 19], f16), T([240, 1, 7, 7], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([64, 240, 21, 21], f16), T([240, 1, 9, 9], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([64, 960, 1, 1], f16), T([80, 960, 1, 1], f16), T([80], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 80, 1, 1], f16), T([960, 80, 1, 1], f16), T([960], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([264, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 264, 7, 7], f16), T([1584, 264, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 396), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 396), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 396), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 9, 9], f16), None, [1, 1], [4, 4], [1, 1], False, [0, 0], 396), {})
+cnt: 3, ((T([64, 1584, 1, 1], f16), T([132, 1584, 1, 1], f16), T([132], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 132, 1, 1], f16), T([1584, 132, 1, 1], f16), T([1584], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([64, 792, 7, 7], f16, stride=(77616, 49, 7, 1)), T([132, 792, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 264, 7, 7], f16), T([1536, 264, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 1536, 7, 7], f16), T([64, 264, 7, 7], f16), T([1536, 264, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 132, 7, 7], f16, stride=(12936, 49, 7, 1)), T([64, 792, 7, 7], f16, stride=(77616, 49, 7, 1)), T([132, 792, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 1584, 1, 1], f16), T([64, 132, 1, 1], f16), T([1584, 132, 1, 1], f16), [1584], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 132, 1, 1], f16), T([64, 1584, 1, 1], f16), T([132, 1584, 1, 1], f16), [132], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 9, 9], f16), [0], [1, 1], [4, 4], [1, 1], False, [0, 0], 396, [True, True, False]), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 396, [True, True, False]), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 396, [True, True, False]), {})
+cnt: 3, ((T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([64, 396, 7, 7], f16, stride=(77616, 49, 7, 1)), T([396, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 396, [True, True, False]), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), T([64, 264, 7, 7], f16), T([1584, 264, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 264, 7, 7], f16), T([64, 960, 7, 7], f16), T([264, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 960, 1, 1], f16), T([64, 80, 1, 1], f16), T([960, 80, 1, 1], f16), [960], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 80, 1, 1], f16), T([64, 960, 1, 1], f16), T([80, 960, 1, 1], f16), [80], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 240, 7, 7], f16, stride=(47040, 49, 7, 1)), T([64, 240, 21, 21], f16), T([240, 1, 9, 9], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 7, 7], f16, stride=(47040, 49, 7, 1)), T([64, 240, 19, 19], f16), T([240, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 7, 7], f16, stride=(47040, 49, 7, 1)), T([64, 240, 17, 17], f16), T([240, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 7, 7], f16, stride=(47040, 49, 7, 1)), T([64, 240, 15, 15], f16), T([240, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([64, 160, 14, 14], f16), T([960, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 80, 14, 14], f16, stride=(31360, 196, 14, 1)), T([64, 240, 14, 14], f16, stride=(94080, 196, 14, 1)), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 480, 1, 1], f16), T([64, 80, 1, 1], f16), T([480, 80, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 80, 1, 1], f16), T([64, 480, 1, 1], f16), T([80, 480, 1, 1], f16), [80], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 9, 9], f16), [0], [1, 1], [4, 4], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 3, ((T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 120, 14, 14], f16, stride=(94080, 196, 14, 1)), T([120, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 6, ((T([64, 240, 14, 14], f16, stride=(94080, 196, 14, 1)), T([64, 80, 14, 14], f16, stride=(31360, 196, 14, 1)), T([240, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 160, 14, 14], f16), T([64, 624, 14, 14], f16), T([160, 624, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 624, 1, 1], f16), T([64, 52, 1, 1], f16), T([624, 52, 1, 1], f16), [624], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 52, 1, 1], f16), T([64, 624, 1, 1], f16), T([52, 624, 1, 1], f16), [52], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16), T([624, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 624, [True, True, False]), {})
+cnt: 1, ((T([64, 624, 14, 14], f16), T([64, 104, 14, 14], f16), T([624, 104, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 52, 14, 14], f16, stride=(20384, 196, 14, 1)), T([64, 312, 14, 14], f16, stride=(122304, 196, 14, 1)), T([52, 312, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 624, 1, 1], f16), T([64, 26, 1, 1], f16), T([624, 26, 1, 1], f16), [624], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 26, 1, 1], f16), T([64, 624, 1, 1], f16), T([26, 624, 1, 1], f16), [26], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 9, 9], f16), [0], [1, 1], [4, 4], [1, 1], False, [0, 0], 156, [True, True, False]), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 156, [True, True, False]), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 156, [True, True, False]), {})
+cnt: 3, ((T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 156, 14, 14], f16, stride=(122304, 196, 14, 1)), T([156, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 156, [True, True, False]), {})
+cnt: 6, ((T([64, 312, 14, 14], f16, stride=(122304, 196, 14, 1)), T([64, 52, 14, 14], f16, stride=(20384, 196, 14, 1)), T([312, 52, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 104, 14, 14], f16), T([64, 336, 14, 14], f16), T([104, 336, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 336, 1, 1], f16), T([64, 14, 1, 1], f16), T([336, 14, 1, 1], f16), [336], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 14, 1, 1], f16), T([64, 336, 1, 1], f16), T([14, 336, 1, 1], f16), [14], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 112, 14, 14], f16, stride=(65856, 196, 14, 1)), T([64, 112, 33, 33], f16), T([112, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 112, [True, True, False]), {})
+cnt: 1, ((T([64, 112, 14, 14], f16, stride=(65856, 196, 14, 1)), T([64, 112, 31, 31], f16), T([112, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 112, [True, True, False]), {})
+cnt: 1, ((T([64, 112, 14, 14], f16, stride=(65856, 196, 14, 1)), T([64, 112, 29, 29], f16), T([112, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 112, [True, True, False]), {})
+cnt: 1, ((T([64, 336, 28, 28], f16), T([64, 56, 28, 28], f16), T([336, 56, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([64, 28, 28, 28], f16, stride=(43904, 784, 28, 1)), T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([28, 168, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([64, 336, 1, 1], f16), T([64, 28, 1, 1], f16), T([336, 28, 1, 1], f16), [336], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 28, 1, 1], f16), T([64, 336, 1, 1], f16), T([28, 336, 1, 1], f16), [28], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([168, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 168, [True, True, False]), {})
+cnt: 3, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([168, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 168, [True, True, False]), {})
+cnt: 6, ((T([64, 168, 28, 28], f16, stride=(263424, 784, 28, 1)), T([64, 28, 28, 28], f16, stride=(43904, 784, 28, 1)), T([168, 28, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 56, 28, 28], f16), T([64, 240, 28, 28], f16), T([56, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 1, 1], f16), T([64, 20, 1, 1], f16), T([240, 20, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 20, 1, 1], f16), T([64, 240, 1, 1], f16), T([20, 240, 1, 1], f16), [20], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 60, 28, 28], f16, stride=(188160, 784, 28, 1)), T([64, 60, 63, 63], f16), T([60, 1, 9, 9], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([64, 60, 28, 28], f16, stride=(188160, 784, 28, 1)), T([64, 60, 61, 61], f16), T([60, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([64, 60, 28, 28], f16, stride=(188160, 784, 28, 1)), T([64, 60, 59, 59], f16), T([60, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([64, 60, 28, 28], f16, stride=(188160, 784, 28, 1)), T([64, 60, 57, 57], f16), T([60, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 60, [True, True, False]), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), T([64, 40, 56, 56], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 20, 56, 56], f16, stride=(125440, 3136, 56, 1)), T([64, 60, 56, 56], f16, stride=(376320, 3136, 56, 1)), T([20, 60, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 120, 56, 56], f16), T([64, 120, 56, 56], f16), T([120, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 2, ((T([64, 60, 56, 56], f16, stride=(376320, 3136, 56, 1)), T([64, 20, 56, 56], f16, stride=(125440, 3136, 56, 1)), T([60, 20, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([64, 20, 56, 56], f16, stride=(125440, 3136, 56, 1)), T([64, 96, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([20, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([64, 64, 117, 117], f16), T([64, 1, 7, 7], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([64, 64, 115, 115], f16), T([64, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 56, 56], f16, stride=(602112, 3136, 56, 1)), T([64, 64, 113, 113], f16), T([64, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 2, ((T([64, 96, 112, 112], f16, stride=(2408448, 12544, 112, 1)), T([64, 16, 112, 112], f16, stride=(401408, 12544, 112, 1)), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([64, 32, 112, 112], f16), T([64, 3, 225, 225], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([64, 1536, 7, 7], f16, stride=(1536, 1, 0, 0)), 49), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16, stride=(1584, 1, 0, 0)), 49), {})
+cnt: 1, ((T([64, 960, 7, 7], f16, stride=(960, 1, 0, 0)), 49), {})
+cnt: 3, ((T([64, 480, 14, 14], f16, stride=(480, 1, 0, 0)), 196), {})
+cnt: 4, ((T([64, 624, 14, 14], f16, stride=(624, 1, 0, 0)), 196), {})
+cnt: 1, ((T([64, 336, 14, 14], f16, stride=(336, 1, 0, 0)), 196), {})
+cnt: 3, ((T([64, 336, 28, 28], f16, stride=(336, 1, 0, 0)), 784), {})
+cnt: 1, ((T([64, 240, 28, 28], f16, stride=(240, 1, 0, 0)), 784), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([64, 240, 28, 28], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 336, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), [2, 3], True), {})
+cnt: 4, ((T([64, 624, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 1536, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 1536], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 1536], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([64, 240, 28, 28], f16), T([64, 240, 1, 1], f16)), {})
+cnt: 6, ((T([64, 336, 28, 28], f16), T([64, 336, 1, 1], f16)), {})
+cnt: 2, ((T([64, 336, 14, 14], f16), T([64, 336, 1, 1], f16)), {})
+cnt: 8, ((T([64, 624, 14, 14], f16), T([64, 624, 1, 1], f16)), {})
+cnt: 6, ((T([64, 480, 14, 14], f16), T([64, 480, 1, 1], f16)), {})
+cnt: 2, ((T([64, 960, 7, 7], f16), T([64, 960, 1, 1], f16)), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16), T([64, 1584, 1, 1], f16)), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), T([64, 1584, 7, 7], f16)), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16)), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16)), {})
+cnt: 4, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16)), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([64, 336, 14, 14], f16)), {})
+cnt: 3, ((T([64, 336, 28, 28], f16), T([64, 336, 28, 28], f16)), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([64, 240, 28, 28], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 3, ((T([64, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([64, 192, 112, 112], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([64, 40, 56, 56], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 0.001), {})
+cnt: 2, ((T([64, 120, 56, 56], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 0.001), {})
+cnt: 4, ((T([64, 56, 28, 28], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f16), True, 0.1, 0.001), {})
+cnt: 7, ((T([64, 336, 28, 28], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f16), True, 0.1, 0.001), {})
+cnt: 4, ((T([64, 104, 14, 14], f16), T([104], f16), T([104], f16), T([104], f16), T([104], f16), True, 0.1, 0.001), {})
+cnt: 8, ((T([64, 624, 14, 14], f16), T([624], f16), T([624], f16), T([624], f16), T([624], f16), True, 0.1, 0.001), {})
+cnt: 4, ((T([64, 160, 14, 14], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), True, 0.1, 0.001), {})
+cnt: 6, ((T([64, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), True, 0.1, 0.001), {})
+cnt: 4, ((T([64, 264, 7, 7], f16), T([264], f16), T([264], f16), T([264], f16), T([264], f16), True, 0.1, 0.001), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16), T([1584], f16), T([1584], f16), T([1584], f16), T([1584], f16), True, 0.1, 0.001), {})
+cnt: 1, ((T([64, 1536, 7, 7], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f16), True, 0.1, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([64, 1536, 7, 7], f16), T([64, 1536, 7, 7], f16), T([1536], f16), T([1536], f16), T([1536], f16), T([1536], f32), T([1536], f32), True, 0.001, [True, True, True]), {})
+cnt: 4, ((T([64, 264, 7, 7], f16), T([64, 264, 7, 7], f16), T([264], f16), T([264], f16), T([264], f16), T([264], f32), T([264], f32), True, 0.001, [True, True, True]), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16), T([64, 1584, 7, 7], f16), T([1584], f16), T([1584], f16), T([1584], f16), T([1584], f32), T([1584], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([64, 960, 14, 14], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), True, 0.001, [True, True, True]), {})
+cnt: 4, ((T([64, 160, 14, 14], f16), T([64, 160, 14, 14], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), True, 0.001, [True, True, True]), {})
+cnt: 6, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 0.001, [True, True, True]), {})
+cnt: 8, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16), T([624], f16), T([624], f16), T([624], f16), T([624], f32), T([624], f32), True, 0.001, [True, True, True]), {})
+cnt: 4, ((T([64, 104, 14, 14], f16), T([64, 104, 14, 14], f16), T([104], f16), T([104], f16), T([104], f16), T([104], f32), T([104], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([64, 336, 14, 14], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 0.001, [True, True, True]), {})
+cnt: 7, ((T([64, 336, 28, 28], f16), T([64, 336, 28, 28], f16), T([336], f16), T([336], f16), T([336], f16), T([336], f32), T([336], f32), True, 0.001, [True, True, True]), {})
+cnt: 4, ((T([64, 56, 28, 28], f16), T([64, 56, 28, 28], f16), T([56], f16), T([56], f16), T([56], f16), T([56], f32), T([56], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([64, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), T([64, 240, 56, 56], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([64, 40, 56, 56], f16), T([64, 40, 56, 56], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 0.001, [True, True, True]), {})
+cnt: 2, ((T([64, 120, 56, 56], f16), T([64, 120, 56, 56], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([64, 192, 56, 56], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 1, ((T([64, 192, 112, 112], f16), T([64, 192, 112, 112], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 0.001, [True, True, True]), {})
+cnt: 3, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 0.001, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([64, 32, 112, 112], f16),), {})
+cnt: 1, ((T([64, 192, 112, 112], f16),), {})
+cnt: 1, ((T([64, 192, 56, 56], f16),), {})
+cnt: 2, ((T([64, 120, 56, 56], f16),), {})
+cnt: 1, ((T([64, 1536, 7, 7], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([64, 240, 1, 1], f16),), {})
+cnt: 4, ((T([64, 336, 1, 1], f16),), {})
+cnt: 4, ((T([64, 624, 1, 1], f16),), {})
+cnt: 3, ((T([64, 480, 1, 1], f16),), {})
+cnt: 1, ((T([64, 960, 1, 1], f16),), {})
+cnt: 3, ((T([64, 1584, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 3, ((T([64, 1584, 1, 1], f16), T([64, 1584, 1, 1], f16)), {})
+cnt: 1, ((T([64, 960, 1, 1], f16), T([64, 960, 1, 1], f16)), {})
+cnt: 3, ((T([64, 480, 1, 1], f16), T([64, 480, 1, 1], f16)), {})
+cnt: 4, ((T([64, 624, 1, 1], f16), T([64, 624, 1, 1], f16)), {})
+cnt: 4, ((T([64, 336, 1, 1], f16), T([64, 336, 1, 1], f16)), {})
+cnt: 1, ((T([64, 240, 1, 1], f16), T([64, 240, 1, 1], f16)), {})
+Operator: aten.silu_.default
+cnt: 1, ((T([64, 240, 56, 56], f16),), {})
+cnt: 1, ((T([64, 240, 28, 28], f16),), {})
+cnt: 1, ((T([64, 20, 1, 1], f16),), {})
+cnt: 7, ((T([64, 336, 28, 28], f16),), {})
+cnt: 3, ((T([64, 28, 1, 1], f16),), {})
+cnt: 1, ((T([64, 336, 14, 14], f16),), {})
+cnt: 1, ((T([64, 14, 1, 1], f16),), {})
+cnt: 8, ((T([64, 624, 14, 14], f16),), {})
+cnt: 3, ((T([64, 26, 1, 1], f16),), {})
+cnt: 1, ((T([64, 52, 1, 1], f16),), {})
+cnt: 6, ((T([64, 480, 14, 14], f16),), {})
+cnt: 4, ((T([64, 80, 1, 1], f16),), {})
+cnt: 1, ((T([64, 960, 14, 14], f16),), {})
+cnt: 1, ((T([64, 960, 7, 7], f16),), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16),), {})
+cnt: 3, ((T([64, 132, 1, 1], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 3, ((T([64, 132, 1, 1], f16), T([64, 132, 1, 1], f16)), {})
+cnt: 6, ((T([64, 1584, 7, 7], f16), T([64, 1584, 7, 7], f16)), {})
+cnt: 4, ((T([64, 80, 1, 1], f16), T([64, 80, 1, 1], f16)), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), T([64, 960, 7, 7], f16)), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), T([64, 960, 14, 14], f16)), {})
+cnt: 6, ((T([64, 480, 14, 14], f16), T([64, 480, 14, 14], f16)), {})
+cnt: 1, ((T([64, 52, 1, 1], f16), T([64, 52, 1, 1], f16)), {})
+cnt: 8, ((T([64, 624, 14, 14], f16), T([64, 624, 14, 14], f16)), {})
+cnt: 3, ((T([64, 26, 1, 1], f16), T([64, 26, 1, 1], f16)), {})
+cnt: 1, ((T([64, 14, 1, 1], f16), T([64, 14, 1, 1], f16)), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), T([64, 336, 14, 14], f16)), {})
+cnt: 7, ((T([64, 336, 28, 28], f16), T([64, 336, 28, 28], f16)), {})
+cnt: 3, ((T([64, 28, 1, 1], f16), T([64, 28, 1, 1], f16)), {})
+cnt: 1, ((T([64, 20, 1, 1], f16), T([64, 20, 1, 1], f16)), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), T([64, 240, 28, 28], f16)), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), T([64, 240, 56, 56], f16)), {})
+Operator: aten.split_with_sizes.default
+cnt: 1, ((T([64, 32, 112, 112], f16), [16, 16], 1), {})
+cnt: 1, ((T([64, 192, 112, 112], f16), [64, 64, 64], 1), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), [96, 96], 1), {})
+cnt: 1, ((T([64, 40, 56, 56], f16), [20, 20], 1), {})
+cnt: 1, ((T([64, 120, 56, 56], f16), [60, 60], 1), {})
+cnt: 1, ((T([64, 240, 56, 56], f16), [60, 60, 60, 60], 1), {})
+cnt: 3, ((T([64, 56, 28, 28], f16), [28, 28], 1), {})
+cnt: 6, ((T([64, 336, 28, 28], f16), [168, 168], 1), {})
+cnt: 1, ((T([64, 336, 28, 28], f16), [112, 112, 112], 1), {})
+cnt: 3, ((T([64, 104, 14, 14], f16), [52, 52], 1), {})
+cnt: 3, ((T([64, 624, 14, 14], f16), [156, 156, 156, 156], 1), {})
+cnt: 3, ((T([64, 624, 14, 14], f16), [312, 312], 1), {})
+cnt: 3, ((T([64, 160, 14, 14], f16), [80, 80], 1), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), [120, 120, 120, 120], 1), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), [240, 240], 1), {})
+cnt: 1, ((T([64, 960, 14, 14], f16), [240, 240, 240, 240], 1), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), [396, 396, 396, 396], 1), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), [792, 792], 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 3, ((T([64, 1584, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 4, ((T([64, 624, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 336, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([64, 336, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([64, 240, 28, 28], f16), [2, 3], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([64, 1536, 7, 7], f16), T([64, 1536, 7, 7], f16), 0), {})
+cnt: 2, ((T([64, 120, 56, 56], f16), T([64, 120, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 192, 56, 56], f16), T([64, 192, 56, 56], f16), 0), {})
+cnt: 1, ((T([64, 192, 112, 112], f16), T([64, 192, 112, 112], f16), 0), {})
+cnt: 2, ((T([64, 32, 112, 112], f16), T([64, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tinynet_a_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tinynet_a_training.txt
new file mode 100644
index 000000000000..c3f1255f43ee
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tinynet_a_training.txt
@@ -0,0 +1,302 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 58, ((T([], i64), 1), {})
+cnt: 2, ((T([128, 24, 48, 48], f16), T([128, 24, 48, 48], f16)), {})
+cnt: 2, ((T([128, 40, 24, 24], f16), T([128, 40, 24, 24], f16)), {})
+cnt: 6, ((T([128, 80, 12, 12], f16), T([128, 80, 12, 12], f16)), {})
+cnt: 6, ((T([128, 112, 12, 12], f16), T([128, 112, 12, 12], f16)), {})
+cnt: 8, ((T([128, 192, 6, 6], f16), T([128, 192, 6, 6], f16)), {})
+cnt: 5, ((T([128, 1152, 6, 6], f16), T([128, 1152, 6, 6], f16)), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), T([128, 672, 6, 6], f16)), {})
+cnt: 3, ((T([128, 672, 12, 12], f16), T([128, 672, 12, 12], f16)), {})
+cnt: 4, ((T([128, 480, 12, 12], f16), T([128, 480, 12, 12], f16)), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), T([128, 240, 12, 12], f16)), {})
+cnt: 1, ((T([128, 240, 24, 24], f16), T([128, 240, 24, 24], f16)), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), T([128, 144, 24, 24], f16)), {})
+cnt: 1, ((T([128, 144, 48, 48], f16), T([128, 144, 48, 48], f16)), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), T([128, 96, 48, 48], f16)), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([128, 32, 96, 96], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 192, 192], f16),), {})
+cnt: 2, ((T([128, 32, 96, 96], f16),), {})
+cnt: 1, ((T([128, 8, 1, 1], f16),), {})
+cnt: 1, ((T([128, 96, 96, 96], f16),), {})
+cnt: 1, ((T([128, 96, 48, 48], f16),), {})
+cnt: 1, ((T([128, 4, 1, 1], f16),), {})
+cnt: 3, ((T([128, 144, 48, 48], f16),), {})
+cnt: 2, ((T([128, 6, 1, 1], f16),), {})
+cnt: 1, ((T([128, 144, 24, 24], f16),), {})
+cnt: 3, ((T([128, 240, 24, 24], f16),), {})
+cnt: 2, ((T([128, 10, 1, 1], f16),), {})
+cnt: 1, ((T([128, 240, 12, 12], f16),), {})
+cnt: 8, ((T([128, 480, 12, 12], f16),), {})
+cnt: 4, ((T([128, 20, 1, 1], f16),), {})
+cnt: 7, ((T([128, 672, 12, 12], f16),), {})
+cnt: 4, ((T([128, 28, 1, 1], f16),), {})
+cnt: 1, ((T([128, 672, 6, 6], f16),), {})
+cnt: 10, ((T([128, 1152, 6, 6], f16),), {})
+cnt: 5, ((T([128, 48, 1, 1], f16),), {})
+cnt: 1, ((T([128, 1280, 6, 6], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 192, 192], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([8, 32, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([32, 8, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 96, 96], f16), T([96, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([128, 96, 1, 1], f16), T([4, 96, 1, 1], f16), T([4], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 4, 1, 1], f16), T([96, 4, 1, 1], f16), T([96], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), T([24, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 24, 48, 48], f16), T([144, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 48, 48], f16), T([144, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 144), {})
+cnt: 2, ((T([128, 144, 1, 1], f16), T([6, 144, 1, 1], f16), T([6], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 6, 1, 1], f16), T([144, 6, 1, 1], f16), T([144], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 48, 48], f16), T([24, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 144, 48, 48], f16), T([144, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), T([40, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 40, 24, 24], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 24, 24], f16), T([240, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([10, 240, 1, 1], f16), T([10], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 10, 1, 1], f16), T([240, 10, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 24, 24], f16), T([40, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 240, 24, 24], f16), T([240, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 80, 12, 12], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 480, 12, 12], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 4, ((T([128, 480, 1, 1], f16), T([20, 480, 1, 1], f16), T([20], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 20, 1, 1], f16), T([480, 20, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 480, 12, 12], f16), T([80, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 480, 12, 12], f16), T([480, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([128, 480, 12, 12], f16), T([112, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 112, 12, 12], f16), T([672, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 672, 12, 12], f16), T([672, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 4, ((T([128, 672, 1, 1], f16), T([28, 672, 1, 1], f16), T([28], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 28, 1, 1], f16), T([672, 28, 1, 1], f16), T([672], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 672, 12, 12], f16), T([112, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 672, 12, 12], f16), T([672, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), T([192, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 192, 6, 6], f16), T([1152, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 1152, 6, 6], f16), T([1152, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1152), {})
+cnt: 5, ((T([128, 1152, 1, 1], f16), T([48, 1152, 1, 1], f16), T([48], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([128, 48, 1, 1], f16), T([1152, 48, 1, 1], f16), T([1152], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 1152, 6, 6], f16), T([192, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1152, 6, 6], f16), T([1152, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1152), {})
+cnt: 1, ((T([128, 1152, 6, 6], f16), T([320, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 320, 6, 6], f16), T([1280, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1280, 6, 6], f16), T([128, 320, 6, 6], f16), T([1280, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 320, 6, 6], f16), T([128, 1152, 6, 6], f16), T([320, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 1152, 1, 1], f16), T([128, 48, 1, 1], f16), T([1152, 48, 1, 1], f16), [1152], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([128, 48, 1, 1], f16), T([128, 1152, 1, 1], f16), T([48, 1152, 1, 1], f16), [48], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1152, 6, 6], f16), T([128, 1152, 6, 6], f16), T([1152, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 5, ((T([128, 1152, 6, 6], f16), T([128, 192, 6, 6], f16), T([1152, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 192, 6, 6], f16), T([128, 1152, 6, 6], f16), T([192, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 1152, 6, 6], f16), T([128, 1152, 6, 6], f16), T([1152, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 6, 6], f16), T([128, 672, 6, 6], f16), T([192, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 672, 1, 1], f16), T([128, 28, 1, 1], f16), T([672, 28, 1, 1], f16), [672], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([128, 28, 1, 1], f16), T([128, 672, 1, 1], f16), T([28, 672, 1, 1], f16), [28], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), T([128, 672, 12, 12], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 4, ((T([128, 672, 12, 12], f16), T([128, 112, 12, 12], f16), T([672, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 112, 12, 12], f16), T([128, 672, 12, 12], f16), T([112, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 672, 12, 12], f16), T([128, 672, 12, 12], f16), T([672, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 1, ((T([128, 112, 12, 12], f16), T([128, 480, 12, 12], f16), T([112, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 480, 1, 1], f16), T([128, 20, 1, 1], f16), T([480, 20, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([128, 20, 1, 1], f16), T([128, 480, 1, 1], f16), T([20, 480, 1, 1], f16), [20], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 480, 12, 12], f16), T([128, 480, 12, 12], f16), T([480, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 4, ((T([128, 480, 12, 12], f16), T([128, 80, 12, 12], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 80, 12, 12], f16), T([128, 480, 12, 12], f16), T([80, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 480, 12, 12], f16), T([128, 480, 12, 12], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([128, 80, 12, 12], f16), T([128, 240, 12, 12], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 10, 1, 1], f16), T([240, 10, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 10, 1, 1], f16), T([128, 240, 1, 1], f16), T([10, 240, 1, 1], f16), [10], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), T([128, 240, 24, 24], f16), T([240, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 2, ((T([128, 240, 24, 24], f16), T([128, 40, 24, 24], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 24, 24], f16), T([128, 240, 24, 24], f16), T([40, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 240, 24, 24], f16), T([128, 240, 24, 24], f16), T([240, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([128, 40, 24, 24], f16), T([128, 144, 24, 24], f16), T([40, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 144, 1, 1], f16), T([128, 6, 1, 1], f16), T([144, 6, 1, 1], f16), [144], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 6, 1, 1], f16), T([128, 144, 1, 1], f16), T([6, 144, 1, 1], f16), [6], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), T([128, 144, 48, 48], f16), T([144, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 2, ((T([128, 144, 48, 48], f16), T([128, 24, 48, 48], f16), T([144, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 48, 48], f16), T([128, 144, 48, 48], f16), T([24, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 144, 48, 48], f16), T([128, 144, 48, 48], f16), T([144, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 48, 48], f16), T([128, 96, 48, 48], f16), T([24, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 1, 1], f16), T([128, 4, 1, 1], f16), T([96, 4, 1, 1], f16), [96], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 4, 1, 1], f16), T([128, 96, 1, 1], f16), T([4, 96, 1, 1], f16), [4], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), T([128, 96, 96, 96], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([128, 96, 96, 96], f16), T([128, 16, 96, 96], f16), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([128, 32, 96, 96], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 8, 1, 1], f16), T([32, 8, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([128, 32, 1, 1], f16), T([8, 32, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([128, 32, 96, 96], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([128, 3, 192, 192], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 192, 192], f16), T([128, 3, 192, 192], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1280, 6, 6], f16, stride=(1280, 1, 0, 0)), 36), {})
+cnt: 5, ((T([128, 1152, 6, 6], f16, stride=(1152, 1, 0, 0)), 36), {})
+cnt: 1, ((T([128, 672, 6, 6], f16, stride=(672, 1, 0, 0)), 36), {})
+cnt: 3, ((T([128, 672, 12, 12], f16, stride=(672, 1, 0, 0)), 144), {})
+cnt: 4, ((T([128, 480, 12, 12], f16, stride=(480, 1, 0, 0)), 144), {})
+cnt: 1, ((T([128, 240, 12, 12], f16, stride=(240, 1, 0, 0)), 144), {})
+cnt: 1, ((T([128, 240, 24, 24], f16, stride=(240, 1, 0, 0)), 576), {})
+cnt: 1, ((T([128, 144, 24, 24], f16, stride=(144, 1, 0, 0)), 576), {})
+cnt: 1, ((T([128, 144, 48, 48], f16, stride=(144, 1, 0, 0)), 2304), {})
+cnt: 1, ((T([128, 96, 48, 48], f16, stride=(96, 1, 0, 0)), 2304), {})
+cnt: 1, ((T([128, 32, 96, 96], f16, stride=(32, 1, 0, 0)), 9216), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 32, 96, 96], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 144, 48, 48], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 24, 24], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), [2, 3], True), {})
+cnt: 4, ((T([128, 480, 12, 12], f16), [2, 3], True), {})
+cnt: 3, ((T([128, 672, 12, 12], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), [2, 3], True), {})
+cnt: 5, ((T([128, 1152, 6, 6], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 1280, 6, 6], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 1280], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([128, 32, 96, 96], f16), T([128, 32, 1, 1], f16)), {})
+cnt: 2, ((T([128, 96, 48, 48], f16), T([128, 96, 1, 1], f16)), {})
+cnt: 2, ((T([128, 144, 48, 48], f16), T([128, 144, 1, 1], f16)), {})
+cnt: 2, ((T([128, 144, 24, 24], f16), T([128, 144, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 24, 24], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 12, 12], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 8, ((T([128, 480, 12, 12], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 6, ((T([128, 672, 12, 12], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 2, ((T([128, 672, 6, 6], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 10, ((T([128, 1152, 6, 6], f16), T([128, 1152, 1, 1], f16)), {})
+cnt: 5, ((T([128, 1152, 6, 6], f16), T([128, 1152, 6, 6], f16)), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), T([128, 672, 6, 6], f16)), {})
+cnt: 3, ((T([128, 672, 12, 12], f16), T([128, 672, 12, 12], f16)), {})
+cnt: 4, ((T([128, 480, 12, 12], f16), T([128, 480, 12, 12], f16)), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), T([128, 240, 12, 12], f16)), {})
+cnt: 1, ((T([128, 240, 24, 24], f16), T([128, 240, 24, 24], f16)), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), T([128, 144, 24, 24], f16)), {})
+cnt: 1, ((T([128, 144, 48, 48], f16), T([128, 144, 48, 48], f16)), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), T([128, 96, 48, 48], f16)), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([128, 32, 96, 96], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([128, 32, 96, 96], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 96, 96, 96], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 24, 48, 48], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 144, 48, 48], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), True, 0.1, 1e-05), {})
+cnt: 2, ((T([128, 40, 24, 24], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), True, 0.1, 1e-05), {})
+cnt: 3, ((T([128, 240, 24, 24], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 80, 12, 12], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 480, 12, 12], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), True, 0.1, 1e-05), {})
+cnt: 4, ((T([128, 112, 12, 12], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), True, 0.1, 1e-05), {})
+cnt: 7, ((T([128, 672, 12, 12], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), True, 0.1, 1e-05), {})
+cnt: 5, ((T([128, 192, 6, 6], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([128, 1152, 6, 6], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 320, 6, 6], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), True, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1280, 6, 6], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1280, 6, 6], f16), T([128, 1280, 6, 6], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 320, 6, 6], f16), T([128, 320, 6, 6], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), True, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([128, 1152, 6, 6], f16), T([128, 1152, 6, 6], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), True, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([128, 192, 6, 6], f16), T([128, 192, 6, 6], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), T([128, 672, 6, 6], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([128, 672, 12, 12], f16), T([128, 672, 12, 12], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 112, 12, 12], f16), T([128, 112, 12, 12], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 480, 12, 12], f16), T([128, 480, 12, 12], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), True, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([128, 80, 12, 12], f16), T([128, 80, 12, 12], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), T([128, 240, 12, 12], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 240, 24, 24], f16), T([128, 240, 24, 24], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 40, 24, 24], f16), T([128, 40, 24, 24], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), T([128, 144, 24, 24], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([128, 144, 48, 48], f16), T([128, 144, 48, 48], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 24, 48, 48], f16), T([128, 24, 48, 48], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), T([128, 96, 48, 48], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 96, 96, 96], f16), T([128, 96, 96, 96], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([128, 16, 96, 96], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), True, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([128, 32, 96, 96], f16), T([128, 32, 96, 96], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([128, 32, 1, 1], f16),), {})
+cnt: 1, ((T([128, 96, 1, 1], f16),), {})
+cnt: 2, ((T([128, 144, 1, 1], f16),), {})
+cnt: 2, ((T([128, 240, 1, 1], f16),), {})
+cnt: 4, ((T([128, 480, 1, 1], f16),), {})
+cnt: 4, ((T([128, 672, 1, 1], f16),), {})
+cnt: 5, ((T([128, 1152, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 5, ((T([128, 1152, 1, 1], f16), T([128, 1152, 1, 1], f16)), {})
+cnt: 4, ((T([128, 672, 1, 1], f16), T([128, 672, 1, 1], f16)), {})
+cnt: 4, ((T([128, 480, 1, 1], f16), T([128, 480, 1, 1], f16)), {})
+cnt: 2, ((T([128, 240, 1, 1], f16), T([128, 240, 1, 1], f16)), {})
+cnt: 2, ((T([128, 144, 1, 1], f16), T([128, 144, 1, 1], f16)), {})
+cnt: 1, ((T([128, 96, 1, 1], f16), T([128, 96, 1, 1], f16)), {})
+cnt: 1, ((T([128, 32, 1, 1], f16), T([128, 32, 1, 1], f16)), {})
+Operator: aten.silu_.default
+cnt: 2, ((T([128, 32, 96, 96], f16),), {})
+cnt: 1, ((T([128, 8, 1, 1], f16),), {})
+cnt: 1, ((T([128, 96, 96, 96], f16),), {})
+cnt: 1, ((T([128, 96, 48, 48], f16),), {})
+cnt: 1, ((T([128, 4, 1, 1], f16),), {})
+cnt: 3, ((T([128, 144, 48, 48], f16),), {})
+cnt: 2, ((T([128, 6, 1, 1], f16),), {})
+cnt: 1, ((T([128, 144, 24, 24], f16),), {})
+cnt: 3, ((T([128, 240, 24, 24], f16),), {})
+cnt: 2, ((T([128, 10, 1, 1], f16),), {})
+cnt: 1, ((T([128, 240, 12, 12], f16),), {})
+cnt: 8, ((T([128, 480, 12, 12], f16),), {})
+cnt: 4, ((T([128, 20, 1, 1], f16),), {})
+cnt: 7, ((T([128, 672, 12, 12], f16),), {})
+cnt: 4, ((T([128, 28, 1, 1], f16),), {})
+cnt: 1, ((T([128, 672, 6, 6], f16),), {})
+cnt: 10, ((T([128, 1152, 6, 6], f16),), {})
+cnt: 5, ((T([128, 48, 1, 1], f16),), {})
+cnt: 1, ((T([128, 1280, 6, 6], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 1, ((T([128, 1280, 6, 6], f16), T([128, 1280, 6, 6], f16)), {})
+cnt: 5, ((T([128, 48, 1, 1], f16), T([128, 48, 1, 1], f16)), {})
+cnt: 10, ((T([128, 1152, 6, 6], f16), T([128, 1152, 6, 6], f16)), {})
+cnt: 4, ((T([128, 28, 1, 1], f16), T([128, 28, 1, 1], f16)), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), T([128, 672, 6, 6], f16)), {})
+cnt: 7, ((T([128, 672, 12, 12], f16), T([128, 672, 12, 12], f16)), {})
+cnt: 4, ((T([128, 20, 1, 1], f16), T([128, 20, 1, 1], f16)), {})
+cnt: 8, ((T([128, 480, 12, 12], f16), T([128, 480, 12, 12], f16)), {})
+cnt: 2, ((T([128, 10, 1, 1], f16), T([128, 10, 1, 1], f16)), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), T([128, 240, 12, 12], f16)), {})
+cnt: 3, ((T([128, 240, 24, 24], f16), T([128, 240, 24, 24], f16)), {})
+cnt: 2, ((T([128, 6, 1, 1], f16), T([128, 6, 1, 1], f16)), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), T([128, 144, 24, 24], f16)), {})
+cnt: 3, ((T([128, 144, 48, 48], f16), T([128, 144, 48, 48], f16)), {})
+cnt: 1, ((T([128, 4, 1, 1], f16), T([128, 4, 1, 1], f16)), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), T([128, 96, 48, 48], f16)), {})
+cnt: 1, ((T([128, 96, 96, 96], f16), T([128, 96, 96, 96], f16)), {})
+cnt: 1, ((T([128, 8, 1, 1], f16), T([128, 8, 1, 1], f16)), {})
+cnt: 2, ((T([128, 32, 96, 96], f16), T([128, 32, 96, 96], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 5, ((T([128, 1152, 6, 6], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 672, 6, 6], f16), [2, 3], True), {})
+cnt: 3, ((T([128, 672, 12, 12], f16), [2, 3], True), {})
+cnt: 4, ((T([128, 480, 12, 12], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 12, 12], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 240, 24, 24], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 144, 24, 24], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 144, 48, 48], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 96, 48, 48], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), [2, 3], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tnt_s_patch16_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tnt_s_patch16_224_training.txt
new file mode 100644
index 000000000000..d7622dd4d8ce
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/tnt_s_patch16_224_training.txt
@@ -0,0 +1,146 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([12544, 4, 16, 16], f16), -1, False), {})
+cnt: 12, ((T([64, 6, 197, 197], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 6, 197, 197], f16), T([64, 6, 197, 197], f16), -1, f16), {})
+cnt: 12, ((T([12544, 4, 16, 16], f16), T([12544, 4, 16, 16], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 1, ((T([64, 196, 384], f16), [12544, 24, 4, 4]), {})
+cnt: 1, ((T([12544, 16, 24], f16), [64, 196, 384]), {})
+cnt: 12, ((T([200704, 48], f16), [12544, 16, 48]), {})
+cnt: 12, ((T([200704, 24], f16), [12544, 16, 24]), {})
+cnt: 36, ((T([12544, 4, 16, 6], f16), [50176, 16, 6]), {})
+cnt: 12, ((T([12544, 4, 6, 16], f16), [50176, 6, 16]), {})
+cnt: 12, ((T([50176, 16, 16], f16), [12544, 4, 16, 16]), {})
+cnt: 12, ((T([50176, 16, 6], f16), [12544, 4, 16, 6]), {})
+cnt: 24, ((T([12544, 16, 4, 6], f16), [12544, 16, 24]), {})
+cnt: 12, ((T([12608, 768], f16), [64, 197, 768]), {})
+cnt: 12, ((T([12608, 384], f16), [64, 197, 384]), {})
+cnt: 36, ((T([64, 6, 197, 64], f16), [384, 197, 64]), {})
+cnt: 12, ((T([64, 6, 64, 197], f16), [384, 64, 197]), {})
+cnt: 12, ((T([384, 197, 197], f16), [64, 6, 197, 197]), {})
+cnt: 12, ((T([384, 197, 64], f16), [64, 6, 197, 64]), {})
+cnt: 24, ((T([64, 197, 6, 64], f16), [64, 197, 384]), {})
+cnt: 12, ((T([64, 197, 2, 6, 64], f16), [64, 197, 768]), {})
+cnt: 12, ((T([64, 196, 384], f16), [12544, 384]), {})
+cnt: 12, ((T([12544, 16, 2, 4, 6], f16), [12544, 16, 48]), {})
+cnt: 1, ((T([12544, 24, 4, 4], f16), [64, 196, 384]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([12544, 24, 4, 4], f16), T([1, 24, 4, 4], f16)), {})
+cnt: 1, ((T([64, 197, 384], f16), T([1, 197, 384], f16)), {})
+cnt: 24, ((T([12544, 16, 24], f16, stride=(384, 1, 16)), T([12544, 16, 24], f16)), {})
+cnt: 12, ((T([64, 196, 384], f16, stride=(75648, 384, 1)), T([64, 196, 384], f16)), {})
+cnt: 72, ((T([64, 197, 384], f16), T([64, 197, 384], f16)), {})
+cnt: 48, ((T([12544, 16, 24], f16), T([12544, 16, 24], f16)), {})
+Operator: aten.addmm.default
+cnt: 13, ((T([384], f16), T([12544, 384], f16), T([384, 384], f16, stride=(1, 384))), {})
+cnt: 12, ((T([24], f16), T([200704, 24], f16), T([24, 24], f16, stride=(1, 24))), {})
+cnt: 12, ((T([96], f16), T([200704, 24], f16), T([24, 96], f16, stride=(1, 24))), {})
+cnt: 12, ((T([24], f16), T([200704, 96], f16), T([96, 24], f16, stride=(1, 96))), {})
+cnt: 12, ((T([384], f16), T([12608, 384], f16), T([384, 384], f16, stride=(1, 384))), {})
+cnt: 12, ((T([1536], f16), T([12608, 384], f16), T([384, 1536], f16, stride=(1, 384))), {})
+cnt: 12, ((T([384], f16), T([12608, 1536], f16), T([1536, 384], f16, stride=(1, 1536))), {})
+cnt: 1, ((T([1000], f16), T([64, 384], f16, stride=(75648, 1)), T([384, 1000], f16, stride=(1, 384))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([50176, 16, 6], f16), T([50176, 6, 16], f16)), {})
+cnt: 12, ((T([50176, 16, 16], f16), T([50176, 16, 6], f16)), {})
+cnt: 12, ((T([384, 197, 64], f16), T([384, 64, 197], f16)), {})
+cnt: 12, ((T([384, 197, 197], f16), T([384, 197, 64], f16)), {})
+cnt: 12, ((T([384, 197, 197], f16, stride=(38809, 1, 197)), T([384, 197, 64], f16)), {})
+cnt: 12, ((T([384, 197, 64], f16), T([384, 64, 197], f16, stride=(12608, 1, 64))), {})
+cnt: 12, ((T([384, 64, 197], f16, stride=(12608, 1, 64)), T([384, 197, 197], f16)), {})
+cnt: 12, ((T([384, 197, 197], f16), T([384, 197, 64], f16, stride=(12608, 1, 197))), {})
+cnt: 12, ((T([50176, 16, 16], f16, stride=(256, 1, 16)), T([50176, 16, 6], f16)), {})
+cnt: 12, ((T([50176, 16, 6], f16), T([50176, 6, 16], f16, stride=(96, 1, 6))), {})
+cnt: 12, ((T([50176, 6, 16], f16, stride=(96, 1, 6)), T([50176, 16, 16], f16)), {})
+cnt: 12, ((T([50176, 16, 16], f16), T([50176, 16, 6], f16, stride=(96, 1, 16))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 1, 384], f16, stride=(0, 384, 1)), T([64, 196, 384], f16)], 1), {})
+cnt: 12, (([T([64, 1, 384], f16, stride=(75648, 384, 1)), T([64, 196, 384], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([24, 3, 7, 7], f16), T([24], f16), [4, 4], [3, 3], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 24, 56, 56], f16), T([64, 3, 224, 224], f16), T([24, 3, 7, 7], f16), [24], [4, 4], [3, 3], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([12544, 16, 96], f16),), {})
+cnt: 12, ((T([64, 197, 1536], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([64, 197, 1536], f16), T([64, 197, 1536], f16)), {})
+cnt: 12, ((T([12544, 16, 96], f16), T([12544, 16, 96], f16)), {})
+Operator: aten.im2col.default
+cnt: 1, ((T([64, 24, 56, 56], f16), [4, 4], [1, 1], [0, 0], [4, 4]), {})
+Operator: aten.im2col_backward.default
+cnt: 1, ((T([64, 384, 196], f16, stride=(75264, 1, 384)), [56, 56], [4, 4], [1, 1], [0, 0], [4, 4]), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mm.default
+cnt: 12, ((T([200704, 24], f16), T([24, 48], f16, stride=(1, 24))), {})
+cnt: 12, ((T([200704, 24], f16), T([24, 24], f16, stride=(1, 24))), {})
+cnt: 12, ((T([12608, 384], f16), T([384, 768], f16, stride=(1, 384))), {})
+cnt: 12, ((T([12608, 384], f16), T([384, 384], f16, stride=(1, 384))), {})
+cnt: 1, ((T([64, 1000], f16), T([1000, 384], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 384], f16, stride=(75648, 1))), {})
+cnt: 12, ((T([12608, 384], f16), T([384, 1536], f16)), {})
+cnt: 12, ((T([384, 12608], f16, stride=(1, 384)), T([12608, 1536], f16)), {})
+cnt: 12, ((T([12608, 1536], f16), T([1536, 384], f16)), {})
+cnt: 12, ((T([1536, 12608], f16, stride=(1, 1536)), T([12608, 384], f16)), {})
+cnt: 24, ((T([12608, 384], f16), T([384, 384], f16)), {})
+cnt: 24, ((T([384, 12608], f16, stride=(1, 384)), T([12608, 384], f16)), {})
+cnt: 12, ((T([768, 12608], f16, stride=(1, 768)), T([12608, 384], f16)), {})
+cnt: 12, ((T([12608, 768], f16), T([768, 384], f16)), {})
+cnt: 13, ((T([12544, 384], f16), T([384, 384], f16)), {})
+cnt: 13, ((T([384, 12544], f16, stride=(1, 384)), T([12544, 384], f16)), {})
+cnt: 12, ((T([200704, 24], f16), T([24, 96], f16)), {})
+cnt: 12, ((T([24, 200704], f16, stride=(1, 24)), T([200704, 96], f16)), {})
+cnt: 12, ((T([200704, 96], f16), T([96, 24], f16)), {})
+cnt: 12, ((T([96, 200704], f16, stride=(1, 96)), T([200704, 24], f16)), {})
+cnt: 24, ((T([200704, 24], f16), T([24, 24], f16)), {})
+cnt: 24, ((T([24, 200704], f16, stride=(1, 24)), T([200704, 24], f16)), {})
+cnt: 12, ((T([48, 200704], f16, stride=(1, 48)), T([200704, 24], f16)), {})
+cnt: 12, ((T([200704, 48], f16), T([48, 24], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([12544, 4, 16, 16], f16), 0.408248290463863), {})
+cnt: 24, ((T([64, 6, 197, 197], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 2, ((T([64, 196, 384], f16), [384], T([384], f16), T([384], f16), 1e-05), {})
+cnt: 36, ((T([12544, 16, 24], f16, stride=(384, 1, 16)), [24], T([24], f16), T([24], f16), 1e-05), {})
+cnt: 25, ((T([64, 197, 384], f16), [384], T([384], f16), T([384], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([64, 197, 384], f16), T([64, 197, 384], f16), [384], T([64, 197, 1], f32), T([64, 197, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+cnt: 36, ((T([12544, 16, 24], f16), T([12544, 16, 24], f16, stride=(384, 1, 16)), [24], T([12544, 16, 1], f32), T([12544, 16, 1], f32), T([24], f16), T([24], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 196, 384], f16, stride=(75648, 384, 1)), T([64, 196, 384], f16), [384], T([64, 196, 1], f32), T([64, 196, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+cnt: 1, ((T([64, 196, 384], f16), T([64, 196, 384], f16), [384], T([64, 196, 1], f32), T([64, 196, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([64, 384], f16), [64, 197, 384], 1, 0), {})
+Operator: aten.slice_backward.default
+cnt: 25, ((T([64, 197, 384], f16), [64, 197, 384], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([64, 196, 384], f16, stride=(75648, 384, 1)), [64, 197, 384], 1, 1, 9223372036854775807, 1), {})
+cnt: 12, ((T([64, 1, 384], f16, stride=(75648, 384, 1)), [64, 197, 384], 1, 0, 1, 1), {})
+Operator: aten.stack.default
+cnt: 12, (([T([64, 6, 197, 64], f16), T([64, 6, 197, 64], f16, stride=(75648, 12608, 1, 197))],), {})
+cnt: 12, (([T([12544, 4, 16, 6], f16), T([12544, 4, 16, 6], f16, stride=(384, 96, 1, 16))],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 24, ((T([12608, 384], f16), [0], True), {})
+cnt: 12, ((T([12608, 1536], f16), [0], True), {})
+cnt: 13, ((T([12544, 384], f16), [0], True), {})
+cnt: 24, ((T([200704, 24], f16), [0], True), {})
+cnt: 12, ((T([200704, 96], f16), [0], True), {})
+cnt: 1, ((T([64, 197, 384], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 384], f16, stride=(75648, 384, 1)), [0], True), {})
+cnt: 1, ((T([12544, 24, 4, 4], f16, stride=(384, 1, 96, 24)), [0], True), {})
+Operator: aten.unbind.int
+cnt: 12, ((T([2, 12544, 4, 16, 6], f16, stride=(24, 768, 6, 48, 1)),), {})
+cnt: 12, ((T([2, 64, 6, 197, 64], f16, stride=(384, 151296, 64, 768, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/twins_pcpvt_base_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/twins_pcpvt_base_training.txt
new file mode 100644
index 000000000000..f3a99cba2b64
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/twins_pcpvt_base_training.txt
@@ -0,0 +1,245 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([32, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([32, 1000], f16), T([32, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 3, ((T([32, 1, 3136, 49], f16), -1, False), {})
+cnt: 4, ((T([32, 2, 784, 49], f16), -1, False), {})
+cnt: 18, ((T([32, 5, 196, 49], f16), -1, False), {})
+cnt: 3, ((T([32, 8, 49, 49], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 3, ((T([32, 8, 49, 49], f16), T([32, 8, 49, 49], f16), -1, f16), {})
+cnt: 18, ((T([32, 5, 196, 49], f16), T([32, 5, 196, 49], f16), -1, f16), {})
+cnt: 4, ((T([32, 2, 784, 49], f16), T([32, 2, 784, 49], f16), -1, f16), {})
+cnt: 3, ((T([32, 1, 3136, 49], f16), T([32, 1, 3136, 49], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 3, ((T([32, 3136, 49], f16), [32, 1, 3136, 49]), {})
+cnt: 3, ((T([32, 3136, 64], f16), [32, 1, 3136, 64]), {})
+cnt: 8, ((T([32, 2, 784, 64], f16), [64, 784, 64]), {})
+cnt: 4, ((T([32, 2, 64, 49], f16), [64, 64, 49]), {})
+cnt: 4, ((T([64, 784, 49], f16), [32, 2, 784, 49]), {})
+cnt: 4, ((T([32, 2, 49, 64], f16), [64, 49, 64]), {})
+cnt: 4, ((T([64, 784, 64], f16), [32, 2, 784, 64]), {})
+cnt: 8, ((T([32, 784, 2, 64], f16), [32, 784, 128]), {})
+cnt: 36, ((T([32, 5, 196, 64], f16), [160, 196, 64]), {})
+cnt: 18, ((T([32, 5, 64, 49], f16), [160, 64, 49]), {})
+cnt: 18, ((T([160, 196, 49], f16), [32, 5, 196, 49]), {})
+cnt: 18, ((T([32, 5, 49, 64], f16), [160, 49, 64]), {})
+cnt: 18, ((T([160, 196, 64], f16), [32, 5, 196, 64]), {})
+cnt: 36, ((T([32, 196, 5, 64], f16), [32, 196, 320]), {})
+cnt: 9, ((T([32, 8, 49, 64], f16), [256, 49, 64]), {})
+cnt: 3, ((T([32, 8, 64, 49], f16), [256, 64, 49]), {})
+cnt: 3, ((T([256, 49, 49], f16), [32, 8, 49, 49]), {})
+cnt: 3, ((T([256, 49, 64], f16), [32, 8, 49, 64]), {})
+cnt: 6, ((T([32, 49, 8, 64], f16), [32, 49, 512]), {})
+cnt: 3, ((T([32, 49, 2, 8, 64], f16), [32, 49, 1024]), {})
+cnt: 36, ((T([32, 196, 320], f16), [6272, 320]), {})
+cnt: 18, ((T([32, 49, 2, 5, 64], f16), [32, 49, 640]), {})
+cnt: 8, ((T([32, 784, 128], f16), [25088, 128]), {})
+cnt: 4, ((T([32, 49, 2, 2, 64], f16), [32, 49, 256]), {})
+cnt: 6, ((T([32, 3136, 64], f16), [100352, 64]), {})
+cnt: 3, ((T([32, 49, 2, 1, 64], f16), [32, 49, 128]), {})
+Operator: aten.add.Tensor
+cnt: 9, ((T([32, 3136, 64], f16), T([32, 3136, 64], f16)), {})
+cnt: 12, ((T([32, 784, 128], f16), T([32, 784, 128], f16)), {})
+cnt: 54, ((T([32, 196, 320], f16), T([32, 196, 320], f16)), {})
+cnt: 15, ((T([32, 49, 512], f16), T([32, 49, 512], f16)), {})
+cnt: 3, ((T([2, 32, 8, 49, 64], f16), T([2, 32, 8, 49, 64], f16)), {})
+cnt: 1, ((T([32, 512, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([32, 512, 7, 7], f16, stride=(25088, 1, 3584, 512))), {})
+cnt: 36, ((T([32, 196, 320], f16, stride=(62720, 1, 196)), T([32, 196, 320], f16)), {})
+cnt: 18, ((T([2, 32, 5, 49, 64], f16), T([2, 32, 5, 49, 64], f16)), {})
+cnt: 1, ((T([32, 320, 14, 14], f16), T([32, 320, 14, 14], f16, stride=(62720, 1, 4480, 320))), {})
+cnt: 8, ((T([32, 784, 128], f16, stride=(100352, 1, 784)), T([32, 784, 128], f16)), {})
+cnt: 4, ((T([2, 32, 2, 49, 64], f16), T([2, 32, 2, 49, 64], f16)), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), T([32, 128, 28, 28], f16, stride=(100352, 1, 3584, 128))), {})
+cnt: 6, ((T([32, 3136, 64], f16, stride=(200704, 1, 3136)), T([32, 3136, 64], f16)), {})
+cnt: 3, ((T([2, 32, 1, 49, 64], f16), T([2, 32, 1, 49, 64], f16)), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16, stride=(200704, 1, 3584, 64))), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([32, 64, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([32, 64, 56, 56], f16, stride=(200704, 1, 3584, 64))), {})
+cnt: 1, ((T([32, 128, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([32, 128, 28, 28], f16, stride=(100352, 1, 3584, 128))), {})
+cnt: 1, ((T([32, 320, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([32, 320, 14, 14], f16, stride=(62720, 1, 4480, 320))), {})
+cnt: 1, ((T([32, 512, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([32, 512, 7, 7], f16, stride=(25088, 1, 3584, 512))), {})
+Operator: aten.addmm.default
+cnt: 6, ((T([64], f16), T([100352, 64], f16), T([64, 64], f16, stride=(1, 64))), {})
+cnt: 3, ((T([128], f16), T([1568, 64], f16), T([64, 128], f16, stride=(1, 64))), {})
+cnt: 3, ((T([512], f16), T([100352, 64], f16), T([64, 512], f16, stride=(1, 64))), {})
+cnt: 3, ((T([64], f16), T([100352, 512], f16), T([512, 64], f16, stride=(1, 512))), {})
+cnt: 8, ((T([128], f16), T([25088, 128], f16), T([128, 128], f16, stride=(1, 128))), {})
+cnt: 4, ((T([256], f16), T([1568, 128], f16), T([128, 256], f16, stride=(1, 128))), {})
+cnt: 4, ((T([1024], f16), T([25088, 128], f16), T([128, 1024], f16, stride=(1, 128))), {})
+cnt: 4, ((T([128], f16), T([25088, 1024], f16), T([1024, 128], f16, stride=(1, 1024))), {})
+cnt: 36, ((T([320], f16), T([6272, 320], f16), T([320, 320], f16, stride=(1, 320))), {})
+cnt: 18, ((T([640], f16), T([1568, 320], f16), T([320, 640], f16, stride=(1, 320))), {})
+cnt: 18, ((T([1280], f16), T([6272, 320], f16), T([320, 1280], f16, stride=(1, 320))), {})
+cnt: 18, ((T([320], f16), T([6272, 1280], f16), T([1280, 320], f16, stride=(1, 1280))), {})
+cnt: 6, ((T([512], f16), T([1568, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 3, ((T([1024], f16), T([1568, 512], f16), T([512, 1024], f16, stride=(1, 512))), {})
+cnt: 3, ((T([2048], f16), T([1568, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 3, ((T([512], f16), T([1568, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+cnt: 1, ((T([1000], f16), T([32, 512], f16), T([512, 1000], f16, stride=(1, 512))), {})
+Operator: aten.bmm.default
+cnt: 6, ((T([32, 3136, 64], f16), T([32, 64, 49], f16, stride=(6272, 1, 128))), {})
+cnt: 6, ((T([32, 3136, 49], f16), T([32, 49, 64], f16, stride=(6272, 128, 1))), {})
+cnt: 4, ((T([64, 784, 64], f16), T([64, 64, 49], f16)), {})
+cnt: 4, ((T([64, 784, 49], f16), T([64, 49, 64], f16)), {})
+cnt: 18, ((T([160, 196, 64], f16), T([160, 64, 49], f16)), {})
+cnt: 18, ((T([160, 196, 49], f16), T([160, 49, 64], f16)), {})
+cnt: 3, ((T([256, 49, 64], f16), T([256, 64, 49], f16)), {})
+cnt: 3, ((T([256, 49, 49], f16), T([256, 49, 64], f16)), {})
+cnt: 3, ((T([256, 49, 49], f16, stride=(2401, 1, 49)), T([256, 49, 64], f16)), {})
+cnt: 3, ((T([256, 49, 64], f16), T([256, 64, 49], f16, stride=(3136, 1, 64))), {})
+cnt: 3, ((T([256, 64, 49], f16, stride=(3136, 1, 64)), T([256, 49, 49], f16)), {})
+cnt: 3, ((T([256, 49, 49], f16), T([256, 49, 64], f16, stride=(3136, 1, 49))), {})
+cnt: 18, ((T([160, 49, 196], f16, stride=(9604, 1, 49)), T([160, 196, 64], f16)), {})
+cnt: 18, ((T([160, 196, 64], f16), T([160, 64, 49], f16, stride=(3136, 1, 64))), {})
+cnt: 18, ((T([160, 64, 196], f16, stride=(12544, 1, 64)), T([160, 196, 49], f16)), {})
+cnt: 18, ((T([160, 196, 49], f16), T([160, 49, 64], f16, stride=(3136, 1, 49))), {})
+cnt: 4, ((T([64, 49, 784], f16, stride=(38416, 1, 49)), T([64, 784, 64], f16)), {})
+cnt: 4, ((T([64, 784, 64], f16), T([64, 64, 49], f16, stride=(3136, 1, 64))), {})
+cnt: 4, ((T([64, 64, 784], f16, stride=(50176, 1, 64)), T([64, 784, 49], f16)), {})
+cnt: 4, ((T([64, 784, 49], f16), T([64, 49, 64], f16, stride=(3136, 1, 49))), {})
+cnt: 3, ((T([32, 49, 3136], f16, stride=(153664, 1, 49)), T([32, 3136, 64], f16)), {})
+cnt: 3, ((T([32, 64, 3136], f16, stride=(200704, 1, 64)), T([32, 3136, 49], f16)), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([64, 3, 4, 4], f16), T([64], f16), [4, 4], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 64, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([64, 64, 8, 8], f16), T([64], f16), [8, 8], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([64, 1, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([128, 64, 2, 2], f16), T([128], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 128, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([128, 128, 4, 4], f16), T([128], f16), [4, 4], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([128, 1, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 128), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), T([320, 128, 2, 2], f16), T([320], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 18, ((T([32, 320, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([320, 320, 2, 2], f16), T([320], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 320, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([320, 1, 3, 3], f16), T([320], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 320), {})
+cnt: 1, ((T([32, 320, 14, 14], f16), T([512, 320, 2, 2], f16), T([512], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([512, 1, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 512), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 512, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([32, 512, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([512, 1, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 512, [True, True, True]), {})
+cnt: 1, ((T([32, 512, 7, 7], f16, stride=(25088, 1, 3584, 512)), T([32, 320, 14, 14], f16), T([512, 320, 2, 2], f16), [512], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 18, ((T([32, 320, 7, 7], f16, stride=(15680, 1, 2240, 320)), T([32, 320, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([320, 320, 2, 2], f16), [320], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 320, 14, 14], f16), T([32, 320, 14, 14], f16, stride=(62720, 1, 4480, 320)), T([320, 1, 3, 3], f16), [320], [1, 1], [1, 1], [1, 1], False, [0, 0], 320, [True, True, True]), {})
+cnt: 1, ((T([32, 320, 14, 14], f16), T([32, 128, 28, 28], f16), T([320, 128, 2, 2], f16), [320], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([32, 128, 7, 7], f16, stride=(6272, 1, 896, 128)), T([32, 128, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([128, 128, 4, 4], f16), [128], [4, 4], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), T([32, 128, 28, 28], f16, stride=(100352, 1, 3584, 128)), T([128, 1, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 128, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), T([32, 64, 56, 56], f16), T([128, 64, 2, 2], f16), [128], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 64, 7, 7], f16, stride=(3136, 1, 448, 64)), T([32, 64, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([64, 64, 8, 8], f16), [64], [8, 8], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16, stride=(200704, 1, 3584, 64)), T([64, 1, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 64, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 3, 224, 224], f16), T([64, 3, 4, 4], f16), [64], [4, 4], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+cnt: 18, ((T([320, 320, 2, 2], f16), T([320, 320, 2, 2], f16, stride=(1280, 1, 640, 320))), {})
+cnt: 4, ((T([128, 128, 4, 4], f16), T([128, 128, 4, 4], f16, stride=(2048, 1, 512, 128))), {})
+cnt: 3, ((T([64, 64, 8, 8], f16), T([64, 64, 8, 8], f16, stride=(4096, 1, 512, 64))), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 49, 512], f16, stride=(512, 0, 1)), 49), {})
+Operator: aten.gelu.default
+cnt: 3, ((T([32, 3136, 512], f16),), {})
+cnt: 4, ((T([32, 784, 1024], f16),), {})
+cnt: 18, ((T([32, 196, 1280], f16),), {})
+cnt: 3, ((T([32, 49, 2048], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 3, ((T([32, 49, 2048], f16), T([32, 49, 2048], f16)), {})
+cnt: 18, ((T([32, 196, 1280], f16), T([32, 196, 1280], f16)), {})
+cnt: 4, ((T([32, 784, 1024], f16), T([32, 784, 1024], f16)), {})
+cnt: 3, ((T([32, 3136, 512], f16), T([32, 3136, 512], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([32], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 49, 512], f16), [1]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16), T([1000, 512], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(1, 1000)), T([32, 512], f16)), {})
+cnt: 3, ((T([1568, 512], f16), T([512, 2048], f16)), {})
+cnt: 3, ((T([512, 1568], f16, stride=(1, 512)), T([1568, 2048], f16)), {})
+cnt: 3, ((T([1568, 2048], f16), T([2048, 512], f16)), {})
+cnt: 3, ((T([2048, 1568], f16, stride=(1, 2048)), T([1568, 512], f16)), {})
+cnt: 6, ((T([1568, 512], f16), T([512, 512], f16)), {})
+cnt: 6, ((T([512, 1568], f16, stride=(1, 512)), T([1568, 512], f16)), {})
+cnt: 3, ((T([1568, 1024], f16), T([1024, 512], f16)), {})
+cnt: 3, ((T([1024, 1568], f16, stride=(1, 1024)), T([1568, 512], f16)), {})
+cnt: 18, ((T([6272, 320], f16), T([320, 1280], f16)), {})
+cnt: 18, ((T([320, 6272], f16, stride=(1, 320)), T([6272, 1280], f16)), {})
+cnt: 18, ((T([6272, 1280], f16), T([1280, 320], f16)), {})
+cnt: 18, ((T([1280, 6272], f16, stride=(1, 1280)), T([6272, 320], f16)), {})
+cnt: 36, ((T([6272, 320], f16), T([320, 320], f16)), {})
+cnt: 36, ((T([320, 6272], f16, stride=(1, 320)), T([6272, 320], f16)), {})
+cnt: 18, ((T([1568, 640], f16), T([640, 320], f16)), {})
+cnt: 18, ((T([640, 1568], f16, stride=(1, 640)), T([1568, 320], f16)), {})
+cnt: 4, ((T([25088, 128], f16), T([128, 1024], f16)), {})
+cnt: 4, ((T([128, 25088], f16, stride=(1, 128)), T([25088, 1024], f16)), {})
+cnt: 4, ((T([25088, 1024], f16), T([1024, 128], f16)), {})
+cnt: 4, ((T([1024, 25088], f16, stride=(1, 1024)), T([25088, 128], f16)), {})
+cnt: 8, ((T([25088, 128], f16), T([128, 128], f16)), {})
+cnt: 8, ((T([128, 25088], f16, stride=(1, 128)), T([25088, 128], f16)), {})
+cnt: 4, ((T([1568, 256], f16), T([256, 128], f16)), {})
+cnt: 4, ((T([256, 1568], f16, stride=(1, 256)), T([1568, 128], f16)), {})
+cnt: 3, ((T([100352, 64], f16), T([64, 512], f16)), {})
+cnt: 3, ((T([64, 100352], f16, stride=(1, 64)), T([100352, 512], f16)), {})
+cnt: 3, ((T([100352, 512], f16), T([512, 64], f16)), {})
+cnt: 3, ((T([512, 100352], f16, stride=(1, 512)), T([100352, 64], f16)), {})
+cnt: 6, ((T([100352, 64], f16), T([64, 64], f16)), {})
+cnt: 6, ((T([64, 100352], f16, stride=(1, 64)), T([100352, 64], f16)), {})
+cnt: 3, ((T([1568, 128], f16), T([128, 64], f16)), {})
+cnt: 3, ((T([128, 1568], f16, stride=(1, 128)), T([1568, 64], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 6, ((T([32, 1, 3136, 49], f16), 0.125), {})
+cnt: 8, ((T([32, 2, 784, 49], f16), 0.125), {})
+cnt: 36, ((T([32, 5, 196, 49], f16), 0.125), {})
+cnt: 6, ((T([32, 8, 49, 49], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 1, ((T([32, 3136, 64], f16, stride=(200704, 1, 3136)), [64], T([64], f16), T([64], f16), 1e-05), {})
+cnt: 6, ((T([32, 3136, 64], f16), [64], T([64], f16), T([64], f16), 1e-06), {})
+cnt: 3, ((T([32, 49, 64], f16), [64], T([64], f16), T([64], f16), 1e-05), {})
+cnt: 1, ((T([32, 784, 128], f16, stride=(100352, 1, 784)), [128], T([128], f16), T([128], f16), 1e-05), {})
+cnt: 8, ((T([32, 784, 128], f16), [128], T([128], f16), T([128], f16), 1e-06), {})
+cnt: 4, ((T([32, 49, 128], f16), [128], T([128], f16), T([128], f16), 1e-05), {})
+cnt: 1, ((T([32, 196, 320], f16, stride=(62720, 1, 196)), [320], T([320], f16), T([320], f16), 1e-05), {})
+cnt: 36, ((T([32, 196, 320], f16), [320], T([320], f16), T([320], f16), 1e-06), {})
+cnt: 18, ((T([32, 49, 320], f16), [320], T([320], f16), T([320], f16), 1e-05), {})
+cnt: 1, ((T([32, 49, 512], f16, stride=(25088, 1, 49)), [512], T([512], f16), T([512], f16), 1e-05), {})
+cnt: 7, ((T([32, 49, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 7, ((T([32, 49, 512], f16), T([32, 49, 512], f16), [512], T([32, 49, 1], f32), T([32, 49, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 1, ((T([32, 49, 512], f16), T([32, 49, 512], f16, stride=(25088, 1, 49)), [512], T([32, 49, 1], f32), T([32, 49, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 36, ((T([32, 196, 320], f16), T([32, 196, 320], f16), [320], T([32, 196, 1], f32), T([32, 196, 1], f32), T([320], f16), T([320], f16), [True, True, True]), {})
+cnt: 18, ((T([32, 49, 320], f16), T([32, 49, 320], f16), [320], T([32, 49, 1], f32), T([32, 49, 1], f32), T([320], f16), T([320], f16), [True, True, True]), {})
+cnt: 1, ((T([32, 196, 320], f16, stride=(62720, 1, 196)), T([32, 196, 320], f16, stride=(62720, 1, 196)), [320], T([32, 196, 1], f32), T([32, 196, 1], f32), T([320], f16), T([320], f16), [True, True, True]), {})
+cnt: 8, ((T([32, 784, 128], f16), T([32, 784, 128], f16), [128], T([32, 784, 1], f32), T([32, 784, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 4, ((T([32, 49, 128], f16), T([32, 49, 128], f16), [128], T([32, 49, 1], f32), T([32, 49, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 1, ((T([32, 784, 128], f16, stride=(100352, 1, 784)), T([32, 784, 128], f16, stride=(100352, 1, 784)), [128], T([32, 784, 1], f32), T([32, 784, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 6, ((T([32, 3136, 64], f16), T([32, 3136, 64], f16), [64], T([32, 3136, 1], f32), T([32, 3136, 1], f32), T([64], f16), T([64], f16), [True, True, True]), {})
+cnt: 3, ((T([32, 49, 64], f16), T([32, 49, 64], f16), [64], T([32, 49, 1], f32), T([32, 49, 1], f32), T([64], f16), T([64], f16), [True, True, True]), {})
+cnt: 1, ((T([32, 3136, 64], f16, stride=(200704, 1, 3136)), T([32, 3136, 64], f16, stride=(200704, 1, 3136)), [64], T([32, 3136, 1], f32), T([32, 3136, 1], f32), T([64], f16), T([64], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 18, ((T([320, 320, 2, 2], f16, stride=(1280, 1, 640, 320)), [320, 320, 2, 2], [1280, 4, 2, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 4, ((T([128, 128, 4, 4], f16, stride=(2048, 1, 512, 128)), [128, 128, 4, 4], [2048, 16, 4, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 3, ((T([64, 64, 8, 8], f16, stride=(4096, 1, 512, 64)), [64, 64, 8, 8], [4096, 64, 8, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([32, 1000], f16), T([32], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([32, 1000], f16), T([32], i64), None, 1, -100), {})
+Operator: aten.select_backward.default
+cnt: 3, ((T([32, 8, 49, 64], f16), [2, 32, 8, 49, 64], 0, 1), {})
+cnt: 3, ((T([32, 8, 49, 64], f16, stride=(25088, 3136, 1, 49)), [2, 32, 8, 49, 64], 0, 0), {})
+cnt: 18, ((T([32, 5, 49, 64], f16), [2, 32, 5, 49, 64], 0, 1), {})
+cnt: 18, ((T([32, 5, 49, 64], f16, stride=(15680, 3136, 1, 49)), [2, 32, 5, 49, 64], 0, 0), {})
+cnt: 4, ((T([32, 2, 49, 64], f16), [2, 32, 2, 49, 64], 0, 1), {})
+cnt: 4, ((T([32, 2, 49, 64], f16, stride=(6272, 3136, 1, 49)), [2, 32, 2, 49, 64], 0, 0), {})
+cnt: 3, ((T([32, 1, 49, 64], f16), [2, 32, 1, 49, 64], 0, 1), {})
+cnt: 3, ((T([32, 1, 49, 64], f16, stride=(3136, 3136, 1, 49)), [2, 32, 1, 49, 64], 0, 0), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16), [0], True), {})
+cnt: 9, ((T([1568, 512], f16), [0], True), {})
+cnt: 3, ((T([1568, 2048], f16), [0], True), {})
+cnt: 3, ((T([1568, 1024], f16), [0], True), {})
+cnt: 54, ((T([6272, 320], f16), [0], True), {})
+cnt: 18, ((T([6272, 1280], f16), [0], True), {})
+cnt: 18, ((T([1568, 640], f16), [0], True), {})
+cnt: 12, ((T([25088, 128], f16), [0], True), {})
+cnt: 4, ((T([25088, 1024], f16), [0], True), {})
+cnt: 4, ((T([1568, 256], f16), [0], True), {})
+cnt: 9, ((T([100352, 64], f16), [0], True), {})
+cnt: 3, ((T([100352, 512], f16), [0], True), {})
+cnt: 3, ((T([1568, 128], f16), [0], True), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/visformer_small_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/visformer_small_training.txt
new file mode 100644
index 000000000000..76ef9f17620e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/visformer_small_training.txt
@@ -0,0 +1,132 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([128, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([128, 1000], f16), T([128, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 4, ((T([128, 6, 196, 196], f16), -1, False), {})
+cnt: 4, ((T([128, 6, 49, 49], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 4, ((T([128, 6, 49, 49], f16), T([128, 6, 49, 49], f16), -1, f16), {})
+cnt: 4, ((T([128, 6, 196, 196], f16), T([128, 6, 196, 196], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 8, ((T([128, 6, 196, 64], f16), [768, 196, 64]), {})
+cnt: 4, ((T([128, 6, 64, 196], f16), [768, 64, 196]), {})
+cnt: 4, ((T([768, 196, 196], f16), [128, 6, 196, 196]), {})
+cnt: 4, ((T([768, 196, 64], f16), [128, 6, 196, 64]), {})
+cnt: 4, ((T([128, 6, 64, 196], f16), [128, 384, 14, 14]), {})
+cnt: 8, ((T([128, 6, 49, 128], f16), [768, 49, 128]), {})
+cnt: 4, ((T([128, 6, 128, 49], f16), [768, 128, 49]), {})
+cnt: 4, ((T([768, 49, 49], f16), [128, 6, 49, 49]), {})
+cnt: 4, ((T([768, 49, 128], f16), [128, 6, 49, 128]), {})
+cnt: 4, ((T([128, 6, 128, 49], f16), [128, 768, 7, 7]), {})
+cnt: 4, ((T([128, 3, 6, 128, 49], f16), [128, 2304, 7, 7]), {})
+cnt: 4, ((T([128, 3, 6, 64, 196], f16), [128, 1152, 14, 14]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128, 192, 28, 28], f16), T([1, 192, 28, 28], f16)), {})
+cnt: 14, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16)), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([1, 384, 14, 14], f16)), {})
+cnt: 16, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16)), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([1, 768, 7, 7], f16)), {})
+cnt: 16, ((T([128, 768, 7, 7], f16), T([128, 768, 7, 7], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 28, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 768], f16), T([768, 1000], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 4, ((T([768, 196, 64], f16), T([768, 64, 196], f16)), {})
+cnt: 4, ((T([768, 196, 196], f16), T([768, 196, 64], f16)), {})
+cnt: 4, ((T([768, 49, 128], f16), T([768, 128, 49], f16)), {})
+cnt: 4, ((T([768, 49, 49], f16), T([768, 49, 128], f16)), {})
+cnt: 4, ((T([768, 49, 49], f16, stride=(2401, 1, 49)), T([768, 49, 128], f16, stride=(6272, 1, 49))), {})
+cnt: 4, ((T([768, 49, 128], f16, stride=(6272, 1, 49)), T([768, 128, 49], f16, stride=(6272, 1, 128))), {})
+cnt: 4, ((T([768, 128, 49], f16, stride=(6272, 1, 128)), T([768, 49, 49], f16)), {})
+cnt: 4, ((T([768, 49, 49], f16), T([768, 49, 128], f16, stride=(6272, 1, 49))), {})
+cnt: 4, ((T([768, 196, 196], f16, stride=(38416, 1, 196)), T([768, 196, 64], f16, stride=(12544, 1, 196))), {})
+cnt: 4, ((T([768, 196, 64], f16, stride=(12544, 1, 196)), T([768, 64, 196], f16, stride=(12544, 1, 64))), {})
+cnt: 4, ((T([768, 64, 196], f16, stride=(12544, 1, 64)), T([768, 196, 196], f16)), {})
+cnt: 4, ((T([768, 196, 196], f16), T([768, 196, 64], f16, stride=(12544, 1, 196))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([32, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([192, 32, 4, 4], f16), T([192], f16), [4, 4], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 192, 28, 28], f16), T([384, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 384, 28, 28], f16), T([384, 48, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 7, ((T([128, 384, 28, 28], f16), T([192, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([384, 192, 2, 2], f16), T([384], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 14, 14], f16), T([1152, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 14, 14], f16), T([384, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 384, 14, 14], f16), T([1536, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 1536, 14, 14], f16), T([384, 1536, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([768, 384, 2, 2], f16), T([768], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 768, 7, 7], f16), T([2304, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 768, 7, 7], f16), T([768, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 768, 7, 7], f16), T([3072, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([128, 3072, 7, 7], f16), T([768, 3072, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 4, ((T([128, 768, 7, 7], f16), T([128, 3072, 7, 7], f16), T([768, 3072, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 3072, 7, 7], f16), T([128, 768, 7, 7], f16), T([3072, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 768, 7, 7], f16), T([128, 768, 7, 7], f16), T([768, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 2304, 7, 7], f16), T([128, 768, 7, 7], f16), T([2304, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), T([128, 384, 14, 14], f16), T([768, 384, 2, 2], f16), [768], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([128, 384, 14, 14], f16), T([128, 1536, 14, 14], f16), T([384, 1536, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 1536, 14, 14], f16), T([128, 384, 14, 14], f16), T([1536, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 1152, 14, 14], f16), T([128, 384, 14, 14], f16), T([1152, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), T([128, 192, 28, 28], f16), T([384, 192, 2, 2], f16), [384], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 7, ((T([128, 192, 28, 28], f16), T([128, 384, 28, 28], f16), T([192, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 384, 28, 28], f16), T([128, 384, 28, 28], f16), T([384, 48, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 7, ((T([128, 384, 28, 28], f16), T([128, 192, 28, 28], f16), T([384, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), T([128, 32, 112, 112], f16), T([192, 32, 4, 4], f16), [192], [4, 4], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 3, 224, 224], f16), T([32, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 768, 7, 7], f16, stride=(768, 1, 0, 0)), 49), {})
+Operator: aten.gelu.default
+cnt: 14, ((T([128, 384, 28, 28], f16),), {})
+cnt: 4, ((T([128, 1536, 14, 14], f16),), {})
+cnt: 4, ((T([128, 3072, 7, 7], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 4, ((T([128, 3072, 7, 7], f16), T([128, 3072, 7, 7], f16)), {})
+cnt: 4, ((T([128, 1536, 14, 14], f16), T([128, 1536, 14, 14], f16)), {})
+cnt: 14, ((T([128, 384, 28, 28], f16), T([128, 384, 28, 28], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([128], i64),), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 768, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16), T([1000, 768], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(1, 1000)), T([128, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 8, ((T([128, 6, 196, 196], f16), 0.125), {})
+cnt: 8, ((T([128, 6, 49, 49], f16), 0.08838834764831845), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), True, 0.1, 1e-05), {})
+cnt: 8, ((T([128, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), True, 0.1, 1e-05), {})
+cnt: 9, ((T([128, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), True, 0.1, 1e-05), {})
+cnt: 10, ((T([128, 768, 7, 7], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 10, ((T([128, 768, 7, 7], f16), T([128, 768, 7, 7], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), True, 1e-05, [True, True, True]), {})
+cnt: 9, ((T([128, 384, 14, 14], f16), T([128, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), True, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([128, 192, 28, 28], f16), T([128, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), True, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([128, 1000], f16), T([128], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([128, 1000], f16), T([128], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 32, 112, 112], f16),), {})
+Operator: aten.stack.default
+cnt: 4, (([T([128, 6, 49, 128], f16), T([128, 6, 49, 128], f16, stride=(37632, 6272, 1, 49)), T([128, 6, 49, 128], f16)],), {})
+cnt: 4, (([T([128, 6, 196, 64], f16), T([128, 6, 196, 64], f16, stride=(75264, 12544, 1, 196)), T([128, 6, 196, 64], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16), [0], True), {})
+cnt: 1, ((T([128, 768, 7, 7], f16), [0], True), {})
+cnt: 1, ((T([128, 384, 14, 14], f16), [0], True), {})
+cnt: 1, ((T([128, 192, 28, 28], f16), [0], True), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 32, 112, 112], f16), T([128, 32, 112, 112], f16), 0), {})
+Operator: aten.unbind.int
+cnt: 4, ((T([3, 128, 6, 196, 64], f16, stride=(75264, 225792, 12544, 1, 196)),), {})
+cnt: 4, ((T([3, 128, 6, 49, 128], f16, stride=(37632, 112896, 6272, 1, 49)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/vit_base_patch16_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/vit_base_patch16_224_training.txt
new file mode 100644
index 000000000000..8d2c7bd9a740
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/vit_base_patch16_224_training.txt
@@ -0,0 +1,83 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([64, 12, 197, 197], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([64, 12, 197, 197], f16), T([64, 12, 197, 197], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([64, 12, 197, 64], f16), [768, 197, 64]), {})
+cnt: 12, ((T([64, 12, 64, 197], f16), [768, 64, 197]), {})
+cnt: 12, ((T([768, 197, 197], f16), [64, 12, 197, 197]), {})
+cnt: 12, ((T([768, 197, 64], f16), [64, 12, 197, 64]), {})
+cnt: 12, ((T([64, 197, 12, 64], f16), [64, 197, 768]), {})
+cnt: 12, ((T([64, 197, 3, 12, 64], f16), [64, 197, 2304]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 197, 768], f16), T([1, 197, 768], f16)), {})
+cnt: 48, ((T([64, 197, 768], f16), T([64, 197, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([2304], f16), T([12608, 768], f16), T([768, 2304], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([12608, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([12608, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([12608, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([1000], f16), T([64, 768], f16, stride=(151296, 1)), T([768, 1000], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([768, 197, 64], f16), T([768, 64, 197], f16)), {})
+cnt: 12, ((T([768, 197, 197], f16), T([768, 197, 64], f16)), {})
+cnt: 12, ((T([768, 197, 197], f16, stride=(38809, 1, 197)), T([768, 197, 64], f16)), {})
+cnt: 12, ((T([768, 197, 64], f16), T([768, 64, 197], f16, stride=(12608, 1, 64))), {})
+cnt: 12, ((T([768, 64, 197], f16, stride=(12608, 1, 64)), T([768, 197, 197], f16)), {})
+cnt: 12, ((T([768, 197, 197], f16), T([768, 197, 64], f16, stride=(12608, 1, 197))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 1, 768], f16, stride=(0, 768, 1)), T([64, 196, 768], f16, stride=(150528, 1, 196))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([768, 3, 16, 16], f16), T([768], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 768, 14, 14], f16, stride=(151296, 1, 10752, 768)), T([64, 3, 224, 224], f16), T([768, 3, 16, 16], f16), [768], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([64, 197, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([64, 197, 3072], f16), T([64, 197, 3072], f16)), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16), T([1000, 768], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 768], f16, stride=(151296, 1))), {})
+cnt: 12, ((T([12608, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 12608], f16, stride=(1, 768)), T([12608, 3072], f16)), {})
+cnt: 12, ((T([12608, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 12608], f16, stride=(1, 3072)), T([12608, 768], f16)), {})
+cnt: 12, ((T([12608, 768], f16), T([768, 768], f16)), {})
+cnt: 12, ((T([768, 12608], f16, stride=(1, 768)), T([12608, 768], f16)), {})
+cnt: 12, ((T([12608, 2304], f16), T([2304, 768], f16)), {})
+cnt: 12, ((T([2304, 12608], f16, stride=(1, 2304)), T([12608, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([64, 12, 197, 197], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([64, 197, 768], f16), [768], T([768], f16), T([768], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([64, 197, 768], f16), T([64, 197, 768], f16), [768], T([64, 197, 1], f32), T([64, 197, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([64, 768], f16), [64, 197, 768], 1, 0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([64, 197, 768], f16), [64, 197, 768], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.stack.default
+cnt: 12, (([T([64, 12, 197, 64], f16), T([64, 12, 197, 64], f16, stride=(151296, 12608, 1, 197)), T([64, 12, 197, 64], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 24, ((T([12608, 768], f16), [0], True), {})
+cnt: 12, ((T([12608, 3072], f16), [0], True), {})
+cnt: 12, ((T([12608, 2304], f16), [0], True), {})
+cnt: 1, ((T([64, 197, 768], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 768], f16, stride=(151296, 768, 1)), [0], True), {})
+Operator: aten.unbind.int
+cnt: 12, ((T([3, 64, 12, 197, 64], f16, stride=(768, 453888, 64, 2304, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/volo_d1_224_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/volo_d1_224_training.txt
new file mode 100644
index 000000000000..2f173f535c37
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/timm_train/volo_d1_224_training.txt
@@ -0,0 +1,216 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([64, 1000], f16), 1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16), 1, f16), {})
+Operator: aten._softmax.default
+cnt: 4, ((T([64, 6, 196, 9, 9], f16, stride=(95256, 81, 486, 9, 1)), -1, False), {})
+cnt: 14, ((T([64, 12, 196, 196], f16), -1, False), {})
+cnt: 2, ((T([64, 12, 1, 197], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 2, ((T([64, 12, 1, 197], f16), T([64, 12, 1, 197], f16), -1, f16), {})
+cnt: 14, ((T([64, 12, 196, 196], f16), T([64, 12, 196, 196], f16), -1, f16), {})
+cnt: 4, ((T([64, 6, 196, 9, 9], f16), T([64, 6, 196, 9, 9], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 12, ((T([50176, 192], f16), [64, 28, 28, 192]), {})
+cnt: 4, ((T([12544, 486], f16), [64, 14, 14, 486]), {})
+cnt: 8, ((T([64, 6, 196, 9, 32], f16), [75264, 9, 32]), {})
+cnt: 4, ((T([75264, 9, 32], f16), [64, 6, 196, 9, 32]), {})
+cnt: 8, ((T([64, 6, 32, 9, 196], f16), [64, 1728, 196]), {})
+cnt: 16, ((T([64, 28, 28, 192], f16), [50176, 192]), {})
+cnt: 4, ((T([50176, 576], f16), [64, 28, 28, 576]), {})
+cnt: 28, ((T([12544, 1152], f16), [64, 14, 14, 1152]), {})
+cnt: 42, ((T([64, 12, 196, 32], f16), [768, 196, 32]), {})
+cnt: 14, ((T([64, 12, 32, 196], f16), [768, 32, 196]), {})
+cnt: 14, ((T([768, 196, 196], f16), [64, 12, 196, 196]), {})
+cnt: 14, ((T([768, 196, 32], f16), [64, 12, 196, 32]), {})
+cnt: 14, ((T([64, 196, 12, 32], f16), [64, 14, 14, 384]), {})
+cnt: 28, ((T([12544, 384], f16), [64, 14, 14, 384]), {})
+cnt: 2, ((T([12608, 768], f16), [64, 197, 768]), {})
+cnt: 2, ((T([64, 384], f16), [64, 1, 384]), {})
+cnt: 2, ((T([64, 12, 32, 197], f16), [768, 32, 197]), {})
+cnt: 2, ((T([768, 1, 197], f16), [64, 12, 1, 197]), {})
+cnt: 2, ((T([64, 12, 197, 32], f16), [768, 197, 32]), {})
+cnt: 2, ((T([768, 1, 32], f16), [64, 12, 1, 32]), {})
+cnt: 1, ((T([64, 196, 384], f16), [12544, 384]), {})
+cnt: 1, ((T([12544, 1000], f16), [64, 196, 1000]), {})
+cnt: 2, ((T([64, 197, 2, 12, 32], f16), [64, 197, 768]), {})
+cnt: 1, ((T([64, 14, 14, 384], f16), [12544, 384]), {})
+cnt: 14, ((T([64, 196, 3, 12, 32], f16), [64, 14, 14, 1152]), {})
+cnt: 4, ((T([64, 196, 6, 9, 9], f16), [64, 14, 14, 486]), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([64, 14, 14, 486], f16), T([486], f16)), {})
+cnt: 8, ((T([64, 28, 28, 192], f16), T([192], f16)), {})
+cnt: 16, ((T([64, 28, 28, 192], f16, stride=(150528, 28, 1, 784)), T([64, 28, 28, 192], f16)), {})
+cnt: 4, ((T([64, 28, 28, 576], f16), T([576], f16)), {})
+cnt: 1, ((T([64, 14, 14, 384], f16, stride=(75264, 14, 1, 196)), T([1, 14, 14, 384], f16)), {})
+cnt: 28, ((T([64, 14, 14, 384], f16), T([384], f16)), {})
+cnt: 28, ((T([64, 14, 14, 384], f16, stride=(75264, 14, 1, 196)), T([64, 14, 14, 384], f16)), {})
+cnt: 14, ((T([64, 14, 14, 1152], f16), T([1152], f16)), {})
+cnt: 4, ((T([64, 1, 384], f16, stride=(75648, 384, 1)), T([64, 1, 384], f16)), {})
+cnt: 2, ((T([64, 1, 384], f16), T([64, 1, 384], f16)), {})
+cnt: 1, ((T([64, 196, 1000], f16), T([1000], f16)), {})
+cnt: 1, ((T([64, 1000], f16), T([64, 1000], f16)), {})
+cnt: 7, ((T([64, 197, 384], f16), T([64, 197, 384], f16)), {})
+cnt: 1, ((T([64, 14, 14, 384], f16, stride=(75648, 5376, 384, 1)), T([64, 14, 14, 384], f16)), {})
+cnt: 27, ((T([64, 14, 14, 384], f16), T([64, 14, 14, 384], f16)), {})
+cnt: 4, ((T([64, 28, 28, 192], f16), T([64, 28, 28, 192], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 3, ((T([], i64), 1), {})
+Operator: aten.addmm.default
+cnt: 2, ((T([384], f16), T([64, 384], f16), T([384, 384], f16, stride=(1, 384))), {})
+cnt: 2, ((T([1152], f16), T([64, 384], f16), T([384, 1152], f16, stride=(1, 384))), {})
+cnt: 2, ((T([384], f16), T([64, 1152], f16), T([1152, 384], f16, stride=(1, 1152))), {})
+cnt: 1, ((T([1000], f16), T([64, 384], f16, stride=(75648, 1)), T([384, 1000], f16, stride=(1, 384))), {})
+Operator: aten.avg_pool2d.default
+cnt: 4, ((T([64, 192, 28, 28], f16, stride=(150528, 1, 5376, 192)), [2, 2], [2, 2], [0, 0], True), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 4, ((T([64, 192, 14, 14], f16, stride=(37632, 1, 2688, 192)), T([64, 192, 28, 28], f16, stride=(150528, 1, 5376, 192)), [2, 2], [2, 2], [0, 0], True, True, None), {})
+Operator: aten.bmm.default
+cnt: 4, ((T([75264, 9, 9], f16), T([75264, 9, 32], f16)), {})
+cnt: 14, ((T([768, 196, 32], f16), T([768, 32, 196], f16)), {})
+cnt: 14, ((T([768, 196, 196], f16), T([768, 196, 32], f16)), {})
+cnt: 2, ((T([768, 1, 32], f16), T([768, 32, 197], f16)), {})
+cnt: 2, ((T([768, 1, 197], f16), T([768, 197, 32], f16)), {})
+cnt: 2, ((T([768, 197, 1], f16), T([768, 1, 32], f16)), {})
+cnt: 2, ((T([768, 1, 32], f16), T([768, 32, 197], f16, stride=(6304, 1, 32))), {})
+cnt: 2, ((T([768, 32, 1], f16), T([768, 1, 197], f16)), {})
+cnt: 2, ((T([768, 1, 197], f16), T([768, 197, 32], f16, stride=(6304, 1, 197))), {})
+cnt: 14, ((T([768, 196, 196], f16, stride=(38416, 1, 196)), T([768, 196, 32], f16)), {})
+cnt: 14, ((T([768, 196, 32], f16), T([768, 32, 196], f16, stride=(6272, 1, 32))), {})
+cnt: 14, ((T([768, 32, 196], f16, stride=(6272, 1, 32)), T([768, 196, 196], f16)), {})
+cnt: 14, ((T([768, 196, 196], f16), T([768, 196, 32], f16, stride=(6272, 1, 196))), {})
+cnt: 4, ((T([75264, 9, 9], f16, stride=(81, 1, 9)), T([75264, 9, 32], f16)), {})
+cnt: 4, ((T([75264, 9, 32], f16), T([75264, 32, 9], f16, stride=(288, 1, 32))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([64, 1, 384], f16, stride=(0, 384, 1)), T([64, 196, 384], f16, stride=(75264, 1, 196))], 1), {})
+cnt: 2, (([T([64, 1, 384], f16), T([64, 196, 384], f16, stride=(75648, 384, 1))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.col2im.default
+cnt: 4, ((T([64, 1728, 196], f16), [28, 28], [3, 3], [1, 1], [1, 1], [2, 2]), {})
+Operator: aten.col2im_backward.default
+cnt: 4, ((T([64, 192, 28, 28], f16, stride=(150528, 1, 5376, 192)), [3, 3], [1, 1], [1, 1], [2, 2]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 64, 112, 112], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([192, 64, 4, 4], f16), T([192], f16), [4, 4], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 192, 28, 28], f16), T([384, 192, 2, 2], f16), T([384], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([64, 384, 14, 14], f16, stride=(75264, 1, 5376, 384)), T([64, 192, 28, 28], f16), T([384, 192, 2, 2], f16), [384], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 192, 28, 28], f16), T([64, 64, 112, 112], f16), T([192, 64, 4, 4], f16), [192], [4, 4], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.gelu.default
+cnt: 4, ((T([64, 28, 28, 576], f16),), {})
+cnt: 14, ((T([64, 14, 14, 1152], f16),), {})
+cnt: 2, ((T([64, 1, 1152], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 2, ((T([64, 1, 1152], f16), T([64, 1, 1152], f16)), {})
+cnt: 14, ((T([64, 14, 14, 1152], f16), T([64, 14, 14, 1152], f16)), {})
+cnt: 4, ((T([64, 28, 28, 576], f16), T([64, 28, 28, 576], f16)), {})
+Operator: aten.im2col.default
+cnt: 4, ((T([64, 192, 28, 28], f16, stride=(150528, 1, 5376, 192)), [3, 3], [1, 1], [1, 1], [2, 2]), {})
+Operator: aten.im2col_backward.default
+cnt: 4, ((T([64, 1728, 196], f16), [28, 28], [3, 3], [1, 1], [1, 1], [2, 2]), {})
+Operator: aten.lift_fresh_copy.default
+cnt: 1, ((T([64], i64),), {})
+Operator: aten.max.dim
+cnt: 1, ((T([64, 196, 1000], f16), 1), {})
+Operator: aten.mm.default
+cnt: 8, ((T([50176, 192], f16), T([192, 192], f16, stride=(1, 192))), {})
+cnt: 4, ((T([12544, 192], f16), T([192, 486], f16, stride=(1, 192))), {})
+cnt: 4, ((T([50176, 192], f16), T([192, 576], f16, stride=(1, 192))), {})
+cnt: 4, ((T([50176, 576], f16), T([576, 192], f16, stride=(1, 576))), {})
+cnt: 28, ((T([12544, 384], f16), T([384, 1152], f16, stride=(1, 384))), {})
+cnt: 14, ((T([12544, 384], f16), T([384, 384], f16, stride=(1, 384))), {})
+cnt: 14, ((T([12544, 1152], f16), T([1152, 384], f16, stride=(1, 1152))), {})
+cnt: 2, ((T([12608, 384], f16), T([384, 768], f16, stride=(1, 384))), {})
+cnt: 2, ((T([64, 384], f16, stride=(75648, 1)), T([384, 384], f16, stride=(1, 384))), {})
+cnt: 1, ((T([12544, 384], f16), T([384, 1000], f16, stride=(1, 384))), {})
+cnt: 1, ((T([1000, 12544], f16, stride=(1, 1000)), T([12544, 384], f16)), {})
+cnt: 1, ((T([12544, 1000], f16), T([1000, 384], f16)), {})
+cnt: 1, ((T([64, 1000], f16), T([1000, 384], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(1, 1000)), T([64, 384], f16, stride=(75648, 1))), {})
+cnt: 2, ((T([64, 384], f16, stride=(75648, 1)), T([384, 1152], f16)), {})
+cnt: 2, ((T([384, 64], f16, stride=(1, 75648)), T([64, 1152], f16)), {})
+cnt: 2, ((T([64, 1152], f16), T([1152, 384], f16)), {})
+cnt: 2, ((T([1152, 64], f16, stride=(1, 1152)), T([64, 384], f16)), {})
+cnt: 4, ((T([64, 384], f16), T([384, 384], f16)), {})
+cnt: 2, ((T([384, 64], f16, stride=(1, 384)), T([64, 384], f16)), {})
+cnt: 2, ((T([384, 64], f16, stride=(1, 384)), T([64, 384], f16, stride=(75648, 1))), {})
+cnt: 2, ((T([768, 12608], f16, stride=(1, 768)), T([12608, 384], f16)), {})
+cnt: 2, ((T([12608, 768], f16), T([768, 384], f16)), {})
+cnt: 14, ((T([384, 12544], f16, stride=(1, 384)), T([12544, 1152], f16)), {})
+cnt: 14, ((T([12544, 384], f16), T([384, 1152], f16)), {})
+cnt: 28, ((T([1152, 12544], f16, stride=(1, 1152)), T([12544, 384], f16)), {})
+cnt: 28, ((T([12544, 1152], f16), T([1152, 384], f16)), {})
+cnt: 14, ((T([384, 12544], f16, stride=(1, 384)), T([12544, 384], f16)), {})
+cnt: 14, ((T([12544, 384], f16), T([384, 384], f16)), {})
+cnt: 4, ((T([192, 50176], f16, stride=(1, 192)), T([50176, 576], f16)), {})
+cnt: 4, ((T([50176, 192], f16), T([192, 576], f16)), {})
+cnt: 4, ((T([576, 50176], f16, stride=(1, 576)), T([50176, 192], f16)), {})
+cnt: 4, ((T([50176, 576], f16), T([576, 192], f16)), {})
+cnt: 8, ((T([192, 50176], f16, stride=(1, 192)), T([50176, 192], f16)), {})
+cnt: 8, ((T([50176, 192], f16), T([192, 192], f16)), {})
+cnt: 4, ((T([486, 12544], f16, stride=(1, 486)), T([12544, 192], f16)), {})
+cnt: 4, ((T([12544, 486], f16), T([486, 192], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([64, 6, 196, 9, 9], f16, stride=(95256, 81, 486, 9, 1)), 0.1767766952966369), {})
+cnt: 28, ((T([64, 12, 196, 196], f16), 0.1767766952966369), {})
+cnt: 4, ((T([64, 12, 1, 32], f16), 0.1767766952966369), {})
+cnt: 2, ((T([64, 1000], f16), 0.5), {})
+cnt: 4, ((T([64, 6, 196, 9, 9], f16), 0.1767766952966369), {})
+Operator: aten.native_batch_norm.default
+cnt: 3, ((T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), True, 1e-05, [True, True, True]), {})
+Operator: aten.native_layer_norm.default
+cnt: 8, ((T([64, 28, 28, 192], f16, stride=(150528, 28, 1, 784)), [192], T([192], f16), T([192], f16), 1e-05), {})
+cnt: 28, ((T([64, 14, 14, 384], f16, stride=(75264, 14, 1, 196)), [384], T([384], f16), T([384], f16), 1e-05), {})
+cnt: 3, ((T([64, 197, 384], f16), [384], T([384], f16), T([384], f16), 1e-05), {})
+cnt: 2, ((T([64, 1, 384], f16), [384], T([384], f16), T([384], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 3, ((T([64, 197, 384], f16), T([64, 197, 384], f16), [384], T([64, 197, 1], f32), T([64, 197, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+cnt: 2, ((T([64, 1, 384], f16), T([64, 1, 384], f16), [384], T([64, 1, 1], f32), T([64, 1, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+cnt: 28, ((T([64, 14, 14, 384], f16), T([64, 14, 14, 384], f16, stride=(75264, 14, 1, 196)), [384], T([64, 14, 14, 1], f32), T([64, 14, 14, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+cnt: 8, ((T([64, 28, 28, 192], f16), T([64, 28, 28, 192], f16, stride=(150528, 28, 1, 784)), [192], T([64, 28, 28, 1], f32), T([64, 28, 28, 1], f32), T([192], f16), T([192], f16), [True, True, True]), {})
+Operator: aten.nll_loss_backward.default
+cnt: 1, ((T([], f16), T([64, 1000], f16), T([64], i64), None, 1, -100, T([], f16)), {})
+Operator: aten.nll_loss_forward.default
+cnt: 1, ((T([64, 1000], f16), T([64], i64), None, 1, -100), {})
+Operator: aten.relu_.default
+cnt: 3, ((T([64, 64, 112, 112], f16),), {})
+Operator: aten.scatter.src
+cnt: 1, ((T([64, 196, 1000], f16), 1, T([64, 1, 1000], i64), T([64, 1, 1000], f16)), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([64, 384], f16), [64, 197, 384], 1, 0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([64, 196, 384], f16), [64, 197, 384], 1, 1, 9223372036854775807, 1), {})
+cnt: 8, ((T([64, 197, 384], f16), [64, 197, 384], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 196, 384], f16, stride=(75648, 384, 1)), [64, 197, 384], 1, 1, 9223372036854775807, 1), {})
+cnt: 2, ((T([64, 1, 384], f16), [64, 1, 384], 2, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([64, 1, 384], f16), [64, 197, 384], 1, 0, 1, 1), {})
+Operator: aten.stack.default
+cnt: 2, (([T([64, 12, 197, 32], f16, stride=(75648, 6304, 1, 197)), T([64, 12, 197, 32], f16)],), {})
+cnt: 14, (([T([64, 12, 196, 32], f16), T([64, 12, 196, 32], f16, stride=(75264, 6272, 1, 196)), T([64, 12, 196, 32], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 196, 1000], f16), [0, 1], True), {})
+cnt: 1, ((T([64, 1000], f16), [0], True), {})
+cnt: 2, ((T([64, 384], f16, stride=(75648, 1)), [0], True), {})
+cnt: 2, ((T([64, 1152], f16), [0], True), {})
+cnt: 2, ((T([64, 384], f16), [0], True), {})
+cnt: 1, ((T([64, 1, 384], f16, stride=(75648, 384, 1)), [0], True), {})
+cnt: 1, ((T([64, 14, 14, 384], f16, stride=(75648, 5376, 384, 1)), [0, 1, 2], True), {})
+cnt: 14, ((T([64, 14, 14, 1152], f16), [0, 1, 2], True), {})
+cnt: 27, ((T([64, 14, 14, 384], f16), [0, 1, 2], True), {})
+cnt: 1, ((T([64, 14, 14, 384], f16), [0], True), {})
+cnt: 8, ((T([64, 28, 28, 192], f16, stride=(150528, 28, 1, 784)), [0, 1, 2], True), {})
+cnt: 4, ((T([64, 28, 28, 576], f16), [0, 1, 2], True), {})
+cnt: 4, ((T([64, 14, 14, 486], f16), [0, 1, 2], True), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([64, 64, 112, 112], f16), T([64, 64, 112, 112], f16), 0), {})
+Operator: aten.unbind.int
+cnt: 14, ((T([3, 64, 12, 196, 32], f16, stride=(384, 225792, 32, 1152, 1)),), {})
+cnt: 2, ((T([2, 64, 12, 197, 32], f16, stride=(384, 151296, 32, 768, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/BERT_pytorch_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/BERT_pytorch_training.txt
new file mode 100644
index 000000000000..6c1b78ab6bfe
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/BERT_pytorch_training.txt
@@ -0,0 +1,94 @@
+Operator: aten._softmax.default
+cnt: 12, ((T([16, 12, 128, 128], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([16, 12, 128, 128], f16), T([16, 12, 128, 128], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([16, 12, 128, 64], f16), [192, 128, 64]), {})
+cnt: 12, ((T([16, 12, 64, 128], f16), [192, 64, 128]), {})
+cnt: 12, ((T([192, 128, 128], f16), [16, 12, 128, 128]), {})
+cnt: 12, ((T([192, 128, 64], f16), [16, 12, 128, 64]), {})
+cnt: 24, ((T([16, 128, 12, 64], f16), [16, 128, 768]), {})
+cnt: 12, ((T([16, 128, 768], f16), [2048, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([16, 128, 768], f16), T([1, 128, 768], f16)), {})
+cnt: 120, ((T([16, 128, 768], f16), T([16, 128, 768], f16)), {})
+cnt: 24, ((T([16, 128, 1], f16), 1e-06), {})
+cnt: 24, ((T([16, 128, 768], f16), T([768], f16)), {})
+cnt: 1, ((T([16, 128, 768], f16, stride=(0, 0, 0)), T([16, 128, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([192, 128, 64], f16), T([192, 64, 128], f16)), {})
+cnt: 12, ((T([192, 128, 128], f16), T([192, 128, 64], f16)), {})
+cnt: 12, ((T([192, 128, 128], f16, stride=(16384, 1, 128)), T([192, 128, 64], f16)), {})
+cnt: 12, ((T([192, 128, 64], f16), T([192, 64, 128], f16, stride=(8192, 1, 64))), {})
+cnt: 12, ((T([192, 64, 128], f16, stride=(8192, 1, 64)), T([192, 128, 128], f16)), {})
+cnt: 12, ((T([192, 128, 128], f16), T([192, 128, 64], f16, stride=(8192, 1, 128))), {})
+Operator: aten.clone.default
+cnt: 2, ((T([16, 128], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([16, 128], i64), T([16, 128], i64)), {})
+Operator: aten.div.Scalar
+cnt: 24, ((T([16, 128, 768], f16, stride=(128, 1, 0)), 768), {})
+Operator: aten.div.Tensor
+cnt: 96, ((T([16, 128, 768], f16), T([16, 128, 1], f16)), {})
+cnt: 24, ((T([16, 12, 128, 128], f16), 8.0), {})
+cnt: 2, ((T([], f16), 1572864), {})
+cnt: 24, ((T([16, 128, 1], f16), T([16, 128, 1], f16)), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([20005, 768], f16), T([16, 128], i64), 0), {})
+cnt: 1, ((T([3, 768], f16), T([16, 128], i64), 0), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([16, 128, 768], f16), T([16, 128], i64), 3, 0, False), {})
+cnt: 1, ((T([16, 128, 768], f16), T([16, 128], i64), 20005, 0, False), {})
+Operator: aten.eq.Scalar
+cnt: 12, ((T([16, 1, 128, 128], b8), 0), {})
+cnt: 24, ((T([16, 128, 1], f16), 0), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([16, 128, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([16, 128, 3072], f16), T([16, 128, 3072], f16)), {})
+Operator: aten.gt.Scalar
+cnt: 1, ((T([16, 128], i64), 0), {})
+Operator: aten.masked_fill.Scalar
+cnt: 12, ((T([16, 12, 128, 128], f16), T([16, 1, 128, 128], b8), -65504.0), {})
+cnt: 12, ((T([16, 12, 128, 128], f16), T([16, 1, 128, 128], b8), 0), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 24, ((T([16, 128, 1], f16), T([16, 128, 1], b8), 0), {})
+Operator: aten.mean.dim
+cnt: 48, ((T([16, 128, 768], f16), [-1], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 768], f16, stride=(0, 0)), T([768, 3072], f16)), {})
+cnt: 1, ((T([768, 2048], f16, stride=(0, 0)), T([2048, 3072], f16)), {})
+cnt: 12, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+cnt: 48, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 48, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+cnt: 11, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 11, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 24, ((T([16, 128, 1], f16), 2), {})
+cnt: 24, ((T([16, 128, 1], f16), 0.002607561929595828), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([768], f16), T([16, 128, 768], f16)), {})
+cnt: 48, ((T([16, 128, 768], f16), T([16, 128, 768], f16)), {})
+cnt: 24, ((T([16, 128, 768], f16), T([768], f16)), {})
+cnt: 24, ((T([16, 128, 1], f16), T([16, 128, 768], f16)), {})
+Operator: aten.neg.default
+cnt: 48, ((T([16, 128, 768], f16),), {})
+Operator: aten.repeat.default
+cnt: 1, ((T([16, 1, 128], b8), [1, 128, 1]), {})
+Operator: aten.std.correction
+cnt: 24, ((T([16, 128, 768], f16), [-1]), {'correction': 1, 'keepdim': True})
+Operator: aten.sub.Tensor
+cnt: 48, ((T([16, 128, 768], f16), T([16, 128, 1], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2048, 768], f16, stride=(0, 0)), [0], True), {})
+cnt: 12, ((T([2048, 3072], f16), [0], True), {})
+cnt: 48, ((T([16, 128, 768], f16), [0, 1], True), {})
+cnt: 48, ((T([16, 128, 768], f16), [2], True), {})
+cnt: 59, ((T([2048, 768], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([16, 128, 768], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/Background_Matting_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/Background_Matting_training.txt
new file mode 100644
index 000000000000..fbc1f47d5c8f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/Background_Matting_training.txt
@@ -0,0 +1,119 @@
+Operator: aten.add.Tensor
+cnt: 27, ((T([3, 256, 128, 128], f16), T([3, 256, 128, 128], f16)), {})
+cnt: 1, ((T([], f16), 0), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+cnt: 1, ((T([3, 256, 128, 128], f16, stride=(7340032, 16384, 128, 1)), T([3, 256, 128, 128], f16, stride=(8388608, 16384, 128, 1))), {})
+cnt: 2, ((T([3, 256, 128, 128], f16), T([3, 256, 128, 128], f16, stride=(8388608, 16384, 128, 1))), {})
+cnt: 1, ((T([3, 256, 128, 128], f16, stride=(8388608, 16384, 128, 1)), T([3, 256, 128, 128], f16, stride=(8388608, 16384, 128, 1))), {})
+cnt: 1, ((T([3, 128, 256, 256], f16, stride=(16777216, 65536, 256, 1)), T([3, 128, 256, 256], f16)), {})
+Operator: aten.cat.default
+cnt: 2, (([T([3, 256, 128, 128], f16), T([3, 256, 128, 128], f16)], 1), {})
+cnt: 1, (([T([3, 256, 128, 128], f16), T([3, 256, 128, 128], f16, stride=(4194304, 1, 32768, 256))], 1), {})
+cnt: 1, (([T([3, 64, 128, 128], f16), T([3, 64, 128, 128], f16), T([3, 64, 128, 128], f16)], 1), {})
+cnt: 1, (([T([3, 256, 128, 128], f16), T([3, 192, 128, 128], f16)], 1), {})
+cnt: 1, (([T([3, 128, 256, 256], f16), T([3, 128, 256, 256], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 2, ((T([3, 3, 512, 512], f16),), {})
+cnt: 1, ((T([3, 1, 512, 512], f16),), {})
+cnt: 1, ((T([3, 4, 512, 512], f16),), {})
+Operator: aten.convolution.default
+cnt: 2, ((T([3, 3, 518, 518], f16), T([64, 3, 7, 7], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([3, 64, 512, 512], f16), T([128, 64, 3, 3], f16), T([128], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([3, 128, 256, 256], f16), T([256, 128, 3, 3], f16), T([256], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 1, 518, 518], f16), T([64, 1, 7, 7], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)), T([128, 64, 3, 3], f16), T([128], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)), T([256, 128, 3, 3], f16), T([256], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 4, 518, 518], f16), T([64, 4, 7, 7], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([3, 512, 128, 128], f16), T([64, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 448, 128, 128], f16), T([256, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 26, ((T([3, 256, 130, 130], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([3, 256, 256, 256], f16), T([128, 256, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 128, 512, 512], f16), T([64, 128, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 64, 518, 518], f16), T([1, 64, 7, 7], f16), T([1], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 256, 512, 512], f16), T([64, 256, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([3, 64, 518, 518], f16), T([3, 64, 7, 7], f16), T([3], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([3, 3, 512, 512], f16, stride=(0, 0, 0, 0)), T([3, 64, 518, 518], f16), T([3, 64, 7, 7], f16), [3], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([3, 64, 512, 512], f16), T([3, 256, 512, 512], f16), T([64, 256, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([3, 128, 256, 256], f16), T([3, 256, 256, 256], f16), T([128, 256, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 26, ((T([3, 256, 128, 128], f16), T([3, 256, 130, 130], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([3, 1, 512, 512], f16), T([3, 64, 518, 518], f16), T([1, 64, 7, 7], f16), [1], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([3, 64, 512, 512], f16), T([3, 128, 512, 512], f16), T([64, 128, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([3, 256, 128, 128], f16), T([3, 448, 128, 128], f16), T([256, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([3, 64, 128, 128], f16), T([3, 512, 128, 128], f16), T([64, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([3, 256, 128, 128], f16, stride=(4194304, 1, 32768, 256)), T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)), T([256, 128, 3, 3], f16), [256], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)), T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)), T([128, 64, 3, 3], f16), [128], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)), T([3, 1, 518, 518], f16), T([64, 1, 7, 7], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+cnt: 2, ((T([3, 256, 128, 128], f16), T([3, 128, 256, 256], f16), T([256, 128, 3, 3], f16), [256], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([3, 128, 256, 256], f16), T([3, 64, 512, 512], f16), T([128, 64, 3, 3], f16), [128], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([3, 64, 512, 512], f16), T([3, 3, 518, 518], f16), T([64, 3, 7, 7], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([3, 3, 512, 512], f16), T([3, 3, 512, 512], f16)), {})
+cnt: 1, ((T([3, 1, 512, 512], f16), T([3, 1, 512, 512], f16)), {})
+cnt: 1, ((T([3, 4, 512, 512], f16), T([3, 4, 512, 512], f16)), {})
+cnt: 1, ((T([256, 128, 3, 3], f16), T([256, 128, 3, 3], f16, stride=(1152, 1, 384, 128))), {})
+cnt: 1, ((T([128, 64, 3, 3], f16), T([128, 64, 3, 3], f16, stride=(576, 1, 192, 64))), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 786432), {})
+cnt: 2, ((T([], f16), 2359296), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.native_batch_norm.default
+cnt: 5, ((T([3, 64, 512, 512], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([3, 128, 256, 256], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 30, ((T([3, 256, 128, 128], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([3, 256, 128, 128], f16, stride=(4194304, 1, 32768, 256)), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([3, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([3, 64, 512, 512], f16), T([3, 64, 512, 512], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([3, 128, 256, 256], f16), T([3, 128, 256, 256], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 29, ((T([3, 256, 128, 128], f16), T([3, 256, 128, 128], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([3, 64, 128, 128], f16), T([3, 64, 128, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([3, 256, 128, 128], f16, stride=(4194304, 1, 32768, 256)), T([3, 256, 128, 128], f16, stride=(4194304, 1, 32768, 256)), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)), T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)), T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 1, ((T([256, 128, 3, 3], f16, stride=(1152, 1, 384, 128)), [256, 128, 3, 3], [1152, 9, 3, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([128, 64, 3, 3], f16, stride=(576, 1, 192, 64)), [128, 64, 3, 3], [576, 9, 3, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.reflection_pad2d.default
+cnt: 2, ((T([3, 3, 512, 512], f16), [3, 3, 3, 3]), {})
+cnt: 1, ((T([3, 1, 512, 512], f16), [3, 3, 3, 3]), {})
+cnt: 1, ((T([3, 4, 512, 512], f16), [3, 3, 3, 3]), {})
+cnt: 26, ((T([3, 256, 128, 128], f16), [1, 1, 1, 1]), {})
+cnt: 2, ((T([3, 64, 512, 512], f16), [3, 3, 3, 3]), {})
+Operator: aten.reflection_pad2d_backward.default
+cnt: 2, ((T([3, 64, 518, 518], f16), T([3, 64, 512, 512], f16), [3, 3, 3, 3]), {})
+cnt: 26, ((T([3, 256, 130, 130], f16), T([3, 256, 128, 128], f16), [1, 1, 1, 1]), {})
+Operator: aten.relu_.default
+cnt: 5, ((T([3, 64, 512, 512], f16),), {})
+cnt: 5, ((T([3, 128, 256, 256], f16),), {})
+cnt: 17, ((T([3, 256, 128, 128], f16),), {})
+cnt: 1, ((T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)),), {})
+cnt: 1, ((T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)),), {})
+cnt: 1, ((T([3, 256, 128, 128], f16, stride=(4194304, 1, 32768, 256)),), {})
+cnt: 3, ((T([3, 64, 128, 128], f16),), {})
+Operator: aten.sum.default
+cnt: 1, ((T([3, 1, 512, 512], f16),), {})
+cnt: 1, ((T([3, 3, 512, 512], f16),), {})
+Operator: aten.tanh.default
+cnt: 1, ((T([3, 1, 512, 512], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([3, 1, 512, 512], f16, stride=(0, 0, 0, 0)), T([3, 1, 512, 512], f16)), {})
+Operator: aten.threshold_backward.default
+cnt: 4, ((T([3, 64, 512, 512], f16), T([3, 64, 512, 512], f16), 0), {})
+cnt: 1, ((T([3, 128, 256, 256], f16, stride=(16777216, 65536, 256, 1)), T([3, 128, 256, 256], f16), 0), {})
+cnt: 16, ((T([3, 256, 128, 128], f16), T([3, 256, 128, 128], f16), 0), {})
+cnt: 3, ((T([3, 128, 256, 256], f16), T([3, 128, 256, 256], f16), 0), {})
+cnt: 3, ((T([3, 64, 128, 128], f16, stride=(7340032, 16384, 128, 1)), T([3, 64, 128, 128], f16), 0), {})
+cnt: 1, ((T([3, 256, 128, 128], f16, stride=(8388608, 16384, 128, 1)), T([3, 256, 128, 128], f16, stride=(4194304, 1, 32768, 256)), 0), {})
+cnt: 1, ((T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)), T([3, 128, 256, 256], f16, stride=(8388608, 1, 32768, 128)), 0), {})
+cnt: 1, ((T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)), T([3, 64, 512, 512], f16, stride=(16777216, 1, 32768, 64)), 0), {})
+Operator: aten.upsample_bilinear2d.vec
+cnt: 2, ((T([3, 256, 128, 128], f16), None, True, [2.0, 2.0]), {})
+cnt: 1, ((T([3, 128, 256, 256], f16), None, True, [2.0, 2.0]), {})
+cnt: 1, ((T([3, 256, 256, 256], f16), None, True, [2.0, 2.0]), {})
+Operator: aten.upsample_bilinear2d_backward.vec
+cnt: 1, ((T([3, 256, 512, 512], f16), None, [3, 256, 256, 256], True, [2.0, 2.0]), {})
+cnt: 2, ((T([3, 256, 256, 256], f16), None, [3, 256, 128, 128], True, [2.0, 2.0]), {})
+cnt: 1, ((T([3, 128, 512, 512], f16), None, [3, 128, 256, 256], True, [2.0, 2.0]), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/LearningToPaint_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/LearningToPaint_training.txt
new file mode 100644
index 000000000000..272e9fb33858
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/LearningToPaint_training.txt
@@ -0,0 +1,86 @@
+Operator: aten.add.Tensor
+cnt: 1, ((T([96, 512, 4, 4], f16), T([96, 512, 4, 4], f16)), {})
+cnt: 2, ((T([96, 256, 8, 8], f16), T([96, 256, 8, 8], f16)), {})
+cnt: 2, ((T([96, 128, 16, 16], f16), T([96, 128, 16, 16], f16)), {})
+cnt: 2, ((T([96, 64, 32, 32], f16), T([96, 64, 32, 32], f16)), {})
+cnt: 1, ((T([96, 64, 64, 64], f16), T([96, 64, 64, 64], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 2, ((T([96, 64, 32, 32], f16), T([96, 64, 32, 32], f16)), {})
+cnt: 2, ((T([96, 128, 16, 16], f16), T([96, 128, 16, 16], f16)), {})
+cnt: 2, ((T([96, 256, 8, 8], f16), T([96, 256, 8, 8], f16)), {})
+cnt: 2, ((T([96, 512, 4, 4], f16), T([96, 512, 4, 4], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([65], f16), T([96, 512], f16), T([512, 65], f16, stride=(1, 512))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([96, 512, 4, 4], f16), [4, 4]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([96, 512, 1, 1], f16), T([96, 512, 4, 4], f16), [4, 4], [], [0, 0], False, True, None), {})
+Operator: aten.clone.default
+cnt: 1, ((T([96, 9, 128, 128], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([96, 9, 128, 128], f16), T([64, 9, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 64, 64, 64], f16), T([64, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([96, 64, 32, 32], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 64, 64, 64], f16), T([64, 64, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 64, 32, 32], f16), T([128, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([96, 128, 16, 16], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 64, 32, 32], f16), T([128, 64, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 128, 16, 16], f16), T([256, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([96, 256, 8, 8], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 128, 16, 16], f16), T([256, 128, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 256, 8, 8], f16), T([512, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([96, 512, 4, 4], f16), T([512, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 256, 8, 8], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([96, 512, 4, 4], f16), T([96, 512, 4, 4], f16), T([512, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 512, 4, 4], f16), T([96, 256, 8, 8], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 512, 4, 4], f16), T([96, 256, 8, 8], f16), T([512, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([96, 256, 8, 8], f16), T([96, 256, 8, 8], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 256, 8, 8], f16), T([96, 128, 16, 16], f16), T([256, 128, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 256, 8, 8], f16), T([96, 128, 16, 16], f16), T([256, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([96, 128, 16, 16], f16), T([96, 128, 16, 16], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 128, 16, 16], f16), T([96, 64, 32, 32], f16), T([128, 64, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 128, 16, 16], f16), T([96, 64, 32, 32], f16), T([128, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([96, 64, 32, 32], f16), T([96, 64, 32, 32], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 64, 32, 32], f16), T([96, 64, 64, 64], f16), T([64, 64, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 64, 32, 32], f16), T([96, 64, 64, 64], f16), T([64, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 64, 64, 64], f16), T([96, 9, 128, 128], f16), T([64, 9, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([96, 9, 128, 128], f16), T([96, 9, 128, 128], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 6240), {})
+Operator: aten.mm.default
+cnt: 1, ((T([96, 65], f16), T([65, 512], f16)), {})
+cnt: 1, ((T([65, 96], f16, stride=(1, 65)), T([96, 512], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([96, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([96, 64, 32, 32], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([96, 128, 16, 16], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([96, 256, 8, 8], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([96, 512, 4, 4], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 5, ((T([96, 512, 4, 4], f16), T([96, 512, 4, 4], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([96, 256, 8, 8], f16), T([96, 256, 8, 8], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([96, 128, 16, 16], f16), T([96, 128, 16, 16], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([96, 64, 32, 32], f16), T([96, 64, 32, 32], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([96, 64, 64, 64], f16), T([96, 64, 64, 64], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu.default
+cnt: 1, ((T([96, 64, 64, 64], f16),), {})
+cnt: 4, ((T([96, 64, 32, 32], f16),), {})
+cnt: 4, ((T([96, 128, 16, 16], f16),), {})
+cnt: 4, ((T([96, 256, 8, 8], f16),), {})
+cnt: 4, ((T([96, 512, 4, 4], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([96, 65], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([96, 65], f16, stride=(0, 0)), T([96, 65], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([96, 65], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([96, 65], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 4, ((T([96, 512, 4, 4], f16), T([96, 512, 4, 4], f16), 0), {})
+cnt: 4, ((T([96, 256, 8, 8], f16), T([96, 256, 8, 8], f16), 0), {})
+cnt: 4, ((T([96, 128, 16, 16], f16), T([96, 128, 16, 16], f16), 0), {})
+cnt: 4, ((T([96, 64, 32, 32], f16), T([96, 64, 32, 32], f16), 0), {})
+cnt: 1, ((T([96, 64, 64, 64], f16), T([96, 64, 64, 64], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/Super_SloMo_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/Super_SloMo_training.txt
new file mode 100644
index 000000000000..ff432c07b7ab
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/Super_SloMo_training.txt
@@ -0,0 +1,255 @@
+Operator: aten._to_copy.default
+cnt: 12, ((T([6, 352, 352], i64, stride=(0, 352, 1)),), {'dtype': f16})
+Operator: aten.abs.default
+cnt: 5, ((T([6, 3, 352, 352], f16),), {})
+cnt: 2, ((T([6, 2, 352, 351], f16),), {})
+cnt: 2, ((T([6, 2, 351, 352], f16),), {})
+Operator: aten.add.Tensor
+cnt: 22, ((T([6, 2, 352, 352], f16), T([6, 2, 352, 352], f16)), {})
+cnt: 8, ((T([6, 352, 352], f16), T([6, 352, 352], f16, stride=(247808, 352, 1))), {})
+cnt: 2, ((T([6, 2, 352, 352], f16, stride=(619520, 123904, 352, 1)), T([6, 2, 352, 352], f16)), {})
+cnt: 2, ((T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16)), {})
+cnt: 4, ((T([6, 1, 352, 352], f16), T([6, 1, 352, 352], f16)), {})
+cnt: 10, ((T([], f16), T([], f16)), {})
+cnt: 4, ((T([6, 352, 352], f16), T([6, 352, 352], f16, stride=(495616, 352, 1))), {})
+cnt: 1, ((T([], f16), 0), {})
+cnt: 1, ((T([6, 3, 352, 352], f16, stride=(0, 0, 0, 0)), T([6, 3, 352, 352], f16)), {})
+cnt: 2, ((T([6, 5, 352, 352], f16), T([6, 5, 352, 352], f16)), {})
+cnt: 2, ((T([6, 512, 22, 22], f16, stride=(495616, 484, 22, 1)), T([6, 512, 22, 22], f16)), {})
+cnt: 2, ((T([6, 256, 44, 44], f16, stride=(991232, 1936, 44, 1)), T([6, 256, 44, 44], f16)), {})
+cnt: 2, ((T([6, 128, 88, 88], f16, stride=(1982464, 7744, 88, 1)), T([6, 128, 88, 88], f16)), {})
+cnt: 2, ((T([6, 64, 176, 176], f16, stride=(3964928, 30976, 176, 1)), T([6, 64, 176, 176], f16)), {})
+cnt: 2, ((T([6, 32, 352, 352], f16, stride=(7929856, 123904, 352, 1)), T([6, 32, 352, 352], f16)), {})
+cnt: 4, ((T([6, 2, 352, 352], f16), T([6, 2, 352, 352], f16, stride=(2478080, 123904, 352, 1))), {})
+cnt: 2, ((T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16, stride=(2478080, 123904, 352, 1))), {})
+cnt: 1, ((T([6, 4, 352, 352], f16), T([6, 4, 352, 352], f16)), {})
+Operator: aten.avg_pool2d.default
+cnt: 2, ((T([6, 32, 352, 352], f16), [2, 2]), {})
+cnt: 2, ((T([6, 64, 176, 176], f16), [2, 2]), {})
+cnt: 2, ((T([6, 128, 88, 88], f16), [2, 2]), {})
+cnt: 2, ((T([6, 256, 44, 44], f16), [2, 2]), {})
+cnt: 2, ((T([6, 512, 22, 22], f16), [2, 2]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 2, ((T([6, 512, 11, 11], f16), T([6, 512, 22, 22], f16), [2, 2], [], [0, 0], False, True, None), {})
+cnt: 2, ((T([6, 256, 22, 22], f16), T([6, 256, 44, 44], f16), [2, 2], [], [0, 0], False, True, None), {})
+cnt: 2, ((T([6, 128, 44, 44], f16), T([6, 128, 88, 88], f16), [2, 2], [], [0, 0], False, True, None), {})
+cnt: 2, ((T([6, 64, 88, 88], f16), T([6, 64, 176, 176], f16), [2, 2], [], [0, 0], False, True, None), {})
+cnt: 2, ((T([6, 32, 176, 176], f16), T([6, 32, 352, 352], f16), [2, 2], [], [0, 0], False, True, None), {})
+Operator: aten.cat.default
+cnt: 1, (([T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16)], 1), {})
+cnt: 2, (([T([6, 512, 22, 22], f16), T([6, 512, 22, 22], f16)], 1), {})
+cnt: 2, (([T([6, 256, 44, 44], f16), T([6, 256, 44, 44], f16)], 1), {})
+cnt: 2, (([T([6, 128, 88, 88], f16), T([6, 128, 88, 88], f16)], 1), {})
+cnt: 2, (([T([6, 64, 176, 176], f16), T([6, 64, 176, 176], f16)], 1), {})
+cnt: 2, (([T([6, 32, 352, 352], f16), T([6, 32, 352, 352], f16)], 1), {})
+cnt: 1, (([T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16), T([6, 2, 352, 352], f16, stride=(495616, 123904, 352, 1)), T([6, 2, 352, 352], f16, stride=(495616, 123904, 352, 1)), T([6, 2, 352, 352], f16), T([6, 2, 352, 352], f16), T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([6], i64),), {})
+cnt: 3, ((T([6, 3, 352, 352], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([6, 6, 352, 352], f16), T([32, 6, 7, 7], f16), T([32], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 32, 352, 352], f16), T([32, 32, 7, 7], f16), T([32], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 32, 176, 176], f16), T([64, 32, 5, 5], f16), T([64], f16), [1, 1], [2, 2], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 64, 176, 176], f16), T([64, 64, 5, 5], f16), T([64], f16), [1, 1], [2, 2], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 64, 88, 88], f16), T([128, 64, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 128, 88, 88], f16), T([128, 128, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 128, 44, 44], f16), T([256, 128, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 256, 44, 44], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 256, 22, 22], f16), T([512, 256, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([6, 512, 22, 22], f16), T([512, 512, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([6, 512, 11, 11], f16), T([512, 512, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 1024, 22, 22], f16), T([512, 1024, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([6, 512, 44, 44], f16), T([256, 512, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([6, 256, 88, 88], f16), T([128, 256, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([6, 128, 176, 176], f16), T([64, 128, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([6, 64, 352, 352], f16), T([32, 64, 3, 3], f16), T([32], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([6, 32, 352, 352], f16), T([4, 32, 3, 3], f16), T([4], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([6, 20, 352, 352], f16), T([32, 20, 7, 7], f16), T([32], f16), [1, 1], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([6, 32, 352, 352], f16), T([5, 32, 3, 3], f16), T([5], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 3, 352, 352], f16), T([64, 3, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 64, 352, 352], f16), T([64, 64, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 64, 176, 176], f16), T([128, 64, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 128, 176, 176], f16), T([128, 128, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 128, 88, 88], f16), T([256, 128, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([6, 256, 88, 88], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([6, 256, 44, 44], f16), T([512, 256, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([6, 512, 44, 44], f16), T([512, 512, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 2, ((T([6, 512, 44, 44], f16), T([6, 512, 44, 44], f16), T([512, 512, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, False, False]), {})
+cnt: 1, ((T([6, 512, 44, 44], f16), T([6, 256, 44, 44], f16), T([512, 256, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, False, False]), {})
+cnt: 2, ((T([6, 256, 88, 88], f16), T([6, 256, 88, 88], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, False, False]), {})
+cnt: 1, ((T([6, 256, 88, 88], f16), T([6, 128, 88, 88], f16), T([256, 128, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, False, False]), {})
+cnt: 1, ((T([6, 128, 176, 176], f16), T([6, 128, 176, 176], f16), T([128, 128, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, False, False]), {})
+cnt: 1, ((T([6, 128, 176, 176], f16), T([6, 64, 176, 176], f16), T([128, 64, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, False, False]), {})
+cnt: 1, ((T([6, 64, 352, 352], f16), T([6, 64, 352, 352], f16), T([64, 64, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, False, False]), {})
+cnt: 1, ((T([6, 64, 352, 352], f16), T([6, 3, 352, 352], f16), T([64, 3, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, False, False]), {})
+cnt: 1, ((T([6, 5, 352, 352], f16), T([6, 32, 352, 352], f16), T([5, 32, 3, 3], f16), [5], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([6, 32, 352, 352], f16), T([6, 64, 352, 352], f16), T([32, 64, 3, 3], f16), [32], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([6, 64, 176, 176], f16), T([6, 128, 176, 176], f16), T([64, 128, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([6, 128, 88, 88], f16), T([6, 256, 88, 88], f16), T([128, 256, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([6, 256, 44, 44], f16), T([6, 512, 44, 44], f16), T([256, 512, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 512, 22, 22], f16), T([6, 1024, 22, 22], f16), T([512, 1024, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([6, 512, 22, 22], f16), T([6, 512, 22, 22], f16), T([512, 512, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([6, 512, 11, 11], f16), T([6, 512, 11, 11], f16), T([512, 512, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 512, 22, 22], f16), T([6, 256, 22, 22], f16), T([512, 256, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 256, 44, 44], f16), T([6, 256, 44, 44], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 256, 44, 44], f16), T([6, 128, 44, 44], f16), T([256, 128, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 128, 88, 88], f16), T([6, 128, 88, 88], f16), T([128, 128, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 128, 88, 88], f16), T([6, 64, 88, 88], f16), T([128, 64, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 64, 176, 176], f16), T([6, 64, 176, 176], f16), T([64, 64, 5, 5], f16), [64], [1, 1], [2, 2], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 64, 176, 176], f16), T([6, 32, 176, 176], f16), T([64, 32, 5, 5], f16), [64], [1, 1], [2, 2], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([6, 32, 352, 352], f16), T([6, 32, 352, 352], f16), T([32, 32, 7, 7], f16), [32], [1, 1], [3, 3], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([6, 32, 352, 352], f16), T([6, 20, 352, 352], f16), T([32, 20, 7, 7], f16), [32], [1, 1], [3, 3], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([6, 4, 352, 352], f16), T([6, 32, 352, 352], f16), T([4, 32, 3, 3], f16), [4], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([6, 32, 352, 352], f16), T([6, 6, 352, 352], f16), T([32, 6, 7, 7], f16), [32], [1, 1], [3, 3], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([6], i64), T([6], i64)), {})
+cnt: 3, ((T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16)), {})
+Operator: aten.div.Scalar
+cnt: 2, ((T([6, 2, 351, 352], f16, stride=(0, 0, 0, 0)), 1482624), {})
+cnt: 2, ((T([6, 2, 352, 351], f16, stride=(0, 0, 0, 0)), 1482624), {})
+cnt: 5, ((T([6, 3, 352, 352], f16, stride=(0, 0, 0, 0)), 2230272), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([6, 352, 352], f16), 352), {})
+cnt: 4, ((T([6, 3, 352, 352], f16), T([6, 1, 352, 352], f16)), {})
+cnt: 2, ((T([], f16), 2230272), {})
+cnt: 2, ((T([], f16), 1), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.grid_sampler_2d.default
+cnt: 6, ((T([6, 3, 352, 352], f16), T([6, 352, 352, 2], f16), 0, 0, False), {})
+Operator: aten.grid_sampler_2d_backward.default
+cnt: 6, ((T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16), T([6, 352, 352, 2], f16), 0, 0, False, [False, True]), {})
+Operator: aten.index.Tensor
+cnt: 8, ((T([7], f16), [T([6], i64)]), {})
+Operator: aten.leaky_relu.default
+cnt: 8, ((T([6, 32, 352, 352], f16), 0.1), {})
+cnt: 8, ((T([6, 64, 176, 176], f16), 0.1), {})
+cnt: 8, ((T([6, 128, 88, 88], f16), 0.1), {})
+cnt: 8, ((T([6, 256, 44, 44], f16), 0.1), {})
+cnt: 8, ((T([6, 512, 22, 22], f16), 0.1), {})
+cnt: 4, ((T([6, 512, 11, 11], f16), 0.1), {})
+cnt: 1, ((T([6, 4, 352, 352], f16), 0.1), {})
+cnt: 1, ((T([6, 5, 352, 352], f16), 0.1), {})
+Operator: aten.leaky_relu_backward.default
+cnt: 1, ((T([6, 5, 352, 352], f16), T([6, 5, 352, 352], f16), 0.1, False), {})
+cnt: 6, ((T([6, 32, 352, 352], f16), T([6, 32, 352, 352], f16), 0.1, False), {})
+cnt: 2, ((T([6, 32, 352, 352], f16, stride=(7929856, 123904, 352, 1)), T([6, 32, 352, 352], f16), 0.1, False), {})
+cnt: 6, ((T([6, 64, 176, 176], f16), T([6, 64, 176, 176], f16), 0.1, False), {})
+cnt: 2, ((T([6, 64, 176, 176], f16, stride=(3964928, 30976, 176, 1)), T([6, 64, 176, 176], f16), 0.1, False), {})
+cnt: 6, ((T([6, 128, 88, 88], f16), T([6, 128, 88, 88], f16), 0.1, False), {})
+cnt: 2, ((T([6, 128, 88, 88], f16, stride=(1982464, 7744, 88, 1)), T([6, 128, 88, 88], f16), 0.1, False), {})
+cnt: 6, ((T([6, 256, 44, 44], f16), T([6, 256, 44, 44], f16), 0.1, False), {})
+cnt: 2, ((T([6, 256, 44, 44], f16, stride=(991232, 1936, 44, 1)), T([6, 256, 44, 44], f16), 0.1, False), {})
+cnt: 6, ((T([6, 512, 22, 22], f16), T([6, 512, 22, 22], f16), 0.1, False), {})
+cnt: 2, ((T([6, 512, 22, 22], f16, stride=(495616, 484, 22, 1)), T([6, 512, 22, 22], f16), 0.1, False), {})
+cnt: 4, ((T([6, 512, 11, 11], f16), T([6, 512, 11, 11], f16), 0.1, False), {})
+cnt: 1, ((T([6, 4, 352, 352], f16), T([6, 4, 352, 352], f16), 0.1, False), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 2, ((T([6, 64, 352, 352], f16), [2, 2], [2, 2]), {})
+cnt: 2, ((T([6, 128, 176, 176], f16), [2, 2], [2, 2]), {})
+cnt: 2, ((T([6, 256, 88, 88], f16), [2, 2], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([6, 256, 44, 44], f16), T([6, 256, 88, 88], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([6, 256, 44, 44], i64)), {})
+cnt: 1, ((T([6, 128, 88, 88], f16), T([6, 128, 176, 176], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([6, 128, 88, 88], i64)), {})
+cnt: 1, ((T([6, 64, 176, 176], f16), T([6, 64, 352, 352], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([6, 64, 176, 176], i64)), {})
+Operator: aten.mean.default
+cnt: 5, ((T([6, 3, 352, 352], f16),), {})
+cnt: 2, ((T([6, 2, 352, 351], f16),), {})
+cnt: 2, ((T([6, 2, 351, 352], f16),), {})
+Operator: aten.mse_loss.default
+cnt: 1, ((T([6, 512, 44, 44], f16), T([6, 512, 44, 44], f16)), {})
+Operator: aten.mse_loss_backward.default
+cnt: 1, ((T([], f16), T([6, 512, 44, 44], f16), T([6, 512, 44, 44], f16), 1), {})
+Operator: aten.mul.Tensor
+cnt: 3, ((T([6], f16), T([6], f16)), {})
+cnt: 4, ((T([6, 1, 1, 1], f16), T([6, 2, 352, 352], f16, stride=(495616, 123904, 352, 1))), {})
+cnt: 12, ((T([6, 352, 352], f16), 2), {})
+cnt: 4, ((T([6, 1, 1, 1], f16), T([6, 1, 352, 352], f16)), {})
+cnt: 2, ((T([6, 1, 352, 352], f16), T([6, 3, 352, 352], f16)), {})
+cnt: 2, ((T([], f16), 204), {})
+cnt: 2, ((T([], f16), 102), {})
+cnt: 2, ((T([], f16), 0.005), {})
+cnt: 2, ((T([6, 2, 351, 352], f16), T([6, 2, 351, 352], f16)), {})
+cnt: 2, ((T([6, 2, 352, 351], f16), T([6, 2, 352, 351], f16)), {})
+cnt: 8, ((T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16)), {})
+cnt: 12, ((T([6, 352, 352], f16, stride=(247808, 704, 2)), 2), {})
+cnt: 4, ((T([6, 1, 352, 352], f16), T([6, 1, 1, 1], f16)), {})
+cnt: 2, ((T([6, 3, 352, 352], f16), T([6, 1, 352, 352], f16)), {})
+cnt: 4, ((T([6, 2, 352, 352], f16), T([6, 1, 1, 1], f16)), {})
+Operator: aten.neg.default
+cnt: 1, ((T([6], f16),), {})
+cnt: 2, ((T([6, 2, 351, 352], f16),), {})
+cnt: 2, ((T([6, 2, 352, 351], f16),), {})
+cnt: 1, ((T([6, 3, 352, 352], f16),), {})
+cnt: 1, ((T([6, 1, 352, 352], f16),), {})
+Operator: aten.relu_.default
+cnt: 4, ((T([6, 64, 352, 352], f16),), {})
+cnt: 4, ((T([6, 128, 176, 176], f16),), {})
+cnt: 6, ((T([6, 256, 88, 88], f16),), {})
+cnt: 4, ((T([6, 512, 44, 44], f16),), {})
+Operator: aten.rsub.Scalar
+cnt: 4, ((T([6], f16), 1), {})
+cnt: 1, ((T([6, 1, 352, 352], f16), 1), {})
+Operator: aten.select_backward.default
+cnt: 6, ((T([6, 352, 352], f16), [6, 2, 352, 352], 1, 1), {})
+cnt: 6, ((T([6, 352, 352], f16), [6, 2, 352, 352], 1, 0), {})
+Operator: aten.sgn.default
+cnt: 2, ((T([6, 2, 351, 352], f16),), {})
+cnt: 2, ((T([6, 2, 352, 351], f16),), {})
+cnt: 5, ((T([6, 3, 352, 352], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([6, 1, 352, 352], f16, stride=(619520, 123904, 352, 1)),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([6, 1, 352, 352], f16), T([6, 1, 352, 352], f16)), {})
+Operator: aten.slice_backward.default
+cnt: 4, ((T([6, 2, 351, 352], f16), [6, 2, 351, 352], 3, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([6, 2, 351, 352], f16), [6, 2, 352, 352], 2, 1, 9223372036854775807, 1), {})
+cnt: 8, ((T([6, 2, 352, 352], f16), [6, 2, 352, 352], 1, 0, 9223372036854775807, 1), {})
+cnt: 20, ((T([6, 2, 352, 352], f16), [6, 2, 352, 352], 0, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([6, 2, 351, 352], f16), [6, 2, 352, 352], 2, 0, -1, 1), {})
+cnt: 2, ((T([6, 2, 352, 351], f16), [6, 2, 352, 352], 3, 1, 9223372036854775807, 1), {})
+cnt: 8, ((T([6, 2, 352, 352], f16), [6, 2, 352, 352], 2, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([6, 2, 352, 351], f16), [6, 2, 352, 352], 3, 0, -1, 1), {})
+cnt: 12, ((T([6, 352, 352], f16), [6, 352, 352], 2, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([6, 352, 352], f16), [6, 352, 352], 1, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([6, 1, 352, 352], f16), [6, 1, 352, 352], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([6, 1, 352, 352], f16), [6, 1, 352, 352], 2, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([6, 1, 352, 352], f16), [6, 5, 352, 352], 1, 4, 5, 1), {})
+cnt: 3, ((T([6, 5, 352, 352], f16), [6, 5, 352, 352], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([6, 2, 352, 352], f16), [6, 2, 352, 352], 3, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([6, 2, 352, 352], f16), [6, 5, 352, 352], 1, 2, 4, 1), {})
+cnt: 1, ((T([6, 2, 352, 352], f16), [6, 5, 352, 352], 1, 0, 2, 1), {})
+cnt: 1, ((T([6, 2, 352, 352], f16), [6, 4, 352, 352], 1, 2, 9223372036854775807, 1), {})
+cnt: 2, ((T([6, 4, 352, 352], f16), [6, 4, 352, 352], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([6, 2, 352, 352], f16), [6, 4, 352, 352], 1, 0, 2, 1), {})
+Operator: aten.stack.default
+cnt: 6, (([T([6, 352, 352], f16), T([6, 352, 352], f16)], 3), {})
+Operator: aten.sub.Tensor
+cnt: 12, ((T([6, 352, 352], f16), 0.5), {})
+cnt: 5, ((T([6, 3, 352, 352], f16), T([6, 3, 352, 352], f16)), {})
+cnt: 2, ((T([6, 2, 352, 351], f16, stride=(495616, 123904, 352, 1)), T([6, 2, 352, 351], f16, stride=(495616, 123904, 352, 1))), {})
+cnt: 2, ((T([6, 2, 351, 352], f16, stride=(495616, 123904, 352, 1)), T([6, 2, 351, 352], f16, stride=(495616, 123904, 352, 1))), {})
+Operator: aten.sum.SymInt
+cnt: 3, ((T([6, 3, 352, 352], f16), [1], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([6, 3, 352, 352], f16),), {})
+cnt: 1, ((T([], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([6, 512, 44, 44], f16), T([6, 512, 44, 44], f16), 0), {})
+cnt: 3, ((T([6, 256, 88, 88], f16), T([6, 256, 88, 88], f16), 0), {})
+cnt: 2, ((T([6, 128, 176, 176], f16), T([6, 128, 176, 176], f16), 0), {})
+cnt: 2, ((T([6, 64, 352, 352], f16), T([6, 64, 352, 352], f16), 0), {})
+Operator: aten.unbind.int
+cnt: 6, ((T([6, 352, 352, 2], f16), 3), {})
+Operator: aten.upsample_bilinear2d.vec
+cnt: 2, ((T([6, 512, 11, 11], f16), None, False, [2.0, 2.0]), {})
+cnt: 2, ((T([6, 512, 22, 22], f16), None, False, [2.0, 2.0]), {})
+cnt: 2, ((T([6, 256, 44, 44], f16), None, False, [2.0, 2.0]), {})
+cnt: 2, ((T([6, 128, 88, 88], f16), None, False, [2.0, 2.0]), {})
+cnt: 2, ((T([6, 64, 176, 176], f16), None, False, [2.0, 2.0]), {})
+Operator: aten.upsample_bilinear2d_backward.vec
+cnt: 2, ((T([6, 64, 352, 352], f16), None, [6, 64, 176, 176], False, [2.0, 2.0]), {})
+cnt: 2, ((T([6, 128, 176, 176], f16), None, [6, 128, 88, 88], False, [2.0, 2.0]), {})
+cnt: 2, ((T([6, 256, 88, 88], f16), None, [6, 256, 44, 44], False, [2.0, 2.0]), {})
+cnt: 2, ((T([6, 512, 44, 44], f16), None, [6, 512, 22, 22], False, [2.0, 2.0]), {})
+cnt: 2, ((T([6, 512, 22, 22], f16), None, [6, 512, 11, 11], False, [2.0, 2.0]), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/alexnet_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/alexnet_training.txt
new file mode 100644
index 000000000000..a235e1b0535e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/alexnet_training.txt
@@ -0,0 +1,58 @@
+Operator: aten._adaptive_avg_pool2d.default
+cnt: 1, ((T([128, 256, 6, 6], f16), [6, 6]), {})
+Operator: aten._adaptive_avg_pool2d_backward.default
+cnt: 1, ((T([128, 256, 6, 6], f16), T([128, 256, 6, 6], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([4096], f16), T([128, 9216], f16), T([9216, 4096], f16, stride=(1, 9216))), {})
+cnt: 1, ((T([4096], f16), T([128, 4096], f16), T([4096, 4096], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([1000], f16), T([128, 4096], f16), T([4096, 1000], f16, stride=(1, 4096))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([64, 3, 11, 11], f16), T([64], f16), [4, 4], [2, 2], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 27, 27], f16), T([192, 64, 5, 5], f16), T([192], f16), [1, 1], [2, 2], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 192, 13, 13], f16), T([384, 192, 3, 3], f16), T([384], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 384, 13, 13], f16), T([256, 384, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 13, 13], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 256, 13, 13], f16), T([128, 256, 13, 13], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 13, 13], f16), T([128, 384, 13, 13], f16), T([256, 384, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 384, 13, 13], f16), T([128, 192, 13, 13], f16), T([384, 192, 3, 3], f16), [384], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 192, 27, 27], f16), T([128, 64, 27, 27], f16), T([192, 64, 5, 5], f16), [192], [1, 1], [2, 2], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 55, 55], f16), T([128, 3, 224, 224], f16), T([64, 3, 11, 11], f16), [64], [4, 4], [2, 2], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 128000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 64, 55, 55], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 192, 27, 27], f16), [3, 3], [2, 2]), {})
+cnt: 1, ((T([128, 256, 13, 13], f16), [3, 3], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 256, 6, 6], f16), T([128, 256, 13, 13], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 256, 6, 6], i64)), {})
+cnt: 1, ((T([128, 192, 13, 13], f16), T([128, 192, 27, 27], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 192, 13, 13], i64)), {})
+cnt: 1, ((T([128, 64, 27, 27], f16), T([128, 64, 55, 55], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([128, 64, 27, 27], i64)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16, stride=(0, 0)), T([1000, 4096], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(0, 0)), T([128, 4096], f16)), {})
+cnt: 1, ((T([128, 4096], f16), T([4096, 4096], f16)), {})
+cnt: 1, ((T([4096, 128], f16, stride=(1, 4096)), T([128, 4096], f16)), {})
+cnt: 1, ((T([128, 4096], f16), T([4096, 9216], f16)), {})
+cnt: 1, ((T([4096, 128], f16, stride=(1, 4096)), T([128, 9216], f16)), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 64, 55, 55], f16),), {})
+cnt: 1, ((T([128, 192, 27, 27], f16),), {})
+cnt: 1, ((T([128, 384, 13, 13], f16),), {})
+cnt: 2, ((T([128, 256, 13, 13], f16),), {})
+cnt: 2, ((T([128, 4096], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16, stride=(0, 0)), [0], True), {})
+cnt: 2, ((T([128, 4096], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([128, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([128, 4096], f16), T([128, 4096], f16), 0), {})
+cnt: 2, ((T([128, 256, 13, 13], f16), T([128, 256, 13, 13], f16), 0), {})
+cnt: 1, ((T([128, 384, 13, 13], f16), T([128, 384, 13, 13], f16), 0), {})
+cnt: 1, ((T([128, 192, 27, 27], f16), T([128, 192, 27, 27], f16), 0), {})
+cnt: 1, ((T([128, 64, 55, 55], f16), T([128, 64, 55, 55], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/attention_is_all_you_need_pytorch_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/attention_is_all_you_need_pytorch_training.txt
new file mode 100644
index 000000000000..16700c6bb7da
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/attention_is_all_you_need_pytorch_training.txt
@@ -0,0 +1,148 @@
+Operator: aten._softmax.default
+cnt: 6, ((T([256, 8, 33, 33], f16), -1, False), {})
+cnt: 6, ((T([256, 8, 31, 31], f16), -1, False), {})
+cnt: 6, ((T([256, 8, 31, 33], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 6, ((T([256, 8, 31, 33], f16), T([256, 8, 31, 33], f16), -1, f16), {})
+cnt: 6, ((T([256, 8, 31, 31], f16), T([256, 8, 31, 31], f16), -1, f16), {})
+cnt: 6, ((T([256, 8, 33, 33], f16), T([256, 8, 33, 33], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([1, 31, 31], f32),), {'dtype': torch.bool})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([8448, 512], f16), [256, 33, 512]), {})
+cnt: 24, ((T([256, 8, 33, 64], f16), [2048, 33, 64]), {})
+cnt: 12, ((T([256, 8, 64, 33], f16), [2048, 64, 33]), {})
+cnt: 6, ((T([2048, 33, 33], f16), [256, 8, 33, 33]), {})
+cnt: 6, ((T([2048, 33, 64], f16), [256, 8, 33, 64]), {})
+cnt: 36, ((T([7936, 512], f16), [256, 31, 512]), {})
+cnt: 30, ((T([256, 8, 31, 64], f16), [2048, 31, 64]), {})
+cnt: 6, ((T([256, 8, 64, 31], f16), [2048, 64, 31]), {})
+cnt: 6, ((T([2048, 31, 31], f16), [256, 8, 31, 31]), {})
+cnt: 12, ((T([2048, 31, 64], f16), [256, 8, 31, 64]), {})
+cnt: 6, ((T([2048, 31, 33], f16), [256, 8, 31, 33]), {})
+cnt: 1, ((T([7936, 9521], f16), [256, 31, 9521]), {})
+cnt: 18, ((T([256, 33, 8, 64], f16), [256, 33, 512]), {})
+cnt: 12, ((T([256, 33, 512], f16), [8448, 512]), {})
+cnt: 18, ((T([256, 31, 8, 64], f16), [256, 31, 512]), {})
+cnt: 6, ((T([256, 31, 512], f16), [7936, 512]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([256, 33, 512], f16), T([1, 33, 512], f16)), {})
+cnt: 1, ((T([256, 31, 512], f16), T([1, 31, 512], f16)), {})
+cnt: 30, ((T([256, 31, 512], f16), T([256, 31, 512], f16)), {})
+cnt: 35, ((T([256, 33, 512], f16), T([256, 33, 512], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 12, ((T([256, 33, 512], f16), T([256, 33, 512], f16)), {})
+cnt: 18, ((T([256, 31, 512], f16), T([256, 31, 512], f16)), {})
+Operator: aten.addmm.default
+cnt: 6, ((T([2048], f16), T([8448, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 6, ((T([512], f16), T([8448, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+cnt: 6, ((T([2048], f16), T([7936, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 6, ((T([512], f16), T([7936, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+Operator: aten.bitwise_and.Tensor
+cnt: 1, ((T([256, 1, 31], b8, stride=(1, 7936, 256)), T([1, 31, 31], b8)), {})
+Operator: aten.bmm.default
+cnt: 6, ((T([2048, 33, 64], f16), T([2048, 64, 33], f16)), {})
+cnt: 6, ((T([2048, 33, 33], f16), T([2048, 33, 64], f16)), {})
+cnt: 6, ((T([2048, 31, 64], f16), T([2048, 64, 31], f16)), {})
+cnt: 6, ((T([2048, 31, 31], f16), T([2048, 31, 64], f16)), {})
+cnt: 6, ((T([2048, 31, 64], f16), T([2048, 64, 33], f16)), {})
+cnt: 6, ((T([2048, 31, 33], f16), T([2048, 33, 64], f16)), {})
+cnt: 6, ((T([2048, 33, 31], f16, stride=(1023, 1, 33)), T([2048, 31, 64], f16)), {})
+cnt: 6, ((T([2048, 31, 64], f16), T([2048, 64, 33], f16, stride=(2112, 1, 64))), {})
+cnt: 6, ((T([2048, 64, 31], f16, stride=(1984, 1, 64)), T([2048, 31, 33], f16)), {})
+cnt: 6, ((T([2048, 31, 33], f16), T([2048, 33, 64], f16, stride=(2112, 1, 33))), {})
+cnt: 6, ((T([2048, 31, 31], f16, stride=(961, 1, 31)), T([2048, 31, 64], f16)), {})
+cnt: 6, ((T([2048, 31, 64], f16), T([2048, 64, 31], f16, stride=(1984, 1, 64))), {})
+cnt: 6, ((T([2048, 64, 31], f16, stride=(1984, 1, 64)), T([2048, 31, 31], f16)), {})
+cnt: 6, ((T([2048, 31, 31], f16), T([2048, 31, 64], f16, stride=(1984, 1, 31))), {})
+cnt: 6, ((T([2048, 33, 33], f16, stride=(1089, 1, 33)), T([2048, 33, 64], f16)), {})
+cnt: 6, ((T([2048, 33, 64], f16), T([2048, 64, 33], f16, stride=(2112, 1, 64))), {})
+cnt: 6, ((T([2048, 64, 33], f16, stride=(2112, 1, 64)), T([2048, 33, 33], f16)), {})
+cnt: 6, ((T([2048, 33, 33], f16), T([2048, 33, 64], f16, stride=(2112, 1, 33))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([256, 33], i64, stride=(1, 256)),), {})
+cnt: 1, ((T([256, 31], i64, stride=(1, 256)),), {})
+cnt: 1, ((T([1, 33, 512], f16),), {})
+cnt: 1, ((T([1, 31, 512], f16),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([256, 33], i64, stride=(1, 256)), T([256, 33], i64, stride=(1, 256))), {})
+cnt: 1, ((T([256, 31], i64, stride=(1, 256)), T([256, 31], i64, stride=(1, 256))), {})
+cnt: 12, ((T([256, 31, 512], f16), T([256, 31, 512], f16)), {})
+cnt: 6, ((T([7936, 512], f16), T([7936, 512], f16)), {})
+cnt: 12, ((T([256, 33, 512], f16), T([256, 33, 512], f16)), {})
+cnt: 6, ((T([8448, 512], f16), T([8448, 512], f16)), {})
+Operator: aten.div.Tensor
+cnt: 6, ((T([256, 8, 33, 64], f16, stride=(16896, 64, 512, 1)), 8.0), {})
+cnt: 12, ((T([256, 8, 31, 64], f16, stride=(15872, 64, 512, 1)), 8.0), {})
+cnt: 2, ((T([], f16), 75558656), {})
+cnt: 12, ((T([256, 8, 31, 64], f16), 8.0), {})
+cnt: 6, ((T([256, 8, 33, 64], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([9521, 512], f16), T([256, 33], i64, stride=(1, 256)), 1), {})
+cnt: 1, ((T([9521, 512], f16), T([256, 31], i64, stride=(1, 256)), 1), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([256, 31, 512], f16), T([256, 31], i64, stride=(1, 256)), 9521, 1, False), {})
+cnt: 1, ((T([256, 33, 512], f16), T([256, 33], i64, stride=(1, 256)), 9521, 1, False), {})
+Operator: aten.eq.Scalar
+cnt: 12, ((T([256, 1, 1, 33], b8, stride=(1, 8448, 8448, 256)), 0), {})
+cnt: 6, ((T([256, 1, 31, 31], b8, stride=(1, 7936, 256, 7936)), 0), {})
+Operator: aten.masked_fill.Scalar
+cnt: 6, ((T([256, 8, 33, 33], f16), T([256, 1, 1, 33], b8, stride=(1, 8448, 8448, 256)), -65504.0), {})
+cnt: 6, ((T([256, 8, 31, 31], f16), T([256, 1, 31, 31], b8, stride=(1, 7936, 256, 7936)), -65504.0), {})
+cnt: 6, ((T([256, 8, 31, 33], f16), T([256, 1, 1, 33], b8, stride=(1, 8448, 8448, 256)), -65504.0), {})
+cnt: 6, ((T([256, 8, 31, 33], f16), T([256, 1, 1, 33], b8, stride=(1, 8448, 8448, 256)), 0), {})
+cnt: 6, ((T([256, 8, 31, 31], f16), T([256, 1, 31, 31], b8, stride=(1, 7936, 256, 7936)), 0), {})
+cnt: 6, ((T([256, 8, 33, 33], f16), T([256, 1, 1, 33], b8, stride=(1, 8448, 8448, 256)), 0), {})
+Operator: aten.mm.default
+cnt: 36, ((T([8448, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 36, ((T([7936, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 1, ((T([7936, 512], f16), T([512, 9521], f16, stride=(1, 512))), {})
+cnt: 1, ((T([9521, 7936], f16, stride=(1, 9521)), T([7936, 512], f16)), {})
+cnt: 1, ((T([7936, 9521], f16), T([9521, 512], f16)), {})
+cnt: 6, ((T([7936, 512], f16), T([512, 2048], f16)), {})
+cnt: 6, ((T([512, 7936], f16, stride=(1, 512)), T([7936, 2048], f16)), {})
+cnt: 6, ((T([7936, 2048], f16), T([2048, 512], f16)), {})
+cnt: 6, ((T([2048, 7936], f16, stride=(1, 2048)), T([7936, 512], f16)), {})
+cnt: 36, ((T([512, 7936], f16, stride=(1, 512)), T([7936, 512], f16)), {})
+cnt: 36, ((T([7936, 512], f16), T([512, 512], f16)), {})
+cnt: 36, ((T([512, 8448], f16, stride=(1, 512)), T([8448, 512], f16)), {})
+cnt: 36, ((T([8448, 512], f16), T([512, 512], f16)), {})
+cnt: 6, ((T([8448, 512], f16), T([512, 2048], f16)), {})
+cnt: 6, ((T([512, 8448], f16, stride=(1, 512)), T([8448, 2048], f16)), {})
+cnt: 6, ((T([8448, 2048], f16), T([2048, 512], f16)), {})
+cnt: 6, ((T([2048, 8448], f16, stride=(1, 2048)), T([8448, 512], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([256, 31, 9521], f16), 1.0), {})
+cnt: 1, ((T([256, 31, 9521], f16, stride=(0, 0, 0)), 1.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 13, ((T([256, 33, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+cnt: 19, ((T([256, 31, 512], f16), [512], T([512], f16), T([512], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 19, ((T([256, 31, 512], f16), T([256, 31, 512], f16), [512], T([256, 31, 1], f32), T([256, 31, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 13, ((T([256, 33, 512], f16), T([256, 33, 512], f16), [512], T([256, 33, 1], f32), T([256, 33, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([256, 33], i64, stride=(1, 256)), 1), {})
+cnt: 1, ((T([256, 31], i64, stride=(1, 256)), 1), {})
+Operator: aten.new_empty_strided.default
+cnt: 6, ((T([7936, 512], f16), [7936, 512], [512, 1]), {})
+cnt: 6, ((T([8448, 512], f16), [8448, 512], [512, 1]), {})
+Operator: aten.new_zeros.default
+cnt: 6, ((T([256, 31, 512], f16), [4063232]), {})
+cnt: 6, ((T([256, 33, 512], f16), [4325376]), {})
+Operator: aten.relu.default
+cnt: 6, ((T([256, 33, 2048], f16),), {})
+cnt: 6, ((T([256, 31, 2048], f16),), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([1, 31, 31], f32), 1), {})
+Operator: aten.sum.SymInt
+cnt: 6, ((T([7936, 512], f16), [0], True), {})
+cnt: 6, ((T([7936, 2048], f16), [0], True), {})
+cnt: 6, ((T([8448, 512], f16), [0], True), {})
+cnt: 6, ((T([8448, 2048], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([7936, 9521], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 6, ((T([256, 31, 2048], f16), T([256, 31, 2048], f16), 0), {})
+cnt: 6, ((T([256, 33, 2048], f16), T([256, 33, 2048], f16), 0), {})
+Operator: aten.triu.default
+cnt: 1, ((T([1, 31, 31], f32), 1), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/dcgan_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/dcgan_training.txt
new file mode 100644
index 000000000000..0adf5dcbf66d
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/dcgan_training.txt
@@ -0,0 +1,42 @@
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 64, 64], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 64, 64], f16), T([64, 3, 4, 4], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 32, 32], f16), T([128, 64, 4, 4], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 16, 16], f16), T([256, 128, 4, 4], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 8, 8], f16), T([512, 256, 4, 4], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 4, 4], f16), T([1, 512, 4, 4], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 1, 1, 1], f16), T([32, 512, 4, 4], f16), T([1, 512, 4, 4], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 4, 4], f16), T([32, 256, 8, 8], f16), T([512, 256, 4, 4], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 8, 8], f16), T([32, 128, 16, 16], f16), T([256, 128, 4, 4], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 16, 16], f16), T([32, 64, 32, 32], f16), T([128, 64, 4, 4], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 32, 32], f16), T([32, 3, 64, 64], f16), T([64, 3, 4, 4], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 64, 64], f16), T([32, 3, 64, 64], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32), {})
+Operator: aten.leaky_relu_.default
+cnt: 1, ((T([32, 64, 32, 32], f16), 0.2), {})
+cnt: 1, ((T([32, 128, 16, 16], f16), 0.2), {})
+cnt: 1, ((T([32, 256, 8, 8], f16), 0.2), {})
+cnt: 1, ((T([32, 512, 4, 4], f16), 0.2), {})
+Operator: aten.leaky_relu_backward.default
+cnt: 1, ((T([32, 512, 4, 4], f16), T([32, 512, 4, 4], f16), 0.2, True), {})
+cnt: 1, ((T([32, 256, 8, 8], f16), T([32, 256, 8, 8], f16), 0.2, True), {})
+cnt: 1, ((T([32, 128, 16, 16], f16), T([32, 128, 16, 16], f16), 0.2, True), {})
+cnt: 1, ((T([32, 64, 32, 32], f16), T([32, 64, 32, 32], f16), 0.2, True), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([32, 128, 16, 16], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 256, 8, 8], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 512, 4, 4], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([32, 512, 4, 4], f16), T([32, 512, 4, 4], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 256, 8, 8], f16), T([32, 256, 8, 8], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 16, 16], f16), T([32, 128, 16, 16], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([32, 1, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([32, 1, 1, 1], f16, stride=(0, 0, 0, 0)), T([32, 1, 1, 1], f16)), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1, 1, 1], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/densenet121_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/densenet121_training.txt
new file mode 100644
index 000000000000..80f89b783462
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/densenet121_training.txt
@@ -0,0 +1,609 @@
+Operator: aten.add.Tensor
+cnt: 1, ((T([4, 512, 7, 7], f16, stride=(50176, 49, 7, 1)), T([4, 512, 7, 7], f16, stride=(48608, 49, 7, 1))), {})
+cnt: 15, ((T([4, 32, 7, 7], f16, stride=(50176, 49, 7, 1)), T([4, 32, 7, 7], f16, stride=(48608, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(47040, 49, 7, 1))), {})
+cnt: 14, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(47040, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(45472, 49, 7, 1))), {})
+cnt: 13, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(45472, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(43904, 49, 7, 1))), {})
+cnt: 12, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(43904, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(42336, 49, 7, 1))), {})
+cnt: 11, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(42336, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(40768, 49, 7, 1))), {})
+cnt: 10, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(40768, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(39200, 49, 7, 1))), {})
+cnt: 9, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(39200, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(37632, 49, 7, 1))), {})
+cnt: 8, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(37632, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(36064, 49, 7, 1))), {})
+cnt: 7, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(36064, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(34496, 49, 7, 1))), {})
+cnt: 6, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(34496, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(32928, 49, 7, 1))), {})
+cnt: 5, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(32928, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(31360, 49, 7, 1))), {})
+cnt: 4, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(31360, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(29792, 49, 7, 1))), {})
+cnt: 3, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(29792, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(28224, 49, 7, 1))), {})
+cnt: 2, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(28224, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16, stride=(26656, 49, 7, 1))), {})
+cnt: 1, ((T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16, stride=(26656, 49, 7, 1))), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16)), {})
+cnt: 1, ((T([4, 256, 14, 14], f16, stride=(200704, 196, 14, 1)), T([4, 256, 14, 14], f16, stride=(194432, 196, 14, 1))), {})
+cnt: 23, ((T([4, 32, 14, 14], f16, stride=(200704, 196, 14, 1)), T([4, 32, 14, 14], f16, stride=(194432, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(188160, 196, 14, 1))), {})
+cnt: 22, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(188160, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(181888, 196, 14, 1))), {})
+cnt: 21, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(181888, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(175616, 196, 14, 1))), {})
+cnt: 20, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(175616, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(169344, 196, 14, 1))), {})
+cnt: 19, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(169344, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(163072, 196, 14, 1))), {})
+cnt: 18, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(163072, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(156800, 196, 14, 1))), {})
+cnt: 17, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(156800, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(150528, 196, 14, 1))), {})
+cnt: 16, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(150528, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(144256, 196, 14, 1))), {})
+cnt: 15, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(144256, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(137984, 196, 14, 1))), {})
+cnt: 14, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(137984, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(131712, 196, 14, 1))), {})
+cnt: 13, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(131712, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(125440, 196, 14, 1))), {})
+cnt: 12, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(125440, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(119168, 196, 14, 1))), {})
+cnt: 11, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(119168, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(112896, 196, 14, 1))), {})
+cnt: 10, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(112896, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(106624, 196, 14, 1))), {})
+cnt: 9, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(106624, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(100352, 196, 14, 1))), {})
+cnt: 8, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(100352, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(94080, 196, 14, 1))), {})
+cnt: 7, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(94080, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(87808, 196, 14, 1))), {})
+cnt: 6, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(87808, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(81536, 196, 14, 1))), {})
+cnt: 5, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(81536, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(75264, 196, 14, 1))), {})
+cnt: 4, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(75264, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(68992, 196, 14, 1))), {})
+cnt: 3, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(68992, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(62720, 196, 14, 1))), {})
+cnt: 2, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(62720, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16, stride=(56448, 196, 14, 1))), {})
+cnt: 1, ((T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16, stride=(56448, 196, 14, 1))), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16)), {})
+cnt: 1, ((T([4, 128, 28, 28], f16, stride=(401408, 784, 28, 1)), T([4, 128, 28, 28], f16, stride=(376320, 784, 28, 1))), {})
+cnt: 11, ((T([4, 32, 28, 28], f16, stride=(401408, 784, 28, 1)), T([4, 32, 28, 28], f16, stride=(376320, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(351232, 784, 28, 1))), {})
+cnt: 10, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(351232, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(326144, 784, 28, 1))), {})
+cnt: 9, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(326144, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(301056, 784, 28, 1))), {})
+cnt: 8, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(301056, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(275968, 784, 28, 1))), {})
+cnt: 7, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(275968, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(250880, 784, 28, 1))), {})
+cnt: 6, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(250880, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(225792, 784, 28, 1))), {})
+cnt: 5, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(225792, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(200704, 784, 28, 1))), {})
+cnt: 4, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(200704, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(175616, 784, 28, 1))), {})
+cnt: 3, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(175616, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(150528, 784, 28, 1))), {})
+cnt: 2, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(150528, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16, stride=(125440, 784, 28, 1))), {})
+cnt: 1, ((T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16, stride=(125440, 784, 28, 1))), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16)), {})
+cnt: 1, ((T([4, 64, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([4, 64, 56, 56], f16, stride=(702464, 3136, 56, 1))), {})
+cnt: 5, ((T([4, 32, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([4, 32, 56, 56], f16, stride=(702464, 3136, 56, 1))), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([4, 64, 56, 56], f16, stride=(602112, 3136, 56, 1))), {})
+cnt: 4, ((T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16, stride=(602112, 3136, 56, 1))), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([4, 64, 56, 56], f16, stride=(501760, 3136, 56, 1))), {})
+cnt: 3, ((T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16, stride=(501760, 3136, 56, 1))), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([4, 64, 56, 56], f16, stride=(401408, 3136, 56, 1))), {})
+cnt: 2, ((T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16, stride=(401408, 3136, 56, 1))), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([4, 64, 56, 56], f16, stride=(301056, 3136, 56, 1))), {})
+cnt: 1, ((T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16, stride=(301056, 3136, 56, 1))), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([4, 64, 56, 56], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([4, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([4, 128, 56, 56], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([4, 256, 28, 28], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([4, 512, 14, 14], f16), [2, 2], [2, 2]), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 14, 14], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 28, 28], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 56, 56], f16), [2, 2], [2, 2], [0, 0], False, True, None), {})
+Operator: aten.cat.default
+cnt: 1, (([T([4, 64, 56, 56], f16)], 1), {})
+cnt: 1, (([T([4, 64, 56, 56], f16), T([4, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([4, 64, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([4, 64, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([4, 64, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([4, 64, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([4, 64, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16), T([4, 32, 56, 56], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 128, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16), T([4, 32, 28, 28], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 256, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16), T([4, 32, 14, 14], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+cnt: 1, (([T([4, 512, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16), T([4, 32, 7, 7], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([4, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([4, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([4, 128, 56, 56], f16), T([32, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 96, 56, 56], f16), T([128, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 128, 56, 56], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 160, 56, 56], f16), T([128, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 192, 56, 56], f16), T([128, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 224, 56, 56], f16), T([128, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([128, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([4, 128, 28, 28], f16), T([32, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 160, 28, 28], f16), T([128, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 192, 28, 28], f16), T([128, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 224, 28, 28], f16), T([128, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 28, 28], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 288, 28, 28], f16), T([128, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 320, 28, 28], f16), T([128, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 352, 28, 28], f16), T([128, 352, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 384, 28, 28], f16), T([128, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 416, 28, 28], f16), T([128, 416, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 448, 28, 28], f16), T([128, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 480, 28, 28], f16), T([128, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 24, ((T([4, 128, 14, 14], f16), T([32, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 288, 14, 14], f16), T([128, 288, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 320, 14, 14], f16), T([128, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 352, 14, 14], f16), T([128, 352, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 384, 14, 14], f16), T([128, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 416, 14, 14], f16), T([128, 416, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 448, 14, 14], f16), T([128, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 480, 14, 14], f16), T([128, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 512, 14, 14], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 544, 14, 14], f16), T([128, 544, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 576, 14, 14], f16), T([128, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 608, 14, 14], f16), T([128, 608, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 640, 14, 14], f16), T([128, 640, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 672, 14, 14], f16), T([128, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 704, 14, 14], f16), T([128, 704, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 736, 14, 14], f16), T([128, 736, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 768, 14, 14], f16), T([128, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 800, 14, 14], f16), T([128, 800, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 832, 14, 14], f16), T([128, 832, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 864, 14, 14], f16), T([128, 864, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 896, 14, 14], f16), T([128, 896, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 928, 14, 14], f16), T([128, 928, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 960, 14, 14], f16), T([128, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 992, 14, 14], f16), T([128, 992, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 16, ((T([4, 128, 7, 7], f16), T([32, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 544, 7, 7], f16), T([128, 544, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 576, 7, 7], f16), T([128, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 608, 7, 7], f16), T([128, 608, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 640, 7, 7], f16), T([128, 640, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 672, 7, 7], f16), T([128, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 704, 7, 7], f16), T([128, 704, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 736, 7, 7], f16), T([128, 736, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 768, 7, 7], f16), T([128, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 800, 7, 7], f16), T([128, 800, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 832, 7, 7], f16), T([128, 832, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 864, 7, 7], f16), T([128, 864, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 896, 7, 7], f16), T([128, 896, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 928, 7, 7], f16), T([128, 928, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 960, 7, 7], f16), T([128, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 992, 7, 7], f16), T([128, 992, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([4, 32, 7, 7], f16, stride=(50176, 49, 7, 1)), T([4, 128, 7, 7], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 992, 7, 7], f16), T([128, 992, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 15, ((T([4, 32, 7, 7], f16), T([4, 128, 7, 7], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 960, 7, 7], f16), T([128, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 928, 7, 7], f16), T([128, 928, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 896, 7, 7], f16), T([128, 896, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 864, 7, 7], f16), T([128, 864, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 832, 7, 7], f16), T([128, 832, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 800, 7, 7], f16), T([128, 800, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 768, 7, 7], f16), T([128, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 736, 7, 7], f16), T([128, 736, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 704, 7, 7], f16), T([128, 704, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 672, 7, 7], f16), T([128, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 640, 7, 7], f16), T([128, 640, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 608, 7, 7], f16), T([128, 608, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 576, 7, 7], f16), T([128, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 544, 7, 7], f16), T([128, 544, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 7, 7], f16), T([4, 512, 7, 7], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 512, 14, 14], f16), T([4, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 32, 14, 14], f16, stride=(200704, 196, 14, 1)), T([4, 128, 14, 14], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 992, 14, 14], f16), T([128, 992, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 23, ((T([4, 32, 14, 14], f16), T([4, 128, 14, 14], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 960, 14, 14], f16), T([128, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 928, 14, 14], f16), T([128, 928, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 896, 14, 14], f16), T([128, 896, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 864, 14, 14], f16), T([128, 864, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 832, 14, 14], f16), T([128, 832, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 800, 14, 14], f16), T([128, 800, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 768, 14, 14], f16), T([128, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 736, 14, 14], f16), T([128, 736, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 704, 14, 14], f16), T([128, 704, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 672, 14, 14], f16), T([128, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 640, 14, 14], f16), T([128, 640, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 608, 14, 14], f16), T([128, 608, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 576, 14, 14], f16), T([128, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 544, 14, 14], f16), T([128, 544, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 512, 14, 14], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 480, 14, 14], f16), T([128, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 448, 14, 14], f16), T([128, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 416, 14, 14], f16), T([128, 416, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 384, 14, 14], f16), T([128, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 352, 14, 14], f16), T([128, 352, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 320, 14, 14], f16), T([128, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 288, 14, 14], f16), T([128, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 14, 14], f16), T([4, 256, 14, 14], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 256, 28, 28], f16), T([4, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 32, 28, 28], f16, stride=(401408, 784, 28, 1)), T([4, 128, 28, 28], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 480, 28, 28], f16), T([128, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 11, ((T([4, 32, 28, 28], f16), T([4, 128, 28, 28], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 448, 28, 28], f16), T([128, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 416, 28, 28], f16), T([128, 416, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 384, 28, 28], f16), T([128, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 352, 28, 28], f16), T([128, 352, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 320, 28, 28], f16), T([128, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 288, 28, 28], f16), T([128, 288, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 256, 28, 28], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 224, 28, 28], f16), T([128, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 192, 28, 28], f16), T([128, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 160, 28, 28], f16), T([128, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 56, 56], f16), T([4, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 32, 56, 56], f16, stride=(802816, 3136, 56, 1)), T([4, 128, 56, 56], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 56, 56], f16), T([4, 224, 56, 56], f16), T([128, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([4, 32, 56, 56], f16), T([4, 128, 56, 56], f16), T([32, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 56, 56], f16), T([4, 192, 56, 56], f16), T([128, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 56, 56], f16), T([4, 160, 56, 56], f16), T([128, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 56, 56], f16), T([4, 128, 56, 56], f16), T([128, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 56, 56], f16), T([4, 96, 56, 56], f16), T([128, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 56, 56], f16), T([4, 64, 56, 56], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 64, 112, 112], f16), T([4, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([4, 3, 224, 224], f16), T([4, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([4, 1024, 7, 7], f16, stride=(1024, 1, 0, 0)), 49), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 4000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([4, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([4, 64, 56, 56], f16), T([4, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([4, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([4, 1024, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([4, 1000], f16, stride=(0, 0)), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 4], f16, stride=(0, 0)), T([4, 1024], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([4, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 7, ((T([4, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 160, 56, 56], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 192, 56, 56], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 224, 56, 56], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 13, ((T([4, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 160, 28, 28], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 224, 28, 28], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 288, 28, 28], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 320, 28, 28], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 352, 28, 28], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 384, 28, 28], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 416, 28, 28], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 448, 28, 28], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 480, 28, 28], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 24, ((T([4, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 288, 14, 14], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 320, 14, 14], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 352, 14, 14], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 416, 14, 14], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 448, 14, 14], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 544, 14, 14], f16), T([544], f16), T([544], f16), T([544], f16), T([544], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 608, 14, 14], f16), T([608], f16), T([608], f16), T([608], f16), T([608], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 640, 14, 14], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 704, 14, 14], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 736, 14, 14], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 800, 14, 14], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 832, 14, 14], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 864, 14, 14], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 896, 14, 14], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 928, 14, 14], f16), T([928], f16), T([928], f16), T([928], f16), T([928], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 960, 14, 14], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 992, 14, 14], f16), T([992], f16), T([992], f16), T([992], f16), T([992], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 16, ((T([4, 128, 7, 7], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 544, 7, 7], f16), T([544], f16), T([544], f16), T([544], f16), T([544], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 608, 7, 7], f16), T([608], f16), T([608], f16), T([608], f16), T([608], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 640, 7, 7], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 704, 7, 7], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 736, 7, 7], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 768, 7, 7], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 800, 7, 7], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 832, 7, 7], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 864, 7, 7], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 896, 7, 7], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 928, 7, 7], f16), T([928], f16), T([928], f16), T([928], f16), T([928], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 992, 7, 7], f16), T([992], f16), T([992], f16), T([992], f16), T([992], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([4, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([4, 1024, 7, 7], f16), T([4, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 16, ((T([4, 128, 7, 7], f16), T([4, 128, 7, 7], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 992, 7, 7], f16), T([4, 992, 7, 7], f16), T([992], f16), T([992], f16), T([992], f16), T([992], f32), T([992], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 960, 7, 7], f16), T([4, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 928, 7, 7], f16), T([4, 928, 7, 7], f16), T([928], f16), T([928], f16), T([928], f16), T([928], f32), T([928], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 896, 7, 7], f16), T([4, 896, 7, 7], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 864, 7, 7], f16), T([4, 864, 7, 7], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f32), T([864], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 832, 7, 7], f16), T([4, 832, 7, 7], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f32), T([832], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 800, 7, 7], f16), T([4, 800, 7, 7], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f32), T([800], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 768, 7, 7], f16), T([4, 768, 7, 7], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 736, 7, 7], f16), T([4, 736, 7, 7], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f32), T([736], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 704, 7, 7], f16), T([4, 704, 7, 7], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f32), T([704], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 672, 7, 7], f16), T([4, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 640, 7, 7], f16), T([4, 640, 7, 7], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f32), T([640], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 608, 7, 7], f16), T([4, 608, 7, 7], f16), T([608], f16), T([608], f16), T([608], f16), T([608], f32), T([608], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 576, 7, 7], f16), T([4, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 544, 7, 7], f16), T([4, 544, 7, 7], f16), T([544], f16), T([544], f16), T([544], f16), T([544], f32), T([544], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 1024, 14, 14], f16), T([4, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 24, ((T([4, 128, 14, 14], f16), T([4, 128, 14, 14], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 992, 14, 14], f16), T([4, 992, 14, 14], f16), T([992], f16), T([992], f16), T([992], f16), T([992], f32), T([992], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 960, 14, 14], f16), T([4, 960, 14, 14], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 928, 14, 14], f16), T([4, 928, 14, 14], f16), T([928], f16), T([928], f16), T([928], f16), T([928], f32), T([928], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 896, 14, 14], f16), T([4, 896, 14, 14], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 864, 14, 14], f16), T([4, 864, 14, 14], f16), T([864], f16), T([864], f16), T([864], f16), T([864], f32), T([864], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 832, 14, 14], f16), T([4, 832, 14, 14], f16), T([832], f16), T([832], f16), T([832], f16), T([832], f32), T([832], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 800, 14, 14], f16), T([4, 800, 14, 14], f16), T([800], f16), T([800], f16), T([800], f16), T([800], f32), T([800], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 768, 14, 14], f16), T([4, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 736, 14, 14], f16), T([4, 736, 14, 14], f16), T([736], f16), T([736], f16), T([736], f16), T([736], f32), T([736], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 704, 14, 14], f16), T([4, 704, 14, 14], f16), T([704], f16), T([704], f16), T([704], f16), T([704], f32), T([704], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 672, 14, 14], f16), T([4, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 640, 14, 14], f16), T([4, 640, 14, 14], f16), T([640], f16), T([640], f16), T([640], f16), T([640], f32), T([640], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 608, 14, 14], f16), T([4, 608, 14, 14], f16), T([608], f16), T([608], f16), T([608], f16), T([608], f32), T([608], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 576, 14, 14], f16), T([4, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 544, 14, 14], f16), T([4, 544, 14, 14], f16), T([544], f16), T([544], f16), T([544], f16), T([544], f32), T([544], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 512, 14, 14], f16), T([4, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 480, 14, 14], f16), T([4, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 448, 14, 14], f16), T([4, 448, 14, 14], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 416, 14, 14], f16), T([4, 416, 14, 14], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f32), T([416], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 384, 14, 14], f16), T([4, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 352, 14, 14], f16), T([4, 352, 14, 14], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f32), T([352], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 320, 14, 14], f16), T([4, 320, 14, 14], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 288, 14, 14], f16), T([4, 288, 14, 14], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f32), T([288], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 512, 28, 28], f16), T([4, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 13, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 480, 28, 28], f16), T([4, 480, 28, 28], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 448, 28, 28], f16), T([4, 448, 28, 28], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 416, 28, 28], f16), T([4, 416, 28, 28], f16), T([416], f16), T([416], f16), T([416], f16), T([416], f32), T([416], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 384, 28, 28], f16), T([4, 384, 28, 28], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 352, 28, 28], f16), T([4, 352, 28, 28], f16), T([352], f16), T([352], f16), T([352], f16), T([352], f32), T([352], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 320, 28, 28], f16), T([4, 320, 28, 28], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 288, 28, 28], f16), T([4, 288, 28, 28], f16), T([288], f16), T([288], f16), T([288], f16), T([288], f32), T([288], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 28, 28], f16), T([4, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 224, 28, 28], f16), T([4, 224, 28, 28], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 192, 28, 28], f16), T([4, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 160, 28, 28], f16), T([4, 160, 28, 28], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 56, 56], f16), T([4, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([4, 128, 56, 56], f16), T([4, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 224, 56, 56], f16), T([4, 224, 56, 56], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 192, 56, 56], f16), T([4, 192, 56, 56], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 160, 56, 56], f16), T([4, 160, 56, 56], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 96, 56, 56], f16), T([4, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([4, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([4, 64, 112, 112], f16), T([4, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([4, 64, 112, 112], f16),), {})
+cnt: 1, ((T([4, 64, 56, 56], f16),), {})
+cnt: 7, ((T([4, 128, 56, 56], f16),), {})
+cnt: 1, ((T([4, 96, 56, 56], f16),), {})
+cnt: 1, ((T([4, 160, 56, 56], f16),), {})
+cnt: 1, ((T([4, 192, 56, 56], f16),), {})
+cnt: 1, ((T([4, 224, 56, 56], f16),), {})
+cnt: 1, ((T([4, 256, 56, 56], f16),), {})
+cnt: 13, ((T([4, 128, 28, 28], f16),), {})
+cnt: 1, ((T([4, 160, 28, 28], f16),), {})
+cnt: 1, ((T([4, 192, 28, 28], f16),), {})
+cnt: 1, ((T([4, 224, 28, 28], f16),), {})
+cnt: 1, ((T([4, 256, 28, 28], f16),), {})
+cnt: 1, ((T([4, 288, 28, 28], f16),), {})
+cnt: 1, ((T([4, 320, 28, 28], f16),), {})
+cnt: 1, ((T([4, 352, 28, 28], f16),), {})
+cnt: 1, ((T([4, 384, 28, 28], f16),), {})
+cnt: 1, ((T([4, 416, 28, 28], f16),), {})
+cnt: 1, ((T([4, 448, 28, 28], f16),), {})
+cnt: 1, ((T([4, 480, 28, 28], f16),), {})
+cnt: 1, ((T([4, 512, 28, 28], f16),), {})
+cnt: 1, ((T([4, 256, 14, 14], f16),), {})
+cnt: 24, ((T([4, 128, 14, 14], f16),), {})
+cnt: 1, ((T([4, 288, 14, 14], f16),), {})
+cnt: 1, ((T([4, 320, 14, 14], f16),), {})
+cnt: 1, ((T([4, 352, 14, 14], f16),), {})
+cnt: 1, ((T([4, 384, 14, 14], f16),), {})
+cnt: 1, ((T([4, 416, 14, 14], f16),), {})
+cnt: 1, ((T([4, 448, 14, 14], f16),), {})
+cnt: 1, ((T([4, 480, 14, 14], f16),), {})
+cnt: 1, ((T([4, 512, 14, 14], f16),), {})
+cnt: 1, ((T([4, 544, 14, 14], f16),), {})
+cnt: 1, ((T([4, 576, 14, 14], f16),), {})
+cnt: 1, ((T([4, 608, 14, 14], f16),), {})
+cnt: 1, ((T([4, 640, 14, 14], f16),), {})
+cnt: 1, ((T([4, 672, 14, 14], f16),), {})
+cnt: 1, ((T([4, 704, 14, 14], f16),), {})
+cnt: 1, ((T([4, 736, 14, 14], f16),), {})
+cnt: 1, ((T([4, 768, 14, 14], f16),), {})
+cnt: 1, ((T([4, 800, 14, 14], f16),), {})
+cnt: 1, ((T([4, 832, 14, 14], f16),), {})
+cnt: 1, ((T([4, 864, 14, 14], f16),), {})
+cnt: 1, ((T([4, 896, 14, 14], f16),), {})
+cnt: 1, ((T([4, 928, 14, 14], f16),), {})
+cnt: 1, ((T([4, 960, 14, 14], f16),), {})
+cnt: 1, ((T([4, 992, 14, 14], f16),), {})
+cnt: 1, ((T([4, 1024, 14, 14], f16),), {})
+cnt: 1, ((T([4, 512, 7, 7], f16),), {})
+cnt: 16, ((T([4, 128, 7, 7], f16),), {})
+cnt: 1, ((T([4, 544, 7, 7], f16),), {})
+cnt: 1, ((T([4, 576, 7, 7], f16),), {})
+cnt: 1, ((T([4, 608, 7, 7], f16),), {})
+cnt: 1, ((T([4, 640, 7, 7], f16),), {})
+cnt: 1, ((T([4, 672, 7, 7], f16),), {})
+cnt: 1, ((T([4, 704, 7, 7], f16),), {})
+cnt: 1, ((T([4, 736, 7, 7], f16),), {})
+cnt: 1, ((T([4, 768, 7, 7], f16),), {})
+cnt: 1, ((T([4, 800, 7, 7], f16),), {})
+cnt: 1, ((T([4, 832, 7, 7], f16),), {})
+cnt: 1, ((T([4, 864, 7, 7], f16),), {})
+cnt: 1, ((T([4, 896, 7, 7], f16),), {})
+cnt: 1, ((T([4, 928, 7, 7], f16),), {})
+cnt: 1, ((T([4, 960, 7, 7], f16),), {})
+cnt: 1, ((T([4, 992, 7, 7], f16),), {})
+cnt: 1, ((T([4, 1024, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([4, 1000], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([4, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([4, 1024, 7, 7], f16), T([4, 1024, 7, 7], f16), 0), {})
+cnt: 16, ((T([4, 128, 7, 7], f16), T([4, 128, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 992, 7, 7], f16), T([4, 992, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 960, 7, 7], f16), T([4, 960, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 928, 7, 7], f16), T([4, 928, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 896, 7, 7], f16), T([4, 896, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 864, 7, 7], f16), T([4, 864, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 832, 7, 7], f16), T([4, 832, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 800, 7, 7], f16), T([4, 800, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 768, 7, 7], f16), T([4, 768, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 736, 7, 7], f16), T([4, 736, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 704, 7, 7], f16), T([4, 704, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 672, 7, 7], f16), T([4, 672, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 640, 7, 7], f16), T([4, 640, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 608, 7, 7], f16), T([4, 608, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 576, 7, 7], f16), T([4, 576, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 544, 7, 7], f16), T([4, 544, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 512, 7, 7], f16), T([4, 512, 7, 7], f16), 0), {})
+cnt: 1, ((T([4, 1024, 14, 14], f16), T([4, 1024, 14, 14], f16), 0), {})
+cnt: 24, ((T([4, 128, 14, 14], f16), T([4, 128, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 992, 14, 14], f16), T([4, 992, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 960, 14, 14], f16), T([4, 960, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 928, 14, 14], f16), T([4, 928, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 896, 14, 14], f16), T([4, 896, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 864, 14, 14], f16), T([4, 864, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 832, 14, 14], f16), T([4, 832, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 800, 14, 14], f16), T([4, 800, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 768, 14, 14], f16), T([4, 768, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 736, 14, 14], f16), T([4, 736, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 704, 14, 14], f16), T([4, 704, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 672, 14, 14], f16), T([4, 672, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 640, 14, 14], f16), T([4, 640, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 608, 14, 14], f16), T([4, 608, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 576, 14, 14], f16), T([4, 576, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 544, 14, 14], f16), T([4, 544, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 512, 14, 14], f16), T([4, 512, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 480, 14, 14], f16), T([4, 480, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 448, 14, 14], f16), T([4, 448, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 416, 14, 14], f16), T([4, 416, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 384, 14, 14], f16), T([4, 384, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 352, 14, 14], f16), T([4, 352, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 320, 14, 14], f16), T([4, 320, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 288, 14, 14], f16), T([4, 288, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 256, 14, 14], f16), T([4, 256, 14, 14], f16), 0), {})
+cnt: 1, ((T([4, 512, 28, 28], f16), T([4, 512, 28, 28], f16), 0), {})
+cnt: 13, ((T([4, 128, 28, 28], f16), T([4, 128, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 480, 28, 28], f16), T([4, 480, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 448, 28, 28], f16), T([4, 448, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 416, 28, 28], f16), T([4, 416, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 384, 28, 28], f16), T([4, 384, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 352, 28, 28], f16), T([4, 352, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 320, 28, 28], f16), T([4, 320, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 288, 28, 28], f16), T([4, 288, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 256, 28, 28], f16), T([4, 256, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 224, 28, 28], f16), T([4, 224, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 192, 28, 28], f16), T([4, 192, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 160, 28, 28], f16), T([4, 160, 28, 28], f16), 0), {})
+cnt: 1, ((T([4, 256, 56, 56], f16), T([4, 256, 56, 56], f16), 0), {})
+cnt: 7, ((T([4, 128, 56, 56], f16), T([4, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([4, 224, 56, 56], f16), T([4, 224, 56, 56], f16), 0), {})
+cnt: 1, ((T([4, 192, 56, 56], f16), T([4, 192, 56, 56], f16), 0), {})
+cnt: 1, ((T([4, 160, 56, 56], f16), T([4, 160, 56, 56], f16), 0), {})
+cnt: 1, ((T([4, 96, 56, 56], f16), T([4, 96, 56, 56], f16), 0), {})
+cnt: 1, ((T([4, 64, 56, 56], f16), T([4, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([4, 64, 112, 112], f16), T([4, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/fambench_dlrm_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/fambench_dlrm_training.txt
new file mode 100644
index 000000000000..89e383e39c3a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/fambench_dlrm_training.txt
@@ -0,0 +1,1063 @@
+Operator: aten._embedding_bag.default
+cnt: 2, ((T([965, 192], f16), T([54824], i64), T([1024], i64), False, 0, True, T([54824], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54798], i64), T([1024], i64), False, 0, True, T([54798], f16)), {})
+cnt: 5, ((T([965, 192], f16), T([54763], i64), T([1024], i64), False, 0, True, T([54763], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54783], i64), T([1024], i64), False, 0, True, T([54783], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54762], i64), T([1024], i64), False, 0, True, T([54762], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54862], i64), T([1024], i64), False, 0, True, T([54862], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54743], i64), T([1024], i64), False, 0, True, T([54743], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54750], i64), T([1024], i64), False, 0, True, T([54750], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54705], i64), T([1024], i64), False, 0, True, T([54705], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54735], i64), T([1024], i64), False, 0, True, T([54735], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54736], i64), T([1024], i64), False, 0, True, T([54736], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54775], i64), T([1024], i64), False, 0, True, T([54775], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54710], i64), T([1024], i64), False, 0, True, T([54710], f16)), {})
+cnt: 4, ((T([965, 192], f16), T([54753], i64), T([1024], i64), False, 0, True, T([54753], f16)), {})
+cnt: 4, ((T([965, 192], f16), T([54833], i64), T([1024], i64), False, 0, True, T([54833], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54767], i64), T([1024], i64), False, 0, True, T([54767], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54749], i64), T([1024], i64), False, 0, True, T([54749], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54795], i64), T([1024], i64), False, 0, True, T([54795], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54813], i64), T([1024], i64), False, 0, True, T([54813], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54730], i64), T([1024], i64), False, 0, True, T([54730], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54768], i64), T([1024], i64), False, 0, True, T([54768], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54826], i64), T([1024], i64), False, 0, True, T([54826], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54701], i64), T([1024], i64), False, 0, True, T([54701], f16)), {})
+cnt: 6, ((T([965, 192], f16), T([54761], i64), T([1024], i64), False, 0, True, T([54761], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54807], i64), T([1024], i64), False, 0, True, T([54807], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54744], i64), T([1024], i64), False, 0, True, T([54744], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54745], i64), T([1024], i64), False, 0, True, T([54745], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54723], i64), T([1024], i64), False, 0, True, T([54723], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54797], i64), T([1024], i64), False, 0, True, T([54797], f16)), {})
+cnt: 4, ((T([965, 192], f16), T([54786], i64), T([1024], i64), False, 0, True, T([54786], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54816], i64), T([1024], i64), False, 0, True, T([54816], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54725], i64), T([1024], i64), False, 0, True, T([54725], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54819], i64), T([1024], i64), False, 0, True, T([54819], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54855], i64), T([1024], i64), False, 0, True, T([54855], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54782], i64), T([1024], i64), False, 0, True, T([54782], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54712], i64), T([1024], i64), False, 0, True, T([54712], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54799], i64), T([1024], i64), False, 0, True, T([54799], f16)), {})
+cnt: 4, ((T([965, 192], f16), T([54801], i64), T([1024], i64), False, 0, True, T([54801], f16)), {})
+cnt: 5, ((T([965, 192], f16), T([54818], i64), T([1024], i64), False, 0, True, T([54818], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54779], i64), T([1024], i64), False, 0, True, T([54779], f16)), {})
+cnt: 4, ((T([965, 192], f16), T([54719], i64), T([1024], i64), False, 0, True, T([54719], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54778], i64), T([1024], i64), False, 0, True, T([54778], f16)), {})
+cnt: 6, ((T([965, 192], f16), T([54760], i64), T([1024], i64), False, 0, True, T([54760], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54802], i64), T([1024], i64), False, 0, True, T([54802], f16)), {})
+cnt: 5, ((T([965, 192], f16), T([54776], i64), T([1024], i64), False, 0, True, T([54776], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54828], i64), T([1024], i64), False, 0, True, T([54828], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54715], i64), T([1024], i64), False, 0, True, T([54715], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54843], i64), T([1024], i64), False, 0, True, T([54843], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54756], i64), T([1024], i64), False, 0, True, T([54756], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54766], i64), T([1024], i64), False, 0, True, T([54766], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54697], i64), T([1024], i64), False, 0, True, T([54697], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54792], i64), T([1024], i64), False, 0, True, T([54792], f16)), {})
+cnt: 5, ((T([965, 192], f16), T([54793], i64), T([1024], i64), False, 0, True, T([54793], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54727], i64), T([1024], i64), False, 0, True, T([54727], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54733], i64), T([1024], i64), False, 0, True, T([54733], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54692], i64), T([1024], i64), False, 0, True, T([54692], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54758], i64), T([1024], i64), False, 0, True, T([54758], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54820], i64), T([1024], i64), False, 0, True, T([54820], f16)), {})
+cnt: 4, ((T([965, 192], f16), T([54787], i64), T([1024], i64), False, 0, True, T([54787], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54815], i64), T([1024], i64), False, 0, True, T([54815], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54814], i64), T([1024], i64), False, 0, True, T([54814], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54759], i64), T([1024], i64), False, 0, True, T([54759], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54757], i64), T([1024], i64), False, 0, True, T([54757], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54821], i64), T([1024], i64), False, 0, True, T([54821], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54769], i64), T([1024], i64), False, 0, True, T([54769], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54842], i64), T([1024], i64), False, 0, True, T([54842], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54718], i64), T([1024], i64), False, 0, True, T([54718], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54771], i64), T([1024], i64), False, 0, True, T([54771], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54844], i64), T([1024], i64), False, 0, True, T([54844], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54838], i64), T([1024], i64), False, 0, True, T([54838], f16)), {})
+cnt: 5, ((T([965, 192], f16), T([54781], i64), T([1024], i64), False, 0, True, T([54781], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54804], i64), T([1024], i64), False, 0, True, T([54804], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54788], i64), T([1024], i64), False, 0, True, T([54788], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54774], i64), T([1024], i64), False, 0, True, T([54774], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54829], i64), T([1024], i64), False, 0, True, T([54829], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54738], i64), T([1024], i64), False, 0, True, T([54738], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54777], i64), T([1024], i64), False, 0, True, T([54777], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54811], i64), T([1024], i64), False, 0, True, T([54811], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54772], i64), T([1024], i64), False, 0, True, T([54772], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54800], i64), T([1024], i64), False, 0, True, T([54800], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54741], i64), T([1024], i64), False, 0, True, T([54741], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54794], i64), T([1024], i64), False, 0, True, T([54794], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54773], i64), T([1024], i64), False, 0, True, T([54773], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54803], i64), T([1024], i64), False, 0, True, T([54803], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54789], i64), T([1024], i64), False, 0, True, T([54789], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54707], i64), T([1024], i64), False, 0, True, T([54707], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54737], i64), T([1024], i64), False, 0, True, T([54737], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54722], i64), T([1024], i64), False, 0, True, T([54722], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54747], i64), T([1024], i64), False, 0, True, T([54747], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54770], i64), T([1024], i64), False, 0, True, T([54770], f16)), {})
+cnt: 4, ((T([965, 192], f16), T([54780], i64), T([1024], i64), False, 0, True, T([54780], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54731], i64), T([1024], i64), False, 0, True, T([54731], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54836], i64), T([1024], i64), False, 0, True, T([54836], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54839], i64), T([1024], i64), False, 0, True, T([54839], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54714], i64), T([1024], i64), False, 0, True, T([54714], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54785], i64), T([1024], i64), False, 0, True, T([54785], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54729], i64), T([1024], i64), False, 0, True, T([54729], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54812], i64), T([1024], i64), False, 0, True, T([54812], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54734], i64), T([1024], i64), False, 0, True, T([54734], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54791], i64), T([1024], i64), False, 0, True, T([54791], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54827], i64), T([1024], i64), False, 0, True, T([54827], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54717], i64), T([1024], i64), False, 0, True, T([54717], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54716], i64), T([1024], i64), False, 0, True, T([54716], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54830], i64), T([1024], i64), False, 0, True, T([54830], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54732], i64), T([1024], i64), False, 0, True, T([54732], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54835], i64), T([1024], i64), False, 0, True, T([54835], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54831], i64), T([1024], i64), False, 0, True, T([54831], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54748], i64), T([1024], i64), False, 0, True, T([54748], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54746], i64), T([1024], i64), False, 0, True, T([54746], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54711], i64), T([1024], i64), False, 0, True, T([54711], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54739], i64), T([1024], i64), False, 0, True, T([54739], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54713], i64), T([1024], i64), False, 0, True, T([54713], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54847], i64), T([1024], i64), False, 0, True, T([54847], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54809], i64), T([1024], i64), False, 0, True, T([54809], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54742], i64), T([1024], i64), False, 0, True, T([54742], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54704], i64), T([1024], i64), False, 0, True, T([54704], f16)), {})
+cnt: 3, ((T([965, 192], f16), T([54784], i64), T([1024], i64), False, 0, True, T([54784], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54796], i64), T([1024], i64), False, 0, True, T([54796], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54754], i64), T([1024], i64), False, 0, True, T([54754], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54751], i64), T([1024], i64), False, 0, True, T([54751], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54764], i64), T([1024], i64), False, 0, True, T([54764], f16)), {})
+cnt: 2, ((T([965, 192], f16), T([54687], i64), T([1024], i64), False, 0, True, T([54687], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54740], i64), T([1024], i64), False, 0, True, T([54740], f16)), {})
+cnt: 1, ((T([965, 192], f16), T([54765], i64), T([1024], i64), False, 0, True, T([54765], f16)), {})
+Operator: aten._embedding_bag_per_sample_weights_backward.default
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54765], i64), T([1024], i64), T([54765], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54704], i64), T([1024], i64), T([54704], i64), 0), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54786], i64), T([1024], i64), T([54786], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54804], i64), T([1024], i64), T([54804], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54757], i64), T([1024], i64), T([54757], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54746], i64), T([1024], i64), T([54746], i64), 0), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54781], i64), T([1024], i64), T([54781], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54687], i64), T([1024], i64), T([54687], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54738], i64), T([1024], i64), T([54738], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54784], i64), T([1024], i64), T([54784], i64), 0), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54787], i64), T([1024], i64), T([54787], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54768], i64), T([1024], i64), T([54768], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54697], i64), T([1024], i64), T([54697], i64), 0), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54833], i64), T([1024], i64), T([54833], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54809], i64), T([1024], i64), T([54809], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54713], i64), T([1024], i64), T([54713], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54814], i64), T([1024], i64), T([54814], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54802], i64), T([1024], i64), T([54802], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54789], i64), T([1024], i64), T([54789], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54743], i64), T([1024], i64), T([54743], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54731], i64), T([1024], i64), T([54731], i64), 0), {})
+cnt: 6, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54760], i64), T([1024], i64), T([54760], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54771], i64), T([1024], i64), T([54771], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54723], i64), T([1024], i64), T([54723], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54812], i64), T([1024], i64), T([54812], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54799], i64), T([1024], i64), T([54799], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54745], i64), T([1024], i64), T([54745], i64), 0), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54753], i64), T([1024], i64), T([54753], i64), 0), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54763], i64), T([1024], i64), T([54763], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54795], i64), T([1024], i64), T([54795], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54740], i64), T([1024], i64), T([54740], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54707], i64), T([1024], i64), T([54707], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54798], i64), T([1024], i64), T([54798], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54751], i64), T([1024], i64), T([54751], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54788], i64), T([1024], i64), T([54788], i64), 0), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54780], i64), T([1024], i64), T([54780], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54824], i64), T([1024], i64), T([54824], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54764], i64), T([1024], i64), T([54764], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54797], i64), T([1024], i64), T([54797], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54739], i64), T([1024], i64), T([54739], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54791], i64), T([1024], i64), T([54791], i64), 0), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54776], i64), T([1024], i64), T([54776], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54754], i64), T([1024], i64), T([54754], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54777], i64), T([1024], i64), T([54777], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54794], i64), T([1024], i64), T([54794], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54742], i64), T([1024], i64), T([54742], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54748], i64), T([1024], i64), T([54748], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54729], i64), T([1024], i64), T([54729], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54815], i64), T([1024], i64), T([54815], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54796], i64), T([1024], i64), T([54796], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54730], i64), T([1024], i64), T([54730], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54773], i64), T([1024], i64), T([54773], i64), 0), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54801], i64), T([1024], i64), T([54801], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54744], i64), T([1024], i64), T([54744], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54847], i64), T([1024], i64), T([54847], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54766], i64), T([1024], i64), T([54766], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54778], i64), T([1024], i64), T([54778], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54711], i64), T([1024], i64), T([54711], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54826], i64), T([1024], i64), T([54826], i64), 0), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54793], i64), T([1024], i64), T([54793], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54792], i64), T([1024], i64), T([54792], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54831], i64), T([1024], i64), T([54831], i64), 0), {})
+cnt: 6, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54761], i64), T([1024], i64), T([54761], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54835], i64), T([1024], i64), T([54835], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54732], i64), T([1024], i64), T([54732], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54830], i64), T([1024], i64), T([54830], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54775], i64), T([1024], i64), T([54775], i64), 0), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54719], i64), T([1024], i64), T([54719], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54722], i64), T([1024], i64), T([54722], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54716], i64), T([1024], i64), T([54716], i64), 0), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54818], i64), T([1024], i64), T([54818], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54783], i64), T([1024], i64), T([54783], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54717], i64), T([1024], i64), T([54717], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54827], i64), T([1024], i64), T([54827], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54734], i64), T([1024], i64), T([54734], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54779], i64), T([1024], i64), T([54779], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54785], i64), T([1024], i64), T([54785], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54714], i64), T([1024], i64), T([54714], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54772], i64), T([1024], i64), T([54772], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54839], i64), T([1024], i64), T([54839], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54836], i64), T([1024], i64), T([54836], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54774], i64), T([1024], i64), T([54774], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54803], i64), T([1024], i64), T([54803], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54770], i64), T([1024], i64), T([54770], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54747], i64), T([1024], i64), T([54747], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54737], i64), T([1024], i64), T([54737], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54741], i64), T([1024], i64), T([54741], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54800], i64), T([1024], i64), T([54800], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54811], i64), T([1024], i64), T([54811], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54758], i64), T([1024], i64), T([54758], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54829], i64), T([1024], i64), T([54829], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54838], i64), T([1024], i64), T([54838], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54759], i64), T([1024], i64), T([54759], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54733], i64), T([1024], i64), T([54733], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54844], i64), T([1024], i64), T([54844], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54718], i64), T([1024], i64), T([54718], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54842], i64), T([1024], i64), T([54842], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54769], i64), T([1024], i64), T([54769], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54821], i64), T([1024], i64), T([54821], i64), 0), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54782], i64), T([1024], i64), T([54782], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54710], i64), T([1024], i64), T([54710], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54820], i64), T([1024], i64), T([54820], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54692], i64), T([1024], i64), T([54692], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54727], i64), T([1024], i64), T([54727], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54767], i64), T([1024], i64), T([54767], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54819], i64), T([1024], i64), T([54819], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54756], i64), T([1024], i64), T([54756], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54843], i64), T([1024], i64), T([54843], i64), 0), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54735], i64), T([1024], i64), T([54735], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54715], i64), T([1024], i64), T([54715], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54828], i64), T([1024], i64), T([54828], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54712], i64), T([1024], i64), T([54712], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54855], i64), T([1024], i64), T([54855], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54725], i64), T([1024], i64), T([54725], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54816], i64), T([1024], i64), T([54816], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54807], i64), T([1024], i64), T([54807], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54701], i64), T([1024], i64), T([54701], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54813], i64), T([1024], i64), T([54813], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54749], i64), T([1024], i64), T([54749], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54736], i64), T([1024], i64), T([54736], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54705], i64), T([1024], i64), T([54705], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54750], i64), T([1024], i64), T([54750], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54862], i64), T([1024], i64), T([54862], i64), 0), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), T([965, 192], f16), T([54762], i64), T([1024], i64), T([54762], i64), 0), {})
+Operator: aten._sparse_coo_tensor_with_dims_and_tensors.default
+cnt: 2, ((1, 1, [965, 192], T([1, 54765], i64), T([54765, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54704], i64), T([54704, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 8, ((1, 1, [965, 192], T([1, 54786], i64), T([54786, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54804], i64), T([54804, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54757], i64), T([54757, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54746], i64), T([54746, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 10, ((1, 1, [965, 192], T([1, 54781], i64), T([54781, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54687], i64), T([54687, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54738], i64), T([54738, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54784], i64), T([54784, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 8, ((1, 1, [965, 192], T([1, 54787], i64), T([54787, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54768], i64), T([54768, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54697], i64), T([54697, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 8, ((1, 1, [965, 192], T([1, 54833], i64), T([54833, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54809], i64), T([54809, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54713], i64), T([54713, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54814], i64), T([54814, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54802], i64), T([54802, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54789], i64), T([54789, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54743], i64), T([54743, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54731], i64), T([54731, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 12, ((1, 1, [965, 192], T([1, 54760], i64), T([54760, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54771], i64), T([54771, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54723], i64), T([54723, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54812], i64), T([54812, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54799], i64), T([54799, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54745], i64), T([54745, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 8, ((1, 1, [965, 192], T([1, 54753], i64), T([54753, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 10, ((1, 1, [965, 192], T([1, 54763], i64), T([54763, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54795], i64), T([54795, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54740], i64), T([54740, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54707], i64), T([54707, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54798], i64), T([54798, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54751], i64), T([54751, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54788], i64), T([54788, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 8, ((1, 1, [965, 192], T([1, 54780], i64), T([54780, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54824], i64), T([54824, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54764], i64), T([54764, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54797], i64), T([54797, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54739], i64), T([54739, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54791], i64), T([54791, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 10, ((1, 1, [965, 192], T([1, 54776], i64), T([54776, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54754], i64), T([54754, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54777], i64), T([54777, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54794], i64), T([54794, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54742], i64), T([54742, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54748], i64), T([54748, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54729], i64), T([54729, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54815], i64), T([54815, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54796], i64), T([54796, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54730], i64), T([54730, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54773], i64), T([54773, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 8, ((1, 1, [965, 192], T([1, 54801], i64), T([54801, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54744], i64), T([54744, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54847], i64), T([54847, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54766], i64), T([54766, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54778], i64), T([54778, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54711], i64), T([54711, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54826], i64), T([54826, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 10, ((1, 1, [965, 192], T([1, 54793], i64), T([54793, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54792], i64), T([54792, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54831], i64), T([54831, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 12, ((1, 1, [965, 192], T([1, 54761], i64), T([54761, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54835], i64), T([54835, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54732], i64), T([54732, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54830], i64), T([54830, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54775], i64), T([54775, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 8, ((1, 1, [965, 192], T([1, 54719], i64), T([54719, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54722], i64), T([54722, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54716], i64), T([54716, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 10, ((1, 1, [965, 192], T([1, 54818], i64), T([54818, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54783], i64), T([54783, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54717], i64), T([54717, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54827], i64), T([54827, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54734], i64), T([54734, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54779], i64), T([54779, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54785], i64), T([54785, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54714], i64), T([54714, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54772], i64), T([54772, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54839], i64), T([54839, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54836], i64), T([54836, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54774], i64), T([54774, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54803], i64), T([54803, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54770], i64), T([54770, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54747], i64), T([54747, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54737], i64), T([54737, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54741], i64), T([54741, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54800], i64), T([54800, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54811], i64), T([54811, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54758], i64), T([54758, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54829], i64), T([54829, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54838], i64), T([54838, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54759], i64), T([54759, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54733], i64), T([54733, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54844], i64), T([54844, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54718], i64), T([54718, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54842], i64), T([54842, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54769], i64), T([54769, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54821], i64), T([54821, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 6, ((1, 1, [965, 192], T([1, 54782], i64), T([54782, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54710], i64), T([54710, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54820], i64), T([54820, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54692], i64), T([54692, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54727], i64), T([54727, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54767], i64), T([54767, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54819], i64), T([54819, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54756], i64), T([54756, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54843], i64), T([54843, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 4, ((1, 1, [965, 192], T([1, 54735], i64), T([54735, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54715], i64), T([54715, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54828], i64), T([54828, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54712], i64), T([54712, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54855], i64), T([54855, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54725], i64), T([54725, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54816], i64), T([54816, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54807], i64), T([54807, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54701], i64), T([54701, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54813], i64), T([54813, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54749], i64), T([54749, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54736], i64), T([54736, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54705], i64), T([54705, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54750], i64), T([54750, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54862], i64), T([54862, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+cnt: 2, ((1, 1, [965, 192], T([1, 54762], i64), T([54762, 192], f16)), {'dtype': f16, 'layout': torch.sparse_coo, 'device': 'cuda', 'pin_memory': None})
+Operator: aten.add.Tensor
+cnt: 1, ((T([1024, 249, 192], f16), T([1024, 249, 192], f16, stride=(47808, 1, 249))), {})
+cnt: 1, ((T([1024, 192], f16, stride=(31068, 1)), T([1024, 192], f16, stride=(47808, 1))), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1500], f16), T([1024, 2000], f16), T([2000, 1500], f16, stride=(1, 2000))), {})
+cnt: 2, ((T([1500], f16), T([1024, 1500], f16), T([1500, 1500], f16, stride=(1, 1500))), {})
+cnt: 1, ((T([192], f16), T([1024, 1500], f16), T([1500, 192], f16, stride=(1, 1500))), {})
+cnt: 1, ((T([4000], f16), T([1024, 31068], f16), T([31068, 4000], f16, stride=(1, 31068))), {})
+cnt: 8, ((T([4000], f16), T([1024, 4000], f16), T([4000, 4000], f16, stride=(1, 4000))), {})
+cnt: 1, ((T([1], f16), T([1024, 4000], f16), T([4000, 1], f16)), {})
+Operator: aten.bmm.default
+cnt: 1, ((T([1024, 249, 192], f16), T([1024, 192, 249], f16, stride=(47808, 1, 192))), {})
+cnt: 1, ((T([1024, 192, 249], f16, stride=(47808, 1, 192)), T([1024, 249, 249], f16)), {})
+cnt: 1, ((T([1024, 249, 249], f16), T([1024, 249, 192], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16), T([1024, 192], f16)], 1), {})
+cnt: 1, (([T([1024, 192], f16), T([1024, 30876], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([1024, 2000], f16),), {})
+cnt: 1, ((T([248, 1024], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([1024, 2000], f16), T([1024, 2000], f16)), {})
+cnt: 1, ((T([248, 1024], i64), T([248, 1024], i64)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 1024), {})
+Operator: aten.gather.default
+cnt: 2, ((T([965], f16), 0, T([54824], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54798], i64)), {})
+cnt: 5, ((T([965], f16), 0, T([54763], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54783], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54762], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54862], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54743], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54750], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54705], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54735], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54736], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54775], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54710], i64)), {})
+cnt: 4, ((T([965], f16), 0, T([54753], i64)), {})
+cnt: 4, ((T([965], f16), 0, T([54833], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54767], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54749], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54795], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54813], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54730], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54768], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54826], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54701], i64)), {})
+cnt: 6, ((T([965], f16), 0, T([54761], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54807], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54744], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54745], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54723], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54797], i64)), {})
+cnt: 4, ((T([965], f16), 0, T([54786], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54816], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54725], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54819], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54855], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54782], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54712], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54799], i64)), {})
+cnt: 4, ((T([965], f16), 0, T([54801], i64)), {})
+cnt: 5, ((T([965], f16), 0, T([54818], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54779], i64)), {})
+cnt: 4, ((T([965], f16), 0, T([54719], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54778], i64)), {})
+cnt: 6, ((T([965], f16), 0, T([54760], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54802], i64)), {})
+cnt: 5, ((T([965], f16), 0, T([54776], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54828], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54715], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54843], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54756], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54766], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54697], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54792], i64)), {})
+cnt: 5, ((T([965], f16), 0, T([54793], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54727], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54733], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54692], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54758], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54820], i64)), {})
+cnt: 4, ((T([965], f16), 0, T([54787], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54815], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54814], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54759], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54757], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54821], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54769], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54842], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54718], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54771], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54844], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54838], i64)), {})
+cnt: 5, ((T([965], f16), 0, T([54781], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54804], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54788], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54774], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54829], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54738], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54777], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54811], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54772], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54800], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54741], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54794], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54773], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54803], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54789], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54707], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54737], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54722], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54747], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54770], i64)), {})
+cnt: 4, ((T([965], f16), 0, T([54780], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54731], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54836], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54839], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54714], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54785], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54729], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54812], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54734], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54791], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54827], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54717], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54716], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54830], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54732], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54835], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54831], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54748], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54746], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54711], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54739], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54713], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54847], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54809], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54742], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54704], i64)), {})
+cnt: 3, ((T([965], f16), 0, T([54784], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54796], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54754], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54751], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54764], i64)), {})
+cnt: 2, ((T([965], f16), 0, T([54687], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54740], i64)), {})
+cnt: 1, ((T([965], f16), 0, T([54765], i64)), {})
+Operator: aten.index.Tensor
+cnt: 1, ((T([1024, 249, 249], f16), [None, T([30876], i64), T([30876], i64)]), {})
+Operator: aten.index_put.default
+cnt: 1, ((T([1024, 249, 249], f16), [None, T([30876], i64), T([30876], i64)], T([1024, 30876], f16, stride=(31068, 1)), True), {})
+Operator: aten.index_select.default
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54765], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54704], i64)), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54786], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54804], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54757], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54746], i64)), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54781], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54687], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54738], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54784], i64)), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54787], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54768], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54697], i64)), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54833], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54809], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54713], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54814], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54802], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54789], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54743], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54731], i64)), {})
+cnt: 6, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54760], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54771], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54723], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54812], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54799], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54745], i64)), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54753], i64)), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54763], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54795], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54740], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54707], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54798], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54751], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54788], i64)), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54780], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54824], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54764], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54797], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54739], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54791], i64)), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54776], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54754], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54777], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54794], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54742], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54748], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54729], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54815], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54796], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54730], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54773], i64)), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54801], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54744], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54847], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54766], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54778], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54711], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54826], i64)), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54793], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54792], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54831], i64)), {})
+cnt: 6, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54761], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54835], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54732], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54830], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54775], i64)), {})
+cnt: 4, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54719], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54722], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54716], i64)), {})
+cnt: 5, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54818], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54783], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54717], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54827], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54734], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54779], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54785], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54714], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54772], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54839], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54836], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54774], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54803], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54770], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54747], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54737], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54741], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54800], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54811], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54758], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54829], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54838], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54759], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54733], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54844], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54718], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54842], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54769], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54821], i64)), {})
+cnt: 3, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54782], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54710], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54820], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54692], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54727], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54767], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54819], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54756], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54843], i64)), {})
+cnt: 2, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54735], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54715], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54828], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54712], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54855], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54725], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54816], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54807], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54701], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54813], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54749], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54736], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54705], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54750], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54862], i64)), {})
+cnt: 1, ((T([1024, 192], f16, stride=(47808, 1)), 0, T([54762], i64)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([1024, 1], f16), T([1, 4000], f16)), {})
+cnt: 1, ((T([1, 1024], f16), T([1024, 4000], f16)), {})
+cnt: 8, ((T([1024, 4000], f16), T([4000, 4000], f16)), {})
+cnt: 8, ((T([4000, 1024], f16, stride=(1, 4000)), T([1024, 4000], f16)), {})
+cnt: 1, ((T([1024, 4000], f16), T([4000, 31068], f16)), {})
+cnt: 1, ((T([4000, 1024], f16, stride=(1, 4000)), T([1024, 31068], f16)), {})
+cnt: 1, ((T([1024, 192], f16), T([192, 1500], f16)), {})
+cnt: 1, ((T([192, 1024], f16, stride=(1, 192)), T([1024, 1500], f16)), {})
+cnt: 2, ((T([1024, 1500], f16), T([1500, 1500], f16)), {})
+cnt: 2, ((T([1500, 1024], f16, stride=(1, 1500)), T([1024, 1500], f16)), {})
+cnt: 1, ((T([1500, 1024], f16, stride=(1, 1500)), T([1024, 2000], f16)), {})
+Operator: aten.mul_.Tensor
+cnt: 1, ((T([54765, 192], f16), T([54765, 1], f16)), {})
+cnt: 2, ((T([54704, 192], f16), T([54704, 1], f16)), {})
+cnt: 4, ((T([54786, 192], f16), T([54786, 1], f16)), {})
+cnt: 2, ((T([54804, 192], f16), T([54804, 1], f16)), {})
+cnt: 3, ((T([54757, 192], f16), T([54757, 1], f16)), {})
+cnt: 2, ((T([54746, 192], f16), T([54746, 1], f16)), {})
+cnt: 5, ((T([54781, 192], f16), T([54781, 1], f16)), {})
+cnt: 2, ((T([54687, 192], f16), T([54687, 1], f16)), {})
+cnt: 2, ((T([54738, 192], f16), T([54738, 1], f16)), {})
+cnt: 3, ((T([54784, 192], f16), T([54784, 1], f16)), {})
+cnt: 4, ((T([54787, 192], f16), T([54787, 1], f16)), {})
+cnt: 3, ((T([54768, 192], f16), T([54768, 1], f16)), {})
+cnt: 2, ((T([54697, 192], f16), T([54697, 1], f16)), {})
+cnt: 4, ((T([54833, 192], f16), T([54833, 1], f16)), {})
+cnt: 2, ((T([54809, 192], f16), T([54809, 1], f16)), {})
+cnt: 2, ((T([54713, 192], f16), T([54713, 1], f16)), {})
+cnt: 2, ((T([54814, 192], f16), T([54814, 1], f16)), {})
+cnt: 2, ((T([54802, 192], f16), T([54802, 1], f16)), {})
+cnt: 2, ((T([54789, 192], f16), T([54789, 1], f16)), {})
+cnt: 2, ((T([54743, 192], f16), T([54743, 1], f16)), {})
+cnt: 2, ((T([54731, 192], f16), T([54731, 1], f16)), {})
+cnt: 6, ((T([54760, 192], f16), T([54760, 1], f16)), {})
+cnt: 3, ((T([54771, 192], f16), T([54771, 1], f16)), {})
+cnt: 2, ((T([54723, 192], f16), T([54723, 1], f16)), {})
+cnt: 2, ((T([54812, 192], f16), T([54812, 1], f16)), {})
+cnt: 3, ((T([54799, 192], f16), T([54799, 1], f16)), {})
+cnt: 3, ((T([54745, 192], f16), T([54745, 1], f16)), {})
+cnt: 4, ((T([54753, 192], f16), T([54753, 1], f16)), {})
+cnt: 5, ((T([54763, 192], f16), T([54763, 1], f16)), {})
+cnt: 2, ((T([54795, 192], f16), T([54795, 1], f16)), {})
+cnt: 1, ((T([54740, 192], f16), T([54740, 1], f16)), {})
+cnt: 2, ((T([54707, 192], f16), T([54707, 1], f16)), {})
+cnt: 2, ((T([54798, 192], f16), T([54798, 1], f16)), {})
+cnt: 2, ((T([54751, 192], f16), T([54751, 1], f16)), {})
+cnt: 2, ((T([54788, 192], f16), T([54788, 1], f16)), {})
+cnt: 4, ((T([54780, 192], f16), T([54780, 1], f16)), {})
+cnt: 2, ((T([54824, 192], f16), T([54824, 1], f16)), {})
+cnt: 1, ((T([54764, 192], f16), T([54764, 1], f16)), {})
+cnt: 2, ((T([54797, 192], f16), T([54797, 1], f16)), {})
+cnt: 3, ((T([54739, 192], f16), T([54739, 1], f16)), {})
+cnt: 2, ((T([54791, 192], f16), T([54791, 1], f16)), {})
+cnt: 5, ((T([54776, 192], f16), T([54776, 1], f16)), {})
+cnt: 1, ((T([54754, 192], f16), T([54754, 1], f16)), {})
+cnt: 2, ((T([54777, 192], f16), T([54777, 1], f16)), {})
+cnt: 2, ((T([54794, 192], f16), T([54794, 1], f16)), {})
+cnt: 2, ((T([54742, 192], f16), T([54742, 1], f16)), {})
+cnt: 3, ((T([54748, 192], f16), T([54748, 1], f16)), {})
+cnt: 2, ((T([54729, 192], f16), T([54729, 1], f16)), {})
+cnt: 2, ((T([54815, 192], f16), T([54815, 1], f16)), {})
+cnt: 1, ((T([54796, 192], f16), T([54796, 1], f16)), {})
+cnt: 3, ((T([54730, 192], f16), T([54730, 1], f16)), {})
+cnt: 2, ((T([54773, 192], f16), T([54773, 1], f16)), {})
+cnt: 4, ((T([54801, 192], f16), T([54801, 1], f16)), {})
+cnt: 2, ((T([54744, 192], f16), T([54744, 1], f16)), {})
+cnt: 1, ((T([54847, 192], f16), T([54847, 1], f16)), {})
+cnt: 3, ((T([54766, 192], f16), T([54766, 1], f16)), {})
+cnt: 3, ((T([54778, 192], f16), T([54778, 1], f16)), {})
+cnt: 1, ((T([54711, 192], f16), T([54711, 1], f16)), {})
+cnt: 2, ((T([54826, 192], f16), T([54826, 1], f16)), {})
+cnt: 5, ((T([54793, 192], f16), T([54793, 1], f16)), {})
+cnt: 3, ((T([54792, 192], f16), T([54792, 1], f16)), {})
+cnt: 1, ((T([54831, 192], f16), T([54831, 1], f16)), {})
+cnt: 6, ((T([54761, 192], f16), T([54761, 1], f16)), {})
+cnt: 1, ((T([54835, 192], f16), T([54835, 1], f16)), {})
+cnt: 1, ((T([54732, 192], f16), T([54732, 1], f16)), {})
+cnt: 1, ((T([54830, 192], f16), T([54830, 1], f16)), {})
+cnt: 3, ((T([54775, 192], f16), T([54775, 1], f16)), {})
+cnt: 4, ((T([54719, 192], f16), T([54719, 1], f16)), {})
+cnt: 2, ((T([54722, 192], f16), T([54722, 1], f16)), {})
+cnt: 1, ((T([54716, 192], f16), T([54716, 1], f16)), {})
+cnt: 5, ((T([54818, 192], f16), T([54818, 1], f16)), {})
+cnt: 2, ((T([54783, 192], f16), T([54783, 1], f16)), {})
+cnt: 1, ((T([54717, 192], f16), T([54717, 1], f16)), {})
+cnt: 1, ((T([54827, 192], f16), T([54827, 1], f16)), {})
+cnt: 1, ((T([54734, 192], f16), T([54734, 1], f16)), {})
+cnt: 3, ((T([54779, 192], f16), T([54779, 1], f16)), {})
+cnt: 1, ((T([54785, 192], f16), T([54785, 1], f16)), {})
+cnt: 1, ((T([54714, 192], f16), T([54714, 1], f16)), {})
+cnt: 2, ((T([54772, 192], f16), T([54772, 1], f16)), {})
+cnt: 1, ((T([54839, 192], f16), T([54839, 1], f16)), {})
+cnt: 1, ((T([54836, 192], f16), T([54836, 1], f16)), {})
+cnt: 2, ((T([54774, 192], f16), T([54774, 1], f16)), {})
+cnt: 2, ((T([54803, 192], f16), T([54803, 1], f16)), {})
+cnt: 1, ((T([54770, 192], f16), T([54770, 1], f16)), {})
+cnt: 1, ((T([54747, 192], f16), T([54747, 1], f16)), {})
+cnt: 1, ((T([54737, 192], f16), T([54737, 1], f16)), {})
+cnt: 1, ((T([54741, 192], f16), T([54741, 1], f16)), {})
+cnt: 1, ((T([54800, 192], f16), T([54800, 1], f16)), {})
+cnt: 1, ((T([54811, 192], f16), T([54811, 1], f16)), {})
+cnt: 2, ((T([54758, 192], f16), T([54758, 1], f16)), {})
+cnt: 1, ((T([54829, 192], f16), T([54829, 1], f16)), {})
+cnt: 1, ((T([54838, 192], f16), T([54838, 1], f16)), {})
+cnt: 2, ((T([54759, 192], f16), T([54759, 1], f16)), {})
+cnt: 2, ((T([54733, 192], f16), T([54733, 1], f16)), {})
+cnt: 1, ((T([54844, 192], f16), T([54844, 1], f16)), {})
+cnt: 1, ((T([54718, 192], f16), T([54718, 1], f16)), {})
+cnt: 1, ((T([54842, 192], f16), T([54842, 1], f16)), {})
+cnt: 1, ((T([54769, 192], f16), T([54769, 1], f16)), {})
+cnt: 1, ((T([54821, 192], f16), T([54821, 1], f16)), {})
+cnt: 3, ((T([54782, 192], f16), T([54782, 1], f16)), {})
+cnt: 2, ((T([54710, 192], f16), T([54710, 1], f16)), {})
+cnt: 1, ((T([54820, 192], f16), T([54820, 1], f16)), {})
+cnt: 1, ((T([54692, 192], f16), T([54692, 1], f16)), {})
+cnt: 1, ((T([54727, 192], f16), T([54727, 1], f16)), {})
+cnt: 2, ((T([54767, 192], f16), T([54767, 1], f16)), {})
+cnt: 2, ((T([54819, 192], f16), T([54819, 1], f16)), {})
+cnt: 1, ((T([54756, 192], f16), T([54756, 1], f16)), {})
+cnt: 1, ((T([54843, 192], f16), T([54843, 1], f16)), {})
+cnt: 2, ((T([54735, 192], f16), T([54735, 1], f16)), {})
+cnt: 1, ((T([54715, 192], f16), T([54715, 1], f16)), {})
+cnt: 1, ((T([54828, 192], f16), T([54828, 1], f16)), {})
+cnt: 1, ((T([54712, 192], f16), T([54712, 1], f16)), {})
+cnt: 1, ((T([54855, 192], f16), T([54855, 1], f16)), {})
+cnt: 1, ((T([54725, 192], f16), T([54725, 1], f16)), {})
+cnt: 1, ((T([54816, 192], f16), T([54816, 1], f16)), {})
+cnt: 1, ((T([54807, 192], f16), T([54807, 1], f16)), {})
+cnt: 1, ((T([54701, 192], f16), T([54701, 1], f16)), {})
+cnt: 1, ((T([54813, 192], f16), T([54813, 1], f16)), {})
+cnt: 1, ((T([54749, 192], f16), T([54749, 1], f16)), {})
+cnt: 1, ((T([54736, 192], f16), T([54736, 1], f16)), {})
+cnt: 1, ((T([54705, 192], f16), T([54705, 1], f16)), {})
+cnt: 1, ((T([54750, 192], f16), T([54750, 1], f16)), {})
+cnt: 1, ((T([54862, 192], f16), T([54862, 1], f16)), {})
+cnt: 1, ((T([54762, 192], f16), T([54762, 1], f16)), {})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([1024, 30876], f16, stride=(31068, 1)), [1024, 249, 249]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([54765], f16), [965]), {})
+cnt: 2, ((T([54704], f16), [965]), {})
+cnt: 4, ((T([54786], f16), [965]), {})
+cnt: 2, ((T([54804], f16), [965]), {})
+cnt: 3, ((T([54757], f16), [965]), {})
+cnt: 2, ((T([54746], f16), [965]), {})
+cnt: 5, ((T([54781], f16), [965]), {})
+cnt: 2, ((T([54687], f16), [965]), {})
+cnt: 2, ((T([54738], f16), [965]), {})
+cnt: 3, ((T([54784], f16), [965]), {})
+cnt: 4, ((T([54787], f16), [965]), {})
+cnt: 3, ((T([54768], f16), [965]), {})
+cnt: 2, ((T([54697], f16), [965]), {})
+cnt: 4, ((T([54833], f16), [965]), {})
+cnt: 2, ((T([54809], f16), [965]), {})
+cnt: 2, ((T([54713], f16), [965]), {})
+cnt: 2, ((T([54814], f16), [965]), {})
+cnt: 2, ((T([54802], f16), [965]), {})
+cnt: 2, ((T([54789], f16), [965]), {})
+cnt: 2, ((T([54743], f16), [965]), {})
+cnt: 2, ((T([54731], f16), [965]), {})
+cnt: 6, ((T([54760], f16), [965]), {})
+cnt: 3, ((T([54771], f16), [965]), {})
+cnt: 2, ((T([54723], f16), [965]), {})
+cnt: 2, ((T([54812], f16), [965]), {})
+cnt: 3, ((T([54799], f16), [965]), {})
+cnt: 3, ((T([54745], f16), [965]), {})
+cnt: 4, ((T([54753], f16), [965]), {})
+cnt: 5, ((T([54763], f16), [965]), {})
+cnt: 2, ((T([54795], f16), [965]), {})
+cnt: 1, ((T([54740], f16), [965]), {})
+cnt: 2, ((T([54707], f16), [965]), {})
+cnt: 2, ((T([54798], f16), [965]), {})
+cnt: 2, ((T([54751], f16), [965]), {})
+cnt: 2, ((T([54788], f16), [965]), {})
+cnt: 4, ((T([54780], f16), [965]), {})
+cnt: 2, ((T([54824], f16), [965]), {})
+cnt: 1, ((T([54764], f16), [965]), {})
+cnt: 2, ((T([54797], f16), [965]), {})
+cnt: 3, ((T([54739], f16), [965]), {})
+cnt: 2, ((T([54791], f16), [965]), {})
+cnt: 5, ((T([54776], f16), [965]), {})
+cnt: 1, ((T([54754], f16), [965]), {})
+cnt: 2, ((T([54777], f16), [965]), {})
+cnt: 2, ((T([54794], f16), [965]), {})
+cnt: 2, ((T([54742], f16), [965]), {})
+cnt: 3, ((T([54748], f16), [965]), {})
+cnt: 2, ((T([54729], f16), [965]), {})
+cnt: 2, ((T([54815], f16), [965]), {})
+cnt: 1, ((T([54796], f16), [965]), {})
+cnt: 3, ((T([54730], f16), [965]), {})
+cnt: 2, ((T([54773], f16), [965]), {})
+cnt: 4, ((T([54801], f16), [965]), {})
+cnt: 2, ((T([54744], f16), [965]), {})
+cnt: 1, ((T([54847], f16), [965]), {})
+cnt: 3, ((T([54766], f16), [965]), {})
+cnt: 3, ((T([54778], f16), [965]), {})
+cnt: 1, ((T([54711], f16), [965]), {})
+cnt: 2, ((T([54826], f16), [965]), {})
+cnt: 5, ((T([54793], f16), [965]), {})
+cnt: 3, ((T([54792], f16), [965]), {})
+cnt: 1, ((T([54831], f16), [965]), {})
+cnt: 6, ((T([54761], f16), [965]), {})
+cnt: 1, ((T([54835], f16), [965]), {})
+cnt: 1, ((T([54732], f16), [965]), {})
+cnt: 1, ((T([54830], f16), [965]), {})
+cnt: 3, ((T([54775], f16), [965]), {})
+cnt: 4, ((T([54719], f16), [965]), {})
+cnt: 2, ((T([54722], f16), [965]), {})
+cnt: 1, ((T([54716], f16), [965]), {})
+cnt: 5, ((T([54818], f16), [965]), {})
+cnt: 2, ((T([54783], f16), [965]), {})
+cnt: 1, ((T([54717], f16), [965]), {})
+cnt: 1, ((T([54827], f16), [965]), {})
+cnt: 1, ((T([54734], f16), [965]), {})
+cnt: 3, ((T([54779], f16), [965]), {})
+cnt: 1, ((T([54785], f16), [965]), {})
+cnt: 1, ((T([54714], f16), [965]), {})
+cnt: 2, ((T([54772], f16), [965]), {})
+cnt: 1, ((T([54839], f16), [965]), {})
+cnt: 1, ((T([54836], f16), [965]), {})
+cnt: 2, ((T([54774], f16), [965]), {})
+cnt: 2, ((T([54803], f16), [965]), {})
+cnt: 1, ((T([54770], f16), [965]), {})
+cnt: 1, ((T([54747], f16), [965]), {})
+cnt: 1, ((T([54737], f16), [965]), {})
+cnt: 1, ((T([54741], f16), [965]), {})
+cnt: 1, ((T([54800], f16), [965]), {})
+cnt: 1, ((T([54811], f16), [965]), {})
+cnt: 2, ((T([54758], f16), [965]), {})
+cnt: 1, ((T([54829], f16), [965]), {})
+cnt: 1, ((T([54838], f16), [965]), {})
+cnt: 2, ((T([54759], f16), [965]), {})
+cnt: 2, ((T([54733], f16), [965]), {})
+cnt: 1, ((T([54844], f16), [965]), {})
+cnt: 1, ((T([54718], f16), [965]), {})
+cnt: 1, ((T([54842], f16), [965]), {})
+cnt: 1, ((T([54769], f16), [965]), {})
+cnt: 1, ((T([54821], f16), [965]), {})
+cnt: 3, ((T([54782], f16), [965]), {})
+cnt: 2, ((T([54710], f16), [965]), {})
+cnt: 1, ((T([54820], f16), [965]), {})
+cnt: 1, ((T([54692], f16), [965]), {})
+cnt: 1, ((T([54727], f16), [965]), {})
+cnt: 2, ((T([54767], f16), [965]), {})
+cnt: 2, ((T([54819], f16), [965]), {})
+cnt: 1, ((T([54756], f16), [965]), {})
+cnt: 1, ((T([54843], f16), [965]), {})
+cnt: 2, ((T([54735], f16), [965]), {})
+cnt: 1, ((T([54715], f16), [965]), {})
+cnt: 1, ((T([54828], f16), [965]), {})
+cnt: 1, ((T([54712], f16), [965]), {})
+cnt: 1, ((T([54855], f16), [965]), {})
+cnt: 1, ((T([54725], f16), [965]), {})
+cnt: 1, ((T([54816], f16), [965]), {})
+cnt: 1, ((T([54807], f16), [965]), {})
+cnt: 1, ((T([54701], f16), [965]), {})
+cnt: 1, ((T([54813], f16), [965]), {})
+cnt: 1, ((T([54749], f16), [965]), {})
+cnt: 1, ((T([54736], f16), [965]), {})
+cnt: 1, ((T([54705], f16), [965]), {})
+cnt: 1, ((T([54750], f16), [965]), {})
+cnt: 1, ((T([54862], f16), [965]), {})
+cnt: 1, ((T([54762], f16), [965]), {})
+Operator: aten.relu.default
+cnt: 3, ((T([1024, 1500], f16),), {})
+cnt: 1, ((T([1024, 192], f16),), {})
+cnt: 9, ((T([1024, 4000], f16),), {})
+Operator: aten.scatter_add.default
+cnt: 1, ((T([965], f16), 0, T([54765], i64), T([54765], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54704], i64), T([54704], f16)), {})
+cnt: 4, ((T([965], f16), 0, T([54786], i64), T([54786], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54804], i64), T([54804], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54757], i64), T([54757], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54746], i64), T([54746], f16)), {})
+cnt: 5, ((T([965], f16), 0, T([54781], i64), T([54781], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54687], i64), T([54687], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54738], i64), T([54738], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54784], i64), T([54784], f16)), {})
+cnt: 4, ((T([965], f16), 0, T([54787], i64), T([54787], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54768], i64), T([54768], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54697], i64), T([54697], f16)), {})
+cnt: 4, ((T([965], f16), 0, T([54833], i64), T([54833], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54809], i64), T([54809], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54713], i64), T([54713], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54814], i64), T([54814], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54802], i64), T([54802], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54789], i64), T([54789], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54743], i64), T([54743], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54731], i64), T([54731], f16)), {})
+cnt: 6, ((T([965], f16), 0, T([54760], i64), T([54760], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54771], i64), T([54771], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54723], i64), T([54723], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54812], i64), T([54812], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54799], i64), T([54799], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54745], i64), T([54745], f16)), {})
+cnt: 4, ((T([965], f16), 0, T([54753], i64), T([54753], f16)), {})
+cnt: 5, ((T([965], f16), 0, T([54763], i64), T([54763], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54795], i64), T([54795], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54740], i64), T([54740], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54707], i64), T([54707], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54798], i64), T([54798], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54751], i64), T([54751], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54788], i64), T([54788], f16)), {})
+cnt: 4, ((T([965], f16), 0, T([54780], i64), T([54780], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54824], i64), T([54824], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54764], i64), T([54764], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54797], i64), T([54797], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54739], i64), T([54739], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54791], i64), T([54791], f16)), {})
+cnt: 5, ((T([965], f16), 0, T([54776], i64), T([54776], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54754], i64), T([54754], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54777], i64), T([54777], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54794], i64), T([54794], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54742], i64), T([54742], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54748], i64), T([54748], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54729], i64), T([54729], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54815], i64), T([54815], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54796], i64), T([54796], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54730], i64), T([54730], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54773], i64), T([54773], f16)), {})
+cnt: 4, ((T([965], f16), 0, T([54801], i64), T([54801], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54744], i64), T([54744], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54847], i64), T([54847], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54766], i64), T([54766], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54778], i64), T([54778], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54711], i64), T([54711], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54826], i64), T([54826], f16)), {})
+cnt: 5, ((T([965], f16), 0, T([54793], i64), T([54793], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54792], i64), T([54792], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54831], i64), T([54831], f16)), {})
+cnt: 6, ((T([965], f16), 0, T([54761], i64), T([54761], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54835], i64), T([54835], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54732], i64), T([54732], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54830], i64), T([54830], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54775], i64), T([54775], f16)), {})
+cnt: 4, ((T([965], f16), 0, T([54719], i64), T([54719], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54722], i64), T([54722], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54716], i64), T([54716], f16)), {})
+cnt: 5, ((T([965], f16), 0, T([54818], i64), T([54818], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54783], i64), T([54783], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54717], i64), T([54717], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54827], i64), T([54827], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54734], i64), T([54734], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54779], i64), T([54779], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54785], i64), T([54785], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54714], i64), T([54714], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54772], i64), T([54772], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54839], i64), T([54839], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54836], i64), T([54836], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54774], i64), T([54774], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54803], i64), T([54803], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54770], i64), T([54770], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54747], i64), T([54747], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54737], i64), T([54737], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54741], i64), T([54741], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54800], i64), T([54800], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54811], i64), T([54811], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54758], i64), T([54758], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54829], i64), T([54829], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54838], i64), T([54838], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54759], i64), T([54759], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54733], i64), T([54733], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54844], i64), T([54844], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54718], i64), T([54718], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54842], i64), T([54842], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54769], i64), T([54769], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54821], i64), T([54821], f16)), {})
+cnt: 3, ((T([965], f16), 0, T([54782], i64), T([54782], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54710], i64), T([54710], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54820], i64), T([54820], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54692], i64), T([54692], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54727], i64), T([54727], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54767], i64), T([54767], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54819], i64), T([54819], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54756], i64), T([54756], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54843], i64), T([54843], f16)), {})
+cnt: 2, ((T([965], f16), 0, T([54735], i64), T([54735], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54715], i64), T([54715], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54828], i64), T([54828], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54712], i64), T([54712], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54855], i64), T([54855], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54725], i64), T([54725], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54816], i64), T([54816], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54807], i64), T([54807], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54701], i64), T([54701], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54813], i64), T([54813], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54749], i64), T([54749], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54736], i64), T([54736], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54705], i64), T([54705], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54750], i64), T([54750], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54862], i64), T([54862], f16)), {})
+cnt: 1, ((T([965], f16), 0, T([54762], i64), T([54762], f16)), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([1024, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([1024, 1], f16, stride=(0, 0)), T([1024, 1], f16)), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([1024, 249, 249], f16), [1024, 249, 249], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([1024, 1], f16), [0], True), {})
+cnt: 9, ((T([1024, 4000], f16), [0], True), {})
+cnt: 1, ((T([1024, 192], f16), [0], True), {})
+cnt: 3, ((T([1024, 1500], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([1024, 1], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 9, ((T([1024, 4000], f16), T([1024, 4000], f16), 0), {})
+cnt: 1, ((T([1024, 192], f16), T([1024, 192], f16), 0), {})
+cnt: 3, ((T([1024, 1500], f16), T([1024, 1500], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/fastNLP_Bert_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/fastNLP_Bert_training.txt
new file mode 100644
index 000000000000..14639db6d712
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/fastNLP_Bert_training.txt
@@ -0,0 +1,157 @@
+Operator: aten._index_put_impl_.default
+cnt: 1, ((T([6, 474, 768], f16), [T([6, 474], i64, stride=(1, 0)), T([6, 474], i64, stride=(475, 1))], T([6, 474, 768], f16), True, True), {})
+Operator: aten._softmax.default
+cnt: 12, ((T([6, 12, 476, 476], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([6, 12, 476, 476], f16), T([6, 12, 476, 476], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([6, 474], i64),), {'dtype': i64, 'layout': torch.strided, 'device': "torch.device('cpu')"})
+cnt: 1, ((T([6], i64),), {'dtype': i64, 'device': 'cuda'})
+cnt: 1, ((T([6, 476], b8),), {'dtype': i64})
+cnt: 1, ((T([6, 1, 1, 476], i64),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([6, 12, 476, 64], f16), [72, 476, 64]), {})
+cnt: 12, ((T([6, 12, 64, 476], f16), [72, 64, 476]), {})
+cnt: 12, ((T([72, 476, 476], f16), [6, 12, 476, 476]), {})
+cnt: 12, ((T([72, 476, 64], f16), [6, 12, 476, 64]), {})
+cnt: 24, ((T([6, 476, 12, 64], f16), [6, 476, 768]), {})
+cnt: 12, ((T([6, 476, 768], f16), [2856, 768]), {})
+Operator: aten.add.Tensor
+cnt: 6, ((T([], i64), 1), {})
+cnt: 6, ((T([], i64), 2), {})
+cnt: 1, ((T([6], i64), 1), {})
+cnt: 74, ((T([6, 476, 768], f16), T([6, 476, 768], f16)), {})
+cnt: 12, ((T([6, 12, 476, 476], f16), T([6, 1, 1, 476], f16)), {})
+cnt: 12, ((T([6, 476, 3072], f16), 1.0), {})
+cnt: 1, ((T([], f16), 0), {})
+cnt: 1, ((T([], f16), T([], f16)), {})
+cnt: 1, ((T([6, 474, 2], f16), T([6, 474, 2], f16)), {})
+cnt: 12, ((T([6, 476, 3072], f16), T([6, 476, 3072], f16)), {})
+Operator: aten.addmm.default
+cnt: 48, ((T([768], f16), T([2856, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([2856, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([2856, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([768], f16), T([6, 768], f16, stride=(365568, 1)), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 1, ((T([2], f16), T([2844, 768], f16), T([768, 2], f16, stride=(1, 768))), {})
+Operator: aten.bitwise_xor.Tensor
+cnt: 1, ((T([6, 1], i64, stride=(476, 1)), T([6, 476], i64)), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([72, 476, 64], f16), T([72, 64, 476], f16)), {})
+cnt: 12, ((T([72, 476, 476], f16), T([72, 476, 64], f16)), {})
+cnt: 12, ((T([72, 476, 476], f16, stride=(226576, 1, 476)), T([72, 476, 64], f16)), {})
+cnt: 12, ((T([72, 476, 64], f16), T([72, 64, 476], f16, stride=(30464, 1, 64))), {})
+cnt: 12, ((T([72, 64, 476], f16, stride=(30464, 1, 64)), T([72, 476, 476], f16)), {})
+cnt: 12, ((T([72, 476, 476], f16), T([72, 476, 64], f16, stride=(30464, 1, 476))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([6, 474, 768], f16)], -1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([6, 474], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([6, 474], i64), T([6, 474], i64)), {})
+cnt: 6, ((T([474], i64), T([474], i64)), {})
+cnt: 1, ((T([6, 474], i64, stride=(475, 1)), T([6, 474], i64)), {})
+cnt: 1, ((T([6, 474, 768], f16), T([6, 474, 768], f16)), {})
+cnt: 1, ((T([1, 6, 474, 768], f16), T([1, 6, 474, 768], f16)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([6, 476], i64), -1), {})
+cnt: 1, ((T([6, 474], i64), -1), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([6, 12, 476, 476], f16), 8.0), {})
+cnt: 24, ((T([6, 476, 3072], f16), 1.4142135623730951), {})
+cnt: 4, ((T([], f16), 2844), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([21128, 768], f16), T([6, 476], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([6, 476], i64, stride=(0, 1))), {})
+cnt: 1, ((T([2, 768], f16), T([6, 476], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([6, 476, 768], f16), T([6, 476], i64), 2, -1, False), {})
+cnt: 1, ((T([6, 476, 768], f16), T([6, 476], i64, stride=(0, 1)), 512, -1, False), {})
+cnt: 1, ((T([6, 476, 768], f16), T([6, 476], i64), 21128, 0, False), {})
+Operator: aten.eq.Scalar
+cnt: 1, ((T([6, 474], b8), False), {})
+cnt: 1, ((T([6, 476], i64), 511), {})
+cnt: 1, ((T([6, 474, 1], b8), False), {})
+Operator: aten.erf.default
+cnt: 12, ((T([6, 476, 3072], f16),), {})
+Operator: aten.exp.default
+cnt: 12, ((T([6, 476, 3072], f16),), {})
+Operator: aten.fill_.Scalar
+cnt: 6, ((T([476], i64), 1), {})
+cnt: 1, ((T([6], i64, stride=(476,)), 2057), {})
+Operator: aten.flip.default
+cnt: 2, ((T([6, 476], i64), [-1]), {})
+Operator: aten.fmod.Scalar
+cnt: 1, ((T([6, 476], i64), 2), {})
+Operator: aten.ge.Scalar
+cnt: 1, ((T([6, 474], i64, stride=(475, 1)), 474), {})
+Operator: aten.index.Tensor
+cnt: 1, ((T([2869], i64), [T([6, 474], i64)]), {})
+cnt: 1, ((T([6, 474, 768], f16, stride=(365568, 768, 1)), [T([6, 474], i64, stride=(1, 0)), T([6, 474], i64, stride=(475, 1))]), {})
+Operator: aten.index_put_.default
+cnt: 1, ((T([6, 476], i64), [T([6], i64), T([6], i64)], T([], i64)), {})
+Operator: aten.masked_fill.Scalar
+cnt: 1, ((T([6, 474], i64), T([6, 474], b8), 0), {})
+cnt: 2, ((T([6, 474, 768], f16), T([6, 474, 1], b8), 0), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([6, 474], i64, stride=(475, 1)), T([6, 474], b8), 0), {})
+Operator: aten.max.default
+cnt: 2, ((T([6], i64),), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2844, 2], f16), T([2, 768], f16)), {})
+cnt: 1, ((T([2, 2844], f16, stride=(1, 2)), T([2844, 768], f16)), {})
+cnt: 12, ((T([2856, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 2856], f16, stride=(1, 768)), T([2856, 3072], f16)), {})
+cnt: 12, ((T([2856, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 2856], f16, stride=(1, 3072)), T([2856, 768], f16)), {})
+cnt: 48, ((T([2856, 768], f16), T([768, 768], f16)), {})
+cnt: 48, ((T([768, 2856], f16, stride=(1, 768)), T([2856, 768], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 12, ((T([6, 476, 3072], f16), 1.1283791670955126), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([6, 1, 1, 476], f16), -10000.0), {})
+cnt: 24, ((T([6, 476, 3072], f16), 0.5), {})
+cnt: 48, ((T([6, 476, 3072], f16), T([6, 476, 3072], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([6, 476, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([6, 476, 768], f16), T([6, 476, 768], f16), [768], T([6, 476, 1], f32), T([6, 476, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([6, 474], i64), 0), {})
+Operator: aten.neg.default
+cnt: 12, ((T([6, 476, 3072], f16),), {})
+Operator: aten.new_empty_strided.default
+cnt: 1, ((T([1, 6, 474, 768], f16), [1, 6, 474, 768], [2184192, 364032, 768, 1]), {})
+Operator: aten.new_full.default
+cnt: 1, ((T([6, 474], i64), [6, 476], 2457), {'dtype': i64, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([6, 476, 768], f16), [1, 6, 474, 768]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 1, ((T([6, 474], i64), [6, 475]), {'dtype': i64, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 1, ((T([6, 474, 768], f16), [6, 474, 768]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([6, 476, 3072], f16), 2), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([6, 1, 1, 476], f16), 1.0), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([6, 474], f16, stride=(0, 0)), [6, 474, 2], 2, 1), {})
+cnt: 1, ((T([6, 474], f16, stride=(0, 0)), [6, 474, 2], 2, 0), {})
+Operator: aten.slice_backward.default
+cnt: 2, ((T([6, 474, 2], f16), [6, 474, 2], 1, 0, 9223372036854775807, 1), {})
+cnt: 2, ((T([6, 474, 2], f16), [6, 474, 2], 0, 0, 9223372036854775807, 1), {})
+cnt: 1, ((T([6, 474, 768], f16), [6, 476, 768], 1, 1, -1, 1), {})
+cnt: 1, ((T([6, 476, 768], f16), [6, 476, 768], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.stack.default
+cnt: 1, (([T([6, 474, 768], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2844, 2], f16), [0], True), {})
+cnt: 60, ((T([2856, 768], f16), [0], True), {})
+cnt: 12, ((T([2856, 3072], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 2, ((T([6, 474], f16, stride=(948, 2)),), {})
+Operator: aten.sum.dim_IntList
+cnt: 1, ((T([6, 474], b8), [-1]), {})
+cnt: 2, ((T([6, 474], i64), [-1]), {})
+Operator: aten.tanh.default
+cnt: 1, ((T([6, 768], f16),), {})
+Operator: aten.unbind.int
+cnt: 1, ((T([1, 6, 474, 768], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Albert_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Albert_training.txt
new file mode 100644
index 000000000000..9dc41c8ff468
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Albert_training.txt
@@ -0,0 +1,110 @@
+Operator: aten._softmax.default
+cnt: 12, ((T([8, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([8, 12, 512, 512], f16), T([8, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([8, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([8, 12, 512, 64], f16), [96, 512, 64]), {})
+cnt: 12, ((T([8, 12, 64, 512], f16), [96, 64, 512]), {})
+cnt: 12, ((T([96, 512, 512], f16), [8, 12, 512, 512]), {})
+cnt: 12, ((T([96, 512, 64], f16), [8, 12, 512, 64]), {})
+cnt: 36, ((T([8, 512, 12, 64], f16), [8, 512, 768]), {})
+cnt: 12, ((T([8, 512, 768], f16), [4096, 768]), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([8, 512, 128], f16), T([8, 512, 128], f16)), {})
+cnt: 12, ((T([8, 12, 512, 512], f16), T([8, 1, 1, 512], f16)), {})
+cnt: 72, ((T([8, 512, 768], f16), T([8, 512, 768], f16)), {})
+cnt: 36, ((T([8, 512, 3072], f16), T([8, 512, 3072], f16)), {})
+cnt: 12, ((T([8, 512, 3072], f16), 1.0), {})
+cnt: 1, ((T([8, 512, 128], f16), 1.0), {})
+cnt: 99, ((T([768], f16), T([768], f16)), {})
+cnt: 11, ((T([768, 3072], f16), T([768, 3072], f16)), {})
+cnt: 11, ((T([3072], f16), T([3072], f16)), {})
+cnt: 11, ((T([3072, 768], f16), T([3072, 768], f16)), {})
+cnt: 44, ((T([768, 768], f16), T([768, 768], f16)), {})
+cnt: 1, ((T([30000, 128], f16), T([30000, 128], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([8, 512, 128], f16), T([1, 512, 128], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([768], f16), T([4096, 128], f16), T([128, 768], f16, stride=(1, 128))), {})
+cnt: 48, ((T([768], f16), T([4096, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([4096, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([4096, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([128], f16), T([4096, 768], f16), T([768, 128], f16, stride=(1, 768))), {})
+cnt: 1, ((T([30000], f16), T([4096, 128], f16), T([128, 30000], f16, stride=(1, 128))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([96, 512, 64], f16), T([96, 64, 512], f16)), {})
+cnt: 12, ((T([96, 512, 512], f16), T([96, 512, 64], f16)), {})
+cnt: 12, ((T([96, 512, 512], f16, stride=(262144, 1, 512)), T([96, 512, 64], f16)), {})
+cnt: 12, ((T([96, 512, 64], f16), T([96, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([96, 64, 512], f16, stride=(32768, 1, 64)), T([96, 512, 512], f16)), {})
+cnt: 12, ((T([96, 512, 512], f16), T([96, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([8, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([8, 512], i64), T([8, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([8, 12, 512, 512], f16), 8.0), {})
+cnt: 2, ((T([], f16), 122880000), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30000, 128], f16), T([8, 512], i64), 0), {})
+cnt: 1, ((T([2, 128], f16), T([8, 512], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 128], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 128], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([8, 512, 128], f16), T([8, 512], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([8, 512, 128], f16), T([8, 512], i64), 30000, 0, False), {})
+Operator: aten.mm.default
+cnt: 1, ((T([4096, 30000], f16, stride=(0, 0)), T([30000, 128], f16)), {})
+cnt: 1, ((T([30000, 4096], f16, stride=(0, 0)), T([4096, 128], f16)), {})
+cnt: 1, ((T([4096, 128], f16), T([128, 768], f16)), {})
+cnt: 1, ((T([128, 4096], f16, stride=(1, 128)), T([4096, 768], f16)), {})
+cnt: 12, ((T([4096, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 3072], f16)), {})
+cnt: 12, ((T([4096, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 4096], f16, stride=(1, 3072)), T([4096, 768], f16)), {})
+cnt: 48, ((T([4096, 768], f16), T([768, 768], f16)), {})
+cnt: 48, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 768], f16)), {})
+cnt: 1, ((T([4096, 768], f16), T([768, 128], f16)), {})
+cnt: 1, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 128], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 1, ((T([8, 512, 128], f16), 3.0), {})
+cnt: 12, ((T([8, 512, 3072], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([8, 1, 1, 512], f16), -65504.0), {})
+cnt: 24, ((T([8, 512, 3072], f16), 0.5), {})
+cnt: 24, ((T([8, 512, 3072], f16), 0.044715), {})
+cnt: 24, ((T([8, 512, 3072], f16), 0.7978845608028654), {})
+cnt: 48, ((T([8, 512, 3072], f16), T([8, 512, 3072], f16)), {})
+cnt: 2, ((T([8, 512, 128], f16), 0.5), {})
+cnt: 2, ((T([8, 512, 128], f16), 0.044715), {})
+cnt: 2, ((T([8, 512, 128], f16), 0.7978845608028654), {})
+cnt: 4, ((T([8, 512, 128], f16), T([8, 512, 128], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 2, ((T([8, 512, 128], f16), [128], T([128], f16), T([128], f16), 1e-12), {})
+cnt: 24, ((T([8, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 2, ((T([8, 512, 128], f16), T([8, 512, 128], f16), [128], T([8, 512, 1], f32), T([8, 512, 1], f32), T([128], f16), T([128], f16), [True, True, True]), {})
+cnt: 24, ((T([8, 512, 768], f16), T([8, 512, 768], f16), [768], T([8, 512, 1], f32), T([8, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([8, 512, 3072], f16), 3.0), {})
+cnt: 1, ((T([8, 512, 128], f16), 3.0), {})
+cnt: 1, ((T([8, 512, 128], f16), 2.0), {})
+cnt: 12, ((T([8, 512, 3072], f16), 2.0), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([8, 1, 1, 512], f16), 1.0), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([4096, 30000], f16, stride=(0, 0)), [0], True), {})
+cnt: 1, ((T([4096, 128], f16), [0], True), {})
+cnt: 61, ((T([4096, 768], f16), [0], True), {})
+cnt: 12, ((T([4096, 3072], f16), [0], True), {})
+cnt: 1, ((T([8, 512, 128], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([8, 512, 30000], f16),), {})
+Operator: aten.tanh.default
+cnt: 12, ((T([8, 512, 3072], f16),), {})
+cnt: 1, ((T([8, 512, 128], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([8, 512, 128], f16), T([8, 512, 128], f16)), {})
+cnt: 12, ((T([8, 512, 3072], f16), T([8, 512, 3072], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Bart_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Bart_training.txt
new file mode 100644
index 000000000000..96ff5f455b08
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Bart_training.txt
@@ -0,0 +1,76 @@
+Operator: aten._softmax.default
+cnt: 18, ((T([48, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 18, ((T([48, 512, 512], f16), T([48, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([512, 512], f32),), {'dtype': f16})
+cnt: 1, ((T([4, 1, 512, 512], f16, stride=(0, 262144, 512, 1)),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 54, ((T([4, 512, 12, 64], f16), [4, 512, 768]), {})
+cnt: 1, ((T([2048, 50265], f16), [4, 512, 50265]), {})
+cnt: 18, ((T([4, 12, 512, 64], f16), [48, 512, 64]), {})
+cnt: 18, ((T([4, 512, 768], f16), [2048, 768]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([4, 512], i64, stride=(0, 1)), 2), {})
+cnt: 97, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 1, ((T([512], i64), 1), {})
+cnt: 6, ((T([4, 12, 512, 512], f16), T([4, 1, 512, 512], f16)), {})
+cnt: 1, ((T([4, 512, 50265], f16), T([1, 50265], f16)), {})
+cnt: 2, ((T([50265, 768], f16), T([50265, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 72, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+Operator: aten.any.default
+cnt: 12, ((T([4, 512, 768], b8),), {})
+Operator: aten.bmm.default
+cnt: 36, ((T([48, 512, 64], f16), T([48, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 36, ((T([48, 512, 512], f16), T([48, 512, 64], f16)), {})
+cnt: 18, ((T([48, 512, 512], f16, stride=(262144, 1, 512)), T([48, 512, 64], f16)), {})
+cnt: 18, ((T([48, 64, 512], f16, stride=(32768, 1, 64)), T([48, 512, 512], f16)), {})
+Operator: aten.clone.default
+cnt: 2, ((T([4, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 2, ((T([4, 512], i64), T([4, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 102942720), {})
+Operator: aten.embedding.default
+cnt: 2, ((T([50265, 768], f16), T([4, 512], i64), 1), {})
+cnt: 2, ((T([1026, 768], f16), T([4, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 2, ((T([4, 512, 768], f16), T([4, 512], i64), 1026, -1, False), {})
+cnt: 2, ((T([4, 512, 768], f16), T([4, 512], i64), 50265, 1, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([4, 512, 3072], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([4, 512, 3072], f16), T([4, 512, 3072], f16)), {})
+Operator: aten.isinf.default
+cnt: 6, ((T([4, 512, 768], f16),), {})
+Operator: aten.isnan.default
+cnt: 6, ((T([4, 512, 768], f16),), {})
+Operator: aten.lt.Tensor
+cnt: 1, ((T([512], i64), T([512, 1], i64)), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([512, 512], f32), T([512, 512], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 768], f16), T([768, 50265], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50265, 2048], f16, stride=(0, 0)), T([2048, 768], f16)), {})
+cnt: 1, ((T([2048, 50265], f16, stride=(0, 0)), T([50265, 768], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 12, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+cnt: 72, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 72, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([4, 512, 768], f16), 1.0), {})
+cnt: 36, ((T([4, 512, 768], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 32, ((T([4, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 32, ((T([4, 512, 768], f16), T([4, 512, 768], f16), [768], T([4, 512, 1], f32), T([4, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.sum.SymInt
+cnt: 84, ((T([2048, 768], f16), [0], True), {})
+cnt: 12, ((T([2048, 3072], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([4, 512, 50265], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Bert_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Bert_training.txt
new file mode 100644
index 000000000000..59a786f127ce
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Bert_training.txt
@@ -0,0 +1,76 @@
+Operator: aten._softmax.default
+cnt: 12, ((T([4, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([4, 1, 1, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([4, 12, 512, 64], f16), [48, 512, 64]), {})
+cnt: 12, ((T([4, 12, 64, 512], f16), [48, 64, 512]), {})
+cnt: 12, ((T([48, 512, 512], f16), [4, 12, 512, 512]), {})
+cnt: 12, ((T([48, 512, 64], f16), [4, 12, 512, 64]), {})
+cnt: 24, ((T([4, 512, 12, 64], f16), [4, 512, 768]), {})
+cnt: 12, ((T([4, 512, 768], f16), [2048, 768]), {})
+Operator: aten.add.Tensor
+cnt: 73, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 1, 1, 512], f16)), {})
+cnt: 1, ((T([30522, 768], f16), T([30522, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([4, 512, 768], f16), T([1, 512, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 49, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([30522], f16), T([2048, 768], f16), T([768, 30522], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([48, 512, 64], f16), T([48, 64, 512], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16), T([48, 512, 64], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16, stride=(262144, 1, 512)), T([48, 512, 64], f16)), {})
+cnt: 12, ((T([48, 512, 64], f16), T([48, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([48, 64, 512], f16, stride=(32768, 1, 64)), T([48, 512, 512], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16), T([48, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([4, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([4, 512], i64), T([4, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([4, 12, 512, 512], f16), 8.0), {})
+cnt: 2, ((T([], f16), 62509056), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([4, 512], i64), 0), {})
+cnt: 1, ((T([2, 768], f16), T([4, 512], i64, stride=(0, 1))), {})
+cnt: 1, ((T([512, 768], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([4, 512, 768], f16), T([4, 512], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([4, 512, 768], f16), T([4, 512], i64), 30522, 0, False), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([4, 512, 3072], f16),), {})
+cnt: 1, ((T([4, 512, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 12, ((T([4, 512, 3072], f16), T([4, 512, 3072], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 30522], f16, stride=(0, 0)), T([30522, 768], f16)), {})
+cnt: 1, ((T([30522, 2048], f16, stride=(0, 0)), T([2048, 768], f16)), {})
+cnt: 49, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 49, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 12, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([4, 1, 1, 512], f16), -65504.0), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([4, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([4, 512, 768], f16), T([4, 512, 768], f16), [768], T([4, 512, 1], f32), T([4, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([4, 1, 1, 512], f16), 1.0), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2048, 30522], f16, stride=(0, 0)), [0], True), {})
+cnt: 61, ((T([2048, 768], f16), [0], True), {})
+cnt: 12, ((T([2048, 3072], f16), [0], True), {})
+cnt: 1, ((T([4, 512, 768], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([4, 512, 30522], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_BigBird_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_BigBird_training.txt
new file mode 100644
index 000000000000..924d9eb843b3
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_BigBird_training.txt
@@ -0,0 +1,235 @@
+Operator: aten._softmax.default
+cnt: 24, ((T([2, 12, 64, 1024], f16), -1, False), {})
+cnt: 24, ((T([2, 12, 64, 448], f16), -1, False), {})
+cnt: 12, ((T([2, 12, 12, 64, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 24, ((T([2, 12, 64, 1024], f16), T([2, 12, 64, 1024], f16), -1, f16), {})
+cnt: 24, ((T([2, 12, 64, 448], f16), T([2, 12, 64, 448], f16), -1, f16), {})
+cnt: 12, ((T([2, 12, 12, 64, 512], f16), T([2, 12, 12, 64, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 12, ((T([2, 1, 12, 64, 192], f32),), {'dtype': f16})
+cnt: 12, ((T([2, 1, 1024, 1], f32),), {'dtype': f16})
+cnt: 12, ((T([2, 1, 1, 1024], f32),), {'dtype': f16})
+cnt: 12, ((T([12, 14, 3], i32),), {'dtype': i64, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 24, ((T([2, 12, 16, 64, 64], f16), [384, 64, 64]), {})
+cnt: 96, ((T([2, 12, 64, 64], f16), [24, 64, 64]), {})
+cnt: 48, ((T([2, 12, 1024, 64], f16), [24, 1024, 64]), {})
+cnt: 24, ((T([2, 12, 12, 64, 64], f16), [288, 64, 64]), {})
+cnt: 24, ((T([2, 12, 12, 192, 64], f16), [288, 192, 64]), {})
+cnt: 24, ((T([2, 12, 12, 64, 64, 1], f16), [24, 768, 64]), {})
+cnt: 48, ((T([2, 12, 64, 64, 1, 1], f16), [24, 64, 64]), {})
+cnt: 24, ((T([2, 1024, 12, 64], f16), [2, 1024, 768]), {})
+cnt: 12, ((T([2, 1024, 768], f16), [2048, 768]), {})
+Operator: aten.add.Tensor
+cnt: 76, ((T([2, 1024, 768], f16), T([2, 1024, 768], f16)), {})
+cnt: 24, ((T([1008], i64), T([1008], i64)), {})
+cnt: 36, ((T([2, 1024, 3072], f16), T([2, 1024, 3072], f16)), {})
+cnt: 12, ((T([2, 1024, 3072], f16), 1.0), {})
+cnt: 1, ((T([2, 1024, 768], f16), 1.0), {})
+cnt: 360, ((T([2, 12, 16, 64, 64], f16), T([2, 12, 16, 64, 64], f16)), {})
+cnt: 36, ((T([2, 12, 12, 64, 512], f16), T([2, 12, 12, 64, 512], f16)), {})
+cnt: 48, ((T([2, 12, 14, 192, 64], f16), T([2, 12, 14, 192, 64], f16)), {})
+cnt: 36, ((T([2, 12, 12, 64, 64], f16), T([2, 12, 12, 64, 64], f16)), {})
+cnt: 24, ((T([2, 12, 1024, 64], f16), T([2, 12, 1024, 64], f16)), {})
+cnt: 12, ((T([2, 12, 1024, 64], f16, stride=(786432, 65536, 1, 1024)), T([2, 12, 1024, 64], f16, stride=(786432, 65536, 1, 1024))), {})
+cnt: 12, ((T([2, 12, 1024, 64], f16, stride=(786432, 65536, 1, 1024)), T([2, 12, 1024, 64], f16)), {})
+cnt: 1, ((T([50358, 768], f16), T([50358, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([2, 1024, 768], f16), T([1, 1024, 768], f16)), {})
+cnt: 24, ((T([2, 12, 64, 1024], f16), T([2, 1, 1, 1024], f16)), {})
+cnt: 24, ((T([2, 12, 64, 448], f16), T([2, 12, 64, 448], f32)), {})
+cnt: 12, ((T([2, 12, 12, 64, 192], f16), T([2, 1, 12, 64, 192], f16)), {})
+cnt: 24, ((T([2, 12, 12, 64, 64], f16), T([2, 1, 1, 1, 64], f16)), {})
+cnt: 12, ((T([2, 12, 12, 64, 192], f16), T([2, 12, 12, 64, 192], f32)), {})
+cnt: 36, ((T([2, 12, 12, 64, 64], f16), T([2, 12, 12, 64, 64], f16)), {})
+Operator: aten.addmm.default
+cnt: 49, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([768], f16), T([2, 768], f16, stride=(786432, 1)), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50358], f16), T([2048, 768], f16), T([768, 50358], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 48, ((T([24, 64, 64], f16), T([24, 64, 1024], f16, stride=(65536, 1, 64))), {})
+cnt: 48, ((T([24, 64, 1024], f16), T([24, 1024, 64], f16)), {})
+cnt: 48, ((T([24, 64, 64], f16), T([24, 64, 448], f16, stride=(28672, 1, 64))), {})
+cnt: 48, ((T([24, 64, 448], f16), T([24, 448, 64], f16)), {})
+cnt: 48, ((T([288, 64, 64], f16), T([288, 64, 192], f16, stride=(12288, 1, 64))), {})
+cnt: 24, ((T([24, 768, 64], f16), T([24, 64, 64], f16)), {})
+cnt: 24, ((T([288, 64, 192], f16, stride=(32768, 512, 1)), T([288, 192, 64], f16)), {})
+cnt: 24, ((T([24, 768, 64], f16, stride=(393216, 512, 1)), T([24, 64, 64], f16)), {})
+cnt: 24, ((T([24, 1024, 64], f16, stride=(65536, 1, 1024)), T([24, 64, 64], f16)), {})
+cnt: 24, ((T([24, 64, 64], f16, stride=(4096, 1, 64)), T([24, 64, 1024], f16)), {})
+cnt: 24, ((T([24, 448, 64], f16, stride=(28672, 1, 448)), T([24, 64, 64], f16)), {})
+cnt: 24, ((T([24, 64, 64], f16, stride=(4096, 1, 64)), T([24, 64, 448], f16)), {})
+cnt: 24, ((T([24, 64, 768], f16, stride=(393216, 1, 512)), T([24, 768, 64], f16)), {})
+cnt: 48, ((T([24, 768, 64], f16), T([24, 64, 64], f16, stride=(4096, 1, 64))), {})
+cnt: 24, ((T([288, 192, 64], f16, stride=(32768, 1, 512)), T([288, 64, 64], f16)), {})
+cnt: 24, ((T([24, 64, 768], f16, stride=(49152, 1, 64)), T([24, 768, 64], f16)), {})
+cnt: 24, ((T([288, 64, 64], f16, stride=(4096, 1, 64)), T([288, 64, 192], f16)), {})
+cnt: 24, ((T([288, 64, 192], f16), T([288, 192, 64], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([2, 12, 64], f32, stride=(1024, 64, 1)), T([2, 12, 64], f32, stride=(1024, 64, 1)), T([2, 12, 64], f32, stride=(1024, 64, 1))], 2), {})
+cnt: 12, (([T([1, 12, 14, 3], i64), T([1, 12, 14, 3], i64)],), {})
+cnt: 48, (([T([2, 12, 64, 64], f16, stride=(786432, 64, 768, 1)), T([2, 12, 64, 64], f16, stride=(786432, 64, 768, 1)), T([2, 12, 64, 64], f16, stride=(786432, 64, 768, 1)), T([2, 12, 64, 64], f16, stride=(786432, 64, 768, 1)), T([2, 12, 192, 64], f16, stride=(2064384, 172032, 64, 1))], 2), {})
+cnt: 12, (([T([2, 1, 1, 192], f16, stride=(1024, 1024, 1024, 1)), T([2, 1, 1, 64], f16, stride=(1024, 1024, 1024, 1)), T([2, 1, 1, 192], f16)], 3), {})
+cnt: 24, (([T([2, 12, 64, 256], f32), T([2, 12, 64, 192], f32, stride=(2064384, 172032, 192, 1))], 3), {})
+cnt: 24, (([T([2, 12, 12, 64, 64], f16, stride=(786432, 64, 49152, 768, 1)), T([2, 12, 12, 64, 64], f16, stride=(786432, 64, 49152, 768, 1)), T([2, 12, 12, 64, 64], f16, stride=(786432, 64, 49152, 768, 1))], 3), {})
+cnt: 12, (([T([2, 12, 12, 64, 64], f16), T([2, 12, 12, 64, 192], f16), T([2, 12, 12, 64, 192], f16), T([2, 12, 12, 64, 64], f16)], -1), {})
+cnt: 12, (([T([2, 1, 1, 64], f16, stride=(1024, 1024, 1024, 1)), T([2, 1, 1, 192], f16, stride=(1024, 1024, 1024, 1)), T([2, 1, 1, 192], f16)], 3), {})
+cnt: 12, (([T([2, 12, 1, 64, 64], f16), T([2, 12, 1, 64, 64], f16), T([2, 12, 12, 64, 64], f16), T([2, 12, 1, 64, 64], f16), T([2, 12, 1, 64, 64], f16)], 2), {})
+Operator: aten.clone.default
+cnt: 1, ((T([2, 1024], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([2, 1024], i64), T([2, 1024], i64)), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16), T([2, 12, 12, 64, 64], f16, stride=(786432, 64, 49152, 768, 1))), {})
+cnt: 36, ((T([288, 64, 64], f16), T([288, 64, 64], f16)), {})
+cnt: 36, ((T([2, 12, 12, 64, 64], f16), T([2, 12, 12, 64, 64], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 103133184), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50358, 768], f16), T([2, 1024], i64), 0), {})
+cnt: 1, ((T([2, 768], f16), T([2, 1024], i64, stride=(0, 1))), {})
+cnt: 1, ((T([4096, 768], f16), T([1, 1024], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 1024, 768], f16), T([1, 1024], i64), 4096, -1, False), {})
+cnt: 1, ((T([2, 1024, 768], f16), T([2, 1024], i64, stride=(0, 1)), 2, -1, False), {})
+cnt: 1, ((T([2, 1024, 768], f16), T([2, 1024], i64), 50358, 0, False), {})
+Operator: aten.floor_divide.default
+cnt: 24, ((T([1008], i64), 42), {})
+Operator: aten.index.Tensor
+cnt: 24, ((T([16, 64], f32), [T([504], i64)]), {})
+Operator: aten.index_add.default
+cnt: 24, ((T([384, 64, 64], f16), 0, T([1008], i64), T([1008, 64, 64], f16)), {})
+Operator: aten.index_select.default
+cnt: 24, ((T([384, 64, 64], f16), 0, T([1008], i64)), {})
+Operator: aten.minimum.default
+cnt: 24, ((T([2, 1, 1, 448], f16), T([2, 12, 64, 448], f32)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 50358], f16, stride=(0, 0)), T([50358, 768], f16)), {})
+cnt: 1, ((T([50358, 2048], f16, stride=(0, 0)), T([2048, 768], f16)), {})
+cnt: 49, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 49, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 12, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 1, ((T([2, 1024, 768], f16), 3.0), {})
+cnt: 12, ((T([2, 1024, 3072], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([2, 12, 64, 1], f32, stride=(1024, 64, 1, 1)), T([2, 12, 1, 192], f32)), {})
+cnt: 12, ((T([2, 1, 14, 64, 1], f32, stride=(1024, 1, 64, 1, 1)), T([2, 12, 14, 1, 192], f32)), {})
+cnt: 24, ((T([1008], i64), 16), {})
+cnt: 48, ((T([2, 12, 64, 1024], f16), 0.125), {})
+cnt: 24, ((T([2, 1, 1, 1024], f16), -10000.0), {})
+cnt: 48, ((T([2, 12, 64, 448], f16), 0.125), {})
+cnt: 24, ((T([2, 12, 64, 448], f32), -10000.0), {})
+cnt: 24, ((T([2, 12, 12, 64, 192], f16), 0.125), {})
+cnt: 24, ((T([2, 12, 12, 64, 64], f16), 0.125), {})
+cnt: 12, ((T([2, 1, 12, 64, 192], f16), -10000.0), {})
+cnt: 24, ((T([2, 1, 1, 1, 64], f16), -10000.0), {})
+cnt: 12, ((T([2, 12, 12, 64, 192], f32), -10000.0), {})
+cnt: 12, ((T([2, 12, 1024, 64], f16), T([2, 1, 1024, 1], f16)), {})
+cnt: 24, ((T([2, 1024, 3072], f16), 0.5), {})
+cnt: 24, ((T([2, 1024, 3072], f16), 0.044715), {})
+cnt: 24, ((T([2, 1024, 3072], f16), 0.7978845608028654), {})
+cnt: 48, ((T([2, 1024, 3072], f16), T([2, 1024, 3072], f16)), {})
+cnt: 2, ((T([2, 1024, 768], f16), 0.5), {})
+cnt: 2, ((T([2, 1024, 768], f16), 0.044715), {})
+cnt: 2, ((T([2, 1024, 768], f16), 0.7978845608028654), {})
+cnt: 4, ((T([2, 1024, 768], f16), T([2, 1024, 768], f16)), {})
+cnt: 12, ((T([2, 12, 1024, 64], f16, stride=(786432, 64, 768, 1)), T([2, 1, 1024, 1], f16)), {})
+cnt: 24, ((T([2, 12, 12, 64, 64], f16, stride=(4718592, 393216, 32768, 512, 1)), 0.125), {})
+cnt: 24, ((T([2, 12, 12, 64, 192], f16, stride=(4718592, 393216, 32768, 512, 1)), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([2, 1024, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([2, 1024, 768], f16), T([2, 1024, 768], f16), [768], T([2, 1024, 1], f32), T([2, 1024, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 36, ((T([288, 64, 64], f16), [288, 64, 64], [4096, 64, 1]), {})
+Operator: aten.new_ones.default
+cnt: 24, ((T([2, 1, 1, 1024], f16), [2, 1, 1, 192]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 24, ((T([2, 12, 14, 64, 192], f32), [2, 12, 64, 256]), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_zeros.default
+cnt: 12, ((T([2, 12, 12, 64, 64], f16, stride=(786432, 64, 49152, 768, 1)), [1179648]), {})
+cnt: 24, ((T([1008, 64, 64], f16), [384, 64, 64]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([2, 1024, 3072], f16), 3.0), {})
+cnt: 1, ((T([2, 1024, 768], f16), 3.0), {})
+cnt: 1, ((T([2, 1024, 768], f16), 2.0), {})
+cnt: 12, ((T([2, 1024, 3072], f16), 2.0), {})
+Operator: aten.rsub.Scalar
+cnt: 24, ((T([2, 1, 1, 1024], f16), 1.0), {})
+cnt: 24, ((T([2, 12, 64, 448], f32), 1.0), {})
+cnt: 12, ((T([2, 1, 12, 64, 192], f16), 1.0), {})
+cnt: 24, ((T([2, 1, 1, 1, 64], f16, stride=(1024, 1024, 1024, 64, 1)), 1.0), {})
+cnt: 12, ((T([2, 12, 12, 64, 192], f32, stride=(2064384, 172032, 12288, 192, 1)), 1.0), {})
+Operator: aten.select_backward.default
+cnt: 24, ((T([2, 12, 64, 64], f16), [2, 12, 16, 64, 64], 2, -1), {})
+cnt: 12, ((T([2, 12, 64, 64], f16), [2, 12, 16, 64, 64], 2, -2), {})
+cnt: 12, ((T([2, 12, 192, 64], f16, stride=(344064, 28672, 64, 1)), [2, 12, 14, 192, 64], 2, -1), {})
+cnt: 24, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [2, 12, 16, 64, 64], 2, -1), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [2, 12, 16, 64, 64], 2, -2), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [2, 12, 16, 64, 64], 2, -3), {})
+cnt: 24, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [2, 12, 16, 64, 64], 2, 0), {})
+cnt: 12, ((T([2, 12, 192, 64], f16, stride=(344064, 28672, 1, 448)), [2, 12, 14, 192, 64], 2, -1), {})
+cnt: 24, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [2, 12, 16, 64, 64], 2, -1), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [2, 12, 16, 64, 64], 2, -2), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [2, 12, 16, 64, 64], 2, -3), {})
+cnt: 24, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [2, 12, 16, 64, 64], 2, 0), {})
+cnt: 24, ((T([2, 12, 64, 64], f16), [2, 12, 16, 64, 64], 2, 0), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(49152, 4096, 1, 64)), [2, 12, 16, 64, 64], 2, -1), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(49152, 4096, 1, 64)), [2, 12, 16, 64, 64], 2, 0), {})
+cnt: 12, ((T([2, 12, 64, 64], f16), [2, 12, 16, 64, 64], 2, 1), {})
+cnt: 12, ((T([2, 12, 192, 64], f16, stride=(344064, 28672, 64, 1)), [2, 12, 14, 192, 64], 2, 0), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [2, 12, 16, 64, 64], 2, 2), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 64, 1)), [2, 12, 16, 64, 64], 2, 1), {})
+cnt: 12, ((T([2, 12, 192, 64], f16, stride=(344064, 28672, 1, 448)), [2, 12, 14, 192, 64], 2, 0), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [2, 12, 16, 64, 64], 2, 2), {})
+cnt: 12, ((T([2, 12, 64, 64], f16, stride=(344064, 28672, 1, 448)), [2, 12, 16, 64, 64], 2, 1), {})
+Operator: aten.slice_backward.default
+cnt: 372, ((T([2, 12, 16, 64, 64], f16), [2, 12, 16, 64, 64], 1, 0, 9223372036854775807, 1), {})
+cnt: 372, ((T([2, 12, 16, 64, 64], f16), [2, 12, 16, 64, 64], 0, 0, 9223372036854775807, 1), {})
+cnt: 72, ((T([2, 12, 14, 192, 64], f16), [2, 12, 14, 192, 64], 1, 0, 9223372036854775807, 1), {})
+cnt: 72, ((T([2, 12, 14, 192, 64], f16), [2, 12, 14, 192, 64], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16), [2, 12, 12, 64, 512], 4, -64, 9223372036854775807, 1), {})
+cnt: 48, ((T([2, 12, 12, 64, 512], f16), [2, 12, 12, 64, 512], 3, 0, 9223372036854775807, 1), {})
+cnt: 48, ((T([2, 12, 12, 64, 512], f16), [2, 12, 12, 64, 512], 2, 0, 9223372036854775807, 1), {})
+cnt: 48, ((T([2, 12, 12, 64, 512], f16), [2, 12, 12, 64, 512], 1, 0, 9223372036854775807, 1), {})
+cnt: 48, ((T([2, 12, 12, 64, 512], f16), [2, 12, 12, 64, 512], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16), [2, 12, 12, 64, 512], 4, 0, 64, 1), {})
+cnt: 12, ((T([2, 12, 12, 192, 64], f16), [2, 12, 14, 192, 64], 2, 1, -1, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 192], f16), [2, 12, 12, 64, 512], 4, 256, -64, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 192], f16), [2, 12, 12, 64, 512], 4, 64, 256, 1), {})
+cnt: 12, ((T([2, 12, 12, 192, 64], f16, stride=(1769472, 147456, 12288, 1, 192)), [2, 12, 14, 192, 64], 2, 1, -1, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16), [2, 12, 16, 64, 64], 2, 2, -2, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 64, 1)), [2, 12, 16, 64, 64], 2, 3, -1, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 64, 1)), [2, 12, 16, 64, 64], 2, 2, -2, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 64, 1)), [2, 12, 16, 64, 64], 2, 1, -3, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 1, 192)), [2, 12, 16, 64, 64], 2, 3, -1, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 1, 192)), [2, 12, 16, 64, 64], 2, 2, -2, 1), {})
+cnt: 12, ((T([2, 12, 12, 64, 64], f16, stride=(1769472, 147456, 12288, 1, 192)), [2, 12, 16, 64, 64], 2, 1, -3, 1), {})
+Operator: aten.stack.default
+cnt: 12, (([T([504, 64], f32), T([504, 64], f32)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2048, 50358], f16, stride=(0, 0)), [0], True), {})
+cnt: 61, ((T([2048, 768], f16), [0], True), {})
+cnt: 12, ((T([2048, 3072], f16), [0], True), {})
+cnt: 1, ((T([2, 1024, 768], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([2, 1024, 50358], f16),), {})
+Operator: aten.tanh.default
+cnt: 12, ((T([2, 1024, 3072], f16),), {})
+cnt: 1, ((T([2, 768], f16),), {})
+cnt: 1, ((T([2, 1024, 768], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([2, 1024, 768], f16), T([2, 1024, 768], f16)), {})
+cnt: 12, ((T([2, 1024, 3072], f16), T([2, 1024, 3072], f16)), {})
+Operator: aten.unbind.int
+cnt: 12, ((T([2, 16, 64], f32),), {})
+cnt: 12, ((T([2, 12, 14, 3], i64),), {})
+Operator: aten.unsqueeze_.default
+cnt: 1, ((T([2, 12, 64, 192], f32), 1), {})
+cnt: 12, ((T([12, 14, 3], i64), 0), {})
+cnt: 48, ((T([2, 12, 64, 64], f16), 2), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_DistilBert_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_DistilBert_training.txt
new file mode 100644
index 000000000000..225446dad9dd
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_DistilBert_training.txt
@@ -0,0 +1,73 @@
+Operator: aten._softmax.default
+cnt: 6, ((T([8, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 6, ((T([8, 12, 512, 512], f16), T([8, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 18, ((T([8, 12, 512, 64], f16), [96, 512, 64]), {})
+cnt: 6, ((T([8, 12, 64, 512], f16), [96, 64, 512]), {})
+cnt: 6, ((T([96, 512, 512], f16), [8, 12, 512, 512]), {})
+cnt: 6, ((T([96, 512, 64], f16), [8, 12, 512, 64]), {})
+cnt: 12, ((T([8, 512, 12, 64], f16), [8, 512, 768]), {})
+cnt: 6, ((T([8, 512, 768], f16), [4096, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([8, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 36, ((T([8, 512, 768], f16), T([8, 512, 768], f16)), {})
+cnt: 1, ((T([30522, 768], f16), T([30522, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 25, ((T([768], f16), T([4096, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 6, ((T([3072], f16), T([4096, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 6, ((T([768], f16), T([4096, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([30522], f16), T([4096, 768], f16), T([768, 30522], f16, stride=(1, 768))), {})
+Operator: aten.bmm.default
+cnt: 6, ((T([96, 512, 64], f16), T([96, 64, 512], f16)), {})
+cnt: 6, ((T([96, 512, 512], f16), T([96, 512, 64], f16)), {})
+cnt: 6, ((T([96, 512, 512], f16, stride=(262144, 1, 512)), T([96, 512, 64], f16)), {})
+cnt: 6, ((T([96, 512, 64], f16), T([96, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 6, ((T([96, 64, 512], f16, stride=(32768, 1, 64)), T([96, 512, 512], f16)), {})
+cnt: 6, ((T([96, 512, 512], f16), T([96, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([8, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([8, 512], i64), T([8, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 6, ((T([8, 12, 512, 64], f16, stride=(393216, 64, 768, 1)), 8.0), {})
+cnt: 2, ((T([], f16), 125018112), {})
+cnt: 6, ((T([8, 12, 512, 64], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([30522, 768], f16), T([8, 512], i64), 0), {})
+cnt: 1, ((T([512, 768], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 512, -1, False), {})
+cnt: 1, ((T([8, 512, 768], f16), T([8, 512], i64), 30522, 0, False), {})
+Operator: aten.eq.Scalar
+cnt: 6, ((T([8, 512], f32), 0), {})
+Operator: aten.gelu.default
+cnt: 6, ((T([8, 512, 3072], f16),), {})
+cnt: 1, ((T([8, 512, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([8, 512, 768], f16), T([8, 512, 768], f16)), {})
+cnt: 6, ((T([8, 512, 3072], f16), T([8, 512, 3072], f16)), {})
+Operator: aten.masked_fill.Scalar
+cnt: 6, ((T([8, 12, 512, 512], f16), T([8, 12, 512, 512], b8, stride=(512, 0, 0, 1)), 0), {})
+Operator: aten.masked_fill.Tensor
+cnt: 6, ((T([8, 12, 512, 512], f16), T([8, 12, 512, 512], b8, stride=(512, 0, 0, 1)), T([], f32)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([4096, 30522], f16, stride=(0, 0)), T([30522, 768], f16)), {})
+cnt: 1, ((T([30522, 4096], f16, stride=(0, 0)), T([4096, 768], f16)), {})
+cnt: 25, ((T([4096, 768], f16), T([768, 768], f16)), {})
+cnt: 25, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 768], f16)), {})
+cnt: 6, ((T([4096, 768], f16), T([768, 3072], f16)), {})
+cnt: 6, ((T([768, 4096], f16, stride=(1, 768)), T([4096, 3072], f16)), {})
+cnt: 6, ((T([4096, 3072], f16), T([3072, 768], f16)), {})
+cnt: 6, ((T([3072, 4096], f16, stride=(1, 3072)), T([4096, 768], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 14, ((T([8, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-12), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 14, ((T([8, 512, 768], f16), T([8, 512, 768], f16), [768], T([8, 512, 1], f32), T([8, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([4096, 30522], f16, stride=(0, 0)), [0], True), {})
+cnt: 31, ((T([4096, 768], f16), [0], True), {})
+cnt: 6, ((T([4096, 3072], f16), [0], True), {})
+cnt: 1, ((T([8, 512, 768], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([8, 512, 30522], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_GPT2_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_GPT2_training.txt
new file mode 100644
index 000000000000..7a2ca611a2ec
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_GPT2_training.txt
@@ -0,0 +1,88 @@
+Operator: aten._softmax.default
+cnt: 12, ((T([4, 12, 512, 512], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([4, 12, 512, 512], f16), T([4, 12, 512, 512], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 12, ((T([1, 1, 512, 512], u8, stride=(1048576, 1048576, 1024, 1)),), {'dtype': torch.bool})
+cnt: 12, ((T([], f16),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([4, 12, 512, 64], f16), [48, 512, 64]), {})
+cnt: 12, ((T([4, 12, 64, 512], f16), [48, 64, 512]), {})
+cnt: 12, ((T([48, 512, 512], f16), [4, 12, 512, 512]), {})
+cnt: 12, ((T([48, 512, 64], f16), [4, 12, 512, 64]), {})
+cnt: 1, ((T([2048, 50257], f16), [4, 512, 50257]), {})
+cnt: 24, ((T([4, 512, 12, 64], f16), [4, 512, 768]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([4, 512, 768], f16), T([1, 512, 768], f16)), {})
+cnt: 48, ((T([4, 512, 768], f16), T([4, 512, 768], f16)), {})
+cnt: 36, ((T([4, 512, 3072], f16), T([4, 512, 3072], f16)), {})
+cnt: 12, ((T([4, 512, 3072], f16), 1.0), {})
+cnt: 1, ((T([50257, 768], f16), T([50257, 768], f16)), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([2304], f16), T([2048, 768], f16), T([768, 2304], f16)), {})
+cnt: 12, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 12, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16)), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([48, 512, 64], f16), T([48, 64, 512], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16), T([48, 512, 64], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16, stride=(262144, 1, 512)), T([48, 512, 64], f16)), {})
+cnt: 12, ((T([48, 512, 64], f16), T([48, 64, 512], f16, stride=(32768, 1, 64))), {})
+cnt: 12, ((T([48, 64, 512], f16, stride=(32768, 1, 64)), T([48, 512, 512], f16)), {})
+cnt: 12, ((T([48, 512, 512], f16), T([48, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.cat.default
+cnt: 12, (([T([4, 512, 768], f16), T([4, 512, 768], f16, stride=(393216, 1, 512)), T([4, 512, 768], f16)], 2), {})
+Operator: aten.clone.default
+cnt: 1, ((T([4, 512], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([4, 512], i64), T([4, 512], i64)), {})
+Operator: aten.div.Tensor
+cnt: 24, ((T([4, 12, 512, 512], f16), T([], f16)), {})
+cnt: 2, ((T([], f16), 102926336), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50257, 768], f16), T([4, 512], i64)), {})
+cnt: 1, ((T([1024, 768], f16), T([1, 512], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([1, 512, 768], f16), T([1, 512], i64), 1024, -1, False), {})
+cnt: 1, ((T([4, 512, 768], f16), T([4, 512], i64), 50257, -1, False), {})
+Operator: aten.mm.default
+cnt: 1, ((T([2048, 768], f16), T([768, 50257], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50257, 2048], f16, stride=(0, 0)), T([2048, 768], f16)), {})
+cnt: 1, ((T([2048, 50257], f16, stride=(0, 0)), T([50257, 768], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 2304], f16), T([2304, 768], f16, stride=(1, 2304))), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 2304], f16)), {})
+Operator: aten.mul.Scalar
+cnt: 12, ((T([4, 512, 3072], f16), 3.0), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([4, 512, 3072], f16), 0.5), {})
+cnt: 24, ((T([4, 512, 3072], f16), 0.044715), {})
+cnt: 24, ((T([4, 512, 3072], f16), 0.7978845608028654), {})
+cnt: 48, ((T([4, 512, 3072], f16), T([4, 512, 3072], f16)), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([4, 512, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([4, 512, 768], f16), T([4, 512, 768], f16), [768], T([4, 512, 1], f32), T([4, 512, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.pow.Tensor_Scalar
+cnt: 12, ((T([4, 512, 3072], f16), 3.0), {})
+cnt: 12, ((T([4, 512, 3072], f16), 2.0), {})
+Operator: aten.split.Tensor
+cnt: 12, ((T([4, 512, 2304], f16), 768, 2), {})
+Operator: aten.sum.SymInt
+cnt: 24, ((T([2048, 768], f16), [0], True), {})
+cnt: 12, ((T([2048, 3072], f16), [0], True), {})
+cnt: 12, ((T([2048, 2304], f16), [0], True), {})
+cnt: 1, ((T([4, 512, 768], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([4, 512, 50257], f16),), {})
+Operator: aten.tanh.default
+cnt: 12, ((T([4, 512, 3072], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 12, ((T([4, 512, 3072], f16), T([4, 512, 3072], f16)), {})
+Operator: aten.where.self
+cnt: 24, ((T([1, 1, 512, 512], b8), T([4, 12, 512, 512], f16), T([], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Longformer_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Longformer_training.txt
new file mode 100644
index 000000000000..23725d8af431
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/hf_Longformer_training.txt
@@ -0,0 +1,189 @@
+Operator: aten._softmax.default
+cnt: 12, ((T([2, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)), -1, True), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([2, 1024, 12, 513], f32), T([2, 1024, 12, 513], f32), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([2, 1, 1, 1024], f32),), {'dtype': f16})
+cnt: 1, ((T([2, 1024], b8),), {'dtype': i32})
+cnt: 1, ((T([2, 1024], i64),), {'dtype': i32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([2, 1024], i32),), {'dtype': i64})
+cnt: 12, ((T([2, 1024, 1, 1], b8),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 12, ((T([2, 1024, 12, 513], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 12, ((T([2, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([1024, 2, 768], f16), [2048, 768]), {})
+cnt: 36, ((T([2048, 768], f16), [1024, 2, 768]), {})
+cnt: 12, ((T([24, 3, 512, 64, 1], f16), [72, 512, 64]), {})
+cnt: 12, ((T([24, 3, 64, 512, 1], f16), [72, 64, 512]), {})
+cnt: 12, ((T([2, 12, 1024, 513], f16), [24, 4, 256, 513]), {})
+cnt: 12, ((T([24, 4, 768, 64, 1], f16), [96, 768, 64]), {})
+cnt: 24, ((T([1024, 2, 12, 64], f16), [1024, 2, 768]), {})
+cnt: 12, ((T([2, 1024, 768], f16), [2048, 768]), {})
+cnt: 12, ((T([2048, 768], f16), [2, 1024, 768]), {})
+cnt: 12, ((T([2, 12, 1024, 64], f16), [24, 4, 256, 64]), {})
+cnt: 12, ((T([24, 4, 768, 64], i64), [4718592]), {})
+cnt: 12, ((T([24, 3, 512, 64], f16), [2359296]), {})
+cnt: 24, ((T([24, 3, 512, 64], i64), [2359296]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([2, 1024], i64), 1), {})
+cnt: 38, ((T([2, 1024, 768], f16), T([2, 1024, 768], f16)), {})
+cnt: 36, ((T([1024, 2, 768], f16), T([768], f16)), {})
+cnt: 12, ((T([2, 1024, 768], f16), T([768], f16)), {})
+cnt: 1, ((T([], f16), 0), {})
+cnt: 36, ((T([24, 3, 512, 513], f16), T([24, 3, 512, 513], f16)), {})
+cnt: 24, ((T([1024, 2, 768], f16), T([1024, 2, 768], f16)), {})
+cnt: 12, ((T([2, 1024, 768], f16), T([2, 1024, 768], f16, stride=(768, 1536, 1))), {})
+cnt: 1, ((T([50265, 768], f16), T([50265, 768], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 12, ((T([2, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)), T([2, 1024, 1, 513], f16)), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([3072], f16), T([2048, 768], f16), T([768, 3072], f16, stride=(1, 768))), {})
+cnt: 12, ((T([768], f16), T([2048, 3072], f16), T([3072, 768], f16, stride=(1, 3072))), {})
+cnt: 1, ((T([768], f16), T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 1, ((T([50265], f16), T([2048, 768], f16), T([768, 50265], f16, stride=(1, 768))), {})
+Operator: aten.any.default
+cnt: 1, ((T([2048], b8),), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([72, 512, 64], f16), T([72, 64, 512], f16)), {})
+cnt: 12, ((T([96, 256, 768], f16, stride=(197120, 769, 1)), T([96, 768, 64], f16)), {})
+cnt: 12, ((T([96, 768, 256], f16, stride=(197120, 1, 769)), T([96, 256, 64], f16)), {})
+cnt: 12, ((T([96, 256, 64], f16), T([96, 64, 768], f16, stride=(49152, 1, 64))), {})
+cnt: 12, ((T([72, 64, 512], f16, stride=(32768, 1, 64)), T([72, 512, 512], f16)), {})
+cnt: 12, ((T([72, 512, 512], f16), T([72, 512, 64], f16, stride=(32768, 1, 512))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([2, 1024], i64),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 12, ((T([24, 3, 512, 512], f16), [0, 0, 0, 1], 0.0), {})
+cnt: 12, ((T([2, 3, 512, 512], f16), [0, 0, 0, 1], 0.0), {})
+cnt: 12, ((T([24, 1024, 64], f16, stride=(64, 1536, 1)), [0, 0, 256, 256], -1.0), {})
+cnt: 12, ((T([24, 4, 256, 513], f16), [0, 257], 0.0), {})
+cnt: 12, ((T([24, 4, 256, 770], f16), [0, -257]), {})
+cnt: 12, ((T([24, 1536, 64], f16), [0, 0, -256, -256]), {})
+cnt: 12, ((T([24, 3, 513, 512], f16), [0, 0, 0, -1]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([2, 1024], i64), T([2, 1024], i64)), {})
+cnt: 12, ((T([24, 3, 256, 257], f16, stride=(525312, 131328, 513, 1)), T([24, 3, 256, 257], f16, stride=(787968, 262656, 513, 1))), {})
+cnt: 12, ((T([24, 256, 257], f16, stride=(525312, 513, 1)), T([24, 256, 257], f16, stride=(787968, 513, 1))), {})
+cnt: 12, ((T([24, 3, 256, 256], f16, stride=(525312, 131328, 513, 1)), T([24, 3, 256, 256], f16, stride=(787968, 262656, 513, 1))), {})
+cnt: 12, ((T([24, 255, 255], f16, stride=(525312, 513, 1)), T([24, 255, 255], f16, stride=(787968, 513, 1))), {})
+cnt: 12, ((T([2, 3, 256, 257], f16, stride=(525312, 131328, 513, 1)), T([2, 3, 256, 257], f16, stride=(787968, 262656, 513, 1))), {})
+cnt: 12, ((T([2, 256, 257], f16, stride=(525312, 513, 1)), T([2, 256, 257], f16, stride=(787968, 513, 1))), {})
+cnt: 12, ((T([2, 3, 256, 256], f16, stride=(525312, 131328, 513, 1)), T([2, 3, 256, 256], f16, stride=(787968, 262656, 513, 1))), {})
+cnt: 12, ((T([2, 255, 255], f16, stride=(525312, 513, 1)), T([2, 255, 255], f16, stride=(787968, 513, 1))), {})
+cnt: 24, ((T([2, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)), T([2, 1024, 12, 513], f16)), {})
+cnt: 84, ((T([24, 4, 256, 513], f16), T([24, 4, 256, 513], f16)), {})
+cnt: 24, ((T([2, 256, 12, 257], f16, stride=(6303744, 513, 525312, 1)), T([2, 256, 12, 257], f16)), {})
+cnt: 12, ((T([24, 255, 255], f16, stride=(525312, 513, 1)), T([24, 255, 255], f16)), {})
+cnt: 12, ((T([24, 3, 256, 256], f16, stride=(525312, 131328, 513, 1)), T([24, 3, 256, 256], f16)), {})
+cnt: 12, ((T([24, 256, 257], f16, stride=(525312, 513, 1)), T([24, 256, 257], f16)), {})
+Operator: aten.cumsum.default
+cnt: 1, ((T([2, 1024], i32), 1), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 102942720), {})
+cnt: 2, ((T([], f16), 1), {})
+cnt: 12, ((T([1024, 2, 768], f16), 8.0), {})
+Operator: aten.div_.Tensor
+cnt: 12, ((T([1024, 2, 768], f16), 8.0), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([50265, 768], f16), T([2, 1024], i64), 1), {})
+cnt: 1, ((T([4098, 768], f16), T([2, 1024], i64), 1), {})
+cnt: 1, ((T([1, 768], f16), T([2, 1024], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([2, 1024, 768], f16), T([2, 1024], i64), 1, -1, False), {})
+cnt: 1, ((T([2, 1024, 768], f16), T([2, 1024], i64), 4098, 1, False), {})
+cnt: 1, ((T([2, 1024, 768], f16), T([2, 1024], i64), 50265, 1, False), {})
+Operator: aten.eq.Scalar
+cnt: 24, ((T([2, 256, 12, 257], f16, stride=(0, 257, 0, 1)), 1), {})
+cnt: 24, ((T([2, 256, 1, 257], f16, stride=(0, 257, 257, 1)), 1), {})
+Operator: aten.flip.default
+cnt: 24, ((T([256, 257], f16), [0]), {})
+cnt: 24, ((T([1, 256, 1, 257], f16), [1, 3]), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([2, 1024, 3072], f16),), {})
+cnt: 1, ((T([2, 1024, 768], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([2, 1024, 768], f16), T([2, 1024, 768], f16)), {})
+cnt: 12, ((T([2, 1024, 3072], f16), T([2, 1024, 3072], f16)), {})
+Operator: aten.gt.Scalar
+cnt: 1, ((T([2, 1024], f16), 0), {})
+Operator: aten.index_add_.default
+cnt: 12, ((T([2359296], f16), 0, T([4718592], i64), T([4718592], f16)), {})
+cnt: 24, ((T([1572864], f16), 0, T([2359296], i64), T([2359296], f16)), {})
+Operator: aten.lt.Scalar
+cnt: 1, ((T([2, 1024], f16), 0), {})
+Operator: aten.masked_fill.Scalar
+cnt: 12, ((T([2, 1024, 1, 1], f16), T([2, 1024, 1, 1], b8), -65504.0), {})
+cnt: 12, ((T([2, 1024, 12, 513], f32), T([2, 1024, 1, 1], b8), 0.0), {})
+cnt: 12, ((T([2, 1024, 12, 513], f32, stride=(6303744, 513, 525312, 1)), T([2, 1024, 1, 1], b8), 0), {})
+cnt: 24, ((T([2, 256, 12, 257], f16), T([2, 256, 12, 257], b8), 0), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 24, ((T([2, 256, 12, 257], f16, stride=(6303744, 513, 525312, 1)), T([2, 256, 12, 257], b8), -inf), {})
+cnt: 24, ((T([2, 256, 1, 257], f16, stride=(525312, 513, 525312, 1)), T([2, 256, 1, 257], b8), -inf), {})
+Operator: aten.mm.default
+cnt: 48, ((T([2048, 768], f16), T([768, 768], f16, stride=(1, 768))), {})
+cnt: 1, ((T([2048, 50265], f16, stride=(0, 0)), T([50265, 768], f16)), {})
+cnt: 1, ((T([50265, 2048], f16, stride=(0, 0)), T([2048, 768], f16)), {})
+cnt: 49, ((T([2048, 768], f16), T([768, 768], f16)), {})
+cnt: 49, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 768], f16)), {})
+cnt: 12, ((T([2048, 768], f16), T([768, 3072], f16)), {})
+cnt: 12, ((T([768, 2048], f16, stride=(1, 768)), T([2048, 3072], f16)), {})
+cnt: 12, ((T([2048, 3072], f16), T([3072, 768], f16)), {})
+cnt: 12, ((T([3072, 2048], f16, stride=(1, 3072)), T([2048, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([2, 1, 1, 1024], f16), -65504.0), {})
+cnt: 1, ((T([2, 1024], i32), T([2, 1024], i32)), {})
+cnt: 12, ((T([2, 3, 512, 1], f16, stride=(1024, 256, 1, 1)), T([2, 3, 1, 512], f16, stride=(1024, 256, 1, 1))), {})
+Operator: aten.native_layer_norm.default
+cnt: 26, ((T([2, 1024, 768], f16), [768], T([768], f16), T([768], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 26, ((T([2, 1024, 768], f16), T([2, 1024, 768], f16), [768], T([2, 1024, 1], f32), T([2, 1024, 1], f32), T([768], f16), T([768], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 1, ((T([2, 1024], i64), 1), {})
+cnt: 12, ((T([2, 1024], f16), 0), {})
+Operator: aten.new_empty.default
+cnt: 12, ((T([24, 3, 512, 513], f16), [24, 4, 256, 513]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 12, ((T([2, 3, 512, 513], f16), [2, 4, 256, 513]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_empty_strided.default
+cnt: 84, ((T([24, 4, 256, 513], f16), [24, 4, 256, 513], [525312, 131328, 513, 1]), {})
+Operator: aten.new_ones.default
+cnt: 12, ((T([2, 1024, 12, 513], f16, stride=(6303744, 513, 525312, 1)), [256, 257]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 12, ((T([2, 1024, 1, 1], f16), [2, 1024, 1, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 12, ((T([2, 1024, 1, 513], f16), [256, 257]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_zeros.default
+cnt: 12, ((T([24, 4, 768, 64], f16), [2359296]), {})
+cnt: 12, ((T([2, 1024, 12, 513], f16), [12607488]), {})
+cnt: 12, ((T([24, 3, 512, 64], f16, stride=(98304, 32768, 1, 512)), [1572864]), {})
+cnt: 12, ((T([24, 3, 512, 64], f16), [1572864]), {})
+Operator: aten.rsub.Scalar
+cnt: 1, ((T([2, 1, 1, 1024], f16), 1.0), {})
+Operator: aten.select_backward.default
+cnt: 12, ((T([24, 512, 513], f16), [24, 3, 512, 513], 1, 0), {})
+cnt: 12, ((T([24, 512, 513], f16), [24, 3, 512, 513], 1, -1), {})
+Operator: aten.slice_backward.default
+cnt: 12, ((T([24, 4, 256, 768], f16), [24, 4, 256, 769], 3, 0, -1, 1), {})
+cnt: 12, ((T([24, 4, 256, 769], f16), [24, 4, 256, 769], 2, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 4, 256, 769], f16), [24, 4, 256, 769], 1, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 4, 256, 769], f16), [24, 4, 256, 769], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 4, 196864], f16), [24, 4, 197120], 2, 0, -256, 1), {})
+cnt: 12, ((T([24, 4, 197120], f16), [24, 4, 197120], 1, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 4, 197120], f16), [24, 4, 197120], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 255, 255], f16), [24, 255, 513], 2, -255, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 255, 513], f16), [24, 512, 513], 1, 0, 255, 1), {})
+cnt: 48, ((T([24, 3, 512, 513], f16), [24, 3, 512, 513], 0, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 3, 256, 256], f16), [24, 3, 256, 513], 3, 257, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 3, 256, 513], f16), [24, 3, 512, 513], 2, -257, -1, 1), {})
+cnt: 24, ((T([24, 3, 512, 513], f16), [24, 3, 512, 513], 1, 0, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 256, 257], f16), [24, 256, 513], 2, 0, 257, 1), {})
+cnt: 12, ((T([24, 256, 513], f16), [24, 512, 513], 1, 256, 9223372036854775807, 1), {})
+cnt: 12, ((T([24, 3, 256, 257], f16), [24, 3, 256, 513], 3, 0, 257, 1), {})
+cnt: 12, ((T([24, 3, 256, 513], f16), [24, 3, 512, 513], 2, 0, 256, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([2048, 50265], f16, stride=(0, 0)), [0], True), {})
+cnt: 13, ((T([2048, 768], f16), [0], True), {})
+cnt: 12, ((T([2048, 3072], f16), [0], True), {})
+cnt: 12, ((T([2, 1024, 768], f16), [0, 1], True), {})
+cnt: 36, ((T([1024, 2, 768], f16), [0, 1], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([2, 1024, 50265], f16),), {})
+Operator: aten.tril.default
+cnt: 24, ((T([256, 257], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/maml_omniglot_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/maml_omniglot_training.txt
new file mode 100644
index 000000000000..3121d116dddd
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/maml_omniglot_training.txt
@@ -0,0 +1,49 @@
+Operator: aten.addmm.default
+cnt: 1, ((T([5], f16), T([5, 64], f16), T([64, 5], f16, stride=(1, 64))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([5, 1, 28, 28], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([5, 1, 28, 28], f16), T([64, 1, 3, 3], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([5, 64, 13, 13], f16, stride=(10816, 1, 832, 64)), T([64, 64, 3, 3], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([5, 64, 5, 5], f16, stride=(1600, 1, 320, 64)), T([64, 64, 3, 3], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)), T([5, 64, 5, 5], f16, stride=(1600, 1, 320, 64)), T([64, 64, 3, 3], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)), T([5, 64, 13, 13], f16, stride=(10816, 1, 832, 64)), T([64, 64, 3, 3], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)), T([5, 1, 28, 28], f16), T([64, 1, 3, 3], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([5, 1, 28, 28], f16), T([5, 1, 28, 28], f16)), {})
+cnt: 2, ((T([64, 64, 3, 3], f16), T([64, 64, 3, 3], f16, stride=(576, 1, 192, 64))), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 25), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)), [2, 2], [2, 2]), {})
+cnt: 1, ((T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)), [2, 2], [2, 2]), {})
+cnt: 1, ((T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)), [2, 2], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([5, 64, 1, 1], f16), T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)), [2, 2], [2, 2], [0, 0], [1, 1], False, T([5, 64, 1, 1], i64)), {})
+cnt: 1, ((T([5, 64, 5, 5], f16, stride=(1600, 1, 320, 64)), T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)), [2, 2], [2, 2], [0, 0], [1, 1], False, T([5, 64, 5, 5], i64, stride=(1600, 1, 320, 64))), {})
+cnt: 1, ((T([5, 64, 13, 13], f16, stride=(10816, 1, 832, 64)), T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)), [2, 2], [2, 2], [0, 0], [1, 1], False, T([5, 64, 13, 13], i64, stride=(10816, 1, 832, 64))), {})
+Operator: aten.mm.default
+cnt: 2, ((T([5, 5], f16, stride=(0, 0)), T([5, 64], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 1.0, 1e-05), {})
+cnt: 1, ((T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 1.0, 1e-05), {})
+cnt: 1, ((T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 1.0, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)), T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)), T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)), T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 2, ((T([64, 64, 3, 3], f16, stride=(576, 1, 192, 64)), [64, 64, 3, 3], [576, 9, 3, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.relu_.default
+cnt: 1, ((T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)),), {})
+cnt: 1, ((T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)),), {})
+cnt: 1, ((T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([5, 5], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([5, 5], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)), T([5, 64, 3, 3], f16, stride=(576, 1, 192, 64)), 0), {})
+cnt: 1, ((T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)), T([5, 64, 11, 11], f16, stride=(7744, 1, 704, 64)), 0), {})
+cnt: 1, ((T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)), T([5, 64, 26, 26], f16, stride=(43264, 1, 1664, 64)), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mnasnet1_0_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mnasnet1_0_training.txt
new file mode 100644
index 000000000000..4f81a114632c
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mnasnet1_0_training.txt
@@ -0,0 +1,163 @@
+Operator: aten.add.Tensor
+cnt: 4, ((T([32, 24, 56, 56], f16), T([32, 24, 56, 56], f16)), {})
+cnt: 4, ((T([32, 40, 28, 28], f16), T([32, 40, 28, 28], f16)), {})
+cnt: 4, ((T([32, 80, 14, 14], f16), T([32, 80, 14, 14], f16)), {})
+cnt: 2, ((T([32, 96, 14, 14], f16), T([32, 96, 14, 14], f16)), {})
+cnt: 6, ((T([32, 192, 7, 7], f16), T([32, 192, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([48, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 48, 112, 112], f16), T([48, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 48), {})
+cnt: 1, ((T([32, 48, 56, 56], f16), T([24, 48, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 24, 56, 56], f16), T([72, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 72, 56, 56], f16), T([72, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 72), {})
+cnt: 2, ((T([32, 72, 56, 56], f16), T([24, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 72, 56, 56], f16), T([72, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([40, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([120, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([120, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([40, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 40, 28, 28], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([240, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 80, 14, 14], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([480, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 480), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([80, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([96, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 96, 14, 14], f16), T([576, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 576, 14, 14], f16), T([576, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 576), {})
+cnt: 1, ((T([32, 576, 14, 14], f16), T([96, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 576, 14, 14], f16), T([576, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 576), {})
+cnt: 1, ((T([32, 576, 7, 7], f16), T([192, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1152), {})
+cnt: 3, ((T([32, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1152), {})
+cnt: 1, ((T([32, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 1280, 7, 7], f16), T([32, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 320, 7, 7], f16), T([32, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 4, ((T([32, 1152, 7, 7], f16), T([32, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 192, 7, 7], f16), T([32, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 1, ((T([32, 192, 7, 7], f16), T([32, 576, 7, 7], f16), T([192, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 576, 7, 7], f16), T([32, 576, 14, 14], f16), T([576, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 2, ((T([32, 576, 14, 14], f16), T([32, 96, 14, 14], f16), T([576, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 96, 14, 14], f16), T([32, 576, 14, 14], f16), T([96, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 576, 14, 14], f16), T([32, 576, 14, 14], f16), T([576, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 1, ((T([32, 96, 14, 14], f16), T([32, 480, 14, 14], f16), T([96, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 3, ((T([32, 480, 14, 14], f16), T([32, 80, 14, 14], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 80, 14, 14], f16), T([32, 480, 14, 14], f16), T([80, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), T([480, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([32, 80, 14, 14], f16), T([32, 240, 14, 14], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 28, 28], f16), T([240, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 40, 28, 28], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([32, 120, 28, 28], f16), T([40, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([32, 120, 28, 28], f16), T([120, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([32, 40, 28, 28], f16), T([120, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 40, 28, 28], f16), T([32, 72, 28, 28], f16), T([40, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 56, 56], f16), T([72, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 3, ((T([32, 72, 56, 56], f16), T([32, 24, 56, 56], f16), T([72, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 24, 56, 56], f16), T([32, 72, 56, 56], f16), T([24, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 72, 56, 56], f16), T([32, 72, 56, 56], f16), T([72, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 1, ((T([32, 24, 56, 56], f16), T([32, 48, 56, 56], f16), T([24, 48, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 48, 56, 56], f16), T([32, 48, 112, 112], f16), T([48, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 48, [True, True, False]), {})
+cnt: 1, ((T([32, 48, 112, 112], f16), T([32, 16, 112, 112], f16), T([48, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 1280, 7, 7], f16, stride=(1280, 1, 0, 0)), 49), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32000), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 1280, 7, 7], f16), [2, 3]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(0, 0)), T([32, 1280], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 48, 112, 112], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 3, ((T([32, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 5, ((T([32, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 3, ((T([32, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 4, ((T([32, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 3, ((T([32, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 6, ((T([32, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 2, ((T([32, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 3, ((T([32, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 4, ((T([32, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), False, 0.00029999999999996696, 1e-05), {})
+cnt: 1, ((T([32, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), False, 0.00029999999999996696, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([32, 1280, 7, 7], f16), T([32, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 320, 7, 7], f16), T([32, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), False, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), False, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 192, 7, 7], f16), T([32, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 576, 7, 7], f16), T([32, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 576, 14, 14], f16), T([32, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 96, 14, 14], f16), T([32, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 80, 14, 14], f16), T([32, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 40, 28, 28], f16), T([32, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), False, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 120, 28, 28], f16), T([32, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([32, 72, 56, 56], f16), T([32, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 24, 56, 56], f16), T([32, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 48, 56, 56], f16), T([32, 48, 56, 56], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 48, 112, 112], f16), T([32, 48, 112, 112], f16), T([48], f16), T([48], f16), T([48], f16), T([48], f32), T([48], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([32, 32, 112, 112], f16),), {})
+cnt: 1, ((T([32, 48, 112, 112], f16),), {})
+cnt: 1, ((T([32, 48, 56, 56], f16),), {})
+cnt: 5, ((T([32, 72, 56, 56], f16),), {})
+cnt: 1, ((T([32, 72, 28, 28], f16),), {})
+cnt: 4, ((T([32, 120, 28, 28], f16),), {})
+cnt: 1, ((T([32, 240, 28, 28], f16),), {})
+cnt: 1, ((T([32, 240, 14, 14], f16),), {})
+cnt: 6, ((T([32, 480, 14, 14], f16),), {})
+cnt: 3, ((T([32, 576, 14, 14], f16),), {})
+cnt: 1, ((T([32, 576, 7, 7], f16),), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16),), {})
+cnt: 1, ((T([32, 1280, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([32, 1280, 7, 7], f16), T([32, 1280, 7, 7], f16), 0), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16), 0), {})
+cnt: 1, ((T([32, 576, 7, 7], f16), T([32, 576, 7, 7], f16), 0), {})
+cnt: 3, ((T([32, 576, 14, 14], f16), T([32, 576, 14, 14], f16), 0), {})
+cnt: 6, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16), 0), {})
+cnt: 4, ((T([32, 120, 28, 28], f16), T([32, 120, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 28, 28], f16), 0), {})
+cnt: 5, ((T([32, 72, 56, 56], f16), T([32, 72, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 48, 56, 56], f16), T([32, 48, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 48, 112, 112], f16), T([32, 48, 112, 112], f16), 0), {})
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mobilenet_v2_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mobilenet_v2_training.txt
new file mode 100644
index 000000000000..185ce981ae35
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mobilenet_v2_training.txt
@@ -0,0 +1,165 @@
+Operator: aten.add.Tensor
+cnt: 2, ((T([96, 24, 56, 56], f16), T([96, 24, 56, 56], f16)), {})
+cnt: 4, ((T([96, 32, 28, 28], f16), T([96, 32, 28, 28], f16)), {})
+cnt: 6, ((T([96, 64, 14, 14], f16), T([96, 64, 14, 14], f16)), {})
+cnt: 4, ((T([96, 96, 14, 14], f16), T([96, 96, 14, 14], f16)), {})
+cnt: 4, ((T([96, 160, 7, 7], f16), T([96, 160, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([96, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([96, 3, 224, 224], f16),), {})
+cnt: 2, ((T([96, 32, 112, 112], f16),), {})
+cnt: 1, ((T([96, 96, 112, 112], f16),), {})
+cnt: 1, ((T([96, 96, 56, 56], f16),), {})
+cnt: 3, ((T([96, 144, 56, 56], f16),), {})
+cnt: 1, ((T([96, 144, 28, 28], f16),), {})
+cnt: 5, ((T([96, 192, 28, 28], f16),), {})
+cnt: 1, ((T([96, 192, 14, 14], f16),), {})
+cnt: 8, ((T([96, 384, 14, 14], f16),), {})
+cnt: 5, ((T([96, 576, 14, 14], f16),), {})
+cnt: 1, ((T([96, 576, 7, 7], f16),), {})
+cnt: 6, ((T([96, 960, 7, 7], f16),), {})
+cnt: 1, ((T([96, 1280, 7, 7], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([96, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([96, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 16, 112, 112], f16), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 96, 112, 112], f16), T([96, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([96, 96, 56, 56], f16), T([24, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([96, 24, 56, 56], f16), T([144, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 144, 56, 56], f16), T([144, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([96, 144, 56, 56], f16), T([24, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 144, 56, 56], f16), T([144, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([96, 144, 28, 28], f16), T([32, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([96, 32, 28, 28], f16), T([192, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([96, 192, 28, 28], f16), T([192, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 192), {})
+cnt: 2, ((T([96, 192, 28, 28], f16), T([32, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 192, 28, 28], f16), T([192, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 192), {})
+cnt: 1, ((T([96, 192, 14, 14], f16), T([64, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([96, 64, 14, 14], f16), T([384, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([96, 384, 14, 14], f16), T([384, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 384), {})
+cnt: 3, ((T([96, 384, 14, 14], f16), T([64, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 384, 14, 14], f16), T([96, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([96, 96, 14, 14], f16), T([576, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([96, 576, 14, 14], f16), T([576, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 576), {})
+cnt: 2, ((T([96, 576, 14, 14], f16), T([96, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 576, 14, 14], f16), T([576, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 576), {})
+cnt: 1, ((T([96, 576, 7, 7], f16), T([160, 576, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([96, 160, 7, 7], f16), T([960, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([96, 960, 7, 7], f16), T([960, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 960), {})
+cnt: 2, ((T([96, 960, 7, 7], f16), T([160, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 960, 7, 7], f16), T([320, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([96, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([96, 1280, 7, 7], f16), T([96, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 320, 7, 7], f16), T([96, 960, 7, 7], f16), T([320, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([96, 960, 7, 7], f16), T([96, 960, 7, 7], f16), T([960, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 960, [True, True, False]), {})
+cnt: 3, ((T([96, 960, 7, 7], f16), T([96, 160, 7, 7], f16), T([960, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([96, 160, 7, 7], f16), T([96, 960, 7, 7], f16), T([160, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 160, 7, 7], f16), T([96, 576, 7, 7], f16), T([160, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 576, 7, 7], f16), T([96, 576, 14, 14], f16), T([576, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 3, ((T([96, 576, 14, 14], f16), T([96, 96, 14, 14], f16), T([576, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([96, 96, 14, 14], f16), T([96, 576, 14, 14], f16), T([96, 576, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([96, 576, 14, 14], f16), T([96, 576, 14, 14], f16), T([576, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 576, [True, True, False]), {})
+cnt: 1, ((T([96, 96, 14, 14], f16), T([96, 384, 14, 14], f16), T([96, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([96, 384, 14, 14], f16), T([96, 384, 14, 14], f16), T([384, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 384, [True, True, False]), {})
+cnt: 4, ((T([96, 384, 14, 14], f16), T([96, 64, 14, 14], f16), T([384, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([96, 64, 14, 14], f16), T([96, 384, 14, 14], f16), T([64, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 64, 14, 14], f16), T([96, 192, 14, 14], f16), T([64, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 192, 14, 14], f16), T([96, 192, 28, 28], f16), T([192, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 3, ((T([96, 192, 28, 28], f16), T([96, 32, 28, 28], f16), T([192, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([96, 32, 28, 28], f16), T([96, 192, 28, 28], f16), T([32, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([96, 192, 28, 28], f16), T([96, 192, 28, 28], f16), T([192, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 192, [True, True, False]), {})
+cnt: 1, ((T([96, 32, 28, 28], f16), T([96, 144, 28, 28], f16), T([32, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 144, 28, 28], f16), T([96, 144, 56, 56], f16), T([144, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 2, ((T([96, 144, 56, 56], f16), T([96, 24, 56, 56], f16), T([144, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 24, 56, 56], f16), T([96, 144, 56, 56], f16), T([24, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 144, 56, 56], f16), T([96, 144, 56, 56], f16), T([144, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 1, ((T([96, 24, 56, 56], f16), T([96, 96, 56, 56], f16), T([24, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 96, 56, 56], f16), T([96, 96, 112, 112], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([96, 96, 112, 112], f16), T([96, 16, 112, 112], f16), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 16, 112, 112], f16), T([96, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([96, 32, 112, 112], f16), T([96, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([96, 32, 112, 112], f16), T([96, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([96, 3, 224, 224], f16), T([96, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([96, 1280, 7, 7], f16, stride=(1280, 1, 0, 0)), 49), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 96000), {})
+Operator: aten.hardtanh_.default
+cnt: 2, ((T([96, 32, 112, 112], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 96, 112, 112], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 96, 56, 56], f16), 0.0, 6.0), {})
+cnt: 3, ((T([96, 144, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 144, 28, 28], f16), 0.0, 6.0), {})
+cnt: 5, ((T([96, 192, 28, 28], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 192, 14, 14], f16), 0.0, 6.0), {})
+cnt: 8, ((T([96, 384, 14, 14], f16), 0.0, 6.0), {})
+cnt: 5, ((T([96, 576, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 576, 7, 7], f16), 0.0, 6.0), {})
+cnt: 6, ((T([96, 960, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 1280, 7, 7], f16), 0.0, 6.0), {})
+Operator: aten.hardtanh_backward.default
+cnt: 1, ((T([96, 1280, 7, 7], f16), T([96, 1280, 7, 7], f16), 0.0, 6.0), {})
+cnt: 6, ((T([96, 960, 7, 7], f16), T([96, 960, 7, 7], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 576, 7, 7], f16), T([96, 576, 7, 7], f16), 0.0, 6.0), {})
+cnt: 5, ((T([96, 576, 14, 14], f16), T([96, 576, 14, 14], f16), 0.0, 6.0), {})
+cnt: 8, ((T([96, 384, 14, 14], f16), T([96, 384, 14, 14], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 192, 14, 14], f16), T([96, 192, 14, 14], f16), 0.0, 6.0), {})
+cnt: 5, ((T([96, 192, 28, 28], f16), T([96, 192, 28, 28], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 144, 28, 28], f16), T([96, 144, 28, 28], f16), 0.0, 6.0), {})
+cnt: 3, ((T([96, 144, 56, 56], f16), T([96, 144, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 96, 56, 56], f16), T([96, 96, 56, 56], f16), 0.0, 6.0), {})
+cnt: 1, ((T([96, 96, 112, 112], f16), T([96, 96, 112, 112], f16), 0.0, 6.0), {})
+cnt: 2, ((T([96, 32, 112, 112], f16), T([96, 32, 112, 112], f16), 0.0, 6.0), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([96, 1280, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([96, 1000], f16, stride=(0, 0)), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 96], f16, stride=(0, 0)), T([96, 1280], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([96, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([96, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([96, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([96, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([96, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([96, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([96, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([96, 32, 28, 28], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([96, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([96, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), False, 0.1, 1e-05), {})
+cnt: 4, ((T([96, 64, 14, 14], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 8, ((T([96, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([96, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([96, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([96, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([96, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), False, 0.1, 1e-05), {})
+cnt: 6, ((T([96, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([96, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([96, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([96, 1280, 7, 7], f16), T([96, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([96, 320, 7, 7], f16), T([96, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), False, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([96, 960, 7, 7], f16), T([96, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([96, 160, 7, 7], f16), T([96, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([96, 576, 7, 7], f16), T([96, 576, 7, 7], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([96, 576, 14, 14], f16), T([96, 576, 14, 14], f16), T([576], f16), T([576], f16), T([576], f16), T([576], f32), T([576], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([96, 96, 14, 14], f16), T([96, 96, 14, 14], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([96, 384, 14, 14], f16), T([96, 384, 14, 14], f16), T([384], f16), T([384], f16), T([384], f16), T([384], f32), T([384], f32), False, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([96, 64, 14, 14], f16), T([96, 64, 14, 14], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([96, 192, 14, 14], f16), T([96, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([96, 192, 28, 28], f16), T([96, 192, 28, 28], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([96, 32, 28, 28], f16), T([96, 32, 28, 28], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([96, 144, 28, 28], f16), T([96, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([96, 144, 56, 56], f16), T([96, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([96, 24, 56, 56], f16), T([96, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([96, 96, 56, 56], f16), T([96, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([96, 96, 112, 112], f16), T([96, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([96, 16, 112, 112], f16), T([96, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([96, 32, 112, 112], f16), T([96, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([96, 1000], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([96, 1000], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mobilenet_v3_large_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mobilenet_v3_large_training.txt
new file mode 100644
index 000000000000..07ba40cf12a5
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/mobilenet_v3_large_training.txt
@@ -0,0 +1,277 @@
+Operator: aten.add.Tensor
+cnt: 2, ((T([32, 960, 7, 7], f16), T([32, 960, 7, 7], f16)), {})
+cnt: 2, ((T([32, 160, 7, 7], f16), T([32, 160, 7, 7], f16)), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 7, 7], f16)), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16)), {})
+cnt: 1, ((T([32, 112, 14, 14], f16), T([32, 112, 14, 14], f16)), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16)), {})
+cnt: 3, ((T([32, 80, 14, 14], f16), T([32, 80, 14, 14], f16)), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([32, 120, 28, 28], f16)), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([32, 40, 28, 28], f16)), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 28, 28], f16)), {})
+cnt: 1, ((T([32, 24, 56, 56], f16), T([32, 24, 56, 56], f16)), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16)), {})
+cnt: 1, ((T([32, 24, 56, 56], f16), T([32, 24, 56, 56], f16)), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([32, 40, 28, 28], f16)), {})
+cnt: 3, ((T([32, 80, 14, 14], f16), T([32, 80, 14, 14], f16)), {})
+cnt: 1, ((T([32, 112, 14, 14], f16), T([32, 112, 14, 14], f16)), {})
+cnt: 2, ((T([32, 160, 7, 7], f16), T([32, 160, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1280], f16), T([32, 960], f16), T([960, 1280], f16, stride=(1, 960))), {})
+cnt: 1, ((T([1000], f16), T([32, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+cnt: 1, ((T([32, 16, 112, 112], f16),), {})
+cnt: 1, ((T([32, 240, 28, 28], f16),), {})
+cnt: 1, ((T([32, 240, 14, 14], f16),), {})
+cnt: 2, ((T([32, 200, 14, 14], f16),), {})
+cnt: 4, ((T([32, 184, 14, 14], f16),), {})
+cnt: 2, ((T([32, 480, 14, 14], f16),), {})
+cnt: 3, ((T([32, 672, 14, 14], f16),), {})
+cnt: 1, ((T([32, 672, 7, 7], f16),), {})
+cnt: 5, ((T([32, 960, 7, 7], f16),), {})
+cnt: 1, ((T([32, 1280], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([16, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([16, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([16, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([64, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([64, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 64), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([24, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 24, 56, 56], f16), T([72, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 72, 56, 56], f16), T([72, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([32, 72, 56, 56], f16), T([24, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 72, 56, 56], f16), T([72, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 72), {})
+cnt: 1, ((T([32, 72, 1, 1], f16), T([24, 72, 1, 1], f16), T([24], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 24, 1, 1], f16), T([72, 24, 1, 1], f16), T([72], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([40, 72, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([120, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([120, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 120), {})
+cnt: 2, ((T([32, 120, 1, 1], f16), T([32, 120, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 32, 1, 1], f16), T([120, 32, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([40, 120, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 40, 28, 28], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([240, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 80, 14, 14], f16), T([200, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 200, 14, 14], f16), T([200, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 200), {})
+cnt: 1, ((T([32, 200, 14, 14], f16), T([80, 200, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 80, 14, 14], f16), T([184, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 184, 14, 14], f16), T([184, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 184), {})
+cnt: 2, ((T([32, 184, 14, 14], f16), T([80, 184, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 80, 14, 14], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([32, 480, 1, 1], f16), T([120, 480, 1, 1], f16), T([120], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 120, 1, 1], f16), T([480, 120, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([112, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 112, 14, 14], f16), T([672, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), T([672, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 672), {})
+cnt: 2, ((T([32, 672, 1, 1], f16), T([168, 672, 1, 1], f16), T([168], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 168, 1, 1], f16), T([672, 168, 1, 1], f16), T([672], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), T([112, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([160, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 160, 7, 7], f16), T([960, 160, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 960, 7, 7], f16), T([960, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 960), {})
+cnt: 2, ((T([32, 960, 1, 1], f16), T([240, 960, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 240, 1, 1], f16), T([960, 240, 1, 1], f16), T([960], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 960, 7, 7], f16), T([160, 960, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([32, 960, 7, 7], f16), T([32, 160, 7, 7], f16), T([960, 160, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 160, 7, 7], f16), T([32, 960, 7, 7], f16), T([160, 960, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 960, 1, 1], f16), T([32, 240, 1, 1], f16), T([960, 240, 1, 1], f16), [960], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 240, 1, 1], f16), T([32, 960, 1, 1], f16), T([240, 960, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 960, 7, 7], f16), T([32, 960, 7, 7], f16), T([960, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 960, [True, True, False]), {})
+cnt: 1, ((T([32, 160, 7, 7], f16), T([32, 672, 7, 7], f16), T([160, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 672, 1, 1], f16), T([32, 168, 1, 1], f16), T([672, 168, 1, 1], f16), [672], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 168, 1, 1], f16), T([32, 672, 1, 1], f16), T([168, 672, 1, 1], f16), [168], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 2, ((T([32, 672, 14, 14], f16), T([32, 112, 14, 14], f16), T([672, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 112, 14, 14], f16), T([32, 672, 14, 14], f16), T([112, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16), T([672, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 1, ((T([32, 112, 14, 14], f16), T([32, 480, 14, 14], f16), T([112, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 480, 1, 1], f16), T([32, 120, 1, 1], f16), T([480, 120, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 120, 1, 1], f16), T([32, 480, 1, 1], f16), T([120, 480, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([32, 80, 14, 14], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 80, 14, 14], f16), T([32, 184, 14, 14], f16), T([80, 184, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 184, 14, 14], f16), T([32, 184, 14, 14], f16), T([184, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 184, [True, True, False]), {})
+cnt: 2, ((T([32, 184, 14, 14], f16), T([32, 80, 14, 14], f16), T([184, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 80, 14, 14], f16), T([32, 200, 14, 14], f16), T([80, 200, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 200, 14, 14], f16), T([32, 200, 14, 14], f16), T([200, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 200, [True, True, False]), {})
+cnt: 1, ((T([32, 200, 14, 14], f16), T([32, 80, 14, 14], f16), T([200, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 80, 14, 14], f16), T([32, 240, 14, 14], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 28, 28], f16), T([240, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 40, 28, 28], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([32, 120, 28, 28], f16), T([40, 120, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 120, 1, 1], f16), T([32, 32, 1, 1], f16), T([120, 32, 1, 1], f16), [120], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 32, 1, 1], f16), T([32, 120, 1, 1], f16), T([32, 120, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([32, 120, 28, 28], f16), T([120, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 120, [True, True, False]), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([32, 40, 28, 28], f16), T([120, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 40, 28, 28], f16), T([32, 72, 28, 28], f16), T([40, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 72, 1, 1], f16), T([32, 24, 1, 1], f16), T([72, 24, 1, 1], f16), [72], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 24, 1, 1], f16), T([32, 72, 1, 1], f16), T([24, 72, 1, 1], f16), [24], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 56, 56], f16), T([72, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 2, ((T([32, 72, 56, 56], f16), T([32, 24, 56, 56], f16), T([72, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 24, 56, 56], f16), T([32, 72, 56, 56], f16), T([24, 72, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 72, 56, 56], f16), T([32, 72, 56, 56], f16), T([72, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 72, [True, True, False]), {})
+cnt: 1, ((T([32, 24, 56, 56], f16), T([32, 64, 56, 56], f16), T([24, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 112, 112], f16), T([64, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 64, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 16, 112, 112], f16), T([64, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16), T([16, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16), T([16, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 3, 224, 224], f16), T([16, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 3, ((T([32, 960, 7, 7], f16, stride=(960, 1, 0, 0)), 49), {})
+cnt: 1, ((T([32, 672, 7, 7], f16, stride=(672, 1, 0, 0)), 49), {})
+cnt: 1, ((T([32, 672, 14, 14], f16, stride=(672, 1, 0, 0)), 196), {})
+cnt: 1, ((T([32, 480, 14, 14], f16, stride=(480, 1, 0, 0)), 196), {})
+cnt: 2, ((T([32, 120, 28, 28], f16, stride=(120, 1, 0, 0)), 784), {})
+cnt: 1, ((T([32, 72, 28, 28], f16, stride=(72, 1, 0, 0)), 784), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32000), {})
+Operator: aten.hardsigmoid.default
+cnt: 1, ((T([32, 72, 1, 1], f16),), {})
+cnt: 2, ((T([32, 120, 1, 1], f16),), {})
+cnt: 1, ((T([32, 480, 1, 1], f16),), {})
+cnt: 2, ((T([32, 672, 1, 1], f16),), {})
+cnt: 2, ((T([32, 960, 1, 1], f16),), {})
+Operator: aten.hardsigmoid_backward.default
+cnt: 2, ((T([32, 960, 1, 1], f16), T([32, 960, 1, 1], f16)), {})
+cnt: 2, ((T([32, 672, 1, 1], f16), T([32, 672, 1, 1], f16)), {})
+cnt: 1, ((T([32, 480, 1, 1], f16), T([32, 480, 1, 1], f16)), {})
+cnt: 2, ((T([32, 120, 1, 1], f16), T([32, 120, 1, 1], f16)), {})
+cnt: 1, ((T([32, 72, 1, 1], f16), T([32, 72, 1, 1], f16)), {})
+Operator: aten.hardswish_.default
+cnt: 1, ((T([32, 16, 112, 112], f16),), {})
+cnt: 1, ((T([32, 240, 28, 28], f16),), {})
+cnt: 1, ((T([32, 240, 14, 14], f16),), {})
+cnt: 2, ((T([32, 200, 14, 14], f16),), {})
+cnt: 4, ((T([32, 184, 14, 14], f16),), {})
+cnt: 2, ((T([32, 480, 14, 14], f16),), {})
+cnt: 3, ((T([32, 672, 14, 14], f16),), {})
+cnt: 1, ((T([32, 672, 7, 7], f16),), {})
+cnt: 5, ((T([32, 960, 7, 7], f16),), {})
+cnt: 1, ((T([32, 1280], f16),), {})
+Operator: aten.hardswish_backward.default
+cnt: 1, ((T([32, 1280], f16), T([32, 1280], f16)), {})
+cnt: 5, ((T([32, 960, 7, 7], f16), T([32, 960, 7, 7], f16)), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 7, 7], f16)), {})
+cnt: 3, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16)), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16)), {})
+cnt: 4, ((T([32, 184, 14, 14], f16), T([32, 184, 14, 14], f16)), {})
+cnt: 2, ((T([32, 200, 14, 14], f16), T([32, 200, 14, 14], f16)), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 14, 14], f16)), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16)), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 72, 28, 28], f16), [-1, -2], True), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), [-1, -2], True), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), [-1, -2], True), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), [-1, -2], True), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), [-1, -2], True), {})
+cnt: 3, ((T([32, 960, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(0, 0)), T([32, 1280], f16)), {})
+cnt: 1, ((T([32, 1280], f16), T([1280, 960], f16)), {})
+cnt: 1, ((T([1280, 32], f16, stride=(1, 1280)), T([32, 960], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([32, 72, 1, 1], f16), T([32, 72, 28, 28], f16)), {})
+cnt: 2, ((T([32, 120, 1, 1], f16), T([32, 120, 28, 28], f16)), {})
+cnt: 1, ((T([32, 480, 1, 1], f16), T([32, 480, 14, 14], f16)), {})
+cnt: 1, ((T([32, 672, 1, 1], f16), T([32, 672, 14, 14], f16)), {})
+cnt: 1, ((T([32, 672, 1, 1], f16), T([32, 672, 7, 7], f16)), {})
+cnt: 2, ((T([32, 960, 1, 1], f16), T([32, 960, 7, 7], f16)), {})
+cnt: 2, ((T([32, 960, 7, 7], f16), T([32, 960, 1, 1], f16)), {})
+cnt: 2, ((T([32, 960, 7, 7], f16), T([32, 960, 7, 7], f16)), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 1, 1], f16)), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 7, 7], f16)), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), T([32, 672, 1, 1], f16)), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16)), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([32, 480, 1, 1], f16)), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16)), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([32, 120, 1, 1], f16)), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), T([32, 120, 28, 28], f16)), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 1, 1], f16)), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 28, 28], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 3, ((T([32, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), False, 0.01, 0.001), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.01, 0.001), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.01, 0.001), {})
+cnt: 2, ((T([32, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), False, 0.01, 0.001), {})
+cnt: 3, ((T([32, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), False, 0.01, 0.001), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f16), False, 0.01, 0.001), {})
+cnt: 3, ((T([32, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), False, 0.01, 0.001), {})
+cnt: 4, ((T([32, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f16), False, 0.01, 0.001), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), False, 0.01, 0.001), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), False, 0.01, 0.001), {})
+cnt: 4, ((T([32, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), False, 0.01, 0.001), {})
+cnt: 2, ((T([32, 200, 14, 14], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f16), False, 0.01, 0.001), {})
+cnt: 4, ((T([32, 184, 14, 14], f16), T([184], f16), T([184], f16), T([184], f16), T([184], f16), False, 0.01, 0.001), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), False, 0.01, 0.001), {})
+cnt: 2, ((T([32, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), False, 0.01, 0.001), {})
+cnt: 3, ((T([32, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), False, 0.01, 0.001), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), False, 0.01, 0.001), {})
+cnt: 3, ((T([32, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), False, 0.01, 0.001), {})
+cnt: 5, ((T([32, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f16), False, 0.01, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 5, ((T([32, 960, 7, 7], f16), T([32, 960, 7, 7], f16), T([960], f16), T([960], f16), T([960], f16), T([960], f32), T([960], f32), False, 0.001, [True, True, True]), {})
+cnt: 3, ((T([32, 160, 7, 7], f16), T([32, 160, 7, 7], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), False, 0.001, [True, True, True]), {})
+cnt: 3, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), False, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 112, 14, 14], f16), T([32, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), False, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), False, 0.001, [True, True, True]), {})
+cnt: 4, ((T([32, 80, 14, 14], f16), T([32, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), False, 0.001, [True, True, True]), {})
+cnt: 4, ((T([32, 184, 14, 14], f16), T([32, 184, 14, 14], f16), T([184], f16), T([184], f16), T([184], f16), T([184], f32), T([184], f32), False, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 200, 14, 14], f16), T([32, 200, 14, 14], f16), T([200], f16), T([200], f16), T([200], f16), T([200], f32), T([200], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), False, 0.001, [True, True, True]), {})
+cnt: 3, ((T([32, 40, 28, 28], f16), T([32, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), False, 0.001, [True, True, True]), {})
+cnt: 4, ((T([32, 120, 28, 28], f16), T([32, 120, 28, 28], f16), T([120], f16), T([120], f16), T([120], f16), T([120], f32), T([120], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 28, 28], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), False, 0.001, [True, True, True]), {})
+cnt: 3, ((T([32, 72, 56, 56], f16), T([32, 72, 56, 56], f16), T([72], f16), T([72], f16), T([72], f16), T([72], f32), T([72], f32), False, 0.001, [True, True, True]), {})
+cnt: 2, ((T([32, 24, 56, 56], f16), T([32, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 0.001, [True, True, True]), {})
+cnt: 3, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), False, 0.001, [True, True, True]), {})
+Operator: aten.relu.default
+cnt: 1, ((T([32, 24, 1, 1], f16),), {})
+cnt: 2, ((T([32, 32, 1, 1], f16),), {})
+cnt: 1, ((T([32, 120, 1, 1], f16),), {})
+cnt: 2, ((T([32, 168, 1, 1], f16),), {})
+cnt: 2, ((T([32, 240, 1, 1], f16),), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([32, 16, 112, 112], f16),), {})
+cnt: 1, ((T([32, 64, 112, 112], f16),), {})
+cnt: 1, ((T([32, 64, 56, 56], f16),), {})
+cnt: 3, ((T([32, 72, 56, 56], f16),), {})
+cnt: 1, ((T([32, 72, 28, 28], f16),), {})
+cnt: 4, ((T([32, 120, 28, 28], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), [0], True), {})
+cnt: 1, ((T([32, 1280], f16), [0], True), {})
+cnt: 2, ((T([32, 960, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 2, ((T([32, 120, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), [2, 3], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([32, 240, 1, 1], f16), T([32, 240, 1, 1], f16), 0), {})
+cnt: 2, ((T([32, 168, 1, 1], f16), T([32, 168, 1, 1], f16), 0), {})
+cnt: 1, ((T([32, 120, 1, 1], f16), T([32, 120, 1, 1], f16), 0), {})
+cnt: 2, ((T([32, 32, 1, 1], f16), T([32, 32, 1, 1], f16), 0), {})
+cnt: 4, ((T([32, 120, 28, 28], f16), T([32, 120, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 24, 1, 1], f16), T([32, 24, 1, 1], f16), 0), {})
+cnt: 1, ((T([32, 72, 28, 28], f16), T([32, 72, 28, 28], f16), 0), {})
+cnt: 3, ((T([32, 72, 56, 56], f16), T([32, 72, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), 0), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/nvidia_deeprecommender_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/nvidia_deeprecommender_training.txt
new file mode 100644
index 000000000000..438f2289338e
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/nvidia_deeprecommender_training.txt
@@ -0,0 +1,36 @@
+Operator: aten.addmm.default
+cnt: 1, ((T([512], f16), T([256, 197951], f16), T([197951, 512], f16, stride=(1, 197951))), {})
+cnt: 2, ((T([512], f16), T([256, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 1, ((T([1024], f16), T([256, 512], f16), T([512, 1024], f16, stride=(1, 512))), {})
+cnt: 1, ((T([512], f16), T([256, 1024], f16), T([1024, 512], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([197951], f16), T([256, 512], f16), T([512, 197951], f16, stride=(1, 512))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([256, 197951], f16),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([256, 197951], f16), T([256, 197951], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 50675456), {})
+Operator: aten.elu.default
+cnt: 4, ((T([256, 512], f16), 1.6732632423543772, 1.0507009873554805), {})
+cnt: 1, ((T([256, 1024], f16), 1.6732632423543772, 1.0507009873554805), {})
+cnt: 1, ((T([256, 197951], f16), 1.6732632423543772, 1.0507009873554805), {})
+Operator: aten.elu_backward.default
+cnt: 1, ((T([256, 197951], f16, stride=(0, 0)), 1.6732632423543772, 1.0507009873554805, 1, False, T([256, 197951], f16)), {})
+cnt: 4, ((T([256, 512], f16), 1.6732632423543772, 1.0507009873554805, 1, False, T([256, 512], f16)), {})
+cnt: 1, ((T([256, 1024], f16), 1.6732632423543772, 1.0507009873554805, 1, False, T([256, 1024], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([256, 197951], f16), T([197951, 512], f16)), {})
+cnt: 1, ((T([197951, 256], f16, stride=(1, 197951)), T([256, 512], f16)), {})
+cnt: 2, ((T([256, 512], f16), T([512, 512], f16)), {})
+cnt: 2, ((T([512, 256], f16, stride=(1, 512)), T([256, 512], f16)), {})
+cnt: 1, ((T([256, 512], f16), T([512, 1024], f16)), {})
+cnt: 1, ((T([512, 256], f16, stride=(1, 512)), T([256, 1024], f16)), {})
+cnt: 1, ((T([256, 1024], f16), T([1024, 512], f16)), {})
+cnt: 1, ((T([1024, 256], f16, stride=(1, 1024)), T([256, 512], f16)), {})
+cnt: 1, ((T([512, 256], f16, stride=(1, 512)), T([256, 197951], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([256, 197951], f16), [0], True), {})
+cnt: 4, ((T([256, 512], f16), [0], True), {})
+cnt: 1, ((T([256, 1024], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([256, 197951], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_CycleGAN_and_pix2pix_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_CycleGAN_and_pix2pix_training.txt
new file mode 100644
index 000000000000..81c5a051ffe8
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_CycleGAN_and_pix2pix_training.txt
@@ -0,0 +1,67 @@
+Operator: aten.add.Tensor
+cnt: 18, ((T([1, 256, 64, 64], f16), T([1, 256, 64, 64], f16)), {})
+Operator: aten.clone.default
+cnt: 1, ((T([1, 3, 256, 256], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([1, 3, 262, 262], f16), T([64, 3, 7, 7], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 64, 256, 256], f16), T([128, 64, 3, 3], f16), T([128], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 128, 128, 128], f16), T([256, 128, 3, 3], f16), T([256], f16), [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 18, ((T([1, 256, 66, 66], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 256, 64, 64], f16), T([256, 128, 3, 3], f16), T([128], f16), [2, 2], [1, 1], [1, 1], True, [1, 1], 1), {})
+cnt: 1, ((T([1, 128, 128, 128], f16), T([128, 64, 3, 3], f16), T([64], f16), [2, 2], [1, 1], [1, 1], True, [1, 1], 1), {})
+cnt: 1, ((T([1, 64, 262, 262], f16), T([3, 64, 7, 7], f16), T([3], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([1, 3, 256, 256], f16), T([1, 64, 262, 262], f16), T([3, 64, 7, 7], f16), [3], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 64, 256, 256], f16), T([1, 128, 128, 128], f16), T([128, 64, 3, 3], f16), [64], [2, 2], [1, 1], [1, 1], True, [1, 1], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 128, 128, 128], f16), T([1, 256, 64, 64], f16), T([256, 128, 3, 3], f16), [128], [2, 2], [1, 1], [1, 1], True, [1, 1], 1, [True, True, True]), {})
+cnt: 18, ((T([1, 256, 64, 64], f16), T([1, 256, 66, 66], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 256, 64, 64], f16), T([1, 128, 128, 128], f16), T([256, 128, 3, 3], f16), [256], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 128, 128, 128], f16), T([1, 64, 256, 256], f16), T([128, 64, 3, 3], f16), [128], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 64, 256, 256], f16), T([1, 3, 262, 262], f16), T([64, 3, 7, 7], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([1, 3, 256, 256], f16), T([1, 3, 256, 256], f16)), {})
+cnt: 2, ((T([64, 256, 256], f16), T([64, 256, 256], f16)), {})
+cnt: 4, ((T([1, 64, 256, 256], f16), T([1, 64, 256, 256], f16)), {})
+cnt: 2, ((T([128, 128, 128], f16), T([128, 128, 128], f16)), {})
+cnt: 4, ((T([1, 128, 128, 128], f16), T([1, 128, 128, 128], f16)), {})
+cnt: 10, ((T([256, 64, 64], f16), T([256, 64, 64], f16)), {})
+cnt: 20, ((T([1, 256, 64, 64], f16), T([1, 256, 64, 64], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 196608), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([1, 64, 256, 256], f16), None, None, None, None, True, 0.1, 1e-05), {})
+cnt: 2, ((T([1, 128, 128, 128], f16), None, None, None, None, True, 0.1, 1e-05), {})
+cnt: 19, ((T([1, 256, 64, 64], f16), None, None, None, None, True, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 2, ((T([1, 64, 256, 256], f16), T([1, 64, 256, 256], f16), None, None, None, T([64], f32), T([64], f32), True, 1e-05, [True, False, False]), {})
+cnt: 2, ((T([1, 128, 128, 128], f16), T([1, 128, 128, 128], f16), None, None, None, T([128], f32), T([128], f32), True, 1e-05, [True, False, False]), {})
+cnt: 19, ((T([1, 256, 64, 64], f16), T([1, 256, 64, 64], f16), None, None, None, T([256], f32), T([256], f32), True, 1e-05, [True, False, False]), {})
+Operator: aten.new_empty_strided.default
+cnt: 2, ((T([1, 64, 256, 256], f16), [1, 64, 256, 256], [4194304, 65536, 256, 1]), {})
+cnt: 2, ((T([1, 128, 128, 128], f16), [1, 128, 128, 128], [2097152, 16384, 128, 1]), {})
+cnt: 10, ((T([1, 256, 64, 64], f16), [1, 256, 64, 64], [1048576, 4096, 64, 1]), {})
+Operator: aten.new_zeros.default
+cnt: 2, ((T([64, 256, 256], f16), [4194304]), {})
+cnt: 2, ((T([128, 128, 128], f16), [2097152]), {})
+cnt: 10, ((T([256, 64, 64], f16), [1048576]), {})
+Operator: aten.reflection_pad2d.default
+cnt: 1, ((T([1, 3, 256, 256], f16), [3, 3, 3, 3]), {})
+cnt: 18, ((T([1, 256, 64, 64], f16), [1, 1, 1, 1]), {})
+cnt: 1, ((T([1, 64, 256, 256], f16), [3, 3, 3, 3]), {})
+Operator: aten.reflection_pad2d_backward.default
+cnt: 1, ((T([1, 64, 262, 262], f16), T([1, 64, 256, 256], f16), [3, 3, 3, 3]), {})
+cnt: 18, ((T([1, 256, 66, 66], f16), T([1, 256, 64, 64], f16), [1, 1, 1, 1]), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([1, 64, 256, 256], f16),), {})
+cnt: 2, ((T([1, 128, 128, 128], f16),), {})
+cnt: 10, ((T([1, 256, 64, 64], f16),), {})
+Operator: aten.sum.default
+cnt: 1, ((T([1, 3, 256, 256], f16),), {})
+Operator: aten.tanh.default
+cnt: 1, ((T([1, 3, 256, 256], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([1, 3, 256, 256], f16, stride=(0, 0, 0, 0)), T([1, 3, 256, 256], f16)), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([1, 64, 256, 256], f16), T([1, 64, 256, 256], f16), 0), {})
+cnt: 2, ((T([1, 128, 128, 128], f16), T([1, 128, 128, 128], f16), 0), {})
+cnt: 10, ((T([1, 256, 64, 64], f16), T([1, 256, 64, 64], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_stargan_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_stargan_training.txt
new file mode 100644
index 000000000000..a2969693ef9b
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_stargan_training.txt
@@ -0,0 +1,80 @@
+Operator: aten.add.Tensor
+cnt: 12, ((T([16, 256, 32, 32], f16), T([16, 256, 32, 32], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([16, 3, 128, 128], f16), T([16, 5, 128, 128], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([16, 3, 128, 128], f16),), {})
+cnt: 1, ((T([16, 5], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([16, 8, 128, 128], f16), T([64, 8, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 64, 128, 128], f16), T([128, 64, 4, 4], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 128, 64, 64], f16), T([256, 128, 4, 4], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([16, 256, 32, 32], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 256, 32, 32], f16), T([256, 128, 4, 4], f16), None, [2, 2], [1, 1], [1, 1], True, [0, 0], 1), {})
+cnt: 1, ((T([16, 128, 64, 64], f16), T([128, 64, 4, 4], f16), None, [2, 2], [1, 1], [1, 1], True, [0, 0], 1), {})
+cnt: 1, ((T([16, 64, 128, 128], f16), T([3, 64, 7, 7], f16), None, [1, 1], [3, 3], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([16, 3, 128, 128], f16), T([16, 64, 128, 128], f16), T([3, 64, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 64, 128, 128], f16), T([16, 128, 64, 64], f16), T([128, 64, 4, 4], f16), [0], [2, 2], [1, 1], [1, 1], True, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 128, 64, 64], f16), T([16, 256, 32, 32], f16), T([256, 128, 4, 4], f16), [0], [2, 2], [1, 1], [1, 1], True, [0, 0], 1, [True, True, False]), {})
+cnt: 12, ((T([16, 256, 32, 32], f16), T([16, 256, 32, 32], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 256, 32, 32], f16), T([16, 128, 64, 64], f16), T([256, 128, 4, 4], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 128, 64, 64], f16), T([16, 64, 128, 128], f16), T([128, 64, 4, 4], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 64, 128, 128], f16), T([16, 8, 128, 128], f16), T([64, 8, 7, 7], f16), [0], [1, 1], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([16, 3, 128, 128], f16), T([16, 3, 128, 128], f16)), {})
+cnt: 1, ((T([16, 5], f16), T([16, 5], f16)), {})
+cnt: 4, ((T([64], f16), T([64], f16)), {})
+cnt: 4, ((T([128], f16), T([128], f16)), {})
+cnt: 26, ((T([256], f16), T([256], f16)), {})
+cnt: 4, ((T([16, 64, 128, 128], f16), T([16, 64, 128, 128], f16)), {})
+cnt: 2, ((T([1, 1024, 128, 128], f16), T([1, 1024, 128, 128], f16)), {})
+cnt: 4, ((T([16, 128, 64, 64], f16), T([16, 128, 64, 64], f16)), {})
+cnt: 2, ((T([1, 2048, 64, 64], f16), T([1, 2048, 64, 64], f16)), {})
+cnt: 14, ((T([16, 256, 32, 32], f16), T([16, 256, 32, 32], f16)), {})
+cnt: 7, ((T([1, 4096, 32, 32], f16), T([1, 4096, 32, 32], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 786432), {})
+Operator: aten.mean.dim
+cnt: 4, ((T([16, 64], f16), [0]), {})
+cnt: 4, ((T([16, 128], f16), [0]), {})
+cnt: 26, ((T([16, 256], f16), [0]), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([1, 1024, 128, 128], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([1, 2048, 64, 64], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), False, 0.1, 1e-05), {})
+cnt: 13, ((T([1, 4096, 32, 32], f16), T([4096], f16), T([4096], f16), T([4096], f16), T([4096], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 2, ((T([1, 1024, 128, 128], f16), T([1, 1024, 128, 128], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([1, 2048, 64, 64], f16), T([1, 2048, 64, 64], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), False, 1e-05, [True, True, True]), {})
+cnt: 13, ((T([1, 4096, 32, 32], f16), T([1, 4096, 32, 32], f16), T([4096], f16), T([4096], f16), T([4096], f16), T([4096], f32), T([4096], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 2, ((T([1, 1024, 128, 128], f16), [1, 1024, 128, 128], [16777216, 16384, 128, 1]), {})
+cnt: 2, ((T([1, 2048, 64, 64], f16), [1, 2048, 64, 64], [8388608, 4096, 64, 1]), {})
+cnt: 7, ((T([1, 4096, 32, 32], f16), [1, 4096, 32, 32], [4194304, 1024, 32, 1]), {})
+Operator: aten.new_zeros.default
+cnt: 2, ((T([16, 64, 128, 128], f16), [16777216]), {})
+cnt: 2, ((T([16, 128, 64, 64], f16), [8388608]), {})
+cnt: 7, ((T([16, 256, 32, 32], f16), [4194304]), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([16, 64, 128, 128], f16),), {})
+cnt: 2, ((T([16, 128, 64, 64], f16),), {})
+cnt: 7, ((T([16, 256, 32, 32], f16),), {})
+Operator: aten.repeat.default
+cnt: 1, ((T([16, 5, 1, 1], f16), [1, 1, 128, 128]), {})
+cnt: 8, ((T([64], f16), [16]), {})
+cnt: 8, ((T([128], f16), [16]), {})
+cnt: 52, ((T([256], f16), [16]), {})
+Operator: aten.sum.default
+cnt: 1, ((T([16, 3, 128, 128], f16),), {})
+Operator: aten.sum.dim_IntList
+cnt: 4, ((T([16, 64], f16), [0]), {})
+cnt: 4, ((T([16, 128], f16), [0]), {})
+cnt: 26, ((T([16, 256], f16), [0]), {})
+Operator: aten.tanh.default
+cnt: 1, ((T([16, 3, 128, 128], f16),), {})
+Operator: aten.tanh_backward.default
+cnt: 1, ((T([16, 3, 128, 128], f16, stride=(0, 0, 0, 0)), T([16, 3, 128, 128], f16)), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([16, 64, 128, 128], f16), T([16, 64, 128, 128], f16), 0), {})
+cnt: 2, ((T([16, 128, 64, 64], f16), T([16, 128, 64, 64], f16), 0), {})
+cnt: 7, ((T([16, 256, 32, 32], f16), T([16, 256, 32, 32], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_struct_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_struct_training.txt
new file mode 100644
index 000000000000..3512fcd8ff06
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_struct_training.txt
@@ -0,0 +1,63 @@
+Operator: aten._log_softmax.default
+cnt: 1, ((T([30, 4771], f16, stride=(1, 30)), -1, False), {})
+cnt: 1, ((T([30, 3600], f16), -1, False), {})
+cnt: 1, ((T([30], f16), -1, False), {})
+Operator: aten._log_softmax_backward_data.default
+cnt: 1, ((T([30], f16), T([30], f16), -1, f16), {})
+cnt: 1, ((T([30, 3600], f16), T([30, 3600], f16), -1, f16), {})
+cnt: 1, ((T([30, 4771], f16), T([30, 4771], f16), -1, f16), {})
+Operator: aten.add.Tensor
+cnt: 4, ((T([30, 256], f16), T([30, 256], f16)), {})
+cnt: 1, ((T([], f16), 0), {})
+cnt: 2, ((T([], f16), T([], f16)), {})
+cnt: 4, ((T([30, 256], f16, stride=(1, 30)), T([30, 256], f16)), {})
+Operator: aten.addmm.default
+cnt: 10, ((T([256], f16), T([30, 256], f16), T([256, 256], f16, stride=(1, 256))), {})
+Operator: aten.bmm.default
+cnt: 1, ((T([1, 4771, 256], f16), T([1, 256, 30], f16, stride=(256, 1, 256))), {})
+cnt: 1, ((T([1, 30, 256], f16), T([1, 256, 3600], f16, stride=(256, 1, 256))), {})
+cnt: 1, ((T([1, 1, 256], f16), T([1, 256, 30], f16, stride=(256, 1, 256))), {})
+cnt: 1, ((T([1, 256, 1], f16), T([1, 1, 30], f16)), {})
+cnt: 1, ((T([1, 1, 30], f16), T([1, 30, 256], f16)), {})
+cnt: 1, ((T([1, 256, 30], f16, stride=(7680, 1, 256)), T([1, 30, 3600], f16)), {})
+cnt: 1, ((T([1, 30, 3600], f16), T([1, 3600, 256], f16)), {})
+cnt: 1, ((T([1, 256, 4771], f16, stride=(1221376, 1, 256)), T([1, 4771, 30], f16, stride=(4771, 1, 4771))), {})
+cnt: 1, ((T([1, 4771, 30], f16, stride=(4771, 1, 4771)), T([1, 30, 256], f16)), {})
+Operator: aten.clone.default
+cnt: 1, ((T([40, 29], i64, stride=(1, 40)),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([40, 29], i64, stride=(1, 40)), T([40, 29], i64, stride=(1, 40))), {})
+cnt: 1, ((T([60, 60, 256], f16), T([60, 60, 256], f16, stride=(60, 1, 3600))), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 34800), {})
+cnt: 2, ((T([], f16), 4320000), {})
+cnt: 2, ((T([], f16), 1200), {})
+cnt: 2, ((T([], f16), 3), {})
+Operator: aten.gather.default
+cnt: 1, ((T([40, 29, 30, 4771], f16, stride=(0, 0, 4771, 1)), 3, T([40, 29, 30, 1], i64, stride=(1, 40, 0, 1))), {})
+Operator: aten.mm.default
+cnt: 8, ((T([30, 256], f16), T([256, 256], f16)), {})
+cnt: 8, ((T([256, 30], f16, stride=(1, 256)), T([30, 256], f16)), {})
+cnt: 2, ((T([30, 256], f16, stride=(1, 30)), T([256, 256], f16)), {})
+cnt: 2, ((T([256, 30], f16), T([30, 256], f16)), {})
+Operator: aten.new_empty_strided.default
+cnt: 1, ((T([60, 60, 256], f16, stride=(60, 1, 3600)), [60, 60, 256], [15360, 256, 1]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([40, 29, 30, 1], f16, stride=(0, 0, 0, 1)), [40, 29, 30, 4771]), {})
+Operator: aten.relu.default
+cnt: 8, ((T([30, 256], f16),), {})
+Operator: aten.scatter_add.default
+cnt: 1, ((T([40, 29, 30, 4771], f16), 3, T([40, 29, 30, 1], i64, stride=(1, 40, 0, 1)), T([40, 29, 30, 1], f16, stride=(0, 0, 0, 1))), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([40, 30], f16, stride=(0, 0)), [0], True), {})
+cnt: 8, ((T([30, 256], f16), [0], True), {})
+cnt: 2, ((T([30, 256], f16, stride=(1, 30)), [0], True), {})
+cnt: 1, ((T([40, 30, 60, 60], f16, stride=(0, 0, 0, 0)), [0], True), {})
+cnt: 1, ((T([40, 29, 30, 4771], f16), [0, 1], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([40, 29, 30], f16),), {})
+cnt: 1, ((T([40, 30, 60, 60], f16, stride=(0, 3600, 60, 1)),), {})
+cnt: 1, ((T([40, 30], f16, stride=(0, 1)),), {})
+Operator: aten.threshold_backward.default
+cnt: 4, ((T([30, 256], f16, stride=(1, 30)), T([30, 256], f16), 0), {})
+cnt: 4, ((T([30, 256], f16), T([30, 256], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_unet_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_unet_training.txt
new file mode 100644
index 000000000000..e2e12ab9be69
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/pytorch_unet_training.txt
@@ -0,0 +1,119 @@
+Operator: aten.add.Tensor
+cnt: 1, ((T([1, 512, 80, 119], f16), T([1, 512, 80, 119], f16)), {})
+cnt: 1, ((T([1, 256, 160, 239], f16), T([1, 256, 160, 239], f16)), {})
+cnt: 1, ((T([1, 128, 320, 479], f16), T([1, 128, 320, 479], f16)), {})
+cnt: 1, ((T([1, 64, 640, 959], f16), T([1, 64, 640, 959], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([1, 512, 80, 119], f16), T([1, 512, 80, 119], f16)], 1), {})
+cnt: 1, (([T([1, 256, 160, 239], f16), T([1, 256, 160, 239], f16)], 1), {})
+cnt: 1, (([T([1, 128, 320, 479], f16), T([1, 128, 320, 479], f16)], 1), {})
+cnt: 1, (([T([1, 64, 640, 959], f16), T([1, 64, 640, 959], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([1, 3, 640, 959], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([1, 512, 80, 118], f16), [0, 1, 0, 0], 0.0), {})
+cnt: 1, ((T([1, 256, 160, 238], f16), [0, 1, 0, 0], 0.0), {})
+cnt: 1, ((T([1, 128, 320, 478], f16), [0, 1, 0, 0], 0.0), {})
+cnt: 1, ((T([1, 64, 640, 958], f16), [0, 1, 0, 0], 0.0), {})
+cnt: 1, ((T([1, 64, 640, 959], f16), [0, -1, 0, 0]), {})
+cnt: 1, ((T([1, 128, 320, 479], f16), [0, -1, 0, 0]), {})
+cnt: 1, ((T([1, 256, 160, 239], f16), [0, -1, 0, 0]), {})
+cnt: 1, ((T([1, 512, 80, 119], f16), [0, -1, 0, 0]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([1, 3, 640, 959], f16), T([64, 3, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([1, 64, 640, 959], f16), T([64, 64, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 64, 320, 479], f16), T([128, 64, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 128, 320, 479], f16), T([128, 128, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 128, 160, 239], f16), T([256, 128, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 256, 160, 239], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 256, 80, 119], f16), T([512, 256, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 512, 80, 119], f16), T([512, 512, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([1, 512, 40, 59], f16), T([512, 512, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 1024, 80, 119], f16), T([512, 1024, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 512, 80, 119], f16), T([256, 512, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 512, 160, 239], f16), T([256, 512, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 256, 160, 239], f16), T([128, 256, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 256, 320, 479], f16), T([128, 256, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 128, 320, 479], f16), T([64, 128, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 128, 640, 959], f16), T([64, 128, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 64, 640, 959], f16), T([2, 64, 1, 1], f16), T([2], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([1, 2, 640, 959], f16, stride=(0, 0, 0, 0)), T([1, 64, 640, 959], f16), T([2, 64, 1, 1], f16), [2], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([1, 64, 640, 959], f16), T([1, 64, 640, 959], f16), T([64, 64, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 64, 640, 959], f16), T([1, 128, 640, 959], f16), T([64, 128, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 64, 320, 479], f16), T([1, 128, 320, 479], f16), T([64, 128, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 128, 320, 479], f16), T([1, 256, 320, 479], f16), T([128, 256, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 128, 160, 239], f16), T([1, 256, 160, 239], f16), T([128, 256, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 256, 160, 239], f16), T([1, 512, 160, 239], f16), T([256, 512, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 256, 80, 119], f16), T([1, 512, 80, 119], f16), T([256, 512, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 512, 80, 119], f16), T([1, 1024, 80, 119], f16), T([512, 1024, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([1, 512, 40, 59], f16), T([1, 512, 40, 59], f16), T([512, 512, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 512, 80, 119], f16), T([1, 512, 80, 119], f16), T([512, 512, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 512, 80, 119], f16), T([1, 256, 80, 119], f16), T([512, 256, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 256, 160, 239], f16), T([1, 256, 160, 239], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 256, 160, 239], f16), T([1, 128, 160, 239], f16), T([256, 128, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 128, 320, 479], f16), T([1, 128, 320, 479], f16), T([128, 128, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 128, 320, 479], f16), T([1, 64, 320, 479], f16), T([128, 64, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 64, 640, 959], f16), T([1, 3, 640, 959], f16), T([64, 3, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([1, 3, 640, 959], f16), T([1, 3, 640, 959], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 1227520), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([1, 64, 640, 959], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([1, 128, 320, 479], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([1, 256, 160, 239], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([1, 512, 80, 119], f16), [2, 2], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([1, 512, 40, 59], f16), T([1, 512, 80, 119], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([1, 512, 40, 59], i64)), {})
+cnt: 1, ((T([1, 256, 80, 119], f16), T([1, 256, 160, 239], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([1, 256, 80, 119], i64)), {})
+cnt: 1, ((T([1, 128, 160, 239], f16), T([1, 128, 320, 479], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([1, 128, 160, 239], i64)), {})
+cnt: 1, ((T([1, 64, 320, 479], f16), T([1, 64, 640, 959], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([1, 64, 320, 479], i64)), {})
+Operator: aten.native_batch_norm.default
+cnt: 4, ((T([1, 64, 640, 959], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([1, 128, 320, 479], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([1, 256, 160, 239], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([1, 512, 80, 119], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([1, 512, 40, 59], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([1, 256, 80, 119], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([1, 128, 160, 239], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([1, 64, 320, 479], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([1, 64, 640, 959], f16), T([1, 64, 640, 959], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([1, 64, 320, 479], f16), T([1, 64, 320, 479], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([1, 128, 320, 479], f16), T([1, 128, 320, 479], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([1, 128, 160, 239], f16), T([1, 128, 160, 239], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([1, 256, 160, 239], f16), T([1, 256, 160, 239], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([1, 256, 80, 119], f16), T([1, 256, 80, 119], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([1, 512, 80, 119], f16), T([1, 512, 80, 119], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([1, 512, 40, 59], f16), T([1, 512, 40, 59], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 4, ((T([1, 64, 640, 959], f16),), {})
+cnt: 3, ((T([1, 128, 320, 479], f16),), {})
+cnt: 3, ((T([1, 256, 160, 239], f16),), {})
+cnt: 3, ((T([1, 512, 80, 119], f16),), {})
+cnt: 2, ((T([1, 512, 40, 59], f16),), {})
+cnt: 1, ((T([1, 256, 80, 119], f16),), {})
+cnt: 1, ((T([1, 128, 160, 239], f16),), {})
+cnt: 1, ((T([1, 64, 320, 479], f16),), {})
+Operator: aten.sum.default
+cnt: 1, ((T([1, 2, 640, 959], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 4, ((T([1, 64, 640, 959], f16), T([1, 64, 640, 959], f16), 0), {})
+cnt: 1, ((T([1, 64, 320, 479], f16), T([1, 64, 320, 479], f16), 0), {})
+cnt: 3, ((T([1, 128, 320, 479], f16), T([1, 128, 320, 479], f16), 0), {})
+cnt: 1, ((T([1, 128, 160, 239], f16), T([1, 128, 160, 239], f16), 0), {})
+cnt: 3, ((T([1, 256, 160, 239], f16), T([1, 256, 160, 239], f16), 0), {})
+cnt: 1, ((T([1, 256, 80, 119], f16), T([1, 256, 80, 119], f16), 0), {})
+cnt: 3, ((T([1, 512, 80, 119], f16), T([1, 512, 80, 119], f16), 0), {})
+cnt: 2, ((T([1, 512, 40, 59], f16), T([1, 512, 40, 59], f16), 0), {})
+Operator: aten.upsample_bilinear2d.vec
+cnt: 1, ((T([1, 512, 40, 59], f16), None, True, [2.0, 2.0]), {})
+cnt: 1, ((T([1, 256, 80, 119], f16), None, True, [2.0, 2.0]), {})
+cnt: 1, ((T([1, 128, 160, 239], f16), None, True, [2.0, 2.0]), {})
+cnt: 1, ((T([1, 64, 320, 479], f16), None, True, [2.0, 2.0]), {})
+Operator: aten.upsample_bilinear2d_backward.vec
+cnt: 1, ((T([1, 64, 640, 958], f16), None, [1, 64, 320, 479], True, [2.0, 2.0]), {})
+cnt: 1, ((T([1, 128, 320, 478], f16), None, [1, 128, 160, 239], True, [2.0, 2.0]), {})
+cnt: 1, ((T([1, 256, 160, 238], f16), None, [1, 256, 80, 119], True, [2.0, 2.0]), {})
+cnt: 1, ((T([1, 512, 80, 118], f16), None, [1, 512, 40, 59], True, [2.0, 2.0]), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnet18_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnet18_training.txt
new file mode 100644
index 000000000000..f949353a358a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnet18_training.txt
@@ -0,0 +1,81 @@
+Operator: aten.add.Tensor
+cnt: 1, ((T([16, 512, 7, 7], f16), T([16, 512, 7, 7], f16)), {})
+cnt: 2, ((T([16, 256, 14, 14], f16), T([16, 256, 14, 14], f16)), {})
+cnt: 2, ((T([16, 128, 28, 28], f16), T([16, 128, 28, 28], f16)), {})
+cnt: 3, ((T([16, 64, 56, 56], f16), T([16, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 2, ((T([16, 64, 56, 56], f16), T([16, 64, 56, 56], f16)), {})
+cnt: 2, ((T([16, 128, 28, 28], f16), T([16, 128, 28, 28], f16)), {})
+cnt: 2, ((T([16, 256, 14, 14], f16), T([16, 256, 14, 14], f16)), {})
+cnt: 2, ((T([16, 512, 7, 7], f16), T([16, 512, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([16, 512], f16), T([512, 1000], f16, stride=(1, 512))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([16, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([16, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([16, 64, 56, 56], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 64, 56, 56], f16), T([128, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([16, 128, 28, 28], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 64, 56, 56], f16), T([128, 64, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 128, 28, 28], f16), T([256, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([16, 256, 14, 14], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 128, 28, 28], f16), T([256, 128, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 256, 14, 14], f16), T([512, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([16, 512, 7, 7], f16), T([512, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([16, 256, 14, 14], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([16, 512, 7, 7], f16), T([16, 512, 7, 7], f16), T([512, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 512, 7, 7], f16), T([16, 256, 14, 14], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 512, 7, 7], f16), T([16, 256, 14, 14], f16), T([512, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([16, 256, 14, 14], f16), T([16, 256, 14, 14], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 256, 14, 14], f16), T([16, 128, 28, 28], f16), T([256, 128, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 256, 14, 14], f16), T([16, 128, 28, 28], f16), T([256, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([16, 128, 28, 28], f16), T([16, 128, 28, 28], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 128, 28, 28], f16), T([16, 64, 56, 56], f16), T([128, 64, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 128, 28, 28], f16), T([16, 64, 56, 56], f16), T([128, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([16, 64, 56, 56], f16), T([16, 64, 56, 56], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([16, 64, 112, 112], f16), T([16, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([16, 3, 224, 224], f16), T([16, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([16, 512, 7, 7], f16, stride=(512, 1, 0, 0)), 49), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 16000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([16, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([16, 64, 56, 56], f16), T([16, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([16, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([16, 512, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([16, 1000], f16, stride=(0, 0)), T([1000, 512], f16)), {})
+cnt: 1, ((T([1000, 16], f16, stride=(0, 0)), T([16, 512], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([16, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 4, ((T([16, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([16, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([16, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([16, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 5, ((T([16, 512, 7, 7], f16), T([16, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([16, 256, 14, 14], f16), T([16, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([16, 128, 28, 28], f16), T([16, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([16, 64, 56, 56], f16), T([16, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([16, 64, 112, 112], f16), T([16, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([16, 64, 112, 112], f16),), {})
+cnt: 4, ((T([16, 64, 56, 56], f16),), {})
+cnt: 4, ((T([16, 128, 28, 28], f16),), {})
+cnt: 4, ((T([16, 256, 14, 14], f16),), {})
+cnt: 4, ((T([16, 512, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([16, 1000], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([16, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 4, ((T([16, 512, 7, 7], f16), T([16, 512, 7, 7], f16), 0), {})
+cnt: 4, ((T([16, 256, 14, 14], f16), T([16, 256, 14, 14], f16), 0), {})
+cnt: 4, ((T([16, 128, 28, 28], f16), T([16, 128, 28, 28], f16), 0), {})
+cnt: 4, ((T([16, 64, 56, 56], f16), T([16, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([16, 64, 112, 112], f16), T([16, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnet50_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnet50_training.txt
new file mode 100644
index 000000000000..517a1e3f175d
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnet50_training.txt
@@ -0,0 +1,134 @@
+Operator: aten.add.Tensor
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+cnt: 6, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 4, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 4, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 6, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 64, 56, 56], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 256, 56, 56], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([128, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 128, 28, 28], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 512, 28, 28], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 128, 28, 28], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([256, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([32, 256, 14, 14], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([32, 1024, 14, 14], f16), T([256, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([32, 256, 14, 14], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([512, 512, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 512, 7, 7], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 512, 7, 7], f16), T([512, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 512, 7, 7], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 512, 7, 7], f16), T([32, 512, 7, 7], f16), T([512, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 512, 7, 7], f16), T([32, 2048, 7, 7], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 7, 7], f16), T([32, 512, 14, 14], f16), T([512, 512, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([32, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([32, 1024, 14, 14], f16), T([32, 256, 14, 14], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([32, 256, 14, 14], f16), T([32, 256, 14, 14], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([32, 256, 14, 14], f16), T([32, 1024, 14, 14], f16), T([256, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 14, 14], f16), T([32, 256, 28, 28], f16), T([256, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([32, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 512, 28, 28], f16), T([32, 128, 28, 28], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 128, 28, 28], f16), T([32, 128, 28, 28], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 128, 28, 28], f16), T([32, 512, 28, 28], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 256, 56, 56], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), T([32, 128, 56, 56], f16), T([128, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([32, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 64, 56, 56], f16), T([32, 256, 56, 56], f16), T([64, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([32, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(0, 0)), T([32, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 6, ((T([32, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 7, ((T([32, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 11, ((T([32, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 7, ((T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([32, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([32, 512, 7, 7], f16), T([32, 512, 7, 7], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([32, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 11, ((T([32, 256, 14, 14], f16), T([32, 256, 14, 14], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([32, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([32, 128, 28, 28], f16), T([32, 128, 28, 28], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([32, 64, 112, 112], f16),), {})
+cnt: 6, ((T([32, 64, 56, 56], f16),), {})
+cnt: 3, ((T([32, 256, 56, 56], f16),), {})
+cnt: 1, ((T([32, 128, 56, 56], f16),), {})
+cnt: 7, ((T([32, 128, 28, 28], f16),), {})
+cnt: 4, ((T([32, 512, 28, 28], f16),), {})
+cnt: 1, ((T([32, 256, 28, 28], f16),), {})
+cnt: 11, ((T([32, 256, 14, 14], f16),), {})
+cnt: 6, ((T([32, 1024, 14, 14], f16),), {})
+cnt: 1, ((T([32, 512, 14, 14], f16),), {})
+cnt: 5, ((T([32, 512, 7, 7], f16),), {})
+cnt: 3, ((T([32, 2048, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), 0), {})
+cnt: 5, ((T([32, 512, 7, 7], f16), T([32, 512, 7, 7], f16), 0), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([32, 512, 14, 14], f16), 0), {})
+cnt: 6, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), 0), {})
+cnt: 11, ((T([32, 256, 14, 14], f16), T([32, 256, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([32, 256, 28, 28], f16), 0), {})
+cnt: 4, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), 0), {})
+cnt: 7, ((T([32, 128, 28, 28], f16), T([32, 128, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), 0), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), 0), {})
+cnt: 6, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnext50_32x4d_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnext50_32x4d_training.txt
new file mode 100644
index 000000000000..256d8ac3242c
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/resnext50_32x4d_training.txt
@@ -0,0 +1,124 @@
+Operator: aten.add.Tensor
+cnt: 2, ((T([8, 2048, 7, 7], f16), T([8, 2048, 7, 7], f16)), {})
+cnt: 6, ((T([8, 1024, 14, 14], f16), T([8, 1024, 14, 14], f16)), {})
+cnt: 4, ((T([8, 512, 28, 28], f16), T([8, 512, 28, 28], f16)), {})
+cnt: 3, ((T([8, 256, 56, 56], f16), T([8, 256, 56, 56], f16)), {})
+cnt: 1, ((T([8, 64, 56, 56], f16), T([8, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 3, ((T([8, 256, 56, 56], f16), T([8, 256, 56, 56], f16)), {})
+cnt: 4, ((T([8, 512, 28, 28], f16), T([8, 512, 28, 28], f16)), {})
+cnt: 6, ((T([8, 1024, 14, 14], f16), T([8, 1024, 14, 14], f16)), {})
+cnt: 3, ((T([8, 2048, 7, 7], f16), T([8, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([8, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([8, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([8, 3, 224, 224], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 64, 56, 56], f16), T([128, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([8, 128, 56, 56], f16), T([128, 4, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 3, ((T([8, 128, 56, 56], f16), T([256, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([8, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 256, 56, 56], f16), T([256, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 256, 56, 56], f16), T([256, 8, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 4, ((T([8, 256, 28, 28], f16), T([512, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 256, 56, 56], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([8, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([8, 256, 28, 28], f16), T([256, 8, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([8, 512, 28, 28], f16), T([512, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 512, 28, 28], f16), T([512, 16, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 6, ((T([8, 512, 14, 14], f16), T([1024, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([8, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([8, 512, 14, 14], f16), T([512, 16, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([8, 1024, 14, 14], f16), T([1024, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 1024, 14, 14], f16), T([1024, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 3, ((T([8, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([8, 2048, 7, 7], f16), T([1024, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([8, 1024, 7, 7], f16), T([1024, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([8, 2048, 7, 7], f16), T([8, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([8, 1024, 7, 7], f16), T([8, 1024, 7, 7], f16), T([1024, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 2, ((T([8, 1024, 7, 7], f16), T([8, 2048, 7, 7], f16), T([1024, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 2048, 7, 7], f16), T([8, 1024, 14, 14], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 1024, 7, 7], f16), T([8, 1024, 14, 14], f16), T([1024, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([8, 1024, 14, 14], f16), T([8, 1024, 14, 14], f16), T([1024, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([8, 1024, 14, 14], f16), T([8, 512, 14, 14], f16), T([1024, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([8, 512, 14, 14], f16), T([8, 512, 14, 14], f16), T([512, 16, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 5, ((T([8, 512, 14, 14], f16), T([8, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 1024, 14, 14], f16), T([8, 512, 28, 28], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 512, 14, 14], f16), T([8, 512, 28, 28], f16), T([512, 16, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([8, 512, 28, 28], f16), T([8, 512, 28, 28], f16), T([512, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([8, 512, 28, 28], f16), T([8, 256, 28, 28], f16), T([512, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([8, 256, 28, 28], f16), T([8, 256, 28, 28], f16), T([256, 8, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 3, ((T([8, 256, 28, 28], f16), T([8, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 512, 28, 28], f16), T([8, 256, 56, 56], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 256, 28, 28], f16), T([8, 256, 56, 56], f16), T([256, 8, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([8, 256, 56, 56], f16), T([8, 256, 56, 56], f16), T([256, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([8, 256, 56, 56], f16), T([8, 128, 56, 56], f16), T([256, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([8, 128, 56, 56], f16), T([8, 128, 56, 56], f16), T([128, 4, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 2, ((T([8, 128, 56, 56], f16), T([8, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 256, 56, 56], f16), T([8, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 128, 56, 56], f16), T([8, 64, 56, 56], f16), T([128, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 64, 112, 112], f16), T([8, 3, 224, 224], f16), T([64, 3, 7, 7], f16), [0], [2, 2], [3, 3], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([8, 3, 224, 224], f16), T([8, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([8, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 8000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([8, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([8, 64, 56, 56], f16), T([8, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([8, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([8, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8, 1000], f16, stride=(0, 0)), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 8], f16, stride=(0, 0)), T([8, 2048], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([8, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 6, ((T([8, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([8, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 7, ((T([8, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 6, ((T([8, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 11, ((T([8, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 8, ((T([8, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([8, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+cnt: 4, ((T([8, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 4, ((T([8, 2048, 7, 7], f16), T([8, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([8, 1024, 7, 7], f16), T([8, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([8, 1024, 14, 14], f16), T([8, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 11, ((T([8, 512, 14, 14], f16), T([8, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([8, 512, 28, 28], f16), T([8, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 7, ((T([8, 256, 28, 28], f16), T([8, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([8, 256, 56, 56], f16), T([8, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([8, 128, 56, 56], f16), T([8, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([8, 64, 112, 112], f16), T([8, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([8, 64, 112, 112], f16),), {})
+cnt: 6, ((T([8, 128, 56, 56], f16),), {})
+cnt: 4, ((T([8, 256, 56, 56], f16),), {})
+cnt: 7, ((T([8, 256, 28, 28], f16),), {})
+cnt: 5, ((T([8, 512, 28, 28], f16),), {})
+cnt: 11, ((T([8, 512, 14, 14], f16),), {})
+cnt: 7, ((T([8, 1024, 14, 14], f16),), {})
+cnt: 5, ((T([8, 1024, 7, 7], f16),), {})
+cnt: 3, ((T([8, 2048, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([8, 1000], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([8, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 3, ((T([8, 2048, 7, 7], f16), T([8, 2048, 7, 7], f16), 0), {})
+cnt: 5, ((T([8, 1024, 7, 7], f16), T([8, 1024, 7, 7], f16), 0), {})
+cnt: 7, ((T([8, 1024, 14, 14], f16), T([8, 1024, 14, 14], f16), 0), {})
+cnt: 11, ((T([8, 512, 14, 14], f16), T([8, 512, 14, 14], f16), 0), {})
+cnt: 5, ((T([8, 512, 28, 28], f16), T([8, 512, 28, 28], f16), 0), {})
+cnt: 7, ((T([8, 256, 28, 28], f16), T([8, 256, 28, 28], f16), 0), {})
+cnt: 4, ((T([8, 256, 56, 56], f16), T([8, 256, 56, 56], f16), 0), {})
+cnt: 6, ((T([8, 128, 56, 56], f16), T([8, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([8, 64, 112, 112], f16), T([8, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/shufflenet_v2_x1_0_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/shufflenet_v2_x1_0_training.txt
new file mode 100644
index 000000000000..9b26d6a7b7c1
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/shufflenet_v2_x1_0_training.txt
@@ -0,0 +1,123 @@
+Operator: aten._unsafe_view.default
+cnt: 4, ((T([128, 2, 232, 7, 7], f16), [128, 464, 7, 7]), {})
+cnt: 8, ((T([128, 2, 116, 14, 14], f16), [128, 232, 14, 14]), {})
+cnt: 4, ((T([128, 2, 58, 28, 28], f16), [128, 116, 28, 28]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([128, 232, 14, 14], f16), T([128, 232, 14, 14], f16)), {})
+cnt: 1, ((T([128, 116, 28, 28], f16), T([128, 116, 28, 28], f16)), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 24, 56, 56], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([128, 58, 28, 28], f16), T([128, 58, 28, 28], f16)], 1), {})
+cnt: 6, (([T([128, 58, 28, 28], f16, stride=(90944, 784, 28, 1)), T([128, 58, 28, 28], f16)], 1), {})
+cnt: 1, (([T([128, 116, 14, 14], f16), T([128, 116, 14, 14], f16)], 1), {})
+cnt: 14, (([T([128, 116, 14, 14], f16, stride=(45472, 196, 14, 1)), T([128, 116, 14, 14], f16)], 1), {})
+cnt: 1, (([T([128, 232, 7, 7], f16), T([128, 232, 7, 7], f16)], 1), {})
+cnt: 6, (([T([128, 232, 7, 7], f16, stride=(22736, 49, 7, 1)), T([128, 232, 7, 7], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([24, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([24, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 24), {})
+cnt: 1, ((T([128, 24, 28, 28], f16), T([58, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 24, 56, 56], f16), T([58, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 58, 56, 56], f16), T([58, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 58), {})
+cnt: 4, ((T([128, 58, 28, 28], f16), T([58, 58, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 58, 28, 28], f16, stride=(90944, 784, 28, 1)), T([58, 58, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 58, 28, 28], f16), T([58, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 58), {})
+cnt: 2, ((T([128, 116, 28, 28], f16), T([116, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 116), {})
+cnt: 9, ((T([128, 116, 14, 14], f16), T([116, 116, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 116, 28, 28], f16), T([116, 116, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 116, 14, 14], f16, stride=(45472, 196, 14, 1)), T([116, 116, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([128, 116, 14, 14], f16), T([116, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 116), {})
+cnt: 2, ((T([128, 232, 14, 14], f16), T([232, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 232), {})
+cnt: 5, ((T([128, 232, 7, 7], f16), T([232, 232, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 232, 14, 14], f16), T([232, 232, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 232, 7, 7], f16, stride=(22736, 49, 7, 1)), T([232, 232, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 232, 7, 7], f16), T([232, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 232), {})
+cnt: 1, ((T([128, 464, 7, 7], f16), T([1024, 464, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 464, 7, 7], f16), T([1024, 464, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([128, 232, 7, 7], f16), T([128, 232, 7, 7], f16), T([232, 232, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 232, 7, 7], f16), T([128, 232, 7, 7], f16), T([232, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 232, [True, True, False]), {})
+cnt: 3, ((T([128, 232, 7, 7], f16), T([128, 232, 7, 7], f16, stride=(22736, 49, 7, 1)), T([232, 232, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 232, 7, 7], f16), T([128, 232, 14, 14], f16), T([232, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 232, [True, True, False]), {})
+cnt: 1, ((T([128, 232, 14, 14], f16), T([128, 232, 14, 14], f16), T([232, 232, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 9, ((T([128, 116, 14, 14], f16), T([128, 116, 14, 14], f16), T([116, 116, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([128, 116, 14, 14], f16), T([128, 116, 14, 14], f16), T([116, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 116, [True, True, False]), {})
+cnt: 7, ((T([128, 116, 14, 14], f16), T([128, 116, 14, 14], f16, stride=(45472, 196, 14, 1)), T([116, 116, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([128, 116, 14, 14], f16), T([128, 116, 28, 28], f16), T([116, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 116, [True, True, False]), {})
+cnt: 1, ((T([128, 116, 28, 28], f16), T([128, 116, 28, 28], f16), T([116, 116, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([128, 58, 28, 28], f16), T([128, 58, 28, 28], f16), T([58, 58, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([128, 58, 28, 28], f16), T([128, 58, 28, 28], f16), T([58, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 58, [True, True, False]), {})
+cnt: 3, ((T([128, 58, 28, 28], f16), T([128, 58, 28, 28], f16, stride=(90944, 784, 28, 1)), T([58, 58, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 58, 28, 28], f16), T([128, 58, 56, 56], f16), T([58, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 58, [True, True, False]), {})
+cnt: 1, ((T([128, 58, 56, 56], f16), T([128, 24, 56, 56], f16), T([58, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 58, 28, 28], f16), T([128, 24, 28, 28], f16), T([58, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 28, 28], f16), T([128, 24, 56, 56], f16), T([24, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 24, [True, True, False]), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 3, 224, 224], f16), T([24, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 224, 224], f16), T([128, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 1024, 7, 7], f16, stride=(1024, 1, 0, 0)), 49), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 128000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([128, 24, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([128, 24, 56, 56], f16), T([128, 24, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([128, 24, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 1024, 7, 7], f16), [2, 3]), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16, stride=(0, 0)), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(0, 0)), T([128, 1024], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([128, 24, 112, 112], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 24, 28, 28], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), False, 0.1, 1e-05), {})
+cnt: 12, ((T([128, 58, 28, 28], f16), T([58], f16), T([58], f16), T([58], f16), T([58], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 58, 56, 56], f16), T([58], f16), T([58], f16), T([58], f16), T([58], f16), False, 0.1, 1e-05), {})
+cnt: 25, ((T([128, 116, 14, 14], f16), T([116], f16), T([116], f16), T([116], f16), T([116], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 116, 28, 28], f16), T([116], f16), T([116], f16), T([116], f16), T([116], f16), False, 0.1, 1e-05), {})
+cnt: 13, ((T([128, 232, 7, 7], f16), T([232], f16), T([232], f16), T([232], f16), T([232], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 232, 14, 14], f16), T([232], f16), T([232], f16), T([232], f16), T([232], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 13, ((T([128, 232, 7, 7], f16), T([128, 232, 7, 7], f16), T([232], f16), T([232], f16), T([232], f16), T([232], f32), T([232], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 232, 14, 14], f16), T([128, 232, 14, 14], f16), T([232], f16), T([232], f16), T([232], f16), T([232], f32), T([232], f32), False, 1e-05, [True, True, True]), {})
+cnt: 25, ((T([128, 116, 14, 14], f16), T([128, 116, 14, 14], f16), T([116], f16), T([116], f16), T([116], f16), T([116], f32), T([116], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 116, 28, 28], f16), T([128, 116, 28, 28], f16), T([116], f16), T([116], f16), T([116], f16), T([116], f32), T([116], f32), False, 1e-05, [True, True, True]), {})
+cnt: 12, ((T([128, 58, 28, 28], f16), T([128, 58, 28, 28], f16), T([58], f16), T([58], f16), T([58], f16), T([58], f32), T([58], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 58, 56, 56], f16), T([128, 58, 56, 56], f16), T([58], f16), T([58], f16), T([58], f16), T([58], f32), T([58], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 28, 28], f16), T([128, 24, 28, 28], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 24, 112, 112], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 24, 112, 112], f16),), {})
+cnt: 8, ((T([128, 58, 28, 28], f16),), {})
+cnt: 1, ((T([128, 58, 56, 56], f16),), {})
+cnt: 16, ((T([128, 116, 14, 14], f16),), {})
+cnt: 1, ((T([128, 116, 28, 28], f16),), {})
+cnt: 8, ((T([128, 232, 7, 7], f16),), {})
+cnt: 1, ((T([128, 232, 14, 14], f16),), {})
+cnt: 1, ((T([128, 1024, 7, 7], f16),), {})
+Operator: aten.split.Tensor
+cnt: 3, ((T([128, 116, 28, 28], f16), 58, 1), {})
+cnt: 7, ((T([128, 232, 14, 14], f16), 116, 1), {})
+cnt: 3, ((T([128, 464, 7, 7], f16), 232, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([128, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([128, 1024, 7, 7], f16), T([128, 1024, 7, 7], f16), 0), {})
+cnt: 5, ((T([128, 232, 7, 7], f16, stride=(22736, 49, 7, 1)), T([128, 232, 7, 7], f16), 0), {})
+cnt: 3, ((T([128, 232, 7, 7], f16), T([128, 232, 7, 7], f16), 0), {})
+cnt: 1, ((T([128, 232, 14, 14], f16), T([128, 232, 14, 14], f16), 0), {})
+cnt: 9, ((T([128, 116, 14, 14], f16, stride=(45472, 196, 14, 1)), T([128, 116, 14, 14], f16), 0), {})
+cnt: 7, ((T([128, 116, 14, 14], f16), T([128, 116, 14, 14], f16), 0), {})
+cnt: 1, ((T([128, 116, 28, 28], f16), T([128, 116, 28, 28], f16), 0), {})
+cnt: 5, ((T([128, 58, 28, 28], f16, stride=(90944, 784, 28, 1)), T([128, 58, 28, 28], f16), 0), {})
+cnt: 3, ((T([128, 58, 28, 28], f16), T([128, 58, 28, 28], f16), 0), {})
+cnt: 1, ((T([128, 58, 56, 56], f16), T([128, 58, 56, 56], f16), 0), {})
+cnt: 1, ((T([128, 24, 112, 112], f16), T([128, 24, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/speech_transformer_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/speech_transformer_training.txt
new file mode 100644
index 000000000000..8431f307e34d
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/speech_transformer_training.txt
@@ -0,0 +1,178 @@
+Operator: aten._softmax.default
+cnt: 6, ((T([80, 204, 204], f16), 2, False), {})
+cnt: 6, ((T([80, 22, 22], f16), 2, False), {})
+cnt: 6, ((T([80, 22, 204], f16), 2, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 6, ((T([80, 22, 204], f16), T([80, 22, 204], f16), 2, f16), {})
+cnt: 6, ((T([80, 22, 22], f16), T([80, 22, 22], f16), 2, f16), {})
+cnt: 6, ((T([80, 204, 204], f16), T([80, 204, 204], f16), 2, f16), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([10, 22], b8),), {'dtype': f32})
+cnt: 1, ((T([], f32),), {'dtype': f16})
+cnt: 18, ((T([10, 22, 512], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 1, ((T([220, 1014], f16), [10, 22, 1014]), {})
+cnt: 12, ((T([8, 10, 22, 64], f16), [80, 22, 64]), {})
+cnt: 30, ((T([10, 204, 8, 64], f16), [10, 204, 512]), {})
+cnt: 24, ((T([10, 22, 8, 64], f16), [10, 22, 512]), {})
+cnt: 6, ((T([8, 10, 204, 64], f16), [80, 204, 64]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([10, 204, 512], f16), T([1, 204, 512], f16)), {})
+cnt: 47, ((T([10, 204, 512], f16), T([10, 204, 512], f16)), {})
+cnt: 1, ((T([10, 22, 22], b8, stride=(22, 0, 1)), T([10, 22, 22], u8, stride=(0, 22, 1))), {})
+cnt: 1, ((T([10, 22, 512], f16), T([1, 22, 512], f16)), {})
+cnt: 48, ((T([10, 22, 512], f16), T([10, 22, 512], f16)), {})
+cnt: 1, ((T([], f16), 0), {})
+cnt: 1, ((T([], f16), T([], f32)), {})
+cnt: 1, ((T([1014, 512], f16), T([1014, 512], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([512], f16), T([2040, 320], f16), T([320, 512], f16, stride=(1, 320))), {})
+cnt: 36, ((T([512], f16), T([2040, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 6, ((T([2048], f16), T([2040, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 6, ((T([512], f16), T([2040, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+cnt: 36, ((T([512], f16), T([220, 512], f16), T([512, 512], f16, stride=(1, 512))), {})
+cnt: 6, ((T([2048], f16), T([220, 512], f16), T([512, 2048], f16, stride=(1, 512))), {})
+cnt: 6, ((T([512], f16), T([220, 2048], f16), T([2048, 512], f16, stride=(1, 2048))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([80, 204, 64], f16), T([80, 64, 204], f16, stride=(13056, 1, 64))), {})
+cnt: 12, ((T([80, 204, 204], f16), T([80, 204, 64], f16)), {})
+cnt: 12, ((T([80, 22, 64], f16), T([80, 64, 22], f16, stride=(1408, 1, 64))), {})
+cnt: 12, ((T([80, 22, 22], f16), T([80, 22, 64], f16)), {})
+cnt: 12, ((T([80, 22, 64], f16), T([80, 64, 204], f16, stride=(13056, 1, 64))), {})
+cnt: 12, ((T([80, 22, 204], f16), T([80, 204, 64], f16)), {})
+cnt: 6, ((T([80, 204, 22], f16, stride=(4488, 1, 204)), T([80, 22, 64], f16)), {})
+cnt: 6, ((T([80, 64, 22], f16, stride=(1408, 1, 64)), T([80, 22, 204], f16)), {})
+cnt: 6, ((T([80, 22, 22], f16, stride=(484, 1, 22)), T([80, 22, 64], f16)), {})
+cnt: 6, ((T([80, 64, 22], f16, stride=(1408, 1, 64)), T([80, 22, 22], f16)), {})
+cnt: 6, ((T([80, 204, 204], f16, stride=(41616, 1, 204)), T([80, 204, 64], f16)), {})
+cnt: 6, ((T([80, 64, 204], f16, stride=(13056, 1, 64)), T([80, 204, 204], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([1], i64), T([17], i64)],), {})
+cnt: 1, (([T([1], i64), T([15], i64)],), {})
+cnt: 1, (([T([1], i64), T([21], i64)],), {})
+cnt: 1, (([T([1], i64), T([18], i64)],), {})
+cnt: 3, (([T([1], i64), T([9], i64)],), {})
+cnt: 1, (([T([1], i64), T([12], i64)],), {})
+cnt: 1, (([T([1], i64), T([11], i64)],), {})
+cnt: 1, (([T([1], i64), T([10], i64)],), {})
+cnt: 1, (([T([17], i64), T([1], i64)],), {})
+cnt: 1, (([T([15], i64), T([1], i64)],), {})
+cnt: 1, (([T([21], i64), T([1], i64)],), {})
+cnt: 1, (([T([18], i64), T([1], i64)],), {})
+cnt: 3, (([T([9], i64), T([1], i64)],), {})
+cnt: 1, (([T([12], i64), T([1], i64)],), {})
+cnt: 1, (([T([11], i64), T([1], i64)],), {})
+cnt: 1, (([T([10], i64), T([1], i64)],), {})
+Operator: aten.clone.default
+cnt: 1, ((T([10, 204, 320], f16),), {})
+cnt: 1, ((T([10], i64),), {})
+cnt: 1, ((T([10, 21], i64),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([10, 204, 320], f16), T([10, 204, 320], f16)), {})
+cnt: 7, ((T([10], i64), T([10], i64)), {})
+cnt: 1, ((T([10, 21], i64), T([10, 21], i64)), {})
+cnt: 2, ((T([18], i64), T([18], i64)), {})
+cnt: 2, ((T([16], i64), T([16], i64)), {})
+cnt: 2, ((T([22], i64), T([22], i64)), {})
+cnt: 2, ((T([19], i64), T([19], i64)), {})
+cnt: 2, ((T([13], i64), T([13], i64)), {})
+cnt: 2, ((T([12], i64), T([12], i64)), {})
+cnt: 2, ((T([11], i64), T([11], i64)), {})
+Operator: aten.div.Tensor
+cnt: 12, ((T([80, 204, 204], f16), 8.0), {})
+cnt: 12, ((T([80, 22, 22], f16), 8.0), {})
+cnt: 12, ((T([80, 22, 204], f16), 8.0), {})
+cnt: 2, ((T([], f16), 223080), {})
+cnt: 1, ((T([], i64), 220), {})
+cnt: 2, ((T([], f32), 2), {})
+Operator: aten.embedding.default
+cnt: 1, ((T([1014, 512], f16), T([10, 22], i64)), {})
+Operator: aten.embedding_dense_backward.default
+cnt: 1, ((T([10, 22, 512], f16), T([10, 22], i64), 1014, -1, False), {})
+Operator: aten.eq.Scalar
+cnt: 1, ((T([10, 22], i64), 2), {})
+Operator: aten.fill_.Scalar
+cnt: 1, ((T([10, 22], i64), 2), {})
+cnt: 1, ((T([10, 22], i64), -1), {})
+Operator: aten.fill_.Tensor
+cnt: 3, ((T([0], f16), T([], f16)), {})
+cnt: 3, ((T([4], f16), T([], f16)), {})
+cnt: 3, ((T([8], f16), T([], f16)), {})
+cnt: 3, ((T([24], f16), T([], f16)), {})
+cnt: 3, ((T([57], f16), T([], f16)), {})
+cnt: 3, ((T([67], f16), T([], f16)), {})
+cnt: 3, ((T([75], f16), T([], f16)), {})
+cnt: 3, ((T([91], f16), T([], f16)), {})
+cnt: 3, ((T([99], f16), T([], f16)), {})
+cnt: 3, ((T([118], f16), T([], f16)), {})
+Operator: aten.gt.Scalar
+cnt: 1, ((T([10, 22, 22], u8), 0), {})
+Operator: aten.index.Tensor
+cnt: 10, ((T([21], i64), [T([21], b8)]), {})
+Operator: aten.lt.Scalar
+cnt: 2, ((T([10, 204], f16), 1), {})
+Operator: aten.masked_fill.Scalar
+cnt: 6, ((T([80, 204, 204], f16), T([80, 204, 204], b8), -inf), {})
+cnt: 6, ((T([80, 22, 22], f16), T([80, 22, 22], b8), -inf), {})
+cnt: 6, ((T([80, 22, 204], f16), T([80, 22, 204], b8), -inf), {})
+cnt: 6, ((T([80, 22, 204], f16), T([80, 22, 204], b8), 0), {})
+cnt: 6, ((T([80, 22, 22], f16), T([80, 22, 22], b8), 0), {})
+cnt: 6, ((T([80, 204, 204], f16), T([80, 204, 204], b8), 0), {})
+Operator: aten.mm.default
+cnt: 1, ((T([220, 512], f16), T([512, 1014], f16, stride=(1, 512))), {})
+cnt: 1, ((T([1014, 220], f16, stride=(0, 0)), T([220, 512], f16)), {})
+cnt: 1, ((T([220, 1014], f16, stride=(0, 0)), T([1014, 512], f16)), {})
+cnt: 6, ((T([220, 512], f16), T([512, 2048], f16)), {})
+cnt: 6, ((T([512, 220], f16, stride=(1, 512)), T([220, 2048], f16)), {})
+cnt: 6, ((T([220, 2048], f16), T([2048, 512], f16)), {})
+cnt: 6, ((T([2048, 220], f16, stride=(1, 2048)), T([220, 512], f16)), {})
+cnt: 36, ((T([220, 512], f16), T([512, 512], f16)), {})
+cnt: 36, ((T([512, 220], f16, stride=(1, 512)), T([220, 512], f16)), {})
+cnt: 36, ((T([2040, 512], f16), T([512, 512], f16)), {})
+cnt: 36, ((T([512, 2040], f16, stride=(1, 512)), T([2040, 512], f16)), {})
+cnt: 6, ((T([2040, 512], f16), T([512, 2048], f16)), {})
+cnt: 6, ((T([512, 2040], f16, stride=(1, 512)), T([2040, 2048], f16)), {})
+cnt: 6, ((T([2040, 2048], f16), T([2048, 512], f16)), {})
+cnt: 6, ((T([2048, 2040], f16, stride=(1, 2048)), T([2040, 512], f16)), {})
+cnt: 1, ((T([512, 2040], f16, stride=(1, 512)), T([2040, 320], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([10, 22, 512], f16), 22.627416997969522), {})
+cnt: 18, ((T([10, 22, 512], f16), T([10, 22, 1], f32)), {})
+cnt: 12, ((T([10, 204, 512], f16), T([10, 204, 1], f16)), {})
+Operator: aten.mul_.Tensor
+cnt: 12, ((T([10, 204, 512], f16), T([10, 204, 1], f16)), {})
+cnt: 18, ((T([10, 22, 512], f16), T([10, 22, 1], f32)), {})
+Operator: aten.native_layer_norm.default
+cnt: 13, ((T([10, 204, 512], f16), [512], T([512], f16), T([512], f16), 1e-05), {})
+cnt: 18, ((T([10, 22, 512], f16), [512], T([512], f16), T([512], f16), 1e-05), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 18, ((T([10, 22, 512], f16), T([10, 22, 512], f16), [512], T([10, 22, 1], f32), T([10, 22, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+cnt: 13, ((T([10, 204, 512], f16), T([10, 204, 512], f16), [512], T([10, 204, 1], f32), T([10, 204, 1], f32), T([512], f16), T([512], f16), [True, True, True]), {})
+Operator: aten.ne.Scalar
+cnt: 10, ((T([21], i64), -1), {})
+cnt: 1, ((T([10, 22], i64), 2), {})
+Operator: aten.new_ones.default
+cnt: 2, ((T([10, 204, 320], f16), [10, 204]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 1, ((T([10, 204, 512], f16), [10, 204]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.relu.default
+cnt: 6, ((T([10, 204, 2048], f16),), {})
+cnt: 6, ((T([10, 22, 2048], f16),), {})
+Operator: aten.repeat.default
+cnt: 6, ((T([10, 204, 204], b8, stride=(204, 0, 1)), [8, 1, 1]), {})
+cnt: 6, ((T([10, 22, 22], b8), [8, 1, 1]), {})
+cnt: 6, ((T([10, 22, 204], b8, stride=(204, 0, 1)), [8, 1, 1]), {})
+Operator: aten.sum.SymInt
+cnt: 42, ((T([220, 512], f16), [0], True), {})
+cnt: 6, ((T([220, 2048], f16), [0], True), {})
+cnt: 43, ((T([2040, 512], f16), [0], True), {})
+cnt: 6, ((T([2040, 2048], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([10, 22, 1014], f16),), {})
+cnt: 1, ((T([10, 22], i64),), {})
+Operator: aten.threshold_backward.default
+cnt: 6, ((T([10, 22, 2048], f16), T([10, 22, 2048], f16), 0), {})
+cnt: 6, ((T([10, 204, 2048], f16), T([10, 204, 2048], f16), 0), {})
+Operator: aten.triu.default
+cnt: 1, ((T([22, 22], u8), 1), {})
+Operator: aten.unbind.int
+cnt: 1, ((T([10, 21], i64),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/squeezenet1_1_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/squeezenet1_1_training.txt
new file mode 100644
index 000000000000..4e4da308b341
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/squeezenet1_1_training.txt
@@ -0,0 +1,90 @@
+Operator: aten.add.Tensor
+cnt: 2, ((T([32, 64, 13, 13], f16), T([32, 64, 13, 13], f16)), {})
+cnt: 2, ((T([32, 48, 13, 13], f16), T([32, 48, 13, 13], f16)), {})
+cnt: 2, ((T([32, 32, 27, 27], f16), T([32, 32, 27, 27], f16)), {})
+cnt: 2, ((T([32, 16, 55, 55], f16), T([32, 16, 55, 55], f16)), {})
+Operator: aten.cat.default
+cnt: 2, (([T([32, 64, 55, 55], f16), T([32, 64, 55, 55], f16)], 1), {})
+cnt: 2, (([T([32, 128, 27, 27], f16), T([32, 128, 27, 27], f16)], 1), {})
+cnt: 2, (([T([32, 192, 13, 13], f16), T([32, 192, 13, 13], f16)], 1), {})
+cnt: 2, (([T([32, 256, 13, 13], f16), T([32, 256, 13, 13], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([64, 3, 3, 3], f16), T([64], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 55, 55], f16), T([16, 64, 1, 1], f16), T([16], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 16, 55, 55], f16), T([64, 16, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 16, 55, 55], f16), T([64, 16, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 55, 55], f16), T([16, 128, 1, 1], f16), T([16], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 27, 27], f16), T([32, 128, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 32, 27, 27], f16), T([128, 32, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 32, 27, 27], f16), T([128, 32, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 27, 27], f16), T([32, 256, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 13, 13], f16), T([48, 256, 1, 1], f16), T([48], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 48, 13, 13], f16), T([192, 48, 1, 1], f16), T([192], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 48, 13, 13], f16), T([192, 48, 3, 3], f16), T([192], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 384, 13, 13], f16), T([48, 384, 1, 1], f16), T([48], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 384, 13, 13], f16), T([64, 384, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 64, 13, 13], f16), T([256, 64, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 64, 13, 13], f16), T([256, 64, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 13, 13], f16), T([64, 512, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 13, 13], f16), T([1000, 512, 1, 1], f16), T([1000], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 1000, 13, 13], f16), T([32, 512, 13, 13], f16), T([1000, 512, 1, 1], f16), [1000], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 256, 13, 13], f16), T([32, 64, 13, 13], f16), T([256, 64, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 256, 13, 13], f16), T([32, 64, 13, 13], f16), T([256, 64, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 13, 13], f16), T([32, 512, 13, 13], f16), T([64, 512, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 13, 13], f16), T([32, 384, 13, 13], f16), T([64, 384, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 192, 13, 13], f16), T([32, 48, 13, 13], f16), T([192, 48, 3, 3], f16), [192], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 192, 13, 13], f16), T([32, 48, 13, 13], f16), T([192, 48, 1, 1], f16), [192], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 48, 13, 13], f16), T([32, 384, 13, 13], f16), T([48, 384, 1, 1], f16), [48], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 48, 13, 13], f16), T([32, 256, 13, 13], f16), T([48, 256, 1, 1], f16), [48], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 128, 27, 27], f16), T([32, 32, 27, 27], f16), T([128, 32, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 128, 27, 27], f16), T([32, 32, 27, 27], f16), T([128, 32, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 32, 27, 27], f16), T([32, 256, 27, 27], f16), T([32, 256, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 32, 27, 27], f16), T([32, 128, 27, 27], f16), T([32, 128, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 64, 55, 55], f16), T([32, 16, 55, 55], f16), T([64, 16, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 64, 55, 55], f16), T([32, 16, 55, 55], f16), T([64, 16, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 16, 55, 55], f16), T([32, 128, 55, 55], f16), T([16, 128, 1, 1], f16), [16], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 16, 55, 55], f16), T([32, 64, 55, 55], f16), T([16, 64, 1, 1], f16), [16], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 111, 111], f16), T([32, 3, 224, 224], f16), T([64, 3, 3, 3], f16), [64], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 1000, 13, 13], f16, stride=(0, 0, 0, 0)), 169), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 64, 111, 111], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+cnt: 1, ((T([32, 128, 55, 55], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+cnt: 1, ((T([32, 256, 27, 27], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 256, 13, 13], f16), T([32, 256, 27, 27], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([32, 256, 13, 13], i64)), {})
+cnt: 1, ((T([32, 128, 27, 27], f16), T([32, 128, 55, 55], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([32, 128, 27, 27], i64)), {})
+cnt: 1, ((T([32, 64, 55, 55], f16), T([32, 64, 111, 111], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([32, 64, 55, 55], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 1000, 13, 13], f16), [-1, -2], True), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([32, 64, 111, 111], f16),), {})
+cnt: 2, ((T([32, 16, 55, 55], f16),), {})
+cnt: 4, ((T([32, 64, 55, 55], f16),), {})
+cnt: 2, ((T([32, 32, 27, 27], f16),), {})
+cnt: 4, ((T([32, 128, 27, 27], f16),), {})
+cnt: 2, ((T([32, 48, 13, 13], f16),), {})
+cnt: 4, ((T([32, 192, 13, 13], f16),), {})
+cnt: 2, ((T([32, 64, 13, 13], f16),), {})
+cnt: 4, ((T([32, 256, 13, 13], f16),), {})
+cnt: 1, ((T([32, 1000, 13, 13], f16),), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([32, 1000, 13, 13], f16), T([32, 1000, 13, 13], f16), 0), {})
+cnt: 4, ((T([32, 256, 13, 13], f16, stride=(86528, 169, 13, 1)), T([32, 256, 13, 13], f16), 0), {})
+cnt: 2, ((T([32, 64, 13, 13], f16), T([32, 64, 13, 13], f16), 0), {})
+cnt: 4, ((T([32, 192, 13, 13], f16, stride=(64896, 169, 13, 1)), T([32, 192, 13, 13], f16), 0), {})
+cnt: 2, ((T([32, 48, 13, 13], f16), T([32, 48, 13, 13], f16), 0), {})
+cnt: 4, ((T([32, 128, 27, 27], f16, stride=(186624, 729, 27, 1)), T([32, 128, 27, 27], f16), 0), {})
+cnt: 2, ((T([32, 32, 27, 27], f16), T([32, 32, 27, 27], f16), 0), {})
+cnt: 4, ((T([32, 64, 55, 55], f16, stride=(387200, 3025, 55, 1)), T([32, 64, 55, 55], f16), 0), {})
+cnt: 2, ((T([32, 16, 55, 55], f16), T([32, 16, 55, 55], f16), 0), {})
+cnt: 1, ((T([32, 64, 111, 111], f16), T([32, 64, 111, 111], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_efficientdet_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_efficientdet_training.txt
new file mode 100644
index 000000000000..873f036593f0
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_efficientdet_training.txt
@@ -0,0 +1,623 @@
+Operator: aten._index_put_impl_.default
+cnt: 1, ((T([5000, 1], f32), [T([100], i64)], T([100, 1], f32, stride=(0, 0)), True, True), {})
+cnt: 1, ((T([5000, 4], f32), [T([100], i64)], T([100, 4], f32), True, True), {})
+Operator: aten._to_copy.default
+cnt: 1, ((T([5000, 4], f16),), {'dtype': f32})
+cnt: 1, ((T([5000], f16),), {'dtype': f32})
+cnt: 1, ((T([5000], i64),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 1, ((T([], i64),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 1, ((T([100, 1], i64),), {'dtype': f32})
+cnt: 1, ((T([5000], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([5000, 4], f32),), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten._unsafe_view.default
+cnt: 1, ((T([1, 80, 80, 810], f16), [1, 57600, 90]), {})
+cnt: 1, ((T([1, 40, 40, 810], f16), [1, 14400, 90]), {})
+cnt: 1, ((T([1, 20, 20, 810], f16), [1, 3600, 90]), {})
+cnt: 1, ((T([1, 10, 10, 810], f16), [1, 900, 90]), {})
+cnt: 1, ((T([1, 5, 5, 810], f16), [1, 225, 90]), {})
+cnt: 1, ((T([1, 80, 80, 36], f16), [1, 57600, 4]), {})
+cnt: 1, ((T([1, 40, 40, 36], f16), [1, 14400, 4]), {})
+cnt: 1, ((T([1, 20, 20, 36], f16), [1, 3600, 4]), {})
+cnt: 1, ((T([1, 10, 10, 36], f16), [1, 900, 4]), {})
+cnt: 1, ((T([1, 5, 5, 36], f16), [1, 225, 4]), {})
+Operator: aten.add.Scalar
+cnt: 1, ((T([100, 1], i64), 1), {})
+Operator: aten.add.Tensor
+cnt: 3, ((T([1, 16, 320, 320], f16), T([1, 16, 320, 320], f16)), {})
+cnt: 4, ((T([1, 24, 160, 160], f16), T([1, 24, 160, 160], f16)), {})
+cnt: 5, ((T([1, 40, 80, 80], f16), T([1, 40, 80, 80], f16)), {})
+cnt: 6, ((T([1, 80, 40, 40], f16), T([1, 80, 40, 40], f16)), {})
+cnt: 8, ((T([1, 112, 40, 40], f16), T([1, 112, 40, 40], f16)), {})
+cnt: 8, ((T([1, 192, 20, 20], f16), T([1, 192, 20, 20], f16)), {})
+cnt: 4, ((T([1, 320, 20, 20], f16), T([1, 320, 20, 20], f16)), {})
+cnt: 76, ((T([], f16), 0.0001), {})
+cnt: 2, ((T([5000], f16, stride=(4,)), T([5000], f16, stride=(4,))), {})
+cnt: 2, ((T([5000], f32), T([5000], f16)), {})
+cnt: 2, ((T([5000], f32), T([5000], f32)), {})
+cnt: 1, ((T([], f32), T([], f32)), {})
+cnt: 1, ((T([5000, 4], f32), T([5000, 1], f32)), {})
+cnt: 2, ((T([5000], f32, stride=(4,)), T([5000], f32, stride=(4,))), {})
+cnt: 2, ((T([5000], f32, stride=(4,)), T([5000], f32)), {})
+cnt: 4, ((T([36, 88, 1, 1], f16), T([36, 88, 1, 1], f16)), {})
+cnt: 4, ((T([36], f16), T([36], f16)), {})
+cnt: 32, ((T([88, 1, 3, 3], f16), T([88, 1, 3, 3], f16)), {})
+cnt: 24, ((T([88, 88, 1, 1], f16), T([88, 88, 1, 1], f16)), {})
+cnt: 24, ((T([88], f16), T([88], f16)), {})
+cnt: 5, ((T([1, 88, 5, 5], f16), T([1, 88, 5, 5], f16)), {})
+cnt: 4, ((T([810, 88, 1, 1], f16), T([810, 88, 1, 1], f16)), {})
+cnt: 4, ((T([810], f16), T([810], f16)), {})
+cnt: 14, ((T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16)), {})
+cnt: 12, ((T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16)), {})
+cnt: 12, ((T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16)), {})
+cnt: 5, ((T([1, 88, 80, 80], f16), T([1, 88, 80, 80], f16)), {})
+cnt: 44, ((T([], f16), T([], f16)), {})
+cnt: 20, ((T([2], f16), T([2], f16)), {})
+cnt: 20, ((T([2], f16), T([2], f16, stride=(0,))), {})
+cnt: 24, ((T([3], f16), T([3], f16)), {})
+cnt: 12, ((T([3], f16), T([3], f16, stride=(0,))), {})
+cnt: 1, ((T([1, 1920, 20, 20], f16), T([1, 1920, 20, 20], f16)), {})
+cnt: 5, ((T([1, 1152, 20, 20], f16), T([1, 1152, 20, 20], f16)), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), T([1, 672, 20, 20], f16)), {})
+cnt: 3, ((T([1, 672, 40, 40], f16), T([1, 672, 40, 40], f16)), {})
+cnt: 4, ((T([1, 480, 40, 40], f16), T([1, 480, 40, 40], f16)), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), T([1, 240, 40, 40], f16)), {})
+cnt: 2, ((T([1, 240, 80, 80], f16), T([1, 240, 80, 80], f16)), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), T([1, 144, 80, 80], f16)), {})
+cnt: 2, ((T([1, 144, 160, 160], f16), T([1, 144, 160, 160], f16)), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), T([1, 96, 160, 160], f16)), {})
+cnt: 1, ((T([1, 32, 320, 320], f16), T([1, 32, 320, 320], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([1, 57600, 90], f16), T([1, 14400, 90], f16), T([1, 3600, 90], f16), T([1, 900, 90], f16), T([1, 225, 90], f16)], 1), {})
+cnt: 1, (([T([1, 57600, 4], f16), T([1, 14400, 4], f16), T([1, 3600, 4], f16), T([1, 900, 4], f16), T([1, 225, 4], f16)], 1), {})
+cnt: 1, (([T([2], f16), T([2], f16)],), {})
+cnt: 1, (([T([100, 4], f32), T([100, 1], f32), T([100, 1], f32)], 1), {})
+Operator: aten.clamp.default
+cnt: 1, ((T([5000, 4], f32), 0), {})
+Operator: aten.clone.default
+cnt: 1, ((T([1, 3, 640, 640], f16),), {})
+cnt: 2, ((T([1, 32, 320, 320], f16),), {})
+cnt: 1, ((T([1, 8, 1, 1], f16),), {})
+cnt: 1, ((T([1, 16, 320, 320], f16),), {})
+cnt: 2, ((T([1, 4, 1, 1], f16),), {})
+cnt: 1, ((T([1, 96, 320, 320], f16),), {})
+cnt: 1, ((T([1, 96, 160, 160], f16),), {})
+cnt: 5, ((T([1, 144, 160, 160], f16),), {})
+cnt: 3, ((T([1, 6, 1, 1], f16),), {})
+cnt: 1, ((T([1, 144, 80, 80], f16),), {})
+cnt: 5, ((T([1, 240, 80, 80], f16),), {})
+cnt: 3, ((T([1, 10, 1, 1], f16),), {})
+cnt: 1, ((T([1, 240, 40, 40], f16),), {})
+cnt: 8, ((T([1, 480, 40, 40], f16),), {})
+cnt: 4, ((T([1, 20, 1, 1], f16),), {})
+cnt: 7, ((T([1, 672, 40, 40], f16),), {})
+cnt: 4, ((T([1, 28, 1, 1], f16),), {})
+cnt: 1, ((T([1, 672, 20, 20], f16),), {})
+cnt: 10, ((T([1, 1152, 20, 20], f16),), {})
+cnt: 5, ((T([1, 48, 1, 1], f16),), {})
+cnt: 2, ((T([1, 1920, 20, 20], f16),), {})
+cnt: 1, ((T([1, 80, 1, 1], f16),), {})
+cnt: 14, ((T([1, 88, 10, 10], f16),), {})
+cnt: 14, ((T([1, 88, 20, 20], f16),), {})
+cnt: 14, ((T([1, 88, 40, 40], f16),), {})
+cnt: 10, ((T([1, 88, 80, 80], f16),), {})
+cnt: 10, ((T([1, 88, 5, 5], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([1, 3, 640, 640], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([1, 96, 320, 320], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([1, 144, 160, 160], f16), [1, 2, 1, 2], 0.0), {})
+cnt: 1, ((T([1, 240, 80, 80], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([1, 672, 40, 40], f16), [1, 2, 1, 2], 0.0), {})
+cnt: 5, ((T([1, 88, 20, 20], f16), [0, 1, 0, 1], -inf), {})
+cnt: 5, ((T([1, 88, 10, 10], f16), [0, 1, 0, 1], -inf), {})
+cnt: 4, ((T([1, 88, 80, 80], f16), [0, 1, 0, 1], -inf), {})
+cnt: 4, ((T([1, 88, 40, 40], f16), [0, 1, 0, 1], -inf), {})
+cnt: 5, ((T([1, 88, 11, 11], f16), [0, -1, 0, -1]), {})
+cnt: 5, ((T([1, 88, 21, 21], f16), [0, -1, 0, -1]), {})
+cnt: 4, ((T([1, 88, 41, 41], f16), [0, -1, 0, -1]), {})
+cnt: 4, ((T([1, 88, 81, 81], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([1, 672, 43, 43], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([1, 240, 81, 81], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([1, 144, 163, 163], f16), [-1, -2, -1, -2]), {})
+cnt: 1, ((T([1, 96, 321, 321], f16), [0, -1, 0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([1, 3, 641, 641], f16), T([32, 3, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 32, 320, 320], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([1, 32, 1, 1], f16), T([8, 32, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 8, 1, 1], f16), T([32, 8, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 32, 320, 320], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), T([16, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 16), {})
+cnt: 1, ((T([1, 16, 1, 1], f16), T([4, 16, 1, 1], f16), T([4], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 4, 1, 1], f16), T([16, 4, 1, 1], f16), T([16], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), T([16, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 96, 321, 321], f16), T([96, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([1, 96, 1, 1], f16), T([4, 96, 1, 1], f16), T([4], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 4, 1, 1], f16), T([96, 4, 1, 1], f16), T([96], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), T([24, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 24, 160, 160], f16), T([144, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([1, 144, 160, 160], f16), T([144, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 144), {})
+cnt: 3, ((T([1, 144, 1, 1], f16), T([6, 144, 1, 1], f16), T([6], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 6, 1, 1], f16), T([144, 6, 1, 1], f16), T([144], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([1, 144, 160, 160], f16), T([24, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 144, 163, 163], f16), T([144, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), T([40, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 40, 80, 80], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([1, 240, 80, 80], f16), T([240, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 3, ((T([1, 240, 1, 1], f16), T([10, 240, 1, 1], f16), T([10], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 10, 1, 1], f16), T([240, 10, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([1, 240, 80, 80], f16), T([40, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 240, 81, 81], f16), T([240, 1, 3, 3], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([1, 80, 40, 40], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 480, 40, 40], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 4, ((T([1, 480, 1, 1], f16), T([20, 480, 1, 1], f16), T([20], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([1, 20, 1, 1], f16), T([480, 20, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 480, 40, 40], f16), T([80, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 480, 40, 40], f16), T([480, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([1, 480, 40, 40], f16), T([112, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([1, 112, 40, 40], f16), T([672, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 672, 40, 40], f16), T([672, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 4, ((T([1, 672, 1, 1], f16), T([28, 672, 1, 1], f16), T([28], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([1, 28, 1, 1], f16), T([672, 28, 1, 1], f16), T([672], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 672, 40, 40], f16), T([112, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 672, 43, 43], f16), T([672, 1, 5, 5], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), T([192, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([1, 192, 20, 20], f16), T([1152, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([1, 1152, 20, 20], f16), T([1152, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1152), {})
+cnt: 5, ((T([1, 1152, 1, 1], f16), T([48, 1152, 1, 1], f16), T([48], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([1, 48, 1, 1], f16), T([1152, 48, 1, 1], f16), T([1152], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([1, 1152, 20, 20], f16), T([192, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 1152, 20, 20], f16), T([1152, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1152), {})
+cnt: 1, ((T([1, 1152, 20, 20], f16), T([320, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 320, 20, 20], f16), T([1920, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 1920, 20, 20], f16), T([1920, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1920), {})
+cnt: 1, ((T([1, 1920, 1, 1], f16), T([80, 1920, 1, 1], f16), T([80], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 80, 1, 1], f16), T([1920, 80, 1, 1], f16), T([1920], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 1920, 20, 20], f16), T([320, 1920, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([1, 320, 20, 20], f16), T([88, 320, 1, 1], f16), T([88], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 16, ((T([1, 88, 10, 10], f16), T([88, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 88), {})
+cnt: 14, ((T([1, 88, 10, 10], f16), T([88, 88, 1, 1], f16), T([88], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 16, ((T([1, 88, 20, 20], f16), T([88, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 88), {})
+cnt: 14, ((T([1, 88, 20, 20], f16), T([88, 88, 1, 1], f16), T([88], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([1, 112, 40, 40], f16), T([88, 112, 1, 1], f16), T([88], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 16, ((T([1, 88, 40, 40], f16), T([88, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 88), {})
+cnt: 14, ((T([1, 88, 40, 40], f16), T([88, 88, 1, 1], f16), T([88], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 40, 80, 80], f16), T([88, 40, 1, 1], f16), T([88], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([1, 88, 80, 80], f16), T([88, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 88), {})
+cnt: 10, ((T([1, 88, 80, 80], f16), T([88, 88, 1, 1], f16), T([88], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 12, ((T([1, 88, 5, 5], f16), T([88, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 88), {})
+cnt: 10, ((T([1, 88, 5, 5], f16), T([88, 88, 1, 1], f16), T([88], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 80, 80], f16), T([810, 88, 1, 1], f16), T([810], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 40, 40], f16), T([810, 88, 1, 1], f16), T([810], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 20, 20], f16), T([810, 88, 1, 1], f16), T([810], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 10, 10], f16), T([810, 88, 1, 1], f16), T([810], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 5, 5], f16), T([810, 88, 1, 1], f16), T([810], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 80, 80], f16), T([36, 88, 1, 1], f16), T([36], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 40, 40], f16), T([36, 88, 1, 1], f16), T([36], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 20, 20], f16), T([36, 88, 1, 1], f16), T([36], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 10, 10], f16), T([36, 88, 1, 1], f16), T([36], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([1, 88, 5, 5], f16), T([36, 88, 1, 1], f16), T([36], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([1, 36, 5, 5], f16, stride=(900, 1, 180, 36)), T([1, 88, 5, 5], f16), T([36, 88, 1, 1], f16), [36], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 12, ((T([1, 88, 5, 5], f16), T([1, 88, 5, 5], f16), T([88, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 88, [True, True, False]), {})
+cnt: 10, ((T([1, 88, 5, 5], f16), T([1, 88, 5, 5], f16), T([88, 88, 1, 1], f16), [88], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 36, 10, 10], f16, stride=(3600, 1, 360, 36)), T([1, 88, 10, 10], f16), T([36, 88, 1, 1], f16), [36], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 16, ((T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16), T([88, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 88, [True, True, False]), {})
+cnt: 14, ((T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16), T([88, 88, 1, 1], f16), [88], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 36, 20, 20], f16, stride=(14400, 1, 720, 36)), T([1, 88, 20, 20], f16), T([36, 88, 1, 1], f16), [36], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 16, ((T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16), T([88, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 88, [True, True, False]), {})
+cnt: 14, ((T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16), T([88, 88, 1, 1], f16), [88], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 36, 40, 40], f16, stride=(57600, 1, 1440, 36)), T([1, 88, 40, 40], f16), T([36, 88, 1, 1], f16), [36], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 16, ((T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16), T([88, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 88, [True, True, False]), {})
+cnt: 14, ((T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16), T([88, 88, 1, 1], f16), [88], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 36, 80, 80], f16, stride=(230400, 1, 2880, 36)), T([1, 88, 80, 80], f16), T([36, 88, 1, 1], f16), [36], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 12, ((T([1, 88, 80, 80], f16), T([1, 88, 80, 80], f16), T([88, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 88, [True, True, False]), {})
+cnt: 10, ((T([1, 88, 80, 80], f16), T([1, 88, 80, 80], f16), T([88, 88, 1, 1], f16), [88], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 810, 5, 5], f16, stride=(20250, 1, 4050, 810)), T([1, 88, 5, 5], f16), T([810, 88, 1, 1], f16), [810], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 810, 10, 10], f16, stride=(81000, 1, 8100, 810)), T([1, 88, 10, 10], f16), T([810, 88, 1, 1], f16), [810], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 810, 20, 20], f16, stride=(324000, 1, 16200, 810)), T([1, 88, 20, 20], f16), T([810, 88, 1, 1], f16), [810], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 810, 40, 40], f16, stride=(1296000, 1, 32400, 810)), T([1, 88, 40, 40], f16), T([810, 88, 1, 1], f16), [810], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 810, 80, 80], f16, stride=(5184000, 1, 64800, 810)), T([1, 88, 80, 80], f16), T([810, 88, 1, 1], f16), [810], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([1, 88, 20, 20], f16), T([1, 320, 20, 20], f16), T([88, 320, 1, 1], f16), [88], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([1, 88, 40, 40], f16), T([1, 112, 40, 40], f16), T([88, 112, 1, 1], f16), [88], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 88, 80, 80], f16), T([1, 40, 80, 80], f16), T([88, 40, 1, 1], f16), [88], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 320, 20, 20], f16), T([1, 1920, 20, 20], f16), T([320, 1920, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([1, 1920, 1, 1], f16), T([1, 80, 1, 1], f16), T([1920, 80, 1, 1], f16), [1920], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 80, 1, 1], f16), T([1, 1920, 1, 1], f16), T([80, 1920, 1, 1], f16), [80], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 1920, 20, 20], f16), T([1, 1920, 20, 20], f16), T([1920, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1920, [True, True, False]), {})
+cnt: 1, ((T([1, 1920, 20, 20], f16), T([1, 320, 20, 20], f16), T([1920, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([1, 320, 20, 20], f16), T([1, 1152, 20, 20], f16), T([320, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([1, 1152, 1, 1], f16), T([1, 48, 1, 1], f16), T([1152, 48, 1, 1], f16), [1152], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([1, 48, 1, 1], f16), T([1, 1152, 1, 1], f16), T([48, 1152, 1, 1], f16), [48], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 1152, 20, 20], f16), T([1, 1152, 20, 20], f16), T([1152, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 5, ((T([1, 1152, 20, 20], f16), T([1, 192, 20, 20], f16), T([1152, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([1, 192, 20, 20], f16), T([1, 1152, 20, 20], f16), T([192, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([1, 1152, 20, 20], f16), T([1, 1152, 20, 20], f16), T([1152, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 1, ((T([1, 192, 20, 20], f16), T([1, 672, 20, 20], f16), T([192, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([1, 672, 1, 1], f16), T([1, 28, 1, 1], f16), T([672, 28, 1, 1], f16), [672], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([1, 28, 1, 1], f16), T([1, 672, 1, 1], f16), T([28, 672, 1, 1], f16), [28], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), T([1, 672, 43, 43], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 4, ((T([1, 672, 40, 40], f16), T([1, 112, 40, 40], f16), T([672, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([1, 112, 40, 40], f16), T([1, 672, 40, 40], f16), T([112, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([1, 672, 40, 40], f16), T([1, 672, 40, 40], f16), T([672, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 1, ((T([1, 112, 40, 40], f16), T([1, 480, 40, 40], f16), T([112, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([1, 480, 1, 1], f16), T([1, 20, 1, 1], f16), T([480, 20, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([1, 20, 1, 1], f16), T([1, 480, 1, 1], f16), T([20, 480, 1, 1], f16), [20], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 480, 40, 40], f16), T([1, 480, 40, 40], f16), T([480, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 4, ((T([1, 480, 40, 40], f16), T([1, 80, 40, 40], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([1, 80, 40, 40], f16), T([1, 480, 40, 40], f16), T([80, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([1, 480, 40, 40], f16), T([1, 480, 40, 40], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([1, 80, 40, 40], f16), T([1, 240, 40, 40], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([1, 240, 1, 1], f16), T([1, 10, 1, 1], f16), T([240, 10, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([1, 10, 1, 1], f16), T([1, 240, 1, 1], f16), T([10, 240, 1, 1], f16), [10], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), T([1, 240, 81, 81], f16), T([240, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 3, ((T([1, 240, 80, 80], f16), T([1, 40, 80, 80], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([1, 40, 80, 80], f16), T([1, 240, 80, 80], f16), T([40, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([1, 240, 80, 80], f16), T([1, 240, 80, 80], f16), T([240, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([1, 40, 80, 80], f16), T([1, 144, 80, 80], f16), T([40, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([1, 144, 1, 1], f16), T([1, 6, 1, 1], f16), T([144, 6, 1, 1], f16), [144], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([1, 6, 1, 1], f16), T([1, 144, 1, 1], f16), T([6, 144, 1, 1], f16), [6], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), T([1, 144, 163, 163], f16), T([144, 1, 5, 5], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 3, ((T([1, 144, 160, 160], f16), T([1, 24, 160, 160], f16), T([144, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([1, 24, 160, 160], f16), T([1, 144, 160, 160], f16), T([24, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([1, 144, 160, 160], f16), T([1, 144, 160, 160], f16), T([144, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 1, ((T([1, 24, 160, 160], f16), T([1, 96, 160, 160], f16), T([24, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([1, 96, 1, 1], f16), T([1, 4, 1, 1], f16), T([96, 4, 1, 1], f16), [96], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 4, 1, 1], f16), T([1, 96, 1, 1], f16), T([4, 96, 1, 1], f16), [4], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), T([1, 96, 321, 321], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([1, 96, 320, 320], f16), T([1, 16, 320, 320], f16), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), T([1, 16, 320, 320], f16), T([16, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([1, 16, 1, 1], f16), T([1, 4, 1, 1], f16), T([16, 4, 1, 1], f16), [16], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 4, 1, 1], f16), T([1, 16, 1, 1], f16), T([4, 16, 1, 1], f16), [4], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), T([1, 16, 320, 320], f16), T([16, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 16, [True, True, False]), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), T([1, 32, 320, 320], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([1, 32, 1, 1], f16), T([1, 8, 1, 1], f16), T([32, 8, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 8, 1, 1], f16), T([1, 32, 1, 1], f16), T([8, 32, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([1, 32, 320, 320], f16), T([1, 32, 320, 320], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([1, 32, 320, 320], f16), T([1, 3, 641, 641], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([1, 3, 640, 640], f16), T([1, 3, 640, 640], f16)), {})
+Operator: aten.div.Scalar
+cnt: 2, ((T([5000], f16), 2), {})
+cnt: 2, ((T([5000], f32), 2.0), {})
+cnt: 1, ((T([5000, 4], f32), 2), {})
+cnt: 1, ((T([1, 1920, 20, 20], f16, stride=(1920, 1, 0, 0)), 400), {})
+cnt: 5, ((T([1, 1152, 20, 20], f16, stride=(1152, 1, 0, 0)), 400), {})
+cnt: 1, ((T([1, 672, 20, 20], f16, stride=(672, 1, 0, 0)), 400), {})
+cnt: 3, ((T([1, 672, 40, 40], f16, stride=(672, 1, 0, 0)), 1600), {})
+cnt: 4, ((T([1, 480, 40, 40], f16, stride=(480, 1, 0, 0)), 1600), {})
+cnt: 1, ((T([1, 240, 40, 40], f16, stride=(240, 1, 0, 0)), 1600), {})
+cnt: 2, ((T([1, 240, 80, 80], f16, stride=(240, 1, 0, 0)), 6400), {})
+cnt: 1, ((T([1, 144, 80, 80], f16, stride=(144, 1, 0, 0)), 6400), {})
+cnt: 2, ((T([1, 144, 160, 160], f16, stride=(144, 1, 0, 0)), 25600), {})
+cnt: 1, ((T([1, 96, 160, 160], f16, stride=(96, 1, 0, 0)), 25600), {})
+cnt: 1, ((T([1, 16, 320, 320], f16, stride=(16, 1, 0, 0)), 102400), {})
+cnt: 1, ((T([1, 32, 320, 320], f16, stride=(32, 1, 0, 0)), 102400), {})
+Operator: aten.div.Tensor
+cnt: 80, ((T([1, 88, 10, 10], f16), T([], f16)), {})
+cnt: 80, ((T([1, 88, 20, 20], f16), T([], f16)), {})
+cnt: 80, ((T([1, 88, 40, 40], f16), T([], f16)), {})
+cnt: 32, ((T([1, 88, 80, 80], f16), T([], f16)), {})
+cnt: 32, ((T([1, 88, 5, 5], f16), T([], f16)), {})
+cnt: 1, ((T([2], i32), T([], f16)), {})
+cnt: 2, ((T([], f32), 600), {})
+cnt: 2, ((T([5000], f32), T([], f64)), {})
+Operator: aten.eq.Tensor
+cnt: 1, ((T([5000, 4], f32), T([4], f16)), {})
+Operator: aten.exp.default
+cnt: 2, ((T([5000], f32, stride=(4,)),), {})
+Operator: aten.floor_divide.default
+cnt: 1, ((T([1, 5000], i64), 90), {})
+Operator: aten.gather.default
+cnt: 1, ((T([1, 76725, 4], f16), 1, T([1, 5000, 4], i64, stride=(5000, 1, 0))), {})
+cnt: 1, ((T([1, 76725, 90], f16), 1, T([1, 5000, 90], i64, stride=(5000, 1, 0))), {})
+cnt: 1, ((T([1, 5000, 90], f16), 2, T([1, 5000, 1], i64)), {})
+Operator: aten.ge.Scalar
+cnt: 1, ((T([5000, 4], f32), 0), {})
+Operator: aten.gt.Tensor
+cnt: 1, ((T([5000, 4], f32), T([4], f16)), {})
+Operator: aten.index.Tensor
+cnt: 1, ((T([76725, 4], f16, stride=(1, 76725)), [T([5000], i64)]), {})
+cnt: 1, ((T([5000, 4], f32), [T([100], i64)]), {})
+cnt: 1, ((T([5000, 1], f32), [T([100], i64)]), {})
+cnt: 1, ((T([5000, 1], i64), [T([100], i64)]), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([5000, 4], f32), T([5000, 4], b8), 0), {})
+Operator: aten.max.default
+cnt: 1, ((T([5000, 4], f32),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 5, ((T([1, 88, 21, 21], f16), [3, 3], [2, 2]), {})
+cnt: 5, ((T([1, 88, 11, 11], f16), [3, 3], [2, 2]), {})
+cnt: 4, ((T([1, 88, 81, 81], f16), [3, 3], [2, 2]), {})
+cnt: 4, ((T([1, 88, 41, 41], f16), [3, 3], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 5, ((T([1, 88, 5, 5], f16), T([1, 88, 11, 11], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([1, 88, 5, 5], i64)), {})
+cnt: 5, ((T([1, 88, 10, 10], f16), T([1, 88, 21, 21], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([1, 88, 10, 10], i64)), {})
+cnt: 4, ((T([1, 88, 20, 20], f16), T([1, 88, 41, 41], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([1, 88, 20, 20], i64)), {})
+cnt: 4, ((T([1, 88, 40, 40], f16), T([1, 88, 81, 81], f16), [3, 3], [2, 2], [0, 0], [1, 1], False, T([1, 88, 40, 40], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([1, 32, 320, 320], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), [2, 3], True), {})
+cnt: 2, ((T([1, 144, 160, 160], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), [2, 3], True), {})
+cnt: 2, ((T([1, 240, 80, 80], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), [2, 3], True), {})
+cnt: 4, ((T([1, 480, 40, 40], f16), [2, 3], True), {})
+cnt: 3, ((T([1, 672, 40, 40], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), [2, 3], True), {})
+cnt: 5, ((T([1, 1152, 20, 20], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 1920, 20, 20], f16), [2, 3], True), {})
+Operator: aten.minimum.default
+cnt: 1, ((T([5000, 4], f32), T([4], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([1, 32, 320, 320], f16), T([1, 32, 1, 1], f16)), {})
+cnt: 2, ((T([1, 16, 320, 320], f16), T([1, 16, 1, 1], f16)), {})
+cnt: 2, ((T([1, 96, 160, 160], f16), T([1, 96, 1, 1], f16)), {})
+cnt: 4, ((T([1, 144, 160, 160], f16), T([1, 144, 1, 1], f16)), {})
+cnt: 2, ((T([1, 144, 80, 80], f16), T([1, 144, 1, 1], f16)), {})
+cnt: 4, ((T([1, 240, 80, 80], f16), T([1, 240, 1, 1], f16)), {})
+cnt: 2, ((T([1, 240, 40, 40], f16), T([1, 240, 1, 1], f16)), {})
+cnt: 8, ((T([1, 480, 40, 40], f16), T([1, 480, 1, 1], f16)), {})
+cnt: 6, ((T([1, 672, 40, 40], f16), T([1, 672, 1, 1], f16)), {})
+cnt: 2, ((T([1, 672, 20, 20], f16), T([1, 672, 1, 1], f16)), {})
+cnt: 10, ((T([1, 1152, 20, 20], f16), T([1, 1152, 1, 1], f16)), {})
+cnt: 2, ((T([1, 1920, 20, 20], f16), T([1, 1920, 1, 1], f16)), {})
+cnt: 40, ((T([1, 88, 10, 10], f16), T([], f16)), {})
+cnt: 40, ((T([1, 88, 20, 20], f16), T([], f16)), {})
+cnt: 40, ((T([1, 88, 40, 40], f16), T([], f16)), {})
+cnt: 16, ((T([1, 88, 80, 80], f16), T([], f16)), {})
+cnt: 16, ((T([1, 88, 5, 5], f16), T([], f16)), {})
+cnt: 6, ((T([5000], f32), T([5000], f16)), {})
+cnt: 2, ((T([5000], f32, stride=(4,)), T([5000], f16)), {})
+cnt: 1, ((T([5000], f32), T([], f32)), {})
+cnt: 1, ((T([100, 4], f32), T([], f16)), {})
+cnt: 1, ((T([100, 4], f32, stride=(0, 0)), T([], f16)), {})
+cnt: 2, ((T([5000], f32), T([5000], f32)), {})
+cnt: 16, ((T([1, 88, 5, 5], f16), T([1, 88, 5, 5], f16)), {})
+cnt: 40, ((T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16)), {})
+cnt: 40, ((T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16)), {})
+cnt: 40, ((T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16)), {})
+cnt: 16, ((T([1, 88, 80, 80], f16), T([1, 88, 80, 80], f16)), {})
+cnt: 1, ((T([1, 1920, 20, 20], f16), T([1, 1920, 20, 20], f16)), {})
+cnt: 5, ((T([1, 1152, 20, 20], f16), T([1, 1152, 20, 20], f16)), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), T([1, 672, 20, 20], f16)), {})
+cnt: 3, ((T([1, 672, 40, 40], f16), T([1, 672, 40, 40], f16)), {})
+cnt: 4, ((T([1, 480, 40, 40], f16), T([1, 480, 40, 40], f16)), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), T([1, 240, 40, 40], f16)), {})
+cnt: 2, ((T([1, 240, 80, 80], f16), T([1, 240, 80, 80], f16)), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), T([1, 144, 80, 80], f16)), {})
+cnt: 2, ((T([1, 144, 160, 160], f16), T([1, 144, 160, 160], f16)), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), T([1, 96, 160, 160], f16)), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), T([1, 16, 320, 320], f16)), {})
+cnt: 1, ((T([1, 32, 320, 320], f16), T([1, 32, 320, 320], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([1, 32, 320, 320], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.1, 0.001), {})
+cnt: 3, ((T([1, 16, 320, 320], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), False, 0.1, 0.001), {})
+cnt: 1, ((T([1, 96, 320, 320], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.1, 0.001), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.1, 0.001), {})
+cnt: 3, ((T([1, 24, 160, 160], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), False, 0.1, 0.001), {})
+cnt: 5, ((T([1, 144, 160, 160], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), False, 0.1, 0.001), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), False, 0.1, 0.001), {})
+cnt: 3, ((T([1, 40, 80, 80], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), False, 0.1, 0.001), {})
+cnt: 5, ((T([1, 240, 80, 80], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), False, 0.1, 0.001), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), False, 0.1, 0.001), {})
+cnt: 4, ((T([1, 80, 40, 40], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), False, 0.1, 0.001), {})
+cnt: 8, ((T([1, 480, 40, 40], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), False, 0.1, 0.001), {})
+cnt: 4, ((T([1, 112, 40, 40], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), False, 0.1, 0.001), {})
+cnt: 7, ((T([1, 672, 40, 40], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), False, 0.1, 0.001), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), False, 0.1, 0.001), {})
+cnt: 5, ((T([1, 192, 20, 20], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), False, 0.1, 0.001), {})
+cnt: 10, ((T([1, 1152, 20, 20], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), False, 0.1, 0.001), {})
+cnt: 2, ((T([1, 320, 20, 20], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), False, 0.1, 0.001), {})
+cnt: 2, ((T([1, 1920, 20, 20], f16), T([1920], f16), T([1920], f16), T([1920], f16), T([1920], f16), False, 0.1, 0.001), {})
+cnt: 17, ((T([1, 88, 20, 20], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f16), False, 0.01, 0.001), {})
+cnt: 14, ((T([1, 88, 10, 10], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f16), False, 0.01, 0.001), {})
+cnt: 16, ((T([1, 88, 40, 40], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f16), False, 0.01, 0.001), {})
+cnt: 11, ((T([1, 88, 80, 80], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f16), False, 0.01, 0.001), {})
+cnt: 10, ((T([1, 88, 5, 5], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f16), False, 0.01, 0.001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 10, ((T([1, 88, 5, 5], f16), T([1, 88, 5, 5], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f32), T([88], f32), False, 0.001, [True, True, True]), {})
+cnt: 14, ((T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f32), T([88], f32), False, 0.001, [True, True, True]), {})
+cnt: 17, ((T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f32), T([88], f32), False, 0.001, [True, True, True]), {})
+cnt: 16, ((T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f32), T([88], f32), False, 0.001, [True, True, True]), {})
+cnt: 11, ((T([1, 88, 80, 80], f16), T([1, 88, 80, 80], f16), T([88], f16), T([88], f16), T([88], f16), T([88], f32), T([88], f32), False, 0.001, [True, True, True]), {})
+cnt: 2, ((T([1, 320, 20, 20], f16), T([1, 320, 20, 20], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), False, 0.001, [True, True, True]), {})
+cnt: 2, ((T([1, 1920, 20, 20], f16), T([1, 1920, 20, 20], f16), T([1920], f16), T([1920], f16), T([1920], f16), T([1920], f32), T([1920], f32), False, 0.001, [True, True, True]), {})
+cnt: 10, ((T([1, 1152, 20, 20], f16), T([1, 1152, 20, 20], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), False, 0.001, [True, True, True]), {})
+cnt: 5, ((T([1, 192, 20, 20], f16), T([1, 192, 20, 20], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), T([1, 672, 20, 20], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), False, 0.001, [True, True, True]), {})
+cnt: 7, ((T([1, 672, 40, 40], f16), T([1, 672, 40, 40], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), False, 0.001, [True, True, True]), {})
+cnt: 4, ((T([1, 112, 40, 40], f16), T([1, 112, 40, 40], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), False, 0.001, [True, True, True]), {})
+cnt: 8, ((T([1, 480, 40, 40], f16), T([1, 480, 40, 40], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), False, 0.001, [True, True, True]), {})
+cnt: 4, ((T([1, 80, 40, 40], f16), T([1, 80, 40, 40], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), T([1, 240, 40, 40], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), False, 0.001, [True, True, True]), {})
+cnt: 5, ((T([1, 240, 80, 80], f16), T([1, 240, 80, 80], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), False, 0.001, [True, True, True]), {})
+cnt: 3, ((T([1, 40, 80, 80], f16), T([1, 40, 80, 80], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), T([1, 144, 80, 80], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), False, 0.001, [True, True, True]), {})
+cnt: 5, ((T([1, 144, 160, 160], f16), T([1, 144, 160, 160], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), False, 0.001, [True, True, True]), {})
+cnt: 3, ((T([1, 24, 160, 160], f16), T([1, 24, 160, 160], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), T([1, 96, 160, 160], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 0.001, [True, True, True]), {})
+cnt: 1, ((T([1, 96, 320, 320], f16), T([1, 96, 320, 320], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 0.001, [True, True, True]), {})
+cnt: 3, ((T([1, 16, 320, 320], f16), T([1, 16, 320, 320], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), False, 0.001, [True, True, True]), {})
+cnt: 2, ((T([1, 32, 320, 320], f16), T([1, 32, 320, 320], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 0.001, [True, True, True]), {})
+Operator: aten.neg.default
+cnt: 2, ((T([5000], f32, stride=(4,)),), {})
+cnt: 8, ((T([1, 88, 5, 5], f16),), {})
+cnt: 20, ((T([1, 88, 10, 10], f16),), {})
+cnt: 20, ((T([1, 88, 20, 20], f16),), {})
+cnt: 20, ((T([1, 88, 40, 40], f16),), {})
+cnt: 8, ((T([1, 88, 80, 80], f16),), {})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([100, 1], f32, stride=(0, 0)), [5000, 1]), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([100, 4], f32), [5000, 4]), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([1, 5000, 1], f16), [1, 5000, 90]), {})
+cnt: 1, ((T([1, 5000, 90], f16), [1, 76725, 90]), {})
+cnt: 1, ((T([1, 5000, 4], f16), [1, 76725, 4]), {})
+Operator: aten.relu.default
+cnt: 20, ((T([2], f16),), {})
+cnt: 12, ((T([3], f16),), {})
+Operator: aten.remainder.Scalar
+cnt: 1, ((T([1, 5000], i64), 90), {})
+Operator: aten.scatter_add_.default
+cnt: 1, ((T([1, 5000, 90], f16), 2, T([1, 5000, 1], i64), T([1, 5000, 1], f16)), {})
+cnt: 1, ((T([1, 76725, 90], f16), 1, T([1, 5000, 90], i64, stride=(5000, 1, 0)), T([1, 5000, 90], f16)), {})
+cnt: 1, ((T([1, 76725, 4], f16), 1, T([1, 5000, 4], i64, stride=(5000, 1, 0)), T([1, 5000, 4], f16)), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([5000, 4], f16), [1, 5000, 4], 0, 0), {})
+cnt: 1, ((T([5000, 1], f16), [1, 5000, 1], 0, 0), {})
+cnt: 20, ((T([], f16), [2], 0, 1), {})
+cnt: 20, ((T([], f16), [2], 0, 0), {})
+cnt: 12, ((T([], f16), [3], 0, 2), {})
+cnt: 12, ((T([], f16), [3], 0, 1), {})
+cnt: 12, ((T([], f16), [3], 0, 0), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([1, 32, 1, 1], f16),), {})
+cnt: 1, ((T([1, 16, 1, 1], f16),), {})
+cnt: 1, ((T([1, 96, 1, 1], f16),), {})
+cnt: 3, ((T([1, 144, 1, 1], f16),), {})
+cnt: 3, ((T([1, 240, 1, 1], f16),), {})
+cnt: 4, ((T([1, 480, 1, 1], f16),), {})
+cnt: 4, ((T([1, 672, 1, 1], f16),), {})
+cnt: 5, ((T([1, 1152, 1, 1], f16),), {})
+cnt: 1, ((T([1, 1920, 1, 1], f16),), {})
+cnt: 1, ((T([5000, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([5000, 1], f16), T([5000, 1], f16)), {})
+cnt: 1, ((T([1, 1920, 1, 1], f16), T([1, 1920, 1, 1], f16)), {})
+cnt: 5, ((T([1, 1152, 1, 1], f16), T([1, 1152, 1, 1], f16)), {})
+cnt: 4, ((T([1, 672, 1, 1], f16), T([1, 672, 1, 1], f16)), {})
+cnt: 4, ((T([1, 480, 1, 1], f16), T([1, 480, 1, 1], f16)), {})
+cnt: 3, ((T([1, 240, 1, 1], f16), T([1, 240, 1, 1], f16)), {})
+cnt: 3, ((T([1, 144, 1, 1], f16), T([1, 144, 1, 1], f16)), {})
+cnt: 1, ((T([1, 96, 1, 1], f16), T([1, 96, 1, 1], f16)), {})
+cnt: 1, ((T([1, 16, 1, 1], f16), T([1, 16, 1, 1], f16)), {})
+cnt: 1, ((T([1, 32, 1, 1], f16), T([1, 32, 1, 1], f16)), {})
+Operator: aten.silu_.default
+cnt: 2, ((T([1, 32, 320, 320], f16),), {})
+cnt: 1, ((T([1, 8, 1, 1], f16),), {})
+cnt: 1, ((T([1, 16, 320, 320], f16),), {})
+cnt: 2, ((T([1, 4, 1, 1], f16),), {})
+cnt: 1, ((T([1, 96, 320, 320], f16),), {})
+cnt: 1, ((T([1, 96, 160, 160], f16),), {})
+cnt: 5, ((T([1, 144, 160, 160], f16),), {})
+cnt: 3, ((T([1, 6, 1, 1], f16),), {})
+cnt: 1, ((T([1, 144, 80, 80], f16),), {})
+cnt: 5, ((T([1, 240, 80, 80], f16),), {})
+cnt: 3, ((T([1, 10, 1, 1], f16),), {})
+cnt: 1, ((T([1, 240, 40, 40], f16),), {})
+cnt: 8, ((T([1, 480, 40, 40], f16),), {})
+cnt: 4, ((T([1, 20, 1, 1], f16),), {})
+cnt: 7, ((T([1, 672, 40, 40], f16),), {})
+cnt: 4, ((T([1, 28, 1, 1], f16),), {})
+cnt: 1, ((T([1, 672, 20, 20], f16),), {})
+cnt: 10, ((T([1, 1152, 20, 20], f16),), {})
+cnt: 5, ((T([1, 48, 1, 1], f16),), {})
+cnt: 2, ((T([1, 1920, 20, 20], f16),), {})
+cnt: 1, ((T([1, 80, 1, 1], f16),), {})
+cnt: 14, ((T([1, 88, 10, 10], f16),), {})
+cnt: 14, ((T([1, 88, 20, 20], f16),), {})
+cnt: 14, ((T([1, 88, 40, 40], f16),), {})
+cnt: 10, ((T([1, 88, 80, 80], f16),), {})
+cnt: 10, ((T([1, 88, 5, 5], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 10, ((T([1, 88, 5, 5], f16), T([1, 88, 5, 5], f16)), {})
+cnt: 14, ((T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16)), {})
+cnt: 14, ((T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16)), {})
+cnt: 14, ((T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16)), {})
+cnt: 10, ((T([1, 88, 80, 80], f16), T([1, 88, 80, 80], f16)), {})
+cnt: 1, ((T([1, 80, 1, 1], f16), T([1, 80, 1, 1], f16)), {})
+cnt: 2, ((T([1, 1920, 20, 20], f16), T([1, 1920, 20, 20], f16)), {})
+cnt: 5, ((T([1, 48, 1, 1], f16), T([1, 48, 1, 1], f16)), {})
+cnt: 10, ((T([1, 1152, 20, 20], f16), T([1, 1152, 20, 20], f16)), {})
+cnt: 4, ((T([1, 28, 1, 1], f16), T([1, 28, 1, 1], f16)), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), T([1, 672, 20, 20], f16)), {})
+cnt: 7, ((T([1, 672, 40, 40], f16), T([1, 672, 40, 40], f16)), {})
+cnt: 4, ((T([1, 20, 1, 1], f16), T([1, 20, 1, 1], f16)), {})
+cnt: 8, ((T([1, 480, 40, 40], f16), T([1, 480, 40, 40], f16)), {})
+cnt: 3, ((T([1, 10, 1, 1], f16), T([1, 10, 1, 1], f16)), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), T([1, 240, 40, 40], f16)), {})
+cnt: 5, ((T([1, 240, 80, 80], f16), T([1, 240, 80, 80], f16)), {})
+cnt: 3, ((T([1, 6, 1, 1], f16), T([1, 6, 1, 1], f16)), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), T([1, 144, 80, 80], f16)), {})
+cnt: 5, ((T([1, 144, 160, 160], f16), T([1, 144, 160, 160], f16)), {})
+cnt: 2, ((T([1, 4, 1, 1], f16), T([1, 4, 1, 1], f16)), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), T([1, 96, 160, 160], f16)), {})
+cnt: 1, ((T([1, 96, 320, 320], f16), T([1, 96, 320, 320], f16)), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), T([1, 16, 320, 320], f16)), {})
+cnt: 1, ((T([1, 8, 1, 1], f16), T([1, 8, 1, 1], f16)), {})
+cnt: 2, ((T([1, 32, 320, 320], f16), T([1, 32, 320, 320], f16)), {})
+Operator: aten.stack.default
+cnt: 4, (([T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16)], -1), {})
+cnt: 4, (([T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16)], -1), {})
+cnt: 4, (([T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16)], -1), {})
+cnt: 4, (([T([1, 88, 80, 80], f16), T([1, 88, 80, 80], f16)], -1), {})
+cnt: 4, (([T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16), T([1, 88, 40, 40], f16)], -1), {})
+cnt: 4, (([T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16), T([1, 88, 20, 20], f16)], -1), {})
+cnt: 4, (([T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16), T([1, 88, 10, 10], f16)], -1), {})
+cnt: 4, (([T([1, 88, 5, 5], f16), T([1, 88, 5, 5], f16)], -1), {})
+cnt: 2, (([T([5000], f32), T([5000], f32), T([5000], f32), T([5000], f32)], 1), {})
+cnt: 1, (([T([100, 6], f32)],), {})
+Operator: aten.sub.Tensor
+cnt: 2, ((T([5000], f16, stride=(4,)), T([5000], f16, stride=(4,))), {})
+cnt: 2, ((T([5000], f32), T([5000], f32)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([1, 1920, 20, 20], f16), [2, 3], True), {})
+cnt: 5, ((T([1, 1152, 20, 20], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 672, 20, 20], f16), [2, 3], True), {})
+cnt: 3, ((T([1, 672, 40, 40], f16), [2, 3], True), {})
+cnt: 4, ((T([1, 480, 40, 40], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 240, 40, 40], f16), [2, 3], True), {})
+cnt: 2, ((T([1, 240, 80, 80], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 144, 80, 80], f16), [2, 3], True), {})
+cnt: 2, ((T([1, 144, 160, 160], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 96, 160, 160], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 16, 320, 320], f16), [2, 3], True), {})
+cnt: 1, ((T([1, 32, 320, 320], f16), [2, 3], True), {})
+Operator: aten.sum.default
+cnt: 20, ((T([2], f16),), {})
+cnt: 12, ((T([3], f16),), {})
+cnt: 1, ((T([1, 100, 6], f32),), {})
+cnt: 16, ((T([1, 88, 5, 5], f16),), {})
+cnt: 40, ((T([1, 88, 10, 10], f16),), {})
+cnt: 40, ((T([1, 88, 20, 20], f16),), {})
+cnt: 40, ((T([1, 88, 40, 40], f16),), {})
+cnt: 16, ((T([1, 88, 80, 80], f16),), {})
+Operator: aten.sum.dim_IntList
+cnt: 4, ((T([1, 88, 10, 10, 2], f16), [-1]), {})
+cnt: 4, ((T([1, 88, 20, 20, 2], f16), [-1]), {})
+cnt: 4, ((T([1, 88, 40, 40, 2], f16), [-1]), {})
+cnt: 4, ((T([1, 88, 80, 80, 2], f16), [-1]), {})
+cnt: 4, ((T([1, 88, 40, 40, 3], f16), [-1]), {})
+cnt: 4, ((T([1, 88, 20, 20, 3], f16), [-1]), {})
+cnt: 4, ((T([1, 88, 10, 10, 3], f16), [-1]), {})
+cnt: 4, ((T([1, 88, 5, 5, 2], f16), [-1]), {})
+Operator: aten.threshold_backward.default
+cnt: 20, ((T([2], f16), T([2], f16), 0), {})
+cnt: 12, ((T([3], f16), T([3], f16), 0), {})
+Operator: aten.topk.default
+cnt: 1, ((T([1, 6905250], f16), 5000, 1), {})
+Operator: aten.unbind.int
+cnt: 2, ((T([5000, 4], f32), 1), {})
+cnt: 1, ((T([1, 100, 6], f32, stride=(0, 0, 0)),), {})
+cnt: 4, ((T([1, 88, 5, 5, 2], f16, stride=(2200, 25, 5, 1, 0)), -1), {})
+cnt: 4, ((T([1, 88, 10, 10, 3], f16, stride=(8800, 100, 10, 1, 0)), -1), {})
+cnt: 4, ((T([1, 88, 20, 20, 3], f16, stride=(35200, 400, 20, 1, 0)), -1), {})
+cnt: 4, ((T([1, 88, 40, 40, 3], f16, stride=(140800, 1600, 40, 1, 0)), -1), {})
+cnt: 4, ((T([1, 88, 80, 80, 2], f16, stride=(563200, 6400, 80, 1, 0)), -1), {})
+cnt: 4, ((T([1, 88, 40, 40, 2], f16, stride=(140800, 1600, 40, 1, 0)), -1), {})
+cnt: 4, ((T([1, 88, 20, 20, 2], f16, stride=(35200, 400, 20, 1, 0)), -1), {})
+cnt: 4, ((T([1, 88, 10, 10, 2], f16, stride=(8800, 100, 10, 1, 0)), -1), {})
+Operator: aten.upsample_nearest2d.vec
+cnt: 4, ((T([1, 88, 5, 5], f16), [10, 10], None), {})
+cnt: 4, ((T([1, 88, 10, 10], f16), [20, 20], None), {})
+cnt: 4, ((T([1, 88, 20, 20], f16), [40, 40], None), {})
+cnt: 4, ((T([1, 88, 40, 40], f16), [80, 80], None), {})
+Operator: aten.upsample_nearest2d_backward.vec
+cnt: 4, ((T([1, 88, 80, 80], f16), [80, 80], [1, 88, 40, 40], None), {})
+cnt: 4, ((T([1, 88, 40, 40], f16), [40, 40], [1, 88, 20, 20], None), {})
+cnt: 4, ((T([1, 88, 20, 20], f16), [20, 20], [1, 88, 10, 10], None), {})
+cnt: 4, ((T([1, 88, 10, 10], f16), [10, 10], [1, 88, 5, 5], None), {})
+Operator: aten.where.self
+cnt: 1, ((T([5000, 4], b8), T([5000, 4], f32), T([5000, 4], f32)), {})
+cnt: 1, ((T([5000, 4], b8), T([5000, 4], f32), T([], f32)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_efficientnet_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_efficientnet_training.txt
new file mode 100644
index 000000000000..1f004ded91be
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_efficientnet_training.txt
@@ -0,0 +1,295 @@
+Operator: aten.add.Tensor
+cnt: 2, ((T([32, 24, 56, 56], f16), T([32, 24, 56, 56], f16)), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([32, 40, 28, 28], f16)), {})
+cnt: 4, ((T([32, 80, 14, 14], f16), T([32, 80, 14, 14], f16)), {})
+cnt: 4, ((T([32, 112, 14, 14], f16), T([32, 112, 14, 14], f16)), {})
+cnt: 6, ((T([32, 192, 7, 7], f16), T([32, 192, 7, 7], f16)), {})
+cnt: 4, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16)), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 7, 7], f16)), {})
+cnt: 2, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16)), {})
+cnt: 3, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16)), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 14, 14], f16)), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16)), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), T([32, 144, 28, 28], f16)), {})
+cnt: 1, ((T([32, 144, 56, 56], f16), T([32, 144, 56, 56], f16)), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), T([32, 96, 56, 56], f16)), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 1280], f16), T([1280, 1000], f16, stride=(1, 1280))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+cnt: 2, ((T([32, 32, 112, 112], f16),), {})
+cnt: 1, ((T([32, 8, 1, 1], f16),), {})
+cnt: 1, ((T([32, 96, 112, 112], f16),), {})
+cnt: 1, ((T([32, 96, 56, 56], f16),), {})
+cnt: 1, ((T([32, 4, 1, 1], f16),), {})
+cnt: 3, ((T([32, 144, 56, 56], f16),), {})
+cnt: 2, ((T([32, 6, 1, 1], f16),), {})
+cnt: 1, ((T([32, 144, 28, 28], f16),), {})
+cnt: 3, ((T([32, 240, 28, 28], f16),), {})
+cnt: 2, ((T([32, 10, 1, 1], f16),), {})
+cnt: 1, ((T([32, 240, 14, 14], f16),), {})
+cnt: 6, ((T([32, 480, 14, 14], f16),), {})
+cnt: 3, ((T([32, 20, 1, 1], f16),), {})
+cnt: 5, ((T([32, 672, 14, 14], f16),), {})
+cnt: 3, ((T([32, 28, 1, 1], f16),), {})
+cnt: 1, ((T([32, 672, 7, 7], f16),), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16),), {})
+cnt: 4, ((T([32, 48, 1, 1], f16),), {})
+cnt: 1, ((T([32, 1280, 7, 7], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 32), {})
+cnt: 1, ((T([32, 32, 1, 1], f16), T([8, 32, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 8, 1, 1], f16), T([32, 8, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([16, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([96, 16, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 96, 112, 112], f16), T([96, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 96), {})
+cnt: 1, ((T([32, 96, 1, 1], f16), T([4, 96, 1, 1], f16), T([4], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 4, 1, 1], f16), T([96, 4, 1, 1], f16), T([96], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), T([24, 96, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 24, 56, 56], f16), T([144, 24, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 144, 56, 56], f16), T([144, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 144), {})
+cnt: 2, ((T([32, 144, 1, 1], f16), T([6, 144, 1, 1], f16), T([6], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 6, 1, 1], f16), T([144, 6, 1, 1], f16), T([144], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 144, 56, 56], f16), T([24, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 144, 56, 56], f16), T([144, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 144), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), T([40, 144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([240, 40, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([240, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 240), {})
+cnt: 2, ((T([32, 240, 1, 1], f16), T([10, 240, 1, 1], f16), T([10], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 10, 1, 1], f16), T([240, 10, 1, 1], f16), T([240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([40, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([240, 1, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 240), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([80, 240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 80, 14, 14], f16), T([480, 80, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([480, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 480), {})
+cnt: 3, ((T([32, 480, 1, 1], f16), T([20, 480, 1, 1], f16), T([20], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 20, 1, 1], f16), T([480, 20, 1, 1], f16), T([480], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([80, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([480, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 480), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([112, 480, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 112, 14, 14], f16), T([672, 112, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 3, ((T([32, 672, 1, 1], f16), T([28, 672, 1, 1], f16), T([28], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 28, 1, 1], f16), T([672, 28, 1, 1], f16), T([672], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 672, 14, 14], f16), T([112, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 672, 14, 14], f16), T([672, 1, 5, 5], f16), None, [2, 2], [2, 2], [1, 1], False, [0, 0], 672), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([192, 672, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), None, [1, 1], [2, 2], [1, 1], False, [0, 0], 1152), {})
+cnt: 4, ((T([32, 1152, 1, 1], f16), T([48, 1152, 1, 1], f16), T([48], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 48, 1, 1], f16), T([1152, 48, 1, 1], f16), T([1152], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1152), {})
+cnt: 1, ((T([32, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 1280, 7, 7], f16), T([32, 320, 7, 7], f16), T([1280, 320, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 320, 7, 7], f16), T([32, 1152, 7, 7], f16), T([320, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 1152, 1, 1], f16), T([32, 48, 1, 1], f16), T([1152, 48, 1, 1], f16), [1152], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([32, 48, 1, 1], f16), T([32, 1152, 1, 1], f16), T([48, 1152, 1, 1], f16), [48], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16), T([1152, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 4, ((T([32, 1152, 7, 7], f16), T([32, 192, 7, 7], f16), T([1152, 192, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 192, 7, 7], f16), T([32, 1152, 7, 7], f16), T([192, 1152, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16), T([1152, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 1152, [True, True, False]), {})
+cnt: 1, ((T([32, 192, 7, 7], f16), T([32, 672, 7, 7], f16), T([192, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 672, 1, 1], f16), T([32, 28, 1, 1], f16), T([672, 28, 1, 1], f16), [672], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 28, 1, 1], f16), T([32, 672, 1, 1], f16), T([28, 672, 1, 1], f16), [28], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 3, ((T([32, 672, 14, 14], f16), T([32, 112, 14, 14], f16), T([672, 112, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 112, 14, 14], f16), T([32, 672, 14, 14], f16), T([112, 672, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16), T([672, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 672, [True, True, False]), {})
+cnt: 1, ((T([32, 112, 14, 14], f16), T([32, 480, 14, 14], f16), T([112, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 480, 1, 1], f16), T([32, 20, 1, 1], f16), T([480, 20, 1, 1], f16), [480], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([32, 20, 1, 1], f16), T([32, 480, 1, 1], f16), T([20, 480, 1, 1], f16), [20], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), T([480, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 3, ((T([32, 480, 14, 14], f16), T([32, 80, 14, 14], f16), T([480, 80, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 80, 14, 14], f16), T([32, 480, 14, 14], f16), T([80, 480, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), T([480, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 480, [True, True, False]), {})
+cnt: 1, ((T([32, 80, 14, 14], f16), T([32, 240, 14, 14], f16), T([80, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 240, 1, 1], f16), T([32, 10, 1, 1], f16), T([240, 10, 1, 1], f16), [240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 10, 1, 1], f16), T([32, 240, 1, 1], f16), T([10, 240, 1, 1], f16), [10], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 28, 28], f16), T([240, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 2, ((T([32, 240, 28, 28], f16), T([32, 40, 28, 28], f16), T([240, 40, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 40, 28, 28], f16), T([32, 240, 28, 28], f16), T([40, 240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16), T([240, 1, 5, 5], f16), [0], [1, 1], [2, 2], [1, 1], False, [0, 0], 240, [True, True, False]), {})
+cnt: 1, ((T([32, 40, 28, 28], f16), T([32, 144, 28, 28], f16), T([40, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 144, 1, 1], f16), T([32, 6, 1, 1], f16), T([144, 6, 1, 1], f16), [144], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([32, 6, 1, 1], f16), T([32, 144, 1, 1], f16), T([6, 144, 1, 1], f16), [6], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), T([32, 144, 56, 56], f16), T([144, 1, 5, 5], f16), [0], [2, 2], [2, 2], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 2, ((T([32, 144, 56, 56], f16), T([32, 24, 56, 56], f16), T([144, 24, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 24, 56, 56], f16), T([32, 144, 56, 56], f16), T([24, 144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 144, 56, 56], f16), T([32, 144, 56, 56], f16), T([144, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 144, [True, True, False]), {})
+cnt: 1, ((T([32, 24, 56, 56], f16), T([32, 96, 56, 56], f16), T([24, 96, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 96, 1, 1], f16), T([32, 4, 1, 1], f16), T([96, 4, 1, 1], f16), [96], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 4, 1, 1], f16), T([32, 96, 1, 1], f16), T([4, 96, 1, 1], f16), [4], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), T([32, 96, 112, 112], f16), T([96, 1, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 96, [True, True, False]), {})
+cnt: 1, ((T([32, 96, 112, 112], f16), T([32, 16, 112, 112], f16), T([96, 16, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 32, 112, 112], f16), T([16, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 32, 1, 1], f16), T([32, 8, 1, 1], f16), T([32, 8, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 8, 1, 1], f16), T([32, 32, 1, 1], f16), T([8, 32, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), T([32, 1, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 32, [True, True, False]), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 1280, 7, 7], f16, stride=(1280, 1, 0, 0)), 49), {})
+cnt: 4, ((T([32, 1152, 7, 7], f16, stride=(1152, 1, 0, 0)), 49), {})
+cnt: 1, ((T([32, 672, 7, 7], f16, stride=(672, 1, 0, 0)), 49), {})
+cnt: 2, ((T([32, 672, 14, 14], f16, stride=(672, 1, 0, 0)), 196), {})
+cnt: 3, ((T([32, 480, 14, 14], f16, stride=(480, 1, 0, 0)), 196), {})
+cnt: 1, ((T([32, 240, 14, 14], f16, stride=(240, 1, 0, 0)), 196), {})
+cnt: 1, ((T([32, 240, 28, 28], f16, stride=(240, 1, 0, 0)), 784), {})
+cnt: 1, ((T([32, 144, 28, 28], f16, stride=(144, 1, 0, 0)), 784), {})
+cnt: 1, ((T([32, 144, 56, 56], f16, stride=(144, 1, 0, 0)), 3136), {})
+cnt: 1, ((T([32, 96, 56, 56], f16, stride=(96, 1, 0, 0)), 3136), {})
+cnt: 1, ((T([32, 32, 112, 112], f16, stride=(32, 1, 0, 0)), 12544), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32000), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 32, 112, 112], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 144, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([32, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 2, ((T([32, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 4, ((T([32, 1152, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 1280, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), T([1000, 1280], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(0, 0)), T([32, 1280], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32, 32, 1, 1], f16)), {})
+cnt: 2, ((T([32, 96, 56, 56], f16), T([32, 96, 1, 1], f16)), {})
+cnt: 2, ((T([32, 144, 56, 56], f16), T([32, 144, 1, 1], f16)), {})
+cnt: 2, ((T([32, 144, 28, 28], f16), T([32, 144, 1, 1], f16)), {})
+cnt: 2, ((T([32, 240, 28, 28], f16), T([32, 240, 1, 1], f16)), {})
+cnt: 2, ((T([32, 240, 14, 14], f16), T([32, 240, 1, 1], f16)), {})
+cnt: 6, ((T([32, 480, 14, 14], f16), T([32, 480, 1, 1], f16)), {})
+cnt: 4, ((T([32, 672, 14, 14], f16), T([32, 672, 1, 1], f16)), {})
+cnt: 2, ((T([32, 672, 7, 7], f16), T([32, 672, 1, 1], f16)), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16), T([32, 1152, 1, 1], f16)), {})
+cnt: 4, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16)), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 7, 7], f16)), {})
+cnt: 2, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16)), {})
+cnt: 3, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16)), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 14, 14], f16)), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16)), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), T([32, 144, 28, 28], f16)), {})
+cnt: 1, ((T([32, 144, 56, 56], f16), T([32, 144, 56, 56], f16)), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), T([32, 96, 56, 56], f16)), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f16), False, 0.1, 1e-05), {})
+cnt: 6, ((T([32, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([32, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f16), False, 0.1, 1e-05), {})
+cnt: 4, ((T([32, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), False, 0.1, 1e-05), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([32, 1280, 7, 7], f16), T([32, 1280, 7, 7], f16), T([1280], f16), T([1280], f16), T([1280], f16), T([1280], f32), T([1280], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 320, 7, 7], f16), T([32, 320, 7, 7], f16), T([320], f16), T([320], f16), T([320], f16), T([320], f32), T([320], f32), False, 1e-05, [True, True, True]), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16), T([1152], f16), T([1152], f16), T([1152], f16), T([1152], f32), T([1152], f32), False, 1e-05, [True, True, True]), {})
+cnt: 4, ((T([32, 192, 7, 7], f16), T([32, 192, 7, 7], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 7, 7], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16), T([672], f16), T([672], f16), T([672], f16), T([672], f32), T([672], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 112, 14, 14], f16), T([32, 112, 14, 14], f16), T([112], f16), T([112], f16), T([112], f16), T([112], f32), T([112], f32), False, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16), T([480], f16), T([480], f16), T([480], f16), T([480], f32), T([480], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 80, 14, 14], f16), T([32, 80, 14, 14], f16), T([80], f16), T([80], f16), T([80], f16), T([80], f32), T([80], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 14, 14], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16), T([240], f16), T([240], f16), T([240], f16), T([240], f32), T([240], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 40, 28, 28], f16), T([32, 40, 28, 28], f16), T([40], f16), T([40], f16), T([40], f16), T([40], f32), T([40], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), T([32, 144, 28, 28], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 144, 56, 56], f16), T([32, 144, 56, 56], f16), T([144], f16), T([144], f16), T([144], f16), T([144], f32), T([144], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 24, 56, 56], f16), T([32, 24, 56, 56], f16), T([24], f16), T([24], f16), T([24], f16), T([24], f32), T([24], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), T([32, 96, 56, 56], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 96, 112, 112], f16), T([32, 96, 112, 112], f16), T([96], f16), T([96], f16), T([96], f16), T([96], f32), T([96], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 16, 112, 112], f16), T([32, 16, 112, 112], f16), T([16], f16), T([16], f16), T([16], f16), T([16], f32), T([16], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([32, 32, 1, 1], f16),), {})
+cnt: 1, ((T([32, 96, 1, 1], f16),), {})
+cnt: 2, ((T([32, 144, 1, 1], f16),), {})
+cnt: 2, ((T([32, 240, 1, 1], f16),), {})
+cnt: 3, ((T([32, 480, 1, 1], f16),), {})
+cnt: 3, ((T([32, 672, 1, 1], f16),), {})
+cnt: 4, ((T([32, 1152, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 4, ((T([32, 1152, 1, 1], f16), T([32, 1152, 1, 1], f16)), {})
+cnt: 3, ((T([32, 672, 1, 1], f16), T([32, 672, 1, 1], f16)), {})
+cnt: 3, ((T([32, 480, 1, 1], f16), T([32, 480, 1, 1], f16)), {})
+cnt: 2, ((T([32, 240, 1, 1], f16), T([32, 240, 1, 1], f16)), {})
+cnt: 2, ((T([32, 144, 1, 1], f16), T([32, 144, 1, 1], f16)), {})
+cnt: 1, ((T([32, 96, 1, 1], f16), T([32, 96, 1, 1], f16)), {})
+cnt: 1, ((T([32, 32, 1, 1], f16), T([32, 32, 1, 1], f16)), {})
+Operator: aten.silu_.default
+cnt: 2, ((T([32, 32, 112, 112], f16),), {})
+cnt: 1, ((T([32, 8, 1, 1], f16),), {})
+cnt: 1, ((T([32, 96, 112, 112], f16),), {})
+cnt: 1, ((T([32, 96, 56, 56], f16),), {})
+cnt: 1, ((T([32, 4, 1, 1], f16),), {})
+cnt: 3, ((T([32, 144, 56, 56], f16),), {})
+cnt: 2, ((T([32, 6, 1, 1], f16),), {})
+cnt: 1, ((T([32, 144, 28, 28], f16),), {})
+cnt: 3, ((T([32, 240, 28, 28], f16),), {})
+cnt: 2, ((T([32, 10, 1, 1], f16),), {})
+cnt: 1, ((T([32, 240, 14, 14], f16),), {})
+cnt: 6, ((T([32, 480, 14, 14], f16),), {})
+cnt: 3, ((T([32, 20, 1, 1], f16),), {})
+cnt: 5, ((T([32, 672, 14, 14], f16),), {})
+cnt: 3, ((T([32, 28, 1, 1], f16),), {})
+cnt: 1, ((T([32, 672, 7, 7], f16),), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16),), {})
+cnt: 4, ((T([32, 48, 1, 1], f16),), {})
+cnt: 1, ((T([32, 1280, 7, 7], f16),), {})
+Operator: aten.silu_backward.default
+cnt: 1, ((T([32, 1280, 7, 7], f16), T([32, 1280, 7, 7], f16)), {})
+cnt: 4, ((T([32, 48, 1, 1], f16), T([32, 48, 1, 1], f16)), {})
+cnt: 8, ((T([32, 1152, 7, 7], f16), T([32, 1152, 7, 7], f16)), {})
+cnt: 3, ((T([32, 28, 1, 1], f16), T([32, 28, 1, 1], f16)), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), T([32, 672, 7, 7], f16)), {})
+cnt: 5, ((T([32, 672, 14, 14], f16), T([32, 672, 14, 14], f16)), {})
+cnt: 3, ((T([32, 20, 1, 1], f16), T([32, 20, 1, 1], f16)), {})
+cnt: 6, ((T([32, 480, 14, 14], f16), T([32, 480, 14, 14], f16)), {})
+cnt: 2, ((T([32, 10, 1, 1], f16), T([32, 10, 1, 1], f16)), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), T([32, 240, 14, 14], f16)), {})
+cnt: 3, ((T([32, 240, 28, 28], f16), T([32, 240, 28, 28], f16)), {})
+cnt: 2, ((T([32, 6, 1, 1], f16), T([32, 6, 1, 1], f16)), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), T([32, 144, 28, 28], f16)), {})
+cnt: 3, ((T([32, 144, 56, 56], f16), T([32, 144, 56, 56], f16)), {})
+cnt: 1, ((T([32, 4, 1, 1], f16), T([32, 4, 1, 1], f16)), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), T([32, 96, 56, 56], f16)), {})
+cnt: 1, ((T([32, 96, 112, 112], f16), T([32, 96, 112, 112], f16)), {})
+cnt: 1, ((T([32, 8, 1, 1], f16), T([32, 8, 1, 1], f16)), {})
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), [0], True), {})
+cnt: 4, ((T([32, 1152, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 672, 7, 7], f16), [2, 3], True), {})
+cnt: 2, ((T([32, 672, 14, 14], f16), [2, 3], True), {})
+cnt: 3, ((T([32, 480, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 240, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 240, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 144, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 144, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 96, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), [2, 3], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1000], f16),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_nfnet_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_nfnet_training.txt
new file mode 100644
index 000000000000..c94aacd7fa2c
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_nfnet_training.txt
@@ -0,0 +1,289 @@
+Operator: aten.add.Tensor
+cnt: 3, ((T([128, 256, 48, 48], f16), T([128, 256, 48, 48], f16)), {})
+cnt: 6, ((T([128, 512, 24, 24], f16), T([128, 512, 24, 24], f16)), {})
+cnt: 18, ((T([128, 1536, 12, 12], f16), T([128, 1536, 12, 12], f16)), {})
+cnt: 8, ((T([128, 1536, 6, 6], f16), T([128, 1536, 6, 6], f16)), {})
+cnt: 1, ((T([128, 128, 48, 48], f16), T([128, 128, 48, 48], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([128, 3072], f16), T([3072, 1000], f16, stride=(1, 3072))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([128, 256, 48, 48], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([128, 512, 24, 24], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([128, 1536, 12, 12], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([128, 1536, 6, 6], f16), T([128, 1536, 12, 12], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([128, 512, 12, 12], f16), T([128, 512, 24, 24], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([128, 256, 24, 24], f16), T([128, 256, 48, 48], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+Operator: aten.clone.default
+cnt: 1, ((T([128, 3, 192, 192], f16),), {})
+cnt: 1, ((T([128, 256, 48, 48], f16),), {})
+cnt: 2, ((T([128, 512, 24, 24], f16),), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16),), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16),), {})
+Operator: aten.constant_pad_nd.default
+cnt: 1, ((T([128, 3, 192, 192], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 768, 12, 12], f16), [0, 1, 0, 1], 0.0), {})
+cnt: 1, ((T([128, 768, 13, 13], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([128, 768, 25, 25], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([128, 256, 49, 49], f16), [0, -1, 0, -1]), {})
+cnt: 1, ((T([128, 64, 97, 97], f16), [0, -1, 0, -1]), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([128, 3, 193, 193], f16), T([16, 3, 3, 3], f16), T([16], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([32, 16, 3, 3], f16), T([32], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([64, 32, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 64, 97, 97], f16), T([128, 64, 3, 3], f16), T([128], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 48, 48], f16), T([256, 128, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 48, 48], f16), T([128, 128, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 128, 48, 48], f16), T([128, 128, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([256, 128, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([128, 256, 24, 24], f16), T([512, 256, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), T([256, 256, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 256, 49, 49], f16), T([256, 128, 3, 3], f16), T([256], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 2), {})
+cnt: 3, ((T([128, 256, 24, 24], f16), T([256, 128, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([256, 512, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 256, 1, 1], f16), T([512, 256, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 24, 24], f16), T([256, 512, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 12, 12], f16), T([1536, 512, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 512, 24, 24], f16), T([768, 512, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 768, 25, 25], f16), T([768, 128, 3, 3], f16), T([768], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 6), {})
+cnt: 11, ((T([128, 768, 12, 12], f16), T([768, 128, 3, 3], f16), T([768], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 6), {})
+cnt: 6, ((T([128, 768, 12, 12], f16), T([1536, 768, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([768, 1536, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([128, 768, 1, 1], f16), T([1536, 768, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([768, 1536, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1536, 6, 6], f16), T([1536, 1536, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 768, 13, 13], f16), T([768, 128, 3, 3], f16), T([768], f16), [2, 2], [0, 0], [1, 1], False, [0, 0], 6), {})
+cnt: 5, ((T([128, 768, 6, 6], f16), T([768, 128, 3, 3], f16), T([768], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 6), {})
+cnt: 3, ((T([128, 768, 6, 6], f16), T([1536, 768, 1, 1], f16), T([1536], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), T([768, 1536, 1, 1], f16), T([768], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([128, 1536, 6, 6], f16), T([3072, 1536, 1, 1], f16), T([3072], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([128, 3072, 6, 6], f16), T([128, 1536, 6, 6], f16), T([3072, 1536, 1, 1], f16), [3072], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([128, 768, 1, 1], f16), T([1536, 768, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 9, ((T([128, 768, 1, 1], f16), T([128, 1536, 1, 1], f16), T([768, 1536, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), T([128, 768, 6, 6], f16), T([1536, 768, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 5, ((T([128, 768, 6, 6], f16), T([128, 768, 6, 6], f16), T([768, 128, 3, 3], f16), [768], [1, 1], [1, 1], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 2, ((T([128, 768, 6, 6], f16), T([128, 1536, 6, 6], f16), T([768, 1536, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 6, 6], f16), T([128, 768, 13, 13], f16), T([768, 128, 3, 3], f16), [768], [2, 2], [0, 0], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 6, ((T([128, 768, 12, 12], f16), T([128, 1536, 12, 12], f16), T([768, 1536, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1536, 6, 6], f16), T([128, 1536, 6, 6], f16), T([1536, 1536, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([128, 768, 12, 12], f16), T([1536, 768, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 11, ((T([128, 768, 12, 12], f16), T([128, 768, 12, 12], f16), T([768, 128, 3, 3], f16), [768], [1, 1], [1, 1], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 12, 12], f16), T([128, 768, 25, 25], f16), T([768, 128, 3, 3], f16), [768], [2, 2], [0, 0], [1, 1], False, [0, 0], 6, [True, True, True]), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), T([128, 512, 24, 24], f16), T([768, 512, 1, 1], f16), [768], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 1536, 12, 12], f16), T([128, 512, 12, 12], f16), T([1536, 512, 1, 1], f16), [1536], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([128, 256, 1, 1], f16), T([512, 256, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 256, 1, 1], f16), T([128, 512, 1, 1], f16), T([256, 512, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 512, 24, 24], f16), T([128, 256, 24, 24], f16), T([512, 256, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([128, 256, 24, 24], f16), T([128, 256, 24, 24], f16), T([256, 128, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 24, 24], f16), T([128, 512, 24, 24], f16), T([256, 512, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 24, 24], f16), T([128, 256, 49, 49], f16), T([256, 128, 3, 3], f16), [256], [2, 2], [0, 0], [1, 1], False, [0, 0], 2, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), T([128, 256, 48, 48], f16), T([256, 256, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 128, 1, 1], f16), T([256, 128, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), T([128, 128, 48, 48], f16), T([256, 128, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([128, 128, 48, 48], f16), T([128, 128, 48, 48], f16), T([128, 128, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 48, 48], f16), T([128, 128, 48, 48], f16), T([128, 128, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 128, 48, 48], f16), T([128, 64, 97, 97], f16), T([128, 64, 3, 3], f16), [128], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), T([128, 32, 96, 96], f16), T([64, 32, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([128, 16, 96, 96], f16), T([32, 16, 3, 3], f16), [32], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([128, 3, 193, 193], f16), T([16, 3, 3, 3], f16), [16], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([128, 3, 192, 192], f16), T([128, 3, 192, 192], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([128, 3072, 6, 6], f16, stride=(3072, 1, 0, 0)), 36), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16, stride=(1536, 1, 0, 0)), 36), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16, stride=(1536, 1, 0, 0)), 144), {})
+cnt: 2, ((T([128, 512, 24, 24], f16, stride=(512, 1, 0, 0)), 576), {})
+cnt: 1, ((T([128, 256, 48, 48], f16, stride=(256, 1, 0, 0)), 2304), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 128000), {})
+Operator: aten.gelu.default
+cnt: 1, ((T([128, 16, 96, 96], f16),), {})
+cnt: 1, ((T([128, 32, 96, 96], f16),), {})
+cnt: 1, ((T([128, 64, 96, 96], f16),), {})
+cnt: 4, ((T([128, 128, 48, 48], f16),), {})
+cnt: 2, ((T([128, 256, 48, 48], f16),), {})
+cnt: 5, ((T([128, 256, 24, 24], f16),), {})
+cnt: 2, ((T([128, 512, 24, 24], f16),), {})
+cnt: 1, ((T([128, 768, 24, 24], f16),), {})
+cnt: 18, ((T([128, 768, 12, 12], f16),), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16),), {})
+cnt: 8, ((T([128, 768, 6, 6], f16),), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16),), {})
+cnt: 1, ((T([128, 3072, 6, 6], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 1, ((T([128, 3072, 6, 6], f16), T([128, 3072, 6, 6], f16)), {})
+cnt: 8, ((T([128, 768, 6, 6], f16), T([128, 768, 6, 6], f16)), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), T([128, 1536, 6, 6], f16)), {})
+cnt: 18, ((T([128, 768, 12, 12], f16), T([128, 768, 12, 12], f16)), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([128, 1536, 12, 12], f16)), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), T([128, 768, 24, 24], f16)), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), T([128, 512, 24, 24], f16)), {})
+cnt: 5, ((T([128, 256, 24, 24], f16), T([128, 256, 24, 24], f16)), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), T([128, 256, 48, 48], f16)), {})
+cnt: 4, ((T([128, 128, 48, 48], f16), T([128, 128, 48, 48], f16)), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), T([128, 64, 96, 96], f16)), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), T([128, 32, 96, 96], f16)), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), T([128, 16, 96, 96], f16)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([128, 256, 48, 48], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), [2, 3], True), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), [2, 3], True), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 3072, 6, 6], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([128, 1000], f16, stride=(0, 0)), T([1000, 3072], f16)), {})
+cnt: 1, ((T([1000, 128], f16, stride=(0, 0)), T([128, 3072], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 2, ((T([16, 1, 1, 1], f16), 0.19245008972987526), {})
+cnt: 2, ((T([32, 1, 1, 1], f16), 0.08333333333333333), {})
+cnt: 2, ((T([64, 1, 1, 1], f16), 0.05892556509887896), {})
+cnt: 2, ((T([128, 1, 1, 1], f16), 0.041666666666666664), {})
+cnt: 2, ((T([128, 128, 48, 48], f16), 1.0), {})
+cnt: 4, ((T([256, 1, 1, 1], f16), 0.08838834764831845), {})
+cnt: 2, ((T([128, 1, 1, 1], f16), 0.08838834764831845), {})
+cnt: 4, ((T([128, 1, 1, 1], f16), 0.02946278254943948), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), T([128, 256, 1, 1], f16)), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 2.0), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 0.2), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 0.9805806756909201), {})
+cnt: 6, ((T([512, 1, 1, 1], f16), 0.0625), {})
+cnt: 2, ((T([256, 1, 1, 1], f16), 0.0625), {})
+cnt: 8, ((T([256, 1, 1, 1], f16), 0.02946278254943948), {})
+cnt: 4, ((T([128, 512, 24, 24], f16), T([128, 512, 1, 1], f16)), {})
+cnt: 4, ((T([128, 512, 24, 24], f16), 2.0), {})
+cnt: 4, ((T([128, 512, 24, 24], f16), 0.2), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), 0.9805806756909201), {})
+cnt: 2, ((T([256, 1, 1, 1], f16), 0.04419417382415922), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), 0.9622504486493761), {})
+cnt: 2, ((T([1536, 1, 1, 1], f16), 0.04419417382415922), {})
+cnt: 2, ((T([768, 1, 1, 1], f16), 0.04419417382415922), {})
+cnt: 36, ((T([768, 1, 1, 1], f16), 0.02946278254943948), {})
+cnt: 18, ((T([1536, 1, 1, 1], f16), 0.03608439182435161), {})
+cnt: 12, ((T([128, 1536, 12, 12], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 12, ((T([128, 1536, 12, 12], f16), 2.0), {})
+cnt: 12, ((T([128, 1536, 12, 12], f16), 0.2), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9805806756909201), {})
+cnt: 16, ((T([768, 1, 1, 1], f16), 0.02551551815399144), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9622504486493761), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9449111825230679), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9284766908852592), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.9128709291752768), {})
+cnt: 2, ((T([128, 1536, 12, 12], f16), 0.8980265101338745), {})
+cnt: 2, ((T([1536, 1, 1, 1], f16), 0.02551551815399144), {})
+cnt: 6, ((T([128, 1536, 6, 6], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 6, ((T([128, 1536, 6, 6], f16), 2.0), {})
+cnt: 6, ((T([128, 1536, 6, 6], f16), 0.2), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), 0.9805806756909201), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), 0.9622504486493761), {})
+cnt: 2, ((T([3072, 1, 1, 1], f16), 0.02551551815399144), {})
+cnt: 1, ((T([128, 3072, 6, 6], f16), 1.7015043497085571), {})
+cnt: 6, ((T([128, 1536, 6, 6], f16), T([128, 1536, 6, 6], f16)), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), T([], f16)), {})
+cnt: 8, ((T([128, 768, 6, 6], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), 1.7015043497085571), {})
+cnt: 18, ((T([128, 768, 12, 12], f16), 1.7015043497085571), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), 1.7015043497085571), {})
+cnt: 12, ((T([128, 1536, 12, 12], f16), T([128, 1536, 12, 12], f16)), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([], f16)), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), 1.7015043497085571), {})
+cnt: 4, ((T([128, 512, 24, 24], f16), T([128, 512, 24, 24], f16)), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), T([], f16)), {})
+cnt: 5, ((T([128, 256, 24, 24], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), T([128, 256, 48, 48], f16)), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), T([], f16)), {})
+cnt: 4, ((T([128, 128, 48, 48], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 16, 96, 96], f16), 1.7015043497085571), {})
+Operator: aten.mul_.Tensor
+cnt: 1, ((T([128, 16, 96, 96], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 32, 96, 96], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 64, 96, 96], f16), 1.7015043497085571), {})
+cnt: 4, ((T([128, 128, 48, 48], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), T([], f16)), {})
+cnt: 2, ((T([128, 256, 48, 48], f16), 1.7015043497085571), {})
+cnt: 5, ((T([128, 256, 24, 24], f16), 1.7015043497085571), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), T([], f16)), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 768, 24, 24], f16), 1.7015043497085571), {})
+cnt: 18, ((T([128, 768, 12, 12], f16), 1.7015043497085571), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), T([], f16)), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), 1.7015043497085571), {})
+cnt: 8, ((T([128, 768, 6, 6], f16), 1.7015043497085571), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), T([], f16)), {})
+cnt: 2, ((T([128, 1536, 6, 6], f16), 1.7015043497085571), {})
+cnt: 1, ((T([128, 3072, 6, 6], f16), 1.7015043497085571), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([1, 16, 27], f16), T([16], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 32, 144], f16), T([32], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 64, 288], f16), T([64], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 128, 576], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 2, ((T([1, 256, 128], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 128, 128], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 2, ((T([1, 128, 1152], f16), T([128], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 3, ((T([1, 512, 256], f16), T([512], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 256, 256], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 4, ((T([1, 256, 1152], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 256, 512], f16), T([256], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 1536, 512], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 768, 512], f16), T([768], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 18, ((T([1, 768, 1152], f16), T([768], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 9, ((T([1, 1536, 768], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 8, ((T([1, 768, 1536], f16), T([768], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 1536, 1536], f16), T([1536], f16), None, None, None, True, 0.0, 1e-05), {})
+cnt: 1, ((T([1, 3072, 1536], f16), T([3072], f16), None, None, None, True, 0.0, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 1, ((T([1, 3072, 1536], f16), T([1, 3072, 1536], f16), T([3072], f16), None, None, T([3072], f32), T([3072], f32), True, 1e-05, [True, True, False]), {})
+cnt: 9, ((T([1, 1536, 768], f16), T([1, 1536, 768], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 18, ((T([1, 768, 1152], f16), T([1, 768, 1152], f16), T([768], f16), None, None, T([768], f32), T([768], f32), True, 1e-05, [True, True, False]), {})
+cnt: 8, ((T([1, 768, 1536], f16), T([1, 768, 1536], f16), T([768], f16), None, None, T([768], f32), T([768], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 1536, 1536], f16), T([1, 1536, 1536], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 768, 512], f16), T([1, 768, 512], f16), T([768], f16), None, None, T([768], f32), T([768], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 1536, 512], f16), T([1, 1536, 512], f16), T([1536], f16), None, None, T([1536], f32), T([1536], f32), True, 1e-05, [True, True, False]), {})
+cnt: 3, ((T([1, 512, 256], f16), T([1, 512, 256], f16), T([512], f16), None, None, T([512], f32), T([512], f32), True, 1e-05, [True, True, False]), {})
+cnt: 4, ((T([1, 256, 1152], f16), T([1, 256, 1152], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 256, 512], f16), T([1, 256, 512], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 256, 256], f16), T([1, 256, 256], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 2, ((T([1, 256, 128], f16), T([1, 256, 128], f16), T([256], f16), None, None, T([256], f32), T([256], f32), True, 1e-05, [True, True, False]), {})
+cnt: 2, ((T([1, 128, 1152], f16), T([1, 128, 1152], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 128, 128], f16), T([1, 128, 128], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 128, 576], f16), T([1, 128, 576], f16), T([128], f16), None, None, T([128], f32), T([128], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 64, 288], f16), T([1, 64, 288], f16), T([64], f16), None, None, T([64], f32), T([64], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 32, 144], f16), T([1, 32, 144], f16), T([32], f16), None, None, T([32], f32), T([32], f32), True, 1e-05, [True, True, False]), {})
+cnt: 1, ((T([1, 16, 27], f16), T([1, 16, 27], f16), T([16], f16), None, None, T([16], f32), T([16], f32), True, 1e-05, [True, True, False]), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([128, 128, 1, 1], f16),), {})
+cnt: 2, ((T([128, 256, 1, 1], f16),), {})
+cnt: 9, ((T([128, 768, 1, 1], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([128, 256, 1, 1], f16),), {})
+cnt: 2, ((T([128, 512, 1, 1], f16),), {})
+cnt: 9, ((T([128, 1536, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 9, ((T([128, 1536, 1, 1], f16), T([128, 1536, 1, 1], f16)), {})
+cnt: 2, ((T([128, 512, 1, 1], f16), T([128, 512, 1, 1], f16)), {})
+cnt: 1, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([128, 1000], f16, stride=(0, 0)), [0], True), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16), [2, 3], True), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16), [2, 3], True), {})
+cnt: 2, ((T([128, 512, 24, 24], f16), [2, 3], True), {})
+cnt: 1, ((T([128, 256, 48, 48], f16), [2, 3], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([128, 1000], f16),), {})
+cnt: 3, ((T([128, 1536, 6, 6], f16),), {})
+cnt: 6, ((T([128, 1536, 12, 12], f16),), {})
+cnt: 2, ((T([128, 512, 24, 24], f16),), {})
+cnt: 1, ((T([128, 256, 48, 48], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 9, ((T([128, 768, 1, 1], f16), T([128, 768, 1, 1], f16), 0), {})
+cnt: 2, ((T([128, 256, 1, 1], f16), T([128, 256, 1, 1], f16), 0), {})
+cnt: 1, ((T([128, 128, 1, 1], f16), T([128, 128, 1, 1], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_regnet_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_regnet_training.txt
new file mode 100644
index 000000000000..e67c9e94a87a
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_regnet_training.txt
@@ -0,0 +1,178 @@
+Operator: aten.add.Tensor
+cnt: 6, ((T([32, 224, 56, 56], f16), T([32, 224, 56, 56], f16)), {})
+cnt: 15, ((T([32, 448, 28, 28], f16), T([32, 448, 28, 28], f16)), {})
+cnt: 33, ((T([32, 896, 14, 14], f16), T([32, 896, 14, 14], f16)), {})
+cnt: 2, ((T([32, 2240, 7, 7], f16), T([32, 2240, 7, 7], f16)), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 2240], f16), T([2240, 1000], f16, stride=(1, 2240))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([224, 32, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 224, 112, 112], f16), T([224, 112, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 1, ((T([32, 224, 1, 1], f16), T([8, 224, 1, 1], f16), T([8], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 8, 1, 1], f16), T([224, 8, 1, 1], f16), T([224], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([32, 224, 56, 56], f16), T([224, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([224, 32, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 224, 56, 56], f16), T([224, 112, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 1, ((T([32, 224, 1, 1], f16), T([56, 224, 1, 1], f16), T([56], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 56, 1, 1], f16), T([224, 56, 1, 1], f16), T([224], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 224, 56, 56], f16), T([448, 224, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 448, 56, 56], f16), T([448, 112, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 4), {})
+cnt: 1, ((T([32, 448, 1, 1], f16), T([56, 448, 1, 1], f16), T([56], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 56, 1, 1], f16), T([448, 56, 1, 1], f16), T([448], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 9, ((T([32, 448, 28, 28], f16), T([448, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 224, 56, 56], f16), T([448, 224, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 448, 28, 28], f16), T([448, 112, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 4), {})
+cnt: 4, ((T([32, 448, 1, 1], f16), T([112, 448, 1, 1], f16), T([112], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 112, 1, 1], f16), T([448, 112, 1, 1], f16), T([448], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 448, 28, 28], f16), T([896, 448, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([896, 112, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 1, ((T([32, 896, 1, 1], f16), T([112, 896, 1, 1], f16), T([112], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 112, 1, 1], f16), T([896, 112, 1, 1], f16), T([896], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 21, ((T([32, 896, 14, 14], f16), T([896, 896, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 448, 28, 28], f16), T([896, 448, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 10, ((T([32, 896, 14, 14], f16), T([896, 112, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 8), {})
+cnt: 10, ((T([32, 896, 1, 1], f16), T([224, 896, 1, 1], f16), T([224], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 10, ((T([32, 224, 1, 1], f16), T([896, 224, 1, 1], f16), T([896], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 896, 14, 14], f16), T([2240, 896, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([2240, 112, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 20), {})
+cnt: 1, ((T([32, 2240, 1, 1], f16), T([224, 2240, 1, 1], f16), T([224], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 224, 1, 1], f16), T([2240, 224, 1, 1], f16), T([2240], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16), T([2240, 2240, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 896, 14, 14], f16), T([2240, 896, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 2240, 7, 7], f16), T([32, 896, 14, 14], f16), T([2240, 896, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16), T([32, 2240, 7, 7], f16), T([2240, 2240, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2240, 1, 1], f16), T([32, 224, 1, 1], f16), T([2240, 224, 1, 1], f16), [2240], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 224, 1, 1], f16), T([32, 2240, 1, 1], f16), T([224, 2240, 1, 1], f16), [224], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16), T([32, 2240, 14, 14], f16), T([2240, 112, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 20, [True, True, False]), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([32, 896, 14, 14], f16), T([2240, 896, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 21, ((T([32, 896, 14, 14], f16), T([32, 896, 14, 14], f16), T([896, 896, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 10, ((T([32, 896, 1, 1], f16), T([32, 224, 1, 1], f16), T([896, 224, 1, 1], f16), [896], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 10, ((T([32, 224, 1, 1], f16), T([32, 896, 1, 1], f16), T([224, 896, 1, 1], f16), [224], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 10, ((T([32, 896, 14, 14], f16), T([32, 896, 14, 14], f16), T([896, 112, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([32, 896, 14, 14], f16), T([32, 448, 28, 28], f16), T([896, 448, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 896, 1, 1], f16), T([32, 112, 1, 1], f16), T([896, 112, 1, 1], f16), [896], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 112, 1, 1], f16), T([32, 896, 1, 1], f16), T([112, 896, 1, 1], f16), [112], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 896, 14, 14], f16), T([32, 896, 28, 28], f16), T([896, 112, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 8, [True, True, False]), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([32, 448, 28, 28], f16), T([896, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 9, ((T([32, 448, 28, 28], f16), T([32, 448, 28, 28], f16), T([448, 448, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 448, 1, 1], f16), T([32, 112, 1, 1], f16), T([448, 112, 1, 1], f16), [448], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([32, 112, 1, 1], f16), T([32, 448, 1, 1], f16), T([112, 448, 1, 1], f16), [112], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 4, ((T([32, 448, 28, 28], f16), T([32, 448, 28, 28], f16), T([448, 112, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 4, [True, True, False]), {})
+cnt: 1, ((T([32, 448, 28, 28], f16), T([32, 224, 56, 56], f16), T([448, 224, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 448, 1, 1], f16), T([32, 56, 1, 1], f16), T([448, 56, 1, 1], f16), [448], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 56, 1, 1], f16), T([32, 448, 1, 1], f16), T([56, 448, 1, 1], f16), [56], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 448, 28, 28], f16), T([32, 448, 56, 56], f16), T([448, 112, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 4, [True, True, False]), {})
+cnt: 1, ((T([32, 448, 56, 56], f16), T([32, 224, 56, 56], f16), T([448, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([32, 224, 56, 56], f16), T([32, 224, 56, 56], f16), T([224, 224, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 224, 1, 1], f16), T([32, 56, 1, 1], f16), T([224, 56, 1, 1], f16), [224], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 56, 1, 1], f16), T([32, 224, 1, 1], f16), T([56, 224, 1, 1], f16), [56], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 224, 56, 56], f16), T([32, 224, 56, 56], f16), T([224, 112, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 224, 56, 56], f16), T([32, 32, 112, 112], f16), T([224, 32, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 224, 1, 1], f16), T([32, 8, 1, 1], f16), T([224, 8, 1, 1], f16), [224], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 8, 1, 1], f16), T([32, 224, 1, 1], f16), T([8, 224, 1, 1], f16), [8], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 224, 56, 56], f16), T([32, 224, 112, 112], f16), T([224, 112, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 224, 112, 112], f16), T([32, 32, 112, 112], f16), T([224, 32, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 2, ((T([32, 2240, 7, 7], f16, stride=(2240, 1, 0, 0)), 49), {})
+cnt: 11, ((T([32, 896, 14, 14], f16, stride=(896, 1, 0, 0)), 196), {})
+cnt: 5, ((T([32, 448, 28, 28], f16, stride=(448, 1, 0, 0)), 784), {})
+cnt: 2, ((T([32, 224, 56, 56], f16, stride=(224, 1, 0, 0)), 3136), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32000), {})
+Operator: aten.mean.dim
+cnt: 2, ((T([32, 224, 56, 56], f16), [2, 3], True), {})
+cnt: 5, ((T([32, 448, 28, 28], f16), [2, 3], True), {})
+cnt: 11, ((T([32, 896, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), T([1000, 2240], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(0, 0)), T([32, 2240], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([32, 224, 56, 56], f16), T([32, 224, 1, 1], f16)), {})
+cnt: 10, ((T([32, 448, 28, 28], f16), T([32, 448, 1, 1], f16)), {})
+cnt: 22, ((T([32, 896, 14, 14], f16), T([32, 896, 1, 1], f16)), {})
+cnt: 2, ((T([32, 2240, 7, 7], f16), T([32, 2240, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16), T([32, 2240, 7, 7], f16)), {})
+cnt: 11, ((T([32, 896, 14, 14], f16), T([32, 896, 14, 14], f16)), {})
+cnt: 5, ((T([32, 448, 28, 28], f16), T([32, 448, 28, 28], f16)), {})
+cnt: 2, ((T([32, 224, 56, 56], f16), T([32, 224, 56, 56], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 224, 112, 112], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), False, 0.1, 1e-05), {})
+cnt: 6, ((T([32, 224, 56, 56], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 448, 56, 56], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), False, 0.1, 1e-05), {})
+cnt: 15, ((T([32, 448, 28, 28], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), False, 0.1, 1e-05), {})
+cnt: 33, ((T([32, 896, 14, 14], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([2240], f16), T([2240], f16), T([2240], f16), T([2240], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 2240, 7, 7], f16), T([2240], f16), T([2240], f16), T([2240], f16), T([2240], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 3, ((T([32, 2240, 7, 7], f16), T([32, 2240, 7, 7], f16), T([2240], f16), T([2240], f16), T([2240], f16), T([2240], f32), T([2240], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([32, 2240, 14, 14], f16), T([2240], f16), T([2240], f16), T([2240], f16), T([2240], f32), T([2240], f32), False, 1e-05, [True, True, True]), {})
+cnt: 33, ((T([32, 896, 14, 14], f16), T([32, 896, 14, 14], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([32, 896, 28, 28], f16), T([896], f16), T([896], f16), T([896], f16), T([896], f32), T([896], f32), False, 1e-05, [True, True, True]), {})
+cnt: 15, ((T([32, 448, 28, 28], f16), T([32, 448, 28, 28], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 448, 56, 56], f16), T([32, 448, 56, 56], f16), T([448], f16), T([448], f16), T([448], f16), T([448], f32), T([448], f32), False, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([32, 224, 56, 56], f16), T([32, 224, 56, 56], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 224, 112, 112], f16), T([32, 224, 112, 112], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu.default
+cnt: 2, ((T([32, 224, 56, 56], f16),), {})
+cnt: 5, ((T([32, 448, 28, 28], f16),), {})
+cnt: 11, ((T([32, 896, 14, 14], f16),), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16),), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([32, 32, 112, 112], f16),), {})
+cnt: 1, ((T([32, 224, 112, 112], f16),), {})
+cnt: 3, ((T([32, 224, 56, 56], f16),), {})
+cnt: 1, ((T([32, 8, 1, 1], f16),), {})
+cnt: 2, ((T([32, 56, 1, 1], f16),), {})
+cnt: 1, ((T([32, 448, 56, 56], f16),), {})
+cnt: 9, ((T([32, 448, 28, 28], f16),), {})
+cnt: 5, ((T([32, 112, 1, 1], f16),), {})
+cnt: 1, ((T([32, 896, 28, 28], f16),), {})
+cnt: 21, ((T([32, 896, 14, 14], f16),), {})
+cnt: 11, ((T([32, 224, 1, 1], f16),), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16),), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 2, ((T([32, 224, 1, 1], f16),), {})
+cnt: 5, ((T([32, 448, 1, 1], f16),), {})
+cnt: 11, ((T([32, 896, 1, 1], f16),), {})
+cnt: 1, ((T([32, 2240, 1, 1], f16),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([32, 2240, 1, 1], f16), T([32, 2240, 1, 1], f16)), {})
+cnt: 11, ((T([32, 896, 1, 1], f16), T([32, 896, 1, 1], f16)), {})
+cnt: 5, ((T([32, 448, 1, 1], f16), T([32, 448, 1, 1], f16)), {})
+cnt: 2, ((T([32, 224, 1, 1], f16), T([32, 224, 1, 1], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), [0], True), {})
+cnt: 1, ((T([32, 2240, 7, 7], f16), [2, 3], True), {})
+cnt: 11, ((T([32, 896, 14, 14], f16), [2, 3], True), {})
+cnt: 5, ((T([32, 448, 28, 28], f16), [2, 3], True), {})
+cnt: 2, ((T([32, 224, 56, 56], f16), [2, 3], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([32, 2240, 7, 7], f16), T([32, 2240, 7, 7], f16), 0), {})
+cnt: 11, ((T([32, 224, 1, 1], f16), T([32, 224, 1, 1], f16), 0), {})
+cnt: 1, ((T([32, 2240, 14, 14], f16), T([32, 2240, 14, 14], f16), 0), {})
+cnt: 32, ((T([32, 896, 14, 14], f16), T([32, 896, 14, 14], f16), 0), {})
+cnt: 5, ((T([32, 112, 1, 1], f16), T([32, 112, 1, 1], f16), 0), {})
+cnt: 1, ((T([32, 896, 28, 28], f16), T([32, 896, 28, 28], f16), 0), {})
+cnt: 14, ((T([32, 448, 28, 28], f16), T([32, 448, 28, 28], f16), 0), {})
+cnt: 2, ((T([32, 56, 1, 1], f16), T([32, 56, 1, 1], f16), 0), {})
+cnt: 1, ((T([32, 448, 56, 56], f16), T([32, 448, 56, 56], f16), 0), {})
+cnt: 5, ((T([32, 224, 56, 56], f16), T([32, 224, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 8, 1, 1], f16), T([32, 8, 1, 1], f16), 0), {})
+cnt: 1, ((T([32, 224, 112, 112], f16), T([32, 224, 112, 112], f16), 0), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_resnest_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_resnest_training.txt
new file mode 100644
index 000000000000..31d5de6bf287
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_resnest_training.txt
@@ -0,0 +1,205 @@
+Operator: aten._softmax.default
+cnt: 1, ((T([32, 2, 1, 64], f16), 1, False), {})
+cnt: 1, ((T([32, 2, 1, 128], f16), 1, False), {})
+cnt: 1, ((T([32, 2, 1, 256], f16), 1, False), {})
+cnt: 1, ((T([32, 2, 1, 512], f16), 1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 1, ((T([32, 2, 1, 512], f16), T([32, 2, 1, 512], f16), 1, f16), {})
+cnt: 1, ((T([32, 2, 1, 256], f16), T([32, 2, 1, 256], f16), 1, f16), {})
+cnt: 1, ((T([32, 2, 1, 128], f16), T([32, 2, 1, 128], f16), 1, f16), {})
+cnt: 1, ((T([32, 2, 1, 64], f16), T([32, 2, 1, 64], f16), 1, f16), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([32, 2, 512, 14, 14], f16), T([32, 2, 512, 14, 14], f16, stride=(100352, 0, 196, 14, 1))), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 1, ((T([32, 2, 256, 28, 28], f16), T([32, 2, 256, 28, 28], f16, stride=(200704, 0, 784, 28, 1))), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 1, ((T([32, 2, 128, 56, 56], f16), T([32, 2, 128, 56, 56], f16, stride=(401408, 0, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 1, ((T([32, 2, 64, 56, 56], f16), T([32, 2, 64, 56, 56], f16, stride=(200704, 0, 3136, 56, 1))), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16)), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16)), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16)), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 2048], f16), T([2048, 1000], f16, stride=(1, 2048))), {})
+Operator: aten.avg_pool2d.default
+cnt: 1, ((T([32, 128, 56, 56], f16), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), [2, 2], [2, 2], [0, 0], True, False), {})
+Operator: aten.avg_pool2d_backward.default
+cnt: 1, ((T([32, 1024, 7, 7], f16), T([32, 1024, 14, 14], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([32, 512, 7, 7], f16), T([32, 512, 14, 14], f16), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([32, 512, 28, 28], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([32, 256, 14, 14], f16), T([32, 256, 28, 28], f16), [3, 3], [2, 2], [1, 1], False, True, None), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([32, 256, 56, 56], f16), [2, 2], [2, 2], [0, 0], True, False, None), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), T([32, 128, 56, 56], f16), [3, 3], [2, 2], [1, 1], False, True, None), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([128, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 1, ((T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), T([32], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 32, 1, 1], f16), T([128, 32, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([32, 64, 56, 56], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([256, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 1, ((T([32, 128, 1, 1], f16), T([64, 128, 1, 1], f16), T([64], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 1, 1], f16), T([256, 64, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 28, 28], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([512, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([512, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 1, ((T([32, 256, 1, 1], f16), T([128, 256, 1, 1], f16), T([128], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 128, 1, 1], f16), T([512, 128, 1, 1], f16), T([512], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 14, 14], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([1024, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([1024, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 2), {})
+cnt: 1, ((T([32, 512, 1, 1], f16), T([256, 512, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 1, 1], f16), T([1024, 256, 1, 1], f16), T([1024], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 7, 7], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 1024, 7, 7], f16), T([2048, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 512, 7, 7], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 1, 1], f16), T([32, 256, 1, 1], f16), T([1024, 256, 1, 1], f16), [1024], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 256, 1, 1], f16), T([32, 512, 1, 1], f16), T([256, 512, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 512, 14, 14], f16), T([1024, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([32, 1024, 14, 14], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 512, 14, 14], f16), T([1024, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 14, 14], f16), T([32, 256, 14, 14], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 1, 1], f16), T([32, 128, 1, 1], f16), T([512, 128, 1, 1], f16), [512], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 1, 1], f16), T([32, 256, 1, 1], f16), T([128, 256, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 256, 28, 28], f16), T([512, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([32, 512, 28, 28], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 256, 28, 28], f16), T([512, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 128, 28, 28], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 1, 1], f16), T([32, 64, 1, 1], f16), T([256, 64, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 1, 1], f16), T([32, 128, 1, 1], f16), T([64, 128, 1, 1], f16), [64], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 128, 56, 56], f16), T([256, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 256, 56, 56], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([32, 256, 56, 56], f16), T([32, 64, 56, 56], f16), T([256, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 1, 1], f16), T([32, 32, 1, 1], f16), T([128, 32, 1, 1], f16), [128], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 32, 1, 1], f16), T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), [32], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 64, 56, 56], f16), T([128, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 2, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), T([64, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 32, 112, 112], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), T([32, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 32, 112, 112], f16), T([32, 3, 224, 224], f16), T([32, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 2048, 7, 7], f16, stride=(2048, 1, 0, 0)), 49), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(512, 1, 0, 0)), 196), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(256, 1, 0, 0)), 784), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(128, 1, 0, 0)), 3136), {})
+cnt: 1, ((T([32, 64, 56, 56], f16, stride=(64, 1, 0, 0)), 3136), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 112, 112], f16), [3, 3], [2, 2], [1, 1], [1, 1], False, T([32, 64, 56, 56], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 64, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), [2, 3], True), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), T([1000, 2048], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(0, 0)), T([32, 2048], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([32, 2, 64, 56, 56], f16), T([32, 2, 64, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 128, 56, 56], f16), T([32, 2, 128, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 256, 28, 28], f16), T([32, 2, 256, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 512, 14, 14], f16), T([32, 2, 512, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 512, 14, 14], f16, stride=(100352, 0, 196, 14, 1)), T([32, 2, 512, 14, 14], f16)), {})
+cnt: 1, ((T([32, 2, 512, 14, 14], f16, stride=(100352, 0, 196, 14, 1)), T([32, 2, 512, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 256, 28, 28], f16, stride=(200704, 0, 784, 28, 1)), T([32, 2, 256, 28, 28], f16)), {})
+cnt: 1, ((T([32, 2, 256, 28, 28], f16, stride=(200704, 0, 784, 28, 1)), T([32, 2, 256, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 128, 56, 56], f16, stride=(401408, 0, 3136, 56, 1)), T([32, 2, 128, 56, 56], f16)), {})
+cnt: 1, ((T([32, 2, 128, 56, 56], f16, stride=(401408, 0, 3136, 56, 1)), T([32, 2, 128, 1, 1], f16)), {})
+cnt: 1, ((T([32, 2, 64, 56, 56], f16, stride=(200704, 0, 3136, 56, 1)), T([32, 2, 64, 56, 56], f16)), {})
+cnt: 1, ((T([32, 2, 64, 56, 56], f16, stride=(200704, 0, 3136, 56, 1)), T([32, 2, 64, 1, 1], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 32, 1, 1], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 64, 1, 1], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 128, 1, 1], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 3, ((T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 256, 1, 1], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 2, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), T([2048], f16), T([2048], f16), T([2048], f16), T([2048], f32), T([2048], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 256, 1, 1], f16), T([32, 256, 1, 1], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([32, 512, 14, 14], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 128, 1, 1], f16), T([32, 128, 1, 1], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([32, 256, 28, 28], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 3, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 32, 1, 1], f16), T([32, 32, 1, 1], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([32, 32, 112, 112], f16),), {})
+cnt: 1, ((T([32, 64, 112, 112], f16),), {})
+cnt: 1, ((T([32, 64, 56, 56], f16),), {})
+cnt: 2, ((T([32, 128, 56, 56], f16),), {})
+cnt: 1, ((T([32, 32, 1, 1], f16),), {})
+cnt: 2, ((T([32, 256, 56, 56], f16),), {})
+cnt: 1, ((T([32, 64, 1, 1], f16),), {})
+cnt: 2, ((T([32, 512, 28, 28], f16),), {})
+cnt: 1, ((T([32, 256, 28, 28], f16),), {})
+cnt: 1, ((T([32, 128, 1, 1], f16),), {})
+cnt: 2, ((T([32, 1024, 14, 14], f16),), {})
+cnt: 1, ((T([32, 512, 14, 14], f16),), {})
+cnt: 1, ((T([32, 256, 1, 1], f16),), {})
+cnt: 1, ((T([32, 2048, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), [0], True), {})
+cnt: 1, ((T([32, 2, 512, 14, 14], f16), [3, 4], True), {})
+cnt: 1, ((T([32, 2, 256, 28, 28], f16), [3, 4], True), {})
+cnt: 1, ((T([32, 2, 128, 56, 56], f16), [3, 4], True), {})
+cnt: 1, ((T([32, 2, 64, 56, 56], f16), [3, 4], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1000], f16),), {})
+Operator: aten.sum.dim_IntList
+cnt: 2, ((T([32, 2, 64, 56, 56], f16), [1]), {})
+cnt: 2, ((T([32, 2, 128, 56, 56], f16), [1]), {})
+cnt: 2, ((T([32, 2, 256, 28, 28], f16), [1]), {})
+cnt: 2, ((T([32, 2, 512, 14, 14], f16), [1]), {})
+Operator: aten.threshold_backward.default
+cnt: 1, ((T([32, 2048, 7, 7], f16), T([32, 2048, 7, 7], f16), 0), {})
+cnt: 1, ((T([32, 256, 1, 1], f16), T([32, 256, 1, 1], f16), 0), {})
+cnt: 2, ((T([32, 1024, 14, 14], f16), T([32, 1024, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([32, 512, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 128, 1, 1], f16), T([32, 128, 1, 1], f16), 0), {})
+cnt: 2, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([32, 256, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 64, 1, 1], f16), T([32, 64, 1, 1], f16), 0), {})
+cnt: 2, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), 0), {})
+cnt: 2, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 32, 1, 1], f16), T([32, 32, 1, 1], f16), 0), {})
+cnt: 1, ((T([32, 64, 56, 56], f16), T([32, 64, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), 0), {})
+cnt: 2, ((T([32, 32, 112, 112], f16), T([32, 32, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_vision_transformer_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_vision_transformer_training.txt
new file mode 100644
index 000000000000..ed9e7bf694f6
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_vision_transformer_training.txt
@@ -0,0 +1,77 @@
+Operator: aten._softmax.default
+cnt: 12, ((T([8, 6, 197, 197], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 12, ((T([8, 6, 197, 197], f16), T([8, 6, 197, 197], f16), -1, f16), {})
+Operator: aten._unsafe_view.default
+cnt: 36, ((T([8, 6, 197, 64], f16), [48, 197, 64]), {})
+cnt: 12, ((T([8, 6, 64, 197], f16), [48, 64, 197]), {})
+cnt: 12, ((T([48, 197, 197], f16), [8, 6, 197, 197]), {})
+cnt: 12, ((T([48, 197, 64], f16), [8, 6, 197, 64]), {})
+cnt: 12, ((T([8, 197, 6, 64], f16), [8, 197, 384]), {})
+cnt: 12, ((T([8, 197, 3, 6, 64], f16), [8, 197, 1152]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([8, 197, 384], f16), T([1, 197, 384], f16)), {})
+cnt: 48, ((T([8, 197, 384], f16), T([8, 197, 384], f16)), {})
+Operator: aten.addmm.default
+cnt: 12, ((T([1152], f16), T([1576, 384], f16), T([384, 1152], f16, stride=(1, 384))), {})
+cnt: 12, ((T([384], f16), T([1576, 384], f16), T([384, 384], f16, stride=(1, 384))), {})
+cnt: 12, ((T([1536], f16), T([1576, 384], f16), T([384, 1536], f16, stride=(1, 384))), {})
+cnt: 12, ((T([384], f16), T([1576, 1536], f16), T([1536, 384], f16, stride=(1, 1536))), {})
+cnt: 1, ((T([1000], f16), T([8, 384], f16, stride=(75648, 1)), T([384, 1000], f16, stride=(1, 384))), {})
+Operator: aten.bmm.default
+cnt: 12, ((T([48, 197, 64], f16), T([48, 64, 197], f16)), {})
+cnt: 12, ((T([48, 197, 197], f16), T([48, 197, 64], f16)), {})
+cnt: 12, ((T([48, 197, 197], f16, stride=(38809, 1, 197)), T([48, 197, 64], f16)), {})
+cnt: 12, ((T([48, 197, 64], f16), T([48, 64, 197], f16, stride=(12608, 1, 64))), {})
+cnt: 12, ((T([48, 64, 197], f16, stride=(12608, 1, 64)), T([48, 197, 197], f16)), {})
+cnt: 12, ((T([48, 197, 197], f16), T([48, 197, 64], f16, stride=(12608, 1, 197))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([8, 1, 384], f16, stride=(0, 384, 1)), T([8, 196, 384], f16, stride=(75264, 1, 196))], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([8, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([8, 3, 224, 224], f16), T([384, 3, 16, 16], f16), T([384], f16), [16, 16], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([8, 384, 14, 14], f16, stride=(75648, 1, 5376, 384)), T([8, 3, 224, 224], f16), T([384, 3, 16, 16], f16), [384], [16, 16], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([8, 3, 224, 224], f16), T([8, 3, 224, 224], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 8000), {})
+Operator: aten.gelu.default
+cnt: 12, ((T([8, 197, 1536], f16),), {})
+Operator: aten.gelu_backward.default
+cnt: 12, ((T([8, 197, 1536], f16), T([8, 197, 1536], f16)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([8, 1000], f16, stride=(0, 0)), T([1000, 384], f16)), {})
+cnt: 1, ((T([1000, 8], f16, stride=(0, 0)), T([8, 384], f16, stride=(75648, 1))), {})
+cnt: 12, ((T([1576, 384], f16), T([384, 1536], f16)), {})
+cnt: 12, ((T([384, 1576], f16, stride=(1, 384)), T([1576, 1536], f16)), {})
+cnt: 12, ((T([1576, 1536], f16), T([1536, 384], f16)), {})
+cnt: 12, ((T([1536, 1576], f16, stride=(1, 1536)), T([1576, 384], f16)), {})
+cnt: 12, ((T([1576, 384], f16), T([384, 384], f16)), {})
+cnt: 12, ((T([384, 1576], f16, stride=(1, 384)), T([1576, 384], f16)), {})
+cnt: 12, ((T([1576, 1152], f16), T([1152, 384], f16)), {})
+cnt: 12, ((T([1152, 1576], f16, stride=(1, 1152)), T([1576, 384], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 24, ((T([8, 6, 197, 197], f16), 0.125), {})
+Operator: aten.native_layer_norm.default
+cnt: 25, ((T([8, 197, 384], f16), [384], T([384], f16), T([384], f16), 1e-06), {})
+Operator: aten.native_layer_norm_backward.default
+cnt: 25, ((T([8, 197, 384], f16), T([8, 197, 384], f16), [384], T([8, 197, 1], f32), T([8, 197, 1], f32), T([384], f16), T([384], f16), [True, True, True]), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([8, 384], f16), [8, 197, 384], 1, 0), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([8, 197, 384], f16), [8, 197, 384], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.stack.default
+cnt: 12, (([T([8, 6, 197, 64], f16), T([8, 6, 197, 64], f16, stride=(75648, 12608, 1, 197)), T([8, 6, 197, 64], f16)],), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([8, 1000], f16, stride=(0, 0)), [0], True), {})
+cnt: 24, ((T([1576, 384], f16), [0], True), {})
+cnt: 12, ((T([1576, 1536], f16), [0], True), {})
+cnt: 12, ((T([1576, 1152], f16), [0], True), {})
+cnt: 1, ((T([8, 197, 384], f16), [0], True), {})
+cnt: 1, ((T([8, 1, 384], f16, stride=(75648, 384, 1)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([8, 1000], f16),), {})
+Operator: aten.unbind.int
+cnt: 12, ((T([3, 8, 6, 197, 64], f16, stride=(384, 226944, 64, 1152, 1)),), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_vovnet_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_vovnet_training.txt
new file mode 100644
index 000000000000..0ff92b240c67
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/timm_vovnet_training.txt
@@ -0,0 +1,130 @@
+Operator: aten.add.Tensor
+cnt: 4, ((T([32, 224, 7, 7], f16, stride=(105056, 49, 7, 1)), T([32, 224, 7, 7], f16)), {})
+cnt: 1, ((T([32, 1024, 7, 7], f16, stride=(105056, 49, 7, 1)), T([32, 1024, 7, 7], f16)), {})
+cnt: 4, ((T([32, 224, 7, 7], f16, stride=(92512, 49, 7, 1)), T([32, 224, 7, 7], f16)), {})
+cnt: 1, ((T([32, 768, 7, 7], f16, stride=(92512, 49, 7, 1)), T([32, 768, 7, 7], f16)), {})
+cnt: 4, ((T([32, 192, 14, 14], f16, stride=(338688, 196, 14, 1)), T([32, 192, 14, 14], f16)), {})
+cnt: 1, ((T([32, 768, 14, 14], f16, stride=(338688, 196, 14, 1)), T([32, 768, 14, 14], f16)), {})
+cnt: 4, ((T([32, 192, 14, 14], f16, stride=(288512, 196, 14, 1)), T([32, 192, 14, 14], f16)), {})
+cnt: 1, ((T([32, 512, 14, 14], f16, stride=(288512, 196, 14, 1)), T([32, 512, 14, 14], f16)), {})
+cnt: 4, ((T([32, 160, 28, 28], f16, stride=(827904, 784, 28, 1)), T([32, 160, 28, 28], f16)), {})
+cnt: 1, ((T([32, 256, 28, 28], f16, stride=(827904, 784, 28, 1)), T([32, 256, 28, 28], f16)), {})
+cnt: 5, ((T([32, 128, 56, 56], f16, stride=(2408448, 3136, 56, 1)), T([32, 128, 56, 56], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1000], f16), T([32, 1024], f16), T([1024, 1000], f16, stride=(1, 1024))), {})
+Operator: aten.cat.default
+cnt: 1, (([T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16)], 1), {})
+cnt: 1, (([T([32, 256, 28, 28], f16), T([32, 160, 28, 28], f16), T([32, 160, 28, 28], f16), T([32, 160, 28, 28], f16), T([32, 160, 28, 28], f16), T([32, 160, 28, 28], f16)], 1), {})
+cnt: 1, (([T([32, 512, 14, 14], f16), T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 768, 14, 14], f16), T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16)], 1), {})
+cnt: 1, (([T([32, 768, 7, 7], f16), T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16)], 1), {})
+cnt: 1, (([T([32, 1024, 7, 7], f16), T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([32, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([64, 3, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([128, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([32, 128, 56, 56], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 768, 56, 56], f16), T([256, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([160, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([32, 160, 28, 28], f16), T([160, 160, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1056, 28, 28], f16), T([512, 1056, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([192, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([32, 192, 14, 14], f16), T([192, 192, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1472, 14, 14], f16), T([768, 1472, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 768, 14, 14], f16), T([192, 768, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1728, 14, 14], f16), T([768, 1728, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 768, 7, 7], f16), T([224, 768, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 8, ((T([32, 224, 7, 7], f16), T([224, 224, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1888, 7, 7], f16), T([1024, 1888, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 1024, 7, 7], f16), T([224, 1024, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([32, 2144, 7, 7], f16), T([1024, 2144, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([32, 1024, 7, 7], f16), T([32, 2144, 7, 7], f16), T([1024, 2144, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), T([224, 224, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 224, 7, 7], f16), T([32, 1024, 7, 7], f16), T([224, 1024, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 1024, 7, 7], f16), T([32, 1888, 7, 7], f16), T([1024, 1888, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 224, 7, 7], f16), T([32, 768, 7, 7], f16), T([224, 768, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 768, 14, 14], f16), T([32, 1728, 14, 14], f16), T([768, 1728, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 8, ((T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), T([192, 192, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 192, 14, 14], f16), T([32, 768, 14, 14], f16), T([192, 768, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 768, 14, 14], f16), T([32, 1472, 14, 14], f16), T([768, 1472, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 192, 14, 14], f16), T([32, 512, 14, 14], f16), T([192, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 1056, 28, 28], f16), T([512, 1056, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([32, 160, 28, 28], f16), T([32, 160, 28, 28], f16), T([160, 160, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 160, 28, 28], f16), T([32, 256, 28, 28], f16), T([160, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 768, 56, 56], f16), T([256, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 128, 56, 56], f16), T([32, 64, 112, 112], f16), T([128, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([32, 64, 112, 112], f16), T([32, 3, 224, 224], f16), T([64, 3, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([32, 3, 224, 224], f16), T([32, 3, 224, 224], f16)), {})
+Operator: aten.div.Scalar
+cnt: 1, ((T([32, 1024, 7, 7], f16, stride=(1024, 1, 0, 0)), 49), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 32000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([32, 256, 56, 56], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+cnt: 1, ((T([32, 768, 14, 14], f16), [3, 3], [2, 2], [0, 0], [1, 1], True), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([32, 768, 7, 7], f16), T([32, 768, 14, 14], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([32, 768, 7, 7], i64)), {})
+cnt: 1, ((T([32, 512, 14, 14], f16), T([32, 512, 28, 28], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([32, 512, 14, 14], i64)), {})
+cnt: 1, ((T([32, 256, 28, 28], f16), T([32, 256, 56, 56], f16), [3, 3], [2, 2], [0, 0], [1, 1], True, T([32, 256, 28, 28], i64)), {})
+Operator: aten.mean.dim
+cnt: 1, ((T([32, 1024, 7, 7], f16), [-1, -2], True), {})
+Operator: aten.mm.default
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), T([1000, 1024], f16)), {})
+cnt: 1, ((T([1000, 32], f16, stride=(0, 0)), T([32, 1024], f16)), {})
+Operator: aten.native_batch_norm.default
+cnt: 2, ((T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.1, 1e-05), {})
+cnt: 6, ((T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.1, 1e-05), {})
+cnt: 5, ((T([32, 160, 28, 28], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f16), False, 0.1, 1e-05), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.1, 1e-05), {})
+cnt: 10, ((T([32, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f16), False, 0.1, 1e-05), {})
+cnt: 10, ((T([32, 224, 7, 7], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f16), False, 0.1, 1e-05), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.1, 1e-05), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([32, 1024, 7, 7], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), T([224], f16), T([224], f16), T([224], f16), T([224], f32), T([224], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 768, 14, 14], f16), T([32, 768, 14, 14], f16), T([768], f16), T([768], f16), T([768], f16), T([768], f32), T([768], f32), False, 1e-05, [True, True, True]), {})
+cnt: 10, ((T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), T([192], f16), T([192], f16), T([192], f16), T([192], f32), T([192], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 1e-05, [True, True, True]), {})
+cnt: 5, ((T([32, 160, 28, 28], f16), T([32, 160, 28, 28], f16), T([160], f16), T([160], f16), T([160], f16), T([160], f32), T([160], f32), False, 1e-05, [True, True, True]), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 1e-05, [True, True, True]), {})
+cnt: 6, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 1e-05, [True, True, True]), {})
+cnt: 2, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 1e-05, [True, True, True]), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([32, 64, 112, 112], f16),), {})
+cnt: 6, ((T([32, 128, 56, 56], f16),), {})
+cnt: 1, ((T([32, 256, 56, 56], f16),), {})
+cnt: 5, ((T([32, 160, 28, 28], f16),), {})
+cnt: 1, ((T([32, 512, 28, 28], f16),), {})
+cnt: 10, ((T([32, 192, 14, 14], f16),), {})
+cnt: 2, ((T([32, 768, 14, 14], f16),), {})
+cnt: 10, ((T([32, 224, 7, 7], f16),), {})
+cnt: 2, ((T([32, 1024, 7, 7], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([32, 1000], f16, stride=(0, 0)), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([32, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([32, 1024, 7, 7], f16), T([32, 1024, 7, 7], f16), 0), {})
+cnt: 1, ((T([32, 224, 7, 7], f16, stride=(105056, 49, 7, 1)), T([32, 224, 7, 7], f16), 0), {})
+cnt: 8, ((T([32, 224, 7, 7], f16), T([32, 224, 7, 7], f16), 0), {})
+cnt: 1, ((T([32, 224, 7, 7], f16, stride=(92512, 49, 7, 1)), T([32, 224, 7, 7], f16), 0), {})
+cnt: 2, ((T([32, 768, 14, 14], f16), T([32, 768, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 192, 14, 14], f16, stride=(338688, 196, 14, 1)), T([32, 192, 14, 14], f16), 0), {})
+cnt: 8, ((T([32, 192, 14, 14], f16), T([32, 192, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 192, 14, 14], f16, stride=(288512, 196, 14, 1)), T([32, 192, 14, 14], f16), 0), {})
+cnt: 1, ((T([32, 512, 28, 28], f16), T([32, 512, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 160, 28, 28], f16, stride=(827904, 784, 28, 1)), T([32, 160, 28, 28], f16), 0), {})
+cnt: 4, ((T([32, 160, 28, 28], f16), T([32, 160, 28, 28], f16), 0), {})
+cnt: 1, ((T([32, 256, 56, 56], f16), T([32, 256, 56, 56], f16), 0), {})
+cnt: 1, ((T([32, 128, 56, 56], f16, stride=(2408448, 3136, 56, 1)), T([32, 128, 56, 56], f16), 0), {})
+cnt: 5, ((T([32, 128, 56, 56], f16), T([32, 128, 56, 56], f16), 0), {})
+cnt: 2, ((T([32, 64, 112, 112], f16), T([32, 64, 112, 112], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/tts_angular_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/tts_angular_training.txt
new file mode 100644
index 000000000000..847934aa9e1f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/tts_angular_training.txt
@@ -0,0 +1,51 @@
+Operator: aten._cudnn_rnn.default
+cnt: 1, ((T([64, 50, 40], f16), [T([3072, 40], f16), T([3072, 768], f16), T([3072], f16), T([3072], f16)], 4, None, T([1, 64, 768], f16), T([1, 64, 768], f16), 2, 768, 0, 1, True, 0.0, True, False, [], None), {})
+cnt: 2, ((T([64, 50, 256], f16), [T([3072, 256], f16), T([3072, 768], f16), T([3072], f16), T([3072], f16)], 4, None, T([1, 64, 768], f16), T([1, 64, 768], f16), 2, 768, 0, 1, True, 0.0, True, False, [], None), {})
+Operator: aten._cudnn_rnn_backward.default
+cnt: 2, ((T([64, 50, 256], f16), [T([3072, 256], f16), T([3072, 768], f16), T([3072], f16), T([3072], f16)], 4, T([3151872], f16), T([1, 64, 768], f16), T([1, 64, 768], f16), T([64, 50, 768], f16, stride=(768, 49152, 1)), T([64, 50, 768], f16), None, None, 2, 768, 0, 1, True, 0.0, True, False, [], None, T([24576016], u8), [True, False, False, True]), {})
+cnt: 1, ((T([64, 50, 40], f16), [T([3072, 40], f16), T([3072, 768], f16), T([3072], f16), T([3072], f16)], 4, T([2488320], f16), T([1, 64, 768], f16), T([1, 64, 768], f16), T([64, 50, 768], f16, stride=(768, 49152, 1)), T([64, 50, 768], f16), None, None, 2, 768, 0, 1, True, 0.0, True, False, [], None, T([24576016], u8), [False, False, False, True]), {})
+Operator: aten._unsafe_view.default
+cnt: 3, ((T([64, 50, 768], f16), [3200, 768]), {})
+cnt: 3, ((T([3200, 256], f16), [64, 50, 256]), {})
+cnt: 2, ((T([64, 50, 256], f16), [3200, 256]), {})
+Operator: aten.add.Tensor
+cnt: 1, ((T([64, 256], f16), T([64, 256], f16)), {})
+Operator: aten.clamp_min.default
+cnt: 1, ((T([64, 1], f16), 1e-12), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 50, 40], f16),), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 50, 40], f16), T([64, 50, 40], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([64, 256], f16, stride=(12800, 1)), T([64, 256], f16, stride=(1, 0))), {})
+cnt: 2, ((T([], f16), 16384), {})
+cnt: 1, ((T([64, 256], f16), T([64, 256], f16, stride=(1, 0))), {})
+cnt: 1, ((T([64, 256], f16, stride=(0, 0)), T([64, 256], f16, stride=(1, 0))), {})
+cnt: 1, ((T([64, 256], f16, stride=(12800, 1)), T([64, 1], f16)), {})
+Operator: aten.eq.Scalar
+cnt: 1, ((T([64, 1], f16), 0), {})
+Operator: aten.ge.Scalar
+cnt: 1, ((T([64, 1], f16), 1e-12), {})
+Operator: aten.masked_fill_.Scalar
+cnt: 1, ((T([64, 256], f16), T([64, 1], b8), 0), {})
+Operator: aten.mm.default
+cnt: 3, ((T([3200, 768], f16), T([768, 256], f16, stride=(1, 768))), {})
+cnt: 3, ((T([256, 3200], f16, stride=(1, 256)), T([3200, 768], f16)), {})
+cnt: 3, ((T([3200, 256], f16), T([256, 768], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([64, 256], f16), T([64, 256], f16)), {})
+cnt: 1, ((T([64, 1], f16), T([64, 256], f16)), {})
+Operator: aten.neg.default
+cnt: 1, ((T([64, 256], f16, stride=(0, 0)),), {})
+Operator: aten.norm.ScalarOpt_dim
+cnt: 1, ((T([64, 256], f16, stride=(12800, 1)), 2, [1], True), {})
+Operator: aten.select_backward.default
+cnt: 1, ((T([64, 256], f16), [64, 50, 256], 1, -1), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([64, 50, 256], f16), [64, 50, 256], 0, 0, 9223372036854775807, 1), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 256], f16), [1], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([64, 256], f16),), {})
+Operator: aten.where.self
+cnt: 1, ((T([64, 1], b8), T([64, 1], f16), T([], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/vgg16_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/vgg16_training.txt
new file mode 100644
index 000000000000..cc96188bb03f
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/vgg16_training.txt
@@ -0,0 +1,72 @@
+Operator: aten._adaptive_avg_pool2d.default
+cnt: 1, ((T([64, 512, 7, 7], f16), [7, 7]), {})
+Operator: aten._adaptive_avg_pool2d_backward.default
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 7, 7], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([4096], f16), T([64, 25088], f16), T([25088, 4096], f16, stride=(1, 25088))), {})
+cnt: 1, ((T([4096], f16), T([64, 4096], f16), T([4096, 4096], f16, stride=(1, 4096))), {})
+cnt: 1, ((T([1000], f16), T([64, 4096], f16), T([4096, 1000], f16, stride=(1, 4096))), {})
+Operator: aten.clone.default
+cnt: 1, ((T([64, 3, 224, 224], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 224, 224], f16), T([64, 64, 3, 3], f16), T([64], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([128, 64, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 112, 112], f16), T([128, 128, 3, 3], f16), T([128], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([256, 128, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 256, 56, 56], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([512, 256, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([64, 512, 28, 28], f16), T([512, 512, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([64, 512, 14, 14], f16), T([512, 512, 3, 3], f16), T([512], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 3, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), T([512, 512, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16), T([512, 512, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), T([64, 256, 28, 28], f16), T([512, 256, 3, 3], f16), [512], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 2, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), T([64, 128, 56, 56], f16), T([256, 128, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 112, 112], f16), T([64, 128, 112, 112], f16), T([128, 128, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 128, 112, 112], f16), T([64, 64, 112, 112], f16), T([128, 64, 3, 3], f16), [128], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 64, 224, 224], f16), T([64, 64, 224, 224], f16), T([64, 64, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([64, 64, 224, 224], f16), T([64, 3, 224, 224], f16), T([64, 3, 3, 3], f16), [64], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([64, 3, 224, 224], f16), T([64, 3, 224, 224], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 64000), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([64, 64, 224, 224], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([64, 128, 112, 112], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([64, 256, 56, 56], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([64, 512, 28, 28], f16), [2, 2], [2, 2]), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), [2, 2], [2, 2]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([64, 512, 7, 7], f16), T([64, 512, 14, 14], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 512, 7, 7], i64)), {})
+cnt: 1, ((T([64, 512, 14, 14], f16), T([64, 512, 28, 28], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 512, 14, 14], i64)), {})
+cnt: 1, ((T([64, 256, 28, 28], f16), T([64, 256, 56, 56], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 256, 28, 28], i64)), {})
+cnt: 1, ((T([64, 128, 56, 56], f16), T([64, 128, 112, 112], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 128, 56, 56], i64)), {})
+cnt: 1, ((T([64, 64, 112, 112], f16), T([64, 64, 224, 224], f16), [2, 2], [2, 2], [0, 0], [1, 1], False, T([64, 64, 112, 112], i64)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([64, 1000], f16, stride=(0, 0)), T([1000, 4096], f16)), {})
+cnt: 1, ((T([1000, 64], f16, stride=(0, 0)), T([64, 4096], f16)), {})
+cnt: 1, ((T([64, 4096], f16), T([4096, 4096], f16)), {})
+cnt: 1, ((T([4096, 64], f16, stride=(1, 4096)), T([64, 4096], f16)), {})
+cnt: 1, ((T([64, 4096], f16), T([4096, 25088], f16)), {})
+cnt: 1, ((T([4096, 64], f16, stride=(1, 4096)), T([64, 25088], f16)), {})
+Operator: aten.relu_.default
+cnt: 2, ((T([64, 64, 224, 224], f16),), {})
+cnt: 2, ((T([64, 128, 112, 112], f16),), {})
+cnt: 3, ((T([64, 256, 56, 56], f16),), {})
+cnt: 3, ((T([64, 512, 28, 28], f16),), {})
+cnt: 3, ((T([64, 512, 14, 14], f16),), {})
+cnt: 2, ((T([64, 4096], f16),), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([64, 1000], f16, stride=(0, 0)), [0], True), {})
+cnt: 2, ((T([64, 4096], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 1, ((T([64, 1000], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([64, 4096], f16), T([64, 4096], f16), 0), {})
+cnt: 3, ((T([64, 512, 14, 14], f16), T([64, 512, 14, 14], f16), 0), {})
+cnt: 3, ((T([64, 512, 28, 28], f16), T([64, 512, 28, 28], f16), 0), {})
+cnt: 3, ((T([64, 256, 56, 56], f16), T([64, 256, 56, 56], f16), 0), {})
+cnt: 2, ((T([64, 128, 112, 112], f16), T([64, 128, 112, 112], f16), 0), {})
+cnt: 2, ((T([64, 64, 224, 224], f16), T([64, 64, 224, 224], f16), 0), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/vision_maskrcnn_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/vision_maskrcnn_training.txt
new file mode 100644
index 000000000000..a88dbc3aec30
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/vision_maskrcnn_training.txt
@@ -0,0 +1,477 @@
+Operator: aten._index_put_impl_.default
+cnt: 12, ((T([0], f16), [T([0], i64)], T([0], f16), True, True), {})
+cnt: 12, ((T([0, 4], f16), [T([0], i64)], T([0, 4], f16), True, True), {})
+Operator: aten._softmax.default
+cnt: 1, ((T([0, 91], f16), -1, False), {})
+Operator: aten._softmax_backward_data.default
+cnt: 1, ((T([0, 91], f16), T([0, 91], f16), -1, f16), {})
+Operator: aten._to_copy.default
+cnt: 8, ((T([], i64),), {'dtype': f32})
+cnt: 5, ((T([3, 4], f32),), {'dtype': f16, 'device': 'cuda'})
+cnt: 8, ((T([0, 4], f16),), {'dtype': f32})
+cnt: 2, ((T([0], f32),), {'dtype': i64})
+cnt: 4, ((T([0, 4], f16),), {'dtype': i64})
+cnt: 8, ((T([], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 2, ((T([296, 304], i32), [89984]), {})
+cnt: 2, ((T([148, 152], i32), [22496]), {})
+cnt: 2, ((T([74, 76], i32), [5624]), {})
+cnt: 2, ((T([37, 38], i32), [1406]), {})
+cnt: 2, ((T([19, 19], i32), [361]), {})
+cnt: 1, ((T([4, 296, 304, 3, 1], f16), [4, 269952, 1]), {})
+cnt: 1, ((T([4, 296, 304, 3, 4], f16), [4, 269952, 4]), {})
+cnt: 1, ((T([4, 148, 152, 3, 1], f16), [4, 67488, 1]), {})
+cnt: 1, ((T([4, 148, 152, 3, 4], f16), [4, 67488, 4]), {})
+cnt: 1, ((T([4, 74, 76, 3, 1], f16), [4, 16872, 1]), {})
+cnt: 1, ((T([4, 74, 76, 3, 4], f16), [4, 16872, 4]), {})
+cnt: 1, ((T([4, 37, 38, 3, 1], f16), [4, 4218, 1]), {})
+cnt: 1, ((T([4, 37, 38, 3, 4], f16), [4, 4218, 4]), {})
+cnt: 1, ((T([4, 19, 19, 3, 1], f16), [4, 1083, 1]), {})
+cnt: 1, ((T([4, 19, 19, 3, 4], f16), [4, 1083, 4]), {})
+Operator: aten.add.Tensor
+cnt: 7, ((T([1, 64, 1, 1], f16), 0.0), {})
+cnt: 1, ((T([4, 64, 592, 608], f16), T([1, 64, 1, 1], f16)), {})
+cnt: 6, ((T([4, 64, 296, 304], f16), T([1, 64, 1, 1], f16)), {})
+cnt: 16, ((T([1, 256, 1, 1], f16), 0.0), {})
+cnt: 4, ((T([4, 256, 296, 304], f16), T([1, 256, 1, 1], f16)), {})
+cnt: 8, ((T([1, 128, 1, 1], f16), 0.0), {})
+cnt: 1, ((T([4, 128, 296, 304], f16), T([1, 128, 1, 1], f16)), {})
+cnt: 7, ((T([4, 128, 148, 152], f16), T([1, 128, 1, 1], f16)), {})
+cnt: 11, ((T([1, 512, 1, 1], f16), 0.0), {})
+cnt: 5, ((T([4, 512, 148, 152], f16), T([1, 512, 1, 1], f16)), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), T([1, 256, 1, 1], f16)), {})
+cnt: 11, ((T([4, 256, 74, 76], f16), T([1, 256, 1, 1], f16)), {})
+cnt: 7, ((T([1, 1024, 1, 1], f16), 0.0), {})
+cnt: 7, ((T([4, 1024, 74, 76], f16), T([1, 1024, 1, 1], f16)), {})
+cnt: 1, ((T([4, 512, 74, 76], f16), T([1, 512, 1, 1], f16)), {})
+cnt: 5, ((T([4, 512, 37, 38], f16), T([1, 512, 1, 1], f16)), {})
+cnt: 4, ((T([1, 2048, 1, 1], f16), 0.0), {})
+cnt: 4, ((T([4, 2048, 37, 38], f16), T([1, 2048, 1, 1], f16)), {})
+cnt: 2, ((T([4, 256, 74, 76], f16), T([4, 256, 74, 76], f16)), {})
+cnt: 2, ((T([4, 256, 148, 152], f16), T([4, 256, 148, 152], f16)), {})
+cnt: 1, ((T([4, 256, 296, 304], f16), T([4, 256, 296, 304], f16)), {})
+cnt: 1, ((T([89984, 1, 4], i32), T([1, 3, 4], f16)), {})
+cnt: 1, ((T([22496, 1, 4], i32), T([1, 3, 4], f16)), {})
+cnt: 1, ((T([5624, 1, 4], i32), T([1, 3, 4], f16)), {})
+cnt: 1, ((T([1406, 1, 4], i32), T([1, 3, 4], f16)), {})
+cnt: 1, ((T([361, 1, 4], i32), T([1, 3, 4], f16)), {})
+cnt: 2, ((T([1438452], f16, stride=(4,)), T([1438452], f16)), {})
+cnt: 4, ((T([1438452, 1], f16), T([1438452, 1], f16)), {})
+cnt: 1, ((T([4, 1000], i64), 0), {})
+cnt: 1, ((T([4, 1000], i64), 269952), {})
+cnt: 1, ((T([4, 1000], i64), 337440), {})
+cnt: 1, ((T([4, 1000], i64), 354312), {})
+cnt: 1, ((T([4, 1000], i64), 358530), {})
+cnt: 2, ((T([0], f32), 4), {})
+cnt: 2, ((T([0], f32), T([], f32)), {})
+cnt: 18, ((T([0], f16), T([0], f16)), {})
+cnt: 2, ((T([0, 91], f16), T([0, 1], f16)), {})
+cnt: 6, ((T([0, 91], f16), T([0, 91], f16)), {})
+cnt: 4, ((T([], f16), 0), {})
+cnt: 4, ((T([], f16), T([], f32)), {})
+cnt: 8, ((T([], f32), T([], f16)), {})
+cnt: 1, ((T([], f32), 0), {})
+cnt: 3, ((T([], f32), T([], f32)), {})
+cnt: 7, ((T([0, 364], f16), T([0, 364], f16)), {})
+cnt: 1, ((T([0, 1024], f16), T([0, 1024], f16)), {})
+cnt: 1, ((T([4, 256, 37, 38], f16), T([4, 256, 37, 38], f16)), {})
+cnt: 2, ((T([4, 2048, 37, 38], f16), T([4, 2048, 37, 38], f16)), {})
+cnt: 7, ((T([4, 1024, 74, 76], f16), T([4, 1024, 74, 76], f16)), {})
+cnt: 5, ((T([4, 512, 148, 152], f16), T([4, 512, 148, 152], f16)), {})
+Operator: aten.add_.Tensor
+cnt: 3, ((T([4, 256, 296, 304], f16), T([4, 256, 296, 304], f16)), {})
+cnt: 4, ((T([4, 512, 148, 152], f16), T([4, 512, 148, 152], f16)), {})
+cnt: 6, ((T([4, 1024, 74, 76], f16), T([4, 1024, 74, 76], f16)), {})
+cnt: 3, ((T([4, 2048, 37, 38], f16), T([4, 2048, 37, 38], f16)), {})
+Operator: aten.addmm.default
+cnt: 1, ((T([1024], f16), T([0, 12544], f16), T([12544, 1024], f16, stride=(1, 12544))), {})
+cnt: 1, ((T([1024], f16), T([0, 1024], f16), T([1024, 1024], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([91], f16), T([0, 1024], f16), T([1024, 91], f16, stride=(1, 1024))), {})
+cnt: 1, ((T([364], f16), T([0, 1024], f16), T([1024, 364], f16, stride=(1, 1024))), {})
+Operator: aten.bitwise_and.Tensor
+cnt: 4, ((T([5000], b8), T([5000], b8)), {})
+cnt: 4, ((T([0], b8), T([0], b8)), {})
+Operator: aten.cat.default
+cnt: 4, (([T([269952, 4], f16), T([67488, 4], f16), T([16872, 4], f16), T([4218, 4], f16), T([1083, 4], f16)],), {})
+cnt: 1, (([T([4, 269952, 1], f16), T([4, 67488, 1], f16), T([4, 16872, 1], f16), T([4, 4218, 1], f16), T([4, 1083, 1], f16)], 1), {})
+cnt: 1, (([T([4, 269952, 4], f16), T([4, 67488, 4], f16), T([4, 16872, 4], f16), T([4, 4218, 4], f16), T([4, 1083, 4], f16)], 1), {})
+cnt: 1, (([T([359613, 4], f16), T([359613, 4], f16), T([359613, 4], f16), T([359613, 4], f16)],), {})
+cnt: 1, (([T([269952], i64), T([67488], i64), T([16872], i64), T([4218], i64), T([1083], i64)],), {})
+cnt: 1, (([T([4, 1000], i64), T([4, 1000], i64), T([4, 1000], i64), T([4, 1000], i64), T([4, 1000], i64)], 1), {})
+cnt: 3, (([T([0, 4], f16), T([0, 4], f16), T([0, 4], f16), T([0, 4], f16)],), {})
+cnt: 2, (([T([0, 1], f16), T([0, 1], f16), T([0, 1], f16), T([0, 1], f16)],), {})
+cnt: 2, (([T([0, 1], f16), T([0, 4], f16)], 1), {})
+cnt: 2, (([T([0], f32), T([0], f32), T([0], f32), T([0], f32)],), {})
+cnt: 1, (([T([0], i64), T([0], i64), T([0], i64), T([0], i64)],), {})
+cnt: 1, (([T([0, 91], f16), T([0, 91], f16), T([0, 91], f16), T([0, 91], f16)],), {})
+cnt: 1, (([T([0, 364], f16), T([0, 364], f16), T([0, 364], f16), T([0, 364], f16)],), {})
+Operator: aten.clamp.default
+cnt: 2, ((T([1438452, 1], f16), None, 4.135166556742356), {})
+cnt: 1, ((T([5000, 2], f16, stride=(4, 2)), 0, 1199), {})
+cnt: 2, ((T([5000, 2], f16, stride=(4, 2)), 0, 799), {})
+cnt: 3, ((T([5000, 2], f16, stride=(4, 2)), 0, 800), {})
+cnt: 1, ((T([5000, 2], f16, stride=(4, 2)), 0, 1155), {})
+cnt: 1, ((T([5000, 2], f16, stride=(4, 2)), 0, 1115), {})
+cnt: 2, ((T([0], f32), 2, 5), {})
+cnt: 2, ((T([0, 91], f16), None, 4.135166556742356), {})
+cnt: 1, ((T([0, 182], f16), 0, 1199), {})
+cnt: 2, ((T([0, 182], f16), 0, 799), {})
+cnt: 3, ((T([0, 182], f16), 0, 800), {})
+cnt: 1, ((T([0, 182], f16), 0, 1155), {})
+cnt: 1, ((T([0, 182], f16), 0, 1115), {})
+Operator: aten.constant_pad_nd.default
+cnt: 4, ((T([0, 1, 28, 28], f16), [1, 1, 1, 1], 0.0), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([4, 3, 1184, 1216], f16), T([64, 3, 7, 7], f16), None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 64, 296, 304], f16), T([64, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([4, 64, 296, 304], f16), T([64, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([4, 64, 296, 304], f16), T([256, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([4, 256, 296, 304], f16), T([64, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 296, 304], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 128, 296, 304], f16), T([128, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([4, 128, 148, 152], f16), T([512, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 296, 304], f16), T([512, 256, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([4, 512, 148, 152], f16), T([128, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([4, 128, 148, 152], f16), T([128, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 512, 148, 152], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), T([256, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 6, ((T([4, 256, 74, 76], f16), T([1024, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 512, 148, 152], f16), T([1024, 512, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([4, 1024, 74, 76], f16), T([256, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 5, ((T([4, 256, 74, 76], f16), T([256, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 1024, 74, 76], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 512, 74, 76], f16), T([512, 512, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 3, ((T([4, 512, 37, 38], f16), T([2048, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 1024, 74, 76], f16), T([2048, 1024, 1, 1], f16), None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([4, 2048, 37, 38], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([4, 512, 37, 38], f16), T([512, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 2048, 37, 38], f16), T([256, 2048, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([4, 256, 37, 38], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 1024, 74, 76], f16), T([256, 1024, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([4, 256, 74, 76], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 512, 148, 152], f16), T([256, 512, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([4, 256, 148, 152], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 296, 304], f16), T([256, 256, 1, 1], f16), T([256], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([4, 256, 296, 304], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 296, 304], f16), T([3, 256, 1, 1], f16), T([3], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 296, 304], f16), T([12, 256, 1, 1], f16), T([12], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), T([3, 256, 1, 1], f16), T([3], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), T([12, 256, 1, 1], f16), T([12], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 74, 76], f16), T([3, 256, 1, 1], f16), T([3], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 74, 76], f16), T([12, 256, 1, 1], f16), T([12], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 37, 38], f16), T([3, 256, 1, 1], f16), T([3], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 37, 38], f16), T([12, 256, 1, 1], f16), T([12], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 19, 19], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 19, 19], f16), T([3, 256, 1, 1], f16), T([3], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([4, 256, 19, 19], f16), T([12, 256, 1, 1], f16), T([12], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 4, ((T([0, 256, 14, 14], f16), T([256, 256, 3, 3], f16), T([256], f16), [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([0, 256, 14, 14], f16), T([256, 256, 2, 2], f16), T([256], f16), [2, 2], [0, 0], [1, 1], True, [0, 0], 1), {})
+cnt: 1, ((T([0, 256, 28, 28], f16), T([91, 256, 1, 1], f16), T([91], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([4, 256, 296, 304], f16), T([4, 256, 296, 304], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 296, 304], f16), T([4, 256, 296, 304], f16), T([256, 256, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [False, True, True]), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), T([4, 256, 148, 152], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), T([4, 512, 148, 152], f16), T([256, 512, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 74, 76], f16), T([4, 256, 74, 76], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 74, 76], f16), T([4, 1024, 74, 76], f16), T([256, 1024, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 37, 38], f16), T([4, 256, 37, 38], f16), T([256, 256, 3, 3], f16), [256], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 1, ((T([4, 256, 37, 38], f16), T([4, 2048, 37, 38], f16), T([256, 2048, 1, 1], f16), [256], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 3, ((T([4, 2048, 37, 38], f16), T([4, 512, 37, 38], f16), T([2048, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([4, 512, 37, 38], f16), T([4, 512, 37, 38], f16), T([512, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([4, 512, 37, 38], f16), T([4, 2048, 37, 38], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 2048, 37, 38], f16), T([4, 1024, 74, 76], f16), T([2048, 1024, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 512, 37, 38], f16), T([4, 512, 74, 76], f16), T([512, 512, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 512, 74, 76], f16), T([4, 1024, 74, 76], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 6, ((T([4, 1024, 74, 76], f16), T([4, 256, 74, 76], f16), T([1024, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([4, 256, 74, 76], f16), T([4, 256, 74, 76], f16), T([256, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 5, ((T([4, 256, 74, 76], f16), T([4, 1024, 74, 76], f16), T([256, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 1024, 74, 76], f16), T([4, 512, 148, 152], f16), T([1024, 512, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 256, 74, 76], f16), T([4, 256, 148, 152], f16), T([256, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), T([4, 512, 148, 152], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 4, ((T([4, 512, 148, 152], f16), T([4, 128, 148, 152], f16), T([512, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([4, 128, 148, 152], f16), T([4, 128, 148, 152], f16), T([128, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 3, ((T([4, 128, 148, 152], f16), T([4, 512, 148, 152], f16), T([128, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 512, 148, 152], f16), T([4, 256, 296, 304], f16), T([512, 256, 1, 1], f16), [0], [2, 2], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+cnt: 1, ((T([4, 128, 148, 152], f16), T([4, 128, 296, 304], f16), T([128, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([4, 128, 296, 304], f16), T([4, 256, 296, 304], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([3, 799, 1199], f16, stride=(1439744, 1216, 1)), T([3, 799, 1199], f16)), {})
+cnt: 1, ((T([3, 800, 800], f16, stride=(1439744, 1216, 1)), T([3, 800, 800], f16)), {})
+cnt: 1, ((T([3, 1155, 800], f16, stride=(1439744, 1216, 1)), T([3, 1155, 800], f16)), {})
+cnt: 1, ((T([3, 799, 1115], f16, stride=(1439744, 1216, 1)), T([3, 799, 1115], f16)), {})
+cnt: 16, ((T([0], f16), T([0], f16)), {})
+Operator: aten.div.Tensor
+cnt: 1, ((T([3, 427, 640], f16, stride=(1, 1920, 3)), T([3, 1, 1], f16)), {})
+cnt: 1, ((T([3, 612, 612], f16, stride=(1, 1836, 3)), T([3, 1, 1], f16)), {})
+cnt: 1, ((T([3, 640, 443], f16, stride=(1, 1329, 3)), T([3, 1, 1], f16)), {})
+cnt: 1, ((T([3, 459, 640], f16, stride=(1, 1920, 3)), T([3, 1, 1], f16)), {})
+cnt: 4, ((T([1438452, 1], f16, stride=(4, 4)), 1.0), {})
+cnt: 2, ((T([0], f32), 224), {})
+cnt: 4, ((T([0, 91], f16), 10.0), {})
+cnt: 4, ((T([0, 91], f16), 5.0), {})
+cnt: 8, ((T([], f32), T([], f32)), {})
+cnt: 20, ((T([], f16), 0), {})
+cnt: 4, ((T([], i64), 0), {})
+cnt: 10, ((T([], f32), 4), {})
+Operator: aten.eq.Scalar
+cnt: 2, ((T([0], i64), 0), {})
+cnt: 2, ((T([0], i64), 1), {})
+cnt: 2, ((T([0], i64), 2), {})
+cnt: 2, ((T([0], i64), 3), {})
+Operator: aten.exp.default
+cnt: 2, ((T([1438452, 1], f16),), {})
+cnt: 2, ((T([0, 91], f16),), {})
+Operator: aten.fill_.Scalar
+cnt: 2, ((T([], i64), 4), {})
+cnt: 2, ((T([], i64), 8), {})
+cnt: 2, ((T([], i64), 16), {})
+cnt: 2, ((T([], i64), 32), {})
+cnt: 1, ((T([], i64), 62), {})
+cnt: 1, ((T([], i64), 64), {})
+Operator: aten.floor.default
+cnt: 2, ((T([0], f32),), {})
+Operator: aten.ge.Scalar
+cnt: 8, ((T([5000], f16), 0.001), {})
+cnt: 4, ((T([0], f16), 0.0), {})
+cnt: 8, ((T([0], f16), 0.01), {})
+cnt: 8, ((T([0, 182], f16), 0), {})
+Operator: aten.gt.Scalar
+cnt: 4, ((T([0], f16), 0.05), {})
+Operator: aten.index.Tensor
+cnt: 1, ((T([4, 359613], f16), [T([4, 1], i64), T([4, 5000], i64)]), {})
+cnt: 1, ((T([4, 359613], i64, stride=(0, 1)), [T([4, 1], i64), T([4, 5000], i64)]), {})
+cnt: 1, ((T([4, 359613, 4], f16), [T([4, 1], i64), T([4, 5000], i64)]), {})
+cnt: 4, ((T([5000, 4], f16), [T([0], i64)]), {})
+cnt: 4, ((T([5000], f16), [T([0], i64)]), {})
+cnt: 4, ((T([5000], i64), [T([0], i64)]), {})
+cnt: 20, ((T([0, 4], f16), [T([0], i64)]), {})
+cnt: 20, ((T([0], f16), [T([0], i64)]), {})
+cnt: 16, ((T([0], i64), [T([0], i64)]), {})
+cnt: 8, ((T([0, 5], f16), [T([0], i64)]), {})
+cnt: 1, ((T([0, 91, 28, 28], f16), [T([0], i64), T([0], i64)]), {})
+cnt: 4, ((T([0, 256, 7, 7], f16), [T([0], i64)]), {})
+Operator: aten.index_put.default
+cnt: 3, ((T([0, 256, 7, 7], f16), [T([0], i64)], T([0, 256, 7, 7], f16)), {})
+Operator: aten.index_put_.default
+cnt: 4, ((T([0, 256, 7, 7], f16), [T([0], i64)], T([0, 256, 7, 7], f16)), {})
+cnt: 4, ((T([0, 256, 14, 14], f16), [T([0], i64)], T([0, 256, 14, 14], f16)), {})
+Operator: aten.le.Scalar
+cnt: 2, ((T([0, 182], f16), 799), {})
+cnt: 1, ((T([0, 182], f16), 1115), {})
+cnt: 1, ((T([0, 182], f16), 1155), {})
+cnt: 3, ((T([0, 182], f16), 800), {})
+cnt: 1, ((T([0, 182], f16), 1199), {})
+cnt: 2, ((T([0, 91], f16), 4.135166556742356), {})
+Operator: aten.log2.default
+cnt: 20, ((T([], f32),), {})
+cnt: 2, ((T([0], f32),), {})
+Operator: aten.logical_and_.default
+cnt: 8, ((T([0, 182], b8), T([0, 182], b8)), {})
+Operator: aten.max.default
+cnt: 4, ((T([2], i64),), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([4, 64, 592, 608], f16), [3, 3], [2, 2], [1, 1]), {})
+cnt: 1, ((T([4, 256, 37, 38], f16), [1, 1], [2, 2]), {})
+Operator: aten.min.default
+cnt: 4, ((T([2], i64),), {})
+Operator: aten.minimum.default
+cnt: 4, ((T([], f32), T([], f32)), {})
+Operator: aten.mm.default
+cnt: 1, ((T([0, 364], f16), T([364, 1024], f16)), {})
+cnt: 1, ((T([364, 0], f16), T([0, 1024], f16)), {})
+cnt: 1, ((T([0, 91], f16), T([91, 1024], f16)), {})
+cnt: 1, ((T([91, 0], f16), T([0, 1024], f16)), {})
+cnt: 1, ((T([0, 1024], f16), T([1024, 1024], f16)), {})
+cnt: 1, ((T([1024, 0], f16), T([0, 1024], f16)), {})
+cnt: 1, ((T([0, 1024], f16), T([1024, 12544], f16)), {})
+cnt: 1, ((T([1024, 0], f16), T([0, 12544], f16)), {})
+Operator: aten.mul.Tensor
+cnt: 4, ((T([], f32), 800.0), {})
+cnt: 4, ((T([], f32), 1333.0), {})
+cnt: 14, ((T([1, 64, 1, 1], f16), T([1, 64, 1, 1], f16)), {})
+cnt: 1, ((T([4, 64, 592, 608], f16), T([1, 64, 1, 1], f16)), {})
+cnt: 6, ((T([4, 64, 296, 304], f16), T([1, 64, 1, 1], f16)), {})
+cnt: 32, ((T([1, 256, 1, 1], f16), T([1, 256, 1, 1], f16)), {})
+cnt: 4, ((T([4, 256, 296, 304], f16), T([1, 256, 1, 1], f16)), {})
+cnt: 16, ((T([1, 128, 1, 1], f16), T([1, 128, 1, 1], f16)), {})
+cnt: 2, ((T([4, 128, 296, 304], f16), T([1, 128, 1, 1], f16)), {})
+cnt: 14, ((T([4, 128, 148, 152], f16), T([1, 128, 1, 1], f16)), {})
+cnt: 22, ((T([1, 512, 1, 1], f16), T([1, 512, 1, 1], f16)), {})
+cnt: 10, ((T([4, 512, 148, 152], f16), T([1, 512, 1, 1], f16)), {})
+cnt: 2, ((T([4, 256, 148, 152], f16), T([1, 256, 1, 1], f16)), {})
+cnt: 22, ((T([4, 256, 74, 76], f16), T([1, 256, 1, 1], f16)), {})
+cnt: 14, ((T([1, 1024, 1, 1], f16), T([1, 1024, 1, 1], f16)), {})
+cnt: 14, ((T([4, 1024, 74, 76], f16), T([1, 1024, 1, 1], f16)), {})
+cnt: 2, ((T([4, 512, 74, 76], f16), T([1, 512, 1, 1], f16)), {})
+cnt: 10, ((T([4, 512, 37, 38], f16), T([1, 512, 1, 1], f16)), {})
+cnt: 8, ((T([1, 2048, 1, 1], f16), T([1, 2048, 1, 1], f16)), {})
+cnt: 8, ((T([4, 2048, 37, 38], f16), T([1, 2048, 1, 1], f16)), {})
+cnt: 1, ((T([304], i32), T([], i64)), {})
+cnt: 1, ((T([296], i32), T([], i64)), {})
+cnt: 1, ((T([152], i32), T([], i64)), {})
+cnt: 1, ((T([148], i32), T([], i64)), {})
+cnt: 1, ((T([76], i32), T([], i64)), {})
+cnt: 1, ((T([74], i32), T([], i64)), {})
+cnt: 1, ((T([38], i32), T([], i64)), {})
+cnt: 1, ((T([37], i32), T([], i64)), {})
+cnt: 2, ((T([19], i32), T([], i64)), {})
+cnt: 2, ((T([1438452], f16), 0.5), {})
+cnt: 4, ((T([1438452, 1], f16), T([1438452, 1], f16)), {})
+cnt: 2, ((T([], f16), T([1438452, 1], f16)), {})
+cnt: 8, ((T([0], f32), T([0], f32)), {})
+cnt: 18, ((T([0], f16), 0.5), {})
+cnt: 8, ((T([0, 91], f16), T([0, 1], f16)), {})
+cnt: 2, ((T([], f16), T([0, 91], f16)), {})
+cnt: 32, ((T([0], f16), T([], f32)), {})
+cnt: 2, ((T([0, 91], f16), T([], f16)), {})
+cnt: 2, ((T([0, 91], f16), T([0, 91], f16)), {})
+Operator: aten.mul_.Tensor
+cnt: 8, ((T([0], f16), 1.0714285714285714), {})
+Operator: aten.neg.default
+cnt: 2, ((T([0, 91], f16),), {})
+Operator: aten.new_empty.default
+cnt: 1, ((T([0, 1, 30, 30], f16), [0, 1, 427, 640]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 1, ((T([0, 1, 30, 30], f16), [0, 1, 612, 612]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 1, ((T([0, 1, 30, 30], f16), [0, 1, 640, 443]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+cnt: 1, ((T([0, 1, 30, 30], f16), [0, 1, 459, 640]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_full.default
+cnt: 1, ((T([3, 799, 1199], f16), [4, 3, 1184, 1216], 0), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda', 'pin_memory': False})
+Operator: aten.new_zeros.default
+cnt: 12, ((T([0], f16), [0]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 12, ((T([0, 4], f16), [0, 4]), {'dtype': f16, 'layout': torch.strided, 'device': 'cuda'})
+Operator: aten.nonzero.default
+cnt: 4, ((T([5000], b8),), {})
+cnt: 20, ((T([0], b8),), {})
+Operator: aten.reciprocal.default
+cnt: 8, ((T([], f32),), {})
+Operator: aten.relu.default
+cnt: 2, ((T([0, 1024], f16),), {})
+Operator: aten.relu_.default
+cnt: 1, ((T([4, 64, 592, 608], f16),), {})
+cnt: 6, ((T([4, 64, 296, 304], f16),), {})
+cnt: 4, ((T([4, 256, 296, 304], f16),), {})
+cnt: 1, ((T([4, 128, 296, 304], f16),), {})
+cnt: 7, ((T([4, 128, 148, 152], f16),), {})
+cnt: 4, ((T([4, 512, 148, 152], f16),), {})
+cnt: 2, ((T([4, 256, 148, 152], f16),), {})
+cnt: 12, ((T([4, 256, 74, 76], f16),), {})
+cnt: 6, ((T([4, 1024, 74, 76], f16),), {})
+cnt: 1, ((T([4, 512, 74, 76], f16),), {})
+cnt: 5, ((T([4, 512, 37, 38], f16),), {})
+cnt: 3, ((T([4, 2048, 37, 38], f16),), {})
+cnt: 1, ((T([4, 256, 37, 38], f16),), {})
+cnt: 1, ((T([4, 256, 19, 19], f16),), {})
+cnt: 4, ((T([0, 256, 14, 14], f16),), {})
+cnt: 1, ((T([0, 256, 28, 28], f16),), {})
+Operator: aten.round.default
+cnt: 16, ((T([], f32),), {})
+Operator: aten.rsqrt.default
+cnt: 7, ((T([1, 64, 1, 1], f16),), {})
+cnt: 16, ((T([1, 256, 1, 1], f16),), {})
+cnt: 8, ((T([1, 128, 1, 1], f16),), {})
+cnt: 11, ((T([1, 512, 1, 1], f16),), {})
+cnt: 7, ((T([1, 1024, 1, 1], f16),), {})
+cnt: 4, ((T([1, 2048, 1, 1], f16),), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([4, 5000], f16),), {})
+cnt: 1, ((T([0, 91, 28, 28], f16),), {})
+Operator: aten.slice_backward.default
+cnt: 4, ((T([0, 90], f16), [0, 91], 1, 1, 9223372036854775807, 1), {})
+cnt: 4, ((T([0, 91], f16), [0, 91], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([0, 363], f16), [0, 364], 1, 1, 9223372036854775807, 1), {})
+cnt: 8, ((T([0, 364], f16), [0, 364], 0, 0, 9223372036854775807, 1), {})
+cnt: 4, ((T([0, 182], f16), [0, 364], 1, 1, 9223372036854775807, 2), {})
+cnt: 4, ((T([0, 182], f16), [0, 364], 1, 0, 9223372036854775807, 2), {})
+cnt: 1, ((T([0, 91], f16), [0, 364], 1, 3, 9223372036854775807, 4), {})
+cnt: 1, ((T([0, 91], f16), [0, 364], 1, 2, 9223372036854775807, 4), {})
+cnt: 1, ((T([0, 91], f16), [0, 364], 1, 1, 9223372036854775807, 4), {})
+cnt: 1, ((T([0, 91], f16), [0, 364], 1, 0, 9223372036854775807, 4), {})
+Operator: aten.split_with_sizes.default
+cnt: 1, ((T([4, 359613], f16), [269952, 67488, 16872, 4218, 1083], 1), {})
+cnt: 1, ((T([0, 364], f16), [0, 0, 0, 0]), {})
+cnt: 1, ((T([0, 91], f16), [0, 0, 0, 0]), {})
+cnt: 1, ((T([0, 1, 28, 28], f16), [0, 0, 0, 0]), {})
+Operator: aten.sqrt.default
+cnt: 2, ((T([0], f32),), {})
+Operator: aten.stack.default
+cnt: 1, (([T([89984], i32), T([89984], i32), T([89984], i32), T([89984], i32)], 1), {})
+cnt: 1, (([T([22496], i32), T([22496], i32), T([22496], i32), T([22496], i32)], 1), {})
+cnt: 1, (([T([5624], i32), T([5624], i32), T([5624], i32), T([5624], i32)], 1), {})
+cnt: 1, (([T([1406], i32), T([1406], i32), T([1406], i32), T([1406], i32)], 1), {})
+cnt: 1, (([T([361], i32), T([361], i32), T([361], i32), T([361], i32)], 1), {})
+cnt: 1, (([T([1438452, 1], f16), T([1438452, 1], f16), T([1438452, 1], f16), T([1438452, 1], f16)], 2), {})
+cnt: 4, (([T([5000, 2], f16), T([5000, 2], f16)], 2), {})
+cnt: 1, (([T([0, 91], f16), T([0, 91], f16), T([0, 91], f16), T([0, 91], f16)], 2), {})
+cnt: 4, (([T([0, 182], f16), T([0, 182], f16)], 2), {})
+cnt: 8, (([T([0], f16), T([0], f16), T([0], f16), T([0], f16)], 1), {})
+Operator: aten.sub.Tensor
+cnt: 1, ((T([3, 427, 640], f16, stride=(1, 1920, 3)), T([3, 1, 1], f16)), {})
+cnt: 1, ((T([3, 612, 612], f16, stride=(1, 1836, 3)), T([3, 1, 1], f16)), {})
+cnt: 1, ((T([3, 640, 443], f16, stride=(1, 1329, 3)), T([3, 1, 1], f16)), {})
+cnt: 1, ((T([3, 459, 640], f16, stride=(1, 1920, 3)), T([3, 1, 1], f16)), {})
+cnt: 7, ((T([1, 64, 1, 1], f16), T([1, 64, 1, 1], f16)), {})
+cnt: 16, ((T([1, 256, 1, 1], f16), T([1, 256, 1, 1], f16)), {})
+cnt: 8, ((T([1, 128, 1, 1], f16), T([1, 128, 1, 1], f16)), {})
+cnt: 11, ((T([1, 512, 1, 1], f16), T([1, 512, 1, 1], f16)), {})
+cnt: 7, ((T([1, 1024, 1, 1], f16), T([1, 1024, 1, 1], f16)), {})
+cnt: 4, ((T([1, 2048, 1, 1], f16), T([1, 2048, 1, 1], f16)), {})
+cnt: 2, ((T([1438452], f16, stride=(4,)), T([1438452], f16, stride=(4,))), {})
+cnt: 2, ((T([1438452, 1], f16), T([1438452, 1], f16)), {})
+cnt: 8, ((T([5000], f16, stride=(4,)), T([5000], f16, stride=(4,))), {})
+cnt: 16, ((T([0], f32), T([0], f32)), {})
+cnt: 2, ((T([0], i64), 2), {})
+cnt: 26, ((T([0], f16), T([0], f16)), {})
+cnt: 2, ((T([0, 91], f16), T([0, 91], f16)), {})
+Operator: aten.sum.SymInt
+cnt: 1, ((T([0, 364], f16), [0], True), {})
+cnt: 1, ((T([0, 91], f16), [0], True), {})
+cnt: 2, ((T([0, 1024], f16), [0], True), {})
+Operator: aten.sum.default
+cnt: 4, ((T([0, 4], f16),), {})
+cnt: 4, ((T([0], i64),), {})
+cnt: 4, ((T([0], f16),), {})
+cnt: 1, ((T([0, 1, 427, 640], f16),), {})
+cnt: 1, ((T([0, 1, 612, 612], f16),), {})
+cnt: 1, ((T([0, 1, 640, 443], f16),), {})
+cnt: 1, ((T([0, 1, 459, 640], f16),), {})
+Operator: aten.threshold_backward.default
+cnt: 2, ((T([0, 1024], f16), T([0, 1024], f16), 0), {})
+cnt: 3, ((T([4, 2048, 37, 38], f16), T([4, 2048, 37, 38], f16), 0), {})
+cnt: 5, ((T([4, 512, 37, 38], f16), T([4, 512, 37, 38], f16), 0), {})
+cnt: 1, ((T([4, 512, 74, 76], f16), T([4, 512, 74, 76], f16), 0), {})
+cnt: 6, ((T([4, 1024, 74, 76], f16), T([4, 1024, 74, 76], f16), 0), {})
+cnt: 11, ((T([4, 256, 74, 76], f16), T([4, 256, 74, 76], f16), 0), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), T([4, 256, 148, 152], f16), 0), {})
+cnt: 4, ((T([4, 512, 148, 152], f16), T([4, 512, 148, 152], f16), 0), {})
+cnt: 7, ((T([4, 128, 148, 152], f16), T([4, 128, 148, 152], f16), 0), {})
+cnt: 1, ((T([4, 128, 296, 304], f16), T([4, 128, 296, 304], f16), 0), {})
+Operator: aten.topk.default
+cnt: 1, ((T([4, 269952], f16, stride=(359613, 1)), 1000, 1), {})
+cnt: 1, ((T([4, 67488], f16, stride=(359613, 1)), 1000, 1), {})
+cnt: 1, ((T([4, 16872], f16, stride=(359613, 1)), 1000, 1), {})
+cnt: 1, ((T([4, 4218], f16, stride=(359613, 1)), 1000, 1), {})
+cnt: 1, ((T([4, 1083], f16, stride=(359613, 1)), 1000, 1), {})
+Operator: aten.unbind.int
+cnt: 1, ((T([4, 5000, 4], f16),), {})
+cnt: 1, ((T([4, 5000], f16),), {})
+cnt: 1, ((T([4, 5000], i64),), {})
+cnt: 24, ((T([0, 1], i64), 1), {})
+cnt: 8, ((T([0, 4], f16), 1), {})
+cnt: 4, ((T([0, 182, 2], f16), 2), {})
+cnt: 1, ((T([0, 91, 4], f16), 2), {})
+Operator: aten.upsample_bilinear2d.vec
+cnt: 1, ((T([1, 3, 427, 640], f16, stride=(3, 1, 1920, 3)), [799, 1199], False, None), {})
+cnt: 1, ((T([1, 3, 612, 612], f16, stride=(3, 1, 1836, 3)), [800, 800], False, None), {})
+cnt: 1, ((T([1, 3, 640, 443], f16, stride=(3, 1, 1329, 3)), [1155, 800], False, None), {})
+cnt: 1, ((T([1, 3, 459, 640], f16, stride=(3, 1, 1920, 3)), [799, 1115], False, None), {})
+Operator: aten.upsample_nearest2d.vec
+cnt: 1, ((T([4, 256, 37, 38], f16), [74, 76], None), {})
+cnt: 1, ((T([4, 256, 74, 76], f16), [148, 152], None), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), [296, 304], None), {})
+Operator: aten.upsample_nearest2d_backward.vec
+cnt: 1, ((T([4, 256, 296, 304], f16), [296, 304], [4, 256, 148, 152], None), {})
+cnt: 1, ((T([4, 256, 148, 152], f16), [148, 152], [4, 256, 74, 76], None), {})
+cnt: 1, ((T([4, 256, 74, 76], f16), [74, 76], [4, 256, 37, 38], None), {})
+Operator: aten.where.self
+cnt: 8, ((T([0, 182], b8), T([0, 182], f16), T([], f16)), {})
+cnt: 2, ((T([0, 91], b8), T([0, 91], f16), T([], f16)), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/yolov3_training.txt b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/yolov3_training.txt
new file mode 100644
index 000000000000..c8ad368382fc
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_logs/torchbench_train/yolov3_training.txt
@@ -0,0 +1,261 @@
+Operator: aten._to_copy.default
+cnt: 1, ((T([1, 1, 12, 16, 2], i64),), {'dtype': f32})
+cnt: 3, ((T([3, 2], f32),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 3, ((T([1, 3, 1, 1, 2], f32),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 1, ((T([1, 1, 24, 32, 2], i64),), {'dtype': f32})
+cnt: 1, ((T([1, 1, 48, 64, 2], i64),), {'dtype': f32})
+cnt: 2, ((T([8, 3, 48, 64, 2], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 2, ((T([8, 3, 48, 64, 2], f32),), {'dtype': f16})
+cnt: 2, ((T([8, 3, 24, 32, 2], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 2, ((T([8, 3, 24, 32, 2], f32),), {'dtype': f16})
+cnt: 2, ((T([8, 3, 12, 16, 2], f16),), {'dtype': f32, 'layout': torch.strided, 'device': 'cuda'})
+cnt: 2, ((T([8, 3, 12, 16, 2], f32),), {'dtype': f16})
+Operator: aten._unsafe_view.default
+cnt: 1, ((T([8, 3, 85, 48, 64], f16), [8, 255, 48, 64]), {})
+cnt: 1, ((T([8, 3, 85, 24, 32], f16), [8, 255, 24, 32]), {})
+cnt: 1, ((T([8, 3, 85, 12, 16], f16), [8, 255, 12, 16]), {})
+Operator: aten.add.Tensor
+cnt: 2, ((T([8, 64, 192, 256], f16), T([8, 64, 192, 256], f16)), {})
+cnt: 4, ((T([8, 128, 96, 128], f16), T([8, 128, 96, 128], f16)), {})
+cnt: 16, ((T([8, 256, 48, 64], f16), T([8, 256, 48, 64], f16)), {})
+cnt: 16, ((T([8, 512, 24, 32], f16), T([8, 512, 24, 32], f16)), {})
+cnt: 8, ((T([8, 1024, 12, 16], f16), T([8, 1024, 12, 16], f16)), {})
+cnt: 1, ((T([8, 3, 12, 16, 2], f16), T([1, 1, 12, 16, 2], f32)), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f16), T([1, 1, 24, 32, 2], f32)), {})
+cnt: 1, ((T([8, 3, 48, 64, 2], f16), T([1, 1, 48, 64, 2], f32)), {})
+cnt: 2, ((T([], f16), 0), {})
+cnt: 3, ((T([], f16), T([], f16)), {})
+cnt: 3, ((T([8, 3, 48, 64, 85], f16), T([8, 3, 48, 64, 85], f16)), {})
+cnt: 1, ((T([8, 3, 48, 64, 85], f16, stride=(0, 0, 0, 0, 0)), T([8, 3, 48, 64, 85], f16)), {})
+cnt: 3, ((T([8, 3, 24, 32, 85], f16), T([8, 3, 24, 32, 85], f16)), {})
+cnt: 1, ((T([8, 3, 24, 32, 85], f16, stride=(0, 0, 0, 0, 0)), T([8, 3, 24, 32, 85], f16)), {})
+cnt: 1, ((T([8, 256, 24, 32], f16), T([8, 256, 24, 32], f16)), {})
+cnt: 3, ((T([8, 3, 12, 16, 85], f16), T([8, 3, 12, 16, 85], f16)), {})
+cnt: 1, ((T([8, 3, 12, 16, 85], f16, stride=(0, 0, 0, 0, 0)), T([8, 3, 12, 16, 85], f16)), {})
+cnt: 3, ((T([8, 512, 12, 16], f16), T([8, 512, 12, 16], f16)), {})
+cnt: 1, ((T([8, 512, 12, 16], f16, stride=(393216, 192, 16, 1)), T([8, 512, 12, 16], f16)), {})
+cnt: 1, ((T([8, 512, 24, 32], f16, stride=(589824, 768, 32, 1)), T([8, 512, 24, 32], f16)), {})
+cnt: 1, ((T([8, 256, 48, 64], f16, stride=(1179648, 3072, 64, 1)), T([8, 256, 48, 64], f16)), {})
+Operator: aten.cat.default
+cnt: 1, (([T([8, 512, 12, 16], f16), T([8, 512, 12, 16], f16), T([8, 512, 12, 16], f16), T([8, 512, 12, 16], f16)], 1), {})
+cnt: 1, (([T([8, 256, 24, 32], f16), T([8, 512, 24, 32], f16)], 1), {})
+cnt: 1, (([T([8, 128, 48, 64], f16), T([8, 256, 48, 64], f16)], 1), {})
+cnt: 1, (([T([8, 576, 85], f16), T([8, 2304, 85], f16), T([8, 9216, 85], f16)], 1), {})
+Operator: aten.clone.default
+cnt: 1, ((T([8, 3, 384, 512], f16),), {})
+cnt: 1, ((T([8, 3, 12, 16, 85], f16),), {})
+cnt: 1, ((T([8, 3, 24, 32, 85], f16),), {})
+cnt: 1, ((T([8, 3, 48, 64, 85], f16),), {})
+Operator: aten.convolution.default
+cnt: 1, ((T([8, 3, 384, 512], f16), T([32, 3, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 32, 384, 512], f16), T([64, 32, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 64, 192, 256], f16), T([32, 64, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 32, 192, 256], f16), T([64, 32, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 64, 192, 256], f16), T([128, 64, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([8, 128, 96, 128], f16), T([64, 128, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 2, ((T([8, 64, 96, 128], f16), T([128, 64, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 128, 96, 128], f16), T([256, 128, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 10, ((T([8, 256, 48, 64], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 11, ((T([8, 128, 48, 64], f16), T([256, 128, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 256, 48, 64], f16), T([512, 256, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 10, ((T([8, 512, 24, 32], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 11, ((T([8, 256, 24, 32], f16), T([512, 256, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 512, 24, 32], f16), T([1024, 512, 3, 3], f16), None, [2, 2], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([8, 1024, 12, 16], f16), T([512, 1024, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 7, ((T([8, 512, 12, 16], f16), T([1024, 512, 3, 3], f16), None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 2048, 12, 16], f16), T([512, 2048, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 1024, 12, 16], f16), T([255, 1024, 1, 1], f16), T([255], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 512, 12, 16], f16), T([256, 512, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 768, 24, 32], f16), T([256, 768, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 512, 24, 32], f16), T([255, 512, 1, 1], f16), T([255], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 256, 24, 32], f16), T([128, 256, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 384, 48, 64], f16), T([128, 384, 1, 1], f16), None, [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+cnt: 1, ((T([8, 256, 48, 64], f16), T([255, 256, 1, 1], f16), T([255], f16), [1, 1], [0, 0], [1, 1], False, [0, 0], 1), {})
+Operator: aten.convolution_backward.default
+cnt: 1, ((T([8, 255, 48, 64], f16), T([8, 256, 48, 64], f16), T([255, 256, 1, 1], f16), [255], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 11, ((T([8, 256, 48, 64], f16), T([8, 128, 48, 64], f16), T([256, 128, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 10, ((T([8, 128, 48, 64], f16), T([8, 256, 48, 64], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 128, 48, 64], f16), T([8, 384, 48, 64], f16), T([128, 384, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 128, 24, 32], f16), T([8, 256, 24, 32], f16), T([128, 256, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 255, 24, 32], f16), T([8, 512, 24, 32], f16), T([255, 512, 1, 1], f16), [255], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 11, ((T([8, 512, 24, 32], f16), T([8, 256, 24, 32], f16), T([512, 256, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 10, ((T([8, 256, 24, 32], f16), T([8, 512, 24, 32], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 256, 24, 32], f16), T([8, 768, 24, 32], f16), T([256, 768, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 256, 12, 16], f16), T([8, 512, 12, 16], f16), T([256, 512, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 255, 12, 16], f16), T([8, 1024, 12, 16], f16), T([255, 1024, 1, 1], f16), [255], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, True]), {})
+cnt: 7, ((T([8, 1024, 12, 16], f16), T([8, 512, 12, 16], f16), T([1024, 512, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 7, ((T([8, 512, 12, 16], f16), T([8, 1024, 12, 16], f16), T([512, 1024, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 512, 12, 16], f16), T([8, 2048, 12, 16], f16), T([512, 2048, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 1024, 12, 16], f16), T([8, 512, 24, 32], f16), T([1024, 512, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 512, 24, 32], f16), T([8, 256, 48, 64], f16), T([512, 256, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 256, 48, 64], f16), T([8, 128, 96, 128], f16), T([256, 128, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([8, 128, 96, 128], f16), T([8, 64, 96, 128], f16), T([128, 64, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 2, ((T([8, 64, 96, 128], f16), T([8, 128, 96, 128], f16), T([64, 128, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 128, 96, 128], f16), T([8, 64, 192, 256], f16), T([128, 64, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 64, 192, 256], f16), T([8, 32, 192, 256], f16), T([64, 32, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 32, 192, 256], f16), T([8, 64, 192, 256], f16), T([32, 64, 1, 1], f16), [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 64, 192, 256], f16), T([8, 32, 384, 512], f16), T([64, 32, 3, 3], f16), [0], [2, 2], [1, 1], [1, 1], False, [0, 0], 1, [True, True, False]), {})
+cnt: 1, ((T([8, 32, 384, 512], f16), T([8, 3, 384, 512], f16), T([32, 3, 3, 3], f16), [0], [1, 1], [1, 1], [1, 1], False, [0, 0], 1, [False, True, False]), {})
+Operator: aten.copy_.default
+cnt: 1, ((T([8, 3, 384, 512], f16), T([8, 3, 384, 512], f16)), {})
+cnt: 2, ((T([8, 3, 12, 16, 2], f16, stride=(48960, 16320, 1360, 85, 1)), T([8, 3, 12, 16, 2], f32)), {})
+cnt: 1, ((T([8, 3, 12, 16, 4], f16, stride=(48960, 16320, 1360, 85, 1)), T([8, 3, 12, 16, 4], f16, stride=(48960, 16320, 1360, 85, 1))), {})
+cnt: 2, ((T([8, 3, 24, 32, 2], f16, stride=(195840, 65280, 2720, 85, 1)), T([8, 3, 24, 32, 2], f32)), {})
+cnt: 1, ((T([8, 3, 24, 32, 4], f16, stride=(195840, 65280, 2720, 85, 1)), T([8, 3, 24, 32, 4], f16, stride=(195840, 65280, 2720, 85, 1))), {})
+cnt: 2, ((T([8, 3, 48, 64, 2], f16, stride=(783360, 261120, 5440, 85, 1)), T([8, 3, 48, 64, 2], f32)), {})
+cnt: 1, ((T([8, 3, 48, 64, 4], f16, stride=(783360, 261120, 5440, 85, 1)), T([8, 3, 48, 64, 4], f16, stride=(783360, 261120, 5440, 85, 1))), {})
+cnt: 1, ((T([8, 3, 48, 64, 85], f16), T([8, 3, 48, 64, 85], f16, stride=(0, 0, 0, 0, 0))), {})
+cnt: 1, ((T([8, 3, 48, 64, 81], f16, stride=(783360, 261120, 5440, 85, 1)), T([8, 3, 48, 64, 81], f16)), {})
+cnt: 4, ((T([8, 3, 48, 64, 85], f16), T([8, 3, 48, 64, 85], f16)), {})
+cnt: 3, ((T([8, 3, 48, 64, 4], f16, stride=(783360, 261120, 5440, 85, 1)), T([8, 3, 48, 64, 4], f16)), {})
+cnt: 2, ((T([8, 3, 48, 64, 2], f16, stride=(783360, 261120, 5440, 85, 1)), T([8, 3, 48, 64, 2], f16)), {})
+cnt: 1, ((T([8, 3, 24, 32, 85], f16), T([8, 3, 24, 32, 85], f16, stride=(0, 0, 0, 0, 0))), {})
+cnt: 1, ((T([8, 3, 24, 32, 81], f16, stride=(195840, 65280, 2720, 85, 1)), T([8, 3, 24, 32, 81], f16)), {})
+cnt: 4, ((T([8, 3, 24, 32, 85], f16), T([8, 3, 24, 32, 85], f16)), {})
+cnt: 3, ((T([8, 3, 24, 32, 4], f16, stride=(195840, 65280, 2720, 85, 1)), T([8, 3, 24, 32, 4], f16)), {})
+cnt: 2, ((T([8, 3, 24, 32, 2], f16, stride=(195840, 65280, 2720, 85, 1)), T([8, 3, 24, 32, 2], f16)), {})
+cnt: 1, ((T([8, 3, 12, 16, 85], f16), T([8, 3, 12, 16, 85], f16, stride=(0, 0, 0, 0, 0))), {})
+cnt: 1, ((T([8, 3, 12, 16, 81], f16, stride=(48960, 16320, 1360, 85, 1)), T([8, 3, 12, 16, 81], f16)), {})
+cnt: 4, ((T([8, 3, 12, 16, 85], f16), T([8, 3, 12, 16, 85], f16)), {})
+cnt: 3, ((T([8, 3, 12, 16, 4], f16, stride=(48960, 16320, 1360, 85, 1)), T([8, 3, 12, 16, 4], f16)), {})
+cnt: 2, ((T([8, 3, 12, 16, 2], f16, stride=(48960, 16320, 1360, 85, 1)), T([8, 3, 12, 16, 2], f16)), {})
+Operator: aten.div.Tensor
+cnt: 2, ((T([], f16), 8225280), {})
+cnt: 2, ((T([], f16), 391680), {})
+cnt: 2, ((T([], f16), 1566720), {})
+cnt: 2, ((T([], f16), 6266880), {})
+cnt: 2, ((T([], f16), 3), {})
+cnt: 2, ((T([], f16), 2), {})
+Operator: aten.exp.default
+cnt: 1, ((T([8, 3, 12, 16, 2], f16, stride=(48960, 16320, 1360, 85, 1)),), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f16, stride=(195840, 65280, 2720, 85, 1)),), {})
+cnt: 1, ((T([8, 3, 48, 64, 2], f16, stride=(783360, 261120, 5440, 85, 1)),), {})
+Operator: aten.leaky_relu_.default
+cnt: 1, ((T([8, 32, 384, 512], f16), 0.1), {})
+cnt: 2, ((T([8, 64, 192, 256], f16), 0.1), {})
+cnt: 1, ((T([8, 32, 192, 256], f16), 0.1), {})
+cnt: 3, ((T([8, 128, 96, 128], f16), 0.1), {})
+cnt: 2, ((T([8, 64, 96, 128], f16), 0.1), {})
+cnt: 12, ((T([8, 256, 48, 64], f16), 0.1), {})
+cnt: 11, ((T([8, 128, 48, 64], f16), 0.1), {})
+cnt: 12, ((T([8, 512, 24, 32], f16), 0.1), {})
+cnt: 11, ((T([8, 256, 24, 32], f16), 0.1), {})
+cnt: 8, ((T([8, 1024, 12, 16], f16), 0.1), {})
+cnt: 8, ((T([8, 512, 12, 16], f16), 0.1), {})
+cnt: 1, ((T([8, 256, 12, 16], f16), 0.1), {})
+cnt: 1, ((T([8, 128, 24, 32], f16), 0.1), {})
+Operator: aten.leaky_relu_backward.default
+cnt: 12, ((T([8, 256, 48, 64], f16), T([8, 256, 48, 64], f16), 0.1, True), {})
+cnt: 11, ((T([8, 128, 48, 64], f16), T([8, 128, 48, 64], f16), 0.1, True), {})
+cnt: 1, ((T([8, 128, 24, 32], f16), T([8, 128, 24, 32], f16), 0.1, True), {})
+cnt: 12, ((T([8, 512, 24, 32], f16), T([8, 512, 24, 32], f16), 0.1, True), {})
+cnt: 11, ((T([8, 256, 24, 32], f16), T([8, 256, 24, 32], f16), 0.1, True), {})
+cnt: 1, ((T([8, 256, 12, 16], f16), T([8, 256, 12, 16], f16), 0.1, True), {})
+cnt: 8, ((T([8, 1024, 12, 16], f16), T([8, 1024, 12, 16], f16), 0.1, True), {})
+cnt: 8, ((T([8, 512, 12, 16], f16), T([8, 512, 12, 16], f16), 0.1, True), {})
+cnt: 3, ((T([8, 128, 96, 128], f16), T([8, 128, 96, 128], f16), 0.1, True), {})
+cnt: 2, ((T([8, 64, 96, 128], f16), T([8, 64, 96, 128], f16), 0.1, True), {})
+cnt: 2, ((T([8, 64, 192, 256], f16), T([8, 64, 192, 256], f16), 0.1, True), {})
+cnt: 1, ((T([8, 32, 192, 256], f16), T([8, 32, 192, 256], f16), 0.1, True), {})
+cnt: 1, ((T([8, 32, 384, 512], f16), T([8, 32, 384, 512], f16), 0.1, True), {})
+Operator: aten.max_pool2d_with_indices.default
+cnt: 1, ((T([8, 512, 12, 16], f16), [5, 5], [1, 1], [2, 2]), {})
+cnt: 1, ((T([8, 512, 12, 16], f16), [9, 9], [1, 1], [4, 4]), {})
+cnt: 1, ((T([8, 512, 12, 16], f16), [13, 13], [1, 1], [6, 6]), {})
+Operator: aten.max_pool2d_with_indices_backward.default
+cnt: 1, ((T([8, 512, 12, 16], f16, stride=(393216, 192, 16, 1)), T([8, 512, 12, 16], f16), [13, 13], [1, 1], [6, 6], [1, 1], False, T([8, 512, 12, 16], i64)), {})
+cnt: 1, ((T([8, 512, 12, 16], f16, stride=(393216, 192, 16, 1)), T([8, 512, 12, 16], f16), [9, 9], [1, 1], [4, 4], [1, 1], False, T([8, 512, 12, 16], i64)), {})
+cnt: 1, ((T([8, 512, 12, 16], f16, stride=(393216, 192, 16, 1)), T([8, 512, 12, 16], f16), [5, 5], [1, 1], [2, 2], [1, 1], False, T([8, 512, 12, 16], i64)), {})
+Operator: aten.mul.Tensor
+cnt: 1, ((T([8, 3, 12, 16, 2], f16), T([1, 3, 1, 1, 2], f32)), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f16), T([1, 3, 1, 1, 2], f32)), {})
+cnt: 1, ((T([8, 3, 48, 64, 2], f16), T([1, 3, 1, 1, 2], f32)), {})
+cnt: 1, ((T([8, 3, 48, 64, 4], f16), 8), {})
+cnt: 1, ((T([8, 3, 48, 64, 2], f32), T([1, 3, 1, 1, 2], f32)), {})
+cnt: 1, ((T([8, 3, 48, 64, 2], f16), T([8, 3, 48, 64, 2], f16)), {})
+cnt: 1, ((T([8, 3, 24, 32, 4], f16), 16), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f32), T([1, 3, 1, 1, 2], f32)), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f16), T([8, 3, 24, 32, 2], f16)), {})
+cnt: 1, ((T([8, 3, 12, 16, 4], f16), 32), {})
+cnt: 1, ((T([8, 3, 12, 16, 2], f32), T([1, 3, 1, 1, 2], f32)), {})
+cnt: 1, ((T([8, 3, 12, 16, 2], f16), T([8, 3, 12, 16, 2], f16)), {})
+Operator: aten.mul_.Tensor
+cnt: 1, ((T([8, 3, 12, 16, 4], f16, stride=(48960, 16320, 1360, 85, 1)), 32), {})
+cnt: 1, ((T([8, 3, 24, 32, 4], f16, stride=(195840, 65280, 2720, 85, 1)), 16), {})
+cnt: 1, ((T([8, 3, 48, 64, 4], f16, stride=(783360, 261120, 5440, 85, 1)), 8), {})
+Operator: aten.native_batch_norm.default
+cnt: 1, ((T([8, 32, 384, 512], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.03, 0.0001), {})
+cnt: 2, ((T([8, 64, 192, 256], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.03, 0.0001), {})
+cnt: 1, ((T([8, 32, 192, 256], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f16), False, 0.03, 0.0001), {})
+cnt: 3, ((T([8, 128, 96, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.03, 0.0001), {})
+cnt: 2, ((T([8, 64, 96, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f16), False, 0.03, 0.0001), {})
+cnt: 12, ((T([8, 256, 48, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.03, 0.0001), {})
+cnt: 11, ((T([8, 128, 48, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.03, 0.0001), {})
+cnt: 12, ((T([8, 512, 24, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.03, 0.0001), {})
+cnt: 11, ((T([8, 256, 24, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.03, 0.0001), {})
+cnt: 8, ((T([8, 1024, 12, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f16), False, 0.03, 0.0001), {})
+cnt: 8, ((T([8, 512, 12, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f16), False, 0.03, 0.0001), {})
+cnt: 1, ((T([8, 256, 12, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f16), False, 0.03, 0.0001), {})
+cnt: 1, ((T([8, 128, 24, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f16), False, 0.03, 0.0001), {})
+Operator: aten.native_batch_norm_backward.default
+cnt: 12, ((T([8, 256, 48, 64], f16), T([8, 256, 48, 64], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 0.0001, [True, True, True]), {})
+cnt: 11, ((T([8, 128, 48, 64], f16), T([8, 128, 48, 64], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 0.0001, [True, True, True]), {})
+cnt: 1, ((T([8, 128, 24, 32], f16), T([8, 128, 24, 32], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 0.0001, [True, True, True]), {})
+cnt: 12, ((T([8, 512, 24, 32], f16), T([8, 512, 24, 32], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 0.0001, [True, True, True]), {})
+cnt: 11, ((T([8, 256, 24, 32], f16), T([8, 256, 24, 32], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 0.0001, [True, True, True]), {})
+cnt: 1, ((T([8, 256, 12, 16], f16), T([8, 256, 12, 16], f16), T([256], f16), T([256], f16), T([256], f16), T([256], f32), T([256], f32), False, 0.0001, [True, True, True]), {})
+cnt: 8, ((T([8, 1024, 12, 16], f16), T([8, 1024, 12, 16], f16), T([1024], f16), T([1024], f16), T([1024], f16), T([1024], f32), T([1024], f32), False, 0.0001, [True, True, True]), {})
+cnt: 8, ((T([8, 512, 12, 16], f16), T([8, 512, 12, 16], f16), T([512], f16), T([512], f16), T([512], f16), T([512], f32), T([512], f32), False, 0.0001, [True, True, True]), {})
+cnt: 3, ((T([8, 128, 96, 128], f16), T([8, 128, 96, 128], f16), T([128], f16), T([128], f16), T([128], f16), T([128], f32), T([128], f32), False, 0.0001, [True, True, True]), {})
+cnt: 2, ((T([8, 64, 96, 128], f16), T([8, 64, 96, 128], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 0.0001, [True, True, True]), {})
+cnt: 2, ((T([8, 64, 192, 256], f16), T([8, 64, 192, 256], f16), T([64], f16), T([64], f16), T([64], f16), T([64], f32), T([64], f32), False, 0.0001, [True, True, True]), {})
+cnt: 1, ((T([8, 32, 192, 256], f16), T([8, 32, 192, 256], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 0.0001, [True, True, True]), {})
+cnt: 1, ((T([8, 32, 384, 512], f16), T([8, 32, 384, 512], f16), T([32], f16), T([32], f16), T([32], f16), T([32], f32), T([32], f32), False, 0.0001, [True, True, True]), {})
+Operator: aten.new_empty_strided.default
+cnt: 1, ((T([8, 3, 48, 64, 85], f16, stride=(0, 0, 0, 0, 0)), [8, 3, 48, 64, 85], [783360, 261120, 5440, 85, 1]), {})
+cnt: 4, ((T([8, 3, 48, 64, 85], f16), [8, 3, 48, 64, 85], [783360, 261120, 5440, 85, 1]), {})
+cnt: 1, ((T([8, 3, 24, 32, 85], f16, stride=(0, 0, 0, 0, 0)), [8, 3, 24, 32, 85], [195840, 65280, 2720, 85, 1]), {})
+cnt: 4, ((T([8, 3, 24, 32, 85], f16), [8, 3, 24, 32, 85], [195840, 65280, 2720, 85, 1]), {})
+cnt: 1, ((T([8, 3, 12, 16, 85], f16, stride=(0, 0, 0, 0, 0)), [8, 3, 12, 16, 85], [48960, 16320, 1360, 85, 1]), {})
+cnt: 4, ((T([8, 3, 12, 16, 85], f16), [8, 3, 12, 16, 85], [48960, 16320, 1360, 85, 1]), {})
+Operator: aten.new_zeros.default
+cnt: 1, ((T([8, 3, 48, 64, 4], f16), [6266880]), {})
+cnt: 1, ((T([8, 3, 24, 32, 4], f16), [1566720]), {})
+cnt: 1, ((T([8, 3, 12, 16, 4], f16), [391680]), {})
+Operator: aten.sigmoid.default
+cnt: 1, ((T([8, 3, 12, 16, 2], f16, stride=(48960, 16320, 1360, 85, 1)),), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f16, stride=(195840, 65280, 2720, 85, 1)),), {})
+cnt: 1, ((T([8, 3, 48, 64, 2], f16, stride=(783360, 261120, 5440, 85, 1)),), {})
+Operator: aten.sigmoid_.default
+cnt: 1, ((T([8, 3, 12, 16, 81], f16, stride=(48960, 16320, 1360, 85, 1)),), {})
+cnt: 1, ((T([8, 3, 24, 32, 81], f16, stride=(195840, 65280, 2720, 85, 1)),), {})
+cnt: 1, ((T([8, 3, 48, 64, 81], f16, stride=(783360, 261120, 5440, 85, 1)),), {})
+Operator: aten.sigmoid_backward.default
+cnt: 1, ((T([8, 3, 48, 64, 81], f16), T([8, 3, 48, 64, 81], f16, stride=(783360, 261120, 5440, 85, 1))), {})
+cnt: 1, ((T([8, 3, 48, 64, 2], f16), T([8, 3, 48, 64, 2], f16)), {})
+cnt: 1, ((T([8, 3, 24, 32, 81], f16), T([8, 3, 24, 32, 81], f16, stride=(195840, 65280, 2720, 85, 1))), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f16), T([8, 3, 24, 32, 2], f16)), {})
+cnt: 1, ((T([8, 3, 12, 16, 81], f16), T([8, 3, 12, 16, 81], f16, stride=(48960, 16320, 1360, 85, 1))), {})
+cnt: 1, ((T([8, 3, 12, 16, 2], f16), T([8, 3, 12, 16, 2], f16)), {})
+Operator: aten.slice_backward.default
+cnt: 1, ((T([8, 3, 48, 64, 2], f16), [8, 3, 48, 64, 85], 4, 2, 4, 1), {})
+cnt: 1, ((T([8, 3, 48, 64, 2], f16), [8, 3, 48, 64, 85], 4, 0, 2, 1), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f16), [8, 3, 24, 32, 85], 4, 2, 4, 1), {})
+cnt: 1, ((T([8, 3, 24, 32, 2], f16), [8, 3, 24, 32, 85], 4, 0, 2, 1), {})
+cnt: 1, ((T([8, 3, 12, 16, 2], f16), [8, 3, 12, 16, 85], 4, 2, 4, 1), {})
+cnt: 1, ((T([8, 3, 12, 16, 2], f16), [8, 3, 12, 16, 85], 4, 0, 2, 1), {})
+Operator: aten.stack.default
+cnt: 1, (([T([12, 16], i64, stride=(0, 1)), T([12, 16], i64, stride=(1, 0))], 2), {})
+cnt: 1, (([T([24, 32], i64, stride=(0, 1)), T([24, 32], i64, stride=(1, 0))], 2), {})
+cnt: 1, (([T([48, 64], i64, stride=(0, 1)), T([48, 64], i64, stride=(1, 0))], 2), {})
+Operator: aten.sum.default
+cnt: 1, ((T([8, 12096, 85], f16),), {})
+cnt: 1, ((T([8, 3, 12, 16, 85], f16),), {})
+cnt: 1, ((T([8, 3, 24, 32, 85], f16),), {})
+cnt: 1, ((T([8, 3, 48, 64, 85], f16),), {})
+Operator: aten.upsample_nearest2d.vec
+cnt: 1, ((T([8, 256, 12, 16], f16), None, [2.0, 2.0]), {})
+cnt: 1, ((T([8, 128, 24, 32], f16), None, [2.0, 2.0]), {})
+Operator: aten.upsample_nearest2d_backward.vec
+cnt: 1, ((T([8, 128, 48, 64], f16, stride=(1179648, 3072, 64, 1)), None, [8, 128, 24, 32], [2.0, 2.0]), {})
+cnt: 1, ((T([8, 256, 24, 32], f16, stride=(589824, 768, 32, 1)), None, [8, 256, 12, 16], [2.0, 2.0]), {})
diff --git a/benchmarks/dynamo/microbenchmarks/operator_inp_utils.py b/benchmarks/dynamo/microbenchmarks/operator_inp_utils.py
new file mode 100644
index 000000000000..61793795be56
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operator_inp_utils.py
@@ -0,0 +1,342 @@
+import functools
+import logging
+import math
+import os
+from collections import Counter, defaultdict
+from functools import partial
+from typing import Any, Dict, Generator, Iterable, Tuple
+
+import torch
+from torch.testing import make_tensor
+from torch.utils._python_dispatch import TorchDispatchMode
+from torch.utils._pytree import tree_flatten, tree_map
+
+log = logging.getLogger(__name__)
+
+OP_INP_DIRECTORY = os.path.join(os.path.dirname(__file__), "operator_inp_logs")
+
+TIMM_DIR = os.path.join(OP_INP_DIRECTORY, "timm_train")
+HF_DIR = os.path.join(OP_INP_DIRECTORY, "hf_train")
+TORCHBENCH_DIR = os.path.join(OP_INP_DIRECTORY, "torchbench_train")
+
+aten = torch.ops.aten
+tensor_type = torch._C.TensorType.get()
+
+dtype_abbrs = {
+    torch.bfloat16: "bf16",
+    torch.float64: "f64",
+    torch.float32: "f32",
+    torch.float16: "f16",
+    torch.complex32: "c32",
+    torch.complex64: "c64",
+    torch.complex128: "c128",
+    torch.int8: "i8",
+    torch.int16: "i16",
+    torch.int32: "i32",
+    torch.int64: "i64",
+    torch.bool: "b8",
+    torch.uint8: "u8",
+}
+
+dtype_abbrs_parsing = {value: key for key, value in dtype_abbrs.items()}
+
+
+def truncate_inp(arg):
+    if arg in dtype_abbrs:
+        return dtype_abbrs[arg]
+    elif isinstance(arg, torch.device):
+        return arg.type
+    else:
+        return arg
+
+
+# Serialize Function Call
+class FuncCallWrapper:
+    def __init__(self, call, *args, **kwargs):
+        self.call = call
+        self.args = tree_map(truncate_inp, args)
+        self.kwargs = tree_map(truncate_inp, kwargs) if kwargs is not None else {}
+
+    def __repr__(self):
+        args = ", ".join([repr(arg) for arg in self.args])
+        kwargs = "".join(
+            [f", {str(key)}={value}" for key, value in self.kwargs.items()]
+        )
+        out = f"{self.call}({args}{kwargs})".strip('"')
+        # f strings introduce quotations we dont want
+        for key in dtype_abbrs_parsing:
+            out = out.replace(f"'{key}'", key)
+        return out
+
+
+def serialize_sparse_tensor(e):
+    if isinstance(e, torch._subclasses.FakeTensor):
+        return FuncCallWrapper("ST", list(e.shape), e.dtype, e.layout, e.is_coalesced())
+    else:
+        return FuncCallWrapper(
+            "ST", list(e.shape), e.dtype, e.layout, e.is_coalesced(), e._nnz()
+        )
+
+
+def deserialize_sparse_tensor(size, dtype, layout, is_coalesced, nnz=None):
+    raise NotImplementedError()
+
+
+def deserialize_tensor(size, dtype, stride=None):
+    if stride is not None:
+        out = torch.empty_strided(size, stride, dtype=dtype)
+    else:
+        out = torch.empty(size, dtype=dtype)
+    try:
+        out.copy_(make_tensor(size, dtype=dtype, device="cpu"))
+    except Exception as e:
+        print(e)
+        return out
+    return out
+
+
+def serialize_tensor(e):
+    if not e.is_contiguous():
+        return FuncCallWrapper("T", list(e.shape), e.dtype, stride=e.stride())
+    else:
+        return FuncCallWrapper("T", list(e.shape), e.dtype)
+
+
+def serialize_torch_args(e):
+    if isinstance(e, torch.Tensor):
+        if e.is_sparse:
+            return serialize_sparse_tensor(e)
+        return serialize_tensor(e)
+    else:
+        return truncate_inp(e)
+
+
+def contains_tensor(elems):
+    for elem in tree_flatten(elems)[0]:
+        if isinstance(elem, torch.Tensor):
+            return True
+    return False
+
+
+def skip_args(elems):
+    for i in tree_flatten(elems)[0]:
+        # only shows up in constructors and ops like that
+        if isinstance(i, (torch.memory_format, torch.storage.UntypedStorage)):
+            return True
+    return False
+
+
+def contains_tensor_types(type):
+    return type.isSubtypeOf(tensor_type) or any(
+        contains_tensor_types(e) for e in type.containedTypes()
+    )
+
+
+@functools.lru_cache(None)
+def non_compute_operator(op):
+    schema = op._schema
+
+    # skip constructors
+    if not any(contains_tensor_types(arg.type) for arg in schema.arguments):
+        return True
+    if "_like" in op.name:
+        return True
+
+    # allow in place writes
+    if schema.is_mutable:
+        return False
+
+    tensor_inps = [arg for arg in schema.arguments if arg.type is tensor_type]
+    tensor_outputs = [ret for ret in schema.returns if ret.type is tensor_type]
+
+    # skip aliasing unless there are multiple outputs
+    if len(tensor_outputs) != 1:
+        return False
+
+    for inp in tensor_inps:
+        if inp.alias_info and tensor_outputs[0].alias_info:
+            if inp.alias_info.before_set.intersection(
+                tensor_outputs[0].alias_info.after_set
+            ):
+                return True
+
+    return False
+
+
+class OperatorInputsMode(TorchDispatchMode):
+    def __init__(self, func_db=None):
+        self.func_db = defaultdict(Counter) if func_db is None else func_db
+
+    def __torch_dispatch__(self, func_overload, types, args=(), kwargs=None):
+        kwargs = kwargs if kwargs else {}
+        arg_meta, kwarg_meta = tree_map(serialize_torch_args, (args, kwargs))
+
+        out = func_overload(*args, **kwargs)
+
+        inps = (args, kwargs)
+        if contains_tensor(inps) and not skip_args(inps) and contains_tensor(out):
+            serialized_str = repr((arg_meta, kwarg_meta))
+            self.func_db[str(func_overload)][serialized_str] += 1
+
+        return out
+
+    def log_to_file(self, output_filename, *, skip_non_compute_operators=True):
+        sorted_operators = sorted(list(self.func_db.keys()))
+        with open(output_filename, "w") as f:
+            for operator in sorted_operators:
+                if skip_non_compute_operators and non_compute_operator(eval(operator)):
+                    continue
+                f.write(f"Operator: {operator}\n")
+                operator_inputs = self.func_db[operator]
+                for inps, count in operator_inputs.items():
+                    f.write(f"cnt: {count}, ")
+                    # repr will add quotation marks around the dtype strings
+                    for dtype_abbr in dtype_abbrs.values():
+                        inps = inps.replace("'" + dtype_abbr + "'", dtype_abbr)
+                    f.write(inps)
+                    f.write("\n")
+
+
+def map_to_device(e, device):
+    if isinstance(e, torch.Tensor):
+        return e.to(device)
+    elif isinstance(e, torch.device):
+        return device
+    elif isinstance(e, str):
+        if e == "cuda" or e == "cpu":
+            return device.type
+    else:
+        return e
+
+
+def map_to_dtype(e, dtype):
+    if isinstance(e, torch.Tensor) and e.is_floating_point():
+        return e.to(dtype)
+    elif isinstance(e, torch.dtype):
+        return dtype
+    else:
+        return e
+
+
+def deserialize_args(inps):
+    inps = inps.strip().strip("'")
+    global_vals = {
+        **{
+            "T": deserialize_tensor,
+            "ST": deserialize_sparse_tensor,
+            "th": torch,
+            "inf": math.inf,
+            "torch": torch,
+        },
+        **dtype_abbrs_parsing,
+    }
+    # f strings introduce quotations we dont want
+    for key in dtype_abbrs_parsing:
+        inps = inps.replace(f"'{key}'", key)
+    return eval(inps.strip().strip("'").strip('"'), global_vals)
+
+
+class OperatorInputsLoader:
+    def __init__(self, json_file_path):
+        self.operator_db = defaultdict(Counter)
+
+        with open(json_file_path, "r") as f:
+            lines = f.readlines()
+
+        i = 0
+        while i < len(lines):
+            op_line = lines[i].strip("\n")
+            assert "Operator: " in op_line, op_line
+            operator = op_line[len("Operator: ") :]
+            operator = (
+                operator if operator != "aten.sum.SymInt" else "aten.sum.dim_IntList"
+            )
+            op_inps = Counter()
+            i += 1
+            while i < len(lines) and "Operator: " not in lines[i]:
+                line = lines[i]
+                cnt = eval(line[len("cnt: ") : line.find(",")])
+                inps = line[line.find(",") + 2 :].strip("'")
+                op_inps[inps] += cnt
+                i += 1
+            self.operator_db[operator] = op_inps
+
+    def get_inputs_for_operator(
+        self, operator, dtype=None, device="cuda"
+    ) -> Generator[Tuple[Iterable[Any], Dict[str, Any]], None, None]:
+        assert (
+            str(operator) in self.operator_db
+        ), f"Could not find {operator}, must provide overload"
+
+        if "embedding" in str(operator):
+            log.warning("Embedding inputs NYI, input data cannot be randomized")
+            yield
+            return
+
+        # line[1] represents number of times these inputs occured, ignored for now
+        for line in self.operator_db[str(operator)].items():
+            inps = line[0]
+
+            args, kwargs = deserialize_args(inps)
+
+            # Backwards require some inputs to be float16 and some to be float32
+            # So we record on half and upcast to float when specified
+            if dtype and dtype != torch.float16:
+                to_dtype = partial(map_to_dtype, dtype=dtype)
+                args, kwargs = tree_map(to_dtype, (args, kwargs))
+
+            if device:
+                to_device = partial(map_to_device, device=torch.device(device))
+                args, kwargs = tree_map(to_device, (args, kwargs))
+
+            yield args, kwargs
+
+    def get_all_ops(self):
+        for key in self.operator_db.keys():
+            try:
+                op = eval(key)
+            except AttributeError as ae:
+                log.warning(f"Evaluating an op name into an OpOverload: {ae}")
+                continue
+            yield op
+
+    def get_call_frequency(self, op):
+        assert (
+            str(op) in self.operator_db
+        ), f"Could not find {op}, must provide overload"
+
+        count = 0
+        for _, counter in self.operator_db[str(op)].items():
+            count += counter
+        return count
+
+    def merge(self, other):
+        for operator, counter_dict in other.operator_db.items():
+            for inps, cnt in counter_dict.items():
+                self.operator_db[operator][inps] += cnt
+
+    @staticmethod
+    def get_timm_loader():
+        return OperatorInputsLoader._load_directory(TIMM_DIR)
+
+    @staticmethod
+    def get_huggingface_loader():
+        return OperatorInputsLoader._load_directory(HF_DIR)
+
+    @staticmethod
+    def get_torchbench_loader():
+        return OperatorInputsLoader._load_directory(TORCHBENCH_DIR)
+
+    @staticmethod
+    def _load_directory(inp_dir):
+        assert os.path.isdir(inp_dir), inp_dir
+        union = None
+        for inp in os.listdir(inp_dir):
+            if inp[-4:] != ".txt":
+                continue
+            path = os.path.join(inp_dir, inp)
+            if union is None:
+                union = OperatorInputsLoader(path)
+            else:
+                union.merge(OperatorInputsLoader(path))
+        return union
diff --git a/benchmarks/dynamo/microbenchmarks/operatorbench.py b/benchmarks/dynamo/microbenchmarks/operatorbench.py
new file mode 100644
index 000000000000..1b806ebdd8f5
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/operatorbench.py
@@ -0,0 +1,242 @@
+#!/usr/bin/env python3
+import click
+import numpy as np
+import torch
+from operator_inp_utils import OperatorInputsLoader
+
+from torch._dynamo.optimizations.backends import cudagraphs_inner
+from torch._dynamo.testing import same
+from torch._inductor import config as inductor_config
+from torch._inductor.compile_fx import compile_fx
+from torch._inductor.decomposition import decompositions
+from torch._inductor.lowering import fallbacks, lowerings
+from torch._inductor.utils import gen_gm_and_inputs
+
+aten = torch.ops.aten
+
+
+def compute_speedups(
+    operator, models, example_inputs, repeats, accuracy_checking=False, device="cuda"
+):
+    expected = models[0](*example_inputs)
+    if accuracy_checking:
+        for model in models[1:]:
+            actual = model(*example_inputs)
+            # change to assert later
+            try:
+                same(actual, expected, cos_similarity=True, equal_nan=True)
+            except AssertionError as e:
+                print(e)
+                print(f"Accuracy check failed: {operator}")
+                print((expected[0] - actual[0]).abs().max())
+
+    timings = np.zeros((repeats, len(models)), np.float64)
+    for rep in range(repeats):
+        # interleave the runs to handle frequency scaling and load changes
+        for m, model in enumerate(models):
+            if device == "cuda":
+                import triton
+
+                # do_bench() clears L2 cache to hide the latency of CPU launch time
+                # along with cuda synchronization
+                median_ms, _, _ = triton.testing.do_bench(
+                    lambda: model(*example_inputs)
+                )
+                timings[rep, m] = median_ms
+            else:
+                from torch._inductor.utils import timed
+
+                timings[rep, m] = timed(model, example_inputs)
+    return np.median(timings, axis=0)
+
+
+def strip_overloads(gm):
+    """
+    Modifies the target of graph nodes in :attr:`gm` to strip overloads.
+    Args:
+        gm(fx.GraphModule): The input Fx graph module to be modified
+    """
+    for node in gm.graph.nodes:
+        if isinstance(node.target, torch._ops.OpOverload):
+            node.target = node.target.overloadpacket
+    gm.recompile()
+
+
+def convert_to_jit(gm, gm_args):
+    strip_overloads(gm)
+    try:
+        return torch.jit.script(gm)
+    except Exception:
+        pass
+    return torch.jit.trace(gm, gm_args)
+
+
+def microbenchmark(
+    operator, args, kwargs, dtype, accuracy_checking, repeats, measure_nvfuser, device
+):
+    gm, gm_args = gen_gm_and_inputs(operator, args, kwargs)
+    torch.jit._builtins._register_builtin(
+        torch.ops.aten.convolution_backward.default, "aten::convolution_backward"
+    )
+    if device == "cuda":
+        cudagraphs_eager = cudagraphs_inner(gm, gm_args, copy_outputs=False)
+        compiled_fn = compile_fx(gm, gm_args)
+        compiled = [cudagraphs_eager, compiled_fn]
+    else:
+        compiled_fn = compile_fx(gm, gm_args)
+        compiled = [gm, compiled_fn]
+    if measure_nvfuser:
+        g = convert_to_jit(gm, gm_args)
+        cudagraphs_jit = cudagraphs_inner(g, gm_args, copy_outputs=False)
+        compiled += [cudagraphs_jit]
+    if accuracy_checking:
+        repeats = 1
+
+    medians = compute_speedups(
+        operator, compiled, gm_args, repeats, accuracy_checking, device
+    )
+    return medians
+
+
+def skip_operator(operator):
+    nyi_strings = (
+        "aten.gather.default",
+        "nll_loss",
+        "aten.index",
+        "aten.scatter_",
+        "masked_fill_.Scalar",
+    )
+
+    if any(nyi_string in str(operator) for nyi_string in nyi_strings):
+        # maybe disable aten.native_layer_norm.default
+        # TODO - inputs cannot be randomly initialized, causes cyda failures
+        print(f"Skipping {operator}, input generator nyi")
+        return True
+
+    # not covered by other non-compute operator heuristics
+    if operator == torch.ops.aten._unsafe_view.default:
+        print(f"Skipping {operator}, non compute operator")
+        return True
+
+    # some of inductor registered to the OpOverload, some registered to OpOverloadPacket
+    op_impls = [operator]
+    if isinstance(operator, torch._ops.OpOverload):
+        op_impls.append(operator.overloadpacket)
+
+    if any(op in fallbacks for op in op_impls):
+        print(f"Skipping {operator}, no inductor impl")
+        return True
+
+    if all(op not in decompositions and op not in lowerings for op in op_impls):
+        print(f"Skipping {operator}, no inductor impl")
+        return True
+
+    if inductor_config.triton.convolution == "aten" and "convolution" in str(operator):
+        return True
+
+    if inductor_config.triton.mm == "aten" and operator in (
+        aten.mm.default,
+        aten.bmm.default,
+        aten.addmm.default,
+        aten.matmul.default,
+    ):
+        return True
+
+    return False
+
+
+@click.command()
+@click.option(
+    "--suite",
+    help="suite to load inps from: options: timm, huggingface, torchbench",
+    default="torchbench",
+)
+@click.option("--op", help="operator overload to benchmark")
+@click.option("--dtype", help="dtype to benchmark")
+@click.option("--max-samples", help="max samples per op", default=15)
+@click.option("--accuracy-checking", help="check accuracy", default=False)
+@click.option(
+    "--repeats", help="how many times to repeat for perf measurement", default=3
+)
+@click.option(
+    "--measure-nvfuser", help="default we only measure inductor", default=False
+)
+@click.option("--device", help="cpu or cuda", default="cuda")
+def benchmark(
+    suite, op, dtype, max_samples, accuracy_checking, repeats, measure_nvfuser, device
+):
+    assert suite in ("timm", "huggingface", "torchbench"), f"got {suite}"
+    if suite == "timm":
+        loader = OperatorInputsLoader.get_timm_loader()
+    elif suite == "huggingface":
+        loader = OperatorInputsLoader.get_huggingface_loader()
+    else:
+        loader = OperatorInputsLoader.get_torchbench_loader()
+
+    assert dtype in ("float16", "float32"), f"got {dtype}"
+
+    if op == "all":
+        filename = f"timings_{suite}_{op.replace('.', '_')}{dtype}.txt"
+        f = open(filename, "a")
+
+    dtype = torch.float16 if dtype == "float16" else torch.float32
+
+    if op == "all":
+        ops = loader.get_all_ops()
+    else:
+        ops = [eval(op)]
+
+    for operator in ops:
+        if skip_operator(operator):
+            continue
+
+        print(f"Running {operator}")
+        inp_gen = loader.get_inputs_for_operator(operator, dtype=dtype, device=device)
+        timings = []
+
+        for i in range(min(max_samples, 1000000)):
+            print(f"Iter {i}")
+            try:
+                inps = next(inp_gen)
+                if inps is None:
+                    break
+                args, kwargs = inps
+            except StopIteration:
+                break
+            try:
+                # aten, nvfuser, inductor
+                timings.append(
+                    microbenchmark(
+                        operator,
+                        args,
+                        kwargs,
+                        dtype,
+                        accuracy_checking,
+                        repeats,
+                        measure_nvfuser,
+                        device,
+                    )
+                )
+            except Exception as e:
+                print(f"error {operator}")
+                print(e)
+                raise e
+
+        if not timings:
+            continue
+
+        timings = torch.tensor(timings).T
+        q = torch.tensor([0.2, 0.5, 0.8], dtype=torch.float64)
+        output = f"{operator}:\nInductor Speedups : {(torch.quantile(timings[0] / timings[1], q)).tolist()}\n"
+        if measure_nvfuser:
+            output += f"NVFUSER Speedups :{(torch.quantile(timings[0] / timings[2], q)).tolist()}\n"
+        if op == "all":
+            f.write(output)
+        print(output)
+
+    if op == "all":
+        f.close()
+
+
+if __name__ == "__main__":
+    benchmark()
diff --git a/benchmarks/dynamo/microbenchmarks/profile_conv.py b/benchmarks/dynamo/microbenchmarks/profile_conv.py
new file mode 100644
index 000000000000..1d57414d9421
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/profile_conv.py
@@ -0,0 +1,107 @@
+import torch
+
+import torch._inductor.triton_ops
+from torch.profiler import profile, ProfilerActivity, record_function
+
+# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
+torch.backends.cuda.matmul.allow_tf32 = True
+# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
+torch.backends.cudnn.allow_tf32 = True
+
+
+(
+    BATCH,
+    IN_C,
+    IN_H,
+    IN_W,
+    KERNEL_N,
+    KERNEL_H,
+    KERNEL_W,
+    stride,
+    padding,
+    dilation,
+    groups,
+    dtype,
+) = (32, 56, 56, 64, 3, 3, 64, (1, 1), (0, 0), (1, 1), 1, torch.float32)
+
+
+def profile_op(
+    # provider
+    provider,
+    # Tensor dimensions
+    BATCH,
+    IN_C,
+    IN_H,
+    IN_W,
+    KERNEL_N,
+    KERNEL_H,
+    KERNEL_W,
+    # parameters of conv
+    stride=(1, 1),
+    padding=(0, 0),
+    dilation=(1, 1),
+    groups=1,
+    dtype=torch.float16,
+    layout="nhwc",
+    warmup=25,
+    rep=50,
+):
+
+    # allocate inputs, nchw
+    x = torch.randn((BATCH, IN_C, IN_H, IN_W), dtype=dtype, device="cuda")
+    w = torch.randn(
+        (KERNEL_N, IN_C // groups, KERNEL_H, KERNEL_W), dtype=dtype, device="cuda"
+    )
+    bias = torch.randn((KERNEL_N), dtype=dtype, device="cuda")
+    if layout == "nhwc":
+        x = x.to(memory_format=torch.channels_last)
+        w = w.to(memory_format=torch.channels_last)
+
+    if provider == "cublas":
+
+        def fn():
+            return torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+
+    elif provider == "triton":
+
+        def fn():
+            return torch._inductor.triton_ops.conv(
+                x, w, bias, stride, padding, dilation, False, (0, 0), groups
+            )
+
+    else:
+        raise ValueError(f"{provider} not supported")
+    # warm up
+    for _ in range(warmup):
+        fn()
+    with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
+        with record_function("model_inference"):
+            for _ in range(rep):
+                fn()
+
+    print("Profiling ", provider)
+    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
+
+
+for provider in ["cublas", "triton"]:
+    profile_op(
+        # provider
+        provider,
+        # Tensor dimensions
+        BATCH,
+        IN_C,
+        IN_H,
+        IN_W,
+        KERNEL_N,
+        KERNEL_H,
+        KERNEL_W,
+        # parameters of conv
+        stride,
+        padding,
+        dilation,
+        groups,
+        dtype=dtype,
+        layout="nhwc",
+        warmup=25,
+        rep=50,
+    )
diff --git a/benchmarks/dynamo/microbenchmarks/utils.py b/benchmarks/dynamo/microbenchmarks/utils.py
new file mode 100644
index 000000000000..18972ba09ae6
--- /dev/null
+++ b/benchmarks/dynamo/microbenchmarks/utils.py
@@ -0,0 +1,19 @@
+import math
+
+import torch
+
+
+def rounded_linspace(low, high, steps, div):
+    ret = torch.linspace(low, high, steps)
+    ret = (ret.int() + div - 1) // div * div
+    ret = torch.unique(ret)
+    return list(map(int, ret))
+
+
+def powspace(start, stop, pow, step):
+    start = math.log(start, pow)
+    stop = math.log(stop, pow)
+    steps = int((stop - start + 1) // step)
+    ret = torch.pow(pow, torch.linspace(start, stop, steps))
+    ret = torch.unique(ret)
+    return list(map(int, ret))
diff --git a/benchmarks/dynamo/runner.py b/benchmarks/dynamo/runner.py
new file mode 100755
index 000000000000..38bfa3160625
--- /dev/null
+++ b/benchmarks/dynamo/runner.py
@@ -0,0 +1,1345 @@
+#!/usr/bin/env python3
+
+"""
+A wrapper over the benchmark infrastructure to generate commonly used commands,
+parse results and generate csv/graphs.
+
+The script works on manually written TABLE (see below). We can add more commands
+in the future.
+
+One example usage is
+-> python benchmarks/runner.py --suites=torchbench --inference
+This command will generate the commands for the default compilers (see DEFAULTS
+below) for inference, run them and visualize the logs.
+
+If you want to just print the commands, you could use the following command
+-> python benchmarks/runner.py --print_run_commands --suites=torchbench --inference
+
+Similarly, if you want to just visualize the already finished logs
+-> python benchmarks/runner.py --visualize_logs --suites=torchbench --inference
+
+If you want to test float16
+-> python benchmarks/runner.py --suites=torchbench --inference --dtypes=float16
+
+"""
+
+
+import argparse
+import dataclasses
+import functools
+import glob
+import importlib
+import io
+import itertools
+import logging
+import os
+import re
+import shutil
+import subprocess
+import sys
+import tempfile
+from collections import defaultdict
+from datetime import datetime, timedelta, timezone
+from os.path import abspath, exists
+from random import randint
+
+import matplotlib.pyplot as plt
+
+import numpy as np
+import pandas as pd
+import torch
+
+import torch._dynamo
+from matplotlib import rcParams
+from scipy.stats import gmean
+from tabulate import tabulate
+
+rcParams.update({"figure.autolayout": True})
+plt.rc("axes", axisbelow=True)
+
+DEFAULT_OUTPUT_DIR = "benchmark_logs"
+
+
+log = logging.getLogger(__name__)
+
+TABLE = {
+    "training": {
+        "ts_nnc": "--training --speedup-ts ",
+        "ts_nvfuser": "--training --nvfuser --speedup-dynamo-ts ",
+        "eager": "--training --backend=eager ",
+        "aot_eager": "--training --backend=aot_eager ",
+        "aot_cudagraphs": "--training --backend=aot_cudagraphs ",
+        "aot_nvfuser": "--training --nvfuser --backend=aot_ts_nvfuser ",
+        "nvprims_nvfuser": "--training --backend=nvprims_nvfuser ",
+        "inductor": "--training --inductor ",
+        "inductor_no_cudagraphs": "--training --inductor --disable-cudagraphs ",
+    },
+    "inference": {
+        "ts_nnc": "--speedup-ts",
+        "ts_nvfuser": "-n100 --speedup-ts --nvfuser",
+        "trt": "-n100 --speedup-trt",
+        "ts_nvfuser_cudagraphs": "--backend=cudagraphs_ts",
+        "inductor": "-n50 --inductor",
+    },
+}
+
+INFERENCE_COMPILERS = tuple(TABLE["inference"].keys())
+TRAINING_COMPILERS = tuple(TABLE["training"].keys())
+
+DEFAULTS = {
+    "training": [
+        "eager",
+        "aot_eager",
+        "inductor",
+        "inductor_no_cudagraphs",
+    ],
+    "inference": ["ts_nvfuser_cudagraphs", "inductor"],
+    "flag_compilers": {
+        "training": ["inductor", "inductor_no_cudagraphs"],
+        "inference": ["inductor"],
+    },
+    "dtypes": [
+        "float32",
+    ],
+    "suites": ["torchbench", "huggingface", "timm_models"],
+    "devices": [
+        "cuda",
+    ],
+    "quick": {
+        "torchbench": '-k "resnet..$"',
+        "huggingface": "-k Albert",
+        "timm_models": ' -k "^resnet" -k "^inception"',
+    },
+}
+
+
+DASHBOARD_DEFAULTS = {
+    "dashboard_image_uploader": "/fsx/users/anijain/bin/imgur.sh",
+    "dashboard_archive_path": "/data/home/anijain/cluster/cron_logs",
+    "dashboard_gh_cli_path": "/data/home/anijain/miniconda/bin/gh",
+}
+
+
+def flag_speedup(x):
+    return x < 0.95
+
+
+def flag_compilation_latency(x):
+    return x > 120
+
+
+def flag_compression_ratio(x):
+    return x < 0.9
+
+
+def flag_accuracy(x):
+    return "pass" not in x
+
+
+FLAG_FNS = {
+    "speedup": flag_speedup,
+    "compilation_latency": flag_compilation_latency,
+    "compression_ratio": flag_compression_ratio,
+    "accuracy": flag_accuracy,
+}
+
+
+def percentage(part, whole, decimals=2):
+    if whole == 0:
+        return 0
+    return round(100 * float(part) / float(whole), decimals)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--devices", action="append", help="cpu or cuda")
+    parser.add_argument("--dtypes", action="append", help="float16/float32/amp")
+    parser.add_argument("--suites", action="append", help="huggingface/torchbench/timm")
+    parser.add_argument(
+        "--compilers",
+        action="append",
+        help=f"For --inference, options are {INFERENCE_COMPILERS}. For --training, options are {TRAINING_COMPILERS}",
+    )
+
+    parser.add_argument(
+        "--flag-compilers",
+        action="append",
+        help="List of compilers to flag issues. Same format as --compilers.",
+    )
+    parser.add_argument(
+        "--quick", action="store_true", help="Just runs one model. Helps in debugging"
+    )
+    parser.add_argument(
+        "--output-dir",
+        help="Choose the output directory to save the logs",
+        default=DEFAULT_OUTPUT_DIR,
+    )
+
+    # Choose either generation of commands, pretty parsing or e2e runs
+    group = parser.add_mutually_exclusive_group(required=False)
+    group.add_argument(
+        "--print_run_commands",
+        action="store_true",
+        help="Generate commands and saves them to run.sh",
+    )
+    group.add_argument(
+        "--visualize_logs",
+        action="store_true",
+        help="Pretty print the log files and draw graphs",
+    )
+    group.add_argument(
+        "--run",
+        action="store_true",
+        default=True,
+        help="Generate commands, run and parses the files",
+    )
+
+    parser.add_argument(
+        "--log-operator-inputs",
+        action="store_true",
+        default=False,
+        help="Log operator inputs",
+    )
+
+    parser.add_argument(
+        "--extra-args", default="", help="Append commandline with these args"
+    )
+
+    # Choose either inference or training
+    group_mode = parser.add_mutually_exclusive_group(required=True)
+    group_mode.add_argument(
+        "--inference", action="store_true", help="Only run inference related tasks"
+    )
+    group_mode.add_argument(
+        "--training", action="store_true", help="Only run training related tasks"
+    )
+
+    parser.add_argument(
+        "--update-dashboard",
+        action="store_true",
+        default=False,
+        help="Updates to dashboard",
+    )
+    parser.add_argument(
+        "--no-graphs",
+        action="store_true",
+        default=False,
+        help="Do not genenerate and upload metric graphs",
+    )
+    parser.add_argument(
+        "--no-update-archive",
+        action="store_true",
+        default=False,
+        help="Do not update lookup.csv or the log archive",
+    )
+    parser.add_argument(
+        "--no-gh-comment",
+        action="store_true",
+        default=False,
+        help="Do not write a comment to github",
+    )
+    parser.add_argument(
+        "--update-dashboard-test",
+        action="store_true",
+        default=False,
+        help="does all of --no-graphs, --no-update-lookup, and --no-gh-comment",
+    )
+    parser.add_argument(
+        "--dashboard-image-uploader",
+        default=DASHBOARD_DEFAULTS["dashboard_image_uploader"],
+        help="Image uploader command",
+    )
+    parser.add_argument(
+        "--dashboard-archive-path",
+        default=DASHBOARD_DEFAULTS["dashboard_archive_path"],
+        help="Archived directory path",
+    )
+    parser.add_argument(
+        "--archive-name",
+        help="Directory name under dashboard-archive-path to copy output-dir to. "
+        "If not provided, a generated name is used.",
+    )
+    parser.add_argument(
+        "--dashboard-gh-cli-path",
+        default=DASHBOARD_DEFAULTS["dashboard_gh_cli_path"],
+        help="Github CLI path",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def get_mode(args):
+    if args.inference:
+        return "inference"
+    return "training"
+
+
+def get_skip_tests(suite):
+    """
+    Generate -x seperated string to skip the unusual setup training tests
+    """
+    skip_tests = set()
+    original_dir = abspath(os.getcwd())
+    module = importlib.import_module(suite)
+    os.chdir(original_dir)
+
+    if hasattr(module, "SKIP"):
+        skip_tests.update(module.SKIP)
+    if hasattr(module, "SKIP_TRAIN"):
+        skip_tests.update(module.SKIP_TRAIN)
+
+    skip_tests = map(lambda name: f"-x {name}", skip_tests)
+    skip_str = " ".join(skip_tests)
+    return skip_str
+
+
+def generate_csv_name(args, dtype, suite, device, compiler, testing):
+    mode = get_mode(args)
+    return f"{compiler}_{suite}_{dtype}_{mode}_{device}_{testing}.csv"
+
+
+def generate_commands(args, dtypes, suites, devices, compilers, output_dir):
+    mode = get_mode(args)
+    with open("run.sh", "w") as runfile:
+        lines = []
+
+        lines.append("# Setup the output directory")
+        lines.append(f"rm -rf {output_dir}")
+        lines.append(f"mkdir {output_dir}")
+        lines.append("")
+
+        for testing in ["performance", "accuracy"]:
+            for iter in itertools.product(suites, devices, dtypes):
+                suite, device, dtype = iter
+                lines.append(
+                    f"# Commands for {suite} for device={device}, dtype={dtype} for {mode} and for {testing} testing"
+                )
+                info = TABLE[mode]
+                for compiler in compilers:
+                    base_cmd = info[compiler]
+                    output_filename = f"{output_dir}/{generate_csv_name(args, dtype, suite, device, compiler, testing)}"
+                    cmd = f"python benchmarks/dynamo/{suite}.py --{testing} --{dtype} -d{device} --output={output_filename}"
+                    cmd = f"{cmd} {base_cmd} {args.extra_args} --no-skip --dashboard"
+
+                    skip_tests_str = get_skip_tests(suite)
+                    cmd = f"{cmd} {skip_tests_str}"
+
+                    if args.log_operator_inputs:
+                        cmd = f"{cmd} --log-operator-inputs"
+
+                    if args.quick:
+                        filters = DEFAULTS["quick"][suite]
+                        cmd = f"{cmd} {filters}"
+
+                    if testing == "performance" and compiler in (
+                        "inductor",
+                        "inductor_no_cudagraphs",
+                    ):
+                        cmd = f"{cmd} --cold_start_latency"
+                    lines.append(cmd)
+                lines.append("")
+        runfile.writelines([line + "\n" for line in lines])
+
+
+def generate_dropdown_comment(title, body):
+    str_io = io.StringIO()
+    str_io.write(f"{title}\n")
+    str_io.write("<details>\n")
+    str_io.write("<summary>see more</summary>\n")
+    str_io.write(f"{body}")
+    str_io.write("\n")
+    str_io.write("</details>\n\n")
+    return str_io.getvalue()
+
+
+def build_summary(args):
+    import git
+
+    out_io = io.StringIO()
+
+    def print_commit_hash(path, name):
+        if exists(path):
+            repo = git.Repo(path, search_parent_directories=True)
+            sha = repo.head.object.hexsha
+            date = repo.head.object.committed_datetime
+            out_io.write(f"{name} commit: {sha}\n")
+            out_io.write(f"{name} commit date: {date}\n")
+        else:
+            out_io.write(f"{name} Absent\n")
+
+    def env_var(name):
+        out_io.write(f"{name} = {os.environ[name]}\n")
+
+    out_io.write("\n")
+    out_io.write("### Run name ###\n")
+    out_io.write(get_archive_name(args, args.dtypes[0]))
+    out_io.write("\n")
+
+    out_io.write("\n")
+    out_io.write("### Commit hashes ###\n")
+    print_commit_hash("../pytorch", "pytorch")
+    print_commit_hash("../functorch", "functorch")
+    print_commit_hash("../torchbenchmark", "torchbench")
+
+    out_io.write("\n")
+    out_io.write("### TorchDynamo config flags ###\n")
+    for key in dir(torch._dynamo.config):
+        val = getattr(torch._dynamo.config, key)
+        if not key.startswith("__") and isinstance(val, bool):
+            out_io.write(f"torch._dynamo.config.{key} = {val}\n")
+
+    out_io.write("\n")
+    out_io.write("### Torch version ###\n")
+    out_io.write(f"torch: {torch.__version__}\n")
+
+    out_io.write("\n")
+    out_io.write("### Environment variables ###\n")
+    env_var("TORCH_CUDA_ARCH_LIST")
+    env_var("CUDA_HOME")
+    env_var("USE_LLVM")
+
+    out_io.write("\n")
+    out_io.write("### GPU details ###\n")
+    out_io.write(f"CUDNN VERSION: {torch.backends.cudnn.version()}\n")
+    out_io.write(f"Number CUDA Devices: {torch.cuda.device_count()}\n")
+    out_io.write(f"Device Name: {torch.cuda.get_device_name(0)}\n")
+    out_io.write(
+        f"Device Memory [GB]: {torch.cuda.get_device_properties(0).total_memory/1e9}\n"
+    )
+
+    title = "## Build Summary"
+    comment = generate_dropdown_comment(title, out_io.getvalue())
+    with open(f"{output_dir}/gh_build_summary.txt", "w") as gh_fh:
+        gh_fh.write(comment)
+
+
+@functools.lru_cache(None)
+def archive_data(archive_name):
+    if archive_name is not None:
+        prefix_match = re.search(r"\w+(?=_performance)", archive_name)
+        if prefix_match is not None:
+            prefix = prefix_match.group(0)
+        else:
+            prefix = ""
+        day_match = re.search(r"day_(\d+)_", archive_name)
+        if day_match is not None:
+            day = day_match.group(1)
+        else:
+            day = "000"
+    else:
+        now = datetime.now(tz=timezone(timedelta(hours=-8)))
+        day = now.strftime("%j")
+        prefix = now.strftime(f"day_{day}_%d_%m_%y")
+    return day, prefix
+
+
+@functools.lru_cache(None)
+def default_archive_name(dtype):
+    _, prefix = archive_data(None)
+    return f"{prefix}_performance_{dtype}_{randint(100, 999)}"
+
+
+def get_archive_name(args, dtype):
+    return (
+        default_archive_name(dtype) if args.archive_name is None else args.archive_name
+    )
+
+
+def archive(src_dir, dest_dir_prefix, archive_name, dtype):
+    if archive_name is None:
+        archive_name = default_archive_name(dtype)
+    # Copy the folder to archived location
+    dest = os.path.join(dest_dir_prefix, archive_name)
+    shutil.copytree(src_dir, dest, dirs_exist_ok=True)
+    print(f"copied contents of {src_dir} to {dest}")
+
+
+def get_metric_title(metric):
+    if metric == "speedup":
+        return "Performance speedup"
+    elif metric == "accuracy":
+        return "Accuracy"
+    elif metric == "compilation_latency":
+        return "Compilation latency (sec)"
+    elif metric == "compression_ratio":
+        return "Peak Memory Compression Ratio"
+    elif metric == "abs_latency":
+        return "Absolute latency (ms)"
+    raise RuntimeError("unknown metric")
+
+
+class Parser:
+    def __init__(
+        self, suites, devices, dtypes, compilers, flag_compilers, mode, output_dir
+    ):
+        self.suites = suites
+        self.devices = devices
+        self.dtypes = dtypes
+        self.compilers = compilers
+        self.flag_compilers = flag_compilers
+        self.output_dir = output_dir
+        self.mode = mode
+
+    def has_header(self, output_filename):
+        header_present = False
+        with open(output_filename, "r") as f:
+            line = f.readline()
+            if "dev" in line:
+                header_present = True
+        return header_present
+
+
+class ParsePerformanceLogs(Parser):
+    def __init__(
+        self, suites, devices, dtypes, compilers, flag_compilers, mode, output_dir
+    ):
+        super().__init__(
+            suites, devices, dtypes, compilers, flag_compilers, mode, output_dir
+        )
+        self.parsed_frames = defaultdict(lambda: defaultdict(None))
+        self.untouched_parsed_frames = defaultdict(lambda: defaultdict(None))
+        self.metrics = [
+            "speedup",
+            "abs_latency",
+            "compilation_latency",
+            "compression_ratio",
+        ]
+        self.bottom_k = 50
+        self.parse()
+
+    def plot_graph(self, df, title):
+        labels = df.columns.values.tolist()
+        labels = labels[3:]
+        df.plot(
+            x="name",
+            y=labels,
+            kind="bar",
+            width=0.65,
+            title=title,
+            ylabel="Speedup over eager",
+            xlabel="",
+            grid=True,
+            figsize=(max(len(df.index) / 4, 5), 10),
+            edgecolor="black",
+        )
+        plt.tight_layout()
+        plt.savefig(f"{self.output_dir}/{title}.png")
+
+    def read_csv(self, output_filename):
+        if self.has_header(output_filename):
+            return pd.read_csv(output_filename)
+        else:
+            return pd.read_csv(
+                output_filename,
+                names=[
+                    "dev",
+                    "name",
+                    "batch_size",
+                    "speedup",
+                    "abs_latency",
+                    "compilation_latency",
+                    "compression_ratio",
+                ],
+                header=None,
+                engine="python",
+            )
+
+    def parse(self):
+        self.extract_df("accuracy", "accuracy")
+        for metric in self.metrics:
+            self.extract_df(metric, "performance")
+
+    def clean_batch_sizes(self, frames):
+        # Clean up batch sizes when its 0
+        if len(frames) == 1:
+            return frames
+        batch_sizes = frames[0]["batch_size"].to_list()
+        for frame in frames[1:]:
+            frame_batch_sizes = frame["batch_size"].to_list()
+            for idx, (batch_a, batch_b) in enumerate(
+                zip(batch_sizes, frame_batch_sizes)
+            ):
+                assert batch_a == batch_b or batch_a == 0 or batch_b == 0, print(
+                    f"a={batch_a}, b={batch_b}"
+                )
+                batch_sizes[idx] = max(batch_a, batch_b)
+        for frame in frames:
+            frame["batch_size"] = batch_sizes
+        return frames
+
+    def extract_df(self, metric, testing):
+        for iter in itertools.product(self.suites, self.devices, self.dtypes):
+            suite, device, dtype = iter
+            frames = []
+            for compiler in self.compilers:
+                output_filename = f"{self.output_dir}/{compiler}_{suite}_{dtype}_{self.mode}_{device}_{testing}.csv"
+                df = self.read_csv(output_filename)
+                if metric not in df:
+                    df.insert(len(df.columns), metric, np.nan)
+                df = df[["dev", "name", "batch_size", metric]]
+                df.rename(columns={metric: compiler}, inplace=True)
+                df["batch_size"] = df["batch_size"].astype(int)
+                frames.append(df)
+
+            # Merge the results
+            frames = self.clean_batch_sizes(frames)
+            if len(self.compilers) == 1:
+                df = frames[0]
+            else:
+                # Merge data frames
+                df = pd.merge(frames[0], frames[1], on=["dev", "name", "batch_size"])
+                for idx in range(2, len(frames)):
+                    df = pd.merge(df, frames[idx], on=["dev", "name", "batch_size"])
+
+            df_copy = df.copy()
+            df_copy = df_copy.sort_values(
+                by=list(reversed(self.compilers)), ascending=False
+            )
+            if "inductor" in self.compilers:
+                df_copy = df_copy.sort_values(by="inductor", ascending=False)
+            self.untouched_parsed_frames[suite][metric] = df_copy
+
+            if testing == "performance":
+                df_accuracy = self.parsed_frames[suite]["accuracy"]
+                perf_rows = []
+                for model_name in df["name"]:
+                    perf_row = df[df["name"] == model_name].copy()
+                    acc_row = df_accuracy[df_accuracy["name"] == model_name]
+                    for compiler in self.compilers:
+                        if not perf_row.empty:
+                            if acc_row.empty:
+                                perf_row[compiler] = 0.0
+                            elif acc_row[compiler].iloc[0] not in (
+                                "pass",
+                                "pass_due_to_skip",
+                            ):
+                                perf_row[compiler] = 0.0
+                    perf_rows.append(perf_row)
+                df = pd.concat(perf_rows)
+            df = df.sort_values(by=list(reversed(self.compilers)), ascending=False)
+
+            if "inductor" in self.compilers:
+                df = df.sort_values(by="inductor", ascending=False)
+            self.parsed_frames[suite][metric] = df
+
+    def get_passing_entries(self, compiler, df):
+        return df[compiler][df[compiler] > 0]
+
+    def comp_time(self, compiler, df):
+        df = self.get_passing_entries(compiler, df)
+        # df = df.sort_values(by=compiler, ascending=False)[compiler][: self.bottom_k]
+        if df.empty:
+            return "0.0"
+
+        return f"{df.mean():.2f}"
+
+    def geomean(self, compiler, df):
+        cleaned_df = self.get_passing_entries(compiler, df).clip(1)
+        if cleaned_df.empty:
+            return "0.0x"
+        return f"{gmean(cleaned_df):.2f}x"
+
+    def passrate(self, compiler, df):
+        total = len(df.index)
+        passing = df[df[compiler] > 0.0][compiler].count()
+        perc = int(percentage(passing, total, decimals=0))
+        return f"{perc}%, {passing}/{total}"
+
+    def memory(self, compiler, df):
+        df = self.get_passing_entries(compiler, df)
+        df = df.fillna(0)
+        df = df[df > 0]
+        if df.empty:
+            return "0.0x"
+        return f"{df.mean():.2f}x"
+
+    def exec_summary_df(self, fn, metric):
+        """
+        Generate a table with passrate and geomean perf
+        """
+        cols = {}
+        cols["Compiler"] = self.compilers
+        for suite in self.suites:
+            df = self.parsed_frames[suite][metric]
+            # speedups = [self.geomean(compiler, df) for compiler in self.compilers]
+            speedups = [fn(compiler, df) for compiler in self.compilers]
+            col = pd.Series(data=speedups, index=self.compilers)
+            cols[suite] = col
+        df = pd.DataFrame(cols)
+        df = df.fillna(0)
+        df.to_csv(os.path.join(self.output_dir, f"{fn.__name__}.csv"))
+        return df
+
+    def exec_summary_text(self, caption, fn, metric):
+        df = self.exec_summary_df(fn, metric)
+        tabform = tabulate(df, headers="keys", tablefmt="pretty", showindex="never")
+
+        str_io = io.StringIO()
+        str_io.write(f"{caption}")
+        str_io.write("~~~\n")
+        str_io.write(f"{tabform}\n")
+        str_io.write("~~~\n")
+        return str_io.getvalue()
+
+    def generate_executive_summary(self):
+        description = (
+            "We evaluate different backends "
+            "across three benchmark suites - torchbench, huggingface and timm. We run "
+            "these experiments on A100 GPUs. Each experiment runs one iteration of forward "
+            "and backward pass. For accuracy, we check the numerical correctness of forward "
+            "pass outputs and gradients by comparing with native pytorch. We measure speedup "
+            "by normalizing against the performance of native pytorch. We report mean "
+            "compilation latency numbers and peak memory footprint reduction ratio. \n\n"
+            "Caveats\n"
+            "1) Batch size has been reduced to workaround OOM errors. Work is in progress to "
+            "reduce peak memory footprint.\n"
+            "2) Experiments do not cover dynamic shapes.\n"
+            "3) Experimental setup does not have optimizer.\n\n"
+        )
+
+        comment = generate_dropdown_comment("", description)
+        str_io = io.StringIO()
+        str_io.write("\n")
+        str_io.write("## Executive Summary ##\n")
+        str_io.write(comment)
+
+        speedup_caption = "Geometric mean speedup \n"
+        speedup_summary = self.exec_summary_text(
+            speedup_caption, self.geomean, "speedup"
+        )
+
+        passrate_caption = "Passrate\n"
+        passrate_summary = self.exec_summary_text(
+            passrate_caption, self.passrate, "speedup"
+        )
+
+        comp_time_caption = "Mean compilation time (seconds)\n"
+        comp_time_summary = self.exec_summary_text(
+            comp_time_caption, self.comp_time, "compilation_latency"
+        )
+
+        peak_memory_caption = (
+            "Peak memory footprint compression ratio (higher is better)\n"
+        )
+        peak_memory_summary = self.exec_summary_text(
+            peak_memory_caption, self.memory, "compression_ratio"
+        )
+
+        str_io.write(
+            "To measure performance, compilation latency and memory footprint reduction, "
+            "we remove the models that fail accuracy checks.\n\n"
+        )
+        str_io.write(passrate_summary)
+        str_io.write(speedup_summary)
+        str_io.write(comp_time_summary)
+        str_io.write(peak_memory_summary)
+        self.executive_summary = str_io.getvalue()
+
+    def flag_bad_entries(self, suite, metric, flag_fn):
+        df = self.untouched_parsed_frames[suite][metric]
+        df = df.drop("dev", axis=1)
+        df = df.rename(columns={"batch_size": "bs"})
+        # apply flag_fn elementwise to flag_compilers columns,
+        # if one element fails, the entire row is flagged
+        flag = np.logical_or.reduce(
+            df[self.flag_compilers].applymap(flag_fn),
+            axis=1,
+        )
+        df = df[flag]
+        df = df.assign(suite=suite)
+        return df.reindex(columns=["suite", "name"] + self.flag_compilers)
+
+    def generate_warnings(self):
+        title = "## Warnings ##"
+        body = (
+            "We flag models where:\n\n"
+            " - accuracy fails\n"
+            " - speedup < 0.95x (NOTE: 0.0 speedup typically signifies a failure in the performance test)\n"
+            " - compilation latency > 120 sec.\n"
+            " - compression ratio < 0.9\n"
+            "\n"
+        )
+        for metric in [
+            "accuracy",
+            "speedup",
+            "compilation_latency",
+            "compression_ratio",
+        ]:
+            dfs = []
+            for suite in self.suites:
+                dfs.append(self.flag_bad_entries(suite, metric, FLAG_FNS[metric]))
+            df = pd.concat(dfs, axis=0)
+            if df.empty:
+                continue
+            tabform = tabulate(df, headers="keys", tablefmt="pretty", showindex="never")
+            str_io = io.StringIO()
+            str_io.write("\n")
+            str_io.write(get_metric_title(metric) + " warnings\n")
+            str_io.write("~~~\n")
+            str_io.write(f"{tabform}\n")
+            str_io.write("~~~\n")
+            body += str_io.getvalue()
+
+        comment = generate_dropdown_comment(title, body)
+        return comment
+
+    def prepare_message(self, suite):
+        title = f"## {suite} suite with {self.dtypes[0]} precision ##"
+        body = ""
+        for metric in [
+            "speedup",
+            "accuracy",
+            "compilation_latency",
+            "compression_ratio",
+            "abs_latency",
+        ]:
+            df = self.untouched_parsed_frames[suite][metric]
+            df = df.drop("dev", axis=1)
+            df = df.rename(columns={"batch_size": "bs"})
+            tabform = tabulate(df, headers="keys", tablefmt="pretty", showindex="never")
+            str_io = io.StringIO()
+            str_io.write("\n")
+            str_io.write(get_metric_title(metric) + "\n")
+            str_io.write("~~~\n")
+            str_io.write(f"{tabform}\n")
+            str_io.write("~~~\n")
+            body += str_io.getvalue()
+
+        comment = generate_dropdown_comment(title, body)
+        return comment
+
+    def gen_summary_files(self):
+        self.generate_executive_summary()
+        for suite in self.suites:
+            self.plot_graph(
+                self.untouched_parsed_frames[suite]["speedup"],
+                f"{suite}_{self.dtypes[0]}",
+            )
+
+        with open(f"{self.output_dir}/gh_title.txt", "w") as gh_fh:
+            str_io = io.StringIO()
+            str_io.write("\n")
+            str_io.write(f"# Performance Dashboard for {self.dtypes[0]} precision ##\n")
+            str_io.write("\n")
+            gh_fh.write(str_io.getvalue())
+
+        with open(f"{self.output_dir}/gh_executive_summary.txt", "w") as gh_fh:
+            gh_fh.write(self.executive_summary)
+
+        with open(f"{self.output_dir}/gh_warnings.txt", "w") as gh_fh:
+            warnings_body = self.generate_warnings()
+            gh_fh.write(warnings_body)
+
+        str_io = io.StringIO()
+        for suite in self.suites:
+            str_io.write(self.prepare_message(suite))
+        str_io.write("\n")
+        with open(f"{self.output_dir}/gh_{self.mode}.txt", "w") as gh_fh:
+            gh_fh.write(str_io.getvalue())
+
+
+def parse_logs(args, dtypes, suites, devices, compilers, flag_compilers, output_dir):
+    mode = get_mode(args)
+    build_summary(args)
+
+    parser_class = ParsePerformanceLogs
+    parser = parser_class(
+        suites, devices, dtypes, compilers, flag_compilers, mode, output_dir
+    )
+    parser.gen_summary_files()
+    return
+
+
+@dataclasses.dataclass
+class LogInfo:
+    # Day of the year this log was generated
+    day: str
+
+    # Directory path where all logs are present
+    dir_path: str
+
+
+def get_date(log_info):
+    return datetime.strptime(f"{log_info.day}", "%j").strftime("%m-%d")
+
+
+def find_last_2_with_filenames(lookup_file, dashboard_archive_path, dtype, filenames):
+    df = pd.read_csv(lookup_file, names=("day", "mode", "prec", "path"))
+    df = df[df["mode"] == "performance"]
+    df = df[df["prec"] == dtype]
+    df = df[::-1]
+    last2 = []
+    for path in df["path"]:
+        output_dir = os.path.join(dashboard_archive_path, path)
+        fullpaths = [
+            os.path.join(dashboard_archive_path, path, name) for name in filenames
+        ]
+        if all([os.path.exists(fullpath) for fullpath in fullpaths]):
+            last2.append(output_dir)
+        if len(last2) >= 2:
+            return last2
+    return None
+
+
+class SummaryStatDiffer:
+    def __init__(self, args):
+        self.args = args
+        self.lookup_file = os.path.join(self.args.dashboard_archive_path, "lookup.csv")
+        assert os.path.exists(self.lookup_file)
+
+    def generate_diff(self, last2, filename, caption):
+        df_cur, df_prev = [pd.read_csv(os.path.join(path, filename)) for path in last2]
+        df_merge = df_cur.merge(df_prev, on="Compiler", suffixes=("_cur", "_prev"))
+        data = {col: [] for col in ("compiler", "suite", "prev_value", "cur_value")}
+        for _, row in df_merge.iterrows():
+            if row["Compiler"] in self.args.flag_compilers:
+                for suite in self.args.suites:
+                    if suite + "_prev" not in row or suite + "_cur" not in row:
+                        continue
+                    data["compiler"].append(row["Compiler"])
+                    data["suite"].append(suite)
+                    data["prev_value"].append(row[suite + "_prev"])
+                    data["cur_value"].append(row[suite + "_cur"])
+
+        df = pd.DataFrame(data)
+        tabform = tabulate(df, headers="keys", tablefmt="pretty", showindex="never")
+        str_io = io.StringIO()
+        str_io.write("\n")
+        str_io.write(f"{caption}\n")
+        str_io.write("~~~\n")
+        str_io.write(f"{tabform}\n")
+        str_io.write("~~~\n")
+        return str_io.getvalue()
+
+    def generate_comment(self):
+        title = "## Summary Statistics Diff ##\n"
+        body = (
+            "For each relevant compiler, we compare the summary statistics "
+            "for the most 2 recent reports that actually run the compiler.\n\n"
+        )
+        dtype = self.args.dtypes[0]
+        last2 = find_last_2_with_filenames(
+            self.lookup_file,
+            self.args.dashboard_archive_path,
+            dtype,
+            ["geomean.csv", "passrate.csv"],
+        )
+
+        if last2 is None:
+            body += "Could not find most 2 recent reports.\n\n"
+        else:
+            for state, path in zip(("Current", "Previous"), last2):
+                body += f"{state} report name: {path}\n\n"
+            body += self.generate_diff(last2, "passrate.csv", "Passrate diff")
+            body += self.generate_diff(
+                last2, "geomean.csv", "Geometric mean speedup diff"
+            )
+
+        comment = generate_dropdown_comment(title, body)
+
+        with open(f"{self.args.output_dir}/gh_summary_diff.txt", "w") as gh_fh:
+            gh_fh.write(comment)
+
+
+class RegressionDetector:
+    """
+    Compares the most recent 2 benchmarks to find previously unflagged models
+    that are now flagged.
+    """
+
+    def __init__(self, args):
+        self.args = args
+        self.lookup_file = os.path.join(self.args.dashboard_archive_path, "lookup.csv")
+        assert os.path.exists(self.lookup_file)
+
+    def generate_comment(self):
+        title = "## Recent Regressions ##\n"
+        body = (
+            "For each relevant compiler, we compare the most recent 2 reports "
+            "(that actually run the compiler) to find previously unflagged "
+            "models that are now flagged as problematic (according to the "
+            "'Warnings' section).\n\n"
+        )
+        dtype = self.args.dtypes[0]
+        device = self.args.devices[0]
+        for suite in self.args.suites:
+            body += f"### Regressions for {suite} ###\n"
+            last2 = {}
+
+            for compiler in self.args.flag_compilers:
+                filenames = [
+                    generate_csv_name(
+                        self.args, dtype, suite, device, compiler, testing
+                    )
+                    for testing in ["performance", "accuracy"]
+                ]
+                compiler_last2 = find_last_2_with_filenames(
+                    self.lookup_file, self.args.dashboard_archive_path, dtype, filenames
+                )
+                if compiler_last2 is not None:
+                    last2[compiler] = [
+                        ParsePerformanceLogs(
+                            [suite],
+                            [device],
+                            [dtype],
+                            [compiler],
+                            [compiler],
+                            get_mode(self.args),
+                            output_dir,
+                        )
+                        for output_dir in compiler_last2
+                    ]
+                    for state, path in zip(("Current", "Previous"), compiler_last2):
+                        body += (
+                            f"{state} report name (compiler: {compiler}, "
+                            f"suite: {suite}): {path}\n\n"
+                        )
+
+            regressions_present = False
+            for metric in [
+                "accuracy",
+                "speedup",
+                "compilation_latency",
+                "compression_ratio",
+            ]:
+                dfs = []
+                for compiler in self.args.flag_compilers:
+                    if last2[compiler] is None:
+                        continue
+
+                    df_cur, df_prev = [
+                        last2[compiler][i].untouched_parsed_frames[suite][metric]
+                        for i in (0, 1)
+                    ]
+                    df_merge = df_cur.merge(
+                        df_prev, on="name", suffixes=("_cur", "_prev")
+                    )
+                    flag_fn = FLAG_FNS[metric]
+                    flag = np.logical_and(
+                        df_merge[compiler + "_prev"].apply(
+                            lambda x: not pd.isna(x) and not flag_fn(x)
+                        ),
+                        df_merge[compiler + "_cur"].apply(
+                            lambda x: not pd.isna(x) and flag_fn(x)
+                        ),
+                    )
+                    df_bad = df_merge[flag]
+                    dfs.append(
+                        pd.DataFrame(
+                            data={
+                                "compiler": compiler,
+                                "name": df_bad["name"],
+                                "prev_status": df_bad[compiler + "_prev"],
+                                "cur_status": df_bad[compiler + "_cur"],
+                            }
+                        )
+                    )
+
+                if not dfs:
+                    continue
+                df = pd.concat(dfs, axis=0)
+                if df.empty:
+                    continue
+                regressions_present = True
+                tabform = tabulate(
+                    df, headers="keys", tablefmt="pretty", showindex="never"
+                )
+                str_io = io.StringIO()
+                str_io.write("\n")
+                str_io.write(f"{get_metric_title(metric)} regressions\n")
+                str_io.write("~~~\n")
+                str_io.write(f"{tabform}\n")
+                str_io.write("~~~\n")
+                body += str_io.getvalue()
+
+            if not regressions_present:
+                body += "No regressions found.\n"
+
+        comment = generate_dropdown_comment(title, body)
+
+        with open(f"{self.args.output_dir}/gh_metric_regression.txt", "w") as gh_fh:
+            gh_fh.write(comment)
+
+
+class RegressionTracker:
+    """
+    Plots progress of different metrics over time to detect regressions.
+    """
+
+    def __init__(self, args):
+        self.args = args
+        self.suites = self.args.suites
+        self.lookup_file = os.path.join(self.args.dashboard_archive_path, "lookup.csv")
+        assert os.path.exists(self.lookup_file)
+        self.k = 10
+
+    def find_last_k(self):
+        """
+        Find the last k pairs of (day number, log_path)
+        """
+        dtype = self.args.dtypes[0]
+        df = pd.read_csv(self.lookup_file, names=("day", "mode", "prec", "path"))
+        df = df[df["mode"] == "performance"]
+        df = df[df["prec"] == dtype]
+        log_infos = []
+        for day, path in zip(df["day"], df["path"]):
+            log_infos.append(LogInfo(day, path))
+
+        assert len(log_infos) >= self.k
+        log_infos = log_infos[len(log_infos) - self.k :]
+        return log_infos
+
+    def generate_comment(self):
+        title = "## Metrics over time ##\n"
+        str_io = io.StringIO()
+        if not self.args.update_dashboard_test and not self.args.no_graphs:
+            for name in glob.glob(self.args.output_dir + "/*over_time.png"):
+                output = (
+                    subprocess.check_output([self.args.dashboard_image_uploader, name])
+                    .decode("ascii")
+                    .rstrip()
+                )
+                str_io.write(f"\n{name} : ![]({output})\n")
+        comment = generate_dropdown_comment(title, str_io.getvalue())
+
+        with open(f"{self.args.output_dir}/gh_regression.txt", "w") as gh_fh:
+            gh_fh.write(comment)
+
+    def diff(self):
+        log_infos = self.find_last_k()
+
+        for metric in ["geomean", "passrate", "comp_time", "memory"]:
+            fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
+            for idx, suite in enumerate(self.suites):
+                dfs = []
+                for log_info in log_infos:
+                    dir_path = os.path.join(
+                        self.args.dashboard_archive_path, log_info.dir_path
+                    )
+                    assert os.path.exists(dir_path)
+                    gmean_filename = os.path.join(dir_path, f"{metric}.csv")
+                    if not os.path.exists(gmean_filename):
+                        continue
+                    df = pd.read_csv(gmean_filename)
+                    if suite not in df:
+                        continue
+                    if metric == "geomean" or metric == "memory":
+                        df[suite] = df[suite].str.replace("x", "").astype(float)
+                    elif metric == "passrate":
+                        df[suite] = df[suite].str.split("%").str[0].astype(float)
+                    df.insert(0, "day", get_date(log_info))
+                    df = df.pivot(index="day", columns="Compiler", values=suite)
+
+                    # Interim stage when both inductor_cudagraphs and inductor exist
+                    df = df.rename(columns={"inductor_cudagraphs": "inductor"})
+                    for col_name in df.columns:
+                        if col_name not in self.args.compilers:
+                            df = df.drop(columns=[col_name])
+                    dfs.append(df)
+
+                df = pd.concat(dfs)
+                df = df.interpolate(method="linear")
+                ax = df.plot(
+                    ax=axes[idx],
+                    kind="line",
+                    ylabel=metric,
+                    xlabel="Date",
+                    grid=True,
+                    ylim=0 if metric == "passrate" else 0.8,
+                    title=suite,
+                    style=".-",
+                    legend=False,
+                )
+                ax.legend(loc="lower right", ncol=2)
+
+            plt.tight_layout()
+            plt.savefig(os.path.join(output_dir, f"{metric}_over_time.png"))
+
+        self.generate_comment()
+
+
+class DashboardUpdater:
+    """
+    Aggregates the information and makes a comment to Performance Dashboard.
+    https://github.com/pytorch/torchdynamo/issues/681
+    """
+
+    def __init__(self, args):
+        self.args = args
+        self.output_dir = args.output_dir
+        self.lookup_file = os.path.join(self.args.dashboard_archive_path, "lookup.csv")
+        assert os.path.exists(self.lookup_file)
+        try:
+            if not self.args.update_dashboard_test and not self.args.no_update_archive:
+                self.update_lookup_file()
+        except subprocess.CalledProcessError:
+            sys.stderr.write("failed to update lookup file\n")
+
+    def update_lookup_file(self):
+        dtype = self.args.dtypes[0]
+        day, _ = archive_data(self.args.archive_name)
+        target_dir = get_archive_name(self.args, dtype)
+        # Update lookup csv the folder to arhived logs
+        subprocess.check_call(
+            f'echo "{day},performance,{dtype},{target_dir}" >> {self.lookup_file}',
+            shell=True,
+        )
+
+    def archive(self):
+        dtype = self.args.dtypes[0]
+        # Copy the folder to archived location
+        archive(
+            self.output_dir,
+            self.args.dashboard_archive_path,
+            self.args.archive_name,
+            dtype,
+        )
+
+    def upload_graphs(self):
+        title = "## Performance graphs ##\n"
+        str_io = io.StringIO()
+        if not self.args.update_dashboard_test and not self.args.no_graphs:
+            for name in glob.glob(self.output_dir + "/*png"):
+                if "over_time" not in name:
+                    output = (
+                        subprocess.check_output(
+                            [self.args.dashboard_image_uploader, name]
+                        )
+                        .decode("ascii")
+                        .rstrip()
+                    )
+                    str_io.write(f"\n{name} : ![]({output})\n")
+        comment = generate_dropdown_comment(title, str_io.getvalue())
+
+        with open(f"{self.output_dir}/gh_graphs.txt", "w") as gh_fh:
+            gh_fh.write(comment)
+
+    def gen_comment(self):
+        files = [
+            "gh_title.txt",
+            "gh_executive_summary.txt",
+            "gh_summary_diff.txt",
+            "gh_warnings.txt",
+            "gh_regression.txt",
+            "gh_metric_regression.txt",
+            "gh_training.txt",
+            "gh_graphs.txt",
+            "gh_build_summary.txt",
+        ]
+        all_lines = []
+        for f in files:
+            try:
+                with open(os.path.join(self.output_dir, f), "r") as fh:
+                    all_lines.extend(fh.readlines())
+            except FileNotFoundError:
+                pass
+
+        return "\n".join([x.rstrip() for x in all_lines])
+
+    def comment_on_gh(self, comment):
+        """
+        Send a commment to dashboard
+        """
+        with tempfile.NamedTemporaryFile(mode="w", delete=False) as f:
+            f.write(comment)
+            filename = f.name
+
+        subprocess.check_call(
+            [
+                self.args.dashboard_gh_cli_path,
+                "issue",
+                "comment",
+                "--repo=https://github.com/pytorch/torchdynamo.git",
+                "681",
+                "-F",
+                filename,
+            ]
+        )
+
+        os.remove(filename)
+
+    def update(self):
+        self.upload_graphs()
+        SummaryStatDiffer(self.args).generate_comment()
+        RegressionDetector(self.args).generate_comment()
+        try:
+            RegressionTracker(self.args).diff()
+        except Exception as e:
+            logging.exception(e)
+            with open(f"{self.args.output_dir}/gh_regression.txt", "w") as gh_fh:
+                gh_fh.write("")
+
+        comment = self.gen_comment()
+        print(comment)
+
+        if not self.args.update_dashboard_test:
+            if not self.args.no_gh_comment:
+                self.comment_on_gh(comment)
+            if not self.args.no_update_archive:
+                self.archive()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    def extract(key):
+        return DEFAULTS[key] if getattr(args, key, None) is None else getattr(args, key)
+
+    dtypes = extract("dtypes")
+    suites = extract("suites")
+    devices = extract("devices")
+
+    if args.inference:
+        compilers = DEFAULTS["inference"] if args.compilers is None else args.compilers
+        flag_compilers = (
+            DEFAULTS["flag_compilers"]["inference"]
+            if args.flag_compilers is None
+            else args.flag_compilers
+        )
+    else:
+        assert args.training
+        compilers = DEFAULTS["training"] if args.compilers is None else args.compilers
+        flag_compilers = (
+            DEFAULTS["flag_compilers"]["training"]
+            if args.flag_compilers is None
+            else args.flag_compilers
+        )
+
+    output_dir = args.output_dir
+    args.compilers = compilers
+    args.devices = devices
+    args.dtypes = dtypes
+    args.flag_compilers = flag_compilers
+    args.suites = suites
+
+    if args.print_run_commands:
+        generate_commands(args, dtypes, suites, devices, compilers, output_dir)
+    elif args.visualize_logs:
+        parse_logs(args, dtypes, suites, devices, compilers, flag_compilers, output_dir)
+    elif args.run:
+        generate_commands(args, dtypes, suites, devices, compilers, output_dir)
+        # generate memoized archive name now so that the date is reflective
+        # of when the run started
+        get_archive_name(args, dtypes[0])
+        # TODO - Do we need to worry about segfaults
+        try:
+            os.system("bash run.sh")
+        except Exception as e:
+            print(
+                "Running commands failed. Please run manually (bash run.sh) and inspect the errors."
+            )
+            raise e
+        if not args.log_operator_inputs:
+            if not args.no_update_archive:
+                archive(
+                    output_dir,
+                    args.dashboard_archive_path,
+                    args.archive_name,
+                    dtypes[0],
+                )
+            parse_logs(
+                args, dtypes, suites, devices, compilers, flag_compilers, output_dir
+            )
+
+    if args.update_dashboard:
+        DashboardUpdater(args).update()
diff --git a/benchmarks/dynamo/test.py b/benchmarks/dynamo/test.py
new file mode 100644
index 000000000000..438218462030
--- /dev/null
+++ b/benchmarks/dynamo/test.py
@@ -0,0 +1,44 @@
+import os
+import unittest
+
+from .common import parse_args, run
+
+from .torchbench import setup_torchbench_cwd, TorchBenchmarkRunner
+
+try:
+    # fbcode only
+    from aiplatform.utils.sanitizer_status import is_asan_or_tsan
+except ImportError:
+
+    def is_asan_or_tsan():
+        return False
+
+
+class TestDynamoBenchmark(unittest.TestCase):
+    @unittest.skipIf(is_asan_or_tsan(), "ASAN/TSAN not supported")
+    def test_benchmark_infra_runs(self) -> None:
+        """
+        Basic smoke test that TorchBench runs.
+
+        This test is mainly meant to check that our setup in fbcode
+        doesn't break.
+
+        If you see a failure here related to missing CPP headers, then
+        you likely need to update the resources list in:
+            //caffe2:inductor
+        """
+        original_dir = setup_torchbench_cwd()
+        try:
+            args = parse_args(
+                [
+                    "-dcpu",
+                    "--inductor",
+                    "--performance",
+                    "--only=BERT_pytorch",
+                    "-n1",
+                    "--batch_size=1",
+                ]
+            )
+            run(TorchBenchmarkRunner(), args, original_dir)
+        finally:
+            os.chdir(original_dir)
diff --git a/benchmarks/dynamo/timm_models.py b/benchmarks/dynamo/timm_models.py
new file mode 100755
index 000000000000..70d06ab31818
--- /dev/null
+++ b/benchmarks/dynamo/timm_models.py
@@ -0,0 +1,322 @@
+#!/usr/bin/env python3
+import importlib
+import logging
+import os
+import re
+import subprocess
+import sys
+import time
+import warnings
+
+import torch
+from common import BenchmarkRunner, main
+
+from torch._dynamo.testing import collect_results
+from torch._dynamo.utils import clone_inputs
+
+
+def pip_install(package):
+    subprocess.check_call([sys.executable, "-m", "pip", "install", package])
+
+
+try:
+    importlib.import_module("timm")
+except ModuleNotFoundError:
+    print("Installing Pytorch Image Models...")
+    pip_install("git+https://github.com/rwightman/pytorch-image-models")
+finally:
+    from timm.data import resolve_data_config
+    from timm.models import create_model
+
+TIMM_MODELS = dict()
+filename = os.path.join(os.path.dirname(__file__), "timm_models_list.txt")
+
+with open(filename, "r") as fh:
+    lines = fh.readlines()
+    lines = [line.rstrip() for line in lines]
+    for line in lines:
+        model_name, batch_size = line.split(" ")
+        TIMM_MODELS[model_name] = int(batch_size)
+
+
+# TODO - Figure out the reason of cold start memory spike
+
+BATCH_SIZE_DIVISORS = {
+    "beit_base_patch16_224": 2,
+    "cait_m36_384": 2,
+    "convit_base": 2,
+    "convmixer_768_32": 2,
+    "convnext_base": 2,
+    "cspdarknet53": 2,
+    "deit_base_distilled_patch16_224": 2,
+    "dpn107": 2,
+    "gluon_xception65": 2,
+    "mobilevit_s": 2,
+    "pit_b_224": 2,
+    "pnasnet5large": 2,
+    "poolformer_m36": 2,
+    "res2net101_26w_4s": 2,
+    "resnest101e": 2,
+    "sebotnet33ts_256": 2,
+    "swin_base_patch4_window7_224": 2,
+    "swsl_resnext101_32x16d": 2,
+    "twins_pcpvt_base": 2,
+    "vit_base_patch16_224": 2,
+    "volo_d1_224": 2,
+    "jx_nest_base": 4,
+    "xcit_large_24_p8_224": 4,
+}
+
+REQUIRE_HIGHER_TOLERANCE = set()
+
+SKIP = {
+    # Unusual training setup
+    "levit_128",
+}
+
+
+def refresh_model_names():
+    import glob
+
+    from timm.models import list_models
+
+    def read_models_from_docs():
+        models = set()
+        # TODO - set the path to pytorch-image-models repo
+        for fn in glob.glob("../pytorch-image-models/docs/models/*.md"):
+            with open(fn, "r") as f:
+                while True:
+                    line = f.readline()
+                    if not line:
+                        break
+                    if not line.startswith("model = timm.create_model("):
+                        continue
+
+                    model = line.split("'")[1]
+                    # print(model)
+                    models.add(model)
+        return models
+
+    def get_family_name(name):
+        known_families = [
+            "darknet",
+            "densenet",
+            "dla",
+            "dpn",
+            "ecaresnet",
+            "halo",
+            "regnet",
+            "efficientnet",
+            "deit",
+            "mobilevit",
+            "mnasnet",
+            "convnext",
+            "resnet",
+            "resnest",
+            "resnext",
+            "selecsls",
+            "vgg",
+            "xception",
+        ]
+
+        for known_family in known_families:
+            if known_family in name:
+                return known_family
+
+        if name.startswith("gluon_"):
+            return "gluon_" + name.split("_")[1]
+        return name.split("_")[0]
+
+    def populate_family(models):
+        family = dict()
+        for model_name in models:
+            family_name = get_family_name(model_name)
+            if family_name not in family:
+                family[family_name] = []
+            family[family_name].append(model_name)
+        return family
+
+    docs_models = read_models_from_docs()
+    all_models = list_models(pretrained=True, exclude_filters=["*in21k"])
+
+    all_models_family = populate_family(all_models)
+    docs_models_family = populate_family(docs_models)
+
+    # print(docs_models_family.keys())
+    for key in docs_models_family:
+        del all_models_family[key]
+
+    chosen_models = set()
+    for value in docs_models_family.values():
+        chosen_models.add(value[0])
+
+    for key, value in all_models_family.items():
+        chosen_models.add(value[0])
+
+    filename = "timm_models_list.txt"
+    if os.path.exists("benchmarks"):
+        filename = "benchmarks/" + filename
+    with open(filename, "w") as fw:
+        for model_name in sorted(chosen_models):
+            fw.write(model_name + "\n")
+
+
+class TimmRunnner(BenchmarkRunner):
+    def __init__(self):
+        super(TimmRunnner, self).__init__()
+        self.suite_name = "timm_models"
+
+    def load_model(
+        self,
+        device,
+        model_name,
+        batch_size=None,
+    ):
+
+        is_training = self.args.training
+        use_eval_mode = self.args.use_eval_mode
+
+        # _, model_dtype, data_dtype = self.resolve_precision()
+        channels_last = self._args.channels_last
+
+        retries = 1
+        success = False
+        while not success and retries < 4:
+            try:
+                model = create_model(
+                    model_name,
+                    in_chans=3,
+                    scriptable=False,
+                    num_classes=None,
+                    drop_rate=0.0,
+                    drop_path_rate=None,
+                    drop_block_rate=None,
+                    pretrained=True,
+                    # global_pool=kwargs.pop('gp', 'fast'),
+                    # num_classes=kwargs.pop('num_classes', None),
+                    # drop_rate=kwargs.pop('drop', 0.),
+                    # drop_path_rate=kwargs.pop('drop_path', None),
+                    # drop_block_rate=kwargs.pop('drop_block', None),
+                )
+                success = True
+            except Exception:
+                wait = retries * 30
+                time.sleep(wait)
+                retries += 1
+
+        model.to(
+            device=device,
+            memory_format=torch.channels_last if channels_last else None,
+        )
+
+        self.num_classes = model.num_classes
+
+        data_config = resolve_data_config(
+            self._args, model=model, use_test_size=not is_training
+        )
+        input_size = data_config["input_size"]
+        recorded_batch_size = TIMM_MODELS[model_name]
+
+        if model_name in BATCH_SIZE_DIVISORS:
+            recorded_batch_size = max(
+                int(recorded_batch_size / BATCH_SIZE_DIVISORS[model_name]), 1
+            )
+        batch_size = batch_size or recorded_batch_size
+
+        # example_inputs = torch.randn(
+        #     (batch_size,) + input_size, device=device, dtype=data_dtype
+        # )
+        torch.manual_seed(1337)
+        input_tensor = torch.randint(
+            256, size=(batch_size,) + input_size, device=device
+        ).to(dtype=torch.float32)
+        mean = torch.mean(input_tensor)
+        std_dev = torch.std(input_tensor)
+        example_inputs = (input_tensor - mean) / std_dev
+
+        if channels_last:
+            example_inputs = example_inputs.contiguous(
+                memory_format=torch.channels_last
+            )
+        example_inputs = [
+            example_inputs,
+        ]
+        self.target = self._gen_target(batch_size, device)
+
+        self.loss = torch.nn.CrossEntropyLoss().to(device)
+        if is_training and not use_eval_mode:
+            model.train()
+        else:
+            model.eval()
+
+        self.init_optimizer(device, model.parameters())
+
+        self.validate_model(model, example_inputs)
+
+        return device, model_name, model, example_inputs, batch_size
+
+    def iter_model_names(self, args):
+        # for model_name in list_models(pretrained=True, exclude_filters=["*in21k"]):
+        model_names = sorted(TIMM_MODELS.keys())
+        start, end = self.get_benchmark_indices(len(model_names))
+        for index, model_name in enumerate(model_names):
+            if index < start or index >= end:
+                continue
+            if (
+                not re.search("|".join(args.filter), model_name, re.I)
+                or re.search("|".join(args.exclude), model_name, re.I)
+                or model_name in self.skip_models
+            ):
+                continue
+
+            yield model_name
+
+    def pick_grad(self, name, is_training):
+        if is_training:
+            return torch.enable_grad()
+        else:
+            return torch.no_grad()
+
+    def get_tolerance_and_cosine_flag(self, is_training, current_device, name):
+        cosine = self.args.cosine
+        tolerance = 1e-3
+        if is_training:
+            if REQUIRE_HIGHER_TOLERANCE:
+                tolerance = 2 * 1e-2
+            else:
+                tolerance = 1e-2
+        return tolerance, cosine
+
+    def _gen_target(self, batch_size, device):
+        # return torch.ones((batch_size,) + (), device=device, dtype=torch.long)
+        return torch.empty((batch_size,) + (), device=device, dtype=torch.long).random_(
+            self.num_classes
+        )
+
+    def compute_loss(self, pred):
+        # High loss values make gradient checking harder, as small changes in
+        # accumulation order upsets accuracy checks.
+        return self.loss(pred, self.target) / 10.0
+
+    def forward_pass(self, mod, inputs, collect_outputs=True):
+        return mod(*inputs)
+
+    def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
+        cloned_inputs = clone_inputs(inputs)
+        self.optimizer_zero_grad()
+        with self.autocast():
+            pred = mod(*cloned_inputs)
+            if isinstance(pred, tuple):
+                pred = pred[0]
+            loss = self.compute_loss(pred)
+        self.grad_scaler.scale(loss).backward()
+        self.optimizer_step()
+        if collect_outputs:
+            return collect_results(mod, pred, loss, cloned_inputs)
+        return None
+
+
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.WARNING)
+    warnings.filterwarnings("ignore")
+    main(TimmRunnner())
diff --git a/benchmarks/dynamo/timm_models_list.txt b/benchmarks/dynamo/timm_models_list.txt
new file mode 100644
index 000000000000..d8c40edd7da9
--- /dev/null
+++ b/benchmarks/dynamo/timm_models_list.txt
@@ -0,0 +1,62 @@
+adv_inception_v3 128
+beit_base_patch16_224 128
+botnet26t_256 128
+cait_m36_384 8
+coat_lite_mini 128
+convit_base 128
+convmixer_768_32 64
+convnext_base 128
+crossvit_9_240 128
+cspdarknet53 128
+deit_base_distilled_patch16_224 128
+dla102 128
+dm_nfnet_f0 128
+dpn107 64
+eca_botnext26ts_256 128
+eca_halonext26ts 128
+ese_vovnet19b_dw 128
+fbnetc_100 128
+fbnetv3_b 128
+gernet_l 128
+ghostnet_100 128
+gluon_inception_v3 128
+gluon_xception65 64
+gmixer_24_224 128
+gmlp_s16_224 128
+hrnet_w18 128
+inception_v3 128
+jx_nest_base 128
+lcnet_050 128
+levit_128 128
+mixer_b16_224 128
+mixnet_l 128
+mnasnet_100 128
+mobilenetv2_100 128
+mobilenetv3_large_100 128
+mobilevit_s 128
+nfnet_l0 128
+pit_b_224 128
+pnasnet5large 32
+poolformer_m36 128
+regnety_002 128
+repvgg_a2 128
+res2net101_26w_4s 128
+res2net50_14w_8s 128
+res2next50 128
+resmlp_12_224 128
+resnest101e 128
+rexnet_100 128
+sebotnet33ts_256 128
+selecsls42b 128
+spnasnet_100 128
+swin_base_patch4_window7_224 128
+swsl_resnext101_32x16d 64
+tf_efficientnet_b0 128
+tf_mixnet_l 128
+tinynet_a 128
+tnt_s_patch16_224 128
+twins_pcpvt_base 128
+visformer_small 128
+vit_base_patch16_224 128
+volo_d1_224 128
+xcit_large_24_p8_224 23
diff --git a/benchmarks/dynamo/torchbench.py b/benchmarks/dynamo/torchbench.py
new file mode 100755
index 000000000000..24a049f14ba2
--- /dev/null
+++ b/benchmarks/dynamo/torchbench.py
@@ -0,0 +1,365 @@
+#!/usr/bin/env python3
+import gc
+import importlib
+import logging
+import os
+import re
+import sys
+import warnings
+from os.path import abspath, exists
+
+import torch
+
+try:
+    from .common import BenchmarkRunner, main
+except ImportError:
+    from common import BenchmarkRunner, main
+
+from torch._dynamo.testing import collect_results, reduce_to_scalar_loss
+from torch._dynamo.utils import clone_inputs
+
+# We are primarily interested in tf32 datatype
+torch.backends.cuda.matmul.allow_tf32 = True
+
+
+def setup_torchbench_cwd():
+    original_dir = abspath(os.getcwd())
+
+    os.environ["KALDI_ROOT"] = "/tmp"  # avoids some spam
+    for torchbench_dir in (
+        "./torchbenchmark",
+        "../torchbenchmark",
+        "../torchbench",
+        "../benchmark",
+        "../../torchbenchmark",
+        "../../torchbench",
+        "../../benchmark",
+    ):
+        if exists(torchbench_dir):
+            break
+
+    if exists(torchbench_dir):
+        torchbench_dir = abspath(torchbench_dir)
+        os.chdir(torchbench_dir)
+        sys.path.append(torchbench_dir)
+
+    return original_dir
+
+
+# Some models have large dataset that doesn't fit in memory. Lower the batch
+# size to test the accuracy.
+USE_SMALL_BATCH_SIZE = {
+    "demucs": 4,
+    "dlrm": 1024,
+    "densenet121": 4,
+    "hf_Reformer": 4,
+    "timm_efficientdet": 1,
+}
+
+DETECTRON2_MODELS = {
+    "detectron2_fasterrcnn_r_101_c4",
+    "detectron2_fasterrcnn_r_101_dc5",
+    "detectron2_fasterrcnn_r_101_fpn",
+    "detectron2_fasterrcnn_r_50_c4",
+    "detectron2_fasterrcnn_r_50_dc5",
+    "detectron2_fasterrcnn_r_50_fpn",
+    "detectron2_maskrcnn_r_101_c4",
+    "detectron2_maskrcnn_r_101_fpn",
+    "detectron2_maskrcnn_r_50_fpn",
+}
+
+SKIP = {
+    # https://github.com/pytorch/torchdynamo/issues/101
+    "detectron2_maskrcnn",
+    # https://github.com/pytorch/torchdynamo/issues/145
+    "fambench_xlmr",
+}
+
+# Additional models that are skipped in training
+SKIP_TRAIN = {
+    # not designed for training
+    "pyhpc_equation_of_state",
+    "pyhpc_isoneutral_mixing",
+    "pyhpc_turbulent_kinetic_energy",
+    # Unusual training setup
+    "opacus_cifar10",
+    "maml",
+}
+SKIP_TRAIN.update(DETECTRON2_MODELS)
+
+# These models support only train mode. So accuracy checking can't be done in
+# eval mode.
+ONLY_TRAINING_MODE = {
+    "tts_angular",
+    "tacotron2",
+    "demucs",
+    "hf_Reformer",
+    "pytorch_struct",
+    "yolov3",
+}
+ONLY_TRAINING_MODE.update(DETECTRON2_MODELS)
+
+# Need lower tolerance on GPU. GPU kernels have non deterministic kernels for these models.
+REQUIRE_HIGHER_TOLERANCE = {
+    "alexnet",
+    "attention_is_all_you_need_pytorch",
+    "densenet121",
+    "hf_Albert",
+    "vgg16",
+    "mobilenet_v3_large",
+    "nvidia_deeprecommender",
+    "timm_efficientdet",
+    "vision_maskrcnn",
+}
+
+# These models need >1e-3 tolerance
+REQUIRE_EVEN_HIGHER_TOLERANCE = {
+    "soft_actor_critic",
+    "tacotron2",
+}
+
+REQUIRE_COSINE_TOLERACE = {
+    # https://github.com/pytorch/torchdynamo/issues/556
+    "resnet50_quantized_qat",
+}
+
+# non-deterministic output / cant check correctness
+NONDETERMINISTIC = set()
+
+# These benchmarks took >600s on an i9-11900K CPU
+VERY_SLOW_BENCHMARKS = {
+    "hf_BigBird",  # 3339s
+    "hf_Longformer",  # 3062s
+    "hf_T5",  # 930s
+}
+
+# These benchmarks took >60s on an i9-11900K CPU
+SLOW_BENCHMARKS = {
+    *VERY_SLOW_BENCHMARKS,
+    "BERT_pytorch",  # 137s
+    "demucs",  # 116s
+    "fastNLP_Bert",  # 242s
+    "hf_Albert",  # 221s
+    "hf_Bart",  # 400s
+    "hf_Bert",  # 334s
+    "hf_DistilBert",  # 187s
+    "hf_GPT2",  # 470s
+    "hf_Reformer",  # 141s
+    "speech_transformer",  # 317s
+    "vision_maskrcnn",  # 99s
+}
+
+TRT_NOT_YET_WORKING = {
+    "alexnet",
+    "resnet18",
+    "resnet50",
+    "mobilenet_v2",
+    "mnasnet1_0",
+    "squeezenet1_1",
+    "shufflenetv2_x1_0",
+    "vgg16",
+    "resnext50_32x4d",
+}
+
+DYNAMIC_SHAPES_NOT_YET_WORKING = {
+    "demucs",
+    "timm_nfnet",
+}
+
+DONT_CHANGE_BATCH_SIZE = {
+    "demucs",
+    "pytorch_struct",
+    "pyhpc_turbulent_kinetic_energy",
+}
+
+
+SKIP_ACCURACY_CHECK_MODELS = {
+    # Models too large to have eager, dynamo and fp64_numbers simultaneosuly
+    # even for 40 GB machine. We have tested accuracy for smaller version of
+    # these models
+    "hf_GPT2_large",
+    "hf_T5_large",
+    "timm_vision_transformer_large",
+}
+
+
+class TorchBenchmarkRunner(BenchmarkRunner):
+    def __init__(self):
+        super(TorchBenchmarkRunner, self).__init__()
+        self.suite_name = "torchbench"
+
+    @property
+    def skip_models(self):
+        return SKIP
+
+    @property
+    def slow_models(self):
+        return SLOW_BENCHMARKS
+
+    @property
+    def very_slow_models(self):
+        return VERY_SLOW_BENCHMARKS
+
+    @property
+    def non_deterministic_models(self):
+        return NONDETERMINISTIC
+
+    @property
+    def skip_not_suitable_for_training_models(self):
+        return SKIP_TRAIN
+
+    @property
+    def failing_fx2trt_models(self):
+        return TRT_NOT_YET_WORKING
+
+    @property
+    def failing_dynamic_shape_models(self):
+        return DYNAMIC_SHAPES_NOT_YET_WORKING
+
+    @property
+    def skip_accuracy_checks_large_models_dashboard(self):
+        if self.args.dashboard or self.args.accuracy:
+            return SKIP_ACCURACY_CHECK_MODELS
+        return set()
+
+    def load_model(
+        self,
+        device,
+        model_name,
+        batch_size=None,
+        part=None,
+    ):
+
+        is_training = self.args.training
+        use_eval_mode = self.args.use_eval_mode
+        dynamic_shapes = self.args.dynamic_shapes
+        try:
+            module = importlib.import_module(f"torchbenchmark.models.{model_name}")
+        except ModuleNotFoundError:
+            module = importlib.import_module(f"torchbenchmark.models.fb.{model_name}")
+        benchmark_cls = getattr(module, "Model", None)
+        if not hasattr(benchmark_cls, "name"):
+            benchmark_cls.name = model_name
+
+        cant_change_batch_size = (
+            not getattr(benchmark_cls, "ALLOW_CUSTOMIZE_BSIZE", True)
+            or model_name in DONT_CHANGE_BATCH_SIZE
+        )
+        if cant_change_batch_size:
+            batch_size = None
+        if batch_size is None and is_training and model_name in USE_SMALL_BATCH_SIZE:
+            batch_size = USE_SMALL_BATCH_SIZE[model_name]
+
+        # workaround "RuntimeError: not allowed to set torch.backends.cudnn flags"
+        torch.backends.__allow_nonbracketed_mutation_flag = True
+        extra_args = []
+        if part:
+            extra_args = ["--part", part]
+        if is_training:
+            benchmark = benchmark_cls(
+                test="train",
+                device=device,
+                jit=False,
+                batch_size=batch_size,
+                extra_args=extra_args,
+            )
+        else:
+            benchmark = benchmark_cls(
+                test="eval",
+                device=device,
+                jit=False,
+                batch_size=batch_size,
+                extra_args=extra_args,
+            )
+        if dynamic_shapes:
+            if not hasattr(benchmark, "get_dynamic_shapes_module"):
+                raise NotImplementedError("Dynamic Shapes not supported")
+            model, example_inputs = benchmark.get_dynamic_shapes_module()
+        else:
+            model, example_inputs = benchmark.get_module()
+
+        # Models that must be in train mode while training
+        if is_training and (not use_eval_mode or model_name in ONLY_TRAINING_MODE):
+            model.train()
+        else:
+            model.eval()
+        gc.collect()
+        batch_size = benchmark.batch_size
+
+        self.init_optimizer(device, model.parameters())
+
+        # Torchbench has quite different setup for yolov3, so directly passing
+        # the right example_inputs
+        if model_name == "yolov3":
+            example_inputs = (torch.rand(batch_size, 3, 384, 512).to(device),)
+        # global current_name, current_device
+        # current_device = device
+        # current_name = benchmark.name
+        self.validate_model(model, example_inputs)
+        return device, benchmark.name, model, example_inputs, batch_size
+
+    def iter_model_names(self, args):
+        from torchbenchmark import _list_model_paths
+
+        models = _list_model_paths()
+        start, end = self.get_benchmark_indices(len(models))
+        for index, model_path in enumerate(models):
+            if index < start or index >= end:
+                continue
+
+            model_name = os.path.basename(model_path)
+            if (
+                not re.search("|".join(args.filter), model_name, re.I)
+                or re.search("|".join(args.exclude), model_name, re.I)
+                or model_name in SKIP
+            ):
+                continue
+
+            yield model_name
+
+    def pick_grad(self, name, is_training):
+        if is_training or name in ("maml",):
+            return torch.enable_grad()
+        else:
+            return torch.no_grad()
+
+    def get_tolerance_and_cosine_flag(self, is_training, current_device, name):
+        tolerance = 1e-4
+        cosine = self.args.cosine
+        # Increase the tolerance for torch allclose
+        if self.args.float16:
+            return 1e-3, cosine
+        if is_training and current_device == "cuda":
+            if name in REQUIRE_COSINE_TOLERACE:
+                cosine = True
+            elif name in REQUIRE_HIGHER_TOLERANCE:
+                tolerance = 1e-3
+            elif name in REQUIRE_EVEN_HIGHER_TOLERANCE:
+                tolerance = 8 * 1e-2
+        return tolerance, cosine
+
+    def compute_loss(self, pred):
+        return reduce_to_scalar_loss(pred)
+
+    def forward_pass(self, mod, inputs, collect_outputs=True):
+        return mod(*inputs)
+
+    def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
+        cloned_inputs = clone_inputs(inputs)
+        self.optimizer_zero_grad()
+        with self.autocast():
+            pred = mod(*cloned_inputs)
+            loss = self.compute_loss(pred)
+        self.grad_scaler.scale(loss).backward()
+        self.optimizer_step()
+        if collect_outputs:
+            return collect_results(mod, pred, loss, cloned_inputs)
+        return None
+
+
+if __name__ == "__main__":
+
+    original_dir = setup_torchbench_cwd()
+    logging.basicConfig(level=logging.WARNING)
+    warnings.filterwarnings("ignore")
+    main(TorchBenchmarkRunner(), original_dir)
diff --git a/benchmarks/dynamo/torchbench_models_list.txt b/benchmarks/dynamo/torchbench_models_list.txt
new file mode 100644
index 000000000000..04947c4a6a30
--- /dev/null
+++ b/benchmarks/dynamo/torchbench_models_list.txt
@@ -0,0 +1,28 @@
+BERT_pytorch,128
+Background_Matting, 16
+LearningToPaint,1024
+alexnet,1024
+dcgan,1024
+densenet121,64
+hf_Albert,32
+hf_Bart,16
+hf_Bert,16
+hf_GPT2,16
+hf_T5,4
+mnasnet1_0,256
+mobilenet_v2,128
+mobilenet_v3_large,256
+nvidia_deeprecommender,1024
+pytorch_unet,8
+resnet18,512
+resnet50,128
+resnext50_32x4d,128
+shufflenet_v2_x1_0,512
+squeezenet1_1,512
+timm_nfnet,256
+timm_efficientnet,128
+timm_regnet,128
+timm_resnest,256
+timm_vision_transformer,256
+timm_vovnet,128
+vgg16,128
diff --git a/benchmarks/dynamo/training_loss.py b/benchmarks/dynamo/training_loss.py
new file mode 100644
index 000000000000..2ec794540334
--- /dev/null
+++ b/benchmarks/dynamo/training_loss.py
@@ -0,0 +1,205 @@
+import argparse
+import inspect
+import os
+import sys
+import time
+from datetime import timedelta
+
+import torch
+
+import torch._dynamo
+from datasets import load_dataset, load_metric
+from torch.utils.data import DataLoader
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+torch.backends.cuda.matmul.allow_tf32 = True
+
+# You will download around 84G dataset if you run this end to end training/evaluation example.
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+
+
+def data_processing(num_samples, batch_size):
+    dataset = load_dataset("yelp_review_full")
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+
+    def tokenize_function(examples):
+        return tokenizer(examples["text"], padding="max_length", truncation=True)
+
+    tokenized_datasets = dataset.map(tokenize_function, batched=True)
+
+    tokenized_datasets = tokenized_datasets.remove_columns(["text"])
+    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+    tokenized_datasets.set_format("torch")
+
+    small_train_dataset = tokenized_datasets["train"].select(range(num_samples))
+    small_eval_dataset = tokenized_datasets["test"].select(range(num_samples))
+
+    train_dataloader = DataLoader(small_train_dataset, batch_size=batch_size)
+    eval_dataloader = DataLoader(small_eval_dataset, batch_size=batch_size)
+
+    return train_dataloader, eval_dataloader
+
+
+def training_iter_fn(batch, model, optimizer):
+    outputs = model(**batch)
+    loss = outputs.loss
+    loss.backward()
+    optimizer.step()
+    optimizer.zero_grad()
+    return loss
+
+
+def model_training_evaluation(
+    backend, train_dataloader, eval_dataloader, model, optimizer, num_epochs, evaluation
+):
+    model.to(device)
+    model.train()
+    loss_history = []
+    if not backend:
+        # Run with native Pytorch
+        opt_training_iter_fn = training_iter_fn
+    else:
+        # Support backends: eager, aot_eager, aot_nvfuser and inductor
+        opt_training_iter_fn = torch._dynamo.optimize(backend)(training_iter_fn)
+    for epoch in range(num_epochs):
+        running_loss = 0.0
+        for i, batch in enumerate(train_dataloader, 0):
+            batch = {k: v.to(device) for k, v in batch.items()}
+            loss = opt_training_iter_fn(batch, model, optimizer)
+            running_loss += loss.item()
+            if i % 100 == 99:
+                loss_history.append(running_loss / 100)
+                running_loss = 0.0
+
+    if evaluation:
+        metric = load_metric("accuracy")
+        model.eval()
+        if not backend:
+            opt_model = model
+        else:
+            opt_model = torch._dynamo.optimize(backend)(model)
+        for batch in eval_dataloader:
+            batch = {k: v.to(device) for k, v in batch.items()}
+            with torch.no_grad():
+                outputs = opt_model(**batch)
+
+            logits = outputs.logits
+            predictions = torch.argmax(logits, dim=-1)
+            metric.add_batch(predictions=predictions, references=batch["labels"])
+
+        return loss_history, metric.compute()
+    else:
+        return loss_history, None
+
+
+def check_loss(ref_loss, res_loss):
+    assert len(ref_loss) == len(res_loss)
+    length = len(ref_loss)
+    x = min(length, 10)
+    if sum(res_loss[-x:]) / 10 <= sum(ref_loss[-x:]) / 10 + 1e-1:
+        return True
+    else:
+        return False
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="TorchDynamo end to end training/evaluation benchmark"
+    )
+    parser.add_argument(
+        "--epochs", type=int, default=10, help="number of epochs to train (default: 10)"
+    )
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=1000,
+        help="number of samples to train/eval (default: 1000)",
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=8,
+        help="input batch size for training (default: 8)",
+    )
+    parser.add_argument(
+        "--lr", type=float, default=5e-5, help="learning rate (default: 5e-5)"
+    )
+    parser.add_argument(
+        "--backend",
+        choices=torch._dynamo.list_backends(),
+        default="inductor",
+        help="train/evaluate model with a given backend (default: inductor)",
+    )
+    parser.add_argument(
+        "--optimizer",
+        default="Adam",
+        help="train model using a given optimizer (default: Adam)",
+    )
+    parser.add_argument(
+        "--evaluation",
+        action="store_true",
+        help="running evaluation after model training",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    train_dataloader, eval_dataloader = data_processing(
+        args.num_samples, args.batch_size
+    )
+    model = AutoModelForSequenceClassification.from_pretrained(
+        "bert-base-cased", num_labels=5
+    )
+    optimizer_cls = getattr(sys.modules["torch.optim"], args.optimizer)
+    if "capturable" in inspect.signature(optimizer_cls).parameters.keys():
+        optimizer = optimizer_cls(model.parameters(), lr=args.lr, capturable=True)
+    else:
+        optimizer = optimizer_cls(model.parameters(), lr=args.lr)
+    native_start = time.time()
+    ref_loss, accuracy = model_training_evaluation(
+        None,
+        train_dataloader,
+        eval_dataloader,
+        model,
+        optimizer,
+        args.epochs,
+        args.evaluation,
+    )
+    native_end = time.time()
+    res_loss, accuracy = model_training_evaluation(
+        args.backend,
+        train_dataloader,
+        eval_dataloader,
+        model,
+        optimizer,
+        args.epochs,
+        args.evaluation,
+    )
+    dynamo_end = time.time()
+    if check_loss(ref_loss, res_loss):
+        print(
+            "[PASSED] TorchDynamo end to end training loss is less than or equal to native PyTorch"
+        )
+    else:
+        print(
+            "[FAILED] TorchDynamo end to end training loss is greater than native Pytorch"
+        )
+    if args.evaluation:
+        print(f"Model accuracy: {accuracy}")
+    native_elapsed = native_end - native_start
+    dynamo_elapsed = dynamo_end - native_end
+    print(
+        f"Train model on {args.epochs} epochs with backend {args.backend} and optimizer {args.optimizer}:"
+    )
+    print(f"PyTorch spent {timedelta(seconds=native_elapsed/args.epochs)} per epoch")
+    print(
+        f"TorchDynamo spent {timedelta(seconds=dynamo_elapsed/args.epochs)} per epoch"
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/instruction_counts/README.md b/benchmarks/instruction_counts/README.md
index ed2633caba15..32071e8aa80e 100644
--- a/benchmarks/instruction_counts/README.md
+++ b/benchmarks/instruction_counts/README.md
@@ -73,7 +73,7 @@ Timer(
 ```
 
 Moreover, because `signature` is provided we know that creation of `x` and `w`
-is part of setup, and the overall comptation uses `x` and `w` to produce `y`.
+is part of setup, and the overall computation uses `x` and `w` to produce `y`.
 As a result, we can derive TorchScript'd and AutoGrad variants as well. We can
 deduce that a TorchScript model will take the form:
 
diff --git a/benchmarks/instruction_counts/core/utils.py b/benchmarks/instruction_counts/core/utils.py
index 891c3f86dc59..81d46800ae14 100644
--- a/benchmarks/instruction_counts/core/utils.py
+++ b/benchmarks/instruction_counts/core/utils.py
@@ -4,7 +4,7 @@
 import textwrap
 from typing import List, Optional, Tuple
 
-from torch.utils.benchmark import _make_temp_dir
+from torch.utils.benchmark.utils.common import _make_temp_dir
 
 from core.api import GroupedBenchmark, TimerArgs
 from core.types import Definition, FlatIntermediateDefinition, Label
diff --git a/benchmarks/nested/nested_bmm_bench.py b/benchmarks/nested/nested_bmm_bench.py
new file mode 100644
index 000000000000..56e283effddf
--- /dev/null
+++ b/benchmarks/nested/nested_bmm_bench.py
@@ -0,0 +1,53 @@
+import argparse
+import random
+
+import torch
+
+
+def bench(nt_a, nt_b, niter):
+    # Warmup
+    nt_c = nt_a.bmm(nt_b)
+
+    torch.cuda.synchronize()
+    start_event = torch.cuda.Event(enable_timing=True)
+    end_event = torch.cuda.Event(enable_timing=True)
+    start_event.record()
+    for iter in range(niter):
+        nt_c = nt_a.bmm(nt_b)
+    end_event.record()
+    torch.cuda.synchronize()
+    runtime = (start_event.elapsed_time(end_event)) / niter
+    return runtime
+
+
+def sweep_n(niter, dtype):
+    for ntensor in [4, 8, 16, 32, 64, 128, 256]:
+        tensors = [torch.randn(256, random.randint(100, 200)) for t in range(ntensor)]
+        nt_a = torch.nested.nested_tensor(
+            tensors,
+            dtype=dtype,
+            device="cuda",
+        )
+        nt_b = torch.nested.nested_tensor(
+            [t.t() for t in tensors],
+            dtype=dtype,
+            device="cuda",
+        )
+        runtime = bench(nt_a, nt_b, niter)
+        nt_a_size = torch.ops.aten._nested_tensor_size(nt_a)
+        lengths = nt_a_size[:, 1]
+        print(",".join(map(str, [ntensor, dtype, lengths.min().item(),
+              lengths.float().mean().item(), lengths.max().item(), runtime])))
+
+
+if __name__ == "__main__":
+    random.seed(123)
+    parser = argparse.ArgumentParser(description="Nested Tensor BMM Benchmark")
+    parser.add_argument("--niter", default="10", type=int)
+
+    args = parser.parse_args()
+    niter = args.niter
+
+    print("ntensor,dtype,min_length,mean_length,max_length,runtime")
+    sweep_n(niter, torch.float32)
+    sweep_n(niter, torch.float16)
diff --git a/benchmarks/operator_benchmark/README.md b/benchmarks/operator_benchmark/README.md
index 59918f6fab3c..cff275d9a1f9 100644
--- a/benchmarks/operator_benchmark/README.md
+++ b/benchmarks/operator_benchmark/README.md
@@ -374,7 +374,7 @@ unary_ops_list = op_bench.op_list(
 ```
 
 #### Part 2. Create Tensors and Add Computation
-In this example, both operators share the same input so we only need to implement one TorchBenchmakrBase subclass.
+In this example, both operators share the same input so we only need to implement one TorchBenchmarkBase subclass.
 Every new subclass is required to implement 3 methods:
 * `init` is used to create tensors and set the operator name and function. In this example, the parameters to `init` are `M`, `N`, and `op_func` which have been specified in the configurations.
 * `forward` includes the operator to be tested and the computation based on the created tensors in `init`. Apart from `self`, the order of the arguments must match the entries specified in `self.inputs`.
diff --git a/benchmarks/operator_benchmark/pt/ao_sparsifier_test.py b/benchmarks/operator_benchmark/pt/ao_sparsifier_test.py
index c2ef2aadf9f1..674ffabda112 100644
--- a/benchmarks/operator_benchmark/pt/ao_sparsifier_test.py
+++ b/benchmarks/operator_benchmark/pt/ao_sparsifier_test.py
@@ -3,7 +3,7 @@
 import torch
 from torch import nn
 
-from torch.ao import sparsity
+from torch.ao import pruning
 
 
 """Microbenchmarks for sparsifier."""
@@ -33,7 +33,7 @@ def init(self, M, SL, SBS, ZPB):
         model.register_buffer("weight", weight)
 
         sparse_config = [{"tensor_fqn": "weight"}]
-        self.sparsifier = sparsity.WeightNormSparsifier(
+        self.sparsifier = pruning.WeightNormSparsifier(
             sparsity_level=SL,
             sparse_block_shape=SBS,
             zeros_per_block=ZPB,
diff --git a/benchmarks/operator_benchmark/pt/interpolate_test.py b/benchmarks/operator_benchmark/pt/interpolate_test.py
index 40659fe4f169..980c60f4909a 100644
--- a/benchmarks/operator_benchmark/pt/interpolate_test.py
+++ b/benchmarks/operator_benchmark/pt/interpolate_test.py
@@ -49,6 +49,10 @@ def forward(self, input_image, output_size, mode, align_corners):
         [(1, 3, 60, 40), (24, 24)],
         [(1, 3, 600, 400), (240, 240)],
         [(1, 3, 320, 320), (256, 256)],
+
+        [(1, 1, 60, 40), (24, 24)],
+        [(1, 1, 600, 400), (240, 240)],
+        [(1, 1, 320, 320), (256, 256)],
     ],
     cross_product_configs={
         'channels_last': [True, False],
@@ -63,6 +67,10 @@ def forward(self, input_image, output_size, mode, align_corners):
         [(1, 3, 60, 40), (24, 24)],
         [(1, 3, 600, 400), (240, 240)],
         [(1, 3, 320, 320), (256, 256)],
+
+        [(1, 1, 60, 40), (24, 24)],
+        [(1, 1, 600, 400), (240, 240)],
+        [(1, 1, 320, 320), (256, 256)],
     ],
     cross_product_configs={
         'channels_last': [True, False],
@@ -80,6 +88,10 @@ def forward(self, input_image, output_size, mode, align_corners):
         [(1, 3, 500, 500), (256, 256)],
         [(1, 3, 500, 500), (800, 800)],
 
+        [(1, 1, 320, 320), (512, 512)],
+        [(1, 1, 500, 500), (256, 256)],
+        [(1, 1, 500, 500), (800, 800)],
+
         # vectorization test-case
         [(2, 128, 64, 46), (128, 128)],
         [(2, 128, 64, 46), (32, 24)],
diff --git a/benchmarks/operator_benchmark/pt/qactivation_test.py b/benchmarks/operator_benchmark/pt/qactivation_test.py
index 5baf4cca3c3b..f57ff8d1f16c 100644
--- a/benchmarks/operator_benchmark/pt/qactivation_test.py
+++ b/benchmarks/operator_benchmark/pt/qactivation_test.py
@@ -1,5 +1,5 @@
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized.functional as qF
 
 import operator_benchmark as op_bench
 
@@ -44,9 +44,9 @@
     attrs=(
         ('relu', torch.nn.ReLU()),
         ('relu6', torch.ops.quantized.relu6),
-        ('functional.hardtanh', nnq.functional.hardtanh),
-        ('functional.hardsigmoid', nnq.functional.hardsigmoid),
-        ('functional.leaky_relu', nnq.functional.leaky_relu),
+        ('functional.hardtanh', qF.hardtanh),
+        ('functional.hardsigmoid', qF.hardsigmoid),
+        ('functional.leaky_relu', qF.leaky_relu),
         ('functional.sigmoid', torch.nn.functional.sigmoid),
         ('functional.tanh', torch.nn.functional.tanh),
     ),
@@ -92,9 +92,9 @@ def forward(self, q_input):
 
 qactivation_scale_zero_point_ops = op_bench.op_list(
     attrs=(
-        ('functional.hardswish', nnq.functional.hardswish),
-        ('functional.elu', nnq.functional.elu),
-        ('functional.celu', nnq.functional.celu),
+        ('functional.hardswish', qF.hardswish),
+        ('functional.elu', qF.elu),
+        ('functional.celu', qF.celu),
     ),
     attr_names=('op_name', 'op_func'),
 )
diff --git a/benchmarks/operator_benchmark/pt/qarithmetic_test.py b/benchmarks/operator_benchmark/pt/qarithmetic_test.py
index b1103a8a2531..97766bdb4c19 100644
--- a/benchmarks/operator_benchmark/pt/qarithmetic_test.py
+++ b/benchmarks/operator_benchmark/pt/qarithmetic_test.py
@@ -29,7 +29,7 @@
 
 class _QFunctionalBinaryArithmeticBenchmarkBase(op_bench.TorchBenchmarkBase):
     def setup(self, N, dtype, contig):
-        self.qfunctional = torch.nn.quantized.QFunctional()
+        self.qfunctional = torch.ao.nn.quantized.QFunctional()
 
         # TODO: Consider more diverse shapes
         f_input = (torch.rand(N, N) - 0.5) * 256
diff --git a/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py b/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py
index 2dcfdd4de439..97ce0357557e 100644
--- a/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py
+++ b/benchmarks/operator_benchmark/pt/qatembedding_ops_test.py
@@ -1,6 +1,6 @@
 import operator_benchmark as op_bench
 import torch
-import torch.nn.qat as nnqat
+import torch.ao.nn.qat as nnqat
 import numpy
 from pt import configs
 from torch.ao.quantization import default_embedding_qat_qconfig
diff --git a/benchmarks/operator_benchmark/pt/qcat_test.py b/benchmarks/operator_benchmark/pt/qcat_test.py
index 32dd32e43adf..2ff0b87a9d38 100644
--- a/benchmarks/operator_benchmark/pt/qcat_test.py
+++ b/benchmarks/operator_benchmark/pt/qcat_test.py
@@ -1,7 +1,7 @@
 import operator_benchmark as op_bench
 
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from typing import List
 
 
diff --git a/benchmarks/operator_benchmark/pt/qconv_test.py b/benchmarks/operator_benchmark/pt/qconv_test.py
index 14e8e143a7ca..c48759d330e7 100644
--- a/benchmarks/operator_benchmark/pt/qconv_test.py
+++ b/benchmarks/operator_benchmark/pt/qconv_test.py
@@ -1,7 +1,7 @@
 
 import operator_benchmark as op_bench
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 
 from pt import configs
 
diff --git a/benchmarks/operator_benchmark/pt/qembeddingbag_test.py b/benchmarks/operator_benchmark/pt/qembeddingbag_test.py
index 872f8c28fccd..5a406631b5ed 100644
--- a/benchmarks/operator_benchmark/pt/qembeddingbag_test.py
+++ b/benchmarks/operator_benchmark/pt/qembeddingbag_test.py
@@ -1,7 +1,7 @@
 
 import operator_benchmark as op_bench
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import numpy
 from pt import configs
 
diff --git a/benchmarks/operator_benchmark/pt/qlinear_test.py b/benchmarks/operator_benchmark/pt/qlinear_test.py
index 6e4dd9d97eca..c4f8f36c11d3 100644
--- a/benchmarks/operator_benchmark/pt/qlinear_test.py
+++ b/benchmarks/operator_benchmark/pt/qlinear_test.py
@@ -2,8 +2,8 @@
 import operator_benchmark as op_bench
 
 import torch
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 
 from pt import configs
 
diff --git a/benchmarks/operator_benchmark/pt/quantization_test.py b/benchmarks/operator_benchmark/pt/quantization_test.py
index 8ffbdd20e442..332ff52c21d6 100644
--- a/benchmarks/operator_benchmark/pt/quantization_test.py
+++ b/benchmarks/operator_benchmark/pt/quantization_test.py
@@ -1,7 +1,7 @@
 
 import operator_benchmark as op_bench
 import torch
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import torch.ao.quantization as tq
 import torch.nn as nn
 
diff --git a/benchmarks/static_runtime/test_generated_ops.cc b/benchmarks/static_runtime/test_generated_ops.cc
index fa2d36cd3e15..415bf464fbd1 100644
--- a/benchmarks/static_runtime/test_generated_ops.cc
+++ b/benchmarks/static_runtime/test_generated_ops.cc
@@ -1,4 +1,5 @@
 // @lint-ignore-every CLANGTIDY HOWTOEVEN
+// AUTO-GENERATED FROM: torchgen/static_runtime/gen_static_runtime_ops.py
 #include <gtest/gtest.h>
 #include <torch/csrc/jit/runtime/static/impl.h>
 #include <torch/torch.h>
@@ -863,6 +864,38 @@ TEST(StaticRuntime, autogen_clamp_max) {
       /*check_resize=*/true);
 }
 
+TEST(StaticRuntime, autogen_clamp_max_Tensor) {
+  const std::string script = R"IR(
+    graph(%self: Tensor, %max: Tensor):
+        %bias: None = prim::Constant()
+        %ret = aten::clamp_max(%self, %max)
+        %cloned = aten::clone(%ret, %bias)
+        return (%cloned)
+  )IR";
+
+  auto self0 = at::rand({6, 6, 6});
+  auto max0 = at::rand({6, 6, 6});
+  std::vector<IValue> args{self0, max0};
+  testStaticRuntime(
+      script,
+      args,
+      {},
+      /*use_allclose=*/false,
+      /*use_equalnan=*/false,
+      /*check_resize=*/true);
+
+  auto self1 = at::rand({22, 22, 22});
+  auto max1 = at::rand({22, 22, 22});
+  std::vector<IValue> args2{self1, max1};
+  testStaticRuntime(
+      script,
+      args,
+      args2,
+      /*use_allclose=*/false,
+      /*use_equalnan=*/false,
+      /*check_resize=*/true);
+}
+
 TEST(StaticRuntime, autogen_clip) {
   const std::string script = R"IR(
     graph(%self: Tensor, %min: int?, %max: int?):
@@ -1531,36 +1564,6 @@ TEST(StaticRuntime, autogen_index_copy) {
       /*check_resize=*/true);
 }
 
-TEST(StaticRuntime, autogen_inverse) {
-  const std::string script = R"IR(
-    graph(%self: Tensor):
-        %bias: None = prim::Constant()
-        %ret = aten::inverse(%self)
-        %cloned = aten::clone(%ret, %bias)
-        return (%cloned)
-  )IR";
-
-  auto self0 = at::rand({6, 6, 6});
-  std::vector<IValue> args{self0};
-  testStaticRuntime(
-      script,
-      args,
-      {},
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/true);
-
-  auto self1 = at::rand({22, 22, 22});
-  std::vector<IValue> args2{self1};
-  testStaticRuntime(
-      script,
-      args,
-      args2,
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/true);
-}
-
 TEST(StaticRuntime, autogen_isin_Tensor_Tensor) {
   const std::string script = R"IR(
     graph(%elements: Tensor, %test_elements: Tensor, %assume_unique: bool, %invert: bool):
@@ -2899,6 +2902,38 @@ TEST(StaticRuntime, autogen_square) {
       /*check_resize=*/true);
 }
 
+TEST(StaticRuntime, autogen_prod) {
+  const std::string script = R"IR(
+    graph(%self: Tensor, %dtype: int?):
+        %bias: None = prim::Constant()
+        %ret = aten::prod(%self, %dtype)
+        %cloned = aten::clone(%ret, %bias)
+        return (%cloned)
+  )IR";
+
+  auto self0 = at::rand({6, 6, 6});
+  auto dtype0 = at::ScalarType::Float;
+  std::vector<IValue> args{self0, dtype0};
+  testStaticRuntime(
+      script,
+      args,
+      {},
+      /*use_allclose=*/false,
+      /*use_equalnan=*/false,
+      /*check_resize=*/false);
+
+  auto self1 = at::rand({22, 22, 22});
+  auto dtype1 = at::ScalarType::Float;
+  std::vector<IValue> args2{self1, dtype1};
+  testStaticRuntime(
+      script,
+      args,
+      args2,
+      /*use_allclose=*/false,
+      /*use_equalnan=*/false,
+      /*check_resize=*/false);
+}
+
 TEST(StaticRuntime, autogen_prod_dim_int) {
   const std::string script = R"IR(
     graph(%self: Tensor, %dim: int, %keepdim: bool, %dtype: int?):
@@ -3666,8 +3701,8 @@ TEST(StaticRuntime, autogen_bitwise_left_shift_Tensor) {
         return (%cloned)
   )IR";
 
-  auto self0 = at::randint(1, 100, {6, 6, 6}, at::kInt);
-  auto other0 = at::randint(1, 100, {6, 6, 6}, at::kInt);
+  auto self0 = at::randint(1, 1 << 4, {6, 6, 6}, at::kInt);
+  auto other0 = at::randint(1, 26, {6, 6, 6}, at::kInt);
   std::vector<IValue> args{self0, other0};
   testStaticRuntime(
       script,
@@ -3677,8 +3712,8 @@ TEST(StaticRuntime, autogen_bitwise_left_shift_Tensor) {
       /*use_equalnan=*/false,
       /*check_resize=*/true);
 
-  auto self1 = at::randint(1, 100, {22, 22, 22}, at::kInt);
-  auto other1 = at::randint(1, 100, {22, 22, 22}, at::kInt);
+  auto self1 = at::randint(1, 1 << 4, {22, 22, 22}, at::kInt);
+  auto other1 = at::randint(1, 26, {22, 22, 22}, at::kInt);
   std::vector<IValue> args2{self1, other1};
   testStaticRuntime(
       script,
@@ -3698,8 +3733,8 @@ TEST(StaticRuntime, autogen_bitwise_right_shift_Tensor) {
         return (%cloned)
   )IR";
 
-  auto self0 = at::randint(1, 100, {6, 6, 6}, at::kInt);
-  auto other0 = at::randint(1, 100, {6, 6, 6}, at::kInt);
+  auto self0 = at::randint(1 << 21, 1 << 30, {6, 6, 6}, at::kInt);
+  auto other0 = at::randint(1, 22, {6, 6, 6}, at::kInt);
   std::vector<IValue> args{self0, other0};
   testStaticRuntime(
       script,
@@ -3709,8 +3744,8 @@ TEST(StaticRuntime, autogen_bitwise_right_shift_Tensor) {
       /*use_equalnan=*/false,
       /*check_resize=*/true);
 
-  auto self1 = at::randint(1, 100, {22, 22, 22}, at::kInt);
-  auto other1 = at::randint(1, 100, {22, 22, 22}, at::kInt);
+  auto self1 = at::randint(1 << 21, 1 << 30, {22, 22, 22}, at::kInt);
+  auto other1 = at::randint(1, 22, {22, 22, 22}, at::kInt);
   std::vector<IValue> args2{self1, other1};
   testStaticRuntime(
       script,
@@ -5163,38 +5198,6 @@ TEST(StaticRuntime, autogen_maximum) {
       /*check_resize=*/true);
 }
 
-TEST(StaticRuntime, autogen_max_other) {
-  const std::string script = R"IR(
-    graph(%self: Tensor, %other: Tensor):
-        %bias: None = prim::Constant()
-        %ret = aten::max(%self, %other)
-        %cloned = aten::clone(%ret, %bias)
-        return (%cloned)
-  )IR";
-
-  auto self0 = at::rand({6, 6, 6});
-  auto other0 = at::rand({6, 6, 6});
-  std::vector<IValue> args{self0, other0};
-  testStaticRuntime(
-      script,
-      args,
-      {},
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/true);
-
-  auto self1 = at::rand({22, 22, 22});
-  auto other1 = at::rand({22, 22, 22});
-  std::vector<IValue> args2{self1, other1};
-  testStaticRuntime(
-      script,
-      args,
-      args2,
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/true);
-}
-
 TEST(StaticRuntime, autogen_minimum) {
   const std::string script = R"IR(
     graph(%self: Tensor, %other: Tensor):
@@ -5507,40 +5510,6 @@ TEST(StaticRuntime, autogen_mse_loss) {
       /*check_resize=*/true);
 }
 
-TEST(StaticRuntime, autogen_l1_loss) {
-  const std::string script = R"IR(
-    graph(%self: Tensor, %target: Tensor, %reduction: int):
-        %bias: None = prim::Constant()
-        %ret = aten::l1_loss(%self, %target, %reduction)
-        %cloned = aten::clone(%ret, %bias)
-        return (%cloned)
-  )IR";
-
-  auto self0 = at::rand({6, 6, 6});
-  auto target0 = at::rand({6, 6, 6});
-  auto reduction0 = 1;
-  std::vector<IValue> args{self0, target0, reduction0};
-  testStaticRuntime(
-      script,
-      args,
-      {},
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/false);
-
-  auto self1 = at::rand({22, 22, 22});
-  auto target1 = at::rand({22, 22, 22});
-  auto reduction1 = 1;
-  std::vector<IValue> args2{self1, target1, reduction1};
-  testStaticRuntime(
-      script,
-      args,
-      args2,
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/false);
-}
-
 TEST(StaticRuntime, autogen_multi_margin_loss) {
   const std::string script = R"IR(
     graph(%self: Tensor, %target: Tensor, %p: int, %margin: int, %weight: Tensor?, %reduction: int):
@@ -5615,138 +5584,6 @@ TEST(StaticRuntime, autogen_multilabel_margin_loss) {
       /*check_resize=*/false);
 }
 
-TEST(StaticRuntime, autogen_nll_loss) {
-  const std::string script = R"IR(
-    graph(%self: Tensor, %target: Tensor, %weight: Tensor?, %reduction: int, %ignore_index: int):
-        %bias: None = prim::Constant()
-        %ret = aten::nll_loss(%self, %target, %weight, %reduction, %ignore_index)
-        %cloned = aten::clone(%ret, %bias)
-        return (%cloned)
-  )IR";
-
-  auto self0 = at::rand({6, 6});
-  auto target0 = at::randint(6, {6}, torch::kInt64);
-  auto weight0 = at::rand({6});
-  auto reduction0 = 1;
-  auto ignore_index0 = 1;
-  std::vector<IValue> args{self0, target0, weight0, reduction0, ignore_index0};
-  testStaticRuntime(
-      script,
-      args,
-      {},
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/false);
-
-  auto self1 = at::rand({22, 22});
-  auto target1 = at::randint(22, {22}, torch::kInt64);
-  auto weight1 = at::rand({22});
-  auto reduction1 = 1;
-  auto ignore_index1 = 1;
-  std::vector<IValue> args2{self1, target1, weight1, reduction1, ignore_index1};
-  testStaticRuntime(
-      script,
-      args,
-      args2,
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/false);
-}
-
-TEST(StaticRuntime, autogen_nll_loss_backward) {
-  const std::string script = R"IR(
-    graph(%grad_output: Tensor, %self: Tensor, %target: Tensor, %weight: Tensor?, %reduction: int, %ignore_index: int, %total_weight: Tensor):
-        %bias: None = prim::Constant()
-        %ret = aten::nll_loss_backward(%grad_output, %self, %target, %weight, %reduction, %ignore_index, %total_weight)
-        %cloned = aten::clone(%ret, %bias)
-        return (%cloned)
-  )IR";
-
-  auto grad_output0 = at::rand({});
-  auto self0 = at::rand({6});
-  auto target0 = at::randint(0, 5, {6}, torch::kInt64);
-  auto weight0 = at::rand({6});
-  auto reduction0 = 1;
-  auto ignore_index0 = 1;
-  auto total_weight0 = at::rand({});
-  std::vector<IValue> args{
-      grad_output0,
-      self0,
-      target0,
-      weight0,
-      reduction0,
-      ignore_index0,
-      total_weight0};
-  testStaticRuntime(
-      script,
-      args,
-      {},
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/true);
-
-  auto grad_output1 = at::rand({});
-  auto self1 = at::rand({36});
-  auto target1 = at::randint(0, 11, {36}, torch::kInt64);
-  auto weight1 = at::rand({36});
-  auto reduction1 = 1;
-  auto ignore_index1 = 1;
-  auto total_weight1 = at::rand({});
-  std::vector<IValue> args2{
-      grad_output1,
-      self1,
-      target1,
-      weight1,
-      reduction1,
-      ignore_index1,
-      total_weight1};
-  testStaticRuntime(
-      script,
-      args,
-      args2,
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/true);
-}
-
-TEST(StaticRuntime, autogen_nll_loss2d) {
-  const std::string script = R"IR(
-    graph(%self: Tensor, %target: Tensor, %weight: Tensor?, %reduction: int, %ignore_index: int):
-        %bias: None = prim::Constant()
-        %ret = aten::nll_loss2d(%self, %target, %weight, %reduction, %ignore_index)
-        %cloned = aten::clone(%ret, %bias)
-        return (%cloned)
-  )IR";
-
-  auto self0 = at::rand({6, 6, 6, 6});
-  auto target0 = at::randint(6, {6, 6, 6}, torch::kInt64);
-  auto weight0 = at::rand({6});
-  auto reduction0 = 1;
-  auto ignore_index0 = 1;
-  std::vector<IValue> args{self0, target0, weight0, reduction0, ignore_index0};
-  testStaticRuntime(
-      script,
-      args,
-      {},
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/false);
-
-  auto self1 = at::rand({22, 22, 22, 22});
-  auto target1 = at::randint(22, {22, 22, 22}, torch::kInt64);
-  auto weight1 = at::rand({22});
-  auto reduction1 = 1;
-  auto ignore_index1 = 1;
-  std::vector<IValue> args2{self1, target1, weight1, reduction1, ignore_index1};
-  testStaticRuntime(
-      script,
-      args,
-      args2,
-      /*use_allclose=*/false,
-      /*use_equalnan=*/false,
-      /*check_resize=*/false);
-}
-
 TEST(StaticRuntime, autogen_soft_margin_loss) {
   const std::string script = R"IR(
     graph(%self: Tensor, %target: Tensor, %reduction: int):
@@ -6230,8 +6067,8 @@ TEST(StaticRuntime, autogen_adaptive_max_pool2d_backward) {
         return (%cloned)
   )IR";
 
-  auto grad_output0 = at::randint(-3, 2, {2, 2, 2});
-  auto self0 = at::randint(-3, 2, {2, 2, 2});
+  auto grad_output0 = at::rand({2, 2, 2}, at::kFloat);
+  auto self0 = at::rand({2, 2, 2}, at::kFloat);
   auto indices0 = at::randint(0, 1, {2, 2, 2}, at::kLong);
   std::vector<IValue> args{grad_output0, self0, indices0};
   testStaticRuntime(
@@ -6242,8 +6079,8 @@ TEST(StaticRuntime, autogen_adaptive_max_pool2d_backward) {
       /*use_equalnan=*/false,
       /*check_resize=*/true);
 
-  auto grad_output1 = at::randint(-3, 3, {3, 3, 3});
-  auto self1 = at::randint(-3, 2, {3, 3, 3});
+  auto grad_output1 = at::rand({3, 3, 3}, at::kFloat);
+  auto self1 = at::rand({3, 3, 3}, at::kFloat);
   auto indices1 = at::randint(0, 1, {3, 3, 3}, at::kLong);
   std::vector<IValue> args2{grad_output1, self1, indices1};
   testStaticRuntime(
@@ -6264,8 +6101,8 @@ TEST(StaticRuntime, autogen_adaptive_max_pool3d_backward) {
         return (%cloned)
   )IR";
 
-  auto grad_output0 = at::randint(-3, 2, {2, 2, 2, 2});
-  auto self0 = at::randint(-3, 2, {2, 2, 2, 2});
+  auto grad_output0 = at::rand({2, 2, 2, 2}, at::kFloat);
+  auto self0 = at::rand({2, 2, 2, 2}, at::kFloat);
   auto indices0 = at::randint(0, 1, {2, 2, 2, 2}, at::kLong);
   std::vector<IValue> args{grad_output0, self0, indices0};
   testStaticRuntime(
@@ -6276,8 +6113,8 @@ TEST(StaticRuntime, autogen_adaptive_max_pool3d_backward) {
       /*use_equalnan=*/false,
       /*check_resize=*/true);
 
-  auto grad_output1 = at::randint(-3, 3, {3, 3, 3, 3});
-  auto self1 = at::randint(-3, 2, {3, 3, 3, 3});
+  auto grad_output1 = at::rand({3, 3, 3, 3}, at::kFloat);
+  auto self1 = at::rand({3, 3, 3, 3}, at::kFloat);
   auto indices1 = at::randint(0, 1, {3, 3, 3, 3}, at::kLong);
   std::vector<IValue> args2{grad_output1, self1, indices1};
   testStaticRuntime(
@@ -7521,15 +7358,15 @@ TEST(StaticRuntime, autogen_linalg_cross) {
 
 TEST(StaticRuntime, autogen_linalg_det) {
   const std::string script = R"IR(
-    graph(%self: Tensor):
+    graph(%A: Tensor):
         %bias: None = prim::Constant()
-        %ret = aten::linalg_det(%self)
+        %ret = aten::linalg_det(%A)
         %cloned = aten::clone(%ret, %bias)
         return (%cloned)
   )IR";
 
-  auto self0 = at::rand({6, 6, 6});
-  std::vector<IValue> args{self0};
+  auto A0 = at::rand({6, 6, 6});
+  std::vector<IValue> args{A0};
   testStaticRuntime(
       script,
       args,
@@ -7538,8 +7375,8 @@ TEST(StaticRuntime, autogen_linalg_det) {
       /*use_equalnan=*/false,
       /*check_resize=*/true);
 
-  auto self1 = at::rand({22, 22, 22});
-  std::vector<IValue> args2{self1};
+  auto A1 = at::rand({22, 22, 22});
+  std::vector<IValue> args2{A1};
   testStaticRuntime(
       script,
       args,
@@ -7613,15 +7450,15 @@ TEST(StaticRuntime, autogen_linalg_eigvals) {
 
 TEST(StaticRuntime, autogen_linalg_inv) {
   const std::string script = R"IR(
-    graph(%self: Tensor):
+    graph(%A: Tensor):
         %bias: None = prim::Constant()
-        %ret = aten::linalg_inv(%self)
+        %ret = aten::linalg_inv(%A)
         %cloned = aten::clone(%ret, %bias)
         return (%cloned)
   )IR";
 
-  auto self0 = at::rand({6, 6, 6});
-  std::vector<IValue> args{self0};
+  auto A0 = at::rand({6, 6, 6});
+  std::vector<IValue> args{A0};
   testStaticRuntime(
       script,
       args,
@@ -7630,8 +7467,8 @@ TEST(StaticRuntime, autogen_linalg_inv) {
       /*use_equalnan=*/false,
       /*check_resize=*/true);
 
-  auto self1 = at::rand({22, 22, 22});
-  std::vector<IValue> args2{self1};
+  auto A1 = at::rand({22, 22, 22});
+  std::vector<IValue> args2{A1};
   testStaticRuntime(
       script,
       args,
@@ -7641,18 +7478,17 @@ TEST(StaticRuntime, autogen_linalg_inv) {
       /*check_resize=*/true);
 }
 
-TEST(StaticRuntime, autogen_inner) {
+TEST(StaticRuntime, autogen_inverse) {
   const std::string script = R"IR(
-    graph(%self: Tensor, %other: Tensor):
+    graph(%self: Tensor):
         %bias: None = prim::Constant()
-        %ret = aten::inner(%self, %other)
+        %ret = aten::inverse(%self)
         %cloned = aten::clone(%ret, %bias)
         return (%cloned)
   )IR";
 
   auto self0 = at::rand({6, 6, 6});
-  auto other0 = at::rand({6, 6, 6});
-  std::vector<IValue> args{self0, other0};
+  std::vector<IValue> args{self0};
   testStaticRuntime(
       script,
       args,
@@ -7662,8 +7498,7 @@ TEST(StaticRuntime, autogen_inner) {
       /*check_resize=*/true);
 
   auto self1 = at::rand({22, 22, 22});
-  auto other1 = at::rand({22, 22, 22});
-  std::vector<IValue> args2{self1, other1};
+  std::vector<IValue> args2{self1};
   testStaticRuntime(
       script,
       args,
@@ -7673,18 +7508,18 @@ TEST(StaticRuntime, autogen_inner) {
       /*check_resize=*/true);
 }
 
-TEST(StaticRuntime, autogen_outer) {
+TEST(StaticRuntime, autogen_inner) {
   const std::string script = R"IR(
-    graph(%self: Tensor, %vec2: Tensor):
+    graph(%self: Tensor, %other: Tensor):
         %bias: None = prim::Constant()
-        %ret = aten::outer(%self, %vec2)
+        %ret = aten::inner(%self, %other)
         %cloned = aten::clone(%ret, %bias)
         return (%cloned)
   )IR";
 
-  auto self0 = at::rand({16});
-  auto vec20 = at::rand({16});
-  std::vector<IValue> args{self0, vec20};
+  auto self0 = at::rand({6, 6, 6});
+  auto other0 = at::rand({6, 6, 6});
+  std::vector<IValue> args{self0, other0};
   testStaticRuntime(
       script,
       args,
@@ -7693,9 +7528,9 @@ TEST(StaticRuntime, autogen_outer) {
       /*use_equalnan=*/false,
       /*check_resize=*/true);
 
-  auto self1 = at::rand({64});
-  auto vec21 = at::rand({64});
-  std::vector<IValue> args2{self1, vec21};
+  auto self1 = at::rand({22, 22, 22});
+  auto other1 = at::rand({22, 22, 22});
+  std::vector<IValue> args2{self1, other1};
   testStaticRuntime(
       script,
       args,
@@ -7705,18 +7540,18 @@ TEST(StaticRuntime, autogen_outer) {
       /*check_resize=*/true);
 }
 
-// Disabling the test because JIT alias analysis does not support linalg_svdvals at this point.
-TEST(StaticRuntime, DISABLED_autogen_linalg_svdvals) {
+TEST(StaticRuntime, autogen_outer) {
   const std::string script = R"IR(
-    graph(%A: Tensor):
+    graph(%self: Tensor, %vec2: Tensor):
         %bias: None = prim::Constant()
-        %ret = aten::linalg_svdvals(%A)
+        %ret = aten::outer(%self, %vec2)
         %cloned = aten::clone(%ret, %bias)
         return (%cloned)
   )IR";
 
-  auto A0 = at::rand({6, 6, 6});
-  std::vector<IValue> args{A0};
+  auto self0 = at::rand({16});
+  auto vec20 = at::rand({16});
+  std::vector<IValue> args{self0, vec20};
   testStaticRuntime(
       script,
       args,
@@ -7725,8 +7560,9 @@ TEST(StaticRuntime, DISABLED_autogen_linalg_svdvals) {
       /*use_equalnan=*/false,
       /*check_resize=*/true);
 
-  auto A1 = at::rand({22, 22, 22});
-  std::vector<IValue> args2{A1};
+  auto self1 = at::rand({64});
+  auto vec21 = at::rand({64});
+  std::vector<IValue> args2{self1, vec21};
   testStaticRuntime(
       script,
       args,
@@ -8005,7 +7841,6 @@ TEST(StaticRuntime, autogen_diagonal) {
   auto offset0 = 0;
   auto dim10 = 2;
   auto dim20 = 1;
-  auto dim00 = 1;
   std::vector<IValue> args{self0, offset0, dim10, dim20};
   testStaticRuntime(script, args);
 }
@@ -8023,7 +7858,6 @@ TEST(StaticRuntime, autogen_linalg_diagonal) {
   auto offset0 = 0;
   auto dim10 = 2;
   auto dim20 = 1;
-  auto dim00 = 1;
   std::vector<IValue> args{A0, offset0, dim10, dim20};
   testStaticRuntime(script, args);
 }
diff --git a/benchmarks/static_runtime/test_static_module.cc b/benchmarks/static_runtime/test_static_module.cc
index 70d1d1d30693..3c927c9c41d9 100644
--- a/benchmarks/static_runtime/test_static_module.cc
+++ b/benchmarks/static_runtime/test_static_module.cc
@@ -77,13 +77,6 @@ const auto sigmoid_inplace_script = R"JIT(
       return (a)
 )JIT";
 
-const auto sigmoid_out_script = R"JIT(
-  def forward(self, inp: Tensor):
-      a = inp + inp
-      b = torch.sigmoid(inp, out=a).clone()
-      return (b)
-)JIT";
-
 } // namespace
 
 // Test that StaticModule::value_group groups values of the graph into
@@ -354,6 +347,18 @@ TEST(StaticRuntime, CanEnableStaticRuntime) {
   EXPECT_TRUE(testCanEnableStaticRuntime(is_not_script_none));
 }
 
+TEST(StaticRuntime, CanEnableStaticRuntimeSubBlocks) {
+  const auto src = R"JIT(
+    def forward(self, a: Tensor, b: Tensor, cond: bool):
+        if cond:
+            # aten::__is__ on tensors is blocked
+            return a is b
+        return False
+  )JIT";
+
+  EXPECT_FALSE(testCanEnableStaticRuntime(src));
+}
+
 TEST(StaticRuntime, NestedOutput) {
   // dict of tuple of list
   const auto nested_output_script_0 = R"JIT(
diff --git a/benchmarks/static_runtime/test_static_runtime.cc b/benchmarks/static_runtime/test_static_runtime.cc
index 72ee217401ab..ef3bc75f921b 100644
--- a/benchmarks/static_runtime/test_static_runtime.cc
+++ b/benchmarks/static_runtime/test_static_runtime.cc
@@ -325,8 +325,8 @@ TEST(StaticRuntime, ClampIntTensor) {
         a = torch.clamp(inp, min, max).clone()
         return (a)
   )JIT";
-  auto a = at::randint(0, 20, {2, 3});
-  auto b = at::randint(0, 20, {4, 3, 2});
+  auto a = at::randint(0, 20, {2, 3}, at::kFloat);
+  auto b = at::randint(0, 20, {4, 3, 2}, at::kFloat);
   auto min = 5.0f;
   auto max = 5.0f;
   testStaticRuntime(src, {a, min, max});
@@ -2504,11 +2504,29 @@ TEST(StaticRuntime, Index_Put) {
   )JIT";
 
   auto a = at::randn({2});
-  auto indicies_a = std::make_tuple(torch::tensor({0}, at::kLong));
+  auto indices_a = std::make_tuple(torch::tensor({0}, at::kLong));
   auto values_a = at::randn({1});
 
-  std::vector<IValue> args0{a, indicies_a, values_a, false};
+  std::vector<IValue> args0{a, indices_a, values_a, false};
   testStaticRuntime(index_put_str, args0);
+
+  const auto index_put_non_optional_str = R"JIT(
+    def forward(self, a: Tensor, indices: List[Tensor], values: Tensor, accumulate: bool):
+        return torch.index_put(a, indices, values, accumulate).clone()
+  )JIT";
+
+  auto indices_b = c10::List<at::Tensor>{torch::tensor({0}, at::kLong)};
+  std::vector<IValue> args1{a, indices_b, values_a, false};
+  testStaticRuntime(index_put_non_optional_str, args1);
+
+  const auto index_put_list_construct = R"JIT(
+    def forward(self, a: Tensor, indices: Tensor, values: Tensor, accumulate: bool):
+        indices: List[Optional[Tensor]] = [indices]
+        return torch.index_put(a, indices, values, accumulate).clone()
+  )JIT";
+
+  std::vector<IValue> args2{a, torch::tensor({0}, at::kLong), values_a, false};
+  testStaticRuntime(index_put_list_construct, args2);
 }
 
 TEST(StaticRuntime, Item) {
@@ -2537,8 +2555,8 @@ TEST(StaticRuntime, Tensor_Split) {
   std::vector<IValue> args2{at::randn({8}), torch::tensor(3), 0};
 
   const auto tensor_split_str3 = R"JIT(
-    def forward(self, a: Tensor, indicies: List[int], dim: int):
-        return torch.tensor_split(a, indicies, dim)
+    def forward(self, a: Tensor, indices: List[int], dim: int):
+        return torch.tensor_split(a, indices, dim)
   )JIT";
   std::vector<IValue> args3{at::randn({8}), c10::List<int64_t>({1, 6}), 0};
 
@@ -2590,23 +2608,28 @@ TEST(StaticRuntime, JIT_Aten_Numel) {
 }
 
 TEST(StaticRuntime, JIT_Aten_List) {
-  const std::string script = R"IR(
+  const auto script_str = R"IR(
     graph(%a: str):
-        %1 : int = prim::Constant[value=0]()
         %ret: str[] = aten::list(%a)
         return (%ret)
   )IR";
-
-  auto graph = std::make_shared<Graph>();
-  std::unordered_map<std::string, Value*> vmap;
-  vmap.reserve(0);
-  parseIR(script, graph.get(), vmap);
-  torch::jit::StaticModule smodule(graph);
-
-  string a = "abcd";
+  std::string a = "abcd";
   std::vector<IValue> args0{a};
+  testStaticRuntime(script_str, args0);
+
+  // Update the result of aten::list to ensure that a deep copy
+  // took place
+  const auto script_list = R"IR(
+    graph(%a : int[]):
+        %idx : int = prim::Constant[value=0]()
+        %value : int = prim::Constant[value=42]()
+        %res : int[] = aten::list(%a)
+        %updated : int[] = aten::_set_item(%res, %idx, %value)
+        return (%res, %a)
+  )IR";
 
-  testStaticRuntime(script, args0);
+  std::vector<IValue> args1{c10::List<int64_t>{1, 2, 3}};
+  testStaticRuntime(script_list, args1);
 }
 
 TEST(StaticRuntime, JIT_Aten_Range_Length) {
@@ -2826,9 +2849,9 @@ TEST(StaticRuntime, RemainderTensor) {
   )JIT";
 
   std::vector<IValue> args1 = {
-      at::randint(0, 10, {2, 2}), at::randint(0, 10, {2, 2})};
+      at::randint(0, 10, {2, 2}), at::randint(1, 10, {2, 2})};
   std::vector<IValue> args2 = {
-      at::randint(0, 10, {3, 6}), at::randint(0, 10, {3, 6})};
+      at::randint(0, 10, {3, 6}), at::randint(1, 10, {3, 6})};
 
   // Use allclose and equalnan since outputs may be NaN.
   testStaticRuntime(
@@ -3166,9 +3189,14 @@ TEST(StaticRuntime, ReplaceWithMaybeCopy) {
   smodule.runtime().check_for_memory_leak();
 
   EXPECT_TRUE(expected.equal(actual));
-  EXPECT_FALSE(hasProcessedNodeWithName(smodule, "aten::to"));
+
+  // Make a fresh graph to ensure the pass works in isolation
+  auto new_graph = std::make_shared<torch::jit::Graph>();
+  torch::jit::parseIR(to, new_graph.get());
+  ReplaceWithMaybeCopy(new_graph);
+  EXPECT_FALSE(hasNodeWithKind(new_graph, "aten::to"));
   EXPECT_TRUE(
-      hasProcessedNodeWithName(smodule, "static_runtime::to_maybe_copy_out"));
+      hasNodeWithKind(new_graph, "static_runtime::to_maybe_copy_out"));
 }
 
 TEST(StaticRuntime, Int) {
@@ -3647,3 +3675,17 @@ TEST(StaticRuntime, ClampNaNToNum) {
   testStaticRuntime(src1, {a.to(at::kDouble)}, {}, /*use_allclose=*/true, /*use_equalnan=*/true);
   testStaticRuntime(src1, {a.to(at::kDouble)}, {b.to(at::kDouble)}, /*use_allclose=*/true, /*use_equalnan=*/true);
 }
+
+TEST(StaticRuntime, IfReturningTuple) {
+  const auto src = R"JIT(
+    def forward(self, x, y, cond: bool, idx: int):
+        if cond:
+            tup = (x, y)
+        else:
+            tup = (x, x)
+        return tup[idx]
+  )JIT";
+
+  std::vector<IValue> args{at::randn({3}), at::randn({3}), true, 0};
+  testStaticRuntime(src, args);
+}
diff --git a/benchmarks/static_runtime/test_utils.cc b/benchmarks/static_runtime/test_utils.cc
index 7e0733fbc8af..cc8880113933 100644
--- a/benchmarks/static_runtime/test_utils.cc
+++ b/benchmarks/static_runtime/test_utils.cc
@@ -124,7 +124,7 @@ void compareTensorLists(
     const bool use_allclose,
     const bool use_equalnan) {
   EXPECT_TRUE(l.size() == r.size());
-  for (int i = 0; i < l.size(); ++i) {
+  for (auto i : c10::irange(l.size())) {
     ASSERT_TRUE(l[i].isTensor());
     ASSERT_TRUE(r[i].isTensor());
     VLOG(2) << "expect " << i << ": \n" << l[i] << std::endl;
@@ -172,7 +172,7 @@ void compareResults(
     EXPECT_TRUE(actual.isTuple());
     auto lhs = expect.toTupleRef().elements();
     auto rhs = actual.toTupleRef().elements();
-    EXPECT_TRUE(lhs.size() == rhs.size());
+    ASSERT_TRUE(lhs.size() == rhs.size());
     for (size_t i = 0; i < lhs.size(); i++) {
       compareResults(lhs[i], rhs[i]);
     }
@@ -180,7 +180,7 @@ void compareResults(
     EXPECT_TRUE(actual.isList());
     auto lhs = expect.toList();
     auto rhs = actual.toList();
-    EXPECT_TRUE(lhs.size() == rhs.size());
+    ASSERT_TRUE(lhs.size() == rhs.size());
     for (size_t i = 0; i < lhs.size(); i++) {
       compareResults(lhs[i], rhs[i]);
     }
@@ -191,7 +191,7 @@ void compareResults(
     EXPECT_TRUE(lhs.size() == rhs.size());
     for (auto& lh : lhs) {
       auto f = rhs.find(lh.key());
-      EXPECT_FALSE(f == rhs.end());
+      ASSERT_FALSE(f == rhs.end());
       compareResults(lh.value(), f->value());
     }
   } else {
@@ -298,11 +298,12 @@ void testStaticRuntime(
         // 1st run: collect allocation profiles (args)
         // 2nd run: exercise memory planner and resizing with args2
         // 3rd run: run with args again
-        StaticModuleOptions opts{
-            .enable_out_variant = enable_out_variant,
-            .optimize_memory = enable_out_variant,
-            .manage_output_tensors = manage_output_tensors,
-            .enable_tensorexpr_fusion = enable_tensorexpr_fusion};
+        StaticModuleOptions opts;
+        opts.enable_out_variant = enable_out_variant;
+        opts.optimize_memory = enable_out_variant;
+        opts.manage_output_tensors = manage_output_tensors;
+        opts.enable_tensorexpr_fusion = enable_tensorexpr_fusion;
+
         auto smodule = test_context->makeStaticModule(opts);
         StaticRuntime runtime(smodule);
         auto actual = runtime(args, {});
diff --git a/benchmarks/transformer/better_transformer_vs_mha_functional.py b/benchmarks/transformer/better_transformer_vs_mha_functional.py
new file mode 100644
index 000000000000..b76077ba4c22
--- /dev/null
+++ b/benchmarks/transformer/better_transformer_vs_mha_functional.py
@@ -0,0 +1,195 @@
+"""
+Tests the performance of torch.nn.MultiheadAttention's fast path (BetterTransformer)
+vs the slow path (torch.nn.functional.multi_head_attention)
+
+To run this script install these dependencies:
+
+pip install tqdm
+pip install prettytable
+"""
+
+import torch
+import random
+import numpy as np
+from pprint import pprint
+import itertools
+import json
+import argparse
+from pathlib import Path
+from typing import Optional
+
+from prettytable import PrettyTable
+from collections import defaultdict, OrderedDict
+from tqdm import tqdm
+
+
+import warnings
+
+warnings.filterwarnings("ignore")
+
+error_dict = defaultdict(int)
+
+
+def benchmark_torch_function(iters, f, *args, **kwargs):
+    f(*args, **kwargs)
+    f(*args, **kwargs)
+    torch.cuda.synchronize()
+    start_event = torch.cuda.Event(enable_timing=True)
+    end_event = torch.cuda.Event(enable_timing=True)
+    start_event.record()
+    for _ in range(iters):
+        f(*args, **kwargs)
+    end_event.record()
+    torch.cuda.synchronize()
+    # elapsed_time has a resolution of 0.5 microseconds:
+    # but returns milliseconds, so we need to multiply it to increase resolution
+    return start_event.elapsed_time(end_event) * 1000 / iters, *f(*args, **kwargs)
+
+
+def run(a: int, b: int, iters: int, batch_size: int, sequence_length: int,
+        embed_dim: int, num_heads: int, device: str, dtype: str, block_size: int, seed):
+    random.seed(seed)
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+
+    from scipy.stats import beta
+    lengths = beta.rvs(a, b, size=batch_size) * (sequence_length + block_size - 1) // block_size
+    lengths = list(map(int, list(lengths)))
+    lengths = [l * block_size for l in lengths]
+    lengths = [max(l, block_size) for l in lengths]
+
+    # Used to enforce no padding
+    # lengths = [sequence_length] * batch_size
+
+    # Ensure one row in the batch of ele has the max_sequence_length
+    lengths[random.randint(0, batch_size - 1)] = sequence_length
+
+    q = [torch.randn(l, embed_dim, device=device, dtype=dtype)
+         for l in lengths]
+    q = torch.nested.nested_tensor(q, device=device, dtype=dtype)
+    k, v = q, q
+
+    qkv = torch.nn.Linear(embed_dim, 3 * embed_dim, device=device, dtype=dtype)
+    proj = torch.nn.Linear(embed_dim, embed_dim, device=device, dtype=dtype)
+
+    native_mha = torch.nn.MultiheadAttention(
+        embed_dim, num_heads, batch_first=True, device=device, dtype=dtype
+    ).eval()
+    native_mha.in_proj_weight = qkv.weight
+    native_mha.in_proj_bias = qkv.bias
+    native_mha.out_proj.weight = proj.weight
+    native_mha.out_proj.bias = proj.bias
+
+    # Create query mask
+    q_mask = torch.nested.to_padded_tensor(
+        torch.nested.nested_tensor([
+            torch.tensor([True] * length, dtype=torch.bool)
+            for length in lengths
+        ]), 0)
+    q_mask = q_mask.cuda()
+
+    if q_mask.size(1) == 0:
+        return None
+
+    # Benchmark the native MHA in core
+    with torch.backends.cuda.sdp_kernel(enable_math=False, enable_flash=True):
+        with torch.inference_mode():
+            time_native_mha_fast, y_native_mha_fast, _ = benchmark_torch_function(
+                iters, native_mha, q, k, v, need_weights=False)
+    q = q.to_padded_tensor(0)
+    k = q
+    v = q
+    # Internal Flash Attention
+    time_native_mha_slow, y_native_mha_slow, _ = benchmark_torch_function(
+        iters, native_mha, q, k, v, key_padding_mask=~q_mask, need_weights=False)
+
+    # Convert to padded for comparison
+    if y_native_mha_fast.is_nested:
+        y_native_mha_fast = torch.nested.to_padded_tensor(y_native_mha_fast, 0)
+    y_native_mha_fast = y_native_mha_fast * q_mask.unsqueeze(-1)
+
+    if y_native_mha_slow.is_nested:
+        y_native_mha_slow = torch.nested.to_padded_tensor(y_native_mha_slow, 0)
+    y_native_mha_slow = y_native_mha_slow * q_mask.unsqueeze(-1)
+
+    # Correctness check
+    entry_name = f"batch:{batch_size}_seq_len:{sequence_length}_n_heads:{num_heads}_embed_dim:{embed_dim}"
+    try:
+        torch.testing.assert_close(y_native_mha_fast, y_native_mha_slow, atol=1e-3, rtol=1e-3)
+    except AssertionError as e:
+        error_dict[entry_name] += 1
+        pprint(error_dict)
+
+    # Calculate amount of padding
+    padding = 1 - q_mask.float().mean().item()
+
+    # Calculate the speedup for flash attention
+    speedup_fast_internal = time_native_mha_slow / time_native_mha_fast
+
+    result_entry = OrderedDict()
+    result_entry['dtype'] = dtype
+    result_entry["batch_size"] = batch_size
+    result_entry["sequence_length"] = sequence_length
+    result_entry["n_heads"] = num_heads
+    result_entry["embed_dim"] = embed_dim
+    result_entry["time_native_mha_slow(μs)"] = f"{time_native_mha_slow:.3f}"
+    result_entry["time_native_mha_fast (μs)"] = f"{time_native_mha_fast:.3f}"
+    result_entry["speedup flash_mha v native_mha"] = f"{speedup_fast_internal:.3f}"
+    result_entry["padding"] = f"{padding:.3f}"
+    return result_entry
+
+
+def main(save_path: Optional[Path], error_path: Optional[Path]):
+    table = PrettyTable()
+    entries = defaultdict(list)
+
+    print("CUDA device: ", torch.cuda.get_device_name(0))
+    iters = 100
+    header = None
+    batch_sizes = [16, 32, 64, 128, 256]
+    sequence_lengths = [64, 128, 256, 512]
+    embed_dims = [512, 1024]
+    num_heads_list = [8, 16]
+    betas = range(1, 64, 4)
+
+    for (batch_size, sequence_length, embed_dim, num_heads, block_size, b) in tqdm(
+            list(itertools.product(batch_sizes, sequence_lengths, embed_dims, num_heads_list, [2], betas))):
+        seed = 26214  # Magic number that works well for higher b values
+        entry = run(1, b * 0.05, iters, batch_size, sequence_length,
+                    embed_dim, num_heads, "cuda", torch.float16, block_size, seed)
+        if entry is None:
+            continue
+        if header is None:
+            table.field_names = list(entry.keys())
+            header = list(entry.keys())
+        row = []
+        for k, v in entry.items():
+            row.append(v)
+            entries[k].append(v)
+        table.add_row(row)
+
+    # Print the full table to console
+    print(table)
+    pprint(error_dict)
+
+    csv_string = table.get_csv_string()
+    if save_path is not None:
+        with open(save_path, 'w') as csvfile:
+            csvfile.write(csv_string)
+
+    print(f"Total errors: {sum(error_dict.values())}")
+    if error_path is not None:
+        with open(error_path, 'w') as file:
+            file.write(json.dumps(error_dict))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--save_path", type=str, help="Path to save the results")
+    parser.add_argument("--error_save_path", type=str, help="Path to save the errors")
+
+    args = parser.parse_args()
+    save_path = Path(args.save_path) if args.save_path else None
+    error_path = Path(args.error_save_path) if args.error_save_path else None
+
+    main(save_path, error_path)
diff --git a/benchmarks/transformer/sdp.py b/benchmarks/transformer/sdp.py
new file mode 100644
index 000000000000..fbd123fc39b3
--- /dev/null
+++ b/benchmarks/transformer/sdp.py
@@ -0,0 +1,157 @@
+import torch
+import itertools
+import numpy as np
+import sys
+import csv
+import random
+
+import warnings
+warnings.filterwarnings("ignore")
+
+
+class CompositeMHA(torch.nn.Module):
+    def __init__(self, num_heads, in_proj_weight, in_proj_bias, out_proj):
+        super().__init__()
+        self.in_proj_weight = in_proj_weight
+        self.in_proj_bias = in_proj_bias
+        self.out_proj = out_proj
+        self.num_heads = num_heads
+
+    def forward(self, query, key, value, mask):
+        if not (query is key and key is value):
+            raise NotImplementedError(
+                "query, key and value must be the same Tensor for now."
+            )
+        if mask is not None:
+            raise NotImplementedError("mask is currently not supported.")
+
+        query_projected = torch.nn.functional.linear(
+            query, self.in_proj_weight, self.in_proj_bias
+        )
+
+        batch_size = query_projected.size(0)
+        embed_dim = query_projected.size(2)
+        head_dim = embed_dim // (self.num_heads * 3)
+
+        query, key, value = query_projected.chunk(3, -1)
+
+        query = query.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2)
+
+        # the output of sdp = (batch, num_heads, seq_len, head_dim)
+        attn, _ = torch.nn.functional._scaled_dot_product_attention(
+            query,
+            key,
+            value,
+            attn_mask=None,
+            dropout_p=0.0,
+            need_attn_weights=False,
+            is_causal=False,
+        )
+
+        attn = attn.transpose(1, 2).reshape(
+            batch_size, -1, self.num_heads * head_dim
+        )
+        # Match return signature of nn.MHA
+        return self.out_proj(attn), None
+
+
+def build_composite_mha_from_nn_mha(pt):
+    assert pt._qkv_same_embed_dim
+    in_proj_weight = pt.in_proj_weight
+    assert in_proj_weight is not None
+    assert pt.batch_first
+    return CompositeMHA(pt.num_heads, pt.in_proj_weight, pt.in_proj_bias, pt.out_proj)
+
+
+def generate_rand_batch(batch_size, max_sequence_len, embed_dimension, pad_percentage=None, dtype=torch.float16, device="cuda"):
+    if not pad_percentage:
+        return torch.randn(batch_size, max_sequence_len, embed_dimension, dtype=dtype, device=device), None
+    # Really slow but should work
+    seq_len_list = [int(max_sequence_len * (1 - random.gauss(pad_percentage, 0.01))) for _ in range(batch_size)]
+    # Make random ele max length
+    seq_len_list[random.randint(0, batch_size - 1)] = max_sequence_len
+    # print(f"Theoretical padding: {pad_percentage} actual: {1 - (sum(seq_len_list) / (batch_size * max_sequence_len))}")
+    return torch.nested.nested_tensor([
+        torch.randn(seq_len, embed_dimension, dtype=dtype, device=device) for seq_len in seq_len_list]), seq_len_list
+
+
+def benchmark_torch_function(iters, f, *args, **kwargs):
+    if f is None:
+        return None
+    f(*args, **kwargs)
+    torch.cuda.synchronize()
+    start_event = torch.cuda.Event(enable_timing=True)
+    end_event = torch.cuda.Event(enable_timing=True)
+    start_event.record()
+    for _ in range(iters):
+        f(*args, **kwargs)
+    end_event.record()
+    torch.cuda.synchronize()
+    return (start_event.elapsed_time(end_event) * 1.0e-3) / iters
+
+
+def run_timing(iters, batch_size, embed_dimension, num_heads, max_sequence_len, pad_percentage, enable_math, enable_flash, writer):
+    with torch.backends.cuda.sdp_kernel(enable_math=enable_math, enable_flash=enable_flash):
+        with torch.inference_mode():
+            dropout_p = 0.0
+            mask = None
+
+            pt = torch.nn.MultiheadAttention(
+                embed_dim=embed_dimension, num_heads=num_heads, batch_first=True, dropout=dropout_p
+            )
+            npt = pt.eval().half().cuda()
+            cpt = build_composite_mha_from_nn_mha(npt)
+            x, lengths = generate_rand_batch(batch_size, max_sequence_len, embed_dimension, pad_percentage)
+            pt_output, _ = pt(x, x, x, mask)
+            cpt_output, _ = cpt(x, x, x, mask)
+
+            # First order sanity check. Not a replacement for rigorous tests.
+            if pt_output.is_nested and cpt_output.is_nested:
+                for a, b in zip(pt_output.unbind(), cpt_output.unbind()):
+                    assert torch.allclose(a, b, atol=1e-3, rtol=1e-3)
+            else:
+                assert torch.allclose(pt_output, cpt_output, atol=1e-3, rtol=1e-3)
+
+            pt_time = benchmark_torch_function(iters, npt, x, x, x, mask) * 1e3
+            cp_time = benchmark_torch_function(iters, cpt, x, x, x, mask) * 1e3
+            results = {}
+            results["max_sequence_len"] = max_sequence_len
+            results["num_heads"] = num_heads
+            results["embed_dimension"] = embed_dimension
+            results["pt_time"] = pt_time
+            results["cp_time"] = cp_time
+            results["speedup"] = pt_time / cp_time
+            results["dtype"] = str(x.dtype)
+            results["enable_math"] = str(enable_math)
+            results["enable_flash"] = str(enable_flash)
+            writer.writerow(results)
+
+
+def main():
+    iters = 100
+    seed = 123
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+
+    headers = ["max_sequence_len", "num_heads", "embed_dimension", "pt_time",
+               "cp_time", "speedup", "dtype", "enable_math", "enable_flash"]
+    writer = csv.DictWriter(sys.stdout, headers)
+    writer.writeheader()
+
+    batch_size = 64
+    pad_percentage = 0.5
+
+    for (enable_math, enable_flash) in [(False, True), (True, False), (True, True)]:
+        for num_heads, max_seq_len in itertools.product([2, 4, 8, 16, 32], [64, 128, 256]):
+            run_timing(iters, batch_size, 1024, num_heads, max_seq_len,
+                       pad_percentage, enable_math, enable_flash, writer)
+            run_timing(iters, batch_size, 1024, num_heads, max_seq_len,
+                       pad_percentage, enable_math, enable_flash, writer)
+            run_timing(iters, batch_size, 1024, num_heads, max_seq_len,
+                       pad_percentage, enable_math, enable_flash, writer)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/transformer/sdp_backwards.py b/benchmarks/transformer/sdp_backwards.py
new file mode 100644
index 000000000000..2f745e157b28
--- /dev/null
+++ b/benchmarks/transformer/sdp_backwards.py
@@ -0,0 +1,189 @@
+import torch
+import numpy as np
+import random
+import torch.utils.benchmark as benchmark
+from torch.profiler import profile, record_function, ProfilerActivity
+
+
+class CompositeMHA(torch.nn.Module):
+    def __init__(self, num_heads, in_proj_weight, in_proj_bias, out_proj):
+        super().__init__()
+        self.in_proj_weight = in_proj_weight
+        self.in_proj_bias = in_proj_bias
+        self.out_proj = out_proj
+        self.num_heads = num_heads
+
+    def forward(self, query, key, value, mask):
+        if not (query is key and key is value):
+            raise NotImplementedError(
+                "query, key and value must be the same Tensor for now."
+            )
+        if mask is not None:
+            raise NotImplementedError("mask is currently not supported.")
+
+        query_projected = torch.nn.functional.linear(
+            query, self.in_proj_weight, self.in_proj_bias
+        )
+
+        batch_size = query_projected.size(0)
+        embed_dim = query_projected.size(2)
+        head_dim = embed_dim // (self.num_heads * 3)
+
+        query, key, value = query_projected.chunk(3, -1)
+
+        query = query.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2)
+
+        # the output of sdp = (batch, num_heads, seq_len, head_dim)
+        attn, _ = torch.nn.functional._scaled_dot_product_attention(
+            query,
+            key,
+            value,
+            attn_mask=None,
+            dropout_p=0.0,
+            need_attn_weights=False,
+            is_causal=False,
+        )
+
+        attn = attn.transpose(1, 2).reshape(batch_size, -1, self.num_heads * head_dim)
+        # Match return signature of nn.MHA
+        return self.out_proj(attn)
+
+
+def build_composite_mha_from_nn_mha(pt):
+    assert pt._qkv_same_embed_dim
+    in_proj_weight = pt.in_proj_weight
+    assert in_proj_weight is not None
+    assert pt.batch_first
+    return CompositeMHA(pt.num_heads, pt.in_proj_weight, pt.in_proj_bias, pt.out_proj)
+
+
+def forw_back(model, input, upward):
+    output = model(*input)
+    output.backward(upward)
+
+
+# Context manger not working in timer
+
+
+def forw_back_fused(model, input, upward):
+    with torch.backends.cuda.sdp_kernel(enable_math=False, enable_mem_efficient=True):
+        output = model(*input)
+        output.backward(upward)
+
+
+def forw_back_eager(model, input, upward):
+    with torch.backends.cuda.sdp_kernel(enable_math=True, enable_mem_efficient=False):
+        output = model(*input)
+        output.backward(upward)
+
+
+def run_timing(
+    min_run_time, batch_size, embed_dimension, num_heads, max_sequence_len, dtype
+):
+    dropout_p = 0.0
+    mask = None
+
+    pt = torch.nn.MultiheadAttention(
+        embed_dim=embed_dimension,
+        num_heads=num_heads,
+        batch_first=True,
+        dropout=dropout_p,
+    )
+    npt = pt.cuda().to(dtype)
+    cpt = build_composite_mha_from_nn_mha(npt)
+    x = torch.randn(
+        batch_size,
+        max_sequence_len,
+        embed_dimension,
+        dtype=dtype,
+        device="cuda",
+        requires_grad=True,
+    )
+
+    with torch.backends.cuda.sdp_kernel(enable_math=False, enable_mem_efficient=True):
+        rand_fused_upward = cpt(x, x, x, mask).clone().detach()
+
+    with torch.backends.cuda.sdp_kernel(enable_math=True, enable_mem_efficient=False):
+        rand_eager_upward = cpt(x, x, x, mask).clone().detach()
+
+    t0 = benchmark.Timer(
+        stmt="forw_back_fused(cpt, (x,x,x,mask), rand_fused_upward)",
+        globals={
+            "forw_back_fused": forw_back_fused,
+            "cpt": cpt,
+            "x": x,
+            "rand_fused_upward": rand_fused_upward,
+            "mask": mask,
+        },
+        label=f"Fused SDP forward and backward batch_size={batch_size} max_sequence_len={max_sequence_len} "
+        f"num_heads={num_heads} embed_dimension={embed_dimension} dtype={dtype}",
+        num_threads=torch.get_num_threads(),
+    )
+
+    t1 = benchmark.Timer(
+        stmt="forw_back_eager(cpt, (x,x,x,mask), rand_eager_upward)",
+        globals={
+            "forw_back_eager": forw_back_eager,
+            "cpt": cpt,
+            "x": x,
+            "rand_eager_upward": rand_eager_upward,
+            "mask": mask,
+        },
+        label=f"Eager SDP forward and backward batch_size={batch_size} max_sequence_len={max_sequence_len} "
+        f"num_heads={num_heads} embed_dimension={embed_dimension} dtype={dtype}",
+        num_threads=torch.get_num_threads(),
+    )
+
+    m0 = t0.blocked_autorange(min_run_time=min_run_time)
+    m1 = t1.blocked_autorange(min_run_time=min_run_time)
+
+    print(m0)
+    print(m1)
+
+    activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA]
+
+    print("Profile for Fused".center(200, "-"))
+    with torch.backends.cuda.sdp_kernel(enable_math=False, enable_mem_efficient=True):
+        with profile(
+            activities=activities, record_shapes=False, with_stack=True
+        ) as prof:
+            with record_function("Fused SDP forward and backward"):
+                for _ in range(20):
+                    forw_back(cpt, (x, x, x, mask), rand_fused_upward)
+    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
+
+    print("Profile for eager".center(200, "-"))
+    with torch.backends.cuda.sdp_kernel(enable_math=True, enable_mem_efficient=False):
+        with profile(
+            activities=activities, record_shapes=False, with_stack=True
+        ) as prof:
+            with record_function("Fused SDP forward and backward"):
+                for _ in range(20):
+                    forw_back(cpt, (x, x, x, mask), rand_eager_upward)
+    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
+
+
+def main():
+    seed = 123
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    random.seed(seed)
+
+    min_run_time = 10
+    batch_size = 64
+    num_heads = 32
+    max_seq_len = 256
+    embed_dim = 1024
+    dtype = torch.bfloat16
+
+    print(
+        f"Running timing for batch_size={batch_size} max_sequence_len={max_seq_len} "
+        f"num_heads={num_heads} embed_dimension={embed_dim} dtype={dtype}"
+    )
+    run_timing(min_run_time, batch_size, embed_dim, num_heads, max_seq_len, dtype)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/binaries/CMakeLists.txt b/binaries/CMakeLists.txt
index b683ee002280..15f47bf52aee 100644
--- a/binaries/CMakeLists.txt
+++ b/binaries/CMakeLists.txt
@@ -1,13 +1,8 @@
 if(INTERN_BUILD_MOBILE)
-  if(BUILD_CAFFE2_MOBILE)
-    #caffe2_binary_target("predictor_verifier.cc")
-    caffe2_binary_target("speed_benchmark.cc")
-  else()
-    caffe2_binary_target("speed_benchmark_torch.cc")
-    caffe2_binary_target("load_benchmark_torch.cc")
-    if(NOT BUILD_LITE_INTERPRETER)
-      caffe2_binary_target("compare_models_torch.cc")
-    endif()
+  caffe2_binary_target("speed_benchmark_torch.cc")
+  caffe2_binary_target("load_benchmark_torch.cc")
+  if(NOT BUILD_LITE_INTERPRETER)
+    caffe2_binary_target("compare_models_torch.cc")
   endif()
   return()
 endif()
diff --git a/binaries/optimize_for_mobile.cc b/binaries/optimize_for_mobile.cc
index 991bca7e5587..005b19ce888a 100644
--- a/binaries/optimize_for_mobile.cc
+++ b/binaries/optimize_for_mobile.cc
@@ -16,13 +16,13 @@
 
 #include <string>
 #include <sstream>
-#include "torch/script.h"
-#include "torch/csrc/jit/api/module.h"
+#include <torch/script.h>
+#include <torch/csrc/jit/api/module.h>
 #include <torch/csrc/jit/passes/metal_rewrite.h>
-#include "torch/csrc/jit/passes/vulkan_rewrite.h"
-#include "torch/csrc/jit/passes/xnnpack_rewrite.h"
-#include "torch/csrc/jit/serialization/import.h"
-#include "torch/csrc/jit/serialization/export.h"
+#include <torch/csrc/jit/passes/vulkan_rewrite.h>
+#include <torch/csrc/jit/passes/xnnpack_rewrite.h>
+#include <torch/csrc/jit/serialization/import.h>
+#include <torch/csrc/jit/serialization/export.h>
 
 C10_DEFINE_string(model, "", "The torch script model to optimize.");
 C10_DEFINE_string(
@@ -86,7 +86,8 @@ int main(int argc, char** argv) {
   if (FLAGS_backend == "" || FLAGS_backend == "cpu") {
     optimized_module = torch::jit::optimizeForMobile(module);
   } else if (FLAGS_backend == "vulkan") {
-    optimized_module = torch::jit::vulkanOptimizeForMobile(module, preserved_methods);
+    optimized_module = torch::jit::vulkanOptimizeForMobile(
+        module, std::set<MobileOptimizerType>(), preserved_methods);
   } else if (FLAGS_backend == "metal"){
     optimized_module = torch::jit::metalOptimizeForMobile(module, preserved_methods);
   }else{
diff --git a/binaries/speed_benchmark_torch.cc b/binaries/speed_benchmark_torch.cc
index ea523898b51e..0fadfad5b9f2 100644
--- a/binaries/speed_benchmark_torch.cc
+++ b/binaries/speed_benchmark_torch.cc
@@ -180,6 +180,10 @@ class vkRunner final : public Runner<T> {
   virtual c10::IValue run(
       T& module,
       const std::vector<c10::IValue>& inputs) override {
+    if (!module.attr("requires_backend_transfers", at::IValue(true)).toBool()) {
+      // No need to transfer input/output backends
+      return module.forward(inputs);
+    }
 
     if (inputs_.size() == 0) {
       // Upload the input tensor(s) to GPU memory.
diff --git a/buckbuild.bzl b/buckbuild.bzl
index 21ee68db4cf9..75c16ba00655 100644
--- a/buckbuild.bzl
+++ b/buckbuild.bzl
@@ -22,6 +22,8 @@ load(
     "jit_core_headers",
     "jit_core_sources",
     "libtorch_profiler_sources",
+    "torch_cpp_srcs",
+    "torch_mobile_tracer_sources",
 )
 load(
     ":pt_ops.bzl",
@@ -96,6 +98,9 @@ def get_strip_error_messages():
         return True  # always strip in OSS CI to expose potential issues
     return read_bool("pt", "strip_error_messages", not _is_build_mode_dev())
 
+def get_disable_warn():
+    return read_bool("pt", "disable_warn", False)
+
 def get_enable_eager_symbolication():
     return read_bool("pt", "enable_eager_symbolication", default = False, required = False)
 
@@ -198,6 +203,8 @@ _COMMON_PREPROCESSOR_FLAGS = [
     ["-DC10_MOBILE_TRIM_DISPATCH_KEYS"] if get_enable_mobile_dispatch_keys_trimming() else []
 ) + (
     ["-DSTRIP_ERROR_MESSAGES"] if get_strip_error_messages() else []
+) + (
+    ["-DDISABLE_WARN"] if get_disable_warn() else []
 )
 
 def get_aten_preprocessor_flags():
@@ -249,6 +256,7 @@ PT_BACKEND_HEADERS = [
     "CompositeExplicitAutograd",
     "CompositeExplicitAutogradNonFunctional",
     "CompositeImplicitAutograd",
+    "CompositeImplicitAutogradNestedTensor",
     "Meta",
 ]
 
@@ -307,6 +315,7 @@ def get_aten_generated_files(enabled_backends):
     src_files = [
         "RegisterBackendSelect.cpp",
         "RegisterCompositeImplicitAutograd.cpp",
+        "RegisterCompositeImplicitAutogradNestedTensor.cpp",
         "RegisterCompositeExplicitAutograd.cpp",
         "RegisterCompositeExplicitAutogradNonFunctional.cpp",
         "CompositeViewCopyKernels.cpp",
@@ -327,10 +336,13 @@ def get_aten_generated_files(enabled_backends):
         "Operators_4.cpp",
         "CompositeImplicitAutogradFunctions.h",
         "CompositeImplicitAutogradFunctions_inl.h",
+        "CompositeImplicitAutogradNestedTensorFunctions.h",
+        "CompositeImplicitAutogradNestedTensorFunctions_inl.h",
         "CompositeExplicitAutogradFunctions.h",
         "CompositeExplicitAutogradFunctions_inl.h",
         "CompositeExplicitAutogradNonFunctionalFunctions.h",
         "CompositeExplicitAutogradNonFunctionalFunctions_inl.h",
+        "VmapGeneratedPlumbing.h",
         "core/ATenOpList.cpp",
         "core/TensorBody.h",
         "core/TensorMethods.cpp",
@@ -364,7 +376,7 @@ def get_aten_derived_type_src_rules(aten_rule_name, enabled_backends):
 def get_aten_selective_cpp_rules(aten_rule_name, enabled_backends):
     return [
         ":{}[{}]".format(aten_rule_name, f)
-        for f in ["RegisterCompositeImplicitAutograd.cpp", "RegisterCompositeExplicitAutograd.cpp", "RegisterCompositeExplicitAutogradNonFunctional.cpp", "RegisterSchema.cpp", "RegisterBackendSelect.cpp", "CompositeViewCopyKernels.cpp"]
+        for f in ["RegisterCompositeImplicitAutograd.cpp", "RegisterCompositeImplicitAutogradNestedTensor.cpp", "RegisterCompositeExplicitAutograd.cpp", "RegisterCompositeExplicitAutogradNonFunctional.cpp", "RegisterSchema.cpp", "RegisterBackendSelect.cpp", "CompositeViewCopyKernels.cpp"]
     ] + get_aten_derived_type_src_rules(aten_rule_name, enabled_backends)
 
 def get_aten_derived_type_srcs(enabled_backends):
@@ -503,6 +515,16 @@ def copy_template_registration_files(name, apple_sdks = None):
         apple_sdks = apple_sdks,
     )
 
+def get_feature_tracer_source_list():
+    """
+    Return just the Feature specific handlers used in the model tracer.
+    """
+    sources = []
+    for s in torch_mobile_tracer_sources:
+        if s.endswith("Tracer.cpp"):
+            sources.append(s)
+    return sources
+
 def pt_operator_query_codegen(
         name,
         deps = [],
@@ -739,7 +761,7 @@ def get_pt_operator_registry_dict(
                    third_party("glog"),
                    C10,
                ] + ([ROOT + ":torch_mobile_train"] if train else []) +
-               ([ROOT + ":flatbuffers_jit"] if enable_flatbuffer else []),
+               ([ROOT + ":flatbuffers_mobile"] if enable_flatbuffer else []),
         **kwargs
     )
 
@@ -803,6 +825,7 @@ def define_buck_targets(
             ("aten/src", "ATen/*.h"),
             ("aten/src", "ATen/cpu/**/*.h"),
             ("aten/src", "ATen/detail/*.h"),
+            ("aten/src", "ATen/functorch/**/*.h"),
             ("aten/src", "ATen/quantized/*.h"),
             ("aten/src", "ATen/vulkan/*.h"),
             ("aten/src", "ATen/metal/*.h"),
@@ -865,6 +888,7 @@ def define_buck_targets(
                 ("", "torch/custom_class_detail.h"),
                 # Add again due to namespace difference from aten_header.
                 ("", "aten/src/ATen/*.h"),
+                ("", "aten/src/ATen/functorch/**/*.h"),
                 ("", "aten/src/ATen/quantized/*.h"),
             ],
             exclude = [
@@ -1083,6 +1107,8 @@ def define_buck_targets(
             "CompositeExplicitAutogradNonFunctionalFunctions_inl.h": ":gen_aten[CompositeExplicitAutogradNonFunctionalFunctions_inl.h]",
             "CompositeImplicitAutogradFunctions.h": ":gen_aten[CompositeImplicitAutogradFunctions.h]",
             "CompositeImplicitAutogradFunctions_inl.h": ":gen_aten[CompositeImplicitAutogradFunctions_inl.h]",
+            "CompositeImplicitAutogradNestedTensorFunctions.h": ":gen_aten[CompositeImplicitAutogradNestedTensorFunctions.h]",
+            "CompositeImplicitAutogradNestedTensorFunctions_inl.h": ":gen_aten[CompositeImplicitAutogradNestedTensorFunctions_inl.h]",
             "FunctionalInverses.h": ":gen_aten[FunctionalInverses.h]",
             "Functions.h": ":gen_aten[Functions.h]",
             "MethodOperators.h": ":gen_aten[MethodOperators.h]",
@@ -1246,12 +1272,8 @@ def define_buck_targets(
     pt_xplat_cxx_library(
         name = "torch_model_tracer",
         srcs = [
-            "torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.cpp",
-            "torch/csrc/jit/mobile/model_tracer/CustomClassTracer.cpp",
-            "torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.cpp",
-            "torch/csrc/jit/mobile/model_tracer/OperatorCallTracer.cpp",
             "torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp",
-        ],
+        ] + get_feature_tracer_source_list(),
         header_namespace = "",
         compiler_flags = get_pt_compiler_flags(),
         exported_preprocessor_flags = get_pt_preprocessor_flags() + (["-DSYMBOLICATE_MOBILE_DEBUG_HANDLE"] if get_enable_eager_symbolication() else []),
@@ -1347,11 +1369,24 @@ def define_buck_targets(
         exported_preprocessor_flags = get_pt_preprocessor_flags(),
         visibility = ["PUBLIC"],
         exported_deps = [
-            ":flatbuffers_jit",
+            ":flatbuffers_mobile",
             ":torch_mobile_core",
         ],
     )
 
+    pt_xplat_cxx_library(
+        name = "torch_cpp_cpu",
+        srcs = torch_cpp_srcs,
+        headers = native.glob(["torch/csrc/api/include/**/*.h"]) + ["torch/script.h"],
+        compiler_flags = get_pt_compiler_flags(),
+        exported_preprocessor_flags = get_pt_preprocessor_flags(),
+        visibility = ["PUBLIC"],
+        exported_deps = [
+            ":torch",
+            ":torch_mobile_deserialize_common",  # for torch/csrc/api/src/serialize/input-archive.cpp
+        ],
+    )
+
     pt_xplat_cxx_library(
         name = "torch_core",
         srcs = core_sources_full_mobile_no_backend_interface + [
@@ -1417,6 +1452,7 @@ def define_buck_targets(
             "torch/csrc/autograd/VariableTypeManual.cpp",
             "torch/csrc/autograd/FunctionsManual.cpp",
             "torch/csrc/api/src/data/datasets/mnist.cpp",
+            "torch/csrc/jit/mobile/quantization.cpp",
             "torch/csrc/jit/mobile/train/export_data.cpp",
             "torch/csrc/jit/mobile/train/optim/sgd.cpp",
             "torch/csrc/jit/mobile/train/random.cpp",
@@ -1633,7 +1669,6 @@ def define_buck_targets(
         compiler_flags = get_pt_compiler_flags() + ["-Wno-error"],
         exported_preprocessor_flags = get_pt_preprocessor_flags() + [
             "-DUSE_KINETO",
-            "-DUSE_KINETO_UPDATED",
             # Need this otherwise USE_KINETO is undefed
             # for mobile
             "-DEDGE_PROFILER_USE_KINETO",
@@ -1660,7 +1695,6 @@ def define_buck_targets(
         compiler_flags = get_pt_compiler_flags() + ["-Wno-error"],
         exported_preprocessor_flags = get_pt_preprocessor_flags() + [
             "-DUSE_KINETO",
-            "-DUSE_KINETO_UPDATED",
             "-DEDGE_PROFILER_USE_KINETO",
         ],
         # @lint-ignore BUCKLINT link_whole
@@ -1682,7 +1716,7 @@ def define_buck_targets(
             "torch/csrc/jit/serialization/mobile_bytecode.fbs",
         ],
         outs = {
-            "mobile_bytecode_generated.h": ["mobile_bytecode_generated.h"],
+            "mobile_bytecode_generated_fbsource.h": ["mobile_bytecode_generated.h"],
         },
         cmd = "$(exe {})".format(third_party("flatc")) +
               " --cpp --gen-mutable --scoped-enums -o ${OUT} ${SRCS}",
@@ -1698,18 +1732,21 @@ def define_buck_targets(
         name = "mobile_bytecode",
         header_namespace = "",
         exported_headers = {
-            "torch/csrc/jit/serialization/mobile_bytecode_generated.h": ":mobile_bytecode_header[mobile_bytecode_generated.h]",
+            ("torch/csrc/jit/serialization/mobile_bytecode_generated.h" if IS_OSS else "torch/csrc/jit/serialization/mobile_bytecode_generated_fbsource.h"): ":mobile_bytecode_header[mobile_bytecode_generated_fbsource.h]",
         },
         # Avoid leaking implementation details by only exposing this header to
         # the internals of the loader/serializer layer.
         visibility = [
             "{}:flatbuffer_loader".format(ROOT),
-            "{}:flatbuffer_serializer".format(ROOT),
+            "{}:flatbuffer_serializer_mobile".format(ROOT),
+        ],
+        exported_deps = [
+            third_party("flatbuffers-api"),
         ],
     )
 
     fb_xplat_cxx_library(
-        name = "flatbuffer_serializer",
+        name = "flatbuffers_serializer_mobile",
         srcs = ["torch/csrc/jit/serialization/flatbuffer_serializer.cpp"],
         exported_headers = [
             "torch/csrc/jit/serialization/flatbuffer_serializer.h",
@@ -1726,7 +1763,6 @@ def define_buck_targets(
             ":mobile_bytecode",
             ":torch_mobile_module",
             C10,
-            third_party("flatbuffers-api"),
         ],
         exported_deps = [
             ":torch_mobile_train",
@@ -1744,7 +1780,6 @@ def define_buck_targets(
         compiler_flags = get_pt_compiler_flags() + ["-Wno-error"],
         exported_preprocessor_flags = get_pt_preprocessor_flags() + [
             "-DUSE_KINETO",
-            "-DUSE_KINETO_UPDATED",
             # Need this otherwise USE_KINETO is undefed
             # for mobile
             "-DEDGE_PROFILER_USE_KINETO",
@@ -1765,7 +1800,6 @@ def define_buck_targets(
         visibility = ["PUBLIC"],
         deps = [
             ":mobile_bytecode",
-            third_party("flatbuffers-api"),
         ],
         exported_deps = [
             ":torch_mobile_deserialize",
@@ -1774,7 +1808,7 @@ def define_buck_targets(
     )
 
     fb_xplat_cxx_library(
-        name = "flatbuffer_serializer_jit",
+        name = "flatbuffers_serializer_jit",
         srcs = ["torch/csrc/jit/serialization/flatbuffer_serializer_jit.cpp"],
         exported_headers = [
             "torch/csrc/jit/serialization/flatbuffer_serializer_jit.h",
@@ -1792,7 +1826,7 @@ def define_buck_targets(
         visibility = ["PUBLIC"],
         deps = [
             ":flatbuffer_loader",
-            ":flatbuffer_serializer",
+            ":flatbuffers_serializer_mobile",
             ":torch_core",
             ":torch_mobile_module",
             C10,
@@ -1804,8 +1838,17 @@ def define_buck_targets(
         visibility = ["PUBLIC"],
         exported_deps = [
             ":flatbuffer_loader",
-            ":flatbuffer_serializer",
-            ":flatbuffer_serializer_jit",
+            ":flatbuffers_serializer_mobile",
+            ":flatbuffers_serializer_jit",
+        ],
+    )
+
+    fb_xplat_cxx_library(
+        name = "flatbuffers_mobile",
+        visibility = ["PUBLIC"],
+        exported_deps = [
+            ":flatbuffer_loader",
+            ":flatbuffers_serializer_mobile",
         ],
     )
 
diff --git a/build.bzl b/build.bzl
index 5715e34786d4..6d1bcf7a0601 100644
--- a/build.bzl
+++ b/build.bzl
@@ -162,6 +162,8 @@ GENERATED_H_CORE = [
     "CompositeExplicitAutogradNonFunctionalFunctions_inl.h",
     "CompositeImplicitAutogradFunctions.h",
     "CompositeImplicitAutogradFunctions_inl.h",
+    "CompositeImplicitAutogradNestedTensorFunctions.h",
+    "CompositeImplicitAutogradNestedTensorFunctions_inl.h",
     "MetaFunctions.h",
     "MetaFunctions_inl.h",
     "core/TensorBody.h",
@@ -193,6 +195,7 @@ GENERATED_CPP = [
     "RegisterSparseCsrCPU.cpp",
     "RegisterMkldnnCPU.cpp",
     "RegisterCompositeImplicitAutograd.cpp",
+    "RegisterCompositeImplicitAutogradNestedTensor.cpp",
     "RegisterZeroTensor.cpp",
     "RegisterMeta.cpp",
     "RegisterQuantizedMeta.cpp",
@@ -254,6 +257,7 @@ _GENERATED_AUTOGRAD_PYTHON_CPP = [
     "torch/csrc/autograd/generated/python_functions_3.cpp",
     "torch/csrc/autograd/generated/python_functions_4.cpp",
     "torch/csrc/autograd/generated/python_nn_functions.cpp",
+    "torch/csrc/autograd/generated/python_nested_functions.cpp",
     "torch/csrc/autograd/generated/python_fft_functions.cpp",
     "torch/csrc/autograd/generated/python_linalg_functions.cpp",
     "torch/csrc/autograd/generated/python_return_types.cpp",
diff --git a/build_variables.bzl b/build_variables.bzl
index 268517eb666a..2faeed6e52d9 100644
--- a/build_variables.bzl
+++ b/build_variables.bzl
@@ -19,27 +19,33 @@ GENERATED_LAZY_TS_CPP = [
 # NVFuser runtime library
 libtorch_nvfuser_runtime_sources = [
     "torch/csrc/jit/codegen/cuda/runtime/array.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/array_rocm.cu",
     "torch/csrc/jit/codegen/cuda/runtime/bf16_support.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/bf16_support_rocm.cu",
     "torch/csrc/jit/codegen/cuda/runtime/block_reduction.cu",
     "torch/csrc/jit/codegen/cuda/runtime/block_sync_atomic.cu",
     "torch/csrc/jit/codegen/cuda/runtime/block_sync_default.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/block_sync_default_rocm.cu",
     "torch/csrc/jit/codegen/cuda/runtime/broadcast.cu",
     "torch/csrc/jit/codegen/cuda/runtime/fp16_support.cu",
     "torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_broadcast.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu",
     "torch/csrc/jit/codegen/cuda/runtime/helpers.cu",
     "torch/csrc/jit/codegen/cuda/runtime/index_utils.cu",
-    "torch/csrc/jit/codegen/cuda/runtime/tensorcore.cu",
-    "torch/csrc/jit/codegen/cuda/runtime/swizzle.cu",
     "torch/csrc/jit/codegen/cuda/runtime/memory.cu",
     "torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/swizzle.cu",
     "torch/csrc/jit/codegen/cuda/runtime/tensor.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/tensorcore.cu",
     "torch/csrc/jit/codegen/cuda/runtime/tuple.cu",
     "torch/csrc/jit/codegen/cuda/runtime/type_traits.cu",
-    "torch/csrc/jit/codegen/cuda/runtime/welford.cu",
     "torch/csrc/jit/codegen/cuda/runtime/warp.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/warp_rocm.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/welford.cu",
     "aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh",
     "aten/src/ATen/cuda/detail/UnpackRaw.cuh",
 ]
@@ -127,13 +133,17 @@ libtorch_sources_common = sorted(core_sources_common + torch_unpickler_common)
 libtorch_profiler_sources = [
     "torch/csrc/autograd/profiler_legacy.cpp",
     "torch/csrc/autograd/profiler_kineto.cpp",
-    "torch/csrc/profiler/api.cpp",
     "torch/csrc/profiler/collection.cpp",
-    "torch/csrc/profiler/execution_graph_observer.cpp",
+    "torch/csrc/profiler/data_flow.cpp",
     "torch/csrc/profiler/kineto_shim.cpp",
-    "torch/csrc/profiler/nvtx_observer.cpp",
     "torch/csrc/profiler/kineto_client_interface.cpp",
-    "torch/csrc/profiler/itt_observer.cpp",
+    "torch/csrc/profiler/orchestration/observer.cpp",
+    "torch/csrc/profiler/orchestration/python_tracer.cpp",
+    "torch/csrc/profiler/standalone/execution_graph_observer.cpp",
+    "torch/csrc/profiler/standalone/itt_observer.cpp",
+    "torch/csrc/profiler/standalone/nvtx_observer.cpp",
+    "torch/csrc/profiler/stubs/base.cpp",
+    "torch/csrc/profiler/perf.cpp",
     "torch/csrc/monitor/counters.cpp",
     "torch/csrc/monitor/events.cpp",
 ]
@@ -160,6 +170,7 @@ core_trainer_sources = [
     "torch/csrc/autograd/saved_variable.cpp",
     "torch/csrc/autograd/variable.cpp",
     "torch/csrc/autograd/utils/warnings.cpp",
+    "torch/csrc/autograd/jit_decomp_interface.cpp",
     "torch/csrc/jit/frontend/name_mangler.cpp",
     "torch/csrc/jit/ir/type_hashing.cpp",
     "torch/csrc/jit/serialization/pickler.cpp",
@@ -300,6 +311,7 @@ core_sources_full_mobile_no_backend_interface = [
     "torch/csrc/jit/passes/quantization/dedup_module_uses.cpp",
     "torch/csrc/jit/passes/quantization/finalize.cpp",
     "torch/csrc/jit/passes/quantization/fusion_passes.cpp",
+    "torch/csrc/jit/passes/quantization/register_packed_params.cpp",
     "torch/csrc/jit/python/update_graph_executor_opt.cpp",
     "torch/csrc/jit/runtime/argument_spec.cpp",
     "torch/csrc/jit/runtime/autodiff.cpp",
@@ -404,7 +416,6 @@ lazy_tensor_core_sources = [
     "torch/csrc/lazy/core/ir_metadata.cpp",
     "torch/csrc/lazy/core/ir_util.cpp",
     "torch/csrc/lazy/core/lazy_graph_executor.cpp",
-    "torch/csrc/lazy/core/lazy_view.cpp",
     "torch/csrc/lazy/core/metrics.cpp",
     "torch/csrc/lazy/core/multi_wait.cpp",
     "torch/csrc/lazy/core/ops/arithmetic_ir_ops.cpp",
@@ -453,9 +464,11 @@ libtorch_core_sources = sorted(
 
 # These files are the only ones that are supported on Windows.
 libtorch_distributed_base_sources = [
+    "torch/csrc/distributed/c10d/Backend.cpp",
     "torch/csrc/distributed/c10d/FileStore.cpp",
     "torch/csrc/distributed/c10d/GlooDeviceFactory.cpp",
     "torch/csrc/distributed/c10d/Ops.cpp",
+    "torch/csrc/distributed/c10d/OpsImpl.cpp",
     "torch/csrc/distributed/c10d/ParamCommsUtils.cpp",
     "torch/csrc/distributed/c10d/PrefixStore.cpp",
     "torch/csrc/distributed/c10d/ProcessGroup.cpp",
@@ -475,6 +488,7 @@ libtorch_distributed_base_sources = [
     "torch/csrc/distributed/c10d/reducer.cpp",
     "torch/csrc/distributed/c10d/sequence_num.cpp",
     "torch/csrc/distributed/c10d/socket.cpp",
+    "torch/csrc/distributed/c10d/Work.cpp",
 ]
 
 # These files are only supported on Linux (and others) but not on Windows.
@@ -545,6 +559,8 @@ torch_mobile_tracer_sources = [
     "torch/csrc/jit/mobile/model_tracer/MobileModelRunner.cpp",
     "torch/csrc/jit/mobile/model_tracer/OperatorCallTracer.cpp",
     "torch/csrc/jit/mobile/model_tracer/KernelDTypeTracer.cpp",
+    "torch/csrc/jit/mobile/model_tracer/CustomClassTracer.cpp",
+    "torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.cpp",
 ]
 
 torch_mobile_core = [
@@ -554,6 +570,8 @@ torch_mobile_core = [
     # TODO: Remove this dependency
     "torch/csrc/jit/backends/backend_debug_info.cpp",
     "torch/csrc/jit/mobile/compatibility/model_compatibility.cpp",
+    # TODO: This line needs to be uncommented to build mobile in OSS with flatbuffers
+    # "torch/csrc/jit/mobile/flatbuffer_loader.cpp",
     "torch/csrc/jit/mobile/function.cpp",
     "torch/csrc/jit/mobile/import.cpp",
     "torch/csrc/jit/mobile/interpreter.cpp",
@@ -561,6 +579,7 @@ torch_mobile_core = [
     "torch/csrc/jit/mobile/observer.cpp",
     "torch/csrc/jit/mobile/parse_bytecode.cpp",
     "torch/csrc/jit/mobile/parse_operators.cpp",
+    "torch/csrc/jit/mobile/quantization.cpp",
     "torch/csrc/jit/mobile/upgrader_mobile.cpp",
     "torch/csrc/jit/runtime/register_prim_ops.cpp",
     "torch/csrc/jit/runtime/register_special_ops.cpp",
@@ -609,6 +628,7 @@ libtorch_extra_sources = libtorch_core_jit_sources + [
     "torch/csrc/jit/mobile/observer.cpp",
     "torch/csrc/jit/mobile/parse_bytecode.cpp",
     "torch/csrc/jit/mobile/parse_operators.cpp",
+    "torch/csrc/jit/mobile/quantization.cpp",
     "torch/csrc/jit/mobile/train/export_data.cpp",
     "torch/csrc/jit/mobile/train/optim/sgd.cpp",
     "torch/csrc/jit/mobile/train/random.cpp",
@@ -640,12 +660,13 @@ def libtorch_sources(gencode_pattern = ":generate-code[{}]"):
 libtorch_cuda_core_sources = [
     "torch/csrc/CudaIPCTypes.cpp",
     "torch/csrc/cuda/comm.cpp",
+    "torch/csrc/cuda/memory_snapshot.cpp",
     "torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp",
-    "torch/csrc/profiler/cuda.cpp",
+    "torch/csrc/profiler/stubs/cuda.cpp",
     "torch/csrc/autograd/functions/comm.cpp",
     "torch/csrc/jit/codegen/cuda/arith.cpp",
     "torch/csrc/jit/codegen/cuda/compute_at.cpp",
-    "torch/csrc/jit/codegen/cuda/inline_propagator.cpp",
+    "torch/csrc/jit/codegen/cuda/inlining.cpp",
     "torch/csrc/jit/codegen/cuda/compute_at_map.cpp",
     "torch/csrc/jit/codegen/cuda/codegen.cpp",
     "torch/csrc/jit/codegen/cuda/contiguity.cpp",
@@ -679,6 +700,7 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp",
     "torch/csrc/jit/codegen/cuda/lower_allocation.cpp",
     "torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp",
+    "torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp",
     "torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp",
     "torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp",
     "torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.cpp",
@@ -702,6 +724,7 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/lower_validation.cpp",
     "torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp",
     "torch/csrc/jit/codegen/cuda/lower2device.cpp",
+    "torch/csrc/jit/codegen/cuda/lower_bank_conflict.cpp",
     "torch/csrc/jit/codegen/cuda/manager.cpp",
     "torch/csrc/jit/codegen/cuda/maxinfo_propagator.cpp",
     "torch/csrc/jit/codegen/cuda/mutator.cpp",
@@ -715,16 +738,21 @@ libtorch_cuda_core_sources = [
     "torch/csrc/jit/codegen/cuda/partial_split_map.cpp",
     "torch/csrc/jit/codegen/cuda/partition.cpp",
     "torch/csrc/jit/codegen/cuda/predicate_compute.cpp",
+    "torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.cpp",
+    "torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.cpp",
+    "torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.cpp",
     "torch/csrc/jit/codegen/cuda/register_interface.cpp",
     "torch/csrc/jit/codegen/cuda/root_domain_map.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp",
+    "torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/matmul.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/registry.cpp",
     "torch/csrc/jit/codegen/cuda/scheduler/utils.cpp",
+    "torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp",
     "torch/csrc/jit/codegen/cuda/type_inference.cpp",
     "torch/csrc/jit/codegen/cuda/type_promotion.cpp",
     "torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp",
@@ -834,6 +862,7 @@ libtorch_python_cuda_core_sources = [
     "torch/csrc/cuda/shared/cudart.cpp",
     "torch/csrc/cuda/shared/nvtx.cpp",
     "torch/csrc/cuda/utils.cpp",
+    "torch/csrc/cuda/CUDAPluggableAllocator.cpp",
 ]
 
 libtorch_python_cuda_sources = libtorch_python_cuda_core_sources + [
@@ -871,12 +900,16 @@ libtorch_python_core_sources = [
     "torch/csrc/autograd/python_function.cpp",
     "torch/csrc/autograd/python_hook.cpp",
     "torch/csrc/autograd/python_legacy_variable.cpp",
+    "torch/csrc/autograd/python_nested_functions_manual.cpp",
     "torch/csrc/autograd/python_torch_functions_manual.cpp",
     "torch/csrc/autograd/python_variable.cpp",
     "torch/csrc/autograd/python_variable_indexing.cpp",
+    "torch/csrc/dynamo/eval_frame.c",
+    "torch/csrc/dynamo/guards.cpp",
+    "torch/csrc/dynamo/init.cpp",
+    "torch/csrc/functorch/init.cpp",
     "torch/csrc/jit/backends/backend_init.cpp",
     "torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp",
-    "torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.cpp",
     "torch/csrc/jit/python/init.cpp",
     "torch/csrc/jit/passes/onnx.cpp",
     "torch/csrc/jit/passes/onnx/cast_all_constant_to_floating.cpp",
@@ -898,6 +931,7 @@ libtorch_python_core_sources = [
     "torch/csrc/jit/passes/onnx/shape_type_inference.cpp",
     "torch/csrc/jit/passes/onnx/function_extraction.cpp",
     "torch/csrc/jit/passes/onnx/onnx_log.cpp",
+    "torch/csrc/jit/passes/onnx/naming.cpp",
     "torch/csrc/jit/python/pybind_utils.cpp",
     "torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.cpp",
     "torch/csrc/jit/passes/onnx/pattern_conversion/common.cpp",
@@ -920,6 +954,7 @@ libtorch_python_core_sources = [
     "torch/csrc/monitor/python_init.cpp",
     "torch/csrc/multiprocessing/init.cpp",
     "torch/csrc/onnx/init.cpp",
+    "torch/csrc/profiler/python/init.cpp",
     "torch/csrc/serialization.cpp",
     "torch/csrc/tensor/python_tensor.cpp",
     "torch/csrc/utils/init.cpp",
@@ -927,9 +962,12 @@ libtorch_python_core_sources = [
     "torch/csrc/utils.cpp",
     "torch/csrc/utils/cuda_lazy_init.cpp",
     "torch/csrc/utils/invalid_arguments.cpp",
+    "torch/csrc/utils/nested.cpp",
     "torch/csrc/utils/object_ptr.cpp",
     "torch/csrc/utils/python_arg_parser.cpp",
     "torch/csrc/utils/python_dispatch.cpp",
+    "torch/csrc/utils/python_symnode.cpp",
+    "torch/csrc/utils/pybind.cpp",
     "torch/csrc/utils/structseq.cpp",
     "torch/csrc/utils/tensor_apply.cpp",
     "torch/csrc/utils/tensor_dtypes.cpp",
@@ -969,6 +1007,7 @@ def glob_libtorch_python_sources(gencode_pattern = ":generate-code[{}]"):
         "torch/csrc/autograd/generated/python_functions_2.cpp",
         "torch/csrc/autograd/generated/python_functions_3.cpp",
         "torch/csrc/autograd/generated/python_functions_4.cpp",
+        "torch/csrc/autograd/generated/python_nested_functions.cpp",
         "torch/csrc/autograd/generated/python_nn_functions.cpp",
         "torch/csrc/autograd/generated/python_fft_functions.cpp",
         "torch/csrc/autograd/generated/python_linalg_functions.cpp",
@@ -1051,7 +1090,7 @@ aten_cpu_source_non_codegen_list = [
     "aten/src/ATen/core/op_registration/infer_schema.cpp",
     "aten/src/ATen/core/op_registration/op_registration.cpp",
     "aten/src/ATen/core/operator_name.cpp",
-    "aten/src/ATen/core/TorchDispatchModeTLS.cpp",
+    "aten/src/ATen/core/TorchDispatchUtils.cpp",
     "aten/src/ATen/core/register_symbols.cpp",
     "aten/src/ATen/core/class_type.cpp",
     "aten/src/ATen/core/type.cpp",
@@ -1066,6 +1105,7 @@ aten_cpu_source_non_codegen_list = [
     "aten/src/ATen/detail/ORTHooksInterface.cpp",
     "aten/src/ATen/metal/Context.cpp",
     "aten/src/ATen/native/AutogradComposite.cpp",
+    "aten/src/ATen/native/ComparisonUtils.cpp",
     "aten/src/ATen/native/DispatchStub.cpp",
     "aten/src/ATen/native/UpSample.cpp",
     "aten/src/ATen/native/mkldnn/BinaryOps.cpp",
@@ -1100,9 +1140,6 @@ aten_cpu_source_non_codegen_list = [
     "aten/src/ATen/Dispatch.cpp",
     "aten/src/ATen/SavedTensorHooks.cpp",
     "aten/src/ATen/vulkan/Context.cpp",
-    "aten/src/ATen/nnapi/nnapi_bind.cpp",
-    "aten/src/ATen/nnapi/nnapi_wrapper.cpp",
-    "aten/src/ATen/nnapi/nnapi_model_loader.cpp",
     "aten/src/ATen/native/prim_native_functions.cpp",
     "aten/src/ATen/native/verbose_wrapper.cpp",
 ]
@@ -1371,9 +1408,15 @@ aten_native_source_non_codegen_list = [
     "aten/src/ATen/native/mkl/SparseBlasImpl.cpp",
     "aten/src/ATen/native/mkl/SparseCsrLinearAlgebra.cpp",
     "aten/src/ATen/native/mkl/SpectralOps.cpp",
+    "aten/src/ATen/native/nested/NestedTensorAliases.cpp",
+    "aten/src/ATen/native/nested/NestedTensorBackward.cpp",
+    "aten/src/ATen/native/nested/NestedTensorBinaryOps.cpp",
+    "aten/src/ATen/native/nested/NestedTensorFactories.cpp",
     "aten/src/ATen/native/nested/NestedTensorMath.cpp",
+    "aten/src/ATen/native/nested/NestedTensorMatmul.cpp",
     "aten/src/ATen/native/nested/NestedTensorTransformerFunctions.cpp",
-    "aten/src/ATen/native/nested/NestedTensorBackward.cpp",
+    "aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp",
+    "aten/src/ATen/native/nested/NestedTensorUtils.cpp",
     "aten/src/ATen/native/sparse/ParamUtils.cpp",
     "aten/src/ATen/native/sparse/SoftMax.cpp",
     "aten/src/ATen/native/sparse/SparseBlas.cpp",
@@ -1386,6 +1429,7 @@ aten_native_source_non_codegen_list = [
     "aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp",
     "aten/src/ATen/native/sparse/SparseFactories.cpp",
     "aten/src/ATen/native/sparse/ValidateCompressedIndicesKernel.cpp",
+    "aten/src/ATen/native/sparse/SparseBinaryOpIntersectionKernel.cpp",
     "aten/src/ATen/native/transformers/attention.cpp",
     "aten/src/ATen/native/transformers/transformer.cpp",
     "aten/src/ATen/native/xnnpack/Activation.cpp",
@@ -1401,7 +1445,6 @@ aten_native_source_non_codegen_list = [
     # Files not in native, but depends on native symbols
     # "aten/src/ATen/TensorIndexing.cpp",
     "aten/src/ATen/TensorIterator.cpp",
-    "aten/src/ATen/nnapi/nnapi_register.cpp",
 ]
 
 # 1. Files in ATen/native with a few exceptions
@@ -1435,6 +1478,7 @@ aten_cuda_cu_source_list = [
     "aten/src/ATen/native/sparse/cuda/SparseBlasImpl.cpp",
     "aten/src/ATen/native/sparse/cuda/SparseBlasLegacy.cpp",
     "aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp",
+    "aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp",
 ]
 
 # Files using thrust::sort_by_key need to be linked last
@@ -1446,3 +1490,33 @@ aten_cuda_with_sort_by_key_source_list = [
 aten_cuda_cu_with_sort_by_key_source_list = [
     "aten/src/ATen/native/cuda/Unique.cu",
 ]
+
+# Followings are source code for xnnpack delegate
+
+xnnpack_delegate_serializer_header = [
+    "torch/csrc/jit/backends/xnnpack/serialization/serializer.h",
+]
+
+xnnpack_delegate_serializer_source_list = [
+    "torch/csrc/jit/backends/xnnpack/serialization/serializer.cpp",
+]
+
+xnnpack_delegate_core_source_list = [
+    "torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.cpp",
+]
+
+xnnpack_delegate_core_header = [
+    "torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.h",
+    "torch/csrc/jit/backends/xnnpack/executor/xnn_executor.h",
+]
+
+xnnpack_backend_header = [
+    "torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.h",
+] + xnnpack_delegate_core_header
+
+xnnpack_backend_source_list = [
+    "torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.cpp",
+    "torch/csrc/jit/backends/xnnpack/xnnpack_backend_lib.cpp",
+    "torch/csrc/jit/backends/xnnpack/xnnpack_backend_preprocess.cpp",
+    "torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.cpp",
+] + xnnpack_delegate_core_source_list
diff --git a/c10/CMakeLists.txt b/c10/CMakeLists.txt
index 5f1b9777a120..0309d7a2d712 100644
--- a/c10/CMakeLists.txt
+++ b/c10/CMakeLists.txt
@@ -52,8 +52,9 @@ target_compile_options(c10 PRIVATE "-DC10_BUILD_MAIN_LIB")
 if(${COMPILER_SUPPORTS_HIDDEN_VISIBILITY})
   target_compile_options(c10 PRIVATE "-fvisibility=hidden")
 endif()
-if(HAS_WERROR_SIGN_COMPARE AND WERROR)
-  target_compile_options(c10 PRIVATE "-Werror=sign-compare")
+if(WERROR)
+  target_compile_options_if_supported(c10 PRIVATE "-Werror=sign-compare")
+  target_compile_options_if_supported(c10 PRIVATE "-Werror=shadow")
 endif()
 
 # ---[ Dependency of c10
diff --git a/c10/c10_defs.bzl b/c10/c10_defs.bzl
deleted file mode 100644
index 55fb9fc35e5d..000000000000
--- a/c10/c10_defs.bzl
+++ /dev/null
@@ -1,29 +0,0 @@
-load("@fbsource//tools/build_defs:expect.bzl", "expect")
-load(
-    "@fbsource//tools/build_defs/apple:build_mode_defs.bzl",
-    "is_production_build",
-)
-
-###############################################################################
-# Check if we need to strip glog.
-def _get_strip_glog_config():
-    c2_strip_glog = native.read_config("caffe2", "strip_glog", "1")
-    expect(
-        c2_strip_glog in ("0", "1"),
-        c2_strip_glog,
-    )
-    return bool(int(c2_strip_glog))
-
-# For iOS production builds (and all Android builds), strip GLOG logging to
-# save size. We can disable by setting caffe2.strip_glog=0 in .buckconfig.local.
-def get_fbobjc_strip_glog_flags():
-    if is_production_build() or _get_strip_glog_config():
-        return ["-UGOOGLE_STRIP_LOG", "-DGOOGLE_STRIP_LOG=3"]
-    else:
-        return ["-UGOOGLE_STRIP_LOG"]
-
-def get_fbandroid_strip_glog_flags():
-    if _get_strip_glog_config():
-        return ["-UGOOGLE_STRIP_LOG", "-DGOOGLE_STRIP_LOG=1"]
-    else:
-        return []
diff --git a/c10/core/AutogradState.cpp b/c10/core/AutogradState.cpp
index 4667acb43519..899996e770fd 100644
--- a/c10/core/AutogradState.cpp
+++ b/c10/core/AutogradState.cpp
@@ -3,11 +3,13 @@
 namespace c10 {
 
 namespace {
-// By default, grad mode is enabled and inference mode is disabled
+// By default, grad mode and mulithreading are enabled, inference mode is
+// disabled,
 thread_local AutogradState autograd_state_tls = AutogradState(
     /* grad_mode */ true,
     /* inference_mode */ false,
-    /* fw_grad_mode */ true);
+    /* fw_grad_mode */ true,
+    /* multithreading_enabled */ true);
 } // namespace
 
 AutogradState& AutogradState::get_tls_state() {
diff --git a/c10/core/AutogradState.h b/c10/core/AutogradState.h
index a1d13a42891d..cf821ec030e1 100644
--- a/c10/core/AutogradState.h
+++ b/c10/core/AutogradState.h
@@ -12,10 +12,15 @@ struct C10_API AutogradState {
   static AutogradState& get_tls_state();
   static void set_tls_state(AutogradState state);
 
-  AutogradState(bool grad_mode, bool inference_mode, bool fw_grad_mode)
+  AutogradState(
+      bool grad_mode,
+      bool inference_mode,
+      bool fw_grad_mode,
+      bool multithreading_enabled)
       : grad_mode_(grad_mode),
         inference_mode_(inference_mode),
-        fw_grad_mode_(fw_grad_mode) {}
+        fw_grad_mode_(fw_grad_mode),
+        mulithreading_enabled_(multithreading_enabled) {}
 
   void set_grad_mode(bool enabled) {
     grad_mode_ = enabled;
@@ -29,6 +34,10 @@ struct C10_API AutogradState {
     inference_mode_ = enabled;
   }
 
+  void set_multithreading_enabled(bool mulithreading_enabled) {
+    mulithreading_enabled_ = mulithreading_enabled;
+  }
+
   bool get_grad_mode() const {
     return grad_mode_;
   }
@@ -41,10 +50,15 @@ struct C10_API AutogradState {
     return inference_mode_;
   }
 
+  bool get_multithreading_enabled() const {
+    return mulithreading_enabled_;
+  }
+
  private:
   bool grad_mode_ : 1;
   bool inference_mode_ : 1;
   bool fw_grad_mode_ : 1;
+  bool mulithreading_enabled_ : 1;
 };
 
 } // namespace c10
diff --git a/c10/core/Device.cpp b/c10/core/Device.cpp
index 5cd474774c9e..96d2504ec7de 100644
--- a/c10/core/Device.cpp
+++ b/c10/core/Device.cpp
@@ -47,9 +47,20 @@ DeviceType parse_type(const std::string& device_string) {
   if (device != types.end()) {
     return device->second;
   }
+  if (device_string == get_privateuse1_backend()) {
+    return DeviceType::PrivateUse1;
+  }
+  std::vector<const char*> device_names;
+  for (const auto& it : types) {
+    if (it.first) {
+      device_names.push_back(it.first);
+    }
+  }
   TORCH_CHECK(
       false,
-      "Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, ort, mps, xla, lazy, vulkan, meta, hpu, privateuseone device type at start of device string: ",
+      "Expected one of ",
+      c10::Join(", ", device_names),
+      " device type at start of device string: ",
       device_string);
 }
 enum DeviceStringParsingState { START, INDEX_START, INDEX_REST, ERROR };
diff --git a/c10/core/Device.h b/c10/core/Device.h
index cea7cfec119e..d53ab38ff9cb 100644
--- a/c10/core/Device.h
+++ b/c10/core/Device.h
@@ -148,7 +148,8 @@ struct C10_API Device final {
 
   /// Return true if the device supports arbirtary strides.
   bool supports_as_strided() const noexcept {
-    return type_ != DeviceType::XLA && type_ != DeviceType::Lazy;
+    return type_ != DeviceType::IPU && type_ != DeviceType::XLA &&
+        type_ != DeviceType::Lazy;
   }
 
   /// Same string as returned from operator<<.
diff --git a/c10/core/DeviceType.cpp b/c10/core/DeviceType.cpp
index ac4c1f653efb..22f0029d747d 100644
--- a/c10/core/DeviceType.cpp
+++ b/c10/core/DeviceType.cpp
@@ -1,5 +1,9 @@
 #include <c10/core/DeviceType.h>
 #include <c10/util/Exception.h>
+#include <c10/util/Optional.h>
+#include <atomic>
+#include <memory>
+#include <mutex>
 
 namespace c10 {
 
@@ -46,7 +50,7 @@ std::string DeviceTypeName(DeviceType d, bool lower_case) {
     case DeviceType::IPU:
       return lower_case ? "ipu" : "IPU";
     case DeviceType::PrivateUse1:
-      return lower_case ? "privateuseone" : "PRIVATEUSEONE";
+      return get_privateuse1_backend(/*lowercase=*/lower_case);
     default:
       TORCH_CHECK(
           false,
@@ -101,4 +105,46 @@ std::ostream& operator<<(std::ostream& stream, DeviceType type) {
   return stream;
 }
 
+// We use both a mutex and an atomic here because:
+// (1) Mutex is needed during writing:
+//     We need to first check the value and potentially error,
+//     before setting the value (without any one else racing in the middle).
+//     It's also totally fine for this to be slow, since it happens exactly once
+//     at import time.
+// (2) Atomic is needed during reading:
+//     Whenever a user prints a privatuse1 device name, they need to read this
+//     variable. Although unlikely, we'll data race if someone else is trying to
+//     set this variable at the same time that another thread is print the
+//     device name. We could re-use the same mutex, but reading the atomic will
+//     be much faster.
+static std::atomic<bool> privateuse1_backend_name_set;
+static std::string privateuse1_backend_name;
+static std::mutex privateuse1_lock;
+
+std::string get_privateuse1_backend(bool lower_case) {
+  // Applying the same atomic read memory ordering logic as in Note [Memory
+  // ordering on Python interpreter tag].
+  auto name_registered =
+      privateuse1_backend_name_set.load(std::memory_order_acquire);
+  // Guaranteed that if the flag is set, then privateuse1_backend_name has been
+  // set, and will never be written to.
+  auto backend_name =
+      name_registered ? privateuse1_backend_name : "privateuseone";
+  return backend_name;
+}
+
+void register_privateuse1_backend(std::string backend_name) {
+  std::lock_guard<std::mutex> guard(privateuse1_lock);
+  TORCH_CHECK(
+      !privateuse1_backend_name_set.load() ||
+          privateuse1_backend_name == backend_name,
+      "torch.register_privateuse1_backend() has already been set! Current backend: ",
+      privateuse1_backend_name);
+
+  privateuse1_backend_name = backend_name;
+  // Invariant: once this flag is set, privateuse1_backend_name is NEVER written
+  // to.
+  privateuse1_backend_name_set.store(true, std::memory_order_relaxed);
+}
+
 } // namespace c10
diff --git a/c10/core/DeviceType.h b/c10/core/DeviceType.h
index 000ad331828b..065444827833 100644
--- a/c10/core/DeviceType.h
+++ b/c10/core/DeviceType.h
@@ -95,6 +95,9 @@ C10_API bool isValidDeviceType(DeviceType d);
 
 C10_API std::ostream& operator<<(std::ostream& stream, DeviceType type);
 
+C10_API void register_privateuse1_backend(std::string backend_name);
+C10_API std::string get_privateuse1_backend(bool lower_case = true);
+
 } // namespace c10
 
 namespace std {
diff --git a/c10/core/DispatchKey.cpp b/c10/core/DispatchKey.cpp
index 4423d578b52d..e07d2ce6b051 100644
--- a/c10/core/DispatchKey.cpp
+++ b/c10/core/DispatchKey.cpp
@@ -136,6 +136,8 @@ const char* toString(DispatchKey t) {
       return "AutocastCPU";
     case DispatchKey::AutocastXPU:
       return "AutocastXPU";
+    case DispatchKey::AutocastHPU:
+      return "AutocastHPU";
     case DispatchKey::AutocastCUDA:
       return "AutocastCUDA";
 
@@ -172,12 +174,17 @@ const char* toString(DispatchKey t) {
     case DispatchKey::TESTING_ONLY_GenericMode:
       return "TESTING_ONLY_GenericMode";
 
+    case DispatchKey::PythonDispatcher:
+      return "PythonDispatcher";
+
       // Aliases
 
     case DispatchKey::Autograd:
       return "Autograd";
     case DispatchKey::CompositeImplicitAutograd:
       return "CompositeImplicitAutograd";
+    case DispatchKey::CompositeImplicitAutogradNestedTensor:
+      return "CompositeImplicitAutogradNestedTensor";
     case DispatchKey::CompositeExplicitAutograd:
       return "CompositeExplicitAutograd";
     case DispatchKey::CompositeExplicitAutogradNonFunctional:
@@ -199,7 +206,7 @@ const char* toString(DispatchKey t) {
     switch (bc) {                                  \
       C10_FORALL_BACKEND_COMPONENTS(ENTRY, prefix) \
       default:                                     \
-        return #prefix "Unknown";                  \
+        return #prefix "Undefined";                \
     }
 
         C10_FORALL_FUNCTIONALITY_KEYS(FORALL_BC)
@@ -262,6 +269,7 @@ c10::DispatchKey parseDispatchKey(const std::string& k) {
       {"ZeroTensor", c10::DispatchKey::ZeroTensor},
       {"FuncTorchDynamicLayerBackMode",
        c10::DispatchKey::FuncTorchDynamicLayerBackMode},
+      {"Functionalize", c10::DispatchKey::Functionalize},
       {"ADInplaceOrView", c10::DispatchKey::ADInplaceOrView},
       {"AutogradOther", c10::DispatchKey::AutogradOther},
       {"AutogradFunctionality", c10::DispatchKey::AutogradFunctionality},
@@ -269,6 +277,7 @@ c10::DispatchKey parseDispatchKey(const std::string& k) {
       {"Tracer", c10::DispatchKey::Tracer},
       {"AutocastCPU", c10::DispatchKey::AutocastCPU},
       {"AutocastXPU", c10::DispatchKey::AutocastXPU},
+      {"AutocastHPU", c10::DispatchKey::AutocastHPU},
       {"AutocastCUDA", c10::DispatchKey::AutocastCUDA},
       {"FuncTorchBatched", c10::DispatchKey::FuncTorchBatched},
       {"FuncTorchVmapMode", c10::DispatchKey::FuncTorchVmapMode},
@@ -281,6 +290,7 @@ c10::DispatchKey parseDispatchKey(const std::string& k) {
       {"TESTING_ONLY_GenericWrapper",
        c10::DispatchKey::TESTING_ONLY_GenericWrapper},
       {"TESTING_ONLY_GenericMode", c10::DispatchKey::TESTING_ONLY_GenericMode},
+      {"PythonDispatcher", c10::DispatchKey::PythonDispatcher},
 
       {"CPU", c10::DispatchKey::CPU},
       {"CUDA", c10::DispatchKey::CUDA},
@@ -307,6 +317,7 @@ c10::DispatchKey parseDispatchKey(const std::string& k) {
       {"SparseHIP", c10::DispatchKey::SparseHIP},
       {"SparseXPU", c10::DispatchKey::SparseXPU},
       {"SparseVE", c10::DispatchKey::SparseVE},
+      {"SparseMeta", c10::DispatchKey::SparseMeta},
 
       {"AutogradCPU", c10::DispatchKey::AutogradCPU},
       {"AutogradCUDA", c10::DispatchKey::AutogradCUDA},
@@ -324,6 +335,8 @@ c10::DispatchKey parseDispatchKey(const std::string& k) {
       {"Autograd", c10::DispatchKey::Autograd},
       {"CompositeImplicitAutograd",
        c10::DispatchKey::CompositeImplicitAutograd},
+      {"CompositeImplicitAutogradNestedTensor",
+       c10::DispatchKey::CompositeImplicitAutogradNestedTensor},
       {"CompositeExplicitAutograd",
        c10::DispatchKey::CompositeExplicitAutograd},
       {"CompositeExplicitAutogradNonFunctional",
diff --git a/c10/core/DispatchKey.h b/c10/core/DispatchKey.h
index 2f1f1fc5f77e..33f762e9da7d 100644
--- a/c10/core/DispatchKey.h
+++ b/c10/core/DispatchKey.h
@@ -350,6 +350,7 @@ enum class DispatchKey : uint16_t {
   // and inputs are saved for backward in the post-autocast type.
   AutocastCPU,
   AutocastXPU,
+  AutocastHPU,
   // Naughtily, AutocastCUDA is also being used for XLA.  In the terminal state,
   // it probably should get its own Autocast key
   AutocastCUDA,
@@ -401,6 +402,10 @@ enum class DispatchKey : uint16_t {
   // for a usage example
   TESTING_ONLY_GenericMode,
 
+  // This is a bypass that allows you to skip running the C++ dispatcher
+  // entirely
+  PythonDispatcher,
+
   // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FIN ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ //
   EndOfFunctionalityKeys, // End of functionality keys.
 
@@ -439,6 +444,8 @@ enum class DispatchKey : uint16_t {
   Autograd,
   CompositeImplicitAutograd, // registered at
   // build/aten/src/ATen/RegisterCompositeImplicitAutograd.cpp
+  CompositeImplicitAutogradNestedTensor, // registered at
+  // build/aten/src/ATen/RegisterCompositeImplicitAutogradNestedTensor.cpp
   CompositeExplicitAutograd, // registered at
   // build/aten/src/ATen/RegisterCompositeExplicitAutograd.cpp
   // See Note [CompositeExplicitAutogradNonFunctional Key]
diff --git a/c10/core/DispatchKeySet.cpp b/c10/core/DispatchKeySet.cpp
index 358703210112..a8f60451be37 100644
--- a/c10/core/DispatchKeySet.cpp
+++ b/c10/core/DispatchKeySet.cpp
@@ -50,11 +50,16 @@ constexpr DispatchKeySet math_dispatch_keyset = backend_dispatch_keyset |
     autograd_dispatch_keyset |
     // See Note [NestedTensor Not Included in Backend Keys]
     // The caveat to that note is that nested_tensor is a special case
-    // where we would like to support composite implict kernels but not
+    // where we would like to support composite implicit kernels but not
     // explicit kernels therefore we manually add the key to the
     // math_dispatch_keyset
     DispatchKeySet{DispatchKey::NestedTensor};
 
+constexpr DispatchKeySet nested_dispatch_keyset =
+    DispatchKeySet(
+        {DispatchKey::AutogradNestedTensor, DispatchKey::NestedTensor}) |
+    DispatchKeySet(DispatchKeySet::RAW, full_backend_mask);
+
 DispatchKeySet getRuntimeDispatchKeySet(DispatchKey t) {
   TORCH_INTERNAL_ASSERT(t != DispatchKey::Undefined);
   switch (t) {
@@ -67,6 +72,8 @@ DispatchKeySet getRuntimeDispatchKeySet(DispatchKey t) {
           DispatchKeySet(DispatchKeySet::RAW, full_backend_mask);
     case DispatchKey::CompositeImplicitAutograd:
       return math_dispatch_keyset;
+    case DispatchKey::CompositeImplicitAutogradNestedTensor:
+      return nested_dispatch_keyset;
     case DispatchKey::CompositeExplicitAutograd:
       return backend_dispatch_keyset;
     case DispatchKey::CompositeExplicitAutogradNonFunctional:
@@ -84,6 +91,9 @@ bool runtimeDispatchKeySetHas(DispatchKey t, DispatchKey k) {
     case DispatchKey::CompositeImplicitAutograd:
       // See Note [NestedTensor Not Included in Backend Keys]
       return math_dispatch_keyset.has(k);
+    case DispatchKey::CompositeImplicitAutogradNestedTensor:
+      // See Note [NestedTensor Not Included in Backend Keys]
+      return nested_dispatch_keyset.has(k);
     case DispatchKey::CompositeExplicitAutograd:
       // See Note [NestedTensor Not Included in Backend Keys]
       return k != DispatchKey::NestedTensor && backend_dispatch_keyset.has(k);
diff --git a/c10/core/DispatchKeySet.h b/c10/core/DispatchKeySet.h
index d3d90693b906..6a0be21f8fe0 100644
--- a/c10/core/DispatchKeySet.h
+++ b/c10/core/DispatchKeySet.h
@@ -172,7 +172,9 @@ class DispatchKeySet final {
             (1ULL
              << (num_backends + static_cast<uint8_t>(toFunctionalityKey(t)) -
                  1)) -
-            1) {}
+            1) {
+    *this = add(DispatchKey::PythonDispatcher);
+  }
 
   // Public version of DispatchKeySet(uint64_t) API; external users
   // must be explicit when they do this!
@@ -640,6 +642,7 @@ constexpr DispatchKeySet autocast_dispatch_keyset = DispatchKeySet({
     DispatchKey::AutocastCPU,
     DispatchKey::AutocastCUDA,
     DispatchKey::AutocastXPU,
+    DispatchKey::AutocastHPU,
 });
 
 // See Note [TLS Initialization]
@@ -652,6 +655,7 @@ constexpr DispatchKeySet default_excluded_set = DispatchKeySet({
     DispatchKey::AutocastCPU,
     DispatchKey::AutocastCUDA,
     DispatchKey::AutocastXPU,
+    DispatchKey::AutocastHPU,
 });
 
 constexpr DispatchKeySet autograd_dispatch_keyset_with_ADInplaceOrView =
@@ -830,12 +834,15 @@ inline DispatchKeySet getAutogradRelatedKeySetFromBackend(BackendComponent t) {
 inline DispatchKeySet getAutocastRelatedKeySetFromBackend(BackendComponent t) {
   constexpr auto autocast_cpu_ks = DispatchKeySet(DispatchKey::AutocastCPU);
   constexpr auto autocast_xpu_ks = DispatchKeySet(DispatchKey::AutocastXPU);
+  constexpr auto autocast_hpu_ks = DispatchKeySet(DispatchKey::AutocastHPU);
   constexpr auto autocast_cuda_ks = DispatchKeySet(DispatchKey::AutocastCUDA);
   switch (t) {
     case BackendComponent::CPUBit:
       return autocast_cpu_ks;
     case BackendComponent::XPUBit:
       return autocast_xpu_ks;
+    case BackendComponent::HPUBit:
+      return autocast_hpu_ks;
     case BackendComponent::CUDABit:
     case BackendComponent::XLABit:
       return autocast_cuda_ks;
@@ -869,7 +876,10 @@ static inline DispatchKey legacyExtractDispatchKey(DispatchKeySet s) {
   // treatment;
   return (s - autograd_dispatch_keyset_with_ADInplaceOrView -
           autocast_dispatch_keyset -
-          DispatchKeySet({DispatchKey::PythonTLSSnapshot, DispatchKey::Python}))
+          DispatchKeySet(
+              {DispatchKey::Functionalize,
+               DispatchKey::PythonTLSSnapshot,
+               DispatchKey::Python}))
       .highestPriorityTypeId();
 }
 
diff --git a/c10/core/InferenceMode.h b/c10/core/InferenceMode.h
index 704c43b522c6..1f44c80fec41 100644
--- a/c10/core/InferenceMode.h
+++ b/c10/core/InferenceMode.h
@@ -58,7 +58,8 @@ struct TORCH_API InferenceMode {
     AutogradState::set_tls_state(AutogradState(
         /* grad_mode */ !enabled,
         /* inference_mode */ enabled,
-        /* fw_grad_mode */ !enabled));
+        /* fw_grad_mode */ !enabled,
+        /* multithreading_enabled*/ !enabled));
     DispatchKeySet included = enabled
         ? prev_keyset.included_.remove(c10::DispatchKey::ADInplaceOrView)
         : prev_keyset.included_.add(c10::DispatchKey::ADInplaceOrView);
diff --git a/c10/core/MemoryFormat.h b/c10/core/MemoryFormat.h
index a4dfd1e87ebe..b1363033ca47 100644
--- a/c10/core/MemoryFormat.h
+++ b/c10/core/MemoryFormat.h
@@ -114,10 +114,11 @@ inline std::vector<int64_t> get_channels_last_strides_3d(IntArrayRef sizes) {
 // input
 // 3. All helper functions have similar comments, only 1st helper function is
 // commented here.
+template <typename T>
 inline bool is_channels_last_strides_2d_s4(
-    const IntArrayRef sizes,
-    const IntArrayRef strides) {
-  int64_t min = 0;
+    const ArrayRef<T> sizes,
+    const ArrayRef<T> strides) {
+  T min = 0;
   // special case for trivial C dimension. default to NCHW
   if (strides[1] == 0) {
     return false;
@@ -155,10 +156,11 @@ inline bool is_channels_last_strides_2d_s4(
   return true;
 }
 
+template <typename T>
 inline bool is_channels_last_strides_3d_s5(
-    const IntArrayRef sizes,
-    const IntArrayRef strides) {
-  int64_t min = 0;
+    const ArrayRef<T> sizes,
+    const ArrayRef<T> strides) {
+  T min = 0;
   if (strides[1] == 0) {
     return false;
   }
@@ -230,9 +232,10 @@ inline bool is_channels_last_strides_3d_s5(
 // implementation. Please check the helper functions
 // (is_channels_last_strides_*d_s*) for more details.
 
+template <typename T>
 inline bool is_channels_last_strides_2d(
-    const IntArrayRef sizes,
-    const IntArrayRef strides) {
+    const ArrayRef<T> sizes,
+    const ArrayRef<T> strides) {
   switch (sizes.size()) {
     case 4:
       return is_channels_last_strides_2d_s4(sizes, strides);
@@ -244,9 +247,10 @@ inline bool is_channels_last_strides_2d(
   }
 }
 
+template <typename T>
 inline bool is_channels_last_strides_3d(
-    const IntArrayRef sizes,
-    const IntArrayRef strides) {
+    const ArrayRef<T> sizes,
+    const ArrayRef<T> strides) {
   switch (sizes.size()) {
     case 5:
       return is_channels_last_strides_3d_s5(sizes, strides);
@@ -258,4 +262,16 @@ inline bool is_channels_last_strides_3d(
   }
 }
 
+inline bool is_channels_last_strides_2d(
+    const IntArrayRef sizes,
+    const IntArrayRef strides) {
+  return is_channels_last_strides_2d<int64_t>(sizes, strides);
+}
+
+inline bool is_channels_last_strides_3d(
+    const IntArrayRef sizes,
+    const IntArrayRef strides) {
+  return is_channels_last_strides_3d<int64_t>(sizes, strides);
+}
+
 } // namespace c10
diff --git a/c10/core/PyHandleCache.h b/c10/core/PyHandleCache.h
new file mode 100644
index 000000000000..351c038132a2
--- /dev/null
+++ b/c10/core/PyHandleCache.h
@@ -0,0 +1,75 @@
+#pragma once
+
+#include <c10/core/impl/PyInterpreter.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/python_stub.h>
+
+#include <atomic>
+
+namespace c10 {
+
+// A PyHandleCache represents a cached pointer from a C++ object to
+// a Python object that represents that object analogously in Python.
+// Upon a cache hit, the relevant object can be retrieved after a test
+// and then a memory load.  Two conditions must hold to be able to use this
+// class:
+//
+//  - This must truly be a cache; e.g., the caller must be able to produce
+//    the object some other way if the cache hit misses.
+//
+//  - This must truly be a handle; e.g., the Python object referenced by
+//    this class must have static lifetime.  This means we don't have to
+//    maintain strong ownership or deallocate the object when the C++ object
+//    dies.  Static lifetime is a good idea in conjunction with the cache,
+//    since if you are producing a fresh object on miss you won't be
+//    maintaining object identity.  If you need bidirectional ownership,
+//    you will want to factor out the pattern in TensorImpl with
+//    resurrection.
+//
+// This cache is expected to not improve perf under torchdeploy, as one
+// interpreter will fill up the cache, and all the interpreters will be
+// unable to use the slot.  A potential improvement is to have multiple
+// slots (one per interpreter), which will work in deployment scenarios
+// where there a stable, fixed number of interpreters.  You can also store
+// the relevant state in the Python library, rather than in the non-Python
+// library (although in many cases, this is not convenient, as there may
+// not be a way to conveniently index based on the object.)
+class PyHandleCache {
+ public:
+  PyHandleCache() : pyinterpreter_(nullptr), data_(nullptr) {}
+
+  // Attempt to fetch the pointer from the cache, if the PyInterpreter
+  // matches.  If it doesn't exist, or the cache entry is not valid,
+  // use slow_accessor to get the real pointer value and return that
+  // (possibly writing it to the cache, if the cache entry is
+  // available.)
+  template <typename F>
+  PyObject* ptr_or(impl::PyInterpreter* self_interpreter, F slow_accessor)
+      const {
+    // Note [Memory ordering on Python interpreter tag]
+    impl::PyInterpreter* interpreter =
+        pyinterpreter_.load(std::memory_order_acquire);
+    if (C10_LIKELY(interpreter == self_interpreter)) {
+      return data_;
+    } else if (interpreter == nullptr) {
+      auto* r = slow_accessor();
+      impl::PyInterpreter* expected = nullptr;
+      // attempt to claim this cache entry with the specified interpreter tag
+      if (pyinterpreter_.compare_exchange_strong(
+              expected, self_interpreter, std::memory_order_acq_rel)) {
+        data_ = r;
+      }
+      // This shouldn't be possible, as you should be GIL protected
+      TORCH_INTERNAL_ASSERT(expected != self_interpreter);
+      return r;
+    } else {
+      return slow_accessor();
+    }
+  }
+
+ private:
+  mutable std::atomic<impl::PyInterpreter*> pyinterpreter_;
+  mutable PyObject* data_;
+};
+
+} // namespace c10
diff --git a/c10/core/QEngine.h b/c10/core/QEngine.h
index 60c21361f15f..71eb4b34ac9e 100644
--- a/c10/core/QEngine.h
+++ b/c10/core/QEngine.h
@@ -16,12 +16,14 @@ enum class QEngine : uint8_t {
   FBGEMM = 1,
   QNNPACK = 2,
   ONEDNN = 3,
+  X86 = 4,
 };
 
 constexpr auto kNoQEngine = QEngine::NoQEngine;
 constexpr auto kFBGEMM = QEngine::FBGEMM;
 constexpr auto kQNNPACK = QEngine::QNNPACK;
 constexpr auto kONEDNN = QEngine::ONEDNN;
+constexpr auto kX86 = QEngine::X86;
 
 inline std::string toString(QEngine qengine) {
   switch (qengine) {
@@ -33,6 +35,8 @@ inline std::string toString(QEngine qengine) {
       return "QNNPACK";
     case kONEDNN:
       return "ONEDNN";
+    case kX86:
+      return "X86";
     default:
       TORCH_CHECK(
           false, "Unrecognized Quantized Engine: ", static_cast<int>(qengine));
diff --git a/c10/core/SafePyObject.cpp b/c10/core/SafePyObject.cpp
index d8c3da49ffb1..09c20e24df11 100644
--- a/c10/core/SafePyObject.cpp
+++ b/c10/core/SafePyObject.cpp
@@ -8,4 +8,9 @@ PyObject* SafePyObject::ptr(const c10::impl::PyInterpreter* interpreter) const {
   return data_;
 }
 
+PyObject* SafePyHandle::ptr(const c10::impl::PyInterpreter* interpreter) const {
+  TORCH_INTERNAL_ASSERT(interpreter == pyinterpreter_);
+  return data_;
+}
+
 } // namespace c10
diff --git a/c10/core/SafePyObject.h b/c10/core/SafePyObject.h
index 13e32da3dc1d..f9ecb9c4de6d 100644
--- a/c10/core/SafePyObject.h
+++ b/c10/core/SafePyObject.h
@@ -29,11 +29,11 @@ struct C10_API SafePyObject {
   SafePyObject& operator=(SafePyObject const&) = delete;
 
   ~SafePyObject() {
-    pyinterpreter_->decref(data_, /*is_tensor*/ false);
+    (*pyinterpreter_)->decref(data_, /*is_tensor*/ false);
   }
 
-  c10::impl::PyInterpreter* pyinterpreter() const {
-    return pyinterpreter_;
+  c10::impl::PyInterpreter& pyinterpreter() const {
+    return *pyinterpreter_;
   }
   PyObject* ptr(const c10::impl::PyInterpreter*) const;
 
@@ -42,4 +42,29 @@ struct C10_API SafePyObject {
   c10::impl::PyInterpreter* pyinterpreter_;
 };
 
+// Like SafePyObject, but non-owning.  Good for references to global PyObjects
+// that will be leaked on interpreter exit.  You get a copy constructor/assign
+// this way.
+struct C10_API SafePyHandle {
+  SafePyHandle() : data_(nullptr), pyinterpreter_(nullptr) {}
+  SafePyHandle(PyObject* data, c10::impl::PyInterpreter* pyinterpreter)
+      : data_(data), pyinterpreter_(pyinterpreter) {}
+
+  c10::impl::PyInterpreter& pyinterpreter() const {
+    return *pyinterpreter_;
+  }
+  PyObject* ptr(const c10::impl::PyInterpreter*) const;
+  void reset() {
+    data_ = nullptr;
+    pyinterpreter_ = nullptr;
+  }
+  operator bool() {
+    return data_;
+  }
+
+ private:
+  PyObject* data_;
+  c10::impl::PyInterpreter* pyinterpreter_;
+};
+
 } // namespace c10
diff --git a/c10/core/Scalar.cpp b/c10/core/Scalar.cpp
index dd1f95813c3b..0eb2389fb8e6 100644
--- a/c10/core/Scalar.cpp
+++ b/c10/core/Scalar.cpp
@@ -7,16 +7,21 @@ Scalar Scalar::operator-() const {
       !isBoolean(),
       "torch boolean negative, the `-` operator, is not supported.");
   if (isFloatingPoint()) {
+    TORCH_CHECK(!isSymbolic(), "NYI negate symbolic float");
     return Scalar(-v.d);
   } else if (isComplex()) {
+    TORCH_INTERNAL_ASSERT(!isSymbolic());
     return Scalar(-v.z);
-  } else {
+  } else if (isIntegral(false)) {
+    TORCH_CHECK(!isSymbolic(), "NYI negate symbolic int");
     return Scalar(-v.i);
   }
+  TORCH_INTERNAL_ASSERT(false, "unknown ivalue tag ", static_cast<int>(tag));
 }
 
 Scalar Scalar::conj() const {
   if (isComplex()) {
+    TORCH_INTERNAL_ASSERT(!isSymbolic());
     return Scalar(std::conj(v.z));
   } else {
     return *this;
@@ -25,12 +30,16 @@ Scalar Scalar::conj() const {
 
 Scalar Scalar::log() const {
   if (isComplex()) {
+    TORCH_INTERNAL_ASSERT(!isSymbolic());
     return std::log(v.z);
   } else if (isFloatingPoint()) {
+    TORCH_CHECK(!isSymbolic(), "NYI log symbolic float");
     return std::log(v.d);
-  } else {
+  } else if (isIntegral(false)) {
+    TORCH_CHECK(!isSymbolic(), "NYI log symbolic int");
     return std::log(v.i);
   }
+  TORCH_INTERNAL_ASSERT(false, "unknown ivalue tag ", static_cast<int>(tag));
 }
 
 } // namespace c10
diff --git a/c10/core/Scalar.h b/c10/core/Scalar.h
index 19074b004dd1..0c124177e38f 100644
--- a/c10/core/Scalar.h
+++ b/c10/core/Scalar.h
@@ -9,10 +9,13 @@
 
 #include <c10/core/OptionalRef.h>
 #include <c10/core/ScalarType.h>
+#include <c10/core/SymFloat.h>
+#include <c10/core/SymInt.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/Exception.h>
 #include <c10/util/Half.h>
 #include <c10/util/TypeCast.h>
+#include <c10/util/intrusive_ptr.h>
 
 C10_CLANG_DIAGNOSTIC_PUSH()
 #if C10_CLANG_HAS_WARNING("-Wimplicit-int-float-conversion")
@@ -33,6 +36,17 @@ class C10_API Scalar {
  public:
   Scalar() : Scalar(int64_t(0)) {}
 
+  void destroy() {
+    if (Tag::HAS_si == tag || Tag::HAS_sd == tag) {
+      raw::intrusive_ptr::decref(v.p);
+      v.p = nullptr;
+    }
+  }
+
+  ~Scalar() {
+    destroy();
+  }
+
 #define DEFINE_IMPLICIT_CTOR(type, name) \
   Scalar(type vv) : Scalar(vv, true) {}
 
@@ -61,35 +75,62 @@ class C10_API Scalar {
     }                                                                 \
     if (Tag::HAS_b == tag) {                                          \
       return checked_convert<type, bool>(v.i, #type);                 \
-    } else {                                                          \
+    } else if (Tag::HAS_i == tag) {                                   \
       return checked_convert<type, int64_t>(v.i, #type);              \
+    } else if (Tag::HAS_si == tag) {                                  \
+      TORCH_CHECK(false, "tried to get " #name " out of SymInt")      \
+    } else if (Tag::HAS_sd == tag) {                                  \
+      TORCH_CHECK(false, "tried to get " #name " out of SymFloat")    \
     }                                                                 \
+    TORCH_CHECK(false)                                                \
   }
 
   // TODO: Support ComplexHalf accessor
   AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_ACCESSOR)
 
+#undef DEFINE_ACCESSOR
+
+  SymInt toSymInt() const {
+    if (Tag::HAS_si == tag) {
+      return c10::SymInt(intrusive_ptr<SymNodeImpl>::reclaim_copy(
+          static_cast<SymNodeImpl*>(v.p)));
+    } else {
+      return toLong();
+    }
+  }
+
+  SymFloat toSymFloat() const {
+    if (Tag::HAS_sd == tag) {
+      return c10::SymFloat(intrusive_ptr<SymNodeImpl>::reclaim_copy(
+          static_cast<SymNodeImpl*>(v.p)));
+    } else {
+      return toDouble();
+    }
+  }
+
   // also support scalar.to<int64_t>();
   // Deleted for unsupported types, but specialized below for supported types
   template <typename T>
   T to() const = delete;
 
+  // audit uses of data_ptr
   const void* data_ptr() const {
+    TORCH_INTERNAL_ASSERT(!isSymbolic());
     return static_cast<const void*>(&v);
   }
 
-#undef DEFINE_ACCESSOR
   bool isFloatingPoint() const {
-    return Tag::HAS_d == tag;
+    return Tag::HAS_d == tag || Tag::HAS_sd == tag;
   }
 
   C10_DEPRECATED_MESSAGE(
       "isIntegral is deprecated. Please use the overload with 'includeBool' parameter instead.")
   bool isIntegral() const {
-    return Tag::HAS_i == tag;
+    return Tag::HAS_i == tag || Tag::HAS_si == tag;
   }
   bool isIntegral(bool includeBool) const {
-    return Tag::HAS_i == tag || (includeBool && isBoolean());
+    return Tag::HAS_i == tag || Tag::HAS_si == tag ||
+        (includeBool && isBoolean());
   }
 
   bool isComplex() const {
@@ -99,6 +140,37 @@ class C10_API Scalar {
     return Tag::HAS_b == tag;
   }
 
+  // you probably don't actually want these; they're mostly for testing
+  bool isSymInt() const {
+    return Tag::HAS_si == tag;
+  }
+  bool isSymFloat() const {
+    return Tag::HAS_sd == tag;
+  }
+
+  bool isSymbolic() const {
+    return Tag::HAS_si == tag || Tag::HAS_sd == tag;
+  }
+
+  C10_ALWAYS_INLINE Scalar& operator=(Scalar&& other) {
+    if (&other == this) {
+      return *this;
+    }
+
+    destroy();
+    moveFrom(std::move(other));
+    return *this;
+  }
+
+  C10_ALWAYS_INLINE Scalar& operator=(const Scalar& other) {
+    if (&other == this) {
+      return *this;
+    }
+
+    *this = Scalar(other);
+    return *this;
+  }
+
   Scalar operator-() const;
   Scalar conj() const;
   Scalar log() const;
@@ -108,15 +180,21 @@ class C10_API Scalar {
       typename std::enable_if<!c10::is_complex<T>::value, int>::type = 0>
   bool equal(T num) const {
     if (isComplex()) {
+      TORCH_INTERNAL_ASSERT(!isSymbolic());
       auto val = v.z;
       return (val.real() == num) && (val.imag() == T());
     } else if (isFloatingPoint()) {
+      TORCH_CHECK(!isSymbolic(), "NYI SymFloat equality");
       return v.d == num;
     } else if (isIntegral(/*includeBool=*/false)) {
+      TORCH_CHECK(!isSymbolic(), "NYI SymInt equality");
       return v.i == num;
-    } else {
+    } else if (isBoolean()) {
       // boolean scalar does not equal to a non boolean value
+      TORCH_INTERNAL_ASSERT(!isSymbolic());
       return false;
+    } else {
+      TORCH_INTERNAL_ASSERT(false);
     }
   }
 
@@ -125,19 +203,26 @@ class C10_API Scalar {
       typename std::enable_if<c10::is_complex<T>::value, int>::type = 0>
   bool equal(T num) const {
     if (isComplex()) {
+      TORCH_INTERNAL_ASSERT(!isSymbolic());
       return v.z == num;
     } else if (isFloatingPoint()) {
+      TORCH_CHECK(!isSymbolic(), "NYI SymFloat equality");
       return (v.d == num.real()) && (num.imag() == T());
     } else if (isIntegral(/*includeBool=*/false)) {
+      TORCH_CHECK(!isSymbolic(), "NYI SymInt equality");
       return (v.i == num.real()) && (num.imag() == T());
-    } else {
+    } else if (isBoolean()) {
       // boolean scalar does not equal to a non boolean value
+      TORCH_INTERNAL_ASSERT(!isSymbolic());
       return false;
+    } else {
+      TORCH_INTERNAL_ASSERT(false);
     }
   }
 
   bool equal(bool num) const {
     if (isBoolean()) {
+      TORCH_INTERNAL_ASSERT(!isSymbolic());
       return static_cast<bool>(v.i) == num;
     } else {
       return false;
@@ -158,7 +243,62 @@ class C10_API Scalar {
     }
   }
 
+  Scalar(Scalar&& rhs) noexcept : tag(rhs.tag) {
+    moveFrom(std::move(rhs));
+  }
+
+  Scalar(const Scalar& rhs) : tag(rhs.tag), v(rhs.v) {
+    if (isSymbolic()) {
+      c10::raw::intrusive_ptr::incref(v.p);
+    }
+  }
+
+  Scalar(c10::SymInt si) {
+    if (si.is_symbolic()) {
+      tag = Tag::HAS_si;
+      v.p = std::move(si).release();
+    } else {
+      tag = Tag::HAS_i;
+      v.i = si.as_int_unchecked();
+    }
+  }
+
+  Scalar(c10::SymFloat sd) {
+    if (sd.is_symbolic()) {
+      tag = Tag::HAS_sd;
+      v.p = std::move(sd).release();
+    } else {
+      tag = Tag::HAS_d;
+      v.d = sd.as_float_unchecked();
+    }
+  }
+
+  // We can't set v in the initializer list using the
+  // syntax v{ .member = ... } because it doesn't work on MSVC
  private:
+  enum class Tag { HAS_d, HAS_i, HAS_z, HAS_b, HAS_sd, HAS_si };
+
+  // NB: assumes that self has already been cleared
+  C10_ALWAYS_INLINE void moveFrom(Scalar&& rhs) noexcept {
+    v = rhs.v;
+    tag = rhs.tag;
+    if (rhs.tag == Tag::HAS_si || rhs.tag == Tag::HAS_sd) {
+      // Move out of scalar
+      rhs.tag = Tag::HAS_i;
+      rhs.v.i = 0;
+    }
+  }
+
+  Tag tag;
+
+  union v_t {
+    double d;
+    int64_t i;
+    c10::complex<double> z;
+    c10::intrusive_ptr_target* p;
+    v_t() {} // default constructor
+  } v;
+
   template <
       typename T,
       typename std::enable_if<
@@ -183,18 +323,6 @@ class C10_API Scalar {
   Scalar(T vv, bool) : tag(Tag::HAS_z) {
     v.z = convert<decltype(v.z), T>(vv);
   }
-
-  // We can't set v in the initializer list using the
-  // syntax v{ .member = ... } because it doesn't work on MSVC
-
-  enum class Tag { HAS_d, HAS_i, HAS_z, HAS_b };
-  Tag tag;
-  union v_t {
-    double d;
-    int64_t i;
-    c10::complex<double> z;
-    v_t() {} // default constructor
-  } v;
 };
 
 using OptionalScalarRef = c10::OptionalRef<Scalar>;
diff --git a/c10/core/ScalarType.h b/c10/core/ScalarType.h
index 9e6b56d611a8..51de905def9c 100644
--- a/c10/core/ScalarType.h
+++ b/c10/core/ScalarType.h
@@ -103,7 +103,7 @@ struct ScalarTypeToCPPType;
     /* This is a workaround for the CUDA bug which prevents */               \
     /* ::detail::ScalarTypeToCType<T>::type being used directly due to */    \
     /* ambiguous reference which can't to be resolved. For some reason it */ \
-    /* cant pick between at::detail and at::cuda::detail. */                 \
+    /* can't pick between at::detail and at::cuda::detail. */                \
     /* For repro example, please see: */                                     \
     /* https://gist.github.com/izdeby/952ae7cf256ddb740a73776d39a7e7ba */    \
     /* TODO: remove once the bug is fixed. */                                \
diff --git a/c10/core/Storage.h b/c10/core/Storage.h
index 655884afe12f..09c5920b5649 100644
--- a/c10/core/Storage.h
+++ b/c10/core/Storage.h
@@ -75,6 +75,10 @@ struct C10_API Storage {
     storage_impl_.get()->set_nbytes(size_bytes);
   }
 
+  void set_nbytes(c10::SymInt size_bytes) const {
+    storage_impl_.get()->set_nbytes(std::move(size_bytes));
+  }
+
   bool resizable() const {
     return storage_impl_->resizable();
   }
diff --git a/c10/core/StorageImpl.h b/c10/core/StorageImpl.h
index b38359cce8f6..1d80daed871a 100644
--- a/c10/core/StorageImpl.h
+++ b/c10/core/StorageImpl.h
@@ -42,7 +42,8 @@ struct C10_API StorageImpl : public c10::intrusive_ptr_target {
       at::Allocator* allocator,
       bool resizable)
       : data_ptr_(std::move(data_ptr)),
-        size_bytes_(size_bytes),
+        size_bytes_(std::move(size_bytes)),
+        size_bytes_is_symbolic_(size_bytes_.is_symbolic()),
         resizable_(resizable),
         received_cuda_(false),
         allocator_(allocator) {
@@ -76,6 +77,7 @@ struct C10_API StorageImpl : public c10::intrusive_ptr_target {
   void reset() {
     data_ptr_.clear();
     size_bytes_ = 0;
+    size_bytes_is_symbolic_ = false;
   }
 
   template <typename T>
@@ -95,7 +97,8 @@ struct C10_API StorageImpl : public c10::intrusive_ptr_target {
   }
 
   size_t nbytes() const {
-    return size_bytes_.expect_int();
+    TORCH_CHECK(!size_bytes_is_symbolic_);
+    return size_bytes_.as_int_unchecked();
   }
 
   SymInt sym_nbytes() const {
@@ -105,6 +108,11 @@ struct C10_API StorageImpl : public c10::intrusive_ptr_target {
   // TODO: remove later
   void set_nbytes(size_t size_bytes) {
     size_bytes_ = size_bytes;
+    size_bytes_is_symbolic_ = false;
+  }
+
+  void set_nbytes(c10::SymInt size_bytes) {
+    size_bytes_ = std::move(size_bytes);
   }
 
   bool resizable() const {
@@ -190,6 +198,7 @@ struct C10_API StorageImpl : public c10::intrusive_ptr_target {
       size_t size_bytes) {
     data_ptr_ = std::move(data_ptr);
     size_bytes_ = size_bytes;
+    size_bytes_is_symbolic_ = false;
     allocator_ = nullptr;
     resizable_ = false;
   }
@@ -207,6 +216,7 @@ struct C10_API StorageImpl : public c10::intrusive_ptr_target {
  private:
   DataPtr data_ptr_;
   SymInt size_bytes_;
+  bool size_bytes_is_symbolic_;
   bool resizable_;
   // Identifies that Storage was received from another process and doesn't have
   // local to process cuda memory allocation
diff --git a/c10/core/SymFloat.cpp b/c10/core/SymFloat.cpp
new file mode 100644
index 000000000000..511c50e3398e
--- /dev/null
+++ b/c10/core/SymFloat.cpp
@@ -0,0 +1,81 @@
+#include <c10/core/SymFloat.h>
+#include <c10/core/SymNodeImpl.h>
+#include <array>
+#include <utility>
+
+namespace c10 {
+
+SymNode SymFloat::toSymNodeImpl() const {
+  TORCH_CHECK(is_symbolic());
+  return SymNode::reclaim_copy(toSymNodeImplUnowned());
+}
+
+static std::array<SymNode, 2> normalize_symfloats(
+    const SymFloat& a_,
+    const SymFloat& b_) {
+  SymNode a, b;
+  if (a_.is_symbolic())
+    a = a_.toSymNodeImpl();
+  if (b_.is_symbolic())
+    b = b_.toSymNodeImpl();
+
+  SymNodeImpl* common = a ? a.get() : b.get();
+  if (!a) {
+    a = common->wrap_float(a_.as_float_unchecked());
+  }
+  if (!b) {
+    b = common->wrap_float(b_.as_float_unchecked());
+  }
+  return {std::move(a), std::move(b)};
+}
+
+SymFloat SymFloat::operator+(const SymFloat& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymFloat(data_ + sci.data_);
+  }
+  auto res = normalize_symfloats(*this, sci);
+  return SymFloat(res[0]->add(res[1]));
+}
+
+SymFloat SymFloat::operator-(const SymFloat& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymFloat(data_ - sci.data_);
+  }
+  auto res = normalize_symfloats(*this, sci);
+  return SymFloat(res[0]->sub(res[1]));
+}
+
+SymFloat SymFloat::operator*(const SymFloat& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymFloat(data_ * sci.data_);
+  }
+  auto res = normalize_symfloats(*this, sci);
+  return SymFloat(res[0]->mul(res[1]));
+}
+
+SymFloat SymFloat::operator/(const SymFloat& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymFloat(data_ / sci.data_);
+  }
+  auto res = normalize_symfloats(*this, sci);
+  return SymFloat(res[0]->truediv(res[1]));
+}
+
+std::ostream& operator<<(std::ostream& os, const SymFloat& s) {
+  if (s.is_symbolic()) {
+    os << s.toSymNodeImpl()->str();
+  } else {
+    os << s.as_float_unchecked();
+  }
+  return os;
+}
+
+double SymFloat::guard_float(const char* file, int64_t line) const {
+  if (!is_symbolic()) {
+    return data_;
+  }
+  SymNode a = toSymNodeImpl();
+  return a->guard_float(file, line);
+}
+
+} // namespace c10
diff --git a/c10/core/SymFloat.h b/c10/core/SymFloat.h
new file mode 100644
index 000000000000..ff9e101e31af
--- /dev/null
+++ b/c10/core/SymFloat.h
@@ -0,0 +1,71 @@
+#pragma once
+
+#include <c10/core/SymNodeImpl.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/Exception.h>
+#include <c10/util/intrusive_ptr.h>
+
+#include <limits>
+#include <memory>
+
+namespace c10 {
+
+// NB: this is actually double precision; we're using the Python naming here
+class C10_API SymFloat {
+ public:
+  /*implicit*/ SymFloat(double d) : data_(d){};
+  SymFloat(SymNode ptr)
+      : data_(std::numeric_limits<double>::quiet_NaN()), ptr_(std::move(ptr)) {
+    TORCH_CHECK(ptr_->is_float());
+  };
+  SymFloat() : data_(0.0) {}
+
+  SymNodeImpl* toSymNodeImplUnowned() const {
+    return ptr_.get();
+  }
+
+  SymNodeImpl* release() && {
+    return std::move(ptr_).release();
+  }
+
+  SymNode toSymNodeImpl() const;
+
+  double expect_float() const {
+    TORCH_CHECK(!is_symbolic());
+    return data_;
+  }
+
+  SymFloat operator+(const SymFloat&) const;
+  SymFloat operator-(const SymFloat&) const;
+  SymFloat operator*(const SymFloat&) const;
+  SymFloat operator/(const SymFloat&) const;
+
+  // Insert a guard for the float to be its concrete value, and then return
+  // that value.  This operation always works, even if the float is symbolic,
+  // so long as we know what the underlying value is. Don't blindly put this
+  // everywhere; you can cause overspecialization of PyTorch programs with
+  // this method.
+  //
+  // It should be called as guard_float(__FILE__, __LINE__).  The file and line
+  // number can be used to diagnose overspecialization.
+  double guard_float(const char* file, int64_t line) const;
+
+  // N.B. It's important to keep this definition in the header
+  // as we expect if checks to be folded for mobile builds
+  // where `is_symbolic` is always false
+  C10_ALWAYS_INLINE bool is_symbolic() const {
+    return ptr_;
+  }
+
+  double as_float_unchecked() const {
+    return data_;
+  }
+
+ private:
+  // TODO: optimize to union
+  double data_;
+  SymNode ptr_;
+};
+
+C10_API std::ostream& operator<<(std::ostream& os, const SymFloat& s);
+} // namespace c10
diff --git a/c10/core/SymInt.cpp b/c10/core/SymInt.cpp
index e0046bb353de..977397be1264 100644
--- a/c10/core/SymInt.cpp
+++ b/c10/core/SymInt.cpp
@@ -1,65 +1,100 @@
+#include <c10/core/SymFloat.h>
 #include <c10/core/SymInt.h>
-#include <c10/core/SymIntNodeImpl.h>
+#include <c10/core/SymNodeImpl.h>
 #include <array>
+#include <utility>
 
 namespace c10 {
 
-std::array<SymIntNode, 2> normalize_symints(SymInt a_, SymInt b_) {
-  SymIntNode a, b;
+static std::array<SymNode, 2> normalize_symints(
+    const SymInt& a_,
+    const SymInt& b_) {
+  SymNode a, b;
   if (a_.is_symbolic())
-    a = a_.toSymIntNodeImpl();
+    a = a_.toSymNodeImpl();
   if (b_.is_symbolic())
-    b = b_.toSymIntNodeImpl();
+    b = b_.toSymNodeImpl();
 
-  SymIntNodeImpl* common = a ? a.get() : b.get();
+  SymNodeImpl* common = a ? a.get() : b.get();
   // TODO: technically we need to check that the classes match
   if (!a) {
-    a = common->wrap(a_.as_int_unchecked());
-    a_.toSymInt(a); //
+    a = common->wrap_int(a_.as_int_unchecked());
   }
   if (!b) {
-    b = common->wrap(b_.as_int_unchecked());
-    b_.toSymInt(b);
+    b = common->wrap_int(b_.as_int_unchecked());
   }
-  return {a, b};
+  return {std::move(a), std::move(b)};
 }
 
-SymIntNode SymInt::toSymIntNodeImpl() const {
+SymNode SymInt::toSymNodeImpl() const {
   TORCH_CHECK(is_symbolic());
-  return SymIntNode::reclaim_copy(toSymIntNodeImplUnowned());
+  return SymNode::reclaim_copy(toSymNodeImplUnowned());
 }
 
-c10::SymInt SymInt::toSymInt(SymIntNode sin_sp) {
+SymInt::SymInt(SymNode sin_sp) {
+  TORCH_CHECK(sin_sp->is_int());
   auto ptr = static_cast<uint64_t>(
       reinterpret_cast<uintptr_t>(static_cast<void*>(sin_sp.release())));
   auto rep = (ptr & ~MASK) | IS_SYM;
-  return c10::SymInt(UNCHECKED, static_cast<int64_t>(rep));
+  data_ = static_cast<int64_t>(rep);
 }
 
-SymInt SymInt::operator+(SymInt sci) const {
-  TORCH_CHECK(
-      !this->is_symbolic() && !sci.is_symbolic(),
-      "Symbolic Add isn't supported yet");
-  return SymInt(data_ + sci.data_);
+int64_t SymInt::guard_int(const char* file, int64_t line) const {
+  if (!is_symbolic()) {
+    return data_;
+  }
+  SymNode a = toSymNodeImpl();
+  return a->guard_int(file, line);
+}
+
+SymInt::operator SymFloat() const {
+  if (!is_symbolic()) {
+    return SymFloat(double(data_));
+  }
+  return SymFloat(toSymNodeImpl()->sym_float());
+}
+
+SymInt SymInt::operator+(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymInt(data_ + sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt(res[0]->add(res[1]));
+}
+
+SymInt SymInt::operator-(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymInt(data_ - sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt(res[0]->sub(res[1]));
 }
 
-SymInt SymInt::operator*(SymInt sci) const {
+SymInt SymInt::operator*(const SymInt& sci) const {
   if (!is_symbolic() && !sci.is_symbolic()) {
     return SymInt(data_ * sci.data_);
   }
   auto res = normalize_symints(*this, sci);
-  return SymInt::toSymInt(res[0]->mul(res[1]));
+  return SymInt(res[0]->mul(res[1]));
 }
 
-SymInt SymInt::operator/(SymInt sci) const {
+SymInt SymInt::operator/(const SymInt& sci) const {
   if (!is_symbolic() && !sci.is_symbolic()) {
     return SymInt(data_ / sci.data_);
   }
   auto res = normalize_symints(*this, sci);
-  return SymInt::toSymInt(res[0]->floordiv(res[1]));
+  return SymInt(res[0]->floordiv(res[1]));
 }
 
-bool SymInt::operator==(SymInt sci) const {
+SymInt SymInt::operator%(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return SymInt(data_ % sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt(res[0]->mod(res[1]));
+}
+
+bool SymInt::operator==(const SymInt& sci) const {
   if (!is_symbolic() && !sci.is_symbolic()) {
     return data_ == sci.data_;
   }
@@ -67,27 +102,83 @@ bool SymInt::operator==(SymInt sci) const {
   return res[0]->eq(res[1])->bool_();
 }
 
-bool SymInt::operator!=(SymInt sci) const {
+bool SymInt::operator!=(const SymInt& sci) const {
   return !(*this == sci);
 }
 
-bool SymInt::operator<(SymInt sci) const {
-  TORCH_CHECK(
-      !this->is_symbolic() && !sci.is_symbolic(),
-      "Symbolic lt isn't supported yet");
-  return data_ < sci.data_;
+bool SymInt::operator<(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return data_ < sci.data_;
+  }
+  auto res = normalize_symints(*this, sci);
+  return res[0]->lt(res[1])->bool_();
+}
+
+bool SymInt::operator<=(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return data_ <= sci.data_;
+  }
+  auto res = normalize_symints(*this, sci);
+  return res[0]->le(res[1])->bool_();
+}
+
+bool SymInt::operator>(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return data_ > sci.data_;
+  }
+  auto res = normalize_symints(*this, sci);
+  return res[0]->gt(res[1])->bool_();
+}
+
+bool SymInt::operator>=(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return data_ >= sci.data_;
+  }
+  auto res = normalize_symints(*this, sci);
+  return res[0]->ge(res[1])->bool_();
+}
+
+SymInt SymInt::min(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return std::min(data_, sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt(res[0]->min(res[1]));
+}
+SymInt SymInt::max(const SymInt& sci) const {
+  if (!is_symbolic() && !sci.is_symbolic()) {
+    return std::max(data_, sci.data_);
+  }
+  auto res = normalize_symints(*this, sci);
+  return SymInt(res[0]->max(res[1]));
 }
 
-void SymInt::operator*=(SymInt sci) {
-  TORCH_CHECK(
-      !this->is_symbolic() && !sci.is_symbolic(),
-      "Symbolic mul_ isn't supported yet");
-  data_ = data_ * sci.data_;
+void SymInt::operator*=(const SymInt& sci) {
+  *this = *this * sci;
+}
+
+void SymInt::operator/=(const SymInt& sci) {
+  *this = *this / sci;
+}
+
+void SymInt::operator+=(const SymInt& sci) {
+  *this = *this + sci;
 }
 
 bool SymInt::operator<(int64_t sci) const {
-  TORCH_CHECK(!this->is_symbolic(), "Symbolic lt isn't supported yet");
-  return data_ < sci;
+  return *this < c10::SymInt(sci);
+}
+
+bool SymInt::operator<=(int64_t sci) const {
+  return *this <= c10::SymInt(sci);
+}
+
+bool SymInt::operator>(int64_t sci) const {
+  return *this > c10::SymInt(sci);
+}
+
+bool SymInt::operator>=(int64_t sci) const {
+  return *this >= c10::SymInt(sci);
 }
 
 bool SymInt::operator==(int64_t sci) const {
@@ -99,8 +190,24 @@ bool SymInt::operator!=(int64_t sci) const {
 }
 
 SymInt SymInt::operator*(int64_t sci) const {
-  TORCH_CHECK(!this->is_symbolic(), "Symbolic mul isn't supported yet");
-  return SymInt(data_ * sci);
+  return *this * c10::SymInt(sci);
+}
+
+std::ostream& operator<<(std::ostream& os, const SymInt& s) {
+  if (s.is_symbolic()) {
+    os << s.toSymNodeImpl()->str();
+  } else {
+    os << s.as_int_unchecked();
+  }
+  return os;
+}
+
+SymInt operator-(const SymInt& s) {
+  if (s.is_symbolic()) {
+    return SymInt(s.toSymNodeImpl()->neg());
+  } else {
+    return SymInt(-s.as_int_unchecked());
+  }
 }
 
 } // namespace c10
diff --git a/c10/core/SymInt.h b/c10/core/SymInt.h
index 60160a801a3e..6355f1339505 100644
--- a/c10/core/SymInt.h
+++ b/c10/core/SymInt.h
@@ -1,147 +1,287 @@
 #pragma once
 
-#include <c10/core/SymIntNodeImpl.h>
+#include <c10/core/SymNodeImpl.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/Exception.h>
 #include <c10/util/intrusive_ptr.h>
 
 #include <memory>
+#include <numeric>
+#include <utility>
 
 namespace c10 {
 
-// `SymInt` is a C++ wrapper class around int64_t data_ which  and is used to
-// represent concrete dimension values.
-//
-// `SymInt` is also a data type in Pytorch that can be used in function schemas
-// to enable tracing.
-//
-// `SymInt` is introduced to enable tracing arithmetic
-// operations on symbolic integers (e.g. sizes). Tracing symbolic sizes will
-// allow LTC and AOTAutograd representing dynamic shapes in expression graphs
-// faithfully without baking in concrete dimension values.
+class SymFloat;
+
+// SymInt represents either a regular int64_t, or a symbolic integer
+// (represented in a type erased way as SymNode).  The intention is for SymInt
+// to represent symbolic sizes that arise when doing shape computation in
+// operator kernels. This allows for tracing through programs without baking in
+// concrete sizes into kernel calls.
 //
-// To trace the operations, SymInt will overload arithmetic operators (e.g. +,
-// -, *) and will provide overloads taking SymInt for commonly used math
-// functions.
+// SymInt has an API equivalent to int64_t.  In particular, it is a value type.
+// Internally, SymInt is represented in a clever packed way, so that it only
+// occupies one word of space; but morally, it is a union between an int64_t
+// and an intrusive pointer to SymNodeImpl.
 //
-// SymInt will be extenteded to represent a union structure Union[int64_t,
-// SymIntNodeImpl*] which will be implemented as a single packed int64_t field
-// named data_.
+// Invariant: the referenced SymNodeImpl is guaranteed to be a SymNode where
+// is_int() returns true
+
 class C10_API SymInt {
+ public:
   enum Unchecked {
     UNCHECKED,
   };
 
- public:
   /*implicit*/ SymInt(int64_t d) : data_(d) {
+    // NB: this relies on exception in constructor inhibiting
+    // destructor; otherwise we would attempt to deallocate
+    // the garbage data!
     TORCH_CHECK(!is_symbolic());
   };
-  SymInt() = default;
+  SymInt() : data_(0) {}
+  SymInt(SymNode n);
 
   // unchecked c-tor accepting raw `data_`
+  // One appropriate use for this is when you are constructing a symint
+  // in a situation where you know it is non-negative (or, if it is negative,
+  // the negative value is -1; i.e., not user controlled)
   SymInt(Unchecked, int64_t d) : data_(d) {}
 
   // TODO: these implementations are not optimal because they allocate a
   // temporary and then use the move constructor/assignment
   SymInt(const SymInt& s) : data_(0) {
     if (s.is_symbolic()) {
-      *this = SymInt::toSymInt(s.toSymIntNodeImpl());
+      *this = SymInt(s.toSymNodeImpl());
     } else {
       data_ = s.data_;
     }
   }
-  SymInt(SymInt&& s) : data_(s.data_) {
+  SymInt(SymInt&& s) noexcept : data_(s.data_) {
     s.data_ = 0;
   }
 
   SymInt& operator=(const SymInt& s) {
-    if (s.is_symbolic()) {
-      *this = SymInt::toSymInt(s.toSymIntNodeImpl());
-    } else {
-      data_ = s.data_;
+    if (this != &s) {
+      if (s.is_symbolic()) {
+        *this = SymInt(s.toSymNodeImpl());
+      } else {
+        data_ = s.data_;
+      }
     }
     return *this;
   }
-  SymInt& operator=(SymInt&& s) {
-    data_ = s.data_;
-    if (s.is_symbolic())
-      s.data_ = 0;
+  SymInt& operator=(SymInt&& s) noexcept {
+    if (this != &s) {
+      release_(); // release the current SymNode if any
+      data_ = s.data_;
+      if (s.is_symbolic())
+        s.data_ = 0;
+    };
+    return *this;
+  }
+
+  SymInt clone() const {
+    if (is_symbolic()) {
+      return SymInt(toSymNodeImplUnowned()->clone());
+    }
     return *this;
   }
 
-  SymIntNodeImpl* toSymIntNodeImplUnowned() const {
+  SymNodeImpl* toSymNodeImplUnowned() const {
+    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(is_symbolic());
     uint64_t unextended_bits = static_cast<uint64_t>(data_) & ~MASK;
     uint64_t sign_bit_mask = 1ULL << (62 - 1);
     // https://stackoverflow.com/questions/42534749/signed-extension-from-24-bit-to-32-bit-in-c
     uint64_t extended_bits = (unextended_bits ^ sign_bit_mask) - sign_bit_mask;
-    return static_cast<SymIntNodeImpl*>(
+    return static_cast<SymNodeImpl*>(
         reinterpret_cast<void*>(static_cast<uintptr_t>(extended_bits)));
   }
 
-  ~SymInt() {
+  void release_() {
     if (is_symbolic()) {
-      SymIntNode::reclaim(toSymIntNodeImplUnowned()); // steal
+      SymNode::reclaim(toSymNodeImplUnowned()); // steal
     }
   }
 
+  SymNodeImpl* release() && {
+#ifndef C10_MOBILE
+    TORCH_INTERNAL_ASSERT(is_symbolic());
+    auto* r = toSymNodeImplUnowned();
+    data_ = 0; // transfer ownership
+    return r;
+#else
+    TORCH_INTERNAL_ASSERT(false);
+#endif
+  }
+
+  SymNode toSymNodeImpl() const;
+
+  ~SymInt() {
+    release_();
+  }
+
+  // Require the int to be non-symbolic, and if it is symbolic raise an
+  // error.  This is safe to use for C++ code that doesn't work for symbolic
+  // shapes, and you don't have time to fix it immediately, as if we
+  // try to trigger the path in C++ you'll appropriately get an error
   int64_t expect_int() const {
     TORCH_CHECK(!is_symbolic());
     return data_;
   }
 
-  bool is_symbolic() const {
-    return (MASK & static_cast<uint64_t>(this->data_)) == IS_SYM;
+  // Insert a guard for the int to be its concrete value, and then return
+  // that value.  This operation always works, even if the int is symbolic,
+  // so long as we know what the underlying value is (e.g., this won't work
+  // if you call it on the size of nonzero output).  Don't blindly put this
+  // everywhere; you can cause overspecialization of PyTorch programs with
+  // this method.
+  //
+  // It should be called as guard_int(__FILE__, __LINE__).  The file and line
+  // number can be used to diagnose overspecialization.
+  int64_t guard_int(const char* file, int64_t line) const;
+
+  // N.B. It's important to keep this definition in the header
+  // as we expect if checks to be folded for mobile builds
+  // where `is_symbolic` is always false and optimize dead code paths
+  C10_ALWAYS_INLINE bool is_symbolic() const {
+#ifdef C10_MOBILE
+    return false;
+#else
+    return !check_range(data_);
+#endif
   }
 
-  SymInt operator+(SymInt sci) const;
-  SymInt operator*(SymInt sci) const;
-  SymInt operator/(SymInt sci) const;
-  bool operator==(SymInt sci) const;
-  bool operator!=(SymInt p2) const;
-  bool operator<(SymInt sci) const;
-  void operator*=(SymInt sci);
+  SymInt operator+(const SymInt& sci) const;
+  SymInt operator-(const SymInt& sci) const;
+  SymInt operator*(const SymInt& sci) const;
+  SymInt operator/(const SymInt& sci) const;
+  SymInt operator%(const SymInt& sci) const;
+  bool operator==(const SymInt& sci) const;
+  bool operator!=(const SymInt& p2) const;
+  bool operator<(const SymInt& sci) const;
+  bool operator<=(const SymInt& sci) const;
+  bool operator>(const SymInt& sci) const;
+  bool operator>=(const SymInt& sci) const;
+  void operator*=(const SymInt& sci);
+  void operator+=(const SymInt& sci);
+  void operator/=(const SymInt& sci);
+
+  SymInt min(const SymInt& sci) const;
+  SymInt max(const SymInt& sci) const;
 
   SymInt operator*(int64_t sci) const;
   bool operator<(int64_t sci) const;
   bool operator==(int64_t sci) const;
   bool operator!=(int64_t sci) const;
+  bool operator<=(int64_t sci) const;
+  bool operator>(int64_t sci) const;
+  bool operator>=(int64_t sci) const;
 
-  SymIntNode toSymIntNodeImpl() const;
-  static c10::SymInt toSymInt(SymIntNode sin);
+  operator SymFloat() const;
 
   int64_t as_int_unchecked() const {
+    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!is_symbolic());
     return data_;
   }
 
   // Return whether the integer is representable as a SymInt.
   static bool check_range(int64_t i) {
-    return i > MIN_INT;
+    return i > MAX_UNREPRESENTABLE_INT;
   }
 
  private:
   // Constraints on the internal representation:
+  //
   // - Should represent positive and small negative ints
-  // - No conversion necessary for operations on ints.
+  // - No conversion necessary for operations on ints
   // - Must represent valid 64-bit pointers
+  // - Is symbolic test should be FAST (two arithmetic instructions is too
+  // much).
+  //   This code being a hotpath is based on Strobelight profiles of
+  //   is_symbolic().  FB only: https://fburl.com/strobelight/5l50ncxd
+  //   (you will need to change the time window).
+  //
+  // So, the scheme is to reserve large negative numbers (asssuming
+  // two's complement):
   //
-  // So, the scheme is to reserve large negative numbers:
-  // - 0b0.... means we are a positive int (following two's complement)
-  // - 0b11... means we are a negative int (following two's complement)
+  // - 0b0.... means we are a positive int
+  // - 0b11... means we are a small negative int
   // - 0b10... means we are are a pointer. This means that
   //           [-2^63, -2^62-1] are not representable as ints.
   //           We don't actually need all of this space as on x86_64
   //           as the top 16bits aren't used for anything
-  static constexpr uint64_t MASK = 1ULL << 63 | 1ULL << 62;
-  static constexpr uint64_t IS_SYM = 1ULL << 63;
-  // Since we use the top two bits to determine whether something is symbolic,
-  // we cannot represent symbolic indices that are large enough to use those
-  // bits. This will probably never happen.
-  static constexpr uint64_t MAX_SYM_IDX = 1ULL << 62;
-  // Since 0b10... is reserved for symbolic indices, any integers lower than
-  // this value would collide with our representation.
-  static constexpr int64_t MIN_INT = -1LL & static_cast<int64_t>(~(1ULL << 62));
+  static constexpr uint64_t MASK = 1ULL << 63 | 1ULL << 62 | 1ULL << 61;
+  static constexpr uint64_t IS_SYM = 1ULL << 63 | 1ULL << 61;
+  // We must manually translate the bit pattern test into a greater
+  // than test because compiler doesn't figure it out:
+  // https://godbolt.org/z/356aferaW
+  static constexpr int64_t MAX_UNREPRESENTABLE_INT =
+      -1LL & static_cast<int64_t>(~(1ULL << 62));
   int64_t data_;
 };
 
-C10_API std::ostream& operator<<(std::ostream& os, SymInt s);
+/// Sum of a list of SymInt; accumulates into the c10::SymInt expression
+template <
+    typename C,
+    typename std::enable_if<
+        std::is_same<typename C::value_type, c10::SymInt>::value,
+        int>::type = 0>
+inline c10::SymInt multiply_integers(const C& container) {
+  return std::accumulate(
+      container.begin(),
+      container.end(),
+      c10::SymInt(1),
+      [](const c10::SymInt& a, const c10::SymInt& b) { return a * b; });
+}
+
+template <
+    typename Iter,
+    typename = std::enable_if_t<std::is_same<
+        typename std::iterator_traits<Iter>::value_type,
+        c10::SymInt>::value>>
+inline c10::SymInt multiply_integers(Iter begin, Iter end) {
+  return std::accumulate(
+      begin,
+      end,
+      c10::SymInt(1),
+      [](const c10::SymInt& a, const c10::SymInt& b) { return a * b; });
+}
+
+inline SymInt operator+(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) + b;
+}
+inline SymInt operator-(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) - b;
+}
+inline SymInt operator*(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) * b;
+}
+inline SymInt operator/(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) / b;
+}
+inline SymInt operator%(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) % b;
+}
+inline bool operator==(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) == b;
+}
+inline bool operator!=(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) != b;
+}
+inline bool operator<(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) < b;
+}
+inline bool operator<=(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) <= b;
+}
+inline bool operator>(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) > b;
+}
+inline bool operator>=(int64_t a, const SymInt& b) {
+  return c10::SymInt(a) >= b;
+}
+
+C10_API std::ostream& operator<<(std::ostream& os, const SymInt& s);
+C10_API SymInt operator-(const SymInt& s);
 } // namespace c10
diff --git a/c10/core/SymIntArrayRef.cpp b/c10/core/SymIntArrayRef.cpp
index 44a419f4a9f4..103845cefd9a 100644
--- a/c10/core/SymIntArrayRef.cpp
+++ b/c10/core/SymIntArrayRef.cpp
@@ -1,35 +1,3 @@
 #include <c10/core/SymIntArrayRef.h>
-#include <c10/util/Optional.h>
-#include <iostream>
 
-namespace c10 {
-
-at::IntArrayRef asIntArrayRefSlow(c10::SymIntArrayRef ar) {
-  auto r = asIntArrayRefSlowOpt(ar);
-  TORCH_CHECK(
-      r.has_value(),
-      "SymIntArrayRef expected to contain only concrete integers");
-  return *r;
-}
-
-c10::optional<at::IntArrayRef> asIntArrayRefSlowOpt(c10::SymIntArrayRef ar) {
-  for (c10::SymInt sci : ar) {
-    if (sci.is_symbolic()) {
-      return c10::nullopt;
-    }
-  }
-
-  return {asIntArrayRefUnchecked(ar)};
-}
-
-at::IntArrayRef asIntArrayRefUnchecked(c10::SymIntArrayRef ar) {
-  return IntArrayRef(reinterpret_cast<const int64_t*>(ar.data()), ar.size());
-}
-
-// TODO: this print is bad
-std::ostream& operator<<(std::ostream& os, SymInt s) {
-  os << "SymInt(" << s.as_int_unchecked() << ")";
-  return os;
-}
-
-} // namespace c10
+namespace c10 {} // namespace c10
diff --git a/c10/core/SymIntArrayRef.h b/c10/core/SymIntArrayRef.h
index bf2eb65c5536..47e493bc44f5 100644
--- a/c10/core/SymIntArrayRef.h
+++ b/c10/core/SymIntArrayRef.h
@@ -1,15 +1,3 @@
-// This file defines `SymIntArrayRef` which serves as the view onto
-// std::vector<SymInt>. This class is conceptually and mostly functionally
-// equivalent to ArrayRef<SymInt>.
-//
-// However, ArrayRef<SymInt> can't be used directly as it introduces ambiguity
-// in the following cases:
-//   - a.expand({1, 2, 3}) matches two overloads:
-//       1. `at::Tensor Tensor::expand(c10::SymIntArrayRef size, bool implicit)`
-//       2. `at::Tensor Tensor::expand(at::IntArrayRef size, bool implicit)`
-// Introducing `SymIntArrayRef` allows to have a finer-grained control over
-// which overload will be used.
-
 #pragma once
 
 #include <c10/core/SymInt.h>
@@ -23,185 +11,52 @@
 #include <vector>
 
 namespace c10 {
-/// SymIntArrayRef - Represent a constant reference to an array (0 or more
-/// elements consecutively in memory), i.e. a start pointer and a length.  It
-/// allows various APIs to take consecutive elements easily and conveniently.
-///
-/// This class does not own the underlying data, it is expected to be used in
-/// situations where the data resides in some other buffer, whose lifetime
-/// extends past that of the SymIntArrayRef. For this reason, it is not in
-/// general safe to store an SymIntArrayRef.
-///
-/// This is intended to be trivially copyable, so it should be passed by
-/// value.
-
-class SymIntArrayRef final {
- public:
-  using iterator = const c10::SymInt*;
-  using const_iterator = const c10::SymInt*;
-  using size_type = size_t;
-  using value_type = c10::SymInt;
-
-  using reverse_iterator = std::reverse_iterator<iterator>;
-
- private:
-  ArrayRef<c10::SymInt> wrapped_symint_array_ref;
-
- public:
-  /// @name Constructors
-  /// @{
-
-  /// Construct an empty SymIntArrayRef.
-  /* implicit */ constexpr SymIntArrayRef() {}
+using SymIntArrayRef = ArrayRef<SymInt>;
 
-  /* implicit */ SymIntArrayRef(const std::vector<c10::SymInt>& Vec)
-      : wrapped_symint_array_ref(Vec) {}
-
-  /// Construct an SymIntArrayRef from a pointer and length.
-  C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA SymIntArrayRef(
-      const c10::SymInt* data,
-      size_t length)
-      : wrapped_symint_array_ref(data, length) {}
-
-  template <typename U>
-  /* implicit */ SymIntArrayRef(
-      const SmallVectorTemplateCommon<c10::SymInt, U>& Vec)
-      : wrapped_symint_array_ref(Vec) {}
-
-  /// Construct an SymIntArrayRef from a range.
-  C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA SymIntArrayRef(
-      const c10::SymInt* begin,
-      const c10::SymInt* end)
-      : wrapped_symint_array_ref(begin, end) {}
-
-  /// Construct an SymIntArrayRef from a C array.
-  template <size_t N>
-  /* implicit */ constexpr SymIntArrayRef(const c10::SymInt (&Arr)[N])
-      : wrapped_symint_array_ref(Arr) {}
+inline at::IntArrayRef asIntArrayRefUnchecked(c10::SymIntArrayRef ar) {
+  return IntArrayRef(reinterpret_cast<const int64_t*>(ar.data()), ar.size());
+}
 
-  static SymIntArrayRef fromIntArrayRef(IntArrayRef array_ref) {
-    for (size_t i = 0; i < array_ref.size(); ++i) {
-      TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
-          SymInt::check_range(array_ref[i]),
-          "IntArrayRef contains int that cannot be representative as a SymInt",
-          array_ref[i]);
+inline c10::optional<at::IntArrayRef> asIntArrayRefSlowOpt(
+    c10::SymIntArrayRef ar) {
+  for (c10::SymInt sci : ar) {
+    if (sci.is_symbolic()) {
+      return c10::nullopt;
     }
-    return SymIntArrayRef(
-        reinterpret_cast<const SymInt*>(array_ref.data()), array_ref.size());
-  }
-
-  /// @}
-  /// @name Simple Operations
-  /// @{
-
-  constexpr iterator begin() const {
-    return wrapped_symint_array_ref.begin();
-  }
-  constexpr iterator end() const {
-    return wrapped_symint_array_ref.end();
-  }
-
-  // These are actually the same as iterator, since SymIntArrayRef only
-  // gives you const iterators.
-  constexpr const_iterator cbegin() const {
-    return wrapped_symint_array_ref.cbegin();
-  }
-  constexpr const_iterator cend() const {
-    return wrapped_symint_array_ref.cend();
-  }
-
-  /// empty - Check if the array is empty.
-  constexpr bool empty() const {
-    return size() == 0;
-  }
-
-  constexpr const c10::SymInt* data() const {
-    return wrapped_symint_array_ref.data();
-  }
-
-  /// size - Get the array size.
-  constexpr size_t size() const {
-    return wrapped_symint_array_ref.size();
-  }
-
-  /// front - Get the first element.
-  C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA const c10::SymInt& front() const {
-    return wrapped_symint_array_ref.front();
-  }
-
-  /// back - Get the last element.
-  C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA const c10::SymInt& back() const {
-    return wrapped_symint_array_ref.back();
-  }
-
-  /// equals - Check for element-wise equality.
-  constexpr bool equals(SymIntArrayRef RHS) const {
-    return this->wrapped_symint_array_ref.equals(RHS.wrapped_symint_array_ref);
-  }
-
-  /// slice(n, m) - Take M elements of the array starting at element N
-  C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA SymIntArrayRef
-  slice(size_t N, size_t M) const {
-    return SymIntArrayRef(wrapped_symint_array_ref.data() + N, M);
   }
 
-  /// slice(n) - Chop off the first N elements of the array.
-  C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA SymIntArrayRef slice(size_t N) const {
-    return slice(N, size() - N);
-  }
-
-  /// @}
-  /// @name Operator Overloads
-  /// @{
-  constexpr const c10::SymInt& operator[](size_t Index) const {
-    return wrapped_symint_array_ref[Index];
-  }
+  return {asIntArrayRefUnchecked(ar)};
+}
 
-  /// Vector compatibility
-  C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA const c10::SymInt& at(size_t Index) const {
-    return wrapped_symint_array_ref.at(Index);
+inline at::IntArrayRef asIntArrayRefSlow(c10::SymIntArrayRef ar) {
+  for (c10::SymInt sci : ar) {
+    TORCH_CHECK(
+        !sci.is_symbolic(),
+        "SymIntArrayRef expected to contain only concrete integers");
   }
+  return asIntArrayRefUnchecked(ar);
+}
 
-  /// Disallow accidental assignment from a temporary.
-  ///
-  /// The declaration here is extra complicated so that "arrayRef = {}"
-  /// continues to select the move assignment operator.
-  template <typename U>
-  typename std::enable_if<std::is_same<U, c10::SymInt>::value, SymIntArrayRef>::
-      type&
-      operator=(U&& Temporary) = delete;
+// Prefer using a more semantic constructor, like
+// fromIntArrayRefKnownNonNegative
+inline SymIntArrayRef fromIntArrayRefUnchecked(IntArrayRef array_ref) {
+  return SymIntArrayRef(
+      reinterpret_cast<const SymInt*>(array_ref.data()), array_ref.size());
+}
 
-  /// Disallow accidental assignment from a temporary.
-  ///
-  /// The declaration here is extra complicated so that "arrayRef = {}"
-  /// continues to select the move assignment operator.
-  template <typename U>
-  typename std::enable_if<std::is_same<U, c10::SymInt>::value, SymIntArrayRef>::
-      type&
-      operator=(std::initializer_list<U>) = delete;
+inline SymIntArrayRef fromIntArrayRefKnownNonNegative(IntArrayRef array_ref) {
+  return fromIntArrayRefUnchecked(array_ref);
+}
 
-  /// @}
-  /// @name Expensive Operations
-  /// @{
-  std::vector<c10::SymInt> vec() const {
-    return wrapped_symint_array_ref.vec();
+inline SymIntArrayRef fromIntArrayRefSlow(IntArrayRef array_ref) {
+  for (size_t i = 0; i < array_ref.size(); ++i) {
+    TORCH_CHECK(
+        SymInt::check_range(array_ref[i]),
+        "IntArrayRef contains an int that cannot be represented as a SymInt: ",
+        array_ref[i]);
   }
-
-  friend std::ostream& operator<<(
-      std::ostream& out,
-      const SymIntArrayRef& list);
-  /// @}
-};
-
-TORCH_API at::IntArrayRef asIntArrayRefSlow(c10::SymIntArrayRef ar);
-TORCH_API at::IntArrayRef asIntArrayRefUnchecked(c10::SymIntArrayRef ar);
-TORCH_API c10::optional<at::IntArrayRef> asIntArrayRefSlowOpt(
-    c10::SymIntArrayRef ar);
-
-inline std::ostream& operator<<(
-    std::ostream& out,
-    const c10::SymIntArrayRef& list) {
-  return out << list.wrapped_symint_array_ref;
+  return SymIntArrayRef(
+      reinterpret_cast<const SymInt*>(array_ref.data()), array_ref.size());
 }
 
 } // namespace c10
diff --git a/c10/core/SymIntNodeImpl.cpp b/c10/core/SymIntNodeImpl.cpp
deleted file mode 100644
index 483110a90fa6..000000000000
--- a/c10/core/SymIntNodeImpl.cpp
+++ /dev/null
@@ -1,11 +0,0 @@
-#include <c10/core/SymInt.h>
-#include <c10/core/SymIntNodeImpl.h>
-
-namespace c10 {
-
-c10::SymInt SymIntNodeImpl::toSymInt() {
-  auto sit_sp = SymIntNode::reclaim_copy(this);
-  return SymInt::toSymInt(sit_sp);
-}
-
-} // namespace c10
diff --git a/c10/core/SymIntNodeImpl.h b/c10/core/SymIntNodeImpl.h
deleted file mode 100644
index da4beaeae7dc..000000000000
--- a/c10/core/SymIntNodeImpl.h
+++ /dev/null
@@ -1,81 +0,0 @@
-#pragma once
-
-#include <c10/macros/Macros.h>
-#include <c10/util/Exception.h>
-#include <c10/util/intrusive_ptr.h>
-#include <memory>
-#include <mutex>
-#include <vector>
-
-namespace c10 {
-
-class SymInt;
-class SymIntNodeImpl;
-using SymIntNode = c10::intrusive_ptr<SymIntNodeImpl>;
-
-class C10_API SymIntNodeImpl : public c10::intrusive_ptr_target {
- public:
-  c10::SymInt toSymInt();
-  virtual ~SymIntNodeImpl(){};
-
-  template <typename T>
-  c10::intrusive_ptr<T> dyn_cast() const {
-    return c10::intrusive_ptr<T>::reclaim_copy(dynamic_cast<T*>(this));
-  }
-
-  // these could be pure virtual when we implement LTC versions
-  virtual SymIntNode add(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode sub(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode mul(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode truediv(const SymIntNode& other) {
-    TORCH_CHECK(false, "FP division isn't support for SymInts");
-  };
-  virtual SymIntNode floordiv(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode mod(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode eq(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode ne(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode gt(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode lt(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode le(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode ge(const SymIntNode& other) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual SymIntNode wrap(int64_t num) {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual bool bool_() {
-    TORCH_CHECK(false, "NYI");
-  };
-  virtual int64_t int_() {
-    TORCH_CHECK(false, "NYI");
-  }
-  virtual std::string str() {
-    TORCH_CHECK(false, "NYI");
-  };
-  std::ostream& operator<<(std::ostream& os) {
-    os << str();
-    return os;
-  };
-};
-
-} // namespace c10
diff --git a/c10/core/SymNodeImpl.cpp b/c10/core/SymNodeImpl.cpp
new file mode 100644
index 000000000000..80999ba50f1e
--- /dev/null
+++ b/c10/core/SymNodeImpl.cpp
@@ -0,0 +1,3 @@
+#include <c10/core/SymNodeImpl.h>
+
+namespace c10 {} // namespace c10
diff --git a/c10/core/SymNodeImpl.h b/c10/core/SymNodeImpl.h
new file mode 100644
index 000000000000..fcec452821d7
--- /dev/null
+++ b/c10/core/SymNodeImpl.h
@@ -0,0 +1,118 @@
+#pragma once
+
+#include <c10/macros/Macros.h>
+#include <c10/util/Exception.h>
+#include <c10/util/intrusive_ptr.h>
+#include <memory>
+#include <mutex>
+#include <vector>
+
+namespace c10 {
+
+class SymNodeImpl;
+using SymNode = c10::intrusive_ptr<SymNodeImpl>;
+
+class C10_API SymNodeImpl : public c10::intrusive_ptr_target {
+ public:
+  virtual ~SymNodeImpl(){};
+
+  template <typename T>
+  c10::intrusive_ptr<T> dyn_cast() const {
+    return c10::intrusive_ptr<T>::reclaim_copy(dynamic_cast<T*>(this));
+  }
+
+  // these could be pure virtual when we implement LTC versions
+  virtual bool is_int() {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual bool is_float() {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode add(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode sub(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode mul(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode truediv(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode pow(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode floordiv(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode mod(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode eq(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode ne(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode gt(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode lt(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode le(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode ge(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode ceil() {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode floor() {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode neg() {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode min(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode max(const SymNode& other) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode clone() {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode sym_float() {
+    TORCH_CHECK(false, "NYI");
+  }
+  virtual SymNode wrap_int(int64_t num) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual SymNode wrap_float(double num) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual int64_t guard_int(const char* file, int64_t line) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual double guard_float(const char* file, int64_t line) {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual int64_t int_() {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual bool bool_() {
+    TORCH_CHECK(false, "NYI");
+  };
+  virtual std::string str() {
+    TORCH_CHECK(false, "NYI");
+  };
+  std::ostream& operator<<(std::ostream& os) {
+    os << str();
+    return os;
+  };
+};
+
+} // namespace c10
diff --git a/c10/core/TensorImpl.cpp b/c10/core/TensorImpl.cpp
index d2d09499069b..bee3fa32ec21 100644
--- a/c10/core/TensorImpl.cpp
+++ b/c10/core/TensorImpl.cpp
@@ -6,9 +6,12 @@
 #include <c10/core/WrapDimMinimal.h>
 #include <c10/core/impl/LocalDispatchKeySet.h>
 #include <c10/core/impl/PyInterpreter.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 #include <c10/util/Optional.h>
 #include <c10/util/irange.h>
 
+#include <utility>
+
 C10_DEFINE_bool(
     caffe2_keep_on_shrink,
     true,
@@ -181,7 +184,6 @@ TensorImpl::TensorImpl(
   if (!is_inference()) {
     version_counter_ = VariableVersion(/*version=*/0);
   }
-
   // we would also like to check that non-cpu devices have an index, but some
   // Caffe2 operators create Storages with default devices.
 }
@@ -209,20 +211,16 @@ void TensorImpl::HandleResize() {
   // If needed, we will free the data. the next mutable_data() call
   // will create the data storage.
   bool reset_tensor = false;
-
-  TORCH_CHECK(!numel_.is_symbolic(), "CAFFE2 doesn't support SymInts");
-  int concrete_numel = numel_.as_int_unchecked();
   if (reserved_) {
     // If tensor is reserved then don't claim its memeory unless nbytes()
     // is smaller than new size
-    reset_tensor = storage_.nbytes() <
-        (storage_offset_ + concrete_numel) * data_type_.itemsize();
+    reset_tensor =
+        storage_.nbytes() < (storage_offset_ + numel_) * data_type_.itemsize();
   } else {
     reset_tensor = storage_.nbytes() <
-            (storage_offset_ + concrete_numel) * data_type_.itemsize() ||
+            (storage_offset_ + numel_) * data_type_.itemsize() ||
         !FLAGS_caffe2_keep_on_shrink ||
-        storage_.nbytes() -
-                (storage_offset_ + concrete_numel) * data_type_.itemsize() >
+        storage_.nbytes() - (storage_offset_ + numel_) * data_type_.itemsize() >
             static_cast<size_t>(FLAGS_caffe2_max_keep_on_shrink_memory);
   }
 
@@ -231,16 +229,20 @@ void TensorImpl::HandleResize() {
   }
 }
 
-bool TensorImpl::compute_contiguous() const {
+template <typename T>
+bool_is_contiguous _compute_contiguous(
+    ArrayRef<T> sizes,
+    ArrayRef<T> strides,
+    T numel) {
   bool is_contiguous = true;
-  if (is_empty())
-    return is_contiguous;
-  int64_t z = 1;
-  for (int64_t d = dim() - 1; d >= 0; d--) {
-    const auto size_d =
-        sizes_and_strides_.size_at_unchecked(d).as_int_unchecked();
+  if (numel == 0)
+    return bool_is_contiguous(is_contiguous);
+  T z = 1;
+  // NB: make sure we do signed arithmetic
+  for (int64_t d = int64_t(sizes.size()) - 1; d >= 0; d--) {
+    const auto size_d = sizes[d];
     if (size_d != 1) {
-      if (sizes_and_strides_.stride_at_unchecked(d).as_int_unchecked() == z) {
+      if (strides[d] == z) {
         z *= size_d;
       } else {
         is_contiguous = false;
@@ -248,107 +250,168 @@ bool TensorImpl::compute_contiguous() const {
       }
     }
   }
-  return is_contiguous;
+  return bool_is_contiguous(is_contiguous);
 }
 
-bool TensorImpl::compute_channels_last_contiguous_2d() const {
+// NB: intentionally bypass the normal accessors; we always want to be
+// consistent with what is actually stored on the struct
+#define COMPUTE_WITH_SIZES_STRIDES_NUMEL(TEMPLATE)                            \
+  (has_symbolic_sizes_strides_                                                \
+       ? TEMPLATE<c10::SymInt>(                                               \
+             extra_meta_->sizes_, extra_meta_->strides_, extra_meta_->numel_) \
+       : TEMPLATE<int64_t>(                                                   \
+             sizes_and_strides_.sizes_arrayref(),                             \
+             sizes_and_strides_.strides_arrayref(),                           \
+             numel_))
+
+#define COMPUTE_WITH_SIZES_STRIDES(TEMPLATE)                               \
+  (has_symbolic_sizes_strides_                                             \
+       ? TEMPLATE<c10::SymInt>(extra_meta_->sizes_, extra_meta_->strides_) \
+       : TEMPLATE<int64_t>(                                                \
+             sizes_and_strides_.sizes_arrayref(),                          \
+             sizes_and_strides_.strides_arrayref()))
+
+bool_is_contiguous TensorImpl::compute_contiguous() const {
+  if (is_sparse()) {
+    return bool_is_contiguous(false);
+  }
+  return COMPUTE_WITH_SIZES_STRIDES_NUMEL(_compute_contiguous);
+}
+
+template <typename T>
+bool_is_channels_last_contiguous _compute_channels_last_contiguous_2d(
+    ArrayRef<T> sizes,
+    ArrayRef<T> strides) {
   // Please don't combine these code, constant array is used here to let
   // compiler fully unroll the loop to get better performance
-  switch (sizes_and_strides_.size()) {
+  switch (sizes.size()) {
     case 4: {
-      int64_t expected = 1;
+      T expected = 1;
       for (auto& d : {1, 3, 2, 0}) {
-        const auto size_d =
-            sizes_and_strides_.size_at_unchecked(d).as_int_unchecked();
+        const auto size_d = sizes[d];
         if (size_d != 1) {
-          if (sizes_and_strides_.stride_at_unchecked(d).as_int_unchecked() !=
-              expected) {
-            return false;
+          if (strides[d] != expected) {
+            return bool_is_channels_last_contiguous(false);
           }
           expected *= size_d;
         }
       }
-      return true;
+      return bool_is_channels_last_contiguous(true);
     }
     // NOLINTNEXTLINE(bugprone-branch-clone)
     case 3:
       // TODO dim == 3 case will be enabled once it is fully tested
-      return false;
+      return bool_is_channels_last_contiguous(false);
     default:
-      return false;
+      return bool_is_channels_last_contiguous(false);
+  }
+}
+
+bool_is_channels_last_contiguous TensorImpl::
+    compute_channels_last_contiguous_2d() const {
+  if (is_sparse()) {
+    return bool_is_channels_last_contiguous(false);
   }
+  return COMPUTE_WITH_SIZES_STRIDES(_compute_channels_last_contiguous_2d);
 }
 
-bool TensorImpl::compute_channels_last_contiguous_3d() const {
+template <typename T>
+bool_is_channels_last_3d_contiguous _compute_channels_last_contiguous_3d(
+    ArrayRef<T> sizes,
+    ArrayRef<T> strides) {
   // Please don't combine these code, constant array is used here to let
   // compiler fully unroll the loop to get better performance
-  switch (sizes_and_strides_.size()) {
+  switch (sizes.size()) {
     case 5: {
-      int64_t expected = 1;
+      T expected = 1;
       for (auto& d : {1, 4, 3, 2, 0}) {
-        const auto size_d =
-            sizes_and_strides_.size_at_unchecked(d).as_int_unchecked();
+        const auto size_d = sizes[d];
         if (size_d != 1) {
-          if (sizes_and_strides_.stride_at_unchecked(d).as_int_unchecked() !=
-              expected) {
-            return false;
+          if (strides[d] != expected) {
+            return bool_is_channels_last_3d_contiguous(false);
           }
           expected *= size_d;
         }
       }
-      return true;
+      return bool_is_channels_last_3d_contiguous(true);
     }
     // NOLINTNEXTLINE(bugprone-branch-clone)
     case 4:
       // TODO dim == 4 case will be enabled once it is fully tested
-      return false;
+      return bool_is_channels_last_3d_contiguous(false);
     default:
-      return false;
+      return bool_is_channels_last_3d_contiguous(false);
   }
 }
 
-bool TensorImpl::compute_strides_like_channels_last_2d() const {
-  return is_channels_last_strides_2d(
-      TensorImpl::sizes(), TensorImpl::strides());
+bool_is_channels_last_3d_contiguous TensorImpl::
+    compute_channels_last_contiguous_3d() const {
+  if (is_sparse()) {
+    return bool_is_channels_last_3d_contiguous(false);
+  }
+  return COMPUTE_WITH_SIZES_STRIDES(_compute_channels_last_contiguous_3d);
 }
 
-bool TensorImpl::compute_strides_like_channels_last_3d() const {
-  return is_channels_last_strides_3d(
-      TensorImpl::sizes(), TensorImpl::strides());
+bool_is_channels_last TensorImpl::compute_strides_like_channels_last_2d()
+    const {
+  if (is_sparse()) {
+    return bool_is_channels_last(false);
+  }
+  return bool_is_channels_last(
+      COMPUTE_WITH_SIZES_STRIDES(is_channels_last_strides_2d));
+}
+
+bool_is_channels_last_3d TensorImpl::compute_strides_like_channels_last_3d()
+    const {
+  if (is_sparse()) {
+    return bool_is_channels_last_3d(false);
+  }
+  return bool_is_channels_last_3d(
+      COMPUTE_WITH_SIZES_STRIDES(is_channels_last_strides_3d));
 }
 
-bool TensorImpl::compute_non_overlapping_and_dense() const {
-  if (dim() == 1) {
-    return sizes_and_strides_.size_at_unchecked(0) < 2 ||
-        sizes_and_strides_.stride_at_unchecked(0) == 1;
+template <typename T>
+bool_is_non_overlapping_and_dense _compute_non_overlapping_and_dense(
+    ArrayRef<T> sizes,
+    ArrayRef<T> strides) {
+  auto dim = sizes.size();
+  if (dim == 1) {
+    return bool_is_non_overlapping_and_dense(sizes[0] < 2 || strides[0] == 1);
   }
   SmallVector<int64_t, 5> perm;
-  perm.resize(dim());
-  for (const auto i : c10::irange(dim())) {
+  perm.resize(dim);
+  for (const auto i : c10::irange(dim)) {
     perm[i] = i;
   }
   // Sort by strides, leaving 0 and 1 sized dims at the end of the array
   std::sort(perm.begin(), perm.end(), [&](int64_t a, int64_t b) {
-    if (sizes_and_strides_.size_at_unchecked(a) < 2) {
+    if (sizes[a] < 2) {
       return false;
-    } else if (sizes_and_strides_.size_at_unchecked(b) < 2) {
+    } else if (sizes[b] < 2) {
       return true;
     }
-    return sizes_and_strides_.stride_at_unchecked(a) <
-        sizes_and_strides_.stride_at_unchecked(b);
+    return strides[a] < strides[b];
   });
-  SymInt require_stride = 1;
-  for (const auto i : c10::irange(dim())) {
-    const auto size_perm_i = sizes_and_strides_.size_at_unchecked(perm[i]);
+  T require_stride = 1;
+  for (const auto i : c10::irange(dim)) {
+    const auto size_perm_i = sizes[perm[i]];
     if (size_perm_i < 2) {
-      return true;
+      return bool_is_non_overlapping_and_dense(true);
     }
-    if (sizes_and_strides_.stride_at_unchecked(perm[i]) != require_stride) {
-      return false;
+    if (strides[perm[i]] != require_stride) {
+      return bool_is_non_overlapping_and_dense(false);
     }
     require_stride *= size_perm_i;
   }
-  return true;
+  return bool_is_non_overlapping_and_dense(true);
+}
+
+bool_is_non_overlapping_and_dense TensorImpl::
+    compute_non_overlapping_and_dense() const {
+  if (is_sparse()) {
+    return bool_is_non_overlapping_and_dense(false);
+  }
+  return COMPUTE_WITH_SIZES_STRIDES(_compute_non_overlapping_and_dense);
 }
 
 void TensorImpl::release_resources() {
@@ -363,7 +426,7 @@ void TensorImpl::destroy_pyobj_if_needed() {
   if (owns_pyobj()) {
     TORCH_INTERNAL_ASSERT(pyobj_interpreter_ != nullptr);
     TORCH_INTERNAL_ASSERT(pyobj_ != nullptr);
-    pyobj_interpreter_.load(std::memory_order_acquire)
+    (*pyobj_interpreter_.load(std::memory_order_acquire))
         ->decref(_unchecked_untagged_pyobj(), /*is_tensor*/ true);
     // NB: this destructor can only be entered when there are no
     // references to this C++ object (obviously), NOR any references
@@ -386,88 +449,118 @@ void TensorImpl::throw_storage_access_error() const {
       false, "Cannot access storage of ", tensorimpl_type_name());
 }
 
-impl::PyInterpreter* TensorImpl::load_pyobj_interpreter() const {
+impl::PyInterpreter& TensorImpl::load_pyobj_interpreter() const {
   auto interpreter = pyobj_interpreter_.load(std::memory_order_acquire);
   if (interpreter) {
-    return interpreter;
+    return *interpreter;
   }
   TORCH_CHECK(
       false,
       "cannot access PyObject for Tensor on interpreter ",
-      pyobj_interpreter_.load()->name());
+      (*pyobj_interpreter_.load())->name());
 }
 
 bool TensorImpl::is_contiguous_custom(at::MemoryFormat memory_format) const {
-  if (is_python_dispatch()) {
-    return load_pyobj_interpreter()->is_contiguous(this);
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
+    return load_pyobj_interpreter()->is_contiguous(this, memory_format);
   }
-  TORCH_CHECK(
-      false,
-      "Tensors of type ",
-      tensorimpl_type_name(),
-      " do not have is_contiguous");
+  return is_contiguous_default(memory_format);
+}
+
+bool TensorImpl::is_strides_like_custom(at::MemoryFormat memory_format) const {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
+    return load_pyobj_interpreter()->is_strides_like(this, memory_format);
+  }
+  return is_strides_like_default(memory_format);
+}
+
+bool TensorImpl::is_non_overlapping_and_dense_custom() const {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
+    return load_pyobj_interpreter()->is_non_overlapping_and_dense(this);
+  }
+  return is_non_overlapping_and_dense_default();
 }
 
 IntArrayRef TensorImpl::sizes_custom() const {
-  if (is_python_dispatch()) {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomSizes))) {
     return load_pyobj_interpreter()->sizes(this);
   }
-  TORCH_CHECK(
-      false, "Tensors of type ", tensorimpl_type_name(), " do not have sizes");
+  return sizes_default();
 }
 
 c10::SymIntArrayRef TensorImpl::sym_sizes_custom() const {
-  if (C10_UNLIKELY(is_python_dispatch())) {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomSizes))) {
     return load_pyobj_interpreter()->sym_sizes(this);
   }
   return sym_sizes_default();
 }
 
 c10::SymInt TensorImpl::sym_numel_custom() const {
-  if (C10_UNLIKELY(is_python_dispatch())) {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomSizes))) {
     return load_pyobj_interpreter()->sym_numel(this);
   }
   return sym_numel_default();
 }
 
+c10::SymIntArrayRef TensorImpl::sym_strides_custom() const {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
+    return load_pyobj_interpreter()->sym_strides(this);
+  }
+  return sym_strides_default();
+}
+
 c10::Device TensorImpl::device_custom() const {
-  if (is_python_dispatch()) {
+  if (C10_UNLIKELY(python_custom_device_)) {
     return load_pyobj_interpreter()->device(this);
   }
-  TORCH_CHECK(
-      false, "Tensors of type ", tensorimpl_type_name(), " do not have device");
+  return device_default();
 }
 
 IntArrayRef TensorImpl::strides_custom() const {
-  if (is_python_dispatch()) {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
     return load_pyobj_interpreter()->strides(this);
   }
-  TORCH_CHECK(
-      false,
-      "Tensors of type ",
-      tensorimpl_type_name(),
-      " do not have strides");
+  return strides_default();
 }
 
 int64_t TensorImpl::dim_custom() const {
-  if (is_python_dispatch()) {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomSizes))) {
     return load_pyobj_interpreter()->dim(this);
   }
-  TORCH_CHECK(
-      false, "Tensors of type ", tensorimpl_type_name(), " do not have dim");
+  return dim_default();
 }
 
 int64_t TensorImpl::numel_custom() const {
-  TORCH_CHECK(
-      false, "Tensors of type ", tensorimpl_type_name(), " do not have numel");
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomSizes))) {
+    // TODO: fix this
+    return load_pyobj_interpreter()->sym_numel(this).expect_int();
+  }
+  return numel_default();
 }
 
 c10::Layout TensorImpl::layout_custom() const {
-  if (is_python_dispatch()) {
+  if (C10_UNLIKELY(python_custom_layout_)) {
     return load_pyobj_interpreter()->layout(this);
   }
+  // TODO: fix this
   TORCH_CHECK(
-      false, "Tensors of type ", tensorimpl_type_name(), " do not have layout");
+      0, "Tensors of type ", tensorimpl_type_name(), " do not have layout")
+  // return layout_default();
+}
+
+int64_t TensorImpl::storage_offset_custom() const {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomSizes))) {
+    // TODO: fix this
+    return load_pyobj_interpreter()->sym_storage_offset(this).expect_int();
+  }
+  return storage_offset_default();
+}
+
+c10::SymInt TensorImpl::sym_storage_offset_custom() const {
+  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomSizes))) {
+    return load_pyobj_interpreter()->sym_storage_offset(this);
+  }
+  return sym_storage_offset_default();
 }
 
 static void deletePlacementDeleteContext(void* ptr) {
@@ -537,17 +630,26 @@ template <typename VariableVersion>
 c10::intrusive_ptr<TensorImpl> TensorImpl::shallow_copy_and_detach_core(
     VariableVersion&& version_counter,
     bool allow_tensor_metadata_change) const {
-  if (key_set_.has(DispatchKey::Python) &&
+  c10::intrusive_ptr<TensorImpl> r;
+  const auto mode_stack_len = c10::impl::TorchDispatchModeTLS::stack_len();
+  // TODO: do we have to exclude after Python dispatch key set?
+  if (mode_stack_len > 0 &&
       !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
-    auto r = pyobj_interpreter_.load(std::memory_order_acquire)->detach(this);
-    if (r) {
-      r->set_version_counter(std::forward<VariableVersion>(version_counter));
-      r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
-      return r;
-    }
-    // otherwise just copy the TensorImpl and not the PyObject.  Since
-    // the interpreter is dead no one can call us out on it
+    const auto& cur_torch_dispatch_mode_state =
+        c10::impl::TorchDispatchModeTLS::get_stack_at(mode_stack_len - 1);
+    r = cur_torch_dispatch_mode_state->pyinterpreter()->detach(this);
+  } else if (
+      key_set_.has(DispatchKey::Python) &&
+      !c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
+    r = (*pyobj_interpreter_.load(std::memory_order_acquire))->detach(this);
   }
+  if (r) {
+    r->set_version_counter(std::forward<VariableVersion>(version_counter));
+    r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
+    return r;
+  }
+  // otherwise just copy the TensorImpl and not the PyObject.  Since
+  // the interpreter is dead no one can call us out on it
   auto impl = c10::make_intrusive<TensorImpl>(
       // No need to populate Storage; copy_tensor_metadata will do it for us.
       key_set_,
@@ -559,12 +661,8 @@ c10::intrusive_ptr<TensorImpl> TensorImpl::shallow_copy_and_detach_core(
       /*version_counter=*/std::forward<VariableVersion>(version_counter),
       /*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
 
-  // We currently don't support refresh_numel() and refresh_contiguous(). It's
-  // plausible that we could support it, but currently done to unblock.
-  if (!has_symbolic_sizes_strides()) {
-    impl->refresh_numel();
-    impl->refresh_contiguous();
-  }
+  impl->refresh_numel();
+  impl->refresh_contiguous();
   return impl;
 }
 
@@ -614,10 +712,18 @@ void TensorImpl::copy_generic_tensor_metadata(
       src_impl->is_non_overlapping_and_dense_;
   dest_impl->is_wrapped_number_ = src_impl->is_wrapped_number_;
   dest_impl->reserved_ = src_impl->reserved_;
-  if (src_impl->named_tensor_meta_ != nullptr) {
-    dest_impl->named_tensor_meta_ = src_impl->named_tensor_meta_->clone();
+  if (src_impl->extra_meta_ != nullptr) {
+    dest_impl->extra_meta_ = src_impl->extra_meta_->clone();
   }
-  dest_impl->sizes_strides_policy_ = src_impl->sizes_strides_policy_;
+
+  // NB: symbolic sizes and strides are copied as is custom policy, but python
+  // policy is NOT (you have no Python object to dispatch to!)
+  // NB: subclass relevant policy doesn't have to be copied; the
+  // constructor sets this up
+
+  dest_impl->refresh_sizes_strides_policy();
+  dest_impl->refresh_layout_policy();
+  dest_impl->refresh_device_policy();
 }
 
 void TensorImpl::copy_tensor_metadata_except_version_counter(
@@ -680,8 +786,7 @@ void TensorImpl::Extend(int64_t num, float growthPct) {
       "Extend() called on tensor with symbolic shape")
 
   using SizesVector = SmallVector<int64_t, 5>;
-  IntArrayRef sizes_and_strides =
-      asIntArrayRefUnchecked(sizes_and_strides_.sizes_arrayref());
+  IntArrayRef sizes_and_strides = sizes_and_strides_.sizes_arrayref();
   SizesVector newDims(sizes_and_strides.begin(), sizes_and_strides.end());
   newDims[0] += num;
   if (!storage_.data()) {
@@ -690,7 +795,7 @@ void TensorImpl::Extend(int64_t num, float growthPct) {
   }
   const auto newNumel = c10::multiply_integers(newDims.begin(), newDims.end());
   if (newNumel * data_type_.itemsize() <= storage_.nbytes()) {
-    sizes_and_strides_.set_sizes(SymIntArrayRef::fromIntArrayRef(newDims));
+    sizes_and_strides_.set_sizes(newDims);
     numel_ = newNumel;
     return;
   }
@@ -698,11 +803,10 @@ void TensorImpl::Extend(int64_t num, float growthPct) {
   newCapacity[0] = std::max(
       newDims[0],
       static_cast<int64_t>(std::ceil(
-          sizes_and_strides_.size_at_unchecked(0).as_int_unchecked() *
-          (1 + growthPct / 100))));
+          sizes_and_strides_.size_at_unchecked(0) * (1 + growthPct / 100))));
   auto oldData = std::move(storage_.data_ptr());
-  auto oldSize = numel_.as_int_unchecked();
-  Resize(newCapacity);
+  auto oldSize = numel_;
+  Resize(std::move(newCapacity));
   auto* newData = raw_mutable_data(data_type_);
   if (data_type_.copy()) {
     TORCH_CHECK(
@@ -727,7 +831,7 @@ void TensorImpl::Extend(int64_t num, float growthPct) {
         true); // non-blocking
   }
   reserved_ = true;
-  sizes_and_strides_.set_sizes(SymIntArrayRef::fromIntArrayRef(newDims));
+  sizes_and_strides_.set_sizes(newDims);
   numel_ = newNumel;
 }
 
@@ -737,12 +841,11 @@ void TensorImpl::ReserveSpace(int64_t outer_dim) {
       "Right now ReserveSpace is only supported for contiguous Tensor.");
   TORCH_CHECK(
       !has_symbolic_sizes_strides_,
-      "ReserveSpace() called on tensor with symbolic shape");
+      "ReserveSpace() called on tensor with symbolic shape")
 
   TORCH_CHECK(storage_.unique(), "Can't call ReserveSpace on shared storage.");
   // TODO: eliminate newCapacity.
-  IntArrayRef sizes_and_strides =
-      asIntArrayRefUnchecked(sizes_and_strides_.sizes_arrayref());
+  IntArrayRef sizes_and_strides = sizes_and_strides_.sizes_arrayref();
   SmallVector<int64_t, 5> newCapacity(
       sizes_and_strides.begin(), sizes_and_strides.end());
   newCapacity[0] = outer_dim;
@@ -755,10 +858,10 @@ void TensorImpl::ReserveSpace(int64_t outer_dim) {
   auto oldSize = numel_;
   SmallVector<int64_t, 5> oldDims(
       sizes_and_strides.begin(), sizes_and_strides.end());
-  Resize(newCapacity);
+  Resize(std::move(newCapacity));
   // Allocate new memory but don't copy over the data
   raw_mutable_data(data_type_);
-  sizes_and_strides_.set_sizes(SymIntArrayRef::fromIntArrayRef(oldDims));
+  sizes_and_strides_.set_sizes(oldDims);
   numel_ = oldSize;
   reserved_ = true;
 }
@@ -769,7 +872,7 @@ void TensorImpl::Reshape(const std::vector<int64_t>& dims) {
       "Right now Reshape is only supported for contiguous Tensor.");
   TORCH_CHECK(
       !has_symbolic_sizes_strides_,
-      "Reshape() called on tensor with symbolic shape");
+      "Reshape() called on tensor with symbolic shape")
 
   int64_t new_size = 1;
   for (auto d : dims) {
@@ -777,7 +880,7 @@ void TensorImpl::Reshape(const std::vector<int64_t>& dims) {
     new_size *= d;
   }
   TORCH_CHECK(
-      new_size == numel_.as_int_unchecked(),
+      new_size == numel_,
       "New size and old size are not equal. You cannot use Reshape, "
       "but should use Resize."
       // TODO(jiayq): remove the following warning after pending diffs
@@ -785,7 +888,7 @@ void TensorImpl::Reshape(const std::vector<int64_t>& dims) {
       " The old caffe2 mixes Reshape and Resize but this behavior has "
       "been changed. If you find this error, most likely you will need "
       "to change corresponding code from Reshape to Resize.");
-  sizes_and_strides_.set_sizes(SymIntArrayRef::fromIntArrayRef(dims));
+  sizes_and_strides_.set_sizes(dims);
   empty_tensor_restride(MemoryFormat::Contiguous);
 }
 
@@ -841,9 +944,9 @@ void TensorImpl::ShareExternalPointer(
       "initialized data_type(TypeMeta).");
   TORCH_CHECK(
       !has_symbolic_sizes_strides_,
-      "ReserveSpace() called on tensor with symbolic shape");
+      "ShareExternalPointer() called on tensor with symbolic shape");
   if (!size_bytes) {
-    size_bytes = numel_.as_int_unchecked() * data_type.itemsize();
+    size_bytes = numel_ * data_type.itemsize();
   }
   if (storage_.unique()) {
     storage_.UniqueStorageShareExternalPointer(std::move(data_ptr), size_bytes);
@@ -864,13 +967,48 @@ void TensorImpl::ShareExternalPointer(
   }
 }
 
-void TensorImpl::set_sym_sizes_and_strides(
+void clone_symvec(SymIntArrayRef src, SymDimVector& dst) {
+  dst.clear();
+  dst.reserve(src.size());
+  for (size_t i = 0; i < src.size(); i++) {
+    dst.emplace_back(src[i].clone());
+  }
+}
+
+// NB: this doesn't check that the sizes/strides/offset are in bound for the
+// storage, and furthermore, it CANNOT do so as in some cases we temporarily
+// violate invariants by first setting sizes/strides, and then updating the
+// storage
+void TensorImpl::set_sizes_and_strides(
     c10::SymIntArrayRef sizes,
-    c10::SymIntArrayRef strides) {
+    c10::SymIntArrayRef strides,
+    c10::optional<c10::SymInt> storage_offset) {
+  auto int_sizes = asIntArrayRefSlowOpt(sizes);
+  auto int_strides = asIntArrayRefSlowOpt(strides);
+  if (int_sizes && int_strides &&
+      (!storage_offset.has_value() || !storage_offset->is_symbolic()) &&
+      !has_symbolic_sizes_strides_) {
+    set_sizes_and_strides(*int_sizes, *int_strides);
+    if (storage_offset.has_value())
+      set_storage_offset(storage_offset->as_int_unchecked());
+    return;
+  }
+
   has_symbolic_sizes_strides_ = true;
-  sizes_strides_policy_ = static_cast<uint8_t>(SizesStridesPolicy::CustomSizes);
-  sizes_and_strides_.set_sizes(sizes);
-  sizes_and_strides_.set_strides(strides);
+  refresh_sizes_strides_policy();
+  if (!extra_meta_) {
+    extra_meta_ = std::make_unique<ExtraMeta>();
+    if (!storage_offset.has_value()) {
+      extra_meta_->storage_offset_ = storage_offset_;
+    }
+  }
+  clone_symvec(sizes, extra_meta_->sizes_);
+  clone_symvec(strides, extra_meta_->strides_);
+  if (storage_offset.has_value())
+    extra_meta_->storage_offset_ = storage_offset->clone();
+
+  refresh_numel();
+  refresh_contiguous();
 }
 
 namespace impl {
diff --git a/c10/core/TensorImpl.h b/c10/core/TensorImpl.h
index fb87825fc320..a6ba3f16e2a2 100644
--- a/c10/core/TensorImpl.h
+++ b/c10/core/TensorImpl.h
@@ -9,9 +9,11 @@
 #include <c10/core/SymIntArrayRef.h>
 #include <c10/core/TensorOptions.h>
 #include <c10/core/WrapDimMinimal.h>
+#include <c10/core/impl/HermeticPyObjectTLS.h>
 #include <c10/core/impl/LocalDispatchKeySet.h>
 #include <c10/core/impl/PyInterpreter.h>
 #include <c10/core/impl/SizesAndStrides.h>
+#include <c10/util/DimVector.h>
 #include <c10/util/Exception.h>
 #include <c10/util/Flags.h>
 #include <c10/util/Logging.h>
@@ -20,6 +22,7 @@
 #include <c10/util/irange.h>
 #include <c10/util/python_stub.h>
 #include <c10/util/safe_numerics.h>
+#include <c10/util/strong_type.h>
 
 #include <algorithm>
 #include <atomic>
@@ -224,6 +227,86 @@ struct C10_API NamedTensorMetaInterface {
   };
 };
 
+template <typename T>
+using strong_bool = strong::
+    type<bool, T, strong::regular, strong::iostreamable, strong::boolean>;
+
+// For ease of copy pasting
+#if 0
+is_contiguous
+is_channels_last_contiguous
+is_channels_last_3d_contiguous
+is_channels_last
+is_channels_last_3d
+is_non_overlapping_and_dense
+#endif
+
+using bool_is_contiguous = strong_bool<struct bool_is_contiguous_>;
+using bool_is_channels_last_contiguous =
+    strong_bool<struct bool_is_channels_last_contiguous_>;
+using bool_is_channels_last_3d_contiguous =
+    strong_bool<struct bool_is_channels_last_3d_contiguous_>;
+using bool_is_channels_last = strong_bool<struct bool_is_channels_last_>;
+using bool_is_channels_last_3d = strong_bool<struct bool_is_channels_last_3d_>;
+using bool_is_non_overlapping_and_dense =
+    strong_bool<struct bool_is_non_overlapping_and_dense_>;
+
+struct C10_API ExtraMeta {
+  SymDimVector sizes_ = {0};
+  SymDimVector strides_ = {1};
+  SymInt numel_ = 1;
+  SymInt storage_offset_ = 0;
+  // TODO: make these all SymBool
+  bool_is_contiguous is_contiguous_{true};
+  bool_is_channels_last_contiguous is_channels_last_contiguous_{false};
+  bool_is_channels_last_3d_contiguous is_channels_last_3d_contiguous_{false};
+  bool_is_channels_last is_channels_last_{false};
+  bool_is_channels_last_3d is_channels_last_3d_{false};
+  bool_is_non_overlapping_and_dense is_non_overlapping_and_dense_{true};
+  std::unique_ptr<c10::NamedTensorMetaInterface> named_tensor_meta_ = nullptr;
+
+  ExtraMeta() {}
+
+  ExtraMeta(
+      SymDimVector sizes,
+      SymDimVector strides,
+      SymInt numel,
+      SymInt storage_offset,
+      bool_is_contiguous is_contiguous,
+      bool_is_channels_last_contiguous is_channels_last_contiguous,
+      bool_is_channels_last_3d_contiguous is_channels_last_3d_contiguous,
+      bool_is_channels_last is_channels_last,
+      bool_is_channels_last_3d is_channels_last_3d,
+      bool_is_non_overlapping_and_dense is_non_overlapping_and_dense,
+      std::unique_ptr<c10::NamedTensorMetaInterface> named_tensor_meta)
+      : sizes_(std::move(sizes)),
+        strides_(std::move(strides)),
+        numel_(std::move(numel)),
+        storage_offset_(std::move(storage_offset)),
+        is_contiguous_(is_contiguous),
+        is_channels_last_contiguous_(is_channels_last_contiguous),
+        is_channels_last_3d_contiguous_(is_channels_last_3d_contiguous),
+        is_channels_last_(is_channels_last),
+        is_channels_last_3d_(is_channels_last_3d),
+        is_non_overlapping_and_dense_(is_non_overlapping_and_dense),
+        named_tensor_meta_(std::move(named_tensor_meta)) {}
+
+  std::unique_ptr<ExtraMeta> clone() const {
+    return std::make_unique<ExtraMeta>(
+        sizes_,
+        strides_,
+        numel_,
+        storage_offset_,
+        is_contiguous_,
+        is_channels_last_contiguous_,
+        is_channels_last_3d_contiguous_,
+        is_channels_last_,
+        is_channels_last_3d_,
+        is_non_overlapping_and_dense_,
+        named_tensor_meta_ ? named_tensor_meta_->clone() : nullptr);
+  }
+};
+
 // NOTE [ Version Counter Sharing ]
 //
 // Every Tensor has a version counter. Version counters are incremented whenever
@@ -411,7 +494,7 @@ class C10_TensorImpl_Size_Check_Dummy_Class;
  *
  *      - A tensor may be DTYPE UNINITIALIZED.  A tensor of this
  *        form has an uninitialized dtype.  This situation most
- *        frequently arises when a user writes Tensor x(CPU).  The dtype and
+ *        frequently arises when a user writes Tensor x(CPU).  The dtype
  *        is subsequently initialized when mutable_data<T>() is
  *        invoked for the first time.
  *
@@ -539,59 +622,267 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
     return key_set_;
   }
 
+  // NOTE: The general recipe for customizable methods is that the fastpath
+  // function (e.g., sizes()) does an unlikely policy test, and if doesn't
+  // trigger, it does the fast path implementation with no checks and going
+  // directly to on-TensorImpl fields.  In particular, you never need to
+  // check ExtraMeta if the policy doesn't trigger, as non-trivial ExtraMeta
+  // implies the policy will always match.
+  //
+  // The default implementations of methods are "safe": they do extra tests
+  // to make sure the internal state is consistent no matter if you are
+  // doing symbolic shapes or not.  If you don't want the tests, directly
+  // override the custom method (e.g., custom_sizes()) to do your preferred
+  // behavior.
+
+ public:
   /**
    * Return a reference to the sizes of this tensor.  This reference remains
    * valid as long as the tensor is live and not resized.
    */
   IntArrayRef sizes() const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomSizes))) {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
       return sizes_custom();
     }
-    return sizes_default();
+    return sizes_and_strides_.sizes_arrayref();
   }
 
-  // TODO: make it non-virtual after a change to XLA
-  virtual c10::SymIntArrayRef sym_sizes() const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomSizes))) {
+  SymIntArrayRef sym_sizes() const {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
       return sym_sizes_custom();
     }
-    return sym_sizes_default();
+    // Sizes guaranteed to be non-negative, so unchecked cast is OK
+    return c10::fromIntArrayRefKnownNonNegative(
+        sizes_and_strides_.sizes_arrayref());
   }
 
-  virtual c10::SymIntArrayRef sym_sizes_custom() const;
+  IntArrayRef sizes_default() const {
+    // TODO: force backtrace to be printed on this error
+    TORCH_CHECK(
+        !has_symbolic_sizes_strides_,
+        "Cannot call sizes() on tensor with symbolic sizes/strides");
+    return sizes_and_strides_.sizes_arrayref();
+  }
+
+  SymIntArrayRef sym_sizes_default() const {
+    if (has_symbolic_sizes_strides_) {
+      return extra_meta_->sizes_;
+    } else {
+      // Sizes guaranteed to be non-negative, so unchecked cast is OK
+      return c10::fromIntArrayRefKnownNonNegative(sizes_default());
+    }
+  }
+
+  // From https://stackoverflow.com/a/3057522/23845
+  // TODO: does C++14 have a stdlib template for this?
+  template <typename T>
+  struct identity {
+    typedef T type;
+  };
+
+  template <typename T>
+  ArrayRef<T> generic_sizes() {
+    return _generic_sizes(identity<T>());
+  }
+
+  ArrayRef<int64_t> _generic_sizes(identity<int64_t>) {
+    return sizes();
+  }
+  ArrayRef<c10::SymInt> _generic_sizes(identity<c10::SymInt>) {
+    return sym_sizes();
+  }
+
+  /**
+   * The number of elements in a tensor.
+   *
+   * WARNING: Previously, if you were using the Caffe2 API, you could
+   * test numel() == -1 to see if a tensor was uninitialized.  This
+   * is no longer true; numel always accurately reports the product
+   * of sizes of a tensor.
+   */
+  int64_t numel() const {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
+      return numel_custom();
+    }
+    return numel_;
+  }
 
   c10::SymInt sym_numel() const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomSizes))) {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
       return sym_numel_custom();
     }
-    return sym_numel_default();
+    return c10::SymInt(SymInt::UNCHECKED, numel_);
   }
 
-  inline c10::SymInt sym_numel_default() const {
+  int64_t numel_default() const {
+    TORCH_CHECK(
+        !has_symbolic_sizes_strides_,
+        "Cannot call numel() on tensor with symbolic sizes/strides");
     return numel_;
   }
 
-  virtual c10::SymInt sym_numel_custom() const;
+  c10::SymInt sym_numel_default() const {
+    if (has_symbolic_sizes_strides_) {
+      return extra_meta_->numel_;
+    } else {
+      return c10::SymInt(SymInt::UNCHECKED, numel_);
+    }
+  }
+
+  /**
+   * Return the number of dimensions of this tensor.  Note that 0-dimension
+   * represents a Tensor that is a Scalar, e.g., one that has a single element.
+   */
+  int64_t dim() const {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
+      return dim_custom();
+    }
+    return sizes_and_strides_.size();
+  }
+
+  int64_t dim_default() const {
+    if (has_symbolic_sizes_strides_) {
+      return extra_meta_->sizes_.size();
+    } else {
+      return sizes_and_strides_.size();
+    }
+  }
+
+  /**
+   * Return the offset in number of elements into the storage that this
+   * tensor points to.  Most tensors have storage_offset() == 0, but,
+   * for example, an index into a tensor will have a non-zero storage_offset().
+   *
+   * WARNING: This is NOT computed in bytes.
+   */
+  int64_t storage_offset() const {
+    // TODO: maybe this should be toggled by strides
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
+      return storage_offset_custom();
+    }
+    return storage_offset_;
+  }
+
+  c10::SymInt sym_storage_offset() const {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
+      return sym_storage_offset_custom();
+    }
+    return c10::SymInt(SymInt::UNCHECKED, storage_offset_);
+  }
+
+  int64_t storage_offset_default() const {
+    TORCH_CHECK(
+        !has_symbolic_sizes_strides_,
+        "Cannot call storage_offset() on tensor with symbolic sizes/strides");
+    return storage_offset_;
+  }
+
+  c10::SymInt sym_storage_offset_default() const {
+    if (has_symbolic_sizes_strides_) {
+      return extra_meta_->storage_offset_;
+    } else {
+      return c10::SymInt(SymInt::UNCHECKED, storage_offset_);
+    }
+  }
 
   /**
    * Return a reference to the strides of this tensor.  This reference remains
    * valid as long as the tensor is live and not restrided.
    */
   IntArrayRef strides() const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomStrides))) {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomStrides))) {
       return strides_custom();
     }
-    return strides_default();
+    return sizes_and_strides_.strides_arrayref();
+  }
+
+  c10::SymIntArrayRef sym_strides() const {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomStrides))) {
+      return sym_strides_custom();
+    }
+    return c10::fromIntArrayRefKnownNonNegative(strides_default());
+  }
+
+  IntArrayRef strides_default() const {
+    TORCH_CHECK(
+        !has_symbolic_sizes_strides_,
+        "Cannot call strides() on tensor with symbolic sizes/strides");
+    return sizes_and_strides_.strides_arrayref();
+  }
+
+  c10::SymIntArrayRef sym_strides_default() const {
+    if (has_symbolic_sizes_strides_) {
+      return extra_meta_->strides_;
+    } else {
+      return c10::fromIntArrayRefKnownNonNegative(strides_default());
+    }
   }
 
+  /**
+   * Whether or not a tensor is laid out in contiguous memory.
+   *
+   * Tensors with non-trivial strides are not contiguous.  See
+   * compute_contiguous() for the exact definition of whether or not
+   * a tensor is contiguous or not.
+   */
+  bool is_contiguous(
+      at::MemoryFormat memory_format = at::MemoryFormat::Contiguous) const {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomStrides))) {
+      return is_contiguous_custom(memory_format);
+    }
+    return is_contiguous_default(memory_format);
+  }
+
+  // These are factored into separate functions in case subclasses
+  // want to use them
+  bool is_contiguous_default(at::MemoryFormat memory_format) const {
+    if (has_symbolic_sizes_strides_) {
+      if (memory_format == at::MemoryFormat::ChannelsLast) {
+        return bool(extra_meta_->is_channels_last_contiguous_);
+      } else if (memory_format == at::MemoryFormat::ChannelsLast3d) {
+        return bool(extra_meta_->is_channels_last_3d_contiguous_);
+      }
+      return bool(extra_meta_->is_contiguous_);
+    }
+
+    if (memory_format == at::MemoryFormat::ChannelsLast) {
+      return is_channels_last_contiguous_;
+    } else if (memory_format == at::MemoryFormat::ChannelsLast3d) {
+      return is_channels_last_3d_contiguous_;
+    }
+    return is_contiguous_;
+  }
+
+  bool is_strides_like_default(at::MemoryFormat memory_format) const {
+    if (has_symbolic_sizes_strides_) {
+      if (memory_format == at::MemoryFormat::ChannelsLast) {
+        return bool(extra_meta_->is_channels_last_);
+      } else if (memory_format == at::MemoryFormat::ChannelsLast3d) {
+        return bool(extra_meta_->is_channels_last_3d_);
+      } else {
+        return false;
+      }
+    }
+
+    if (memory_format == at::MemoryFormat::ChannelsLast) {
+      return is_channels_last_;
+    } else if (memory_format == at::MemoryFormat::ChannelsLast3d) {
+      return is_channels_last_3d_;
+    } else {
+      return false;
+    }
+  }
+
+  bool is_non_overlapping_and_dense_default() const {
+    if (has_symbolic_sizes_strides_) {
+      return bool(extra_meta_->is_non_overlapping_and_dense_);
+    } else {
+      return is_non_overlapping_and_dense_;
+    }
+  }
+
+  // NB: these dim accessor functions don't have _default(), as you can use
+  // sizes_default/strides_default
   /**
    * Return the size of a tensor at some dimension, wrapping the dimension if
    * necessary.
@@ -600,19 +891,15 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * be faster
    */
   int64_t size(int64_t d) const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomSizes))) {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
       return size_custom(d);
     }
     d = maybe_wrap_dim(d, dim(), /*wrap_scalar=*/false);
-    return sizes_and_strides_.size_at_unchecked(d).as_int_unchecked();
+    return sizes_and_strides_.size_at_unchecked(d);
   }
 
   c10::SymInt sym_size(int64_t d) const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomSizes))) {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomSizes))) {
       return sym_size_custom(d);
     }
     d = maybe_wrap_dim(d, dim(), /*wrap_scalar=*/false);
@@ -629,80 +916,49 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    */
   int64_t stride(int64_t d) const {
     d = maybe_wrap_dim(d, dim(), false);
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomStrides))) {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomStrides))) {
+      // TODO: provide stride_custom, symmetrically with size_custom.
+      // There is presently no user for it; only NestedTensor is using
+      // size_custom overrideability
       return strides_custom()[d]; // unchecked (maybe_wrap_dim enforces bounds)
     }
-    return sizes_and_strides_.stride_at_unchecked(d).as_int_unchecked();
+    // Intentionally don't call default, which also handles symbolic
+    return sizes_and_strides_.stride_at_unchecked(d);
   }
 
-  /**
-   * Return the number of dimensions of this tensor.  Note that 0-dimension
-   * represents a Tensor that is a Scalar, e.g., one that has a single element.
-   */
-  int64_t dim() const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomSizes))) {
-      return dim_custom();
-    }
-    return dim_default();
-  }
-
-  /**
-   * The number of elements in a tensor.
-   *
-   * WARNING: Previously, if you were using the Caffe2 API, you could
-   * test numel() == -1 to see if a tensor was uninitialized.  This
-   * is no longer true; numel always accurately reports the product
-   * of sizes of a tensor.
-   */
-  int64_t numel() const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomSizes))) {
-      return numel_custom();
-    }
-    return numel_default();
-  }
-
-  /**
-   * Whether or not a tensor is laid out in contiguous memory.
-   *
-   * Tensors with non-trivial strides are not contiguous.  See
-   * compute_contiguous() for the exact definition of whether or not
-   * a tensor is contiguous or not.
-   */
-  bool is_contiguous(
-      at::MemoryFormat memory_format = at::MemoryFormat::Contiguous) const {
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomStrides))) {
-      return is_contiguous_custom(memory_format);
-    }
-    return is_contiguous_default(memory_format);
-  }
+  enum class SizesStridesPolicy : uint8_t {
+    // Default behavior, e.g., dense tensor.
+    //
+    // Can override: nothing
+    Default = 0,
+    // Customizable strides behavior, e.g., sparse tensor,
+    // mkldnn tensor.
+    //
+    // Can override: strides(), is_contiguous()
+    CustomStrides = 1,
+    // Customizable sizes behavior, e.g., nested tensor
+    //
+    // Can override: strides(), is_contiguous(), sizes(), dim(), numel()
+    CustomSizes = 2
+  };
 
-  inline IntArrayRef strides_default() const {
-    return c10::IntArrayRef(
-        reinterpret_cast<const int64_t*>(sizes_and_strides_.strides_data()),
-        sizes_and_strides_.size());
+ protected:
+  inline bool matches_policy(SizesStridesPolicy policy) const {
+    return sizes_strides_policy_ >= static_cast<uint8_t>(policy);
   }
 
-  inline IntArrayRef sizes_default() const {
-    return c10::IntArrayRef(
-        reinterpret_cast<const int64_t*>(sizes_and_strides_.sizes_data()),
-        sizes_and_strides_.size());
+  inline bool matches_custom(SizesStridesPolicy policy) const {
+    return custom_sizes_strides_ >= static_cast<uint8_t>(policy);
   }
 
-  inline c10::SymIntArrayRef sym_sizes_default() const {
-    return c10::SymIntArrayRef(
-        reinterpret_cast<const c10::SymInt*>(sizes_and_strides_.sizes_data()),
-        sizes_and_strides_.size());
+  inline bool matches_python_custom(SizesStridesPolicy policy) const {
+    auto r = python_custom_sizes_strides_ >= static_cast<uint8_t>(policy);
+    if (r) {
+      TORCH_INTERNAL_ASSERT(is_python_dispatch())
+    }
+    return r;
   }
 
- protected:
   /**
    * Customization points for the functions above.  sizes_strides_policy_
    * must be set to enable these.
@@ -711,8 +967,9 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * for a tensor to have rank, but not well defined sizes.
    */
   // sizes_strides_policy_ >= CustomStrides
-  virtual IntArrayRef strides_custom() const;
   virtual bool is_contiguous_custom(at::MemoryFormat memory_format) const;
+  virtual bool is_strides_like_custom(at::MemoryFormat memory_format) const;
+  virtual bool is_non_overlapping_and_dense_custom() const;
   // sizes_strides_policy_ >= CustomSizes
   // Currently this method only exists to be overwritten by subclasses such as
   // NestedTensorImpl.
@@ -733,38 +990,17 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   }
 
   virtual IntArrayRef sizes_custom() const;
+  virtual IntArrayRef strides_custom() const;
+  virtual int64_t numel_custom() const;
+  virtual int64_t storage_offset_custom() const;
+  virtual int64_t dim_custom() const;
   virtual Device device_custom() const;
   virtual Layout layout_custom() const;
 
-  virtual int64_t dim_custom() const;
-  virtual int64_t numel_custom() const;
-
-  // These are factored into separate functions in case subclasses
-  // want to use them
-  inline bool is_contiguous_default(at::MemoryFormat memory_format) const {
-    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(compute_contiguous() == is_contiguous_);
-    if (memory_format == at::MemoryFormat::ChannelsLast) {
-      return is_channels_last_contiguous_;
-    } else if (memory_format == at::MemoryFormat::ChannelsLast3d) {
-      return is_channels_last_3d_contiguous_;
-    }
-    return is_contiguous_;
-  }
-  inline int64_t dim_default() const {
-    return sizes_and_strides_.size();
-  }
-  inline c10::Device device_default() const {
-    TORCH_CHECK(device_opt_.has_value(), "tensor does not have a device");
-    // See NOTE [c10::optional operator usage in CUDA]
-    return *device_opt_;
-  }
-
-  inline int64_t numel_default() const {
-#ifdef DEBUG
-    TORCH_INTERNAL_ASSERT(compute_numel() == numel_.as_int_unchecked());
-#endif
-    return numel_.as_int_unchecked();
-  }
+  virtual c10::SymIntArrayRef sym_sizes_custom() const;
+  virtual c10::SymIntArrayRef sym_strides_custom() const;
+  virtual c10::SymInt sym_numel_custom() const;
+  virtual c10::SymInt sym_storage_offset_custom() const;
 
  public:
   /**
@@ -849,7 +1085,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   bool is_meta() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_meta();
     }
     return device_opt_.has_value() && device_opt_->type() == kMeta;
@@ -858,7 +1094,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   bool is_cpu() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_cpu();
     }
     // Note: we cannot rely on dispatch keys to determine the device type
@@ -870,7 +1106,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   bool is_cuda() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_cuda();
     }
     return device_opt_.has_value() && device_opt_->type() == kCUDA;
@@ -879,35 +1115,35 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   bool is_xpu() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_xpu();
     }
     return device_opt_.has_value() && device_opt_->type() == kXPU;
   }
 
   bool is_ipu() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_ipu();
     }
     return device_opt_.has_value() && device_opt_->type() == kIPU;
   }
 
   bool is_xla() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_xla();
     }
     return device_opt_.has_value() && device_opt_->type() == kXLA;
   }
 
   bool is_hpu() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_hpu();
     }
     return device_opt_.has_value() && device_opt_->type() == kHPU;
   }
 
   bool is_lazy() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_lazy();
     }
     return device_opt_.has_value() && device_opt_->type() == kLazy;
@@ -916,7 +1152,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   bool is_hip() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_hip();
     }
     return device_opt_.has_value() && device_opt_->type() == kHIP;
@@ -925,7 +1161,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   bool is_ve() const {
     // NB: This method is not virtual and avoid dispatches for performance
     // reasons.
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_ve();
     }
     return device_opt_.has_value() && device_opt_->type() == kVE;
@@ -936,28 +1172,28 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   }
 
   bool is_vulkan() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_vulkan();
     }
     return device_opt_.has_value() && device_opt_->type() == kVulkan;
   }
 
   bool is_metal() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_metal();
     }
     return device_opt_.has_value() && device_opt_->type() == kMetal;
   }
 
   bool is_mps() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_mps();
     }
     return device_opt_.has_value() && device_opt_->type() == kMPS;
   }
 
   bool is_ort() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().is_ort();
     }
     return device_opt_.has_value() && device_opt_->type() == kORT;
@@ -989,21 +1225,29 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   }
 
   int64_t get_device() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom().index();
     }
     return device_default().index();
   }
 
   Device device() const {
-    if (C10_UNLIKELY(custom_device_)) {
+    if (C10_UNLIKELY(device_policy_)) {
       return device_custom();
     }
     return device_default();
   }
 
+ protected:
+  c10::Device device_default() const {
+    TORCH_CHECK(device_opt_.has_value(), "tensor does not have a device");
+    // See NOTE [c10::optional operator usage in CUDA]
+    return *device_opt_;
+  }
+
+ public:
   Layout layout() const {
-    if (C10_UNLIKELY(custom_layout_)) {
+    if (C10_UNLIKELY(layout_policy_)) {
       return layout_custom();
     }
 
@@ -1082,7 +1326,13 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * It can be expanded as needed in the future, e.g sparse Tensor.
    */
   inline bool support_as_strided() const {
-    return is_nested() ? false : device().supports_as_strided();
+    if (is_nested()) {
+      return false;
+    }
+    if (key_set_.has(DispatchKey::Functionalize)) {
+      return false;
+    }
+    return device().supports_as_strided();
   }
 
   // ~~~~~ Autograd API ~~~~~
@@ -1328,17 +1578,6 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
     return data_type_.itemsize();
   }
 
-  /**
-   * Return the offset in number of elements into the storage that this
-   * tensor points to.  Most tensors have storage_offset() == 0, but,
-   * for example, an index into a tensor will have a non-zero storage_offset().
-   *
-   * WARNING: This is NOT computed in bytes.
-   */
-  TENSORIMPL_MAYBE_VIRTUAL int64_t storage_offset() const {
-    return storage_offset_;
-  }
-
  protected:
   /**
    * Returns the human-readable name of the actual type of this object (e.g.,
@@ -1361,9 +1600,10 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
 
   // if we are going to use sym sizes, we should be setting sym strides at the
   // same time, otherwise it's very easy to misuse this API
-  void set_sym_sizes_and_strides(
+  void set_sizes_and_strides(
       c10::SymIntArrayRef sizes,
-      c10::SymIntArrayRef strides);
+      c10::SymIntArrayRef strides,
+      c10::optional<c10::SymInt> storage_offset = c10::nullopt);
 
   /**
    * Change the size at some dimension.  This DOES NOT update strides;
@@ -1379,8 +1619,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
         "set_size ",
         err_msg_tensor_metadata_change_not_allowed);
     TORCH_CHECK(
-        !has_symbolic_sizes_strides_,
-        "set_size() called on tensor with symbolic shape")
+        !matches_policy(SizesStridesPolicy::CustomSizes),
+        "set_size() called on tensor with dynamic shapes or customized size behavior")
     sizes_and_strides_.size_at(dim) = new_size;
     refresh_numel();
     refresh_contiguous();
@@ -1416,6 +1656,10 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
         allow_tensor_metadata_change(),
         "set_storage_offset ",
         err_msg_tensor_metadata_change_not_allowed);
+    // TODO: this should probably consult policy
+    TORCH_CHECK(
+        !has_symbolic_sizes_strides_,
+        "set_storage_offset() called on tensor with symbolic shape")
     storage_offset_ = storage_offset;
   }
 
@@ -1431,19 +1675,14 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
         allow_tensor_metadata_change(),
         "set_sizes_contiguous ",
         err_msg_tensor_metadata_change_not_allowed);
-    if (C10_UNLIKELY(
-            sizes_strides_policy_ >=
-            static_cast<uint8_t>(SizesStridesPolicy::CustomStrides))) {
-      TORCH_CHECK(false, "todo, I guess we want to throw here");
-    }
-
     TORCH_CHECK(
-        !has_symbolic_sizes_strides_,
-        "set_sizes_contiguous() called on tensor with symbolic shape")
-    sizes_and_strides_.set_sizes(SymIntArrayRef::fromIntArrayRef(new_size));
+        !matches_policy(SizesStridesPolicy::CustomStrides),
+        "tried to directly modify sizes for customized tensor");
+    sizes_and_strides_.set_sizes(new_size);
 
     refresh_numel();
-    empty_tensor_restride(MemoryFormat::Contiguous);
+    empty_tensor_restride(
+        MemoryFormat::Contiguous); // calls refresh_contiguous()
   }
 
   /**
@@ -1453,7 +1692,10 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * sizes/strides are in bounds for the storage that is allocated;
    * this is the responsibility of the caller
    */
-  void set_sizes_and_strides(IntArrayRef new_size, IntArrayRef new_stride) {
+  void set_sizes_and_strides(
+      IntArrayRef new_size,
+      IntArrayRef new_stride,
+      c10::optional<int64_t> storage_offset = c10::nullopt) {
     TORCH_CHECK(
         allow_tensor_metadata_change(),
         "set_sizes_and_strides ",
@@ -1470,7 +1712,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
         ")");
     const auto new_dim = new_size.size();
 
-    sizes_and_strides_.set_sizes(SymIntArrayRef::fromIntArrayRef(new_size));
+    sizes_and_strides_.set_sizes(new_size);
 
     if (new_dim > 0) {
       for (size_t dim = new_dim - 1;; dim--) {
@@ -1486,11 +1728,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
             // Keep stride monotonically increasing to match NumPy.
             sizes_and_strides_.stride_at_unchecked(dim) =
                 std::max<int64_t>(
-                    sizes_and_strides_.size_at_unchecked(dim + 1)
-                        .as_int_unchecked(),
-                    1) *
-                sizes_and_strides_.stride_at_unchecked(dim + 1)
-                    .as_int_unchecked();
+                    sizes_and_strides_.size_at_unchecked(dim + 1), 1) *
+                sizes_and_strides_.stride_at_unchecked(dim + 1);
           }
         }
         if (dim == 0)
@@ -1500,6 +1739,10 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
 
     refresh_numel();
     refresh_contiguous();
+
+    if (storage_offset.has_value()) {
+      storage_offset_ = *storage_offset;
+    }
   }
 
   /**
@@ -1508,7 +1751,8 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * ] for details.
    */
   void set_allow_tensor_metadata_change(bool value) {
-    allow_tensor_metadata_change_ = value;
+    // TODO: at some point, we should kill this field completely.
+    allow_tensor_metadata_change_ = true;
   }
 
   /**
@@ -1546,11 +1790,17 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
       TORCH_INTERNAL_ASSERT(named_tensor_meta->slow_dim() == dim());
     }
 #endif
-    named_tensor_meta_ = std::move(named_tensor_meta);
-    if (named_tensor_meta_ == nullptr) {
-      key_set_ = key_set_.remove(DispatchKey::Named);
-    } else {
+    if (named_tensor_meta) {
+      if (!extra_meta_) {
+        extra_meta_ = std::make_unique<ExtraMeta>();
+      }
+      extra_meta_->named_tensor_meta_ = std::move(named_tensor_meta);
       key_set_ = key_set_.add(DispatchKey::Named);
+    } else {
+      if (extra_meta_) {
+        extra_meta_->named_tensor_meta_ = nullptr;
+      }
+      key_set_ = key_set_.remove(DispatchKey::Named);
     }
   }
 
@@ -1570,15 +1820,24 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * Return the pointer to named tensor metadata.
    */
   const c10::NamedTensorMetaInterface* named_tensor_meta() const {
-    return named_tensor_meta_.get();
+    if (!extra_meta_) {
+      return nullptr;
+    }
+    return extra_meta_->named_tensor_meta_.get();
   }
 
   c10::NamedTensorMetaInterface* named_tensor_meta() {
-    return named_tensor_meta_.get();
+    if (!extra_meta_) {
+      return nullptr;
+    }
+    return extra_meta_->named_tensor_meta_.get();
   }
 
   bool has_named_tensor_meta() const {
-    return named_tensor_meta_ != nullptr;
+    if (!extra_meta_) {
+      return false;
+    }
+    return extra_meta_->named_tensor_meta_ != nullptr;
   }
 
   // NOTE [ TensorImpl Shallow-Copying ]
@@ -1804,14 +2063,18 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
       return c10::nullopt;
     } else if (interpreter == self_interpreter) {
       // NB: pyobj_ could still be null!
-      return c10::make_optional(_unchecked_untagged_pyobj());
+      if (c10::impl::HermeticPyObjectTLS::get_state()) {
+        return c10::nullopt;
+      } else {
+        return c10::make_optional(_unchecked_untagged_pyobj());
+      }
     } else {
       TORCH_CHECK(
           false,
           "cannot access PyObject for Tensor on interpreter ",
-          self_interpreter->name(),
+          (*self_interpreter)->name(),
           " that has already been used by another torch deploy interpreter ",
-          pyobj_interpreter_.load()->name());
+          (*pyobj_interpreter_.load())->name());
     }
   }
 
@@ -1830,7 +2093,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
     return device_opt_;
   }
 
-  impl::PyInterpreter* load_pyobj_interpreter() const;
+  impl::PyInterpreter& load_pyobj_interpreter() const;
 
  public:
   /**
@@ -1941,10 +2204,6 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * and a new storage will be created.
    */
   inline void* raw_mutable_data(const caffe2::TypeMeta meta) {
-    auto concrete_numel = numel_.expect_int();
-#ifdef DEBUG
-    TORCH_INTERNAL_ASSERT(compute_numel() == concrete_numel);
-#endif
     // For 0-size tensors it's fine to return any pointer (including nullptr)
     if (data_type_ == meta && storage_initialized()) {
       return static_cast<void*>(
@@ -1959,9 +2218,9 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
       // We can reuse the existing buffer if the current data does not have
       // a special destructor and the new data doesn't have a special
       // constructor.
-      if (concrete_numel == 0 ||
+      if (numel_ == 0 ||
           (meta.placementNew() == nullptr && !had_special_dtor &&
-           (storage_.nbytes() >= (concrete_numel * data_type_.itemsize())))) {
+           (storage_.nbytes() >= (numel_ * data_type_.itemsize())))) {
         TORCH_INTERNAL_ASSERT(
             storage_offset_ == 0); // because we just reallocated
         return storage_.data();
@@ -1978,18 +2237,18 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
         // For types that need placement new, we will call it, as well as
         // making sure that when the data is freed, it calls the right
         // destruction procedure.
+        auto size = numel_;
         auto dtor = data_type_.placementDelete();
-        auto data_ptr =
-            allocator->allocate(concrete_numel * data_type_.itemsize());
+        auto data_ptr = allocator->allocate(numel_ * data_type_.itemsize());
         storage_.set_data_ptr_noswap(PlacementDeleteContext::makeDataPtr(
-            std::move(data_ptr), dtor, concrete_numel, storage_.device()));
-        data_type_.placementNew()(storage_.data(), concrete_numel);
+            std::move(data_ptr), dtor, size, storage_.device()));
+        data_type_.placementNew()(storage_.data(), numel_);
       } else {
         // For fundamental type, new and delete is easier.
         storage_.set_data_ptr_noswap(
-            allocator->allocate(concrete_numel * data_type_.itemsize()));
+            allocator->allocate(numel_ * data_type_.itemsize()));
       }
-      storage_.set_nbytes(concrete_numel * data_type_.itemsize());
+      storage_.set_nbytes(numel_ * data_type_.itemsize());
       TORCH_INTERNAL_ASSERT(
           storage_offset_ == 0); // because we just reallocated
       device_opt_ = storage_.device();
@@ -2064,7 +2323,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
         "empty_tensor_restride() called on tensor with symbolic shape")
 #ifdef DEBUG
     TORCH_INTERNAL_ASSERT(
-        compute_numel() == numel_.as_int_unchecked(),
+        compute_numel() == numel_,
         "If you are seeing this error, that means empty_tensor_restride was "
         "called before setting correct numel");
 #endif
@@ -2078,12 +2337,9 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
           sizes_and_strides_.stride_at_unchecked(last_idx) = 1;
           for (auto i = last_idx - 1; i >= 0; --i) {
             sizes_and_strides_.stride_at_unchecked(i) =
-                sizes_and_strides_.stride_at_unchecked(i + 1)
-                    .as_int_unchecked() *
+                sizes_and_strides_.stride_at_unchecked(i + 1) *
                 std::max<int64_t>(
-                    sizes_and_strides_.size_at_unchecked(i + 1)
-                        .as_int_unchecked(),
-                    1);
+                    sizes_and_strides_.size_at_unchecked(i + 1), 1);
           }
         }
         break;
@@ -2114,16 +2370,26 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
     refresh_contiguous();
   }
 
+  bool is_strides_like(at::MemoryFormat memory_format) const {
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomStrides))) {
+      return is_strides_like_custom(memory_format);
+    }
+    return is_strides_like_default(memory_format);
+  }
+
   bool is_strides_like_channels_last() const {
-    return is_channels_last_;
+    return is_strides_like(at::MemoryFormat::ChannelsLast);
   }
 
   bool is_strides_like_channels_last_3d() const {
-    return is_channels_last_3d_;
+    return is_strides_like(at::MemoryFormat::ChannelsLast3d);
   }
 
   bool is_non_overlapping_and_dense() const {
-    return is_non_overlapping_and_dense_;
+    if (C10_UNLIKELY(matches_policy(SizesStridesPolicy::CustomStrides))) {
+      return is_non_overlapping_and_dense_custom();
+    }
+    return is_non_overlapping_and_dense_default();
   }
 
   bool has_symbolic_sizes_strides() const {
@@ -2201,12 +2467,16 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   /**
    * Compute the number of elements based on the sizes of a tensor.
    */
+  // NB: This is ONLY called when sizes_and_strides_ is used directly; if
+  // we are virtualizing, then numel calls are virtualized as well, and this
+  // should never get called
   int64_t compute_numel() const {
+    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!has_symbolic_sizes_strides_);
 #if C10_HAS_BUILTIN_OVERFLOW() && !defined(C10_MOBILE)
     // Use overflow checks if supported by the compiler
     return safe_compute_numel();
 #else
-    return c10::multiply_integers(sizes());
+    return c10::multiply_integers(sizes_and_strides_.sizes_arrayref());
 #endif
   }
 
@@ -2216,8 +2486,10 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * using a sparse layout has multiple dimensions with large sizes.
    */
   int64_t safe_compute_numel() const {
+    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!has_symbolic_sizes_strides_);
     uint64_t n = 1;
-    bool overflows = c10::safe_multiplies_u64(sizes(), &n);
+    bool overflows =
+        c10::safe_multiplies_u64(sizes_and_strides_.sizes_arrayref(), &n);
     constexpr auto numel_max = std::min(
         static_cast<uint64_t>(std::numeric_limits<int64_t>::max()),
         static_cast<uint64_t>(std::numeric_limits<size_t>::max()));
@@ -2227,21 +2499,31 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
     return static_cast<int64_t>(n);
   }
 
+  SymInt compute_sym_numel() const {
+    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(has_symbolic_sizes_strides_);
+    SymInt numel = 1;
+    for (const auto& s : extra_meta_->sizes_) {
+      numel *= s;
+    }
+    return numel;
+  }
+
   /**
    * Compute whether or not a tensor is contiguous based on the sizes and
    * strides of a tensor.
    */
-  bool compute_contiguous() const;
+  bool_is_contiguous compute_contiguous() const;
 
-  bool compute_channels_last_contiguous_2d() const;
+  bool_is_channels_last_contiguous compute_channels_last_contiguous_2d() const;
 
-  bool compute_channels_last_contiguous_3d() const;
+  bool_is_channels_last_3d_contiguous compute_channels_last_contiguous_3d()
+      const;
 
-  bool compute_strides_like_channels_last_2d() const;
+  bool_is_channels_last compute_strides_like_channels_last_2d() const;
 
-  bool compute_strides_like_channels_last_3d() const;
+  bool_is_channels_last_3d compute_strides_like_channels_last_3d() const;
 
-  bool compute_non_overlapping_and_dense() const;
+  bool_is_non_overlapping_and_dense compute_non_overlapping_and_dense() const;
 
  protected:
   /**
@@ -2251,9 +2533,20 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * For tensors with sparse layouts, use safe_refresh_numel() instead
    * because it will catch integer overflow that may occur for tensors
    * with sparse layouts and large dimensions.
+   *
+   * NB: We may uselessly recompute cached numel even in situations where
+   * it is completely never used (e.g., if CustomSizes for Python).  However,
+   * we still must keep it up to date in case the Python overload
+   * returns None (in which case we will consult the field here).  This also
+   * implies that sizes/strides will never be complete garbage; in the
+   * very worst case scenario, it will reflect a 1-dim zero size tensor.
    */
   void refresh_numel() {
-    numel_ = compute_numel();
+    if (has_symbolic_sizes_strides_) {
+      extra_meta_->numel_ = compute_sym_numel();
+    } else {
+      numel_ = compute_numel();
+    }
   }
 
   /**
@@ -2263,7 +2556,13 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * overflow when computing numel.
    */
   void safe_refresh_numel() {
-    numel_ = safe_compute_numel();
+    if (has_symbolic_sizes_strides_) {
+      // NB: sym numel is done with symbolic integers, which handle overflow
+      // checking
+      extra_meta_->numel_ = compute_sym_numel();
+    } else {
+      numel_ = safe_compute_numel();
+    }
   }
 
   /**
@@ -2271,49 +2570,94 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
    * or strides.
    */
   void refresh_contiguous() {
-    TORCH_CHECK(
-        !has_symbolic_sizes_strides_,
-        "refresh_contiguous() called on tensor with symbolic shape")
+    auto set_fields =
+        [&](bool_is_contiguous is_contiguous,
+            bool_is_channels_last_contiguous is_channels_last_contiguous,
+            bool_is_channels_last_3d_contiguous is_channels_last_3d_contiguous,
+            bool_is_channels_last is_channels_last,
+            bool_is_channels_last_3d is_channels_last_3d,
+            bool_is_non_overlapping_and_dense is_non_overlapping_and_dense) {
+          if (has_symbolic_sizes_strides_) {
+            extra_meta_->is_contiguous_ = is_contiguous;
+            extra_meta_->is_channels_last_contiguous_ =
+                is_channels_last_contiguous;
+            extra_meta_->is_channels_last_3d_contiguous_ =
+                is_channels_last_3d_contiguous;
+            extra_meta_->is_channels_last_ = is_channels_last;
+            extra_meta_->is_channels_last_3d_ = is_channels_last_3d;
+            extra_meta_->is_non_overlapping_and_dense_ =
+                is_non_overlapping_and_dense;
+          } else {
+            is_contiguous_ = bool(is_contiguous);
+            is_channels_last_contiguous_ = bool(is_channels_last_contiguous);
+            is_channels_last_3d_contiguous_ =
+                bool(is_channels_last_3d_contiguous);
+            is_channels_last_ = bool(is_channels_last);
+            is_channels_last_3d_ = bool(is_channels_last_3d);
+            is_non_overlapping_and_dense_ = bool(is_non_overlapping_and_dense);
+          }
+        };
 
-    is_contiguous_ = compute_contiguous();
+    auto is_contiguous = compute_contiguous();
     // Note:
     // Dim 0, 1, 2 will never be a channels last 2d/3d format
     // Dim 3+ is possibly be a channels last 2d format (Dim 4 only at this
     // point) Dim 4+ is possibly be a channels last 3d format (Dim 5 only at
     // this point)
     switch (dim()) {
-      case 4:
-        is_channels_last_contiguous_ = compute_channels_last_contiguous_2d();
-        is_channels_last_3d_contiguous_ = false;
-        is_channels_last_ = compute_strides_like_channels_last_2d();
-        is_channels_last_3d_ = false;
-        is_non_overlapping_and_dense_ = is_contiguous_ ||
-            is_channels_last_contiguous_ || compute_non_overlapping_and_dense();
+      case 4: {
+        auto is_channels_last_contiguous =
+            compute_channels_last_contiguous_2d();
+        set_fields(
+            is_contiguous,
+            is_channels_last_contiguous,
+            bool_is_channels_last_3d_contiguous(false),
+            compute_strides_like_channels_last_2d(),
+            bool_is_channels_last_3d(false),
+            bool_is_non_overlapping_and_dense(
+                is_contiguous || is_channels_last_contiguous ||
+                compute_non_overlapping_and_dense()));
         break;
-      case 5:
-        is_channels_last_contiguous_ = compute_channels_last_contiguous_2d();
-        is_channels_last_3d_contiguous_ = !is_channels_last_contiguous_ &&
-            compute_channels_last_contiguous_3d();
-        is_channels_last_ = !is_channels_last_3d_contiguous_ &&
-            compute_strides_like_channels_last_2d();
-        is_channels_last_3d_ =
-            !is_channels_last_ && compute_strides_like_channels_last_3d();
-        is_non_overlapping_and_dense_ = is_contiguous_ ||
-            is_channels_last_contiguous_ || is_channels_last_3d_contiguous_ ||
-            compute_non_overlapping_and_dense();
+      }
+      case 5: {
+        auto is_channels_last_contiguous =
+            compute_channels_last_contiguous_2d();
+        auto is_channels_last_3d_contiguous =
+            bool_is_channels_last_3d_contiguous(
+                !is_channels_last_contiguous &&
+                compute_channels_last_contiguous_3d());
+        auto is_channels_last = bool_is_channels_last(
+            !is_channels_last_3d_contiguous &&
+            compute_strides_like_channels_last_2d());
+        auto is_channels_last_3d = bool_is_channels_last_3d(
+            !is_channels_last && compute_strides_like_channels_last_3d());
+        auto is_non_overlapping_and_dense = bool_is_non_overlapping_and_dense(
+            is_contiguous || is_channels_last_contiguous ||
+            is_channels_last_3d_contiguous ||
+            compute_non_overlapping_and_dense());
+        set_fields(
+            is_contiguous,
+            is_channels_last_contiguous,
+            is_channels_last_3d_contiguous,
+            is_channels_last,
+            is_channels_last_3d,
+            is_non_overlapping_and_dense);
         break;
+      }
       default:
-        is_channels_last_contiguous_ = false;
-        is_channels_last_3d_contiguous_ = false;
         // is_channels_last_ and is_channels_last_3d_ are suggested
         // memory_format. Being channels_last_contiguous doesn't necessarily
         // mean the tensor is strided like channels_last: for strides on channel
         // dimension could suggest desired memory_layout, but it doesn't affect
         // memory storage
-        is_channels_last_ = false;
-        is_channels_last_3d_ = false;
-        is_non_overlapping_and_dense_ =
-            is_contiguous_ || compute_non_overlapping_and_dense();
+        set_fields(
+            is_contiguous,
+            bool_is_channels_last_contiguous(false),
+            bool_is_channels_last_3d_contiguous(false),
+            bool_is_channels_last(false),
+            bool_is_channels_last_3d(false),
+            bool_is_non_overlapping_and_dense(
+                is_contiguous || compute_non_overlapping_and_dense()));
     }
   }
 
@@ -2375,32 +2719,53 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   }
 
  public:
-  enum class SizesStridesPolicy : uint8_t {
-    // Default behavior, e.g., dense tensor.
-    //
-    // Can override: nothing
-    Default = 0,
-    // Customizable strides behavior, e.g., sparse tensor,
-    // mkldnn tensor.
-    //
-    // Can override: strides(), is_contiguous()
-    CustomStrides = 1,
-    // Customizable sizes behavior, e.g., nested tensor
-    //
-    // Can override: strides(), is_contiguous(), sizes(), dim(), numel()
-    CustomSizes = 2
-  };
+  void set_custom_sizes_strides(SizesStridesPolicy policy) {
+    custom_sizes_strides_ = static_cast<uint8_t>(policy);
+    refresh_sizes_strides_policy();
+  }
 
-  void set_sizes_strides_policy(SizesStridesPolicy policy) {
-    sizes_strides_policy_ = static_cast<uint8_t>(policy);
+  void set_python_custom_sizes_strides(SizesStridesPolicy policy) {
+    python_custom_sizes_strides_ = static_cast<uint8_t>(policy);
+    refresh_sizes_strides_policy();
   }
 
   void set_custom_device(bool custom_device) {
     custom_device_ = custom_device;
+    refresh_device_policy();
   }
 
   void set_custom_layout(bool custom_layout) {
     custom_layout_ = custom_layout;
+    refresh_layout_policy();
+  }
+
+  void set_python_custom_device(bool custom_device) {
+    python_custom_device_ = custom_device;
+    refresh_device_policy();
+  }
+
+  void set_python_custom_layout(bool custom_layout) {
+    python_custom_layout_ = custom_layout;
+    refresh_layout_policy();
+  }
+
+ protected:
+  void refresh_sizes_strides_policy() {
+    if (has_symbolic_sizes_strides_) {
+      sizes_strides_policy_ =
+          static_cast<uint8_t>(SizesStridesPolicy::CustomSizes);
+    } else {
+      sizes_strides_policy_ =
+          std::max(custom_sizes_strides_, python_custom_sizes_strides_);
+    }
+  }
+
+  void refresh_device_policy() {
+    device_policy_ = custom_device_ || python_custom_device_;
+  }
+
+  void refresh_layout_policy() {
+    layout_policy_ = custom_layout_ || python_custom_layout_;
   }
 
  protected:
@@ -2433,7 +2798,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   std::unique_ptr<c10::AutogradMetaInterface> autograd_meta_ = nullptr;
 
  protected:
-  std::unique_ptr<c10::NamedTensorMetaInterface> named_tensor_meta_ = nullptr;
+  std::unique_ptr<c10::ExtraMeta> extra_meta_ = nullptr;
 
   c10::VariableVersion version_counter_;
 
@@ -2488,7 +2853,7 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
   // time, we will immediately set sizes to {0} and reset numel to 0.
   // (Can't do that in the default initializers, because there's no way to
   // spell "allocate a one-element array" for strides_).
-  SymInt numel_ = c10::SymInt(1);
+  int64_t numel_ = 1;
 
   // INVARIANT: When storage is non-null, this type meta must
   // agree with the type meta in storage
@@ -2521,8 +2886,15 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
     allow_tensor_metadata_change_ = true;
     reserved_ = false;
     sizes_strides_policy_ = static_cast<uint8_t>(SizesStridesPolicy::Default);
+    custom_sizes_strides_ = static_cast<uint8_t>(SizesStridesPolicy::Default);
+    python_custom_sizes_strides_ =
+        static_cast<uint8_t>(SizesStridesPolicy::Default);
+    python_custom_device_ = false;
+    python_custom_layout_ = false;
     custom_device_ = false;
     custom_layout_ = false;
+    device_policy_ = false;
+    layout_policy_ = false;
     storage_access_should_throw_ = false;
     has_symbolic_sizes_strides_ = false;
   }
@@ -2585,22 +2957,42 @@ struct C10_API TensorImpl : public c10::intrusive_ptr_target {
 
   // Call _custom() virtual methods for
   // strides()/is_contiguous()/sizes()/dim()/numel()
+  // This is a combination of sizes_strides_custom_dispatch_
+  // and has_symbolic_sizes_strides_
   uint8_t sizes_strides_policy_ : 2;
 
   // Whether or not sizes_and_strides_ contains a symbolic value.
   bool has_symbolic_sizes_strides_ : 1;
 
+  // Call _custom() virtual method for
+  // strides()/is_contiguous()/sizes()/dim()/numel()
+  uint8_t custom_sizes_strides_ : 2;
+
+  // Combo of custom_ and python_custom_
+  bool device_policy_ : 1;
+  bool layout_policy_ : 1;
+
   // Call _custom() virtual method for device()
   bool custom_device_ : 1;
 
   // Call _custom() virtual method for layout()
   bool custom_layout_ : 1;
 
+  // Call into Python for
+  // strides()/is_contiguous()/sizes()/dim()/numel()
+  uint8_t python_custom_sizes_strides_ : 2;
+
+  // Call into Python for device()
+  bool python_custom_device_ : 1;
+
+  // Call into Python for layout()
+  bool python_custom_layout_ : 1;
+
   // The set of DispatchKeys which describe this tensor.  NB: this
   // does NOT include Autograd (historically, it did, but
   // not anymore!)
   //
-  // INVARIANT: named_tensor_meta_ != nullptr  <==>
+  // INVARIANT: extra_meta_->named_tensor_meta_ != nullptr  <==>
   // key_set_.has(DispatchKey::Named)
   DispatchKeySet key_set_;
 
@@ -2729,7 +3121,7 @@ class C10_TensorImpl_Size_Check_Dummy_Class : private TensorImpl {
   enum class FieldNameEnum {
     storage_,
     autograd_meta_,
-    named_tensor_meta_,
+    extra_meta_,
     version_counter_,
     pyobj_interpreter_,
     pyobj_,
@@ -2786,7 +3178,7 @@ class C10_TensorImpl_Size_Check_Dummy_Class : private TensorImpl {
     // clang-format off
     are_equal<sizeof(storage_),            4,  FieldNameEnum::storage_>();
     are_equal<sizeof(autograd_meta_),      4,  FieldNameEnum::autograd_meta_>();
-    are_equal<sizeof(named_tensor_meta_),  4,  FieldNameEnum::named_tensor_meta_>();
+    are_equal<sizeof(extra_meta_),         4,  FieldNameEnum::extra_meta_>();
     are_equal<sizeof(version_counter_),    4,  FieldNameEnum::version_counter_>();
     are_equal<sizeof(pyobj_interpreter_),  4,  FieldNameEnum::pyobj_interpreter_>();
     are_equal<sizeof(pyobj_),              4,  FieldNameEnum::pyobj_>();
@@ -2812,7 +3204,7 @@ class C10_TensorImpl_Size_Check_Dummy_Class : private TensorImpl {
     // figured out how to detect those via macro preprocessors yet, so we use <=
     // comparisons for the relevant fields.
     is_le<sizeof(autograd_meta_),         16,  FieldNameEnum::autograd_meta_>();
-    is_le<sizeof(named_tensor_meta_),     16,  FieldNameEnum::named_tensor_meta_>();
+    is_le<sizeof(extra_meta_),            16,  FieldNameEnum::extra_meta_>();
     are_equal<sizeof(version_counter_),    8,  FieldNameEnum::version_counter_>();
     are_equal<sizeof(pyobj_interpreter_),  8,  FieldNameEnum::pyobj_interpreter_>();
     are_equal<sizeof(pyobj_),              8,  FieldNameEnum::pyobj_>();
diff --git a/c10/core/UndefinedTensorImpl.cpp b/c10/core/UndefinedTensorImpl.cpp
index 1c24c17b53d3..1b16a5d5b9fd 100644
--- a/c10/core/UndefinedTensorImpl.cpp
+++ b/c10/core/UndefinedTensorImpl.cpp
@@ -9,12 +9,18 @@ UndefinedTensorImpl::UndefinedTensorImpl()
   set_storage_access_should_throw();
   // TODO: accessing the sizes on an undefined tensor is not meaningful
   // and should error too, but empirically it does not!
-  set_sizes_strides_policy(SizesStridesPolicy::CustomStrides);
+  set_custom_sizes_strides(SizesStridesPolicy::CustomStrides);
 }
 
 bool UndefinedTensorImpl::is_contiguous_custom(MemoryFormat format) const {
   return is_contiguous_default(format);
 }
+IntArrayRef UndefinedTensorImpl::strides_custom() const {
+  TORCH_CHECK(false, "strides() called on an undefined Tensor");
+}
+SymIntArrayRef UndefinedTensorImpl::sym_strides_custom() const {
+  TORCH_CHECK(false, "sym_strides() called on an undefined Tensor");
+}
 
 #ifdef DEBUG
 bool UndefinedTensorImpl::has_storage() const {
diff --git a/c10/core/UndefinedTensorImpl.h b/c10/core/UndefinedTensorImpl.h
index ddf688a569c6..b2a73ddf0a91 100644
--- a/c10/core/UndefinedTensorImpl.h
+++ b/c10/core/UndefinedTensorImpl.h
@@ -25,6 +25,8 @@ struct C10_API UndefinedTensorImpl final : public TensorImpl {
 
  protected:
   bool is_contiguous_custom(MemoryFormat format) const override;
+  IntArrayRef strides_custom() const override;
+  SymIntArrayRef sym_strides_custom() const override;
 
  private:
   UndefinedTensorImpl();
diff --git a/c10/core/WrapDimMinimal.cpp b/c10/core/WrapDimMinimal.cpp
index 2dc359fc5d4f..2375dc3ac5cf 100644
--- a/c10/core/WrapDimMinimal.cpp
+++ b/c10/core/WrapDimMinimal.cpp
@@ -3,21 +3,23 @@
 namespace c10 {
 namespace detail {
 
-int64_t maybe_wrap_dim_slow(
-    int64_t dim,
-    int64_t dim_post_expr,
-    bool wrap_scalar) {
-  if (dim_post_expr <= 0) {
+template <typename T>
+T maybe_wrap_dim_slow(T dim, T dim_post_expr, bool wrap_scalar) {
+  TORCH_CHECK_INDEX(
+      dim_post_expr >= 0, "Rank cannot be negative but got ", dim_post_expr);
+
+  if (dim_post_expr == 0) {
     TORCH_CHECK_INDEX(
         wrap_scalar,
-        "dimension specified as ",
+        "Dimension specified as ",
         dim,
         " but tensor has no dimensions");
-    return c10::maybe_wrap_dim(dim, /*dim_post_expr=*/1, /*wrap_scalar=*/false);
+    return c10::maybe_wrap_dim(
+        std::move(dim), /*dim_post_expr=*/1, /*wrap_scalar=*/false);
   }
 
-  int64_t min = -dim_post_expr;
-  int64_t max = dim_post_expr - 1;
+  T min = dim_post_expr * -1;
+  T max = dim_post_expr - 1;
   TORCH_CHECK_INDEX(
       min <= dim && dim <= max,
       "Dimension out of range (expected to be in range of [",
@@ -32,5 +34,11 @@ int64_t maybe_wrap_dim_slow(
       false, "should never reach here as dim should be out-of-bounds");
 }
 
+// Explicitly instantiate the template at the two types it will be used
+template C10_API int64_t
+maybe_wrap_dim_slow(int64_t dim, int64_t dim_post_expr, bool wrap_scalar);
+template C10_API SymInt
+maybe_wrap_dim_slow(SymInt dim, SymInt dim_post_expr, bool wrap_scalar);
+
 } // namespace detail
 } // namespace c10
diff --git a/c10/core/WrapDimMinimal.h b/c10/core/WrapDimMinimal.h
index 4a6f37514749..dda01fbe18f0 100644
--- a/c10/core/WrapDimMinimal.h
+++ b/c10/core/WrapDimMinimal.h
@@ -1,25 +1,44 @@
 #pragma once
 
+#include <c10/core/SymInt.h>
 #include <c10/util/Exception.h>
 
 namespace c10 {
 
 namespace detail {
-C10_API int64_t
-maybe_wrap_dim_slow(int64_t dim, int64_t dim_post_expr, bool wrap_scalar);
+// This template can only be specialized at int64_t and c10::SymInt;
+// you'll get linker errors otherwise
+template <typename T>
+C10_API T maybe_wrap_dim_slow(T dim, T dim_post_expr, bool wrap_scalar);
+} // namespace detail
+
+template <typename T>
+T _maybe_wrap_dim(T dim, T dim_post_expr, bool wrap_scalar = true) {
+  // Inline the fast paths
+  if (C10_LIKELY(dim_post_expr * -1 <= dim && dim < dim_post_expr)) {
+    // For SymInts, we want an explicit control flow to trigger a guard, so we
+    // may as well branch too.
+    if (dim < 0) {
+      return dim + dim_post_expr;
+    }
+    return dim;
+  }
+  // Check edge-cases out-of-line (wrapping scalars and out-of-bounds errors)
+  return c10::detail::maybe_wrap_dim_slow<T>(dim, dim_post_expr, wrap_scalar);
 }
 
-static inline int64_t maybe_wrap_dim(
+inline int64_t maybe_wrap_dim(
     int64_t dim,
     int64_t dim_post_expr,
     bool wrap_scalar = true) {
-  // Inline the fast paths
-  if (C10_LIKELY(-dim_post_expr <= dim && dim < dim_post_expr)) {
-    // Branch-less version of dim + (dim < 0 ? dim_post_expr : 0)
-    return dim + dim_post_expr * (dim < 0);
-  }
-  // Check edge-cases out-of-line (wrapping scalars and out-of-bounds errors)
-  return c10::detail::maybe_wrap_dim_slow(dim, dim_post_expr, wrap_scalar);
+  return _maybe_wrap_dim(dim, dim_post_expr, wrap_scalar);
+}
+
+inline c10::SymInt maybe_wrap_dim(
+    c10::SymInt dim,
+    c10::SymInt dim_post_expr,
+    bool wrap_scalar = true) {
+  return _maybe_wrap_dim(std::move(dim), std::move(dim_post_expr), wrap_scalar);
 }
 
 } // namespace c10
diff --git a/c10/core/impl/HermeticPyObjectTLS.cpp b/c10/core/impl/HermeticPyObjectTLS.cpp
new file mode 100644
index 000000000000..a7eb89430be8
--- /dev/null
+++ b/c10/core/impl/HermeticPyObjectTLS.cpp
@@ -0,0 +1,23 @@
+#include <c10/core/impl/HermeticPyObjectTLS.h>
+
+namespace c10 {
+namespace impl {
+
+thread_local std::atomic<bool> hermeticPyObjectState{false};
+
+std::atomic<bool> HermeticPyObjectTLS::haveState_{false};
+
+void HermeticPyObjectTLS::set_state(bool state) {
+  hermeticPyObjectState = state;
+}
+
+bool HermeticPyObjectTLS::get_tls_state() {
+  return hermeticPyObjectState;
+}
+
+void HermeticPyObjectTLS::init_state() {
+  haveState_ = true;
+}
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/HermeticPyObjectTLS.h b/c10/core/impl/HermeticPyObjectTLS.h
new file mode 100644
index 000000000000..9ecc8e761247
--- /dev/null
+++ b/c10/core/impl/HermeticPyObjectTLS.h
@@ -0,0 +1,61 @@
+#pragma once
+
+#include <c10/macros/Macros.h>
+#include <atomic>
+
+namespace c10 {
+namespace impl {
+
+// This TLS controls whether or not we permanently associate PyObject
+// with Tensor the first time it is allocated.  When hermetic PyObject
+// TLS is enabled (state is true), we DO NOT save PyObjects to Tensor,
+// meaning you get a distinct PyObject whenever you execute the code in
+// question.
+struct C10_API HermeticPyObjectTLS {
+  static void set_state(bool state);
+  static bool get_state() {
+    // Hypothetical fastpath if torchdeploy/multipy isn't used.  Per
+    // https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2055r0.pdf
+    // this qualifies relaxed access because it is a single-location data
+    // structure (only the boolean here).
+    //
+    // Forgetting about data races for a moment, is there a logical race?
+    //
+    //  - Boolean only ever transitions from false to true.  So the
+    //    critical situation is when one interpreter is already running
+    //    when a second interpreter switches haveState from false to true.
+    //
+    //  - The first interpreter is indifferent whether or not it sees
+    //    hasState true/false; obviously false works (this is what the
+    //    interpreter was previously using; more directly, the interpreter
+    //    calls into itself as the handler, so being hermetic is not
+    //    required), and true simply means serviced python operator calls will
+    //    be hermetic; in these cases it is expected to be functionally
+    //    equivalent.
+    //
+    //  - The second interpreter MUST see hasState true (as its requests will
+    //    be forwarded to the first interpreter), but it is assumed that there
+    //    is a synchronization between the interpreter initialization, and
+    //    when we actually perform operations, so it is guaranteed to see
+    //    hasState true.
+    //
+    // QED.
+    //
+    // This fastpath is currently disabled so that we can more easily test that
+    // hermetic mode works correctly even on stock build of PyTorch.
+    if (false && !haveState_.load(std::memory_order_relaxed))
+      return false;
+    return get_tls_state();
+  }
+  // Call this from the multipy/torchdeploy top level
+  static void init_state();
+
+ private:
+  // This only flipped once from false to true during torchdeploy/multipy
+  // initialization, and never again.
+  static std::atomic<bool> haveState_;
+  static bool get_tls_state();
+};
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/PyInterpreter.cpp b/c10/core/impl/PyInterpreter.cpp
index c9ac61f706d5..8c29f13f3e5c 100644
--- a/c10/core/impl/PyInterpreter.cpp
+++ b/c10/core/impl/PyInterpreter.cpp
@@ -5,120 +5,98 @@
 namespace c10 {
 namespace impl {
 
-template <typename... Ts>
-static void noop_trace_gpu_fn(const PyInterpreter*, Ts...) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call a GPU trace function after corresponding interpreter died");
-}
-
-void GPUTraceFunctionWrapper::disarm() {
-  event_creation_fn_ = &noop_trace_gpu_fn;
-  event_deletion_fn_ = &noop_trace_gpu_fn;
-  event_record_fn_ = &noop_trace_gpu_fn;
-  event_wait_fn_ = &noop_trace_gpu_fn;
-  memory_allocation_fn_ = &noop_trace_gpu_fn;
-  memory_deallocation_fn_ = &noop_trace_gpu_fn;
-  stream_creation_fn_ = &noop_trace_gpu_fn;
-}
-
-static std::string noop_name_fn(const PyInterpreter*) {
-  return "<unloaded interpreter>";
-}
-
-static void noop_decref_fn(const PyInterpreter*, PyObject*, bool) {
-  // no-op
-}
-
-static c10::intrusive_ptr<TensorImpl> noop_detach_fn(
-    const PyInterpreter*,
-    const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to detach (shallow_copy_and_detach) Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static void noop_dispatch_fn(
-    const PyInterpreter*,
-    const c10::OperatorHandle& op,
-    torch::jit::Stack* stack) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to dispatch (__torch_dispatch__) an operator on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static bool noop_is_contiguous_fn(const PyInterpreter*, const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call `is_contiguous` on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static c10::Device noop_device_fn(const PyInterpreter*, const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call `device` on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static int64_t noop_dim_fn(const PyInterpreter*, const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call `dim` on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static c10::IntArrayRef noop_strides_fn(
-    const PyInterpreter*,
-    const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call `strides` on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static c10::IntArrayRef noop_sizes_fn(const PyInterpreter*, const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call `sizes` on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static c10::SymIntArrayRef noop_sym_sizes_fn(
-    const PyInterpreter*,
-    const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call `sym_sizes` on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static c10::Layout noop_layout_fn(const PyInterpreter*, const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call `layout` on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
-
-static c10::SymInt noop_sym_numel_fn(const PyInterpreter*, const TensorImpl*) {
-  TORCH_INTERNAL_ASSERT(
-      0,
-      "attempted to call `sym_numel` on Tensor with nontrivial PyObject after corresponding interpreter died");
-}
+struct NoopPyInterpreterVTable final : public PyInterpreterVTable {
+  std::string name() const override {
+    return "<unloaded interpreter>";
+  }
+
+  void decref(PyObject* pyobj, bool is_tensor) const override {} // do nothing
+
+#define PANIC(m)              \
+  TORCH_INTERNAL_ASSERT(      \
+      0,                      \
+      "attempted to call " #m \
+      " on a Tensor with nontrivial PyObject after corresponding interpreter died")
+
+  c10::intrusive_ptr<TensorImpl> detach(const TensorImpl* self) const override {
+    PANIC(detach);
+  }
+
+  void dispatch(const c10::OperatorHandle& op, torch::jit::Stack* stack)
+      const override {
+    PANIC(dispatch);
+  }
+
+  void python_op_registration_trampoline(
+      const c10::OperatorHandle& op,
+      c10::DispatchKey,
+      torch::jit::Stack* stack) const override {
+    PANIC(python_op_registration_trampoline);
+  }
+
+  void python_dispatcher(
+      const c10::OperatorHandle& op,
+      c10::DispatchKeySet,
+      torch::jit::Stack* stack) const override {
+    PANIC(python_dispatcher);
+  }
+
+  bool is_contiguous(const TensorImpl* self, at::MemoryFormat) const override {
+    PANIC(is_contiguous);
+  }
+  bool is_strides_like(const TensorImpl* self, at::MemoryFormat)
+      const override {
+    PANIC(is_strides_like);
+  }
+  bool is_non_overlapping_and_dense(const TensorImpl* self) const override {
+    PANIC(is_non_overlapping_and_dense);
+  }
+  c10::Device device(const TensorImpl* self) const override {
+    PANIC(device);
+  }
+  int64_t dim(const TensorImpl* self) const override {
+    PANIC(dim);
+  }
+  c10::IntArrayRef strides(const TensorImpl* self) const override {
+    PANIC(strides);
+  }
+  c10::IntArrayRef sizes(const TensorImpl* self) const override {
+    PANIC(sizes);
+  }
+  c10::SymIntArrayRef sym_sizes(const TensorImpl* self) const override {
+    PANIC(sym_sizes);
+  }
+  c10::Layout layout(const TensorImpl* self) const override {
+    PANIC(layout);
+  }
+  c10::SymInt sym_numel(const TensorImpl* self) const override {
+    PANIC(sym_numel);
+  }
+  c10::SymIntArrayRef sym_strides(const TensorImpl* self) const override {
+    PANIC(sym_strides);
+  }
+  c10::SymInt sym_storage_offset(const TensorImpl* self) const override {
+    PANIC(sym_storage_offset);
+  }
+
+  // Just swallow the event, don't do anything
+  void trace_gpu_event_creation(uintptr_t event) const override {}
+  void trace_gpu_event_deletion(uintptr_t event) const override {}
+  void trace_gpu_event_record(uintptr_t event, uintptr_t stream)
+      const override {}
+  void trace_gpu_event_wait(uintptr_t event, uintptr_t stream) const override {}
+  void trace_gpu_memory_allocation(uintptr_t ptr) const override {}
+  void trace_gpu_memory_deallocation(uintptr_t ptr) const override {}
+  void trace_gpu_stream_creation(uintptr_t stream) const override {}
+  void trace_gpu_device_synchronization() const override {}
+  void trace_gpu_stream_synchronization(uintptr_t stream) const override {}
+  void trace_gpu_event_synchronization(uintptr_t event) const override {}
+};
 
 void PyInterpreter::disarm() noexcept {
-  name_fn_ = &noop_name_fn;
-  decref_fn_ = &noop_decref_fn;
-  detach_fn_ = &noop_detach_fn;
-  dispatch_fn_ = &noop_dispatch_fn;
-  is_contiguous_fn_ = &noop_is_contiguous_fn;
-  device_fn_ = &noop_device_fn;
-  dim_fn_ = &noop_dim_fn;
-  strides_fn_ = &noop_strides_fn;
-  sizes_fn_ = &noop_sizes_fn;
-  sym_sizes_fn_ = &noop_sym_sizes_fn;
-  layout_fn_ = &noop_layout_fn;
-  sym_numel_fn_ = &noop_sym_numel_fn;
-  trace_gpu_functions.disarm();
-}
-
-// Defined out-of-line because it needs access to the definition of TensorImpl.
-__ubsan_ignore_function__ c10::intrusive_ptr<TensorImpl> PyInterpreter::detach(
-    const TensorImpl* self) const {
-  return (*detach_fn_)(this, self);
+  // Intentionally leaked
+  static PyInterpreterVTable* noop_vtable = new NoopPyInterpreterVTable();
+  vtable_ = noop_vtable;
 }
 
 } // namespace impl
diff --git a/c10/core/impl/PyInterpreter.h b/c10/core/impl/PyInterpreter.h
index f6d548b6f3a0..da5b612f093b 100644
--- a/c10/core/impl/PyInterpreter.h
+++ b/c10/core/impl/PyInterpreter.h
@@ -2,6 +2,7 @@
 
 #include <c10/core/Device.h>
 #include <c10/core/Layout.h>
+#include <c10/core/MemoryFormat.h>
 #include <c10/core/SymIntArrayRef.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/ArrayRef.h>
@@ -32,44 +33,6 @@ namespace impl {
 
 struct C10_API PyInterpreter;
 
-struct C10_API GPUTraceFunctionWrapper {
-  using event_creation_sig = void(const PyInterpreter*, uintptr_t event);
-  using event_deletion_sig = void(const PyInterpreter*, uintptr_t event);
-  using event_record_sig =
-      void(const PyInterpreter*, uintptr_t event, uintptr_t stream);
-  using event_wait_sig =
-      void(const PyInterpreter*, uintptr_t event, uintptr_t stream);
-  using memory_allocation_sig = void(const PyInterpreter*, uintptr_t pointer);
-  using memory_deallocation_sig = void(const PyInterpreter*, uintptr_t pointer);
-  using stream_creation_sig = void(const PyInterpreter*, uintptr_t stream);
-
-  event_creation_sig* event_creation_fn_;
-  event_deletion_sig* event_deletion_fn_;
-  event_record_sig* event_record_fn_;
-  event_wait_sig* event_wait_fn_;
-  memory_allocation_sig* memory_allocation_fn_;
-  memory_deallocation_sig* memory_deallocation_fn_;
-  stream_creation_sig* stream_creation_fn_;
-
-  GPUTraceFunctionWrapper(
-      event_creation_sig* event_creation_fn,
-      event_deletion_sig* event_deletion_fn,
-      event_record_sig* event_record_fn,
-      event_wait_sig* event_wait_fn,
-      memory_allocation_sig* memory_allocation_fn,
-      memory_deallocation_sig* memory_deallocation_fn,
-      stream_creation_sig* stream_creation_fn)
-      : event_creation_fn_(event_creation_fn),
-        event_deletion_fn_(event_deletion_fn),
-        event_record_fn_(event_record_fn),
-        event_wait_fn_(event_wait_fn),
-        memory_allocation_fn_(memory_allocation_fn),
-        memory_deallocation_fn_(memory_deallocation_fn),
-        stream_creation_fn_(stream_creation_fn) {}
-
-  void disarm();
-};
-
 // Note [Python interpreter tag]
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 // Traditionally, PyTorch is layered such that our Python library
@@ -133,9 +96,10 @@ struct C10_API GPUTraceFunctionWrapper {
 // point (on a good day, you might get "pure virtual method called").
 //
 // The idea to solve this problem is we always leak PyInterpreters (so they
-// always stay live even after dlclose), and disarm the "virtual methods" by
-// replacing them with function pointers that just no-op.  This can't be done
-// with a traditional C++ vtable, so we have to roll our own.
+// always stay live even after dlclose), and make sure we can disarm their
+// virtual methods by indirecting through a separate PyInterpreterVTable
+// object.  This can be replaced with a no-op vtable from libc10.so, which
+// is guaranteed to stick around until the bitter end.
 //
 // NB: The downside with representing PyInterpreter tags as full objects is that
 // it takes an extra word on TensorImpl.  If tags were instead just integer
@@ -157,174 +121,84 @@ struct C10_API GPUTraceFunctionWrapper {
 // still fit inside three eight word cache lines.  If you need to penny pinch
 // another word consider doing this!
 
-struct C10_API PyInterpreter {
-  // Feel free to add as much random crap here as you need; each of these
-  // can be thought of as a "C++ to Python" hook.
-  using name_sig = std::string(const PyInterpreter*);
-  using decref_sig = void(const PyInterpreter*, PyObject*, bool);
-  using detach_sig =
-      c10::intrusive_ptr<TensorImpl>(const PyInterpreter*, const TensorImpl*);
-  using dispatch_sig = void(
-      const PyInterpreter*,
-      const c10::OperatorHandle&,
-      torch::jit::Stack* stack);
-  using is_contiguous_sig = bool(const PyInterpreter*, const TensorImpl*);
-  using device_sig = c10::Device(const PyInterpreter*, const TensorImpl*);
-  using dim_sig = int64_t(const PyInterpreter*, const TensorImpl*);
-  using strides_sig = c10::IntArrayRef(const PyInterpreter*, const TensorImpl*);
-  using sizes_sig = c10::IntArrayRef(const PyInterpreter*, const TensorImpl*);
-  using sym_sizes_sig =
-      c10::SymIntArrayRef(const PyInterpreter*, const TensorImpl*);
-  using layout_sig = c10::Layout(const PyInterpreter*, const TensorImpl*);
-  using sym_numel_sig = c10::SymInt(const PyInterpreter*, const TensorImpl*);
-
-  PyInterpreter(
-      name_sig* name_fn,
-      decref_sig* decref_fn,
-      detach_sig* detach,
-      dispatch_sig* dispatch,
-      is_contiguous_sig* is_contiguous,
-      device_sig* device_fn,
-      dim_sig* dim_fn,
-      strides_sig* strides,
-      sizes_sig* sizes,
-      sym_sizes_sig* sym_sizes,
-      layout_sig* layout,
-      sym_numel_sig* sym_numel,
-      GPUTraceFunctionWrapper trace_gpu_functions)
-      : name_fn_(name_fn),
-        decref_fn_(decref_fn),
-        detach_fn_(detach),
-        dispatch_fn_(dispatch),
-        is_contiguous_fn_(is_contiguous),
-        device_fn_(device_fn),
-        dim_fn_(dim_fn),
-        strides_fn_(strides),
-        sizes_fn_(sizes),
-        sym_sizes_fn_(sym_sizes),
-        layout_fn_(layout),
-        sym_numel_fn_(sym_numel),
-        trace_gpu_functions(trace_gpu_functions) {}
-
-  name_sig* name_fn_;
-  decref_sig* decref_fn_;
-  detach_sig* detach_fn_;
-  dispatch_sig* dispatch_fn_;
-  is_contiguous_sig* is_contiguous_fn_;
-  device_sig* device_fn_;
-  dim_sig* dim_fn_;
-  strides_sig* strides_fn_;
-  sizes_sig* sizes_fn_;
-  sym_sizes_sig* sym_sizes_fn_;
-  layout_sig* layout_fn_;
-  sym_numel_sig* sym_numel_fn_;
-  GPUTraceFunctionWrapper trace_gpu_functions;
-
-  // UBSAN suppression fixes: "call to function
-  // (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*,
-  // _object*) through pointer to incorrect function type 'void (*)(const
-  // c10::impl::PyInterpreter *, _object *)'" See
-  // https://github.com/google/sanitizers/issues/911
+struct C10_API PyInterpreterVTable {
+  virtual ~PyInterpreterVTable() {}
 
   // Report the name of this interpreter
-  __ubsan_ignore_function__ std::string name() const {
-    return (*name_fn_)(this);
-  }
+  virtual std::string name() const = 0;
 
   // Run Py_DECREF on a PyObject.  We DO NOT assume the GIL is held on call
   // See NOTE [PyInterpreter::decref takes an `is_tensor` arg]
-  __ubsan_ignore_function__ void decref(PyObject* pyobj, bool is_tensor) const {
-    return (*decref_fn_)(this, pyobj, is_tensor);
-  }
+  virtual void decref(PyObject* pyobj, bool is_tensor) const = 0;
 
   // Perform a detach by deferring to the __torch_dispatch__ implementation of
   // detach, which will also arrange for the PyObject to get copied in this
   // situation
-  __ubsan_ignore_function__ c10::intrusive_ptr<TensorImpl> detach(
-      const TensorImpl* self) const;
+  virtual c10::intrusive_ptr<TensorImpl> detach(
+      const TensorImpl* self) const = 0;
 
   // Invoke the Python boxed fallback dispatch to go back into Python
-  __ubsan_ignore_function__ void dispatch(
+  virtual void dispatch(const c10::OperatorHandle& op, torch::jit::Stack* stack)
+      const = 0;
+
+  // This is only invoked in the multipy/torchdeploy situation from
+  // pythonOpRegistrationTrampoline; this lets us get to the Python
+  // interpreter to actually find the appropriate Python op registration
+  // entry to call.
+  virtual void python_op_registration_trampoline(
       const c10::OperatorHandle& op,
-      torch::jit::Stack* stack) const {
-    return (*dispatch_fn_)(this, op, stack);
-  }
-
-  __ubsan_ignore_function__ bool is_contiguous(const TensorImpl* self) const {
-    return (*is_contiguous_fn_)(this, self);
-  }
-
-  __ubsan_ignore_function__ c10::Device device(const TensorImpl* self) const {
-    return (*device_fn_)(this, self);
-  }
-
-  __ubsan_ignore_function__ int64_t dim(const TensorImpl* self) const {
-    return (*dim_fn_)(this, self);
-  }
-
-  __ubsan_ignore_function__ c10::IntArrayRef strides(
-      const TensorImpl* self) const {
-    return (*strides_fn_)(this, self);
-  }
+      c10::DispatchKey,
+      torch::jit::Stack* stack) const = 0;
 
-  __ubsan_ignore_function__ c10::IntArrayRef sizes(
-      const TensorImpl* self) const {
-    return (*sizes_fn_)(this, self);
-  }
-
-  __ubsan_ignore_function__ c10::SymIntArrayRef sym_sizes(
-      const TensorImpl* self) const {
-    return (*sym_sizes_fn_)(this, self);
-  }
-
-  __ubsan_ignore_function__ c10::Layout layout(const TensorImpl* self) const {
-    return (*layout_fn_)(this, self);
-  }
-
-  __ubsan_ignore_function__ c10::SymInt sym_numel(
-      const TensorImpl* self) const {
-    return (*sym_numel_fn_)(this, self);
-  }
-
-  __ubsan_ignore_function__ void trace_gpu_event_creation(
-      uintptr_t event) const {
-    return (*trace_gpu_functions.event_creation_fn_)(this, event);
-  }
-
-  __ubsan_ignore_function__ void trace_gpu_event_deletion(
-      uintptr_t event) const {
-    return (*trace_gpu_functions.event_deletion_fn_)(this, event);
-  }
-
-  __ubsan_ignore_function__ void trace_gpu_event_record(
-      uintptr_t event,
-      uintptr_t stream) const {
-    return (*trace_gpu_functions.event_record_fn_)(this, event, stream);
-  }
+  // Invoke the Python dispatcher to handle this call
+  virtual void python_dispatcher(
+      const c10::OperatorHandle& op,
+      c10::DispatchKeySet,
+      torch::jit::Stack* stack) const = 0;
+
+  virtual bool is_contiguous(const TensorImpl* self, at::MemoryFormat)
+      const = 0;
+  virtual bool is_strides_like(const TensorImpl* self, at::MemoryFormat)
+      const = 0;
+  virtual bool is_non_overlapping_and_dense(const TensorImpl* self) const = 0;
+  virtual c10::Device device(const TensorImpl* self) const = 0;
+  virtual int64_t dim(const TensorImpl* self) const = 0;
+  virtual c10::IntArrayRef strides(const TensorImpl* self) const = 0;
+  virtual c10::IntArrayRef sizes(const TensorImpl* self) const = 0;
+  virtual c10::SymIntArrayRef sym_sizes(const TensorImpl* self) const = 0;
+  virtual c10::Layout layout(const TensorImpl* self) const = 0;
+  virtual c10::SymInt sym_numel(const TensorImpl* self) const = 0;
+  virtual c10::SymIntArrayRef sym_strides(const TensorImpl* self) const = 0;
+  virtual c10::SymInt sym_storage_offset(const TensorImpl* self) const = 0;
+
+  virtual void trace_gpu_event_creation(uintptr_t event) const = 0;
+  virtual void trace_gpu_event_deletion(uintptr_t event) const = 0;
+  virtual void trace_gpu_event_record(uintptr_t event, uintptr_t stream)
+      const = 0;
+  virtual void trace_gpu_event_wait(uintptr_t event, uintptr_t stream)
+      const = 0;
+  virtual void trace_gpu_memory_allocation(uintptr_t ptr) const = 0;
+  virtual void trace_gpu_memory_deallocation(uintptr_t ptr) const = 0;
+  virtual void trace_gpu_stream_creation(uintptr_t stream) const = 0;
+  virtual void trace_gpu_device_synchronization() const = 0;
+  virtual void trace_gpu_stream_synchronization(uintptr_t stream) const = 0;
+  virtual void trace_gpu_event_synchronization(uintptr_t event) const = 0;
+};
 
-  __ubsan_ignore_function__ void trace_gpu_event_wait(
-      uintptr_t event,
-      uintptr_t stream) const {
-    return (*trace_gpu_functions.event_wait_fn_)(this, event, stream);
-  }
+struct C10_API PyInterpreter {
+  const PyInterpreterVTable* vtable_;
 
-  __ubsan_ignore_function__ void trace_gpu_memory_allocation(
-      uintptr_t ptr) const {
-    return (*trace_gpu_functions.memory_allocation_fn_)(this, ptr);
-  }
+  PyInterpreter(const PyInterpreterVTable* vtable) : vtable_(vtable){};
 
-  __ubsan_ignore_function__ void trace_gpu_memory_deallocation(
-      uintptr_t ptr) const {
-    return (*trace_gpu_functions.memory_deallocation_fn_)(this, ptr);
+  const PyInterpreterVTable& operator*() const noexcept {
+    return *vtable_;
   }
-
-  __ubsan_ignore_function__ void trace_gpu_stream_creation(
-      uintptr_t stream) const {
-    return (*trace_gpu_functions.stream_creation_fn_)(this, stream);
+  const PyInterpreterVTable* operator->() const noexcept {
+    return vtable_;
   }
 
   // Disarm this PyInterpreter, making all of its methods noops.
-  // Because the function pointers are raw pointers (not atomics),
+  // The vtable pointer is not an atomic at the moment, which means
   // a disarm() invocation that is concurrent with active destructors
   // is not thread safe and will trigger TSAN.  My hope is that this
   // situations doesn't ever actually happen; tensor destruction should
diff --git a/c10/core/impl/PythonDispatcherTLS.cpp b/c10/core/impl/PythonDispatcherTLS.cpp
new file mode 100644
index 000000000000..dd6bd52b3207
--- /dev/null
+++ b/c10/core/impl/PythonDispatcherTLS.cpp
@@ -0,0 +1,32 @@
+#include <c10/core/DispatchKeySet.h>
+#include <c10/core/SafePyObject.h>
+#include <c10/core/impl/LocalDispatchKeySet.h>
+#include <c10/core/impl/PythonDispatcherTLS.h>
+
+namespace c10 {
+namespace impl {
+
+thread_local PyInterpreter* pythonDispatcherState;
+
+void PythonDispatcherTLS::set_state(PyInterpreter* state) {
+  if (state) {
+    c10::impl::tls_set_dispatch_key_included(
+        DispatchKey::PythonDispatcher, true);
+  } else {
+    PythonDispatcherTLS::reset_state();
+  }
+  pythonDispatcherState = state;
+}
+
+PyInterpreter* PythonDispatcherTLS::get_state() {
+  return pythonDispatcherState;
+}
+
+void PythonDispatcherTLS::reset_state() {
+  pythonDispatcherState = nullptr;
+  c10::impl::tls_set_dispatch_key_included(
+      DispatchKey::PythonDispatcher, false);
+}
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/PythonDispatcherTLS.h b/c10/core/impl/PythonDispatcherTLS.h
new file mode 100644
index 000000000000..1c055a59fb15
--- /dev/null
+++ b/c10/core/impl/PythonDispatcherTLS.h
@@ -0,0 +1,27 @@
+#pragma once
+
+#include <c10/core/SafePyObject.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/Optional.h>
+
+namespace c10 {
+namespace impl {
+
+struct C10_API PythonDispatcherTLS {
+  static void set_state(PyInterpreter* state);
+  static PyInterpreter* get_state();
+  static void reset_state();
+};
+
+struct C10_API DisablePythonDispatcher {
+  DisablePythonDispatcher() : old_(PythonDispatcherTLS::get_state()) {
+    PythonDispatcherTLS::set_state({});
+  }
+  ~DisablePythonDispatcher() {
+    PythonDispatcherTLS::set_state(old_);
+  }
+  PyInterpreter* old_;
+};
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/SizesAndStrides.cpp b/c10/core/impl/SizesAndStrides.cpp
index b46725f8bf19..db1d7c61e980 100644
--- a/c10/core/impl/SizesAndStrides.cpp
+++ b/c10/core/impl/SizesAndStrides.cpp
@@ -10,35 +10,43 @@ void SizesAndStrides::resizeSlowPath(
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
         !isInline(),
         "resizeSlowPath called when fast path should have been hit!");
-    SymInt* tempStorage = outOfLineStorage_;
-    for (size_t i = 0; i < C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE; i++) {
-      inlineStorage_[i] = std::move(tempStorage[i]);
-      inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE + i] =
-          std::move(tempStorage[oldSize + i]);
-    }
+    int64_t* tempStorage = outOfLineStorage_;
+    memcpy(
+        &inlineStorage_[0],
+        &tempStorage[0],
+        C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE * sizeof(inlineStorage_[0]));
+    memcpy(
+        &inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE],
+        &tempStorage[oldSize],
+        C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE * sizeof(inlineStorage_[0]));
     // CANNOT USE freeOutOfLineStorage() HERE! outOfLineStorage_
     // HAS BEEN OVERWRITTEN!
     // NOLINTNEXTLINE(cppcoreguidelines-no-malloc)
-    delete[] tempStorage;
+    free(tempStorage);
   } else {
     if (isInline()) {
       // CANNOT USE allocateOutOfLineStorage(newSize) HERE! WOULD
       // OVERWRITE inlineStorage_!
       // NOLINTNEXTLINE(cppcoreguidelines-no-malloc)
-      SymInt* tempStorage = new SymInt[storageElems(newSize)];
+      int64_t* tempStorage =
+          static_cast<int64_t*>(malloc(storageBytes(newSize)));
       TORCH_CHECK(
           tempStorage,
           "Could not allocate memory to change Tensor SizesAndStrides!");
-      const auto elemsToCopy = oldSize;
-      const auto elemsToZero = (newSize > oldSize) ? (newSize - oldSize) : 0;
-      for (size_t i = 0; i < elemsToCopy; i++) {
-        tempStorage[i] = std::move(inlineStorage_[i]);
-        tempStorage[newSize + i] = std::move(
-            inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE + i]);
+      const auto bytesToCopy = oldSize * sizeof(inlineStorage_[0]);
+      const auto bytesToZero = (newSize > oldSize)
+          ? (newSize - oldSize) * sizeof(tempStorage[0])
+          : 0;
+      memcpy(&tempStorage[0], &inlineStorage_[0], bytesToCopy);
+      if (bytesToZero) {
+        memset(&tempStorage[oldSize], 0, bytesToZero);
       }
-      for (size_t i = 0; i < elemsToZero; i++) {
-        tempStorage[oldSize + i] = 0;
-        tempStorage[newSize + oldSize + i] = 0;
+      memcpy(
+          &tempStorage[newSize],
+          &inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE],
+          bytesToCopy);
+      if (bytesToZero) {
+        memset(&tempStorage[newSize + oldSize], 0, bytesToZero);
       }
       outOfLineStorage_ = tempStorage;
     } else {
@@ -50,27 +58,19 @@ void SizesAndStrides::resizeSlowPath(
       // Shift the old strides to their new starting point. Note
       // that this does not occur in the inline path above because
       // the stride starting point is not moving.
-      if (isGrowing) {
-        std::move_backward(
-            outOfLineStorage_ + oldSize,
-            outOfLineStorage_ + oldSize + oldSize,
-            outOfLineStorage_ + newSize + oldSize);
-      } else {
-        std::move(
-            outOfLineStorage_ + oldSize,
-            outOfLineStorage_ + oldSize + newSize,
-            outOfLineStorage_ + newSize);
-      }
+      memmove(
+          outOfLineStorage_ + newSize,
+          outOfLineStorage_ + oldSize,
+          std::min(oldSize, newSize) * sizeof(outOfLineStorage_[0]));
       if (!isGrowing) {
         // Resize after shifting so that we don't lose data.
         resizeOutOfLineStorage(newSize);
       } else {
         // Zero the end of the sizes portion.
-        const auto elemsToZero = newSize - oldSize;
-        for (size_t i = 0; i < elemsToZero; i++) {
-          outOfLineStorage_[oldSize + i] = 0;
-          outOfLineStorage_[newSize + oldSize + i] = 0;
-        }
+        const auto bytesToZero =
+            (newSize - oldSize) * sizeof(outOfLineStorage_[0]);
+        memset(&outOfLineStorage_[oldSize], 0, bytesToZero);
+        memset(&outOfLineStorage_[newSize + oldSize], 0, bytesToZero);
       }
     }
   }
diff --git a/c10/core/impl/SizesAndStrides.h b/c10/core/impl/SizesAndStrides.h
index 603a3296cd9f..9074b252c6db 100644
--- a/c10/core/impl/SizesAndStrides.h
+++ b/c10/core/impl/SizesAndStrides.h
@@ -3,8 +3,6 @@
 #include <algorithm>
 #include <cstdint>
 
-#include <c10/core/SymInt.h>
-#include <c10/core/SymIntArrayRef.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/ArrayRef.h>
 #include <c10/util/SmallVector.h>
@@ -27,10 +25,10 @@ class C10_API SizesAndStrides {
  public:
   // TODO: different iterator types for sizes & strides to prevent
   // mixing the two accidentally.
-  using sizes_iterator = SymInt*;
-  using sizes_const_iterator = const SymInt*;
-  using strides_iterator = SymInt*;
-  using strides_const_iterator = const SymInt*;
+  using sizes_iterator = int64_t*;
+  using sizes_const_iterator = const int64_t*;
+  using strides_iterator = int64_t*;
+  using strides_const_iterator = const int64_t*;
 
   SizesAndStrides() : size_(1) {
     size_at_unchecked(0) = 0;
@@ -39,7 +37,7 @@ class C10_API SizesAndStrides {
 
   ~SizesAndStrides() {
     if (C10_UNLIKELY(!isInline())) {
-      delete[] outOfLineStorage_;
+      free(outOfLineStorage_);
     }
   }
 
@@ -58,7 +56,7 @@ class C10_API SizesAndStrides {
     }
     if (C10_LIKELY(rhs.isInline())) {
       if (C10_UNLIKELY(!isInline())) {
-        delete[] outOfLineStorage_;
+        free(outOfLineStorage_);
       }
       copyDataInline(rhs);
     } else {
@@ -76,9 +74,7 @@ class C10_API SizesAndStrides {
   // Move from rhs. rhs.size() == 0 afterwards.
   SizesAndStrides(SizesAndStrides&& rhs) noexcept : size_(rhs.size_) {
     if (C10_LIKELY(isInline())) {
-      for (size_t i = 0; i < sizeof(inlineStorage_) / sizeof(SymInt); i++) {
-        inlineStorage_[i] = std::move(rhs.inlineStorage_[i]);
-      }
+      memcpy(inlineStorage_, rhs.inlineStorage_, sizeof(inlineStorage_));
     } else {
       outOfLineStorage_ = rhs.outOfLineStorage_;
       rhs.outOfLineStorage_ = nullptr;
@@ -94,13 +90,13 @@ class C10_API SizesAndStrides {
     }
     if (C10_LIKELY(rhs.isInline())) {
       if (C10_UNLIKELY(!isInline())) {
-        delete[] outOfLineStorage_;
+        free(outOfLineStorage_);
       }
       copyDataInline(rhs);
     } else {
       // They're outline. We're going to steal their vector.
       if (!isInline()) {
-        delete[] outOfLineStorage_;
+        free(outOfLineStorage_);
       }
       outOfLineStorage_ = rhs.outOfLineStorage_;
       rhs.outOfLineStorage_ = nullptr;
@@ -115,7 +111,7 @@ class C10_API SizesAndStrides {
     return size_;
   }
 
-  const SymInt* sizes_data() const noexcept {
+  const int64_t* sizes_data() const noexcept {
     if (C10_LIKELY(isInline())) {
       return &inlineStorage_[0];
     } else {
@@ -123,23 +119,7 @@ class C10_API SizesAndStrides {
     }
   }
 
-  bool has_sym_slow() const noexcept {
-    if (std::any_of(sizes_begin(), sizes_end(), [](const auto i) {
-          return i.is_symbolic();
-        })) {
-      return true;
-    }
-
-    if (std::any_of(strides_begin(), strides_end(), [](const auto i) {
-          return i.is_symbolic();
-        })) {
-      return true;
-    }
-
-    return false;
-  }
-
-  SymInt* sizes_data() noexcept {
+  int64_t* sizes_data() noexcept {
     if (C10_LIKELY(isInline())) {
       return &inlineStorage_[0];
     } else {
@@ -163,25 +143,21 @@ class C10_API SizesAndStrides {
     return sizes_begin() + size();
   }
 
-  SymIntArrayRef sizes_arrayref() const noexcept {
-    return SymIntArrayRef{sizes_data(), size()};
+  IntArrayRef sizes_arrayref() const noexcept {
+    return IntArrayRef{sizes_data(), size()};
   }
 
-  void set_sizes(SymIntArrayRef newSizes) {
+  void set_sizes(IntArrayRef newSizes) {
     resize(newSizes.size());
     std::copy(newSizes.begin(), newSizes.end(), sizes_begin());
   }
 
-  void set_strides(SymIntArrayRef strides) {
+  void set_strides(IntArrayRef strides) {
     TORCH_INTERNAL_ASSERT(strides.size() == size());
     std::copy(strides.begin(), strides.end(), strides_begin());
   }
 
-  void set_sizes(IntArrayRef newSizes) {
-    set_sizes(SymIntArrayRef::fromIntArrayRef(newSizes));
-  }
-
-  const SymInt* strides_data() const noexcept {
+  const int64_t* strides_data() const noexcept {
     if (C10_LIKELY(isInline())) {
       return &inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE];
     } else {
@@ -189,7 +165,7 @@ class C10_API SizesAndStrides {
     }
   }
 
-  SymInt* strides_data() noexcept {
+  int64_t* strides_data() noexcept {
     if (C10_LIKELY(isInline())) {
       return &inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE];
     } else {
@@ -221,45 +197,45 @@ class C10_API SizesAndStrides {
     return strides_begin() + size();
   }
 
-  SymIntArrayRef strides_arrayref() const noexcept {
-    return SymIntArrayRef{strides_data(), size()};
+  IntArrayRef strides_arrayref() const noexcept {
+    return IntArrayRef{strides_data(), size()};
   }
 
   // Size accessors.
-  SymInt size_at(size_t idx) const noexcept {
+  int64_t size_at(size_t idx) const noexcept {
     assert(idx < size());
     return sizes_data()[idx];
   }
 
-  SymInt& size_at(size_t idx) noexcept {
+  int64_t& size_at(size_t idx) noexcept {
     assert(idx < size());
     return sizes_data()[idx];
   }
 
-  SymInt size_at_unchecked(size_t idx) const noexcept {
+  int64_t size_at_unchecked(size_t idx) const noexcept {
     return sizes_data()[idx];
   }
 
-  SymInt& size_at_unchecked(size_t idx) noexcept {
+  int64_t& size_at_unchecked(size_t idx) noexcept {
     return sizes_data()[idx];
   }
 
   // Size accessors.
-  SymInt stride_at(size_t idx) const noexcept {
+  int64_t stride_at(size_t idx) const noexcept {
     assert(idx < size());
     return strides_data()[idx];
   }
 
-  SymInt& stride_at(size_t idx) noexcept {
+  int64_t& stride_at(size_t idx) noexcept {
     assert(idx < size());
     return strides_data()[idx];
   }
 
-  SymInt stride_at_unchecked(size_t idx) const noexcept {
+  int64_t stride_at_unchecked(size_t idx) const noexcept {
     return strides_data()[idx];
   }
 
-  SymInt& stride_at_unchecked(size_t idx) noexcept {
+  int64_t& stride_at_unchecked(size_t idx) noexcept {
     return strides_data()[idx];
   }
 
@@ -271,10 +247,13 @@ class C10_API SizesAndStrides {
     if (C10_LIKELY(
             newSize <= C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE && isInline())) {
       if (oldSize < newSize) {
-        for (size_t i = oldSize; i < newSize; i++) {
-          inlineStorage_[i] = 0;
-          inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE + i] = 0;
-        }
+        const auto bytesToZero =
+            (newSize - oldSize) * sizeof(inlineStorage_[0]);
+        memset(&inlineStorage_[oldSize], 0, bytesToZero);
+        memset(
+            &inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE + oldSize],
+            0,
+            bytesToZero);
       }
       size_ = newSize;
     } else {
@@ -291,17 +270,15 @@ class C10_API SizesAndStrides {
 
   void copyDataInline(const SizesAndStrides& rhs) {
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(rhs.isInline());
-    for (size_t i = 0; i < sizeof(inlineStorage_) / sizeof(SymInt); i++) {
-      inlineStorage_[i] = rhs.inlineStorage_[i];
-    }
+    memcpy(inlineStorage_, rhs.inlineStorage_, sizeof(inlineStorage_));
   }
 
-  static size_t storageElems(size_t size) noexcept {
-    return size * 2;
+  static size_t storageBytes(size_t size) noexcept {
+    return size * 2 * sizeof(int64_t);
   }
 
   void allocateOutOfLineStorage(size_t size) {
-    outOfLineStorage_ = new SymInt[storageElems(size)];
+    outOfLineStorage_ = static_cast<int64_t*>(malloc(storageBytes(size)));
     TORCH_CHECK(
         outOfLineStorage_,
         "Could not allocate memory for Tensor SizesAndStrides!");
@@ -309,27 +286,21 @@ class C10_API SizesAndStrides {
 
   void resizeOutOfLineStorage(size_t newSize) {
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!isInline());
-    auto* newStorage = new SymInt[storageElems(newSize)];
+    outOfLineStorage_ = static_cast<int64_t*>(
+        realloc(outOfLineStorage_, storageBytes(newSize)));
     TORCH_CHECK(
-        newStorage, "Could not allocate memory for Tensor SizesAndStrides!");
-    for (size_t i = 0; i < storageElems(newSize) && i < storageElems(size_);
-         i++) {
-      newStorage[i] = std::move(outOfLineStorage_[i]);
-    }
-    delete[] outOfLineStorage_;
-    outOfLineStorage_ = newStorage;
+        outOfLineStorage_,
+        "Could not allocate memory for Tensor SizesAndStrides!");
   }
 
   void copyDataOutline(const SizesAndStrides& rhs) noexcept {
-    for (size_t i = 0; i < storageElems(rhs.size_); i++) {
-      outOfLineStorage_[i] = rhs.outOfLineStorage_[i];
-    }
+    memcpy(outOfLineStorage_, rhs.outOfLineStorage_, storageBytes(rhs.size_));
   }
 
   size_t size_;
   union {
-    SymInt* outOfLineStorage_;
-    SymInt inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE * 2]{};
+    int64_t* outOfLineStorage_;
+    int64_t inlineStorage_[C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE * 2]{};
   };
 };
 
diff --git a/c10/core/impl/TorchDispatchModeTLS.cpp b/c10/core/impl/TorchDispatchModeTLS.cpp
new file mode 100644
index 000000000000..6755657b7368
--- /dev/null
+++ b/c10/core/impl/TorchDispatchModeTLS.cpp
@@ -0,0 +1,72 @@
+#include <c10/core/DispatchKeySet.h>
+#include <c10/core/SafePyObject.h>
+#include <c10/core/impl/LocalDispatchKeySet.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
+
+namespace c10 {
+namespace impl {
+
+thread_local TorchDispatchModeTLS torchDispatchModeState;
+
+void TorchDispatchModeTLS::push_onto_stack(std::shared_ptr<SafePyObject> mode) {
+  if (torchDispatchModeState.stack_.size() == 0) {
+    c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, true);
+    c10::impl::tls_set_dispatch_key_included(
+        DispatchKey::PythonTLSSnapshot, true);
+  }
+  torchDispatchModeState.stack_.push_back(std::move(mode));
+}
+
+const std::shared_ptr<SafePyObject> TorchDispatchModeTLS::pop_stack() {
+  TORCH_CHECK(
+      torchDispatchModeState.stack_.size() > 0,
+      "trying to pop from empty mode stack");
+  const std::shared_ptr<SafePyObject> out =
+      torchDispatchModeState.stack_.back();
+  torchDispatchModeState.stack_.pop_back();
+
+  if (torchDispatchModeState.stack_.size() == 0) {
+    c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, false);
+    c10::impl::tls_set_dispatch_key_included(
+        DispatchKey::PythonTLSSnapshot, false);
+  }
+  return out;
+}
+
+const std::shared_ptr<SafePyObject>& TorchDispatchModeTLS::get_stack_at(
+    int64_t idx) {
+  TORCH_CHECK(
+      idx < static_cast<int64_t>(torchDispatchModeState.stack_.size()),
+      "Tried to get stack at idx that's too big");
+  return torchDispatchModeState.stack_[idx];
+}
+
+int64_t TorchDispatchModeTLS::stack_len() {
+  return torchDispatchModeState.stack_.size();
+}
+
+const TorchDispatchModeTLS& TorchDispatchModeTLS::get_state() {
+  return torchDispatchModeState;
+}
+
+void TorchDispatchModeTLS::set_state(const TorchDispatchModeTLS& state) {
+  torchDispatchModeState = state;
+  if (torchDispatchModeState.stack_.size() == 0) {
+    c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, false);
+    c10::impl::tls_set_dispatch_key_included(
+        DispatchKey::PythonTLSSnapshot, false);
+  } else {
+    c10::impl::tls_set_dispatch_key_included(DispatchKey::Python, true);
+    c10::impl::tls_set_dispatch_key_included(
+        DispatchKey::PythonTLSSnapshot, true);
+  }
+}
+
+// UTIL
+
+bool dispatch_mode_enabled() {
+  return TorchDispatchModeTLS::stack_len() > 0;
+}
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/core/impl/TorchDispatchModeTLS.h b/c10/core/impl/TorchDispatchModeTLS.h
new file mode 100644
index 000000000000..da30d0460427
--- /dev/null
+++ b/c10/core/impl/TorchDispatchModeTLS.h
@@ -0,0 +1,27 @@
+#pragma once
+
+#include <c10/core/SafePyObject.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/ArrayRef.h>
+#include <c10/util/Optional.h>
+
+namespace c10 {
+namespace impl {
+
+struct C10_API TorchDispatchModeTLS {
+  static void push_onto_stack(std::shared_ptr<SafePyObject> mode);
+  static const std::shared_ptr<SafePyObject> pop_stack();
+  static const std::shared_ptr<SafePyObject>& get_stack_at(int64_t idx);
+  static int64_t stack_len();
+
+  static const TorchDispatchModeTLS& get_state();
+  static void set_state(const TorchDispatchModeTLS& state);
+
+ private:
+  std::vector<std::shared_ptr<c10::SafePyObject>> stack_;
+};
+
+C10_API bool dispatch_mode_enabled();
+
+} // namespace impl
+} // namespace c10
diff --git a/c10/cuda/CMakeLists.txt b/c10/cuda/CMakeLists.txt
index a95bd278e202..1dc4435da5f0 100644
--- a/c10/cuda/CMakeLists.txt
+++ b/c10/cuda/CMakeLists.txt
@@ -20,21 +20,25 @@ configure_file(
 # torch/utils/hipify/cuda_to_hip_mappings.py for new files
 # and headers you add
 set(C10_CUDA_SRCS
-    CUDAStream.cpp
+    CUDACachingAllocator.cpp
+    CUDAException.cpp
     CUDAFunctions.cpp
     CUDAMiscFunctions.cpp
+    CUDAStream.cpp
     CUDACachingAllocator.cpp
+    CUDAMallocAsyncAllocator.cpp
     impl/CUDAGuardImpl.cpp
     impl/CUDATest.cpp
 )
 set(C10_CUDA_HEADERS
     CUDAException.h
+    CUDAFunctions.h
     CUDAGuard.h
     CUDAMacros.h
     CUDAMathCompat.h
-    CUDAStream.h
-    CUDAFunctions.h
     CUDAMiscFunctions.h
+    CUDAStream.h
+    CUDACachingAllocator.h
     impl/CUDAGuardImpl.h
     impl/CUDATest.h
 )
diff --git a/c10/cuda/CUDACachingAllocator.cpp b/c10/cuda/CUDACachingAllocator.cpp
index c6f91f5a5990..aaa647502a89 100644
--- a/c10/cuda/CUDACachingAllocator.cpp
+++ b/c10/cuda/CUDACachingAllocator.cpp
@@ -1,4 +1,3 @@
-
 #include <c10/cuda/CUDACachingAllocator.h>
 
 #include <c10/core/impl/GPUTrace.h>
@@ -20,6 +19,7 @@
 #include <mutex>
 #include <regex>
 #include <set>
+#include <utility>
 #include <vector>
 
 namespace c10 {
@@ -28,6 +28,7 @@ C10_DEFINE_REGISTRY(FreeCudaMemoryCallbacksRegistry, FreeMemoryCallback);
 
 namespace cuda {
 namespace CUDACachingAllocator {
+namespace Native {
 
 //
 // Yet another caching allocator for CUDA device allocations.
@@ -73,7 +74,7 @@ namespace CUDACachingAllocator {
  * must be available for the graph to use during replay. DeviceCachingAllocator
  * assigns and frees memory eagerly and dynamically, so if we're not careful
  * about managing graphs' memory, at replay time those memory addresses could be
- * use by other tensors.
+ * used by other tensors.
  *
  * To guarantee a graph's baked in addresses are safe to reuse in replay,
  * DeviceAllocator satisfies allocations from a graph-private memory pool during
@@ -88,14 +89,12 @@ namespace CUDACachingAllocator {
  * (regardless whether those captures are idle or replaying).
  *
  * CUDAGraph's requests for private pools are mediated by
- * DeviceAllocator::notifyCaptureBegin, notifyCaptureEnd, and
- * notifyCaptureDestroy.
+ * DeviceAllocator::notifyCaptureBegin,
+ *                  notifyCaptureAboutToEnd,
+ *                  notifyCaptureEnded,
+ *                  notifyCaptureDestroy.
  */
 
-namespace {
-
-using stream_set = ska::flat_hash_set<cuda::CUDAStream>;
-
 constexpr size_t kMinBlockSize =
     512; // all sizes are rounded to at least 512 bytes
 constexpr size_t kSmallSize = 1048576; // largest "small" allocation is 1 MiB
@@ -106,6 +105,11 @@ constexpr size_t kLargeBuffer =
 constexpr size_t kMinLargeAlloc =
     10485760; // allocations between 1 and 10 MiB may use kLargeBuffer
 constexpr size_t kRoundLarge = 2097152; // round up large allocations to 2 MiB
+constexpr size_t kRoundUpPowerOfTwoIntervals = 16;
+
+namespace {
+
+using stream_set = ska::flat_hash_set<cuda::CUDAStream>;
 
 using StatTypes = std::array<bool, static_cast<size_t>(StatType::NUM_TYPES)>;
 
@@ -168,6 +172,12 @@ struct BlockPool {
   PrivatePool* owner_PrivatePool;
 };
 
+struct HistoryChain {
+  History h;
+  std::unique_ptr<HistoryChain> next; // when blocks are merged we keep records
+                                      // of what used to be in the block
+};
+
 struct Block {
   int device; // gpu
   cudaStream_t stream; // allocation stream
@@ -181,8 +191,8 @@ struct Block {
   int event_count; // number of outstanding CUDA events
   int gc_count; // counter for prioritizing older / less useful blocks for
                 // garbage collection
-  std::unique_ptr<History> history;
-  History* history_last;
+  std::unique_ptr<HistoryChain> history;
+  HistoryChain* history_last;
 
   Block(
       int device,
@@ -231,25 +241,6 @@ static bool BlockComparator(const Block* a, const Block* b) {
   return (uintptr_t)a->ptr < (uintptr_t)b->ptr;
 }
 
-static std::string format_size(uint64_t size) {
-  std::ostringstream os;
-  os.precision(2);
-  os << std::fixed;
-  if (size <= 1024) {
-    os << size << " bytes";
-  } else if (size <= 1048576) {
-    os << (size / 1024.0);
-    os << " KiB";
-  } else if (size <= 1073741824ULL) {
-    os << size / 1048576.0;
-    os << " MiB";
-  } else {
-    os << size / 1073741824.0;
-    os << " GiB";
-  }
-  return os.str();
-}
-
 struct AllocParams {
   AllocParams(
       int device,
@@ -284,7 +275,7 @@ struct AllocParams {
 
 int trimHistoryBefore(Block* block, void* point) {
   int n = 0;
-  while (block->history && block->history->addr < point) {
+  while (block->history && block->history->h.addr < point) {
     block->history = std::move(block->history->next);
     ++n;
   }
@@ -397,8 +388,13 @@ cudaError_t cudaMallocMaybeCapturing(void** p, size_t size) {
 #endif
 }
 
-} // namespace
+} // anonymous namespace
+} // namespace Native
 
+// Environment config parser
+// Defined here, rather than its own .cpp file,
+// because parseArgs needs to know kLargeBuffer.
+// Defined outside namespace Native because it's not Native-specific.
 class CachingAllocatorConfig {
  public:
   static size_t max_split_size() {
@@ -412,90 +408,301 @@ class CachingAllocatorConfig {
   // More description below in function roundup_power2_next_division
   // As ane example, if we want 4 divisions between 2's power, this can be done
   // using env variable: PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:4
-  static size_t roundup_power2_divisions() {
-    return instance().m_roundup_power2_divisions;
+  static size_t roundup_power2_divisions(size_t size) {
+    size_t log_size = (63 - llvm::countLeadingZeros(size));
+
+    // Our intervals start at 1MB and end at 64GB
+    const size_t interval_start =
+        63 - llvm::countLeadingZeros(static_cast<size_t>(1048576));
+    const size_t interval_end =
+        63 - llvm::countLeadingZeros(static_cast<size_t>(68719476736));
+    TORCH_CHECK(
+        (interval_end - interval_start == Native::kRoundUpPowerOfTwoIntervals),
+        "kRoundUpPowerOfTwoIntervals mismatch");
+
+    int index = static_cast<int>(log_size) - static_cast<int>(interval_start);
+
+    index = std::max(0, index);
+    index = std::min(
+        index, static_cast<int>(Native::kRoundUpPowerOfTwoIntervals) - 1);
+    return instance().m_roundup_power2_divisions[index];
   }
 
- private:
   static CachingAllocatorConfig& instance() {
     static CachingAllocatorConfig* s_instance = ([]() {
       auto inst = new CachingAllocatorConfig();
-      inst->parseArgs();
+      const char* env = getenv("PYTORCH_CUDA_ALLOC_CONF");
+      inst->parseArgs(env);
       return inst;
     })();
     return *s_instance;
   }
 
+  void parseArgs(const char* env);
+
+ private:
   CachingAllocatorConfig()
       : m_max_split_size(std::numeric_limits<size_t>::max()),
-        m_roundup_power2_divisions(0),
-        m_garbage_collection_threshold(0) {}
-  size_t m_max_split_size;
-  size_t m_roundup_power2_divisions;
-  double m_garbage_collection_threshold;
+        m_garbage_collection_threshold(0) {
+    m_roundup_power2_divisions.assign(Native::kRoundUpPowerOfTwoIntervals, 0);
+  }
+
+  void lexArgs(const char* env, std::vector<std::string>& config);
+  void consumeToken(
+      const std::vector<std::string>& config,
+      size_t i,
+      const char c);
+  size_t parseMaxSplitSize(const std::vector<std::string>& config, size_t i);
+  size_t parseGarbageCollectionThreshold(
+      const std::vector<std::string>& config,
+      size_t i);
+  size_t parseRoundUpPower2Divisions(
+      const std::vector<std::string>& config,
+      size_t i);
+  size_t parseAllocatorConfig(
+      const std::vector<std::string>& config,
+      size_t i,
+      bool& used_cudaMallocAsync);
+
+  std::atomic<size_t> m_max_split_size;
+  std::vector<size_t> m_roundup_power2_divisions;
+  std::atomic<double> m_garbage_collection_threshold;
+};
 
-  void parseArgs() {
-    const char* val = getenv("PYTORCH_CUDA_ALLOC_CONF");
-    if (val != NULL) {
-      const std::string config(val);
+void CachingAllocatorConfig::lexArgs(
+    const char* env,
+    std::vector<std::string>& config) {
+  std::vector<char> buf;
+
+  size_t env_length = strlen(env);
+  for (size_t i = 0; i < env_length; i++) {
+    if (env[i] == ',' || env[i] == ':' || env[i] == '[' || env[i] == ']') {
+      if (buf.size() != 0) {
+        config.emplace_back(std::string(buf.begin(), buf.end()));
+        buf.clear();
+      }
+      config.emplace_back(std::string(1, env[i]));
+    } else if (env[i] != ' ') {
+      buf.emplace_back(static_cast<char>(env[i]));
+    }
+  }
+  if (!buf.empty()) {
+    config.emplace_back(std::string(buf.begin(), buf.end()));
+  }
+}
 
-      std::regex exp("[\\s,]+");
-      std::sregex_token_iterator it(config.begin(), config.end(), exp, -1);
-      std::sregex_token_iterator end;
-      std::vector<std::string> options(it, end);
+void CachingAllocatorConfig::consumeToken(
+    const std::vector<std::string>& config,
+    size_t i,
+    const char c) {
+  TORCH_CHECK(
+      i < config.size() && config[i].compare(std::string(1, c)) == 0,
+      "Error parsing CachingAllocator settings, expected ",
+      c,
+      "");
+}
 
-      for (auto option : options) {
-        std::regex exp2("[:]+");
-        std::sregex_token_iterator it2(option.begin(), option.end(), exp2, -1);
-        std::sregex_token_iterator end2;
-        std::vector<std::string> kv(it2, end2);
-        if (kv.size() >= 2) {
-          /* Maximum split size in MB.  Limited to large size blocks */
-          if (kv[0].compare("max_split_size_mb") == 0) {
-            size_t val2 = stoi(kv[1]);
-            TORCH_CHECK(
-                val2 > kLargeBuffer / (1024 * 1024),
-                "CachingAllocator option max_split_size_mb too small, must be > ",
-                kLargeBuffer / (1024 * 1024),
-                "");
-            val2 = std::max(val2, kLargeBuffer / (1024 * 1024));
-            val2 = std::min(
-                val2, (std::numeric_limits<size_t>::max() / (1024 * 1024)));
-            m_max_split_size = val2 * 1024 * 1024;
-          } else if (kv[0].compare("roundup_power2_divisions") == 0) {
-            size_t val2 = stoi(kv[1]);
-            TORCH_CHECK(
-                llvm::isPowerOf2_64(val2),
-                "For roundups, the divisons has to be power of 2 ",
-                "");
-            m_roundup_power2_divisions = val2;
-          } else if (kv[0].compare("garbage_collection_threshold") == 0) {
-            /*
-             * Perform garbage collection of GPU memory blocks to avoid
-             * triggering expensive sync-and-reclaim-all operation. Upon setting
-             * the threshold (e.g., 0.8), the allocator will start reclaiming
-             * blocks if GPU memory capacity usage exceeds the threshold (i.e.,
-             * 80% of total memory).
-             * Values 0.0 and 1.0 are not allowed as they are less meaningful.
-             */
-            double val2 = stod(kv[1]);
-            TORCH_CHECK(
-                val2 > 0,
-                "garbage_collect_threshold too small, set it 0.0~1.0",
-                "");
-            TORCH_CHECK(
-                val2 < 1.0,
-                "garbage_collect_threshold too big, set it 0.0~1.0",
-                "");
-            m_garbage_collection_threshold = val2;
-          } else {
-            TORCH_CHECK(false, "Unrecognized CachingAllocator option: ", kv[0]);
+size_t CachingAllocatorConfig::parseMaxSplitSize(
+    const std::vector<std::string>& config,
+    size_t i) {
+  consumeToken(config, ++i, ':');
+  if (++i < config.size()) {
+    size_t val1 = stoi(config[i]);
+    TORCH_CHECK(
+        val1 > Native::kLargeBuffer / (1024 * 1024),
+        "CachingAllocator option max_split_size_mb too small, must be > ",
+        Native::kLargeBuffer / (1024 * 1024),
+        "");
+    val1 = std::max(val1, Native::kLargeBuffer / (1024 * 1024));
+    val1 = std::min(val1, (std::numeric_limits<size_t>::max() / (1024 * 1024)));
+    m_max_split_size = val1 * 1024 * 1024;
+  } else {
+    TORCH_CHECK(false, "Error, expecting max_split_size_mb value", "");
+  }
+  return i;
+}
+
+size_t CachingAllocatorConfig::parseGarbageCollectionThreshold(
+    const std::vector<std::string>& config,
+    size_t i) {
+  consumeToken(config, ++i, ':');
+  if (++i < config.size()) {
+    double val1 = stod(config[i]);
+    TORCH_CHECK(
+        val1 > 0, "garbage_collect_threshold too small, set it 0.0~1.0", "");
+    TORCH_CHECK(
+        val1 < 1.0, "garbage_collect_threshold too big, set it 0.0~1.0", "");
+    m_garbage_collection_threshold = val1;
+  } else {
+    TORCH_CHECK(
+        false, "Error, expecting garbage_collection_threshold value", "");
+  }
+  return i;
+}
+
+size_t CachingAllocatorConfig::parseRoundUpPower2Divisions(
+    const std::vector<std::string>& config,
+    size_t i) {
+  consumeToken(config, ++i, ':');
+  bool first_value = true;
+
+  if (++i < config.size()) {
+    if (config[i].compare("[") == 0) {
+      size_t last_index = 0;
+      while (++i < config.size() && config[i].compare("]") != 0) {
+        std::string val1 = config[i];
+        size_t val2 = 0;
+
+        consumeToken(config, ++i, ':');
+        if (++i < config.size()) {
+          val2 = stoi(config[i]);
+        } else {
+          TORCH_CHECK(
+              false, "Error parsing roundup_power2_divisions value", "");
+        }
+        TORCH_CHECK(
+            llvm::isPowerOf2_64(val2),
+            "For roundups, the divisons has to be power of 2 ",
+            "");
+
+        if (val1.compare(">") == 0) {
+          std::fill(
+              std::next(
+                  m_roundup_power2_divisions.begin(),
+                  static_cast<std::vector<unsigned long>::difference_type>(
+                      last_index)),
+              m_roundup_power2_divisions.end(),
+              val2);
+        } else {
+          size_t val1_long = stoul(val1);
+          TORCH_CHECK(
+              llvm::isPowerOf2_64(val1_long),
+              "For roundups, the intervals have to be power of 2 ",
+              "");
+
+          size_t index = 63 - llvm::countLeadingZeros(val1_long);
+          index = std::max((size_t)0, index);
+          index = std::min(index, m_roundup_power2_divisions.size() - 1);
+
+          if (first_value) {
+            std::fill(
+                m_roundup_power2_divisions.begin(),
+                std::next(
+                    m_roundup_power2_divisions.begin(),
+                    static_cast<std::vector<unsigned long>::difference_type>(
+                        index)),
+                val2);
+            first_value = false;
+          }
+          if (index < m_roundup_power2_divisions.size()) {
+            m_roundup_power2_divisions[index] = val2;
           }
+          last_index = index;
+        }
+
+        if (config[i + 1].compare("]") != 0) {
+          consumeToken(config, ++i, ',');
         }
       }
+    } else { // Keep this for backwards compatibility
+      size_t val1 = stoi(config[i]);
+      TORCH_CHECK(
+          llvm::isPowerOf2_64(val1),
+          "For roundups, the divisons has to be power of 2 ",
+          "");
+      std::fill(
+          m_roundup_power2_divisions.begin(),
+          m_roundup_power2_divisions.end(),
+          val1);
     }
+  } else {
+    TORCH_CHECK(false, "Error, expecting roundup_power2_divisions value", "");
   }
-};
+  return i;
+}
+
+size_t CachingAllocatorConfig::parseAllocatorConfig(
+    const std::vector<std::string>& config,
+    size_t i,
+    bool& used_cudaMallocAsync) {
+  consumeToken(config, ++i, ':');
+  if (++i < config.size()) {
+    TORCH_CHECK(
+        ((config[i] == "native") || (config[i] == "cudaMallocAsync")),
+        "Unknown allocator backend, "
+        "options are native and cudaMallocAsync");
+    used_cudaMallocAsync = (config[i] == "cudaMallocAsync");
+    if (used_cudaMallocAsync) {
+#if CUDA_VERSION >= 11040
+      int version;
+      C10_CUDA_CHECK(cudaDriverGetVersion(&version));
+      TORCH_CHECK(
+          version >= 11040,
+          "backend:cudaMallocAsync requires CUDA runtime "
+          "11.4 or newer, but cudaDriverGetVersion returned ",
+          version);
+#else
+      TORCH_CHECK(
+          false,
+          "backend:cudaMallocAsync requires PyTorch to be built with "
+          "CUDA 11.4 or newer, but CUDA_VERSION is ",
+          CUDA_VERSION);
+#endif
+    }
+    TORCH_INTERNAL_ASSERT(
+        config[i] == get()->name(),
+        "Allocator backend parsed at runtime != "
+        "allocator backend parsed at load time");
+  } else {
+    TORCH_CHECK(false, "Error parsing backend value", "");
+  }
+  return i;
+}
+
+void CachingAllocatorConfig::parseArgs(const char* env) {
+  // If empty, set the default values
+  m_max_split_size = std::numeric_limits<size_t>::max();
+  m_roundup_power2_divisions.assign(Native::kRoundUpPowerOfTwoIntervals, 0);
+  m_garbage_collection_threshold = 0;
+  bool used_cudaMallocAsync = false;
+  bool used_native_specific_option = false;
+
+  if (env == nullptr) {
+    return;
+  }
+
+  std::vector<std::string> config;
+  lexArgs(env, config);
+
+  for (size_t i = 0; i < config.size(); i++) {
+    if (config[i].compare("max_split_size_mb") == 0) {
+      i = parseMaxSplitSize(config, i);
+      used_native_specific_option = true;
+    } else if (config[i].compare("garbage_collection_threshold") == 0) {
+      i = parseGarbageCollectionThreshold(config, i);
+      used_native_specific_option = true;
+    } else if (config[i].compare("roundup_power2_divisions") == 0) {
+      i = parseRoundUpPower2Divisions(config, i);
+      used_native_specific_option = true;
+    } else if (config[i].compare("backend") == 0) {
+      i = parseAllocatorConfig(config, i, used_cudaMallocAsync);
+    } else {
+      TORCH_CHECK(false, "Unrecognized CachingAllocator option: ", config[i]);
+    }
+
+    if (i + 1 < config.size()) {
+      consumeToken(config, ++i, ',');
+    }
+  }
+
+  if (used_cudaMallocAsync && used_native_specific_option) {
+    TORCH_WARN(
+        "backend:cudaMallocAsync ignores max_split_size_mb, roundup_bypass_threshold_mb,"
+        "roundup_power2_divisions, and garbage_collect_threshold.");
+  }
+}
+
+namespace Native {
 
 class DeviceCachingAllocator {
  private:
@@ -534,6 +741,16 @@ class DeviceCachingAllocator {
 
   bool set_fraction = false;
 
+  bool record_history = false;
+  std::atomic<CreateContextFn> context_recorder_;
+  size_t alloc_trace_next = 0;
+  bool alloc_trace_record_context_ = false;
+  size_t alloc_trace_max_entries_ = 1;
+  std::vector<TraceEntry>*
+      alloc_trace; // pointer because we need to intentionally leak this on
+                   // deallocation it can hold references to Python state which
+                   // will already be destroyed when we are in exit handlers
+
   // Members specific to CUDA graphs
 
   // Private pools for CUDA graphs
@@ -549,18 +766,35 @@ class DeviceCachingAllocator {
   // Maps a capturing stream to its assigned private pool,
   // in case we want multiple captures to share the same pool
   ska::flat_hash_map<CaptureId_t, MempoolId_t> capture_to_pool_map;
-  std::atomic<CreateContextFn> context_recorder_;
+
+  // XXX - maybe we should generalize and have multiple events
+  std::vector<OutOfMemoryObserver> oom_observers_;
 
  public:
   DeviceCachingAllocator()
       : large_blocks(BlockComparator, /*is_small=*/false),
-        small_blocks(BlockComparator, /*is_small=*/true) {
+        small_blocks(BlockComparator, /*is_small=*/true),
+        alloc_trace(new std::vector<TraceEntry>()) {
     stats.max_split_size = CachingAllocatorConfig::max_split_size();
     context_recorder_.store(nullptr);
   }
 
-  void setContextRecorder(CreateContextFn c) {
-    context_recorder_.store(c);
+  void recordHistory(
+      bool enabled,
+      CreateContextFn context_recorder,
+      size_t alloc_trace_max_entries,
+      bool alloc_trace_record_context) {
+    std::unique_lock<std::recursive_mutex> lock(mutex);
+    record_history = enabled;
+    context_recorder_.store(context_recorder);
+    alloc_trace_max_entries_ = std::max(size_t(1), alloc_trace_max_entries);
+    alloc_trace_record_context_ = alloc_trace_record_context;
+    alloc_trace_next = 0;
+    alloc_trace->clear();
+  }
+
+  void attachOutOfMemoryObserver(OutOfMemoryObserver observer) {
+    oom_observers_.emplace_back(std::move(observer));
   }
 
   // All public methods (except the above) acquire the allocator mutex.
@@ -570,7 +804,7 @@ class DeviceCachingAllocator {
     // done outside the lock because we don't know what locks the recorder needs
     // to have...
     CreateContextFn context_recorder = context_recorder_.load();
-    std::unique_ptr<Context> context =
+    std::shared_ptr<Context> context =
         context_recorder ? context_recorder() : nullptr;
 
     std::unique_lock<std::recursive_mutex> lock(mutex);
@@ -582,13 +816,12 @@ class DeviceCachingAllocator {
       //
       // Q. Why skip process_events if a capture might be underway?
       // A. process_events involves cudaEventQueries, illegal during CUDA graph
-      // capture.
+      //    capture.
       //    Dumb simple solution: defer reclaiming these allocations until after
       //    capture. Cross-stream memory use is uncommon, so the deferral's
       //    effect on memory use during capture should be small.
       process_events();
     }
-
     size_t size = round_size(orig_size);
     auto& pool = get_pool(size, stream);
     const size_t alloc_size = get_allocation_size(size);
@@ -620,6 +853,14 @@ class DeviceCachingAllocator {
           // Free all non-split cached blocks and retry alloc.
           || (C10_LIKELY(captures_underway == 0) && release_cached_blocks() &&
               alloc_block(params, true));
+      if (record_history && block_found) {
+        record_trace(
+            TraceEntry::SEGMENT_ALLOC,
+            int64_t(params.block->ptr),
+            params.block->size,
+            params.stream(),
+            context);
+      }
     }
 
     if (!block_found) {
@@ -636,6 +877,14 @@ class DeviceCachingAllocator {
         allowed_info = format_size(allowed_memory_maximum) + " allowed; ";
       }
 
+      if (record_history) {
+        record_trace(
+            TraceEntry::OOM,
+            device_free,
+            params.size(),
+            params.stream(),
+            std::move(context));
+      }
       stats.num_ooms += 1;
 
       c10::reportOutOfMemoryToProfiler(
@@ -645,6 +894,12 @@ class DeviceCachingAllocator {
           stats.reserved_bytes[static_cast<int64_t>(StatType::AGGREGATE)]
               .current,
           c10::Device(c10::DeviceType::CUDA, static_cast<DeviceIndex>(device)));
+      for (const auto& obs : oom_observers_) {
+        obs(device,
+            alloc_size,
+            set_fraction ? allowed_memory_maximum : device_total,
+            device_free);
+      }
       // "total capacity": total global memory on GPU
       // "allowed": memory is allowed to use, which set by fraction.
       // "already allocated": memory allocated by the program using the
@@ -712,7 +967,7 @@ class DeviceCachingAllocator {
       bool inserted = pool.blocks.insert(remaining).second;
       TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inserted);
 
-      if (context) {
+      if (record_history) {
         trimHistoryBefore(remaining, (char*)block->ptr + size);
       }
 
@@ -738,17 +993,22 @@ class DeviceCachingAllocator {
     }
 
     block->allocated = true;
-    if (context) {
+    if (record_history) {
       trimHistoryBefore(block, (char*)block->ptr + size);
-      block->history = std::make_unique<History>(History{
-          block->ptr,
-          orig_size,
-          std::move(context),
+      block->history = std::make_unique<HistoryChain>(HistoryChain{
+          History{block->ptr, orig_size, std::move(context)},
           std::move(block->history)});
       if (!block->history_last) {
         block->history_last = block->history.get();
       }
+      record_trace(
+          TraceEntry::ALLOC,
+          int64_t(block->ptr),
+          orig_size,
+          block->stream,
+          block->history->h.context);
     }
+
     bool inserted = active_blocks.insert(block).second;
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inserted);
 
@@ -789,6 +1049,14 @@ class DeviceCachingAllocator {
       update_stat(stats.allocation[stat_type], -1);
       update_stat(stats.allocated_bytes[stat_type], -block->size);
     });
+    if (block->history) {
+      record_trace(
+          TraceEntry::FREE_REQUESTED,
+          int64_t(block->ptr),
+          block->history->h.real_size,
+          block->stream,
+          block->history->h.context);
+    }
     if (block->size >= CachingAllocatorConfig::max_split_size())
       update_stat(stats.oversize_allocations, -1);
 
@@ -856,8 +1124,8 @@ class DeviceCachingAllocator {
     release_cached_blocks();
   }
 
-  /** Retrieves info (total size + largest block) of the memory cache **/
-  void cacheInfo(size_t* total, size_t* largest) {
+  /** Retrieves size of largest unused block held by the memory cache **/
+  void cacheInfo(size_t* largest) {
     std::lock_guard<std::recursive_mutex> lock(mutex);
     if (*largest ==
         0) { // make an initial guess if a zero *largest is passed in
@@ -866,11 +1134,11 @@ class DeviceCachingAllocator {
           largest, // Use free memory as an optimistic initial guess of *largest
           &tmp_bytes));
     }
-    cache_info_aux(large_blocks, total, largest);
-    cache_info_aux(small_blocks, total, largest);
+    cache_info_aux(large_blocks, largest);
+    cache_info_aux(small_blocks, largest);
     for (const auto& gp : graph_pools) {
-      cache_info_aux(gp.second->large_blocks, total, largest);
-      cache_info_aux(gp.second->small_blocks, total, largest);
+      cache_info_aux(gp.second->large_blocks, largest);
+      cache_info_aux(gp.second->small_blocks, largest);
     }
   }
 
@@ -923,12 +1191,12 @@ class DeviceCachingAllocator {
 
   /** Dump a complete snapshot of the memory held by the allocator. Potentially
    * VERY expensive. **/
-  std::vector<SegmentInfo> snapshot() const {
+  std::vector<SegmentInfo> snapshot() {
     std::lock_guard<std::recursive_mutex> lock(mutex);
 
+    size_t total_active = 0;
     std::vector<SegmentInfo> result;
     const auto all_blocks = get_all_blocks();
-
     for (const Block* const head_block : all_blocks) {
       if (head_block->prev != nullptr) {
         continue;
@@ -957,9 +1225,14 @@ class DeviceCachingAllocator {
         if (block_info.active) {
           segment_info.active_size += block_info.size;
         }
-        block_info.history = block->history.get();
+        HistoryChain* h = block->history.get();
+        while (h) {
+          block_info.history.push_back(h->h);
+          h = h->next.get();
+        }
         block = block->next;
       }
+      total_active += segment_info.active_size;
     }
 
     std::sort(
@@ -969,6 +1242,24 @@ class DeviceCachingAllocator {
           return a.address < b.address;
         });
 
+    if (record_history) {
+      record_trace(TraceEntry::SNAPSHOT, 0, total_active, 0, nullptr);
+    }
+    return result;
+  }
+
+  std::vector<TraceEntry> trace() {
+    std::lock_guard<std::recursive_mutex> lock(mutex);
+    std::vector<TraceEntry> result;
+    result.reserve(alloc_trace->size());
+    result.insert(
+        result.end(),
+        alloc_trace->begin() + alloc_trace_next,
+        alloc_trace->end());
+    result.insert(
+        result.end(),
+        alloc_trace->begin(),
+        alloc_trace->begin() + alloc_trace_next);
     return result;
   }
 
@@ -1003,7 +1294,7 @@ class DeviceCachingAllocator {
     if (size < kMinBlockSize) {
       return kMinBlockSize;
     } else {
-      auto divisions = CachingAllocatorConfig::roundup_power2_divisions();
+      auto divisions = CachingAllocatorConfig::roundup_power2_divisions(size);
       if (divisions > 0 && size > (kMinBlockSize * divisions)) {
         return roundup_power2_next_division(size, divisions);
       } else {
@@ -1038,7 +1329,7 @@ class DeviceCachingAllocator {
   }
 
   // Called by CUDAGraph::capture_end
-  void notifyCaptureEnd(CaptureId_t graph_id) {
+  void notifyCaptureAboutToEnd(CaptureId_t graph_id) {
     std::lock_guard<std::recursive_mutex> lock(mutex);
     captures_underway--;
     auto it = capture_to_pool_map.find(graph_id);
@@ -1099,7 +1390,14 @@ class DeviceCachingAllocator {
     TORCH_INTERNAL_ASSERT(
         !block->allocated && block->event_count == 0 &&
         block->stream_uses.empty());
-
+    if (block->history) {
+      record_trace(
+          TraceEntry::FREE_COMPLETED,
+          int64_t(block->ptr),
+          block->history->h.real_size,
+          block->stream,
+          block->history->h.context);
+    }
     size_t original_block_size = block->size;
 
     auto& pool = *block->pool;
@@ -1360,14 +1658,13 @@ class DeviceCachingAllocator {
       if (p.err != cudaSuccess) {
         if (p.err == cudaErrorMemoryAllocation) {
           // If this is the first attempt (!isRetry), we can forgive and clear
-          // CUDA's
-          //   internal error state.
+          // CUDA's internal error state.
+          //
           // If this is the second attempt (isRetry), malloc's TORCH_CHECK_WITH
-          // will take
-          //   over to throw a helpful exception. The user can choose to catch
-          //   the exception, free some stuff in their script, and attempt their
-          //   allocation again. In this case, we can also forgive and clear
-          //   CUDA's internal error state.
+          // will take over to throw a helpful exception. The user can choose
+          // to catch the exception, free some stuff in their script, and
+          // attempt the allocation again. In this case, we can also forgive and
+          // clear CUDA's internal error state.
           cudaGetLastError();
         } else {
           // If the error's unrelated to memory allocation, we should throw
@@ -1493,7 +1790,14 @@ class DeviceCachingAllocator {
     });
     if (block->size >= CachingAllocatorConfig::max_split_size())
       update_stat(stats.oversize_segments, -1);
-
+    if (block->history) {
+      record_trace(
+          TraceEntry::SEGMENT_FREE,
+          int64_t(block->ptr),
+          block->size,
+          block->stream,
+          block->history->h.context);
+    }
     pool->blocks.erase(block);
     delete block;
   }
@@ -1614,28 +1918,65 @@ class DeviceCachingAllocator {
     }
   }
 
-  // Accumulates sizes of all memory blocks for given device in given pool
-  void cache_info_aux(const BlockPool& pool, size_t* total, size_t* largest) {
+  // Iterates over sizes of all memory blocks for given device in given pool
+  void cache_info_aux(const BlockPool& pool, size_t* largest) {
     for (const auto& block : pool.blocks) {
       const auto blocksize = block->size;
-      *total += blocksize;
       if (blocksize > *largest) {
         *largest = blocksize;
       }
     }
   }
+
+  void record_trace(
+      TraceEntry::Action action,
+      int64_t addr,
+      size_t size,
+      cudaStream_t stream,
+      std::shared_ptr<Context> context) {
+    auto te = TraceEntry(
+        action,
+        addr,
+        size,
+        stream,
+        alloc_trace_record_context_ ? std::move(context) : nullptr);
+    if (alloc_trace->size() < alloc_trace_max_entries_) {
+      alloc_trace->emplace_back(te);
+    } else {
+      (*alloc_trace)[alloc_trace_next++] = te;
+      if (alloc_trace_next == alloc_trace_max_entries_) {
+        alloc_trace_next = 0;
+      }
+    }
+  }
 };
 
-class THCCachingAllocator {
+// Returns whether to force all allocations to bypass the caching allocator and
+// go straight to cudaMalloc.  This setting is useful when debugging GPU memory
+// errors, since the caching allocator foils cuda-memcheck.
+bool forceUncachedAllocator() {
+  static bool force_uncached =
+      getenv("PYTORCH_NO_CUDA_MEMORY_CACHING") != nullptr;
+  return force_uncached;
+}
+
+static void uncached_delete(void* ptr) {
+  const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+  if (C10_UNLIKELY(interp)) {
+    (*interp)->trace_gpu_memory_deallocation(reinterpret_cast<uintptr_t>(ptr));
+  }
+  C10_CUDA_CHECK(cudaFree(ptr));
+}
+
+void local_raw_delete(void* ptr);
+
+class NativeCachingAllocator : public CUDAAllocator {
  private:
   std::mutex mutex;
 
   // allocated blocks by device pointer
   ska::flat_hash_map<void*, Block*> allocated_blocks;
 
-  // lock around calls to cudaFree (to prevent deadlocks with NCCL)
-  mutable std::mutex cuda_free_mutex;
-
   void add_allocated_block(Block* block) {
     std::lock_guard<std::mutex> lock(mutex);
     allocated_blocks[block->ptr] = block;
@@ -1644,10 +1985,6 @@ class THCCachingAllocator {
  public:
   std::vector<std::unique_ptr<DeviceCachingAllocator>> device_allocator;
 
-  std::mutex* getCudaFreeMutex() const {
-    return &cuda_free_mutex;
-  }
-
   Block* get_allocated_block(void* ptr, bool remove = false) {
     std::lock_guard<std::mutex> lock(mutex);
     auto it = allocated_blocks.find(ptr);
@@ -1661,7 +1998,7 @@ class THCCachingAllocator {
     return block;
   }
 
-  void init(int device_count) {
+  void init(int device_count) override {
     const auto size = static_cast<int64_t>(device_allocator.size());
     if (size < device_count) {
       device_allocator.resize(device_count);
@@ -1671,6 +2008,10 @@ class THCCachingAllocator {
     }
   }
 
+  bool initialized() override {
+    return device_allocator.size() > 0;
+  }
+
   /** allocates a block which is safe to use from the provided stream */
   void malloc(void** devPtr, int device, size_t size, cudaStream_t stream) {
     TORCH_INTERNAL_ASSERT(
@@ -1683,7 +2024,8 @@ class THCCachingAllocator {
     *devPtr = (void*)block->ptr;
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_memory_allocation(reinterpret_cast<uintptr_t>(*devPtr));
+      (*interp)->trace_gpu_memory_allocation(
+          reinterpret_cast<uintptr_t>(*devPtr));
     }
   }
 
@@ -1697,13 +2039,13 @@ class THCCachingAllocator {
     }
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_memory_deallocation(
+      (*interp)->trace_gpu_memory_deallocation(
           reinterpret_cast<uintptr_t>(block->ptr));
     }
     device_allocator[block->device]->free(block);
   }
 
-  void setMemoryFraction(double fraction, int device) {
+  void setMemoryFraction(double fraction, int device) override {
     TORCH_INTERNAL_ASSERT(
         0 <= device && static_cast<size_t>(device) < device_allocator.size(),
         "Allocator not initialized for device ",
@@ -1722,18 +2064,32 @@ class THCCachingAllocator {
     device_allocator[device]->setMemoryFraction(fraction);
   }
 
-  void setContextRecorder(CreateContextFn recorder) {
+  void recordHistory(
+      bool enabled,
+      CreateContextFn context_recorder,
+      size_t alloc_trace_max_entries,
+      bool alloc_trace_record_context) override {
     int device;
     C10_CUDA_CHECK(cudaGetDevice(&device));
-    device_allocator[device]->setContextRecorder(std::move(recorder));
+    device_allocator[device]->recordHistory(
+        enabled,
+        std::move(context_recorder),
+        alloc_trace_max_entries,
+        alloc_trace_record_context);
   }
 
-  void emptyCache() {
+  void attachOutOfMemoryObserver(OutOfMemoryObserver observer) override {
+    int device;
+    C10_CUDA_CHECK(cudaGetDevice(&device));
+    device_allocator[device]->attachOutOfMemoryObserver(std::move(observer));
+  }
+
+  void emptyCache() override {
     for (auto& da : device_allocator)
       da->emptyCache();
   }
 
-  void* getBaseAllocation(void* ptr, size_t* outSize) {
+  void* getBaseAllocation(void* ptr, size_t* outSize) override {
     Block* block = get_allocated_block(ptr);
     if (!block) {
       TORCH_CHECK(false, "invalid device pointer: ", ptr);
@@ -1741,7 +2097,7 @@ class THCCachingAllocator {
     return device_allocator[block->device]->getBaseAllocation(block, outSize);
   }
 
-  void recordStream(const DataPtr& ptr, cuda::CUDAStream stream) {
+  void recordStream(const DataPtr& ptr, cuda::CUDAStream stream) override {
     // Empty tensor's storage().data() might be a null ptr. As there is no
     // blocks associated with those tensors, it is fine to do nothing here.
     if (!ptr.get()) {
@@ -1753,7 +2109,7 @@ class THCCachingAllocator {
     // we have implemented reference counting based sharing mechanism to
     // guarantee tensors won't be accidentally freed by one process while
     // they are still being used in another
-    if (ptr.get_deleter() != &raw_delete)
+    if (ptr.get_deleter() != &local_raw_delete)
       return;
 
     Block* block = get_allocated_block(ptr.get());
@@ -1762,40 +2118,15 @@ class THCCachingAllocator {
     device_allocator[block->device]->recordStream(block, stream);
   }
 
-  std::vector<SegmentInfo> snapshot() {
-    std::vector<SegmentInfo> result;
+  SnapshotInfo snapshot() override {
+    SnapshotInfo result;
     for (auto& da : device_allocator) {
+      result.device_traces.emplace_back(da->trace());
       auto snap = da->snapshot();
-      result.insert(result.end(), snap.begin(), snap.end());
+      result.segments.insert(result.segments.end(), snap.begin(), snap.end());
     }
-
     return result;
   }
-};
-
-THCCachingAllocator caching_allocator;
-
-// Returns whether to force all allocations to bypass the caching allocator and
-// go straight to cudaMalloc.  This setting is useful when debugging GPU memory
-// errors, since the caching allocator foils cuda-memcheck.
-bool forceUncachedAllocator() {
-  static bool force_uncached =
-      getenv("PYTORCH_NO_CUDA_MEMORY_CACHING") != nullptr;
-  return force_uncached;
-}
-
-static void uncached_delete(void* ptr) {
-  const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
-  if (C10_UNLIKELY(interp)) {
-    interp->trace_gpu_memory_deallocation(reinterpret_cast<uintptr_t>(ptr));
-  }
-  C10_CUDA_CHECK(cudaFree(ptr));
-}
-
-// NB: I decided not to fold this into THCCachingAllocator, because the latter
-// has a lot more methods and it wasn't altogether clear that they should
-// actually be publicly exposed
-struct CudaCachingAllocator : public Allocator {
   DataPtr allocate(size_t size) const override {
     constexpr size_t one_exa_bytes = 1152921504606846976ULL;
     TORCH_CHECK_WITH(
@@ -1811,195 +2142,243 @@ struct CudaCachingAllocator : public Allocator {
       C10_CUDA_CHECK(cudaMalloc(&r, size));
       const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
       if (C10_UNLIKELY(interp)) {
-        interp->trace_gpu_memory_allocation(reinterpret_cast<uintptr_t>(r));
+        (*interp)->trace_gpu_memory_allocation(reinterpret_cast<uintptr_t>(r));
       }
       return {r, r, &uncached_delete, Device(DeviceType::CUDA, device)};
     }
     if (size != 0) {
-      caching_allocator.malloc(
+      // Allocator declars allocate const!?
+      const_cast<NativeCachingAllocator*>(this)->malloc(
           &r, device, size, cuda::getCurrentCUDAStream(device));
     }
-    return {r, r, &raw_delete, Device(DeviceType::CUDA, device)};
+    return {r, r, &local_raw_delete, Device(DeviceType::CUDA, device)};
   }
   DeleterFnPtr raw_deleter() const override {
     if (forceUncachedAllocator()) {
       return &uncached_delete;
     } else {
-      return &raw_delete;
+      return &local_raw_delete;
     }
   }
-};
+  void cacheInfo(int dev_id, size_t* largestBlock) override {
+    device_allocator[dev_id]->cacheInfo(largestBlock);
+  }
+  void assertValidDevice(int device) {
+    const auto device_num = device_allocator.size();
+    TORCH_CHECK(
+        0 <= device && device < static_cast<int64_t>(device_num),
+        "Invalid device argument ",
+        device,
+        ": did you call init?");
+  }
 
-CudaCachingAllocator device_allocator;
+  DeviceStats getDeviceStats(int device) override {
+    assertValidDevice(device);
+    return device_allocator[device]->getStats();
+  }
 
-Allocator* get(void) {
-  return &device_allocator;
-}
+  void resetAccumulatedStats(int device) override {
+    assertValidDevice(device);
+    device_allocator[device]->resetAccumulatedStats();
+  }
 
-void init(int device_count) {
-  caching_allocator.init(device_count);
-}
+  void resetPeakStats(int device) override {
+    assertValidDevice(device);
+    device_allocator[device]->resetPeakStats();
+  }
+  // CUDAGraph interactions
+  void notifyCaptureBegin(
+      int device,
+      CaptureId_t graph_id,
+      MempoolId_t mempool_id) override {
+    assertValidDevice(device);
+    device_allocator[device]->notifyCaptureBegin(
+        graph_id, std::move(mempool_id));
+  }
 
-void setMemoryFraction(double fraction, int device) {
-  caching_allocator.setMemoryFraction(fraction, device);
-}
+  void notifyCaptureAboutToEnd(int device, CaptureId_t graph_id) override {
+    assertValidDevice(device);
+    device_allocator[device]->notifyCaptureAboutToEnd(graph_id);
+  }
 
-void setContextRecorder(CreateContextFn recorder) {
-  caching_allocator.setContextRecorder(std::move(recorder));
-}
+  void notifyCaptureEnded(int device, CaptureId_t graph_id) override {} // no-op
 
-void emptyCache(void) {
-  caching_allocator.emptyCache();
-}
+  void notifyCaptureDestroy(int device, MempoolId_t mempool_id) override {
+    assertValidDevice(device);
+    device_allocator[device]->notifyCaptureDestroy(std::move(mempool_id));
+  }
 
-void cacheInfo(int dev_id, size_t* cachedAndFree, size_t* largestBlock) {
-  caching_allocator.device_allocator[dev_id]->cacheInfo(
-      cachedAndFree, largestBlock);
-}
+  void* raw_alloc(size_t nbytes) override {
+    if (nbytes == 0) {
+      return nullptr;
+    }
+    int device;
+    C10_CUDA_CHECK(cudaGetDevice(&device));
+    void* r = nullptr;
+    malloc(&r, device, nbytes, cuda::getCurrentCUDAStream(device));
+    return r;
+  }
 
-void* getBaseAllocation(void* ptr, size_t* size) {
-  return caching_allocator.getBaseAllocation(ptr, size);
-}
+  void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream) override {
+    if (nbytes == 0) {
+      return nullptr;
+    }
+    int device;
+    C10_CUDA_CHECK(cudaGetDevice(&device));
+    void* r = nullptr;
+    malloc(&r, device, nbytes, stream);
+    return r;
+  }
+  bool needsPoolSpecificPeerAccess() override {
+    return false;
+  }
+  void raw_delete(void* ptr) override {
+    this->free(ptr);
+  }
+
+  // In CUDA IPC, sender sends a tensor to receiver, getIpcDevPtr
+  // is called by the receiving process to map the CUDA memory from the sending
+  // process into its own address space.
+  //
+  // CUDA IPC only allows sharing a big memory block associated with a
+  // cudaIpcMemHandle_t and it can be opened only **once** per context per
+  // process. There can be multiple types of storage in the same IPC mem block,
+  // so we must cache the device ptr to construct typed storage as it comes.
+  //
+  // ipcMemHandle_to_devptr maps a cudaIpcMemHandle_t to a device pointer in the
+  // process that can be used to access the memory block in the sender process.
+  // It only saves a weak_ptr of the device pointer in the map, the shared_ptr
+  // will be used to reconstruct all storages in this CudaMalloc allocation. And
+  // it will deleted in cudaIpcCloseMemHandle when its reference count is 0.
+  //
+  std::mutex IpcMutex;
+  ska::flat_hash_map<std::string, std::weak_ptr<void>> ipcMemHandle_to_devptr;
+  std::shared_ptr<void> getIpcDevPtr(std::string handle) override {
+    std::lock_guard<std::mutex> lock(IpcMutex);
+
+    auto iter = ipcMemHandle_to_devptr.find(handle);
+    if (iter != ipcMemHandle_to_devptr.end()) {
+      auto devptr = iter->second.lock();
+      if (devptr)
+        return devptr;
+    }
+    // This ipcMemHandle hasn't been opened, or already expired, open it to
+    // enable IPC access to that mem block.
+    void* dev = nullptr;
+    auto ipc_handle =
+        reinterpret_cast<const cudaIpcMemHandle_t*>(handle.c_str());
+    C10_CUDA_CHECK(cudaIpcOpenMemHandle(
+        &dev, *ipc_handle, cudaIpcMemLazyEnablePeerAccess));
+    // devPtr has to be deleted in same device when created.
+    int curr_device;
+    C10_CUDA_CHECK(cudaGetDevice(&curr_device));
+    auto sp =
+        std::shared_ptr<void>(dev, [handle, curr_device, this](void* ptr) {
+          cuda::CUDAGuard device_guard(curr_device);
+          std::lock_guard<std::mutex> deleter_lock(IpcMutex);
+          C10_CUDA_CHECK(cudaIpcCloseMemHandle(ptr));
+          ipcMemHandle_to_devptr.erase(handle);
+        });
+    std::weak_ptr<void> wp = sp;
+    // To eliminate an additional search, we can use insert().
+    // It doesn't overwrite when key already exists(ptr expired).
+    // But in the deleter for sp we erased the entry,
+    // this should be safe to do now.
+    ipcMemHandle_to_devptr.insert(iter, {handle, wp});
 
-void recordStream(const DataPtr& ptr, cuda::CUDAStream stream) {
-  caching_allocator.recordStream(ptr, stream);
-}
+    return sp;
+  }
+  std::string name() override {
+    return "native";
+  }
+};
 
-std::mutex* getFreeMutex() {
-  return caching_allocator.getCudaFreeMutex();
-}
+NativeCachingAllocator allocator;
 
-static inline void assertValidDevice(int device) {
-  const auto device_num = caching_allocator.device_allocator.size();
-  TORCH_CHECK(
-      0 <= device && device < static_cast<int64_t>(device_num),
-      "Invalid device argument ",
-      device,
-      ": did you call init?");
+void local_raw_delete(void* ptr) {
+  allocator.free(ptr);
 }
 
-DeviceStats getDeviceStats(int device) {
-  assertValidDevice(device);
-  return caching_allocator.device_allocator[device]->getStats();
+void setAllocatorSettings(const std::string& env) {
+  CachingAllocatorConfig::instance().parseArgs(env.c_str());
 }
 
-void resetAccumulatedStats(int device) {
-  assertValidDevice(device);
-  caching_allocator.device_allocator[device]->resetAccumulatedStats();
-}
+} // namespace Native
 
-void resetPeakStats(int device) {
-  assertValidDevice(device);
-  caching_allocator.device_allocator[device]->resetPeakStats();
+// General caching allocator utilities
+void setAllocatorSettings(const std::string& env) {
+  CachingAllocatorConfig::instance().parseArgs(env.c_str());
 }
 
-std::vector<SegmentInfo> snapshot() {
-  return caching_allocator.snapshot();
+// Size pretty-printer
+inline std::string format_size(uint64_t size) {
+  std::ostringstream os;
+  os.precision(2);
+  os << std::fixed;
+  if (size <= 1024) {
+    os << size << " bytes";
+  } else if (size <= 1048576) {
+    os << (size / 1024.0);
+    os << " KiB";
+  } else if (size <= 1073741824ULL) {
+    os << size / 1048576.0;
+    os << " MiB";
+  } else {
+    os << size / 1073741824.0;
+    os << " GiB";
+  }
+  return os.str();
 }
 
-// CUDAGraph interactions
-void notifyCaptureBegin(
-    int device,
-    CaptureId_t graph_id,
-    MempoolId_t mempool_id) {
-  assertValidDevice(device);
-  caching_allocator.device_allocator[device]->notifyCaptureBegin(
-      graph_id, mempool_id);
-}
+namespace CudaMallocAsync {
+// If this is put in its own header file, it gets incorrectly renamed in HIPify.
+CUDAAllocator* allocator();
 
-void notifyCaptureEnd(int device, CaptureId_t graph_id) {
-  assertValidDevice(device);
-  caching_allocator.device_allocator[device]->notifyCaptureEnd(graph_id);
-}
+} // namespace CudaMallocAsync
 
-void notifyCaptureDestroy(int device, MempoolId_t mempool_id) {
-  assertValidDevice(device);
-  caching_allocator.device_allocator[device]->notifyCaptureDestroy(mempool_id);
-}
+struct BackendStaticInitializer {
+  // Parses env for backend at load time, duplicating some logic from
+  // CachingAllocatorConfig. CachingAllocatorConfig double-checks it later (at
+  // runtime). Defers verbose exceptions and error checks, including Cuda
+  // version checks, to CachingAllocatorConfig's runtime doublecheck. If this
+  // works, maybe we should move all of CachingAllocatorConfig here?
+  CUDAAllocator* parseEnvForBackend() {
+    const char* val = getenv("PYTORCH_CUDA_ALLOC_CONF");
+    if (val != nullptr) {
+      const std::string config(val);
 
-//
-// In CUDA IPC, sender sends a tensor to receiver, getIpcDevPtr
-// is called by the receiving process to map the CUDA memory from the sending
-// process into its own address space.
-//
-// CUDA IPC only allows sharing a big memory block associated with a
-// cudaIpcMemHandle_t and it can be opened only **once** per context per
-// process. There can be multiple types of storage in the same IPC mem block, so
-// we must cache the device ptr to construct typed storage as it comes.
-//
-// ipcMemHandle_to_devptr maps a cudaIpcMemHandle_t to a device pointer in the
-// process that can be used to access the memory block in the sender process. It
-// only saves a weak_ptr of the device pointer in the map, the shared_ptr will
-// be used to reconstruct all storages in this CudaMalloc allocation. And it
-// will deleted in cudaIpcCloseMemHandle when its reference count is 0.
-//
-namespace {
-std::mutex IpcMutex;
-ska::flat_hash_map<std::string, std::weak_ptr<void>> ipcMemHandle_to_devptr;
-} // namespace
-
-std::shared_ptr<void> getIpcDevPtr(std::string handle) {
-  std::lock_guard<std::mutex> lock(IpcMutex);
-
-  auto iter = ipcMemHandle_to_devptr.find(handle);
-  if (iter != ipcMemHandle_to_devptr.end()) {
-    auto devptr = iter->second.lock();
-    if (devptr)
-      return devptr;
-  }
-  // This ipcMemHandle hasn't been opened, or already expired, open it to
-  // enable IPC access to that mem block.
-  void* dev = nullptr;
-  auto ipc_handle = reinterpret_cast<const cudaIpcMemHandle_t*>(handle.c_str());
-  C10_CUDA_CHECK(
-      cudaIpcOpenMemHandle(&dev, *ipc_handle, cudaIpcMemLazyEnablePeerAccess));
-  // devPtr has to be deleted in same device when created.
-  int curr_device;
-  C10_CUDA_CHECK(cudaGetDevice(&curr_device));
-  auto sp = std::shared_ptr<void>(dev, [handle, curr_device](void* ptr) {
-    cuda::CUDAGuard device_guard(curr_device);
-    std::lock_guard<std::mutex> deleter_lock(IpcMutex);
-    C10_CUDA_CHECK(cudaIpcCloseMemHandle(ptr));
-    ipcMemHandle_to_devptr.erase(handle);
-  });
-  std::weak_ptr<void> wp = sp;
-  // To eliminate an additional search, we can use insert().
-  // It doesn't overwrite when key already exists(ptr expired).
-  // But in the deleter for sp we erased the entry,
-  // this should be safe to do now.
-  ipcMemHandle_to_devptr.insert(iter, {handle, wp});
-
-  return sp;
-}
+      std::regex exp("[\\s,]+");
+      std::sregex_token_iterator it(config.begin(), config.end(), exp, -1);
+      std::sregex_token_iterator end;
+      std::vector<std::string> options(it, end);
 
-void* raw_alloc(size_t nbytes) {
-  if (nbytes == 0) {
-    return nullptr;
+      for (auto option : options) {
+        std::regex exp2("[:]+");
+        std::sregex_token_iterator it2(option.begin(), option.end(), exp2, -1);
+        std::sregex_token_iterator end2;
+        std::vector<std::string> kv(it2, end2);
+        if (kv.size() >= 2) {
+          if (kv[0] == "backend") {
+            if (kv[1] == "cudaMallocAsync")
+              return CudaMallocAsync::allocator();
+            if (kv[1] == "native")
+              return &Native::allocator;
+          }
+        }
+      }
+    }
+    return &Native::allocator;
   }
-  int device;
-  C10_CUDA_CHECK(cudaGetDevice(&device));
-  void* r = nullptr;
-  caching_allocator.malloc(
-      &r, device, nbytes, cuda::getCurrentCUDAStream(device));
-  return r;
-}
 
-void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream) {
-  if (nbytes == 0) {
-    return nullptr;
+  BackendStaticInitializer() {
+    auto r = parseEnvForBackend();
+    allocator.store(r);
   }
-  int device;
-  C10_CUDA_CHECK(cudaGetDevice(&device));
-  void* r = nullptr;
-  caching_allocator.malloc(&r, device, nbytes, stream);
-  return r;
-}
+};
 
-void raw_delete(void* ptr) {
-  caching_allocator.free(ptr);
-}
+std::atomic<CUDAAllocator*> allocator{};
+BackendStaticInitializer backend_static_initializer;
 
 } // namespace CUDACachingAllocator
-
 } // namespace cuda
 } // namespace c10
diff --git a/c10/cuda/CUDACachingAllocator.h b/c10/cuda/CUDACachingAllocator.h
index 0fd23f4e61d5..41e082933d55 100644
--- a/c10/cuda/CUDACachingAllocator.h
+++ b/c10/cuda/CUDACachingAllocator.h
@@ -1,5 +1,5 @@
-#ifndef THC_DEVICE_ALLOCATOR_INC
-#define THC_DEVICE_ALLOCATOR_INC
+#pragma once
+
 #include <c10/core/Allocator.h>
 #include <c10/cuda/CUDAGraphsC10Utils.h>
 #include <c10/cuda/CUDAMacros.h>
@@ -98,14 +98,12 @@ struct Context {
   virtual ~Context() {}
 };
 
-typedef std::unique_ptr<Context> (*CreateContextFn)(void);
+typedef std::shared_ptr<Context> (*CreateContextFn)(void);
 
 struct History {
   void* addr;
   size_t real_size; // unrounded, actually requested size
-  std::unique_ptr<Context> context; // per-watcher context
-  std::unique_ptr<History> next; // when blocks are merged we keep records of
-                                 // what used to be in the block
+  std::shared_ptr<Context> context; // per-watcher context
 };
 
 // Struct containing info of an allocation block (i.e. a fractional part of a
@@ -115,8 +113,7 @@ struct BlockInfo {
   int32_t gc_counter = 0;
   bool allocated = false;
   bool active = false;
-  History* history =
-      nullptr; // borrowed reference because it is owned by the allocator
+  std::vector<History> history;
 };
 
 // Struct containing info of a memory segment (i.e. one contiguous cudaMalloc).
@@ -131,41 +128,197 @@ struct SegmentInfo {
   std::vector<BlockInfo> blocks;
 };
 
-C10_CUDA_API void* raw_alloc(size_t nbytes);
-C10_CUDA_API void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream);
-C10_CUDA_API void raw_delete(void* ptr);
-
-C10_CUDA_API Allocator* get();
-C10_CUDA_API void init(int device_count);
-C10_CUDA_API void setMemoryFraction(double fraction, int device);
-C10_CUDA_API void emptyCache();
-C10_CUDA_API void cacheInfo(
-    int dev_id,
-    size_t* cachedAndFree,
-    size_t* largestBlock);
-C10_CUDA_API void* getBaseAllocation(void* ptr, size_t* size);
-C10_CUDA_API void recordStream(const DataPtr&, CUDAStream stream);
-C10_CUDA_API DeviceStats getDeviceStats(int device);
-C10_CUDA_API void resetAccumulatedStats(int device);
-C10_CUDA_API void resetPeakStats(int device);
-C10_CUDA_API std::vector<SegmentInfo> snapshot();
+struct TraceEntry {
+  enum Action {
+    ALLOC, // API made to the caching allocator for new memory
+    FREE_REQUESTED, // API call made to the caching allocator to free memory
+    FREE_COMPLETED, // The allocator might have to delay a free because
+                    // it is still in use on another stream via record_stream
+                    // This event is generated when a free actually completes.
+    SEGMENT_ALLOC, // a call to cudaMalloc to get more memory from the OS
+    SEGMENT_FREE, // a call to cudaFree to return memory to the OS (e.g. to
+                  // defragement or empty_caches)
+    SNAPSHOT, // a call to snapshot, used to correlate memory snapshots to trace
+              // events
+    OOM // the allocator threw an OutOfMemoryError (addr_ is the amount of free
+        // bytes reported by cuda)
+  };
+  TraceEntry(
+      Action action,
+      int64_t addr,
+      size_t size,
+      cudaStream_t stream,
+      std::shared_ptr<Context> context = nullptr)
+      : action_(action),
+        addr_(addr),
+        context_(context),
+        stream_(stream),
+        size_(size) {}
+  Action action_;
+  int64_t addr_; // for OOM, this is the amount of free bytes reported by cuda
+  std::shared_ptr<Context> context_;
+  cudaStream_t stream_;
+  int64_t size_;
+};
+
+struct SnapshotInfo {
+  std::vector<SegmentInfo> segments;
+  std::vector<std::vector<TraceEntry>> device_traces;
+};
+
+C10_CUDA_API void setAllocatorSettings(const std::string& env);
+
+// Size pretty-printer
+std::string format_size(uint64_t size);
+
+using OutOfMemoryObserver = std::function<void(
+    int64_t device,
+    int64_t allocated,
+    int64_t device_total,
+    int64_t device_free)>;
+
+class CUDAAllocator : public Allocator {
+ public:
+  virtual void* raw_alloc(size_t nbytes) = 0;
+  virtual void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream) = 0;
+  virtual void raw_delete(void* ptr) = 0;
+  virtual void init(int device_count) = 0;
+  virtual bool initialized() = 0;
+  virtual void setMemoryFraction(double fraction, int device) = 0;
+  virtual void emptyCache() = 0;
+  virtual void cacheInfo(int dev_id, size_t* largestBlock) = 0;
+  virtual void* getBaseAllocation(void* ptr, size_t* size) = 0;
+  virtual void recordStream(const DataPtr&, CUDAStream stream) = 0;
+  virtual DeviceStats getDeviceStats(int device) = 0;
+  virtual void resetAccumulatedStats(int device) = 0;
+  virtual void resetPeakStats(int device) = 0;
+  virtual SnapshotInfo snapshot() = 0;
+  virtual void notifyCaptureBegin(
+      int device,
+      CaptureId_t graph_id,
+      MempoolId_t mempool_id) = 0;
+  virtual void notifyCaptureAboutToEnd(int device, CaptureId_t graph_id) = 0;
+  virtual void notifyCaptureEnded(int device, CaptureId_t graph_id) = 0;
+  virtual void notifyCaptureDestroy(int device, MempoolId_t mempool_id) = 0;
+  virtual std::shared_ptr<void> getIpcDevPtr(std::string handle) = 0;
+  virtual void recordHistory(
+      bool enabled,
+      CreateContextFn context_recorder,
+      size_t alloc_trace_max_entries,
+      bool alloc_trace_record_context) = 0;
+  virtual void attachOutOfMemoryObserver(OutOfMemoryObserver observer) = 0;
+  virtual bool needsPoolSpecificPeerAccess() = 0;
+  virtual std::string name() = 0;
+};
+
+// Allocator object, statically initialized
+// See BackendInitializer in CUDACachingAllocator.cpp.
+// Atomic loads on x86 are just normal loads,
+// (atomic stores are different), so reading this value
+// is no different than loading a pointer.
+C10_CUDA_API extern std::atomic<CUDAAllocator*> allocator;
+
+inline CUDAAllocator* get() {
+  return allocator.load();
+}
+
+// Called directly by clients.
+inline void* raw_alloc(size_t nbytes) {
+  return get()->raw_alloc(nbytes);
+}
+
+inline void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream) {
+  return get()->raw_alloc_with_stream(nbytes, stream);
+}
+
+inline void raw_delete(void* ptr) {
+  return get()->raw_delete(ptr);
+}
+
+inline void init(int device_count) {
+  return get()->init(device_count);
+}
+
+inline void setMemoryFraction(double fraction, int device) {
+  return get()->setMemoryFraction(fraction, device);
+}
+
+inline void emptyCache() {
+  return get()->emptyCache();
+}
+
+inline void cacheInfo(int dev_id, size_t* largestBlock) {
+  return get()->cacheInfo(dev_id, largestBlock);
+}
+
+inline void* getBaseAllocation(void* ptr, size_t* size) {
+  return get()->getBaseAllocation(ptr, size);
+}
+
+inline void recordStream(const DataPtr& dataPtr, CUDAStream stream) {
+  return get()->recordStream(dataPtr, stream);
+}
+
+inline DeviceStats getDeviceStats(int device) {
+  return get()->getDeviceStats(device);
+}
+
+inline void resetAccumulatedStats(int device) {
+  return get()->resetAccumulatedStats(device);
+}
+
+inline void resetPeakStats(int device) {
+  return get()->resetPeakStats(device);
+}
+
+inline SnapshotInfo snapshot() {
+  return get()->snapshot();
+}
 
 // CUDAGraph interactions
-C10_CUDA_API void notifyCaptureBegin(
+inline void notifyCaptureBegin(
     int device,
     CaptureId_t graph_id,
-    MempoolId_t mempool_id);
-C10_CUDA_API void notifyCaptureEnd(int device, CaptureId_t graph_id);
-C10_CUDA_API void notifyCaptureDestroy(int device, MempoolId_t mempool_id);
+    MempoolId_t mempool_id) {
+  return get()->notifyCaptureBegin(device, graph_id, mempool_id);
+}
 
-C10_CUDA_API std::mutex* getFreeMutex();
+inline void notifyCaptureAboutToEnd(int device, CaptureId_t graph_id) {
+  return get()->notifyCaptureAboutToEnd(device, graph_id);
+}
 
-C10_CUDA_API void setContextRecorder(CreateContextFn recorder);
+inline void recordHistory(
+    bool enabled,
+    CreateContextFn context_recorder,
+    size_t alloc_trace_max_entries,
+    bool alloc_trace_record_context) {
+  return get()->recordHistory(
+      enabled,
+      context_recorder,
+      alloc_trace_max_entries,
+      alloc_trace_record_context);
+}
 
-C10_CUDA_API std::shared_ptr<void> getIpcDevPtr(std::string handle);
-} // namespace CUDACachingAllocator
+inline void attachOutOfMemoryObserver(OutOfMemoryObserver observer) {
+  return get()->attachOutOfMemoryObserver(observer);
+}
+
+inline void notifyCaptureEnded(int device, CaptureId_t graph_id) {
+  return get()->notifyCaptureEnded(device, graph_id);
+}
+
+inline void notifyCaptureDestroy(int device, MempoolId_t mempool_id) {
+  return get()->notifyCaptureDestroy(device, mempool_id);
+}
+// Not part of CUDA_ALLOCATOR_BACKEND_INTERFACE
+inline std::shared_ptr<void> getIpcDevPtr(std::string handle) {
+  return get()->getIpcDevPtr(handle);
+}
 
+inline std::string name() {
+  return get()->name();
+}
+
+} // namespace CUDACachingAllocator
 } // namespace cuda
 } // namespace c10
-
-#endif
diff --git a/c10/cuda/CUDAException.cpp b/c10/cuda/CUDAException.cpp
new file mode 100644
index 000000000000..7813be5c1f66
--- /dev/null
+++ b/c10/cuda/CUDAException.cpp
@@ -0,0 +1,35 @@
+#include <c10/cuda/CUDAException.h>
+
+#include <c10/util/Exception.h>
+#include <cuda_runtime.h>
+
+#include <string>
+
+namespace c10 {
+namespace cuda {
+
+void c10_cuda_check_implementation(
+    const char* filename,
+    const char* function_name,
+    const int line_number,
+    const bool include_device_assertions) {
+  // We retrieve the error here in order to keep CUDA data types out of
+  // CUDAException.h thereby simplifying including it in other files
+  const cudaError_t err = cudaGetLastError();
+
+  if (C10_LIKELY(err == cudaSuccess)) {
+    return;
+  }
+
+  std::string check_message;
+#ifndef STRIP_ERROR_MESSAGES
+  check_message.append("CUDA error: ");
+  check_message.append(cudaGetErrorString(err));
+  check_message.append(c10::cuda::get_cuda_check_suffix());
+#endif
+
+  TORCH_CHECK(false, check_message);
+}
+
+} // namespace cuda
+} // namespace c10
diff --git a/c10/cuda/CUDAException.h b/c10/cuda/CUDAException.h
index ca441711cbd6..ddc1eeeabf72 100644
--- a/c10/cuda/CUDAException.h
+++ b/c10/cuda/CUDAException.h
@@ -22,41 +22,23 @@ class C10_CUDA_API CUDAError : public c10::Error {
 };
 } // namespace c10
 
-// For CUDA Runtime API
-#ifdef STRIP_ERROR_MESSAGES
-#define C10_CUDA_CHECK(EXPR)                                     \
-  do {                                                           \
-    cudaError_t __err = EXPR;                                    \
-    if (__err != cudaSuccess) {                                  \
-      throw c10::CUDAError(                                      \
-          {__func__, __FILE__, static_cast<uint32_t>(__LINE__)}, \
-          TORCH_CHECK_MSG(false, ""));                           \
-    }                                                            \
+#define C10_CUDA_CHECK(EXPR)                                               \
+  do {                                                                     \
+    const cudaError_t __err = EXPR;                                        \
+    if (C10_UNLIKELY(__err != cudaSuccess)) {                              \
+      c10::cuda::c10_cuda_check_implementation(                            \
+          __FILE__,                                                        \
+          __func__, /* Line number's data type is not well-defined between \
+                       compilers, so we perform an explicit cast */        \
+          static_cast<uint32_t>(__LINE__),                                 \
+          true);                                                           \
+    }                                                                      \
   } while (0)
-#else
-#define C10_CUDA_CHECK(EXPR)                                        \
-  do {                                                              \
-    cudaError_t __err = EXPR;                                       \
-    if (__err != cudaSuccess) {                                     \
-      auto error_unused C10_UNUSED = cudaGetLastError();            \
-      (void)error_unused;                                           \
-      auto _cuda_check_suffix = c10::cuda::get_cuda_check_suffix(); \
-      throw c10::CUDAError(                                         \
-          {__func__, __FILE__, static_cast<uint32_t>(__LINE__)},    \
-          TORCH_CHECK_MSG(                                          \
-              false,                                                \
-              "",                                                   \
-              "CUDA error: ",                                       \
-              cudaGetErrorString(__err),                            \
-              _cuda_check_suffix));                                 \
-    }                                                               \
-  } while (0)
-#endif
 
 #define C10_CUDA_CHECK_WARN(EXPR)                              \
   do {                                                         \
-    cudaError_t __err = EXPR;                                  \
-    if (__err != cudaSuccess) {                                \
+    const cudaError_t __err = EXPR;                            \
+    if (C10_UNLIKELY(__err != cudaSuccess)) {                  \
       auto error_unused C10_UNUSED = cudaGetLastError();       \
       (void)error_unused;                                      \
       TORCH_WARN("CUDA warning: ", cudaGetErrorString(__err)); \
@@ -69,8 +51,8 @@ class C10_CUDA_API CUDAError : public c10::Error {
 // Intentionally ignore a CUDA error
 #define C10_CUDA_IGNORE_ERROR(EXPR)                             \
   do {                                                          \
-    cudaError_t __err = EXPR;                                   \
-    if (__err != cudaSuccess) {                                 \
+    const cudaError_t __err = EXPR;                             \
+    if (C10_UNLIKELY(__err != cudaSuccess)) {                   \
       cudaError_t error_unused C10_UNUSED = cudaGetLastError(); \
       (void)error_unused;                                       \
     }                                                           \
@@ -87,3 +69,17 @@ class C10_CUDA_API CUDAError : public c10::Error {
 // the launch happened correctly and provide an early, close-to-source
 // diagnostic if it didn't.
 #define C10_CUDA_KERNEL_LAUNCH_CHECK() C10_CUDA_CHECK(cudaGetLastError())
+
+namespace c10 {
+namespace cuda {
+
+/// In the event of a CUDA failure, formats a nice error message about that
+/// failure and also checks for device-side assertion failures
+C10_CUDA_API void c10_cuda_check_implementation(
+    const char* filename,
+    const char* function_name,
+    const int line_number,
+    const bool include_device_assertions);
+
+} // namespace cuda
+} // namespace c10
diff --git a/c10/cuda/CUDAFunctions.cpp b/c10/cuda/CUDAFunctions.cpp
index 9ab61aa1f381..fefc5ec1665a 100644
--- a/c10/cuda/CUDAFunctions.cpp
+++ b/c10/cuda/CUDAFunctions.cpp
@@ -132,6 +132,10 @@ void set_device(DeviceIndex device) {
 }
 
 void device_synchronize() {
+  const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+  if (C10_UNLIKELY(interp)) {
+    (*interp)->trace_gpu_device_synchronization();
+  }
   C10_CUDA_CHECK(cudaDeviceSynchronize());
 }
 
diff --git a/c10/cuda/CUDAFunctions.h b/c10/cuda/CUDAFunctions.h
index 5b1e8b27b979..32b0ae62506d 100644
--- a/c10/cuda/CUDAFunctions.h
+++ b/c10/cuda/CUDAFunctions.h
@@ -8,6 +8,7 @@
 // The naming convention used here matches the naming convention of torch.cuda
 
 #include <c10/core/Device.h>
+#include <c10/core/impl/GPUTrace.h>
 #include <c10/cuda/CUDAException.h>
 #include <c10/cuda/CUDAMacros.h>
 #include <cuda_runtime_api.h>
@@ -69,6 +70,11 @@ C10_CUDA_API void __inline__ memcpy_and_sync(
           warning_state().get_sync_debug_mode() != SyncDebugMode::L_DISABLED)) {
     warn_or_error_on_sync();
   }
+  const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+  if (C10_UNLIKELY(interp)) {
+    (*interp)->trace_gpu_stream_synchronization(
+        reinterpret_cast<uintptr_t>(stream));
+  }
 #if defined(TORCH_HIP_VERSION) && (TORCH_HIP_VERSION >= 301)
   C10_CUDA_CHECK(hipMemcpyWithStream(dst, src, nbytes, kind, stream));
 #else
@@ -82,6 +88,11 @@ C10_CUDA_API void __inline__ stream_synchronize(cudaStream_t stream) {
           warning_state().get_sync_debug_mode() != SyncDebugMode::L_DISABLED)) {
     warn_or_error_on_sync();
   }
+  const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
+  if (C10_UNLIKELY(interp)) {
+    (*interp)->trace_gpu_stream_synchronization(
+        reinterpret_cast<uintptr_t>(stream));
+  }
   C10_CUDA_CHECK(cudaStreamSynchronize(stream));
 }
 
diff --git a/c10/cuda/CUDAMallocAsyncAllocator.cpp b/c10/cuda/CUDAMallocAsyncAllocator.cpp
new file mode 100644
index 000000000000..9edc4f87ccf3
--- /dev/null
+++ b/c10/cuda/CUDAMallocAsyncAllocator.cpp
@@ -0,0 +1,856 @@
+#include <c10/cuda/CUDACachingAllocator.h>
+#include <c10/cuda/CUDAException.h>
+#include <c10/cuda/CUDAFunctions.h>
+#include <c10/cuda/CUDAGuard.h>
+#include <c10/util/UniqueVoidPtr.h>
+#include <c10/util/flat_hash_map.h>
+#include <c10/util/irange.h>
+
+#include <unordered_set>
+#include <vector>
+
+namespace c10 {
+namespace cuda {
+namespace CUDACachingAllocator {
+namespace CudaMallocAsync {
+
+#if CUDA_VERSION >= 11040
+// CUDA device allocator that uses cudaMallocAsync to implement
+// the same interface as CUDACachingAllocator.cpp.
+
+// Designed to be safe for CUDA graph capture.
+// Interactions with CUDA graph capture are mediated by
+// notifyCaptureBegin
+// notifyCaptureAboutToEnd
+// notifyCaptureEnded
+// notifyCaptureDestroy
+
+// Implementation details, not declared in CUDACachingAllocator.h
+namespace {
+
+// General helpers
+
+struct UsageStream {
+  cudaStream_t stream;
+  int device;
+  UsageStream() {}
+  UsageStream(cudaStream_t s, int d) : stream(s), device(d) {}
+  UsageStream(const UsageStream& us) : stream(us.stream), device(us.device) {}
+  UsageStream(const UsageStream&& us) : stream(us.stream), device(us.device) {}
+  UsageStream& operator=(UsageStream other) {
+    stream = other.stream;
+    device = other.device;
+    return *this;
+  }
+};
+
+bool operator==(const UsageStream& lhs, const UsageStream& rhs) {
+  return (lhs.stream == rhs.stream) && (lhs.device == rhs.device);
+}
+
+struct UsageStreamHash {
+  size_t operator()(const UsageStream& us) const noexcept {
+    return std::hash<void*>{}(us.stream) + size_t(us.device);
+  }
+};
+
+struct PtrUsage {
+  // recorded_streams holds side usage streams added by record_stream calls.
+  // In other words, it does NOT include the original creation stream.
+  ska::flat_hash_set<UsageStream, UsageStreamHash> recorded_streams;
+  UsageStream creation_stream;
+  uint64_t size;
+  bool captured;
+  PtrUsage(uint64_t s, bool c) : size(s), captured(c) {}
+};
+
+int device_count = 0;
+// these don't need to be c10::once_flags as in CUDAGeneratorImpl.cpp
+// because they'll only be flipped by functions that have locked the mutex.
+std::vector<bool> devs_initialized_flags;
+std::vector<UsageStream> dummy_unifying_free_streams;
+
+// Possible micro-optimization:
+// Some accesses to ptr_info are read-only.
+// We could let those be concurrent with a shared_mutex and
+// have concurrent calls take a shared_lock.
+// Keeping it simple with an ordinary mutex for now.
+std::mutex general_mutex;
+
+/**
+ * Note [Avoid freeing uncaptured ptrs during CUDA graph capture]
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * During CUDA graph capture, it's illegal to call cudaFreeAsync
+ * on a pointer that came from a non-captured cudaMallocAsync.
+ * Unfortunately, Python being what it is, it's impossible to be
+ * sure no uncaptured tensor will ever have its destructor called
+ * in a capturing region.
+ * We avoid errors by
+ *  1. remembering if allocated pointers were captured or uncaptured
+ *  2. during capture, if we detect an attempt to free an uncaptured
+ *     allocation on a capturing stream, don't free it immediately,
+ *     just remember it and defer its cudaFreeAsync call to after
+ *     the end of capture (specifically, to notifyCaptureEnded).
+ */
+
+using PtrInfo = ska::flat_hash_map<void*, PtrUsage>;
+PtrInfo ptr_info;
+std::vector<void*> ungraphed_ptrs_defer_free_until_no_capture;
+
+// These two help setMemoryFraction limit the amount of memory
+// used by PyTorch in particular (as opposed to other libraries
+// in the same process that might be sharing the same cudaMemPool_t).
+std::vector<size_t> pytorch_used_bytes;
+std::vector<size_t> pytorch_memory_limits;
+
+// Graph-specific helpers
+
+/**
+ * Note [Avoid dangling free streams during CUDA graph capture]
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * During capture, all stream dependencies must branch out from
+ * the stream on which capture began and rejoin this initial stream
+ * before capture ends.
+ * The user rigs desired forking and joining with event waits.
+ * But it's hard to be sure when tensor destructors get called relative
+ * to the final joins.
+ * For example, suppose a user
+ *   forks work stream B from initial capture stream A
+ *   creates a tensor T in B
+ *   joins by syncing A with B
+ *   ends capture.
+ * All well and good, right? Maybe not: maybe T went out of scope
+ * and its destructor got called AFTER the rejoin, leaving the graph with
+ * "unjoined work": a dangling cudaFreeAsync node in stream B.
+ * Ensuring that all tensor destructors for all side stream tensors
+ * are called before side streams rejoin the main stream is
+ * difficult. The user might have to add a bunch of explicit
+ * "del"s at the right spots in code that was fine for ordinary
+ * eager execution.
+ * Fortunately, we can spare the user this burden:
+ * during capture, we remember _all_ free streams,
+ * and manually rejoin them with the capture stream during
+ * notifyCaptureAboutToEnd.
+ * This approach is heavy-handed, but hopefully capture only needs to
+ * happen once, so we don't mind being heavy-handed.
+ *
+ * TODO: If, someday, we augment the graph bindings to support recapture
+ * https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#whole-graph-update
+ * (eg, as a way to accommodate dynamic params) we should think more
+ * carefully about the CPU overhead of remembering and rejoining
+ * all free streams during capture. Maybe it's not a big deal.
+ */
+std::unordered_set<UsageStream, UsageStreamHash> capture_free_streams;
+bool capture_underway = false;
+
+// Implementation functions
+
+// Assumes the caller holds general_mutex
+inline void lazy_init_device(int device) {
+  if (!devs_initialized_flags[device]) {
+    CUDAGuard g(device);
+
+    // See "Retaining memory in the pool" here:
+    // https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/
+    cudaMemPool_t mempool;
+    C10_CUDA_CHECK(cudaDeviceGetDefaultMemPool(&mempool, device));
+    uint64_t threshold = UINT64_MAX;
+    C10_CUDA_CHECK(cudaMemPoolSetAttribute(
+        mempool, cudaMemPoolAttrReleaseThreshold, &threshold));
+
+    // I think all these are on by default, but I want to enable them
+    // explicitly to ensure awareness.
+    int enable = 1;
+    C10_CUDA_CHECK(cudaMemPoolSetAttribute(
+        mempool, cudaMemPoolReuseFollowEventDependencies, &enable));
+    C10_CUDA_CHECK(cudaMemPoolSetAttribute(
+        mempool, cudaMemPoolReuseAllowOpportunistic, &enable));
+    C10_CUDA_CHECK(cudaMemPoolSetAttribute(
+        mempool, cudaMemPoolReuseAllowInternalDependencies, &enable));
+
+    // Grabs a stream from the current device to use as the "unifier" free
+    // stream for allocations that end up used on multiple streams.
+    const auto dufs = getStreamFromPool();
+    dummy_unifying_free_streams[device] =
+        UsageStream(dufs.stream(), dufs.device_index());
+
+    pytorch_used_bytes[device] = 0;
+    pytorch_memory_limits[device] = UINT64_MAX;
+
+    devs_initialized_flags[device] = true;
+  }
+}
+
+inline void sync_raw(cudaStream_t dependency, cudaStream_t dependent) {
+  // CUDACachingAllocator.cpp uses raw cuda events, as do we.
+  cudaEvent_t event;
+  C10_CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
+  C10_CUDA_CHECK(cudaEventRecord(event, dependency));
+  C10_CUDA_CHECK(cudaStreamWaitEvent(dependent, event));
+  C10_CUDA_CHECK(cudaEventDestroy(event));
+}
+
+// Assumes the caller holds general_mutex
+inline void free_impl(PtrInfo::iterator& it) {
+  // Possible micro-optimization: If we did a value-copy here, we could move
+  // ptr_info.erase(it) up here and drop the lock immediately.
+  const auto& recorded_streams = it->second.recorded_streams;
+  const auto& creation_stream = it->second.creation_stream;
+
+  // If the usage stream is a null (default) stream,
+  // cudaFreeAsync infers the device from the ambient context,
+  // so we need to set the right ambient context.
+  CUDAGuard g(creation_stream.device);
+
+  if (recorded_streams.empty()) {
+    // ptr was only used on one stream, which must have been
+    // the original allocation stream.
+    // Frees ptr in the original allocation stream.
+
+    C10_CUDA_CHECK(cudaFreeAsync(it->first, creation_stream.stream));
+
+    if (C10_UNLIKELY(capture_underway)) {
+      // See Note [Avoid dangling free streams during CUDA graph capture]
+      capture_free_streams.insert(creation_stream);
+    }
+  } else {
+    // ptr was used on many streams. We don't know which was the most recent.
+    // There could even have been multiple most recent usage streams acting
+    // on different regions of the memory.
+    // But cudaFreeAsync only accepts a single most recent usage stream.
+    // We can still safely free ptr with a trick:
+    // Use a dummy "unifying stream", sync the unifying stream with all of
+    // ptr's usage streams, and pass the dummy stream to cudaFreeAsync.
+
+    // Retrieves the dummy "unifier" stream from the device
+    // on which the pointer was originally allocated.
+    auto dummy_unifying_free_stream =
+        dummy_unifying_free_streams[creation_stream.device];
+    TORCH_INTERNAL_ASSERT(
+        dummy_unifying_free_stream.device == creation_stream.device);
+
+    // we're already on creation_stream.device, no need to re-guard
+    sync_raw(creation_stream.stream, dummy_unifying_free_stream.stream);
+
+    // The number of usage streams is typically small (low single digits)
+    for (const auto& recorded_stream : recorded_streams) {
+      // Logic here accommodates the chance some of the usage streams were on
+      // other devices, which is possible if some usage kernels accessed the
+      // memory via p2p.
+
+      // cudaEventRecord requires that the input event and stream are on the
+      // same device.
+      CUDAGuard g_usage(recorded_stream.device);
+
+      sync_raw(recorded_stream.stream, dummy_unifying_free_stream.stream);
+    }
+
+    // Frees ptr in the dummy "unifier" stream.
+    C10_CUDA_CHECK(cudaFreeAsync(it->first, dummy_unifying_free_stream.stream));
+    // At this point, unless dummy_unifying_free_stream happens to alias some
+    // future user stream, the allocation is only available for "opportunistic"
+    // reuse, ie, if the CPU sees dummy_unifying_free_stream has reached the
+    // point that all events recorded on all usage streams have resolved from
+    // the CPU's perspective. In theory, we could remove the need for the driver
+    // to do this tracking by e.g. replacing
+    // cudaStreamWaitEvent(dummy_unifying_free_stream.stream, event);
+    // with
+    // cudaStreamWaitEvent(creation_stream.stream, event);
+    // then cudaFreeAsyncing straight back into creation_stream.stream,
+    // but this forces a potentially false dependency of creation_stream.stream
+    // on all the recorded_streams.
+
+    if (C10_UNLIKELY(capture_underway)) {
+      // See Note [Avoid dangling free streams during CUDA graph capture]
+      capture_free_streams.insert(UsageStream(
+          dummy_unifying_free_stream.stream,
+          dummy_unifying_free_stream.device));
+    }
+  }
+
+  pytorch_used_bytes[creation_stream.device] -= it->second.size;
+
+  ptr_info.erase(it);
+}
+
+void freeAsync(void* ptr) {
+  std::lock_guard<std::mutex> lk(general_mutex);
+
+  auto err = cudaGetLastError();
+  C10_CUDA_CHECK(err);
+  auto it = ptr_info.find(ptr);
+  TORCH_INTERNAL_ASSERT(it != ptr_info.end(), "ptr not found in ptr_info");
+
+  if (C10_UNLIKELY(capture_underway)) {
+    if (!it->second.captured) {
+      TORCH_WARN_ONCE(
+          "freeAsync() was called on an uncaptured allocation during graph capture "
+          "(address = ",
+          ptr,
+          "). This may be benign, for example, a Python tensor in the capture "
+          "might happen to shadow (use the same name as) an unrelated temporary "
+          "tensor from somewhere before capture, pushing the earlier tensor "
+          "out of scope. "
+          "However, if the tensor we're freeing here IS used by the capture, "
+          "freeing it is an error, and may cause illegal memory accesses or "
+          "memory corruption during graph replay.");
+      // See Note [Avoid freeing uncaptured ptrs during CUDA graph capture]
+      // Remembers the raw pointer, not the iterator.
+      // This forces notifyCaptureEnded to do another lookup,
+      // but avoids the risk the iterator might be invalidated
+      // between now and then.
+      ungraphed_ptrs_defer_free_until_no_capture.push_back(ptr);
+      return;
+    }
+  } else if (C10_UNLIKELY(it->second.captured)) {
+    TORCH_WARN(
+        "Attempting uncaptured free of a captured allocation with address ",
+        ptr,
+        "\nThis is technically allowed, but may indicate you are losing "
+        "the last user-visible tensor through which the allocation can "
+        "be accessed, so you'll have no way to view the data after "
+        "future replays of the owning graph.");
+  }
+
+  free_impl(it);
+}
+
+// Symmetric with NativeCachingAllocator::malloc for now,
+// although I don't think we absolutely need the symmetry.
+void mallocAsync(void** devPtr, int device, size_t size, cudaStream_t stream) {
+  TORCH_INTERNAL_ASSERT(
+      0 <= device && device < device_count,
+      "Invalid device index ",
+      device,
+      ": did you call init?");
+
+  // If stream is a null (default) stream,
+  // cudaMallocAsync infers the device from the ambient context,
+  // so we need to set the right ambient context.
+  CUDAGuard g(device);
+
+  std::lock_guard<std::mutex> lk(general_mutex);
+
+  lazy_init_device(device);
+
+  // Defensively checks for preexisting CUDA error state.
+  auto err = cudaGetLastError();
+  C10_CUDA_CHECK(err);
+
+  // TODO: Could we avoid calling cudaMallocAsync while holding general_mutex,
+  // perhaps by letting lazy_init_device use separate once_flags or an internal
+  // static initializer?
+  if (pytorch_used_bytes[device] + size > pytorch_memory_limits[device]) {
+    err = cudaErrorMemoryAllocation;
+  } else {
+    err = cudaMallocAsync(devPtr, size, stream);
+  }
+
+  if (err == cudaErrorMemoryAllocation) {
+    // Clears CUDA's internal error state so the user, if desired, can catch the
+    // OOM exception, free some stuff on the script side, and retry the
+    // allocation. This aligns with the behavior of alloc_block in
+    // CUDACachingAllocator.cpp.
+    cudaGetLastError();
+    size_t device_free;
+    size_t device_total;
+    C10_CUDA_CHECK(cudaMemGetInfo(&device_free, &device_total));
+    TORCH_CHECK_WITH(
+        OutOfMemoryError,
+        false,
+        "Allocation on device ",
+        device,
+        " would exceed allowed memory. (out of memory)",
+        "\nCurrently allocated     : ",
+        format_size(pytorch_used_bytes[device]),
+        "\nRequested               : ",
+        format_size(size),
+        "\nDevice limit            : ",
+        format_size(device_total),
+        "\nFree (according to CUDA): ",
+        format_size(device_free),
+        "\nPyTorch limit (set by user-supplied memory fraction)"
+        "\n                        : ",
+        format_size(pytorch_memory_limits[device]));
+  } else {
+    C10_CUDA_CHECK(err);
+  }
+
+  auto inserted = ptr_info.emplace(*devPtr, PtrUsage(size, capture_underway));
+  TORCH_INTERNAL_ASSERT(
+      inserted.second,
+      "address returned by cudaMallocAsync already exists "
+      "in ptr_info");
+
+  inserted.first->second.creation_stream = {stream, device};
+
+  pytorch_used_bytes[device] += size;
+}
+
+} // anonymous namespace
+
+void local_raw_delete(void* ptr);
+
+// Same pattern as CUDACachingAllocator.cpp.
+struct CudaMallocAsyncAllocator : public CUDAAllocator {
+  DataPtr allocate(size_t size) const override {
+    constexpr size_t one_exa_bytes = 1152921504606846976ULL;
+    TORCH_CHECK_WITH(
+        OutOfMemoryError,
+        size < one_exa_bytes,
+        "CUDA out of memory. Tried to allocate more than 1EB memory.");
+    int device;
+    C10_CUDA_CHECK(cudaGetDevice(&device));
+    void* r = nullptr;
+    if (size != 0) {
+      mallocAsync(&r, device, size, cuda::getCurrentCUDAStream(device));
+    }
+    return {r, r, &local_raw_delete, Device(DeviceType::CUDA, device)};
+  }
+  DeleterFnPtr raw_deleter() const override {
+    return &local_raw_delete;
+  }
+
+  // This function should not issue any context-creating calls,
+  // just set up for later calls to init per-device pools based
+  // on the current device each later call sees.
+  void init(int dev_count) override {
+    static bool called = [](int dev_count) {
+      ;
+      // Are there external guarantees init will be called before
+      // any of the allocator's other functions?
+      // std::lock_guard<std::mutex> lk(general_mutex);
+      device_count = dev_count;
+      devs_initialized_flags.resize(dev_count, false);
+      dummy_unifying_free_streams.resize(dev_count);
+      pytorch_used_bytes.resize(dev_count);
+      pytorch_memory_limits.resize(dev_count);
+      return true;
+    }(dev_count);
+    (void)called;
+  }
+
+  bool initialized() {
+    return devs_initialized_flags.size() > 0;
+  }
+
+  static inline void assertValidDevice(int device) {
+    TORCH_CHECK(
+        0 <= device && device < device_count, "Invalid device argument.");
+  }
+
+  void setMemoryFraction(double fraction, int device) override {
+    TORCH_INTERNAL_ASSERT(
+        0 <= fraction && fraction <= 1,
+        "invalid fraction:",
+        fraction,
+        ". Please set within (0, 1).");
+
+    std::lock_guard<std::mutex> lk(general_mutex);
+    assertValidDevice(device);
+    CUDAGuard g(device);
+    // Should setMemoryFraction be allowed to trigger a full device context and
+    // pool-creating lazy_init_device, or should we simply assert this device is
+    // already initialized, ie
+    // TORCH_CHECK(devs_initialized_flags[device], ...)?
+    lazy_init_device(device);
+
+    size_t device_free;
+    size_t device_total;
+    C10_CUDA_CHECK(cudaMemGetInfo(&device_free, &device_total));
+    pytorch_memory_limits[device] =
+        static_cast<uint64_t>(fraction * device_total);
+
+    // Alternative: Instead of a manual hard limit, we could use
+    // cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold,
+    // &threshold); This is a soft hint: The driver allows the pool's reserved
+    // memory to spike above threshold in regions of high cudaMallocAsync
+    // demand, but opportunistically trims reserved memory back to threshold
+    // when the memory in use is < threshold. I don't like this because it
+    // introduces performance nondeterminism.
+  }
+
+  void emptyCache(void) override {
+    std::lock_guard<std::mutex> lk(general_mutex);
+
+    for (int dev = 0; dev < device_count; dev++) {
+      if (devs_initialized_flags[dev]) {
+        CUDAGuard g(dev);
+
+        cudaMemPool_t mempool;
+        cudaDeviceGetDefaultMemPool(&mempool, dev);
+        cudaDeviceSynchronize();
+        cudaMemPoolTrimTo(mempool, 0);
+      }
+    }
+  }
+
+  void cacheInfo(int device, size_t* maxWorkspaceGuess) override {
+    // The only consumer of cacheInfo is getMaxWorkspaceSize in Conv_v7.cpp.
+    // Afaict, the role of cacheInfo is to give getMaxWorkspaceSize a reasonable
+    // maximum workspace size to use for an upcoming cudnnFind call.
+    //
+    // The native allocator's cacheInfo chooses to return the size of its
+    // largest unused block (which is the largest allocation the native
+    // allocator can service immediately and asynchronously without a
+    // cudaMalloc.
+    //
+    // Here, we use a different heuristic: figure out the max usable workspace
+    // size with a bit of educated trial and error. It's ok to be
+    // perf-inefficient because cacheInfo is a prelude to cudnnFind.
+    //
+    // The algo cache then stores the best-performing algo with workspace <=
+    // maxWorkspaceGuess. Later calls with the same param set hit in cache and
+    // try to allocate the same workspace. If, in one of those future calls,
+    // workspace allocation fails (ie because less ambient memory is available),
+    // the bindings rerun cudnnFind, including calling cacheInfo again
+    // beforehand to estimate a new (smaller) largest-available workspace. Over
+    // a few such calls, the cache should settle to the algo with a workspace
+    // size that's small enough to succeed every time (for that param set).
+    //
+    // So the strategy here is to return a rough, largeish guess and let the
+    // bindings retry to trim as needed over time.
+    //
+    // The only caveat is, even if a workspace is allocated without OOM errors
+    // now and in future calls, it's hard to be sure those later error-free
+    // cudaMallocAsyncs are fast and come straight from the pool (ie,
+    // cudaMallocAsync didn't need to reserve more memory from the system).
+    // Hopefully, after repeated workspace requests, the pool's reserved memory
+    // also stabilizes to a point where they all come straight from the pool.
+    std::lock_guard<std::mutex> lk(general_mutex);
+    assertValidDevice(device);
+    CUDAGuard g(device);
+    lazy_init_device(device);
+
+    size_t free_upper_bound;
+    size_t device_total;
+    C10_CUDA_CHECK(cudaMemGetInfo(&free_upper_bound, &device_total));
+    TORCH_INTERNAL_ASSERT(
+        free_upper_bound + pytorch_used_bytes[device] <= device_total);
+    size_t guess = std::min(
+        free_upper_bound,
+        pytorch_memory_limits[device] - pytorch_used_bytes[device]);
+    auto stream = c10::cuda::getCurrentCUDAStream();
+    void* dummy;
+
+    // Defensively checks for preexisting CUDA error state.
+    auto err = cudaGetLastError();
+    C10_CUDA_CHECK(err);
+
+    while (true) {
+      // Duplicates some logic from mallocAsync to work with the error state
+      // directly instead of repeatedly catching an exception thrown by
+      // mallocAsync.
+      if (pytorch_used_bytes[device] + guess > pytorch_memory_limits[device]) {
+        err = cudaErrorMemoryAllocation;
+      } else {
+        err = cudaMallocAsync(&dummy, guess, stream);
+      }
+
+      if (err == cudaSuccess) {
+        cudaFreeAsync(dummy, stream);
+        *maxWorkspaceGuess = guess;
+        return;
+      } else if (err == cudaErrorMemoryAllocation) {
+        cudaGetLastError(); // clear CUDA error
+        guess >>= 1; // quick and dirty: try half the size next iteration
+      } else {
+        C10_CUDA_CHECK(err);
+      }
+    }
+  }
+
+  void* getBaseAllocation(void* ptr, size_t* size) override {
+    std::lock_guard<std::mutex> lk(general_mutex);
+
+    auto it = ptr_info.find(ptr);
+    TORCH_INTERNAL_ASSERT(it != ptr_info.end(), "ptr not found in ptr_info");
+
+    if (size) {
+      *size = it->second.size;
+    }
+
+    return ptr;
+  }
+
+  void recordStream(const DataPtr& ptr, cuda::CUDAStream stream) override {
+    std::lock_guard<std::mutex> lk(general_mutex);
+    auto ptr_val = ptr.get();
+    // Empty tensor's storage().data() might be a null ptr. As there is no
+    // blocks associated with those tensors, it is fine to do nothing here.
+    if (!ptr_val) {
+      return;
+    }
+
+    // The pointer should exist in the map already.
+    auto it = ptr_info.find(ptr_val);
+    TORCH_INTERNAL_ASSERT(it != ptr_info.end(), "ptr not found in ptr_info");
+
+    UsageStream to_record{stream.stream(), stream.device_index()};
+    if (to_record == it->second.creation_stream) {
+      TORCH_WARN(
+          "Called record_stream on tensor whose original creation stream "
+          "matches the recorded stream. This is unnecessary and has no effect.");
+    } else {
+      it->second.recorded_streams.insert(to_record);
+    }
+  }
+
+  std::shared_ptr<void> getIpcDevPtr(std::string handle) override {
+    TORCH_CHECK(
+        false,
+        "cudaMallocAsync does not yet support getIpcDevPtr. "
+        "If you need it, please file an issue describing your use case.");
+  }
+
+  void recordHistory(
+      bool enabled,
+      CreateContextFn context_recorder,
+      size_t alloc_trace_max_entries,
+      bool alloc_trace_record_context) override {
+    TORCH_CHECK(
+        false,
+        "cudaMallocAsync does not yet support recordHistory. "
+        "If you need it, please file an issue describing your use case.");
+  }
+
+  void attachOutOfMemoryObserver(OutOfMemoryObserver observer) override {
+    TORCH_CHECK(
+        false,
+        "cudaMallocAsync does not yet support attachOutOfMemoryObserver. "
+        "If you need it, please file an issue describing your use case.");
+  }
+
+  // Collects stats for device.
+  // If device hasn't been used yet, returns 0s without creating a context.
+  DeviceStats getDeviceStats(int device) override {
+    assertValidDevice(device);
+
+    // Memory currently reserved by the mempool
+    uint64_t reserved_mem_current = 0;
+    // High-water mark of memory reserved by the mempool since last reset
+    uint64_t reserved_mem_peak = 0;
+    // Memory currently in use by the mempool
+    uint64_t used_mem_current = 0;
+    // High-water mark of memory
+    uint64_t used_mem_peak = 0;
+
+    std::lock_guard<std::mutex> lk(general_mutex);
+
+    if (devs_initialized_flags[device]) {
+      CUDAGuard g(device);
+
+      cudaMemPool_t mempool;
+      C10_CUDA_CHECK(cudaDeviceGetDefaultMemPool(&mempool, device));
+      C10_CUDA_CHECK(cudaMemPoolGetAttribute(
+          mempool, cudaMemPoolAttrReservedMemCurrent, &reserved_mem_current));
+
+      C10_CUDA_CHECK(cudaMemPoolGetAttribute(
+          mempool, cudaMemPoolAttrReservedMemHigh, &reserved_mem_peak));
+
+      C10_CUDA_CHECK(cudaMemPoolGetAttribute(
+          mempool, cudaMemPoolAttrUsedMemCurrent, &used_mem_current));
+
+      C10_CUDA_CHECK(cudaMemPoolGetAttribute(
+          mempool, cudaMemPoolAttrUsedMemHigh, &used_mem_peak));
+    }
+
+    // Many stat types are specific to the native allocator. We leave these
+    // untouched. Their "struct Stat"s will contain zeroed values.
+    DeviceStats stats;
+
+    // In the native allocator:
+    // allocated_bytes is the total bytes of blocks that have been malloc()ed
+    // and not yet free()d.
+    // active_bytes is the total bytes of blocks that have been malloc()ed but
+    // not yet released back into a free pool. In other words, it includes all
+    // allocated_bytes, as well as the bytes of "limbo state" blocks had have
+    // already been free()ed but not yet free_block()ed back into a pool due to
+    // outstanding stream_uses.
+    //
+    // Here, in the cudaMallocAsync allocator:
+    // We simply ask the driver's opinion about active memory.
+    // We don't bother distinguishing between allocated_bytes and active_bytes.
+    stats.allocated_bytes[static_cast<size_t>(StatType::AGGREGATE)].current =
+        used_mem_current;
+    stats.allocated_bytes[static_cast<size_t>(StatType::AGGREGATE)].peak =
+        used_mem_peak;
+    stats.active_bytes[static_cast<size_t>(StatType::AGGREGATE)].current =
+        used_mem_current;
+    stats.active_bytes[static_cast<size_t>(StatType::AGGREGATE)].peak =
+        used_mem_peak;
+    stats.reserved_bytes[static_cast<size_t>(StatType::AGGREGATE)].current =
+        reserved_mem_current;
+    stats.reserved_bytes[static_cast<size_t>(StatType::AGGREGATE)].peak =
+        reserved_mem_peak;
+
+    return stats;
+  }
+
+  void resetAccumulatedStats(int device) override {
+    assertValidDevice(device);
+    TORCH_WARN_ONCE(
+        "For backend:cudaMallocAsync, resetAccumulatedStats has no effect.");
+  }
+
+  void resetPeakStats(int device) override {
+    assertValidDevice(device);
+
+    CUDAGuard g(device);
+    cudaMemPool_t mempool;
+    C10_CUDA_CHECK(cudaDeviceGetDefaultMemPool(&mempool, device));
+    // Using zero as the reset value is the method recommended by Cuda driver
+    // team. Vivek Kini says:
+    //   "Resetting to zero (which is the only valid value when setting
+    //    ReservedMemHigh) resets it to ReservedMemCurrent inside the driver
+    //   (same goes for UsedMemHigh/UsedMemCurrent)"
+    uint64_t zero = 0;
+    C10_CUDA_CHECK(cudaMemPoolSetAttribute(
+        mempool, cudaMemPoolAttrReservedMemHigh, &zero));
+    C10_CUDA_CHECK(
+        cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrUsedMemHigh, &zero));
+  }
+
+  SnapshotInfo snapshot() override {
+    TORCH_CHECK(
+        false,
+        "Calling snapshot with backend:cudaMallocAsync is not meaningful. "
+        "(For backend:native, snapshot returns a detailed summary of all "
+        "blocks tracked by the allocator, but the cudaMallocAsync backend "
+        "does not track individual blocks.)");
+    // Alternative: TORCH_WARN
+    return {};
+  }
+
+  // CUDAGraph interactions
+  void notifyCaptureBegin(
+      int device,
+      CaptureId_t graph_id,
+      MempoolId_t mempool_id) override {
+    std::lock_guard<std::mutex> lk(general_mutex);
+
+    TORCH_INTERNAL_ASSERT(capture_free_streams.empty());
+    TORCH_CHECK(
+        !capture_underway,
+        "Only one capture at a time is allowed in a process.")
+    capture_underway = true;
+  }
+
+  void notifyCaptureAboutToEnd(int device, CaptureId_t graph_id) override {
+    assertValidDevice(device);
+
+    std::lock_guard<std::mutex> lk(general_mutex);
+
+    TORCH_CHECK(
+        capture_underway,
+        "CudaMallocAsync::notifyCaptureAboutToEnd called, "
+        "but CudaMallocAsync::capture_underway is false.");
+
+    auto capture_stream = cuda::getCurrentCUDAStream(device);
+
+    // See Note [Avoid dangling free streams during CUDA graph capture]
+    for (const auto& free_stream : capture_free_streams) {
+      // cudaEventRecord requires that the input event and stream are on the
+      // same device.
+      CUDAGuard g(free_stream.device);
+
+      // CUDACachingAllocator.cpp uses raw cuda events, as do we.
+      cudaEvent_t event;
+      C10_CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
+      C10_CUDA_CHECK(cudaEventRecord(event, free_stream.stream));
+      C10_CUDA_CHECK(cudaStreamWaitEvent(capture_stream.stream(), event));
+      C10_CUDA_CHECK(cudaEventDestroy(event));
+    }
+
+    capture_free_streams.clear();
+  }
+
+  void notifyCaptureEnded(int device, CaptureId_t graph_id) override {
+    assertValidDevice(device);
+
+    std::lock_guard<std::mutex> lk(general_mutex);
+
+    TORCH_CHECK(
+        capture_underway,
+        "CudaMallocAsync::notifyCaptureEnded called, "
+        "but CudaMallocAsync::capture_underway is false.");
+    capture_underway = false;
+
+    // See Note [Avoid freeing uncaptured ptrs during CUDA graph capture]
+    for (const auto ptr : ungraphed_ptrs_defer_free_until_no_capture) {
+      auto it = ptr_info.find(ptr);
+      TORCH_INTERNAL_ASSERT(it != ptr_info.end(), "ptr not found in ptr_info");
+      free_impl(it);
+    }
+
+    ungraphed_ptrs_defer_free_until_no_capture.clear();
+  }
+
+  void notifyCaptureDestroy(int device, MempoolId_t mempool_id) override {
+    // Q: Do we need to do anything special here, like clear long-lived
+    //    pointers created during the original capture (for example,
+    //    tensors intended as the graph's I/O surface) that might still
+    //    be resident in ptr_info?
+    // A: I don't think so.
+    //    Those allocations survived capture because the user held
+    //    explicit tensor references to them,
+    //    Those tensors' destructors will call freeAsync() on each pointer
+    //    when the user is done with them.
+    //    The freeAsync()s will probably incur
+    //    TORCH_WARN("Attempting uncaptured free of a captured allocation..."
+    //    but stale ptrs will not permanently leak into ptr_info.
+  }
+
+  void* raw_alloc(size_t nbytes) override {
+    if (nbytes == 0) {
+      return nullptr;
+    }
+    int device;
+    C10_CUDA_CHECK(cudaGetDevice(&device));
+    void* r = nullptr;
+    mallocAsync(&r, device, nbytes, cuda::getCurrentCUDAStream(device));
+    return r;
+  }
+
+  void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream) override {
+    if (nbytes == 0) {
+      return nullptr;
+    }
+    int device;
+    C10_CUDA_CHECK(cudaGetDevice(&device));
+    void* r = nullptr;
+    mallocAsync(&r, device, nbytes, stream);
+    return r;
+  }
+  void raw_delete(void* ptr) override {
+    freeAsync(ptr);
+  }
+  bool needsPoolSpecificPeerAccess() override {
+    return true;
+  }
+  std::string name() override {
+    return "cudaMallocAsync";
+  }
+};
+
+CudaMallocAsyncAllocator device_allocator;
+
+void local_raw_delete(void* ptr) {
+  freeAsync(ptr);
+}
+CUDAAllocator* allocator() {
+  return &device_allocator;
+}
+
+#else
+CUDAAllocator* allocator() {
+  TORCH_CHECK(false, "Cannot use cudaMallocAsyncAllocator with cuda < 11.4.");
+  return nullptr;
+}
+
+#endif
+
+} // namespace CudaMallocAsync
+} // namespace CUDACachingAllocator
+} // namespace cuda
+} // namespace c10
diff --git a/c10/cuda/CUDAMiscFunctions.cpp b/c10/cuda/CUDAMiscFunctions.cpp
index 7655ca8c6a60..c9e6de12dca3 100644
--- a/c10/cuda/CUDAMiscFunctions.cpp
+++ b/c10/cuda/CUDAMiscFunctions.cpp
@@ -16,5 +16,10 @@ const char* get_cuda_check_suffix() noexcept {
            "\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.";
   }
 }
+std::mutex* getFreeMutex() {
+  static std::mutex cuda_free_mutex;
+  return &cuda_free_mutex;
+}
+
 } // namespace cuda
 } // namespace c10
diff --git a/c10/cuda/CUDAMiscFunctions.h b/c10/cuda/CUDAMiscFunctions.h
index eca8fd042f61..e911a8a4e3f7 100644
--- a/c10/cuda/CUDAMiscFunctions.h
+++ b/c10/cuda/CUDAMiscFunctions.h
@@ -4,8 +4,11 @@
 
 #include <c10/cuda/CUDAMacros.h>
 
+#include <mutex>
+
 namespace c10 {
 namespace cuda {
 C10_CUDA_API const char* get_cuda_check_suffix() noexcept;
-}
+C10_CUDA_API std::mutex* getFreeMutex();
+} // namespace cuda
 } // namespace c10
diff --git a/c10/cuda/CUDAStream.cpp b/c10/cuda/CUDAStream.cpp
index e80026cf81b8..7568f1b6eea3 100644
--- a/c10/cuda/CUDAStream.cpp
+++ b/c10/cuda/CUDAStream.cpp
@@ -169,9 +169,9 @@ static void initDeviceStreamState(DeviceIndex device_index) {
 
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_stream_creation(
+      (*interp)->trace_gpu_stream_creation(
           reinterpret_cast<uintptr_t>(lowpri_stream));
-      interp->trace_gpu_stream_creation(
+      (*interp)->trace_gpu_stream_creation(
           reinterpret_cast<uintptr_t>(hipri_stream));
     }
   }
diff --git a/c10/cuda/impl/CUDAGuardImpl.h b/c10/cuda/impl/CUDAGuardImpl.h
index 52a9de8ce1cb..c2365e449a40 100644
--- a/c10/cuda/impl/CUDAGuardImpl.h
+++ b/c10/cuda/impl/CUDAGuardImpl.h
@@ -103,7 +103,8 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
     C10_CUDA_CHECK(cudaEventCreateWithFlags(cuda_event, cuda_flag));
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_event_creation(reinterpret_cast<uintptr_t>(cuda_event));
+      (*interp)->trace_gpu_event_creation(
+          reinterpret_cast<uintptr_t>(cuda_event));
     }
   }
 
@@ -117,7 +118,8 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
     C10_CUDA_CHECK_WARN(cudaSetDevice(device_index));
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_event_deletion(reinterpret_cast<uintptr_t>(cuda_event));
+      (*interp)->trace_gpu_event_deletion(
+          reinterpret_cast<uintptr_t>(cuda_event));
     }
     C10_CUDA_CHECK_WARN(cudaEventDestroy(cuda_event));
     C10_CUDA_CHECK_WARN(cudaSetDevice(orig_device));
@@ -151,7 +153,7 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
     *event = cuda_event;
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_event_record(
+      (*interp)->trace_gpu_event_record(
           reinterpret_cast<uintptr_t>(cuda_event),
           reinterpret_cast<uintptr_t>(cuda_stream.stream()));
     }
@@ -173,7 +175,7 @@ struct CUDAGuardImpl final : public c10::impl::DeviceGuardImplInterface {
         /*flags (must be zero)=*/0));
     const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
     if (C10_UNLIKELY(interp)) {
-      interp->trace_gpu_event_wait(
+      (*interp)->trace_gpu_event_wait(
           reinterpret_cast<uintptr_t>(cuda_event),
           reinterpret_cast<uintptr_t>(cuda_stream.stream()));
     }
diff --git a/c10/defs_hip.bzl b/c10/defs_hip.bzl
deleted file mode 100644
index 5084758b62e6..000000000000
--- a/c10/defs_hip.bzl
+++ /dev/null
@@ -1,126 +0,0 @@
-load("@bazel_skylib//lib:paths.bzl", "paths")
-load("//caffe2:defs_hip.bzl", "get_hip_file_path")
-
-gpu_file_extensions = [".cu", ".c", ".cc", ".cpp"]
-gpu_header_extensions = [".cuh", ".h", ".hpp"]
-
-def is_test_files(filepath):
-    if filepath.startswith("test"):
-        return True
-    else:
-        return False
-
-def get_c10_hip_srcs():
-    gpu_file_pattern = [
-        base + suffix
-        for base in c10_includes
-        for suffix in gpu_file_extensions
-    ]
-    native_gpu_files = native.glob(gpu_file_pattern)
-
-    gpu_files = []
-    hip_files = []
-    for name in native_gpu_files:
-        # exclude the test folder
-        if is_test_files(name):
-            continue
-
-        gpu_files.append(name)
-        hip_file_name = get_hip_file_path(paths.join("cuda/", name))
-        hip_files.append(hip_file_name)
-
-    # there will be some native hip files that needs suffix changed
-    native_hip_pattern = [
-        "hip/**/*.hip",
-    ]
-    native_hip_files = native.glob(native_hip_pattern)
-
-    gpu_files += native_hip_files
-    hip_files += native_hip_files
-
-    # we run hipify script under the caffe2 folder; therefore we need to
-    # prepend c10 to the path so that buck can find the hipified file
-    real_hip_files = []
-    for filename in hip_files:
-        real_hip_files.append(paths.join("c10", filename))
-
-    # return the src and output_gen files
-    return gpu_files, real_hip_files
-
-def get_c10_hip_headers():
-    gpu_file_pattern = [
-        base + suffix
-        for base in c10_includes
-        for suffix in gpu_header_extensions
-    ]
-    native_gpu_files = native.glob(gpu_file_pattern)
-
-    # store the original
-    gpu_files = []
-    hip_files = []
-    for name in native_gpu_files:
-        if is_test_files(name):
-            continue
-
-        gpu_files.append(name)
-        hip_file_name = get_hip_file_path(paths.join("cuda/", name))
-        hip_files.append(hip_file_name)
-
-    # there will be some native hip files that needs suffix changed
-    native_hip_pattern = [
-        "hip/**/*" + suffix
-        for suffix in gpu_header_extensions
-    ]
-    native_hip_files = native.glob(native_hip_pattern)
-
-    gpu_files += native_hip_files
-    hip_files += native_hip_files
-
-    # we run hipify script under the caffe2 folder; therefore we need to
-    # prepend c10 to the path so that buck can find the hipified file
-    real_hip_files = []
-    for filename in hip_files:
-        real_hip_files.append(paths.join("c10", filename))
-
-    # return the src and output_gen files
-    return gpu_files, real_hip_files
-
-def get_c10_hip_test_files():
-    gpu_file_pattern = [
-        base + suffix
-        for base in c10_includes
-        for suffix in gpu_file_extensions
-    ]
-    native_gpu_files = native.glob(gpu_file_pattern)
-
-    # store the original
-    gpu_files = []
-    hip_files = []
-    for name in native_gpu_files:
-        if not is_test_files(name):
-            continue
-
-        gpu_files.append(name)
-        hip_file_name = get_hip_file_path(paths.join("cuda/", name))
-        hip_files.append(hip_file_name)
-
-    # there will be some native hip files that needs suffix changed
-    native_hip_pattern = [
-        "hip/test/**/*" + suffix
-        for suffix in gpu_header_extensions
-    ]
-    native_hip_files = native.glob(native_hip_pattern)
-
-    gpu_files += native_hip_files
-    hip_files += native_hip_files
-
-    # we run hipify script under the caffe2 folder; therefore we need to
-    # prepend c10 to the path so that buck can find the hipified file
-    real_hip_files = []
-    for filename in hip_files:
-        real_hip_files.append(paths.join("c10", filename))
-
-    # return the src and output_gen files
-    return gpu_files, real_hip_files
-
-c10_includes = ["**/*"]
diff --git a/c10/macros/Macros.h b/c10/macros/Macros.h
index 84a5045d648c..b6912004bd77 100644
--- a/c10/macros/Macros.h
+++ b/c10/macros/Macros.h
@@ -114,22 +114,17 @@
 //  - MSVC 19.14: https://godbolt.org/z/Dzd7gn (requires /std:c++latest)
 //  - Clang 8.0.0: https://godbolt.org/z/3PYL4Z (always advertises support)
 //  - gcc 8.3: https://godbolt.org/z/4tLMQS (always advertises support)
-#define C10_NODISCARD
-#if defined(__has_cpp_attribute)
-#if __has_cpp_attribute(nodiscard)
-#undef C10_NODISCARD
+#if C10_HAS_CPP_ATTRIBUTE(nodiscard)
 #define C10_NODISCARD [[nodiscard]]
-#endif
 // Workaround for llvm.org/PR23435, since clang 3.6 and below emit a spurious
 // error when __has_cpp_attribute is given a scoped attribute in C mode.
-#elif __cplusplus && defined(__has_cpp_attribute)
-#if __has_cpp_attribute(clang::warn_unused_result)
+#elif __cplusplus && C10_HAS_CPP_ATTRIBUTE(clang::warn_unused_result)
 // TODO: It's possible this is still triggering
 // https://github.com/pytorch/pytorch/issues/13118 on Windows; if it is, better
 // fix it.
-#undef C10_NODISCARD
 #define C10_NODISCARD [[clang::warn_unused_result]]
-#endif
+#else
+#define C10_NODISCARD
 #endif
 
 // suppress an unused variable.
@@ -243,8 +238,7 @@ using namespace c10::hip;
 #define C10_FALLTHROUGH
 #endif
 
-#include <sstream>
-#include <string>
+#include <cstdint>
 
 #ifdef __HIPCC__
 // Unlike CUDA, HIP requires a HIP header to be included for __host__ to work.
@@ -261,13 +255,13 @@ using namespace c10::hip;
 // constants from
 // (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications)
 // The maximum number of threads per multiprocessor is 1024 for Turing
-// architecture (7.5), 1536 for Geforce Ampere (8.6), and 2048 for all other
-// architectures. You'll get warnings if you exceed these constants. Hence, the
-// following macros adjust the input values from the user to resolve potential
-// warnings.
+// architecture (7.5), 1536 for Geforce Ampere (8.6)/Jetson Orin (8.7), and
+// 2048 for all other architectures. You'll get warnings if you exceed these
+// constants. Hence, the following macros adjust the input values from the user
+// to resolve potential warnings.
 #if __CUDA_ARCH__ == 750
 constexpr uint32_t CUDA_MAX_THREADS_PER_SM = 1024;
-#elif __CUDA_ARCH__ == 860
+#elif __CUDA_ARCH__ == 860 || __CUDA_ARCH__ == 870
 constexpr uint32_t CUDA_MAX_THREADS_PER_SM = 1536;
 #else
 constexpr uint32_t CUDA_MAX_THREADS_PER_SM = 2048;
@@ -332,24 +326,37 @@ constexpr uint32_t CUDA_THREADS_PER_BLOCK_FALLBACK = 256;
 // CUDA_KERNEL_ASSERT checks the assertion
 // even when NDEBUG is defined. This is useful for important assertions in CUDA
 // code that would otherwise be suppressed when building Release.
-#if defined(__ANDROID__) || defined(__APPLE__) || defined(USE_ROCM)
+#if defined(__ANDROID__) || defined(__APPLE__) || \
+    (defined(USE_ROCM) && ROCM_VERSION < 40100)
 // Those platforms do not support assert()
 #define CUDA_KERNEL_ASSERT(cond)
+#define SYCL_KERNEL_ASSERT(cond)
 #elif defined(_MSC_VER)
 #if defined(NDEBUG)
 extern "C" {
 C10_IMPORT
+#if defined(__SYCL_DEVICE_ONLY__)
+extern SYCL_EXTERNAL void _wassert(
+    const wchar_t* wexpr,
+    const wchar_t* wfile,
+    unsigned line);
+#else
 #if defined(__CUDA_ARCH__)
 __host__ __device__
 #endif // __CUDA_ARCH__
     void
     _wassert(wchar_t const* _Message, wchar_t const* _File, unsigned _Line);
+#endif // __SYCL_DEVICE_ONLY__
 }
-#endif
+#endif // NDEBUG
 #define CUDA_KERNEL_ASSERT(cond)                                                                 \
   if (C10_UNLIKELY(!(cond))) {                                                                   \
     (void)(_wassert(_CRT_WIDE(#cond), _CRT_WIDE(__FILE__), static_cast<unsigned>(__LINE__)), 0); \
   }
+#define SYCL_KERNEL_ASSERT(cond)                                                                 \
+  if (C10_UNLIKELY(!(cond))) {                                                                   \
+    (void)(_wassert(_CRT_WIDE(#cond), _CRT_WIDE(__FILE__), static_cast<unsigned>(__LINE__)), 0); \
+  }
 #else // __APPLE__, _MSC_VER
 #if defined(NDEBUG)
 extern "C" {
@@ -360,23 +367,35 @@ extern SYCL_EXTERNAL void __assert_fail(
     unsigned int line,
     const char* func);
 #else // __SYCL_DEVICE_ONLY__
-#if (defined(__CUDA_ARCH__) && !(defined(__clang__) && defined(__CUDA__)))
+#if (                                                                       \
+    defined(__CUDA_ARCH__) && !(defined(__clang__) && defined(__CUDA__)) && \
+    !defined(TORCH_DISABLE_GPU_ASSERTS))
+// CUDA supports __assert_fail function which are common for both device
+// and host side code.
 __host__ __device__
-#endif // __CUDA_ARCH__
+#endif
+
+    // This forward declaration matching the declaration of __assert_fail
+    // exactly how it is in glibc in case parts of the program are compiled with
+    // different NDEBUG settings. Otherwise we might get 'ambiguous declaration'
+    // error. Note: On ROCm - this declaration serves for host side compilation.
     void
     __assert_fail(
         const char* assertion,
         const char* file,
         unsigned int line,
-        const char* function) throw()
-// We match the declaration of __assert_fail exactly how it is in glibc in case
-// parts of the program are compiled with different NDEBUG settings. Otherwise
-// we might get 'ambiguous declaration' error.
-#ifdef __GNUC__
-        __attribute__((__noreturn__))
-#endif
-        ;
-#endif
+        const char* function) throw() __attribute__((__noreturn__));
+
+#if (defined(__HIP_ARCH__) || defined(__HIP__)) && \
+    !defined(TORCH_DISABLE_GPU_ASSERTS)
+// ROCm supports __assert_fail only as a device side function.
+__device__ __attribute__((noinline)) __attribute__((weak)) void __assert_fail(
+    const char* assertion,
+    const char* file,
+    unsigned int line,
+    const char* function);
+#endif // defined(__HIP_ARCH__) || defined(__HIP__)
+#endif // __SYCL_DEVICE_ONLY__
 }
 #endif // NDEBUG
 #define CUDA_KERNEL_ASSERT(cond)                                         \
@@ -384,6 +403,11 @@ __host__ __device__
     __assert_fail(                                                       \
         #cond, __FILE__, static_cast<unsigned int>(__LINE__), __func__); \
   }
+#define SYCL_KERNEL_ASSERT(cond)                                         \
+  if (C10_UNLIKELY(!(cond))) {                                           \
+    __assert_fail(                                                       \
+        #cond, __FILE__, static_cast<unsigned int>(__LINE__), __func__); \
+  }
 #endif // __APPLE__
 
 #ifdef __APPLE__
@@ -416,15 +440,6 @@ __host__ __device__
 #define C10_IS_TRIVIALLY_COPYABLE(T) std::is_trivially_copyable<T>::value
 #endif
 
-#if !defined(__clang__) && !defined(_MSC_VER) && defined(__GNUC__) && \
-    __GNUC__ < 6
-#define CONSTEXPR_EXCEPT_GCC5
-#define IS_NOT_GCC5_CONSTEXPR 0
-#else
-#define CONSTEXPR_EXCEPT_GCC5 constexpr
-#define IS_NOT_GCC5_CONSTEXPR 1
-#endif
-
 #if defined(__CUDA_ARCH__)
 #if defined(_MSC_VER) && defined(__CUDACC__)
 #define CONSTEXPR_EXCEPT_WIN_CUDA const
diff --git a/c10/macros/build.bzl b/c10/macros/build.bzl
index 932d0cabac4c..50f283560d7e 100644
--- a/c10/macros/build.bzl
+++ b/c10/macros/build.bzl
@@ -29,3 +29,12 @@ def define_targets(rules):
             "//conditions:default": [],
         }),
     )
+    rules.filegroup(
+        name = "headers",
+        srcs = rules.glob(
+            ["*.h"],
+            exclude = [
+            ],
+        ),
+        visibility = ["//:__pkg__"],
+    )
diff --git a/c10/test/core/SymInt_test.cpp b/c10/test/core/SymInt_test.cpp
index 8892cce015da..d889d72b5afb 100644
--- a/c10/test/core/SymInt_test.cpp
+++ b/c10/test/core/SymInt_test.cpp
@@ -1,10 +1,10 @@
 #include <gtest/gtest.h>
 
 #include <c10/core/SymInt.h>
-#include <c10/core/SymIntNodeImpl.h>
+#include <c10/core/SymNodeImpl.h>
 
 using namespace c10;
-
+#ifndef C10_MOBILE
 void check(int64_t value) {
   EXPECT_TRUE(SymInt::check_range(value));
   const auto i = SymInt(value);
@@ -20,12 +20,7 @@ TEST(SymIntTest, ConcreteInts) {
   check(-4611686018427387904LL);
 }
 
-TEST(SymIntTest, AddNode) {
-  auto n = c10::make_intrusive<SymIntNodeImpl>();
-  auto i = n->toSymInt();
-  EXPECT_TRUE(i.is_symbolic());
-}
-
 TEST(SymIntTest, CheckRange) {
   EXPECT_FALSE(SymInt::check_range(INT64_MIN));
 }
+#endif
diff --git a/c10/test/core/impl/SizesAndStrides_test.cpp b/c10/test/core/impl/SizesAndStrides_test.cpp
index 883dd0186adb..21a08b8aa79d 100644
--- a/c10/test/core/impl/SizesAndStrides_test.cpp
+++ b/c10/test/core/impl/SizesAndStrides_test.cpp
@@ -26,7 +26,7 @@ static void checkData(
     EXPECT_EQ(*(sz.sizes_begin() + idx), x) << "index: " << idx;
     idx++;
   }
-  EXPECT_EQ(asIntArrayRefSlow(sz.sizes_arrayref()), sizes);
+  EXPECT_EQ(sz.sizes_arrayref(), sizes);
 
   idx = 0;
   for (auto x : strides) {
@@ -37,7 +37,7 @@ static void checkData(
 
     idx++;
   }
-  EXPECT_EQ(asIntArrayRefSlow(sz.strides_arrayref()), strides);
+  EXPECT_EQ(sz.strides_arrayref(), strides);
 }
 
 TEST(SizesAndStridesTest, DefaultConstructor) {
diff --git a/c10/test/util/complex_math_test_common.h b/c10/test/util/complex_math_test_common.h
index 15addf687856..ce1be7b38d84 100644
--- a/c10/test/util/complex_math_test_common.h
+++ b/c10/test/util/complex_math_test_common.h
@@ -166,6 +166,134 @@ C10_DEFINE_TEST(TestLog2, Rev) {
   }
 }
 
+C10_DEFINE_TEST(TestLog1p, Normal) {
+  // log1p(x) = log(1 + x)
+  {
+    c10::complex<float> x(0.1, 1.2);
+    c10::complex<float> l1 = std::log1p(x);
+    c10::complex<float> l2 = std::log(1.0f + x);
+    C10_ASSERT_NEAR(l1.real(), l2.real(), tol);
+    C10_ASSERT_NEAR(l1.imag(), l2.imag(), tol);
+  }
+  {
+    c10::complex<double> x(0.1, 1.2);
+    c10::complex<double> l1 = std::log1p(x);
+    c10::complex<double> l2 = std::log(1.0 + x);
+    C10_ASSERT_NEAR(l1.real(), l2.real(), tol);
+    C10_ASSERT_NEAR(l1.imag(), l2.imag(), tol);
+  }
+}
+
+C10_DEFINE_TEST(TestLog1p, Small) {
+  // log(1 + x) ~ x for |x| << 1
+  {
+    c10::complex<float> x(1e-9, 2e-9);
+    c10::complex<float> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real() / x.real(), 1, tol);
+    C10_ASSERT_NEAR(l.imag() / x.imag(), 1, tol);
+  }
+  {
+    c10::complex<double> x(1e-100, 2e-100);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real() / x.real(), 1, tol);
+    C10_ASSERT_NEAR(l.imag() / x.imag(), 1, tol);
+  }
+}
+
+C10_DEFINE_TEST(TestLog1p, Extreme) {
+  // log(1 + x) ~ x for |x| << 1 and in the brink of overflow / underflow
+  {
+    c10::complex<float> x(-1, 1e-30);
+    c10::complex<float> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), -69.07755278982137, tol);
+    C10_ASSERT_NEAR(l.imag(), 1.5707963267948966, tol);
+  }
+  {
+    c10::complex<float> x(-1, 1e30);
+    c10::complex<float> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 69.07755278982137, tol);
+    C10_ASSERT_NEAR(l.imag(), 1.5707963267948966, tol);
+  }
+  {
+    c10::complex<float> x(1e30, 1);
+    c10::complex<float> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 69.07755278982137, tol);
+    C10_ASSERT_NEAR(l.imag(), 1e-30, tol);
+  }
+  {
+    c10::complex<float> x(1e-30, 1);
+    c10::complex<float> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 0.34657359027997264, tol);
+    C10_ASSERT_NEAR(l.imag(), 0.7853981633974483, tol);
+  }
+  {
+    c10::complex<float> x(1e30, 1e30);
+    c10::complex<float> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 69.42412638010134, tol);
+    C10_ASSERT_NEAR(l.imag(), 0.7853981633974483, tol);
+  }
+  {
+    c10::complex<float> x(1e-38, 1e-38);
+    c10::complex<float> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 1e-38, tol);
+    C10_ASSERT_NEAR(l.imag(), 1e-38, tol);
+  }
+  {
+    c10::complex<float> x(1e-38, 2e-30);
+    c10::complex<float> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 1e-30, tol);
+    C10_ASSERT_NEAR(l.imag(), 2e-30, tol);
+  }
+  {
+    c10::complex<double> x(-1, 1e-250);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), -575.6462732485114, tol);
+    C10_ASSERT_NEAR(l.imag(), 1.5707963267948966, tol);
+  }
+  {
+    c10::complex<double> x(-1, 1e250);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 575.6462732485114, tol);
+    C10_ASSERT_NEAR(l.imag(), 1.5707963267948966, tol);
+  }
+  {
+    c10::complex<double> x(1e250, 1);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 575.6462732485114, tol);
+    C10_ASSERT_NEAR(l.imag(), 1e-250, tol);
+  }
+  {
+    c10::complex<double> x(1e-250, 1);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 0.34657359027997264, tol);
+    C10_ASSERT_NEAR(l.imag(), 0.7853981633974483, tol);
+  }
+  {
+    c10::complex<double> x(1e250, 1e250);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 575.9928468387914, tol);
+    C10_ASSERT_NEAR(l.imag(), 0.7853981633974483, tol);
+  }
+  {
+    c10::complex<double> x(1e-250, 1e-250);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 1e-250, tol);
+    C10_ASSERT_NEAR(l.imag(), 1e-250, tol);
+  }
+  {
+    c10::complex<double> x(1e-250, 2e-250);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 1e-250, tol);
+    C10_ASSERT_NEAR(l.imag(), 2e-250, tol);
+  }
+  {
+    c10::complex<double> x(2e-308, 1.5e-250);
+    c10::complex<double> l = std::log1p(x);
+    C10_ASSERT_NEAR(l.real(), 2e-308, tol);
+    C10_ASSERT_NEAR(l.imag(), 1.5e-308, tol);
+  }
+}
+
 // Power functions
 
 C10_DEFINE_TEST(TestPowSqrt, Equal) {
diff --git a/c10/test/util/intrusive_ptr_test.cpp b/c10/test/util/intrusive_ptr_test.cpp
index 7ed1c292841d..632fe7fc2f20 100644
--- a/c10/test/util/intrusive_ptr_test.cpp
+++ b/c10/test/util/intrusive_ptr_test.cpp
@@ -146,6 +146,11 @@ TEST(IntrusivePtrTest, givenInvalidPtr_whenCallingGet_thenReturnsNullptr) {
   EXPECT_EQ(nullptr, obj.get());
 }
 
+TEST(IntrusivePtrTest, givenNullptr_whenCallingGet_thenReturnsNullptr) {
+  intrusive_ptr<SomeClass1Parameter> obj(nullptr);
+  EXPECT_EQ(nullptr, obj.get());
+}
+
 TEST(IntrusivePtrTest, givenValidPtr_whenDereferencing_thenReturnsObject) {
   intrusive_ptr<SomeClass1Parameter> obj =
       make_intrusive<SomeClass1Parameter>(5);
diff --git a/c10/test/util/string_view_test.cpp b/c10/test/util/string_view_test.cpp
index f63bd1ea71a7..43e8994d8bfc 100644
--- a/c10/test/util/string_view_test.cpp
+++ b/c10/test/util/string_view_test.cpp
@@ -218,19 +218,17 @@ static_assert(!string_view("hello").empty(), "");
 } // namespace test_empty
 
 namespace test_remove_prefix {
-CONSTEXPR_EXCEPT_GCC5 string_view remove_prefix(string_view input, size_t len) {
+constexpr string_view remove_prefix(string_view input, size_t len) {
   input.remove_prefix(len);
   return input;
 }
 
 TEST(StringViewTest, whenRemovingValidPrefix_thenWorks) {
-#if IS_NOT_GCC5_CONSTEXPR
   static_assert(
       remove_prefix(string_view("hello"), 0) == string_view("hello"), "");
   static_assert(
       remove_prefix(string_view("hello"), 1) == string_view("ello"), "");
   static_assert(remove_prefix(string_view("hello"), 5) == string_view(""), "");
-#endif
 
   EXPECT_EQ(remove_prefix(string_view("hello"), 0), string_view("hello"));
   EXPECT_EQ(remove_prefix(string_view("hello"), 1), string_view("ello"));
@@ -245,19 +243,17 @@ TEST(StringViewTest, whenRemovingTooLargePrefix_thenThrows) {
 } // namespace test_remove_prefix
 
 namespace test_remove_suffix {
-CONSTEXPR_EXCEPT_GCC5 string_view remove_suffix(string_view input, size_t len) {
+constexpr string_view remove_suffix(string_view input, size_t len) {
   input.remove_suffix(len);
   return input;
 }
 
 TEST(StringViewTest, whenRemovingValidSuffix_thenWorks) {
-#if IS_NOT_GCC5_CONSTEXPR
   static_assert(
       remove_suffix(string_view("hello"), 0) == string_view("hello"), "");
   static_assert(
       remove_suffix(string_view("hello"), 1) == string_view("hell"), "");
   static_assert(remove_suffix(string_view("hello"), 5) == string_view(""), "");
-#endif
 
   EXPECT_EQ(remove_suffix(string_view("hello"), 0), string_view("hello"));
   EXPECT_EQ(remove_suffix(string_view("hello"), 1), string_view("hell"));
@@ -272,17 +268,15 @@ TEST(StringViewTest, whenRemovingTooLargeSuffix_thenThrows) {
 } // namespace test_remove_suffix
 
 namespace test_swap_function {
-CONSTEXPR_EXCEPT_GCC5 std::pair<string_view, string_view> get() {
+constexpr std::pair<string_view, string_view> get() {
   string_view first = "first";
   string_view second = "second";
   swap(first, second);
   return std::make_pair(first, second);
 }
 TEST(StringViewTest, testSwapFunction) {
-#if IS_NOT_GCC5_CONSTEXPR
   static_assert(string_view("second") == get().first, "");
   static_assert(string_view("first") == get().second, "");
-#endif
 
   EXPECT_EQ(string_view("second"), get().first);
   EXPECT_EQ(string_view("first"), get().second);
@@ -290,17 +284,15 @@ TEST(StringViewTest, testSwapFunction) {
 } // namespace test_swap_function
 
 namespace test_swap_method {
-CONSTEXPR_EXCEPT_GCC5 std::pair<string_view, string_view> get() {
+constexpr std::pair<string_view, string_view> get() {
   string_view first = "first";
   string_view second = "second";
   first.swap(second);
   return std::make_pair(first, second);
 }
 TEST(StringViewTest, testSwapMethod) {
-#if IS_NOT_GCC5_CONSTEXPR
   static_assert(string_view("second") == get().first, "");
   static_assert(string_view("first") == get().second, "");
-#endif
 
   EXPECT_EQ(string_view("second"), get().first);
   EXPECT_EQ(string_view("first"), get().second);
diff --git a/c10/util/C++17.h b/c10/util/C++17.h
index 107042bf1754..c51275721e58 100644
--- a/c10/util/C++17.h
+++ b/c10/util/C++17.h
@@ -347,23 +347,27 @@ decltype(auto) if_constexpr(
   if constexpr (Condition) {
     if constexpr (detail::function_takes_identity_argument<
                       ThenCallback>::value) {
-      return ::std::forward<ThenCallback>(thenCallback)(detail::_identity());
+      // Note that we use static_cast<T&&>(t) instead of std::forward (or
+      // ::std::forward) because using the latter produces some compilation
+      // errors about ambiguous `std` on MSVC when using C++17. This static_cast
+      // is just what std::forward is doing under the hood, and is equivalent.
+      return static_cast<ThenCallback&&>(thenCallback)(detail::_identity());
     } else {
-      return ::std::forward<ThenCallback>(thenCallback)();
+      return static_cast<ThenCallback&&>(thenCallback)();
     }
   } else {
     if constexpr (detail::function_takes_identity_argument<
                       ElseCallback>::value) {
-      return ::std::forward<ElseCallback>(elseCallback)(detail::_identity());
+      return static_cast<ElseCallback&&>(elseCallback)(detail::_identity());
     } else {
-      return ::std::forward<ElseCallback>(elseCallback)();
+      return static_cast<ElseCallback&&>(elseCallback)();
     }
   }
 #else
   // C++14 implementation of if constexpr
   return detail::_if_constexpr<Condition>::call(
-      ::std::forward<ThenCallback>(thenCallback),
-      ::std::forward<ElseCallback>(elseCallback));
+      static_cast<ThenCallback&&>(thenCallback),
+      static_cast<ElseCallback&&>(elseCallback));
 #endif
 }
 
@@ -375,15 +379,19 @@ decltype(auto) if_constexpr(ThenCallback&& thenCallback) {
   if constexpr (Condition) {
     if constexpr (detail::function_takes_identity_argument<
                       ThenCallback>::value) {
-      return ::std::forward<ThenCallback>(thenCallback)(detail::_identity());
+      // Note that we use static_cast<T&&>(t) instead of std::forward (or
+      // ::std::forward) because using the latter produces some compilation
+      // errors about ambiguous `std` on MSVC when using C++17. This static_cast
+      // is just what std::forward is doing under the hood, and is equivalent.
+      return static_cast<ThenCallback&&>(thenCallback)(detail::_identity());
     } else {
-      return ::std::forward<ThenCallback>(thenCallback)();
+      return static_cast<ThenCallback&&>(thenCallback)();
     }
   }
 #else
   // C++14 implementation of if constexpr
   return if_constexpr<Condition>(
-      ::std::forward<ThenCallback>(thenCallback), [](auto) {});
+      static_cast<ThenCallback&&>(thenCallback), [](auto) {});
 #endif
 }
 
diff --git a/c10/util/DimVector.h b/c10/util/DimVector.h
index 542b0673c3f6..7d1ff1962c44 100644
--- a/c10/util/DimVector.h
+++ b/c10/util/DimVector.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <c10/core/SymInt.h>
 #include <c10/core/impl/SizesAndStrides.h>
 #include <c10/util/SmallVector.h>
 #include <cstdint>
@@ -10,5 +11,6 @@ constexpr size_t kDimVectorStaticSize = C10_SIZES_AND_STRIDES_MAX_INLINE_SIZE;
 
 /// A container for sizes or strides
 using DimVector = SmallVector<int64_t, kDimVectorStaticSize>;
+using SymDimVector = SmallVector<c10::SymInt, kDimVectorStaticSize>;
 
 } // namespace c10
diff --git a/c10/util/Exception.cpp b/c10/util/Exception.cpp
index 0abedef0478d..6ab4895558de 100644
--- a/c10/util/Exception.cpp
+++ b/c10/util/Exception.cpp
@@ -116,7 +116,7 @@ void torchInternalAssertFail(
 
 } // namespace detail
 
-namespace Warning {
+namespace WarningUtils {
 
 namespace {
 WarningHandler* getBaseHandler() {
@@ -147,27 +147,6 @@ thread_local WarningHandler* ThreadWarningHandler::warning_handler_ = nullptr;
 
 } // namespace
 
-void warn(
-    const SourceLocation& source_location,
-    const std::string& msg,
-    const bool verbatim) {
-  ThreadWarningHandler::get_handler()->process(source_location, msg, verbatim);
-}
-
-void warn(
-    SourceLocation source_location,
-    detail::CompileTimeEmptyString msg,
-    const bool verbatim) {
-  warn(source_location, "", verbatim);
-}
-
-void warn(
-    SourceLocation source_location,
-    const char* msg,
-    const bool verbatim) {
-  ThreadWarningHandler::get_handler()->process(source_location, msg, verbatim);
-}
-
 void set_warning_handler(WarningHandler* handler) noexcept(true) {
   ThreadWarningHandler::set_handler(handler);
 }
@@ -195,14 +174,60 @@ WarnAlways::~WarnAlways() {
   set_warnAlways(prev_setting);
 }
 
-} // namespace Warning
+} // namespace WarningUtils
+
+void warn(const Warning& warning) {
+  WarningUtils::ThreadWarningHandler::get_handler()->process(warning);
+}
 
-void WarningHandler::process(
+Warning::Warning(
+    warning_variant_t type,
     const SourceLocation& source_location,
     const std::string& msg,
-    const bool /*verbatim*/) {
-  LOG_AT_FILE_LINE(WARNING, source_location.file, source_location.line)
-      << "Warning: " << msg << " (function " << source_location.function << ")";
+    const bool verbatim)
+    : type_(type),
+      source_location_(source_location),
+      msg_(msg),
+      verbatim_(verbatim) {}
+
+Warning::Warning(
+    warning_variant_t type,
+    SourceLocation source_location,
+    detail::CompileTimeEmptyString msg,
+    const bool verbatim)
+    : Warning(type, std::move(source_location), "", verbatim) {}
+
+Warning::Warning(
+    warning_variant_t type,
+    SourceLocation source_location,
+    const char* msg,
+    const bool verbatim)
+    : type_(type),
+      source_location_(std::move(source_location)),
+      msg_(std::string(msg)),
+      verbatim_(verbatim) {}
+
+Warning::warning_variant_t Warning::type() const {
+  return type_;
+}
+
+const SourceLocation& Warning::source_location() const {
+  return source_location_;
+}
+
+const std::string& Warning::msg() const {
+  return msg_;
+}
+
+bool Warning::verbatim() const {
+  return verbatim_;
+}
+
+void WarningHandler::process(const Warning& warning) {
+  LOG_AT_FILE_LINE(
+      WARNING, warning.source_location().file, warning.source_location().line)
+      << "Warning: " << warning.msg() << " (function "
+      << warning.source_location().function << ")";
 }
 
 std::string GetExceptionString(const std::exception& e) {
diff --git a/c10/util/Exception.h b/c10/util/Exception.h
index a869038ea444..773107f668ae 100644
--- a/c10/util/Exception.h
+++ b/c10/util/Exception.h
@@ -4,6 +4,7 @@
 #include <c10/macros/Macros.h>
 #include <c10/util/Deprecated.h>
 #include <c10/util/StringUtil.h>
+#include <c10/util/variant.h>
 
 #include <cstddef>
 #include <exception>
@@ -112,17 +113,66 @@ class C10_API Error : public std::exception {
   std::string compute_what(bool include_backtrace) const;
 };
 
+class C10_API Warning {
+ public:
+  class C10_API UserWarning {};
+  class C10_API DeprecationWarning {};
+
+  using warning_variant_t = c10::variant<UserWarning, DeprecationWarning>;
+
+  Warning(
+      warning_variant_t type,
+      const SourceLocation& source_location,
+      const std::string& msg,
+      bool verbatim);
+
+  Warning(
+      warning_variant_t type,
+      SourceLocation source_location,
+      const char* msg,
+      bool verbatim);
+
+  Warning(
+      warning_variant_t type,
+      SourceLocation source_location,
+      ::c10::detail::CompileTimeEmptyString msg,
+      bool verbatim);
+
+  // Getters for members
+  warning_variant_t type() const;
+  const SourceLocation& source_location() const;
+  const std::string& msg() const;
+  bool verbatim() const;
+
+ private:
+  // The type of warning
+  warning_variant_t type_;
+
+  // Where the warning happened.
+  SourceLocation source_location_;
+
+  // The actual warning message.
+  std::string msg_;
+
+  // See note: [Verbatim Warnings]
+  bool verbatim_;
+};
+
+using UserWarning = Warning::UserWarning;
+using DeprecationWarning = Warning::DeprecationWarning;
+
+// Issue a warning with a given message. Dispatched to the current
+// warning handler.
+void C10_API warn(const Warning& warning);
+
 class C10_API WarningHandler {
  public:
   virtual ~WarningHandler() = default;
   /// The default warning handler. Prints the message to stderr.
-  virtual void process(
-      const SourceLocation& source_location,
-      const std::string& msg,
-      const bool verbatim);
+  virtual void process(const Warning& warning);
 };
 
-namespace Warning {
+namespace WarningUtils {
 
 // Note: [Verbatim Warnings]
 // Warnings originating in C++ code can appear out-of-place to Python users:
@@ -137,20 +187,6 @@ namespace Warning {
 // context in their warnings should set verbatim to true so their warnings
 // appear without modification.
 
-/// Issue a warning with a given message. Dispatched to the current
-/// warning handler.
-C10_API void warn(
-    const SourceLocation& source_location,
-    const std::string& msg,
-    bool verbatim);
-C10_API void warn(
-    SourceLocation source_location,
-    const char* msg,
-    bool verbatim);
-C10_API void warn(
-    SourceLocation source_location,
-    ::c10::detail::CompileTimeEmptyString msg,
-    bool verbatim);
 /// Sets the global warning handler. This is not thread-safe, so it should
 /// generally be called once during initialization or while holding the GIL
 /// for programs that use python.
@@ -165,11 +201,11 @@ class C10_API WarningHandlerGuard {
 
  public:
   WarningHandlerGuard(WarningHandler* new_handler)
-      : prev_handler_(c10::Warning::get_warning_handler()) {
-    c10::Warning::set_warning_handler(new_handler);
+      : prev_handler_(c10::WarningUtils::get_warning_handler()) {
+    c10::WarningUtils::set_warning_handler(new_handler);
   }
   ~WarningHandlerGuard() {
-    c10::Warning::set_warning_handler(prev_handler_);
+    c10::WarningUtils::set_warning_handler(prev_handler_);
   }
 };
 
@@ -190,7 +226,7 @@ struct C10_API WarnAlways {
   bool prev_setting;
 };
 
-} // namespace Warning
+} // namespace WarningUtils
 
 // Used in ATen for out-of-bound indices that can reasonably only be detected
 // lazily inside a kernel (See: advanced indexing).  These turn into
@@ -239,6 +275,12 @@ class C10_API OutOfMemoryError : public Error {
   using Error::Error;
 };
 
+// Used for collective communication library errors from the distributed module.
+// These turn into DistBackendError when they cross into Python.
+class C10_API DistBackendError : public Error {
+  using Error::Error;
+};
+
 // A utility function to return an exception std::string by prepending its
 // exception type before its what() content
 C10_API std::string GetExceptionString(const std::exception& e);
@@ -516,54 +558,52 @@ namespace detail {
 #define TORCH_CHECK_NOT_IMPLEMENTED(cond, ...) \
   TORCH_CHECK_WITH_MSG(NotImplementedError, cond, "TYPE", __VA_ARGS__)
 
+#ifdef STRIP_ERROR_MESSAGES
+#define WARNING_MESSAGE_STRING(...) \
+  ::c10::detail::CompileTimeEmptyString {}
+#else
+#define WARNING_MESSAGE_STRING(...) ::c10::str(__VA_ARGS__)
+#endif
+
 // Report a warning to the user.  Accepts an arbitrary number of extra
 // arguments which are concatenated into the warning message using operator<<
 //
-#ifdef STRIP_ERROR_MESSAGES
-#define TORCH_WARN(...)                                      \
-  ::c10::Warning::warn(                                      \
-      {__func__, __FILE__, static_cast<uint32_t>(__LINE__)}, \
-      ::c10::detail::CompileTimeEmptyString{},               \
-      false)
+#ifdef DISABLE_WARN
+#define _TORCH_WARN_WITH(...) ((void)0);
 #else
-#define TORCH_WARN(...)                                      \
-  ::c10::Warning::warn(                                      \
+#define _TORCH_WARN_WITH(warning_t, ...)                     \
+  ::c10::warn(::c10::Warning(                                \
+      warning_t(),                                           \
       {__func__, __FILE__, static_cast<uint32_t>(__LINE__)}, \
-      ::c10::str(__VA_ARGS__),                               \
-      false)
+      WARNING_MESSAGE_STRING(__VA_ARGS__),                   \
+      false));
 #endif
 
+#define TORCH_WARN(...) _TORCH_WARN_WITH(::c10::UserWarning, __VA_ARGS__);
+
+#define TORCH_WARN_DEPRECATION(...) \
+  _TORCH_WARN_WITH(::c10::DeprecationWarning, __VA_ARGS__);
+
 // Report a warning to the user only once.  Accepts an arbitrary number of extra
 // arguments which are concatenated into the warning message using operator<<
 //
-#ifdef STRIP_ERROR_MESSAGES
 #define _TORCH_WARN_ONCE(...)                                             \
   C10_UNUSED static const auto C10_ANONYMOUS_VARIABLE(torch_warn_once_) = \
       [&] {                                                               \
-        ::c10::Warning::warn(                                             \
-            {__func__, __FILE__, static_cast<uint32_t>(__LINE__)},        \
-            ::c10::detail::CompileTimeEmptyString{},                      \
-            false);                                                       \
+        TORCH_WARN(__VA_ARGS__);                                          \
         return true;                                                      \
       }()
-#else
-#define _TORCH_WARN_ONCE(...)                                             \
-  C10_UNUSED static const auto C10_ANONYMOUS_VARIABLE(torch_warn_once_) = \
-      [&] {                                                               \
-        ::c10::Warning::warn(                                             \
-            {__func__, __FILE__, static_cast<uint32_t>(__LINE__)},        \
-            ::c10::str(__VA_ARGS__),                                      \
-            false);                                                       \
-        return true;                                                      \
-      }()
-#endif
 
-#define TORCH_WARN_ONCE(...)              \
-  if (::c10::Warning::get_warnAlways()) { \
-    TORCH_WARN(__VA_ARGS__);              \
-  } else {                                \
-    _TORCH_WARN_ONCE(__VA_ARGS__);        \
+#ifdef DISABLE_WARN
+#define TORCH_WARN_ONCE(...) ((void)0);
+#else
+#define TORCH_WARN_ONCE(...)                   \
+  if (::c10::WarningUtils::get_warnAlways()) { \
+    TORCH_WARN(__VA_ARGS__);                   \
+  } else {                                     \
+    _TORCH_WARN_ONCE(__VA_ARGS__);             \
   }
+#endif
 
 // Report an error with a specific argument
 // NOTE: using the argument name in TORCH_CHECK's message is preferred
diff --git a/c10/util/FunctionRef.h b/c10/util/FunctionRef.h
index f0ba874a8c0d..33a54d525ace 100644
--- a/c10/util/FunctionRef.h
+++ b/c10/util/FunctionRef.h
@@ -55,7 +55,7 @@ class function_ref<Ret(Params...)> {
           typename std::remove_reference<Callable>::type,
           function_ref>::value>::type* = nullptr,
       typename std::enable_if<std::is_convertible<
-          typename std::result_of<Callable && (Params && ...)>::type,
+          typename c10::invoke_result_t<Callable, Params...>,
           Ret>::value>::type* = nullptr)
       : callback(callback_fn<typename std::remove_reference<Callable>::type>),
         callable(reinterpret_cast<intptr_t>(&callable)) {}
diff --git a/c10/util/Half-inl.h b/c10/util/Half-inl.h
index 9bfda23d6faf..3ed8cf80d116 100644
--- a/c10/util/Half-inl.h
+++ b/c10/util/Half-inl.h
@@ -12,8 +12,10 @@
 #include <hip/hip_fp16.h>
 #endif
 
-#ifdef SYCL_LANGUAGE_VERSION
-#include <CL/sycl.hpp>
+#if defined(SYCL_LANGUAGE_VERSION)
+#include <sycl/sycl.hpp> // for SYCL 2020
+#elif defined(CL_SYCL_LANGUAGE_VERSION)
+#include <CL/sycl.hpp> // for SYCL 1.2.1
 #endif
 
 C10_CLANG_DIAGNOSTIC_PUSH()
diff --git a/c10/util/Half.h b/c10/util/Half.h
index fcd6c251392f..a786db956cd6 100644
--- a/c10/util/Half.h
+++ b/c10/util/Half.h
@@ -45,8 +45,10 @@
 #include <hip/hip_fp16.h>
 #endif
 
-#ifdef SYCL_LANGUAGE_VERSION
-#include <CL/sycl.hpp>
+#if defined(SYCL_LANGUAGE_VERSION)
+#include <sycl/sycl.hpp> // for SYCL 2020
+#elif defined(CL_SYCL_LANGUAGE_VERSION)
+#include <CL/sycl.hpp> // for SYCL 1.2.1
 #endif
 
 // Standard check for compiling CUDA with clang
diff --git a/c10/util/IdWrapper.h b/c10/util/IdWrapper.h
index a22a60cb9fc3..59b5088c270f 100644
--- a/c10/util/IdWrapper.h
+++ b/c10/util/IdWrapper.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <c10/macros/Macros.h>
+#include <cstddef>
 #include <functional>
 #include <utility>
 
diff --git a/c10/util/Optional.h b/c10/util/Optional.h
index a65b2bbe8fe5..1fad879b209d 100644
--- a/c10/util/Optional.h
+++ b/c10/util/Optional.h
@@ -39,6 +39,7 @@
 #include <type_traits>
 #include <utility>
 
+#include <c10/util/C++17.h>
 #include <c10/util/Metaprogramming.h>
 
 C10_CLANG_DIAGNOSTIC_PUSH()
@@ -1238,7 +1239,7 @@ constexpr optional<X&> make_optional(std::reference_wrapper<X> v) {
 namespace std {
 template <typename T>
 struct hash<c10::optional<T>> {
-  typedef typename hash<T>::result_type result_type;
+  typedef c10::invoke_result_t<std::hash<T>, T> result_type;
   typedef c10::optional<T> argument_type;
 
   constexpr result_type operator()(argument_type const& arg) const {
diff --git a/c10/util/SmallVector.cpp b/c10/util/SmallVector.cpp
index f70c982c8315..d57f4d97b999 100644
--- a/c10/util/SmallVector.cpp
+++ b/c10/util/SmallVector.cpp
@@ -17,6 +17,7 @@
 #include <c10/util/SmallVector.h>
 #include <cstdint>
 #include <stdexcept>
+#include <string>
 using namespace c10;
 
 // Check that no bytes are wasted and everything is well-aligned.
diff --git a/c10/util/SmallVector.h b/c10/util/SmallVector.h
index 1fcc4a1a8f43..e4672d666a93 100644
--- a/c10/util/SmallVector.h
+++ b/c10/util/SmallVector.h
@@ -35,6 +35,7 @@
 #include <limits>
 #include <memory>
 #include <new>
+#include <ostream>
 #include <type_traits>
 #include <utility>
 
diff --git a/c10/util/ThreadLocalDebugInfo.cpp b/c10/util/ThreadLocalDebugInfo.cpp
index 85cb839c6107..e79ee00d1a61 100644
--- a/c10/util/ThreadLocalDebugInfo.cpp
+++ b/c10/util/ThreadLocalDebugInfo.cpp
@@ -1,6 +1,8 @@
 #include <c10/util/ThreadLocal.h>
 #include <c10/util/ThreadLocalDebugInfo.h>
 
+#include <utility>
+
 namespace c10 {
 
 C10_DEFINE_TLS_static(std::shared_ptr<ThreadLocalDebugInfo>, tls_debug_info);
@@ -67,7 +69,7 @@ DebugInfoGuard::DebugInfoGuard(
     return;
   }
   prev_info_ = debug_info;
-  ThreadLocalDebugInfo::_push(kind, info);
+  ThreadLocalDebugInfo::_push(kind, std::move(info));
   active_ = true;
 }
 
diff --git a/c10/util/build.bzl b/c10/util/build.bzl
index b981eba67718..8d79a557477f 100644
--- a/c10/util/build.bzl
+++ b/c10/util/build.bzl
@@ -68,5 +68,5 @@ def define_targets(rules):
             exclude = [
             ],
         ),
-        visibility = ["//c10:__pkg__"],
+        visibility = ["//c10:__pkg__", "//:__pkg__"],
     )
diff --git a/c10/util/complex_math.h b/c10/util/complex_math.h
index ecfd0442b751..8709fe4a0eb5 100644
--- a/c10/util/complex_math.h
+++ b/c10/util/complex_math.h
@@ -291,6 +291,35 @@ C10_HOST_DEVICE inline c10::complex<T> atanh(const c10::complex<T>& x) {
 #endif
 }
 
+template <typename T>
+C10_HOST_DEVICE inline c10::complex<T> log1p(const c10::complex<T>& z) {
+  // log1p(z) = log(1 + z)
+  // Let's define 1 + z = r * e ^ (i * a), then we have
+  // log(r * e ^ (i * a)) = log(r) + i * a
+  // With z = x + iy, the term r can be written as
+  // r = ((1 + x) ^ 2 + y ^ 2) ^ 0.5
+  //   = (1 + x ^ 2 + 2 * x + y ^ 2) ^ 0.5
+  // So, log(r) is
+  // log(r) = 0.5 * log(1 + x ^ 2 + 2 * x + y ^ 2)
+  //        = 0.5 * log1p(x * (x + 2) + y ^ 2)
+  // we need to use the expression only on certain condition to avoid overflow
+  // and underflow from `(x * (x + 2) + y ^ 2)`
+  T x = z.real();
+  T y = z.imag();
+  T zabs = std::abs(z);
+  T theta = std::atan2(y, x + T(1));
+  if (zabs < 0.5) {
+    T r = x * (T(2) + x) + y * y;
+    if (r == 0) { // handle underflow
+      return {x, theta};
+    }
+    return {T(0.5) * std::log1p(r), theta};
+  } else {
+    T z0 = std::hypot(x + 1, y);
+    return {std::log(z0), theta};
+  }
+}
+
 } // namespace c10_complex_math
 
 using c10_complex_math::acos;
@@ -304,6 +333,7 @@ using c10_complex_math::cosh;
 using c10_complex_math::exp;
 using c10_complex_math::log;
 using c10_complex_math::log10;
+using c10_complex_math::log1p;
 using c10_complex_math::log2;
 using c10_complex_math::pow;
 using c10_complex_math::sin;
@@ -325,6 +355,7 @@ using c10_complex_math::cosh;
 using c10_complex_math::exp;
 using c10_complex_math::log;
 using c10_complex_math::log10;
+using c10_complex_math::log1p;
 using c10_complex_math::log2;
 using c10_complex_math::pow;
 using c10_complex_math::sin;
diff --git a/c10/util/hash.h b/c10/util/hash.h
index d4bb42da21c9..9d771e401ed4 100644
--- a/c10/util/hash.h
+++ b/c10/util/hash.h
@@ -304,6 +304,14 @@ struct hash<std::tuple<Types...>> {
   }
 };
 
+template <typename T1, typename T2>
+struct hash<std::pair<T1, T2>> {
+  size_t operator()(const std::pair<T1, T2>& pair) const {
+    std::tuple<T1, T2> tuple = std::make_tuple(pair.first, pair.second);
+    return _hash_detail::simple_get_hash(tuple);
+  }
+};
+
 template <typename T>
 struct hash<c10::ArrayRef<T>> {
   size_t operator()(c10::ArrayRef<T> v) const {
diff --git a/c10/util/intrusive_ptr.h b/c10/util/intrusive_ptr.h
index c87305b08be5..e75c1980fdfa 100644
--- a/c10/util/intrusive_ptr.h
+++ b/c10/util/intrusive_ptr.h
@@ -326,6 +326,9 @@ class intrusive_ptr final {
   intrusive_ptr() noexcept
       : intrusive_ptr(NullType::singleton(), raw::DontIncreaseRefcount{}) {}
 
+  intrusive_ptr(std::nullptr_t) noexcept
+      : intrusive_ptr(NullType::singleton(), raw::DontIncreaseRefcount{}) {}
+
   // This constructor will not increase the ref counter for you.
   // We use the tagged dispatch mechanism to explicitly mark this constructor
   // to not increase the refcount
diff --git a/c10/util/irange.h b/c10/util/irange.h
index e734688c81d6..16fa682eb0d4 100644
--- a/c10/util/irange.h
+++ b/c10/util/irange.h
@@ -15,6 +15,7 @@ namespace detail {
 
 template <
     typename I,
+    bool one_sided = false,
     typename std::enable_if<std::is_integral<I>::value, int>::type = 0>
 struct integer_iterator : std::iterator<std::input_iterator_tag, I> {
   explicit integer_iterator(I value) : value(value) {}
@@ -39,11 +40,19 @@ struct integer_iterator : std::iterator<std::input_iterator_tag, I> {
   }
 
   bool operator==(const integer_iterator& other) const {
-    return value == other.value;
+    if /* constexpr -- we don't have C++17 yet, see #85969 */ (one_sided) {
+      // Range-for loops' end test is `begin != end`, not `begin <
+      // end`. To handle `c10::irange(n)` where n < 0 (which should be
+      // empty), we just make `begin != end` fail whenever `end` is
+      // negative.
+      return other.value < 0 || value == other.value;
+    } else {
+      return value == other.value;
+    }
   }
 
   bool operator!=(const integer_iterator& other) const {
-    return value != other.value;
+    return !(*this == other);
   }
 
  protected:
@@ -54,20 +63,22 @@ struct integer_iterator : std::iterator<std::input_iterator_tag, I> {
 
 template <
     typename I,
+    bool one_sided = false,
     typename std::enable_if<std::is_integral<I>::value, bool>::type = true>
 struct integer_range {
  public:
   integer_range(I begin, I end) : begin_(begin), end_(end) {}
-  detail::integer_iterator<I> begin() const {
+  using iterator = detail::integer_iterator<I, one_sided>;
+  iterator begin() const {
     return begin_;
   }
-  detail::integer_iterator<I> end() const {
+  iterator end() const {
     return end_;
   }
 
  private:
-  detail::integer_iterator<I> begin_;
-  detail::integer_iterator<I> end_;
+  iterator begin_;
+  iterator end_;
 };
 
 /// Creates an integer range for the half-open interval [begin, end)
@@ -95,11 +106,8 @@ template <
     typename Integer,
     typename std::enable_if<std::is_integral<Integer>::value, bool>::type =
         true>
-integer_range<Integer> irange(Integer end) {
-  // If end<=begin then the range is empty; we can achieve this effect by
-  // choosing the larger of {0, end} as the loop terminator
-  // Handles the case where end<0. irange only works for ranges >=0
-  return {Integer(), std::max(Integer(), end)};
+integer_range<Integer, true> irange(Integer end) {
+  return {Integer(), end};
 }
 
 } // namespace c10
diff --git a/c10/util/logging_is_not_google_glog.h b/c10/util/logging_is_not_google_glog.h
index d27cc18e4530..d92f163453e9 100644
--- a/c10/util/logging_is_not_google_glog.h
+++ b/c10/util/logging_is_not_google_glog.h
@@ -49,7 +49,7 @@ class C10_API MessageLogger {
 // is not used" and "statement has no effect".
 class C10_API LoggerVoidify {
  public:
-  LoggerVoidify() {}
+  LoggerVoidify() = default;
   // This has to be an operator with a precedence lower than << but
   // higher than ?:
   void operator&(const std::ostream& s) {}
diff --git a/c10/util/safe_numerics.h b/c10/util/safe_numerics.h
index 7eb9ed39395d..e5c249dd1d2b 100644
--- a/c10/util/safe_numerics.h
+++ b/c10/util/safe_numerics.h
@@ -22,7 +22,13 @@ C10_ALWAYS_INLINE bool add_overflows(uint64_t a, uint64_t b, uint64_t* out) {
   return __builtin_add_overflow(a, b, out);
 #else
   unsigned long long tmp;
+#if defined(_M_IX86) || defined(_M_X64)
   auto carry = _addcarry_u64(0, a, b, &tmp);
+#else
+  tmp = a + b;
+  unsigned long long vector = (a & b) ^ ((a ^ b) & ~tmp);
+  auto carry = vector >> 63;
+#endif
   *out = tmp;
   return carry;
 #endif
diff --git a/c10/util/string_view.h b/c10/util/string_view.h
index 0a4e043740b2..9ad4397d8377 100644
--- a/c10/util/string_view.h
+++ b/c10/util/string_view.h
@@ -179,7 +179,7 @@ class basic_string_view final {
     return size() == 0;
   }
 
-  CONSTEXPR_EXCEPT_GCC5 void remove_prefix(size_type n) {
+  constexpr void remove_prefix(size_type n) {
     if (n > size()) {
       throw std::out_of_range(
           "basic_string_view::remove_prefix: out of range. PrefixLength: " +
@@ -189,7 +189,7 @@ class basic_string_view final {
     size_ -= n;
   }
 
-  CONSTEXPR_EXCEPT_GCC5 void remove_suffix(size_type n) {
+  constexpr void remove_suffix(size_type n) {
     if (n > size()) {
       throw std::out_of_range(
           "basic_string_view::remove_suffix: out of range. SuffixLength: " +
@@ -198,7 +198,7 @@ class basic_string_view final {
     size_ -= n;
   }
 
-  CONSTEXPR_EXCEPT_GCC5 void swap(basic_string_view& sv) noexcept {
+  constexpr void swap(basic_string_view& sv) noexcept {
     auto tmp = *this;
     *this = sv;
     sv = tmp;
@@ -694,7 +694,7 @@ inline std::basic_ostream<CharT>& operator<<(
 }
 
 template <class CharT>
-CONSTEXPR_EXCEPT_GCC5 inline void swap(
+constexpr inline void swap(
     basic_string_view<CharT>& lhs,
     basic_string_view<CharT>& rhs) {
   lhs.swap(rhs);
diff --git a/c10/util/strong_type.h b/c10/util/strong_type.h
index 84dd5289223a..da17e2da8934 100644
--- a/c10/util/strong_type.h
+++ b/c10/util/strong_type.h
@@ -33,20 +33,12 @@
 #endif
 
 #ifndef STRONG_HAS_STD_FORMAT
-#if __has_include(<format>)
-#define STRONG_HAS_STD_FORMAT 1
-#else
 #define STRONG_HAS_STD_FORMAT 0
 #endif
-#endif
 
 #ifndef STRONG_HAS_FMT_FORMAT
-#if __has_include(<fmt/format.h>)
-#define STRONG_HAS_FMT_FORMAT 1
-#else
 #define STRONG_HAS_FMT_FORMAT 0
 #endif
-#endif
 
 #if STRONG_HAS_STD_FORMAT
 #include <format>
diff --git a/c10/util/typeid.cpp b/c10/util/typeid.cpp
index 2eeb8f2c6a82..cf161d2ed956 100644
--- a/c10/util/typeid.cpp
+++ b/c10/util/typeid.cpp
@@ -1,6 +1,7 @@
 #include <c10/util/Exception.h>
 #include <c10/util/typeid.h>
 
+#include <algorithm>
 #include <atomic>
 
 #if !defined(_MSC_VER)
@@ -27,7 +28,8 @@ C10_EXPORT void _ThrowRuntimeTypeLogicError(const string& msg) {
 }
 
 // see TypeMeta::addTypeMetaData
-std::atomic<uint16_t> TypeMeta::nextTypeIndex(NumScalarTypes);
+std::mutex TypeMeta::typeMetaDatasLock;
+uint16_t TypeMeta::nextTypeIndex(NumScalarTypes);
 
 // fixed length array of TypeMetaData instances
 detail::TypeMetaData* TypeMeta::typeMetaDatas() {
@@ -53,41 +55,35 @@ detail::TypeMetaData* TypeMeta::typeMetaDatas() {
   return instances;
 }
 
-CAFFE_KNOWN_TYPE(std::string)
-CAFFE_KNOWN_TYPE(uint16_t)
-CAFFE_KNOWN_TYPE(char)
-CAFFE_KNOWN_TYPE(std::unique_ptr<std::mutex>)
-CAFFE_KNOWN_TYPE(std::unique_ptr<std::atomic<bool>>)
-CAFFE_KNOWN_TYPE(std::vector<int32_t>)
-CAFFE_KNOWN_TYPE(std::vector<int64_t>)
-CAFFE_KNOWN_TYPE(std::vector<unsigned long>)
-CAFFE_KNOWN_TYPE(bool*)
-CAFFE_KNOWN_TYPE(char*)
-CAFFE_KNOWN_TYPE(int*)
+uint16_t TypeMeta::existingMetaDataIndexForType(TypeIdentifier identifier) {
+  auto* metaDatas = typeMetaDatas();
+  const auto end = metaDatas + nextTypeIndex;
+  // MaxTypeIndex is not very large; linear search should be fine.
+  auto it = std::find_if(metaDatas, end, [identifier](const auto& metaData) {
+    return metaData.id_ == identifier;
+  });
+  if (it == end) {
+    return MaxTypeIndex;
+  }
+  return static_cast<uint16_t>(it - metaDatas);
+}
 
-// For some of the compilers, long is defined separately from int32_t and
-// int64_t. As a result we will need to actually define them separately.
-// It is recommended that one does NOT use long - use int32_t and int64_t
-// explicitly. Explicit long type annotation may go away in the future.
-// details: This hack works by defining a _guard_long_unique type, which is
-// long iff the compiler has a separate long type and is a dummy type otherwise.
-// we then allocate a type id to that _guard_long_unique. If the compiler has a
-// separate long type, this allocates a type id for long. Otherwise, it
-// allocates a type id for the dummy type, which doesn't matter.
-namespace detail {
-template <class T>
-class _guard_long_unique_dummy final {};
-template <class T>
-using _guard_long_unique = std::conditional_t<
-    std::is_same<long, int32_t>::value || std::is_same<long, int64_t>::value,
-    _guard_long_unique_dummy<T>,
-    T>;
-} // namespace detail
+CAFFE_DEFINE_KNOWN_TYPE(std::string)
+CAFFE_DEFINE_KNOWN_TYPE(uint16_t)
+CAFFE_DEFINE_KNOWN_TYPE(char)
+CAFFE_DEFINE_KNOWN_TYPE(std::unique_ptr<std::mutex>)
+CAFFE_DEFINE_KNOWN_TYPE(std::unique_ptr<std::atomic<bool>>)
+CAFFE_DEFINE_KNOWN_TYPE(std::vector<int32_t>)
+CAFFE_DEFINE_KNOWN_TYPE(std::vector<int64_t>)
+CAFFE_DEFINE_KNOWN_TYPE(std::vector<unsigned long>)
+CAFFE_DEFINE_KNOWN_TYPE(bool*)
+CAFFE_DEFINE_KNOWN_TYPE(char*)
+CAFFE_DEFINE_KNOWN_TYPE(int*)
 
-CAFFE_KNOWN_TYPE(detail::_guard_long_unique<long>);
-CAFFE_KNOWN_TYPE(detail::_guard_long_unique<std::vector<long>>)
+CAFFE_DEFINE_KNOWN_TYPE(detail::_guard_long_unique<long>);
+CAFFE_DEFINE_KNOWN_TYPE(detail::_guard_long_unique<std::vector<long>>)
 
-CAFFE_KNOWN_TYPE(float*)
-CAFFE_KNOWN_TYPE(at::Half*)
+CAFFE_DEFINE_KNOWN_TYPE(float*)
+CAFFE_DEFINE_KNOWN_TYPE(at::Half*)
 
 } // namespace caffe2
diff --git a/c10/util/typeid.h b/c10/util/typeid.h
index 396ea3eefdf7..1de743331f95 100644
--- a/c10/util/typeid.h
+++ b/c10/util/typeid.h
@@ -31,16 +31,19 @@
 
 /*
  * TypeIdentifier is a small type containing an id.
- * Types must be registered using CAFFE_KNOWN_TYPE() for them to have a type id.
- * If a type is registered, you can also create an object containing meta data
- * like constructor, destructor, stringified name, ... about the type by calling
- * TypeMeta::Make<T>. This returns a TypeMeta() object, which is basically just
- * a pointer to the type information, so it's cheap to pass around.
+ * Types must be registered using CAFFE_DECLARE_KNOWN_TYPE() (in their header)
+ * and CAFFE_DEFINE_KNOWN_TYPE() (in their .cpp file) for them to have a type
+ * id. If a type is registered, you can also create an object containing meta
+ * data like constructor, destructor, stringified name, ... about the type by
+ * calling TypeMeta::Make<T>. This returns a TypeMeta() object, which is
+ * basically just a pointer to the type information, so it's cheap to pass
+ * around.
  */
 
 // TODO: This file is still in the caffe2 namespace, despite living
 // in the ATen directory.  This is because the macro
-// CAFFE_KNOWN_TYPE defines a template specialization, which relies
+// CAFFE_KNOWN_TYPE (and CAFFE_DECLARE_KNOWN_TYPE) defines a template
+// specialization, which relies
 // on the namespace of TypeMeta matching the namespace where the macro is
 // called.  This requires us to fix all of the call-sites, which I want to do
 // later.  So the namespace is not fixed at the moment.
@@ -508,12 +511,40 @@ class C10_API TypeMeta final {
 #define MaxTypeIndex UINT8_MAX
 #endif
 
-  static std::atomic<uint16_t> nextTypeIndex;
+  // Protects type metadata allocation.
+  // NOLINTNEXTLINE(facebook-hte-NonPodStaticDeclaration)
+  static std::mutex typeMetaDatasLock;
+  static uint16_t nextTypeIndex;
 
   static detail::TypeMetaData* typeMetaDatas();
 
+  static uint16_t existingMetaDataIndexForType(TypeIdentifier identifier);
+
+#ifdef __CUDACC__
+  // NOTE [ TypeIdentifier::Get nvcc/clang discrepancy]
+  // nvcc and clang do not produce identical results for
+  // TypeIdentifier::Get, because TypeIdentifier::Get relies on
+  // __PRETTY_FUNCTION__ and they don't agree on the canonical names
+  // of types (e.g., nvcc normalizes to `short unsigned int`, but clang
+  // calls it `unsigned short`). Hide the implementation of this function
+  // from nvcc so that we always use clang (or whatever host C++ compiler)
+  // for TypeIdentifier::Get.
   template <class T>
-  static uint16_t addTypeMetaData() {
+  C10_EXPORT static uint16_t addTypeMetaData();
+#else
+  template <class T>
+  C10_EXPORT static uint16_t addTypeMetaData() {
+    const auto identifier = TypeIdentifier::Get<T>();
+    // Need to hold this for the rest of the function, protecting:
+    // 1) existingMetaDataIndexForType()
+    // 2) nextTypeIndex++
+    // 3) the write into typeMetaDatas()
+    std::lock_guard<std::mutex> lock(typeMetaDatasLock);
+    // It may exist already if added in a different dynamic shared library.
+    const uint16_t existing_index = existingMetaDataIndexForType(identifier);
+    if (existing_index != MaxTypeIndex) {
+      return existing_index;
+    }
     const uint16_t index = nextTypeIndex++;
     TORCH_CHECK(
         index <= MaxTypeIndex,
@@ -526,10 +557,11 @@ class C10_API TypeMeta final {
         detail::_PickCopy<T>(),
         detail::_PickPlacementDelete<T>(),
         detail::_PickDelete<T>(),
-        TypeIdentifier::Get<T>(),
+        identifier,
         c10::util::get_fully_qualified_type_name<T>()};
     return index;
   }
+#endif
 
   // specializations return indexes into typeMetaDataInstances()
   template <class T>
@@ -582,6 +614,9 @@ inline std::ostream& operator<<(
  * Register unique id for a type so it can be used in TypeMeta context, e.g. be
  * used as a type for Blob or for Tensor elements.
  *
+ * CAFFE_KNOWN_TYPE is deprecated; prefer CAFFE_DECLARE_KNOWN_TYPE and
+ * CAFFE_DEFINE_KNOWN_TYPE.
+ *
  * CAFFE_KNOWN_TYPE does explicit instantiation of TypeIdentifier::Get<T>
  * template function and thus needs to be put in a single translation unit (.cpp
  * file) for a given type T. Other translation units that use type T as a type
@@ -603,13 +638,44 @@ inline std::ostream& operator<<(
 #define EXPORT_IF_NOT_GCC
 #endif
 
+// CAFFE_KNOWN_TYPE is deprecated! Use CAFFE_DECLARE_KNOWN_TYPE and
+// CAFFE_DEFINE_KNOWN_TYPE instead.
 #define CAFFE_KNOWN_TYPE(T)                                          \
+  template uint16_t TypeMeta::addTypeMetaData<T>();                  \
   template <>                                                        \
   EXPORT_IF_NOT_GCC uint16_t TypeMeta::_typeMetaData<T>() noexcept { \
     static const uint16_t index = addTypeMetaData<T>();              \
     return index;                                                    \
   }
 
+#define CAFFE_DEFINE_KNOWN_TYPE(T) \
+  template uint16_t TypeMeta::addTypeMetaData<T>();
+
+// Unlike CAFFE_KNOWN_TYPE, CAFFE_DECLARE_KNOWN_TYPE avoids a function
+// call to access _typeMetaData in the common case.
+#ifdef __CUDACC__
+// nvcc needs its own specialization that doesn't use
+// C10_ALWAYS_INLINE so that it doesn't need to see a definition for
+// _addTypeMeta. See NOTE [ TypeIdentifier::Get nvcc/clang discrepancy
+// ].
+#define CAFFE_DECLARE_KNOWN_TYPE(T)                                         \
+  extern template uint16_t TypeMeta::addTypeMetaData<T>();                  \
+  template <>                                                               \
+  EXPORT_IF_NOT_GCC inline uint16_t TypeMeta::_typeMetaData<T>() noexcept { \
+    static const uint16_t index = addTypeMetaData<T>();                     \
+    return index;                                                           \
+  }
+#else
+#define CAFFE_DECLARE_KNOWN_TYPE(T)                        \
+  extern template uint16_t TypeMeta::addTypeMetaData<T>(); \
+  template <>                                              \
+  EXPORT_IF_NOT_GCC C10_ALWAYS_INLINE uint16_t             \
+  TypeMeta::_typeMetaData<T>() noexcept {                  \
+    static const uint16_t index = addTypeMetaData<T>();    \
+    return index;                                          \
+  }
+#endif
+
 #define CAFFE_KNOWN_TYPE_NOEXPORT(T)                    \
   template <>                                           \
   uint16_t TypeMeta::_typeMetaData<T>() noexcept {      \
@@ -617,4 +683,41 @@ inline std::ostream& operator<<(
     return index;                                       \
   }
 
+CAFFE_DECLARE_KNOWN_TYPE(std::string)
+CAFFE_DECLARE_KNOWN_TYPE(uint16_t)
+CAFFE_DECLARE_KNOWN_TYPE(char)
+CAFFE_DECLARE_KNOWN_TYPE(std::unique_ptr<std::mutex>)
+CAFFE_DECLARE_KNOWN_TYPE(std::unique_ptr<std::atomic<bool>>)
+CAFFE_DECLARE_KNOWN_TYPE(std::vector<int32_t>)
+CAFFE_DECLARE_KNOWN_TYPE(std::vector<int64_t>)
+CAFFE_DECLARE_KNOWN_TYPE(std::vector<unsigned long>)
+CAFFE_DECLARE_KNOWN_TYPE(bool*)
+CAFFE_DECLARE_KNOWN_TYPE(char*)
+CAFFE_DECLARE_KNOWN_TYPE(int*)
+
+// For some of the compilers, long is defined separately from int32_t and
+// int64_t. As a result we will need to actually define them separately.
+// It is recommended that one does NOT use long - use int32_t and int64_t
+// explicitly. Explicit long type annotation may go away in the future.
+// details: This hack works by defining a _guard_long_unique type, which is
+// long iff the compiler has a separate long type and is a dummy type otherwise.
+// we then allocate a type id to that _guard_long_unique. If the compiler has a
+// separate long type, this allocates a type id for long. Otherwise, it
+// allocates a type id for the dummy type, which doesn't matter.
+namespace detail {
+template <class T>
+class _guard_long_unique_dummy final {};
+template <class T>
+using _guard_long_unique = std::conditional_t<
+    std::is_same<long, int32_t>::value || std::is_same<long, int64_t>::value,
+    _guard_long_unique_dummy<T>,
+    T>;
+} // namespace detail
+
+CAFFE_DECLARE_KNOWN_TYPE(detail::_guard_long_unique<long>);
+CAFFE_DECLARE_KNOWN_TYPE(detail::_guard_long_unique<std::vector<long>>)
+
+CAFFE_DECLARE_KNOWN_TYPE(float*)
+CAFFE_DECLARE_KNOWN_TYPE(at::Half*)
+
 } // namespace caffe2
diff --git a/c2_defs.bzl b/c2_defs.bzl
index d77fed977f39..fedbb4bca84b 100644
--- a/c2_defs.bzl
+++ b/c2_defs.bzl
@@ -5,7 +5,7 @@ load("@fbsource//tools/build_defs:default_platform_defs.bzl", "compose_platform_
 load("@fbsource//tools/build_defs:dict_defs.bzl", "dict_defs")
 load("@fbsource//tools/build_defs:expect.bzl", "expect")
 load("@fbsource//tools/build_defs:fb_xplat_cxx_library.bzl", "fb_xplat_cxx_library")
-load("@fbsource//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")
+load("@fbsource//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode", "is_fbcode_mode_mac")
 load("@fbsource//tools/build_defs:platform_defs.bzl", "ANDROID", "APPLE", "CXX", "IOS", "MACOSX", "WINDOWS")
 load("@fbsource//tools/build_defs/apple:build_mode_defs.bzl", "is_production_build")
 load("@fbsource//tools/build_defs/apple:config_utils_defs.bzl", "STATIC_LIBRARY_IOS_CONFIG", "STATIC_LIBRARY_MAC_CONFIG", "fbobjc_configs")
@@ -166,6 +166,7 @@ def get_c2_fbandroid_xplat_compiler_flags():
         # T95767731 -- remove this once all builds are on at least llvm-13
         "-Wno-unknown-warning-option",
         "-Wno-unused-but-set-variable",
+        "-DHAVE_MMAP",
     ]
 
     if get_c2_strip_glog():
@@ -220,26 +221,13 @@ def get_c2_fbobjc_ios_frameworks():
     frameworks = []
 
     if get_c2_mpscnn():
-        frameworks.append(
+        frameworks.extend([
             "$SDKROOT/System/Library/Frameworks/Metal.framework",
-        )
+            "$SDKROOT/System/Library/Frameworks/MetalPerformanceShaders.framework",
+        ])
 
     return frameworks
 
-def get_c2_fbobjc_linker_flags():
-    flags = []
-
-    if get_c2_mpscnn():
-        # Need linker flags as no platform_frameworks exist, and we can't
-        # use MPSCNN on x86_64.
-        # We use weak_framework as it's iOS 10
-        flags = [
-            "-L$SDKROOT/System/Library/Frameworks/MetalPerformanceShaders.framework",
-            "-weak_framework",
-            "MetalPerformanceShaders",
-        ]
-    return flags
-
 def get_c2_fbobjc_exported_preprocessor_flags():
     flags = []
 
@@ -310,12 +298,6 @@ def get_c2_default_cxx_args():
             STATIC_LIBRARY_IOS_CONFIG,
             extra_target_config = C2_FBOBJC_EXTRA_TARGET_CONFIG,
         ),
-        fbobjc_exported_platform_linker_flags = [
-            (
-                "iphoneos",
-                get_c2_fbobjc_linker_flags(),
-            ),
-        ],
         fbobjc_exported_platform_preprocessor_flags = [
             (
                 "iphoneos",
@@ -351,7 +333,10 @@ def get_c2_aten_cpu_fbobjc_macosx_deps():
             "fbsource//xplat/caffe2:cpukernel_avx2",
         ]
     else:
-        return []
+        return select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": ["fbsource//xplat/deeplearning/fbgemm:fbgemm"],
+        }) if is_arvr_mode() else []
 
 def get_c2_aten_cpu_fbobjc_macosx_platform_deps():
     if is_focus_enabled():
@@ -377,10 +362,19 @@ def get_c2_aten_cpu_fbobjc_macosx_platform_deps():
             },
         ])
 
+def using_protobuf_v3():
+    # Consider migrating this to `read_config("protobuf", "use_v3")`
+    # The `is_fbcode_mode_mac()` clause was added rather than changing to `read_config` to minimize changes in behavior
+    return is_arvr_mode() or is_fbcode_mode_mac()
+
+def get_c2_protobuf_dep():
+    return "fbsource//third-party/protobuf:libprotobuf" if using_protobuf_v3() else "fbsource//xplat/third-party/protobuf:fb-protobuf-lite"
+
 def c2_cxx_library(**kwargs):
     args = get_c2_default_cxx_args()
     args.update(kwargs)
     args.setdefault("platforms", (ANDROID, APPLE, CXX, WINDOWS))
+
     fb_xplat_cxx_library(
         labels = [
             "supermodule:android/default/caffe2",
@@ -403,7 +397,7 @@ def c2_protobuf_rule(protos):
             protocmd = ("cp $SRCDIR/{} $SRCDIR/{} && chmod +w $SRCDIR/{} && echo \"option optimize_for = LITE_RUNTIME;\" >> $SRCDIR/{} && ".format(p, proto, proto, proto) +
                         "cp $SRCDIR/caffe2/proto/caffe2.proto $SRCDIR/caffe2.proto && chmod +w $SRCDIR/caffe2.proto && echo \"option optimize_for = LITE_RUNTIME;\" >> $SRCDIR/caffe2.proto && " +
                         "sed -i -e 's/caffe2\\/proto\\/caffe2.proto/caffe2.proto/g' $SRCDIR/{} && ".format(proto) +
-                        ("$(exe fbsource//third-party/protobuf:protoc-host) " if is_arvr_mode() else "$(exe fbsource//xplat/third-party/protobuf:protoc) --osx $(location fbsource//xplat/third-party/protobuf:protoc.Darwin) --linux $(location fbsource//xplat/third-party/protobuf:protoc.Linux) ") +
+                        ("$(exe fbsource//third-party/protobuf:protoc-host) " if using_protobuf_v3() else "$(exe fbsource//xplat/third-party/protobuf:protoc) --osx $(location fbsource//xplat/third-party/protobuf:protoc.Darwin) --linux $(location fbsource//xplat/third-party/protobuf:protoc.Linux) ") +
                         "-I $SRCDIR --cpp_out=$OUT $SRCDIR/{}".format(proto))
         buck_genrule(
             name = proto,
@@ -450,7 +444,7 @@ def c2_full_protobuf_rule(protos):
             protocmd = ("cp $SRCDIR/{} $SRCDIR/{} && ".format(p, proto) +
                         "cp $SRCDIR/caffe2/proto/caffe2.proto $SRCDIR/caffe2.proto && " +
                         "sed -i -e 's/caffe2\\/proto\\/caffe2.proto/caffe2.proto/g' $SRCDIR/{} && ".format(proto) +
-                        ("$(exe fbsource//third-party/protobuf:protoc-host) " if is_arvr_mode() else "$(exe fbsource//xplat/third-party/protobuf:protoc) --osx $(location fbsource//xplat/third-party/protobuf:protoc.Darwin) --linux $(location fbsource//xplat/third-party/protobuf:protoc.Linux) ") +
+                        ("$(exe fbsource//third-party/protobuf:protoc-host) " if using_protobuf_v3() else "$(exe fbsource//xplat/third-party/protobuf:protoc) --osx $(location fbsource//xplat/third-party/protobuf:protoc.Darwin) --linux $(location fbsource//xplat/third-party/protobuf:protoc.Linux) ") +
                         "-I $SRCDIR --cpp_out=$OUT $SRCDIR/{}".format(proto))
         buck_genrule(
             name = prefix + proto,
@@ -484,7 +478,7 @@ def libcaffe2_cxx_library(name, use_hptt, **kwargs):
         name = name,
         exported_deps = [
             "fbsource//xplat/caffe2/c10:c10",
-            "fbsource//third-party/protobuf:libprotobuf" if is_arvr_mode() else "fbsource//xplat/third-party/protobuf:fb-protobuf-lite",
+            get_c2_protobuf_dep(),
             ":caffe2_protobuf_headers",
             ":pthreadpool",
             ":common_core",
diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt
index 86bf5f9f82e4..4182797fc78e 100644
--- a/caffe2/CMakeLists.txt
+++ b/caffe2/CMakeLists.txt
@@ -22,7 +22,7 @@ endif()
 #  OMP - OpenMP for intra-op, native thread pool for inter-op parallelism
 #  NATIVE - using native thread pool for intra- and inter-op parallelism
 #  TBB - using TBB for intra- and native thread pool for inter-op parallelism
-if(INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE)
+if(INTERN_BUILD_MOBILE)
   set(ATEN_THREADING "NATIVE" CACHE STRING "ATen parallel backend")
 else()
   if(USE_OPENMP)
@@ -63,7 +63,7 @@ if(INTERN_BUILD_ATEN_OPS)
   set(CMAKE_POSITION_INDEPENDENT_CODE ${__caffe2_CMAKE_POSITION_INDEPENDENT_CODE})
 
   # Generate the headers wrapped by our operator
-  file(GLOB_RECURSE all_python "${PROJECT_SOURCE_DIR}/torchgen/*.py")
+  file(GLOB_RECURSE torchgen_python "${PROJECT_SOURCE_DIR}/torchgen/*.py")
   add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/contrib/aten/aten_op.h
   COMMAND
   "${PYTHON_EXECUTABLE}" ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/gen_op.py
@@ -72,7 +72,7 @@ if(INTERN_BUILD_ATEN_OPS)
     --yaml_dir=${CMAKE_CURRENT_BINARY_DIR}/../aten/src/ATen
     --install_dir=${CMAKE_CURRENT_BINARY_DIR}/contrib/aten
   DEPENDS
-  ${all_python}
+  ${torchgen_python}
   ${CMAKE_BINARY_DIR}/aten/src/ATen/Declarations.yaml
   ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/gen_op.py
   ${CMAKE_CURRENT_SOURCE_DIR}/contrib/aten/aten_op_template.h)
@@ -129,7 +129,7 @@ if(BUILD_CAFFE2 OR (NOT USE_FBGEMM))
 endif()
 
 # Skip modules that are not used by libtorch mobile yet.
-if(BUILD_CAFFE2 AND (NOT INTERN_BUILD_MOBILE OR BUILD_CAFFE2_MOBILE))
+if(BUILD_CAFFE2 AND NOT INTERN_BUILD_MOBILE)
   add_subdirectory(contrib)
   add_subdirectory(predictor)
   add_subdirectory(predictor/emulator)
@@ -166,7 +166,7 @@ if(BUILD_CAFFE2 AND (NOT INTERN_BUILD_MOBILE OR BUILD_CAFFE2_MOBILE))
   # add_subdirectory(test) # todo: use caffe2_gtest_main instead of gtest_main because we will need to call GlobalInit
   add_subdirectory(transforms)
 endif()
-if(NOT BUILD_CAFFE2 AND (NOT INTERN_BUILD_MOBILE OR BUILD_CAFFE2_MOBILE))
+if(NOT BUILD_CAFFE2 AND NOT INTERN_BUILD_MOBILE)
   add_subdirectory(proto)
 endif()
 
@@ -269,7 +269,7 @@ if(PRINT_CMAKE_DEBUG_INFO)
 
 endif()
 
-if(NOT INTERN_BUILD_MOBILE OR BUILD_CAFFE2_MOBILE)
+if(NOT INTERN_BUILD_MOBILE)
   # ---[ List of libraries to link with
   add_library(caffe2_protos STATIC $<TARGET_OBJECTS:Caffe2_PROTO>)
   add_dependencies(caffe2_protos Caffe2_PROTO)
@@ -326,118 +326,119 @@ if(NOT TORCH_INSTALL_LIB_DIR)
   set(TORCH_INSTALL_LIB_DIR lib)
 endif()
 
+set(CMAKE_POSITION_INDEPENDENT_CODE TRUE)
 
+# Generate files
+set(TOOLS_PATH "${TORCH_ROOT}/tools")
 
-if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
-  set(CMAKE_POSITION_INDEPENDENT_CODE TRUE)
-
-  # Generate files
-  set(TOOLS_PATH "${TORCH_ROOT}/tools")
-
-  configure_file("${TORCH_SRC_DIR}/_utils_internal.py"
-    "${TOOLS_PATH}/shared/_utils_internal.py"
-    COPYONLY)
+configure_file("${TORCH_SRC_DIR}/_utils_internal.py"
+  "${TOOLS_PATH}/shared/_utils_internal.py"
+  COPYONLY)
 
-  # Generate header with version info
-  configure_file("${TORCH_SRC_DIR}/csrc/api/include/torch/version.h.in"
-    "${TORCH_SRC_DIR}/csrc/api/include/torch/version.h"
-    @ONLY)
+# Generate header with version info
+configure_file("${TORCH_SRC_DIR}/csrc/api/include/torch/version.h.in"
+  "${TORCH_SRC_DIR}/csrc/api/include/torch/version.h"
+  @ONLY)
 
-  set(GENERATED_CXX_TORCH
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/Functions.cpp"
-    )
+set(GENERATED_CXX_TORCH
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/Functions.cpp"
+  )
 
-  if(NOT INTERN_DISABLE_AUTOGRAD AND NOT BUILD_LITE_INTERPRETER)
+if(NOT INTERN_DISABLE_AUTOGRAD AND NOT BUILD_LITE_INTERPRETER)
+  list(APPEND GENERATED_CXX_TORCH
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_0.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_1.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_2.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_3.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_4.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_0.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_1.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_2.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_3.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_4.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/ADInplaceOrViewType_0.cpp"
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/ADInplaceOrViewType_1.cpp"
+  )
+  if(BUILD_LAZY_TS_BACKEND)
     list(APPEND GENERATED_CXX_TORCH
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_0.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_1.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_2.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_3.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType_4.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_0.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_1.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_2.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_3.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/TraceType_4.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/ADInplaceOrViewType_0.cpp"
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/ADInplaceOrViewType_1.cpp"
+      "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyNativeFunctions.cpp"
+      "${TORCH_SRC_DIR}/csrc/lazy/generated/RegisterAutogradLazy.cpp"
+      "${TORCH_SRC_DIR}/csrc/lazy/generated/RegisterLazy.cpp"
     )
-    if(BUILD_LAZY_TS_BACKEND)
-      list(APPEND GENERATED_CXX_TORCH
-        "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyNativeFunctions.cpp"
-        "${TORCH_SRC_DIR}/csrc/lazy/generated/RegisterAutogradLazy.cpp"
-        "${TORCH_SRC_DIR}/csrc/lazy/generated/RegisterLazy.cpp"
-      )
-    endif()
   endif()
+endif()
 
-  set(GENERATED_H_TORCH
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/Functions.h"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/variable_factories.h"
-    )
+set(GENERATED_H_TORCH
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/Functions.h"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/variable_factories.h"
+  )
 
-  if(NOT INTERN_DISABLE_AUTOGRAD)
-    list(APPEND GENERATED_H_TORCH
-      "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType.h"
-      "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyIr.h"
-      "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyNonNativeIr.h"
-      "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyNativeFunctions.h"
-    )
-  endif()
+if(NOT INTERN_DISABLE_AUTOGRAD)
+  list(APPEND GENERATED_H_TORCH
+    "${TORCH_SRC_DIR}/csrc/autograd/generated/VariableType.h"
+    "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyIr.h"
+    "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyNonNativeIr.h"
+    "${TORCH_SRC_DIR}/csrc/lazy/generated/LazyNativeFunctions.h"
+  )
+endif()
 
-  set(GENERATED_CXX_PYTHON
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_0.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_1.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_2.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_3.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_4.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_variable_methods.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_torch_functions_0.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_torch_functions_1.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_torch_functions_2.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_nn_functions.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_fft_functions.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_linalg_functions.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_sparse_functions.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_special_functions.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_return_types.cpp"
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_enum_tag.cpp"
-    )
+set(GENERATED_CXX_PYTHON
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_0.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_1.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_2.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_3.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions_4.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_variable_methods.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_torch_functions_0.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_torch_functions_1.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_torch_functions_2.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_nn_functions.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_fft_functions.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_linalg_functions.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_nested_functions.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_sparse_functions.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_special_functions.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_return_types.cpp"
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_enum_tag.cpp"
+  )
 
-  set(GENERATED_H_PYTHON
-    "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions.h"
-    )
+set(GENERATED_H_PYTHON
+  "${TORCH_SRC_DIR}/csrc/autograd/generated/python_functions.h"
+  )
 
-  set(GENERATED_TESTING_PYTHON
-    "${TORCH_SRC_DIR}/testing/_internal/generated/annotated_fn_args.py"
-    )
+set(GENERATED_TESTING_PYTHON
+  "${TORCH_SRC_DIR}/testing/_internal/generated/annotated_fn_args.py"
+  )
 
-  set(TORCH_GENERATED_CODE
-    ${GENERATED_CXX_TORCH}
-    ${GENERATED_H_TORCH}
-    ${GENERATED_CXX_PYTHON}
-    ${GENERATED_H_PYTHON}
-    ${GENERATED_TESTING_PYTHON}
-    )
+set(TORCH_GENERATED_CODE
+  ${GENERATED_CXX_TORCH}
+  ${GENERATED_H_TORCH}
+  ${GENERATED_CXX_PYTHON}
+  ${GENERATED_H_PYTHON}
+  ${GENERATED_TESTING_PYTHON}
+  )
 
-  set(GEN_PER_OPERATOR_FLAG)
-  if(USE_PER_OPERATOR_HEADERS)
-    list(APPEND GEN_PER_OPERATOR_FLAG "--per_operator_headers")
-  endif()
-
-  add_custom_command(
-    OUTPUT
-    ${TORCH_GENERATED_CODE}
-    COMMAND
-    "${PYTHON_EXECUTABLE}" tools/setup_helpers/generate_code.py
-      --native-functions-path "aten/src/ATen/native/native_functions.yaml"
-      --tags-path "aten/src/ATen/native/tags.yaml"
-      $<$<BOOL:${INTERN_DISABLE_AUTOGRAD}>:--disable-autograd>
-      $<$<BOOL:${SELECTED_OP_LIST}>:--selected-op-list-path="${SELECTED_OP_LIST}">
-      --force_schema_registration
-      --gen_lazy_ts_backend
-      ${GEN_PER_OPERATOR_FLAG}
-    DEPENDS
+set(GEN_PER_OPERATOR_FLAG)
+if(USE_PER_OPERATOR_HEADERS)
+  list(APPEND GEN_PER_OPERATOR_FLAG "--per_operator_headers")
+endif()
+
+file(GLOB_RECURSE autograd_python "${TOOLS_PATH}/autograd/*.py")
+file(GLOB_RECURSE autograd_yaml "${TOOLS_PATH}/autograd/*.yaml")
+file(GLOB_RECURSE autograd_templates "${TOOLS_PATH}/autograd/templates/*")
+add_custom_command(
+  OUTPUT
+  ${TORCH_GENERATED_CODE}
+  COMMAND
+  "${PYTHON_EXECUTABLE}" tools/setup_helpers/generate_code.py
+    --native-functions-path "aten/src/ATen/native/native_functions.yaml"
+    --tags-path "aten/src/ATen/native/tags.yaml"
+    $<$<BOOL:${INTERN_DISABLE_AUTOGRAD}>:--disable-autograd>
+    $<$<BOOL:${SELECTED_OP_LIST}>:--selected-op-list-path="${SELECTED_OP_LIST}">
+    --force_schema_registration
+    --gen_lazy_ts_backend
+    ${GEN_PER_OPERATOR_FLAG}
+  DEPENDS
     "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
     "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml"
     "${TORCH_ROOT}/aten/src/ATen/native/ts_native_functions.yaml"
@@ -448,344 +449,317 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
     "${TORCH_ROOT}/aten/src/ATen/templates/LazyIr.h"
     "${TORCH_ROOT}/aten/src/ATen/templates/LazyNonNativeIr.h"
     "${TORCH_ROOT}/aten/src/ATen/templates/RegisterDispatchKey.cpp"
-    "${TOOLS_PATH}/autograd/templates/VariableType.h"
-    "${TOOLS_PATH}/autograd/templates/VariableType.cpp"
-    "${TOOLS_PATH}/autograd/templates/ADInplaceOrViewType.cpp"
-    "${TOOLS_PATH}/autograd/templates/TraceType.cpp"
-    "${TOOLS_PATH}/autograd/templates/Functions.h"
-    "${TOOLS_PATH}/autograd/templates/Functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_functions.h"
-    "${TOOLS_PATH}/autograd/templates/python_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_variable_methods.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_torch_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_nn_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_fft_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_linalg_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_sparse_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_special_functions.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_return_types.cpp"
-    "${TOOLS_PATH}/autograd/templates/python_enum_tag.cpp"
-    "${TOOLS_PATH}/autograd/templates/variable_factories.h"
-    "${TOOLS_PATH}/autograd/templates/annotated_fn_args.py.in"
-    "${TOOLS_PATH}/autograd/deprecated.yaml"
-    "${TOOLS_PATH}/autograd/derivatives.yaml"
-    "${TOOLS_PATH}/autograd/gen_autograd_functions.py"
-    "${TOOLS_PATH}/autograd/gen_autograd.py"
-    "${TOOLS_PATH}/autograd/gen_python_functions.py"
-    "${TOOLS_PATH}/autograd/gen_variable_factories.py"
-    "${TOOLS_PATH}/autograd/gen_variable_type.py"
-    "${TOOLS_PATH}/autograd/gen_inplace_or_view_type.py"
-    "${TOOLS_PATH}/autograd/load_derivatives.py"
-    "${TORCH_ROOT}/torchgen/gen_backend_stubs.py"
-    "${TORCH_ROOT}/torchgen/gen_lazy_tensor.py"
-    "${TORCH_ROOT}/torchgen/api/lazy.py"
-    "${TORCH_ROOT}/torchgen/dest/lazy_ir.py"
-    WORKING_DIRECTORY "${TORCH_ROOT}")
-
-
-  # Required workaround for libtorch_python.so build
-  # see https://samthursfield.wordpress.com/2015/11/21/cmake-dependencies-between-targets-and-files-and-custom-commands/#custom-commands-in-different-directories
-  add_custom_target(
-    generate-torch-sources
-    DEPENDS ${TORCH_GENERATED_CODE}
-    )
-
-  set(TORCH_SRCS ${GENERATED_CXX_TORCH})
-  list(APPEND TORCH_SRCS ${GENERATED_H_TORCH})
-  list(APPEND LIBTORCH_CMAKE_SRCS "")
-
-  list(APPEND LITE_EAGER_SYMOBLICATION_SRCS "")
-  if(USE_SOURCE_DEBUG_ON_MOBILE)
-    append_filelist("libtorch_lite_eager_symbolication" LITE_EAGER_SYMOBLICATION_SRCS)
-    # For source debug on lite interpreter, we have to add dependency on pickling
-    # but references to read/writeArchiveAndTensor is not built for mobile
-    # so this condition specifically says we are building for source debug
-    # on mobile.
-    if(BUILD_LITE_INTERPRETER)
-      set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/serialization/pickle.cpp PROPERTIES COMPILE_FLAGS "-DC10_MOBILE -DFEATURE_TORCH_MOBILE")
-    endif()
-  endif()
+    ${autograd_python}
+    ${autograd_yaml}
+    ${autograd_templates}
+    ${torchgen_python}
+  WORKING_DIRECTORY "${TORCH_ROOT}")
+
+
+# Required workaround for libtorch_python.so build
+# see https://samthursfield.wordpress.com/2015/11/21/cmake-dependencies-between-targets-and-files-and-custom-commands/#custom-commands-in-different-directories
+add_custom_target(
+  generate-torch-sources
+  DEPENDS ${TORCH_GENERATED_CODE}
+  )
 
-  list(APPEND LITE_PROFILER_SRCS "")
-  if(USE_LITE_INTERPRETER_PROFILER)
-    append_filelist("libtorch_edge_profiler_sources " LITE_PROFILER_SRCS)
+set(TORCH_SRCS ${GENERATED_CXX_TORCH})
+list(APPEND TORCH_SRCS ${GENERATED_H_TORCH})
+list(APPEND LIBTORCH_CMAKE_SRCS "")
+
+list(APPEND LITE_EAGER_SYMOBLICATION_SRCS "")
+if(USE_SOURCE_DEBUG_ON_MOBILE)
+  append_filelist("libtorch_lite_eager_symbolication" LITE_EAGER_SYMOBLICATION_SRCS)
+  # For source debug on lite interpreter, we have to add dependency on pickling
+  # but references to read/writeArchiveAndTensor is not built for mobile
+  # so this condition specifically says we are building for source debug
+  # on mobile.
+  if(BUILD_LITE_INTERPRETER)
+    set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/serialization/pickle.cpp PROPERTIES COMPILE_FLAGS "-DC10_MOBILE -DFEATURE_TORCH_MOBILE")
   endif()
+endif()
 
-  # Switch between the full jit interpreter and lite interpreter
-  if(BUILD_LITE_INTERPRETER)
-    append_filelist("libtorch_lite_cmake_sources" LIBTORCH_CMAKE_SRCS)
-    list(APPEND LIBTORCH_CMAKE_SRCS ${LITE_EAGER_SYMOBLICATION_SRCS})
-    list(APPEND LIBTORCH_CMAKE_SRCS ${LITE_PROFILER_SRCS})
-    set(CMAKE_POSITION_INDEPENDENT_CODE TRUE)
-  else()
-    append_filelist("libtorch_cmake_sources" LIBTORCH_CMAKE_SRCS)
-    if(BUILD_LAZY_TS_BACKEND)
-      append_filelist("lazy_tensor_ts_sources" LIBTORCH_CMAKE_SRCS)
-    endif()
-    if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
-      # TODO: Delete this line once https://github.com/pytorch/pytorch/pull/55889 lands
-      set_source_files_properties(../torch/csrc/jit/serialization/export.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
+list(APPEND LITE_PROFILER_SRCS "")
+if(USE_LITE_INTERPRETER_PROFILER)
+  append_filelist("libtorch_edge_profiler_sources " LITE_PROFILER_SRCS)
+endif()
 
-      # TODO: Delete this when https://github.com/pytorch/pytorch/issues/35026 is fixed
-      set_source_files_properties(../torch/csrc/autograd/record_function_ops.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
-    endif()
+# Switch between the full jit interpreter and lite interpreter
+if(BUILD_LITE_INTERPRETER)
+  append_filelist("libtorch_lite_cmake_sources" LIBTORCH_CMAKE_SRCS)
+  list(APPEND LIBTORCH_CMAKE_SRCS ${LITE_EAGER_SYMOBLICATION_SRCS})
+  list(APPEND LIBTORCH_CMAKE_SRCS ${LITE_PROFILER_SRCS})
+  set(CMAKE_POSITION_INDEPENDENT_CODE TRUE)
+else()
+  append_filelist("libtorch_cmake_sources" LIBTORCH_CMAKE_SRCS)
+  if(BUILD_LAZY_TS_BACKEND)
+    append_filelist("lazy_tensor_ts_sources" LIBTORCH_CMAKE_SRCS)
   endif()
-  list(APPEND TORCH_SRCS ${LIBTORCH_CMAKE_SRCS})
+  if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
+    # TODO: Delete this line once https://github.com/pytorch/pytorch/pull/55889 lands
+    set_source_files_properties(../torch/csrc/jit/serialization/export.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
 
-  if(PRINT_CMAKE_DEBUG_INFO)
-    message(STATUS "Interpreter sources: ")
-    foreach(tmp ${LIBTORCH_CMAKE_SRCS})
-      message(STATUS "  " ${tmp})
-    endforeach()
+    # TODO: Delete this when https://github.com/pytorch/pytorch/issues/35026 is fixed
+    set_source_files_properties(../torch/csrc/autograd/record_function_ops.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
   endif()
+endif()
+list(APPEND TORCH_SRCS ${LIBTORCH_CMAKE_SRCS})
+
+if(PRINT_CMAKE_DEBUG_INFO)
+  message(STATUS "Interpreter sources: ")
+  foreach(tmp ${LIBTORCH_CMAKE_SRCS})
+    message(STATUS "  " ${tmp})
+  endforeach()
+endif()
 
-  # Mobile backend delegate srcs
-  if(INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE)
-    set(DELEGATE_SRCS
-      ${TORCH_SRC_DIR}/csrc/jit/backends/backend_debug_info.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/backends/backend_interface.cpp
+# Mobile backend delegate srcs
+if(INTERN_BUILD_MOBILE)
+  set(DELEGATE_SRCS
+    ${TORCH_SRC_DIR}/csrc/jit/backends/backend_debug_info.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/backends/backend_interface.cpp
+  )
+  list(APPEND TORCH_SRCS ${DELEGATE_SRCS})
+  if(IOS AND USE_COREML_DELEGATE)
+    set(COREML_DELEGATE_SRCS
+      ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/cpp/context.cpp
+      ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm
+      ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.mm
+      ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
+      ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLFeatureProvider.mm
     )
-    list(APPEND TORCH_SRCS ${DELEGATE_SRCS})
-    if(IOS AND USE_COREML_DELEGATE)
-      set(COREML_DELEGATE_SRCS
-        ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/cpp/context.cpp
-        ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm
-        ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.mm
-        ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
-        ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLFeatureProvider.mm
-        ${TORCH_SRC_DIR}/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm
-      )
-      set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm PROPERTIES COMPILE_FLAGS "-fno-objc-arc")
-      include_directories(${TORCH_ROOT}/third_party/nlohmann/single_include)
-      list(APPEND TORCH_SRCS ${COREML_DELEGATE_SRCS})
-    endif()
-  endif()
+    set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm PROPERTIES COMPILE_FLAGS "-fno-objc-arc")
+    include_directories(${TORCH_ROOT}/third_party/nlohmann/single_include)
+    list(APPEND TORCH_SRCS ${COREML_DELEGATE_SRCS})
+  endif()
+endif()
+
+# Required workaround for LLVM 9 includes.
+if(NOT MSVC)
+  set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/tensorexpr/llvm_jit.cpp PROPERTIES COMPILE_FLAGS -Wno-noexcept-type)
+  # Force -Werror on several files
+  set_source_files_properties(${CMAKE_CURRENT_LIST_DIR}/../aten/src/ATen/native/mkldnn/Pooling.cpp PROPERTIES COMPILE_FLAGS "-Werror")
+endif()
+# Disable certain warnings for GCC-9.X
+if(CMAKE_COMPILER_IS_GNUCXX AND (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 9.0.0))
+  # See https://github.com/pytorch/pytorch/issues/38856
+  set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/tensorexpr/llvm_jit.cpp PROPERTIES COMPILE_FLAGS "-Wno-redundant-move -Wno-noexcept-type")
+  set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/tensorexpr/llvm_codegen.cpp PROPERTIES COMPILE_FLAGS "-Wno-init-list-lifetime")
+endif()
+
+if(NOT INTERN_DISABLE_MOBILE_INTERP)
+  set(MOBILE_SRCS
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/function.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/import.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/import_data.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/interpreter.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/compatibility/model_compatibility.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/module.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/flatbuffer_loader.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/observer.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/parse_bytecode.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/parse_operators.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/quantization.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/train/export_data.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/train/optim/sgd.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/train/random.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/train/sequential.cpp
+     ${TORCH_SRC_DIR}/csrc/jit/mobile/upgrader_mobile.cpp
+     )
+  list(APPEND TORCH_SRCS ${MOBILE_SRCS})
+  list(APPEND TORCH_SRCS ${LITE_EAGER_SYMOBLICATION_SRCS})
+endif()
+
+# This one needs to be unconditionally added as Functions.cpp is also unconditionally added
+list(APPEND TORCH_SRCS
+  ${TORCH_SRC_DIR}/csrc/autograd/FunctionsManual.cpp
+  ${TORCH_SRC_DIR}/csrc/utils/out_types.cpp
+)
+
+if(NOT INTERN_DISABLE_AUTOGRAD AND NOT BUILD_LITE_INTERPRETER)
+  list(APPEND TORCH_SRCS
+    ${TORCH_SRC_DIR}/csrc/autograd/TraceTypeManual.cpp
+    ${TORCH_SRC_DIR}/csrc/autograd/VariableTypeManual.cpp
+  )
+endif()
 
-  # Required workaround for LLVM 9 includes.
-  if(NOT MSVC)
-    set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/tensorexpr/llvm_jit.cpp PROPERTIES COMPILE_FLAGS -Wno-noexcept-type)
-    # Force -Werror on several files
-    set_source_files_properties(${CMAKE_CURRENT_LIST_DIR}/../aten/src/ATen/native/mkldnn/Pooling.cpp PROPERTIES COMPILE_FLAGS "-Werror")
-  endif()
-  # Disable certain warnings for GCC-9.X
-  if(CMAKE_COMPILER_IS_GNUCXX AND (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 9.0.0))
-    # See https://github.com/pytorch/pytorch/issues/38856
-    set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/tensorexpr/llvm_jit.cpp PROPERTIES COMPILE_FLAGS "-Wno-redundant-move -Wno-noexcept-type")
-    set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/tensorexpr/llvm_codegen.cpp PROPERTIES COMPILE_FLAGS "-Wno-init-list-lifetime")
-  endif()
-
-  if(NOT INTERN_DISABLE_MOBILE_INTERP)
-    set(MOBILE_SRCS
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/function.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/import.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/import_data.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/interpreter.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/compatibility/model_compatibility.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/module.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/flatbuffer_loader.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/observer.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/parse_bytecode.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/parse_operators.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/train/export_data.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/train/optim/sgd.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/train/random.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/train/sequential.cpp
-       ${TORCH_SRC_DIR}/csrc/jit/mobile/upgrader_mobile.cpp
-       )
-    list(APPEND TORCH_SRCS ${MOBILE_SRCS})
-    list(APPEND TORCH_SRCS ${LITE_EAGER_SYMOBLICATION_SRCS})
-  endif()
-
-  # This one needs to be unconditionally added as Functions.cpp is also unconditionally added
+if(${USE_ITT})
   list(APPEND TORCH_SRCS
-    ${TORCH_SRC_DIR}/csrc/autograd/FunctionsManual.cpp
-    ${TORCH_SRC_DIR}/csrc/utils/out_types.cpp
+    ${TORCH_SRC_DIR}/csrc/itt_wrapper.cpp
+    ${TORCH_SRC_DIR}/csrc/profiler/stubs/itt.cpp
   )
+endif()
 
-  if(NOT INTERN_DISABLE_AUTOGRAD AND NOT BUILD_LITE_INTERPRETER)
-    list(APPEND TORCH_SRCS
-      ${TORCH_SRC_DIR}/csrc/autograd/TraceTypeManual.cpp
-      ${TORCH_SRC_DIR}/csrc/autograd/VariableTypeManual.cpp
-    )
-  endif()
+if(NOT INTERN_BUILD_MOBILE AND NOT BUILD_LITE_INTERPRETER)
+  list(APPEND TORCH_SRCS
+    ${TORCH_SRC_DIR}/csrc/api/src/jit.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/mobile/compatibility/backport.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/mobile/compatibility/backport_manager.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/serialization/onnx.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/serialization/export.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/serialization/export_bytecode.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/serialization/export_module.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/serialization/flatbuffer_serializer.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/serialization/flatbuffer_serializer_jit.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/fuser/cpu/fused_kernel.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/api/module_save.cpp
+    ${TORCH_SRC_DIR}/csrc/utils/byte_order.cpp
+  )
 
-  if(${USE_ITT})
+  # Disable legacy import of building without Caffe2 support
+  if(BUILD_CAFFE2)
     list(APPEND TORCH_SRCS
-      ${TORCH_SRC_DIR}/csrc/itt_wrapper.cpp
-      ${TORCH_SRC_DIR}/csrc/profiler/itt.cpp
+      ${TORCH_SRC_DIR}/csrc/jit/serialization/import_legacy.cpp
     )
-  endif()
-
-  if(NOT INTERN_BUILD_MOBILE AND NOT BUILD_LITE_INTERPRETER)
-    list(APPEND TORCH_SRCS
-      ${TORCH_SRC_DIR}/csrc/api/src/jit.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/mobile/compatibility/backport.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/mobile/compatibility/backport_manager.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/serialization/onnx.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/serialization/export.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/serialization/export_bytecode.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/serialization/export_module.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/serialization/flatbuffer_serializer.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/serialization/flatbuffer_serializer_jit.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/fuser/cpu/fused_kernel.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/api/module_save.cpp
-      ${TORCH_SRC_DIR}/csrc/utils/byte_order.cpp
+  else()
+    set_source_files_properties(
+      ${TORCH_SRC_DIR}/csrc/jit/serialization/import.cpp
+      PROPERTIES COMPILE_FLAGS "-DC10_DISABLE_LEGACY_IMPORT"
     )
-
-    # Disable legacy import of building without Caffe2 support
-    if(BUILD_CAFFE2)
-      list(APPEND TORCH_SRCS
-        ${TORCH_SRC_DIR}/csrc/jit/serialization/import_legacy.cpp
-      )
-    else()
-      set_source_files_properties(
-        ${TORCH_SRC_DIR}/csrc/jit/serialization/import.cpp
-        PROPERTIES COMPILE_FLAGS "-DC10_DISABLE_LEGACY_IMPORT"
-      )
-    endif()
-    if(USE_DISTRIBUTED)
-      append_filelist("libtorch_distributed_base_sources" TORCH_SRCS)
-      if(NOT WIN32)
-        append_filelist("libtorch_distributed_extra_sources" TORCH_SRCS)
-      endif()
+  endif()
+  if(USE_DISTRIBUTED)
+    append_filelist("libtorch_distributed_base_sources" TORCH_SRCS)
+    if(NOT WIN32)
+      append_filelist("libtorch_distributed_extra_sources" TORCH_SRCS)
     endif()
   endif()
+endif()
 
-  if(USE_CUDA OR USE_ROCM)
-    append_filelist("libtorch_cuda_core_sources" Caffe2_GPU_HIP_JIT_FUSERS_SRCS)
-  endif()
+if(USE_CUDA OR USE_ROCM)
+  append_filelist("libtorch_cuda_core_sources" Caffe2_GPU_HIP_JIT_FUSERS_SRCS)
+endif()
 
-  if(USE_CUDA)
-    list(APPEND Caffe2_GPU_CU_SRCS ${Caffe2_GPU_HIP_JIT_FUSERS_SRCS})
-    add_library(caffe2_nvrtc SHARED ${ATen_NVRTC_STUB_SRCS})
-    if(MSVC)
-      # Delay load nvcuda.dll so we can import torch compiled with cuda on a CPU-only machine
-      set(DELAY_LOAD_FLAGS "-DELAYLOAD:nvcuda.dll;delayimp.lib")
-    else()
-      set(DELAY_LOAD_FLAGS "")
-    endif()
-    target_link_libraries(caffe2_nvrtc ${CUDA_NVRTC} ${CUDA_CUDA_LIB} ${CUDA_NVRTC_LIB} ${DELAY_LOAD_FLAGS})
-    target_include_directories(caffe2_nvrtc PRIVATE ${CUDA_INCLUDE_DIRS})
-    install(TARGETS caffe2_nvrtc DESTINATION "${TORCH_INSTALL_LIB_DIR}")
-    if(USE_NCCL)
-      list(APPEND Caffe2_GPU_SRCS
-        ${TORCH_SRC_DIR}/csrc/cuda/nccl.cpp)
-    endif()
-    if(USE_DISTRIBUTED)
-      append_filelist("libtorch_cuda_distributed_base_sources" Caffe2_GPU_SRCS)
-      if(NOT WIN32)
-        append_filelist("libtorch_cuda_distributed_extra_sources" Caffe2_GPU_SRCS)
-      endif()
-    endif()
-    set_source_files_properties(
-      ${TORCH_ROOT}/aten/src/ATen/cuda/detail/LazyNVRTC.cpp
-      PROPERTIES COMPILE_DEFINITIONS "NVRTC_SHORTHASH=${CUDA_NVRTC_SHORTHASH}"
-    )
-    set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/passes/frozen_conv_add_relu_fusion.cpp PROPERTIES COMPILE_FLAGS "-DUSE_CUDA=1")
-  endif()
-
-  if(BUILD_ONEDNN_GRAPH)
-    list(APPEND Caffe2_CPU_SRCS
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/graph_fuser.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/graph_rewriter.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/graph_helper.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/register_interface.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/interface.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/kernel.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/defer_size_check.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/layout_propagation.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/prepare_binary.cpp
-      ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/guard_shape.cpp
-    )
+if(USE_CUDA)
+  list(APPEND Caffe2_GPU_CU_SRCS ${Caffe2_GPU_HIP_JIT_FUSERS_SRCS})
+  add_library(caffe2_nvrtc SHARED ${ATen_NVRTC_STUB_SRCS})
+  if(MSVC)
+    # Delay load nvcuda.dll so we can import torch compiled with cuda on a CPU-only machine
+    set(DELAY_LOAD_FLAGS "-DELAYLOAD:nvcuda.dll;delayimp.lib")
+  else()
+    set(DELAY_LOAD_FLAGS "")
   endif()
 
-  if(USE_ROCM)
-    list(APPEND Caffe2_HIP_SRCS ${Caffe2_GPU_HIP_JIT_FUSERS_SRCS})
-    if(USE_NCCL)
-      list(APPEND Caffe2_HIP_SRCS
-        ${TORCH_SRC_DIR}/csrc/cuda/nccl.cpp)
-    endif()
-    if(USE_DISTRIBUTED)
-      append_filelist("libtorch_cuda_distributed_base_sources" Caffe2_HIP_SRCS)
-      if(NOT WIN32)
-        append_filelist("libtorch_cuda_distributed_extra_sources" Caffe2_HIP_SRCS)
-      endif()
+  target_link_libraries(caffe2_nvrtc ${CUDA_NVRTC} ${CUDA_CUDA_LIB} ${CUDA_NVRTC_LIB} ${DELAY_LOAD_FLAGS})
+  target_include_directories(caffe2_nvrtc PRIVATE ${CUDA_INCLUDE_DIRS})
+  install(TARGETS caffe2_nvrtc DESTINATION "${TORCH_INSTALL_LIB_DIR}")
+  if(USE_NCCL)
+    list(APPEND Caffe2_GPU_SRCS
+      ${TORCH_SRC_DIR}/csrc/cuda/nccl.cpp)
+  endif()
+  if(USE_DISTRIBUTED)
+    append_filelist("libtorch_cuda_distributed_base_sources" Caffe2_GPU_SRCS)
+    if(NOT WIN32)
+      append_filelist("libtorch_cuda_distributed_extra_sources" Caffe2_GPU_SRCS)
     endif()
-    # caffe2_nvrtc's stubs to driver APIs are useful for HIP.
-    # See NOTE [ ATen NVRTC Stub and HIP ]
-    add_library(caffe2_nvrtc SHARED ${ATen_NVRTC_STUB_SRCS})
-    target_link_libraries(caffe2_nvrtc ${PYTORCH_HIP_HCC_LIBRARIES} ${ROCM_HIPRTC_LIB})
-    target_compile_definitions(caffe2_nvrtc PRIVATE USE_ROCM __HIP_PLATFORM_HCC__)
-    install(TARGETS caffe2_nvrtc DESTINATION "${TORCH_INSTALL_LIB_DIR}")
   endif()
+  set_source_files_properties(
+    ${TORCH_ROOT}/aten/src/ATen/cuda/detail/LazyNVRTC.cpp
+    PROPERTIES COMPILE_DEFINITIONS "NVRTC_SHORTHASH=${CUDA_NVRTC_SHORTHASH}"
+  )
+  set_source_files_properties(${TORCH_SRC_DIR}/csrc/jit/passes/frozen_conv_add_relu_fusion.cpp PROPERTIES COMPILE_FLAGS "-DUSE_CUDA=1")
+endif()
+
+if(BUILD_ONEDNN_GRAPH)
+  list(APPEND Caffe2_CPU_SRCS
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/graph_fuser.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/graph_rewriter.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/graph_helper.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/register_interface.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/decompose_silu.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/interface.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/kernel.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/defer_size_check.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/layout_propagation.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/prepare_binary.cpp
+    ${TORCH_SRC_DIR}/csrc/jit/codegen/onednn/guard_shape.cpp
+  )
+endif()
 
-  if(NOT NO_API AND NOT BUILD_LITE_INTERPRETER)
-    list(APPEND TORCH_SRCS
-      ${TORCH_SRC_DIR}/csrc/api/src/cuda.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/data/datasets/mnist.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/data/samplers/distributed.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/data/samplers/random.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/data/samplers/sequential.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/data/samplers/stream.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/enum.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/imethod.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/serialize.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/jit.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/init.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/module.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/_functions.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/activation.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/adaptive.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/batchnorm.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/normalization.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/instancenorm.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/conv.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/dropout.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/distance.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/embedding.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/fold.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/linear.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/loss.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/padding.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/pixelshuffle.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/pooling.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/rnn.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/upsampling.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/transformer.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/container/functional.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/activation.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/adaptive.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/batchnorm.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/embedding.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/instancenorm.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/normalization.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/conv.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/dropout.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/linear.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/padding.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/pooling.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/rnn.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/vision.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/nn/options/transformer.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/adagrad.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/adam.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/adamw.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/lbfgs.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/optimizer.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/rmsprop.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/serialize.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/sgd.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/schedulers/lr_scheduler.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/optim/schedulers/step_lr.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/serialize/input-archive.cpp
-      ${TORCH_SRC_DIR}/csrc/api/src/serialize/output-archive.cpp
-    )
+if(USE_ROCM)
+  list(APPEND Caffe2_HIP_SRCS ${Caffe2_GPU_HIP_JIT_FUSERS_SRCS})
+  if(USE_NCCL)
+    list(APPEND Caffe2_HIP_SRCS
+      ${TORCH_SRC_DIR}/csrc/cuda/nccl.cpp)
   endif()
+  if(USE_DISTRIBUTED)
+    append_filelist("libtorch_cuda_distributed_base_sources" Caffe2_HIP_SRCS)
+    if(NOT WIN32)
+      append_filelist("libtorch_cuda_distributed_extra_sources" Caffe2_HIP_SRCS)
+    endif()
+  endif()
+  # caffe2_nvrtc's stubs to driver APIs are useful for HIP.
+  # See NOTE [ ATen NVRTC Stub and HIP ]
+  add_library(caffe2_nvrtc SHARED ${ATen_NVRTC_STUB_SRCS})
+  target_link_libraries(caffe2_nvrtc ${PYTORCH_HIP_HCC_LIBRARIES} ${ROCM_HIPRTC_LIB})
+  target_compile_definitions(caffe2_nvrtc PRIVATE USE_ROCM __HIP_PLATFORM_HCC__)
+  install(TARGETS caffe2_nvrtc DESTINATION "${TORCH_INSTALL_LIB_DIR}")
+endif()
 
-  list(APPEND Caffe2_CPU_SRCS ${TORCH_SRCS})
+if(NOT NO_API AND NOT BUILD_LITE_INTERPRETER)
+  list(APPEND TORCH_SRCS
+    ${TORCH_SRC_DIR}/csrc/api/src/cuda.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/data/datasets/mnist.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/data/samplers/distributed.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/data/samplers/random.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/data/samplers/sequential.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/data/samplers/stream.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/enum.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/imethod.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/serialize.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/jit.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/init.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/module.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/_functions.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/activation.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/adaptive.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/batchnorm.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/normalization.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/instancenorm.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/conv.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/dropout.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/distance.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/embedding.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/fold.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/linear.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/loss.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/padding.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/pixelshuffle.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/pooling.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/rnn.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/upsampling.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/transformer.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/modules/container/functional.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/activation.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/adaptive.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/batchnorm.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/embedding.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/instancenorm.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/normalization.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/conv.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/dropout.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/linear.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/padding.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/pooling.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/rnn.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/vision.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/nn/options/transformer.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/adagrad.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/adam.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/adamw.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/lbfgs.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/optimizer.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/rmsprop.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/serialize.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/sgd.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/schedulers/lr_scheduler.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/optim/schedulers/step_lr.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/serialize/input-archive.cpp
+    ${TORCH_SRC_DIR}/csrc/api/src/serialize/output-archive.cpp
+  )
 endif()
 
+list(APPEND Caffe2_CPU_SRCS ${TORCH_SRCS})
+
 if(USE_MPS)
   list(APPEND Caffe2_CPU_SRCS ${Caffe2_MPS_SRCS})
 endif()
@@ -909,10 +883,6 @@ file(WRITE ${DUMMY_EMPTY_FILE} ${DUMMY_FILE_CONTENT})
 # Wrapper library for people who link against torch and expect both CPU and CUDA support
 # Contains "torch_cpu" and "torch_cuda"
 add_library(torch ${DUMMY_EMPTY_FILE})
-if(BUILD_SPLIT_CUDA)
-  # When we split torch_cuda, we want a dummy torch_cuda library that contains both parts
-  add_library(torch_cuda ${DUMMY_EMPTY_FILE})
-endif()
 if(HAVE_SOVERSION)
   set_target_properties(torch PROPERTIES
       VERSION ${TORCH_VERSION} SOVERSION ${TORCH_SOVERSION})
@@ -952,37 +922,19 @@ elseif(USE_CUDA)
         ${Caffe2_GPU_CU_SRCS_W_SORT_BY_KEY})
     set_property(TARGET torch_cuda_w_sort_by_key PROPERTY CUDA_SEPARABLE_COMPILATION OFF)
     target_link_libraries(torch_cuda PRIVATE torch_cuda_w_sort_by_key)
-  elseif(BUILD_SPLIT_CUDA)
-    add_library(torch_cuda_cpp ${Caffe2_GPU_SRCS} ${Caffe2_GPU_SRCS_W_SORT_BY_KEY})
-    add_library(torch_cuda_cu ${Caffe2_GPU_CU_SRCS} ${Caffe2_GPU_CU_SRCS_W_SORT_BY_KEY})
   else()
     add_library(torch_cuda
         ${Caffe2_GPU_SRCS} ${Caffe2_GPU_SRCS_W_SORT_BY_KEY}
         ${Caffe2_GPU_CU_SRCS} ${Caffe2_GPU_CU_SRCS_W_SORT_BY_KEY})
   endif()
   set(CUDA_LINK_LIBRARIES_KEYWORD)
-  if(BUILD_SPLIT_CUDA)
-    torch_compile_options(torch_cuda_cpp)  # see cmake/public/utils.cmake
-    torch_compile_options(torch_cuda_cu)  # see cmake/public/utils.cmake
-    target_compile_definitions(torch_cuda_cpp PRIVATE BUILD_SPLIT_CUDA)
-    target_compile_definitions(torch_cuda_cpp PRIVATE USE_CUDA)
-    target_compile_definitions(torch_cuda_cu PRIVATE BUILD_SPLIT_CUDA)
-    target_compile_definitions(torch_cuda_cu PRIVATE USE_CUDA)
-  else()
-    torch_compile_options(torch_cuda)  # see cmake/public/utils.cmake
-    target_compile_definitions(torch_cuda PRIVATE USE_CUDA)
-  endif()
-  if(USE_NCCL AND BUILD_SPLIT_CUDA)
-    target_link_libraries(torch_cuda_cpp PRIVATE __caffe2_nccl)
-    target_compile_definitions(torch_cuda_cpp PRIVATE USE_NCCL)
-  elseif(USE_NCCL)
+  torch_compile_options(torch_cuda)  # see cmake/public/utils.cmake
+  target_compile_definitions(torch_cuda PRIVATE USE_CUDA)
+  if(USE_NCCL)
     target_link_libraries(torch_cuda PRIVATE __caffe2_nccl)
     target_compile_definitions(torch_cuda PRIVATE USE_NCCL)
   endif()
-  if(USE_UCC AND BUILD_SPLIT_CUDA)
-    target_link_libraries(torch_cuda_cpp PRIVATE __caffe2_ucc)
-    target_compile_definitions(torch_cuda_cpp PRIVATE USE_UCC)
-  elseif(USE_UCC)
+  if(USE_UCC)
     target_link_libraries(torch_cuda PRIVATE __caffe2_ucc)
     target_compile_definitions(torch_cuda PRIVATE USE_UCC)
   endif()
@@ -1024,13 +976,8 @@ elseif(USE_CUDA)
   endif()
 
   if(USE_PRECOMPILED_HEADERS)
-    if(BUILD_SPLIT_CUDA)
-      target_precompile_headers(torch_cuda_cpp PRIVATE
-          "$<$<COMPILE_LANGUAGE:CXX>:ATen/core/ATen_pch.h>")
-    else()
-      target_precompile_headers(torch_cuda PRIVATE
-          "$<$<COMPILE_LANGUAGE:CXX>:ATen/core/ATen_pch.h>")
-    endif()
+    target_precompile_headers(torch_cuda PRIVATE
+        "$<$<COMPILE_LANGUAGE:CXX>:ATen/core/ATen_pch.h>")
   endif()
 endif()
 
@@ -1060,28 +1007,42 @@ if(TRACING_BASED AND NOT BUILD_LITE_INTERPRETER AND NOT INTERN_BUILD_MOBILE)
     ${TORCH_ROOT}/torch/csrc/jit/mobile/model_tracer
     ${CMAKE_BINARY_DIR}/model_tracer
   )
+  string(APPEND CMAKE_CXX_FLAGS " -DENABLE_RECORD_KERNEL_FUNCTION_DTYPE")
 endif()
 
 # Codegen selected_mobile_ops.h for template selective build
 if(BUILD_LITE_INTERPRETER AND SELECTED_OP_LIST)
   message("running gen_selected_mobile_ops_header for:  '${SELECTED_OP_LIST}'")
+  file(GLOB lite_interpreter_python "${TOOLS_PATH}/lite_interpreter/*.py")
   if(${TRACING_BASED})
+    file(GLOB code_analyzer_python "${TOOLS_PATH}/code_analyzer/*.py")
     add_custom_command(
       OUTPUT ${CMAKE_BINARY_DIR}/aten/src/ATen/selected_mobile_ops.h
       COMMAND
-      "${PYTHON_EXECUTABLE}"
-      -m tools.code_analyzer.gen_oplist
-      --model_file_list_path "${SELECTED_OP_LIST}"
-      --output_dir "${CMAKE_BINARY_DIR}/aten/src/ATen"
+        "${PYTHON_EXECUTABLE}"
+        -m tools.code_analyzer.gen_oplist
+        --model_file_list_path "${SELECTED_OP_LIST}"
+        --output_dir "${CMAKE_BINARY_DIR}/aten/src/ATen"
+      DEPENDS
+        ${torchgen_python}
+        ${lite_interpreter_python}
+        ${code_analyzer_python}
+        "${SELECTED_OP_LIST}"
+        "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
       WORKING_DIRECTORY "${TORCH_ROOT}")
   else()
     add_custom_command(
       OUTPUT ${CMAKE_BINARY_DIR}/aten/src/ATen/selected_mobile_ops.h
       COMMAND
-      "${PYTHON_EXECUTABLE}"
-      -m tools.lite_interpreter.gen_selected_mobile_ops_header
-      --yaml_file_path "${SELECTED_OP_LIST}"
-      --output_file_path "${CMAKE_BINARY_DIR}/aten/src/ATen"
+        "${PYTHON_EXECUTABLE}"
+        -m tools.lite_interpreter.gen_selected_mobile_ops_header
+        --yaml_file_path "${SELECTED_OP_LIST}"
+        --output_file_path "${CMAKE_BINARY_DIR}/aten/src/ATen"
+      DEPENDS
+        ${torchgen_python}
+        ${lite_interpreter_python}
+        "${SELECTED_OP_LIST}"
+        "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
       WORKING_DIRECTORY "${TORCH_ROOT}")
   endif()
 
@@ -1091,47 +1052,41 @@ if(BUILD_LITE_INTERPRETER AND SELECTED_OP_LIST)
   add_dependencies(torch_cpu __selected_mobile_ops_header_gen)
 endif()
 
-if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
-  if(NOT NO_API)
-    target_include_directories(torch_cpu PRIVATE
-      ${TORCH_SRC_DIR}/csrc/api
-      ${TORCH_SRC_DIR}/csrc/api/include)
-  endif()
-
-  if(BUILD_SPLIT_CUDA AND MSVC)
-    # -INCLUDE is used to ensure torch_cuda_cpp/cu are linked against in a project that relies on them.
-    target_link_libraries(torch_cuda_cpp INTERFACE "-INCLUDE:?warp_size@cuda@at@@YAHXZ")
-    # See [Note about _torch_cuda_cu_linker_symbol_op and torch_cuda_cu] in native_functions.yaml
-    target_link_libraries(torch_cuda_cu INTERFACE "-INCLUDE:?_torch_cuda_cu_linker_symbol_op_cuda@native@at@@YA?AVTensor@2@AEBV32@@Z")
-  elseif(USE_CUDA AND MSVC)
-    # -INCLUDE is used to ensure torch_cuda is linked against in a project that relies on them.
-    # Related issue: https://github.com/pytorch/pytorch/issues/31611
-    target_link_libraries(torch_cuda INTERFACE "-INCLUDE:?warp_size@cuda@at@@YAHXZ")
-  endif()
-
-  if(NOT BUILD_LITE_INTERPRETER)
-    set(TH_CPU_INCLUDE
-      # dense
-      aten/src/TH
-      ${CMAKE_CURRENT_BINARY_DIR}/aten/src/TH
-      ${TORCH_ROOT}/aten/src
-      ${CMAKE_CURRENT_BINARY_DIR}/aten/src
+if(NOT NO_API)
+  target_include_directories(torch_cpu PRIVATE
+    ${TORCH_SRC_DIR}/csrc/api
+    ${TORCH_SRC_DIR}/csrc/api/include)
+endif()
 
-      ${CMAKE_BINARY_DIR}/aten/src)
-      target_include_directories(torch_cpu PRIVATE ${TH_CPU_INCLUDE})
-      endif()
+if(USE_CUDA AND MSVC)
+  # -INCLUDE is used to ensure torch_cuda is linked against in a project that relies on them.
+  # Related issue: https://github.com/pytorch/pytorch/issues/31611
+  target_link_libraries(torch_cuda INTERFACE "-INCLUDE:?warp_size@cuda@at@@YAHXZ")
+endif()
 
-  set(ATen_CPU_INCLUDE
+if(NOT BUILD_LITE_INTERPRETER)
+  set(TH_CPU_INCLUDE
+    # dense
+    aten/src/TH
+    ${CMAKE_CURRENT_BINARY_DIR}/aten/src/TH
     ${TORCH_ROOT}/aten/src
-    ${CMAKE_CURRENT_BINARY_DIR}/../aten/src
+    ${CMAKE_CURRENT_BINARY_DIR}/aten/src
+
     ${CMAKE_BINARY_DIR}/aten/src)
+    target_include_directories(torch_cpu PRIVATE ${TH_CPU_INCLUDE})
+endif()
 
-    if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
-      set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/QuantizedLinear.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
-      set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/RNN.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
-      set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
-      set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/quantized/qlinear_unpack.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
-    endif()
+set(ATen_CPU_INCLUDE
+  ${TORCH_ROOT}/aten/src
+  ${CMAKE_CURRENT_BINARY_DIR}/../aten/src
+  ${CMAKE_BINARY_DIR}/aten/src)
+
+if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
+  set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/QuantizedLinear.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
+  set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/RNN.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
+  set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
+  set_source_files_properties(${CMAKE_CURRENT_SOURCE_DIR}/../aten/src/ATen/native/quantized/qlinear_unpack.cpp PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
+endif()
 
 if(USE_TBB)
   list(APPEND ATen_CPU_INCLUDE ${TBB_INCLUDE_DIR})
@@ -1143,135 +1098,121 @@ if(BUILD_CAFFE2 AND BUILD_CAFFE2_OPS AND USE_FBGEMM)
   target_include_directories(torch_cpu PRIVATE ${CMAKE_CURRENT_LIST_DIR}/../third_party)
 endif()
 
-  target_include_directories(torch_cpu PRIVATE ${ATen_CPU_INCLUDE})
+target_include_directories(torch_cpu PRIVATE ${ATen_CPU_INCLUDE})
 
-  target_include_directories(torch_cpu PRIVATE
-    ${TORCH_SRC_DIR}/csrc)
+target_include_directories(torch_cpu PRIVATE
+  ${TORCH_SRC_DIR}/csrc)
 
-  target_include_directories(torch_cpu PRIVATE
-    ${TORCH_ROOT}/third_party/miniz-2.1.0)
+target_include_directories(torch_cpu PRIVATE
+  ${TORCH_ROOT}/third_party/miniz-2.1.0)
+
+target_include_directories(torch_cpu PRIVATE
+  ${TORCH_ROOT}/third_party/kineto/libkineto/include)
 
+if(USE_KINETO)
   target_include_directories(torch_cpu PRIVATE
-    ${TORCH_ROOT}/third_party/kineto/libkineto/include)
-
-  if(USE_KINETO)
-    target_include_directories(torch_cpu PRIVATE
-      ${TORCH_ROOT}/third_party/kineto/libkineto/src)
-  endif()
-
-  install(DIRECTORY "${TORCH_SRC_DIR}/csrc"
-    DESTINATION ${TORCH_INSTALL_INCLUDE_DIR}/torch
-    FILES_MATCHING PATTERN "*.h" PATTERN "*.hpp")
-  install(DIRECTORY "${TORCH_SRC_DIR}/csrc/distributed/c10d"
-    DESTINATION ${TORCH_INSTALL_INCLUDE_DIR}
-    FILES_MATCHING PATTERN "*.h" PATTERN "*.hpp")
-  install(FILES
-    "${TORCH_SRC_DIR}/script.h"
-    "${TORCH_SRC_DIR}/extension.h"
-    "${TORCH_SRC_DIR}/custom_class.h"
-    "${TORCH_SRC_DIR}/library.h"
-    "${TORCH_SRC_DIR}/custom_class_detail.h"
-    DESTINATION ${TORCH_INSTALL_INCLUDE_DIR}/torch)
-  if(USE_DEPLOY)
-    install(FILES
-      "${TORCH_SRC_DIR}/deploy.h"
-      DESTINATION ${TORCH_INSTALL_INCLUDE_DIR}/torch)
-  endif()
-
-  if(BUILD_TEST)
-    if(BUILD_LITE_INTERPRETER)
-      add_subdirectory(
-        ${TORCH_ROOT}/test/cpp/lite_interpreter_runtime
-        ${CMAKE_BINARY_DIR}/test_lite_interpreter_runtime
-      )
-      add_subdirectory(
-        ${TORCH_ROOT}/test/mobile/lightweight_dispatch
-        ${CMAKE_BINARY_DIR}/test_codegen_unboxing
-      )
-    else()
-      add_subdirectory(${TORCH_ROOT}/test/cpp/jit ${CMAKE_BINARY_DIR}/test_jit)
-      add_subdirectory(
-        ${TORCH_ROOT}/test/cpp/tensorexpr
-        ${CMAKE_BINARY_DIR}/test_tensorexpr
-      )
-      if(USE_DISTRIBUTED)
-        add_subdirectory(${TORCH_ROOT}/test/cpp/c10d ${CMAKE_BINARY_DIR}/test_cpp_c10d)
-        if(NOT WIN32)
-          add_subdirectory(${TORCH_ROOT}/test/cpp/dist_autograd ${CMAKE_BINARY_DIR}/dist_autograd)
-          add_subdirectory(${TORCH_ROOT}/test/cpp/rpc ${CMAKE_BINARY_DIR}/test_cpp_rpc)
-        endif()
-      endif()
-      if(NOT NO_API)
-        add_subdirectory(${TORCH_ROOT}/test/cpp/api ${CMAKE_BINARY_DIR}/test_api)
+    ${TORCH_ROOT}/third_party/kineto/libkineto/src)
+endif()
+
+install(DIRECTORY "${TORCH_SRC_DIR}/csrc"
+  DESTINATION ${TORCH_INSTALL_INCLUDE_DIR}/torch
+  FILES_MATCHING PATTERN "*.h" PATTERN "*.hpp")
+install(FILES
+  "${TORCH_SRC_DIR}/script.h"
+  "${TORCH_SRC_DIR}/extension.h"
+  "${TORCH_SRC_DIR}/custom_class.h"
+  "${TORCH_SRC_DIR}/library.h"
+  "${TORCH_SRC_DIR}/custom_class_detail.h"
+  DESTINATION ${TORCH_INSTALL_INCLUDE_DIR}/torch)
+if(BUILD_TEST)
+  if(BUILD_LITE_INTERPRETER)
+    add_subdirectory(
+      ${TORCH_ROOT}/test/cpp/lite_interpreter_runtime
+      ${CMAKE_BINARY_DIR}/test_lite_interpreter_runtime
+    )
+    add_subdirectory(
+      ${TORCH_ROOT}/test/mobile/lightweight_dispatch
+      ${CMAKE_BINARY_DIR}/test_codegen_unboxing
+    )
+  else()
+    add_subdirectory(${TORCH_ROOT}/test/cpp/jit ${CMAKE_BINARY_DIR}/test_jit)
+    add_subdirectory(
+      ${TORCH_ROOT}/test/cpp/tensorexpr
+      ${CMAKE_BINARY_DIR}/test_tensorexpr
+    )
+    if(USE_DISTRIBUTED)
+      add_subdirectory(${TORCH_ROOT}/test/cpp/c10d ${CMAKE_BINARY_DIR}/test_cpp_c10d)
+      if(NOT WIN32)
+        add_subdirectory(${TORCH_ROOT}/test/cpp/dist_autograd ${CMAKE_BINARY_DIR}/dist_autograd)
+        add_subdirectory(${TORCH_ROOT}/test/cpp/rpc ${CMAKE_BINARY_DIR}/test_cpp_rpc)
       endif()
+    endif()
+    if(NOT NO_API)
+      add_subdirectory(${TORCH_ROOT}/test/cpp/api ${CMAKE_BINARY_DIR}/test_api)
+    endif()
 
-      if(USE_LLVM AND LLVM_FOUND)
-        add_subdirectory(
-          ${TORCH_ROOT}/test/mobile/nnc
-          ${CMAKE_BINARY_DIR}/test_mobile_nnc
-        )
-      endif()
-      add_subdirectory(${TORCH_ROOT}/test/cpp/lazy
-                       ${CMAKE_BINARY_DIR}/test_lazy)
+    if(USE_LLVM AND LLVM_FOUND)
+      add_subdirectory(
+        ${TORCH_ROOT}/test/mobile/nnc
+        ${CMAKE_BINARY_DIR}/test_mobile_nnc
+      )
     endif()
+    add_subdirectory(${TORCH_ROOT}/test/cpp/lazy
+                     ${CMAKE_BINARY_DIR}/test_lazy)
   endif()
+endif()
 
-  # XXX This ABI check cannot be run with arm-linux-androideabi-g++
-  if(CMAKE_SYSTEM_NAME STREQUAL "Linux")
-    if(DEFINED GLIBCXX_USE_CXX11_ABI)
-      message(STATUS "_GLIBCXX_USE_CXX11_ABI is already defined as a cmake variable")
-    else()
-      message(STATUS "${CMAKE_CXX_COMPILER} ${TORCH_SRC_DIR}/abi-check.cpp -o ${CMAKE_BINARY_DIR}/abi-check")
-      execute_process(
-        COMMAND
-        "${CMAKE_CXX_COMPILER}"
-        "${TORCH_SRC_DIR}/abi-check.cpp"
-        "-o"
-        "${CMAKE_BINARY_DIR}/abi-check"
-        RESULT_VARIABLE ABI_CHECK_COMPILE_RESULT)
-      if(ABI_CHECK_COMPILE_RESULT)
-        message(FATAL_ERROR "Could not compile ABI Check: ${ABI_CHECK_COMPILE_RESULT}")
-      endif()
-      execute_process(
-        COMMAND "${CMAKE_BINARY_DIR}/abi-check"
-        RESULT_VARIABLE ABI_CHECK_RESULT
-        OUTPUT_VARIABLE GLIBCXX_USE_CXX11_ABI)
-      if(ABI_CHECK_RESULT)
-        message(WARNING "Could not run ABI Check: ${ABI_CHECK_RESULT}")
-      endif()
+# XXX This ABI check cannot be run with arm-linux-androideabi-g++
+if(CMAKE_SYSTEM_NAME STREQUAL "Linux")
+  if(DEFINED GLIBCXX_USE_CXX11_ABI)
+    message(STATUS "_GLIBCXX_USE_CXX11_ABI is already defined as a cmake variable")
+  else()
+    message(STATUS "${CMAKE_CXX_COMPILER} ${TORCH_SRC_DIR}/abi-check.cpp -o ${CMAKE_BINARY_DIR}/abi-check")
+    execute_process(
+      COMMAND
+      "${CMAKE_CXX_COMPILER}"
+      "${TORCH_SRC_DIR}/abi-check.cpp"
+      "-o"
+      "${CMAKE_BINARY_DIR}/abi-check"
+      RESULT_VARIABLE ABI_CHECK_COMPILE_RESULT)
+    if(ABI_CHECK_COMPILE_RESULT)
+      message(FATAL_ERROR "Could not compile ABI Check: ${ABI_CHECK_COMPILE_RESULT}")
+    endif()
+    execute_process(
+      COMMAND "${CMAKE_BINARY_DIR}/abi-check"
+      RESULT_VARIABLE ABI_CHECK_RESULT
+      OUTPUT_VARIABLE GLIBCXX_USE_CXX11_ABI)
+    if(ABI_CHECK_RESULT)
+      message(WARNING "Could not run ABI Check: ${ABI_CHECK_RESULT}")
     endif()
-    message(STATUS "Determined _GLIBCXX_USE_CXX11_ABI=${GLIBCXX_USE_CXX11_ABI}")
   endif()
+  message(STATUS "Determined _GLIBCXX_USE_CXX11_ABI=${GLIBCXX_USE_CXX11_ABI}")
+endif()
 
-  # CMake config for external projects.
-  configure_file(
-    ${PROJECT_SOURCE_DIR}/cmake/TorchConfigVersion.cmake.in
-    ${PROJECT_BINARY_DIR}/TorchConfigVersion.cmake
-    @ONLY)
-  configure_file(
-    ${TORCH_ROOT}/cmake/TorchConfig.cmake.in
-    ${PROJECT_BINARY_DIR}/TorchConfig.cmake
-    @ONLY)
-  install(FILES
-    ${PROJECT_BINARY_DIR}/TorchConfigVersion.cmake
-    ${PROJECT_BINARY_DIR}/TorchConfig.cmake
-    DESTINATION share/cmake/Torch)
-
+# CMake config for external projects.
+configure_file(
+  ${PROJECT_SOURCE_DIR}/cmake/TorchConfigVersion.cmake.in
+  ${PROJECT_BINARY_DIR}/TorchConfigVersion.cmake
+  @ONLY)
+configure_file(
+  ${TORCH_ROOT}/cmake/TorchConfig.cmake.in
+  ${PROJECT_BINARY_DIR}/TorchConfig.cmake
+  @ONLY)
+install(FILES
+  ${PROJECT_BINARY_DIR}/TorchConfigVersion.cmake
+  ${PROJECT_BINARY_DIR}/TorchConfig.cmake
+  DESTINATION share/cmake/Torch)
 
-  # ---[ Torch python bindings build
-  add_subdirectory(../torch torch)
 
+# ---[ Torch python bindings build
+add_subdirectory(../torch torch)
+set(TORCH_PYTHON_COMPILE_OPTIONS ${TORCH_PYTHON_COMPILE_OPTIONS} PARENT_SCOPE)
+set(TORCH_PYTHON_LINK_FLAGS ${TORCH_PYTHON_LINK_FLAGS} PARENT_SCOPE)
 
-endif()
 # ==========================================================
 # END formerly-libtorch flags
 # ==========================================================
 
-
-
-
-
-
 if(NOT NO_API)
   target_include_directories(torch_cpu PUBLIC
     $<BUILD_INTERFACE:${TORCH_SRC_DIR}/csrc/api>
@@ -1337,25 +1278,6 @@ if(BUILD_LITE_INTERPRETER)
 endif()
 
 
-# For torch/csrc/distributed/c10d
-function(include_torch_lib_dir target)
-  target_include_directories(${target} PRIVATE $<BUILD_INTERFACE:${TORCH_SRC_DIR}/csrc/distributed>)
-endfunction()
-
-include_torch_lib_dir(torch_cpu)
-if(USE_ROCM)
-  include_torch_lib_dir(torch_hip)
-elseif(USE_CUDA)
-  if(BUILD_SPLIT_CUDA)
-    include_torch_lib_dir(torch_cuda_cpp)
-  else()
-    include_torch_lib_dir(torch_cuda)
-  endif()
-endif()
-if(BUILD_PYTHON)
-    include_torch_lib_dir(torch_python)
-endif()
-
 # Pass USE_DISTRIBUTED to torch_cpu, as some codes in jit/pickler.cpp and
 # jit/unpickler.cpp need to be compiled only when USE_DISTRIBUTED is set
 if(USE_DISTRIBUTED)
@@ -1366,27 +1288,16 @@ if(USE_DISTRIBUTED)
   if(USE_UCC AND USE_C10D_UCC)
     target_compile_definitions(torch_cpu PUBLIC USE_C10D_UCC)
     if(USE_CUDA)
-      if(BUILD_SPLIT_CUDA)
-        target_compile_definitions(torch_cuda_cpp PUBLIC USE_C10D_UCC)
-      else()
-        target_compile_definitions(torch_cuda PUBLIC USE_C10D_UCC)
-      endif()
+      target_compile_definitions(torch_cuda PUBLIC USE_C10D_UCC)
     endif()
   endif()
   if(USE_NCCL AND USE_C10D_NCCL)
     if(USE_ROCM)
       target_compile_definitions(torch_hip PUBLIC USE_C10D_NCCL)
     else()
-      if(BUILD_SPLIT_CUDA)
-        target_compile_definitions(torch_cuda_cpp PUBLIC USE_C10D_NCCL)
-        if(USE_NCCL_WITH_UCC)
-          target_compile_definitions(torch_cuda_cpp PUBLIC USE_NCCL_WITH_UCC)
-        endif()
-      else()
-        target_compile_definitions(torch_cuda PUBLIC USE_C10D_NCCL)
-        if(USE_NCCL_WITH_UCC)
-          target_compile_definitions(torch_cuda PUBLIC USE_NCCL_WITH_UCC)
-        endif()
+      target_compile_definitions(torch_cuda PUBLIC USE_C10D_NCCL)
+      if(USE_NCCL_WITH_UCC)
+        target_compile_definitions(torch_cuda PUBLIC USE_NCCL_WITH_UCC)
       endif()
     endif()
   endif()
@@ -1411,7 +1322,7 @@ if(USE_DISTRIBUTED)
   endif()
 endif()
 
-if(NOT INTERN_BUILD_MOBILE OR BUILD_CAFFE2_MOBILE)
+if(NOT INTERN_BUILD_MOBILE)
   caffe2_interface_library(caffe2_protos caffe2_protos_whole)
   target_link_libraries(torch_cpu PRIVATE caffe2_protos_whole)
   if(${CAFFE2_LINK_LOCAL_PROTOBUF})
@@ -1469,14 +1380,7 @@ torch_set_target_props(torch_cpu)
 
 
 target_compile_options(torch_cpu PRIVATE "-DCAFFE2_BUILD_MAIN_LIB")
-if(BUILD_SPLIT_CUDA)
-  target_compile_options(torch_cuda_cu PRIVATE "-DTORCH_CUDA_CU_BUILD_MAIN_LIB")
-  target_compile_options(torch_cuda_cpp PRIVATE "-DTORCH_CUDA_CPP_BUILD_MAIN_LIB")
-  # NB: This must be target_compile_definitions, not target_compile_options,
-  # as the latter is not respected by nvcc
-  target_compile_definitions(torch_cuda_cu PRIVATE "-DTORCH_CUDA_CU_BUILD_MAIN_LIB")
-  target_compile_definitions(torch_cuda_cpp PRIVATE "-DTORCH_CUDA_CPP_BUILD_MAIN_LIB")
-elseif(USE_CUDA)
+if(USE_CUDA)
   target_compile_options(torch_cuda PRIVATE "-DTORCH_CUDA_BUILD_MAIN_LIB")
   # NB: This must be target_compile_definitions, not target_compile_options,
   # as the latter is not respected by nvcc
@@ -1487,10 +1391,7 @@ elseif(USE_ROCM)
 endif()
 
 if(USE_EXPERIMENTAL_CUDNN_V8_API)
-  if(BUILD_SPLIT_CUDA)
-    target_compile_definitions(torch_cuda_cu PRIVATE "-DUSE_EXPERIMENTAL_CUDNN_V8_API")
-    target_compile_definitions(torch_cuda_cpp PRIVATE "-DUSE_EXPERIMENTAL_CUDNN_V8_API")
-  elseif(USE_CUDA)
+  if(USE_CUDA)
     target_compile_definitions(torch_cuda PRIVATE "-DUSE_EXPERIMENTAL_CUDNN_V8_API")
   endif()
 endif()
@@ -1580,10 +1481,6 @@ caffe2_interface_library(torch_cpu torch_cpu_library)
 
 if(USE_CUDA)
   caffe2_interface_library(torch_cuda torch_cuda_library)
-  if(BUILD_SPLIT_CUDA)
-    caffe2_interface_library(torch_cuda_cu torch_cuda_cu_library)
-    caffe2_interface_library(torch_cuda_cpp torch_cuda_cpp_library)
-  endif()
 elseif(USE_ROCM)
   caffe2_interface_library(torch_hip torch_hip_library)
 endif()
@@ -1594,10 +1491,6 @@ install(TARGETS torch_cpu torch_cpu_library EXPORT Caffe2Targets DESTINATION "${
 
 if(USE_CUDA)
   install(TARGETS torch_cuda torch_cuda_library EXPORT Caffe2Targets DESTINATION "${TORCH_INSTALL_LIB_DIR}")
-  if(BUILD_SPLIT_CUDA)
-    install(TARGETS torch_cuda_cu torch_cuda_cu_library EXPORT Caffe2Targets DESTINATION "${TORCH_INSTALL_LIB_DIR}")
-    install(TARGETS torch_cuda_cpp torch_cuda_cpp_library EXPORT Caffe2Targets DESTINATION "${TORCH_INSTALL_LIB_DIR}")
-  endif()
 elseif(USE_ROCM)
   install(TARGETS torch_hip torch_hip_library EXPORT Caffe2Targets DESTINATION "${TORCH_INSTALL_LIB_DIR}")
 endif()
@@ -1607,11 +1500,6 @@ target_link_libraries(torch PUBLIC torch_cpu_library)
 
 if(USE_CUDA)
   target_link_libraries(torch PUBLIC torch_cuda_library)
-  if(BUILD_SPLIT_CUDA)
-    # NS: Library order is important here to prevent cudnn double linking
-    target_link_libraries(torch_cuda PUBLIC torch_cuda_cpp_library)
-    target_link_libraries(torch_cuda PUBLIC torch_cuda_cu_library)
-  endif()
 elseif(USE_ROCM)
   target_link_libraries(torch PUBLIC torch_hip_library)
 endif()
@@ -1624,10 +1512,7 @@ endif()
 # Install PDB files for MSVC builds
 if(MSVC AND BUILD_SHARED_LIBS)
   install(FILES $<TARGET_PDB_FILE:torch_cpu> DESTINATION "${TORCH_INSTALL_LIB_DIR}" OPTIONAL)
-  if(BUILD_SPLIT_CUDA)
-    install(FILES $<TARGET_PDB_FILE:torch_cuda_cu> DESTINATION "${TORCH_INSTALL_LIB_DIR}" OPTIONAL)
-    install(FILES $<TARGET_PDB_FILE:torch_cuda_cpp> DESTINATION "${TORCH_INSTALL_LIB_DIR}" OPTIONAL)
-  elseif(USE_CUDA)
+  if(USE_CUDA)
     install(FILES $<TARGET_PDB_FILE:torch_cuda> DESTINATION "${TORCH_INSTALL_LIB_DIR}" OPTIONAL)
   elseif(USE_ROCM)
     install(FILES $<TARGET_PDB_FILE:torch_hip> DESTINATION "${TORCH_INSTALL_LIB_DIR}" OPTIONAL)
@@ -1635,36 +1520,13 @@ if(MSVC AND BUILD_SHARED_LIBS)
 endif()
 
 # ---[ CUDA library.
-if(BUILD_SPLIT_CUDA)
-  target_link_libraries(torch_cuda_cu INTERFACE torch::cudart)
-  target_link_libraries(torch_cuda_cpp INTERFACE torch::cudart)
-  target_link_libraries(torch_cuda_cu PUBLIC c10_cuda torch::nvtoolsext)
-  target_link_libraries(torch_cuda_cpp PUBLIC c10_cuda torch::nvtoolsext)
-
-  target_include_directories(
-      torch_cuda_cu INTERFACE $<INSTALL_INTERFACE:include>)
-  target_include_directories(
-      torch_cuda_cpp INTERFACE $<INSTALL_INTERFACE:include>)
-  target_include_directories(
-      torch_cuda_cu PRIVATE ${Caffe2_GPU_INCLUDE})
-  target_include_directories(
-      torch_cuda_cpp PRIVATE ${Caffe2_GPU_INCLUDE})
-  target_link_libraries(
-      torch_cuda_cu PRIVATE ${Caffe2_CUDA_DEPENDENCY_LIBS})
-  target_link_libraries(
-      torch_cuda_cpp PRIVATE ${Caffe2_CUDA_DEPENDENCY_LIBS})
-  target_link_libraries(torch_cuda_cu PRIVATE torch_cuda_cpp)
-  if(USE_CUDNN)
-    target_link_libraries(
-        torch_cuda_cpp PRIVATE  caffe2::cudnn-private)
+if(USE_CUDA)
+  # FIXME: If kineto is linked with CUPTI it pollutes torch_cpu with CUDA dependencies
+  # Even worse, it never declares that it depends on cudart, but calls the API, see
+  # https://github.com/pytorch/kineto/blob/aef2f5c0f15e3be52406ac0b885e8689de6bc9f6/libkineto/src/CudaDeviceProperties.cpp#L24
+  if(USE_KINETO AND NOT MSVC)
+    target_link_libraries(torch_cpu PRIVATE torch::cudart)
   endif()
-
-  # These public dependencies must go after the previous dependencies, as the
-  # order of the libraries in the linker call matters here when statically
-  # linking; libculibos and cublas must be last.
-  target_link_libraries(torch_cuda_cpp PUBLIC torch_cpu_library ${Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS})
-  target_link_libraries(torch_cuda_cu PUBLIC torch_cpu_library ${Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS})
-elseif(USE_CUDA)
   target_link_libraries(torch_cuda INTERFACE torch::cudart)
   target_link_libraries(torch_cuda PUBLIC c10_cuda torch::nvtoolsext)
 
diff --git a/caffe2/README.md b/caffe2/README.md
index 0b69eec8191b..13171fca23bb 100644
--- a/caffe2/README.md
+++ b/caffe2/README.md
@@ -1,7 +1,5 @@
 # Caffe2
 
-[![Jenkins Build Status](https://ci.pytorch.org/jenkins/job/caffe2-master/lastCompletedBuild/badge/icon)](https://ci.pytorch.org/jenkins/job/caffe2-master)
-
 Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the original [Caffe](http://caffe.berkeleyvision.org), Caffe2 is designed with expression, speed, and modularity in mind.
 
 ## Questions and Feedback
diff --git a/caffe2/contrib/aten/gen_op.py b/caffe2/contrib/aten/gen_op.py
index 55f1faba2750..dad452b5216a 100755
--- a/caffe2/contrib/aten/gen_op.py
+++ b/caffe2/contrib/aten/gen_op.py
@@ -68,8 +68,13 @@ def value_has_tensors(v):
 
 
 def value_is_tensor_type(v):
-    return value_has_tensors(v) and v['dynamic_type'] not in ['at::TensorList', 'const c10::List<c10::optional<at::Tensor>> &']
+    return value_has_tensors(v) and v['dynamic_type'] not in TENSORLIST_TYPE
 
+TENSORLIST_TYPE = [
+    'at::TensorList',
+    'const at::ITensorListRef &',
+    'const c10::List<c10::optional<at::Tensor>> &',
+]
 
 # for each aten type, how do we handle a return value of that type?
 RETURN_MAP = {
@@ -208,7 +213,7 @@ def self_as_first_argument(arguments):
 def get_num_inputs(o):
     args = 0
     for a in o['arguments']:
-        if a['type'] in ['at::TensorList', 'const c10::List<c10::optional<at::Tensor>> &']:
+        if a['type'] in TENSORLIST_TYPE:
             return '*'
         elif value_has_tensors(a):
             args += 1
@@ -277,17 +282,17 @@ def emit_assignments(o, env):
             # e.g. "Float" is at::kFloat
             assert('Type' in o['method_of'])
 
-        static_tensor_inputs = sum(arg['type'] not in ['at::TensorList', 'const c10::List<c10::optional<at::Tensor>> &'] and value_is_tensor_type(arg) for arg in o['arguments'])
-        has_tensorlist = any(arg['type'] in ['at::TensorList', 'const c10::List<c10::optional<at::Tensor>> &'] for arg in o['arguments'])
+        static_tensor_inputs = sum(arg['type'] not in TENSORLIST_TYPE and value_is_tensor_type(arg) for arg in o['arguments'])
+        has_tensorlist = any(arg['type'] in TENSORLIST_TYPE for arg in o['arguments'])
         if has_tensorlist:
-            tensorlist_idx = [i for i, arg in enumerate(o['arguments']) if arg['type'] in ['at::TensorList', 'const c10::List<c10::optional<at::Tensor>> &']][0]
+            tensorlist_idx = [i for i, arg in enumerate(o['arguments']) if arg['type'] in TENSORLIST_TYPE][0]
 
         real_inputs = 0
         for i, arg in enumerate(o['arguments']):
             env['arguments'].append(arg['name'])
             # Pretend the flat argument list is a stack where the end is the top.
             view_length = 'InputSize()' if has_tensorlist and i < tensorlist_idx else static_tensor_inputs
-            if arg['type'] == 'at::TensorList':
+            if arg['type'] == 'at::TensorList' or arg['type'] == 'const at::ITensorListRef &':
                 # NOTE: do not advance real_inputs here. After this we will
                 # switch to indexing the "stack" from the end
                 env['statements'].append(
diff --git a/caffe2/contrib/nccl/cuda_nccl_gpu.cc b/caffe2/contrib/nccl/cuda_nccl_gpu.cc
index 82fe52365128..9a5b83c55c66 100644
--- a/caffe2/contrib/nccl/cuda_nccl_gpu.cc
+++ b/caffe2/contrib/nccl/cuda_nccl_gpu.cc
@@ -177,7 +177,7 @@ void runNCCL(const NCCLExecution& ex, InitF&& init_f, F&& f) {
   // Now, wait on all the events in the original stream.
   CUDAGuard dg(ex.stream_gpu_id);
   for (auto& event : events) {
-    CUDA_ENFORCE(cudaStreamWaitEvent(CHECK_NOTNULL(ex.stream), event, 0));
+    CUDA_ENFORCE(cudaStreamWaitEvent(TORCH_CHECK_NOTNULL(ex.stream), event, 0));
   }
 }
 
diff --git a/caffe2/contrib/tensorrt/README.md b/caffe2/contrib/tensorrt/README.md
index f1e449e727e9..6ffe1dfb53bc 100644
--- a/caffe2/contrib/tensorrt/README.md
+++ b/caffe2/contrib/tensorrt/README.md
@@ -15,4 +15,4 @@ For further information please explore `caffe2/python/trt/test_trt.py` test show
 
 ## Questions and Feedback
 
-Please use Github issues (https://github.com/pytorch/pytorch/issues) to ask questions, report bugs, and request new features.
+Please use GitHub issues (https://github.com/pytorch/pytorch/issues) to ask questions, report bugs, and request new features.
diff --git a/caffe2/contrib/tensorrt/tensorrt_tranformer.cc b/caffe2/contrib/tensorrt/tensorrt_tranformer.cc
index ebe27ef38a19..f1414deca8ca 100644
--- a/caffe2/contrib/tensorrt/tensorrt_tranformer.cc
+++ b/caffe2/contrib/tensorrt/tensorrt_tranformer.cc
@@ -518,7 +518,7 @@ void TensorRTTransformer::Transform(
     return SubnetToTrtOp(net, &mapped_ws, &exporter2, &shape_hints);
   };
 
-  auto cutResult = opt::OptimizeForBackend(*pred_net, supports, trt_converter)
+  auto cutResult = opt::OptimizeForBackend(*pred_net, supports, trt_converter);
   NetDef net_opt = std::move(cutResult.net);
 
   // Need to figure out a proper place to handle device option
diff --git a/caffe2/core/CMakeLists.txt b/caffe2/core/CMakeLists.txt
index 91cd11551b34..f59c0e703edf 100644
--- a/caffe2/core/CMakeLists.txt
+++ b/caffe2/core/CMakeLists.txt
@@ -1,4 +1,4 @@
-if((NOT BUILD_CAFFE2) OR (INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE))
+if(NOT BUILD_CAFFE2 OR INTERN_BUILD_MOBILE)
   list(APPEND Caffe2_CPU_SRCS
     "${CMAKE_CURRENT_SOURCE_DIR}/common.cc"
   )
diff --git a/caffe2/core/context_gpu.cu b/caffe2/core/context_gpu.cu
index bfa563ca6b8b..83b8a049b872 100644
--- a/caffe2/core/context_gpu.cu
+++ b/caffe2/core/context_gpu.cu
@@ -235,7 +235,7 @@ static void Caffe2InitializeCuda() {
         // a reserved flag for cudaDeviceEnablePeerAccess that should always be
         // zero currently.
         // It is ok if peer access is already enabled...
-        cudaError_t err = cudaDeviceEnablePeerAccess(j, 0);
+        cudaError_t err = C10_CUDA_ERROR_HANDLED(cudaDeviceEnablePeerAccess(j, 0));
         if ((err != cudaErrorPeerAccessAlreadyEnabled) &&
             (err != cudaSuccess)) {
           CAFFE_THROW(cudaGetErrorString(err));
@@ -351,7 +351,7 @@ struct CAFFE2_CUDA_API PinnedCPUAllocator final : public at::Allocator {
       CUDA_ENFORCE(cudaHostUnregister(data));
       GetDefaultCPUAllocator()->raw_deleter()(data);
     } else {
-      cudaError_t err = cudaFreeHost(data);
+      cudaError_t err = C10_CUDA_ERROR_HANDLED(cudaFreeHost(data));
       profiledCPUMemoryReporter().Delete(data);
       if (err == cudaErrorInvalidValue) {
         free(data);
@@ -598,7 +598,7 @@ struct DefaultCUDAAllocator final : public at::Allocator {
     switch (g_cuda_memory_pool_type) {
       case CudaMemoryPoolType::NONE: {
         // If memory pool is not set up, use simple cudaFree.
-        cudaError_t error = cudaFree(ptr);
+        cudaError_t error = C10_CUDA_ERROR_HANDLED(cudaFree(ptr));
         // For some reason, in Python runtime we sometimes delete a data pointer
         // after the cuda runtime exits - this is odd but is probably caused by
         // a static workspace that pycaffe2 uses, and the destruction got
diff --git a/caffe2/core/context_gpu.h b/caffe2/core/context_gpu.h
index e411d9cd735f..611b2550c7e0 100644
--- a/caffe2/core/context_gpu.h
+++ b/caffe2/core/context_gpu.h
@@ -195,10 +195,6 @@ class CAFFE2_CUDA_API CUDAContext final : public BaseContext {
   // SwitchToDevice()
   void FinishDeviceComputation() override {
     CUDA_ENFORCE(cudaStreamSynchronize(getCudaObjects().GetStream(gpu_id_)));
-    cudaError_t error = cudaGetLastError();
-    if (error != cudaSuccess) {
-      CAFFE_THROW("Encountered CUDA error: ", cudaGetErrorString(error));
-    }
   }
 
   inline int device_id() const {
@@ -309,11 +305,13 @@ class CAFFE2_CUDA_API CUDAContext final : public BaseContext {
   }
 
   static bool IsStreamFree(const DeviceOption& option, StreamId stream_id) {
-    auto stream = CUDAContext::cuda_stream(option.device_id(), stream_id);
-    auto status = cudaStreamQuery(stream);
+    const auto stream = CUDAContext::cuda_stream(option.device_id(), stream_id);
+    const auto status = C10_CUDA_ERROR_HANDLED(cudaStreamQuery(stream));
     if (status == cudaErrorNotReady) {
       // ignore and clear the error if not ready
-      (void)cudaGetLastError();
+      C10_CUDA_CLEAR_ERROR();
+    } else {
+      C10_CUDA_CHECK(status); // Reraise error
     }
     return status == cudaSuccess;
   }
diff --git a/caffe2/core/macros.h.in b/caffe2/core/macros.h.in
index 9c9f73457563..2d9f03e94c0f 100644
--- a/caffe2/core/macros.h.in
+++ b/caffe2/core/macros.h.in
@@ -44,6 +44,7 @@ static_assert(
 #cmakedefine CAFFE2_USE_NVTX
 #cmakedefine CAFFE2_USE_ITT
 #cmakedefine CAFFE2_USE_TRT
+#cmakedefine TORCH_DISABLE_GPU_ASSERTS
 
 #ifndef EIGEN_MPL2_ONLY
 #cmakedefine EIGEN_MPL2_ONLY
@@ -85,4 +86,5 @@ static_assert(
   {"USE_NVTX", "${CAFFE2_USE_NVTX}"}, \
   {"USE_ITT", "${CAFFE2_USE_ITT}"}, \
   {"USE_TRT", "${CAFFE2_USE_TRT}"}, \
+  {"TORCH_DISABLE_GPU_ASSERTS", "${TORCH_DISABLE_GPU_ASSERTS}"}, \
 }
diff --git a/caffe2/core/nomnigraph/CMakeLists.txt b/caffe2/core/nomnigraph/CMakeLists.txt
index c4d4216ef9e9..8980c52ddfb4 100644
--- a/caffe2/core/nomnigraph/CMakeLists.txt
+++ b/caffe2/core/nomnigraph/CMakeLists.txt
@@ -18,5 +18,5 @@ set(Caffe2_CPU_SRCS ${Caffe2_CPU_SRCS} PARENT_SCOPE)
 set(Caffe2_CPU_INCLUDE ${Caffe2_CPU_INCLUDE} PARENT_SCOPE)
 set(Caffe2_CPU_TEST_SRCS ${Caffe2_CPU_TEST_SRCS} PARENT_SCOPE)
 if(USE_TENSORRT)
-set(Caffe2_GPU_INCLUDE ${Caffe2_CPU_INCLUDE} PARENT_SCOPE)
+set(Caffe2_GPU_INCLUDE ${Caffe2_GPU_INCLUDE} PARENT_SCOPE)
 endif()
diff --git a/caffe2/core/tensor.cc b/caffe2/core/tensor.cc
index f5fb7d1fccaa..81d06956fec5 100644
--- a/caffe2/core/tensor.cc
+++ b/caffe2/core/tensor.cc
@@ -11,7 +11,7 @@
 
 namespace caffe2 {
 
-CAFFE_KNOWN_TYPE(Tensor);
+CAFFE_DEFINE_KNOWN_TYPE(Tensor);
 
 TensorPrinter::TensorPrinter(
     // NOLINTNEXTLINE(modernize-pass-by-value)
diff --git a/caffe2/core/tensor.h b/caffe2/core/tensor.h
index c383cb0c6224..2be8bf30104c 100644
--- a/caffe2/core/tensor.h
+++ b/caffe2/core/tensor.h
@@ -443,6 +443,10 @@ class TORCH_API Tensor final {
     return impl_->sym_numel();
   }
 
+  inline c10::SymIntArrayRef sym_strides() const {
+    return impl_->sym_strides();
+  }
+
   inline int64_t size_from_dim(int k) const {
     return size_from_dim_(k, impl_->sizes());
   }
@@ -658,6 +662,7 @@ void TensorPrinter::Print(const Tensor& tensor) {
   }
 }
 
+CAFFE_DECLARE_KNOWN_TYPE(Tensor)
 } // namespace caffe2
 
 C10_CLANG_DIAGNOSTIC_POP()
diff --git a/caffe2/defs.bzl b/caffe2/defs.bzl
deleted file mode 100644
index 39f4b1b5d93d..000000000000
--- a/caffe2/defs.bzl
+++ /dev/null
@@ -1,89 +0,0 @@
-# useful command for debugging which files are included:
-# buck targets caffe2/caffe2: --json | jq -r "map(select(.srcs)) | map({key: .name, value: .srcs | sort}) | from_entries"
-load("@fbsource//tools/build_defs:type_defs.bzl", "is_list")
-load("//tools/build/buck:flags.bzl", "get_flags")
-
-flags = get_flags()
-
-_BASE_PATHS = (
-    "core/*",
-    "core/boxing/*",
-    "core/boxing/impl/*",
-    "core/dispatch/*",
-    "core/op_registration/*",
-    "cuda_rtc/*",
-    "db/*",
-    "experiments/operators/*",
-    "ideep/**/*",
-    "observers/*",
-    "onnx/**/*",
-    "operators/**/*",
-    "observers/*",
-    "predictor/*",
-    "queue/*",
-    "sgd/*",
-    "share/contrib/zstd/*",
-    "transforms/*",
-    "utils/**/*",
-)
-
-_BASE_SGX_PATHS = (
-    "core/*",
-    "core/boxing/*",
-    "core/boxing/impl/*",
-    "core/dispatch/*",
-    "core/op_registration/*",
-    "cuda_rtc/*",
-    "db/*",
-    "experiments/operators/*",
-    "observers/*",
-    "onnx/**/*",
-    "operators/**/*",
-    "observers/*",
-    "predictor/*",
-    "queue/*",
-    "sgd/*",
-    "serialize/*",
-    "share/contrib/zstd/*",
-    "transforms/*",
-    "utils/**/*",
-)
-
-def get_sgx_patterns(ext):
-    if not is_list(ext):
-        ext = [ext]
-    return [path + e for path in _BASE_SGX_PATHS for e in ext]
-
-def get_patterns(ext):
-    if not is_list(ext):
-        ext = [ext]
-    return [path + e for path in _BASE_PATHS for e in ext]
-
-def get_simd_preprocessor_flags():
-    return [
-        "-DUSE_FBGEMM",
-    ]
-
-def get_simd_compiler_flags():
-    if flags.USE_SSE_ONLY:
-        return ["-mno-avx"]
-
-    simd_compiler_flags = [
-        "-mavx",
-    ] + get_simd_preprocessor_flags()
-
-    # Every uarch with AVX512 support has AVX2 support
-    if (flags.USE_AVX2 or flags.USE_AVX512):
-        simd_compiler_flags += [
-            "-mavx2",
-            "-mfma",
-        ]
-
-    if flags.USE_AVX512:
-        simd_compiler_flags += [
-            "-mavx512f",
-            "-mavx512dq",
-            "-mavx512vl",
-        ]
-
-    return simd_compiler_flags
diff --git a/caffe2/defs_hip.bzl b/caffe2/defs_hip.bzl
deleted file mode 100644
index a93fe3569060..000000000000
--- a/caffe2/defs_hip.bzl
+++ /dev/null
@@ -1,149 +0,0 @@
-load("@bazel_skylib//lib:paths.bzl", "paths")
-load(
-    "//caffe2:defs_hip.bzl",
-    "caffe2_includes",
-    "caffe2_video_image_includes",
-    "get_hip_file_path",
-)
-
-gpu_file_extensions = [".cu", ".c", ".cc", ".cpp"]
-gpu_header_extensions = [".cuh", ".h", ".hpp"]
-
-def is_caffe2_gpu_file(filepath):
-    # those files are needed since they define placeholders
-    if "/native/cudnn/" in filepath:
-        return True
-
-    # files that are already compatible with hip
-    if "/hip/" in filepath:
-        return False
-
-    # exclude all cudnn and nvrtc implementations except for nvrtc_stub
-    if "/nvrtc_stub/" in filepath:
-        return True
-    if any([keyword in filepath for keyword in ("cudnn", "nvrtc", "NVRTC")]):
-        return False
-
-    if "/cuda/" in filepath:
-        return True
-
-    filename = paths.basename(filepath)
-    _, ext = paths.split_extension(filename)
-
-    if "gpu" in filename or ext in [".cu", ".cuh"]:
-        return True
-
-    return False
-
-def get_caffe2_hip_srcs(
-        include_patterns = caffe2_includes,
-        include_files = [],
-        project_dir = "caffe2"):
-    gpu_file_pattern = [
-        base + suffix
-        for base in include_patterns
-        for suffix in gpu_file_extensions
-    ]
-    native_gpu_files = native.glob(gpu_file_pattern) + include_files
-
-    # store the original
-    gpu_files = []
-    hip_files = []
-    for name in native_gpu_files:
-        # exclude test files
-        if "_test" in paths.basename(name) or not is_caffe2_gpu_file(name):
-            continue
-
-        gpu_files.append(name)
-        hip_file_name = get_hip_file_path(name, is_caffe2 = True)
-        hip_files.append(hip_file_name)
-
-    # there will be some native hip files that needs suffix changed
-    native_hip_pattern = [
-        base[:-1] + "hip/*.hip"
-        for base in include_patterns
-    ]
-    native_hip_files = native.glob(native_hip_pattern)
-
-    gpu_files += native_hip_files
-    hip_files += native_hip_files
-
-    # we run hipify script under the caffe2 folder; therefore we need to
-    # prepend caffe2 to the path so that buck can find the hipified file
-    real_hip_files = []
-    for filename in hip_files:
-        real_hip_files.append(paths.join(project_dir, filename))
-
-    # return the src and output_gen files
-    return gpu_files, real_hip_files
-
-def get_caffe2_hip_headers(
-        include_patterns = caffe2_includes,
-        include_files = [],
-        project_dir = "caffe2"):
-    header_pattern = [
-        base + suffix
-        for base in include_patterns
-        for suffix in gpu_header_extensions
-    ]
-    native_header_files = native.glob(header_pattern) + include_files
-
-    header_files = []
-    hip_headers = []
-    for name in native_header_files:
-        # exclude test files
-        # if the caller directly specifies files via include_files, follow it
-        if not name in include_files and ("_test" in paths.basename(name) or not is_caffe2_gpu_file(name)):
-            continue
-
-        header_files.append(name)
-        hip_header_name = get_hip_file_path(name, is_caffe2 = True)
-        hip_headers.append(hip_header_name)
-
-    # we run hipify script under the caffe2 folder; therefore we need to
-    # prepend caffe2 to the path so that buck can find the hipified file
-    real_hip_headers = []
-    for filename in hip_headers:
-        real_hip_headers.append(paths.join(project_dir, filename))
-
-    # return the src and output_gen files
-    return header_files, real_hip_headers
-
-def get_caffe2_hip_video_image_srcs():
-    return get_caffe2_hip_srcs(include_patterns = caffe2_video_image_includes)
-
-def get_caffe2_hip_video_image_headers():
-    return get_caffe2_hip_headers(include_patterns = caffe2_video_image_includes)
-
-def get_caffe2_hip_test_files():
-    test_includes = [
-        "**/*_gpu_test.cc",
-    ]
-
-    # let's ignores the mpi test and fb-internal tests for now
-    test_ignores = [
-        "mpi/mpi_gpu_test.cc",
-        # "operators/roi_align_op_gpu_test.cc",
-        "**/fb/**/*_gpu_test.cc",
-    ]
-
-    native_test_files = native.glob(test_includes, exclude = test_ignores)
-
-    test_files = []
-    hip_test_files = []
-    for name in native_test_files:
-        if not is_caffe2_gpu_file(name):
-            continue
-
-        test_files.append(name)
-        hip_file_name = get_hip_file_path(name, is_caffe2 = True)
-        hip_test_files.append(hip_file_name)
-
-    # we run hipify script under the caffe2 folder; therefore we need to
-    # prepend caffe2 to the path so that buck can find the hipified file
-    real_hip_test_files = []
-    for filename in hip_test_files:
-        real_hip_test_files.append(paths.join("caffe2", filename))
-
-    # return the src and output_gen files
-    return test_files, real_hip_test_files
diff --git a/caffe2/mobile/contrib/libopencl-stub/README.md b/caffe2/mobile/contrib/libopencl-stub/README.md
index 20b3dafa8095..835ba2172cbe 100644
--- a/caffe2/mobile/contrib/libopencl-stub/README.md
+++ b/caffe2/mobile/contrib/libopencl-stub/README.md
@@ -1,7 +1,7 @@
 libopencl-stub
 ==============
 
-A stub opecl library that dynamically dlopen/dlsyms opencl implementations at runtime based on environment variables. Will be useful when opencl implementations are installed in non-standard paths (say pocl on android)
+A stub opencl library that dynamically dlopen/dlsyms opencl implementations at runtime based on environment variables. Will be useful when opencl implementations are installed in non-standard paths (say pocl on android)
 
 
 
diff --git a/caffe2/mobile/contrib/nnapi/nnapi.h b/caffe2/mobile/contrib/nnapi/nnapi.h
index 04d51aae81bf..bd4772ec3337 100644
--- a/caffe2/mobile/contrib/nnapi/nnapi.h
+++ b/caffe2/mobile/contrib/nnapi/nnapi.h
@@ -21,8 +21,7 @@ class NNApi {
       const NetDef& run_net,
       Workspace* ws = nullptr,
       const PreferenceCode pref = ANEURALNETWORKS_PREFER_SUSTAINED_SPEED)
-      : order_(StorageOrder::NHWC),
-        preference_(pref),
+      : preference_(pref),
         run_net_(run_net),
         ws_(ws) {
     if (!loadNNApiLibrary()) {
@@ -43,7 +42,6 @@ class NNApi {
   ANeuralNetworksCompilation* compilation_{nullptr};
   ANeuralNetworksExecution* run_{nullptr};
   ANeuralNetworksEvent* run_end_{nullptr};
-  StorageOrder order_;
   PreferenceCode preference_;
   NetDef run_net_;
   Workspace ws_;
diff --git a/caffe2/operators/batch_box_cox_op.cc b/caffe2/operators/batch_box_cox_op.cc
index aa444330969b..6e2bb4d9a8d9 100644
--- a/caffe2/operators/batch_box_cox_op.cc
+++ b/caffe2/operators/batch_box_cox_op.cc
@@ -2,72 +2,34 @@
 
 #include "caffe2/core/operator.h"
 #include "caffe2/core/tensor.h"
-
-#ifdef CAFFE2_USE_MKL
-#include <mkl.h>
-#endif // CAFFE2_USE_MKL
+#include "caffe2/perfkernels/batch_box_cox.h"
 
 namespace caffe2 {
 
-#ifdef CAFFE2_USE_MKL
 namespace {
-
-// Helpers for copying parameters.
 template <typename T>
-void TileArrayIntoVector(const T* a, int D, int K, vector<T>* b) {
-  b->resize(K * D);
-  for (int k = 0; k < K; k++) {
-    std::copy(a, a + D, b->begin() + k * D);
-  }
-}
-
-void TileIndicesInPlace(vector<int>* v, int D, int K) {
-  int n = v->size();
-  v->resize(K * n);
-  for (int k = 1; k < K; k++) {
-    for (int j = 0; j < n; j++) {
-      (*v)[k * n + j] = (*v)[j] + k * D;
+void BoxCoxNaive(
+    int64_t N,
+    int64_t D,
+    const T* data_ptr,
+    const T* lambda1_ptr,
+    const T* lambda2_ptr,
+    T* output_ptr) {
+  constexpr T k_eps = static_cast<T>(1e-6);
+  for (int64_t i = 0; i < N; i++) {
+    for (int64_t j = 0; j < D; j++, data_ptr++, output_ptr++) {
+      T lambda1_v = lambda1_ptr[j];
+      T lambda2_v = lambda2_ptr[j];
+      T tmp = std::max(*data_ptr + lambda2_v, k_eps);
+      if (lambda1_v == 0) {
+        *output_ptr = std::log(tmp);
+      } else {
+        *output_ptr = (std::pow(tmp, lambda1_v) - 1) / lambda1_v;
+      }
     }
   }
 }
-
-// MKL VML function templates.
-template <typename T>
-void PackV(const int N, const T* a, const int* ia, T* y);
-template <typename T>
-void UnpackV(const int N, const T* a, T* y, const int* iy);
-template <typename T>
-void Pow(const int N, const T* a, const T* b, T* y);
-
-#define DELEGATE_PACKV_FUNCTION(T, OriginalFunc)                \
-  template <>                                                   \
-  void PackV<T>(const int N, const T* a, const int* ia, T* y) { \
-    OriginalFunc(N, a, ia, y);                                  \
-  }
-DELEGATE_PACKV_FUNCTION(float, vsPackV)
-DELEGATE_PACKV_FUNCTION(double, vdPackV)
-#undef DELEGATE_PACKV_FUNCTION
-
-#define DELEGATE_UNPACKV_FUNCTION(T, OriginalFunc)                \
-  template <>                                                     \
-  void UnpackV<T>(const int N, const T* a, T* y, const int* iy) { \
-    OriginalFunc(N, a, y, iy);                                    \
-  }
-DELEGATE_UNPACKV_FUNCTION(float, vsUnpackV)
-DELEGATE_UNPACKV_FUNCTION(double, vdUnpackV)
-#undef DELEGATE_UNPACKV_FUNCTION
-
-#define DELEGATE_SIMPLE_BINARY_FUNCTION(T, Funcname, OriginalFunc) \
-  template <>                                                      \
-  void Funcname<T>(const int N, const T* a, const T* b, T* y) {    \
-    OriginalFunc(N, a, b, y);                                      \
-  }
-DELEGATE_SIMPLE_BINARY_FUNCTION(float, Pow, vsPow)
-DELEGATE_SIMPLE_BINARY_FUNCTION(double, Pow, vdPow)
-#undef DELEGATE_SIMPLE_BINARY_FUNCTION
-
-} // namespace
-#endif // CAFFE2_USE_MKL
+}
 
 template <>
 template <typename T>
@@ -93,227 +55,19 @@ bool BatchBoxCoxOp<CPUContext>::DoRunWithType() {
   const auto* lambda1_ptr = lambda1.template data<T>();
   const auto* lambda2_ptr = lambda2.template data<T>();
 
-  const T k_eps = static_cast<T>(1e-6);
-
 #ifdef CAFFE2_USE_MKL
   if (min_block_size_ < 1) {
-    BoxCoxNaive(N, D, data_ptr, lambda1_ptr, lambda2_ptr, k_eps, output_ptr);
-  } else {
-    // Find zero-valued columns, since they get special treatment.
-    nonzeros_.clear();
-    zeros_.clear();
-    nonzeros_.reserve(D);
-    zeros_.reserve(D);
-    for (int64_t j = 0; j < D; j++) {
-      if (lambda1_ptr[j] == 0) {
-        zeros_.push_back(j);
-      } else {
-        nonzeros_.push_back(j);
-      }
-    }
-
-    // Process K rows at a time for effective vectorization with small rows.
-    const int K = std::min(N, (min_block_size_ + D - 1) / D);
-
-    // Avoid copying data if all lambda1 values are zero, or if all are nonzero.
-    // In each of the three cases here, when K > 1, first process batches of K
-    // rows by replicating the input parameters K times. Then finish row-by-row.
-    TypedCachedBuffers<T>& b = GetBuffers<T>();
-    if (nonzeros_.size() == D) {
-      int64_t i = 0;
-      if (K > 1) {
-        TileArrayIntoVector(lambda1_ptr, D, K, &b.lambda1_);
-        TileArrayIntoVector(lambda2_ptr, D, K, &b.lambda2_);
-        TORCH_DCHECK_EQ(K * D, b.lambda1_.size());
-        TORCH_DCHECK_EQ(K * D, b.lambda2_.size());
-        for (; i < N - K + 1; i += K, data_ptr += K * D, output_ptr += K * D) {
-          BoxCoxNonzeroLambda(
-              K * D,
-              data_ptr,
-              b.lambda1_.data(),
-              b.lambda2_.data(),
-              k_eps,
-              output_ptr);
-        }
-      }
-      for (; i < N; i++, data_ptr += D, output_ptr += D) {
-        BoxCoxNonzeroLambda(
-            D, data_ptr, lambda1_ptr, lambda2_ptr, k_eps, output_ptr);
-      }
-    } else if (zeros_.size() == D) {
-      int64_t i = 0;
-      if (K > 1) {
-        TileArrayIntoVector(lambda2_ptr, D, K, &b.lambda2_z_);
-        TORCH_DCHECK_EQ(K * D, b.lambda2_z_.size());
-        for (; i < N - K + 1; i += K, data_ptr += K * D, output_ptr += K * D) {
-          BoxCoxZeroLambda(
-              K * D, data_ptr, b.lambda2_z_.data(), k_eps, output_ptr);
-        }
-      }
-      for (; i < N; i++, data_ptr += D, output_ptr += D) {
-        BoxCoxZeroLambda(D, data_ptr, lambda2_ptr, k_eps, output_ptr);
-      }
-    } else { // General case of mixed zero and non-zero lambda1 values.
-      int n = nonzeros_.size();
-      if (K > 1) {
-        TileIndicesInPlace(&nonzeros_, 0, K);
-        TileIndicesInPlace(&zeros_, 0, K);
-      }
-
-      // Gather parameter values into contiguous memory.
-      b.lambda1_.resize(nonzeros_.size());
-      b.lambda2_.resize(nonzeros_.size());
-      b.lambda2_z_.resize(zeros_.size());
-      PackV(nonzeros_.size(), lambda1_ptr, nonzeros_.data(), b.lambda1_.data());
-      PackV(nonzeros_.size(), lambda2_ptr, nonzeros_.data(), b.lambda2_.data());
-      PackV(zeros_.size(), lambda2_ptr, zeros_.data(), b.lambda2_z_.data());
-
-      int64_t i = 0;
-      b.accumulator_.resize(std::max(nonzeros_.size(), zeros_.size()));
-      if (K > 1) {
-        // Truncate to original size, and re-tile with offsets this time.
-        nonzeros_.resize(n);
-        zeros_.resize(D - n);
-        TileIndicesInPlace(&nonzeros_, D, K);
-        TileIndicesInPlace(&zeros_, D, K);
-        TORCH_DCHECK_EQ(nonzeros_.size(), b.lambda1_.size());
-        TORCH_DCHECK_EQ(nonzeros_.size(), b.lambda2_.size());
-        TORCH_DCHECK_EQ(zeros_.size(), b.lambda2_z_.size());
-        for (; i < N - K + 1; i += K, data_ptr += K * D, output_ptr += K * D) {
-          BoxCoxMixedLambda(
-              data_ptr,
-              nonzeros_,
-              zeros_,
-              b.lambda1_.data(),
-              b.lambda2_.data(),
-              b.lambda2_z_.data(),
-              k_eps,
-              b.accumulator_.data(),
-              output_ptr);
-        }
-        // Truncate to original size.
-        nonzeros_.resize(n);
-        zeros_.resize(D - n);
-      }
-      for (; i < N; i++, data_ptr += D, output_ptr += D) {
-        BoxCoxMixedLambda(
-            data_ptr,
-            nonzeros_,
-            zeros_,
-            b.lambda1_.data(),
-            b.lambda2_.data(),
-            b.lambda2_z_.data(),
-            k_eps,
-            b.accumulator_.data(),
-            output_ptr);
-      }
-    }
+    BoxCoxNaive(N, D, data_ptr, lambda1_ptr, lambda2_ptr, output_ptr);
+    return true;
   }
-#else // CAFFE2_USE_MKL
-  BoxCoxNaive(N, D, data_ptr, lambda1_ptr, lambda2_ptr, k_eps, output_ptr);
-#endif // CAFFE2_USE_MKL
+  caffe2::compute_batch_box_cox(
+    N, D, min_block_size_, data_ptr, lambda1_ptr, lambda2_ptr, output_ptr);
+#else
+  BoxCoxNaive(N, D, data_ptr, lambda1_ptr, lambda2_ptr, output_ptr);
+#endif
   return true;
 }
 
-template <>
-template <typename T>
-void BatchBoxCoxOp<CPUContext>::BoxCoxNaive(
-    int64_t N,
-    int64_t D,
-    const T* data_ptr,
-    const T* lambda1_ptr,
-    const T* lambda2_ptr,
-    T k_eps,
-    T* output_ptr) {
-  for (int64_t i = 0; i < N; i++) {
-    for (int64_t j = 0; j < D; j++, data_ptr++, output_ptr++) {
-      T lambda1_v = lambda1_ptr[j];
-      T lambda2_v = lambda2_ptr[j];
-      T tmp = std::max(*data_ptr + lambda2_v, k_eps);
-      if (lambda1_v == 0) {
-        *output_ptr = std::log(tmp);
-      } else {
-        *output_ptr = (std::pow(tmp, lambda1_v) - 1) / lambda1_v;
-      }
-    }
-  }
-}
-
-#ifdef CAFFE2_USE_MKL
-
-template <>
-template <typename T>
-void BatchBoxCoxOp<CPUContext>::BoxCoxNonzeroLambda(
-    int64_t D,
-    const T* data_ptr,
-    const T* lambda1,
-    const T* lambda2,
-    T k_eps,
-    T* out) {
-  caffe2::math::Add(D, data_ptr, lambda2, out, &context_);
-  for (int64_t j = 0; j < D; j++) {
-    out[j] = std::max(out[j], k_eps);
-  }
-  Pow(D, out, lambda1, out);
-  for (int64_t j = 0; j < D; j++) {
-    out[j] -= 1.0;
-  }
-  caffe2::math::Div(D, out, lambda1, out, &context_);
-}
-
-template <>
-template <typename T>
-void BatchBoxCoxOp<CPUContext>::BoxCoxZeroLambda(
-    int64_t D,
-    const T* data_ptr,
-    const T* lambda2,
-    T k_eps,
-    T* output_ptr) {
-  caffe2::math::Add(D, data_ptr, lambda2, output_ptr, &context_);
-  for (int64_t j = 0; j < D; j++) {
-    output_ptr[j] = std::max(output_ptr[j], k_eps);
-  }
-  caffe2::math::Log(D, output_ptr, output_ptr, &context_);
-}
-
-template <>
-template <typename T>
-void BatchBoxCoxOp<CPUContext>::BoxCoxMixedLambda(
-    const T* data_ptr,
-    const vector<int>& nonzeros,
-    const vector<int>& zeros,
-    const T* lambda1,
-    const T* lambda2,
-    const T* lambda2_z,
-    T k_eps,
-    T* buffer,
-    T* output_ptr) {
-  PackV(nonzeros.size(), data_ptr, nonzeros.data(), buffer);
-  BoxCoxNonzeroLambda(nonzeros.size(), buffer, lambda1, lambda2, k_eps, buffer);
-  UnpackV(nonzeros.size(), buffer, output_ptr, nonzeros.data());
-
-  PackV(zeros.size(), data_ptr, zeros.data(), buffer);
-  BoxCoxZeroLambda(zeros.size(), buffer, lambda2_z, k_eps, buffer);
-  UnpackV(zeros.size(), buffer, output_ptr, zeros.data());
-}
-
-// Helpers to access cached buffers.
-#define DEFINE_CACHED_BUFFERS(T, tag)                                         \
-  template <>                                                                 \
-  template <>                                                                 \
-  BatchBoxCoxOp<CPUContext>::TypedCachedBuffers<T>&                           \
-  BatchBoxCoxOp<CPUContext>::GetBuffers<T>() {                                \
-    if (!buffers_ || buffers_->type_ != tag) {                                \
-      buffers_.reset(new BatchBoxCoxOp<CPUContext>::TypedCachedBuffers<T>()); \
-      buffers_->type_ = tag;                                                  \
-    }                                                                         \
-    return *static_cast<TypedCachedBuffers<T>*>(buffers_.get());              \
-  }
-DEFINE_CACHED_BUFFERS(float, 1);
-DEFINE_CACHED_BUFFERS(double, 2);
-#undef DEFINE_CACHED_BUFFERS
-
-#endif // CAFFE2_USE_MKL
 
 namespace {
 
diff --git a/caffe2/operators/batch_box_cox_op.h b/caffe2/operators/batch_box_cox_op.h
index baa9c955b6ca..a177131e9ade 100644
--- a/caffe2/operators/batch_box_cox_op.h
+++ b/caffe2/operators/batch_box_cox_op.h
@@ -29,65 +29,7 @@ class BatchBoxCoxOp final : public Operator<Context> {
   bool DoRunWithType();
 
  protected:
-  template <typename T>
-  void BoxCoxNaive(
-      int64_t N,
-      int64_t D,
-      const T* data_ptr,
-      const T* lambda1_ptr,
-      const T* lambda2_ptr,
-      T k_eps,
-      T* output_ptr);
-
-#ifdef CAFFE2_USE_MKL
-  template <typename T>
-  void BoxCoxNonzeroLambda(
-      int64_t D,
-      const T* data_ptr,
-      const T* lambda1,
-      const T* lambda2,
-      T k_eps,
-      T* output_ptr);
-
-  template <typename T>
-  void BoxCoxZeroLambda(
-      int64_t D,
-      const T* data_ptr,
-      const T* lambda2,
-      T k_eps,
-      T* output_ptr);
-
-  template <typename T>
-  void BoxCoxMixedLambda(
-      const T* data_ptr,
-      const vector<int>& nonzeros,
-      const vector<int>& zeros,
-      const T* lambda1,
-      const T* lambda2,
-      const T* lambda2_z,
-      T k_eps,
-      T* buffer,
-      T* output_ptr);
-
-  vector<int> nonzeros_, zeros_;
-
-  // Buffers used by the MKL version are cached across calls.
-  struct CachedBuffers {
-    virtual ~CachedBuffers() {}
-    int type_;
-  };
-  template <typename T>
-  struct TypedCachedBuffers : public CachedBuffers {
-    vector<T> lambda1_, lambda2_, lambda2_z_;
-    vector<T> accumulator_;
-  };
-  template <typename T>
-  TypedCachedBuffers<T>& GetBuffers();
-  unique_ptr<CachedBuffers> buffers_;
-
-#endif // CAFFE2_USE_MKL
-
-  int min_block_size_;
+  std::size_t min_block_size_;
 
   INPUT_TAGS(DATA, LAMBDA1, LAMBDA2);
 };
diff --git a/caffe2/operators/generate_proposals_op_util_nms_gpu.cu b/caffe2/operators/generate_proposals_op_util_nms_gpu.cu
index 9776266154cf..aac9b3b81db2 100644
--- a/caffe2/operators/generate_proposals_op_util_nms_gpu.cu
+++ b/caffe2/operators/generate_proposals_op_util_nms_gpu.cu
@@ -145,15 +145,15 @@ void nms_gpu_upright(
   // Overlapping CPU computes and D2H memcpy
   // both take about the same time
   cudaEvent_t copy_done;
-  cudaEventCreate(&copy_done);
+  C10_CUDA_CHECK(cudaEventCreate(&copy_done));
   int nto_copy = std::min(CHUNK_SIZE, N);
-  CUDA_CHECK(cudaMemcpyAsync(
+  C10_CUDA_CHECK(cudaMemcpyAsync(
       &h_delete_mask[0],
       &d_delete_mask[0],
       nto_copy * mask_ld * sizeof(int),
       cudaMemcpyDeviceToHost,
       context->cuda_stream()));
-  CUDA_CHECK(cudaEventRecord(copy_done, context->cuda_stream()));
+  C10_CUDA_CHECK(cudaEventRecord(copy_done, context->cuda_stream()));
   int offset = 0;
   std::vector<int> h_keep_sorted_list;
   std::vector<int> rmv(mask_ld, 0);
@@ -162,7 +162,7 @@ void nms_gpu_upright(
     int next_offset = offset + ncopied;
     nto_copy = std::min(CHUNK_SIZE, N - next_offset);
     if (nto_copy > 0) {
-      CUDA_CHECK(cudaMemcpyAsync(
+      C10_CUDA_CHECK(cudaMemcpyAsync(
           &h_delete_mask[next_offset * mask_ld],
           &d_delete_mask[next_offset * mask_ld],
           nto_copy * mask_ld * sizeof(int),
@@ -170,9 +170,10 @@ void nms_gpu_upright(
           context->cuda_stream()));
     }
     // Waiting for previous copy
-    CUDA_CHECK(cudaEventSynchronize(copy_done));
-    if (nto_copy > 0)
-      cudaEventRecord(copy_done, context->cuda_stream());
+    C10_CUDA_CHECK(cudaEventSynchronize(copy_done));
+    if (nto_copy > 0){
+      C10_CUDA_CHECK(cudaEventRecord(copy_done, context->cuda_stream()));
+    }
     for (int i = offset; i < next_offset; ++i) {
       int iblock = i / BOXES_PER_THREAD;
       int inblock = i % BOXES_PER_THREAD;
@@ -186,15 +187,15 @@ void nms_gpu_upright(
     }
     offset = next_offset;
   }
-  cudaEventDestroy(copy_done);
+  C10_CUDA_CHECK(cudaEventDestroy(copy_done));
 
   const int nkeep = h_keep_sorted_list.size();
-  cudaMemcpyAsync(
+  C10_CUDA_CHECK(cudaMemcpyAsync(
       d_keep_sorted_list,
       &h_keep_sorted_list[0],
       nkeep * sizeof(int),
       cudaMemcpyHostToDevice,
-      context->cuda_stream());
+      context->cuda_stream()));
 
   *h_nkeep = nkeep;
 }
@@ -502,15 +503,15 @@ void nms_gpu_rotated(
   // Overlapping CPU computes and D2H memcpy
   // both take about the same time
   cudaEvent_t copy_done;
-  cudaEventCreate(&copy_done);
+  C10_CUDA_CHECK(cudaEventCreate(&copy_done));
   int nto_copy = std::min(CHUNK_SIZE, N);
-  CUDA_CHECK(cudaMemcpyAsync(
+  C10_CUDA_CHECK(cudaMemcpyAsync(
       &h_delete_mask[0],
       &d_delete_mask[0],
       nto_copy * mask_ld * sizeof(int),
       cudaMemcpyDeviceToHost,
       context->cuda_stream()));
-  CUDA_CHECK(cudaEventRecord(copy_done, context->cuda_stream()));
+  C10_CUDA_CHECK(cudaEventRecord(copy_done, context->cuda_stream()));
   int offset = 0;
   std::vector<int> h_keep_sorted_list;
   std::vector<int> rmv(mask_ld, 0);
@@ -519,7 +520,7 @@ void nms_gpu_rotated(
     int next_offset = offset + ncopied;
     nto_copy = std::min(CHUNK_SIZE, N - next_offset);
     if (nto_copy > 0) {
-      CUDA_CHECK(cudaMemcpyAsync(
+      C10_CUDA_CHECK(cudaMemcpyAsync(
           &h_delete_mask[next_offset * mask_ld],
           &d_delete_mask[next_offset * mask_ld],
           nto_copy * mask_ld * sizeof(int),
@@ -527,9 +528,10 @@ void nms_gpu_rotated(
           context->cuda_stream()));
     }
     // Waiting for previous copy
-    CUDA_CHECK(cudaEventSynchronize(copy_done));
-    if (nto_copy > 0)
-      cudaEventRecord(copy_done, context->cuda_stream());
+    C10_CUDA_CHECK(cudaEventSynchronize(copy_done));
+    if (nto_copy > 0){
+      C10_CUDA_CHECK(cudaEventRecord(copy_done, context->cuda_stream()));
+    }
     for (int i = offset; i < next_offset; ++i) {
       int iblock = i / BOXES_PER_THREAD;
       int inblock = i % BOXES_PER_THREAD;
@@ -543,15 +545,15 @@ void nms_gpu_rotated(
     }
     offset = next_offset;
   }
-  cudaEventDestroy(copy_done);
+  C10_CUDA_CHECK(cudaEventDestroy(copy_done));
 
   const int nkeep = h_keep_sorted_list.size();
-  cudaMemcpyAsync(
+  C10_CUDA_CHECK(cudaMemcpyAsync(
       d_keep_sorted_list,
       &h_keep_sorted_list[0],
       nkeep * sizeof(int),
       cudaMemcpyHostToDevice,
-      context->cuda_stream());
+      context->cuda_stream()));
 
   *h_nkeep = nkeep;
 }
diff --git a/caffe2/operators/generate_proposals_op_util_nms_gpu_test.cc b/caffe2/operators/generate_proposals_op_util_nms_gpu_test.cc
index 6c8283b3d0fe..ea656dd30e3b 100644
--- a/caffe2/operators/generate_proposals_op_util_nms_gpu_test.cc
+++ b/caffe2/operators/generate_proposals_op_util_nms_gpu_test.cc
@@ -691,7 +691,7 @@ TEST(UtilsNMSTest, TestPerfRotatedNMS) {
 //           list_nitems * sizeof(int),
 //           cudaMemcpyDeviceToHost,
 //           cuda_context.cuda_stream()));
-//       CUDA_CHECK(cudaStreamSynchronize(cuda_context.cuda_stream()));
+//       CUDA_CHECK(cudaStreamSynchronize(cuda_context.cuda_stream());
 
 //       ASSERT_EQ(keep.size(), gpu_keep.size());
 //       std::sort(keep.begin(), keep.end());
diff --git a/caffe2/operators/rnn/recurrent_network_executor_gpu.cc b/caffe2/operators/rnn/recurrent_network_executor_gpu.cc
index ef041959742a..0356218c717f 100644
--- a/caffe2/operators/rnn/recurrent_network_executor_gpu.cc
+++ b/caffe2/operators/rnn/recurrent_network_executor_gpu.cc
@@ -130,8 +130,7 @@ void CUDARecurrentNetworkExecutor::_ExecRange(int from, int to) {
   for (int stream_id = 0; stream_id <= std::min(stream_seq, max_streams - 1);
        stream_id++) {
     VLOG(1) << "Wait for stream:" << stream_id;
-    CUDA_CHECK(
-        cudaStreamSynchronize(CUDAContext::cuda_stream(gpu_id, stream_id)));
+    CUDA_CHECK(cudaStreamSynchronize(CUDAContext::cuda_stream(gpu_id, stream_id)));
   }
 }
 
diff --git a/caffe2/operators/scale_blobs_op.cu b/caffe2/operators/scale_blobs_op.cu
index 01421fb822c6..7305fddece96 100644
--- a/caffe2/operators/scale_blobs_op.cu
+++ b/caffe2/operators/scale_blobs_op.cu
@@ -138,9 +138,9 @@ REGISTER_CUDA_OPERATOR(ScaleBlobs, ScaleBlobsOp<CUDAContext>);
         }
       }
     }
-    cudaMalloc(&dStartCoorArr, sizeof(int) * coorArrSize);
-    cudaMemcpy(dStartCoorArr, startCoorArr, sizeof(int) * coorArrSize,
-    cudaMemcpyHostToDevice);
+    C10_CUDA_CHECK(cudaMalloc(&dStartCoorArr, sizeof(int) * coorArrSize));
+    C10_CUDA_CHECK(cudaMemcpy(dStartCoorArr, startCoorArr, sizeof(int) * coorArrSize,
+      cudaMemcpyHostToDevice));
 
   // ScaleBlobsCUDAKernelBalanced kernel launch
   ScaleBlobsCUDAKernelBalanced<T>
@@ -150,7 +150,7 @@ REGISTER_CUDA_OPERATOR(ScaleBlobs, ScaleBlobsOp<CUDAContext>);
      dOutputArr);
   C10_CUDA_KERNEL_LAUNCH_CHECK();
 
-  cudaFree(dStartCoorArr);
+  C10_CUDA_CHECK(cudaFree(dStartCoorArr));
 */
 
 template <typename T>
diff --git a/caffe2/operators/segment_reduction_op_gpu.cu b/caffe2/operators/segment_reduction_op_gpu.cu
index 7253df677025..6985c3c3378b 100644
--- a/caffe2/operators/segment_reduction_op_gpu.cu
+++ b/caffe2/operators/segment_reduction_op_gpu.cu
@@ -493,7 +493,7 @@ class CUDASparseLengthsSumOp : public Operator<CUDAContext> {
   enum { DATA = 0, INDICES = 1, LENGTHS = 1 + (SparseFused ? 1 : 0) };
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
@@ -632,7 +632,7 @@ class CUDASparseLengthsMeanOp : public Operator<CUDAContext> {
   enum { DATA = 0, INDICES = 1, LENGTHS = 1 + (SparseFused ? 1 : 0) };
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
@@ -765,7 +765,7 @@ class CUDASparseLengthsMaxOp : public Operator<CUDAContext> {
   enum { INDICES = 1, LENGTHS = 1 + (SparseFused ? 1 : 0) };
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
@@ -861,7 +861,7 @@ class CUDASparseLengthsWeightedSumOp : public Operator<CUDAContext> {
   enum { DATA = 0, WEIGHTS = 1, INDICES = 2, LENGTHS = 3 };
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
@@ -1356,7 +1356,7 @@ class CUDASparseLengthsSumGradientWithIndicesOp : public Operator<CUDAContext> {
   }
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
@@ -1437,7 +1437,7 @@ class CUDASparseLengthsMeanGradientWithIndicesOp
   }
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
@@ -1526,7 +1526,7 @@ class CUDASparseLengthsWeightedSumGradientWithIndicesOp
   }
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
@@ -1666,7 +1666,7 @@ class CUDALengthsMaxWithMainInputAndForwardOutputGradientOp
   }
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
@@ -1793,7 +1793,7 @@ class CUDASparseLengthsIndicesInGradientWeightedSumWithMainInputGradientOp
   }
 
  private:
-  // menber field to manage memory
+  // member field to manage memory
   Tensor inclusive_scan_buffer_{CUDA};
   Tensor inclusive_scan_length_buffer_{CUDA};
 };
diff --git a/caffe2/perfkernels/CMakeLists.txt b/caffe2/perfkernels/CMakeLists.txt
index 4316900ba56a..9510ec60dfef 100644
--- a/caffe2/perfkernels/CMakeLists.txt
+++ b/caffe2/perfkernels/CMakeLists.txt
@@ -1,4 +1,4 @@
-if(INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE)
+if(INTERN_BUILD_MOBILE)
   list(APPEND Caffe2_CPU_SRCS
     "${CMAKE_CURRENT_SOURCE_DIR}/embedding_lookup_idx.cc"
   )
diff --git a/caffe2/perfkernels/batch_box_cox.cc b/caffe2/perfkernels/batch_box_cox.cc
new file mode 100644
index 000000000000..3e840d8fa04d
--- /dev/null
+++ b/caffe2/perfkernels/batch_box_cox.cc
@@ -0,0 +1,113 @@
+#include "caffe2/perfkernels/common.h"
+
+#include <algorithm>
+#include <cstdint>
+#include <cmath>
+
+namespace caffe2 {
+
+namespace {
+template <typename T>
+void BoxCoxNaive(
+    std::size_t N,
+    std::size_t D,
+    const T* data_ptr,
+    const T* __restrict lambda1_ptr,
+    const T* __restrict lambda2_ptr,
+    T* output_ptr) {
+  constexpr T k_eps = static_cast<T>(1e-6);
+
+  for (int64_t i = 0; i < N; i++) {
+    for (int64_t j = 0; j < D; j++, data_ptr++, output_ptr++) {
+      T lambda1_v = lambda1_ptr[j];
+      T lambda2_v = lambda2_ptr[j];
+      T tmp = std::max(*data_ptr + lambda2_v, k_eps);
+      if (lambda1_v == 0) {
+        *output_ptr = std::log(tmp);
+      } else {
+        T lambda_1 = 1 / lambda1_v;
+        T pow = std::pow(tmp, lambda1_v);
+        *output_ptr = lambda_1 * pow - lambda_1;
+      }
+    }
+  }
+
+}
+}
+
+#if defined(CAFFE2_PERF_WITH_AVX2) && defined(CAFFE2_PERF_USE_MKL)
+namespace details {
+template <typename T>
+void compute_batch_box_cox__avx2_fma(
+  std::size_t N,
+  std::size_t D,
+  std::size_t block_size,
+  const T* data_ptr,
+  const T* __restrict lambda1_ptr,
+  const T* __restrict lambda2_ptr,
+  T* output_ptr);
+
+extern template
+void compute_batch_box_cox__avx2_fma<float>(
+  std::size_t N,
+  std::size_t D,
+  std::size_t block_size,
+  const float* self_data,
+  const float* __restrict lambda1_data,
+  const float* __restrict lambda2_data,
+  float* output_data);
+
+extern template
+void compute_batch_box_cox__avx2_fma<double>(
+  std::size_t N,
+  std::size_t D,
+  std::size_t block_size,
+  const double* self_data,
+  const double* __restrict lambda1_data,
+  const double* __restrict lambda2_data,
+  double* output_data);
+} // namespace detail
+#endif
+
+template <typename T>
+void compute_batch_box_cox(
+    std::size_t N,
+    std::size_t D,
+    std::size_t block_size,
+    const T* data,
+    const T* lambda1_data,
+    const T* lambda2_data,
+    T* output_data) {
+#ifdef CAFFE2_PERF_WITH_AVX2
+  AVX2_FMA_DO(
+    details::compute_batch_box_cox,
+    N,
+    D,
+    block_size,
+    data,
+    lambda1_data,
+    lambda2_data,
+    output_data);
+#endif
+  BoxCoxNaive<T>(N, D, data, lambda1_data, lambda2_data, output_data);
+}
+
+template void compute_batch_box_cox<float>(
+  std::size_t N,
+  std::size_t D,
+  std::size_t block_size,
+  const float* data,
+  const float* lambda1_data,
+  const float* lambda2_data,
+  float* output_data);
+
+template void compute_batch_box_cox<double>(
+  std::size_t N,
+  std::size_t D,
+  std::size_t block_size,
+  const double* data,
+  const double* lambda1_data,
+  const double* lambda2_data,
+  double* output_data);
+
+} // namespace caffe2
diff --git a/caffe2/perfkernels/batch_box_cox.h b/caffe2/perfkernels/batch_box_cox.h
new file mode 100644
index 000000000000..60c973bbf8ea
--- /dev/null
+++ b/caffe2/perfkernels/batch_box_cox.h
@@ -0,0 +1,35 @@
+// Impmenets BoxCox operator for CPU
+#pragma once
+#include <cstdint>
+
+namespace caffe2 {
+
+template <typename T>
+void compute_batch_box_cox(
+    std::size_t N,
+    std::size_t D,
+    std::size_t block_size,
+    const T* self_data,
+    const T* lambda1_data,
+    const T* lambda2_data,
+    T* output_data);
+
+extern template void compute_batch_box_cox<float>(
+    std::size_t N,
+    std::size_t D,
+    std::size_t block_size,
+    const float* data,
+    const float* lambda1_data,
+    const float* lambda2_data,
+    float* output_data);
+
+extern template void compute_batch_box_cox<double>(
+    std::size_t N,
+    std::size_t D,
+    std::size_t block_size,
+    const double* data,
+    const double* lambda1_data,
+    const double* lambda2_data,
+    double* output_data);
+
+} // namespace caffe2
diff --git a/caffe2/perfkernels/batch_box_cox_avx2.cc b/caffe2/perfkernels/batch_box_cox_avx2.cc
new file mode 100644
index 000000000000..6171b5bfd032
--- /dev/null
+++ b/caffe2/perfkernels/batch_box_cox_avx2.cc
@@ -0,0 +1,399 @@
+#include <immintrin.h>
+#ifdef CAFFE2_PERF_USE_MKL
+#include <c10/util/irange.h>
+#include <caffe2/perfkernels/common.h>
+#include <folly/SingletonThreadLocal.h>
+
+#include "vectorizer.h"
+
+// Enable compiler vectorized version only if numerical consistency is not
+// required between dev and opt versions - disabled for now
+#ifndef FAST_VECTORIZED_KERNEL
+#define CPU_CAPABILITY_AVX2
+#include <ATen/cpu/vec/vec.h>
+
+namespace at::vec {
+
+// Implements the vectorized version of std::max() operation,
+// which DOESNOT propagates NaN for second argument
+template <typename scalar_t>
+Vectorized<scalar_t> max(const Vectorized<scalar_t>& a, const Vectorized<scalar_t>& b);
+
+template <>
+Vectorized<double> max(const Vectorized<double>& a, const Vectorized<double>& b) {
+  // std::max(NaN, nonNan) -> NaN
+  return _mm256_max_pd(b, a);
+}
+
+template <>
+Vectorized<float> max(const Vectorized<float>& a, const Vectorized<float>& b) {
+  // std::max(NaN, nonNan) -> NaN
+  return _mm256_max_ps(b, a);
+}
+
+// Implements recieprocal method based on newton-rapson method
+// 1. user RCP approximiation
+// 2. update with RCP = RCP * (2 - X * RCP)
+template <typename scalar_t>
+Vectorized<scalar_t> fast_recieprocal(const Vectorized<scalar_t>& b);
+template <typename scalar_t>
+scalar_t fast_recieprocal(scalar_t b);
+
+template<>
+Vectorized<float> fast_recieprocal(const Vectorized<float>& b) {
+  auto minus2 = _mm256_set1_ps(-2.f);
+  auto rcp = _mm256_rcp_ps(b);
+  rcp = _mm256_mul_ps(rcp,  _mm256_fnmsub_ps(rcp, b, minus2));
+  rcp = _mm256_mul_ps(rcp,  _mm256_fnmsub_ps(rcp, b, minus2));
+  return rcp;
+}
+
+template <>
+float fast_recieprocal(float b) {
+  auto minus2 = _mm_set_ss(-2.f);
+  auto b_reg = _mm_set_ss(b);
+  auto rcp = _mm_rcp_ss(b_reg);
+  rcp = _mm_mul_ss(rcp,  _mm_fnmsub_ss(rcp, b_reg, minus2));
+  rcp = _mm_mul_ss(rcp,  _mm_fnmsub_ss(rcp, b_reg, minus2));
+  return _mm_cvtss_f32(rcp);
+}
+
+template<>
+Vectorized<double> fast_recieprocal(const Vectorized<double>& b) {
+  return b.reciprocal();
+}
+
+template <>
+double fast_recieprocal(double b) {
+  return 1./b;
+}
+
+}
+#endif
+
+#include <cstdint>
+#include <cmath>
+#include <vector>
+
+#include <mkl.h>
+
+namespace caffe2::details {
+
+// MKL VML function templates.
+template <typename T>
+void PackV(const int N, const T* a, const int* ia, T* y);
+template <typename T>
+void UnpackV(const int N, const T* a, T* y, const int* iy);
+
+#define DELEGATE_PACKV_FUNCTION(T, OriginalFunc)                \
+  template <>                                                   \
+  void PackV<T>(const int N, const T* a, const int* ia, T* y) { \
+    OriginalFunc(N, a, ia, y);                                  \
+  }
+DELEGATE_PACKV_FUNCTION(float, vsPackV)
+DELEGATE_PACKV_FUNCTION(double, vdPackV)
+#undef DELEGATE_PACKV_FUNCTION
+
+#define DELEGATE_UNPACKV_FUNCTION(T, OriginalFunc)                \
+  template <>                                                     \
+  void UnpackV<T>(const int N, const T* a, T* y, const int* iy) { \
+    OriginalFunc(N, a, y, iy);                                    \
+  }
+DELEGATE_UNPACKV_FUNCTION(float, vsUnpackV)
+DELEGATE_UNPACKV_FUNCTION(double, vdUnpackV)
+#undef DELEGATE_UNPACKV_FUNCTION
+
+#ifndef FAST_VECTORIZED_KERNEL
+template <typename T>
+void box_cox_zero_lambda(
+    size_t D,
+    const T* const self_data,
+    const T* const lambda2_data,
+    T k_eps,
+    T* const output_data) {
+  int j = 0;
+  using Vec = at::vec::Vectorized<T>;
+  constexpr int64_t VLEN = Vec::size();
+  auto k_eps_vec = Vec(k_eps);
+  for(; j + VLEN < D; j += VLEN) {
+    auto data = Vec::loadu(self_data + j);
+    auto lambda2 = Vec::loadu(lambda2_data + j);
+    auto sum = data + lambda2;
+    auto max = at::vec::max(sum, k_eps_vec);
+    auto res = max.log();
+    res.store(output_data + j);
+  }
+  for ( ;j < D; ++j) {
+    auto sum = self_data[j] + lambda2_data[j];
+    auto max = std::max(sum, k_eps);
+    output_data[j] = std::log(max);
+  }
+}
+
+template <typename T>
+void box_cox_nonzero_lambda(
+    int64_t D,
+    const T* data_ptr,
+    const T* lambda1_ptr,
+    const T* lambda2_ptr,
+    T k_eps,
+    T* out) {
+
+  int j = 0;
+  using Vec = at::vec::Vectorized<T>;
+  constexpr int64_t VLEN = Vec::size();
+  auto k_eps_vec = Vec(k_eps);
+  for(; j + VLEN < D; j += VLEN) {
+    auto data = Vec::loadu(data_ptr + j);
+    auto lambda2 = Vec::loadu(lambda2_ptr + j);
+    auto sum = data + lambda2;
+    auto max = at::vec::max(sum, k_eps_vec);
+    auto lambda1 = Vec::loadu(lambda1_ptr + j);
+    auto lambda_over_1 = at::vec::fast_recieprocal(lambda1);
+    auto pow = max.pow(lambda1);
+    auto res = at::vec::fmsub(pow, lambda_over_1, lambda_over_1);
+    res.store(out + j);
+  }
+  for ( ;j < D; ++j) {
+    auto sum = data_ptr[j] + lambda2_ptr[j];
+    auto max = std::max(sum, k_eps);
+    auto lambda_over_1 = at::vec::fast_recieprocal(lambda1_ptr[j]);
+    auto pow = std::pow(max, lambda1_ptr[j]);
+    out[j] = pow * lambda_over_1 - lambda_over_1;
+  }
+}
+#else
+template <typename T>
+void box_cox_zero_lambda(
+    size_t D,
+    const T* const self_data,
+    const T* const lambda2_data,
+    T k_eps,
+    T* const output_data) {
+  VECTOR_LOOP for (auto j=0 ;j < D; ++j) {
+    auto sum = self_data[j] + lambda2_data[j];
+    auto max = std::max(sum, k_eps);
+    output_data[j] = std::log(max);
+  }
+}
+
+template <typename T>
+void box_cox_nonzero_lambda(
+    int64_t D,
+    const T* data_ptr,
+    const T* lambda1_ptr,
+    const T* lambda2_ptr,
+    T k_eps,
+    T* out) {
+
+  VECTOR_LOOP for (auto j=0 ;j < D; ++j) {
+    FAST_MATH
+    auto sum = data_ptr[j] + lambda2_ptr[j];
+    auto max = std::max(sum, k_eps);
+    auto lamda1 = lambda1_ptr[j];
+    auto lambda_over_1 = 1 / lamda1;
+    if constexpr (std::is_same<T, float>::value) {
+      lambda_over_1 = lambda_over_1 * (T{2} - lambda_over_1 * lamda1);
+      lambda_over_1 = lambda_over_1 * (T{2} - lambda_over_1 * lamda1);
+    }
+    auto pow = std::pow(max, lamda1);
+    out[j] = pow * lambda_over_1 - lambda_over_1;
+  }
+}
+#endif
+
+template <typename T>
+void box_cox_mixed_lambda(
+    const T* const self_data,
+    const std::vector<int>& nonzeros,
+    const std::vector<int>& zeros,
+    const T* const lambda1,
+    const T* const lambda2,
+    const T* const lambda2_z_,
+    T k_eps,
+    T* const buffer,
+    T* const output_data) {
+  PackV(nonzeros.size(), self_data, nonzeros.data(), buffer);
+  box_cox_nonzero_lambda<T>(
+      nonzeros.size(), buffer, lambda1, lambda2, k_eps, buffer);
+  UnpackV(nonzeros.size(), buffer, output_data, nonzeros.data());
+
+  PackV(zeros.size(), self_data, zeros.data(), buffer);
+  box_cox_zero_lambda<T>(
+      zeros.size(), buffer, lambda2_z_, k_eps, buffer);
+  UnpackV(zeros.size(), buffer, output_data, zeros.data());
+}
+
+template <typename T>
+void TileArrayIntoVector(
+    const T* const a,
+    const size_t D,
+    const int K,
+    std::vector<T>& b) {
+  b.resize(K * D);
+  for (const auto k : c10::irange(K)) {
+    std::copy(a, a + D, b.begin() + k * D);
+  }
+}
+
+void TileIndicesInPlace(std::vector<int>& v, const std::size_t D, const std::size_t K) {
+  auto n = v.size();
+  v.resize(K * n);
+  for (const auto k : c10::irange(1, K)) {
+    for (const auto j : c10::irange(n)) {
+      v[k * n + j] = v[j] + k * D;
+    }
+  }
+}
+
+template <typename T>
+void compute_batch_box_cox__avx2_fma(
+    std::size_t N,
+    std::size_t D,
+    std::size_t block_size,
+    const T* self_data,
+    const T* __restrict lambda1_data,
+    const T* __restrict lambda2_data,
+    T* output_data) {
+  constexpr T k_eps = static_cast<T>(1e-6);
+
+  FOLLY_DECLARE_REUSED(zeros, std::vector<int>);
+  FOLLY_DECLARE_REUSED(nonzeros, std::vector<int>);
+  // Don't bother calling reserve; calls after the first will get a
+  // correctly-sized allocation anyway.
+  for (const auto j : c10::irange(D)) {
+    if (lambda1_data[j] == 0) {
+      zeros.push_back(j);
+    } else {
+      nonzeros.push_back(j);
+    }
+  }
+
+  // Process K rows at a time for effective vectorization with small rows.
+  const auto K = std::min(N, (block_size + D - 1) / D);
+
+  FOLLY_DECLARE_REUSED(lambda1_, std::vector<T>);
+  FOLLY_DECLARE_REUSED(lambda2_, std::vector<T>);
+  FOLLY_DECLARE_REUSED(lambda2_z_, std::vector<T>);
+
+  if (nonzeros.size() == D) {
+    // ((x + lambda2)^lambda1 - 1)/lambda1, if lambda1 != 0
+    size_t i = 0;
+    if (K > 1) {
+      TileArrayIntoVector(lambda1_data, D, K, lambda1_);
+      TileArrayIntoVector(lambda2_data, D, K, lambda2_);
+      DCHECK_EQ(K * D, lambda1_.size());
+      DCHECK_EQ(K * D, lambda2_.size());
+      for (; i < N - K + 1; i += K, self_data += K * D, output_data += K * D) {
+        box_cox_nonzero_lambda<T>(
+            K * D,
+            self_data,
+            lambda1_.data(),
+            lambda2_.data(),
+            k_eps,
+            output_data);
+      }
+    }
+    for (; i < N; i++, self_data += D, output_data += D) {
+      box_cox_nonzero_lambda<T>(
+          D, self_data, lambda1_data, lambda2_data, k_eps, output_data);
+    }
+  } else if (zeros.size() == D) {
+    // ln(x + lambda2), if lambda1 == 0
+    size_t i = 0;
+    if (K > 1) {
+      TileArrayIntoVector(lambda2_data, D, K, lambda2_z_);
+      DCHECK_EQ(K * D, lambda2_z_.size());
+      for (; i < N - K + 1; i += K, self_data += K * D, output_data += K * D) {
+        box_cox_zero_lambda<T>(
+            K * D, self_data, lambda2_z_.data(), k_eps, output_data);
+      }
+    }
+    for (; i < N; i++, self_data += D, output_data += D) {
+      box_cox_zero_lambda<T>(
+          D, self_data, lambda2_data, k_eps, output_data);
+    }
+  } else {
+    // mix zeros and nonzeros
+    const size_t n = nonzeros.size();
+    if (K > 1) {
+      TileIndicesInPlace(nonzeros, 0, K);
+      TileIndicesInPlace(zeros, 0, K);
+    }
+
+    FOLLY_DECLARE_REUSED(buffer, std::vector<T>);
+
+    buffer.resize(std::max(nonzeros.size(), zeros.size()));
+    lambda1_.resize(nonzeros.size());
+    lambda2_.resize(nonzeros.size());
+    lambda2_z_.resize(zeros.size());
+    PackV(nonzeros.size(), lambda1_data, nonzeros.data(), lambda1_.data());
+    PackV(nonzeros.size(), lambda2_data, nonzeros.data(), lambda2_.data());
+    PackV(zeros.size(), lambda2_data, zeros.data(), lambda2_z_.data());
+
+    size_t i = 0;
+    if (K > 1) {
+      // Truncate to original size, and re-tile with offsets this time.
+      nonzeros.resize(n);
+      DCHECK_GT(D, n);
+      zeros.resize(D - n);
+      TileIndicesInPlace(nonzeros, D, K);
+      TileIndicesInPlace(zeros, D, K);
+      DCHECK_EQ(nonzeros.size(), lambda1_.size());
+      DCHECK_EQ(nonzeros.size(), lambda2_.size());
+      DCHECK_EQ(zeros.size(), lambda2_z_.size());
+
+      for (; i < N - K + 1; i += K, self_data += K * D, output_data += K * D) {
+        box_cox_mixed_lambda<T>(
+            self_data,
+            nonzeros,
+            zeros,
+            lambda1_.data(),
+            lambda2_.data(),
+            lambda2_z_.data(),
+            k_eps,
+            buffer.data(),
+            output_data);
+      }
+      // Truncate to original size.
+      nonzeros.resize(n);
+      zeros.resize(D - n);
+    }
+    for (; i < N; i++, self_data += D, output_data += D) {
+      box_cox_mixed_lambda<T>(
+          self_data,
+          nonzeros,
+          zeros,
+          lambda1_.data(),
+          lambda2_.data(),
+          lambda2_z_.data(),
+          k_eps,
+          buffer.data(),
+          output_data);
+    }
+  }
+};
+
+
+template
+void compute_batch_box_cox__avx2_fma<float>(
+  std::size_t N,
+  std::size_t D,
+  std::size_t block_size,
+  const float* self_data,
+  const float* __restrict lambda1_data,
+  const float* __restrict lambda2_data,
+  float* output_data);
+
+template
+void compute_batch_box_cox__avx2_fma<double>(
+  std::size_t N,
+  std::size_t D,
+  std::size_t block_size,
+  const double* self_data,
+  const double* __restrict lambda1_data,
+  const double* __restrict lambda2_data,
+  double* output_data);
+
+} // namespace caffe2::detail
+#endif
diff --git a/caffe2/perfkernels/common.h b/caffe2/perfkernels/common.h
index fb960dbe5dc3..6fed9e1d6d06 100644
--- a/caffe2/perfkernels/common.h
+++ b/caffe2/perfkernels/common.h
@@ -62,7 +62,10 @@ In foo.cc, do:
 
 #pragma once
 
+#if defined(CAFFE2_PERF_WITH_AVX512) || defined(CAFFE2_PERF_WITH_AVX2) \
+     || defined(CAFFE2_PERF_WITH_AVX)
 #include <cpuinfo.h>
+#endif
 
 // DO macros: these should be used in your entry function, similar to foo()
 // above, that routes implementations based on CPU capability.
diff --git a/caffe2/perfkernels/lstm_unit_cpu-impl.h b/caffe2/perfkernels/lstm_unit_cpu-impl.h
index 5e76e1aa39fe..239d2807f778 100644
--- a/caffe2/perfkernels/lstm_unit_cpu-impl.h
+++ b/caffe2/perfkernels/lstm_unit_cpu-impl.h
@@ -5,27 +5,7 @@
 #include "c10/util/irange.h"
 #include "caffe2/utils/conversions.h"
 
-#if (ENABLE_VECTORIZATION > 0) && !defined(_DEBUG) && !defined(DEBUG)
-#if defined(__clang__) && (__clang_major__ > 7)
-#define IS_SANITIZER                          \
-  ((__has_feature(address_sanitizer) == 1) || \
-   (__has_feature(memory_sanitizer) == 1) ||  \
-   (__has_feature(thread_sanitizer) == 1) ||  \
-   (__has_feature(undefined_sanitizer) == 1))
-
-#if IS_SANITIZER == 0
-#define VECTOR_LOOP _Pragma("clang loop vectorize(enable)")
-#endif
-#elif defined(_OPENMP) && (_OPENMP >= 201511)
-// Support with OpenMP4.5 and above
-#define VECTOR_LOOP _Pragma("omp for simd")
-#endif
-#endif
-
-#ifndef VECTOR_LOOP
-// Not supported
-#define VECTOR_LOOP
-#endif
+#include "vectorizer.h"
 
 namespace caffe2 {
 namespace perfkernels {
diff --git a/caffe2/perfkernels/vectorizer.h b/caffe2/perfkernels/vectorizer.h
new file mode 100644
index 000000000000..be4e6bbc280f
--- /dev/null
+++ b/caffe2/perfkernels/vectorizer.h
@@ -0,0 +1,28 @@
+#pragma once
+
+#if (ENABLE_VECTORIZATION > 0) && !defined(_DEBUG) && !defined(DEBUG)
+#if defined(__clang__) && (__clang_major__ > 7)
+#define IS_SANITIZER                          \
+  ((__has_feature(address_sanitizer) == 1) || \
+   (__has_feature(memory_sanitizer) == 1) ||  \
+   (__has_feature(thread_sanitizer) == 1) ||  \
+   (__has_feature(undefined_sanitizer) == 1))
+
+#if IS_SANITIZER == 0
+#define VECTOR_LOOP _Pragma("clang loop vectorize(enable)")
+#define FAST_MATH _Pragma("clang fp contract(fast)")
+#define VECTORIZED_KERNEL 1
+#endif
+#elif defined(_OPENMP) && (_OPENMP >= 201511)
+// Support with OpenMP4.5 and above
+#define VECTOR_LOOP _Pragma("omp for simd")
+#define VECTORIZED_KERNEL 1
+#define FAST_MATH
+#endif
+#endif
+
+#ifndef VECTOR_LOOP
+// Not supported
+#define VECTOR_LOOP
+#define FAST_MATH
+#endif
diff --git a/caffe2/proto/caffe2_pb.h b/caffe2/proto/caffe2_pb.h
index bc6c40a195d5..fc82659dc51d 100644
--- a/caffe2/proto/caffe2_pb.h
+++ b/caffe2/proto/caffe2_pb.h
@@ -108,7 +108,7 @@ inline TORCH_API caffe2::DeviceOption DeviceToOption(const at::Device& device) {
   return option;
 }
 
-inline TORCH_API at::Device OptionToDevice(const caffe2::DeviceOption option) {
+inline TORCH_API at::Device OptionToDevice(const caffe2::DeviceOption& option) {
   auto type = option.device_type();
   c10::DeviceIndex id = -1;
   switch (type) {
diff --git a/caffe2/python/CMakeLists.txt b/caffe2/python/CMakeLists.txt
index c092febee4a9..464aa24eadd2 100644
--- a/caffe2/python/CMakeLists.txt
+++ b/caffe2/python/CMakeLists.txt
@@ -1,6 +1,7 @@
 # ---[ CPU files.
 set(Caffe2_CPU_PYTHON_SRCS
     "/pybind_state.cc"
+    "/pybind_workspace.cc"
     "/pybind_state_dlpack.cc"
     "/pybind_state_nomni.cc"
     "/pybind_state_registry.cc"
diff --git a/caffe2/python/clean_workspace_test.py b/caffe2/python/clean_workspace_test.py
new file mode 100644
index 000000000000..c8285f4a1c5b
--- /dev/null
+++ b/caffe2/python/clean_workspace_test.py
@@ -0,0 +1,15 @@
+import unittest
+
+from caffe2.python import workspace
+
+
+# This test is extracted out from workspace_test.py because it relies on the pristine
+# state of the initial workspace. When tests are run in different orders, this test may
+# become flaky because of global state modifications impacting what the root folder is
+# after a reset.
+class TestWorkspace(unittest.TestCase):
+    def testRootFolder(self):
+        self.assertEqual(workspace.ResetWorkspace(), True)
+        self.assertEqual(workspace.RootFolder(), ".")
+        self.assertEqual(workspace.ResetWorkspace("/tmp/caffe-workspace-test"), True)
+        self.assertEqual(workspace.RootFolder(), "/tmp/caffe-workspace-test")
diff --git a/caffe2/python/onnx/ONNXOpCoverage.md b/caffe2/python/onnx/ONNXOpCoverage.md
index bb4b71f05535..66cf2d692e87 100644
--- a/caffe2/python/onnx/ONNXOpCoverage.md
+++ b/caffe2/python/onnx/ONNXOpCoverage.md
@@ -19,7 +19,7 @@ This doc keeps tracking why operators are not covered by the testcases.
 |Atan|||&#x1F49A;OK|
 |AveragePool||OK|&#x1F49A;OK|
 |BatchNormalization||OK|&#x1F49A;OK|
-|Cast|Yes||&#x1F494;Need extendtion|
+|Cast|Yes||&#x1F494;Need extension|
 |Ceil|Yes||&#x1F49A;OK|
 |Clip|Yes|OK|&#x1F49A;OK|
 |Concat|Yes|OK|&#x1F49A;OK|
diff --git a/caffe2/python/operator_test/_utils.py b/caffe2/python/operator_test/_utils.py
new file mode 100644
index 000000000000..3ee1def89e71
--- /dev/null
+++ b/caffe2/python/operator_test/_utils.py
@@ -0,0 +1,50 @@
+"""
+This file only exists since `torch.testing.assert_allclose` is deprecated, but used extensively throughout the tests in
+this package. The replacement `torch.testing.assert_close` doesn't support one feature that is needed here: comparison
+between numpy arrays and torch tensors. See https://github.com/pytorch/pytorch/issues/61844 for the reasoning why this
+was removed.
+"""
+
+import torch
+from typing import Tuple, Any, Optional
+
+_DTYPE_PRECISIONS = {
+    torch.float16: (1e-3, 1e-3),
+    torch.float32: (1e-4, 1e-5),
+    torch.float64: (1e-5, 1e-8),
+}
+
+
+def _get_default_rtol_and_atol(actual: torch.Tensor, expected: torch.Tensor) -> Tuple[float, float]:
+    actual_rtol, actual_atol = _DTYPE_PRECISIONS.get(actual.dtype, (0.0, 0.0))
+    expected_rtol, expected_atol = _DTYPE_PRECISIONS.get(expected.dtype, (0.0, 0.0))
+    return max(actual_rtol, expected_rtol), max(actual_atol, expected_atol)
+
+
+def assert_allclose(
+    actual: Any,
+    expected: Any,
+    rtol: Optional[float] = None,
+    atol: Optional[float] = None,
+    equal_nan: bool = True,
+    msg: str = "",
+) -> None:
+    if not isinstance(actual, torch.Tensor):
+        actual = torch.tensor(actual)
+    if not isinstance(expected, torch.Tensor):
+        expected = torch.tensor(expected, dtype=actual.dtype)
+
+    if rtol is None and atol is None:
+        rtol, atol = _get_default_rtol_and_atol(actual, expected)
+
+    torch.testing.assert_close(
+        actual,
+        expected,
+        rtol=rtol,
+        atol=atol,
+        equal_nan=equal_nan,
+        check_device=True,
+        check_dtype=False,
+        check_stride=False,
+        msg=msg or None,
+    )
\ No newline at end of file
diff --git a/caffe2/python/operator_test/layer_norm_op_test.py b/caffe2/python/operator_test/layer_norm_op_test.py
index 32a2511e3e8e..31ba78be0c19 100644
--- a/caffe2/python/operator_test/layer_norm_op_test.py
+++ b/caffe2/python/operator_test/layer_norm_op_test.py
@@ -18,6 +18,8 @@
 
 import unittest
 
+from ._utils import assert_allclose
+
 
 def _layer_norm_ref(axis, epsilon, X):
     left = int(np.prod(X.shape[:axis]))
@@ -254,10 +256,9 @@ def test_layer_norm_op_c10_preallocated_outputs(
         actual_mean = self.ws.fetch_blob('mean')
         actual_std = self.ws.fetch_blob('std')
 
-        torch.testing.assert_allclose(
-            expected_norm, actual_norm, rtol=1e-4, atol=1e-4)
-        torch.testing.assert_allclose(expected_mean, actual_mean)
-        torch.testing.assert_allclose(expected_std, actual_std)
+        assert_allclose(expected_norm, actual_norm, rtol=1e-4, atol=1e-4)
+        assert_allclose(expected_mean, actual_mean)
+        assert_allclose(expected_std, actual_std)
 
     @given(X=hu.tensor(min_dim=2),
            eps=st.floats(1e-5, 1e-3),
@@ -280,10 +281,9 @@ def test_layer_norm_op_pytorch(self, X, eps, elementwise_affine, gc, dc):
             actual_norm, actual_mean, actual_std = torch.ops._caffe2.LayerNorm(
                 torch.tensor(X), None, None, axis, eps)
 
-        torch.testing.assert_allclose(
-            expected_norm, actual_norm, rtol=1e-4, atol=1e-4)
-        torch.testing.assert_allclose(expected_mean, actual_mean)
-        torch.testing.assert_allclose(expected_std, actual_std)
+        assert_allclose(expected_norm, actual_norm, rtol=1e-4, atol=1e-4)
+        assert_allclose(expected_mean, actual_mean)
+        assert_allclose(expected_std, actual_std)
 
     # Test case is using workspace.has_cuda_support and not
     # workspace.has_gpu_support to exclude it from HIP because tensor interop
@@ -313,10 +313,9 @@ def test_layer_norm_op_pytorch_cuda(self, X, eps, elementwise_affine):
             actual_norm, actual_mean, actual_std = torch.ops._caffe2.LayerNorm(
                 torch.tensor(X).cuda(), None, None, axis, eps)
 
-        torch.testing.assert_allclose(
-            expected_norm, actual_norm.cpu(), rtol=1e-4, atol=1e-4)
-        torch.testing.assert_allclose(expected_mean, actual_mean.cpu())
-        torch.testing.assert_allclose(expected_std, actual_std.cpu())
+        assert_allclose(expected_norm, actual_norm, rtol=1e-4, atol=1e-4)
+        assert_allclose(expected_mean, actual_mean)
+        assert_allclose(expected_std, actual_std)
 
     @given(X=hu.tensor(min_dim=2),
            eps=st.floats(1e-5, 1e-3),
@@ -352,10 +351,9 @@ def jit_layer_norm(
             actual_norm, actual_mean, actual_std = jit_layer_norm(
                 torch.tensor(X), None, None, axis, eps, elementwise_affine)
 
-        torch.testing.assert_allclose(
-            expected_norm, actual_norm, rtol=1e-4, atol=1e-4)
-        torch.testing.assert_allclose(expected_mean, actual_mean)
-        torch.testing.assert_allclose(expected_std, actual_std)
+        assert_allclose(expected_norm, actual_norm, rtol=1e-4, atol=1e-4)
+        assert_allclose(expected_mean, actual_mean)
+        assert_allclose(expected_std, actual_std)
 
     @given(X=hu.tensor(min_dim=2), **hu.gcs)
     def test_layer_norm_brew_wrapper(self, X, gc, dc):
diff --git a/caffe2/python/operator_test/torch_integration_test.py b/caffe2/python/operator_test/torch_integration_test.py
index f99a61688de6..d143e0193dfd 100644
--- a/caffe2/python/operator_test/torch_integration_test.py
+++ b/caffe2/python/operator_test/torch_integration_test.py
@@ -11,6 +11,8 @@
 from hypothesis import given, settings
 from scipy.stats import norm
 
+from ._utils import assert_allclose
+
 
 def generate_rois(roi_counts, im_dims):
     assert len(roi_counts) == len(im_dims)
@@ -172,7 +174,7 @@ def bbox_transform_ref():
             legacy_plus_one=True,
         )
 
-        torch.testing.assert_allclose(box_out, a)
+        assert_allclose(box_out, a)
 
     @given(
         roi_counts=st.lists(st.integers(0, 5), min_size=1, max_size=10),
@@ -268,7 +270,7 @@ def box_with_nms_limit_ref():
         )
 
         for o, o_ref in zip(outputs, output_refs):
-            torch.testing.assert_allclose(o, o_ref)
+            assert_allclose(o, o_ref)
 
     @given(
         dim_1=st.integers(min_value=10, max_value=10),
@@ -314,7 +316,7 @@ def sparse_to_dense_mask_ref(return_presence_mask=False):
             mask=mask,
         )
 
-        torch.testing.assert_allclose(output, a)
+        assert_allclose(output, a)
 
         # Testing return_presence_mask = True
         output, presence_mask = sparse_to_dense_mask_ref(return_presence_mask=True)
@@ -330,8 +332,8 @@ def sparse_to_dense_mask_ref(return_presence_mask=False):
             return_presence_mask=True,
         )
 
-        torch.testing.assert_allclose(output, a)
-        torch.testing.assert_allclose(presence_mask, b)
+        assert_allclose(output, a)
+        assert_allclose(presence_mask, b)
 
     @given(
         A=st.integers(min_value=4, max_value=4),
@@ -382,8 +384,8 @@ def generate_proposals_ref():
             1.0,
             legacy_plus_one=True,
         )
-        torch.testing.assert_allclose(rois, a)
-        torch.testing.assert_allclose(rois_probs, b)
+        assert_allclose(rois, a)
+        assert_allclose(rois_probs, b)
 
     @given(
         bsz=st.integers(1, 5),
@@ -461,9 +463,9 @@ def inference_lstm_ref():
         a, b, c = torch.ops._caffe2.InferenceLSTM(
             lstm_in, num_layers, has_biases, batch_first, is_bidirectional
         )
-        torch.testing.assert_allclose(output, a)
-        torch.testing.assert_allclose(hidden, b)
-        torch.testing.assert_allclose(cell, c)
+        assert_allclose(output, a)
+        assert_allclose(hidden, b)
+        assert_allclose(cell, c)
 
     # Test case is using workspace.has_cuda_support and not workspace.has_gpu_support
     # to exclude it from HIP because tensor interop doesn't work for HIP tensors yet
@@ -517,8 +519,8 @@ def generate_proposals_ref():
             1.0,
             legacy_plus_one=True,
         )
-        torch.testing.assert_allclose(rois, a.cpu())
-        torch.testing.assert_allclose(rois_probs, b.cpu())
+        assert_allclose(rois, a.cpu())
+        assert_allclose(rois_probs, b.cpu())
 
     @given(
         N=st.integers(min_value=1, max_value=2),
@@ -567,7 +569,7 @@ def roi_align_ref(_feature, _rois):
             sampling_ratio=0,
             aligned=False,
         )
-        torch.testing.assert_allclose(roi_feature_ref, roi_feature.cpu())
+        assert_allclose(roi_feature_ref, roi_feature.cpu())
 
     def test_roi_align_cpu(self):
         self._test_roi_align(device="cpu")
@@ -624,7 +626,7 @@ def roi_align_ref(_feature, _rois):
             sampling_ratio=0,
             aligned=False,
         )
-        torch.testing.assert_allclose(roi_feature_ref, roi_feature.cpu())
+        assert_allclose(roi_feature_ref, roi_feature.cpu())
 
     def test_roi_align_rotated_cpu(self):
         self._test_roi_align_rotated(device="cpu")
@@ -674,9 +676,9 @@ def test_collect_and_distribute_fpn_rpn_proposals_op(self, roi_counts):
         rois_idx_restore_int32 = fpn_outputs[-1]
 
         # [rois] + fpn_outputs should be equal to all_outputs
-        torch.testing.assert_allclose(rois, all_outputs[0])
+        assert_allclose(rois, all_outputs[0])
         for x, y in zip(fpn_outputs, all_outputs[1:]):
-            torch.testing.assert_allclose(x, y)
+            assert_allclose(x, y)
 
     @given(X=hu.tensor(), fast_gelu=st.booleans())
     def _test_gelu_op(self, X, fast_gelu, device):
@@ -688,7 +690,7 @@ def _gelu_ref(_X):
 
         rtol = 1e-3 if fast_gelu else 1e-4
         atol = 1e-5
-        torch.testing.assert_allclose(
+        assert_allclose(
             expected_output, actual_output.cpu(), rtol=rtol, atol=atol
         )
 
@@ -719,7 +721,7 @@ def _lengths_ref(X, Y):
             torch.tensor(data), torch.tensor(lengths, dtype=torch.int32)
         )
 
-        torch.testing.assert_allclose(expected_output, actual_output.cpu())
+        assert_allclose(expected_output, actual_output.cpu())
 
     def _test_lengths_sum_op(self, device):
         self._test_lengths_op("LengthsSum", torch.ops._caffe2.LengthsSum, device)
@@ -775,7 +777,7 @@ def _resize_nearest_ref(X):
             height_scale=1.5,
         )
 
-        torch.testing.assert_allclose(expected_output, actual_output.cpu())
+        assert_allclose(expected_output, actual_output.cpu())
 
     def test_resize_nearest_op_cpu(self):
         return self._test_resize_nearest_op("cpu")
@@ -838,16 +840,16 @@ def _piecewise_linear_ref(X):
             binary_input,
         )
 
-        torch.testing.assert_allclose(torch.tensor(expected_output), actual_output)
+        assert_allclose(torch.tensor(expected_output), actual_output)
 
     def test_alias_with_name_is_in_place(self):
         device = "cuda" if workspace.has_cuda_support else "cpu"
         x = torch.tensor([3., 42.]).to(device=device)
         y = torch.ops._caffe2.AliasWithName(x, "new_name")
         x[1] = 6
-        torch.testing.assert_allclose(x, torch.tensor([3., 6.]).to(device=device))
+        assert_allclose(x, torch.tensor([3., 6.]).to(device=device))
         # y should also change because y is alias of x
-        torch.testing.assert_allclose(y, torch.tensor([3., 6.]).to(device=device))
+        assert_allclose(y, torch.tensor([3., 6.]).to(device=device))
 
     @unittest.skipIf(not workspace.has_cuda_support, "No cuda support")
     def test_copy_between_cpu_and_gpu(self):
@@ -855,9 +857,9 @@ def test_copy_between_cpu_and_gpu(self):
         x_gpu_ref = x_cpu_ref.to("cuda")
 
         x_gpu = torch.ops._caffe2.CopyCPUToGPU(x_cpu_ref)
-        torch.testing.assert_allclose(x_gpu, x_gpu_ref)
+        assert_allclose(x_gpu, x_gpu_ref)
         x_cpu = torch.ops._caffe2.CopyGPUToCPU(x_gpu)
-        torch.testing.assert_allclose(x_cpu, x_cpu_ref)
+        assert_allclose(x_cpu, x_cpu_ref)
 
     def test_index_hash_op(self):
         data = np.random.randint(low=0, high=1000, size=(4, 4, 4))
@@ -873,7 +875,7 @@ def _index_hash_ref(X):
             torch.tensor(data), seed=0, modulo=100
         )
 
-        torch.testing.assert_allclose(expected_output, actual_output.cpu())
+        assert_allclose(expected_output, actual_output.cpu())
 
     def test_bucketize_op(self):
         data = np.random.rand(8, 10).astype(np.float32) * 1000
@@ -889,7 +891,7 @@ def _bucketize_ref(X):
 
         expected_output = _bucketize_ref(data)
         actual_output = torch.ops._caffe2.Bucketize(torch.tensor(data), boundaries)
-        torch.testing.assert_allclose(expected_output, actual_output.cpu())
+        assert_allclose(expected_output, actual_output.cpu())
 
     @given(X=hu.tensor(), eps=st.floats(min_value=1e-4, max_value=1e-2))
     def test_logit(self, X, eps):
@@ -901,7 +903,7 @@ def ref(X, eps):
 
         expected_output = ref(X, eps)
         actual_output = torch.ops._caffe2.Logit(torch.tensor(X), eps)
-        torch.testing.assert_allclose(expected_output, actual_output.cpu())
+        assert_allclose(expected_output, actual_output.cpu())
 
     def test_percentile(self):
         original_values = np.array([[3.0, 5.0, 3], [5.0, 1.0, 6.0]]).astype(np.float32)
@@ -926,7 +928,7 @@ def _percentile_ref(original_values, value_to_pct, lengths):
             torch.tensor(value_to_pct),
             torch.tensor(lengths),
         )
-        torch.testing.assert_allclose(expected_output, actual_output.cpu())
+        assert_allclose(expected_output, actual_output.cpu())
 
     def test_batch_bucket_one_hot_op(self):
         data = np.array([[2, 3], [4, 1], [2, 5]]).astype(np.float32)
@@ -947,7 +949,7 @@ def _batch_bucket_one_hot_ref(data, lengths, boundaries):
         actual_output = torch.ops._caffe2.BatchBucketOneHot(
             torch.tensor(data), torch.tensor(lengths), torch.tensor(boundaries)
         )
-        torch.testing.assert_allclose(expected_output, actual_output.cpu())
+        assert_allclose(expected_output, actual_output.cpu())
 
     def test_gather_ranges_to_dense_op(self):
         data = np.array([1, 2, 3, 4, 5, 6, 7, 8])
@@ -1033,8 +1035,8 @@ def _merge_id_lists(lengths, values):
                 torch.tensor(values[1]),
             ]
         )
-        torch.testing.assert_allclose(expected_merged_lengths, output_merged_lengths)
-        torch.testing.assert_allclose(expected_merged_values, output_merged_values)
+        assert_allclose(expected_merged_lengths, output_merged_lengths)
+        assert_allclose(expected_merged_values, output_merged_values)
 
     def test_learning_rate(self):
         base_lr = 0.05
@@ -1097,7 +1099,7 @@ def test_pack_segments(self):
         packed_tensor, _ = torch.ops._caffe2.PackSegments(lengths, s)
         self.assertEqual(packed_tensor.numpy().shape, (2, 2, 3, 3))
         unpacked_tensor = torch.ops._caffe2.UnpackSegments(lengths, packed_tensor)
-        torch.testing.assert_allclose(s, unpacked_tensor)
+        assert_allclose(s, unpacked_tensor)
 
 
 if __name__ == "__main__":
diff --git a/caffe2/python/optimizer.py b/caffe2/python/optimizer.py
index cd10aae7f43d..fe0aee77e071 100644
--- a/caffe2/python/optimizer.py
+++ b/caffe2/python/optimizer.py
@@ -39,6 +39,7 @@ def __init__(self):
         self._lr_multiplier = None
         self._local_lr_multiplier = None
         self._local_lr_multiplier_on_gpu = False
+        self._use_dedicated_lr_iteration_counter = False
 
     """
     Adds optimization operators to the net for given parameter and its gradient
@@ -86,6 +87,14 @@ def attributes(self):
         del attr["_instance_num"]
         return attr
 
+    @property
+    def use_dedicated_lr_iteration_counter(self):
+        return self._use_dedicated_lr_iteration_counter
+
+    @use_dedicated_lr_iteration_counter.setter
+    def use_dedicated_lr_iteration_counter(self, val):
+        self._use_dedicated_lr_iteration_counter = val
+
     def make_unique_blob_name(self, base_str):
         """
         Returns a blob name that will be unique to the current device
@@ -115,7 +124,17 @@ def build_lr(
         if learning_rate_blob is None:
             learning_rate_blob = self.make_unique_blob_name("lr")
 
-        iteration = utils.BuildUniqueMutexIter(param_init_net, net, iter_val=iter_val)
+        if self._use_dedicated_lr_iteration_counter:
+            iteration = utils.BuildUniqueMutexIter(
+                param_init_net,
+                net,
+                iter=utils.OPTIMIZER_ITERATION_LR_NAME,
+                iter_mutex=utils.ITERATION_MUTEX_LR_NAME,
+                iter_val=iter_val,
+            )
+            logger.info(f"Created dedicated learning rate iteration counter: {iteration}")
+        else:
+            iteration = utils.BuildUniqueMutexIter(param_init_net, net, iter_val=iter_val)
 
         if not net.BlobIsDefined(learning_rate_blob):
             # There is one interesting thing here: since we are minimizing, we are
@@ -163,6 +182,36 @@ def build_lr(
 
         return lr, iteration
 
+    def build_non_lr_iter(
+        self,
+        net,
+        param_init_net,
+        iter_val=0,
+    ):
+        assert (
+            self._use_dedicated_lr_iteration_counter
+        ), "This method should be only called when dedicated learning rate iteration counter is used."
+
+        iteration = utils.BuildUniqueMutexIter(param_init_net, net, iter_val=iter_val)
+        logger.info(f"Created iteration counter for non learning rate purposes: {iteration}")
+
+        # We need to create a dummy learning rate operator to enforce that
+        # iteration counter blob being placed in the trainer nodes. Otherwise,
+        # the Automatic Device Placement (ADP) algorithm for Hierachical
+        # Training (HT) will encounter issues to distribute blobs across group
+        # parameter servers. Note that this learning rate operator will not be
+        # used for any other purpose.
+        learning_rate_blob = self.make_unique_blob_name("iter_placement_hint")
+        if not net.BlobIsDefined(learning_rate_blob):
+            net.LearningRate(
+                [iteration],
+                learning_rate_blob,
+                base_lr=1.0,
+                policy="fixed",
+            )
+
+        return iteration
+
     def add_lr_multiplier(self, lr_multiplier):
         """
         Set the global learning rate multiplier. If a multiplier already
@@ -582,6 +631,7 @@ def __init__(
         ema_options=None,
         weight_scale=None,
         counter_halflife=-1,
+        use_dedicated_lr_iteration_counter=False,
         **kwargs
     ):
         super(AdagradOptimizer, self).__init__()
@@ -599,6 +649,7 @@ def __init__(
         self.counter_halflife = counter_halflife
         self.init_kwargs = kwargs
         self.weight_scale = weight_scale
+        self.use_dedicated_lr_iteration_counter = use_dedicated_lr_iteration_counter
 
         self._process_pruning_options(pruning_options)
         self._process_swa_options(swa_options)
@@ -727,7 +778,12 @@ def _run(self, net, param_init_net, param_info):
             policy=self.policy,
             **(self.init_kwargs)
         )
-        iteration = lr_iteration
+        iteration = (
+            self.build_non_lr_iter(net, param_init_net, iter_val=0)
+            if self._use_dedicated_lr_iteration_counter
+            else lr_iteration
+        )
+
         if self.counter_halflife > 0:
             self._aux_params.shared.append(iteration)
 
@@ -970,7 +1026,7 @@ def _run(self, net, param_init_net, param_info):
             logger.debug("using {} for {}".format(op, str(param)))
 
             if self.prune_delays:
-                input_args += [lr_iteration, last_mask_updated_iter]
+                input_args += [iteration, last_mask_updated_iter]
                 output_args += [mask_blob, last_mask_updated_iter]
 
             if weight_decay > 0 and self.counter_halflife == -1:
@@ -1020,7 +1076,7 @@ def _run(self, net, param_init_net, param_info):
                 input_args += [mask_blob]
 
             if self.prune_delays:
-                input_args += [lr_iteration, last_mask_updated_iter]
+                input_args += [iteration, last_mask_updated_iter]
                 output_args += [mask_blob, last_mask_updated_iter]
 
             if self.use_mask:
@@ -1063,7 +1119,7 @@ def _run(self, net, param_init_net, param_info):
                         self._aux_params.local.append(param_swa)
 
                     net.SWA(
-                        [param, param_swa, lr_iteration],
+                        [param, param_swa, iteration],
                         [param, param_swa],
                         avg_start=self.swa_avg_start_it,
                         avg_end=self.swa_avg_end_it,
@@ -1079,7 +1135,7 @@ def _run(self, net, param_init_net, param_info):
                 self._aux_params.local.append(param_ema)
 
             net.EMA(
-                [param, param_ema, lr_iteration],
+                [param, param_ema, iteration],
                 [param, param_ema],
                 ema_start=self.ema_start,
                 ema_end=self.ema_end,
@@ -1089,7 +1145,7 @@ def _run(self, net, param_init_net, param_info):
 
         if self.weight_scale:
             net.WeightScale(
-                [param, lr_iteration],
+                [param, iteration],
                 [param],
                 stepsize=self.weight_scale.stepsize,
                 upper_bound_iter=self.weight_scale.upper_bound_iter,
@@ -1097,7 +1153,7 @@ def _run(self, net, param_init_net, param_info):
             )
             if self.weight_scale.to_aux:
                 net.WeightScale(
-                    [param_squared_sum, lr_iteration],
+                    [param_squared_sum, iteration],
                     [param_squared_sum],
                     stepsize=self.weight_scale.stepsize,
                     upper_bound_iter=self.weight_scale.upper_bound_iter,
diff --git a/caffe2/python/optimizer_test.py b/caffe2/python/optimizer_test.py
index 1503d07c1914..7511a2c8a3ec 100644
--- a/caffe2/python/optimizer_test.py
+++ b/caffe2/python/optimizer_test.py
@@ -11,7 +11,7 @@
 from caffe2.python.optimizer_test_util import (
     OptimizerTestBase, LRModificationTestBase
 )
-from caffe2.python import core, workspace
+from caffe2.python import core, utils, workspace
 from caffe2.python.test_util import TestCase
 import numpy as np
 from numpy.testing import assert_allclose, assert_equal
@@ -137,6 +137,26 @@ def check_optimizer(self, optimizer):
             workspace.FetchBlob(param)
 
 
+class TestAdagradWithDedicatedLRIteration(OptimizerTestBase, LRModificationTestBase, TestCase):
+    def build_optimizer(self, model, **kwargs):
+        self._skip_gpu = False
+        return build_adagrad(model, base_learning_rate=1.0, lars=0.5, use_dedicated_lr_iteration_counter=True, **kwargs)
+
+    def check_optimizer(self, optimizer):
+        self.assertFalse(optimizer.get_auxiliary_parameters().shared)
+        self.assertTrue(optimizer.get_auxiliary_parameters().local)
+        for param in optimizer.get_auxiliary_parameters().local:
+            workspace.FetchBlob(param)
+
+        # check iteration counters have the same value by default
+        non_lr_iter = workspace.FetchBlob(utils.OPTIMIZER_ITERATION_NAME)
+        lr_iter = workspace.FetchBlob(utils.OPTIMIZER_ITERATION_LR_NAME)
+        self.assertEqual(non_lr_iter, lr_iter)
+
+    def testGPUDense(self):
+        raise unittest.SkipTest("GPU support is not validated")
+
+
 class TestRowWiseAdagrad(OptimizerTestBase, TestCase):
     def build_optimizer(self, model, **kwargs):
         self._skip_gpu = True
diff --git a/caffe2/python/pybind_state.cc b/caffe2/python/pybind_state.cc
index a637f15e7a9d..5b2c2f71a827 100644
--- a/caffe2/python/pybind_state.cc
+++ b/caffe2/python/pybind_state.cc
@@ -2,6 +2,7 @@
 
 #include <chrono>
 #include <future>
+#include <memory>
 
 #include <pybind11/pybind11.h>
 #include <pybind11/stl.h>
@@ -33,6 +34,7 @@
 #include "caffe2/predictor/emulator/data_filler.h"
 #include "caffe2/predictor/predictor.h"
 #include "caffe2/python/pybind_state_registry.h"
+#include "caffe2/python/pybind_workspace.h"
 #include "caffe2/utils/cpuid.h"
 #include "caffe2/utils/proto_convert.h"
 #include "caffe2/utils/string_utils.h"
@@ -56,14 +58,6 @@ constexpr bool kPyBindFalse = false;
 
 namespace py = pybind11;
 
-// gWorkspaces allows us to define and switch between multiple workspaces in
-// Python.
-static std::map<std::string, std::unique_ptr<Workspace>> gWorkspaces;
-// gWorkspace is the pointer to the current workspace. The ownership is kept
-// by the gWorkspaces map.
-static Workspace* gWorkspace = nullptr;
-static std::string gCurrentWorkspaceName;
-
 // NOLINTNEXTLINE(modernize-use-equals-default)
 BlobFetcherBase::~BlobFetcherBase() {}
 // NOLINTNEXTLINE(modernize-use-equals-default)
@@ -83,17 +77,6 @@ C10_DEFINE_TYPED_REGISTRY(
 REGISTER_BLOB_FETCHER((TypeMeta::Id<Tensor>()), TensorFetcher);
 REGISTER_BLOB_FEEDER(CPU, TensorFeeder<CPUContext>);
 
-Workspace* GetCurrentWorkspace() {
-  return gWorkspace;
-}
-
-Workspace* GetWorkspaceByName(const std::string& name) {
-  if (gWorkspaces.count(name)) {
-    return gWorkspaces[name].get();
-  }
-  return nullptr;
-}
-
 class StringFetcher : public BlobFetcherBase {
  public:
   py::object Fetch(const Blob& blob) override {
@@ -180,20 +163,6 @@ std::function<const char*(const string&)> DefinitionGetter(
   return [registry](const string& name) { return registry->HelpMessage(name); };
 }
 
-void switchWorkspaceInternal(const std::string& name, bool create_if_missing) {
-  if (gWorkspaces.count(name)) {
-    gCurrentWorkspaceName = name;
-    gWorkspace = gWorkspaces[name].get();
-    return;
-  }
-
-  CAFFE_ENFORCE(create_if_missing);
-  std::unique_ptr<Workspace> new_workspace(new Workspace());
-  gWorkspace = new_workspace.get();
-  gWorkspaces.insert(std::make_pair(name, std::move(new_workspace)));
-  gCurrentWorkspaceName = name;
-}
-
 namespace python_detail {
 // Python Op implementations.
 using FuncRegistry = std::unordered_map<std::string, Func>;
@@ -240,7 +209,7 @@ bool feedBlob(
     const py::object& arg,
     const py::object device_option) {
   DeviceOption option;
-  if (!device_option.is(py::none())) {
+  if (!device_option.is_none()) {
     // If we have a device option passed in, read it.
     CAFFE_ENFORCE(ParseProtoFromLargeString(
         py::bytes(device_option).cast<std::string>(), &option));
@@ -652,10 +621,9 @@ void addObjectMethods(py::module& m) {
             return (int)self->last_failed_op_net_position;
           })
       .def_property_readonly_static("current", [](py::object /* type */) {
-        auto ws = gWorkspaces.find(gCurrentWorkspaceName);
-        CAFFE_ENFORCE(ws != gWorkspaces.end());
-        CAFFE_ENFORCE(ws->second.get());
-        return py::cast(ws->second.get(), py::return_value_policy::reference);
+        auto ws = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(ws);
+        return py::cast(ws, py::return_value_policy::reference);
       });
 
   py::class_<BackgroundPlan, std::shared_ptr<BackgroundPlan>>(
@@ -784,7 +752,7 @@ void addObjectMethods(py::module& m) {
       .def(
           "reset",
           [](caffe2::onnx::DummyName& instance, const py::object& args) {
-            if (args.is(py::none())) {
+            if (args.is_none()) {
               instance.Reset(std::unordered_set<std::string>());
             } else {
               instance.Reset(args.cast<std::unordered_set<std::string>>());
@@ -972,14 +940,15 @@ void addObjectMethods(py::module& m) {
 
   py::class_<Predictor>(m, "Predictor")
       .def(py::init([](py::bytes init_net, py::bytes predict_net) {
-        CAFFE_ENFORCE(gWorkspace);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
         NetDef init_net_, predict_net_;
         CAFFE_ENFORCE(ParseProtoFromLargeString(
             init_net.cast<std::string>(), &init_net_));
         CAFFE_ENFORCE(ParseProtoFromLargeString(
             predict_net.cast<std::string>(), &predict_net_));
         return new Predictor(
-            makePredictorConfig(init_net_, predict_net_, gWorkspace));
+            makePredictorConfig(init_net_, predict_net_, workspace));
       }))
       .def(
           "run",
@@ -1139,20 +1108,21 @@ void addGlobalMethods(py::module& m) {
     }
     return keys;
   });
-  m.def("on_module_exit", []() { gWorkspaces.clear(); });
+  m.def("on_module_exit", []() { caffe2::python::ClearWorkspaces(); });
   // create_if_missing not used by necessary for pybind to do
   // properly do function overloading.
   m.def(
-      "switch_workspace",
-      [](Workspace* ws, py::object /*create_if_missing*/) { gWorkspace = ws; });
+      "switch_workspace", [](Workspace* ws, py::object /*create_if_missing*/) {
+        // TODO
+        caffe2::python::SetCurrentWorkspace(ws);
+      });
   m.def(
       "create_child_workspace",
       [](const std::string& parent_ws_name, const std::string& child_ws_name) {
-        CAFFE_ENFORCE(
-            gWorkspaces.count(parent_ws_name), "Parent ws does not exist.");
-        auto parent_gws = gWorkspaces[parent_ws_name].get();
+        auto parent_gws = caffe2::python::GetWorkspaceByName(parent_ws_name);
+        CAFFE_ENFORCE(parent_gws, "Parent ws does not exist.");
         std::unique_ptr<Workspace> child_ws(new Workspace(parent_gws));
-        gWorkspaces.insert(std::make_pair(child_ws_name, std::move(child_ws)));
+        caffe2::python::InsertWorkspace(child_ws_name, std::move(child_ws));
       },
       "Create and register child ws, sharing existing blobs in parent ws.",
       py::arg("parent_ws_name"),
@@ -1160,10 +1130,11 @@ void addGlobalMethods(py::module& m) {
   m.def(
       "switch_workspace",
       [](const std::string& name, const py::object create_if_missing) {
-        if (create_if_missing.is(py::none())) {
-          return switchWorkspaceInternal(name, false);
+        if (create_if_missing.is_none()) {
+          return caffe2::python::SwitchWorkspaceInternal(name, false);
         }
-        return switchWorkspaceInternal(name, create_if_missing.cast<bool>());
+        return caffe2::python::SwitchWorkspaceInternal(
+            name, create_if_missing.cast<bool>());
       },
       "Switch to the specified workspace, creating if necessary",
       py::arg("name"),
@@ -1172,31 +1143,28 @@ void addGlobalMethods(py::module& m) {
       "reset_workspace",
       [](const py::object& root_folder) {
         VLOG(1) << "Resetting workspace.";
-        if (root_folder.is(py::none())) {
-          // NOLINTNEXTLINE(modernize-make-unique)
-          gWorkspaces[gCurrentWorkspaceName].reset(new Workspace());
+        if (root_folder.is_none()) {
+          caffe2::python::ResetWorkspace(new Workspace());
         } else {
-          // NOLINTNEXTLINE(modernize-make-unique)
-          gWorkspaces[gCurrentWorkspaceName].reset(
+          caffe2::python::ResetWorkspace(
               new Workspace(root_folder.cast<std::string>()));
         }
-        gWorkspace = gWorkspaces[gCurrentWorkspaceName].get();
         return true;
       },
       "Reset the workspace",
       py::arg("root_folder") = py::none());
 
   m.def("root_folder", []() {
-    CAFFE_ENFORCE(gWorkspace);
-    return gWorkspace->RootFolder();
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    return workspace->RootFolder();
+  });
+  m.def("current_workspace", []() {
+    return caffe2::python::GetCurrentWorkspaceName();
   });
-  m.def("current_workspace", []() { return gCurrentWorkspaceName; });
   m.def("workspaces", []() {
     std::vector<std::string> names;
-    for (const auto& kv : gWorkspaces) {
-      // NOLINTNEXTLINE(performance-inefficient-vector-operation)
-      names.push_back(kv.first);
-    }
+    caffe2::python::GetWorkspaceNames(names);
     return names;
   });
   m.def("nearby_opnames", [](const std::string& name) {
@@ -1211,41 +1179,46 @@ void addGlobalMethods(py::module& m) {
     return alternatives;
   });
   m.def("local_blobs", []() {
-    CAFFE_ENFORCE(gWorkspace);
-    return gWorkspace->LocalBlobs();
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    return workspace->LocalBlobs();
   });
   m.def("blobs", []() {
-    CAFFE_ENFORCE(gWorkspace);
-    return gWorkspace->Blobs();
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    return workspace->Blobs();
   });
   m.def("has_blob", [](const std::string& name) {
-    CAFFE_ENFORCE(gWorkspace);
-    return gWorkspace->HasBlob(name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    return workspace->HasBlob(name);
   });
   m.def(
       "fill_random_network_inputs",
       [](const py::bytes& net_def,
          const std::vector<std::vector<std::vector<int64_t>>>& inputDims,
          const std::vector<std::vector<std::string>>& inputTypes) {
-        CAFFE_ENFORCE(gWorkspace);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
         py::gil_scoped_release g;
         NetDef net;
         CAFFE_ENFORCE(
             ParseProtoFromLargeString(net_def.cast<std::string>(), &net));
         caffe2::emulator::fillRandomNetworkInputs(
-            net, inputDims, inputTypes, gWorkspace);
+            net, inputDims, inputTypes, workspace);
       });
   m.def(
       "create_net",
       [](py::bytes net_def, bool overwrite) {
-        CAFFE_ENFORCE(gWorkspace);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
         caffe2::NetDef proto;
         CAFFE_ENFORCE(
             ParseProtoFromLargeString(net_def.cast<std::string>(), &proto),
             "Can't parse net proto: ",
             net_def.cast<std::string>());
         CAFFE_ENFORCE(
-            gWorkspace->CreateNet(proto, overwrite),
+            workspace->CreateNet(proto, overwrite),
             "Error creating net with proto: ",
             net_def.cast<std::string>());
         return true;
@@ -1253,11 +1226,12 @@ void addGlobalMethods(py::module& m) {
       py::arg("net_def"),
       py::arg("overwrite") = kPyBindFalse);
   m.def("run_net", [](const std::string& name, int num_iter, bool allow_fail) {
-    CAFFE_ENFORCE(gWorkspace);
-    CAFFE_ENFORCE(gWorkspace->GetNet(name), "Can't find net ", name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    CAFFE_ENFORCE(workspace->GetNet(name), "Can't find net ", name);
     py::gil_scoped_release g;
     for (int i = 0; i < num_iter; i++) {
-      bool success = gWorkspace->RunNet(name);
+      bool success = workspace->RunNet(name);
       if (!allow_fail) {
         CAFFE_ENFORCE(success, "Error running net ", name);
       } else {
@@ -1271,12 +1245,12 @@ void addGlobalMethods(py::module& m) {
   m.def(
       "add_observer_to_net",
       [](const std::string& net_name, const std::string& observer_type) {
-        CAFFE_ENFORCE(gWorkspace);
-        CAFFE_ENFORCE(
-            gWorkspace->GetNet(net_name), "Can't find net ", net_name);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
+        CAFFE_ENFORCE(workspace->GetNet(net_name), "Can't find net ", net_name);
         py::gil_scoped_release g;
 
-        NetBase* net = gWorkspace->GetNet(net_name);
+        NetBase* net = workspace->GetNet(net_name);
         const Observable<NetBase>::Observer* observer = nullptr;
 
 #define REGISTER_PYTHON_EXPOSED_OBSERVER(ob_type)             \
@@ -1303,12 +1277,12 @@ void addGlobalMethods(py::module& m) {
   m.def(
       "remove_observer_from_net",
       [](const std::string& net_name, const ObserverBase<NetBase>* observer) {
-        CAFFE_ENFORCE(gWorkspace);
-        CAFFE_ENFORCE(
-            gWorkspace->GetNet(net_name), "Can't find net ", net_name);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
+        CAFFE_ENFORCE(workspace->GetNet(net_name), "Can't find net ", net_name);
         py::gil_scoped_release g;
 
-        NetBase* net = gWorkspace->GetNet(net_name);
+        NetBase* net = workspace->GetNet(net_name);
         net->DetachObserver(observer);
       });
   m.def("clear_global_net_observer", []() {
@@ -1316,11 +1290,12 @@ void addGlobalMethods(py::module& m) {
     caffe2::ClearGlobalNetObservers();
   });
   m.def("num_observers_on_net", [](const std::string& net_name) {
-    CAFFE_ENFORCE(gWorkspace);
-    CAFFE_ENFORCE(gWorkspace->GetNet(net_name), "Can't find net ", net_name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    CAFFE_ENFORCE(workspace->GetNet(net_name), "Can't find net ", net_name);
     py::gil_scoped_release g;
 
-    NetBase* net = gWorkspace->GetNet(net_name);
+    NetBase* net = workspace->GetNet(net_name);
     return net->NumObservers();
   });
   m.def(
@@ -1329,8 +1304,9 @@ void addGlobalMethods(py::module& m) {
          size_t warmup_runs,
          size_t main_runs,
          bool run_individual) {
-        CAFFE_ENFORCE(gWorkspace);
-        auto* net = gWorkspace->GetNet(name);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
+        auto* net = workspace->GetNet(name);
         CAFFE_ENFORCE(net, "Didn't find net: ", name);
         py::gil_scoped_release g;
         vector<float> stat =
@@ -1338,8 +1314,9 @@ void addGlobalMethods(py::module& m) {
         return stat;
       });
   m.def("benchmark_net_once", [](const std::string& name) {
-    CAFFE_ENFORCE(gWorkspace);
-    auto* net = gWorkspace->GetNet(name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    auto* net = workspace->GetNet(name);
     CAFFE_ENFORCE(net, "Didn't find net: ", name);
     py::gil_scoped_release g;
     float stat = net->TEST_Benchmark_One_Run();
@@ -1347,28 +1324,35 @@ void addGlobalMethods(py::module& m) {
   });
 
   m.def("delete_net", [](const std::string& name) {
-    CAFFE_ENFORCE(gWorkspace);
-    gWorkspace->DeleteNet(name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    workspace->DeleteNet(name);
     return true;
   });
-  m.def("nets", []() { return gWorkspace->Nets(); });
+  m.def("nets", []() {
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    return workspace->Nets();
+  });
   m.def("run_operator_once", [](const py::bytes& op_def) {
-    CAFFE_ENFORCE(gWorkspace);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
     OperatorDef def;
     CAFFE_ENFORCE(ParseProtoFromLargeString(op_def.cast<std::string>(), &def));
     py::gil_scoped_release g;
-    CAFFE_ENFORCE(gWorkspace->RunOperatorOnce(def));
+    CAFFE_ENFORCE(workspace->RunOperatorOnce(def));
     return true;
   });
   // Run an operator multiple times.
   // This is needed for microbenchmarking as we want the benchmark loop to be in
   // C++ to minimize overhead.
   m.def("run_operator_multiple", [](const py::bytes& op_def, int num_runs) {
-    CAFFE_ENFORCE(gWorkspace);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
     OperatorDef def;
     CAFFE_ENFORCE(ParseProtoFromLargeString(op_def.cast<std::string>(), &def));
     py::gil_scoped_release g;
-    std::unique_ptr<OperatorBase> op(CreateOperator(def, gWorkspace));
+    std::unique_ptr<OperatorBase> op(CreateOperator(def, workspace));
     for (int i = 0; i < num_runs; i++) {
       if (!op->Run()) {
         return false;
@@ -1379,7 +1363,8 @@ void addGlobalMethods(py::module& m) {
   m.def(
       "get_operator_cost",
       [](const py::bytes& op_def, const std::vector<string>& input_blobs) {
-        CAFFE_ENFORCE(gWorkspace);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
         OperatorDef def;
         CAFFE_ENFORCE(
             ParseProtoFromLargeString(op_def.cast<std::string>(), &def),
@@ -1389,37 +1374,40 @@ void addGlobalMethods(py::module& m) {
         CAFFE_ENFORCE(schema);
         vector<TensorShape> shapes;
         for (const auto& blob_name : input_blobs) {
-          auto* blob = gWorkspace->GetBlob(blob_name);
+          auto* blob = workspace->GetBlob(blob_name);
           shapes.emplace_back(GetTensorShapeOfBlob(blob));
         }
         const auto c = schema->InferCost(def, shapes);
         return std::make_tuple(c.flops, c.bytes_written, c.bytes_read);
       });
   m.def("run_net_once", [](const py::bytes& net_def) {
-    CAFFE_ENFORCE(gWorkspace);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
     NetDef def;
     CAFFE_ENFORCE(ParseProtoFromLargeString(net_def.cast<std::string>(), &def));
     py::gil_scoped_release g;
-    CAFFE_ENFORCE(gWorkspace->RunNetOnce(def));
+    CAFFE_ENFORCE(workspace->RunNetOnce(def));
     return true;
   });
   m.def("run_plan", [](const py::bytes& plan_def) {
-    CAFFE_ENFORCE(gWorkspace);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
     PlanDef def;
     CAFFE_ENFORCE(
         ParseProtoFromLargeString(plan_def.cast<std::string>(), &def));
     py::gil_scoped_release g;
-    CAFFE_ENFORCE(gWorkspace->RunPlan(def));
+    CAFFE_ENFORCE(workspace->RunPlan(def));
     return true;
   });
   m.def("run_plan_in_background", [](const py::bytes& plan_def) {
-    CAFFE_ENFORCE(gWorkspace);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
     PlanDef def;
     CAFFE_ENFORCE(
         ParseProtoFromLargeString(plan_def.cast<std::string>(), &def));
     py::gil_scoped_release g;
 
-    auto background_plan = std::make_shared<BackgroundPlan>(gWorkspace, def);
+    auto background_plan = std::make_shared<BackgroundPlan>(workspace, def);
     background_plan->run();
     return background_plan;
   });
@@ -1513,7 +1501,8 @@ void addGlobalMethods(py::module& m) {
   m.def(
       "infer_shapes_and_types_from_workspace",
       [](const std::vector<py::bytes>& net_protos) {
-        CAFFE_ENFORCE(gWorkspace);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
 
         // Parse protobuffers to NetDefs
         std::vector<std::unique_ptr<caffe2::NetDef>> nets;
@@ -1527,7 +1516,7 @@ void addGlobalMethods(py::module& m) {
         }
 
         auto blob_info =
-            InferBlobShapesAndTypesFromWorkspace(gWorkspace, nets_ptr);
+            InferBlobShapesAndTypesFromWorkspace(workspace, nets_ptr);
 
         std::string protob;
         CAFFE_ENFORCE(blob_info.SerializeToString(&protob));
@@ -1593,23 +1582,27 @@ void addGlobalMethods(py::module& m) {
     return py::bytes(output_net_proto);
   });
   m.def("create_blob", [](const std::string& name) {
-    CAFFE_ENFORCE(gWorkspace);
-    CAFFE_ENFORCE(gWorkspace->CreateBlob(name));
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    CAFFE_ENFORCE(workspace->CreateBlob(name));
     return true;
   });
   m.def("reset_blob", [](const std::string& name) {
-    CAFFE_ENFORCE(gWorkspace);
-    auto* b = gWorkspace->GetBlob(name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    auto* b = workspace->GetBlob(name);
     CAFFE_ENFORCE(b);
     b->Reset();
   });
   m.def("fetch_blob", [](const std::string& name) -> py::object {
-    return python_detail::fetchBlob(gWorkspace, name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    return python_detail::fetchBlob(workspace, name);
   });
   m.def(
       "feed_blob",
       [](const std::string& name, py::object arg, py::object device_option) {
-        auto* blob = gWorkspace->CreateBlob(name);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        auto* blob = workspace->CreateBlob(name);
         return python_detail::feedBlob(blob, arg, device_option);
       },
       "",
@@ -1620,16 +1613,18 @@ void addGlobalMethods(py::module& m) {
     return python_detail::deserializeBlob(content);
   });
   m.def("serialize_blob", [](const std::string& name) {
-    CAFFE_ENFORCE(gWorkspace);
-    auto* blob = gWorkspace->GetBlob(name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    auto* blob = workspace->GetBlob(name);
     CAFFE_ENFORCE(blob);
     return py::bytes(SerializeBlob(*blob, name));
   });
   m.def(
       "deserialize_blob",
       [](const std::string& name, const py::bytes& serialized) {
-        CAFFE_ENFORCE(gWorkspace);
-        auto* blob = gWorkspace->CreateBlob(name);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
+        auto* blob = workspace->CreateBlob(name);
         DeserializeBlob(serialized.cast<std::string>(), blob);
       });
 
@@ -1639,7 +1634,7 @@ void addGlobalMethods(py::module& m) {
       "register_python_op",
       [](py::object func, bool pass_workspace, std::string name) {
         using namespace python_detail;
-        CAFFE_ENFORCE(!func.is(py::none()));
+        CAFFE_ENFORCE(!func.is_none());
         if (!name.empty()) {
           name += ":";
         }
@@ -1655,7 +1650,7 @@ void addGlobalMethods(py::module& m) {
       "register_python_gradient_op",
       [](const std::string& token, py::object func) {
         using namespace python_detail;
-        CAFFE_ENFORCE(!func.is(py::none()));
+        CAFFE_ENFORCE(!func.is_none());
         CAFFE_ENFORCE(gRegistry().find(token) != gRegistry().end());
         // For global sanity gradient ops shouldn't access workspace
         gRegistry()[token + "_gradient"] = Func{func, false};
@@ -1695,8 +1690,9 @@ void addGlobalMethods(py::module& m) {
   m.def("is_numa_enabled", []() { return IsNUMAEnabled(); });
   m.def("get_num_numa_nodes", []() { return GetNumNUMANodes(); });
   m.def("get_blob_numa_node", [](const std::string& blob_name) {
-    CAFFE_ENFORCE(gWorkspace);
-    auto* blob = gWorkspace->GetBlob(blob_name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    auto* blob = workspace->GetBlob(blob_name);
     CAFFE_ENFORCE(blob);
     const TensorCPU& tensor = blob->Get<TensorCPU>();
     const void* raw_data = tensor.raw_data();
@@ -1704,8 +1700,9 @@ void addGlobalMethods(py::module& m) {
     return GetNUMANode(raw_data);
   });
   m.def("get_blob_size_bytes", [](const std::string& blob_name) {
-    CAFFE_ENFORCE(gWorkspace);
-    auto* blob = gWorkspace->GetBlob(blob_name);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
+    auto* blob = workspace->GetBlob(blob_name);
     CAFFE_ENFORCE(blob);
     return BlobStat::sizeBytes(*blob);
   });
@@ -1861,13 +1858,14 @@ void addGlobalMethods(py::module& m) {
   m.def(
       "run_workspace_transform",
       [](const std::string& transform_name, py::bytes def) {
-        CAFFE_ENFORCE(gWorkspace);
+        Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+        CAFFE_ENFORCE(workspace);
         caffe2::NetDef proto;
         CAFFE_ENFORCE(
             ParseProtoFromLargeString(def.cast<std::string>(), &proto));
         auto nn = caffe2::convertToNNModule(proto);
         auto pass = WorkspaceOptimizationPassRegistry()->Create(
-            transform_name, &nn, gWorkspace);
+            transform_name, &nn, workspace);
 
         CAFFE_ENFORCE(pass, "Pass doesn't exist: ", transform_name);
         pass->run();
@@ -1897,7 +1895,8 @@ void addGlobalMethods(py::module& m) {
     CAFFE_ENFORCE(ParseProtoFromLargeString(def.cast<std::string>(), &proto));
 
     auto nn = caffe2::convertToNNModule(proto);
-    opt::OptimizeForMkldnn(&nn, gWorkspace, training_mode);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    opt::OptimizeForMkldnn(&nn, workspace, training_mode);
     auto new_proto = caffe2::convertToCaffe2Proto(nn, proto);
 
     std::string out;
@@ -1919,12 +1918,13 @@ void addGlobalMethods(py::module& m) {
   });
 
   m.def("transform_fuseConvBN", [](py::bytes def) {
-    CAFFE_ENFORCE(gWorkspace);
+    Workspace* workspace = caffe2::python::GetCurrentWorkspace();
+    CAFFE_ENFORCE(workspace);
     caffe2::NetDef proto;
     CAFFE_ENFORCE(ParseProtoFromLargeString(def.cast<std::string>(), &proto));
 
     auto nn = caffe2::convertToNNModule(proto);
-    opt::fuseConvBN(&nn, gWorkspace);
+    opt::fuseConvBN(&nn, workspace);
     auto new_proto = caffe2::convertToCaffe2Proto(nn);
 
     std::string out;
@@ -1959,8 +1959,7 @@ void addGlobalMethods(py::module& m) {
       return;
     }
     // We will create a default workspace for us to run stuff.
-    switchWorkspaceInternal("default", true);
-    gCurrentWorkspaceName = "default";
+    caffe2::python::SwitchWorkspaceInternal("default", true);
     initialized = true;
   };
 
diff --git a/caffe2/python/pybind_workspace.cc b/caffe2/python/pybind_workspace.cc
new file mode 100644
index 000000000000..aa837b7b4dfe
--- /dev/null
+++ b/caffe2/python/pybind_workspace.cc
@@ -0,0 +1,72 @@
+#include "caffe2/core/workspace.h"
+
+namespace caffe2 {
+namespace python {
+
+// gWorkspace is the pointer to the current workspace. The ownership is kept
+// by the gWorkspaces map.
+static Workspace* gWorkspace = nullptr;
+static std::string gCurrentWorkspaceName;
+// gWorkspaces allows us to define and switch between multiple workspaces in
+// Python.
+static std::map<std::string, std::unique_ptr<Workspace>> gWorkspaces;
+
+Workspace* GetCurrentWorkspace() {
+  return gWorkspace;
+}
+
+void SetCurrentWorkspace(Workspace* workspace) {
+  gWorkspace = workspace;
+}
+
+Workspace* NewWorkspace() {
+  std::unique_ptr<Workspace> new_workspace(new Workspace());
+  gWorkspace = new_workspace.get();
+  return gWorkspace;
+}
+
+Workspace* GetWorkspaceByName(const std::string& name) {
+  if (gWorkspaces.count(name)) {
+    return gWorkspaces[name].get();
+  }
+  return nullptr;
+}
+
+std::string GetCurrentWorkspaceName() {
+  return gCurrentWorkspaceName;
+}
+void InsertWorkspace(const std::string& name, std::unique_ptr<Workspace> ws) {
+  gWorkspaces.insert(std::make_pair(name, std::move(ws)));
+}
+
+void SwitchWorkspaceInternal(const std::string& name, bool create_if_missing) {
+  if (gWorkspaces.count(name)) {
+    gCurrentWorkspaceName = name;
+    gWorkspace = gWorkspaces[name].get();
+    return;
+  }
+
+  CAFFE_ENFORCE(create_if_missing);
+  std::unique_ptr<Workspace> new_workspace(new Workspace());
+  gWorkspace = new_workspace.get();
+  gWorkspaces.insert(std::make_pair(name, std::move(new_workspace)));
+  gCurrentWorkspaceName = name;
+}
+
+void ResetWorkspace(Workspace* workspace) {
+  gWorkspaces[gCurrentWorkspaceName].reset(workspace);
+  gWorkspace = gWorkspaces[gCurrentWorkspaceName].get();
+}
+
+void GetWorkspaceNames(std::vector<std::string>& names) {
+  for (const auto& kv : gWorkspaces) {
+    // NOLINTNEXTLINE(performance-inefficient-vector-operation)
+    names.emplace_back(kv.first);
+  }
+}
+
+void ClearWorkspaces() {
+  gWorkspaces.clear();
+}
+} // namespace python
+} // namespace caffe2
diff --git a/caffe2/python/pybind_workspace.h b/caffe2/python/pybind_workspace.h
new file mode 100644
index 000000000000..0467d9ff6ccd
--- /dev/null
+++ b/caffe2/python/pybind_workspace.h
@@ -0,0 +1,15 @@
+namespace caffe2 {
+namespace python {
+
+Workspace* GetCurrentWorkspace();
+void SetCurrentWorkspace(Workspace* workspace);
+Workspace* NewWorkspace();
+Workspace* GetWorkspaceByName(const std::string& name);
+std::string GetCurrentWorkspaceName();
+void InsertWorkspace(const std::string& name, std::unique_ptr<Workspace> ws);
+void SwitchWorkspaceInternal(const std::string& name, bool create_if_missing);
+void ResetWorkspace(Workspace* workspace);
+void GetWorkspaceNames(std::vector<std::string>& names);
+void ClearWorkspaces();
+} // namespace python
+} // namespace caffe2
diff --git a/caffe2/python/utils.py b/caffe2/python/utils.py
index 289d107303fa..ad24205b8fd5 100644
--- a/caffe2/python/utils.py
+++ b/caffe2/python/utils.py
@@ -18,7 +18,9 @@
 from six import integer_types, binary_type, text_type, string_types
 
 OPTIMIZER_ITERATION_NAME = "optimizer_iteration"
+OPTIMIZER_ITERATION_LR_NAME = "optimizer_iteration_lr"
 ITERATION_MUTEX_NAME = "iteration_mutex"
+ITERATION_MUTEX_LR_NAME = "iteration_mutex_lr"
 
 
 def OpAlmostEqual(op_a, op_b, ignore_fields=None):
diff --git a/caffe2/python/workspace_test.py b/caffe2/python/workspace_test.py
index 2e2d284f92e4..b434b5e748cc 100644
--- a/caffe2/python/workspace_test.py
+++ b/caffe2/python/workspace_test.py
@@ -24,12 +24,6 @@ def setUp(self):
         )
         workspace.ResetWorkspace()
 
-    def testRootFolder(self):
-        self.assertEqual(workspace.ResetWorkspace(), True)
-        self.assertEqual(workspace.RootFolder(), ".")
-        self.assertEqual(workspace.ResetWorkspace("/tmp/caffe-workspace-test"), True)
-        self.assertEqual(workspace.RootFolder(), "/tmp/caffe-workspace-test")
-
     def testWorkspaceHasBlobWithNonexistingName(self):
         self.assertEqual(workspace.HasBlob("non-existing"), False)
 
diff --git a/caffe2/quantization/server/README.md b/caffe2/quantization/server/README.md
index 4819b62fedb7..b7d22bf8bbfe 100644
--- a/caffe2/quantization/server/README.md
+++ b/caffe2/quantization/server/README.md
@@ -19,8 +19,8 @@ To compute the quantization parameters of activation tensors, we need to know th
 
 * Floating-point requantization
 
-Unlike gemmlowp using fixed-point operations that emulates floating point operations of requantization, fbgemm just uses single-precison floating-point operations. This is because in x86 just using single-precision floating-point operations is faster. Probably, gemmlowp used pure fixed-point operations for low-end mobile processors. QNNPACK also has similar constraints as gemmlowp and provides multiple options of requantization implementations.
-The users could modify the code to use a different requantization implementation to be bit-wise idential to the HW they want to emulate for example. If there're enough requests, we could consider implementing a few popular fixed-point requantization as QNNPACK did.
+Unlike gemmlowp using fixed-point operations that emulates floating point operations of requantization, fbgemm just uses single-precision floating-point operations. This is because in x86 just using single-precision floating-point operations is faster. Probably, gemmlowp used pure fixed-point operations for low-end mobile processors. QNNPACK also has similar constraints as gemmlowp and provides multiple options of requantization implementations.
+The users could modify the code to use a different requantization implementation to be bit-wise identical to the HW they want to emulate for example. If there're enough requests, we could consider implementing a few popular fixed-point requantization as QNNPACK did.
 
 * 16-bit accumulation with outlier-aware quantization
 
diff --git a/caffe2/quantization/server/dnnlowp.h b/caffe2/quantization/server/dnnlowp.h
index 2f68d156af10..c71ac8dbef6e 100644
--- a/caffe2/quantization/server/dnnlowp.h
+++ b/caffe2/quantization/server/dnnlowp.h
@@ -6,7 +6,9 @@
 #include <cstdint>
 #include <limits>
 
+#ifdef __x86_64__
 #include <immintrin.h>
+#endif
 
 #include <fbgemm/QuantUtils.h>
 
diff --git a/caffe2/quantization/server/fully_connected_fake_lowp_op.h b/caffe2/quantization/server/fully_connected_fake_lowp_op.h
index 6cbfc900e961..cee1c26498fb 100644
--- a/caffe2/quantization/server/fully_connected_fake_lowp_op.h
+++ b/caffe2/quantization/server/fully_connected_fake_lowp_op.h
@@ -16,7 +16,9 @@
 
 #pragma once
 
+#ifdef __x86_64__
 #include <immintrin.h>
+#endif
 #include "caffe2/core/context.h"
 #include "caffe2/core/operator.h"
 #include "caffe2/utils/conversions.h"
diff --git a/caffe2/release-notes.md b/caffe2/release-notes.md
index e76b760a7ed5..d449e98f78e3 100644
--- a/caffe2/release-notes.md
+++ b/caffe2/release-notes.md
@@ -133,7 +133,7 @@ If you're running this all on a cloud computer, you probably won't have a UI or
 
 First configure your cloud server to accept port 8889, or whatever you want, but change the port in the following commands. On AWS you accomplish this by adding a rule to your server's security group allowing a TCP inbound on port 8889. Otherwise you would adjust iptables for this.
 
-Next you launch the Juypter server.
+Next you launch the Jupyter server.
 
 ```
 jupyter notebook --no-browser --port=8889
diff --git a/caffe2/serialize/inline_container.cc b/caffe2/serialize/inline_container.cc
index 9d3cc332ae96..54b94d31775d 100644
--- a/caffe2/serialize/inline_container.cc
+++ b/caffe2/serialize/inline_container.cc
@@ -338,8 +338,7 @@ PyTorchStreamWriter::PyTorchStreamWriter(std::string file_name)
 }
 
 PyTorchStreamWriter::PyTorchStreamWriter(
-    // NOLINTNEXTLINE(modernize-pass-by-value)
-    const std::function<size_t(const void*, size_t)>& writer_func)
+    const std::function<size_t(const void*, size_t)> writer_func)
     : archive_name_("archive"),
       writer_func_(writer_func) {
   setup(archive_name_);
@@ -416,6 +415,21 @@ void PyTorchStreamWriter::writeRecord(
 }
 
 void PyTorchStreamWriter::writeEndOfFile() {
+  // Ensurers that finalized is set to true even
+  // exception is raised during the method call.
+  // I.e. even partial call to writeEndOfFile() should mark
+  // file as finalized, otherwise double exception raised from
+  // destructor would would result in `std::terminate()`
+  // See https://github.com/pytorch/pytorch/issues/87997/
+  struct Finalizer {
+    Finalizer(bool& var): var_(var) {}
+    ~Finalizer() {
+      var_ = true;
+    }
+    private:
+     bool& var_;
+  } f(finalized_);
+
   auto allRecords = getAllWrittenRecords();
   // If no ".data/version" or "version" record in the output model, rewrites version info
   if(allRecords.find(".data/version") == allRecords.end() && allRecords.find("version") == allRecords.end()) {
diff --git a/caffe2/serialize/inline_container.h b/caffe2/serialize/inline_container.h
index 621ffbe9a41a..3f0e661dd229 100644
--- a/caffe2/serialize/inline_container.h
+++ b/caffe2/serialize/inline_container.h
@@ -130,7 +130,7 @@ class TORCH_API PyTorchStreamWriter final {
  public:
   explicit PyTorchStreamWriter(std::string archive_name);
   explicit PyTorchStreamWriter(
-      const std::function<size_t(const void*, size_t)>& writer_func);
+      const std::function<size_t(const void*, size_t)> writer_func);
 
   void setMinVersion(const uint64_t version);
 
diff --git a/caffe2/sgd/learning_rate_op.cc b/caffe2/sgd/learning_rate_op.cc
index e8172ab65efe..7e6545c5adeb 100644
--- a/caffe2/sgd/learning_rate_op.cc
+++ b/caffe2/sgd/learning_rate_op.cc
@@ -134,6 +134,12 @@ Example usage:
     .Arg(
         "cosine_lr_shrink",
         "defaults to 0.99, part of CompositeCosineLRPolicy")
+    .Arg(
+         "num_iter_1",
+        "(int, default 0) number of iterations over which to warmup for slope policy")
+    .Arg(
+         "num_iter_2",
+        "(int, default 0) number of iterations over which to gradually gate for slope policy")
     .Input(0, "input", "description needed")
     .Output(0, "output", "description needed")
     .DeviceInferenceFunction([](const OperatorDef& def) {
@@ -185,5 +191,7 @@ C10_EXPORT_CAFFE2_OP_TO_C10_CPU(
     "int? cosine_period = 50, "
     "float? cosine_t_mult = 1.0, "
     "float? cosine_lr_shrink = 0.99, "
-    "float? decay = 1.0) -> Tensor output",
+    "float? decay = 1.0, "
+    "int? num_iter_1 = 0, "
+    "int? num_iter_2 = 0) -> Tensor output",
     LearningRateOpFloatCPU);
diff --git a/caffe2/utils/CMakeLists.txt b/caffe2/utils/CMakeLists.txt
index 3e059d3f5eb3..a7dfe1181e31 100644
--- a/caffe2/utils/CMakeLists.txt
+++ b/caffe2/utils/CMakeLists.txt
@@ -1,4 +1,4 @@
-if((NOT BUILD_CAFFE2) OR (INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE))
+if(NOT BUILD_CAFFE2 OR INTERN_BUILD_MOBILE)
   list(APPEND Caffe2_CPU_SRCS
     utils/string_utils.cc
     utils/threadpool/ThreadPool.cc
diff --git a/caffe2/utils/math/elementwise.cu b/caffe2/utils/math/elementwise.cu
index b41d2590e929..d1911ae4db4c 100644
--- a/caffe2/utils/math/elementwise.cu
+++ b/caffe2/utils/math/elementwise.cu
@@ -305,7 +305,7 @@ CAFFE2_SPECIALIZED_HALF_SCALE_CUDA_KERNEL(float)
       return;                                                             \
     }                                                                     \
     if (alpha == T(0)) {                                                  \
-      cudaMemsetAsync(Y, 0, sizeof(T) * N, context->cuda_stream());       \
+      C10_CUDA_CHECK(cudaMemsetAsync(Y, 0, sizeof(T) * N, context->cuda_stream()));       \
     } else {                                                              \
       thrust::fill(                                                       \
           thrust::cuda::par.on(context->cuda_stream()), Y, Y + N, alpha); \
diff --git a/caffe2/utils/math/reduce.cu b/caffe2/utils/math/reduce.cu
index 69a6469d8ed1..d59cbd387753 100644
--- a/caffe2/utils/math/reduce.cu
+++ b/caffe2/utils/math/reduce.cu
@@ -418,12 +418,12 @@ void MomentsCUDA(
     return;
   }
   if (std::equal(X_dims, X_dims + ndim, Y_dims)) {
-    cudaMemcpyAsync(
+    C10_CUDA_CHECK(cudaMemcpyAsync(
         mean,
         X,
         sizeof(T) * X_size,
         cudaMemcpyDeviceToDevice,
-        context->cuda_stream());
+        context->cuda_stream()));
     Set<T, CUDAContext>(Y_size, T(0), var, context);
     return;
   }
diff --git a/caffe2/utils/math_gpu.cu b/caffe2/utils/math_gpu.cu
index 54b0a9391c26..4ad249dadc7e 100644
--- a/caffe2/utils/math_gpu.cu
+++ b/caffe2/utils/math_gpu.cu
@@ -2685,12 +2685,12 @@ CAFFE2_CUDA_EXPORT void CopyVector<float, CUDAContext>(
     float* dst,
     CUDAContext* context) {
   if (src != dst && N > 0) {
-    cudaMemcpyAsync(
+    C10_CUDA_CHECK(cudaMemcpyAsync(
         dst,
         src,
         sizeof(float) * N,
         cudaMemcpyDeviceToDevice,
-        context->cuda_stream());
+        context->cuda_stream()));
   }
 }
 
@@ -2701,12 +2701,12 @@ CAFFE2_CUDA_EXPORT void CopyVector<int, CUDAContext>(
     int* dst,
     CUDAContext* context) {
   if (src != dst && N > 0) {
-    cudaMemcpyAsync(
+    C10_CUDA_CHECK(cudaMemcpyAsync(
         dst,
         src,
         sizeof(int) * N,
         cudaMemcpyDeviceToDevice,
-        context->cuda_stream());
+        context->cuda_stream()));
   }
 }
 
diff --git a/caffe2/utils/threadpool/ThreadPool.cc b/caffe2/utils/threadpool/ThreadPool.cc
index 3f0a2adc233c..cbccf0749bef 100644
--- a/caffe2/utils/threadpool/ThreadPool.cc
+++ b/caffe2/utils/threadpool/ThreadPool.cc
@@ -100,6 +100,17 @@ size_t getDefaultNumThreads() {
     // Always give precedence to explicit setting.
     numThreads = FLAGS_pthreadpool_size;
   }
+
+  /*
+   * For llvm-tsan, holding limit for the number of locks for a single thread
+   * is 64. pthreadpool's worst case is the number of threads in a pool. So we
+   * want to limit the threadpool size to 64 when running with tsan. However,
+   * sometimes it is tricky to detect if we are running under tsan, for now
+   * capping the default threadcount to the tsan limit unconditionally.
+   */
+  int tsanThreadLimit = 64;
+  numThreads = std::min(numThreads, tsanThreadLimit);
+
   return numThreads;
 }
 
diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake
index 0e96653967da..104056ee0724 100644
--- a/cmake/Dependencies.cmake
+++ b/cmake/Dependencies.cmake
@@ -78,7 +78,7 @@ if(USE_CUDA)
 endif()
 
 # ---[ Custom Protobuf
-if(CAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO AND (NOT INTERN_BUILD_MOBILE OR BUILD_CAFFE2_MOBILE))
+if(CAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO AND NOT INTERN_BUILD_MOBILE)
   disable_ubsan()
   include(${CMAKE_CURRENT_LIST_DIR}/ProtoBuf.cmake)
   enable_ubsan()
@@ -92,6 +92,10 @@ endif()
 if(MSVC)
   # skip unwanted includes from windows.h
   add_definitions(-DWIN32_LEAN_AND_MEAN)
+
+  # Windows SDK broke compatibility since version 25131, but introduced this define for backward compatibility.
+  add_definitions(-D_UCRT_LEGACY_INFINITY)
+
   foreach(flag_var
       CMAKE_C_FLAGS CMAKE_C_FLAGS_RELEASE CMAKE_C_FLAGS_MINSIZEREL
       CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_RELEASE CMAKE_CXX_FLAGS_MINSIZEREL)
@@ -1244,6 +1248,16 @@ if(ANDROID)
   list(APPEND Caffe2_DEPENDENCY_LIBS log)
 endif()
 
+# ---[ Kernel asserts
+# Kernel asserts are enabled by default for CUDA and disabled for ROCm.
+# For ROCm, it can be enabled by setting ROCM_FORCE_ENABLE_GPU_ASSERTS
+if(USE_ROCM AND ROCM_FORCE_ENABLE_GPU_ASSERTS)
+  message(STATUS "Forcefully enabling kernel asserts on ROCM")
+elseif(USE_ROCM AND NOT ROCM_FORCE_ENABLE_GPU_ASSERTS)
+  message(STATUS "Disabling kernel asserts for ROCm")
+  caffe2_update_option(TORCH_DISABLE_GPU_ASSERTS ON)
+endif()
+
 # ---[ LLVM
 if(USE_LLVM)
   message(STATUS "Looking for LLVM in ${USE_LLVM}")
@@ -1266,6 +1280,21 @@ endif()
 
 # ---[ HIP
 if(USE_ROCM)
+  # This prevents linking in the libtinfo from /opt/conda/lib which conflicts with ROCm libtinfo.
+  # Currently only active for Ubuntu 20.04 and greater versions.
+  if(UNIX AND EXISTS "/etc/os-release")
+    file(STRINGS /etc/os-release OS_RELEASE)
+    string(REGEX REPLACE "NAME=\"([A-Za-z]+).*" "\\1" OS_DISTRO ${OS_RELEASE})
+    string(REGEX REPLACE ".*VERSION_ID=\"([0-9\.]+).*" "\\1" OS_VERSION ${OS_RELEASE})
+    if(OS_DISTRO STREQUAL "Ubuntu" AND OS_VERSION VERSION_GREATER_EQUAL "20.04")
+      find_library(LIBTINFO_LOC tinfo NO_CMAKE_PATH NO_CMAKE_ENVIRONMENT_PATH)
+      if(LIBTINFO_LOC)
+        get_filename_component(LIBTINFO_LOC_PARENT ${LIBTINFO_LOC} DIRECTORY)
+        set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,-rpath-link,${LIBTINFO_LOC_PARENT}")
+      endif()
+    endif()
+  endif()
+
   include(${CMAKE_CURRENT_LIST_DIR}/public/LoadHIP.cmake)
   if(PYTORCH_FOUND_HIP)
     message(INFO "Compiling with HIP for AMD.")
@@ -1336,7 +1365,7 @@ if(USE_ROCM)
 endif()
 
 # ---[ ROCm
-if(USE_ROCM)
+if(USE_ROCM AND ROCM_VERSION_DEV VERSION_LESS "5.2.0")
   # We check again for USE_ROCM because it might have been set to OFF
   # in the if above
   include_directories(SYSTEM ${HIP_PATH}/include)
@@ -1923,6 +1952,7 @@ if(USE_KINETO)
     find_library(CUPTI_LIBRARY_PATH ${CUPTI_LIB_NAME} PATHS
         ${CUDA_SOURCE_DIR}
         ${CUDA_SOURCE_DIR}/extras/CUPTI/lib64
+        ${CUDA_SOURCE_DIR}/lib
         ${CUDA_SOURCE_DIR}/lib64
         NO_DEFAULT_PATH)
 
diff --git a/cmake/External/nccl.cmake b/cmake/External/nccl.cmake
index fa882d0bfb84..160d2b648c05 100644
--- a/cmake/External/nccl.cmake
+++ b/cmake/External/nccl.cmake
@@ -15,25 +15,42 @@ if(NOT __NCCL_INCLUDED)
     # this second replacement is needed when there are multiple archs
     string(REPLACE ";-gencode" " -gencode" NVCC_GENCODE "${NVCC_GENCODE}")
 
+    if(DEFINED ENV{MAX_JOBS})
+      set(MAX_JOBS "$ENV{MAX_JOBS}")
+    else()
+      include(ProcessorCount)
+      ProcessorCount(NUM_HARDWARE_THREADS)
+      # Assume 2 hardware threads per cpu core
+      math(EXPR MAX_JOBS "${NUM_HARDWARE_THREADS} / 2")
+      # ProcessorCount might return 0, set to a positive number
+      if(MAX_JOBS LESS 2)
+        set(MAX_JOBS 2)
+      endif()
+    endif()
+
+    if("${CMAKE_GENERATOR}" MATCHES "Make")
+      # Recursive make with jobserver for parallelism, and also put a load limit
+      # here to avoid flaky OOM, https://www.gnu.org/software/make/manual/html_node/Parallel.html
+      set(MAKE_COMMAND "$(MAKE)" "-l${MAX_JOBS}")
+    else()
+      # Parallel build with CPU load limit to avoid oversubscription
+      set(MAKE_COMMAND "make" "-j${MAX_JOBS}" "-l${MAX_JOBS}")
+    endif()
+
     set(__NCCL_BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/nccl")
     ExternalProject_Add(nccl_external
       SOURCE_DIR ${PROJECT_SOURCE_DIR}/third_party/nccl/nccl
       BUILD_IN_SOURCE 1
       CONFIGURE_COMMAND ""
       BUILD_COMMAND
-        env
-        # TODO: remove these flags when
-        # https://github.com/pytorch/pytorch/issues/13362 is fixed
-        "CCACHE_DISABLE=1"
-        "SCCACHE_DISABLE=1"
-        make
+        ${MAKE_COMMAND}
         "CXX=${CMAKE_CXX_COMPILER}"
         "CUDA_HOME=${CUDA_TOOLKIT_ROOT_DIR}"
         "NVCC=${CUDA_NVCC_EXECUTABLE}"
         "NVCC_GENCODE=${NVCC_GENCODE}"
         "BUILDDIR=${__NCCL_BUILD_DIR}"
         "VERBOSE=0"
-        BUILD_BYPRODUCTS "${__NCCL_BUILD_DIR}/lib/libnccl_static.a"
+      BUILD_BYPRODUCTS "${__NCCL_BUILD_DIR}/lib/libnccl_static.a"
       INSTALL_COMMAND ""
       )
 
@@ -42,7 +59,13 @@ if(NOT __NCCL_INCLUDED)
     string(REGEX REPLACE "GNU objcopy .+ ([0-9])\\.([0-9]+).*" "\\1" OBJCOPY_VERSION_MAJOR ${OBJCOPY_VERSION_STR})
     string(REGEX REPLACE "GNU objcopy .+ ([0-9])\\.([0-9]+).*" "\\2" OBJCOPY_VERSION_MINOR ${OBJCOPY_VERSION_STR})
 
-    if((${OBJCOPY_VERSION_MAJOR} GREATER 2) OR ((${OBJCOPY_VERSION_MAJOR} EQUAL 2) AND (${OBJCOPY_VERSION_MINOR} GREATER 27)))
+    # TODO: Replace me with SKIP_NCCL_SLIMMING option (and investigate why it does not work on newer compilers)
+    if("$ENV{BUILD_ENVIRONMENT}" MATCHES ".*-libtorch-cxx11-abi$")
+      # See https://github.com/pytorch/pytorch/issues/83887
+      message(WARNING "Skip NCCL library slimming for cxx11-abi builds")
+      set(__NCCL_LIBRARY_DEP nccl_external)
+      set(NCCL_LIBRARIES ${__NCCL_BUILD_DIR}/lib/libnccl_static.a)
+    elseif((${OBJCOPY_VERSION_MAJOR} GREATER 2) OR ((${OBJCOPY_VERSION_MAJOR} EQUAL 2) AND (${OBJCOPY_VERSION_MINOR} GREATER 27)))
       message(WARNING "Enabling NCCL library slimming")
       add_custom_command(
         OUTPUT "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a"
diff --git a/cmake/Modules/FindMKLDNN.cmake b/cmake/Modules/FindMKLDNN.cmake
index e2f427be67c8..30ac5401ddf3 100644
--- a/cmake/Modules/FindMKLDNN.cmake
+++ b/cmake/Modules/FindMKLDNN.cmake
@@ -76,6 +76,8 @@ IF(NOT MKLDNN_FOUND)
   SET(DNNL_BUILD_EXAMPLES FALSE CACHE BOOL "" FORCE)
   SET(DNNL_LIBRARY_TYPE STATIC CACHE STRING "" FORCE)
   SET(DNNL_ENABLE_PRIMITIVE_CACHE TRUE CACHE BOOL "" FORCE)
+  SET(DNNL_GRAPH_CPU_RUNTIME ${MKLDNN_CPU_RUNTIME} CACHE STRING "" FORCE)
+
   IF(BUILD_ONEDNN_GRAPH)
     SET(DNNL_GRAPH_LIBRARY_TYPE STATIC CACHE STRING "" FORCE)
   ENDIF(BUILD_ONEDNN_GRAPH)
diff --git a/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake b/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake
index 7f22d476d2fb..65e7a6ac8993 100644
--- a/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake
+++ b/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake
@@ -94,12 +94,28 @@ if(CUDA_VERSION VERSION_GREATER "10.5")
 endif()
 
 if(NOT CUDA_VERSION VERSION_LESS "11.1")
-  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.6" "8.6+PTX")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.6")
   list(APPEND CUDA_ALL_GPU_ARCHITECTURES "8.6")
   set(CUDA_LIMIT_GPU_ARCHITECUTRE "8.6")
 
+  if(CUDA_VERSION VERSION_LESS "11.8")
+    set(CUDA_LIMIT_GPU_ARCHITECTURE "8.9")
+    list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.6+PTX")
+  endif()
+endif()
+
+if(NOT CUDA_VERSION VERSION_LESS "11.8")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Ada")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Hopper")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.9")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "9.0")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "8.9")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "9.0")
+
   if(CUDA_VERSION VERSION_LESS "12.0")
     set(CUDA_LIMIT_GPU_ARCHITECTURE "9.0")
+    list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.9+PTX")
+    list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "9.0+PTX")
   endif()
 endif()
 
@@ -237,6 +253,12 @@ function(CUDA_SELECT_NVCC_ARCH_FLAGS out_variable)
       elseif(${arch_name} STREQUAL "Ampere")
         set(arch_bin 8.0)
         set(arch_ptx 8.0)
+      elseif(${arch_name} STREQUAL "Ada")
+        set(arch_bin 8.9)
+        set(arch_ptx 8.9)
+      elseif(${arch_name} STREQUAL "Hopper")
+        set(arch_bin 9.0)
+        set(arch_ptx 9.0)
       else()
         message(SEND_ERROR "Unknown CUDA Architecture Name ${arch_name} in CUDA_SELECT_NVCC_ARCH_FLAGS")
       endif()
diff --git a/cmake/Summary.cmake b/cmake/Summary.cmake
index a9c6201fb6be..279d72a41e66 100644
--- a/cmake/Summary.cmake
+++ b/cmake/Summary.cmake
@@ -26,7 +26,6 @@ function(caffe2_print_configuration_summary)
   message(STATUS "  CAFFE2_VERSION        : ${CAFFE2_VERSION}")
   message(STATUS "  BUILD_CAFFE2          : ${BUILD_CAFFE2}")
   message(STATUS "  BUILD_CAFFE2_OPS      : ${BUILD_CAFFE2_OPS}")
-  message(STATUS "  BUILD_CAFFE2_MOBILE   : ${BUILD_CAFFE2_MOBILE}")
   message(STATUS "  BUILD_STATIC_RUNTIME_BENCHMARK: ${BUILD_STATIC_RUNTIME_BENCHMARK}")
   message(STATUS "  BUILD_TENSOREXPR_BENCHMARK: ${BUILD_TENSOREXPR_BENCHMARK}")
   message(STATUS "  BUILD_NVFUSER_BENCHMARK: ${BUILD_NVFUSER_BENCHMARK}")
@@ -59,6 +58,7 @@ function(caffe2_print_configuration_summary)
     message(STATUS "  CROSS_COMPILING_MACOSX : ${CROSS_COMPILING_MACOSX}")
   endif()
   message(STATUS "  INTERN_BUILD_MOBILE   : ${INTERN_BUILD_MOBILE}")
+  message(STATUS "  TRACING_BASED         : ${TRACING_BASED}")
 
   message(STATUS "  USE_BLAS              : ${USE_BLAS}")
   if(${USE_BLAS})
@@ -70,6 +70,7 @@ function(caffe2_print_configuration_summary)
     message(STATUS "    LAPACK              : ${LAPACK_INFO}")
   endif()
   message(STATUS "  USE_ASAN              : ${USE_ASAN}")
+  message(STATUS "  USE_TSAN              : ${USE_TSAN}")
   message(STATUS "  USE_CPP_CODE_COVERAGE : ${USE_CPP_CODE_COVERAGE}")
   message(STATUS "  USE_CUDA              : ${USE_CUDA}")
   if(${USE_CUDA})
@@ -78,6 +79,7 @@ function(caffe2_print_configuration_summary)
     message(STATUS "    USE_CUDNN           : ${USE_CUDNN}")
     message(STATUS "    USE_EXPERIMENTAL_CUDNN_V8_API: ${USE_EXPERIMENTAL_CUDNN_V8_API}")
     message(STATUS "    CUDA version        : ${CUDA_VERSION}")
+    message(STATUS "    USE_FLASH_ATTENTION : ${USE_FLASH_ATTENTION}")
     if(${USE_CUDNN})
       message(STATUS "    cuDNN version       : ${CUDNN_VERSION}")
     endif()
@@ -192,10 +194,10 @@ function(caffe2_print_configuration_summary)
   if(NOT "${SELECTED_OP_LIST}" STREQUAL "")
     message(STATUS "  SELECTED_OP_LIST    : ${SELECTED_OP_LIST}")
   endif()
-  message(STATUS "  USE_DEPLOY           : ${USE_DEPLOY}")
   message(STATUS "  Public Dependencies  : ${Caffe2_PUBLIC_DEPENDENCY_LIBS}")
   message(STATUS "  Private Dependencies : ${Caffe2_DEPENDENCY_LIBS}")
   # coreml
   message(STATUS "  USE_COREML_DELEGATE     : ${USE_COREML_DELEGATE}")
   message(STATUS "  BUILD_LAZY_TS_BACKEND   : ${BUILD_LAZY_TS_BACKEND}")
+  message(STATUS "  TORCH_DISABLE_GPU_ASSERTS : ${TORCH_DISABLE_GPU_ASSERTS}")
 endfunction()
diff --git a/cmake/VulkanCodegen.cmake b/cmake/VulkanCodegen.cmake
index 075f2b36ad2a..52279be82ae2 100644
--- a/cmake/VulkanCodegen.cmake
+++ b/cmake/VulkanCodegen.cmake
@@ -15,7 +15,9 @@ endif()
 
 if(USE_VULKAN_SHADERC_RUNTIME)
   set(PYTHONPATH "$ENV{PYTHONPATH}")
-  set(ENV{PYTHONPATH} "$ENV{PYTHONPATH}:${CMAKE_CURRENT_LIST_DIR}/..")
+  set(NEW_PYTHONPATH ${PYTHONPATH})
+  list(APPEND NEW_PYTHONPATH "${CMAKE_CURRENT_LIST_DIR}/..")
+  set(ENV{PYTHONPATH} ${NEW_PYTHONPATH})
   execute_process(
     COMMAND
     "${PYTHON_EXECUTABLE}"
@@ -25,7 +27,7 @@ if(USE_VULKAN_SHADERC_RUNTIME)
     --tmp-dir-path=${CMAKE_BINARY_DIR}/vulkan/glsl
     --env ${VULKAN_GEN_ARG_ENV}
     RESULT_VARIABLE error_code)
-  set(ENV{PYTHONPATH} "$PYTHONPATH")
+  set(ENV{PYTHONPATH} ${PYTHONPATH})
 
   if(error_code)
     message(FATAL_ERROR "Failed to gen glsl.h and glsl.cpp with shaders sources for Vulkan backend")
@@ -58,7 +60,9 @@ if(NOT USE_VULKAN_SHADERC_RUNTIME)
   endif()
 
   set(PYTHONPATH "$ENV{PYTHONPATH}")
-  set(ENV{PYTHONPATH} "$ENV{PYTHONPATH}:${CMAKE_CURRENT_LIST_DIR}/..")
+  set(NEW_PYTHONPATH ${PYTHONPATH})
+  list(APPEND NEW_PYTHONPATH "${CMAKE_CURRENT_LIST_DIR}/..")
+  set(ENV{PYTHONPATH} ${NEW_PYTHONPATH})
   execute_process(
     COMMAND
     "${PYTHON_EXECUTABLE}"
@@ -69,7 +73,7 @@ if(NOT USE_VULKAN_SHADERC_RUNTIME)
     --tmp-dir-path=${CMAKE_BINARY_DIR}/vulkan/spv
     --env ${VULKAN_GEN_ARG_ENV}
     RESULT_VARIABLE error_code)
-  set(ENV{PYTHONPATH} "$PYTHONPATH")
+  set(ENV{PYTHONPATH} ${PYTHONPATH})
 
     if(error_code)
       message(FATAL_ERROR "Failed to gen spv.h and spv.cpp with precompiled shaders for Vulkan backend")
diff --git a/cmake/public/LoadHIP.cmake b/cmake/public/LoadHIP.cmake
index 87bb57da1543..b51284115f14 100644
--- a/cmake/public/LoadHIP.cmake
+++ b/cmake/public/LoadHIP.cmake
@@ -283,9 +283,6 @@ if(HIP_FOUND)
   find_package_and_print_version(hipcub REQUIRED)
   find_package_and_print_version(rocthrust REQUIRED)
 
-  # Disable Asserts In Code (Can't use asserts on HIP stack.)
-  add_definitions(-DNDEBUG)
-
   if(HIP_COMPILER STREQUAL clang)
     set(hip_library_name amdhip64)
   else()
diff --git a/cmake/public/mkl.cmake b/cmake/public/mkl.cmake
index 9515a4ae9681..f4ab1ffa9d0f 100644
--- a/cmake/public/mkl.cmake
+++ b/cmake/public/mkl.cmake
@@ -9,4 +9,9 @@ set_property(
   ${MKL_INCLUDE_DIR})
 set_property(
   TARGET caffe2::mkl PROPERTY INTERFACE_LINK_LIBRARIES
-  ${MKL_LIBRARIES})
+  ${MKL_LIBRARIES} ${MKL_THREAD_LIB})
+# TODO: This is a hack, it will not pick up architecture dependent
+# MKL libraries correctly; see https://github.com/pytorch/pytorch/issues/73008
+set_property(
+  TARGET caffe2::mkl PROPERTY INTERFACE_LINK_DIRECTORIES
+  ${MKL_ROOT}/lib)
diff --git a/cmake/public/utils.cmake b/cmake/public/utils.cmake
index e9152ab76449..5944a5a1a626 100644
--- a/cmake/public/utils.cmake
+++ b/cmake/public/utils.cmake
@@ -415,73 +415,70 @@ function(torch_compile_options libname)
     list(APPEND private_compile_options -Werror)
   endif()
 
-  if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
-    # until they can be unified, keep these lists synced with setup.py
-    if(MSVC)
+  # until they can be unified, keep these lists synced with setup.py
+  if(MSVC)
 
-      if(MSVC_Z7_OVERRIDE)
-        set(MSVC_DEBINFO_OPTION "/Z7")
-      else()
-        set(MSVC_DEBINFO_OPTION "/Zi")
-      endif()
+    if(MSVC_Z7_OVERRIDE)
+      set(MSVC_DEBINFO_OPTION "/Z7")
+    else()
+      set(MSVC_DEBINFO_OPTION "/Zi")
+    endif()
 
-      target_compile_options(${libname} PUBLIC
-        $<$<COMPILE_LANGUAGE:CXX>:
-          ${MSVC_RUNTIME_LIBRARY_OPTION}
-          $<$<OR:$<CONFIG:Debug>,$<CONFIG:RelWithDebInfo>>:${MSVC_DEBINFO_OPTION}>
-          /EHsc
-          /DNOMINMAX
-          /wd4267
-          /wd4251
-          /wd4522
-          /wd4522
-          /wd4838
-          /wd4305
-          /wd4244
-          /wd4190
-          /wd4101
-          /wd4996
-          /wd4275
-          /bigobj>
-        )
+    target_compile_options(${libname} PUBLIC
+      $<$<COMPILE_LANGUAGE:CXX>:
+        ${MSVC_RUNTIME_LIBRARY_OPTION}
+        $<$<OR:$<CONFIG:Debug>,$<CONFIG:RelWithDebInfo>>:${MSVC_DEBINFO_OPTION}>
+        /EHsc
+        /DNOMINMAX
+        /wd4267
+        /wd4251
+        /wd4522
+        /wd4522
+        /wd4838
+        /wd4305
+        /wd4244
+        /wd4190
+        /wd4101
+        /wd4996
+        /wd4275
+        /bigobj>
+      )
+  else()
+    list(APPEND private_compile_options
+      -Wall
+      -Wextra
+      -Wno-unused-parameter
+      -Wno-unused-function
+      -Wno-unused-result
+      -Wno-missing-field-initializers
+      -Wno-write-strings
+      -Wno-unknown-pragmas
+      -Wno-type-limits
+      -Wno-array-bounds
+      -Wno-unknown-pragmas
+      -Wno-sign-compare
+      -Wno-strict-overflow
+      -Wno-strict-aliasing
+      -Wno-error=deprecated-declarations
+      # Clang has an unfixed bug leading to spurious missing braces
+      # warnings, see https://bugs.llvm.org/show_bug.cgi?id=21629
+      -Wno-missing-braces
+      )
+    if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
+      list(APPEND private_compile_options
+        -Wno-range-loop-analysis)
     else()
       list(APPEND private_compile_options
-        -Wall
-        -Wextra
-        -Wno-unused-parameter
-        -Wno-unused-function
-        -Wno-unused-result
-        -Wno-unused-local-typedefs
-        -Wno-missing-field-initializers
-        -Wno-write-strings
-        -Wno-unknown-pragmas
-        -Wno-type-limits
-        -Wno-array-bounds
-        -Wno-unknown-pragmas
-        -Wno-sign-compare
-        -Wno-strict-overflow
-        -Wno-strict-aliasing
-        -Wno-error=deprecated-declarations
-        # Clang has an unfixed bug leading to spurious missing braces
-        # warnings, see https://bugs.llvm.org/show_bug.cgi?id=21629
-        -Wno-missing-braces
-        )
-      if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
-        list(APPEND private_compile_options
-          -Wno-range-loop-analysis)
-      else()
-        list(APPEND private_compile_options
-          # Considered to be flaky.  See the discussion at
-          # https://github.com/pytorch/pytorch/pull/9608
-          -Wno-maybe-uninitialized)
-      endif()
-
+        # Considered to be flaky.  See the discussion at
+        # https://github.com/pytorch/pytorch/pull/9608
+        -Wno-maybe-uninitialized)
     endif()
 
-    if(MSVC)
-    elseif(WERROR)
-      list(APPEND private_compile_options -Wno-strict-overflow)
-    endif()
+  endif()
+
+  if(MSVC)
+  elseif(WERROR)
+    list(APPEND private_compile_options -Wno-strict-overflow)
   endif()
 
   target_compile_options(${libname} PRIVATE
diff --git a/defs.bzl b/defs.bzl
index a40ee4b9091d..00cf0fa8f061 100644
--- a/defs.bzl
+++ b/defs.bzl
@@ -1,10 +1,3 @@
-def get_sleef_arch_deps():
-    return [
-        ("x86_64", [
-            "third-party//sleef:sleef",
-        ]),
-    ]
-
 def get_blas_gomp_arch_deps():
     return [
         ("x86_64", [
@@ -22,7 +15,6 @@ default_compiler_flags = [
     "-Wno-unused-function",
     "-Wno-unused-parameter",
     "-Wno-error=strict-aliasing",
-    "-Wno-unused-local-typedefs",
     "-Wno-shadow-compatible-local",
     "-Wno-maybe-uninitialized",  # aten is built with gcc as part of HHVM
     "-Wno-unknown-pragmas",
diff --git a/defs_gpu.bzl b/defs_gpu.bzl
deleted file mode 100644
index 3d6cae883089..000000000000
--- a/defs_gpu.bzl
+++ /dev/null
@@ -1,166 +0,0 @@
-load("@fbcode_macros//build_defs:native_rules.bzl", "buck_genrule")
-load(
-    "//caffe2/caffe2:defs_hip.bzl",
-    "get_caffe2_hip_headers",
-    "get_caffe2_hip_srcs",
-)
-load(":ufunc_defs.bzl", "aten_ufunc_names")
-
-ATEN_CUDA_H_PATTERN = [
-    "aten/src/ATen/cuda/*.h",
-    "aten/src/ATen/cuda/detail/*.h",
-    "aten/src/ATen/cuda/nvrtc_stub/*.h",
-    "aten/src/ATen/cuda/*.cuh",
-    "aten/src/ATen/cuda/detail/*.cuh",
-]
-
-ATEN_CUDA_CPP_PATTERN = [
-    "aten/src/ATen/cuda/*.cpp",
-    "aten/src/ATen/cuda/detail/*.cpp",
-    "aten/src/ATen/cuda/nvrtc_stub/*.cpp",
-]
-
-ATEN_CUDA_CU_PATTERN = [
-    "aten/src/ATen/cuda/*.cu",
-    "aten/src/ATen/cuda/detail/*.cu",
-]
-
-ATEN_CUDNN_H_PATTERN = [
-    "aten/src/ATen/cudnn/*.h",
-    "aten/src/ATen/cudnn/*.cuh",
-]
-
-ATEN_CUDNN_CPP_PATTERN = ["aten/src/ATen/cudnn/*.cpp"]
-
-ATEN_MIOPEN_H_PATTERN = [
-    "aten/src/ATen/miopen/*.h",
-    "aten/src/ATen/miopen/*.cuh",
-]
-
-ATEN_MIOPEN_CPP_PATTERN = ["aten/src/ATen/miopen/*.cpp"]
-
-ATEN_NATIVE_CUDNN_CPP_PATTERN = ["aten/src/ATen/native/cudnn/*.cpp"]
-
-ATEN_NATIVE_MIOPEN_CPP_PATTERN = ["aten/src/ATen/native/miopen/*.cpp"]
-
-ATEN_NATIVE_CUDA_CU_PATTERN = [
-    "aten/src/ATen/native/cuda/*.cu",
-    "aten/src/ATen/native/nested/cuda/*.cu",
-    "aten/src/ATen/native/quantized/cuda/*.cu",
-    "aten/src/ATen/native/sparse/cuda/*.cu",
-    "aten/src/ATen/native/transformers/**/*.cu",
-]
-
-ATEN_NATIVE_CUDA_CPP_PATTERN = [
-    "aten/src/ATen/native/cuda/*.cpp",
-    "aten/src/ATen/native/cuda/linalg/*.cpp",
-    "aten/src/ATen/native/nested/cuda/*.cpp",
-    "aten/src/ATen/native/sparse/cuda/*.cpp",
-    "aten/src/ATen/native/transformers/cuda/*.cpp",
-]
-
-ATEN_NATIVE_CUDA_H_PATTERN = [
-    "aten/src/ATen/native/cudnn/**/*.h",
-    "aten/src/ATen/native/cuda/**/*.h",
-    "aten/src/ATen/native/cuda/**/*.cuh",
-    "aten/src/ATen/native/sparse/cuda/*.h",
-    "aten/src/ATen/native/sparse/cuda/*.cuh",
-    "aten/src/ATen/native/quantized/cuda/*.h",
-    "aten/src/ATen/native/transformers/cuda/*.h",
-    "aten/src/ATen/native/transformers/**/*.cuh",
-]
-
-# T66678203: Clang CUDA rollout
-ATEN_CUDA_CLANG_CU_PATTERN = [
-    "aten/src/ATen/native/cuda/DistributionBernoulli.cu",
-]
-
-### Cuda Files
-def get_aten_cuda_headers():
-    ATEN_CUDA_H = native.glob(ATEN_CUDA_H_PATTERN)
-    ATEN_NATIVE_CUDA_H = native.glob(ATEN_NATIVE_CUDA_H_PATTERN)
-    ATEN_CUDNN_H = native.glob(ATEN_CUDNN_H_PATTERN)
-    return ATEN_CUDA_H + ATEN_NATIVE_CUDA_H + ATEN_CUDNN_H
-
-def get_aten_cuda_srcs():
-    ATEN_CUDA_CU = native.glob(ATEN_CUDA_CU_PATTERN)
-    ATEN_NATIVE_CUDA_CU = native.glob(
-        ATEN_NATIVE_CUDA_CU_PATTERN,
-        exclude = ATEN_CUDA_CLANG_CU_PATTERN,
-    )
-    return ATEN_CUDA_CU + ATEN_NATIVE_CUDA_CU
-
-def get_aten_cuda_clang_srcs():
-    return native.glob(ATEN_CUDA_CLANG_CU_PATTERN)
-
-# CPU+CUDA file
-# Note that these sources and headers include the CPU lists too
-def get_all_cuda_srcs():
-    ATEN_NATIVE_CUDNN_CPP = native.glob(ATEN_NATIVE_CUDNN_CPP_PATTERN)
-    ATEN_CUDNN_CPP = native.glob(ATEN_CUDNN_CPP_PATTERN)
-    ATEN_NATIVE_MIOPEN_CPP = native.glob(ATEN_NATIVE_MIOPEN_CPP_PATTERN)
-    ATEN_CUDA_CPP = native.glob(ATEN_CUDA_CPP_PATTERN)
-    ATEN_NATIVE_CUDA_CPP = native.glob(ATEN_NATIVE_CUDA_CPP_PATTERN)
-
-    return ATEN_NATIVE_CUDNN_CPP + ATEN_CUDNN_CPP + ATEN_NATIVE_MIOPEN_CPP + ATEN_CUDA_CPP + ATEN_NATIVE_CUDA_CPP + get_aten_cuda_srcs()
-
-### HIP files
-# Files that must be hipified
-def get_aten_hip_srcs():
-    ## CU -> HIP files
-    ATEN_CUDA_CU = native.glob(ATEN_CUDA_CU_PATTERN)
-
-    # HIP does not use clang for ATEN_CUDA_CLANG_CU_PATTERN
-    ATEN_NATIVE_CUDA_CU = native.glob(ATEN_NATIVE_CUDA_CU_PATTERN)
-
-    ## CPU files
-    ATEN_NATIVE_CUDNN_CPP = native.glob(ATEN_NATIVE_CUDNN_CPP_PATTERN)
-    ATEN_CUDNN_CPP = native.glob(ATEN_CUDNN_CPP_PATTERN)
-    ATEN_CUDA_CPP = native.glob(ATEN_CUDA_CPP_PATTERN)
-    ATEN_NATIVE_CUDA_CPP = native.glob(ATEN_NATIVE_CUDA_CPP_PATTERN)
-
-    # Get hipified file names (before, after)
-    srcs = ATEN_CUDA_CU + ATEN_NATIVE_CUDA_CU + ATEN_NATIVE_CUDNN_CPP + ATEN_CUDNN_CPP + ATEN_CUDA_CPP + ATEN_NATIVE_CUDA_CPP
-    ret = get_caffe2_hip_srcs(include_patterns = [], include_files = srcs, project_dir = "")
-    return (ret[0], [f.replace("aten/src/", "") for f in ret[1]])
-
-def get_aten_hip_headers():
-    ATEN_CUDA_H = native.glob(ATEN_CUDA_H_PATTERN)
-    ATEN_NATIVE_CUDA_H = native.glob(ATEN_NATIVE_CUDA_H_PATTERN)
-    ATEN_CUDNN_H = []  # native.glob(ATEN_CUDNN_H_PATTERN)
-
-    # Get hipified file names (before, after)
-    srcs = ATEN_CUDA_H + ATEN_NATIVE_CUDA_H + ATEN_CUDNN_H
-    ret = get_caffe2_hip_headers(include_patterns = [], include_files = ATEN_CUDA_H + ATEN_NATIVE_CUDA_H + ATEN_CUDNN_H, project_dir = "")
-    return ret[0], [f.replace("aten/src/", "") for f in ret[1]]
-
-# Native HIP-aware files
-def get_aten_hip_native_srcs():
-    HIP_IMPL_CPP = native.glob(["aten/src/ATen/hip/impl/*.cpp"])
-    ATEN_MIOPEN_CPP = native.glob(ATEN_MIOPEN_CPP_PATTERN)
-    ATEN_NATIVE_MIOPEN_CPP = native.glob(ATEN_NATIVE_MIOPEN_CPP_PATTERN)
-    return HIP_IMPL_CPP + ATEN_MIOPEN_CPP + ATEN_NATIVE_MIOPEN_CPP
-
-def get_aten_hip_native_headers():
-    HIP_IMPL_H = native.glob(["aten/src/ATen/hip/impl/*.h"])
-    ATEN_MIOPEN_H = native.glob(ATEN_MIOPEN_H_PATTERN)
-    return HIP_IMPL_H + ATEN_MIOPEN_H
-
-def get_aten_hip_ufunc_generated_cuda_sources(gencode_pattern = "{}"):
-    # Contents of these CUDA files do not need to be hipified at this point,
-    # but they must be renamed from ".cu" to ".hip" because, unlike OSS, a compiler
-    # is selected based on a file extension.
-
-    renamed_rules = []
-    for n in aten_ufunc_names:
-        cuda_name = "UfuncCUDA_{}.cu".format(n)
-        hip_name = "UfuncCUDA_{}.hip".format(n)
-        buck_genrule(
-            name = "aten_ufunc_hip_renamed_{}".format(n),
-            srcs = [gencode_pattern.format(cuda_name)],
-            bash = 'cp "$SRCDIR/{}" "$OUT"'.format(cuda_name),
-            out = hip_name,
-            default_outs = [],
-        )
-        renamed_rules.append(":aten_ufunc_hip_renamed_{}".format(n))
-    return renamed_rules
diff --git a/defs_hip.bzl b/defs_hip.bzl
deleted file mode 100644
index 061f7fe2157f..000000000000
--- a/defs_hip.bzl
+++ /dev/null
@@ -1,136 +0,0 @@
-load("@bazel_skylib//lib:paths.bzl", "paths")
-load("@fbcode//tools/build/buck:rocm_flags.bzl", "get_rocm_arch_args")
-
-caffe2_includes = [
-    "operators/**/*",
-    "operators/*",
-    "sgd/*",
-    "transforms/*",
-    # distributed folder is managed by its own TARGETS file
-    # "distributed/*",
-    "queue/*",
-    # "binaries/*",
-    "**/*_test*",
-    "core/*",
-    "db/*",
-    "utils/**/*",
-]
-
-caffe2_video_image_includes = [
-    "image/*",
-    "video/*",
-]
-
-pytorch_includes = [
-    "aten/src/ATen/cuda/*",
-    "aten/src/ATen/native/cuda/*",
-    "aten/src/ATen/native/cuda/linalg/*",
-    "aten/src/ATen/native/cudnn/*",
-    "aten/src/ATen/native/nested/cuda/*",
-    "aten/src/ATen/native/sparse/cuda/*",
-    "aten/src/ATen/native/transformers/cuda/*",
-    "aten/src/THC/*",
-    "aten/src/ATen/test/*",
-    "torch/*",
-]
-
-gpu_file_extensions = [".cu", ".c", ".cc", ".cpp"]
-gpu_header_extensions = [".cuh", ".h", ".hpp"]
-
-hip_external_deps = [
-    ("rocm", None, "amdhip64-lazy"),
-    ("rocm", None, "MIOpen-lazy"),
-    ("rocm", None, "rccl-lazy"),
-    ("rocm", None, "roctracer64-lazy"),
-]
-
-hip_pp_flags = [
-    # HIP 4.4.21432 -> TORCH_HIP_VERSION=404
-    "-DTORCH_HIP_VERSION=(FB_HIP_VERSION/100000)",
-    # ROCm 4.5.2 -> ROCM_VERSION=40502
-    "-DROCM_VERSION=FB_ROCM_VERSION",
-    "-DUSE_ROCM=1",
-    "-D__HIP_PLATFORM_HCC__=1",
-    "-D__HIP_NO_HALF_OPERATORS__=1",
-    "-D__HIP_NO_HALF_CONVERSIONS__=1",
-    "-DCUDA_HAS_FP16=1",
-    "-DCAFFE2_USE_MIOPEN",
-    # The c10/cuda/impl/cuda_cmake_macros.h is not generated for the
-    # hip build yet.
-    "-DC10_HIP_NO_CMAKE_CONFIGURE_FILE",
-    # clang with -fopenmp=libgomp (gcc's OpenMP runtime library) produces
-    #      single threaded code and doesn't define -D_OPENMP by default.
-    # clang with -fopenmp or -fopenmp=libomp (llvm's OpenMP runtime library)
-    #      produces multi-threaded code and defines -D_OPENMP by default.
-    #
-    # hcc currently don't have llvm openmp runtime project builtin.
-    # wrap_hip.py also drops -D_OPENMP if explicitly specified.
-    "-U_OPENMP",
-]
-
-def get_hip_flags():
-    return [
-        # Caffe2 cannot be compiled with NDEBUG using ROCm 4.5.2.
-        # TODO: The issue should be fixed properly.
-        "-UNDEBUG",
-        "-Wno-error=absolute-value",
-        "-Wno-macro-redefined",
-        "-Wno-inconsistent-missing-override",
-        "-Wno-exceptions",
-        "-Wno-shift-count-negative",
-        "-Wno-shift-count-overflow",
-        "-Wno-duplicate-decl-specifier",
-        "-Wno-implicit-int-float-conversion",
-        "-Wno-unused-result",
-        "-Wno-pass-failed",
-        "-Wno-unknown-pragmas",
-        "-Wno-cuda-compat",
-    ] + get_rocm_arch_args()
-
-def get_hip_file_path(filepath, is_caffe2 = False):
-    """
-    this function should be in sync with the hipified script in
-    third-party/hipify_torch/hipify/hipify_python.py
-    unfortunately because it's a normal python (instead of Starlark)
-    we cannot simply import from there
-
-    The general rule of converting file names from cuda to hip is:
-       - If there is a directory component named "cuda", replace
-         it with "hip", AND
-
-       - If the file name contains "CUDA", replace it with "HIP", AND
-
-    If NONE of the above occurred, then insert "hip" in the file path
-    as the direct parent folder of the file
-
-    Furthermore, ALWAYS replace '.cu' with '.hip', because those files
-    contain CUDA kernels that needs to be hipified and processed with
-    hcc compile
-    """
-    dirpath = paths.dirname(filepath)
-    filename = paths.basename(filepath)
-    filename, ext = paths.split_extension(filename)
-
-    if ext == ".cu":
-        ext = ".hip"
-
-    orig_dirpath = dirpath
-
-    dirpath = dirpath.replace("cuda", "hip")
-    dirpath = dirpath.replace("THC", "THH")
-
-    filename = filename.replace("cuda", "hip")
-    filename = filename.replace("CUDA", "HIP")
-
-    # Special case to handle caffe2/core/THCCachingAllocator
-    if not (is_caffe2 and dirpath == "core"):
-        filename = filename.replace("THC", "THH")
-
-    # if the path doesn't change (e.g., path doesn't include "cuda" so we
-    # cannot differentiate), insert "hip" as the direct parent folder
-    # special case for utils/cub_namespace, because it is first used and hipified when used
-    # from core, it doesn't end up in hip directory
-    if dirpath == orig_dirpath and not filename == "cub_namespace":
-        dirpath = paths.join(dirpath, "hip")
-
-    return paths.join(dirpath, filename + ext)
diff --git a/docker.Makefile b/docker.Makefile
index b9477febf90d..f85a3c3a3fc1 100644
--- a/docker.Makefile
+++ b/docker.Makefile
@@ -1,6 +1,6 @@
-DOCKER_REGISTRY           = docker.io
-DOCKER_ORG                = $(shell docker info 2>/dev/null | sed '/Username:/!d;s/.* //')
-DOCKER_IMAGE              = pytorch
+DOCKER_REGISTRY          ?= docker.io
+DOCKER_ORG               ?= $(shell docker info 2>/dev/null | sed '/Username:/!d;s/.* //')
+DOCKER_IMAGE             ?= pytorch
 DOCKER_FULL_NAME          = $(DOCKER_REGISTRY)/$(DOCKER_ORG)/$(DOCKER_IMAGE)
 
 ifeq ("$(DOCKER_ORG)","")
@@ -8,7 +8,7 @@ $(warning WARNING: No docker user found using results from whoami)
 DOCKER_ORG                = $(shell whoami)
 endif
 
-CUDA_VERSION              = 11.3
+CUDA_VERSION              = 11.6.2
 CUDNN_VERSION             = 8
 BASE_RUNTIME              = ubuntu:18.04
 BASE_DEVEL                = nvidia/cuda:$(CUDA_VERSION)-cudnn$(CUDNN_VERSION)-devel-ubuntu18.04
@@ -16,24 +16,49 @@ BASE_DEVEL                = nvidia/cuda:$(CUDA_VERSION)-cudnn$(CUDNN_VERSION)-de
 # The conda channel to use to install cudatoolkit
 CUDA_CHANNEL              = nvidia
 # The conda channel to use to install pytorch / torchvision
-INSTALL_CHANNEL           = pytorch
+INSTALL_CHANNEL          ?= pytorch
 
-PYTHON_VERSION            = 3.8
-PYTORCH_VERSION           = $(shell git describe --tags --always)
+PYTHON_VERSION           ?= 3.10
+PYTORCH_VERSION          ?= $(shell git describe --tags --always)
 # Can be either official / dev
-BUILD_TYPE                = dev
-BUILD_PROGRESS            = auto
+BUILD_TYPE               ?= dev
+BUILD_PROGRESS           ?= auto
+# Intentionally left blank
+TRITON_VERSION           ?=
 BUILD_ARGS                = --build-arg BASE_IMAGE=$(BASE_IMAGE) \
 							--build-arg PYTHON_VERSION=$(PYTHON_VERSION) \
 							--build-arg CUDA_VERSION=$(CUDA_VERSION) \
 							--build-arg CUDA_CHANNEL=$(CUDA_CHANNEL) \
 							--build-arg PYTORCH_VERSION=$(PYTORCH_VERSION) \
-							--build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL)
+							--build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL) \
+							--build-arg TRITON_VERSION=$(TRITON_VERSION)
 EXTRA_DOCKER_BUILD_FLAGS ?=
+
+BUILD                    ?= build
+# Intentionally left blank
+PLATFORMS_FLAG           ?=
+PUSH_FLAG                ?=
+USE_BUILDX               ?=
+BUILD_PLATFORMS          ?=
+WITH_PUSH                ?= false
+# Setup buildx flags
+ifneq ("$(USE_BUILDX)","")
+BUILD                     = buildx build
+ifneq ("$(BUILD_PLATFORMS)","")
+PLATFORMS_FLAG            = --platform="$(BUILD_PLATFORMS)"
+endif
+# Only set platforms flags if using buildx
+ifeq ("$(WITH_PUSH)","true")
+PUSH_FLAG                 = --push
+endif
+endif
+
 DOCKER_BUILD              = DOCKER_BUILDKIT=1 \
-							docker build \
+							docker $(BUILD) \
 								--progress=$(BUILD_PROGRESS) \
 								$(EXTRA_DOCKER_BUILD_FLAGS) \
+								$(PLATFORMS_FLAG) \
+								$(PUSH_FLAG) \
 								--target $(BUILD_TYPE) \
 								-t $(DOCKER_FULL_NAME):$(DOCKER_TAG) \
 								$(BUILD_ARGS) .
@@ -59,7 +84,6 @@ runtime-image: BASE_IMAGE := $(BASE_RUNTIME)
 runtime-image: DOCKER_TAG := $(PYTORCH_VERSION)-runtime
 runtime-image:
 	$(DOCKER_BUILD)
-	docker tag $(DOCKER_FULL_NAME):$(DOCKER_TAG) $(DOCKER_FULL_NAME):latest
 
 .PHONY: runtime-push
 runtime-push: BASE_IMAGE := $(BASE_RUNTIME)
diff --git a/docs/Makefile b/docs/Makefile
index 122bda6231e3..c506845fa92b 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -17,8 +17,9 @@ figures:
 	@$(PYCMD) source/scripts/build_activation_images.py
 	@$(PYCMD) source/scripts/build_quantization_configs.py
 
-onnx_supported_aten_ops:
+onnx:
 	@$(PYCMD) source/scripts/onnx/build_onnx_supported_aten_op_csv_table.py
+	@$(PYCMD) source/scripts/onnx/build_onnx_diagnostics_rules_md.py $(SOURCEDIR)/generated/onnx_diagnostics_rules
 
 docset: html
 	doc2dash --name $(SPHINXPROJ) --icon $(SOURCEDIR)/_static/img/pytorch-logo-flame.png --enable-js --online-redirect-url https://pytorch.org/docs/ --force $(BUILDDIR)/html/
@@ -34,11 +35,11 @@ html-stable:
 	# See conf.py for more details.
 	RELEASE=1 make html
 
-.PHONY: help Makefile docset onnx_supported_aten_ops
+.PHONY: help Makefile docset onnx
 
 # Catch-all target: route all unknown targets to Sphinx using the new
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile figures onnx_supported_aten_ops
+%: Makefile figures onnx
 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 
 clean:
diff --git a/docs/caffe2/.Doxyfile-c b/docs/caffe2/.Doxyfile-c
index c4873d63841c..b30ab661d24c 100644
--- a/docs/caffe2/.Doxyfile-c
+++ b/docs/caffe2/.Doxyfile-c
@@ -1490,7 +1490,7 @@ EXT_LINKS_IN_WINDOW    = NO
 
 FORMULA_FONTSIZE       = 10
 
-# Use the FORMULA_TRANPARENT tag to determine whether or not the images
+# Use the FORMULA_TRANSPARENT tag to determine whether or not the images
 # generated for formulas are transparent PNGs. Transparent PNGs are not
 # supported properly for IE 6.0, but are supported on all modern browsers.
 #
diff --git a/docs/caffe2/.Doxyfile-python b/docs/caffe2/.Doxyfile-python
index 9d16671ffe3b..514e58036399 100644
--- a/docs/caffe2/.Doxyfile-python
+++ b/docs/caffe2/.Doxyfile-python
@@ -1488,7 +1488,7 @@ EXT_LINKS_IN_WINDOW    = NO
 
 FORMULA_FONTSIZE       = 10
 
-# Use the FORMULA_TRANPARENT tag to determine whether or not the images
+# Use the FORMULA_TRANSPARENT tag to determine whether or not the images
 # generated for formulas are transparent PNGs. Transparent PNGs are not
 # supported properly for IE 6.0, but are supported on all modern browsers.
 #
diff --git a/docs/cpp/source/notes/tensor_cuda_stream.rst b/docs/cpp/source/notes/tensor_cuda_stream.rst
index 9ecf86f51fe3..494031771363 100644
--- a/docs/cpp/source/notes/tensor_cuda_stream.rst
+++ b/docs/cpp/source/notes/tensor_cuda_stream.rst
@@ -61,7 +61,7 @@ Pytorch's C++ API provides the following ways to set CUDA stream:
 
 .. attention::
 
-  This function may have nosthing to do with the current device. It only changes the current stream on the stream's device.
+  This function may have nothing to do with the current device. It only changes the current stream on the stream's device.
   We recommend using ``CUDAStreamGuard``, instead, since it switches to the stream's device and makes it the current stream on that device.
   ``CUDAStreamGuard`` will also restore the current device and stream when it's destroyed
 
@@ -92,7 +92,7 @@ CUDA Stream Usage Examples
 
   // get the default CUDA stream on device 0
   at::cuda::CUDAStream defaultStream = at::cuda::getDefaultCUDAStream();
-  // set current CUDA stream back to default CUDA stream on devide 0
+  // set current CUDA stream back to default CUDA stream on device 0
   at::cuda::setCurrentCUDAStream(defaultStream);
   // sum() on tensor0 uses `defaultStream` as current CUDA stream
   tensor0.sum();
@@ -120,7 +120,7 @@ CUDA Stream Usage Examples
 .. attention::
 
   Above code is running on the same CUDA device. `setCurrentCUDAStream` will always set current CUDA stream on current device,
-  but note that `setCurrentCUDASteram` actually set current stream on the device of passed in CUDA stream.
+  but note that `setCurrentCUDAStream` actually set current stream on the device of passed in CUDA stream.
 
 
 2. Acquiring and setting CUDA streams on multiple devices.
@@ -144,7 +144,7 @@ CUDA Stream Usage Examples
   // sum() on tensor0 use `myStream0` as current CUDA stream on device 0
   tensor0.sum();
 
-  // change the current device index to 1 by using CUDA device guard within a braket scope
+  // change the current device index to 1 by using CUDA device guard within a bracket scope
   {
     at::cuda::CUDAGuard device_guard{1};
     // create a tensor on device 1
@@ -206,7 +206,7 @@ CUDA Stream Usage Examples
 
   // sum() on tensor0 uses default CUDA stream as current CUDA stream on device 0
   tensor0.sum();
-  // sum() on tensor1 uses defualt CUDA stream as current CUDA stream on device 1
+  // sum() on tensor1 uses default CUDA stream as current CUDA stream on device 1
   tensor1.sum();
 
 .. attention::
diff --git a/docs/requirements.txt b/docs/requirements.txt
index 9a967dd54e0f..fdbe10778bf9 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -1,9 +1,13 @@
 sphinx==5.0.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
-sphinxcontrib.katex
-matplotlib
-tensorboard
+# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
+# but it doesn't seem to work and hangs around idly. The initial thought is probably
+# something related to Docker setup. We can investigate this later
+sphinxcontrib.katex==0.8.6
+matplotlib==3.5.3
+tensorboard==2.10.0
 # required to build torch.distributed.elastic.rendezvous.etcd* docs
-python-etcd>=0.4.5
-sphinx_copybutton
-sphinx-panels
+python-etcd==0.4.5
+sphinx-copybutton==0.5.0
+sphinx-panels==0.4.1
+myst-parser==0.18.1
diff --git a/docs/source/_dynamo.rst b/docs/source/_dynamo.rst
new file mode 100644
index 000000000000..5e16dcf52dde
--- /dev/null
+++ b/docs/source/_dynamo.rst
@@ -0,0 +1,13 @@
+.. _torch_dynamo:
+
+torch._dynamo
+--------------------------
+
+.. warning ::
+     This module is an early prototype and is subject to change.
+
+.. currentmodule:: torch._dynamo
+
+.. automodule:: torch._dynamo
+    :members:
+    :member-order: bysource
diff --git a/docs/source/_static/img/masked/tensor_comparison.jpg b/docs/source/_static/img/masked/tensor_comparison.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..0e040b2616ae33b47053f4aa8afbb4ee9a4cc6ae
GIT binary patch
literal 179951
zcmeFZcUV)~);GFBfJifRlorH-4We{uK@kxVK~WJ9A|ld>3P^;IASeikY#S&Dv7;iQ
z(xgVZWh)&7gcP=gCMybIY1*B*&-R@A{`GzLeebi+{q9*j$r_n!&avhkYm6~}W6Ze_
zzZ0_{#cf-hw?HTq1pNm7ATb*%a3qHuh9Fm0Xf*^u3m_?!GK2;c3j9GRO-Sk|4ME#b
zTEFOBsFnYyBLP9FA&}%h>Ue|m`~f)6xBkZ&eIW(HfM3esbLzUpU$v!XuA~2=B`v`<
zXrqUNvokn*1RM(tjEo8<MT^-oOM;`Lk67XHkx@qehe!tljRHs!c(VTy{CcBxc*ur8
zKH?t`78tE}Ffb@2(pG=EhM}()a>!QS)68|9>k)^*;E=6x#{xa#cI*y_3k%qANS|P*
zXG69kM;wU=jP}<fM}$X4S&?n^e=@fM^n5X1U+-s?=rCLTeXj0$4y0p&dgey!jMnLc
z*2fMVw%WCM%RkzJU$*-H=wxhctWm6q5$RYEe*K0G8}RFl@y5o6poU@8iO6VwvSDP@
z(!V%t4vY#o7IGvygcPYa&(Z%NDJI%hUmp+dfd9uW=E+}uwbT30@gE8NM*{zmz<(t0
z9|`<N0{{P$z(3JWU?hMzu>i|J;wi||9Y7@#z)MH;OpMk+mYbYiCFda-IDXR0f0Bwl
zI%fy6pr?{!i85Cvkxf)FM`6FSg9Fihm)jQS?T$ZDp~80mBS*p|6d@=gGWysqr%ihM
zyu9@=9|3?{2+2WKkb-|e)RB$u?%U^i{&oJz|JN6<?<e?!x((-f{VM)nBa{yTSOn1H
zeo+13p{RfeKtBhcF9Ao6MMDt!F(_Xe8+~M+?g6ywG0;Ikf1am<{-g!-^ueFBBp3;(
zvwN2V1W7If_*UNE|5y+N$*%+DddCBT03UgGK(C8D6cP#OL_pgG1p6NX^m9P3jff!4
z)02Q+3F;q#AO%%GTORxuo<aXY`}-gK#oymQ=okIxxIjzLiOnHVN5cJ2{QU7h`4tf!
z17!E}SOLD}gO6?91<vX~e(%CaF2Cr=gFD>*Dh~mp`-={a*}CU1Iv~nn9?t)4d&qyY
z%U^V4_>TE!|FisX$d-A${*w-gcHaFL9T?^4@fUrJwCmTs4j*&a_*c3AF+lyJZA{pn
zU**Ap&cFD_1@HF!tIqL|y*vJ*qryCX^(QXa;aB;wm|eg69~kMh>96{Sk8Sxi_J}A@
z`j7hthdBRg8y&oR>tE#o{{QTMVC2Ss@{HQ|t8a$_H#`1SKQMC7ukjv<-t_BUN5X&F
z$xrzNhCBT#k2>!0t8Fw;&cFB^iFW@b!%+Wi^LqKSEwmYOgtmh}J!lVP3YkF0&^l-*
z1Vsmuf&M~}1L??#V<ADo(Rv$!#Rcj)M+O+J)iYjaYz#s3_B&q(f@YXM?GhC`^3O8A
z;}C@J1NxEo&$4h41l7+%5O(yRWor&Xkdi3`H9QK4IR@IyACmK5Qjjc!g;b%%kS??o
zS_v6~zL`VTkUi+}HpmUy4S7L+PyiGRML@@(Scn3hgi@dk5E@>Fav?f&3%UmtLr<Yf
zs1|B~TA|la2h<A<K`iJ4Gy%a758|UxC=5ygrGi?F!l9O<)}V}0mMA-vBWgQp7s?BD
z02PFaM8%*esAN<I>H;bkbpv%D^#oOgYCye2b)p7PY}6R)E9!>?T0&kzRYFH%g#=#0
zQevZoi^OgTUx^@zqY`lvrzFlvT#>jb@lc{dqCw)dM6U!(VnSjT4WZ@G>gXkCL$oE@
z0qutNMju8WLsQY|=xp>&^dodNx((fpW}zq1-zB9bRV4K!4JEB5w@U7j43Io3NtHY+
znJaluvO=<1vRjfZ2}_Ek6r{AJ)<{`NIZN%63X+PIN|DNzx+_&K)gsj^^+Ad!EiJ7s
zZ6IwSy;a&vI#fDd`mA)mbcytH=}zfU=~)a0vlz1mV~cUa_+z3mDVVF6hZqK?3-bYk
zERb7(TVT9k%L4BOkqeR*WG{HIpngHuf=>(hGKw<GWvpb}WDd!kkjav{EmJMiF7rW#
zFRLVLAZshTTQ*eor0f;hN3t!l?`6Nq$;s)<ZIIh37b2G^cSY{8+zUCj+;@3p`PK3c
z^84ju<j=_$%0HJMkpHHjps+&0Ucp--TH&0+eT6245d~zS%0i=s&I=DMq%F)_Sh28c
z;TNn7b~$z<)(;zp&Bm5u+p&|1GKwn{9Tfc)sfu}uRf_$Jvq~yT>y_M;B9zW5Jy3e7
z^jR6Byh7Pg`H=D{<y*>4%A+c1m8B{UDgi3VDz{ZyR6eLms~V^}tA?s(s1~cfRfW}*
z)lAj)s*%<5)auk&>S*;9>dxxn>RIYf)%(<cXy|Bc)Ckl_(|DxOr7^ciYmxn;z(whc
zN*47l`mtDVvE$;f#pf4SE*{pD)LgB(Q!`exK(j@2QcG3KS}Q;+UF)gVptgkeYVBRx
z@!Gevf7hPX(baL%Iihn_r%7j0S6$a$H&pkMZoTdpP6cO&3&CB&F>vF0YI+;>!u77`
zHS57kw3lpM60_vSlDA9v`Ud)Y^po^U^oN(qF11{Gc<H63jZ5KWxMi-(D9ehL4K9~i
zZn->odCu~-<#Q_xR(P&RT~WE>v%w+*XM;F{B7^rU6;={f9$k5JWzQ<<RhFy5R^_ki
zSS_*IY<2MJyw&Y%B-WU(30afBrgN>-+6`+X*WO&)ZzylL(eSuoks;ei-DtZ}l2N76
z7yL53H~u`n4KG?}wk~|#?R7)zmDg`upR~Su{a54F#sS88#yutqCR<EsCeKW`rfW<C
zO$$ulnW>n$nx&dGnhDJ<%#WErG9R;8W^uscnnk~*s-?T-S<4q2q&95YK-*Be;k%W&
zRkYPpt7&UP>oDsF)}L(*Yyxd=+l<=k+xpwyuw~il*&VREZpR`lA@~z+5;*ot?GM@C
zwg0qn)yB|`4>x|<v~JU}P34>B9IPA?9G-7RZ+6<8zWJ@Avg01dT*r|u`dfmw6m5Z>
z%$(w#o^O@f>az9x)_!Ll=K$yX&hR#iZ3)|&UF2OnTykAF+gEQtw!PX_!gZVLMc4N`
zmhXt%QRxP`IlEnS8+JEvKk8nyQ+lWS&b*x;cdg$=+12Ku;z9Iy;KAFyX?ND{p*<`2
z#O!I@i`~0_@BO`Vo|`={c(V54_a*Fm?WO4z;`PiM<Gt7WuJ^2uqt9iZ5Btsbr|j?Z
zHSi_-zVy@d3-_xdVu}95(gRWlJP$lLAn<qdzv(}F(D~rCgK&UDKu*BKA^Ss@4vhuc
z23`pKblCRrg~OkNY=bTajRg~evx6r?971wKxS?A^>7jFBZefLC;&9LKl86Nn2O_E>
zl_SF<n@PGPGO6>(nj<MkSVygnUOoyRb2)Z53KiuW^(<Ny0Hxn!R>qu;VIQ|Uo_8FH
z-5Xm<Rw74|e?PJMMB0f@aa-c<#7o8p#5YowP)<_V2^$k`P*K!_)JED;+9}$HM5n}i
zCuL8DoqUtDF6lzjOtNQk^(mcGiKj+SJDn~{S(tJxr9ag=m3~I@OvstHX(nk`(?scq
z(qCn)%gD|UoDDepDsz42l}z!u!{^>+nP(OJCjDFFZ++(p=kHxmx)66^^rGv<ic5Nz
z(l5<iCSHD-ZJJHbk<E$8VO`mN<=NF`SAV-I$PLcz&D)e$a!vbM+O@g-L-}0=_63jV
zI`j<skLw}V-`#M!QE_v{&Fov!w~pWXczf^dwmX)0?%&nCn{iiI7+J`=x9eW>eT)0|
zA80?wDv~IQDH?mY|6xb*=Hh3M3?JPnQ7_4O3_Xr{JpSaslfI`ePa8`uOCOajFS}N*
zQl4IcsyI;rR|Z$Io_Rg%tlC!9SZ!ThUSm{KSi7V)uTHHlt6sMLRK1uH$C!P7<oRSn
zP{U{=v2mzrUsF%>&gQo*E-f!wx3o63ZERz_uzgYc(&}Z^E6Z1vugzaq{BHJp`5Uu0
z<!{a3R<v8RKkL}gQQc|NS>HwIYUtkF-P+^a^SXCOZ)e}0zIXlm`&sV}z56^6KENG3
zKFA+Rd@uDreOPfgdqjKWCUX_Dgk{dEV{c~v&hg+3jRuTP{1Nqs;KQkp3LmpS;Xd8}
zZ2Y-)Y|B{3xX<{9i6ayI$&@LjscT<Wekq%_pZ*=*2Y=um<BGqY{kHg9;f(1_<E;Da
z2rq)ipG!v;Blo^resBBX`QsBmR)7&)5v~$ei(EwS#Svoh5&vWU^B~}7GFaj;NF#qO
z1o+qsf>b&nNY>|1eEqY=&vf!XC;&HpQh%QRh5i$R|GZ`nLAMhj0Cym02RIizhadxc
z2wDg}PT=>p=a7Vn$*&Le=Mi<-1Rx%8pfp<`NbZDK{8<CE`36D43bA;uTP)_^hak)X
z2+CysiR<SpD?S8~Ooxf3@%F*-rHS(;|9bq1`9TR9eCEgaU(cY_!~{U2|8f>Thm;mb
zT1w5JQA&`65(=$^5;sFQ&^IXnXuvVwV*Wr$pe3cGF$-j5<v@jUMMwgLMoUPdrKIMA
zI8*|-4oNCWDKA;SQCelEKSuwks`1J5Hx?}2^rS&;S0}v8<lwO+8Ci9WMT<3;uP|7-
zYPG4ExrOBhD~HXFTb#B!Z}Zr_XRjxKash_|4+jN@ghoZj9FHZRh)X_oIwkc?TKa{H
zmo8`LT)CQi^VaP<cMI>`fAF-lth}P~SygpoQ*%pe+l!a4y1ILM`}*Gv46-?+e|-4(
z>GRk)_v^QrS>7D-{m1;cP!RghVf`A}e;F4L0!l(s5-o|D9~Vj@c7EVWl2S|7ODk{O
ziSa+GqHlb1f$FC7H=Z=eEH&8$s~tSnDXYHBl)aogKeV4C`(GPa(*L88{TkRm$29~w
zfT;MtcZdO#{~ia>T09KUHbXoJ$)iy~nb1lQ0TPK;Lf0h{WdZ2^&*^_a1O#!%e?<B}
zt4;)Q=fAH=2_|BQbq@#cAq{kkWY)u3Y%$bI8+2)8E~0p{iD@{ci|mxsL2ytLRgr66
z6T!j2Z|gWuq8@mYU$-%_4<lyilhat@=19YsHA9RF=P=wm%GG7h&Y+nL|B|7l>{x*+
zhUbaw<oi)HIt5YJDaJ@k1gC}}H-=M=F2-)=mS3Y{xy*Ds-O%LyT>=Y^o5tsg67x#7
z;Wke5J|tt8^4LW$H*25$uWc_s!}RTj+)?&YAkzNxgyM{73vby;k()*dnPMo2CPUc^
zd#2l&3ryj@%be*mOodw8VxKU-Ww3hkBc>hwmfPr>1k3jg?7ZQ<9P<RimcB=8%WG;3
zt1oFrjlIp;jhZt3zu)2C?-xzn^>3g~l9Jkq8}(SMPZ)^xLtF7kykdNZM2CdQ>NO^h
z_qcmqJtO--+GRah&^C6qL|;7XF3~an+?E`;wRY3llNan4FN2Ja7&}987cd(n64n1{
zaH4e#7BB1jI?{lX=&gt;hIY5LXhK3SF_e6W@IjjSI7fJrY9)py+y#->u_N1{FstkD
zC+}75tU9wZ`@jR^t&0yY*&URvCx6j?;m-+Sll}8$5^b_Ftvh6%&ss{df)d@r@6GoH
zx1KuWJyOc1ziZ=FadD|;32sfIC9oKpY~g<M>4=$97ENsHi*^m#hV0?g2=Sg>X4Z=j
z)OE+$n*SJ)do9P`z^*Q>PFa(2A|Tf^1`=U`A-&Xba_<#GA=5*1dwWYt%v?P^5L>v(
zy{FuE%XrwZ!%jvWlFV_*dp$An`s(s$HMKR*E;xUQj9h*p<m34d*-o*C4`63}LBAD^
zFT7f$zCY4^<C>Z;Uv#$ETfX?D9i$4C&FkX=uZ`|orOx$ACVEw5H~z?f@aE*ja_7%7
z?ovNT>=+n+`M`5^kFw+xkMc{SH+C&Ja6<CH%=JI7hCY4v{_Ml5rN@5TvitJVrAwEb
zlnhE*ut7gj=cipICZQx1?e&s>+xUl~lcZ8d)`Y#>zZl%#VGcNP_usAXT-xB4uhTn(
z`6AnK&4k`JnZm3FTDoVpXJ6Fou-Ks3Fjm6xu&^SZE4mwszeIf8oGrtS&R(^1S!Qpq
zcOv1%G*3<pB|V}wY^l|P`MlG@i-aIj8leo^LL5xyjd8KhX~_(kTDN#Eu0?kpm)JC=
z+wP8JVKR%Ui{6vzzO|$U$Xr{F-D)@`Ic#*zs<K{~kvdiD$WeJ1oZEWwdRSaAX1r~(
zTntI~(;A|v7;29gD$Qycc4^a~DMYtxP!7Wi{cTz4L`g=dAze+RP1Y1c#0ok`y@nHi
zCdOCxTg#7hgc6!<?1cpF;kHy1(P&b2-*dLmchx4Mf1Cx@cZR^K3KK(;`G_-{n1YjB
zJA+rD4iI%_65MveYK$|6t+&9`af*g5K3v^7ZHGs)oWdPTLRG?k+aFBFQq1CB-VW33
z%X>6FIO;cMr&<F~pRp%oV+S%s3mnCeR2C4>d&I(7SR#hPrw1o|k$8BHbLsfVDkSDU
zh1y>mz*T9fUkevh(@AAq&%}DU&v7Z2*#t;nSAVL>;`Gc*Gc4L?Tr+BL+)o$oLJog&
z>BZY`m>%>L$pDs8HDCx);n9r265cW4y;(X=)K10qMR>GUi=hS~u_b2&1rH7ry~c94
z6LGX4rdGoVc1Z1dj!Q;S!iuKm^*%_=)-E%zh+-o%F(e;WS+7NRheS*UD~$o~&61)u
z>}MRhN5wnd9IKZ?Mq%dO3K%z~0Kay5R`sk)uwiu7+BaX$b4ao;X=?{h)k|`DS5*fY
zsZxS|&s^8p!ZS7oG501c{1(%_dmuY!(ARX}O)C0{@HX7Z0u>nZ91#$TMek^3I(MHj
z7IyF78e?n1#+@{6wxX7Dv-ZP+^KYa}6Lg>0XyTgQpuM)18AcJ4eR|86_eGaTP3UU&
z*%K}ifJ<Gl9LRzKWZjRfaT7ifLt)IpHgDtvi<tgokeu~g^V5_<F*3~|WZA^B<A?kc
zZju=dOgU5j6u}<OY07x<@hoDuCZE{2(|62uVG%n^fpGduS*DviG-<bzLI8}1X$_Ba
z3$i#79-Q~gGUg<XyV};_x#m=M8b=H*o}gnM&f=8D23xLqn^xn~C)>|3g6Qdu3L-t2
zewMO4bVRowKN%i9?BL10W^Pq^sslYSjhJYRwzuKW;&5Upf{)FhEv{c9*!iTBilwZo
z>3mqPFL312GN(-QYDzT8=d|*QjhW5Wt%VAwTYg{PruxnM_AXcL5bd)s9;`%<F%Vmg
zQ9>&&nJF)Z4rtL&(bPoCDTdGbYh(DEHa>ELKehR-gf|cVZmx-G3|Z^n8wkHGx@&h!
z?NW}XTZj((s$s!L_KgLBzdgpxcmi{9;aa5G1Oeiz@TAS2QLJKE!5Xez4c`*Mx%b;R
zaeTeSkZ*r^-uD~NzLk{w2JC)T-D9Tpa;t9Oo8jheXF=o<3y6A(vacYXc$%*KrTB36
z;Kw0gNM{QxpCvh^kPT9ssxO^)mG8WKVb#I&dqyvqdpfzla)5%-0x(sv-Vz42M(0|X
z$jaC$70?W*jW%^1Wk*M}7<$*#64#QF#>H}iGHt?cdxdf-M-a6@_4^@rk&E6v52wl(
zO-`3YEUiybt#S#gxu?EUt@#BDS15+ycqhRXi^*H_h>9eJ9t3bQA?ib5C{J-;lS%Jf
zM9&#IwAp)93wI0~C(w2SI`v8TP+-CLasZM5=9UyZ;im@|HisoRH2`}k;^NtFE-(dt
z+}M_qv?F7mn&^^LMPfqkI7#k}e*_1=aK#f2L%vMcyoP)CHpS#tD0NX*lV%P18v3Ix
zMkGHAIHrp_?Qj)HCZ9-a*htYS6zu0pT^O&vdy=yPxxwE&VMV=lq#)coD6M`;m}nXN
zN32!xv)`H?y#IWkV{0t;#5g3kpfq@C-8J+#4c;^11qSOW4UQl+(^6>4HadJ)BxQ*0
zlxgE=M`zK_FfoD%PNom5J;TQF_JBi?A6bsm)`(l|-FCg@p>N3f_e8H?>T38)<*kX_
z<N5cANWlL3T3A#fJc|-tpusNqq15f5%V((#w9?9{qhbDmx5Ajpwc0$o62-RAF-kg*
zJ)K^!(2K9t;;z}&<zH7Y{?%5B5^y85zhIF?tf$2;<nTAA0uJZI-4pxJ$QHost7wF=
zSPTVZA(kbM$mt-i^{8Wi12ai?g)q}@EjfclKW(=XDQT)#=4Pj|aab~@)z1LF0smpu
z?580!x^gYvbZRRz<l?}v`3=+Rvl_u&t<P~LSY$a#piwE3J0BJg9^z{5=G?%MJg)KF
zZmwHH0)gYys)mrUr&x@t^3Iv|le&`GE{Kg3Yt7O~o}ET@e-KQeE>A*p`g*_<nyTcQ
zKSBvbKuuI>L2iTby#1ixN2u*$NEY~Xx(;O{moc?iD(eg&B(P^jY;!nUN?59W0*8Ap
zU`nxyO~-k0RXsJGP71WJsWayz3MZ=Xo{uOpdRT4tqAU09qC~>pmFVx`z&oC!v95@r
z3WKRT>>{@;X6dJ?2yhn-cIxCP-%JqJBU&o}Zu(Ru=UlQFlH1Yk^0>z)BtAr#$GWQ0
zG*~!fV|$w!Y;Chm(|{i|@KFqH3~_{L@&tG@7FofViUOk0N3yu1jfAIeQviC(7e5jh
z!)~W)okw*@BIS~_di8nk&iygNPTRYqO9<@SX(wWZ`6t3`nU^0N9wb;(7G0`?X!qJD
zOWFAjA{k&wlD)9qd47Q4NIAEqZSoFGZ1K}WnB2G1?p(V%_42#q!_&=vYkOvUs=2G%
zIa8ryrfJu^C&MH4pY@rod1qMm#!lu_L6-(3lBTYrH8>DhPYCd_ps5oa;y?{gtB(=@
zYw$w5-n_kCAOW|XBT*M44h>-JDR*?!>57C9W?I4G=>FQE`xA#h4tm*O3hNnn3<om_
z&kLmb@H2HjL@@89_XCgeCJR|<(I|!%($jHhLge&m@BX#~{lRIUjLqF1oUF#0*Tn}L
z)vKO7ff+3sK5uq>pUwLYgnpV=bCc}pR#}@ZgD-c{ayWV-5;X-piNiF4_nbP<x23+z
zB?*-CM6#a}MqQn}dnlv;nDm^fELs<NaTL=?%&_?CJwSGOPFJbjF=gA|Ohi*Q*41@C
z6qs<)UUhvj@mf>yoBFg`;yyJE@0DXJB+?Gk;a!FSp769lrH04jVp3;)T{iWX;osHp
z?(~_ulAP+Ry3G=Bp51mB^7x!<Z*=jtSNrg$LKnBXB&`~)31@aySyjp^R`!OSI`)L$
zmx^RE!Gi~nJE_PK_*PEa<Wj!s1bvanjM`7HaBId%43aH4PU)1Rw9#p0!5SEoL|HR>
zan#Zd=f>6WtnUygJ($QS_8G69YtBDBx1&=a#b(GeLpy7|dwv9+p{zswNaHXNQ?y9-
zKKG&n1F7QDd1J`>QMW;j-Icz_T}|3XtdS2~3DJa4LC^lzfh83GzFM>R2Tl(ik0`ft
zMD6zdYvn(8ZM@$ee15+rCIwiUE`jxc2)pBv<(>l7%0@c8!)>UWgUKR^<i_dhrX>Sl
z9H~wCvwkb8xShySPL@WVjtP6Z8K2^}p3!5rp$sRL>A4qfElF6I_F#l!#l<yc(spuf
zvoHO)mWw|V0TDVUc!k27HY)sOqBmf2p@lHcjKvd@>ZK^_xv9;=)TZrU-xfXeK|Hop
z_po5m`A=%e{D5Z9$3ww!PJ@1Hh*O4}`ln1BNFQHa$9-hOPMd&yU(p)6h-DhB(^?_S
zvV7*G7~-#|SoW_S?Xg*JkwqchpeQ50{+y`$&TL$2fjkn+x>S+Fp*Kr)Od3@63p_c&
z;g22_>&L%|Z_(To>bDr4eU9lMECSlwg-2H5Cr`3`<3|~jzWhjUiYrpYN-&yhab+=6
zo(Q%H>7q4Z#f5`z@w7m=zu6m~L2+DI!x_kx^flPtGfR@SD_ynDhxzDc_+*Ja!Hvf1
zBOyi>Qz(8o`4$H^xr!e|uBN5o)I~T%f|N$b*sTl8?yD#;<;ErpHWZENYIEs{BE4ZB
zrx0A4%}#b(f0+)cNu*!!M&Mi=0%j4e-;MizA?E%45RL)NV;>G#jccMK%cdt2IQYS(
zExFtMOp%i%p-%*Mhz8s|B4;w%o~~JucW+!X+PjSVBImfSJp3UgFP?*29NPElaH`8w
z*VPRvHN~Ghd?A_yfc<lmoj{F&*G;}E*f~l%OSKc;Fmsa>y~#aXlW>==h<s}HGlC5!
zhwn@%?%!xy03l{DsbMFvf%7I{Q)yL&uZz#KZll~qYp2-Uw&pOp@|!^Q>!>Lj0AH*D
z8nUKPpt7`Kuw=@NuOR9jOl`?@Vt**})9arh3^eUWuEJ5HXT1Z@f;nqnPZ<{DJ?<G7
zlvP%TpZt9=H8q^FCx&Bq?DvOpwB+)OhHoy}<En9RM4><%2hbt1GGc0O(w~2b8KOe%
z6{*4HpVIBLD7(7LCKM!wMuE$Xmi%Vp7EluGy{9fa^$x`y&dhLp_?BWF;D3i))A%H`
zy3BjU#RPVO-zP8V$2ZZ!Vk8wbI}HTAm3W4_76}=pFQQm;_b^(sPWBxk=RB`p(O0{i
zv%nBf+1Kgk-SqIY#vt{UTKRioP1Q-G&7Mkv3%BkyCeoCFT>GaKfLsHC`Z7dsaiwhn
zWe?t_e(Fjj!wuO9=5x9VA!J$?Xi2f?H@e~<TvE%l>Uf+26Cj@Ik55(<Su`>w$;9RY
zRd_l@AZ7URMhu6b{(d}%2%ilbV>9Jm-bkcvro$wnK*?Gpn-6A;kHR`LrV_Qgz`WPS
zfR7I*NZ0N;5d=`)X}ZobF0o<yEI}pa^$=bjp>5<EG`*Xisi_k!>@0UvCU<pva&bvz
z-$IJLKD*px?k{hBlzkodUguZl?C%KVr68KP=kE^X;pxFxk!uMLD8*A9E<o<?hSWsD
zn=q058)PG!sSBLyPP=6wqL9vuxaUhz6ke)L;E*t`M@f?Hw~*H@BO8#TurMafd4;p)
z;2&gsv!6xRsS45lRvqtP!H2XD=iITg>esPf*8u}ew6i6LaR9>lHd;`}ZB?K(6D39R
zf&<7>xTINE1O7TNUfo5(A;g>H#Z-MVt+B%UZQ<>ZUE{ezLA_ny!<nHePv`Ds{Q@T&
z&$!c<l-uL3<9ZL%#So^D*06piPtFkUBD_L~$hX+u9bs4|wBFGz!0_z`-tF7x)tJpY
zzIX3&F;u2~oL7-u!EKW&FH>rhSG?({yn6M6#1o`8U~$&pL^8hoO#*qO1zy@pFU3ri
z+#`F8;+pW<6qEj`7#xb)K39=>7wK<`%S!#S;=%}VaG_&txsMnsw`Wt{MJ`DX_F9o{
z;f{SlLX1a8rEA#017q5-$xlYM)~g^*{59mPeLc|)Hs_m;a;AaM6>a2b><Cx3WQ)z<
z@sk|7?to(*<wdtwKIgK^+jB>-Yk`4%9`+Lr(c!2a7x@;*%@DH^DMv1Yw_5mHt?k{1
zbBbDS7;=N%Vm~rto5ng<?Y>5m4rMQr9M?VgI+DjhUg)gIyn6sUYuGX4ifghmI?ry$
zh8)*iagr^>yLFn{YHTY{Y1`gqny|WSje?0wcX;sy>v(GgyH_J}F4p^$mQ+@zuRQ4J
zxaT_dG>!Fu2Gy?N?*Lk1$%r5&>l${f6*!VOq=6>RqdV*&mv?iS1dc|sVEKKdolQSe
zymiS%cet%t0o_E^dwk${Ua^(sN6(WVGM^LDvXJ#+NVLO%_Z6s~A)FzGJm@MNm|h%A
zAqp-q5rZIM7I<7V(Jcn%9{R^T5K*DM0t(3j&VUzQj}}9n7N)2l1a7j(8Q+5A8~B-v
zp`I|0>)3C>K-*RlJ}hIDl?VYc2E1n}0_Zvx%>Rh(Ezx!@-cI1c7?onEcRhCt0Mazp
z0QURF%E{#*3MQUo{OBQQ-9vZLLfi(HiWc19>r#SC!Tm2$yX(=y0y{J%tV0lAEZEUa
z-GH>WQkTH3RUBMe^(Bq=2NT{w$%Yj7Y=VkSm;tZ6??%uaFS138W^euVzQKyig)R5q
zdHD_{QXIG*qJ;nwCYcLt;>+N|b5vW((XashR;oHhuD|TXw5GAW`cq0^k4Sly*8<_t
zd#z}_l(niK6A70~5IkO>9RsI|WMA^lf^knjHDU(>IM~EmLw45L5mlrO+d@j$HG#QY
zd5ii*aDS#?73uayc&;&y%DD5czBq6+f3s_m3s-q!vh#W0`H;%P+(!W0JeWpI4|**F
zeKHnUIE6pNY1i92DvBn9TDWg_(}uimM>6BzXG5Zm4A#eKNo*P&K7voL*Ea3;DQT8-
z8_;S{u&r;mLva{rveqUOhmX=(Z_g5^v*MnT_Pm$<!yWVE4&vD;(uE191s1H51)q7k
zx^UqsQypBRXi<l05qZk=4Yp|n*F9C1ui?}ku|!U-IOXcyK_4%#q2?lObJ$T6dLRXn
zK*|?qKpkOz4Zu~T3UgzpG(b#FYc)j)mIA%}jqvp-vyvt}k3Yryk&mP#5~67=m~i&T
zTw^=3Y<c@BQ|p%zONG}NA*5s`rgl@rmt~c?q=oxJsMbhl6l%TVSD*c%^aZBzJ}<7E
zIb8XSyyE?ZE7#E9bAU=w39LHYV}d|=DbEMF0k@B0p0^JY+6dZILqT92Shf*F!M7T5
zGWfEPT9rFwpQd_U7?Vuj#|>(sONY&!Cm;KYT|Dt*@O_BH+gD&=DYx?7$DAn@Ylu2S
zYYt<=uEZba%%O{7XhC}-O{my|62yWLod7C-n^}qzsJ8QZV2i1vd;^Ld(g4$TRsa``
zm8Linl*>p`U^1;{W(j0CB$Z)uY@2t?bdwm;A;0tWYhWw5H0f&kZkcm`+_QE{pmVDs
zOeSS;gDP5RObBMCd(Hz{Rtil=WHw&!+#^~?YcLS4r0nX+d?xXpDX<s*X4bCmcLKHP
zUh0%)ZNa`h?kd61!j)(Qq*arbiWZUM_X~>&;nNwFJoPz0eIyFTJ|B^VXaC?juFZ|8
zY{BX7ANT&F@9^fJ%IarX?EKSLs}gTB0t&wNRHb}5bObw-0VW85rjs9-*0{;j8jja)
zC>c|13%MW0lqUo-)j};4SEhvJ2khuI!!7h4e!bfNk7pS!aQ*qvi|a>z8{DYOKIjdJ
zAtwT|5;Iv#h;Vy%i&r4LytL0yV0({p4B0S>N0Vv9IxnJIhrlP{?#n3FFmd;}1(8O}
zyP}zD-k;~HADVa6^$bR9?DUCvM$We<yiVom06RznIcgefh&E_3ihn+hqsX2#Wom@R
zxlnu%6gO)}Z<(t_bEThJ!rrDPx{^puVBd9qa2Wrvq-3g$xtL4?v9TnX;gCD%@&@mn
z=hA_XK7gHw=Ybqp^6xZOK3yHR9e~pi(rLPu7z$LNPBvE#zU!W<w-gw5XH0p?Az9D+
z24kDkk~X}zp(npB*t+(Rr>+k~6;PXw<KXSMw0fO9-YL+}c!a>S>M0wJoZFS<ja=q$
zpx7cl02wLWEiYD~6^D$N)xm|SWi<8A1<QM=U8I<Nt(MMxVWk<Dt9rs;<m)^qq~a0V
zCXT48Nc4{Rr2Y8@ybRpm#2hHxkJJPmJjeF*LmIeF&Gid=jqci0Dx(T1*v?y(hqX8!
zM=Ry5BO%cQP|~Z6(tgmJKoVjwH%WsH8br8qcsMJ`4icDK4QG}eHm~I}&(vB3aOh`5
zdUbETT|Jsbwtclu9GzX8)PPhxo`D&8iM~Cy3gq*y6M~6y?|}YqAh5>66s*|`uVPOM
z4B5n?Z!ClKvN}0u<H3;Qb|~Q;6?T7yO$0lCh2>TsL)mYQbPb9tyT;e*T|$OyccgTV
z!n3k+pFNPMnZWu&`|f~mrt^&mDBA40IOw~NNdCz@&;e>^RZpNC&T_4ATP*Mt7R7)(
z$GG>aq34x3b_$3;%w)fVD=seYa4UNqQAkd0rP|b`tXlpx;G*Z17Fp}p%dX?zwfxO+
zQWYic{rAL^?P!twB{6hyLRbtI6uUrh)7u2|5(!<6iG-I$BDoM&yMpK?5k4x0G6ghl
zcY`MWEOkK`qs()1jPHzOG*R(zncXrtr<s<lyCU_uCbPNNz#Z<p6%-%Fu~i?R+lHjF
z@^|!4R)!>;xIM1qxcJ)Ym6#uE5g(BZ0fY`zE+fTe!>AhusFN+2;+e{bOUn=S^PDU>
za+aeoPb5FCX~$p>)7&b1qg_ePj7s=g_T1p%qPxC>%y+)erxo5j<hrUnwJL{>i6IRJ
z>nao8LOe;36)hEnAUD7reh5zs*2AN0LXwRe7u#m1S0lV?hrJS=C5t$0sqK%cnO;lY
zZ1c6L@ZH;pse{9|!)=$tBC~8dZ$T6f&^%UHOk<rUmY1|L)Lw$$UBXKgG#obCe$q}W
zR1+k!%ZG8ijqr2>UGXzjq8C{8ruuFpsy^c3NN*bDhwZ89-f3f@k^Z5s@5bZ0x{4>&
zrJZJS8TQye3`7g%K>Fqk;Q+{$b~iH0`|jTwS0NL8k@Nhe)PVwd@&PAbzjpSX{=^xQ
z##qHg%(;;WN1F&Zxp_P9*h{Hq^})NBmv`*L8$Nmj5qdP97i@sDnx`cZ2aXe%U=+uw
zhOj|at;r4Y>xTMOokqLnoVd8twXS!)C#)7$-5GAXRQ0&mwC1(XkaJFIN_3=2Iwbf4
zT-NWR9y=M~Z@_I_p;wSv7>g$ywpP_48eB9xtLbyVA*+;1E|JroQcg$(**DWn^Y<^c
z`;91WCHuzXiiTtYyKOA)hQ<AH-t1A%vE68-5BO9NlK;V>k??$d<^sm-2CglX5C(#S
zrm;C=3UIJV@u3-7dDet9I&S9sN7?NmQTNF{?97(i;X<_!8P2tbb{%^ye<~=z9Ww*a
zp|psc=fWiklnO;Mv-ig|9uvI>1Tv-Y@85i0T)6FBm~7+5AuI^nyQ8yIs-VfU6CeI4
zAl$}d=%sf0qlYWeU$8?toVgS#Bx)s=d$wjR3e6rQq>3~}?R0g4XKZYgjkzeI^(reN
z)D^$I)TB1#Xi-?)lP8^JK9T`P?}usdtYU7u#Kg#lqeLfYa3Dr-JBN1}U{)NLCMuyP
zs-pNnuQ6q`m&w#u1YRsp4E?xkx3U8>-Y(EuB8FC)3r{g7<+%M~$UAj32BJwFdoP$z
z6<+}BmK8X8>RN$EpAD8pSd4t*q?~4TDwJtF7AOgG>y-wOwpMehDojenM@Fo@&D~pk
z<MWyz+5_9#%=?HKYR;h=^_zY#Lmq|eL84iaBrvHmBD@8c$jc=T5CVp=oO(uKi|l@+
zaJ2GlZ4`Igw0E}R`lttFPoMs7_U&UYHiq2nvXQwn@M2$b=shJ>G`B{yaJrKQQV9dV
zTra{ZyhF%hxQR90py0wOnet^b7LDN++NqG`8i)%``AiJO5Q*UG;?>AmA|-&8LUL`C
z9dA#FEMTLn%h^wltmyC54UCm#;lRRP!%gLfba)FxLjSkzI9-u-GwXN&XU$?bfww)C
zA(^0x{tm!TauVaiVLZsnzvJ^Uw6RL|Wr%iPVcwZzUg0WpZRn4gQ~BDVzJCZW5rQvm
z-kBXlI%T(RkGu7Stx%jW7>#TOJoLPrrk3zyjkb)!PHYFM_T`>!xyV7xF4gL!qYeJo
z4!L8#&;`0^*pr9=A1HacnwydwXf{r5r<D~pV$V{SsP}!{?}kkwMVIIZ7S%o*QVeqO
z@pa?yvf>j~NiQ;tI8;SH<Zk)|qK4BN3h4mvuyPp{Ig=*bHDID|D;coh?Lc1hw^1aK
z&PQz=TPF^w0iUXCjASIhbk4%OJ79*!QXKDv)$79_ct!WeaPWBtF1qo!^|aB#i?|fM
zvde}Mk4($2p?`pgI@KD;ZY6C@i~a;7T58Y$BDF&+(Oe(iM`R4xMkBZiY6SC115vyS
z!qbey3KpE5Ffo-vdiaKw^)8v6nnfY7U2tmqL)YB+EL!ob!^YaKJb>f!#;OsosG!f>
zSR2z}mVn)9Jk@JBdL2828+z>mrvOuzydW%%CZ^Ues)XGVqiKkN)0D~`1lN#F)h1yE
z$xe-$w8_p<IX4CJ!Tv^kNFn2N@r;7Hmf>pbLbKBsgU@>#UFpki?SE;H9Sso4gIW9n
z5VsW2lc(`~E84g-_gEsK<3}Rlb-ZXH@U)j$WI54`ph;jT_p{D^Mho=`;kdz1Ec&D&
z-wqiCIPzvpKCOwlfT~lOI?CN!$!=eOWVBYFPDTow<2+B5t^HiTsC!n&eq%hxkhHB^
zVWy+0(CbZt9MgBOv2_oG6!AQS4;ZX#IM@$2Fwy3>y2DNxi5hKh#6y(rz4aOsBZ|U1
zR7J9bd);I1az;w+2F~XGnJfizlAD@c-rezB;+cm#QaP4HmtG(D?ikBk>0J}fhwZU%
zvj$5=3LqR#a-&eodS|hx{Irm({H4eUZ`H@Ay;LPK(}RmoN-5SOtseD6Vz`@#i@bDJ
znCrc~NydA@vc@-y%RgKViiJe>4A!|gdK&{-P84XbV$<GzbK-vH?#PelJ0Tf78)PVm
z9Ww2dRKJ?MYg`O@c9@b>{XWxzqavc%343PfXRO?7f_iOM#RuH`JZFIqv#W}rqUOYq
zKz9`OAa*gru<wR=XZfnrY?7|1-A)f#U48cWSF`fUGj_@XXD;^Kq0*Wf_k-#2Qx>D!
zJwkY2pE^hB=HGXYT<+qNtcaPD0Xk74ykn!q_nFr^M!@vBdfEP)T&c>mdP$01m(5P&
z$NU{RTwH@6?kP-cZpVi1oIalF-a(1IjwElVA00YY!6o|C_Qme)S58b2Y=th^Lx(Ot
zhhmoTfQ5hGgmkx5(*&wRqJ<A(W(HN>w<OC(jl&oM7S`K(xp;`mYw|-J`Kz5+8EzU#
ziOYE#_R$-za6hYn!Le0lp9+h=laL;=tG_+k9y@6wTKEZQm9fAXXxu2Q0YEM0I@Sg_
zhz*1fN>+u!{KD2Op`%E>12Y@=H}52xPeAtl-GGtXK4{?5Q_dLRkKWC=Io3Z||D;Hu
zjpW}Mv)Yjpa+G#IG#AW!NinZobgXwgDL>~^G&C`E|6oN~Sw*@_MqXM#%Aye`#{*dQ
zQjvm$5M+A7-3I8qx59fi9(*U!OJ-T+q>xQlr2f7$SQC3jWG;|F?!bkc`er-1p3ke(
zI-@g*ZB*3`uny*_nQ*6<+a3PJpjQ2OrG?+tq{Uz%tRpm$kPo1fhG;?1En0(*NSRD*
z!^zI(W1iD044w<|yPqN<9HNp83uZKDNsYX=D@=ehT7R6r7WID2ytvpp%ukMMU)pmE
zZY@7i?EU_oj!BEv9bqEot3TfaD6}Ju#la&>a#B5cen=r-AI!(J2AZ^QMLr8SY(Wg-
z%*AsIP6=?gL#FkI#|Tn(wv?6qQ)OE1vK60QALWQ48GM$_DmJorS%+zYlF_5=&l1cj
z2fiQR-AspxZ%nobuh=Z*>yZ1|*hwKL!f8iOt;2P)9GldPJ;ji<*;Va2J~CFF+KS7<
zEjBannzc39{ES@g<0i9@oaNLxn<2Dt>p1{N0%bRF%I!h(SA)l)Mr+Wn1vo=8MR>K`
zO)j+hBYl)H=sCK7M2Ab5QsSbWzcJW4R(DT^+;?KrQ>juKdn-KOuxKcws^Yf52jhD?
zLPrexW)q^CJ+rG_^Ldxqb&_ZgksJ-MJc59XhZIGaH`6YiN!pG?@wZTxR`=FA@(BX3
zp70M<T{FW@o)r}t&2}j6D=+Wi&{Balw{h$Tkykdj-6K?_+gFw;*E|?=w^%pt9zr5Q
zB4a`jACFj7iWV>;68JV0eUQX66YN7Q?o4=A7U;s%)9@MUI~?W)-}m=>Kj?`u)2Y8r
zj2ObYsl~6R9=5Z?62cmuw4I2xD=I<UcJzHIH@JkphkW9rfRM?+PZWWrWs$L{9|!Lx
zX^@L}@xp7gL+t}|qdBbru9$H5o@O&stIs_t5!fI>qu9m!K99&DDE71B!yU6ayTp)l
zc^73T*P1hYrzCC4#xS}sSAm(n>#gDH7w7>7tF6wC<93c&$$N0`W8Z5wzCpn*M1pQ;
zWJXwoeEB%AxEgc8R)6fg^Xmn-ms-0*w0(5gmx!!b2G)7@klvZJ&?RheC)iUOqd7@7
zuD3w$0>i<nm{wGq>e|7>^*8P^$w6K{#Rm$$G&yqnPsNldA_wEk&&AuC?5pdbI3@%p
z=q6=Pq6Au^HfF^qfkusJfh@(1Z$_lPEHLl0Q-6=p-Fu6v>OfcVTwEJn0c^HWBED~l
zcnE?rMza`7(fpF-9~Dxve`rf&>S+OqlO;hMcZ+%8bkn>hA)dAp^Z8L0n}py$3AAr<
z=+XjX;r&`I8h;_J!FsaR^(K<RN$aRzi~vEL5(E|od*_$;gs|(hIERZkZfQ8J+5^kV
zPf#qRz7)p7$xVJMODnkxi`6eb7|RV;PK+VKp_oY%zL)3~1NOiT))0{a{wh&70WDgz
zpZYsasaA!>&=v$EI4*(Rep=udk87Jw_N+F#cpE8d^;79EwIz8mU|vSTL!W`Me&hEe
zV}l8zTcpVFsi5-^VJCw%%@|<tQiXTK&_UurUL(CU>p3wKSB6QC#tjx=6(<Tbk>zkr
z8%`}os9}$jv0FM;xWBiICK2o{x>XoFy0iZ2W2^iRW#*fBU;FyoRo**lB~o)}4K++z
zk%2%3*qVi?6(?_JLb2pdHxkmQ>RomwEgV`Z*cq^dgF}%&HC+g$myuc(qPA!G!8z$=
zQwGE?+|JGzTZ?^<T^-yi!`|TCpY#|uSLE#dW{Kfe0pl_RY%PKAx-eLmi15x~50Na$
zd`k|(8qLIXh8&Uev4)$)E=;2;BYwm*-F1k^1B(5{JyqP2#*#Fm9K|;11jp5ZJ@a@%
zv;t0=s2BvTPnTz<=3ai3$8imC;Y#(k)IkZ!Y#L%k6v+{|5{>wjdVpDzwhn-}pr%cD
z>Dy<53?oDs*AmkdTJ-59T)Cs|jG?l1W`F6v_sE3eAl=C4_Bo5{>)6YTFeU<AjX=GW
zClb;L!3<;-n69zSe#=477XmB~F=#TOzGXZKI!Uptx8#19JJpt1w#tktJLZQ$;*E-K
z3(UKtiBh2>EBb5A;OQpy`c+6Or(L7Cr7er_)W-hD*=iUwg9W(u0z%}SMa=lp6hC1e
zErd3Z19qq8+jzn4yrkaR5cXl@6CB2IYV;v$B5OE>DW<FRE;g2M(M`I#zSM6`BL*-n
zU7xG%Z8DOnKeQMAJ%=+;9#6DE!(dO*MX*3T$W#a3f=yDOgd%(RIBE}Fg{sgnN__Wp
zoR&evGJ>Wv>`ag#g_6_2hEmfw@mDD)0N@<=Q#;3OA_bCjXWEM%#yNQV8eQkL%51jB
zedf<&2vifr4PKGQbzv7a6H=^x?;08dimkw5N`zJ_LP5!{T<XbtWa^LLjjQFKc#kXB
z9(bBGj#|0_q+O#CC)h$HR|%dzR`^h)Po&5T^Xu2+6hm_p#86X&OAA;}kOLba!P1u*
zd0ALR+}#Y?tqCi9>TIU|=v3@p&BIUbF0Z(A9pN|8!}GTE8AgX?3=_9rLJQ}Iny<S8
zVE~0w29JI>qK`#ZSv0uDHqlf>)`-EVMw8p1HQS;Yn_{;Nu|4`A*rH$qmzlhFMQTIb
zaCYXS;Ju*vkpk_$nS{MB4evUa-0F?an5|0JYJo&JJCMj}(4w)P05`XBda!$^@E(D6
zwTSCBTGH^sh;##<&a78NLPSfD2h;4{sTs{WW+P+Jfwea)g7$!%<y`J-#pL@mTY09M
z;iZOO-We+}?v8PjKpt4@SlB~9bU%p}f^Z}W7C6G?b4|EZ2D~FnW}hIasx;3nr^Qc3
zc$E-RqB>D+Q@QGoq0n4>ldors-?FmGzT)9S*;r(&)0r9PNA@bgDnqvhU!ewQ4LBNW
zxMawIcSBebJq=a`1?sYGE#3-exOG`=a>Nw6N{p6E)>)=@y}{skMRj}XBfFJkKa_!B
z4Lwy;?L&T9<%m@OfljlD30wSjc)RYxj#}O6)rqtSI=s_Oo}AMP<lxpmRMb!)2lJ{p
z(~HP8JWFAr&3ZOL1_^Lx^b1@#G-VJ4O4*R^1pDd2ywi7lhC_4vs)|E%j#o!syggP?
zY4Iad_@*vDzqVh+Nh^`2kAs6OCR6$QM86ZkBaGyWp+OHGQFxOORD{z&G+2g(r%I@n
z0yX3^oZH=^5ogEXZi(>gZ>qGjgWFQseP2xP#mLFTEHZ7YzR+qt<ZP6^tSa#tW-h`W
z_elyQFf6G}w2I2Mc4RFGFjffUgjtMGw@iXUF;#)uKIq%u3|4VU=UV7zMlcmUGbA;n
zVsu~kBO5P(jTTH;$y*G%*1f6yart&``u=@QHpatF$}U=9p-EEIO)q6M(FTWj6r_c3
zO7VMA!$3&f%I9`VkOWRycVT+}-qOeY4+}<gf*fiJM!v}570FcoWcA0B31)YC_Ikg~
z4~SX*a7@KvCFXlYF!4#HK%+>cP%peII5A2ez_VjBXu-6Tx@J9jTWb0RD7fDjVZyPa
zPDwU|drTs-gO<TmpvWC1WQkOe(rsl`mooP}s$E69?Kd9x<usdsBPTcg$iDaE8K$=$
z_|D{f0;`s&ZJNP2#Xx~zvDz)U-Y9HcK3b$qkeJ_SROZy2r5SAX=A{;?c0H?IDKxI{
zr2qZrSAru1{$NkCYZ~R{qsc=1n|jz%25ZC((ZfvH!lk@x&wvKspf$K~>bvUA%0Fe)
z65%2?F(amkq$b!_gPiBC);Wk%2RY2%1l1pxv#)Y+kHZFQ>$TyY>XaLtXs&uq6o;Tv
zczTwmFjh^yjuOs8*1ZH)E>Q+-wPHNQXAlk()B4-lu1C$Zh*|<Y<VH~8qrwLuMN@cl
zfI0BFz#zlbVz0}?&;8k&cki>AY6#Wm^`1T(i-)1>y+%Ufdc3!k-c6*TXwp*v2xAfR
ziMDzx*eS^ljpTte@#MEh6DrK%eGV*~F13UHl$c?raY4(4bVgUVSENl0_FIPdaHfcE
zPh7kvXKg*$Vkpfl!*XqZOTBvU&CZ!EWf0Hp93A#zB9<3LvURLs2i`&9H9`Px5oIyV
zYz`){aooLid~1Og(sqMfvb)RR<mBuywi%}ba;xdnjf{nY<qw?oQg<_+<~*N!9opvD
zW8-C)Poo7<Q9=}$rHL@fw_l*5YwO0lD|`?Mw#d_maBY5SV0TWGJ=}V7t>&jxLXewG
z=;#}+ZrWWWd&9EFTzoT0WlKeFP`beF=9cIL4VUtIxboC&LeQ4Jm^VRw&<_{U!gR!d
zo5wo|p5JzgJ=kNR#l|&HXEGG0?bZuad)EpuaK>}L)yTzB>pF{VT#x;sxtW!x1zNW!
zy#L@?MO1y8{=#`2A8$#EKe}vdYzGEJ(bBgB7-W=h1)iXh89tR+rqMu8XG+w@0ri?|
zsY|d?0&_kyL$nB~n=y20qARev%Ws3l<FW*-SHucYyw<z$^5sv>V|NV<J4wFfb+3kK
zE|>nEjWZxM_mB79Fyj8dJA(b!S8WMraL9%dk?b1aW)FgA=TGeyLn!Jh>MN$US(X&_
z6=M-qIpl{kJW>c23o5fHQWmWw6jIb&iZJbaK5*&DwGQ__S5{9am1jwe?|mAQ+nStk
zf5`D+zW&A)r|<9A!+gn^%mj`TIM0Svw9xib^=e2mVJZ!_U>RhT>nv>j)Jm5Ht|?7-
z5n{%fMk6_VJGo*~(;JIsmuZDy(b|e`Bfouy6epI@$){s>-*3(l<Vv}tL6IorHDC+W
zq@kJlB%yNBhg-<A5uT-*l8L-@WEkXvtal+hIdnD3x@wY=U>y<y7qyyyaz)Y-Xrn`9
zT;s@UE+$>D#IBU}M#UaJSM<Q`_Uqzdaw@T=E9GKgNF0K>1DF>O9+R3019`mFNEP3m
z;sMY0pDt$%xwQC6_fc`d)z`4n#TE^O)44%u#kFdjp&G9)>r5-G7CE(zYSuFob|L^b
zNL4F^`_m3&ClK~Oj)p|_pZ^D5^!?l3nEv;A!Pv%=cT#w*UQ*Ocmz!lKG0S2b>5_s#
z;A9=-b4!wKVsH9{d3)8BD<`CSb`S2+$TeczHGTT}HaXR?s(bd0eyoM&;Ts`@X|OT<
zG*SY5&Y^j4V?%lrYW1n9J&jgg(m=+~E;hOW=cH1XAaX4tZ0F>iD=&ptZ<l1%Yapqu
zch`{H&xb@kL_T>{?Orf<%<q-Xd-PcQ07j$$0_CLb)R!a1UG+=G3l<>x{GH?``<}Hi
zS=iTnS905CuIH&4+r5pObJtv<>XKC&fcM67)*M|eBTeU`9k_RL&9xmrwrSr(3H4i4
z-5s^pwf@#5yKJYU^4g=0s-eo3qnY-*H@mAT+3%VEFIy{7Gr!!-zrJPr|B*u4zw>+6
zf1A1etF!*!?gZiA;$7qaqyD`E;qQN&^}nSO|JyqB|9%6)-@mK)|2&cYxv>V@+74bz
zq>)N8y(^oy+Mv1CjWQ7|qRQ**%oYv-i>!XhIsYYQS;e_G#WvXJ$bfy5%e<CfZTjGz
zIW&9^{Z0(A=My2<z|K{7#0yX$c_oI3smX#L9TK!g5C}?b5JNZVa4DcfVD0GB2_fz3
zAoS-Y0NnreQXH*t_Bw!pr(8s`QDVr^fgL206X3<r+i&zX()jU21A#sf_9*$W3fzSN
z6WKITETbcylZZt&GgyVh84r;<jm2mLnYC%B?=vrDAqEWgyX~NTzDEJOjvHvF#ZXC8
zgD4>w<n8`>6OhgmtOqxlq|yH`_TD_Iscc&x4TXq`5Rgtmh&^^dsUV7wpa_T*#vTMA
zc1R~gOOOa58%26CDhef*1f`WON~;Ko^eTn)AVEl{BK;0(*pg)X7IjKboqK<Es_Nc1
z#=Y;2-(OOLBH4Sbx#pVlo8SEAeUJfsd+58rduV!xM3|5H$M-XS|NT*)V=<&(zn<`M
zKOQ7WJa!FEI@Ua)@duy8XpsmrFt~3I2IA_c*tyV$@>F?L4N$9_WtxDN44`y68nj>T
zIWo%TrPDAlIi|v|K)MGxZ9-P^>gM#P2aFn22bqC}!3PmpjFH3-Qp+m~3O25q5GHQ7
zJvH;l=G}9d5y@<+hOeMGtb!m?0;{4y!)*)~Y#M*t{<0{IvaNA%@6jBt_Xb%er<*VB
z5~&3q-4t!NW-RK(xy3zom|*O$tIe2NjRQmiSbqAovj4-vVJOGHVD5DO)z|r#B>}M%
z#$tNihko0?uM4OcN%-8i8x$gDB}p{~(AcPIwiE+5e8FtV!4Q6DZcbk?VfyodNd;Od
z{*OK?ND=~&IAGukhyvelLtV;dB!gGd_KB$Kiky3eoUK8XI+%FwnJk~+8?*qk{_jd-
z3|Y+?hsSOneztsK_hP4$saP-K9AK$>Y7KSRh@S!vE@+po`}zB4FY$WEH(3b60CV&I
zD~s_zyl-;#pdo`kdX2~6Kz!{g&}U6jwXn=}Mq0HGvDz&*-=$df^m%Qyu-47vTt^&s
zbd35R3{?@>I&2n_S1p-%<9QQsH@eLKdumf4=eR5)#z@uREdtND3SF8>vpOZHMmxAP
zAzRl~sny#w{pxcQ?ukyru@R$<E-_y)foJ{Jo7r4`v=bwpLfmozNf%v%DAjOPsm3BZ
z{s%AjrIF%Li(wtJOKU$FCc4}iyJg17&!@^6R{UUe>ZDrq+t8NJr)oEBY*7KcmTzu^
zff&O8tKRYw(qrG}U>LtT9rpVy=~r*~xgWH~j{n|+eRH52rf}4N)EWPDl!R~I7zU2=
z_r4XluqtT!(9%DAT5k0ZXFPqFzkSxaDF!w0U;Uf$uLru2Kzi}J)Bo)cHcQW;Szf<4
zrDmx%LIS?#9b`Rz<;K4_l7C(;|IxR8ea(O4^v9=Y=ePXsnWsJYzjHt$!WFg}o0rPK
zR2XL<xeWx33a#>!`AloRWFKflL5%lSNT^e^wCvPm-Eq0z?TK8^c7L;-1#Ppb8A>Iu
zS`<s@v%x{%p@?clRkN%JB0m4h7tG|Y<`2^Ma?t-q@Ams$_+R9REr_*@>5WL4k2LYv
z7hRI-j1}k$C|p}VvaKr-ix~OM=*beJG;X#%Z_~ITps>WA0}Fk+_ZdSp`~kPoDpkMd
z|9aKG7>E4JdHdbPZ34JUaiGyR4&6XK4j5T!bz=QxIF+@uDnr)oG>RwjVM%xUVJ)KJ
zop<!q_uZ3;qY9NATmMPBGT&l{RkpjatwaAkPx}{g55jMZAnY&Wh(VMTG*$axC9BUs
zgQQq5`H%JgRHpjZ5FaLTZljKkLXs?jq5cIkSW3?N{Ati?q45YPHLu8m6VQr^J%D1@
z9P?g&*yRu2LjD7&L({)tmKj$O#0LWyGHa=?6)wH>3q~}qC_4jU1}hCY(?1~D%YD8h
zDns`gN=yKcGr|~+M;KPPOq2f}SOXZOk^^5b&mU2JrU`@soNTENM#w(O{A^0P1c_Bd
zI5ab5St>Tq>klAEY!gTd?4;MQ@=(u#cJ?PVQ3QsTnDv@8>GXNGeD)vx)i2;C=1yE2
z?<qr{gP)rPO&HowUm@@Rj78Br(dS<<6BJyjo$T^FXlPIFZpFcfKUk>$-irJac?Vrs
zzhUox*H#Gs!B%`QeD+cDXH&u@#;3Oex@;GfXoYiW`Ook9>zCt_8SsAS^I{snstb$K
zzhH)NLKlo|dD#~XokAGCOq#-^&i><-1vQk3Yvi6Rw00Hrkz6~4k*)hxChTAB`EUG}
zJaCpAChAxtbdn`-Re|w?3Z$My8G4>G-H`SmC>nGhnuwNTLXZCu-;d@6RqZ9biKghH
zFADB_!DQ1fuay5-f){+jl(S!i>HXo~c}h<LvLT(;#1`-JL?JD3M63Vkh_00XKoAo_
z6UI|nAz|c5!WYcmc%a7h!{2z-KQPxn1EhX~xc{!10GC4V`OT%E<)>=6(y72y6>Tx#
z6(sCK?7v{#-B*H!&w<*Ha-Z`5z}V$J-v^n79bf@#^B9sf=0x-^YMP2|bA^z}aP(2#
z7fk9LO!6Nsne-(BSfH!4WXaITwxll@TPZOW+phBm?&~kE4jz)_fJTMacvGmyQiMgX
zm0-?rA9PETnbib|A_e%reZ+aY6}$e(bt%4!l4>qRcSK!akK_z;r=k(iMB}h;%vh2u
z&6ex<gX69E4yq@A-Z41>Ug*R-Xh3xG-){Z14km!^KU-jXm!kEv37=MWX8$|e`it-W
zrxy_RH_-aOU@QOk`uWdx`Tz6l=O4<Se{L&s*PbuTF5Eu2%<o0Tse<_M@RA!hH!Zq?
z-E(D~751Or-$GrSIUxQK=?1nxQ2Pawd3&As3kF}gQa%E-IJ^O!*id&;EaA7#qJMG9
z3ig3Yvy6@wrHI>fpM)t_=9PWlaCXzO?C+nBEx)04aL+@Pombb}y^!<&>S_q41YS0{
zv$suC{5(8?H|rqO7t2jKkiTH`VW@jP`=gwd`tPs&_4Z3+%@Un<+>1Z-=DO|vD<*pq
zgU&5Ju>ZS5KZPkSyR2@dGDCINBA{2f@FI56Meur;AQ#NHQ0sq*E0bS#ebcq{n{6vw
z{Le4**uHy9(YiTluYV5p`j=W8E*S^!r+;3eKNDQ>xmnQAQsEqo%#iqqJ;t0!Rs83w
z%=KPs8}HWYa&XnWqSeJ$)Y4bYw|FwE^>xmhOSZ=+06sbOyEB3P)2M30FPEPkQ;9ra
zb1m()MO}Q)&BIy!vwpdfQ+uTF`vt2z{?n)R5KyAL_}LfC!;3`OPE#L@?E8OO)1Ugf
zKf5;n&lj`&Zv~S-v3fkf3NMD!df7^B-W{mWxE{w53SOJJWJTe%-=AwvM~IG3N{hNG
z&I{c_c9*nRX!1g}eDepD2cj!%vR?t|$GDJ^mX98t7S?CIEw5Y)08(y&gc&q-sLB`O
zy&sRm%|i{4aA93{HN7XiN(aqIE9rWehZI#qMNwhQPufKteyBQ7HxA;x(kDivo7<#Y
zvImor$BLNw?;a#C88p+k6!lfUE8@pl?w8M>ZVTmI0@9HXve@(D7{QPHiYsPyS3o@D
z$YZ^r*Pcn8#i%FDl^*BiYjq8+Dli4D?bM~L&I)@z6iu3wzhVO7uXVWNP%A!|h&_`H
zo<kMc)$Y_eiXl9J5VK)aX@R0tvq*fXi;OM-tfYmq9GV>WJ&@!WPxbQFON<xz`uTT_
zk!Mm};QE9C#}!HUkS@bD&JR0HPLJxx2>RsR%1U;aIONMWV`T8xO~%fU-Dkgf%>qx;
zvspwK%xwA#84tD@LRJodeiQM2utg*(wgbPO)<!ccDeNHonQme&pc=IL8y27|q(?-i
z5f82tuoQ!u`mGyWGitle__9`a*knae*C>zX*Ryjty0=#vJtZ`Si_K(gya^zBv3Wx1
zRl7J3=`*ZMEE2Feu1{Tz!+{z>U!NO=%#W&xpB?C2u^g7v7zUNvK4@F?=rOBfTzdPE
zg|#5wsaD^hyttIrHFnGj`zh<&Yc>&))dcBPHo&)f4z)p_u%dhn?K+4g8Y*y~>WyWX
z-823fcPLJPs7i4;K)Hgb4uDq5f;}%no}`|0AFRo}2127t)D!)Sy92h6)CPh4i<;>e
z^DYPJmFVt26kzFD&sr<er{5pM&uJ?Ou{>_`5SXT2K85a(XadU;i{+#+t%(c!lck!&
z=$tBea)<0b!LPa3%mq1=M)kPZ*g(^6Wo;Tp3CMGiu?x(NK9$;KLR5b(-Bco)N#~Q~
z-k0Br@BBGM53Xy74!CpNzWt)>VG6OPt4n<FLl^}&3K04k?O<PCLRF7M6iA$oC`k4U
zyDvuMvyiAA?SVi9Kue?@vk2Z5mG9Fn=(cG=&8Z3^I;X-igg-v7-`el!@%#OQtPOBX
z=a@;~3eYgsDz(W*f_s5J>N}va0E|2yNS^T}i|j<C{jC8E1u7MB&F`gdk=>;#-m`2F
z>(S##j&XZiD}4F$$^OhhT}WPP)lO^8_DT%C+1dFnE_aZ%9Ii-?EVRIvze@+5p#Qjv
z6yF8ab3gAW8v)OMOoKgTCI|&t;~%9_|5OeCgNX6h^h@poP%;FigB5&HF8X{t8u}QY
zEq_H8$$kSm%s`^Or52h>rd$uhKm2AaRaj&b7MwRfu#yR2&qBCST{<L?1o=SpN&Hfn
z{?t+1$u%TjySirE!-olVTrDXv(_u;J{WOYZVE#Nk<L#?OzUZ2+k6YGs_^Iff+kqi~
zn7GFpoplt1@&J{(IEbvr^%BK(GG+-giiBmYJ*-TA8eG<}ih5kg&_skabXA6<C@73I
z_kCH)eAqJ_KfmwEXPW92J-aG#P}j;pXfEZbbVE?O4<}VoUew6rO0~jKWpCsO+$#Xg
zON&109$6ywxPUZLX(ZgyWkC{Pdx`{F88ha&lc<}KhoAvEXts6@7w6-meDm$QsNBJM
zcDbyZGqVDDI9$J*SBgPVTNt4=1fBgs;x9D<j~q*T##V%Up-VAFcldy$I5oG>*Im7`
zqO7vZLQZ%`Wg=9)9+mHUSLS>PXz7YHZ+^-s`QXxH^f>K~aqYpxYu;0JDLib^``_Yh
z`486UoNE9nv3|-G9|#ziZKoc!!e!i_)RuLk`!U~OBGee-y=5cD#Cm#<wP0pA*pdO+
zg+M<CTUlnsflP|$KBQ9QP-E_w6is#IZ86qwD61mOrs=$&;P@oeiHP;(p<}lMx_yBr
zT^-Ub{84iCbKbk8TGkTttcWneTACJAb&m`0ZRXs58XVL><Z#99px3;H9`A0{=2X{}
zqKMGXh;TxBt)uRt&ez;{0PtT~AyU3s7FA(8GI%y0$WB;Lx}2^TZ*a66F;lCRd&4Lq
zOEv6%eYu(awAVD8?CLaZxm?7o8}B)YhA#NTP#WrA)rGD=j-u*AK}jyV7=R~vKok{o
z(9qMqq{V14tGp7SK78+TvIu`C57}4et_cssaT?rPFobqOjs6Pogz*-p{PFpS!=ZMm
z4WFdt6-<n&*dS0|@@`F^fg6lg9wp~^abDg!uVi8tOv}o0b9L`IT-&4}Zm(N=YEQ?5
znQy<Qo3wwQfq%hlCLwF=q*^2~h$Js&pnV*=blw1%EKoohu3^v32d0I)6g1UPUFm~U
zrj!XMV=UI45AhZ$Hkc38_TgPa#^tGU@M|6aoWwN8?vMGnTK9P+9n?cY*QULVu5I8o
zcyodWqzjnfXv_jHc5w_fM4ZH<h$9ni4?bm_??O4*h-zhq``z~`wby&SG1;ymWbj_M
zperk7Mbr-!ofV4d-ow;Va%hhCIi}_4RYf*bB?s{y<4Pv0#(VV$JRsnDKrg61nHyt@
zUEqV8g)RVe)H>?q7mOBdKGnND>v5DLvInS9PepXrrwKu=nQB$Jy_Ko}I?Z;AW<K;4
zINH^0)bTILqb#jGr>(F1FEdJ}W3vC5DC!@|Ykv|&5yEkZ<*@i-2LKi2X*IaQuB2=?
zyDTKFmaB>1g0%0l&xmdFrs`%QK~Eu-cgf4~2YTg&gc#lJ;Tz-U`rk5JmaX`)3Y@y7
zvJ?~Jp4;eboMbklZr7grfCbtQi8HCbqO%!!=O)Wo8xT=egD8j;>TaX&*M}yVU0wpe
zIs4k;x^?FJ<j_k$lqMc*U8Fb^6<cA<=Psliucgfq(db77#<LOj)-DTU0fZeY9Sq7C
zX6cD$GR^=VFh0X~#qq}vIzIM3xxJ?G{LJV11J^gccnIXN<N@RSw8fr$K)#1hm(igU
za(Rc1qo||;S{I|xASPk=^5d*ZWJ9Q+c>(1Z`fg$Ly{_A}I0atY$k=x#tzR(JTeF?F
z@3piQo>rWS{~C=eUZi!fC;Ul3tzzV7O4=8UW6^F!C5aOB{JwM<pb{8sKtpE*!I-J8
zuQ|{h%;$D-wTN8H#h}j#B?h3c*LBDYT0qs7m{a{l<Z1^uT+N`M&t*VBj>0cM7|yUJ
zD~W$-ZQu^6V|J4u*K+k1V=U`dc4o<st={>^DP=4Ao)W^iuNqM0TyPeyfzlDWu9wyY
z3i5}racm8|4&p>g>D|)_v_KfD%xK8RSbtFMrN3Xrx-2iJIM6G|Qw$fVzwYIDmsehL
zVMhwBGY1KCJ`Qim^*h)i&tqO>D4+|_mkc;CtA>4nG_%6c7bpz_!Y%m@t_nW`&cD+n
z8__X{GO;w@;;OU(%IV_{C0~+gGzMi41$8IK%Q_O=Np8eIi^4~WQ;)x`+`BBPCJx;}
zx(JQr4HBlJkghO-J-9!G{n#m?HX63_JD+cPjQr>6`{X(ZuS!6h9mGooZEcm>Vvz_p
z>c)eRgKEeVo<?vI==3CAO0I492139zjRw90q=gs|MVIg!IzGz$2kQ3@qQ+fqAU>Nj
zsHMt9N*$Db(&_3MH^LC|KH9_|P&=Tev1AD*CpK+PPMX{SD=-Te^Rz)j?z0<kG}S<o
zpQs2`EdfFM-DF1Hd6Q3Y{auM6VDpwH>8G9b^GPD8N<H`z`W&f2X;I>!J7M$?-e6c?
z&3wm>tlJVBKhAuQYX<;FbS*GMJrlAvFe#AtDyWNQtU8YigZP@Vp9B21_INlbal!9=
zrAI1T+g7Yyl=9(+R;bLe`d&rp)uFC>+Zf|x>S55`-%0|wTU=e}?GkPgQ5FrTqVKse
zALXY05Oe(%Oa9X!NU<3x`;nx#qw1Lw1J>d%n6{0nDqF<B#uE-809HIQ4x*qTZ^2H?
z)VJCQp^=WN;6*F$OU^=7)#!J!tGv#V#Xt`DkSu|4LRrqQ->&RoC;s~(+5%Bf^e4mh
zFKj1l(@oj&hMDe%Kd&)-lY%QOyKbL<Do-2+WsOHz81lLD6jBXs4jf58%ID5T2v)5?
z2noUto?c#X>n;*hGZk2xaPR$8VBQfszoVJnrHNe!PcNFST%lH<y840@E|CPgF?+_X
zWEpJUbzE7{g%Y3%zXwP{J;mutec`z0K1tf8pz%yCqe(=rnHXHJ_KxGDa~V}lsA;l%
zva~VF*CpaoUEa9=u^aNA8DP&A6TBQrdw`^Wt7Oek`L$5(YWJ1M0nxxyj_c!-kslh&
zJ)@-xBBOWTH%T6Cq^R>xI}6=XsFU;X`n#RwCBJ-&75`FoL|*!kz`K0y5LER8#CvDx
zB)%$o3TYtzT%j3I+&JRm;?sI(tKr9SrNhGvI4w_LM)NKyhE>>1>biujbM^94Vb*O=
z#T&yK*>Y%BfSl9^sv2Y$bzKk=UM&?vvNV3XVQ)QGp6&Gn2NE3Lxc!p7R12hzH?9^>
zt9K~T54uYA;6YIzw$6x1wF}8|y#Au!OY!}<ROf?a8z>7EV}VAD*qfy0<9QH_u8>(b
z0GXCPS^H-1C2jbHl$3YGsHx-kJ=eu+^Upouh3aU$2=_ZWH#Vn0D~yo(EmD&<BTWEm
z)(35eBR@Aw(X}nVtKiUrK;t>&wFcakzUb`RlGD_H)<C8FrM3NYWj~`!U4W+a!$fDJ
zj|?28(KhL63Z~QVU07bLeJ0XgJEi&lx<RbX#QN*iRUW&s!X$KVlq@B{a9;zi%wTk1
z?qJJdUla?p0ofN)W_Ea#UO99s-+~+>AT6ZYX-%&GPW<5qiv{qzB}aOfB~H9<STifV
z2Pi9PwkUp1hN?h;Viudf6{>P1fI<YPjDHue2E^`wgF<yghGUiKReJeSh88ME=@Qzv
zOFhITdj@>E+||mF;~CWxxfN=ZW3{*R^p-};%e2`ip}5lN<s9LIM(^{*t|v;69+=2?
zs%s1=dTlrl8>6q#LOnjub>?*^t@p{#oULDsES<C9)kSCdUaxO)@qe?#ft&>{@nRIM
z2{grwj)M^;W)-ISl%!g2an>cgQlqG`p_GM#N0J#A@%j``iFHk4qK<aF)JdVoO{w0z
zE-B8!vS2XTf)moaF+SIOue;B|AWnV!(QHg;gB5Nw3T~oe5FfK&vBhMh1lXBNdxd9?
zbQ7|Ue+@LGzLMhMN$GNT;{<!O{XyHC1p8K<Nym#km&WXxpzOo_KISx_i*TCcQ-0@D
zCd1?G=`g*rr_J#|yv*A;oU4plQs;{f(BoNJ)I;zUx)72_7{vaoL%~`lK4lw$TubS#
z<4@WHx2&35PAf|DQSBWHUcvMZP4`JGEswa_RE`NQ`5M>$HSaiG0{o2_`R~icgeE%_
zd#a=0t*o4I!f$bFT|9l{_L=*xJ7_|;A@+O=$Tv8az`FXrYp(RiW%PoT^0Hab@kWPG
zg4j-$c*<~2Nq*AETX<Vr&()(k{k?of<l;-IBbg?9)7l5EaJIk}k5FfD)1fNv)VO%>
ztz;5Vkse+8E}By9OIw6w$O;?Ari*I3S?6Wj>?ey70WWXgH)If$u9JSN9ndwP!BS9-
zg#M%&DTlf?piZFFS8<&#Q?tSiS$_qC|8{Y5Gglj7S2M8EH3H*>$Z&14c6fz0Us-#v
z$RM`jkRbG6gEc%JWkE=Dae#7tUEUSlC;DX#2D{((n9&{7-Cd^I>(X00(X`~=bX+&B
z1_RV14$}-lco0DFrUSHL7f@2AhfCdr%rJAO+U!&qe>=^4Qp<h0EEB)Fr<e4~PNeHx
z#FnU2_VI}qi*LvKPuA?FTz53;-qlJvgxnOVIIgV880?>eLEi&6OoD$fo((>I72z0J
zk=C4;#39F^YbjSy()0iydhz=OhEyk!lly}QCG5i1%#@X6QWdJyHexb9i{j{itR$pS
zZ#7x%^&p3qFVH*KXGL)SEvNlo0Z_}K&TO3l2;AvaJQtJX2NTI$bPfd1oOS8-TMNT{
z>Za1x9m8n3<&vsYbPSm~t^sb8{c<0Oqj!B3CP&UooDaEay1KG^vV<l1jhkMq)y64R
z?9-^LSJ`hn;(SDuqH+fGy#gZ5OnW!CtkF7&`=KS9A3l=Z$SWvVhC5?4f*MnF)|imR
zZz88;30JP!x0}eJH%)#6XyiWM1u@bNP|IA*BTH772ch@y--B9aha?Ui`hr=F3H-<D
zT}ZBV+c|EFgbMTBtMna4*Hp0#I`YzX>5Yv9I1J~Wu&&q3jL|$lWVaKXgY^t#{mZs~
z;j{>Uj(2R#TEdZW&>I^@|Ebxaf!>o`R~oG5d;ZMNdc8l+bP5GC9BeR<DFmO3&(p@i
zHstt;U-sL309~li@{+mhU4oFTyNABtrJ_4h7POX#Tk_!f%a!u)=*as1p&<xaPORTl
zpw^{p+*>55D~U7fyWWY;M4N)HJb(V#X0%VBkLi!^2)Dv@yM1*-|2DlQfHa0T(6@7Z
z+GQ`Gy`I$aYFZ{H6ph_pvkJg&-Z^L|Nj{*r+yAX&en;!o$3TKz>8QL8Mw$U4nB|6s
z?P4uifpG_uQvms{G+eP8e}?P<hdfnx)w1kN?=Y%XsUv|nyf_D}5}u3T-MeOVg9h`U
z9H&ty9pWhDyZ*w5XhUPd$wPi|3xCXCKH9Q1;Ue?tOK171ci@3c+}!%1sud9L=Xw20
z8`jh6*hM8dYd&pEB+Rwg>~TcMd~O(DS^N+X1cuw^gI#>_HNuxR%Mw6+7$%&eOXh>b
zsUz(mC_2yi`s?4S0w2b8q^`&Tm@{tk0ErWs@h8K`6?Pk3k>JL906uaf3O)@-_Z;B0
z`jzzUs`lG^!qdhpZ*R~+j~N~}zE`BS*<LGdUb1#)r0vaH+kJHk-d+2dzfD0QPQ7^L
z^QRhe)s&@zvg+}esQsiU+&rqhNUqo++&ka+qQ!CkxaOTD?>8(i?VK5r&AqJ@HuU1?
z6V>OBo&6Mc>{giqx!nX^LwX+3<@p6;l_<SJLN#5Hk}~AXFM$VQfvmT6&D5t~+OlqG
z?Dp0vedw9(xc1t$WoH^ry&U|I?ExeSb`@2E;(_0IvE+&G@R~>0LtGm-3(89#M{d25
z+%M3+zUtoTt@caL*)F_D;mDo*XD+!?{w>rSBF=}+Mz%<nQ)#k1{1TJB6klFc%^h;x
zaHjp~Qo4XVZwCDb&|aK3rn^cdLrB;0!Q<e_TT9VVQ)Qo?^fmK?tguzhh-t6bTa-X+
zA}k>AvY_I%?UXoiI@r`C=>ZX`j--m*-dF0y3;?1RE(WpEtsZ<g1BbFl=ijuMFy>M2
z5j7E3TU)hraW7r*HP!rK<SSPSc~>w7z4u=JhKPGXY9QM36kU(_2}kGD)alPe!ucds
z%F=q8PUn&h-YJXWtk~k)O-i-q`e3rs`W~NhM_FHd$I%-V>d)lA`!N;w+(4?^Byy-_
zKVXh#NkV|LQb50_B<=3Vr(vj8B7=JNs}K7{8W*>Wp5*}|pOHzks;~KSmLTyyo(a${
zL|)LB8I_TjtDn!5I<p$Bk1tJC6f1wd!u~WT`i+;P_}7?XgqK_tdl5MrdQ_AkRzO}p
ztN?w_tOcOu{HRIJtt)Mkr~0d4QtW72iWT0j!w_BWR+6)Y_9>vT{xWXGJ}Vpt^e3n*
z3INInp|fiwiqakJ{;B2kI#QwW$mDry*O>kmw_q;cGvebH%)G(NHA|ZT-tL*>?X+`t
z&smSv9uJf;q3u?<w^;y%xp|SyVWTRAB0CVGDi!N6F6|SS$;v@w<Pg8BcLQfDFAZCt
z1m7GndOUCI**wbas$`Ffkr!C;rG3|b`R;qgsl;z7`+o*ILZRj?mp_gjs6CRC)b*mQ
zdM0QC5$7XjaKEcFehJk@b{lU_T@Pzaa&9D>-oTrP25PVx6xZsuejkzD+VTT=!S8a9
zP3qh@W54|J&PjXS;;qi|QVv_70kq7GvScc_*^tOUYW)TC0+c%&#B+gj16c{Lkc)x*
z#(U(2S#@Mh>bl23!uG8A7NRGz7A$Aok=Xc_k4n+mut6<TnR2^MwZ|*LtJNmA&;H(E
z(tUEK!`d_4od<i~*xV=#gVHL2#K`{(ArC>Fz+;mh42*($nZhjDg<dgDmSC!^_yh8Q
zW=h>Gdh}R?i#L)gWI@%Y{3iqr>%0b8rH5~jSxTEv>54VR1|CmU4n<8@foz~!;NoF7
z;?zPs=%^78kto1}+Ak*ou6?E$90*l55Bv!4V)rtD{4~M0S(T!_4{^-_Lp1Tond5$Z
z_T{OvT4gIrO|8BO;+a3Vo)s06>0+%}ILMmc)#YgAz*6+OJFEI_Y#8Lu#<B2HoGb~i
zf=Gk|4Q##l3}x0l>G~Fb3mYL$d-#LJx!};tYp?clIK593NIsmd{7}<^YkZl!eMx0k
z#nybTvlj=91j-8W^#yVgcc<rg!uxT(5>&-bmTkg(bzF=^LWHEb=r7lYkBwU+hy2c&
z#7N!ly$_COjmI-eh;eJP9=8?~=Da^!nVEZ(znKiQB={kp1LD@K6OyA1?+yWFAr4cj
z2BEWjB^FTCQWog+@ah4gjO<Y$_O7C29LFwJMsL5}nfSr9@@;cXbHq6(pGmwL;^z@M
zV?FT~_eG*g*`3ld!DDGIC^&}HU~rGW<^caYn3eKUplosf0doZ44~p(ckboY&!U0IF
zkKUB1P=oz;P0IB9q`2HFc;7zJ&Gu(jJ4QzWb#e2(Bj0GxF%jQELq*w&Q$MvRRsr^4
zbBuXVH4a%(HaaTO-dldL)nc<`G1_w7yXi;%lE)&I4ERYc$1I|_uL0wbizMx73&>VH
zm@;9yz(7MzdKj1&9dt*BFk`-8%DDAWWfR4EN>51&cx_JKvrD06P9jn?l`ME739;XN
z5dEbHK6%F|HOiYf8v1>Fwe2f8H0tk0beg70x}m$OC5}(_ipwK1n&Mx~{jO{BbcOoL
zWvWVcOXaTGWm~S4mp+Dgooiy)C)tP*JD&C(rBJYiIBys?hv0+l>Dpmrcg>U=gX&Qd
zwu-uPJ?9!mYTGMT-8nYMU;3l9_nE63^-^&|4pz7qK~jA~k-;w5Q##MKv8t%gDmkfc
zoD<GhyksI-aoXR15scmxLeZ0xd$y~5AbSD`z%AX$Y{vOL4k6x3g^hvVJyHbN<;-ZQ
zaVLs{s&u8A0>8c75t!<YNJI^S$haG;mQC4QVWvr!YR_APfkLvSYdWY&gUbu6mdv}H
z5k|-={~GN4dsM_6s(K1kx{X-?5v^JPd|qmbdVAV=qX71WHmc)^c#Hatkb(t?Prh6p
z(;K<Tk?=HV*T~F0nm-@+-!K394RDk=*0RA#E{G$^q)Ri06gsJfB3IsI?ae|@rL#s1
zA$JyqSu@XVd(=#4!!ep}gU0y$Gvwt>4{n2Yb9S<vtR2WlK7*>p8S?^0oy3W<2$rf%
zp!^VP25_?u0@F{)o6NMjNSf61_b6n?loK;<Tp1#VmnU28(2dV~^-M!f22z14JgN<%
z_ZiR;?yEE&H`)(<E-;QD6?%>g!{fCFB|00lqb+pdC?KZZt=Y=*xH?oG+Pgznx$^cR
zRU1s``LEYD?iS$6K++V&(QBa1Z3biNhEIpw)X}fM{`FQ@6wL;QXE*wkjb*7&aJC?#
zxk1$y5gYb{U}+iJ34kkrPGqzg(boEM%Bp)W7+k6#Y9)ExqbTk+)Vj9(?H9}i3p3oS
zwIy+Wvri=LS?7Hf^7`S0+@!$>_|?p2v>C`-==&0;_kuaauEq6|#1qJ9CKC)BqHgV?
z8A-RoT@i4|!Ol@yt76UgrE>(V1(DyR_K}IP7|e7Aa7l1h#4qJg57mIrz*5%n;SU(v
zcC;Q;B}PXjUbGfcakEsltbtULRT~1Pd~v1fp2DQ)R3ulJukEZ8)X;nxH;3wxO$25k
zk*xRCUn@CKl~8S*j*#lik#=G03AiD0aUAI(tE3~fjiye4fYMVpI;RF~Koekj@*;sX
zvp`~BwI7Vs3&zGlp14?2k#;>UB3ZGHJhO*NM3RK$8E6Z5T`&m8!FUdmRz*-XVMl{O
zwVJ;a`8aj!Y}+}t>u#oZ;j6<#NgKQ&`063baT5GWQGCZENae1FH+0Imx1ZD7_tfp&
zy2y9Slb5{s@j&dR7YS9upLZ+11rx%73UUM!JcEse11xBFla;f5$9taeq9TedU3ZGW
zFN}K<>XS(|l>|ZGD4M4EHZxzqP-oq_|N6D=mi?jYKF0gniB?=b;_92fzR|AZp!maX
z%x7IN4b7JzRa;XtwXpnTrDPS5)w5-upY6D>o(8A~ke4_0K+dl5ca7OZg9zFRWS@0W
zhrv(9?XTZ+R{V70oX4Wt`i&T5UoGfE{6t60=;C47rKpiK5m^ukjAjN;#wK{>n~C=$
z86fy}5@j-l%(+w*gelsvr-(EI%pKd1Gql$+$6K|ewmGmM^ImdFS;)~;-;MYs-BMeN
zt^Ci_d2%wSMNv>D)p9^*@j>K@tZuH+v&=&`fKJX;qH#VHQDN^5JJj~$M46R-aItTB
zT^HlNE;)vQ9l6EpG8uLJSe4c+u*6CTb>@b5B<{6SBjz3j7f#C|0LGumrJBj^Kqt5y
zPXJqvR7$pbGXP)${1>tYcR{L<O_6V4E8z{Pr@H}sc3d2W980dhDWI@`ww#O2i$h-e
zOU+s+KQ`>IC<$TC8*<qgBg@T4%s)#lLj;W^<uF<zR8@~(EsGdopI|6Zti|OWcypvo
zpdC$HTxB*=BH08)wA4+T<)x=&srYsF$IYTl=t_fA?lH%i4cUmT5?!??E4pH99^&d+
z2EQ&V4{+Bn?Xd93Lqck3i!q;-&{+W1nfV1nDH7v5zyvyfw41Anu6af4=ngU;WAm6j
zlQmpTsdXUC<pCX;qx>fFR+7AQ$1SpdIuPAkS<Fy1DgVXE{O1(SmJ&z8uF{Lfs$B0$
zojxE%n*sa;q>F>kvm0>pyx0#XM{Y}&BRhwmcS1gNj?uurj#F*N)rb3Wh*{cTj8?=(
zV&bV~qVYOx)Ku-*vnIW<xXi-ByVpE0Xc1nHx}=2wmh}QOlQsj{yK*tyT*Zihu&9=o
zTJN?}7&<|buTFM~IHkX7w;b-4UxwWpY{jAH$DAJ?!McLd03nKj7>Vx_M(lvIngxVL
z31qB_qe{59$c2cN&PNi?p1QX*2l1_8&r9Gp*D3Fmu4oSmJp%`cW6N3>o$tAJti|M#
zoK30^2IXh}U1IWIYiS^WZQu#UkrmA`1YQOK-UVqR8~7)$o6_lDFyB#a{6qu10{V<O
zQ%RRg=yTtfcO?kek%{V}CH<6vs;QUxDiJ4%J%L!=K^ivlQ);PGWibZzfOxOj$fj@}
z(B6T)z$v$Yc=vP|F*FT|CqF3GVi!Y!^}P>%zR0NTC~w%2T&vB=ia}MW`&vx=%=U@s
zQJyVQZz!KcIqU9*-SgmKS!W4O1-U_SZ>$MEP=_cE1xYoVQI&LAEUH0kVyNO5qEB&J
z4WkMYB6TNR%H)Geazj<QcB~0gj;cXYJw(hrOttN#Itd(c+e^Eodr~>C!2rN!70NNo
zKy38rh!J(4QBr!1ICQ^!6`R*gK$bVevlPLgJ^)`R=pcN)o^_4Bfh^Tx@lkkKe7fz^
zqm=f8<cnrHccNmcB$2+~#v>=o+5wRx$_BonKlkyID=()8YG~?}-Fer+zZ=@f{!a;I
z$zNlE|4Ez{I|1&-@CP6ObvKda5c~*gv|4?I?(tIuPwcA?wM2fOuT4>6h%YIcRJ=@U
z304nDBVDqe&x+!|DGP#BsH?mK{TnmgossrKg~?vj{R~fxq#3Y4z(ReV2Cu+wul#~J
zNx(`MAfB+2xDO1{8}8CM$P`J0zEo{QQ89*Ph@fkN%*@swc9b~Uhh!qSn)W_-Zx#j)
zJ{%fl)yi>ra<fu|FA}(&J%o{gqZ8c*Z1g7d(S}VnmYU+^z=QyUmGYm~ptH%~YdNy3
z1YUu@C6Wb?i+y2(Q8m8aS-$qA0J07xSttZ7C{W+{p`xv%>7=8|OTBrJg479#2|Nqw
zz4lvP>r@&XFx%g{A#m-bIxAdp$3GBCnkN+memCL(ZX#X`K)E!U0jOfde+Bct<`e%X
zz8Q5=Nas8e2LfULabEYx6Cphgj3Af<#<>*~sykU=v~_vP7+I6j?f|c#LrjFRue7YF
z_;G1l-rC~IH3yPkejq(=%m4KFWBxv3OGZbavbmV^TK@e`3{b7F5x)bAJOLKj3_5}Z
zOf)zhHKICqOV5i0i~R$QFOuY>3v$5Bv<s=B1+U$Dw+bTK-W0>^h)=$8hYZJffJSKH
zkVU!Yho5LXK;Ir}^iH=6GYEl!8ivhM^+t4#nLxIWAN15X&b%N>RAg83xWgDfQ+f=}
zybsq1BI+3@nakh43r`<*_m7{xdXQs@15Sl7jm`84*1Q3cyVsfwugV`=DE}!3z*(O^
z)G$iOBcp<Gj+wxCBpyhzxodU+rBlMC{<&aACn1`kC`=y1Y9S$2?nVxqO$cfxW(Thx
z@;Rq&cmMr3Ex@;Fo11bV7pwQ9;TCkBvIaLSRPlFV;D3SA$1g*%z(y|~M+O7~;h<Bw
zY6ojB0#Jt9<gBO7a~$C;aa>z~^2U&+la2hf!S%TU<_zf`uaTRXe(F#47j;$a$)LRa
zSaaq~)}a2XmXa7wcMYo3MfStslfkF%gsM8Y4+(YL9`;kb9N@7UGDjwXHrQsEmB1&>
zN0{|4+OehtjMU5TyBEqE3DHx=u~O5t;d8;p(Nb&j3?2Gq3sZqtG0vbi$Z3M&?&!3`
zL(2+xgAJd7qzesUNc;wI@CGr;d+=pv`zcJpcqF<SX{bIeJ%%vTK}<7c$);bqCRn39
zJtKG~OM~L;)DFPC1xTVFxnX0l{$1v<-(6>yA<rFc7M6TU1@WjrZ`ZpP`KgP*E1U<%
zEZGG_qQ>ZKE{_|<Ru7<jq;qgWBQ_)|681$uE@LIsuna^BcdjJy=}|zTyd*Ki6R!Z<
zCfxN-t~eNE=5ktery;K3<M8dcEw2ieV`MFVw?f{;=CK)tfRGEKnWX?R)6#v)fDan?
zbZ@=I(gHwnWd=2%#q@n~Zw=p3Nq9e|S4gZH*XKHqN-VSL*BlU;#j2(FZ5*z3Unpw0
z98jtGwgSGw3{h45WcRCHU@JH?JFxMDnZ=A4f-2<o6NsCLx;AMxfkJvXL%jkgl#FX3
zu7|G7Fps=*&qrSkzVhUrbZdK4LvnzdUA2=_(o{UX=S@ekq5Qs(J4>F)BLJki_;?x!
z(PZ=95PO9{GW!bq2qUV0Z6{uZf_3gLCS8Ilv2Bi8l6v>WB6YBT@IyR-y1mHyEbbx;
z42i4>Hhk22yzBB*X*l24(eofXvXLZ;Z@5(4g8g_`b`?4a$<ylDU`?JfdVKlpTL$6i
zOsOf{9cD7RjZ*8B{Hg!iF-9-oQy)7>5q45t1+z;?>O-y7HswApC2hSq+rVg&_kXQE
ze>UZNpl3#f_&5y&5aAA#KIRj2kh88pjoHMZ)7UeQ5>y-P(M8W6bn=>I=$GwXmb5V;
zdr|SeO*1?}g~yqM!LOj61-9e_<ZRih)a9bHRsBYyao_}=SQ4v}-*7_H&za$L*e98)
zzOXMIgH*y}j@ujvDqM=BDk<16ZOGigXSS!OK$oB|H7)P@$QWJlsc|=UlwL=G&vK<|
zzBP<lo=B00#Kj=af992`NCwQTA<RV$2z8ZC`(4khrW`X?@^#l}%?qe>v2T7AnI^C#
zZ-=k%nma+c6(B3SAt&`h8*y0LCg4Q@1KU+*ExwJ=;qg6SaEozNiXlvl!LJ-_KWsS=
ziPxopa9gF{o2Db3?RO@vEbf4aUY9$%dZ_(yxWk0e7?Wxbbi&O&A{l93<76*6G$Y{K
zouK#*nnrjGE=W#5ufQ0(44sny*!n6bZ2}$x^v)cGiU_jwDDgf(%OS=D$jR2XD1I;i
z2mJ`Rg#eH78BTnLTNs3<;fnPz;5@_U%%pd>h{HjmOqYO>3`6hW2n@8nU*ikr8mR^n
zV}O}|>;uxYFAYXX;Ewc_m9o8voJPT<R2_((KZKQ@a<%ZQrL$6cYZ%4Nqj4gpvjk}B
zOSVu7C&mUG*>GLTE%oixLxORY{$7&z<k8G-W_xWiHs)i~20l67WL()d_-6LZP_|S(
z7gaG9UFkd9zy)(bqEPwnBUWOXE<%#1QI~d6m(OK}FE1|Z=9q<%$@#)9mz>N?<L@7S
zAipki=WfL}OsS?450YwfdCm1)WSJv<^s~f=swjIvo8<_ed%x_a9}DM;$08dts5;2w
zEU7h7vs%<gq1(bq;l<Fb!xWDSljOG*bbU5-|IUGrgUY$}-P&pxDEJ1Lh%qq=;%+nJ
zk@JdGxNuHnC4qhlxVS8s(%=YG<$vZDxjaETK1dIvy${i*lJ;Y51i=}zYI&x;2be`b
z>fQCM#NN>Eq8Vi(6yML27ch|}uDwGHcpp372?x7u6pc&q6YaKb4NsM~2y5|k!})%U
zkqL_pmCI#x0UTmhbIZcT87UxY6C&0HlyC6;K-sY|hqmcODvtAMI+Rd^fs<AYPkRp)
z)1_+18_bKzXZgyyg7NxGoyV{3cvuz!q+(x~*S!u>KmCxEo6-9+CvV7IZ}mC%C1<U0
zuVw%E43P0FM%5U^u)PS#lIF0Fvr@<ld~w>W%{3XO54fS^6_rA$XN_&!v9jnB&nBnx
zw%5-2xhX3rGw4w&b}aaJRVOf2bUo4?l<f>wx-2w(NIQ8~t_R%MkQ99Ke%5QECzLs)
zg=%)5EbLh8Kbju-u504u+a-@mAC)Dp;w76Tu2SoC2TW>$XY)m<n9%@5>e$uBkwDQy
z5EXf}lUx$Fyv@k8S=VRcX~3PGp_b`pHT7NFwr&o4kt;qiQFH!$=#y9S;Q!`dhP>HR
zmg6x5L|q7>6cY?oGaqe-K8A4Tfk}Mz;7Wj}b(|vC$ua>oZjuEGi#%dH=<wvVfXHy}
zaS1S-)^V!+w7b;(g85Jd1W4vLh;M@^u75h9zkoLb+Ckw!J7^t!6#QscDFP?$J1U4J
zn`Z)15+KV-M(H5}B02*BUOhz_s2tT3in%iV$y$bVwut=&W1#~CiT-Dw(*i5L3P!o?
zK${sV5dVl9+D+1+?$Z{FlkS2pyk1{)W9x~&`*sw;z-U%|lJL}$mT>K!FPQV}@dEBo
z@}CCyPu$rxY`KjYp=a(78S>al8}HtCYcj3tx*Hs@MLjU|?3vROM9aF@@9f_tdoN2q
zw9NBbPtUEeEz@D>U+sdHo|Qdf^DaWL0~c9|yG#aC00L!q@&o8`oF}=%kQPe~p}Qh(
zLZaG>CuMy|iG|r2?T8A`+?|;#8MXSxaMu%oW@%|@QPhDeZ5kJ>EOf=-?hHQ)_M1*9
zGQ0fIINCgDWIRWDLPU-%h!DpFNw!i3T#;LRiEoErOJ6_L@`g7~Iq<n(w6YItF?=px
zIj>Rgwc7GGdk$}xlYGJW`T2SHkv%=h${88y>s(zpz1y$4y1H7$X=sVz?YSD?*UZa1
zW22K}8=D=gfl2+=!}wp}2ZLfawh=;p1PaU6sIEGyJB}`3qU*a+Euz$L9C5|Lk5~lg
z`S#v2bRG`bO5pq8A{jyS5e8xnJpc}VKGsBT>e|=f_ivJOur=IR2=)U<M=eT7svVDJ
zFTk&)HOxbD1x|GgJxZB?RBNo>EU`tqTmt%+h}c(3hQDCck=(;cyINUWygB%ksnW+e
zB$DZsgsfNIi6^0=SBG3&Y?R9}ry-G2l2kPfz3lk~v(ZxwP__3L!HHY^BM6H$pw`P@
zFz<drUom8=`{gDTN36k}b`|_41ghFpu@6+O(RxDBq;wTqow}pNWLybp-wR67`JyKR
zocLu#`jBz1x0rAo$rTwi0LGX&bSp<VYVMd3|K{;T)}!mHOH;ooi4Zoh05R-^4f`ZR
zq6#Wzt!(x8|3>EHH}V1aM<u8ZH3Y#F)j&Ykqck6S2oHoywfTMII@TJI%6+q*k4=g-
z?hD)f1LK1Zb$D)8eKi>DcBk{x>ss1a|7;vUIw!V(x0=&t0TM+8QyCcB3PM*Sc6}&*
zQlc-Z<1~!im#Bf6aq+iN&2B)Da&dm`n;dzYZ1mpHbiM`dKH^yeF2yT*oA-%4<M3wo
zYrW6hf4w9OI&wqq^KtNz_he53v50}OR6R}{Ez7hpT%L{G&GHHk3Zu=nIq$3a5eJvz
zm-(FQ*MKu~4I>(+jP8>8*fK6<NV2OOZ=05dxR2FcivRK2CdF6~^Qf?SKe^A5g%dom
z@(75t(fRs|Wse{)rpha+yDXPwf^gLGr3X4_i=?)fDQst0fg2lW^QbGon67ZxhyAMj
zW_+apIG9zS=hF3letlN%v-fh~BCDdZdCC1cvOEw*EdUOCX0ZuO7!^ryUJ24rpv1e0
zqN8??`a{xgWcLzL%3XLOl>CFlR9xxD^w!aPjaMq}G`LAM6GpUrvRv=rYhTi6xL<x8
zP;J<w3~=NXX2>o8fp$*)(xbR&yaCm`kQmLOC12FF#0i=&`Hx8#h=>lw#cSJo1|J~f
zR^1t`iy4!JUK<9Ttf@}xp7C(B7R8AT;HsRDtf^-|$O8GeI(X)DApCDK4-!R<@P*Su
ziWI@Xmd%-3C2<1sQ{#nwL#GEfawLmlpKaT9AK+eXL*+@mMuJXi2>foUWqZF;I|qXu
z%Hor1xih3bhzI;Zh^_JS@2zEH@v9JnYA{q1$oA@5-}dvZrmb%Ul>@$P*CN^p?PT}6
zzMEs7USDxmBevjjj$D7sH;E7&0quk){6UC3k}_Qj-oXjmH2%sLb#&t6WZ@NdKzb%C
zW~KbF&Q~;vdEV1I+r`b9#OdWab6uM~Lp;vUaN3xZn6%EsY-8EM_BAzU$Y%h1t7^5y
z-rCWHvCn=1eD!fz8iChMK$bGWG#y7`Lz`3%%pD`B4wohAL_XFYk%{D8kWhZ;Ht`UI
zKQvmQ5D1Dliu~;7hVu(yC{ns5YsNY9!<dKb!$PekSg?|FfLTE>k)c&5S}F!3kBV>M
z*KpG9vj7-zw_up9Nj;Nlv9nRkoaKhRl<aMz&c6Rcp`%_`E-zHKraa?qMd6`7cgK^{
zJUX}c-=NOF*7Yy{5^pHQAx;2TABX|vY+vL&z}#aQbETd~IG(W<%EBgjN0C`%N$$4B
zWye131oCT!@pXBFn*EFrow|@r?p8l2>vgYtZ`J8AT(iMth}T9!mKw)akPKo?xDkv6
z7PGS)DU;U`-72o01u({0;pfBGyEK3xaW9fpw=vOgvG-mjV-45y3Kh4#7nSVb?Z$Q!
zU&T!moK?Wmm{U!&k+WiJ8KYt`TqtzATqhvwXm?sTi8j=oUeGpps=`jRrf!lm5Fco%
zuaw>OuA!#T*Ph?EKD5m{*d}31i(+5L@6(EZCx3xWAoz~IauV;5RkHNz;4<-JFx!G1
z?>?{aQSt|C{lyn0yF|tj#$$U|js@ZrK!;%&$1I+z4Fzeevknd7L#p2Ue6(5(Gd`D>
z{9+=9gr|Y$zXuWyTfYspg}!BM720Mb${}bjYi<|bl2VFQ-hQ`nzOcMyQLDv~JvXB1
zw*`-vy-JOl5s{vX<#Kzv#V2G8_6d5V3!w^-3HjIyG$rbWR>FC2KIuEJa~zxereSLh
zZJn28lzU)?GHVHI$71uW7%;0T2MIrL-Sqa}j)F{m4@^kLH|$OBGecqkl#F<^9-R1l
z5a!*t%?J8OAfEnqhCl@X)I;Ea$<ZRovVN%?VlP?u3T<I4nPNGLeDbJJvJQAY0ioKF
zP`A+xHH8V`AMfDmlT)(t5r%z{cY15;gZ2{Hys0tWt}7F|j;_Q<*L#kYUw9=y6)C&T
z_8IRf5?8^dleO%g5FxpSd4)Bjt-?BpZ|PdgR<W=@E-)nQ5Lw12GVZ&1>9Bi9&2673
z#eJ?09jvPCt>n{F=Q56l>R1A!ZwrIIXTXlwUM2D2>7rc;$f3v>v<0V+*P-f*paweU
zwwMHiJ&>sjurYa5(=*^)wgnI`o1eRzNeT9|;TJ!80%|LV__qzNxJC!L4JZ3mcx04s
zKY}eF@J7g=w*`$Vf`AYw2H;L-6Ca2VB!E6dFj>x5T#bYnL=!y5=^%UFJcyqaw+$Y@
z>`&7}k{jG1rC}$X?M_<dG`6R_;n+>f5!J&E$H;zyE1wMuTw331wU1y>e>M+X@5K#I
zK{919x)eC+Y1-~7^tHYj+&nS}iYmA=?K&b5WF*wGFLSzX@wkd;8*V1vXs~>){;cj|
zV)WG8K6b=F_mLv_=j8NcQ}lkubIXC)f&j}EFUISV72`*WA7J0>3PG2&HafdiVvg+m
zP|u!E@T2$SR&z`<9z1dICC9m|0nx=z&c7I*j^T4>P^|ZLOFhBBp(Fd&ZcNc@F?GFv
zTvr%#V=NUbC`0Gm1&>J<4U7(mnCR>@P=nl#C<y3bz+83$_!Qnt?-_T7%SP=)HIDjf
zB5XZ78m}k%zEesR`YG8w>Tq}5z{iGH$fY6G!qG&n;Ag%!KU6s6zn=o1bjjFynp%9)
zxYt)~FDnjUYYyWI8KcTDz|NvB7zgAN>gh1IZlk%ZA;Q1p6*H@)o%ArGOXA?@W4Pvk
z-x9$DM>muxG2Z98>SU-nyd79Nc-T;|rXutO33ecUN2#fPE+EbY_jIAcm0R9CQcspc
z*ELha4f{p5KXH<xq)wu9XwOZm5>MA*wqf8CX360y0Zz52&$~2P;hyUdy}<hj4)Cg~
zW>qLkR!IGjQ6ab2RaiFq7+xx#1YcAxwTG$w-;3^)U8H)XQ*PVJuA}o_zjk>a#V>ox
zP*Q#{SZ)x7jV=bR4|&o+TUd-X6^EWQp8ekJ%%F?-VF7g;Mj}C#lE8q{Dgxe=_L8P9
z(jL7d*+wxFZ$OfGWbK+j$#V3Sw)^rDIH;C!iJ)^NqQW|juHduBeGs%}n&jRxnN>p4
zq^z!=!j-)85)F&B`xlB<MBJrh*<z$%l`n1s&tErP-8S%<${Lvty1qjs@UC|8UF0Zp
zkoFyHJZi~nRv(<i6%CBWWw;M%>EI%$PJD)X6*+SduZ_g#-LZ2BCD*uAGNKGW*^e)0
z6}DS!$?2s;Ri(_wpoLx_R(}iFbpqMdPb7{RI@>!$6jJTt_cBe0dnExKe5i?|=4D6A
zk6dNbI9E1C>TqS{O$Wowc(OoQzM;Ib<%C3^&hs6PTwwxO(%?)x8gmkWL4YPJsAr4#
zcH5`LIfteD5JMh_vbCDtt!4U&mTr9r42UR5Nmv_8ytOe-gxf0Gb!}Pt(w$wl$92fB
z!f<DRf5X)#jfSMr5imAP50b2;D#L=;zMyH3cZ;Mf#feBlg}$<>g9&NKK$UqOTUU59
zeXn08eV=F`j_t{foa&6*h%oa?a(!=J%k%fUDlhT-iq8J3;33~g+YCL|KtM2eqD-KO
z-V3h8Pyroc#yDsaI}At3qNo5ExxZY731xrHME)bkk_JIlJK1k~*<%nG6_zlpAt$ES
zG=EmL05QTbC=&YE=wYQ70a$qdWqG8(+v4!Md*N@UP5&xu0vc@pSa1#d77V&Mi0qLZ
zrD}uG)0;tE0rF%btB9i{SbY~_%qL%DEi94!#9A=Sh2yB}y8?0Zr9QAhb)Q46rz<>`
zb<g)8=W<sHA;GoWJR>W*z<|}3T3A;3uzkf9&NKNs$OAVW2?IPrAsiv{jX56x)kETo
zb`j(%EQRB9yn?evxRRyl3lMVJRqK)>Q3bT=0uYha7F_0_;lf!Z?C6yBMH3YV8*{xT
zY3pLQ^eQpckepV3i0_6t%fH2aNA(A1s~6D~r%l5iGl!(A$meXSttcs+dYsSI_JZJj
z*ccqgeG$TdH;5ODNEZSK^L_mCbf1m;mk<}a(6te#YWI1fjdu(BhNp;^jRVOF6sP^|
z#e)?K1Yian4`+?`GGGUY4#rqF_cR{@Uw4aE$A$MWdp!k7^?(?%2WHgbIJ!@9V2Y0i
zz+@A|$>5fa4Y0%tX5P%B*mCXK4cgL^_1qnI+cSOL*LUQbxGu1bD;P{MDQT0s0+7EQ
zHW0Arpk@tKv9Gx6mn1e=A<bY=)IeRR%8T>V&v_rAXqH;5l$7bxWR^f4a3Fi+ZPmCg
zOTn&dVJB#vTd7zNhLHVw9qexAU_k`0Y@sQc38u@>mc{13>8jYyUPxQyi<@Qg1rw)_
z?~twwI-Y6AZzM;y`mQ%hxaa=Ltsx)(px#aUAWOS^KZh7+vaziozvMKw5n!|7-o>Pt
z?<^s40eJT!sLBbH0c=<61N{)F7=H&SnQS!6=|AB_TL7Od;zKd*h~L_`vavf^y~w>`
zV$tUB`l`MQ!v`ZJ87Kz_`w!GI!bsXDWF>er+G{TCz`P7P-Il|~!XPj%?8)I$5wii;
z)4o?EE2(2fpeq0ozP5EcCn~bIr?_vY<Uq@*5esd8`0^5V)L6xqz+6gBV|~g33|jGC
z{(}-ahau6Wfl_g)S(WmZ8t%h^(dQy3@e1TI@4SVBKo7eIfzF>vN63OJ56`CCs9fL^
zVqC@|#r+#vDmIg|UDfn+`Tq}lZywOpxvdWe5m6ALjLH-h5tPXRM46%@AR<Pkf`AYe
zkx5J~AVESlg3LlH7EmZeWfB2VrYHhorb-c+CnW<_WRe|)uqDav_j>NJr>Cd)e(m{v
zcR2U={h<Pqo$PnN?;4)<tY<yV<Q2#`EgS(8q2k|`_XOzI<LM9r5VvN0g=lt$HpYr#
zG6sYSiTt5*5d-h-52^a%<e2PidWCJ=`APusvE*yE6%L*p1{}#*@F>>!yY^Rs4PK+0
zcLE5LY>63bRLVG45b#aLVaBpYB|^(SrCDbT8pC9l3}mnE`WbV&G@(w1mp;?ZlSfyK
zacGiuIpl&9q8zG{-6=|dH!1J+?!@{<t*dX<J>8kS+E@LHhch{|%Za|D`UA$KX_1-z
z<BZ+!=cZunxZ7twsuu7(Cz-w^H3T%aAfLctTra8?Dhkwm>8p)sifAWpS)DQY{`g6R
zCUB(a_85OxljshXtk(170A(}0z+utm)_7J_L_VW{{kt=x<b4Up!lT``3}z<R91|1W
z|7_x4C=1oy1cxVq{E%?<M=n2>BN*tZu?AydX6uRN4oTBHZa!GwML9X#nN$;gAuf0A
z%zS4oPFCp7@&D<3tG@Z(&7_+f9!tKH6s`{82las9kA!H&P8-9K1BYVm`t(7@3x+8z
zv^sU!C=rkoMCBkJ&#jH=m!AG4o7IxNz3=?~58Hr3U<D&)wR*)?D~#mC3<lcDgoAcw
zPy%32bVU+;iR+YTgMddQ@r@|e+^|$;di|3gk`XRJ#cbHU^@+f`$o@gz*nZB97mp$x
z^2YYv#`W}Jtj$-s?=>#z-FDLLWXncnrH`#Q_L|GDPd;%jVNX}P!lUwy*Q?7+w>e4J
z0dDGZsNr5!$8D&=oN6K59b+4y%E1X0wE4FY66djsExCBkVnZjVu280uar4Is$r0nC
z3H~F!bJ;Fy(k`9Dpx^6=bAX#<OivV`{dAZBjerQawwu%U9^D8aQ@QgLU4L_`pA}Eu
z8lGn*Sgx5Y!tpxp2=~7f8kV@-+j5A%Iv{9QFh)oKr`8JZD{>xV3gu#YIHb7Qq|qn9
z_4@Ky4IM_EW+#abY?;d=Hb^OZDl_(sOF;1hVHlUV!gt%Y4J>;&wWMP2smP5-u#+Bt
zgWAE);THj)`ZWSh*x7|<{Gv^-SR1v8_`#349F*17xp<L-x!oSziNE0jfA47ju&&r|
zUUm!Af|+kLS>+_r4e&L+CF4I-`uZasJeF2S0f-S6MUm9q<bmU$&#s)mUi2-xRcJGC
z?1cphq{+baonuDfS5f?nH)VOt(hKrIU0n&$Juwih@bF<3<59WW(@)kV->(|kC$}bN
zKZcKwB8pi`pf9rJGr^u8Z-B0{mVg<AMX#qmB8KdNCc30>EByW9yih5PG5l&&4Yr}(
zdH2C@A+29&=fdS1L#pj@dL<h&ypu;2_F%R9n5O_UorshewuOiA4MjI9OOxf1lz~=9
zidP@gn-H~_-r`$BJOkT%aSN&IN6Drn<!4;`-TDE!l>;~z)2yc>^*nb?YmJv!AEX@i
z=Sa3OmrmkjK)1sdK=a7qw3Y@Ic=$1~s2-x)gj-FKFC6t%e&pxs(0>P%{e3ULhz<+P
zo7~A?SPWV)rIZbMG${6!TbtW$h&V-25P`~sH#Aga)n~p<uv@qkXtzr?xxT$y-OAw1
zM$)AzSh&=x-}vm60LQQTjOa_=xysV-`E5R5bb&E&>BFmHsfY3VWGckQS(PT!7J?1v
z2vxPaMqA*Fb`n=oO?_1)nio9OAvNTi(T}ch?}8o3xMC<refm8%!Q&%Gwy+H1iw5)H
z1YBkD=>ARXu<@31Py?v1u#@piYmSn}a7)m7y2ORxDf_gV@^r2yxqZywmiHwu`L;k`
z4iNYC*urDQ&Ujbda&lXHsyip#qEDrHL;=XyDj+R_?bHrtg>7^Vo=qAS=ZAsO#GMV6
z^315?Fe9pN8$}yT>19ci{lBX0d0-S4VBdt|c%`xP9ojZbM)!pWL5~fMrp)}~Dy9+>
z*SvA_&5e)1M&ir0r$+=kd^1Ml1tw@GBo0>ZwQBP~S~V@+h!%-oA(Vg%4)K=V@m%@P
zg*bIdLmWo0_}?FOX}D;<li>`%B9K;oXj_?(QVm84@}l^1qFB&iKvV{^TWPFErPGDH
zwl>;n-#mO$-BxNRetDf1@?@^oV&x?jfAWpiY;6C0+QYJ>1Xq_#R{6+!-IYFT5-|OU
zmS7Fq#w@pC6d{@Xi=yxF=SqhJ%^7H_DEa{s#ZIp;jb=$z){^!M+=12P?<_4M%Qsd_
zpm}-WbuYRX1ePB{XDkQ&0|MWyxRcZ?{ut>L;3>;QIYL8!Hf>myo!)3jvy^_+S290k
zxSZmR1h9e8Qmz;XIQ68mvcDt<tL8KO_6DiQ&CSDoWMp!iOUX}L-A^{D6{&qF!cgWp
z5VFZcDDT;zq4`pHn$1MCOQ+2^fOzBmRaw8#giDu3pRvS@u&WWswa44O1n>NEzE+zf
z$}Y&AOjF+0cG2tJ3HQAr{-zsW9!%L%i3x83@&7ld2~3Qx+ekJ9*_f;_@a_pDy}{A~
zZUqdEHRM3YYaRJWRO9I8?gw-DRX!P4_eANZ73H$r2Vb`?H(dHb=kwIwH$V(a2Nd?$
z8LTSC49LxwZbP~Zo)(g=8l8tbX95K>$n5c|TO~FAFs{Y4m7as7y-=NxF*B^#n4YcD
zI~ycWRC2<ts+hf2l}({ZJo;XIaR3ZjhAtQmmp0-aE=(s0_EP&jc2{INz63<3odf1)
ztF@Rn%aQcv<&S)`W#QZ%wrcdV=`Gn?dxH#R@`OeF0E^1K;g@*B4MquKpDo+60koN@
zU@1rK{H)-?zKyRup4UsS(iSQ(+Xf(9O{l@6vwYXK7|XR&D*5GHyqM4S6DNy0?wlk)
z5Ht_ZHrvl^8TFAHuo|dY&6(NP3QC-f8rF{`=Su!Ag_2AYf{IDXW?mHf42-iE?VrDl
z`P2@0I6$%N2?r~X{XEq0nEfI~K>eE^`CAd=bE+alK~$LvJ!3$xKpB-J((cz0pTWh_
zKdd9;P*9fvdH+P(^9VE0fGPgh@n8QhqCwy{e}Q5@zr#O4T>P)ezQz|&+3SERp-6p&
zoA2R!|BUIcLnHv74_D3<0{dFuDbQrDWX?-<0qA>Pl!|}jOGCEcMK_`8p`S6iIPrJl
z?@7omN%RPamn0+stpXbcGF_8c9dXd@yUAIY69^R0fEU{<(19jnU>OX>7X0D{u;9Jq
z9ehLNDTg*V%O@jZJS+Y<AbzbQlgWuq_;T-tqEHpAqz~hIY>&A;FRCT2%Cr~7c3&v5
zpSn2Y;-gYK?V(X&Iv#%xhGLENRhv!NIL!$bX!@F_CG70a=%d%ukd5QhET4X!-wKXl
zoTU!J9K$N0i|e)dmgV;m<u>B#8VjwTB!q=@`+)6YMi-PsV_T-hkZ0iY2@pE?yNAwS
zk`7F@P{D-nCyJl3lz2_KfLMfu@ILZS@BmOQH-3<39m5FKj+t|?t%p|E1m8z8P>XNQ
zJ!U-aHKNJ3c%ALaEm!p!0Suoj0Ew3T_z4tJmQa7hml5NjbGz{Yvk62sdH!}$0d#iP
zKnuPcH*<yKsLyP+&~Cdl8(aulS9v`~X^AJ1)Z56%6K;uS1c_nk4qKZiwS5%pyp5)N
zc}X#k#SnnyEFC~@0Xnh)VP**aXam-_pImDWN<08Nwrw&@B$ffj$qgbB$i-iUWB`>=
z-_{k*@K;Hnapby7wh;K(gl|qx{E)3ptLIA_q~7x}t(eW3xWtsMz^nRLZor`YGYL8y
zl9y{HuQ{ExA!f~W&jg*EBiGy1e3sa%tdoknz2%<NZLtJQ3fS$a9+S=gVSa}|hWbNF
zGl9RI+P83#b#13?fp#NSy(+~{a69otWwH22wUipk+t{fw_MwcE%%zqx?RS!@^_Wt1
z8g@Ak_KxH+uLi~Rv^q;&^f|ruDD9B=9*v?sxF5ZRYV(kCRGmE3V1VjR9(ss)bDV{W
zWBemL!vs>^rvRf`jR9LE^**7H*wH=ueD&R%URnIsvXhO`j2|LFYmq7DY|6$1AD&OB
zUCgoDn6h<$(ki8pkTZ%$zfwAPW=VwIx|M)KqzLR+7R!!HLbPq$@bUg6t-4Imekvs)
za1h#eDUOl5=1NEe#J}Lq)yK;Y<kWP{>bGCx`gp~C%LDX~32ub$s73Lmbi!p%B|DPY
z>st|96OYFyHGHQQwI<Hw=JM7b$9Efv2~D7e>#Kp$X-oqT?CL8wy=R<w>NcS=&B|qA
zt<qaVDe|Y;eml#tAFIP#BejN2#AD9K9t0wi0BHLw#)1VK)sOkf1e-+U6DU^TOa|;x
z3z;?s)~^Vrp$($|UGD8^{!VQE-y`AvQB8rr4~W5c0A=N6qQ@)*pnN>EoBm1Y@;goH
zf3t)Cr#f#-5`8oLCy)9Y77SLEN_@9?UZu#@E$g7V>10?~X6EwJYv(4-bQ3e}-Qs7C
z9gp;#DS4B!GiA*X#_))QzP&?FWu|nku5-_HrTcbNvZ}79Qul;mv0*@-su$yFd3CZ|
z*v*X|8k=*1E;cysQQp%g9<d8^>B}O2S!;a+fO;Dp$0S85tT^0a{N$Y2$8@1Q6Lkn|
z#PQX6>1b~HoEZNv{GwmgRK8SL#6F3DRoMda7l0T&0Z*N^NGLucIgH2Rf7%1%Me*p`
zpD{i9shCUP6_S5U<o934)A*yZnLOIn^AJ0pRYi&-QI2!k@yb&d(T;%(69DUXY-crR
z)aDh;T!go9C#Md*d|T~s7#$rR?S3B?KcnxJye8m+*Y!v%-P@8glG6oaxaCMYa2(r$
zIaoz<V2jgmNMmWt;)OdGIkka@22MTfJq;g@Irp$8)Beem<bM9Ty=nb{YP?7jn;mHw
zG#v5<v|`&v92^8jzV!i_6}l|hJ(xZi#`o#2h~X(c$@eL`KY7EKwNypl#KAPR?_M1<
z^>~JvnWI+IJuS~uG8pPr0BY3ZKg7@x9lYq<FqX)!X7?W4KBz(0+!N41!T;{YpVq~%
zM2ecSHJ`%c)29Vnd@a{kzO3xn;r#ez@Cm8}S!HuY$2l3xNv^x$k%SZ4$@aRvk#Bde
zPgU9a!<-nB4$OKL58Fh%0@S-Y5q<AQ=CJ+I?wV}^H|k5Ge4QF)zbJFQGfu<i)_vho
z(ai_?=GNi4WLu6>d2ws(@z^FOlk8<Z^PoOYh8)uZC}f+LZMiR)(({40BOu4obvyt+
zStvUKotfYcT$6`-SHMPQKQJAu{4TlfOw1i@6w6p}>@!|FQ+z&PgXH#;?kBCKRw`Y*
zz6VziyrLS2UL|gdfB^!%X|t}PB*-%y=O(&q4y2^Ulp9Nx&@NnMNK{F%G(7^6F%IA1
z(Hmf{3~F+Fc-KrR874{Ko!0oZPJT5o9eluCCrGs2r$+xCxp88i0pFBdNCHlqiLIeB
zO4$T;vPo-Ueue!#Cg?`IOTK<yg}v<x$<4OrsZz<-q#yTv6SL=<qnT@jrJsnwBoT*V
z(G^p$6=3K9N5_m;v5f-Cr5)StOSBvL3zwAXtLoNG$*o=A4f3dujdC8tlNwq)*~-iG
zvsEXqsW@e-y!J!a_09!|JM;znufvW%<*SLZg%`oy7K`rI;K0AYs7BuMt6Lt$F}b33
zqlNy^Stzg4-g+6APyxkiD5v>{29+kUDhSc+>g@=vP34t+f#2p_B-XAn>=W9c9cR>`
z+*sv`i+CwM-v;n#@R(m9s)8_~&P$<tDe?(AM;rEJOT40P7iHkj)3JU>5MnECskyIb
zs0^Bt5gT_*U`sy=%jT3=of{sA)$R*W(T&Vt#*&QatM-yEvvI0W1MtD+=@7V6<OaH8
z9DL^W13*_A7@x>UigJ}w8oKMH(3t~zNCZS6{~|LS(n5<ubsuH3nzWlU9^8_f%-#rF
z-+34s*J})6-!0nvj&6mun{H#@HGcN+At0cizzz|=65WH?G^W~vn%$n83b>_1<;~fv
zME8Q+)&vZPp6j8i%MkY>@n5eMx09N{t*G%In+gBvt5aik0`pAGTLp3A(?n!FYi4sk
z+Q)<s;szmZ)`R;QRh9X|l`y@LrgrG;>&w}e$Yy^aQb=zp^Sh24bsL?mF1@P7Taud)
z8K!>b#p0lG>YMV1l9R77R5x(uyrI)YG(;y>C=c;dL>Y6^y8g`g%jwalc;gM19;cWE
z;#y{71|A~Jh75{xIdz;sRC?vhuoUk#UJ+&Mn}d`dyHzzK8=>I<9?E{OT3LuDDz#Ri
zNmf1xilwKzq4V@BmTEmzB>;~}nmcHMjsiKM2~W2kq#4VzAC0-&H&%atDWm$Gheb_y
zX7ka}2LFp*cc)yF3)vkuGoZ_H=5YTyEZF*yi%^4Y5N=49e*my(#$f0qq)Z*fRaqvP
z)MJNo*vu>5G$$lGn2^whuBx@-Zh4m`S66U%5D7cfSIXqb^*t!*xA1K1<cPEh$z0>U
z_n_ii1+i~19}$FRJ>UQcwo#ukx?Ns@7`rnf`3`UUVq<VkSq)M;_S(}Cxx(8qt;{)E
zsjf-j(c>$#RGSvcHkj9`_ZWS9Ci>|5J=m@-V6-0im?da;M#ywLSPqt`9Ax~euU(7+
ztQznbbI6O`-gcv%#WQKG-e0!_aXxcx`PYlIW=J0KXf`_iQ_=;M<2HlV7%FJ*$b-sS
z3k7CS!$zUz*sR<MhtpglP_QUSrs74}{8tT-FS{+u8u4W>cl)OB#^HV=Y%CMbdiTMr
zxk_Vaa`4yXdOl(z&@A{R_xsaqpT7knfAQvjR6hA1UWOkt5$1;R0adLm5dRS={|5%e
z7jOSLf%_vJ*}vS<KLl96N&$cUfB)4P?;n!M{g;VDo-;?V=<-ysof|}TR0CWxQ1?uq
zuJz>+xJ^jN0r>mTtx?>8Hd^d}iR-dRZFi053l4}o&$hu->*Ju-k6nAiL-PsuLM7IP
z<;Un;$5`a>KXW&{(PljCFoWaDMhlcFG8Kqn0p)nFxy60Tp4G@KH<ku?)u9}-@G~P#
zF;mAuD=tZOvZ{Y^zbRUwHytFHveybf{NY^Hj)O1iHKYZOPy@gm*lGBx(zbAaemY1G
zy%M1WqDqZz04GsDuw?*R15Tph%Hc_;Y}u}9BdqMmNyN5+s3<gHM@b~p@&<?O<&Xf!
zqX%5(t!p!UfBH^*c9w=KA842z_GZ(Pncjv;b+UxDpp`Zih%&_{nJX#*E~~U{5pwZd
z#bbx<7J&`bov}3zxp(b7{hnDSmiP|%J&fzWsA{@PvYq~LK`58QUkdJ<G^v5_W1ayC
z0NTUuFXY*DTJ_Wf-vhC2GxCHdCph^K{%K&r)Hr<@)g=SY<c^87v=Ei7Ie<NO(Frkc
zeQVk)ZgHc(U4kDOL07m7R)He8k;RfG{(Z#En+qu*#z)Op@Q97W*>~8q#8Js;!^5}F
zQ}AOTq_OueQdgYam^~vjE`iRIy=Qvr?p2#k7Rk(EaA}JnLX|@BjQgJU>luK44J#1?
zEEsONZ~D<rPnbDoIAaAm0>r8qQ8$g=ZGgRFYOvg*ur+JSTlkQoH90R+Waz8A2Un<J
z3T~1BHJDL1P@g-o9->bLdWiHv5eTNN?~s|OqG6NZXN<B?r)QSp!s-09j7)dt;Tl<T
zh*@jKc$__EF~7VR0omQrf!yQi?oWE%kKbHE{|lS<J8<RCY+j+=7-<c)6)JBjXy}v|
zTECp87>}u+7uq2$U;&+822JtqyZT;i=eco}lcBO1<s%2)zoFTxK*RY7O4U0;w=KoA
ziK_7IWF`!J7NS$o42EPliykr0wc`^%6De`*5hxZk%pU~G1q1MVw5OpeqrZ}?>mbF&
zN8{zpWxVeX5Bqy)4j9gC>?<jKUy|2n&@Fv6#!)2#^v>7{SAi$vC6E%H>xO*rL*9Jn
z3dD;SySF>|HP)K<J-B-?Y5^Uy@1Z#4cjinzIx&Hhr*1(yoA@6mO^<iHpzF`M9%SW}
z{0KJ%$~mk%g#}K;A3q4d+>4?obt@-<;>7HXBi!A?ZEMsIbm|Iz4Z6iZzZVp}wvvW*
zSM5+eR>70J0z4<A;LEsVvT_E97ckCs+ira5@U}tk`c}!GW=e%B^WbUgl|NxTjpZv}
zc~hCK8`nL>r->evmSh}oKWqKCtPc4U=p7>FwZzT;Y3NLWLsy5w^>R%Nb-`t5c!Hl+
zrL@)B-^(*VA}gzSxn|68|M13`(d9?KE>zgIp-n|B+8P6-{?0{guY8TQ;n~C&BmT{<
z?#T=Q53TlfO^$s4rb<*eXIv=F@#{|+^t<eF$J?~R#L0y3EI;f%a;QXs|KXWJ!qJ;s
zB|m^9Z>jAK5AgU8e^yOB54vNpEviL>R7J|M{&{G4mXG1>EZ563cV`PkX6=A=s7EsJ
zunTtg1w7KH`}^dUR+Uxo2n~dUkerSHzm9PX+6def7Ib)wWTMNhz~w&x)3&LBH^6m{
z5C(wO%!m_b7<I^mU^n^gEC<&dK-F2l6UcJh_19H*ZnD(s3z(^LY2rGNnw;x&9CnWO
z*rCIw4q27g_5k4aG!9u0qA38W@_15HW6q8ESdGm{G!LEBEeb+D@dT|PO;JR3CN3>h
zrJV4<$`bvyo*)M{G8h?@0aGvjQx09mS7SqOP-amUFqO{L-RHu#U_@6ya*h8Mbi2zp
zTY{=-<hRrz=#jH<RV@2(UZ{eH8Sh-Y;4nIj8g$_SkxQzFT@Wqv!N;k(_<8w>h7b*!
zcD%W_Qc|%fH5jWzfNfr^2q9;)zLuQ7!6t|!j-2#JpqRp{&(KN-M;jJ2)@i`DQIy^J
zt!H{Fuu_QQfr^(k7Iz2tyIi|;+3EUeUr)d2o5fq#=^bdt1PA@%c>fvg=ihol;Y)CL
zxzMv!u!$(u-3*ul+!`>O(JvgNPr*<s#uUWUzlk#b4c++<m2Z9>W&9;L_yb*YzueZ}
zhaAhCo|a5xt8^LDR!=$O&vaGD=pN~x59{2-9iPrPm+EUgFX|kc*^oU6ZRj8MrDXbS
zdYtZ^zVg_CAFhdgfPmy>FhKYj*y!NTk+A%L@$ou1yJ*ubE=_`5_q=UCuQgh=1C-M`
z-Y0t9_g`gqo182i^0do1On=||;UnoP!4r=Lj6}|M4>1Cz-(L!I##p-WI^i0QZ%<tZ
zO==gtlzc&i4};Dn<Id&KU`Q)*F@^Y2a%Aq88<dQocz@>8mku>I<3ax(fvsK7cwFwj
zZ`kk}UKzzvjI$?Oeo9;Oer6pnJ<cffLCKr!QSyeg(cMli+>|x#2cs3lo!Ol;n+2xS
zuId1Oxxhk5LG*be2fBw9dglpIkV1_N&y3LH!3o)u5j`Fm{QAP^v$w>IjZac`UQBuV
z&_{gu-rq2~&RJx$voJG6)QKvAM#b<m5tdy@=+Br9l2<^{ngDmAx6a=afB5-Cd=b-a
z04_C$Ujg*FEkefp$We_N20U?cPg{x~-@EDhM7h>ACq@RFOj6<VfBaO!@7;DYwKwin
z7E|XGwiwSDCxZClT^i*pfZ)4}|BOlh+zj@j4p^QpJp}m0h-v`U8+`5ti_N0!_!)E8
zh3EJ)Ml1-6o)q9d510-eXu>aQ(G~Hq@}8eDZ#o|d@BQ8{zZ3(G*c9;wWYBVM0lMTJ
zPWS}$uxJE>TZ;E?AqXX2;y(p~jxqBo4qYVInLtORZ$ls6{^L7zfM&6erGVyUZ@C3x
z8-R6br+a=n@;le(<5U0fcQZg=*&n=V@ea%FAWz(9i4*Qa9C?!U!%!1FZZTjXLAbft
z$0>a=wJxqo>Bgi|%3igF0NEm=#^pm%2R!3hK4xEmo|WA|#AVb{0@(|@HQ;0`p;+Kz
zjk9c<oEDp%O|z3oId!+b3C6CTQmF~>8*t64dE*&-Bzdj*503J-Iwx;Cox%d_?;jTZ
z_tx=u@AKbTlmE}w{CC&UvhBk^zQHuK<&UrVAKp9I6zDuOBH4(m&~Bn(flw&Al2;H5
zWX)Aw$SH2D)lK=fUdFXm(Oh-?lGFaWf+r3=_C?lPVytP;W6qr1Y7KV(U0`m%r|dY8
zLqLu*6`mdFvQw&FCfg?s%3Si<SEKgG{mw}Ur@i?nH|g&#*uUaw%fNg>NBEI1sTSxf
zX8HKF8a$_S6+-Q??}0c98Ewsp-V#4C<ghBQjDNzPc~ToUqIbD>uhn+5_NDhO71|!b
zj{G}b?C)LM2Ovbb4AC8<C*W67cOvZ^=ZG2tFN$5^u{nxk*xW&CT(rp<gJO@6SqB5K
zEiVzP=JcIqei>?(8@ko!L{7S@H&fpC+i3<jr(D3y{~`>dBmQg-9|1tF8e_oy$5lwi
zqKb&L4Z`6WHr1)a>CM<g(~#)7L?Y1Wg4?csYs&%Jsv4UU`hvr44|SihCzk{m7-Sf3
zX|hkSCeM`^>+i8R^DAE@K297S9iOI`JGc5?T5E#f94m)aeV5I(oWt!p|LNS=k0Uht
zg_q7|--&;G$`2CN)JVgL?6xbYx*sr)0BcmC#RLSOFbLLHVcCZ7d{bJpx5)2afa{ba
z<NG*n-GQ?o*YDeud}R9_mk5i}-x|+<=bZiSnfy=R!~bpP<fY^ffKrK{0s0EVs+{py
ziY8}v8M&v4aoP6T48_)%702%JtjHLak$3M=8J72iLN-6yK6&xLo$H<-?tYiPQt^cN
zlov1zz}rdXpo*lH?R&%%-NnC`W$|`v#4;}QVrGBq^M1!8{U09oe^D#{3D?G|T)0}}
zC~y*LBNP7YP8C-lkEL*NhB_J*vMj@R>eDM^M1F8%$n3e7Sj0~MNyel$;6YY@vOFPt
z3EFA?sAvAeUg94jjQ>%G<afiW-+T7|=u|(?j?@}^Y8Z#BV9e-pJMf2@J;gP?D|i-J
z`MDeHr*uQj9X7k!H^<oB5gbrSh&L~ESG;>_En~@TT*+6zM8AI@-v9bLu{U%zAS@m)
z1c*(U3l9W=NA*1-@Yf87<oP5}xsed;MpRKLWq~T0cH+jEauZV#8Mj5E;b*7A4|cXO
zv2)@3`rV4c)Kyr|GMg{Wo?U6V2|ih_TI<o>-{uMjzqmD~k0rPt5ns#~MFL4pLNuNM
zHQod>J0~^(DN7t+M0oRTIl*tHLsSN$jEmx4Y4sIHue;#_$CPhWmM4>%ue-G<{&bXZ
zI$W0iz;`h}{jAu+@n1UFeszZ$ZoGE%KtU^3pv7E;pO^1@Cq8b7XhXt`@vHc0voS-F
zo#8w2nZ+-u6?z%Z9>XKLbpU->S|a+^$j4Ilm9M_wD{8lEek~gxb!>8<bPEFAc{lu(
zvxX(|!?1CqT|n_kL_e4$JE5W4Qz;R!lb}p4vg)lgbN96Asti{Wihnu>Y>F$0Cd-(r
z5c@t|CIBLpB08iQ3!Z1gWkOp~9OX3IpPWnMV%zA6*-B489;IrLo7#x04Cad<sfmn9
zs|%<USSf3sM`5Xtg9bxSWfN^vtf)-vEfA>%aL&4c8C&p)*F!77yI4V|NW+0#+j#0)
ziaF09$+Hw0?N@~)DP9Qf9ujqg(N$qqs_^XgHki?ji^<mSdvG4Gayt!Zc_-|k2NQM5
zJng7pcje>hN596-u+Zfo1(C3P2yUU6+E^!x%oGn#sNh;>CHxj3DW(re>fO^jJw~(f
z*&5}KY5@Um`mN_he-0go1Uh!r5OYViWy@@sJ=Q1OtaPWagOgb_RF6OX#odd46rfVa
z0Sn>v&zL~ana-S}pSX>!plU!rTwy~3c_GL;^4+#zrojuae@sUFw*kigR8rMn*#iFq
z|G~!L{fUT<_jHFqVKRI5%evj<*yc2q*`x}=3~^sK#e}2o*}UUf?!>}|7vo2sR5KFv
zw~wwTHxB6!Bdl^1b`7U5^ojxIlMoMj06+eWu~Kfq3wK@y0Fc&S4BQmhOO&56FTvbM
z&;XL12f9szK;{X`NNhTOZX3lBXv=yRLLb~g28RMNWv6H*Q2}%i4own;=^%RfSGug4
z37`fd2Hw+(puwHFN@$o%j^YMR)@_7M5`$#6X(Ob2P2zqktyk{(_S+kE1UczsyydRn
zn0v;kU5y*D8=!tB!khU~pv}OLYlwcY_fGsr=K0ydHZG&yFo6ZLz*q>$Ob#odPCf0T
z3&MbJ*p#P?StHl)zB-Uj-Sl}J;mYAPp8bc*)ij_aio93?akaeyRTs{!4x!B<dHrCn
z3jP#xq=!S8VZr(MHHhhf8Z(|`(<2ubAg1ab8m8V_$?ovunyd)+Uw>d@KV^OKGl}L=
zRo+G?o^_~#;~>$hU@C)Jm^Xw8@B31A`&%7K><v?m`BMVr6!0W&Jb*DAweX0r{03}*
z5b)F=kgd?BI*@1grl1j)tS>|w?1wqf7Eop$P)hA4!mgz;`0~TIjGK=LcOyq(N%n&z
zA7*VFQ1Zpb8{;%a{kJwe{uz_!F(kBwAKcoWm3{uQ#LeNY?)CT{sNQ1*Jl;eQCmV9w
zqAKRPDVl&-&`4E-3DFggt3J7sV$C<_483#JplkG54|(UGa2?6-wH{g1c(498HtVz6
zbL6n|jIuzE+RqX<&Q$CTpl(KAFxQx?lU5{cOp)<)f;T<E##ZX?nS5TIOe;+?aq7|9
zGHR3K<-hgVk!2WkC6oQbY|>(WlE8^NL@ZC7R0iei)p&34SSa36hps2O<b%QrSK|h|
z&|oI@$_Yb*pxGoFGy2Ay(UC#7%;;M|Q<oSZ;yLh(9>JfYp$?$vXUq8?&s_T7VKL1^
z8t4_u4Wu}toaK5U4kk5|WTwC*M}_vK9(dTeviDmgv8ge*N;gnH`jyiM;-IzGo2<&W
z1$nA7mW_*(BZN92zn85Vg5huZ5)FLuQolfS|JlY`{L>q7GH&{2T&M&-_W=0%g74*4
zx@VDZdX?*(a`mM=UsCw#kQD<Dp2bYQnA~0ReuLKDZSAUew#dp_9>MlP$T7eih2Gr|
zf^**}-^dacg6L6!3C3M*1eO*LUxgW9{?F&oGujXiLF4gF1mSvfeIV@eIcc#M0w&B|
z0^9>Q3Iu8BLp}iD{uM7MyWr4c^jfKLz`kovpsszWEN-t?0ho2j->LL%<bjy}7yY{e
zX1EgZt6Nmo=^b81Jtf^MhJ*>csjYa$Ix&h9XpQ?BBZ@F*$qJ9d(~OMH7UtCq<L}c|
z44adbxci@i9?Syd+{-47vf!KSk3xU;&6UTHgcCVDVzgFI4OGx#P$st6Ado;u=~55?
z@}jo^z1(hV4h!_UCp_=u(i*4Q2j3f)F7tO@{n)<9MQZP+a*oWo0|uL5R{RD=g6<D@
zvl|!H@h5xc9SnXDe}DhC8}gs&B*wh~<P3>hp+Hq&s(e}Hhlws~9|SuHPB7zbXxdG<
zp$onV?^^@)Tc#efL37WN`U|(Q^JceKb_4{_ZTh-lZpU2ve6_inn{n>lz^I6et1lF2
z01oSi31F&sSOae#Sx){qop@7n+MO3M>Af-F5xDUzSkLCw*MWuV7GHmDd43)PCZl}!
zrRPd=d_kyCggQJ$$H9yhXu|T*9-NQR97ybm*MqX3_Gi5KP}NsrgP$1^e^e9e?~m-i
zX6%Pg(D&q+%*c=75S?)_c*2{1jCXwW3hy(Iv57ky=dre1f2|Tx#W%!bd%qEL72>Kh
z-jSSRnRBLnEsPeY_@J8}`HPnEcZN-c-^O(lK(z&xQe@*t66-_K26ldpi1Q=hrJKEZ
zAsodiLO@ua$H-x$DY`uQ+{jVVj4DfVbQ1rr+cxm6U!7c^7f}|8&rO|6B6@(Hqc0x%
zKid%g*5YEvJ5ec66x-|!_L(m<Q1|p_Ou@)=jOglr@QK9Xael8T*-{(r#LK@0FN{JP
zcHpFh4saoJ0Nn-}NCVo*5EmQ!DsfhEvOinnN(sBq@p$Cx2Q`80!BM4wQ3W0;I=_b!
zR$y&VA+|UH8kI)?RL8u)imU=KF=hSfU=EPMay~*1;d);wy;k1N!77I*>plr64z^ml
z%0FXSo_Lbg-hTaVlk8_`XQAuFaP5OT`NNp-%HKpve>P7AWf=%D4QaYep9X|7iD4K!
zv#!dS>B(b6fG!H$u%tsx@}{k?C-E!h(~+=HPv0FgS7wWQI7+vS_;shqjz=G5cpaEs
z7Xlo?5gc}E*fya;UA`H`f!48?=C--M)%@7|u2lke`vd)YoL-|5f%h0iIn8<1$$f*L
za;_8B-p10I0l35r?|Q=_aef~1x@a_?ej3gMJ!mbwWp3%38;DXAJa1XK?^3DIj1TWW
zbdPqo4w={ohng$BEVAFCiGeV5_-shS48c|O2o$$z&s2ec0npvg5blO!-df?_zbNGu
zaH-K_`nxt>_3TJ)XB!l+{p$Vd&X1=|--C~BXZpn`{;7bd>wjH9B;IMcnc@n>!0j6b
zK#_^6kBm1#Sc-DKeE=zhYp5`JY4Kb(BZT(GWw7z1>9f##9tWg`2VL4OhJJHgcoXNT
zi20<AYJjrG(oaCbu^4dy5{`@MKVx+N0#JcbnU>JIDWK%crh+`}`7j6*CU8K9YSFYa
znvUQ=mR1F#gv+)<`EfM4oa2p3f+VfvGiLR#dr5H~3tjPO43ZMs0jmM-qrrv}#s-DD
zW-69<;YqPSRdrx`e`DK;ZbRzENjaCvo#(hZUSV$c;?hQ);+=e4v}i#WlIt}@f{SQV
zwz-Bm2&e++gPBJ%>v5IN5t~fkKFWD_Mmo$eG0o}an%O%UjeCrySn{J2jR9sydxF={
z5Axoo>Q(QoR7<CkF7A>9;0R0r8M!k!+!qYRM%_*hHL<b7;`OWY=q;?P&LtL=?^DA9
zcJ(}<IB+rpteF)g3N8%Lt8(nrJ8#`b?2o5X%($CUoV$&~$>MXzepR#ht)cx-H0=M3
z4NvgBQx)Tu7=MPrU?kWTWc$kLoQ~OOs8{y<&9PlM*4`<$Ui+=}nlTO-Fwia>Obw!;
zQqerxQBfSQCKfjCy6j7-TWkeZ)R)iHpTQzfQ|<UMZJ^4@xvIN<!BX}K*`T<hWXlV#
zj6HWWZ)t!3=PBs2)4?UAV;D37??oC4<PfIA;K03zKAdQu*0Z-qEB$!?eAeDEI+0h9
zNGd5Vsidz#GT!gvCQrHcc|I(QDBflTg3UD33yFCnE)%9+(OHiryT1DsCJB<`e=aKq
zxdAE*ZY<Ic0mDX31e*{%7Z3vkoTsIV)E=nJYakxbn`Fqk2a~n?Ew@m%@Xne#QmkQ{
zB+2)0`pi$~c5-I0E$E6K%7ivE+@yJQO_v_YcvgAPb`LHN58IFsQ${_rj0mo1I0(e*
zEnorn)c_7NH%L_z8Y69NizS`SrXofZWm}f6EfU7_uU~v$&Z+f3+*gMia@CFjyz+Op
z&ehDW>52|bclb?A0g!}^J&EBF5q+9a&G$ak;6#xIW~2eQa`yLlNy}}FE7Wb|%%*A*
zE%<d{6X3u!jvyl;1%CW$Ml8C9EX@I`J;rzi_X!XACj-+)uf0$9UYIxG;o@8A0e3s@
zzSF?Px)J0s&`CN>0)UfTzu+@JaQQn(2HoK1Q!20(o*H-q1}N_(`h09Eber(d?9~gX
zn3Wdps_4J-<K;9uJl4qXuMYUJ>YG<SACdMWy!?_ouU<`5*C_9ZpmM)3iT_-z+Wo)J
z$-pdBRvNS<VY~6ubrcJL*aWs!05=?ihYx|tz|DA>I*qBtw&__;Q3vrPjK4ac<TQ)Q
zVbw=OMMrz*RHa9z=o<!Fz8-EEf7Pj1=TNb2ug1)`&DZbWjR?WPlVJ2P$hcHPVwi9~
z68kfzBOE|~|F``#ToN|pe<jMspX;1qAYiy>lLzJr50_FsnpYrWb)?^Txjz4<-nsty
z_8CWKm9xjzWm_2N4$l2lz1=;LSkb<*SZq!aL)}Mx1w9<6NzwoWFyhVM)Vm3moH@>3
zk^H3CXK_vBXgl{!YQ-|Mcjq2{D`UFQyiomh=HcF;*EZVSHKBR-21zBhT4N)x4vQ^7
z|LS=I`VA<yWShYBCK9rl2paI(JWL19Q+9B0D<>@1_0EgJj)mCE7&PKnx!Wu_6V$$`
zZZ6%hSyxr<;n?>tC5Kq>2WZd{O(t<l(PfkPe+y~|0x77gLud>bQ?G4*#-vsN-1-?{
zKN1#q19%rv1y1)b0Z}M^dNa_zHi1D*CLWAKo&ExRYm1`s@7|&+!1`Ut`D--w6F_ha
zL~5)iqEMz6`sDVxI#1^-0Oy^F_=8xin)W3YyMZ>+=K>)a2sSwi*j54Hi2N6P!bKXV
z08DRUtu<HAr|d<D+z!pw(UtvLz3Y1@R(B8g47xkyv0hDFzIxDoL)z{D(u{%rPst~8
zC<Qm~+$F!RH;d1Y8D4wAKr3N%baeBk?9ki;cVgCjTT|Aqn|gqoq?m1|p1NA0e3h7y
z80hX^7@KXyBU_YPEo}LkqPY3BL8VuQzbeGd$Ox=9hG;h}pg5E5ZXQD}9wYYRuMpEW
z?CP@aiXyW(W5dMD-^shVITQ%Z9`#q&9uQxc1!pPMay7LLS`J1;07)t0BVVTzbv$ew
zDQzvqVT>+szw)S`r7q7YN8xBV<9O|dlwF`{xo0lRQr7`zTUpjUU827ZvuN`V;nD|I
zGVz18<yL4&9|07!d)cLXa6pm_IJKV!%-4h6z*-AU+Ch78q%T+pKrrY~1w$y8iLPi7
zoI=vZT3Hp)v_GuN^-eC`<8I$JkS$vUuxVOl^ntrEiUZ~sIctruic^UJE04CEtFC%P
z*2`I$7OCa!gn190BOuzb(>nxfM?L&lWvuA}*ovJVXRN-<SA~{Hk!L$-bha)gr>GqW
zs<moUKJpr=790*ezA3GB$~CRuC3REUrqrXo;?vb%GANxtSBC5PBkpZ`<xI$axGJ_7
z$!s2OP3O$YARz~PgITNm7vG2;q=Jd1o38urymNm_rLHhsPs#N{dj5%#&M*iHHQ!lv
z>0A4Y7xlhwet#R+WIIy?FwJDYRl+lM$PSREG>g*C*Wp|8Cx`Q+JV8H33;o)>Dj}{q
zEA_{;NA5xs#0nli(rtb^loLiPujQu~`;Y2$&A&eLoU3_}EfTw<|Fsw!!ZO(~Ze)QU
zj%3~@yP;CxEH9mfJEMihMTw#Br)y-`v)Ab5j22yWDYwdmspbsr7*GbIwPko;i@qk&
zQI|Ymi7oGae_TJ0cQ{4;y@I$PU6ca(6RzMG*m^?@+RZA5k^*W1;Zm*!UF${qJv|4n
zBA>;q?>c--5a$+GqaOz@OC|_+-)q~MRq}3uZmzsqI??w+#Z;#5NCY<5=dW?Ocmv*#
z_{qdD3F03%2Zug?=$?4AD4O|GFDeHOq$l9#ck?VT;r?IX1l%raKfRJx&m4?pV}aJ2
z8~++Fuk;G4B;3`VgceaI8Kza0g(rdNHem%|H-9`0!dNJNUOCDRM&r*<XuX-6?CJ3J
zlzBb#SP~xMX`<L=qH5HF;=q%<%4BCUVFCe*lsBaXdhTH$SJ*(Ljj2K1JgIVQ?anFF
zaWFRpp*OmE$2>4RX7a7C(r{@)Nl<r+t%z9^?W-UTu#+>hh`D$3DvDPHq6u&dJ5g#7
zD*}75q~(r3O97Ri7{$LF;3VhZZ>VW<wd*_M7#^E=HKs<#4jBX|dD`%j-ZOGtbi#;B
zr=nlq#h~Nn_|kVCx-WwW3JB4qTWHNGm76fVz~r!2`GoMRA0$3_ce&+z)8re)?-ui}
zl@Iy3dK8?#ND3=p4RsoqyvgOAbZg(ZJ?G&vFt+&L0(7b$)Bu+Gr)WeQeg5JGINMN?
zkN8{~3$*|@`(PXNZb`(ZFOU<q9hfRGVWXW-1VQ;h%Gz$KmXJ(8!Shb42@DO-rD!7B
z?Dz@>Z=pHB`7pz-Qgtxhv8J*T9P`K;%L{t82|{mZ+;svstP-9Y9m6Q~b9IWn2efrT
z5+Vsqf5WLdqmB|X>o1qTAk^tc<&ev7`5&DhM{%@gcJun?+%-0zJs#VPUsWSEMqGki
zdf~P!q2z6zc1*>f#`|rCgEJgM?>)GOMA&u~G0kK%XXbg1ji3@YJ41k*4vW$uAG*q@
z+dI50Wp6Gy8EE_O;1OEN;+K#Q4u@ZowtzUL*NpumsyyW3x3%HiER*KynAaKuEijUK
zq>Ey49|O7e{*u^8z71(*eLBFH2FdOj1?Hj&Hyq4k`I`a!oc16fMwgTTg}>Md9G_?j
zc>a^S$A&S20X9>3oplwUz7FOqz@k^B?$Jp9vZw!6O#JJ6iG5n}|0mX8Z1<;yF$f6Y
zo1)!;$c3q~(nkDp%AqIlHP>7X)yF)o?|TTjdrK^?X5{Yd9y#gu!<nZy>>q_BJ6&JR
ze!a>j?F?IEHOP1wU`K+2d{?1R4+J(c_CI6Nz5;<w$-fok{UB*PEDj1mZ$4A1r=ZUT
zc@=tNrx4T2{Nx4V6;lA0d<0ak0=OIu{WLa2U5)@o_P@+kC9^5+U;(_)23O@qD1LGD
zy1<aK?O|UK6q{|_7u5bvFL<>@8YuaY9G{jwy;^nmD|=GlN=KUm>r7Whbw^n41~1;!
zA;Xwf&pCP0Ua_{8*-VPxK``&$c%*5ib3g5O&<e`bzeuQ~^Usu;etTm7xM=MkY9hs7
zrz7UJ{9c~nG{8HS3J^%J1JQIewP6Z;Df@bBjJZz-4K$<|klk<^%Zlfm98@GP`z!}U
z-3h@!rPrWT`{ueQaNql<cipYbRobU9^AKW8L|2r<Px;Q^SzZ<ziHeQs5)&afFuN)2
z_NZ{rfO%Nq#Dqy)jS1|3ypM${rZ=%xvN~%kI#>F8sCQRDaf^dJ+;Tc`z}$YR>yN`(
z=`rR8Mlg2hbE_S|CFBKcnoxlbhTh%~7z@wAZxcf;nYaB0a4XG2Li1Y!x!9|af;=IM
zqRGQHR!82SR0$uweqWR>IQRTHR~2+h2UilLr#1)YOb%3)0ht5{_Vxzz+B*RtSjC!A
z2CKkEt_Ow{_}~=gfb3Yi6`VC?6xD>Uz}1g{x$$dUh8iAgeUENLLP%Uio>u3q%3ESg
zwlYKNuH2j$`t28ZN{mgfG!$m0w=xCV8sKyHobMHDL>pNj{Gb*1srDA!;`c~)gk|L~
z9`Iu*sQ$Gu{aA5$b$meQ5NQ8uhv(lLV`YV4=0`Loppfi+IO`f1q9@{YFOJ#3CCPWZ
z(R@3hF<T2>BzrjN8Xcn?<eBdJ@L-)E1_cNgfQ24PHW-3|sndn<2;Uz}ORZIE43OU2
z^#IjDM%xJX;UV1d7Ta#U65(kcuEnH#Eortixa07O<W%qAKz&Bc+|q904uB2jOgK4x
zi=i3<N7@UT9!b%xN>B<L;}U01=a5b8%S8Z)I;*|}7nfm3GpgOW#(XcA7PaV+cfoQK
zS-izHy1IC{G-i$<Gdf9M6@;BptVwSgX2lw;?|RccO(AdtsbX*Nus1YQSqc;^U;<cz
za_wZ}%q0I7;v?^7bwDX!^Z4ndjJGUydP}B-XEP)_SCn|mzT=x=Yyyx1u-)I2`h80Q
z*`uk+)uPPGzhKXIVxn)p_)E#R1wus}dVm-SO&3q#7h{qB2)tdu&zP}M?BbX*8lkhq
z?~4og8KXy}`GP39Aq--_^|f4%G#%|Vpe}ufQxg(kD?(Mbp@H>-bir2Ya}rlXY$e>C
zJZi5tY{XI`Hr8z$C^6o<KHY&wXdbzE0Vwr8e;$GF1?rwnxLCRbiX&TbY#T|-$RQk7
z6A9zvWVsghj-+VijrfeJ!*8PtY~Q&04xh{NRr|U@PNw^yz|}u9ep7XIn*Xi*TlM%(
z7K&8_M=FqFfv9t#D|q-A@rs5@C33Npatuidh8bf`i0Obps>c~A&M>$$))~W8;HnUZ
zdJgj*Jjy3Kx2Ab)?k{PUpE9Xdjt!2_svPJe&9B2w16{ibftMei(J7R7;~x<fnA?)m
zM=f|z(;ZG}LWr)En=z|}5OcY;0A_KOiqCVuN^UijBhwp)V-xfOHZ0Q#dVS%CeTTGw
zK+COPK)(ON2nKju2>_o(M)=^%JfaH7()F(2P;5l`sHRpO9q!@e(@B9vot~DZfui($
za(Z+1xwN*U8P5akq6=cpcl*en;N?H04`p~?_g+>~<a=8Mg6&C&cE~lncdb<mt1@K9
zu=o{ngI|ff@w(s1^|v$7Klca;&q2yhJOW*#Ih~?xW&2)p-{Q%%ojVZPk)D@6i%XrJ
zafnh^@;4P0`m=H6M!!UMzorKNcl=Pph1ZE5ptl+dwCJm#c{xDhMIML}-UCUi(h*7a
zED$8l#+A1e*GV8DKsX;8S94lmE%Xcw;-`7Y@GS4u*~At)oDWE<OT9?)HvgJ@>^baD
zTs^;Uc>l+DKr60>5VlK_uPxX@eNJD_^kiOjHEhIQRWTf_DAFSW=A%jb%DPRw&iK3-
zol2Y{Qgk#yr*7o_=qEe1t*zI67Z1}F^JzK5L%x@s(!RxnYyMu?`d_WZ(Dz3&bbQAc
zioL`;-ZDNDAW$H$e~=eO#8O{m8}q)}X=Af><Q2J{>;B?;V6d9YaLTAH+wDO9w>rMZ
zS|!djDoBZmCea2B#WGlX+EAd3XdjN9=4)U@AY<|-N)PRgFwb<C1cMyr9&7HBGSRzq
ze+}EKEZj$YY75x(dWM6oHVBBCBt!kHsaG_8(?Gx}p(!di*A0+5ebmS>E(E7;ZW+|U
z489G<qOd&UcAn5l*mLMUfLSKPzO1jt3IPa7pFN>~OsVW5mnPcT@tGG7xJ;W0;2UU7
z4anmPJLBJpPr3feP?APe+mR0N)|cO9T&modkq7V_&{W(;Re2iYy}dZYw&@aryKA4X
zKEfX^NsrEcHs=0x{iw2kAb4x=M-yeCYsz_y>GcAC#N<fqjFN^P$oqX}Wjyvad<>}U
zjH=t-qkZk{>L)o*P8F!_#I6Ba%Rq#o;kt%yZMmofOx6%@3!2DB68XWfZNmHME27se
zvr(2>4Y|GxhiXS(U^V<MuAgRfW4X7nb1}d8*$x7bQxty7TnhR<UAm!pEUyGZO#umY
zTMcf$)lh^vvh=T+m?xm*$2Qez3{dt4(#qet?kp@OJmyfcwm%OW3ejjfD|2HI+{YUp
z5U#uUu4Ake(d}#|Df@Rj6i!*L712;*1ol%l@Om6hazXCEj*TXZS3f-V4t-xzLn_QI
zlY1~5)cmMj7XuZ7Au!wUaENz|9tgq^O#*G+?IvZYW;Jh&jgOxjug81WJmp!iqBJhV
zs#bvZf8yTjQ3B!FNNY?eNw~Iv`*e0S)SyN2%5{Hj(X1gZDy!Q@c03fH?9|%&k;7^r
zsM-0iK^4O7TQ5gX;68SLutC@F!S!4JlBxfJ5z;t7<AT2?3U5-c1ODts&McafZE+i0
zgkmYD5!(Qm#__7`Y{AFwTmb#HvXs?JVZs=+vc*W=nv=CnE5?WZm2>im*Df2A`rOqI
zK3uwbk46$;D&??eMT#__=jD#Lfkden12w)w0QP6?$tggUso+v?DnXu9Z7B;fsuKuE
z=A&8B%ufLzQ8J)KG3T5tRUl3XhlIL)&XWEIUrv&En@n%=&@E3t3Nzw@@NyIItJ?1d
zhL@F`E_P`NRXftDmCJAqZuxQbt2T*_nojY@sb{j+N-a%Qi7)|>A_<iO_a}jQ37xH#
zJPa#GRQ!Y_0bN8J`*1v`pvl2<blf?+HQZxdV0egU7$4N3pPXa--S7=*f-L!)mzy4q
zIh@9q8+C4YA>-!Mvqv4oxIhge^1MGwDtL-DBF?X`h2mIM+Rff{cWw9V5$7n5f=cbf
z7&o)sr&;+OV}%uO-F1)k2IV>68eL5dBZUg<yV$+PGGuZ7=i%9ZPZay_`);Mk?O@OR
zYr|up)%`%}T!|+Ej8ZpS+pyJRVlld+D!uB4#QJN_N<H=~BUT;U-JtFx{zKgo^aXAy
z{v1{U7t0hw)!#8?YqSmSY$^#oo!4GS9QDh{eb{t<F^%zHDt)*#&iJ*}bQPZ;A6h(_
zzOc>Jtsy|IXz9_`2#t|1ru6UgwLi;6`pw$?IV`#Vk9n#b6#EU_(pNB(E>J-_W^(QI
zA^usOa*Y106nFFc8cm*Kvr5UEGCa@DS2=0gD;0_+ox!*5!LbUa&A6&WWh#it0a&T$
z?N($ckG9hRP7!zhy>sg0%Kkce#Qqo~i?G|}%o96F3)NaoxGgY{*9Gg_fZ!=&_L2Cw
z$F4U(8T6?LJf|*|@5?`)(a!mirB<0+OS@oKWlxe0FVQmBua0-asNcp-lD@3!Kj)x`
z!;v`TI1UbZAG@inB(LcrI6~V*-&MwYTzqr?1m$QQvb$)Qcz_`7ALzwJ3=cI&EA@ny
zM5y{mE@mB(d`@J<#syF`1+wT+wr<_`u0v!`{xQ(9Wfl5On>5M)u5djsChww|#@6*Q
zLqBO-C|Bu-BkMs9x1q*ZYhWf=;D^-jn^B4u>>`^r5x!o1?V=r*8DC|Z`*j69*n>k^
zs~30meQ?$F^bID!KPCPX{wWX)>cew3!4SxSB3SUZWRiQ=&PQTy0%!Z`hfPY25!9|o
zyLvUyLjNfiYRE7x6>ea{PE63>(04V1n#aWsGRAPxa;<Mx1_FT?U-}wLF>=b}^8=aw
zS}m1~=*qPl_dUHa-5t411_RwI!9d&Ta0o6gAP7vH!!^(B<X&7kYJiNu@=g8QZbStK
z2aFx(xz*PdS9z?ddD4d0e_~N~TLYB0mw6+84p?->wy~_E9CCUqu_~k$yShm;N9RdG
z52v&NyL@eqXY=*qQKbx-XQ^@xTf*nRzK!em`cv*{H`>cAYilHoNOK&3=F+u(mA>)l
zy#9(#xf8FD5Vz`=!6i6ix7>Sab*IMntBrHJ+kEf&DJA(9o$$MS{<?>q=3r&-&U7U<
z?k2>;7EFWZxR*quBVX5m*y=|a4DSEpA5vU^zbHn$Z&PGNb$9?fM`^5l+Mfq|H?<E@
z^zS^V3|D-cW0}S2#7$nNrtU1VC@G1~dn2mNE#GgD(kdQ5J{UiKaE&;~F%TU|p@R5c
zQOcaUS-r=So*P_~857=G{W(B>pvxEu!m>QV*3O5~r)zgK*J<?_RheGvcl0}?Z@!z;
zz9*wC$@nVzV~=^_F5>_<2TTYMi3ilOQp)CmQs_K(WXU|LNgbrG&JGXCx)c`RfxUXf
zkKefV{PdexYNknYe$TC-5+LQ7Y6PwDc~>V5ABnY%I?`(l$Y1}!3I8u-a{U>S*q>Kr
z`TKP9&oOWQP`?X43E+pb{CxxHR%$bT3EppZ#e{Qp<-&A2&;De$`9<}lq7Kn5xOA+8
z9$Vq=*gP;RXHNI(s#{Oi<UB}u%9E*~ZmD@160iB)W64oEd|rEI7CaC@&RPM%&Yb-o
z+7HtKCJ%;t>aw+)XC)lMu>7}K9Q+DVjy+<=eHvhq@VbNJ9>VEMi_aPMt8*T*ji-QW
z$=Z$H-HFe|1TumEbdb0le~L8}#bwRd^BC@ry3N($rpDp4BJccH#0IZIWUQsbcNa`+
zDwWP1Nt#WlAwAH~kIr$kRZCQ05bhCG!wp&&KfDr0mVuqW7bInbI65dlE5!0a1REKH
zpq(?-xJHsR`Q2-BK3QX15%nd$BBmrztM{_ift7)ohD&hq${y=JY)T8)148TDPTp;K
zu;;xP`d!7B8qA+Y3J-x#QVqtW;m~#_>{vQ9g-Wf4(H9>^4kEQ2T5QJa^aRUinu8N~
z9Q6ON_vUd;q-hsu5ET(IDhL8XY&WzK(h7(|Ok7$~k&a3W0z%wbBBpIcB7_v8>_XZN
zT4)H0h=_m+0mVRou(S)ZgMg5QR(7aT2qmeM_jP*gxzpdwd^2<BcfXmQ{xc>PRd2n^
z^PJ~A=Nv6|`YONhe$k_|&db~&^}X1tru^H-GonX!+Oi^JGp@1HziN?^dX1HO+*ZB@
z;6zH=5&&V|9;SKTZxQMs_T}xT@0-UKj};H6OGUcD@pBP|SfaaFL#0v4oz|+|bvCDc
zYovBFzkP_(X*>M%jZ_`s<1@olB@{K~La6Ku?kdg#mhRroNVP%?4}p@1D7-`~R6!bg
zCeZ)_Bktr2T#LtzO@dd{x%IfoE2iy~s#^yE2%S`EV%V@Fy8WDRJxLcTyY_)_e>{q+
z1$Y|<B43t%AF0iTFBq>FT0)s&8g*%WH|hkhCe20<*|(COZ~ZB><7<$XMd3@`+cxEA
zH^*J}x>LQPr6??GnTBHdr>No|@>o;Ha{c^)NBDULe6mC!hob6$zdfp9Q=L2Ccid1I
z@07#08LU?B^=U`zu|4V=W(Tvjp<m;c1w}0q`>@xK02tlL+FB^EQM`S*Ok*P6+g1@F
z&j!k7v~cTWvkFfmL@yy@AccfXjkxXR+ivmbMt{M5hEHJ0#XEZ)P8`bgNxXUm_T0;G
ztzDPhdAFX<T5)k}Tq&uv2Q!IP2C|72ZnVL1uu=D-!E$%nY_>$qUB_rAcIRA6^Qb)C
zrClU|Z{^Bp=Tm04l>B&P<v_cF-wn-U-x(2)leORQ(>F+m4hC)etaUW$JOKw1vyq|)
z#7O7>A6-&C7^7UrmN<AA$*C8mTD8WQW23?Ju0Adm705kUaww8Q%d%W}Ch?<4>|yE*
z;StI7!U$(GpQ8oQ%lD~|9sHCc{L{YIA2&&S#TcVQ>uc3T+Q)ux)i}`9z0Fq2$1Ca4
z+y!Vbd|6!3Vq+K^gBFa-m!{T9eB)PJ6y@BF&t9RK!49&wZxY&Pz1rJ1d)uKYH80Y=
z)s&lmqgl}3K2a#ox;<`PhhV0>Gw$MnfII!Iq8Y+7KDT^M>%H@2__~GFhWYP3c6UX#
z`6Y*#tjEt1qJbWG*V?5Y>B1M2ukKZUTMdXFg-R!ECVJ>5kiDfHuAPsDHae8Fkae7z
z@*#a6{gbP+Zr9$Xunuylul4l4ui!m=AASAd6E}iFgfe;U9r%T3mC*P|B_DgCgeKm-
zBbZ{=)kA)u>Ul_BLhDP2p}qe>F^&a(%;~jHDBYp@@6lY|#1VNd?zOJ?1Ht49D5&n^
zwsg{D7dS2-2wWkEP@YS1D49V1!<8EGd4K#1wfJm_DhXQ#ptA6Xs_Y_2Y{1UNRcPLO
zK#v7LLmm_c@P$^AppcFpA_I6`BGQ9ss6C`$!GL&g*r5R3DQT+^7nKt6`arM*2!+@k
zG_Z~y_5Az;LBs_$-Zz0EM3l+;`LP&(c&yq#+_~S!JHN)x`uOKQ5U%wDx_`Ea$(pbv
z&xJnznRn{|jA32)hX+3*&sC0pe7@Ts@3cUbE$V$P+yfpB`}k;M1h(lA39(e%M91M?
zHgS*Y3MD}H%qh1ouvHP4a07kp_UN5e@5rtNtC^|SXFs+0URWWUz6@Wvo|Gfec2-f1
z*=p7f%SSsMDRX6enobSeUMhHWPvkMZ)T0eJoPGYvy_WgS;?TXOdlHw=PP@7(YYJfq
z&}N9gbc%t{%`^f75RfluJKnRK+%Qr}>+XABr_mSoDM~*<n^R(h>H`G}i4nF8<;#Li
zt0XhIi`niTFg=d*Sx)KkQ_vi?OGS}i)#V)T_Vb@V`K&<HIGkhNDZ6AAS@^!-ZKdtv
zNHuIaUQdI4O^1}gBeKpdqIv{f#2?y!%Gh9d<R`f5>|M@o#HEZD*QMUz@192KnjTYK
zq<tkt7riW7mR490-ak^|mU}+O-|mRuelup1{RDvj9wFb<fII|Xd}VB!^o9uyURRCj
zf{v!fe13R|VbP^+q$-gCWiv8fN!1fvnJm04la!AT?S0^<l6}pGv{+{{9lOcekKONn
z+f{R06|W5%M28&|+N!V;0#s%T9F)}Y=cGJ-gpob$bzUAc#9}ms-Nnf$uRD07(P4yC
zK{xR&XlaSbLbEDXwU?S+T@WhX-Y{e(n(6KA<`r?uW6_R$g3{#EMoH08fP&iPZ_!pz
zR_g|(Lche2HH??;xFcO=syf34d`Bm~PMt=}FCheCkLeN>Q$9+=OK*Z{O}jul@`2D!
z{_=4#%#U1J*h;aK2hZ8-M`6IN)$NXgK@%f1Oq#useSASaGBwg#Yv6SFaQlEm%c)>{
zXBZb82~P8ERa^$;(bEuooQUZIMbeILi9QcqS~wJY3#cs3r0fZNYds?7#}<v3#xg3L
zPd5bfcL>p|aBk%@(DbUqj&5nh*X3A^=%5{gUl?@fHK)GoYA#)OS#cbMFLcpy8Ze9w
z%U7j>TYdlua<QfDhxG%ObQQ~>nSdmfQN_Bzm_1sv3du<-O{}fp>$>{5*^6n@R_xy9
zH6vw~5b<jkTp2I0ox%%y&0Kf%mAbMW0AoUZIIqkEDDeU^`h;mHM_o!I`k2gtS?4L%
z86ACY8sS3g)R*F0y%QD}Gu&>TR-88Uj}_{*)O{JU;>3mFgKi<V?VkWGtO*2oV62~L
z!F1KtWK+=K^?_grCcU``3o83S*mn>JFH9jg{t{0B=D!;qU8n(!a8?k-^qI*gE9P)W
zs+%QxFZ=aG(|~n#k?laJ_ELNsnM@^dHo)lP#*RvpNlpH%tc?y#QN=%!nICkBWr$nD
zJ9YshVflLEYjP<c-js{l+AAFEy;OYKQK(}|e(qVhTv59%K^8Cs$ssMa`o{Yuryr&$
z0a%S3zX~i+*~Y`KFWxHQm6e#}6&YT+{DOCiF_ag=jfct>LDD8=Zri)=k_hDyApVP+
zp)QITv`r?_=|o4wBu&a%s;HYm5F?toap@VHt>|?`tJ>MtS%lbB>e*h2&FS!_%N}c+
zG*-ZF1BNM&<)a&prjpvNKkWp-rX(3D#CMx3tv(Q5hLQ1Dln*q!|Gpv^5TctYlKvpd
zqlsDVX>!_7*sb0~T!cV!lPoBwqRYH#ROJQ8+%02*@7M`@(ctu)Sf8Mq{UwWhylmRT
zbMFj{2cFV*DGCU`Rd*^$IYb3^NfV!5=&#_lf73MYe*sk0e_M9Me>+_HU&i7;?hO)u
zTW)6mZ7=T=UHeb*5Rrkpfr?lkBbqibsy1=p(^>H6b$}@!12|BbUHTfy#h+A1(?&b2
z3EznB6=I3}a{9zu#2(L}nx+!7LDcmZ$CUW@S1oma%`5mj*rdPyFMp03{f0hieE39*
z`+vPc<>!BVgSGS*A+#NT?UW!+R%Jpzy(+N=h2u5gjU^pTovE(?HOk9Q=(Z08Pfx(|
zYU-&NRU6Cx<tY5~fmM5-2}%%vAAd<-%2%4%((&uKbt<vJ9YV*?;SVc65K^ZT5<f+O
zzux<>7bu8em+Pq#p}~y_AX~!d!2ZkdH+8uEmFfRK*Tg5)=BNCZrd1`{@UTF+9V*`-
zm(1whv=jXyK~KZAH9y;OsLosxRp{uY=y*tNoc{jeo4o<4?{v)HnG?9tAOmbiL5{WA
zMmCP?ae++VA4s_cC3B&AM>H7Jn)wm5Kag_GbHf;;PFxZs^ODv6%*fSZ%3M^sk_ZXv
zih}CNM~?4*{?$y3l@8brl$D9sO6KLOY~&iD)>_Kih^{4tdH3%lp63G21e~x){=9mM
zAYcEjBc)-b*8E+I9^Wqy-I_YnOtBn|ft6pXzUS)s1h&RuMmK#=$8(psXWM;^P@@h6
zA3K6%L@m9r&&`9e>4fGbgGJXSU(EErxozrCr+f($Nw{8XdG5q=5m5x^!99T5@-Y<-
z6Tn$`G)p{!t-#g-h3bzA$bKI^m2M=#N>K5q&4k{5RL^#2Oc(Xu!l>Y#){6Pq1CYD_
z7A}VYz%(?TEh*zJMa`rGByChwjxQ*VkjE$ZTc<?j_Sp)WoHJ##dF8FE++TXUa+t5R
zqOPKG)wGsv2RD`_s}s249he*Hyx=@3icc(?y=5BJ5`MGcrI3VPn`z(pHdPb)ZyL_O
zbM6pp>Cq`bKPXZpy$<C8Tq*+FaD!R!1-i;%IqQ_4v)Cy=8o7yHa&B;lF^iHp+|t*3
zz29rjMatsk?{<gKoc`JY@;`U6-_kGDKWR!<(A#Vj`%pR}+J_Ez(eheG7$8}m<-V<6
zLT|a%IM3}>7qg0$pPhfe^lHryrrc&i__WkH%hU&J(E!uIC%~<a0S;A}2lRnZ)eH83
zh;O;i`J+=17Csl?A=q)S-2c0NN6G3=04zBwN|cUQ<Qkws+Yu|<o^XnB4%6-HLaM2-
zs4#A#{Hg0uJ2PPY<t>Xh6s=x!qO6ke?Z@_{K}UFzJhzlbV1FP~PB0>eDVya+e&2d*
zXOd=jzHEQ|lAY$>kp0~G{`IXP`S%N*r*pk#^h}8*1dU*JmlcOWNJ5T*;2aU4OP>PZ
z=Ofxt`otu(5Ez6EfHIv8fH41Ceny(deZ_ddD<_^MX<^oKQACU3>sz^*=!96@%E}x_
zu>C>%)+=^UMbVYW?{?hDB)8{3-nc~Lo6V=BYhS4;_R-+4SZ&v2Q5{<lpQEw!_^3nf
z`xUAD_g(`Z1@ixk$MrAFAhowIuEVCKfSJW2wsn^S+o<LhqrncFSUR0cuV*@BzFM70
z!c2D_-`2C^rnVmMXzVoO^dm<C%-541e1b&(Jo5gne^UGKsa*dT@|tDpqvd}ssrb*7
zb^CQ{^q+o?{-sI_@l#s*`|JB_llR}{wG)4VWdFGOM83bjy}t}({Ts%>f5K${w0HiY
z%t-@fPSUT+oWDZX{?RVKUq3$0nkN0IbFvZB0knEA03$Wgs(?PC1EATNz)<Tv(Wu&b
zoljj=zXsw+*Ca;J>xGg$x+)ShaP&%|00epHuR)N%y7k}Kkx2T*z5VZdiyA-iF<mWO
z>nxyw`~<RW-~Ixes7j;S*asnQvPFDcx3D~UIlc(n-FWv_f&S@nd}-ZiR#vc0nO?$o
zlsP^T)rPeZOIr>-bci)H-`H;UD7%zqjvIgyb2kIt`hj4KmPo|ZW%YnrOF9?08PuQK
zXfi6wgM-Qw`RD!oYPD`N+&Rn8bl~NCzr)hsfuYg0@5U<2a*=R00RmX@WPB~hM0JyD
zfVy{+OIcCWSt#hLt%_f?6K;*gh}mw;BE9r!b`}eij%j(W{ei)yLFSAtqPSM)dcnNo
zR-JB7pW^L7rwC(z?AiIpi^@rN2W)8(R>3Y6)X$nvb!x4zJJm7dwXfAWX~2VQU!Nar
z(c(Tt-_z`G*S7ybV}7UYQ_T@_Zca~lOGC>?nVPu80XvCr5)@ZJy;lH9FG6{e3uCqF
zXiz1|kmeI>FCO=zm$X7&Gy;g4O{S(ytnN5r!s}*4dQwp4b$0yI&Tv|RkbEM$*Ci{;
zDx1FRLP9F>CT)K$x@G{^^aoJtWaJGGzAUS)B<pcVXr-JfxDQ8&iS6X_7HH;}XkORD
z4SLBkXihJ_0O5%p-93upzbt6vkoDslE8Qx~mZ$Y5hHNugK<s?_3yP}#&IzB+`#@Ms
zE~6=&Pw`coFy?yFO+W)dMl@6>VnBMTtM;c_=f_n<3Kh+Rq}@PCrWz!==rUlWEKPdb
z&N+d`BGvoQYPeRqWF#7&hdPwH^^}veI8JDi+)uy!beBG&A3es8rW&<wZawFp%&lsy
z?%4>(JqGpjv7g-HdGkcb)XSMZda0!1um8AS|GyMhf4zVFx=8V#wnM1>g65BW;-`Rf
z_XGs0F+gdg>Q7+<)(KX45x0Gu|55zWlr^+gxyPHd|8jN7WO$V~$YrwyX0#zQ17Ml^
z&ZQC9SK{*V?N1@!zCco>KuMc3ka*_O;)iusj$b=aW<@PsPn4PA+9lwc6t*2c{WU^X
zgsMP>EKN;}h^AU%HgI93d$BH?>4NA9yv`uCijWp*W61(k;)!0Cn^WB#4~~ne1`R@o
z)Gm)+Tj%rxrcq;}x$V`No*PoVUD`>JA+&q6_843nlmtSza~nt+pxZu>JjjPz6<lOT
z(P9Nx3w;cmZHOraqFl!x&9~Gh?|WppBYuJrcd+o`cgFOsNO4Y~Usk*R`5G@yyjyaJ
z2GR#8Fi*G*;K-2jCGcK;B<SecK{P=x1v|(@mGnr4F;#>8uDs8GC0uY;xq(B-N6r_~
zD|kdV#==)^3?>uqa6`|q3!ZpS3@6LiP4YW!lu22BJ=scKWz+jJaQ}~G6B;mBXBrow
zvK<sA5OT3c>0sAuK-bD+-NkQ#TzD#4rgWfRuUV4);xRli;w}O|5|hK;c6A&*4HRl}
z*(WPaOp*9F9?N_vC03fnynHfkz&7^kOO0MWHpd0mj#7m~M@QNo%UYv85ROUCQfI5u
zAz#sa_Q}l%RFR9iJAaGDl+k8bdj!}E9`j``=P`{Xqb0=R0{B=0@s@Ab=dD&Xhujd2
z$OTRQ4raw>n-jLzjou?ZA)#x%aP48BFXu&S17sZVmcl^~rzY#%Y>rtghr*?Dj9`J{
zCO(~NfS42?O7$Vlr5eBN?CO-TEnVU2Xn)6<)fHT$x?7AH``FM<#vGr-fbioS@?A&y
zeqyf!t}OtUyojdZ(*f!|`8E6^5AkB~D;W9ugClY#|13#=Y-powX$yy`in-aBSR2A!
z?7RP{4~w%$+UZ>F*lO(%?=vPcVl%1?vySOI7!siMR=;c@ng3W3O8q?oHt+~Az3(M@
zVRLE~+aA{8hD_&G+?R~}b>W3F@>#>}NkNZu)%tv<yRq}S!|xwhindgj?d%dWr6^e*
zFY!c%hA(RE?GMz8mFIudI<&5RSuG2wl$--|&_(i6AMBgPqpVrpq`73#F~AzxBqvAZ
zkLc}lJSo3StfECiQ@LdJ$=!%q=d;z<O+-NVsoHj_;I5l*j&+i;{mzFET5g=%Qz#5w
zKvdlS6nFxwhFH<ti>$T`yka>t%Bq+ks8)XZ^4EdbZ!M$z^XDUeDwq9l?iF!+Z=@{W
zpR(*tsCU=$>sR%*rblmHzIb<)^ybAMFYKy6>%4o<Sq)va%Dy5EG4Fhe5u!aDk;=A{
z)rOe^u_k`>;<|i4QhEET+PRnLIqampjus71TO34#DqGVgfA^a~2?Y;gYOh_IJY5ZY
z&+XvDr@%t3Rxr>baz7{ONL8#2XtS?XCET%>(!*&Q#{|z*>N}A+ZfhEjBD{*nHLISL
z0dv9cnmn8RV>&JOAR1Ad&pt}p1Xta+BF+hc)=1!EL`+8v{M)lc(i^q_E!NXloQLCb
zW#q6TBG{TayJ23Gb+*~g`fTbNP9WMzj2_KCI#SwVEBv;TX+nVgL~#`N9ABz==mc|_
zkinG2s?&>GhGodsy7jdBiJy;@f1(l$#d~3M)p}H!BIzKFh*=_=#F=bYkstkKkTlRI
zFN1%CDNQyZhS~;1fo_<k^p!{AhWke<-IAM#%1oBE-R;^09l!O&G0Rk9I8B<tCmy4k
z0V{KeMT;nk73$AsrzzH}Zrt^w*3Y|Y?*BcmgXX*wPh7e_gQ7C^C}<QoIF~Iy>7^U<
zZdoc(PRC~iqp`R5a=ulSP-jPJ`FO4XZZKYQ%P(JS7vJVk5$FPC)ul}aKW;kdhgi4t
z)h++d&22>zcDsN8-AMja+~xun;Li$QY_nEeuLTs*ZwJJRCYsbHw*8g#g=B}f&;%r~
z+F#}=0?=$Fox?`vu$i)ma<Y~OdfD?4W67Ag^G{Z%1NFl=c0N}3EYjt<`U9<Q0pXmL
z>F?_yZ)Re8Y+emZ_tL%v8ZaoOPrj)7zT^vVCt|Rbe9ZzIJq&8P9YNw+o54JIQ`LlD
zVF=gAmwGx)T70r1MbE#vL<g@8ob2+=lK3nsEiBMuC?bx!qDHY7h@i+lJYTHy+Fk?-
zj~C^q8%wF1z&88?P?wb8pd^upf=>s4(3Y`sah=A*cc0QSU<6%N`$?{AhGfE09tPyP
zK7IRJ3%dSx-l*}Z3iP*j8sZzXZaQWR$f}bcq1p&fURx?LLYa!3q$>VY?o#$UnSIQy
zI?zoD3cX(@RFY<5PIu5vjLTa}9-)N?P6VY>v(h>Xk^|PoUTIw&zC1m@Ai#U4`tuJ2
z`6?E=n5UX9DInnnwUeNCuTzQr`ZBKyM|B|Y3}jendYom67d-XQ>rx=@^(Fi&usGj2
zFh5WuR!GZ06YL$oYS2ph(cE9RQDAqZxFCGEf0@OX_w7&GF1s4C07P6pj%X#ECWp$`
zdAYP&X)luV#kGdm+Gc;rjM_tfaz}kfL=-+on<4SVO&&E3T6?UP>}&M%-^Q}KHl}U4
z`+hWTbjD(kyuj7`H38pIVgZ#Ukl?Qvv#_<IdVB_lrMg2k<m?nQSX_sDVljq17aY5@
zF-!mKtA~7ox_MLFL2T(?mAyard}3`mbqQ>h{>8vI&(oI>Fn!f|03|R$VFsH8Vp21}
z&>dAhrY;5nr|IA$8QBD^GW$5n+HOCPYgxb&F)MsxutiJC=h5=I3Ul%Xs3x#AKYYxQ
zR@qY)f1>ibZ`a%e=ek_s@|)~8FIwJh0H-!R<m8V(nQSor=c2v$2ROc4;Bm}EElqou
z@DJU%8_>y!(YXucs4+<dIqlA6-k3wy%k+CA50)6zK3Bst^3~qi52Xi9O;XOsTijw>
z#LNRqD^5W1qd7lp*)Krd<(Y5ZX_jifuIhL~v7qa!;auQ^5Ptcj%}}2505kf<iqVla
zuWb!p8}RHL|J)eQ*+RilpdY`mjx`w?6kY*MSbs{Z{@$oaUB*I}1z@^nQX5RiOc6xO
z7Y!L4LezJnr-0IEZbX$tmJyLJ(qShk11i7i&+Fr<u4MtgBt3pxf1>rvqKA0nyMsp~
zmk7(AM5@Dle9=;{s{p{DE?Wuq7TFK5=?4PUjiBtoHJ^a+cnY_i0e?;71?CRER<7<9
zk@80(4Q<4%S(wvfz8nI(-VBr>&H(8|3r0@$T`A54`OGYi)2WFx2K(`LCeyj`P`HEt
z<BP*`_U^T%w5xo;O&;<>BJi0rKW=UxW$Hp5aIvA%0|zT;nh@epm<aJJO9pCI(M5)<
zlQMw35Lj9cNz2m&O<Y5cNqvqrU5m4N>zl1YMy{{n)fW-a9Xg_3)}ja@bx8)#DVGD_
zA`${4eI8UGdrD&`d(*VJbJ(7(;)$JCHntZOXWBgpmIH&d`2K0Byq$>41>pw&Od!n@
z&^KfDzW#~i`hr41zMi5$N_z7e1gjthzAD1Vp!W_$qLaZRH`yt+0!y6*pQ1S;xbs-(
z?M3KgtOqcdQLoDB1l1RxzW(!I=(qlf_)9J5Z)C;99;UnpGX#0@<T$hwVaNljcqEQB
zT70HEH%2r|=!4Lz5)z@R4+Mk0czv!pG9G13o{Pr4dFEJnf|b!uzw6t2sbg;`C26b2
zf*4J}ick4Ks8|F}-z}txC43;Fgrhwl2yr_#Iz;f-BOQw>B=8myx^$R_WJqmFlkK{_
z!b&E+!VFPs(=du_cd@&8C)-|@UGoENCdFD_Z10{2k4NT?DU)dxD^Vly$(PTq;+sen
zyZcu$zo_+&4@jNgip>J~Q9T)(D^bvqfGXBplwKy%#x1bbXd$qoq0QpXM{S&%Sc6vC
zt2}5s5YrIhHkOGXAmFYA$X_+S<~zC%WH>Ov4$F40)I6ZO(siBSp~g@!0lLD&jL&xS
zG_fGCIO_Yk>Zpx2FIwm><)4Qp<CEDph%<I1CRaT|LP1Y!dt05WEFhdd+oy$U2j`y4
zy}J-vCEnfOzFoLiVC3-dAlNy-CPuVVvt=V{>_zKaN4`qW>E+IbTX@;NE6PB(t8=3j
z)69JPXi=DpjmvkAYxm!oSF+xqlHlL;@v!};#1F^}bgff)An-J>1DnVsj-@JZfBn&5
zttxOpJjfV2AiH*KL!4x1Ysut(Pd(Tra+G~U)wGN0aGNu~ag5pPyuLqWX`{bpGGkW_
z0m{%1rTEAgp#2i&RnR95#G9DHbRK7AD!aC<YjMLZ<(33FQH!0nUC@HjvillCC<DeW
z+e-39`3dUJL)jhbYf_0%NhmpYXo*bH2H~3eN){?|Z#F-ra*~tE7<0L`xTYY!`$<VN
zb_99n_z3kl<d?R^()>mGO&`xiEgbj7--I*$$l}EKw51))<(&jtG9O)KsW{n`Pd*29
z`>Kar?)up5bmeSDTE&b|sz%fBe)eY%1*jLWh^#WP3$~W9I?Rs?h><BX^B+FCK$$J^
zwsf;!@Gw+iMZgC>)sg;w>`$YLU+6VxptOU^wv<dIwY4hvb(p=J7;aicH9^`MF|s_F
zUN(_|EvC+kaf;Z+S2DJl1t{3v%_Ear9=**V8*t7rAF2b7jn7D6ih=mGeg)bHxGv+>
zH0;|D>H`&kMx<wf79Lo`J#o|pn5XKJ(70Qog}ETPWjqs(0p5&cRMO|NA8(=)(azDJ
zfrqaf{U<++)1(Q6Yi~SfNkf@gLG0sm=G}akJWn0%sE4G_Ssx>+2#do)cw}gUa<;er
z<@GIj9=#$E-1?YXym3EzX{}q=wn6A<l8@)q_<1+IuO(V8-u1}i$6Fz`4D{Rpt_AMa
z{+bf3a~TS9nF*kw^8U3dl_pJw@}W{by1eCMQ1=QMso+O|D0;{oLXs3*bQ24)+=MXV
z#>f%7P;qxv0^I5Z&&)H8@#jwkeWV_y>jh|x?4h7z6n5%QJXUo4M&jz^EPLku@l;Z;
z2EeUw5a4Umj!V!bxkJklJ_zQ9T#z211_-QpH=;lwkTtVcGK0DZJA|a<X0WeeLDCVO
zoD7iS*rV=JM~l|43_Z`@c1wC}yyY#n@70EB4^}tSQV66LkDs#{(zOo+qcz}7#Bgh9
ziq9M;o*)#k_W56dTFim})`Cji-y<k9oq+J#O~E_frQCxNQNGd^I{_CLUMSJ!c)}?_
z;K`89eSLv`B7<@MlGczdqQ-Y<J>|GX%lxD`vH4cV5RuV%s`Fs^*4!1T7YVeQfC|w)
z4NR{a*A}Vn3XdRS#d0{}0+oi>PCTnkw}}&K0T6HxDuR3D46&BSZb!ye7^rr_h2`};
zJ&e|x|E2qhAbAiSwKvc2{OaAVbF%S@Ykt-q&IYCHy5&S{el3V6<f=Oy^E0r6;+QI8
zMag{5R%A=EX~YP20Q?gh9Y;v=oU6C#Z(nty`a2N4EP1SY`S{W16tdkuxG;L;)XmdP
zEBwu=3ay}lZ-@^+u2Ts1E377iSe-OVN0&*kxzjb}Xzq1(%78qe96DRFxK)Z|#F?U5
znUd9yh*grfl7*p4mwXT)B{(!6??Fr>gv(V~nP?>YsRKiY{o>u0W-BskFIYevF8Ze-
z2QCPN%FaMio}oTAw^8BRC^VM^(>K6gLpHF}&Sva@l7_zX4Bv2b->d$DmF{SUEFgyE
zci>_1Jz)5$+c@g=sQK7X&aiNue4V;DvEPmH(zU;6%N;fB1pbT!JJC@xADfj_-oUK}
zU2(KSq<NfO=q71W@s?+@H?as$R#*Ye>Z8ssP;z3|MU;)iiDrNX&gy82nl<*W#_yIz
zTDRk(9hXiC;McVgLqO;wWG9*gH{x0naA2*%zC{CWV*AnMIY6K(j~FtR#Z(%Mfz69+
ziLO)HVXN-99M-d~<sCCG`AkB)z_sOo7<QFghkI{(Lga|K-kBAn0}z3I(zAuNS6!t8
z9i+VkRy>0`N^^{)y3Eyd11fyP^Hd|hg^jzL9rFgdmbF@&DQki%Q!J7BE#Gx?IOd{{
zD<*Db=k`NKL$&ODMrz{uEM|sVyIZN+*e?d@|B%S}9|BKxObTYq4aU8mq3mv869kF*
z;0&W38fm{=DIuh)3Lt4R*NorI3#WOF#MpvYh;qnM-gNGBy)H{+g=o%b?zB#}#X9wJ
zeG^h^Dnh+CA@6IS+`cqdFYDyd3BUeQHJo?l7p(W6@5KA9;TCGYkTm{ASo}Y8gMX}~
z@#{hRha@I`-Td%7fc?ve{E4IS`^)=fQ2ys+{S?Uld;Ns?*QgnPNk>5U&+*0I2{!*O
zI5O#%V*4KnHX@%7gx@)e|2d2Blalf8ukRPK&%f6``M0JL@l##pN68fK(+`9~lGxa(
zDgn|4=9FYuP#A0deyr1V$u+@RR*rV%U_rw(RVMt*MXzDA@nD%e|I!+Vtx(I5^G@PM
zP}>4H^OTi9*CqsY`9KIUWPBj3y#pHS_5vmb)GMXV1p(^Rgb#!lzM$2{8HD<zi8aJ9
z2vtk}KzNr3TI=RUV7k?KoVtvG6MP`-%^y1Tev23Vg0crD<I~N%$upUrx`=5tF)T3W
ztlZePXIqPvWZ{d(p`zfyqee+kemiGjeo@FaoiA2(yb`a9JM`TeH$s|}(bj`c*8x%_
ziT0tGK9V8x43i~mTckCN@-4Ig5Vz#@SWDWs)YjYB5>atdO|kG1-?H<mOU>KH!`>HD
zHQWI+NsLcF1TYshI2ZWASc&kHA3II&0i@!4fEkr?K%J*loBV<BLo?7Zx=WMZqK$q-
znhs!Cb%zgx9&6B^V@5=m^HlRS%0(EV;RE4)%|~&f-|S`2WK@#dTofZ{_4Su1OKxAu
z592PcKc#XuKmMY+#&+wuuR~wFEjLS}^58;;qVKYArJb1-_U4lMn*_G7BC1Kf!AS;c
zJ7v3cIb3w_^*!sI4y4aXoiSxG$7ui7`Ev|PGS;SQ{67Bw;3PCCnW+NPm1!qw*ODbv
z(Jyq-&>(=}p&MT9tvPvf-!rewr6KhLMtA*ItiYzoYHVedRf02S+fFRpaVWWIpUu@7
zU#m@o{kOt!KX>;0t;6B>@BX1mAMKC2X1@>HKQwHAiy}w+Be?SWkp2sY^k?tu--q;H
zIHW%d`20Sk|H2{tSy1QqA^jH)>Cc(^??d`89MYf7F@7J?f8mh+o5rQ;D5&F~OB6{3
z40ty%ScO2FUnxl)I!tSy(<)<#wNENs&@+gc6a1588g&|bY?$Mf%W*(!WJ69ahP4$?
zgy)LZ>+Z>m$^)m|ZcJ^u(YRps(x9L<ImdICJ|y6<xktL7;NlC<t)B&~QD1>%tt>4y
zSPV~Hw=$ZL@~4g?fW#AjYQ<;X0GWC+w}Syp6V>r5!`i_^NHp<$Ag@w17dzfafhd5d
z$<@O)8oo|Am(L4#@iY!Qh$cxz^ERa?X=0o2uDxvUpm`i`E*Kl$-PN|Apn3)_xCi?N
zO;$QUWov1NXgrZ2WlM?$nvF1}re}qI-6J~HY2KbDjwdVHU+{bSZKHd+^BS#KM4y9l
z#`mth6CPm~)_qp6Ssnt^lE8GJ{ZNaVNwE2ra15@Qi6oTK9@w1a&KrOXAm3V(!jRm>
z_nYxX86R-@w1q3XhmQHUd5Yw*!KVjI%$whwH7#kxK9iYwXJ>@3HmtoMT>I^3`Hu<s
zg+Fpc|2-=b;Hm$wa{$_EphQfAooO!t_le~L$B!$7NI(p>T%+n4&!`=pA~cnfW}&Zh
zZ=pf@VM~xNN<1cMzWCn!a#iB&0%rD&!P+$uw4?DMlq^L#m=b+ZnGNlLBut{kU5v5N
z=5PFIv5+=qj(Evf<)q0d(1NLi46t2NhCy$O?RbRXnP^JKx_qkUJs<}?wmLDP!qKbT
zcJt8doU{y|iP$&SGdkVBRC`|oJK-|{`6hHT#~rESOKWu;l-n`2{H766-jFZ=AX@g}
z38i(<7TtBe)bAphPF;*Sw^}9HWxDPXXgi2o+PObmXtgcwN<O07x>ZeCue9gB;y-X1
zOn~Q*+k?7+=^8%*S4WMYL+QRI^ATCINVf0+@;p#=sXuUvNE~ZdexDgwTftNX>|hOf
zn9|{|l5;vGKAP0r8II}guRKfDurV><3tWcE7N;W|+z8UbU#^~5>_pjxZAg&zoJ}fs
z`TVlg#y3Xus4E@21%>fn6wZ#}O)2|!Veo?vd)xf&YO49b#7Mv!98mglNOfEa_nAbG
zVuqZP1w<ac)FP?yG$^`Klg`vrC<<M=43V!QAnjWs8F6l3{5UWt;p!vQq?-dRv$|Ys
zPi9n3d7XbLBzWfY+jn1-?b&ej=z`C08?3015(rNgOJ{xs^tfyPp!cl-NO@%oX@_}p
zeQ|XW`dUupb*+KN8eN$sB-J$zpjT<F5<5PQeSXr7A88wYU&aVCyxN?fx!9%!IIQ~H
z6;tJF&2wdfM^4%V4ofOQDPR(<mHT}w+1AQYQ+jjk?xplRfppl9Wkj)=YUT@_hT&mJ
zDb1L|l5d(-U2DvX$UQ!F%S%RbDyhD8@VfF4{v0YKl_!y6p$GgbnkMHcdh*PvQv<Pw
z#7><XZhIR(b?*R98^C*x?X_3X$k`>8%r5isXZo)`D~}~@YrfRb-|}uSyrsX(g)l*K
zX&&Vh`)qF&=lUj|j=J8P=s%pB&2%dsDoDE7XZTf;pZS}#JHCmrKizA4>C>Y3jblJi
zP=J+lOF_K@<?YR_!>Obi#*ij>t)RkGMk<+%Cb+}75yF#5f+9{-cYNvn$4!veNVwjF
zC~TND^b<SdoIT=IkvkXrqK>kGIX_BS%v>*Ow$kdocHAo4W#`$eFDFR`0o7IGFjNKz
zY~J*_J}Jaj?gEaxD$#%4(cvz!Lle!17_r*K2)<dEU^@F&dN)17(*nINBWty#9p1+@
z`J#og?t1&#k{+k1F9y|Ai+d-kF3QapZAUaB#yspi>ij!xtM}Gh7n@9UI5k+uE^;b}
z9(Qa$mA(C`>)@ES-OjueqsK-A+ujsBE=XK1jo*6Hen0W;dvH+I(O#SJr;7%wsdJ$I
z*Qy)d>Z(o0pzf!&crj2O8{NHA<EMn70RY$oigPFnjE4=^)SEw*<|B$VjN>Epy>*L_
z&8?vnb!qKv&gK@O`3;Z!L-vo5AGn5shv&pPXsGF>cbgim*LF9P^M9lXHeU%m(!vi2
zR2#1yEK}OENuc9uiIfL`x%enVV+$I9km$CO(8K68X)Hq~i5wfze)_VJL)Qr7tpAP~
zUo^E?e<C1)HKt{^GRKF^%<jAYs<$XE>2_Dh89>UrtPJA*2%2n1B4%@(R5>MU03&E5
zoctQNPYm6<zeJG*P(MRoy=_{ZJ}FaR(}-+qmI3?eME&t>#!foXI|1>T<+16^UB^QQ
zt6s0CrTDqvL>utjbf2z8m%q$IPb!(rcDH>F1SI?_upIo_cJhyxj>QcdIkLK2c=Fr_
z&qJQxfV4ZhZp;(6MY(O26I4H{LP0g-j^thfrkk;$X^h@x+MN^D&7Z|ST|>q^;Ad`m
zO&kM+QNEhBLhsE7f-9^Mm7PCyJXrg9PWGTk_IaJGg|o;Udvb`C>&SMSLQoM?i5H<`
z{0^~VJGe6lvjpucLnn~ja*>A6H26fo0ohCclfie`8C8LTtM`f?)ni0y_N2zI5w94*
zn$OsWAD`P;KXRz?K#ErKZl=+;DU}3>S|EbY(6KqYyQ^lgeOa2^m+`iXTW6J-R+#0U
z>zV{jYqwwAwJ#w-Z#Bb!MG0&vKrWtGCV(BAyT81<;6eO1pM66F94ypMthfVOQjba6
zTyV{ch(vJ!1v|$D#&m33cWwJoEs_rE*8lA>ff+a$W+jfVM@*t_ckE?HlzT2~JX9nI
zb}H{$4*a)g^=0}W8*e{Sv~PVw2;>3e#~JUx)W>x4@ER{!0-oso(b9v$>mC8*a6oy>
zqnFX5Mfv)bOr?q}%P8YdVkgLHaROuVj-nQU5m}2(-|?_{EGfx&VWsE0-9_O~+r7eH
zv>BdxHLZKf_rROX5=`oEMbcDoDf=P7ft|7+P8jl1QqY)+!11!!Mdc#GsiaN@1CGm+
z`5gel=>z&MDNDYrv-ZBxbl8EtWJ|$CC%=v(2cAp~p*85>nksOK$EpxWdP_8y;{kgq
zzCwT+<*shgWDZq`Cg--XGh_+TLY+ez=kIdtWwp~^d2GmJ7Sy%B>a@5o&o*3gKkCBJ
zjJ&UOtJSX%R7=n@BuQp7sDyheB>LPnoGmb`911s?@xU-53iI!l%<}0n1XS+hZ>w3y
z3*AyO*9oT%PYmx}VO2xbF~_rRziRJm_s?z~1J#{W()0032&FjTnsdSB#<?bNG7k`0
zYw3cAZ#6EI%*E!zFOp<If;qaEf~?SL#7;(zJ{&dIA<aP+Ns-on|293A5DcpB>R`yq
zXz&g|Hsr_T*Me3_s+BKQ4Xw#p!&{HqAl+iIEF%_ku+y8}htDbYb8v@A7X**yF1|{!
zk&z7tJQuI7cQ}3OBvbcts>YBXfxD9nN<*+4>!poin<}@2go_!5loQ>tB*i&2DFCi|
zRvtK$2e($5ET}u4LnX6g@Y$@%*{#u513p$sON7lI2>yYMPp9&UBg3aNK3D5z6!u{A
zX@Ck`0oKTAc8?S4fOyFZqo_+UZy3uWMssK5wfMYqa&okgWXY;Yh!G^ISmh!j#X4u>
z&fu)_B2k<<$Nk9sf^nO~xq|c>L4}lL#;xJQr$yMDMT&FCt!q+f7Mj-08hjRXw`)U2
zU&;bi8g7h28{i#<g{$VJ@^&E+VmgWAR4*hSb@$=BAs29yx)p~`SX|F?+xTE>2&r*u
zQl?M)i{y*}L+bS-Sw4n>t2ycE{fT}}`t7IFGZ-6;o8zu$NZg)S3tnG+oqvCyTc`W8
z75cNb&9D0rDJmv#29z%NOHrQ9;958y=;QH*+>vwTqN!ZNy^;+#D|6VF90syFSsq29
z{ll%|5*@fD>c|~5t;%F^QT~IH;&WnLuiih{V{ip|jCeoB$mkhp`~N|r50$;<!`mS3
z0cN^)*9_GI4z&L$TT%hq;b`!1^_HzoGA}*c7HtI^zSXziJWbp9F7&X_VZrP=jeh$<
zEin1pz?gqLIqikO!D1XxbNj7$7Xn%ybhcr0(8t$3IjhmDckEfrbwf>_^90SF-C`5l
z;`<K-!jh&_pvcHhtF)Oo&iC@q^wm?ea~Qt+i7yQa@6{DgRK>uLf{vPwi*Iw6;jOg!
zoX_AOvD^8C(_QoPntv*8<}kX~ZG5VmIN%7=&T`f<@1Ww!)`BP2k4ii=HkU*=I3xwd
zIx~%2o#i25bN9lDAXN*c_oXaM*nA&l0UzKPhqyDJG4w$T(==Syz1L>i-MJP+L)IS%
zyhPJz?z4rjOCQ0nv*oeZ?}{yN5V$RPsRVX`UaI5MU3?&o!E~et7%IhCxvOUaKDmDM
zpfza{XJ1`D+TLm5znnspdmqak>zw)2{jmcR_)=`7&q3W@yd7?N+U*|q_EaH|VlE<J
z;iwrHL{MW7^~VBLw{Z*nN0aGjrrg+0K2my{Yw5|^l~jdmv~qN$r0G6<g1=dfr93Y0
zHfKFD$rbh0HWEMZ8Bf^lo9DUt$kJq%IeOLaVoC^YjR1scTDYdh4UQY|!I>)=GS5K|
zid}4Xp|7uV<~QVUbvmmO7It|YEN>a9R?ze<^4a0<EBfH9p@h|l!HT=bInN-9`h*+m
zj@XofW0Y%%s&{LM&HXvM;BI-sIgT}=QR=xQPQ0h#20}b1JdoDT#0^^)Gzy*>llSHo
zHs=Q^?UZb5bK9rLfroTzyT08rb>Y7Z5cpf=fxq(I|8>!Tb{3qPR2O`6F1j>A6+vV2
zR0~u|w<n1(BM>c4)<KPw-*9V-NSJN{KHZfmXS8psAkE9k8D)ghW>8$^xpRE*xecIW
zbJ<BaX%j-ekg%y~G??l6Xcc=^w-{)PoN86z)8HQQ2$9WY=V!ZG2U8{DRAUx4zYJVo
zUloR{bIXA3HBgs1^bxe64=THBL+}+S1Tp|EahGP3;0;W#AV%<KQ~a)TEZMK&l+4?{
zMO>boF_WEJ*`?n+I&?@}y}2-0lj7i$<I2==O!M&GPr68gL9B@BJgz3e?jn6}s*XxP
zU4d;w7=s|WnA`~;;fGVy;S4(B;0=!P3Yxy~h@52Dj@dM@h8B1@Ilxsnid-VcNXBSN
z`5T(yhSa{2NiaFWztd=IL0u1gPeWm#bOw9OeF)x8jAi6yT)^jX7O1|*R>B(c3)H#I
z)#XqOpEx#RxLMAafp!@gU8Af$BR^v;((!G<*TJLh{I(&^=5%(^y03zY4Vm|E_Pgoj
zCmt8v3W2^5!AD$%G`s0q+;&m^$Y5qrE<+5zlhX&^Lx>+``8T-Vd4?`k7X{phlzD=t
zyVrvyZ4<GQb0rJd@e=3w)+Zj$lFbc=nz#+^)XR#U#C`{SCOtu+L}-|f{GAjyM~=ue
zDp^q_Ht0~d9ZLf{*jzHZt#~8~TMmC-y|u}ouIJOT>!qZf8RL(r@8yY8W(z%*c3~2>
z?Ld{==B8zdS6J&w@ice?16_Sm;Z;XFBGTg4li=N~w(4ETIA~GX_|BtgkUI_Yd4SL&
zMp<XOX54Qf=dV-W+KL+`b$+gWs0v^9pfMt5%sm&J79R))eTuqbZuk^YFM=bfB>heZ
z_Yr87hy4<OkclD@^^ikT(IAB@tUox={dj0;KW!>!FRFdOCPYxH!46w->)f{H9Ex39
zr#suevgnu%(Wt*&KXTOlM}LPC<jelI`a>I?9}sXmNLmlPdqp}TFVNlzI;-2}^n-5G
zxuBn7M4lU3vIu48iw;oNsrbVJ@@%$Og`$4=Laf8)xaw-#DS10`_-!fV=E>i^mv`+Y
z&{Bc*<7LHF&=n1QIQxJMla1Kp<X*Ti95fTrW8bmK@~*rauWF7^1yY(8+j1{rfcgiW
znH;qukaofPR-?c6(Ux_h_V$3V<5tY-%UheQ9n9_#0hQ(4w~#azL?hc6u@L-K?zs<y
zd{JD<Y|KKHU1H1GO#KXsX-}x=xY!*_lR?q>Je_`UNNM@u2CpPQ$Y$EJ>wS7fN3cS}
zuCu05!aXyy^`oX{+Jss2X1cl(0(qET7OowwipTBn8WU3#Tz&FSszl5jj!~4un_D7o
zqqI`~41W)^QWFCDERCMi?#kws1rA5a`LlWf1jaGTZ0jwGPnUzPs|WSMwU+&c<^^Z7
zJ?S?k<r8}~BqI2D#Gn`)Ad^%VX#17?wj_lT?&8aa7_!!iWZxd?p5(#``gG3b;`)*~
z>?FrVYafmgbQ`q?xJk!ep-p!@a%A`_p_w(u*s{WT;>g8_uNwRoM5?2&{$jE?@lhdY
z3@YOQb^0<hd@i{x_nb|j1`jwn@*}(EMzb{7U!9zqb)5qYA~StnyJ26W$%h~4Ug`^f
zFI1N@v-8*~Z<jO+*f-$pwAjPyhz33lFvF%|*q5riC7-!NWnV+myNZ;B$V>FBTnAyu
zSY`ZZiBFSo8<J6BC@Ak&Kn7gfHs3p5)F~-0;MJGWh}P3CQkaw4`b!qvKPMY>J08)!
zlQ`la52&Dn&K}eQ(URB3k@dfXHOg58G*51N!Ft;MnML;JFbx?<bmyR<Z_N(Byjx&s
zaW%kpeY*Oz^UOjkHGD7aqrY#v7C!STcw`~kQ$-_T{&KnoN3YImh+f^b)<a`=s0&I~
zuIp`o?nVpYYI7DnXqGnRJCJ7&6vbuS>U+?Csqb>ft9}=g)o5~d-^80c34zD&sHtE!
z1J~{YSG7XjWzT6duB$>zW)387eCg`y!RA$}S4@B{(r=@K9rT$%vt)@U8iuisK!KIR
zEXI~rt1PdFoJ!`56CITVjgP|hQ!>DnSZvV3^jcLw^P+XoKu&__SDEOcd%A0N2f7xZ
z&WSnc)XY2BGNk8x&I*>04trG;UG6k*<>C5hme{ajm(M-+_w-;nX?kH4wn*ML_9#CC
znJHAmV@3CAL;g}x$07w8)l?qxe$y&WnFrP{Zwnfe<J#%5EDt_@f64`LqRgVu0wBw2
z?ZheUEC)m{>?E}4{q9>Am(=g)B@#XmtVo*PG|(a>?IWS)g;9ePkRj=^$x`F*5mvOH
zVg2i7V(Ca5NmiAa(Oaa)lXjAvwpj&{4A~=2%@~<^M^baJK!>vlX~~k;Y}!$L?}6>2
z5QvQfeC6CJn)HI3MtKKxen_dsv1&ba<z9Zi5Mf1HLq-y8chj)nvC6weg6axXNcZY?
z*HZS7g97x<-R+&&Df~{5a*a9JSpgHEWE!Rqgx<`vfN2%Yho;T3giT|GjBZ-nrtajp
zV#GnZqh#6ROAeh{&h^8C`K0;mMHL&{{H%h#DobV*7Q-bYbK87Q>oBQvfxS7kZH5iv
zSsFs>I|7zGh$ydYfXeJ>gDqSGsP9$?&3T!62$}^K0*nhBh~Xu+C=GEJ>zP=iJ_xyd
zqBd0MFEidX+}$YLCS%QVb;w}aS`WN@)z+0RP+fWF{=B8o`Q7<%4<)W{A63J5f$i%X
z+Ta<b756Cvb|p^b)-v*H%V{$)b95!xVH78t$Q-p)`rVqeSDtLqzQ+Z~iV=(4JN<!?
z!EVzU>_Q5k-pTsjdC@Zhl%6qo9o%psNN5k_8Z$o-bZCbqZI??$lZ63tzUC{oeuc?m
zSkM!lL-wa>dVo|$bmGQWiQ|!DNxPU9_hl**RiYU?<K5kblMWbMoUe}IBIzTEfH6b_
zGxY|Tsg-B|t{I`Eu`kNd4KTm^2Yz0L7}9!n+oKPk)r?#8TJ8hkxKW7>TcLGk3D1Q-
znc>s5s$tZ<_ePV^4%A;}mQ<_dy7VKmGs`{;+$HqD3o~*6_k2b5J+9~7H4PyT*2+lj
z?$uhiC?+e>$IA7tg%?1#Y_jw46yDB~M&UZE?sAi9_hQGFhGn3P&H2IxKj*2BTm1ce
zj*mTqi9KM=gF4AMcF83CNB*Q-NO}W;>B);+QJZvZ38D`W9@$B3UOlz}2=|!Op8~dP
ztGT2)oXlC-JeKVSkZ-O*%P85<H)UQPq7j~LZPaO`wA;<l_1-39pj)iB`oqND$GA3F
zGr(*?giqfFZhoY0OmxAvDG`8Lno0`$4s=$ePub*u@izOlRW@ly3ug*qy6f4~Bh8dO
zGV|@Nbi`a)M9uAHX|r=(+}-V%13-EaUj<2rSo&y^Yy{JLwh<W*6(rTpWIqGZyV^%1
zVSUu0!uI6N!0<3ROPiUpq9xz>Xl3W(FNF<b?q==-E~iepN-sS4Vv4u=1fYV3x<F+)
z(0!W_t~y(*n>_R$bVBY{-4XeM6TM`99liwHA^^gIJRSaI&KgAjf_XwkAwH{-V|{~Z
zxSRdhz42yKgSaZ?b7X8J=141M?*-lVuJcax?{M9&_8#Pn8erouBhU^4rOIlB6{0cJ
zt(?cPp@$Upl#R?~&k@VV#Ukwi!#8R0K$XcfG(uiOk~PN_-|{wD&@>hRXGG@{=b@L&
zcilmc_Bg_d>({%OyM5i}x8HH^zfa)iLS;8d0#Yo^n~W~8f`}CVGZyK=%Fj7|2(PqO
zvp>M{ng1DoxE&&*_HwWDek{`@^A7~wpKwYm<+Ga|W>(JBdGPRAubDr}_I{`SJ`&7_
ztK5}LQhpDnUswjXw8S9;gzkbKP?~a#kbuEQ@Ez>~BMS@as79z(cSJZomp#x;uVNn?
z;dIVoDipYF6Nq0ui$ix#(9Tk58NLRBdi+HQsiV4_fi5V-X7kGU+NX%xsJ&vYDur`K
zmXtlIY>>k2c3ccvux+A_k@rWxCC=hSa+fz*4;>wL^jPYS`fHQE7TkYpF=yRUJC85i
zK?Mc750&kOq%FXV(MUEvbPhh(?SjpAm&K>uWScp^qxj|Y^5G(isT36*mBe^zAX6{i
z?llAt$JvH#sy1X@usB81HTuz-N9;2+LkpDUiLK$*^@?QLreb*b@?kYqqYqyf+Xg2o
zcHBeTluqt_CAtVCrnSjC<OxGYezJiqbEhqEPb8)fw^3Y?)8<nZ?OUx>_6`mH>ntl4
zZ7&MnYJPlBT6lHG%92^?SkVW<QPO!RKL9hH3pNNg!iy?d%kHk^&O!IEymHth)e}`f
z4R5Gq68A;>)Lion5bB)c&yY49U=Upg0t?c!`ce!*vElCU(voJf&<k;QZ=mdg4#|x?
zwyG<~;0VkJC=g6O3`64Yal=i@Y1rq;%<qpKyE`Iulq)6kP@a5sx^?=)P{*(6kqKQ(
zsTK_*xvz?w`_Yq5^((0L%a!I~0->`)Ati}fm_9(#)`5)&T`)eh8P-wMs6vIoh!L0$
zQz>5EF+=vCh%+-|T~8F`CAq*gnZrCMyV@F^PBx7Ohbm7idHX>5H|Q-p5OvH#4{bqm
z&tqVN)m&kC<V~)87IAd-$3=&COn?%yyFx?tgNPY<>=*z@Iosi51<>qM+H1BM$DKu-
z%G^rQrMxKH<H;<_xXtPIMW|RjQIID4e#9&BEnQ2}TWI;r?3KYhbu1NM#<4|55DBhj
zr1Zp}0nD?cn0l^T(k@a+RcSP@m}}1OM{B6#<+!ctrr=rfw~ZBnSA|XtH}=LU^?m}D
ze(+g0`8#po(Tog-DPyriT}gpzDsY6uUNme@Jn>vR>8yXJyCSxUYwFrppzeW6ZgJ=Z
z4T4UlN7Vj=T@MT4J=GD`tcVHEvu93U7W<`D%apNhwOf<@7!|E<H#hD3hS;AV_MBfQ
zG?vw<ms9l{?v9dYpe|oF2-nIwb?A{B4y5(E8ON4fyXoWnMcik3jiXBUAP4V=+2ZL7
z`qmqr`t$a!|6f0m{%EfEf5!6oD|mwXL>!n%#(nYa_ShVKT(g<!eSKkq!U^4z`H-;8
z#<nX=FE8-rJcFQT?_Z_d-FyGZu{w=fKK$)Cma0@B!1&x<I*#WDl(iEu4p?oASFGq(
zMbnP0+hyMEvGJYaX8oYW@etKuOCYP9rTJ=a$D<MB{E3{j#hd0^zsg8%Qd8XpC_5+s
zm)&?t0vK=90s=vm1bOdMt0SU&p&1X2*BC#<ZmX+iKvEEDz&l6UKs7NzX@l|<msDwE
z$|S#;<#KbavXp7y#CdsK7=+NP_3Iri5~Tb|efq6M9Q_GkATq8$5%#*dGv{h*Z+u{Q
z;ayL+m#*8E5TH4RBbYHbQK!VJ1e^dlO&cl$(bs+tUDWgGPn&(bWQ=nXeVQ*XI?y~{
zvLt&vTDVCT6RTh9wte5{Nd?g(vO&E>_R(lZ#?DX;;T83JjnsKiSr1gm+;kHa$?c<o
zZZ=Jg8B&ffXr?&8riuf+jp0&`oXwry+e*tnWjFiTnY1mUqDY>_TIW5}Qc3ZA_*g&$
zXM>E0uF)tn(Ls(rELn=UgbmS{teI#+MO1K5W=14u!wt?lWEe;w`(*n1n55md$iCgN
zPm(=+eSSdUK`7tj#*5)QARLEaClP9DTE<5|rw=B{N7|U(OWvf(e#{j(+EhU^xeI&a
z?@0zETDU2jE7>K0$74^`tnLqDIv$)i=pTG&Lu<ZSR??TPl$ALHZifxyc+klscl)HI
zp0kJzu00B-&3o|PzM!1U#x|&8$N2vbdv6}r)V=PD2T@TGBC~=J6%`aQRY59KRFpx)
zC=Lh+QBj%1w3Q)B$Woa_Ow}qC3PBMN5h6oC84_ly6q#oslcG$rq7as3WqH4Lx83`k
z-|6q{^W1&zxzBU^k3xWzm3O`0cYZ&gILZ<;6ZfV@2>_`;D%mt@{p`!z{;Lc9@1z+a
z`;tqe01jxWYt@GYFjg|+O^{5+rr+_*B}_!@>8A~s;YMtM^xJ_FwDblxX!BG+A8T$e
z{j}Gzdy(!iVPXeNJ%p^2+-ceyEw$e2y$0qo1u7mYPRSN=W44ws^}6qd)3RM#UOfC+
zLH-7+UPBV(Lq(0K)_`DNk;F|foSH|&H}df2rY68E=rWXP`R_aSON?qG!aa)1%iA^v
zO(!f66iOT-t;yOcxx6UUo$8zU{1N0c+MRqS{@&>9nR}W2m!JhC_@WY;c^)vV{rI+q
zEI-C8cEK@@GrD;&*ven8RMXPeDC|MCm8B0l=q}%aQW=1!njM?X=ZQU2ntjBC7>?KB
z;0Z{W4Z06ky_a{H^+eWsFGUg}s1e!7)(+QIHrvcVgW4mFnc+^$uF<x<q%`XfRjcs?
zIou^c=6Q^IZw;iX!{3q&<`ekGkuaGv4Nu-Qz{I0_3Od?4>Om8`qB%&wia(-8(4x6i
zC3n~JJ>fVWud~81cTc)5ANKa5xV4c4*ZJ`NfkCxw&*!`|pnLAGp?5)d9jVZIhiz&)
za@jp;Zq=-6W{zp6@0;Y_CgnH!pc|MBp=u6$z-<gEv-*wd2(W#Ei@o!^<H(5&Myp#H
z5#9Vte>5kvGtH_J2@JHR6bIT2a#BGeV{v;meC@!1ck^+#V>kWJn%yqN$=`nD2|%6f
zJu>-g%qjSh2MClvfv1RfoQ+tyVT-f`fX(JB`4bETu40SJsE!ChF9;KLXX+?Vlp<fZ
zTOZa=%%#x+z5!t--r8AcdhSq!p=W2_nU4IBbyK-9ca-mrpDHgXcv_0Xegfy3*w-P=
zmxJ(3&`;?%6p$vC!Eg5?z7oEWJPv9V^A>alljnLNtQxM-vm(Gf9W9A%#MEga%(-LO
z+E%p?y$Oc%kLkr-r<42Mt{pb^4NKfkQUq7~+0ksi9Awt#JX2UknEH>5IjDK0{f0}g
zpyS8rJ9)S2xQFh(dXj42Q^UB3vwK>ynrUi>gEBZc2v32-hwr94%QgX&v-(C5=?aaX
z03U3zw{3<v>n5F_2h7^uGKta6!FfFC^#mcMm@RRV?zA0A$G<DZ7J!hz7^E0iLZQCL
z=;qJ34xo&W!B#He4RGH|_Rxz621t5)V`XchQHO}_gi-VD1W~0-lYkieUITZYHJg2v
zCrFj{B(X|YGqxczF<z<ZEq!H*M@h=zLP;j8ZlH;Rc(>yfyaPyZ{MgXitFQU1XVXj4
zW+(Sv^Rjzer64!Iz=!G33_T3&>j4wJmjalV3|u5Vf+W=7TiIUvO5i4{DA#ROB}^t)
z9X$Zc8>1`21hN4@*!})CmF*Yiy4$QI=n?JWjAKvmh;}&5=!(n{02C^_3$KQIK0LiE
z1SgLK&SgI&Na2;Xpc=WDUWS9b2;f7p=1q|jDDX><q}me?X*0s4YxlcDsB;0;-XuqI
zgHwuxiK#juopGASx>{kpakwjqEITQ`YPGdYX8JSG3P8C&ZlxDQ#+2iQrdvO3GR?XN
zyGa8I%Hl7Xm~!~>*5tX8?|00#cHIpAHQ4cbztZfT{YpY2y1W4xci4_MP+=LPSPRWx
zJ|B0f4Qpr3Pt{XGZghE(51*d<-8>w_5tzuEQ2^g@5u;mdq$n+py4<S)hYO{S1`iZi
z$MAb_GCZaX>O1qtKqd<xuJqY1&)_P6V^K?40E+$#X!iVSkINDc7wFfbCNVJz0R89k
zR7V)Z&R{(oH+PtM)!J+}CS(2$v5i$NW@s66%e*vZx_Paqjn%HsDIET_3h4|J*(Syo
z^tTXdT}aYexU`n2Y@MW?2$MYs#WUj{wOQxh@JiUW`u<8H>SelB_RpG=DGfu4@Zbmg
zCasC$Ifv|V@&)pEpfDd<{eePYxq>M`@vC>*&Oxk1Jy!%grlZ87juqK|h`*S-g?7Q`
z)^lE02zk!vMR4v^QJ;tId#>N4h3}0NIxf3q8JDa;{aR`I7C3U>AtM6bmx}66(!dsR
zPZe3q9T5Z2e0DFGZZd-1u1>`~D=qD=z>5<%H}p%#tIw-&<ky;na`J)L!|mpJo-PjL
zvY3xfT|uVQ6=kJyik|>-O{AU~6aNGpi`#)i-#r&SB#);)P!8R0&{R7^f!lOq{7$_b
z(K*-|H0N0Qv7NywI~_lE?pK<b*-m&di5d_xk+A*XZ?|)XN{yMWt)FoL<SQ4mqn6w+
zVAUP;Pb0N$Jn8Db_pG(iW7_?oQT?km`yaab?q5tnwtN?HhJvhy;(WT4<uN@{F4CkF
zYhhKGxU518zB>G^ZmAjo-LlXfSx5N!6mjLJ!FP(EKvnX}K7VbG9#p^RVKt=QtiqaL
z1G%i}5+`=KGX^pcJ?X>F;1S00DszsU!|6Ob|IN*L>knKC0cEm-ztJcBNek=0f4=cI
z@Pz;Sf?5_3t1<M#@C#kL)Q>WkH?7=ZtT@a=wuNK+BV;sS6Z2pIqUB*Fn8q=7LDe^*
z4?ZMLk32P+jXlCEv?@vO{>_NaPk>(db611HLkZT(n=P<i24Dk$!84;QaT8^u8t=o)
z6H1W77PX)ctVX-o<pDjZm&_ligH$OE7V7DM3RmYm^4(Zf!9WoMtqlq`{51Cm1-T<s
zU1=FRUUUKEn>Xaw@5%41g#-Pi(83BMorWez#UXPCdjM-E^xjdU2T|khNDu6I2#lJ$
z#q||kDtSYdIs@kTg+?*?Q_p>2@w;1JE47l~gOkX5;&>OhVud^vfQ_9O8zs`W8Pz$!
z!;cw@QFW$UiLEX2OdRR^U7A5#aq9dbr;u>UVtua_0I97#a7}B)rcC+$;n9)F0dhwO
z4*&5>K<`1Pgy(^i9t!ymhSlj@8S|11Ma%XLQM@)kaDnmyD2j|V2<yOIpKGfFI?Pii
zo?kax@JKJ<(ZhZPIY3>k{sn>$v>?U~sKypS_BbG_NRpbT2pdx2nMPZ438j{jz%`bH
z)rLu4g1JR)FF&d7Os#UFF36jY*e)yR2$886HFlRa<<UZCbno2GXru-=e9C*M_@0lf
z;-Oj=LSIz#qtpx`iywjpk=qe@w)HwekWwwbc(Rt3Z|0@bV5w9$xadU-Fi)T*Gq3Q@
zwsGruV)<dcdV?k1W=#%oFQ=O%)W15aLfzSImS{GIjp~HgK~E-QAwe>~xPn!x!mD)w
zbRhhN+?7XBLYwu;GBd7a7iR87JM(@(_=+@7RCb-Z#7;7YU*GXgXMX^9S%RfXl$&l(
z{F)Lt^G?CHOO!r<IB>ol-2zFlXe1BBiqNfP(YzWj66i{`#LV6vYiJwsVApy1H0T%@
zwV4KGh90iyv?dAjl<U;_4(D!`qy&kAw&nM~3h3VXPEpK9Y!orqXMlY6m-9j+ET7W9
z77Q1qi1}6cXdZ`Lg^%MGu^U=cszoi)tflmYfKp+Z->tJwr#nJC&5m+ZHjiZ7C~k)A
zxXv!;YcF<{$&2iIrY0<o7SJXE&e?WSQ=~KkV5RK@Pmc`>#7s@P6_Om%qK>Mu4i>`M
z^~!;I{wF%PmW(C?V%Mp1*a3u2%KnMQ1xXDI^?1EHU9U<`2luwvMY(~m{NDV>v(tDr
zXJe(=AbL0w2KQi#6lF{3?uf4t((z-gL?|*ZMl)s#^CKB6{mJs&d$0TmIt;?l4gsJU
z8X;<Tgw110U)~l)Z0cMc<Lvp}fm3>#ifB7eaEVkJ4ppo31X++S$(J_}=u(c5VVUSy
z5&?(Vj(T*ZJ#42NA~)c%W1a801cbl8wBU-!%elJg;i&FVG?dfd*K*d{LL5k0?(WTH
zX1+-NUTHQ1acJ;X0})H~6iE-!sJGRZWJg-NWO<A+^d!`1^E`_>+hsjd0EKe6=}241
z;M(DjdpoBF%EO%U*k|*zOVZZ58{Yo@N)@5r_e+C76=1`U=Sp}&_IM9Cm@kmKa7&|s
z73?*{&SA9MT2NHqk+do_4Dv2$-A;<a@P0a2Py;K1J<_r^Cl|rQwWXAR)@ef_Us=V~
z=FC8SzxgGlX+zlxcd!C3m*LQL967g`qFgU27qhFzVvtV)z7lfY72e0Z*N+}*=d4=`
zFYi~)$Q`>Fb4!F*MTp*C;2+0FT^q0s1>Nt@o-8c7IrIXYbr48iq0mEmc!I%6#Pn?)
zC5piL$pV0sdamZ0%2OC-NLIx9S;Uk6mBc>x9?~LoXbhU)xVNN?+i7Ofl&Vq1HT`J>
zOd$hLei@0Z3x)l^8!OnIU8jIO#PEPw2xLL{%f*!Vp?MOJTb+Q#br+W#tZfCgOV$#9
zkDkbBrs|+a!a&6NeG+X}*tZV`M_l{WzA0@T<doh@+8pG1ND-}u1mhHBy>AV1pj?`U
z?3R%jFKvmf3`6=-`ORXMZeuR$4A(dDY2gk0rPxaEcXx)q;=n`scRSGCLh?MnG-Y`B
z_5<XWlfT)Z=dmv9W3xd+nHlw&qi)9uLu_<O4|ZT4ct*fT6}cmz5tEgSlOO%EgwPa@
zsa8Hqr~;qu7+V2pJf^WDqZ+12&q%J_y0BTj-!@jn*Xrk!&po=<qp%v5qdqU(*7`k&
z-f?4@v7&q<LzB@4t~uEl=(YKyptKPm#n%#cTU!wlio3wFhjLBY#%E=4#mzp?#0zJO
zZH`Dk*le`Q+gZ~tdhe%h;yKV1XmaQ3=&5*cBfe5lVlV^;6UP$>$6A<?RWtppS}z@O
z3y4Pj@G(ce@odT-!ANq@8t@|^d!|~J(kL~hAD+*9T-FnXZW2@V(66q)e-oX_N<@v5
zqX&QL_mAFqiPWhhJtlt)8t(%RLY-KKCF2)J3tcBUFJ#BBNSv5vt5B3G+KJbkOdDfS
zxAa$e+-E|(rvE^fIZ1_<MD8%pRWgw#tZ_07#i84Q(})6X90PTrQ9h_b1rj79fx*Nh
zxv~qW9k0#QT17COuv=2hUsz$HnvosAFhUYQ7waRZE9-V(MEZtVr7LDkwM<S=9H%nC
zCeY<c=M;4@?Q&j~&3R1awY&hLwhV}+rro3f!ay<s=iy1~5Y2lZ3#-xf;SZBl9!YUv
zP2cJec<ACmzpWey49!7t);?P^(h=IoMQ`SE!|Hy{h%BTe?<DBsEu+98*=V+N6-(+!
zIzy1}sc;5alHGO!h}q%ucr<-R3Us%?SI~nMCsl(9EKn|sTtj+Y4+S!1R$oKYWI8Dw
zuMIxym;Q1!58e;fddR5uBVu2vSkAAVFp_G+?0OOoxsKuI(+StfIyAoYIKs?we+3-6
z168CScCaTT&!;YJu9x%C!cJfK2d;kPF-dl<+|%(&clXN+j}Ixz&LIcD)1Wp|0BoQr
zyP6s2t%_vB1NBn1QSW7v#!IMO7L8SF+5A)FWM4qPOLJF7lkyq8@-sKiJ<fOa^!)k6
zv#XBZ|84uI9d(eYwq#*14RqmDUad24pnpK5osmPEafMI%HH(`K1(t}_{?a5z1a#}N
z2Nw5dC%m^_77h=$G;G<L?Vs(>AhK$o)Mkz9=Dxx)2EfPGgnTBo=m&vhf-p_&IbZL{
zM-UFd?NCwa#JUXsf>CLrO-nRJz}mQ5k)k~MR*B|UceR~vqg%vYN}{!~96|Ds{=ORH
zirA8!>4Ziqy2McWhJ`Hv3CBEJq(f%K7$~_#Ukwh+GPbP(%IM^zTDbm5HwCtR#-o$S
z)k8i^jvqNcG=!{o9+L;|dfIh%e7NHb%wwKK!}M$4^}9BGtu%X&7TDmuM8b+LW-q6a
z>b;js+^%rIxJ%eB2?P?REyrMP!M#Dxgz_L?=bovi+*FhO?e9(9&z<NrJ@=;g&YZPQ
z*G`#(&aWG)&jUB+k2uH<2e1}^q$rJ0*pp~RO7PZ44ocQT@hp97iXbZ}eD^b?G0WeB
zf7x0YZmKDpzG}mY%cq-W(Q+Fs(l!On5AtzJqI9F%?$^^mZzFoTNpC-XtOZOt$TVpz
zsS;2sPpp}s$XNNVhe1F%kNo=i_3VT)gY-j`cYud1C(jGXr!EdyV*kM|FbCOETguRf
z&*aYC?4%>jj%BY^x7yRPolt9ps_S7#hI&ZjK$8X81W=0#7{!79obLZH*3fUDUH^00
zgCdZ}GYV|SN9b6J%*>w;cktn69{jS0uS29yBCdS+DJ1*~Jeq7OJcK8Lno4v5T#Bgy
z=xQS3R$e*@dS@IGM^Z}0TAJS}jmD0XWfoAa0pn+2@RP<1K*NTGZlEB(08Q0)Ap%aw
zLOyE9gX6%_2VK(zp$S0`ASzU|NnS3ZjuR#bMe+<85mVRq9S=uY6MA`EAKV2ts>zeC
z2lMRl3~;IpNIqo59fGrY!q5>0zJnw@1=Z+*2uu-EV<EdRLy(XwrGjfi=o;^ZSicNH
zS3!LS6m*TQTOU+x%O5@_Pa1c_5mthOs}2diC%|usmsvUtWl(akgKBq!TkAv$bn5Yl
z6C`v6-`&TmKcIwk<e&^MBVgK;y2%L_elLrPbzof`1yXh)sIRb$Ce$x6I`%7lcmrH$
z0<Z)ipb&QPOIYzF@D&n-N4$YW_#VKY$HFt1)?0K-HbDOY0;;(nA}$^SElBX=JK)<X
zBv#5CXCn@_&*dPCThK{5C-Y@V2S|p`e^QjDV2eOGXYQmtg(S%07jg4s7FdVJ{G4!M
zJSUK(Nh8+}X;l`tTSvhBhUBw4M@kHcustU{d$jwA$Kz0o*_fp8z#M;%d}gk{Wui3o
zk$=+uz1;mulH3pwp$B3hMpLSw<cvav0BGm!9yl%Gk@s`GQ44P!gayaeOWrw~K7UdS
zqDehkZCz;E%F4phr+%w{Xv%OObY9b3w>A0VzV53Ev&u4{A=URFFswj`x1=`WfsvG6
zmM|y!#4?j|6>k!#i?{%l;A*@Yy|ZTVI`Q*qrUs4INYOf*XE4+|&^xNz6J>1!ohuwk
zGlaS2nayi*ap9^!9P@v6N52Mkp$JrrV4zL_-23Ma#{7|sl*R#K;t^62<A&@j&j3sP
zv>T86c4vb*yr2`}L=$moRG!&{lE^kShsNJ~b+}gerovXc9rcth4Ng1}oIZO=X(~Zx
z$7o>zqZ7cs!dC+UHIuQh9b1X60pPo1ti?!EP4bf)Cn^i?il3HL6jYSb*u@cH_MJiF
z(=64oS+Dvmwoo2gOYRO%Y4Wmf=ohWPi`>L-(PfTzpe5#{rC2?ylu-ocO`A=;Xn~*$
zy0l_pcrKDrd$8PbGKf&mj!jzuzbcEqs_pprC)ASD85ATwT*2+iS04OY$rR4O78fBo
zVv1fbjVW#a;%-ahi%tAjM7X&iZXi=o2ea#CZ`f%n-K^7Kh%3KHpB<AbTzs<g9GI-G
z(@^8>Iqd~@qaI<6go32;3E5n99-KYSmwiQ>(*o*QwZui0T{C@k9|JCa#>Jx!IX|*I
zCCo(Y-51*fyw^&+OILrDFn9#pxuS^EG3B*h#cF$daN@=)MKE5dyaPOzUZ{2m3Yc8P
z00yL=0Jl)!YjjmOshWxGBneX?;n|3Cu$8b3-6aV_SF_-ozbbm5lv)baM95ikru_1^
zQpz}s1EhZ}f;ZlMiO^}LcHt{K)rp})2kOh04!%kE>V4ku?N6t~S^KtZC&WUq2RK$M
z`86bDL+m9?+hdfc&J(h@Tj>TpR7wp+@2suC4$0bKaaHp82D%G8R%feayBUGj7G>}Q
zzW4W##)o3YrL3CS%G*<A!J7wYvECdW4q#KQ1|(HQApJ(wWl|rO^oRrunVbi+g{K90
zRa)9dFxPS+0``W!aJ&pE9=kA&SEp&6ffD$dVhJH;qxEj<p;a(C&&-OYIyA~%S-d46
zIY@fT_ueJC{EN>7h9LsX6B7`Vd7Y3N>bbTQHk97&nEDY+Z!y8Um<_IfFRQy4acz^Q
zza^!fsYH<{?L(N^ekC_d^KXR7bA&vdk?wCj{Ews&6LuCc0n^5?D+$^BR{9)%Bh3Y2
zJ~(*x3|)7W*!|usBnAvhXrfyp9(&k_dXgU@icS*kOyStrEj*u7COIpqf<p9PkK8p@
zlxAZKLC8Np2m6`y1YZZ%oLVWm+!cWo&g6umz~v-q32hOm9p)unw#Czwyoxu>I>Z%I
zV-j*{B{GVH5xqs83)bxt=<YcqWue*B6YO^k?bd1e)<#!QWX^u5C@IAj)dRz|UCx9A
zc|%Lvz@m!B7-=!rl13z_xWM=c6?kaCk=ZbN?uTI!tzwN>Pi4=`2$D$Y(*4eJgo6pS
z{SQhk({6>WnM%`;5Mu})J(NQ@?4Sm=I9vW`Jj8@O&z}V!pvJVK%e$scJe&P*0gUT#
z(>>F9nID9m=eWzNMW5)VIRQuRJpSO3z|*_u1hw7yVEp5&he{)qdNQ)5Nb1uAog&1L
zilAz!-|EPhN&X#3s!`o%+%-aR_BF0@Fa5BD*`UmUv+fD=Olr1zFMk3+v0pX``Ijp!
zHs(S(G-&&iZu-vW=5IR7n%|e^?wbYt3@-d9J>!jl35FtD!Fb6lOhVTJ(^UuwchEP$
zEwyBX3<-hzin7AgTL3V+ZIc`vf(1M&s4+uNxP4d=u<+`pJ|HU#o9OTf!9}<OtSyMd
zhduGg3PLp**#O#Q*s5j;*#l9K*#p4Au9hp-HWDi`Kze6dZxc~)&mRRCtSo|kuUqTc
za)c>Oxe<DGmL)|wG^$6Nwey3)cOPiXItsNwqG@sD$Sh3>%%$&YLAExD+)#Cjw6u#C
z;1UM(tFb;Xh$18<GM0?8RIM#wiW~eQ@FD?_q%NXe>BrvIgUR32?IZ2}7keVjTy__~
zhPe-u$V%UrN)a7R1sU)#^Ogg{<&A;h06xTul=2rs0i=0elw-7u;e&R<#(JXCp!++D
z7FTObH|m?+55KV7^jL=GQ1#RD^z$BEoQdHyZ}EV$Dy62D;=`@)*;9!IGh`gLt-=SW
zt{bC`J4cc_oF@YTOZ9H+b-wpreYzv<!8XA<*D+n>W#gWTQ0LT*AAn<?iLY?;&HEJk
zz+ue`=!9ng^QlwjNZXU$xLLl`44hFS7wn@Um;EIg(l?-_<{;bW-sd$C5D&A2lhJIh
z7LDxLDQ2~lF&on?xDxVJu)3S7J@fUQ+S+*d<G~9Wzp%X17M>|Qj-NOP$baCsSI}+g
zG*b1fDpN>Zh`8BzSZ@;omlEz$+%`qP`{|ngU>VB1?^os$bwp};=62?YR4Vmguzcrk
zEeZX8H+8O)`YL0kkHGWK16D6GkR*IdD%}??1LmaG517&$Slt(D-9=Ch(4ClI5T)Ig
zvXCRQ`I$L#9;QJ+ytxV=pc1H<&IG|qjj53|?ACBZmh6|lB}&sPw-Xiuf$fD(vJhJK
zxQ#?C-2$$V1lld(n9jMof<Y@x+W}q4gP%acWk6FML}XB^0!^=2aIM1v)&y$;kL=<V
zTJw-SW~@TU5AvfvlM#FKFWEYP=e-81UcnQ-q|D)Nq-%lO19@T$2#}()CH}U?;0n&*
zjZ!cz9jP{$mWOJfUP><?F2sx=|F<$_nVpO*JGig0pW7r=d>wIrKYkCmWHwZ-51pbS
zCY0J16~spx*-_yjgyINzw<FYOzMk#!LH{z}I>LPDtiv*1TfW}m4s$P~dIObCYS3e9
z2uXSD$iB3QoM)}m`icjirnnx*zs?1*L?jWw;PGO6W<XF2(gQ)J_u|OyU2AWu?3gCa
zL0yo;(D^LK&>-#;*!LQkG0tgzJ=bYP@`s_JlTSaYeWj`RezKQ{E#}LA;x3}^YUP?@
zAnGwNC-sYuv~nCdGD?((PE-~G!AbkO^QeYpB~-$wZfu;jlV~zt*Xhotkg)qy*gNSF
z_<()bisGJPpYRhoILKE!H1uLfYCbEwf8c2E(SaSlK0dy4kCcRlh8Aql?k*{5U2jJX
zqAptf)s98059~@<9`Nvyu_D}rE$RY?;xmrX0KrGroj{_+>&$tVS;Y;vAJk^}L&f;o
zAgeuPot(j~FN_84Q_QQ}B|Po^u)0)3hteI?2YY9;W9gom{f$9c{@6xN+e*trU_){P
zA3hy6aXIr<07C`lqnack1zn#Abh4#Wlu`xeU#F}Rk9u$ilqbl-iI~bq$hlGZ$u%X%
zX!+vt>-2}zGY5xTmEQND>O4&4oIC*%q}hHA_cTIRZH>BkfqcLXebXbrTU|o9j&8~X
zq-f+?mRkSbq$3Ptm{8OE9<?(;%AKF??{|wweMCG>rw6YF`o<W5tanrXKqDP&4T6?3
zm`p4rpnC4W<mP|_L*)X6$@=Y}T#)jwQHtNzGm!(JY082VK7oE1cR9jNhmJ!vFM(E$
ziS&VW0y<jLBBMyzs0oPYqhe(mKV!|UVA-KUhQp8U{5Mb{|L4;;(r=8R|L5QNf2`5=
z-`&b1{RZLtmq`C*JBOs-sLuWp>Az5<f1~~ROQiook+%KHr1Y0a|Aiv`8#pHZtzjLQ
zs+l;LPGi+2n^Cl>bNnw#^Rm`MHQNN`4*~G=&f%^j>o!(wZvSD9wbJ=vN1XrW+Bd(h
z_)Gwj=y#6?A>kSU5j*`3gkv%P55A<pmW5zlh3gJV0${~!KftAzea<bgwxk{ZYUCx}
zaK&ql?SymaS`gA-iABAUaA8XvR|m8}t!$L)3Mft41A()Q+p28!ky#FQ<Lip69m1`-
z=N|WX&?oV?p>I^mcR3{{?7CzCg7HP!*di~0WUlfB$y{Lr?^DySK;(bD`e$-bilIkH
zbS9={wzPwLB!HyaigxfohdE_QwJ`T6-6-o_sm^l<toB~K@BA$YC~Hj@DfpRAB}Dmp
z-aYM}rge0Y_4P{zLP8A>Sr_P_N1RZGO9NLGzAr8^y^@|-hw7a4)OR|wZFJxH?RXHB
zg5+f0M!*RW^D8I79SZv9S5*JMUjC<Kq8J5FCXxatLAuHzRZm6{mNGW_4H*|ouGvv4
zi{R#%?~_lzjMt?lGM<Nof!d2hjQN6mtRsEj_??kA6)tmZI!u}t;TW9Oe@`Ga?{~KK
zf5(gSA1+oXh$zLeaUf46UWPQ#144b5VyYSf)$}BDZ|6MtW{>oIueAk!)$a52nkyb3
zzkalPo8XQPv2b^9-9x3fjWywT#5lITC3a{z5-(&j(-ezZE1XrPW+O^?*(3W$C2Owd
zL}czI8Wtlij|MW#o3FJjxsC~z>aN;*)?+meFy9>}!LF0DUsIzY<V&URr3c`T`0F)m
z6Y&gaMp4;)lQsPg2^HT;25OS8r{vPE5LG>Dg*FzupH@1LC0ZTm?DmXfW6S*ZNBrP9
za`4`bbEj{gq1`@i{J|0zo{k*;azIz8$#$X}s&Uu)-M?@lCv3QYLY6qnUsC=LHO<<j
z-oQ}KtsqKDLE%8roOpdmGpWJWvTa{^2Bp`(Y#rQH`BRX{@Xcrv9)8h1YjuTg0e#q!
z;oVT<x!ZqFMZ!6S_dNI!B*^4XeZyY_l6LhbK()aER2zT2_@4{JLw=oMh+YJ%@4Q43
z-gYx>7QfV@Hjp@;<5O7JJ`vp2V_z1urlR9i0gOeT|2Ckt?^Q$0hoLz+tAaQ?C!Z)z
zCI7b<LH?j7d(9R(pg)iez|iW5h$XkB>{rtHQ^E-a`8uqI{~-WcNE&OZC(QnYq+PP*
z{8?~IGA9Whe#U*+iE4mdC0{2%UruTOyt2+zP`iVno~67gh3y)DM)O6u%h!Y6tl}4u
z<#Dc`aqsx?Qwp;_e|l^Gp<np-FDkVFH$_2HVjI{pBQ9uCw{J~XwUX+7#@#xMfO_ij
zNzhvz^%w2I%{KhmBS1l^>dFJ?`4ZSde%s6HGwuLk>3*e=fC$in34w0oEdTTmTaf0@
zxOE0#1#SGN78Kjisx%V)r(RBJ7#asXS)<FcQ&4ZU5tgaT!7+aPGhzLgyyd@_%0IeZ
zh2P1}{?d!T^x_{5o5F8pQT$(e@t0ow8@>1&!2Dl&@s}U@w|?YrDEa?#y_oZ$EILZ(
ztG<?Sf0e<i`9s~FKl(iTN&V2$<6-JAlue9#W7jw?Gkk+9{as1_%S^<}R^O=CYchFo
z!2a<EC&Tp)y2edb<x3ttQrPgzcEy|a3)7Y6ESS4w2kwgM4yEsREMMZVlaz-u{oRo>
zeCa^|V<&;p3^tWP`!$1M1pIvB??3)$vf2Nye<qNyF6hHKf-Y}74FRdPWB;YECwl`i
z7dIh69$5h*mz?Z)W5uz59J-vp`o_--6Wh1vOA724YdiKnh;v-6l-PBnQ;@sPb+Zc2
z_V01Ze<h@ubFXv5yxxxHiFDrSp%VLHwb19qTuBh`rAdMN%-n`9)1MaoUG`u{(HA`f
z#KB}V^Q8NHF-Y4_ajNF;?`%A0+v%%wH@mF<)_zgs>imNAjW6Pqw=GgGUvK=<{qCVt
zg{^ac_y=(i?t=a!gr+G?+yH|<b6K-L1avCQp7{My|I4DFe|XDZuc`DKX#T%io6#8%
zOJ0YnU3jO-$QrQ8Dz_6GO&E#8Jx%SU9S0MHNi`ALqW(z{wE?ej)N`RBJadI^eM-Z;
z?3jzMcDX2Tmj}(Q=E+e1tH9a3d(R55j4PtOBp+ztqu2q}EXPzW<cXw!mc;gfzCc?O
zd1l4D`-oMwrPp9rPYi%9{JJ&ia6%0QU$MC-{8XUID&>TfiG81O&pEg1Y>m6<J4E5R
z&Vdu>hF60UvB$;V2YS&?aMKy!*OXzh{C29*XWV|U^bVBVl!u0c*}G#WKjT{G;avXy
zRG<wM_&W-!6IREAohRqhjHDs*B&yj~c_Cx-JBJ1rFs5$Dkr>>D$wCe9(OG>q0Q=(X
zPO|H!2Z~zM-p!`n66Be<6?L0=>Uk;FUF?aQdgepM_XPq}d8IG#=JSWajD<H(@R)!P
zX7=BbzLFP(u%nYL8jfzL2opx=iLdj-VA6=C4wem&m=`KOyCgO(wF#6ZewY?)kdV_m
z=>A;DUwn)f3)n4Fzh1*>j6!XKsHedm>E$nAgf3tX?k%dv)B#5?fP9I<G2n;@mn@;U
z;6w{=$xJm>jdl}1_ua+rJceno7harQMOiA(t?l6&NK9XE?bfzTeV-yGElBkiyR=iM
zBM5aK%-B2*4*%Zw*Cp$DeyspdfKWIW(wcVBaf00%TcXPbOKW-kI#m}9UqL5&5zbUE
z$5gL@s}>_2ojFhnlj>9~CUj%seRI6Ly#30!DI$vU^M`?kwPX1K+X+Qee^cQ9J8aP>
zwzO9Mh^vLwk&1{7<o+V@;WrZiY3s_+5>~zLxW^&2S7W-(uV(ICB*s*j-rHeTBK<Ii
z5heR_?<1Z&oy$|4HF5*B&tM;3{^D|eiPEQX)D{dnYG+`ox8x5PtNIW-vHshtmOe<x
z%{s@4&~2&rHiXGwrVIA19a{UelU0+ZQPE+oM^&wql!~t3%REY1?Qy-*|Iz5d1=Rd=
z(}X#QhxA3m3VbLc2f9=Sbm;(C1_xXJAz*SiSGruzp>o&K281!p3u3Y+5-#|Oug0B2
zKN&%A>=^Duoz4sVrg&OoC9EaqHUv$qxk+8@lIO46a&?2Nr>T)$cZmB$v0f{F5I|;z
zVam1QNo3vDcrm%pPFhsi&LQ*{31*^BN*ZG@!;x-X#$`goWb~b|GIH#6qJvl4S=*K@
z57Yq;<>dO428#T&bwiUxHX2@qU8co>Ce0dkI$`OA?myfM!w;%<CB2TL>0vMOQ$OP#
zGOrmcPH}9(%w-wx#U=f}pUGdB$K82?0;upUs!kPj&0NLSjrBv<7?wzq9rr1+PPWbO
z;$ssrE3WLx=6?5pi-mUjuhNu}Z+#ycf(ect;^s45XES16%J#w2>^S=V4HcwN1r!V+
zRVzRN%xk2QMs#@+TrRzbd}OoZY>iNFrZh=nP;Oq<qIx5vpExd-tg77(vtl0SpW8Ux
zL0>13&*|6;x5SwBoy*TEY%?ogdhXoh#v_5xsz_an{_Zl+|KBuP0riEdUw~)-4XK{k
z8`<DN8aI?F2X<eynAnu-R~A|>DHp}Va#q|lIjT(Oni)WKQ4>IO{F94y`VX<xQLjwO
z04;6Co!f5yHA)nwOsb};4~gq+4PESNE)jkpEFcMh`9s$8CK?417GlywpUl6shQACM
zGGS`g3rN>=8|x_j1W{Vb)+|;~QBvc5aeZ=FEV?BxVDr8XzJI8}+Ma0}^<`1Im3e7J
z`tM^;DSRMIsL8egxt#@bQN2t8YK`OxDE*))QBNrX3rEiW_x;jVz)-xki-_IV)Zp>K
z^H`t`LFLQHm*i~86U3L&6`ZT@?X`7QZ|S-BW>2S`TNG#*|JakRgZ993?_?oh!x+`1
zf>C5tizqckF31icCVR+>#(H`#CSB%kL~V3p_fum@I&>42NC3mA#@9Mh*?w*uSfhHb
zR^@Hi50f<pMy*Dj6XP5QZb30lZuoo62mcG0s3KuxMd{CAlj#aszQ;TOC;#gAAOE*?
zeieU%Q2y6?WZPi*lZr1W)1*T3gp-(k<qkmCZJ^A<Ub6HmJmH%?4)83k^vE*+Y}XRQ
zFzU9RN6kczIP1HTGy=BJ9<}Vb-c&o86|gSvSg}Xo(IM04rQhHM7Sc^%{w)AJ+y@i@
znsY6b&*7okcB098d<<8s=WPjq-BqQ%tuql3`V$(<cAZa2p)U6C*Ak5a1UCf%5;e0`
zntYCZLYpmo+B#B3uBo(qIXZB-7>Ae$va6vOunB<b`@@q0w`#Vs2Vx}jtziwkb%4p2
ztxbH-L*et|h0XYV*Sl+e;g6Pv&1T)Rt=~L2YMDAcoB@Tb?n&x7I7r@*uYlzr`Yr5H
z`1EDA3j~!nbp$M#i6t4khnWM2a8C%CiNv0QI#>E4+&`2kF=w9O0S^v>>L`dY5!IaB
zEw7+E$lPcIOe>195t(O@2~ZZ3glh}T`k@+D^d`iWU(a3Gfu6u$o3<f&K9l@V1|>`v
z-vnl*Y<H9eW_q(`cs{2jyy+5t%tp=U@UFwVb}xv#wEq6Xnnw?-4)4d|7URC!RF7i-
zV8CxG+~AICKJ3UtjL&3Cyj^VQn;tTnD4M8gX5+(G10I<zb);o9hPZ~eaN_ZsGB>8|
z2swVXF>2pU^>xEon`dauvQvlm;4o*BpwUGSd^i}G{_0)mjz!)M0s^ud^3~VdAL{N}
zGXs9s4^|6^!NKtf3ydl{%qSBb-#lcSKCt8z+_>-D8zmsT`z`8G8V+mVe{w7x?*ItL
zn!H&HQ3R;o{ckq8K7wg^x6in>LP;+6d@>H2N_?nzT^_@G-NgW9%U-KGXf~O4^O9}B
z@23xcnIy<0!DM1IA1v_q#U7U5w%vr<A*Ham%A+v>tZceC9747|M%)~FIG$iVrE+>G
zHX|^$9>YuWlO7`OAqgp_p)<}2E`?myq1IymJWj%zlG__`+wjO5GOA%B1l%2wQp!{L
zU0V&LypFWQZ1A9ndX2k|7TIWE+7X;etYfU`p&!;}W_qgZ;qN~5#Ar9ni$S*uku%Y&
zJRh`UJ<{+qPXlR}YzL7RNs<s6{|M(cvh=uSNY5^?(9ISm#hT-_Y5WH7RnKG_lfW47
zcjxZ7^&5yjNJo8RYWK81UiqvmB+ui0qhtHo$q=gnXQwLCFH}rr9~_pz*aB5kLwJdN
z)e7@4AwR}o4#0ZSkmxmZhsGxoYSXlkTdIKRW-ekG{%l!BAPaG<@Vx&#v)5rm5<vD}
z+DX0dbatw8ryq{g`A0dJWJVI?5I*d(Rsl6_HxMm%Iy7i1%zCiCg$5~awyXXj`xbzV
z#bZMLPz(hoQ~GZ;@Cuq)2}FPqJ!mA9b9mLc<T<pBq&UXrr1~=cRk}@j1AU9&c`!xI
zA%ASs$ZRTFvUT!)fb-m+i7rJEKNemBi|8=GUN#Q{FhluV6FZqVRPAqTMC)%SsE^)B
zTi4B12_P*Y(y6kY^h0+Iwh4+5jzr&F-(x`77?U65+TZS_TwB(iC$4W#d%MfYc3HsI
ztw{?MKD_~}0~&y4h^`DYz?ygkpiX-D8EC*^4TOP@HC&z6;3>axGqNfYUPaLlPSkKL
zNW9x`<znGdTvAkGA9gxsuIn#508a5kvp>=%P`qxSesY`{9{{`w!_ZsFnM<U-e;+f%
zzb<M3=Pby#ju^*--J7kjMM)C8^f64VBh9y%uu7*#bTSqP5*AD+8XbMoO2?T^BPRsT
z)Fq^ADEUEPvrw757_r!sa-^cY*fYqLZ-3{#jnTf38BZ=L&UU2Z-<D#kKYW3I0GMAI
zynr!$95cIe^4c`{3Re^H&p!5Qz%#&uRhJ>jsNx#7-`eiq^#Ck~IkbX#eyYak5oyY~
zjn<~=$W`^4zW2t8A9At9iER}ps4@bhuYdZ3H}TRmrQf0EA|*z5_YX&(iz+pQcT&!a
z&u}&jIUiFoIR@)`FKzvLcd@%yRpLn;*7xoU*&G<|RQyPil0m1TN?u4BKYk1BNG5vx
z8?pRn&Jry{z6FNa2?><k!WLbS1Y!%b;S}N9tJWboV7qEQOJvc=Vp@9VX0ZYE17`WO
z+j^VN19ZjHLYl7ls6@XjHcrSoP+8E~?PcU%*kx_qZ~kgl=PK0<hwt>jee#s@<(cNB
z-ZAMOWJY#igal~#BwXk+1<xVJ^+1d1Cb=Gd9zAivHglZAnd--CAA9$We&yfotp`iG
ztPegO1fATc3HEK~ioV%=?})VS{mT4;(g|MxFg`55%UyzXQ4~YmLr5UbKuzcKvZ}*h
zZ!8ivdn|n04(Qoh)IA4WzNPw$pn1rh#=~%9oaw~-jY0EsGG9GE5pv<g8qO;L!R5bd
zK^1;OM*P=?311#2TFON8K?mu=8p7-<u;d=|dxZL5?|Pl*5+zWyEdg<>^WG-ui8muE
z(TFv=8v)e17OkjmHt%|@kgrOE?2Cb#9_i*Z`?{tLklUv!x4u}IKgnjFTUo*>jenE8
zd$-Sa!mtfc5wKTt<tYj*`jvN#4uT3KNaev~f)Y{=cD#ex@dtK**wwz{{e}u>0HJlb
zsD$?fppO>#LejNxb`5KuzlHALWK6nbb+T7}>V0mJ#6+|_`9Q35M}?+>ycMcm$`kzP
z4SIT4Tp4vks&bYov6!!c0u4iX>X09S;NQC017VBuD^HvVgidF;SnnU=MWQ;Hz5YW>
z;N(~J0W)_j4-FYU>nN6c)>-D2U1Ajc?ZGO1N(IYIuBI}uC(;{D#1awKc0$(Q?aDt^
z82>Of|JjQGX+{QS2%bNsgBYwZTttu?2H?cJ6YNjslqg85Dj{H5X{*!mkNti-sDQ!p
zE-z?4A_G(IiL$-5vh^c97qcfeg~LSA^}Fz#?nYO*;YJvtLEo*qpPK?t9*8_B$&;oC
z^^5>&TuA2VP*^-MulTg~t^oCflfp}+Una3dcEHMl#}fGyV<k#V5<wG{+4NBP1FkMj
z_XS03bZ99U)u+xuk3E4+?vc)R`5=Y>ZCFR$CEO>7D?Zp`ts_YMPKD@n#WO{IWWc5P
z2B7#F1(!Pr!pbZHAlXEAP4JF`-7STD-f4V)Lj6}p@gXkMdIpeW^eu~tAQ5x%JGWbb
z7mEo?-kaeU5BURtj9_OK^3e`kb@Kg@A-&dO_mC9UT=F9dK&%JLH81~A=zr4VeKB93
zai)_9I2CA~orom+8X&>3vHK7KU~D@6gLFa8aE<@r#$;b9_8ffOB5MA3t<4Lp2J$tY
zGRc=6jhQ1IlA=Zc7UBz{f$rGzB9bnRUmrG-?U!tDE0DAI)}HF&XXWKjZz`*pI6sLQ
zJKP=OnQLm~R9M+Ld3O9QxFsSJS;w3hkR1SV9T;OMovE4Zvj8aakjN0$tD41%{9)(n
z{$%HD2lL161OAw~JlocivNLJxkWR<(cJ0Bl6Ks`E#zMHWfEMnM;3{D!Z~`OfigI0^
zAcge=0&8Y#d&bYhvA}Nnkfin5mx6O9D}2T7k54}71OOs?B$^YCkhhH<x67<26}rvu
zK8UpM$S&vJ+5uNyj}$(g%+D+KTXP&=_Pd1sDM#_?Z3CYJ7!2W#1~xn(vY4;|#9@k}
z4X*xAngj$rdH@n6B<nVAEm*ETqDG45@YQ<*<JabDHCL3K4LZS{wTd(7-BWn)I9>&y
z?+S7N;RL_Wu(6x92!MM`4(JHTHBh{TbKq+3Jc&-?+0&AJ%z5a6S4X{&4EuJdm|d~b
z2coV&)=Qyf>Fd6f_Fa>f;ulq1RhU-2ClLz(NnI4yYRPIKM)yNiA#hjeHLw^KJVcY#
zw9?nvs#P3KYpENP%=dD)HlCd)YEkMl5AI5?dDpV2+AMLs369Q}fzWL)Xvh-=jOu6E
zwIIoi1$=$$BuS0Q=9$A87X<8uNLedV=tA5_6Z5*wT2x1O*1B~Udm1IHh2{Bmhs}CD
zO`CsbLYtO=i)h&ncH^$n-GvlIKg-ywpsGm4Bcxj0T_nG*3INCM>jGnuS)Q+ntcL-!
zoaQXnY1Lun;a9Kh@2)s?^GgjIX~dK-g9CgOHi~zMssLwso{Bul(?jOgn+aA}!bRS<
z#WriZU;EPCh792_uwNLvVey8==Q+c<b3A{^TRgn&Hi*lo5F?39K%FoWgj3>KMQ=eq
z5yR7C5WKyP9o-=&MW>m=GuJCD<Qcxj-u^d7hx9tl^_+uTe=ak9pK^YKoAm5x=~l@W
z<8=J7bi6*omhEDIaR!Bk>x{3_^N1CEiWWD>#Ivq~O(D7#JRX=;`$G0jTT8-pB|)~g
zFCoshdqV|44*1k|VZgfXl>!USm8Qv)xt62JcVb}&X$+z#x#iLJcz*6<KQOe)HKi$u
z400Zf1bwS@f7Ig+OQ;XIl|OkBSRQA(`%UKo&y@K4QBK2-aaH=1FKe980`7u$O8}H6
zsh%85SWsbl{-NE20l<g{UANvO(4Qkd{Pe1O-qg?LEUnSxeW@GoS#O<DuRX0_d#p)Q
z{pTB+@qOE^$Amo-(X!Qaio}I=nX8U<@`_mX1l`dk-p6`rBw<g(WDH}g#AUi9HE7N7
zy>ER%K|_VcI@pzeyr3w?Gxzz4V%kK|{hJr=uCl*~o81iT5duQXZ)E{A2oSZ%7BSj*
zoUCHgjT~Z(fnvFVLL2qFR$7XPeRX3{g%QlFE!uUxsOdxwP3+M&|EaXb=`r0#oBe3%
zFIzVkUShaH0I{mjpa2lVE`<4<*!swafdDWQ)g(9cw@g?`d>Xhb<TLRco-#lK=NQ#I
zot(UCe{LeA((qt6W5ez(CXTbr`k+G<dAb9QIHjq(+ciU6wzzap4LaC&pWQI+W$0E=
z4p!-i2TEdQ$eqOlvJ0lv$HejX>dxWiR!#1irDY|l2FnCD9W(9iPp9L>5VjyeB;z4#
zEdUO!6b0UxfIN~NC<SPYWX%A6@!N4H19pf_XK4REu$4?$N<m1@6{%@nu7HznM@XjA
zqhk%L3M1=Cb1MS3G-_w(zRJKsr#t^?ud4aC--Lcc>HE_FOW_kJB+N6EGf9CU3Gx6N
zxw0+j#Uf;jk6@5k-<`Ixqzs9b5Ubu&%DeNY78YM<C*s~ES6Ie%2s=U+^t-sE-yE1Y
z5wt7bym~wSmlh-#Lbcp5W|AWq{FTe>p^wY9E1{MoY*8)<;hP>nLJZimPuJ2FF?|Af
zW8nBR-n5fou&dX`LL0tJUaK;JtR8*FwS52tAP~>51@kcmULqFUXZ2y|-KStCO5zv$
ze#UM30SDD)00;YsH$^XVzzCE5TR!7{e1V%al)8XxoP(<ExGGp1L7BT2Jtof!<`d{&
z!`GjMK_a$Nt>l&%ip%WTaHh@Nc|p*@-4cuF@nX~MjS*pi=idZa?B97GZ+`Tb8zwkt
z`_yl;+ygx+MUz=I<k=4r2Q1sxC>@{ot5dK&D=)HLM(Yr}jYF~{s2)N!eN0fZHDxrj
z^ug<k!YB5)=)NH6(mo1fuUeZ5{i)`(j5}WIyKT-NE(lND^Yw$94Qs+5B=)a(tbhek
zB#WRg2=cM`C?0YP1Md2b7B0d^HXo|yK(IUY5~*Zn{IaA&RC<-W^cDIQvQ4%ZRYmSe
z)cO15_c?>17&I5r5V^$w<odX?#ZJY3_q*OdXbW!YEbUlx=j+K#ld~UIy>Q=7Y8(1n
zNFe@<Gl_$SoE)&Ybo}(nKT_moGP0h8e$Sf=%)5I#G>w<!;TShT&e&K%m~J4-ar@`~
zs*r?u7EEFHo|h&AsYfxYCIC9u^W>PwEc#KBF4!=9bqrxPBr)c6@PxYXAnWSNZXax@
zkW-%0;&1M$p_O#2P@uG-w#fsBt>TF^LHTxLwp^945Rw?7z?8pOkCV59P%0b}#8F@{
zYK9n5CR#*HCV}%d&RRJKHSL0YA@FDmN&xUjgr{hJy7dVWUYEY5d+IL5uCl{)<<-7m
z{^G#Ewdrg4ipBe*PJCCtoy4Xv`2+?YYho#ZarTd5{`Dzx41XbGnV*HhDhV@T`cs;f
zjctRrW6(rUkidz}2?{*IP401ovC9wohsL~DbWc7u3(~ET9jxSVeo*|<qosbSJmV)o
zmqN)9>(q4{-y~B)Z6xd%=!{cHLD)TXLrJ;rE^K34u#G)Gtf8epyZ+9>SMT}>(ZSn>
zIu1X9e=6<q)`Tn3vfa0Tl$6#16anRQ{Dd2}fCrqLIiOG#+~6;UiqFQRNGQ@m$;;Sz
zN%j2x`JZtvA2OXUBoBRfD|}m1Zo&y?EkL&xa<42i8YV8G4Ph?_JSE4l!>rf}8{sAV
ztE+#s3ST1NX|B_jCqv^q!P2umN>0aj{9~8;Kg)TI|A@)|_)S|#4o>guSGML*pN#~M
z-QeK91vmg`GP{N(Q=);>eUPlaU#Z*d%c&v^8%fAu9~LMMAEYE<byQ71?~RCSG9ljj
zf=CY8JMsLPOFUjAA1M&2*d6G&<9WEXFX(}%bDUk(<0fUrrK^q26siFW(mGJE@FLGH
zw^avdbL1~g_#X}_U`(DK6ach0S89m33z!;MH%lMWr7vnF0!;Ts{$)LNWKDN~6}LjN
zHsxMsz4VrF>e;v9?0`)J9><0_St&v1LxX(6W8(K~Y&T%aNLVX0A0Wa<0n!PH%m&@j
zimgLeO8Uo$LZUwH{6vOhcVldYT`ujQr}p!Q{YUxsv(&g^8&9Wa*N5AcJX4lkI^uF&
zVDNl9;q{jaZZ6D><sBy>Uo#uH8{~J7$5uBB*@<7kgC_LT0ZhtwHVFu&acEPUErISR
zVKDvYY~FDu%}A0sTq-gycAVp{J!_E=RD9I*Wc!T`t%~x1cYjne?65^QKI2x~Rq>^V
zgD2&?=qI)lGIM5h<XzYy+~}8DM<ceVR<@0PT%J!7r1CkS-2*Qgb16h>C7xBup_c{D
z0F_JwaoZbvxJ1H?;oME0tlQf9ik#UHJh5?lm!xL=Ws@rIhcA3hwiG0qgeikhWEWdQ
zL7hJ1mVlwstCvX5kbp^nslKRI?iXNzE8k<kP>OjIS|Vz!8-Kp7_88ZSC#la*kW6|=
z7&+$oJebwURcnvVj@?+CKr`R!1=Z;~KVN=p#TBia<La&JpL~Jo1^5qlW>D}B*pYh@
zg8U~ustIbY2T06L1?<7ED)qOCfIqEpkx~J^K45QpCi8)+_dve9zK_D5@xVtk-#vhz
zgm<m1G?DH_jK@~;>6Rr)wPB0}UBOng368R!Rb!`-Zt7sUdCL7&_KU0Ij{(FZP-R?_
z4IEPl>D@PoGk}jFi3r$%{Yq1yxi^34%l5w@9fDm539MrNHPYEZYIPYqmTQb|gDqkz
zmH>jRf*iVivcB&Q$=2LQ0n9M}i0r<NcQuX<4#d4T^!4X&T5?a}IQ|`9`m?+cP>c5x
zrI4g_!c^vr3=LG1RIIhm#qC+;jWFxeR#iGdPb%<&2<4X2rxK%U6%N*$cRU!;{@|cn
zuQi<eXwW}dH7PehiX#jc$aXL~0YLy2u}qORJdv|7971+QAQ=%c^h`L`#?W`l;d=n^
zpKZdhrz~=;m@uIu)>~@8uLt$Z4ctD11Rs^mH@__7!HqQ_#3P?4#hcssDat;8hrAyM
z19+@6L}WFuS`Gdv7)rMGl(^NGB%cLS2p9=Ta4CswBx!~;nfZDKI}W|39jx-O`K45`
zCT&g!(dbG0y&G(f=O9$qlgx{o-j473!&v=u1v-ofkRtJbE|?0x9Ul?m2lRyaYG^)<
zB<(~pWWJ1+w7HdtII}5dgt4mobtm0K9H?R(sFo>1F1*o6(}C|k?I!FBa?DB$=<wR3
z)>cJ2$~(bBEX2|!0NY^hAVQViKsLdR(k@UNF!30qMM`hW@ALd1r4dLQ^%A$lr5W3E
zCDAQW=2SxOXc_NHg^fYM`QTaehQy62qSPHCj)V{&lIT#sVfG3RvH~Sz60$}a>jZH8
zEdvSI!ZwMI*oa97!Oq+--qaoI>@TgO&SCsq%Gl7=-PmAYiOgKc9|ezook;&=Ul3=|
zxypM<tF@ZQv+|o`V^>B(2J-C5_-daE5{t&!SLuX%z%YLEc6^6yBcp~@0HQLmOw%}i
z7zWLO@m$$ntPXg=1+;`J+piEM0eLZfJz0ZB^_D~=MNUJC$fPqIdMk6pc2gU*A<xJ9
z19H7_KToDol6^0wq_x6HWV6K$hm{pKz3!YEB)KN)uI}8BN4#mVqQXCH?)gWXC_fY~
zoC8wa&EVlVNZ*4Gw3AU!fL)<DXdxOQqadClqq+ogWdFXr2ek9T=Gc0k0b(Lddte@p
za29fGpjw6s7%Rz5{BD_H;Gu4_N>2xFdB-W;bGaKcP6uY&`6;y0wqOf@$DNxC<6{9A
zycPIzRDO-(xHaC#igrsHjChKJ&S2fvXPJj3#6-He2(KPE_bc@a$ys?9*8Omy!EWW^
zbV3xV1Om)V+1LVbnCIz(`=nw$q$l>*K?28E#uy-RsFBb~X8#yatHi-cbRAFs9baqO
z<Xa(6pB7$KYC+CK?7ZOfP{3&Doits`Gl{+t58fHtwHH4$CoZcNeUBd-zC<zwv1&F8
zc7*zrKs*e;cw56GpeyC65{Dgdeo;EQ0+|suq!Ho>3#lCEa+d=ei+hi}=kJkhjhfA8
z1+6U*lGSQMPCV>&$r%`#(bdilH{b*)ARvsMONJ-MKY{<#<ONioAdO$b9<POyz!*;j
z-AS0hSHxZtN_C=7_5@-2j6o2dY@D9hjJ4RcoAcEI+$_n_T;rjDiyN<)lU4nt{h(EJ
zw0+!6-*aCF8seYjL90`oQpP~{LAC>HoB<45ZZaKOx%42MjcQs*N0H|utxo6^vrqfL
zqgQ-C+jXFVJ+&KYlIR?U8%omy8S{|;#om{OHIZ)XhDJn1h{&WM!~sENNVhWBOi>XL
z5#tB~LbMSfAf#<YqJ$KJ%mSh<w6G;8ASxm<D^r9psHn^!LCDm~EEUC2l1h0$dhg!T
z+54W;eb2ei-RImNK97l}QmOiCeeZhLyWZt<%U##oWsiIL-~xJcpL0);19nyLQmF}K
z)O+WHsS=<bB$Ba!Uw0KL8W5C8L%P6ljW+OVW^XqGSo=$NCC?IY=%QM&MPk9o3{ZYB
z7x?S>K49rx%NxN~^^^sIvR^>P$*Sf`{WjC<dQrJsZ;z(&37zIRbve`(M5JLbv5Vmt
z0u#Wrq|4~HcbtIett@3yN0_2PvDXuW8xE{I)Hr(hjzV*J{M6cqUZ+LnX}QLk>raK{
zaR)b;anJ0DAiVD55P`MoLQDr3<hToV+O-O3F<FW*E5gS7q{e}XS;eR`I(U^4_~h=@
z=RD5d2-54;aSPVMEyBGAI0-Lk!BjSPAD3Mr@he#HcJ@H&mdEMN-R4EARNms6V<rhd
z`F5jtxK!jAO-?fVeEf*<l|T$9%e5g3GUGkZ+-mT@dG*ZKPZH#0EdV+FCNbTAS=+#W
z9?wpA3o=B-V?ZBMLzNz6a+#}T0Noc2y=J~56c5L-Kxr(|&sfsSUc&a@X}n#}TD*h3
z;L11Qc+Run_t*+#f9Kd>`tF)y@$FJyvv%aJ-7C2n7P=I8w0#5*21pDe_hB!9ljbJ4
zjjuIX{Q%w9ci*g#BARLLMToEpUX9@?MqS?r2OocwDap7p115#X=2z8pG35BukEALW
za>~*}jlzWCU?AqZ61iCbV^smSRKEeVz9ce-C`f>v{1@Exp1q1=z&Q$x(JGg&gzAsj
zfSE)B-b5dt%w5x9HQYd4FfwE&Hg9PLS%(e$>iW67z>F6ex__|J$CQvF@zSw(nDryB
zb6nBM>5-*xVw1<*7Xl<jz4J$8hyH%Nj+_=6qUzR?KqOH5JwpLpj&oyGeeoe|ZlEL*
zhAyFy0MO}lGN&Yv>lkUH!=IFI)LW^y3b}cIYls%QS{q7O4d=&|0qV&~;^H(FVq(&_
z-^Jkh)SkL(oHQ9Ek{BTJsXSz}J%5<P<0>{2N<t^xq$W&|yMgXYu9_gPv4W*hM(X!#
zC=Dd2V4%KPspd}c&oV|*gYuCG)8gB_hLh$_XHpiAWi9Xgv~22(i(_MNoqp#hU6nWT
zd9uq8zm6i0eakrC0m_yvZnfmREE|N>OjfyNqQhO}G~6bsfLjqq4o}m7{aEun^C~>h
zQYoogq5X;t{m?Nmnh{)@O+QOe9XirK>b-<5HYnaSIwB`+0}m6!Ivdah(22oNiC8>^
zs>QejHrDCMZlj<roWoi+Sw>T*pPQyAF}A@c*g+a^yIG5CoNc=tg_Gol-ad;idRBvy
zvulO@-5VQS+%Q#03y`C@@TR|dOy$5j{Ehkm;LS$lp=wcJ%REqo7&@k43(-y5foKqN
zP8hBv%CEP!-FK_9q7Mz>2U$|OOly{zFFzimYdYSHsfaBT7~kabhMe`%1BJ?tJ0AD0
zt>{|$xH<xC>L8mw&F0tH0rVbu+`rCF76I|I)ATHnFgTebw5u&dv|3u~vsU($r$k`i
z);<a&#@opuH~2TKxT5BVi1)`MU2X?UX?VJyn;0LB9y={`4)SrR=_+SeASdN#nKIB;
z#vdag`i@UXF;*=J*mCrAmF!NPzVg~&lmI6Rh*4<xu`r>I^v9+jt%j3pjtP{t>b;e^
zj$RjrpY<Z|s}`HyIBsgtWwW-3_ap<jFSX2|OYJ2NvV4eN%ZA-<gkp;pO6}?A1$Ofh
zk|-@o+$w5GqPS%6>a~;e_<o&4J9RKdSuyQ{mP`1tPKGtVZ9zb<(J6DlBQe!R-K-we
z-ZQAZM?Yn=O%BWdg~R~yRPHkXOA3Iwy0VH-lWwRA#ctVs4TLq_(nM_R3vL50<gaAn
zil1pvJsxl~0R&Z$VLkz|zybQChRA-gkp6RY{&xh{UvR6*$YvI{C{aj4*0rJPI?a0X
zQ<}~<r%Tsjt(a9GhDX=7Pba#vBj39yxoayn@{;h`g8}Cq9@GjU?SNUYiMkcet&?ic
zVY}-Bmqkuz3i|rYft+y!4EF^)rJ?~;zmUYwq?Cq6u}|>&2P8qVDzieJ?PcY);-Vze
zIu)#~KtxKkvkBX5@@XnA%eH5=?DqDmwAw-!m7qyR+eDiRb-j<E2=6gEt8y*Be~X-C
z7~PQ$LMjN?w@jqxAi*63T!xNsw-8DQurmQMrcBg>%NW`YEGJVW{%~juVZmWkrPoI5
z7QG5M=xM^F7}KH$Xo#=SdU6$Nm$eAFD%Lb=rLTgUv#S!v)TsOI^(!K|M&)UK5%QlR
zL1_7fWgvkVe*6pWvkiV=4c5Gn^l4qsoxc^UgFW_LAf^H!+*KtRkfKlOlS)|F6^3&*
zel#tDbe)8(0pq)*n93HAWGBn6vOg`8Wsk^%{~uM%e!GxYMv(mo^#iQnW_fX){46N8
zq6wJJ7u>`7Ally^55QqZm_&?Af)4<9KFi5t(j8a}1cQWE7a>TTWmh3zmi7^bjw}bF
zeZgrWOfa3&Qf%2|V(~7JupA-q=coBR>RS7#L(lr@N|p=O+GAMSHEBZu{=K9pz21g!
zc=98;k+UFN2B$Z&RVRDuh(JHZ<`>m%7XP?YyUi2so74rh;?)g;hbMwa9Bv9~h|lL%
zf3?4j+<UUu?!&k99xDD^glZ^Z3!4B|YozEv{8X!%WMG%eE>D@vfnys<@F8-)8^=x?
zAxrmD*vaN8^N{NdH<f&R5=hbGLz!!<>T0jp8gTZtlZx>%_g8)$(EPLa`cmg}TS&FL
z0CDg+1YE6@IwIJWb?R2oiP*wa0p-F6tPeO)@{+6al8dB7bQ7$3D+5ePKQVYF<_b@8
zNlZ&Iewry@#gv)Y^}ff#Ne8NnyHLNIugCC0*De&FKZdRDd}0Uo(nwneEp%x*wn$m*
zC^-dQKSF}tN&IX=g~CvJC=75jS;W!<GhU$xHkCAgE11{gNR9v&z1KtqM&MS-(bEB4
zKVg!b%E<5nWseX;cqZC><Roy4CDDX@{3LU3ImshF;8!%Q%B<(<u;*sF>kxg~E~2X9
zDA5YLRn=8Svh@6yK455GdQ*k!;JMi6lu2F7ZDRG3w1)D!Wa+bmpSO@6KLjl%3sx@K
z`gG-o1#2P~j5vPERshr~DTabrbBAbgpp})3s@n5IV-E$_P=qO$C>oQKK_=Jt+`dYm
z5n4_Z3-F5zA02qt07ifxyXT<xed(LS4;noBB@xi+gMY*Gp9^hS7@Uomh?#*p)k$^|
zLi7+R?G@JPw5qD4Dy89k>A>KC%i>gLji$1j1~EJHlDDU7T-knf`4-X!2msywQX#K}
z813QMNQVO0@pUR*p8I+WJ?$$YwSZ?zRI5v~)Iacw8kJY<>RUbNyYlsp@8&tw?Y;S0
z9%KrFDC}#fV~lYSG(Qm3zu-Q00p@-gNrac_<u{R0oR|!DwdlJtk`0=zh<*(<<7aKj
z2`o_QWMSve0DfK(JqjrPZUoRn5q9gp;D+--J>b8`|07sqUF=d`Q(FXSaZAc=*<GMQ
z?@^2^D^&NdRHJ9^*uwqM?WLriF=W@<Jfcv#F7B{zWByD>wDl9xQRea%Gm<h!a5x=e
zGTG?80L~57Dov+8c;_;|-nZ&{meY~eQ!A;*iN?uZ@zfPpUdRmq1qR{2s>xIS0}oqK
z4q`eWv135BFh&agfTZdYcrW@Yh)-ddW(#UcOyPalXE8Zv3WzNsBObAW{6Ue_5L?2V
zAkD_e1C>DTuZfA`yL?oGfbFD2LKF8#Nwcv?Zv@FH=nGCXgr5~@VG--T3iy?U29Aw0
zLKdn6ffG1}^ogy7pba5`P~2Fz0l6mjYw<Qg(#1-RA2CoKOuJzjS3?sOo(GgKwxMrr
zR~aptdp0G%uJA!Xdf!W*+);_P?{(IMMAJ1b)jl|41p(dm794jSScJ>J;9g#r6_aLH
z&N;}$9bxmM3DZ8L&!waa!!Nk)pvF=!1;k>#J_%&d@E)Q%A_AholnEKYytGE%QZr@i
z2F$FREeK=;BeAm5n^ddBF>ZLuYitARB!0!p4`=N?`$)RyD%IzlDdX;Y{fi4m<fqU7
zasR@<m`Y-QNTv1XV~y*fX<FkfPWH{Om6op}HdA!by$BVqdfu`hxh__F!d5Ap+z(S{
zeZSz6QLk+Jj@Er_>r*~HAwnhT0@JAlXPU~ZYg&E~s0g;`Uc2%3_DyOabS#7x?h{T^
zczbQ~58@+SZ;vC5rJi0MZyMT^n`#?PuhiS@5j(5hPc2=uK3j#rA!sr+kOA*r)P&y_
z58+WSA+-3Jv7J6V9Ao1rcnNvK=)<R}E}E4&qaH~^H*=q<m+0?rfoQ?vrz6So8w+vg
z=P=Vd02>Ekir~hl2iL^};S>iSJ9|xh(VdD}if5v2)`!!BJ$)aQm-UqgE_|=dNxJ-b
zA3K8+K(kq=7@j+>y{9&RAb_iRH$Twh(Ac?(^I2EJgYJyZoccLsBUkdueKLP{q-E0@
z)2+HmIDha6{25oBV>c5M0r*pfL2penFP{6tTMr3=Q=ViQAt|DyV6XdTJSF<3?X}|h
zl_4umie<Niwss9r2{-3lWvGFXzr734F+fHv7;L1x<;>S#a2ilB0WqaaD&+%V0|<42
zP(s~W!~xdk>@^8d<V-~jDOPv%?e9d1Asb3sbDF|(5of`nq|UgaV~PBC_xwVGn4jO{
zNDsiAkYixu!Z2>#Mp-n(2UtmuX~-~Qcbl=3D60A}?+GDtHtq6@j91r0yu`}Mo2^Cu
z<Yi&eVnQ^!TrsicfZ%pT%(>V&Gp|jm_kG`P!Gp~zaw}Mvb~b8YDG$h^>y*&>oi^X*
zjq=osYS!OG2EZ13sOI~&`*e{gKmJU0kYkJ6Y(?442j*b}d_cMy-MQh!Id|h4qg&Ly
z+anL0GS1cu2M`Tl*2r+WC$>mUievb~yR8)Nv-Ftih^IHwB>e^%dbNNrhA!HfBAQQ-
ztIhyL7db5NK2t!9yPE@Ks9SZ*>UO7b^KKXZTv2$_KmF_$-E>qn5M=1!uo2*N%^+Vk
zm(*DK5)u;*!S-~#@Y9QXsc*a)u87y2K}4xJW3qAIx=4AEuKhN>%@vgqWv&uCgR+Y!
zkNaPi|Fj{3bhQoncAVdr%$z^MD)B=#V<o2|Mgeg{MAZg_yo+W$ZEsKg)UCp)h%8&N
zp@f32!HgbP7`z|mRCrtTDMmdg8@c%XO96Cg$j)+q${Q0Lw2NIzX#qI^*qhxZ0MMdI
zE+cNt5JD_4&ef&62}#j!Q*#2Vy-1C-hbnGycic(cR9Nvid(ipRt-xO6?N!XueBSF9
zcg<_!vh@c4vKT?;`<o?f&)5TiZj1sq6lAlJ@#8ak*0LNhO~FJDO4PCtdy>T|f;uoS
z<r^fOSphpxgAC)%ixvzWreanZ&z~<aJ~L_q<sf&<TGy>E>#BP>Z)cnwU;yD3OnEIh
zo;es2ytXnfq%DTByv~$EJtg_c#!_r8qQ&E5UfdJzRCt-s8K3~aUx{QQM~2_q4F;oM
z{d&xGrv2&26GDepL(xUHQr&#IB8RmITMVcQzlNfiI@MCF4Q{yrq>wZxm=$ei?V38A
zW@~NVVRCX>diL?-w7}|TSLAFbfBrb+^~}8c<1W5l^siTmkU@YCrhHzl2g;Qt1^Lj}
z8qp%0%!K@i{l4f!w&JfF$J`^%^h7~M5orR|n~?y1;ocgSyheypki0eMNcVUlQ9@!u
zu$=+aAVg$B;wEpTlqysr#A)&17*IpJ{TTjCs+?pTc+ey{jIOIV?>105Qo%i$)=S1I
zjye1}g_HZt5@%xjZKW0fB7MvHjJku`=bv(oIgAKf?gJq4D4QQsRdF-?-Wa8LMJ0f0
z=R)oTXkEp+po%qZp<!X0-V7dj0j_&H#iq1vP3sy+?Xm61SFGQ#>o8+64h!DRdkxtC
z>O98w4$^m1ku3Q38Fr#}Y^*tj0Gb*aT9C(LH7OJP&6V4fADjQN&Ip-lanq(q!s}`a
z9N+qvj6NI;FAARyB`TCYk(&ws^=C5uuxFqgM<-M`$}TO1#`YEL{d-#zLZc(9YYF0k
zv+ObvHR89uwg5ePo32U{5GAQE&@e2=^hY+Ko5f(L!lY%rwqmS}QxTwz-4Z^zSa5cV
zbvvn9#P<9PX3NWUoGTw?ak;b3nMSX(f%!cN4D(&*IPnr3+z>Ip$DqruF&u<NOHtbn
z2CaZ$-9u!!^GO<X6$xI$zsftlkm5ezfn>UrxOnuK@s^B&ct;&c*G~IG=@IArmm!*Y
zzTtksa%c_40ZIU<-%SF1aU#c{djC3QW_y-__-vGj#+{Kol4YRFVb?wm318zW$_(=K
zh9%d<Mh@Ladw31!g9}1(wpGP8g3)OjjlKJuI|jXuu+xKlLd~`ia#-+LOZ29-1U#4}
zDN-9?H}ejT#xX$<ueSu<8k7<is$L+ls1~vM-5Q68V$)cdy|*2VC|z#j4$m~z%?~qS
zhA%aGz49Te(JG|k#)It=@ve_{@z5RH3#z91x|v@6#(Y9s<%Em$2nr4#1@5W;7u<mJ
zoXbsQj6c%^v4!PDAbAZDTY^r4RMN@Aptv)o#aN9*35pW9C!!U295xs)x4b>X#e>tL
zr`%<;UQlhjyF9saZCd3i;j;4aLLJT$y7dw9O2tujGXeQF3R^G$vp9wKMxpZK1RWcv
zA6BMtfYQcp9wOf|WGk*rFjp#{-{#bbw&6#H)E;)5<{qz{>t|!*8%%SZ-L9K1bKF9B
zp@nMXOM1X*Ml%A$GpHJ%EtI7GjQMO?+}x~9B|!RQv%d-{iXxQuT`~zaCvdGBi|Nh>
z!1TF<I$MTg`}}z71NS!sc~Btltoc^poGR!znNfDON;6t`IM`4;^hi#27pk|N(>s<Z
zq8f@Ef-34R)dwS!n1-Eo>f=`PqeVMFl@?k?$4|Ho8J6`B&m8J{k2>5Q7(GACnxA;5
z{Ed}GPIzeUXV;haJr6tHqY}@{eFkZV0?1{e;_C=d{ic&OzS511CG8<88rYkRtmtf~
zn{SZHdhTF;12Ou%VUk&Tw=t$4;PFzHYXL1KmSuix!rKMsN5>T8)m5(XL~`~<Np<$a
zpAf}MU{RElG)er7I&-mG^C5&J1`>uz0Gt&}k?Vqm%uUSKP`Or9(M3=tHd7`yN<zI?
zwVMIPFbgMkOE6<iOev$*?F^^;Bd%4nn6=$zELD0(VPTEK>0Hpgg3w&xnF|pl7l>bj
z>4VEorIL!3`@nF@JFd+2iTu74l<F%W;^^!rcJgAw#JW|e6;f^yUpU6p6<1yeX~X+x
zRMFQodYC8QxaX{yU-*o6Gq8dc2xk@h3YCc-we7SgBsmg4kq!UQ0+z!<WDJiOwoC>A
z%*`%KiJj&xM<4-v;)2wlN?FXH5tBgB@$u@JeyLcGsUqX>d(R-}7H#NKSz`E(&V@F9
zL5lIr9=#j)Jm^se`V1eIh$|Cuq^f&!tU)ul{I^Npy(CL&!4KKTa26Ayrq6|>#PLd?
zcrzB)hm%%uDvezW7hNoyq-;aVeRS6Og?W-<vQ~@nm)?~*oT#mZPh>7W`eGI6<^f24
z-Z>!XY8wazJr>ec%y!qlhL4n9OqCAR^HdMzFMTt>IDj0}LBB<kO)p=yebjB^E=VGK
zcI7;lO>+$j@`k3;bnf{5L4LXdz;n!yLAwbmqhPM&1Z#okh$Pk%Q3hX;ickfuZ0)g;
zW*1FLqMkxsuZ>TReW20&1iU)HAp|gQDQtUH@nz$gvjhc(gKe3w*>Kes{A9u(A^)Eg
zF?bUMWFYznM2fO2yg0&a{MfYI$8=PUjk%m|AV_q@=~!<1w46ltYsgmro8%v`k{|(5
z0Nlt`pptnEb^zWi`Tkx-E(uTic?(RE#be-dbeUYkodq9M{F9gWKh^vF?Xmy#>i-?z
zFv8PxR2S^K^CXDTJ!TUCdtew-1yPG}x&!j(T?9mr0-v|Y-5(U&?S0@PsZk@LOp8N_
zwYpeI!QTirE}WupvaI_SrrXza+<er-Y%QP+40*5Rypl7V0P$)5EPa!t25}S+xF0bk
zzXJNOq(+v9Z2|zNhZkL^6~I<Q&+(^XSf!Q?)wSA#pd2S{(1aGDZUW1zjjmaGS-mrj
zq!Zog4YPa7!))tVc>&Bd#ocI_@F@Nz7#mwcnjEKsyCo;P1M%M!^=C>Rf|*xS*ur{&
zGOdABmY!5+=zPDH*FuTrmCU3tR0SzZ0@<a*9JWRqq{JQ7vk<4JOt&d|QC+Rhxc1Fh
zdFP0dTzhIrMU1U5E8&UV8}RgirCPrj_-hm)2C)+kMHmk_AQa%^pc1fYoR%&wx@c!Y
z_Yr2Li+zNoL_>sio9exx{S0dvC;^>F1fAyU0hl(Zlg*F4H&$6$;jP<IW1BnnDoNnw
z=2Qa*=Y?e7QT!Ap{f6OnRT4QjOwz9?W&fy{x)BXVbV0T?k#X>=t@E!NHnQaW^)`ju
zsT!9)UtJR9Kd@tM7WJzg@~vxv&*DA<1=2TCIiw3{t$}ivJh#VHv`uc6uB78RR()|r
zOhYPEsu&MQM+GEG1cKJ<(X<M0JupWnuadQ|b7Fd-G3`W2g=e_u)^0b??wmf<`d+I_
z>Z!a4P>`b`=IPjidO_yG*d%a%DFnt@vCD^D<|%QcWnchDvSrrSeWs=)(eU2xl!%(s
zeEj01U1vXYXE~g+ZmR5;^DbYp$1y{h{ZNU;b+9OSNC7`|atf@vX(Uzb*gz<riU+$g
zghTa#>gGHA3Ur&;v5|05yDhSAO;bpoz1aE*QRq47*qA{|LXFKHU*R3-`9xL^Y;mh=
zcX^PfbuoxR7I-n_rJ#E7f{1K@(51tISX5&ZKjsEp+QL$|GX=UaVx+#<v>IlENx3NH
zF4aPtm}GuqwTWHJ0WmvL^f+n<5_|J7Igq+$RO`cN)qRVUy90Enk*UWzR8#vs{+%_p
zz+GyAf=o3s6sq45O!9>WdN50LHFzTx`L3nel~>xAYz`&YZRlnnrwxz<v`Z_yrxRVm
z1Zhc7(Cx!xc4ifw-eWrC`SdK%?Y4BRS2@)0)(E6pV;LrRIZu9;0FE?<v`LnMX<3S*
zOlUyV!ct?Hz}1KP%A9Og2}l>}?7A7M;!<H-W8uoSFhNQzOXX;n;Q?Xbr1y!lKoSz%
z`?0~p_hOKcytLwmUyftJVH{Hjs$Wdv^T-+qtvO2zd_^?YSH%WdS)>(8Sz<;>V3o8?
zT);QeG>(k(lxzrjW8(2FjjNKYVw<#C?|cz{>A0)vVb4H{$^jSnRfbu9cTml_oR(0b
z5)VhWk(d55w)H0hVx1{T?6Fe>56fyBXTH}W^z|64Y^IqPLy&(kv~W2wHaaN1k)Tzx
z`{;4b+PqKfh5@olKq1RCh}5|0@Y6*@?TX`l=+Z>KZR|uFSQ<UDSmvUB56}KLbVo@5
z;ua#@Jh^6P;X82~Xw>xuT4J}3dmAFnNxI0@MvuguPg;!z%SvsoTG!&3k$V+CYJggd
z)#yW{TEGL-xCIq?uP%t%5M#p$R_^#dQRwxcQ@0qq+6Q;7Fiw<^mkhmKHfd71`3Wdo
z^v-Qi$APWE_|9nx(0^sY;XkR4VDjL4UBP<vmb3@8&r9fX>^j4W4%L*v+t}5S`f0O*
zv}KXadC%mH{pDwA+4vu!U|xiTP_8)PE_KD)JfibzUn3l;9eNnn%P{qE+TYsfF%?;!
zNKpz@HBNY1S)~>4>C{{HIiWHY${*M>-$y5cbOQ{z4+WRB&Inx6uWS536{K7H!$-+c
z+07bX0eP`!>8A2nPD;FoYogfRG5UJ^$;y<FsrYwh%aV-_RU~cRe@9Ms@-J!L@9|9f
zv&M8TKl}Lai5&l}H1(H{kwlhgZkPmcktL@|ACXef!S;7M?qBVZzsb>PMS4<=9G@JY
zq0ka-w+1n4pe(U53Dy5$5D9D^TG3JCmhI)d<dk})wneQkl<%Uy&1oLZq;#X=_qp?f
zp6HQaj>KA)iY*u~6tW`0*|kl)g%cZJ1uM}L1+n$`7^Ym`;lV4xi2;~8=x@8DcE-ta
z-Ww)z<+$>d!}25CKhvpmzk7XvM=<ODXy5FPA|L8Ks#W2&-psqB8PR2yY1+xG@Of#K
z6_&l}3-z^J)6aIc(oP3$wAsX2)3kcyCcu)IQea0yf0qQn)<{>V-jNiPf*92BRIoMl
zSji&fao`>1Yj$y8^SU#(G*<-TZ#s2vq;9zwv?MK|bDpw5AELYL8gz?(_M=Pp)Z{1Y
zaaN2rl(@1>vro7}?V;i)^DTrm*bs}r2mzurfDFt0T6odC#i*M&Md-&N$7o}v@VKBT
z%Eq#QUVTLDnEd{V38BGaQXrc@G=8|Q{XS#sUFvLQB4tlS);mr^ns#r_3bJI|eYwwI
zH<&*Smgi<P1hkX1aHkrJxE8uQ#wFdd@)|$VBwk7ms9W8bW!zcra^A~vKbRafwd9RJ
z&&7}9%EcZF$(r_N-vtulm{#J7KNwb(mXu^fkQn}dvywjWk|`e??7oKjO^^$~!X>L{
z3*lR+-Wq70`)pQy!5!l*hsK<sEd<(M^Ob+S#0V@NyuawP#gvqVntZ|K&Nkt~bKsBI
zc072hU$X=(*~b38ipKyG0{v-@$iFO{YnwuK*L=aP+aP;H8qc2#q=n(|q6m^d$q)Jr
zIEM=?i6awSFO6FuMvhB2K+n4XO$155o-11cBQ-n4cEvY5_<KBhGap+7SfS&3GhnZx
zsq9tk2FPM}(dQk*sM;|0B1!_3r#;GiZ5-Mx0}Lep0k14MQ|e9u86U*|ky6TxmmXja
zSdpS8ZM;C6nzwDA)^@;en2dM~)?E+D9iq>^4N+DP#8zU@de01=5ILte4(XP4WxB$>
z>G@3}*3y8=EY+?jHdpt6rvk92^-tNVg@xv{1aKNhNEcZP%zY~pU^<zfG7$)Knub<%
zeKL$Q<DTm)FH&M`>@yEraQsBPX-Hro?;9Vd=!8w`d-2V4@{WmN0D<lzYo&6|Kuenf
zUMPUmI?jBdC<?6~58uxvM0hCH(~>lq##cRSK<#D!@NUXihy@-WU}anW{8bev09!8&
zd@|DE5VGNNHep0}ZrzX|_*5eWCZ~WTXn;L%BK$<F(Cc=w4ZGNPs)4cw(`FL5W)1<^
zGRlQ*hKc*HSaf%7r^TthX(dF*#8qqxD5)22u33I+iJ}0HF0G*Hje%cz5FcfasTj~M
zEtH<}WN0E3z{9R_k>!}6CPDC5o{pXOGl4D5f$hfx&!E7CA!eQWStQNX^cxIj#wG7H
z@<Le=N9wr}KlMv`B#(epCW>1Yz8`w+EZNm}Q|*S%ZdoD4se92)3(<jIOa<Lu-DxxW
z^`R3MRrAYM?mt)+o_%S;Oip$do!4VFd74kD52x)+-duLWM_|dhabS0XTGhJgN&EU)
z+rb}CS?n3<<oaJz*>h0-YHx%s@HhAgsGS}6k#Wpdptkb`mooEniy)={BetkmYK-m_
z7mB^wM=<p+Gbr9Iw5q9nR1y{@RknCpeVr!UyWqqIIhl=-(Y>m#eUC18-?`yAXVs6F
zK=n6LWx==0pJp%exSbFBT~FlPADgM*24BrRJUDj#c23wv&;HlcUd<0)-a~_Js}2h)
z$}1Awhck8@Ox6$mPF)R9Dt)N-Npmk6wlH0+Bdo~xvHg*vM+(Ms-Gq{~wOis(`4m(}
z+L)g*c{uNGjILipMqy%*Ao%3_C4N+6IoY@E3bs3DyPUhGK3RX()X@dLGqV!E=BI@w
zE5j1wbvEnlUpAO&xNL<e1BdCc`A-W?Q4}N&kr$A>pu948s8U=cn4g|A5aE(nizV{E
z;FgbB<&`aEeY!I6nx47EDMsi@EGcfuP{Kh!nEiQE{2y-8EcceZg8g#><0#mnJ;fkk
z@3Zxeu=@dA)=<#I{(x?eaq2b#+mS0+gdOiM1wH^h{%RK?k=TGs{1C*OAt9b$a892R
za8gIXFdC-^@M<xMLz?w1$yR9q(C@#<+3;^p=D*MfoX`p;rjw=LF<U)=u44)?vne5z
z6w=~JQUFmKWPq$Vn@#ZA8W$g8<BbCj4?$<Q3cX@{A4U1ehar!F{1_YS2Z7?ux8>C7
z!i{6IgT5E!A1Y2NqG~OYMp+s=kQl*S93?r4e1zLXE`1Ftv6MwMo5ZG^C*B6dy`PBk
zf$3Q)kw!Xrwb9EZypgFp&bI<jQ9Fk`TWEKja{@a0U93m@3UG=aXKn%kOR3fb1W>vy
zKs<~?(!J;pKvB4cV`mOL8c!4{(kYE3H5&)2AW6+lY&1eq_(Ytz6XN_KZSSj$@EbkT
zonB-bO`))+>*YpAohlr&5~^qB#|&c&KxU&L_qZa%C^0N0!fjK%%~Jh`ZHw#l;jSjL
zjW>^9S+9Ccl)5~%BWnH1s4Y4acn5JnP2wrL1szN2cNT<#A|1G1m?63oQ3X>}TkABD
zCO%~$lF_mcc&HSH?LojGLY1M{bk;UEmmN<A0=cGzInRFReb**+CRdfJm5Dd5Q~cQt
zuy|WgAdrbX1h%dK68|1JD|RqxDV6e7A1DBdNl!?iCH1j%JHxrQq8$kKcER7Xn#a$<
zp!y(GM%zd1mQ}$!Dn&u|TVWYH&ad;n!>f3qj*@JLQUOoVUbqB5;w&Vi>ao~DDIC>s
z02vlKP8Lc{5DI5WzpTzkT9gc_1P^x?w#^o2k9zpU(Lz#G-@MxD8E7)UWq)mz?`rPQ
z{$=YF#oz%@{0ge4)C7Vv+#tJZK1q<+$XGR0OH?gXKyD$#aD=SMA~BeFKcw7Mx1O=T
zypMPv)#0~Aa`hgc&Id+cKQud5_13sRp?QPIeT6PTUl~5O*U5iq6AlCP#|a>`vz%~}
z`UPMHFc{`uBqgJI$hZ(cf3%{iEx`;QJ7oH>*ZifHW6fhtS8nb5)Sp5$I&*7}msnl#
zto<?cN`0X5GaEQJn*iil`$_x~RxwEF3IwmJ@1nlq2iHMI^^BcLY!g8VIzdqhc!&pF
z!ctOsxw5A{jNO9IxlsiKxi^6sdZn}i<fCV$cF?to*uth<4<xqHqSsrGjYCP`K7BI=
zl>)`CyrjMph%*V5w8?a-6|-{=IWG6FN-mdfVzziM7B^=W0>936h;QD&oAFeCk_xID
z5^^f=0pno1+1bZE<&!?-bkAz|>VqPoQ4-l*92S6mU5=B1dUs?!)&^`xKz$7_*(}}@
z4KBATARfTRYur)ptWzWp+9L6BXV{AL0)9)AkxY`ME+%tm8k8t=c0$(Zb{7G%^g~zE
z`ID9&u0?sg(V?Re@^IKMqm&g@&~=EVSh0~chd|b)Pe#Ku;Vgq#u>@-)z<}OUqTirl
zT4*?OI&l7^nj_MG*I9oiYaUX2n|ewqtdQ2U_#qPc?)em1Y)}*?TsIAAGxp=K=Ozz=
zih>HKjNh-UE~}0-%JeLt`ebq6ap}E>O-+X5GuCHk7`m@cv+(~p=~_p^Lt{Bx98(Lb
zZ{d}w+?5`O>H!j^(iYJ@cR%jIRx(zUg32qEh@(m_XJ!`E9^g;+Syd()pPXDXUpTX`
z$4-;S0F%Bmttrz5H@vZiWj;4DC%Pie5@1^Fq;np#mjdrBdd+K+%%%^Z&TzHhL8jD-
z4t~FT5lBHaP{l}mY&#TSl{OewgxG@$!lg9F(2PK$TK}6P*B+%E+=Az71CKatQJ7R?
zlmPE7dQFmCmR$*1SSHm4a<Pa9Bmu6^d~>Ti;O$t^w38?WZuP3VVjg(MY@Dd$@3CP#
z$S*N7>8l{urqH!-;`oFg0Y-BKR;Zy3wKRSTxS!wAK=CcbjM8t3Ez*!0_*aqZ$%zo2
z<Sz@`g&YeLn-b^sH{AkdZ^uY<t4K$xtkayHWek_6UAsq5RwnLC!-hS=og1rcqlY8S
zQ}BW5{1sZLW)Zfi9td_q8+1Lfg|UDc85HwkfblA8pi(ka@fcCPJ;MG9CErGP+korA
zRP))kV`9r@Z<4q&`auX0j|9ABC|i^tP3sJh%%)r{Yn`+gYQ+(p+R@D^_i#)e2v4?<
z&(vb(6Lv9WVpO^c+5et*unaMR$I0y>FPWW+<rQLbig{o`n>z6B2(Qh3obig+#9mU)
zQ==zypPxwc)Cu2XWSMqA@h3jPw_WjVKB`$Q$?IhMv>|KrC-cN<9}i$J`f`!=!lD=o
z-h2Hv?`L95k!~i#6kbuGdffT6L5dKs!KGiBT}L{nnR06^^pa0x@43tJ!W6{3Rk95v
z9>;lzKBYdBr#|YYO-hESJXIu9k|Q>W{@K?wsEi%w$~B0BTNT6`2F9QTAU2Vft}VT%
z<XOx7Y_Z&fx<Y<-^<R*|Z|ax*v;O5T`K01Iz$~u8cE(;{Puv^j&Bh@Op!f2?d!n=e
zy(8ECXWaHz;u-Maf5!ssubGxP-0~k0-~Rn){uc85gIE8TD<gj#cK9!wN}Hg*GVq*%
zE;M4(*T}FW3XGHYWhY}6m0c#&p3pMz&#XE%23fpY+cu|l?d_}hI0j9ieA+c}clnib
zKUY3Xo(wv6E{`sBO>M=UWz8d<AoZ6@PIQI5)RNF-S(qZ$Rz-@RGBGDCrIO|W!$2j$
zj!dEMwubfD#dUqg3`K$Lt3KD7l1QOY$2k)THmFEAmlNaQ<mE3Pfn-YOGdUD(Y~>Nu
z9ch&=!=4i~u?^nJYiV`{3>VoA<`STJRp=bitdLkOc`Qy(sL{QLB%iv~@6L4}UG3$<
zE)5C_nw&s?I+mm^&trtc!-A<MngYN8%601ywR-karY^mRj6D^TBN*WTgEiYN5}RH*
zo^arHAf_qSNY1+%sC_PNfv;3E=h&&BclK(ZJyVx&v1#FlPGm}rnJ@er6(uKS86-bI
zEKDE|52t7lF7W1KD^No_18qN+EW>6CLjlO#zu-<LZ65RrkCkqJBfC1eNycjZnEM9D
zDuX8Sq!w5o_7!O{M4gJHs4%~!YY^4;m$oxkA%5h(oKEwdPY!`nO}LHyLHON+)$`ws
zRVhnu+$qn!>S5kdk#+cp{KxbsESP4As)h;L5=3GaX#d7X^>G{BSi#Pe6#<Z{ogs3B
z9Pt%P9z?dHM25N~2UnPgKoc9@B!+zF4czEz!lafUgVkUz17)Upb?X@OfGl_ozxnc+
z)4Aws@l*p(p3V`({KT-zv#~n^W_csATe}(SkDoDHd**adu7m8S+`{zB3$qvFBY|9h
z;)ZnN;CLla!-r0=rPf$m`p~+{@=~$+R6RbDt;|&SLJ~TkyJkG^^Lh+x#n_N6ifGhL
zH5Cs7@TXQRt#(?PQ>ux(E3TGok^$m^niRNGY&c1fuESok%J6YDq`VvWEq$0JLkXdX
zJv}<iT&o|kgVyZqu?ZD;F{Lh@+}W{r)1#6D_5Q){gH3nupyr;%H6dH23z!^UDKAlk
zM8SB8ne4|pMPyb-=cOr82VE)4vRi_hi5VUzatPP%P&<(HdM3WZJh!+mamFnzF-#J6
z2XhjI0P5TyxeG%AVhcq9w9A6CFhxRurs^Fq1f8vnsM|m{<WQE>Xx$?pvl=}7@($wJ
z#63}bbEwt#hxk*}wJ$Fg1z8!l{}`gYg<#+THl+<DetMdv#Y_bSc4>R9_v%9@{XtOh
zB;>>Ee>e25_no5qu}76zgxGId7sF2z_4hZ9>j&Q(sgE#|n<-hnV{SBg+`YKCoBLFE
zXdH4|bT1osZ$6IiqYk?9j~6@G?6)_gnr&LsWW4{@Gxg$~dHTsaUu&ERS?je-!&iUF
zR^VWFnT8mIUaWBfxBVmlZ(YoGu-!Wkl3ltd**+{}od;4N$(2Q$Z=MX+Z)ezwiWd4;
za;H+5%ROGIQ@$&$YDe+8fqH7aF9%=Drgmb>8LPT)N4}<y86L%te*lRla#~UajvK(N
z2NqnHG8}l5K|sk*L^=xnH65KhY&;)O)2l2~%nDb&G;fobq%`6eQvCHSDmxT<nwo?o
z9pmG9c|U*jD*Wm3wYNu=cqPDpG+m-2yO`b6C2^AdglU3`w^*U%b>NTzJd|X5Q*`6B
zOY-qYiB7$CE0HVXp5o^G9L^7kA@fthg$t*jnqGLE&|_$@11EKzD}dNT{1PjC(wq-F
zd~mAY2wTv`k28PD_9OO7$av;!7QC0J(eHT9FQ4mQsafK{Px-3iPW<wU-VlPqSl@Za
z*ev-lg|;}mZ$_=^D}9mh`%cH@WT!?wyfbftuCT;gqmuJ`QgU9qkDlL?p6_Yzmp@t+
z%%W-JC9`~}!Kw9uetv<AT_qLmEi1mZ*LKj}u;E?=<$ul={<V#p(3+r#csACVbPKCk
zE^x(#`$)>($fi|kvw^y5dmeYc7nUXq!~FHDt|<9ApUHSPq;);@c$U8Ln4B2oSRm<C
zkVE99XXzA=UglCtpa(IYVPw3IUMgrw%I+-VX^&h!66ERSdvY2`z(7IW5+Yq!NG{Lk
z?T+-vUV?sSw~2~_|5PJ!j@Cl_yqdL)G?9jG0o)7OEgSy_Hva|?z6EA20S!D;m)M{5
z+1dEYk-s8gYz#$xlMYe~W~-G-yfB69DH^WlMcAznn?9ITj1A#kf>hD9uxM9b$fNu3
zx_vv(giUFVab4>2?l>Hyng86I_JgyTYsdB6Ek6a|B(<cy#Xa|pLtIQh4D9VPH*yZL
z-_vcJpZxCCZ1V^8M`bl-kD~zv5SV$+Cu5=TMDpPq$9HUa@=&4bS6&Fkz|(gPdR7U6
z8oz)hGwfX_KS_vXeOL*6nj1lH<`>*(+VcqOxf<4=T{EE;Er=%0Pn@b@8%@@&6D#np
zv0r?s>niYi`F4t%DWMIpzYA_Y`{2};6}A6FUIL)tSm<2yxPm+k)w80879-TVz_rG-
zJX}CGr~7qJn0uKHb&iaAgLc3lzvg@&b6mCQPOj;SkNUx!4XzQknV4Q4A3wX+?rRVw
zhZ4Wwnn!RzVtbb^kHwQOk>Jo<UvT0o97!<d4c@<XK!itiN7!gEn2dNnJ%TjyBf<eG
z|9K+l!U4S{|I)|$OaTN}nawv+N7SyJvHTWd<P8tLN);Iq)t-gj3(yXKqqa5hRv`C)
zNAB5#65}?5BK4cECl)SF`2Iz8gk4@Dw&*5*nJ5Y0P_%4mT-?syGVil@fVjtvrw*>%
zf4i&odsl>f(;0}n$x(I_3dZ+?<yJ+E`9#8dQY{P{HXa^JPpjP2YTY6-x?Pj(x&52C
zG}|A{RX0xgU8whUbK8O+82j~s`)dmH?~{`pyX&_Hy<Z2od4wmu%E=2akeXs)&gY8G
z@G7^NOfl<2^n+|!11t9_^z{!B%eKus9lS*&g7}x>w)%HnX0M4;=7-Is%R*dsC|P^i
z?@#DlFf~<JCOX3nuYD`rVQPA`IB=l3G={2Yec;@gV+TL}JS6w0W|3bqDfwX<xA$zN
zEYogd$@Z?Ax`o{*d(HK~37m;(+9fzry6Vx;8l~%tJ|BI+ACvpc1L$qwEN@|*GTl({
zwynM>j=7q}Im^AjG|+1IsRyj{v=Eu+LFHim3cs&Z874n3Zv5Qg)Rl(|BcnEv91PS=
zh552M?rrs)1si_lL(ne+%{6T?wtm4qoikc{YD^m8VH+Uc66nKPFlh9H5`GpG@Df-v
zjRZ6vG<E$e9!KtjA8<50m_G?xkT`W}#fY0IMX@y(dEVFPawo9Ei2KHIe*6RK?N_Um
zt1P+8%gujUdiL}8JHOtuC?fng8{Q|yWkU@S$er(TubzG@!fF*ifRq~&{H#fsrU4F!
zG!;ksyc7hwa|*FCK)yDSr9wQmeJ;#xJqGNyKwvz8AyHp&sq5xK=H<gs2OB;31y{Ec
z6q9#>kom87Ap8{+jIYQ~AO*0ZNh^TsmMBq)Bv1EnQK2b$ZO`}U@wOKo@p88sh`YEF
zN!olyKhr2;!Si;-Ue<q)74@I>q{wlwhOA9A9jp^baTC}NKj|(aXb-UCk3Q^?bAD+H
z2@DypkKotucM@4$cE7PGZpb7L|I?xfX4zkY&_Cdc{=@+(rZG-~E_Xw5BKFL$ks#Y%
zejbN?3;7iFTZ#@6N~R*%!ELX*8a<k?Np~@b;-VB|bSwWS-!utFm>yY{iD-Lm;Ox3~
z$#ORP;ko-Egax}?VlS1ICzyO=T;u8EBdd{<Su&pxKwinA4de*QGH$1@u)`x9{z%Y7
zoe{dO<`Hd@tUQ)k_hPm=Io;ay6KO#}=&9#b%<PWLBs<aBl=}-p794zo16B54yGQ=|
zw<xBepxkz4y*LVcMr*{&wtqU0``n5x1k|CyV*y^Og0upOfb%R)`b)F#9}yqIl1u=A
z6%B|9(ApQ}kq+$l3WUt-{KBQ1ALPLK7mbH_2Z8#}(yKDg2FOfl{PL>GZie%H@*0kA
z<5+KoO4MS}LGhmQ?2*fZ0sC;!6$+|Jz!nXPfiB^#)Y-BrMc2cLpK?)8F2xPzO-NyH
z?#IbA0^sBCcyG0Rf3sj{b&KX>v0}6GvD-XkpwU@<-<wvS+x}I@UAZb3ay_=lVJ#$i
z($DdzE~qTJ!%|RLG+7;Cy9cB{zid^%p=Q{HVc;Xbjp_O)JmKH6hX3|e_yOrs=vfzI
z6+qdqih&&FLG2bo=Reaj|IRP@3)%T^PvH0d8$(%o901mxBtP<NKTlxxVz;R%r6`0&
zpn5b{=DH&Uj%L+xzu8N@_ph4Xhd3uWd$boGlLqy47rWd0*-p01D!fTAc5jvt&&r8o
z(d}B&1yH-8Xc*JssX!mY!G8GP?5!1)#uS9;IAJZq$1-i<wCnkdAENe)$&u&*!ArZa
zDV+Z|hdJSeBcWX1+{!DS$k03bq)*mcJ%OO&(e+hs{`_kVIXFX9RR`OB0APxebg|YK
z+#by{a;egjOpxC!Wvmq&@FWwmi|-9!XmTxoG{3hdgzu%Pfwdyf@+NtO&YB6CCgvoy
zv9qaVV1(Gu1rjI@^lpt>3^^vRiWD_Q5N@$Sbr{eh1CkLW$lz{X3m&wuBUU<LM4-}H
zL^Rcr9^jn6JymK~oRjyewU*qVedEm!4_J51PKd~l26M*0(%q(Ei6cP&K)CgLQcOJG
z4J5u%N59~L$iQ?BI28g%=i6-lO-T0>Xt0J#Z<k7UqQ@R1$8u7%Ibt_a-zBC3N9@?N
z5&Z_9ipm)K*r>M{*~B-4&8iMEwPFvm4)-w3j~?x6YZ$Ppxr}4lL-jqp68yxtR2ewP
z6}L^y0Lo@#JEq0>8g7|zzeUze&oi#Hx?NlN26)Ci)65sLm(gQpJ#Pdy)tU4(l|SmP
zv46nx9#gZYQGL`O0;!6HE1xj73f_e6WYzPeYgu^qr=^|i<hy^(o8>=Bu|?oV84P~G
zfh{DOJ<U3tjW1*BK~FF6A*mX=2{>hVZoo~<#h>3btNM3_i2su~B=?aOK?<bxcL`IP
zNpZZ=sS9=M{i}G*<mjbu>_sfO^5ojQfQ-QYBZ(&4M=P~^ujW&;)+O$&@h=Q{jIM_s
zY`t~lh=+ZFlkeKM_jbum`~K5R`#&#w{o|-7ht5FtPOE_GU{Ma>313$}N&u>ZMB);T
zjoOYY9J7t%ZqJPOHwv!o%-wPPilwU~pdcb0aLf{@VLAT`ZdSv_9PE)MAQk@wkY3zW
z4=k-V&;biS&x7)(++cwpz^fw%;XZ6Z8T%J3yQB@|4efUYMW`wj`Xv=um4*L;yLo^^
zK~{GH$Cfz+Y;QvVqt5bU*swz$pk?MVh!{Evizb8V7m3b)g1d`hNVX_T8-!)e^gpH@
z|HbEJX*_X%TIxSBEp-mm*%|uTwc>k+pdhWEri`<WEYUw@@cH;TLO{~1_ru074u1vw
z4Zg(&2*spE)<T9;rL3BC%sRe?Aa$Z!Q9l*w(zP6-p%QE(H7s;|f77kotuExo*&De*
zT5;&N8)Ph@GJDpgGF23J^bzRG8!4KQlR~J4tjbQu=^phI2VDM3b}BJLtwMG~VrvZ6
z4G-O|o@l%8KYF@k>Hr**AG)F-e@uD#qptUs{<BYF_eQ%sQRGek{nPRrv61}GwQiDG
zS+to6d4GpFd*Fb?x#QCfJ?!JIp6MhvNfKh4KX%?VKmUz06}mm<s>R$G^S&8t9$Q{m
zWc>cS1R+<r{#b@t>hg78#iBOW5*)}H-vKLeu>n&F`WUo@aO5@b@K~C~`K;^b0`H7g
zB^B06Sa%M;u@ChL_X%Be#KYTtaG>V0m&f4N)nwYK#aHajOt0*ixOD%C<{f$T1;m$-
zKJQGQU<1*HXEtcX{DAarA&kuYXN9Mpe+`%<hdR?)5(e@X!1YCIq~Yim*>%(}R6s}u
z4Ia%z4aRPGp5}ojU=0x%B?|I%=`K^!-6Oy_fp@(RSGNTE-LDT3v@K~`+ZV&ISNCu&
zJ{9IKCtCv5TSEL?Z*5sAY7X)7gx6bPo+KZk&Fdwp)h*-MpUSPfH|z4G%bci1cXa7~
z+U?VIbsyol*f8e(rK+|_^XsHib9>uMvvx)IBAB`VV8{H&A?oiJ?Ehg%RQxy$m`i||
ze=Gn|!;C%7^8(FqlW%d%hW||>B!~1!f54ivY+=O|ibN)hH{XmAuyzVwR3|1irYv$_
zW@1AE>MF@)^S*fVg=M4K%2wlERbAOXAw~Ap>wQJ7`e9GG1I-oi8zfGCx(irsNlzix
zk|{(N-&m;Tl-_`CgokcwMb{b9EnLs>+ieb4XwEbTX@7=82+HFYU5S&LV=k5ZRzB(~
zN}sGytBwfo`nya=&r{rr?aXh+OV_fOkUpvPjK~iY5km;g7`{jXAdDz;B96#U&d%iq
z_>b_Mr`!DR7!bl1b40|ZOu|4EUy|q}b`(}NlH)4Q+wvM%+JjzHi#iK2vsmaEzVY2V
zf~uo(Q`1GO43C#KJ5yikyXgDhsrQ$E4hca#=Ne)ut^7(%o45kF{y2fh?f_g`{}HuC
zj)CcPmTnKn+`JWG!bKax={p0B@K%Tv!?4ZwFZHgUcj<$@a(9}G&8eO5#km)jU%y}`
zH?&vr^H1OgO+X(~iTx1vq7p#Bp10x$5u}bOz8{)OiegKg?Fi7=+T6X0cYj??io-LY
z+ZuDJ#=u=xf=2C%w-Eg2x~5a*pM1ed3=zLCxZR*G@EY7}!td&Q{GH43KY}*&+{Ni4
zV4KiC^~56`0NhJMW}P((d@qCX-Nm(ipL}bwz6rlivCpSgoUO3(EtK57<8|cCtNe-C
z@TBezPsh`cm4lC8=;`3O^uktzEo=b;6zl~QbS+c=k5d)?=e|j}3IR`TJ;Yg<f~ud!
z78#1Qo_M_QUIkN<%)<{fdN1ocT3K5b<=PQKjO**uw1fu&EP6vb98Sv{7Z<*~zjWfL
z$2X~42o6jh>mebY5hnE7B&m|6hB#q5SRbT397U;o)NX1reG$pZaCsbK<ayVCJTZd$
z>iwZ8&)s|Mu0fLUrM8pJa?$4ej^UEJjtD!@s}LDHmrmB0OZ|`l;FhIFdKe{OPm<xG
zh4+E|%Qtl~BB$^})0r1;2Y8#p$(eUk@3iC_+_HNnBr6Xt4`Iy<+z6crhzVOh9Zorf
zlR5pRy7>3xG<f2Fv()B4`T_aRAHgq422&RR$meHJK21FU8J6Tj1nJm6|5_vOq5jc|
z(DTVTp+U#;Uwe|n!o$OCTr1AHo-MYSaSIJSQ1iW~r{`9kmo^`*92W2VuyZWIBZ8Dg
z1mB~JnSsmZW@k2tHB*YO4Zez&+K#w=?IdmyUApY;cx!Cd*eyxDGW3T=Nmr|2y<f$n
z*aq*VF=63u!aW7SeQqwt1mSj=)CGsdv3+U%r-7?=7<_swVge^Rr*&=ZTAO~T_tTB7
zA{GUd$!LZ53(G+xY{vV|!sKy}9-oIL5%<5b_syrKSp_HN4Y-e5`9|M(q}cJ}-^PXC
zt@-|f+eAQYQ?UJOr2tg-r;U(iYwFC~6=!Ydg!oqs;ZQ)paezU8Spx2p)vuL_-&6|h
z`+p0>5P(Dp{+@#V#7qSi1|J1eSWcF0wn_>JTxvy<HNADq7?!=}l>TZ#k0if<u2esp
zl@n%u<8#J)Q2*p`LweqJ=Y=e}?U@*VJZ&3J=M8_cFv6OLoNbv{4+!&Ta39dLloK%b
zUna%8mPM0B`t3l%z}Ex`PwNIinGX;i=F4<|wgdgD8~W@Eh!|@hi8=tL#}SanM0lip
z!5Po_f5o0;Hn65ovXmiMX&<ohyaqCo$&}Xzu%xSkB1|rTgqF+=sVSB&f;vhVATN4i
z;`jxZb`TJ}6{KV2SqpLu8&S9ea+&8Kz4=c*KlwqXGWH74fqst|TNUdmF?t%v_gNd#
zT{0Vk1g!>@nI1vQi3gmtqhEBkowD)p@BXyLN8b780lS-F`uhq7_q>vuJR|p6OUh(E
zVXhTV#h+;d*5wX8h6qjSg<57QGlxW59#wX#3C~k4-YqK(&IM>LNBraYAL_38hTY4}
zuuL?g>cf7Qge=Fb=h7Xx**4%!1Ax|QU_r?e0H}qdV-0w4tIJ6ox<N5=18$zxNm~2K
ztuuK3^}F=y2CLxPJ96~?kP#;}jJL5o<J-$nz5h&h{oe8kzm8Ag2jxE*{7P&7-;Hhn
zHbg!I9=@RsEM&bFs@4Yl=amD9D)dba8GEYlBC$s@dNGOaMh>?0+sT*R6<n{vNi<cz
zv{BzeQo5|;Qi(#mZnyk65fo!lH86xC*%2hUyXVH#yxRtr!T*1E?mE~tBA#MT!EiHs
z9%%btYNvq|xrp;4Lx2T<vJs`TuDzlUpFi}5+Sl%+@Uv!RMY=_LMd`pBk0I{Bn?~^s
zvBGgb+<PMJ<n(}bt>eDxiicejCi%5>b+uVWp4LXzhqJ;8s;ln~Z8j+^%uafjb?(OM
z82xklAr2;62yI#|5ZG0;A$|c%I~bTG5MMukKfFzrHEr~hJ|Yq~?<B722np9TYEJO1
ze*9e+^)38k>A(?t0db+dZC!!b@K9x7!R?1|9SaJ-9Rb<1OY$e~f(fbCh{G-5&%gDS
zEOA#!(@Y%0p7F3%NLztq9R4m5G-4!_FT0gESyedoV=8AcRg;xFD)#W7wJl6b_oY&C
zik)VEyP*Kl#qTL%tdoC7Nn#bdO$14u>50zb*hwAfTILHdMZojsNHIcl5IZ)vMer73
z8<0dtc*3oZo9i(@y%3pd$iVXRM~cVJXXV^!$@4hE`7WIs{fYa;)bzdn_xN1nlPAwq
z>)3AEU0U8%HfYs5)I#o>2ojVP7Z;aBn@k!{Oqz6VHVFzknzXXAGG?<NMt!Hq!Qs$e
zrGtO^S-;Vvr2SuE&%)O-l|V<AE<y%+@2)4T*+l9=xzRI!e<S~#z!FlS%Hat`Aw>^K
z-z|H<KJK+2nv0M;&N+VH<a;xv#PC&ihPST<A4}*RbGj<EA}x2!HyU+ytGefuLoeNN
z1^6($T0ebmqNwaB{sF|#B*FV?<t5#4T@jzGOI;%vo^*sOULCeGLjPBLS02~IxwV5}
zQ4r&Tplq><qD4$q7KNC&R8c9$rOGD6YLz8os)$4k83dF~py*XTu@D3i1%WD?Y>5z-
zDk3`w5whR{vdsvFktE~yVoR^yTie_I?Co;<4?nU^Chv0IbDr~@^OzOn6gH3r1~Cxh
zl81pQe4hi(p|aAJjm{p&{g#!n30z_t)2K}QE4Q|b4;@{)d~-9DMuPt(!hAEO_QQ{O
z&rdRW<-As)cUj5n-HW7&gnTG$8|u@yc2xmV8DvVGB6O=09$P;mwBa=#NGZON!*w%x
zRGgg)_IZA~=luLI2TzkY9EpbM*MRg&U6NcG_y~u@HQ4IWqxmoU9NuCygQ+4_1RvoA
z5oUO=TqgYt)lKnm^x;E#!tRliQU0d7HUixq>LiaCt(fyZJEhSzp5Anl`^w_7rPij)
z=4{kWOTFpOnhUP7y*R4@+8wcMYvaMb>!vb%2yHr%k(ucl)_ni&{pSAWciU29xeq!p
zn%>Mut_(<*Do$tgA0f65tjQ0cMpE>?2zb6y+pca;ky~<M;d!Inc-lmpl{sbPY4AUm
z37+w$c{?2~D4^rjyN8F{+*F6GWFN8bawhQh@`4<RrT8En;)Cq8MKoq`nMKtk?tG*s
zH2Ktm9KsAVZEM&9t{}DJ<wm`aE27;;HgjfNiAri~bw9fE{8u=?x&OY3f3<i1W<>J~
zva7$Qj^LeE<EJ;=e}XgoxX1mAt=9!W`>pIlw0k?XqF^UcZln<SH*{iXpKFqwf<j5H
zcQS0ntvPqdmC0yenT|8(!pf%h#^*%&T(S;sSC7v}ix2KJF_^XQ7VZTFonHgkzS|TR
ztV4Ki)Hc;JZNgOO0Oj;ipXp3ss9It#G^LZKUZGo<`z8+rnjSCnqNhH~a(1w{2$ip!
zK%c&tLr-9C-3?1Qpglr}KgGfAHw<}y*`-NYog4sHou&b}YBqqR=cU!EA||q#<uZFV
zAfu_1mVrI@0>!J!!K-S$MYidvpi@}BYWU0pzL+!B<B+YVr^Mls6Op(%s&_j-k*QUR
zyucguY}G0*x(Q<!Qrq2E5n5@z*JK~D<|rbuscyUp<SA~vJ@wab_(=#;go~D<7c0s0
z&(BP3WZ6m5*Dpwv9ZWm^*zilf&bpLtOV-X<auj`Ea;qa?6bwcDSSsq5K#M<u8~m~q
zVEv!ss88hL+4a1vxV%!z$@9Y$0$0uHp%V~k`(#-O(90cHZaRIPIBv<jQ*IoZU8H@x
zgJDFIv;L-=0TZ_10g3C>Ma4y^OFY_ZNc;c>X>C(e5GC4T)x4oEIpD~GdzrwKh}8kn
z=@pR55RNRTo&bE#qonbmBee#II~$-ZhC<H>1SfJUgK;31I0a$}cOKB+j+`!nF#RYc
zXmBlvWKD&=0)PTegejnTW2Hau<t1<q>M%xo*(3#_Ai);A_cl#lTdM@hW-Lbgn%0%F
z@ZiY2AtGXKZn)p(POXHhENk=_qNUpt72E+j#YUzN<?jyO&ME0wP%EcSjx)QvL0U(%
zNu&vrFXv6v%vABw(7hY|ZQUug@>N!rIC+hNOJRiYTxVuO0mAXwM^TQ<fkmaSB#6PS
zD-7=29%f9g&e$kjNh~v%UEeFe)1T3w5)FxG!L{R_I^}QmI()+ot}?l;dKn-|%OTO%
zOR@XYK(8m)zFDb60K5u_65sY`+(pR#Yd!|~cs|qn+rg{My?mV$!tu>KFO^f3<t`uH
z5FC)+WtoWg;!j>GDndS1?eY7uz4O!NTJu+$`JesV3`Vb~x0P<Sxs+&GwI%1)jly77
z^vmYFFN?IN9*ZU&u6(j??iDo7&{+`O6p({EW>G?E9Tw&Fc8bFXK448oflk?}4xPpq
zX+ZDGkrC|qMTm!G6B^;l;}zzge=_p?vw#JQ`m~-rvCmeJJ6bk>6WLbU*3mH=zxT(H
z#{Y8R-p|mPl)jfR20ZhFyc51HrN^lcdPH$g%Zn|_d}-ksq+^HZFsFhy?Nq4WmG-Ww
zT~ms%>Aag!8!c*GJZ>(sOSCSMEo7>t7hM8njuf$o2RmhsEL%wf3QzP-BcM+|x^sd9
zz~VZ{>$>*>lCR4=&5;+6vN*xE9uKN!l9n|ADyJ5ADfC<lb}o}B*XqGe*Vuux&Jn!>
zfT=dH_5b7R;u9%f2n7@j=$#RyA*i#66Ggb$pA5KV?Aa}g`$!DQHJ)otC^jv8eZyR!
zUR-!Q2GPn(d+6?H@3*O=LReFDveV1rE8H3T)>V`Q-!WA2wYNhVPw@o(1u(S>aHcc{
zJtv~HR!a9bvS-U@)$+uXL3<gCa+sxz#&p`}Kz&pk)6M~zTr-^s?aZw=qkO|m+uCAU
zAv2sPHNR=el4W=rfmVNv-)J0u(fBH60%Ye9U`yr=R1yfRwRJo?=q_Z(n<zI<WP+A@
z@c9&P7zj^lwCq4s+8VX;%+;?lGq)7wHaf0c@%WZZKlO5#_Wk`!u7UQV18@15s)Ono
zn2-lMR@pM_vjLxBZ=B;(+ngqEbw5iH8#wASUA477Zm-h6T*`b=qF3tV9Z5C+c$%%C
zas2EpAK1xB+L$_kKGwUY0g|-|>kEK2_5qs@(@>KZ#jN;D;Nj#TXJ6rBvcaKI0*Ufy
zL;HPs*qoLMD&{h3%JJZs2xy1uE`TOHXgF;a<>m2Fy7hm2StS{WU{!|EeA#BXJF1(C
z7D{iTj6-mq#FJO%eEgFeaC^-Ok~RE%qkOJwB6CmArl$2LCE;n+>v!ZgX1+)}boFu?
z{mTRQrp`XowtqLi;PpZxUsHrq9(8j;b<G1;dr)&@=K`A5y+ODH@4_o9lA-;aAP@v|
zO4Dsb10VxBU=OpCQoGtDaZ6tgrMucR4H-8T-fTP)vCZ&|llqmAcK4X26@@AFIr{6%
zZ|DnRadJ=&6RZZ6kfAPTGzI*>Y}@y2ky6lIfHR`a2EefRUL)@{&E(6J(QZ&Dg<%lt
zWk!5WPtgOftGon!6d#~TqE~G*#<u<wUlX;!CVIgGeCPupjYEUCPy?~?HZr0zIu0o^
zNw9Z$rfnJYbTix%`9go0&BZPGaaAtD%7cS31<|YV-bZF{*mEv%3qBWOL<uCs)(R7X
zD6hm*pKF9Q)26r8SfdaOoQ$lqyDsvx=x6+*Y!KBD(?ZHp5Gh`JL_^oJOmX?m{h#f6
zc-@h-wCYQxZVJ4afErEWWd#+a^v#m3_XtNQ(ke7Q&o?ui<dEuInXwk`&_7IiW-t>-
zJo%J(XFe?X@CN9BeEq0k<xP${)1EJOV-QK7Dvb!=+IqhgEb+7F?av<bzYe~8E7bT!
zS)Rr}mK*)9@llf7d(d)xB6#}&;gJUP25nJmk%pawP%#hzX|b%~PVxM+dp}pCxzi=#
znIS&)^Au0csRVCrK6uF?39GC-(3c#%wNY*Hr9+<kG((Rl$x6W91#V!#msPUmHmU47
zPWJw*JP?>3;Dz@1q@r2nvB?j8A5k>KEpzW*5=Tw)P-SU5D(+3wCCp%^R(fU$7Q%Ta
zPm{GyJs&K)O-LX5tsJy>4al$kKBXrdodez`RMFG&IE(e)<Tsf`tQDt3_8v)JdVkj9
zgD><q4V1QB*D}}2g##B2Tf}Tz$52_KY<TnAtNvMl3vnL875T}&U}>WCWEQ!WvQM$y
zq2WkjpG&{7;way#!F$#hO=c4_A3K=XR$sN3obl4z{Gn3a>W5`II4Oj>(*(g-#A1p(
z9eSPI`fEKV&6S2CQGLNki?|CWhVb^A?ER!c;L_lI!cVNQaB3oAbt&tZH`aKTmy4W^
zX^G``9OtQ3=?qMIc3o+t6Rz`CLA^vOP1m0^<a`EGPugv3=r?(ied|V}_L^_B&wI7)
zl1Q5A^b?Dg3LfGuwb$z1^7g>>T^*&v{#p2DNN{Q9gSkLDd97>%%Sn+&vJxxob^vL3
zpB=?bnP!lthgLy#Ou%BZs}Go$$og#{rfsVMg-{MwRfp2sJ38o=vHX?I5lO;svR+YP
zT?$S{{ynmNZ&};dPRsksHU814M(^AD{93E^n}6=t^+WxYxAmWZum3|M{|8Ij{SS@&
zA1vYUKQ!`V=yj6BFuEF1uH05W$3~O81R+J+=o(iVGSY=hH|k#dsO#BU<AopHU9A$J
zfjiXrE6vTjf|Aij$2XQ;hH{KbKZtJ9?D0(<*Q=?0map^ZP=XOI`;28pcvIXtyG5%&
z{C(=Lb<X~$hc@Na_Yh$@u~l2@impadoKY0ucc$v?pUaI9jqKv7=`Y${H<?R)EMi|?
zn5X0X_~yf`i_0g!3|!i4HDOYTKN%o+!N=KmLBY#h4<YKv1aznKjR27;=Iw(<rsgo@
zCL};Dukrv$UWf1|Pcfk_57UZY4A7h$faVM^T4`92rxVBkQMIAxZ2$%I$?FB!&KmP{
zP$rcU&U@)ON<|Y$0VsG$cPEf>X($)Ub4T;oeKKQEKNFD#iQOdwnA2-!n^bWQRN1U=
z;=$R7&nukCYsT8|aC`Xq=zwer`@V}Rx(syjGth<EDXy7JdP!{mjmyq9lA7u5l?-U_
zzD3RT#W~F>^NOCGSulTL5>uG8P<4_Ly7ec*dDS5YynK46)PcF@C~SV+At~bjz^J%O
z`fU~%T}hAR{zJ*=&o^HGjp<}vxGtZ9`oJ{lpdwCg3=eKBtjl(`UQM1?n8(@C=<Q(G
zlmvj>=LFMT9$D?xdY+S`_vx<IUBtBBr#?5UR<+mp50u&XFCHe)SUuuBfTJ1(%TMt<
z@Uo8pO|_dAw%}it#NSAK!qm1H%(kkNrZ^j=33cas03C1;&%N=Ex9t>m8*KR>)46^7
zGs=TNj!}NuN$L*lyBbZQi;zKsH@COn^gr@Qc`P;_n;OthbaZaKUf~$*g&3`+4GhiO
zqx{cDcooGp=+Q&28DLm1N35}n`xEew@BYL|i0@SsVk<?ja2ZRWu5teJ-!+w#Wn7R$
zy}ORLpCg+Bsut^b(}AFG#ZO%MoonKscl&333uP?m{d<c3sFO(<Qww=>ifE`~Gz2W%
zA%)xDHtzS_e9B|3{NJ1%VB-#bXXBuTj(@Rn6vI(LO&Yuls4^PKr`zqw1AP%@Aq7^#
zu$GCSDi2QPYIakB=JxrxeP-Vj=J62o;ACf~ZtjHaq7$OCpOTk1vsRdA>6h+utMkyR
zbFMz9^pXMBV0ch35nzoDluNN$$oeRmy<E3`F#`j6>WRZ;P+*5<qkJy?Ak%rxh(K&z
zP7LLmMYdi{bwtU}A5V>rpP>+iD38!kBcw_+-1i||Qwrp>m026%;BwB%4$~Vt!KaEC
zB0rA?=h~bFt6LALwt4Eu)@*O+S8nwfF_RxH$c?zlKXm0;^G-WygCIdQ^}Av}B@riw
zD8Cp}!IK#oEC4l}2O5>+`9D>|sW<_=?aCs;iwdFww?*acu#2rWD#9HI8b)#mfR-al
zARB-m&8TumDkT<``d*y`KE*#~RL@dORHAbxuuKrDgb|uen<xN{{#}fl*$YsHv$l&y
zv$sr_N{CZ0wH%PoZHe}9B<%rNeVxt8mmU4tFUIaq^@&e~@a2ZQ-UiU1SCcnlA@Rep
z|L?nB_MXGRKW<L!`#V#8^*vrOw$6FL!TE7NOAfk71Qf8_hCwD5tSCm`3fT@f!tS$`
zTorgJC>>Bs&oQR-8PCFunS*u>MR*vK3~#y%d35^{LULr3X+5#W&KTaRHMRZBhBPH~
z{wloK7}MZ_Ym%5&peqkT1OkS<(b<m;4kIs<TflkcPQ{u`x*ejG<9ZCKk@`1_8o0Aj
zp$mFeVnTMZ-80N5&#Eu7N$7CyvfP>cxl!l~-i)@&Yc8u3-0a#2fQtAHr7axOC<ouM
ziCs@n^%kD;CH8XQ7AYI~w7^ixR!0kA<JIA}<1OHW`FhJVh`aK2K0QA|i`<iU>nnZg
z2BbJ6&oiw}JE+=~b>1<~Ulr*FmAiL9V}D&p1azV7IU!(Ye~LFi&-mGhdF9j)nm$>T
zIau25y##dKl&xmrvXP(yYFQNCkv8XHD~(Qvfn@SAX6_xgk^VTz%5k1Yb+3^YUWzkS
zR#<d>=Rdwj4;KHl5}OQ|W&x${9N31^I=^!*>Yi?YNA@|pR;nWm;;OiR<k{a}q?&wr
zp}V9eQl~uVVy6>f{MzFhnSml+n3Lb}%DjS4O!i*ua<mAmuT_?}Q1%i|3bomQs8&LQ
zcGFblK6OQ_6Ih>d6<I({3m?Lr4+O;ynbm}SN|}j$D4&HEi>qzQGJO`c1^HrmVwzr8
zZ1WBG)LYF5y!F>esvN^Kc6L0U;7@7$!BhL&k$WSKpp117{rUYFX`dLsaCiKICjo99
z3oo71+K?Q!ZlU3}3en~0Z%=QnJF#=yjuWbyO6A=Js{E*+cwI!JKRAJDC9XMX!m&p<
zVTQHYo`ll2r8P6oP&1ep?QIv=-)*+x^(ril8oiw^ACJks8D4YYjEb6){3Tn!fsb(H
z(<-Gjbf3g?J79}MSV(iw8bwU1wTOC(qIw{wN};?3ndZ2>em@dbR#?5Xs)VZ#f7R&H
z<Q?9=YzLwmy6!~&#kFw74O9Fye`vW7J^&B~zBERW#uFu40(PaAwloNi%oGz(7T|$J
zU~Pk=Le0{Rx74MRm$Q7)c6``S`u>62&6Y)n+vx^4*pn9t8XtR*O#(*f3yNrTafC9p
zxVc}9Y^YsDsT=t&5_pgC2iZq)9rB<Af{4(ycK~sc#-oVD5N_e)gIu%<k@>}kxi-%n
z)4v5BONeUd(R%mY&Nk&P#FmWgfl26Io54h^w#Cdw*EP0dq2aAncu^EKaV}7D9X$$z
zfgQX$%IOvX9#Zaba6F^cRTLaCBw*m==8ddA>k{9mp`=AUX)Gg9Ej1xSWMg7?8&Vr8
z6yWt(4i|fGIHCNqg5s}z@1Fx~CiS^xeJTDdaMAHAUg6I1Ms@655s<9s11gY}bYlYk
z#q6=L3V&}Pg7-X{(VBvlErf;`Wg|GnlCc+mPmk=4P4T`mgFh1*;KzW<vAy>baW1xp
zJ6akINae5%$6q)TTO`*&cmTBME^>!mC9Y9qO=f|ox|kA>Ikb_r6&@(J%nUMnX-e<m
zgdf$`F4+F8V5Pvl#p^+<t(V)LSSK6TqbG5s8f@--V7)g0#4%k4)`j?M*yt6G?1;lI
zv9(yA)RKwvC8!tTS{iz_T+9eFP=<rcs2Y`Nf_1KG)(K1je*_9Wa-^|uc79Pn%T`;f
z?YT|@(7in|;E|1{&$7f&`*HYi-sB^~(=;`*k$i5mx1ZOldh$X!3CZNg**U}aq~;xl
zbw`A{C}TTZEzK17#BJv9Xz+YdZQk<P`N55x(dNlBGsSsl`b(dWKhv=V|ENmxR6YxM
z$tsT#JpZDCAgZhY{V9^fDdp<UbDj|<Glk`#;0;0ud<)0co#rvz<?s$3A5z`r(X57U
zye^+Bp($_LJ~?mF-ECX&CC~$)8+uX(fuU6fR)Gf(Pz6~MYAAUQyDEMf^7Ir}liA6S
zH*YCgEdlcE<|rehdN+3>Iw)J&<J&{d5)UXlpp4HN<SwjPu6mAI>0$?OPc>O-taapp
zK8EQO{1B>2id_9@YRAu82Yd}Jk_O}sA2dk4&}2zeAZ-TrA-#YQCD&JkDtO7QOy!D#
z_MA+)Az8nkv?KYYZ~WOr=k8%LJ3Pa^1us1wO1dzQe(We!GbF*l7BB^~!PGLLI(vkI
z>F`2?^6J<8!T;W~sttrJV&9-_DZ7O`2QfOy9nY-UiJ0ezS8Cy>(9e%V1<JRIZ}nvt
zUhEp`+8^2JQa^+)*3Wpn;oirjGpTWp*Nk&h?)|<cHsu>*2ptZ@ZUJX2K@Ar>I-JU4
zAZY^>!Hq>g4a!)0oRZD!o3$5F-hv(>9TH~-R|v%^!P&g&%ov$Z`IpbKx}y{q($H|e
z7N_-Dtl5KtTUf*3p8X;7ic0Q;D8rn0?FHFWY$6DURL8SR!Sqm~d^39=%gnoYgzSM3
z9Sc5VihO$^M*;y`Hwoxlr2wC?Sx!vwSi-*soYpC<?MFtEXw1S*bo$PQy{Bxv3#0ob
z%xx=fQ_gYFMfg5fBp90vsI~!L%NK#aHW*(==8OT0Frf%yk?071^D-K0T-HH^D=((7
zwt+l`#OC6{q?|^!4$Gh}!;GrVGT-oILlG_0L+9N6U?`P}D3>%#{Rthy-ay$xmVqJ!
z+7AF=&}8rF?Fbb&iUxZc)1`$!vgIFZ+H#ApS>5rIP-dd%pG`=J+Je}fE?Vt<jTuws
z(>-bGu4f~0b-8&t*;>+E5#h<>AIv-PE%UzNZ}a}6=hhD$@5apg8wfL{;-BF?c9gpq
zDoGc@8wMhTGcY5T8#2tNg}208qJ@Zh1!ub4{aA{efy7_<+(#lLFwbs7Xd(RvvIRa`
zBYu}hD$<Xc<eXN%lYO#&!Z*?BxMKtIH56bUfjLifKvIAm7>I<>$2dQb!y3O?agzj0
zg9;riwS!%tG#Ui!6C1eZOec{E=o1~djk&!%-m0m{Of`tD^RSf<ck*3~8@&rDcqV9?
zNXu`KuO*x)vRHYVztMBGeBI)t`B{^sp{~A4)li8V1+-5Eh7=^m0IgY@wI87c$#r3&
zWPr~IXU_)bh^n|I(0k>2jK8U2Uc*q$G^Tpld8EhA$G5b*V{WbY#-S!|!I4}ZqJ6Sx
z%lWBmXPgx{`LkPURbPSws^o#ea+p)eLrn})M3lfUAE0UFzz|!*RFQSqY`E;swTE=h
z%y~^D+w{2c@!HHU5Useqojf(va-`H~aFLnGyzI2{@U7W}r)J|{tU~>go^=?&+5x5-
z?4f6l{lQ{_hTBMU;ywM$7oN-3-@MtFX>)alm-jK1hVyEd!XGJpy+ze|G=23`$b=F+
zT!BaD5BJ(3gA!s0cahu)o$D$Aly0&y_AS_hfo0oJfy5<xev;zkCHHup`aJgz0&?Z-
zWhceiPwKJ-Q>C_KVYSm%66Sd?t7d^}EX3!43Kq_gXDo#2aX_5`^1Awf42EL`rAq)j
z0{IFGpnyw8E&F6YFsX;3x?B)pqlei`p}}&HP}Gxw>3#&N5)J?~0)8Y1Q}Z+&Bp+7{
zzqV@YTfqtu-x;k-sL=v>O0NO=^s`t*78O+6oE;^i$D82%lw@SAjCOON_eMq!HV>T$
zX9&btp2TmyKHixdSnjf-r5ZcOV8+00rUu3cxtNZ#xpjtN-aNr*-ci%^SqoPYc7BJA
zDE$~rL(pNjbG~uy?E~vm2b~GI(%PW42qA<#6EtTzUg=7Od12lZ)n^>|D|C;^X8rk4
zpPym6u(dPFu4uaB)4N|6?L!`Ci0~B@Gzh{b06Y9iRworB2mvaF+{KMlNx^mqP=QK-
z%qAy(IU0x@mu`Z6qChWWS_SU|(lOXoW>>+PLE0hd(;!hged-1>I%8}br%9F@V$JZA
z@Fl!}i5cmA;#(W`oV|Z!H@=gx4!F=uc%n2+FaOLSp9hrG*Cn{vDmzyDB+8u>m$M5m
zc*bscOh0CHLlNtl%+#^p6P>u<?Oycy$3AmnaovyvpNTGl6b7!nQ6SOWDFuaKUET@!
zr!i80aI4-}-7DNUNR;9T$)|f&0<|N99-xbz1G?KYT4iDlURgE1oNd*})wk}%N9oY`
z&eYqM_u@^OS-Q#S%Oi2)S|r3%k`_io2{qOIW_LlDTiele{q(>Ne<*<<*BxCDriwyH
zbe@-;QUQenLO!jM6F{3G*GJSvM|c`VR!OXn(I_OU!j~H-aYY6Ki#8X;U-}#&h#4nL
z3!Rs{`+jiA<6y71ZWZEnNM+8q0vi&HgB*^02X3)lS)QN;TR6L#BK^<?tUR_+t!np=
z79ai>^L_6kRQ$#W8uYKZtTzt<AkFqh9t%Gc$b7mEoFwGXZok5X0v)EzN`kt8Sp{&{
zfr-z^laZ9*e*3av9dC5wY;Hs5lvzW?KA$x1zS%-rk?!cpt$m@eux(ys820?`#3=wt
zVcE!VP{~1xNB|t<stTGOWe;ZplivvxSYF|-7maLZ_T1*-U(mn;n9QCh|2U_~uxTht
z1FM77Stc_P248XflXZF8TqEEy1|1dv^>(wFWCCmB`Gy@$pMR~~F<S9j0?szrh1ix-
z+QL!c?XP)oxJq1E7NzN4&t!w;N>JgsMnyqWv5&ePfd^tzEj)3gW~v8wK`n`R)YCyS
zm{X(9%oyHe?YX9MTf@NaJZ>IN7KKd!Rg)hGScicd4>mK{5fR{<wj{(8!$!12`?*Fe
zBUD#n%f{<4B~VyF+#vlc(9pI=vVfR#oQiHdQoOStn#Md{%4sX$gan_yoP$`L9x(OX
zz6Sh`<3KsG=jwY0!hj~;eLSW%Pd1s|)&bIlfhqD8vNh;3DOqui<P59f1+Lh{7D+y~
z9NjBrqgk19LU1WI0Z!>H@$95bt3rsI;E0xh&RWukhr|kB{PZ4t#^SgH+e8n<BhQ?1
zrQ<q|(DibRKS8(v@WID(gjrNY2&jMQ5{3>bjqDwJ^{>CF-rUPSbov|fvc3<Z!OCH<
z7hq8RTYKznN9{dgHE%-(e?yFKpeg^Bh<w8Y{>`-bU7@dk%R;_8JM%V{^&WEwZ*IxI
zSup>!Bg+4pIs4wQy=Ev;YN+@BHf-NvYHu!}H_XmINAkCN^51hOe^bcbsxSZTr{G=R
zpgabx-yy29E1_IFez8sYFdK&oU|Fd-wzy?}bIyX^OtZq5jHmSQJD!h=ETjAlpFcgh
z`0@5fR&%o>H(Kqus$2xfx*$g~UL6E6aj=g<mJfpGQ8kYRKw^`o&>INk4Yb$1!dX27
zJYe9bmH-d4>nThB6^`8xy%dgech!tAHQ>!!TZsl_!(Oj&_eVgGy_EuQ;f<^iszSm-
z4(0=h#ihXA4>$=(W%p4lOaY3V2e}}2*QCldiCCmhiEz08=__2^3&1T-ghZF1Auld~
z40%2j`M@X__;<X?NtEgAYC>L0IEiwdNq12sk??6<1^pbaD;}}Vb&a$0csC1Ammi{=
zHVg6GI=#ac<qqO9|HGH3#^PQ!6VZ9Riu$nqNK5IrkDeZ0ca@p@ncml)(Q<;ixt+=r
zvl2s%+XoX=0Thuwn*FGK3G+Z;EZ^}8cP|+XMGqr_GDKz(K{rm`3}|p6m>5ezCsl#x
zfC+<%<uv)-*YxRJ6hI$qpOyl&L>&Q)>!tqyOzywV6{n5@dU>cwb0+B!oNCv8xB42h
zV7z+Z86C<852v->Zrd_xn$INMZNm1<+Yq{7TWTuS5I<RI)fN?>UAT``Dfm#)q{EOU
zj`mktm+yb`q=;%-Eu`@j0$E4?IhuuRzl~YYl<(#aQpfk_c&0~a&d+e#d?PowB=#U@
z>4g{c3)HsE*@IdVW_ziM9Zs?GcXjlBt0VAT_V<VH8w0_Aw?q18@9<A|XO!<Dr9a(`
M|8b{+@~ZX!0W<DD2><{9

literal 0
HcmV?d00001

diff --git a/docs/source/amp.rst b/docs/source/amp.rst
index 3c0c77d4bc4f..81587521919b 100644
--- a/docs/source/amp.rst
+++ b/docs/source/amp.rst
@@ -294,7 +294,6 @@ CPU Ops that can autocast to ``float32``
 ``cholesky_solve``,
 ``inverse``,
 ``lu_solve``,
-``matrix_rank``,
 ``orgqr``,
 ``inverse``,
 ``ormqr``,
diff --git a/docs/source/autograd.rst b/docs/source/autograd.rst
index 4a3668194700..ab04d955082d 100644
--- a/docs/source/autograd.rst
+++ b/docs/source/autograd.rst
@@ -273,3 +273,5 @@ Also see :ref:`saved-tensors-hooks-doc`.
 .. autoclass:: torch.autograd.graph.saved_tensors_hooks
 
 .. autoclass:: torch.autograd.graph.save_on_cpu
+
+.. autoclass:: torch.autograd.graph.disable_saved_tensors_hooks
diff --git a/docs/source/backends.rst b/docs/source/backends.rst
index ffbc8a99081a..2a02b325341f 100644
--- a/docs/source/backends.rst
+++ b/docs/source/backends.rst
@@ -15,6 +15,7 @@ These backends include:
 - ``torch.backends.mkl``
 - ``torch.backends.mkldnn``
 - ``torch.backends.openmp``
+- ``torch.backends.opt_einsum``
 - ``torch.backends.xeon``
 
 
@@ -51,6 +52,21 @@ torch.backends.cuda
 
 .. autofunction:: torch.backends.cuda.preferred_linalg_library
 
+.. autoclass:: torch.backends.cuda.SDPBackend
+
+.. autofunction:: torch.backends.cuda.flash_sdp_enabled
+
+.. autofunction:: torch.backends.cuda.enable_mem_efficient_sdp
+
+.. autofunction:: torch.backends.cuda.mem_efficient_sdp_enabled
+
+.. autofunction:: torch.backends.cuda.enable_flash_sdp
+
+.. autofunction:: torch.backends.cuda.math_sdp_enabled
+
+.. autofunction:: torch.backends.cuda.enable_math_sdp
+
+.. autofunction:: torch.backends.cuda.sdp_kernel
 
 torch.backends.cudnn
 ^^^^^^^^^^^^^^^^^^^^
@@ -128,6 +144,32 @@ torch.backends.openmp
 .. py:module:: torch.backends.xnnpack
 
 
+torch.backends.opt_einsum
+^^^^^^^^^^^^^^^^^^^^^^^^^
+.. automodule:: torch.backends.opt_einsum
+
+.. autofunction:: torch.backends.opt_einsum.is_available
+
+.. autofunction:: torch.backends.opt_einsum.get_opt_einsum
+
+.. attribute::  torch.backends.opt_einsum.enabled
+
+    A :class:``bool`` that controls whether opt_einsum is enabled (``True`` by default). If so,
+    torch.einsum will use opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/path_finding.html)
+    if available to calculate an optimal path of contraction for faster performance.
+
+    If opt_einsum is not available, torch.einsum will fall back to the default contraction path
+    of left to right.
+
+.. attribute::  torch.backends.opt_einsum.strategy
+
+    A :class:``str`` that specifies which strategies to try when ``torch.backends.opt_einsum.enabled``
+    is ``True``. By default, torch.einsum will try the "auto" strategy, but the "greedy" and "optimal"
+    strategies are also supported. Note that the "optimal" strategy is factorial on the number of
+    inputs as it tries all possible paths. See more details in opt_einsum's docs
+    (https://optimized-einsum.readthedocs.io/en/stable/path_finding.html).
+
+
 torch.backends.xeon
 ^^^^^^^^^^^^^^^^^^^
 .. automodule:: torch.backends.xeon
diff --git a/docs/source/bottleneck.rst b/docs/source/bottleneck.rst
index c41377188707..ed5caf3fff58 100644
--- a/docs/source/bottleneck.rst
+++ b/docs/source/bottleneck.rst
@@ -47,9 +47,9 @@ where [args] are any number of arguments to `script.py`, or run
     evaluating. If the profiler outputs don't help, you could try looking at
     the result of :func:`torch.autograd.profiler.emit_nvtx()` with ``nvprof``.
     However, please take into account that the NVTX overhead is very high and
-    often gives a heavily skewed timeline. Similarly, Intel VTune Profiler helps
-    to analyze performance on Intel platforms further with
-    :func:`torch.autograd.profiler.emit_nvtx()`.
+    often gives a heavily skewed timeline. Similarly, ``Intel® VTune™ Profiler``
+    helps to analyze performance on Intel platforms further with
+    :func:`torch.autograd.profiler.emit_itt()`.
 
 .. warning::
     If you are profiling CUDA code, the first profiler that ``bottleneck`` runs
diff --git a/docs/source/community/build_ci_governance.rst b/docs/source/community/build_ci_governance.rst
new file mode 100644
index 000000000000..0705349ec40b
--- /dev/null
+++ b/docs/source/community/build_ci_governance.rst
@@ -0,0 +1,19 @@
+PyTorch Governance | Build + CI
+===============================
+
+How to Add a New Maintainer
+---------------------------
+
+For the person to be a maintainer, a person needs to:
+
+* Land at least six commits to the related part of the PyTorch repository
+* At least one of these commits must be submitted in the last six months
+
+To add a qualified person to the maintainers' list, please create
+a PR that adds a person to the `persons of interests <https://pytorch.org/docs/master/community/persons_of_interest.html>`__ page and
+`merge_rules <https://github.com/pytorch/pytorch/blob/master/.github/merge_rules.yaml>`__ files. Current maintainers will cast their votes of
+support. Decision criteria for approving the PR:
+* Not earlier than two business days passed before merging (ensure the majority of the contributors have seen it)
+* PR has the correct label (`module: ci`)
+* There are no objections from the current maintainers
+* There are at least three net *thumbs up* from current maintainers (or all maintainers vote *thumbs up* when the module has less than 3 maintainers).
diff --git a/docs/source/community/contribution_guide.rst b/docs/source/community/contribution_guide.rst
index 6ddc19518230..30bd9c6cf975 100644
--- a/docs/source/community/contribution_guide.rst
+++ b/docs/source/community/contribution_guide.rst
@@ -41,7 +41,7 @@ here is the basic process.
    requests are small; in that case, no need to let us know about what
    you want to do, just get cracking. But if the change is going to be
    large, it's usually a good idea to get some design comments about it
-   first.
+   first by `submitting an RFC <https://github.com/pytorch/rfcs/blob/master/README.md>`__.
 
    -  If you don't know how big a change is going to be, we can help you
       figure it out! Just post about it on
@@ -73,11 +73,13 @@ here is the basic process.
 
 -  **Open a pull request.**
 
-   -  If you are not ready for the pull request to be reviewed, tag it
-      with [WIP]. We will ignore it when doing review passes. If you are
-      working on a complex change, it's good to start things off as WIP,
-      because you will need to spend time looking at CI results to see
-      if things worked out or not.
+   -  If you are not ready for the pull request to be reviewed, create a draft
+      pull request first - you can later convert it to a full PR by pressing
+      "Ready for review" button. You can also prepend the title of the PR with
+      "[WIP]" ("work in progress") while it's still in draft. We will ignore
+      draft PRs when doing review passes. If you are working on a complex change,
+      it's good to start things off as a draft, because you will need to spend
+      time looking at CI results to see if things worked out or not.
    -  Find an appropriate reviewer for your change. We have some folks
       who regularly go through the PR queue and try to review
       everything, but if you happen to know who the maintainer for a
@@ -136,7 +138,7 @@ A great deal of the tutorials on `pytorch.org <https://pytorch.org/>`__
 come from the community itself and we welcome additional contributions.
 To learn more about how to contribute a new tutorial you can learn more
 here: `PyTorch.org Tutorial Contribution Guide on
-Github <https://github.com/pytorch/tutorials/#contributing>`__
+GitHub <https://github.com/pytorch/tutorials/#contributing>`__
 
 Improving Documentation & Tutorials
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/source/community/governance.rst b/docs/source/community/governance.rst
index 9e9c162b2bed..3aa124846328 100644
--- a/docs/source/community/governance.rst
+++ b/docs/source/community/governance.rst
@@ -60,16 +60,6 @@ design docs, any disputes and dispute resolutions) so that
 contributors and other interested parties understand the
 future direction of the project and can participate in discussion.
 
-Within `pytorch/pytorch <https://github.com/pytorch/pytorch>`__,
-maintainer groups are defined in the
-`CODEOWNERS <https://github.com/pytorch/pytorch/blob/master/CODEOWNERS>`__
-file in the GitHub repository. For other modules that correspond
-to repositories, membership is recorded on GitHub as access
-level to the repo (i.e. “write” permission). Module maintainers
-are given privileges to administrate the repository (except for
-`pytorch/pytorch <https://github.com/pytorch/pytorch>`__ where
-they are responsible for a folder).
-
 Responsibilities of the maintainer includes:
 * Triaging high priority issues of the module
 * Triaging and reviewing and landing high priority pull requests of the module
@@ -139,7 +129,7 @@ The Process for Nomination
 
 * Each module has its own process. Please contact module maintainers for more information.
   However, if there is no process identified, you can file a request to the core maintainers
-  by submitting [this form](https://forms.gle/xNeu1byGMZVHcA2q7). Core maintainers are
+  by submitting `this form <https://forms.gle/xNeu1byGMZVHcA2q7>`__. Core maintainers are
   meeting every three months.
 * If you are submitting a request to the core maintainers, the information in your request
   must include the following items:
@@ -271,6 +261,23 @@ be opened for discussion. This includes:
 
 Core and module maintainers ultimately approve these changes.
 
+General Project Policies
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+PyTorch has been established as PyTorch a Series of LF Projects, LLC.
+Policies applicable to PyTorch and participants in PyTorch, including
+guidelines on the usage of trademarks, are located at https://www.lfprojects.org/policies/.
+
+PyTorch participants acknowledge that the copyright in all new contributions
+will be retained by the copyright holder as independent works of authorship
+and that no contributor or copyright holder will be required to assign copyrights
+to the project. Except as described below, all code contributions to the project
+must be made using the 3-Clause-BSD License available here:
+https://opensource.org/licenses/BSD-3-Clause (the “Project License”).
+All outbound code will be made available under the Project License.
+The Maintainers may approve the use of an alternative open license or
+licenses for inbound or outbound contributions on an exception basis.
+
 FAQ
 ---
 
diff --git a/docs/source/community/persons_of_interest.rst b/docs/source/community/persons_of_interest.rst
index 0a2d3e3a0e96..02224696c61b 100644
--- a/docs/source/community/persons_of_interest.rst
+++ b/docs/source/community/persons_of_interest.rst
@@ -7,7 +7,7 @@ Responsibilities
 * Triage and fix high priority issues assigned to the module or library
 * Triage, review, and land high priority pull requests assigned to the module or library
 * Answer module or library questions on `discuss.pytorch.org <https://discuss.pytorch.org/>`__
-  and `dev-discuss.pytorch.org <dev-discuss.pytorch.org>`__
+  and `dev-discuss.pytorch.org <https://dev-discuss.pytorch.org/>`__
 * Maintain public user and development documentation
 * Run meetings and share minutes plus roadmap on a half or quarterly basis
 
@@ -54,13 +54,14 @@ Autograd (torch.autograd)
 -  Jeffrey Wan (`soulitzer <https://github.com/soulitzer>`__)
 -  (emeritus) Adam Paszke (`apaszke <https://github.com/apaszke>`__)
 
-JIT / TorchScript / FX
-~~~~~~~~~~~~~~~~~~~~~~
+Compilers (JIT / TorchScript / FX / TorchDynamo)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 -  Elias Ellison (`eellison <https://github.com/eellison>`__)
 -  Michael Suo (`suo <https://github.com/suo>`__)
 -  Yanan Cao (`gmagogsfm <https://github.com/gmagogsfm>`__)
 -  James Reed (`jamesr66a <https://github.com/jamesr66a>`__)
+-  Jason Ansel (`jansel <https://github.com/jansel>`__)
 -  (emeritus) Zach Devito (`zdevito <https://github.com/zdevito>`__)
 
 
@@ -83,9 +84,10 @@ Distributed
 -  Junjie Wang (`fduwjj <https://github.com/fduwjj>`__)
 -  Howard Huang (`H-Huang <https://github.com/H-Huang>`__)
 -  Tristan Rice (`d4l3k <https://github.com/d4l3k>`__)
--  Kiuk Chung (`kiukchung <https://github.com/kiukchung>`__)
 -  Alisson Azzolini (`aazzolini <https://github.com/aazzolini>`__)
 -  Ke Wen (`kwen2501 <https://github.com/kwen2501>`__)
+-  James Reed (`jamesr66a <https://github.com/jamesr66a>`__)
+-  (emeritus) Kiuk Chung (`kiukchung <https://github.com/kiukchung>`__)
 -  (emeritus) Pieter Noordhuis (`pietern <https://github.com/pietern>`__)
 -  (emeritus) Mingzhe Li (`mingzhe09088 <https://github.com/mingzhe09088>`__)
 -  (emeritus) Omkar Salpekar (`osalpekar <https://github.com/osalpekar>`__)
@@ -101,10 +103,19 @@ Linear Algebra (torch.linalg)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 -  Mike Ruberry (`mruberry <https://github.com/mruberry>`__)
--  Mario Lezcano (`Lezcano <https://github.com/Lezcano>`__)
+-  Mario Lezcano (`lezcano <https://github.com/lezcano>`__)
 -  Ivan Yashchuk (`IvanYashchuk <https://github.com/IvanYashchuk>`__)
 -  (emeritus) Vishwak Srinivasan (`vishwakftw <https://github.com/vishwakftw>`__)
 
+Sparse (torch.sparse)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+-  Pearu Peterson (`pearu <https://github.com/pearu>`__)
+-  Nikita Vedeneev (`nikitaved <https://github.com/nikitaved>`__)
+-  Ivan Yashchuk (`IvanYashchuk <https://github.com/IvanYashchuk>`__)
+-  Christian Puhrsch (`cpuhrsch <https://github.com/cpuhrsch>`__)
+-  Andrew James (`amjames <https://github.com/amjames>`__)
+
 Fast Fourier Transform (torch.fft)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -115,6 +126,7 @@ CPU Performance / SIMD
 ~~~~~~~~~~~~~~~~~~~~~~
 
 -  Vitaly Fedyunin (`VitalyFedyunin <https://github.com/VitalyFedyunin>`__)
+-  Mingfei Ma (`mingfeima <https://github.com/mingfeima>`__)
 -  (emeritus) Xiaoqiang Zheng (`zheng-xq <https://github.com/zheng-xq>`__)
 -  (emeritus) Sam Gross (`colesbury <https://github.com/colesbury>`__)
 -  (emeritus) Christian Puhrsch (`cpuhrsch <https://github.com/cpuhrsch>`__)
@@ -126,13 +138,24 @@ NVIDIA / CUDA
 -  Natalia Gimelshein (`ngimel <https://github.com/ngimel>`__)
 -  Edward Yang (`ezyang <https://github.com/ezyang>`__)
 -  Piotr Bialecki (`ptrblck <https://github.com/ptrblck>`__)
+-  Christian Sarofeen (`csarofeen <https://github.com/csarofeen>`__)
+-  Andrew Tulloch (`ajtulloch <https://github.com/ajtulloch>`__)
 -  (emeritus) Xiaoqiang Zheng (`zheng-xq <https://github.com/zheng-xq>`__)
 
+NVFuser
+~~~~~~~
+
+-  Christian Sarofeen (`csarofeen <https://github.com/csarofeen>`__)
+-  Alex Jann (`jjsjann123 <https://github.com/jjsjann123>`__)
+-  Piotr Bialecki (`ptrblck <https://github.com/ptrblck>`__)
+-  Natalia Gimelshein (`ngimel <https://github.com/ngimel>`__)
+
 Intel / MKLDNN
 ~~~~~~~~~~~~~~
 
 -  Vitaly Fedyunin (`VitalyFedyunin <https://github.com/VitalyFedyunin>`__)
 -  Jianhui Li (`Jianhui-Li <https://github.com/Jianhui-Li>`__)
+-  Mingfei Ma (`mingfeima <https://github.com/mingfeima>`__)
 -  (emeritus) Junjie Bai (`bddppq <https://github.com/bddppq>`__)
 -  (emeritus) Yinghai Lu (`yinghai <https://github.com/yinghai>`__)
 
@@ -151,6 +174,10 @@ Build + CI
 -  Eli Uriegas (`seemethere <https://github.com/seemethere>`__)
 -  Alban Desmaison (`alband <https://github.com/alband>`__)
 -  Mikey Dagitses (`dagitses <https://github.com/dagitses>`__)
+-  Omkar Salpekar (`osalpekar <https://github.com/osalpekar>`__)
+-  Zain Rizvi (`ZainRizvi <https://github.com/ZainRizvi>`__)
+-  Nirav Mehta (`mehtanirav <https://github.com/mehtanirav>`__)
+-  Andrey Talman (`atalman <https://github.com/atalman>`__)
 -  (emeritus) Zhuojie Zhou (`zhouzhuojie <https://github.com/zhouzhuojie>`__)
 -  (emeritus) Edward Yang (`ezyang <https://github.com/ezyang>`__)
 -  (emeritus) Karl Ostmo (`kostmo <https://github.com/kostmo>`__)
@@ -182,8 +209,8 @@ C10 utils and operator dispatch
 -  Dmytro Dzhulgakov (`dzhulgakov <https://github.com/dzhulgakov>`__)
 -  (emeritus) Sebastian Messmer (`smessmer <https://github.com/smessmer>`__)
 
-PyTorch -> ONNX
-~~~~~~~~~~~~~~~
+ONNX exporter
+~~~~~~~~~~~~~
 -  Bowen Bao (`BowenBao <https://github.com/BowenBao>`__)
 -  Aaron Bockover (`abock <https://github.com/abock>`__)
 -  (emeritus) Gary Miguel (`garymm <https://github.com/garymm>`__)
@@ -202,10 +229,11 @@ Mobile / Edge
 
 Model Compression & Optimization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--  Raghuraman Krishnamoorthi (`raghuramank100 <https://github.com/raghuramank100>`__)
+-  Vasiliy Kuznetsov (`vkuzo <https://github.com/vkuzo>`__)
 -  Jerry Zhang (`jerryzh168 <https://github.com/jerryzh168>`__)
 -  Zafar Takhirov (`z-a-f <https://github.com/z-a-f>`__)
 -  Supriya Rao (`supriyar <https://github.com/supriyar>`__)
+-  (emeritus) Raghuraman Krishnamoorthi (`raghuramank100 <https://github.com/raghuramank100>`__)
 
 
 Windows
@@ -221,6 +249,7 @@ Apple M1/MPS
 -  Alban Desmaison (`alband <https://github.com/alband>`__)
 -  Nikita Shulga (`malfet <https://github.com/malfet>`__)
 -  Kulin Seth (`kulinseth <https://github.com/kulinseth>`__)
+-  Ramin Azarmehr (`razarmehr <https://github.com/razarmehr>`__)
 
 PowerPC
 ~~~~~~~
@@ -260,12 +289,17 @@ TorchVision
 
 -  Francisco Massa (`fmassa <https://github.com/fmassa>`__)
 -  Vasilis Vryniotis (`datumbox <https://github.com/datumbox>`__)
+-  Nicolas Hug (`NicolasHug <https://github.com/NicolasHug>`__)
+-  Yosua Michael Maranatha (`YosuaMichael <https://github.com/YosuaMichael>`__)
+-  Joao Gomes (`jdsgomes <https://github.com/jdsgomes>`__)
+-  Philip Meier (`pmeier <https://github.com/pmeier>`__)
+-  Victor Fomin (`vfdev-5 <https://github.com/vfdev-5>`__)
 
 TorchText
 ~~~~~~~~~
 
--  Parmeet Singh Bhatia (`parmeet <https://github.com/parmeet>`__)
--  Steven Liu (`hudeven <https://github.com/hudeven>`__)
+-  Nayef Ahmed (`Nayef211 <https://github.com/Nayef211>`__)
+-  (emeritus) Parmeet Singh Bhatia (`parmeet <https://github.com/parmeet>`__)
 -  (emeritus) Guanheng George Zhang (`zhangguanheng66 <https://github.com/zhangguanheng66>`__)
 -  (emeritus) Christian Puhrsch (`cpuhrsch <https://github.com/cpuhrsch>`__)
 
@@ -275,13 +309,19 @@ TorchAudio
 -  Moto Hira (`mthrok <https://github.com/mthrok>`__)
 -  (emeritus) Vincent QB (`vincentqb <https://github.com/vincentqb>`__)
 
+TorchRec
+~~~~~~~~
+
+-  Dmytro Ivchenko (`divchenko <https://github.com/divchenko>`__)
+-  Colin Taylor (`colin2328 <https://github.com/colin2328>`__)
+
 TorchX
 ~~~~~~
 
 -  Tristan Rice (`d4l3k <https://github.com/d4l3k>`__)
 -  Kiuk Chung (`kiukchung <https://github.com/kiukchung>`__)
 
-TorchData
-~~~~~~~~~
+TorchData / TorchArrow
+~~~~~~~~~~~~~~~~~~~~~~
 -  Vitaly Fedyunin (`VitalyFedyunin <https://github.com/VitalyFedyunin>`__)
 -  Wenlei Xie (`wenleix <https://github.com/wenleix>`__)
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 0490da118243..f4d1d8b68eb9 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -58,7 +58,8 @@
     'sphinxcontrib.katex',
     'sphinx.ext.autosectionlabel',
     'sphinx_copybutton',
-    'sphinx_panels'
+    'sphinx_panels',
+    'myst_parser',
 ]
 
 # build the templated autosummary files
@@ -96,6 +97,10 @@
     "check_error",
     "cudart",
     "is_bf16_supported",
+    # torch.cuda._sanitizer
+    "format_log_message",
+    "zip_arguments",
+    "zip_by_key",
     # torch.distributed.autograd
     "is_available",
     # torch.distributed.elastic.events
@@ -187,7 +192,10 @@
     "DeserializationStorageContext",
     "DeviceObjType",
     "DictType",
+    "DispatchKey",
+    "DispatchKeySet",
     "EnumType",
+    "ExcludeDispatchKeyGuard",
     "ExecutionPlan",
     "FileCheck",
     "FloatType",
@@ -263,6 +271,15 @@
     "ShortStorage",
     "ShortTensor",
     "cudaStatus",
+    # torch.cuda._sanitizer
+    "Access",
+    "AccessType",
+    "CUDASanitizer",
+    "CUDASanitizerDispatchMode",
+    "CUDASanitizerErrors",
+    "EventHandler",
+    "SynchronizationError",
+    "UnsynchronizedAccessError",
     # torch.distributed.elastic.multiprocessing.errors
     "ChildFailedError",
     "ProcessFailure",
@@ -314,12 +331,13 @@
     "DDPCommHookType",
     # torch.jit.mobile
     "LiteScriptModule",
-    # torch.nn.quantized.modules
+    # torch.ao.nn.quantized.modules
     "DeQuantize",
     "Quantize",
     # torch.utils.backcompat
     "Warning",
-    "SymIntNode"
+    "SymInt",
+    "SymFloat",
 ]
 
 # The suffix(es) of source filenames.
@@ -379,8 +397,34 @@
 # Disable docstring inheritance
 autodoc_inherit_docstrings = False
 
-# Disable displaying type annotations, these can be very verbose
-autodoc_typehints = 'none'
+# Show type hints in the description
+autodoc_typehints = 'description'
+
+# Add parameter types if the parameter is documented in the docstring
+autodoc_typehints_description_target = 'documented_params'
+
+# Type aliases for common types
+# Sphinx type aliases only works with Postponed Evaluation of Annotations
+# (PEP 563) enabled (via `from __future__ import annotations`), which keeps the
+# type annotations in string form instead of resolving them to actual types.
+# However, PEP 563 does not work well with JIT, which uses the type information
+# to generate the code. Therefore, the following dict does not have any effect
+# until PEP 563 is supported by JIT and enabled in files.
+autodoc_type_aliases = {
+    "_size_1_t": "int or tuple[int]",
+    "_size_2_t": "int or tuple[int, int]",
+    "_size_3_t": "int or tuple[int, int, int]",
+    "_size_4_t": "int or tuple[int, int, int, int]",
+    "_size_5_t": "int or tuple[int, int, int, int, int]",
+    "_size_6_t": "int or tuple[int, int, int, int, int, int]",
+    "_size_any_opt_t": "int or None or tuple",
+    "_size_2_opt_t": "int or None or 2-tuple",
+    "_size_3_opt_t": "int or None or 3-tuple",
+    "_ratio_2_t": "float or tuple[float, float]",
+    "_ratio_3_t": "float or tuple[float, float, float]",
+    "_ratio_any_t": "float or tuple",
+    "_tensor_list_t": "Tensor or tuple[Tensor]",
+}
 
 # Enable overriding of function signatures in the first line of the docstring.
 autodoc_docstring_signature = True
diff --git a/docs/source/cuda._sanitizer.rst b/docs/source/cuda._sanitizer.rst
new file mode 100644
index 000000000000..658b97569311
--- /dev/null
+++ b/docs/source/cuda._sanitizer.rst
@@ -0,0 +1,102 @@
+.. currentmodule:: torch.cuda._sanitizer
+
+CUDA Stream Sanitizer
+=====================
+
+.. note::
+    This is a prototype feature, which means it is at an early stage
+    for feedback and testing, and its components are subject to change.
+
+Overview
+--------
+
+.. automodule:: torch.cuda._sanitizer
+
+
+Usage
+------
+
+Here is an example of a simple synchronization error in PyTorch:
+
+::
+
+    import torch
+
+    a = torch.rand(4, 2, device="cuda")
+
+    with torch.cuda.stream(torch.cuda.Stream()):
+        torch.mul(a, 5, out=a)
+
+The ``a`` tensor is initialized on the default stream and, without any synchronization
+methods, modified on a new stream. The two kernels will run concurrently on the same tensor,
+which might cause the second kernel to read uninitialized data before the first one was able
+to write it, or the first kernel might overwrite part of the result of the second.
+When this script is run on the commandline with:
+::
+
+    TORCH_CUDA_SANITIZER=1 python example_error.py
+
+the following output is printed by CSAN:
+
+::
+
+    ============================
+    CSAN detected a possible data race on tensor with data pointer 139719969079296
+    Access by stream 94646435460352 during kernel:
+    aten::mul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
+    writing to argument(s) self, out, and to the output
+    With stack trace:
+      File "example_error.py", line 6, in <module>
+        torch.mul(a, 5, out=a)
+      ...
+      File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
+        stack_trace = traceback.StackSummary.extract(
+
+    Previous access by stream 0 during kernel:
+    aten::rand(int[] size, *, int? dtype=None, Device? device=None) -> Tensor
+    writing to the output
+    With stack trace:
+      File "example_error.py", line 3, in <module>
+        a = torch.rand(10000, device="cuda")
+      ...
+      File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
+        stack_trace = traceback.StackSummary.extract(
+
+    Tensor was allocated with stack trace:
+      File "example_error.py", line 3, in <module>
+        a = torch.rand(10000, device="cuda")
+      ...
+      File "pytorch/torch/cuda/_sanitizer.py", line 420, in _handle_memory_allocation
+        traceback.StackSummary.extract(
+
+This gives extensive insight into the origin of the error:
+
+- A tensor was incorrectly accessed from streams with ids: 0 (default stream) and 94646435460352 (new stream)
+- The tensor was allocated by invoking ``a = torch.rand(10000, device="cuda")``
+- The faulty accesses were caused by operators
+    - ``a = torch.rand(10000, device="cuda")`` on stream 0
+    - ``torch.mul(a, 5, out=a)`` on stream 94646435460352
+- The error message also displays the schemas of the invoked operators, along with a note
+  showing which arguments of the operators correspond to the affected tensor.
+
+  - In the example, it can be seen that tensor ``a`` corresponds to arguments ``self``, ``out``
+    and the ``output`` value of the invoked operator ``torch.mul``.
+
+.. seealso::
+    The list of supported torch operators and their schemas can be viewed
+    :doc:`here <torch>`.
+
+The bug can be fixed by forcing the new stream to wait for the default stream:
+
+::
+
+    with torch.cuda.stream(torch.cuda.Stream()):
+        torch.cuda.current_stream().wait_stream(torch.cuda.default_stream())
+        torch.mul(a, 5, out=a)
+
+When the script is run again, there are no errors reported.
+
+API Reference
+-------------
+
+.. autofunction:: enable_cuda_sanitizer
diff --git a/docs/source/cuda.rst b/docs/source/cuda.rst
index 02c3b407aa21..b14e5cec360d 100644
--- a/docs/source/cuda.rst
+++ b/docs/source/cuda.rst
@@ -87,6 +87,8 @@ Graphs (beta)
     graph
     make_graphed_callables
 
+.. _cuda-memory-management-api:
+
 Memory management
 -----------------
 .. autosummary::
@@ -111,6 +113,9 @@ Memory management
      reset_peak_memory_stats
      caching_allocator_alloc
      caching_allocator_delete
+     get_allocator_backend
+     CUDAPluggableAllocator
+     change_current_allocator
 .. FIXME The following doesn't seem to exist. Is it supposed to?
    https://github.com/pytorch/pytorch/issues/27785
    .. autofunction:: reset_max_memory_reserved
@@ -134,3 +139,14 @@ Jiterator (beta)
 
     jiterator._create_jit_fn
     jiterator._create_multi_output_jit_fn
+
+Stream Sanitizer (prototype)
+----------------------------
+
+CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch.
+See the :doc:`documentation <cuda._sanitizer>` for information on how to use it.
+
+.. toctree::
+    :hidden:
+
+    cuda._sanitizer
diff --git a/docs/source/data.rst b/docs/source/data.rst
index f75c6d727b96..b44096d10196 100644
--- a/docs/source/data.rst
+++ b/docs/source/data.rst
@@ -65,7 +65,7 @@ in real time.
 
 See :class:`~torch.utils.data.IterableDataset` for more details.
 
-.. note:: When using an :class:`~torch.utils.data.IterableDataset` with
+.. note:: When using a :class:`~torch.utils.data.IterableDataset` with
           `multi-process data loading <Multi-process data loading_>`_. The same
           dataset object is replicated on each worker process, and thus the
           replicas must be configured differently to avoid duplicated data. See
@@ -427,6 +427,7 @@ Example::
 .. autoclass:: ConcatDataset
 .. autoclass:: ChainDataset
 .. autoclass:: Subset
+.. autofunction:: torch.utils.data._utils.collate.collate
 .. autofunction:: torch.utils.data.default_collate
 .. autofunction:: torch.utils.data.default_convert
 .. autofunction:: torch.utils.data.get_worker_info
@@ -440,9 +441,6 @@ Example::
 .. autoclass:: torch.utils.data.distributed.DistributedSampler
 
 
-.. This module is experimental and should be private, adding it here for now
-.. py:module:: torch.utils.data.communication
-
 .. These modules are documented as part of torch/data listing them here for
 .. now until we have a clearer fix
 .. py:module:: torch.utils.data.datapipes
diff --git a/docs/source/deploy.rst b/docs/source/deploy.rst
index 9311ba8c4ee6..8bc5a1f61da2 100644
--- a/docs/source/deploy.rst
+++ b/docs/source/deploy.rst
@@ -1,239 +1,4 @@
-torch::deploy
-=============
+torch::deploy has been moved to pytorch/multipy
+===============================================
 
-``torch::deploy`` is a system that allows you to run multiple embedded Python
-interpreters in a C++ process without a shared global interpreter lock. For more
-information on how ``torch::deploy`` works internally, please see the related
-`arXiv paper <https://arxiv.org/pdf/2104.00254.pdf>`_.
-
-
-.. warning::
-
-    This is a prototype feature. Only Linux x86 is supported, and the API may
-    change without warning.
-
-
-Getting Started
----------------
-
-Installing ``torch::deploy``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-``torch::deploy`` is not yet built by default in our binary releases, so to get
-a copy of libtorch with ``torch::deploy`` enabled, follow the instructions for
-`building PyTorch from source <https://github.com/pytorch/pytorch/#from-source>`_.
-
-When running ``setup.py``, you will need to specify ``USE_DEPLOY=1``, like:
-
-.. code-block:: bash
-
-    export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
-    export USE_DEPLOY=1
-    python setup.py develop
-
-
-Creating a model package in Python
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-``torch::deploy`` can load and run Python models that are packaged with
-``torch.package``. You can learn more about ``torch.package`` in the
-``torch.package`` `documentation <https://pytorch.org/docs/stable/package.html#tutorials>`_.
-
-For now, let's create a simple model that we can load and run in ``torch::deploy``.
-
-.. code-block:: py
-
-    from torch.package import PackageExporter
-    import torchvision
-
-    # Instantiate some model
-    model = torchvision.models.resnet.resnet18()
-
-    # Package and export it.
-    with PackageExporter("my_package.pt") as e:
-        e.intern("torchvision.**")
-        e.extern("numpy.**")
-        e.extern("sys")
-        e.extern("PIL.*")
-        e.save_pickle("model", "model.pkl", model)
-
-Note that since "numpy", "sys" and "PIL" were marked as "extern", `torch.package` will
-look for these dependencies on the system that loads this package. They will not be packaged
-with the model.
-
-Now, there should be a file named ``my_package.pt`` in your working directory.
-
-
-Loading and running the model in C++
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Set an environment variable (e.g. $PATH_TO_EXTERN_PYTHON_PACKAGES) to indicate to the interpreters
-where the external Python dependencies can be found. In the example below, the path to the
-site-packages of a conda environment is provided.
-
-.. code-block:: bash
-
-    export PATH_TO_EXTERN_PYTHON_PACKAGES= \
-        "~/anaconda/envs/deploy-example-env/lib/python3.8/site-packages"
-
-
-Let's create a minimal C++ program to that loads the model.
-
-.. code-block:: cpp
-
-    #include <torch/csrc/deploy/deploy.h>
-    #include <torch/csrc/deploy/path_environment.h>
-    #include <torch/script.h>
-    #include <torch/torch.h>
-
-    #include <iostream>
-    #include <memory>
-
-    int main(int argc, const char* argv[]) {
-        if (argc != 2) {
-            std::cerr << "usage: example-app <path-to-exported-script-module>\n";
-            return -1;
-        }
-
-        // Start an interpreter manager governing 4 embedded interpreters.
-        std::shared_ptr<torch::deploy::Environment> env =
-            std::make_shared<torch::deploy::PathEnvironment>(
-                std::getenv("PATH_TO_EXTERN_PYTHON_PACKAGES")
-            );
-        torch::deploy::InterpreterManager manager(4, env);
-
-        try {
-            // Load the model from the torch.package.
-            torch::deploy::Package package = manager.loadPackage(argv[1]);
-            torch::deploy::ReplicatedObj model = package.loadPickle("model", "model.pkl");
-        } catch (const c10::Error& e) {
-            std::cerr << "error loading the model\n";
-            std::cerr << e.msg();
-            return -1;
-        }
-
-        std::cout << "ok\n";
-    }
-
-This small program introduces many of the core concepts of ``torch::deploy``.
-
-An ``InterpreterManager`` abstracts over a collection of independent Python
-interpreters, allowing you to load balance across them when running your code.
-
-``PathEnvironment`` enables you to specify the location of Python
-packages on your system which are external, but necessary, for your model.
-
-Using the ``InterpreterManager::loadPackage`` method, you can load a
-``torch.package`` from disk and make it available to all interpreters.
-
-``Package::loadPickle`` allows you to retrieve specific Python objects
-from the package, like the ResNet model we saved earlier.
-
-Finally, the model itself is a ``ReplicatedObj``. This is an abstract handle to
-an object that is replicated across multiple interpreters. When you interact
-with a ``ReplicatedObj`` (for example, by calling ``forward``), it will select
-an free interpreter to execute that interaction.
-
-
-Building and running the application
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Locate `libtorch_deployinterpreter.o` on your system. This should have been
-built when PyTorch was built from source. In the same PyTorch directory, locate
-the deploy source files. Set these locations to an environment variable for the build.
-An example of where these can be found on a system is shown below.
-
-.. code-block:: bash
-
-    export DEPLOY_INTERPRETER_PATH="/pytorch/build/torch/csrc/deploy/"
-    export DEPLOY_SRC_PATH="/pytorch/torch/csrc/deploy/"
-
-As ``torch::deploy`` is in active development, these manual steps will be removed
-soon.
-
-Assuming the above C++ program was stored in a file called, `example-app.cpp`, a
-minimal CMakeLists.txt file would look like:
-
-.. code-block:: cmake
-
-    cmake_minimum_required(VERSION 3.19 FATAL_ERROR)
-    project(deploy_tutorial)
-
-    find_package(fmt REQUIRED)
-    find_package(Torch REQUIRED)
-
-    add_library(torch_deploy_internal STATIC
-        ${DEPLOY_INTERPRETER_PATH}/libtorch_deployinterpreter.o
-        ${DEPLOY_DIR}/deploy.cpp
-        ${DEPLOY_DIR}/loader.cpp
-        ${DEPLOY_DIR}/path_environment.cpp
-        ${DEPLOY_DIR}/elf_file.cpp)
-
-    # for python builtins
-    target_link_libraries(torch_deploy_internal PRIVATE
-        crypt pthread dl util m z ffi lzma readline nsl ncursesw panelw)
-    target_link_libraries(torch_deploy_internal PUBLIC
-        shm torch fmt::fmt-header-only)
-    caffe2_interface_library(torch_deploy_internal torch_deploy)
-
-    add_executable(example-app example.cpp)
-    target_link_libraries(example-app PUBLIC
-        "-Wl,--no-as-needed -rdynamic" dl torch_deploy "${TORCH_LIBRARIES}")
-
-Currently, it is necessary to build ``torch::deploy`` as a static library.
-In order to correctly link to a static library, the utility ``caffe2_interface_library``
-is used to appropriately set and unset ``--whole-archive`` flag.
-
-Furthermore, the ``-rdynamic`` flag is needed when linking to the executable
-to ensure that symbols are exported to the dynamic table, making them accessible
-to the deploy interpreters (which are dynamically loaded).
-
-The last step is configuring and building the project. Assuming that our code
-directory is laid out like this:
-
-.. code-block:: none
-
-    example-app/
-        CMakeLists.txt
-        example-app.cpp
-
-We can now run the following commands to build the application from within the
-``example-app/`` folder:
-
-.. code-block:: bash
-
-    mkdir build
-    cd build
-    # Point CMake at the built version of PyTorch we just installed.
-    cmake -DCMAKE_PREFIX_PATH="$(python -c 'import torch.utils; print(torch.utils.cmake_prefix_path)')" .. \
-        -DDEPLOY_INTERPRETER_PATH="$DEPLOY_INTERPRETER_PATH" \
-        -DDEPLOY_DIR="$DEPLOY_DIR"
-    cmake --build . --config Release
-
-Now we can run our app:
-
-.. code-block:: bash
-
-        ./example-app /path/to/my_package.pt
-
-
-Executing ``forward`` in C++
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-One you have your model loaded in C++, it is easy to execute it:
-
-.. code-block:: cpp
-
-    // Create a vector of inputs.
-    std::vector<torch::jit::IValue> inputs;
-    inputs.push_back(torch::ones({1, 3, 224, 224}));
-
-    // Execute the model and turn its output into a tensor.
-    at::Tensor output = model(inputs).toTensor();
-    std::cout << output.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << '\n';
-
-Notably, the model's forward function is executing in Python, in an embedded
-CPython interpreter. Note that the model is a ``ReplicatedObj``, which means
-that you can call ``model()`` from multiple threads and the forward method will
-be executed on multiple independent interpreters, with no global interpreter
-lock.
+``torch::deploy`` has been moved to its new home at `https://github.com/pytorch/multipy <https://github.com/pytorch/multipy>`_.
diff --git a/docs/source/distributed.checkpoint.rst b/docs/source/distributed.checkpoint.rst
new file mode 100644
index 000000000000..380ec0e6022a
--- /dev/null
+++ b/docs/source/distributed.checkpoint.rst
@@ -0,0 +1,4 @@
+Distributed Checkpoint
+========================
+
+.. automodule:: torch.distributed.checkpoint
diff --git a/docs/source/distributed.rst b/docs/source/distributed.rst
index 58d6d1606431..62e16ebb8a7b 100644
--- a/docs/source/distributed.rst
+++ b/docs/source/distributed.rst
@@ -39,7 +39,7 @@ MPI supports CUDA only if the implementation used to build PyTorch supports it.
 +----------------+-----+-----+-----+-----+-----+-----+
 | gather         | ✓   | ✘   | ✓   | ?   | ✘   | ✓   |
 +----------------+-----+-----+-----+-----+-----+-----+
-| scatter        | ✓   | ✘   | ✓   | ?   | ✘   | ✘   |
+| scatter        | ✓   | ✘   | ✓   | ?   | ✘   | ✓   |
 +----------------+-----+-----+-----+-----+-----+-----+
 | reduce_scatter | ✘   | ✘   | ✘   | ✘   | ✘   | ✓   |
 +----------------+-----+-----+-----+-----+-----+-----+
@@ -350,6 +350,10 @@ as they should never be created manually, but they are guaranteed to support two
 
 .. autofunction:: irecv
 
+.. autofunction:: batch_isend_irecv
+
+.. autoclass:: P2POp
+
 Synchronous and asynchronous collective operations
 --------------------------------------------------
 Every collective operation function supports the following two kinds of operations,
@@ -417,6 +421,8 @@ Collective functions
 
 .. autofunction:: all_gather
 
+.. autofunction:: all_gather_into_tensor
+
 .. autofunction:: all_gather_object
 
 .. autofunction:: gather
@@ -429,6 +435,8 @@ Collective functions
 
 .. autofunction:: reduce_scatter
 
+.. autofunction:: reduce_scatter_tensor
+
 .. autofunction:: all_to_all
 
 .. autofunction:: barrier
@@ -464,6 +472,9 @@ Please refer to the `profiler documentation <https://pytorch.org/docs/master/pro
 Multi-GPU collective functions
 ------------------------------
 
+.. warning::
+    The multi-GPU functions will be deprecated. If you must use them, please revisit our documentation later.
+
 If you have more than one GPU on each node, when using the NCCL and Gloo backend,
 :func:`~torch.distributed.broadcast_multigpu`
 :func:`~torch.distributed.all_reduce_multigpu`
@@ -821,6 +832,13 @@ following matrix shows how the log level can be adjusted via the combination of
 | ``INFO``                | ``DETAIL``                  | Trace (a.k.a. All)     |
 +-------------------------+-----------------------------+------------------------+
 
+Distributed has a custom Exception type derived from `RuntimeError` called `torch.distributed.DistBackendError`. This exception is thrown when a backend-specific error occurs. For example, if
+the `NCCL` backend is used and the user attempts to use a GPU that is not available to the `NCCL` library.
+
+.. autoclass:: torch.distributed.DistBackendError
+
+.. warning::
+    The DistBackendError exception type is an experimental feature is subject to change.
 
 .. Distributed modules that are missing specific entries.
 .. Adding them here for tracking purposes until they are more permanently fixed.
diff --git a/docs/source/elastic/agent.rst b/docs/source/elastic/agent.rst
index 4cf92557e291..db1465d71e68 100644
--- a/docs/source/elastic/agent.rst
+++ b/docs/source/elastic/agent.rst
@@ -59,3 +59,18 @@ to implement.
    :private-members:
 
 .. autoclass:: torch.distributed.elastic.agent.server.api.RunResult
+
+
+Watchdog in the Agent
+---------------------
+
+A named pipe based watchdog can be enabled in ```LocalElasticAgent``` if an
+environment variable ``TORCHELASTIC_ENABLE_FILE_TIMER`` with value 1 has
+been defined in the ```LocalElasticAgent``` process.
+Optionally, another environment variable ```TORCHELASTIC_TIMER_FILE```
+can be set with a unique file name for the named pipe. If the environment
+variable ```TORCHELASTIC_TIMER_FILE``` is not set, ```LocalElasticAgent```
+will internally create a unique file name and set it to the environment
+variable ```TORCHELASTIC_TIMER_FILE```, and this environment variable will
+be propagated to the worker processes to allow them to connect to the same
+named pipe that ```LocalElasticAgent``` uses.
diff --git a/docs/source/elastic/timer.rst b/docs/source/elastic/timer.rst
index e9d4228ee7a6..f64597c4ce2b 100644
--- a/docs/source/elastic/timer.rst
+++ b/docs/source/elastic/timer.rst
@@ -18,10 +18,21 @@ Below are the timer server and client pairs that are provided by torchelastic.
           in pairs since there is a messaging protocol between the server
           and client.
 
+Below is a pair of timer server and client that is implemented based on
+a ``multiprocess.Queue``.
+
 .. autoclass:: LocalTimerServer
 
 .. autoclass:: LocalTimerClient
 
+Below is another pair of timer server and client that is implemented
+based on a named pipe.
+
+.. autoclass:: FileTimerServer
+
+.. autoclass:: FileTimerClient
+
+
 Writing a custom timer server/client
 --------------------------------------
 
diff --git a/docs/source/fsdp.rst b/docs/source/fsdp.rst
index ff42770831b7..feb6d8cd470b 100644
--- a/docs/source/fsdp.rst
+++ b/docs/source/fsdp.rst
@@ -5,3 +5,15 @@ FullyShardedDataParallel
 
 .. autoclass:: torch.distributed.fsdp.FullyShardedDataParallel
   :members:
+
+.. autoclass:: torch.distributed.fsdp.BackwardPrefetch
+  :members:
+
+.. autoclass:: torch.distributed.fsdp.ShardingStrategy
+  :members:
+
+.. autoclass:: torch.distributed.fsdp.MixedPrecision
+  :members:
+
+.. autoclass:: torch.distributed.fsdp.CPUOffload
+  :members:
diff --git a/docs/source/fx.rst b/docs/source/fx.rst
index 988ae081125c..29d73b3055dc 100644
--- a/docs/source/fx.rst
+++ b/docs/source/fx.rst
@@ -36,7 +36,7 @@ What is an FX transform? Essentially, it's a function that looks like this.
         # Step 3: Construct a Module to return
         return torch.fx.GraphModule(m, graph)
 
-Your transform will take in an :class:`torch.nn.Module`, acquire a :class:`Graph`
+Your transform will take in a :class:`torch.nn.Module`, acquire a :class:`Graph`
 from it, do some modifications, and return a new
 :class:`torch.nn.Module`. You should think of the :class:`torch.nn.Module` that your FX
 transform returns as identical to a regular :class:`torch.nn.Module` -- you can pass it to another
@@ -1039,7 +1039,7 @@ Miscellanea
         traced.eval()
 
         x = torch.randn(5, 3)
-        torch.testing.assert_allclose(traced(x), x)
+        torch.testing.assert_close(traced(x), x)
         """
         AssertionError: Tensor-likes are not close!
 
@@ -1071,7 +1071,7 @@ Miscellanea
         traced.eval()
 
         x = torch.randn(5, 3)
-        torch.testing.assert_allclose(traced(x), x)
+        torch.testing.assert_close(traced(x), x)
 
   - Because of this difference, consider marking modules that interact with the ``training`` flag dynamically as leaf modules.
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 4e069c9279a2..e4b6a124d6bd 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -70,7 +70,9 @@ Features described in this documentation are classified by release status:
    torch.distributed.elastic <distributed.elastic>
    torch.distributed.fsdp <fsdp>
    torch.distributed.optim <distributed.optim>
+   torch.distributed.checkpoint <distributed.checkpoint>
    torch.distributions <distributions>
+   torch._dynamo <_dynamo>
    torch.fft <fft>
    futures
    fx
@@ -78,12 +80,14 @@ Features described in this documentation are classified by release status:
    torch.jit <jit>
    torch.linalg <linalg>
    torch.monitor <monitor>
+   torch.signal <signal>
    torch.special <special>
    torch.overrides
    torch.package <package>
    profiler
    nn.init
    onnx
+   onnx_diagnostics
    optim
    complex_numbers
    ddp_comm_hooks
@@ -91,7 +95,8 @@ Features described in this documentation are classified by release status:
    quantization
    rpc
    torch.random <random>
-   nested
+   masked
+   torch.nested <nested>
    sparse
    storage
    torch.testing <testing>
diff --git a/docs/source/jit_language_reference.rst b/docs/source/jit_language_reference.rst
index 0b464853d330..63c52314aa3a 100644
--- a/docs/source/jit_language_reference.rst
+++ b/docs/source/jit_language_reference.rst
@@ -40,7 +40,7 @@ only use the subset of Python supported in TorchScript. This section documents
 what is supported in TorchScript as if it were a language reference for a stand
 alone language. Any features of Python not mentioned in this reference are not
 part of TorchScript. See `Builtin Functions` for a complete reference of available
-Pytorch tensor methods, modules, and functions.
+PyTorch tensor methods, modules, and functions.
 
 As a subset of Python, any valid TorchScript function is also a valid Python
 function. This makes it possible to `disable TorchScript` and debug the
diff --git a/docs/source/jit_language_reference_v2.rst b/docs/source/jit_language_reference_v2.rst
index 4995bb47fdb4..7b99c5462afd 100644
--- a/docs/source/jit_language_reference_v2.rst
+++ b/docs/source/jit_language_reference_v2.rst
@@ -906,7 +906,7 @@ Atoms are the most basic elements of expressions.
 
 Identifiers
 """""""""""
-The rules that dictate what is a legal identifer in TorchScript are the same as
+The rules that dictate what is a legal identifier in TorchScript are the same as
 their `Python counterparts <https://docs.python.org/3/reference/lexical_analysis.html#identifiers>`_.
 
 Literals
@@ -1830,7 +1830,7 @@ Specifically, following APIs are fully supported:
 
 - ``torch.distributed.rpc.rpc_async()``
     - ``rpc_async()`` makes a non-blocking RPC call to run a function on a remote worker. RPC messages are sent and received in parallel to execution of Python code.
-    - More deatils about its usage and examples can be found in :meth:`~torch.distributed.rpc.rpc_async`.
+    - More details about its usage and examples can be found in :meth:`~torch.distributed.rpc.rpc_async`.
 - ``torch.distributed.rpc.remote()``
     - ``remote.()`` executes a remote call on a worker and gets a Remote Reference ``RRef`` as the return value.
     - More details about its usage and examples can be found in :meth:`~torch.distributed.rpc.remote`.
diff --git a/docs/source/jit_unsupported.rst b/docs/source/jit_unsupported.rst
index 7368abad1e30..7a6538300984 100644
--- a/docs/source/jit_unsupported.rst
+++ b/docs/source/jit_unsupported.rst
@@ -1,6 +1,6 @@
 .. _jit_unsupported:
 
-TorchScript Unsupported Pytorch Constructs
+TorchScript Unsupported PyTorch Constructs
 ============================================
 
 Torch and Tensor Unsupported Attributes
diff --git a/docs/source/linalg.rst b/docs/source/linalg.rst
index 02950ff971a6..aec7031e2248 100644
--- a/docs/source/linalg.rst
+++ b/docs/source/linalg.rst
@@ -6,6 +6,8 @@ torch.linalg
 
 Common linear algebra operations.
 
+See :ref:`Linear Algebra Stability` for some common numerical edge-cases.
+
 .. automodule:: torch.linalg
 .. currentmodule:: torch.linalg
 
@@ -43,6 +45,8 @@ Decompositions
     svd
     svdvals
 
+.. _linalg solvers:
+
 Solvers
 -------
 
@@ -55,6 +59,8 @@ Solvers
     lu_solve
     lstsq
 
+.. _linalg inverses:
+
 Inverses
 --------
 
diff --git a/docs/source/masked.rst b/docs/source/masked.rst
new file mode 100644
index 000000000000..3655a6d79fd9
--- /dev/null
+++ b/docs/source/masked.rst
@@ -0,0 +1,297 @@
+.. automodule:: torch.masked
+.. automodule:: torch.masked.maskedtensor
+
+.. currentmodule:: torch
+
+.. _masked-docs:
+
+torch.masked
+============
+
+Introduction
+++++++++++++
+
+Motivation
+----------
+
+.. warning::
+
+  The PyTorch API of masked tensors is in the prototype stage and may or may not change in the future.
+
+MaskedTensor serves as an extension to :class:`torch.Tensor` that provides the user with the ability to:
+
+* use any masked semantics (e.g. variable length tensors, nan* operators, etc.)
+* differentiate between 0 and NaN gradients
+* various sparse applications (see tutorial below)
+
+"Specified" and "unspecified" have a long history in PyTorch without formal semantics and certainly without
+consistency; indeed, MaskedTensor was born out of a build up of issues that the vanilla :class:`torch.Tensor`
+class could not properly address. Thus, a primary goal of MaskedTensor is to become the source of truth for
+said "specified" and "unspecified" values in PyTorch where they are a first class citizen instead of an afterthought.
+In turn, this should further unlock `sparsity's <https://pytorch.org/docs/stable/sparse.html>`_ potential,
+enable safer and more consistent operators, and provide a smoother and more intuitive experience
+for users and developers alike.
+
+What is a MaskedTensor?
+-----------------------
+
+A MaskedTensor is a tensor subclass that consists of 1) an input (data), and 2) a mask. The mask tells us
+which entries from the input should be included or ignored.
+
+By way of example, suppose that we wanted to mask out all values that are equal to 0 (represented by the gray)
+and take the max:
+
+.. image:: _static/img/masked/tensor_comparison.jpg
+      :scale: 50%
+
+On top is the vanilla tensor example while the bottom is MaskedTensor where all the 0's are masked out.
+This clearly yields a different result depending on whether we have the mask, but this flexible structure
+allows the user to systematically ignore any elements they'd like during computation.
+
+There are already a number of existing tutorials that we've written to help users onboard, such as:
+
+-  `Overview - the place to start for new users, discusses how to use MaskedTensors and why they're useful`_
+-  `Sparsity - MaskedTensor supports sparse COO and CSR data and mask Tensors`_
+-  `Adagrad sparse semantics - a practical example of how MaskedTensor can simplify sparse semantics and implementations`_
+-  `Advanced semantics - discussion on why certain decisions were made (e.g. requiring masks to match for binary/reduction operations),
+   differences with NumPy's MaskedArray, and reduction semantics`_
+
+.. _Overview - the place to start for new users, discusses how to use MaskedTensors and why they're useful: https://pytorch.org/tutorials/prototype/maskedtensor_overview.html
+.. _Sparsity - MaskedTensor supports sparse COO and CSR data and mask Tensors: https://pytorch.org/tutorials/prototype/maskedtensor_sparsity.html
+.. _Adagrad sparse semantics - a practical example of how MaskedTensor can simplify sparse semantics and implementations: https://pytorch.org/tutorials/prototype/maskedtensor_adagrad.html
+.. _Advanced semantics - discussion on why certain decisions were made (e.g. requiring masks to match for binary/reduction operations), differences with NumPy's MaskedArray, and reduction semantics: https://pytorch.org/tutorials/prototype/maskedtensor_advanced_semantics.html
+
+Supported Operators
++++++++++++++++++++
+
+Unary Operators
+---------------
+
+Unary operators are operators that only contain only a single input.
+Applying them to MaskedTensors is relatively straightforward: if the data is masked out at a given index,
+we apply the operator, otherwise we'll continue to mask out the data.
+
+The available unary operators are:
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    abs
+    absolute
+    acos
+    arccos
+    acosh
+    arccosh
+    angle
+    asin
+    arcsin
+    asinh
+    arcsinh
+    atan
+    arctan
+    atanh
+    arctanh
+    bitwise_not
+    ceil
+    clamp
+    clip
+    conj_physical
+    cos
+    cosh
+    deg2rad
+    digamma
+    erf
+    erfc
+    erfinv
+    exp
+    exp2
+    expm1
+    fix
+    floor
+    frac
+    lgamma
+    log
+    log10
+    log1p
+    log2
+    logit
+    i0
+    isnan
+    nan_to_num
+    neg
+    negative
+    positive
+    pow
+    rad2deg
+    reciprocal
+    round
+    rsqrt
+    sigmoid
+    sign
+    sgn
+    signbit
+    sin
+    sinc
+    sinh
+    sqrt
+    square
+    tan
+    tanh
+    trunc
+
+The available inplace unary operators are all of the above **except**:
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    angle
+    positive
+    signbit
+    isnan
+
+Binary Operators
+----------------
+
+As you may have seen in the tutorial, :class:`MaskedTensor` also has binary operations implemented with the caveat
+that the masks in the two MaskedTensors must match or else an error will be raised. As noted in the error, if you
+need support for a particular operator or have proposed semantics for how they should behave instead, please open
+an issue on GitHub. For now, we have decided to go with the most conservative implementation to ensure that users
+know exactly what is going on and are being intentional about their decisions with masked semantics.
+
+The available binary operators are:
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    add
+    atan2
+    arctan2
+    bitwise_and
+    bitwise_or
+    bitwise_xor
+    bitwise_left_shift
+    bitwise_right_shift
+    div
+    divide
+    floor_divide
+    fmod
+    logaddexp
+    logaddexp2
+    mul
+    multiply
+    nextafter
+    remainder
+    sub
+    subtract
+    true_divide
+    eq
+    ne
+    le
+    ge
+    greater
+    greater_equal
+    gt
+    less_equal
+    lt
+    less
+    maximum
+    minimum
+    fmax
+    fmin
+    not_equal
+
+The available inplace binary operators are all of the above **except**:
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    logaddexp
+    logaddexp2
+    equal
+    fmin
+    minimum
+    fmax
+
+Reductions
+----------
+
+The following reductions are available (with autograd support). For more information, the
+`Overview <https://pytorch.org/tutorials/prototype/maskedtensor_overview.html/>`_ tutorial
+details some examples of reductions, while the
+`Advanced semantics <https://pytorch.org/tutorials/prototype/maskedtensor_advanced_semantics.html>`_ tutorial
+has some further in-depth discussions about how we decided on certain reduction semantics.
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    sum
+    mean
+    amin
+    amax
+    argmin
+    argmax
+    prod
+    all
+    norm
+    var
+    std
+
+View and select functions
+-------------------------
+
+We've included a number of view and select functions as well; intuitively, these operators will apply to
+both the data and the mask and then wrap the result in a :class:`MaskedTensor`. For a quick example,
+consider :func:`select`:
+
+    >>> data = torch.arange(12, dtype=torch.float).reshape(3, 4)
+    >>> data
+    tensor([[ 0.,  1.,  2.,  3.],
+            [ 4.,  5.,  6.,  7.],
+            [ 8.,  9., 10., 11.]])
+    >>> mask = torch.tensor([[True, False, False, True], [False, True, False, False], [True, True, True, True]])
+    >>> mt = masked_tensor(data, mask)
+    >>> data.select(0, 1)
+    tensor([4., 5., 6., 7.])
+    >>> mask.select(0, 1)
+    tensor([False,  True, False, False])
+    >>> mt.select(0, 1)
+    MaskedTensor(
+      [      --,   5.0000,       --,       --]
+    )
+
+The following ops are currently supported:
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    atleast_1d
+    broadcast_tensors
+    broadcast_to
+    cat
+    chunk
+    column_stack
+    dsplit
+    flatten
+    hsplit
+    hstack
+    kron
+    meshgrid
+    narrow
+    ravel
+    select
+    split
+    t
+    transpose
+    vsplit
+    vstack
+    Tensor.expand
+    Tensor.expand_as
+    Tensor.reshape
+    Tensor.reshape_as
+    Tensor.view
diff --git a/docs/source/mobile_optimizer.rst b/docs/source/mobile_optimizer.rst
index bb11abf82dba..4df148dc707b 100644
--- a/docs/source/mobile_optimizer.rst
+++ b/docs/source/mobile_optimizer.rst
@@ -7,13 +7,16 @@ torch.utils.mobile_optimizer
 Torch mobile supports ``torch.mobile_optimizer.optimize_for_mobile`` utility to run a list of optimization pass with modules in eval mode.
 The method takes the following parameters: a torch.jit.ScriptModule object, a blocklisting optimization set and a preserved method list
 
-By default, if optimization blocklist is None or empty, ``optimize_for_mobile`` will run the following optimizations:
+For CPU Backend, by default, if optimization blocklist is None or empty, ``optimize_for_mobile`` will run the following optimizations:
     - **Conv2D + BatchNorm fusion** (blocklisting option `MobileOptimizerType::CONV_BN_FUSION`):  This optimization pass folds ``Conv2d-BatchNorm2d`` into ``Conv2d`` in ``forward`` method of this module and all its submodules. The weight and bias of the ``Conv2d`` are correspondingly updated.
     - **Insert and Fold prepacked ops** (blocklisting option `MobileOptimizerType::INSERT_FOLD_PREPACK_OPS`): This optimization pass rewrites the graph to replace 2D convolutions and linear ops with their prepacked counterparts. Prepacked ops are stateful ops in that, they require some state to be created, such as weight prepacking and use this state, i.e. prepacked weights, during op execution. XNNPACK is one such backend that provides prepacked ops, with kernels optimized for mobile platforms (such as ARM CPUs). Prepacking of weight enables efficient memory access and thus faster kernel execution. At the moment ``optimize_for_mobile`` pass rewrites the graph to replace ``Conv2D/Linear`` with 1) op that pre-packs weight for XNNPACK conv2d/linear ops and 2) op that takes pre-packed weight and activation as input and generates output activations. Since 1 needs to be done only once, we fold the weight pre-packing such that it is done only once at model load time. This pass of the ``optimize_for_mobile`` does 1 and 2 and then folds, i.e. removes, weight pre-packing ops.
     - **ReLU/Hardtanh fusion**: XNNPACK ops support fusion of clamping. That is clamping of output activation is done as part of the kernel, including for 2D convolution and linear op kernels. Thus clamping effectively comes for free. Thus any op that can be expressed as clamping op, such as ``ReLU`` or ``hardtanh``, can be fused with previous ``Conv2D`` or ``linear`` op in XNNPACK. This pass rewrites graph by finding ``ReLU/hardtanh`` ops that follow XNNPACK ``Conv2D/linear`` ops, written by the previous pass, and fuses them together.
     - **Dropout removal** (blocklisting option `MobileOptimizerType::REMOVE_DROPOUT`): This optimization pass removes ``dropout`` and ``dropout_`` nodes from this module when training is false.
     - **Conv packed params hoisting** (blocklisting option `MobileOptimizerType::HOIST_CONV_PACKED_PARAMS`): This optimization pass moves convolution packed params to the root module, so that the convolution structs can be deleted. This decreases model size without impacting numerics.
 
+for Vulkan Backend, by default, if optimization blocklist is None or empty, ``optimize_for_mobile`` will run the folllwing optimization:
+    - **Automatic GPU Transfer** (blocklisting option `MobileOptimizerType::VULKAN_AUTOMATIC_GPU_TRANSFER`): This optimization pass rewrites the graph such that inputs are transferred to Vulkan backend, and outputs are transferred to CPU backend
+
 ``optimize_for_mobile`` will also invoke freeze_module pass which only preserves ``forward`` method. If you have other method to that needed to be preserved,  add them into the preserved method list and pass into the method.
 
 
diff --git a/docs/source/nested.rst b/docs/source/nested.rst
index 9ad43322d196..ac07f8acb5a2 100644
--- a/docs/source/nested.rst
+++ b/docs/source/nested.rst
@@ -16,42 +16,62 @@ The only constraint on the input Tensors is that their dimension must match.
 
 This enables more efficient metadata representations and access to purpose built kernels.
 
-Construction is straightforward and involves passing a list of Tensors to the constructor.
+One application of NestedTensors is to express sequential data in various domains.
+While the conventional approach is to pad variable length sequences, NestedTensor
+enables users to bypass padding. The API for calling operations on a nested tensor is no different
+from that of a regular ``torch.Tensor``, which should allow seamless integration with existing models,
+with the main difference being :ref:`construction of the inputs <construction>`.
+
+As this is a prototype feature, the :ref:`operations supported <supported operations>` are still
+limited. However, we welcome issues, feature requests and contributions. More information on contributing can be found
+`on this wiki <https://github.com/pytorch/pytorch/wiki/NestedTensor-Backend>`_.
+
+.. _construction:
+
+Construction
+++++++++++++
+
+Construction is straightforward and involves passing a list of Tensors to the ``torch.nested.nested_tensor``
+constructor.
 
 >>> a, b = torch.arange(3), torch.arange(5) + 3
 >>> a
 tensor([0, 1, 2])
 >>> b
 tensor([3, 4, 5, 6, 7])
->>> nt = torch.nested_tensor([a, b])
+>>> nt = torch.nested.nested_tensor([a, b])
 >>> nt
 nested_tensor([
   tensor([0, 1, 2]),
     tensor([3, 4, 5, 6, 7])
     ])
 
-Data type and device can be chosen via the usual keyword arguments.
+Data type, device and whether gradients are required can be chosen via the usual keyword arguments.
 
->>> nt = torch.nested_tensor([a, b], dtype=torch.float32, device="cuda")
+>>> nt = torch.nested.nested_tensor([a, b], dtype=torch.float32, device="cuda", requires_grad=True)
 >>> nt
 nested_tensor([
-  tensor([0., 1., 2.], device='cuda:0'),
-  tensor([3., 4., 5., 6., 7.], device='cuda:0')
-])
+  tensor([0., 1., 2.], device='cuda:0', requires_grad=True),
+  tensor([3., 4., 5., 6., 7.], device='cuda:0', requires_grad=True)
+], device='cuda:0', requires_grad=True)
+
+In the vein of ``torch.as_tensor``, ``torch.nested.as_nested_tensor`` can be used to preserve autograd
+history from the tensors passed to the constructor. For more information, refer to the section on
+:ref:`constructor functions`.
 
-In order to form a valid NestedTensor the passed Tensors also all need to match in dimension, but none of the other attributes need to.
+In order to form a valid NestedTensor all the passed Tensors need to match in dimension, but none of the other attributes need to.
 
 >>> a = torch.randn(3, 50, 70) # image 1
 >>> b = torch.randn(3, 128, 64) # image 2
->>> nt = torch.nested_tensor([a, b], dtype=torch.float32)
+>>> nt = torch.nested.nested_tensor([a, b], dtype=torch.float32)
 >>> nt.dim()
 4
 
-If one of the dimensions don't match, the constructor throws an error.
+If one of the dimensions doesn't match, the constructor throws an error.
 
 >>> a = torch.randn(50, 128) # text 1
 >>> b = torch.randn(3, 128, 64) # image 2
->>> nt = torch.nested_tensor([a, b], dtype=torch.float32)
+>>> nt = torch.nested.nested_tensor([a, b], dtype=torch.float32)
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
 RuntimeError: All Tensors given to nested_tensor must have the same dimension. Found dimension 3 for Tensor at index 1 and dimension 2 for Tensor at index 0.
@@ -59,21 +79,20 @@ RuntimeError: All Tensors given to nested_tensor must have the same dimension. F
 Note that the passed Tensors are being copied into a contiguous piece of memory. The resulting
 NestedTensor allocates new memory to store them and does not keep a reference.
 
-
 At this moment we only support one level of nesting, i.e. a simple, flat list of Tensors. In the future
 we can add support for multiple levels of nesting, such as a list that consists entirely of lists of Tensors.
 Note that for this extension it is important to maintain an even level of nesting across entries so that the resulting NestedTensor
-has a well defined dimension. If you have a need for this feature, please feel encourage to open a feature request so that
+has a well defined dimension. If you have a need for this feature, please feel encouraged to open a feature request so that
 we can track it and plan accordingly.
 
 size
 +++++++++++++++++++++++++
 
-Even though a NestedTensor does not support .size() (or .shape), it supports .size(i) if dimension i is regular.
+Even though a NestedTensor does not support ``.size()`` (or ``.shape``), it supports ``.size(i)`` if dimension i is regular.
 
 >>> a = torch.randn(50, 128) # text 1
 >>> b = torch.randn(32, 128) # text 2
->>> nt = torch.nested_tensor([a, b], dtype=torch.float32)
+>>> nt = torch.nested.nested_tensor([a, b], dtype=torch.float32)
 >>> nt.size(0)
 2
 >>> nt.size(1)
@@ -83,10 +102,10 @@ RuntimeError: Given dimension 1 is irregular and does not have a size.
 >>> nt.size(2)
 128
 
-If all dimensions are regular, the NestedTensor is intended to be semantically indistinguishable from a regular torch.Tensor.
+If all dimensions are regular, the NestedTensor is intended to be semantically indistinguishable from a regular ``torch.Tensor``.
 
 >>> a = torch.randn(20, 128) # text 1
->>> nt = torch.nested_tensor([a, a], dtype=torch.float32)
+>>> nt = torch.nested.nested_tensor([a, a], dtype=torch.float32)
 >>> nt.size(0)
 2
 >>> nt.size(1)
@@ -102,17 +121,17 @@ True
 
 In the future we might make it easier to detect this condition and convert seamlessly.
 
-Please open a feature request if you have a need for this (or any other related feature for that manner).
+Please open a feature request if you have a need for this (or any other related feature for that matter).
 
 unbind
 +++++++++++++++++++++++++
 
-unbind allows you to retrieve a view of the constituents.
+``unbind`` allows you to retrieve a view of the constituents.
 
 >>> import torch
 >>> a = torch.randn(2, 3)
 >>> b = torch.randn(3, 4)
->>> nt = torch.nested_tensor([a, b], dtype=torch.float32)
+>>> nt = torch.nested.nested_tensor([a, b], dtype=torch.float32)
 >>> nt
 nested_tensor([
   tensor([[ 1.2286, -1.2343, -1.4842],
@@ -140,16 +159,56 @@ nested_tensor([
           [ 0.4074,  2.4855,  0.0733,  0.8285]])
 ])
 
-Note that nt.unbind()[0] is not a, but rather a slice of the underlying memory, which represents the first entry or constituent of the NestedTensor.
-
-Nested tensor methods
-+++++++++++++++++++++++++
-
-The following Tensor methods are related to nested tensors:
-
-.. currentmodule:: torch
-.. autosummary::
-    :toctree: generated
-    :nosignatures:
-
-    Tensor.to_padded_tensor
+Note that ``nt.unbind()[0]`` is not a copy, but rather a slice of the underlying memory, which represents the first entry or constituent of the NestedTensor.
+
+.. _constructor functions:
+
+Nested tensor constructor and conversion functions
+++++++++++++++++++++++++++++++++++++++++++++++++++
+
+The following functions are related to nested tensors:
+
+.. currentmodule:: torch.nested
+
+.. autofunction:: nested_tensor
+.. autofunction:: as_nested_tensor
+.. autofunction:: to_padded_tensor
+
+.. _supported operations:
+
+Supported operations
+++++++++++++++++++++++++++
+
+In this section, we summarize the operations that are currently supported on
+NestedTensor and any constraints they have.
+
+.. csv-table::
+   :header: "PyTorch operation",  "Constraints"
+   :widths: 30, 55
+   :delim: ;
+
+   :func:`torch.matmul`;  "Supports matrix multiplication between two (>= 3d) nested tensors where
+   the last two dimensions are matrix dimensions and the leading (batch) dimensions have the same size
+   (i.e. no broadcasting support for batch dimensions yet)."
+   :func:`torch.bmm`; "Supports batch matrix multiplication of two 3-d nested tensors."
+   :func:`torch.nn.Linear`;  "Supports 3-d nested input and a dense 2-d weight matrix."
+   :func:`torch.nn.functional.softmax`; "Supports softmax along all dims except dim=0."
+   :func:`torch.nn.Dropout`; "Behavior is the same as on regular tensors."
+   :func:`torch.relu`; "Behavior is the same as on regular tensors."
+   :func:`torch.gelu`; "Behavior is the same as on regular tensors."
+   :func:`torch.neg`; "Behavior is the same as on regular tensors."
+   :func:`torch.add`; "Supports elementwise addition of two nested tensors.
+   Supports addition of a scalar to a nested tensor."
+   :func:`torch.mul`; "Supports elementwise multiplication of two nested tensors.
+   Supports multiplication of a nested tensor by a scalar."
+   :func:`torch.select`; "Supports selecting along all dimensions."
+   :func:`torch.clone`; "Behavior is the same as on regular tensors."
+   :func:`torch.detach`; "Behavior is the same as on regular tensors."
+   :func:`torch.unbind`; "Supports unbinding along ``dim=0`` only."
+   :func:`torch.reshape`; "Supports reshaping with size of ``dim=0`` preserved (i.e. number of tensors nested cannot be changed).
+   Unlike regular tensors, a size of ``-1`` here means that the existing size is inherited.
+   In particular, the only valid size for a ragged dimension is ``-1``.
+   Size inference is not implemented yet and hence for new dimensions the size cannot be ``-1``."
+   :func:`torch.Tensor.reshape_as`; "Similar constraint as for ``reshape``."
+   :func:`torch.transpose`; "Supports transposing of all dims except ``dim=0``."
+   :func:`torch.Tensor.view`; "Rules for the new shape are similar to that of ``reshape``."
diff --git a/docs/source/notes/autograd.rst b/docs/source/notes/autograd.rst
index 6d0f52c66923..6eec13a7de55 100644
--- a/docs/source/notes/autograd.rst
+++ b/docs/source/notes/autograd.rst
@@ -203,7 +203,7 @@ grad mode in the next forward pass.
 
 The implementations in :ref:`nn-init-doc` also
 rely on no-grad mode when initializing the parameters as to avoid
-autograd tracking when updating the intialized parameters in-place.
+autograd tracking when updating the initialized parameters in-place.
 
 Inference Mode
 ^^^^^^^^^^^^^^
diff --git a/docs/source/notes/cuda.rst b/docs/source/notes/cuda.rst
index c678844edcfa..4a1538900a88 100644
--- a/docs/source/notes/cuda.rst
+++ b/docs/source/notes/cuda.rst
@@ -349,27 +349,42 @@ complete snapshot of the memory allocator state via
 :meth:`~torch.cuda.memory_snapshot`, which can help you understand the
 underlying allocation patterns produced by your code.
 
+.. _cuda-memory-envvars:
+
+Environment variables
+^^^^^^^^^^^^^^^^^^^^^
+
 Use of a caching allocator can interfere with memory checking tools such as
 ``cuda-memcheck``.  To debug memory errors using ``cuda-memcheck``, set
 ``PYTORCH_NO_CUDA_MEMORY_CACHING=1`` in your environment to disable caching.
 
-The behavior of caching allocator can be controlled via environment variable
+The behavior of the caching allocator can be controlled via the environment variable
 ``PYTORCH_CUDA_ALLOC_CONF``.
-The format is ``PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2><value2>...``
+The format is ``PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>...``
 Available options:
 
-* ``max_split_size_mb`` prevents the allocator from splitting blocks larger
-  than this size (in MB). This can help prevent fragmentation and may allow
-  some borderline workloads to complete without running out of memory.
-  Performance cost can range from 'zero' to 'substatial' depending on
-  allocation patterns.  Default value is unlimited, i.e. all blocks can be
-  split. The :meth:`~torch.cuda.memory_stats` and
+* ``backend`` allows selecting the underlying allocator implementation.
+  Currently, valid options are ``native``, which uses PyTorch's native
+  implementation, and ``cudaMallocAsync``, which uses
+  `CUDA's built-in asynchronous allocator`_.
+  ``cudaMallocAsync`` requires CUDA 11.4 or newer. The default is ``native``.
+  ``backend`` applies to all devices used by the process, and can't be
+  specified on a per-device basis.
+* ``max_split_size_mb`` prevents the native allocator
+  from splitting blocks larger than this size (in MB). This can reduce
+  fragmentation and may allow some borderline workloads to complete without
+  running out of memory. Performance cost can range from 'zero' to 'substantial'
+  depending on allocation patterns.  Default value is unlimited, i.e. all blocks
+  can be split. The
+  :meth:`~torch.cuda.memory_stats` and
   :meth:`~torch.cuda.memory_summary` methods are useful for tuning.  This
   option should be used as a last resort for a workload that is aborting
   due to 'out of memory' and showing a large amount of inactive split blocks.
+  ``max_split_size_mb`` is only meaningful with ``backend:native``.
+  With ``backend:cudaMallocAsync``, ``max_split_size_mb`` is ignored.
 * ``roundup_power2_divisions`` helps with rounding the requested allocation
   size to nearest power-2 division and making better use of the blocks. In
-  the current CUDACachingAllocator, the sizes are rounded up in multiple
+  the native CUDACachingAllocator, the sizes are rounded up in multiple
   of blocks size of 512, so this works fine for smaller sizes. However, this
   can be inefficient for large near-by allocations as each will go to different
   size of blocks and re-use of those blocks are minimized. This might create
@@ -379,6 +394,20 @@ Available options:
   the size 1200 lies between 1024 and 2048 and if we do 4 divisions between
   them, the values are 1024, 1280, 1536, and 1792. So, allocation size of 1200
   will be rounded to 1280 as the nearest ceiling of power-2 division.
+  Specify a single value to apply for all allocation sizes or specify an
+  array of key value pairs to set power-2 division individually for each
+  power of two interval. For example to set 1 division for all allocations
+  under 256MB, 2 division for allocations between 256MB and 512MB, 4 divisions
+  for allocations between 512MB and 1GB and 8 divisions for any larger allocations,
+  set the knob value to: [256:1,512:2,1024:4,>:8].
+  ``roundup_power2_divisions`` is only meaningful with ``backend:native``.
+  With ``backend:cudaMallocAsync``, ``roundup_power2_divisions`` is ignored.
+* ``roundup_bypass_threshold_mb`` bypass rounding the requested allocation size,
+  for allocation requests larger than the threshold value (in MB). This can help
+  reduce the memory footprint when making large allocations that are expected to
+  be persistent or have a large lifetime.
+  ``roundup_bypass_threshold_mb`` is only meaningful with ``backend:native``.
+  With ``backend:cudaMallocAsync``, ``roundup_bypass_threshold_mb`` is ignored.
 * ``garbage_collection_threshold`` helps actively reclaiming unused GPU memory to
   avoid triggering expensive sync-and-reclaim-all operation (release_cached_blocks),
   which can be unfavorable to latency-critical GPU applications (e.g., servers).
@@ -387,6 +416,79 @@ Available options:
   80% of the total memory allocated to the GPU application). The algorithm prefers
   to free old & unused blocks first to avoid freeing blocks that are actively being
   reused. The threshold value should be between greater than 0.0 and less than 1.0.
+  ``garbage_collection_threshold`` is only meaningful with ``backend:native``.
+  With ``backend:cudaMallocAsync``, ``garbage_collection_threshold`` is ignored.
+
+.. note::
+
+    Some stats reported by the
+    :ref:`CUDA memory management API<cuda-memory-management-api>`
+    are specific to ``backend:native``, and are not meaningful with
+    ``backend:cudaMallocAsync``.
+    See each function's docstring for details.
+
+.. _CUDA's built-in asynchronous allocator:
+    https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/
+
+.. _cuda-memory-custom-allocator:
+
+Using custom memory allocators for CUDA
+---------------------------------------
+
+It is possible to define allocators as simple functions in C/C++ and compile
+them as a shared library, the code below shows a basic allocator that just
+traces all the memory operations.
+
+.. code:: C++
+
+   #include <sys/types.h>
+   #include <cuda_runtime_api.h>
+   #include <iostream>
+   // Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC
+   extern "C" {
+   void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
+      void *ptr;
+      cudaMalloc(&ptr, size);
+      std::cout<<"alloc "<<ptr<<size<<std::endl;
+      return ptr;
+   }
+
+   void my_free(void* ptr, ssize_t size, cudaStream_t stream) {
+      std::cout<<"free "<<ptr<< " "<<stream<<std::endl;
+      cudaFree(ptr);
+   }
+   }
+
+
+This can be used in python through the :class:`torch.cuda.memory.CUDAPluggableAllocator`.
+The user is responsible for supplying the path to the `.so` file and the name
+of the alloc/free functions that match the signatures specified above.
+
+.. code:: python
+
+   import torch
+
+   # Load the allocator
+   new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
+       'alloc.so', 'my_malloc', 'my_free')
+   # Swap the current allocator
+   torch.cuda.memory.change_current_allocator(new_alloc)
+   # This will allocate memory in the device using the new allocator
+   b = torch.zeros(10, device='cuda')
+
+
+.. code:: python
+
+   import torch
+
+   # Do an initial memory allocator
+   b = torch.zeros(10, device='cuda')
+   # Load the allocator
+   new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
+       'alloc.so', 'my_malloc', 'my_free')
+   # This will error since the current allocator was already instantiated
+   torch.cuda.memory.change_current_allocator(new_alloc)
+
 
 .. _cufft-plan-cache:
 
@@ -466,6 +568,26 @@ have a flag that can be used to disable CUDA, in combination with
     else:
         args.device = torch.device('cpu')
 
+.. note::
+
+    When assessing the availability of CUDA in a given environment (:meth:`~torch.cuda.is_available`), PyTorch's default
+    behavior is to call the CUDA Runtime API method `cudaGetDeviceCount`_. Because this call in turn initializes the
+    CUDA Driver API (via `cuInit`_) if it is not already initialized, subsequent forks of a process that has run
+    :meth:`~torch.cuda.is_available` will fail with a CUDA initialization error.
+
+    One can set ``PYTORCH_NVML_BASED_CUDA_CHECK=1`` in your environment before importing PyTorch modules that execute
+    :meth:`~torch.cuda.is_available` (or before executing it directly) in order to direct
+    :meth:`~torch.cuda.is_available` to attempt an NVML-based assessment (`nvmlDeviceGetCount_v2`_). If the
+    NVML-based assessment is successful (i.e. NVML discovery/initialization does not fail),
+    :meth:`~torch.cuda.is_available` calls will not poison subsequent forks.
+
+    If NVML discovery/initialization fails, :meth:`~torch.cuda.is_available` will fallback to the standard CUDA Runtime
+    API assessment and the aforementioned fork constraint will apply.
+
+    Note that the above NVML-based CUDA availability assessment provides a weaker guarantee than the default CUDA
+    Runtime API approach (which requires CUDA initialization to succeed). In some circumstances, the NVML-based check
+    may succeed while later CUDA initialization fails.
+
 Now that we have ``args.device``, we can use it to create a Tensor on the
 desired device.
 
@@ -546,6 +668,7 @@ also preserve :class:`torch.device` and :class:`torch.dtype` of a Tensor).
     y_cpu = torch.ones_like(x_cpu)
     y_gpu = torch.zeros_like(x_gpu)
 
+
 .. _cuda-memory-pinning:
 
 Use pinned memory buffers
@@ -597,6 +720,15 @@ by GIL of Python interpreter.
 If you use :class:`~torch.nn.parallel.DistributedDataParallel`, you could use
 `torch.distributed.launch` utility to launch your program, see :ref:`distributed-launch`.
 
+.. _cudaGetDeviceCount:
+    https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g18808e54893cfcaafefeab31a73cc55f
+
+.. _cuInit:
+    https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__INITIALIZE.html#group__CUDA__INITIALIZE_1g0a2f1517e1bd8502c7194c3a8c134bc3
+
+.. _nvmlDeviceGetCount_v2:
+    https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1ga93623b195bff04bbe3490ca33c8a42d
+
 .. _cuda-graph-semantics:
 
 CUDA Graphs
diff --git a/docs/source/notes/extending.rst b/docs/source/notes/extending.rst
index 716f2532ae78..f7cc41335a55 100644
--- a/docs/source/notes/extending.rst
+++ b/docs/source/notes/extending.rst
@@ -23,7 +23,7 @@ feature. A section at the end discusses the extensions for forward mode AD.
 When to use
 ^^^^^^^^^^^
 In general, implement a custom function if you want to perform computations in your model
-that are not differentiable or rely on non-Pytorch libraries (e.g., NumPy), but
+that are not differentiable or rely on non-PyTorch libraries (e.g., NumPy), but
 still wish for your operation to chain with other ops and work with the autograd engine.
 
 In some situations, custom functions can also be used to improve performance and
@@ -372,7 +372,7 @@ tensor, parametrized by the order ``N`` and value along the diagonal entries,
             self._value = value
 
         def __repr__(self):
-            return "DiagonalTensor(N={}, value={})".format(self._N, self._value)
+            return "ScalarTensor(N={}, value={})".format(self._N, self._value)
 
         def tensor(self):
             return self._value * torch.eye(self._N)
@@ -409,7 +409,7 @@ this time adding a ``__torch_function__`` implementation::
           self._value = value
 
       def __repr__(self):
-          return "DiagonalTensor(N={}, value={})".format(self._N, self._value)
+          return "ScalarTensor(N={}, value={})".format(self._N, self._value)
 
       def tensor(self):
           return self._value * torch.eye(self._N)
@@ -494,7 +494,7 @@ function correctly when either operand is a ``ScalarTensor`` or a regular
 
   >>> s = ScalarTensor(2, 2)
   >>> torch.add(s, s)
-  DiagonalTensor(N=2, value=4)
+  ScalarTensor(N=2, value=4)
   >>> t = torch.tensor([[1, 1,], [1, 1]])
   >>> torch.add(s, t)
   tensor([[3., 1.],
diff --git a/docs/source/notes/hip.rst b/docs/source/notes/hip.rst
index a9c94e2a4feb..c54e20148970 100644
--- a/docs/source/notes/hip.rst
+++ b/docs/source/notes/hip.rst
@@ -144,3 +144,14 @@ Refer to CUDA Semantics doc
 ---------------------------
 
 For any sections not listed here, please refer to the CUDA semantics doc: :ref:`cuda-semantics`
+
+
+Enabling kernel asserts
+-----------------------
+
+Kernel asserts are supported on ROCm, but they are disabled due to performance overhead. It can be enabled
+by recompiling the PyTorch from source.
+
+Please add below line as an argument to cmake command parameters::
+
+    -DROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON
diff --git a/docs/source/notes/modules.rst b/docs/source/notes/modules.rst
index 0b33113d2c28..49b27a0ae014 100644
--- a/docs/source/notes/modules.rst
+++ b/docs/source/notes/modules.rst
@@ -599,14 +599,17 @@ PyTorch provides two types of hooks for modules:
 * **Forward hooks** are called during the forward pass. They can be installed for a given module with
   :func:`~torch.nn.Module.register_forward_pre_hook` and :func:`~torch.nn.Module.register_forward_hook`.
   These hooks will be called respectively just before the forward function is called and just after it is called.
-  Alternatively, these hooks can be installed globally for all modules with the analagous
+  Alternatively, these hooks can be installed globally for all modules with the analogous
   :func:`~torch.nn.modules.module.register_module_forward_pre_hook` and
   :func:`~torch.nn.modules.module.register_module_forward_hook` functions.
 * **Backward hooks** are called during the backward pass. They can be installed with
-  :func:`~torch.nn.Module.register_full_backward_hook`. These hooks will be called when the backward for this
-  Module has been computed and will allow the user to access the gradients for both the inputs and outputs.
-  Alternatively, they can be installed globally for all modules with
-  :func:`~torch.nn.modules.module.register_module_full_backward_hook`.
+  :func:`~torch.nn.Module.register_full_backward_pre_hook` and :func:`~torch.nn.Module.register_full_backward_hook`.
+  These hooks will be called when the backward for this Module has been computed.
+  :func:`~torch.nn.Module.register_full_backward_pre_hook` will allow the user to access the gradients for outputs
+  while :func:`~torch.nn.Module.register_full_backward_hook` will allow the user to access the gradients
+  both the inputs and outputs. Alternatively, they can be installed globally for all modules with
+  :func:`~torch.nn.modules.module.register_module_full_backward_hook` and
+  :func:`~torch.nn.modules.module.register_module_full_backward_pre_hook`.
 
 All hooks allow the user to return an updated value that will be used throughout the remaining computation.
 Thus, these hooks can be used to either execute arbitrary code along the regular module forward/backward or
@@ -722,14 +725,6 @@ provides mechanisms for model pruning, which can help reduce memory usage while
 `Pruning tutorial <https://pytorch.org/tutorials/intermediate/pruning_tutorial.html>`_ describes how to utilize
 the pruning techniques PyTorch provides or define custom pruning techniques as necessary.
 
-Deploying with TorchScript
-**************************
-
-When deploying a model for use in production, the overhead of Python can be unacceptable due to its poor
-performance characteristics. For cases like this,
-`TorchScript <https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html>`_ provides a way to load
-and run an optimized model program from outside of Python, such as within a C++ program.
-
 Parametrizations
 ****************
 
diff --git a/docs/source/notes/numerical_accuracy.rst b/docs/source/notes/numerical_accuracy.rst
index c952fb1f7c59..82e0bb253129 100644
--- a/docs/source/notes/numerical_accuracy.rst
+++ b/docs/source/notes/numerical_accuracy.rst
@@ -34,9 +34,10 @@ even though mathematically it's an identical computation.
 
 Similarly, an operation applied to a tensor slice is not guaranteed to produce results that are
 identical to the slice of the result of the same operation applied to the full tensor. E.g. let
-``A`` be a 2-dimentional tensor. ``A.sum(-1)[0]`` is not guaranteed to be bitwise equal to
+``A`` be a 2-dimensional tensor. ``A.sum(-1)[0]`` is not guaranteed to be bitwise equal to
 ``A[:,0].sum()``.
 
+
 Extremal values
 ---------------
 
@@ -51,19 +52,51 @@ datatype. E.g.:
     a.norm() # produces tensor(inf)
     a.double().norm() # produces tensor(1.4142e+20, dtype=torch.float64), representable in fp32
 
+.. _Linear Algebra Stability:
+
+Linear algebra (``torch.linalg``)
+---------------------------------
+
+Non-finite values
+"""""""""""""""""
+
+The external libraries (backends) that ``torch.linalg`` uses provide no guarantees on their behaviour
+when the inputs have non-finite values like ``inf`` or ``NaN``. As such, neither does PyTorch.
+The operations may return a tensor with non-finite values, or raise an exception, or even segfault.
+
+Consider using :func:`torch.isfinite` before calling these functions to detect this situation.
+
+Extremal values in linalg
+"""""""""""""""""""""""""
+
+Functions within ``torch.linalg`` have more `Extremal Values`_ than other PyTorch functions.
+
+:ref:`linalg solvers` and :ref:`linalg inverses` assume that the input matrix ``A`` is invertible. If it is close to
+being non-invertible (for example, if it has a very small singular value), then these algorithms may silently return
+incorrect results. These matrices are said to be `ill-conditioned <https://nhigham.com/2020/03/19/what-is-a-condition-number/>`_.
+If provided with ill-conditioned inputs, the result of these functions they may vary when using the same inputs on different
+devices or when using different backends via the keyword ``driver``.
+
+Spectral operations like ``svd``, ``eig``, and ``eigh`` may also return incorrect results (and their gradients may be infinite)
+when their inputs have singular values that are close to each other. This is because the algorithms used to compute these decompositions
+struggle to converge for these inputs.
+
+Running the computation in ``float64`` (as NumPy does by default) often helps, but it does not solve these issues in all cases.
+Analyzing the spectrum of the inputs via :func:`torch.linalg.svdvals` or their condition number via :func:`torch.linalg.cond`
+may help to detect these issues.
+
+
 TensorFloat-32(TF32) on Nvidia Ampere devices
 ---------------------------------------------
 
-On Ampere Nvidia GPUs, PyTorch by default uses TensorFloat32 (TF32) to speed up mathematically
-intensive operations, in particular matrix multiplications and convolutions. When operation is performed
-using TF32 tensor cores, only the first 10 bits of the input mantissa are read. This leads to less accurate
-results, and surprising results such as multiplying a matrix by identity matrix produces
-results that are different from the input.
-Most neural network workloads have the same convergence behavior when using tf32 as they have
-with fp32, however, if better accuracy is desired, TF32 can be turned off with
-``torch.backends.cuda.matmul.allow_tf32 = False``
+On Ampere Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions.
+When an operation is performed using TF32 tensor cores, only the first 10 bits of the input mantissa are read.
+This may reduce accuracy and produce surprising results (e.g., multiplying a matrix by the identity matrix may produce results that are different from the input).
+By default, TF32 tensor cores are disabled for matrix multiplications and enabled for convolutions, although most neural network workloads have the same convergence behavior when using TF32 as they have with fp32.
+We recommend enabling TF32 tensor cores for matrix multiplications with ``torch.backends.cuda.matmul.allow_tf32 = True`` if your network does not need full float32 precision.
+If your network needs full float32 precision for both matrix multiplications and convolutions, then TF32 tensor cores can also be disabled for convolutions with ``torch.backends.cudnn.allow_tf32 = False``.
 
-For more information see :ref:`TensorFloat32<tf32_on_ampere>`
+For more information see :ref:`TensorFloat32<tf32_on_ampere>`.
 
 Reduced Precision Reduction for FP16 GEMMs
 ------------------------------------------
diff --git a/docs/source/onnx.rst b/docs/source/onnx.rst
index 5f3b4ba70ede..8f52be124e2e 100644
--- a/docs/source/onnx.rst
+++ b/docs/source/onnx.rst
@@ -73,9 +73,9 @@ exporter to print out a human-readable representation of the model::
     }
 
 You can also verify the output using the `ONNX <https://github.com/onnx/onnx/>`_ library,
-which you can install using `conda <https://anaconda.org>`_::
+which you can install using ``pip``::
 
-    conda install -c conda-forge onnx
+    pip install onnx
 
 Then, you can run::
 
@@ -112,16 +112,18 @@ Here is a more involved `tutorial on exporting a model and running it with ONNX
 Tracing vs Scripting
 --------------------
 
-Internally, ``torch.onnx.export()`` requires a :class:`torch.jit.ScriptModule` rather than
+Internally, :func:`torch.onnx.export()` requires a :class:`torch.jit.ScriptModule` rather than
 a :class:`torch.nn.Module`. If the passed-in model is not already a ``ScriptModule``,
 ``export()`` will use *tracing* to convert it to one:
 
+.. TODO(justinchuby): Add a word on recommending tracing over scripting for most use cases.
+
 * **Tracing**: If ``torch.onnx.export()`` is called with a Module that is not already a
   ``ScriptModule``, it first does the equivalent of :func:`torch.jit.trace`, which executes the model
   once with the given ``args`` and records all operations that happen during that execution. This
   means that if your model is dynamic, e.g., changes behavior depending on input data, the exported
-  model will *not* capture this dynamic behavior.  Similarly, a trace is likely to be valid only for
-  a specific input size. We recommend examining the exported model and making sure the operators look
+  model will *not* capture this dynamic behavior.
+  We recommend examining the exported model and making sure the operators look
   reasonable. Tracing will unroll loops and if statements, exporting a static graph that is exactly
   the same as the traced run. If you want to export your model with dynamic control flow, you will
   need to use *scripting*.
@@ -163,7 +165,7 @@ Use torch operators on torch.Tensors: ::
     torch.cat((x, y), dim=1)
 
 
-And rather than using :func:`torch.Tensor.item` (which converts a Tensor to a Python
+And rather than use :func:`torch.Tensor.item` (which converts a Tensor to a Python
 built-in number): ::
 
     # Bad! y.item() will be replaced with a constant during tracing.
@@ -186,9 +188,9 @@ Use :func:`torch.Tensor.detach` instead. (Work is ongoing to
 Avoid in-place operations when using tensor.shape in tracing mode
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-In tracing mode, shape values obtained from tensor.shape are traced as tensors,
-and share the same memory. This might cause a mismatch in values of the final outputs.
-As a workaround, avoid use of inplace operations in these scenarios.
+In tracing mode, shapes obtained from ``tensor.shape`` are traced as tensors,
+and share the same memory. This might cause a mismatch the final output values.
+As a workaround, avoid the use of inplace operations in these scenarios.
 For example, in the model::
 
     class Model(torch.nn.Module):
@@ -209,14 +211,14 @@ Limitations
 Types
 ^^^^^
 
-* Only torch.Tensors, numeric types that can be trivially converted to torch.Tensors (e.g. float, int),
+* Only :class:`torch.Tensors`, numeric types that can be trivially converted to torch.Tensors (e.g. float, int),
   and tuples and lists of those types are supported as model inputs or outputs. Dict and str inputs and
   outputs are accepted in :ref:`tracing<tracing-vs-scripting>` mode, but:
 
-  * Any computation that depends on the value of a dict or a str input will be replaced with the
-    constant value seen during the one traced execution.
-  * Any output that is a dict will be silently replaced with a flattened sequence of its values
-    (keys will be removed). E.g. ``{"foo": 1, "bar": 2}`` becomes ``(1, 2)``.
+  * Any computation that depends on the value of a dict or a str input **will be replaced with the
+    constant value** seen during the one traced execution.
+  * Any output that is a dict will be silently replaced with a **flattened sequence of its values
+    (keys will be removed)**. E.g. ``{"foo": 1, "bar": 2}`` becomes ``(1, 2)``.
   * Any output that is a str will be silently removed.
 
 * Certain operations involving tuples and lists are not supported in
@@ -286,27 +288,34 @@ When exporting a model that includes unsupported operators, you'll see an error
 
     RuntimeError: ONNX export failed: Couldn't export operator foo
 
-When that happens, you'll need to either change the model to not use that operator,
-or add support for the operator.
+When that happens, there are a few things you can do:
+
+#. Change the model to not use that operator.
+#. Create a symbolic function to convert the operator and register it as a custom symbolic function.
+#. Contribute to PyTorch to add the same symbolic function to :mod:`torch.onnx` itself.
+
+If you decided to implement a symbolic function (we hope you will contribute it back to PyTorch!), here is how you can get started:
+
+ONNX exporter internals
+^^^^^^^^^^^^^^^^^^^^^^^
 
-Adding support for operators requires contributing a change to PyTorch's source code.
-See `CONTRIBUTING <https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md>`_
-for general instructions on that, and below for specific instructions on the code
-changes required for supporting an operator.
+A "symbolic function" is a function that decomposes a PyTorch operator into a
+composition of a series of ONNX operators.
 
-During export, each node in the TorchScript graph is visited in topological order.
-Upon visiting a node, the exporter tries to find a registered symbolic functions for
-that node. Symbolic functions are implemented in Python. A symbolic function for
+During export, each node (which contains a PyTorch operator) in the TorchScript
+graph is visited by the exporter in topological order.
+Upon visiting a node, the exporter looks for a registered symbolic functions for
+that operator. Symbolic functions are implemented in Python. A symbolic function for
 an op named ``foo`` would look something like::
 
 
     def foo(
-      g: torch._C.Graph,
+      g,
       input_0: torch._C.Value,
       input_1: torch._C.Value) -> Union[None, torch._C.Value, List[torch._C.Value]]:
       """
-      Modifies g (e.g., using "g.op()"), adding the ONNX operations representing
-      this PyTorch function.
+      Adds the ONNX operations representing this PyTorch function by updating the
+      graph g with `g.op()` calls.
 
       Args:
         g (Graph): graph to write the ONNX representation into.
@@ -318,7 +327,8 @@ an op named ``foo`` would look something like::
       Returns:
         A Value or List of Values specifying the ONNX nodes that compute something
         equivalent to the original PyTorch operator with the given inputs.
-        Returns None if it cannot be converted to ONNX.
+
+        None if it cannot be converted to ONNX.
       """
       ...
 
@@ -332,18 +342,18 @@ The process for adding a symbolic function depends on the type of operator.
 ATen operators
 ^^^^^^^^^^^^^^
 
-`ATen <https://pytorch.org/cppdocs/#aten>`_ is PyTorch’s built-in tensor library.
+`ATen <https://pytorch.org/cppdocs/#aten>`_ is PyTorch's built-in tensor library.
 If the operator is an ATen operator (shows up in the TorchScript graph with the prefix
 ``aten::``), make sure it is not supported already.
 
 List of supported operators
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Visit the auto generated :doc:`list of supported ATen operators <../onnx_supported_aten_ops>`
+Visit the auto generated :doc:`list of supported TorchScript operators <../onnx_supported_aten_ops>`
 for details on which operator are supported in each ``opset_version``.
 
-Adding support for an operator
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Adding support for an aten or quantized operator
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 If the operator is not in the list above:
 
@@ -355,28 +365,22 @@ If the operator is not in the list above:
 * By default, the first arg is the ONNX graph.
   Other arg names must EXACTLY match the names in the ``.pyi`` file,
   because dispatch is done with keyword arguments.
-* A symbolic function that has a first arg (before the Graph object) with the
-  type annotation of torch.onnx.SymbolicContext will be called with that additional context.
-  See examples below.
 * In the symbolic function, if the operator is in the
   `ONNX standard operator set <https://github.com/onnx/onnx/blob/master/docs/Operators.md>`_,
   we only need to create a node to represent the ONNX operator in the graph.
-  If not, we can create a graph of several standard operators that have
+  If not, we can compose several standard operators that have the
   equivalent semantics to the ATen operator.
-* If an input argument is a Tensor, but ONNX asks for a scalar, we have to
-  explicitly do the conversion. :func:`symbolic_helper._scalar` can convert a
-  scalar tensor into a python scalar, and
-  :func:`symbolic_helper._if_scalar_type_as` can turn a Python scalar into a
-  PyTorch tensor.
 
 Here is an example of handling missing symbolic function for the ``ELU`` operator.
 
 If we run the following code::
 
     print(
-      torch.jit.trace(torch.nn.ELU(), # module
-                      torch.ones(1)   # example input
-                      ).graph)
+        torch.jit.trace(
+            torch.nn.ELU(), # module
+            torch.ones(1)   # example input
+        ).graph
+    )
 
 We see something like::
 
@@ -399,12 +403,12 @@ We find a signature for ``elu`` in ``torch/nn/functional.pyi``::
 
 We add the following lines to ``symbolic_opset9.py``::
 
-    def elu(g, input, alpha, inplace=False):
-        return g.op("Elu", input, alpha_f=_scalar(alpha))
+    def elu(g, input: torch.Value, alpha: torch.Value, inplace: bool = False):
+        return g.op("Elu", input, alpha_f=alpha)
 
 Now PyTorch is able to export models containing the ``aten::elu`` operator!
 
-See the ``symbolic_opset*.py`` files for more examples.
+See the ``torch/onnx/symbolic_opset*.py`` files for more examples.
 
 
 torch.autograd.Functions
@@ -426,73 +430,79 @@ ONNX operators that represent the function's behavior in ONNX. For example::
             return input.clamp(min=0)
 
         @staticmethod
-        def symbolic(g: torch._C.Graph, input: torch._C.Value) -> torch._C.Value:
+        def symbolic(g: torch.Graph, input: torch.Value) -> torch.Value:
             return g.op("Clip", input, g.op("Constant", value_t=torch.tensor(0, dtype=torch.float)))
 
-PythonOp Symbolic
-~~~~~~~~~~~~~~~~~
-
-Alternatively, you can register a custom symbolic function.
-This gives the symbolic function access to more info through the
-``torch.onnx.SymbolicContext`` object, which gets passed in as the first
-argument (before the ``Graph`` object).
-
-All autograd ``Function``\ s appear in the TorchScript graph as ``prim::PythonOp`` nodes.
-In order to differentiate between different ``Function`` subclasses, the
-symbolic function should use the ``name`` kwarg which gets set to the name of the class.
-
-Custom symbolic functions should add type and shape information by calling ``setType(...)``
-on Value objects before returning them (implemented in C++ by
-``torch::jit::Value::setType``). This is not required, but it can help the exporter's
-shape and type inference for down-stream nodes. For a non-trivial example of ``setType``, see
-``test_aten_embedding_2`` in
-`test_operators.py <https://github.com/pytorch/pytorch/blob/master/test/onnx/test_operators.py>`_.
-
-The example below shows how you can access ``requires_grad`` via the ``Node`` object::
-
-    class MyClip(torch.autograd.Function):
-        @staticmethod
-        def forward(ctx, input, min):
-            ctx.save_for_backward(input)
-            return input.clamp(min=min)
-
-    class MyRelu(torch.autograd.Function):
-        @staticmethod
-        def forward(ctx, input):
-            ctx.save_for_backward(input)
-            return input.clamp(min=0)
-
-    def symbolic_python_op(ctx: torch.onnx.SymbolicContext, g: torch._C.Graph, *args, **kwargs):
-        n = ctx.cur_node
-        print("original node: ", n)
-        for i, out in enumerate(n.outputs()):
-            print("original output {}: {}, requires grad: {}".format(i, out, out.requiresGrad()))
-        import torch.onnx.symbolic_helper as sym_helper
-        for i, arg in enumerate(args):
-            requires_grad = arg.requiresGrad() if sym_helper._is_value(arg) else False
-            print("arg {}: {}, requires grad: {}".format(i, arg, requires_grad))
-
-        name = kwargs["name"]
-        ret = None
-        if name == "MyClip":
-            ret = g.op("Clip", args[0], args[1])
-        elif name == "MyRelu":
-            ret = g.op("Relu", args[0])
-        else:
-            # Logs a warning and returns None
-            return _unimplemented("prim::PythonOp", "unknown node kind: " + name)
-        # Copy type and shape from original node.
-        ret.setType(n.type())
-        return ret
-
-    from torch.onnx import register_custom_op_symbolic
-    register_custom_op_symbolic("prim::PythonOp", symbolic_python_op, 1)
+.. FIXME(justinchuby): PythonOps are too complicated and the example below
+..  uses private methods we do not expose. We are looking to
+..  improve the experience. Since SymbolicContext is deprecated, we think
+..  defining a symbolic staticmethod is a better way to go for now.
+
+.. PythonOp Symbolic
+.. ~~~~~~~~~~~~~~~~~
+
+.. Alternatively, you can register a custom symbolic function.
+.. This gives the symbolic function access to more info through the
+.. ``torch.onnx.SymbolicContext`` object, which gets passed in as the first
+.. argument (before the ``Graph`` object).
+
+.. All autograd ``Function``\ s appear in the TorchScript graph as ``prim::PythonOp`` nodes.
+.. In order to differentiate between different ``Function`` subclasses, the
+.. symbolic function should use the ``name`` kwarg which gets set to the name of the class.
+
+.. Custom symbolic functions should add type and shape information by calling ``setType(...)``
+.. on Value objects before returning them (implemented in C++ by
+.. . ``torch::jit::Value::setType``). This is not required, but it can help the exporter's
+.. shape and type inference for down-stream nodes. For a non-trivial example of ``setType``, see
+.. ``test_aten_embedding_2`` in
+.. `test_operators.py <https://github.com/pytorch/pytorch/blob/master/test/onnx/test_operators.py>`_.
+
+.. The example below shows how you can access ``requires_grad`` via the ``Node`` object:
+
+..     class MyClip(torch.autograd.Function):
+..         @staticmethod
+..         def forward(ctx, input, min):
+..             ctx.save_for_backward(input)
+..             return input.clamp(min=min)
+
+..     class MyRelu(torch.autograd.Function):
+..         @staticmethod
+..         def forward(ctx, input):
+..             ctx.save_for_backward(input)
+..             return input.clamp(min=0)
+
+..     def symbolic_python_op(g: "GraphContext", *args, **kwargs):
+..         n = ctx.cur_node
+..         print("original node: ", n)
+..         for i, out in enumerate(n.outputs()):
+..             print("original output {}: {}, requires grad: {}".format(i, out, out.requiresGrad()))
+..         import torch.onnx.symbolic_helper as sym_helper
+..         for i, arg in enumerate(args):
+..             requires_grad = arg.requiresGrad() if sym_helper._is_value(arg) else False
+..             print("arg {}: {}, requires grad: {}".format(i, arg, requires_grad))
+
+..         name = kwargs["name"]
+..         ret = None
+..         if name == "MyClip":
+..             ret = g.op("Clip", args[0], args[1])
+..         elif name == "MyRelu":
+..             ret = g.op("Relu", args[0])
+..         else:
+..             # Logs a warning and returns None
+..             return _unimplemented("prim::PythonOp", "unknown node kind: " + name)
+..         # Copy type and shape from original node.
+..         ret.setType(n.type())
+..         return ret
+
+..     from torch.onnx import register_custom_op_symbolic
+.. .     register_custom_op_symbolic("prim::PythonOp", symbolic_python_op, 1)
 
 Inline Autograd Function
 ~~~~~~~~~~~~~~~~~~~~~~~~
-In cases where a static symbolic method is not provided for its subsequent autograd.Function
-or where a function to register prim::PythonOp as custom symbolic functions is not provided,
-torch.onnx.export tries to inline the graph that corresponds to that autograd.Function such that
+
+In cases where a static symbolic method is not provided for its subsequent :class:`torch.autograd.Function` or
+where a function to register ``prim::PythonOp`` as custom symbolic functions is not provided,
+:func:`torch.onnx.export` tries to inline the graph that corresponds to that :class:`torch.autograd.Function` such that
 this function is broken down into individual operators that were used within the function.
 The export should be successful as long as these individual operators are supported. For example::
 
@@ -511,30 +521,99 @@ There is no static symbolic method present for this model, yet it is exported as
         %3 : float = onnx::Log[](%2)
         return (%3)
 
-In order to avoid inlining of autograd.Functions, model should be exported with
-operator_export_type set to ONNX_FALLTHROUGH or ONNX_ATEN_FALLBACK mode
+If you need to avoid inlining of :class:`torch.autograd.Function`, you should export models with
+``operator_export_type`` set to ``ONNX_FALLTHROUGH`` or ``ONNX_ATEN_FALLBACK``.
 
 Custom operators
 ^^^^^^^^^^^^^^^^
 
+You can export your model with custom operators that includes a combination of many standard ONNX ops,
+or are driven by self-defined C++ backend.
+
+ONNX-script functions
+~~~~~~~~~~~~~~~~~~~~~
+
+If an operator is not a standard ONNX op, but can be composed of multiple existing ONNX ops, you can utilize
+`ONNX-script <https://github.com/microsoft/onnx-script>`_ to create an external ONNX function to support the operator.
+You can export it by following this example::
+
+    import onnxscript
+    # There are three opset version needed to be aligned
+    # This is (1) the opset version in ONNX function
+    from onnxscript.onnx_opset import opset15 as op
+    opset_version = 15
+
+    x = torch.randn(1, 2, 3, 4, requires_grad=True)
+    model = torch.nn.SELU()
+
+    custom_opset = onnxscript.values.Opset(domain="onnx-script", version=1)
+
+    @onnxscript.script(custom_opset)
+    def Selu(X):
+        alpha = 1.67326  # auto wrapped as Constants
+        gamma = 1.0507
+        alphaX = op.CastLike(alpha, X)
+        gammaX = op.CastLike(gamma, X)
+        neg = gammaX * (alphaX * op.Exp(X) - alphaX)
+        pos = gammaX * X
+        zero = op.CastLike(0, X)
+        return op.Where(X <= zero, neg, pos)
+
+    # setType API provides shape/type to ONNX shape/type inference
+    def custom_selu(g: jit_utils.GraphContext, X):
+        return g.onnxscript_op(Selu, X).setType(X.type())
+
+    # Register custom symbolic function
+    # There are three opset version needed to be aligned
+    # This is (2) the opset version in registry
+    torch.onnx.register_custom_op_symbolic(
+        symbolic_name="aten::selu",
+        symbolic_fn=custom_selu,
+        opset_version=opset_version,
+    )
+
+    # There are three opset version needed to be aligned
+    # This is (2) the opset version in exporter
+    torch.onnx.export(
+        model,
+        x,
+        "model.onnx",
+        opset_version=opset_version,
+        # only needed if you want to specify an opset version > 1.
+        custom_opsets={"onnx-script": 2}
+    )
+
+The example above exports it as a custom operator in the "onnx-script" opset.
+When exporting a custom operator, you can specify the custom domain version using the
+``custom_opsets`` dictionary at export. If not specified, the custom opset version defaults to 1.
+
+NOTE: Be careful to align the opset version mentioned in the above example, and make sure they are consumed in exporter step.
+The example usage of how to write a onnx-script function is a beta version in terms of the active development on onnx-script.
+Please follow the latest `ONNX-script <https://github.com/microsoft/onnx-script>`_
+
+C++ Operators
+~~~~~~~~~~~~~
+
 If a model uses a custom operator implemented in C++ as described in
 `Extending TorchScript with Custom C++ Operators <https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html>`_,
 you can export it by following this example::
 
-    from torch.onnx import register_custom_op_symbolic
-    from torch.onnx.symbolic_helper import parse_args
+    from torch.onnx import symbolic_helper
+
 
     # Define custom symbolic function
-    @parse_args("v", "v", "f", "i")
+    @symbolic_helper.parse_args("v", "v", "f", "i")
     def symbolic_foo_forward(g, input1, input2, attr1, attr2):
         return g.op("custom_domain::Foo", input1, input2, attr1_f=attr1, attr2_i=attr2)
 
+
     # Register custom symbolic function
-    register_custom_op_symbolic("custom_ops::foo_forward", symbolic_foo_forward, 9)
+    torch.onnx.register_custom_op_symbolic("custom_ops::foo_forward", symbolic_foo_forward, 9)
+
 
     class FooModel(torch.nn.Module):
         def __init__(self, attr1, attr2):
-            super(FooModule, self).__init__()
+            super().__init__()
             self.attr1 = attr1
             self.attr2 = attr2
 
@@ -542,18 +621,20 @@ you can export it by following this example::
             # Calling custom op
             return torch.ops.custom_ops.foo_forward(input1, input2, self.attr1, self.attr2)
 
+
     model = FooModel(attr1, attr2)
     torch.onnx.export(
-      model,
-      (example_input1, example_input1),
-      "model.onnx",
-      # only needed if you want to specify an opset version > 1.
-      custom_opsets={"custom_domain": 2})
+        model,
+        (example_input1, example_input1),
+        "model.onnx",
+        # only needed if you want to specify an opset version > 1.
+        custom_opsets={"custom_domain": 2}
+    )
 
-You can export it as one or a combination of standard ONNX ops, or as a custom operator.
 The example above exports it as a custom operator in the "custom_domain" opset.
 When exporting a custom operator, you can specify the custom domain version using the
 ``custom_opsets`` dictionary at export. If not specified, the custom opset version defaults to 1.
+
 The runtime that consumes the model needs to support the custom op. See
 `Caffe2 custom ops <https://caffe2.ai/docs/custom-operators.html>`_,
 `ONNX Runtime custom ops <https://onnxruntime.ai/docs/reference/operators/add-custom-op.html>`_,
@@ -567,16 +648,21 @@ When export fails due to an unconvertible ATen op, there may in fact be more
 than one such op but the error message only mentions the first. To discover
 all of the unconvertible ops in one go you can::
 
-    from torch.onnx import utils as onnx_utils
-
     # prepare model, args, opset_version
     ...
 
-    torch_script_graph, unconvertible_ops = onnx_utils.unconvertible_ops(
-        model, args, opset_version=opset_version)
+    torch_script_graph, unconvertible_ops = torch.onnx.utils.unconvertible_ops(
+        model, args, opset_version=opset_version
+    )
 
     print(set(unconvertible_ops))
 
+The set is approximated because some ops may be removed during the conversion
+process and don't need to be converted. Some other ops may have partial support
+that will fail conversion with particular inputs, but this should give you a
+general idea of what ops are not supported. Please feel free to open GitHub Issues
+for op support requests.
+
 Frequently Asked Questions
 --------------------------
 Q: I have exported my LSTM model, but its input size seems to be fixed?
@@ -595,7 +681,8 @@ Q: How to export models with primitive type inputs (e.g. int, float)?
 
 Q: Does ONNX support implicit scalar datatype casting?
 
-  No, but the exporter will try to handle that part. Scalars are exported as constant tensors.
+  The ONNX standard does not, but the exporter will try to handle that part.
+  Scalars are exported as constant tensors.
   The exporter will figure out the right data type for scalars. In rare cases when it is unable
   to do so, you will need to manually specify the datatype with e.g. `dtype=torch.float32`.
   If you see any errors, please [create a GitHub issue](https://github.com/pytorch/pytorch/issues).
@@ -614,13 +701,11 @@ Functions
 .. autofunction:: export
 .. autofunction:: export_to_pretty_string
 .. autofunction:: register_custom_op_symbolic
+.. autofunction:: unregister_custom_op_symbolic
 .. autofunction:: select_model_mode_for_export
 .. autofunction:: is_in_onnx_export
-.. autofunction:: is_onnx_log_enabled
 .. autofunction:: enable_log
 .. autofunction:: disable_log
-.. autofunction:: set_log_stream
-.. autofunction:: log
 
 Classes
 -------
@@ -630,5 +715,4 @@ Classes
     :nosignatures:
     :template: classtemplate.rst
 
-    SymbolicContext
     JitScalarType
diff --git a/docs/source/onnx_diagnostics.rst b/docs/source/onnx_diagnostics.rst
new file mode 100644
index 000000000000..ec2edd4cbdbe
--- /dev/null
+++ b/docs/source/onnx_diagnostics.rst
@@ -0,0 +1,35 @@
+torch.onnx diagnostics
+======================
+
+.. contents:: :local:
+.. automodule:: torch.onnx._internal.diagnostics
+.. currentmodule:: torch.onnx._internal.diagnostics
+
+Overview
+--------
+
+NOTE: This feature is underdevelopment and is subject to change.
+
+The goal is to improve the diagnostics to help users debug and improve their model export to ONNX.
+
+- The diagnostics are emitted in machine parsable `Static Analysis Results Interchange Format (SARIF) <https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html>`__.
+- A new clearer, structured way to add new and keep track of diagnostic rules.
+- Serve as foundation for more future improvements consuming the diagnostics.
+
+
+Diagnostic Rules
+----------------
+
+.. toctree::
+    :glob:
+
+    generated/onnx_diagnostics_rules/*
+
+API Reference
+-------------
+
+.. autoclass:: torch.onnx._internal.diagnostics.ExportDiagnostic
+    :members:
+
+.. autoclass:: torch.onnx._internal.diagnostics.infra.DiagnosticEngine
+    :members:
diff --git a/docs/source/onnx_supported_aten_ops.rst b/docs/source/onnx_supported_aten_ops.rst
index d6bf535e2e7e..ce075b59a7bd 100644
--- a/docs/source/onnx_supported_aten_ops.rst
+++ b/docs/source/onnx_supported_aten_ops.rst
@@ -1,14 +1,30 @@
 :orphan:
 
-ONNX supported ATen operators
-=============================
+ONNX supported TorchScript operators
+====================================
 
-This file is automatically generated during the documentation build
-by cross referencing ONNX operator symbolics with Torch JIT operators via
-``docs/source/scripts/build_onnx_supported_aten_op_csv_table.py``.
-Do not modify directly and instead `rebuild the docs <https://github.com/pytorch/pytorch#building-the-documentation>`_.
+.. This file is automatically generated during the documentation build
+.. by cross referencing ONNX operator symbolics with TorchScript operators via
+.. ``docs/source/scripts/build_onnx_supported_aten_op_csv_table.py``.
+.. Do not modify directly and instead `rebuild the docs <https://github.com/pytorch/pytorch#building-the-documentation>`_.
 
-.. csv-table:: Supported ATen operators
-   :file: ../build/auto_gen_aten_op_list.csv
-   :widths: 30, 70
+This page lists the TorchScript operators that are supported/unsupported by ONNX export.
+
+Supported operators
+-------------------
+
+.. csv-table:: ONNX support for TorchScript operators
+   :file: ../build/onnx/auto_gen_supported_op_list.csv
+   :widths: 70, 30
+   :header-rows: 1
+
+
+Unsupported operators
+---------------------
+
+Operators that are not yet supported
+
+.. csv-table:: Unsupported operators
+   :file: ../build/onnx/auto_gen_unsupported_op_list.csv
+   :widths: 70, 30
    :header-rows: 1
diff --git a/docs/source/optim.rst b/docs/source/optim.rst
index dd1ea2ae5287..3b902c60f0ca 100644
--- a/docs/source/optim.rst
+++ b/docs/source/optim.rst
@@ -141,8 +141,7 @@ should write your code this way:
 
 Example::
 
-    model = [Parameter(torch.randn(2, 2, requires_grad=True))]
-    optimizer = SGD(model, 0.1)
+    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
     scheduler = ExponentialLR(optimizer, gamma=0.9)
 
     for epoch in range(20):
@@ -160,8 +159,7 @@ other on the learning rate obtained by the one preceding it.
 
 Example::
 
-    model = [Parameter(torch.randn(2, 2, requires_grad=True))]
-    optimizer = SGD(model, 0.1)
+    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
     scheduler1 = ExponentialLR(optimizer, gamma=0.9)
     scheduler2 = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)
 
diff --git a/docs/source/profiler.rst b/docs/source/profiler.rst
index c49d16632301..fa76234eff6f 100644
--- a/docs/source/profiler.rst
+++ b/docs/source/profiler.rst
@@ -26,3 +26,14 @@ API Reference
 .. autofunction:: torch.profiler.schedule
 
 .. autofunction:: torch.profiler.tensorboard_trace_handler
+
+Intel Instrumentation and Tracing Technology APIs
+-------------------------------------------------
+
+.. autofunction:: torch.profiler.itt.is_available
+
+.. autofunction:: torch.profiler.itt.mark
+
+.. autofunction:: torch.profiler.itt.range_push
+
+.. autofunction:: torch.profiler.itt.range_pop
diff --git a/docs/source/quantization-accuracy-debugging.rst b/docs/source/quantization-accuracy-debugging.rst
index 69bda8706cc6..0fa590abd2f0 100644
--- a/docs/source/quantization-accuracy-debugging.rst
+++ b/docs/source/quantization-accuracy-debugging.rst
@@ -6,7 +6,7 @@ accuracy. If a quantized model has error compared to the original model,
 we can categorize the error into:
 
 1. **data insensitive error** - caused by intrinsic model quantization error,
-   large portion of input data has large errror
+   large portion of input data has large error
 2. **data sensitive error** - caused by outlier input data, small
    portion of input data has large error
 3. **implementation error** - quantized kernel is not matching reference implementation
diff --git a/docs/source/quantization-support.rst b/docs/source/quantization-support.rst
index e142cf70a619..d57a4b822f5c 100644
--- a/docs/source/quantization-support.rst
+++ b/docs/source/quantization-support.rst
@@ -2,7 +2,7 @@ Quantization API Reference
 -------------------------------
 
 torch.quantization
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~
 
 This module contains Eager mode quantization APIs.
 
@@ -52,7 +52,7 @@ Utility functions
     get_observer_dict
 
 torch.quantization.quantize_fx
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This module contains FX graph mode quantization APIs (prototype).
 
@@ -68,6 +68,59 @@ This module contains FX graph mode quantization APIs (prototype).
     convert_fx
     fuse_fx
 
+torch.ao.quantization.qconfig_mapping
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This module contains QConfigMapping for configuring FX graph mode quantization.
+
+.. currentmodule:: torch.ao.quantization.qconfig_mapping
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+    :template: classtemplate.rst
+
+    QConfigMapping
+    get_default_qconfig_mapping
+    get_default_qat_qconfig_mapping
+
+torch.ao.quantization.backend_config
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This module contains BackendConfig, a config object that defines how quantization is supported
+in a backend. Currently only used by FX Graph Mode Quantization, but we may extend Eager Mode
+Quantization to work with this as well.
+
+.. currentmodule:: torch.ao.quantization.backend_config
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+    :template: classtemplate.rst
+
+    BackendConfig
+    BackendPatternConfig
+    DTypeConfig
+    ObservationType
+
+torch.ao.quantization.fx.custom_config
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This module contains a few CustomConfig classes that's used in both eager mode and FX graph mode quantization
+
+
+.. currentmodule:: torch.ao.quantization.fx.custom_config
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+    :template: classtemplate.rst
+
+    FuseCustomConfig
+    PrepareCustomConfig
+    ConvertCustomConfig
+    StandaloneModuleConfigEntry
+
 torch (quantization related functions)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -127,7 +180,7 @@ regular full-precision tensor.
 
 
 torch.quantization.observer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This module contains observers which are used to collect statistics about
 the values observed during calibration (PTQ) or training (QAT).
@@ -160,7 +213,7 @@ the values observed during calibration (PTQ) or training (QAT).
     default_float_qparams_observer
 
 torch.quantization.fake_quantize
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This module implements modules which are used to perform fake quantization
 during QAT.
@@ -189,7 +242,7 @@ during QAT.
     enable_observer
 
 torch.quantization.qconfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This module defines `QConfig` objects which are used
 to configure quantization settings for individual ops.
@@ -215,15 +268,15 @@ to configure quantization settings for individual ops.
     default_activation_only_qconfig
     default_qat_qconfig_v2
 
-torch.nn.intrinsic
-~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.intrinsic
-.. automodule:: torch.nn.intrinsic.modules
+torch.ao.nn.intrinsic
+~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.ao.nn.intrinsic
+.. automodule:: torch.ao.nn.intrinsic.modules
 
 This module implements the combined (fused) modules conv + relu which can
 then be quantized.
 
-.. currentmodule:: torch.nn.intrinsic
+.. currentmodule:: torch.ao.nn.intrinsic
 
 .. autosummary::
     :toctree: generated
@@ -243,16 +296,16 @@ then be quantized.
     BNReLU2d
     BNReLU3d
 
-torch.nn.intrinsic.qat
-~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.intrinsic.qat
-.. automodule:: torch.nn.intrinsic.qat.modules
+torch.ao.nn.intrinsic.qat
+~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.ao.nn.intrinsic.qat
+.. automodule:: torch.ao.nn.intrinsic.qat.modules
 
 
 This module implements the versions of those fused operations needed for
 quantization aware training.
 
-.. currentmodule:: torch.nn.intrinsic.qat
+.. currentmodule:: torch.ao.nn.intrinsic.qat
 
 .. autosummary::
     :toctree: generated
@@ -271,17 +324,17 @@ quantization aware training.
     update_bn_stats
     freeze_bn_stats
 
-torch.nn.intrinsic.quantized
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.intrinsic.quantized
-.. automodule:: torch.nn.intrinsic.quantized.modules
+torch.ao.nn.intrinsic.quantized
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.ao.nn.intrinsic.quantized
+.. automodule:: torch.ao.nn.intrinsic.quantized.modules
 
 
 This module implements the quantized implementations of fused operations
 like conv + relu. No BatchNorm variants as it's usually folded into convolution
 for inference.
 
-.. currentmodule:: torch.nn.intrinsic.quantized
+.. currentmodule:: torch.ao.nn.intrinsic.quantized
 
 .. autosummary::
     :toctree: generated
@@ -295,15 +348,15 @@ for inference.
     ConvReLU3d
     LinearReLU
 
-torch.nn.intrinsic.quantized.dynamic
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.intrinsic.quantized.dynamic
-.. automodule:: torch.nn.intrinsic.quantized.dynamic.modules
+torch.ao.nn.intrinsic.quantized.dynamic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.ao.nn.intrinsic.quantized.dynamic
+.. automodule:: torch.ao.nn.intrinsic.quantized.dynamic.modules
 
 This module implements the quantized dynamic implementations of fused operations
 like linear + relu.
 
-.. currentmodule:: torch.nn.intrinsic.quantized.dynamic
+.. currentmodule:: torch.ao.nn.intrinsic.quantized.dynamic
 
 .. autosummary::
     :toctree: generated
@@ -312,16 +365,16 @@ like linear + relu.
 
     LinearReLU
 
-torch.nn.qat
+torch.ao.nn.qat
 ~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.qat
-.. automodule:: torch.nn.qat.modules
+.. automodule:: torch.ao.nn.qat
+.. automodule:: torch.ao.nn.qat.modules
 
 This module implements versions of the key nn modules **Conv2d()** and
 **Linear()** which run in FP32 but with rounding applied to simulate the
 effect of INT8 quantization.
 
-.. currentmodule:: torch.nn.qat
+.. currentmodule:: torch.ao.nn.qat
 
 .. autosummary::
     :toctree: generated
@@ -332,16 +385,16 @@ effect of INT8 quantization.
     Conv3d
     Linear
 
-torch.nn.qat.dynamic
+torch.ao.nn.qat.dynamic
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.qat.dynamic
-.. automodule:: torch.nn.qat.dynamic.modules
+.. automodule:: torch.ao.nn.qat.dynamic
+.. automodule:: torch.ao.nn.qat.dynamic.modules
 
 This module implements versions of the key nn modules such as **Linear()**
 which run in FP32 but with rounding applied to simulate the effect of INT8
 quantization and will be dynamically quantized during inference.
 
-.. currentmodule:: torch.nn.qat.dynamic
+.. currentmodule:: torch.ao.nn.qat.dynamic
 
 .. autosummary::
     :toctree: generated
@@ -350,15 +403,16 @@ quantization and will be dynamically quantized during inference.
 
     Linear
 
-torch.nn.quantized
+torch.ao.nn.quantized
 ~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.quantized
-.. automodule:: torch.nn.quantized.modules
+.. automodule:: torch.ao.nn.quantized
+   :noindex:
+.. automodule:: torch.ao.nn.quantized.modules
 
 This module implements the quantized versions of the nn layers such as
 ~`torch.nn.Conv2d` and `torch.nn.ReLU`.
 
-.. currentmodule:: torch.nn.quantized
+.. currentmodule:: torch.ao.nn.quantized
 
 .. autosummary::
     :toctree: generated
@@ -390,15 +444,15 @@ This module implements the quantized versions of the nn layers such as
     InstanceNorm2d
     InstanceNorm3d
 
-torch.nn.quantized.functional
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.quantized.functional
+torch.ao.nn.quantized.functional
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.ao.nn.quantized.functional
 
 This module implements the quantized versions of the functional layers such as
 ~`torch.nn.functional.conv2d` and `torch.nn.functional.relu`. Note:
 :meth:`~torch.nn.functional.relu` supports quantized inputs.
 
-.. currentmodule:: torch.nn.quantized.functional
+.. currentmodule:: torch.ao.nn.quantized.functional
 
 .. autosummary::
     :toctree: generated
@@ -429,7 +483,7 @@ This module implements the quantized versions of the functional layers such as
     upsample_nearest
 
 torch.nn.quantizable
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~
 
 This module implements the quantizable versions of some of the nn layers.
 These modules can be used in conjunction with the custom module mechanism,
@@ -446,16 +500,16 @@ by providing the ``custom_module_config`` argument to both prepare and convert.
     MultiheadAttention
 
 
-torch.nn.quantized.dynamic
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. automodule:: torch.nn.quantized.dynamic
-.. automodule:: torch.nn.quantized.dynamic.modules
+torch.ao.nn.quantized.dynamic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. automodule:: torch.ao.nn.quantized.dynamic
+.. automodule:: torch.ao.nn.quantized.dynamic.modules
 
 Dynamically quantized :class:`~torch.nn.Linear`, :class:`~torch.nn.LSTM`,
 :class:`~torch.nn.LSTMCell`, :class:`~torch.nn.GRUCell`, and
 :class:`~torch.nn.RNNCell`.
 
-.. currentmodule:: torch.nn.quantized.dynamic
+.. currentmodule:: torch.ao.nn.quantized.dynamic
 
 .. autosummary::
     :toctree: generated
@@ -475,7 +529,7 @@ Quantized dtypes and quantization schemes
 Note that operator implementations currently only
 support per channel quantization for weights of the **conv** and **linear**
 operators. Furthermore, the input data is
-mapped linearly to the the quantized data and vice versa
+mapped linearly to the quantized data and vice versa
 as follows:
 
     .. math::
@@ -489,7 +543,7 @@ as follows:
 
 where :math:`\text{clamp}(.)` is the same as :func:`~torch.clamp` while the
 scale :math:`s` and zero point :math:`z` are then computed
-as decribed in :class:`~torch.ao.quantization.observer.MinMaxObserver`, specifically:
+as described in :class:`~torch.ao.quantization.observer.MinMaxObserver`, specifically:
 
     .. math::
 
@@ -532,5 +586,21 @@ the `custom operator mechanism <https://pytorch.org/tutorials/advanced/torch_scr
 
 
 .. These modules are missing docs. Adding them here only for tracking
+.. automodule:: torch.nn.intrinsic
+.. automodule:: torch.nn.intrinsic.modules
 .. automodule:: torch.nn.quantizable
 .. automodule:: torch.nn.quantizable.modules
+.. automodule:: torch.nn.quantized
+   :noindex:
+
+.. automodule:: torch.ao.nn.quantized.reference
+   :noindex:
+.. automodule:: torch.ao.nn.quantized.reference.modules
+   :noindex:
+
+.. py:module:: torch.nn.intrinsic.qat
+.. py:module:: torch.nn.intrinsic.qat.modules
+.. py:module:: torch.nn.intrinsic.quantized
+.. py:module:: torch.nn.intrinsic.quantized.modules
+.. py:module:: torch.nn.intrinsic.quantized.dynamic
+.. py:module:: torch.nn.intrinsic.quantized.dynamic.modules
diff --git a/docs/source/quantization.rst b/docs/source/quantization.rst
index b71b1fb97695..4b87e8b18155 100644
--- a/docs/source/quantization.rst
+++ b/docs/source/quantization.rst
@@ -80,7 +80,7 @@ The following table compares the differences between Eager Mode Quantization and
 |                 |Static, Dynamic,   |Static, Dynamic,   |
 |                 |Weight Only        |Weight Only        |
 |                 |                   |                   |
-|                 |Quantiztion Aware  |Quantiztion Aware  |
+|                 |Quantization Aware |Quantization Aware |
 |                 |Training:          |Training:          |
 |                 |Static             |Static             |
 +-----------------+-------------------+-------------------+
@@ -103,7 +103,7 @@ There are three types of quantization supported:
 3. static quantization aware training (weights quantized, activations quantized,
    quantization numerics modeled during training)
 
-Please see our `Introduction to Quantization on Pytorch
+Please see our `Introduction to Quantization on PyTorch
 <https://pytorch.org/blog/introduction-to-quantization-on-pytorch/>`_ blog post
 for a more comprehensive overview of the tradeoffs between these quantization
 types.
@@ -258,7 +258,7 @@ PTSQ API Example::
   # attach a global qconfig, which contains information about what kind
   # of observers to attach. Use 'fbgemm' for server inference and
   # 'qnnpack' for mobile inference. Other quantization configurations such
-  # as selecting symmetric or assymetric quantization and MinMax or L2Norm
+  # as selecting symmetric or asymmetric quantization and MinMax or L2Norm
   # calibration techniques can be specified here.
   model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
 
@@ -354,7 +354,7 @@ QAT API Example::
   # attach a global qconfig, which contains information about what kind
   # of observers to attach. Use 'fbgemm' for server inference and
   # 'qnnpack' for mobile inference. Other quantization configurations such
-  # as selecting symmetric or assymetric quantization and MinMax or L2Norm
+  # as selecting symmetric or asymmetric quantization and MinMax or L2Norm
   # calibration techniques can be specified here.
   model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
 
@@ -407,15 +407,15 @@ For static quantization techniques which quantize activations, the user needs
 to do the following in addition:
 
 1. Specify where activations are quantized and de-quantized. This is done using
-   :class:`~torch.quantization.QuantStub` and
-   :class:`~torch.quantization.DeQuantStub` modules.
-2. Use :class:`torch.nn.quantized.FloatFunctional` to wrap tensor operations
+   :class:`~torch.ao.quantization.QuantStub` and
+   :class:`~torch.ao.quantization.DeQuantStub` modules.
+2. Use :class:`~torch.ao.nn.quantized.FloatFunctional` to wrap tensor operations
    that require special handling for quantization into modules. Examples
    are operations like ``add`` and ``cat`` which require special handling to
    determine output quantization parameters.
 3. Fuse modules: combine operations/modules into a single module to obtain
    higher accuracy and performance. This is done using the
-   :func:`torch.quantization.fuse_modules` API, which takes in lists of modules
+   :func:`~torch.ao.quantization.fuse_modules` API, which takes in lists of modules
    to be fused. We currently support the following fusions:
    [Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu]
 
@@ -632,7 +632,7 @@ Quantization Mode Support
 |                             |Quantization                                          |Dataset         | Works Best For | Accuracy   |      Notes      |
 |                             |Mode                                                  |Requirement     |                |            |                 |
 +-----------------------------+---------------------------------+--------------------+----------------+----------------+------------+-----------------+
-|Post Training Quantization   |Dyanmic/Weight Only Quantization |activation          |None            |LSTM, MLP,      |good        |Easy to use,     |
+|Post Training Quantization   |Dynamic/Weight Only Quantization |activation          |None            |LSTM, MLP,      |good        |Easy to use,     |
 |                             |                                 |dynamically         |                |Embedding,      |            |close to static  |
 |                             |                                 |quantized (fp16,    |                |Transformer     |            |quantization when|
 |                             |                                 |int8) or not        |                |                |            |performance is   |
@@ -640,7 +640,7 @@ Quantization Mode Support
 |                             |                                 |statically quantized|                |                |            |bound due to     |
 |                             |                                 |(fp16, int8, in4)   |                |                |            |weights          |
 |                             +---------------------------------+--------------------+----------------+----------------+------------+-----------------+
-|                             |Static Quantization              |acivation and       |calibration     |CNN             |good        |Provides best    |
+|                             |Static Quantization              |activation and      |calibration     |CNN             |good        |Provides best    |
 |                             |                                 |weights statically  |dataset         |                |            |perf, may have   |
 |                             |                                 |quantized (int8)    |                |                |            |big impact on    |
 |                             |                                 |                    |                |                |            |accuracy, good   |
@@ -652,7 +652,7 @@ Quantization Mode Support
 |                             |                                 |weight are fake     |dataset         |                |            |for now          |
 |                             |                                 |quantized           |                |                |            |                 |
 |                             +---------------------------------+--------------------+----------------+----------------+------------+-----------------+
-|                             |Static Quantization              |activatio nand      |fine-tuning     |CNN, MLP,       |best        |Typically used   |
+|                             |Static Quantization              |activation and      |fine-tuning     |CNN, MLP,       |best        |Typically used   |
 |                             |                                 |weight are fake     |dataset         |Embedding       |            |when static      |
 |                             |                                 |quantized           |                |                |            |quantization     |
 |                             |                                 |                    |                |                |            |leads to bad     |
@@ -736,7 +736,7 @@ Backend/Hardware Support
 +-----------------+---------------+------------+------------+------------+
 |server GPU       |TensorRT (early|Not support |Supported   |Static      |
 |                 |prototype)     |this it     |            |Quantization|
-|                 |               |requries a  |            |            |
+|                 |               |requires a  |            |            |
 |                 |               |graph       |            |            |
 +-----------------+---------------+------------+------------+------------+
 
@@ -998,6 +998,25 @@ if ``dtype`` is ``torch.qint8``, make sure to set a custom ``quant_min`` to be `
 you call the `torch.ao.quantization.get_default_qconfig(backend)` or `torch.ao.quantization.get_default_qat_qconfig(backend)` function to get the default ``qconfig`` for
 ``fbgemm`` or ``qnnpack`` backend
 
+Frequently Asked Questions
+--------------------------
+
+1. How can I do quantized inference on GPU?:
+
+   We don't have official GPU support yet, but this is an area of active development, you can find more information
+   `here <https://github.com/pytorch/pytorch/issues/87395>`_
+
+2. Where can I get ONNX support for my quantized model?:
+
+   You can open an issue in `GitHub - onnx/onnx <https://github.com/onnx/onnx>`_  when you encounter problems with ONNX,
+   or reach out to people in this list: `PyTorch Governance | Maintainers | ONNX exporter <https://pytorch.org/docs/stable/community/persons_of_interest.html#onnx-exporter>`_
+
+3. How can I use quantization with LSTM's?:
+
+   LSTM is supported through our custom module api in both eager mode and fx graph mode quantization. Examples can be found at
+   Eager Mode: `pytorch/test_quantized_op.py TestQuantizedOps.test_custom_module_lstm <https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/core/test_quantized_op.py#L2782>`_
+   FX Graph Mode: `pytorch/test_quantize_fx.py TestQuantizeFx.test_static_lstm <https://github.com/pytorch/pytorch/blob/9b88dcf248e717ca6c3f8c5e11f600825547a561/test/quantization/fx/test_quantize_fx.py#L4116>`_
+
 Common Errors
 ---------------------------------------
 
@@ -1133,6 +1152,9 @@ Please take a look at `Limitations of Symbolic Tracing <https://docs-preview.pyt
 .. They are here for tracking purposes until they are more permanently fixed.
 .. py:module:: torch.ao
 .. py:module:: torch.ao.nn
+.. py:module:: torch.ao.nn.quantizable
+.. py:module:: torch.ao.nn.quantizable.modules
+.. py:module:: torch.ao.nn.quantized
 .. py:module:: torch.ao.nn.sparse
 .. py:module:: torch.ao.nn.sparse.quantized
 .. py:module:: torch.ao.nn.sparse.quantized.dynamic
@@ -1141,6 +1163,18 @@ Please take a look at `Limitations of Symbolic Tracing <https://docs-preview.pyt
 .. py:module:: torch.ao.quantization
 .. py:module:: torch.ao.quantization.fx
 .. py:module:: torch.ao.quantization.backend_config
-.. py:module:: torch.ao.sparsity
-.. py:module:: torch.ao.sparsity.scheduler
-.. py:module:: torch.ao.sparsity.sparsifier
+.. py:module:: torch.ao.pruning
+.. py:module:: torch.ao.pruning.scheduler
+.. py:module:: torch.ao.pruning.sparsifier
+
+.. py:module:: torch.nn.qat
+.. py:module:: torch.nn.qat.modules
+.. py:module:: torch.nn.qat.dynamic
+.. py:module:: torch.nn.qat.dynamic.modules
+.. py:module:: torch.nn.quantized
+.. py:module:: torch.nn.quantized.modules
+.. py:module:: torch.nn.quantized.dynamic
+.. py:module:: torch.nn.quantized.dynamic.modules
+
+.. py:module:: torch.ao.nn.quantized.reference
+.. py:module:: torch.ao.nn.quantized.reference.modules
diff --git a/docs/source/rpc.rst b/docs/source/rpc.rst
index 89f146bfd68e..2c95f6f0765f 100644
--- a/docs/source/rpc.rst
+++ b/docs/source/rpc.rst
@@ -16,7 +16,7 @@ machines.
     CUDA support was introduced in PyTorch 1.9 and is still a **beta** feature.
     Not all features of the RPC package are yet compatible with CUDA support and
     thus their use is discouraged. These unsupported features include: RRefs,
-    JIT compatibility, dist autograd and dist optimizier, and profiling. These
+    JIT compatibility, dist autograd and dist optimizer, and profiling. These
     shortcomings will be addressed in future releases.
 
 .. note ::
diff --git a/docs/source/scripts/onnx/build_onnx_diagnostics_rules_md.py b/docs/source/scripts/onnx/build_onnx_diagnostics_rules_md.py
new file mode 100644
index 000000000000..3c2895f6fe76
--- /dev/null
+++ b/docs/source/scripts/onnx/build_onnx_diagnostics_rules_md.py
@@ -0,0 +1,37 @@
+import argparse
+import os
+from dataclasses import fields
+
+from torch.onnx._internal import diagnostics
+from torch.onnx._internal.diagnostics import infra
+
+
+def gen_docs(out_dir: str):
+    os.makedirs(out_dir, exist_ok=True)
+    for field in fields(diagnostics.rules):
+        rule = getattr(diagnostics.rules, field.name)
+        if not isinstance(rule, infra.Rule):
+            continue
+        title = f"{rule.id}:{rule.name}"
+        full_description_markdown = rule.full_description_markdown
+        assert (
+            full_description_markdown is not None
+        ), f"Expected {title} to have a full description in markdown"
+        with open(f"{out_dir}/{title}.md", "w") as f:
+            f.write(f"# {title}\n")
+            f.write(full_description_markdown)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Generate ONNX diagnostics rules doc in markdown."
+    )
+    parser.add_argument(
+        "out_dir", metavar="OUT_DIR", help="path to output directory for docs"
+    )
+    args = parser.parse_args()
+    gen_docs(args.out_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/source/scripts/onnx/build_onnx_supported_aten_op_csv_table.py b/docs/source/scripts/onnx/build_onnx_supported_aten_op_csv_table.py
index 31bc2e8c8480..e4b4eb11fdfc 100644
--- a/docs/source/scripts/onnx/build_onnx_supported_aten_op_csv_table.py
+++ b/docs/source/scripts/onnx/build_onnx_supported_aten_op_csv_table.py
@@ -8,18 +8,59 @@
 from torch.onnx import _onnx_supported_ops
 
 # Constants
-BUILD_DIR = "build"
-AUTO_GEN_ATEN_OPS_CSV_FILE = "auto_gen_aten_op_list.csv"
+BUILD_DIR = "build/onnx"
+SUPPORTED_OPS_CSV_FILE = "auto_gen_supported_op_list.csv"
+UNSUPPORTED_OPS_CSV_FILE = "auto_gen_unsupported_op_list.csv"
+
+
+def _sort_key(namespaced_opname):
+    return tuple(reversed(namespaced_opname.split("::")))
+
+
+def _get_op_lists():
+    all_schemas = _onnx_supported_ops.all_forward_schemas()
+    symbolic_schemas = _onnx_supported_ops.all_symbolics_schemas()
+    supported_result = set()
+    not_supported_result = set()
+    for opname in all_schemas:
+        if opname.endswith("_"):
+            opname = opname[:-1]
+        if opname in symbolic_schemas:
+            # Supported op
+            opsets = symbolic_schemas[opname].opsets
+            supported_result.add(
+                (
+                    opname,
+                    f"Since opset {opsets[0]}",
+                )
+            )
+        else:
+            # Unsupported op
+            not_supported_result.add(
+                (
+                    opname,
+                    "Not yet supported",
+                )
+            )
+    return (
+        sorted(supported_result, key=lambda x: _sort_key(x[0])),
+        sorted(not_supported_result),
+    )
 
 
 def main():
     os.makedirs(BUILD_DIR, exist_ok=True)
 
-    aten_list = _onnx_supported_ops.onnx_supported_ops()
+    supported, unsupported = _get_op_lists()
+
+    with open(os.path.join(BUILD_DIR, SUPPORTED_OPS_CSV_FILE), "w") as f:
+        f.write("Operator,opset_version(s)\n")
+        for name, opset_version in supported:
+            f.write(f'"``{name}``","{opset_version}"\n')
 
-    with open(os.path.join(BUILD_DIR, AUTO_GEN_ATEN_OPS_CSV_FILE), "w") as f:
+    with open(os.path.join(BUILD_DIR, UNSUPPORTED_OPS_CSV_FILE), "w") as f:
         f.write("Operator,opset_version(s)\n")
-        for name, opset_version in aten_list:
+        for name, opset_version in unsupported:
             f.write(f'"``{name}``","{opset_version}"\n')
 
 
diff --git a/docs/source/signal.rst b/docs/source/signal.rst
new file mode 100644
index 000000000000..a450c92727f3
--- /dev/null
+++ b/docs/source/signal.rst
@@ -0,0 +1,30 @@
+.. role:: hidden
+    :class: hidden-section
+
+torch.signal
+============
+.. automodule:: torch.signal
+.. currentmodule:: torch.signal
+
+The `torch.signal` module, modeled after SciPy's `signal <https://docs.scipy.org/doc/scipy/reference/signal.html>`_ module.
+
+torch.signal.windows
+--------------------
+
+.. automodule:: torch.signal.windows
+.. currentmodule:: torch.signal.windows
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    bartlett
+    blackman
+    cosine
+    exponential
+    gaussian
+    general_cosine
+    general_hamming
+    hamming
+    hann
+    kaiser
diff --git a/docs/source/sparse.rst b/docs/source/sparse.rst
index a5449b432e00..77e8dabec274 100644
--- a/docs/source/sparse.rst
+++ b/docs/source/sparse.rst
@@ -7,51 +7,166 @@
 torch.sparse
 ============
 
-Introduction
-++++++++++++
-
-PyTorch provides :class:`torch.Tensor` to represent a
-multi-dimensional array containing elements of a single data type. By
-default, array elements are stored contiguously in memory leading to
-efficient implementations of various array processing algorithms that
-relay on the fast access to array elements.  However, there exists an
-important class of multi-dimensional arrays, so-called sparse arrays,
-where the contiguous memory storage of array elements turns out to be
-suboptimal. Sparse arrays have a property of having a vast portion of
-elements being equal to zero which means that a lot of memory as well
-as processor resources can be spared if only the non-zero elements are
-stored or/and processed. Various sparse storage formats (`such as COO,
-CSR/CSC, LIL, etc.`__) have been developed that are optimized for a
-particular structure of non-zero elements in sparse arrays as well as
-for specific operations on the arrays. PyTorch supports the following
-sparse storage formats: :ref:`COO<sparse-coo-docs>`,
-:ref:`CSR<sparse-csr-docs>`, :ref:`CSC<sparse-csc-docs>`,
-:ref:`BSR<sparse-bsr-docs>`, and :ref:`BSC<sparse-bsc-docs>`.
-
-__ https://en.wikipedia.org/wiki/Sparse_matrix
-
-.. note::
-
-   When talking about storing only non-zero elements of a sparse
-   array, the usage of adjective "non-zero" is not strict: one is
-   allowed to store also zeros in the sparse array data
-   structure. Hence, in the following, we use "specified elements" for
-   those array elements that are actually stored. In addition, the
-   unspecified elements are typically assumed to have zero value, but
-   not only, hence we use the term "fill value" to denote such
-   elements.
+.. warning::
 
-.. note::
+  The PyTorch API of sparse tensors is in beta and may change in the near future.
+  We highly welcome feature requests, bug reports and general suggestions as GitHub issues.
+
+Why and when to use sparsity
+++++++++++++++++++++++++++++
+
+By default PyTorch stores :class:`torch.Tensor` stores elements contiguously
+physical memory. This leads to efficient implementations of various array
+processing algorithms that require fast access to elements.
+
+Now, some users might decide to represent data such as graph adjacency
+matrices, pruned weights or points clouds by Tensors whose *elements are
+mostly zero valued*. We recognize these are important applications and aim
+to provide performance optimizations for these use cases via sparse storage formats.
+
+Various sparse storage formats such as COO, CSR/CSC, LIL, etc. have been
+developed over the years. While they differ in exact layouts, they all
+compress data through efficient representation of zero valued elements.
+We call the uncompressed values *specified* in contrast to *unspecified*,
+compressed elements.
+
+By compressing repeat zeros sparse storage formats aim to save memory
+and computational resources on various CPUs and GPUs. Especially for high
+degrees of sparsity or highly structured sparsity this can have significant
+performance implications. As such sparse storage formats can be seen as a
+performance optimization.
+
+Like many other performance optimization sparse storage formats are not
+always advantageous. When trying sparse formats for your use case
+you might find your execution time to decrease rather than increase.
+
+Please feel encouraged to open a GitHub issue if you analytically
+expected to see a stark increase in performance but measured a
+degradation instead. This helps us prioritize the implementation
+of efficient kernels and wider performance optimizations.
+
+We make it easy to try different sparsity layouts, and convert between them,
+without being opinionated on what's best for your particular application.
+
+Functionality overview
+++++++++++++++++++++++
+
+We want it to be straightforward to construct a sparse Tensor from a
+given dense Tensor by providing conversion routines for each layout.
+
+In the next example we convert a 2D Tensor with default dense (strided)
+layout to a 2D Tensor backed by the COO memory layout. Only values and
+indices of non-zero elements are stored in this case.
+
+    >>> a = torch.tensor([[0, 2.], [3, 0]])
+    >>> a.to_sparse()
+    tensor(indices=tensor([[0, 1],
+                           [1, 0]]),
+           values=tensor([2., 3.]),
+           size=(2, 2), nnz=2, layout=torch.sparse_coo)
+
+PyTorch currently supports :ref:`COO<sparse-coo-docs>`, :ref:`CSR<sparse-csr-docs>`,
+:ref:`CSC<sparse-csc-docs>`, :ref:`BSR<sparse-bsr-docs>`, and :ref:`BSC<sparse-bsc-docs>`.
+Please see the references for more details.
+
+Note that we provide slight generalizations of these formats.
+
+Batching: Devices such as GPUs require batching for optimal performance and
+thus we support batch dimensions.
+
+We currently offer a very simple version of batching where each component of a sparse format
+itself is batched. This also requires the same number of specified elements per batch entry.
+In this example we construct a 3D (batched) CSR Tensor from a 3D dense Tensor.
+
+    >>> t = torch.tensor([[[1., 0], [2., 3.]], [[4., 0], [5., 6.]]])
+    >>> t.dim()
+    3
+    >>> t.to_sparse_csr()
+    tensor(crow_indices=tensor([[0, 1, 3],
+                                [0, 1, 3]]),
+           col_indices=tensor([[0, 0, 1],
+                               [0, 0, 1]]),
+           values=tensor([[1., 2., 3.],
+                          [4., 5., 6.]]), size=(2, 2, 2), nnz=3,
+           layout=torch.sparse_csr)
 
-   Using a sparse storage format for storing sparse arrays can be
-   advantageous only when the size and sparsity levels of arrays are
-   high. Otherwise, for small-sized or low-sparsity arrays using the
-   contiguous memory storage format is likely the most efficient
-   approach.
 
-.. warning::
+Dense dimensions: On the other hand, some data such as Graph embeddings might be
+better viewed as sparse collections of vectors instead of scalars.
+
+In this example we create a 3D Hybrid COO Tensor with 2 sparse and 1 dense dimension
+from a 3D strided Tensor. If an entire row in the 3D strided Tensor is zero, it is
+not stored. If however any of the values in the row are non-zero, they are stored
+entirely. This reduces the number of indices since we need one index one per row instead
+of one per element. But it also increases the amount of storage for the values. Since
+only rows that are *entirely* zero can be emitted and the presence of any non-zero
+valued elements cause the entire row to be stored.
+
+    >>> t = torch.tensor([[[0., 0], [1., 2.]], [[0., 0], [3., 4.]]])
+    >>> t.to_sparse(sparse_dim=2)
+    tensor(indices=tensor([[0, 1],
+                           [1, 1]]),
+           values=tensor([[1., 2.],
+                          [3., 4.]]),
+           size=(2, 2, 2), nnz=2, layout=torch.sparse_coo)
+
+
+Operator overview
++++++++++++++++++
+
+Fundamentally, operations on Tensor with sparse storage formats behave the same as
+operations on Tensor with strided (or other) storage formats. The particularities of
+storage, that is the physical layout of the data, influences the performance of
+an operation but should not influence the semantics.
+
+
+We are actively increasing operator coverage for sparse tensors. Users should not
+expect support same level of support as for dense Tensors yet.
+See our :ref:`operator<sparse-ops-docs>` documentation for a list.
+
+    >>> b = torch.tensor([[0, 0, 1, 2, 3, 0], [4, 5, 0, 6, 0, 0]])
+    >>> b_s = b.to_sparse_csr()
+    >>> b_s.cos()
+    Traceback (most recent call last):
+      File "<stdin>", line 1, in <module>
+    RuntimeError: unsupported tensor layout: SparseCsr
+    >>> b_s.sin()
+    tensor(crow_indices=tensor([0, 3, 6]),
+           col_indices=tensor([2, 3, 4, 0, 1, 3]),
+           values=tensor([ 0.8415,  0.9093,  0.1411, -0.7568, -0.9589, -0.2794]),
+           size=(2, 6), nnz=6, layout=torch.sparse_csr)
+
+As shown in the example above, we don't support non-zero preserving unary
+operators such as cos. The output of a non-zero preserving unary operation
+will not be able to take advantage of sparse storage formats to the same
+extent as the input and potentially result in a catastrophic increase in memory.
+We instead rely on the user to explicitly convert to a dense Tensor first and
+then run the operation.
+
+    >>> b_s.to_dense().cos()
+    tensor([[ 1.0000, -0.4161],
+            [-0.9900,  1.0000]])
+
+We are aware that some users want to ignore compressed zeros for operations such
+as `cos` instead of preserving the exact semantics of the operation. For this we
+can point to torch.masked and its MaskedTensor, which is in turn also backed and
+powered by sparse storage formats and kernels.
+
+Also note that, for now, the user doesn't have a choice of the output layout. For example,
+adding a sparse Tensor to a regular strided Tensor results in a strided Tensor. Some
+users might prefer for this to stay a sparse layout, because they know the result will
+still be sufficiently sparse.
+
+    >>> a + b.to_sparse()
+    tensor([[0., 3.],
+            [3., 0.]])
+
+We acknowledge that access to kernels that can efficiently produce different output
+layouts can be very useful. A subsequent operation might significantly benefit from
+receiving a particular layout. We are working on an API to control the result layout
+and recognize it is an important feature to plan a more optimal path of execution for
+any given model.
 
-  The PyTorch API of sparse tensors is in beta and may change in the near future.
 
 .. _sparse-coo-docs:
 
@@ -142,7 +257,7 @@ only:
 Sparse hybrid COO tensors
 -------------------------
 
-Pytorch implements an extension of sparse tensors with scalar values
+PyTorch implements an extension of sparse tensors with scalar values
 to sparse tensors with (contiguous) tensor values. Such tensors are
 called hybrid tensors.
 
@@ -268,7 +383,6 @@ sparse tensor with the following properties:
   advantageous for implementing algorithms that involve many element
   selection operations, such as slicing or matrix products.
 
-
 Working with sparse COO tensors
 -------------------------------
 
@@ -416,6 +530,26 @@ __ https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_o
      where ``plain_dim_size`` is the number of plain dimensions
      (orthogonal to compressed dimensions, e.g. columns or rows).
 
+.. note::
+
+   The generalization of sparse compressed layouts to N-dimensional
+   tensors can lead to some confusion regarding the count of specified
+   elements. When a sparse compressed tensor contains batch dimensions
+   the number of specified elements will correspond to the number of such
+   elements per-batch. When a sparse compressed tensor has dense dimensions
+   the element considered is now the K-dimensional array. Also for block
+   sparse compressed layouts the 2-D block is considered as the element
+   being specified.  Take as an example a 3-dimensional block sparse
+   tensor, with one batch dimension of length ``b``, and a block
+   shape of ``p, q``. If this tensor has ``n`` specified elements, then
+   in fact we have ``n`` blocks specified per batch. This tensor would
+   have ``values`` with shape ``(b, n, p, q)``. This interpretation of the
+   number of specified elements comes from all sparse compressed layouts
+   being derived from the compression of a 2-dimensional matrix. Batch
+   dimensions are treated as stacking of sparse matrices, dense dimensions
+   change the meaning of the element from a simple scalar value to an
+   array with its own dimensions.
+
 .. _sparse-csr-docs:
 
 Sparse CSR Tensor
@@ -470,7 +604,7 @@ ncols, *densesize)`` where ``len(batchsize) == B`` and
 
    The batches of sparse CSR tensors are dependent: the number of
    specified elements in all batches must be the same. This somewhat
-   artifical constraint allows efficient storage of the indices of
+   artificial constraint allows efficient storage of the indices of
    different CSR batches.
 
 .. note::
@@ -817,10 +951,14 @@ function:
     >>> (csr.transpose(0, 1).to_dense() == csc.to_dense()).all()
     tensor(True)
 
+.. _sparse-ops-docs:
 
-Supported Linear Algebra operations
+Supported operations
 +++++++++++++++++++++++++++++++++++
 
+Linear Algebra operations
+-------------------------
+
 The following table summarizes supported Linear Algebra operations on
 sparse matrices where the operands layouts may vary. Here
 ``T[layout]`` denotes a tensor with a given layout. Similarly,
@@ -863,7 +1001,7 @@ matrix arguments.
    S == (S.t() @ D.t()).t()``.
 
 Tensor methods and sparse
-+++++++++++++++++++++++++
+-------------------------
 
 The following Tensor methods are related to sparse tensors:
 
@@ -970,7 +1108,8 @@ The following Tensor methods support sparse COO tensors:
 :meth:`~torch.Tensor.zero_`
 
 Torch functions specific to sparse Tensors
-++++++++++++++++++++++++++++++++++++++++++
+------------------------------------------
+
 .. autosummary::
     :toctree: generated
     :nosignatures:
@@ -993,7 +1132,7 @@ Torch functions specific to sparse Tensors
     sparse.spdiags
 
 Other functions
-+++++++++++++++
+---------------
 
 The following :mod:`torch` functions support sparse tensors:
 
@@ -1021,8 +1160,16 @@ The following :mod:`torch` functions support sparse tensors:
 :func:`~torch.zeros`
 :func:`~torch.zeros_like`
 
-In addition, all zero-preserving unary functions support sparse
-COO/CSR/CSC/BSR/CSR tensor inputs:
+Unary functions
+---------------
+
+We aim to support all zero-preserving unary functions.
+
+If you find that we are missing a zero-preserving unary function
+that you need, please feel encouraged to open an issue for a feature request.
+As always please kindly try the search function first before opening an issue.
+
+The following operators currently support sparse COO/CSR/CSC/BSR/CSR tensor inputs.
 
 :func:`~torch.abs`
 :func:`~torch.asin`
diff --git a/docs/source/special.rst b/docs/source/special.rst
index ac1ce837ef6b..96179475469a 100644
--- a/docs/source/special.rst
+++ b/docs/source/special.rst
@@ -13,6 +13,8 @@ Functions
 -----------------------
 
 .. autofunction:: airy_ai
+.. autofunction:: bessel_j0
+.. autofunction:: bessel_j1
 .. autofunction:: digamma
 .. autofunction:: entr
 .. autofunction:: erf
diff --git a/docs/source/storage.rst b/docs/source/storage.rst
index 1b2b5d7185a4..84fed2f659a7 100644
--- a/docs/source/storage.rst
+++ b/docs/source/storage.rst
@@ -15,13 +15,17 @@ same class methods that :class:`torch.TypedStorage` has.
 
 A :class:`torch.TypedStorage` is a contiguous, one-dimensional array of
 elements of a particular :class:`torch.dtype`. It can be given any
-:class:`torch.dtype`, and the internal data will be interpretted appropriately.
+:class:`torch.dtype`, and the internal data will be interpreted appropriately.
 :class:`torch.TypedStorage` contains a :class:`torch.UntypedStorage` which
 holds the data as an untyped array of bytes.
 
 Every strided :class:`torch.Tensor` contains a :class:`torch.TypedStorage`,
 which stores all of the data that the :class:`torch.Tensor` views.
 
+.. warning::
+  All storage classes except for :class:`torch.UntypedStorage` will be removed
+  in the future, and :class:`torch.UntypedStorage` will be used in all cases.
+
 .. autoclass:: torch.TypedStorage
    :members:
    :undoc-members:
diff --git a/docs/source/tensor_attributes.rst b/docs/source/tensor_attributes.rst
index aa68e8f805fe..c6592476d431 100644
--- a/docs/source/tensor_attributes.rst
+++ b/docs/source/tensor_attributes.rst
@@ -139,7 +139,7 @@ torch.device
 A :class:`torch.device` is an object representing the device on which a :class:`torch.Tensor` is
 or will be allocated.
 
-The :class:`torch.device` contains a device type (``'cpu'`` or ``'cuda'``) and optional device
+The :class:`torch.device` contains a device type (``'cpu'``, ``'cuda'`` or ``'mps'``) and optional device
 ordinal for the device type. If the device ordinal is not present, this object will always represent
 the current device for the device type, even after :func:`torch.cuda.set_device()` is called; e.g.,
 a :class:`torch.Tensor` constructed with device ``'cuda'`` is equivalent to ``'cuda:X'`` where X is
@@ -158,6 +158,9 @@ Via a string:
     >>> torch.device('cpu')
     device(type='cpu')
 
+    >>> torch.device('mps')
+    device(type='mps')
+
     >>> torch.device('cuda')  # current cuda device
     device(type='cuda')
 
@@ -168,6 +171,9 @@ Via a string and device ordinal:
     >>> torch.device('cuda', 0)
     device(type='cuda', index=0)
 
+    >>> torch.device('mps', 0)
+    device(type='mps', index=0)
+
     >>> torch.device('cpu', 0)
     device(type='cpu', index=0)
 
@@ -244,12 +250,16 @@ or will be allocated.
 Possible values are:
 
 - ``torch.contiguous_format``:
-  Tensor is or will be  allocated in dense non-overlapping memory. Strides represented by values in decreasing order.
+  Tensor is or will be allocated in dense non-overlapping memory. Strides represented by values in decreasing order.
 
 - ``torch.channels_last``:
-  Tensor is or will be  allocated in dense non-overlapping memory. Strides represented by values in
+  Tensor is or will be allocated in dense non-overlapping memory. Strides represented by values in
   ``strides[0] > strides[2] > strides[3] > strides[1] == 1`` aka NHWC order.
 
+- ``torch.channels_last_3d``:
+  Tensor is or will be allocated in dense non-overlapping memory. Strides represented by values in
+  ``strides[0] > strides[2] > strides[3] > strides[4] > strides[1] == 1`` aka NDHWC order.
+
 - ``torch.preserve_format``:
   Used in functions like `clone` to preserve the memory format of the input tensor. If input tensor is
   allocated in dense non-overlapping memory, the output tensor strides will be copied from the input.
diff --git a/docs/source/tensors.rst b/docs/source/tensors.rst
index 9c4264316fd1..310a7e211611 100644
--- a/docs/source/tensors.rst
+++ b/docs/source/tensors.rst
@@ -345,7 +345,6 @@ Tensor class reference
     Tensor.dot
     Tensor.double
     Tensor.dsplit
-    Tensor.eig
     Tensor.element_size
     Tensor.eq
     Tensor.eq_
@@ -484,7 +483,6 @@ Tensor class reference
     Tensor.logit
     Tensor.logit_
     Tensor.long
-    Tensor.lstsq
     Tensor.lt
     Tensor.lt_
     Tensor.less
diff --git a/docs/source/torch.rst b/docs/source/torch.rst
index a530c5af136f..f5e06d5ea438 100644
--- a/docs/source/torch.rst
+++ b/docs/source/torch.rst
@@ -86,6 +86,7 @@ Indexing, Slicing, Joining, Mutating Ops
     argwhere
     cat
     concat
+    concatenate
     conj
     chunk
     dsplit
@@ -102,6 +103,7 @@ Indexing, Slicing, Joining, Mutating Ops
     movedim
     moveaxis
     narrow
+    narrow_copy
     nonzero
     permute
     reshape
@@ -558,7 +560,6 @@ BLAS and LAPACK Operations
     cholesky_inverse
     cholesky_solve
     dot
-    eig
     geqrf
     ger
     inner
@@ -566,13 +567,11 @@ BLAS and LAPACK Operations
     det
     logdet
     slogdet
-    lstsq
     lu
     lu_solve
     lu_unpack
     matmul
     matrix_power
-    matrix_rank
     matrix_exp
     mm
     mv
@@ -632,3 +631,14 @@ Operator Tags
 .. This module needs to be documented. Adding here in the meantime
 .. for tracking purposes
 .. py:module:: torch.utils.model_dump
+
+.. automodule:: torch.autograd
+.. currentmodule:: torch.autograd
+
+Engine Configuration
+----------------------------------
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    set_multithreading_enabled
diff --git a/functorch/.circleci/config.yml b/functorch/.circleci/config.yml
deleted file mode 100644
index bab6defefd24..000000000000
--- a/functorch/.circleci/config.yml
+++ /dev/null
@@ -1,316 +0,0 @@
-version: 2.1
-
-executors:
-  windows-cpu:
-    machine:
-      resource_class: windows.xlarge
-      image: windows-server-2019-vs2019:stable
-      shell: bash.exe
-
-  windows-gpu:
-    machine:
-      resource_class: windows.gpu.nvidia.medium
-      image: windows-server-2019-nvidia:stable
-      shell: bash.exe
-
-commands:
-  checkout_merge:
-    description: "checkout merge branch"
-    steps:
-      - checkout
-  designate_upload_channel:
-    description: "inserts the correct upload channel into ${BASH_ENV}"
-    steps:
-      - run:
-          name: adding UPLOAD_CHANNEL to BASH_ENV
-          command: |
-            our_upload_channel=nightly
-            # On tags upload to test instead
-            if [[ -n "${CIRCLE_TAG}" ]]; then
-              our_upload_channel=test
-            fi
-            echo "export UPLOAD_CHANNEL=${our_upload_channel}" >> ${BASH_ENV}
-
-binary_common: &binary_common
-  parameters:
-    # Edit these defaults to do a release`
-    build_version:
-      description: "version number of release binary; by default, build a nightly"
-      type: string
-      default: ""
-    pytorch_version:
-      description: "PyTorch version to build against; by default, use a nightly"
-      type: string
-      default: ""
-    # Don't edit these
-    python_version:
-      description: "Python version to build against (e.g., 3.7)"
-      type: string
-    cu_version:
-      description: "CUDA version to build against, in CU format (e.g., cpu or cu100)"
-      type: string
-    unicode_abi:
-      description: "Python 2.7 wheel only: whether or not we are cp27mu (default: no)"
-      type: string
-      default: ""
-    wheel_docker_image:
-      description: "Wheel only: what docker image to use"
-      type: string
-      default: "pytorch/manylinux-cuda101"
-  environment:
-    PYTHON_VERSION: << parameters.python_version >>
-    PYTORCH_VERSION: << parameters.pytorch_version >>
-    UNICODE_ABI: << parameters.unicode_abi >>
-    CU_VERSION: << parameters.cu_version >>
-
-jobs:
-  unittest_linux_cpu:
-    <<: *binary_common
-    machine:
-      image: "ubuntu-2004:202104-01"
-    resource_class: xlarge
-    steps:
-      - checkout
-      - run:
-          name: Setup
-          command: |
-            touch ${BASH_ENV}
-            echo "export PARAMETERS_PYTHON_VERSION=<< parameters.python_version >>" >> ${BASH_ENV}
-            cat ${BASH_ENV}
-            # For some reason circleci isn't automatically sourcing this within the builds
-            source ${BASH_ENV} && .circleci/unittest/linux/scripts/setup_env.sh
-      - run:
-          name: Install functorch
-          command: |
-            touch ${BASH_ENV}
-            echo "export PARAMETERS_PYTHON_VERSION=<< parameters.python_version >>" >> ${BASH_ENV}
-            cat ${BASH_ENV}
-            # For some reason circleci isn't automatically sourcing this within the builds
-            source ${BASH_ENV} && .circleci/unittest/linux/scripts/install.sh
-      - persist_to_workspace:
-          root: wheels
-          paths:
-            - "*"
-      - store_artifacts:
-          path: wheels
-      - run:
-          name: Run tests
-          command: .circleci/unittest/linux/scripts/run_test.sh
-      - run:
-          name: Post process
-          command: .circleci/unittest/linux/scripts/post_process.sh
-      - store_test_results:
-          path: test-reports
-
-  unittest_linux_gpu:
-    <<: *binary_common
-    machine:
-      # https://circleci.com/docs/2.0/configuration-reference/#available-linux-gpu-images
-      image: ubuntu-2004-cuda-11.4:202110-01
-    resource_class: gpu.nvidia.medium
-    steps:
-      - checkout
-      - run:
-          name: Setup
-          command: |
-            touch ${BASH_ENV}
-            echo "export PARAMETERS_PYTHON_VERSION=<< parameters.python_version >>" >> ${BASH_ENV}
-            cat ${BASH_ENV}
-            # For some reason circleci isn't automatically sourcing this within the builds
-            source ${BASH_ENV} && .circleci/unittest/linux/scripts/setup_env.sh
-      - run:
-          name: Install functorch
-          command: |
-            touch ${BASH_ENV}
-            echo "export PARAMETERS_PYTHON_VERSION=<< parameters.python_version >>" >> ${BASH_ENV}
-            cat ${BASH_ENV}
-            # For some reason circleci isn't automatically sourcing this within the builds
-            source ${BASH_ENV} && .circleci/unittest/linux/scripts/install.sh
-      - persist_to_workspace:
-          root: wheels
-          paths:
-            - "*"
-      - store_artifacts:
-          path: wheels
-      - run:
-          name: Run tests
-          command: .circleci/unittest/linux/scripts/run_test.sh
-      - run:
-          name: Post process
-          command: .circleci/unittest/linux/scripts/post_process.sh
-      - store_test_results:
-          path: test-reports
-
-  unittest_macos_cpu:
-    <<: *binary_common
-    macos:
-      xcode: "12.0"
-    resource_class: large
-    steps:
-      - checkout
-      - run:
-          name: Install wget
-          command: HOMEBREW_NO_AUTO_UPDATE=1 brew install wget
-          # Disable brew auto update which is very slow
-      - run:
-          name: Setup
-          command: |
-            touch ${BASH_ENV}
-            echo "export PARAMETERS_PYTHON_VERSION=<< parameters.python_version >>" >> ${BASH_ENV}
-            cat ${BASH_ENV}
-            # For some reason circleci isn't automatically sourcing this within the builds
-            source ${BASH_ENV} && .circleci/unittest/linux/scripts/setup_env.sh
-      - run:
-          name: Install functorch
-          command: .circleci/unittest/linux/scripts/install.sh
-      - run:
-          name: Run tests
-          command: .circleci/unittest/linux/scripts/run_test.sh
-      - run:
-          name: Post process
-          command: .circleci/unittest/linux/scripts/post_process.sh
-      - store_test_results:
-          path: test-results
-
-  unittest_windows_cpu:
-    <<: *binary_common
-    executor:
-      name: windows-cpu
-    steps:
-      - checkout
-      - designate_upload_channel
-      - run:
-          name: Generate cache key
-          # This will refresh cache on Sundays, nightly build should generate new cache.
-          command: echo "$(date +"%Y-%U")" > .circleci-weekly
-      - restore_cache:
-          keys:
-            - env-v2-windows-{{ arch }}-py<< parameters.python_version >>-{{ checksum ".circleci/unittest/windows/scripts/environment.yml" }}-{{ checksum ".circleci-weekly" }}
-      - run:
-          name: Setup
-          command: .circleci/unittest/windows/scripts/setup_env.sh
-      - save_cache:
-          key: env-v2-windows-{{ arch }}-py<< parameters.python_version >>-{{ checksum ".circleci/unittest/windows/scripts/environment.yml" }}-{{ checksum ".circleci-weekly" }}
-          paths:
-            - conda
-            - env
-      - run:
-          name: Install functorch
-          command: .circleci/unittest/windows/scripts/install.sh
-      - run:
-          name: Run tests
-          command: .circleci/unittest/windows/scripts/run_test.sh
-      - run:
-          name: Post process
-          command: .circleci/unittest/windows/scripts/post_process.sh
-      - store_test_results:
-          path: test-reports
-
-  unittest_windows_gpu:
-    <<: *binary_common
-    executor:
-      name: windows-gpu
-    environment:
-      CUDA_VERSION: "11.3"
-      PYTHON_VERSION: << parameters.python_version >>
-    steps:
-      - checkout
-      - designate_upload_channel
-      - run:
-          name: Generate cache key
-          # This will refresh cache on Sundays, nightly build should generate new cache.
-          command: echo "$(date +"%Y-%U")" > .circleci-weekly
-      - restore_cache:
-          keys:
-            - env-v2-windows-{{ arch }}-py<< parameters.python_version >>-{{ checksum ".circleci/unittest/windows/scripts/environment.yml" }}-{{ checksum ".circleci-weekly" }}
-      - run:
-          name: Setup
-          command: .circleci/unittest/windows/scripts/setup_env.sh
-      - save_cache:
-          key: env-v2-windows-{{ arch }}-py<< parameters.python_version >>-{{ checksum ".circleci/unittest/windows/scripts/environment.yml" }}-{{ checksum ".circleci-weekly" }}
-          paths:
-            - conda
-            - env
-      - run:
-          name: Install CUDA
-          command: packaging/windows/internal/cuda_install.bat
-      - run:
-          name: Update CUDA driver
-          command: packaging/windows/internal/driver_update.bat
-      - run:
-          name: Install functorch
-          command: .circleci/unittest/windows/scripts/install.sh
-      - run:
-          name: Run tests
-          command: .circleci/unittest/windows/scripts/run_test.sh
-      - run:
-          name: Post process
-          command: .circleci/unittest/windows/scripts/post_process.sh
-      - store_test_results:
-          path: test-reports
-
-  binary_win_wheel:
-    <<: *binary_common
-    executor: windows-cpu
-    steps:
-      - checkout_merge
-      - designate_upload_channel
-      - run:
-          name: Build wheel packages
-          command: |
-            set -ex
-            source packaging/windows/internal/vc_install_helper.sh
-            packaging/windows/internal/cuda_install.bat
-            packaging/build_wheel.sh
-      - store_artifacts:
-          path: dist
-      - persist_to_workspace:
-          root: dist
-          paths:
-            - "*"
-      - store_test_results:
-          path: build_results/
-
-workflows:
-  unittest:
-    jobs:
-      - unittest_linux_cpu:
-          name: unittest_linux_<< matrix.cu_version >>_py<< matrix.python_version >>
-          matrix:
-            parameters:
-              python_version: ["3.7", "3.8", "3.9", "3.10"]
-              cu_version: ["cpu"]
-      - unittest_linux_gpu:
-          name: unittest_linux_<< matrix.cu_version >>_py<< matrix.python_version >>
-          matrix:
-            parameters:
-              python_version: ["3.7", "3.8", "3.9", "3.10"]
-              cu_version: ["cu102"]
-
-      - unittest_macos_cpu:
-          name: unittest_macos_<< matrix.cu_version >>_py<< matrix.python_version >>
-          matrix:
-            parameters:
-              python_version: ["3.10"]
-              cu_version: ["cpu"]
-
-      - unittest_windows_cpu:
-          name: unittest_windows_<< matrix.cu_version >>_py<< matrix.python_version >>
-          matrix:
-            parameters:
-              python_version: ["3.9"]
-              cu_version: ["cpu"]
-
-      - unittest_windows_gpu:
-          name: unittest_windows_<< matrix.cu_version >>_py<< matrix.python_version >>
-          matrix:
-            parameters:
-              python_version: ["3.10"]
-              cu_version: ["cu113"]
-
-      - binary_win_wheel:
-          name: binary_win_wheel_<< matrix.cu_version >>_py<< matrix.python_version >>
-          matrix:
-            parameters:
-              python_version: ["3.7", "3.8", "3.9", "3.10"]
-              cu_version: ["cpu"]
diff --git a/functorch/.circleci/unittest/linux/scripts/environment.yml b/functorch/.circleci/unittest/linux/scripts/environment.yml
deleted file mode 100644
index 2b3e5d43683a..000000000000
--- a/functorch/.circleci/unittest/linux/scripts/environment.yml
+++ /dev/null
@@ -1,17 +0,0 @@
-channels:
-  - defaults
-dependencies:
-  - numpy
-  - pytest
-  - pytest-cov
-  - codecov
-  - pip
-  - ca-certificates
-  - pyyaml
-  - pip:
-      - unittest-xml-reporting
-      - pillow>=4.1.1
-      - scipy
-      - av
-      - networkx
-      - ninja
diff --git a/functorch/.circleci/unittest/linux/scripts/install.sh b/functorch/.circleci/unittest/linux/scripts/install.sh
deleted file mode 100755
index c6f7272b1c6f..000000000000
--- a/functorch/.circleci/unittest/linux/scripts/install.sh
+++ /dev/null
@@ -1,61 +0,0 @@
-#!/usr/bin/env bash
-set -x
-set -e
-
-unset PYTORCH_VERSION
-# For unittest, nightly PyTorch is used as the following section,
-# so no need to set PYTORCH_VERSION.
-# In fact, keeping PYTORCH_VERSION forces us to hardcode PyTorch version in config.
-
-set -e
-
-eval "$(./conda/bin/conda shell.bash hook)"
-conda activate ./env
-
-# if [ "${CU_VERSION:-}" == cpu ] ; then
-#     cudatoolkit="cpuonly"
-# else
-#     if [[ ${#CU_VERSION} -eq 4 ]]; then
-#         CUDA_VERSION="${CU_VERSION:2:1}.${CU_VERSION:3:1}"
-#     elif [[ ${#CU_VERSION} -eq 5 ]]; then
-#         CUDA_VERSION="${CU_VERSION:2:2}.${CU_VERSION:4:1}"
-#     fi
-#     echo "Using CUDA $CUDA_VERSION as determined by CU_VERSION"
-#     version="$(python -c "print('.'.join(\"${CUDA_VERSION}\".split('.')[:2]))")"
-#     cudatoolkit="cudatoolkit=${version}"
-# fi
-
-WHEELS_FOLDER=${HOME}/project/wheels
-mkdir -p $WHEELS_FOLDER
-
-PYVSHORT=${PARAMETERS_PYTHON_VERSION:0:1}${PARAMETERS_PYTHON_VERSION:2:1}
-
-if [[ "$PYVSHORT" == "38" ]] ; then
-   PYVSHORT=cp${PYVSHORT}-cp${PYVSHORT}
-else
-   PYVSHORT=cp${PYVSHORT}-cp${PYVSHORT}m
-fi
-
-# if [ "${CU_VERSION:-}" == cpu ] ; then
-#     pip install https://download.pytorch.org/whl/nightly/cpu/torch-1.9.0.dev20210427%2Bcpu-${PYVSHORT}-linux_x86_64.whl
-#     pip install https://download.pytorch.org/whl/nightly/cpu/torchvision-0.10.0.dev20210427%2Bcpu-${PYVSHORT}-linux_x86_64.whl
-#     USE_NINJA=1 python setup.py develop bdist_wheel -d $WHEELS_FOLDER
-# else
-#     pip install https://download.pytorch.org/whl/nightly/cu102/torch-1.9.0.dev20210427%2Bcu102-${PYVSHORT}-linux_x86_64.whl
-#     pip install https://download.pytorch.org/whl/nightly/cu102/torchvision-0.10.0.dev20210427-${PYVSHORT}-linux_x86_64.whl
-#     USE_NINJA=1 python setup.py develop bdist_wheel -d $WHEELS_FOLDER
-# fi
-
-gcc --version
-
-# TODO: This should really be a part of environment.yml or the docker image.
-# expecttest isn't on conda so it can't be a part of environment.yml :/
-pip install expecttest
-
-if [ "${CU_VERSION:-}" == cpu ] ; then
-    conda install -y pytorch torchvision cpuonly -c pytorch-nightly
-    PYTORCH_VERSION="$(python -c "import torch; print(torch.__version__)")" python setup.py develop bdist_wheel -d $WHEELS_FOLDER
-else
-    conda install -y pytorch torchvision cudatoolkit=10.2 -c pytorch-nightly
-    PYTORCH_VERSION="$(python -c "import torch; print(torch.__version__)")" python setup.py develop bdist_wheel -d $WHEELS_FOLDER
-fi
diff --git a/functorch/.circleci/unittest/linux/scripts/post_process.sh b/functorch/.circleci/unittest/linux/scripts/post_process.sh
deleted file mode 100755
index a84a0dea55e0..000000000000
--- a/functorch/.circleci/unittest/linux/scripts/post_process.sh
+++ /dev/null
@@ -1,8 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-eval "$(./conda/bin/conda shell.bash hook)"
-conda activate ./env
-
-codecov
diff --git a/functorch/.circleci/unittest/linux/scripts/run_test.sh b/functorch/.circleci/unittest/linux/scripts/run_test.sh
deleted file mode 100755
index 9d5eaa5a136f..000000000000
--- a/functorch/.circleci/unittest/linux/scripts/run_test.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-export IN_CI=1
-mkdir test-reports
-eval "$(./conda/bin/conda shell.bash hook)"
-conda activate ./env
-
-python -m torch.utils.collect_env
-
-# test_functorch_lagging_op_db.py: Only run this locally because it checks
-# the functorch lagging op db vs PyTorch's op db.
-EXIT_STATUS=0
-find test \( -name test\*.py ! -name test_functorch_lagging_op_db.py \) | xargs -I {} -n 1 python {} -v || EXIT_STATUS=$?
-exit $EXIT_STATUS
diff --git a/functorch/.circleci/unittest/linux/scripts/setup_env.sh b/functorch/.circleci/unittest/linux/scripts/setup_env.sh
deleted file mode 100755
index bbc1a4c24970..000000000000
--- a/functorch/.circleci/unittest/linux/scripts/setup_env.sh
+++ /dev/null
@@ -1,39 +0,0 @@
-#!/usr/bin/env bash
-set -x
-set -e
-
-# This script is for setting up environment in which unit test is ran.
-# To speed up the CI time, the resulting environment is cached.
-#
-# Do not install PyTorch and functorch here, otherwise they also get cached.
-
-this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
-root_dir="$(git rev-parse --show-toplevel)"
-conda_dir="${root_dir}/conda"
-env_dir="${root_dir}/env"
-
-cd "${root_dir}"
-
-case "$(uname -s)" in
-    Darwin*) os=MacOSX;;
-    *) os=Linux
-esac
-
-# 1. Install conda at ./conda
-if [ ! -d "${conda_dir}" ]; then
-    printf "* Installing conda\n"
-    wget -O miniconda.sh http://repo.continuum.io/miniconda/Miniconda3-latest-${os}-x86_64.sh
-    bash ./miniconda.sh -b -f -p "${conda_dir}"
-fi
-eval "$(${conda_dir}/bin/conda shell.bash hook)"
-
-# 2. Create test environment at ./env
-if [ ! -d "${env_dir}" ]; then
-    printf "* Creating a test environment\n"
-    conda create --prefix "${env_dir}" -y python="$PARAMETERS_PYTHON_VERSION"
-fi
-conda activate "${env_dir}"
-
-# 3. Install Conda dependencies
-printf "* Installing dependencies (except PyTorch)\n"
-conda env update --file "${this_dir}/environment.yml" --prune
diff --git a/functorch/.circleci/unittest/windows/scripts/environment.yml b/functorch/.circleci/unittest/windows/scripts/environment.yml
deleted file mode 100644
index 265589019d3f..000000000000
--- a/functorch/.circleci/unittest/windows/scripts/environment.yml
+++ /dev/null
@@ -1,20 +0,0 @@
-channels:
-  - pytorch
-  - defaults
-dependencies:
-  - numpy
-  - pytest
-  - pytest-cov
-  - xdoctest
-  - codecov
-  - pip
-  - pyyaml
-  - ca-certificates
-  - pip:
-      - unittest-xml-reporting
-      - pillow>=4.1.1
-      - scipy
-      - av
-      - networkx
-      - expecttest
-      - ninja
diff --git a/functorch/.circleci/unittest/windows/scripts/install.sh b/functorch/.circleci/unittest/windows/scripts/install.sh
deleted file mode 100644
index d425b2b7133b..000000000000
--- a/functorch/.circleci/unittest/windows/scripts/install.sh
+++ /dev/null
@@ -1,46 +0,0 @@
-#!/usr/bin/env bash
-
-unset PYTORCH_VERSION
-# For unittest, nightly PyTorch is used as the following section,
-# so no need to set PYTORCH_VERSION.
-# In fact, keeping PYTORCH_VERSION forces us to hardcode PyTorch version in config.
-
-set -ex
-
-this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
-
-eval "$(./conda/Scripts/conda.exe 'shell.bash' 'hook')"
-conda activate ./env
-
-# TODO, refactor the below logic to make it easy to understand how to get correct cuda_version.
-if [ "${CU_VERSION:-}" == cpu ] ; then
-    cudatoolkit="cpuonly"
-    version="cpu"
-else
-    if [[ ${#CU_VERSION} -eq 4 ]]; then
-        CUDA_VERSION="${CU_VERSION:2:1}.${CU_VERSION:3:1}"
-    elif [[ ${#CU_VERSION} -eq 5 ]]; then
-        CUDA_VERSION="${CU_VERSION:2:2}.${CU_VERSION:4:1}"
-    fi
-    echo "Using CUDA $CUDA_VERSION as determined by CU_VERSION"
-    version="$(python -c "print('.'.join(\"${CUDA_VERSION}\".split('.')[:2]))")"
-    cudatoolkit="cudatoolkit=${version}"
-fi
-
-printf "Installing PyTorch with %s\n" "${cudatoolkit}"
-conda install -y -c "pytorch-${UPLOAD_CHANNEL}" -c nvidia "pytorch-${UPLOAD_CHANNEL}"::pytorch[build="*${version}*"] "${cudatoolkit}"
-
-torch_cuda=$(python -c "import torch; print(torch.cuda.is_available())")
-echo torch.cuda.is_available is $torch_cuda
-
-if [ ! -z "${CUDA_VERSION:-}" ] ; then
-    if [ "$torch_cuda" == "False" ]; then
-        echo "torch with cuda installed but torch.cuda.is_available() is False"
-        exit 1
-    fi
-fi
-
-source "$this_dir/set_cuda_envs.sh"
-
-printf "* Installing functorch\n"
-"$this_dir/vc_env_helper.bat" python setup.py develop
diff --git a/functorch/.circleci/unittest/windows/scripts/install_conda.bat b/functorch/.circleci/unittest/windows/scripts/install_conda.bat
deleted file mode 100644
index 6052ad08b106..000000000000
--- a/functorch/.circleci/unittest/windows/scripts/install_conda.bat
+++ /dev/null
@@ -1 +0,0 @@
-start /wait "" "%miniconda_exe%" /S /InstallationType=JustMe /RegisterPython=0 /AddToPath=0 /D=%tmp_conda%
diff --git a/functorch/.circleci/unittest/windows/scripts/post_process.sh b/functorch/.circleci/unittest/windows/scripts/post_process.sh
deleted file mode 100644
index 5c5cbb758a9e..000000000000
--- a/functorch/.circleci/unittest/windows/scripts/post_process.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-eval "$(./conda/Scripts/conda.exe 'shell.bash' 'hook')"
-conda activate ./env
diff --git a/functorch/.circleci/unittest/windows/scripts/run_test.sh b/functorch/.circleci/unittest/windows/scripts/run_test.sh
deleted file mode 100644
index 8435aa5c955d..000000000000
--- a/functorch/.circleci/unittest/windows/scripts/run_test.sh
+++ /dev/null
@@ -1,26 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-export IN_CI=1
-mkdir test-reports
-
-eval "$(./conda/Scripts/conda.exe 'shell.bash' 'hook')"
-conda activate ./env
-
-this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
-source "$this_dir/set_cuda_envs.sh"
-
-python -m torch.utils.collect_env
-
-EXIT_STATUS=0
-# TODO: we should be able to acquire the following from some bash commands
-# Tests currently ordered in order of runtime...
-python test/test_eager_transforms.py -v || EXIT_STATUS=$?
-python test/test_compile_cache.py -v || EXIT_STATUS=$?
-python test/test_minifier.py -v || EXIT_STATUS=$?
-python test/test_memory_efficient_fusion.py -v || EXIT_STATUS=$?
-python test/test_pythonkey.py -v || EXIT_STATUS=$?
-python test/test_vmap.py -v || EXIT_STATUS=$?
-python test/test_ops.py -v || EXIT_STATUS=$?
-exit $EXIT_STATUS
diff --git a/functorch/.circleci/unittest/windows/scripts/set_cuda_envs.sh b/functorch/.circleci/unittest/windows/scripts/set_cuda_envs.sh
deleted file mode 100644
index 7db3137b5944..000000000000
--- a/functorch/.circleci/unittest/windows/scripts/set_cuda_envs.sh
+++ /dev/null
@@ -1,48 +0,0 @@
-#!/usr/bin/env bash
-set -ex
-
-echo CU_VERSION is "${CU_VERSION}"
-echo CUDA_VERSION is "${CUDA_VERSION}"
-
-# Currenly, CU_VERSION and CUDA_VERSION are not consistent.
-# to understand this code, see https://github.com/pytorch/vision/issues/4443
-version="cpu"
-if [[ ! -z "${CUDA_VERSION}" ]] ; then
-    version="$CUDA_VERSION"
-else
-    if [[ ${#CU_VERSION} -eq 5 ]]; then
-        version="${CU_VERSION:2:2}.${CU_VERSION:4:1}"
-    fi
-fi
-
-# Don't use if [[ "$version" == "cpu" ]]; then exit 0 fi.
-# It would exit the shell. One result is cpu tests would not run if the shell exit.
-# Unless there's an error, Don't exit.
-if [[ "$version" != "cpu" ]]; then
-    # set cuda envs
-    export PATH="/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${version}/bin:/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${version}/libnvvp:$PATH"
-    export CUDA_PATH_V${version/./_}="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v${version}"
-    export CUDA_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v${version}"
-
-    if  [ ! -d "$CUDA_PATH" ]; then
-        echo "$CUDA_PATH" does not exist
-        exit 1
-    fi
-
-    if [ ! -f "${CUDA_PATH}\include\nvjpeg.h" ]; then
-        echo "nvjpeg does not exist"
-        exit 1
-    fi
-
-    # check cuda driver version
-    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do
-        if [[ -x "$path" ]]; then
-            "$path" || echo "true";
-            break
-        fi
-    done
-
-    which nvcc
-    nvcc --version
-    env | grep CUDA
-fi
diff --git a/functorch/.circleci/unittest/windows/scripts/setup_env.sh b/functorch/.circleci/unittest/windows/scripts/setup_env.sh
deleted file mode 100644
index b0b706311120..000000000000
--- a/functorch/.circleci/unittest/windows/scripts/setup_env.sh
+++ /dev/null
@@ -1,39 +0,0 @@
-#!/usr/bin/env bash
-
-# This script is for setting up environment in which unit test is ran.
-# To speed up the CI time, the resulting environment is cached.
-#
-# Do not install PyTorch and torchvision here, otherwise they also get cached.
-
-set -e
-
-this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
-root_dir="$(git rev-parse --show-toplevel)"
-conda_dir="${root_dir}/conda"
-env_dir="${root_dir}/env"
-
-cd "${root_dir}"
-
-# 1. Install conda at ./conda
-if [ ! -d "${conda_dir}" ]; then
-    printf "* Installing conda\n"
-    export tmp_conda="$(echo $conda_dir | tr '/' '\\')"
-    export miniconda_exe="$(echo $root_dir | tr '/' '\\')\\miniconda.exe"
-    curl --output miniconda.exe https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -O
-    "$this_dir/install_conda.bat"
-    unset tmp_conda
-    unset miniconda_exe
-fi
-
-eval "$(${conda_dir}/Scripts/conda.exe 'shell.bash' 'hook')"
-
-# 2. Create test environment at ./env
-if [ ! -d "${env_dir}" ]; then
-    printf "* Creating a test environment\n"
-    conda create --prefix "${env_dir}" -y python="$PYTHON_VERSION"
-fi
-conda activate "${env_dir}"
-
-# 3. Install Conda dependencies
-printf "* Installing dependencies (except PyTorch)\n"
-conda env update --file "${this_dir}/environment.yml" --prune
diff --git a/functorch/.circleci/unittest/windows/scripts/vc_env_helper.bat b/functorch/.circleci/unittest/windows/scripts/vc_env_helper.bat
deleted file mode 100644
index 9410135677a4..000000000000
--- a/functorch/.circleci/unittest/windows/scripts/vc_env_helper.bat
+++ /dev/null
@@ -1,39 +0,0 @@
-@echo on
-
-set VC_VERSION_LOWER=16
-set VC_VERSION_UPPER=17
-
-for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
-    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (
-        set "VS15INSTALLDIR=%%i"
-        set "VS15VCVARSALL=%%i\VC\Auxiliary\Build\vcvarsall.bat"
-        goto vswhere
-    )
-)
-
-:vswhere
-if "%VSDEVCMD_ARGS%" == "" (
-    call "%VS15VCVARSALL%" x64 || exit /b 1
-) else (
-    call "%VS15VCVARSALL%" x64 %VSDEVCMD_ARGS% || exit /b 1
-)
-
-@echo on
-
-set DISTUTILS_USE_SDK=1
-
-set args=%1
-shift
-:start
-if [%1] == [] goto done
-set args=%args% %1
-shift
-goto start
-
-:done
-if "%args%" == "" (
-    echo Usage: vc_env_helper.bat [command] [args]
-    echo e.g. vc_env_helper.bat cl /c test.cpp
-)
-
-%args% || exit /b 1
diff --git a/functorch/.flake8 b/functorch/.flake8
deleted file mode 100644
index a6d73773e3b5..000000000000
--- a/functorch/.flake8
+++ /dev/null
@@ -1,20 +0,0 @@
-[flake8]
-select = B,C,E,F,P,T4,W,B9
-max-line-length = 120
-# C408 ignored because we like the dict keyword argument syntax
-# E501 is not flexible enough, we're using B950 instead
-ignore =
-    E203,E305,E402,E501,E721,E741,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
-    # shebang has extra meaning in fbcode lints, so I think it's not worth trying
-    # to line this up with executable bit
-    EXE001,
-    # these ignores are from flake8-bugbear; please fix!
-    B007,B008,
-    # these ignores are from flake8-comprehensions; please fix!
-    C400,C401,C402,C403,C404,C405,C407,C411,C413,C414,C415
-exclude =
-    ./.git,
-    ./benchmarks,
-    ./docs,
-    ./examples,
-    ./notebooks
diff --git a/functorch/.github/workflows/docs.yml b/functorch/.github/workflows/docs.yml
deleted file mode 100644
index 017d9949ff7b..000000000000
--- a/functorch/.github/workflows/docs.yml
+++ /dev/null
@@ -1,82 +0,0 @@
-name: Build and Deploy Docs
-on:
-  pull_request:
-    types: [opened, synchronize, reopened]
-  push:
-    branches:
-      - main
-
-jobs:
-
-  build-docs:
-    runs-on: ubuntu-18.04
-    steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: "3.9"
-          architecture: x64
-      - name: Checkout functorch
-        uses: actions/checkout@v2
-      - name: Install PyTorch Nightly
-        run: |
-          python3 -mpip install --pre torch>=1.12.0.dev -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
-      - name: Install functorch
-        run: |
-          python3 setup.py install
-      - name: Install docs requirements
-        run: |
-          cd docs
-          python3 -mpip install -r requirements.txt
-      - name: Build docs
-        run: |
-          cd docs
-          make html
-      - name: Upload docs as GHA artifact
-        uses: actions/upload-artifact@v2
-        with:
-          name: built-docs
-          path: docs/build/html
-
-  deploy-docs:
-    needs: [build-docs]
-    runs-on: ubuntu-latest
-    if: (github.ref == 'refs/heads/main' && github.event_name == 'push')
-    steps:
-      - uses: actions/checkout@v2
-        with:
-          ref: gh-pages
-          fetch-depth: 3
-
-      - name: Download docs artifact
-        uses: actions/download-artifact@v2
-        with:
-          name: built-docs
-          path: /tmp/docs
-
-      - name: Copy built docs to nightly
-        id: copy-docs
-        run: |
-          cp -R /tmp/docs/* nightly/
-          git log -3
-          # Set commit name and hash as variables: commit_name, commit_hash
-          echo "::set-output name=commit_name::$(git log -1 --format='%s')"
-          echo "::set-output name=commit_hash::$(git log -1 --format='%h')"
-
-      - name: Git reset to commit/amend
-        if: ${{ steps.copy-docs.outputs.commit_name == 'auto-generated commit' }}
-        run: |
-          # if commit_name is "auto-generated commit"
-          # then go back in commit history to commit to the same commit
-          git reset --soft ${{ steps.copy-docs.outputs.commit_hash }}~1
-          git log -3
-
-      - name: Commit and push to gh-pages
-        uses: github-actions-x/commit@v2.9
-        with:
-          github-token: ${{ secrets.GITHUB_TOKEN }}
-          push-branch: 'gh-pages'
-          commit-message: 'auto-generated commit'
-          force-push: 'true'
-          name: gha
-          email: gha@email.org
diff --git a/functorch/.github/workflows/lint.yml b/functorch/.github/workflows/lint.yml
deleted file mode 100644
index 26e752e2c9b7..000000000000
--- a/functorch/.github/workflows/lint.yml
+++ /dev/null
@@ -1,63 +0,0 @@
-name: Lint
-
-on:
-  push:
-    branches:
-      - main
-  pull_request:
-
-jobs:
-  lintrunner:
-    runs-on: ubuntu-18.04
-    steps:
-      - uses: actions/checkout@v3
-        with:
-          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-          fetch-depth: 0
-
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: 3.8
-          architecture: x64
-
-      - name: Install lintrunner
-        run: pip install lintrunner==0.8.*
-
-      - name: Initialize lint dependencies
-        run: lintrunner init
-
-      - name: Run lintrunner on all files
-        if: github.event_name == 'push'
-        run: lintrunner -vv --paths-cmd='git grep -Il .' --force-color
-
-      - name: Run lintrunner on PR files
-        if: github.event_name == 'pull_request'
-        env:
-          PR_BASE_SHA: ${{ github.event.pull_request.base.sha }}
-        run: |
-          set +e
-          if ! lintrunner -vv --force-color --merge-base-with "${PR_BASE_SHA}" ; then
-              echo ""
-              echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner\`.\e[0m"
-              echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"
-              exit 1
-          fi
-
-      - name: Store annotations
-        if: always() && github.event_name == 'pull_request'
-        # Don't show this as an error; the above step will have already failed.
-        continue-on-error: true
-        env:
-          PR_BASE_SHA: ${{ github.event.pull_request.base.sha }}
-        run: |
-          # The easiest way to get annotations is to just run lintrunner again
-          # in JSON mode and use jq to massage the output into GitHub Actions
-          # workflow commands.
-          lintrunner --merge-base-with "${PR_BASE_SHA}" --output=json | \
-            jq --raw-output '"::\(if .severity == "advice" or .severity == "disabled" then "warning" else .severity end) file=\(.path),line=\(.line),col=\(.char),title=\(.code) \(.name)::" + (.description | gsub("\\n"; "%0A"))'
-
-
-concurrency:
-  group: lint-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
-  cancel-in-progress: true
diff --git a/functorch/.github/workflows/wheels.yml b/functorch/.github/workflows/wheels.yml
deleted file mode 100644
index 3b1ac1b94b86..000000000000
--- a/functorch/.github/workflows/wheels.yml
+++ /dev/null
@@ -1,61 +0,0 @@
-name: Wheels
-on:
-  pull_request:
-    types: [opened, synchronize, reopened]
-  push:
-    branches:
-      - main
-
-jobs:
-
-  build-wheel-linux:
-    runs-on: ubuntu-18.04
-    container: pytorch/manylinux-cpu
-    strategy:
-      matrix:
-        python_abi: [ "cp37-cp37m", "cp38-cp38", "cp39-cp39" ]
-    steps:
-      - name: Checkout functorch
-        uses: actions/checkout@v2
-      - name: Install PyTorch Nightly
-        run: |
-          export PATH="/opt/python/${{ matrix.python_abi }}/bin:$PATH"
-          python3 -mpip install --pre torch>=1.12.0.dev -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
-      - name: Build wheel
-        run: |
-          export PATH="/opt/python/${{ matrix.python_abi }}/bin:$PATH"
-          python3 -mpip install wheel
-          python3 setup.py bdist_wheel
-          # NB: wheels have the linux_x86_64 prefix, need to be manually renamed
-      - name: Upload wheel as GHA artifact
-        uses: actions/upload-artifact@v2
-        with:
-          name: functorch-linux.whl
-          path: dist/*.whl
-
-  build-wheel-mac:
-    runs-on: macos-latest
-    strategy:
-      matrix:
-        python_version: [ "3.7", "3.8", "3.9" ]
-    steps:
-      - name: Setup Python
-        uses: actions/setup-python@v2
-        with:
-          python-version: ${{ matrix.python_version }}
-          architecture: x64
-      - name: Checkout functorch
-        uses: actions/checkout@v2
-      - name: Install PyTorch Nightly
-        run: |
-          python3 -mpip install --pre torch>=1.12.0.dev -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
-      - name: Build wheel
-        run: |
-          export CC=clang CXX=clang++
-          python3 -mpip install wheel
-          python3 setup.py bdist_wheel
-      - name: Upload wheel as GHA artifact
-        uses: actions/upload-artifact@v2
-        with:
-          name: functorch-mac.whl
-          path: dist/*.whl
diff --git a/functorch/.lintrunner.toml b/functorch/.lintrunner.toml
deleted file mode 100644
index 6e0d756b53bf..000000000000
--- a/functorch/.lintrunner.toml
+++ /dev/null
@@ -1,48 +0,0 @@
-[[linter]]
-code = 'FLAKE8'
-include_patterns = ['**/*.py']
-exclude_patterns = [
-    '.git/**',
-    'benchmarks/**',
-    'docs/**',
-    'examples/**',
-    'notebooks/**',
-]
-command = [
-    'python3',
-    'tools/lint/flake8_linter.py',
-    '--',
-    '@{{PATHSFILE}}'
-]
-init_command = [
-    'python3',
-    'tools/lint/pip_init.py',
-    '--dry-run={{DRYRUN}}',
-    'flake8==3.8.2',
-    'flake8-bugbear==20.1.4',
-    'flake8-comprehensions==3.3.0',
-    'flake8-executable==2.0.4',
-    'flake8-pyi==20.5.0',
-    'mccabe==0.6.1',
-    'pycodestyle==2.6.0',
-    'pyflakes==2.2.0',
-]
-
-# [[linter]]
-# code = 'BLACK'
-# include_patterns = [
-#     '**/*.py',
-# ]
-# command = [
-#     'python3',
-#     'tools/lint/black_linter.py',
-#     '--',
-#     '@{{PATHSFILE}}'
-# ]
-# init_command = [
-#     'python3',
-#     'tools/lint/pip_init.py',
-#     '--dry-run={{DRYRUN}}',
-#     'black==22.3.0',
-# ]
-# is_formatter = true
diff --git a/functorch/CMakeLists.txt b/functorch/CMakeLists.txt
new file mode 100644
index 000000000000..d20304324382
--- /dev/null
+++ b/functorch/CMakeLists.txt
@@ -0,0 +1,38 @@
+cmake_minimum_required(VERSION 3.12)
+project(functorch)
+set(CMAKE_CXX_STANDARD 14)
+
+include(GNUInstallDirs)
+include(CMakePackageConfigHelpers)
+
+set(FT_DIR csrc)
+file(GLOB_RECURSE FT_SOURCES ${FT_DIR}/*.cpp)
+
+add_library(${PROJECT_NAME} MODULE ${FT_SOURCES})
+target_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+target_compile_definitions(${PROJECT_NAME} PRIVATE FUNCTORCH_BUILD_MAIN_LIB)
+target_compile_definitions(${PROJECT_NAME} PRIVATE TORCH_EXTENSION_NAME=_C)
+target_compile_definitions(${PROJECT_NAME} PRIVATE TORCH_API_INCLUDE_EXTENSION_H)
+target_compile_options(${PROJECT_NAME} PRIVATE ${TORCH_PYTHON_COMPILE_OPTIONS})
+target_link_libraries(${PROJECT_NAME} PRIVATE torch torch_python)
+target_link_libraries(${PROJECT_NAME} PRIVATE pybind::pybind11)
+
+set_target_properties(${PROJECT_NAME} PROPERTIES LIBRARY_OUTPUT_DIRECTORY
+      ${CMAKE_BINARY_DIR}/functorch)
+set_target_properties(${PROJECT_NAME} PROPERTIES INSTALL_RPATH "${_rpath_portable_origin}/../torch/lib")
+
+# Copy-pasted prefix/suffix logic for Python extensions from
+# https://github.com/pytorch/pytorch/blob/33bb8ae350611760139457b85842b1d7edf9aa11/caffe2/CMakeLists.txt#L1975
+# https://github.com/pytorch/pytorch/blob/33bb8ae350611760139457b85842b1d7edf9aa11/caffe2/CMakeLists.txt#L2022
+# TODO: It would be good to be able to use Python3_add_library target, but it does not work in many cases
+set_target_properties(${PROJECT_NAME} PROPERTIES PREFIX "" DEBUG_POSTFIX "")
+if(WIN32)
+  set_target_properties(${PROJECT_NAME} PROPERTIES SUFFIX ".pyd")
+else()
+  set_target_properties(${PROJECT_NAME} PROPERTIES SUFFIX ".so")
+endif()
+# Needed to link functorch on MacOS
+if(NOT ${TORCH_PYTHON_LINK_FLAGS} STREQUAL "")
+  set_target_properties(${PROJECT_NAME} PROPERTIES LINK_FLAGS ${TORCH_PYTHON_LINK_FLAGS})
+endif()
+install(TARGETS ${PROJECT_NAME} DESTINATION "${CMAKE_CURRENT_SOURCE_DIR}")
diff --git a/functorch/CODE_OF_CONDUCT.md b/functorch/CODE_OF_CONDUCT.md
deleted file mode 100644
index b91e23b17c02..000000000000
--- a/functorch/CODE_OF_CONDUCT.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# Code of Conduct
-
-## Our Pledge
-
-In the interest of fostering an open and welcoming environment, we as
-contributors and maintainers pledge to make participation in our project and
-our community a harassment-free experience for everyone, regardless of age, body
-size, disability, ethnicity, sex characteristics, gender identity and expression,
-level of experience, education, socio-economic status, nationality, personal
-appearance, race, religion, or sexual identity and orientation.
-
-## Our Standards
-
-Examples of behavior that contributes to creating a positive environment
-include:
-
-* Using welcoming and inclusive language
-* Being respectful of differing viewpoints and experiences
-* Gracefully accepting constructive criticism
-* Focusing on what is best for the community
-* Showing empathy towards other community members
-
-Examples of unacceptable behavior by participants include:
-
-* The use of sexualized language or imagery and unwelcome sexual attention or
-advances
-* Trolling, insulting/derogatory comments, and personal or political attacks
-* Public or private harassment
-* Publishing others' private information, such as a physical or electronic
-address, without explicit permission
-* Other conduct which could reasonably be considered inappropriate in a
-professional setting
-
-## Our Responsibilities
-
-Project maintainers are responsible for clarifying the standards of acceptable
-behavior and are expected to take appropriate and fair corrective action in
-response to any instances of unacceptable behavior.
-
-Project maintainers have the right and responsibility to remove, edit, or
-reject comments, commits, code, wiki edits, issues, and other contributions
-that are not aligned to this Code of Conduct, or to ban temporarily or
-permanently any contributor for other behaviors that they deem inappropriate,
-threatening, offensive, or harmful.
-
-## Scope
-
-This Code of Conduct applies within all project spaces, and it also applies when
-an individual is representing the project or its community in public spaces.
-Examples of representing a project or community include using an official
-project e-mail address, posting via an official social media account, or acting
-as an appointed representative at an online or offline event. Representation of
-a project may be further defined and clarified by project maintainers.
-
-## Enforcement
-
-Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported by contacting the project team at <conduct@pytorch.org>. All
-complaints will be reviewed and investigated and will result in a response that
-is deemed necessary and appropriate to the circumstances. The project team is
-obligated to maintain confidentiality with regard to the reporter of an incident.
-Further details of specific enforcement policies may be posted separately.
-
-Project maintainers who do not follow or enforce the Code of Conduct in good
-faith may face temporary or permanent repercussions as determined by other
-members of the project's leadership.
-
-## Attribution
-
-This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
-available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
-
-[homepage]: https://www.contributor-covenant.org
-
-For answers to common questions about this code of conduct, see
-https://www.contributor-covenant.org/faq
diff --git a/functorch/CONTRIBUTING.md b/functorch/CONTRIBUTING.md
deleted file mode 100644
index effa6ed2e28c..000000000000
--- a/functorch/CONTRIBUTING.md
+++ /dev/null
@@ -1,12 +0,0 @@
-## Contributing
-Feedback on our APIs, as well as finding bugs, would be very helpful.
-
-Please feel free to chat us up on the PyTorch Slack, or open an issue
-at https://github.com/pytorch/functorch if you're interested in
-contributing.
-
-To contribute a change to functorch, please make sure you are submitting a
-Pull Request to the functorch folder in https://github.com/pytorch/pytorch
-repository. The source of truth for functorch has moved there from
-https://github.com/pytorch/functorch ; the code in the pytorch/functorch
-repository is read-only.
diff --git a/functorch/LICENSE b/functorch/LICENSE
deleted file mode 100644
index 22f4f8f28d49..000000000000
--- a/functorch/LICENSE
+++ /dev/null
@@ -1,26 +0,0 @@
-Copyright (c) 2021 Facebook, Inc. and its affiliates. All rights reserved.
-
-Redistribution and use in source and binary forms, with or without modification,
-are permitted provided that the following conditions are met:
-
-1. Redistributions of source code must retain the above copyright notice,
-   this list of conditions and the following disclaimer.
-
-2. Redistributions in binary form must reproduce the above copyright notice,
-   this list of conditions and the following disclaimer in the documentation
-   and/or other materials provided with the distribution.
-
-3. Neither the name of the copyright holder nor the names of its contributors
-   may be used to endorse or promote products derived from this software
-   without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
-FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/functorch/README.md b/functorch/README.md
index 33fa06e6491f..5021c8591cff 100644
--- a/functorch/README.md
+++ b/functorch/README.md
@@ -39,7 +39,7 @@ transforms comes from the [JAX framework](https://github.com/google/jax).
 
 There are two ways to install functorch:
 1. functorch from source
-2. functorch beta (compatible with PyTorch 1.11)
+2. functorch beta (compatible with recent PyTorch releases)
 
 We recommend trying out the functorch beta first.
 
@@ -54,37 +54,11 @@ Follow the instructions [in this Colab notebook](https://colab.research.google.c
 
 #### Locally
 
-First, set up an environment. We will be installing a nightly PyTorch binary
-as well as functorch. If you're using conda, create a conda environment:
-```bash
-conda create --name functorch
-conda activate functorch
-```
-If you wish to use `venv` instead:
-```bash
-python -m venv functorch-env
-source functorch-env/bin/activate
-```
-
-Next, install one of the following following PyTorch nightly binaries.
-```bash
-# For CUDA 10.2
-pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html --upgrade
-# For CUDA 11.3
-pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html --upgrade
-# For CPU-only build
-pip install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html --upgrade
-```
-If you already have a nightly of PyTorch installed and wanted to upgrade it
-(recommended!), append `--upgrade` to one of those commands.
-
-Install functorch:
-```bash
-pip install ninja  # Makes the build go faster
-pip install --user "git+https://github.com/pytorch/functorch.git"
-```
+As of 9/21/2022, `functorch` comes installed alongside a nightly PyTorch binary.
+Please install a Preview (nightly) PyTorch binary; see  https://pytorch.org/
+for instructions.
 
-Run a quick sanity check in python:
+Once you've done that, run a quick sanity check in Python:
 ```py
 import torch
 from functorch import vmap
@@ -95,28 +69,20 @@ assert torch.allclose(y, x.sin())
 
 #### functorch development setup
 
-`functorch` is a PyTorch C++ Extension module. To install,
+As of 9/21/2022, `functorch` comes installed alongside PyTorch and is in the
+PyTorch source tree. Please install
+[PyTorch from source](https://github.com/pytorch/pytorch#from-source), then,
+you will be able to `import functorch`.
 
-- Install [PyTorch from source](https://github.com/pytorch/pytorch#from-source).
-`functorch` usually runs on the latest development version of PyTorch.
-- Run `python setup.py install`. You can use `DEBUG=1` to compile in debug mode.
-
-Then, try to run some tests to make sure all is OK:
+Try to run some tests to make sure all is OK:
 ```bash
 pytest test/test_vmap.py -v
 pytest test/test_eager_transforms.py -v
 ```
 
-To do devel install:
-
-```bash
-pip install -e .
-```
-
-To install with optional dependencies, e.g. for AOTAutograd:
-
+AOTAutograd has some additional optional requirements. You can install them via:
 ```bash
-pip install -e .[aot]
+pip install networkx
 ```
 
 To run functorch tests, please install our test dependencies (`expecttest`, `pyyaml`).
@@ -125,7 +91,7 @@ To run functorch tests, please install our test dependencies (`expecttest`, `pyy
 </p>
 </details>
 
-### Installing functorch beta (compatible with PyTorch 1.11)
+### Installing functorch beta (compatible with recent PyTorch releases)
 
 <details><summary>Click to expand</summary>
 <p>
@@ -136,7 +102,7 @@ Follow the instructions [here](https://colab.research.google.com/drive/1GNfb01W_
 
 #### pip
 
-Prerequisite: [Install PyTorch 1.11](https://pytorch.org/get-started/locally/)
+Prerequisite: [Install PyTorch](https://pytorch.org/get-started/locally/)
 
 
 ```bash
@@ -369,8 +335,8 @@ If you're making an ensemble of models, you may find
 For more documentation, see [our docs website](https://pytorch.org/functorch).
 
 ## Debugging
-`functorch._C.dump_tensor`: Dumps dispatch keys on stack
-`functorch._C._set_vmap_fallback_warning_enabled(False)` if the vmap warning spam bothers you.
+`torch._C._functorch.dump_tensor`: Dumps dispatch keys on stack
+`torch._C._functorch._set_vmap_fallback_warning_enabled(False)` if the vmap warning spam bothers you.
 
 ## Future Plans
 
diff --git a/functorch/functorch/__init__.py b/functorch/__init__.py
similarity index 76%
rename from functorch/functorch/__init__.py
rename to functorch/__init__.py
index 4f5adc5a7ad1..971ce793d720 100644
--- a/functorch/functorch/__init__.py
+++ b/functorch/__init__.py
@@ -6,10 +6,6 @@
 import torch
 from . import _C
 
-# Monkey patch PyTorch. This is a hack, we should try to upstream
-# these pieces.
-from ._src import monkey_patching as _monkey_patching
-
 # Top-level APIs. Please think carefully before adding something to the
 # top-level namespace:
 # - private helper functions should go into functorch._src
@@ -19,7 +15,7 @@
 # functorch transforms
 from ._src.vmap import vmap
 from ._src.eager_transforms import (
-    grad, grad_and_value, vjp, jacrev, jvp, jacfwd, hessian,
+    grad, grad_and_value, vjp, jacrev, jvp, jacfwd, hessian, functionalize
 )
 from ._src.python_key import make_fx
 
@@ -32,7 +28,4 @@
     FunctionalModuleWithBuffers,
 )
 
-try:
-    from .version import __version__  # noqa: F401
-except ImportError:
-    pass
+__version__ = torch.__version__
diff --git a/functorch/functorch/_src/__init__.py b/functorch/_src/__init__.py
similarity index 100%
rename from functorch/functorch/_src/__init__.py
rename to functorch/_src/__init__.py
diff --git a/functorch/_src/aot_autograd.py b/functorch/_src/aot_autograd.py
new file mode 100644
index 000000000000..4336cbf0df50
--- /dev/null
+++ b/functorch/_src/aot_autograd.py
@@ -0,0 +1,1965 @@
+import collections
+import dataclasses
+import warnings
+import itertools
+from contextlib import contextmanager, nullcontext
+from dataclasses import dataclass
+from enum import Enum
+from functools import wraps
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+from torch.fx.experimental.proxy_tensor import is_sym_node
+
+import torch
+import torch.fx.traceback as fx_traceback
+import torch.nn as nn
+import torch.utils._pytree as pytree
+import torch.utils.dlpack
+from torch import Tensor
+from torch._dynamo import disable as disable_torchdynamo
+from torch._dynamo.utils import dynamo_timed
+from torch._subclasses import FakeTensorMode, CrossRefFakeMode
+from torch.fx import immutable_collections, Interpreter
+from torch.fx.experimental.symbolic_shapes import ShapeEnv
+from torch.multiprocessing.reductions import StorageWeakRef
+from torch.nn.utils import stateless
+
+from functorch import make_fx
+from torch._dispatch.python import enable_python_dispatcher
+from . import config
+from .named_members_polyfill import _named_buffers, _named_parameters
+from .partitioners import default_partition
+
+MutationType = Enum("MutationType", ("none", "metadata_only", "data"))
+OutputType = Enum("OutputType", ("non_alias", "alias_of_input", "alias_of_intermediate"))
+
+pytree._register_pytree_node(
+    immutable_collections.immutable_list,
+    lambda x: (list(x), None),
+    lambda x, c: immutable_collections.immutable_list(x),
+)
+pytree._register_pytree_node(
+    immutable_collections.immutable_dict,
+    lambda x: (list(x.values()), list(x.keys())),
+    lambda x, c: immutable_collections.immutable_dict(
+        {key: value for key, value in zip(c, x)}
+    ),
+)
+
+aten = torch.ops.aten
+
+# This global counter increments every time we compile a graph with
+# AOTAutograd.  You can use this to correlate runtime error messages
+# with compile time (e.g., if you get an error at runtime saying
+# compiled graph 3 failed, you can set a breakpoint at compile time
+# for this graph number to investigate further at compile time.)
+#
+# NB: this is different from get_aot_compilation_context, which tracks
+# each underlying graph that is compiled.  In contrast, AOT_COUNTER
+# corresponds to top-level invocations of aot_module/aot_function;
+# one counter is allocated per entire compiled block (but this block
+# may involve compiling multiple subgraphs; e.g., for forwards/backwards)
+AOT_COUNTER = itertools.count()
+
+KNOWN_TYPES = [torch.Tensor, int, str, float, bool, torch.SymInt, torch.SymFloat]
+
+@contextmanager
+def preserve_rng_state():
+    rng_state = torch.clone(torch.random.get_rng_state())
+    if torch.cuda.is_available():
+        cuda_rng_state = torch.clone(torch.cuda.get_rng_state())
+    try:
+        yield
+    finally:
+        torch.random.set_rng_state(rng_state)
+        if torch.cuda.is_available():
+            torch.cuda.set_rng_state(cuda_rng_state)
+
+
+# Set up hooks so that during backward the fx's stack_trace is properly set
+callback_set = False
+
+
+def setup_stacktrace_preservation_hooks(roots: List):
+    def iter_graph(roots):
+        if not roots:
+            return
+        seen = set()
+        q = collections.deque()
+        for node in roots:
+            if node is not None:
+                seen.add(node)
+                q.append(node)
+
+        while q:
+            node = q.popleft()
+            for fn, _idx in node.next_functions:
+                if fn in seen or fn is None:
+                    continue
+                seen.add(fn)
+                q.append(fn)
+
+            yield node
+
+    def get_callback(saved_stack_):
+        def callback():
+            global callback_set
+            fx_traceback.set_stack_trace(saved_stack_)
+            callback_set = False
+
+        return callback
+
+    def get_prehook(stack_):
+        def prehook(grad_output):
+            global callback_set
+
+            if not callback_set:
+                torch.autograd.variable.Variable._execution_engine.queue_callback(
+                    get_callback(fx_traceback.format_stack())
+                )
+                callback_set = True
+
+            fx_traceback.set_stack_trace(stack_)
+
+        return prehook
+
+    def get_posthook(special_stack_):
+        def posthook(grad_input, grad_output):
+            fx_traceback.set_stack_trace(special_stack_)
+
+        return posthook
+
+    for node in iter_graph(roots):
+        forward_node_stack = node.metadata.get("traceback_", [])
+        node.register_prehook(get_prehook(forward_node_stack))
+
+        special_stack = forward_node_stack.copy()
+        special_stack.append(
+            "Gradient addition node due to multiple use of tensor around:"
+        )
+        node.register_hook(get_posthook(special_stack))
+
+# This class tells us about a user's forward output that is an alias.
+# It can be an alias of either a user forward input, of of a graph intermediate.
+@dataclass(frozen=True)
+class OutputAliasInfo:
+    # Tells us if this output is:
+    # (1) a regular (non-aliased) output
+    # (2) an alias of a forward input
+    # (2) an alias of an intermediate (aka an alias of an output of the inner traced forward)
+    output_type: OutputType
+    # If (1) above, then
+    # - Tells us that the base of this alias is user_fwd_input[base_idx]
+    #   (This is an index into the inputs *before* we make synthetic bases)
+    # If (2) above, then
+    # - Tells us that the base of this alias is traced_fwd_outputs[base_idx]
+    #   here, this refers to the index of the *direct* traced
+    base_idx: int
+    # sizes, strides and storage offset of the aliased output are all returned as actual (sym)ints
+    # in the compiled forward. These indices tell us where in the forward outputs to grab them.
+    sizes_idx: Optional[int]
+    strides_idx: Optional[int]
+    storage_offset_idx: Optional[int]
+    # We store the actual output alias that we traced in the forward (should be a fake tensor)
+    # to grab any other non-symbolic properties on the output alias, like requires_grad.
+    # It's optional here, for cases where the user directly returns an input as an output.
+    # If output_type == non_alias, then these fields are also always None.
+    tensor_meta: Optional[Tensor]
+
+# This class tells us about how to perform a metadata mutation on forward inputs.
+# it only applies to forward inputs that experience metadata-only mutations
+@dataclass(frozen=True)
+class InputAliasInfo:
+    # This object gives us information about how to perform a metadata-mutation
+    # on original_fwd_inputs[base_idx]
+    #   (This is an index into the inputs *before* we make synthetic bases)
+    base_idx: int
+    # sizes, strides and storage offset of the aliased output are all returned as actual (sym)ints
+    # in the compiled forward. These indices tell us where in the forward outputs to grab them.
+    sizes_idx: int
+    strides_idx: int
+    storage_offset_idx: int
+    # We store the actual output alias that we traced in the forward (should be a fake tensor)
+    # to grab any other non-symbolic properties on the output alias, like requires_grad.
+    tensor_meta: Tensor
+
+# This class encapsulates all aliasing + mutation info we need about the forward graph
+# See a more detailed overview of the edge case handling at
+# https://docs.google.com/document/d/19UoIh_SVrMy_b2Sx5ZaeOJttm6P0Qmyss2rdBuyfoic/edit
+@dataclass(frozen=True)
+class ViewAndMutationMeta:
+    # length: # user forward inputs
+    # For every input, tells us whether the input:
+    # (a) is not mutated
+    # (b) only metadata is mutated
+    # (c) data (and maybe metadta) is mutated
+    mutated_input_info: List[MutationType]
+    # length: (# inputs of the user forward)
+    # metadata_mutation_input_info[i] is not None <====> mutated_input_info[i] == MutationType.metadata_only
+    # We stash the updated FakeTensor that we traced with in the forward in here,
+    # that way we can use it to replay the metadata mutation
+    metadata_mutation_input_info: List[Optional[InputAliasInfo]]
+    # length: # outputs in the compiled forward (not including output alias symints). Equal to:
+    # length: (# inputs w data mutations) + (# outputs that don't alias inputs)
+    # For every output *and* mutated input returned from the forward,
+    # tells us whether or not the output should require gradients or not
+    requires_grad_out_info: List[bool]
+    # length: # fw outputs
+    aliased_output_info: List[OutputAliasInfo]
+
+def gen_alias_from_base(aliased_base_tensor, size, stride, storage_offset, target_meta_tensor):
+    # handle R2C and C2R
+    if aliased_base_tensor.is_complex() and not target_meta_tensor.is_complex():
+        aliased_out = torch.view_as_real(aliased_base_tensor).as_strided(size, stride, storage_offset)
+    elif not aliased_base_tensor.is_complex() and target_meta_tensor.is_complex():
+        aliased_out = torch.view_as_complex(aliased_base_tensor).as_strided(size, stride, storage_offset)
+    else:
+        aliased_out = aliased_base_tensor.as_strided(size, stride, storage_offset)
+    # For outputs aliasing inputs, we need to check if the requires-gradness has changed.
+    if aliased_base_tensor.requires_grad and not target_meta_tensor.requires_grad:
+        aliased_out = aliased_out.detach()
+    elif not aliased_base_tensor.requires_grad and target_meta_tensor.requires_grad:
+        aliased_out.requires_grad_(True)
+    return aliased_out
+
+# This is a version of functionalization that is specifically designed
+# for the AOTAutograd use case.
+#
+# Unlike functorch's variant, this doesn't use the functorch level system,
+# instead it directly uses PyTorch's conventional dispatcher to hit the
+# functionalization key.  In particular, this means that FunctionalTensorWrapper
+# can have autograd data stored directly on it.
+#
+# In typical AOTAutograd usage, the dispatch key order will look like:
+#
+#   Autograd - Functionalization ~~~~> Proxy Mode - Fake Tensor
+#       outer tensor                        inner tensor
+#
+# TODO: Provide a faster version of this that assumes flat arguments
+# (so no pytree necessary)
+def run_functionalized_fw_and_collect_metadata(f):
+    def to_fun(t):
+        if isinstance(t, Tensor):
+            return torch._to_functional_tensor(t, mirror_autograd_meta=True)
+        else:
+            return t
+
+    def from_fun(t):
+        if not isinstance(t, Tensor) or not torch._is_functional_tensor(t):
+            return t
+        torch._sync(t)
+        return torch._from_functional_tensor(t)
+
+    @wraps(f)
+    def inner(*args):
+        # This function is meant to be run with the forward, which expects a flat list of tensor/symint/other args.
+        assert all(isinstance(a, torch.Tensor) or type(a) in KNOWN_TYPES for a in args)
+
+        collect_mutated_input_info: List[MutationType] = []
+        collect_requires_grad_out_info: List[bool] = []
+        collect_aliased_output_info: List[OutputAliasInfo] = []
+        collect_metadata_mutation_input_info: List[Optional[InputAliasInfo]] = []
+
+        f_args = pytree.tree_map(to_fun, args)
+
+        torch._enable_functionalization(reapply_views=True)
+        try:
+            outs = f(*f_args)
+        finally:
+            torch._disable_functionalization()
+
+        flat_args, _ = pytree.tree_flatten(args)
+        flat_f_args, _ = pytree.tree_flatten(f_args)
+        flat_outs, _ = pytree.tree_flatten(outs)
+
+        # Inspect the state of the input tensor functional wrapper to detect input mutation info
+        inputs_with_mutated_data = []
+        # If inp[i] has a metadata-only mutation, then maybe_inputs_with_mutated_metadata[i] contains the updated version
+        maybe_inputs_with_mutated_metadata: List[Optional[torch.Tensor]] = []
+        for (i, (arg, f_arg)) in enumerate(zip(flat_args, flat_f_args)):
+            if not isinstance(arg, Tensor):
+                continue
+            torch._sync(f_arg)
+            new_arg = torch._from_functional_tensor(f_arg)
+            if arg is not new_arg:
+                # Note [Input mutation handling in aot autograd]
+                # We use functionalization to detect two types in input mutations:
+                # (1) metadata-only input mutations, like input.t_()
+                # (2) data input mutations, like input.add_(1)
+                #     inputs that have both data and metadata mutated get lumped into (2).
+                #
+                # Why do we distinguish these two cases? aot autograd needs to handle them very differently.
+                # For data mutations, we return the updated inputs *directly* in the compiled forward graph.
+                # e.g.
+                # def f(x):
+                #     x.mul_(2)
+                #     out = x.mul(3)
+                #     return out
+                #
+                # // This function gets compiled and dumped inside of an autograd.Function.forward()
+                # def traced_forward(x):
+                #     x_updated = x.mul(2)
+                #     out = x_updated.mul(3)
+                #     return x_updated, out
+                #
+                # // The returned function will call the compiled forward, and apply input mutations afterwards
+                # def compiled_fn(x):
+                #    x_updated, out = traced_forward(x)
+                #    x.copy_(x_updated)
+                #    return out
+                #
+                # For input metadata mutations, though, we cannot return the "updated input" in the forward graph,
+                # Because it is an alias of an input. autograd.Function.forward can't handle arbitrary outputs that alias inputs.
+                # Instead, we stash the "updated input metadata" during tracing
+                # e.g.
+                # def f(x):
+                #     x.t_()
+                #     out = x.mul(3)
+                #     return out
+                #
+                # // This function gets compiled and dumped inside of an autograd.Function.forward()
+                # // (We don't return x_updated. Just return the original fw out)
+                # def traced_forward(x):
+                #     x_updated = x.t()
+                #     out = x_updated.mul(3)
+                #     return out
+                #
+                # // The returned function will call the compiled forward, and apply input mutations afterwards
+                # def compiled_fn(x):
+                #    out = traced_forward(x)
+                #    _x_updated_metadata = CompiledFunction.fw_metadata.metadata_mutation_input_info[0]
+                #    x.as_strided_(_x_updated_metadata.size(), _x_updated_metadata.stride(), _x_updated_metadata.storage_offset())
+                #    return out
+                if StorageWeakRef(arg.storage()) == StorageWeakRef(new_arg.storage()):
+                    # We can use the storage aliasing of the inputs and updated inputs
+                    # to detect when an input was actually updated, or just inplace-viewed.
+                    collect_mutated_input_info.append(MutationType.metadata_only)
+                else:
+                    collect_mutated_input_info.append(MutationType.data)
+                    # Only return mutated inputs that mutate *data*, not metadata
+                    # Note [Input mutation handling in aot autograd]
+                    inputs_with_mutated_data.append(new_arg)
+                    # For every mutated input, we ALSO need to return info on
+                    # whether than mutated input requires gradients. Why?
+                    # Our custom autograd.Function.forward returns updated inputs as outputs,
+                    collect_requires_grad_out_info.append(f_arg.requires_grad)
+            else:
+                collect_mutated_input_info.append(MutationType.none)
+
+            maybe_inputs_with_mutated_metadata.append(
+                new_arg if collect_mutated_input_info[-1] == MutationType.metadata_only else None)
+
+        def collect_grad_info(t):
+            # Collect info on which output tensors require gradients,
+            # so we can mark them properly in the returned autograd.Function.
+            # We only collect requires_grad info on real forward outputs, and not on inputs.
+            collect_requires_grad_out_info.append(isinstance(t, torch.Tensor) and t.requires_grad)
+
+        # Note [output alias handling in aot autograd]
+        # Given a function to compile where one of its outputs aliases an input,
+        # we need to remove that output from the compiled graph and generate it off to the side.
+        # e.g.
+        # def f(x):
+        #     return x.view(-1)
+        #
+        # Why? Two reasons:
+        # (1) If your autograd.Function returns a view on an input in the forward, autograd.Function
+        #     will not allow you to mutate it (This original came from arbitrary user code where the user might want to mutate)
+        # (2) There's no reason to compile views anyway. We can just regenerate the view of the input off to the side,
+        #
+        # Another interesting case is when you have both mutation and aliasing:
+        # def f(x):
+        #     x.mul_(2)
+        #     return x.view(-1)
+        #
+        # You could imagine that this output is now *safe* to compile and return in the autograd.Function,
+        # because after functionalization runs, it will technically not alias an input:
+        # def f_functionalized(x):
+        #     x_updated = x.mul(2)
+        #     return x_updated, x_updated.view(-1)
+        #
+        # However, this is still wrong: we can't return x_updated.view(-1) to the user. We are on the hook to return:
+        # def traced_forward(x):
+        #     x_updated = x.mul(2)
+        #     return x_updated
+        #
+        # def compiled_fn(x)
+        #     x_updated = traced_forward(x)
+        #     x.copy_(x_updated)
+        #     return x.view(-1)
+        #
+        # Why can't we return x_updated.view(-1) to the user?
+        # It can have different metadata from x.view(-1)! Specifically, the input x could be a non-memory-dense tensor,
+        # But the intermediate created by our graph, x_updated, will always be memory-dense.
+        def filter_and_record_aliased_outs(outputs):
+            # NOTE: this dict will clobber keys if we have multiple inputs that alias.
+            # Let's say inpA and inpB alias, and the user generated an output using out = inpA.view(...)
+            # For now, since we're not handling the case with multiple _base's sharing a storage,
+            # it is actually fine to arbitrarily pick which input to regenerate the aliased output from.
+            # e.g. out_new = inpB.as_strided(out.size(), out.stride(), out.storage_offset())
+            #
+            # This will be more complicated when you have multiple _base tensors aliasing the same
+            # underlying storage, when we eventually handle that.
+            # We'll need to ensure that we generate the view off of the right base.
+            inp_storage_refs = {StorageWeakRef(inpt.storage()): idx for idx, inpt in enumerate(flat_f_args)}
+            inp_tensor_ids = {id(inpt) for inpt in flat_f_args if isinstance(inpt, torch.Tensor)}
+            inp_storage_refs_set = set(inp_storage_refs)
+
+            non_aliased_input_outs = []
+            # For a given output tensor that alias an input, tells us:
+            # (1) the index of the input that we alias
+            # (2) Whether or not the output is a view of the input, or if `output is input`
+            #     (so we don't need to generate a view, and can return the input directly)
+            # Note: if the function returns an output that *is* an input, we still cannot return it in the graph.
+            # e.g.
+            #   def f(x):
+            #       x.add_(1)
+            #       return x
+            # Our compiled fw will return an "x_updated", but it is *not* ok to return that to the user.
+            # We need to manually do x.copy_(x_updated), and return the original x to the user.
+            # Why? for example, the metadata between x and x_updated might be different (e.g. _is_leaf())
+            aliased_out_idx: Dict[torch.Tensor, Tuple[int, bool]] = {}
+
+            for o in outputs:
+                # Note: When detecting input/output aliasing, we NEED to do it using the outer FunctionalTensorWrapper objects.
+                # In the case where we mutate an input *and* return a view of it, the outer wrappers will still alias,
+                # but the inner tensors no longer alias.
+                if isinstance(o, torch.Tensor) and StorageWeakRef(o.storage()) in inp_storage_refs:
+                    aliased_inp_idx = inp_storage_refs[StorageWeakRef(o.storage())]
+                    is_exact_input = id(o) in inp_tensor_ids
+                    aliases_intermediate_and_not_input = False
+                    aliased_out_idx[o] = (aliased_inp_idx, aliases_intermediate_and_not_input, is_exact_input)
+                else:
+                    # Only return outputs that are not aliases of inputs.
+                    non_aliased_input_outs.append(o)
+            # If a function involves creating a tensor, and returning a view of it, such that its _base is the intermediiate,
+            # We need to make sure our graph returns the _base as a graph output, and we manually recreate the view
+            # to return to the user. Why? The backend compiler is free to (incorrectly) not set requires_grad
+            # on the base tensor, but we are obligated to properly set requires-gradness on the real output.
+            non_aliased_outs = []
+            for i, o in enumerate(non_aliased_input_outs):
+                non_aliased_outs.append(o)
+
+            return non_aliased_outs, aliased_out_idx
+
+        non_aliased_outs, aliased_out_to_inp_idx = filter_and_record_aliased_outs(outs)
+
+        pytree.tree_map(collect_grad_info, non_aliased_outs)
+
+        # Calling convention: the output is (mutated_input_values, original_outs)
+        # We return all mutated inputs + outputs here, **except** for any mutated inputs or outputs
+        # that alias original inputs.
+        # See Note [Input mutation handling in aot autograd]
+        mutated_inps_and_outs = inputs_with_mutated_data + list(non_aliased_outs)
+
+        # Our compiled forward function will return:
+        # (1) non-aliased updated inputs
+        # (2) non-aliased fw outputs
+        # (3) size/stride/storage_offset metadata for updated aliased inputs
+        # (4) size/stride/storage_offset metadata for aliased outputs
+
+        start_idx_for_aliased_output_metadata = 0
+
+        # First, gather the metadata info on mutated inputs (this only applies to inputs with metadata-only mutations))
+        for i, maybe_aliased_updated_inp in enumerate(maybe_inputs_with_mutated_metadata):
+            if maybe_aliased_updated_inp is None:
+                collect_metadata_mutation_input_info.append(None)
+                continue
+            # Figure out where the sizes/strides/storage_offset are in the compiled fw output.
+            sizes_idx = start_idx_for_aliased_output_metadata
+            strides_idx = sizes_idx + len(maybe_aliased_updated_inp.size())
+            storage_offset_idx = strides_idx + len(maybe_aliased_updated_inp.stride())
+            # update our offset for the next tensor
+            start_idx_for_aliased_output_metadata = storage_offset_idx + 1
+            inp_info = InputAliasInfo(
+                base_idx=i,
+                sizes_idx=sizes_idx,
+                strides_idx=strides_idx,
+                storage_offset_idx=storage_offset_idx,
+                tensor_meta=maybe_aliased_updated_inp,
+            )
+            collect_metadata_mutation_input_info.append(inp_info)
+
+        # Next, gather the metadata info on the user's outputs that alias (either inputs or graph outputs)
+        num_non_input_aliased_outputs = 0
+        for o in outs:
+            maybe_alias_info = aliased_out_to_inp_idx.get(o, None) if isinstance(o, torch.Tensor) else None
+            if maybe_alias_info is None:
+                output_type = OutputType.non_alias
+                # Here, alias_idx will tell us which output from the inner forward this corresponds to.
+                alias_idx = num_non_input_aliased_outputs
+                sizes_idx = None
+                strides_idx = None
+                storage_offset_idx = None
+                tensor_meta = None
+            else:
+                input_alias_idx, is_alias_of_intermediate_not_input, is_exact_input = maybe_alias_info
+                if is_exact_input:
+                    assert not is_alias_of_intermediate_not_input
+                    output_type = OutputType.alias_of_input
+                    alias_idx = input_alias_idx
+                    sizes_idx = None
+                    strides_idx = None
+                    storage_offset_idx = None
+                    tensor_meta = None
+                else:
+                    if is_alias_of_intermediate_not_input:
+                        output_type = OutputType.alias_of_intermediate
+                        alias_idx = num_non_input_aliased_outputs
+                    else:
+                        output_type = OutputType.alias_of_input
+                        alias_idx = input_alias_idx
+                    tensor_meta = o
+                    # Figure out where the sizes/strides/storage_offset are in the compiled fw output.
+                    sizes_idx = start_idx_for_aliased_output_metadata
+                    strides_idx = sizes_idx + len(tensor_meta.size())
+                    storage_offset_idx = strides_idx + len(tensor_meta.stride())
+                    # update our offset for the next tensor
+                    start_idx_for_aliased_output_metadata = storage_offset_idx + 1
+
+            if output_type != OutputType.alias_of_input:
+                num_non_input_aliased_outputs += 1
+
+            inp_info = OutputAliasInfo(
+                output_type=output_type,
+                base_idx=alias_idx,
+                sizes_idx=sizes_idx,
+                strides_idx=strides_idx,
+                storage_offset_idx=storage_offset_idx,
+                tensor_meta=tensor_meta
+            )
+            collect_aliased_output_info.append(inp_info)
+
+        # This is the total number of size/stride/storage_offset metadata outputs that we return in the forward,
+        # used for regenerating aliases later.
+        num_aliasing_metadata_outs = start_idx_for_aliased_output_metadata
+
+        assert len(collect_metadata_mutation_input_info) == len(collect_mutated_input_info)
+
+        assert len([x for x in collect_metadata_mutation_input_info if x is not None]) == len([
+            x for x in collect_mutated_input_info if x == MutationType.metadata_only
+        ])
+        assert len(collect_aliased_output_info) == len(outs)
+        assert len([x for x in collect_aliased_output_info if x.output_type != OutputType.alias_of_input]) == len(non_aliased_outs)
+
+
+        # Our autograd.Function.forward returns both mutated inputs and outputs,
+        # so we need grad info on all of them.
+        assert len(collect_requires_grad_out_info) == len(mutated_inps_and_outs)
+
+        metadata = ViewAndMutationMeta(
+            mutated_input_info=collect_mutated_input_info,
+            metadata_mutation_input_info=collect_metadata_mutation_input_info,
+            requires_grad_out_info=collect_requires_grad_out_info,
+            aliased_output_info=collect_aliased_output_info,
+        )
+        return metadata, pytree.tree_map(from_fun, mutated_inps_and_outs), num_aliasing_metadata_outs
+    return inner
+
+
+# This creates a functionalized joint forwards-backwards function given both
+# the primals (to run forwards) and tangents (to run backwards).
+#
+# It uses the metadata that was created earlier to figure out what all of the outputs to the autograd.Function.forward are:
+# (1) Which inputs received data mutations (and need to be passed as outputs into autograd.grad())
+# (2) Which outputs are aliases of inputs (and should *not* be passed as outputs into autograd.grad())
+def create_joint_forward_backward_functionalized(
+    fn,
+    *,
+    meta: ViewAndMutationMeta,
+    synthetic_base_info: Optional[List[Union[int, Tuple[int, List[Any]]]]],
+):
+    # NOTE: when we have synthetic base inputs, we need to clone them *before* creating views off of them.
+    # This means that "idx" here represents the index of the (potentially) synthetic base.
+    # What we need to do is:
+    # (1) map the current (post-synthetic-base calling convention) input argument index
+    #     to int index pre-synthetic-base-calling-convention.
+    # (2) There could be multiple, if this index corresponds to a synthetic base
+    #     that has multiple input aliases.
+    # (3) If any of those corresponding inputs get metadata mutations, then we clone the base.
+    def maybe_to_fresh_input(idx, t):
+        if not isinstance(t, Tensor):
+            return t
+
+        if synthetic_base_info is None:
+            outer_aliased_indices_of_current_base_arg = [idx]
+        else:
+            outer_aliased_indices_of_current_base_arg = [
+                # For every argument index in the outer calling convention (before synthetic bases)
+                # find its index in the inner calling convention.
+                # if it matches the index of our current arg (idx), track the outer argument's index (i)
+                i for i, outer_idx_or_lambda in enumerate(synthetic_base_info)
+                if (isinstance(outer_idx_or_lambda, int) and outer_idx_or_lambda == idx)
+                or (isinstance(outer_idx_or_lambda, tuple) and outer_idx_or_lambda[0] == idx)
+            ]
+        if any(meta.mutated_input_info[i] == MutationType.data for i in outer_aliased_indices_of_current_base_arg):
+            # Make sure the primal we pass to autograd.grad()
+            # seees the tensor before the mutation
+            out = t.clone()
+        elif any(meta.mutated_input_info[i] == MutationType.metadata_only for i in outer_aliased_indices_of_current_base_arg):
+            # Make sure the primal we pass to autograd.grad()
+            # seees the tensor before the metadata mutation
+            out = t.view(t.shape)
+        else:
+            out = t
+        return out
+
+    def unpack_synthetic_bases(primals: List[Any]) -> List[Any]:
+        # This is only not None if our graph mutates a graph input that aliases another graph input.
+        if synthetic_base_info is None:
+            return primals
+
+        f_args_inner = []
+        for outer_idx_or_lambda in synthetic_base_info:
+            if isinstance(outer_idx_or_lambda, int):
+                f_args_inner.append(primals[outer_idx_or_lambda])
+            else:
+                outer_base_idx, strided_args = outer_idx_or_lambda
+                outer_base = primals[outer_base_idx]
+                # TODO: we could consider storing and executing view replay logic here,
+                # instead of a general as_strided() call.
+                # This could also improve perf, since today this will cause
+                # more as_strided_scatter() ops in the graph.
+                view_arg = outer_base.as_strided(*strided_args)
+                f_args_inner.append(view_arg)
+        return f_args_inner
+
+    def joint_forward_backward(
+        primals: List[Any], tangents: List[Any]
+    ) -> Tuple[List[Any], List[Any]]:
+        # Call the forward pass, making sure to clone any inputs that are mutated first.
+        # We need to ensure that the inputs we pass to autograd.grad() are the *original*
+        # inputs, and not their mutated values.
+        primals_no_input_mutations = [maybe_to_fresh_input(i, t) for i, t in enumerate(primals)]
+        # This is also where we handle the calling convention around synthetic bases.
+        # We need to make sure that we convert any synthetic base arguments into views
+        # *after* we do the cloning above, to preserve the view relationship.
+        primals_ = unpack_synthetic_bases(primals_no_input_mutations)
+        assert len(meta.mutated_input_info) == len(primals_)
+        all_outs = fn(*primals_)
+        assert len(meta.aliased_output_info) == len(all_outs)
+
+        # Pass any (non-aliased) outputs in as tangents, since they'll be returned as outputs in the fw
+        # For outputs that are aliases of intermediates, we will have returned the output's _base as an output in the graph instead,
+        # which we *should* send to grad()
+        outputs_for_grad = [
+            x
+            # TODO: support ._base
+            # x._base if meta.aliased_output_info[i].output_type == OutputType.alias_of_intermediate else x
+            for (i, x) in enumerate(all_outs) if meta.aliased_output_info[i].output_type != OutputType.alias_of_input
+        ]
+        # Pass any (non-aliased) mutated inputs in as tangents, since they'll be returned as outputs in the fw
+        # Important: the traced joint fw/bw will return updated inputs with data mutations,
+        # but *not* with metadata mutations.
+        # Instead, we shunt the updated metadata around externally
+        # and update the input's metadata outside of the autograd.Function
+        mutated_inputs_for_grad = [x for (i, x) in enumerate(primals_) if meta.mutated_input_info[i] == MutationType.data]
+        mutated_inputs_and_outs_to_grad = mutated_inputs_for_grad + outputs_for_grad
+
+        metadata_mutated_inps = [x for (i, x) in enumerate(primals_) if meta.mutated_input_info[i] == MutationType.metadata_only]
+        # for user outputs that are aliases (either of inputs, or of graph intermediates)
+        # figure out what metadata to return in the forward, which is needed to regenerate the output aliases
+        aliased_outs = [x for (i, x) in enumerate(all_outs) if meta.aliased_output_info[i].output_type != OutputType.non_alias
+                        and meta.aliased_output_info[i].tensor_meta is not None]
+        output_metadata_for_fw = []
+        for curr_alias in metadata_mutated_inps + aliased_outs:
+            size_ = curr_alias.size()
+            stride_ = curr_alias.stride()
+            storage_offset_ = curr_alias.storage_offset()
+            # FX IR doesn't know about tuples, so we flatten the metadata into individual ints/symints,
+            # and index into the final output list later.
+            output_metadata_for_fw += (size_ + stride_ + (storage_offset_,))
+
+        # Take care to grab and sync the updated inputs from primals_ (the inputs we actually mutate!)
+        # and not primals (the preserved inputs, pre-mutation, that we pass to grad())
+        for i, arg in enumerate(primals_):
+            if not isinstance(arg, Tensor):
+                continue
+            torch._sync(arg)
+
+        # Get the inputs that need gradients
+        grad_primals = []
+        inputs_needs_grads = []
+        # Note that we're not using primals_ here, being carefully not to pass any mutated inputs into autograd.grad()
+        for p in primals:
+            is_grad_tensor = isinstance(p, Tensor) and p.requires_grad
+            inputs_needs_grads.append(is_grad_tensor)
+            if is_grad_tensor:
+                grad_primals.append(p)
+
+        # Get the outputs that need gradients
+        assert len(tangents) == len(mutated_inputs_and_outs_to_grad)
+        needed_outs = []
+        needed_tangents = []
+        for out, tangent in zip(mutated_inputs_and_outs_to_grad, tangents):
+            if isinstance(out, Tensor) and out.requires_grad:
+                # A bit sketchy, but fixes e.g. test_aot_autograd_exhaustive_matmul_cpu_float32
+                # The issue is that we are sensitive to decomps that don't accurately maintain
+                # their output's _base.shape compared to eager mode, and this helps mitigate a bit.
+                needed_outs.append(out if out.shape == tangent.shape else out.view(tangent.shape))
+                needed_tangents.append(tangent.requires_grad_(True))
+
+        setup_stacktrace_preservation_hooks([out.grad_fn for out in needed_outs])
+
+        backward_out = []
+        # Call the backwards pass
+        if grad_primals:
+            with fx_traceback.override_stack_trace():
+                backward_out = torch.autograd.grad(
+                    needed_outs,
+                    grad_primals,
+                    grad_outputs=needed_tangents,
+                    allow_unused=True,
+                )
+        backward_out_iter = iter(backward_out)
+        all_fw_outs = mutated_inputs_and_outs_to_grad + output_metadata_for_fw
+        return all_fw_outs, [
+            next(backward_out_iter) if i else None for i in inputs_needs_grads
+        ]
+
+    def to_fun(t):
+        if isinstance(t, Tensor):
+            return torch._to_functional_tensor(t, mirror_autograd_meta=True)
+        else:
+            return t
+
+    def from_fun(t):
+        if not isinstance(t, Tensor) or not torch._is_functional_tensor(t):
+            return t
+        torch._sync(t)
+        return torch._from_functional_tensor(t)
+
+    def functionalized_joint(
+        primals: List[Any], tangents: List[Any]
+    ) -> Tuple[List[Any], List[Any]]:
+
+        # Wrap inputs into functional wrappers
+        f_primals, f_tangents = pytree.tree_map(to_fun, (primals, tangents))
+        torch._enable_functionalization(reapply_views=True)
+        try:
+            # Run the joint
+            outs = joint_forward_backward(f_primals, f_tangents)
+        finally:
+            torch._disable_functionalization()
+
+        # Syncing of inputs/outputs was already done directly in the joint call
+        return pytree.tree_map(from_fun, outs)
+
+    return functionalized_joint
+
+
+def normalize_as_list(x):
+    if isinstance(x, tuple):
+        return list(x)
+    elif isinstance(x, list):
+        return x
+    return [x]
+
+
+aot_autograd_decompositions = {}
+
+
+# This is a list since looking forward, we can have this arbitrarily nested.
+graph_being_compiled: List[str] = []
+# TODO: It would be nice to reset the numbering every time aot_id goes
+# up, but this is annoying to do right now (because we don't know if
+# an aot_id will come back from the dead), so right now this also happens
+# to be a globally unique number too (at the cost of wobbling if you change
+# how the graphs compile)
+nth_graph: int = 0
+model_name: str = "model"
+
+
+def set_model_name(name):
+    global model_name
+    model_name = name
+
+
+def get_aot_compilation_context() -> Tuple[List[str], str, int]:
+    return list(graph_being_compiled), model_name, nth_graph
+
+
+def get_aot_graph_name() -> str:
+    """
+    Returns the name of the graph being compiled.
+    """
+    global model_name, graph_being_compiled, nth_graph
+    return f"{model_name}__{'_'.join(graph_being_compiled)}_{nth_graph}"
+
+
+get_graph_being_compiled = get_aot_graph_name
+
+
+@contextmanager
+def track_graph_compiling(aot_config, graph_name):
+    global graph_being_compiled
+    # TODO: Don't shove the aot_id in here; set it in the context
+    graph_being_compiled = [f"{aot_config.aot_id}_{graph_name}"]
+    yield
+    global nth_graph
+    nth_graph += 1
+    graph_being_compiled = []
+
+
+def make_boxed_func(f):
+    def g(args):
+        return f(*args)
+
+    g._boxed_call = True
+    return g
+
+
+def make_boxed_compiler(compiler):
+    @wraps(compiler)
+    def f(fx_g, inps):
+        out_f = compiler(fx_g, inps)
+        fx_g = make_boxed_func(out_f)
+        return fx_g
+
+    return f
+
+
+def call_func_with_args(f, args, steal_args=False, disable_amp=False):
+    if not steal_args:
+        args = list(args)
+    assert isinstance(args, list)
+
+    if disable_amp:
+        guard = torch._C._DisableAutocast()
+    try:
+        if hasattr(f, "_boxed_call"):
+            out = normalize_as_list(f(args))
+        else:
+            # TODO: Please remove soon
+            # https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670
+            warnings.warn(
+                "Your compiler for AOTAutograd is returning a a function that doesn't take boxed arguments. "
+                "Please wrap it with functorch.compile.make_boxed_func or handle the boxed arguments yourself. "
+                "See https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670 for rationale."
+            )
+            out = normalize_as_list(f(*args))
+    finally:
+        if disable_amp:
+            del guard
+    return out
+
+
+@dataclasses.dataclass
+class AOTConfig:
+    """
+    Configuration for AOTDispatcher
+    """
+
+    fw_compiler: Callable
+    bw_compiler: Callable
+    partition_fn: Callable
+    decompositions: Dict[Callable, Callable]
+    num_params_buffers: int
+    aot_id: int
+
+
+def aot_dispatch_base(flat_fn, flat_args: List[Tensor], aot_config: AOTConfig):
+    fw_module = make_fx(flat_fn, aot_config.decompositions)(*flat_args)
+    if config.debug_graphs:
+        print("====== Forward (only) graph {aot_config.aot_id} ======")
+        fw_module.print_readable()
+
+
+    disable_amp = torch._C._is_any_autocast_enabled()
+    context = disable_autocast_manager if disable_amp else nullcontext
+
+    with context(), track_graph_compiling(aot_config, "inference"):
+        compiled_fw = aot_config.fw_compiler(fw_module, flat_args)
+
+    @wraps(compiled_fw)
+    def new_fn(args):
+        fw_outs = call_func_with_args(compiled_fw, args, disable_amp=disable_amp)
+        return fw_outs
+
+    return new_fn
+
+
+@contextmanager
+def disable_autocast_manager():
+    guard = torch._C._DisableAutocast()
+    try:
+        yield
+    finally:
+        del guard
+
+def are_differentiable_views(view1, view2):
+    if view1 is view2:
+        return True
+    if view1._base is None and view2._base is None:
+        return False
+    if view1._base is view2._base or view1._base is view2 or view1 is view2._base:
+        return True
+    return False
+
+def same_dtype_views(view1, view2):
+    if view1.dtype != view2.dtype:
+        return False
+    if view1._base is not None and view1.dtype != view1._base.dtype:
+        return False
+    if view2._base is not None and view2.dtype != view2._base.dtype:
+        return False
+    return True
+
+# Note [Handling mutations on an input that aliases other inputs]
+# The easiest example to show-case this edge case is here:
+#
+# def f(a, b):
+#     a.mul_(2)
+#     out = a + b
+#     return out
+#
+# In this situation, if a and b happened to be aliased, we need to trace something different!
+# Suppose we had b = a.view(-1)
+# (In this case, that means that `a._base is b`)
+#
+# We need to ensure that the aliasing relationship between a and b is preserved.
+# We do that detecting the specific situation above (mutate an input that aliases another input),
+# and when we do that, we create a synthetic base argument. Then inside of the traced forward,
+# we regenerate a and b off of that base.
+# The complete example of the transformed function looks like this:
+#
+# // The traced forward takes in a synthetic base, and regenerates the aliased inputs as views
+# // We could consider getting view-replay support here to minimize as_strided_scatter ops in the graph
+# def traced_forward(base):
+#     a = base.as_strided(...)
+#     b = base.as_strided(...)
+#     a_updated = a.mul(2)
+#     base_updated = torch.as_strided_scatter(base, a_updated, ...)
+#     b_updated = base_updated.as_strided(...)
+#     out = a_updated + b_updated
+#     return a_updated, out
+#
+# def compiled_fn(a, b):
+#     // we detect that a is the "differentiable base" here
+#     base = a
+#     // In other situations, we might do either:
+#     // (1) a and b are both views off of some larger differentiable base
+#     //     assert a._base is b._base and a._base is not None
+#     //     base = a._base
+#     // (2) a and b both don't require gradients. Create a base from the storage
+#     //     assert a._base is None and b._base is None
+#     //     base = torch.Tensor(a.storage())
+#     a_updated, out = traced_forward(base)
+#     a.copy_(a_updated)
+#     return out
+#
+# This function:
+# (1) Merges input views into a synthetic base argument, when any of those input views are mutated
+# (2) Returns metadata telling the autograd.Function how to modify their arguments properly,
+#     to respect the new calling convention.
+#
+# The calling convention is as follows.
+# Any inputs that were originally views of one another get yanked, and replaced with a synthetic base.
+# The argument list ordering goes [base1, ..., baseN], [arg1, ..., argN],
+# Where the ordering of the bases is determined from the ordering of the original view args.
+# baseA will come before baseB if the earliest original argument coming from baseA
+# showed up earlier in the argument list than the earliest original argument coming from baseB.
+#
+# Example, given some tensors a, b, c, d
+# call site:
+#   f(a, c.view(-1), b.view(-1), b, c, d)
+# Modified argument list:
+#   c_base comes first because the first c view came earlier in arg list than the first b view
+#   b_base = torch.Tensor(b.storage())
+#   c_base = torch.Tensor(c.storage())
+#   f(c_base, b_base, a, d)
+def merge_view_inputs(
+    fwd_inputs: List[Any],
+    mutated_input_info: List[MutationType]
+) -> Tuple[List[Any], Optional[List[Union[int, Tuple[int, Tuple[Any]]]]]]:
+    assert len(fwd_inputs) == len(mutated_input_info)
+    storage_ref_to_idx: Dict[StorageWeakRef, List[int]] = collections.defaultdict(list)
+    for i, inpt in enumerate(fwd_inputs):
+        if isinstance(inpt, Tensor):
+            storage_ref = StorageWeakRef(inpt.storage())
+            storage_ref_to_idx[storage_ref].append(i)
+    base_args = []
+    other_args = []
+    # This list contains metadata that tells you what the i'th argument in the inner calling convention should be.
+    # It's either:
+    # - another int (corresponding to the index in the argument list of the element from the outer calling convention)
+    # - idx, *args, where we can generate the new output with old_args[idx].as_strided(*args)
+    #   idx corresponds to which synthetic base from the outer calling context to view
+    inner_calling_convention_meta: Dict[int, Union[int, Tuple[int, List[Any]]]] = {}
+    for aliased_input_indices in storage_ref_to_idx.values():
+        if len(aliased_input_indices) > 1 and any(
+            # We only care about mutations that affect all aliases,
+            # so metadata mutations on an input doesn't require us to do synthetic base handling.
+            mutated_input_info[inpt_idx] == MutationType.data for inpt_idx in aliased_input_indices
+        ):
+            # We detected an input that was mutated, AND aliases with another input.
+            # we need to replace this set of aliased inputs with a single synthetic base.
+            # For now, I'm banning a bunch of cases. We expect dynamo to properly detect these cases
+            # and error out. We can fix them later.
+            for idx1, idx2 in zip(aliased_input_indices, aliased_input_indices[1:]):
+                view1 = fwd_inputs[idx1]
+                view2 = fwd_inputs[idx2]
+                # The "inputs that are aliased but have different differentiable bases" case
+                # is more complicated and hopefully pretty rare. Not currently handled.
+                assert are_differentiable_views(view1, view2), \
+                    "aot_autograd() does not yet handle non-differentiable view input mutations."
+                # Regenerating views when reinterpreting complex / real tensors seems non-trivial,
+                # not handling for now
+                assert same_dtype_views(view1, view2), \
+                    "aot_autograd() does not yet handle input mutations on views with different dtypes."
+            non_none_bases = [fwd_inputs[i]._base for i in aliased_input_indices if fwd_inputs[i]._base is not None]
+            aliases_with_none_bases = [fwd_inputs[i] for i in aliased_input_indices if fwd_inputs[i]._base is None]
+            if len(non_none_bases) == 0:
+                # Case where none of the aliases require gradients
+                example_idx = aliased_input_indices[0]
+                synthetic_base = torch.Tensor(fwd_inputs[example_idx].storage())
+            else:
+                # Case where all of the aliases require gradients, and have the same _base.
+                synthetic_base = non_none_bases[0]
+                for other_base in non_none_bases[1:]:
+                    assert other_base is synthetic_base, \
+                        "aot_autograd() does not yet handle non-differentiable view input mutations."
+                for alias in aliases_with_none_bases:
+                    assert alias is synthetic_base, "aot_autograd() does not yet handle non-differentiable view input mutations."
+            base_args.append(synthetic_base)
+            for curr_view_idx in aliased_input_indices:
+                curr_view = fwd_inputs[curr_view_idx]
+                base_idx = len(base_args) - 1
+                size_ = curr_view.size()
+                stride_ = curr_view.stride()
+                storage_offset_ = curr_view.storage_offset()
+                # We store just enough info here so that we can regenerate the view later.
+                # Regeneration: args[base_idx].as_strided(size_, stride_, storage_offset_)
+                # If we want view replay instead of as_strided() calls, this will need to change.
+                inner_calling_convention_meta[curr_view_idx] = (base_idx, (size_, stride_, storage_offset_))
+        else:
+            for curr_idx in aliased_input_indices:
+                other_args.append(fwd_inputs[curr_idx])
+    if len(base_args) == 0:
+        assert len(other_args) == len(fwd_inputs)
+        # If no synthetic bases are necessary, just return the original inputs.
+        return fwd_inputs, None
+    else:
+        # Otherwise, return:
+        # (1) The new args according to the updated calling convention: (synthetic_bases, other_args)
+        # (2) Metadata telling functionalization how to generate the inner argument list given the outer calling convention.
+        #     We post-process it into a list, where meta[i] tells you info about the i'th argument in the inner calling convention.
+        args_to_functionalization = base_args + other_args
+        arg_to_old_idx_map = {arg: i for (i, arg) in enumerate(fwd_inputs)}
+        for i, other_arg in enumerate(other_args):
+            new_idx = len(base_args) + i
+            old_idx = arg_to_old_idx_map[other_arg]
+            inner_calling_convention_meta[old_idx] = new_idx
+        # post process into a list
+        post_processed_calling_convention_meta: List[Union[int, Callable]] = [-1 for _ in range(len(inner_calling_convention_meta))]
+        for k, v in inner_calling_convention_meta.items():
+            post_processed_calling_convention_meta[k] = v
+        # Quick assert: every argument in the inner calling convention should be accounted for.
+        for x in post_processed_calling_convention_meta:
+            assert x != -1
+        return args_to_functionalization, post_processed_calling_convention_meta
+
+
+def format_guard_bug_msg(aot_config, expected):
+    return (
+        f"At compilation time, graph {aot_config.aot_id} was compiled under the "
+        f"assumption that {expected}, but at runtime this was not the case.  "
+        "This indicates a guard bug in AOTAutograd or Dynamo, please file a bug to PyTorch."
+    )
+
+
+# Wraps aot_dispatch_deduplicated_autograd, ensuring that duplicate arguments
+# are dropped from the inner compilation function.
+#
+# In Haskell types, suppose you have:
+#
+#   add_dupe_args :: DedupedArgs -> Args
+#   remove_dupe_args :: Args -> DedupedArgs
+#
+#   aot_dispatch_deduplicated_autograd
+#       :: (DedupedArgs -> R) -> DedupedArgs -> AOTConfig -> (DedupedArgs -> R)
+#   aot_dispatch_autograd
+#       :: (Args -> R) -> Args -> AOTConfig -> (Args -> R)
+#
+# Then the code below can be written in point-free style as:
+#
+#   aot_dispatch_deduplicate_autograd f a c =
+#       aot_dispatch_autograd (f . add_dupe_args) (remove_dupe_args a) c . remove_dupe_args
+#
+def aot_dispatch_autograd(flat_fn, flat_args: List[Tensor], aot_config: AOTConfig):
+    # Suppose you have:
+    #
+    #   [a, b, a, c]
+    #
+    # We want:
+    #
+    #   remove_dupe_args([a, b, a, c]) == [a, b, c]
+    #   add_dupe_args([a, b, c]) == [a, b, a, c]
+    #
+    # This is done via (respectively):
+    #
+    #   seen_args = {2}  # what to drop
+    #   add_dupe_map = {  # how to get args from the deduped list
+    #       0: 0,
+    #       1: 1,
+    #       2: 0,
+    #       3: 2,
+    #   }
+    #   keep_arg_mask = [True, True, False, True]
+
+    seen_args = {}
+    keep_arg_mask = []
+    dropped_args = False
+    add_dupe_map = {}
+    duped_arg_len = len(flat_args)
+
+    j = 0  # index into deduped_flat_args
+    for i, t in enumerate(flat_args):
+        if t in seen_args:
+            keep_arg_mask.append(False)
+            dropped_args = True
+            add_dupe_map[i] = seen_args[t]
+            continue
+        keep_arg_mask.append(True)
+        seen_args[t] = j
+        add_dupe_map[i] = j
+        j += 1
+
+    unique_args = j
+
+    # NB: Hot path, avoid set lookups here
+    # TODO: Can avoid the zip here too, probably
+    def remove_dupe_args(args):
+        return [t for t, keep in zip(args, keep_arg_mask) if keep]
+
+    def add_dupe_args(args):
+        return [args[add_dupe_map[i]] for i in range(duped_arg_len)]
+
+    def maybe_wrap_debug(f):
+        if not config.debug_assert:
+            return f
+
+        @wraps(f)
+        def debug_wrapper(*args):
+            # Test that the computed remove/add arg functions are an inverse
+            new_args = add_dupe_args(remove_dupe_args(args))
+            seen = {}
+            for i, (x, y) in enumerate(zip(new_args, args)):
+                seen[y] = None
+                assert x is y, format_guard_bug_msg(
+                    aot_config,
+                    f"{describe_input(i, aot_config)} would be a duplicate of "
+                    f"{describe_input(add_dupe_map[i], aot_config)}"
+                )
+            # This is only an error if there is metadata mutation on both of
+            # the duped arguments; in this case, we need to know what order
+            # the metadata mutation applies in.  You'll get the correct result
+            # otherwise, because a graph that assumes distinct inputs works if
+            # you dupe the inputs (the gradient contributions from each input
+            # will get summed up appropriately.)
+            """
+            assert len(seen) == unique_args, format_guard_bug_msg(aot_config,
+                f"there would be {unique_args} distinct arguments"
+            )
+            """
+            return f(*args)
+
+        return debug_wrapper
+
+    # Fastpath
+    if not dropped_args:
+        return maybe_wrap_debug(aot_dispatch_deduplicated_autograd(flat_fn, flat_args, aot_config))
+
+    deduped_flat_args = remove_dupe_args(flat_args)
+
+    @wraps(flat_fn)
+    def wrapped_flat_fn(*args):
+        return flat_fn(*add_dupe_args(args))
+
+    compiled_fn = aot_dispatch_deduplicated_autograd(wrapped_flat_fn, deduped_flat_args, aot_config)
+
+    @wraps(compiled_fn)
+    def wrapped_compiled_fn(*args):
+        return compiled_fn(*remove_dupe_args(args))
+
+    return maybe_wrap_debug(wrapped_compiled_fn)
+
+
+def describe_input(i, aot_config):
+    if i < aot_config.num_params_buffers:
+        return f"parameter/buffer {i}"
+    else:
+        return f"input {i - aot_config.num_params_buffers}"
+
+
+# Like aot_dispatch_autograd, but with the precondition that there
+# are no duplicate arguments in flat_args (e.g., the same Tensor
+# object never shows up twice.  However, two tensor inputs MAY alias
+# the same storage, so long as they have separate TensorImpls.)
+def aot_dispatch_deduplicated_autograd(flat_fn, flat_args: List[Tensor], aot_config: AOTConfig):
+
+    _fw_metadata, out, _num_aliasing_metadata_outs = run_functionalized_fw_and_collect_metadata(
+        flat_fn
+    )(*flat_args)
+
+    # pre-compute, so we can bail out quickly in the hotpath
+    _num_outputs_aliased_to_inputs = len([
+        x for x in _fw_metadata.aliased_output_info if x.output_type == OutputType.alias_of_input])
+    _num_outputs_aliased_to_intermediates = len([
+        x for x in _fw_metadata.aliased_output_info if x.output_type == OutputType.alias_of_intermediate])
+    _num_mutated_data_inputs = len([x for x in _fw_metadata.mutated_input_info if x == MutationType.data])
+    _num_mutated_metadata_only_inputs = len([x for x in _fw_metadata.metadata_mutation_input_info if x is not None])
+    _num_mutated_inputs = _num_mutated_data_inputs + _num_mutated_metadata_only_inputs
+
+    if isinstance(out, (list, tuple)):
+        _num_non_aliased_outs = len(out[_num_mutated_data_inputs:])
+    else:
+        _num_non_aliased_outs = 1
+    assert len(_fw_metadata.requires_grad_out_info) == _num_mutated_data_inputs + _num_non_aliased_outs
+
+    # out here corresponds to the set of outputs that should be returned by the traced forward call.
+    # It includes outputs of the original forward, *and* any updated inputs due to input mutations.
+    # However, it does *not* include any outputs that are aliases of inputs, or any metadata-only input mutations.
+    out = pytree.tree_map(
+        lambda x: x.detach().contiguous() if isinstance(x, Tensor) else x,
+        out,
+    )
+
+    # This code only executes if we have graph inputs that alias each other, and one of those inputs
+    # gets its data mutated.
+    # When that happens, we replace the aliased inputs with a synthetic base, and in the traced forward
+    # we later generate the input views
+    flat_args_with_views_handled, _synthetic_base_info = merge_view_inputs(
+        flat_args, _fw_metadata.mutated_input_info)
+
+    joint_forward_backward = create_joint_forward_backward_functionalized(
+        flat_fn,
+        meta=_fw_metadata,
+        synthetic_base_info=_synthetic_base_info,
+    )
+
+    joint_inputs = (flat_args_with_views_handled, out)
+
+    disable_amp = torch._C._is_any_autocast_enabled()
+
+    if config.use_functionalize:
+        with enable_python_dispatcher():
+            flattened_joints, _ = pytree.tree_flatten(joint_inputs)
+            fx_g = make_fx(
+                joint_forward_backward, aot_config.decompositions
+            )(*joint_inputs)
+
+        # Redudant with the check above, but worth having in case tracing introduced
+        # a fake tensor. Unlikely.
+        # See Note: [Fake Modules and AOTAutograd]
+        torch._dynamo.utils.assert_no_fake_params_or_buffers(fx_g)
+        fx_g.graph.eliminate_dead_code()
+        fx_g.recompile()
+    else:
+        # joint_forward_backward() now always runs with functionalization, and factoring it out
+        # to make that toggleable is a bit painful.
+        # aot autograd without functionalization is wrong anyway, so we error.
+        raise AssertionError("Graph partitioning without functionalization is not sound, we may introduce errors")
+
+    if config.debug_joint:
+        print(f"====== Joint graph {aot_config.aot_id} ======")
+        fx_g.print_readable()
+
+    with torch.no_grad():
+        with track_graph_compiling(aot_config, "joint"):
+            num_inner_fwd_outputs = _num_mutated_data_inputs + _num_non_aliased_outs + _num_aliasing_metadata_outs
+            fw_module, bw_module = aot_config.partition_fn(
+                fx_g, joint_inputs, num_fwd_outputs=num_inner_fwd_outputs)
+            fw_outs = [n for n in fw_module.graph.nodes if n.op == "output"][0].args[0]
+            # we only need to bookkeep the symints that are saved for bw, not any symints
+            # the user forward might have returned in its own output
+            fw_outs_saved_for_bw = fw_outs[num_inner_fwd_outputs:]
+            symint_outs_saved_for_bw = [n for n in fw_outs_saved_for_bw if is_sym_node(n)]
+            _num_symints_saved_for_bw = len(symint_outs_saved_for_bw)
+
+        if config.debug_graphs:
+            print("====== Forward graph {aot_config.aot_id} ======")
+            fw_module.print_readable()
+            print("====== Backward graph {aot_config.aot_id} ======")
+            bw_module.print_readable()
+
+        with track_graph_compiling(aot_config, "forward"):
+            compiled_fw_func = aot_config.fw_compiler(fw_module, flat_args_with_views_handled)
+
+    class CompiledFunction(torch.autograd.Function):
+        compiled_fw = compiled_fw_func
+        compiled_bw = None
+        # Corresponds to number of outs (not including updated inputs returns as outs),
+        # *and* not including outs that are aliases of inputs
+        num_non_aliased_outs = _num_non_aliased_outs
+        num_symints_saved_for_bw = _num_symints_saved_for_bw
+        # Corresponds to number of inputs that are mutated (both metadata only, and data)
+        num_mutated_inputs = _num_mutated_inputs
+        # Corresponds to number of inputs that only have their metadata mutated
+        num_mutated_data_inputs = _num_mutated_data_inputs
+        # Corresponds to number of inputs that get their metadata (but not data) mutated
+        # We don't return these in the compiled fw, and instead we stash enough info
+        # to replay the metadata mutations later.
+        num_mutated_metadata_only_inputs = _num_mutated_metadata_only_inputs
+        # Corresponds to number of outputs in the original fw that are aliases of inputs
+        # (These are all not returned by the compiled forward, and instead they are manually
+        # created in the epilogue)
+        num_outputs_aliased_to_inputs = _num_outputs_aliased_to_inputs
+        # Corresponds to the number of user outputs that alias intermediates (aka graph outputs).
+        num_outputs_aliased_to_intermediates = _num_outputs_aliased_to_intermediates
+        # For every output that aliases and input, and every input that gets only its metadata mutated,
+        # we return that tensor's size/stride/storage_offset directly at the end of the compiled forward,
+        # as a big list of ints.
+        # The number is tracked here.
+        num_aliasing_metadata_outs = _num_aliasing_metadata_outs
+        synthetic_base_info = _synthetic_base_info
+        fw_metadata = _fw_metadata
+
+        @staticmethod
+        @disable_torchdynamo
+        def forward(ctx, *deduped_flat_tensor_args):
+
+            # There is a pretty complicated calling convention around what the compiled fw returns.
+            # The full list of outputs and their relative order is:
+            # (*mutated_data_inputs, *non_aliased_fw_outs, *saved_tensors, *saved_symints)
+            # - Note that in the synthetic bases case, mutated_inputs will correspond to an updated version
+            #   of the original view, and not the synthetic base
+            fw_outs = call_func_with_args(
+                CompiledFunction.compiled_fw, deduped_flat_tensor_args, disable_amp=disable_amp
+            )
+
+            num_non_aliased_outs = CompiledFunction.num_non_aliased_outs
+            num_aliasing_metadata_outs = CompiledFunction.num_aliasing_metadata_outs
+            num_symints_saved_for_bw = CompiledFunction.num_symints_saved_for_bw
+            num_mutated_data_inputs = CompiledFunction.num_mutated_data_inputs
+            # Our forward() returns both (mutated_inputs, outputs, output_alias_meta, saved_tensors, saved_symints)
+            num_forward_returns = num_mutated_data_inputs + num_non_aliased_outs + num_aliasing_metadata_outs
+            num_forward_returns_not_including_alias_meta = num_mutated_data_inputs + num_non_aliased_outs
+
+            # Partitioners must put symint arguments at the end separate from tensor arguments
+            if num_symints_saved_for_bw > 0:
+                tensors_saved_for_backwards = fw_outs[num_forward_returns:-num_symints_saved_for_bw]
+                assert all([isinstance(x, torch.Tensor) for x in tensors_saved_for_backwards])
+                ctx.save_for_backward(*tensors_saved_for_backwards)
+                symint_outs = fw_outs[-num_symints_saved_for_bw:]
+                assert all([isinstance(x, (int, float, torch.SymInt, torch.SymFloat)) for x in symint_outs])
+                ctx.symints = symint_outs
+            else:
+                ctx.save_for_backward(*fw_outs[num_forward_returns:])
+                ctx.symints = []
+
+            fw_outs_not_requiring_grad = [
+                x for (i, x) in enumerate(fw_outs[:num_forward_returns_not_including_alias_meta])
+                if isinstance(x, torch.Tensor) and not CompiledFunction.fw_metadata.requires_grad_out_info[i]
+            ]
+            fw_out_ids_requiring_grad = [
+                id(x) for (i, x) in enumerate(fw_outs[:num_forward_returns_not_including_alias_meta])
+                if isinstance(x, torch.Tensor) and CompiledFunction.fw_metadata.requires_grad_out_info[i]
+            ]
+
+            ctx.mark_non_differentiable(*fw_outs_not_requiring_grad)
+
+            return tuple(fw_outs[0:num_forward_returns])
+
+        @staticmethod
+        @disable_torchdynamo
+        def backward(ctx, *all_flat_args):
+            # Calling convention: we expect a grad_out passed to the backward:
+            # - for every output of the fw that does *not* alias an input
+            # - for every updated_input generated by the fw that does *not* alias an input
+            # - for every size/stride metadata value for aliased outputs.
+            #   These are returned by the forward, but we just drop them in the backward.
+            #   We need to return them in the forward, but unfortunately there's no way to specify
+            #   in autograd.Function that certain non-tensor forward outputs shouldn't show up in the backward.
+            expected_grad_outs = CompiledFunction.num_non_aliased_outs + CompiledFunction.num_mutated_data_inputs
+            if CompiledFunction.num_aliasing_metadata_outs > 0:
+                flat_args = all_flat_args[:-CompiledFunction.num_aliasing_metadata_outs]
+                metadata_args = all_flat_args[-CompiledFunction.num_aliasing_metadata_outs:]
+                # metadata args are all ints/symints, which autograd will send Nones for as grad_outputs in the bw
+                assert all([x is None for x in metadata_args])
+                # delete
+                # for out_idx, (base_sizes, base_strides, base_storage_offset) in CompiledFunctions.fw_out_base_metadata.items():
+
+            else:
+                flat_args = all_flat_args
+
+            assert len(flat_args) == expected_grad_outs
+            contiguous_args = [t.contiguous() if torch.is_tensor(t) else t for t in flat_args]
+            all_args = list(ctx.symints) + list(ctx.saved_tensors) + list(contiguous_args)
+            del contiguous_args
+            if CompiledFunction.compiled_bw is None:
+                # TODO - pass in fake tensors ?
+                context = disable_autocast_manager if disable_amp else nullcontext
+                with context(), track_graph_compiling(aot_config, "backward"):
+                    CompiledFunction.compiled_bw = aot_config.bw_compiler(
+                        bw_module, all_args
+                    )
+
+            ctx.maybe_clear_saved_tensors()
+            out = call_func_with_args(
+                CompiledFunction.compiled_bw, all_args, steal_args=True, disable_amp=disable_amp
+            )
+            return tuple(out)
+
+    @wraps(CompiledFunction.apply)
+    def compiled_function(*args):
+        # Step 2: remove aliased inputs that are mutated, replace with synthetic bases
+        # Only happens if our graph mutates an input that aliases another input.
+        if CompiledFunction.synthetic_base_info is not None:
+            # Given: the original args, including at least one pair of inputs that are aliased
+            # and get subsequently mutated.
+            # Generate: the updated args, including (potentially multiple) synthetic bases
+            # that replace the views. The input views are regenerated manually in the compiled function.
+            # TODO: think harder about what happens if (a view of) one of these mutated input views is ALSO returned
+            new_inputs, metadata = merge_view_inputs(args, CompiledFunction.fw_metadata.mutated_input_info)
+            # We're just re-running the original-args-to-synthetic-base transformation
+            # that we ran during compilation.
+            # This returns metadata that we use during tracing to recover the input views,
+            # which we don't actually need at runtime.
+            assert metadata is not None
+            args_with_synthetic_bases = new_inputs
+        else:
+            args_with_synthetic_bases = args
+
+        all_outs = CompiledFunction.apply(*args_with_synthetic_bases)
+        if CompiledFunction.num_aliasing_metadata_outs > 0:
+            outs = all_outs[:-CompiledFunction.num_aliasing_metadata_outs]
+            aliasing_metadata_outs = all_outs[-CompiledFunction.num_aliasing_metadata_outs:]
+        else:
+            outs = all_outs
+            aliasing_metadata_outs = []
+
+        assert len(all_outs) == CompiledFunction.num_mutated_data_inputs + CompiledFunction.num_non_aliased_outs \
+            + CompiledFunction.num_aliasing_metadata_outs
+
+        # Step 3: After running the compiled fw, apply updates to mutated inputs
+        if CompiledFunction.num_mutated_inputs > 0:
+            # Calling convention: (mutated_inputs, real_outs, aliasing_metadata)
+
+            if CompiledFunction.num_mutated_data_inputs > 0:
+                updated_inputs = outs[:CompiledFunction.num_mutated_data_inputs]
+                fw_outs = outs[CompiledFunction.num_mutated_data_inputs:]
+            else:
+                updated_inputs = []
+                fw_outs = outs
+
+            curr_mutated_inpt_idx = 0
+            for inpt_idx, (mutation_type, metadata_mutation_info) in enumerate(zip(
+                # TODO: I should merge these two pieces of state
+                CompiledFunction.fw_metadata.mutated_input_info,
+                CompiledFunction.fw_metadata.metadata_mutation_input_info,
+            )):
+                if mutation_type == MutationType.none:
+                    continue
+                original_inpt = args[inpt_idx]
+                if mutation_type == MutationType.metadata_only:
+                    # We need to grab the size/stride/storage_offset from the compiled forward,
+                    # and use that to mutate the metadata of the input
+                    expected_meta = CompiledFunction.fw_metadata.metadata_mutation_input_info[inpt_idx]
+                    assert expected_meta is not None
+                    fake_meta = expected_meta.tensor_meta
+                    size_len = len(fake_meta.size())
+                    stride_len = len(fake_meta.stride())
+                    size_ = aliasing_metadata_outs[expected_meta.sizes_idx:expected_meta.sizes_idx + size_len]
+                    stride_ = aliasing_metadata_outs[expected_meta.strides_idx:expected_meta.strides_idx + stride_len]
+                    storage_offset_ = aliasing_metadata_outs[expected_meta.storage_offset_idx]
+                    original_inpt.as_strided_(size_, stride_, storage_offset_)
+                else:
+                    updated_inpt = updated_inputs[curr_mutated_inpt_idx]
+                    curr_mutated_inpt_idx += 1
+                    # TODO: handle resize_() on inputs to a larger size.
+                    # This is actually non-trivial to detect, so we should probably just handle it
+                    # (or make dynamo detect).
+                    # We can't just check of original_inpt.storage_size != updated_inpt.storage_size,
+                    # Because the original_inpt might be a view of some larger tensor,
+                    # and updated_inpt is always densely packed.
+                    if original_inpt.size() != updated_inpt.size() \
+                            or original_inpt.stride() != updated_inpt.stride() \
+                            or original_inpt.storage_offset() != updated_inpt.storage_offset():
+                        # Functionalization can't easily tell us if an input had BOTH its metadata actual data mutated.
+                        # So we check if metadata needs to be mutated here manually.
+                        original_inpt.as_strided_(updated_inpt.size(), updated_inpt.stride(), updated_inpt.storage_offset())
+                    original_inpt.copy_(updated_inpt)
+        else:
+            fw_outs = outs
+
+        # Step 4: Manually regenerate any outputs that are aliased to inputs, instead of
+        # compiling them.
+        if CompiledFunction.num_outputs_aliased_to_inputs > 0 or CompiledFunction.num_outputs_aliased_to_intermediates > 0:
+            assert CompiledFunction.num_outputs_aliased_to_inputs + len(fw_outs) == \
+                len(CompiledFunction.fw_metadata.aliased_output_info)
+            fw_outs_including_aliases = []
+            for aliased_out_metadata in CompiledFunction.fw_metadata.aliased_output_info:
+                if aliased_out_metadata.output_type == OutputType.non_alias:
+                    fw_outs_including_aliases.append(fw_outs[aliased_out_metadata.base_idx])
+                else:
+                    if aliased_out_metadata.output_type == OutputType.alias_of_input:
+                        aliased_base_tensor = args[aliased_out_metadata.base_idx]
+                    else:
+                        assert aliased_out_metadata.output_type == OutputType.alias_of_intermediate
+                        aliased_base_tensor = fw_outs[aliased_out_metadata.base_idx]
+                    # Note: here, we manually regenerate the output, using an as_strided() call,
+                    # OR if the aliased output came from a custom autograd.function, we replay it.
+                    # The as_strided() in the normal case is good for perf (this is hot-path code,
+                    # and we're consolidating potential chains of views into a single view op).
+                    # But we might need to figure out view replaying for e.g. XLA.
+                    # TODO: handle the custom autograd function case here.
+                    # We need a way to check whether a tensor came from a custom autograd fn from python,
+                    # AND a way to replay that custom view fn.
+                    fake_meta = aliased_out_metadata.tensor_meta
+                    if fake_meta is None:
+                        # This handles the specific case where the user returns an output that *was* an input. Don't create a view.
+                        fw_outs_including_aliases.append(aliased_base_tensor)
+                    else:
+                        # We need to grab the size/stride/storage_offset from the compiled forward,
+                        # and use that to create a view off of the right input
+                        fake_meta = aliased_out_metadata.tensor_meta
+                        size_len = len(fake_meta.size())
+                        stride_len = len(fake_meta.stride())
+                        size_ = aliasing_metadata_outs[aliased_out_metadata.sizes_idx:aliased_out_metadata.sizes_idx + size_len]
+                        stride_ = aliasing_metadata_outs[
+                            aliased_out_metadata.strides_idx:aliased_out_metadata.strides_idx + stride_len]
+                        storage_offset_ = aliasing_metadata_outs[aliased_out_metadata.storage_offset_idx]
+                        # Create the output alias
+                        aliased_out = gen_alias_from_base(aliased_base_tensor, size_, stride_, storage_offset_, fake_meta)
+                        fw_outs_including_aliases.append(aliased_out)
+
+            for inner_out, user_out in zip(fw_outs, fw_outs_including_aliases):
+                # Sanity check assert
+                assert type(inner_out) == type(user_out)
+            return fw_outs_including_aliases
+        else:
+            return fw_outs
+
+    if not config.debug_assert:
+        return compiled_function
+
+    flat_requires_grad = [a.requires_grad if isinstance(a, Tensor) else None for a in flat_args]
+
+    @wraps(compiled_function)
+    def debug_compiled_function(*args):
+        # TODO: Check aliasing relationships
+        # TODO: Check strides for metadata mutation
+        # (NB: ideally, this logic is factored out of this function and
+        # you move these debug checks there)
+
+        # Check requires grad.  Bad case is when we compiled with
+        # requires_grad = False, but input requires_grad = True
+        # (vice versa is OK; we compute a gradient and then throw
+        # it away when it hits the input.)
+        for i, a in enumerate(args):
+            can_require_grad = flat_requires_grad[i]
+            if can_require_grad is None:
+                assert not isinstance(a, Tensor)
+            elif not can_require_grad:
+                assert not a.requires_grad, format_guard_bug_msg(
+                    aot_config,
+                    f"{describe_input(i, aot_config)} would not require grad"
+                )
+
+        return compiled_function(*args)
+
+    return debug_compiled_function
+
+
+@dynamo_timed
+def create_aot_dispatcher_function(
+    flat_fn, flat_args: List[Tensor], aot_config: AOTConfig
+):
+    """
+    Traces the forward and backward graphs of the attr:`flat_fn` to generate a
+    joint graph. The joint graph is an Fx graph with Aten ops. Please refer to
+    the tracing mechanism to understand the graph capturing details.
+
+    The joint graph is then passed through attr:`partition_fn` to isolate the
+    forward and backward portions, which are then respectively compiled via the
+    provided attr:`fw_compiler` and attr:`bw_compiler`.
+
+    The resulting compiled forward and backward graphs are then wrapped up in a
+    ``torch.autograd.Function`` object.
+
+    The calling convention here is that the first aot_config.num_params_buffers
+    inputs in flat_args are parameters and buffers, and the rest are inputs.
+
+    We use this to assume that parameters/buffer's shapes don't change.
+    """
+
+    # This is the main entry point.
+    # TODO: Chillee argues that dynamo itself should pass in fake tensors to
+    # the list of arguments when compiling; at the moment we do not do this
+
+    if aot_config.decompositions is None:
+        aot_config.decompositions = {}
+
+    aot_config.decompositions = {
+        **aot_autograd_decompositions,
+        **aot_config.decompositions,
+    }
+    # NB: don't bother setting allow_fallback_kernels; this should not actually
+    # be configurable in fake tensor, we should automatically do the right
+    # thing
+    if config.debug_fake_cross_ref:
+        # This is a little messy but TorchDynamo directly changes `use_fake_tensor`
+        # so it's not enough for user to change the config manually
+        # TODO: have TorchDynamo read in `use_fake_tensor` from os environ /
+        # coordinate flags
+        config.use_fake_tensor = False
+
+    if config.use_dynamic_shapes:
+        assert config.use_fake_tensor, "Dynamic shapes only works with fake tensor"
+
+    shape_env = ShapeEnv() if config.use_dynamic_shapes else None
+    fake_mode = FakeTensorMode(shape_env=shape_env) if config.use_fake_tensor else nullcontext()
+    cross_ref = CrossRefFakeMode() if config.debug_fake_cross_ref else nullcontext()
+    python_dispatcher_mode = enable_python_dispatcher() if config.use_dynamic_shapes else nullcontext()
+
+    with torch.autograd.set_multithreading_enabled(False), preserve_rng_state(), cross_ref, fake_mode, python_dispatcher_mode:
+
+        def process_inputs(flat_args):
+            if config.use_fake_tensor:
+                def convert(idx, x):
+                    if not isinstance(x, torch.Tensor):
+                        return x
+                    if idx < aot_config.num_params_buffers and config.static_weight_shapes:
+                        return fake_mode.from_tensor(x, static_shapes=True)
+                    return fake_mode.from_tensor(x, static_shapes=False)
+
+                return [convert(idx, x) for idx, x in enumerate(flat_args)]
+            else:
+                return flat_args
+
+        fake_flat_tensor_args = process_inputs(flat_args)
+
+        needs_autograd = (
+            any(
+                [
+                    x.requires_grad
+                    for x in fake_flat_tensor_args
+                    if isinstance(x, Tensor)
+                ]
+            )
+            and torch.is_grad_enabled()
+        )
+        # crappy version of dispatcher
+        # TODO: Do this properly
+        if needs_autograd:
+            return make_boxed_func(
+                aot_dispatch_autograd(flat_fn, fake_flat_tensor_args, aot_config)
+            )
+        else:
+            return aot_dispatch_base(flat_fn, fake_flat_tensor_args, aot_config)
+
+
+# Inspired by autodidax (thanks!)
+class PytreeThunk:
+    spec = None
+    # These are some kinda dumb microoptimizations that save about 3-4 us of overhead.
+    is_simple = (
+        None  # if the output spec is a tuple/list, we won't bother unflattening it.
+    )
+    is_really_simple = None  # if the output spec is a LeafSpec
+
+    def set(self, spec):
+        assert self.spec is None or self.spec == spec
+        self.spec = spec
+        if type(self.spec) in [tuple, list] and all(
+            isinstance(i, pytree.LeafSpec) for i in spec.children_specs
+        ):
+            self.is_simple = True
+        if isinstance(self.spec, pytree.LeafSpec):
+            self.is_really_simple = True
+
+    def unflatten(self, x):
+        if self.is_really_simple:
+            return x[0]
+        if self.is_simple:
+            return x
+        return pytree.tree_unflatten(x, self.spec)
+
+
+def aot_function(
+    fn: Callable,
+    fw_compiler: Callable,
+    bw_compiler: Optional[Callable] = None,
+    partition_fn: Callable = default_partition,
+    decompositions: Optional[Dict] = None,
+    num_params_buffers: int = 0,
+    hasher_type=None,  # deprecated
+    static_argnums: Optional[Tuple[int]] = None,  # deprecated
+) -> Callable:
+    """
+    Traces the forward and backward graph of :attr:`fn` using torch dispatch
+    mechanism, and then compiles the generated forward and backward graphs
+    through :attr:`fw_compiler` and :attr:`bw_compiler`.
+
+    :func:`aot_function` traces the forward and backward graph ahead of time,
+    and generates a joint forward and backward graph.  :attr:`partition_fn` is
+    then used to separate out forward and backward graphs. The partitioner
+    function can be used to perform optimizations such as recomputation. One can
+    set `decompositions` dictionary to decompose the operators into a sequence
+    of core or simpler operators supported by the backend compilers.
+
+    :func:`aot_function` uses a compilation cache, based on input tensor
+    properties, to detect when there is a need of recompilation.
+
+    .. warning::
+        This API is experimental and likely to change.
+
+    Args:
+        fn (Callable): A Python function that takes one ore more arguments. Must
+            return one or more Tensors.
+        fw_compiler (Callable): A Python function that accepts an Fx graph with
+            Aten ops and input args, and returns a Callable that semantically is
+            equivalent to the input Fx graph.
+        bw_compiler (Optional[Callable]): A Python function that accepts an
+            Fx graph with Aten ops and input args, and returns a Callable that
+            semantically is equivalent to the input Fx graph.  Default: None
+            (when None, it defaults to the :attr:`fw_compiler`)
+        partition_fn (Callable): A Python function that takes a joint forward
+            and backward graph, and partitions it into separate forward and
+            backward graphs.
+        decompositions (Dict): A dictionary to define the decomposition of
+            larger Aten ops into simpler or core Aten ops.
+
+    Returns:
+        Returns a ``Callable`` that retains the eager behavior of the original
+        :attr:`fn`, but with forward and backward graph compiled via
+        :attr:`fw_compile` and :attr:`bw_compile`.
+
+    A simple example usage of :func:`aot_function` is as follows. This example
+    will print the forward and backward graphs of the function ``fn``
+
+        >>> fn = lambda x : x.sin().cos()
+        >>> def print_compile_fn(fx_module, args):
+        >>>     print(fx_module)
+        >>>     return fx_module
+        >>> aot_fn = aot_function(fn, print_compile_fn)
+        >>> x = torch.randn(4, 5, requires_grad=True)
+        >>> aot_fn(x)
+    """
+    if static_argnums is not None:
+        raise RuntimeError("static_argnums has been deprecated - manually wrap your function or use torchdynamo.")
+
+    if bw_compiler is None:
+        bw_compiler = fw_compiler
+    aot_config = AOTConfig(
+        fw_compiler=fw_compiler,
+        bw_compiler=bw_compiler,
+        partition_fn=partition_fn,
+        decompositions=decompositions,
+        num_params_buffers=num_params_buffers,
+        aot_id=next(AOT_COUNTER),
+    )
+    cached_res = None
+
+    @wraps(fn)
+    def returned_function(*args, **kwargs):
+        nonlocal cached_res
+        # Now flatten the tensor args
+        flat_args, _ = pytree.tree_flatten((args, kwargs))
+
+        # Compile the function and save it in the cache
+        if cached_res is None:
+            # Save the args_spec for flat_tensor_args to unflatten while tracing
+            _, tensor_args_spec = pytree.tree_flatten((args, kwargs))
+            out_spec = PytreeThunk()
+
+            def flat_fn(*flat_args):
+                # The input are flattened tensor args. Prepare the args in the
+                # order that original function expects. Add static args as well.
+                # They will appear as tensor constants in the traced graph.
+                nonlocal out_spec
+                args, kwargs = pytree.tree_unflatten(
+                    flat_args, tensor_args_spec
+                )
+                tree_out = fn(*args, **kwargs)
+                flat_out, spec = pytree.tree_flatten(tree_out)
+                for i in flat_out:
+                    is_known_type = False
+                    for j in KNOWN_TYPES:
+                        if isinstance(i, j):
+                            is_known_type = True
+                            break
+                    if not is_known_type:
+                        raise RuntimeError(
+                            f"Found {type(i)} in output, which is not a known type. "
+                            "If this type holds tensors, you need to register a pytree for it. "
+                            "See https://github.com/pytorch/functorch/issues/475 for a brief "
+                            "explanation why. If you don't need to register a pytree, please "
+                            "leave a comment explaining your use case and we'll make this more "
+                            "ergonomic to deal with"
+                        )
+                out_spec.set(spec)
+                return flat_out
+
+            compiled_fn = create_aot_dispatcher_function(
+                flat_fn,
+                flat_args,
+                aot_config,
+            )
+            cached_res = (compiled_fn, out_spec)
+
+        cached_fn, out_spec = cached_res
+        out = cached_fn(flat_args)
+        return out_spec.unflatten(out)
+
+    return returned_function
+
+
+def aot_module(mod: nn.Module, *args, **kwargs) -> nn.Module:
+    """
+    Traces the forward and backward graph of :attr:`mod` using torch dispatch
+    tracing mechanism. It is wrapper function, that underneath uses
+    :func:`aot_function` to perform tracing and compilation.
+
+    :func:`aot_module` lifts the parameters and buffers of ``nn.Module`` as inputs
+    to a new callable which is then compiled through :func:`aot_function`.
+
+    .. warning::
+        This API is experimental and likely to change.
+
+    Args:
+        mod (Callable): A ``nn.Module`` module.
+        args : args to be passed to :func:`aot_function`
+        kwargs : kwargs to be passed to :func:`aot_function`
+
+    Returns:
+        Returns a ``nn.Module`` that retains the eager behavior of the original
+        :attr:`mod`, but with forward and backward graph compiled.
+
+    """
+    # See Note: [Fake Modules and AOTAutograd]
+    torch._dynamo.utils.assert_no_fake_params_or_buffers(mod)
+
+    def functional_call(named_params, named_buffers, *args, **kwargs):
+        params_and_buffers = {**named_params, **named_buffers}
+        return stateless.functional_call(mod, params_and_buffers, args, kwargs)
+
+    named_params = dict(_named_parameters(mod, remove_duplicate=False))
+    named_buffers = dict(_named_buffers(mod, remove_duplicate=False))
+    num_params_buffers = len(named_params) + len(named_buffers)
+    compiled_f = aot_function(functional_call, num_params_buffers=num_params_buffers, *args, **kwargs)
+
+    class AOTModule(nn.Module):
+        def __init__(self):
+            super(AOTModule, self).__init__()
+            self.orig_module = mod
+
+        def forward(self, *args, **kwargs):
+            return compiled_f(
+                named_params,
+                named_buffers,
+                *args,
+                **kwargs,
+            )
+
+    return AOTModule()
+
+
+def aot_module_simplified(
+    mod: nn.Module,
+    fw_compiler: Callable,
+    bw_compiler: Optional[Callable] = None,
+    partition_fn: Callable = default_partition,
+    decompositions: Optional[Dict] = None,
+    hasher_type=None,
+    static_argnums=None
+) -> nn.Module:
+    """
+    This is the simplified or low overhead version of aot_module. For frontends
+    like TorchDynamo, the input functions/modules to AOT are static and have
+    unpacked inputs/outputs. This gives us an opportunity to remove the
+        (1) pytree overhead to parse inputs/outputs,
+        (2) AOT Autograd cache,
+        (3) Reading of params/buffers in every forward call
+
+    :func:`aot_module_simplified` removes these overheads.
+    """
+    #########################################################
+
+    # Redudant with dynamo, but worth having in case this gets invoked elsewhere.
+
+    # Note [Fake Modules and AOTAutograd]
+    #
+    # A simple heuristic for when to use fake versus real tensors is that fake tensors are for compile time
+    # (when we don't want to actually run the compute, but we do want to know about metadata),
+    # and real tensors are for runtime (when we actually want to do the compute.) However, in AOTAutograd,
+    # modules are the exception: we always pass AOTAutograd modules with real tensors.
+    # This is because AOTAutograd will produce a compiled function which needs to directly access any
+    # parameters the compiled function may need, but these parameters will NOT be passed in by the caller (aka Dynamo).
+    # So at compile time, the compiled function we produce must close over any parameters, and those parameters must be
+    # real parameters, and we cannot do this unless at compile time we get a module with real tensors.
+
+    # Even if Dynamo did pass all parameters explicitly at runtime, which would eliminate the need to close over
+    # the parameters, it would still be profitable to pass real tensor parameters to the compiler at compile time,
+    # because some compilation strategies like CUDA graphs want to burn in the pointer addresses where the parameter data live,
+    # and of course we can't do that unless we give the backend a real tensor.
+    torch._dynamo.utils.assert_no_fake_params_or_buffers(mod)
+
+    params = {
+        **dict(_named_parameters(mod, remove_duplicate=False)),
+        **dict(_named_buffers(mod, remove_duplicate=False)),
+    }
+    params_flat, params_spec = pytree.tree_flatten(params)
+    params_flat = tuple(params_flat)
+    params_len = len(params_flat)
+
+    def functional_call(*args, **kwargs):
+        with stateless._reparametrize_module(
+            mod, pytree.tree_unflatten(args[:params_len], params_spec)
+        ):
+            if isinstance(mod, torch.fx.GraphModule):
+                with fx_traceback.override_stack_trace(), warnings.catch_warnings():
+                    warnings.filterwarnings(
+                        "ignore", "Anomaly Detection has been enabled."
+                    )
+                    with torch.autograd.detect_anomaly(check_nan=False):
+                        out = Interpreter(mod).run(*args[params_len:], **kwargs)
+            else:
+                out = mod(*args[params_len:], **kwargs)
+
+        if not isinstance(out, (tuple, list)):
+            raise RuntimeError(
+                "Graph output must be a tuple(). This is so that we can avoid "
+                "pytree processing of the ouputs. Please change the module to "
+                "have tuple outputs or use aot_module instead."
+            )
+        return out
+
+    assert static_argnums is None
+    if bw_compiler is None:
+        bw_compiler = fw_compiler
+    aot_config = AOTConfig(
+        fw_compiler=fw_compiler,
+        bw_compiler=bw_compiler,
+        partition_fn=partition_fn,
+        decompositions=decompositions,
+        num_params_buffers=params_len,
+        aot_id=next(AOT_COUNTER),
+    )
+
+    compiled_fn = None
+
+    @wraps(functional_call)
+    def compiled_f(*args):
+        nonlocal compiled_fn
+        if compiled_fn is None:
+            compiled_fn = create_aot_dispatcher_function(
+                functional_call,
+                args,
+                aot_config,
+            )
+        return compiled_fn(args)
+
+    def forward(*args):
+        return compiled_f(
+            *params_flat,
+            *args,
+        )
+
+    forward.zero_grad = mod.zero_grad
+    forward.named_parameters = mod.named_parameters
+    return forward
+
+
+compiled_function = aot_function
+compiled_module = aot_module
diff --git a/functorch/functorch/_src/benchmark_utils.py b/functorch/_src/benchmark_utils.py
similarity index 100%
rename from functorch/functorch/_src/benchmark_utils.py
rename to functorch/_src/benchmark_utils.py
diff --git a/functorch/functorch/_src/compile_utils.py b/functorch/_src/compile_utils.py
similarity index 91%
rename from functorch/functorch/_src/compile_utils.py
rename to functorch/_src/compile_utils.py
index caebe100ecfb..f0f23e99768d 100644
--- a/functorch/functorch/_src/compile_utils.py
+++ b/functorch/_src/compile_utils.py
@@ -78,3 +78,13 @@ def strip_overloads(gm):
         if isinstance(node.target, torch._ops.OpOverload):
             node.target = node.target.overloadpacket
     gm.recompile()
+
+
+def get_placeholders(graph):
+    return list(filter(lambda x: x.op == 'placeholder', graph.nodes))
+
+def get_outputs(graph):
+    for node in graph.nodes:
+        if node.op == 'output':
+            return tree_flatten(node.args[0])[0]
+    raise AssertionError("No output node found")
diff --git a/functorch/functorch/_src/compilers.py b/functorch/_src/compilers.py
similarity index 79%
rename from functorch/functorch/_src/compilers.py
rename to functorch/_src/compilers.py
index 44daf68a5758..55de63e5c344 100644
--- a/functorch/functorch/_src/compilers.py
+++ b/functorch/_src/compilers.py
@@ -3,21 +3,24 @@
 import os
 import pickle
 import random
+from contextlib import contextmanager
 from functools import partial
 from typing import Callable, Optional, Tuple, Union
 
 import torch
 import torch.fx as fx
 import torch.nn as nn
+from torch._decomp import get_decompositions
 
 from .aot_autograd import aot_function, aot_module, make_boxed_compiler
 from .compile_utils import strip_overloads
-from .decompositions import get_decompositions
 from .partitioners import (
     default_partition,
     draw_graph,
     min_cut_rematerialization_partition,
 )
+import torch.utils._pytree as pytree
+
 
 
 # These canonicalizations are needed here (and not decompositions), as the ops
@@ -30,8 +33,17 @@ def _canonicalize(fx_g):
     return fx_g
 
 
+@contextmanager
+def _disable_jit_autocast():
+    old_jit_autocast_flag = torch._C._jit_set_autocast_mode(False)
+    try:
+        yield
+    finally:
+        torch._C._jit_set_autocast_mode(old_jit_autocast_flag)
+
+
 @make_boxed_compiler
-def ts_compile(fx_g: fx.GraphModule, _) -> Callable:
+def ts_compile(fx_g: fx.GraphModule, inps) -> Callable:
     """
     Compiles the :attr:`fx_g` with Torchscript compiler.
 
@@ -45,35 +57,38 @@ def ts_compile(fx_g: fx.GraphModule, _) -> Callable:
         Torch scripted model.
     """
 
-    strip_overloads(fx_g)
+    with _disable_jit_autocast():
+        strip_overloads(fx_g)
 
-    for node in fx_g.graph.nodes:
-        if (
-            node.target == torch.ops.aten._to_copy
-            and len(node.args) == 1
-            and len(node.kwargs) == 1
-            and "dtype" in node.kwargs
-        ):
-            node.target = torch.ops.aten.to
+        for node in fx_g.graph.nodes:
+            if (
+                node.target == torch.ops.aten._to_copy
+                and len(node.args) == 1
+                and len(node.kwargs) == 1
+                and "dtype" in node.kwargs
+            ):
+                node.target = torch.ops.aten.to
 
-    for node in fx_g.graph.nodes:
-        new_kwargs = {}
-        for k, v in node.kwargs.items():
-            if isinstance(v, torch.device):
-                v = v.type
-            new_kwargs[k] = v
-        node.kwargs = new_kwargs
+        for node in fx_g.graph.nodes:
+            new_kwargs = {}
+            for k, v in node.kwargs.items():
+                if isinstance(v, torch.device):
+                    v = v.type
+                new_kwargs[k] = v
+            node.kwargs = new_kwargs
 
-    fx_g.graph.lint()
+        fx_g.graph.lint()
 
-    fx_g.recompile()
+        fx_g.recompile()
 
-    f = torch.jit.script(fx_g)
+        f = torch.jit.script(fx_g)
 
-    torch._C._jit_pass_remove_mutation(f.graph)
+        torch._C._jit_pass_remove_mutation(f.graph)
 
-    f = torch.jit.freeze(f.eval())
-    f = torch.jit.optimize_for_inference(f)
+        f = torch.jit.freeze(f.eval())
+        f = torch.jit.optimize_for_inference(f)
+        if not any(isinstance(t, torch._subclasses.FakeTensor) for t in inps):
+            f(*inps)
     return f
 
 
@@ -100,6 +115,34 @@ def nop(fx_g: fx.GraphModule, _) -> Callable:
     """
     return fx_g
 
+class DebugInterpreter(fx.Interpreter):
+    def run_node(self, n):
+        # TODO: This will fail once we start caching in AOTAutograd
+        # again, because we need to remap SymInts to their new values
+        # in the presence of dynamism
+        r = super().run_node(n)
+        if 'val' in n.meta:
+            n_vals, n_spec = pytree.tree_flatten(n.meta['val'])
+            r_vals, r_spec = pytree.tree_flatten(r)
+            assert n_spec == r_spec, f"{n_spec} != {r_spec}"
+            assert len(n_vals) == len(r_vals), f"{len(n_vals)} != {len(r_vals)}"
+            for i, nv, rv in zip(range(len(n_vals)), n_vals, r_vals):
+                if not isinstance(rv, torch.Tensor):
+                    continue
+                assert nv.size() == rv.size(), f"output {i}: {nv.size()} != {rv.size()}"
+                assert nv.dtype == rv.dtype, f"output {i}: {nv.dtype} != {rv.dtype}"
+                assert torch._prims_common.check_significant_strides(nv, rv), f"output {i}: {nv.stride()} != {rv.stride()}"
+        return r
+
+
+@make_boxed_compiler
+def debug_nop(fx_g: fx.GraphModule, _) -> Callable:
+    """
+    Returns a (slow) interpreter over the FX graph module that also checks
+    various debugging properties (e.g., that tracing strides matched real
+    strides.)
+    """
+    return DebugInterpreter(fx_g).run
 
 @make_boxed_compiler
 def simple_ts_compile(fx_g, _):
@@ -181,7 +224,6 @@ def memory_efficient_fusion(
         "fw_compiler": ts_compile,
         "bw_compiler": ts_compile,
         "partition_fn": min_cut_rematerialization_partition,
-        "hasher_type": "StaticShapeHasher",
         "decompositions": default_decompositions,
         "static_argnums": static_argnums,
     }
diff --git a/functorch/_src/config.py b/functorch/_src/config.py
new file mode 100644
index 000000000000..87f60fe061e1
--- /dev/null
+++ b/functorch/_src/config.py
@@ -0,0 +1,38 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""
+Global flags for aot autograd
+"""
+import os
+
+use_functionalize = True
+
+# TODO Benchmark
+use_fake_tensor = False
+
+# Enables optional asserts in hotpath code to check for errors.  If
+# you are seeing weird accuracy problems, try turning this on.
+# For now, to more easily identify bugs, this is turned on by default.
+debug_assert = True
+
+debug_fake_cross_ref = os.environ.get('AOT_FAKE_CROSSREF', False)
+
+debug_partitioner = os.environ.get('AOT_PARTITIONER_DEBUG', False)
+# Prints out forward + backwards FX graphs
+debug_graphs = os.environ.get('AOT_FX_GRAPHS', False)
+# Prints out joint graph traced, before partitioning
+debug_joint = os.environ.get('AOT_FX_GRAPHS_JOINT', False)
+
+use_dynamic_shapes = os.getenv('AOT_DYNAMIC_SHAPES', False)
+
+static_weight_shapes = True
+
+# Applies CSE to the graph before partitioning
+cse = True
+
+# Restricts the amount of computation AOTAutograd can do.
+max_dist_from_bw = 5
diff --git a/functorch/functorch/_src/eager_transforms.py b/functorch/_src/eager_transforms.py
similarity index 92%
rename from functorch/functorch/_src/eager_transforms.py
rename to functorch/_src/eager_transforms.py
index 6cfa66d1c9b7..6144318edf3a 100644
--- a/functorch/functorch/_src/eager_transforms.py
+++ b/functorch/_src/eager_transforms.py
@@ -4,20 +4,18 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
 
-from typing import Callable, Union, Tuple, List, Any
+from typing import Callable, Union, Tuple, List, Any, Optional
 import torch
-import inspect
 from functools import partial, wraps
 import contextlib
 from torch.utils._pytree import tree_flatten, tree_unflatten, tree_map
 from .pytree_hacks import tree_map_, treespec_pprint
 import torch.autograd.forward_ad as fwAD
 
-from .vmap import vmap
-from .decompositions import decomposition_table, decomposition_table_for_jvp
+from .vmap import vmap, doesnt_support_saved_tensors_hooks
+from torch._decomp import decomposition_table
 
-
-from functorch._C import (
+from torch._C._functorch import (
     _wrap_for_grad,
     _unwrap_for_grad,
     _grad_increment_nesting,
@@ -33,7 +31,7 @@
     _assert_wrapped_functional,
     _propagate_functional_input_mutation,
     set_inplace_requires_grad_allowed,
-    get_inplace_requires_grad_allowed,
+    get_inplace_requires_grad_allowed
 )
 
 argnums_t = Union[int, Tuple[int, ...]]
@@ -261,13 +259,34 @@ def vjp(func: Callable, *primals, has_aux: bool = False):
         outer one. This is because ``vjp`` is a "function transform": its result
         should not depend on the result of a context manager outside of ``f``.
     """
+    return _vjp_with_argnums(func, *primals, has_aux=has_aux)
+
+
+@doesnt_support_saved_tensors_hooks
+def _vjp_with_argnums(func: Callable, *primals, argnums: Optional[argnums_t] = None, has_aux: bool = False):
+    # This is the same function as vjp but also accepts an argnums argument
+    # All args are the same as vjp except for the added argument
+    # argnums (Optional[int or tuple[int]]): Optional, specifies the argument(s) to compute gradients with respect to.
+    #         If None, computes the gradients with respect to all inputs (used for vjp). Default: None
+    #
+    # WARN: Users should NOT call this function directly and should just be calling vjp.
+    # It is only separated so that inputs passed to jacrev but not differentiated get the correct wrappers.
+    #
+    # NOTE: All error messages are produced as if vjp was being called, even if this was called by jacrev
+    #
+    # Returns the same two elements as :func:`vjp` but the function returned, vjp_fn, returns a tuple of VJPs
+    # for only the primal elements given by argnums.
     level = _grad_increment_nesting()
     try:
         # See NOTE [grad and vjp interaction with no_grad]
         with torch.enable_grad():
             primals = _wrap_all_tensors(primals, level)
-            diff_primals = _create_differentiable(primals, level)
-            primals_out = func(*diff_primals)
+            if argnums is None:
+                diff_primals = _create_differentiable(primals, level)
+            else:
+                diff_primals = _slice_argnums(primals, argnums, as_tuple=False)
+                tree_map_(partial(_create_differentiable, level=level), diff_primals)
+            primals_out = func(*primals)
 
             if has_aux:
                 if not (isinstance(primals_out, tuple) and len(primals_out) == 2):
@@ -437,8 +456,7 @@ def jacrev(func: Callable, argnums: Union[int, Tuple[int]] = 0, *, has_aux=False
     """
     @wraps(func)
     def wrapper_fn(*args):
-        f_wrapper, primals = _argnums_partial(func, args, argnums)
-        vjp_out = vjp(f_wrapper, *primals, has_aux=has_aux)
+        vjp_out = _vjp_with_argnums(func, *args, argnums=argnums, has_aux=has_aux)
         if has_aux:
             output, vjp_fn, aux = vjp_out
         else:
@@ -455,6 +473,7 @@ def wrapper_fn(*args):
 
         results = vmap(vjp_fn)(basis)
 
+        primals = _slice_argnums(args, argnums)
         flat_primals, primals_spec = tree_flatten(primals)
         flat_results, results_spec = tree_flatten(results)
 
@@ -618,15 +637,6 @@ def _slice_argnums(args, argnums, as_tuple=True):
     return tuple(args[i] for i in argnums)
 
 
-def _argnums_partial(f, args, argnums):
-    def f_wrapper(*wrapper_args):
-        replaced_args = _replace_args(args, wrapper_args, argnums)
-        return f(*replaced_args)
-    wrapper_args = _slice_argnums(args, argnums)
-    wrapper_args = wrapper_args if isinstance(wrapper_args, tuple) else (wrapper_args, )
-    return (f_wrapper, wrapper_args)
-
-
 JVP_NESTING = 0
 
 
@@ -750,11 +760,9 @@ def jvp(func: Callable, primals: Any, tangents: Any, *, strict: bool = False, ha
         evaluated at ``primals`` and the Jacobian-vector product.
         If ``has_aux is True``, then instead returns a ``(output, jvp_out, aux)`` tuple.
 
-    .. warning::
-        PyTorch's forward-mode AD coverage on operators is not very good at the
-        moment. You may see this API error out with "forward-mode AD not
-        implemented for operator X". If so, please file us a bug report and we
-        will prioritize it.
+    .. note::
+        You may see this API error out with "forward-mode AD not implemented
+        for operator X". If so, please file a bug report and we will prioritize it.
 
     jvp is useful when you wish to compute gradients of a function R^1 -> R^N
 
@@ -776,11 +784,33 @@ def jvp(func: Callable, primals: Any, tangents: Any, *, strict: bool = False, ha
          >>> assert torch.allclose(output, x + y)
 
     """
+
+    return _jvp_with_argnums(func, primals, tangents, argnums=None, strict=strict, has_aux=has_aux)
+
+
+@doesnt_support_saved_tensors_hooks
+def _jvp_with_argnums(func: Callable, primals: Any, tangents: Any, argnums: Optional[argnums_t], *,
+                      strict: bool = False, has_aux: bool):
+    # This is the same function as jvp but also accepts an argnums argument
+    # Most args are the same as jvp except for the added argument
+    # argnums (Optional[int or tuple[int]]): Optional, specifies the argument(s) to compute gradients with respect to.
+    #         If None, computes the gradients with respect to all inputs (used for jvp). Default: None
+    # Because of this, tangents must be of length argnums and matches up to the corresponding primal whose index is
+    # given by argnums
+    #
+    # WARN: Users should NOT call this function directly and should just be calling jvp.
+    # It is only separated so that inputs passed to jacfwd but not differentiated get the correct wrappers.
+    #
+    # NOTE: All error messages are produced as if jvp was being called, even if this was called by jacfwd
+    #
+    # Returns the same two elements as :func:`jvp` but the returned tuple, ``jvp_out``, only has JVPs with respect to
+    # the primals given by argnums
     if not isinstance(primals, tuple):
         raise RuntimeError(
             f'{jvp_str}: Expected primals to be a tuple. '
             f'E.g. it should be valid to call f(*primals).')
-    flat_primals, primals_spec = tree_flatten(primals)
+    diff_args = primals if argnums is None else _slice_argnums(primals, argnums)
+    flat_primals, primals_spec = tree_flatten(diff_args)
     flat_tangents, tangents_spec = tree_flatten(tangents)
     if primals_spec != tangents_spec:
         raise RuntimeError(
@@ -801,6 +831,9 @@ def jvp(func: Callable, primals: Any, tangents: Any, *, strict: bool = False, ha
                 flat_duals = tuple(fwAD.make_dual(p, t)
                                    for p, t in zip(flat_primals, flat_tangents))
                 duals = tree_unflatten(flat_duals, primals_spec)
+                if argnums is not None:
+                    primals = _wrap_all_tensors(primals, level)
+                    duals = _replace_args(primals, duals, argnums)
                 result_duals = func(*duals)
                 if has_aux:
                     if not (isinstance(result_duals, tuple) and len(result_duals) == 2):
@@ -839,7 +872,7 @@ def safe_unflatten(tensor, dim, shape):
     return tensor.unflatten(dim, shape)
 
 
-def jacfwd(func: Callable, argnums: argnums_t = 0, has_aux: bool = False):
+def jacfwd(func: Callable, argnums: argnums_t = 0, has_aux: bool = False, *, randomness: str = "error"):
     """
     Computes the Jacobian of :attr:`func` with respect to the arg(s) at index
     :attr:`argnum` using forward-mode autodiff
@@ -855,6 +888,9 @@ def jacfwd(func: Callable, argnums: argnums_t = 0, has_aux: bool = False):
             the function to be differentiated and the second element is
             auxiliary objects that will not be differentiated.
             Default: False.
+        randomness(str): Flag indicating what type of randomness to use.
+            See :func:`vmap` for more detail. Allowed: "different", "same", "error".
+            Default: "error"
 
     Returns:
         Returns a function that takes in the same inputs as :attr:`func` and
@@ -863,11 +899,10 @@ def jacfwd(func: Callable, argnums: argnums_t = 0, has_aux: bool = False):
         instead returns a ``(jacobian, aux)`` tuple where ``jacobian``
         is the Jacobian and ``aux`` is auxiliary objects returned by :attr:`func`.
 
-    .. warning::
-        PyTorch's forward-mode AD coverage on operators is not very good at the
-        moment. You may see this API error out with "forward-mode AD not
-        implemented for operator X". If so, please file us a bug report and we
-        will prioritize it.
+    .. note::
+        You may see this API error out with "forward-mode AD not implemented
+        for operator X". If so, please file a bug report and we will prioritize it.
+        An alternative is to use :func:`jacrev`, which has better operator coverage.
 
     A basic usage with a pointwise, unary operation will give a diagonal array
     as the Jacobian
@@ -944,21 +979,21 @@ def jacfwd(func: Callable, argnums: argnums_t = 0, has_aux: bool = False):
     """
     @wraps(func)
     def wrapper_fn(*args):
-        f_wrapper, primals = _argnums_partial(func, args, argnums)
+        primals = args if argnums is None else _slice_argnums(args, argnums)
         flat_primals, primals_spec = tree_flatten(primals)
         flat_primals_numels = tuple(p.numel() for p in flat_primals)
         flat_basis = _construct_standard_basis_for(flat_primals, flat_primals_numels)
         basis = tree_unflatten(flat_basis, primals_spec)
 
         def push_jvp(basis):
-            output = jvp(f_wrapper, primals, basis, has_aux=has_aux)
+            output = _jvp_with_argnums(func, args, basis, argnums=argnums, has_aux=has_aux)
             if has_aux:
                 _, jvp_out, aux = output
                 return jvp_out, aux
             _, jvp_out = output
             return jvp_out
 
-        results = vmap(push_jvp)(basis)
+        results = vmap(push_jvp, randomness=randomness)(basis)
         if has_aux:
             results, aux = results
             # aux is in the standard basis format, e.g. NxN matrix
@@ -1012,11 +1047,11 @@ def hessian(func, argnums=0):
         returns the Hessian of :attr:`func` with respect to the arg(s) at
         :attr:`argnums`.
 
-    .. warning::
-        PyTorch's forward-mode AD coverage on operators is not very good at the
-        moment. You may see this API error out with "forward-mode AD not
-        implemented for operator X". If so, please file us a bug report and we
-        will prioritize it.
+    .. note::
+        You may see this API error out with "forward-mode AD not implemented
+        for operator X". If so, please file a bug report and we will prioritize it.
+        An alternative is to use ``jacrev(jacrev(func))``, which has better
+        operator coverage.
 
     A basic usage with a R^N -> R^1 function gives a N x N Hessian:
 
@@ -1025,7 +1060,7 @@ def hessian(func, argnums=0):
         >>>   return x.sin().sum()
         >>>
         >>> x = torch.randn(5)
-        >>> hess = jacfwd(jacrev(f))(x)
+        >>> hess = hessian(f)(x)  # equivalent to jacfwd(jacrev(f))(x)
         >>> assert torch.allclose(hess, torch.diag(-x.sin()))
 
     """
@@ -1060,6 +1095,7 @@ def grad_and_value(func: Callable, argnums: argnums_t = 0, has_aux: bool = False
 
     See :func:`grad` for examples
     """
+    @doesnt_support_saved_tensors_hooks
     @wraps(func)
     def wrapper(*args, **kwargs):
         level = _grad_increment_nesting()
@@ -1405,6 +1441,7 @@ def forward(self, a_1):
             " replaced with their non-aliasing counterparts, {view}_copy.\n"
         )
 
+    @doesnt_support_saved_tensors_hooks
     @wraps(func)
     def wrapped(*args, **kwargs):
         try:
@@ -1439,37 +1476,6 @@ def wrapped(*args, **kwargs):
             _func_decrement_nesting()
     return wrapped
 
-
-def _register_jit_decomposition(decomp, use_python=False):
-    if decomp in decomposition_table_for_jvp:
-        decomposition_table_used = decomposition_table_for_jvp
-    elif decomp in decomposition_table:
-        decomposition_table_used = decomposition_table
-    else:
-        raise RuntimeError(f"could not find decomposition for {decomp}")
-    decomp_fn = decomposition_table_used[decomp]
-    if use_python:
-        decomp_fn = torch.jit.ignore(decomp_fn)
-        sig = inspect.signature(decomp_fn)
-
-        # Create a string wrapping the function from the signature
-        # example output:
-        # def wrapped_decomp(x: torch.Tensor, y: int, z: int):
-        #   return decomp_fn(x, y, z)
-        # Thanks copilot!
-        def get_function_def(sig):
-            param_def = [f"{param_str}" for param_str in sig.parameters.values()]
-            param_use = [f"{param_str}" for param_str in sig.parameters.keys()]
-
-            return f"def wrapped_decomp({', '.join(param_def)}):\n  return decomp_fn({', '.join(param_use)})\n"
-
-        f_str = get_function_def(sig)
-        graph = torch.jit.CompilationUnit(f_str).wrapped_decomp.graph
-    else:
-        graph = torch.jit.script(decomp_fn).graph
-    torch.jit._register_decomposition(decomp, graph)
-
-
 # use an alternate way to register an operator into the decomposition table
 # _register_jit_decomposition doesn't work for some operators, e.g. addr,
 #  because the Tensor types generated cannot be unioned by torchscript
@@ -1484,14 +1490,5 @@ def _register_python_decomposition_vmap(decomp):
         raise RuntimeError(f"could not find decomposition for {decomp}")
 
 
-_register_jit_decomposition(torch.ops.aten.trace.default, use_python=True)
-_register_jit_decomposition(torch.ops.aten.nll_loss_backward.default)
-_register_jit_decomposition(torch.ops.aten.nll_loss2d_backward.default)
-_register_jit_decomposition(torch.ops.aten._log_softmax_backward_data.default)
-_register_jit_decomposition(torch.ops.aten._softmax_backward_data.default)
-_register_jit_decomposition(torch.ops.aten.log_sigmoid_forward.default)
-_register_jit_decomposition(torch.ops.aten.native_layer_norm_backward.default)
-_register_jit_decomposition(torch.ops.aten.native_batch_norm_backward.default)
-_register_jit_decomposition(torch.ops.aten.cudnn_batch_norm_backward.default)
 _register_python_decomposition_vmap(torch.ops.aten.mse_loss_backward.default)
 _register_python_decomposition_vmap(torch.ops.aten.addr.default)
diff --git a/functorch/_src/fx_minifier.py b/functorch/_src/fx_minifier.py
new file mode 100644
index 000000000000..5ee00f807c41
--- /dev/null
+++ b/functorch/_src/fx_minifier.py
@@ -0,0 +1,306 @@
+import torch.fx as fx
+import copy
+import torch
+import math
+from typing import Callable, List
+from functools import wraps, partial
+from dataclasses import dataclass
+from .compile_utils import get_placeholders, get_outputs
+
+class ConcreteProp(torch.fx.Interpreter):
+    def run_node(self, n):
+        result = super().run_node(n)
+
+        found_tensor = False
+
+        def extract_tensor_meta(obj):
+            if isinstance(obj, torch.Tensor):
+                nonlocal found_tensor
+                found_tensor = True
+                return obj
+            else:
+                return obj
+
+        from torch.fx.node import map_aggregate
+        concrete_value = map_aggregate(result, extract_tensor_meta)
+        if found_tensor:
+            n.meta['concrete_value'] = concrete_value
+        return result
+
+    def propagate(self, *args):
+        return super().run(*args)
+
+
+# inplace modifies node/inps
+def _convert_node_to_placeholder(node, inps):
+    if node.op == 'output' or node.op == "placeholder":
+        return
+    node.op = 'placeholder'
+    node.args = ()
+    node.kwargs = {}
+    node.target = node.name
+    concrete_val = node.meta.get('concrete_value', None)
+    if isinstance(concrete_val, torch.Tensor):
+        inps.append(concrete_val)
+    else:
+        inps.append(torch.zeros(()))
+        for tuple_user in list(node.users):
+            _convert_node_to_placeholder(tuple_user, inps)
+
+def dump_state(fx_g, inps):
+    print(f"""
+# Working Repro with {len(fx_g.graph.nodes)} nodes
+inps = {[(i.shape, i.dtype, i.device.type) for i in inps]}
+inps = [torch.zeros(())] + [torch.ones(shape, dtype=dtype, device=device) for (shape, dtype, device) in inps]
+{fx_g.code}
+""")
+
+@dataclass
+class ReproState:
+    graph: fx.Graph
+    inps: List[torch.Tensor]
+
+def minifier(fail_f: fx.GraphModule, inps, module_fails, dump_state: Callable = dump_state):
+    """
+    Minimizes a FX graph with given inputs, such that the resulting FX graph still returns True for module_fails.
+
+    Does 2 main strategies:
+    1. Truncates suffix: Removes some suffix from the graph and sets a new output.
+    2. Delta Debugging: Tries replacing half of the graph with inputs. If fails,
+        tries replacing quarter of the graph, etc.
+
+    >>> failing_function = fx.symbolic_trace(f)
+    >>> minimize(failing_function, [torch.randn(5)], lambda fx_g, inps: fx_g(*inps))
+
+    note: module_fails returns True if it fails.
+    """
+    failing_graph = fail_f.graph
+    cur_size = len(failing_graph.nodes)
+
+    num_queries = 0
+
+    def deepcopy_fx_graph(fx_graph):
+        return fx.GraphModule(fail_f, copy.deepcopy(fx_graph)).graph
+
+
+    def graph_fails(graph, inps):
+        nonlocal num_queries
+        graph = copy.deepcopy(graph)
+        num_queries += 1
+        mod = fx.GraphModule(fail_f, graph)
+        mod.graph.lint()
+        return module_fails(mod, inps)
+
+    ConcreteProp(fail_f).propagate(*inps)
+    if not graph_fails(failing_graph, inps):
+        raise RuntimeError("Input graph did not fail the tester")
+    print(f"Started off with {cur_size} nodes")
+
+    def _register_strategy(strategy: Callable, name: str):
+        @wraps(strategy)
+        def new_func(old_state: ReproState, granularity=1):
+            print()
+            print(f"Strategy: {name} (G: {granularity}) ({len(old_state.graph.nodes)} nodes, {len(old_state.inps)} inputs)")
+            new_state = strategy(deepcopy_fx_graph(old_state.graph), list(old_state.inps), granularity)
+            if new_state is not None:
+                new_nodes = len(new_state.graph.nodes)
+                old_nodes = len(old_state.graph.nodes)
+                new_inps = len(new_state.inps)
+                old_inps = len(old_state.inps)
+                new_outs = len(get_outputs(new_state.graph))
+                old_outs = len(get_outputs(old_state.graph))
+                progress_made = False
+                if new_nodes < old_nodes:
+                    progress_made = True
+                    print(f"SUCCESS: Went from {old_nodes} to {new_nodes} nodes")
+                if new_inps > old_inps:
+                    progress_made = True
+                    print(f"SUCCESS: Went from {old_inps} to {new_inps} inputs")
+                if new_outs < old_outs:
+                    progress_made = True
+                    print(f"SUCCESS: Went from {old_outs} to {new_outs} outputs")
+
+                if not progress_made:
+                    raise RuntimeError("Success raised but no progress made?")
+
+                if not graph_fails(new_state.graph, new_state.inps):
+                    print("WARNING: Something went wrong, not applying this minification")
+                    return None
+                return new_state
+            else:
+                print(f"FAIL: {name}")
+            return None
+
+        return new_func
+
+    def register_strategy(name: str):
+        return partial(_register_strategy, name=name)
+
+    @register_strategy("Truncate suffix")
+    def remove_suffix(cur_graph, cur_inps, granularity):
+        tested = set()
+        new_graph = fx.Graph()
+        env = {}
+        for idx, node in enumerate(cur_graph.nodes):
+            new_node = new_graph.node_copy(node, lambda x: env[x])
+            if node.op not in ['placeholder', 'output']:
+                # If idx is divisible by (granularity * 2), it would have been checked already.
+                if idx % granularity == 0 and (idx % (granularity * 2) != 0) and idx not in tested:
+                    output_node = new_graph.output((new_node,))
+                    if len(new_graph.nodes) < len(cur_graph.nodes) and graph_fails(new_graph, cur_inps):
+                        return ReproState(new_graph, cur_inps)
+                    else:
+                        tested.add(idx)
+                        new_graph.erase_node(output_node)
+            env[node] = new_node
+        return None
+
+    @register_strategy("Remove outputs")
+    def remove_outputs(cur_graph, cur_inps, granularity):
+        granularity = max(1, granularity // 2)
+        for idx, node in enumerate(cur_graph.nodes):
+            node.idx = idx
+            if node.op == 'output':
+                output = node
+                break
+
+        output_args = sorted(output.args[0], key=lambda x: x.idx if isinstance(x, fx.Node) else int(1e9))
+        if len(output_args) == 1:
+            return None
+
+        for idx in range(0, len(output_args), granularity):
+            output.args = (output_args[:idx] + output_args[idx + granularity:],)
+            if graph_fails(cur_graph, cur_inps):
+                return ReproState(cur_graph, cur_inps)
+        return None
+
+
+    def remove_unused_inputs_unchecked(cur_state: ReproState):
+        cur_graph = cur_state.graph
+        cur_inps = cur_state.inps
+        ph_nodes = get_placeholders(cur_graph)
+        assert len(ph_nodes) == len(cur_inps)
+
+        new_inps = []
+        for idx in range(len(ph_nodes)):
+            if len(ph_nodes[idx].users) == 0:
+                cur_graph.erase_node(ph_nodes[idx])
+            else:
+                new_inps.append(cur_inps[idx])
+        if len(new_inps) < len(cur_inps):
+            return ReproState(cur_graph, new_inps)
+        return None
+
+    def remove_unused_inputs_checked(cur_state: ReproState):
+        new_state = remove_unused_inputs_unchecked(cur_state)
+        if new_state is not None and graph_fails(new_state.graph, new_state.inps):
+            return new_state
+        return None
+
+    def _remove_unused_wrapper(cur_graph, cur_inps, granularity):
+        return remove_unused_inputs_checked(ReproState(cur_graph, cur_inps))
+
+    remove_unused_inputs = register_strategy("Remove unused inputs")(_remove_unused_wrapper)
+
+    @register_strategy("Eliminate dead code")
+    def eliminate_dead_code(cur_graph, cur_inps, granularity):
+        if cur_graph.eliminate_dead_code() and graph_fails(cur_graph, cur_inps):
+            return ReproState(cur_graph, cur_inps)
+        return None
+
+
+    def _consolidate_placeholders(cur_graph):
+        new_graph = fx.Graph()
+        env = {}
+        for node in cur_graph.nodes:
+            if node.op == 'placeholder':
+                new_node = new_graph.node_copy(node, lambda x: env[x])
+                env[node] = new_node
+
+        for node in cur_graph.nodes:
+            if node.op != 'placeholder':
+                new_node = new_graph.node_copy(node, lambda x: env[x])
+                env[node] = new_node
+        return new_graph
+
+    @register_strategy("Delta Debugging")
+    def delta_debugging(cur_graph: fx.Graph, cur_inps, granularity):
+        num_nodes = len(cur_graph.nodes)
+        for start_range in range(0, num_nodes, granularity):
+            is_removing = False
+            new_graph = deepcopy_fx_graph(cur_graph)
+            new_inps = cur_inps[:]
+            end_range = min(num_nodes, start_range + granularity)
+            for idx in range(start_range, end_range):
+                new_node = list(new_graph.nodes)[idx]
+                if new_node.op not in ['placeholder', 'output']:
+                    is_removing = True
+                    _convert_node_to_placeholder(new_node, new_inps)
+            if not is_removing:
+                continue
+            new_graph = _consolidate_placeholders(new_graph)
+            new_state = remove_unused_inputs_unchecked(ReproState(new_graph, new_inps))
+            if new_state is None:
+                new_state = ReproState(new_graph, new_inps)
+            if graph_fails(new_state.graph, new_state.inps):
+                return ReproState(new_state.graph, new_state.inps)
+
+        return None
+
+    failing_state = ReproState(failing_graph, inps)
+
+    def try_granularity(failing_state, granularity, use_non_granular):
+        print(f"Trying granularity {granularity}")
+
+        strategies = []
+        num_nodes = len(failing_state.graph.nodes)
+        num_outputs = len(get_outputs(failing_state.graph))
+        if num_outputs > num_nodes // 2:
+            strategies += [remove_outputs]
+
+        if use_non_granular:
+            strategies += [eliminate_dead_code, remove_unused_inputs]
+
+        strategies += [remove_suffix, delta_debugging]
+
+        for strategy in strategies:
+            new_state = strategy(failing_state, granularity)
+            if new_state is not None:
+                return new_state
+        return None
+
+    while True:
+        dump_state(fx.GraphModule(fail_f, failing_state.graph), failing_state.inps)
+        granularity = int(2**(math.floor(math.log2(len(failing_state.graph.nodes)))))
+        new_state = try_granularity(failing_state, granularity, use_non_granular=True)
+        if new_state is not None:
+            failing_state = new_state
+            continue
+
+        granularity //= 2
+        has_progress = False
+        while granularity >= 1:
+            new_state = try_granularity(failing_state, granularity, use_non_granular=False)
+            if new_state is not None:
+                failing_state = new_state
+                has_progress = True
+                break
+            granularity //= 2
+        if has_progress:
+            continue
+
+        new_state = remove_outputs(failing_state, 1)
+        if new_state is not None:
+            failing_state = new_state
+            continue
+
+        break
+
+    if not graph_fails(failing_state.graph, failing_state.inps):
+        raise RuntimeError("Uh oh, something went wrong :( Final graph is not failing")
+
+    print(f"Made {num_queries} queries")
+    failing_fx = fx.GraphModule(fail_f, failing_state.graph)
+    dump_state(failing_fx, failing_state.inps)
+    print("Wrote minimal repro out to repro.py")
+    return failing_fx, failing_state.inps
diff --git a/functorch/functorch/_src/make_functional.py b/functorch/_src/make_functional.py
similarity index 99%
rename from functorch/functorch/_src/make_functional.py
rename to functorch/_src/make_functional.py
index 7b8c15196e23..abb3f07ca597 100644
--- a/functorch/functorch/_src/make_functional.py
+++ b/functorch/_src/make_functional.py
@@ -44,7 +44,7 @@ def _get_nested_attr(obj: nn.Module, names: List[str]) -> None:
     if len(names) == 1:
         return getattr(obj, names[0])
     else:
-        _get_nested_attr(getattr(obj, names[0]), names[1:])
+        return _get_nested_attr(getattr(obj, names[0]), names[1:])
 
 
 def raise_parameter_tying_error():
diff --git a/functorch/functorch/_src/named_members_polyfill.py b/functorch/_src/named_members_polyfill.py
similarity index 100%
rename from functorch/functorch/_src/named_members_polyfill.py
rename to functorch/_src/named_members_polyfill.py
diff --git a/functorch/functorch/_src/partitioners.py b/functorch/_src/partitioners.py
similarity index 61%
rename from functorch/functorch/_src/partitioners.py
rename to functorch/_src/partitioners.py
index 325c66b631d2..af8db94edf4a 100644
--- a/functorch/functorch/_src/partitioners.py
+++ b/functorch/_src/partitioners.py
@@ -1,3 +1,4 @@
+from torch.fx.experimental.proxy_tensor import is_sym_node, py_sym_types
 import torch
 import torch.fx as fx
 import operator
@@ -83,21 +84,21 @@ def _is_tangent(node):
     return node.op == "placeholder" and "tangents" in node.target
 
 
-def _extract_fwd_bwd_outputs(joint_module: fx.GraphModule):
-    num_fwd_outputs = joint_module._out_spec.children_specs[0].num_leaves
+def _extract_fwd_bwd_outputs(joint_module: fx.GraphModule, *, num_fwd_outputs):
     outputs = pytree.tree_flatten([node.args for node in joint_module.graph.nodes if node.op == 'output'])[0]
     fwd_outputs = outputs[:num_fwd_outputs]
     bwd_outputs = outputs[num_fwd_outputs:]
     return fwd_outputs, bwd_outputs
 
 
-def _extract_fwd_bwd_modules(joint_module: fx.GraphModule, saved_values):
-    fwd_outputs, bwd_outputs = _extract_fwd_bwd_outputs(joint_module)
+def _extract_fwd_bwd_modules(joint_module: fx.GraphModule, saved_values, saved_sym_nodes=(), *, num_fwd_outputs):
+    fwd_outputs, bwd_outputs = _extract_fwd_bwd_outputs(joint_module, num_fwd_outputs=num_fwd_outputs)
     primal_inputs = list(filter(_is_primal, joint_module.graph.nodes))
     tangent_inputs = list(filter(_is_tangent, joint_module.graph.nodes))
     # Construct the forward module
-    fwd_graph = _extract_graph_with_inputs_outputs(joint_module.graph, primal_inputs, fwd_outputs + saved_values)
-    bwd_graph = _extract_graph_with_inputs_outputs(joint_module.graph, saved_values + tangent_inputs, bwd_outputs)
+    # Keep symints separate from tensors, passed between fwd/bwd graphs, and in the right order.
+    fwd_graph = _extract_graph_with_inputs_outputs(joint_module.graph, primal_inputs, fwd_outputs + saved_values + saved_sym_nodes)
+    bwd_graph = _extract_graph_with_inputs_outputs(joint_module.graph, saved_sym_nodes + saved_values + tangent_inputs, bwd_outputs)
 
     # This is to filter out saved values that don't actually end up being used by the backwards pass
     for node in bwd_graph.nodes:
@@ -107,10 +108,15 @@ def _extract_fwd_bwd_modules(joint_module: fx.GraphModule, saved_values):
                     saved_values.remove(saved_value)
                     break
 
+            for saved_sym in saved_sym_nodes:
+                if saved_sym.name == node.name:
+                    saved_sym_nodes.remove(saved_sym)
+                    break
+
     # Now, we re-generate the fwd/bwd graphs.
     # NB: This might increase compilation time, but I doubt it matters
-    fwd_graph = _extract_graph_with_inputs_outputs(joint_module.graph, primal_inputs, fwd_outputs + saved_values)
-    bwd_graph = _extract_graph_with_inputs_outputs(joint_module.graph, saved_values + tangent_inputs, bwd_outputs)
+    fwd_graph = _extract_graph_with_inputs_outputs(joint_module.graph, primal_inputs, fwd_outputs + saved_values + saved_sym_nodes)
+    bwd_graph = _extract_graph_with_inputs_outputs(joint_module.graph, saved_sym_nodes + saved_values + tangent_inputs, bwd_outputs)
 
     fwd_module = fx.GraphModule(joint_module, fwd_graph)
     bwd_module = fx.GraphModule(joint_module, bwd_graph)
@@ -118,7 +124,7 @@ def _extract_fwd_bwd_modules(joint_module: fx.GraphModule, saved_values):
 
 
 def default_partition(
-    joint_module: fx.GraphModule, _joint_inputs
+    joint_module: fx.GraphModule, _joint_inputs, *, num_fwd_outputs
 ) -> Tuple[fx.GraphModule, fx.GraphModule]:
     """
     Partitions the :attr:`joint_module` in a manner that closely resembles the
@@ -144,15 +150,24 @@ def default_partition(
         Returns the generated forward and backward Fx graph modules.
     """
     primal_inputs = list(filter(_is_primal, joint_module.graph.nodes))
-    fwd_outputs, bwd_outputs = _extract_fwd_bwd_outputs(joint_module)
+    fwd_outputs, bwd_outputs = _extract_fwd_bwd_outputs(joint_module, num_fwd_outputs=num_fwd_outputs)
     forward_only_graph = _extract_graph_with_inputs_outputs(joint_module.graph, primal_inputs, fwd_outputs)
     forward_node_names = {node.name for node in forward_only_graph.nodes if node.op != 'output'}
     saved_values = []
+    saved_sym_nodes = []
+
     for node in joint_module.graph.nodes:
         if node.name not in forward_node_names:
             continue
-        # Since we can't save tuple of tensor values, we need to flatten out what we're saving
-        if 'tensor_meta' not in node.meta and node.op == 'call_function':
+        if is_sym_node(node):
+            # Symints must be kept separate from tensors so that PythonFunction only calls
+            # save_for_backward on tensors and stashes symints in autograd .ctx
+            saved_sym_nodes.append(node)
+        elif (
+            'tensor_meta' not in node.meta
+            and node.op == 'call_function'
+        ):
+            # Since we can't save tuple of tensor values, we need to flatten out what we're saving
             users = node.users
             assert all(user.target == operator.getitem for user in users)
             for user in users:
@@ -160,8 +175,9 @@ def default_partition(
         else:
             saved_values.append(node)
     saved_values = list(set(saved_values))
+    saved_sym_nodes = list(set(saved_sym_nodes))
 
-    return _extract_fwd_bwd_modules(joint_module, saved_values)
+    return _extract_fwd_bwd_modules(joint_module, saved_values, saved_sym_nodes=saved_sym_nodes, num_fwd_outputs=num_fwd_outputs)
 
 
 def _prod(x):
@@ -170,8 +186,7 @@ def _prod(x):
         s *= i
     return s
 
-
-def _size_of(metadata):
+def _tensor_nbytes(numel, dtype):
     sizes = {
         torch.float: 4,
         torch.float16: 2,
@@ -186,15 +201,40 @@ def _size_of(metadata):
         torch.uint8: 1,
         torch.bool: 1,
     }
-
-    numel = _prod(metadata.shape)
-    dtype = metadata.dtype
-
     if dtype not in sizes:
         raise NotImplementedError("Don't know the size of dtype ", dtype)
 
     return numel * sizes[dtype]
 
+def _size_of(node: fx.Node) -> int:
+    def to_size_hint(s):
+        if isinstance(s, torch.SymInt):
+            py_s = s.get_pyobj()
+            return py_s.shape_env.size_hint(py_s.expr)
+        assert isinstance(s, int)
+        return s
+
+    if 'val' in node.meta:
+        val = node.meta['val']
+        if isinstance(val, py_sym_types):
+            return 1
+        elif isinstance(val, (list, tuple)):
+            return sum(_tensor_nbytes(to_size_hint(n.numel()), n.dtype) for n in val if isinstance(n, torch.Tensor))
+        elif isinstance(val, torch.Tensor):
+            return _tensor_nbytes(to_size_hint(val.numel()), val.dtype)
+
+        raise RuntimeError(f"Unknown metadata type {type(val)}")
+
+    # Only needed since we don't always trace with fake tensors.
+    if 'tensor_meta' in node.meta:
+        metadata = node.meta['tensor_meta']
+        numel = _prod(map(to_size_hint, metadata.shape))
+        dtype = metadata.dtype
+    else:
+        return 0
+
+    return _tensor_nbytes(numel, dtype)
+
 
 # Used for some investigative purposes
 def _count_ops(graph):
@@ -207,7 +247,8 @@ def _count_ops(graph):
 
 
 def min_cut_rematerialization_partition(
-    joint_module: fx.GraphModule, _joint_inputs, compiler="nvfuser"
+    joint_module: fx.GraphModule, _joint_inputs, compiler="nvfuser", recomputable_ops=None,
+    *, num_fwd_outputs
 ) -> Tuple[fx.GraphModule, fx.GraphModule]:
     """
     Partitions the joint graph such that the backward recomputes the forward.
@@ -223,6 +264,13 @@ def min_cut_rematerialization_partition(
     Args:
         joint_module(fx.GraphModule): The joint forward and backward graph. This
             is the result of AOT Autograd tracing.
+        _joint_inputs: The inputs to the joint graph. This is unused.
+        compiler: This option determines the default set of recomputable ops.
+            Currently, there are two options: ``nvfuser`` and ``inductor``.
+        recomputable_ops: This is an optional set of recomputable ops. If this
+            is not None, then this set of ops will be used instead of the
+            default set of ops.
+        num_fwd_outputs: The number of outputs from the forward graph.
 
     Returns:
         Returns the generated forward and backward Fx graph modules.
@@ -234,11 +282,13 @@ def min_cut_rematerialization_partition(
 
     joint_module.graph.eliminate_dead_code()
     joint_module.recompile()
+
     fx_g = joint_module.graph
 
     #  add the CSE pass
-    cse_graph = fx_graph_cse(fx_g)
-    joint_module.graph = cse_graph
+    if config.cse:
+        cse_graph = fx_graph_cse(fx_g)
+        joint_module.graph = cse_graph
     full_bw_graph = joint_module.graph
 
     name_to_node = {}
@@ -255,15 +305,30 @@ def classify_nodes(joint_module):
                     required_bw_nodes.add(user)
 
         primal_inputs = list(filter(_is_primal, joint_module.graph.nodes))
-        fwd_outputs, _ = _extract_fwd_bwd_outputs(joint_module)
+        fwd_outputs, _ = _extract_fwd_bwd_outputs(joint_module, num_fwd_outputs=num_fwd_outputs)
         forward_only_graph = _extract_graph_with_inputs_outputs(joint_module.graph, primal_inputs, fwd_outputs)
         required_fw_nodes = {name_to_node[node.name] for node in forward_only_graph.nodes
                              if node.op != 'output'}
         unclaimed_nodes = {node for node in joint_module.graph.nodes
                            if node not in required_fw_nodes and node not in required_bw_nodes}
-        return required_fw_nodes, required_bw_nodes, unclaimed_nodes
+        return fwd_outputs, required_fw_nodes, required_bw_nodes, unclaimed_nodes
+
+    orig_fw_outputs, required_fw_nodes, required_bw_nodes, unclaimed_nodes = classify_nodes(joint_module)
+
+    def is_tensor_node(x):
+        # When dynamic shapes are not enabled, fw outputs can be raw ints and not fx nodes
+        if not isinstance(x, fx.Node):
+            return False
+        # It would be nice if we could guarantee that all fx nodes from make_fx get a 'val'
+        # key in their meta dict, but that isn't always true today (see proxy_tensor.py)
+        return 'tensor_meta' in x.meta or ('val' in x.meta and isinstance(x.meta['val'], torch.Tensor))
+
+    # networkx blows up on graphs with no tensor outputs.
+    # Since there's nothing to partition anyway, and the default partitioner can "handle"
+    # this case, send our graph over to the default partitioner.
+    if not any(is_tensor_node(x) for x in orig_fw_outputs):
+        return default_partition(joint_module, _joint_inputs, num_fwd_outputs=num_fwd_outputs)
 
-    required_fw_nodes, required_bw_nodes, unclaimed_nodes = classify_nodes(joint_module)
     for node in reversed(joint_module.graph.nodes):
         if node not in required_fw_nodes:
             node.dist_from_bw = 0
@@ -275,36 +340,23 @@ def classify_nodes(joint_module):
     aten = torch.ops.aten
     prims = torch.ops.prims
 
-    pointwise_ops = [aten.add, aten.sub, aten.div, aten.atan2, aten.mul, aten.max, aten.min, aten.pow, aten.remainder, aten.fmod, aten.__and__, aten.__or__, aten.__xor__, aten.__lshift__, aten.__rshift__, aten.eq, aten.ne, aten.ge, aten.gt, aten.le, aten.lt, aten.abs, aten.bitwise_not, aten.ceil, aten.floor, aten.frac, aten.neg, aten.relu, aten.round, aten.silu, aten.trunc, aten.log, aten.log10, aten.log1p, aten.log2, aten.lgamma, aten.exp, aten.expm1, aten.erf, aten.erfc, aten.cos, aten.acos, aten.cosh, aten.sin, aten.asin, aten.sinh, aten.tan, aten.atan, aten.tanh, aten.atanh, aten.sqrt, aten.rsqrt, aten.reciprocal, aten.sigmoid, aten.softplus, aten.threshold, aten.threshold_backward, aten.clamp, aten.where, aten.lerp, aten.addcmul, aten.gelu, aten.gelu_backward]  # noqa: E501
-    if compiler == "inductor":
-        pointwise_ops += [prims.div, prims.convert_element_type, aten.sign, aten.clone, aten._to_copy]  # noqa: E501
-    misc_ops = [aten.to, aten.type_as, operator.getitem]
-
-    reduction_ops = [aten.softmax, aten._softmax, aten._softmax_backward_data, aten.sum, aten.mean, aten._grad_sum_to_size, aten.sum_to_size, aten.amax]  # noqa: E501
+    # compiler == "nvfuser" is the default set of recomputable ops
+    default_recomputable_ops = [aten.add, aten.sub, aten.div, aten.atan2, aten.mul, aten.max, aten.min, aten.pow, aten.remainder, aten.fmod, aten.__and__, aten.__or__, aten.__xor__, aten.__lshift__, aten.__rshift__, aten.eq, aten.ne, aten.ge, aten.gt, aten.le, aten.lt, aten.abs, aten.bitwise_not, aten.ceil, aten.floor, aten.frac, aten.neg, aten.relu, aten.round, aten.silu, aten.trunc, aten.log, aten.log10, aten.log1p, aten.log2, aten.lgamma, aten.exp, aten.expm1, aten.erf, aten.erfc, aten.cos, aten.acos, aten.cosh, aten.sin, aten.asin, aten.sinh, aten.tan, aten.atan, aten.tanh, aten.atanh, aten.sqrt, aten.rsqrt, aten.reciprocal, aten.sigmoid, aten.softplus, aten.threshold, aten.threshold_backward, aten.clamp, aten.where, aten.lerp, aten.addcmul, aten.gelu, aten.gelu_backward, aten.sum, aten.mean, aten._grad_sum_to_size, aten.sum_to_size, aten.amax, aten.to, aten.type_as, operator.getitem, aten.squeeze, aten.unsqueeze, aten.rsub, aten._to_copy]  # noqa: E501
+    view_ops = [aten.squeeze, aten.unsqueeze, aten.alias]
     if compiler == "inductor":
-        reduction_ops += [prims.var, prims.sum, aten.var, aten.std]
+        default_recomputable_ops += [prims.div, prims.convert_element_type, aten.sign, aten.clone, aten._to_copy, aten.full_like, prims.var, prims.sum, aten.var, aten.std, prims.broadcast_in_dim, aten.select, aten.permute, aten._unsafe_view, aten.reshape, aten.broadcast_tensors, aten.scalar_tensor, aten.ones, aten.new_zeros, aten.lift_fresh_copy, aten.minimum, aten.arange, aten.bitwise_and, aten.triu, aten.var_mean, aten.isinf, aten.any, aten.isnan, aten.full, aten.as_strided, aten.zeros, aten.argmax, aten.maximum, aten.bitwise_or, aten.logical_and, aten.logical_or]  # noqa: E501
+        view_ops += [aten.view, aten.slice, aten.permute, aten.t, prims.broadcast_in_dim, aten.expand, aten.as_strided]
+        # Natalia said that we should allow recomputing indexing :)
+        default_recomputable_ops += [aten.index]
+    default_recomputable_ops += view_ops
 
-    # not recomputed by default since these are kinda expensive/hard to fuse into
-    # norm_ops = [aten.instance_norm, aten._batch_norm_impl_index, aten.native_batch_norm, aten.batch_norm, aten._batch_norm_impl_index_backward, aten.native_layer_norm, aten.layer_norm, aten.native_layer_norm_backward]  # noqa: E501
+    recomputable_ops = set(recomputable_ops) if recomputable_ops is not None else set(default_recomputable_ops)
 
-    # Not used by default since NVFuser can't fuse view ops
-    # view_ops = [aten.expand, aten.clone, aten.transpose, aten.t, aten.view, aten._unsafe_view, aten.permute, aten.transpose, aten.t, aten._reshape_alias, aten.squeeze, aten.unsqueeze, aten.reshape, aten.cat, aten.slice, aten.split, aten.select, aten.repeat]  # noqa: E501
-
-    # These are the view ops that NVFuser can fuse
-    view_ops = [aten.squeeze, aten.unsqueeze]
-    if compiler == "inductor":
-        view_ops += [prims.broadcast_in_dim, aten.select, aten.permute, aten._unsafe_view, aten.view, aten.expand, aten.slice, aten.reshape, aten.broadcast_tensors]  # noqa: E501
     random_ops = [aten.native_dropout, aten.rand_like, aten.randn_like]
-    compute_intensive_ops = [aten.mm, aten.convolution, aten.convolution_backward, aten.bmm, aten.addmm, aten.upsample_bilinear2d]  # noqa: E501
+    compute_intensive_ops = [aten.mm, aten.convolution, aten.convolution_backward, aten.bmm, aten.addmm, aten.upsample_bilinear2d, aten._softmax, aten._softmax_backward_data, aten.native_layer_norm, aten.native_layer_norm_backward, aten.native_batch_norm, aten.native_batch_norm_backward, aten._native_batch_norm_legit]  # noqa: E501
 
     unrecomputable_ops = random_ops + compute_intensive_ops
 
-    recomputable_ops = set(
-        pointwise_ops
-        + misc_ops
-        + reduction_ops
-        + view_ops
-    )
     fusible_ops = recomputable_ops | set(random_ops)
     if AOT_PARTITIONER_DEBUG:
         joint_module_ops = set(
@@ -318,10 +370,17 @@ def classify_nodes(joint_module):
 
     AGGRESSIVE_RECOMPUTATION = False
 
-    def _maybe_size_of(node):
-        if 'tensor_meta' in node.meta:
-            return _size_of(node.meta['tensor_meta'])
-        return 0
+    def is_materialized_backwards(node):
+        cur_nodes = set([node])
+        while len(cur_nodes) > 0:
+            cur = cur_nodes.pop()
+            for user in cur.users:
+                if user not in required_fw_nodes and not is_fusible(cur, user):
+                    return True
+                if user not in required_fw_nodes and get_aten_target(user) in view_ops:
+                    cur_nodes.add(user)
+
+        return False
 
     def ban_recomputation(node):
         if AGGRESSIVE_RECOMPUTATION:
@@ -333,14 +392,27 @@ def ban_recomputation(node):
                 return True
             if node.target == operator.getitem:
                 return False
-            if compiler == "inductor" and node.dist_from_bw > 4:
+            if node.target in [aten.lift_fresh_copy.default, aten.lift_fresh.default]:
+                return False
+
+            # If a node *must* be materialized in the backwards pass, then we
+            # should never recompute it. This is a pretty subtle point.  In
+            # general, the assumption we make is that recomputing a node in the
+            # backwards pass is "free". However, if a node must be materialized
+            # in the backwards pass, then recomputing it is never free.
+            if is_materialized_backwards(node):
+                return True
+
+            # Arbitrary hack that sometimes seems to help things. The above
+            # modification appears to have made this heuristic a lot less critical
+            # for performance.
+            # TODO: Investigate why this hack helps.
+            if compiler == "inductor" and node.dist_from_bw > config.max_dist_from_bw:
                 return True
             # If the output of an op is 4x smaller (arbitrary choice),
             # then we don't allow recomputation.
-            if 'tensor_meta' not in node.meta:
-                return False
-            input_tensors_size = sum(_maybe_size_of(i) for i in node.args if isinstance(i, fx.Node))
-            output_size = _size_of(node.meta['tensor_meta'])
+            input_tensors_size = sum(_size_of(i) for i in node.args if isinstance(i, fx.Node))
+            output_size = _size_of(node)
             return (output_size * 4 < input_tensors_size)
 
     def is_fusible(a, b):
@@ -352,8 +424,8 @@ def is_materialized(node):
 
         return not all(is_fusible(node, user) for user in node.users)
 
-    def get_node_weight(node):
-        mem_sz = _size_of(node.meta['tensor_meta'])
+    def get_node_weight(node) -> int:
+        mem_sz = _size_of(node)
 
         # Heuristic to bias towards nodes closer to the backwards pass
         # Complete guess about current value
@@ -383,7 +455,12 @@ def get_node_weight(node):
         if ban_recomputation(node) and node in required_fw_nodes:
             nx_graph.add_edge("source", node.name + "_in", capacity=math.inf)
 
-        if 'tensor_meta' not in node.meta:
+        # Checks if a node is actually a tuple. Can be simplified to just an isisinstance check if we always use faketensors.
+        is_non_tensor_node = (('val' not in node.meta and 'tensor_meta' not in node.meta) or
+                              ('val' in node.meta and not isinstance(node.meta['val'], torch.Tensor)))
+        if is_sym_node(node):
+            weight = 1
+        elif is_non_tensor_node:
             weight = math.inf
         else:
             weight = get_node_weight(node)
@@ -408,9 +485,14 @@ def get_node_weight(node):
     # To make this stuff deterministic
     node_idx = {node: idx for idx, node in enumerate(joint_module.graph.nodes)}
     saved_values = sorted((name_to_node[node] for node in cut_nodes), key=lambda x: node_idx[x])
-    fw_module, bw_module = _extract_fwd_bwd_modules(joint_module, saved_values)
+    # Symints must be kept separate from tensors so that PythonFunction only calls
+    # save_for_backward on tensors and stashes symints in autograd .ctx
+    saved_sym_nodes = list(filter(lambda n: is_sym_node(n), saved_values))
+    saved_values = list(filter(lambda n: not is_sym_node(n), saved_values))
+    fw_module, bw_module = _extract_fwd_bwd_modules(
+        joint_module, saved_values, saved_sym_nodes=saved_sym_nodes, num_fwd_outputs=num_fwd_outputs)
     if AOT_PARTITIONER_DEBUG:
-        print("Theoretical Activations Stored: ", sum([_size_of(i.meta['tensor_meta']) for i in saved_values]) / 1e9)
+        print("Theoretical Activations Stored: ", sum([_size_of(i) for i in saved_values]) / 1e9)
         fw_module_nodes = set([node.name for node in fw_module.graph.nodes if node.op == 'call_function'])
         bw_module_nodes = set([node.name for node in bw_module.graph.nodes if node.op == 'call_function'])
         remat_nodes = fw_module_nodes & bw_module_nodes
@@ -419,7 +501,7 @@ def get_node_weight(node):
         for node in fw_module.graph.nodes:
             if node.name in remat_nodes and hasattr(node.target, '_overloadpacket'):
                 counts[str(node.target._overloadpacket)] += 1
-        print("# nodes rematerialized: ", len(remat_nodes))
+        print(f"# remat/fw/bw: {len(remat_nodes)}/{len(fw_module_nodes)}/{len(bw_module_nodes)}")
         print("Count of Ops Rematerialized: ", sorted(counts.items(), key=lambda x: x[1], reverse=True))
     return fw_module, bw_module
 
diff --git a/functorch/functorch/_src/python_key.py b/functorch/_src/python_key.py
similarity index 50%
rename from functorch/functorch/_src/python_key.py
rename to functorch/_src/python_key.py
index 5fe0aff691ca..e7c805841a62 100644
--- a/functorch/functorch/_src/python_key.py
+++ b/functorch/_src/python_key.py
@@ -3,8 +3,7 @@
 #
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-__all__ = ["make_fx", "ProxyTensor", "dispatch_trace", "PythonKeyTracer", "pythonkey_decompose"]
-from torch.fx.experimental.proxy_tensor import make_fx, ProxyTensor, dispatch_trace, PythonKeyTracer, decompose
+__all__ = ["make_fx", "dispatch_trace", "PythonKeyTracer", "pythonkey_decompose"]
+from torch.fx.experimental.proxy_tensor import make_fx, dispatch_trace, PythonKeyTracer, decompose
 
 pythonkey_decompose = decompose
-PythonTensor = ProxyTensor
diff --git a/functorch/functorch/_src/pytree_hacks.py b/functorch/_src/pytree_hacks.py
similarity index 100%
rename from functorch/functorch/_src/pytree_hacks.py
rename to functorch/_src/pytree_hacks.py
diff --git a/functorch/functorch/_src/top_operators_github_usage.py b/functorch/_src/top_operators_github_usage.py
similarity index 100%
rename from functorch/functorch/_src/top_operators_github_usage.py
rename to functorch/_src/top_operators_github_usage.py
diff --git a/functorch/functorch/_src/vmap.py b/functorch/_src/vmap.py
similarity index 97%
rename from functorch/functorch/_src/vmap.py
rename to functorch/_src/vmap.py
index df5755fbca5d..a7ae36074e6f 100644
--- a/functorch/functorch/_src/vmap.py
+++ b/functorch/_src/vmap.py
@@ -12,7 +12,7 @@
 from .pytree_hacks import tree_map_
 from functools import partial
 
-from functorch._C import (
+from torch._C._functorch import (
     _add_batch_dim,
     _remove_batch_dim,
     _vmap_decrement_nesting,
@@ -23,6 +23,19 @@
 out_dims_t = Union[int, Tuple[int, ...]]
 
 
+def doesnt_support_saved_tensors_hooks(f):
+    message = (
+        "functorch transforms don't yet support saved tensor hooks. "
+        "Please open an issue with your use case."
+    )
+
+    @functools.wraps(f)
+    def fn(*args, **kwargs):
+        with torch.autograd.graph.disable_saved_tensors_hooks(message):
+            return f(*args, **kwargs)
+    return fn
+
+
 # Checks that all args-to-be-batched have the same batch dim size
 def _validate_and_get_batch_size(
         flat_in_dims: List[Optional[int]],
@@ -468,6 +481,7 @@ def _check_randomness_arg(randomness):
         raise RuntimeError(f"Only allowed values for randomness are 'error', 'different', or 'same'. Got {randomness}")
 
 
+@doesnt_support_saved_tensors_hooks
 def _flat_vmap(func, batch_size, flat_in_dims, flat_args, args_spec, out_dims, randomness, **kwargs):
     vmap_level = _vmap_increment_nesting(batch_size, randomness)
     try:
diff --git a/functorch/benchmarks/operator_authoring.py b/functorch/benchmarks/operator_authoring.py
index 88e558bdafc1..cbd816e2ad13 100644
--- a/functorch/benchmarks/operator_authoring.py
+++ b/functorch/benchmarks/operator_authoring.py
@@ -77,7 +77,7 @@ def setup(n):
         assert result_nnc.dtype == result_aten.dtype
         assert result_nnc.size() == result_aten.size()
         assert result_nnc.stride() == result_aten.stride()
-        torch.testing.assert_allclose(result_aten, result_nnc)
+        torch.testing.assert_close(result_aten, result_nnc)
         return (lambda: nnc(*args), lambda: aten(*args))
 
     return benchmark_loop(setup)
@@ -90,7 +90,7 @@ def inplace_setup(n):
         result_nnc = torch.clone(a)
         nnc(result_nnc, b, out=result_nnc)
         aten(result_aten, b, out=result_aten)
-        torch.testing.assert_allclose(result_aten, result_nnc)
+        torch.testing.assert_close(result_aten, result_nnc)
         return (lambda: nnc(a, b, out=a), lambda: aten(a, b, out=a))
 
     return benchmark_loop(inplace_setup)
@@ -103,7 +103,7 @@ def out_setup(n):
         result_nnc = out(n)
         aten(*args, out=result_aten)
         nnc(*args, out=result_nnc)
-        torch.testing.assert_allclose(result_aten, result_nnc)
+        torch.testing.assert_close(result_aten, result_nnc)
         result = out(n)
         return (lambda: nnc(*args, out=result), lambda: aten(*args, out=result))
 
@@ -118,7 +118,7 @@ def backwards_setup(n):
         correct = grad_var.grad.clone()
         grad_var.grad.zero_()
         nnc(*args).sum().backward()
-        torch.testing.assert_allclose(correct, grad_var.grad)
+        torch.testing.assert_close(correct, grad_var.grad)
         return (
             lambda: nnc(*args).sum().backward(),
             lambda: aten(*args).sum().backward(),
diff --git a/functorch/benchmarks/pointwise_scorecard.py b/functorch/benchmarks/pointwise_scorecard.py
index ac4cf5f386dc..15863dc3510c 100644
--- a/functorch/benchmarks/pointwise_scorecard.py
+++ b/functorch/benchmarks/pointwise_scorecard.py
@@ -195,13 +195,13 @@ def micros(s):
         if shape == medium_transpose:
             raise RuntimeError("pointwise_operator hangs on medium_transpose")
         pw_op = pointwise_operator(operator)
-        torch.testing.assert_allclose(operator(*args), pw_op(*args))
+        torch.testing.assert_close(operator(*args), pw_op(*args))
     except Exception:
         print(f"pointwise_operator failed on {operator.__name__}, {shape.__name__}")
         nope.add((operator, shape))
 
     ts_op = torch.jit.script(operator)
-    torch.testing.assert_allclose(operator(*args), ts_op(*args))
+    torch.testing.assert_close(operator(*args), ts_op(*args))
 
 
 print("fuser,device,operator,shape,time")
diff --git a/functorch/benchmarks/transformer_fusion_patterns/benchmark.py b/functorch/benchmarks/transformer_fusion_patterns/benchmark.py
index a6646e150c55..f7994223e129 100644
--- a/functorch/benchmarks/transformer_fusion_patterns/benchmark.py
+++ b/functorch/benchmarks/transformer_fusion_patterns/benchmark.py
@@ -1,5 +1,5 @@
 import torch
-from functorch.compile import memory_efficient_fusion, clear_compile_cache
+from functorch.compile import memory_efficient_fusion
 import benchmark_helper
 
 
@@ -159,7 +159,6 @@ def args():
 
 for cl in [DropoutResBias, BiasReluDropout, DropoutResBiasScalar, BiasDropoutResLayerNorm, LayerNormSigmoid]:
     # Clear the compile cache
-    clear_compile_cache()
 
     # Get the function and inputs
     obj = cl()
diff --git a/functorch/benchmarks/transformer_fusion_patterns/bias_gelu_dropout.py b/functorch/benchmarks/transformer_fusion_patterns/bias_gelu_dropout.py
index b2318068645f..26c6d7c9e9fe 100644
--- a/functorch/benchmarks/transformer_fusion_patterns/bias_gelu_dropout.py
+++ b/functorch/benchmarks/transformer_fusion_patterns/bias_gelu_dropout.py
@@ -1,5 +1,5 @@
 import torch
-from functorch.compile import memory_efficient_pointwise_fusion, clear_compile_cache
+from functorch.compile import memory_efficient_pointwise_fusion
 import benchmark_helper
 
 # ALL comments regarding the patetrns
@@ -21,7 +21,6 @@ def aot_fn(input, bias):
 
 fn = bias_gelu_dropout
 
-clear_compile_cache()
 
 # Set inputs
 device = "cuda"
diff --git a/functorch/functorch/compile/__init__.py b/functorch/compile/__init__.py
similarity index 70%
rename from functorch/functorch/compile/__init__.py
rename to functorch/compile/__init__.py
index 1e84274105e8..12549dceda9f 100644
--- a/functorch/functorch/compile/__init__.py
+++ b/functorch/compile/__init__.py
@@ -1,15 +1,14 @@
 from .._src.python_key import pythonkey_decompose
-from .._src.decompositions import register_decomposition, decomposition_table, get_decompositions
-from .._src.fx_minifier import minifier, check_nvfuser_subprocess, check_nvfuser_correctness_subprocess
+from .._src.fx_minifier import minifier
 from .._src.aot_autograd import (
     aot_function,
     aot_module,
     compiled_function,
     compiled_module,
-    num_of_recompilations,
-    clear_compile_cache,
     aot_module_simplified,
     get_graph_being_compiled,
+    get_aot_graph_name,
+    get_aot_compilation_context,
     make_boxed_func,
     make_boxed_compiler
 )
diff --git a/functorch/functorch/csrc/dim/arena.h b/functorch/csrc/dim/arena.h
similarity index 100%
rename from functorch/functorch/csrc/dim/arena.h
rename to functorch/csrc/dim/arena.h
diff --git a/functorch/functorch/csrc/dim/dim.cpp b/functorch/csrc/dim/dim.cpp
similarity index 96%
rename from functorch/functorch/csrc/dim/dim.cpp
rename to functorch/csrc/dim/dim.cpp
index 41826222b331..c43a6c7a9cff 100644
--- a/functorch/functorch/csrc/dim/dim.cpp
+++ b/functorch/csrc/dim/dim.cpp
@@ -12,9 +12,10 @@
 #include <iostream>
 #include <vector>
 //#include <torch/csrc/autograd/python_variable.h>
+#include <torch/csrc/utils/python_compat.h>
 #include <torch/csrc/Export.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
-#include <functorch/csrc/DynamicLayer.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
+#include <ATen/functorch/DynamicLayer.h>
 #include <ATen/ATen.h>
 #include <memory>
 #include "arena.h"
@@ -325,7 +326,7 @@ PyTypeObject Dim::Type = {
     0,                              /* tp_descr_get */
     0,                              /* tp_descr_set */
     0,                              /* tp_dictoffset */
-    (initproc) Dim_init,            /* tp_init */
+    (initproc)(void*) Dim_init,     /* tp_init */
     0,                              /* tp_alloc */
     Dim::new_stub,                      /* tp_new */
 };
@@ -355,7 +356,7 @@ struct DimList : public py::base<DimList> {
         } else {
             bound_ = true;
             dims_.resize(size);
-            for (ssize_t i = 0; i < size; ++i) {
+            for (Py_ssize_t i = 0; i < size; ++i) {
                 dims_[i] = Dim::create(py::unicode_from_format("%S%i", name_.ptr(), (int)i));
             }
         }
@@ -410,7 +411,7 @@ static PyObject* DimList_bind(DimList *self,
     py::sequence_view seq = sizes;
     auto size = seq.size();
     self->bind_len(size);
-    for (ssize_t i = 0; i < size; ++i) {
+    for (Py_ssize_t i = 0; i < size; ++i) {
         self->dims_[i]->set_size(py::to_int(seq[i]));
     }
     Py_RETURN_NONE;
@@ -434,8 +435,8 @@ static PyObject* DimList_bind_len(DimList *self,
 }
 
 static PyMethodDef DimList_methods[] = {
-    {"bind", (PyCFunction) DimList_bind, METH_FASTCALL | METH_KEYWORDS},
-    {"bind_len", (PyCFunction) DimList_bind_len, METH_FASTCALL | METH_KEYWORDS},
+    {"bind", (PyCFunction)(void*) DimList_bind, METH_FASTCALL | METH_KEYWORDS},
+    {"bind_len", (PyCFunction)(void*) DimList_bind_len, METH_FASTCALL | METH_KEYWORDS},
     {NULL, NULL, 0, NULL}        /* Sentinel */
 };
 
@@ -451,7 +452,7 @@ PyObject * DimList_item(DimList* self, Py_ssize_t idx) {
     if (!self->is_bound()) {
         py::raise_error(DimensionBindError(), "DimList not bound");
     }
-    if (idx >= self->dims_.size()) {
+    if (idx < 0 || (size_t) idx >= self->dims_.size()) {
         py::raise_error(PyExc_IndexError, "index out of bounds");
     }
     py::object r = self->dims_[idx];
@@ -506,7 +507,7 @@ static PyObject* DimList_subscript(DimList* self, py::handle idx) {
 
 PyMappingMethods DimList_mapping = {
     0, //lenfunc mp_length;
-    (binaryfunc) DimList_subscript, //binaryfunc mp_subscript;
+    (binaryfunc)(void*) DimList_subscript, //binaryfunc mp_subscript;
     0, //objobjargproc mp_ass_subscript;
 };
 
@@ -570,7 +571,7 @@ static int DimList_init(DimList *self, PyObject *args, PyObject *kwds) {
             std::vector<py::obj<Dim>> dims;
             size_t size = s.size();
             dims.reserve(size);
-            for (ssize_t i = 0; i < size; ++i) {
+            for (size_t i = 0; i < size; ++i) {
                 auto r = s[i];
                 if (py::is_int(r)) {
                     dims.emplace_back(Dim::create(py::unicode_from_format("%S%i", self->name_.ptr(), (int)i),  py::to_int(r)));
@@ -1024,7 +1025,7 @@ static PyObject* py_unflatten(PyObject *self,
     PY_END(nullptr)
 }
 
-PyMethodDef py_unflatten_def = {"unflatten", (PyCFunction) py_unflatten, METH_FASTCALL | METH_KEYWORDS};
+PyMethodDef py_unflatten_def = {"unflatten", (PyCFunction)(void*) py_unflatten, METH_FASTCALL | METH_KEYWORDS};
 
 void free_unflatten_arena(PyObject * pc) {
     delete (UnflattenArena*) PyCapsule_GetPointer(pc, "arena");
@@ -1440,10 +1441,18 @@ py::object getname(PyCodeObject* code, _Py_CODEUNIT c) {
           names = code->co_names;
           break;
         case STORE_FAST:
+#if PY_VERSION_HEX < 0x030b0000
           names = code->co_varnames;
+#else
+          names = PyCode_GetVarnames(code);
+#endif
           break;
         case STORE_DEREF:
+#if PY_VERSION_HEX < 0x030b0000
           names = code->co_cellvars;
+#else
+          names = PyCode_GetCellvars(code);
+#endif
           break;
         default:
             return py::object();
@@ -1456,7 +1465,7 @@ py::object create_dim(py::object name, py::handle size) {
     if (!py::is_none(size)) {
         d->set_size(py::to_int(size));
     }
-    return d;
+    return std::move(d);
 }
 
 py::object create_dimlist(py::object name, py::handle size) {
@@ -1472,9 +1481,16 @@ py::object create_dimlist(py::object name, py::handle size) {
             }
         }
     }
-    return d;
+    return std::move(d);
 }
 
+
+
+// Python wrappers that make new reflection primitives available for older runtimes
+#if PY_VERSION_HEX < 0x030b0000
+#define _PyCode_CODE(CO) ((_Py_CODEUNIT*)PyBytes_AS_STRING((CO)->co_code))
+#endif
+
 template<py::object (*create_object)(py::object, py::handle)>
 static PyObject* _dims(PyObject *self,
                       PyObject *const *args,
@@ -1500,12 +1516,13 @@ static PyObject* _dims(PyObject *self,
     }
 
     PyThreadState* state = PyThreadState_GET();
-    PyFrameObject* f = state->frame;
-    auto code = (_Py_CODEUNIT*)PyBytes_AS_STRING(f->f_code->co_code);
+    auto f = py::obj<PyFrameObject>::steal(PyThreadState_GetFrame(state));
+    auto c = py::obj<PyCodeObject>::steal(PyFrame_GetCode(f.ptr()));
+    auto code = _PyCode_CODE(c.ptr());
 #if PY_VERSION_HEX >= 0x030a00f0
-    int first = f->f_lasti + 1;
+    int first = PyFrame_GetLasti(f.ptr()) + 1;
 #else
-    int first = f->f_lasti /  2 + 1;
+    int first = PyFrame_GetLasti(f.ptr()) /  2 + 1;
 #endif
     auto unpack = code[first];
     int names_start = first;
@@ -1529,7 +1546,7 @@ static PyObject* _dims(PyObject *self,
     auto genobject = [&](int i) -> py::object {
         py::object name;
         if (i < found_ndims) {
-            name = getname(f->f_code, code[names_start + i]);
+            name = getname(c.ptr(), code[names_start + i]);
         }
         if (!name.ptr()) {
             name = py::unicode_from_format("d%d", i);
@@ -1750,6 +1767,9 @@ static PyObject* order(PyObject *_,
                       PyObject *kwnames) {
     Arena A;
     PY_BEGIN
+    if (kwnames) {
+        py::raise_error(PyExc_TypeError, "unexpected keyword arguments %S", kwnames);
+    }
     AT_ASSERT(nargs-- > 0);
     Slice<DimEntry> orig_levels;
     Slice<DimEntry> levels;
@@ -2036,7 +2056,7 @@ static py::object index(Arena& A, py::handle self, py::handle dims, py::handle i
     if (lhs_list && rhs_list) {
         py::sequence_view dv(dims);
         py::sequence_view ind(indices);
-        size_t N = dv.size();
+        Py_ssize_t N = dv.size();
         if (N != ind.size()) {
             py::raise_error(PyExc_TypeError, "dims (%d) and indices (%d) must have the same length", int(N), int(ind.size()));
         }
@@ -2992,7 +3012,7 @@ static PyObject* _wrap(PyObject * self_,
     } else {
         dim_name_str = "dim";
     }
-    auto info = WrappedOperator::create(py::object::borrow(orig), (PyCFunction) patched_dim_method, std::move(dim_name_str));
+    auto info = WrappedOperator::create(py::object::borrow(orig), (PyCFunction)(void*) patched_dim_method, std::move(dim_name_str));
     if (dim_offset.ptr()) {
         info->dim_offset = py::to_int(dim_offset);
     }
@@ -3036,7 +3056,7 @@ static PyObject* _wrap_method(PyObject *self,
         auto dim = py::import("functorch.dim");
         pointwise = dim.attr("pointwise");
     }
-    auto info = WrappedOperator::create(py::object::borrow(orig), (PyCFunction) call_torch_function);
+    auto info = WrappedOperator::create(py::object::borrow(orig), (PyCFunction)(void*) call_torch_function);
     info->is_pointwise = pointwise.contains(orig);
     return PyInstanceMethod_New(info->function().release());
     PY_END(nullptr);
@@ -3141,25 +3161,25 @@ Example::
 )""";
 
 static PyMethodDef methods[] = {
-    {"dims", (PyCFunction) _dims<create_dim>, METH_FASTCALL | METH_KEYWORDS, dims_doc},
-    {"dimlists", (PyCFunction) _dims<create_dimlist>, METH_FASTCALL | METH_KEYWORDS},
-    {"_test_c", (PyCFunction) test_c, METH_FASTCALL | METH_KEYWORDS},
-    {"_wrap_method", (PyCFunction) _wrap_method, METH_FASTCALL | METH_KEYWORDS},
-    {"Tensor_from_positional", (PyCFunction) py_Tensor_from_positional, METH_FASTCALL | METH_KEYWORDS},
-    {"__torch_function__", (PyCFunction) py___torch_function__, METH_FASTCALL | METH_KEYWORDS},
-    {"tree_flatten", (PyCFunction) py_tree_flatten, METH_FASTCALL | METH_KEYWORDS},
-    {"order", (PyCFunction) order, METH_FASTCALL | METH_KEYWORDS},
-    {"index", (PyCFunction) py_index, METH_FASTCALL | METH_KEYWORDS},
-    {"stack", (PyCFunction) py_stack, METH_FASTCALL | METH_KEYWORDS},
-    {"split", (PyCFunction) py_split, METH_FASTCALL | METH_KEYWORDS},
-    {"expand", (PyCFunction) expand, METH_FASTCALL | METH_KEYWORDS},
-    {"__getitem__", (PyCFunction) py___getitem__, METH_FASTCALL | METH_KEYWORDS},
-    {"__setitem__", (PyCFunction) py___setitem__, METH_FASTCALL | METH_KEYWORDS},
-    {"_wrap", (PyCFunction) _wrap, METH_FASTCALL | METH_KEYWORDS},
-    {"Tensor_sum", (PyCFunction) Tensor_sum, METH_FASTCALL | METH_KEYWORDS},
-    {"_parse_test", (PyCFunction) _parse_test, METH_FASTCALL | METH_KEYWORDS},
-    {"_set_pointwise_optimize", (PyCFunction) _set_pointwise_optimize, METH_FASTCALL | METH_KEYWORDS},
-    {"_patch_tensor_class", (PyCFunction) _patch_tensor_class, METH_FASTCALL | METH_KEYWORDS},
+    {"dims", (PyCFunction)(void*) _dims<create_dim>, METH_FASTCALL | METH_KEYWORDS, dims_doc},
+    {"dimlists", (PyCFunction)(void*) _dims<create_dimlist>, METH_FASTCALL | METH_KEYWORDS},
+    {"_test_c", (PyCFunction)(void*) test_c, METH_FASTCALL | METH_KEYWORDS},
+    {"_wrap_method", (PyCFunction)(void*) _wrap_method, METH_FASTCALL | METH_KEYWORDS},
+    {"Tensor_from_positional", (PyCFunction)(void*) py_Tensor_from_positional, METH_FASTCALL | METH_KEYWORDS},
+    {"__torch_function__", (PyCFunction)(void*) py___torch_function__, METH_FASTCALL | METH_KEYWORDS},
+    {"tree_flatten", (PyCFunction)(void*) py_tree_flatten, METH_FASTCALL | METH_KEYWORDS},
+    {"order", (PyCFunction)(void*) order, METH_FASTCALL | METH_KEYWORDS},
+    {"index", (PyCFunction)(void*) py_index, METH_FASTCALL | METH_KEYWORDS},
+    {"stack", (PyCFunction)(void*) py_stack, METH_FASTCALL | METH_KEYWORDS},
+    {"split", (PyCFunction)(void*) py_split, METH_FASTCALL | METH_KEYWORDS},
+    {"expand", (PyCFunction)(void*) expand, METH_FASTCALL | METH_KEYWORDS},
+    {"__getitem__", (PyCFunction)(void*) py___getitem__, METH_FASTCALL | METH_KEYWORDS},
+    {"__setitem__", (PyCFunction)(void*) py___setitem__, METH_FASTCALL | METH_KEYWORDS},
+    {"_wrap", (PyCFunction)(void*) _wrap, METH_FASTCALL | METH_KEYWORDS},
+    {"Tensor_sum", (PyCFunction)(void*) Tensor_sum, METH_FASTCALL | METH_KEYWORDS},
+    {"_parse_test", (PyCFunction)(void*) _parse_test, METH_FASTCALL | METH_KEYWORDS},
+    {"_set_pointwise_optimize", (PyCFunction)(void*) _set_pointwise_optimize, METH_FASTCALL | METH_KEYWORDS},
+    {"_patch_tensor_class", (PyCFunction)(void*) _patch_tensor_class, METH_FASTCALL | METH_KEYWORDS},
     {NULL, NULL, 0, NULL}        /* Sentinel */
 };
 
diff --git a/functorch/functorch/csrc/dim/dim.h b/functorch/csrc/dim/dim.h
similarity index 100%
rename from functorch/functorch/csrc/dim/dim.h
rename to functorch/csrc/dim/dim.h
diff --git a/functorch/functorch/csrc/dim/minpybind.h b/functorch/csrc/dim/minpybind.h
similarity index 98%
rename from functorch/functorch/csrc/dim/minpybind.h
rename to functorch/csrc/dim/minpybind.h
index 3fa7f5294dd1..dd0edfe5d5a3 100644
--- a/functorch/functorch/csrc/dim/minpybind.h
+++ b/functorch/csrc/dim/minpybind.h
@@ -610,18 +610,19 @@ struct vector_args {
                 if (i == required) {
                     *format_it++ = '|';
                 }
-                if (i == names.size() - kwonly) {
+                if (i == (int)names.size() - kwonly) {
                     *format_it++ = '$';
                 }
                 *format_it++ = 'O';
             }
             *format_it++ = '\0';
             _PyArg_Parser* _parser = new _PyArg_Parser{format_str, &names_buf[0], fname_cstr, 0};
-            _PyArg_ParseStackAndKeywords((PyObject*const*)args, nargs, kwnames.ptr(), _parser);
+            PyObject *dummy = NULL;
+            _PyArg_ParseStackAndKeywords((PyObject*const*)args, nargs, kwnames.ptr(), _parser, &dummy, &dummy, &dummy, &dummy, &dummy);
 #else
             _PyArg_Parser* _parser = new _PyArg_Parser{NULL, &names_buf[0], fname_cstr, 0};
             std::unique_ptr<PyObject*[]> buf(new PyObject*[names.size()]);
-            _PyArg_UnpackKeywords((PyObject*const*)args, nargs, NULL, kwnames.ptr(), _parser, required, values.size() - kwonly, 0, &buf[0]);
+            _PyArg_UnpackKeywords((PyObject*const*)args, nargs, NULL, kwnames.ptr(), _parser, required, (Py_ssize_t)values.size() - kwonly, 0, &buf[0]);
 #endif
             throw exception_set();
         };
@@ -630,7 +631,7 @@ struct vector_args {
         auto names_it = names.begin();
         auto npositional = values.size() - kwonly;
 
-        if (nargs > npositional) {
+        if (nargs > (Py_ssize_t)npositional) {
             // TOO MANY ARGUMENTS
             error();
         }
diff --git a/functorch/functorch/csrc/dim/python_variable_simple.h b/functorch/csrc/dim/python_variable_simple.h
similarity index 100%
rename from functorch/functorch/csrc/dim/python_variable_simple.h
rename to functorch/csrc/dim/python_variable_simple.h
diff --git a/functorch/csrc/init_dim_only.cpp b/functorch/csrc/init_dim_only.cpp
new file mode 100644
index 000000000000..88d4cbcff795
--- /dev/null
+++ b/functorch/csrc/init_dim_only.cpp
@@ -0,0 +1,22 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+
+#include <torch/extension.h>
+#include <functorch/csrc/dim/dim.h>
+
+namespace at {
+namespace functorch {
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+  // initialize first-class dims and install it as a submodule on _C
+  auto dim = Dim_init();
+  if (!dim) {
+    throw py::error_already_set();
+  }
+  py::setattr(m, "dim", py::reinterpret_steal<py::object>(dim));
+}
+
+}}
diff --git a/functorch/functorch/dim/README.md b/functorch/dim/README.md
similarity index 94%
rename from functorch/functorch/dim/README.md
rename to functorch/dim/README.md
index 750c8847c850..5ed7bbd3d528 100644
--- a/functorch/functorch/dim/README.md
+++ b/functorch/dim/README.md
@@ -7,7 +7,7 @@ _An implementation of [named tensors](https://namedtensor.github.io) with the fu
 
 The tensor input to a resnet might have the shape [8, 3, 224, 224] but informally we think of those dimensions as 'batch', 'channel', 'width', and 'height'. Eventhough 'width' and 'height' have the same _size_ we still think of them as separate dimensions, and if we have two _different_ images, we think of both as sharing the _same_ 'channel' dimension.
 
-Named tensors gives these dimensions names. [PyTorch's current implementation](https://pytorch.org/docs/stable/named_tensor.html) uses strings to name dimensions. Instead, this library introduces a Python object, a `Dim`, to represent the concept. By expanding the semantics of tensors with dim objects, in addition to naming dimensions, we can get behavior equivalent to batching transforms (xmap, vmap), einops-style rearragement, and loop-style tensor indexing.
+Named tensors gives these dimensions names. [PyTorch's current implementation](https://pytorch.org/docs/stable/named_tensor.html) uses strings to name dimensions. Instead, this library introduces a Python object, a `Dim`, to represent the concept. By expanding the semantics of tensors with dim objects, in addition to naming dimensions, we can get behavior equivalent to batching transforms (xmap, vmap), einops-style rearrangement, and loop-style tensor indexing.
 
 A preview:
 
@@ -85,11 +85,11 @@ from torchdim import dims
 batch, channel, width, height = dims(4)
 ```
 
-The existing implemention of [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html) in PyTorch, or [JAX's xmap](https://jax.readthedocs.io/en/latest/notebooks/xmap_tutorial.html) use strings to name dimensions. We call these dimensions _first class_ because they are Python objects.
+The existing implementation of [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html) in PyTorch, or [JAX's xmap](https://jax.readthedocs.io/en/latest/notebooks/xmap_tutorial.html) use strings to name dimensions. We call these dimensions _first class_ because they are Python objects.
 
 In addition to the normal _positional_ dimensions in a tensor, tensors can also have a separate set of first-class dimensions.
 
-You can create tensors with first-class dimensions by indexing the normal positional dimensions of a tensor with a dimension object. The `ndim` property continues to list the number of positional dimesions, while the new `dims` property lists all the bound first-class dimensions.
+You can create tensors with first-class dimensions by indexing the normal positional dimensions of a tensor with a dimension object. The `ndim` property continues to list the number of positional dimensions, while the new `dims` property lists all the bound first-class dimensions.
 
 ```py
 input = torch.rand(2, 3, 224, 224)
@@ -101,7 +101,7 @@ print(input_fc.dims) # first class dimensions
 > (batch, channel, width, height)
 
 
-# since we converted all the positional dimesions
+# since we converted all the positional dimensions
 # first class `input_fc` has 0 positional dimensions now.
 print(input_fc.ndim)
 > 0
@@ -266,7 +266,7 @@ print(i <= j)
 > with dims=(i, j) sizes=(4, 4)
 ```
 
-Because of the intentional similarity to loop-level code, using dimsions as tensors makes complicated indexing arithmetic easier to read.
+Because of the intentional similarity to loop-level code, using dimensions as tensors makes complicated indexing arithmetic easier to read.
 
 Here is code that lookups up features in an embedding table given a sequence of ids:
 
@@ -296,7 +296,7 @@ Unbinding Dims
 -------------
 The `order` method converts first-class dimensions in a tensor back to normal positional dimensions by specifying an order for those dimensions.[^4]
 
-By specifiying a different order from how things were originally bound, it is easy to do transpositions.
+By specifying a different order from how things were originally bound, it is easy to do transpositions.
 
 ```py
 i, j = dims(2)
@@ -305,7 +305,7 @@ A_T = A[i, j].order(j, i)
 assert torch.allclose(A.T, A_T)
 ```
 
-Indexing acts left-to-right, and `order` also places the new dimensions back on the left, so it possible to work on tensors that have mixed positonal and first-class dimensions:
+Indexing acts left-to-right, and `order` also places the new dimensions back on the left, so it possible to work on tensors that have mixed positional and first-class dimensions:
 
 ```py
 B = torch.rand(3, 4, 5)
@@ -313,7 +313,7 @@ B_T = B[i, j].order(j, i)
 assert torch.allclose(B.permute(1, 0, 2), B_T)
 ```
 
-[^4] `order` is actually just a synonym for the already-existing `permute` method, which takes a list a dimension specifiers and puts the tensor in that order because rule #2 says that first-class dims can be passed as arguments to functions that previousely took only integers as dimensions. However, the name `permute` is confusing in this context since it implies dim objects have an original order, so we prefer to use `order` when writing code.
+[^4] `order` is actually just a synonym for the already-existing `permute` method, which takes a list a dimension specifiers and puts the tensor in that order because rule #2 says that first-class dims can be passed as arguments to functions that previously took only integers as dimensions. However, the name `permute` is confusing in this context since it implies dim objects have an original order, so we prefer to use `order` when writing code.
 
 Flattening and Splitting Dims
 -----------------------------
@@ -412,7 +412,7 @@ Named tensors with first-class dimensions can accomplish the same goal, but usin
 Automatically batching Code (`vmap`, `xmap`)
 -----------------------------
 
-The implicit batching of Rule #1 means it is easy to created batched versions of existing PyTorch code. Simply bind a dim to the dimensions that should act as a batch, and then pass the tensor to the unbatched function. Since the unbatched function does not know about the dim, the dim will be implicictly batched over:
+The implicit batching of Rule #1 means it is easy to created batched versions of existing PyTorch code. Simply bind a dim to the dimensions that should act as a batch, and then pass the tensor to the unbatched function. Since the unbatched function does not know about the dim, the dim will be implicitly batched over:
 
 ```py
 batch_size, feature_size = 3, 5
@@ -501,7 +501,7 @@ def multiheadattention(q, k, v, num_attention_heads, dropout_prob, use_positiona
 Indexing
 --------
 
-Rule #3 enables indexing because dimensions act as loop indices when used as a tensor. This allows for a lot of powerful behavior. The simplest might be using the dimensions to compute masks, such as extracing the upper triangular part of a matrix:
+Rule #3 enables indexing because dimensions act as loop indices when used as a tensor. This allows for a lot of powerful behavior. The simplest might be using the dimensions to compute masks, such as extracting the upper triangular part of a matrix:
 
 ```py
 from torch import where
@@ -745,7 +745,7 @@ The semantics and surface syntax of dimension objects resembles the kind of code
 
 These compilers and language have syntax and semantics that resemble the loop-level analogy similar to first-class dimensions. However, as compilers or statically typed languages, they require some binding code to go from running deep learning framework code in Python to using the compiled language. This often at least requires refactoring the compiled parts into their own functions, and may require defining a gradient function. Similar to graph mode frameworks, this adds friction to using and debugging the code.
 
-Dimension objects are just an extension of the existing PyTorch tensors and eager sematics, so there is no friction switching between normal Python code and code that uses them. However, since loops over the dimensions are defined implicitly, they can still execute in Python with good performance compared to explicit loops. Furthermore, with dimension objects, a tensors containing dimensions can compute through code that is oblivous to the dimension such as batching examples. There is no need to separate code into 'compiled' vs 'eager'.
+Dimension objects are just an extension of the existing PyTorch tensors and eager semantics, so there is no friction switching between normal Python code and code that uses them. However, since loops over the dimensions are defined implicitly, they can still execute in Python with good performance compared to explicit loops. Furthermore, with dimension objects, a tensors containing dimensions can compute through code that is oblivious to the dimension such as batching examples. There is no need to separate code into 'compiled' vs 'eager'.
 
 In this way, first-class dims are a way of adapting the nicer syntax of these array compilers and languages to eager numpy-style libraries.
 
diff --git a/functorch/functorch/dim/__init__.py b/functorch/dim/__init__.py
similarity index 97%
rename from functorch/functorch/dim/__init__.py
rename to functorch/dim/__init__.py
index 4f1cd84e44a1..6d36a8994dfe 100644
--- a/functorch/functorch/dim/__init__.py
+++ b/functorch/dim/__init__.py
@@ -102,9 +102,9 @@ def _def(name, *args, **kwargs):
 del _Tensor.ndim
 
 if use_c:
-    _Tensor.permute = _Tensor.order = _C._instancemethod(_C.order)
+    _Tensor.order = _C._instancemethod(_C.order)
 else:
-    _Tensor.permute = _Tensor.order = reference.positional
+    _Tensor.order = reference.positional
 
 _def('mean')
 _def('sum')
diff --git a/functorch/functorch/dim/batch_tensor.py b/functorch/dim/batch_tensor.py
similarity index 95%
rename from functorch/functorch/dim/batch_tensor.py
rename to functorch/dim/batch_tensor.py
index 15a966ad2176..e909afe1e21e 100644
--- a/functorch/functorch/dim/batch_tensor.py
+++ b/functorch/dim/batch_tensor.py
@@ -3,7 +3,7 @@
 #
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
-from functorch._C import (
+from torch._C._functorch import (
     _vmap_add_layers,
     _vmap_remove_layers,
 )
diff --git a/functorch/functorch/dim/delayed_mul_tensor.py b/functorch/dim/delayed_mul_tensor.py
similarity index 100%
rename from functorch/functorch/dim/delayed_mul_tensor.py
rename to functorch/dim/delayed_mul_tensor.py
diff --git a/functorch/functorch/dim/dim.py b/functorch/dim/dim.py
similarity index 100%
rename from functorch/functorch/dim/dim.py
rename to functorch/dim/dim.py
diff --git a/functorch/functorch/dim/magic_trace.py b/functorch/dim/magic_trace.py
similarity index 100%
rename from functorch/functorch/dim/magic_trace.py
rename to functorch/dim/magic_trace.py
diff --git a/functorch/functorch/dim/op_properties.py b/functorch/dim/op_properties.py
similarity index 100%
rename from functorch/functorch/dim/op_properties.py
rename to functorch/dim/op_properties.py
diff --git a/functorch/functorch/dim/reference.py b/functorch/dim/reference.py
similarity index 100%
rename from functorch/functorch/dim/reference.py
rename to functorch/dim/reference.py
diff --git a/functorch/functorch/dim/tree_map.py b/functorch/dim/tree_map.py
similarity index 100%
rename from functorch/functorch/dim/tree_map.py
rename to functorch/dim/tree_map.py
diff --git a/functorch/functorch/dim/wrap_type.py b/functorch/dim/wrap_type.py
similarity index 100%
rename from functorch/functorch/dim/wrap_type.py
rename to functorch/dim/wrap_type.py
diff --git a/functorch/docs/source/_static/css/custom.css b/functorch/docs/source/_static/css/custom.css
index 31ded215ee74..cc2c3b1f86d3 100644
--- a/functorch/docs/source/_static/css/custom.css
+++ b/functorch/docs/source/_static/css/custom.css
@@ -3,14 +3,6 @@
     overflow: scroll;
 }
 
-div.container a.header-logo {
-  height: 80px;
-  width: 160px;
-  background-image: url("../images/functorch.svg");
-  background-size: 160px;
-  background-position-y: -50%;
-}
-
 .highlight pre {
   border: 1px solid rgba(0,0,0,0.15);
   border-radius: 4px;
diff --git a/functorch/docs/source/_static/images/functorch.svg b/functorch/docs/source/_static/images/functorch.svg
deleted file mode 100644
index ec7d794122b2..000000000000
--- a/functorch/docs/source/_static/images/functorch.svg
+++ /dev/null
@@ -1,6 +0,0 @@
-<svg width="389.1999877929687px" height="131.99999389648437px" xmlns="http://www.w3.org/2000/svg" viewBox="55.400006103515636 9.000003051757815 389.1999877929687 131.99999389648437" style="background: rgba(0, 0, 0, 0) none repeat scroll 0% 0%;" preserveAspectRatio="xMidYMid"><defs><filter id="editing-hole" x="-100%" y="-100%" width="300%" height="300%"><feFlood flood-color="none" result="black"></feFlood><feMorphology operator="dilate" radius="2" in="SourceGraphic" result="erode"></feMorphology><feGaussianBlur in="erode" stdDeviation="4" result="blur"></feGaussianBlur><feOffset in="blur" dx="2" dy="2" result="offset"></feOffset><feComposite operator="atop" in="offset" in2="black" result="merge"></feComposite><feComposite operator="in" in="merge" in2="SourceGraphic" result="inner-shadow"></feComposite></filter></defs><g filter="url(#editing-hole)"><g transform="translate(115.65498876571655, 99.59499931335449)"><path d="M17.50-44.25L17.50-44.25L17.50-44.25Q14.25-44.25 13.05-42.38L13.05-42.38L13.05-42.38Q11.84-40.50 11.84-35.88L11.84-35.88L11.84-34.28L20.09-34.28L20.09-30.41L11.84-30.41L11.84 0L6.63 0L6.63-30.41L1.06-30.41L1.50-33.84L6.63-34.28L6.63-35.81L6.63-35.81Q6.69-42.44 9.11-45.42L9.11-45.42L9.11-45.42Q11.53-48.41 16.94-48.41L16.94-48.41L16.94-48.41Q19.03-48.41 22.06-47.94L22.06-47.94L21.72-43.84L21.72-43.84Q18.94-44.25 17.50-44.25ZM24.19-11.44L24.19-11.44L24.19-34.28L29.44-34.28L29.44-11.38L29.44-11.38Q29.44-9.13 30.08-7.56L30.08-7.56L30.08-7.56Q30.72-6 31.92-5.14L31.92-5.14L31.92-5.14Q33.13-4.28 34.64-3.91L34.64-3.91L34.64-3.91Q36.16-3.53 38.13-3.53L38.13-3.53L38.13-3.53Q42.19-3.53 44.45-5.31L44.45-5.31L44.45-5.31Q46.72-7.09 46.72-11.38L46.72-11.38L46.72-34.28L51.97-34.28L51.97-11.44L51.97-11.44Q51.97-8.81 51.22-6.72L51.22-6.72L51.22-6.72Q50.47-4.63 49.20-3.23L49.20-3.23L49.20-3.23Q47.94-1.84 46.16-0.94L46.16-0.94L46.16-0.94Q44.38-0.03 42.39 0.38L42.39 0.38L42.39 0.38Q40.41 0.78 38.13 0.78L38.13 0.78L38.13 0.78Q35.88 0.78 33.88 0.38L33.88 0.38L33.88 0.38Q31.88-0.03 30.08-0.95L30.08-0.95L30.08-0.95Q28.28-1.88 26.98-3.27L26.98-3.27L26.98-3.27Q25.69-4.66 24.94-6.75L24.94-6.75L24.94-6.75Q24.19-8.84 24.19-11.44ZM65.03 0L59.75 0L59.75-34.28L64.13-34.28L64.78-29.38L64.78-29.38Q66.81-31.91 69.86-33.45L69.86-33.45L69.86-33.45Q72.91-35 76.44-35L76.44-35L76.44-35Q82.38-35 85.22-31.67L85.22-31.67L85.22-31.67Q88.06-28.34 88.06-21.44L88.06-21.44L88.06 0L82.84 0L82.84 0Q82.84-20.63 82.81-22.09L82.81-22.09L82.81-22.09Q82.69-26.47 81.05-28.50L81.05-28.50L81.05-28.50Q79.41-30.53 75.56-30.53L75.56-30.53L75.56-30.53Q71.53-30.53 68.84-28.59L68.84-28.59L68.84-28.59Q66.16-26.66 65.34-23.56L65.34-23.56L65.34-23.56Q65.03-21.47 65.03-18.75L65.03-18.75L65.03 0ZM111.25 0.72L111.25 0.72L111.25 0.72Q103.69 0.72 99.42-4.19L99.42-4.19L99.42-4.19Q95.16-9.09 95.16-17.03L95.16-17.03L95.16-17.03Q95.19-25.03 99.52-30.02L99.52-30.02L99.52-30.02Q103.84-35 111.38-35L111.38-35L111.38-35Q114.03-35 116.52-34.34L116.52-34.34L116.52-34.34Q119-33.69 120.44-32.63L120.44-32.63L119-28.56L119-28.56Q115.34-30.53 111.28-30.53L111.28-30.53L111.28-30.53Q106.31-30.53 103.42-26.91L103.42-26.91L103.42-26.91Q100.53-23.28 100.53-17L100.53-17L100.53-17Q100.53-10.88 103.39-7.30L103.39-7.30L103.39-7.30Q106.25-3.72 111.31-3.72L111.31-3.72L111.31-3.72Q115.34-3.72 119.56-5.91L119.56-5.91L120.22-1.75L120.22-1.75Q116.81 0.72 111.25 0.72ZM137.94 0.63L137.94 0.63L137.94 0.63Q132.88 0.63 130.75-1.84L130.75-1.84L130.75-1.84Q128.63-4.31 128.50-9.81L128.50-9.81L128.50-30.41L123.41-30.41L123.63-33.84L128.47-34.28L130.25-42.13L133.69-42.25L133.69-34.28L143.22-34.28L143.22-30.41L133.69-30.41L133.69-10.75L133.69-10.75Q133.69-6.84 134.88-5.25L134.88-5.25L134.88-5.25Q136.06-3.66 138.94-3.66L138.94-3.66L138.94-3.66Q139.97-3.66 143.41-4.16L143.41-4.16L143.63-0.03L143.63-0.03Q139.91 0.63 137.94 0.63ZM152.66-17.06L152.66-17.06L152.66-17.06Q152.66-10.97 155.36-7.30L155.36-7.30L155.36-7.30Q158.06-3.63 163.25-3.63L163.25-3.63L163.25-3.63Q168.38-3.63 170.98-7.28L170.98-7.28L170.98-7.28Q173.59-10.94 173.59-17.13L173.59-17.13L173.59-17.13Q173.59-23.41 171.02-27.02L171.02-27.02L171.02-27.02Q168.44-30.63 163.16-30.63L163.16-30.63L163.16-30.63Q158.03-30.63 155.34-26.94L155.34-26.94L155.34-26.94Q152.66-23.25 152.66-17.06ZM179-17.16L179-17.16L179-17.16Q179-9.19 174.75-4.23L174.75-4.23L174.75-4.23Q170.50 0.72 163 0.72L163 0.72L163 0.72Q155.78 0.72 151.52-4.28L151.52-4.28L151.52-4.28Q147.25-9.28 147.25-17.13L147.25-17.13L147.25-17.13Q147.25-25.13 151.53-30.06L151.53-30.06L151.53-30.06Q155.81-35 163.28-35L163.28-35L163.28-35Q170.63-35 174.81-30.05L174.81-30.05L174.81-30.05Q179-25.09 179-17.16ZM201.75-34.59L201.75-34.59L201.75-34.59Q203.16-34.59 204.31-34.41L204.31-34.41L204.13-29.59L204.13-29.59Q202.69-29.84 201.63-29.84L201.63-29.84L201.63-29.84Q197.25-29.84 194.52-26.80L194.52-26.80L194.52-26.80Q191.78-23.75 191.78-19.25L191.78-19.25L191.78 0L186.56 0L186.56 0Q186.53-31.56 186.53-34.28L186.53-34.28L190.88-34.28L191.38-28.13L191.38-28.13Q193.16-31.03 195.91-32.81L195.91-32.81L195.91-32.81Q198.66-34.59 201.75-34.59ZM223.56 0.72L223.56 0.72L223.56 0.72Q216 0.72 211.73-4.19L211.73-4.19L211.73-4.19Q207.47-9.09 207.47-17.03L207.47-17.03L207.47-17.03Q207.50-25.03 211.83-30.02L211.83-30.02L211.83-30.02Q216.16-35 223.69-35L223.69-35L223.69-35Q226.34-35 228.83-34.34L228.83-34.34L228.83-34.34Q231.31-33.69 232.75-32.63L232.75-32.63L231.31-28.56L231.31-28.56Q227.66-30.53 223.59-30.53L223.59-30.53L223.59-30.53Q218.63-30.53 215.73-26.91L215.73-26.91L215.73-26.91Q212.84-23.28 212.84-17L212.84-17L212.84-17Q212.84-10.88 215.70-7.30L215.70-7.30L215.70-7.30Q218.56-3.72 223.63-3.72L223.63-3.72L223.63-3.72Q227.66-3.72 231.88-5.91L231.88-5.91L232.53-1.75L232.53-1.75Q229.13 0.72 223.56 0.72ZM267.63 0L262.41 0L262.41-20.72L262.41-20.72Q262.41-25.88 260.77-28.22L260.77-28.22L260.77-28.22Q259.13-30.56 255.03-30.56L255.03-30.56L255.03-30.56Q251.09-30.56 248.39-28.59L248.39-28.59L248.39-28.59Q245.69-26.63 244.88-23.50L244.88-23.50L244.88-23.50Q244.59-21.16 244.59-18.28L244.59-18.28L244.59 0L239.38 0L239.38-47.59L244.59-47.91L244.59-33.97L244.59-33.97Q244.59-32.19 244.47-29.56L244.47-29.56L244.47-29.56Q248.88-35 256.16-35L256.16-35L256.16-35Q267.63-35 267.63-21.53L267.63-21.53L267.63 0Z" fill="#000000"></path></g></g><style>text {
-  font-size: 64px;
-  font-family: Arial Black;
-  dominant-baseline: central;
-  text-anchor: middle;
-}</style></svg>
diff --git a/functorch/docs/source/_templates/layout.html b/functorch/docs/source/_templates/layout.html
index 7e10da7294f8..dc00bcaf5453 100644
--- a/functorch/docs/source/_templates/layout.html
+++ b/functorch/docs/source/_templates/layout.html
@@ -1,334 +1,9 @@
-{# TEMPLATE VAR SETTINGS #}
-{%- set url_root = pathto('', 1) %}
-{%- if url_root == '#' %}{% set url_root = '' %}{% endif %}
-{%- if not embedded and docstitle %}
-  {%- set titlesuffix = " &mdash; "|safe + docstitle|e %}
-{%- else %}
-  {%- set titlesuffix = "" %}
-{%- endif %}
-{%- set lang_attr = 'en' if language == None else (language | replace('_', '-')) %}
-{% import 'theme_variables.jinja' as theme_variables %}
-
-<!DOCTYPE html>
-<!--[if IE 8]><html class="no-js lt-ie9" lang="{{ lang_attr }}" > <![endif]-->
-<!--[if gt IE 8]><!--> <html class="no-js" lang="{{ lang_attr }}" > <!--<![endif]-->
-<head>
-  <meta charset="utf-8">
-  {{ metatags }}
-  <meta name="viewport" content="width=device-width, initial-scale=1.0">
-  {% block htmltitle %}
-  <title>{{ title|striptags|e }}{{ titlesuffix }}</title>
-  {% endblock %}
-
-  {# FAVICON #}
-  {% if favicon %}
-    <link rel="shortcut icon" href="{{ pathto('_static/' + favicon, 1) }}"/>
-  {% endif %}
-  {# CANONICAL URL #}
-  {% if theme_canonical_url %}
-    <link rel="canonical" href="{{ theme_canonical_url }}{{ pagename }}.html"/>
-  {% endif %}
-
-  {# CSS #}
-
-  {# OPENSEARCH #}
-  {% if not embedded %}
-    {% if use_opensearch %}
-      <link rel="search" type="application/opensearchdescription+xml"
-            title="{% trans docstitle=docstitle|e %}Search within {{ docstitle }}{% endtrans %}"
-            href="{{ pathto('_static/opensearch.xml', 1) }}"/>
-    {% endif %}
-
-  {% endif %}
-
-  <link rel="stylesheet" href="{{ pathto('_static/' + style, 1) }}" type="text/css" />
-  <!-- <link rel="stylesheet" href="{{ pathto('_static/pygments.css', 1) }}" type="text/css" /> -->
-  {%- for css in css_files %}
-    {%- if css|attr("rel") %}
-  <link rel="{{ css.rel }}" href="{{ pathto(css.filename, 1) }}" type="text/css"{% if css.title is not none %} title="{{ css.title }}"{% endif %} />
-    {%- else %}
-  <link rel="stylesheet" href="{{ pathto(css, 1) }}" type="text/css" />
-    {%- endif %}
-  {%- endfor %}
-  {%- for cssfile in extra_css_files %}
-    <link rel="stylesheet" href="{{ pathto(cssfile, 1) }}" type="text/css" />
-  {%- endfor %}
-
-  {%- block linktags %}
-    {%- if hasdoc('about') %}
-    <link rel="author" title="{{ _('About these documents') }}" href="{{ pathto('about') }}" />
-    {%- endif %}
-    {%- if hasdoc('genindex') %}
-    <link rel="index" title="{{ _('Index') }}" href="{{ pathto('genindex') }}" />
-    {%- endif %}
-    {%- if hasdoc('search') %}
-    <link rel="search" title="{{ _('Search') }}" href="{{ pathto('search') }}" />
-    {%- endif %}
-    {%- if hasdoc('copyright') %}
-    <link rel="copyright" title="{{ _('Copyright') }}" href="{{ pathto('copyright') }}" />
-    {%- endif %}
-    {%- if next %}
-    <link rel="next" title="{{ next.title|striptags|e }}" href="{{ next.link|e }}" />
-    {%- endif %}
-    {%- if prev %}
-    <link rel="prev" title="{{ prev.title|striptags|e }}" href="{{ prev.link|e }}" />
-    {%- endif %}
-  {%- endblock %}
-
-  {%- block extrahead %}
-  <!-- Google Analytics -->
-  {% if theme_analytics_id %}
-    <script async src="https://www.googletagmanager.com/gtag/js?id={{ theme_analytics_id }}"></script>
-    <script>
-      window.dataLayer = window.dataLayer || [];
-      function gtag(){dataLayer.push(arguments);}
-      gtag('js', new Date());
-
-      gtag('config', '{{ theme_analytics_id }}');
-    </script>
-  {% endif %}
-  <!-- End Google Analytics -->
-
-  {% if release == "master" %}
-  <!--
-    Search engines should not index the master version of documentation.
-    Stable documentation are built without release == 'master'.
-  -->
-  <meta name="robots" content="noindex">
-  {% endif %}
-
-  {% endblock %}
-
-  {# Keep modernizr in head - http://modernizr.com/docs/#installing #}
-  <script src="{{ pathto('_static/js/modernizr.min.js', 1) }}"></script>
-
-  {% include "fonts.html" %}
-  <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.15.2/css/all.css" integrity="sha384-vSIIfh2YWi9wW0r9iZe7RJPrKwp6bG+s9QZMoITbCckVJqGCCRhc+ccxNcdpHuYu" crossorigin="anonymous">
-</head>
-
-<div class="container-fluid header-holder tutorials-header" id="header-holder">
-  <div class="container">
-    <div class="header-container">
-      <a class="header-logo" href="https://pytorch.org/functorch" aria-label="functorch"></a>
-
-      <div class="main-menu">
-        <ul>
-
-          <li>
-            <a href="https://pytorch.org/functorch">Tutorials</a>
-          </li>
-
-          <li>
-            <a href="https://github.com/pytorch/functorch/tree/main/examples">Examples</a>
-          </li>
-
-          <li>
-            <a href="https://github.com/pytorch/functorch">GitHub</a>
-          </li>
-        </ul>
-      </div>
-
-      <a class="main-menu-open-button" href="#" data-behavior="open-mobile-menu"></a>
-    </div>
-  </div>
-</div>
-
-<body class="pytorch-body">
-
-  {% block extrabody %} {% endblock %}
-
-    {# SIDE NAV, TOGGLES ON MOBILE #}
-
-    <div class="table-of-contents-link-wrapper">
-      <span>Table of Contents</span>
-      <a href="#" class="toggle-table-of-contents" data-behavior="toggle-table-of-contents"></a>
-    </div>
-
-    <nav data-toggle="wy-nav-shift" class="pytorch-left-menu" id="pytorch-left-menu">
-      <div class="pytorch-side-scroll">
-        <div class="pytorch-menu pytorch-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
-          <div class="pytorch-left-menu-search">
-            {% block sidebartitle %}
-
-            {% if theme_display_version %}
-              {%- set nav_version = version %}
-              {% if READTHEDOCS and current_version %}
-                {%- set nav_version = current_version %}
-              {% endif %}
-              {% if nav_version %}
-                <div class="version">
-                  <a href='https://pytorch.org/functorch/versions.html'>{{ nav_version }} &#x25BC</a>
-                </div>
-              {% endif %}
-            {% endif %}
-
-            {% include "searchbox.html" %}
-
-            {% endblock %}
-          </div>
-
-          {% block menu %}
-            {#
-              The singlehtml builder doesn't handle this toctree call when the
-              toctree is empty. Skip building this for now.
-            #}
-            {% if 'singlehtml' not in builder %}
-              {% set global_toc = toctree(maxdepth=1,
-                                          collapse=theme_collapse_navigation|tobool,
-                                          includehidden=theme_includehidden|tobool,
-                                          titles_only=theme_titles_only|tobool) %}
-            {% endif %}
-            {% if global_toc %}
-              {{ global_toc }}
-            {% else %}
-              <!-- Local TOC -->
-              <div class="local-toc">{{ toc }}</div>
-            {% endif %}
-          {% endblock %}
-        </div>
-      </div>
-    </nav>
-
-    <div class="pytorch-container">
-      <div class="pytorch-page-level-bar" id="pytorch-page-level-bar">
-        <div class="pytorch-breadcrumbs-wrapper">
-          {% include "breadcrumbs.html" %}
-        </div>
-
-        <div class="pytorch-shortcuts-wrapper" id="pytorch-shortcuts-wrapper">
-          Shortcuts
-        </div>
-      </div>
-
-      <section data-toggle="wy-nav-shift" id="pytorch-content-wrap" class="pytorch-content-wrap">
-        <div class="pytorch-content-left">
-
-        {% if theme_pytorch_project == 'tutorials' %}
-
-          <div class="pytorch-call-to-action-links">
-            <div id="tutorial-type">{{ pagename }}</div>
-
-            <div id="google-colab-link">
-              <img class="call-to-action-img" src="{{ pathto('_static/images/pytorch-colab.svg', 1) }}"/>
-              <div class="call-to-action-desktop-view">Run in Google Colab</div>
-              <div class="call-to-action-mobile-view">Colab</div>
-            </div>
-            <div id="download-notebook-link">
-              <img class="call-to-action-notebook-img" src="{{ pathto('_static/images/pytorch-download.svg', 1) }}"/>
-              <div class="call-to-action-desktop-view">Download Notebook</div>
-              <div class="call-to-action-mobile-view">Notebook</div>
-            </div>
-            <div id="github-view-link">
-              <img class="call-to-action-img" src="{{ pathto('_static/images/pytorch-github.svg', 1) }}"/>
-              <div class="call-to-action-desktop-view">View on GitHub</div>
-              <div class="call-to-action-mobile-view">GitHub</div>
-            </div>
-          </div>
-
-        {% endif %}
-
-        {%- block content %}
-          {% if theme_style_external_links|tobool %}
-          <div class="rst-content style-external-links">
-          {% else %}
-          <div class="rst-content">
-          {% endif %}
-            <div role="main" class="main-content" itemscope="itemscope" itemtype="http://schema.org/Article">
-            {%- block document %}
-             <article itemprop="articleBody" id="pytorch-article" class="pytorch-article">
-              {% block body %}{% endblock %}
-             </article>
-             {% if self.comments()|trim %}
-             <div class="articleComments">
-              {% block comments %}{% endblock %}
-             </div>
-             {% endif%}
-            </div>
-            {%- endblock %}
-            {% include "footer.html" %}
-          </div>
-        {%- endblock %}
-        </div>
+{% extends "!layout.html" %}
+<link rel="canonical" href="{{ theme_canonical_url }}{{ pagename }}.html" />
 
-        <div class="pytorch-content-right" id="pytorch-content-right">
-          <div class="pytorch-right-menu" id="pytorch-right-menu">
-            <div class="pytorch-side-scroll" id="pytorch-side-scroll-right">
-              {{ toc }}
-            </div>
-          </div>
-        </div>
-      </section>
+{% block sidebartitle %}
+    <div class="version">
+      <a href='https://pytorch.org/functorch/versions.html'>{{ version }} &#x25BC</a>
     </div>
-
-  {% include "versions.html" %}
-
-  {% if not embedded %}
-
-     {% if sphinx_version >= "1.8.0" %}
-       <script type="text/javascript" id="documentation_options" data-url_root="{{ pathto('', 1) }}" src="{{ pathto('_static/documentation_options.js', 1) }}"></script>
-       {%- for scriptfile in script_files %}
-         {{ js_tag(scriptfile) }}
-       {%- endfor %}
-     {% else %}
-       <script type="text/javascript">
-           var DOCUMENTATION_OPTIONS = {
-               URL_ROOT:'{{ url_root }}',
-               VERSION:'{{ release|e }}',
-               LANGUAGE:'{{ language }}',
-               COLLAPSE_INDEX:false,
-               FILE_SUFFIX:'{{ '' if no_search_suffix else file_suffix }}',
-               HAS_SOURCE:  {{ has_source|lower }},
-               SOURCELINK_SUFFIX: '{{ sourcelink_suffix }}'
-           };
-       </script>
-       {%- for scriptfile in script_files %}
-         <script type="text/javascript" src="{{ pathto(scriptfile, 1) }}"></script>
-       {%- endfor %}
-     {% endif %}
-
-  {% endif %}
-
-  <script type="text/javascript" src="{{ pathto('_static/js/vendor/popper.min.js', 1) }}"></script>
-  <script type="text/javascript" src="{{ pathto('_static/js/vendor/bootstrap.min.js', 1) }}"></script>
-  <script src="https://cdnjs.cloudflare.com/ajax/libs/list.js/1.5.0/list.min.js"></script>
-  <script type="text/javascript" src="{{ pathto('_static/js/theme.js', 1) }}"></script>
-
-  <script type="text/javascript">
-      jQuery(function () {
-          SphinxRtdTheme.Navigation.enable({{ 'true' if theme_sticky_navigation|tobool else 'false' }});
-      });
-  </script>
-
-  {%- block footer %}
-
-  <script script type="text/javascript">
-    var collapsedSections = ['Notes', 'Language Bindings', 'Libraries', 'Community'];
-  </script>
-
-  <img height="1" width="1" style="border-style:none;" alt="" src="https://www.googleadservices.com/pagead/conversion/795629140/?label=txkmCPmdtosBENSssfsC&amp;guid=ON&amp;script=0"/>
-
-  {% endblock %}
-
-  <script type="text/javascript" src="{{ pathto('_static/js/vendor/anchor.min.js', 1) }}"></script>
-
-  <script type="text/javascript">
-    $(document).ready(function() {
-      mobileMenu.bind();
-      mobileTOC.bind();
-      pytorchAnchors.bind();
-      sideMenus.bind();
-      scrollToAnchor.bind();
-      highlightNavigation.bind();
-      mainMenuDropdown.bind();
-      filterTags.bind();
-
-      // Add class to links that have code blocks, since we cannot create links in code blocks
-      $("article.pytorch-article a span.pre").each(function(e) {
-        $(this).closest("a").addClass("has-code");
-      });
-    })
-  </script>
-</body>
-</html>
-
-<link rel="canonical" href="{{ theme_canonical_url }}{{ pagename }}.html" />
+    {% include "searchbox.html" %}
+{% endblock %}
diff --git a/functorch/docs/source/batch_norm.rst b/functorch/docs/source/batch_norm.rst
index 09eb6001b5b6..8ccd4ee587d3 100644
--- a/functorch/docs/source/batch_norm.rst
+++ b/functorch/docs/source/batch_norm.rst
@@ -11,7 +11,7 @@ we end up with this error
 How to fix
 ----------
 All of these options assume that you don't need running stats. If you're using a module this means
-that it's assumed you won't use batch norm in evalution mode. If you have a use case that involves
+that it's assumed you won't use batch norm in evaluation mode. If you have a use case that involves
 running batch norm with vmap in evaluation mode, please file an issue
 
 Option 1: Change the BatchNorm
diff --git a/functorch/docs/source/conf.py b/functorch/docs/source/conf.py
index c73012793908..7224273616db 100644
--- a/functorch/docs/source/conf.py
+++ b/functorch/docs/source/conf.py
@@ -88,8 +88,8 @@
 
 # General information about the project.
 project = 'functorch'
-copyright = 'functorch Contributors'
-author = 'functorch Contributors'
+copyright = 'PyTorch Contributors'
+author = 'PyTorch Contributors'
 functorch_version = str(functorch.__version__)
 
 # The version info for the project you're documenting, acts as replacement for
@@ -170,7 +170,7 @@
     "collapse_navigation": False,
     "display_version": True,
     "logo_only": True,
-    "pytorch_project": "docs",
+    "pytorch_project": "functorch",
     "navigation_with_keys": True,
     "analytics_id": "UA-117752657-2",
 }
diff --git a/functorch/docs/source/experimental.rst b/functorch/docs/source/experimental.rst
index 05f82f08f745..e320fb5d7a86 100644
--- a/functorch/docs/source/experimental.rst
+++ b/functorch/docs/source/experimental.rst
@@ -8,5 +8,3 @@ Experimental Function Transforms
 .. autosummary::
     :toctree: generated
     :nosignatures:
-
-    functionalize
diff --git a/functorch/docs/source/functorch.rst b/functorch/docs/source/functorch.rst
index c073bc1b604e..c79a3cba5fbb 100644
--- a/functorch/docs/source/functorch.rst
+++ b/functorch/docs/source/functorch.rst
@@ -17,6 +17,7 @@ Function Transforms
     jacrev
     jacfwd
     hessian
+    functionalize
 
 Utilities for working with torch.nn.Modules
 -------------------------------------------
diff --git a/functorch/docs/source/index.rst b/functorch/docs/source/index.rst
index 6c66a86e2139..79d782ba6981 100644
--- a/functorch/docs/source/index.rst
+++ b/functorch/docs/source/index.rst
@@ -52,7 +52,7 @@ Check out our `whirlwind tour <whirlwind_tour>`_ or some of our tutorials mentio
 
 .. toctree::
    :maxdepth: 2
-   :caption: Getting Started
+   :caption: functorch: Getting Started
 
    install
    notebooks/whirlwind_tour.ipynb
@@ -60,7 +60,7 @@ Check out our `whirlwind tour <whirlwind_tour>`_ or some of our tutorials mentio
 
 .. toctree::
    :maxdepth: 2
-   :caption: API Reference and Notes
+   :caption: functorch API Reference and Notes
 
    functorch
    experimental
@@ -68,7 +68,7 @@ Check out our `whirlwind tour <whirlwind_tour>`_ or some of our tutorials mentio
 
 .. toctree::
    :maxdepth: 1
-   :caption: Tutorials
+   :caption: functorch Tutorials
 
    notebooks/jacobians_hessians.ipynb
    notebooks/ensembling.ipynb
diff --git a/functorch/docs/source/install.rst b/functorch/docs/source/install.rst
index 9196893565d9..46765abbddca 100644
--- a/functorch/docs/source/install.rst
+++ b/functorch/docs/source/install.rst
@@ -1,29 +1,37 @@
 Install functorch
 =================
 
-pip
----
+As of PyTorch 1.13, functorch is now included in the PyTorch binary and no
+longer requires the installation of a separate functorch package. That is,
+after installing PyTorch (`instructions <https://pytorch.org>`_),
+you'll be able to ``import functorch`` in your program.
 
-To install functorch via pip, please first install
-`PyTorch 1.11 <https://pytorch.org/get-started/locally/>`_
-and then run the following command:
+If you're upgrading from an older version of functorch (functorch 0.1.x or 0.2.x),
+then you may need to uninstall functorch first via ``pip uninstall functorch``.
 
-::
-
-  pip install functorch
-
-We currently support manylinux, x86 MacOS, and Windows, please reach out on
-`GitHub <https://github.com/pytorch/functorch>`_ for other platforms.
+We've maintained backwards compatibility for ``pip install functorch``: this
+command works for PyTorch 1.13 and will continue to work for the foreseeable future
+until we do a proper deprecation. This is helpful if you're maintaining a library
+that supports multiple versions of PyTorch and/or functorch.
 
 Colab
 -----
 
 Please see `this colab for instructions. <https://colab.research.google.com/drive/1GNfb01W_xf8JRu78ZKoNnLqiwcrJrbYG#scrollTo=HJ1srOGeNCGA>`_
 
+Nightly
+-------
+
+Looking for the newest functorch features? Please download the latest nightly PyTorch
+binary (``import functorch`` is included in nightly PyTorch binaries as of 09/21/2022).
+by following instructions `here <https://pytorch.org>`_.
 
-Building from source
---------------------
+Previous Versions
+-----------------
 
-See our `README <https://github.com/pytorch/functorch#installing-functorch-main>`_
-for instructions on how to build the functorch main development branch for the
-latest and greatest. This requires an installation of the latest PyTorch nightly.
+For PyTorch 1.11.x and PyTorch 1.12.x:
+Please first install PyTorch and then run the following command:
+
+::
+
+  pip install functorch
diff --git a/functorch/docs/source/ux_limitations.rst b/functorch/docs/source/ux_limitations.rst
index 58ff259a61d3..4fee30e43288 100644
--- a/functorch/docs/source/ux_limitations.rst
+++ b/functorch/docs/source/ux_limitations.rst
@@ -131,7 +131,7 @@ Tensor with fewer elements. Here's an example of how this can occur:
 
   def f(x, y):
     x.add_(y)
-  return x
+    return x
 
   x = torch.randn(1)
   y = torch.randn(3)
@@ -290,5 +290,5 @@ Under "same" randomness, elements in a batch produce same random values. For ins
 .. note::
     Finally, our randomness differs from JAX because we aren't using a stateless PRNG, in part because PyTorch
     doesn't have full support for a stateless PRNG. Instead, we've introduced a flag system to allow for the
-    most common forms of randmoness that we see. If your use case does not fit these forms of randomness, please
+    most common forms of randomness that we see. If your use case does not fit these forms of randomness, please
     file an issue.
diff --git a/functorch/examples/compilation/fuse_module.py b/functorch/examples/compilation/fuse_module.py
index dafbc80711a3..ec091eb24435 100644
--- a/functorch/examples/compilation/fuse_module.py
+++ b/functorch/examples/compilation/fuse_module.py
@@ -36,7 +36,7 @@ def forward(self, x):
 compiled_mod = compiled_module(mod, fw_compiler, bw_compiler)
 
 for a, b in zip(run(mod, input), run(compiled_mod, input)):
-    torch.testing.assert_allclose(a, b)
+    torch.testing.assert_close(a, b)
 
 out = mod(input)
 out.sum().backward()
@@ -45,7 +45,7 @@ def forward(self, x):
 compiled_mod.orig_module.param.grad = None
 
 for a, b in zip(run(mod, input), run(compiled_mod, input)):
-    torch.testing.assert_allclose(a, b)
+    torch.testing.assert_close(a, b)
 
 for _ in range(5):
     i = 10000
diff --git a/functorch/examples/dp_cifar10/cifar10_transforms.py b/functorch/examples/dp_cifar10/cifar10_transforms.py
index 9b591857055f..4441a539a745 100644
--- a/functorch/examples/dp_cifar10/cifar10_transforms.py
+++ b/functorch/examples/dp_cifar10/cifar10_transforms.py
@@ -26,7 +26,7 @@
 from functorch import make_functional
 
 # disable warning spam
-functorch._C._set_vmap_fallback_warning_enabled(False)
+torch._C._functorch._set_vmap_fallback_warning_enabled(False)
 
 logging.basicConfig(
     format="%(asctime)s:%(levelname)s:%(message)s",
diff --git a/functorch/examples/maml_omniglot/README.md b/functorch/examples/maml_omniglot/README.md
index dfb6077814bf..afc3f55023d4 100644
--- a/functorch/examples/maml_omniglot/README.md
+++ b/functorch/examples/maml_omniglot/README.md
@@ -1,6 +1,6 @@
 # Omniglot MAML examples
 
-In this directory we've provided some examples of traning omniglot that reproduce the experiments from [the original MAML paper](https://arxiv.org/abs/1703.03400).
+In this directory we've provided some examples of training omniglot that reproduce the experiments from [the original MAML paper](https://arxiv.org/abs/1703.03400).
 
 They can be run via `python {filename}`.
 
diff --git a/functorch/examples/maml_omniglot/maml-omniglot-transforms.py b/functorch/examples/maml_omniglot/maml-omniglot-transforms.py
index 2ac6663faaf7..0819807433c1 100755
--- a/functorch/examples/maml_omniglot/maml-omniglot-transforms.py
+++ b/functorch/examples/maml_omniglot/maml-omniglot-transforms.py
@@ -47,7 +47,7 @@
 
 
 # Squash the warning spam
-functorch._C._set_vmap_fallback_warning_enabled(False)
+torch._C._functorch._set_vmap_fallback_warning_enabled(False)
 
 
 def main():
diff --git a/functorch/examples/maml_omniglot/support/omniglot_loaders.py b/functorch/examples/maml_omniglot/support/omniglot_loaders.py
index b712b9b31e43..88eaff97df32 100644
--- a/functorch/examples/maml_omniglot/support/omniglot_loaders.py
+++ b/functorch/examples/maml_omniglot/support/omniglot_loaders.py
@@ -161,7 +161,7 @@ def __init__(self, root, batchsz, n_way, k_shot, k_query, imgsz, device=None):
                      lambda x: x / 255.]),
             )
 
-            temp = dict()  # {label:img1, img2..., 20 imgs, label2: img1, img2,... in total, 1623 label}
+            temp = {}  # {label:img1, img2..., 20 imgs, label2: img1, img2,... in total, 1623 label}
             for (img, label) in self.x:
                 if label in temp.keys():
                     temp[label].append(img)
diff --git a/functorch/functorch/experimental/__init__.py b/functorch/experimental/__init__.py
similarity index 60%
rename from functorch/functorch/experimental/__init__.py
rename to functorch/experimental/__init__.py
index 2b2505137355..3a4c92ffbe7a 100644
--- a/functorch/functorch/experimental/__init__.py
+++ b/functorch/experimental/__init__.py
@@ -1,4 +1,5 @@
-from .batch_norm_replacement import replace_all_batch_norm_modules_
 # PyTorch forward-mode is not mature yet
-from .._src.eager_transforms import jvp, jacfwd, hessian, functionalize
+from .._src.eager_transforms import hessian, jacfwd, jvp
 from .._src.vmap import chunk_vmap
+from .batch_norm_replacement import replace_all_batch_norm_modules_
+from functorch import functionalize
diff --git a/functorch/experimental/_map.py b/functorch/experimental/_map.py
new file mode 100644
index 000000000000..d681526da4b3
--- /dev/null
+++ b/functorch/experimental/_map.py
@@ -0,0 +1,105 @@
+import torch
+import torch.utils._pytree as pytree
+from torch._C import DispatchKey, DispatchKeySet, ExcludeDispatchKeyGuard
+from torch._ops import PyOperator
+from torch._subclasses.fake_tensor import FakeTensorMode
+from torch.fx.experimental.proxy_tensor import (
+    disable_proxy_modes_tracing,
+    get_proxy_slot,
+    make_fx,
+    ProxyTorchDispatchMode,
+    track_tensor_tree,
+)
+from torch.utils._python_dispatch import (
+    _get_current_dispatch_mode,
+    _pop_mode_temporarily,
+)
+from torch.utils._pytree import tree_flatten
+
+
+map = PyOperator("map")
+
+
+def trace_map(proxy_mode, func_overload, f, xs, *args):
+    def _unwrap_proxy(e):
+        if not isinstance(e, (torch.Tensor, torch.SymInt, torch.SymFloat)):
+            return e
+        return get_proxy_slot(e, proxy_mode.tracer, e, lambda e: e.proxy)
+
+
+    if not isinstance(xs, torch.Tensor):
+        raise ValueError("map() must loop over a tensor")
+    if len(xs.shape) == 0 or xs.shape[0] == 0:
+        raise ValueError("map() cannot be traced with scalar tensors or zero dimension tensors")
+    if not all(isinstance(o, (torch.Tensor, torch.nn.Module)) for o in args):
+        raise ValueError("map() operands must be a list of tensors or modules")
+
+    with disable_proxy_modes_tracing():
+        body_graph = make_fx(f)(xs[0], *args)
+
+    next_name = None
+    i = 0
+    while not next_name:
+        candidate = f"body_graph_{i}"
+        if hasattr(proxy_mode.tracer.root, candidate):
+            i += 1
+        else:
+            next_name = candidate
+
+    proxy_mode.tracer.root.register_module(next_name, body_graph)
+    node_args = (body_graph, xs, *args)
+    proxy_args = pytree.tree_map(_unwrap_proxy, node_args)
+    out_proxy = proxy_mode.tracer.create_proxy('call_function', func_overload, proxy_args, {},
+                                               name="map")
+    outs = [body_graph(x, *args) for x in xs]
+    # Implementation notes: we need to use new_empty() + copy_() here instead of stack() directly
+    # because stack([...]) takes a fixed size list which will specialize dynamic shape here.
+    # Meanwhile we want to preserve the looped over dimension as symbolic shape, such that:
+    # ys: Tensor[s0, ...] = map(xs: Tensor[s0, ...], *args)
+    out = xs.new_empty([xs.shape[0], *outs[0].shape])
+    out.copy_(torch.stack(outs))
+    return track_tensor_tree(out, out_proxy, constant=None, tracer=proxy_mode.tracer)
+
+
+@map.py_impl(DispatchKey.CPU)
+def map_cpu(f, xs, *args):
+    mode = _get_current_dispatch_mode()
+    assert (mode is None), "Mode should never be enabled for CPU key"
+    return torch.stack([f(x, *args) for x in xs])
+
+
+@map.py_impl(DispatchKey.AutogradCPU)
+def map_autograd(f, xs, *args):
+    # TODO: support autograd
+    flat_operands, _ = tree_flatten([f, xs, args])
+    assert all([not f.requires_grad for f in flat_operands
+                if isinstance(f, torch.Tensor)])
+
+    _ = ExcludeDispatchKeyGuard(DispatchKeySet(DispatchKey.AutogradCPU))
+    return map(f, xs, *args)
+
+
+@map.py_impl(ProxyTorchDispatchMode)
+def map_proxy_torch_dispatch_mode(f, xs, *args):
+    mode = _get_current_dispatch_mode()
+    assert (mode is not None), "Mode should always be enabled for python fallback key"
+    with _pop_mode_temporarily() as mode:
+        res = trace_map(mode, map, f, xs, *args)
+    return res
+
+
+@map.py_impl(FakeTensorMode)
+def map_fake_tensor_mode(f, xs, *args):
+    return torch.stack([f(x, *args) for x in xs])
+
+# We cannot directly call fallthrough here due to issue #89037.
+@map.py_impl(DispatchKey.PythonDispatcher)
+def map_python_dispatcher(*args):
+    _ = ExcludeDispatchKeyGuard(DispatchKeySet(DispatchKey.PythonDispatcher))
+    return map(*args)
+
+
+# TODO(voz) Make this automatic for keys, this is very ugly atm
+map.fallthrough(DispatchKey.PythonTLSSnapshot)
+map.fallthrough(DispatchKey.ADInplaceOrView)
+map.fallthrough(DispatchKey.BackendSelect)
diff --git a/functorch/functorch/experimental/batch_norm_replacement.py b/functorch/experimental/batch_norm_replacement.py
similarity index 100%
rename from functorch/functorch/experimental/batch_norm_replacement.py
rename to functorch/experimental/batch_norm_replacement.py
diff --git a/functorch/experimental/cond.py b/functorch/experimental/cond.py
new file mode 100644
index 000000000000..bc6f776d073f
--- /dev/null
+++ b/functorch/experimental/cond.py
@@ -0,0 +1,157 @@
+# TODO(zhxchen17) Expose API through functorhc.experimental.control_flow
+#                 and rename this file to _cond.py.
+import torch
+
+import torch.utils._pytree as pytree
+
+from torch._C import DispatchKey, DispatchKeySet, ExcludeDispatchKeyGuard
+from torch._ops import PyOperator
+from torch._subclasses.fake_tensor import FakeTensorMode
+from torch.fx.experimental.proxy_tensor import (
+    get_isolated_graphmodule,
+    get_proxy_slot,
+    ProxyTorchDispatchMode,
+    track_tensor_tree,
+)
+from torch.fx.passes.shape_prop import _extract_tensor_metadata
+from torch.utils._python_dispatch import (
+    _get_current_dispatch_mode,
+    _pop_mode_temporarily,
+)
+from torch.utils._pytree import tree_flatten
+
+
+"""
+We're going to define a `cond` operation.
+In order to do this, we need implementations for each of the dispatch keys.
+"""
+cond = PyOperator("cond")
+
+
+def trace_cond(proxy_mode, func_overload, pred, true_fn, false_fn, operands):
+    def _unwrap_proxy(e):
+        if not isinstance(e, (torch.Tensor, torch.SymInt, torch.SymFloat)):
+            return e
+        return get_proxy_slot(e, proxy_mode.tracer, e, lambda e: e.proxy)
+
+    assert isinstance(operands, list), "Cond operands must be a list of tensors"
+    assert all(isinstance(o, torch.Tensor) for o in operands), "Cond operands must be a list of tensors"
+
+    true_graph = get_isolated_graphmodule(true_fn, operands, {})
+    false_graph = get_isolated_graphmodule(false_fn, operands, {})
+
+    true_outs = []
+    false_outs = []
+    for node in true_graph.graph.nodes:
+        if node.op == 'output':
+            true_outs.extend(node.args)
+
+    for node in false_graph.graph.nodes:
+        if node.op == 'output':
+            false_outs.extend(node.args)
+
+    flat_true_outs, _ = pytree.tree_flatten(true_outs)
+    flat_false_outs, _ = pytree.tree_flatten(false_outs)
+    assert(len(flat_true_outs) == len(flat_false_outs))
+
+    for i in range(0, len(flat_true_outs)):
+        true_out = flat_true_outs[i]
+        false_out = flat_false_outs[i]
+        assert true_out.meta['tensor_meta'] == false_out.meta['tensor_meta']
+
+    # There are probably better ways - I know that create_arg has some self incrementing name
+    # magic to it, but since we explicitly have to get the name for register_module,
+    # I was not sure how to do that. This kinda simulates it.
+    next_name = None
+    i = 0
+    while not next_name:
+        candidate = f"true_graph_{i}"
+        if hasattr(proxy_mode.tracer.root, candidate):
+            i += 1
+        else:
+            next_name = candidate
+
+    true_name = next_name
+    false_name = f"false_graph_{i}"
+    assert(not hasattr(proxy_mode.tracer.root, false_name))
+
+    proxy_mode.tracer.root.register_module(true_name, true_graph)
+    proxy_mode.tracer.root.register_module(false_name, false_graph)
+
+    args = (pred, true_graph, false_graph, [operands])
+
+    proxy_args = pytree.tree_map(_unwrap_proxy, args)
+
+    out_proxy = proxy_mode.tracer.create_proxy('call_function', func_overload, proxy_args, {},
+                                               name="conditional")
+
+    # At this point, we're *guaranteed* that whether an output came from the
+    # true or false branch is indistinguishable. So, as this is just for tracing
+    # purposes, choose the true branch.
+
+    # TODO: Uhh.... it shouldn't matter, but changing this to true_fn results in
+    # a FakeTensorMode error :
+    # `Current active mode <class 'torch._subclasses.fake_tensor.FakeTensorMode'> not registered`
+    out = false_fn(*operands)
+
+    return track_tensor_tree(out, out_proxy, constant=None, tracer=proxy_mode.tracer)
+
+
+@cond.py_impl(DispatchKey.CPU)
+def cond_dense(pred, true_fn, false_fn, operands):
+    mode = _get_current_dispatch_mode()
+    assert (mode is None), "Mode should never be enabled for CPU key"
+    if pred:
+        return true_fn(*operands)
+    else:
+        return false_fn(*operands)
+
+
+@cond.py_impl(DispatchKey.AutogradCPU)
+def cond_autograd(pred, true_fn, false_fn, *operands):
+    # TODO: support autograd
+    flat_operands, _ = tree_flatten([true_fn, false_fn] + [operands])
+    assert all([not f.requires_grad for f in flat_operands
+                if isinstance(f, torch.Tensor)])
+
+    guard = ExcludeDispatchKeyGuard(DispatchKeySet(DispatchKey.AutogradCPU))
+    return cond(pred, true_fn, false_fn, *operands)
+
+
+@cond.py_impl(ProxyTorchDispatchMode)
+def inner(pred, true_fn, false_fn, operands):
+    mode = _get_current_dispatch_mode()
+    assert (mode is not None), "Mode should always be enabled for python fallback key"
+    with _pop_mode_temporarily() as mode:
+        res = trace_cond(mode, cond, pred, true_fn, false_fn, operands)
+    return res
+
+
+@cond.py_impl(FakeTensorMode)
+def cond_fake_tensor_mode(pred, true_fn, false_fn, operands):
+    true_outs = true_fn(*operands)
+    flat_true_outs, _ = pytree.tree_flatten(true_outs)
+    flat_false_outs, _ = pytree.tree_flatten(false_fn(*operands))
+    if len(flat_true_outs) != len(flat_false_outs):
+        raise RuntimeError("Unmatched number of outputs from cond() branches.")
+
+    for true_out, false_out in zip(flat_true_outs, flat_false_outs):
+        true_meta = _extract_tensor_metadata(true_out)
+        false_meta = _extract_tensor_metadata(false_out)
+        if true_meta != false_meta:
+            raise RuntimeError(
+                f"Unmatched tensor metadata from cond() branches.\ntrue branch: {true_meta}, false branch: {false_meta}")
+    return true_outs
+
+
+# We cannot directly call fallthrough here due to issue #89037.
+@cond.py_impl(DispatchKey.PythonDispatcher)
+def cond_python_dispatcher(*args):
+    _ = ExcludeDispatchKeyGuard(DispatchKeySet(DispatchKey.PythonDispatcher))
+    return cond(*args)
+
+
+# TODO(voz): Make this automatic for keys, this is very ugly atm
+cond.fallthrough(DispatchKey.PythonTLSSnapshot)
+cond.fallthrough(DispatchKey.ADInplaceOrView)
+cond.fallthrough(DispatchKey.BackendSelect)
diff --git a/functorch/experimental/control_flow.py b/functorch/experimental/control_flow.py
new file mode 100644
index 000000000000..c46c83fd005d
--- /dev/null
+++ b/functorch/experimental/control_flow.py
@@ -0,0 +1 @@
+from ._map import map  # noqa: F401
diff --git a/functorch/experimental/ops.py b/functorch/experimental/ops.py
new file mode 100644
index 000000000000..42899c20526f
--- /dev/null
+++ b/functorch/experimental/ops.py
@@ -0,0 +1 @@
+from torch._ops import PyOperator  # noqa: F401
diff --git a/functorch/functorch/_src/aot_autograd.py b/functorch/functorch/_src/aot_autograd.py
deleted file mode 100644
index 03889cfb6a21..000000000000
--- a/functorch/functorch/_src/aot_autograd.py
+++ /dev/null
@@ -1,808 +0,0 @@
-import dataclasses
-import warnings
-from contextlib import contextmanager, nullcontext
-from functools import wraps
-from typing import Any, Callable, Dict, List, Optional, Tuple
-
-import torch
-import torch.fx.traceback as fx_traceback
-import torch.nn as nn
-import torch.utils._pytree as pytree
-import torch.utils.dlpack
-from torch import Tensor
-from torch._subclasses import FakeTensorMode
-from torch.fx import immutable_collections, Interpreter
-from torch.nn.utils import stateless
-
-from functorch import make_fx
-from functorch._C import CompileCache
-from functorch.experimental import functionalize
-from . import config
-from .decompositions import register_decomposition
-from .named_members_polyfill import _named_buffers, _named_parameters
-from .partitioners import default_partition
-
-try:
-    from torchdynamo import disable as disable_torchdynamo
-except ImportError:
-
-    def disable_torchdynamo(x):
-        return x
-
-
-pytree._register_pytree_node(
-    immutable_collections.immutable_list,
-    lambda x: (list(x), None),
-    lambda x, c: immutable_collections.immutable_list(x),
-)
-pytree._register_pytree_node(
-    immutable_collections.immutable_dict,
-    lambda x: (list(x.values()), list(x.keys())),
-    lambda x, c: immutable_collections.immutable_dict(
-        {key: value for key, value in zip(c, x)}
-    ),
-)
-
-aten = torch.ops.aten
-
-
-@contextmanager
-def preserve_rng_state():
-    rng_state = torch.clone(torch.random.get_rng_state())
-    if torch.cuda.is_available():
-        cuda_rng_state = torch.clone(torch.cuda.get_rng_state())
-    try:
-        yield
-    finally:
-        torch.random.set_rng_state(rng_state)
-        if torch.cuda.is_available():
-            torch.cuda.set_rng_state(cuda_rng_state)
-
-
-def create_joint_forward_backward(fn):
-    def joint_forward_backward(
-        primals: List[Any], tangents: List[Any]
-    ) -> Tuple[List[Any], List[Any]]:
-        # Call the forward pass
-        outs = fn(*primals)
-        # Get the inputs that need gradients
-        grad_primals = []
-        inputs_needs_grads = []
-        for p in primals:
-            is_grad_tensor = isinstance(p, Tensor) and p.requires_grad
-            inputs_needs_grads.append(is_grad_tensor)
-            if is_grad_tensor:
-                grad_primals.append(p)
-
-        # Get the outputs that need gradients
-        assert len(tangents) == len(outs)
-        needed_outs = []
-        needed_tangents = []
-        for out, tangent in zip(outs, tangents):
-            if isinstance(out, Tensor) and out.requires_grad:
-                needed_outs.append(out)
-                needed_tangents.append(tangent)
-        backward_out = []
-        # Call the backwards pass
-        if grad_primals:
-            backward_out = torch.autograd.grad(
-                needed_outs,
-                grad_primals,
-                grad_outputs=needed_tangents,
-                allow_unused=True,
-            )
-        backward_out_iter = iter(backward_out)
-        return outs, [
-            next(backward_out_iter) if i else None for i in inputs_needs_grads
-        ]
-
-    return joint_forward_backward
-
-
-def normalize_as_list(x):
-    if isinstance(x, tuple):
-        return list(x)
-    elif isinstance(x, list):
-        return x
-    return [x]
-
-
-aot_autograd_decompositions = {}
-
-# TODO: Remove these stupid decompositions
-@register_decomposition(aten._reshape_alias, aot_autograd_decompositions)
-def _reshape_alias(x, shape, strides):
-    return aten.view(x, shape)
-
-
-graph_being_compiled: str = None
-nth_graph: int = 0
-model_name: str = "model"
-
-
-def set_model_name(name):
-    global model_name
-    model_name = name
-
-
-def get_graph_being_compiled() -> str:
-    """
-    Returns the name of the graph being compiled.
-    """
-    global model_name, graph_being_compiled, nth_graph
-    return f"{model_name}_{graph_being_compiled}_{nth_graph}"
-
-
-@contextmanager
-def track_graph_compiling(graph_name, increment_index=False):
-    global graph_being_compiled
-    graph_being_compiled = graph_name
-    yield
-    if increment_index:
-        global nth_graph
-        nth_graph += 1
-    graph_being_compiled = None
-
-
-def make_boxed_func(f):
-    def g(args):
-        return f(*args)
-
-    g._boxed_call = True
-    return g
-
-
-def make_boxed_compiler(compiler):
-    @wraps(compiler)
-    def f(fx_g, inps):
-        out_f = compiler(fx_g, inps)
-        fx_g = make_boxed_func(out_f)
-        return fx_g
-
-    return f
-
-
-def call_func_with_args(f, args, steal_args=False):
-    if not steal_args:
-        args = list(args)
-    assert isinstance(args, list)
-
-    if hasattr(f, "_boxed_call"):
-        out = normalize_as_list(f(args))
-    else:
-        # TODO: Please remove soon
-        # https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670
-        warnings.warn(
-            "Your compiler for AOTAutograd is returning a a function that doesn't take boxed arguments. "
-            "Please wrap it with functorch.compile.make_boxed_func or handle the boxed arguments yourself. "
-            "See https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670 for rationale."
-        )
-        out = normalize_as_list(f(*args))
-    return out
-
-
-@dataclasses.dataclass
-class AOTConfig:
-    """
-    Configuration for AOTDispatcher
-    """
-
-    fw_compiler: Callable
-    bw_compiler: Callable
-    partition_fn: Callable
-    decompositions: Dict[Callable, Callable]
-
-
-def aot_dispatch_base(flat_fn, flat_args: List[Tensor], aot_config: AOTConfig):
-    fw_module = make_fx(flat_fn, aot_config.decompositions)(*flat_args)
-    with track_graph_compiling("forward"):
-        compiled_fw = aot_config.fw_compiler(fw_module, flat_args)
-
-    @wraps(compiled_fw)
-    def new_fn(args):
-        fw_outs = call_func_with_args(compiled_fw, args)
-        return fw_outs
-
-    return new_fn
-
-
-@contextmanager
-def _disable_jit_autocast():
-    old_jit_autocast_flag = torch._C._jit_set_autocast_mode(False)
-    try:
-        yield
-    finally:
-        torch._C._jit_set_autocast_mode(old_jit_autocast_flag)
-
-
-def aot_dispatch_autograd(flat_fn, flat_args: List[Tensor], aot_config: AOTConfig):
-    with _disable_jit_autocast():
-        joint_forward_backward = create_joint_forward_backward(flat_fn)
-        # Set input tensors that require grad to leaves
-        with torch.set_grad_enabled(True):
-            out = flat_fn(*flat_args)
-        out = pytree.tree_map(
-            lambda x: x.detach().contiguous() if isinstance(x, Tensor) else x,
-            out,
-        )
-
-        if isinstance(out, (list, tuple)):
-            num_outs = len(out)
-        else:
-            num_outs = 1
-
-        joint_inputs = (flat_args, out)
-        with torch.set_grad_enabled(True):
-            fx_g = make_fx(joint_forward_backward, aot_config.decompositions)(
-                *joint_inputs
-            )
-
-            if config.use_functionalize:
-                # Functionalize the foward backward graph. First create a
-                # fake fn to make functionalize happy
-                def fake_fn(primals, tangents):
-                    return fx_g(primals, tangents)
-
-                fx_g = make_fx(functionalize(fake_fn))(*joint_inputs)
-
-        if config.debug_joint:
-            print(fx_g.code)
-
-        with track_graph_compiling("joint"):
-            fw_module, bw_module = aot_config.partition_fn(fx_g, joint_inputs)
-
-        if config.debug_graphs:
-            print(fw_module.code, bw_module.code)
-
-        with track_graph_compiling("forward"):
-            compiled_fw = aot_config.fw_compiler(fw_module, flat_args)
-
-        # TODO: Delay this backwards compilation until the backwards pass
-        with torch.no_grad():
-            fw_outs = call_func_with_args(compiled_fw, flat_args)
-
-        if config.debug_partitioner:
-            activation_sizes = 0
-            for out in fw_outs[num_outs:]:
-                if isinstance(out, torch.Tensor):
-                    activation_sizes += out.storage().nbytes()
-            print(f"Real Activations Stored(GB): {activation_sizes/1e9}")
-
-        bw_args = fw_outs[num_outs:] + fw_outs[0:num_outs]
-        with track_graph_compiling("backward", True):
-            compiled_bw = aot_config.bw_compiler(bw_module, bw_args)
-
-    class CompiledFunction(torch.autograd.Function):
-        @staticmethod
-        @disable_torchdynamo
-        def forward(ctx, *flat_tensor_args):
-            fw_outs = call_func_with_args(compiled_fw, flat_tensor_args)
-            ctx.save_for_backward(*fw_outs[num_outs:])
-            return tuple(fw_outs[0:num_outs])
-
-        @staticmethod
-        @disable_torchdynamo
-        def backward(ctx, *flat_args):
-            contiguous_args = [t.contiguous() for t in flat_args]
-            all_args = list(ctx.saved_tensors) + list(contiguous_args)
-            ctx.maybe_clear_saved_tensors()
-            out = call_func_with_args(compiled_bw, all_args, steal_args=True)
-            return tuple(out)
-
-    return CompiledFunction.apply
-
-
-def create_aot_dispatcher_function(
-    flat_fn, flat_args: List[Tensor], aot_config: AOTConfig
-):
-    """
-    Traces the forward and backward graphs of the attr:`flat_fn` to generate a
-    joint graph. The joint graph is an Fx graph with Aten ops. Please refer to
-    the tracing mechanism to understand the graph capturing details.
-
-    The joint graph is then passed through attr:`partition_fn` to isolate the
-    forward and backward portions, which are then respectively compiled via the
-    provided attr:`fw_compiler` and attr:`bw_compiler`.
-
-    The resulting compiled forward and backward graphs are then wrapped up in a
-    ``torch.autograd.Function`` object.
-    """
-    if aot_config.decompositions is None:
-        aot_config.decompositions = {}
-
-    aot_config.decompositions = {
-        **aot_autograd_decompositions,
-        **aot_config.decompositions,
-    }
-    fake_mode = FakeTensorMode.push() if config.use_fake_tensor else nullcontext()
-
-    with preserve_rng_state(), fake_mode as mode:
-
-        def process_inputs(flat_args):
-            flat_args = pytree.tree_map(
-                lambda x: x.detach().requires_grad_(x.requires_grad)
-                if isinstance(x, Tensor)
-                else x,
-                flat_args,
-            )
-            fake_flat_tensor_args = pytree.tree_map(
-                lambda x: mode.from_tensor(x)
-                if mode
-                else x
-                if isinstance(x, Tensor)
-                else x,
-                flat_args,
-            )
-            return fake_flat_tensor_args
-
-        fake_flat_tensor_args = process_inputs(flat_args)
-
-        needs_autograd = (
-            any(
-                [
-                    x.requires_grad
-                    for x in fake_flat_tensor_args
-                    if isinstance(x, Tensor)
-                ]
-            )
-            and torch.is_grad_enabled()
-        )
-        # crappy version of dispatcher
-        # TODO: Do this properly
-        if needs_autograd:
-            return make_boxed_func(
-                aot_dispatch_autograd(flat_fn, fake_flat_tensor_args, aot_config)
-            )
-        else:
-            return aot_dispatch_base(flat_fn, fake_flat_tensor_args, aot_config)
-
-
-class _CompileCache(CompileCache):
-    pass
-
-
-# using a C++-based pytree reduces the overhead by about 50%
-compile_cache = None
-
-
-# Inspired by autodidax (thanks!)
-class PytreeThunk:
-    spec = None
-    # These are some kinda dumb microoptimizations that save about 3-4 us of overhead.
-    is_simple = (
-        None  # if the output spec is a tuple/list, we won't bother unflattening it.
-    )
-    is_really_simple = None  # if the output spec is a LeafSpec
-
-    def set(self, spec):
-        assert self.spec is None or self.spec == spec
-        self.spec = spec
-        if type(self.spec) in [tuple, list] and all(
-            isinstance(i, pytree.LeafSpec) for i in spec.children_specs
-        ):
-            self.is_simple = True
-        if isinstance(self.spec, pytree.LeafSpec):
-            self.is_really_simple = True
-
-    def unflatten(self, x):
-        if self.is_really_simple:
-            return x[0]
-        if self.is_simple:
-            return x
-        return pytree.tree_unflatten(x, self.spec)
-
-
-def filter_tensor_and_static_args(args, static_argnums):
-    """
-    Separate out the tensor and static args. Also, for the static args, store
-    the hash.
-    """
-    tensor_args = []
-    static_args = []
-    static_args_hashed = []
-    for idx, arg in enumerate(args):
-        if idx not in static_argnums:
-            tensor_args.append(arg)
-        else:
-            static_args.append(arg)
-            static_args_hashed.append(arg.__hash__())
-    return tensor_args, static_args, static_args_hashed
-
-
-def rearrange(tensor_args, static_args, static_argnums):
-    """
-    Generate the args as per the original spec. static_argnums is sorted.
-    """
-    tensor_index = 0
-    static_index = 0
-    index = 0
-    args = []
-    assert len(static_args) == len(static_argnums)
-    while tensor_index < len(tensor_args) and static_index < len(static_args):
-        if index == static_argnums[static_index]:
-            args.append(static_args[static_index])
-            static_index += 1
-        else:
-            args.append(tensor_args[tensor_index])
-            tensor_index += 1
-        index += 1
-
-    while tensor_index < len(tensor_args):
-        args.append(tensor_args[tensor_index])
-        tensor_index += 1
-
-    while static_index < len(static_args):
-        args.append(static_args[static_index])
-        static_index += 1
-
-    return args
-
-
-KNOWN_TYPES = [torch.Tensor, int, str, float, bool]
-
-
-def aot_function(
-    fn: Callable,
-    fw_compiler: Callable,
-    bw_compiler: Optional[Callable] = None,
-    partition_fn: Callable = default_partition,
-    decompositions: Optional[Dict] = None,
-    hasher_type: str = "StaticShapeHasher",
-    static_argnums: Optional[Tuple[int]] = None,
-) -> Callable:
-    """
-    Traces the forward and backward graph of :attr:`fn` using torch dispatch
-    mechanism, and then compiles the generated forward and backward graphs
-    through :attr:`fw_compiler` and :attr:`bw_compiler`.
-
-    :func:`aot_function` traces the forward and backward graph ahead of time,
-    and generates a joint forward and backward graph.  :attr:`partition_fn` is
-    then used to separate out forward and backward graphs. The partitioner
-    function can be used to perform optimizations such as recomputation. One can
-    set `decompositions` dictionary to decompose the operators into a sequence
-    of core or simpler operators supported by the backend compilers.
-
-    :func:`aot_function` uses a compilation cache, based on input tensor
-    properties, to detect when there is a need of recompilation. By default, its
-    behavior is static, i.e., it recompiles if shape of any input tensor
-    changes.
-
-    :attr:`static_argnums` allows user to mark the arguments of the original
-    :attr:`fn` as static. This is useful when an argument is a non-tensor, e.g.,
-    ``int`` or ``bool``. A change in the actual value of static arg causes
-    recompilation.
-
-    .. warning::
-        This API is experimental and likely to change.
-
-    Args:
-        fn (Callable): A Python function that takes one ore more arguments. Must
-            return one or more Tensors.
-        fw_compiler (Callable): A Python function that accepts an Fx graph with
-            Aten ops and input args, and returns a Callable that semantically is
-            equivalent to the input Fx graph.
-        bw_compiler (Optional[Callable]): A Python function that accepts an
-            Fx graph with Aten ops and input args, and returns a Callable that
-            semantically is equivalent to the input Fx graph.  Default: None
-            (when None, it defaults to the :attr:`fw_compiler`)
-        partition_fn (Callable): A Python function that takes a joint forward
-            and backward graph, and partitions it into separate forward and
-            backward graphs.
-        decompositions (Dict): A dictionary to define the decomposition of
-            larger Aten ops into simpler or core Aten ops.
-        static_argnums (Optional[Tuple[Int]]): An option tuple of ints to mark
-            the arguments of the function as static.
-
-    Returns:
-        Returns a ``Callable`` that retains the eager behavior of the original
-        :attr:`fn`, but with forward and backward graph compiled via
-        :attr:`fw_compile` and :attr:`bw_compile`.
-
-    A simple example usage of :func:`aot_function` is as follows. This example
-    will print the forward and backward graphs of the function ``fn``
-
-        >>> fn = lambda x : x.sin().cos()
-        >>> def print_compile_fn(fx_module, args):
-        >>>     print(fx_module)
-        >>>     return fx_module
-        >>> aot_fn = aot_function(fn, print_compile_fn)
-        >>> x = torch.randn(4, 5, requires_grad=True)
-        >>> aot_fn(x)
-
-    The static argnums are used to mark the non-tensor arguments as static. An
-    example is as follows where the dropout probability is as argument to the
-    original function.
-
-        >>> def fn(input, bias, residual, p: float):
-        >>>     a = torch.add(input, bias)
-        >>>     b = torch.nn.functional.dropout(a, p, training=True)
-        >>>     c = b + residual
-        >>>     return c
-        >>> aot_fn = aot_function(fn, print_compile_fn, static_argnums=(3,))
-
-    """
-    global compile_cache
-    if compile_cache is None:
-        compile_cache = CompileCache()
-    if bw_compiler is None:
-        bw_compiler = fw_compiler
-    aot_config = AOTConfig(
-        fw_compiler=fw_compiler,
-        bw_compiler=bw_compiler,
-        partition_fn=partition_fn,
-        decompositions=decompositions,
-    )
-    cached_res = None
-
-    fn_id = id(fn)
-    fw_compiler_id = id(fw_compiler)
-    bw_compiler_id = id(bw_compiler)
-
-    if isinstance(static_argnums, int):
-        static_argnums = [static_argnums]
-    elif static_argnums is not None and len(static_argnums) == 0:
-        static_argnums = None
-    elif static_argnums is not None:
-        static_argnums = list(static_argnums)
-        static_argnums.sort()
-
-    @wraps(fn)
-    def returned_function(*args, **kwargs):
-        global compile_cache
-        nonlocal cached_res
-
-        # Separate out static args if static_argnums is present
-        tensor_args = args
-        static_args = []
-        # TODO - move the hashing part of static_args to C++.
-        static_args_hashed = []
-        if static_argnums is not None:
-            (
-                tensor_args,
-                static_args,
-                static_args_hashed,
-            ) = filter_tensor_and_static_args(args, static_argnums)
-
-        # Now flatten the tensor args
-        flat_tensor_args, _ = pytree.tree_flatten((tensor_args, kwargs))
-
-        # Check if the fn is already compiled
-        num_tensor_args = len(flat_tensor_args)
-        flat_args_for_cache = flat_tensor_args + static_args_hashed
-        cached_res = compile_cache.at(
-            fn_id,
-            fw_compiler_id,
-            bw_compiler_id,
-            num_tensor_args,
-            hasher_type,
-            *flat_args_for_cache,
-        )
-
-        # Compile the function and save it in the cache
-        if cached_res is None:
-            # Save the args_spec for flat_tensor_args to unflatten while tracing
-            _, tensor_args_spec = pytree.tree_flatten((tensor_args, kwargs))
-            out_spec = PytreeThunk()
-
-            def flat_fn(*flat_tensor_args):
-                # The input are flattened tensor args. Prepare the args in the
-                # order that original function expects. Add static args as well.
-                # They will appear as tensor constants in the traced graph.
-                nonlocal out_spec, static_args
-
-                tensor_args, kwargs = pytree.tree_unflatten(
-                    flat_tensor_args, tensor_args_spec
-                )
-                if static_argnums is None:
-                    args = tensor_args
-                else:
-                    args = rearrange(tensor_args, static_args, static_argnums)
-                tree_out = fn(*args, **kwargs)
-                flat_out, spec = pytree.tree_flatten(tree_out)
-                for i in flat_out:
-                    is_known_type = False
-                    for j in KNOWN_TYPES:
-                        if isinstance(i, j):
-                            is_known_type = True
-                            break
-                    if not is_known_type:
-                        raise RuntimeError(
-                            f"Found {type(i)} in output, which is not a known type. "
-                            "If this type holds tensors, you need to register a pytree for it. "
-                            "See https://github.com/pytorch/functorch/issues/475 for a brief "
-                            "explanation why. If you don't need to register a pytree, please "
-                            "leave a comment explaining your use case and we'll make this more "
-                            "ergonomic to deal with"
-                        )
-                out_spec.set(spec)
-                return flat_out
-
-            compiled_fn = create_aot_dispatcher_function(
-                flat_fn,
-                flat_tensor_args,
-                aot_config,
-            )
-            cached_res = (compiled_fn, out_spec)
-
-            # Save the compiled_fn in the cache
-            compile_cache.insert(
-                fn_id,
-                fw_compiler_id,
-                bw_compiler_id,
-                num_tensor_args,
-                hasher_type,
-                cached_res,
-                *flat_args_for_cache,
-            )
-
-        cached_fn, out_spec = cached_res
-        out = cached_fn(flat_tensor_args)
-        return out_spec.unflatten(out)
-
-    return returned_function
-
-
-def num_of_recompilations():
-    """
-    Returns the numbers of recompilations since the last time cache was cleared.
-    This is equivalent to the number of entries in the compilation cache.
-    """
-    global compile_cache
-    if compile_cache is None:
-        return 0
-    return compile_cache.size()
-
-
-def clear_compile_cache():
-    """
-    Clears the compilation cache.
-    """
-    global compile_cache
-    if compile_cache is not None:
-        compile_cache.clear()
-        compile_cache = None
-
-
-def aot_module(mod: nn.Module, *args, **kwargs) -> nn.Module:
-    """
-    Traces the forward and backward graph of :attr:`mod` using torch dispatch
-    tracing mechanism. It is wrapper function, that underneath uses
-    :func:`aot_function` to perform tracing and compilation.
-
-    :func:`aot_module` lifts the parameters and buffers of ``nn.Module`` as inputs
-    to a new callable which is then compiled through :func:`aot_function`.
-
-    .. warning::
-        This API is experimental and likely to change.
-
-    Args:
-        mod (Callable): A ``nn.Module`` module.
-        args : args to be passed to :func:`aot_function`
-        kwargs : kwargs to be passed to :func:`aot_function`
-
-    Returns:
-        Returns a ``nn.Module`` that retains the eager behavior of the original
-        :attr:`mod`, but with forward and backward graph compiled.
-
-    """
-
-    def functional_call(named_params, named_buffers, *args, **kwargs):
-        params_and_buffers = {**named_params, **named_buffers}
-        return stateless.functional_call(mod, params_and_buffers, args, kwargs)
-
-    compiled_f = aot_function(functional_call, *args, **kwargs)
-
-    class AOTModule(nn.Module):
-        def __init__(self):
-            super(AOTModule, self).__init__()
-            self.orig_module = mod
-
-        def forward(self, *args, **kwargs):
-            return compiled_f(
-                dict(_named_parameters(mod, remove_duplicate=False)),
-                dict(_named_buffers(mod, remove_duplicate=False)),
-                *args,
-                **kwargs,
-            )
-
-    return AOTModule()
-
-
-def aot_module_simplified(mod: nn.Module, *top_args, **top_kwargs) -> nn.Module:
-    """
-    This is the simplified or low overhead version of aot_module. For frontends
-    like TorchDynamo, the input functions/modules to AOT are static and have
-    unpacked inputs/outputs. This gives us an opportunity to remove the
-        (1) pytree overhead to parse inputs/outputs,
-        (2) AOT Autograd cache,
-        (3) Reading of params/buffers in every forward call
-
-    :func:`aot_module_simplified` removes these overheads.
-    """
-    #########################################################
-
-    params = {
-        **dict(_named_parameters(mod, remove_duplicate=False)),
-        **dict(_named_buffers(mod, remove_duplicate=False)),
-    }
-    params_flat, params_spec = pytree.tree_flatten(params)
-    params_flat = tuple(params_flat)
-    params_len = len(params_flat)
-
-    def functional_call(*args, **kwargs):
-        with stateless._reparametrize_module(
-            mod, pytree.tree_unflatten(args[:params_len], params_spec)
-        ):
-            if isinstance(mod, torch.fx.GraphModule):
-                with fx_traceback.override_stack_trace():
-                    out = Interpreter(mod).run(*args[params_len:], **kwargs)
-            else:
-                out = mod(*args[params_len:], **kwargs)
-
-        if not isinstance(out, (tuple, list)):
-            raise RuntimeError(
-                "Graph output must be a tuple(). This is so that we can avoid "
-                "pytree processing of the ouputs. Please change the module to "
-                "have tuple outputs or use aot_module instead."
-            )
-        return out
-
-    def aot_function_simplified(
-        fn: Callable,
-        fw_compiler: Callable,
-        bw_compiler: Optional[Callable] = None,
-        partition_fn: Callable = default_partition,
-        decompositions: Optional[Dict] = None,
-        hasher_type: str = "StaticShapeHasher",
-        static_argnums: Optional[Tuple[int]] = None,
-    ) -> Callable:
-        assert static_argnums is None
-        if bw_compiler is None:
-            bw_compiler = fw_compiler
-        aot_config = AOTConfig(
-            fw_compiler=fw_compiler,
-            bw_compiler=bw_compiler,
-            partition_fn=partition_fn,
-            decompositions=decompositions,
-        )
-
-        compiled_fn = None
-
-        @wraps(fn)
-        def new_func(*args):
-            nonlocal compiled_fn
-            if compiled_fn is None:
-                compiled_fn = create_aot_dispatcher_function(
-                    fn,
-                    args,
-                    aot_config,
-                )
-            return compiled_fn(args)
-
-        return new_func
-
-    compiled_f = aot_function_simplified(functional_call, *top_args, **top_kwargs)
-
-    if top_kwargs:
-
-        def forward(*args, **kwargs):
-            return compiled_f(
-                *params_flat,
-                *args,
-                **kwargs,
-            )
-
-    else:
-
-        def forward(*args):
-            return compiled_f(
-                *params_flat,
-                *args,
-            )
-
-    forward.zero_grad = mod.zero_grad
-    return forward
-
-
-compiled_function = aot_function
-compiled_module = aot_module
diff --git a/functorch/functorch/_src/config.py b/functorch/functorch/_src/config.py
deleted file mode 100644
index 583dbcec7455..000000000000
--- a/functorch/functorch/_src/config.py
+++ /dev/null
@@ -1,27 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-
-"""
-Global flags for aot autograd
-"""
-import os
-
-use_functionalize = False
-
-# TODO: flip this to true by default
-# Waiting on
-#   https://github.com/pytorch/pytorch/pull/81617
-#   https://github.com/pytorch/pytorch/pull/81609
-#   https://github.com/pytorch/pytorch/pull/81604
-#   fix for test_aot_autograd_exhaustive_sgn_cpu_float32 _efficientzerotensor
-#   fix for complex numbers
-use_fake_tensor = False
-
-debug_partitioner = os.environ.get('AOT_PARTITIONER_DEBUG', False)
-# Prints out forward + backwards FX graphs
-debug_graphs = os.environ.get('AOT_FX_GRAPHS', False)
-# Prints out joint graph traced, before partitioning
-debug_joint = os.environ.get('AOT_FX_GRAPHS_JOINT', False)
diff --git a/functorch/functorch/_src/custom_function.py b/functorch/functorch/_src/custom_function.py
deleted file mode 100644
index 028a246c62a3..000000000000
--- a/functorch/functorch/_src/custom_function.py
+++ /dev/null
@@ -1,20 +0,0 @@
-import torch
-import functorch._C
-
-m = functorch._C._dispatch_library("FRAGMENT", "aten", "")
-
-
-def custom_vjp(name, filter_fn, fwd_fn, bwd_fn):
-    m.def_(f"{name}(Tensor[] args) -> Tensor[]")
-    m.impl(f"{name}", "CompositeImplicitAutograd", fwd_fn)
-
-    m.def_(f"{name}_vjp(Tensor[] args) -> Tensor[]")
-    m.impl(f"{name}_vjp", "CompositeImplicitAutograd", bwd_fn)
-
-    # TODO: it looks like the autograd alias key doesn't work
-    m.gen_backward_binding(f"{name}", "AutogradCPU")
-    m.gen_backward_binding(f"{name}", "AutogradCUDA")
-
-    def wrapped(*args):
-        return filter_fn(getattr(torch.ops.aten, name)(args))
-    return wrapped
diff --git a/functorch/functorch/_src/fx_minifier.py b/functorch/functorch/_src/fx_minifier.py
deleted file mode 100644
index a0bb79904580..000000000000
--- a/functorch/functorch/_src/fx_minifier.py
+++ /dev/null
@@ -1,269 +0,0 @@
-import subprocess
-import torch.fx as fx
-import copy
-import torch
-import math
-
-
-class ConcreteProp(torch.fx.Interpreter):
-    def run_node(self, n):
-        result = super().run_node(n)
-
-        found_tensor = False
-
-        def extract_tensor_meta(obj):
-            if isinstance(obj, torch.Tensor):
-                nonlocal found_tensor
-                found_tensor = True
-                return obj
-            else:
-                return obj
-
-        from torch.fx.node import map_aggregate
-        concrete_value = map_aggregate(result, extract_tensor_meta)
-        if found_tensor:
-            n.meta['concrete_value'] = concrete_value
-        return result
-
-    def propagate(self, *args):
-        return super().run(*args)
-
-
-def _get_placeholders(graph):
-    return list(filter(lambda x: x.op == 'placeholder', graph.nodes))
-
-# inplace modifies node/inps
-
-
-def _convert_node_to_placeholder(node, inps):
-    if node.op == 'output':
-        return
-    node.op = 'placeholder'
-    node.args = ()
-    node.target = node.name
-    concrete_val = node.meta['concrete_value']
-    if isinstance(concrete_val, torch.Tensor):
-        inps.append(concrete_val)
-    else:
-        inps.append(torch.zeros(()))
-        for tuple_user in list(node.users):
-            _convert_node_to_placeholder(tuple_user, inps)
-
-
-def minifier(fail_f: fx.GraphModule, inps, module_fails):
-    """
-    Minimizes a FX graph with given inputs, such that the resulting FX graph still returns True for module_fails.
-
-    Does 2 main strategies:
-    1. Truncates suffix: Removes some suffix from the graph and sets a new output.
-    2. Delta Debugging: Tries replacing half of the graph with inputs. If fails,
-        tries replacing quarter of the graph, etc.
-
-    >>> failing_function = fx.symbolic_trace(f)
-    >>> minimize(failing_function, [torch.randn(5)], lambda fx_g, inps: fx_g(*inps))
-
-    note: module_fails returns True if it fails.
-    """
-    failing_graph = fail_f.graph
-    cur_size = len(failing_graph.nodes)
-
-    def graph_fails(graph, inps):
-
-        mod = fx.GraphModule(fail_f, graph)
-        mod.graph.lint()
-        return module_fails(mod, inps)
-
-    ConcreteProp(fail_f).propagate(*inps)
-    if not graph_fails(failing_graph, inps):
-        raise RuntimeError("Input graph did not fail the tester")
-    print(f"Started off with {cur_size} nodes")
-
-    def remove_suffix(cur_graph, cur_inps):
-        print("Strategy: Remove suffix")
-        assert graph_fails(cur_graph, cur_inps)
-        gap = 2**math.floor(math.log2(len(cur_graph.nodes)))
-        tested = set()
-        while gap >= 1:
-            new_graph = fx.Graph()
-            env = {}
-            for idx, node in enumerate(cur_graph.nodes):
-                new_node = new_graph.node_copy(node, lambda x: env[x])
-                if node.op not in ['placeholder', 'output']:
-                    if idx % gap == 0 and idx not in tested:
-                        output_node = new_graph.output((new_node,))
-                        if graph_fails(new_graph, cur_inps) and len(new_graph.nodes) < len(cur_graph.nodes):
-                            print()
-                            print(f"SUCCESS: Removed [{idx}:{len(cur_graph.nodes)})")
-                            return (new_graph, cur_inps), True
-                        else:
-                            tested.add(idx)
-                            new_graph.erase_node(output_node)
-                env[node] = new_node
-            gap //= 2
-        print("FAIL: Could not remove suffix")
-        return (cur_graph, cur_inps), False
-
-    def remove_unused_inputs(cur_graph, cur_inps):
-        assert graph_fails(cur_graph, cur_inps)
-        ph_nodes = _get_placeholders(cur_graph)
-        if len(ph_nodes) != len(cur_inps):
-            print(cur_graph)
-            print(len(cur_inps))
-        assert len(ph_nodes) == len(cur_inps)
-
-        new_inps = []
-        for idx in range(len(ph_nodes)):
-            if len(ph_nodes[idx].users) == 0:
-                cur_graph.erase_node(ph_nodes[idx])
-            else:
-                new_inps.append(cur_inps[idx])
-
-        if len(new_inps) < len(cur_inps) and graph_fails(cur_graph, new_inps):
-            print("Strategy: Remove unused inputs")
-            print(f"SUCCESS: Went from {len(cur_inps)} inputs to {len(new_inps)} inputs")
-            return (cur_graph, new_inps), True
-        else:
-            return (cur_graph, new_inps), False
-
-    def eliminate_dead_code(cur_graph, cur_inps):
-        orig_size = len(cur_graph.nodes)
-        if cur_graph.eliminate_dead_code() and graph_fails(cur_graph, cur_inps):
-            print("Strategy: Eliminate dead code")
-            print(f"SUCCESS: Went from {orig_size} nodes to {len(cur_graph.nodes)} nodes")
-            return (cur_graph, cur_inps), True
-        else:
-            return (cur_graph, cur_inps), False
-
-    def consolidate_placeholders(cur_graph):
-        new_graph = fx.Graph()
-        env = {}
-        for node in cur_graph.nodes:
-            if node.op == 'placeholder':
-                new_node = new_graph.node_copy(node, lambda x: env[x])
-                env[node] = new_node
-
-        for node in cur_graph.nodes:
-            if node.op != 'placeholder':
-                new_node = new_graph.node_copy(node, lambda x: env[x])
-                env[node] = new_node
-        return new_graph
-
-    def delta_debugging(cur_graph: fx.Graph, cur_inps):
-        print("Strategy: Delta Debugging")
-        assert graph_fails(cur_graph, cur_inps)
-        starting_placeholders = len(_get_placeholders(cur_graph))
-        num_nodes = len(cur_graph.nodes)
-        gap = int(2**math.floor(math.log2(num_nodes)))
-        while gap >= 1:
-            for start_range in range(0, num_nodes, gap):
-                is_removing = False
-                new_graph = copy.deepcopy(cur_graph)
-                new_inps = cur_inps[:]
-                end_range = min(num_nodes, start_range + gap)
-                for idx in range(start_range, end_range):
-                    new_node = list(new_graph.nodes)[idx]
-                    if new_node.op not in ['placeholder', 'output']:
-                        is_removing = True
-                        _convert_node_to_placeholder(new_node, new_inps)
-                if not is_removing:
-                    continue
-                new_graph = consolidate_placeholders(new_graph)
-                if graph_fails(new_graph, new_inps):
-                    print(
-                        f"SUCCESS: Removed ({start_range}:{end_range}] - Went from {starting_placeholders} "
-                        f"placeholders to {len(_get_placeholders(new_graph))}"
-                    )
-                    return (new_graph, new_inps), True
-            gap //= 2
-
-        print("FAIL: Could not remove prefix")
-        return (cur_graph, inps), False
-
-    print("###################")
-    print(f"Current size: {len(failing_graph.nodes)}")
-    print("###################")
-    while True:
-        any_succeeded = False
-        strategies = [
-            remove_suffix, eliminate_dead_code, remove_unused_inputs,
-            delta_debugging, eliminate_dead_code, remove_unused_inputs
-        ]
-        for strategy in strategies:
-            out = strategy(copy.deepcopy(failing_graph), inps[:])
-            (cur_graph, cur_inps), succeeded = out
-            if succeeded:
-                print()
-                print("###################")
-                print(f"Current size: {len(cur_graph.nodes)}")
-                print("###################")
-                failing_graph = cur_graph
-                inps = cur_inps
-                any_succeeded = True
-
-        if not any_succeeded:
-            break
-    failing_fx = fx.GraphModule(fail_f, failing_graph)
-    print(f"""
-inps = {[(i.shape, i.dtype) for i in inps]}
-inps = [torch.zeros(())] + [torch.ones(shape, dtype=dtype, device='cuda') for (shape, dtype) in inps]
-{failing_fx.code}
-f = torch.jit.script(forward)
-with torch.jit.fuser("fuser2"):
-  for _ in range(5):
-    f(*inps)""")
-    return failing_fx, inps
-
-
-def check_nvfuser_subprocess(f, inps):
-    f.to_folder("temp")
-    with open("_temp.py", 'w') as fil:
-        fil.write(f'''
-import torch
-from temp import FxModule
-f = FxModule().cuda()
-inps = {[(i.shape, i.dtype) for i in inps]}
-inps = [torch.ones(shape, dtype=dtype, device='cuda') for shape, dtype in inps]
-with torch.jit.fuser("fuser2"):
-    nf = torch.jit.script(f)
-    for _ in range(5):
-        nf(*inps)
-    ''')
-    p = subprocess.Popen(["PYTORCH_NVFUSER_DISABLE_FALLBACK=1 python _temp.py"],
-                         stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
-    out, err = p.communicate()
-    if p.returncode != 0:
-        err = err.decode('utf-8')
-        print(err)
-        return True
-    return False
-
-
-def check_nvfuser_correctness_subprocess(f, inps):
-    f.to_folder("temp")
-    with open("_temp.py", 'w') as fil:
-        fil.write(f'''
-import torch
-from temp import FxModule
-f = FxModule().cuda()
-inps = {[(i.shape, i.dtype) for i in inps]}
-inps = [torch.randn(shape, dtype=dtype, device='cuda')
-        if dtype.is_floating_point else torch.ones(shape, dtype=dtype, device='cuda')
-        for shape, dtype in inps]
-
-ref = f(*inps)
-nv_f = torch.jit.script(f)
-with torch.jit.fuser("fuser2"):
-    for _ in range(5):
-        res = nv_f(*inps)
-for a, b in zip(ref, res):
-    if not torch.allclose(a, b, atol=0.1):
-        exit(1)
-''')
-    p = subprocess.Popen(["PYTORCH_NVFUSER_DISABLE_FALLBACK=1 python _temp.py"],
-                         stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
-    out, err = p.communicate()
-    if p.returncode != 0:
-        err = err.decode('utf-8')
-        print(err)
-        return True
-    return False
diff --git a/functorch/functorch/_src/monkey_patching.py b/functorch/functorch/_src/monkey_patching.py
deleted file mode 100644
index 04ba4ab9bb63..000000000000
--- a/functorch/functorch/_src/monkey_patching.py
+++ /dev/null
@@ -1,80 +0,0 @@
-import torch
-import functorch._C as _C
-import functools
-
-# Monkeypatch tensor printing in pytorch
-_old_str = torch._tensor_str._str
-
-
-def prep_value(text, indent=4):
-    first_line_txt = ''
-    lines = text.split('\n')
-    lines[0] = lines[0]
-    lines[0] = ' ' * indent + first_line_txt + lines[0]
-    for i in range(1, len(lines)):
-        lines[i] = ' ' * (indent + len(first_line_txt)) + lines[i]
-    return '\n'.join(lines)
-
-
-@functools.wraps(_old_str)
-def _functorch_str(tensor, *, tensor_contents=None):
-    level = _C.maybe_get_level(tensor)
-    if level == -1:
-        return _old_str(tensor)
-
-    if _C.is_functionaltensor(tensor):
-        # Since we're unwrapping the FunctionalTensorWrapper, we need to make sure
-        # that it's up to date first
-        torch._sync(tensor)
-
-    value = _C.get_unwrapped(tensor)
-    dl_enabled = _C.tls_set_is_included()
-    try:
-        # Disable temporarily kDynamicLayerFrontModeKey/kDynamicLayerBackModeKey as included dispatch keys
-        if (dl_enabled):
-            _C._set_dynamic_layer_keys_included(False)
-        value_repr = repr(value)
-    finally:
-        # Reenable kDynamicLayerFrontModeKey/kDynamicLayerBackModeKey as included dispatch keys
-        if (dl_enabled):
-            _C._set_dynamic_layer_keys_included(True)
-
-    if _C.is_batchedtensor(tensor):
-        bdim = _C.maybe_get_bdim(tensor)
-        assert bdim != -1
-        return (
-            f'BatchedTensor(lvl={level}, bdim={bdim}, value=\n'
-            f'{prep_value(value_repr)}\n'
-            f')'
-        )
-    if _C.is_gradtrackingtensor(tensor):
-        return (
-            f'GradTrackingTensor(lvl={level}, value=\n'
-            f'{prep_value(value_repr)}\n'
-            f')'
-        )
-    if _C.is_functionaltensor(tensor):
-        return f'FunctionalTensor(lvl={level}, value=\\\n{value_repr})'
-
-    raise ValueError("We don't know how to print this, please file us an issue")
-
-
-torch._tensor_str._str = _functorch_str
-
-
-# Monkeypatch .backward() to error out if any transforms are active.
-# TODO: remove the monkeypatching and add an extension point into PyTorch core
-_old_backward = torch.Tensor.backward
-
-
-@functools.wraps(_old_backward)
-def _backward(*args, **kwargs):
-    if _C.are_transforms_active():
-        raise RuntimeError(
-            "backward() called inside a functorch transform. This is not "
-            "supported, please use functorch.grad or functorch.vjp instead "
-            "or call backward() outside of functorch transforms.")
-    return _old_backward(*args, **kwargs)
-
-
-torch.Tensor.backward = _backward
diff --git a/functorch/functorch/csrc/CompileCache.cpp b/functorch/functorch/csrc/CompileCache.cpp
deleted file mode 100644
index 4c87800c8889..000000000000
--- a/functorch/functorch/csrc/CompileCache.cpp
+++ /dev/null
@@ -1,288 +0,0 @@
-// Copyright (c) Facebook, Inc. and its affiliates.
-// All rights reserved.
-//
-// This source code is licensed under the BSD-style license found in the
-// LICENSE file in the root directory of this source tree.
-
-///
-/// This design stemmed of from the PointwiseOperatorCompileCache with the
-/// purpose of making it more generic for AOTAutograd. This is Compile Cache
-/// allowing different types of hashing functions, and is agnostic of the
-/// compiler.
-///
-#include <functorch/csrc/CompileCache.h>
-#include <torch/csrc/autograd/custom_function.h>
-#include <torch/csrc/jit/python/pybind_utils.h>
-#include <torch/csrc/jit/tensorexpr/codegen.h>
-#include <torch/csrc/utils/pybind.h>
-
-using namespace torch::jit::tensorexpr;
-
-namespace {
-/// Record of thread-local state that changes operator behavior.
-struct LocalState {
-  c10::impl::LocalDispatchKeySet dispatchModifier;
-  bool gradModeEnabled;
-
-  at::DispatchKeySet apply(at::DispatchKeySet ks) const {
-    return (ks | dispatchModifier.included_) - dispatchModifier.excluded_;
-  }
-
-  LocalState()
-      : dispatchModifier(c10::impl::tls_local_dispatch_key_set()),
-        gradModeEnabled(at::GradMode::is_enabled()) {}
-};
-
-/// Helper to pack tensor (dtype, requires grad) into an 8-bit key.
-static uint8_t packFlags(const LocalState &state, const at::Tensor &v) {
-  static_assert(static_cast<int>(at::ScalarType::NumOptions) < 128,
-                "overflow possible");
-  at::ScalarType dtype = v.dtype().toScalarType();
-  bool requires_grad = state.gradModeEnabled && v.requires_grad();
-  return static_cast<uint8_t>(requires_grad) |
-         (static_cast<uint8_t>(dtype) << 1);
-}
-
-using hash_key_t = std::vector<int64_t>;
-/// Per-tensor cache specialization key targetting dynamic shapes. Records
-/// dtype, dispatch options, aliasing, and per-dim contiguity/broadcasting
-/// information.
-
-enum DimFlags {
-  /// A leading dimension implicitly added by broadcasting.
-  SIZE_MISSING = 1 << 0,
-
-  /// Size == 1.
-  SIZE_ONE = 1 << 1,
-
-  /// Size > 1.
-  SIZE_OTHER = 1 << 2,
-
-  /// Stride == 0; broadcasting.
-  STRIDE_ZERO = 1 << 3,
-
-  /// Stride == 1; packed contiguously in memory.
-  STRIDE_ONE = 1 << 4,
-
-  /// Stride = Stride[i + 1] * Size[i + 1].
-  /// Used to collapse dimensions.
-  STRIDE_CONTIGUOUS = 1 << 5,
-
-  /// Stride = Stride[i - 1] * Size[i - 1].
-  /// Used to collapse dimensions in the other direction.
-  STRIDE_TRANSPOSED_CONTIGUOUS = 1 << 6, // stride[i-1] * sizes[i-1]
-
-  /// Stride must be provided as an argument.
-  STRIDE_AS_ARG = 1 << 7,
-};
-
-/// Unique hasher id values to uniquely identify the type of hash. NONE_HASH is
-/// used when a tensor is undefined.
-enum HasherFlags {
-  NONE_HASH,
-  STATIC_HASH,
-  DYNAMIC_HASH,
-};
-
-std::vector<int> genDimFlags(c10::IntArrayRef sizes, c10::IntArrayRef strides) {
-  // Pack all the properties for each dimension into a uint8.
-  int nDims = sizes.size();
-  std::vector<int> dimflags(nDims);
-  for (int64_t dim = 0; dim < nDims; ++dim) {
-    uint8_t flag =
-        (sizes[dim] == 0 ? SIZE_MISSING
-                         : (sizes[dim] == 1 ? SIZE_ONE : SIZE_OTHER));
-    if (strides[dim] == 0) {
-      flag |= STRIDE_ZERO;
-    } else if (strides[dim] == 1) {
-      flag |= STRIDE_ONE;
-    } else if (dim + 1 < (int64_t)sizes.size() &&
-               strides[dim] == strides[dim + 1] * sizes[dim + 1]) {
-      flag |= STRIDE_CONTIGUOUS;
-    } else if (dim > 0 && strides[dim] == strides[dim - 1] * sizes[dim - 1] &&
-               (dimflags[dim - 1] & STRIDE_CONTIGUOUS) == 0) {
-      flag |= STRIDE_TRANSPOSED_CONTIGUOUS;
-    } else {
-      flag |= STRIDE_AS_ARG;
-    }
-    dimflags[dim] = flag;
-  }
-  return dimflags;
-}
-
-hash_key_t dynamic_hasher(const LocalState &state, const at::Tensor &v) {
-  hash_key_t hash = {DYNAMIC_HASH, static_cast<int>(packFlags(state, v)),
-                     static_cast<int>(state.apply(v.key_set()).raw_repr()),
-                     static_cast<int>(v.ndimension())};
-  auto dimFlags = genDimFlags(v.sizes(), v.strides());
-  hash.insert(hash.end(), dimFlags.begin(), dimFlags.end());
-  return hash;
-}
-
-/// Per-tensor cache specialization key targetting static shapes. Recordsdtype,
-/// dispatch options, aliasing, and full shapes and strides.
-hash_key_t static_hasher(const LocalState &state, const at::Tensor &v) {
-  hash_key_t hash = {STATIC_HASH, static_cast<int>(packFlags(state, v)),
-                     static_cast<int>(state.apply(v.key_set()).raw_repr()),
-                     static_cast<int>(v.ndimension())};
-  hash.insert(hash.end(), v.sizes().begin(), v.sizes().end());
-  hash.insert(hash.end(), v.strides().begin(), v.strides().end());
-  return hash;
-}
-
-/// ArgCompileCache is a templated class allowing plugging of different types of
-/// Hasher/Specialization Keys.
-struct CompileCache {
-public:
-  CompileCache() = default;
-  ~CompileCache() = default;
-
-  /// Array defining groups of aliased tensors.
-
-  /// Cache type mapping specialization keys to compiled kernels.
-  class vector_hasher {
-  public:
-    std::size_t operator()(hash_key_t const &vec) const {
-      std::size_t seed = vec.size();
-      for (auto &i : vec) {
-        seed ^= i + 0x9e3779b9 + (seed << 6) + (seed >> 2);
-      }
-      return seed;
-    }
-  };
-  using Cache = std::unordered_map<hash_key_t, py::object, vector_hasher>;
-
-  /// Compute the set of specialization keys based on the inputs to
-  /// the kernel.
-  hash_key_t computeCacheKey(PyObject *args,
-                             const std::vector<at::Tensor> &tensorArgs,
-                             int numTensorArgs, const std::string &hasherType,
-                             int64_t id, int64_t fw_compiler_id,
-                             int64_t bw_compiler_id) {
-    LocalState state;
-    hash_key_t cacheKey;
-    for (int i = 0; i < numTensorArgs; ++i) {
-      if (tensorArgs[i].defined()) {
-        // Only hash the tensor when its defined.
-        if (hasherType == "StaticShapeHasher") {
-          auto res = static_hasher(state, tensorArgs[i]);
-          cacheKey.insert(cacheKey.end(), res.begin(), res.end());
-        } else if (hasherType == "DynamicShapeHasher") {
-          auto res = dynamic_hasher(state, tensorArgs[i]);
-          cacheKey.insert(cacheKey.end(), res.begin(), res.end());
-        }
-      } else {
-        // Add a value to the cacheKey to indicate a None tensor.
-        cacheKey.push_back(NONE_HASH);
-      }
-    }
-    cacheKey.push_back(id);
-    cacheKey.push_back(fw_compiler_id);
-    cacheKey.push_back(bw_compiler_id);
-    cacheKey.push_back(numTensorArgs);
-
-    // Cache the non-tensor args. Currently, all the non-tensor args are cached.
-    for (int i = numTensorArgs; i < PyTuple_Size(args); i++) {
-      PyObject *arg = PyTuple_GET_ITEM(args, i);
-      assert(PyLong_Check(arg));
-      cacheKey.push_back(PyLong_AsLong(arg));
-    }
-    return cacheKey;
-  }
-
-  std::vector<at::Tensor> parsePythonArgs(int numTensorArgs, PyObject *args) {
-    // Convert to Tensor Args
-    std::vector<at::Tensor> tensorArgs(numTensorArgs);
-    for (int i = 0; i < numTensorArgs; ++i) {
-      PyObject *arg = PyTuple_GET_ITEM(args, i);
-      if (arg == Py_None) {
-        // If an input tensor is None, add it as an undefined tensor.
-        tensorArgs[i] = at::Tensor();
-      } else if (!THPVariable_Check(arg)) {
-        // Fail if its a non-tensor arg. It should be marked static.
-        std::string dtype = Py_TYPE(arg)->tp_name;
-        std::string index = std::to_string(i);
-        throw std::runtime_error("Found an argument of type " + dtype +
-                                 " at index " + index +
-                                 ". Non-tensor arguments must be marked static."
-                                 " Please set the static_argnums correctly to "
-                                 "mark the argument at index " +
-                                 index + " static.");
-      } else {
-        tensorArgs[i] = THPVariable_Unpack(arg);
-      }
-    }
-    return tensorArgs;
-  }
-
-  /// Check if the function has already been compiled.
-  py::object at(int64_t id, int64_t fw_compiler_id, int64_t bw_compiler_id,
-                int numTensorArgs, const std::string &hasherType,
-                PyObject *args) {
-    std::vector<at::Tensor> tensorArgs = parsePythonArgs(numTensorArgs, args);
-    hash_key_t cacheKey =
-        computeCacheKey(args, tensorArgs, numTensorArgs, hasherType, id,
-                        fw_compiler_id, bw_compiler_id);
-
-    auto item = cache_.find(cacheKey); // protected by GIL
-
-    if (C10_LIKELY(item != cache_.end())) {
-      return item->second;
-    }
-    return py::none();
-  }
-
-  /// Insert a new compiled functions for new tensor properties.
-  void insert(int64_t id, int64_t fw_compiler_id, int64_t bw_compiler_id,
-              int numTensorArgs, const std::string &hasherType,
-              const py::object &compileFn, PyObject *args) {
-    std::vector<at::Tensor> tensorArgs = parsePythonArgs(numTensorArgs, args);
-    LocalState state;
-    hash_key_t cacheKey =
-        computeCacheKey(args, tensorArgs, numTensorArgs, hasherType, id,
-                        fw_compiler_id, bw_compiler_id);
-    cache_.emplace(cacheKey, compileFn);
-  }
-
-  const int64_t size() const { return cache_.size(); }
-
-  /// Clear the cache.
-  void clear() { cache_.clear(); }
-
-private:
-  /// Compilation cache holding key and the compiled function.
-  Cache cache_;
-};
-
-static CompileCache *createCompileCache() { return new CompileCache(); }
-
-} // namespace
-
-namespace at {
-namespace functorch {
-
-void initCompileCacheBindings(PyObject *module) {
-  py::handle te(module);
-  py::class_<CompileCache>(te, "CompileCache")
-      .def(py::init(&createCompileCache))
-      .def("at",
-           [](CompileCache &self, int64_t id, int64_t fw_compiler_id,
-              int64_t bw_compiler_id, int numTensorArgs,
-              const std::string &hasherType, py::args args) {
-             return self.at(id, fw_compiler_id, bw_compiler_id, numTensorArgs,
-                            hasherType, args.ptr());
-           })
-      .def("insert",
-           [](CompileCache &self, int64_t id, int64_t fw_compiler_id,
-              int64_t bw_compiler_id, int numTensorArgs,
-              const std::string &hasherType, const py::object &compileFn,
-              py::args args, py::kwargs kwargs) {
-             self.insert(id, fw_compiler_id, bw_compiler_id, numTensorArgs,
-                         hasherType, compileFn, args.ptr());
-           })
-      .def("clear", [](CompileCache &self) { self.clear(); })
-      .def("size", [](CompileCache &self) { return self.size(); });
-}
-
-} // namespace functorch
-} // namespace at
diff --git a/functorch/functorch/csrc/CompileCache.h b/functorch/functorch/csrc/CompileCache.h
deleted file mode 100644
index e67b1db63eb3..000000000000
--- a/functorch/functorch/csrc/CompileCache.h
+++ /dev/null
@@ -1,17 +0,0 @@
-// Copyright (c) Facebook, Inc. and its affiliates.
-// All rights reserved.
-//
-// This source code is licensed under the BSD-style license found in the
-// LICENSE file in the root directory of this source tree.
-#pragma once
-
-#include <torch/csrc/utils/pybind.h>
-
-namespace at {
-namespace functorch {
-
-/// Initialize python bindings for kernel compilation cache.
-void initCompileCacheBindings(PyObject *module);
-
-} // namespace functorch
-} // namespace at
diff --git a/functorch/functorch/csrc/Constants.h b/functorch/functorch/csrc/Constants.h
deleted file mode 100644
index f6e614e04246..000000000000
--- a/functorch/functorch/csrc/Constants.h
+++ /dev/null
@@ -1,31 +0,0 @@
-// Copyright (c) Facebook, Inc. and its affiliates.
-// All rights reserved.
-//
-// This source code is licensed under the BSD-style license found in the
-// LICENSE file in the root directory of this source tree.
-
-#pragma once
-#include <c10/core/DispatchKey.h>
-
-namespace at {
-namespace functorch {
-
-#define FT_BATCHED_KEY FuncTorchBatched
-#define FT_VMAP_MODE_KEY FuncTorchVmapMode
-#define FT_GRAD_WRAPPER_KEY FuncTorchGradWrapper
-#define FT_DYNAMIC_LAYER_FRONT_MODE_KEY FuncTorchDynamicLayerFrontMode
-#define FT_DYNAMIC_LAYER_BACK_MODE_KEY FuncTorchDynamicLayerBackMode
-#define FT_PYTHON_KEY FuncTorchPython
-
-constexpr auto kBatchedKey = c10::DispatchKey::FT_BATCHED_KEY;
-constexpr auto kVmapModeKey = c10::DispatchKey::FT_VMAP_MODE_KEY;
-constexpr auto kGradWrapperKey = c10::DispatchKey::FT_GRAD_WRAPPER_KEY;
-constexpr auto kDynamicLayerFrontModeKey = c10::DispatchKey::FT_DYNAMIC_LAYER_FRONT_MODE_KEY;
-constexpr auto kDynamicLayerBackModeKey = c10::DispatchKey::FT_DYNAMIC_LAYER_BACK_MODE_KEY;
-//# constexpr auto kPythonKey = c10::DispatchKey::FT_PYTHON_KEY;
-
-// Some helper macros
-#define DECLTYPE_AUTO(...) decltype(__VA_ARGS__), __VA_ARGS__
-#define SINGLE_ARG(...) __VA_ARGS__
-
-}} // namespace at::functorch
diff --git a/functorch/functorch/csrc/CustomFunction.cpp b/functorch/functorch/csrc/CustomFunction.cpp
deleted file mode 100644
index 6b685b7b728d..000000000000
--- a/functorch/functorch/csrc/CustomFunction.cpp
+++ /dev/null
@@ -1,291 +0,0 @@
-#ifndef _WIN32
-#include <functorch/csrc/CustomFunction.h>
-#include <ATen/ATen.h>
-#include <torch/csrc/autograd/function.h>
-#include <torch/csrc/autograd/graph_task.h>
-#include <torch/csrc/autograd/variable.h>
-#include <torch/csrc/autograd/saved_variable.h>
-#include <torch/csrc/autograd/FunctionsManual.h>
-
-#include <torch/csrc/autograd/VariableTypeUtils.h>
-#include <torch/csrc/autograd/generated/VariableType.h>
-#include <torch/csrc/autograd/FunctionsManual.h>
-
-namespace at { namespace functorch {
-
-class PythonKernelHolder : public c10::OperatorKernel {
-  PyObject* func_;
-
-public:
-
-  PythonKernelHolder(py::object func) : func_(func.release().ptr()) {}
-  // This is a generally useful pattern and safer than directly using pybind11's
-  // py::object destructor.  This is because this object may outlive
-  // libtorch_python, so we want to disarm the deallocation if that happens.
-  // PyInterpreter does this correctly, pybind11 does not.
-  ~PythonKernelHolder() override {
-    getPyInterpreter()->decref(func_, /*is_tensor*/false);
-  }
-
-  void operator()(const c10::OperatorHandle& op, c10::DispatchKeySet, torch::jit::Stack* stack) {
-    const auto& schema = op.schema();
-
-    const auto num_arguments = schema.arguments().size();
-    auto arguments = torch::jit::pop(*stack, num_arguments);
-
-    // TODO: Some duplication with torch/csrc/autograd/python_variable.cpp
-
-    py::gil_scoped_acquire g;
-
-    // Pre-scan for arguments that match defaults
-    int64_t default_suffix_len = 0;
-    for (int64_t idx = arguments.size() - 1; idx >= 0; idx--) {
-      const auto& arg = schema.arguments()[idx];
-      if (!arg.default_value().has_value()) {
-        break;
-      }
-      const auto& default_ivalue = *arg.default_value();
-      const auto& ivalue = arguments[idx];
-      if (default_ivalue != ivalue) {
-        break;
-      }
-      default_suffix_len++;
-    }
-
-    auto args = py::reinterpret_steal<py::object>(PyTuple_New(num_arguments - default_suffix_len));
-        // TODO: actually populate kwargs sometimes?  At the moment, every argument
-        // // just gets passed positionally
-    py::dict kwargs;
-
-    for (int64_t idx = 0; idx < (int64_t)arguments.size() - default_suffix_len; idx++) {
-      PyTuple_SET_ITEM(args.ptr(), idx, torch::jit::toPyObject(std::move(arguments[idx])).release().ptr());
-    }
-
-    auto out = py::reinterpret_steal<py::object>(PyObject_Call(func_, args.ptr(), kwargs.ptr()));
-    if (out.ptr() == nullptr) {
-      throw python_error();
-    }
-
-    if (op.schema().returns().size() == 1) {
-      torch::jit::push(stack, torch::jit::toIValue(out.ptr(), op.schema().returns()[0].type()));
-    } else {
-      auto outs = py::cast<py::sequence>(out);
-      for (unsigned idx = 0; idx < outs.size(); idx++) {
-        torch::jit::push(stack, torch::jit::toIValue(outs[idx].ptr(), op.schema().returns()[idx].type()));
-      }
-    }
-  }
-};
-
-torch::Library::Kind parseKind(const std::string& k) {
-  static std::unordered_map<std::string, torch::Library::Kind> kind_map = {
-    {"DEF", torch::Library::DEF},
-    {"IMPL", torch::Library::IMPL},
-    {"FRAGMENT", torch::Library::FRAGMENT},
-  };
-  auto it = kind_map.find(k);
-  TORCH_CHECK(it != kind_map.end(), "could not parse ", k);
-  return it->second;
-}
-c10::AliasAnalysisKind parseAliasAnalysisKind(const std::string& k) {
-  static std::unordered_map<std::string, c10::AliasAnalysisKind> key_map = {
-    {"CONSERVATIVE", c10::AliasAnalysisKind::CONSERVATIVE},
-    {"FROM_SCHEMA", c10::AliasAnalysisKind::FROM_SCHEMA},
-    {"PURE_FUNCTION", c10::AliasAnalysisKind::PURE_FUNCTION},
-    {"", c10::AliasAnalysisKind::FROM_SCHEMA},  // default
-  };
-  auto it = key_map.find(k);
-  TORCH_CHECK(it != key_map.end(), "could not parse ", k);
-  return it->second;
-}
-
-
-template <typename Func>
-inline torch::CppFunction dispatch_str(const char* key, Func&& raw_f) {
-  auto mb_key = std::string(key) == "" ? c10::nullopt : c10::make_optional(c10::parseDispatchKey(key));
-  if (mb_key) {
-    return torch::dispatch(*mb_key, std::forward<Func>(raw_f));
-  } else {
-    torch::CppFunction f(std::forward<Func>(raw_f));
-    return f;
-  }
-}
-
-std::vector<at::Tensor> unpack(at::TensorList tl, const char *name, int pos) {
-  std::vector<at::Tensor> ret(tl.size());
-  for (const auto i : c10::irange(tl.size())) {
-    const auto &t = tl[i];
-    if (!t.defined()) {
-      continue;
-    }
-    ret[i] = static_cast<const torch::autograd::Variable&>(t);
-  }
-  return ret;
-}
-
-std::vector<Tensor> invoke_backward_fn(
-    PyObject* backward_function,
-    TensorList grads,
-    TensorList intermediates) {
-  std::vector<Tensor> result;
-
-  py::gil_scoped_acquire g;
-  auto args = py::reinterpret_steal<py::object>(PyTuple_New(grads.size() + intermediates.size()));
-  py::dict kwargs;
-  for (int64_t idx = 0; idx < (int64_t) grads.size(); idx++) {
-    PyTuple_SET_ITEM(args.ptr(), idx, torch::jit::toPyObject(grads[idx]).release().ptr());
-  }
-  for (int64_t idx = 0; idx < (int64_t) intermediates.size(); idx++) {
-    PyTuple_SET_ITEM(args.ptr(), idx, torch::jit::toPyObject(intermediates[idx + grads.size()]).release().ptr());
-  }
-
-  auto out = py::reinterpret_steal<py::object>(PyObject_Call(backward_function, args.ptr(), kwargs.ptr()));
-  if (out.ptr() == nullptr) {
-    throw python_error();
-  }
-
-  for (unsigned idx = 0; idx < grads.size(); idx++) {
-    auto ivalue = torch::jit::toIValue(PyTuple_GetItem(out.ptr(), idx), TensorType::get());
-    result.emplace_back(ivalue.toTensor());
-  }
-  return result;
-}
-
-// TODO: figure out what this is
-using torch::autograd::variable_list;
-using custom_function_t = std::vector<Tensor> (at::TensorList);
-
-void copy_range(variable_list& out, torch::autograd::IndexRange range, at::ArrayRef<Tensor> t) {
-  AT_ASSERT(range.second <= out.size());
-  std::cout << range.second << ", " << range.first << ", " << t.size() << std::endl;
-  AT_ASSERTM(range.second - range.first == t.size(), "inconsistent range for TensorList output");
-  std::copy(t.begin(), t.end(), out.begin() + range.first);
-}
-
-struct GenericPythonBackward : public torch::autograd::TraceableFunction {
-  using TraceableFunction::TraceableFunction;
-
-  variable_list apply(variable_list&& grads) override;
-  std::string name() const override { return "GenericPythonBackward"; }
-  void release_variables() override {
-    std::lock_guard<std::mutex> lock(mutex_);
-    for (auto& t : saved_tensors_) {
-      t.reset_data();
-    }
-  }
-  std::vector<torch::autograd::SavedVariable> saved_tensors_;
-  int64_t num_inputs_;
-  optional<c10::OperatorHandle> backward_fn_;
-};
-
-variable_list GenericPythonBackward::apply(variable_list&& grads) {
-  std::lock_guard<std::mutex> lock(mutex_);
-
-  torch::autograd::generated::details::IndexRangeGenerator gen;
-  auto tensors_ix = gen.range(saved_tensors_.size());
-  variable_list grad_inputs(num_inputs_);
-
-  std::vector<Tensor> args;
-  for (auto& g : grads) {
-    args.emplace_back(std::move(g));
-  }
-  for (const auto& saved : saved_tensors_) {
-    args.emplace_back(saved.unpack(shared_from_this()));
-  }
-
-  if (task_should_compute_output({ tensors_ix })) {
-    auto handle = backward_fn_->typed<custom_function_t>();
-    auto grad_result = handle.call(args);
-    grad_inputs = grad_result;
-    // copy_range(grad_inputs, tensors_ix, grad_result);
-  }
-  return grad_inputs;
-}
-
-using custom_python_function_t = TensorList (*)(TensorList);
-
-using torch::autograd::compute_requires_grad;
-using torch::autograd::collect_next_edges;
-using torch::autograd::deleteNode;
-using torch::autograd::flatten_tensor_args;
-
-void customFunctionBoxed(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
-  auto tensors = torch::jit::pop(stack).toTensorList().vec();
-  auto tensors_ = unpack(tensors, "tensors", 0);
-  auto _any_requires_grad = compute_requires_grad(tensors);
-  (void)_any_requires_grad;
-
-  std::string schema_name = op.schema().name();
-  std::string vjp_fn_name = schema_name + "_vjp";
-
-  std::shared_ptr<GenericPythonBackward> grad_fn;
-  if (_any_requires_grad) {
-    grad_fn = std::shared_ptr<GenericPythonBackward>(new GenericPythonBackward(), deleteNode);
-    grad_fn->set_next_edges(collect_next_edges(tensors));
-    grad_fn->backward_fn_ = c10::Dispatcher::singleton().findSchemaOrThrow(vjp_fn_name.c_str(), "");
-    grad_fn->num_inputs_ = tensors_.size();
-  }
-
-  auto typed_handle = op.typed<custom_function_t>();
-  std::vector<Tensor> _tmp = ([&]() {
-    at::AutoDispatchBelowADInplaceOrView guard;
-    return typed_handle.call(tensors_);
-  })();
-  auto result = std::move(_tmp);
-  if (grad_fn) {
-    for (auto& tensor : result) {
-      // TODO: is this right?
-      bool is_input = false;
-      for (const auto& input : tensors_) {
-        if (tensor.unsafeGetTensorImpl() == input.unsafeGetTensorImpl()) {
-          is_input = true;
-        }
-      }
-
-      if (!is_input) {
-        set_history(tensor, grad_fn);
-      }
-      grad_fn->saved_tensors_.emplace_back(tensor, !is_input);
-    }
-  }
-  torch::jit::push(stack, result);
-}
-
-void initDispatchBindings(PyObject* module) {
-  auto m = py::handle(module).cast<py::module>();
-
-  py::class_<torch::Library>(m, "_DispatchModule", py::module_local())
-    .def("def_", [](py::object self, const char* schema, const char* alias) {
-      self.cast<torch::Library&>().def(torch::schema(schema, at::functorch::parseAliasAnalysisKind(alias)));
-      return self;
-    }, "", py::arg("schema"), py::arg("alias") = "")
-    .def("impl", [](py::object self, const char* name, const char* dispatch, py::object func) {
-      self.cast<torch::Library&>().impl(
-        name,
-        dispatch_str(dispatch, torch::CppFunction::makeFromBoxedFunctor(std::make_unique<at::functorch::PythonKernelHolder>(std::move(func))))
-      );
-    }, "", py::arg("name"), py::arg("dispatch"), py::arg("func"))
-    .def("gen_backward_binding", [](py::object self, const char* name, const char* dispatch) {
-      self.cast<torch::Library&>().impl(
-        name,
-        dispatch_str(dispatch,
-          torch::CppFunction::makeFromBoxedFunction<&customFunctionBoxed>())
-      );
-    }, "", py::arg("name"), py::arg("dispatch"))
-    .def("fallback_fallthrough", [](py::object self, const char* dispatch) {
-      self.cast<torch::Library&>().fallback(
-        dispatch_str(dispatch, torch::CppFunction::makeFallthrough())
-      );
-      return self;
-    }, "", py::arg("dispatch") = "")
-  ;
-
-  m.def("_dispatch_library", [](const char* kind, std::string name, const char* dispatch) {
-    auto mb_key = std::string(dispatch) == "" ? c10::nullopt : c10::make_optional(c10::parseDispatchKey(dispatch)      );
-    return std::make_unique<torch::Library>(parseKind(kind), std::move(name), mb_key, "/dev/null", 0);
-  });
-}
-
-
-}} // at::functorch
-#endif // #ifndef _WIN32
diff --git a/functorch/functorch/csrc/CustomFunction.h b/functorch/functorch/csrc/CustomFunction.h
deleted file mode 100644
index f9ef44faacb8..000000000000
--- a/functorch/functorch/csrc/CustomFunction.h
+++ /dev/null
@@ -1,14 +0,0 @@
-#pragma once
-
-#ifndef _WIN32
-#include <torch/extension.h>
-#include <torch/library.h>
-#include <torch/csrc/jit/python/pybind_utils.h>
-#include <torch/csrc/autograd/python_variable.h>
-
-namespace at { namespace functorch {
-
-void initDispatchBindings(PyObject* module);
-
-}}
-#endif // #ifndef _WIN32
diff --git a/functorch/functorch/csrc/DynamicLayer.h b/functorch/functorch/csrc/DynamicLayer.h
deleted file mode 100644
index 7d5b5f4a9d82..000000000000
--- a/functorch/functorch/csrc/DynamicLayer.h
+++ /dev/null
@@ -1,93 +0,0 @@
-// Copyright (c) Facebook, Inc. and its affiliates.
-// All rights reserved.
-//
-// This source code is licensed under the BSD-style license found in the
-// LICENSE file in the root directory of this source tree.
-
-#pragma once
-#include <functorch/csrc/Macros.h>
-#include <c10/core/DispatchKey.h>
-#include <ATen/core/function_schema.h>
-#include <c10/util/Optional.h>
-#include <c10/util/variant.h>
-#include <unordered_map>
-#include <mutex>
-#include <c10/core/impl/LocalDispatchKeySet.h>
-#include <functorch/csrc/Interpreter.h>
-#include <functorch/csrc/VmapInterpreter.h>
-#include <functorch/csrc/ADInterpreters.h>
-#include <functorch/csrc/FunctionalizeInterpreter.h>
-
-// Forward declared bc I am lazy
-namespace c10 { struct AutogradMetaInterface; }
-
-namespace at {
-namespace functorch {
-
-// TODO: we can excise DynamicLayer in favor of Interpreter,
-// But I am going to leave it for now as a compatiblity shim to avoid
-// needing to refactor a lot of callsites...
-struct FUNCTORCH_API DynamicLayer {
-  explicit DynamicLayer(
-      TransformType transform_type,
-      int64_t layerId,
-      optional<int64_t> batchSize = nullopt,
-      optional<RandomnessType> randomness = nullopt,
-      optional<bool> prev_grad_mode = nullopt,
-      optional<bool> pre_fwd_grad_mode = nullopt,
-      optional<bool> functionalize_add_back_views = nullopt);
-
-  TransformType key() const;
-  int64_t layerId() const;
-
-  const Interpreter& interpreter() const { return interpreter_; }
-  Interpreter& interpreter() { return interpreter_; }
-
-  // Only valid for vmap
-  int64_t batchSize() const;
-  RandomnessType randomness() const;
-
- private:
-  Interpreter interpreter_;
-};
-
-FUNCTORCH_API int64_t initAndPushDynamicLayer(
-    TransformType transform_type,
-    optional<int64_t> batch_size = nullopt,
-    optional<RandomnessType> randomness = nullopt,
-    optional<bool> prev_grad_mode = nullopt,
-    optional<bool> prev_fwd_grad_mode = nullopt,
-    optional<bool> functionalize_add_back_views = nullopt);
-FUNCTORCH_API DynamicLayer popDynamicLayerAndDeleteMetadata();
-FUNCTORCH_API c10::optional<DynamicLayer> maybeCurrentDynamicLayer();
-FUNCTORCH_API const std::vector<DynamicLayer>& getDynamicLayerStack();
-FUNCTORCH_API void setDynamicLayerStack(const std::vector<DynamicLayer>& stack);
-FUNCTORCH_API void setDynamicLayerFrontBackKeysIncluded(bool included);
-
-// NB: Not lock safe, you should only call this from Python where the GIL will
-// prevent race conditions.
-FUNCTORCH_API bool areTransformsActive();
-
-// NB: not lock safe. TODO: does it need a lock?
-FUNCTORCH_API std::shared_ptr<bool> getLifeHandleForLevel(int64_t level);
-
-// Returns if an operator is in-place. An operator is inplace if:
-// 1. The first argument is a Tensor and it is being written to
-// 2. The first argument is being returned
-// 3. No other arguments are aliased
-// Here is an example of an in-place operator:
-// add_(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> Tensor(a!)
-bool isInplaceOp(const c10::FunctionSchema& schema);
-
-Tensor unwrapIfDead(const Tensor& tensor);
-
-// Pretty printers
-std::ostream& operator<<(std::ostream& os, const DynamicLayer& layer);
-std::ostream& operator<<(std::ostream& os, const std::vector<DynamicLayer>& dynamicLayerStack);
-
-void setInplaceRequiresGradAllowed(bool allowed);
-bool getInplaceRequiresGradAllowed();
-
-
-}
-} // namespace at
diff --git a/functorch/functorch/csrc/Macros.h b/functorch/functorch/csrc/Macros.h
deleted file mode 100644
index 9ca13023fc92..000000000000
--- a/functorch/functorch/csrc/Macros.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#pragma once
-
-// FUNCTORCH_BUILD_MAIN_LIB is set in setup.py.
-// We don't really need to use C10_IMPORT because no C++ project relies on
-// functorch. But leaving it here for future-proofing.
-#ifdef FUNCTORCH_BUILD_MAIN_LIB
-#define FUNCTORCH_API C10_EXPORT
-#else
-#define FUNCTORCH_API C10_IMPORT
-#endif
diff --git a/functorch/functorch/csrc/PlumbingHelper.h b/functorch/functorch/csrc/PlumbingHelper.h
deleted file mode 100644
index 8a8441c3bb29..000000000000
--- a/functorch/functorch/csrc/PlumbingHelper.h
+++ /dev/null
@@ -1,39 +0,0 @@
-// Copyright (c) Facebook, Inc. and its affiliates.
-// All rights reserved.
-//
-// This source code is licensed under the BSD-style license found in the
-// LICENSE file in the root directory of this source tree.
-#pragma once
-#include <ATen/Tensor.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
-#include <functorch/csrc/Constants.h>
-#include <functorch/csrc/DynamicLayer.h>
-
-namespace at { namespace functorch {
-
-Tensor makeBatched(const Tensor& tensor, optional<int64_t> bdim, int64_t level);
-std::tuple<Tensor, optional<int64_t>> unwrapTensorAtLevel(const Tensor& tensor, int64_t level);
-
-std::vector<Tensor> makeBatchedVector(const std::vector<Tensor>& tensors, optional<int64_t> bdim, int64_t level);
-
-// Returns True if ANY tensor in tensors is batched at level
-bool isBatchedAtLevel(TensorList tensors, int64_t level);
-bool isBatchedAtLevel(const c10::List<c10::optional<Tensor>> maybe_tensors, int64_t level);
-bool isBatchedAtLevel(const Tensor& tensor, int64_t level);
-bool isBatchedAtLevel(const c10::optional<Tensor>& maybe_tensor, int64_t level);
-
-// Convenience helper. Returns true if any tensor is batched at level
-bool areAnyBatchedAtLevel(ArrayRef<optional<Tensor>> maybe_tensors, int64_t level);
-
-inline bool ivalueParticipatesInCurrentLevel(const IValue& ivalue) {
-  if (ivalue.isTensor()) {
-    auto maybe_level = maybeCurrentDynamicLayer();
-    TORCH_INTERNAL_ASSERT(maybe_level.has_value());
-    auto current_level = maybe_level->layerId();
-    return isBatchedAtLevel(ivalue.toTensor(), current_level);
-  }
-  // TODO: should really check this
-  return false;
-}
-
-}}
diff --git a/functorch/functorch/csrc/TensorWrapper.h b/functorch/functorch/csrc/TensorWrapper.h
deleted file mode 100644
index 7abfe1782d38..000000000000
--- a/functorch/functorch/csrc/TensorWrapper.h
+++ /dev/null
@@ -1,68 +0,0 @@
-// Copyright (c) Facebook, Inc. and its affiliates.
-// All rights reserved.
-//
-// This source code is licensed under the BSD-style license found in the
-// LICENSE file in the root directory of this source tree.
-
-#pragma once
-
-#include <functorch/csrc/Macros.h>
-#include <ATen/Tensor.h>
-
-namespace at {
-namespace functorch {
-
-struct FUNCTORCH_API TensorWrapper : public c10::TensorImpl {
-  explicit TensorWrapper(
-      c10::DispatchKeySet key_set,
-      Tensor value,
-      int64_t level,
-      std::shared_ptr<bool> is_alive,
-      bool use_value_sizes_strides = true);
-
-  // Override a bunch of methods inherited from TensorImpl to return error messages
-  void set_size(int64_t dim, int64_t new_size) override;
-  void set_stride(int64_t dim, int64_t new_stride) override;
-  void set_storage_offset(int64_t storage_offset) override;
-
-  void refreshMetadata();
-
-  const Tensor& value() const {
-    return value_;
-  }
-  optional<int64_t> level() const {
-    if (is_alive()) {
-      return level_;
-    }
-    return {};
-  }
-  bool is_alive() const;
-
-  // Overrides necessary for autograd
-  c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
-    const c10::VariableVersion& version_counter,
-    bool allow_tensor_metadata_change) const override;
-  c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
-      c10::VariableVersion&& version_counter,
-      bool allow_tensor_metadata_change) const override;
-  void shallow_copy_from(const c10::intrusive_ptr<TensorImpl>& impl) override;
-
- private:
-  const char* tensorimpl_type_name() const override;
-  Tensor value_;
-  int64_t level_;
-
-  // When we exit the level, this wrapper may be marked as "not alive".
-  // Wrappers that are not alive:
-  // 1) May still have autograd metadata on them
-  // 2) Forward dispatches to the underlying value()
-  std::shared_ptr<bool> is_alive_;
-};
-
-FUNCTORCH_API Tensor makeTensorWrapper(const Tensor& tensor, int64_t level);
-FUNCTORCH_API TensorWrapper* maybeGetTensorWrapper(const Tensor& tensor);
-FUNCTORCH_API void dumpTensor(std::ostream & ss, const Tensor& tensor);
-FUNCTORCH_API void dumpTensorCout(const Tensor& tensor);
-
-}
-} // namespace at
diff --git a/functorch/functorch/csrc/init.cpp b/functorch/functorch/csrc/init.cpp
deleted file mode 100644
index 31cdfb83d6c3..000000000000
--- a/functorch/functorch/csrc/init.cpp
+++ /dev/null
@@ -1,419 +0,0 @@
-// Copyright (c) Facebook, Inc. and its affiliates.
-// All rights reserved.
-//
-// This source code is licensed under the BSD-style license found in the
-// LICENSE file in the root directory of this source tree.
-
-#include <torch/extension.h>
-#include <ATen/WrapDimUtils.h>
-#include <ATen/FunctionalTensorWrapper.h>
-
-#include <functorch/csrc/TensorWrapper.h>
-#include <functorch/csrc/DynamicLayer.h>
-#include <functorch/csrc/BatchedTensorImpl.h>
-#include <functorch/csrc/LegacyVmapTransforms.h>
-#include <functorch/csrc/BatchedFallback.h>
-#include <functorch/csrc/BatchRulesHelper.h>
-#include <functorch/csrc/CompileCache.h>
-#include <functorch/csrc/CustomFunction.h>
-#include <c10/core/AutogradState.h>
-#include <functorch/csrc/dim/dim.h>
-
-namespace at {
-namespace functorch {
-
-static bool has_level(const Tensor& self, int64_t level) {
-  const auto* batched = maybeGetBatchedImpl(self);
-  if (!batched) {
-    return false;
-  }
-  return batched->level() >= level;
-}
-
-Tensor _add_batch_dim(const Tensor& self, int64_t batch_dim, int64_t level) {
-  return addBatchDim(self, batch_dim, level);
-}
-
-Tensor _wrap_functional_tensor(const Tensor& self, int64_t level) {
-  auto t = at::functionalization::impl::to_functional_tensor(self);
-  at::functionalization::impl::unsafeGetFunctionalWrapper(t)->set_level(level);
-  return t;
-}
-
-void _assert_wrapped_functional(const Tensor& unwrapped, const Tensor& wrapped) {
-  TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(wrapped));
-  TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(unwrapped));
-  auto wrapped_impl = at::functionalization::impl::unsafeGetFunctionalWrapper(wrapped);
-  auto& wrapped_inner = wrapped_impl->value();
-  TORCH_INTERNAL_ASSERT(unwrapped.unsafeGetTensorImpl() == wrapped_inner.unsafeGetTensorImpl())
-}
-
-void _propagate_functional_input_mutation(const Tensor& unwrapped, const Tensor& wrapped) {
-  TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(wrapped));
-  TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(unwrapped));
-  auto wrapped_impl = at::functionalization::impl::unsafeGetFunctionalWrapper(wrapped);
-  // Ensure that the input is up to date by committing any pending updates to the alias.
-  wrapped_impl->sync_();
-  auto& wrapped_inner = wrapped_impl->value();
-  // It would probably be more reasonable to check that the two tensors are aliased,
-  // but we can't do that unless we give BatchedTensorImpl a notion of storage.
-  if (unwrapped.unsafeGetTensorImpl() == wrapped_inner.unsafeGetTensorImpl()) {
-  } else {
-      TORCH_INTERNAL_ASSERT(unwrapped.nbytes() == wrapped_inner.nbytes());
-      TORCH_INTERNAL_ASSERT(unwrapped.sizes() == wrapped_inner.sizes(),
-          "An inplace-mutation op (like transpose_() was called on an input to the functionalization pass."
-          " Propagating those mutations to the input is currently not supported.");
-      unwrapped.copy_(wrapped_inner);
-  }
-}
-
-
-static std::pair<Tensor,int64_t> remove_existing_batch_dim(
-    const BatchedTensorImpl* batched, int64_t level) {
-
-  TORCH_INTERNAL_ASSERT(batched->level() == level);
-  return std::make_pair(batched->value(), batched->bdim());
-}
-
-// Poor man's version of np.moveaxis. Moves the dimension at `dst` to `src`
-// while preserving the order of other existing dimensions.
-// We should probably add np.moveaxis (it is more general) to PyTorch. (#36048)
-// When we do, replace the following with it.
-static Tensor _movedim(const Tensor& self, int64_t src, int64_t dst) {
-  auto logical_dim = self.dim();
-  src = maybe_wrap_dim(src, logical_dim);
-  dst = maybe_wrap_dim(dst, logical_dim);
-  if (src == dst) {
-    return self;
-  }
-  VmapDimVector permutation;
-  permutation.reserve(logical_dim);
-  for (int64_t dim = 0; dim < logical_dim; dim++) {
-    if (dim == src) {
-      continue;
-    }
-    permutation.push_back(dim);
-  }
-  permutation.insert(permutation.begin() + dst, src);
-  return self.permute(permutation);
-}
-
-// Removes the batch dim with level `level` from `self`. If this causes the
-// last batch dim to be removed from a BatchedTensor, then this returns a
-// regular Tensor.
-//
-// If the `level` of the batch dim to remove does not exist in `self`, then we
-// add the batch dim in. This can happen if `self` didn't interact with a tensor
-// inside the vmap level, for example,
-//     self = torch.randn(3)
-//     y = torch.randn(5)
-//     out = vmap(lambda x: vmap(lambda y: x)(y))(self)
-//     assert out.shape == (3, 5)
-// Inside the inner vmap, `x` is a BatchedTensor with a single batch dimension
-// corresponding to the *outer* vmap level and it doesn't have any dimensions that
-// correspond to the inner vmap level so we need to create one for the user.
-//
-// `out_dim` controls where we should put the batch dimension in the output tensor.
-Tensor _remove_batch_dim(const Tensor& self, int64_t level, int64_t batch_size, int64_t out_dim) {
-  if (!has_level(self, level)) {
-    auto self_sizes = self.sizes();
-    VmapDimVector expanded_sizes(self_sizes.begin(), self_sizes.end());
-    expanded_sizes.insert(expanded_sizes.begin() + out_dim, batch_size);
-    auto result = self.expand(expanded_sizes);
-    return result;
-  }
-
-  // Must be batched if has_level(self, /*any_level*/)
-  const auto* batched = maybeGetBatchedImpl(self);
-  TORCH_INTERNAL_ASSERT(batched != nullptr);
-
-  Tensor self_without_bdim;
-  int64_t newly_exposed_logical_dim;
-  std::tie(self_without_bdim, newly_exposed_logical_dim) = remove_existing_batch_dim(batched, level);
-  auto result = _movedim(self_without_bdim, newly_exposed_logical_dim, out_dim);
-  return result;
-}
-
-Tensor _unwrap_functional_tensor(const Tensor& self, bool add_back_views) {
-  // We only ever call that after popping out of a functionalize() call, in which case the current tensors
-  // should always be wrapped in a FunctionalTensorWrapper.
-  TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(self));
-  auto functional = at::functionalization::impl::unsafeGetFunctionalWrapper(self);
-
-  // when regenerating the (potentially mutated) input tensors, the functionalization pass
-  // regenerates them through a series of view_copy() op calls.
-  // Functorch wants to turn those back into view ops though.
-  // Ensure that the input is up to date by committing any pending updates to the alias.
-  at::functionalization::impl::FunctionalizationReapplyViewsGuard guard(add_back_views);
-  bool any_updates = functional->apply_updates();
-  if (any_updates) {
-    functional->regenerate_from_base();
-  }
-  return functional->value();
-}
-
-Tensor _wrap_for_grad(const Tensor& self, int64_t level) {
-  // NB: different behavior inside??
-  // return self;
-  // TORCH_INTERNAL_ASSERT(!maybeGetTensorWrapper(self));
-  // TORCH_INTERNAL_ASSERT(self.has_storage());
-  return makeTensorWrapper(self, level);
-}
-
-Tensor _unwrap_for_grad(const Tensor& self, int64_t level) {
-  auto* result = maybeGetTensorWrapper(self);
-  if (!result) {
-    return self;
-  }
-  TORCH_INTERNAL_ASSERT(result->level().has_value());
-  if (result->level() == level) {
-    return result->value();
-  }
-  return self;
-}
-
-int64_t dlevel(const Tensor& tensor) {
-  auto* wrapped = maybeGetTensorWrapper(tensor);
-  if (!wrapped) {
-    return 0;
-  }
-  if (!wrapped->is_alive()) {
-    return -1;
-  }
-  return wrapped->level().value();
-}
-
-bool dump_tensor(const Tensor& self) {
-  dumpTensorCout(self);
-  return true;
-}
-
-RandomnessType get_randomness_enum(const std::string& randomness) {
-    if (randomness == "error") {
-        return RandomnessType::Error;
-    } else if (randomness == "same") {
-        return RandomnessType::Same;
-    } else if (randomness == "different") {
-        return RandomnessType::Different;
-    } else {
-        TORCH_CHECK(false, "randomness argument must be error, same, or different.");
-    }
-}
-
-void set_fwd_grad_enabled(bool enabled) {
-  AutogradState::get_tls_state().set_fw_grad_mode(enabled);
-}
-
-bool get_fwd_grad_enabled() {
-  return AutogradState::get_tls_state().get_fw_grad_mode();
-}
-
-int64_t _grad_increment_nesting() {
-  // See NOTE [grad and vjp interaction with no_grad]
-  bool prev_grad_mode = c10::GradMode::is_enabled();
-  return initAndPushDynamicLayer(TransformType::Grad, nullopt, nullopt, prev_grad_mode);
-}
-
-int64_t _grad_decrement_nesting() {
-  auto layer = popDynamicLayerAndDeleteMetadata();
-  TORCH_INTERNAL_ASSERT(layer.key() == TransformType::Grad);
-  return layer.layerId();
-}
-
-int64_t _jvp_increment_nesting() {
-  // See NOTE [grad and vjp interaction with no_grad]
-  bool prev_fwd_grad_mode = get_fwd_grad_enabled();
-  return initAndPushDynamicLayer(TransformType::Jvp, nullopt, nullopt, nullopt, prev_fwd_grad_mode);
-}
-
-int64_t _jvp_decrement_nesting() {
-  auto layer = popDynamicLayerAndDeleteMetadata();
-  TORCH_INTERNAL_ASSERT(layer.key() == TransformType::Jvp);
-  return layer.layerId();
-}
-
-int64_t _vmap_increment_nesting(int64_t batch_size, const std::string& randomness) {
-  return initAndPushDynamicLayer(TransformType::Vmap, batch_size, get_randomness_enum(randomness));
-}
-
-int64_t _vmap_decrement_nesting() {
-  auto layer = popDynamicLayerAndDeleteMetadata();
-  TORCH_INTERNAL_ASSERT(layer.key() == TransformType::Vmap);
-  return layer.layerId();
-}
-
-int64_t _func_increment_nesting(bool reapply_views) {
-  return initAndPushDynamicLayer(TransformType::Functionalize, c10::nullopt, c10::nullopt, c10::nullopt, c10::nullopt, /*functionalize_add_back_views=*/reapply_views);
-}
-
-int64_t _func_decrement_nesting() {
-  auto layer = popDynamicLayerAndDeleteMetadata();
-  TORCH_INTERNAL_ASSERT(layer.key() == TransformType::Functionalize);
-  return layer.layerId();
-}
-
-static bool is_batchedtensor(const Tensor& tensor) {
-  auto* batched = maybeGetBatchedImpl(tensor);
-  return batched != nullptr;
-}
-
-static bool is_gradtrackingtensor(const Tensor& tensor) {
-  auto* wrapped = maybeGetTensorWrapper(tensor);
-  return wrapped != nullptr;
-}
-
-static bool is_functionaltensor(const Tensor& tensor) {
-  return tensor.unsafeGetTensorImpl()->key_set().has(c10::DispatchKey::Functionalize);
-}
-
-static Tensor get_unwrapped(const Tensor& tensor) {
-  auto* batched = maybeGetBatchedImpl(tensor);
-  if (batched) {
-    return batched->value();
-  }
-  auto* wrapped = maybeGetTensorWrapper(tensor);
-  if (wrapped) {
-    return wrapped->value();
-  }
-  auto* functional = dynamic_cast<FunctionalTensorWrapper*>(tensor.unsafeGetTensorImpl());
-  if (functional) {
-    return functional->value();
-  }
-  TORCH_CHECK(false, "No wrappers present!");
-}
-
-static int64_t maybe_get_level(const Tensor& tensor) {
-  auto* batched = maybeGetBatchedImpl(tensor);
-  if (batched) {
-    return batched->level();
-  }
-  auto* wrapped = maybeGetTensorWrapper(tensor);
-  if (wrapped) {
-    if (wrapped->level()) {
-      return *wrapped->level();
-    }
-    // TODO: this is a weird special case...
-    return -2;
-  }
-  auto* functional = dynamic_cast<FunctionalTensorWrapper*>(tensor.unsafeGetTensorImpl());
-  if (functional) {
-      return functional->level();
-  }
-  return -1;
-}
-
-static int64_t maybe_get_bdim(const Tensor& tensor) {
-  auto* batched = maybeGetBatchedImpl(tensor);
-  if (batched) {
-    return batched->bdim();
-  }
-  return -1;
-}
-
-static int64_t currentLevel() {
-  auto maybe_layer = maybeCurrentDynamicLayer();
-  TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
-  int64_t current_level = maybe_layer->layerId();
-  return current_level;
-}
-
-static std::tuple<Tensor, int64_t> unwrapTensorAtCurrentLevel(const Tensor& tensor) {
-  auto maybe_layer = maybeCurrentDynamicLayer();
-  TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
-  int64_t current_level = maybe_layer->layerId();
-  auto result = unwrapTensorAtLevel(tensor, current_level);
-  auto value = std::get<0>(result);
-  auto bdim = std::get<1>(result);
-  value = moveBatchDimToFront(value, bdim);
-  return std::make_tuple(value, bdim.has_value() ? 0 : -1);
-}
-
-static void tls_set_vmap_excluded(bool excluded) {
-  c10::impl::tls_set_dispatch_key_excluded(kBatchedKey, excluded);
-}
-
-static bool tls_set_is_included() {
-  return c10::impl::tls_is_dispatch_key_included(kDynamicLayerFrontModeKey);
-}
-
-static void _set_dynamic_layer_keys_included(bool value) {
-  return setDynamicLayerFrontBackKeysIncluded(value);
-}
-
-static void dump_dls() {
-  std::cout << getDynamicLayerStack() << std::endl;
-}
-
-static void dump_local_tls() {
-  auto tls = c10::impl::tls_local_dispatch_key_set();
-  std::cout << "[Local Include] " << tls.included_ << std::endl;
-  std::cout << "[Local Exclude] " << tls.excluded_ << std::endl;
-}
-
-} // namespace functorch
-}
-
-
-namespace at { namespace functorch {
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("_add_batch_dim", &at::functorch::_add_batch_dim, "add batch dim");
-  m.def("_remove_batch_dim", &at::functorch::_remove_batch_dim, "remove batch dim");
-  m.def("_wrap_functional_tensor", &at::functorch::_wrap_functional_tensor, "add functional tensor");
-  m.def("_assert_wrapped_functional", &at::functorch::_assert_wrapped_functional, "assert wrapped functional");
-  m.def("_propagate_functional_input_mutation", &at::functorch::_propagate_functional_input_mutation, "propagate functional input mutations");
-  m.def("_unwrap_functional_tensor", &at::functorch::_unwrap_functional_tensor, "remove functional tensor");
-  m.def("_vmap_increment_nesting", &at::functorch::_vmap_increment_nesting, "remove batch dim");
-  m.def("_vmap_decrement_nesting", &at::functorch::_vmap_decrement_nesting, "remove batch dim");
-  m.def("_func_increment_nesting", &at::functorch::_func_increment_nesting, "functionalization start");
-  m.def("_func_decrement_nesting", &at::functorch::_func_decrement_nesting, "functionalization end");
-  m.def("_grad_increment_nesting", &at::functorch::_grad_increment_nesting, "remove batch dim");
-  m.def("_grad_decrement_nesting", &at::functorch::_grad_decrement_nesting, "remove batch dim");
-  m.def("_jvp_increment_nesting", &at::functorch::_jvp_increment_nesting);
-  m.def("_jvp_decrement_nesting", &at::functorch::_jvp_decrement_nesting);
-  m.def("_wrap_for_grad", &at::functorch::_wrap_for_grad, "wrap as gradtrackingtensor");
-  m.def("_unwrap_for_grad", &at::functorch::_unwrap_for_grad, "unwrap from gradtrackingtensor");
-  m.def("_set_vmap_fallback_warning_enabled", &at::functorch::setVmapFallbackWarningEnabled, "Set vmap fallback warnings");
-  m.def("_set_vmap_fallback_enabled", &at::functorch::setVmapFallbackEnabled);
-  m.def("_is_vmap_fallback_enabled", &at::functorch::isVmapFallbackEnabled);
-  m.def("set_inplace_requires_grad_allowed", &at::functorch::setInplaceRequiresGradAllowed);
-  m.def("get_inplace_requires_grad_allowed", &at::functorch::getInplaceRequiresGradAllowed);
-  m.def("dlevel", &at::functorch::dlevel, "dlevel");
-  m.def("dump_tensor", &at::functorch::dump_tensor, "dump_tensor");
-  m.def("reshape_dim_into", &at::functorch::reshape_dim_into);
-  m.def("reshape_dim_outof", &at::functorch::reshape_dim_outof);
-  m.def("are_transforms_active", &at::functorch::areTransformsActive);
-  // various debugging things. Maybe we should offer these as first-class APIs
-  // on Tensors?
-  m.def("is_batchedtensor", &at::functorch::is_batchedtensor);
-  m.def("is_gradtrackingtensor", &at::functorch::is_gradtrackingtensor);
-  m.def("is_functionaltensor", &at::functorch::is_functionaltensor);
-  m.def("get_unwrapped", &at::functorch::get_unwrapped);
-  m.def("maybe_get_level", &at::functorch::maybe_get_level);
-  m.def("maybe_get_bdim", &at::functorch::maybe_get_bdim);
-  m.def("current_level", &at::functorch::currentLevel);
-  m.def("unwrap_batchedtensor", &at::functorch::unwrapTensorAtCurrentLevel);
-  m.def("tls_set_vmap_excluded", &at::functorch::tls_set_vmap_excluded);
-  m.def("tls_set_is_included", &at::functorch::tls_set_is_included);
-  m.def("_set_dynamic_layer_keys_included", &at::functorch::_set_dynamic_layer_keys_included);
-  m.def("dump_dls", &at::functorch::dump_dls);
-  m.def("dump_local_tls", &at::functorch::dump_local_tls);
-  m.def("set_fwd_grad_enabled", &at::functorch::set_fwd_grad_enabled);
-  m.def("get_fwd_grad_enabled", &at::functorch::get_fwd_grad_enabled);
-
-  at::functorch::initCompileCacheBindings(m.ptr());
-
-  // initialize first-class dims and install it as a submodule on _C
-  auto dim = Dim_init();
-  if (!dim) {
-    throw py::error_already_set();
-  }
-  py::setattr(m, "dim", py::reinterpret_steal<py::object>(dim));
-
-  // Windows doesn't like this
-#ifndef _WIN32
-  initDispatchBindings(m.ptr());
-#endif
-}
-
-}}
diff --git a/functorch/notebooks/aot_autograd_optimizations.ipynb b/functorch/notebooks/aot_autograd_optimizations.ipynb
index 78204eea700e..5f896f8223fd 100644
--- a/functorch/notebooks/aot_autograd_optimizations.ipynb
+++ b/functorch/notebooks/aot_autograd_optimizations.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# AOT Autograd - How to use and optimize?\n",
     "\n",
-    "<a href=\"https://colab.research.google.com/github/pytorch/functorch/blob/main/notebooks/colab/aot_autograd_optimizations.ipynb\">\n",
+    "<a href=\"https://colab.research.google.com/github/pytorch/pytorch/blob/master/functorch/notebooks/aot_autograd_optimizations.ipynb\">\n",
     "  <img style=\"width: auto\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
     "</a>\n",
     "\n",
@@ -126,11 +126,11 @@
     "aot_print_fn = aot_function(fn, fw_compiler=compiler_fn, bw_compiler=compiler_fn)\n",
     "\n",
     "# Run the aot_print_fn once to trigger the compilation and print the graphs\n",
-    "res = aot_print_fn(a, b, c, d)\n",
-    "assert torch.allclose(ref, res)\n",
-    "\n",
-    "from functorch.compile import clear_compile_cache\n",
-    "clear_compile_cache()"
+    "cloned_inputs = [x.clone().detach().requires_grad_(True) for x in (a, b, c, d)]\n",
+    "cloned_a, cloned_b, cloned_c, cloned_d = cloned_inputs\n",
+    "res = aot_print_fn(cloned_a, cloned_b, cloned_c, cloned_d)\n",
+    "res.sum().backward()\n",
+    "assert torch.allclose(ref, res)"
    ]
   },
   {
@@ -303,10 +303,13 @@
    "source": [
     "from functorch.compile import min_cut_rematerialization_partition\n",
     "\n",
+    "# Zero out the gradients so we can do a comparison later\n",
+    "a.grad, b.grad, c.grad, d.grad = (None,) * 4\n",
+    "\n",
     "# Lets set up the partitioner. Also set the fwd and bwd compilers to the printer function that we used earlier.\n",
     "# This will show us how the recomputation has modified the graph.\n",
     "aot_fn = aot_function(fn, fw_compiler=compiler_fn, bw_compiler=compiler_fn, partition_fn=min_cut_rematerialization_partition)\n",
-    "res = aot_fn(a, b, c, d)"
+    "res = aot_fn(a, b, c, d).sum().backward()"
    ]
   },
   {
@@ -389,11 +392,8 @@
   }
  ],
  "metadata": {
-  "interpreter": {
-   "hash": "19eab2eb1bf96423965affa906f5d33b4f667cc21cd0152dc4f24eb30ccbeee2"
-  },
   "kernelspec": {
-   "display_name": "Python 3.8.12 ('base')",
+   "display_name": "Python 3.9.5 ('base')",
    "language": "python",
    "name": "python3"
   },
@@ -407,9 +407,14 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.13"
+   "version": "3.9.5"
   },
-  "orig_nbformat": 4
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "73b6e0ee7c860e06bb349c72324473b318d6cb6c97bcad772bce0703fb8d0dfb"
+   }
+  }
  },
  "nbformat": 4,
  "nbformat_minor": 2
diff --git a/functorch/notebooks/colab/ensembling_colab.ipynb b/functorch/notebooks/colab/ensembling_colab.ipynb
deleted file mode 100644
index ae8b7a67a6d8..000000000000
--- a/functorch/notebooks/colab/ensembling_colab.ipynb
+++ /dev/null
@@ -1,598 +0,0 @@
-{
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "name": "ensembling_colab.ipynb",
-      "provenance": [],
-      "collapsed_sections": [
-        "0I5Mm2q2f5aw"
-      ],
-      "machine_shape": "hm",
-      "toc_visible": true
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    },
-    "accelerator": "GPU"
-  },
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "source": [
-        "### Welcome to the functorch tutorial on ensembling models, in colab."
-      ],
-      "metadata": {
-        "id": "W6b4RUiYnhSt"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Configuring your colab to run functorch \n"
-      ],
-      "metadata": {
-        "id": "0I5Mm2q2f5aw"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "**Getting setup** - running functorch currently requires Pytorch Nightly.  \n",
-        "Thus we'll go through a pytorch nightly install and build functorch. \n",
-        "\n",
-        "After that and a restart, you'll be ready to run the tutorial here on colab."
-      ],
-      "metadata": {
-        "id": "jnHxd2KFgPJg"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's setup a restart function:"
-      ],
-      "metadata": {
-        "id": "PvwZSOklhpB2"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def colab_restart():\n",
-        "  print(\"--> Restarting colab instance\") \n",
-        "  get_ipython().kernel.do_shutdown(True)"
-      ],
-      "metadata": {
-        "id": "MklsA-KRhZKC"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Next, let's confirm that we have a gpu.  \n",
-        "(If not, select Runtime -> Change Runtime type above,\n",
-        " and select GPU under Hardward Accelerator )"
-      ],
-      "metadata": {
-        "id": "Njk9qPgTiiGS"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!nvcc --version"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "HxidO4dpiPGi",
-        "outputId": "5b76c0f4-e83b-4626-c9c4-7165324528ee"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "nvcc: NVIDIA (R) Cuda compiler driver\n",
-            "Copyright (c) 2005-2020 NVIDIA Corporation\n",
-            "Built on Mon_Oct_12_20:09:46_PDT_2020\n",
-            "Cuda compilation tools, release 11.1, V11.1.105\n",
-            "Build cuda_11.1.TC455_06.29190527_0\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's remove the default PyTorch install:"
-      ],
-      "metadata": {
-        "id": "HanoUO62jtKx"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!pip uninstall -y torch"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "NIoTNykP9xI5",
-        "outputId": "b79462f1-50f2-42f5-e079-9148d4b238d9"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Found existing installation: torch 1.11.0+cu111\n",
-            "Uninstalling torch-1.11.0+cu111:\n",
-            "  Successfully uninstalled torch-1.11.0+cu111\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "And install the relevant nightly version.  (this defaults to 10.2 Cuda which works on most colabs). "
-      ],
-      "metadata": {
-        "id": "n-DFUwBVkHaX"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "cuda_version = \"cu102\" # optionally - cu113 (for 11.3) is an option as well if you have 11.3 listed above in the nvcc output. "
-      ],
-      "metadata": {
-        "id": "BH5ffJBkkRR8"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!pip install torch -f https://download.pytorch.org/whl/{cuda_version}/torch_stable.html --upgrade"
-      ],
-      "metadata": {
-        "id": "Bi2oymijkav5"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Next we'll install functorch:"
-      ],
-      "metadata": {
-        "id": "s3rrVgGkmNpi"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "UtBgzUPDfIQg"
-      },
-      "outputs": [],
-      "source": [
-        "!pip install functorch"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Finally - restart colab and after that - just skip directly down to the '-- Tutorial Start --' section to get underway."
-      ],
-      "metadata": {
-        "id": "T8dhR1XEmcJ6"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "colab_restart() "
-      ],
-      "metadata": {
-        "id": "xo2UY9b8ma8t",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "d2ca8716-b99a-4335-c60a-b9ad28e8d8c7"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "--> Restarting colab instance\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## -- Tutorial Start -- \n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "nj6_fW76wM0d"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "# Confirm we are ready to start.  \n",
-        "# If this errs, please make sure you have completed the 'configuring your colab' steps above first and then return here.\n",
-        "\n",
-        "import functorch    "
-      ],
-      "metadata": {
-        "id": "SvUfIxRyeAaL"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "# Model Ensembling\n",
-        "\n",
-        "This example illustrates how to vectorize model ensembling, using vmap.\n",
-        "\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "nLdOLDH6m9oy"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "**What is model ensembling?**\n",
-        "\n",
-        "Model ensembling combines the predictions from multiple models together. Traditionally this is done by running each model on some inputs separately and then combining the predictions. However, if you’re running models with the same architecture, then it may be possible to combine them together using vmap. vmap is a function transform that maps functions across dimensions of the input tensors. One of its use cases is eliminating for-loops and speeding them up through vectorization.\n",
-        "\n",
-        "Let’s demonstrate how to do this using an ensemble of simple CNNs.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "CJJBTOl-tawq"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "import torch\n",
-        "import torch.nn as nn\n",
-        "import torch.nn.functional as F\n",
-        "from functools import partial\n",
-        "torch.manual_seed(0);"
-      ],
-      "metadata": {
-        "id": "Gb-yt4VKUUuc"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "# Here's a simple MLP\n",
-        "class SimpleMLP(nn.Module):\n",
-        "    def __init__(self):\n",
-        "        super(SimpleMLP, self).__init__()\n",
-        "        self.fc1 = nn.Linear(784, 128)\n",
-        "        self.fc2 = nn.Linear(128, 128)\n",
-        "        self.fc3 = nn.Linear(128, 10)\n",
-        "\n",
-        "    def forward(self, x):\n",
-        "        x = x.flatten(1)\n",
-        "        x = self.fc1(x)\n",
-        "        x = F.relu(x)\n",
-        "        x = self.fc2(x)\n",
-        "        x = F.relu(x)\n",
-        "        x = self.fc3(x)\n",
-        "        return x\n"
-      ],
-      "metadata": {
-        "id": "tf-HKHjUUbyY"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let’s generate a batch of dummy data and pretend that we’re working with an MNIST dataset. Thus, the dummy images are 28 by 28, and we have a minibatch of size 64. Furthermore, lets say we want to combine the predictions from 10 different models. \n"
-      ],
-      "metadata": {
-        "id": "VEDPe-EoU5Fa"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "device = 'cuda'\n",
-        "num_models = 10\n",
-        "\n",
-        "data = torch.randn(100, 64, 1, 28, 28, device=device)\n",
-        "targets = torch.randint(10, (6400,), device=device)\n",
-        "\n",
-        "models = [SimpleMLP().to(device) for _ in range(num_models)]"
-      ],
-      "metadata": {
-        "id": "WB2Qe3AHUvPN"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "We have a couple of options for generating predictions. Maybe we want to give each model a different randomized minibatch of data. Alternatively, maybe we want to run the same minibatch of data through each model (e.g. if we were testing the effect of different model initializations).\n",
-        "\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "GOGJ-OUxVcT5"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Option 1: different minibatch for each model"
-      ],
-      "metadata": {
-        "id": "CwJBb09MxCN3"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "minibatches = data[:num_models]\n",
-        "predictions_diff_minibatch_loop = [model(minibatch) for model, minibatch in zip(models, minibatches)]"
-      ],
-      "metadata": {
-        "id": "WYjMx8QTUvRu"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Option 2: Same minibatch"
-      ],
-      "metadata": {
-        "id": "HNw4_IVzU5Pz"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "minibatch = data[0]\n",
-        "predictions2 = [model(minibatch) for model in models]"
-      ],
-      "metadata": {
-        "id": "vUsb3VfexJrY"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Using vmap to vectorize the ensemble\n",
-        "\n",
-        "\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "aNkX6lFIxzcm"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let’s use vmap to speed up the for-loop. We must first prepare the models for use with vmap.\n",
-        "\n",
-        "First, let’s combine the states of the model together by stacking each parameter. For example, `model[i].fc1.weight` has shape `[784, 128]`; we are going to stack the .fc1.weight of each of the 10 models to produce a big weight of shape `[10, 784, 128]`.\n",
-        "\n",
-        "functorch offers the 'combine_state_for_ensemble' convenience function to do that. It returns a stateless version of the model (fmodel) and stacked parameters and buffers.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "-sFMojhryviM"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from functorch import combine_state_for_ensemble\n",
-        "\n",
-        "fmodel, params, buffers = combine_state_for_ensemble(models)\n",
-        "[p.requires_grad_() for p in params];\n"
-      ],
-      "metadata": {
-        "id": "C3a9_clvyPho"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Option 1: get predictions using a different minibatch for each model. \n",
-        "\n",
-        "By default, vmap maps a function across the first dimension of all inputs to the passed-in function. After using the combine_state_for_ensemble, each of the params and buffers have an additional dimension of size 'num_models' at the front, and minibatches has a dimension of size 'num_models'.\n",
-        "\n",
-        "\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "mFJDWMM9yaYZ"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "print([p.size(0) for p in params]) # show the leading 'num_models' dimension\n",
-        "\n",
-        "assert minibatches.shape == (num_models, 64, 1, 28, 28) # verify minibatch has leading dimension of size 'num_models'"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "ezuFQx1G1zLG",
-        "outputId": "a8c7626e-5191-4ebe-9cba-55dd1af56e40"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "[10, 10, 10, 10, 10, 10]\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from functorch import vmap\n",
-        "\n",
-        "predictions1_vmap = vmap(fmodel)(params, buffers, minibatches)\n",
-        "\n",
-        "# verify the vmap predictions match the \n",
-        "assert torch.allclose(predictions1_vmap, torch.stack(predictions_diff_minibatch_loop), atol=1e-3, rtol=1e-5)"
-      ],
-      "metadata": {
-        "id": "VroLnfD82DDf"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Option 2: get predictions using the same minibatch of data.\n",
-        "\n",
-        "vmap has an in_dims arg that specifies which dimensions to map over. By using `None`, we tell vmap we want the same minibatch to apply for all of the 10 models.\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "tlkmyQyfY6XU"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "predictions2_vmap = vmap(fmodel, in_dims=(0, 0, None))(params, buffers, minibatch)\n",
-        "\n",
-        "assert torch.allclose(predictions2_vmap, torch.stack(predictions2), atol=1e-3, rtol=1e-5)"
-      ],
-      "metadata": {
-        "id": "WiSMupvCyecd"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "A quick note: there are limitations around what types of functions can be transformed by vmap. The best functions to transform are ones that are pure functions: a function where the outputs are only determined by the inputs that have no side effects (e.g. mutation). vmap is unable to handle mutation of arbitrary Python data structures, but it is able to handle many in-place PyTorch operations."
-      ],
-      "metadata": {
-        "id": "KrXQsUCIGLWm"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Performance\n",
-        "\n",
-        "Curious about performance numbers? Here's how the numbers look on Google Colab."
-      ],
-      "metadata": {
-        "id": "MCjBhMrVF5hH"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from torch.utils.benchmark import Timer\n",
-        "without_vmap = Timer(\n",
-        "    stmt=\"[model(minibatch) for model, minibatch in zip(models, minibatches)]\",\n",
-        "    globals=globals())\n",
-        "with_vmap = Timer(\n",
-        "    stmt=\"vmap(fmodel)(params, buffers, minibatches)\",\n",
-        "    globals=globals())\n",
-        "print(f'Predictions without vmap {without_vmap.timeit(100)}')\n",
-        "print(f'Predictions with vmap {with_vmap.timeit(100)}')"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "gJPrGdS0GBjz",
-        "outputId": "460d9808-2c70-4936-8c03-6a008bc289d5"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Predictions without vmap <torch.utils.benchmark.utils.common.Measurement object at 0x7f85c3749650>\n",
-            "[model(minibatch) for model, minibatch in zip(models, minibatches)]\n",
-            "  3.20 ms\n",
-            "  1 measurement, 100 runs , 1 thread\n",
-            "Predictions with vmap <torch.utils.benchmark.utils.common.Measurement object at 0x7f85c298cd90>\n",
-            "vmap(fmodel)(params, buffers, minibatches)\n",
-            "  879.02 us\n",
-            "  1 measurement, 100 runs , 1 thread\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "There's a large speedup using vmap! \n",
-        "\n",
-        "In general, vectorization with vmap should be faster than running a function in a for-loop and competitive with manual batching. There are some exceptions though, like if we haven’t implemented the vmap rule for a particular operation or if the underlying kernels weren’t optimized for older hardware (GPUs). If you see any of these cases, please let us know by opening an issue at our [GitHub](https://github.com/pytorch/functorch)!\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "UI74G9JarQU8"
-      }
-    }
-  ]
-}
\ No newline at end of file
diff --git a/functorch/notebooks/colab/jacobians_hessians_colab.ipynb b/functorch/notebooks/colab/jacobians_hessians_colab.ipynb
deleted file mode 100644
index b2c2e39b0cfd..000000000000
--- a/functorch/notebooks/colab/jacobians_hessians_colab.ipynb
+++ /dev/null
@@ -1,1120 +0,0 @@
-{
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "name": "jacobians_hessians_colab.ipynb",
-      "provenance": [],
-      "collapsed_sections": [
-        "0I5Mm2q2f5aw"
-      ],
-      "machine_shape": "hm",
-      "toc_visible": true
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    },
-    "accelerator": "GPU"
-  },
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "source": [
-        "### Welcome to the functorch tutorial on Jacobians, Hessians and more - on colab! "
-      ],
-      "metadata": {
-        "id": "W6b4RUiYnhSt"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Configuring your colab to run functorch \n"
-      ],
-      "metadata": {
-        "id": "0I5Mm2q2f5aw"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "**Getting setup** - running functorch currently requires at least PyTorch 1.11.  \n",
-        "Thus we'll go through a pytorch 1.11 install and build functorch. \n",
-        "\n",
-        "After that and a restart, you'll be ready to run the tutorial here on colab."
-      ],
-      "metadata": {
-        "id": "jnHxd2KFgPJg"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's setup a restart function:"
-      ],
-      "metadata": {
-        "id": "PvwZSOklhpB2"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def colab_restart():\n",
-        "  print(\"--> Restarting colab instance\") \n",
-        "  get_ipython().kernel.do_shutdown(True)"
-      ],
-      "metadata": {
-        "id": "MklsA-KRhZKC"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Next, let's confirm that we have a gpu.  \n",
-        "(If not, select Runtime -> Change Runtime type above,\n",
-        " and select GPU under Hardward Accelerator )"
-      ],
-      "metadata": {
-        "id": "Njk9qPgTiiGS"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!nvcc --version"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "HxidO4dpiPGi",
-        "outputId": "d6d31c17-02cf-427b-cae8-6994c57c2320"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "nvcc: NVIDIA (R) Cuda compiler driver\n",
-            "Copyright (c) 2005-2020 NVIDIA Corporation\n",
-            "Built on Mon_Oct_12_20:09:46_PDT_2020\n",
-            "Cuda compilation tools, release 11.1, V11.1.105\n",
-            "Build cuda_11.1.TC455_06.29190527_0\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's remove the default PyTorch install:"
-      ],
-      "metadata": {
-        "id": "HanoUO62jtKx"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!pip uninstall -y torch"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "NIoTNykP9xI5",
-        "outputId": "5cc5a77d-9696-4cde-a7e5-3d835058afee"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Found existing installation: torch 1.10.0+cu111\n",
-            "Uninstalling torch-1.10.0+cu111:\n",
-            "  Successfully uninstalled torch-1.10.0+cu111\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "And install the relevant version. (this defaults to 10.2 Cuda which works on most colabs). "
-      ],
-      "metadata": {
-        "id": "n-DFUwBVkHaX"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "cuda_version = \"cu102\" # optionally - cu113 (for 11.3) is an option as well if you have 11.3 listed above in the nvcc output. "
-      ],
-      "metadata": {
-        "id": "BH5ffJBkkRR8"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!pip install torch -f https://download.pytorch.org/whl/{cuda_version}/torch_stable.html --upgrade"
-      ],
-      "metadata": {
-        "id": "Bi2oymijkav5"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Next we'll install functorch:"
-      ],
-      "metadata": {
-        "id": "s3rrVgGkmNpi"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "UtBgzUPDfIQg"
-      },
-      "outputs": [],
-      "source": [
-        "!pip install functorch"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Finally - restart colab and after that - just skip directly down to the '-- Tutorial Start --' section to get underway."
-      ],
-      "metadata": {
-        "id": "T8dhR1XEmcJ6"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "colab_restart() "
-      ],
-      "metadata": {
-        "id": "xo2UY9b8ma8t",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "a59e08c1-7206-4439-e08e-c4b8ff004f49"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "--> Restarting colab instance\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## -- Tutorial Start -- \n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "nj6_fW76wM0d"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "# Confirm we are ready to start.  \n",
-        "# If this errs, please make sure you have completed the install steps above first and then return here.\n",
-        "\n",
-        "import functorch    "
-      ],
-      "metadata": {
-        "id": "SvUfIxRyeAaL"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "# Jacobians, Hessians, hvp, vhp, and more: composing functorch transforms\n",
-        "\n",
-        "Computing quantities related to Jacobians or Hessians is useful in a number of non-traditional deep learning models. \n",
-        "\n",
-        "It is difficult (or annoying) to compute these quantities efficiently using a standard autodiff system like PyTorch Autograd; functorch provides ways of computing various higher-order autodiff quantities efficiently."
-      ],
-      "metadata": {
-        "id": "OeTtrGkGfsE9"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Computing the Jacobian"
-      ],
-      "metadata": {
-        "id": "viWZDMQtflUG"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "import torch\n",
-        "import torch.nn as nn\n",
-        "import torch.nn.functional as F\n",
-        "from functools import partial\n",
-        "_ = torch.manual_seed(0)\n"
-      ],
-      "metadata": {
-        "id": "w_IinyjzflUH"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let’s start with a function that we’d like to compute the jacobian of.  This is a simple linear function with non-linear activation.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "cibF_PEYflUH"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def predict(weight, bias, x):\n",
-        "    return F.linear(x, weight, bias).tanh()"
-      ],
-      "metadata": {
-        "id": "qhcD9hWYflUH"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's add some dummy data:   a weight, a bias, and a feature vector x.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "G8tqQrO_flUH"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "D = 16\n",
-        "weight = torch.randn(D, D)\n",
-        "bias = torch.randn(D)\n",
-        "x = torch.randn(D) # feature vector"
-      ],
-      "metadata": {
-        "id": "FZ4uJfZGflUH"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's think of `predict` as a function that maps the input `x` from $R^D -> R^D$.\n",
-        "PyTorch Autograd computes vector-Jacobian products. In order to compute the full\n",
-        "Jacobian of this $R^D -> R^D$ function, we would have to compute it row-by-row\n",
-        "by using a different unit vector each time."
-      ],
-      "metadata": {
-        "id": "uMAW-ArQflUH"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def compute_jac(xp):\n",
-        "    jacobian_rows = [torch.autograd.grad(predict(weight, bias, xp), xp, vec)[0]\n",
-        "                     for vec in unit_vectors]\n",
-        "    return torch.stack(jacobian_rows)"
-      ],
-      "metadata": {
-        "id": "z-BJPtbpflUI"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "xp = x.clone().requires_grad_()\n",
-        "unit_vectors = torch.eye(D)\n",
-        "\n",
-        "jacobian = compute_jac(xp)\n",
-        "\n",
-        "print(jacobian.shape)\n",
-        "print(jacobian[0])  # show first row"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "f1f1ec12-56ef-40f7-8c3c-cbad7bf86644",
-        "id": "zuWGSXspflUI"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "torch.Size([16, 16])\n",
-            "tensor([-0.5956, -0.6096, -0.1326, -0.2295,  0.4490,  0.3661, -0.1672, -1.1190,\n",
-            "         0.1705, -0.6683,  0.1851,  0.1630,  0.0634,  0.6547,  0.5908, -0.1308])\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Instead of computing the jacobian row-by-row, we can use vmap to get rid of the for-loop and vectorize the computation. \n",
-        "We can’t directly apply vmap to PyTorch Autograd; instead, functorch provides a vjp transform:\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "mxlEOUieflUI"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from functorch import vmap, vjp\n",
-        "\n",
-        "_, vjp_fn = vjp(partial(predict, weight, bias), x)\n",
-        "\n",
-        "ft_jacobian, = vmap(vjp_fn)(unit_vectors)\n",
-        "\n",
-        "# lets confirm both methods compute the same result\n",
-        "assert torch.allclose(ft_jacobian, jacobian)"
-      ],
-      "metadata": {
-        "id": "DeF6uy4WflUI"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "In future tutorial a composition of reverse-mode AD and vmap will give us per-sample-gradients. \n",
-        "In this tutorial, composing reverse-mode AD and vmap gives us Jacobian computation! \n",
-        "Various compositions of vmap and autodiff transforms can give us different interesting quantities.\n",
-        "\n",
-        "functorch provides **jacrev** as a convenience function that performs the vmap-vjp composition to compute jacobians. **jacrev** accepts an argnums argument that says which argument we would like to compute Jacobians with respect to.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "Hy4REmwDflUI"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from functorch import jacrev\n",
-        "\n",
-        "ft_jacobian = jacrev(predict, argnums=2)(weight, bias, x)\n",
-        "\n",
-        "# confirm \n",
-        "assert torch.allclose(ft_jacobian, jacobian)"
-      ],
-      "metadata": {
-        "id": "Rt7i6_YlflUI"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let’s compare the performance of the two ways to compute the jacobian. The functorch version is much faster (and becomes even faster the more outputs there are). \n",
-        "\n",
-        "In general, we expect that vectorization via vmap can help eliminate overhead and give better utilization of your hardware.\n",
-        "\n",
-        "Vmap does this magic by pushing the outer loop down into the functions primitive operations in order to obtain better performance.\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "JYe2H1UcflUJ"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's make a quick function to evaluate performance and deal with microseconds and milliseconds measurements:"
-      ],
-      "metadata": {
-        "id": "i_143LZwflUJ"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def get_perf(first, first_descriptor, second, second_descriptor):\n",
-        "  \"\"\"  takes torch.benchmark objects and compares delta of second vs first. \"\"\"\n",
-        "  faster = second.times[0]\n",
-        "  slower = first.times[0]\n",
-        "  gain = (slower-faster)/slower\n",
-        "  if gain < 0: gain *=-1 \n",
-        "  final_gain = gain*100\n",
-        "  print(f\" Performance delta: {final_gain:.4f} percent improvement with {second_descriptor} \")\n",
-        "  "
-      ],
-      "metadata": {
-        "id": "II7r6jBtflUJ"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "And then run the performance comparison:"
-      ],
-      "metadata": {
-        "id": "r4clPnPKflUJ"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from torch.utils.benchmark import Timer\n",
-        "\n",
-        "without_vmap = Timer(stmt=\"compute_jac(xp)\", globals=globals())\n",
-        "with_vmap = Timer(stmt=\"jacrev(predict, argnums=2)(weight, bias, x)\", globals=globals())\n",
-        "\n",
-        "no_vmap_timer = without_vmap.timeit(500)\n",
-        "with_vmap_timer = with_vmap.timeit(500)\n",
-        "\n",
-        "print(no_vmap_timer)\n",
-        "print(with_vmap_timer)\n"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "cbf77a19-aac9-428d-eba1-74d337c53e49",
-        "id": "ZPtoxF6eflUJ"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "<torch.utils.benchmark.utils.common.Measurement object at 0x7fa9a911b350>\n",
-            "compute_jac(xp)\n",
-            "  2.25 ms\n",
-            "  1 measurement, 500 runs , 1 thread\n",
-            "<torch.utils.benchmark.utils.common.Measurement object at 0x7fa9a6a99d50>\n",
-            "jacrev(predict, argnums=2)(weight, bias, x)\n",
-            "  884.34 us\n",
-            "  1 measurement, 500 runs , 1 thread\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Lets do a relative performance comparison of the above with our get_perf function:"
-      ],
-      "metadata": {
-        "id": "nGBBi4dZflUJ"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "get_perf(no_vmap_timer, \"without vmap\",  with_vmap_timer, \"vmap\");"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "85d0bc5f-34aa-4826-f953-6c637404490c",
-        "id": "zqV2RzEXflUJ"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            " Performance delta: 60.7170 percent improvement with vmap \n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Furthemore, it’s pretty easy to flip the problem around and say we want to compute Jacobians of the parameters to our model (weight, bias) instead of the input."
-      ],
-      "metadata": {
-        "id": "EQAB99EQflUJ"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "# note the change in input via argnums params of 0,1 to map to weight and bias\n",
-        "ft_jac_weight, ft_jac_bias = jacrev(predict, argnums=(0, 1))(weight, bias, x)"
-      ],
-      "metadata": {
-        "id": "8UZpC8DnflUK"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## reverse-mode Jacobian (jacrev) vs forward-mode Jacobian (jacfwd)\n"
-      ],
-      "metadata": {
-        "id": "F3USYENIflUK"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "We offer two APIs to compute jacobians: **jacrev** and **jacfwd**: \n",
-        "- jacrev uses reverse-mode AD. As you saw above it is a composition of our vjp and vmap transforms. \n",
-        "- jacfwd uses forward-mode AD. It is implemented as a composition of our jvp and vmap transforms. \n",
-        "\n",
-        "jacfwd and jacrev can be substituted for each other but they have different performance characteristics.\n",
-        "\n",
-        "As a general rule of thumb, if you’re computing the jacobian of an $𝑅^N \\to R^M$ function, and there are many more outputs than inputs (i.e. $M > N$) then jacfwd is preferred, otherwise use jacrev. There are exceptions to this rule, but a non-rigorous argument for this follows:\n",
-        "\n",
-        "In reverse-mode AD, we are computing the jacobian row-by-row, while in forward-mode AD (which computes Jacobian-vector products), we are computing it column-by-column. The Jacobian matrix has M rows and N columns, so if it is taller or wider one way we may prefer the method that deals with fewer rows or columns.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "V7B3vE8dflUK"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from functorch import jacrev, jacfwd"
-      ],
-      "metadata": {
-        "id": "k7Tok7m3flUK"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "First, let's benchmark with more inputs than outputs:\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "YrV-gZAaflUL"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "Din = 32\n",
-        "Dout = 2048\n",
-        "weight = torch.randn(Dout, Din)\n",
-        "\n",
-        "bias = torch.randn(Dout)\n",
-        "x = torch.randn(Din)\n",
-        "\n",
-        "# remember the general rule about taller vs wider...here we have a taller matrix:\n",
-        "print(weight.shape)\n",
-        "\n",
-        "using_fwd = Timer(stmt=\"jacfwd(predict, argnums=2)(weight, bias, x)\", globals=globals())\n",
-        "using_bwd = Timer(stmt=\"jacrev(predict, argnums=2)(weight, bias, x)\", globals=globals())\n",
-        "\n",
-        "jacfwd_timing = using_fwd.timeit(500)\n",
-        "jacrev_timing = using_bwd.timeit(500)\n",
-        "\n",
-        "print(f'jacfwd time: {jacfwd_timing}')\n",
-        "print(f'jacrev time: {jacrev_timing}')\n"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "dd882726-9723-47c0-a72f-3c7835a85aa1",
-        "id": "m5j-4hSxflUL"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "torch.Size([2048, 32])\n",
-            "jacfwd time: <torch.utils.benchmark.utils.common.Measurement object at 0x7fa9a5d792d0>\n",
-            "jacfwd(predict, argnums=2)(weight, bias, x)\n",
-            "  1.32 ms\n",
-            "  1 measurement, 500 runs , 1 thread\n",
-            "jacrev time: <torch.utils.benchmark.utils.common.Measurement object at 0x7fa9a4dee450>\n",
-            "jacrev(predict, argnums=2)(weight, bias, x)\n",
-            "  12.46 ms\n",
-            "  1 measurement, 500 runs , 1 thread\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "and then do a relative benchmark:"
-      ],
-      "metadata": {
-        "id": "k_Sg-4tVflUL"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "get_perf(jacfwd_timing, \"jacfwd\", jacrev_timing, \"jacrev\", );"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "3a6586a1-269d-46d8-d119-e24f6d46277f",
-        "id": "_4T96zGjflUL"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            " Performance delta: 842.8274 percent improvement with jacrev \n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "and now the reverse - more outputs (M) than inputs (N):"
-      ],
-      "metadata": {
-        "id": "RCDPot1yflUL"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "Din = 2048\n",
-        "Dout = 32\n",
-        "weight = torch.randn(Dout, Din)\n",
-        "bias = torch.randn(Dout)\n",
-        "x = torch.randn(Din)\n",
-        "\n",
-        "using_fwd = Timer(stmt=\"jacfwd(predict, argnums=2)(weight, bias, x)\", globals=globals())\n",
-        "using_bwd = Timer(stmt=\"jacrev(predict, argnums=2)(weight, bias, x)\", globals=globals())\n",
-        "\n",
-        "jacfwd_timing = using_fwd.timeit(500)\n",
-        "jacrev_timing = using_bwd.timeit(500)\n",
-        "\n",
-        "print(f'jacfwd time: {jacfwd_timing}')\n",
-        "print(f'jacrev time: {jacrev_timing}')"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "913e9ccd-3d4f-472a-a749-19cee36d0a16",
-        "id": "_DRFqzqZflUM"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "jacfwd time: <torch.utils.benchmark.utils.common.Measurement object at 0x7fa9a5d64790>\n",
-            "jacfwd(predict, argnums=2)(weight, bias, x)\n",
-            "  7.99 ms\n",
-            "  1 measurement, 500 runs , 1 thread\n",
-            "jacrev time: <torch.utils.benchmark.utils.common.Measurement object at 0x7fa9a5d67b50>\n",
-            "jacrev(predict, argnums=2)(weight, bias, x)\n",
-            "  1.09 ms\n",
-            "  1 measurement, 500 runs , 1 thread\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "and a relative perf comparison:"
-      ],
-      "metadata": {
-        "id": "5SRbMCNsflUM"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "get_perf(jacrev_timing, \"jacrev\", jacfwd_timing, \"jacfwd\")"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "c282ce25-4f6e-44cd-aed7-60f6f5010e5b",
-        "id": "uF_9GaoiflUM"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            " Performance delta: 635.2095 percent improvement with jacfwd \n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Hessian computation with functorch.hessian\n"
-      ],
-      "metadata": {
-        "id": "J29FQaBQflUM"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "We offer a convenience API to compute hessians: `functorch.hessian`. \n",
-        "Hessians are the jacobian of the jacobian (or the partial derivative of the partial derivative, aka second order).\n",
-        "\n",
-        "This suggests that one can just compose functorch’s jacobian transforms to compute the Hessian. \n",
-        "Indeed, under the hood, `hessian(f)` is simply `jacfwd(jacrev(f))`.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "My4DPH97flUM"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Note: to boost performance: depending on your model, you may also want to use `jacfwd(jacfwd(f))` or `jacrev(jacrev(f))` instead to compute hessians leveraging the rule of thumb above regarding wider vs taller matrices.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "FJt038l5flUM"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from functorch import hessian\n",
-        "\n",
-        "# lets reduce the size in order not to blow out colab. Hessians require significant memory:\n",
-        "Din = 512\n",
-        "Dout = 32\n",
-        "weight = torch.randn(Dout, Din)\n",
-        "bias = torch.randn(Dout)\n",
-        "x = torch.randn(Din)\n",
-        "\n",
-        "hess_api = hessian(predict, argnums=2)(weight, bias, x)\n",
-        "hess_fwdfwd = jacfwd(jacfwd(predict, argnums=2), argnums=2)(weight, bias, x)\n",
-        "#hess_revrev = jacrev(jacrev(predict, argnums=2), argnums=2)(weight, bias, x)\n"
-      ],
-      "metadata": {
-        "id": "jEqr2ywZflUM"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's verify we have the same result regardless of using hessian api or using jacfwd(jacfwd())"
-      ],
-      "metadata": {
-        "id": "n9BHcICQflUN"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "torch.allclose(hess_api, hess_fwdfwd)"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "e457e3bc-f085-4f90-966d-f98893b98ea8",
-        "id": "eHiWRkjJflUN"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "execute_result",
-          "data": {
-            "text/plain": [
-              "True"
-            ]
-          },
-          "metadata": {},
-          "execution_count": 18
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Batch Jacobian and Batch Hessian\n"
-      ],
-      "metadata": {
-        "id": "Gjt1RO8HflUN"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "In the above examples we’ve been operating with a single feature vector. In some cases you might want to take the Jacobian of a batch of outputs with respect to a batch of inputs. That is, given a batch of inputs of shape `(B, N)` and a function that goes from $R^N \\to R^M$, we would like a Jacobian of shape `(B, M, N)`. \n",
-        "\n",
-        "The easiest way to do this is to use vmap:"
-      ],
-      "metadata": {
-        "id": "RjIzdoQNflUN"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "batch_size = 64\n",
-        "Din = 31\n",
-        "Dout = 33\n",
-        "\n",
-        "weight = torch.randn(Dout, Din)\n",
-        "print(f\"weight shape = {weight.shape}\")\n",
-        "\n",
-        "bias = torch.randn(Dout)\n",
-        "\n",
-        "x = torch.randn(batch_size, Din)"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "561eb618-e00f-40d5-bd99-fa51ab82051f",
-        "id": "B1eoEO4UflUN"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "weight shape = torch.Size([33, 31])\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "compute_batch_jacobian = vmap(jacrev(predict, argnums=2), in_dims=(None, None, 0))\n",
-        "batch_jacobian0 = compute_batch_jacobian(weight, bias, x)"
-      ],
-      "metadata": {
-        "id": "nZ_V02NhflUN"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "If you have a function that goes from (B, N) -> (B, M) instead and are certain that each input produces an independent output, then it’s also sometimes possible to do this without using vmap by summing the outputs and then computing the Jacobian of that function:\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "_OLDiY3MflUN"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def predict_with_output_summed(weight, bias, x):\n",
-        "    return predict(weight, bias, x).sum(0)\n",
-        "\n",
-        "batch_jacobian1 = jacrev(predict_with_output_summed, argnums=2)(weight, bias, x).movedim(1, 0)\n",
-        "assert torch.allclose(batch_jacobian0, batch_jacobian1)"
-      ],
-      "metadata": {
-        "id": "_QH4hD8PflUO"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "If you instead have a function that goes from $𝑅^𝑁 \\to 𝑅^𝑀$ but inputs that are batched, you compose vmap with jacrev to compute batched jacobians:\n",
-        "\n",
-        "Finally, batch hessians can be computed similarly. It’s easiest to think about them by using vmap to batch over hessian computation, but in some cases the sum trick also works.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "eUjw65cCflUO"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "compute_batch_hessian = vmap(hessian(predict, argnums=2), in_dims=(None, None, 0))\n",
-        "\n",
-        "batch_hess = compute_batch_hessian(weight, bias, x)\n",
-        "batch_hess.shape"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "f3135cfa-e9e5-4f18-8cb7-0655e8a37cb5",
-        "id": "3vAyQjMsflUO"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "execute_result",
-          "data": {
-            "text/plain": [
-              "torch.Size([64, 33, 31, 31])"
-            ]
-          },
-          "metadata": {},
-          "execution_count": 22
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Computing Hessian-vector products\n",
-        "\n",
-        "The naive way to compute a Hessian-vector product (hvp) is to materialize the full Hessian and perform a dot-product with a vector. We can do better: it turns out we don't need to materialize the full Hessian to do this. We'll go through two (of many) different strategies to compute Hessian-vector products:\n",
-        "- composing reverse-mode AD with reverse-mode AD\n",
-        "- composing reverse-mode AD with forward-mode AD\n",
-        "\n",
-        "Composing reverse-mode AD with forward-mode AD (as opposed to reverse-mode with reverse-mode) is generally the more memory efficient way to compute a hvp because forward-mode AD doesn't need to construct an Autograd graph and save intermediates for backward:"
-      ],
-      "metadata": {
-        "id": "Wa8E48sQgpkb"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from functorch import jvp, grad, vjp\n",
-        "\n",
-        "def hvp(f, primals, tangents):\n",
-        "  return jvp(grad(f), primals, tangents)[1]"
-      ],
-      "metadata": {
-        "id": "trw6WbAth6BM"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Here's some sample usage."
-      ],
-      "metadata": {
-        "id": "DQMpRo6nitfr"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def f(x):\n",
-        "  return x.sin().sum()\n",
-        "\n",
-        "x = torch.randn(2048)\n",
-        "tangent = torch.randn(2048)\n",
-        "\n",
-        "result = hvp(f, (x,), (tangent,))"
-      ],
-      "metadata": {
-        "id": "sPwg8SOdiVAK"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "If PyTorch forward-AD does not have coverage for your operations, then we can instead compose reverse-mode AD with reverse-mode AD:"
-      ],
-      "metadata": {
-        "id": "zGvUIcB0j1Ez"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def hvp_revrev(f, primals, tangents):\n",
-        "  _, vjp_fn = vjp(grad(f), *primals)\n",
-        "  return vjp_fn(*tangents)"
-      ],
-      "metadata": {
-        "id": "mdDFZdlekAOK"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "result_hvp_revrev = hvp_revrev(f, (x,), (tangent,))\n",
-        "assert torch.allclose(result, result_hvp_revrev[0])"
-      ],
-      "metadata": {
-        "id": "_CuCk9X0lW7C"
-      },
-      "execution_count": null,
-      "outputs": []
-    }
-  ]
-}
\ No newline at end of file
diff --git a/functorch/notebooks/colab/per_sample_grads_colab.ipynb b/functorch/notebooks/colab/per_sample_grads_colab.ipynb
deleted file mode 100644
index 8c65eb00be29..000000000000
--- a/functorch/notebooks/colab/per_sample_grads_colab.ipynb
+++ /dev/null
@@ -1,795 +0,0 @@
-{
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "name": "per_sample_grads_colab.ipynb",
-      "provenance": [],
-      "collapsed_sections": [],
-      "machine_shape": "hm",
-      "toc_visible": true
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    },
-    "accelerator": "GPU"
-  },
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "source": [
-        "### Welcome to the functorch tutorial on Per-Sample-Gradients, in colab."
-      ],
-      "metadata": {
-        "id": "W6b4RUiYnhSt"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Configuring your colab to run functorch \n"
-      ],
-      "metadata": {
-        "id": "0I5Mm2q2f5aw"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "**Getting setup** - running functorch currently requires Pytorch Nightly.  \n",
-        "Thus we'll go through a pytorch nightly install and build functorch. \n",
-        "\n",
-        "After that and a restart, you'll be ready to run the tutorial here on colab."
-      ],
-      "metadata": {
-        "id": "jnHxd2KFgPJg"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's setup a restart function:"
-      ],
-      "metadata": {
-        "id": "PvwZSOklhpB2"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def colab_restart():\n",
-        "  print(\"--> Restarting colab instance\") \n",
-        "  get_ipython().kernel.do_shutdown(True)"
-      ],
-      "metadata": {
-        "id": "MklsA-KRhZKC"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Next, let's confirm that we have a gpu.  \n",
-        "(If not, select Runtime -> Change Runtime type above,\n",
-        " and select GPU under Hardward Accelerator )"
-      ],
-      "metadata": {
-        "id": "Njk9qPgTiiGS"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!nvcc --version"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "HxidO4dpiPGi",
-        "outputId": "d6d31c17-02cf-427b-cae8-6994c57c2320"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "nvcc: NVIDIA (R) Cuda compiler driver\n",
-            "Copyright (c) 2005-2020 NVIDIA Corporation\n",
-            "Built on Mon_Oct_12_20:09:46_PDT_2020\n",
-            "Cuda compilation tools, release 11.1, V11.1.105\n",
-            "Build cuda_11.1.TC455_06.29190527_0\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's remove the default PyTorch install:"
-      ],
-      "metadata": {
-        "id": "HanoUO62jtKx"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!pip uninstall -y torch"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "NIoTNykP9xI5",
-        "outputId": "5cc5a77d-9696-4cde-a7e5-3d835058afee"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Found existing installation: torch 1.10.0+cu111\n",
-            "Uninstalling torch-1.10.0+cu111:\n",
-            "  Successfully uninstalled torch-1.10.0+cu111\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "And install the relevant version. (this defaults to 10.2 Cuda which works on most colabs). "
-      ],
-      "metadata": {
-        "id": "n-DFUwBVkHaX"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "cuda_version = \"cu102\" # optionally - cu113 (for 11.3) is an option as well if you have 11.3 listed above in the nvcc output. "
-      ],
-      "metadata": {
-        "id": "BH5ffJBkkRR8"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!pip install torch -f https://download.pytorch.org/whl/{cuda_version}/torch_stable.html --upgrade"
-      ],
-      "metadata": {
-        "id": "Bi2oymijkav5"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Next we'll install functorch:"
-      ],
-      "metadata": {
-        "id": "s3rrVgGkmNpi"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "UtBgzUPDfIQg"
-      },
-      "outputs": [],
-      "source": [
-        "!pip install functorch"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Finally - restart colab and after that - just skip directly down to the '-- Tutorial Start --' section to get underway."
-      ],
-      "metadata": {
-        "id": "T8dhR1XEmcJ6"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "colab_restart() "
-      ],
-      "metadata": {
-        "id": "xo2UY9b8ma8t",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "a59e08c1-7206-4439-e08e-c4b8ff004f49"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "--> Restarting colab instance\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## -- Tutorial Start -- \n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "nj6_fW76wM0d"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "# Confirm we are ready to start.  \n",
-        "# If this errs, please make sure you have completed the 'configuring your colab' steps above first and then return here.\n",
-        "\n",
-        "import functorch    "
-      ],
-      "metadata": {
-        "id": "SvUfIxRyeAaL"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "# Per-sample-gradients\n",
-        "Per-sample-gradient computation is computing the gradient for each and every sample in a batch of data. \n",
-        "It is a useful quantity for differential privacy, meta-learning, and optimization research.\n",
-        "\n",
-        "Let's walk through an example of per-sample-gradients in action below with a simple CNN model. \n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "nLdOLDH6m9oy"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "import torch\n",
-        "import torch.nn as nn\n",
-        "import torch.nn.functional as F\n",
-        "from functools import partial\n",
-        "\n",
-        "torch.manual_seed(0);"
-      ],
-      "metadata": {
-        "id": "Gb-yt4VKUUuc"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "# Here's a simple CNN and loss function:\n",
-        "\n",
-        "class SimpleCNN(nn.Module):\n",
-        "    def __init__(self):\n",
-        "        super(SimpleCNN, self).__init__()\n",
-        "        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n",
-        "        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n",
-        "        self.fc1 = nn.Linear(9216, 128)\n",
-        "        self.fc2 = nn.Linear(128, 10)\n",
-        "\n",
-        "    def forward(self, x):\n",
-        "        x = self.conv1(x)\n",
-        "        x = F.relu(x)\n",
-        "        x = self.conv2(x)\n",
-        "        x = F.relu(x)\n",
-        "        x = F.max_pool2d(x, 2)\n",
-        "        x = torch.flatten(x, 1)\n",
-        "        x = self.fc1(x)\n",
-        "        x = F.relu(x)\n",
-        "        x = self.fc2(x)\n",
-        "        output = F.log_softmax(x, dim=1)\n",
-        "        output = x\n",
-        "        return output\n",
-        "\n",
-        "def loss_fn(predictions, targets):\n",
-        "    return F.nll_loss(predictions, targets)"
-      ],
-      "metadata": {
-        "id": "tf-HKHjUUbyY"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let’s generate a batch of dummy data and pretend that we’re working with an MNIST dataset.  \n",
-        "\n",
-        "The dummy images are 28 by 28 and we use a minibatch of size 64.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "VEDPe-EoU5Fa"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "device = 'cuda'\n",
-        "\n",
-        "num_models = 10\n",
-        "batch_size = 64\n",
-        "data = torch.randn(batch_size, 1, 28, 28, device=device)\n",
-        "\n",
-        "targets = torch.randint(10, (64,), device=device)"
-      ],
-      "metadata": {
-        "id": "WB2Qe3AHUvPN"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "In regular model training, one would forward the minibatch through the model, and then call .backward() to compute gradients.  This would generate an 'average' gradient of the entire mini-batch:\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "GOGJ-OUxVcT5"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "model = SimpleCNN().to(device=device)\n",
-        "predictions = model(data) # move the entire mini-batch through the model\n",
-        "\n",
-        "loss = loss_fn(predictions, targets)\n",
-        "loss.backward() # back propogate the 'average' gradient of this mini-batch"
-      ],
-      "metadata": {
-        "id": "WYjMx8QTUvRu"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "In contrast to the above approach, per-sample-gradient computation is equivalent to: \n",
-        "- for each individual sample of the data, perform a forward and a backward pass to get an individual (per-sample) gradient.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "HNw4_IVzU5Pz"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def compute_grad(sample, target):\n",
-        "    \n",
-        "    sample = sample.unsqueeze(0)  # prepend batch dimension for processing\n",
-        "    target = target.unsqueeze(0)\n",
-        "\n",
-        "    prediction = model(sample)\n",
-        "    loss = loss_fn(prediction, target)\n",
-        "\n",
-        "    return torch.autograd.grad(loss, list(model.parameters()))\n",
-        "\n",
-        "\n",
-        "def compute_sample_grads(data, targets):\n",
-        "    \"\"\" manually process each sample with per sample gradient \"\"\"\n",
-        "    sample_grads = [compute_grad(data[i], targets[i]) for i in range(batch_size)]\n",
-        "    sample_grads = zip(*sample_grads)\n",
-        "    sample_grads = [torch.stack(shards) for shards in sample_grads]\n",
-        "    return sample_grads\n",
-        "\n",
-        "per_sample_grads = compute_sample_grads(data, targets)"
-      ],
-      "metadata": {
-        "id": "vUsb3VfexJrY"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "`sample_grads[0]` is the per-sample-grad for model.conv1.weight. `model.conv1.weight.shape` is `[32, 1, 3, 3]`; notice how there is one gradient, per sample, in the batch for a total of 64.\n",
-        "\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "aNkX6lFIxzcm"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "print(per_sample_grads[0].shape)"
-      ],
-      "metadata": {
-        "id": "C3a9_clvyPho",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "407abc1a-846f-4e50-83bc-c90719a26073"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "torch.Size([64, 32, 1, 3, 3])\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Per-sample-grads, *the efficient way*, using functorch\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "mFJDWMM9yaYZ"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "We can compute per-sample-gradients efficiently by using function transforms. \n",
-        "\n",
-        "First, let’s create a stateless functional version of `model` by using `functorch.make_functional_with_buffers`.  \n",
-        "\n",
-        "This will separate state (the parameters) from the model and turn the model into a pure function:\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "tlkmyQyfY6XU"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from functorch import make_functional_with_buffers, vmap, grad\n",
-        "\n",
-        "fmodel, params, buffers = make_functional_with_buffers(model)"
-      ],
-      "metadata": {
-        "id": "WiSMupvCyecd"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Let's review the changes - first, the model has become the stateless FunctionalModuleWithBuffers:"
-      ],
-      "metadata": {
-        "id": "wMsbppPNZklo"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "fmodel"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "Xj0cZOJMZbbB",
-        "outputId": "2e87dfde-3af2-4e1f-cd91-5c232446fb53"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "execute_result",
-          "data": {
-            "text/plain": [
-              "FunctionalModuleWithBuffers(\n",
-              "  (stateless_model): SimpleCNN(\n",
-              "    (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))\n",
-              "    (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))\n",
-              "    (fc1): Linear(in_features=9216, out_features=128, bias=True)\n",
-              "    (fc2): Linear(in_features=128, out_features=10, bias=True)\n",
-              "  )\n",
-              ")"
-            ]
-          },
-          "metadata": {},
-          "execution_count": 15
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "And the model parameters now exist independently of the model, stored as a tuple:"
-      ],
-      "metadata": {
-        "id": "zv4_YYPxZvvg"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "for x in params:\n",
-        "  print(f\"{x.shape}\")\n",
-        "\n",
-        "print(f\"\\n{type(params)}\")"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "tH0TAZhBZ3bS",
-        "outputId": "97c4401f-cccb-43f6-b071-c85a18fc439b"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "torch.Size([32, 1, 3, 3])\n",
-            "torch.Size([32])\n",
-            "torch.Size([64, 32, 3, 3])\n",
-            "torch.Size([64])\n",
-            "torch.Size([128, 9216])\n",
-            "torch.Size([128])\n",
-            "torch.Size([10, 128])\n",
-            "torch.Size([10])\n",
-            "\n",
-            "<class 'tuple'>\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Next, let’s define a function to compute the loss of the model given a single input rather than a batch of inputs. It is important that this function accepts the parameters, the input, and the target, because we will be transforming over them. \n",
-        "\n",
-        "Note - because the model was originally written to handle batches, we’ll use `torch.unsqueeze` to add a batch dimension.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "cTgIIZ9Wyih8"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def compute_loss_stateless_model (params, buffers, sample, target):\n",
-        "    batch = sample.unsqueeze(0)\n",
-        "    targets = target.unsqueeze(0)\n",
-        "\n",
-        "    predictions = fmodel(params, buffers, batch) \n",
-        "    loss = loss_fn(predictions, targets)\n",
-        "    return loss"
-      ],
-      "metadata": {
-        "id": "ItURFU3M-p98"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Now, let’s use functorch's `grad` to create a new function that computes the gradient with respect to the first argument of `compute_loss` (i.e. the params)."
-      ],
-      "metadata": {
-        "id": "Qo3sbDK2i_bH"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "ft_compute_grad = grad(compute_loss_stateless_model)"
-      ],
-      "metadata": {
-        "id": "sqRp_Sxni-Xm"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "The `ft_compute_grad` function computes the gradient for a single (sample, target) pair. We can use vmap to get it to compute the gradient over an entire batch of samples and targets. Note that `in_dims=(None, None, 0, 0)` because we wish to map `ft_compute_grad` over the 0th dimension of the data and targets,  and use the same params and buffers for each.\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "2pG3Ofqjjc8O"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "ft_compute_sample_grad = vmap(ft_compute_grad, in_dims=(None, None, 0, 0))"
-      ],
-      "metadata": {
-        "id": "62ecNMO6inqX"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Finally, let’s used our transformed function to compute per-sample-gradients:\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "_alXdQ3QkETu"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "ft_per_sample_grads = ft_compute_sample_grad(params, buffers, data, targets)\n",
-        "\n",
-        "# we can double check that the results using functorch grad and vmap match the results of hand processing each one individually:\n",
-        "for per_sample_grad, ft_per_sample_grad in zip(per_sample_grads, ft_per_sample_grads):\n",
-        "    assert torch.allclose(per_sample_grad, ft_per_sample_grad, atol=3e-3, rtol=1e-5)"
-      ],
-      "metadata": {
-        "id": "1gehVA1c-BHd"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "A quick note: there are limitations around what types of functions can be transformed by vmap. The best functions to transform are ones that are pure functions: a function where the outputs are only determined by the inputs, and that have no side effects (e.g. mutation). vmap is unable to handle mutation of arbitrary Python data structures, but it is able to handle many in-place PyTorch operations.\n",
-        "\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "BEZaNt1d_bc1"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Performance comparison"
-      ],
-      "metadata": {
-        "id": "BASP151Iml7B"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Curious about how the performance of vmap compares?\n",
-        "\n",
-        "Currently the best results are obtained on newer GPU's such as the A100 (Ampere) where we've seen up to 25x speedups on this example, but here are some results done in Colab:"
-      ],
-      "metadata": {
-        "id": "jr1xNpV4nJ7u"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "def get_perf(first, first_descriptor, second, second_descriptor):\n",
-        "  \"\"\"  takes torch.benchmark objects and compares delta of second vs first. \"\"\"\n",
-        "  second_res = second.times[0]\n",
-        "  first_res = first.times[0]\n",
-        "\n",
-        "  gain = (first_res-second_res)/first_res\n",
-        "  if gain < 0: gain *=-1 \n",
-        "  final_gain = gain*100\n",
-        "\n",
-        "  print(f\" Performance delta: {final_gain:.4f} percent improvement with {first_descriptor} \")"
-      ],
-      "metadata": {
-        "id": "GnAnMkYmoc-j"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "from torch.utils.benchmark import Timer\n",
-        "\n",
-        "without_vmap = Timer( stmt=\"compute_sample_grads(data, targets)\", globals=globals())\n",
-        "with_vmap = Timer(stmt=\"ft_compute_sample_grad(params, buffers, data, targets)\",globals=globals())\n",
-        "no_vmap_timing = without_vmap.timeit(100)\n",
-        "with_vmap_timing = with_vmap.timeit(100)\n",
-        "\n",
-        "print(f'Per-sample-grads without vmap {no_vmap_timing}')\n",
-        "print(f'Per-sample-grads with vmap {with_vmap_timing}')"
-      ],
-      "metadata": {
-        "id": "Zfnn2C2g-6Fb",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "922f3901-773f-446b-b562-88e78f49036c"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Per-sample-grads without vmap <torch.utils.benchmark.utils.common.Measurement object at 0x7f71ac3f1850>\n",
-            "compute_sample_grads(data, targets)\n",
-            "  79.86 ms\n",
-            "  1 measurement, 100 runs , 1 thread\n",
-            "Per-sample-grads with vmap <torch.utils.benchmark.utils.common.Measurement object at 0x7f7143e26f10>\n",
-            "ft_compute_sample_grad(params, buffers, data, targets)\n",
-            "  12.93 ms\n",
-            "  1 measurement, 100 runs , 1 thread\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "get_perf(with_vmap_timing, \"vmap\", no_vmap_timing,\"no vmap\" )"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "NV9R3LZQoavl",
-        "outputId": "e11e8be9-287d-4e60-e517-e08f8d6909bd"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            " Performance delta: 517.5791 percent improvement with vmap \n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "There are other optimized solutions (like in https://github.com/pytorch/opacus) to computing per-sample-gradients in PyTorch that also perform better than the naive method. But it’s cool that composing `vmap` and `grad` give us a nice speedup.\n",
-        "\n",
-        "\n",
-        "In general, vectorization with vmap should be faster than running a function in a for-loop and competitive with manual batching. There are some exceptions though, like if we haven’t implemented the vmap rule for a particular operation or if the underlying kernels weren’t optimized for older hardware (GPUs). If you see any of these cases, please let us know by opening an issue at our [GitHub](https://github.com/pytorch/functorch)!\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "UI74G9JarQU8"
-      }
-    }
-  ]
-}
diff --git a/functorch/notebooks/colab/readme.md b/functorch/notebooks/colab/readme.md
deleted file mode 100644
index fbdf129da00f..000000000000
--- a/functorch/notebooks/colab/readme.md
+++ /dev/null
@@ -1,5 +0,0 @@
-### Holds the colab ready versions of the notebook tutorials.
-
-These are similar to the jupyter notebooks, but have additional colab specific changes including the building of functorch in colab to prep for running.
-
-The colabs and notebooks are not auto-synced atm, thus currently updates to one need to be synched to the other.
diff --git a/functorch/notebooks/ensembling.ipynb b/functorch/notebooks/ensembling.ipynb
index 72554b9f9e22..41565aa07b62 100644
--- a/functorch/notebooks/ensembling.ipynb
+++ b/functorch/notebooks/ensembling.ipynb
@@ -11,7 +11,7 @@
         "\n",
         "This example illustrates how to vectorize model ensembling using vmap.\n",
         "\n",
-        "<a href=\"https://colab.research.google.com/github/pytorch/functorch/blob/main/notebooks/colab/ensembling_colab.ipynb\">\n",
+        "<a href=\"https://colab.research.google.com/github/pytorch/pytorch/blob/master/functorch/notebooks/ensembling.ipynb\">\n",
         "  <img style=\"width: auto\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
         "</a>\n",
         "\n",
@@ -24,7 +24,7 @@
         "dimensions of the input tensors. One of its use cases is eliminating\n",
         "for-loops and speeding them up through vectorization.\n",
         "\n",
-        "Let's demonstrate how to do this using an ensemble of simple CNNs."
+        "Let's demonstrate how to do this using an ensemble of simple MLPs."
       ]
     },
     {
diff --git a/functorch/notebooks/jacobians_hessians.ipynb b/functorch/notebooks/jacobians_hessians.ipynb
index 172479ae7626..5b986a592b72 100644
--- a/functorch/notebooks/jacobians_hessians.ipynb
+++ b/functorch/notebooks/jacobians_hessians.ipynb
@@ -5,7 +5,7 @@
       "source": [
         "# Jacobians, Hessians, hvp, vhp, and more: composing functorch transforms\n",
         "\n",
-        "<a href=\"https://colab.research.google.com/github/pytorch/functorch/blob/main/notebooks/colab/jacobians_hessians_colab.ipynb\">\n",
+        "<a href=\"https://colab.research.google.com/github/pytorch/pytorch/blob/master/functorch/notebooks/jacobians_hessians.ipynb\">\n",
         "  <img style=\"width: auto\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
         "</a>\n",
         "\n",
diff --git a/functorch/notebooks/neural_tangent_kernels.ipynb b/functorch/notebooks/neural_tangent_kernels.ipynb
index b56a5d6e1b18..11bd8413380a 100644
--- a/functorch/notebooks/neural_tangent_kernels.ipynb
+++ b/functorch/notebooks/neural_tangent_kernels.ipynb
@@ -7,6 +7,10 @@
    "source": [
     "# Neural Tangent Kernels\n",
     "\n",
+    "<a href=\"https://colab.research.google.com/github/pytorch/pytorch/blob/master/functorch/notebooks/neural_tangent_kernels.ipynb\">\n",
+    "  <img style=\"width: auto\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>\n",
+    "\n",
     "The neural tangent kernel (NTK) is a kernel that describes [how a neural network evolves during training](https://en.wikipedia.org/wiki/Neural_tangent_kernel). There has been a lot of research around it [in recent years](https://arxiv.org/abs/1806.07572). This tutorial, inspired by the implementation of [NTKs in JAX](https://github.com/google/neural-tangents) (see [Fast Finite Width Neural Tangent Kernel](https://arxiv.org/abs/2206.08720) for details), demonstrates how to easily compute this quantity using functorch."
    ]
   },
diff --git a/functorch/notebooks/per_sample_grads.ipynb b/functorch/notebooks/per_sample_grads.ipynb
index 8719f934f704..b0bcf2670c04 100644
--- a/functorch/notebooks/per_sample_grads.ipynb
+++ b/functorch/notebooks/per_sample_grads.ipynb
@@ -9,7 +9,7 @@
       "source": [
         "# Per-sample-gradients\n",
         "\n",
-        "<a href=\"https://colab.research.google.com/github/pytorch/functorch/blob/main/notebooks/colab/per_sample_grads_colab.ipynb\">\n",
+        "<a href=\"https://colab.research.google.com/github/pytorch/pytorch/blob/master/functorch/notebooks/per_sample_grads.ipynb\">\n",
         "  <img style=\"width: auto\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
         "</a>\n",
         "\n",
diff --git a/functorch/notebooks/whirlwind_tour.ipynb b/functorch/notebooks/whirlwind_tour.ipynb
index 3c49985d79f4..deae3418966b 100644
--- a/functorch/notebooks/whirlwind_tour.ipynb
+++ b/functorch/notebooks/whirlwind_tour.ipynb
@@ -7,6 +7,9 @@
    "source": [
     "# Whirlwind Tour\n",
     "\n",
+    "<a href=\"https://colab.research.google.com/github/pytorch/pytorch/blob/master/functorch/notebooks/whirlwind_tour.ipynb\">\n",
+    "  <img style=\"width: auto\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>\n",
     "\n",
     "## What is functorch?\n",
     "\n",
@@ -42,6 +45,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import torch\n",
     "from functorch import grad\n",
     "x = torch.randn([])\n",
     "cos_x = grad(lambda x: torch.sin(x))(x)\n",
diff --git a/functorch/op_analysis/public_api b/functorch/op_analysis/public_api
index c3886b6c616f..16f1f8c5ce70 100644
--- a/functorch/op_analysis/public_api
+++ b/functorch/op_analysis/public_api
@@ -602,18 +602,18 @@ is_complex
 is_floating_point
 is_signed
 sum_to_size
-_masked.amax
-_masked.amin
-_masked.log_softmax
-_masked.mean
-_masked.norm
-_masked.normalize
-_masked.prod
-_masked.softmax
-_masked.softmin
-_masked.std
-_masked.sum
-_masked.var
+masked.amax
+masked.amin
+masked.log_softmax
+masked.mean
+masked.norm
+masked.normalize
+masked.prod
+masked.softmax
+masked.softmin
+masked.std
+masked.sum
+masked.var
 masked_fill
 masked_scatter
 masked_select
diff --git a/functorch/packaging/build_wheel.sh b/functorch/packaging/build_wheel.sh
deleted file mode 100644
index 074e7dde7714..000000000000
--- a/functorch/packaging/build_wheel.sh
+++ /dev/null
@@ -1,19 +0,0 @@
-#!/bin/bash
-set -ex
-
-script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
-. "$script_dir/pkg_helpers.bash"
-
-export BUILD_TYPE=wheel
-setup_env 0.2.0
-setup_wheel_python
-pip_install numpy pyyaml future ninja
-pip_install --upgrade setuptools
-setup_pip_pytorch_version
-python setup.py clean
-
-if [[ "$OSTYPE" == "msys" ]]; then
-    "$script_dir/windows/internal/vc_env_helper.bat" python setup.py bdist_wheel
-else
-    python setup.py bdist_wheel
-fi
diff --git a/functorch/packaging/pkg_helpers.bash b/functorch/packaging/pkg_helpers.bash
deleted file mode 100644
index 329891a07216..000000000000
--- a/functorch/packaging/pkg_helpers.bash
+++ /dev/null
@@ -1,414 +0,0 @@
-# A set of useful bash functions for common functionality we need to do in
-# many build scripts
-
-
-# Setup CUDA environment variables, based on CU_VERSION
-#
-# Inputs:
-#   CU_VERSION (cpu, cu92, cu100)
-#   NO_CUDA_PACKAGE (bool)
-#   BUILD_TYPE (conda, wheel)
-#
-# Outputs:
-#   VERSION_SUFFIX (e.g., "")
-#   PYTORCH_VERSION_SUFFIX (e.g., +cpu)
-#   WHEEL_DIR (e.g., cu100/)
-#   CUDA_HOME (e.g., /usr/local/cuda-9.2, respected by torch.utils.cpp_extension)
-#   FORCE_CUDA (respected by torchvision setup.py)
-#   NVCC_FLAGS (respected by torchvision setup.py)
-#
-# Precondition: CUDA versions are installed in their conventional locations in
-# /usr/local/cuda-*
-#
-# NOTE: Why VERSION_SUFFIX versus PYTORCH_VERSION_SUFFIX?  If you're building
-# a package with CUDA on a platform we support CUDA on, VERSION_SUFFIX ==
-# PYTORCH_VERSION_SUFFIX and everyone is happy.  However, if you are building a
-# package with only CPU bits (e.g., torchaudio), then VERSION_SUFFIX is always
-# empty, but PYTORCH_VERSION_SUFFIX is +cpu (because that's how you get a CPU
-# version of a Python package.  But that doesn't apply if you're on OS X,
-# since the default CU_VERSION on OS X is cpu.
-setup_cuda() {
-
-  # First, compute version suffixes.  By default, assume no version suffixes
-  export VERSION_SUFFIX=""
-  export PYTORCH_VERSION_SUFFIX=""
-  export WHEEL_DIR=""
-  # Wheel builds need suffixes (but not if they're on OS X, which never has suffix)
-  if [[ "$BUILD_TYPE" == "wheel" ]] && [[ "$(uname)" != Darwin ]]; then
-    export PYTORCH_VERSION_SUFFIX="+$CU_VERSION"
-    # Match the suffix scheme of pytorch, unless this package does not have
-    # CUDA builds (in which case, use default)
-    if [[ -z "$NO_CUDA_PACKAGE" ]]; then
-      export VERSION_SUFFIX="$PYTORCH_VERSION_SUFFIX"
-      export WHEEL_DIR="$CU_VERSION/"
-    fi
-  fi
-
-  # Now work out the CUDA settings
-  case "$CU_VERSION" in
-    cu115)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.5"
-      else
-        export CUDA_HOME=/usr/local/cuda-11.5/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0;7.5;8.0;8.6"
-      ;;
-    cu113)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.3"
-      else
-        export CUDA_HOME=/usr/local/cuda-11.3/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0;7.5;8.0;8.6"
-      ;;
-    cu112)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.2"
-      else
-        export CUDA_HOME=/usr/local/cuda-11.2/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0;7.5;8.0;8.6"
-      ;;
-    cu111)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.1"
-      else
-        export CUDA_HOME=/usr/local/cuda-11.1/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0;7.5;8.0;8.6"
-      ;;
-    cu110)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.0"
-      else
-        export CUDA_HOME=/usr/local/cuda-11.0/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0;7.5;8.0"
-      ;;
-    cu102)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2"
-      else
-        export CUDA_HOME=/usr/local/cuda-10.2/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0;7.5"
-      ;;
-    cu101)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.1"
-      else
-        export CUDA_HOME=/usr/local/cuda-10.1/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0;7.5"
-      ;;
-    cu100)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.0"
-      else
-        export CUDA_HOME=/usr/local/cuda-10.0/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0;7.5"
-      ;;
-    cu92)
-      if [[ "$OSTYPE" == "msys" ]]; then
-        export CUDA_HOME="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2"
-      else
-        export CUDA_HOME=/usr/local/cuda-9.2/
-      fi
-      export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX;6.0;7.0"
-      ;;
-    cpu)
-      ;;
-    rocm*)
-      export FORCE_CUDA=1
-      ;;
-    *)
-      echo "Unrecognized CU_VERSION=$CU_VERSION"
-      exit 1
-      ;;
-  esac
-  if [[ -n "$CUDA_HOME" ]]; then
-    # Adds nvcc binary to the search path so that CMake's `find_package(CUDA)` will pick the right one
-    export PATH="$CUDA_HOME/bin:$PATH"
-    export FORCE_CUDA=1
-  fi
-}
-
-# Populate build version if necessary, and add version suffix
-#
-# Inputs:
-#   BUILD_VERSION (e.g., 0.2.0 or empty)
-#   VERSION_SUFFIX (e.g., +cpu)
-#
-# Outputs:
-#   BUILD_VERSION (e.g., 0.2.0.dev20190807+cpu)
-#
-# Fill BUILD_VERSION if it doesn't exist already with a nightly string
-# Usage: setup_build_version 0.2.0
-setup_build_version() {
-  if [[ -z "$BUILD_VERSION" ]]; then
-    export BUILD_VERSION="$1.dev$(date "+%Y%m%d")$VERSION_SUFFIX"
-  else
-    export BUILD_VERSION="$BUILD_VERSION$VERSION_SUFFIX"
-  fi
-
-  # Set build version based on tag if on tag
-  if [[ -n "${CIRCLE_TAG}" ]]; then
-    # Strip tag
-    export BUILD_VERSION="$(echo "${CIRCLE_TAG}" | sed -e 's/^v//' -e 's/-.*$//')${VERSION_SUFFIX}"
-  fi
-}
-
-# Set some useful variables for OS X, if applicable
-setup_macos() {
-  if [[ "$(uname)" == Darwin ]]; then
-    export MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++
-  fi
-}
-
-
-# Top-level entry point for things every package will need to do
-#
-# Usage: setup_env 0.2.0
-setup_env() {
-  setup_cuda
-  setup_build_version "$1"
-  setup_macos
-}
-
-# Function to retry functions that sometimes timeout or have flaky failures
-retry () {
-    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
-}
-
-# Inputs:
-#   PYTHON_VERSION (3.7, 3.8, 3.9)
-#   UNICODE_ABI (bool)
-#
-# Outputs:
-#   PATH modified to put correct Python version in PATH
-#
-# Precondition: If Linux, you are in a soumith/manylinux-cuda* Docker image
-setup_wheel_python() {
-  if [[ "$(uname)" == Darwin || "$OSTYPE" == "msys" ]]; then
-    eval "$(conda shell.bash hook)"
-    conda env remove -n "env$PYTHON_VERSION" || true
-    conda create ${CONDA_CHANNEL_FLAGS} -yn "env$PYTHON_VERSION" python="$PYTHON_VERSION"
-    conda activate "env$PYTHON_VERSION"
-    # Install libpng from Anaconda (defaults)
-    conda install ${CONDA_CHANNEL_FLAGS} libpng "jpeg<=9b" -y
-  else
-    # Install native CentOS libJPEG, freetype and GnuTLS
-    yum install -y libjpeg-turbo-devel freetype gnutls
-    case "$PYTHON_VERSION" in
-      3.7) python_abi=cp37-cp37m ;;
-      3.8) python_abi=cp38-cp38 ;;
-      3.9) python_abi=cp39-cp39 ;;
-      3.10) python_abi=cp310-cp310 ;;
-      *)
-        echo "Unrecognized PYTHON_VERSION=$PYTHON_VERSION"
-        exit 1
-        ;;
-    esac
-    # Download all the dependencies required to compile image and video_reader
-    # extensions
-
-    mkdir -p ext_libraries
-    pushd ext_libraries
-    popd
-    export PATH="/opt/python/$python_abi/bin:$(pwd)/ext_libraries/bin:$PATH"
-  fi
-}
-
-# Install with pip a bit more robustly than the default
-pip_install() {
-  retry pip install --progress-bar off "$@"
-}
-
-# Install torch with pip, respecting PYTORCH_VERSION, and record the installed
-# version into PYTORCH_VERSION, if applicable
-setup_pip_pytorch_version() {
-  if [[ -z "$PYTORCH_VERSION" ]]; then
-    # Install latest prerelease version of torch, per our nightlies, consistent
-    # with the requested cuda version
-    pip_install --pre torch -f "https://download.pytorch.org/whl/nightly/${WHEEL_DIR}torch_nightly.html"
-    if [[ "$CUDA_VERSION" == "cpu" ]]; then
-      # CUDA and CPU are ABI compatible on the CPU-only parts, so strip
-      # in this case
-      export PYTORCH_VERSION="$(pip show torch | grep ^Version: | sed 's/Version:  *//' | sed 's/+.\+//')"
-    else
-      export PYTORCH_VERSION="$(pip show torch | grep ^Version: | sed 's/Version:  *//')"
-    fi
-  else
-    pip_install "torch==$PYTORCH_VERSION$PYTORCH_VERSION_SUFFIX" \
-      -f "https://download.pytorch.org/whl/${CU_VERSION}/torch_stable.html" \
-      -f "https://download.pytorch.org/whl/${UPLOAD_CHANNEL}/${CU_VERSION}/torch_${UPLOAD_CHANNEL}.html"
-  fi
-}
-
-# Fill PYTORCH_VERSION with the latest conda nightly version, and
-# CONDA_CHANNEL_FLAGS with appropriate flags to retrieve these versions
-#
-# You MUST have populated PYTORCH_VERSION_SUFFIX before hand.
-setup_conda_pytorch_constraint() {
-  if [[ -z "$PYTORCH_VERSION" ]]; then
-    export CONDA_CHANNEL_FLAGS="${CONDA_CHANNEL_FLAGS} -c pytorch-nightly -c pytorch"
-    export PYTORCH_VERSION="$(conda search --json 'pytorch[channel=pytorch-nightly]' | \
-                              python -c "import os, sys, json, re; cuver = os.environ.get('CU_VERSION'); \
-                               cuver_1 = cuver.replace('cu', 'cuda') if cuver != 'cpu' else cuver; \
-                               cuver_2 = (cuver[:-1] + '.' + cuver[-1]).replace('cu', 'cuda') if cuver != 'cpu' else cuver; \
-                               print(re.sub(r'\\+.*$', '', \
-                                [x['version'] for x in json.load(sys.stdin)['pytorch'] \
-                                  if (x['platform'] == 'darwin' or cuver_1 in x['fn'] or cuver_2 in x['fn']) \
-                                    and 'py' + os.environ['PYTHON_VERSION'] in x['fn']][-1]))")"
-    if [[ -z "$PYTORCH_VERSION" ]]; then
-      echo "PyTorch version auto detection failed"
-      echo "No package found for CU_VERSION=$CU_VERSION and PYTHON_VERSION=$PYTHON_VERSION"
-      exit 1
-    fi
-  else
-    export CONDA_CHANNEL_FLAGS="${CONDA_CHANNEL_FLAGS} -c pytorch -c pytorch-${UPLOAD_CHANNEL}"
-  fi
-  if [[ "$CU_VERSION" == cpu ]]; then
-    export CONDA_PYTORCH_BUILD_CONSTRAINT="- pytorch==$PYTORCH_VERSION${PYTORCH_VERSION_SUFFIX}"
-    export CONDA_PYTORCH_CONSTRAINT="- pytorch==$PYTORCH_VERSION"
-  else
-    export CONDA_PYTORCH_BUILD_CONSTRAINT="- pytorch==${PYTORCH_VERSION}${PYTORCH_VERSION_SUFFIX}"
-    export CONDA_PYTORCH_CONSTRAINT="- pytorch==${PYTORCH_VERSION}${PYTORCH_VERSION_SUFFIX}"
-  fi
-  if [[ "$OSTYPE" == msys && "$CU_VERSION" == cu92 ]]; then
-    export CONDA_CHANNEL_FLAGS="${CONDA_CHANNEL_FLAGS} -c defaults -c numba/label/dev"
-  fi
-}
-
-# Translate CUDA_VERSION into CUDA_CUDATOOLKIT_CONSTRAINT
-setup_conda_cudatoolkit_constraint() {
-  export CONDA_BUILD_VARIANT="cuda"
-  if [[ "$(uname)" == Darwin ]]; then
-    export CONDA_BUILD_VARIANT="cpu"
-  else
-    case "$CU_VERSION" in
-      cu115)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=11.5,<11.6 # [not osx]"
-        ;;
-      cu113)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=11.3,<11.4 # [not osx]"
-        ;;
-      cu112)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=11.2,<11.3 # [not osx]"
-        ;;
-      cu111)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=11.1,<11.2 # [not osx]"
-        ;;
-      cu110)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=11.0,<11.1 # [not osx]"
-        ;;
-      cu102)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=10.2,<10.3 # [not osx]"
-        ;;
-      cu101)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=10.1,<10.2 # [not osx]"
-        ;;
-      cu100)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=10.0,<10.1 # [not osx]"
-        ;;
-      cu92)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="- cudatoolkit >=9.2,<9.3 # [not osx]"
-        ;;
-      cpu)
-        export CONDA_CUDATOOLKIT_CONSTRAINT=""
-        export CONDA_BUILD_VARIANT="cpu"
-        ;;
-      *)
-        echo "Unrecognized CU_VERSION=$CU_VERSION"
-        exit 1
-        ;;
-    esac
-  fi
-}
-
-setup_conda_cudatoolkit_plain_constraint() {
-  export CONDA_BUILD_VARIANT="cuda"
-  export CMAKE_USE_CUDA=1
-  if [[ "$(uname)" == Darwin ]]; then
-    export CONDA_BUILD_VARIANT="cpu"
-    export CMAKE_USE_CUDA=0
-  else
-    case "$CU_VERSION" in
-      cu115)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="cudatoolkit=11.5"
-        ;;
-      cu113)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="cudatoolkit=11.3"
-        ;;
-      cu112)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="cudatoolkit=11.2"
-        ;;
-      cu111)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="cudatoolkit=11.1"
-        ;;
-      cu102)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="cudatoolkit=10.2"
-        ;;
-      cu101)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="cudatoolkit=10.1"
-        ;;
-      cu100)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="cudatoolkit=10.0"
-        ;;
-      cu92)
-        export CONDA_CUDATOOLKIT_CONSTRAINT="cudatoolkit=9.2"
-        ;;
-      cpu)
-        export CONDA_CUDATOOLKIT_CONSTRAINT=""
-        export CONDA_BUILD_VARIANT="cpu"
-        export CMAKE_USE_CUDA=0
-        ;;
-      *)
-        echo "Unrecognized CU_VERSION=$CU_VERSION"
-        exit 1
-        ;;
-    esac
-  fi
-}
-
-# Build the proper compiler package before building the final package
-setup_visual_studio_constraint() {
-  if [[ "$OSTYPE" == "msys" ]]; then
-      export VSTOOLCHAIN_PACKAGE=vs$VC_YEAR
-      conda build $CONDA_CHANNEL_FLAGS --no-anaconda-upload packaging/$VSTOOLCHAIN_PACKAGE
-      cp packaging/$VSTOOLCHAIN_PACKAGE/conda_build_config.yaml packaging/torchvision/conda_build_config.yaml
-  fi
-}
-
-setup_junit_results_folder() {
-  if [[ "$CI" == "true" ]]; then
-    export CONDA_PYTORCH_BUILD_RESULTS_DIRECTORY="${SOURCE_ROOT_DIR}/build_results/results.xml"
-  fi
-}
-
-
-download_copy_ffmpeg() {
-  if [[ "$OSTYPE" == "msys" ]]; then
-    # conda install -yq ffmpeg=4.2 -c pytorch
-    # curl -L -q https://anaconda.org/pytorch/ffmpeg/4.3/download/win-64/ffmpeg-4.3-ha925a31_0.tar.bz2 --output ffmpeg-4.3-ha925a31_0.tar.bz2
-    # bzip2 --decompress --stdout ffmpeg-4.3-ha925a31_0.tar.bz2 | tar -x --file=-
-    # cp Library/bin/*.dll ../torchvision
-    echo "FFmpeg is disabled currently on Windows"
-  else
-    if [[ "$(uname)" == Darwin ]]; then
-      conda install -yq ffmpeg=4.2 -c pytorch
-      conda install -yq wget
-    else
-      # pushd ext_libraries
-      # wget -q https://anaconda.org/pytorch/ffmpeg/4.2/download/linux-64/ffmpeg-4.2-hf484d3e_0.tar.bz2
-      # tar -xjvf ffmpeg-4.2-hf484d3e_0.tar.bz2
-      # rm -rf ffmpeg-4.2-hf484d3e_0.tar.bz2
-      # ldconfig
-      # which ffmpeg
-      # popd
-      echo "FFmpeg is disabled currently on Linux"
-    fi
-  fi
-}
diff --git a/functorch/packaging/windows/internal/cuda_install.bat b/functorch/packaging/windows/internal/cuda_install.bat
deleted file mode 100644
index 41960224ebae..000000000000
--- a/functorch/packaging/windows/internal/cuda_install.bat
+++ /dev/null
@@ -1,264 +0,0 @@
-@echo on
-
-if "%CU_VERSION%" == "cpu" (
-    echo Skipping for CPU builds
-    exit /b 0
-)
-
-set SRC_DIR=%~dp0\..
-
-if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"
-
-rem in unit test workflow, we get CUDA_VERSION, for example 11.1
-if defined CUDA_VERSION (
-    set CUDA_VER=%CUDA_VERSION:.=%
-) else (
-    set CUDA_VER=%CU_VERSION:cu=%
-)
-
-set /a CUDA_VER=%CU_VERSION:cu=%
-set CUDA_VER_MAJOR=%CUDA_VER:~0,-1%
-set CUDA_VER_MINOR=%CUDA_VER:~-1,1%
-set CUDA_VERSION_STR=%CUDA_VER_MAJOR%.%CUDA_VER_MINOR%
-
-
-if %CUDA_VER% EQU 92 goto cuda92
-if %CUDA_VER% EQU 100 goto cuda100
-if %CUDA_VER% EQU 101 goto cuda101
-if %CUDA_VER% EQU 102 goto cuda102
-if %CUDA_VER% EQU 110 goto cuda110
-if %CUDA_VER% EQU 111 goto cuda111
-if %CUDA_VER% EQU 112 goto cuda112
-if %CUDA_VER% EQU 113 goto cuda113
-if %CUDA_VER% EQU 115 goto cuda115
-
-
-echo CUDA %CUDA_VERSION_STR% is not supported
-exit /b 1
-
-:cuda92
-if not exist "%SRC_DIR%\temp_build\cuda_9.2.148_win10.exe" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/win2016/cuda_9.2.148_win10.exe --output "%SRC_DIR%\temp_build\cuda_9.2.148_win10.exe"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\cuda_9.2.148_win10.exe"
-    set "ARGS=nvcc_9.2 cuobjdump_9.2 nvprune_9.2 cupti_9.2 cublas_9.2 cublas_dev_9.2 cudart_9.2 cufft_9.2 cufft_dev_9.2 curand_9.2 curand_dev_9.2 cusolver_9.2 cusolver_dev_9.2 cusparse_9.2 cusparse_dev_9.2 nvgraph_9.2 nvgraph_dev_9.2 npp_9.2 npp_dev_9.2 nvrtc_9.2 nvrtc_dev_9.2 nvml_dev_9.2"
-)
-
-if not exist "%SRC_DIR%\temp_build\cudnn-9.2-windows10-x64-v7.2.1.38.zip" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/win2016/cudnn-9.2-windows10-x64-v7.2.1.38.zip --output "%SRC_DIR%\temp_build\cudnn-9.2-windows10-x64-v7.2.1.38.zip"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\cudnn-9.2-windows10-x64-v7.2.1.38.zip"
-)
-
-goto cuda_common
-
-:cuda100
-
-if not exist "%SRC_DIR%\temp_build\cuda_10.0.130_411.31_win10.exe" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/win2016/cuda_10.0.130_411.31_win10.exe --output "%SRC_DIR%\temp_build\cuda_10.0.130_411.31_win10.exe"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\cuda_10.0.130_411.31_win10.exe"
-    set "ARGS=nvcc_10.0 cuobjdump_10.0 nvprune_10.0 cupti_10.0 cublas_10.0 cublas_dev_10.0 cudart_10.0 cufft_10.0 cufft_dev_10.0 curand_10.0 curand_dev_10.0 cusolver_10.0 cusolver_dev_10.0 cusparse_10.0 cusparse_dev_10.0 nvgraph_10.0 nvgraph_dev_10.0 npp_10.0 npp_dev_10.0 nvrtc_10.0 nvrtc_dev_10.0 nvml_dev_10.0"
-)
-
-if not exist "%SRC_DIR%\temp_build\cudnn-10.0-windows10-x64-v7.4.1.5.zip" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/win2016/cudnn-10.0-windows10-x64-v7.4.1.5.zip --output "%SRC_DIR%\temp_build\cudnn-10.0-windows10-x64-v7.4.1.5.zip"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\cudnn-10.0-windows10-x64-v7.4.1.5.zip"
-)
-
-goto cuda_common
-
-:cuda101
-
-if not exist "%SRC_DIR%\temp_build\cuda_10.1.243_426.00_win10.exe" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cuda_10.1.243_426.00_win10.exe --output "%SRC_DIR%\temp_build\cuda_10.1.243_426.00_win10.exe"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\cuda_10.1.243_426.00_win10.exe"
-    set "ARGS=nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvjpeg_10.1 nvjpeg_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1"
-)
-
-if not exist "%SRC_DIR%\temp_build\cudnn-10.1-windows10-x64-v7.6.4.38.zip" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cudnn-10.1-windows10-x64-v7.6.4.38.zip --output "%SRC_DIR%\temp_build\cudnn-10.1-windows10-x64-v7.6.4.38.zip"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\cudnn-10.1-windows10-x64-v7.6.4.38.zip"
-)
-
-goto cuda_common
-
-:cuda102
-
-if not exist "%SRC_DIR%\temp_build\cuda_10.2.89_441.22_win10.exe" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cuda_10.2.89_441.22_win10.exe --output "%SRC_DIR%\temp_build\cuda_10.2.89_441.22_win10.exe"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\cuda_10.2.89_441.22_win10.exe"
-    set "ARGS=nvcc_10.2 cuobjdump_10.2 nvprune_10.2 cupti_10.2 cublas_10.2 cublas_dev_10.2 cudart_10.2 cufft_10.2 cufft_dev_10.2 curand_10.2 curand_dev_10.2 cusolver_10.2 cusolver_dev_10.2 cusparse_10.2 cusparse_dev_10.2 nvgraph_10.2 nvgraph_dev_10.2 npp_10.2 npp_dev_10.2 nvjpeg_10.2 nvjpeg_dev_10.2 nvrtc_10.2 nvrtc_dev_10.2 nvml_dev_10.2"
-)
-
-if not exist "%SRC_DIR%\temp_build\cudnn-10.2-windows10-x64-v7.6.5.32.zip" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cudnn-10.2-windows10-x64-v7.6.5.32.zip --output "%SRC_DIR%\temp_build\cudnn-10.2-windows10-x64-v7.6.5.32.zip"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\cudnn-10.2-windows10-x64-v7.6.5.32.zip"
-)
-
-rem The below only for cu102, if it's used in other version, e.g. cu111, torch.cuda.is_availabe() would be False.
-if not exist "%SRC_DIR%\temp_build\gpu_driver_dlls.7z" (
-    curl -k -L "https://drive.google.com/u/0/uc?id=1injUyo3lnarMgWyRcXqKg4UGnN0ysmuq&export=download" --output "%SRC_DIR%\temp_build\gpu_driver_dlls.zip"
-    if errorlevel 1 exit /b 1
-)
-
-echo Installing GPU driver DLLs
-7z x %SRC_DIR%\temp_build\gpu_driver_dlls.zip -aoa -o"C:\Windows\System32"
-
-goto cuda_common
-
-:cuda110
-
-if not exist "%SRC_DIR%\temp_build\cuda_11.0.2_451.48_win10.exe" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cuda_11.0.2_451.48_win10.exe --output "%SRC_DIR%\temp_build\cuda_11.0.2_451.48_win10.exe"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\cuda_11.0.2_451.48_win10.exe"
-    set "ARGS=nvcc_11.0 cuobjdump_11.0 nvprune_11.0 nvprof_11.0 cupti_11.0 cublas_11.0 cublas_dev_11.0 cudart_11.0 cufft_11.0 cufft_dev_11.0 curand_11.0 curand_dev_11.0 cusolver_11.0 cusolver_dev_11.0 cusparse_11.0 cusparse_dev_11.0 npp_11.0 npp_dev_11.0 nvjpeg_11.0 nvjpeg_dev_11.0 nvrtc_11.0 nvrtc_dev_11.0 nvml_dev_11.0"
-)
-
-if not exist "%SRC_DIR%\temp_build\cudnn-11.0-windows-x64-v8.0.4.30.zip" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cudnn-11.0-windows-x64-v8.0.4.30.zip --output "%SRC_DIR%\temp_build\cudnn-11.0-windows-x64-v8.0.4.30.zip"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\cudnn-11.0-windows-x64-v8.0.4.30.zip"
-)
-
-goto cuda_common
-
-:cuda111
-
-if not exist "%SRC_DIR%\temp_build\cuda_11.1.1_456.81_win10.exe" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cuda_11.1.1_456.81_win10.exe --output "%SRC_DIR%\temp_build\cuda_11.1.1_456.81_win10.exe"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\cuda_11.1.1_456.81_win10.exe"
-    set "ARGS=nvcc_11.1 cuobjdump_11.1 nvprune_11.1 nvprof_11.1 cupti_11.1 cublas_11.1 cublas_dev_11.1 cudart_11.1 cufft_11.1 cufft_dev_11.1 curand_11.1 curand_dev_11.1 cusolver_11.1 cusolver_dev_11.1 cusparse_11.1 cusparse_dev_11.1 npp_11.1 npp_dev_11.1 nvjpeg_11.1 nvjpeg_dev_11.1 nvrtc_11.1 nvrtc_dev_11.1 nvml_dev_11.1"
-)
-
-if not exist "%SRC_DIR%\temp_build\cudnn-11.1-windows-x64-v8.0.5.39.zip" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cudnn-11.1-windows-x64-v8.0.5.39.zip --output "%SRC_DIR%\temp_build\cudnn-11.1-windows-x64-v8.0.5.39.zip"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\cudnn-11.1-windows-x64-v8.0.5.39.zip"
-)
-
-goto cuda_common
-
-:cuda112
-
-if not exist "%SRC_DIR%\temp_build\cuda_11.2.0_460.89_win10.exe" (
-    curl -k -L https://ossci-windows.s3.amazonaws.com/cuda_11.2.0_460.89_win10.exe --output "%SRC_DIR%\temp_build\cuda_11.2.0_460.89_win10.exe"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\cuda_11.2.0_460.89_win10.exe"
-    set "ARGS=nvcc_11.2 cuobjdump_11.2 nvprune_11.2 nvprof_11.2 cupti_11.2 cublas_11.2 cublas_dev_11.2 cudart_11.2 cufft_11.2 cufft_dev_11.2 curand_11.2 curand_dev_11.2 cusolver_11.2 cusolver_dev_11.2 cusparse_11.2 cusparse_dev_11.2 npp_11.2 npp_dev_11.2 nvjpeg_11.2 nvjpeg_dev_11.2 nvrtc_11.2 nvrtc_dev_11.2 nvml_dev_11.2"
-)
-
-if not exist "%SRC_DIR%\temp_build\cudnn-11.2-windows-x64-v8.1.0.77.zip" (
-    curl -k -L http://s3.amazonaws.com/ossci-windows/cudnn-11.2-windows-x64-v8.1.0.77.zip --output "%SRC_DIR%\temp_build\cudnn-11.2-windows-x64-v8.1.0.77.zip"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\cudnn-11.2-windows-x64-v8.1.0.77.zip"
-)
-
-goto cuda_common
-
-:cuda113
-
-set CUDA_INSTALL_EXE=cuda_11.3.0_465.89_win10.exe
-if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
-    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
-    set "ARGS=thrust_11.3 nvcc_11.3 cuobjdump_11.3 nvprune_11.3 nvprof_11.3 cupti_11.3 cublas_11.3 cublas_dev_11.3 cudart_11.3 cufft_11.3 cufft_dev_11.3 curand_11.3 curand_dev_11.3 cusolver_11.3 cusolver_dev_11.3 cusparse_11.3 cusparse_dev_11.3 npp_11.3 npp_dev_11.3 nvjpeg_11.3 nvjpeg_dev_11.3 nvrtc_11.3 nvrtc_dev_11.3 nvml_dev_11.3"
-
-)
-
-set CUDNN_INSTALL_ZIP=cudnn-11.3-windows-x64-v8.2.0.53.zip
-if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
-    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
-)
-
-goto cuda_common
-
-:cuda115
-
-set CUDA_INSTALL_EXE=cuda_11.5.0_496.13_win10.exe
-if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
-    curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
-    if errorlevel 1 exit /b 1
-    set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
-    set "ARGS=thrust_11.5 nvcc_11.5 cuobjdump_11.5 nvprune_11.5 nvprof_11.5 cupti_11.5 cublas_11.5 cublas_dev_11.5 cudart_11.5 cufft_11.5 cufft_dev_11.5 curand_11.5 curand_dev_11.5 cusolver_11.5 cusolver_dev_11.5 cusparse_11.5 cusparse_dev_11.5 npp_11.5 npp_dev_11.5 nvrtc_11.5 nvrtc_dev_11.5 nvml_dev_11.5"
-)
-
-set CUDNN_INSTALL_ZIP=cudnn-11.3-windows-x64-v8.2.0.53.zip
-if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
-    curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
-    if errorlevel 1 exit /b 1
-    set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
-)
-
-goto cuda_common
-
-:cuda_common
-
-if not exist "%SRC_DIR%\temp_build\NvToolsExt.7z" (
-    curl -k -L https://www.dropbox.com/s/9mcolalfdj4n979/NvToolsExt.7z?dl=1 --output "%SRC_DIR%\temp_build\NvToolsExt.7z"
-    if errorlevel 1 exit /b 1
-)
-
-echo Installing CUDA toolkit...
-7z x %CUDA_SETUP_FILE% -o"%SRC_DIR%\temp_build\cuda"
-pushd "%SRC_DIR%\temp_build\cuda"
-sc config wuauserv start= disabled
-sc stop wuauserv
-sc query wuauserv
-
-start /wait setup.exe -s %ARGS% -loglevel:6 -log:"%cd%/cuda_install_logs"
-echo %errorlevel%
-
-popd
-
-echo Installing VS integration...
-rem It's for VS 2019
-if "%CUDA_VER_MAJOR%" == "10" (
-    xcopy /Y "%SRC_DIR%\temp_build\cuda\CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions\*.*" "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\BuildCustomizations"
-)
-if "%CUDA_VER_MAJOR%" == "11" (
-    xcopy /Y "%SRC_DIR%\temp_build\cuda\visual_studio_integration\CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions\*.*" "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\BuildCustomizations"
-)
-
-echo Installing NvToolsExt...
-7z x %SRC_DIR%\temp_build\NvToolsExt.7z -o"%SRC_DIR%\temp_build\NvToolsExt"
-mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"
-mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"
-mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"
-xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\bin\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"
-xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\include\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"
-xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\lib\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"
-
-echo Setting up environment...
-set "PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin;%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\libnvvp;%PATH%"
-set "CUDA_PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"
-set "CUDA_PATH_V%CUDA_VER_MAJOR%_%CUDA_VER_MINOR%=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"
-set "NVTOOLSEXT_PATH=%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"
-
-if not exist "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" (
-    echo CUDA %CUDA_VERSION_STR% installed failed.
-    echo --------- RunDll32.exe.log
-    type "%SRC_DIR%\temp_build\cuda\cuda_install_logs\LOG.RunDll32.exe.log"
-    echo --------- setup.exe.log -------
-    type "%SRC_DIR%\temp_build\cuda\cuda_install_logs\LOG.setup.exe.log"
-    exit /b 1
-)
-
-echo Installing cuDNN...
-7z x %CUDNN_SETUP_FILE% -o"%SRC_DIR%\temp_build\cudnn"
-xcopy /Y "%SRC_DIR%\temp_build\cudnn\cuda\bin\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin"
-xcopy /Y "%SRC_DIR%\temp_build\cudnn\cuda\lib\x64\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\lib\x64"
-xcopy /Y "%SRC_DIR%\temp_build\cudnn\cuda\include\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\include"
-
-echo Cleaning temp files
-rd /s /q "%SRC_DIR%\temp_build" || ver > nul
diff --git a/functorch/packaging/windows/internal/driver_update.bat b/functorch/packaging/windows/internal/driver_update.bat
deleted file mode 100644
index 00b43affc01c..000000000000
--- a/functorch/packaging/windows/internal/driver_update.bat
+++ /dev/null
@@ -1,25 +0,0 @@
-set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/461.09-data-center-tesla-desktop-winserver-2019-2016-international.exe"
-curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output 461.09-data-center-tesla-desktop-winserver-2019-2016-international.exe
-if errorlevel 1 exit /b 1
-
-start /wait 461.09-data-center-tesla-desktop-winserver-2019-2016-international.exe -s -noreboot
-if errorlevel 1 exit /b 1
-
-del 461.09-data-center-tesla-desktop-winserver-2019-2016-international.exe || ver > NUL
-
-setlocal EnableDelayedExpansion
-set NVIDIA_GPU_EXISTS=0
-for /F "delims=" %%i in ('wmic path win32_VideoController get name') do (
-    set GPUS=%%i
-    if not "x!GPUS:NVIDIA=!" == "x!GPUS!" (
-        SET NVIDIA_GPU_EXISTS=1
-        goto gpu_check_end
-    )
-)
-:gpu_check_end
-endlocal & set NVIDIA_GPU_EXISTS=%NVIDIA_GPU_EXISTS%
-
-if "%NVIDIA_GPU_EXISTS%" == "0" (
-    echo "CUDA Driver installation Failed"
-    exit /b 1
-)
diff --git a/functorch/packaging/windows/internal/vc_env_helper.bat b/functorch/packaging/windows/internal/vc_env_helper.bat
deleted file mode 100644
index e85a372f93d5..000000000000
--- a/functorch/packaging/windows/internal/vc_env_helper.bat
+++ /dev/null
@@ -1,43 +0,0 @@
-@echo on
-
-set VC_VERSION_LOWER=16
-set VC_VERSION_UPPER=17
-if "%VC_YEAR%" == "2017" (
-    set VC_VERSION_LOWER=15
-    set VC_VERSION_UPPER=16
-)
-
-for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
-    if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (
-        set "VS15INSTALLDIR=%%i"
-        set "VS15VCVARSALL=%%i\VC\Auxiliary\Build\vcvarsall.bat"
-        goto vswhere
-    )
-)
-
-:vswhere
-if "%VSDEVCMD_ARGS%" == "" (
-    call "%VS15VCVARSALL%" x64 || exit /b 1
-) else (
-    call "%VS15VCVARSALL%" x64 %VSDEVCMD_ARGS% || exit /b 1
-)
-
-@echo on
-
-set DISTUTILS_USE_SDK=1
-
-set args=%1
-shift
-:start
-if [%1] == [] goto done
-set args=%args% %1
-shift
-goto start
-
-:done
-if "%args%" == "" (
-    echo Usage: vc_env_helper.bat [command] [args]
-    echo e.g. vc_env_helper.bat cl /c test.cpp
-)
-
-%args% || exit /b 1
diff --git a/functorch/packaging/windows/internal/vc_install_helper.sh b/functorch/packaging/windows/internal/vc_install_helper.sh
deleted file mode 100644
index cdae18065b9f..000000000000
--- a/functorch/packaging/windows/internal/vc_install_helper.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/bin/bash
-
-set -ex
-
-if [[ "$CU_VERSION" == "cu92" ]]; then
-  export VC_YEAR=2017
-  export VSDEVCMD_ARGS="-vcvars_ver=14.13"
-  powershell packaging/windows/internal/vs2017_install.ps1
-elif [[ "$CU_VERSION" == "cu100" ]]; then
-  export VC_YEAR=2017
-  export VSDEVCMD_ARGS=""
-  powershell packaging/windows/internal/vs2017_install.ps1
-else
-  export VC_YEAR=2019
-  export VSDEVCMD_ARGS=""
-fi
diff --git a/functorch/pull_request_template.md b/functorch/pull_request_template.md
deleted file mode 100644
index abb0f9bfe518..000000000000
--- a/functorch/pull_request_template.md
+++ /dev/null
@@ -1,5 +0,0 @@
-To contribute a change to functorch, please make sure you are submitting a
-Pull Request to the functorch folder in https://github.com/pytorch/pytorch
-repository. The source of truth for functorch has moved there from
-https://github.com/pytorch/functorch ; the pytorch/functorch repository
-is now read-only.
diff --git a/functorch/setup.cfg b/functorch/setup.cfg
deleted file mode 100644
index c2f3b448a6c2..000000000000
--- a/functorch/setup.cfg
+++ /dev/null
@@ -1,18 +0,0 @@
-[bdist_wheel]
-universal=1
-
-[metadata]
-license_file = LICENSE
-
-[pep8]
-max-line-length = 120
-
-[flake8]
-max-line-length = 120
-exclude = docs, benchmarks, notebooks, tools
-per-file-ignores =
-    __init__.py: F401
-    functorch/_src/decompositions.py: E501
-
-[pydocstyle]
-select = D417 # Missing argument descriptions in the docstring
diff --git a/functorch/setup.py b/functorch/setup.py
deleted file mode 100644
index e200cbd1fcc5..000000000000
--- a/functorch/setup.py
+++ /dev/null
@@ -1,149 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-
-import distutils.command.clean
-import shutil
-import glob
-import os
-import subprocess
-from setuptools import setup, find_packages
-from torch.utils.cpp_extension import (
-    CppExtension,
-    BuildExtension,
-)
-
-cwd = os.path.dirname(os.path.abspath(__file__))
-version_txt = os.path.join(cwd, 'version.txt')
-with open(version_txt, 'r') as f:
-    version = f.readline().strip()
-
-try:
-    sha = subprocess.check_output(['git', 'rev-parse', 'HEAD'], cwd=cwd).decode('ascii').strip()
-except Exception:
-    sha = 'Unknown'
-package_name = 'functorch'
-
-if os.getenv('BUILD_VERSION'):
-    version = os.getenv('BUILD_VERSION')
-elif sha != 'Unknown':
-    version += '+' + sha[:7]
-
-
-def write_version_file():
-    version_path = os.path.join(cwd, 'functorch', 'version.py')
-    with open(version_path, 'w') as f:
-        f.write("__version__ = '{}'\n".format(version))
-        f.write("git_version = {}\n".format(repr(sha)))
-
-
-# pytorch_dep = 'torch'
-# if os.getenv('PYTORCH_VERSION'):
-#     pytorch_dep += "==" + os.getenv('PYTORCH_VERSION')
-requirements = [
-    # This represents a nightly version of PyTorch.
-    # It can be installed as a binary or from source.
-    "torch>=1.13.0.dev",
-]
-
-extras = {}
-extras["aot"] = ["networkx", ]
-
-
-class clean(distutils.command.clean.clean):
-    def run(self):
-
-        with open(".gitignore", "r") as f:
-            ignores = f.read()
-            for wildcard in filter(None, ignores.split("\n")):
-                for filename in glob.glob(wildcard):
-                    try:
-                        os.remove(filename)
-                    except OSError:
-                        shutil.rmtree(filename, ignore_errors=True)
-
-        # It's an old-style class in Python 2.7...
-        distutils.command.clean.clean.run(self)
-
-
-def get_extensions():
-    extension = CppExtension
-
-    # See functorch/csrc/Macros.h
-    define_macros = [('FUNCTORCH_BUILD_MAIN_LIB', None)]
-
-    extra_link_args = []
-    extra_compile_args = {"cxx": [
-        "-O3",
-        "-std=c++14",
-        "-fdiagnostics-color=always",
-    ]}
-    debug_mode = os.getenv('DEBUG', '0') == '1'
-    if debug_mode:
-        print("Compiling in debug mode")
-        extra_compile_args = {
-            "cxx": [
-                "-O0",
-                "-fno-inline",
-                "-g",
-                "-std=c++14",
-                "-fdiagnostics-color=always",
-            ]}
-        extra_link_args = ["-O0", "-g"]
-
-    this_dir = os.path.dirname(os.path.abspath(__file__))
-    extensions_dir = os.path.join(this_dir, "functorch", "csrc")
-
-    extension_sources = set(
-        os.path.join(extensions_dir, p)
-        for p in glob.glob(os.path.join(extensions_dir, "*.cpp"))
-    )
-    sources = list(extension_sources)
-    sources.append(os.path.join(extensions_dir, "dim", "dim.cpp"))
-
-    ext_modules = [
-        extension(
-            "functorch._C",
-            sources,
-            include_dirs=[this_dir],
-            define_macros=define_macros,
-            extra_compile_args=extra_compile_args,
-            extra_link_args=extra_link_args,
-        )
-    ]
-
-    return ext_modules
-
-
-class BuildExtension_(BuildExtension):
-    def build_extensions(self, *args, **kwargs):
-        # It turns out for windows this isn't populated?
-        if hasattr(self.compiler, 'compiler_so'):
-            if '-Wstrict-prototypes' in self.compiler.compiler_so:
-                self.compiler.compiler_so.remove('-Wstrict-prototypes')
-        super().build_extensions(*args, **kwargs)
-
-
-if __name__ == '__main__':
-    print("Building wheel {}-{}".format(package_name, version))
-    write_version_file()
-    setup(
-        # Metadata
-        name=package_name,
-        version=version,
-        author='PyTorch Core Team',
-        url="https://github.com/pytorch/functorch",
-        description='JAX-like composable function transforms for PyTorch',
-        license='BSD',
-
-        # Package info
-        packages=find_packages(),
-        install_requires=requirements,
-        extras_require=extras,
-        ext_modules=get_extensions(),
-        cmdclass={
-            "build_ext": BuildExtension_.with_options(no_python_abi_suffix=True),
-            'clean': clean,
-        })
diff --git a/functorch/test/functorch_lagging_op_db.py b/functorch/test/functorch_lagging_op_db.py
deleted file mode 100644
index 4c6bf07ab78b..000000000000
--- a/functorch/test/functorch_lagging_op_db.py
+++ /dev/null
@@ -1,635 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-from torch.testing._internal.common_methods_invocations import op_db
-
-# Generated from codegen/gen_functorch_op_db.py via
-# python codegen/gen_functorch_lagging_op_db.py > test/functorch_lagging_op_db.py
-#
-# People add new OpInfos to PyTorch all the time.
-# We want them to be able to add OpInfos without breaking our CI.
-# To achieve this, we keep our OpInfo library behind that of Pytorch's and
-# we periodically update our OpInfo library by regenerating this file
-_functorch_lagging_meta = {
-    ('H', ''),
-    ('T', ''),
-    ('__getitem__', ''),
-    ('__radd__', ''),
-    ('__rand__', ''),
-    ('__rdiv__', ''),
-    ('__rmatmul__', ''),
-    ('__rmod__', ''),
-    ('__rmul__', ''),
-    ('__ror__', ''),
-    ('__rpow__', ''),
-    ('__rsub__', ''),
-    ('__rxor__', ''),
-    ('_masked.amax', ''),
-    ('_masked.amin', ''),
-    ('_masked.argmax', ''),
-    ('_masked.argmin', ''),
-    ('_masked.cumprod', ''),
-    ('_masked.cumsum', ''),
-    ('_masked.log_softmax', ''),
-    ('_masked.logaddexp', ''),
-    ('_masked.logsumexp', ''),
-    ('_masked.mean', ''),
-    ('_masked.median', ''),
-    ('_masked.norm', ''),
-    ('_masked.normalize', ''),
-    ('_masked.prod', ''),
-    ('_masked.softmax', ''),
-    ('_masked.softmin', ''),
-    ('_masked.std', ''),
-    ('_masked.sum', ''),
-    ('_masked.var', ''),
-    ('abs', ''),
-    ('acos', ''),
-    ('acosh', ''),
-    ('add', ''),
-    ('addbmm', ''),
-    ('addcdiv', ''),
-    ('addcmul', ''),
-    ('addmm', ''),
-    ('addmm', 'decomposed'),
-    ('addmv', ''),
-    ('addr', ''),
-    ('all', ''),
-    ('allclose', ''),
-    ('amax', ''),
-    ('amin', ''),
-    ('aminmax', ''),
-    ('angle', ''),
-    ('any', ''),
-    ('arange', ''),
-    ('argmax', ''),
-    ('argmin', ''),
-    ('argsort', ''),
-    ('argwhere', ''),
-    ('as_strided', ''),
-    ('as_strided_scatter', ''),
-    ('asin', ''),
-    ('asinh', ''),
-    ('atan', ''),
-    ('atan2', ''),
-    ('atanh', ''),
-    ('atleast_1d', ''),
-    ('atleast_2d', ''),
-    ('atleast_3d', ''),
-    ('baddbmm', ''),
-    ('bernoulli', ''),
-    ('bfloat16', ''),
-    ('bincount', ''),
-    ('bitwise_and', ''),
-    ('bitwise_left_shift', ''),
-    ('bitwise_not', ''),
-    ('bitwise_or', ''),
-    ('bitwise_right_shift', ''),
-    ('bitwise_xor', ''),
-    ('block_diag', ''),
-    ('bmm', ''),
-    ('bool', ''),
-    ('broadcast_shapes', ''),
-    ('broadcast_tensors', ''),
-    ('broadcast_to', ''),
-    ('bucketize', ''),
-    ('byte', ''),
-    ('cartesian_prod', ''),
-    ('cat', ''),
-    ('cdist', ''),
-    ('ceil', ''),
-    ('chalf', ''),
-    ('char', ''),
-    ('cholesky', ''),
-    ('cholesky_inverse', ''),
-    ('cholesky_solve', ''),
-    ('chunk', ''),
-    ('clamp', ''),
-    ('clamp_max', ''),
-    ('clamp_min', ''),
-    ('clone', ''),
-    ('column_stack', ''),
-    ('combinations', ''),
-    ('complex', ''),
-    ('conj', ''),
-    ('conj_physical', ''),
-    ('constant_pad_nd', ''),
-    ('contiguous', ''),
-    ('copysign', ''),
-    ('corrcoef', ''),
-    ('cos', ''),
-    ('cosh', ''),
-    ('count_nonzero', ''),
-    ('cov', ''),
-    ('cross', ''),
-    ('cummax', ''),
-    ('cummin', ''),
-    ('cumprod', ''),
-    ('cumsum', ''),
-    ('cumulative_trapezoid', ''),
-    ('deg2rad', ''),
-    ('diag', ''),
-    ('diag_embed', ''),
-    ('diagflat', ''),
-    ('diagonal', ''),
-    ('diagonal_scatter', ''),
-    ('diff', ''),
-    ('digamma', ''),
-    ('dist', ''),
-    ('div', 'floor_rounding'),
-    ('div', 'no_rounding_mode'),
-    ('div', 'trunc_rounding'),
-    ('dot', ''),
-    ('double', ''),
-    ('dsplit', ''),
-    ('dstack', ''),
-    ('eig', ''),
-    ('einsum', ''),
-    ('empty', ''),
-    ('empty_like', ''),
-    ('eq', ''),
-    ('equal', ''),
-    ('erf', ''),
-    ('erfc', ''),
-    ('erfinv', ''),
-    ('exp', ''),
-    ('exp2', ''),
-    ('expand', ''),
-    ('expand_as', ''),
-    ('expm1', ''),
-    ('fft.fft', ''),
-    ('fft.fft2', ''),
-    ('fft.fftn', ''),
-    ('fft.fftshift', ''),
-    ('fft.hfft', ''),
-    ('fft.hfft2', ''),
-    ('fft.hfftn', ''),
-    ('fft.ifft', ''),
-    ('fft.ifft2', ''),
-    ('fft.ifftn', ''),
-    ('fft.ifftshift', ''),
-    ('fft.ihfft', ''),
-    ('fft.ihfft2', ''),
-    ('fft.ihfftn', ''),
-    ('fft.irfft', ''),
-    ('fft.irfft2', ''),
-    ('fft.irfftn', ''),
-    ('fft.rfft', ''),
-    ('fft.rfft2', ''),
-    ('fft.rfftn', ''),
-    ('fill', ''),
-    ('flatten', ''),
-    ('flip', ''),
-    ('fliplr', ''),
-    ('flipud', ''),
-    ('float', ''),
-    ('float_power', ''),
-    ('floor', ''),
-    ('floor_divide', ''),
-    ('fmax', ''),
-    ('fmin', ''),
-    ('fmod', ''),
-    ('frac', ''),
-    ('frexp', ''),
-    ('full_like', ''),
-    ('gather', ''),
-    ('gcd', ''),
-    ('ge', ''),
-    ('geqrf', ''),
-    ('gradient', ''),
-    ('gt', ''),
-    ('half', ''),
-    ('heaviside', ''),
-    ('histc', ''),
-    ('histogram', ''),
-    ('histogramdd', ''),
-    ('hsplit', ''),
-    ('hstack', ''),
-    ('hypot', ''),
-    ('i0', ''),
-    ('igamma', ''),
-    ('igammac', ''),
-    ('imag', ''),
-    ('index_add', ''),
-    ('index_copy', ''),
-    ('index_fill', ''),
-    ('index_put', ''),
-    ('index_reduce', ''),
-    ('index_select', ''),
-    ('inner', ''),
-    ('int', ''),
-    ('inverse', ''),
-    ('isclose', ''),
-    ('isfinite', ''),
-    ('isin', ''),
-    ('isinf', ''),
-    ('isnan', ''),
-    ('isneginf', ''),
-    ('isposinf', ''),
-    ('isreal', ''),
-    ('istft', ''),
-    ('jiterator_2inputs_2outputs', ''),
-    ('jiterator_4inputs_with_extra_args', ''),
-    ('jiterator_binary', ''),
-    ('jiterator_binary_return_by_ref', ''),
-    ('jiterator_unary', ''),
-    ('kron', ''),
-    ('kthvalue', ''),
-    ('lcm', ''),
-    ('ldexp', ''),
-    ('le', ''),
-    ('lerp', ''),
-    ('lgamma', ''),
-    ('linalg.cholesky', ''),
-    ('linalg.cholesky_ex', ''),
-    ('linalg.cond', ''),
-    ('linalg.cross', ''),
-    ('linalg.det', ''),
-    ('linalg.det', 'singular'),
-    ('linalg.eig', ''),
-    ('linalg.eigh', ''),
-    ('linalg.eigvals', ''),
-    ('linalg.eigvalsh', ''),
-    ('linalg.householder_product', ''),
-    ('linalg.inv', ''),
-    ('linalg.inv_ex', ''),
-    ('linalg.ldl_factor', ''),
-    ('linalg.ldl_factor_ex', ''),
-    ('linalg.ldl_solve', ''),
-    ('linalg.lstsq', ''),
-    ('linalg.lstsq', 'grad_oriented'),
-    ('linalg.lu', ''),
-    ('linalg.lu_factor', ''),
-    ('linalg.lu_factor_ex', ''),
-    ('linalg.lu_solve', ''),
-    ('linalg.matrix_norm', ''),
-    ('linalg.matrix_power', ''),
-    ('linalg.matrix_rank', ''),
-    ('linalg.matrix_rank', 'hermitian'),
-    ('linalg.multi_dot', ''),
-    ('linalg.norm', ''),
-    ('linalg.norm', 'subgradients_at_zero'),
-    ('linalg.pinv', ''),
-    ('linalg.pinv', 'hermitian'),
-    ('linalg.pinv', 'singular'),
-    ('linalg.qr', ''),
-    ('linalg.slogdet', ''),
-    ('linalg.solve', ''),
-    ('linalg.solve_ex', ''),
-    ('linalg.solve_triangular', ''),
-    ('linalg.svd', ''),
-    ('linalg.svdvals', ''),
-    ('linalg.tensorinv', ''),
-    ('linalg.tensorsolve', ''),
-    ('linalg.vander', ''),
-    ('linalg.vecdot', ''),
-    ('linalg.vector_norm', ''),
-    ('linspace', ''),
-    ('log', ''),
-    ('log10', ''),
-    ('log1p', ''),
-    ('log2', ''),
-    ('log_softmax', ''),
-    ('log_softmax', 'dtype'),
-    ('logaddexp', ''),
-    ('logaddexp2', ''),
-    ('logcumsumexp', ''),
-    ('logdet', ''),
-    ('logical_and', ''),
-    ('logical_not', ''),
-    ('logical_or', ''),
-    ('logical_xor', ''),
-    ('logit', ''),
-    ('logspace', ''),
-    ('logsumexp', ''),
-    ('long', ''),
-    ('lt', ''),
-    ('lu', ''),
-    ('lu_solve', ''),
-    ('lu_unpack', ''),
-    ('mH', ''),
-    ('mT', ''),
-    ('masked_fill', ''),
-    ('masked_scatter', ''),
-    ('masked_select', ''),
-    ('matmul', ''),
-    ('matrix_exp', ''),
-    ('max', 'binary'),
-    ('max', 'reduction_no_dim'),
-    ('max', 'reduction_with_dim'),
-    ('maximum', ''),
-    ('mean', ''),
-    ('median', ''),
-    ('meshgrid', 'list_of_tensors'),
-    ('meshgrid', 'variadic_tensors'),
-    ('min', 'binary'),
-    ('min', 'reduction_no_dim'),
-    ('min', 'reduction_with_dim'),
-    ('minimum', ''),
-    ('mm', ''),
-    ('mode', ''),
-    ('movedim', ''),
-    ('msort', ''),
-    ('mul', ''),
-    ('multinomial', ''),
-    ('mv', ''),
-    ('mvlgamma', 'mvlgamma_p_1'),
-    ('mvlgamma', 'mvlgamma_p_3'),
-    ('mvlgamma', 'mvlgamma_p_5'),
-    ('nan_to_num', ''),
-    ('nanmean', ''),
-    ('nanmedian', ''),
-    ('nanquantile', ''),
-    ('nansum', ''),
-    ('narrow', ''),
-    ('native_layer_norm', ''),
-    ('ne', ''),
-    ('neg', ''),
-    ('new_empty', ''),
-    ('new_full', ''),
-    ('new_ones', ''),
-    ('new_zeros', ''),
-    ('nextafter', ''),
-    ('nn.functional.adaptive_avg_pool1d', ''),
-    ('nn.functional.adaptive_avg_pool2d', ''),
-    ('nn.functional.adaptive_avg_pool3d', ''),
-    ('nn.functional.adaptive_max_pool1d', ''),
-    ('nn.functional.adaptive_max_pool2d', ''),
-    ('nn.functional.adaptive_max_pool3d', ''),
-    ('nn.functional.avg_pool1d', ''),
-    ('nn.functional.avg_pool2d', ''),
-    ('nn.functional.avg_pool3d', ''),
-    ('nn.functional.batch_norm', ''),
-    ('nn.functional.batch_norm', 'without_cudnn'),
-    ('nn.functional.bilinear', ''),
-    ('nn.functional.binary_cross_entropy', ''),
-    ('nn.functional.binary_cross_entropy_with_logits', ''),
-    ('nn.functional.celu', ''),
-    ('nn.functional.conv1d', ''),
-    ('nn.functional.conv2d', ''),
-    ('nn.functional.conv_transpose1d', ''),
-    ('nn.functional.conv_transpose2d', ''),
-    ('nn.functional.conv_transpose3d', ''),
-    ('nn.functional.cosine_embedding_loss', ''),
-    ('nn.functional.cosine_similarity', ''),
-    ('nn.functional.cross_entropy', ''),
-    ('nn.functional.ctc_loss', ''),
-    ('nn.functional.dropout', ''),
-    ('nn.functional.dropout2d', ''),
-    ('nn.functional.dropout3d', ''),
-    ('nn.functional.elu', ''),
-    ('nn.functional.embedding', ''),
-    ('nn.functional.embedding_bag', ''),
-    ('nn.functional.feature_alpha_dropout', 'with_train'),
-    ('nn.functional.feature_alpha_dropout', 'without_train'),
-    ('nn.functional.fractional_max_pool2d', ''),
-    ('nn.functional.fractional_max_pool3d', ''),
-    ('nn.functional.gaussian_nll_loss', ''),
-    ('nn.functional.gelu', ''),
-    ('nn.functional.glu', ''),
-    ('nn.functional.grid_sample', ''),
-    ('nn.functional.group_norm', ''),
-    ('nn.functional.hardshrink', ''),
-    ('nn.functional.hardsigmoid', ''),
-    ('nn.functional.hardswish', ''),
-    ('nn.functional.hardtanh', ''),
-    ('nn.functional.hinge_embedding_loss', ''),
-    ('nn.functional.huber_loss', ''),
-    ('nn.functional.instance_norm', ''),
-    ('nn.functional.interpolate', 'area'),
-    ('nn.functional.interpolate', 'bicubic'),
-    ('nn.functional.interpolate', 'bilinear'),
-    ('nn.functional.interpolate', 'linear'),
-    ('nn.functional.interpolate', 'nearest'),
-    ('nn.functional.interpolate', 'trilinear'),
-    ('nn.functional.kl_div', ''),
-    ('nn.functional.l1_loss', ''),
-    ('nn.functional.layer_norm', ''),
-    ('nn.functional.leaky_relu', ''),
-    ('nn.functional.linear', ''),
-    ('nn.functional.local_response_norm', ''),
-    ('nn.functional.logsigmoid', ''),
-    ('nn.functional.margin_ranking_loss', ''),
-    ('nn.functional.max_pool1d', ''),
-    ('nn.functional.max_pool2d', ''),
-    ('nn.functional.max_pool3d', ''),
-    ('nn.functional.max_unpool1d', ''),
-    ('nn.functional.max_unpool1d', 'grad'),
-    ('nn.functional.max_unpool2d', ''),
-    ('nn.functional.max_unpool2d', 'grad'),
-    ('nn.functional.max_unpool3d', ''),
-    ('nn.functional.max_unpool3d', 'grad'),
-    ('nn.functional.mish', ''),
-    ('nn.functional.mse_loss', ''),
-    ('nn.functional.multi_margin_loss', ''),
-    ('nn.functional.multilabel_margin_loss', ''),
-    ('nn.functional.multilabel_soft_margin_loss', ''),
-    ('nn.functional.nll_loss', ''),
-    ('nn.functional.normalize', ''),
-    ('nn.functional.one_hot', ''),
-    ('nn.functional.pad', 'circular'),
-    ('nn.functional.pad', 'constant'),
-    ('nn.functional.pad', 'reflect'),
-    ('nn.functional.pad', 'replicate'),
-    ('nn.functional.pairwise_distance', ''),
-    ('nn.functional.pdist', ''),
-    ('nn.functional.pixel_shuffle', ''),
-    ('nn.functional.pixel_unshuffle', ''),
-    ('nn.functional.poisson_nll_loss', ''),
-    ('nn.functional.prelu', ''),
-    ('nn.functional.relu', ''),
-    ('nn.functional.relu6', ''),
-    ('nn.functional.rrelu', ''),
-    ('nn.functional.selu', ''),
-    ('nn.functional.silu', ''),
-    ('nn.functional.silu', 'complex'),
-    ('nn.functional.smooth_l1_loss', ''),
-    ('nn.functional.soft_margin_loss', ''),
-    ('nn.functional.softmin', ''),
-    ('nn.functional.softmin', 'with_dtype'),
-    ('nn.functional.softplus', ''),
-    ('nn.functional.softshrink', ''),
-    ('nn.functional.softsign', ''),
-    ('nn.functional.tanhshrink', ''),
-    ('nn.functional.threshold', ''),
-    ('nn.functional.triplet_margin_loss', ''),
-    ('nn.functional.triplet_margin_with_distance_loss', ''),
-    ('nn.functional.unfold', ''),
-    ('nn.functional.upsample_bilinear', ''),
-    ('nn.functional.upsample_nearest', ''),
-    ('nonzero', ''),
-    ('norm', ''),
-    ('norm', 'fro'),
-    ('norm', 'inf'),
-    ('norm', 'nuc'),
-    ('normal', ''),
-    ('normal', 'number_mean'),
-    ('ones_like', ''),
-    ('ormqr', ''),
-    ('outer', ''),
-    ('pca_lowrank', ''),
-    ('permute', ''),
-    ('pinverse', ''),
-    ('polar', ''),
-    ('polygamma', 'polygamma_n_0'),
-    ('polygamma', 'polygamma_n_1'),
-    ('polygamma', 'polygamma_n_2'),
-    ('polygamma', 'polygamma_n_3'),
-    ('polygamma', 'polygamma_n_4'),
-    ('positive', ''),
-    ('pow', ''),
-    ('prod', ''),
-    ('put', ''),
-    ('qr', ''),
-    ('quantile', ''),
-    ('rad2deg', ''),
-    ('rand_like', ''),
-    ('randint_like', ''),
-    ('randn_like', ''),
-    ('ravel', ''),
-    ('real', ''),
-    ('reciprocal', ''),
-    ('remainder', ''),
-    ('renorm', ''),
-    ('repeat', ''),
-    ('repeat_interleave', ''),
-    ('reshape', ''),
-    ('reshape_as', ''),
-    ('resize_', ''),
-    ('resize_as_', ''),
-    ('resolve_conj', ''),
-    ('resolve_neg', ''),
-    ('roll', ''),
-    ('rot90', ''),
-    ('round', ''),
-    ('round', 'decimals_0'),
-    ('round', 'decimals_3'),
-    ('round', 'decimals_neg_3'),
-    ('rsqrt', ''),
-    ('rsub', ''),
-    ('scatter', ''),
-    ('scatter_add', ''),
-    ('scatter_reduce', 'amax'),
-    ('scatter_reduce', 'amin'),
-    ('scatter_reduce', 'mean'),
-    ('scatter_reduce', 'prod'),
-    ('scatter_reduce', 'sum'),
-    ('searchsorted', ''),
-    ('segment_reduce', 'lengths'),
-    ('segment_reduce', 'offsets'),
-    ('select', ''),
-    ('select_scatter', ''),
-    ('sgn', ''),
-    ('short', ''),
-    ('sigmoid', ''),
-    ('sign', ''),
-    ('signbit', ''),
-    ('sin', ''),
-    ('sinc', ''),
-    ('sinh', ''),
-    ('slice_scatter', ''),
-    ('softmax', ''),
-    ('softmax', 'with_dtype'),
-    ('sort', ''),
-    ('sparse.sampled_addmm', ''),
-    ('special.airy_ai', ''),
-    ('special.bessel_j0', ''),
-    ('special.bessel_j1', ''),
-    ('special.bessel_y0', ''),
-    ('special.bessel_y1', ''),
-    ('special.chebyshev_polynomial_t', ''),
-    ('special.chebyshev_polynomial_u', ''),
-    ('special.chebyshev_polynomial_v', ''),
-    ('special.chebyshev_polynomial_w', ''),
-    ('special.entr', ''),
-    ('special.erfcx', ''),
-    ('special.hermite_polynomial_h', ''),
-    ('special.hermite_polynomial_he', ''),
-    ('special.i0e', ''),
-    ('special.i1', ''),
-    ('special.i1e', ''),
-    ('special.laguerre_polynomial_l', ''),
-    ('special.legendre_polynomial_p', ''),
-    ('special.log_ndtr', ''),
-    ('special.modified_bessel_i0', ''),
-    ('special.modified_bessel_i1', ''),
-    ('special.modified_bessel_k0', ''),
-    ('special.modified_bessel_k1', ''),
-    ('special.ndtr', ''),
-    ('special.ndtri', ''),
-    ('special.polygamma', 'special_polygamma_n_0'),
-    ('special.scaled_modified_bessel_k0', ''),
-    ('special.scaled_modified_bessel_k1', ''),
-    ('special.shifted_chebyshev_polynomial_t', ''),
-    ('special.shifted_chebyshev_polynomial_u', ''),
-    ('special.shifted_chebyshev_polynomial_v', ''),
-    ('special.shifted_chebyshev_polynomial_w', ''),
-    ('special.spherical_bessel_j0', ''),
-    ('special.xlog1py', ''),
-    ('special.zeta', ''),
-    ('split', ''),
-    ('split', 'list_args'),
-    ('split_with_sizes', ''),
-    ('sqrt', ''),
-    ('square', ''),
-    ('squeeze', ''),
-    ('stack', ''),
-    ('std', ''),
-    ('std_mean', ''),
-    ('stft', ''),
-    ('sub', ''),
-    ('sum', ''),
-    ('sum_to_size', ''),
-    ('svd', ''),
-    ('svd_lowrank', ''),
-    ('symeig', ''),
-    ('t', ''),
-    ('take', ''),
-    ('take_along_dim', ''),
-    ('tan', ''),
-    ('tanh', ''),
-    ('tensor_split', ''),
-    ('tensordot', ''),
-    ('tile', ''),
-    ('to_sparse', ''),
-    ('topk', ''),
-    ('trace', ''),
-    ('transpose', ''),
-    ('trapezoid', ''),
-    ('trapz', ''),
-    ('triangular_solve', ''),
-    ('tril', ''),
-    ('triu', ''),
-    ('true_divide', ''),
-    ('trunc', ''),
-    ('unbind', ''),
-    ('unflatten', ''),
-    ('unfold', ''),
-    ('unique', ''),
-    ('unique_consecutive', ''),
-    ('unsqueeze', ''),
-    ('var', ''),
-    ('var_mean', ''),
-    ('vdot', ''),
-    ('view', ''),
-    ('view_as', ''),
-    ('view_as_complex', ''),
-    ('view_as_real', ''),
-    ('vsplit', ''),
-    ('vstack', ''),
-    ('where', ''),
-    ('xlogy', ''),
-    ('zero_', ''),
-    ('zeros_like', ''),
-}
-
-
-def in_functorch_lagging_op_db(opinfo):
-    return (opinfo.name, opinfo.variant_test_name) in _functorch_lagging_meta
-
-
-functorch_lagging_op_db = [
-    opinfo for opinfo in op_db if in_functorch_lagging_op_db(opinfo)
-]
diff --git a/functorch/test/pytest.ini b/functorch/test/pytest.ini
deleted file mode 100644
index ff3ba09162ec..000000000000
--- a/functorch/test/pytest.ini
+++ /dev/null
@@ -1,2 +0,0 @@
-[pytest]
-addopts=-s -v
diff --git a/functorch/test/test_compile_cache.py b/functorch/test/test_compile_cache.py
deleted file mode 100644
index 2115e58845f3..000000000000
--- a/functorch/test/test_compile_cache.py
+++ /dev/null
@@ -1,686 +0,0 @@
-# Owner(s): ["module: functorch"]
-
-import torch
-
-import functorch
-from torch.testing._internal.common_utils import run_tests, TestCase, IS_WINDOWS
-import unittest
-
-from functorch.compile import aot_function, nop
-
-
-class TestCompileCache(TestCase):
-    def check(self, a, b, aot_fn, fn):
-        a_clone = a.clone().detach().requires_grad_(True)
-        b_clone = b.clone().detach().requires_grad_(True)
-        ref = fn(a, b)
-        ref.sum().backward()
-
-        res = aot_fn(a_clone, b_clone)
-        res.sum().backward()
-        assert torch.allclose(res, ref)
-        assert torch.allclose(a.grad, a_clone.grad)
-        assert torch.allclose(b.grad, b_clone.grad)
-
-    def test_recompilation_on_broadcast(self):
-        def fn(x, bias):
-            return x + bias
-
-        for hasher_type in ["DynamicShapeHasher", "StaticShapeHasher"]:
-            functorch.compile.clear_compile_cache()
-            start_num_recomps = functorch.compile.num_of_recompilations()
-            aot_autograd_fn = aot_function(fn, nop, nop, hasher_type=hasher_type)
-
-            a = torch.randn(10, 20, requires_grad=True)
-            b = torch.randn(20, requires_grad=True)
-            self.check(a, b, aot_autograd_fn, fn)
-
-            a = torch.randn(10, 20, requires_grad=True)
-            b = torch.randn(10, 20, requires_grad=True)
-            self.check(a, b, aot_autograd_fn, fn)
-
-            end_num_recomps = functorch.compile.num_of_recompilations()
-
-            total_recomps = end_num_recomps - start_num_recomps
-            assert total_recomps == 2
-
-    def test_compilation_for_dynamic_shape(self):
-        def fn(x, bias):
-            return x + bias
-
-        for hasher_type in ["DynamicShapeHasher", "StaticShapeHasher"]:
-            functorch.compile.clear_compile_cache()
-            start_num_recomps = functorch.compile.num_of_recompilations()
-            aot_autograd_fn = aot_function(fn, nop, nop, hasher_type=hasher_type)
-
-            for s in range(10, 20):
-                a = torch.randn(s, requires_grad=True)
-                b = torch.randn(s, requires_grad=True)
-                self.check(a, b, aot_autograd_fn, fn)
-
-            for s in range(10, 20):
-                a = torch.randn(s, requires_grad=True)
-                b = torch.randn(s, requires_grad=True)
-                self.check(a, b, aot_autograd_fn, fn)
-
-            end_num_recomps = functorch.compile.num_of_recompilations()
-
-            total_recomps = end_num_recomps - start_num_recomps
-            if hasher_type == "DynamicShapeHasher":
-                assert total_recomps == 1
-            elif hasher_type == "StaticShapeHasher":
-                assert total_recomps == 10
-
-            for s in range(10, 20):
-                a = torch.randn(s, s, requires_grad=True)
-                b = torch.randn(s, s, requires_grad=True)
-                self.check(a, b, aot_autograd_fn, fn)
-
-            end_num_recomps = functorch.compile.num_of_recompilations()
-
-            total_recomps = end_num_recomps - start_num_recomps
-            if hasher_type == "DynamicShapeHasher":
-                assert total_recomps == 2
-            elif hasher_type == "StaticShapeHasher":
-                assert total_recomps == 20
-
-    def test_global_cache_no_recompilations(self):
-        def f(x, bias):
-            return x + bias
-
-        def g(x, bias):
-            return aot_function(f, nop, nop, hasher_type="DynamicShapeHasher")(x, bias)
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-        for _ in range(10):
-            a = torch.randn(10, 20, requires_grad=True)
-            b = torch.randn(10, 20, requires_grad=True)
-            self.check(a, b, g, f)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 1
-
-    def test_multiple_functions(self):
-        def f(x, bias):
-            return x + bias
-
-        def g(x, y):
-            return x * y
-
-        for hasher_type in ["DynamicShapeHasher", "StaticShapeHasher"]:
-            functorch.compile.clear_compile_cache()
-            aot_autograd_f = aot_function(f, nop, nop, hasher_type=hasher_type)
-            aot_autograd_g = aot_function(g, nop, nop, hasher_type=hasher_type)
-
-            start_num_recomps = functorch.compile.num_of_recompilations()
-            a = torch.randn(10, requires_grad=True)
-            b = torch.randn(10, requires_grad=True)
-            self.check(a, b, aot_autograd_f, f)
-
-            a = torch.randn(10, requires_grad=True)
-            b = torch.randn(10, requires_grad=True)
-            self.check(a, b, aot_autograd_g, g)
-
-            end_num_recomps = functorch.compile.num_of_recompilations()
-            total_recomps = end_num_recomps - start_num_recomps
-            assert total_recomps == 2
-
-            # Force recompilation for function f and check num of recompilations again
-            a = torch.randn(10, 20, requires_grad=True)
-            b = torch.randn(10, 20, requires_grad=True)
-            self.check(a, b, aot_autograd_f, f)
-
-            end_num_recomps = functorch.compile.num_of_recompilations()
-            total_recomps = end_num_recomps - start_num_recomps
-            assert total_recomps == 3
-
-    def test_high_number_of_args(self):
-        def f(*args):
-            res = args[0]
-            for arg in args:
-                res = res * arg
-            return res
-
-        def check(args, aot_autograd_fn, fn):
-            args_clone = [arg.clone().detach().requires_grad_(True) for arg in args]
-            ref = fn(*args)
-            ref.sum().backward()
-
-            res = aot_autograd_fn(*args_clone)
-            res.sum().backward()
-            assert torch.allclose(res, ref)
-            for (arg, arg_clone) in zip(args, args_clone):
-                assert torch.allclose(arg.grad, arg_clone.grad)
-
-        for hasher_type in ["DynamicShapeHasher", "StaticShapeHasher"]:
-            functorch.compile.clear_compile_cache()
-
-            aot_autograd_f = aot_function(f, nop, nop, hasher_type=hasher_type)
-
-            args = [torch.randn(10, requires_grad=True) for _ in range(100)]
-            check(args, aot_autograd_f, f)
-
-    def test_multiple_compiler(self):
-        def fn(x, bias):
-            return x + bias
-
-        def nop_duplicate(fx_g, _):
-            return fx_g
-
-        for hasher_type in ["DynamicShapeHasher", "StaticShapeHasher"]:
-            functorch.compile.clear_compile_cache()
-            start_num_recomps = functorch.compile.num_of_recompilations()
-            nop_fn = aot_function(fn, nop, nop, hasher_type=hasher_type)
-            nop_duplicate_fn = aot_function(
-                fn, nop_duplicate, nop_duplicate, hasher_type=hasher_type
-            )
-
-            a = torch.randn(10, 20, requires_grad=True)
-            b = torch.randn(20, requires_grad=True)
-            nop_fn(a, b)
-            nop_duplicate_fn(a, b)
-
-            end_num_recomps = functorch.compile.num_of_recompilations()
-
-            total_recomps = end_num_recomps - start_num_recomps
-            assert total_recomps == 2
-
-
-@unittest.skipIf(IS_WINDOWS, 'test broken on windows')
-class TestCompileCacheStaticArgs(TestCase):
-    def check(self, a, b, aot_autograd_fn, fn):
-        a_clone = a.clone().detach().requires_grad_(True)
-        ref = fn(a, b)
-        ref.sum().backward()
-
-        res = aot_autograd_fn(a_clone, b)
-        res.sum().backward()
-        assert torch.allclose(res, ref)
-        assert torch.allclose(a.grad, a_clone.grad)
-
-    def test_failure(self):
-        # Test that not setting up static_argnums should raise exception
-        def fn(x, p):
-            return x * p
-
-        aot_autograd_f = aot_function(fn, nop, nop)
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = 2
-        try:
-            # Since b is not marked as static, it should raise exception
-            aot_autograd_f(a, b)
-            raise AssertionError()
-        except RuntimeError:
-            pass
-
-    def test_simple(self):
-        def fn(x, static_arg):
-            return x * static_arg
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=1)
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = 2
-        self.check(a, b, aot_autograd_f, fn)
-
-        # Same type of args, so no recompilation
-        a = torch.randn(2, 2, requires_grad=True)
-        b = 2
-        self.check(a, b, aot_autograd_f, fn)
-
-        # Trigger recompilation
-        a = torch.randn(2, 2, requires_grad=True)
-        b = 3
-        self.check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_static_arg_before_tensor_arg(self):
-        def fn(static_arg, x):
-            return static_arg - x
-
-        def check(a, b, aot_autograd_fn, fn):
-            b_clone = b.clone().detach().requires_grad_(True)
-            ref = fn(a, b)
-            ref.sum().backward()
-
-            res = aot_autograd_fn(a, b_clone)
-            res.sum().backward()
-            assert torch.allclose(res, ref)
-            assert torch.allclose(b.grad, b_clone.grad)
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=0)
-
-        a = 2
-        b = torch.randn(2, 2, requires_grad=True)
-        check(a, b, aot_autograd_f, fn)
-
-        a = 3
-        b = torch.randn(2, 2, requires_grad=True)
-        check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_interleaved_static_args(self):
-        def fn(static_arg1, x, static_arg2):
-            return static_arg1 - x - static_arg2
-
-        def check(a, b, c, aot_autograd_fn, fn):
-            b_clone = b.clone().detach().requires_grad_(True)
-            ref = fn(a, b, c)
-            ref.sum().backward()
-
-            res = aot_autograd_fn(a, b_clone, c)
-            res.sum().backward()
-            assert torch.allclose(res, ref)
-            assert torch.allclose(b.grad, b_clone.grad)
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=(0, 2))
-
-        a = 2
-        b = torch.randn(2, 2, requires_grad=True)
-        c = 0.1
-        check(a, b, c, aot_autograd_f, fn)
-
-        a = 3
-        b = torch.randn(2, 2, requires_grad=True)
-        c = 0.1
-        check(a, b, c, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_dropout(self):
-        def fn(x, prob):
-            return torch.nn.functional.dropout(x, p=prob)
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=[1])
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = 0.3
-        aot_autograd_f(a, b)
-
-        # Setting the prob to 0. This should cause recompilation.
-        a = torch.randn(2, 2, requires_grad=True)
-        b = 0
-        self.check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_if_condition(self):
-        def fn(x, state: bool):
-            if state:
-                return torch.sin(x)
-            else:
-                return torch.cos(x)
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=[1])
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = True
-        self.check(a, b, aot_autograd_f, fn)
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = True
-        self.check(a, b, aot_autograd_f, fn)
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = False
-        self.check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_custom(self):
-        class Record:
-            def __init__(self, name, multiplier):
-                self.name = name
-                self.multiplier = multiplier
-
-            def __eq__(self, other):
-                return self.name == other.name and self.multiplier == other.multiplier
-
-            def __hash__(self):
-                return hash((self.name, self.multiplier))
-
-        def fn(x, record):
-            return x * record.multiplier
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=[1])
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = Record("Foo", 0.5)
-        self.check(a, b, aot_autograd_f, fn)
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = Record("Bar", 10.2)
-        self.check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_tuple(self):
-        def fn(a_tuple, static_arg):
-            return torch.sin(a_tuple[0]) - a_tuple[1] - static_arg
-
-        def check(a_tuple, b, aot_autograd_fn, fn):
-            a0 = a_tuple[0]
-            a1 = a_tuple[1]
-
-            a0_clone = a0.clone().detach().requires_grad_(True)
-            a1_clone = a1.clone().detach().requires_grad_(True)
-            ref = fn(a, b)
-            ref.sum().backward()
-
-            res = aot_autograd_fn((a0_clone, a1_clone), b)
-            res.sum().backward()
-            assert torch.allclose(res, ref)
-            assert torch.allclose(a0.grad, a0_clone.grad)
-            assert torch.allclose(a1.grad, a1_clone.grad)
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=(1,))
-
-        a = (
-            torch.randn(2, 2, requires_grad=True),
-            torch.randn(2, 2, requires_grad=True),
-        )
-        b = 0.1
-        check(a, b, aot_autograd_f, fn)
-
-        a = (
-            torch.randn(2, 2, requires_grad=True),
-            torch.randn(2, 2, requires_grad=True),
-        )
-        b = 1
-        check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_tuple_with_first_arg_as_static(self):
-        def fn(static_arg, a_tuple):
-            return torch.sin(a_tuple[0]) - a_tuple[1] - static_arg
-
-        def check(a, b_tuple, aot_autograd_fn, fn):
-            b0 = b_tuple[0]
-            b1 = b_tuple[1]
-
-            b0_clone = b0.clone().detach().requires_grad_(True)
-            b1_clone = b1.clone().detach().requires_grad_(True)
-            ref = fn(a, b_tuple)
-            ref.sum().backward()
-
-            res = aot_autograd_fn(a, (b0_clone, b1_clone))
-            res.sum().backward()
-            assert torch.allclose(res, ref)
-            assert torch.allclose(b0.grad, b0_clone.grad)
-            assert torch.allclose(b1.grad, b1_clone.grad)
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=(0,))
-
-        a = 0.1
-        b = (
-            torch.randn(2, 2, requires_grad=True),
-            torch.randn(2, 2, requires_grad=True),
-        )
-        check(a, b, aot_autograd_f, fn)
-
-        a = 1
-        b = (
-            torch.randn(2, 2, requires_grad=True),
-            torch.randn(2, 2, requires_grad=True),
-        )
-        check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_dict(self):
-        def fn(a_dict, static_arg):
-            return torch.sin(a_dict["foo"]) - a_dict["bar"] - static_arg
-
-        def check(a_dict, b, aot_autograd_fn, fn):
-
-            a0 = a_dict["foo"]
-            a1 = a_dict["bar"]
-
-            a0_clone = a0.clone().detach().requires_grad_(True)
-            a1_clone = a1.clone().detach().requires_grad_(True)
-            ref = fn(a_dict, b)
-            ref.sum().backward()
-
-            a_clone = {}
-            a_clone["foo"] = a0_clone
-            a_clone["bar"] = a1_clone
-            res = aot_autograd_fn(a_clone, b)
-            res.sum().backward()
-            assert torch.allclose(res, ref)
-            assert torch.allclose(a0.grad, a0_clone.grad)
-            assert torch.allclose(a1.grad, a1_clone.grad)
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=(1,))
-
-        a = {}
-        a["foo"] = torch.zeros(2, 2, requires_grad=True)
-        a["bar"] = torch.ones(2, 2, requires_grad=True)
-        b = 0
-        check(a, b, aot_autograd_f, fn)
-
-        a = {}
-        a["foo"] = torch.randn(2, 2, requires_grad=True)
-        a["bar"] = torch.randn(2, 2, requires_grad=True)
-        b = 0.2
-        check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_dict_with_static_arg_before_dict(self):
-        def fn(static_arg, a_dict):
-            return torch.sin(a_dict["foo"]) - a_dict["bar"] - static_arg
-
-        def check(a, b_dict, aot_autograd_fn, fn):
-
-            ref = fn(a, b_dict)
-            res = aot_autograd_fn(a, b_dict)
-            assert torch.allclose(res, ref)
-
-            b0 = b_dict["foo"]
-            b1 = b_dict["bar"]
-
-            b0_clone = b0.clone().detach().requires_grad_(True)
-            b1_clone = b1.clone().detach().requires_grad_(True)
-            ref.sum().backward()
-
-            b_clone = {}
-            b_clone["foo"] = b0_clone
-            b_clone["bar"] = b1_clone
-            res = aot_autograd_fn(a, b_clone)
-            res.sum().backward()
-            assert torch.allclose(res, ref)
-            assert torch.allclose(b0.grad, b0_clone.grad)
-            assert torch.allclose(b1.grad, b1_clone.grad)
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=(0,))
-
-        a = 0.1
-        b = {}
-        b["foo"] = torch.randn(2, 2, requires_grad=True)
-        b["bar"] = torch.randn(2, 2, requires_grad=True)
-        check(a, b, aot_autograd_f, fn)
-
-        a = 0.2
-        b = {}
-        b["foo"] = torch.randn(2, 2, requires_grad=True)
-        b["bar"] = torch.randn(2, 2, requires_grad=True)
-        check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_tuple_static_args(self):
-        def fn(x, tuple_static_arg):
-            return x * tuple_static_arg[0] * tuple_static_arg[1]
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop, static_argnums=1)
-
-        a = torch.randn(2, 2, requires_grad=True)
-        b = (2, 3)
-        self.check(a, b, aot_autograd_f, fn)
-
-        # Same type of args, so no recompilation
-        a = torch.randn(2, 2, requires_grad=True)
-        b = (2, 3)
-        self.check(a, b, aot_autograd_f, fn)
-
-        # Trigger recompilation
-        a = torch.randn(2, 2, requires_grad=True)
-        b = (3, 4)
-        self.check(a, b, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 2
-
-    def test_arg_none(self):
-        def check(a, b, c, aot_autograd_fn, fn):
-            def cloner(x):
-                if x is not None:
-                    return x.clone().detach().requires_grad_(True)
-                return None
-
-            def check_grad(x, x_clone):
-                if x is not None:
-                    return torch.allclose(x.grad, x_clone.grad)
-                return True
-
-            ref = fn(a, b, c)
-            res = aot_autograd_fn(a, b, c)
-            assert torch.allclose(res, ref)
-
-            a_clone = cloner(a)
-            b_clone = cloner(b)
-            c_clone = cloner(c)
-            ref.sum().backward()
-            res = aot_autograd_fn(a_clone, b_clone, c_clone)
-            res.sum().backward()
-
-            check_grad(a, a_clone)
-            check_grad(b, b_clone)
-            check_grad(c, c_clone)
-
-        def fn(a, b, c):
-            if a is None and b is None:
-                return c
-            elif a is None and c is None:
-                return b
-            elif b is None and c is None:
-                return a
-            elif a is None:
-                return b + c
-            elif b is None:
-                return a + c
-            elif c is None:
-                return a + b
-            return a + b + c
-
-        functorch.compile.clear_compile_cache()
-
-        start_num_recomps = functorch.compile.num_of_recompilations()
-
-        aot_autograd_f = aot_function(fn, nop, nop)
-
-        t1 = torch.randn(2, 2, requires_grad=True)
-        check(t1, None, None, aot_autograd_f, fn)
-        check(None, t1, None, aot_autograd_f, fn)
-        check(None, None, t1, aot_autograd_f, fn)
-
-        t2 = torch.randn(2, 2, requires_grad=True)
-        check(t1, t2, None, aot_autograd_f, fn)
-        check(t1, None, t2, aot_autograd_f, fn)
-        check(None, t1, t2, aot_autograd_f, fn)
-
-        t3 = torch.randn(2, 2, requires_grad=True)
-        check(t1, t2, t3, aot_autograd_f, fn)
-
-        # Same type of args, so no recompilation
-        check(t1, t2, None, aot_autograd_f, fn)
-
-        end_num_recomps = functorch.compile.num_of_recompilations()
-
-        total_recomps = end_num_recomps - start_num_recomps
-        assert total_recomps == 7
-
-
-if __name__ == "__main__":
-    run_tests()
diff --git a/functorch/test/test_minifier.py b/functorch/test/test_minifier.py
deleted file mode 100644
index 4f026c185c50..000000000000
--- a/functorch/test/test_minifier.py
+++ /dev/null
@@ -1,53 +0,0 @@
-# Owner(s): ["module: functorch"]
-
-import torch
-from functorch.compile import minifier
-from functorch import make_fx
-from torch.testing._internal.common_utils import TestCase, run_tests
-
-
-class TestMinifier(TestCase):
-    # https://github.com/pytorch/functorch/issues/913
-    def test_has_mul_minifier(self):
-        def failing_f(x, y):
-            y = y / 3
-            x = x + 3
-            x = x * y
-            return x + y
-        inps = [torch.randn(3), torch.randn(3)]
-        failing_f = make_fx(failing_f)(*inps)
-
-        def pass_checker(fx_g, inps):
-            return (torch.ops.aten.mul.Tensor in set([i.target for i in fx_g.graph.nodes]))
-
-        min_f, inps = minifier(failing_f, inps, pass_checker)
-        assert len(min_f.graph.nodes) == 4
-        assert len(inps) == 2
-
-    def test_has_add_mul(self):
-        def failing_f(x):
-            x = x * 3
-            x = x + 5
-            x = x.cos()
-            zero = x - x
-            result = zero / zero
-            result = result + 3
-            return (result * 2,)
-
-        inps = [torch.randn(3)]
-        failing_f = make_fx(failing_f)(*inps)
-
-        def pass_checker(fx_g, inps):
-            # Basically, make sure none of the inputs are nans
-            for i in inps:
-                if torch.isnan(i).any():
-                    return False
-            return torch.isnan(fx_g(*inps)[0]).any()
-
-        min_f, inps = minifier(failing_f, inps, pass_checker)
-        assert len(min_f.graph.nodes) == 3
-        assert len(inps) == 1
-
-
-if __name__ == "__main__":
-    run_tests()
diff --git a/functorch/test/test_pythonkey.py b/functorch/test/test_pythonkey.py
deleted file mode 100644
index 9823dc512ecb..000000000000
--- a/functorch/test/test_pythonkey.py
+++ /dev/null
@@ -1,645 +0,0 @@
-# Owner(s): ["module: functorch"]
-
-# Copyright (c) Facebook, Inc. and its affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-
-from torch.testing._internal.common_utils import TestCase, run_tests
-import torch
-import torch.nn as nn
-import torch.utils._pytree as pytree
-import unittest
-import warnings
-import itertools
-from functools import partial
-from torch.testing._internal.common_device_type import instantiate_device_type_tests
-from functorch import (
-    grad, vjp, vmap, jacrev,
-    make_fx
-)
-from functorch._src.aot_autograd import aot_module_simplified
-from functorch.compile import (
-    nnc_jit, compiled_function, compiled_module,
-    min_cut_rematerialization_partition, aot_function, aot_module, decomposition_table, nop,
-    num_of_recompilations, default_partition, default_decompositions, memory_efficient_fusion
-)
-
-from torch.testing._internal.common_device_type import ops
-from functorch_lagging_op_db import functorch_lagging_op_db
-from functorch_additional_op_db import additional_op_db
-from common_utils import (
-    xfail,
-    skip,
-    skipOps,
-)
-
-USE_TORCHVISION = False
-try:
-    import torchvision
-    USE_TORCHVISION = True
-except ImportError:
-    warnings.warn("Couldn't import torchvision. Some of our tests use it, try "
-                  "to install it with commands from pytorch.org, post-fixed with "
-                  "`--no-deps` to avoid overwriting the pytorch installation",
-                  UserWarning)
-
-USE_NETWORKX = False
-try:
-    import networkx  # noqa: F401
-    USE_NETWORKX = True
-except ImportError:
-    warnings.warn("Some tests use networkx but it was not installed",
-                  UserWarning)
-
-# NB: numpy is a testing dependency!
-
-
-class TestPythonKey(TestCase):
-    def test_make_fx(self, device):
-        def f(x):
-            return torch.sin(x)
-        inp = torch.randn(3)
-        fx_f = make_fx(f)(inp)
-
-        new_inp = torch.randn(3)
-        self.assertEqual(fx_f(new_inp), f(new_inp))
-
-    def test_make_fx_grad(self, device):
-        def f(x):
-            return torch.sin(x).sum()
-        inp = torch.randn(3)
-        f = grad(f)
-        fx_f = make_fx(f)(inp)
-
-        new_inp = torch.randn(3)
-        self.assertEqual(fx_f(new_inp), f(new_inp))
-
-    def test_scalar_device(self, device):
-        def f(a, b):
-            return a + b
-        inps = [torch.randn(3, device=device), torch.tensor(5)]
-        fx_f = make_fx(f)(*inps)
-        self.assertEqual(fx_f(*inps), f(*inps))
-
-    def test_make_fx_vmap(self, device):
-        def f(x):
-            return torch.sin(x)
-        inp = torch.randn(5, 3)
-        f = vmap(f)
-        fx_f = make_fx(f)(inp)
-        new_inp = torch.randn(5, 3)
-        self.assertEqual(fx_f(new_inp), f(new_inp))
-
-    def test_make_fx_jacrev(self, device):
-        def f(x):
-            return x.sin().sum()
-        inp = torch.randn(3)
-        f = jacrev(jacrev(f))
-        fx_f = make_fx(f)(inp)
-        new_inp = torch.randn(3)
-        self.assertEqual(fx_f(new_inp), f(new_inp))
-
-    def test_make_fx_vjp(self, device):
-        def f(x):
-            return torch.sin(x).sum()
-
-        primals = torch.randn(3)
-        _, vjp_fn = vjp(f, primals)
-        cotangent = torch.randn(())
-        fx_f = make_fx(vjp_fn)(cotangent, True, True)
-        new_cotangent = torch.randn(())
-        self.assertEqual(fx_f(new_cotangent, True, True), vjp_fn(new_cotangent))
-
-    def test_make_fx_no_decompose(self, device):
-        # FIXME
-        return self.skipTest("error: maximum recursion reached")
-
-        def f(x):
-            return torch.tanh(x).sum()
-
-        fx_f = make_fx(grad(f))(torch.randn(5))
-        ops = set([i.target for i in fx_f.graph.nodes])
-
-        self.assertEqual(torch.ops.aten.tanh_backward in ops, True)
-
-        fx_f = make_fx(grad(f), decomposition_table)(torch.randn(5))
-        ops = set([i.target for i in fx_f.graph.nodes])
-        self.assertEqual(torch.ops.aten.tanh_backward in ops, False)
-
-    def test_nnc_jit(self, device):
-        def f(x):
-            return torch.sin(x)
-
-        jit_f = nnc_jit(f)
-
-        inp = torch.randn(3)
-        self.assertEqual(jit_f(inp), f(inp))
-
-    def test_nnc_scalar(self, device):
-        def f(x):
-            return torch.sin(x)
-
-        jit_f = nnc_jit(f)
-
-        inp = torch.randn(())
-        self.assertEqual(jit_f(inp), f(inp))
-
-    def test_nnc_pytrees(self, device):
-        def f(x):
-            return [torch.sin(x[0])]
-
-        jit_f = nnc_jit(f)
-
-        inp = [torch.randn(3)]
-        self.assertEqual(jit_f(inp), f(inp))
-
-    def test_external_calls(self, device):
-        def f(a, b):
-            return torch.mv(a, b)
-        jit_f = nnc_jit(f)
-        inp = [torch.randn(3, 3), torch.randn(3)]
-        self.assertEqual(jit_f(*inp), f(*inp))
-
-    def test_nnc_passthrough(self, device):
-        def f(x, y):
-            return x + y, y
-        inp = (torch.randn(3), torch.randn(3))
-        jit_f = nnc_jit(f)
-        self.assertEqual(jit_f(*inp), f(*inp))
-
-        def f(x):
-            x['a'] = x['a'] * 2
-            return x
-        inp = ({'a': torch.randn(3), 'b': torch.randn(3)},)
-        jit_f = nnc_jit(f)
-        self.assertEqual(jit_f(*inp), f(*inp))
-
-    @unittest.skipIf(not USE_TORCHVISION, "test requires torchvision")
-    def test_resnet18_backward_trace(self, device):
-        mod = torchvision.models.resnet18()
-
-        def f(x):
-            out = mod(x)
-            out.sum().backward()
-            return [a.grad for a in mod.parameters()]
-
-        inp = torch.randn(3, 3, 250, 250, requires_grad=True)
-        grads = f(inp)
-
-        mod.zero_grad()
-        mod(inp).sum().backward()
-        grads2 = [a.grad for a in mod.parameters()]
-        self.assertEqual(grads, grads2)
-
-
-def _outs_and_grads(fn, inps):
-    outs = fn(*inps)
-    for out in pytree.tree_flatten(outs)[0]:
-        if isinstance(out, torch.Tensor) and out.requires_grad:
-            out.sum().backward(retain_graph=True)
-    grads = [inp.grad for inp in pytree.tree_flatten(inps)[0]]
-    for inp in pytree.tree_flatten(inps)[0]:
-        inp.grad = None
-    return outs, grads
-
-
-class TestAOTAutograd(TestCase):
-    def verify_aot_autograd(self, f, inp):
-        if isinstance(f, nn.Module):
-            compiled_f = aot_module(f, nop)
-        else:
-            compiled_f = aot_function(f, nop)
-        ref_out, ref_grad = _outs_and_grads(f, inp)
-        test_out, test_grad = _outs_and_grads(compiled_f, inp)
-        self.assertEqual(ref_out, test_out)
-        self.assertEqual(ref_grad, test_grad)
-
-    def test_single_output(self):
-        def f(a, b):
-            return a + b
-        inp = [torch.randn(3, 3, requires_grad=True), torch.randn(3, 3)]
-        self.verify_aot_autograd(f, inp)
-
-    def test_multi_output(self):
-        def f(a, b):
-            return a + b, a - b
-        inp = [torch.randn(3, 3, requires_grad=True), torch.randn(3, 3)]
-        self.verify_aot_autograd(f, inp)
-
-    def test_multi_output_list(self):
-        def f(a, b):
-            return [a + b, a - b]
-        inp = [torch.randn(3, 3, requires_grad=True), torch.randn(3, 3)]
-        self.verify_aot_autograd(f, inp)
-
-    def test_no_grad_input_output(self):
-        def f(a, b):
-            return a.cos(), b.cos(), a * b
-
-        inp_thunks = [lambda: torch.randn(5, requires_grad=True), lambda: torch.randn(5, requires_grad=False)]
-        for inps in itertools.product(inp_thunks, repeat=2):
-            inps = [i() for i in inps]
-            self.verify_aot_autograd(f, inps)
-
-    def test_inner_grad(self):
-        def foo(x):
-            y = torch.exp(x)
-            z = torch.autograd.grad(y, x)
-            return z
-        inps = [torch.randn((), requires_grad=True)]
-        self.verify_aot_autograd(foo, inps)
-
-    def test_grad_context(self):
-        def foo(x):
-            return x * 2
-        inps = [torch.randn((), requires_grad=True)]
-        graph_size = None
-
-        def get_graph_size(fx_g, _):
-            nonlocal graph_size
-            graph_size = len(fx_g.graph.nodes)
-            return fx_g
-
-        start_recompilations = num_of_recompilations()
-        f = aot_function(foo, nop, get_graph_size)
-        with torch.set_grad_enabled(False):
-            f(*inps)
-        self.assertIsNone(graph_size)
-        with torch.set_grad_enabled(True):
-            f(*inps)
-        self.assertTrue(graph_size > 2)
-        self.assertEqual(num_of_recompilations() - start_recompilations, 2)
-
-    def test_output_dict(self):
-        def f(x):
-            return {'a': x, 'b': x}
-        inp = [torch.randn(3, 3, requires_grad=True)]
-        self.verify_aot_autograd(f, inp)
-
-        def f(x, y):
-            return {'a': x, 'b': y + x}
-        inp = [torch.randn(3, requires_grad=True), torch.randn(3)]
-        self.verify_aot_autograd(f, inp)
-
-        def f(x):
-            new_d = {}
-            for k in x:
-                new_d[k] = x[k] * 2
-            return new_d
-        inp = [{'a': torch.randn(3, requires_grad=True), 'b': torch.randn(3, requires_grad=True)}]
-        self.verify_aot_autograd(f, inp)
-
-    def test_module(self):
-        mod = nn.Sequential(nn.Linear(32, 32), nn.ReLU())
-        compiled_mod = compiled_module(mod, nop, nop)
-        inp = torch.randn(32, 32)
-        ref_out = mod(inp)
-        ref_out.sum().backward()
-        ref_grads = sorted([(name, p.grad) for name, p in mod.named_parameters()])
-        out = compiled_mod(inp)
-        out.sum().backward()
-        grads = sorted([(name, p.grad) for name, p in mod.named_parameters()])
-        self.assertEqual((out, grads), (ref_out, ref_grads))
-
-    def test_batchnorm(self):
-        mod = compiled_module(nn.BatchNorm2d(4), nop, nop)
-        x = torch.ones(1, 4, 2, 2)
-        mod(x).sum().backward()
-
-    def test_list_codegen(self):
-        def list_nop(f, _):
-            def g(inps):
-                return f(*inps)
-            g._boxed_call = True
-            return g
-
-        def f(a, b, c):
-            return a.sin() * b.cos() * c.sin()
-        f = aot_function(f, list_nop)
-        inp = [torch.randn(5, requires_grad=True) for _ in range(3)]
-        f(*inp).sum().backward()
-
-
-
-class TestEagerFusionOpInfo(TestCase):
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
-    # entries in here need don't work and need to be fixed.
-    # Each one of these is a bug (or needs to be investigated)
-    @skipOps('TestEagerFusionOpInfo', 'test_aot_autograd_exhaustive', {
-        xfail('linalg.cholesky'),
-        skip('msort'),
-        xfail('nn.functional.dropout'),
-        xfail('to_sparse'),
-        xfail('addcdiv'),
-        xfail('cholesky'),
-        xfail('cumulative_trapezoid'),
-        xfail('diag_embed'),
-        xfail('linalg.householder_product'),
-        xfail('logit'),
-        xfail('trapezoid'),
-        xfail('trapz'),
-        xfail('corrcoef'),
-        xfail('cov'),
-        xfail('chalf'),  # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
-        skip('nn.functional.binary_cross_entropy_with_logits'),  # seems to fail sometimes?
-        skip('nn.functional.margin_ranking_loss'),  # seems flaky
-    })
-    def test_aot_autograd_exhaustive(self, device, dtype, op):
-        def f(args, kwargs):
-            return op.op(*args, **kwargs)
-        if not op.supports_autograd:
-            return
-        sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=True)
-        for sample_input in sample_inputs_itr:
-            args = [sample_input.input] + list(sample_input.args)
-            kwargs = sample_input.kwargs
-            if not all([isinstance(i, torch.Tensor) and i.dtype == torch.float for i in args]):
-                self.skipTest("not all inputs are float tensors")
-            if not all([isinstance(i, torch.Tensor) and i.dtype == torch.float for i in kwargs.values()]):
-                self.skipTest("not all inputs are float tensors")
-                continue
-            t = f(args, kwargs)
-            if isinstance(t, tuple):
-                self.skipTest("output is a tuple")
-                continue
-
-            def reset_grads():
-                def f(x):
-                    x.grad = None
-                pytree.tree_map(f, args)
-
-            def get_grads(args):
-                return pytree.tree_map(lambda x: x.grad, args)
-
-            compiled_f = compiled_function(f, nop, nop)
-
-            reset_grads()
-            compiled_f(args, kwargs).sum().backward()
-            compiled_grad = get_grads(args)
-
-            reset_grads()
-            f(args, kwargs).sum().backward()
-            orig_grad = get_grads(args)
-            self.assertEqual(orig_grad, compiled_grad)
-
-            def create_new_arg(x):
-                return x.detach().uniform_(0, 1).requires_grad_(x.requires_grad)
-
-            args = pytree.tree_map(create_new_arg, args)
-
-            reset_grads()
-            compiled_f(args, kwargs).sum().backward()
-            compiled_grad = get_grads(args)
-
-            reset_grads()
-            f(args, kwargs).sum().backward()
-            orig_grad = get_grads(args)
-            self.assertEqual(orig_grad, compiled_grad)
-
-
-def extract_graph(fx_g, _, graph_cell):
-    graph_cell[0] = fx_g
-    return fx_g
-
-
-def get_ins_outs(fx_g):
-    ins = []
-    outs = []
-    for n in fx_g.graph.nodes:
-        if n.op == 'placeholder':
-            ins.append(n)
-        elif n.op == 'output':
-            outs = tuple(n.args[0])
-    return ins, outs
-
-
-def get_num_ins_outs(fx_g):
-    return tuple(len(i) for i in get_ins_outs(fx_g))
-
-
-def get_fw_bw_graph(f, inps, partitioner=min_cut_rematerialization_partition):
-    fw_graph_cell = [None]
-    bw_graph_cell = [None]
-    aot_function(f,
-                 fw_compiler=partial(extract_graph, graph_cell=fw_graph_cell),
-                 bw_compiler=partial(extract_graph, graph_cell=bw_graph_cell),
-                 partition_fn=partitioner,
-                 decompositions=default_decompositions)(*inps)
-    return (fw_graph_cell[0], bw_graph_cell[0])
-
-
-class TestPartitioning(TestCase):
-    @unittest.skipIf(not USE_NETWORKX, "networkx not available")
-    def test_recompute_partitioning(self):
-        def fn(a, b):
-            return torch.sin(torch.sin(a)) + b
-
-        # Reference calculation
-        ref_a = torch.rand(10, 10, requires_grad=True)
-        ref_b = torch.rand(10, 10, requires_grad=True)
-        ref = fn(ref_a, ref_b)
-        ref.sum().backward()
-
-        # Compiled function calculation
-        res_a = ref_a.clone().detach().requires_grad_(True)
-        res_b = ref_b.clone().detach().requires_grad_(True)
-
-        def compile_fn(x, _):
-            return x
-
-        compiled_fn = compiled_function(fn, compile_fn, compile_fn, min_cut_rematerialization_partition)
-        res = compiled_fn(res_a, res_b)
-        res.sum().backward()
-        assert torch.allclose(ref, res, atol=1e-3, rtol=1e-3)
-        assert torch.allclose(ref_a.grad, res_a.grad, atol=1e-3, rtol=1e-3)
-        assert torch.allclose(ref_b.grad, res_b.grad, atol=1e-3, rtol=1e-3)
-
-    def test_meta_tensor_inplace_op(self):
-        # Following module results in inplace ops while tracing. The test checks
-        # that the meta tensor information is stored for inplace ops.
-        class MockModule(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.weight = torch.nn.Parameter(torch.randn(3072, 768, requires_grad=True))
-                self.bias = torch.nn.Parameter(torch.randn(3072, requires_grad=True))
-
-            def forward(self, add_4):
-                linear_4 = torch.nn.functional.linear(add_4, self.weight, bias=self.bias)
-                gelu = torch.nn.functional.gelu(linear_4)
-                return gelu
-
-        def check_meta_tensor(fx_g, _):
-            for node in fx_g.graph.nodes:
-                if node.op != 'output':
-                    assert 'tensor_meta' in node.meta
-            return fx_g
-
-        inp0 = torch.randn(16, 128, 768, requires_grad=True)
-        inputs = [inp0, ]
-        mod = MockModule().to(device="cpu")
-        aot_mod = aot_module(mod, fw_compiler=check_meta_tensor)
-        aot_mod(*inputs)
-
-    def test_default_partitioner_getitem(self):
-        mod = nn.LayerNorm([10])
-
-        def f(x, mod_weight, mod_bias):
-            return torch.nn.functional.layer_norm(x, [10], mod_weight, mod_bias, eps=1e-6)
-
-        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(3, 10, requires_grad=True), mod.weight, mod.bias],
-                                             partitioner=default_partition)
-        self.assertEqual(get_num_ins_outs(fw_graph), (3, 6))
-        self.assertEqual(get_num_ins_outs(bw_graph), (6, 3))
-
-    @unittest.skipIf(not USE_NETWORKX, "networkx not available")
-    def test_min_cut_partitioner(self):
-        def f(x):
-            return x.cos().cos().cos()
-
-        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(3, requires_grad=True)])
-        self.assertEqual(get_num_ins_outs(fw_graph), (1, 2))
-        self.assertEqual(get_num_ins_outs(bw_graph), (2, 1))
-
-        def f(a, b, c, d):
-            x = a + b + c + d
-            return x.cos().cos()
-
-        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(3, requires_grad=True) for _ in range(4)])
-        self.assertEqual(get_num_ins_outs(fw_graph), (4, 2))
-        self.assertEqual(get_num_ins_outs(bw_graph), (2, 4))
-
-        def f(x):
-            return torch.mm(x, torch.ones(x.shape)).tanh().tanh()
-        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(5, 5, requires_grad=True)])
-        self.assertEqual(get_num_ins_outs(fw_graph), (1, 3))
-
-        ins, outs = get_ins_outs(fw_graph)
-        self.assertEqual(outs[1].target, torch.ops.aten.mm.default)
-
-
-class TestContiguous(TestCase):
-    def test_contiguous(self):
-        # The test simulates the condition where transpose followed by view
-        # happens in the backward pass.
-        # https://discuss.pytorch.org/t/error-on-transpose-and-view/434
-        def f(x):
-            return x.view(2, 3).t()
-
-        inp = torch.randn(6, requires_grad=True)
-        out = aot_function(f, nop)(inp)
-        torch.autograd.grad(out, inp, torch.randn(3, 2))
-
-
-class TestAOTModuleSimplified(TestCase):
-    def test_aot_module_simplified(self):
-        class MockModule(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.linear = torch.nn.Linear(20, 30)
-
-            def forward(self, x, y):
-                return (self.linear(x) + y, )
-
-        mod = MockModule()
-        mod.zero_grad()
-
-        x = torch.randn(128, 20, requires_grad=True)
-        y = torch.randn(128, 30, requires_grad=True)
-        inputs = [x, y]
-        cloned_inputs = [x.detach().clone().requires_grad_(True) for x in inputs]
-
-        ref = mod(*inputs)
-        ref[0].sum().backward()
-
-        aot_mod = aot_module_simplified(mod, nop)
-        aot_mod.zero_grad()
-        res = aot_mod(*cloned_inputs)
-        res[0].sum().backward()
-
-        assert torch.allclose(ref[0], res[0])
-        assert torch.allclose(inputs[0].grad, cloned_inputs[0].grad)
-        assert torch.allclose(inputs[1].grad, cloned_inputs[1].grad)
-
-    def test_aot_module_simplified_preserves_stack_trace(self):
-
-        class MockModule(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.linear = torch.nn.Linear(20, 30)
-
-            def forward(self, x, y):
-                return (self.linear(x) + y, )
-
-        tracer = torch.fx.Tracer()
-        tracer.record_stack_traces = True
-        graph = tracer.trace(MockModule())
-        mod = torch.fx.GraphModule(tracer.root, graph)
-
-        for node in mod.graph.nodes:
-            if node.op == 'output':
-                continue
-            self.assertTrue(node.stack_trace is not None)
-            assert 'test_pythonkey.py' in node.stack_trace
-
-        def assert_compiler(gm: torch.fx.GraphModule, _):
-            for node in gm.graph.nodes:
-                if node.op == 'output' or node.op == 'placeholder':
-                    continue
-                self.assertTrue(node.stack_trace is not None)
-                assert 'test_pythonkey.py' in node.stack_trace
-            return gm.forward  # return a python callable
-
-        aot_mod = aot_module_simplified(mod, fw_compiler=assert_compiler, bw_compiler=nop)
-
-        x = torch.randn(128, 20, requires_grad=True)
-        y = torch.randn(128, 30, requires_grad=True)
-        inputs = [x, y]
-        res = aot_mod(*inputs)
-
-class TestRandom(TestCase):
-    def test_preserve_random(self):
-        def fn(x):
-            return torch.nn.functional.dropout(x, 0.5) + x
-
-
-        x = torch.randn(4)
-
-        torch.manual_seed(0)
-        ref = fn(x)
-
-        torch.manual_seed(0)
-        aot_fn = aot_function(fn, nop)
-        res = aot_fn(x)
-
-        assert torch.allclose(ref, res)
-
-
-class TestAutocast(TestCase):
-    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is unavailable")
-    @unittest.skipIf(not USE_TORCHVISION, "test requires torchvision")
-    def test_autocast(self):
-        mod = torchvision.models.resnet18().cuda()
-        mod.train()
-
-        x = torch.randn(16, 3, 32, 32, device="cuda")
-        aot_mod = memory_efficient_fusion(mod)
-
-        # Ensure that AOT Autograd works with AMP
-        with torch.cuda.amp.autocast(True):
-            res = aot_mod(x)
-        res.sum().backward()
-
-
-only_for = ("cpu")
-instantiate_device_type_tests(
-    TestPythonKey,
-    globals(),
-    only_for=only_for,
-)
-instantiate_device_type_tests(TestEagerFusionOpInfo, globals(), only_for=only_for)
-
-
-if __name__ == '__main__':
-    run_tests()
diff --git a/functorch/tools/lint/black_linter.py b/functorch/tools/lint/black_linter.py
deleted file mode 100644
index 9d259fe096b8..000000000000
--- a/functorch/tools/lint/black_linter.py
+++ /dev/null
@@ -1,228 +0,0 @@
-import argparse
-import concurrent.futures
-import json
-import logging
-import os
-import subprocess
-import sys
-import time
-from enum import Enum
-from typing import Any, List, NamedTuple, Optional, BinaryIO
-
-
-IS_WINDOWS: bool = os.name == "nt"
-
-
-def eprint(*args: Any, **kwargs: Any) -> None:
-    print(*args, file=sys.stderr, flush=True, **kwargs)
-
-
-class LintSeverity(str, Enum):
-    ERROR = "error"
-    WARNING = "warning"
-    ADVICE = "advice"
-    DISABLED = "disabled"
-
-
-class LintMessage(NamedTuple):
-    path: Optional[str]
-    line: Optional[int]
-    char: Optional[int]
-    code: str
-    severity: LintSeverity
-    name: str
-    original: Optional[str]
-    replacement: Optional[str]
-    description: Optional[str]
-
-
-def as_posix(name: str) -> str:
-    return name.replace("\\", "/") if IS_WINDOWS else name
-
-
-def _run_command(
-    args: List[str],
-    *,
-    stdin: BinaryIO,
-    timeout: int,
-) -> "subprocess.CompletedProcess[bytes]":
-    logging.debug("$ %s", " ".join(args))
-    start_time = time.monotonic()
-    try:
-        return subprocess.run(
-            args,
-            stdin=stdin,
-            stdout=subprocess.PIPE,
-            stderr=subprocess.PIPE,
-            shell=IS_WINDOWS,  # So batch scripts are found.
-            timeout=timeout,
-            check=True,
-        )
-    finally:
-        end_time = time.monotonic()
-        logging.debug("took %dms", (end_time - start_time) * 1000)
-
-
-def run_command(
-    args: List[str],
-    *,
-    stdin: BinaryIO,
-    retries: int,
-    timeout: int,
-) -> "subprocess.CompletedProcess[bytes]":
-    remaining_retries = retries
-    while True:
-        try:
-            return _run_command(args, stdin=stdin, timeout=timeout)
-        except subprocess.TimeoutExpired as err:
-            if remaining_retries == 0:
-                raise err
-            remaining_retries -= 1
-            logging.warning(
-                "(%s/%s) Retrying because command failed with: %r",
-                retries - remaining_retries,
-                retries,
-                err,
-            )
-            time.sleep(1)
-
-
-def check_file(
-    filename: str,
-    retries: int,
-    timeout: int,
-) -> List[LintMessage]:
-    try:
-        with open(filename, "rb") as f:
-            original = f.read()
-        with open(filename, "rb") as f:
-            proc = run_command(
-                [sys.executable, "-mblack", "--stdin-filename", filename, "-"],
-                stdin=f,
-                retries=retries,
-                timeout=timeout,
-            )
-    except subprocess.TimeoutExpired:
-        return [
-            LintMessage(
-                path=filename,
-                line=None,
-                char=None,
-                code="BLACK",
-                severity=LintSeverity.ERROR,
-                name="timeout",
-                original=None,
-                replacement=None,
-                description=(
-                    "black timed out while trying to process a file. "
-                    "Please report an issue in pytorch/pytorch with the "
-                    "label 'module: lint'"
-                ),
-            )
-        ]
-    except (OSError, subprocess.CalledProcessError) as err:
-        return [
-            LintMessage(
-                path=filename,
-                line=None,
-                char=None,
-                code="BLACK",
-                severity=LintSeverity.ADVICE,
-                name="command-failed",
-                original=None,
-                replacement=None,
-                description=(
-                    f"Failed due to {err.__class__.__name__}:\n{err}"
-                    if not isinstance(err, subprocess.CalledProcessError)
-                    else (
-                        "COMMAND (exit code {returncode})\n"
-                        "{command}\n\n"
-                        "STDERR\n{stderr}\n\n"
-                        "STDOUT\n{stdout}"
-                    ).format(
-                        returncode=err.returncode,
-                        command=" ".join(as_posix(x) for x in err.cmd),
-                        stderr=err.stderr.decode("utf-8").strip() or "(empty)",
-                        stdout=err.stdout.decode("utf-8").strip() or "(empty)",
-                    )
-                ),
-            )
-        ]
-
-    replacement = proc.stdout
-    if original == replacement:
-        return []
-
-    return [
-        LintMessage(
-            path=filename,
-            line=None,
-            char=None,
-            code="BLACK",
-            severity=LintSeverity.WARNING,
-            name="format",
-            original=original.decode("utf-8"),
-            replacement=replacement.decode("utf-8"),
-            description="Run `lintrunner -a` to apply this patch.",
-        )
-    ]
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(
-        description="Format files with black.",
-        fromfile_prefix_chars="@",
-    )
-    parser.add_argument(
-        "--retries",
-        default=3,
-        type=int,
-        help="times to retry timed out black",
-    )
-    parser.add_argument(
-        "--timeout",
-        default=90,
-        type=int,
-        help="seconds to wait for black",
-    )
-    parser.add_argument(
-        "--verbose",
-        action="store_true",
-        help="verbose logging",
-    )
-    parser.add_argument(
-        "filenames",
-        nargs="+",
-        help="paths to lint",
-    )
-    args = parser.parse_args()
-
-    logging.basicConfig(
-        format="<%(threadName)s:%(levelname)s> %(message)s",
-        level=logging.NOTSET
-        if args.verbose
-        else logging.DEBUG
-        if len(args.filenames) < 1000
-        else logging.INFO,
-        stream=sys.stderr,
-    )
-
-    with concurrent.futures.ThreadPoolExecutor(
-        max_workers=os.cpu_count(),
-        thread_name_prefix="Thread",
-    ) as executor:
-        futures = {
-            executor.submit(check_file, x, args.retries, args.timeout): x
-            for x in args.filenames
-        }
-        for future in concurrent.futures.as_completed(futures):
-            try:
-                for lint_message in future.result():
-                    print(json.dumps(lint_message._asdict()), flush=True)
-            except Exception:
-                logging.critical('Failed at "%s".', futures[future])
-                raise
-
-
-if __name__ == "__main__":
-    main()
diff --git a/functorch/tools/lint/flake8_linter.py b/functorch/tools/lint/flake8_linter.py
deleted file mode 100644
index 20274432566c..000000000000
--- a/functorch/tools/lint/flake8_linter.py
+++ /dev/null
@@ -1,373 +0,0 @@
-import argparse
-import json
-import logging
-import os
-import re
-import subprocess
-import sys
-import time
-from enum import Enum
-from typing import Any, Dict, List, NamedTuple, Optional, Set, Pattern
-
-
-IS_WINDOWS: bool = os.name == "nt"
-
-
-def eprint(*args: Any, **kwargs: Any) -> None:
-    print(*args, file=sys.stderr, flush=True, **kwargs)
-
-
-class LintSeverity(str, Enum):
-    ERROR = "error"
-    WARNING = "warning"
-    ADVICE = "advice"
-    DISABLED = "disabled"
-
-
-class LintMessage(NamedTuple):
-    path: Optional[str]
-    line: Optional[int]
-    char: Optional[int]
-    code: str
-    severity: LintSeverity
-    name: str
-    original: Optional[str]
-    replacement: Optional[str]
-    description: Optional[str]
-
-
-def as_posix(name: str) -> str:
-    return name.replace("\\", "/") if IS_WINDOWS else name
-
-
-# fmt: off
-# https://www.flake8rules.com/
-DOCUMENTED_IN_FLAKE8RULES: Set[str] = {
-    "E101", "E111", "E112", "E113", "E114", "E115", "E116", "E117",
-    "E121", "E122", "E123", "E124", "E125", "E126", "E127", "E128", "E129",
-    "E131", "E133",
-    "E201", "E202", "E203",
-    "E211",
-    "E221", "E222", "E223", "E224", "E225", "E226", "E227", "E228",
-    "E231",
-    "E241", "E242",
-    "E251",
-    "E261", "E262", "E265", "E266",
-    "E271", "E272", "E273", "E274", "E275",
-    "E301", "E302", "E303", "E304", "E305", "E306",
-    "E401", "E402",
-    "E501", "E502",
-    "E701", "E702", "E703", "E704",
-    "E711", "E712", "E713", "E714",
-    "E721", "E722",
-    "E731",
-    "E741", "E742", "E743",
-    "E901", "E902", "E999",
-    "W191",
-    "W291", "W292", "W293",
-    "W391",
-    "W503", "W504",
-    "W601", "W602", "W603", "W604", "W605",
-    "F401", "F402", "F403", "F404", "F405",
-    "F811", "F812",
-    "F821", "F822", "F823",
-    "F831",
-    "F841",
-    "F901",
-    "C901",
-}
-
-# https://pypi.org/project/flake8-comprehensions/#rules
-DOCUMENTED_IN_FLAKE8COMPREHENSIONS: Set[str] = {
-    "C400", "C401", "C402", "C403", "C404", "C405", "C406", "C407", "C408", "C409",
-    "C410",
-    "C411", "C412", "C413", "C413", "C414", "C415", "C416",
-}
-
-# https://github.com/PyCQA/flake8-bugbear#list-of-warnings
-DOCUMENTED_IN_BUGBEAR: Set[str] = {
-    "B001", "B002", "B003", "B004", "B005", "B006", "B007", "B008", "B009", "B010",
-    "B011", "B012", "B013", "B014", "B015",
-    "B301", "B302", "B303", "B304", "B305", "B306",
-    "B901", "B902", "B903", "B950",
-}
-# fmt: on
-
-
-# stdin:2: W802 undefined name 'foo'
-# stdin:3:6: T484 Name 'foo' is not defined
-# stdin:3:-100: W605 invalid escape sequence '\/'
-# stdin:3:1: E302 expected 2 blank lines, found 1
-RESULTS_RE: Pattern[str] = re.compile(
-    r"""(?mx)
-    ^
-    (?P<file>.*?):
-    (?P<line>\d+):
-    (?:(?P<column>-?\d+):)?
-    \s(?P<code>\S+?):?
-    \s(?P<message>.*)
-    $
-    """
-)
-
-
-def _test_results_re() -> None:
-    """
-    >>> def t(s): return RESULTS_RE.search(s).groupdict()
-
-    >>> t(r"file.py:80:1: E302 expected 2 blank lines, found 1")
-    ... # doctest: +NORMALIZE_WHITESPACE
-    {'file': 'file.py', 'line': '80', 'column': '1', 'code': 'E302',
-     'message': 'expected 2 blank lines, found 1'}
-
-    >>> t(r"file.py:7:1: P201: Resource `stdout` is acquired but not always released.")
-    ... # doctest: +NORMALIZE_WHITESPACE
-    {'file': 'file.py', 'line': '7', 'column': '1', 'code': 'P201',
-     'message': 'Resource `stdout` is acquired but not always released.'}
-
-    >>> t(r"file.py:8:-10: W605 invalid escape sequence '/'")
-    ... # doctest: +NORMALIZE_WHITESPACE
-    {'file': 'file.py', 'line': '8', 'column': '-10', 'code': 'W605',
-     'message': "invalid escape sequence '/'"}
-    """
-    pass
-
-
-def _run_command(
-    args: List[str],
-    *,
-    extra_env: Optional[Dict[str, str]],
-) -> "subprocess.CompletedProcess[str]":
-    logging.debug(
-        "$ %s",
-        " ".join(
-            ([f"{k}={v}" for (k, v) in extra_env.items()] if extra_env else []) + args
-        ),
-    )
-    start_time = time.monotonic()
-    try:
-        return subprocess.run(
-            args,
-            stdout=subprocess.PIPE,
-            stderr=subprocess.PIPE,
-            check=True,
-            encoding="utf-8",
-        )
-    finally:
-        end_time = time.monotonic()
-        logging.debug("took %dms", (end_time - start_time) * 1000)
-
-
-def run_command(
-    args: List[str],
-    *,
-    extra_env: Optional[Dict[str, str]],
-    retries: int,
-) -> "subprocess.CompletedProcess[str]":
-    remaining_retries = retries
-    while True:
-        try:
-            return _run_command(args, extra_env=extra_env)
-        except subprocess.CalledProcessError as err:
-            if remaining_retries == 0 or not re.match(
-                r"^ERROR:1:1: X000 linting with .+ timed out after \d+ seconds",
-                err.stdout,
-            ):
-                raise err
-            remaining_retries -= 1
-            logging.warning(
-                "(%s/%s) Retrying because command failed with: %r",
-                retries - remaining_retries,
-                retries,
-                err,
-            )
-            time.sleep(1)
-
-
-def get_issue_severity(code: str) -> LintSeverity:
-    # "B901": `return x` inside a generator
-    # "B902": Invalid first argument to a method
-    # "B903": __slots__ efficiency
-    # "B950": Line too long
-    # "C4": Flake8 Comprehensions
-    # "C9": Cyclomatic complexity
-    # "E2": PEP8 horizontal whitespace "errors"
-    # "E3": PEP8 blank line "errors"
-    # "E5": PEP8 line length "errors"
-    # "F401": Name imported but unused
-    # "F403": Star imports used
-    # "F405": Name possibly from star imports
-    # "T400": type checking Notes
-    # "T49": internal type checker errors or unmatched messages
-    if any(
-        code.startswith(x)
-        for x in [
-            "B9",
-            "C4",
-            "C9",
-            "E2",
-            "E3",
-            "E5",
-            "F401",
-            "F403",
-            "F405",
-            "T400",
-            "T49",
-        ]
-    ):
-        return LintSeverity.ADVICE
-
-    # "F821": Undefined name
-    # "E999": syntax error
-    if any(code.startswith(x) for x in ["F821", "E999"]):
-        return LintSeverity.ERROR
-
-    # "F": PyFlakes Error
-    # "B": flake8-bugbear Error
-    # "E": PEP8 "Error"
-    # "W": PEP8 Warning
-    # possibly other plugins...
-    return LintSeverity.WARNING
-
-
-def get_issue_documentation_url(code: str) -> str:
-    if code in DOCUMENTED_IN_FLAKE8RULES:
-        return f"https://www.flake8rules.com/rules/{code}.html"
-
-    if code in DOCUMENTED_IN_FLAKE8COMPREHENSIONS:
-        return "https://pypi.org/project/flake8-comprehensions/#rules"
-
-    if code in DOCUMENTED_IN_BUGBEAR:
-        return "https://github.com/PyCQA/flake8-bugbear#list-of-warnings"
-
-    return ""
-
-
-def check_files(
-    filenames: List[str],
-    flake8_plugins_path: Optional[str],
-    severities: Dict[str, LintSeverity],
-    retries: int,
-) -> List[LintMessage]:
-    try:
-        proc = run_command(
-            [sys.executable, "-mflake8", "--exit-zero"] + filenames,
-            extra_env={"FLAKE8_PLUGINS_PATH": flake8_plugins_path}
-            if flake8_plugins_path
-            else None,
-            retries=retries,
-        )
-    except (OSError, subprocess.CalledProcessError) as err:
-        return [
-            LintMessage(
-                path=None,
-                line=None,
-                char=None,
-                code="FLAKE8",
-                severity=LintSeverity.ERROR,
-                name="command-failed",
-                original=None,
-                replacement=None,
-                description=(
-                    f"Failed due to {err.__class__.__name__}:\n{err}"
-                    if not isinstance(err, subprocess.CalledProcessError)
-                    else (
-                        "COMMAND (exit code {returncode})\n"
-                        "{command}\n\n"
-                        "STDERR\n{stderr}\n\n"
-                        "STDOUT\n{stdout}"
-                    ).format(
-                        returncode=err.returncode,
-                        command=" ".join(as_posix(x) for x in err.cmd),
-                        stderr=err.stderr.strip() or "(empty)",
-                        stdout=err.stdout.strip() or "(empty)",
-                    )
-                ),
-            )
-        ]
-
-    return [
-        LintMessage(
-            path=match["file"],
-            name=match["code"],
-            description="{}\nSee {}".format(
-                match["message"],
-                get_issue_documentation_url(match["code"]),
-            ),
-            line=int(match["line"]),
-            char=int(match["column"])
-            if match["column"] is not None and not match["column"].startswith("-")
-            else None,
-            code="FLAKE8",
-            severity=severities.get(match["code"]) or get_issue_severity(match["code"]),
-            original=None,
-            replacement=None,
-        )
-        for match in RESULTS_RE.finditer(proc.stdout)
-    ]
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(
-        description="Flake8 wrapper linter.",
-        fromfile_prefix_chars="@",
-    )
-    parser.add_argument(
-        "--flake8-plugins-path",
-        help="FLAKE8_PLUGINS_PATH env value",
-    )
-    parser.add_argument(
-        "--severity",
-        action="append",
-        help="map code to severity (e.g. `B950:advice`)",
-    )
-    parser.add_argument(
-        "--retries",
-        default=3,
-        type=int,
-        help="times to retry timed out flake8",
-    )
-    parser.add_argument(
-        "--verbose",
-        action="store_true",
-        help="verbose logging",
-    )
-    parser.add_argument(
-        "filenames",
-        nargs="+",
-        help="paths to lint",
-    )
-    args = parser.parse_args()
-
-    logging.basicConfig(
-        format="<%(threadName)s:%(levelname)s> %(message)s",
-        level=logging.NOTSET
-        if args.verbose
-        else logging.DEBUG
-        if len(args.filenames) < 1000
-        else logging.INFO,
-        stream=sys.stderr,
-    )
-
-    flake8_plugins_path = (
-        None
-        if args.flake8_plugins_path is None
-        else os.path.realpath(args.flake8_plugins_path)
-    )
-
-    severities: Dict[str, LintSeverity] = {}
-    if args.severity:
-        for severity in args.severity:
-            parts = severity.split(":", 1)
-            assert len(parts) == 2, f"invalid severity `{severity}`"
-            severities[parts[0]] = LintSeverity(parts[1])
-
-    lint_messages = check_files(
-        args.filenames, flake8_plugins_path, severities, args.retries
-    )
-    for lint_message in lint_messages:
-        print(json.dumps(lint_message._asdict()), flush=True)
-
-
-if __name__ == "__main__":
-    main()
diff --git a/functorch/tools/lint/pip_init.py b/functorch/tools/lint/pip_init.py
deleted file mode 100644
index db1f69d26b22..000000000000
--- a/functorch/tools/lint/pip_init.py
+++ /dev/null
@@ -1,75 +0,0 @@
-"""
-Initializer script that installs stuff to pip.
-"""
-import os
-import argparse
-import logging
-import subprocess
-import sys
-import time
-
-from typing import List
-
-
-def run_command(args: List[str]) -> "subprocess.CompletedProcess[bytes]":
-    logging.debug("$ %s", " ".join(args))
-    start_time = time.monotonic()
-    try:
-        return subprocess.run(args, check=True)
-    finally:
-        end_time = time.monotonic()
-        logging.debug("took %dms", (end_time - start_time) * 1000)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="pip initializer")
-    parser.add_argument(
-        "packages",
-        nargs="+",
-        help="pip packages to install",
-    )
-    parser.add_argument(
-        "--verbose",
-        action="store_true",
-        help="verbose logging",
-    )
-    parser.add_argument(
-        "--dry-run", help="do not install anything, just print what would be done."
-    )
-
-    args = parser.parse_args()
-
-    logging.basicConfig(
-        format="<%(threadName)s:%(levelname)s> %(message)s",
-        level=logging.NOTSET if args.verbose else logging.DEBUG,
-        stream=sys.stderr,
-    )
-
-    for package in args.packages:
-        package_name, _, version = package.partition("=")
-        if version == "":
-            raise RuntimeError(
-                "Package {package_name} did not have a version specified. "
-                "Please specify a version to product a consistent linting experience."
-            )
-    pip_args = ["pip3", "install"]
-
-    # If we are in a global install, use `--user` to install so that you do not
-    # need root access in order to initialize linters.
-    #
-    # However, `pip install --user` interacts poorly with virtualenvs (see:
-    # https://bit.ly/3vD4kvl) and conda (see: https://bit.ly/3KG7ZfU). So in
-    # these cases perform a regular installation.
-    in_conda = os.environ.get("CONDA_PREFIX") is not None
-    in_virtualenv = os.environ.get("VIRTUAL_ENV") is not None
-    if not in_conda and not in_virtualenv:
-        pip_args.append("--user")
-
-    pip_args.extend(args.packages)
-
-    dry_run = args.dry_run == "1"
-    if dry_run:
-        print(f"Would have run: {pip_args}")
-        sys.exit(0)
-
-    run_command(pip_args)
diff --git a/functorch/version.txt b/functorch/version.txt
deleted file mode 100644
index c181bf599667..000000000000
--- a/functorch/version.txt
+++ /dev/null
@@ -1 +0,0 @@
-0.3.0a0
diff --git a/ios/LibTorch-Lite.podspec b/ios/LibTorch-Lite.podspec
index 9814eaa36758..96b759f22150 100644
--- a/ios/LibTorch-Lite.podspec
+++ b/ios/LibTorch-Lite.podspec
@@ -1,6 +1,6 @@
 Pod::Spec.new do |s|
     s.name             = 'LibTorch-Lite'
-    s.version          = '1.12.0'
+    s.version          = '1.13.0'
     s.authors          = 'PyTorch Team'
     s.license          = { :type => 'BSD' }
     s.homepage         = 'https://github.com/pytorch/pytorch'
@@ -33,4 +33,5 @@ Pod::Spec.new do |s|
         'VALID_ARCHS' => 'x86_64 arm64'
     }
     s.library = ['c++', 'stdc++']
+    s.frameworks = 'Accelerate'
 end
diff --git a/ios/LibTorch.podspec b/ios/LibTorch.podspec
index 3c197f0f103b..6cee4993cca1 100644
--- a/ios/LibTorch.podspec
+++ b/ios/LibTorch.podspec
@@ -1,6 +1,6 @@
 Pod::Spec.new do |s|
     s.name             = 'LibTorch'
-    s.version          = '1.12.0'
+    s.version          = '1.13.0'
     s.authors          = 'PyTorch Team'
     s.license          = { :type => 'BSD' }
     s.homepage         = 'https://github.com/pytorch/pytorch'
@@ -33,4 +33,5 @@ Pod::Spec.new do |s|
         'VALID_ARCHS' => 'x86_64 arm64'
     }
     s.library = ['c++', 'stdc++']
+    s.frameworks = 'Accelerate'
 end
diff --git a/ios/TestApp/AppleWWDRCAG3.cer b/ios/TestApp/AppleWWDRCAG3.cer
deleted file mode 100644
index 32f96f81dd6ea4c1c0f8e84698a50df773444b83..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 1109
zcmXqLVhJ>8Vzyks%*4pVBv7+HlS_5G<-h9LE>#CBj=nSCW#iOp^Jx3d%gD&h%3zRW
z$Zf#M#vIDRCd?EXY$$9X2;y)Fb2%0i<fJNi<|XSHsu`$&1i6J}A%f1SMJ1VOnaPPI
znfZANj-@3T`9+x}m4;#lA|Q3l!n_c5LHYS53eJuOa^k#31_nlkmWCFF#wO-b;=IN{
zE>H*zq6|t6T@0MSI(e)iI>Ymea#G4OQ&JUNQp-|v@(WUn6oOK7z!nxO;Ibd;6K)<R
zcVoCuc#wU9>{o*(MkVCXU}R-rZerwT0E%-lH8C<W-1yr0XGJ7;`*EL%db(i~E?3Ad
zF?hMu<2|?Ye%+SeDzUE8PhUrx3B`n@%*#pA-EwUDsWUd;uCY!0^OR|sOxBjltLdlL
zytHQ*eAu$bT{f-Ha?y(|o^P!#Jrqc&>YLni>cB0Y1I-s6|D1SjdC!T1#gEpxT}+xg
zX~l#D^V*DQXL}#~nf<F`_v7mIy5e`E|20G^mx}KYI`~%lsmZy2N?Yf7Zg?8HzrSUX
z&5B(|+op;=VK_OT`NEmaR$|rHSYzhCc3;Q4Rkw#P)7RmGw~sZ;yk5nPYxJB$49*?w
zYMU~{`+gR${lDKb>!ynzs^Bts5HBBM?YUi*LCU=8QrPsRL7p`U7PH?oF*7nSE^d5p
z(D=qc2pB!G!i<dnSvU;XfD{uWgMmCqOqoT(K&(MT`$DGslrqkLU$T4k9F~1|_pRrM
zGjL(!&}L&~Wo2h%WU(+XGcbYi4H(-rGD=Dctn~HslZy-V5<zKAFF8LK%r!991q&1b
zlTUJ@fgZ>jc@}j8RRiS(iVNi1WT9G<i*k^3f>i4Pg$!gt>iJm2SVTDg9T1)HbA1Ze
zB%z&Z8p%A<-z?u`zz34%2l<2rm_yhM1lYL1`5`%ljfs(k3792Na|kd^Ffy!76@KyU
zL1VcKle^Qt_@$N#uaZ{&Jv!lJvGCP-voEaT@9jPPmhpU!hC}r6P!-1?f6gD=s{HcK
zn~(XY8f4249ZB?5O<&J{%2IXh<;PWLj5T=q^cIFY6#se_vyLrL^X=nS9{<EUMq=IN
z7hF!o7BL&T%~?0wLF(6+2bb-B>G!0~pME4H^8F=`FYGft6Cx##-<@*tz`>mQ3dQ$4
zudQ=tHhBjnSZuXy%6$`L7`%^vqK~I%N&m9I(=G0PvZd;Idw*Vv`CNY}hnv&vQOc7;
z=FB3JS6^jKS+(-^zi+cR(!1Yl?^q@7*}VSz={+ycznWUOVT<dcGf&%BaUFKFs*Jf6
JU@~{AH2}Gqoj3ph

diff --git a/ios/TestApp/README.md b/ios/TestApp/README.md
index 80f95315316f..a249957cf4e8 100644
--- a/ios/TestApp/README.md
+++ b/ios/TestApp/README.md
@@ -49,3 +49,15 @@ Make sure all models are generated. See https://github.com/pytorch/pytorch/tree/
 
 There's no debug information in simulator test (project TestAppTests). You can copy the failed test code to
 TestApp/TestApp/ViewController.mm and debug in the main TestApp.
+
+### Benchmark
+
+The benchmark folder contains two scripts that help you setup the benchmark project. The `setup.rb` does the heavy-lifting jobs of setting up the XCode project, whereas the `trace_model.py` is a Python script that you can tweak to generate your model for benchmarking. Simply follow the steps below to setup the project
+
+1. In the PyTorch root directory, run `IOS_ARCH=arm64 ./scripts/build_ios.sh` to generate the custom build from **Master** branch
+2. Navigate to the `benchmark` folder, run `python trace_model.py` to generate your model.
+3. In the same directory, open `config.json`. Those are the input parameters you can tweak.
+4. Again, in the same directory, run `ruby setup.rb` to setup the XCode project.
+5. Open the `TestApp.xcodeproj`, you're ready to go.
+
+The benchmark code is written in C++, you can use `UI_LOG` to visualize the log. See `benchmark.mm` for more details.
diff --git a/ios/TestApp/TestApp.xcodeproj/project.pbxproj b/ios/TestApp/TestApp.xcodeproj/project.pbxproj
index c516d91188c0..09aeeada1723 100644
--- a/ios/TestApp/TestApp.xcodeproj/project.pbxproj
+++ b/ios/TestApp/TestApp.xcodeproj/project.pbxproj
@@ -7,6 +7,7 @@
     objects = {
 
 /* Begin PBXBuildFile section */
+        4C636CFD28DDAE0200FF9B4D /* Benchmark.mm in Sources */ = {isa = PBXBuildFile; fileRef = 4C636CFB28DDAE0200FF9B4D /* Benchmark.mm */; };
         A06D4CB5232F0DB200763E16 /* AppDelegate.m in Sources */ = {isa = PBXBuildFile; fileRef = A06D4CB4232F0DB200763E16 /* AppDelegate.m */; };
         A06D4CB8232F0DB200763E16 /* ViewController.mm in Sources */ = {isa = PBXBuildFile; fileRef = A06D4CB7232F0DB200763E16 /* ViewController.mm */; };
         A06D4CBB232F0DB200763E16 /* Main.storyboard in Resources */ = {isa = PBXBuildFile; fileRef = A06D4CB9232F0DB200763E16 /* Main.storyboard */; };
@@ -26,6 +27,8 @@
 /* End PBXContainerItemProxy section */
 
 /* Begin PBXFileReference section */
+        4C636CFB28DDAE0200FF9B4D /* Benchmark.mm */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.objcpp; path = Benchmark.mm; sourceTree = "<group>"; };
+        4C636CFC28DDAE0200FF9B4D /* Benchmark.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = Benchmark.h; sourceTree = "<group>"; };
         A06D4CB0232F0DB200763E16 /* TestApp.app */ = {isa = PBXFileReference; explicitFileType = wrapper.application; includeInIndex = 0; path = TestApp.app; sourceTree = BUILT_PRODUCTS_DIR; };
         A06D4CB3232F0DB200763E16 /* AppDelegate.h */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.c.h; path = AppDelegate.h; sourceTree = "<group>"; };
         A06D4CB4232F0DB200763E16 /* AppDelegate.m */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.c.objc; path = AppDelegate.m; sourceTree = "<group>"; };
@@ -79,6 +82,8 @@
         A06D4CB2232F0DB200763E16 /* TestApp */ = {
             isa = PBXGroup;
             children = (
+                4C636CFC28DDAE0200FF9B4D /* Benchmark.h */,
+                4C636CFB28DDAE0200FF9B4D /* Benchmark.mm */,
                 A06D4CB3232F0DB200763E16 /* AppDelegate.h */,
                 A06D4CB4232F0DB200763E16 /* AppDelegate.m */,
                 A06D4CB6232F0DB200763E16 /* ViewController.h */,
@@ -146,12 +151,12 @@
             attributes = {
                 LastUpgradeCheck = 1030;
                 TargetAttributes = {
-                    A06D4CAF232F0DB200763E16 = {
-                        CreatedOnToolsVersion = 10.3;
-                    };
-                    A0EA3AFE237FCB08007CEA34 = {
-                        CreatedOnToolsVersion = 11.2.1;
-                    };
+                        A06D4CAF232F0DB200763E16 = {
+                                CreatedOnToolsVersion = 10.3;
+                        };
+                        A0EA3AFE237FCB08007CEA34 = {
+                                CreatedOnToolsVersion = 11.2.1;
+                        };
                 };
             };
             buildConfigurationList = A06D4CAB232F0DB200763E16 /* Build configuration list for PBXProject "TestApp" */;
@@ -201,6 +206,7 @@
                 A06D4CB8232F0DB200763E16 /* ViewController.mm in Sources */,
                 A06D4CC3232F0DB200763E16 /* main.m in Sources */,
                 A06D4CB5232F0DB200763E16 /* AppDelegate.m in Sources */,
+                4C636CFD28DDAE0200FF9B4D /* Benchmark.mm in Sources */,
             );
             runOnlyForDeploymentPostprocessing = 0;
         };
@@ -283,8 +289,8 @@
                 GCC_NO_COMMON_BLOCKS = YES;
                 GCC_OPTIMIZATION_LEVEL = 0;
                 GCC_PREPROCESSOR_DEFINITIONS = (
-                    "DEBUG=1",
-                    "$(inherited)",
+                        "DEBUG=1",
+                        "$(inherited)",
                 );
                 GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
                 GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
@@ -362,8 +368,8 @@
                 ENABLE_BITCODE = NO;
                 INFOPLIST_FILE = TestApp/Info.plist;
                 LD_RUNPATH_SEARCH_PATHS = (
-                    "$(inherited)",
-                    "@executable_path/Frameworks",
+                        "$(inherited)",
+                        "@executable_path/Frameworks",
                 );
                 PRODUCT_BUNDLE_IDENTIFIER = com.pytorch.ios.TestApp;
                 PRODUCT_NAME = "$(TARGET_NAME)";
@@ -381,8 +387,8 @@
                 ENABLE_BITCODE = NO;
                 INFOPLIST_FILE = TestApp/Info.plist;
                 LD_RUNPATH_SEARCH_PATHS = (
-                    "$(inherited)",
-                    "@executable_path/Frameworks",
+                        "$(inherited)",
+                        "@executable_path/Frameworks",
                 );
                 PRODUCT_BUNDLE_IDENTIFIER = com.pytorch.ios.TestApp;
                 PRODUCT_NAME = "$(TARGET_NAME)";
@@ -400,9 +406,9 @@
                 INFOPLIST_FILE = TestAppTests/Info.plist;
                 IPHONEOS_DEPLOYMENT_TARGET = 12.4;
                 LD_RUNPATH_SEARCH_PATHS = (
-                    "$(inherited)",
-                    "@executable_path/Frameworks",
-                    "@loader_path/Frameworks",
+                        "$(inherited)",
+                        "@executable_path/Frameworks",
+                        "@loader_path/Frameworks",
                 );
                 PRODUCT_BUNDLE_IDENTIFIER = com.pytorch.ios.TestAppTests;
                 PRODUCT_NAME = "$(TARGET_NAME)";
@@ -419,9 +425,9 @@
                 INFOPLIST_FILE = TestAppTests/Info.plist;
                 IPHONEOS_DEPLOYMENT_TARGET = 12.4;
                 LD_RUNPATH_SEARCH_PATHS = (
-                    "$(inherited)",
-                    "@executable_path/Frameworks",
-                    "@loader_path/Frameworks",
+                        "$(inherited)",
+                        "@executable_path/Frameworks",
+                        "@loader_path/Frameworks",
                 );
                 PRODUCT_BUNDLE_IDENTIFIER = com.pytorch.ios.TestAppTests;
                 PRODUCT_NAME = "$(TARGET_NAME)";
diff --git a/ios/TestApp/TestApp/Benchmark.h b/ios/TestApp/TestApp/Benchmark.h
new file mode 100644
index 000000000000..c05cc60b01d3
--- /dev/null
+++ b/ios/TestApp/TestApp/Benchmark.h
@@ -0,0 +1,15 @@
+#ifdef BUILD_LITE_INTERPRETER
+
+#import <Foundation/Foundation.h>
+
+NS_ASSUME_NONNULL_BEGIN
+
+@interface Benchmark : NSObject
+
++ (BOOL)setup:(NSDictionary* )config;
++ (NSString* )run;
+
+@end
+
+NS_ASSUME_NONNULL_END
+#endif
diff --git a/ios/TestApp/TestApp/Benchmark.mm b/ios/TestApp/TestApp/Benchmark.mm
new file mode 100644
index 000000000000..6b94e1b18062
--- /dev/null
+++ b/ios/TestApp/TestApp/Benchmark.mm
@@ -0,0 +1,108 @@
+#ifdef BUILD_LITE_INTERPRETER
+
+#import "Benchmark.h"
+#include <string>
+#include <vector>
+#include <torch/script.h>
+#include <torch/csrc/jit/mobile/function.h>
+#include <torch/csrc/jit/mobile/import.h>
+#include <torch/csrc/jit/mobile/interpreter.h>
+#include <torch/csrc/jit/mobile/module.h>
+#include <torch/csrc/jit/mobile/observer.h>
+#include "ATen/ATen.h"
+#include "caffe2/core/timer.h"
+#include "caffe2/utils/string_utils.h"
+#include "torch/csrc/autograd/grad_mode.h"
+
+static std::string model = "model_lite.ptl";
+static std::string input_dims = "1,3,224,224";
+static std::string input_type = "float";
+static BOOL print_output = false;
+static int warmup = 10;
+static int iter = 10;
+
+@implementation Benchmark
+
++ (BOOL)setup:(NSDictionary*)config {
+  NSString* modelPath = [[NSBundle mainBundle] pathForResource:@"model_lite" ofType:@"ptl"];
+  if (![[NSFileManager defaultManager] fileExistsAtPath:modelPath]) {
+    NSLog(@"model_lite.ptl doesn't exist!");
+    return NO;
+  }
+  model = std::string(modelPath.UTF8String);
+  input_dims = std::string(((NSString*)config[@"input_dims"]).UTF8String);
+  input_type = std::string(((NSString*)config[@"input_type"]).UTF8String);
+  warmup = ((NSNumber*)config[@"warmup"]).intValue;
+  iter = ((NSNumber*)config[@"iter"]).intValue;
+  print_output = ((NSNumber*)config[@"print_output"]).boolValue;
+  return YES;
+}
+
++ (NSString*)run {
+  std::vector<std::string> logs;
+#define UI_LOG(fmt, ...)                                          \
+  {                                                               \
+    NSString* log = [NSString stringWithFormat:fmt, __VA_ARGS__]; \
+    NSLog(@"%@", log);                                            \
+    logs.push_back(log.UTF8String);                               \
+  }
+
+  CAFFE_ENFORCE_GE(input_dims.size(), 0, "Input dims must be specified.");
+  CAFFE_ENFORCE_GE(input_type.size(), 0, "Input type must be specified.");
+
+  std::vector<std::string> input_dims_list = caffe2::split(';', input_dims);
+  std::vector<std::string> input_type_list = caffe2::split(';', input_type);
+  CAFFE_ENFORCE_EQ(input_dims_list.size(), input_type_list.size(),
+                   "Input dims and type should have the same number of items.");
+
+  std::vector<c10::IValue> inputs;
+  for (size_t i = 0; i < input_dims_list.size(); ++i) {
+    auto input_dims_str = caffe2::split(',', input_dims_list[i]);
+    std::vector<int64_t> input_dims;
+    for (const auto& s : input_dims_str) {
+      input_dims.push_back(c10::stoi(s));
+    }
+    if (input_type_list[i] == "float") {
+      inputs.push_back(torch::ones(input_dims, at::ScalarType::Float));
+    } else if (input_type_list[i] == "uint8_t") {
+      inputs.push_back(torch::ones(input_dims, at::ScalarType::Byte));
+    } else {
+      CAFFE_THROW("Unsupported input type: ", input_type_list[i]);
+    }
+  }
+
+  c10::InferenceMode mode;
+  auto module = torch::jit::_load_for_mobile(model);
+
+//  module.eval();
+  if (print_output) {
+    std::cout << module.forward(inputs) << std::endl;
+  }
+  UI_LOG(@"Running warmup runs", nil);
+  CAFFE_ENFORCE(warmup >= 0, "Number of warm up runs should be non negative, provided ", warmup,
+                ".");
+  for (int i = 0; i < warmup; ++i) {
+    module.forward(inputs);
+  }
+  UI_LOG(@"Main runs", nil);
+  CAFFE_ENFORCE(iter >= 0, "Number of main runs should be non negative, provided ", iter, ".");
+  caffe2::Timer timer;
+  auto millis = timer.MilliSeconds();
+  for (int i = 0; i < iter; ++i) {
+    module.forward(inputs);
+  }
+  millis = timer.MilliSeconds();
+  UI_LOG(@"Main run finished. Milliseconds per iter: %.3f", millis / iter, nil);
+  UI_LOG(@"Iters per second: : %.3f", 1000.0 * iter / millis, nil);
+  UI_LOG(@"Done.", nil);
+
+  NSString* results = @"";
+  for (auto& msg : logs) {
+    results = [results stringByAppendingString:[NSString stringWithUTF8String:msg.c_str()]];
+    results = [results stringByAppendingString:@"\n"];
+  }
+  return results;
+}
+
+@end
+#endif
diff --git a/ios/TestApp/TestApp/ViewController.mm b/ios/TestApp/TestApp/ViewController.mm
index d8ecacda3c83..413d4e9d88f8 100644
--- a/ios/TestApp/TestApp/ViewController.mm
+++ b/ios/TestApp/TestApp/ViewController.mm
@@ -1,12 +1,52 @@
 #import "ViewController.h"
 
+
+#ifdef BUILD_LITE_INTERPRETER
+#import "Benchmark.h"
+#endif
+
 @interface ViewController ()
+@property(nonatomic, strong) UITextView* textView;
 @end
 
 @implementation ViewController
 
 - (void)viewDidLoad {
   [super viewDidLoad];
+
+#ifdef BUILD_LITE_INTERPRETER
+  self.textView = [[UITextView alloc] initWithFrame:self.view.bounds];
+  self.textView.autoresizingMask = UIViewAutoresizingFlexibleWidth | UIViewAutoresizingFlexibleHeight;
+  [self.view addSubview:self.textView];
+
+  NSData* configData = [NSData dataWithContentsOfFile:[[NSBundle mainBundle] pathForResource:@"config" ofType:@"json"]];
+  if (!configData) {
+    NSLog(@"Config.json not found!");
+    return;
+  }
+
+  NSError* err;
+  NSDictionary* config = [NSJSONSerialization JSONObjectWithData:configData options:NSJSONReadingAllowFragments error:&err];
+  if (err) {
+    NSLog(@"Parse config.json failed!");
+    return;
+  }
+
+  [Benchmark setup:config];
+  [self runBenchmark];
+#endif
+}
+
+#ifdef BUILD_LITE_INTERPRETER
+- (void)runBenchmark {
+  self.textView.text = @"Start benchmarking...\n";
+  dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
+    NSString* text = [Benchmark run];
+    dispatch_async(dispatch_get_main_queue(), ^{
+      self.textView.text = [self.textView.text stringByAppendingString:text];
+    });
+  });
 }
+#endif
 
 @end
diff --git a/ios/TestApp/benchmark/config.json b/ios/TestApp/benchmark/config.json
new file mode 100644
index 000000000000..f7b991cdb694
--- /dev/null
+++ b/ios/TestApp/benchmark/config.json
@@ -0,0 +1,7 @@
+{
+    "input_dims": "1,3,224,224",
+    "input_type": "float",
+    "warmup": 10,
+    "iter": 10,
+    "print_output": false
+}
diff --git a/ios/TestApp/benchmark/setup.rb b/ios/TestApp/benchmark/setup.rb
index 82c155f6abe3..3b491abec3c3 100644
--- a/ios/TestApp/benchmark/setup.rb
+++ b/ios/TestApp/benchmark/setup.rb
@@ -36,6 +36,11 @@
         config.build_settings['LIBRARY_SEARCH_PATHS']   = libraries_search_path
         config.build_settings['OTHER_LDFLAGS']          = other_linker_flags
         config.build_settings['ENABLE_BITCODE']         = 'No'
+        if (options[:lite])
+            config.build_settings['GCC_PREPROCESSOR_DEFINITIONS'] = ['$(inherited)', "BUILD_LITE_INTERPRETER"]
+        else
+            config.build_settings['GCC_PREPROCESSOR_DEFINITIONS'] = ['$(inherited)']
+        end
         dev_team_id = options[:team_id]
         if dev_team_id
             config.build_settings['DEVELOPMENT_TEAM']   = dev_team_id
@@ -46,7 +51,8 @@
 group.set_source_tree('SOURCE_ROOT')
 group.files.each do |file|
     if (file.name.to_s.end_with?(".pt") ||
-        file.name.to_s.end_with?(".ptl"))
+        file.name.to_s.end_with?(".ptl") ||
+        file.name == "config.json")
         group.remove_reference(file)
         targets.each do |target|
             target.resources_build_phase.remove_file_reference(file)
@@ -54,6 +60,12 @@
     end
 end
 
+config_path = File.expand_path("./config.json")
+if not File.exist?(config_path)
+    raise "config.json can't be found!"
+end
+config_file_ref = group.new_reference(config_path)
+
 file_refs = []
 # collect models
 puts "Installing models..."
@@ -66,6 +78,7 @@
 end
 
 targets.each do |target|
+    target.resources_build_phase.add_file_reference(config_file_ref, true)
     file_refs.each do |ref|
         target.resources_build_phase.add_file_reference(ref, true)
     end
diff --git a/ios/TestApp/fastlane/Fastfile b/ios/TestApp/fastlane/Fastfile
index 45a26066e012..0261dfac834a 100644
--- a/ios/TestApp/fastlane/Fastfile
+++ b/ios/TestApp/fastlane/Fastfile
@@ -4,20 +4,4 @@ platform :ios do
   before_all do
     setup_ci(provider: "circleci", timeout: 0)
   end
-  lane :install_root_cert do
-    import_certificate(
-      certificate_path: "AppleWWDRCAG3.cer",
-      keychain_path: "/Users/distiller/Library/Keychains/fastlane_tmp_keychain-db",
-      keychain_password: ""
-    )
-  end
-  lane :install_dev_cert do
-    puts "Installing Certificates.p12"
-    import_certificate(
-      keychain_name: ENV["MATCH_KEYCHAIN_NAME"],
-      keychain_password: ENV["MATCH_KEYCHAIN_PASSWORD"],
-      certificate_path: 'Certificates.p12',
-      certificate_password: ENV["IOS_CERT_SECRET"] || "default"
-    )
-  end
 end
diff --git a/mypy-strict.ini b/mypy-strict.ini
index 460599699c46..81c66d5239eb 100644
--- a/mypy-strict.ini
+++ b/mypy-strict.ini
@@ -40,6 +40,7 @@ files =
     .github,
     benchmarks/instruction_counts,
     tools,
+    torch/profiler/_memory_profiler.py,
     torch/utils/_pytree.py,
     torch/utils/benchmark/utils/common.py,
     torch/utils/benchmark/utils/timer.py,
diff --git a/pt_ops.bzl b/pt_ops.bzl
index 73f0f8f40908..37419ef7fd50 100644
--- a/pt_ops.bzl
+++ b/pt_ops.bzl
@@ -19,6 +19,7 @@ def pt_operator_library(
         train = False,
         model = None,
         include_all_operators = False,
+        include_base_operators = True,
         **kwargs):
     (model_name, model_versions, model_assets, model_traced_backends) = validate_and_extract_model_information(
         name,
@@ -28,8 +29,9 @@ def pt_operator_library(
     ops = [op.strip() for op in ops]
 
     # If ops are specified, then we are in static selective build mode, so we append
-    # base ops to this list to avoid additional special case logic in subsequent code.
-    if len(ops) > 0:
+    # base ops to this list to avoid additional special case logic in subsequent code,
+    # unless include_base_operators is explicitly set to False (the default is True)
+    if len(ops) > 0 and include_base_operators:
         ops.extend(PT_BASE_OPS)
 
     labels = kwargs.pop("labels", [])
diff --git a/pt_template_srcs.bzl b/pt_template_srcs.bzl
index 7d8dfd53d376..feb14818700f 100644
--- a/pt_template_srcs.bzl
+++ b/pt_template_srcs.bzl
@@ -144,6 +144,7 @@ def get_generate_code_bin_outs():
             "autograd/generated/python_functions_3.cpp": ["autograd/generated/python_functions_3.cpp"],
             "autograd/generated/python_functions_4.cpp": ["autograd/generated/python_functions_4.cpp"],
             "autograd/generated/python_linalg_functions.cpp": ["autograd/generated/python_linalg_functions.cpp"],
+            "autograd/generated/python_nested_functions.cpp": ["autograd/generated/python_nested_functions.cpp"],
             "autograd/generated/python_nn_functions.cpp": ["autograd/generated/python_nn_functions.cpp"],
             "autograd/generated/python_return_types.cpp": ["autograd/generated/python_return_types.cpp"],
             "autograd/generated/python_sparse_functions.cpp": ["autograd/generated/python_sparse_functions.cpp"],
diff --git a/pytest.ini b/pytest.ini
index 69185dd94ee9..2732aa9a1ff4 100644
--- a/pytest.ini
+++ b/pytest.ini
@@ -6,12 +6,8 @@ addopts =
     --tb=native
     # capture only Python print and C++ py::print, but not C output (low-level Python errors)
     --capture=sys
-    --disable-warnings
-    # TODO: enable xdoctest later
-    #--xdoctest
-    #--xdoctest-style=google
-    #--xdoctest-global-exec="from torch import nn\nimport torch.nn.functional as F\nimport torch"
-    #--xdoctest-options=+IGNORE_WHITESPACE
+    # don't suppress warnings, but don't shove them all to the end either
+    -p no:warnings
 testpaths =
     test
 junit_logging_reruns = all
diff --git a/requirements.txt b/requirements.txt
index f66847c94dcd..573b7a08a568 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,6 +2,7 @@
 astunparse
 expecttest
 future
+hypothesis
 numpy
 psutil
 pyyaml
@@ -11,3 +12,6 @@ six
 types-dataclasses
 typing_extensions
 sympy
+filelock
+networkx
+jinja2
diff --git a/scripts/buck_setup.sh b/scripts/buck_setup.sh
index 8e60d92a5fd1..f6152537435c 100644
--- a/scripts/buck_setup.sh
+++ b/scripts/buck_setup.sh
@@ -22,16 +22,16 @@ python3 generate-xnnpack-wrappers.py
 # bazel-skylib
 printf "\nDownloading bazel-skylib\n"
 rm -rf bazel-skylib; mkdir bazel-skylib
-curl -L $PROXY https://github.com/bazelbuild/bazel-skylib/releases/download/1.0.2/bazel-skylib-1.0.2.tar.gz|tar zx -C bazel-skylib
+curl --retry 3 -L $PROXY https://github.com/bazelbuild/bazel-skylib/releases/download/1.0.2/bazel-skylib-1.0.2.tar.gz|tar zx -C bazel-skylib
 
 # glog
 printf "\nDownloading glog\n"
 rm -rf glog; mkdir glog
-curl -L $PROXY https://github.com/google/glog/archive/v0.4.0.tar.gz | tar zx -C glog --strip-components 1
+curl --retry 3 -L $PROXY https://github.com/google/glog/archive/v0.4.0.tar.gz | tar zx -C glog --strip-components 1
 
 # ruy
 printf "\nDownloading ruy\n"
-curl -L $PROXY -o /tmp/ruy.zip https://github.com/google/ruy/archive/a09683b8da7164b9c5704f88aef2dc65aa583e5d.zip
+curl --retry 3 -L $PROXY -o /tmp/ruy.zip https://github.com/google/ruy/archive/a09683b8da7164b9c5704f88aef2dc65aa583e5d.zip
 unzip -q /tmp/ruy.zip -d /tmp/
 rm -rf ruy/
 mv /tmp/ruy-a09683b8da7164b9c5704f88aef2dc65aa583e5d ruy/
diff --git a/scripts/build_android.sh b/scripts/build_android.sh
index 225caa68abfc..2d6f051ea19f 100755
--- a/scripts/build_android.sh
+++ b/scripts/build_android.sh
@@ -59,30 +59,20 @@ echo "Android NDK version: $ANDROID_NDK_VERSION"
 
 CMAKE_ARGS=()
 
-if [ -z "${BUILD_CAFFE2_MOBILE:-}" ]; then
-  # Build PyTorch mobile
-  CMAKE_ARGS+=("-DCMAKE_PREFIX_PATH=$($PYTHON -c 'import sysconfig; print(sysconfig.get_path("purelib"))')")
-  CMAKE_ARGS+=("-DPYTHON_EXECUTABLE=$($PYTHON -c 'import sys; print(sys.executable)')")
-  CMAKE_ARGS+=("-DBUILD_CUSTOM_PROTOBUF=OFF")
-  # custom build with selected ops
-  if [ -n "${SELECTED_OP_LIST}" ]; then
-    SELECTED_OP_LIST="$(cd $(dirname $SELECTED_OP_LIST); pwd -P)/$(basename $SELECTED_OP_LIST)"
-    echo "Choose SELECTED_OP_LIST file: $SELECTED_OP_LIST"
-    if [ ! -r ${SELECTED_OP_LIST} ]; then
-      echo "Error: SELECTED_OP_LIST file ${SELECTED_OP_LIST} not found."
-      exit 1
-    fi
-    CMAKE_ARGS+=("-DSELECTED_OP_LIST=${SELECTED_OP_LIST}")
+# Build PyTorch mobile
+CMAKE_ARGS+=("-DCMAKE_PREFIX_PATH=$($PYTHON -c 'import sysconfig; print(sysconfig.get_path("purelib"))')")
+CMAKE_ARGS+=("-DPYTHON_EXECUTABLE=$($PYTHON -c 'import sys; print(sys.executable)')")
+CMAKE_ARGS+=("-DBUILD_CUSTOM_PROTOBUF=OFF")
+
+# custom build with selected ops
+if [ -n "${SELECTED_OP_LIST}" ]; then
+  SELECTED_OP_LIST="$(cd $(dirname $SELECTED_OP_LIST); pwd -P)/$(basename $SELECTED_OP_LIST)"
+  echo "Choose SELECTED_OP_LIST file: $SELECTED_OP_LIST"
+  if [ ! -r ${SELECTED_OP_LIST} ]; then
+    echo "Error: SELECTED_OP_LIST file ${SELECTED_OP_LIST} not found."
+    exit 1
   fi
-else
-  # Build Caffe2 mobile
-  CMAKE_ARGS+=("-DBUILD_CAFFE2_MOBILE=ON")
-  # Build protobuf from third_party so we have a host protoc binary.
-  echo "Building protoc"
-  $CAFFE2_ROOT/scripts/build_host_protoc.sh
-  # Use locally built protoc because we'll build libprotobuf for the
-  # target architecture and need an exact version match.
-  CMAKE_ARGS+=("-DCAFFE2_CUSTOM_PROTOC_EXECUTABLE=$CAFFE2_ROOT/build_host_protoc/bin/protoc")
+  CMAKE_ARGS+=("-DSELECTED_OP_LIST=${SELECTED_OP_LIST}")
 fi
 
 # If Ninja is installed, prefer it to Make
diff --git a/scripts/build_ios.sh b/scripts/build_ios.sh
index a0402db65a79..335d14b52171 100755
--- a/scripts/build_ios.sh
+++ b/scripts/build_ios.sh
@@ -11,37 +11,24 @@ CAFFE2_ROOT="$( cd "$(dirname "$0")"/.. ; pwd -P)"
 
 CMAKE_ARGS=()
 
-if [ -z "${BUILD_CAFFE2_MOBILE:-}" ]; then
-  # Build PyTorch mobile
-  CMAKE_ARGS+=("-DCMAKE_PREFIX_PATH=$(python -c 'import sysconfig; print(sysconfig.get_path("purelib"))')")
-  CMAKE_ARGS+=("-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')")
-  CMAKE_ARGS+=("-DBUILD_CUSTOM_PROTOBUF=OFF")
-  # custom build with selected ops
-  if [ -n "${SELECTED_OP_LIST}" ]; then
-    SELECTED_OP_LIST="$(cd $(dirname $SELECTED_OP_LIST); pwd -P)/$(basename $SELECTED_OP_LIST)"
-    echo "Choose SELECTED_OP_LIST file: $SELECTED_OP_LIST"
-    if [ ! -r ${SELECTED_OP_LIST} ]; then
-      echo "Error: SELECTED_OP_LIST file ${SELECTED_OP_LIST} not found."
-      exit 1
-    fi
-    CMAKE_ARGS+=("-DSELECTED_OP_LIST=${SELECTED_OP_LIST}")
+# Build PyTorch mobile
+CMAKE_ARGS+=("-DCMAKE_PREFIX_PATH=$(python -c 'import sysconfig; print(sysconfig.get_path("purelib"))')")
+CMAKE_ARGS+=("-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')")
+CMAKE_ARGS+=("-DBUILD_CUSTOM_PROTOBUF=OFF")
+
+# custom build with selected ops
+if [ -n "${SELECTED_OP_LIST}" ]; then
+  SELECTED_OP_LIST="$(cd $(dirname $SELECTED_OP_LIST); pwd -P)/$(basename $SELECTED_OP_LIST)"
+  echo "Choose SELECTED_OP_LIST file: $SELECTED_OP_LIST"
+  if [ ! -r ${SELECTED_OP_LIST} ]; then
+    echo "Error: SELECTED_OP_LIST file ${SELECTED_OP_LIST} not found."
+    exit 1
   fi
-  # bitcode
-  if [ "${ENABLE_BITCODE:-}" == '1' ]; then
-    CMAKE_ARGS+=("-DCMAKE_C_FLAGS=-fembed-bitcode")
-    CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS=-fembed-bitcode")
-  fi
-else
-  # Build Caffe2 mobile
-  CMAKE_ARGS+=("-DBUILD_CAFFE2_MOBILE=ON")
-  # Build protobuf from third_party so we have a host protoc binary.
-  echo "Building protoc"
-  BITCODE_FLAGS="-DCMAKE_C_FLAGS=-fembed-bitcode -DCMAKE_CXX_FLAGS=-fembed-bitcode "
-  $CAFFE2_ROOT/scripts/build_host_protoc.sh --other-flags $BITCODE_FLAGS
-  # Use locally built protoc because we'll build libprotobuf for the
-  # target architecture and need an exact version match.
-  CMAKE_ARGS+=("-DCAFFE2_CUSTOM_PROTOC_EXECUTABLE=$CAFFE2_ROOT/build_host_protoc/bin/protoc")
-  # Bitcode is enabled by default for caffe2
+  CMAKE_ARGS+=("-DSELECTED_OP_LIST=${SELECTED_OP_LIST}")
+fi
+
+# bitcode
+if [ "${ENABLE_BITCODE:-}" == '1' ]; then
   CMAKE_ARGS+=("-DCMAKE_C_FLAGS=-fembed-bitcode")
   CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS=-fembed-bitcode")
 fi
diff --git a/scripts/build_mobile.sh b/scripts/build_mobile.sh
index 0cc49301baf1..902458b2350e 100755
--- a/scripts/build_mobile.sh
+++ b/scripts/build_mobile.sh
@@ -19,6 +19,7 @@ CMAKE_ARGS+=("-DCMAKE_PREFIX_PATH=$(python -c 'import sysconfig; print(sysconfig
 CMAKE_ARGS+=("-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')")
 CMAKE_ARGS+=("-DBUILD_CUSTOM_PROTOBUF=OFF")
 CMAKE_ARGS+=("-DBUILD_SHARED_LIBS=OFF")
+
 # custom build with selected ops
 if [ -n "${SELECTED_OP_LIST}" ]; then
   SELECTED_OP_LIST="$(cd $(dirname $SELECTED_OP_LIST); pwd -P)/$(basename $SELECTED_OP_LIST)"
@@ -35,6 +36,32 @@ if [ -x "$(command -v ninja)" ]; then
   CMAKE_ARGS+=("-GNinja")
 fi
 
+# Don't build artifacts we don't need
+CMAKE_ARGS+=("-DBUILD_TEST=OFF")
+CMAKE_ARGS+=("-DBUILD_BINARY=OFF")
+
+# If there exists env variable and it equals to 1, build lite interpreter.
+# Default behavior is to build full jit interpreter.
+# cmd:  BUILD_LITE_INTERPRETER=1 ./scripts/build_mobile.sh
+if [ "x${BUILD_LITE_INTERPRETER}" == "x1" ]; then
+  CMAKE_ARGS+=("-DBUILD_LITE_INTERPRETER=ON")
+else
+  CMAKE_ARGS+=("-DBUILD_LITE_INTERPRETER=OFF")
+fi
+if [ "x${TRACING_BASED}" == "x1" ]; then
+  CMAKE_ARGS+=("-DTRACING_BASED=ON")
+else
+  CMAKE_ARGS+=("-DTRACING_BASED=OFF")
+fi
+
+# Lightweight dispatch bypasses the PyTorch Dispatcher.
+if [ "${USE_LIGHTWEIGHT_DISPATCH}" == 1 ]; then
+  CMAKE_ARGS+=("-DUSE_LIGHTWEIGHT_DISPATCH=ON")
+  CMAKE_ARGS+=("-DSTATIC_DISPATCH_BACKEND=CPU")
+else
+  CMAKE_ARGS+=("-DUSE_LIGHTWEIGHT_DISPATCH=OFF")
+fi
+
 # Disable unused dependencies
 CMAKE_ARGS+=("-DUSE_ROCM=OFF")
 CMAKE_ARGS+=("-DUSE_CUDA=OFF")
@@ -45,6 +72,10 @@ CMAKE_ARGS+=("-DUSE_LMDB=OFF")
 CMAKE_ARGS+=("-DUSE_LEVELDB=OFF")
 CMAKE_ARGS+=("-DUSE_MPI=OFF")
 CMAKE_ARGS+=("-DUSE_OPENMP=OFF")
+CMAKE_ARGS+=("-DUSE_MKLDNN=OFF")
+CMAKE_ARGS+=("-DUSE_NNPACK=OFF")
+CMAKE_ARGS+=("-DUSE_NUMPY=OFF")
+CMAKE_ARGS+=("-DUSE_BLAS=OFF")
 
 # Only toggle if VERBOSE=1
 if [ "${VERBOSE:-}" == '1' ]; then
diff --git a/scripts/onnx/test.sh b/scripts/onnx/test.sh
index 2cc7e7789553..2cd5b646dcb0 100755
--- a/scripts/onnx/test.sh
+++ b/scripts/onnx/test.sh
@@ -6,6 +6,7 @@ UNKNOWN=()
 
 # defaults
 PARALLEL=1
+export TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS
 
 while [[ $# -gt 0 ]]
 do
diff --git a/scripts/release_notes/commitlist.py b/scripts/release_notes/commitlist.py
index 2013b6d512a6..5529a2f2a9d5 100644
--- a/scripts/release_notes/commitlist.py
+++ b/scripts/release_notes/commitlist.py
@@ -299,6 +299,10 @@ def cleanup_title(commit):
     for topic in topics:
         lines.append(f'### {topic}\n')
         commits = commit_list.filter(category=category, topic=topic)
+        if '_' in topic:
+            commits.extend(commit_list.filter(category=category, topic=topic.replace('_', ' ')))
+        if ' ' in topic:
+            commits.extend(commit_list.filter(category=category, topic=topic.replace(' ', '_')))
         for commit in commits:
             if commit.merge_into:
                 continue
diff --git a/scripts/xcode_build.rb b/scripts/xcode_build.rb
index 5fe023974aba..360e6430d00b 100644
--- a/scripts/xcode_build.rb
+++ b/scripts/xcode_build.rb
@@ -13,12 +13,6 @@
  opts.on('-p', '--platform ', 'platform for the current build, OS or SIMULATOR') { |value|
     options[:platform] = value
  }
- opts.on('-c', '--provisioning_profile ', 'provisioning profile for code signing') { |value|
-    options[:profile] = value
- }
- opts.on('-t', '--team_id ', 'development team ID') { |value|
-    options[:team_id] = value
- }
 end.parse!
 puts options.inspect
 
@@ -42,11 +36,6 @@
     config.build_settings['LIBRARY_SEARCH_PATHS']   = libraries_search_path
     config.build_settings['OTHER_LDFLAGS']          = other_linker_flags
     config.build_settings['ENABLE_BITCODE']         = 'No'
-    dev_team_id = options[:team_id]
-    if not dev_team_id and options[:platform] == 'OS'
-        raise "Please sepecify a valid development team id for code signing"
-    end
-    config.build_settings['DEVELOPMENT_TEAM']       = dev_team_id
 end
 
 # link static libraries
@@ -84,10 +73,5 @@
     raise "unsupported platform #{options[:platform]}"
 end
 
-profile = options[:profile]
-if not profile and options[:platform] == 'OS'
-    raise "no provisioning profile found!"
-end
-
 # run xcodebuild
-exec "xcodebuild clean build  -project #{xcodeproj_path}  -target #{target.name} -sdk #{sdk} -configuration Release PROVISIONING_PROFILE_SPECIFIER=#{profile} -arch #{arch}"
+exec "xcodebuild clean build -project #{xcodeproj_path} -target #{target.name} -sdk #{sdk} -configuration Release -arch #{arch}"
diff --git a/setup.py b/setup.py
index 4facfe5deec1..e18eb16869a3 100644
--- a/setup.py
+++ b/setup.py
@@ -322,7 +322,7 @@ def get_submodule_folders():
     git_modules_path = os.path.join(cwd, ".gitmodules")
     default_modules_path = [os.path.join(third_party_path, name) for name in [
                             "gloo", "cpuinfo", "tbb", "onnx",
-                            "foxi", "QNNPACK", "fbgemm"
+                            "foxi", "QNNPACK", "fbgemm", "cutlass"
                             ]]
     if not os.path.exists(git_modules_path):
         return default_modules_path
@@ -437,11 +437,6 @@ def build_deps():
 # Building dependent libraries
 ################################################################################
 
-# the list of runtime dependencies required by this built package
-install_requires = [
-    'typing_extensions',
-]
-
 missing_pydep = '''
 Missing build dependency: Unable to `import {importname}`.
 Please install it via `conda install {module}` or `pip install {module}`
@@ -619,6 +614,23 @@ def build_extensions(self):
                     os.makedirs(dst_dir)
                 self.copy_file(src, dst)
                 i += 1
+
+        # Copy functorch extension
+        for i, ext in enumerate(self.extensions):
+            if ext.name != "functorch._C":
+                continue
+            fullname = self.get_ext_fullname(ext.name)
+            filename = self.get_ext_filename(fullname)
+            fileext = os.path.splitext(filename)[1]
+            src = os.path.join(os.path.dirname(filename), "functorch" + fileext)
+            dst = os.path.join(os.path.realpath(self.build_lib), filename)
+            if os.path.exists(src):
+                report("Copying {} from {} to {}".format(ext.name, src, dst))
+                dst_dir = os.path.dirname(dst)
+                if not os.path.exists(dst_dir):
+                    os.makedirs(dst_dir)
+                self.copy_file(src, dst)
+
         setuptools.command.build_ext.build_ext.build_extensions(self)
 
 
@@ -750,6 +762,14 @@ def run(self):
             super().run()
 
 
+def get_cmake_cache_vars():
+    try:
+        return defaultdict(lambda: False, cmake.get_cmake_cache_variables())
+    except FileNotFoundError:
+        # CMakeCache.txt does not exist. Probably running "python setup.py clean" over a clean directory.
+        return defaultdict(lambda: False)
+
+
 def configure_extension_build():
     r"""Configures extension build options according to system environment and user's choice.
 
@@ -757,11 +777,7 @@ def configure_extension_build():
       The input to parameters ext_modules, cmdclass, packages, and entry_points as required in setuptools.setup.
     """
 
-    try:
-        cmake_cache_vars = defaultdict(lambda: False, cmake.get_cmake_cache_variables())
-    except FileNotFoundError:
-        # CMakeCache.txt does not exist. Probably running "python setup.py clean" over a clean directory.
-        cmake_cache_vars = defaultdict(lambda: False)
+    cmake_cache_vars = get_cmake_cache_vars()
 
     ################################################################################
     # Configure compile flags
@@ -779,7 +795,7 @@ def configure_extension_build():
         # /EHsc is about standard C++ exception handling
         # /DNOMINMAX removes builtin min/max functions
         # /wdXXXX disables warning no. XXXX
-        extra_compile_args = ['/MD', '/EHsc', '/DNOMINMAX',
+        extra_compile_args = ['/MD', '/FS', '/EHsc', '/DNOMINMAX',
                               '/wd4267', '/wd4251', '/wd4522', '/wd4522', '/wd4838',
                               '/wd4305', '/wd4244', '/wd4190', '/wd4101', '/wd4996',
                               '/wd4275']
@@ -832,6 +848,13 @@ def configure_extension_build():
             extra_compile_args += ['-g']
             extra_link_args += ['-g']
 
+    # special CUDA 11.7 package that requires installation of cuda runtime, cudnn and cublas
+    pytorch_extra_install_requirements = os.getenv("PYTORCH_EXTRA_INSTALL_REQUIREMENTS", "")
+    if pytorch_extra_install_requirements:
+        report(f"pytorch_extra_install_requirements: {pytorch_extra_install_requirements}")
+        extra_install_requires += pytorch_extra_install_requirements.split("|")
+
+
     # Cross-compile for M1
     if IS_DARWIN:
         macos_target_arch = os.getenv('CMAKE_OSX_ARCHITECTURES', '')
@@ -858,7 +881,12 @@ def make_relative_rpath_args(path):
     ################################################################################
 
     extensions = []
-    packages = find_packages(exclude=('tools', 'tools.*'))
+    excludes = ['tools', 'tools.*']
+    if not cmake_cache_vars['BUILD_CAFFE2']:
+        excludes.extend(['caffe2', 'caffe2.*'])
+    if not cmake_cache_vars['BUILD_FUNCTORCH']:
+        excludes.extend(['functorch', 'functorch.*'])
+    packages = find_packages(exclude=excludes)
     C = Extension("torch._C",
                   libraries=main_libraries,
                   sources=main_sources,
@@ -878,12 +906,6 @@ def make_relative_rpath_args(path):
     extensions.append(C)
     extensions.append(C_flatbuffer)
 
-    if not IS_WINDOWS:
-        DL = Extension("torch._dl",
-                       sources=["torch/csrc/dl.c"],
-                       language='c')
-        extensions.append(DL)
-
     # These extensions are built by cmake and copied manually in build_extensions()
     # inside the build_ext implementation
     if cmake_cache_vars['BUILD_CAFFE2']:
@@ -904,6 +926,12 @@ def make_relative_rpath_args(path):
                     name=str('caffe2.python.caffe2_pybind11_state_hip'),
                     sources=[]),
             )
+    if cmake_cache_vars['BUILD_FUNCTORCH']:
+        extensions.append(
+            Extension(
+                name=str('functorch._C'),
+                sources=[]),
+        )
 
     cmdclass = {
         'bdist_wheel': wheel_concatenate,
@@ -944,7 +972,25 @@ def print_box(msg):
         print('|{}{}|'.format(l, ' ' * (size - len(l))))
     print('-' * (size + 2))
 
-if __name__ == '__main__':
+
+def main():
+    # the list of runtime dependencies required by this built package
+    install_requires = [
+        'typing_extensions',
+        'sympy',
+        'networkx',
+    ]
+
+    extras_require = {
+        'opt-einsum': ['opt-einsum>=3.3']
+    }
+    if platform.system() == 'Linux':
+        triton_pin_file = os.path.join(cwd, ".github", "ci_commit_pins", "triton.txt")
+        if os.path.exists(triton_pin_file):
+            with open(triton_pin_file) as f:
+                triton_pin = f.read().strip()
+                extras_require['dynamo'] = ['torchtriton==2.0.0+' + triton_pin[:10], 'jinja2']
+
     # Parse the command line and check the arguments before we proceed with
     # building deps and setup. We need to set values so `--help` works.
     dist = Distribution()
@@ -968,7 +1014,176 @@ def print_box(msg):
     with open(os.path.join(cwd, "README.md"), encoding="utf-8") as f:
         long_description = f.read()
 
-    version_range_max = max(sys.version_info[1], 9) + 1
+    version_range_max = max(sys.version_info[1], 10) + 1
+    torch_package_data = [
+        'py.typed',
+        'bin/*',
+        'test/*',
+        '_C/*.pyi',
+        '_C_flatbuffer/*.pyi',
+        'cuda/*.pyi',
+        'optim/*.pyi',
+        'autograd/*.pyi',
+        'nn/*.pyi',
+        'nn/modules/*.pyi',
+        'nn/parallel/*.pyi',
+        'utils/data/*.pyi',
+        'utils/data/datapipes/*.pyi',
+        'lib/*.so*',
+        'lib/*.dylib*',
+        'lib/*.dll',
+        'lib/*.lib',
+        'lib/*.pdb',
+        'lib/torch_shm_manager',
+        'lib/*.h',
+        'include/*.h',
+        'include/ATen/*.h',
+        'include/ATen/cpu/*.h',
+        'include/ATen/cpu/vec/vec256/*.h',
+        'include/ATen/cpu/vec/vec256/vsx/*.h',
+        'include/ATen/cpu/vec/vec512/*.h',
+        'include/ATen/cpu/vec/*.h',
+        'include/ATen/core/*.h',
+        'include/ATen/cuda/*.cuh',
+        'include/ATen/cuda/*.h',
+        'include/ATen/cuda/detail/*.cuh',
+        'include/ATen/cuda/detail/*.h',
+        'include/ATen/cudnn/*.h',
+        'include/ATen/functorch/*.h',
+        'include/ATen/ops/*.h',
+        'include/ATen/hip/*.cuh',
+        'include/ATen/hip/*.h',
+        'include/ATen/hip/detail/*.cuh',
+        'include/ATen/hip/detail/*.h',
+        'include/ATen/hip/impl/*.h',
+        'include/ATen/detail/*.h',
+        'include/ATen/native/*.h',
+        'include/ATen/native/cpu/*.h',
+        'include/ATen/native/cuda/*.h',
+        'include/ATen/native/cuda/*.cuh',
+        'include/ATen/native/hip/*.h',
+        'include/ATen/native/hip/*.cuh',
+        'include/ATen/native/quantized/*.h',
+        'include/ATen/native/quantized/cpu/*.h',
+        'include/ATen/quantized/*.h',
+        'include/caffe2/serialize/*.h',
+        'include/c10/*.h',
+        'include/c10/macros/*.h',
+        'include/c10/core/*.h',
+        'include/ATen/core/boxing/*.h',
+        'include/ATen/core/boxing/impl/*.h',
+        'include/ATen/core/dispatch/*.h',
+        'include/ATen/core/op_registration/*.h',
+        'include/c10/core/impl/*.h',
+        'include/c10/util/*.h',
+        'include/c10/cuda/*.h',
+        'include/c10/cuda/impl/*.h',
+        'include/c10/hip/*.h',
+        'include/c10/hip/impl/*.h',
+        'include/torch/*.h',
+        'include/torch/csrc/*.h',
+        'include/torch/csrc/api/include/torch/*.h',
+        'include/torch/csrc/api/include/torch/data/*.h',
+        'include/torch/csrc/api/include/torch/data/dataloader/*.h',
+        'include/torch/csrc/api/include/torch/data/datasets/*.h',
+        'include/torch/csrc/api/include/torch/data/detail/*.h',
+        'include/torch/csrc/api/include/torch/data/samplers/*.h',
+        'include/torch/csrc/api/include/torch/data/transforms/*.h',
+        'include/torch/csrc/api/include/torch/detail/*.h',
+        'include/torch/csrc/api/include/torch/detail/ordered_dict.h',
+        'include/torch/csrc/api/include/torch/nn/*.h',
+        'include/torch/csrc/api/include/torch/nn/functional/*.h',
+        'include/torch/csrc/api/include/torch/nn/options/*.h',
+        'include/torch/csrc/api/include/torch/nn/modules/*.h',
+        'include/torch/csrc/api/include/torch/nn/modules/container/*.h',
+        'include/torch/csrc/api/include/torch/nn/parallel/*.h',
+        'include/torch/csrc/api/include/torch/nn/utils/*.h',
+        'include/torch/csrc/api/include/torch/optim/*.h',
+        'include/torch/csrc/api/include/torch/optim/schedulers/*.h',
+        'include/torch/csrc/api/include/torch/serialize/*.h',
+        'include/torch/csrc/autograd/*.h',
+        'include/torch/csrc/autograd/functions/*.h',
+        'include/torch/csrc/autograd/generated/*.h',
+        'include/torch/csrc/autograd/utils/*.h',
+        'include/torch/csrc/cuda/*.h',
+        'include/torch/csrc/distributed/c10d/*.h',
+        'include/torch/csrc/distributed/c10d/*.hpp',
+        'include/torch/csrc/distributed/rpc/*.h',
+        'include/torch/csrc/jit/*.h',
+        'include/torch/csrc/jit/backends/*.h',
+        'include/torch/csrc/jit/generated/*.h',
+        'include/torch/csrc/jit/passes/*.h',
+        'include/torch/csrc/jit/passes/quantization/*.h',
+        'include/torch/csrc/jit/passes/utils/*.h',
+        'include/torch/csrc/jit/runtime/*.h',
+        'include/torch/csrc/jit/ir/*.h',
+        'include/torch/csrc/jit/frontend/*.h',
+        'include/torch/csrc/jit/api/*.h',
+        'include/torch/csrc/jit/serialization/*.h',
+        'include/torch/csrc/jit/python/*.h',
+        'include/torch/csrc/jit/mobile/*.h',
+        'include/torch/csrc/jit/testing/*.h',
+        'include/torch/csrc/jit/tensorexpr/*.h',
+        'include/torch/csrc/jit/tensorexpr/operators/*.h',
+        'include/torch/csrc/jit/codegen/cuda/*.h',
+        'include/torch/csrc/jit/codegen/cuda/ops/*.h',
+        'include/torch/csrc/jit/codegen/cuda/scheduler/*.h',
+        'include/torch/csrc/onnx/*.h',
+        'include/torch/csrc/profiler/*.h',
+        'include/torch/csrc/profiler/orchestration/*.h',
+        'include/torch/csrc/profiler/stubs/*.h',
+        'include/torch/csrc/utils/*.h',
+        'include/torch/csrc/tensor/*.h',
+        'include/torch/csrc/lazy/backend/*.h',
+        'include/torch/csrc/lazy/core/*.h',
+        'include/torch/csrc/lazy/core/internal_ops/*.h',
+        'include/torch/csrc/lazy/core/ops/*.h',
+        'include/torch/csrc/lazy/ts_backend/*.h',
+        'include/pybind11/*.h',
+        'include/pybind11/detail/*.h',
+        'include/TH/*.h*',
+        'include/TH/generic/*.h*',
+        'include/THC/*.cuh',
+        'include/THC/*.h*',
+        'include/THC/generic/*.h',
+        'include/THH/*.cuh',
+        'include/THH/*.h*',
+        'include/THH/generic/*.h',
+        'include/sleef.h',
+        "_inductor/codegen/*.h",
+        "_inductor/codegen/*.j2",
+        'share/cmake/ATen/*.cmake',
+        'share/cmake/Caffe2/*.cmake',
+        'share/cmake/Caffe2/public/*.cmake',
+        'share/cmake/Caffe2/Modules_CUDA_fix/*.cmake',
+        'share/cmake/Caffe2/Modules_CUDA_fix/upstream/*.cmake',
+        'share/cmake/Caffe2/Modules_CUDA_fix/upstream/FindCUDA/*.cmake',
+        'share/cmake/Gloo/*.cmake',
+        'share/cmake/Tensorpipe/*.cmake',
+        'share/cmake/Torch/*.cmake',
+        'utils/benchmark/utils/*.cpp',
+        'utils/benchmark/utils/valgrind_wrapper/*.cpp',
+        'utils/benchmark/utils/valgrind_wrapper/*.h',
+        'utils/model_dump/skeleton.html',
+        'utils/model_dump/code.js',
+        'utils/model_dump/*.mjs',
+    ]
+
+    if get_cmake_cache_vars()['BUILD_CAFFE2']:
+        torch_package_data.extend([
+            'include/caffe2/**/*.h',
+            'include/caffe2/utils/*.h',
+            'include/caffe2/utils/**/*.h',
+        ])
+    torchgen_package_data = [
+        # Recursive glob doesn't work in setup.py,
+        # https://github.com/pypa/setuptools/issues/1806
+        # To make this robust we should replace it with some code that
+        # returns a list of everything under packaged/
+        'packaged/ATen/*',
+        'packaged/ATen/native/*',
+        'packaged/ATen/templates/*',
+    ]
     setup(
         name=package_name,
         version=version,
@@ -981,167 +1196,10 @@ def print_box(msg):
         packages=packages,
         entry_points=entry_points,
         install_requires=install_requires,
+        extras_require=extras_require,
         package_data={
-            'torch': [
-                'py.typed',
-                'bin/*',
-                'test/*',
-                '_C/*.pyi',
-                '_C_flatbuffer/*.pyi',
-                'cuda/*.pyi',
-                'optim/*.pyi',
-                'autograd/*.pyi',
-                'utils/data/*.pyi',
-                'nn/*.pyi',
-                'nn/modules/*.pyi',
-                'nn/parallel/*.pyi',
-                'utils/data/*.pyi',
-                'lib/*.so*',
-                'lib/*.dylib*',
-                'lib/*.dll',
-                'lib/*.lib',
-                'lib/*.pdb',
-                'lib/torch_shm_manager',
-                'lib/*.h',
-                'include/ATen/*.h',
-                'include/ATen/cpu/*.h',
-                'include/ATen/cpu/vec/vec256/*.h',
-                'include/ATen/cpu/vec/vec512/*.h',
-                'include/ATen/cpu/vec/*.h',
-                'include/ATen/core/*.h',
-                'include/ATen/cuda/*.cuh',
-                'include/ATen/cuda/*.h',
-                'include/ATen/cuda/detail/*.cuh',
-                'include/ATen/cuda/detail/*.h',
-                'include/ATen/cudnn/*.h',
-                'include/ATen/ops/*.h',
-                'include/ATen/hip/*.cuh',
-                'include/ATen/hip/*.h',
-                'include/ATen/hip/detail/*.cuh',
-                'include/ATen/hip/detail/*.h',
-                'include/ATen/hip/impl/*.h',
-                'include/ATen/detail/*.h',
-                'include/ATen/native/*.h',
-                'include/ATen/native/cpu/*.h',
-                'include/ATen/native/cuda/*.h',
-                'include/ATen/native/cuda/*.cuh',
-                'include/ATen/native/hip/*.h',
-                'include/ATen/native/hip/*.cuh',
-                'include/ATen/native/quantized/*.h',
-                'include/ATen/native/quantized/cpu/*.h',
-                'include/ATen/quantized/*.h',
-                'include/caffe2/utils/*.h',
-                'include/caffe2/utils/**/*.h',
-                'include/c10/*.h',
-                'include/c10/macros/*.h',
-                'include/c10/core/*.h',
-                'include/ATen/core/boxing/*.h',
-                'include/ATen/core/boxing/impl/*.h',
-                'include/ATen/core/dispatch/*.h',
-                'include/ATen/core/op_registration/*.h',
-                'include/c10/core/impl/*.h',
-                'include/c10/util/*.h',
-                'include/c10/cuda/*.h',
-                'include/c10/cuda/impl/*.h',
-                'include/c10/hip/*.h',
-                'include/c10/hip/impl/*.h',
-                'include/c10d/*.h',
-                'include/c10d/*.hpp',
-                'include/caffe2/**/*.h',
-                'include/torch/*.h',
-                'include/torch/csrc/*.h',
-                'include/torch/csrc/api/include/torch/*.h',
-                'include/torch/csrc/api/include/torch/data/*.h',
-                'include/torch/csrc/api/include/torch/data/dataloader/*.h',
-                'include/torch/csrc/api/include/torch/data/datasets/*.h',
-                'include/torch/csrc/api/include/torch/data/detail/*.h',
-                'include/torch/csrc/api/include/torch/data/samplers/*.h',
-                'include/torch/csrc/api/include/torch/data/transforms/*.h',
-                'include/torch/csrc/api/include/torch/detail/*.h',
-                'include/torch/csrc/api/include/torch/detail/ordered_dict.h',
-                'include/torch/csrc/api/include/torch/nn/*.h',
-                'include/torch/csrc/api/include/torch/nn/functional/*.h',
-                'include/torch/csrc/api/include/torch/nn/options/*.h',
-                'include/torch/csrc/api/include/torch/nn/modules/*.h',
-                'include/torch/csrc/api/include/torch/nn/modules/container/*.h',
-                'include/torch/csrc/api/include/torch/nn/parallel/*.h',
-                'include/torch/csrc/api/include/torch/nn/utils/*.h',
-                'include/torch/csrc/api/include/torch/optim/*.h',
-                'include/torch/csrc/api/include/torch/optim/schedulers/*.h',
-                'include/torch/csrc/api/include/torch/serialize/*.h',
-                'include/torch/csrc/autograd/*.h',
-                'include/torch/csrc/autograd/functions/*.h',
-                'include/torch/csrc/autograd/generated/*.h',
-                'include/torch/csrc/autograd/utils/*.h',
-                'include/torch/csrc/cuda/*.h',
-                'include/torch/csrc/deploy/*.h',
-                'include/torch/csrc/deploy/interpreter/*.h',
-                'include/torch/csrc/deploy/interpreter/*.hpp',
-                'include/torch/csrc/distributed/c10d/exception.h',
-                'include/torch/csrc/distributed/rpc/*.h',
-                'include/torch/csrc/jit/*.h',
-                'include/torch/csrc/jit/backends/*.h',
-                'include/torch/csrc/jit/generated/*.h',
-                'include/torch/csrc/jit/passes/*.h',
-                'include/torch/csrc/jit/passes/quantization/*.h',
-                'include/torch/csrc/jit/passes/utils/*.h',
-                'include/torch/csrc/jit/runtime/*.h',
-                'include/torch/csrc/jit/ir/*.h',
-                'include/torch/csrc/jit/frontend/*.h',
-                'include/torch/csrc/jit/api/*.h',
-                'include/torch/csrc/jit/serialization/*.h',
-                'include/torch/csrc/jit/python/*.h',
-                'include/torch/csrc/jit/mobile/*.h',
-                'include/torch/csrc/jit/testing/*.h',
-                'include/torch/csrc/jit/tensorexpr/*.h',
-                'include/torch/csrc/jit/tensorexpr/operators/*.h',
-                'include/torch/csrc/jit/codegen/cuda/*.h',
-                'include/torch/csrc/jit/codegen/cuda/ops/*.h',
-                'include/torch/csrc/jit/codegen/cuda/scheduler/*.h',
-                'include/torch/csrc/onnx/*.h',
-                'include/torch/csrc/profiler/*.h',
-                'include/torch/csrc/utils/*.h',
-                'include/torch/csrc/tensor/*.h',
-                'include/torch/csrc/lazy/backend/*.h',
-                'include/torch/csrc/lazy/core/*.h',
-                'include/torch/csrc/lazy/core/internal_ops/*.h',
-                'include/torch/csrc/lazy/core/ops/*.h',
-                'include/torch/csrc/lazy/ts_backend/*.h',
-                'include/pybind11/*.h',
-                'include/pybind11/detail/*.h',
-                'include/TH/*.h*',
-                'include/TH/generic/*.h*',
-                'include/THC/*.cuh',
-                'include/THC/*.h*',
-                'include/THC/generic/*.h',
-                'include/THH/*.cuh',
-                'include/THH/*.h*',
-                'include/THH/generic/*.h',
-                'share/cmake/ATen/*.cmake',
-                'share/cmake/Caffe2/*.cmake',
-                'share/cmake/Caffe2/public/*.cmake',
-                'share/cmake/Caffe2/Modules_CUDA_fix/*.cmake',
-                'share/cmake/Caffe2/Modules_CUDA_fix/upstream/*.cmake',
-                'share/cmake/Caffe2/Modules_CUDA_fix/upstream/FindCUDA/*.cmake',
-                'share/cmake/Gloo/*.cmake',
-                'share/cmake/Tensorpipe/*.cmake',
-                'share/cmake/Torch/*.cmake',
-                'utils/benchmark/utils/*.cpp',
-                'utils/benchmark/utils/valgrind_wrapper/*.cpp',
-                'utils/benchmark/utils/valgrind_wrapper/*.h',
-                'utils/model_dump/skeleton.html',
-                'utils/model_dump/code.js',
-                'utils/model_dump/*.mjs',
-            ],
-            'torchgen': [
-                # Recursive glob doesn't work in setup.py,
-                # https://github.com/pypa/setuptools/issues/1806
-                # To make this robust we should replace it with some code that
-                # returns a list of everything under packaged/
-                'packaged/ATen/*',
-                'packaged/ATen/native/*',
-                'packaged/ATen/templates/*',
-            ],
+            'torch': torch_package_data,
+            'torchgen': torchgen_package_data,
             'caffe2': [
                 'python/serialized_test/data/operator_test/*.zip',
             ],
@@ -1172,3 +1230,7 @@ def print_box(msg):
     )
     if EMIT_BUILD_WARNING:
         print_box(build_update_message)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/test/allowlist_for_publicAPI.json b/test/allowlist_for_publicAPI.json
index 1476b2b4abe4..94ff57700af6 100644
--- a/test/allowlist_for_publicAPI.json
+++ b/test/allowlist_for_publicAPI.json
@@ -1,247 +1,40 @@
 {
-  "being_migrated": {},
-  "torch.ao.quantization": [
-    "ABC",
-    "ABCMeta",
-    "Any",
-    "Callable",
-    "Dict",
-    "List",
-    "Module",
-    "Optional",
-    "OrderedDict",
-    "Pattern",
-    "QConfigAny",
-    "Set",
-    "Tuple",
-    "Type",
-    "Union",
-    "abstractmethod",
-    "namedtuple",
-    "partial",
-    "type_before_parametrizations",
-    "wrap_cpp_module"
-  ],
-  "torch.ao.quantization.fake_quantize": [
-    "ABC",
-    "Any",
-    "FixedQParamsObserver",
-    "HistogramObserver",
-    "Module",
-    "MovingAverageMinMaxObserver",
-    "MovingAveragePerChannelMinMaxObserver",
-    "Tuple",
-    "abstractmethod",
-    "default_fixed_qparams_range_0to1_fake_quant",
-    "default_fixed_qparams_range_0to1_observer",
-    "default_affine_fixed_qparams_fake_quant",
-    "default_affine_fixed_qparams_observer",
-    "default_dynamic_fake_quant",
-    "default_embedding_fake_quant",
-    "default_embedding_fake_quant_4bit",
-    "default_fake_quant",
-    "default_fused_act_fake_quant",
-    "default_fused_per_channel_wt_fake_quant",
-    "default_fused_wt_fake_quant",
-    "default_histogram_fake_quant",
-    "default_per_channel_weight_fake_quant",
-    "default_fixed_qparams_range_neg1to1_fake_quant",
-    "default_fixed_qparams_range_neg1to1_observer",
-    "default_symmetric_fixed_qparams_fake_quant",
-    "default_symmetric_fixed_qparams_observer",
-    "default_weight_fake_quant",
-    "fused_per_channel_wt_fake_quant_range_neg_127_to_127",
-    "fused_wt_fake_quant_range_neg_127_to_127"
-  ],
-  "torch.ao.quantization.fuse_modules": [
-    "List",
-    "Optional",
-    "fuse_conv_bn",
-    "fuse_conv_bn_relu",
-    "get_fuser_method",
-    "type_before_parametrizations"
-  ],
-  "torch.ao.quantization.fuser_method_mappings": [
-    "Callable",
-    "Dict",
-    "MatchAllNode",
-    "Optional",
-    "Pattern",
-    "Tuple",
-    "Type",
-    "Union",
-    "get_combined_dict"
-  ],
-  "torch.ao.quantization.backend_config.utils": [
-    "Any",
-    "Dict",
-    "Callable",
-    "List",
-    "Union",
-    "Tuple",
-    "Pattern"
-  ],
-  "torch.ao.quantization.backend_config.native": [
-    "Any",
-    "Dict",
-    "FixedQParamsFakeQuantize",
-    "List",
-    "ObservationType",
-    "default_fixed_qparams_range_0to1_observer",
-    "default_fixed_qparams_range_neg1to1_observer",
-    "default_affine_fixed_qparams_observer",
-    "default_symmetric_fixed_qparams_observer",
-    "fuse_conv_bn",
-    "fuse_conv_bn_relu",
-    "fuse_convtranspose_bn",
-    "fuse_linear_bn",
-    "namedtuple",
-    "reverse2",
-    "reverse3",
-    "reverse_sequential_wrapper2"
-  ],
-  "torch.ao.quantization.quantization_types": [
-      "Any",
-      "Node",
-      "NodePattern",
-      "Pattern",
-      "QuantizerCls",
-      "Tuple",
-      "Union"
-  ],
-  "torch.ao.quantization.fx.graph_module": [
-    "Any",
-    "Dict",
-    "Graph",
-    "GraphModule",
-    "Set",
-    "Union"
-  ],
-  "torch.ao.quantization.fx.pattern_utils": [
-    "Any",
-    "Dict",
-    "FixedQParamsFakeQuantize",
-    "List",
-    "MatchResult",
-    "Node",
-    "ObserverBase",
-    "Optional",
-    "OrderedDict",
-    "Pattern",
-    "QConfigAny",
-    "QuantizeHandler",
-    "Tuple"
-  ],
-  "torch.ao.quantization.fx.quantization_patterns": [
-    "ABC",
-    "Any",
-    "Callable",
-    "Dict",
-    "Node",
-    "NodePattern",
-    "Optional",
-    "Pattern",
-    "all_node_args_have_no_tensors"
-  ],
-  "torch.ao.quantization.fx.quantization_types": [
-    "Any",
-    "Node",
-    "NodePattern",
-    "Pattern",
-    "QuantizerCls",
-    "Tuple",
-    "Union"
-  ],
-  "torch.ao.quantization.fx.backend_config_utils": [
-    "Any",
-    "Callable",
-    "DefaultFuseHandler",
-    "Dict",
-    "NodePattern",
-    "ObservationType",
-    "Optional",
-    "Pattern",
-    "QuantizeHandler",
-    "QuantizerCls",
-    "activation_dtype",
-    "get_combined_dict",
-    "get_default_quant_patterns",
-    "get_native_backend_config_dict",
-    "sorted_patterns_dict",
-    "get_quantize_handler_cls"
-  ],
-  "torch.ao.quantization.qconfig": [
-    "Any",
-    "FakeQuantize",
-    "FakeQuantizeBase",
-    "FusedMovingAvgObsFakeQuantize",
-    "HistogramObserver",
-    "MovingAverageMinMaxObserver",
-    "NoopObserver",
-    "Optional",
-    "PlaceholderObserver",
-    "QConfigAny",
-    "ReuseInputObserver",
-    "default_debug_observer",
-    "default_dynamic_fake_quant",
-    "default_dynamic_quant_observer",
-    "default_embedding_fake_quant",
-    "default_embedding_fake_quant_4bit",
-    "default_fake_quant",
-    "default_float_qparams_observer",
-    "default_float_qparams_observer_4bit",
-    "default_fused_act_fake_quant",
-    "default_fused_per_channel_wt_fake_quant",
-    "default_fused_wt_fake_quant",
-    "default_observer",
-    "default_per_channel_weight_fake_quant",
-    "default_per_channel_weight_observer",
-    "default_placeholder_observer",
-    "default_reuse_input_observer",
-    "default_weight_fake_quant",
-    "default_weight_observer",
-    "fused_per_channel_wt_fake_quant_range_neg_127_to_127",
-    "fused_wt_fake_quant_range_neg_127_to_127",
-    "namedtuple",
-    "per_channel_weight_observer_range_neg_127_to_127",
-    "weight_observer_range_neg_127_to_127"
-  ],
-  "torch.ao.quantization.quantization_mappings": [
-    "Any",
-    "Callable",
-    "DeQuantStub",
-    "Dict",
-    "Optional",
-    "QuantStub",
-    "Set",
-    "Union",
-    "default_fixed_qparams_range_0to1_fake_quant",
-    "default_fixed_qparams_range_neg1to1_fake_quant",
-    "default_affine_fixed_qparams_fake_quant",
-    "default_symmetric_fixed_qparams_fake_quant",
-    "get_combined_dict",
-    "type_before_parametrizations"
-  ],
-  "torch.ao.quantization.quantize": [
-    "DeQuantStub",
-    "QuantWrapper",
-    "activation_is_memoryless",
-    "add_module_to_qconfig_obs_ctr",
-    "get_default_dynamic_quant_module_mappings",
-    "get_default_qat_module_mappings",
-    "get_default_qconfig_propagation_list",
-    "get_default_static_quant_module_mappings",
-    "get_default_static_quant_reference_module_mappings",
-    "get_qparam_dict",
-    "has_no_children_ignoring_parametrizations",
-    "no_observer_set",
-    "type_before_parametrizations"
-  ],
-  "torch.ao.quantization.quantize_jit": [
-    "QConfig",
-    "QuantType",
-    "wrap_cpp_module"
-  ],
+  "being_migrated": {
+    "torch.nn.intrinsic": "torch.ao.nn.intrinsic",
+    "torch.nn.intrinsic.modules": "torch.ao.nn.intrinsic.modules",
+    "torch.nn.intrinsic.modules.fused": "torch.ao.nn.intrinsic.modules.fused",
+    "torch.nn.intrinsic.qat": "torch.ao.nn.intrinsic.qat",
+    "torch.nn.intrinsic.qat.modules": "torch.ao.nn.intrinsic.qat.modules",
+    "torch.nn.intrinsic.qat.modules.conv_fused": "torch.ao.nn.intrinsic.qat.modules.conv_fused",
+    "torch.nn.intrinsic.qat.modules.linear_fused": "torch.ao.nn.intrinsic.qat.modules.linear_fused",
+    "torch.nn.intrinsic.qat.modules.linear_relu": "torch.ao.nn.intrinsic.qat.modules.linear_relu",
+    "torch.nn.intrinsic.quantized": "torch.ao.nn.intrinsic.quantized",
+    "torch.nn.intrinsic.quantized.modules": "torch.ao.nn.intrinsic.quantized.modules",
+    "torch.nn.intrinsic.quantized.modules.bn_relu": "torch.ao.nn.intrinsic.quantized.modules.bn_relu",
+    "torch.nn.intrinsic.quantized.modules.conv_relu": "torch.ao.nn.intrinsic.quantized.modules.conv_relu",
+    "torch.nn.intrinsic.quantized.modules.linear_relu": "torch.ao.nn.intrinsic.quantized.modules.linear_relu",
+    "torch.nn.intrinsic.quantized.dynamic": "torch.ao.nn.intrinsic.quantized.dynamic",
+    "torch.nn.intrinsic.quantized.dynamic.modules": "torch.ao.nn.intrinsic.quantized.dynamic.modules",
+    "torch.nn.intrinsic.quantized.dynamic.modules.linear_relu": "torch.ao.nn.intrinsic.quantized.dynamic.modules.linear_relu",
+    "torch.nn.qat": "torch.ao.nn.qat",
+    "torch.nn.qat.dynamic": "torch.ao.nn.qat.dynamic",
+    "torch.nn.qat.dynamic.modules": "torch.ao.nn.qat.dynamic.modules",
+    "torch.nn.qat.dynamic.modules.linear": "torch.ao.nn.qat.dynamic.modules.linear",
+    "torch.nn.qat.modules": "torch.ao.nn.qat.modules",
+    "torch.nn.qat.modules.conv": "torch.ao.nn.qat.modules.conv",
+    "torch.nn.qat.modules.embedding_ops": "torch.ao.nn.qat.modules.embedding_ops",
+    "torch.nn.qat.modules.linear": "torch.ao.nn.qat.modules.linear",
+    "torch.nn.quantized.functional": "torch.ao.nn.quantized.functional",
+    "torch.nn.quantized": "torch.ao.nn.quantized",
+    "torch.nn.quantized.modules": "torch.ao.nn.quantized.modules",
+    "torch.nn.quantized.dynamic": "torch.ao.nn.quantized.dynamic",
+    "torch.nn.quantized.dynamic.modules": "torch.ao.nn.quantized.dynamic.modules",
+    "torch.nn.quantized.dynamic.modules.rnn": "torch.ao.nn.quantized.dynamic.modules.rnn",
+    "torch.nn.quantizable": "torch.ao.nn.quantizable",
+    "torch.nn.quantizable.modules": "torch.ao.nn.quantizable.modules",
+    "torch.nn.quantizable.modules.activation": "torch.ao.nn.quantizable.modules.activation",
+    "torch.nn.quantizable.modules.rnn": "torch.ao.nn.quantizable.modules.rnn"
+  },
   "torch.autograd": [
     "NestedIOFunction",
     "detect_anomaly",
@@ -253,78 +46,12 @@
     "no_grad",
     "set_detect_anomaly",
     "set_grad_enabled",
+    "set_multithreading_enabled",
     "variable"
   ],
-  "torch.autograd.function": [
-    "Any",
-    "List",
-    "Optional",
-    "OrderedDict",
-    "with_metaclass"
-  ],
-  "torch.autograd.functional": [
-    "List",
-    "Tuple"
-  ],
-  "torch.autograd.graph": [
-    "Any",
-    "Callable"
-  ],
-  "torch.autograd.profiler": [
-    "Any",
-    "ContextDecorator",
-    "DeviceType",
-    "Dict",
-    "Future",
-    "List",
-    "Optional",
-    "ProfilerActivity",
-    "ProfilerConfig",
-    "ProfilerState",
-    "kineto_available",
-    "warn"
-  ],
-  "torch.autograd.profiler_legacy": [
-    "DeviceType",
-    "EventList",
-    "FunctionEvent",
-    "ProfilerConfig",
-    "ProfilerState",
-    "warn"
-  ],
-  "torch.autograd.profiler_util": [
-    "DeviceType",
-    "Dict",
-    "List",
-    "Optional",
-    "Tuple",
-    "attrgetter",
-    "defaultdict",
-    "namedtuple"
-  ],
-  "torch.autograd.variable": [
-    "ImperativeEngine",
-    "with_metaclass"
-  ],
   "torch.backends": [
     "contextmanager"
   ],
-  "torch.backends.cuda": [
-    "Union"
-  ],
-  "torch.cpu.amp.autocast_mode": [
-    "Any"
-  ],
-  "torch.cuda": [
-    "Any",
-    "Device",
-    "Dict",
-    "List",
-    "Optional",
-    "Tuple",
-    "Union",
-    "classproperty"
-  ],
   "torch.cuda.comm": [
     "broadcast",
     "broadcast_coalesced",
@@ -333,32 +60,12 @@
     "scatter",
     "gather"
   ],
-  "torch.cuda.amp.autocast_mode": [
-    "Any"
-  ],
-  "torch.cuda.amp.common": [
-    "find_spec"
-  ],
-  "torch.cuda.amp.grad_scaler": [
-    "Any",
-    "Dict",
-    "Enum",
-    "List",
-    "Optional",
-    "Tuple",
-    "amp_definitely_not_available",
-    "defaultdict"
-  ],
   "torch.cuda.nccl": [
     "init_rank",
     "is_available",
     "unique_id",
     "version"
   ],
-  "torch.cuda.profiler": [
-    "check_error",
-    "cudart"
-  ],
   "torch.distributed": [
     "AllToAllOptions",
     "AllreduceCoalescedOptions",
@@ -375,6 +82,7 @@
     "GradBucket",
     "HashStore",
     "Logger",
+    "namedtuple",
     "Optional",
     "PrefixStore",
     "ProcessGroup",
@@ -395,59 +103,11 @@
     "ProcessGroupMPI",
     "ProcessGroupNCCL"
   ],
-  "torch.distributed.algorithms.ddp_comm_hooks.debugging_hooks": [
-    "Any",
-    "GradBucket"
-  ],
-  "torch.distributed.algorithms.ddp_comm_hooks.default_hooks": [
-    "Any",
-    "Callable"
-  ],
-  "torch.distributed.algorithms.ddp_comm_hooks.optimizer_overlap_hooks": [
-    "Any",
-    "Callable"
-  ],
-  "torch.distributed.algorithms.model_averaging.utils": [
-    "Dict",
-    "Iterable",
-    "Iterator",
-    "ProcessGroup",
-    "Union",
-    "group"
-  ],
   "torch.distributed.autograd": [
     "DistAutogradContext",
     "backward",
     "get_gradients"
   ],
-  "torch.distributed.distributed_c10d": [
-    "AllToAllOptions",
-    "AllreduceCoalescedOptions",
-    "AllreduceOptions",
-    "BarrierOptions",
-    "BroadcastOptions",
-    "Callable",
-    "DebugLevel",
-    "Dict",
-    "GatherOptions",
-    "Optional",
-    "PrefixStore",
-    "ProcessGroup",
-    "ProcessGroupGloo",
-    "ReduceOp",
-    "ReduceOptions",
-    "ReduceScatterOptions",
-    "ScatterOptions",
-    "Store",
-    "Tuple",
-    "Union",
-    "get_debug_level",
-    "register_rendezvous_handler",
-    "rendezvous",
-    "timedelta",
-    "ProcessGroupMPI",
-    "ProcessGroupNCCL"
-  ],
   "torch.distributed.elastic.events": [
     "Dict",
     "Enum",
@@ -467,56 +127,12 @@
     "Union",
     "get_logger"
   ],
-  "torch.distributed.elastic.multiprocessing.api": [
-    "Any",
-    "Callable",
-    "Dict",
-    "FrameType",
-    "IntFlag",
-    "Optional",
-    "ProcessFailure",
-    "Set",
-    "TailLog",
-    "Tuple",
-    "Union",
-    "dataclass",
-    "field",
-    "nullcontext",
-    "record",
-    "redirect_stderr",
-    "redirect_stdout"
-  ],
-  "torch.distributed.elastic.multiprocessing.errors": [
-    "Any",
-    "Callable",
-    "Dict",
-    "GlobalRank",
-    "JSON",
-    "List",
-    "Optional",
-    "Template",
-    "Tuple",
-    "TypeVar",
-    "dataclass",
-    "datetime",
-    "field",
-    "get_logger",
-    "wraps"
-  ],
   "torch.distributed.elastic.multiprocessing.redirects": [
     "contextmanager",
     "partial",
     "redirect_stderr",
     "redirect_stdout"
   ],
-  "torch.distributed.elastic.multiprocessing.tail_log": [
-    "Dict",
-    "Event",
-    "Future",
-    "List",
-    "TextIO",
-    "ThreadPoolExecutor"
-  ],
   "torch.distributed.elastic.rendezvous": [
     "RendezvousHandlerCreator"
   ],
@@ -550,68 +166,11 @@
     "List",
     "timedelta"
   ],
-  "torch.distributed.fsdp.flatten_params_wrapper": [
-    "Any",
-    "Dict",
-    "Generator",
-    "Iterator",
-    "List",
-    "NamedTuple",
-    "Optional",
-    "ParamOffset",
-    "Sequence",
-    "SharedParamInfo",
-    "Tensor",
-    "Tuple",
-    "Union",
-    "accumulate"
-  ],
-  "torch.distributed.fsdp.utils": [
-    "Any",
-    "Callable",
-    "Dict",
-    "List",
-    "OrderedDict",
-    "Set",
-    "Tuple",
-    "Union"
-  ],
-  "torch.distributed.fsdp.wrap": [
-    "Any",
-    "Callable",
-    "Dict",
-    "Generator",
-    "Optional",
-    "Set",
-    "Tuple",
-    "Type",
-    "cast"
-  ],
   "torch.distributed.nn": [
     "Function",
     "ReduceOp",
     "group"
   ],
-  "torch.distributed.nn.api.remote_module": [
-    "Any",
-    "Callable",
-    "Dict",
-    "Iterator",
-    "List",
-    "Mapping",
-    "Module",
-    "Optional",
-    "Parameter",
-    "RemovableHandle",
-    "Set",
-    "Tensor",
-    "Tuple",
-    "Type",
-    "TypeVar",
-    "Union",
-    "device",
-    "dtype"
-  ],
   "torch.distributed.nn.functional": [
     "Function",
     "ReduceOp",
@@ -624,47 +183,7 @@
   "torch.distributed.optim.utils": [
     "Type"
   ],
-  "torch.distributed.pipeline.sync.checkpoint": [
-    "Checkpoint",
-    "Checkpointing",
-    "Context",
-    "Function",
-    "Recompute",
-    "ThreadLocal",
-    "checkpoint",
-    "enable_checkpointing",
-    "enable_recomputing",
-    "restore_rng_states",
-    "save_rng_states"
-  ],
-  "torch.distributed.pipeline.sync.copy": [
-    "Context",
-    "Copy",
-    "Wait"
-  ],
-  "torch.distributed.pipeline.sync.dependency": [
-    "Fork",
-    "Join",
-    "fork",
-    "join"
-  ],
-  "torch.distributed.pipeline.sync.microbatch": [
-    "Batch",
-    "NoChunk",
-    "check",
-    "gather",
-    "scatter"
-  ],
-  "torch.distributed.pipeline.sync.phony": [
-    "get_phony"
-  ],
   "torch.distributed.pipeline.sync.pipe": [
-    "BalanceError",
-    "PipeSequential",
-    "Pipeline",
-    "WithDevice"
-  ],
-  "torch.distributed.pipeline.sync.pipeline": [
     "Pipeline"
   ],
   "torch.distributed.pipeline.sync.skip.layout": [
@@ -688,25 +207,6 @@
     "current_skip_tracker",
     "use_skip_tracker"
   ],
-  "torch.distributed.pipeline.sync.stream": [
-    "CPUStreamType",
-    "as_cuda",
-    "current_stream",
-    "default_stream",
-    "get_device",
-    "is_cuda",
-    "new_stream",
-    "record_stream",
-    "use_device",
-    "use_stream",
-    "wait_stream"
-  ],
-  "torch.distributed.pipeline.sync.worker": [
-    "Task",
-    "create_workers",
-    "spawn_workers",
-    "worker"
-  ],
   "torch.distributed.remote_device": [
     "Optional",
     "Union"
@@ -727,100 +227,6 @@
     "urlunparse"
   ],
   "torch.distributed.rpc": [
-    "Any",
-    "Dict",
-    "Future",
-    "Generator",
-    "Generic",
-    "GenericWithOneTypeVar",
-    "PyRRef",
-    "RemoteProfilerManager",
-    "RpcAgent",
-    "RpcBackendOptions",
-    "Set",
-    "Store",
-    "TensorPipeAgent",
-    "Tuple",
-    "TypeVar",
-    "WorkerInfo",
-    "enable_gil_profiling",
-    "get_rpc_timeout",
-    "method",
-    "timedelta",
-    "urlparse"
-  ],
-  "torch.distributed.rpc.api": [
-    "Any",
-    "Dict",
-    "Future",
-    "Generic",
-    "GenericWithOneTypeVar",
-    "PyRRef",
-    "PythonUDF",
-    "RPCExecMode",
-    "RemoteProfilerManager",
-    "Set",
-    "TypeVar",
-    "WorkerInfo",
-    "get_rpc_timeout",
-    "method"
-  ],
-  "torch.distributed.rpc.backend_registry": [
-    "Dict",
-    "List",
-    "Set",
-    "Tuple"
-  ],
-  "torch.distributed.rpc.constants": [
-    "timedelta"
-  ],
-  "torch.distributed.rpc.internal": [
-    "Enum"
-  ],
-  "torch.distributed.rpc.options": [
-    "DeviceType",
-    "Dict",
-    "List",
-    "Optional",
-    "Union"
-  ],
-  "torch.distributions.kl": [
-    "Bernoulli",
-    "Beta",
-    "Binomial",
-    "Callable",
-    "Categorical",
-    "Cauchy",
-    "ContinuousBernoulli",
-    "Dict",
-    "Dirichlet",
-    "Distribution",
-    "Exponential",
-    "ExponentialFamily",
-    "Gamma",
-    "Geometric",
-    "Gumbel",
-    "HalfNormal",
-    "Independent",
-    "Laplace",
-    "LowRankMultivariateNormal",
-    "MultivariateNormal",
-    "Normal",
-    "OneHotCategorical",
-    "Pareto",
-    "Poisson",
-    "TransformedDistribution",
-    "Tuple",
-    "Type",
-    "Uniform",
-    "total_ordering"
-  ],
-  "torch.distributions.utils": [
-    "Any",
-    "Dict",
-    "Number",
-    "is_tensor_like",
-    "update_wrapper"
   ],
   "torch.fft": [
     "Tensor",
@@ -875,30 +281,6 @@
     "reify",
     "unify"
   ],
-  "torch.fx.experimental.unification.multipledispatch.conflict": [
-    "groupby",
-    "isvariadic"
-  ],
-  "torch.fx.experimental.unification.multipledispatch.core": [
-    "Dispatcher",
-    "MethodDispatcher"
-  ],
-  "torch.fx.experimental.unification.multipledispatch.dispatcher": [
-    "AmbiguityWarning",
-    "Variadic",
-    "ambiguities",
-    "expand_tuples",
-    "isvariadic",
-    "ordering",
-    "super_signature",
-    "warn"
-  ],
-  "torch.fx.experimental.unification.multipledispatch.utils": [
-    "OrderedDict"
-  ],
-  "torch.fx.experimental.unification.multipledispatch.variadic": [
-    "typename"
-  ],
   "torch.fx.experimental.unification.unification_tools": [
     "first",
     "getter",
@@ -910,70 +292,6 @@
     "hashable",
     "isvar"
   ],
-  "torch.fx.graph": [
-    "Any",
-    "Argument",
-    "Callable",
-    "Dict",
-    "FrozenSet",
-    "List",
-    "NamedTuple",
-    "Node",
-    "Optional",
-    "Set",
-    "Target",
-    "TransformCodeFunc",
-    "Tuple",
-    "Type",
-    "compatibility",
-    "contextmanager",
-    "dataclass",
-    "map_arg"
-  ],
-  "torch.fx.graph_module": [
-    "Any",
-    "Dict",
-    "Graph",
-    "Importer",
-    "List",
-    "Optional",
-    "PackageExporter",
-    "PackageImporter",
-    "Path",
-    "PythonCode",
-    "Set",
-    "Type",
-    "Union",
-    "compatibility"
-  ],
-  "torch.fx.immutable_collections": [
-    "Any",
-    "Context",
-    "Dict",
-    "List",
-    "Tuple",
-    "compatibility"
-  ],
-  "torch.fx.operator_schemas": [
-    "Any",
-    "Callable",
-    "Dict",
-    "List",
-    "NamedTuple",
-    "OpOverload",
-    "OpOverloadPacket",
-    "Optional",
-    "Tuple",
-    "cast",
-    "compatibility"
-  ],
-  "torch.fx.passes.graph_drawer": [
-    "Any",
-    "Dict",
-    "TensorMetadata",
-    "chain",
-    "compatibility"
-  ],
   "torch.fx.proxy": [
     "assert_fn"
   ],
@@ -1206,6 +524,10 @@
   "torch.multiprocessing.spawn": [
     "Optional"
   ],
+  "torch.nested": [
+    "nested_tensor",
+    "to_padded_tensor"
+  ],
   "torch.nn.common_types": [
     "Optional",
     "Tensor",
@@ -1301,7 +623,7 @@
     "OrderedDict"
   ],
   "torch.nn.qat.dynamic.modules.linear": [
-    "activation_is_memoryless"
+    "_activation_is_memoryless"
   ],
   "torch.nn.qat.modules.conv": [
     "Tuple",
@@ -1414,10 +736,10 @@
     "QuantType",
     "QuantWrapper",
     "RecordingObserver",
-    "add_module_to_qconfig_obs_ctr",
+    "_add_module_to_qconfig_obs_ctr",
     "add_observer_",
     "add_quant_dequant",
-    "assert_valid_qconfig",
+    "_assert_valid_qconfig",
     "convert",
     "convert_dynamic_jit",
     "convert_jit",
@@ -1473,7 +795,7 @@
     "prepare_qat",
     "propagate_qconfig_",
     "qconfig_equals",
-    "quant_type_to_str",
+    "_get_quant_type_to_str",
     "quantize",
     "quantize_dynamic",
     "quantize_dynamic_jit",
@@ -1544,15 +866,15 @@
     "QConfig",
     "QConfigAny",
     "QConfigDynamic",
-    "add_module_to_qconfig_obs_ctr",
-    "assert_valid_qconfig",
+    "_add_module_to_qconfig_obs_ctr",
+    "_assert_valid_qconfig",
     "get_default_qat_qconfig",
     "get_default_qconfig",
     "qconfig_equals"
   ],
   "torch.quantization.quant_type": [
     "QuantType",
-    "quant_type_to_str"
+    "_get_quant_type_to_str"
   ],
   "torch.quantization.quantization_mappings": [
     "get_default_compare_output_module_list",
@@ -1604,10 +926,6 @@
   "torch.random": [
     "Generator"
   ],
-  "torch.return_types": [
-    "attr",
-    "pytree_register_structseq"
-  ],
   "torch.serialization": [
     "Any",
     "BinaryIO",
@@ -1781,18 +1099,6 @@
     "Serialization",
     "wrapper_singleton"
   ],
-  "torch.utils.cpp_extension": [
-    "ExtensionVersioner",
-    "FileBaton",
-    "GeneratedFileCleaner",
-    "List",
-    "Optional",
-    "TorchVersion",
-    "Tuple",
-    "Union",
-    "build_ext",
-    "get_hip_file_path"
-  ],
   "torch.utils.data": [
     "_DatasetKind",
     "argument_validation",
@@ -1817,26 +1123,6 @@
     "Any",
     "to_dlpack"
   ],
-  "torch.utils.hipify.hipify_python": [
-    "Dict",
-    "HipifyFinalResult",
-    "HipifyResult",
-    "Iterable",
-    "Iterator",
-    "List",
-    "Mapping",
-    "Optional"
-  ],
-  "torch.utils.hooks": [
-    "Any",
-    "OrderedDict"
-  ],
-  "torch.utils.show_pickle": [
-    "Any",
-    "BinaryIO",
-    "IO",
-    "Union"
-  ],
   "torch": [
     "BFloat16Storage",
     "BFloat16Tensor",
@@ -1999,7 +1285,6 @@
     "_is_functional_tensor",
     "_is_zerotensor",
     "_linalg_check_errors",
-    "_linalg_inv_out_helper_",
     "_linalg_qr_helper",
     "_linalg_svd",
     "_linalg_solve_ex",
diff --git a/test/ao/sparsity/test_activation_sparsifier.py b/test/ao/sparsity/test_activation_sparsifier.py
index c50123a6cf27..5ee8120bdc0a 100644
--- a/test/ao/sparsity/test_activation_sparsifier.py
+++ b/test/ao/sparsity/test_activation_sparsifier.py
@@ -5,10 +5,10 @@
 from torch.testing._internal.common_utils import TestCase, skipIfTorchDynamo
 import logging
 import torch
-from torch.ao.sparsity._experimental.activation_sparsifier.activation_sparsifier import ActivationSparsifier
+from torch.ao.pruning._experimental.activation_sparsifier.activation_sparsifier import ActivationSparsifier
 import torch.nn as nn
 import torch.nn.functional as F
-from torch.ao.sparsity.sparsifier.utils import module_to_fqn
+from torch.ao.pruning.sparsifier.utils import module_to_fqn
 from typing import List
 
 logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)
diff --git a/test/ao/sparsity/test_composability.py b/test/ao/sparsity/test_composability.py
index 698652ef5bd5..f531dd2927bb 100644
--- a/test/ao/sparsity/test_composability.py
+++ b/test/ao/sparsity/test_composability.py
@@ -7,10 +7,10 @@
 import torch
 import torch.ao.quantization as tq
 from torch import nn
-from torch.ao import sparsity
+from torch.ao import pruning
 from torch.testing._internal.common_utils import TestCase
 from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx, convert_to_reference_fx, prepare_qat_fx
-from torch.ao.sparsity import fqn_to_module
+from torch.ao.pruning import fqn_to_module
 
 logging.basicConfig(
     format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO
@@ -37,7 +37,7 @@ def _get_model_and_sparsifier_and_sparse_config(qconfig=None):
         model[4].qconfig = qconfig
         model[5].qconfig = qconfig
 
-    sparsifier = sparsity.WeightNormSparsifier(**sparse_defaults)
+    sparsifier = pruning.WeightNormSparsifier(**sparse_defaults)
 
     sparse_config = [
         {
@@ -87,7 +87,7 @@ def test_q_prep_before_s_prep(self):
         )
 
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
     # This test checks whether performing sparsity prepare before quantization prepare
@@ -119,7 +119,7 @@ def test_s_prep_before_q_prep(self):
         )
 
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
     # if the sparsified modules have not undergone the final squash mask operation, its possible
@@ -150,7 +150,7 @@ def test_convert_without_squash_mask(self):
         tq.convert(mod, inplace=True)
 
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
         # check that module was actually sparsified
@@ -261,12 +261,12 @@ def test_s_prep_before_qat_prep(self):
         # check that correct observers were inserted and that matching
         # occured successfully
         self.assertTrue(hasattr(mod[5], "activation_post_process"))
-        self.assertTrue(isinstance(mod[5], torch.nn.qat.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.qat.Linear))
         _squash_mask_calibrate_and_convert(
             mod, sparsifier, torch.randn(1, 4, 4, 4)
         )
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
         # check that module was actually sparsified
@@ -300,14 +300,14 @@ def test_qat_prep_before_s_prep(self):
         # check that correct observers were inserted and that matching
         # occured successfully
         self.assertTrue(hasattr(mod[5], "activation_post_process"))
-        self.assertTrue(isinstance(mod[5], torch.nn.qat.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.qat.Linear))
 
         _squash_mask_calibrate_and_convert(
             mod, sparsifier, torch.randn(1, 4, 4, 4)
         )
 
         # check that final module is the expected quantized module and that the model runs
-        self.assertTrue(isinstance(mod[5], torch.nn.quantized.Linear))
+        self.assertTrue(isinstance(mod[5], torch.ao.nn.quantized.Linear))
         self.assertEqual(mod(torch.randn(1, 4, 4, 4)).shape, torch.Size([1, 4, 4, 4]))
 
         # check that module was actually sparsified
@@ -514,7 +514,7 @@ def test_s_prep_before_qat_prep_fx(self):
         # that none were lost during prepare
         self.assertTrue(hasattr(fqn_to_module(mod, "0.0"), "parametrizations"))
         self.assertTrue(hasattr(fqn_to_module(mod, "5"), "parametrizations"))
-        self.assertTrue(isinstance(fqn_to_module(mod, "5"), torch.nn.intrinsic.qat.LinearReLU))
+        self.assertTrue(isinstance(fqn_to_module(mod, "5"), torch.ao.nn.intrinsic.qat.LinearReLU))
 
         # check that correct observers were inserted and that matching
         # occured successfully
diff --git a/test/ao/sparsity/test_data_scheduler.py b/test/ao/sparsity/test_data_scheduler.py
index 543c9afd019f..9c33a160e768 100644
--- a/test/ao/sparsity/test_data_scheduler.py
+++ b/test/ao/sparsity/test_data_scheduler.py
@@ -9,8 +9,8 @@
 from typing import Tuple
 import copy
 
-from torch.ao.sparsity._experimental.data_sparsifier import DataNormSparsifier
-from torch.ao.sparsity._experimental.data_scheduler import BaseDataScheduler
+from torch.ao.pruning._experimental.data_sparsifier import DataNormSparsifier
+from torch.ao.pruning._experimental.data_scheduler import BaseDataScheduler
 
 logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)
 
diff --git a/test/ao/sparsity/test_data_sparsifier.py b/test/ao/sparsity/test_data_sparsifier.py
index 40d76ee1953f..a431ac4535a6 100644
--- a/test/ao/sparsity/test_data_sparsifier.py
+++ b/test/ao/sparsity/test_data_sparsifier.py
@@ -13,8 +13,8 @@
 import math
 import copy
 
-from torch.ao.sparsity._experimental.data_sparsifier import BaseDataSparsifier, DataNormSparsifier
-from torch.ao.sparsity._experimental.data_sparsifier.quantization_utils import post_training_sparse_quantize
+from torch.ao.pruning._experimental.data_sparsifier import BaseDataSparsifier, DataNormSparsifier
+from torch.ao.pruning._experimental.data_sparsifier.quantization_utils import post_training_sparse_quantize
 
 logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)
 
diff --git a/test/ao/sparsity/test_kernels.py b/test/ao/sparsity/test_kernels.py
index 3bcbed09c594..9dffb77cf984 100644
--- a/test/ao/sparsity/test_kernels.py
+++ b/test/ao/sparsity/test_kernels.py
@@ -13,15 +13,16 @@
 import torch.ao.quantization as tq
 
 from torch import nn
-from torch.ao.sparsity.sparsifier.utils import fqn_to_module
+from torch.ao.pruning.sparsifier.utils import fqn_to_module
 
-from torch.testing._internal.common_utils import TestCase
+from torch.testing._internal.common_utils import TestCase, skipIfTorchDynamo
 from torch.testing._internal.common_quantized import (
     override_cpu_allocator_for_qnnpack,
     override_qengines,
     qengine_is_qnnpack,
     qengine_is_fbgemm,
     qengine_is_onednn,
+    qengine_is_x86,
 )
 
 # TODO: Once more test files are created, move the contents to a ao folder.
@@ -29,6 +30,7 @@
 logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)
 
 class TestQuantizedSparseKernels(TestCase):
+    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
     @override_qengines
     def test_sparse_qlinear(self):
         batch_size = 12
@@ -48,8 +50,8 @@ def test_sparse_qlinear(self):
         # to other higher priority works.
         if qengine_is_qnnpack() and not (row_block_size == 1 and col_block_size == 4):
             return
-        # ONEDNN does not support this yet
-        if qengine_is_onednn():
+        # ONEDNN and X86 do not support this yet
+        if qengine_is_onednn() or qengine_is_x86():
             return
 
         dense_prepack = torch.ops.quantized.linear_prepack
diff --git a/test/ao/sparsity/test_parametrization.py b/test/ao/sparsity/test_parametrization.py
index ebf964b3a3aa..54b6f778d9fa 100644
--- a/test/ao/sparsity/test_parametrization.py
+++ b/test/ao/sparsity/test_parametrization.py
@@ -5,7 +5,7 @@
 import logging
 
 from torch import nn
-from torch.ao.sparsity.sparsifier import utils
+from torch.ao.pruning.sparsifier import utils
 from torch.nn.utils import parametrize
 
 import torch
diff --git a/test/ao/sparsity/test_pruner.py b/test/ao/sparsity/test_pruner.py
index 6cf7175b9afe..295939cb3e39 100644
--- a/test/ao/sparsity/test_pruner.py
+++ b/test/ao/sparsity/test_pruner.py
@@ -7,7 +7,7 @@
 
 import torch
 from torch import nn
-from torch.ao.sparsity._experimental.pruner import BasePruner, PruningParametrization, ZeroesParametrization
+from torch.ao.pruning._experimental.pruner import BasePruner, PruningParametrization, ZeroesParametrization
 from torch.nn.utils import parametrize
 
 from torch.testing._internal.common_utils import TestCase, skipIfTorchDynamo
diff --git a/test/ao/sparsity/test_qlinear_packed_params.py b/test/ao/sparsity/test_qlinear_packed_params.py
index b2287207e2d6..9e719f423d38 100644
--- a/test/ao/sparsity/test_qlinear_packed_params.py
+++ b/test/ao/sparsity/test_qlinear_packed_params.py
@@ -4,6 +4,10 @@
 import tempfile
 import torch
 from torch.ao.nn.sparse.quantized.dynamic.linear import Linear
+from torch.testing._internal.common_quantization import (
+    skipIfNoFBGEMM,
+    skipIfNoQNNPACK,
+)
 from torch.testing._internal.common_quantized import (
     qengine_is_qnnpack,
     override_quantized_engine,
@@ -12,7 +16,7 @@
 from torch.testing._internal.common_utils import TestCase
 
 class TestQlinearPackedParams(TestCase):
-    def test_qlinear_packed_params(self, allow_non_zero_zero_points=False):
+    def qlinear_packed_params_test(self, allow_non_zero_zero_points=False):
         # copied from https://pytorch.org/docs/stable/sparse.html#csr-tensor-operations,
         # so row/col block indices match that example, but with blocks and
         # scaled rows
@@ -154,8 +158,105 @@ def test_qlinear_packed_params(self, allow_non_zero_zero_points=False):
                     self.assertEqual(y1, y2)
 
 
+    @skipIfNoFBGEMM
+    def test_qlinear_packed_params_fbgemm(self):
+        torch.manual_seed(0)
+        with override_quantized_engine('fbgemm'):
+            self.qlinear_packed_params_test(allow_non_zero_zero_points=False)
+
+
+    @skipIfNoQNNPACK
     def test_qlinear_packed_params_qnnpack(self):
         torch.manual_seed(0)
         with override_quantized_engine('qnnpack'):
             with override_cpu_allocator_for_qnnpack(qengine_is_qnnpack()):
-                self.test_qlinear_packed_params(allow_non_zero_zero_points=True)
+                self.qlinear_packed_params_test(allow_non_zero_zero_points=True)
+
+    def test_qlinear_packed_params_fbgemm_qnnpack_cross_compatibility(self):
+        torch.manual_seed(0)
+
+        weight_fp32 = torch.Tensor([
+            [0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 0, 0, 0],
+            [6, 6, 6, 6, 12, 12, 12, 12, 0, 0, 0, 0, 0, 0, 0, 0],
+            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+        ])
+
+        row_block_size = 1
+        col_block_size = 4
+        out_features = weight_fp32.shape[0]
+        in_features = weight_fp32.shape[1]
+
+        scales = [2.0, 3.0, 7.0]
+        zero_points = [0 for _ in range(out_features)]
+        dtype = torch.qint8
+
+        x = torch.rand(size=(1, weight_fp32.shape[1]))
+
+        def make_lin_get_state_weight_bias_and_save():
+            weight = torch.quantize_per_tensor(
+                weight_fp32,
+                scales[0],
+                zero_points[0],
+                dtype,
+            )
+            lin = Linear(
+                out_features=weight.shape[0],
+                in_features=weight.shape[1],
+                row_block_size=row_block_size,
+                col_block_size=col_block_size,
+                bias=True,
+                dtype=dtype,
+            )
+            bias = torch.ones(size=(weight.shape[0],))
+            lin.set_weight_bias(weight, bias, row_block_size, col_block_size)
+
+            state = lin._packed_params._packed_params.__getstate__()
+            weight_bias = lin._weight_bias()
+
+            file_buff = tempfile.TemporaryFile()
+            torch.save(lin, file_buff)
+            file_buff.seek(0)
+
+            return ((state, weight_bias), file_buff)
+
+        def load_get_state_weight_bias(f_b):
+            lin2 = torch.load(f_b)
+            state = lin2._packed_params._packed_params.__getstate__()
+            weight_bias = lin2._weight_bias()
+            f_b.close()
+            return (state, weight_bias)
+
+        def packed_params_data_with_int32_indices(data_as_state_and_weight_bias):
+            (st, weight_bias) = data_as_state_and_weight_bias
+            (s0, s1) = st
+            s0_updated = tuple([
+                # 7 and 8 are row and col block indices respectively
+                v if (i != 7 and i != 8) else v.to(torch.int32) for (i, v) in enumerate(list(s0))
+            ])
+            return ((s0_updated, s1), weight_bias)
+
+        # Test Fbgemm -> Qnnpack
+        with override_quantized_engine('fbgemm'):
+            packed_params_data_1a, file_buff_1 = make_lin_get_state_weight_bias_and_save()
+
+        with override_quantized_engine('qnnpack'):
+            with override_cpu_allocator_for_qnnpack(qengine_is_qnnpack()):
+                packed_params_data_1b = load_get_state_weight_bias(file_buff_1)
+
+        self.assertEqual(
+            packed_params_data_with_int32_indices(packed_params_data_1a),
+            packed_params_data_with_int32_indices(packed_params_data_1b),
+        )
+
+        # Test Qnnpack -> Fbgemm
+        with override_quantized_engine('qnnpack'):
+            with override_cpu_allocator_for_qnnpack(qengine_is_qnnpack()):
+                packed_params_data_2a, file_buff_2 = make_lin_get_state_weight_bias_and_save()
+
+        with override_quantized_engine('fbgemm'):
+            packed_params_data_2b = load_get_state_weight_bias(file_buff_2)
+
+        self.assertEqual(
+            packed_params_data_with_int32_indices(packed_params_data_2a),
+            packed_params_data_with_int32_indices(packed_params_data_2b),
+        )
diff --git a/test/ao/sparsity/test_scheduler.py b/test/ao/sparsity/test_scheduler.py
index a111d83b519a..52eb54cb9ecb 100644
--- a/test/ao/sparsity/test_scheduler.py
+++ b/test/ao/sparsity/test_scheduler.py
@@ -2,8 +2,9 @@
 # Owner(s): ["module: unknown"]
 
 from torch import nn
-from torch.ao.sparsity import WeightNormSparsifier
-from torch.ao.sparsity import BaseScheduler, LambdaSL
+
+from torch.ao.pruning import WeightNormSparsifier
+from torch.ao.pruning import BaseScheduler, LambdaSL, CubicSL
 
 from torch.testing._internal.common_utils import TestCase
 
@@ -82,3 +83,98 @@ def test_lambda_scheduler(self):
         assert sparsifier.groups[0]['sparsity_level'] == 0.0  # Epoch 0
         scheduler.step()
         assert sparsifier.groups[0]['sparsity_level'] == 5.0  # Epoch 1
+
+
+class TestCubicScheduler(TestCase):
+    def setUp(self):
+        self.model_sparse_config = [
+            {'tensor_fqn': '0.weight', 'sparsity_level': 0.8},
+            {'tensor_fqn': '2.weight', 'sparsity_level': 0.4},
+        ]
+        self.sorted_sparse_levels = [conf['sparsity_level'] for conf in self.model_sparse_config]
+        self.initial_sparsity = 0.1
+        self.initial_step = 3
+
+    def _make_model(self, **kwargs):
+        model = nn.Sequential(
+            nn.Linear(13, 17),
+            nn.Dropout(0.5),
+            nn.Linear(17, 3),
+        )
+        return model
+
+    def _make_scheduler(self, model, **kwargs):
+        sparsifier = WeightNormSparsifier()
+        sparsifier.prepare(model, config=self.model_sparse_config)
+
+        scheduler_args = {
+            'init_sl': self.initial_sparsity,
+            'init_t': self.initial_step,
+        }
+        scheduler_args.update(kwargs)
+
+        scheduler = CubicSL(sparsifier, **scheduler_args)
+        return sparsifier, scheduler
+
+    @staticmethod
+    def _get_sparsity_levels(sparsifier, precision=32):
+        r"""Gets the current levels of sparsity in a sparsifier."""
+        return [round(group['sparsity_level'], precision) for group in sparsifier.groups]
+
+    def test_constructor(self):
+        model = self._make_model()
+        sparsifier, scheduler = self._make_scheduler(model=model, initially_zero=True)
+        self.assertIs(
+            scheduler.sparsifier, sparsifier,
+            msg="Sparsifier is not properly attached")
+        self.assertEqual(
+            scheduler._step_count, 1,
+            msg="Scheduler is initialized with incorrect step count")
+        self.assertEqual(
+            scheduler.base_sl, self.sorted_sparse_levels,
+            msg="Scheduler did not store the target sparsity levels correctly")
+
+        # Value before t_0 is 0
+        self.assertEqual(
+            self._get_sparsity_levels(sparsifier), scheduler._make_sure_a_list(0.0),
+            msg="Sparsifier is not reset correctly after attaching to the Scheduler")
+
+        # Value before t_0 is s_0
+        model = self._make_model()
+        sparsifier, scheduler = self._make_scheduler(model=model, initially_zero=False)
+        self.assertEqual(
+            self._get_sparsity_levels(sparsifier),
+            scheduler._make_sure_a_list(self.initial_sparsity),
+            msg="Sparsifier is not reset correctly after attaching to the Scheduler")
+
+    def test_step(self):
+        # For n=5, dt=2, there will be totally 10 steps between s_0 and s_f, starting from t_0
+        model = self._make_model()
+        sparsifier, scheduler = self._make_scheduler(
+            model=model, initially_zero=True, init_t=3, delta_t=2, total_t=5)
+
+        scheduler.step()
+        scheduler.step()
+        self.assertEqual(scheduler._step_count, 3, msg="Scheduler step_count is expected to increment")
+        # Value before t_0 is supposed to be 0
+        self.assertEqual(
+            self._get_sparsity_levels(sparsifier), scheduler._make_sure_a_list(0.0),
+            msg="Scheduler step updating the sparsity level before t_0")
+
+        scheduler.step()  # Step = 3  =>  sparsity = initial_sparsity
+        self.assertEqual(
+            self._get_sparsity_levels(sparsifier), scheduler._make_sure_a_list(self.initial_sparsity),
+            msg="Sparsifier is not reset to initial sparsity at the first step")
+
+        scheduler.step()  # Step = 4  =>  sparsity ~ [0.3, 0.2]
+        self.assertEqual(
+            self._get_sparsity_levels(sparsifier, 1), [0.3, 0.2],
+            msg="Sparsity level is not set correctly after the first step")
+
+        current_step = scheduler._step_count - scheduler.init_t[0] - 1
+        more_steps_needed = scheduler.delta_t[0] * scheduler.total_t[0] - current_step
+        for _ in range(more_steps_needed):  # More steps needed to final sparsity level
+            scheduler.step()
+        self.assertEqual(
+            self._get_sparsity_levels(sparsifier), self.sorted_sparse_levels,
+            msg="Sparsity level is not reaching the target level afer delta_t * n steps ")
diff --git a/test/ao/sparsity/test_sparsifier.py b/test/ao/sparsity/test_sparsifier.py
index 4dd2d49296ec..512c58b18836 100644
--- a/test/ao/sparsity/test_sparsifier.py
+++ b/test/ao/sparsity/test_sparsifier.py
@@ -7,7 +7,7 @@
 
 import torch
 from torch import nn
-from torch.ao.sparsity import BaseSparsifier, WeightNormSparsifier, FakeSparsity, NearlyDiagonalSparsifier
+from torch.ao.pruning import BaseSparsifier, WeightNormSparsifier, FakeSparsity, NearlyDiagonalSparsifier
 from torch.nn.utils.parametrize import is_parametrized
 
 from torch.testing._internal.common_utils import TestCase
@@ -18,14 +18,16 @@ class Model(nn.Module):
     def __init__(self):
         super().__init__()
         self.seq = nn.Sequential(
-            nn.Linear(16, 16)
+            nn.Linear(37, 39)
         )
-        self.linear = nn.Linear(16, 16)
-        self.head = nn.Linear(16, 4)
+        self.linear = nn.Linear(39, 33)
+        self.head = nn.Linear(33, 13)
 
     def forward(self, x):
         x = self.seq(x)
+        x = torch.relu(x)
         x = self.linear(x)
+        x = torch.relu(x)
         x = self.head(x)
         return x
 
diff --git a/test/ao/sparsity/test_sparsity_utils.py b/test/ao/sparsity/test_sparsity_utils.py
index add621ebc4ba..90aad10ab18d 100644
--- a/test/ao/sparsity/test_sparsity_utils.py
+++ b/test/ao/sparsity/test_sparsity_utils.py
@@ -5,7 +5,7 @@
 import logging
 
 import torch
-from torch.ao.sparsity.sparsifier.utils import (
+from torch.ao.pruning.sparsifier.utils import (
     fqn_to_module,
     get_arg_info_from_tensor_fqn,
     module_to_fqn,
diff --git a/test/conftest.py b/test/conftest.py
index d2e929a9a58d..e5af19b760a6 100644
--- a/test/conftest.py
+++ b/test/conftest.py
@@ -10,6 +10,7 @@
 from typing import Optional
 import xml.etree.ElementTree as ET
 import functools
+import pytest
 
 # a lot of this file is copied from _pytest.junitxml and modified to get rerun info
 
@@ -144,3 +145,25 @@ def node_reporter(self, report: Union[TestReport, str]) -> _NodeReporterReruns:
         self.node_reporters_ordered.append(reporter)
 
         return reporter
+
+
+# imitating summary_failures in pytest's terminal.py
+# both hookwrapper and tryfirst to make sure this runs before pytest's
+@pytest.hookimpl(hookwrapper=True, tryfirst=True)
+def pytest_terminal_summary(terminalreporter, exitstatus, config):
+    # prints stack traces for reruns
+    if terminalreporter.config.option.tbstyle != "no":
+        reports = terminalreporter.getreports("rerun")
+        if reports:
+            terminalreporter.write_sep("=", "RERUNS")
+            if terminalreporter.config.option.tbstyle == "line":
+                for rep in reports:
+                    line = terminalreporter._getcrashline(rep)
+                    terminalreporter.write_line(line)
+            else:
+                for rep in reports:
+                    msg = terminalreporter._getfailureheadline(rep)
+                    terminalreporter.write_sep("_", msg, red=True, bold=True)
+                    terminalreporter._outrep_summary(rep)
+                    terminalreporter._handle_teardown_sections(rep.nodeid)
+    yield
diff --git a/test/cpp/api/CMakeLists.txt b/test/cpp/api/CMakeLists.txt
index 1e205f56b4f1..6b801a073182 100644
--- a/test/cpp/api/CMakeLists.txt
+++ b/test/cpp/api/CMakeLists.txt
@@ -18,6 +18,7 @@ set(TORCH_API_TEST_SOURCES
   ${TORCH_API_TEST_DIR}/moduledict.cpp
   ${TORCH_API_TEST_DIR}/modulelist.cpp
   ${TORCH_API_TEST_DIR}/modules.cpp
+  ${TORCH_API_TEST_DIR}/nested.cpp
   ${TORCH_API_TEST_DIR}/parameterdict.cpp
   ${TORCH_API_TEST_DIR}/parameterlist.cpp
   ${TORCH_API_TEST_DIR}/namespace.cpp
@@ -41,11 +42,6 @@ set(TORCH_API_TEST_SOURCES
   ${TORCH_API_TEST_DIR}/grad_mode.cpp
   ${TORCH_API_TEST_DIR}/operations.cpp
 )
-
-if(USE_DEPLOY)
-  list(APPEND TORCH_API_TEST_SOURCES ${TORCH_API_TEST_DIR}/imethod.cpp)
-endif()
-
 if(USE_CUDA)
   list(APPEND TORCH_API_TEST_SOURCES ${TORCH_API_TEST_DIR}/parallel.cpp)
 endif()
@@ -55,7 +51,6 @@ target_include_directories(test_api PRIVATE ${ATen_CPU_INCLUDE})
 target_link_libraries(test_api PRIVATE torch gtest)
 if(NOT MSVC)
   target_compile_options_if_supported(test_api -Wno-unused-variable)
-  target_compile_options_if_supported(test_api -Wno-unused-local-typedefs)
 endif()
 
 if(USE_CUDA)
@@ -68,10 +63,6 @@ if(USE_CUDA)
   target_compile_definitions(test_api PRIVATE "USE_CUDA")
 endif()
 
-if(USE_DEPLOY)
-  target_link_libraries(test_api PRIVATE torch_deploy)
-endif()
-
 # Workaround for https://github.com/pytorch/pytorch/issues/40941
 if(USE_OPENMP AND CMAKE_COMPILER_IS_GNUCXX AND (CMAKE_CXX_COMPILER_VERSION VERSION_LESS 8.0.0))
   # Compiling transformer.cpp or pow_test.cpp with -O2+ and both -fuse-openmp and -faligned-newout any optimization
diff --git a/test/cpp/api/autograd.cpp b/test/cpp/api/autograd.cpp
index 4f41d6530c49..7daf07fa995e 100644
--- a/test/cpp/api/autograd.cpp
+++ b/test/cpp/api/autograd.cpp
@@ -218,26 +218,40 @@ TEST(AutogradAPITests, AnomalyMode) {
     ASSERT_TRUE(
         warnings.str().find("Traceback of forward") != std::string::npos);
   }
-  {
-    WarningCapture warnings;
-    // Double backward
+  auto double_backward_produce_nan = [](bool should_throw) {
     auto x = torch::tensor({0.0}, torch::requires_grad());
     auto y = x.pow(1.5);
     auto gr =
         // NOLINTNEXTLINE(bugprone-argument-comment)
         grad({y}, {x}, {}, /*retain_graph=*/true, /*create_backward=*/true);
-    ASSERT_THROWS_WITH(grad({gr[0]}, {x}, {torch::tensor({0.0})});
-                       , "returned nan");
-    auto msgs = warnings.messages();
-    ASSERT_EQ(msgs.size(), 2);
-    ASSERT_TRUE(
-        msgs[0].find("Traceback of forward call that caused the error") !=
-        std::string::npos);
-    ASSERT_TRUE(
-        msgs[1].find(
-            "Traceback of forward call that induced the previous calculation") !=
-        std::string::npos);
+    if (should_throw) {
+      WarningCapture warnings;
+      ASSERT_THROWS_WITH(grad({gr[0]}, {x}, {torch::tensor({0.0})});
+                         , "returned nan");
+      auto msgs = warnings.messages();
+      ASSERT_EQ(msgs.size(), 2);
+      ASSERT_TRUE(
+          msgs[0].find("Traceback of forward call that caused the error") !=
+          std::string::npos);
+      ASSERT_TRUE(
+          msgs[1].find(
+              "Traceback of forward call that induced the previous calculation") !=
+          std::string::npos);
+    } else {
+      grad({gr[0]}, {x}, {torch::tensor({0.0})});
+    }
+  };
+
+  double_backward_produce_nan(true);
+  {
+    torch::autograd::DetectAnomalyGuard detect_anomaly(/*check_nan=*/false);
+    double_backward_produce_nan(false);
+    {
+      torch::autograd::DetectAnomalyGuard detect_anomaly(/*check_nan=*/true);
+      double_backward_produce_nan(true);
+    }
   }
+  double_backward_produce_nan(true);
 }
 
 TEST(CustomAutogradTest, CustomFunction) {
@@ -277,6 +291,42 @@ TEST(CustomAutogradTest, CustomFunction) {
   ASSERT_VARIABLE_EQ(y.grad(), x + torch::ones({5, 5}) * 2);
 }
 
+TEST(CustomAutogradTest, CustomFunctionWithTensorList) {
+  struct MyFunction : public Function<MyFunction> {
+    static Variable forward(AutogradContext* ctx, at::TensorList tensors) {
+      torch::autograd::variable_list vars;
+      for (const at::Tensor& tensor : tensors) {
+        vars.push_back(tensor);
+      }
+      ctx->save_for_backward(vars);
+      return tensors[0] + tensors[1] + tensors[0] * tensors[1];
+    }
+
+    static variable_list backward(
+        AutogradContext* ctx,
+        variable_list grad_output) {
+      auto saved = ctx->get_saved_variables();
+      auto var1 = saved[0];
+      auto var2 = saved[1];
+      variable_list output = {
+          grad_output[0] + grad_output[0] * var2,
+          grad_output[0] + grad_output[0] * var1};
+      return output;
+    }
+  };
+
+  at::Tensor x = torch::randn({5, 5}, torch::requires_grad());
+  at::Tensor y = torch::randn({5, 5}, torch::requires_grad());
+  torch::autograd::variable_list variables = {x, y};
+  at::TensorList tensors = variables;
+  auto res = MyFunction::apply(tensors);
+  auto go = torch::ones({}, torch::requires_grad());
+  res.sum().backward(go, false, true);
+
+  ASSERT_VARIABLE_EQ(x.grad(), y + torch::ones({5, 5}));
+  ASSERT_VARIABLE_EQ(y.grad(), x + torch::ones({5, 5}));
+}
+
 TEST(CustomAutogradTest, GraphTaskTrimEdges) {
   struct MyFunction : public Function<MyFunction> {
     static Variable forward(
@@ -1104,7 +1154,7 @@ TEST(CustomAutogradTest, BackwardWithNonLeafInputs) {
 }
 
 TEST(CustomAutogradTest, BackwardWithCreateGraphWarns) {
-  c10::Warning::WarnAlways guard(true);
+  c10::WarningUtils::WarnAlways guard(true);
 
   torch::Tensor x = torch::randn({5, 5}).set_requires_grad(true);
   auto z = x * x;
@@ -1200,19 +1250,19 @@ std::tuple<torch::Tensor, torch::Tensor, int64_t> ret_tuple_non_tensor(
 }
 
 torch::Tensor view_op(const torch::Tensor& self) {
-  return self;
+  return self.alias();
 }
 
 torch::Tensor view_op_with_extra_arg(
     const torch::Tensor& self,
     const torch::Tensor& other) {
-  return self;
+  return self.alias();
 }
 
 std::vector<torch::Tensor> ret_tensor_vector_view(
     const torch::Tensor& self,
     const torch::Tensor& other) {
-  return {self, self};
+  return {self.alias(), self.alias()};
 }
 
 std::vector<at::Tensor> ret_tensor_vector(
diff --git a/test/cpp/api/functional.cpp b/test/cpp/api/functional.cpp
index 1090aeafd1b4..cc2934ea2274 100644
--- a/test/cpp/api/functional.cpp
+++ b/test/cpp/api/functional.cpp
@@ -2252,6 +2252,20 @@ TEST_F(FunctionalTest, Interpolate) {
     auto output = F::interpolate(tensor, options);
     ASSERT_TRUE(output.allclose(expected));
   }
+  {
+    auto tensor = torch::rand({2, 3, 32, 32});
+    std::vector<int64_t> osize = {8, 10};
+    auto expected = at::native::_upsample_bicubic2d_aa(
+        tensor, osize, false, torch::nullopt);
+
+    auto options = F::InterpolateFuncOptions()
+                       .size(osize)
+                       .mode(torch::kBicubic)
+                       .align_corners(false)
+                       .antialias(true);
+    auto output = F::interpolate(tensor, options);
+    ASSERT_TRUE(output.allclose(expected));
+  }
 }
 
 TEST_F(FunctionalTest, Pad1) {
diff --git a/test/cpp/api/imethod.cpp b/test/cpp/api/imethod.cpp
deleted file mode 100644
index 6d257d377acb..000000000000
--- a/test/cpp/api/imethod.cpp
+++ /dev/null
@@ -1,64 +0,0 @@
-// (c) Facebook, Inc. and its affiliates. Confidential and proprietary.
-
-#include <gtest/gtest.h>
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/script.h>
-#include <torch/torch.h>
-
-using namespace ::testing;
-using namespace caffe2;
-
-const char* simple = "torch/csrc/deploy/example/generated/simple";
-const char* simpleJit = "torch/csrc/deploy/example/generated/simple_jit";
-
-// TODO(jwtan): Try unifying cmake and buck for getting the path.
-const char* path(const char* envname, const char* path) {
-  const char* env = getenv(envname);
-  return env ? env : path;
-}
-
-// Run `python torch/csrc/deploy/example/generate_examples.py` before running
-// the following tests.
-// TODO(jwtan): Figure out a way to automate the above step for development. (CI
-// has it already.)
-TEST(IMethodTest, CallMethod) {
-  auto scriptModel = torch::jit::load(path("SIMPLE_JIT", simpleJit));
-  auto scriptMethod = scriptModel.get_method("forward");
-
-  torch::deploy::InterpreterManager manager(3);
-  torch::deploy::Package package = manager.loadPackage(path("SIMPLE", simple));
-  auto pyModel = package.loadPickle("model", "model.pkl");
-  torch::deploy::PythonMethodWrapper pyMethod(pyModel, "forward");
-
-  EXPECT_EQ(scriptMethod.name(), "forward");
-  EXPECT_EQ(pyMethod.name(), "forward");
-
-  auto input = torch::ones({10, 20});
-  auto outputPy = pyMethod({input});
-  auto outputScript = scriptMethod({input});
-  EXPECT_TRUE(outputPy.isTensor());
-  EXPECT_TRUE(outputScript.isTensor());
-  auto outputPyTensor = outputPy.toTensor();
-  auto outputScriptTensor = outputScript.toTensor();
-
-  EXPECT_TRUE(outputPyTensor.equal(outputScriptTensor));
-  EXPECT_EQ(outputPyTensor.numel(), 200);
-}
-
-TEST(IMethodTest, GetArgumentNames) {
-  auto scriptModel = torch::jit::load(path("SIMPLE_JIT", simpleJit));
-  auto scriptMethod = scriptModel.get_method("forward");
-
-  auto& scriptNames = scriptMethod.getArgumentNames();
-  EXPECT_EQ(scriptNames.size(), 1);
-  EXPECT_STREQ(scriptNames[0].c_str(), "input");
-
-  torch::deploy::InterpreterManager manager(3);
-  torch::deploy::Package package = manager.loadPackage(path("SIMPLE", simple));
-  auto pyModel = package.loadPickle("model", "model.pkl");
-  torch::deploy::PythonMethodWrapper pyMethod(pyModel, "forward");
-
-  auto& pyNames = pyMethod.getArgumentNames();
-  EXPECT_EQ(pyNames.size(), 1);
-  EXPECT_STREQ(pyNames[0].c_str(), "input");
-}
diff --git a/test/cpp/api/inference_mode.cpp b/test/cpp/api/inference_mode.cpp
index c6b0f592df08..ebc86de143fd 100644
--- a/test/cpp/api/inference_mode.cpp
+++ b/test/cpp/api/inference_mode.cpp
@@ -344,11 +344,15 @@ TEST(InferenceModeTest, TestMixInferenceAndNormalTensorFunctionalOp) {
           c.mul(s), "Inference tensors cannot be saved for backward.");
 
       // Inference tensor in TensorList input
+      // stack does not capture anymore, so disabled
+      // TODO: find alternative Function that captures a list (maybe custom fn)
+      /*
       std::vector<torch::Tensor> inputs = {s, c};
       ASSERT_THROWS_WITH(
           torch::stack(inputs), // go through kernels: VariableType(ERROR)!,
                                 // ADInplaceOrView(fallthrough), CPU
           "Inference tensors cannot be saved for backward.")
+      */
     }
   }
 }
@@ -644,7 +648,7 @@ TEST(InferenceModeTest, TestCustomFunction) {
 }
 
 TEST(InferenceModeTest, TestLegacyAutoNonVariableTypeModeWarning) {
-  c10::Warning::WarnAlways warn_always(true);
+  c10::WarningUtils::WarnAlways warn_always(true);
   WarningCapture warnings;
   at::AutoNonVariableTypeMode guard;
   ASSERT_TRUE(
diff --git a/test/cpp/api/modules.cpp b/test/cpp/api/modules.cpp
index f5f52390d7e6..4810b9ef29ff 100644
--- a/test/cpp/api/modules.cpp
+++ b/test/cpp/api/modules.cpp
@@ -1225,7 +1225,7 @@ TEST_F(ModulesTest, Unfold) {
         model(torch::randn({1, 2, 2, 2})),
         "Given input with spatial size (2, 2), kernel_size=(2, 3), "
         "dilation=(1, 1), padding=(0, 0), calculated shape of the array of "
-        "sliding blocks as (1, 0), which is too small (non-positive).");
+        "sliding blocks as (1, 0), but its components must be at least one.");
   }
 }
 
diff --git a/test/cpp/api/nested.cpp b/test/cpp/api/nested.cpp
new file mode 100644
index 000000000000..3900fad5bce9
--- /dev/null
+++ b/test/cpp/api/nested.cpp
@@ -0,0 +1,15 @@
+#include <gtest/gtest.h>
+
+#include <torch/nested.h>
+#include <torch/torch.h>
+
+#include <test/cpp/api/support.h>
+
+// Simple test that verifies the nested namespace is registered properly
+//   properly in C++
+TEST(NestedTest, Nested) {
+  auto a = torch::randn({2, 3});
+  auto b = torch::randn({4, 5});
+  auto nt = torch::nested::nested_tensor({a, b});
+  torch::nested::to_padded_tensor(nt, 0);
+}
diff --git a/test/cpp/api/nn_utils.cpp b/test/cpp/api/nn_utils.cpp
index 3d24749a9653..76aab44ac290 100644
--- a/test/cpp/api/nn_utils.cpp
+++ b/test/cpp/api/nn_utils.cpp
@@ -92,7 +92,7 @@ TEST_F(NNUtilsTest, ClipGradNorm) {
     ASSERT_LE(norm_after, max_norm);
     auto scaled = compare_scaling(grads);
     ASSERT_NEAR(0, scaled.std().item().toFloat(), 1e-7);
-    ASSERT_EQ(scaled[0].item().toFloat(), 1);
+    ASSERT_FLOAT_EQ(scaled[0].item().toFloat(), 1);
   }
   // should accept a single tensor as input
   auto p1 = torch::randn({10, 10});
diff --git a/test/cpp/api/serialize.cpp b/test/cpp/api/serialize.cpp
index 0cf8ed88c418..20d572853d3a 100644
--- a/test/cpp/api/serialize.cpp
+++ b/test/cpp/api/serialize.cpp
@@ -257,6 +257,47 @@ TEST(SerializeTest, Basic) {
   ASSERT_TRUE(x.allclose(y));
 }
 
+TEST(SerializeTest, MathBits) {
+  torch::manual_seed(0);
+
+  auto options = torch::TensorOptions{}.dtype(torch::kComplexFloat);
+  auto x = torch::randn({5, 5}, options);
+  {
+    auto expected = torch::conj(x);
+    auto actual = save_and_load(expected);
+
+    ASSERT_TRUE(actual.defined());
+    ASSERT_EQ(actual.sizes().vec(), expected.sizes().vec());
+    ASSERT_TRUE(actual.allclose(expected));
+  }
+
+  {
+    auto expected = torch::_neg_view(x);
+    auto actual = save_and_load(expected);
+
+    ASSERT_TRUE(actual.defined());
+    ASSERT_EQ(actual.sizes().vec(), expected.sizes().vec());
+    ASSERT_TRUE(actual.allclose(expected));
+  }
+
+  {
+    auto expected = torch::conj(torch::_neg_view(x));
+    auto actual = save_and_load(expected);
+
+    ASSERT_TRUE(actual.defined());
+    ASSERT_EQ(actual.sizes().vec(), expected.sizes().vec());
+    ASSERT_TRUE(actual.allclose(expected));
+  }
+
+  {
+    // We don't support serializing `ZeroTensor` as it is not public facing yet.
+    // If in future, `ZeroTensor` serialization is supported, this test should
+    // start failing!
+    auto t = torch::_efficientzerotensor({5, 5});
+    ASSERT_THROWS_WITH(save_and_load(t), "ZeroTensor is not serializable,");
+  }
+}
+
 TEST(SerializeTest, BasicToFile) {
   torch::manual_seed(0);
 
diff --git a/test/cpp/api/static.cpp b/test/cpp/api/static.cpp
index 44fed18dd6a8..d7949c444a73 100644
--- a/test/cpp/api/static.cpp
+++ b/test/cpp/api/static.cpp
@@ -47,6 +47,8 @@ TEST(TestStatic, EnableIfModule) {
   ASSERT_FALSE(torch::detail::check_not_lvalue_references<std::string&>());
 }
 
+namespace {
+
 struct A : torch::nn::Module {
   int forward() {
     return 5;
@@ -73,6 +75,8 @@ struct D : torch::nn::Module {
 
 struct E : torch::nn::Module {};
 
+} // anonymous namespace
+
 // Put in a function because macros don't handle the comma between arguments to
 // is_same well ...
 template <typename Module, typename ExpectedType, typename... Args>
diff --git a/test/cpp/api/support.h b/test/cpp/api/support.h
index e2c88150445d..535edfac9ad1 100644
--- a/test/cpp/api/support.h
+++ b/test/cpp/api/support.h
@@ -37,12 +37,12 @@ struct SeedingFixture : public ::testing::Test {
 };
 
 struct WarningCapture : public WarningHandler {
-  WarningCapture() : prev_(Warning::get_warning_handler()) {
-    Warning::set_warning_handler(this);
+  WarningCapture() : prev_(WarningUtils::get_warning_handler()) {
+    WarningUtils::set_warning_handler(this);
   }
 
   ~WarningCapture() {
-    Warning::set_warning_handler(prev_);
+    WarningUtils::set_warning_handler(prev_);
   }
 
   const std::vector<std::string>& messages() {
@@ -53,11 +53,8 @@ struct WarningCapture : public WarningHandler {
     return c10::Join("\n", messages_);
   }
 
-  void process(
-      const SourceLocation& source_location,
-      const std::string& msg,
-      const bool /*verbatim*/) override {
-    messages_.push_back(msg);
+  void process(const c10::Warning& warning) override {
+    messages_.push_back(warning.msg());
   }
 
  private:
diff --git a/test/cpp/c10d/CMakeLists.txt b/test/cpp/c10d/CMakeLists.txt
index d50c6c4f8ef4..89c6b9155f5b 100644
--- a/test/cpp/c10d/CMakeLists.txt
+++ b/test/cpp/c10d/CMakeLists.txt
@@ -54,6 +54,19 @@ if(USE_CUDA)
       install(TARGETS c10d_cuda_test DESTINATION lib)
     endif()
   endif()
+  if(USE_UCC AND USE_C10D_UCC)
+    # UCC is a private dependency of libtorch, but the tests include some
+    # private headers of libtorch, which in turn include UCC. As a hacky
+    # alternative to making UCC a public dependency of libtorch, we make it
+    # a private dependency of the tests as well.
+    c10d_add_test(
+      ProcessGroupUCCTest.cpp
+      torch_cpu c10d_cuda_test gtest_main __caffe2_ucc)
+    if(INSTALL_TEST)
+      install(TARGETS ProcessGroupUCCTest DESTINATION bin)
+      install(TARGETS c10d_cuda_test DESTINATION lib)
+    endif()
+  endif()
 else()
   if(USE_GLOO AND USE_C10D_GLOO)
     c10d_add_test(ProcessGroupGlooTest.cpp torch_cpu gtest_main)
diff --git a/test/cpp/c10d/FileStoreTest.cpp b/test/cpp/c10d/FileStoreTest.cpp
index b35f1b86140f..29b4b370b011 100644
--- a/test/cpp/c10d/FileStoreTest.cpp
+++ b/test/cpp/c10d/FileStoreTest.cpp
@@ -11,8 +11,8 @@
 
 #include <gtest/gtest.h>
 
-#include <c10d/FileStore.hpp>
-#include <c10d/PrefixStore.hpp>
+#include <torch/csrc/distributed/c10d/FileStore.hpp>
+#include <torch/csrc/distributed/c10d/PrefixStore.hpp>
 
 #ifdef _WIN32
 std::string tmppath() {
@@ -52,7 +52,7 @@ void testGetSet(std::string path, std::string prefix = "") {
     c10d::test::check(store, "key1", "value1");
     c10d::test::check(store, "key2", "value2");
     auto numKeys = fileStore->getNumKeys();
-    EXPECT_EQ(numKeys, 3);
+    EXPECT_EQ(numKeys, 4);
 
     // Check compareSet, does not check return value
     c10d::test::compareSet(store, "key0", "wrongExpectedValue", "newValue");
@@ -63,7 +63,7 @@ void testGetSet(std::string path, std::string prefix = "") {
     // Check deleteKey
     c10d::test::deleteKey(store, "key1");
     numKeys = fileStore->getNumKeys();
-    EXPECT_EQ(numKeys, 2);
+    EXPECT_EQ(numKeys, 3);
     c10d::test::check(store, "key0", "newValue");
     c10d::test::check(store, "key2", "value2");
 
@@ -71,10 +71,10 @@ void testGetSet(std::string path, std::string prefix = "") {
     c10d::test::check(store, "key0", "newValue");
     c10d::test::check(store, "-key0", "value-");
     numKeys = fileStore->getNumKeys();
-    EXPECT_EQ(numKeys, 3);
+    EXPECT_EQ(numKeys, 4);
     c10d::test::deleteKey(store, "-key0");
     numKeys = fileStore->getNumKeys();
-    EXPECT_EQ(numKeys, 2);
+    EXPECT_EQ(numKeys, 3);
     c10d::test::check(store, "key0", "newValue");
     c10d::test::check(store, "key2", "value2");
   }
@@ -85,9 +85,9 @@ void testGetSet(std::string path, std::string prefix = "") {
     c10d::PrefixStore store(prefix, fileStore);
     c10d::test::check(store, "key0", "newValue");
     auto numKeys = fileStore->getNumKeys();
-    // There will be 3 keys since we still use the same underlying file as the
+    // There will be 4 keys since we still use the same underlying file as the
     // other store above.
-    EXPECT_EQ(numKeys, 3);
+    EXPECT_EQ(numKeys, 4);
   }
 }
 
diff --git a/test/cpp/c10d/HashStoreTest.cpp b/test/cpp/c10d/HashStoreTest.cpp
index 095c5d2caa55..467667445a8d 100644
--- a/test/cpp/c10d/HashStoreTest.cpp
+++ b/test/cpp/c10d/HashStoreTest.cpp
@@ -6,8 +6,8 @@
 #include <iostream>
 #include <thread>
 
-#include <c10d/HashStore.hpp>
-#include <c10d/PrefixStore.hpp>
+#include <torch/csrc/distributed/c10d/HashStore.hpp>
+#include <torch/csrc/distributed/c10d/PrefixStore.hpp>
 
 constexpr int64_t kShortStoreTimeoutMillis = 100;
 
diff --git a/test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp b/test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp
index b01117f6dd2d..0815de7e6b64 100644
--- a/test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp
+++ b/test/cpp/c10d/ProcessGroupGlooAsyncTest.cpp
@@ -2,9 +2,9 @@
 #include <c10/util/irange.h>
 
 #include <ATen/cuda/CUDAContext.h>
-#include <c10d/FileStore.hpp>
-#include <c10d/ProcessGroupGloo.hpp>
 #include <gtest/gtest.h>
+#include <torch/csrc/distributed/c10d/FileStore.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
 #include "CUDATest.hpp"
 #include "TestUtils.hpp"
 
@@ -93,7 +93,7 @@ class AsyncInputIsOutputTest : public AsyncTest {
     }
   }
 
-  void wait(c10::intrusive_ptr<c10d::ProcessGroup::Work>& work) {
+  void wait(c10::intrusive_ptr<c10d::Work>& work) {
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
     work->wait();
   }
@@ -129,7 +129,7 @@ class AsyncAllreduceTest : public AsyncInputIsOutputTest {
   AsyncAllreduceTest(const std::string& path, int numTensors)
       : AsyncInputIsOutputTest(path, numTensors) {}
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run() {
+  c10::intrusive_ptr<c10d::Work> run() {
     // For the duration of this function, make THC use our streams
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -155,9 +155,7 @@ class AsyncBroadcastTest : public AsyncInputIsOutputTest {
   AsyncBroadcastTest(const std::string& path, int numTensors)
       : AsyncInputIsOutputTest(path, numTensors) {}
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run(
-      int rootRank,
-      int rootTensor) {
+  c10::intrusive_ptr<c10d::Work> run(int rootRank, int rootTensor) {
     // For the duration of this function, make THC use our streams
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -186,7 +184,7 @@ void runAsyncAllreduceTest(
     size_t numProcesses = 4,
     size_t numTensors = 2) {
   auto tests = initialize<AsyncAllreduceTest>(path, numProcesses, numTensors);
-  std::vector<c10::intrusive_ptr<c10d::ProcessGroup::Work>> work(numProcesses);
+  std::vector<c10::intrusive_ptr<c10d::Work>> work(numProcesses);
   for (const auto i : c10::irange(numProcesses)) {
     work[i] = tests[i].run();
   }
@@ -230,8 +228,7 @@ void runAsyncBroadcastTest(
   // Try every permutation of root rank and root tensor
   for (const auto rootRank : c10::irange(numProcesses)) {
     for (const auto rootTensor : c10::irange(numTensors)) {
-      std::vector<c10::intrusive_ptr<c10d::ProcessGroup::Work>> work(
-          numProcesses);
+      std::vector<c10::intrusive_ptr<c10d::Work>> work(numProcesses);
       for (const auto i : c10::irange(numProcesses)) {
         work[i] = tests[i].run(rootRank, rootTensor);
       }
diff --git a/test/cpp/c10d/ProcessGroupGlooTest.cpp b/test/cpp/c10d/ProcessGroupGlooTest.cpp
index 394215b0b7e3..93c4a422cee9 100644
--- a/test/cpp/c10d/ProcessGroupGlooTest.cpp
+++ b/test/cpp/c10d/ProcessGroupGlooTest.cpp
@@ -17,8 +17,8 @@
 #include <torch/cuda.h>
 
 #include <c10/util/irange.h>
-#include <c10d/FileStore.hpp>
-#include <c10d/ProcessGroupGloo.hpp>
+#include <torch/csrc/distributed/c10d/FileStore.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
 #include "TestUtils.hpp"
 
 using namespace c10d::test;
@@ -47,7 +47,7 @@ class SignalTest {
     });
   }
 
-  c10::intrusive_ptr<::c10d::ProcessGroup::Work> run(int rank, int size) {
+  c10::intrusive_ptr<::c10d::Work> run(int rank, int size) {
     auto store = c10::make_intrusive<::c10d::FileStore>(path_, size);
 
     auto options = ::c10d::ProcessGroupGloo::Options::create();
@@ -65,7 +65,7 @@ class SignalTest {
     };
 
     // Loop until an exception happens
-    c10::intrusive_ptr<::c10d::ProcessGroup::Work> work;
+    c10::intrusive_ptr<::c10d::Work> work;
     while (true) {
       work = pg.allreduce(tensors);
       try {
@@ -85,7 +85,7 @@ class SignalTest {
   Semaphore sem_;
 };
 
-c10::intrusive_ptr<::c10d::ProcessGroup::Work> testSignal(
+c10::intrusive_ptr<::c10d::Work> testSignal(
     const std::string& path,
     int signal) {
   Fork fork;
@@ -110,7 +110,7 @@ class ProcessGroupGlooDelayed : public ::c10d::ProcessGroupGloo {
       c10::intrusive_ptr<Options> options)
       : ProcessGroupGloo(store, rank, size, options) {}
 
-  c10::intrusive_ptr<::c10d::ProcessGroup::Work> send(
+  c10::intrusive_ptr<::c10d::Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override {
@@ -192,7 +192,7 @@ std::vector<std::vector<at::Tensor>> copyTensors(
 }
 
 std::vector<std::vector<at::Tensor>> waitWork(
-    std::vector<c10::intrusive_ptr<c10d::ProcessGroup::Work>> works) {
+    std::vector<c10::intrusive_ptr<c10d::Work>> works) {
   std::vector<std::vector<at::Tensor>> outputTensors;
   for (auto& work : works) {
     try {
@@ -206,7 +206,7 @@ std::vector<std::vector<at::Tensor>> waitWork(
 }
 
 std::vector<std::vector<at::Tensor>> waitFuture(
-    std::vector<c10::intrusive_ptr<c10d::ProcessGroup::Work>> works) {
+    std::vector<c10::intrusive_ptr<c10d::Work>> works) {
   std::vector<std::vector<at::Tensor>> outputTensors;
   for (auto& work : works) {
     auto fut = work->getFuture();
@@ -274,7 +274,7 @@ void testAllreduce(const std::string& path, const at::DeviceType b) {
   }
 
   // Kick off work
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> work(size);
+  std::vector<c10::intrusive_ptr<::c10d::Work>> work(size);
   const char* GLOO_ALLREDUCE_STR = "gloo:all_reduce";
   enableProfilerLegacy(ProfilerConfig(
       ProfilerState::CPU, /* report_input_shapes */ true, false));
@@ -319,7 +319,7 @@ void testAllreduceUsingWorkAPI(
   }
 
   // Kick off work
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> work(size);
+  std::vector<c10::intrusive_ptr<::c10d::Work>> work(size);
   const char* GLOO_ALLREDUCE_STR = "gloo:all_reduce";
   enableProfilerLegacy(ProfilerConfig(
       ProfilerState::CPU, /* report_input_shapes */ true, false));
@@ -378,7 +378,7 @@ void testBroadcast(const std::string& path, const at::DeviceType b) {
       const char* GLOO_BROADCAST_STR = "gloo:broadcast";
       enableProfilerLegacy(ProfilerConfig(
           ProfilerState::CPU, /* report_input_shapes */ true, false));
-      std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> work(size);
+      std::vector<c10::intrusive_ptr<::c10d::Work>> work(size);
 
       for (const auto i : c10::irange(size)) {
         work[i] = tests[i].getProcessGroup().broadcast(inputs[i], options);
@@ -446,7 +446,7 @@ void testAlltoall(const std::string& path, const at::DeviceType b) {
   };
 
   // Kick off work
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> work(size);
+  std::vector<c10::intrusive_ptr<::c10d::Work>> work(size);
   const char* GLOO_A2A_STR = "gloo:all_to_all";
   std::vector<std::vector<int64_t>> allShapes;
   for (const auto& vec : inputSplits) {
@@ -495,7 +495,7 @@ void testBarrier(const std::string& path) {
   // Kick off work
   enableProfilerLegacy(ProfilerConfig(
       ProfilerState::CPU, /* report_input_shapes */ true, false));
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> work(size);
+  std::vector<c10::intrusive_ptr<::c10d::Work>> work(size);
   for (const auto i : c10::irange(size)) {
     work[i] = tests[i].getProcessGroup().barrier();
   }
diff --git a/test/cpp/c10d/ProcessGroupMPITest.cpp b/test/cpp/c10d/ProcessGroupMPITest.cpp
index 9a779e8fa122..d9fcacc83d2f 100644
--- a/test/cpp/c10d/ProcessGroupMPITest.cpp
+++ b/test/cpp/c10d/ProcessGroupMPITest.cpp
@@ -1,7 +1,7 @@
 #include <unistd.h>
 
 #include <c10/util/irange.h>
-#include <c10d/ProcessGroupMPI.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupMPI.hpp>
 
 #include <cstdlib>
 #include <iostream>
@@ -15,7 +15,7 @@
 // Wait for work to complete
 std::vector<std::vector<at::Tensor>> waitWork(
     c10::intrusive_ptr<::c10d::ProcessGroupMPI> pg,
-    std::vector<c10::intrusive_ptr<c10d::ProcessGroup::Work>> works) {
+    std::vector<c10::intrusive_ptr<c10d::Work>> works) {
   std::vector<std::vector<at::Tensor>> outputTensors;
   for (auto& work : works) {
     try {
@@ -32,7 +32,7 @@ std::vector<std::vector<at::Tensor>> waitWork(
 // Wait using Futures
 std::vector<std::vector<at::Tensor>> waitFuture(
     c10::intrusive_ptr<::c10d::ProcessGroupMPI> pg,
-    std::vector<c10::intrusive_ptr<c10d::ProcessGroup::Work>> works) {
+    std::vector<c10::intrusive_ptr<c10d::Work>> works) {
   std::vector<std::vector<at::Tensor>> outputTensors;
   for (auto& work : works) {
     auto fut = work->getFuture();
@@ -58,14 +58,13 @@ void testAllreduce(int iter = 1000) {
   auto pg = c10d::ProcessGroupMPI::createProcessGroupMPI();
 
   // Generate inputs
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> works;
+  std::vector<c10::intrusive_ptr<::c10d::Work>> works;
   for (const auto i : c10::irange(iter)) {
     auto tensor = at::ones({16, 16}) * i;
     std::vector<at::Tensor> tensors = {tensor};
 
     // Queue the work.
-    c10::intrusive_ptr<::c10d::ProcessGroup::Work> work =
-        pg->allreduce(tensors);
+    c10::intrusive_ptr<::c10d::Work> work = pg->allreduce(tensors);
     works.push_back(std::move(work));
   }
 
@@ -88,7 +87,7 @@ void testAllreduce(int iter = 1000) {
 
 void testBroadcast(int iter = 10000) {
   auto pg = c10d::ProcessGroupMPI::createProcessGroupMPI();
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> works;
+  std::vector<c10::intrusive_ptr<::c10d::Work>> works;
   for (const auto i : c10::irange(iter)) {
     auto tensors = std::vector<at::Tensor>();
     if (pg->getRank() == 0) {
@@ -100,8 +99,7 @@ void testBroadcast(int iter = 10000) {
     }
 
     // Queue the work.
-    c10::intrusive_ptr<::c10d::ProcessGroup::Work> work =
-        pg->broadcast(tensors);
+    c10::intrusive_ptr<::c10d::Work> work = pg->broadcast(tensors);
     works.push_back(std::move(work));
   }
 
@@ -121,13 +119,13 @@ void testBroadcast(int iter = 10000) {
 
 void testReduce(int iter = 10000) {
   auto pg = c10d::ProcessGroupMPI::createProcessGroupMPI();
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> works;
+  std::vector<c10::intrusive_ptr<::c10d::Work>> works;
   for (const auto i : c10::irange(iter)) {
     auto tensor = at::ones({16, 16}) * i;
     auto tensors = std::vector<at::Tensor>({tensor});
 
     // Queue the work.
-    c10::intrusive_ptr<::c10d::ProcessGroup::Work> work = pg->reduce(tensors);
+    c10::intrusive_ptr<::c10d::Work> work = pg->reduce(tensors);
     works.push_back(std::move(work));
   }
 
@@ -152,7 +150,7 @@ void testReduce(int iter = 10000) {
 
 void testAllgather(int iter = 10000) {
   auto pg = c10d::ProcessGroupMPI::createProcessGroupMPI();
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> works;
+  std::vector<c10::intrusive_ptr<::c10d::Work>> works;
 
   // Get the world size
   auto worldSize = pg->getSize();
@@ -169,8 +167,7 @@ void testAllgather(int iter = 10000) {
     }
 
     // Queue the work.
-    c10::intrusive_ptr<::c10d::ProcessGroup::Work> work =
-        pg->allgather(outputs, tensors);
+    c10::intrusive_ptr<::c10d::Work> work = pg->allgather(outputs, tensors);
     works.push_back(std::move(work));
   }
 
@@ -192,7 +189,7 @@ void testAllgather(int iter = 10000) {
 
 void testGather(int iter = 10000) {
   auto pg = c10d::ProcessGroupMPI::createProcessGroupMPI();
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> works;
+  std::vector<c10::intrusive_ptr<::c10d::Work>> works;
 
   // Get the world size
   auto worldSize = pg->getSize();
@@ -212,8 +209,7 @@ void testGather(int iter = 10000) {
     }
 
     // Queue the work.
-    c10::intrusive_ptr<::c10d::ProcessGroup::Work> work =
-        pg->gather(outputs, tensors);
+    c10::intrusive_ptr<::c10d::Work> work = pg->gather(outputs, tensors);
     works.push_back(std::move(work));
   }
 
@@ -243,7 +239,7 @@ void testGather(int iter = 10000) {
 
 void testScatter(int iter = 1) {
   auto pg = c10d::ProcessGroupMPI::createProcessGroupMPI();
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> works;
+  std::vector<c10::intrusive_ptr<::c10d::Work>> works;
 
   // Get the world size
   auto worldSize = pg->getSize();
@@ -263,8 +259,7 @@ void testScatter(int iter = 1) {
     }
 
     // Queue the work.
-    c10::intrusive_ptr<::c10d::ProcessGroup::Work> work =
-        pg->scatter(tensors, inputs);
+    c10::intrusive_ptr<::c10d::Work> work = pg->scatter(tensors, inputs);
     works.push_back(std::move(work));
   }
 
@@ -287,7 +282,7 @@ void testScatter(int iter = 1) {
 void testSendRecv(bool recvAnysource, int iter = 10000) {
   auto pg = c10d::ProcessGroupMPI::createProcessGroupMPI();
   // Generate inputs
-  std::vector<c10::intrusive_ptr<::c10d::ProcessGroup::Work>> works;
+  std::vector<c10::intrusive_ptr<::c10d::Work>> works;
 
   // pg->send does not keep sent tensors alive, so we need to.
   std::vector<std::vector<at::Tensor>> sendTensors(iter);
@@ -298,8 +293,7 @@ void testSendRecv(bool recvAnysource, int iter = 10000) {
       sendTensors[i] = std::vector<at::Tensor>({tensor});
 
       // Queue the work.
-      c10::intrusive_ptr<::c10d::ProcessGroup::Work> work =
-          pg->send(sendTensors[i], 1, 0);
+      c10::intrusive_ptr<::c10d::Work> work = pg->send(sendTensors[i], 1, 0);
       works.push_back(std::move(work));
     } else {
       auto tensor = at::zeros({16, 16});
@@ -307,11 +301,10 @@ void testSendRecv(bool recvAnysource, int iter = 10000) {
 
       // Queue the work.
       if (!recvAnysource) {
-        c10::intrusive_ptr<::c10d::ProcessGroup::Work> work =
-            pg->recv(recvTensors, 0, 0);
+        c10::intrusive_ptr<::c10d::Work> work = pg->recv(recvTensors, 0, 0);
         works.push_back(std::move(work));
       } else {
-        c10::intrusive_ptr<::c10d::ProcessGroup::Work> work =
+        c10::intrusive_ptr<::c10d::Work> work =
             pg->recvAnysource(recvTensors, 0);
         works.push_back(std::move(work));
       }
diff --git a/test/cpp/c10d/ProcessGroupNCCLErrorsTest.cpp b/test/cpp/c10d/ProcessGroupNCCLErrorsTest.cpp
index f998dc1fd90a..762dae0402d9 100644
--- a/test/cpp/c10d/ProcessGroupNCCLErrorsTest.cpp
+++ b/test/cpp/c10d/ProcessGroupNCCLErrorsTest.cpp
@@ -1,9 +1,9 @@
 #include <chrono>
 
 #include <c10/util/irange.h>
-#include <c10d/FileStore.hpp>
-#include <c10d/ProcessGroupNCCL.hpp>
 #include <torch/csrc/cuda/nccl.h>
+#include <torch/csrc/distributed/c10d/FileStore.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp>
 #include "CUDATest.hpp"
 #include "TestUtils.hpp"
 
diff --git a/test/cpp/c10d/ProcessGroupNCCLTest.cpp b/test/cpp/c10d/ProcessGroupNCCLTest.cpp
index 9339ec1e634a..083c4770e0ae 100644
--- a/test/cpp/c10d/ProcessGroupNCCLTest.cpp
+++ b/test/cpp/c10d/ProcessGroupNCCLTest.cpp
@@ -1,8 +1,8 @@
 #include <chrono>
 #include <iostream>
 
-#include <c10d/FileStore.hpp>
-#include <c10d/ProcessGroupNCCL.hpp>
+#include <torch/csrc/distributed/c10d/FileStore.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp>
 #include "CUDATest.hpp"
 #include "TestUtils.hpp"
 #include "c10d/Types.hpp"
@@ -92,7 +92,7 @@ class NCCLTest : public NCCLTestBase {
   }
 
   void wait(
-      c10::intrusive_ptr<c10d::ProcessGroup::Work>& work,
+      c10::intrusive_ptr<c10d::Work>& work,
       std::chrono::milliseconds timeout = kNoTimeout) {
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
     work->wait(timeout);
@@ -106,7 +106,7 @@ class NCCLTest : public NCCLTestBase {
 
     // Copy inputs to outputs
     for (const auto i : c10::irange(numDevices_)) {
-      cudaStreamSynchronize(streams_[i].stream());
+      C10_CUDA_CHECK(cudaStreamSynchronize(streams_[i].stream()));
       outputs[i] = tensors_[i].cpu();
     }
 
@@ -137,7 +137,7 @@ class NCCLTest : public NCCLTestBase {
 
     // Copy inputs to outputs
     for (const auto i : c10::irange(numDevices_)) {
-      cudaStreamSynchronize(streams_[i].stream());
+      C10_CUDA_CHECK(cudaStreamSynchronize(streams_[i].stream()));
       for (auto j = 0; j < worldSize_ * numDevices_; ++j) {
         outputs[i][j] = tensor_lists[i][j].cpu();
       }
@@ -177,7 +177,7 @@ class AllreduceNCCLTest : public NCCLTest {
   AllreduceNCCLTest(const std::string& path, int worldSize)
       : NCCLTest(path, worldSize) {}
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run() {
+  c10::intrusive_ptr<c10d::Work> run() {
     // For the duration of this function, make THC use our streams
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -200,9 +200,7 @@ class BroadcastNCCLTest : public NCCLTest {
   BroadcastNCCLTest(const std::string& path, int worldSize)
       : NCCLTest(path, worldSize) {}
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run(
-      int rootRank,
-      int rootTensor) {
+  c10::intrusive_ptr<c10d::Work> run(int rootRank, int rootTensor) {
     // For the duration of this function, make THC use our streams
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -221,9 +219,7 @@ class ReduceNCCLTest : public NCCLTest {
   ReduceNCCLTest(const std::string& path, int worldSize)
       : NCCLTest(path, worldSize) {}
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run(
-      int rootRank,
-      int rootTensor) {
+  c10::intrusive_ptr<c10d::Work> run(int rootRank, int rootTensor) {
     // For the duration of this function, make THC use our streams
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -242,7 +238,7 @@ class AllgatherNCCLTest : public NCCLTest {
   AllgatherNCCLTest(const std::string& path, int worldSize)
       : NCCLTest(path, worldSize) {}
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run() {
+  c10::intrusive_ptr<c10d::Work> run() {
     // For the duration of this function, make THC use our streams
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -260,7 +256,7 @@ class AllgatherBaseNCCLTest : public NCCLTest {
     output_tensor_ = at::empty({worldSize_, 3, 3}, at::kCUDA);
   }
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run() {
+  c10::intrusive_ptr<c10d::Work> run() {
     // For the duration of this function, make THC use our streams
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -290,7 +286,7 @@ struct ReduceScatterNCCLTest : NCCLTest {
   ReduceScatterNCCLTest(const std::string& path, int worldSize)
       : NCCLTest(path, worldSize) {}
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run() {
+  c10::intrusive_ptr<c10d::Work> run() {
     // For the duration of this function, make THC use our streams
     c10::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -321,7 +317,7 @@ class ReduceScatterBaseNCCLTest : public NCCLTest {
     }
   }
 
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> run() {
+  c10::intrusive_ptr<c10d::Work> run() {
     // For the duration of this function, make THC use our streams
     at::cuda::CUDAMultiStreamGuard guard(streams_);
 
@@ -359,7 +355,7 @@ void testAllreduce(const std::string& path, int rank, int size) {
     const auto* const data = tensor.data_ptr<float>();
     for (const auto k : c10::irange(tensor.numel())) {
       EXPECT_EQ(data[k], expected)
-          << "Allreduce ouputs do not match expected outputs";
+          << "Allreduce outputs do not match expected outputs";
     }
   }
 }
diff --git a/test/cpp/c10d/ProcessGroupUCCTest.cpp b/test/cpp/c10d/ProcessGroupUCCTest.cpp
new file mode 100644
index 000000000000..a31e990536e1
--- /dev/null
+++ b/test/cpp/c10d/ProcessGroupUCCTest.cpp
@@ -0,0 +1,35 @@
+#include <gtest/gtest.h>
+#include <torch/csrc/distributed/c10d/UCCUtils.hpp>
+
+#include <utility>
+#include <vector>
+
+using namespace c10d;
+
+TEST(ProcessGroupUCCTest, testTrim) {
+  std::vector<std::pair<std::string, std::string>> tests = {
+      {" allreduce ", "allreduce"},
+      {"\tallgather", "allgather"},
+      {"send\n", "send"},
+  };
+  for (auto entry : tests) {
+    ASSERT_EQ(trim(entry.first), entry.second);
+  }
+}
+
+TEST(ProcessGroupUCCTest, testToLower) {
+  std::vector<std::pair<std::string, std::string>> tests = {
+      {"AllReduce", "allreduce"},
+      {"ALLGATHER", "allgather"},
+      {"send", "send"},
+  };
+  for (auto entry : tests) {
+    ASSERT_EQ(tolower(entry.first), entry.second);
+  }
+}
+
+TEST(ProcessGroupUCCTest, testParseList) {
+  std::string input = "\tAllReduce, ALLGATHER, send\n";
+  std::vector<std::string> expect{"allreduce", "allgather", "send"};
+  ASSERT_EQ(parse_list(input), expect);
+}
diff --git a/test/cpp/c10d/StoreTestCommon.hpp b/test/cpp/c10d/StoreTestCommon.hpp
index 933524612f11..e0999002b932 100644
--- a/test/cpp/c10d/StoreTestCommon.hpp
+++ b/test/cpp/c10d/StoreTestCommon.hpp
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
 #include "TestUtils.hpp"
 
 #include <gtest/gtest.h>
diff --git a/test/cpp/c10d/TCPStoreTest.cpp b/test/cpp/c10d/TCPStoreTest.cpp
index a8070246dc1d..d34168fb06a3 100644
--- a/test/cpp/c10d/TCPStoreTest.cpp
+++ b/test/cpp/c10d/TCPStoreTest.cpp
@@ -9,8 +9,8 @@
 
 #include <gtest/gtest.h>
 
-#include <c10d/PrefixStore.hpp>
-#include <c10d/TCPStore.hpp>
+#include <torch/csrc/distributed/c10d/PrefixStore.hpp>
+#include <torch/csrc/distributed/c10d/TCPStore.hpp>
 
 constexpr int64_t kShortStoreTimeoutMillis = 100;
 constexpr int64_t kStoreCallbackTimeoutMillis = 5000;
diff --git a/test/cpp/c10d/example/allreduce.cpp b/test/cpp/c10d/example/allreduce.cpp
index 80dfe7ac47f6..af05457c97a2 100644
--- a/test/cpp/c10d/example/allreduce.cpp
+++ b/test/cpp/c10d/example/allreduce.cpp
@@ -1,6 +1,6 @@
 #include <c10/util/irange.h>
-#include <c10d/FileStore.hpp>
-#include <c10d/ProcessGroupGloo.hpp>
+#include <torch/csrc/distributed/c10d/FileStore.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
 
 using namespace ::c10d;
 
@@ -20,7 +20,7 @@ int main(int argc, char** argv) {
   }
 
   // Kick off work
-  std::vector<c10::intrusive_ptr<ProcessGroup::Work>> pending;
+  std::vector<c10::intrusive_ptr<Work>> pending;
   for (const auto i : c10::irange(ntensors)) {
     std::vector<at::Tensor> tmp = {tensors[i]};
     pending.push_back(pg.allreduce(tmp));
diff --git a/test/cpp/jit/CMakeLists.txt b/test/cpp/jit/CMakeLists.txt
index dadfd4d74b3f..b8b765a68d8b 100644
--- a/test/cpp/jit/CMakeLists.txt
+++ b/test/cpp/jit/CMakeLists.txt
@@ -96,11 +96,20 @@ set(JIT_TEST_SRCS
 )
 
 if(USE_CUDA)
-  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_definition.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_cache.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_record.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu1.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu2.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu3.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp)
   list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_view.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_rng.cu)
+  list(APPEND JIT_TEST_SRCS ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp)
 endif()
 
 add_executable(test_jit
@@ -108,6 +117,12 @@ add_executable(test_jit
   ${JIT_TEST_SRCS}
 )
 
+# We also build with UBSAN flag in build_asan.h
+if(USE_ASAN)
+  target_compile_options(test_jit PRIVATE "-fsanitize=undefined")
+  target_link_libraries(test_jit PRIVATE "-fsanitize=undefined")
+endif()
+
 target_link_libraries(
   test_jit PRIVATE flatbuffers)
 
@@ -132,7 +147,7 @@ endif(MSVC)
 target_link_libraries(test_jit PRIVATE ${JIT_TEST_DEPENDENCIES})
 target_include_directories(test_jit PRIVATE ${ATen_CPU_INCLUDE})
 if(NOT MSVC)
-  target_compile_options(test_jit PRIVATE -Wno-unused-variable)
+  target_compile_options(test_jit PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-Wno-unused-variable>)
 endif()
 
 if(LINUX)
@@ -150,7 +165,7 @@ if(USE_CUDA)
   target_compile_definitions(test_jit PRIVATE USE_CUDA)
   # Suppress sign compare checks for NVFUSER JIT tests
   if(NOT MSVC)
-    target_compile_options(test_jit PRIVATE -Wno-sign-compare)
+    target_compile_options(test_jit PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-Wno-sign-compare>)
   endif()
 elseif(USE_ROCM)
   target_link_libraries(test_jit PRIVATE
diff --git a/test/cpp/jit/test_custom_class.cpp b/test/cpp/jit/test_custom_class.cpp
index 9a9b5ce956c6..f7fe339b5614 100644
--- a/test/cpp/jit/test_custom_class.cpp
+++ b/test/cpp/jit/test_custom_class.cpp
@@ -45,6 +45,20 @@ TEST(CustomClassTest, TorchbindIValueAPI) {
   test_with_obj(new_stack_ivalue, "boo");
 }
 
+TEST(CustomClassTest, ScalarTypeClass) {
+  script::Module m("m");
+
+  // test make_custom_class API
+  auto cc = make_custom_class<ScalarTypeClass>(at::kFloat);
+  m.register_attribute("s", cc.type(), cc, false);
+
+  std::ostringstream oss;
+  m.save(oss);
+  std::istringstream iss(oss.str());
+  caffe2::serialize::IStreamAdapter adapter{&iss};
+  auto loaded_module = torch::jit::load(iss, torch::kCPU);
+}
+
 class TorchBindTestClass : public torch::jit::CustomClassHolder {
  public:
   std::string get() {
diff --git a/test/cpp/jit/test_custom_class_registrations.cpp b/test/cpp/jit/test_custom_class_registrations.cpp
index 35d35f4cf640..16e690d99d8a 100644
--- a/test/cpp/jit/test_custom_class_registrations.cpp
+++ b/test/cpp/jit/test_custom_class_registrations.cpp
@@ -222,7 +222,7 @@ struct ElementwiseInterpreter : torch::CustomClassHolder {
     }
 
     if (!output_name_) {
-      throw std::runtime_error("Output name not specififed!");
+      throw std::runtime_error("Output name not specified!");
     }
 
     return environment.at(*output_name_);
@@ -275,6 +275,16 @@ struct ReLUClass : public torch::CustomClassHolder {
 };
 
 TORCH_LIBRARY(_TorchScriptTesting, m) {
+  m.class_<ScalarTypeClass>("_ScalarTypeClass")
+      .def(torch::init<at::ScalarType>())
+      .def_pickle(
+          [](const c10::intrusive_ptr<ScalarTypeClass>& self) {
+            return std::make_tuple(self->scalar_type_);
+          },
+          [](std::tuple<at::ScalarType> s) {
+            return c10::make_intrusive<ScalarTypeClass>(std::get<0>(s));
+          });
+
   m.class_<ReLUClass>("_ReLUClass")
       .def(torch::init<>())
       .def("run", &ReLUClass::run);
diff --git a/test/cpp/jit/test_custom_class_registrations.h b/test/cpp/jit/test_custom_class_registrations.h
index 4e6b7bd43883..59ee7c9fe15f 100644
--- a/test/cpp/jit/test_custom_class_registrations.h
+++ b/test/cpp/jit/test_custom_class_registrations.h
@@ -4,6 +4,11 @@
 namespace torch {
 namespace jit {
 
+struct ScalarTypeClass : public torch::CustomClassHolder {
+  ScalarTypeClass(at::ScalarType s) : scalar_type_(s) {}
+  at::ScalarType scalar_type_;
+};
+
 template <class T>
 struct MyStackClass : torch::CustomClassHolder {
   std::vector<T> stack_;
diff --git a/test/cpp/jit/test_flatbuffer.cpp b/test/cpp/jit/test_flatbuffer.cpp
index de49838fc9ab..89efcf739017 100644
--- a/test/cpp/jit/test_flatbuffer.cpp
+++ b/test/cpp/jit/test_flatbuffer.cpp
@@ -27,6 +27,14 @@
 #include <caffe2/serialize/versions.h>
 #include <torch/csrc/jit/serialization/import_export_functions.h>
 #include <unordered_set>
+
+#if defined(FB_XPLAT_BUILD) || defined(FBCODE_CAFFE2)
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated_fbsource.h> // NOLINT
+namespace flatbuffers = flatbuffers_fbsource;
+#define FLATBUFFERS_MAX_ALIGNMENT FLATBUFFERS_FBSOURCE_MAX_ALIGNMENT
+#else
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h> // NOLINT
+#endif
 // Tests go in torch::jit
 namespace torch {
 namespace jit {
@@ -1796,13 +1804,9 @@ TEST(FlatbufferUpgraderTest, DivScalarInplaceIntV2) {
 
 } // namespace jit
 } // namespace torch
-#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h>
 namespace torch {
 namespace jit {
 
-#if defined(FBCODE_CAFFE2) or defined(FB_XPLAT_BUILD)
-namespace flatbuffers = flatbuffers_fbsource;
-#endif
 /**
  * An Allocator that can only deallocate (using delete []), counting
  * the number of times that it has been asked to deallocate.
diff --git a/test/cpp/jit/test_jit_logging_levels.cpp b/test/cpp/jit/test_jit_logging_levels.cpp
index ca2e8c5156e6..6b92bf7d270c 100644
--- a/test/cpp/jit/test_jit_logging_levels.cpp
+++ b/test/cpp/jit/test_jit_logging_levels.cpp
@@ -41,7 +41,15 @@ TEST(JitLoggingTest, CheckOutputStreamSetting) {
   ::torch::jit::set_jit_logging_levels("test_jit_logging_levels");
   std::ostringstream test_stream;
   ::torch::jit::set_jit_logging_output_stream(test_stream);
-  JIT_LOG(::torch::jit::JitLoggingLevels::GRAPH_DUMP, "Message");
+  /* Using JIT_LOG checks if this file has logging enabled with
+    is_enabled(__FILE__, level) making the test fail. since we are only testing
+    the OutputStreamSetting we can forcefully output to it directly.
+  */
+  ::torch::jit::get_jit_logging_output_stream() << ::torch::jit::jit_log_prefix(
+      ::torch::jit::JitLoggingLevels::GRAPH_DUMP,
+      __FILE__,
+      __LINE__,
+      ::c10::str("Message"));
   ASSERT_TRUE(test_stream.str().size() > 0);
 }
 
diff --git a/test/cpp/jit/test_misc.cpp b/test/cpp/jit/test_misc.cpp
index 2e45424902a5..3be0b8598b73 100644
--- a/test/cpp/jit/test_misc.cpp
+++ b/test/cpp/jit/test_misc.cpp
@@ -1,3 +1,4 @@
+#include <gmock/gmock.h>
 #include <gtest/gtest.h>
 
 #include <ATen/ATen.h>
@@ -490,10 +491,20 @@ TEST(ControlFlowTest, Basic) {
   ASSERT_EQ(256, run_binary("while_test", 2, 0));
 }
 
+#if defined(__has_feature)
+#if __has_feature(address_sanitizer)
+#define HAS_ASANUBSAN 1
+#endif
+#endif
+
+#ifndef HAS_ASANUBSAN
+// This test fails vptr UBSAN checks
+
 TEST(ProtoTest, Basic) {
   ::ONNX_NAMESPACE::ModelProto proto;
   proto.set_producer_name("foo");
 }
+#endif
 
 // test a few features that are not directly used in schemas yet
 TEST(SchemaParserTest, NestedArrays) {
@@ -560,6 +571,37 @@ TEST(SchemaParserTest, AnnotatedAliasSets) {
   parseSchema("at::what(Tensor(a) foo) -> (Tensor(a))");
 }
 
+TEST(SchemaParserTest, TensorListAnnotatedAliasSets) {
+  const auto s = parseSchema(
+      "at::foo(Tensor(a!) self, Tensor(b!)[] out)"
+      " -> ()");
+  const AliasInfo* selfAliasInfo = s.arguments().at(0).alias_info();
+  const AliasInfo* outAliasInfo = s.arguments().at(1).alias_info();
+  ASSERT_TRUE(
+      selfAliasInfo->beforeSets() ==
+      std::unordered_set<Symbol>{Symbol::fromQualString("alias::a")});
+  ASSERT_TRUE(selfAliasInfo->isWrite());
+
+  ASSERT_TRUE(outAliasInfo->isWrite());
+  ASSERT_TRUE(outAliasInfo->beforeSets().empty());
+  ASSERT_EQ(outAliasInfo->containedTypes().size(), 1);
+
+  auto containedType = outAliasInfo->containedTypes()[0];
+
+  ASSERT_TRUE(containedType.isWrite());
+  ASSERT_TRUE(
+      containedType.beforeSets() ==
+      std::unordered_set<Symbol>{Symbol::fromQualString("alias::b")});
+}
+
+TEST(SchemaParserTest, AnnotatedAliasWithoutBeforeSet) {
+  EXPECT_THAT(
+      []() { parseSchema("at::foo(Tensor(!) self) -> Tensor"); },
+      ::testing::Throws<std::runtime_error>(::testing::Property(
+          &std::runtime_error::what,
+          ::testing::HasSubstr("expected ident but found '!' here"))));
+}
+
 TEST(SchemaParserTest, BeforeAfterSets) {
   const auto s = parseSchema(
       "at::what(Tensor(b|c)[](a!) list, Tensor(c) element)"
@@ -1414,6 +1456,79 @@ TEST(TestSymInt, AddSymbolicInt) {
   ASSERT_TRUE((a + b).expect_int() == 8);
 }
 
+#ifndef C10_MOBILE
+class TestSymNodeImpl : public c10::SymNodeImpl {
+ public:
+  explicit TestSymNodeImpl(int64_t i) : i_(i) {}
+
+  bool is_int() override {
+    return true;
+  };
+
+  bool is_float() override {
+    return false;
+  };
+
+  bool bool_() override {
+    return static_cast<bool>(i_);
+  };
+
+#define OPDEF3(NAME, OP, RET)                                         \
+  RET NAME(const c10::SymNode& other) override {                      \
+    return make_intrusive<TestSymNodeImpl>(                           \
+        this->i_ OP dynamic_cast<TestSymNodeImpl*>(other.get())->i_); \
+  }
+
+#define OPDEF2(NAME, OP) OPDEF3(NAME, OP, c10::SymNode)
+  OPDEF2(add, +)
+  OPDEF2(sub, -)
+  OPDEF2(mul, *)
+  OPDEF2(floordiv, /)
+  OPDEF2(mod, %)
+
+  OPDEF2(eq, ==)
+  OPDEF2(ne, !=)
+  OPDEF2(lt, <)
+  OPDEF2(le, <=)
+  OPDEF2(gt, >)
+  OPDEF2(ge, >=)
+#undef OPDEF2
+#undef OPDEF3
+
+  int64_t i_;
+};
+
+TEST(TestSymInt, TestSymIntToSymNodeDispatch) {
+  auto get = [](c10::SymInt si) {
+    auto node = si.toSymNodeImpl();
+    return dynamic_cast<TestSymNodeImpl*>(node.get())->i_;
+  };
+
+  std::vector<int64_t> inputs{0, 1, -1, 4, -4, 777, -777};
+  for (auto i : inputs) {
+    for (auto j : inputs) {
+      auto a = c10::SymInt(
+          static_cast<SymNode>(c10::make_intrusive<TestSymNodeImpl>(i)));
+      auto b = c10::SymInt(
+          static_cast<SymNode>(c10::make_intrusive<TestSymNodeImpl>(j)));
+      ASSERT_EQ(get(a + b), i + j);
+      ASSERT_EQ(get(a - b), i - j);
+      ASSERT_EQ(get(a * b), i * j);
+      if (j != 0) {
+        ASSERT_EQ(get(a / b), i / j);
+        ASSERT_EQ(get(a % b), i % j);
+      }
+      ASSERT_EQ(a == b, i == j);
+      ASSERT_EQ(a != b, i != j);
+      ASSERT_EQ(a < b, i < j);
+      ASSERT_EQ(a <= b, i <= j);
+      ASSERT_EQ(a > b, i > j);
+      ASSERT_EQ(a >= b, i >= j);
+    }
+  }
+}
+#endif
+
 TEST(FallbackGraphsTest, Basic) {
   auto x = at::randn({1}, at::kCPU);
   auto y = at::randn({1}, at::kCPU);
@@ -3070,5 +3185,28 @@ TEST_F(Composed, ComposedOp) {
 #endif
 }
 
+TEST(ConstantPropagation, CustomClassesCanBePropagated) {
+  const auto src = R"IR(
+    graph():
+        %none: NoneType = prim::Constant()
+        %dim: int = prim::Constant[value=3]()
+        %shape: int[] = prim::ListConstruct(%dim, %dim)
+        %weight: Tensor = aten::ones(%shape, %none, %none, %none, %none)
+        %scale: float = prim::Constant[value=1.]()
+        %zero_point: int = prim::Constant[value=0]()
+        %dtype: int = prim::Constant[value=12]()
+        %weight_q: Tensor = aten::quantize_per_tensor(%weight, %scale, %zero_point, %dtype)
+        %params: __torch__.torch.classes.quantized.LinearPackedParamsBase = quantized::linear_prepack(%weight_q, %none)
+        return (%params)
+  )IR";
+  auto graph = std::make_shared<Graph>();
+  std::unordered_map<std::string, Value*> vmap;
+  parseIR(src, graph.get(), vmap);
+
+  ConstantPropagation(graph);
+
+  testing::FileCheck().check_not("quantized::linear_prepack")->run(*graph);
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/test/cpp/jit/test_module_api.cpp b/test/cpp/jit/test_module_api.cpp
index c8280c0bc019..adaad24203c9 100644
--- a/test/cpp/jit/test_module_api.cpp
+++ b/test/cpp/jit/test_module_api.cpp
@@ -254,10 +254,10 @@ TEST(ModuleAPITest, DeepCopyString) {
   m.setattr(attr1, str);
   auto copied = m.deepcopy();
   auto original_str = str;
-  ASSERT_EQ(copied.attr(attr1).toString()->string(), original_str);
+  ASSERT_EQ(copied.attr(attr1).toStringRef(), original_str);
   // check string mutation is not reflected in the copied module
   str += "str";
-  ASSERT_EQ(copied.attr(attr1).toString()->string(), original_str);
+  ASSERT_EQ(copied.attr(attr1).toStringRef(), original_str);
 }
 
 TEST(ModuleAPITest, DeepCopyEnum) {
diff --git a/test/cpp/lazy/CMakeLists.txt b/test/cpp/lazy/CMakeLists.txt
index 4d98400323fb..31ef66635346 100644
--- a/test/cpp/lazy/CMakeLists.txt
+++ b/test/cpp/lazy/CMakeLists.txt
@@ -9,7 +9,6 @@ set(LAZY_TEST_SRCS
   ${LAZY_TEST_ROOT}/test_misc.cpp
   ${LAZY_TEST_ROOT}/test_permutation_util.cpp
   ${LAZY_TEST_ROOT}/test_shape.cpp
-  ${LAZY_TEST_ROOT}/test_symbolic_shape.cpp
   ${LAZY_TEST_ROOT}/test_trie_cache.cpp
   ${LAZY_TEST_ROOT}/test_util.cpp
 )
diff --git a/test/cpp/lazy/test_ir.cpp b/test/cpp/lazy/test_ir.cpp
index 2a5c16601ff2..6490d22090ae 100644
--- a/test/cpp/lazy/test_ir.cpp
+++ b/test/cpp/lazy/test_ir.cpp
@@ -162,20 +162,20 @@ TEST(IrTest, DimensionIsDynamicTest) {
   auto size1 =
       std::dynamic_pointer_cast<SizeNode>(MakeNode<SizeNode>(Value{node1}, 1));
 
-  ASSERT_EQ(true, size0->isDynamic());
-  ASSERT_EQ(false, size1->isDynamic());
+  ASSERT_EQ(true, size0->isSymbolic());
+  ASSERT_EQ(false, size1->isSymbolic());
 
   auto add_dim = std::dynamic_pointer_cast<SizeAdd>(
       MakeNode<SizeAdd>(Value{size0}, Value{size1}));
-  ASSERT_EQ(true, add_dim->isDynamic());
+  ASSERT_EQ(true, add_dim->isSymbolic());
 
   add_dim = std::dynamic_pointer_cast<SizeAdd>(
       MakeNode<SizeAdd>(Value{size1}, Value{size1}));
-  ASSERT_EQ(false, add_dim->isDynamic());
+  ASSERT_EQ(false, add_dim->isSymbolic());
 
   auto mul_dim = std::dynamic_pointer_cast<SizeMul>(
       MakeNode<SizeMul>(Value{size0}, Value{size0}));
-  ASSERT_EQ(true, mul_dim->isDynamic());
+  ASSERT_EQ(true, mul_dim->isSymbolic());
 }
 
 } // namespace lazy
diff --git a/test/cpp/lazy/test_ir_util.cpp b/test/cpp/lazy/test_ir_util.cpp
index 2befb04236ab..0b2bfc7614b1 100644
--- a/test/cpp/lazy/test_ir_util.cpp
+++ b/test/cpp/lazy/test_ir_util.cpp
@@ -52,7 +52,7 @@ TEST(IrUtilTest, BasicTest) {
   dynamic_cast<IrUtilNode*>(b.get())->AddOperand(Value(d, 0));
   dynamic_cast<IrUtilNode*>(c.get())->AddOperand(Value(d, 0));
 
-  std::vector<Node*> postorder = Util::ComputePostOrder({a.get()});
+  auto postorder = Util::ComputePostOrder({a.get()});
   EXPECT_EQ(postorder.size(), 4);
   EXPECT_EQ(postorder.at(0), d.get());
   EXPECT_EQ(postorder.at(1), c.get());
diff --git a/test/cpp/lazy/test_lazy_ops.cpp b/test/cpp/lazy/test_lazy_ops.cpp
index de80ff9ea576..d4b0643fea8b 100644
--- a/test/cpp/lazy/test_lazy_ops.cpp
+++ b/test/cpp/lazy/test_lazy_ops.cpp
@@ -1,8 +1,6 @@
-#include <gtest/gtest.h>
-#include <iostream>
-#include "c10/core/DeviceType.h"
-
 #include <c10/core/Device.h>
+#include <c10/core/DeviceType.h>
+#include <gtest/gtest.h>
 #include <test/cpp/lazy/test_lazy_ops_util.h>
 #include <torch/csrc/lazy/core/debug_util.h>
 #include <torch/csrc/lazy/core/helpers.h>
@@ -13,6 +11,7 @@
 #include <torch/csrc/lazy/ts_backend/dynamic_ir.h>
 #include <torch/csrc/lazy/ts_backend/ts_backend_impl.h>
 #include <torch/torch.h>
+#include <iostream>
 
 namespace torch {
 namespace lazy {
@@ -84,33 +83,6 @@ static inline at::DeviceType DefaultDevice() {
 
 } // namespace
 
-TEST(LazyDynamicOpsTest, NarrowCopy) {
-  auto x = torch::rand({5, 10, 10}).to(kLazy);
-  const size_t Y_DIM = 3;
-  const size_t X_DIM_INDEX = 2;
-  auto y = torch::rand({Y_DIM}).to(kLazy);
-  auto ly = torch::lazy::TryGetLtcTensor(y);
-  auto dim_node = MakeNode<SizeNode>(ly->GetIrValue(), 0);
-  auto lmn = c10::make_intrusive<torch::lazy::SymIntNodeImpl>(dim_node);
-  auto z = x.narrow_copy_symint(X_DIM_INDEX, 0, lmn->toSymInt());
-  AllClose(z.cpu(), x.cpu().narrow_copy(X_DIM_INDEX, 0, Y_DIM));
-}
-
-TEST(LazyDynamicOpsTest, NarrowCopyViaSymSizes) {
-  FLAGS_ltc_enable_symbolic_shapes = true;
-  auto xc = torch::rand({10});
-  auto x = xc.to(kLazy);
-  const size_t Y_DIM = 3;
-  const size_t X_DIM_INDEX = 0;
-  auto y = torch::rand({Y_DIM}).to(kLazy);
-  auto z = x.narrow_copy_symint(X_DIM_INDEX, 0, y.sym_sizes()[0]);
-  auto zc = xc.narrow_copy(X_DIM_INDEX, 0, Y_DIM);
-  ASSERT_EQ(z.sizes()[0], xc.sizes()[0]); // note, xc not zc
-  // shape inference assumes narrow_copy can copy the whole tensor
-  AllClose(z.cpu(), zc);
-  FLAGS_ltc_enable_symbolic_shapes = false;
-}
-
 TEST_F(LazyOpsTest, TestScalarTensor) {
   torch::Tensor scalar_tensor = torch::scalar_tensor(
       1., torch::TensorOptions(torch::kFloat).device(DefaultDevice()));
diff --git a/test/cpp/lazy/test_symbolic_shape.cpp b/test/cpp/lazy/test_symbolic_shape.cpp
deleted file mode 100644
index ef344c7f6d92..000000000000
--- a/test/cpp/lazy/test_symbolic_shape.cpp
+++ /dev/null
@@ -1,161 +0,0 @@
-
-#include <c10/core/Device.h>
-#include <gtest/gtest.h>
-#include <test/cpp/lazy/test_lazy_ops_util.h>
-#include <torch/csrc/lazy/core/debug_util.h>
-#include <torch/csrc/lazy/core/helpers.h>
-#include <torch/csrc/lazy/core/ir_builder.h>
-#include <torch/csrc/lazy/core/lazy_graph_executor.h>
-#include <torch/csrc/lazy/core/metrics.h>
-#include <torch/csrc/lazy/core/permutation_util.h>
-#include <torch/csrc/lazy/core/tensor.h>
-#include <torch/csrc/lazy/ts_backend/ts_backend_impl.h>
-#include <torch/torch.h>
-#include <iostream>
-
-namespace torch {
-namespace lazy {
-
-// Lazy Tensor is disabled in FBCODE until addressing non-virtual methods (e.g.
-// sizes) in TensorImpl
-#ifndef FBCODE_CAFFE2
-
-namespace {
-// This registers the torchscript backend, without which lazy device won't work
-torch::lazy::BackendRegistrar g_registrar(GetTSBackendImpl());
-
-static inline at::DeviceType DefaultDevice() {
-  return torch::lazy::getBackend()->EagerFallbackDeviceType();
-}
-
-std::vector<bool> getIsSymbolic(at::Tensor& lazy_tensor) {
-  auto ltc_tensor = GetLtcTensor(lazy_tensor);
-  Value ir_val = ltc_tensor->GetIrValue();
-  const Shape& shape = ir_val->shape();
-  return shape.is_symbolic().value();
-}
-
-class LazyShapeTest : public ::testing::Test {
- protected:
-  static void SetUpTestCase() {}
-  void SetUp() override {
-    at::manual_seed(42);
-    torch::lazy::LazyGraphExecutor::Get()->SetRngSeed(
-        torch::lazy::BackendDevice(), 42);
-    FLAGS_ltc_enable_symbolic_shapes = true;
-  }
-  void TearDown() override {
-    FLAGS_ltc_enable_symbolic_shapes = false;
-  }
-};
-
-class DynamicInputShapeNode : public Node {
- public:
-  explicit DynamicInputShapeNode(Shape& shape)
-      : Node(OpKind(), /* num_outputs */ 1), hash_(0), shape_(shape) {}
-  ~DynamicInputShapeNode() override = default;
-
-  const std::vector<Output>& operands() const override {
-    TORCH_INTERNAL_ASSERT(false, "Can't access operands of test node");
-  }
-
-  const Output& operand(size_t i) const override {
-    TORCH_INTERNAL_ASSERT(false, "Can't access operand[i] of test node");
-  }
-  const Shape& shape(size_t i) const override {
-    return shape_;
-  }
-  c10::ArrayRef<Shape> shapes() const override {
-    return {shape_};
-  }
-
-  hash_t hash() const override {
-    return hash_;
-  }
-  hash_t shapeHash() const override {
-    return hash_;
-  }
-
- private:
-  hash_t hash_;
-  Shape shape_;
-};
-
-} // namespace
-
-Tensor tensorWithSymbolicShape(
-    const std::vector<int64_t>& sizes,
-    const std::vector<bool>& is_symbolic) {
-  Shape shape = Shape(torch::kFloat32, sizes);
-  Shape shape_with_symbolic = shape.with_symbolic_dims(is_symbolic);
-  auto n = torch::lazy::MakeNode<DynamicInputShapeNode>(shape_with_symbolic);
-  auto device = BackendDevice();
-  auto lt = torch::lazy::LazyTensor::Create(n, device);
-  return torch::lazy::CreateAtenFromLtcTensor(lt);
-}
-
-TEST_F(LazyShapeTest, TestMulBasic) {
-  // Basic propagation
-  torch::Tensor a = tensorWithSymbolicShape({2, 2}, {true, false});
-  torch::Tensor b = tensorWithSymbolicShape({2, 2}, {true, false});
-  torch::Tensor res = torch::mul(a, b);
-
-  std::vector<bool> expected = {true, false};
-  EXPECT_EQ(getIsSymbolic(res), expected);
-
-  // Test when some inputs are symbolic
-  a = tensorWithSymbolicShape({2, 2}, {true, true});
-  b = tensorWithSymbolicShape({2, 2}, {true, false});
-  res = torch::mul(a, b);
-
-  // This is not {true, false}, as the SSA shape propagation
-  // is not able to simplify
-  // expandedSizes.append(sizeB if sizeA == 1 else sizeA)
-  // in broadcast() in shape_functions_1.h
-  // due to sizeA being symbolic
-  expected = {true, true};
-  EXPECT_EQ(getIsSymbolic(res), expected);
-
-  // Test correct handling of broadcasting dim
-  a = tensorWithSymbolicShape({2, 2}, {false, true});
-  b = tensorWithSymbolicShape({2, 1}, {true, false});
-  res = torch::mul(a, b);
-
-  expected = {false, true};
-  EXPECT_EQ(getIsSymbolic(res), expected);
-
-  // Test correct handling of scalar values
-  a = tensorWithSymbolicShape({2, 2}, {false, true});
-  res = torch::mul(a, 3);
-  expected = {false, true};
-  EXPECT_EQ(getIsSymbolic(res), expected);
-};
-
-TEST_F(LazyShapeTest, TestCatBasic) {
-  // Basic propagation
-  torch::Tensor a = tensorWithSymbolicShape({2, 2}, {true, false});
-  torch::Tensor b = tensorWithSymbolicShape({2, 2}, {true, false});
-  torch::Tensor c = tensorWithSymbolicShape({2, 2}, {true, false});
-
-  auto res = torch::cat({a, b, c}, 1);
-  std::vector<bool> expected = {true, false};
-  EXPECT_EQ(getIsSymbolic(res), expected);
-
-  torch::Tensor d = tensorWithSymbolicShape({2, 2}, {false, true});
-  res = torch::cat({a, d}, 0);
-  expected = {true, false};
-  EXPECT_EQ(getIsSymbolic(res), expected);
-
-  // Test handling of symbolic dims of inequal sizes, Currently crashes
-  // As we can't handle cases where upper bound dims are not equal
-  /*
-  torch::Tensor e = tensorWithSymbolicShape({2, 2}, {true, false});
-  torch::Tensor f = tensorWithSymbolicShape({2, 3}, {false, true});
-  res = torch::cat({e, f}, 0);
-  expected = {true, false};
-  EXPECT_EQ(getIsSymbolic(res), expected);
-  */
-}
-#endif // FBCODE_CAFFE2
-} // namespace lazy
-} // namespace torch
diff --git a/test/cpp/lite_interpreter_runtime/resources.h b/test/cpp/lite_interpreter_runtime/resources.h
new file mode 100644
index 000000000000..0be5928b299b
--- /dev/null
+++ b/test/cpp/lite_interpreter_runtime/resources.h
@@ -0,0 +1,41 @@
+#pragma once
+
+#include <string>
+
+namespace torch {
+namespace testing {
+
+namespace detail {
+class Path;
+}
+
+/// Gets the path to the resource identified by name.
+///
+/// @param name identifies a resource, relative path starting from the
+///             repo root
+inline auto getResourcePath(std::string name) -> detail::Path;
+
+// End interface: implementation details follow.
+
+namespace detail {
+
+class Path {
+ public:
+  explicit Path(std::string rep) : rep_(std::move(rep)) {}
+
+  auto string() const -> std::string const& {
+    return rep_;
+  }
+
+ private:
+  std::string rep_;
+};
+
+} // namespace detail
+
+inline auto getResourcePath(std::string name) -> detail::Path {
+  return detail::Path(std::move(name));
+}
+
+} // namespace testing
+} // namespace torch
diff --git a/test/cpp/lite_interpreter_runtime/test_mobile_profiler.cpp b/test/cpp/lite_interpreter_runtime/test_mobile_profiler.cpp
index 867b775c1adb..df9cb9cea28c 100644
--- a/test/cpp/lite_interpreter_runtime/test_mobile_profiler.cpp
+++ b/test/cpp/lite_interpreter_runtime/test_mobile_profiler.cpp
@@ -9,6 +9,10 @@
 
 #include <unordered_set>
 
+#include <torch/csrc/profiler/events.h>
+
+#include "test/cpp/lite_interpreter_runtime/resources.h"
+
 #ifdef EDGE_PROFILER_USE_KINETO
 namespace torch {
 namespace jit {
@@ -25,7 +29,10 @@ bool checkMetaData(
     if (line.find(op_name) != std::string::npos) {
       while (std::getline(trace_file, line)) {
         if (line.find(metadata_name) != std::string::npos) {
-          if (line.find(metadata_val) != std::string::npos) {
+          if (line.find(metadata_val) != std::string::npos ||
+              !metadata_val.size()) {
+            /* if found the right metadata_val OR if expected
+             * metadata value is an empty string then ignore the matadata_val */
             return true;
           }
         }
@@ -37,16 +44,15 @@ bool checkMetaData(
 } // namespace
 
 TEST(MobileProfiler, ModuleHierarchy) {
-  std::string filePath(__FILE__);
-  auto testModelFile = filePath.substr(0, filePath.find_last_of("/\\") + 1);
-  testModelFile.append("to_be_profiled_module.ptl");
+  auto testModelFile = torch::testing::getResourcePath(
+      "test/cpp/lite_interpreter_runtime/to_be_profiled_module.ptl");
 
   std::vector<IValue> inputs;
   inputs.emplace_back(at::rand({64, 64}));
   inputs.emplace_back(at::rand({64, 64}));
   std::string trace_file_name("/tmp/test_trace.trace");
 
-  mobile::Module bc = _load_for_mobile(testModelFile);
+  mobile::Module bc = _load_for_mobile(testModelFile.string());
   {
     KinetoEdgeCPUProfiler profiler(
         bc,
@@ -90,16 +96,15 @@ TEST(MobileProfiler, ModuleHierarchy) {
 }
 
 TEST(MobileProfiler, Backend) {
-  std::string filePath(__FILE__);
-  auto testModelFile = filePath.substr(0, filePath.find_last_of("/\\") + 1);
-  testModelFile.append("test_backend_for_profiling.ptl");
+  auto testModelFile = torch::testing::getResourcePath(
+      "test/cpp/lite_interpreter_runtime/test_backend_for_profiling.ptl");
 
   std::vector<IValue> inputs;
   inputs.emplace_back(at::rand({64, 64}));
   inputs.emplace_back(at::rand({64, 64}));
   std::string trace_file_name("/tmp/test_trace_backend.trace");
 
-  mobile::Module bc = _load_for_mobile(testModelFile);
+  mobile::Module bc = _load_for_mobile(testModelFile.string());
   {
     KinetoEdgeCPUProfiler profiler(
         bc,
@@ -125,16 +130,15 @@ TEST(MobileProfiler, Backend) {
 }
 
 TEST(MobileProfiler, BackendMemoryEvents) {
-  std::string filePath(__FILE__);
-  auto testModelFile = filePath.substr(0, filePath.find_last_of("/\\") + 1);
-  testModelFile.append("test_backend_for_profiling.ptl");
+  auto testModelFile = torch::testing::getResourcePath(
+      "test/cpp/lite_interpreter_runtime/test_backend_for_profiling.ptl");
 
   std::vector<IValue> inputs;
   inputs.emplace_back(at::rand({64, 64}));
   inputs.emplace_back(at::rand({64, 64}));
   std::string trace_file_name("/tmp/test_trace_backend_memory.trace");
 
-  mobile::Module bc = _load_for_mobile(testModelFile);
+  mobile::Module bc = _load_for_mobile(testModelFile.string());
   {
     mobile::KinetoEdgeCPUProfiler profiler(
         bc,
@@ -157,6 +161,51 @@ TEST(MobileProfiler, BackendMemoryEvents) {
   ASSERT_TRUE(checkMetaData("[memory]", metadata_name, "49152", trace_file));
 }
 
+TEST(MobileProfiler, ProfilerEvent) {
+  auto testModelFile = torch::testing::getResourcePath(
+      "test/cpp/lite_interpreter_runtime/test_backend_for_profiling.ptl");
+
+  std::vector<IValue> inputs;
+  inputs.emplace_back(at::rand({64, 64}));
+  inputs.emplace_back(at::rand({64, 64}));
+  std::string trace_file_name("/tmp/test_trace_profiler_event.trace");
+
+  std::vector<std::string> events(
+      torch::profiler::ProfilerPerfEvents.begin(),
+      torch::profiler::ProfilerPerfEvents.end());
+
+  mobile::Module bc = _load_for_mobile(testModelFile.string());
+  {
+    // Bail if something goes wrong here
+    try {
+      KinetoEdgeCPUProfiler profiler(
+          bc,
+          trace_file_name,
+          false, // record input_shapes
+          false, // profile memory
+          true, // record callstack
+          false, // record flops
+          true, // record module hierarchy
+          events); // performance events
+      bc.forward(inputs);
+    } catch (...) {
+      return;
+    }
+  } // End of profiler
+  std::ifstream trace_file(trace_file_name);
+  std::string line;
+  ASSERT_TRUE(trace_file.is_open());
+
+  for (auto& event : events) {
+    trace_file.seekg(0, std::ios_base::beg);
+    /*
+     * Just checking if the event entry exists in the chrometrace.
+     * Checking the value in a hardware independent matter is tricky.
+     */
+    ASSERT_TRUE(checkMetaData("aten::__getitem__", event, "", trace_file));
+  }
+}
+
 } // namespace mobile
 } // namespace jit
 } // namespace torch
diff --git a/test/cpp/profiler/perf_events.cpp b/test/cpp/profiler/perf_events.cpp
new file mode 100644
index 000000000000..7740f42da4b5
--- /dev/null
+++ b/test/cpp/profiler/perf_events.cpp
@@ -0,0 +1,248 @@
+
+#include <gtest/gtest.h>
+
+#include <torch/csrc/profiler/events.h>
+#include <torch/csrc/profiler/perf.h>
+
+double calc_pi() {
+  volatile double pi = 1.0;
+  for (int i = 3; i < 100000; i += 2) {
+    pi += (((i + 1) >> 1) % 2) ? 1.0 / i : -1.0 / i;
+  }
+  return pi * 4.0;
+}
+
+TEST(ProfilerTest, LinuxPerf) {
+  torch::profiler::impl::linux_perf::PerfProfiler profiler;
+
+  std::vector<std::string> standard_events(
+      std::begin(torch::profiler::ProfilerPerfEvents),
+      std::end(torch::profiler::ProfilerPerfEvents));
+  torch::profiler::perf_counters_t counters;
+  counters.resize(standard_events.size(), 0);
+
+  // Use try..catch HACK to check TORCH_CHECK because we don't yet fail
+  // gracefully if the syscall were to fail
+  try {
+    profiler.Configure(standard_events);
+
+    profiler.Enable();
+    auto pi = calc_pi();
+    profiler.Disable(counters);
+  } catch (const c10::Error&) {
+    // Bail here if something bad happened during the profiling, we don't want
+    // to make the test fail
+    return;
+  } catch (...) {
+    // something else went wrong - this should be reported
+    ASSERT_EQ(0, 1);
+  }
+
+  // Should have counted something if worked, so lets test that
+  // And if it not supported the counters should be zeros.
+#if defined(__ANDROID__) || defined(__linux__)
+  for (auto counter : counters) {
+    ASSERT_GT(counter, 0);
+  }
+#else /* __ANDROID__ || __linux__ */
+  for (auto counter : counters) {
+    ASSERT_EQ(counter, 0);
+  }
+#endif /* __ANDROID__ || __linux__ */
+}
+
+TEST(ProfilerTest, LinuxPerfNestedDepth) {
+  torch::profiler::impl::linux_perf::PerfProfiler profiler;
+
+  // Only monotonically increasing events will work
+  std::vector<std::string> standard_events(
+      std::begin(torch::profiler::ProfilerPerfEvents),
+      std::end(torch::profiler::ProfilerPerfEvents));
+
+  torch::profiler::perf_counters_t counters_A;
+  torch::profiler::perf_counters_t counters_B;
+  torch::profiler::perf_counters_t counters_C;
+
+  counters_A.resize(standard_events.size(), 0);
+  counters_B.resize(standard_events.size(), 0);
+  counters_C.resize(standard_events.size(), 0);
+
+  // Use try..catch HACK to check TORCH_CHECK because we don't yet fail
+  // gracefully if the syscall were to fail
+  try {
+    profiler.Configure(standard_events);
+
+    // * = work kernel calc_pi()
+    //
+    // A --*---+              +--*-- A
+    //         |              |
+    //         |              |
+    //       B +-*--+    +--*-+ B
+    //              |    |
+    //              |    |
+    //            C +-*--+ C
+    //
+
+    profiler.Enable();
+    auto A = calc_pi();
+
+    profiler.Enable();
+    auto B = calc_pi();
+
+    profiler.Enable();
+    auto C = calc_pi();
+    profiler.Disable(counters_C);
+
+    auto B2 = calc_pi();
+    profiler.Disable(counters_B);
+
+    auto A2 = calc_pi();
+    profiler.Disable(counters_A);
+  } catch (const c10::Error&) {
+    // Bail here if something bad happened during the profiling, we don't want
+    // to make the test fail
+    return;
+  } catch (...) {
+    // something else went wrong - this should be reported
+    ASSERT_EQ(0, 1);
+  }
+
+// for each counter, assert A > B > C
+#if defined(__ANDROID__) || defined(__linux__)
+  for (auto i = 0; i < standard_events.size(); ++i) {
+    ASSERT_GT(counters_A[i], counters_B[i]);
+    ASSERT_GT(counters_A[i], counters_C[i]);
+    ASSERT_GT(counters_B[i], counters_C[i]);
+    ASSERT_GT(counters_A[i], counters_B[i] + counters_C[i]);
+  }
+#else /* __ANDROID__ || __linux__ */
+  for (auto i = 0; i < standard_events.size(); ++i) {
+    ASSERT_EQ(counters_A[i], 0);
+    ASSERT_EQ(counters_B[i], 0);
+    ASSERT_EQ(counters_C[i], 0);
+  }
+#endif /* __ANDROID__ || __linux__ */
+}
+
+TEST(ProfilerTest, LinuxPerfNestedMultiple) {
+  torch::profiler::impl::linux_perf::PerfProfiler profiler;
+
+  // Only monotonically increasing events will work
+  std::vector<std::string> standard_events(
+      std::begin(torch::profiler::ProfilerPerfEvents),
+      std::end(torch::profiler::ProfilerPerfEvents));
+
+  torch::profiler::perf_counters_t counters_A;
+  torch::profiler::perf_counters_t counters_B;
+  torch::profiler::perf_counters_t counters_C;
+
+  counters_A.resize(standard_events.size(), 0);
+  counters_B.resize(standard_events.size(), 0);
+  counters_C.resize(standard_events.size(), 0);
+
+  // Use try..catch HACK to check TORCH_CHECK because we don't yet fail
+  // gracefully if the syscall were to fail
+  try {
+    profiler.Configure(standard_events);
+
+    // * = work kernel calc_pi()
+    //
+    // A --*---+    +---*----+    +--*-- A
+    //         |    |        |    |
+    //         |    |        |    |
+    //      B  +-**-+ B    C +-*--+ C
+
+    profiler.Enable();
+    auto A1 = calc_pi();
+
+    profiler.Enable();
+    auto B1 = calc_pi();
+    auto B2 = calc_pi();
+    profiler.Disable(counters_B);
+
+    auto A2 = calc_pi();
+
+    profiler.Enable();
+    auto C1 = calc_pi();
+    profiler.Disable(counters_C);
+
+    auto A3 = calc_pi();
+    profiler.Disable(counters_A);
+  } catch (const c10::Error&) {
+    // Bail here if something bad happened during the profiling, we don't want
+    // to make the test fail
+    return;
+  } catch (...) {
+    // something else went wrong - this should be reported
+    ASSERT_EQ(0, 1);
+  }
+
+// for each counter, assert A > B > C
+#if defined(__ANDROID__) || defined(__linux__)
+  for (auto i = 0; i < standard_events.size(); ++i) {
+    ASSERT_GT(counters_A[i], counters_B[i]);
+    ASSERT_GT(counters_A[i], counters_C[i]);
+    ASSERT_GT(counters_B[i], counters_C[i]);
+    ASSERT_GT(counters_A[i], counters_B[i] + counters_C[i]);
+  }
+#else /* __ANDROID__ || __linux__ */
+  for (auto i = 0; i < standard_events.size(); ++i) {
+    ASSERT_EQ(counters_A[i], 0);
+    ASSERT_EQ(counters_B[i], 0);
+    ASSERT_EQ(counters_C[i], 0);
+  }
+#endif /* __ANDROID__ || __linux__ */
+}
+
+TEST(ProfilerTest, LinuxPerfNestedSingle) {
+  torch::profiler::impl::linux_perf::PerfProfiler profiler;
+
+  // Only monotonically increasing events will work
+  std::vector<std::string> standard_events(
+      std::begin(torch::profiler::ProfilerPerfEvents),
+      std::end(torch::profiler::ProfilerPerfEvents));
+
+  torch::profiler::perf_counters_t counters_A;
+  torch::profiler::perf_counters_t counters_B;
+  torch::profiler::perf_counters_t counters_C;
+
+  counters_A.resize(standard_events.size(), 0);
+  counters_B.resize(standard_events.size(), 0);
+  counters_C.resize(standard_events.size(), 0);
+
+  // Use try..catch HACK to check TORCH_CHECK because we don't yet fail
+  // gracefully if the syscall were to fail
+  try {
+    profiler.Configure(standard_events);
+
+    profiler.Enable();
+    profiler.Enable();
+    profiler.Enable();
+    auto A1 = calc_pi();
+    profiler.Disable(counters_C);
+    profiler.Disable(counters_B);
+    profiler.Disable(counters_A);
+  } catch (const c10::Error&) {
+    // Bail here if something bad happened during the profiling, we don't want
+    // to make the test fail
+    return;
+  } catch (...) {
+    // something else went wrong - this should be reported
+    ASSERT_EQ(0, 1);
+  }
+
+// for each counter, assert A > B > C
+#if defined(__ANDROID__) || defined(__linux__)
+  for (auto i = 0; i < standard_events.size(); ++i) {
+    ASSERT_GE(counters_A[i], counters_B[i]);
+    ASSERT_GE(counters_A[i], counters_C[i]);
+    ASSERT_GE(counters_B[i], counters_C[i]);
+  }
+#else /* __ANDROID__ || __linux__ */
+  for (auto i = 0; i < standard_events.size(); ++i) {
+    ASSERT_EQ(counters_A[i], 0);
+    ASSERT_EQ(counters_B[i], 0);
+    ASSERT_EQ(counters_C[i], 0);
+  }
+#endif /* __ANDROID__ || __linux__ */
+}
diff --git a/test/cpp/rpc/e2e_test_base.h b/test/cpp/rpc/e2e_test_base.h
index b4b28ceb57c7..f972a8f6a34b 100644
--- a/test/cpp/rpc/e2e_test_base.h
+++ b/test/cpp/rpc/e2e_test_base.h
@@ -1,10 +1,10 @@
 #include <gtest/gtest.h>
 
-#include <c10d/TCPStore.hpp>
 #include <torch/csrc/distributed/autograd/context/container.h>
 #include <torch/csrc/distributed/autograd/context/context.h>
 #include <torch/csrc/distributed/autograd/engine/dist_engine.h>
 #include <torch/csrc/distributed/autograd/utils.h>
+#include <torch/csrc/distributed/c10d/TCPStore.hpp>
 #include <torch/csrc/distributed/rpc/rref_context.h>
 #include <torch/csrc/distributed/rpc/script_call.h>
 #include <torch/csrc/distributed/rpc/script_remote_call.h>
diff --git a/test/cpp/rpc/test_e2e_tensorpipe.cpp b/test/cpp/rpc/test_e2e_tensorpipe.cpp
index 41454f57da7f..b290aa5de704 100644
--- a/test/cpp/rpc/test_e2e_tensorpipe.cpp
+++ b/test/cpp/rpc/test_e2e_tensorpipe.cpp
@@ -2,7 +2,7 @@
 
 #include "e2e_test_base.h"
 
-#include <c10d/ProcessGroupGloo.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
 #include <torch/csrc/distributed/rpc/request_callback_no_python.h>
 #include <torch/csrc/distributed/rpc/tensorpipe_agent.h>
 #include <torch/torch.h>
diff --git a/test/cpp/tensorexpr/test_cuda.cpp b/test/cpp/tensorexpr/test_cuda.cpp
index cc945834d7a5..010ca151e568 100644
--- a/test/cpp/tensorexpr/test_cuda.cpp
+++ b/test/cpp/tensorexpr/test_cuda.cpp
@@ -65,27 +65,31 @@ static void testCudaTestVectorAdd01_impl() {
 
   // TODO: move gpu support into PaddedBuffer
   ctype* a_dev = nullptr;
-  cudaMalloc(&a_dev, N * sizeof(ctype));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, N * sizeof(ctype)));
   ctype* b_dev = nullptr;
-  cudaMalloc(&b_dev, N * sizeof(ctype));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, N * sizeof(ctype)));
   ctype* c_dev = nullptr;
-  cudaMalloc(&c_dev, N * sizeof(ctype));
-  cudaMemcpy(a_dev, a_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, N * sizeof(ctype)));
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_dev, a_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_dev, b_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_dev, c_v.data(), N * sizeof(ctype), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, N * sizeof(ctype), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_v.data(), c_dev, N * sizeof(ctype), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
 }
 
 float sigmoid(float x) {
@@ -127,23 +131,26 @@ TEST(Cuda, Sigmoid_CUDA) {
 
   // TODO: move gpu support into PaddedBuffer
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, N * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, N * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, N * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_dev, c_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, a_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(c_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
 }
 
 TEST(Cuda, TestVectorAdd01_CUDA) {
@@ -188,27 +195,31 @@ static void testCudaTestVectorAdd02_impl(int64_t N, int64_t block_size) {
 
   // TODO: move gpu support into PaddedBuffer
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, N * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, N * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, N * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, N * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, N * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_dev, b_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_dev, c_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
 }
 
 TEST(Cuda, TestVectorAdd02_CUDA) {
@@ -235,23 +246,23 @@ TEST(Cuda, HalfCast_CUDA) {
   auto aSize = aData.size() * sizeof(aData[0]);
   auto bSize = bData.size() * sizeof(bData[0]);
 
-  cudaMalloc(&aDev, aSize);
-  cudaMalloc(&bDev, bSize);
-  cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(bDev, bData.data(), bSize, cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aSize));
+  C10_CUDA_CHECK(cudaMalloc(&bDev, bSize));
+  C10_CUDA_CHECK(cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(bDev, bData.data(), bSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, bDev});
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
-  cudaMemcpy(aData.data(), aDev, aSize, cudaMemcpyDeviceToHost);
-  cudaMemcpy(bData.data(), bDev, bSize, cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMemcpy(aData.data(), aDev, aSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(bData.data(), bDev, bSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   assertAllEqual(bData, 2.0f);
 
-  cudaFree(aDev);
-  cudaFree(bDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(bDev));
 }
 
 TEST(Cuda, DynamicShape2D_CUDA) {
@@ -275,41 +286,41 @@ TEST(Cuda, DynamicShape2D_CUDA) {
     float* aDev = nullptr;
     float* bDev = nullptr;
     float* cDev = nullptr;
-    cudaMalloc(&aDev, aData.size() * sizeof(aData[0]));
-    cudaMalloc(&bDev, bData.size() * sizeof(bData[0]));
-    cudaMalloc(&cDev, cData.size() * sizeof(cData[0]));
-    cudaMemcpy(
+    C10_CUDA_CHECK(cudaMalloc(&aDev, aData.size() * sizeof(aData[0])));
+    C10_CUDA_CHECK(cudaMalloc(&bDev, bData.size() * sizeof(bData[0])));
+    C10_CUDA_CHECK(cudaMalloc(&cDev, cData.size() * sizeof(cData[0])));
+    C10_CUDA_CHECK(cudaMemcpy(
         aDev,
         aData.data(),
         aData.size() * sizeof(aData[0]),
-        cudaMemcpyHostToDevice);
-    cudaMemcpy(
+        cudaMemcpyHostToDevice));
+    C10_CUDA_CHECK(cudaMemcpy(
         bDev,
         bData.data(),
         bData.size() * sizeof(bData[0]),
-        cudaMemcpyHostToDevice);
-    cudaMemcpy(
+        cudaMemcpyHostToDevice));
+    C10_CUDA_CHECK(cudaMemcpy(
         cDev,
         cData.data(),
         cData.size() * sizeof(cData[0]),
-        cudaMemcpyHostToDevice);
-    cudaDeviceSynchronize();
+        cudaMemcpyHostToDevice));
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
 
     cg.call({aDev, bDev, cDev, M, N});
-    cudaDeviceSynchronize();
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
 
-    cudaMemcpy(
+    C10_CUDA_CHECK(cudaMemcpy(
         cData.data(),
         cDev,
         cData.size() * sizeof(cData[0]),
-        cudaMemcpyDeviceToHost);
-    cudaDeviceSynchronize();
+        cudaMemcpyDeviceToHost));
+    C10_CUDA_CHECK(cudaDeviceSynchronize());
 
     ExpectAllNear(cData, std::vector<float>(M * N, 3.0f), 1e-7);
 
-    cudaFree(aDev);
-    cudaFree(bDev);
-    cudaFree(cDev);
+    C10_CUDA_CHECK(cudaFree(aDev));
+    C10_CUDA_CHECK(cudaFree(bDev));
+    C10_CUDA_CHECK(cudaFree(cDev));
   };
   testWithSize(32, 32);
   testWithSize(1, 16);
@@ -342,14 +353,15 @@ TEST(Cuda, TestRand01_CUDA) {
 
   // TODO: move gpu support into PaddedBuffer
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, N * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(c_v.data(), c_dev, N * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   float sum1 = 0;
   float sum2 = 0;
@@ -371,7 +383,7 @@ TEST(Cuda, TestRand01_CUDA) {
   ASSERT_NEAR(sum1, sum1_mean, 2e-2);
   ASSERT_NEAR(sum2, sum2_mean, 2e-2);
   ASSERT_NEAR(sum3, sum3_mean, 2e-2);
-  cudaFree(c_dev);
+  C10_CUDA_CHECK(cudaFree(c_dev));
 }
 
 TEST(Cuda, DynamicShapeSplit_CUDA) {
@@ -393,34 +405,34 @@ TEST(Cuda, DynamicShapeSplit_CUDA) {
   std::vector<float> bData(N, 1.0f);
   float* aDev = nullptr;
   float* bDev = nullptr;
-  cudaMalloc(&aDev, aData.size() * sizeof(aData[0]));
-  cudaMalloc(&bDev, bData.size() * sizeof(bData[0]));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aData.size() * sizeof(aData[0])));
+  C10_CUDA_CHECK(cudaMalloc(&bDev, bData.size() * sizeof(bData[0])));
+  C10_CUDA_CHECK(cudaMemcpy(
       aDev,
       aData.data(),
       aData.size() * sizeof(aData[0]),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       bDev,
       bData.data(),
       bData.size() * sizeof(aData[0]),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, bDev, N});
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMemcpy(
       bData.data(),
       bDev,
       bData.size() * sizeof(aData[0]),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(bData, std::vector<float>(N, 2.0f), 1e-7);
 
-  cudaFree(aDev);
-  cudaFree(bDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(bDev));
 }
 
 TEST(Cuda, OneBlockOneThreadGlobalReduce1_CUDA) {
@@ -469,24 +481,24 @@ TEST(Cuda, OneBlockOneThreadGlobalReduce1_CUDA) {
   }
 
   float* data_dev = nullptr;
-  cudaMalloc(&data_dev, N * sizeof(float));
-  cudaMemcpy(
-      data_dev, data_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMalloc(&data_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      data_dev, data_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
   float* output_dev = nullptr;
-  cudaMalloc(&output_dev, 1 * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&output_dev, 1 * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(data_dev, output_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
-      output_v.data(), output_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      output_v.data(), output_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(output_v, output_ref, 1e-5);
 
-  cudaFree(data_dev);
-  cudaFree(output_dev);
+  C10_CUDA_CHECK(cudaFree(data_dev));
+  C10_CUDA_CHECK(cudaFree(output_dev));
 }
 
 TEST(Cuda, OneBlockMultiThreadGlobalReduce1_CUDA) {
@@ -548,22 +560,24 @@ TEST(Cuda, OneBlockMultiThreadGlobalReduce1_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, N * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_dev, a_v.data(), N * sizeof(float), cudaMemcpyHostToDevice));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, 1 * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, 1 * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(b_v, b_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
 }
 
 TEST(Cuda, NoThreadIdxWrite_1_CUDA) {
@@ -642,23 +656,25 @@ TEST(Cuda, NoThreadIdxWrite_1_CUDA) {
 
   // TODO: add check of the generated code.
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, 2 * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, 2 * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, N * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, N * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(a_v.data(), a_dev, 2 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(b_v.data(), b_dev, N * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(a_v.data(), a_dev, 2 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_v.data(), b_dev, N * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(a_v, a_ref, 1e-5);
   ExpectAllNear(b_v, b_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
 }
 
 TEST(Cuda, SharedMemReduce_1_CUDA) {
@@ -779,23 +795,24 @@ TEST(Cuda, SharedMemReduce_1_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, kTotalSize * sizeof(float));
-  cudaMemcpy(
-      a_dev, a_v.data(), kTotalSize * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, kTotalSize * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), kTotalSize * sizeof(float), cudaMemcpyHostToDevice));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, 1 * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, 1 * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(b_v, b_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
 }
 
 TEST(Cuda, LocalMemReduce_1_CUDA) {
@@ -889,23 +906,24 @@ TEST(Cuda, LocalMemReduce_1_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, kTotalSize * sizeof(float));
-  cudaMemcpy(
-      a_dev, a_v.data(), kTotalSize * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, kTotalSize * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), kTotalSize * sizeof(float), cudaMemcpyHostToDevice));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, 1 * sizeof(float));
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, 1 * sizeof(float)));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(
+      cudaMemcpy(b_v.data(), b_dev, 1 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(b_v, b_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
 }
 
 TEST(Cuda, HalfSupport_CUDA) {
@@ -940,29 +958,29 @@ TEST(Cuda, HalfSupport_CUDA) {
   auto cSize = cData.size() * sizeof(float);
   auto dSize = dData.size() * sizeof(dData[0]);
 
-  cudaMalloc(&aDev, aSize);
-  cudaMalloc(&bDev, bSize);
-  cudaMalloc(&cDev, cSize);
-  cudaMalloc(&dDev, dSize);
-  cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(cDev, cData.data(), cSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(dDev, dData.data(), dSize, cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aSize));
+  C10_CUDA_CHECK(cudaMalloc(&bDev, bSize));
+  C10_CUDA_CHECK(cudaMalloc(&cDev, cSize));
+  C10_CUDA_CHECK(cudaMalloc(&dDev, dSize));
+  C10_CUDA_CHECK(cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(cDev, cData.data(), cSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(dDev, dData.data(), dSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, bDev, cDev, dDev});
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
-  cudaMemcpy(aData.data(), aDev, aSize, cudaMemcpyDeviceToHost);
-  cudaMemcpy(cData.data(), cDev, cSize, cudaMemcpyDeviceToHost);
-  cudaMemcpy(dData.data(), dDev, dSize, cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMemcpy(aData.data(), aDev, aSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(cData.data(), cDev, cSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(dData.data(), dDev, dSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   assertAllEqual(cData, 46.0f);
 
-  cudaFree(aDev);
-  cudaFree(bDev);
-  cudaFree(cDev);
-  cudaFree(dDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(bDev));
+  C10_CUDA_CHECK(cudaFree(cDev));
+  C10_CUDA_CHECK(cudaFree(dDev));
 }
 
 TEST(Cuda, HalfPropagation_CUDA) {
@@ -997,20 +1015,22 @@ TEST(Cuda, HalfPropagation_CUDA) {
   auto aSize = aData.size() * sizeof(aData[0]);
   auto reluSize = reluData.size() * sizeof(reluData[0]);
 
-  cudaMalloc(&aDev, aSize);
-  cudaMalloc(&reluDev, reluSize);
-  cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(reluDev, reluData.data(), reluSize, cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aSize));
+  C10_CUDA_CHECK(cudaMalloc(&reluDev, reluSize));
+  C10_CUDA_CHECK(cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(reluDev, reluData.data(), reluSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, reluDev});
-  cudaMemcpy(reluData.data(), reluDev, reluSize, cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(
+      cudaMemcpy(reluData.data(), reluDev, reluSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   assertAllEqual(aData, reluData);
 
-  cudaFree(aDev);
-  cudaFree(reluDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(reluDev));
 }
 
 TEST(Cuda, UnusedHalfArgument_CUDA) {
@@ -1050,23 +1070,25 @@ TEST(Cuda, UnusedHalfArgument_CUDA) {
   auto bSize = bData.size() * sizeof(bData[0]);
   auto reluSize = reluData.size() * sizeof(reluData[0]);
 
-  cudaMalloc(&aDev, aSize);
-  cudaMalloc(&bDev, bSize);
-  cudaMalloc(&reluDev, reluSize);
-  cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(bDev, bData.data(), bSize, cudaMemcpyHostToDevice);
-  cudaMemcpy(reluDev, reluData.data(), reluSize, cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&aDev, aSize));
+  C10_CUDA_CHECK(cudaMalloc(&bDev, bSize));
+  C10_CUDA_CHECK(cudaMalloc(&reluDev, reluSize));
+  C10_CUDA_CHECK(cudaMemcpy(aDev, aData.data(), aSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(bDev, bData.data(), bSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(
+      cudaMemcpy(reluDev, reluData.data(), reluSize, cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cg.call({aDev, bDev, reluDev});
-  cudaMemcpy(reluData.data(), reluDev, reluSize, cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(
+      cudaMemcpy(reluData.data(), reluDev, reluSize, cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   assertAllEqual(aData, reluData);
 
-  cudaFree(aDev);
-  cudaFree(bDev);
-  cudaFree(reluDev);
+  C10_CUDA_CHECK(cudaFree(aDev));
+  C10_CUDA_CHECK(cudaFree(bDev));
+  C10_CUDA_CHECK(cudaFree(reluDev));
 }
 
 TEST(Cuda, PrioritizeDependents_CUDA) {
@@ -1114,20 +1136,23 @@ TEST(Cuda, PrioritizeDependents_CUDA) {
   float* a_dev = nullptr;
   float* b_dev = nullptr;
   float* c_dev = nullptr;
-  cudaMalloc(&a_dev, 10 * sizeof(float));
-  cudaMalloc(&b_dev, 12 * sizeof(float));
-  cudaMalloc(&c_dev, 12 * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, 10 * sizeof(float)));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, 12 * sizeof(float)));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, 12 * sizeof(float)));
 
-  cudaMemcpy(a_dev, a_v.data(), 10 * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), 12 * sizeof(float), cudaMemcpyHostToDevice);
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), 10 * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), 12 * sizeof(float), cudaMemcpyHostToDevice));
 
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev, c_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, 12 * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, 12 * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   for (const auto i : c10::irange(12)) {
     if (i < 10) {
@@ -1201,33 +1226,39 @@ TEST(Cuda, MaskBlockDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, B_SIZE * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 /// Tests the case with two loops, which have different extents that are bound
@@ -1292,33 +1323,39 @@ TEST(Cuda, MaskThreadDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, B_SIZE * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 /// Tests the case where there are two loops, and each is bound to a different
@@ -1384,33 +1421,39 @@ TEST(Cuda, MaskMultiBlockDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, B_SIZE * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 /// Tests the case where both the blockDim and threadDim are bound to different
@@ -1476,33 +1519,39 @@ TEST(Cuda, MaskBlockAndThreadDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, B_SIZE * sizeof(float));
-  cudaMemcpy(a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaMemcpy(d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
+      a_dev, a_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      b_dev, b_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_dev, c_v.data(), A_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_dev, d_v.data(), B_SIZE * sizeof(float), cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaMemcpy(d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
+      c_v.data(), c_dev, A_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
+      d_v.data(), d_dev, B_SIZE * sizeof(float), cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 /// Tests the case where the loopnest has two loops of depth two: each with the
@@ -1577,57 +1626,57 @@ TEST(Cuda, MaskMultiDim_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case where loop extents are symbolic and not known at compile time.
@@ -1707,57 +1756,57 @@ TEST(Cuda, MaskMultiDimSymbolic_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_EXTENT * A_EXTENT * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_EXTENT * A_EXTENT * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_EXTENT * B_EXTENT * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_EXTENT * B_EXTENT * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_EXTENT * A_EXTENT * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_EXTENT * A_EXTENT * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_EXTENT * B_EXTENT * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_EXTENT * B_EXTENT * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_EXTENT * A_EXTENT * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_EXTENT * B_EXTENT * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_EXTENT * A_EXTENT * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_EXTENT * B_EXTENT * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, OUTER_EXTENT, A_EXTENT, B_EXTENT, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_EXTENT * A_EXTENT * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_EXTENT * B_EXTENT * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case where two loops are fused at a common parent loop, which is
@@ -1845,57 +1894,57 @@ TEST(Cuda, MaskCompoundInnerLoop_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev, c_dev, d_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case with two loops fused into a common parent, which is not bound
@@ -1983,57 +2032,57 @@ TEST(Cuda, MaskInnerLoopOneBlock_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(a_dev, b_dev, c_dev, d_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case with two loop nests, each of which bound to the same block
@@ -2109,57 +2158,57 @@ TEST(Cuda, MaskMultiDimMultiAxis_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 // Tests the case with two loop nests, each bound to both Block and Thread but
@@ -2236,57 +2285,57 @@ TEST(Cuda, MaskMultiDimMultiLevel_CUDA) {
   }
 
   float* a_dev = nullptr;
-  cudaMalloc(&a_dev, OUTER_A_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&a_dev, OUTER_A_SIZE * A_SIZE * sizeof(float)));
   float* b_dev = nullptr;
-  cudaMalloc(&b_dev, OUTER_B_SIZE * B_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&b_dev, OUTER_B_SIZE * B_SIZE * sizeof(float)));
   float* c_dev = nullptr;
-  cudaMalloc(&c_dev, OUTER_A_SIZE * A_SIZE * sizeof(float));
+  C10_CUDA_CHECK(cudaMalloc(&c_dev, OUTER_A_SIZE * A_SIZE * sizeof(float)));
   float* d_dev = nullptr;
-  cudaMalloc(&d_dev, OUTER_B_SIZE * B_SIZE * sizeof(float));
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaMalloc(&d_dev, OUTER_B_SIZE * B_SIZE * sizeof(float)));
+  C10_CUDA_CHECK(cudaMemcpy(
       a_dev,
       a_v.data(),
       OUTER_A_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       b_dev,
       b_v.data(),
       OUTER_B_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       c_dev,
       c_v.data(),
       OUTER_A_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaMemcpy(
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_dev,
       d_v.data(),
       OUTER_B_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyHostToDevice);
-  cudaDeviceSynchronize();
+      cudaMemcpyHostToDevice));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   cuda_cg(c_dev, d_dev, a_dev, b_dev);
 
-  cudaDeviceSynchronize();
-  cudaMemcpy(
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
+  C10_CUDA_CHECK(cudaMemcpy(
       c_v.data(),
       c_dev,
       OUTER_A_SIZE * A_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaMemcpy(
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaMemcpy(
       d_v.data(),
       d_dev,
       OUTER_B_SIZE * B_SIZE * sizeof(float),
-      cudaMemcpyDeviceToHost);
-  cudaDeviceSynchronize();
+      cudaMemcpyDeviceToHost));
+  C10_CUDA_CHECK(cudaDeviceSynchronize());
 
   ExpectAllNear(c_v, c_ref, 1e-5);
   ExpectAllNear(d_v, d_ref, 1e-5);
 
-  cudaFree(a_dev);
-  cudaFree(b_dev);
-  cudaFree(c_dev);
-  cudaFree(d_dev);
+  C10_CUDA_CHECK(cudaFree(a_dev));
+  C10_CUDA_CHECK(cudaFree(b_dev));
+  C10_CUDA_CHECK(cudaFree(c_dev));
+  C10_CUDA_CHECK(cudaFree(d_dev));
 }
 
 } // namespace jit
diff --git a/test/cpp/tensorexpr/test_kernel.cpp b/test/cpp/tensorexpr/test_kernel.cpp
index 5763be459dbc..60322611fe1a 100644
--- a/test/cpp/tensorexpr/test_kernel.cpp
+++ b/test/cpp/tensorexpr/test_kernel.cpp
@@ -30,6 +30,29 @@ class Kernel : public ::testing::Test {
   }
 };
 
+TEST_F(Kernel, ParallelExternalCallBuf) {
+  const auto graph_string = R"IR(
+    graph(%0 : Float(1000, 5000, strides=[5000, 1], device=cpu),
+          %1 : Float(1000, 5000, strides=[5000, 1], device=cpu),
+          %2 : Float(5000, 1000, strides=[5000, 1], device=cpu)):
+      %3 : Float(1000, 5000, strides=[5000, 1], device=cpu) = aten::mul(%0, %1)
+      %4 : Float(1000, 5000, strides=[5000, 1], device=cpu) = aten::matmul(%3, %2)
+      return (%4))IR";
+  auto graph = std::make_shared<Graph>();
+  torch::jit::parseIR(graph_string, &*graph);
+  const std::string& verification_pattern =
+      R"IR(
+# CHECK: for (int64_t i = 0ll; i < 5000ll; i++)  /* parallel */{)IR";
+
+#ifdef TORCH_ENABLE_LLVM
+  TensorExprKernel k(graph);
+  StmtPtr s = k.getCodeGenStmt();
+  std::ostringstream oss;
+  oss << *s;
+  torch::jit::testing::FileCheck().run(verification_pattern, oss.str());
+#endif
+}
+
 TEST_F(Kernel, InliningIntermediates) {
   // here, each mul has only one use, so it should be completely inlined
   {
@@ -579,6 +602,54 @@ TEST_F(Kernel, CatInputTypesPromotion) {
   }
 }
 
+TEST_F(Kernel, ToDType) {
+#ifdef TORCH_ENABLE_LLVM
+  const auto graph_string = R"IR(
+      graph(%x.1 : BFloat16(2, 2, strides=[2, 1], requires_grad=0, device=cpu)):
+        %1 : NoneType = prim::Constant()
+        %2 : bool = prim::Constant[value=0]()
+        %3 : int = prim::Constant[value=6]()
+        %4 : int = prim::Constant[value=15]()
+        %5 : int = prim::Constant[value=5]()
+        %6 : bool = prim::Constant[value=1]()
+        %y.3 : BFloat16(2, 2, strides=[2, 1], requires_grad=0, device=cpu) = aten::sigmoid(%x.1)
+        %z.3 : BFloat16(2, 2, strides=[2, 1], requires_grad=0, device=cpu) = aten::_autocast_to_reduced_precision(%y.3, %6, %6, %5, %4)
+        %h.3 : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu) = aten::_autocast_to_full_precision(%z.3, %6, %6)
+        %i.3 : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu) = aten::to(%h.3, %3, %2, %2, %1)
+        %j.3 : BFloat16(2, 2, strides=[2, 1], requires_grad=0, device=cpu) = aten::to(%i.3, %4, %2, %2, %1)
+        %k.3 : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu) = aten::to(%j.3, %3, %2, %2, %1)
+        return (%k.3))IR";
+
+  auto graph = std::make_shared<Graph>();
+  parseIR(graph_string, &*graph);
+  TensorExprKernel k(graph);
+  StmtPtr s = k.getCodeGenStmt();
+  std::ostringstream oss;
+  oss << *s;
+
+  const std::string& verification_pattern =
+      R"IR(
+# CHECK: for
+# CHECK-NEXT: for
+# CHECK-NEXT: aten_to
+# CHECK-NEXT: }
+# CHECK-NEXT: })IR";
+  torch::jit::testing::FileCheck().run(verification_pattern, oss.str());
+
+  auto a = at::rand({2, 2}, TensorOptions(kCPU).dtype(at::kBFloat16));
+  auto ref =
+      at::_to_copy(at::sigmoid(a), TensorOptions(kCPU).dtype(at::kFloat));
+
+  std::vector<at::Tensor> inputs = {a};
+  std::vector<IValue> stack = fmap<IValue>(inputs);
+  k.run(stack);
+  auto o = stack[0].toTensor();
+  ASSERT_EQ(o.sizes(), ref.sizes());
+  ASSERT_EQ(o.dtype(), ref.dtype());
+  ASSERT_TRUE(at::allclose(o, ref, 4E-3, 4E-3));
+#endif
+}
+
 TEST_F(Kernel, CatAndInlineWithAConstantDim) {
   const auto graph_string = R"IR(
       graph(%0 : Float(1, 512, strides=[1024, 1], requires_grad=0, device=cpu),
@@ -915,7 +986,7 @@ TEST_F(Kernel, SumOneAxis) {
         o = stack[0].toTensor();
         ASSERT_EQ(o.sizes(), ref.sizes());
         ASSERT_EQ(o.dtype(), ref.dtype());
-        ASSERT_TRUE(at::allclose(o, ref));
+        ASSERT_TRUE(at::allclose(o, ref, 4E-3, 4E-3));
       }
     }
   }
diff --git a/test/cpp/tensorexpr/test_loopnest.cpp b/test/cpp/tensorexpr/test_loopnest.cpp
index f2609b0f4166..d33c45362f4f 100644
--- a/test/cpp/tensorexpr/test_loopnest.cpp
+++ b/test/cpp/tensorexpr/test_loopnest.cpp
@@ -5616,7 +5616,7 @@ TEST(LoopNest, fuseLoopsNotContiguous) {
   //     A[j] = 10 * j;
   //   }
   //   B[0] = 0;
-  //   for (int k = 50; k < 100; k++) {
+  //   for (int k = 0; k < 100; k++) {
   //     B[k] = 20 * k;
   //   }
   BufHandle a_buf("A", {100}, kInt);
@@ -5625,7 +5625,7 @@ TEST(LoopNest, fuseLoopsNotContiguous) {
   VarHandle k("k", kInt);
   auto forJ = For::make(j, 0, 100, Store::make(a_buf, {j}, Mul::make(10, j)));
   auto initB = Store::make(b_buf, {0}, 0);
-  auto forK = For::make(k, 50, 100, Store::make(b_buf, {j}, Mul::make(20, k)));
+  auto forK = For::make(k, 0, 100, Store::make(b_buf, {j}, Mul::make(20, k)));
   // NOLINTNEXTLINE(clang-analyzer-deadcode.DeadStores)
   auto par = Block::make({forJ, initB, forK});
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
diff --git a/test/cpp/tensorexpr/test_quantization.cpp b/test/cpp/tensorexpr/test_quantization.cpp
index 96a21e0e07bf..a34a6e7bd7bd 100644
--- a/test/cpp/tensorexpr/test_quantization.cpp
+++ b/test/cpp/tensorexpr/test_quantization.cpp
@@ -103,7 +103,8 @@ TEST_F(Quantization, QuantDequantUInt8_NLC) {
   parseIR(graph_string, &*graph);
 
   auto x = 2 * at::rand({1, 2, 2}, TensorOptions(kCPU).dtype(at::kFloat));
-  x.unsafeGetTensorImpl()->set_sizes_and_strides({1, 2, 2}, {4, 1, 2});
+  x.unsafeGetTensorImpl()->set_sizes_and_strides(
+      std::initializer_list<int64_t>{1, 2, 2}, {4, 1, 2});
   auto q = at::quantize_per_tensor(x, 0.1f, 122, at::kQUInt8);
   auto y_expected = at::dequantize(q);
   TensorExprKernel k(graph);
diff --git a/test/cpp_extensions/cpp_c10d_extension.cpp b/test/cpp_extensions/cpp_c10d_extension.cpp
index b60ba75e8290..caf03f5bc917 100644
--- a/test/cpp_extensions/cpp_c10d_extension.cpp
+++ b/test/cpp_extensions/cpp_c10d_extension.cpp
@@ -23,85 +23,85 @@ ProcessGroupTest::ProcessGroupTest(int rank, int size)
 
 ProcessGroupTest::~ProcessGroupTest() {}
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::broadcast(
+c10::intrusive_ptr<Work> ProcessGroupTest::broadcast(
     std::vector<at::Tensor>& tensors,
     const BroadcastOptions& opts) {
   return c10::make_intrusive<ProcessGroupTest::WorkTest>();
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::allreduce(
+c10::intrusive_ptr<Work> ProcessGroupTest::allreduce(
     std::vector<at::Tensor>& tensors,
     const AllreduceOptions& opts) {
   return c10::make_intrusive<ProcessGroupTest::WorkTest>();
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::allreduce_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupTest::allreduce_coalesced(
       std::vector<at::Tensor>& tensors,
       const AllreduceCoalescedOptions& opts) {
   throw std::runtime_error("ProcessGroupTest does not support allreduce_coalesced");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::reduce(
+c10::intrusive_ptr<Work> ProcessGroupTest::reduce(
     std::vector<at::Tensor>& tensors,
     const ReduceOptions& opts) {
   throw std::runtime_error("ProcessGroupTest does not support reduce");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::allgather(
+c10::intrusive_ptr<Work> ProcessGroupTest::allgather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const AllgatherOptions& opts) {
   throw std::runtime_error("ProcessGroupTest does not support allgather");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::_allgather_base(
+c10::intrusive_ptr<Work> ProcessGroupTest::_allgather_base(
     at::Tensor& outputBuffer,
     at::Tensor& inputBuffer,
     const AllgatherOptions& opts) {
   throw std::runtime_error("ProcessGroupTest does not support _allgather_base");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::barrier(
+c10::intrusive_ptr<Work> ProcessGroupTest::barrier(
     const BarrierOptions& opts) {
   return c10::make_intrusive<ProcessGroupTest::WorkTest>();
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::gather(
+c10::intrusive_ptr<Work> ProcessGroupTest::gather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const GatherOptions& opts) {
   throw std::runtime_error("ProcessGroupTest does not support gather");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::scatter(
+c10::intrusive_ptr<Work> ProcessGroupTest::scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ScatterOptions& opts) {
   throw std::runtime_error("ProcessGroupTest does not support scatter");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::reduce_scatter(
+c10::intrusive_ptr<Work> ProcessGroupTest::reduce_scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ReduceScatterOptions& opts) {
   throw std::runtime_error("ProcessGroupTest does not support reduce_scatter");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::send(
+c10::intrusive_ptr<Work> ProcessGroupTest::send(
     std::vector<at::Tensor>& tensors,
     int dstRank,
     int tag) {
   throw std::runtime_error("ProcessGroupTest does not support send");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::recv(
+c10::intrusive_ptr<Work> ProcessGroupTest::recv(
     std::vector<at::Tensor>& tensors,
     int srcRank,
     int tag) {
   throw std::runtime_error("ProcessGroupTest does not support recv");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupTest::recvAnysource(
+c10::intrusive_ptr<Work> ProcessGroupTest::recvAnysource(
     std::vector<at::Tensor>& tensor,
     int tag) {
   throw std::runtime_error("ProcessGroupTest does not support recvAnysource");
diff --git a/test/cpp_extensions/cpp_c10d_extension.hpp b/test/cpp_extensions/cpp_c10d_extension.hpp
index 6d4a0c598200..9137b721eef7 100644
--- a/test/cpp_extensions/cpp_c10d_extension.hpp
+++ b/test/cpp_extensions/cpp_c10d_extension.hpp
@@ -12,10 +12,11 @@
 
 #include <pybind11/chrono.h>
 
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/Store.hpp>
-#include <c10d/Types.hpp>
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Work.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 
 namespace c10d {
 
@@ -25,7 +26,7 @@ namespace c10d {
 
 class ProcessGroupTest : public ProcessGroup {
  public:
-  class WorkTest : public ProcessGroup::Work {
+  class WorkTest : public Work {
    public:
     WorkTest() {}
 
@@ -41,61 +42,61 @@ class ProcessGroupTest : public ProcessGroup {
   explicit ProcessGroupTest(int rank = -1, int size = -1);
   virtual ~ProcessGroupTest();
 
-  c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& data,
       const BroadcastOptions& opts = BroadcastOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce_coalesced(
+  c10::intrusive_ptr<Work> allreduce_coalesced(
       std::vector<at::Tensor>& tensors,
       const AllreduceCoalescedOptions& opts = AllreduceCoalescedOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce(
+  c10::intrusive_ptr<Work> reduce(
       std::vector<at::Tensor>& tensors,
       const ReduceOptions& opts = ReduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> _allgather_base(
+  c10::intrusive_ptr<Work> _allgather_base(
       at::Tensor& outputBuffer,
       at::Tensor& inputBuffer,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& opts = BarrierOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> gather(
+  c10::intrusive_ptr<Work> gather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const GatherOptions& opts = GatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> scatter(
+  c10::intrusive_ptr<Work> scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ScatterOptions& opts = ScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> send(
+  c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recv(
+  c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& tensors,
       int srcRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recvAnysource(
+  c10::intrusive_ptr<Work> recvAnysource(
       std::vector<at::Tensor>& tensor,
       int tag) override;
 
diff --git a/test/cpp_extensions/open_registration_extension.cpp b/test/cpp_extensions/open_registration_extension.cpp
index c6d05042ef76..ad036109903d 100644
--- a/test/cpp_extensions/open_registration_extension.cpp
+++ b/test/cpp_extensions/open_registration_extension.cpp
@@ -49,9 +49,9 @@ at::Tensor custom_empty_memory_format(at::IntArrayRef size, c10::optional<at::Sc
   constexpr c10::DispatchKeySet private_use_ks(c10::DispatchKey::PrivateUse1);
   return at::detail::empty_generic(size, &global_custom_alloc, private_use_ks, c10::dtype_or_default(dtype), memory_format);
 }
-at::Tensor custom_empty_symint(c10::SymIntArrayRef size, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory, c10::optional<at::MemoryFormat> memory_format) {
+at::Tensor custom_empty_symint(c10::IntArrayRef size, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory, c10::optional<at::MemoryFormat> memory_format) {
   constexpr c10::DispatchKeySet private_use_ks(c10::DispatchKey::PrivateUse1);
-  return at::detail::empty_generic(c10::asIntArrayRefSlow(size), &global_custom_alloc, private_use_ks, c10::dtype_or_default(dtype), memory_format);
+  return at::detail::empty_generic(size, &global_custom_alloc, private_use_ks, c10::dtype_or_default(dtype), memory_format);
 }
 
 at::Tensor & custom_fill__scalar(at::Tensor & self, const at::Scalar & value) {
@@ -85,8 +85,7 @@ at::Tensor custom__copy_from(const at::Tensor& self, const at::Tensor& dst, bool
 // More details on the dispatcher can be found at http://blog.ezyang.com/2020/09/lets-talk-about-the-pytorch-dispatcher/.
 TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
   m.impl("add.Tensor", &custom_add_Tensor);
-  m.impl("empty.memory_format", &custom_empty_memory_format);
-  m.impl("empty.SymInt", &custom_empty_symint);
+  m.impl("empty.memory_format", &custom_empty_symint);
   m.impl("fill_.Scalar", &custom_fill__scalar);
   m.impl("_copy_from", &custom__copy_from);
 }
diff --git a/test/cpp_extensions/ort_extension.cpp b/test/cpp_extensions/ort_extension.cpp
index 24617abeb06d..b646f3b14939 100644
--- a/test/cpp_extensions/ort_extension.cpp
+++ b/test/cpp_extensions/ort_extension.cpp
@@ -26,11 +26,6 @@ Tensor empty_override(IntArrayRef size, c10::optional<ScalarType> dtype, c10::op
   return get_tensor(scalarTypeToTypeMeta(dtype_or_default(dtype)), size);
 }
 
-Tensor empty_symint_override(c10::SymIntArrayRef size, c10::optional<ScalarType> dtype_opt, c10::optional<Layout> layout_opt,
-                 c10::optional<Device> device_opt, c10::optional<bool> pin_memory_opt, c10::optional<c10::MemoryFormat> memory_format_opt) {
-  return empty_override(c10::asIntArrayRefSlow(size), dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
-}
-
 Tensor& add_out_override(const Tensor & a, const Tensor & b , const Scalar& c, Tensor & out) {
   test_int = 1;
   return out;
@@ -58,7 +53,6 @@ std::tuple<Tensor,Tensor,Tensor> fake_convolution_backward(
 }
 
 TORCH_LIBRARY_IMPL(aten, ORT, m) {
-  m.impl("empty.SymInt",                       empty_symint_override);
   m.impl("empty.memory_format",                empty_override);
   m.impl("add.out",                            add_out_override);
   m.impl("convolution_overrideable",           fake_convolution);
diff --git a/test/defs.bzl b/test/defs.bzl
deleted file mode 100644
index 0e92326402dd..000000000000
--- a/test/defs.bzl
+++ /dev/null
@@ -1,112 +0,0 @@
-load("@fbcode_macros//build_defs:python_pytest.bzl", "python_pytest")
-load("@fbcode_macros//build_defs:python_unittest.bzl", "python_unittest")
-load("@fbsource//tools/build_defs/sandcastle:sandcastle_defs.bzl", "is_sandcastle_machine")
-
-def define_python_unittest(pytest = False, **kwargs):
-    build_mode = native.read_config("fbcode", "build_mode_test_label")
-    enable_flatbuffer = bool(native.read_config("fbcode", "caffe2_enable_flatbuffer", None))
-
-    PYTORCH_TEST_WITH_ASAN = "1" if ("asan" in build_mode or build_mode == "dev") else "0"
-
-    PYTORCH_TEST_WITH_DEV_DBG_ASAN = "1" if (build_mode == "dev" or "dev-asan" in build_mode or "dbg-asan" in build_mode or "dbgo-asan" in build_mode) else "0"
-
-    PYTORCH_TEST_WITH_TSAN = "1" if ("tsan" in build_mode) else "0"
-
-    PYTORCH_TEST_WITH_UBSAN = "1" if ("ubsan" in build_mode or build_mode == "dev") else "0"
-
-    NO_MULTIPROCESSING_SPAWN = "1" if is_sandcastle_machine() else "0"
-
-    ENABLE_FLATBUFFER = "1" if enable_flatbuffer else "0"
-
-    # indicates we are running in test env.
-    # "deepcopy" the 'env: Dict[str, str]'
-    kwargs["env"] = dict(kwargs.get("env", {}))
-    kwargs["env"]["PYTORCH_TEST"] = "1"
-    kwargs["env"]["PYTORCH_TEST_FBCODE"] = "1"
-    kwargs["env"]["PYTORCH_TEST_WITH_ASAN"] = PYTORCH_TEST_WITH_ASAN
-    kwargs["env"]["PYTORCH_TEST_WITH_DEV_DBG_ASAN"] = PYTORCH_TEST_WITH_DEV_DBG_ASAN
-    kwargs["env"]["PYTORCH_TEST_WITH_TSAN"] = PYTORCH_TEST_WITH_TSAN
-    kwargs["env"]["PYTORCH_TEST_WITH_UBSAN"] = PYTORCH_TEST_WITH_UBSAN
-    kwargs["env"]["NO_MULTIPROCESSING_SPAWN"] = NO_MULTIPROCESSING_SPAWN
-    kwargs["env"]["ENABLE_FLATBUFFER"] = ENABLE_FLATBUFFER
-
-    # To speed up TP tests.
-    kwargs["env"]["TENSORPIPE_TLS_DATACENTER"] = "test_dc"
-
-    # Run CUDA tests on GPUs
-    if kwargs.get("name").endswith("cuda"):
-        # "deepcopy" the 'tags: List[str]'
-        kwargs["tags"] = list(kwargs.get("tags", []))
-        kwargs["tags"].extend([
-            "re_opts_capabilities={\"platform\": \"gpu-remote-execution\", \"subplatform\": \"P100\"}",
-            "supports_remote_execution",
-            "run_as_bundle",
-            "tpx:experimental-shard-size-for-bundle=100",
-        ])
-        kwargs["env"]["PYTORCH_TEST_REMOTE_GPU"] = "1"
-
-    if pytest:
-        python_pytest(
-            **kwargs
-        )
-    else:
-        python_unittest(
-            **kwargs
-        )
-
-def define_mp_tests(tests, additional_deps = None, pytest = False, **kwargs):
-    # LeakSanitizer doesn't work for python multiprocessing.
-    # See https://fb.workplace.com/groups/fbcode/posts/2625521060818050/
-    # and https://fb.workplace.com/groups/101100140348621/posts/1278688645923092/
-    extra_env = {
-        "ASAN_OPTIONS": "detect_leaks=0",
-        "CUDA_INJECTION64_PATH": "0",  # resolve kineto TSAN flakiness
-    }
-
-    # Serialize test cases since multiple tests running on same GPUs can
-    # deadlock or there can be port conflicts.
-    if "tags" not in kwargs:
-        kwargs["tags"] = []
-    if "serialize_test_cases" not in kwargs["tags"]:
-        kwargs["tags"].append("serialize_test_cases")
-    define_tests(tests, additional_deps, pytest, extra_env, **kwargs)
-
-def define_q_distributed_test(tests, env = None, additional_deps = None, pytest = False, **kwargs):
-    define_tests(tests, additional_deps, pytest, env, **kwargs)
-
-def define_tests(tests, additional_deps = None, pytest = False, extra_env = {}, **kwargs):
-    if additional_deps == None:
-        additional_deps = {}
-
-    provided_tags = kwargs.pop("tags", [])
-
-    env = {
-        "DOCS_SRC_DIR": "$(location //caffe2/docs/source:doc_files)",
-        "MKL_NUM_THREADS": "1",
-        "OMP_NUM_THREADS": "1",
-        "SKIP_TEST_BOTTLENECK": "1",
-    }
-    env.update(extra_env)
-    for name, srcs in tests.items():
-        tags = list(provided_tags)
-
-        test_deps = ["//caffe2:test-lib"] + additional_deps.get(name, [])
-        define_python_unittest(
-            pytest,
-            name = name,
-            srcs = srcs,
-            base_module = "",
-            compile = "with-source",
-            env = env,
-            py_version = ">=3.5",
-            strip_libpar = True,
-            tags = tags,
-            deps = test_deps,
-            # Depend directly on :libtorch so that tests won't be pruned by the
-            # rdep distance heuristic.
-            cpp_deps = ["//caffe2:libtorch"],
-            runtime_deps = [
-                "//caffe2/docs/source:doc_files",
-            ],
-            **kwargs
-        )
diff --git a/test/distributed/_composable/test_checkpoint.py b/test/distributed/_composable/test_checkpoint.py
new file mode 100644
index 000000000000..fbffc90f19c5
--- /dev/null
+++ b/test/distributed/_composable/test_checkpoint.py
@@ -0,0 +1,83 @@
+# Owner(s): ["oncall: distributed"]
+
+from torch.testing._internal.common_cuda import TEST_CUDA
+from torch.testing._internal.common_utils import (
+    TestCase,
+    run_tests,
+)
+
+import torch
+import torch.nn as nn
+from torch.distributed._composable import checkpoint
+
+import unittest
+from collections import deque
+from copy import deepcopy
+
+
+class ToyModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.l1 = nn.Linear(100, 100)
+        self.seq = nn.Sequential(
+            nn.ReLU(),
+            nn.Linear(100, 100),
+            nn.ReLU(),
+        )
+
+    def forward(self, x):
+        return self.seq(self.l1(x))
+
+
+class TestCheckpoint(TestCase):
+    def _get_graph_size(self, out: torch.Tensor) -> int:
+        q = deque([out.grad_fn])
+        num_functions = 0
+        while len(q):
+            fn = q.pop()
+            num_functions += 1
+            for next_fn, _ in fn.next_functions:
+                if next_fn:
+                    q.append(next_fn)
+
+        return num_functions
+
+    def _test_tensor_only(self, net: nn.Module, x: torch.Tensor) -> None:
+        x1 = x.clone()
+        x2 = x.clone()
+        x1.requires_grad = True
+        x2.requires_grad = True
+
+        net1 = net
+        net2 = deepcopy(net)
+
+        # no checkpoint
+        loss1 = net1(x1).sum()
+        graph_size1 = self._get_graph_size(loss1)
+        loss1.backward()
+
+        # with checkpoint
+        checkpoint(net2.seq)
+        loss2 = net2(x2).sum()
+        graph_size2 = self._get_graph_size(loss2)
+        loss2.backward()
+
+        self.assertTrue(graph_size2 < graph_size1)
+
+        for p1, p2 in zip(net1.parameters(), net2.parameters()):
+            self.assertEqual(p1.grad, p2.grad)
+
+    def test_tensor_only_cpu(self):
+        x = torch.randn(20, 100)
+        net = ToyModel()
+        self._test_tensor_only(net, x)
+
+    @unittest.skipIf(not TEST_CUDA, "no cuda")
+    def test_tensor_only_gpu(self):
+        x = torch.randn(20, 100, device="cuda:0")
+        net = ToyModel().to("cuda:0")
+        self._test_tensor_only(net, x)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_composable/test_contract.py b/test/distributed/_composable/test_contract.py
new file mode 100644
index 000000000000..d510af6d7b2b
--- /dev/null
+++ b/test/distributed/_composable/test_contract.py
@@ -0,0 +1,122 @@
+# Owner(s): ["oncall: distributed"]
+
+from torch.testing._internal.common_utils import (
+    TestCase,
+    run_tests,
+    skipIfTorchDynamo,
+)
+
+import torch
+import torch.nn as nn
+from torch.distributed._composable import contract
+
+from copy import deepcopy
+from typing import Tuple
+
+
+class ToyModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.seq1 = nn.Sequential(*[nn.Linear(10, 10) for _ in range(2)])
+        self.seq2 = nn.Sequential(*[nn.Linear(10, 10) for _ in range(2)])
+        self.p = nn.Parameter(torch.randn(10, 10), requires_grad=True)
+        self.b = torch.zeros(1)  # buffer
+
+    def forward(self, x, y):
+        with torch.no_grad():
+            self.b += x.sum() + y.sum()
+
+        return self.p + self.seq1(x) + self.seq2(y)
+
+
+class TestContract(TestCase):
+    @skipIfTorchDynamo("Dynamo does not yet capture module hooks")
+    def test_add_hooks(self):
+        def forward_pre_hook(
+            module: nn.Module, inp: Tuple[torch.Tensor]
+        ) -> Tuple[torch.Tensor]:
+            return inp
+
+        def forward_hook(
+            module: nn.Module, inp: Tuple[torch.Tensor], out: torch.Tensor
+        ) -> torch.Tensor:
+            return out
+
+        def backward_pre_hook(
+            module: nn.Module, grad_output: torch.Tensor
+        ) -> torch.Tensor:
+            return grad_output
+
+        def backward_hook(
+            module: nn.Module,
+            grad_input: Tuple[torch.Tensor],
+            grad_output: torch.Tensor,
+        ) -> Tuple[torch.Tensor]:
+            return grad_input
+
+        @contract
+        def noop_api(module: nn.Module) -> nn.Module:
+            module.register_forward_pre_hook(forward_pre_hook)
+            module.register_forward_hook(forward_hook)
+            module.register_full_backward_pre_hook(backward_pre_hook)
+            module.register_full_backward_hook(backward_hook)
+            return module
+
+        model = ToyModel()
+        model_with_hooks = deepcopy(model)
+        noop_api(model.seq1)
+        noop_api(model.seq2)
+
+        x, y = torch.randn(10, 10), torch.randn(10, 10)
+        model(x, y).sum().backward()
+        model_with_hooks(x, y).sum().backward()
+
+        for p1, p2 in zip(model.parameters(), model_with_hooks.parameters()):
+            self.assertEqual(p1, p2)
+
+    @skipIfTorchDynamo("Dynamo does not yet capture module hooks")
+    def test_modify_fqn(self):
+        class ModelWrapper(nn.Module):
+            def __init__(self, module):
+                super().__init__()
+                self.module = module
+
+            def forward(self, x):
+                return self.module(x)
+
+        @contract
+        def wrap_module(module: nn.Module) -> nn.Module:
+            return ModelWrapper(module)
+
+        model = ToyModel()
+
+        with self.assertRaisesRegex(RuntimeError, "cannot modify FQNs"):
+            wrap_module(model.seq1)
+
+    @skipIfTorchDynamo("Dynamo does not yet capture module hooks")
+    def test_state(self):
+        def check_and_update_state_hook(
+            module: nn.Module, inp: Tuple[torch.Tensor]
+        ) -> Tuple[torch.Tensor]:
+            self.assertEqual(api.state(module).dummy_state, 7)
+            api.state(module).dummy_state = 8
+            return inp
+
+        # FIXME: circular reference looks a bit weird. Shall we make .state a
+        # top-level API instead attached to contract API?
+        @contract
+        def api(module: nn.Module) -> nn.Module:
+            api.state(module).dummy_state = 7
+            module.register_forward_pre_hook(check_and_update_state_hook)
+            return module
+
+        model = ToyModel()
+        api(model.seq1)
+
+        self.assertEqual(api.state(model.seq1).dummy_state, 7)
+        model(torch.zeros(10, 10), torch.zeros(10, 10))
+        self.assertEqual(api.state(model.seq1).dummy_state, 8)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_composable/test_fully_shard.py b/test/distributed/_composable/test_fully_shard.py
new file mode 100644
index 000000000000..ba08deeafcdf
--- /dev/null
+++ b/test/distributed/_composable/test_fully_shard.py
@@ -0,0 +1,267 @@
+# Owner(s): ["oncall: distributed"]
+
+import copy
+import sys
+from typing import Any, Tuple
+
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+from torch.distributed._composable import fully_shard
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp._common_utils import _is_fsdp_flattened
+from torch.distributed.fsdp._runtime_utils import _root_pre_forward
+from torch.distributed.fsdp.wrap import ModuleWrapPolicy
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import FSDPTest
+from torch.testing._internal.common_utils import (
+    instantiate_parametrized_tests,
+    run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
+)
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+class SubModel(nn.Module):
+    def __init__(self, device) -> None:
+        super().__init__()
+        torch.manual_seed(0)
+        self.lin1 = nn.Linear(5, 5, bias=False, device=device)
+        self.lin2 = nn.Linear(5, 5, bias=False, device=device)
+        self.relu = nn.ReLU()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        z = self.relu(self.lin1(x))
+        z = self.relu(self.lin2(z))
+        return z
+
+
+class Model(nn.Module):
+    def __init__(self, device) -> None:
+        super().__init__()
+        torch.manual_seed(0)
+        self.sub1 = SubModel(device=device)
+        self.sub2 = SubModel(device=device)
+        self.lin = nn.Linear(5, 5, device=device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        z = self.sub1(x)
+        z = self.sub2(z)
+        z = self.lin(z)
+        return z
+
+    @staticmethod
+    def policy():
+        return ModuleWrapPolicy({SubModel})
+
+    def get_input(self, device=torch.device) -> Tuple[Any, ...]:
+        return (torch.randn((8, 5), device=device),)
+
+
+class TestFSDPInitialization(FSDPTest):
+    """Tests composable FSDP initialization."""
+
+    @property
+    def world_size(self) -> int:
+        return 2
+
+    @skip_if_lt_x_gpu(2)
+    def test_auto_wrap_policy(self):
+        """Tests passing an ``auto_wrap_policy``."""
+
+        local_model = Model(device=torch.device("cuda"))
+        fsdp_wrapped_model = FSDP(
+            copy.deepcopy(local_model),
+            auto_wrap_policy=Model.policy(),
+            use_orig_params=True,
+        )
+        composable_module = copy.deepcopy(local_model)
+        fully_shard(
+            composable_module,
+            policy=Model.policy(),
+        )
+
+        # Check that the composable module has the same names as the local
+        # model and the same sharded parameters as the FSDP-wrapped model
+        for (
+            (local_name, _),
+            (composable_name, composable_param),
+            (_, fsdp_wrapped_param),
+        ) in zip(
+            local_model.named_parameters(),
+            composable_module.named_parameters(),
+            fsdp_wrapped_model.named_parameters(),
+        ):
+            self.assertEqual(local_name, composable_name)
+            self.assertEqual(fsdp_wrapped_param, composable_param)
+
+        # Check that the composable module has the same  `FlatParameter`
+        # construction as the FSDP-wrapped model
+        composable_handles = fully_shard.state(composable_module)._handles
+        fsdp_wrapped_handles = FSDP._fsdp_handles(fsdp_wrapped_model)
+        self.assertEqual(len(composable_handles), len(fsdp_wrapped_handles))
+        for (composable_handle, fsdp_wrapped_handle) in zip(
+            composable_handles, fsdp_wrapped_handles
+        ):
+            self.assertEqual(
+                composable_handle.flat_param.shape, fsdp_wrapped_handle.flat_param.shape
+            )
+
+        # Check that the composable module does not add any wrapper class
+        local_module_classes = set()
+        composable_module_classes = set()
+        for submodule in local_model.modules():
+            local_module_classes.add(type(submodule))
+        for submodule in composable_module.modules():
+            composable_module_classes.add(type(submodule))
+        self.assertEqual(local_module_classes, composable_module_classes)
+
+    @skip_if_lt_x_gpu(2)
+    def test_device_id(self):
+        """Tests passing a ``device_id``."""
+        cpu_device = torch.device("cpu")
+        composable_module = Model(device=cpu_device)
+        for param in composable_module.parameters():
+            assert param.device == cpu_device
+        fully_shard(
+            composable_module,
+            policy=Model.policy(),
+            device_id=self.rank,
+        )
+        for param in composable_module.parameters():
+            self.assertEqual(param.device, torch.device("cuda", self.rank))
+
+    @skip_if_lt_x_gpu(2)
+    def test_sync_module_states(self):
+        """Tests passing ``sync_module_states=True``."""
+        local_model = Model(device=torch.device("cuda"))
+        composable_module = copy.deepcopy(local_model)
+        # Check that the parameters are broadcast from rank 0 by comparing
+        # against an equivalent FSDP-wrapped module
+        if self.rank != 0:
+            for param in composable_module.parameters():
+                with torch.no_grad():
+                    param.zero_()
+        fsdp_wrapped_model = FSDP(
+            copy.deepcopy(local_model),
+            auto_wrap_policy=Model.policy(),
+            use_orig_params=True,
+        )
+        fully_shard(
+            composable_module,
+            policy=Model.policy(),
+            sync_module_states=True,
+        )
+        for (composable_param, fsdp_wrapped_param) in zip(
+            composable_module.parameters(),
+            fsdp_wrapped_model.parameters(),
+        ):
+            self.assertEqual(composable_param, fsdp_wrapped_param)
+
+    @skip_if_lt_x_gpu(2)
+    def test_materialize_meta_module(self):
+        """Tests materializing a meta-device module."""
+
+        def _param_init_fn(module: nn.Module):
+            """
+            This is an example ``param_init_fn`` for composable FSDP.
+
+            TODO: This function is not satisfactory because this requires
+            guarding with ``_is_fsdp_flattened()``. This guard is needed to
+            avoid re-initializing parameters for nested cases since some
+            initialization methods strictly require non-1D shape (e.g.
+            ``kaiming_uniform_()``), while FSDP replaces the original
+            parameters with their 1D shards.
+            """
+            is_meta = any(param.is_meta for param in module.parameters())
+            if is_meta:
+                module.to_empty(device=torch.cuda.current_device())
+            torch.manual_seed(0)
+            for param in module.parameters():
+                if not _is_fsdp_flattened(param):
+                    nn.init.uniform_(param)
+
+        composable_module = Model(device="meta")
+        fsdp_wrapped_model = FSDP(
+            Model(device="meta"),
+            auto_wrap_policy=Model.policy(),
+            param_init_fn=_param_init_fn,
+            use_orig_params=True,
+        )
+        fully_shard(
+            composable_module,
+            policy=Model.policy(),
+            param_init_fn=_param_init_fn,
+        )
+        for (composable_param, fsdp_wrapped_param) in zip(
+            composable_module.parameters(),
+            fsdp_wrapped_model.parameters(),
+        ):
+            self.assertEqual(composable_param, fsdp_wrapped_param)
+
+
+class TestFSDPRuntime(FSDPTest):
+    """Tests composable FSDP runtime."""
+
+    @property
+    def world_size(self) -> int:
+        return 2
+
+    @skip_if_lt_x_gpu(2)
+    def test_training(self):
+        """Tests training (forward, backward, optimizer)."""
+        device = torch.device("cuda")
+        local_model = Model(device=device)
+        fsdp_wrapped_model = FSDP(
+            copy.deepcopy(local_model),
+            auto_wrap_policy=Model.policy(),
+            use_orig_params=True,
+        )
+        composable_module = copy.deepcopy(local_model)
+        fully_shard(
+            composable_module,
+            policy=Model.policy(),
+        )
+        del local_model  # not needed anymore
+        LR = 1e-2
+        fsdp_wrapped_optim = torch.optim.Adam(fsdp_wrapped_model.parameters(), lr=LR)
+        composable_optim = torch.optim.Adam(composable_module.parameters(), lr=LR)
+        for _ in range(5):
+            inp = composable_module.get_input(device)
+            losses = []
+            for model, optim in (
+                (fsdp_wrapped_model, fsdp_wrapped_optim),
+                (composable_module, composable_optim),
+            ):
+                optim.zero_grad(set_to_none=True)
+                # TODO (awgu): Remove this after resolving the root pre-forward
+                # hook registration, currently blocked by kwarg support
+                if model is composable_module:
+                    args, kwargs = _root_pre_forward(
+                        fully_shard.state(composable_module), composable_module, *inp
+                    )
+                else:
+                    args = inp
+                    kwargs = {}
+                out = model(*args, **kwargs)
+                loss = out.sum()
+                losses.append(loss)
+                loss.backward()
+                optim.step()
+            self.assertEqual(losses[0], losses[1])
+
+
+instantiate_parametrized_tests(TestFSDPInitialization)
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_composable/test_replicate.py b/test/distributed/_composable/test_replicate.py
new file mode 100644
index 000000000000..3e8bf44a1fde
--- /dev/null
+++ b/test/distributed/_composable/test_replicate.py
@@ -0,0 +1,107 @@
+# Owner(s): ["oncall: distributed"]
+
+import os
+from copy import deepcopy
+
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+from torch import nn
+from torch.distributed._composable.replicate import mark_root_module, replicate
+from torch.testing._internal.common_distributed import MultiProcessTestCase
+from torch.testing._internal.common_utils import run_tests
+
+
+class Net(nn.Module):
+    def __init__(self):
+        super(Net, self).__init__()
+        self.fc1 = nn.Linear(2, 10, bias=False)
+        self.fc2 = nn.Linear(10, 50, bias=False)
+        self.fc3 = nn.Linear(50, 4, bias=False)
+        self.relu = nn.ReLU()
+
+    def forward(self, x):
+        x = self.relu(self.fc1(x))
+        x = self.relu(self.fc2(x))
+        x = self.fc3(x)
+        return F.softmax(x, dim=1)
+
+
+class ReplicateTest(MultiProcessTestCase):
+    def setUp(self) -> None:
+        super().setUp()
+        self._spawn_processes()
+
+    def tearDown(self):
+        super().tearDown()
+        try:
+            os.remove(self.file_name)
+        except OSError:
+            pass
+
+    def _compare_module(self, mod, replicate_mod):
+        dist.init_process_group(
+            backend="gloo",
+            rank=self.rank,
+            world_size=self.world_size,
+            store=dist.FileStore(self.file_name, self.world_size),
+        )
+
+        local_batch_size = 1
+        global_batch_size = self.world_size * local_batch_size
+        input = torch.randn(global_batch_size, 2)
+        target = torch.randn(global_batch_size, 4)
+
+        def step_model(model, input, target):
+            model.train()
+            output = model(input)
+            loss = F.mse_loss(output, target.to(output.device))
+            loss.backward()
+            for param in model.parameters():
+                with torch.no_grad():
+                    param -= param.grad
+                param.grad = None
+
+        for iteration in range(2):
+            step_model(mod, input, target)
+            step_model(
+                replicate_mod,
+                input[
+                    self.rank
+                    * local_batch_size : (self.rank + 1)
+                    * local_batch_size
+                ],
+                target[
+                    self.rank
+                    * local_batch_size : (self.rank + 1)
+                    * local_batch_size
+                ],
+            )
+
+            self.assertEqual(
+                len(list(mod.parameters())),
+                len(list(replicate_mod.parameters())),
+            )
+            for i, j in zip(mod.parameters(), replicate_mod.parameters()):
+                self.assertEqual(i, j, rtol=1.3e-06, atol=5e-5)
+
+            # Shuffle the input so that DDP input is different
+            torch.manual_seed(iteration)
+            input = input[torch.randperm(global_batch_size)]
+
+    def test_replicate_single_module(self):
+        model = Net()
+        replicate_model = mark_root_module(replicate(deepcopy(model)))
+        self._compare_module(model, replicate_model)
+
+    def test_replicate_multi_module(self):
+        model = Net()
+        replicate_model = mark_root_module(deepcopy(model))
+        replicate(replicate_model.fc1)
+        replicate(replicate_model.fc2)
+        replicate(replicate_model.fc3)
+        self._compare_module(model, replicate_model)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_shard/checkpoint/test_checkpoint.py b/test/distributed/_shard/checkpoint/test_checkpoint.py
deleted file mode 100644
index b353a9401e67..000000000000
--- a/test/distributed/_shard/checkpoint/test_checkpoint.py
+++ /dev/null
@@ -1,413 +0,0 @@
-# Owner(s): ["oncall: distributed"]
-
-import random
-import sys
-from typing import Optional, List, Union, cast
-from torch.distributed._shard.checkpoint import (
-    StorageReader,
-    StorageWriter,
-    CheckpointException,
-    load_state_dict,
-    save_state_dict,
-)
-
-import torch
-import torch.distributed as dist
-import torch.nn
-import torch.futures
-from torch.futures import Future
-from torch.testing._internal.common_utils import TestCase
-
-from torch.distributed._shard.checkpoint.resharding import (
-    _prepare_sharded_tensor_write,
-    _create_storage_key
-)
-
-from torch.distributed._shard import sharded_tensor
-
-from torch.distributed._shard.checkpoint.state_dict_saver import (
-    _prepare,
-)
-
-from torch.distributed._shard.checkpoint.metadata import (
-    Metadata,
-    BytesReadRequest,
-    BytesWriteRequest,
-    MetadataIndex,
-    TensorReadRequest,
-    TensorWriteRequest,
-)
-
-from torch.distributed._shard.sharded_tensor import (
-    state_dict_hook,
-    ShardedTensor,
-)
-from torch.distributed._shard.sharding_spec import ChunkShardingSpec
-from torch.testing._internal.common_distributed import (
-    requires_nccl,
-    skip_if_lt_x_gpu,
-)
-from torch.testing._internal.distributed._shard.sharded_tensor import (
-    ShardedTensorTestBase,
-    with_comms,
-)
-
-from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
-    run_tests,
-)
-
-if TEST_WITH_DEV_DBG_ASAN:
-    print(
-        "Skip dev-asan as torch + multiprocessing spawn have known issues",
-        file=sys.stderr,
-    )
-    sys.exit(0)
-
-class TestModule(torch.nn.Module):
-    def __init__(self) -> None:
-        super().__init__()
-        self.sharded: ShardedTensor = sharded_tensor.zeros(self.spec(), 4, 4)
-        self.regular = torch.nn.Parameter(torch.ones(4, 4))
-        self.extra_sharded: Optional[ShardedTensor] = None
-        self.extra_param: Optional[torch.nn.Parameter] = None
-        self._register_state_dict_hook(state_dict_hook)
-
-    def spec(self) -> ChunkShardingSpec:
-        # pyre-fixme [28]: Unexpected keyword argument `dim` to call `dist._sharding_spec.api.ChunkShardingSpec.__init__`.
-        return ChunkShardingSpec(
-            dim=0,
-            placements=[
-                "rank:0/cuda:0",
-                "rank:1/cuda:1",
-            ],
-        )
-
-
-class TestDistributedCheckpointing(ShardedTensorTestBase):
-    @property
-    def world_size(self) -> int:
-        return 2
-
-    def gen_metadata(self) -> Metadata:
-        module = TestModule()
-        # compute the default saved metadata (must pass include_non_replicated_tensors or we'll get incomplete MD)
-        metadata, _, _ = _prepare(module.state_dict(), True)
-
-        # _prepare only produc
-        metadata = [metadata]
-        dist.broadcast_object_list(metadata)
-
-        return metadata[0]
-
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(2)
-    @requires_nccl()
-    def test_tensor_metadata_with_missing_rank_spec(self) -> None:
-        spec = ChunkShardingSpec(
-            dim=0,
-            placements=[
-                "rank:1/cuda:1",
-            ],
-        )
-
-        st = sharded_tensor.zeros(spec, 4, 4, dtype=torch.float64)
-        mapping = dict()
-
-        (_, md, storage_md) = _prepare_sharded_tensor_write("fqn", st, "tensor", mapping)
-
-        self.assertEqual(1, len(storage_md))
-        self.assertEqual(1, len(mapping))
-
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(2)
-    @requires_nccl()
-    def test_storage_key_mapping(self) -> None:
-        device = f"cuda:{dist.get_rank()}"
-        spec = ChunkShardingSpec(
-            dim=0,
-            placements=[
-                "rank:0/cuda:0",
-                "rank:1/cuda:1",
-            ],
-        )
-
-        state_dict = {
-            'sharded': sharded_tensor.rand(spec, (10, 10, )),
-            'replicated': torch.rand(4, device=device),
-            'bytes': [1, 2, 3, 4],
-        }
-
-        metadata, bytes_reqs, tensor_reqs = _prepare(state_dict, write_replicated_data=self.rank == 0)
-
-        if self.rank == 0:
-            self.assertEqual(1, len(bytes_reqs))
-            self.assertEqual(2, len(tensor_reqs))
-
-            self.assertTrue('bytes' in metadata.state_dict_metadata)
-            self.assertTrue(MetadataIndex('bytes') in metadata.storage_data)
-
-            # tensor ordering is unspecified
-            if len(tensor_reqs[0].tensor.size()) == 1:
-                replicated = tensor_reqs[0]
-                shard = tensor_reqs[1]
-            else:
-                replicated = tensor_reqs[1]
-                shard = tensor_reqs[0]
-
-            self.assertTrue('replicated' in metadata.state_dict_metadata)
-            storage_key = MetadataIndex('replicated', torch.Size([0]))
-            self.assertTrue(storage_key in metadata.storage_data)
-            self.assertTrue(metadata.storage_data[storage_key], replicated.storage_key)
-        else:
-            self.assertEqual(0, len(bytes_reqs))
-            self.assertEqual(1, len(tensor_reqs))
-            shard = tensor_reqs[0]
-            local_shard = state_dict["sharded"].local_shards()[0]
-
-            self.assertTrue('sharded' in metadata.state_dict_metadata)
-            storage_key = MetadataIndex('sharded', torch.Size(local_shard.metadata.shard_offsets))
-            self.assertTrue(storage_key in metadata.storage_data)
-            self.assertTrue(metadata.storage_data[storage_key], shard.storage_key)
-
-
-class TestStorageKeys(TestCase):
-    def test_create_key_handles_collision(self):
-        keys = dict()
-        key0 = _create_storage_key(keys, "foo")
-        key1 = _create_storage_key(keys, "foo")
-        self.assertNotEqual(key0, key1)
-
-
-
-
-class TestStorageBase:
-    def __init__(
-        self,
-        fail_conf
-    ):
-        self.fail_conf = fail_conf
-        self.rank = 0 if not dist.is_initialized() else dist.get_rank()
-
-    def _get_ranks(self, name):
-        return self.fail_conf[name] if name in self.fail_conf else None
-
-    def _fail_rank(self, name):
-        ranks = self._get_ranks(name)
-        if ranks is not None and self.rank in ranks:
-            raise ValueError(f"rank fail {self.rank} for {name}")
-
-    def _fail_rank_async(self, name):
-        ranks = self._get_ranks(name)
-        fut = Future()
-        if ranks is not None and self.rank in ranks:
-            fut.set_exception(ValueError(f"async rank fail {self.rank} for {name}"))
-        else:
-            fut.set_result(None)
-        return fut
-
-
-class FaultyStorageWriter(TestStorageBase, StorageWriter):
-    def __init__(
-        self,
-        fail_conf
-    ):
-        super(FaultyStorageWriter, self).__init__(fail_conf)
-
-    def prepare(self) -> None:
-        self._fail_rank("fail_prepare")
-
-    def write_bytes(self, requests: List[BytesWriteRequest]) -> Future[None]:
-        self._fail_rank("fail_write_bytes_on_ranks")
-        return self._fail_rank_async("fail_write_bytes_on_ranks_async")
-
-    def write_tensors(self, requests: List[TensorWriteRequest]) -> Future[None]:
-        self._fail_rank("fail_write_tensors_on_ranks")
-        return self._fail_rank_async("fail_write_tensors_on_ranks_async")
-
-    def finish(self, metadata: Metadata) -> None:
-        self._fail_rank("fail_finish")
-
-    def prepare_storage(self, storage_writes: List[Union[TensorWriteRequest, BytesWriteRequest]]) -> None:
-        self._fail_rank("fail_prepare_storage")
-
-class FaultyStorageReader(TestStorageBase, StorageReader):
-    def __init__(
-        self,
-        metadata,
-        fail_conf
-    ):
-        super(FaultyStorageReader, self).__init__(fail_conf)
-        self.metadata = metadata
-
-    def read_bytes(self, requests: List[BytesReadRequest]) -> Future[None]:
-        self._fail_rank("fail_read_bytes")
-        bad_ranks = self._get_ranks("fail_deser_bytes")
-        for r in requests:
-            if bad_ranks is not None and self.rank in bad_ranks:
-                # this is not "guaranteed" to fail, but hard to beat
-                rand = random.Random(1237)
-                r.bytes.write(rand.randbytes(32))
-            else:
-                torch.save([1, 2, 3], r.bytes)
-
-        return self._fail_rank_async("fail_read_bytes_async")
-
-    def read_tensors(self, requests: List[TensorReadRequest]) -> Future[None]:
-        self._fail_rank("fail_read_tensors")
-        return self._fail_rank_async("fail_read_tensors_async")
-
-    def read_metadata(self) -> Metadata:
-        self._fail_rank("fail_read_metadata")
-        return self.metadata
-
-class TestDistributedFailure(ShardedTensorTestBase):
-    def get_spec(self):
-        return ChunkShardingSpec(
-            dim=0,
-            placements=[
-                f"rank:{r}/cuda:{r}" for r in range(dist.get_world_size())
-            ]
-        )
-
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(2)
-    @requires_nccl()
-    def test_dummy_writer_works(self) -> None:
-        state_dict = {
-            'sharded': sharded_tensor.rand(self.get_spec(), 20, 20),
-            'replicated': torch.rand(10, 10),
-            'bytes': [1, 2, 3, 4]
-        }
-
-        save_state_dict(state_dict, FaultyStorageWriter({}))
-
-
-    def _test_dist_failure(self, callback, kwargs):
-        bad_ranks = list(kwargs.values())[0] if len(kwargs) > 0 else []
-
-        # Empty bad_ranks means it must work
-        if len(bad_ranks) == 0:
-            callback()
-        else:
-            with self.assertRaises(CheckpointException) as cm:
-                callback()
-            e = cast(CheckpointException, cm.exception)
-            for rank, wrapped_ex in e.failures.items():
-                ex = wrapped_ex[0]
-                self.assertTrue(rank in bad_ranks, msg=f"{rank} did not fail")
-                if not kwargs.get("ignore_exception_type", False):
-                    self.assertEqual(ValueError, type(ex), str(ex))
-
-            failed_ranks = e.failures.keys()
-            for rank in bad_ranks:
-                self.assertTrue(rank in failed_ranks, msg=f"{rank} was supposed to fail was fine")
-
-
-    def _test_save(self, state_dict, coordinator=0, **kwargs):
-        no_dist = not dist.is_initialized()
-
-        def _save():
-            save_state_dict(
-                state_dict,
-                storage_writer=FaultyStorageWriter(kwargs),
-                coordinator_rank=coordinator,
-                no_dist=no_dist,
-            )
-        self._test_dist_failure(_save, kwargs)
-
-    def _test_load(self, state_dict, coordinator=0, **kwargs):
-        no_dist = not dist.is_initialized()
-        write_replicated = dist.is_initialized() and dist.get_rank() == coordinator
-
-        def _load():
-            metadata, _, _ = _prepare(state_dict, write_replicated)
-            load_state_dict(
-                state_dict,
-                storage_reader=FaultyStorageReader(metadata, kwargs),
-                coordinator_rank=coordinator,
-                no_dist=no_dist,
-            )
-
-        self._test_dist_failure(_load, kwargs)
-
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(4)
-    @requires_nccl()
-    def test_save_error_handling(self) -> None:
-        state_dict = {
-            'sharded': sharded_tensor.rand(self.get_spec(), 20, 20),
-            'replicated': torch.rand(10, 10),
-            'bytes': [1, 2, 3, 4]
-        }
-
-        self._test_save(state_dict, fail_prepare=[0])
-        self._test_save(state_dict, fail_finish=[0])
-
-        self._test_save(state_dict, fail_prepare_storage=[0])
-        self._test_save(state_dict, fail_write_tensors_on_ranks=[1])
-        self._test_save(state_dict, fail_write_tensors_on_ranks_async=[2])
-        self._test_save(state_dict, fail_write_bytes_on_ranks=[3])
-        self._test_save(state_dict, fail_write_bytes_on_ranks_async=[1])
-
-        self._test_save(state_dict, fail_write_tensors_on_ranks_async=[1, 3])
-
-        self._test_save(state_dict, coordinator=1, fail_prepare=[1])
-        self._test_save(state_dict, coordinator=1, fail_finish=[1])
-
-
-    def test_save_error_handling_no_dist(self) -> None:
-        state_dict = {
-            'replicated': torch.rand(10, 10),
-            'bytes': [1, 2, 3, 4]
-        }
-
-        self.assertFalse(dist.is_initialized())
-
-        self._test_save(state_dict, fail_prepare=[0])
-        self._test_save(state_dict, fail_finish=[0])
-
-        self._test_save(state_dict, fail_prepare_storage=[0])
-        self._test_save(state_dict, fail_write_tensors_on_ranks=[0])
-        self._test_save(state_dict, fail_write_tensors_on_ranks_async=[0])
-        self._test_save(state_dict, fail_write_bytes_on_ranks=[0])
-        self._test_save(state_dict, fail_write_bytes_on_ranks_async=[0])
-
-    @with_comms(init_rpc=False)
-    @skip_if_lt_x_gpu(4)
-    @requires_nccl()
-    def test_load_error_handling(self) -> None:
-        state_dict = {
-            'sharded': sharded_tensor.rand(self.get_spec(), 20, 20),
-            'replicated': torch.rand(10, 10),
-            'bytes': [1, 2, 3, 4]
-        }
-
-        self._test_load(state_dict)
-        self._test_load(state_dict, fail_read_metadata=[0])
-        self._test_load(state_dict, fail_read_bytes=[1])
-        self._test_load(state_dict, fail_read_bytes_async=[2])
-        self._test_load(state_dict, fail_read_tensors=[3])
-        self._test_load(state_dict, fail_read_tensors_async=[1])
-        # We don't want to depend on the actual exception raised by pickle
-        self._test_load(state_dict, fail_deser_bytes=[2], ignore_exception_type=True)
-
-        self._test_load(state_dict, coordinator=1, fail_read_metadata=[3])
-        self._test_load(state_dict, coordinator=2, fail_read_bytes=[0])
-        self._test_load(state_dict, coordinator=3, fail_read_tensors_async=[2])
-
-
-    def test_load_error_handling_no_dist(self) -> None:
-        state_dict = {
-            'replicated': torch.rand(10, 10),
-            'bytes': [1, 2, 3, 4]
-        }
-        self._test_load(state_dict)
-        self._test_load(state_dict, fail_read_metadata=[0])
-        self._test_load(state_dict, fail_read_bytes=[0])
-        self._test_load(state_dict, fail_read_bytes_async=[0])
-        self._test_load(state_dict, fail_read_tensors=[0])
-        self._test_load(state_dict, fail_read_tensors_async=[0])
-        self._test_load(state_dict, fail_deser_bytes=[0], ignore_exception_type=True)
-if __name__ == "__main__":
-    run_tests()
diff --git a/test/distributed/_shard/sharded_tensor/ops/test_embedding.py b/test/distributed/_shard/sharded_tensor/ops/test_embedding.py
index 842da0b5ea34..9291e06e3153 100644
--- a/test/distributed/_shard/sharded_tensor/ops/test_embedding.py
+++ b/test/distributed/_shard/sharded_tensor/ops/test_embedding.py
@@ -41,7 +41,6 @@ def _run_sharded_embedding(
         input_size,
         num_embeddings,
         embedding_dim,
-        sharded_dim=None,
         max_norm=None,
         norm_type=2.0,
         padding_idx=None,
@@ -91,6 +90,7 @@ def _run_sharded_embedding(
         # Compare local weight and shared one to ensure the renorm
         # as expected.
         if max_norm is not None:
+            sharded_dim = spec.dim
             sharded_weight = sharded_embedding.weight.local_shards()[0].tensor
             (start_pos, chunk_size) = generate_local_weight_sharding_params_for_test(
                 local_embedding.weight, sharded_dim, TEST_GPU_NUM, spec, self.rank
@@ -134,15 +134,15 @@ def test_sharded_embedding_colwise(self):
             self._run_sharded_embedding(spec, [34], 15, 14, padding_idx=10)
             self._run_sharded_embedding(spec, [8, 6, 5, 4], 23, 13, padding_idx=12)
             self._run_sharded_embedding(
-                spec, [4, 5, 6], 23, 13, max_norm=2.5, sharded_dim=1
+                spec, [4, 5, 6], 23, 13, max_norm=2.5,
             )
             self._run_sharded_embedding(
-                spec, [12, 7, 16], 23, 13, max_norm=2.5, sharded_dim=1
+                spec, [12, 7, 16], 23, 13, max_norm=2.5,
             )
             self._run_sharded_embedding(
-                spec, [8, 16, 20], 12, 12, max_norm=1.25, norm_type=1.0, sharded_dim=1
+                spec, [8, 16, 20], 12, 12, max_norm=1.25, norm_type=1.0,
             )
-            self._run_sharded_embedding(spec, [30], 15, 14, max_norm=2.0, sharded_dim=1)
+            self._run_sharded_embedding(spec, [30], 15, 14, max_norm=2.0)
 
     @with_comms(init_rpc=False)
     @skip_if_lt_x_gpu(TEST_GPU_NUM)
@@ -154,11 +154,11 @@ def test_sharded_embedding_rowwise(self):
             self._run_sharded_embedding(spec, [5, 4], 32, 12)
             self._run_sharded_embedding(spec, [6, 7, 6], 64, 11)
             self._run_sharded_embedding(
-                spec, [5, 12], 16, 22, max_norm=2.5, sharded_dim=0
+                spec, [5, 12], 16, 22, max_norm=2.5,
             )
             self._run_sharded_embedding(spec, [6, 7, 6], 64, 11, padding_idx=30)
             self._run_sharded_embedding(
-                spec, [6, 5, 3], 26, 11, max_norm=2.0, sharded_dim=0
+                spec, [6, 5, 3], 26, 11, max_norm=2.0,
             )
 
             # Test uneven split.
@@ -167,9 +167,9 @@ def test_sharded_embedding_rowwise(self):
             self._run_sharded_embedding(spec, [4], 21, 11)
             self._run_sharded_embedding(spec, [8, 6, 5, 4], 21, 11, padding_idx=10)
             self._run_sharded_embedding(
-                spec, [12, 16, 8], 27, 11, max_norm=2.0, sharded_dim=0
+                spec, [6, 5, 8], 28, 5, max_norm=2.0,
             )
-            self._run_sharded_embedding(spec, [4], 14, 11, max_norm=2.5, sharded_dim=0)
+            self._run_sharded_embedding(spec, [4], 14, 11, max_norm=2.5)
 
 
 if __name__ == "__main__":
diff --git a/test/distributed/_shard/sharded_tensor/ops/test_embedding_bag.py b/test/distributed/_shard/sharded_tensor/ops/test_embedding_bag.py
index 05cad5822de2..4843534f68fb 100644
--- a/test/distributed/_shard/sharded_tensor/ops/test_embedding_bag.py
+++ b/test/distributed/_shard/sharded_tensor/ops/test_embedding_bag.py
@@ -42,7 +42,6 @@ def _run_sharded_embedding_bag(
         num_embeddings,
         embedding_dim,
         mode,
-        sharded_dim=None,
         include_last_offset=False,
         offset_size=None,
         max_norm=None,
@@ -127,6 +126,7 @@ def _run_sharded_embedding_bag(
         # Compare local weight and shared one to ensure the renorm
         # as expected.
         if max_norm is not None:
+            sharded_dim = spec.dim
             sharded_weight = sharded_embedding_bag.weight.local_shards()[0].tensor
             (start_pos, chunk_size) = generate_local_weight_sharding_params_for_test(
                 local_embedding_bag.weight, sharded_dim, TEST_GPU_NUM, spec, self.rank
@@ -184,7 +184,7 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
         self._run_sharded_embedding_bag(spec, [5, 4], 17, 12, "mean")
         self._run_sharded_embedding_bag(spec, [6, 7], 21, 11, "max")
         self._run_sharded_embedding_bag(
-            spec, [5, 5], 17, 14, "sum", max_norm=2.5, sharded_dim=sharded_dim
+            spec, [5, 5], 17, 14, "sum", max_norm=2.5,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -194,7 +194,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "mean",
             max_norm=2.0,
             norm_type=1.0,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -204,7 +203,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "max",
             max_norm=1.5,
             norm_type=1.0,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(spec, [5, 5], 17, 14, "sum", padding_idx=6)
         self._run_sharded_embedding_bag(spec, [8, 6], 24, 13, "sum")
@@ -226,7 +224,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "sum",
             offset_size=3,
             max_norm=1.25,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -236,7 +233,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "mean",
             offset_size=2,
             max_norm=1.25,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -246,7 +242,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             "max",
             offset_size=2,
             max_norm=1.15,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(spec, [4, 3], 16, 14, "sum", padding_idx=12)
         self._run_sharded_embedding_bag(spec, [4, 3], 16, 14, "mean", padding_idx=12)
@@ -260,7 +255,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             offset_size=3,
             max_norm=1.25,
             padding_idx=10,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -271,7 +265,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             offset_size=2,
             max_norm=1.25,
             padding_idx=10,
-            sharded_dim=sharded_dim,
         )
         self._run_sharded_embedding_bag(
             spec,
@@ -282,7 +275,6 @@ def _test_sharded_embedding_bag_with_test_cases(self, spec, sharded_dim):
             offset_size=2,
             max_norm=1.15,
             padding_idx=10,
-            sharded_dim=sharded_dim,
         )
 
 
diff --git a/test/distributed/_shard/sharding_spec/test_sharding_spec.py b/test/distributed/_shard/sharding_spec/test_sharding_spec.py
index a0e13d80d93e..138898ed3649 100644
--- a/test/distributed/_shard/sharding_spec/test_sharding_spec.py
+++ b/test/distributed/_shard/sharding_spec/test_sharding_spec.py
@@ -28,6 +28,7 @@
     get_split_size,
     get_chunked_dim_size,
     get_chunk_sharding_params,
+    validate_non_overlapping_shards_metadata,
 )
 
 from torch.testing._internal.common_utils import (
@@ -391,6 +392,71 @@ def test_infer_sharding_spec_from_shards_metadata(self):
             self._infer_chunk_sharding_spec_case(spec.placements, 3, [7, 12, 16, 37])
             self._infer_chunk_sharding_spec_case(spec.placements, 4, [50, 4, 18, 15, 77])
 
+    def test_check_overlapping(self):
+        shards = [
+            ShardMetadata(
+                shard_offsets=[0, 0], shard_sizes=[5, 5], placement="cuda:0",
+            ),
+            ShardMetadata(
+                shard_offsets=[5, 0], shard_sizes=[5, 5], placement="cuda:1",
+            )
+        ]
+        validate_non_overlapping_shards_metadata(shards)
+
+        shards = [
+            ShardMetadata(
+                shard_offsets=[0, 0], shard_sizes=[5, 5], placement="cuda:0",
+            ),
+            ShardMetadata(
+                shard_offsets=[4, 0], shard_sizes=[5, 5], placement="cuda:1",
+            )
+        ]
+        with self.assertRaisesRegex(ValueError, "overlap"):
+            validate_non_overlapping_shards_metadata(shards)
+
+        shards = [
+            ShardMetadata(
+                shard_offsets=[0, 0], shard_sizes=[5, 5], placement="cuda:0",
+            ),
+            ShardMetadata(
+                shard_offsets=[0, 4], shard_sizes=[5, 5], placement="cuda:1",
+            )
+        ]
+        with self.assertRaisesRegex(ValueError, "overlap"):
+            validate_non_overlapping_shards_metadata(shards)
+
+        shards = [
+            ShardMetadata(
+                shard_offsets=[5, 0, 5], shard_sizes=[5, 5, 5], placement="cuda:0",
+            ),
+            ShardMetadata(
+                shard_offsets=[5, 5, 5], shard_sizes=[5, 5, 5], placement="cuda:1",
+            )
+        ]
+        validate_non_overlapping_shards_metadata(shards)
+
+        shards = [
+            ShardMetadata(
+                shard_offsets=[5, 0, 5], shard_sizes=[5, 5, 5], placement="cuda:0",
+            ),
+            ShardMetadata(
+                shard_offsets=[5, 4, 5], shard_sizes=[5, 5, 5], placement="cuda:1",
+            )
+        ]
+        with self.assertRaisesRegex(ValueError, "overlap"):
+            validate_non_overlapping_shards_metadata(shards)
+
+        shards = [
+            ShardMetadata(
+                shard_offsets=[5, 0, 5], shard_sizes=[5, 5, 5], placement="cuda:0",
+            ),
+            ShardMetadata(
+                shard_offsets=[5, 4, 9], shard_sizes=[5, 5, 5], placement="cuda:1",
+            )
+        ]
+        with self.assertRaisesRegex(ValueError, "overlap"):
+            validate_non_overlapping_shards_metadata(shards)
+
 # Custom ShardingSpec, an simple example to do grid sharding
 @dataclass
 class GridShardingSpec(ShardingSpec):
diff --git a/test/distributed/_tensor/README.md b/test/distributed/_tensor/README.md
new file mode 100644
index 000000000000..6235f9657d5f
--- /dev/null
+++ b/test/distributed/_tensor/README.md
@@ -0,0 +1,11 @@
+## Run distributed tensor tests:
+
+from root, run (either CPU or GPU)
+
+`pytest test/spmd/tensor/test_tensor.py`
+
+`pytest test/spmd/tensor/test_ddp.py`
+
+run specific test case and print stdout/stderr:
+
+`pytest test/spmd/tensor/test_tensor.py -s -k test_tensor_from_local`
diff --git a/test/distributed/_tensor/__init__.py b/test/distributed/_tensor/__init__.py
new file mode 100644
index 000000000000..087882b22d1f
--- /dev/null
+++ b/test/distributed/_tensor/__init__.py
@@ -0,0 +1 @@
+# shut up pylint
diff --git a/torch/ao/sparsity/_experimental/__init__.py b/test/distributed/_tensor/parallel/__init__.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/__init__.py
rename to test/distributed/_tensor/parallel/__init__.py
diff --git a/test/distributed/_tensor/parallel/test_2d_parallel.py b/test/distributed/_tensor/parallel/test_2d_parallel.py
new file mode 100644
index 000000000000..ea41d5388660
--- /dev/null
+++ b/test/distributed/_tensor/parallel/test_2d_parallel.py
@@ -0,0 +1,214 @@
+# Owner(s): ["oncall: distributed"]
+
+from typing import Any
+
+
+import torch
+import torch.nn.functional as F
+import torch.distributed as dist
+from torch.distributed._shard.sharded_tensor.api import ShardedTensor
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
+from torch.distributed._tensor import (
+    distribute_tensor,
+    DeviceMesh,
+    DTensor as DT,
+    Shard,
+    Replicate,
+)
+from torch.distributed._tensor.parallel import (
+    PairwiseParallel,
+    parallelize_module,
+)
+
+import torch.distributed.distributed_c10d as distributed_c10d
+
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.distributed._tensor.parallel.fsdp import is_available
+
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+)
+
+# Tensor-Parallel degree
+TP_DEGREE = 2
+LR = 3e-5
+
+
+class SimpleModel(torch.nn.Module):
+    def __init__(self):
+        super(SimpleModel, self).__init__()
+        self.net1 = torch.nn.Linear(5, 8)
+        self.relu = torch.nn.ReLU()
+        self.net2 = torch.nn.Linear(8, 4)
+        self.net3 = torch.nn.Linear(4, 12)
+
+    def forward(self, x):
+        x = F.relu(self.net1(x))
+        x = F.relu(self.net2(x))
+        x = F.relu(self.net3(x))
+        return x
+
+
+def _aggregate_local_tensor(module: torch.nn.Module) -> torch.nn.Module:
+    def hook_func(_module, _input, output):
+        if isinstance(output, DT):
+            replica_placement = [Replicate()]
+            return output.redistribute(
+                output.device_mesh, replica_placement
+            ).to_local()
+
+    module.register_forward_hook(hook_func)
+    return module
+
+
+def _replicate_input_tensor(
+    module: torch.nn.Module, device_mesh, replica_placement
+) -> torch.nn.Module:
+    def hook_func(_, input):
+        if not isinstance(input[0], DT):
+            return DT.from_local(
+                input[0], device_mesh, replica_placement, run_check=False
+            )
+
+    module.register_forward_pre_hook(hook_func)
+    return module
+
+
+def shard_module(m, pg):
+    start_idx = distributed_c10d.get_global_rank(pg, 0)
+    device_mesh = DeviceMesh(
+        "cuda", list(range(start_idx, start_idx + pg.size())), dim_groups=[pg]
+    )
+    col_wise_sharding = [Shard(0)]
+    row_wise_sharding = [Shard(1)]
+    replicate = [Replicate()]
+    m.net1.weight = torch.nn.Parameter(
+        distribute_tensor(m.net1.weight, device_mesh, col_wise_sharding),
+    )
+    m.net2.weight = torch.nn.Parameter(
+        distribute_tensor(m.net2.weight, device_mesh, row_wise_sharding)
+    )
+    m.net1.bias = torch.nn.Parameter(
+        distribute_tensor(m.net1.bias, device_mesh, col_wise_sharding)
+    )
+    m.net2.bias = torch.nn.Parameter(
+        distribute_tensor(m.net2.bias, device_mesh, replicate)
+    )
+    m = _replicate_input_tensor(m, device_mesh, replicate)
+    m.net2 = _aggregate_local_tensor(m.net2)
+
+
+def _shard_wrap_module(module, module_shard, fsdp_wrap, mesh_2d, fsdp_pg):
+    if module_shard:
+        parallelize_module(module, mesh_2d, PairwiseParallel(), tp_mesh_dim=1)
+
+    if fsdp_wrap and module_shard:
+        return FSDP(module, process_group=fsdp_pg)
+    if fsdp_wrap:
+        return FSDP(module, process_group=distributed_c10d._get_default_group())
+    return module
+
+
+def init_model(model_parallel_size=TP_DEGREE):
+    rank = dist.get_rank()
+    torch.cuda.set_device(rank)
+    world_size = dist.get_world_size()
+
+    model = SimpleModel().cuda(rank)
+
+    # 2-D mesh is [dp, tp]
+    twod_mesh = DeviceMesh(
+        device_type="cuda",
+        mesh=torch.arange(0, world_size).view(model_parallel_size, -1),
+    )
+
+    fsdp_pg = twod_mesh.get_dim_groups()[0]
+
+    # Create Input
+    model = _shard_wrap_module(model, True, True, twod_mesh, fsdp_pg)
+    return model, fsdp_pg
+
+
+def is_nested_tensor(val: Any) -> bool:
+    if isinstance(val, ShardedTensor):
+        if len(val.local_shards()) == 0:
+            return False
+        if isinstance(val.local_shards()[0].tensor, ShardedTensor):
+            return True
+        if isinstance(val.local_shards()[0].tensor, DT):
+            raise ValueError("Cannot handle DT nested insided ST")
+    # Safety valve for when this eventually happen
+    elif isinstance(val, DT) and isinstance(
+        val._local_tensor, (DT, ShardedTensor)
+    ):
+        raise ValueError("Cannot handle nested DT")
+    return False
+
+
+class Test2dParallelIntegration(DTensorTestBase):
+    @with_comms
+    @skip_if_lt_x_gpu(4)
+    def test_2d_fsdp_integration_functionality(self) -> None:
+        if not is_available():
+            self.skipTest("FSDP 2d parallel integration not available")
+
+        model_tp = init_model()[0]
+
+        with FSDP.state_dict_type(model_tp, StateDictType.SHARDED_STATE_DICT):
+            state_dict = model_tp.state_dict()
+            # TODO once 2D is out, validate the nesting
+            self.assertTrue(is_nested_tensor(state_dict["net1.weight"]))
+            self.assertFalse(is_nested_tensor(state_dict["net3.bias"]))
+
+        optim = torch.optim.Adam(model_tp.parameters(), lr=0.0001)
+
+        # Create Input
+        input_seed = self.rank
+        torch.manual_seed(input_seed + 1)
+        input = torch.rand(4, 5).cuda(self.rank)
+
+        model_tp(input).sum().backward()
+        optim.step()
+
+        optim_state = FSDP.sharded_optim_state_dict(model_tp, optim)
+        # TODO once 2D is out, validate the nesting
+        self.assertTrue(
+            is_nested_tensor(optim_state["state"]["net1.weight"]["exp_avg"])
+        )
+        self.assertFalse(
+            is_nested_tensor(optim_state["state"]["net3.bias"]["exp_avg"])
+        )
+
+    @with_comms
+    @skip_if_lt_x_gpu(4)
+    def test_2d_fsdp_integration_correctness(self) -> None:
+        if not is_available():
+            self.skipTest("FSDP 2d parallel integration not available")
+        torch.manual_seed(0)
+        model = SimpleModel().cuda(self.rank)
+        model = FSDP(model)
+        torch.manual_seed(0)
+        model_2d, dp_pg = init_model()
+
+        optim = torch.optim.Adam(model.parameters(), lr=0.0001)
+        optim_2d = torch.optim.Adam(model_2d.parameters(), lr=0.0001)
+
+        for i in range(5):
+            # Ensure all input across TP ranks are same.
+            torch.manual_seed(i + dist.get_rank(dp_pg))
+            input = torch.rand(4, 5).cuda(self.rank)
+            output = model(input)
+            output_2d = model_2d(input)
+            self.assertEqual(output, output_2d)
+            output.sum().backward()
+            output_2d.sum().backward()
+            optim.step()
+            optim_2d.step()
+            self.assertEqual(model(input), model_2d(input))
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/parallel/test_parallelize_api.py b/test/distributed/_tensor/parallel/test_parallelize_api.py
new file mode 100644
index 000000000000..82ba0b0032c6
--- /dev/null
+++ b/test/distributed/_tensor/parallel/test_parallelize_api.py
@@ -0,0 +1,219 @@
+# Owner(s): ["oncall: distributed"]
+
+import torch
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import DTensorTestBase, with_comms
+from torch.distributed._tensor import DeviceMesh, Replicate, DTensor
+from torch.distributed._tensor.parallel.style import (
+    ColwiseParallel,
+    PairwiseParallel,
+    ParallelStyle,
+    RowwiseParallel,
+)
+from torch.distributed._tensor.parallel.api import (
+    _parallelize_linear,
+    _parallelize_mlp,
+)
+from torch.distributed._tensor.parallel.utils import _create_1d_device_mesh
+from torch.distributed._tensor.parallel.style import (
+    make_input_replicate_1d,
+    make_output_replicate_1d,
+)
+
+
+class MLPModule(torch.nn.Module):
+    def __init__(self, device):
+        super(MLPModule, self).__init__()
+        torch.manual_seed(5)
+        self.net1 = torch.nn.Linear(10, 16, device=device)
+        self.relu = torch.nn.ReLU()
+        self.net2 = torch.nn.Linear(16, 12, device=device)
+
+    def forward(self, x):
+        return self.net2(self.relu(self.net1(x)))
+
+
+class TensorParallelAPITests(DTensorTestBase):
+    @property
+    def world_size(self):
+        gpu_num = torch.cuda.device_count()
+        return gpu_num if gpu_num % 2 == 0 and gpu_num > 4 else 4
+
+    @with_comms
+    def test_creat_1d_device_mesh(self):
+        dim_one_size = 2
+        mesh_shape = (
+            torch.arange(self.world_size)
+            .reshape(
+                self.world_size // dim_one_size,
+                dim_one_size,
+            )
+            .to(torch.int)
+        )
+        mesh = DeviceMesh(self.device_type, mesh_shape)
+        # When 1D dim is 1.
+        one_dimention_mesh_shape = mesh_shape[self.rank // dim_one_size, :]
+        pg = mesh.get_dim_groups()[1]
+        new_mesh = _create_1d_device_mesh(mesh, 1)
+        expected_mesh = DeviceMesh(
+            self.device_type, one_dimention_mesh_shape, [pg]
+        )
+        self.assertEqual(new_mesh.mesh, expected_mesh.mesh)
+        self.assertEqual(new_mesh.device_type, expected_mesh.device_type)
+        # When 1D dim is 0.
+        one_dimention_mesh_shape = mesh_shape[:, self.rank % dim_one_size]
+        pg = mesh.get_dim_groups()[0]
+        new_mesh = _create_1d_device_mesh(mesh, 0)
+        expected_mesh = DeviceMesh(
+            self.device_type, one_dimention_mesh_shape, [pg]
+        )
+        self.assertEqual(new_mesh.mesh, expected_mesh.mesh)
+        self.assertEqual(new_mesh.device_type, expected_mesh.device_type)
+
+    @with_comms
+    def test_creat_1d_device_mesh_error(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+        with self.assertRaisesRegex(
+            AssertionError,
+            "Expect tp_mesh_dim within range \\[-1, 1\\), but found 3.",
+        ):
+            _create_1d_device_mesh(mesh, 3)
+
+    def _compare_params(
+        self,
+        local_module,
+        dist_module,
+        skip_rowwise_bias=False,
+        compare_grad=False,
+    ):
+        replicate = [Replicate()]
+        for name, param in local_module.named_parameters():
+            dist_param = dist_module.get_parameter(name)
+            param = param.grad if compare_grad else param
+            dist_param = dist_param.grad if compare_grad else dist_param
+            if self.rank == 0 or (
+                name not in ["net2.bias"]
+                and not skip_rowwise_bias
+                or name not in ["bias", "net2.bias"]
+            ):
+                self.assertEqual(
+                    param,
+                    dist_param.redistribute(
+                        device_mesh=dist_param.device_mesh, placements=replicate
+                    ).to_local(),
+                )
+
+    def _compare_module(
+        self, local_module, dist_module, inp_size, rowwise=False
+    ):
+        LR = 0.25  # the learning rate we use for testing
+        local_optim = torch.optim.SGD(local_module.parameters(), lr=LR)
+        dist_optim = torch.optim.SGD(dist_module.parameters(), lr=LR)
+        torch.manual_seed(0)
+        inp = torch.rand(*inp_size, device=self.device_type)
+        self._compare_params(local_module, dist_module)
+
+        # check forward correctness
+        local_output = local_module(inp)
+        inp = inp.chunk(self.world_size, dim=-1)[self.rank] if rowwise else inp
+        dist_output = dist_module(inp)
+        dist_output = (
+            dist_output.to_local()
+            if isinstance(dist_output, DTensor)
+            else dist_output
+        )
+        self.assertEqual(local_output, dist_output)
+
+        local_output.sum().backward()
+        dist_output.sum().backward()
+
+        # check backward and ensure gradients are same
+        self._compare_params(local_module, dist_module, rowwise, True)
+
+        local_optim.step()
+        dist_optim.step()
+        self._compare_params(local_module, dist_module, rowwise)
+
+    @with_comms
+    def test_parallelize_mlp(self):
+        inp_size = [12, 10]
+        model = MLPModule(self.device_type)
+        model_tp = MLPModule(self.device_type)
+
+        # Ensure model are initialized the same way.
+        self.assertEqual(model.net1.weight, model_tp.net1.weight)
+        self.assertEqual(model.net1.bias, model_tp.net1.bias)
+        self.assertEqual(model.net2.weight, model_tp.net2.weight)
+        self.assertEqual(model.net2.bias, model_tp.net2.bias)
+
+        # Parallelize module.
+        device_mesh = DeviceMesh(
+            self.device_type, torch.arange(self.world_size)
+        )
+        model_tp = _parallelize_mlp(model_tp, device_mesh, PairwiseParallel())
+        self._compare_module(model, model_tp, inp_size)
+
+    @with_comms
+    def test_parallelize_mlp_error(self):
+        class DummyParallel(ParallelStyle):
+            def __init__(self) -> None:
+                super().__init__(
+                    make_input_replicate_1d, make_output_replicate_1d
+                )
+
+        model_tp = MLPModule(self.device_type)
+        device_mesh = DeviceMesh(
+            self.device_type, torch.arange(self.world_size)
+        )
+        with self.assertRaisesRegex(
+            NotImplementedError,
+            "Only support PairwiseParallel for MLP parallelization.",
+        ):
+            _parallelize_mlp(model_tp, device_mesh, DummyParallel())
+
+        with self.assertRaisesRegex(
+            RuntimeError, "More than one nn.Linear needed for a MLP."
+        ):
+            _parallelize_mlp(
+                torch.nn.Linear(10, 5), device_mesh, PairwiseParallel()
+            )
+
+    @with_comms
+    def test_linear_row_wise_parallel(self):
+        # test RowwiseParallel
+        inp_size = [9, 16]
+        rowwise = RowwiseParallel()
+
+        torch.manual_seed(5)
+        model = torch.nn.Linear(16, 10, device=self.device_type)
+        torch.manual_seed(5)
+        model_tp = torch.nn.Linear(16, 10, device=self.device_type)
+
+        # parallelize model_tp
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        model_tp = _parallelize_linear(model_tp, device_mesh, rowwise)
+
+        # let each rank generate unique local input
+        torch.manual_seed(self.rank)
+        self._compare_module(model, model_tp, inp_size, True)
+
+    @with_comms
+    def test_linear_col_wise_parallel(self):
+        # test ColwiseParallel
+        inp_size = [8, 10]
+        colwise = ColwiseParallel()
+
+        torch.manual_seed(5)
+        model = torch.nn.Linear(10, 16, device=self.device_type)
+        torch.manual_seed(5)
+        model_tp = torch.nn.Linear(10, 16, device=self.device_type)
+
+        # parallelize model_tp
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        model_tp = _parallelize_linear(model_tp, device_mesh, colwise)
+
+        self._compare_module(model, model_tp, inp_size)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/parallel/test_tp_examples.py b/test/distributed/_tensor/parallel/test_tp_examples.py
new file mode 100644
index 000000000000..73cf9e05b223
--- /dev/null
+++ b/test/distributed/_tensor/parallel/test_tp_examples.py
@@ -0,0 +1,437 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+import torch.nn as nn
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+    NUM_DEVICES,
+    skip_unless_torch_gpu,
+)
+from torch.distributed._tensor import (
+    DeviceMesh,
+    Replicate,
+)
+from torch.distributed._tensor.parallel import (
+    PairwiseParallel,
+    TensorParallelMultiheadAttention,
+    parallelize_module,
+)
+
+
+class MLPModule(torch.nn.Module):
+    def __init__(self, device):
+        super(MLPModule, self).__init__()
+        torch.manual_seed(5)
+        self.net1 = torch.nn.Linear(10, 16, device=device)
+        self.relu = torch.nn.ReLU()
+        self.net2 = torch.nn.Linear(16, 12, device=device)
+
+    def forward(self, x):
+        return self.net2(self.relu(self.net1(x)))
+
+
+class MultiheadAttnWrap(nn.Module):
+    def __init__(self, embed_dim, num_heads, add_bias_kv=False, device=None):
+        super().__init__()
+        self.attn = nn.MultiheadAttention(
+            embed_dim, num_heads, add_bias_kv=add_bias_kv, device=device
+        )
+
+    def forward(self, query, key, value):
+        return self.attn(query, key, value)
+
+
+# TODO: replace repeated test code with _check_module
+class DistTensorParallelExampleTest(DTensorTestBase):
+    @with_comms
+    def test_mlp_megatron_e2e(self):
+        inp_size = [5, 10]
+        # Ensure all tp ranks have same input.
+        torch.manual_seed(0)
+        inp = torch.rand(*inp_size, device=self.device_type)
+        model = MLPModule(self.device_type)
+        model_tp = MLPModule(self.device_type)
+
+        # Ensure model are initialized the same way.
+        self.assertEqual(model.net1.weight, model_tp.net1.weight)
+        self.assertEqual(model.net1.bias, model_tp.net1.bias)
+        self.assertEqual(model.net2.weight, model_tp.net2.weight)
+        self.assertEqual(model.net2.bias, model_tp.net2.bias)
+
+        # Shard module and initialize optimizer.
+        LR = 0.25
+        device_mesh = DeviceMesh(
+            self.device_type,
+            torch.arange(0, NUM_DEVICES),
+        )
+        model_tp = parallelize_module(model_tp, device_mesh, PairwiseParallel())
+        optim = torch.optim.SGD(model.parameters(), lr=LR)
+        optim_tp = torch.optim.SGD(model_tp.parameters(), lr=LR)
+
+        output = model(inp)
+        output_tp = model_tp(inp)
+        self.assertEqual(output, output_tp)
+
+        output.sum().backward()
+        output_tp.sum().backward()
+
+        device_mesh = model_tp.net1.weight.device_mesh
+        replicate = [Replicate()] * device_mesh.ndim
+
+        # Ensure gradients are same.
+        self.assertEqual(
+            model.net1.weight.grad,
+            model_tp.net1.weight.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.net1.bias.grad,
+            model_tp.net1.bias.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.net2.weight.grad,
+            model_tp.net2.weight.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.net2.bias.grad,
+            model_tp.net2.bias.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+
+        optim.step()
+        optim_tp.step()
+
+        # Ensure model weights are still same after update.
+        self.assertEqual(
+            model.net1.weight,
+            model_tp.net1.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.net1.bias,
+            model_tp.net1.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.net2.weight,
+            model_tp.net2.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        # Due to the trick we use for Partial aggregation, we only check the weight when local_rank = 0.
+        if self.rank == 0:
+            self.assertEqual(
+                model.net2.bias,
+                model_tp.net2.bias.redistribute(
+                    device_mesh=device_mesh, placements=replicate
+                ).to_local(),
+            )
+
+        inp = torch.rand(*inp_size, device=self.device_type)
+        output = model(inp)
+        output_tp = model_tp(inp)
+        self.assertEqual(output, output_tp)
+
+    # TensorParallelMultiheadAttention == dist_module(TensorParallelMultiheadAttention)
+    # baddbmm introduces nan occasionally on CPU: https://github.com/pytorch/pytorch/issues/80588
+    @with_comms
+    @skip_unless_torch_gpu
+    def test_self_attn_megatron_e2e(self):
+        inp_size = [8, 12, 16]
+        # Ensure all tp ranks have same input.
+        torch.manual_seed(0)
+        inp = torch.rand(*inp_size, device=self.device_type)
+
+        # Initialize model using same seed.
+        torch.manual_seed(5)
+        model = TensorParallelMultiheadAttention(
+            16,
+            8,
+            tp_size=NUM_DEVICES,
+            add_bias_kv=True,
+            device=self.device_type,
+        )
+        torch.manual_seed(5)
+        model_tp = TensorParallelMultiheadAttention(
+            16,
+            8,
+            tp_size=NUM_DEVICES,
+            add_bias_kv=True,
+            device=self.device_type,
+        )
+
+        # Ensure model are initialized the same way.
+        self.assertEqual(model.qkv.weight, model_tp.qkv.weight)
+        self.assertEqual(model.qkv.bias, model_tp.qkv.bias)
+        self.assertEqual(model.proj.weight, model_tp.proj.weight)
+        self.assertEqual(model.proj.bias, model_tp.proj.bias)
+
+        # Shard module and initialize optimizer.
+        device_mesh = DeviceMesh(self.device_type, list(range(NUM_DEVICES)))
+        parallelize_module(model_tp, device_mesh, PairwiseParallel())
+
+        device_mesh = model_tp.qkv.weight.device_mesh
+        replicate = [Replicate()] * device_mesh.ndim
+        # Ensure model are initialized the same way.
+        self.assertEqual(
+            model.qkv.weight,
+            model_tp.qkv.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.qkv.bias,
+            model_tp.qkv.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.weight,
+            model_tp.proj.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.bias,
+            model_tp.proj.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+
+        LR = 0.25
+        optim = torch.optim.SGD(model.parameters(), lr=LR)
+        optim_tp = torch.optim.SGD(model_tp.parameters(), lr=LR)
+
+        output = model(inp, inp, inp)
+        output_tp = model_tp(inp, inp, inp)
+        self.assertEqual(output, output_tp)
+
+        output.sum().backward()
+        output_tp.sum().backward()
+
+        device_mesh = model_tp.qkv.weight.device_mesh
+        # Ensure gradients are same.
+        self.assertEqual(
+            model.qkv.weight.grad,
+            model_tp.qkv.weight.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.qkv.bias.grad,
+            model_tp.qkv.bias.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.weight.grad,
+            model_tp.proj.weight.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.bias.grad,
+            model_tp.proj.bias.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+
+        optim.step()
+        optim_tp.step()
+
+        # Ensure model weights are still same after update.
+        self.assertEqual(
+            model.qkv.weight,
+            model_tp.qkv.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.qkv.bias,
+            model_tp.qkv.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.weight,
+            model_tp.proj.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.bias,
+            model_tp.proj.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+
+        inp = torch.rand(*inp_size, device=self.device_type)
+        output = model(inp, inp, inp)
+        output_tp = model_tp(inp, inp, inp)
+        self.assertEqual(output, output_tp)
+
+    # TensorParallelMultiheadAttention == dist_module(torch.nn.MultiheadAttention)
+    # baddbmm introduces nan occasionally on CPU: https://github.com/pytorch/pytorch/issues/80588
+    @with_comms
+    @skip_unless_torch_gpu
+    def test_self_attn_replacement_megatron_e2e(self):
+        inp_size = [8, 12, 16]
+        # Ensure all tp ranks have same input.
+        torch.manual_seed(0)
+        inp = torch.rand(*inp_size, device=self.device_type)
+
+        # TODO: our sharding function cannot shard the root node
+        torch.manual_seed(5)
+        model = TensorParallelMultiheadAttention(
+            16,
+            8,
+            tp_size=NUM_DEVICES,
+            add_bias_kv=True,
+            device=self.device_type,
+        )
+        model_tp = MultiheadAttnWrap(
+            16, 8, add_bias_kv=True, device=self.device_type
+        )
+
+        # TODO: somehow using torch.nn.MultiheadAttention's initial params does not work
+        # Use TensorParallelMultiheadAttention parameters instead
+        x = model.qkv.weight.clone().detach().requires_grad_()
+        model_tp.attn.register_parameter(
+            "in_proj_weight", torch.nn.Parameter(x)
+        )
+
+        x = model.qkv.bias.clone().detach().requires_grad_()
+        model_tp.attn.register_parameter("in_proj_bias", torch.nn.Parameter(x))
+
+        x = model.proj.weight.clone().detach().requires_grad_()
+        model_tp.attn.out_proj.register_parameter(
+            "weight", torch.nn.Parameter(x)
+        )
+
+        x = model.proj.bias.clone().detach().requires_grad_()
+        model_tp.attn.out_proj.register_parameter("bias", torch.nn.Parameter(x))
+
+        # check if parameters are same
+        self.assertEqual(model.qkv.weight, model_tp.attn.in_proj_weight)
+        self.assertEqual(model.qkv.bias, model_tp.attn.in_proj_bias)
+        self.assertEqual(model.proj.weight, model_tp.attn.out_proj.weight)
+        self.assertEqual(model.proj.bias, model_tp.attn.out_proj.bias)
+
+        # Shard module and initialize optimizer.
+        device_mesh = DeviceMesh(self.device_type, list(range(NUM_DEVICES)))
+        parallelize_module(model_tp, device_mesh, PairwiseParallel())
+
+        device_mesh = model_tp.attn.qkv.weight.device_mesh
+        replicate = [Replicate()] * device_mesh.ndim
+        # Ensure model are initialized the same way.
+        self.assertEqual(
+            model.qkv.weight,
+            model_tp.attn.qkv.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.qkv.bias,
+            model_tp.attn.qkv.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.weight,
+            model_tp.attn.proj.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.bias,
+            model_tp.attn.proj.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+
+        LR = 0.25
+        optim = torch.optim.SGD(model.parameters(), lr=LR)
+        optim_tp = torch.optim.SGD(model_tp.parameters(), lr=LR)
+
+        output = model(inp, inp, inp)
+        output_tp = model_tp(inp, inp, inp)
+        self.assertEqual(output, output_tp)
+
+        output.sum().backward()
+        output_tp.sum().backward()
+
+        device_mesh = model_tp.attn.qkv.weight.device_mesh
+        # Ensure gradients are same.
+        self.assertEqual(
+            model.qkv.weight.grad,
+            model_tp.attn.qkv.weight.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.qkv.bias.grad,
+            model_tp.attn.qkv.bias.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.weight.grad,
+            model_tp.attn.proj.weight.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.bias.grad,
+            model_tp.attn.proj.bias.grad.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+
+        optim.step()
+        optim_tp.step()
+
+        # Ensure model weights are still same after update.
+        self.assertEqual(
+            model.qkv.weight,
+            model_tp.attn.qkv.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.qkv.bias,
+            model_tp.attn.qkv.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.weight,
+            model_tp.attn.proj.weight.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+        self.assertEqual(
+            model.proj.bias,
+            model_tp.attn.proj.bias.redistribute(
+                device_mesh=device_mesh, placements=replicate
+            ).to_local(),
+        )
+
+        inp = torch.rand(*inp_size, device=self.device_type)
+        output = model(inp, inp, inp)
+        output_tp = model_tp(inp, inp, inp)
+        self.assertEqual(output, output_tp)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/parallel/test_tp_style.py b/test/distributed/_tensor/parallel/test_tp_style.py
new file mode 100644
index 000000000000..0562c6713da4
--- /dev/null
+++ b/test/distributed/_tensor/parallel/test_tp_style.py
@@ -0,0 +1,197 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import DTensorTestBase, with_comms
+from torch.distributed._tensor import distribute_tensor, DeviceMesh, Shard, Replicate
+from torch.distributed._tensor.parallel.style import (
+    RowwiseParallel,
+    ColwiseParallel,
+    make_input_shard_1d,
+    make_input_replicate_1d,
+    make_output_shard_1d,
+    make_output_replicate_1d,
+    make_output_tensor,
+)
+
+
+class TensorParallelStyleTest(DTensorTestBase):
+    @property
+    def world_size(self):
+        gpu_num = torch.cuda.device_count()
+        return gpu_num if gpu_num % 2 == 0 and gpu_num > 4 else 4
+
+    def _1d_input_func_check(
+        self, input_local_tensor, expected_local_tensor, func
+    ) -> None:
+        with self.assertRaisesRegex(
+            RuntimeError, "device_mesh is not passed nor can be inferred"
+        ):
+            dtensor = func(input_local_tensor)
+        device_mesh = DeviceMesh(
+            self.device_type,
+            torch.arange(self.world_size).reshape(self.world_size // 2, 2),
+        )
+        with self.assertRaisesRegex(
+            RuntimeError,
+            "device_mesh has dims [0-9]+ but expcted to be 1 for input.",
+        ):
+            dtensor = func(input_local_tensor, device_mesh)
+
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        # test 1: replicate local tensor
+        dtensor = func(input_local_tensor, device_mesh)
+        self.assertEqual(expected_local_tensor, dtensor.to_local())
+        # test 2: replicate DTensor
+        dtensor = func(dtensor)
+        self.assertEqual(expected_local_tensor, dtensor.to_local())
+        # test 3: replicate DTensor with DeviceMesh passed
+        dtensor = func(dtensor, device_mesh)
+        self.assertEqual(expected_local_tensor, dtensor.to_local())
+
+    @with_comms
+    def test_make_input_replicate_1d(self):
+        tensor = torch.rand(8, 16, device=self.device_type)
+        self._1d_input_func_check(tensor, tensor, make_input_replicate_1d)
+
+    @with_comms
+    def test_make_input_shard_1d(self):
+        tensor = torch.rand(8, 16, device=self.device_type)
+        self._1d_input_func_check(tensor, tensor, make_input_shard_1d)
+
+    # Common logic for testing prepare output funcs
+    def _test_prepare_output(
+        self, func, spec, dim=None, device_mesh_input_none=False
+    ):
+        device_mesh = DeviceMesh(
+            self.device_type, torch.arange(self.world_size)
+        )
+        tensor = torch.rand(8, 16, device=self.device_type)
+        dtensor = distribute_tensor(tensor, device_mesh, spec)
+        device_mesh_input = None if device_mesh_input_none else device_mesh
+        if dim is not None:
+            output = func(dtensor, device_mesh_input, dim)
+        else:
+            output = func(dtensor, device_mesh_input)
+        return output, dtensor, device_mesh
+
+    @with_comms
+    def test_make_output_shard_1d(self):
+        # test when output is sharded.
+        output, dtensor, device_mesh = self._test_prepare_output(
+            make_output_shard_1d, [Shard(0)], 1
+        )
+        self.assertEqual(output, dtensor.redistribute(device_mesh, [Shard(1)]))
+        #  test when output is replicated.
+        output, dtensor, device_mesh = self._test_prepare_output(
+            make_output_shard_1d, [Replicate()], 0
+        )
+        self.assertEqual(output, dtensor.redistribute(device_mesh, [Shard(0)]))
+        # test when input device_mesh is None.
+        output, dtensor, device_mesh = self._test_prepare_output(
+            make_output_shard_1d, [Shard(0)], 1, True
+        )
+        self.assertEqual(output, dtensor.redistribute(device_mesh, [Shard(1)]))
+
+    @with_comms
+    def test_make_output_replicate_1d(self):
+        output, dtensor, device_mesh = self._test_prepare_output(
+            make_output_replicate_1d, [Shard(0)]
+        )
+        self.assertEqual(
+            output, dtensor.redistribute(device_mesh, [Replicate()])
+        )
+        # test when input device_mesh is None.
+        output, dtensor, device_mesh = self._test_prepare_output(
+            make_output_replicate_1d, [Shard(0)], None, True
+        )
+        self.assertEqual(
+            output, dtensor.redistribute(device_mesh, [Replicate()])
+        )
+
+    @with_comms
+    def test_make_output_tensor(self):
+        # test when output is sharded.
+        output, dtensor, device_mesh = self._test_prepare_output(
+            make_output_tensor, [Shard(0)]
+        )
+        self.assertEqual(
+            output, dtensor.redistribute(device_mesh, [Replicate()]).to_local()
+        )
+        #  test when output is replicated.
+        output, dtensor, device_mesh = self._test_prepare_output(
+            make_output_tensor, [Replicate()]
+        )
+        self.assertEqual(
+            output, dtensor.redistribute(device_mesh, [Replicate()]).to_local()
+        )
+        # test when input device_mesh is None.
+        output, dtensor, device_mesh = self._test_prepare_output(
+            make_output_tensor, [Shard(0)], None, True
+        )
+        self.assertEqual(
+            output, dtensor.redistribute(device_mesh, [Replicate()]).to_local()
+        )
+
+    # Common logic for testing prepare output funcs errors.
+    def _test_prepare_output_error(self, func):
+        tensor = torch.rand(8, 16, device=self.device_type)
+        device_mesh = DeviceMesh(
+            self.device_type, torch.arange(self.world_size)
+        )
+        dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)])
+        output = [dtensor]
+        with self.assertRaisesRegex(
+            AssertionError,
+            "Expect output of Tensor Parallel to be a DTensor, but found"
+            f" {type(output)}.",
+        ):
+            func(output, device_mesh)
+        device_mesh = DeviceMesh(
+            self.device_type,
+            torch.arange(self.world_size).reshape(self.world_size // 2, 2),
+        )
+        with self.assertRaisesRegex(
+            AssertionError,
+            "device_mesh has dims 2 but expcted to be 1 for output.",
+        ):
+            func(dtensor, device_mesh)
+
+    @with_comms
+    def test_prepare_output_error(self):
+        self._test_prepare_output_error(make_output_shard_1d)
+        self._test_prepare_output_error(make_output_replicate_1d)
+        self._test_prepare_output_error(make_output_tensor)
+
+    @with_comms
+    def test_rowwise_parallel_style(self):
+        tensor = torch.rand(8, 16, device=self.device_type)
+        rs = RowwiseParallel()
+        self._1d_input_func_check(tensor, tensor, rs._prepare_input)
+        # TODO: change output test
+        output, dtensor, device_mesh = self._test_prepare_output(
+            rs._prepare_input, [Shard(0)]
+        )
+        self.assertEqual(
+            output, dtensor.redistribute(device_mesh, [Replicate()])
+        )
+        # test when input device_mesh is None.
+        output, dtensor, device_mesh = self._test_prepare_output(
+            rs._prepare_input, [Shard(0)], None, True
+        )
+        self.assertEqual(
+            output, dtensor.redistribute(device_mesh, [Replicate()])
+        )
+        self._test_prepare_output_error(rs._prepare_output)
+
+    @with_comms
+    def test_colwise_parallel_style(self):
+        tensor = torch.rand(8, 16, device=self.device_type)
+        cs = ColwiseParallel()
+        self._1d_input_func_check(tensor, tensor, cs._prepare_input)
+        self.assertEqual(make_output_replicate_1d, cs._prepare_output)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/parallel/test_view_sharding_dim_change.py b/test/distributed/_tensor/parallel/test_view_sharding_dim_change.py
new file mode 100644
index 000000000000..4648d930b9eb
--- /dev/null
+++ b/test/distributed/_tensor/parallel/test_view_sharding_dim_change.py
@@ -0,0 +1,30 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+)
+from torch.distributed._tensor import DeviceMesh, DTensor, Shard
+from torch.distributed._tensor.parallel._view_with_dim_change import (
+    _view_with_sharding_dim_change,
+)
+
+
+class TPViewShardingDimChangeTest(DTensorTestBase):
+    @with_comms
+    def test_view_with_sharding_dim_change(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        torch.manual_seed(self.rank)
+        tensor = torch.rand(3, 5, 6, device=self.device_type)
+        sharding = [Shard(2)]
+        dt = DTensor.from_local(tensor, device_mesh, sharding)
+        dt = _view_with_sharding_dim_change(dt, 1, (3, -1, 6))
+        self.assertTrue(dt.placements[0].is_shard(dim=1))
+        self.assertEqual(dt.to_local(), tensor.view(3, -1, 6))
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_api.py b/test/distributed/_tensor/test_api.py
new file mode 100644
index 000000000000..a966f30d1cb9
--- /dev/null
+++ b/test/distributed/_tensor/test_api.py
@@ -0,0 +1,234 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+import torch.nn as nn
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import DTensorTestBase, with_comms
+from torch.distributed._tensor import (
+    distribute_tensor,
+    distribute_module,
+    DeviceMesh,
+    DTensor,
+    Shard,
+    Replicate,
+)
+
+
+class MyModel(nn.Module):
+    def __init__(self, n_features, n_layers, device):
+        super().__init__()
+        self.seq = nn.Sequential(
+            *[
+                nn.Linear(n_features, n_features, device=device)
+                for _ in range(n_layers)
+            ]
+        )
+
+    def forward(self, x):
+        return self.seq(x)
+
+    def reset_parameters(self):
+        for m in self.seq:
+            m.reset_parameters()
+
+
+class DTensorAPITest(DTensorTestBase):
+    @property
+    def world_size(self) -> int:
+        # hard code world size to 4 as we need to test
+        # at least with 2d mesh
+        return 4
+
+    @with_comms
+    def test_distribute_tensor(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        for requires_grad in [True, False]:
+
+            tensor_to_shard = torch.randn(
+                3 * self.world_size, 3, requires_grad=requires_grad
+            )
+            dist_tensor = distribute_tensor(
+                tensor_to_shard, device_mesh, shard_spec
+            )
+            self.assertEqual(
+                dist_tensor.size(), torch.Size([3 * self.world_size, 3])
+            )
+            local_tensor = dist_tensor.to_local()
+            self.assertEqual(local_tensor.size(), torch.Size([3, 3]))
+            if requires_grad:
+                self.assertTrue(dist_tensor.requires_grad)
+                self.assertTrue(dist_tensor.is_leaf)
+
+    @with_comms
+    def test_distribute_tensor_errors(self):
+        device_mesh = DeviceMesh(
+            self.device_type, torch.arange(self.world_size).reshape(2, 2)
+        )
+        tensor_shape = [3 * self.world_size, 3 * self.world_size]
+        tensor_to_distribute = torch.randn(*tensor_shape)
+
+        with self.assertRaisesRegex(ValueError, "must have the same length"):
+            shard_spec = [Shard(0)]
+            distribute_tensor(tensor_to_distribute, device_mesh, shard_spec)
+
+        spec = [Shard(0), Shard(1)]
+        dtensor = distribute_tensor(tensor_to_distribute, device_mesh, spec)
+
+        with self.assertRaisesRegex(ValueError, "to a different device mesh"):
+            new_mesh = DeviceMesh(
+                self.device_type, torch.arange(self.world_size)
+            )
+            distribute_tensor(dtensor, new_mesh, [Shard(0)])
+
+        with self.assertRaisesRegex(ValueError, "to a different placements"):
+            new_spec = [Shard(0), Replicate()]
+            distribute_tensor(dtensor, device_mesh, new_spec)
+
+    @with_comms
+    def test_distribute_tensor_uneven_sharding(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        input_sizes_and_shard_dims = [
+            ((self.world_size * 3 + 1, 3, 3), 0),
+            ((self.world_size * 3 + 2, 3, 3), 0),
+            ((3, self.world_size * 3 + 1, 3), 1),
+            ((3, self.world_size * 3 + 2, 3), 1),
+            ((3, 3, self.world_size * 3 + 1), 2),
+            ((3, 3, self.world_size * 3 + 2), 2),
+        ]
+        for input_size, shard_dim in input_sizes_and_shard_dims:
+            shard_spec = [Shard(shard_dim)]
+            tensor_to_shard = torch.randn(input_size)
+            splitted_tensor_list = tensor_to_shard.tensor_split(
+                self.world_size, dim=shard_dim
+            )
+            dist_tensor = distribute_tensor(
+                tensor_to_shard, device_mesh, shard_spec
+            )
+            self.assertEqual(dist_tensor.size(), torch.Size(input_size))
+            local_tensor = dist_tensor.to_local()
+            self.assertEqual(local_tensor, splitted_tensor_list[self.rank])
+
+    @with_comms
+    def test_distribute_module(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        # fully shard all linear modules on dim 0
+        module_to_shard = MyModel(
+            5 * self.world_size, 20, device=self.device_type
+        )
+        shard_spec = [Shard(0)]
+
+        def shard_fn(name, module, device_mesh):
+            if isinstance(module, nn.Linear):
+                for name, param in module.named_parameters():
+                    dist_param = torch.nn.Parameter(
+                        distribute_tensor(param, device_mesh, shard_spec)
+                    )
+                    module.register_parameter(name, dist_param)
+
+        sharded_module = distribute_module(
+            module_to_shard, device_mesh, shard_fn
+        )
+        for param in sharded_module.parameters():
+            self.assertIsInstance(param, DTensor)
+            self.assertEqual(param.placements, shard_spec)
+
+        replica_spec = [Replicate()]
+        # fully replicate all modules without passing in partition_fn
+        module_to_replicate = MyModel(5, 20, device=self.device_type)
+        replica_module = distribute_module(module_to_replicate, device_mesh)
+        for param in replica_module.parameters():
+            self.assertIsInstance(param, DTensor)
+            self.assertEqual(param.placements, replica_spec)
+
+        # fully replicate all modules by passing in partition_fn
+        def replicate_fn(name, module, device_mesh):
+            if isinstance(module, nn.Linear):
+                for name, param in module.named_parameters():
+                    dist_param = torch.nn.Parameter(
+                        distribute_tensor(param, device_mesh, replica_spec)
+                    )
+                    module.register_parameter(name, dist_param)
+
+        module_to_replicate = MyModel(5, 20, device=self.device_type)
+        replica_module = distribute_module(
+            module_to_replicate, device_mesh, replicate_fn
+        )
+        for param in replica_module.parameters():
+            self.assertIsInstance(param, DTensor)
+            self.assertEqual(param.placements, replica_spec)
+
+        # only shard part of module, and rest of module should be replicate
+        def shard_fn(name, module, device_mesh):
+            if isinstance(module, nn.Linear) and (
+                name == "seq.0" or name == "seq.8"
+            ):
+                for name, param in module.named_parameters():
+                    dist_param = torch.nn.Parameter(
+                        distribute_tensor(param, device_mesh, shard_spec)
+                    )
+                    module.register_parameter(name, dist_param)
+
+        module_to_distribute = MyModel(
+            5 * self.world_size, 20, device=self.device_type
+        )
+        dist_module = distribute_module(
+            module_to_distribute, device_mesh, shard_fn
+        )
+        for name, param in dist_module.named_parameters():
+            self.assertIsInstance(param, DTensor)
+            if name.startswith("seq.0") or name.startswith("seq.8"):
+                self.assertEqual(param.placements, shard_spec)
+            else:
+                self.assertEqual(param.placements, replica_spec)
+
+    @with_comms
+    def test_distribute_module_input_fn_output_fn(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+
+        # fully replicate all linear modules
+        module_to_replicate = MyModel(20, 1, device=self.device_type)
+
+        # mark input sharding on dim 0
+        def input_fn(inputs, device_mesh):
+            return DTensor.from_local(inputs[0], device_mesh, [Shard(0)])
+
+        def output_fn(outputs, device_mesh):
+            assert isinstance(outputs, DTensor)
+            return outputs.to_local()
+
+        replica_module = distribute_module(
+            module_to_replicate,
+            device_mesh,
+            input_fn=input_fn,
+            output_fn=output_fn,
+        )
+
+        input_tensor = torch.randn(5, 20, device=self.device_type)
+        local_out = replica_module(input_tensor)
+        self.assertIsInstance(local_out, torch.Tensor)
+        self.assertNotIsInstance(local_out, DTensor)
+
+        # full replicate (even on inputs)
+        model = MyModel(10, 10, device=self.device_type)
+
+        def replicate_input_fn(inputs, device_mesh):
+            return DTensor.from_local(inputs[0], device_mesh, [Replicate()])
+
+        replica_model = distribute_module(
+            model,
+            device_mesh,
+            input_fn=replicate_input_fn,
+        )
+        input = torch.randn(10, 10, requires_grad=True)
+        output = replica_model(input)
+        output.sum().backward()
+        param_grad = list(replica_model.parameters())[0].grad
+        self.assertTrue(isinstance(param_grad, DTensor))
+        self.assertTrue(isinstance(param_grad.placements[0], Replicate))
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_common_rules.py b/test/distributed/_tensor/test_common_rules.py
new file mode 100644
index 000000000000..ab9743c1d5e9
--- /dev/null
+++ b/test/distributed/_tensor/test_common_rules.py
@@ -0,0 +1,476 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+from torch.testing._internal.common_utils import run_tests
+from torchgen.model import FunctionSchema
+from torch.distributed._tensor.dispatch import OpSchema
+
+from torch.distributed._tensor.ops.common_rules import (
+    einop_rule,
+    reduction_rule,
+    pointwise_rule,
+)
+from torch.distributed._tensor.placement_types import DTensorSpec
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+)
+from torch.distributed._tensor import DeviceMesh
+
+
+class CommonRulesTest(DTensorTestBase):
+    def parse_schema(self, schema_str):
+        return FunctionSchema.parse(schema_str)
+
+    @property
+    def world_size(self) -> int:
+        # hard code world size to 4 as we need to test
+        # at least with 2d mesh
+        return 4
+
+    @with_comms
+    def test_einop_basic_propagation(self):
+        # plain einsum, mm
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+
+        func_schema = self.parse_schema(
+            "aten::mm(Tensor self, Tensor mat2) -> Tensor"
+        )
+        # propagate col-wise sharding
+        mat1, mat2 = [-1, -1], [-1, 0]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 4])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([4, 8])
+        )
+        output_sharding = einop_rule(
+            "mk,kn->mn", OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertEqual(output_spec.dim_map, [-1, 0])
+        self.assertEqual(output_spec.shape, torch.Size([8, 8]))
+
+        # propagate row-wise sharding
+        mat1, mat2 = [0, -1], [-1, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 4])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([4, 8])
+        )
+        output_sharding = einop_rule(
+            "mk,kn->mn", OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertEqual(output_spec.dim_map, [0, -1])
+        self.assertEqual(output_spec.shape, torch.Size([8, 8]))
+
+        # generate partial
+        mat1, mat2 = [-1, 0], [0, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 4])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([4, 8])
+        )
+        output_sharding = einop_rule(
+            "mk,kn->mn", OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertTrue(output_spec.placements[0].is_partial())
+        self.assertEqual(output_spec.shape, torch.Size([8, 8]))
+
+    @with_comms
+    def test_einop_pointwise_propagation(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+
+        func_schema = self.parse_schema(
+            "aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"
+        )
+        # addition
+        mat1 = [0, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 8])
+        )
+        output_sharding = einop_rule(
+            "ij,ij->ij", OpSchema(func_schema, (mat1_spec, mat1_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertEqual(output_spec.dim_map, [0, -1])
+        self.assertEqual(output_spec.shape, torch.Size([8, 8]))
+
+        # broadcast addition
+        mat1 = [-1, 0, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 4, 2])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, [-1], [], shape=torch.Size([2])
+        )
+        output_sharding = einop_rule(
+            "ijk,k->ijk", OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertEqual(output_spec.dim_map, [-1, 0, -1])
+        self.assertEqual(output_spec.shape, torch.Size([8, 4, 2]))
+
+        # broadcast to a common shape
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, [0, -1, -1], [], shape=torch.Size([8, 8, 8])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, [-1, -1], [], shape=torch.Size([1, 8])
+        )
+        output_sharding = einop_rule(
+            "ijk,1k->ijk", OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertEqual(output_spec.dim_map, [0, -1, -1])
+        self.assertEqual(output_spec.shape, torch.Size([8, 8, 8]))
+
+    @with_comms
+    def test_einop_merge_sharding(self):
+        # 2d mesh einop merge sharding
+        mesh_shape = torch.arange(self.world_size).reshape(
+            self.world_size // 2, self.world_size // 2
+        )
+        mesh = DeviceMesh(self.device_type, mesh_shape)
+
+        func_schema = self.parse_schema(
+            "aten::mm(Tensor self, Tensor mat2) -> Tensor"
+        )
+
+        mat1, mat2 = [0, -1], [-1, 1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 4])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([4, 8])
+        )
+        output_sharding = einop_rule(
+            "mk,kn->mn", OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertEqual(output_spec.dim_map, [0, 1])
+        self.assertEqual(output_spec.shape, torch.Size([8, 8]))
+
+    @with_comms
+    def test_einop_linearity(self):
+        mesh_shape = torch.arange(self.world_size).reshape(
+            self.world_size // 2, self.world_size // 2
+        )
+        mesh = DeviceMesh(self.device_type, mesh_shape)
+
+        mm_func_schema = self.parse_schema(
+            "aten::mm(Tensor self, Tensor mat2) -> Tensor"
+        )
+
+        mat1, mat2 = [0, -1], [-1, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [1], shape=torch.Size([8, 4])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([4, 8])
+        )
+        # if not turn on linearity, partial sum is not eligible to propagate, we return
+        # suggestion to reshard inputs with no partial sum (i.e. all_reduce one input)
+        output_sharding = einop_rule(
+            "mk,kn->mn", OpSchema(mm_func_schema, (mat1_spec, mat2_spec), {})
+        )
+        self.assertIsNone(output_sharding.output_spec)
+        suggestions = output_sharding.schema_suggestions
+        self.assertIsNotNone(suggestions)
+        suggested_spec = suggestions[0].args_schema[0]
+        self.assertFalse(suggested_spec.placements[1].is_partial())
+
+        # einop prop with linearity on mm, should give back suggestion
+        # on converting placements to partial
+        output_sharding = einop_rule(
+            "mk,kn->mn",
+            OpSchema(mm_func_schema, (mat1_spec, mat2_spec), {}),
+            linearity=True,
+        )
+        self.assertIsNone(output_sharding.output_spec)
+        suggestions = output_sharding.schema_suggestions
+        self.assertIsNotNone(suggestions)
+        mat2_spec = suggestions[0].args_schema[1]
+        # mat2 mesh dim 1 should become partial now!
+        self.assertTrue(mat2_spec.placements[1].is_partial())
+
+        # einop prop with linearity on point-wise, should give back suggestion
+        # on converting placements to partial
+        add_func_schema = self.parse_schema(
+            "aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"
+        )
+        mat1, mat2 = [0, -1], [0, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [1], shape=torch.Size([8, 6])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([8, 6])
+        )
+
+        output_sharding = einop_rule(
+            "ij,ij->ij",
+            OpSchema(add_func_schema, (mat1_spec, mat2_spec), {}),
+            linearity=True,
+        )
+        self.assertIsNone(output_sharding.output_spec)
+        suggestions = output_sharding.schema_suggestions
+        self.assertIsNotNone(suggestions)
+        mat2_spec = suggestions[0].args_schema[1]
+        # mat2 mesh dim 1 should become partial now!
+        self.assertTrue(mat2_spec.placements[1].is_partial())
+
+    @with_comms
+    def test_einop_multi_sharding_on_mesh_dim(self):
+        # einop prop with multi sharding on same mesh dim
+        mesh_shape = torch.arange(self.world_size)
+        mesh = DeviceMesh(self.device_type, mesh_shape)
+
+        func_schema = self.parse_schema(
+            "aten::mm(Tensor self, Tensor mat2) -> Tensor"
+        )
+        mat1, mat2 = [0, -1], [0, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 12])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([12, 4])
+        )
+        output_sharding = einop_rule(
+            "mk,kn->mn", OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNone(output_spec)
+        self.assertIsNotNone(output_sharding.schema_suggestions)
+
+        # ensure that the suggestion is to reshard the second
+        # arg by all_gather its tensor dim sharding
+        schema_suggestion = output_sharding.schema_suggestions[0]
+        self.assertEqual(schema_suggestion.args_schema[0].dim_map, [0, -1])
+        self.assertEqual(schema_suggestion.args_schema[1].dim_map, [-1, -1])
+
+    @with_comms
+    def test_einop_errors(self):
+        mesh_shape = torch.arange(self.world_size).reshape(
+            self.world_size // 2, self.world_size // 2
+        )
+        mesh = DeviceMesh(self.device_type, mesh_shape)
+
+        func_schema = self.parse_schema(
+            "aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"
+        )
+        mat1, mat2 = [0, -1], [1, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 4])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([8, 4])
+        )
+
+        with self.assertRaisesRegex(
+            RuntimeError, "sharded two different ways:"
+        ):
+            einop_rule(
+                "ij,ij->ij", OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+            )
+
+    @with_comms
+    def test_pointwise_rules_broadcasting(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+
+        func_schema = self.parse_schema(
+            "where.self(Tensor condition, Tensor self, Tensor other) -> Tensor"
+        )
+        inp1, inp2, inp3 = [0], [], [-1, -1]
+        condition = DTensorSpec.from_dim_map(
+            mesh, inp1, [], shape=torch.Size([8])
+        )
+        self_tensor = DTensorSpec.from_dim_map(
+            mesh, inp2, [], shape=torch.Size([])
+        )
+        other_tensor = DTensorSpec.from_dim_map(
+            mesh, inp3, [], shape=torch.Size([1, 1])
+        )
+        # propagate point-wise sharding with broadcasting
+        output_sharding = pointwise_rule(
+            OpSchema(func_schema, (condition, self_tensor, other_tensor), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertEqual(output_spec.dim_map, [-1, 0])
+        self.assertEqual(output_spec.shape, [1, 8])
+
+    @with_comms
+    def test_pointwise_rules_suggestion(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+
+        func_schema = self.parse_schema(
+            "aten::lerp.Scalar(Tensor self, Tensor end, Scalar weight) -> Tensor"
+        )
+        # propagate point-wise sharding
+        inp1, inp2 = [-1, -1], [-1, 0]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, inp1, [], shape=torch.Size([8, 4])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, inp2, [], shape=torch.Size([8, 4])
+        )
+        # adding a positional argument -1 to arg schema
+        output_sharding = pointwise_rule(
+            OpSchema(func_schema, (mat1_spec, mat2_spec, -1), {})
+        )
+        self.assertIsNone(output_sharding.output_spec)
+        self.assertIsNotNone(output_sharding.schema_suggestions)
+
+        # ensure that the suggestion from pointwise rules still have
+        # the positional args that are not DTensorSpec
+        schema_suggestion = output_sharding.schema_suggestions[0]
+        self.assertEqual(len(schema_suggestion.args_schema), 3)
+        self.assertEqual(schema_suggestion.args_schema[2], -1)
+
+    @with_comms
+    def test_pointwise_multi_sharding_on_mesh_dim(self):
+        # 2d mesh pointwise sharding
+        mesh_shape = torch.arange(self.world_size).reshape(
+            self.world_size // 2, self.world_size // 2
+        )
+        mesh = DeviceMesh(self.device_type, mesh_shape)
+
+        func_schema = self.parse_schema(
+            "aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"
+        )
+
+        # basic case to test implicit broadcasting shape alignment
+        mat1, mat2 = [-1, 0], [0]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([20, 6])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([6])
+        )
+        output_sharding = pointwise_rule(
+            OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNotNone(output_spec)
+        self.assertEqual(output_spec.dim_map, [-1, 0])
+
+        # more advanced case that needs reshard one input to align sharding
+        mat1, mat2 = [0, -1, -1, 1], [0, -1, 1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([12, 1, 1, 8])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([12, 4, 8])
+        )
+        output_sharding = pointwise_rule(
+            OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNone(output_spec)
+        self.assertIsNotNone(output_sharding.schema_suggestions)
+
+        # ensure that the suggestion is to reshard the first
+        # arg by all_gather first tensor dim sharding
+        schema_suggestion = output_sharding.schema_suggestions[0]
+        self.assertEqual(
+            schema_suggestion.args_schema[0].dim_map, [-1, -1, -1, 1]
+        )
+        self.assertEqual(schema_suggestion.args_schema[1].dim_map, mat2)
+
+    @with_comms
+    def test_pointwise_enforce_sharding_multi_sharding_on_mesh_dim(self):
+        # 2d mesh pointwise sharding
+        mesh_shape = torch.arange(self.world_size).reshape(
+            self.world_size // 2, self.world_size // 2
+        )
+        mesh = DeviceMesh(self.device_type, mesh_shape)
+
+        func_schema = self.parse_schema(
+            "aten::add_.Tensor(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> Tensor(a!)"
+        )
+
+        # more advanced case that needs reshard one input to align sharding
+        mat1, mat2 = [0, -1, 1], [-1, -1, 0]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([12, 4, 8])
+        )
+        mat2_spec = DTensorSpec.from_dim_map(
+            mesh, mat2, [], shape=torch.Size([12, 1, 8])
+        )
+        output_sharding = pointwise_rule(
+            OpSchema(func_schema, (mat1_spec, mat2_spec), {})
+        )
+        output_spec = output_sharding.output_spec
+        self.assertIsNone(output_spec)
+        self.assertIsNotNone(output_sharding.schema_suggestions)
+
+        # ensure that the suggestion is to reshard the second
+        # arg as we should enforce the sharding of the first arg
+        schema_suggestion = output_sharding.schema_suggestions[0]
+        self.assertEqual(schema_suggestion.args_schema[0].dim_map, mat1)
+        self.assertEqual(schema_suggestion.args_schema[1].dim_map, mat1)
+
+    @with_comms
+    def test_reduction_rule(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+
+        func_schema = self.parse_schema(
+            "aten::sum(Tensor self, *, ScalarType? dtype=None) -> Tensor"
+        )
+        # reduction on a 2d mat
+        mat1 = [0, -1]
+        mat1_spec = DTensorSpec.from_dim_map(
+            mesh, mat1, [], shape=torch.Size([8, 4])
+        )
+        # reduction on dim 0
+        output_sharding_0 = reduction_rule(
+            OpSchema(func_schema, (mat1_spec, 0), {}),
+            dims=[0],
+            reduction_linear=True,
+        )
+        self.assertIsNotNone(output_sharding_0.output_spec)
+        self.assertEqual(output_sharding_0.output_spec.dim_map, [-1])
+        # pending sum on dim 0
+        self.assertEqual(output_sharding_0.output_spec.sums, [0])
+        self.assertEqual(output_sharding_0.output_spec.shape, torch.Size([4]))
+
+        # reduction on dim 1
+        output_sharding_1 = reduction_rule(
+            OpSchema(func_schema, (mat1_spec, 1), {}),
+            dims=[1],
+            reduction_linear=True,
+        )
+        self.assertIsNotNone(output_sharding_1.output_spec)
+        self.assertEqual(output_sharding_1.output_spec.dim_map, [0])
+        self.assertEqual(output_sharding_1.output_spec.sums, [])
+        self.assertEqual(output_sharding_1.output_spec.shape, torch.Size([8]))
+
+        # full reduction if not specify dim
+        output_sharding_all_dim = reduction_rule(
+            OpSchema(func_schema, (mat1_spec,), {}),
+            dims=[0, 1],
+            reduction_linear=True,
+        )
+        self.assertIsNotNone(output_sharding_all_dim.output_spec)
+        self.assertEqual(output_sharding_all_dim.output_spec.dim_map, [])
+        # pending sum on mesh
+        self.assertEqual(output_sharding_all_dim.output_spec.sums, [0])
+        self.assertEqual(
+            output_sharding_all_dim.output_spec.shape, torch.Size([])
+        )
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_device_mesh.py b/test/distributed/_tensor/test_device_mesh.py
new file mode 100644
index 000000000000..7088f33f42db
--- /dev/null
+++ b/test/distributed/_tensor/test_device_mesh.py
@@ -0,0 +1,518 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+
+from torch.distributed.distributed_c10d import (
+    ProcessGroup,
+    new_group,
+    get_global_rank,
+    get_world_size,
+)
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+)
+from torch.distributed._tensor.device_mesh import DeviceMesh
+from torch.distributed._tensor.placement_types import Shard
+
+
+class DeviceMeshTest(DTensorTestBase):
+    @property
+    def world_size(self):
+        return 8
+
+    @with_comms
+    def test_device_mesh_2d(self):
+        mesh_tensor = torch.arange(4).reshape(2, 2)
+        # construct a cuda device mesh
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+
+        # check all dim groups
+        dim_to_subgroups = mesh.get_dim_groups()
+
+        expected_ranks_by_dim = [[[0, 2], [1, 3]], [[0, 1], [2, 3]]]
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            self.assertTrue(dim < 2)
+            dim_ranks = expected_ranks_by_dim[dim]
+
+            dim_group_size = get_world_size(dim_group)
+            self.assertIsInstance(dim_group, ProcessGroup)
+            self.assertEqual(dim_group_size, 2)
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            current_rank_expected_group_ranks = (
+                dim_ranks[0] if self.rank in dim_ranks[0] else dim_ranks[1]
+            )
+            self.assertEqual(global_ranks, current_rank_expected_group_ranks)
+
+    @with_comms
+    def test_device_mesh_2d_from_dim_groups(self):
+        # construct a two dimension subgroups
+        dim_groups = []
+        expected_ranks_by_dim = [[[0, 2], [1, 3]], [[0, 1], [2, 3]]]
+        for dim_group_ranks in expected_ranks_by_dim:
+            for subgroup_ranks in dim_group_ranks:
+                subgroup = new_group(ranks=subgroup_ranks)
+                if self.rank in subgroup_ranks:
+                    dim_groups.append(subgroup)
+
+        # construct a device mesh from the subgroups
+        mesh = DeviceMesh(
+            self.device_type, [[0, 1], [2, 3]], dim_groups=dim_groups
+        )
+
+        # check all dim groups
+        dim_to_subgroups = mesh.get_dim_groups()
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            self.assertTrue(dim < 2)
+            dim_ranks = expected_ranks_by_dim[dim]
+
+            dim_group_size = get_world_size(dim_group)
+            self.assertIsInstance(dim_group, ProcessGroup)
+            self.assertEqual(dim_group_size, 2)
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            current_rank_expected_group_ranks = (
+                dim_ranks[0] if self.rank in dim_ranks[0] else dim_ranks[1]
+            )
+            self.assertEqual(global_ranks, current_rank_expected_group_ranks)
+
+    @with_comms
+    def test_device_mesh_dim_groups_error(self):
+        # construct a two dimension subgroups
+        dim_groups = []
+        expected_ranks_by_dim = [[[0, 2], [1, 3]], [[0, 1], [2, 3]]]
+        for dim_group_ranks in expected_ranks_by_dim:
+            for subgroup_ranks in dim_group_ranks:
+                subgroup = new_group(ranks=subgroup_ranks)
+                if self.rank in subgroup_ranks:
+                    dim_groups.append(subgroup)
+
+        if len(dim_groups) > 0:
+            # dim_groups is not a list
+            self.assertRaises(
+                RuntimeError,
+                DeviceMesh,
+                self.device_type,
+                [[0, 1], [2, 3]],
+                dim_groups=dim_groups[0],
+            )
+
+            # dim_groups is a list, but not a list of ProcessGroup
+            self.assertRaises(
+                RuntimeError,
+                DeviceMesh,
+                self.device_type,
+                [[0, 1], [2, 3]],
+                dim_groups=[dim_groups[0], "dummy"],
+            )
+
+            # dim_groups has incorrect length
+            self.assertRaises(
+                RuntimeError,
+                DeviceMesh,
+                self.device_type,
+                [[0, 1], [2, 3]],
+                dim_groups=[dim_groups[0]],
+            )
+
+    @with_comms
+    def test_device_mesh_nd(self):
+        # construct a cuda device mesh
+        mesh_tensor = torch.arange(8).reshape(2, 2, 2)
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+
+        # check all dim groups
+        dim_to_subgroups = mesh.get_dim_groups()
+
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            self.assertTrue(dim < mesh_tensor.ndim)
+            dim_ranks = mesh_tensor.swapdims(-1, dim).reshape(-1, 2)
+            # print(dim_ranks)
+            # dim_ranks = expected_ranks_by_dim[dim]
+
+            dim_group_size = get_world_size(dim_group)
+            self.assertIsInstance(dim_group, ProcessGroup)
+            self.assertEqual(dim_group_size, 2)
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            for ranks in dim_ranks:
+                if self.rank in ranks:
+                    self.assertEqual(global_ranks, ranks.tolist())
+
+
+class DeviceMeshCollectiveTest(DTensorTestBase):
+    @property
+    def world_size(self):
+        return 8
+
+    @with_comms
+    def test_all_reduce_1d(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+        local_tensor = torch.ones(3, 3, device=self.device_type) * self.rank
+        mesh.all_reduce(local_tensor, mesh_dim=0)
+        res_num = ((0 + self.world_size - 1) * self.world_size) / 2
+        self.assertEqual(local_tensor, torch.ones(3, 3) * res_num)
+
+    @with_comms
+    def test_broadcast_1d(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+        local_tensor = torch.ones(3, 3, device=self.device_type) * self.rank
+        mesh.broadcast(local_tensor, mesh_dim=0)
+        self.assertEqual(local_tensor, torch.zeros(3, 3))
+
+    @with_comms
+    def test_scatter_1d(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+        scatter_tensor_shape = [3, 3, 3]
+        for scatter_dim in range(len(scatter_tensor_shape)):
+            shard_placement = Shard(scatter_dim)
+            scatter_tensor_shape[scatter_dim] *= self.world_size
+            # make the random seed same across rank
+            torch.manual_seed(0)
+            global_tensor = torch.randn(
+                scatter_tensor_shape, device=self.device_type
+            )
+            splitted_list, _ = shard_placement._split_tensor(
+                global_tensor, mesh.size(), with_padding=True, contiguous=True
+            )
+            recv_tensor = torch.empty_like(splitted_list[mesh.get_rank()])
+            # scatter on dim > 0 would generate non-contiguous tensor, verify that works
+            mesh.scatter(recv_tensor, splitted_list, mesh_dim=0)
+            self.assertEqual(recv_tensor, splitted_list[mesh.get_rank()])
+
+    @with_comms
+    def test_scatter_uneven(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        my_rank = device_mesh.get_rank()
+        tensor_to_split = torch.randn(
+            device_mesh.size() + 3, device_mesh.size() + 1
+        )
+
+        for shard_dim in range(tensor_to_split.ndim):
+            shard_placement = Shard(shard_dim)
+            tensor_to_scatter = tensor_to_split.clone()
+            tensor_splitted_list = tensor_to_split.tensor_split(
+                device_mesh.size(), dim=shard_dim
+            )
+            padded_tensor_list, pad_idx = shard_placement._split_tensor(
+                tensor_to_scatter,
+                device_mesh.size(),
+                with_padding=True,
+                contiguous=True,
+            )
+
+            scattered_tensor = torch.empty_like(padded_tensor_list[my_rank])
+            device_mesh.scatter(
+                scattered_tensor, padded_tensor_list, mesh_dim=0
+            )
+            # unpad scattered_tensor
+            if pad_idx != 0 and my_rank >= pad_idx:
+                scattered_tensor = shard_placement._unpad_tensor(
+                    scattered_tensor
+                )
+
+            self.assertEqual(
+                scattered_tensor.size(), tensor_splitted_list[my_rank].size()
+            )
+            self.assertEqual(scattered_tensor, tensor_splitted_list[my_rank])
+
+    @with_comms
+    def test_all_gather_1d(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+        dims_to_gather = [0, 1]
+        for dim in dims_to_gather:
+            output_size = [3, 3]
+            output_size[dim] *= self.world_size
+            # each rank have its own tensor, all_gather gives a list
+            local_tensor = torch.ones(3, 3, device=self.device_type)
+            gathered_list = []
+            for _ in range(self.world_size):
+                gathered_list.append(torch.zeros_like(local_tensor))
+            mesh.all_gather(gathered_list, local_tensor, mesh_dim=0)
+            gathered_tensor = torch.cat(gathered_list, dim=dim)
+            self.assertEqual(gathered_tensor, torch.ones(output_size))
+
+    @with_comms
+    def test_all_gather_uneven(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        my_rank = device_mesh.get_rank()
+        tensor_to_split = torch.ones(
+            device_mesh.size() + 3,
+            device_mesh.size() + 1,
+            device=self.device_type,
+        )
+
+        for shard_dim in range(tensor_to_split.ndim):
+            shard_placement = Shard(shard_dim)
+            tensor_padded_list, pad_idx = shard_placement._split_tensor(
+                tensor_to_split,
+                device_mesh.size(),
+                with_padding=True,
+                contiguous=True,
+            )
+            local_tensor = tensor_padded_list[my_rank]
+            gathered_list = []
+            for _ in range(device_mesh.size()):
+                gathered_list.append(torch.empty_like(local_tensor))
+
+            device_mesh.all_gather(
+                gathered_list,
+                local_tensor,
+                mesh_dim=0,
+            )
+            if pad_idx != 0:
+                gathered_list = [
+                    shard_placement._unpad_tensor(gathered_tensor)
+                    if i >= pad_idx
+                    else gathered_tensor
+                    for i, gathered_tensor in enumerate(gathered_list)
+                ]
+            all_gathered_tensor = torch.cat(gathered_list, dim=shard_dim)
+            self.assertEqual(all_gathered_tensor.size(), tensor_to_split.size())
+            self.assertEqual(all_gathered_tensor, tensor_to_split)
+
+    @with_comms
+    def test_reduce_scatter_1d(self):
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+        dims_to_scatter = [0, 1]
+        for dim in dims_to_scatter:
+            input_size = [3, 3]
+            scattered_tensor = torch.empty(input_size, device=self.device_type)
+            input_size[dim] *= self.world_size
+
+            input_rs_list = (
+                torch.ones(input_size, device=self.device_type) * self.rank
+            ).tensor_split(self.world_size, dim=dim)
+            res_num = ((0 + self.world_size - 1) * self.world_size) / 2
+            mesh.reduce_scatter(scattered_tensor, input_rs_list, mesh_dim=0)
+            self.assertEqual(scattered_tensor, torch.ones(3, 3) * res_num)
+
+    @with_comms
+    def test_reduce_scatter_uneven(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        my_rank = device_mesh.get_rank()
+        tensor_to_split = (
+            torch.ones(
+                device_mesh.size() + 3,
+                device_mesh.size() + 1,
+                device=self.device_type,
+            )
+            * self.rank
+        )
+
+        for shard_dim in range(tensor_to_split.ndim):
+            shard_placement = Shard(shard_dim)
+            tensor_to_scatter = tensor_to_split.clone()
+            tensor_splitted_list = tensor_to_split.tensor_split(
+                device_mesh.size(), dim=shard_dim
+            )
+            padded_tensor_list, pad_idx = shard_placement._split_tensor(
+                tensor_to_scatter,
+                device_mesh.size(),
+                with_padding=True,
+                contiguous=True,
+            )
+
+            res_num = ((0 + self.world_size - 1) * self.world_size) / 2
+            scattered_tensor = torch.empty_like(padded_tensor_list[my_rank])
+            device_mesh.reduce_scatter(
+                scattered_tensor, padded_tensor_list, mesh_dim=0
+            )
+            # unpad scattered_tensor
+            if pad_idx != 0 and my_rank >= pad_idx:
+                scattered_tensor = shard_placement._unpad_tensor(
+                    scattered_tensor
+                )
+
+            self.assertEqual(
+                scattered_tensor.size(), tensor_splitted_list[my_rank].size()
+            )
+            self.assertEqual(
+                scattered_tensor,
+                torch.ones_like(tensor_splitted_list[my_rank]) * res_num,
+            )
+
+    @with_comms
+    def test_all_gather_nd(self):
+        mesh_tensor = torch.arange(8).reshape(2, 2, 2)
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+        local_tensor = torch.ones(3, 3, device=self.device_type) * self.rank
+
+        dim_to_subgroups = mesh.get_dim_groups()
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            dim_group_size = get_world_size(dim_group)
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            gathered_tensor_list = list(
+                torch.empty(
+                    (dim_group_size * 3, 3), device=self.device_type
+                ).tensor_split(dim_group_size, dim=0)
+            )
+            mesh.all_gather(gathered_tensor_list, local_tensor, mesh_dim=dim)
+            gathered_tensor = torch.cat(gathered_tensor_list)
+            exp_tensor = torch.ones(3 * dim_group_size, 3)
+            for i in range(len(global_ranks)):
+                exp_tensor[i * 3 : (i + 1) * 3] = (
+                    torch.ones(3, 3) * global_ranks[i]
+                )
+            self.assertEqual(gathered_tensor, exp_tensor)
+
+    @with_comms
+    def test_reduce_scatter_nd(self):
+        mesh_tensor = torch.arange(8).reshape(2, 2, 2)
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+
+        dim_to_subgroups = mesh.get_dim_groups()
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            dim_group_size = get_world_size(dim_group)
+            local_rs_list = (
+                torch.ones(dim_group_size * 3, 3, device=self.device_type)
+                * self.rank
+            ).tensor_split(dim_group_size, dim=0)
+            scattered_tensor = torch.empty_like(
+                local_rs_list[mesh.get_coordinate_on_dim(dim)],
+                device=self.device_type,
+            )
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            mesh.reduce_scatter(scattered_tensor, local_rs_list, mesh_dim=dim)
+            res_num = torch.sum(torch.tensor(global_ranks))
+            self.assertEqual(scattered_tensor, torch.ones(3, 3) * res_num)
+
+    @with_comms
+    def test_all_reduce_nd(self):
+        mesh_tensor = torch.arange(8).reshape(2, 2, 2)
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+        local_tensor = torch.ones(3, 3, device=self.device_type) * self.rank
+
+        # check all dim groups
+        dim_to_subgroups = mesh.get_dim_groups()
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            dim_group_size = get_world_size(dim_group)
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            cloned_local_tensor = local_tensor.clone()
+            mesh.all_reduce(cloned_local_tensor, mesh_dim=dim)
+            res_num = sum(global_ranks)
+            self.assertEqual(cloned_local_tensor, torch.ones(3, 3) * res_num)
+
+    @with_comms
+    def test_broadcast_nd(self):
+        mesh_tensor = torch.arange(8).reshape(2, 2, 2)
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+        local_tensor = torch.ones(3, 3, device=self.device_type) * self.rank
+
+        # check all dim groups
+        dim_to_subgroups = mesh.get_dim_groups()
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            dim_group_size = get_world_size(dim_group)
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            cloned_local_tensor = local_tensor.clone()
+            mesh.broadcast(cloned_local_tensor, mesh_dim=dim)
+            res_num = global_ranks[0]
+            self.assertEqual(cloned_local_tensor, torch.ones(3, 3) * res_num)
+
+    @with_comms
+    def test_scatter_nd(self):
+        mesh_tensor = torch.arange(8).reshape(2, 2, 2)
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+
+        # check all dim groups
+        dim_to_subgroups = mesh.get_dim_groups()
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            dim_group_size = get_world_size(dim_group)
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            scattered_tensors = [
+                torch.ones(3, 3, device=self.device_type) * global_rank
+                for global_rank in global_ranks
+            ]
+            received_tensor = torch.empty_like(
+                scattered_tensors[mesh.get_coordinate_on_dim(dim)]
+            )
+            mesh.scatter(received_tensor, scattered_tensors, mesh_dim=dim)
+            self.assertEqual(received_tensor, torch.ones(3, 3) * self.rank)
+
+    @with_comms
+    def test_all_to_all_1d(self):
+        # transpose on a 2D tensor distributed over N nodes:
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+        tensor_shape = [3, 3]
+        input_tensor_list = [
+            torch.ones(*tensor_shape, device=self.device_type)
+            * (rank + self.rank * self.world_size)
+            for rank in range(self.world_size)
+        ]
+        expected_tensor_list = [
+            torch.ones(tensor_shape, device=self.device_type)
+            * (self.rank + rank * self.world_size)  # i.e. transpose
+            for rank in range(self.world_size)
+        ]
+        for scatter_dim in range(len(tensor_shape)):
+            output_tensor_list = [
+                torch.empty_like(input_tensor_list[idx])
+                for idx in range(len(input_tensor_list))
+            ]
+            # scatter on dim > 0 would generate non-contiguous tensor, verify that works
+            mesh.all_to_all(output_tensor_list, input_tensor_list, mesh_dim=0)
+            output_tensor = torch.cat(output_tensor_list, dim=scatter_dim)
+            expected_tensor = torch.cat(expected_tensor_list, dim=scatter_dim)
+
+            self.assertEqual(output_tensor, expected_tensor)
+
+    @with_comms
+    def test_all_to_all_nd(self):
+        mesh_tensor = torch.arange(8).reshape(2, 2, 2)
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+        tensor_shape = [3, 3, 3]
+        # check all dim groups
+        dim_to_subgroups = mesh.get_dim_groups()
+        for dim, dim_group in enumerate(dim_to_subgroups):
+            my_coordinate = mesh.get_coordinate_on_dim(dim)
+            dim_group_size = get_world_size(dim_group)
+            global_ranks = [
+                get_global_rank(dim_group, i) for i in range(dim_group_size)
+            ]
+            input_tensor_list = [
+                torch.ones(*tensor_shape, device=self.device_type)
+                * (i + self.rank * dim_group_size)
+                for i in range(dim_group_size)
+            ]
+            expected_tensor_list = [
+                torch.ones(*tensor_shape, device=self.device_type)
+                * (
+                    my_coordinate + global_rank * dim_group_size
+                )  # i.e. transpose
+                for global_rank in global_ranks
+            ]
+            for scatter_dim in range(len(tensor_shape)):
+                # input_tensor = torch.cat(input_tensor_list, dim=scatter_dim)
+                output_tensor_list = [
+                    torch.empty_like(input_tensor_list[idx])
+                    for idx in range(len(input_tensor_list))
+                ]
+                # scatter on dim > 0 would generate non-contiguous tensor, verify that works
+                mesh.all_to_all(
+                    output_tensor_list, input_tensor_list, mesh_dim=dim
+                )
+                output_tensor = torch.cat(output_tensor_list, dim=scatter_dim)
+                expected_tensor = torch.cat(
+                    expected_tensor_list, dim=scatter_dim
+                )
+                self.assertEqual(output_tensor, expected_tensor)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_dtensor.py b/test/distributed/_tensor/test_dtensor.py
new file mode 100644
index 000000000000..51ce1bd4ec58
--- /dev/null
+++ b/test/distributed/_tensor/test_dtensor.py
@@ -0,0 +1,359 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+)
+from torch.distributed._tensor import DeviceMesh, DTensor, distribute_tensor
+from torch.distributed._tensor.placement_types import _Partial, Replicate, Shard
+
+
+class DTensorTest(DTensorTestBase):
+    # @with_comms
+    # def test_tensor_constructor(self):
+    #     import torch.distributed._tensor as dist_tensor
+    #     shard_spec = PlacementSpec(device_mesh, strategies=[Shard(0)])
+    #     empty_tensor = dist_tensor.empty((12, 10), placement_spec=shard_spec)
+    #     zero_tensor = dist_tensor.zeros((12, 10), placement_spec=shard_spec)
+    #     one_tensor = dist_tensor.ones((12, 10), placement_spec=shard_spec)
+
+    #     zero_cuda_tensor = dist_tensor.zeros((12, 10), device="cuda", placement_spec=shard_spec)
+
+    #     dist_tensor.empty_like(empty_tensor)
+    #     dist_tensor.zero_like(empty_tensor)
+    #     dist_tensor.one_like(empty_tensor)
+
+    @with_comms
+    def test_dtensor_constructor(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+        local_tensor = torch.randn(3, 3, requires_grad=True)
+        dist_tensor_shape = torch.Size([self.world_size * 3, 3])
+        dist_tensor = DTensor(
+            local_tensor,
+            device_mesh,
+            shard_spec,
+            size=dist_tensor_shape,
+            requires_grad=True,
+        )
+        self.assertEqual(
+            dist_tensor.size(), torch.Size((self.world_size * 3, 3))
+        )
+
+        with self.assertWarnsRegex(UserWarning, "To construct"):
+            DTensor(
+                local_tensor, device_mesh, shard_spec, size=dist_tensor_shape
+            )
+
+        local_tensor = torch.randn(3, 3, requires_grad=False)
+        with self.assertWarnsRegex(UserWarning, "To construct"):
+            dist_tensor = DTensor(
+                local_tensor,
+                device_mesh,
+                shard_spec,
+                size=dist_tensor_shape,
+                requires_grad=True,
+            )
+
+    @with_comms
+    def test_dtensor_stride(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard0_spec = [Shard(0)]
+        local_tensor = torch.randn(4, 8)
+        global_shape = torch.Size([self.world_size * 4, 8])
+        dist_tensor = DTensor(
+            local_tensor, device_mesh, shard0_spec, size=global_shape
+        )
+        # won't affect stride
+        self.assertEqual(dist_tensor.stride(), (8, 1))
+
+        shard1_spec = [Shard(1)]
+        local_tensor = torch.randn(8, 4)
+        global_shape = torch.Size([8, self.world_size * 4])
+        dist_tensor = DTensor(
+            local_tensor, device_mesh, shard1_spec, size=global_shape
+        )
+        # will affect stride after DT initialized
+        self.assertEqual(dist_tensor.stride(), (4 * self.world_size, 1))
+
+        # if initialized from a transposed mat
+        local_tensor = torch.randn(8, 4, 8)
+        local_tensor_t = local_tensor.permute(1, 2, 0)
+        global_shape = torch.Size([4, self.world_size * 8, 8])
+        self.assertEqual(local_tensor_t.stride(), (8, 1, 32))
+        dist_tensor = DTensor(
+            local_tensor_t, device_mesh, shard1_spec, size=global_shape
+        )
+        global_stride = (8 * self.world_size, 1, 32 * self.world_size)
+        self.assertEqual(dist_tensor.stride(), global_stride)
+
+    @with_comms
+    def test_from_local(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+        local_tensor = torch.randn(3, 3)
+        sharded_tensor = DTensor.from_local(
+            local_tensor, device_mesh, shard_spec
+        )
+        self.assertEqual(
+            sharded_tensor.size(), torch.Size([self.world_size * 3, 3])
+        )
+
+        replica_spec = [Replicate()]
+        ddp_tensor = DTensor.from_local(local_tensor, device_mesh, replica_spec)
+        self.assertEqual(ddp_tensor.size(), local_tensor.size())
+
+        partial_spec = [_Partial()]
+        partial_tensor = DTensor.from_local(
+            local_tensor, device_mesh, partial_spec
+        )
+        self.assertEqual(partial_tensor.size(), local_tensor.size())
+
+        # test dist tensor works with torch.Tensor during backwards
+        local_tensor_with_grad = torch.randn(3, 3, requires_grad=True)
+        # do some operations on local tensor
+        local_tensor_temp = local_tensor_with_grad * 3
+        # create the dist tensor with non leaf local tensor, dist tensor created
+        # should also be non leaf node
+        dist_tensor = DTensor.from_local(
+            local_tensor_temp, device_mesh, shard_spec
+        )
+        self.assertFalse(dist_tensor.is_leaf)
+        # do some random operations on dist tensor
+        output = dist_tensor * 3
+        self.assertIsInstance(output, DTensor)
+        # trigger .backward() on dist tensor directly
+        local_grad = torch.ones(3, 3)
+        grad_output = DTensor.from_local(local_grad, device_mesh, shard_spec)
+        # run backward directly on dist tensor
+        output.backward(grad_output)
+        # check it gradients flow back to original torch.Tensor
+        self.assertIsNotNone(local_tensor_with_grad.grad)
+        expected_grad = torch.ones(3, 3) * 9
+        self.assertEqual(local_tensor_with_grad.grad, expected_grad)
+
+    @with_comms
+    def test_to_local(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+        dist_tensor_shape = torch.Size([self.world_size * 3, 3])
+        local_tensor_with_grad = torch.randn(
+            3, 3, device=self.device_type, requires_grad=True
+        )
+
+        sharded_tensor = DTensor(
+            local_tensor_with_grad,
+            device_mesh,
+            shard_spec,
+            size=dist_tensor_shape,
+            requires_grad=True,
+        )
+        self.assertEqual(sharded_tensor.size(), dist_tensor_shape)
+        self.assertEqual(sharded_tensor.to_local(), local_tensor_with_grad)
+
+        # test dist tensor works with torch.Tensor during backwards
+        # dist tensor created is a leaf node, do some operation on dist tensor
+        temp_st = sharded_tensor * 3
+
+        # do some operation on local tensor of the dist tensor
+        new_tensor_with_grad = torch.randn(
+            3, 3, device=self.device_type, requires_grad=True
+        )
+        res = temp_st.to_local() + new_tensor_with_grad
+        # call backward directly on torch.Tensor, and see if it works by
+        # propagating through dist tensor
+        res.sum().backward()
+        self.assertIsNotNone(sharded_tensor.grad)
+
+        self.assertEqual(sharded_tensor.grad.to_local(), torch.ones(3, 3) * 3)
+
+    @with_comms
+    def test_from_local_then_to_local(self):
+        # this test ensure end to end from torch.Tensor -> dist tensor -> torch.Tensor works
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        # step 1. construct from construct local tensor
+        local_tensor_with_grad = torch.randn(
+            3, 3, device=self.device_type, requires_grad=True
+        )
+        # do some operations on local tensor
+        local_tensor_temp = local_tensor_with_grad + 8
+        # step 2. create the dist tensor with non leaf local tensor, dist tensor
+        # created should also be non leaf node
+        dist_tensor = DTensor.from_local(
+            local_tensor_temp, device_mesh, shard_spec
+        )
+        self.assertFalse(dist_tensor.is_leaf)
+        # do some random operations on dist tensor
+        output = dist_tensor * 6
+        self.assertIsInstance(output, DTensor)
+
+        # step 3. do some operation on local tensor of the dist tensor
+        new_tensor_with_grad = torch.randn(
+            3, 3, device=self.device_type, requires_grad=True
+        )
+        res = output.to_local() + new_tensor_with_grad
+        # call backward directly on torch.Tensor, and see if it works by
+        # propagating all the way back to the original torch.Tensor
+        res.sum().backward()
+        self.assertIsNotNone(local_tensor_with_grad.grad)
+
+        expected_grad = torch.ones(3, 3) * 6
+        self.assertEqual(local_tensor_with_grad.grad, expected_grad)
+
+    @with_comms
+    def test_dtensor_spec_read_only_after_set(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+        local_tensor = torch.randn(3, 3)
+        sharded_tensor = DTensor.from_local(
+            local_tensor, device_mesh, shard_spec
+        )
+
+        # modify shard_spec, and dist_tensor's spec should not be changed
+        shard_spec[0] = Replicate()
+        self.assertTrue(sharded_tensor.placements is not shard_spec)
+        self.assertNotEqual(sharded_tensor.placements, shard_spec)
+
+    @with_comms
+    def test_dtensor_properties(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+        local_tensor = torch.randn(3, 3)
+        sharded_tensor = DTensor.from_local(
+            local_tensor, device_mesh, shard_spec
+        )
+        self.assertEqual(sharded_tensor.device.type, self.device_type)
+
+
+class DTensorMeshTest(DTensorTestBase):
+    @property
+    def world_size(self):
+        return 8
+
+    @with_comms
+    def test_dtensor_device_mesh_device_conversion(self):
+        # construct a cuda device mesh
+        mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+
+        # construct from a cpu local tensor with cuda device mesh
+        # should automatically convert the dist tensor to cuda
+        shard_spec = [Shard(0)]
+        local_tensor = torch.randn(3, 3)
+        dist_tensor = DTensor.from_local(local_tensor, mesh, shard_spec)
+        self.assertEqual(dist_tensor.device.type, self.device_type)
+        self.assertEqual(dist_tensor.to_local().device.type, self.device_type)
+
+    @with_comms
+    def test_dtensor_api_device_mesh_context_manager(self):
+        with DeviceMesh(self.device_type, list(range(self.world_size))) as mesh:
+            shard_spec = [Shard(0)]
+            local_tensor = torch.randn(3, 3)
+            sharded_tensor = DTensor.from_local(
+                local_tensor, device_mesh=mesh, placements=shard_spec
+            )
+
+        with DeviceMesh(self.device_type, list(range(self.world_size))):
+            shard_spec = [Shard(0)]
+            local_tensor = torch.randn(3, 3)
+            sharded_tensor = DTensor.from_local(
+                local_tensor, placements=shard_spec
+            )
+            replica_spec = [Replicate()]
+            replica_tensor = sharded_tensor.redistribute(
+                placements=replica_spec
+            )
+            self.assertEqual(
+                replica_tensor.size(), torch.Size([3 * self.world_size, 3])
+            )
+
+    @with_comms
+    def test_dtensor_2d_mesh(self):
+        mesh_tensor = torch.arange(self.world_size).reshape(2, 4)
+        # construct a cuda device mesh
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+
+        # construct a dist tensor on 2d device mesh and test if works
+        shard_spec = [Shard(0), Shard(1)]
+        local_tensor = torch.randn(3, 3)
+        dist_tensor = DTensor.from_local(local_tensor, mesh, shard_spec)
+        self.assertEqual(
+            dist_tensor.size(), torch.Size([3 * mesh.size(0), 3 * mesh.size(1)])
+        )
+        self.assertEqual(dist_tensor.device.type, self.device_type)
+        self.assertEqual(dist_tensor.to_local().device.type, self.device_type)
+
+        # if shard on the same tensor dimension
+        # we should correctly construct the global tensor size
+        shard_same_dim_spec = [Shard(0), Shard(0)]
+        local_tensor = torch.randn(3, 3)
+        dist_tensor = DTensor.from_local(
+            local_tensor, mesh, shard_same_dim_spec
+        )
+        self.assertEqual(
+            dist_tensor.size(), torch.Size([3 * self.world_size, 3])
+        )
+
+    @with_comms
+    def test_device_mesh_nd(self):
+        # construct a cuda device mesh
+        mesh_tensor = torch.arange(self.world_size).reshape(2, 2, 2)
+        mesh = DeviceMesh(self.device_type, mesh_tensor)
+        # construct a dist tensor on 3d device mesh and test if works
+        shard_spec = [Shard(0), Shard(1), Shard(2)]
+        local_tensor = torch.randn(3, 3, 3)
+        dist_tensor = DTensor.from_local(local_tensor, mesh, shard_spec)
+        self.assertEqual(dist_tensor.size(), torch.Size([6, 6, 6]))
+        self.assertEqual(dist_tensor.device.type, self.device_type)
+        self.assertEqual(dist_tensor.to_local().device.type, self.device_type)
+
+        # construct a dist tensor on 3d device mesh with some shards on same dim
+        shard_spec = [Shard(0), Shard(0), Shard(2)]
+        local_tensor = torch.randn(3, 3, 3)
+        dist_tensor = DTensor.from_local(local_tensor, mesh, shard_spec)
+        self.assertEqual(dist_tensor.size(), torch.Size([12, 3, 6]))
+        self.assertEqual(dist_tensor.device.type, self.device_type)
+        self.assertEqual(dist_tensor.to_local().device.type, self.device_type)
+
+    @with_comms
+    def test_dtensor_spec_local_shard_offset(self):
+        device_mesh = DeviceMesh(
+            self.device_type, torch.arange(self.world_size).reshape(2, 4)
+        )
+        tensor_shape = (3 * self.world_size, 3 * self.world_size)
+        # sharding specs and its corresponding local shard offsets
+        shard_spec_and_offsets = [
+            (
+                [Shard(0), Replicate()],
+                (3 * (self.world_size // 2) * (self.rank // 4), 0),
+            ),
+            (
+                [Shard(1), Replicate()],
+                (0, 3 * (self.world_size // 2) * (self.rank // 4)),
+            ),
+            (
+                [Replicate(), Shard(0)],
+                (3 * (self.world_size // 4) * (self.rank % 4), 0),
+            ),
+            (
+                [Replicate(), Shard(1)],
+                (0, 3 * (self.world_size // 4) * (self.rank % 4)),
+            ),
+        ]
+
+        # loop through all sharding specs and check local shard offsets
+        logical_tensor = torch.randn(tensor_shape)
+        for shard_spec, expected_shard_offsets in shard_spec_and_offsets:
+            dtensor = distribute_tensor(logical_tensor, device_mesh, shard_spec)
+            self.assertEqual(
+                expected_shard_offsets, dtensor._spec.local_offsets
+            )
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_dtensor_ops.py b/test/distributed/_tensor/test_dtensor_ops.py
new file mode 100644
index 000000000000..22ae5807d5f3
--- /dev/null
+++ b/test/distributed/_tensor/test_dtensor_ops.py
@@ -0,0 +1,704 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+import sys
+import unittest
+import warnings
+
+from torch.overrides import resolve_name
+from torch.utils._pytree import tree_flatten, tree_map
+from torch.testing._internal.common_utils import (
+    suppress_warnings,
+    TEST_WITH_ASAN,
+    run_tests,
+)
+import torch.distributed as dist
+from torch.testing._internal.common_device_type import (
+    ops,
+    instantiate_device_type_tests,
+)
+import torch.testing._internal.common_methods_invocations as common_ops
+from torch.testing._internal.common_methods_invocations import DecorateInfo
+
+from torch.distributed._tensor import DTensor, DeviceMesh, Replicate
+from torch.testing._internal.distributed._tensor.dtensor_lagging_op_db import dtensor_lagging_op_db
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    TEST_SKIPS,
+    DTensorConverter,
+    DEVICE_TYPE,
+)
+
+# rewrite common size variables to sth can be sharded evenly
+# we can enable uneven shards later, but need to adjust more on
+# sample inputs (i.e. view/reshape need to adjust shape size as well)
+common_ops.L = 24
+common_ops.M = 12
+common_ops.S = 4
+common_ops.XS = 2
+
+
+def assert_ref_dtensor_equal(test_case, dtensor_rs, rs):
+    flat_dtensor_rs, _ = tree_flatten(dtensor_rs)
+    flat_rs, _ = tree_flatten(rs)
+    test_case.assertEqual(len(flat_dtensor_rs), len(flat_rs))
+    for dtensor_r, r in zip(flat_dtensor_rs, flat_rs):
+
+        if not isinstance(r, torch.Tensor):
+            continue
+
+        test_case.assertIsInstance(dtensor_r, torch.Tensor)
+        test_case.assertEqual(
+            dtensor_r.shape,
+            r.shape,
+            f"Shape mismatch! original shape:{r.shape}, dtensor shape: {dtensor_r.shape}",
+        )
+        test_case.assertEqual(
+            dtensor_r.requires_grad,
+            r.requires_grad,
+            "op result requires_grad mismatch!"
+            f"original requires_grad: {r.requires_grad}, "
+            f"dtensor requires_grad: {dtensor_r.requires_grad}",
+        )
+
+        test_case.assertEqual(dtensor_r.to_local(), r)
+
+
+# Copied from functorch
+def xfail(op_name, variant_name="", *, device_type=None, dtypes=None):
+    return (op_name, variant_name, device_type, dtypes, True)
+
+
+def skip(op_name, variant_name="", *, device_type=None, dtypes=None):
+    return (op_name, variant_name, device_type, dtypes, False)
+
+
+def skipOps(test_case_name, base_test_name, to_skip):
+    all_opinfos = dtensor_lagging_op_db
+    for xfail in to_skip:
+        op_name, variant_name, device_type, dtypes, expected_failure = xfail
+        matching_opinfos = [
+            o
+            for o in all_opinfos
+            if o.name == op_name and o.variant_test_name == variant_name
+        ]
+        assert len(matching_opinfos) >= 1, f"Couldn't find OpInfo for {xfail}"
+        for opinfo in matching_opinfos:
+            decorators = list(opinfo.decorators)
+            if expected_failure:
+                decorator = DecorateInfo(
+                    unittest.expectedFailure,
+                    test_case_name,
+                    base_test_name,
+                    device_type=device_type,
+                    dtypes=dtypes,
+                )
+                decorators.append(decorator)
+            else:
+                decorator = DecorateInfo(
+                    unittest.skip("Skipped!"),
+                    test_case_name,
+                    base_test_name,
+                    device_type=device_type,
+                    dtypes=dtypes,
+                )
+                decorators.append(decorator)
+            opinfo.decorators = tuple(decorators)
+
+    # This decorator doesn't modify fn in any way
+    def wrapped(fn):
+        return fn
+
+    return wrapped
+
+
+# Re-generate this failed list, turn on dry_run of the below func
+# check_dtensor_func(self, test, op, dry_run=True), then run sth
+# like python test/spmd/tensor/test_dtensor_ops.py > failed.expect
+dtensor_fails = {
+    # these sometimes pass and sometimes fail
+    # we need to remove many of them from list once op
+    # get full support with varying sharding specs
+    xfail("__getitem__"),
+    xfail("__rsub__"),
+    xfail("masked.amax"),
+    xfail("masked.amin"),
+    xfail("masked.argmax"),
+    xfail("masked.argmin"),
+    xfail("masked.cumprod"),
+    xfail("masked.cumsum"),
+    xfail("masked.log_softmax"),
+    xfail("masked.logaddexp"),
+    xfail("masked.logsumexp"),
+    xfail("masked.median"),
+    xfail("masked.norm"),
+    xfail("masked.prod"),
+    xfail("masked.softmin"),
+    xfail("masked.softmax"),
+    xfail("masked.sum"),
+    xfail("addbmm"),
+    xfail("addmv"),
+    xfail("addr"),
+    xfail("all"),
+    xfail("allclose"),
+    xfail("amax"),
+    xfail("amin"),
+    xfail("aminmax"),
+    xfail("any"),
+    xfail("arange"),
+    xfail("argmax"),
+    xfail("argmin"),
+    xfail("argsort"),
+    xfail("as_strided"),
+    xfail("as_strided_scatter"),
+    xfail("baddbmm"),
+    xfail("bernoulli"),
+    xfail("block_diag"),
+    xfail("broadcast_shapes"),
+    xfail("cat"),
+    xfail("cartesian_prod"),
+    xfail("cdist"),
+    xfail("cholesky"),
+    xfail("cholesky_inverse"),
+    xfail("cholesky_solve"),
+    xfail("chunk"),
+    xfail("clamp"),
+    xfail("clamp_max"),
+    xfail("clamp_min"),
+    xfail("column_stack"),
+    xfail("combinations"),
+    xfail("complex"),
+    xfail("constant_pad_nd"),
+    xfail("copysign"),
+    xfail("corrcoef"),
+    xfail("count_nonzero"),
+    xfail("cov"),
+    xfail("cross"),
+    xfail("cummax"),
+    xfail("cummin"),
+    xfail("cumsum"),
+    xfail("cumulative_trapezoid"),
+    xfail("diag"),
+    xfail("diag_embed"),
+    xfail("diagflat"),
+    xfail("diagonal"),
+    xfail("diagonal_copy"),
+    xfail("diagonal_scatter"),
+    xfail("diff"),
+    xfail("dist"),
+    xfail("dot"),
+    xfail("dstack"),
+    xfail("einsum"),
+    xfail("empty"),
+    xfail("empty_like"),
+    xfail("eq"),
+    xfail("eye"),
+    xfail("fft.fft2"),
+    xfail("fft.fft"),
+    xfail("fft.fftn"),
+    xfail("fft.fftshift"),
+    xfail("fft.ifft2"),
+    xfail("fft.ifft"),
+    xfail("fft.ifftshift"),
+    xfail("fft.ihfft2"),
+    xfail("fft.ihfft"),
+    xfail("fft.ihfftn"),
+    xfail("fft.irfft2"),
+    xfail("fft.irfftn"),
+    xfail("fft.rfft2"),
+    xfail("fft.rfft"),
+    xfail("fft.rfftn"),
+    xfail("flip"),
+    xfail("fliplr"),
+    xfail("flipud"),
+    xfail("floor_divide"),
+    xfail("fmax"),
+    xfail("fmin"),
+    xfail("frexp"),
+    xfail("full"),
+    xfail("gather"),
+    xfail("geqrf"),
+    xfail("gradient"),
+    xfail("heaviside"),
+    xfail("histc"),
+    xfail("histogram"),
+    xfail("histogramdd"),
+    xfail("hstack"),
+    xfail("index_add"),
+    xfail("index_copy"),
+    xfail("index_fill"),
+    xfail("index_put"),
+    xfail("index_reduce"),
+    xfail("index_select"),
+    xfail("isfinite"),
+    xfail("isin"),
+    xfail("isinf"),
+    xfail("isnan"),
+    xfail("isneginf"),
+    xfail("isposinf"),
+    xfail("kthvalue"),
+    xfail("linalg.cholesky"),
+    xfail("linalg.cholesky_ex"),
+    xfail("linalg.cond"),
+    xfail("linalg.cross"),
+    xfail("linalg.det"),
+    xfail("linalg.det", "singular"),
+    xfail("linalg.eig"),
+    xfail("linalg.eigh"),
+    xfail("linalg.eigvals"),
+    xfail("linalg.eigvalsh"),
+    xfail("linalg.householder_product"),
+    xfail("linalg.inv"),
+    xfail("linalg.inv_ex"),
+    xfail("linalg.ldl_factor"),
+    xfail("linalg.ldl_factor_ex"),
+    xfail("linalg.ldl_solve"),
+    xfail("linalg.lstsq"),
+    xfail("linalg.lstsq", "grad_oriented"),
+    xfail("linalg.lu"),
+    xfail("linalg.lu_factor"),
+    xfail("linalg.lu_factor_ex"),
+    xfail("linalg.lu_solve"),
+    xfail("linalg.matrix_norm"),
+    xfail("linalg.matrix_power"),
+    xfail("linalg.matrix_rank"),
+    xfail("linalg.matrix_rank", "hermitian"),
+    xfail("linalg.multi_dot"),
+    xfail("linalg.norm"),
+    xfail("linalg.norm", "subgradients_at_zero"),
+    xfail("linalg.pinv"),
+    xfail("linalg.pinv", "hermitian"),
+    xfail("linalg.qr"),
+    xfail("linalg.slogdet"),
+    xfail("linalg.solve"),
+    xfail("linalg.solve_ex"),
+    xfail("linalg.solve_triangular"),
+    xfail("linalg.svd"),
+    xfail("linalg.svdvals"),
+    xfail("linalg.tensorinv"),
+    xfail("linalg.tensorsolve"),
+    xfail("linalg.vander"),
+    xfail("linalg.vecdot"),
+    xfail("linalg.vector_norm"),
+    xfail("linspace"),
+    xfail("log_softmax"),
+    xfail("log_softmax", "with_dtype"),
+    xfail("logcumsumexp"),
+    xfail("logdet"),
+    xfail("logical_not"),
+    xfail("logspace"),
+    xfail("logsumexp"),
+    xfail("lt"),
+    xfail("lu"),
+    xfail("lu_solve"),
+    xfail("lu_unpack"),
+    xfail("masked_fill"),
+    xfail("masked_scatter"),
+    xfail("masked_select"),
+    xfail("matrix_exp"),
+    xfail("max", "binary"),
+    xfail("max", "reduction_no_dim"),
+    xfail("max", "reduction_with_dim"),
+    xfail("maximum"),
+    xfail("median"),
+    xfail("min", "binary"),
+    xfail("min", "reduction_no_dim"),
+    xfail("min", "reduction_with_dim"),
+    xfail("minimum"),
+    xfail("mode"),
+    xfail("msort"),
+    xfail("multinomial"),
+    xfail("mv"),
+    xfail("max_pool2d_with_indices_backward", ""),
+    xfail("nanmean"),
+    xfail("nanmedian"),
+    xfail("nanquantile"),
+    xfail("nansum"),
+    xfail("native_batch_norm"),
+    xfail("native_layer_norm"),
+    xfail("narrow_copy"),
+    xfail("ne"),
+    xfail("new_empty"),
+    xfail("new_empty_strided"),
+    xfail("transpose"),
+    xfail("nn.functional.adaptive_avg_pool1d"),
+    xfail("nn.functional.adaptive_avg_pool2d"),
+    xfail("nn.functional.adaptive_avg_pool3d"),
+    xfail("nn.functional.adaptive_max_pool1d"),
+    xfail("nn.functional.adaptive_max_pool2d"),
+    xfail("nn.functional.adaptive_max_pool3d"),
+    xfail("nn.functional.alpha_dropout"),
+    xfail("nn.functional.avg_pool1d"),
+    xfail("nn.functional.avg_pool2d"),
+    xfail("nn.functional.avg_pool3d"),
+    xfail("nn.functional.batch_norm"),
+    xfail("nn.functional.batch_norm", "without_cudnn"),
+    xfail("nn.functional.bilinear"),
+    xfail("nn.functional.binary_cross_entropy"),
+    xfail("nn.functional.binary_cross_entropy_with_logits"),
+    xfail("nn.functional.celu"),
+    xfail("nn.functional.conv1d"),
+    xfail("nn.functional.conv2d"),
+    xfail("nn.functional.conv_transpose1d"),
+    xfail("nn.functional.conv_transpose2d"),
+    xfail("nn.functional.conv_transpose3d"),
+    xfail("nn.functional.cosine_similarity"),
+    xfail("nn.functional.cross_entropy"),
+    xfail("nn.functional.ctc_loss"),
+    xfail("nn.functional.dropout"),
+    xfail("nn.functional.dropout2d"),
+    xfail("nn.functional.dropout3d"),
+    xfail("nn.functional.elu"),
+    xfail("nn.functional.fractional_max_pool2d"),
+    xfail("nn.functional.fractional_max_pool3d"),
+    xfail("nn.functional.gaussian_nll_loss"),
+    xfail("nn.functional.glu"),
+    xfail("nn.functional.grid_sample"),
+    xfail("nn.functional.group_norm"),
+    xfail("nn.functional.hardshrink"),
+    xfail("nn.functional.hardsigmoid"),
+    xfail("nn.functional.hardswish"),
+    xfail("nn.functional.hardtanh"),
+    xfail("nn.functional.huber_loss"),
+    xfail("nn.functional.instance_norm"),
+    xfail("nn.functional.interpolate", "area"),
+    xfail("nn.functional.interpolate", "bicubic"),
+    xfail("nn.functional.interpolate", "bilinear"),
+    xfail("nn.functional.interpolate", "linear"),
+    xfail("nn.functional.interpolate", "nearest"),
+    xfail("nn.functional.interpolate", "trilinear"),
+    xfail("nn.functional.layer_norm"),
+    xfail("nn.functional.leaky_relu"),
+    xfail("nn.functional.linear"),
+    xfail("nn.functional.local_response_norm"),
+    xfail("nn.functional.logsigmoid"),
+    xfail("nn.functional.margin_ranking_loss"),
+    xfail("nn.functional.max_pool1d"),
+    xfail("nn.functional.max_pool2d"),
+    xfail("nn.functional.max_pool3d"),
+    xfail("nn.functional.max_unpool1d"),
+    xfail("nn.functional.max_unpool1d", "grad"),
+    xfail("nn.functional.max_unpool2d"),
+    xfail("nn.functional.max_unpool2d", "grad"),
+    xfail("nn.functional.max_unpool3d"),
+    xfail("nn.functional.max_unpool3d", "grad"),
+    xfail("nn.functional.mish"),
+    xfail("nn.functional.mse_loss"),
+    xfail("nn.functional.multi_margin_loss"),
+    xfail("nn.functional.multilabel_margin_loss"),
+    xfail("nn.functional.multilabel_soft_margin_loss"),
+    xfail("nn.functional.nll_loss"),
+    xfail("nn.functional.normalize"),
+    xfail("nn.functional.pad", "circular"),
+    xfail("nn.functional.pad", "constant"),
+    xfail("nn.functional.pad", "reflect"),
+    xfail("nn.functional.pad", "replicate"),
+    xfail("nn.functional.pairwise_distance"),
+    xfail("nn.functional.pdist"),
+    xfail("nn.functional.pixel_shuffle"),
+    xfail("nn.functional.pixel_unshuffle"),
+    xfail("nn.functional.poisson_nll_loss"),
+    xfail("nn.functional.prelu"),
+    xfail("nn.functional.relu6"),
+    xfail("nn.functional.rrelu"),
+    xfail("nn.functional.selu"),
+    xfail("nn.functional.silu"),
+    xfail("nn.functional.smooth_l1_loss"),
+    xfail("nn.functional.soft_margin_loss"),
+    xfail("nn.functional.softplus"),
+    xfail("nn.functional.softshrink"),
+    xfail("nn.functional.threshold"),
+    xfail("nn.functional.triplet_margin_loss"),
+    xfail("nn.functional.triplet_margin_with_distance_loss"),
+    xfail("nn.functional.unfold"),
+    xfail("nn.functional.upsample_bilinear"),
+    xfail("nn.functional.upsample_nearest"),
+    xfail("nonzero"),
+    xfail("norm"),
+    xfail("norm", "fro"),
+    xfail("norm", "inf"),
+    xfail("norm", "nuc"),
+    xfail("normal"),
+    xfail("normal", "number_mean"),
+    xfail("ormqr"),
+    xfail("ones"),
+    xfail("pca_lowrank"),
+    xfail("pinverse"),
+    xfail("polar"),
+    xfail("put"),
+    xfail("qr"),
+    xfail("quantile"),
+    xfail("rad2deg"),
+    xfail("rand_like"),
+    xfail("randint_like"),
+    xfail("randint"),
+    xfail("randn"),
+    xfail("randn_like"),
+    xfail("renorm"),
+    xfail("repeat_interleave"),
+    xfail("resize_"),
+    xfail("resize_as_"),
+    xfail("roll"),
+    xfail("rot90"),
+    xfail("rsub"),
+    xfail("scalar_tensor"),
+    xfail("scatter_add"),
+    xfail("scatter"),
+    xfail("scatter_reduce", "amax"),
+    xfail("scatter_reduce", "amin"),
+    xfail("scatter_reduce", "mean"),
+    xfail("scatter_reduce", "prod"),
+    xfail("scatter_reduce", "sum"),
+    xfail("searchsorted"),
+    xfail("select"),
+    xfail("select_scatter"),
+    xfail("signbit"),
+    xfail("sort"),
+    xfail("sparse.sampled_addmm"),
+    xfail("special.airy_ai"),
+    xfail("special.bessel_j0"),
+    xfail("special.bessel_j1"),
+    xfail("special.bessel_y0"),
+    xfail("special.bessel_y1"),
+    xfail("special.chebyshev_polynomial_t"),
+    xfail("special.chebyshev_polynomial_u"),
+    xfail("special.entr"),
+    xfail("special.erfcx"),
+    xfail("special.hermite_polynomial_h"),
+    xfail("special.hermite_polynomial_he"),
+    xfail("special.i0e"),
+    xfail("special.i1"),
+    xfail("special.i1e"),
+    xfail("special.laguerre_polynomial_l"),
+    xfail("special.log_ndtr"),
+    xfail("special.modified_bessel_i0"),
+    xfail("special.modified_bessel_i1"),
+    xfail("special.modified_bessel_k0"),
+    xfail("special.modified_bessel_k1"),
+    xfail("special.ndtri"),
+    xfail("special.scaled_modified_bessel_k0"),
+    xfail("special.scaled_modified_bessel_k1"),
+    xfail("special.spherical_bessel_j0"),
+    xfail("special.xlog1py"),
+    xfail("special.zeta"),
+    xfail("split"),
+    xfail("split", "list_args"),
+    xfail("split_with_sizes"),
+    xfail("signal.windows.cosine"),
+    xfail("signal.windows.exponential"),
+    xfail("signal.windows.gaussian"),
+    xfail("signal.windows.kaiser"),
+    xfail("squeeze"),
+    xfail("stack"),
+    xfail("std"),
+    xfail("std_mean"),
+    xfail("stft"),
+    xfail("svd"),
+    xfail("svd_lowrank"),
+    xfail("symeig"),
+    xfail("t"),
+    xfail("take_along_dim"),
+    xfail("take"),
+    xfail("tensor_split"),
+    xfail("to_sparse"),
+    xfail("topk"),
+    xfail("trace"),
+    xfail("trapezoid"),
+    xfail("trapz"),
+    xfail("triangular_solve"),
+    xfail("tril"),
+    xfail("triu"),
+    xfail("unbind"),
+    xfail("unfold"),
+    xfail("unfold_copy"),
+    xfail("uniform"),
+    xfail("unflatten"),
+    xfail("unique_consecutive"),
+    xfail("unique"),
+    xfail("var_mean"),
+    xfail("vdot"),
+    xfail("view_as_complex"),
+    xfail("vstack"),
+    xfail("zeros"),
+    # ops inside this might even fail without dtensor
+    # tests, as we rescale op db common test size factor (i.e. L, M, S)
+    # which triggered the orignal function run failures with input
+    # generation becomes wrong, we skip them for now but should enable later.
+    # TODO: need to clean this list and remove all cases
+    skip("argwhere"),
+    skip("cumprod"),
+    skip("__rmatmul__"),
+    skip("meshgrid", "list_of_tensors"),
+    skip("meshgrid", "variadic_tensors"),
+    skip("nn.functional._scaled_dot_product_attention"),
+    skip("nn.functional.softmin"),
+    skip("nn.functional.embedding"),
+    skip("nn.functional.embedding_bag"),
+    skip("nn.functional.feature_alpha_dropout", "with_train"),
+    skip("nn.functional.feature_alpha_dropout", "without_train"),
+    skip("nn.functional.hinge_embedding_loss"),
+    skip("nn.functional.cosine_embedding_loss"),
+    skip("fft.hfft"),
+    skip("fft.hfft2"),
+    skip("fft.hfft2"),
+    skip("fft.hfftn"),
+    skip("fft.ifftn"),
+    skip("fft.irfft"),
+    skip("istft"),
+    skip("isclose"),
+    skip("isreal"),
+    skip("matmul"),
+    skip("masked.mean"),
+    skip("masked.var"),
+    skip("masked.std"),
+    skip("masked.normalize"),
+    skip("prod"),
+    skip("segment_reduce", "lengths"),
+    skip("segment_reduce", "offsets"),
+}
+
+
+# Add a list of ops that are currently failing BW pass
+skip_bw = [
+    None,  # corresponds to the transpose ops 'H' and 'T'
+    "torch.bucketize",
+    "torch.conj_physical",
+    "torch.eq",
+    "torch.isfinite",
+    "torch.isnan",
+]
+
+
+def run_dtensor_crossref(test_case, func, args, kwargs):
+    to_dtensor = DTensorConverter(test_case.mesh, args, kwargs)
+
+    # TODO: also handle cases where func raise an exception
+    rs = func(*args, **kwargs)
+
+    def to_replicate(e: object) -> object:
+        return (
+            e.redistribute(test_case.mesh, test_case.mesh.ndim * [Replicate()])
+            if isinstance(e, DTensor)
+            else e
+        )
+
+    try:
+        # Suppress warnings, this doesn't matter for test_meta.py
+        # but it does matter if you want to use this decorator
+        # for cross-ref testing, as some tests may be looking at
+        # errors
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            # for every comb of sharding choices, we test if it works
+            for dtensor_args, dtensor_kwargs in to_dtensor:
+                # Only attempt if we managed to convert all tensors to DTensor
+                # (if any of them failed, we're in a mixed tensor situation and
+                # this is not allowed in DTensor)
+                if to_dtensor.successful():
+                    # Handle special cases first if there's any
+                    # Suppress warnings, this doesn't matter for test_meta.py
+                    # but it does matter if you want to use this decorator
+                    # for cross-ref testing, as some tests may be looking at
+                    # errors
+                    dtensor_rs = func(*dtensor_args, **dtensor_kwargs)
+
+                    # we need to skip tests containing tensors of zero elmeents for now.
+                    # see issue: https://github.com/pytorch/tau/issues/470
+                    # TODO remove this once issue above fixed.
+                    flat_args, _ = tree_flatten(dtensor_rs)
+                    if any(
+                        isinstance(e, torch.Tensor) and e.numel() == 0
+                        for e in flat_args
+                    ):
+                        continue
+
+                    # redistribute/all_gather the results to compare with normal output
+                    dtensor_rs = tree_map(to_replicate, dtensor_rs)
+                    try:
+                        if resolve_name(func) not in skip_bw:
+                            if isinstance(dtensor_rs, DTensor):
+                                dtensor_rs.to_local().sum().backward()
+                            elif isinstance(dtensor_rs, tuple):
+                                dtensor_rs[0].to_local().sum().backward()
+
+                    except Exception as e:
+                        # TODO(anj): Remove this guard exception after gaining more confidence.
+                        if torch.distributed.get_rank() == 0:
+                            print(
+                                f"failed to run BW: {resolve_name(func)}, {func}, {str(e)})"
+                            )
+                    assert_ref_dtensor_equal(test_case, dtensor_rs, rs)
+                else:
+                    raise RuntimeError(
+                        f"failed to convert args to DTensor; "
+                        f"originally (*{args}, **{kwargs})"
+                    )
+    except Exception as e:
+        raise RuntimeError(
+            f"failed to run: {resolve_name(func)}, with (*{args}, **{kwargs})"
+        ) from e
+
+    return rs
+
+
+def check_dtensor_func(test_case, test_func, opinfo, dry_run=False):
+    try:
+        test_func()
+    except Exception:
+        test_case.destroy_pg()
+        if not dry_run:
+            raise
+        if dist.get_rank() == 0:
+            if opinfo.variant_test_name:
+                print(f"xfail('{opinfo.name}', '{opinfo.variant_test_name}'),")
+            else:
+                print(f"xfail('{opinfo.name}'),")
+    else:
+        test_case.destroy_pg()
+
+
+class TestDTensorOps(DTensorTestBase):
+    @property
+    def world_size(self) -> int:
+        return 4
+
+    # only allow float dytpe for now, we can relax this constraint
+    # when feel necessary later (i.e when adding quantization support).
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @suppress_warnings
+    @ops(dtensor_lagging_op_db, allowed_dtypes=(torch.float,))
+    @skipOps("TestDTensorOps", "test_dtensor_op_db", dtensor_fails)
+    def test_dtensor_op_db(self, dtype, op):
+        pg_backend = "nccl" if DEVICE_TYPE == "cuda" else "gloo"
+        if pg_backend == "nccl" and torch.cuda.device_count() < self.world_size:
+            sys.exit(TEST_SKIPS[f"multi-gpu-{self.world_size}"].exit_code)
+
+        self.init_pg(backend=pg_backend)
+        self.mesh = DeviceMesh(DEVICE_TYPE, torch.arange(self.world_size))
+
+        # test each op with dist tensor inputs and normal inputs
+        def test():
+            samples = op.sample_inputs(DEVICE_TYPE, dtype, requires_grad=True)
+            for sample_input in samples:
+                args = [sample_input.input] + list(sample_input.args)
+                kwargs = sample_input.kwargs
+
+                run_dtensor_crossref(self, op.op, args, kwargs)
+                # we need to figure out a way to test the out variant, out variant testing
+                # is tricky, as we need to pre allocate the dtensor out, some of them rely
+                # on sharding placements to be pre-known (i.e. mm.out)
+                # if isinstance(expected, torch.Tensor) and op.supports_out:
+                #     func(*args, **kwargs, out=expected)
+
+        check_dtensor_func(self, test, op)
+
+
+# only instantiate tests for DEVICE_TYPE alone (i.e. either CPU or GPU)
+instantiate_device_type_tests(
+    TestDTensorOps, globals(), only_for=(DEVICE_TYPE,)
+)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_math_ops.py b/test/distributed/_tensor/test_math_ops.py
new file mode 100644
index 000000000000..403f22d8325e
--- /dev/null
+++ b/test/distributed/_tensor/test_math_ops.py
@@ -0,0 +1,126 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+from torch.testing._internal.common_utils import run_tests
+
+from torch.distributed._tensor import distribute_tensor
+from torch.distributed._tensor.placement_types import Shard, Replicate
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+    skip_unless_torch_gpu,
+)
+import itertools
+
+
+class DistMathOpsTest(DTensorTestBase):
+    @with_comms
+    def test_sum(self):
+        device_mesh = self.build_device_mesh()
+
+        shard_spec = [Shard(0)]
+
+        tensor_to_sum = torch.randn(12, 8, 8)
+
+        mat1 = distribute_tensor(tensor_to_sum, device_mesh, shard_spec)
+
+        keep_dim_or_not = [True, False, None]
+        for dim in range(tensor_to_sum.ndim):
+            for keep_dim in keep_dim_or_not:
+                sum_args = (dim, keep_dim) if keep_dim is not None else (dim,)
+                dim_sumed_tensor = tensor_to_sum.sum(*sum_args)
+                dt_dim_sumed_tensor = mat1.sum(*sum_args).redistribute(
+                    device_mesh, [Replicate()] * device_mesh.ndim
+                )
+                self.assertEqual(
+                    dt_dim_sumed_tensor.to_local(), dim_sumed_tensor
+                )
+
+        full_sumed_tensor = tensor_to_sum.sum()
+        dt_sum = mat1.sum().redistribute(
+            device_mesh, [Replicate()] * device_mesh.ndim
+        )
+        self.assertEqual(dt_sum.to_local(), full_sumed_tensor)
+
+    # TODO: forward test can be removed once test_softmax_with_bwd passes on CPU
+    @with_comms
+    def test_softmax_fwd(self):
+        device_mesh = self.build_device_mesh()
+
+        x = torch.rand(8, 12, 16, device=self.device_type)
+        dims = range(3)  # used to convert -1 to the actual dim
+        softmax_dims = [-1, 0, 1, 2]
+        shard_dims = [-1, 0, 1, 2]
+        test_list = list(itertools.product(softmax_dims, shard_dims))
+
+        for softmax_dim, shard_dim in test_list:
+            local_y = torch.nn.functional.softmax(
+                x, dim=softmax_dim, dtype=torch.float32
+            )
+            dist_x = distribute_tensor(x, device_mesh, [Shard(shard_dim)])
+            if dims[shard_dim] == dims[softmax_dim]:
+                with self.assertRaisesRegex(
+                    Exception, "Cannot run .* on sharding dimension!$"
+                ):
+                    dist_y = torch.nn.functional.softmax(
+                        dist_x, dim=softmax_dim, dtype=torch.float32
+                    )
+            else:
+                dist_y = torch.nn.functional.softmax(
+                    dist_x, dim=softmax_dim, dtype=torch.float32
+                )
+                self.assertTrue(dist_y.placements[0].is_shard(dim=shard_dim))
+                dist_y = dist_y.redistribute(device_mesh, [Replicate()])
+                self.assertEqual(dist_y.to_local(), local_y)
+
+    # TODO: get test_softmax_with_bwd pass on CPU
+    # DTensor's _softmax_backward_data produces wrong result on CPU on certain dimension.
+    # fail_on_cpu_list = [(0, -1), (1, -1)]
+    @with_comms
+    @skip_unless_torch_gpu
+    def test_softmax_with_bwd(self):
+        device_mesh = self.build_device_mesh()
+
+        dims = range(3)  # used to convert -1 to the actual dim
+        softmax_dims = [-1, 0, 1, 2]
+        shard_dims = [-1, 0, 1, 2]
+        test_list = list(itertools.product(softmax_dims, shard_dims))
+
+        for params in test_list:
+            softmax_dim, shard_dim = params
+            x = torch.rand(
+                8, 12, 16, device=self.device_type, requires_grad=True
+            )
+            self.assertTrue(x.requires_grad)
+            local_y = torch.nn.functional.softmax(
+                x, dim=softmax_dim, dtype=torch.float32
+            ).sum()
+            local_y.backward()
+
+            dist_x = distribute_tensor(x, device_mesh, [Shard(shard_dim)])
+            self.assertTrue(dist_x.requires_grad)
+            if dims[softmax_dim] == dims[shard_dim]:
+                with self.assertRaisesRegex(
+                    Exception, "Cannot run .* on sharding dimension!$"
+                ):
+                    dist_softmax = dist_x.softmax(dim=softmax_dim)
+            else:
+                dist_softmax = dist_x.softmax(dim=softmax_dim)
+                self.assertTrue(
+                    dist_softmax.placements[0].is_shard(dim=shard_dim)
+                )
+                dist_y = dist_softmax.sum()
+                dist_y = dist_y.redistribute(device_mesh, [Replicate()])
+                self.assertEqual(dist_y.to_local(), local_y)
+                self.assertIsNone(dist_x.grad)
+                dist_y.backward()
+                self.assertIsNotNone(dist_x.grad)
+                dist_x_grad = dist_x.grad.redistribute(
+                    device_mesh, [Replicate()]
+                )
+                self.assertEqual(dist_x_grad.to_local(), x.grad)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_matrix_ops.py b/test/distributed/_tensor/test_matrix_ops.py
new file mode 100644
index 000000000000..ed2af130ac88
--- /dev/null
+++ b/test/distributed/_tensor/test_matrix_ops.py
@@ -0,0 +1,302 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+from torch.testing._internal.common_utils import run_tests
+from torch.distributed._tensor.api import DTensor
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+    skip_unless_torch_gpu,
+)
+from torch.distributed._tensor import distribute_tensor, DeviceMesh
+from torch.distributed._tensor.placement_types import Placement, Shard, Replicate, _Partial
+from typing import List, Optional, cast
+import itertools
+
+
+class DistMatrixOpsTest(DTensorTestBase):
+    @with_comms
+    def test_addmm(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+        replica_spec = [Replicate()]
+
+        tensor_to_shard = torch.randn(12, 8)
+        mat1 = distribute_tensor(tensor_to_shard, device_mesh, shard_spec)
+        tensor_to_replicate = torch.randn(8, 4)
+        mat2 = distribute_tensor(tensor_to_replicate, device_mesh, replica_spec)
+        input_tensor = torch.randn(4)
+        input = distribute_tensor(input_tensor, device_mesh, replica_spec)
+
+        dist_res = torch.addmm(input, mat1, mat2)
+        local_res = torch.addmm(
+            input_tensor, tensor_to_shard, tensor_to_replicate
+        )
+        self.assertEqual(
+            dist_res.redistribute(device_mesh, replica_spec).to_local(),
+            local_res,
+        )
+
+    @with_comms
+    def test_addmm_auto_redistribute(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard0_spec = [Shard(0)]
+        shard1_spec = [Shard(1)]
+        replica_spec = [Replicate()]
+
+        tensor_to_shard1 = torch.randn(12, 8, requires_grad=True)
+        mat1 = distribute_tensor(tensor_to_shard1, device_mesh, shard1_spec)
+        tensor_to_shard0 = torch.randn(8, 4, requires_grad=True)
+        mat2 = distribute_tensor(tensor_to_shard0, device_mesh, shard0_spec)
+        input_tensor = torch.randn(4, requires_grad=True)
+        input = distribute_tensor(input_tensor, device_mesh, replica_spec)
+
+        local_res = torch.addmm(
+            input_tensor, tensor_to_shard1, tensor_to_shard0
+        )
+        dist_res = torch.addmm(input, mat1, mat2)
+
+        # test if addmm output is a partial
+        self.assertIsInstance(dist_res, DTensor)
+        self.assertIsInstance(dist_res.placements[0], _Partial)
+
+        # test if result is the same as tensor
+        replica_res = dist_res.redistribute(device_mesh, replica_spec)
+        dist_local_res = replica_res.to_local()
+        self.assertEqual(local_res, dist_local_res)
+
+        # backward checks
+        dist_local_res.sum().backward()
+        local_res.sum().backward()
+        self.assertIsNotNone(mat2.grad)
+        mat2_grad = mat2.grad.redistribute(device_mesh, replica_spec)
+        self.assertEqual(mat2_grad.to_local(), tensor_to_shard0.grad)
+
+    @with_comms
+    def test_mm(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard0_spec = Shard(0)
+        shard1_spec = Shard(1)
+        replica_spec = Replicate()
+
+        t1 = torch.randn(12, 8, requires_grad=True)
+        t2 = torch.randn(8, 16, requires_grad=True)
+        local_res = torch.mm(t1, t2)
+
+        def test_placement_comb(
+            placements1: List[Placement], placements2: List[Placement]
+        ) -> None:
+            dt1 = distribute_tensor(t1, device_mesh, placements1)
+            dt2 = distribute_tensor(t2, device_mesh, placements2)
+            dist_res: DTensor = cast(DTensor, torch.mm(dt1, dt2)).redistribute(
+                device_mesh, [replica_spec]
+            )
+            self.assertEqual(dist_res.to_local(), local_res)
+            # backward
+            grad_dist_res = torch.ones_like(dist_res)
+            dist_res.backward(grad_dist_res)
+            self.assertIsNotNone(dt1.grad)
+
+        placement_specs = [shard0_spec, shard1_spec, replica_spec]
+        shard_specs_comb = list(
+            itertools.product(placement_specs, placement_specs)
+        )
+        for spec in shard_specs_comb:
+            test_placement_comb([spec[0]], [spec[1]])
+
+    @with_comms
+    def test_t(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        tensor_to_transpose = torch.randn(12, 8, requires_grad=True)
+        mat = distribute_tensor(tensor_to_transpose, device_mesh, shard_spec)
+        tranposed_mat = mat.t()
+        self.assertEqual(tranposed_mat.size(), torch.Size([8, 12]))
+        self.assertEqual(tranposed_mat.placements, [Shard(1)])
+        tranposed_mat2 = tranposed_mat.t()
+        self.assertEqual(tranposed_mat2.size(), torch.Size([12, 8]))
+        self.assertEqual(tranposed_mat2.placements, shard_spec)
+
+    @with_comms
+    def test_t_partial(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+
+        a = torch.randn(12, 8)
+        b = torch.randn(8, 4)
+        c = torch.mm(a, b).t()
+
+        da = distribute_tensor(a, device_mesh, [Shard(1)])
+        db = distribute_tensor(b, device_mesh, [Shard(0)])
+
+        # mm(da, db) should return a _Partial tensor.
+        # transposing it should keep it _Partial
+        dc = torch.mm(da, db).t()
+
+        self.assertTrue(isinstance(dc.placements[0], _Partial))
+
+        # check that the local and distributed op results match
+        self.assertEqual(
+            c,
+            dc.redistribute(device_mesh, [Replicate()]).to_local(),
+        )
+
+    # baddbmm introduces nan occasionally on CPU: https://github.com/pytorch/pytorch/issues/80588
+    @with_comms
+    @skip_unless_torch_gpu
+    def test_baddbmm(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        tensor = torch.rand(
+            4, 4, 8, device=self.device_type, requires_grad=True
+        )
+        batch_1 = torch.rand(
+            4, 4, 8, device=self.device_type, requires_grad=True
+        )
+        batch_2 = torch.rand(
+            4, 8, 8, device=self.device_type, requires_grad=True
+        )
+
+        def test_placement_comb(
+            tensor_placements: List[Placement],
+            batch_1_placements: List[Placement],
+            batch_2_placements: List[Placement],
+            beta: int,
+            alpha: int,
+            batch_1_grad: Optional[torch.Tensor],
+        ) -> None:
+            tensor_dt = distribute_tensor(
+                tensor, device_mesh, tensor_placements
+            )
+            batch_1_dt = distribute_tensor(
+                batch_1, device_mesh, batch_1_placements
+            )
+            batch_2_dt = distribute_tensor(
+                batch_2, device_mesh, batch_2_placements
+            )
+            dist_res = cast(
+                DTensor,
+                torch.baddbmm(
+                    tensor_dt, batch_1_dt, batch_2_dt, beta=beta, alpha=alpha
+                ),
+            ).redistribute(device_mesh, [Replicate()])
+            dist_local_res = dist_res.to_local()
+            assert not torch.isnan(local_result).any()
+            assert not torch.isnan(dist_local_res).any()
+            self.assertEqual(dist_local_res.detach(), local_result.detach())
+
+            # TODO: add test backward
+            # grad_dist_res = torch.ones_like(dist_res)
+            # dist_res.backward(grad_dist_res)
+            # self.assertIsNotNone(batch_1_dt.grad)
+            # batch_1_grad_local = batch_1_dt.grad.redistribute(
+            #     device_mesh, [Replicate()]
+            # ).to_local()
+            # self.assertEqual(batch_1_grad_local, batch_1_grad)
+
+        shard0_spec = Shard(0)
+        shard1_spec = Shard(1)
+        shard2_spec = Shard(2)
+        replica_spec = Replicate()
+        shard_specs = [shard0_spec, shard1_spec, shard2_spec, replica_spec]
+        shard_specs_comb = list(
+            itertools.product(shard_specs, shard_specs, shard_specs)
+        )
+        passlist = [
+            (shard0_spec, shard0_spec, shard0_spec),
+            (shard0_spec, shard0_spec, replica_spec),
+            (shard0_spec, shard1_spec, shard0_spec),
+            (shard0_spec, shard2_spec, shard0_spec),
+            (shard1_spec, shard1_spec, replica_spec),
+            (shard0_spec, replica_spec, shard0_spec),
+            (shard2_spec, replica_spec, shard2_spec),
+            (shard2_spec, shard0_spec, shard2_spec),
+            (shard2_spec, shard1_spec, shard2_spec),
+            (shard2_spec, shard2_spec, shard2_spec),
+            (replica_spec, shard0_spec, shard0_spec),
+            (replica_spec, shard1_spec, replica_spec),
+            (replica_spec, shard2_spec, shard1_spec),
+            (replica_spec, replica_spec, shard2_spec),
+            (replica_spec, replica_spec, replica_spec),
+        ]
+        # If beta is 0, input tensor will be ignored
+        numeric_params_comb = [
+            (0.0, 0.5),  # zero-beta
+            (0.8, 0.5),  # non-zero-beta
+        ]
+
+        for beta, alpha in numeric_params_comb:
+            local_result = torch.baddbmm(
+                tensor, batch_1, batch_2, beta=beta, alpha=alpha
+            )
+            grad_local_res = torch.ones_like(local_result)
+            local_result.backward(grad_local_res)
+            # tests that currently pass
+            for spec in passlist:
+                test_placement_comb(
+                    [spec[0]], [spec[1]], [spec[2]], beta, alpha, batch_1.grad
+                )
+            # TODO: support these tests
+            shard_specs_comb = [
+                spec for spec in shard_specs_comb if spec not in passlist
+            ]
+            for spec in shard_specs_comb:
+                with self.assertRaises(Exception):
+                    test_placement_comb(
+                        [spec[0]],
+                        [spec[1]],
+                        [spec[2]],
+                        beta,
+                        alpha,
+                        batch_1.grad,
+                    )
+
+    @with_comms
+    def test_bmm(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        mat1 = torch.rand(4, 8, 4, device=self.device_type, requires_grad=True)
+        mat2 = torch.rand(4, 4, 8, device=self.device_type, requires_grad=True)
+        local_result = torch.bmm(mat1, mat2)
+        grad_local_res = torch.ones_like(local_result)
+        local_result.backward(grad_local_res)
+
+        def test_placement_comb(
+            placements1: List[Placement],
+            placements2: List[Placement],
+        ) -> None:
+            mat1_dt = distribute_tensor(mat1, device_mesh, placements1)
+            mat2_dt = distribute_tensor(mat2, device_mesh, placements2)
+            dist_res = cast(DTensor, torch.bmm(mat1_dt, mat2_dt)).redistribute(
+                device_mesh, [Replicate()]
+            )
+            dist_local_res = dist_res.to_local()
+            self.assertEqual(dist_local_res, local_result)
+
+            # test backward
+            # TODO: figure out (replicate, shard1) fail on backward
+            # it generates a different grad shape
+            grad_dist_res = torch.ones_like(dist_res)
+            dist_res.backward(grad_dist_res)
+            self.assertIsNotNone(mat1_dt.grad)
+            mat1_dt_grad = cast(DTensor, mat1_dt.grad)
+            mat1_grad_local = mat1_dt_grad.redistribute(
+                device_mesh, [Replicate()]
+            ).to_local()
+            self.assertEqual(mat1_grad_local, mat1.grad)
+
+        shard0_spec = Shard(0)
+        shard1_spec = Shard(1)
+        shard2_spec = Shard(2)
+        replica_spec = Replicate()
+        placement_specs = [shard0_spec, shard1_spec, shard2_spec, replica_spec]
+        shard_specs_comb = list(
+            itertools.product(placement_specs, placement_specs)
+        )
+
+        # tests that currently pass
+        for spec in shard_specs_comb:
+            test_placement_comb([spec[0]], [spec[1]])
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_pointwise_ops.py b/test/distributed/_tensor/test_pointwise_ops.py
new file mode 100644
index 000000000000..5069166dee27
--- /dev/null
+++ b/test/distributed/_tensor/test_pointwise_ops.py
@@ -0,0 +1,285 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+from typing import Sequence, Any, Dict, Callable, Optional
+from unittest import skip
+
+import torch
+from torch import Tensor
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+    skip_unless_torch_gpu,
+)
+
+from torch.distributed._tensor import DeviceMesh, DTensor, distribute_tensor
+from torch.distributed._tensor.placement_types import (
+    Shard,
+    Replicate,
+    _Partial,
+    Placement,
+)
+from torch.distributed.distributed_c10d import ReduceOp
+
+import torch.utils._pytree as pytree
+
+
+def no_op():
+    return None
+
+
+def deepcopy_convert_to_dtensor(
+    val: Any,
+    device_mesh: DeviceMesh,
+    placements: Sequence[Placement],
+) -> Any:
+    """
+    Recursively convert (over Sequence and Dict types) Tensors into DTensors.
+
+    :param device_mesh: the DeviceMesh to use.
+    :param placements: the Placement list to use.
+    :return: the transformed structure.
+    """
+
+    def f(x):
+        if isinstance(x, Tensor) and not isinstance(x, DTensor):
+            return distribute_tensor(
+                x,
+                device_mesh=device_mesh,
+                placements=placements,
+            )
+        return x
+
+    return pytree.tree_map(f, [val])[0]
+
+
+def deepcopy_convert_from_dtensor(val: Any) -> Any:
+    """
+    Recursive convert any DTensor to local Tensor.
+
+    :param val: the structure to coerce.
+    :return: the coerced structure.
+    """
+
+    def f(x):
+        if isinstance(x, DTensor):
+            return x.redistribute(
+                device_mesh=x.device_mesh,
+                placements=[Replicate()] * x.device_mesh.ndim,
+            ).to_local()
+        return x
+
+    return pytree.tree_map(f, [val])[0]
+
+
+class DistElementwiseOpsTest(DTensorTestBase):
+    def _compare_pairwise_ops(
+        self,
+        *,
+        device_mesh: DeviceMesh,
+        placements: Sequence[Placement],
+        op: Callable,
+        pre_op_fn: Optional[Callable] = None,
+        args: Sequence[Any] = tuple(),
+        kwargs: Optional[Dict[str, Any]] = None,
+    ):
+        if pre_op_fn is None:
+            pre_op_fn = no_op
+
+        if not kwargs:
+            kwargs = {}
+
+        dargs = deepcopy_convert_to_dtensor(
+            args,
+            device_mesh=device_mesh,
+            placements=placements,
+        )
+        dkwargs = deepcopy_convert_to_dtensor(
+            kwargs,
+            device_mesh=device_mesh,
+            placements=placements,
+        )
+
+        pre_op_fn()
+
+        # run the reference first, in case the call is broken;
+        # it's better to debug an incorrect call at this point.
+        reference_result = op(*args, **kwargs)
+
+        pre_op_fn()
+
+        dist_result = op(*dargs, **dkwargs)
+
+        collected_result = deepcopy_convert_from_dtensor(dist_result)
+
+        self.assertEqual(reference_result, collected_result)
+
+    # TODO: We need to add CPU tests for ops in the future.
+    def _run_sharded_elementwise_ops(
+        self,
+        *,
+        device_mesh: DeviceMesh,
+        placements: Sequence[Placement],
+        pre_op_fn: Optional[Callable] = None,
+        input_size: Sequence[int],
+        op: Callable,
+        **kwargs,
+    ):
+        if pre_op_fn is None:
+            pre_op_fn = no_op
+
+        input_tensor = torch.randn(
+            *input_size,
+            device=self.device_type,
+            requires_grad=True,
+        )
+
+        self._compare_pairwise_ops(
+            device_mesh=device_mesh,
+            placements=placements,
+            pre_op_fn=pre_op_fn,
+            op=op,
+            args=(input_tensor,),
+            kwargs=kwargs,
+        )
+
+    @with_comms
+    def test_activations(self):
+        device_mesh = self.build_device_mesh()
+        self._run_sharded_elementwise_ops(
+            device_mesh=device_mesh,
+            placements=[Shard(0)],
+            input_size=(8, 5),
+            op=torch.nn.functional.gelu,
+        )
+        self._run_sharded_elementwise_ops(
+            device_mesh=device_mesh,
+            placements=[Replicate()],
+            input_size=(8, 5),
+            op=torch.nn.functional.gelu,
+        )
+        self._run_sharded_elementwise_ops(
+            device_mesh=device_mesh,
+            placements=[Shard(1)],
+            input_size=(3, 12),
+            op=torch.nn.functional.relu,
+        )
+        self._run_sharded_elementwise_ops(
+            device_mesh=device_mesh,
+            placements=[Replicate()],
+            input_size=(8, 5),
+            op=torch.nn.functional.relu,
+        )
+        self._run_sharded_elementwise_ops(
+            device_mesh=device_mesh,
+            placements=[Shard(0)],
+            input_size=(8, 5),
+            op=torch.sigmoid,
+        )
+        self._run_sharded_elementwise_ops(
+            device_mesh=device_mesh,
+            placements=[Replicate()],
+            input_size=(8, 5),
+            op=torch.sigmoid,
+        )
+
+    @with_comms
+    @skip(
+        "testing RNG based ops is broken: https://github.com/pytorch/tau/issues/494"
+    )
+    def test_dropout(self):
+        device_mesh = self.build_device_mesh()
+
+        def _reset_random_seed():
+            torch.manual_seed(self.rank + 4)
+
+        self._run_sharded_elementwise_ops(
+            device_mesh=device_mesh,
+            placements=[Shard(0)],
+            input_size=(8, 5),
+            op=torch.nn.functional.dropout,
+            pre_op_fn=_reset_random_seed,
+            p=0.4,
+            training=False,
+        )
+        self._run_sharded_elementwise_ops(
+            device_mesh=device_mesh,
+            placements=[Shard(1)],
+            input_size=(3, 14),
+            op=torch.nn.functional.dropout,
+            pre_op_fn=_reset_random_seed,
+            p=0.5,
+            training=True,
+        )
+
+    @with_comms
+    @skip_unless_torch_gpu
+    def test_dropout_backward(self):
+        device_mesh = self.build_device_mesh()
+        placements = [Shard(0)]
+
+        input_size = (8, 5)
+
+        grad_output = torch.rand(
+            input_size,
+            device=self.device_type,
+            requires_grad=True,
+        )
+        mask = (
+            torch.rand(
+                input_size,
+                device=self.device_type,
+                requires_grad=False,
+            )
+            < 0.8
+        )
+
+        self._compare_pairwise_ops(
+            device_mesh=device_mesh,
+            placements=placements,
+            op=torch.ops.aten.native_dropout_backward,
+            kwargs=dict(
+                grad_output=grad_output,
+                mask=mask,
+                scale=0.3,
+            ),
+        )
+
+    @with_comms
+    def test_dropout_errors(self):
+        device_mesh = self.build_device_mesh()
+        with self.assertRaisesRegex(RuntimeError, "supported"):
+            self._run_sharded_elementwise_ops(
+                device_mesh=device_mesh,
+                placements=[_Partial(ReduceOp.SUM)],
+                input_size=(8, 5),
+                op=torch.nn.functional.dropout,
+            )
+
+    @with_comms
+    def test_mul_out(self):
+        device_mesh = self.build_device_mesh()
+        torch.manual_seed(self.rank)
+        shard_spec = [Shard(0)]
+        input_size = (8, 4)
+        input_tensor = torch.randn(*input_size, device=self.device_type)
+        dtensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+
+        other_tensor = torch.randn(*input_size, device=self.device_type)
+        other_dtensor = DTensor.from_local(
+            other_tensor, device_mesh, shard_spec
+        )
+
+        output_tensor = torch.randn(*input_size, device=self.device_type)
+        output_dtensor = DTensor.from_local(
+            output_tensor, device_mesh, shard_spec
+        )
+        dt = torch.mul(dtensor, other_dtensor, out=output_dtensor)
+        expected = torch.mul(input_tensor, other_tensor, out=output_tensor)
+        self.assertEqual(input_tensor, dtensor.to_local())
+        self.assertEqual(expected, dt.to_local())
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_redistribute.py b/test/distributed/_tensor/test_redistribute.py
new file mode 100644
index 000000000000..78fc991d615f
--- /dev/null
+++ b/test/distributed/_tensor/test_redistribute.py
@@ -0,0 +1,317 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import itertools
+import torch
+
+from torch.testing._internal.common_utils import run_tests
+
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+)
+from torch.distributed._tensor import distribute_tensor, DeviceMesh, DTensor
+from torch.distributed._tensor.placement_types import _Partial, Replicate, Shard
+
+
+class RedistributeTest(DTensorTestBase):
+    @with_comms
+    def test_shard_to_replicate_forward_backward(self):
+        # 1) test shard -> replicate forward
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        replica_spec = [Replicate()]
+
+        input_sizes_and_shard_dim = [
+            ((self.world_size * 3, 3), 0),
+            ((self.world_size * 3 + 1, 3), 0),
+            ((self.world_size * 3 + 2, 3), 0),
+            ((3, self.world_size * 3), 1),
+            ((3, self.world_size * 3 + 1), 1),
+            ((3, self.world_size * 3 + 2), 1),
+        ]
+
+        for input_size, shard_dim in input_sizes_and_shard_dim:
+            shard_spec = [Shard(shard_dim)]
+            expected_tensor = torch.randn(
+                input_size, device=self.device_type, requires_grad=True
+            )
+            dtensor = distribute_tensor(
+                expected_tensor.clone(), device_mesh, shard_spec
+            )
+            reshard_dtensor = dtensor.redistribute(device_mesh, replica_spec)
+            self.assertEqual(reshard_dtensor.size(), torch.Size(input_size))
+            self.assertEqual(expected_tensor, reshard_dtensor.to_local())
+
+            # 2) test shard -> replicate backward:
+            # should give gradient as shard
+            grad_output = torch.ones_like(reshard_dtensor)
+            reshard_dtensor.backward(grad_output)
+            grad_input = dtensor.grad
+            self.assertEqual(grad_input.placements, shard_spec)
+            self.assertEqual(
+                grad_input.to_local(), torch.ones(dtensor.to_local().size())
+            )
+
+    @with_comms
+    def test_replicate_to_replicate_forward_backward(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        replica_spec = [Replicate()]
+        local_tensor = torch.randn(
+            12, 3, device=self.device_type, requires_grad=True
+        )
+        # 1) test replicate -> replicate forward
+        replica_tensor = distribute_tensor(
+            local_tensor, device_mesh, replica_spec
+        )
+        reshard_replica_tensor = replica_tensor.redistribute(
+            device_mesh, replica_spec
+        )
+        self.assertEqual(replica_tensor.size(), local_tensor.size())
+        self.assertEqual(replica_tensor, reshard_replica_tensor)
+
+        # 2) test replicate -> replicate backward:
+        # should give gradient as replicate
+        grad_output = torch.ones_like(reshard_replica_tensor)
+        reshard_replica_tensor.backward(grad_output)
+        grad_input = replica_tensor.grad
+        self.assertEqual(grad_input.placements, replica_spec)
+        self.assertEqual(grad_input.to_local(), torch.ones(12, 3))
+
+    @with_comms
+    def test_replicate_to_shard_forward_backward(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        replica_spec = [Replicate()]
+
+        input_sizes_and_shard_dim = [
+            ((self.world_size * 3, 3), 0),
+            ((self.world_size * 3 + 1, 3), 0),
+            ((self.world_size * 3 + 2, 3), 0),
+            ((3, self.world_size * 3), 1),
+            ((3, self.world_size * 3 + 1), 1),
+            ((3, self.world_size * 3 + 2), 1),
+        ]
+        for input_size, shard_dim in input_sizes_and_shard_dim:
+            shard_spec = [Shard(shard_dim)]
+            # 1) test replicate -> shard forward
+            local_replica = torch.randn(
+                input_size, device=self.device_type, requires_grad=True
+            )
+            splitted_list = local_replica.tensor_split(
+                self.world_size, shard_dim
+            )
+            # make local tensor as the element of the corresponding chunked list
+            local_tensor = splitted_list[self.rank]
+            replica_tensor = distribute_tensor(
+                local_replica, device_mesh, replica_spec
+            )
+            reshard_tensor = replica_tensor.redistribute(
+                device_mesh, shard_spec
+            )
+            self.assertEqual(reshard_tensor.size(), replica_tensor.size())
+            self.assertEqual(reshard_tensor.placements, shard_spec)
+            self.assertEqual(reshard_tensor.to_local(), local_tensor)
+
+            # 2) test replicate -> shard backward:
+            # should give gradient as replicate
+            grad_output = torch.ones_like(reshard_tensor)
+            reshard_tensor.backward(grad_output)
+            grad_input = replica_tensor.grad
+            self.assertEqual(grad_input.placements, replica_spec)
+            self.assertEqual(grad_input.to_local(), torch.ones(input_size))
+
+    @with_comms
+    def test_partial_to_replicate_forward_backward(self):
+        # Although we don't allow user to reshard to produce a partial
+        # placement (i.e. user can't reshard to partial), we do allow
+        # replicate to partial internally, and also partial to replicate
+        # backward should work as expected
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        partial_local = torch.randn(
+            12, 3, device=self.device_type, requires_grad=True
+        )
+        partial_spec = [_Partial()]
+        replica_spec = [Replicate()]
+        # test partial -> replicate, which trigger all_reduce
+        partial_tensor = DTensor.from_local(
+            partial_local, device_mesh, partial_spec
+        )
+        global_partial_tensor = partial_tensor.redistribute(
+            device_mesh, replica_spec
+        )
+
+        self.assertEqual(partial_tensor.size(), partial_local.size())
+        self.assertEqual(
+            partial_local * self.world_size, global_partial_tensor.to_local()
+        )
+
+        # test backward to have replicate grad on partial
+        global_partial_tensor.backward(torch.ones_like(global_partial_tensor))
+        self.assertIsNotNone(partial_local.grad)
+        if device_mesh.get_rank() == 0:
+            self.assertEqual(partial_local.grad, torch.ones_like(partial_local))
+
+    @with_comms
+    def test_replicate_to_partial(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        local_tensor = torch.randn(
+            12, 3, device=self.device_type, requires_grad=True
+        )
+        partial_spec = _Partial()
+        replica_spec = Replicate()
+        # 1) test replicate -> partial forward
+        replica_tensor = distribute_tensor(
+            local_tensor, device_mesh, [replica_spec]
+        )
+        with self.assertRaisesRegex(
+            RuntimeError, "Can not redistribute to _Partial"
+        ):
+            partial_tensor = replica_tensor.redistribute(
+                device_mesh, [partial_spec]
+            )
+
+        from torch.distributed._tensor.redistribute import Redistribute
+
+        partial_tensor = Redistribute.apply(
+            replica_tensor, device_mesh, [partial_spec]
+        )
+        self.assertEqual(partial_tensor.size(), local_tensor.size())
+        # test it successfully zero out the contents on other ranks
+        if self.rank == 0:
+            self.assertEqual(
+                replica_tensor.to_local(), partial_tensor.to_local()
+            )
+        else:
+            self.assertEqual(
+                partial_tensor.to_local(), torch.zeros_like(local_tensor)
+            )
+
+        # replicate to partial on sub groups
+        local_tensor = torch.randn(12, 3, device=self.device_type)
+        device_mesh = DeviceMesh(
+            self.device_type,
+            torch.arange(self.world_size).reshape(self.world_size // 2, 2),
+        )
+        # 1) test replicate -> partial on 2d-mesh subgroups
+        replica_tensor = distribute_tensor(
+            local_tensor, device_mesh, [replica_spec, replica_spec]
+        )
+        partial_tensor = Redistribute.apply(
+            replica_tensor, device_mesh, [partial_spec, partial_spec]
+        )
+        self.assertEqual(partial_tensor.size(), local_tensor.size())
+
+        if self.rank != 3:
+            # replicate to partial should only zero out rank 3, and leave
+            # rank 0/2 (rank0 on mesh dim 1) and 0, 1 (rank0 on mesh dim 1) un-touched
+            self.assertEqual(
+                replica_tensor.to_local(), partial_tensor.to_local()
+            )
+        else:
+            self.assertEqual(
+                replica_tensor.to_local(), torch.zeros_like(local_tensor)
+            )
+
+    @with_comms
+    def test_partial_to_shard(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        partial_spec = [_Partial()]
+
+        input_sizes_and_shard_dim = [
+            ((self.world_size * 3, 3), 0),
+            ((self.world_size * 3 + 1, 3), 0),
+            ((self.world_size * 3 + 2, 3), 0),
+            ((3, self.world_size * 3), 1),
+            ((3, self.world_size * 3 + 1), 1),
+            ((3, self.world_size * 3 + 2), 1),
+        ]
+
+        for input_size, shard_dim in input_sizes_and_shard_dim:
+            shard_spec = [Shard(shard_dim)]
+
+            partial_local = torch.ones(input_size, device=self.device_type)
+            partial_tensor = DTensor.from_local(
+                partial_local, device_mesh, partial_spec, run_check=False
+            )
+
+            quot, rem = divmod(input_size[shard_dim], self.world_size)
+            local_shape = list(input_size)
+            local_shape[shard_dim] = quot + (1 if self.rank < rem else 0)
+            # test partial to shard, trigger reduce_scatter
+            scatter_shard_tensor = partial_tensor.redistribute(
+                device_mesh, shard_spec
+            )
+            self.assertEqual(scatter_shard_tensor.size(), partial_tensor.size())
+            self.assertEqual(scatter_shard_tensor.placements, shard_spec)
+            self.assertEqual(
+                scatter_shard_tensor.to_local(),
+                torch.ones(local_shape) * self.world_size,
+            )
+
+
+class MultiDimRedistributeTest(DTensorTestBase):
+    @property
+    def world_size(self) -> int:
+        return 8
+
+    @with_comms
+    def test_multi_dim_mesh(self):
+        devices = torch.arange(self.world_size)
+        for mesh_shape in [devices, devices.view(4, 2), devices.view(2, 2, 2)]:
+            mesh_shape = torch.arange(self.world_size).view(-1, 2)
+            device_mesh = DeviceMesh(self.device_type, mesh_shape)
+            tensor_shape = (16, 24)
+
+            if torch.distributed.get_rank() == 0:
+                full_tensor = torch.randn(*tensor_shape)
+            else:
+                # these should be entirely ignored
+                # because distribute_tensor is expected to override shards in ranks != 0
+                full_tensor = torch.ones(*tensor_shape)
+
+            possibilities = [Replicate()] + [
+                Shard(i) for i in range(full_tensor.ndim)
+            ]
+            all_outputs = list(
+                itertools.product(*(mesh_shape.ndim * [possibilities]))
+            )
+            all_inputs = list(
+                itertools.product(
+                    *(mesh_shape.ndim * [possibilities + [_Partial()]])
+                )
+            )
+
+            for inputs in all_inputs:
+                # if partial, temporarily make it Replicated, then replace replicated with partial afterwards
+                repl_inputs = [
+                    Replicate() if s.is_partial() else s for s in inputs
+                ]
+                dt = distribute_tensor(full_tensor, device_mesh, repl_inputs)
+
+                if repl_inputs != inputs:
+                    # create a new DTensor reinterpreting some of the replicated entires as "Partial"
+                    dt = DTensor.from_local(
+                        dt.to_local(), device_mesh, inputs, run_check=False
+                    )
+
+                for outputs in all_outputs:
+                    # redistribute on target outputs
+                    dt2 = dt.redistribute(device_mesh, outputs)
+
+                    # replicate and then get first shard
+                    local_full = dt2.redistribute(
+                        device_mesh, device_mesh.ndim * [Replicate()]
+                    ).to_local()
+
+                    if torch.distributed.get_rank() == 0:
+                        self.assertEqual(local_full.shape, full_tensor.shape)
+
+                        num_sums = 1
+                        for idx, input in enumerate(inputs):
+                            if input.is_partial():
+                                num_sums *= mesh_shape.size(idx)
+                        expected = num_sums * full_tensor
+                        self.assertEqual(local_full, expected)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_tensor_ops.py b/test/distributed/_tensor/test_tensor_ops.py
new file mode 100644
index 000000000000..1ba3f6d5f95b
--- /dev/null
+++ b/test/distributed/_tensor/test_tensor_ops.py
@@ -0,0 +1,365 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorConverter,
+    DTensorTestBase,
+    with_comms,
+)
+from torch.distributed._tensor import distribute_tensor, DeviceMesh, DTensor
+from torch.distributed._tensor.placement_types import Shard, Replicate, _Partial
+
+
+class DistTensorOpsTest(DTensorTestBase):
+    @with_comms
+    def test_aten_contiguous(self):
+        # this op not covered by dtensor_ops
+        mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        self._test_op(
+            mesh,
+            lambda x: torch.ops.aten.contiguous(x),
+            torch.randn(16, 32),
+        )
+
+    @with_comms
+    def test_detach(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        tensor_to_detach = torch.randn(12, 8, requires_grad=True)
+        mat = distribute_tensor(tensor_to_detach, device_mesh, shard_spec)
+        detached_mat = mat.detach()
+        self.assertFalse(detached_mat is mat)
+
+    @with_comms
+    def test_clone(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        specs = [[Replicate()], [Shard(0)]]
+        tensor_to_clone = torch.randn(12, 8, requires_grad=True)
+        for spec in specs:
+            mat = distribute_tensor(tensor_to_clone, device_mesh, spec)
+            cloned_mat = mat.clone()
+            self.assertFalse(cloned_mat is mat)
+            self.assertEqual(cloned_mat.to_local(), mat.to_local())
+
+    @with_comms
+    def test_contiguous(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        tensor = torch.rand(3, 5, 6, requires_grad=True)
+        sharding = [Shard(0)]
+        dist_tensor = DTensor.from_local(tensor, device_mesh, sharding)
+        self.assertTrue(dist_tensor.is_contiguous())
+        # shard on dim 0 should not change stride (30, 6, 1)
+        self.assertEqual(dist_tensor.stride(), tensor.stride())
+
+        new_dt = dist_tensor.transpose(0, 2)
+        self.assertFalse(new_dt.is_contiguous())
+        self.assertFalse(new_dt.to_local().is_contiguous())
+        # check stride
+        self.assertEqual(new_dt.stride(), (1, 6, 30))
+
+        new_dt = new_dt.contiguous()
+        self.assertTrue(new_dt.is_contiguous())
+        self.assertTrue(new_dt.to_local().is_contiguous())
+        # check stride
+        self.assertEqual(dist_tensor.stride(), tensor.stride())
+
+        # check backward
+        new_dt.to_local().sum().backward()
+        self.assertEqual(tensor.grad, torch.ones(3, 5, 6))
+
+    @with_comms
+    def test_inplace_op(self):
+        mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        input_tensor = torch.randn((12, 3), device=self.device_type)
+        dt_to_add = distribute_tensor(input_tensor, mesh, [Shard(0)])
+        dt_to_mul = dt_to_add.clone()
+        expected_add_dt = dt_to_add.clone() + 3
+        add_res = dt_to_add.add_(3)
+        expected_mul_dt = dt_to_mul.clone() * 3
+        mul_res = dt_to_mul.mul_(3)
+        # inplace op should be the same instance before and after
+        self.assertTrue(add_res is dt_to_add)
+        self.assertEqual(add_res.to_local(), expected_add_dt.to_local())
+
+        self.assertTrue(mul_res is dt_to_mul)
+        self.assertEqual(mul_res.to_local(), expected_mul_dt.to_local())
+
+        # test inplace op self and other dtensor with other specs
+        # and make sure out spec not change
+        shard_spec = [Shard(0)]
+        partial_spec = [_Partial()]
+        dt_to_inplace_add = distribute_tensor(input_tensor, mesh, shard_spec)
+        partial_grad = DTensor.from_local(
+            torch.randn(12, 3), mesh, partial_spec
+        )
+        res = dt_to_inplace_add.add_(partial_grad)
+        self.assertTrue(res is dt_to_inplace_add)
+        self.assertTrue(res.placements == shard_spec)
+
+    @with_comms
+    def test_op_out_variant(self):
+        mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        input_tensor = torch.randn((12, 3), device=self.device_type)
+        sharded_dt_input = distribute_tensor(input_tensor, mesh, [Shard(0)])
+        expected_dt = sharded_dt_input.clone() + 3
+        sharded_dt_out = sharded_dt_input.clone()
+        res = torch.add(sharded_dt_input, 3, out=sharded_dt_out)
+        # op out variant should be the same instance before and after
+        self.assertTrue(res is sharded_dt_out)
+        self.assertEqual(sharded_dt_out.to_local(), expected_dt.to_local())
+
+        # test op out variant with other spec and make sure out spec not change
+        replica_spec = [Replicate()]
+        replicate_out = distribute_tensor(input_tensor, mesh, replica_spec)
+        expected_dt = replicate_out.clone() + 3
+        res = torch.add(sharded_dt_input, 3, out=replicate_out)
+        self.assertTrue(res is replicate_out)
+        self.assertTrue(res.placements == replica_spec)
+        self.assertEqual(replicate_out.to_local(), expected_dt.to_local())
+
+    @with_comms
+    def test_empty_like(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        empty_like_dt = torch.empty_like(dist_tensor)
+        # empty is not deterministic, so we only check that the shard propagation worked
+        self.assertEqual((4, 8), empty_like_dt.to_local().shape)
+
+    @with_comms
+    def test_fill_inplace(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        full_like_dt = torch.fill_(dist_tensor, 42.0)
+        full_expected = torch.full((4, 8), 42.0)
+        self.assertEqual(full_expected, full_like_dt.to_local())
+        self.assertEqual(full_expected, dist_tensor.to_local())
+
+    @with_comms
+    def test_full_like(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        full_like_dt = torch.full_like(dist_tensor, 42.0)
+        full_expected = torch.full((4, 8), 42.0)
+        self.assertEqual(full_expected, full_like_dt.to_local())
+
+    @with_comms
+    def test_ones_like(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        ones_like_dt = torch.ones_like(dist_tensor)
+        ones_expected = torch.ones(4, 8)
+        self.assertEqual(ones_expected, ones_like_dt.to_local())
+
+    @with_comms
+    def test_ones_like_partial_sum(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [_Partial()]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        assert dist_tensor.shape == (4, 8)
+
+        ones_like_dt = torch.ones_like(dist_tensor)
+        ones_expected = torch.ones(dist_tensor.shape)
+        self.assertEqual(
+            ones_expected,
+            ones_like_dt.redistribute(device_mesh, [Replicate()]).to_local(),
+        )
+
+    @with_comms
+    def test_fill_inplace_partial_sum(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [_Partial()]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        assert dist_tensor.shape == (4, 8)
+
+        torch.fill_(dist_tensor, 42)
+        fill_expected = torch.full(
+            dist_tensor.shape, 42, dtype=input_tensor.dtype
+        )
+        self.assertEqual(
+            fill_expected,
+            dist_tensor.redistribute(device_mesh, [Replicate()]).to_local(),
+        )
+
+    @with_comms
+    def test_zeros_like_partial_sum(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [_Partial()]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        assert dist_tensor.shape == (4, 8)
+
+        zeros_like_dt = torch.zeros_like(dist_tensor)
+        zeros_expected = torch.zeros(dist_tensor.shape)
+        self.assertEqual(
+            zeros_expected,
+            zeros_like_dt.redistribute(device_mesh, [Replicate()]).to_local(),
+        )
+
+    @with_comms
+    def test_zero_inplace(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        zeros_like_dt = torch.zero_(dist_tensor)
+        zeros_expected = torch.zeros(4, 8)
+        self.assertEqual(zeros_expected, zeros_like_dt.to_local())
+        self.assertEqual(zeros_expected, dist_tensor.to_local())
+
+    @with_comms
+    def test_zeros_like(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        shard_spec = [Shard(0)]
+
+        input_tensor = torch.randn(4, 8, requires_grad=True)
+        dist_tensor = DTensor.from_local(input_tensor, device_mesh, shard_spec)
+        zeros_like_dt = torch.zeros_like(dist_tensor)
+        zeros_expected = torch.zeros(4, 8)
+        self.assertEqual(zeros_expected, zeros_like_dt.to_local())
+
+    def _test_op(self, mesh, op_call, *args, **kwargs):
+        out = op_call(*args, **kwargs)
+        dtc = DTensorConverter(mesh, args, kwargs)
+        for d_args, d_kwargs in dtc:
+            self.assertTrue(dtc.successful())
+            d_out = op_call(*d_args, **d_kwargs)
+            self.assertEqual(
+                d_out.redistribute(mesh, [Replicate()] * mesh.ndim).to_local(),
+                out,
+            )
+
+    @with_comms
+    def test_index(self):
+        meshes = [
+            DeviceMesh(
+                self.device_type, list(range(self.world_size))
+            ),  # 1D mesh
+            # TODO(@azzolini): un-comment when DTensorConverter supports N-D mesh
+            # DeviceMesh(self.device_type, torch.arange(self.world_size).reshape(2, -1)), # 2D mesh
+        ]
+        for mesh in meshes:
+            self._test_op(
+                mesh,
+                lambda x, y: x[y],
+                torch.randn(16, 32, 16),
+                torch.randint(5, (4, 8)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y: x.index_select(1, y),
+                torch.randn(16, 32, 16),
+                torch.randint(5, (4,)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y: x.index_select(0, y),
+                torch.randn(16, 32, 16),
+                torch.randint(5, (4,)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y: x[y],
+                torch.randn(16, 32, 16),
+                torch.randint(5, (12,)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y: x[:, y],
+                torch.randn(16, 32, 16),
+                torch.randint(5, (4, 8)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y: x[..., y],
+                torch.randn(16, 32, 16),
+                torch.randint(5, (4, 12)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y: x[..., y],
+                torch.randn(16, 32, 16),
+                torch.randint(5, (4, 8, 16)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y, z: x[z, y],
+                torch.randn(16, 32, 16),
+                torch.randint(5, (12, 8, 12)),
+                torch.randint(2, (12, 8, 12)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y, z: x[z, :, y],
+                torch.randn(16, 32, 16),
+                torch.randint(5, (12, 8, 12)),
+                torch.randint(2, (12, 8, 12)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y, z: x[:, z, :, y],
+                torch.randn(16, 32, 16, 12),
+                torch.randint(5, (12, 8, 12)),
+                torch.randint(2, (12, 8, 12)),
+            )
+            # broadcast in inner dimensions
+            self._test_op(
+                mesh,
+                lambda x, y, z: x[:, z, :, y],
+                torch.randn(16, 32, 16, 12),
+                torch.randint(5, (12, 8, 12)),
+                torch.randint(2, (12, 1, 12)),
+            )
+            # implicit (left-padded) broadcast
+            self._test_op(
+                mesh,
+                lambda x, y, z: x[:, z, :, y],
+                torch.randn(16, 32, 16, 12),
+                torch.randint(5, (12, 8, 12)),
+                torch.randint(2, (8, 12)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y, z: x[z, y, :, :],
+                torch.randn(16, 32, 16, 12),
+                torch.randint(2, (8, 12)),
+                torch.randint(5, (12, 8, 12)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y, z: x[z, :, y, :],
+                torch.randn(16, 32, 16, 12),
+                torch.randint(2, (8, 12)),
+                torch.randint(5, (12, 8, 12)),
+            )
+            self._test_op(
+                mesh,
+                lambda x, y, z: x[z, :, :, y],
+                torch.randn(16, 32, 16, 12),
+                torch.randint(2, (8, 1)),
+                torch.randint(5, (12, 8, 12)),
+            )
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_tp_sharding_ops.py b/test/distributed/_tensor/test_tp_sharding_ops.py
new file mode 100644
index 000000000000..acd28fe3a306
--- /dev/null
+++ b/test/distributed/_tensor/test_tp_sharding_ops.py
@@ -0,0 +1,101 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+import torch
+from torch.testing._internal.common_utils import run_tests
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    with_comms,
+)
+from torch.distributed._tensor import DeviceMesh, DTensor, Shard, Replicate, distribute_tensor
+
+
+class TPShardingOpsTest(DTensorTestBase):
+    @property
+    def world_size(self) -> int:
+        return 4
+
+    @with_comms
+    def test_sharded_view(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        torch.manual_seed(0)
+        tensor = torch.rand(16, 35, 26)
+        sharding = [Shard(0)]
+        st = distribute_tensor(tensor, device_mesh, sharding).view(8, 4, 35, 13)
+        st_new = distribute_tensor(
+            tensor.view(8, 4, 35, 13), device_mesh, sharding
+        )
+        self.assertEqual(st.to_local(), st_new.to_local())
+        self.assertEqual(st.placements[0], st_new.placements[0])
+
+    @with_comms
+    def test_sharded_transpose(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        torch.manual_seed(self.rank)
+        tensor = torch.rand(3, 5, 6, device=self.device_type)
+        sharding = [Shard(0)]
+        dist_tensor = DTensor.from_local(tensor, device_mesh, sharding)
+        new_dt = dist_tensor.transpose(0, 2)
+        self.assertTrue(new_dt.placements[0].is_shard(dim=2))
+        self.assertEqual(new_dt.to_local(), tensor.transpose(0, 2))
+        new_dt = dist_tensor.transpose(1, 2)
+        self.assertTrue(new_dt.placements[0].is_shard(dim=0))
+        self.assertEqual(new_dt.to_local(), tensor.transpose(1, 2))
+
+    @with_comms
+    def test_sharded_permute(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        torch.manual_seed(self.rank)
+        tensor = torch.rand(3, 5, 6, device=self.device_type)
+        sharding = [Shard(0)]
+        dist_tensor = DTensor.from_local(tensor, device_mesh, sharding)
+        new_dt = dist_tensor.permute(1, 0, 2)
+        self.assertTrue(new_dt.placements[0].is_shard(dim=1))
+        self.assertEqual(new_dt.to_local(), tensor.permute(1, 0, 2))
+
+    @with_comms
+    def test_replicated_permute(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        torch.manual_seed(0)
+        tensor = torch.rand(3, 5, 6, device=self.device_type)
+        sharding = [Replicate()]
+        dist_tensor = DTensor.from_local(tensor, device_mesh, sharding)
+        new_dt = dist_tensor.permute(1, 0, 2)
+        self.assertTrue(new_dt.placements[0].is_replicate())
+        self.assertEqual(new_dt.to_local(), tensor.permute(1, 0, 2))
+        self.assertEqual(new_dt.stride(), tensor.permute(1, 0, 2).stride())
+
+    @with_comms
+    def test_sharded_cat(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        torch.manual_seed(self.rank)
+        tensor_1 = torch.rand(3, 5, 6)
+        tensor_2 = torch.rand(3, 5, 6)
+        tensor_3 = torch.rand(3, 5, 6)
+        sharding = [Shard(0)]
+        dt_1 = DTensor.from_local(tensor_1, device_mesh, sharding)
+        dt_2 = DTensor.from_local(tensor_2, device_mesh, sharding)
+        dt_3 = DTensor.from_local(tensor_3, device_mesh, sharding)
+        new_dt = torch.cat([dt_1, dt_2, dt_3])
+        cat_dt = DTensor.from_local(
+            torch.cat([tensor_1, tensor_2, tensor_3]), device_mesh, sharding
+        )
+        self.assertEqual(new_dt.to_local(), cat_dt.to_local())
+        self.assertEqual(new_dt.size(), cat_dt.size())
+
+    @with_comms
+    def test_sharded_split(self):
+        device_mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
+        torch.manual_seed(self.rank)
+        tensor = torch.rand(3, 5, 6, device=self.device_type)
+        sharding = [Shard(2)]
+        dist_tensor = DTensor.from_local(tensor, device_mesh, sharding)
+        dt_list = dist_tensor.split(dist_tensor.size(-1) // 2, dim=-1)
+        local_tensors = tensor.split(3, dim=-1)
+        for idx, dt in enumerate(dt_list):
+            self.assertTrue(dt.placements[0].is_shard(dim=2))
+            self.assertEqual(dt.to_local(), local_tensors[idx])
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_tensor/test_view_ops.py b/test/distributed/_tensor/test_view_ops.py
new file mode 100644
index 000000000000..c1c5a03b9113
--- /dev/null
+++ b/test/distributed/_tensor/test_view_ops.py
@@ -0,0 +1,480 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# Owner(s): ["oncall: distributed"]
+
+from typing import List, cast
+from torch.distributed._tensor.placement_types import Placement
+from torch.testing._internal.distributed._tensor.common_dtensor import (
+    DTensorTestBase,
+    redistribute_profiler,
+    with_comms,
+)
+from torch.distributed._tensor import DeviceMesh, Shard, Replicate, distribute_tensor
+from torch.distributed._tensor.ops.view_ops import (
+    ops,
+    Singleton,
+    Broadcast,
+    Flatten,
+    Repeat,
+    Split,
+    InputDim,
+    view_groups,
+)
+from torch import Tensor, rand, randn
+from torch.testing._internal.common_utils import run_tests
+from torch.utils._pytree import tree_flatten
+
+import itertools
+import torch
+import torch.distributed as dist
+
+
+class TestViewOps(DTensorTestBase):
+    def test_view_groups(self):
+        self.assertEquals(
+            view_groups([2, 3], [3, 2]),
+            (
+                Split(Flatten((InputDim(0), InputDim(1))), (3, 2), 0),
+                Split(Flatten((InputDim(0), InputDim(1))), (3, 2), 1),
+            ),
+        )
+        self.assertEquals(
+            view_groups([3, 4, 5], [12, 5]),
+            (Flatten((InputDim(0), InputDim(1))), InputDim(2)),
+        )
+        self.assertEquals(
+            view_groups([2, 3, 4, 5, 7], [12, 70]),
+            (
+                Split(
+                    Flatten(
+                        (
+                            InputDim(0),
+                            InputDim(1),
+                            InputDim(2),
+                            InputDim(3),
+                            InputDim(4),
+                        )
+                    ),
+                    (12, 70),
+                    0,
+                ),
+                Split(
+                    Flatten(
+                        (
+                            InputDim(0),
+                            InputDim(1),
+                            InputDim(2),
+                            InputDim(3),
+                            InputDim(4),
+                        )
+                    ),
+                    (12, 70),
+                    1,
+                ),
+            ),
+        )
+        self.assertEquals(
+            view_groups([2, 3, 4, 5, 7], [3, 8, 7, 5]),
+            (
+                Split(
+                    Flatten((InputDim(0), InputDim(1), InputDim(2))), (3, 8), 0
+                ),
+                Split(
+                    Flatten((InputDim(0), InputDim(1), InputDim(2))), (3, 8), 1
+                ),
+                Split(Flatten((InputDim(3), InputDim(4))), (7, 5), 0),
+                Split(Flatten((InputDim(3), InputDim(4))), (7, 5), 1),
+            ),
+        )
+        self.assertEquals(
+            view_groups([3, 4, 8, 3], [12, 4, 2, 3]),
+            (
+                Flatten((InputDim(0), InputDim(1))),
+                Split(InputDim(2), (4, 2), 0),
+                Split(InputDim(2), (4, 2), 1),
+                InputDim(3),
+            ),
+        )
+        self.assertEquals(
+            view_groups([3, 24], [1, 3, 2, 4, 1, 3, 1]),
+            (
+                Singleton(),
+                InputDim(0),
+                Split(InputDim(1), (2, 4, 3), 0),
+                Split(InputDim(1), (2, 4, 3), 1),
+                Singleton(),
+                Split(InputDim(1), (2, 4, 3), 2),
+                Singleton(),
+            ),
+        )
+        self.assertEquals(
+            view_groups([1, 1, 3, 2, 1, 1], [6, 1, 1, 1]),
+            (
+                Flatten((InputDim(2), InputDim(3))),
+                Singleton(),
+                Singleton(),
+                Singleton(),
+            ),
+        )
+        self.assertEquals(
+            view_groups([1, 1, 12, 1, 1, 1, 2, 5, 1], [3, 4, 1, 10]),
+            (
+                Split(InputDim(2), (3, 4), 0),
+                Split(InputDim(2), (3, 4), 1),
+                Singleton(),
+                Flatten((InputDim(6), InputDim(7))),
+            ),
+        )
+        self.assertEquals(
+            view_groups([2, 3, 4], [2, -1, 4]),
+            (InputDim(0), InputDim(1), InputDim(2)),
+        )
+
+    @property
+    def world_size(self) -> int:
+        return 6
+
+    def call_dt_test(self, op, args, kwargs, device_mesh: DeviceMesh):
+        spec = ops[op]
+        rules = spec.dim_map(*args, **kwargs)
+        outputs = op(*args, **kwargs)
+        flat_args, _ = tree_flatten(args)
+        in_shape = flat_args[0].shape
+
+        no_shard_dims = set()
+        for rule in rules:
+            if isinstance(rule, Repeat):
+                if isinstance(rule.input_dim, InputDim):
+                    no_shard_dims.add(rule.input_dim.input_dim)
+            elif isinstance(rule, Flatten):
+                for dim in rule.input_dims[1:]:
+                    if isinstance(dim, InputDim):
+                        no_shard_dims.add(dim.input_dim)
+            elif isinstance(rule, Split):
+                if isinstance(rule.input_dim, Flatten):
+                    for dim in rule.input_dim.input_dims[1:]:
+                        if isinstance(dim, InputDim):
+                            no_shard_dims.add(dim.input_dim)
+
+        if op == torch.unbind:
+            no_shard_dims.add(kwargs.get("dim", 0))
+
+        sharding_choices = cast(List[Placement], [Replicate()]) + [
+            Shard(i)
+            for i, s in enumerate(in_shape)
+            if s > 1 and i not in no_shard_dims
+        ]
+
+        all_sharding_choices = itertools.product(
+            *(device_mesh.ndim * [sharding_choices])
+        )
+
+        for in_shard in all_sharding_choices:
+            # print(f'   |--- {in_shard}')
+            in_dt = distribute_tensor(args[0], device_mesh, in_shard)
+
+            with redistribute_profiler() as profiler:
+                out_dt = op(in_dt, *args[1:], **kwargs)
+
+            self.assertEqual(
+                profiler.num_calls, 0, "Expected no redistribution."
+            )
+
+            full_out = out_dt.redistribute(
+                device_mesh, device_mesh.ndim * [Replicate()]
+            ).to_local()
+
+            if dist.get_rank() == 0:
+                self.assertEqual(outputs, full_out)
+
+    def dimmap_test(self, op, args, expected_rule_output):
+        rules = ops[op].dim_map(*args)
+        self.assertEquals(rules, expected_rule_output)
+        self.call_dt_test(op, args, {}, self.device_mesh)
+
+    @with_comms
+    def test_view_ops(self):
+        self.device_mesh = DeviceMesh(
+            self.device_type, torch.arange(dist.get_world_size()).view(-1, 2)
+        )
+        self.dimmap_test(torch.atleast_1d, (randn(()),), (Singleton(),))
+        self.dimmap_test(torch.atleast_1d, (randn(24),), (InputDim(0),))
+        self.dimmap_test(
+            torch.atleast_1d, (randn(24, 36),), (InputDim(0), InputDim(1))
+        )
+
+        self.dimmap_test(
+            torch.atleast_2d, (randn(()),), (Singleton(), Singleton())
+        )
+        self.dimmap_test(
+            torch.atleast_2d, (randn(24),), (Singleton(), InputDim(0))
+        )
+        self.dimmap_test(
+            torch.atleast_2d, (randn(24, 36),), (InputDim(0), InputDim(1))
+        )
+        self.dimmap_test(
+            torch.atleast_2d,
+            (randn(24, 36, 48),),
+            (InputDim(0), InputDim(1), InputDim(2)),
+        )
+
+        self.dimmap_test(
+            torch.atleast_3d,
+            (randn(()),),
+            (Singleton(), Singleton(), Singleton()),
+        )
+        self.dimmap_test(
+            torch.atleast_3d,
+            (randn(24),),
+            (Singleton(), InputDim(0), Singleton()),
+        )
+        self.dimmap_test(
+            torch.atleast_3d,
+            (randn(24, 36),),
+            (InputDim(0), InputDim(1), Singleton()),
+        )
+        self.dimmap_test(
+            torch.atleast_3d,
+            (randn(24, 36, 42),),
+            (InputDim(0), InputDim(1), InputDim(2)),
+        )
+        self.dimmap_test(
+            torch.atleast_3d,
+            (randn(24, 36, 42, 24),),
+            (InputDim(0), InputDim(1), InputDim(2), InputDim(3)),
+        )
+
+        with self.assertRaises(AssertionError):
+            ops[torch.broadcast_to].dim_map(randn(24, 36), (1, 2, 4))
+
+        self.dimmap_test(
+            torch.broadcast_to,
+            (rand(24, 36), (1, 24, 36)),
+            (Singleton(), InputDim(0), InputDim(1)),
+        )
+        self.dimmap_test(
+            torch.broadcast_to,
+            (rand(24, 36), (42, 24, 36)),
+            (Broadcast(Singleton(), 42), InputDim(0), InputDim(1)),
+        )
+        self.dimmap_test(
+            torch.broadcast_to,
+            (rand(24, 1, 36), (12, 24, 24, 36)),
+            (
+                Broadcast(Singleton(), 12),
+                InputDim(0),
+                Broadcast(InputDim(1), 24),
+                InputDim(2),
+            ),
+        )
+        self.dimmap_test(
+            torch.broadcast_to,
+            (rand(24, 36), (-1, 36)),
+            (InputDim(0), InputDim(1)),
+        )
+        self.dimmap_test(
+            torch.broadcast_to,
+            (rand(24, 1, 36), (-1, 1, 36)),
+            (InputDim(0), InputDim(1), InputDim(2)),
+        )
+
+        self.dimmap_test(
+            torch.broadcast_to,
+            (randn(36, 1, 24), (12, 36, 42, 24)),
+            (
+                Broadcast(Singleton(), 12),
+                InputDim(0),
+                Broadcast(InputDim(1), 42),
+                InputDim(2),
+            ),
+        )
+
+        self.dimmap_test(
+            Tensor.expand,
+            (randn(24, 1, 36, 1), 36, 24, 42, -1, 24),
+            (
+                Broadcast(Singleton(), 36),
+                InputDim(0),
+                Broadcast(InputDim(1), 42),
+                InputDim(2),
+                Broadcast(InputDim(3), 24),
+            ),
+        )
+
+        self.dimmap_test(
+            Tensor.expand,
+            (randn(24, 1, 36, 1), (36, 24, 42, -1, 24)),
+            (
+                Broadcast(Singleton(), 36),
+                InputDim(0),
+                Broadcast(InputDim(1), 42),
+                InputDim(2),
+                Broadcast(InputDim(3), 24),
+            ),
+        )
+
+        self.dimmap_test(
+            torch.flatten,
+            (randn(24, 36),),
+            (Flatten((InputDim(0), InputDim(1))),),
+        )
+        self.dimmap_test(torch.flatten, (randn(42),), (InputDim(0),))
+        self.dimmap_test(torch.flatten, (randn(()),), (Singleton(),))
+
+        self.dimmap_test(
+            torch.movedim,
+            (randn(12, 24, 48, 96), 1, 2),
+            (InputDim(0), InputDim(2), InputDim(1), InputDim(3)),
+        )
+        self.dimmap_test(
+            torch.movedim,
+            (randn(6, 12, 24), 1, 0),
+            (InputDim(1), InputDim(0), InputDim(2)),
+        )
+        self.dimmap_test(
+            torch.movedim,
+            (randn(24, 12, 6), (1, 2), (0, 1)),
+            (InputDim(1), InputDim(2), InputDim(0)),
+        )
+        self.dimmap_test(
+            torch.movedim,
+            (randn(24, 6, 12), (0, 2, 1), (2, 1, 0)),
+            (InputDim(1), InputDim(2), InputDim(0)),
+        )
+        self.dimmap_test(
+            torch.movedim,
+            (randn(24, 12), (1, 0), (0, 1)),
+            (InputDim(1), InputDim(0)),
+        )
+
+        self.dimmap_test(
+            torch.movedim,
+            (randn(36, 24, 12), (1, 2), (0, 1)),
+            (InputDim(1), InputDim(2), InputDim(0)),
+        )
+        self.dimmap_test(
+            torch.movedim,
+            (randn(36, 24, 12), (1, 2), (-3, -2)),
+            (InputDim(1), InputDim(2), InputDim(0)),
+        )
+
+        self.dimmap_test(
+            torch.permute,
+            (randn(24, 36, 42), (2, 0, 1)),
+            (InputDim(2), InputDim(0), InputDim(1)),
+        )
+        self.dimmap_test(
+            torch.permute,
+            (randn(24, 36, 42), (-1, -3, -2)),
+            (InputDim(2), InputDim(0), InputDim(1)),
+        )
+
+        self.dimmap_test(
+            torch.ravel,
+            (randn(24, 36),),
+            (Flatten((InputDim(0), InputDim(1))),),
+        )
+        self.dimmap_test(torch.ravel, (randn(42),), (InputDim(0),))
+        self.dimmap_test(torch.ravel, (randn(()),), (Singleton(),))
+
+        self.dimmap_test(
+            Tensor.repeat,
+            (randn(24, 36), 1, 2, 1, 1, 2),
+            (
+                Singleton(),
+                Broadcast(Singleton(), 2),
+                Singleton(),
+                InputDim(0),
+                Repeat(InputDim(1), 2),
+            ),
+        )
+
+        self.dimmap_test(
+            torch.reshape,
+            (randn(6, 12, 24), (72, 24)),
+            (Flatten((InputDim(0), InputDim(1))), InputDim(2)),
+        )
+
+        self.dimmap_test(
+            torch.tile,
+            (randn(24, 36), (1, 2, 1, 1, 2)),
+            (
+                Singleton(),
+                Broadcast(Singleton(), 2),
+                Singleton(),
+                InputDim(0),
+                Repeat(InputDim(1), 2),
+            ),
+        )
+        self.dimmap_test(
+            torch.tile,
+            (randn(42, 24, 36), (1, 3)),
+            (InputDim(0), InputDim(1), Repeat(InputDim(2), 3)),
+        )
+
+        self.dimmap_test(
+            torch.transpose,
+            (randn(24, 60, 42, 60), 2, 0),
+            (InputDim(2), InputDim(1), InputDim(0), InputDim(3)),
+        )
+        self.dimmap_test(
+            torch.transpose,
+            (randn(24, 60, 42, 60), -1, 0),
+            (InputDim(3), InputDim(1), InputDim(2), InputDim(0)),
+        )
+
+        self.dimmap_test(
+            torch.unsqueeze,
+            (randn(42, 24, 36), 1),
+            (InputDim(0), Singleton(), InputDim(1), InputDim(2)),
+        )
+
+        self.dimmap_test(
+            Tensor.view,
+            (randn(6, 12, 24), 72, 24),
+            (Flatten((InputDim(0), InputDim(1))), InputDim(2)),
+        )
+
+        self.dimmap_test(Tensor.view, (randn(1, 1, 12), -1), (InputDim(2),))
+
+        self.dimmap_test(
+            Tensor.view,
+            (randn(1, 1, 42, 24), -1),
+            (Flatten((InputDim(2), InputDim(3))),),
+        )
+
+        self.dimmap_test(
+            Tensor.view,
+            (randn(1, 1, 42, 1, 24, 1), -1),
+            (Flatten((InputDim(2), InputDim(4))),),
+        )
+
+        self.dimmap_test(
+            Tensor.view,
+            (randn(48, 35, 26), (24, 4, 35, 13)),
+            (
+                Split(
+                    Flatten(input_dims=(InputDim(0), InputDim(1), InputDim(2))),
+                    group_shape=(24, 4, 35, 13),
+                    split_id=0,
+                ),
+                Split(
+                    Flatten(input_dims=(InputDim(0), InputDim(1), InputDim(2))),
+                    group_shape=(24, 4, 35, 13),
+                    split_id=1,
+                ),
+                Split(
+                    Flatten(input_dims=(InputDim(0), InputDim(1), InputDim(2))),
+                    group_shape=(24, 4, 35, 13),
+                    split_id=2,
+                ),
+                Split(
+                    Flatten(input_dims=(InputDim(0), InputDim(1), InputDim(2))),
+                    group_shape=(24, 4, 35, 13),
+                    split_id=3,
+                ),
+            ),
+        )
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py b/test/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
index ead934eb83e7..d3ea932b05fc 100644
--- a/test/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
+++ b/test/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
@@ -124,7 +124,7 @@ def test_ddp_comm_hook_allreduce_hook(self):
         # Register hook case, get the hook grads.
         hook_grads = self._get_grads(process_group, DDPCommHookType.ALLREDUCE)
 
-        torch.testing.assert_allclose(hook_grads, reference_grads, rtol=1e-5, atol=0)
+        torch.testing.assert_close(hook_grads, reference_grads, rtol=1e-5, atol=0)
 
     @requires_nccl()
     @skip_if_lt_x_gpu(2)
@@ -141,7 +141,7 @@ def test_ddp_comm_hook_fp16compress_hook(self):
         # Register hook case, get the hook grads.
         hook_grads = self._get_grads(process_group, DDPCommHookType.FP16_COMPRESS)
 
-        torch.testing.assert_allclose(hook_grads, reference_grads, rtol=1e-5, atol=1e-4)
+        torch.testing.assert_close(hook_grads, reference_grads, rtol=1e-5, atol=1e-4)
 
     @requires_nccl()
     @skip_if_lt_x_gpu(2)
@@ -158,7 +158,7 @@ def test_ddp_comm_hook_quantize_per_tensor_hook(self):
         # Register hook case, get the hook grads.
         hook_grads = self._get_grads(process_group, DDPCommHookType.QUANTIZE_PER_TENSOR)
 
-        torch.testing.assert_allclose(hook_grads, reference_grads, rtol=1e-5, atol=1e-4)
+        torch.testing.assert_close(hook_grads, reference_grads, rtol=1e-5, atol=1e-4)
 
     @requires_nccl()
     @skip_if_lt_x_gpu(2)
@@ -177,7 +177,7 @@ def test_ddp_comm_hook_quantize_per_channel_hook(self):
             process_group, DDPCommHookType.QUANTIZE_PER_CHANNEL
         )
 
-        torch.testing.assert_allclose(hook_grads, reference_grads, rtol=1e-5, atol=1e-4)
+        torch.testing.assert_close(hook_grads, reference_grads, rtol=1e-5, atol=1e-4)
 
 
     @requires_nccl()
@@ -198,7 +198,7 @@ def test_ddp_comm_hook_noop_hook(self):
         hook_grads.div_(self.world_size)
         dist.all_reduce(hook_grads, group=process_group)
 
-        torch.testing.assert_allclose(hook_grads, reference_grads, rtol=1e-5, atol=0)
+        torch.testing.assert_close(hook_grads, reference_grads, rtol=1e-5, atol=0)
 
     @requires_nccl()
     @skip_if_lt_x_gpu(2)
diff --git a/test/distributed/checkpoint/test_checkpoint.py b/test/distributed/checkpoint/test_checkpoint.py
new file mode 100644
index 000000000000..96c98116328c
--- /dev/null
+++ b/test/distributed/checkpoint/test_checkpoint.py
@@ -0,0 +1,392 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+from typing import Optional, List, cast
+from torch.distributed.checkpoint.storage import WriteResult
+
+from torch.distributed.checkpoint import (
+    StorageReader,
+    StorageWriter,
+    CheckpointException,
+    load_state_dict,
+    save_state_dict,
+)
+
+import torch
+import torch.distributed as dist
+import torch.nn
+import torch.futures
+from torch.futures import Future
+
+from torch.distributed._shard import sharded_tensor
+
+from torch.distributed.checkpoint.default_planner import (
+    _create_default_local_metadata,
+)
+
+from torch.distributed.checkpoint.metadata import (
+    BytesStorageMetadata,
+    Metadata,
+    TensorStorageMetadata,
+)
+
+from torch.distributed.checkpoint.planner import (
+    SavePlan,
+    SavePlanner,
+    LoadPlan,
+    LoadPlanner,
+)
+
+from torch.distributed._shard.sharded_tensor import (
+    state_dict_hook,
+    ShardedTensor,
+)
+from torch.distributed._shard.sharding_spec import ChunkShardingSpec
+from torch.testing._internal.common_distributed import (
+    requires_nccl,
+    skip_if_lt_x_gpu,
+)
+from torch.testing._internal.distributed._shard.sharded_tensor import (
+    ShardedTensorTestBase,
+    with_comms,
+)
+
+from torch.testing._internal.common_utils import (
+    TEST_WITH_DEV_DBG_ASAN,
+    run_tests,
+)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+class TestModule(torch.nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.sharded: ShardedTensor = sharded_tensor.zeros(self.spec(), 4, 4)
+        self.regular = torch.nn.Parameter(torch.ones(4, 4))
+        self.extra_sharded: Optional[ShardedTensor] = None
+        self.extra_param: Optional[torch.nn.Parameter] = None
+        self._register_state_dict_hook(state_dict_hook)
+
+    def spec(self) -> ChunkShardingSpec:
+        # pyre-fixme [28]: Unexpected keyword argument `dim` to call `dist._sharding_spec.api.ChunkShardingSpec.__init__`.
+        return ChunkShardingSpec(
+            dim=0,
+            placements=[
+                "rank:0/cuda:0",
+                "rank:1/cuda:1",
+            ],
+        )
+
+
+class TestDistributedCheckpointing(ShardedTensorTestBase):
+    @property
+    def world_size(self) -> int:
+        return 2
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(2)
+    @requires_nccl()
+    def test_tensor_metadata_with_missing_rank_spec(self) -> None:
+        spec = ChunkShardingSpec(
+            dim=0,
+            placements=[
+                "rank:1/cuda:1",
+            ],
+        )
+
+        st = sharded_tensor.zeros(spec, 4, 4, dtype=torch.float64)
+        mapping = {}
+
+        md = _create_default_local_metadata({"st": st})
+
+        st_md = md.state_dict_metadata["st"]
+        self.assertEqual(1, len(st_md.chunks))
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(2)
+    @requires_nccl()
+    def test_default_metadata(self) -> None:
+        device = f"cuda:{dist.get_rank()}"
+        spec = ChunkShardingSpec(
+            dim=0,
+            placements=[
+                "rank:0/cuda:0",
+                "rank:1/cuda:1",
+            ],
+        )
+
+        state_dict = {
+            "sharded": sharded_tensor.rand(
+                spec,
+                (
+                    10,
+                    10,
+                ),
+            ),
+            "replicated": torch.rand(4, device=device),
+            "bytes": [1, 2, 3, 4],
+        }
+
+        metadata = _create_default_local_metadata(state_dict)
+        self.assertTrue("bytes" in metadata.state_dict_metadata)
+        self.assertIsInstance(
+            metadata.state_dict_metadata["bytes"], BytesStorageMetadata
+        )
+
+        self.assertTrue("replicated" in metadata.state_dict_metadata)
+        self.assertIsInstance(
+            metadata.state_dict_metadata["replicated"], TensorStorageMetadata
+        )
+        md = metadata.state_dict_metadata["replicated"]
+        self.assertEqual(md.size, state_dict["replicated"].size())
+        self.assertEqual(md.properties.dtype, torch.float32)
+        self.assertEqual(1, len(md.chunks))
+
+        self.assertTrue("sharded" in metadata.state_dict_metadata)
+        self.assertIsInstance(
+            metadata.state_dict_metadata["sharded"], TensorStorageMetadata
+        )
+        md = metadata.state_dict_metadata["sharded"]
+        self.assertEqual(md.properties.dtype, torch.float32)
+        self.assertEqual(md.size, state_dict["sharded"].size())
+        self.assertEqual(2, len(md.chunks))
+
+
+class TestStorageBase:
+    def __init__(self, fail_conf):
+        self.fail_conf = fail_conf
+        self.rank = 0 if not dist.is_initialized() else dist.get_rank()
+
+    def _get_ranks(self, name):
+        return self.fail_conf[name] if name in self.fail_conf else None
+
+    def _fail_rank(self, name):
+        ranks = self._get_ranks(name)
+        if ranks is not None and self.rank in ranks:
+            raise ValueError(f"rank fail {self.rank} for {name}")
+
+    def _fail_rank_async(self, name, result=None):
+        ranks = self._get_ranks(name)
+        fut = Future()
+        if ranks is not None and self.rank in ranks:
+            fut.set_exception(
+                ValueError(f"async rank fail {self.rank} for {name}")
+            )
+        else:
+            fut.set_result(result)
+        return fut
+
+
+class FaultyStorageWriter(TestStorageBase, StorageWriter):
+    def __init__(self, fail_conf):
+        super(FaultyStorageWriter, self).__init__(fail_conf)
+
+    def init(self, is_coordinator: bool) -> None:
+        self._fail_rank("fail_init")
+
+    def prepare_local_plan(self, plan: SavePlan) -> SavePlan:
+        self._fail_rank("fail_prepare_local_plan")
+        return plan
+
+    def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]:
+        self._fail_rank("fail_prepare_global_plan")
+        return plans
+
+    def write_data(
+        self, plan: SavePlan, planner: SavePlanner
+    ) -> Future[List[WriteResult]]:
+        self._fail_rank("fail_write_data")
+        return self._fail_rank_async("fail_write_data_async", [])
+
+    def finish(
+        self, metadata: Metadata, results: List[List[WriteResult]]
+    ) -> None:
+        self._fail_rank("fail_finish")
+
+
+class FaultyStorageReader(TestStorageBase, StorageReader):
+    def __init__(self, metadata, fail_conf):
+        super(FaultyStorageReader, self).__init__(fail_conf)
+        self.metadata = metadata
+
+    def init(self, metadata: Metadata, is_coordinator: bool) -> None:
+        self._fail_rank("fail_init")
+
+    def prepare_local_plan(self, plan: LoadPlan) -> LoadPlan:
+        self._fail_rank("fail_prepare_local_plan")
+        return plan
+
+    def prepare_global_plan(self, plans: List[LoadPlan]) -> List[LoadPlan]:
+        self._fail_rank("fail_prepare_global_plan")
+        return plans
+
+    def read_data(self, plan: LoadPlan, planner: LoadPlanner) -> Future[None]:
+        self._fail_rank("fail_read_data")
+        return self._fail_rank_async("fail_read_data_async")
+
+    def read_metadata(self) -> Metadata:
+        self._fail_rank("fail_read_metadata")
+        return self.metadata
+
+
+class TestDistributedFailure(ShardedTensorTestBase):
+    def get_spec(self):
+        return ChunkShardingSpec(
+            dim=0,
+            placements=[
+                f"rank:{r}/cuda:{r}" for r in range(dist.get_world_size())
+            ],
+        )
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(2)
+    @requires_nccl()
+    def test_dummy_writer_works(self) -> None:
+        state_dict = {
+            "sharded": sharded_tensor.rand(self.get_spec(), 20, 20),
+            "replicated": torch.rand(10, 10),
+            "bytes": [1, 2, 3, 4],
+        }
+
+        save_state_dict(state_dict, FaultyStorageWriter({}))
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(2)
+    @requires_nccl()
+    def test_dummy_reader_works(self) -> None:
+        state_dict = {
+            "sharded": sharded_tensor.rand(self.get_spec(), 20, 20),
+            "replicated": torch.rand(10, 10),
+            "bytes": [1, 2, 3, 4],
+        }
+        metadata = _create_default_local_metadata(state_dict)
+
+        load_state_dict(state_dict, FaultyStorageReader(metadata, {}))
+
+    def _test_dist_failure(self, callback, kwargs):
+        bad_ranks = list(kwargs.values())[0] if len(kwargs) > 0 else []
+
+        # Empty bad_ranks means it must work
+        if len(bad_ranks) == 0:
+            callback()
+        else:
+            with self.assertRaises(CheckpointException) as cm:
+                callback()
+            e = cast(CheckpointException, cm.exception)
+            for rank, wrapped_ex in e.failures.items():
+                ex = wrapped_ex[0]
+                self.assertTrue(rank in bad_ranks, msg=f"{rank} did not fail")
+                if not kwargs.get("ignore_exception_type", False):
+                    self.assertEqual(ValueError, type(ex), str(ex))
+
+            failed_ranks = e.failures.keys()
+            for rank in bad_ranks:
+                self.assertTrue(
+                    rank in failed_ranks,
+                    msg=f"{rank} was supposed to fail was fine",
+                )
+
+    def _test_save(self, state_dict, coordinator=0, **kwargs):
+        no_dist = not dist.is_initialized()
+
+        def _save():
+            save_state_dict(
+                state_dict,
+                storage_writer=FaultyStorageWriter(kwargs),
+                coordinator_rank=coordinator,
+                no_dist=no_dist,
+            )
+
+        self._test_dist_failure(_save, kwargs)
+
+    def _test_load(self, state_dict, coordinator=0, **kwargs):
+        no_dist = not dist.is_initialized()
+
+        def _load():
+            metadata = _create_default_local_metadata(state_dict)
+            load_state_dict(
+                state_dict,
+                storage_reader=FaultyStorageReader(metadata, kwargs),
+                coordinator_rank=coordinator,
+                no_dist=no_dist,
+            )
+
+        self._test_dist_failure(_load, kwargs)
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(4)
+    @requires_nccl()
+    def test_save_error_handling(self) -> None:
+        state_dict = {
+            "sharded": sharded_tensor.rand(self.get_spec(), 20, 20),
+            "replicated": torch.rand(10, 10),
+            "bytes": [1, 2, 3, 4],
+        }
+
+        self._test_save(state_dict, fail_init=[0])
+        self._test_save(state_dict, fail_finish=[0])
+        self._test_save(state_dict, fail_prepare_global_plan=[0])
+
+        self._test_save(state_dict, fail_prepare_local_plan=[0])
+        self._test_save(state_dict, fail_write_data=[2])
+        self._test_save(state_dict, fail_write_data_async=[3])
+
+        self._test_save(state_dict, coordinator=1, fail_init=[1])
+        self._test_save(state_dict, coordinator=1, fail_finish=[1])
+
+    def test_save_error_handling_no_dist(self) -> None:
+        state_dict = {"replicated": torch.rand(10, 10), "bytes": [1, 2, 3, 4]}
+
+        self.assertFalse(dist.is_initialized())
+
+        self._test_save(state_dict, fail_init=[0])
+        self._test_save(state_dict, fail_finish=[0])
+        self._test_save(state_dict, fail_prepare_global_plan=[0])
+
+        self._test_save(state_dict, fail_prepare_local_plan=[0])
+        self._test_save(state_dict, fail_write_data=[0])
+        self._test_save(state_dict, fail_write_data_async=[0])
+
+    @with_comms(init_rpc=False)
+    @skip_if_lt_x_gpu(4)
+    @requires_nccl()
+    def test_load_error_handling(self) -> None:
+        state_dict = {
+            "sharded": sharded_tensor.rand(self.get_spec(), 20, 20),
+            "replicated": torch.rand(10, 10),
+            "bytes": [1, 2, 3, 4],
+        }
+
+        self._test_load(state_dict)
+        self._test_load(state_dict, fail_init=[0])
+        self._test_load(state_dict, fail_prepare_global_plan=[0])
+        self._test_load(state_dict, fail_read_metadata=[0])
+        self._test_load(state_dict, fail_prepare_local_plan=[1])
+        self._test_load(state_dict, fail_read_data=[3])
+        self._test_load(state_dict, fail_read_data_async=[1])
+
+        self._test_load(state_dict, coordinator=3, fail_init=[0])
+        self._test_load(state_dict, coordinator=1, fail_read_metadata=[3])
+        self._test_load(state_dict, coordinator=2, fail_read_data=[0])
+        self._test_load(state_dict, coordinator=3, fail_read_data_async=[2])
+        self._test_load(state_dict, coordinator=1, fail_prepare_global_plan=[1])
+
+    def test_load_error_handling_no_dist(self) -> None:
+        state_dict = {"replicated": torch.rand(10, 10), "bytes": [1, 2, 3, 4]}
+        self._test_load(state_dict)
+        self._test_load(state_dict, fail_init=[0])
+        self._test_load(state_dict, fail_read_metadata=[0])
+        self._test_load(state_dict, fail_prepare_local_plan=[0])
+        self._test_load(state_dict, fail_prepare_global_plan=[0])
+        self._test_load(state_dict, fail_read_data=[0])
+        self._test_load(state_dict, fail_read_data_async=[0])
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/checkpoint/test_dedup_tensors.py b/test/distributed/checkpoint/test_dedup_tensors.py
new file mode 100644
index 000000000000..a0d72147efeb
--- /dev/null
+++ b/test/distributed/checkpoint/test_dedup_tensors.py
@@ -0,0 +1,45 @@
+# Owner(s): ["oncall: distributed"]
+
+import dataclasses
+import torch
+from torch.distributed.checkpoint.dedup_tensors import dedup_tensors
+from torch.distributed.checkpoint.planner import SavePlan, WriteItemType
+from torch.distributed.checkpoint.planner_helpers import (
+    _create_write_item_for_tensor,
+)
+from torch.testing._internal.common_utils import run_tests, TestCase
+
+
+# TODO: add comments for create_plan
+def create_plan(second_fqn) -> SavePlan:
+    # the first write item is for a duplicated shard (that covers the whole tensor)
+    write_item_1 = _create_write_item_for_tensor("tensor_0", torch.rand(4))
+    write_item_1 = dataclasses.replace(write_item_1, type=WriteItemType.SHARD)
+
+    # the second write item has different keys
+    write_item_2 = _create_write_item_for_tensor(second_fqn, torch.rand(10))
+
+    return SavePlan([write_item_1, write_item_2])
+
+
+# TODO: add comments for TestDedupTensor
+class TestDedupTensor(TestCase):
+    def test_dedup_shards(self):
+        rank0 = create_plan("r0")
+        rank1 = create_plan("r1")
+
+        dedup_plans = dedup_tensors([rank0, rank1])
+
+        self.assertEqual(2, len(dedup_plans[0].items))
+        self.assertEqual(1, len(dedup_plans[1].items))
+
+        self.assertIn(
+            "tensor_0", (item.index.fqn for item in dedup_plans[0].items)
+        )
+        self.assertIn("r0", (item.index.fqn for item in dedup_plans[0].items))
+
+        self.assertIn("r1", (item.index.fqn for item in dedup_plans[1].items))
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_shard/checkpoint/test_file_system_checkpoint.py b/test/distributed/checkpoint/test_file_system_checkpoint.py
similarity index 95%
rename from test/distributed/_shard/checkpoint/test_file_system_checkpoint.py
rename to test/distributed/checkpoint/test_file_system_checkpoint.py
index df7d2412fd01..7ef4e72e4fe0 100644
--- a/test/distributed/_shard/checkpoint/test_file_system_checkpoint.py
+++ b/test/distributed/checkpoint/test_file_system_checkpoint.py
@@ -1,6 +1,5 @@
 # Owner(s): ["oncall: distributed"]
 
-import sys
 import os
 import shutil
 import tempfile
@@ -32,7 +31,7 @@
     run_tests,
 )
 
-from torch.distributed._shard.checkpoint import (
+from torch.distributed.checkpoint import (
     FileSystemReader,
     FileSystemWriter,
     load_state_dict,
@@ -119,6 +118,23 @@ def test_read_write_only_tensor(self) -> None:
 
             assert_state_dict_equal(self, state_dict_to_load_to, state_dict_to_save)
 
+        with tempfile.TemporaryDirectory() as path:
+            state_dict_to_save = MyTestModule().state_dict()
+
+            fs_writer = FileSystemWriter(path=path, single_file_per_rank=True)
+            save_state_dict(state_dict=state_dict_to_save, storage_writer=fs_writer, no_dist=True)
+
+            state_dict_to_load_to = MyTestModule().state_dict()
+
+            with self.assertRaises(AssertionError):
+                assert_state_dict_equal(self, state_dict_to_load_to, state_dict_to_save)
+
+            # Load from file without any resharding
+            fs_reader = FileSystemReader(path=path)
+            load_state_dict(state_dict=state_dict_to_load_to, storage_reader=fs_reader, no_dist=True)
+
+            assert_state_dict_equal(self, state_dict_to_load_to, state_dict_to_save)
+
 
 class TestDistributedStateDictSaveLoadWithSharedTensor(ShardedTensorTestBase):
     @property
diff --git a/test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py b/test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
similarity index 99%
rename from test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
rename to test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
index 321dc2f54688..2ff2d9d12791 100644
--- a/test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
+++ b/test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
@@ -31,7 +31,7 @@
     run_tests,
 )
 
-from torch.distributed._shard.checkpoint import (
+from torch.distributed.checkpoint import (
     FileSystemReader,
     FileSystemWriter,
     load_state_dict,
diff --git a/test/distributed/checkpoint/test_planner.py b/test/distributed/checkpoint/test_planner.py
new file mode 100644
index 000000000000..334fba237a9b
--- /dev/null
+++ b/test/distributed/checkpoint/test_planner.py
@@ -0,0 +1,269 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+
+import torch
+from torch.distributed.checkpoint.planner import LoadItemType, WriteItemType
+
+from torch.distributed._shard.sharded_tensor import (
+    Shard,
+    ShardMetadata,
+    ShardedTensor,
+    ShardedTensorMetadata,
+)
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
+
+from torch.testing._internal.common_utils import (
+    TestCase,
+    TEST_WITH_DEV_DBG_ASAN,
+    run_tests,
+)
+from torch.distributed.checkpoint.metadata import BytesStorageMetadata, MetadataIndex, TensorStorageMetadata
+from torch.testing._internal.distributed.distributed_utils import (
+    with_fake_comms,
+    with_dist
+)
+
+from torch.distributed.checkpoint.default_planner import (
+    create_default_global_save_plan,
+    create_default_local_save_plan,
+    create_default_local_load_plan,
+    _create_default_local_metadata
+)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+def create_sharded_tensor(rank, world_size, shards_per_rank, shard_size=8):
+    shards_metadata = []
+    local_shards = []
+    for idx in range(0, world_size * shards_per_rank):
+        shard_rank = idx // shards_per_rank
+        shard_md = ShardMetadata(shard_offsets=[idx * shard_size], shard_sizes=[shard_size], placement=f"rank:{shard_rank}/cpu")
+        shards_metadata.append(shard_md)
+        if shard_rank == rank:
+            shard = Shard.from_tensor_and_offsets(
+                torch.rand(*shard_md.shard_sizes),
+                shard_offsets=shard_md.shard_offsets,
+                rank=rank
+            )
+            local_shards.append(shard)
+
+    sharded_tensor_md = ShardedTensorMetadata(
+        shards_metadata=shards_metadata,
+        size=torch.Size([shard_size * len(shards_metadata)]),
+        tensor_properties=TensorProperties.create_from_tensor(torch.zeros(1))
+    )
+
+    return ShardedTensor._init_from_local_shards_and_global_metadata(
+        local_shards=local_shards,
+        sharded_tensor_metadata=sharded_tensor_md
+    )
+
+
+class TestSavePlan(TestCase):
+    @with_fake_comms(rank=1, world_size=4)
+    def test_local_plan(self):
+        tensor = torch.rand(10)
+        val = [1, 2, 3]
+        st = create_sharded_tensor(rank=1, world_size=4, shards_per_rank=1)
+        state_dict = {
+            "tensor": tensor,
+            "value": val,
+            "st": st
+        }
+        plan = create_default_local_save_plan(state_dict, False)
+        self.assertEqual(1, len(plan.items))
+        wi = plan.items[0]
+        self.assertEqual(wi.index, MetadataIndex("st", [8]))
+        self.assertEqual(wi.type, WriteItemType.SHARD)
+        self.assertEqual(wi.tensor_data.size, st.size())
+        self.assertEqual(wi.tensor_data.properties, TensorProperties.create_from_tensor(torch.zeros(1)))
+        self.assertEqual(wi.tensor_data.chunk.offsets, torch.Size([8]))
+        self.assertEqual(wi.tensor_data.chunk.sizes, torch.Size([8]))
+
+        # Coordinator rank, should include replicated items as well
+        plan = create_default_local_save_plan(state_dict, True)
+        self.assertEqual(3, len(plan.items))
+
+        tensor_wi = next(wi for wi in plan.items if wi.type == WriteItemType.TENSOR)
+        self.assertEqual(tensor_wi.index, MetadataIndex("tensor", [0]))
+        self.assertEqual(tensor_wi.tensor_data.size, tensor.size())
+        self.assertEqual(tensor_wi.tensor_data.properties, TensorProperties.create_from_tensor(tensor))
+        self.assertEqual(tensor_wi.tensor_data.chunk.offsets, torch.Size([0]))
+        self.assertEqual(tensor_wi.tensor_data.chunk.sizes, torch.Size([10]))
+
+        bytes_wi = next(wi for wi in plan.items if wi.type == WriteItemType.BYTE_IO)
+        self.assertEqual(bytes_wi.index, MetadataIndex("value"))
+        self.assertIsNone(bytes_wi.tensor_data)
+
+    def test_global_plan(self):
+        def create_data(rank):
+            with with_dist(rank=rank, world_size=4):
+                tensor = torch.rand(10)
+                val = [1, 2, 3]
+                st = create_sharded_tensor(rank=rank, world_size=4, shards_per_rank=1)
+                state_dict = {
+                    "tensor": tensor,
+                    "value": val,
+                    "st": st
+                }
+                return create_default_local_save_plan(state_dict, rank == 0)
+
+        all_plans = [create_data(0), create_data(1), create_data(2), create_data(3)]
+        final_plans, metadata = create_default_global_save_plan(all_plans=all_plans)
+
+        # The default global plan updates all indexes to include hints
+        for new_plan, old_plan in zip(final_plans, all_plans):
+            for new_item, old_item in zip(new_plan.items, old_plan.items):
+                self.assertEqual(new_item.index, old_item.index)
+                self.assertEqual(new_item.type, old_item.type)
+                self.assertEqual(new_item.tensor_data, old_item.tensor_data)
+                self.assertIn(new_item.index.fqn, metadata.state_dict_metadata)
+
+                item_md = metadata.state_dict_metadata[new_item.index.fqn]
+                if new_item.type == WriteItemType.BYTE_IO:
+                    self.assertTrue(isinstance(item_md, BytesStorageMetadata))
+                else:
+                    self.assertTrue(isinstance(item_md, TensorStorageMetadata))
+                    self.assertEqual(item_md.size, old_item.tensor_data.size)
+                    self.assertEqual(item_md.properties, old_item.tensor_data.properties)
+
+                    self.assertIsNotNone(new_item.index.index)
+                    # Make sure the hint is correct
+                    self.assertEqual(item_md.chunks[new_item.index.index], old_item.tensor_data.chunk)
+
+    def test_local_load_plan(self):
+        def create_state_dict(rank):
+            with with_dist(rank=rank, world_size=4):
+                tensor = torch.rand(10)
+                val = [1, 2, 3]
+                st = create_sharded_tensor(rank=rank, world_size=4, shards_per_rank=1)
+                return {
+                    "tensor": tensor,
+                    "value": val,
+                    "st": st
+                }
+
+        state_dict = create_state_dict(1)
+        metadata = _create_default_local_metadata(state_dict)
+
+        load_plan = create_default_local_load_plan(state_dict, metadata)
+        # This will create 3 entries
+        self.assertEqual(3, len(load_plan.items))
+        st_item = next(ri for ri in load_plan.items if ri.dest_index.fqn == "st")
+        tensor_item = next(ri for ri in load_plan.items if ri.dest_index.fqn == "tensor")
+        bytes_item = next(ri for ri in load_plan.items if ri.dest_index.fqn == "value")
+
+        self.assertEqual(st_item.type, LoadItemType.TENSOR)
+        # This is an exact copy
+        self.assertEqual(st_item.dest_index, MetadataIndex("st", [8]))
+        self.assertEqual(st_item.dest_offsets, torch.Size([0]))
+        self.assertEqual(st_item.storage_index, MetadataIndex("st", [8]))
+        self.assertEqual(st_item.storage_offsets, torch.Size([0]))
+        self.assertEqual(st_item.lengths, torch.Size([8]))
+
+        self.assertEqual(tensor_item.type, LoadItemType.TENSOR)
+        self.assertEqual(tensor_item.dest_index, MetadataIndex("tensor", [0]))
+        self.assertEqual(tensor_item.dest_offsets, torch.Size([0]))
+        self.assertEqual(tensor_item.storage_index, MetadataIndex("tensor", [0]))
+        self.assertEqual(tensor_item.storage_offsets, torch.Size([0]))
+        self.assertEqual(tensor_item.lengths, torch.Size([10]))
+
+        self.assertEqual(bytes_item.type, LoadItemType.BYTE_IO)
+        self.assertEqual(bytes_item.dest_index, MetadataIndex("value"))
+
+    def test_load_with_resharding(self):
+        def create_state_dict(rank, world_size):
+            with with_dist(rank=rank, world_size=world_size):
+                return {
+                    "st": create_sharded_tensor(
+                        rank=rank,
+                        world_size=world_size,
+                        shards_per_rank=1,
+                        shard_size=128 // world_size,
+                    )
+                }
+
+
+        # Rank 1 has a 16 bytes shard from [16, 32[
+        world8_state_dict = create_state_dict(rank=1, world_size=8)
+        world8_metadata = _create_default_local_metadata(world8_state_dict)
+
+        # Rank 1 has a 32 bytes shard from [32, 64[
+        world4_state_dict = create_state_dict(rank=1, world_size=4)
+        world4_metadata = _create_default_local_metadata(world4_state_dict)
+
+        # First scenario, going from world=8 to world=4, need to load 2 shards
+        # Each 4-world shard has 32 elements, so it needs to load 2 shards
+        load_plan = create_default_local_load_plan(world4_state_dict, world8_metadata)
+        self.assertEqual(2, len(load_plan.items))
+        low_ri = next(ri for ri in load_plan.items if ri.dest_offsets == torch.Size([0]))
+        high_ri = next(ri for ri in load_plan.items if ri.dest_offsets == torch.Size([16]))
+
+        self.assertEqual(low_ri.storage_index, MetadataIndex("st", [32]))
+        self.assertEqual(low_ri.storage_offsets, torch.Size([0]))
+        self.assertEqual(low_ri.dest_index, MetadataIndex("st", [32]))
+        self.assertEqual(low_ri.dest_offsets, torch.Size([0]))
+        self.assertEqual(low_ri.lengths, torch.Size([16]))
+
+        self.assertEqual(high_ri.storage_index, MetadataIndex("st", [48]))
+        self.assertEqual(high_ri.storage_offsets, torch.Size([0]))
+        self.assertEqual(high_ri.dest_index, MetadataIndex("st", [32]))
+        self.assertEqual(high_ri.dest_offsets, torch.Size([16]))
+        self.assertEqual(high_ri.lengths, torch.Size([16]))
+
+        # Second scenario, going from world=4 to world=8, need to load half of 1 shard
+        # rank1 on 8-world needs to load the upper half of the rank0 4-world shard
+        load_plan = create_default_local_load_plan(world8_state_dict, world4_metadata)
+        self.assertEqual(1, len(load_plan.items))
+        ri = load_plan.items[0]
+        self.assertEqual(ri.storage_index, MetadataIndex("st", [0]))
+        self.assertEqual(ri.storage_offsets, torch.Size([16]))
+        self.assertEqual(ri.dest_index, MetadataIndex("st", [16]))
+        self.assertEqual(ri.dest_offsets, torch.Size([0]))
+        self.assertEqual(ri.lengths, torch.Size([16]))
+
+    def test_load_with_world_size_diff_by_one(self):
+        def create_state_dict(rank, world_size):
+            with with_dist(rank=rank, world_size=world_size):
+                return {
+                    "st": create_sharded_tensor(
+                        rank=rank,
+                        world_size=world_size,
+                        shards_per_rank=1,
+                        shard_size=120 // world_size,
+                    )
+                }
+        # rank 1 has a 30 bytes shard from [30, 60[
+        world4_state_dict = create_state_dict(rank=1, world_size=4)
+        world4_metadata = _create_default_local_metadata(world4_state_dict)
+
+        # rank 1 has a 40 bytes shard from [40, 80[
+        world3_state_dict = create_state_dict(rank=1, world_size=3)
+
+        load_plan = create_default_local_load_plan(world3_state_dict, world4_metadata)
+        self.assertEqual(2, len(load_plan.items))
+        # this is [30, 60] to load [40, 60]
+        low_ri = next(ri for ri in load_plan.items if ri.dest_offsets == torch.Size([0]))
+        # this is [60, 90] to load [60, 80]
+        high_ri = next(ri for ri in load_plan.items if ri.dest_offsets == torch.Size([20]))
+
+        self.assertEqual(low_ri.storage_index, MetadataIndex("st", [30]))
+        self.assertEqual(low_ri.storage_offsets, torch.Size([10]))
+        self.assertEqual(low_ri.dest_index, MetadataIndex("st", [40]))
+        self.assertEqual(low_ri.dest_offsets, torch.Size([0]))
+        self.assertEqual(low_ri.lengths, torch.Size([20]))
+
+        self.assertEqual(high_ri.storage_index, MetadataIndex("st", [60]))
+        self.assertEqual(high_ri.storage_offsets, torch.Size([0]))
+        self.assertEqual(high_ri.dest_index, MetadataIndex("st", [40]))
+        self.assertEqual(high_ri.dest_offsets, torch.Size([20]))
+        self.assertEqual(high_ri.lengths, torch.Size([20]))
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/checkpoint/test_traverse.py b/test/distributed/checkpoint/test_traverse.py
new file mode 100644
index 000000000000..a73cb89befba
--- /dev/null
+++ b/test/distributed/checkpoint/test_traverse.py
@@ -0,0 +1,176 @@
+# Owner(s): ["oncall: distributed"]
+
+from collections import OrderedDict
+import torch
+
+import torch.distributed.checkpoint.traverse as traverse
+from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE
+from torch.testing._internal.common_utils import run_tests, TestCase
+
+
+# TODO: add comments for TestTraverse
+class TestTraverse(TestCase):
+    def test_traverse_shallow(self) -> None:
+        state_dict = {
+            "key0": 1,
+            "key1": [1, 2],
+            "key2": {1: 2, 2: 3},
+            "key3": torch.tensor([1]),
+        }
+
+        data = {}
+
+        def collect_data(path, value):
+            nonlocal data
+            data[path] = value
+
+        traverse.traverse_state_dict(state_dict, collect_data)
+
+        self.assertIn(("key0",), data)
+        self.assertEqual(data[("key0",)], 1)
+
+        self.assertIn(("key1",), data)
+        self.assertEqual(data[("key1",)], [1, 2])
+
+        self.assertIn(("key2",), data)
+        self.assertEqual(data[("key2",)], {1: 2, 2: 3})
+
+        self.assertIn(("key3",), data)
+        self.assertEqual(data[("key3",)], torch.tensor([1]))
+
+    def test_traverse_nested_list(self) -> None:
+        state_dict = {
+            "key1": [
+                torch.tensor([1]),
+                [33, torch.tensor([2]), [44, 55]],
+                [66, 77],
+            ],
+        }
+
+        data = {}
+
+        def collect_data(path, value):
+            nonlocal data
+            data[path] = value
+
+        traverse.traverse_state_dict(state_dict, collect_data)
+
+        self.assertNotIn(("key1"), data)
+
+        self.assertIn(("key1", 0), data)
+        self.assertEqual(data[("key1", 0)], torch.tensor([1]))
+
+        self.assertIn(("key1", 1, 0), data)
+        self.assertEqual(data[("key1", 1, 0)], 33)
+
+        self.assertIn(("key1", 1, 1), data)
+        self.assertEqual(data[("key1", 1, 1)], torch.tensor([2]))
+
+        self.assertIn(("key1", 1, 2), data)
+        self.assertEqual(data[("key1", 1, 2)], [44, 55])
+        self.assertNotIn(("key1", 1, 2, 0), data)
+
+        self.assertIn(("key1", 2), data)
+        self.assertEqual(data[("key1", 2)], [66, 77])
+
+    def test_traverse_nested_dict(self) -> None:
+        state_dict = {
+            "key0": {"key1": 99, "key2": torch.tensor([1])},
+        }
+
+        data = {}
+
+        def collect_data(path, value):
+            nonlocal data
+            data[path] = value
+
+        traverse.traverse_state_dict(state_dict, collect_data)
+
+        self.assertNotIn(("key0",), data)
+
+        self.assertIn(("key0", "key1"), data)
+        self.assertEqual(data[("key0", "key1")], 99)
+
+        self.assertIn(("key0", "key2"), data)
+        self.assertEqual(data[("key0", "key2")], torch.tensor([1]))
+
+    def test_traverse_doesnt_ignore_intermediate_collections(self) -> None:
+        state_dict: STATE_DICT_TYPE = {
+            "key0": [{"key1": {"key2": torch.tensor([1])}}]
+        }
+
+        data = {}
+
+        def collect_data(path, value):
+            nonlocal data
+            data[path] = value
+
+        traverse.traverse_state_dict(state_dict, collect_data)
+
+        self.assertIn(("key0", 0, "key1", "key2"), data)
+        self.assertEqual(
+            data[("key0", 0, "key1", "key2")],
+            torch.tensor([1]),
+        )
+
+    def test_traverse_with_ordered_dict(self) -> None:
+        state_dict = OrderedDict(
+            {
+                "key0": [
+                    99,
+                    torch.tensor([3]),
+                ]
+            }
+        )
+
+        data = {}
+
+        def collect_data(path, value):
+            nonlocal data
+            data[path] = value
+
+        traverse.traverse_state_dict(state_dict, collect_data)
+
+        self.assertIn(("key0", 0), data)
+        self.assertEqual(data[("key0", 0)], 99)
+
+        self.assertIn(("key0", 1), data)
+        self.assertEqual(data[("key0", 1)], torch.tensor([3]))
+
+    def test_set_element(self) -> None:
+        state_dict: STATE_DICT_TYPE = {}
+
+        traverse.set_element(state_dict, ("k",), 10)
+        self.assertEqual(state_dict["k"], 10)
+
+        traverse.set_element(state_dict, ("k1", 2), 1)
+        self.assertEqual(state_dict["k1"], [None, None, 1])
+
+        traverse.set_element(state_dict, ("k1", 1), 99)
+        self.assertEqual(state_dict["k1"], [None, 99, 1])
+
+        traverse.set_element(state_dict, ("k1", 3), 88)
+        self.assertEqual(state_dict["k1"], [None, 99, 1, 88])
+
+        traverse.set_element(state_dict, ("k2", "k3"), 3)
+        self.assertEqual(state_dict["k2"], {"k3": 3})
+
+        traverse.set_element(state_dict, ("k2", "k4", 0, 0), 99)
+        self.assertEqual(state_dict["k2"]["k4"][0], [99])
+
+    def test_get_element(self) -> None:
+        state_dict = {"a": [0, 1], "b": [2, {"c": "d"}]}
+        self.assertEqual(traverse.get_element(state_dict, ("a",)), [0, 1])
+        self.assertEqual(traverse.get_element(state_dict, ("b", 0)), 2)
+        self.assertEqual(traverse.get_element(state_dict, ("b", 1, "c")), "d")
+
+        self.assertIsNone(traverse.get_element(state_dict, ("c",)))
+        self.assertIsNone(traverse.get_element(state_dict, ("a", 33)))
+        self.assertIsNone(traverse.get_element(state_dict, ("b", 88)))
+        self.assertIsNone(traverse.get_element(state_dict, ("b", 0, 2)))
+        self.assertIsNone(traverse.get_element(state_dict, ("b", 1, 2)))
+        self.assertIsNone(traverse.get_element(state_dict, ("b", 1, "d")))
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/_shard/checkpoint/test_utils.py b/test/distributed/checkpoint/test_utils.py
similarity index 93%
rename from test/distributed/_shard/checkpoint/test_utils.py
rename to test/distributed/checkpoint/test_utils.py
index 9e1fd2a23888..e2b4aac605bf 100644
--- a/test/distributed/_shard/checkpoint/test_utils.py
+++ b/test/distributed/checkpoint/test_utils.py
@@ -17,8 +17,8 @@
     TEST_WITH_DEV_DBG_ASAN,
     run_tests,
 )
-from torch.distributed._shard.checkpoint.utils import find_state_dict_object
-from torch.distributed._shard.checkpoint.metadata import MetadataIndex
+from torch.distributed.checkpoint.utils import find_state_dict_object
+from torch.distributed.checkpoint.metadata import MetadataIndex
 from torch.testing._internal.distributed.distributed_utils import (
     with_fake_comms
 )
@@ -81,6 +81,8 @@ def test_flat_data(self):
 
         a = find_state_dict_object(state_dict, MetadataIndex("a"))
         self.assertEqual(a, state_dict["a"])
+        a = find_state_dict_object(state_dict, MetadataIndex("a", [0]))
+        self.assertEqual(a, state_dict["a"])
         a = find_state_dict_object(state_dict, MetadataIndex("a", index=99))
         self.assertEqual(a, state_dict["a"])
 
@@ -91,8 +93,6 @@ def test_flat_data(self):
 
         with self.assertRaisesRegex(ValueError, "FQN"):
             find_state_dict_object(state_dict, MetadataIndex("c"))
-        with self.assertRaisesRegex(ValueError, "ShardedTensor"):
-            find_state_dict_object(state_dict, MetadataIndex("a", [0]))
         with self.assertRaisesRegex(ValueError, "ShardedTensor"):
             find_state_dict_object(state_dict, MetadataIndex("b", [1]))
 
diff --git a/test/distributed/defs.bzl b/test/distributed/defs.bzl
deleted file mode 100644
index d3b3040ea4c3..000000000000
--- a/test/distributed/defs.bzl
+++ /dev/null
@@ -1,39 +0,0 @@
-load("@fbsource//tools/build_defs:testpilot_defs.bzl", "special_tags")
-load(
-    "//caffe2/test:defs.bzl",
-    "define_python_unittest",
-)
-
-# These distributed tests need custom environment variables
-def define_distributed_test(**kwargs):
-    # LeakSanitizer doesn't work for python multiprocessing.
-    # See https://fb.workplace.com/groups/fbcode/posts/2625521060818050/
-    # and https://fb.workplace.com/groups/101100140348621/posts/1278688645923092/
-    kwargs["env"]["ASAN_OPTIONS"] = "detect_leaks=0"
-
-    # Resolve kineto TSAN flakiness
-    kwargs["env"]["CUDA_INJECTION64_PATH"] = "0"
-    define_python_unittest(
-        base_module = "",
-        main_module = "fb.test_distributed_trap",
-        py_version = ">=3.5",
-        tags = [special_tags.run_as_bundle],
-        deps = [
-            "//caffe2:test-lib",
-            "//caffe2:torch",
-            "//caffe2/torch/fb/rendezvous:zeus",
-            "//pytorch/vision:torchvision",
-        ],
-        external_deps = [
-            ("numpy", None),
-            ("scipy", None),
-        ],
-        **kwargs
-    )
-
-def define_c10d_distributed_test(srcs, **kwargs):
-    srcs.extend(["fb/test_distributed_trap.py"])
-    define_distributed_test(
-        srcs = srcs + native.glob(["data/*.py"]),
-        **kwargs
-    )
diff --git a/test/distributed/elastic/agent/server/test/local_elastic_agent_test.py b/test/distributed/elastic/agent/server/test/local_elastic_agent_test.py
index 9c5a39505490..d69b629586ae 100644
--- a/test/distributed/elastic/agent/server/test/local_elastic_agent_test.py
+++ b/test/distributed/elastic/agent/server/test/local_elastic_agent_test.py
@@ -30,16 +30,19 @@
     WorkerSpec,
     WorkerState,
 )
-from torch.distributed.elastic.agent.server.local_elastic_agent import LocalElasticAgent
+from torch.distributed.elastic.agent.server.local_elastic_agent import (
+    LocalElasticAgent,
+    TORCHELASTIC_TIMER_FILE,
+)
 from torch.distributed.elastic.multiprocessing import Std
 from torch.distributed.elastic.multiprocessing.errors import ChildFailedError, record
 from torch.distributed.elastic.rendezvous import RendezvousParameters
 from torch.distributed.elastic.rendezvous.etcd_server import EtcdServer
 from torch.distributed.rpc.backend_registry import BackendType
 from torch.testing._internal.common_utils import (
+    sandcastle_skip_if,
     TEST_WITH_DEV_DBG_ASAN,
     TEST_WITH_TSAN,
-    sandcastle_skip_if,
 )
 
 
@@ -190,6 +193,13 @@ def _check_env_value(key: str, expected: str):
             )
 
 
+def _check_local_watchdog_setup(key: str, should_exist: bool):
+    if should_exist and key not in os.environ:
+        raise RuntimeError(f"Environment variable {key} not found in os.environ")
+    if not should_exist and key in os.environ:
+        raise RuntimeError(f"Environment variable {key} found in os.environ")
+
+
 def acquire_available_port():
     """
     Uses sockets to acquire an available port from the os for use.
@@ -505,6 +515,54 @@ def run_check_nccl_async_error_handling_env_default(self):
         )
         self.assertFalse(res.is_failed())
 
+    def run_agent_local_watchdog_setup_enabled(self):
+        # Set the env for watchdog
+        watchdog_env_name = TORCHELASTIC_TIMER_FILE
+        watchdog_file_path = "/tmp/watchdog_timer_" + str(uuid.uuid4())
+        os.environ[watchdog_env_name] = watchdog_file_path
+        # Run the agent
+        node_conf = Conf(entrypoint=_check_local_watchdog_setup, local_world_size=1, args=(TORCHELASTIC_TIMER_FILE, True))
+        spec = self.get_worker_spec(node_conf, max_restarts=2)
+        agent = self.get_agent(spec)
+        res = agent.run()
+        self.assertFalse(res.is_failed())
+
+    def run_agent_local_watchdog_setup_disabled(self):
+        # Do not set the env for watchdog
+        watchdog_env_name = TORCHELASTIC_TIMER_FILE
+        if watchdog_env_name in os.environ:
+            del os.environ[watchdog_env_name]
+        # Run the agent
+        node_conf = Conf(entrypoint=_check_local_watchdog_setup, local_world_size=1, args=(TORCHELASTIC_TIMER_FILE, False))
+        spec = self.get_worker_spec(node_conf, max_restarts=2)
+        agent = self.get_agent(spec)
+        res = agent.run()
+        self.assertFalse(res.is_failed())
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_run_agent_local_watchdog_setup_enabled_etcd(self):
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.run_agent_local_watchdog_setup_enabled
+        )
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_run_agent_local_watchdog_setup_enabled_c10d(self):
+        self.run_test_with_backend(
+            backend="c10d", test_to_run=self.run_agent_local_watchdog_setup_enabled
+        )
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_run_agent_local_watchdog_setup_disabled_etcd(self):
+        self.run_test_with_backend(
+            backend="etcd", test_to_run=self.run_agent_local_watchdog_setup_disabled
+        )
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_run_agent_local_watchdog_setup_disabled_c10d(self):
+        self.run_test_with_backend(
+            backend="c10d", test_to_run=self.run_agent_local_watchdog_setup_disabled
+        )
+
     @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
     def test_run_check_env_function_etcd(self):
         self.run_test_with_backend(
diff --git a/test/distributed/elastic/timer/file_based_local_timer_test.py b/test/distributed/elastic/timer/file_based_local_timer_test.py
new file mode 100644
index 000000000000..785cc978ee85
--- /dev/null
+++ b/test/distributed/elastic/timer/file_based_local_timer_test.py
@@ -0,0 +1,266 @@
+# Owner(s): ["oncall: r2p"]
+
+# Copyright (c) Meta Platforms, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+import multiprocessing as mp
+import signal
+import time
+import unittest
+import unittest.mock as mock
+import uuid
+
+import torch.distributed.elastic.timer as timer
+from torch.testing._internal.common_utils import (
+    IS_MACOS,
+    IS_WINDOWS,
+    run_tests,
+    TEST_WITH_TSAN,
+    TestCase,
+)
+
+
+# timer is not supported on windows or macos
+if not (IS_WINDOWS or IS_MACOS):
+    # func2 should time out
+    def func2(n, file_path):
+        if file_path is not None:
+            timer.configure(timer.FileTimerClient(file_path))
+        if n > 0:
+            with timer.expires(after=0.1):
+                func2(n - 1, None)
+                time.sleep(0.2)
+
+    class FileTimerTest(TestCase):
+        def setUp(self):
+            super().setUp()
+            self.max_interval = 0.01
+            self.file_path = "/tmp/test_file_path_" + str(uuid.uuid4())
+            self.server = timer.FileTimerServer(self.file_path, self.max_interval)
+            self.server.start()
+
+        def tearDown(self):
+            super().tearDown()
+            self.server.stop()
+
+        def test_exception_propagation(self):
+            with self.assertRaises(RuntimeError, msg="foobar"):
+                with timer.expires(after=1):
+                    raise RuntimeError("foobar")
+
+        def test_no_client(self):
+            # no timer client configured; exception expected
+            timer.configure(None)
+            with self.assertRaises(RuntimeError):
+                with timer.expires(after=1):
+                    pass
+
+        def test_client_interaction(self):
+            # no timer client configured but one passed in explicitly
+            # no exception expected
+            timer_client = timer.FileTimerClient(self.file_path)
+            timer_client.acquire = mock.MagicMock(wraps=timer_client.acquire)
+            timer_client.release = mock.MagicMock(wraps=timer_client.release)
+            with timer.expires(after=1, scope="test", client=timer_client):
+                pass
+
+            timer_client.acquire.assert_called_once_with("test", mock.ANY)
+            timer_client.release.assert_called_once_with("test")
+
+        def test_happy_path(self):
+            timer.configure(timer.FileTimerClient(self.file_path))
+            with timer.expires(after=0.5):
+                time.sleep(0.1)
+
+        def test_get_timer_recursive(self):
+            """
+            If a function acquires a countdown timer with default scope,
+            then recursive calls to the function should re-acquire the
+            timer rather than creating a new one. That is only the last
+            recursive call's timer will take effect.
+            """
+            timer.configure(timer.FileTimerClient(self.file_path))
+
+            # func should not time out
+            def func(n):
+                if n > 0:
+                    with timer.expires(after=0.1):
+                        func(n - 1)
+                        time.sleep(0.05)
+
+            func(4)
+
+            p = mp.Process(target=func2, args=(2, self.file_path))
+            p.start()
+            p.join()
+            self.assertEqual(-signal.SIGKILL, p.exitcode)
+
+        def test_multiple_clients_interaction(self):
+            # func should not time out
+            def func(n, file_path):
+                if file_path is not None:
+                    timer.configure(timer.FileTimerClient(file_path))
+                if n > 0:
+                    with timer.expires(after=100):
+                        func(n - 1, None)
+                        time.sleep(0.01)
+
+            num_clients = 10
+            num_requests_per_client = 10
+            processes = []
+            for i in range(num_clients):
+                p = mp.Process(target=func, args=(num_requests_per_client, self.file_path))
+                processes.append(p)
+                p.start()
+            for p in processes:
+                p.join()
+
+            self.server.run_once()  # Allows the server to process all requests
+            self.assertEqual(2 * num_clients * num_requests_per_client, self.server._request_count)
+
+        @staticmethod
+        def _run(file_path, timeout, duration):
+            client = timer.FileTimerClient(file_path)
+            timer.configure(client)
+            with timer.expires(after=timeout):
+                time.sleep(duration)
+
+        @unittest.skipIf(TEST_WITH_TSAN, "test is tsan incompatible")
+        def test_timer(self):
+            timeout = 0.1
+            duration = 1
+            p = mp.Process(target=self._run, args=(self.file_path, timeout, duration))
+            p.start()
+            p.join()
+            self.assertEqual(-signal.SIGKILL, p.exitcode)
+
+    def _request_on_interval(file_path, n, interval, sem):
+        """
+        enqueues ``n`` timer requests into ``mp_queue`` one element per
+        interval seconds. Releases the given semaphore once before going to work.
+        """
+        client = timer.FileTimerClient(file_path)
+        sem.release()
+        for i in range(0, n):
+            client.acquire("test_scope", 0)
+            time.sleep(interval)
+
+
+    class FileTimerClientTest(TestCase):
+        def test_send_request_without_server(self):
+            client = timer.FileTimerClient("test_file")
+            timer.configure(client)
+            with self.assertRaises(BrokenPipeError):
+                with timer.expires(after=0.1):
+                    time.sleep(0.1)
+
+
+    class FileTimerServerTest(TestCase):
+        def setUp(self):
+            super().setUp()
+            self.file_path = "/tmp/test_file_path_" + str(uuid.uuid4())
+            self.max_interval = 0.01
+            self.server = timer.FileTimerServer(self.file_path, self.max_interval)
+
+        def tearDown(self):
+            super().tearDown()
+            self.server.stop()
+
+        def test_watchdog_call_count(self):
+            """
+            checks that the watchdog function ran wait/interval +- 1 times
+            """
+            self.server._run_watchdog = mock.MagicMock(wraps=self.server._run_watchdog)
+            self.server.start()
+
+            test_pid = -3
+            client = timer.FileTimerClient(self.file_path)
+            client._send_request(self._valid_timer(pid=test_pid, scope="test0"))
+
+            wait = 0.1
+            time.sleep(wait)
+            self.server.stop()
+            watchdog_call_count = self.server._run_watchdog.call_count
+            self.assertGreaterEqual(
+                watchdog_call_count, int(wait / self.max_interval) - 1
+            )
+            self.assertLessEqual(watchdog_call_count, int(wait / self.max_interval) + 1)
+
+        def test_watchdog_empty_queue(self):
+            """
+            checks that the watchdog can run on an empty pipe
+            """
+            self.server.start()
+
+        def _expired_timer(self, pid, scope):
+            expired = time.time() - 60
+            return timer.FileTimerRequest(worker_pid=pid, scope_id=scope, expiration_time=expired, signal=signal.SIGKILL)
+
+        def _valid_timer(self, pid, scope):
+            valid = time.time() + 60
+            return timer.FileTimerRequest(worker_pid=pid, scope_id=scope, expiration_time=valid, signal=signal.SIGKILL)
+
+        def _release_timer(self, pid, scope):
+            return timer.FileTimerRequest(worker_pid=pid, scope_id=scope, expiration_time=-1)
+
+        @mock.patch("os.kill")
+        def test_expired_timers(self, mock_os_kill):
+            """
+            tests that a single expired timer on a process should terminate
+            the process and clean up all pending timers that was owned by the process
+            """
+            self.server.start()
+
+            test_pid = -3
+            client = timer.FileTimerClient(self.file_path)
+            client._send_request(self._expired_timer(pid=test_pid, scope="test1"))
+            client._send_request(self._valid_timer(pid=test_pid, scope="test2"))
+
+            self.server.run_once()  # Allows the server to process all requests
+            self.assertEqual(0, len(self.server._timers))
+            mock_os_kill.assert_called_once_with(test_pid, signal.SIGKILL)
+
+        @mock.patch("os.kill")
+        def test_send_request_release(self, mock_os_kill):
+            """
+            tests that:
+            1. a timer can be acquired then released (should not terminate process)
+            2. a timer can be vacuously released (e.g. no-op)
+            """
+            self.server.start()
+
+            client = timer.FileTimerClient(self.file_path)
+            test_pid = -3
+            client._send_request(self._valid_timer(pid=test_pid, scope="test1"))
+            client._send_request(self._release_timer(pid=test_pid, scope="test1"))
+            client._send_request(self._release_timer(pid=test_pid, scope="test2"))
+
+            self.assertEqual(0, len(self.server._timers))
+            mock_os_kill.assert_not_called()
+
+        @mock.patch("os.kill")
+        def test_valid_timers(self, mock_os_kill):
+            """
+            tests that valid timers are processed correctly and the process is left alone
+            """
+            self.server.start()
+
+            client = timer.FileTimerClient(self.file_path)
+            client._send_request(self._valid_timer(pid=-3, scope="test1"))
+            client._send_request(self._valid_timer(pid=-3, scope="test2"))
+            client._send_request(self._valid_timer(pid=-2, scope="test1"))
+            client._send_request(self._valid_timer(pid=-2, scope="test2"))
+
+            self.server.run_once()  # Allows the server to process all requests
+            self.assertEqual(4, len(self.server._timers))
+            self.assertTrue((-3, "test1") in self.server._timers)
+            self.assertTrue((-3, "test2") in self.server._timers)
+            self.assertTrue((-2, "test1") in self.server._timers)
+            self.assertTrue((-2, "test2") in self.server._timers)
+            mock_os_kill.assert_not_called()
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/defs.bzl b/test/distributed/fsdp/defs.bzl
deleted file mode 100644
index 2e496838c807..000000000000
--- a/test/distributed/fsdp/defs.bzl
+++ /dev/null
@@ -1,22 +0,0 @@
-load("@bazel_skylib//lib:paths.bzl", "paths")
-load(
-    "//caffe2/test:defs.bzl",
-    "define_mp_tests",
-)
-
-def define_fsdp_tests():
-    test_files = native.glob(["**/test_*.py"])
-
-    TESTS = {}
-
-    additional_deps = {}
-    for test_file in test_files:
-        test_file_name = paths.basename(test_file)
-        test_name = test_file_name.replace("test_", "").replace(".py", "")
-        TESTS[test_name] = [test_file]
-        additional_deps[test_name] = ["//pytorch/vision:torchvision"]
-
-    define_mp_tests(
-        tests = TESTS,
-        additional_deps = additional_deps,
-    )
diff --git a/test/distributed/fsdp/test_checkpoint_wrapper.py b/test/distributed/fsdp/test_checkpoint_wrapper.py
index b699a3821936..d8e005fcf82b 100644
--- a/test/distributed/fsdp/test_checkpoint_wrapper.py
+++ b/test/distributed/fsdp/test_checkpoint_wrapper.py
@@ -1,25 +1,25 @@
 # Owner(s): ["oncall: distributed"]
 
+import unittest
 from copy import deepcopy
 from functools import partial
 
 import torch
 import torch.nn as nn
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+    apply_activation_checkpointing,
     checkpoint_wrapper,
-    apply_activation_checkpointing_wrapper,
+    CheckpointImpl,
     CheckpointWrapper,
-    CheckpointImpl
+    offload_wrapper,
+    OffloadWrapper,
 )
-
+from torch.testing._internal.common_utils import run_tests, TestCase
 from torch.utils.checkpoint import checkpoint
 
-from torch.testing._internal.common_utils import (
-    run_tests,
-    TestCase,
-)
+_SAVED_PREFIX = "_saved_"
+GRAD_FN_NEXT_FUNCTIONS = "next_functions"
 
-import unittest
 
 class CheckpointWrapperTest(TestCase):
     def setUp(self):
@@ -54,6 +54,54 @@ def test_load_activation_checkpointed_module(self):
         for p1, p2 in zip(lin.parameters(), lin_new.parameters()):
             self.assertEqual(p1, p2)
 
+    def test_checkpoint_wrapper_kwarg_support(self):
+        class MyModel(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lin = nn.Linear(10, 10)
+
+            def forward(self, a, b, c=None, d=None, **kwargs):
+                return (self.lin(a), self.lin(b), self.lin(c), self.lin(d))
+
+        for wrapper in [
+            partial(checkpoint_wrapper, checkpoint_impl=CheckpointImpl.REENTRANT),
+            partial(checkpoint_wrapper, checkpoint_impl=CheckpointImpl.NO_REENTRANT),
+            offload_wrapper,
+        ]:
+            with self.subTest(wrapper=wrapper):
+                model = wrapper(MyModel())
+                if wrapper == offload_wrapper:
+                    self.assertTrue(isinstance(model, OffloadWrapper))
+                else:
+                    self.assertTrue(isinstance(model, CheckpointWrapper))
+                # Verify kwargs can be passed in
+                inp = torch.ones(4, 10, requires_grad=True)
+                out = model(inp, inp, c=inp, d=inp, e=inp, f=inp)
+                self.assertTrue(isinstance(out, tuple))
+                self.assertEqual(4, len(out))
+                # Without kwargs should have equivalent gradient requirements.
+                out_no_kwarg = model(inp, inp, inp, inp)
+                for t1, t2 in zip(out_no_kwarg, out):
+                    self.assertEqual(t1, t2)
+                    self.assertEqual(t1.requires_grad, t2.requires_grad)
+
+        # Test model that enforces kwarg inputs
+        class ModelEnforceKwarg(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lin = nn.Linear(10, 10)
+
+            def forward(self, *, a=None, b=None):
+                return (self.lin(a), self.lin(b))
+
+        model = checkpoint_wrapper(
+            ModelEnforceKwarg(), checkpoint_impl=CheckpointImpl.REENTRANT
+        )
+
+        inp = torch.ones(4, 10, requires_grad=True)
+        out = model(a=inp, b=inp)
+        self.assertEqual(2, len(out))
+
     @unittest.skipIf(not torch.cuda.is_available(), "Test requires CUDA")
     def test_checkpoint_wrapper_parity(self):
         """
@@ -62,13 +110,14 @@ def test_checkpoint_wrapper_parity(self):
         results in the same maximum memory usage, i.e. they are
         equivalent memory usage wise.
         """
+
         class Model(nn.Module):
             def __init__(
                 self,
                 n: int,
                 use_cp: bool,
                 use_wrapper: bool = False,
-                use_reentrant: bool = True
+                use_reentrant: bool = True,
             ):
                 super().__init__()
                 self.layers = nn.ModuleList()
@@ -78,10 +127,14 @@ def __init__(
                 self.use_reentrant = use_reentrant
                 wrp = partial(
                     checkpoint_wrapper,
-                    checkpoint_impl=CheckpointImpl.REENTRANT if use_reentrant else CheckpointImpl.NO_REENTRANT
+                    checkpoint_impl=CheckpointImpl.REENTRANT
+                    if use_reentrant
+                    else CheckpointImpl.NO_REENTRANT,
                 )
                 for i in range(self.n):
-                    l = nn.Sequential(nn.Linear(256, 256), nn.Linear(256, 256), nn.Linear(256, 256))
+                    l = nn.Sequential(
+                        nn.Linear(256, 256), nn.Linear(256, 256), nn.Linear(256, 256)
+                    )
                     use_checkpoint_wrapper = self.use_wrapper
                     if use_checkpoint_wrapper:
                         l = wrp(l)
@@ -89,29 +142,41 @@ def __init__(
 
             def forward(self, x):
                 for i in range(self.n):
-                    if (
-                        self.use_wrapper or
-                        not self.use_cp
-                    ):
+                    if self.use_wrapper or not self.use_cp:
                         x = self.layers[i](x)
                     else:
-                        x = checkpoint(self.layers[i], x, use_reentrant=self.use_reentrant)
+                        x = checkpoint(
+                            self.layers[i], x, use_reentrant=self.use_reentrant
+                        )
                 return x
 
         def test(use_checkpointing, use_wrapper, use_reentrant):
-            a = Model(8, use_checkpointing, use_wrapper=use_wrapper, use_reentrant=use_reentrant).cuda()
+            a = Model(
+                8,
+                use_checkpointing,
+                use_wrapper=use_wrapper,
+                use_reentrant=use_reentrant,
+            ).cuda()
             x = torch.randn(10000, 256, requires_grad=True).cuda()
             torch.cuda.reset_peak_memory_stats()
             loss = a(x).sum()
             loss.backward()
             return torch.cuda.max_memory_allocated()
 
-        functional_no_reentrant = test(use_checkpointing=True, use_wrapper=False, use_reentrant=False)
-        wrapper_no_reentrant = test(use_checkpointing=False, use_wrapper=True, use_reentrant=False)
+        functional_no_reentrant = test(
+            use_checkpointing=True, use_wrapper=False, use_reentrant=False
+        )
+        wrapper_no_reentrant = test(
+            use_checkpointing=False, use_wrapper=True, use_reentrant=False
+        )
         self.assertEqual(functional_no_reentrant, wrapper_no_reentrant)
 
-        functional_reentrant = test(use_checkpointing=True, use_wrapper=False, use_reentrant=True)
-        wrapper_reentrant = test(use_checkpointing=False, use_wrapper=True, use_reentrant=True)
+        functional_reentrant = test(
+            use_checkpointing=True, use_wrapper=False, use_reentrant=True
+        )
+        wrapper_reentrant = test(
+            use_checkpointing=False, use_wrapper=True, use_reentrant=True
+        )
         self.assertEqual(functional_reentrant, wrapper_reentrant)
 
     def test_forward_missing_attributes(self):
@@ -121,15 +186,16 @@ def test_forward_missing_attributes(self):
         # Test indexing is forwarded
         self.assertEqual(wrapped[0], lin)
         # Test missing attributes are forwarded.
-        m._foo = 'bar'
-        self.assertEqual(wrapped._foo, 'bar')
+        m._foo = "bar"
+        self.assertEqual(wrapped._foo, "bar")
 
-    def test_apply_activation_checkpointing_wrapper(self):
+    def test_apply_activation_checkpointing(self):
         """
-        Ensures that `apply_activation_checkpointing_wrapper` can be used
+        Ensures that `apply_activation_checkpointing` can be used
         to swap modules for their checkpoint-wrapped counterparts given
         a model.
         """
+
         class LinearWithBatchNorm(nn.Module):
             def __init__(self):
                 super().__init__()
@@ -150,37 +216,76 @@ def __init__(self):
             def forward(self, x):
                 return self.seq(x)
 
-        model = MyModel()
-        n_linear = sum(1 if isinstance(x, nn.Linear) else 0 for x in model.modules())
-
         def check_fn(l):
             return isinstance(l, nn.Linear)
 
-        apply_activation_checkpointing_wrapper(
-            model, checkpoint_wrapper_fn=checkpoint_wrapper, check_fn=check_fn
-        )
-        n_linear_wrapped = sum(1 if isinstance(x, nn.Linear) else 0 for x in model.modules())
-        n_checkpointed = sum(1 if isinstance(x, CheckpointWrapper) else 0 for x in model.modules())
-        self.assertEqual(n_checkpointed, n_linear_wrapped)
-        self.assertEqual(n_linear, n_linear_wrapped)
-        for j in range(3):
-            self.assertTrue(isinstance(model.seq[j].lin, CheckpointWrapper))
-            self.assertTrue(isinstance(model.seq[j].nested_linear[0], CheckpointWrapper))
-
-        inp = torch.randn(4, 10, requires_grad=True)
-        for i in range(6):
-            loss = model(inp).sum()
-            self.assertTrue(loss.requires_grad)
-            loss.backward()
-            # ensure checkpointed part of model has gradients
-            for j in range(3):
-                weight_lin = model.seq[j].lin._checkpoint_wrapped_module.weight
-                bias_lin = model.seq[j].lin._checkpoint_wrapped_module.bias
-                weight_nested_lin = model.seq[j].nested_linear[0]._checkpoint_wrapped_module.weight
-                bias_nested_lin = model.seq[j].nested_linear[0]._checkpoint_wrapped_module.bias
-                for param in [weight_lin, bias_lin, weight_nested_lin, bias_nested_lin]:
-                    self.assertTrue(param.requires_grad)
-                    self.assertFalse(param.grad is None)
+        n_linear = None
+
+        for wrapper in [
+            partial(checkpoint_wrapper, checkpoint_impl=CheckpointImpl.REENTRANT),
+            partial(checkpoint_wrapper, checkpoint_impl=CheckpointImpl.NO_REENTRANT),
+            offload_wrapper,
+        ]:
+            model = MyModel()
+            if n_linear is None:
+                n_linear = sum(
+                    1 if isinstance(x, nn.Linear) else 0 for x in model.modules()
+                )
+
+            with self.subTest(wrapper=wrapper):
+                apply_activation_checkpointing(
+                    model, checkpoint_wrapper_fn=wrapper, check_fn=check_fn
+                )
+                n_linear_wrapped = sum(
+                    1 if isinstance(x, nn.Linear) else 0 for x in model.modules()
+                )
+                n_checkpointed = sum(
+                    1 if isinstance(x, (CheckpointWrapper, OffloadWrapper)) else 0
+                    for x in model.modules()
+                )
+                self.assertEqual(n_checkpointed, n_linear_wrapped)
+                self.assertEqual(n_linear, n_linear_wrapped)
+                for j in range(3):
+                    self.assertTrue(
+                        isinstance(
+                            model.seq[j].lin, (CheckpointWrapper, OffloadWrapper)
+                        )
+                    )
+                    self.assertTrue(
+                        isinstance(
+                            model.seq[j].nested_linear[0],
+                            (CheckpointWrapper, OffloadWrapper),
+                        )
+                    )
+
+                inp = torch.randn(4, 10, requires_grad=True)
+                for i in range(6):
+                    # Kwarg input
+                    loss = model(x=inp).sum()
+                    self.assertTrue(loss.requires_grad)
+                    loss.backward()
+                    # ensure checkpointed part of model has gradients
+                    for j in range(3):
+                        weight_lin = model.seq[j].lin._checkpoint_wrapped_module.weight
+                        bias_lin = model.seq[j].lin._checkpoint_wrapped_module.bias
+                        weight_nested_lin = (
+                            model.seq[j]
+                            .nested_linear[0]
+                            ._checkpoint_wrapped_module.weight
+                        )
+                        bias_nested_lin = (
+                            model.seq[j]
+                            .nested_linear[0]
+                            ._checkpoint_wrapped_module.bias
+                        )
+                        for param in [
+                            weight_lin,
+                            bias_lin,
+                            weight_nested_lin,
+                            bias_nested_lin,
+                        ]:
+                            self.assertTrue(param.requires_grad)
+                            self.assertFalse(param.grad is None)
 
     def test_fqn(self):
         lin = nn.Linear(10, 10, bias=False)
@@ -189,5 +294,58 @@ def test_fqn(self):
         for fqn, _ in lin.named_parameters():
             self.assertTrue(fqn in state_dict, msg=f"{fqn} not in state_dict.")
 
+    @unittest.skipIf(not torch.cuda.is_available(), "Test requires CUDA")
+    def test_checkpoint_wrapper_cpu_offload(self):
+        model = nn.Sequential(
+            nn.Linear(10, 10),
+            nn.Linear(10, 10),
+            nn.Linear(10, 10),
+        ).cuda()
+
+        # Patch saved_tensor_hooks to make the unpack keep the tensor on CPU for
+        # testing, otherwise the tensor access during the DFS will cause orig
+        # unpack to run, transferring the tensor back to GPU.
+        def patched_init(saved_tensor_hook_obj, pack_hook, _):
+            saved_tensor_hook_obj.pack_hook = pack_hook
+
+            def testing_cpu_offload_unpack_hook(packed):
+                _, tensor = packed
+                return tensor
+
+            saved_tensor_hook_obj.unpack_hook = testing_cpu_offload_unpack_hook
+
+        orig_init = torch.autograd.graph.saved_tensors_hooks.__init__
+        torch.autograd.graph.saved_tensors_hooks.__init__ = patched_init
+
+        model = offload_wrapper(model)
+
+        inp = torch.randn(3, 10, device="cuda")
+        loss = model(inp).sum()
+
+        # All autograd saved tensors should be offloaded to CPU.
+        offload_verified = False
+
+        def dfs(grad_fn):
+            for e in dir(grad_fn):
+                if not e.startswith(_SAVED_PREFIX):
+                    continue
+
+                saved = getattr(grad_fn, e)
+                if isinstance(saved, torch.Tensor):
+                    self.assertEqual(torch.device("cpu"), saved.device)
+                    nonlocal offload_verified
+                    offload_verified = True
+
+            if hasattr(grad_fn, GRAD_FN_NEXT_FUNCTIONS):
+                for next_grad_fn, _ in grad_fn.next_functions:
+                    dfs(next_grad_fn)
+
+        dfs(loss.grad_fn)
+
+        self.assertTrue(offload_verified)
+
+        torch.autograd.graph.saved_tensors_hooks.__init__ = orig_init
+
+
 if __name__ == "__main__":
     run_tests()
diff --git a/test/distributed/fsdp/test_distributed_checkpoint.py b/test/distributed/fsdp/test_distributed_checkpoint.py
index ef95973764c4..3e9b967e0d11 100644
--- a/test/distributed/fsdp/test_distributed_checkpoint.py
+++ b/test/distributed/fsdp/test_distributed_checkpoint.py
@@ -5,23 +5,17 @@
 
 import torch
 from torch import distributed as dist
-from torch.distributed._shard.checkpoint import (
+from torch.distributed.checkpoint import (
     FileSystemReader,
     FileSystemWriter,
-    save_state_dict,
     load_state_dict,
+    save_state_dict,
 )
-from torch.distributed.fsdp import (
-    FullyShardedDataParallel as FSDP,
-    StateDictType,
-)
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, StateDictType
 from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel
 from torch.distributed.fsdp.wrap import enable_wrap, wrap
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
-    SkipModel,
-)
+from torch.testing._internal.common_fsdp import FSDPTest, SkipModel
 from torch.testing._internal.common_utils import (
     instantiate_parametrized_tests,
     parametrize,
@@ -29,7 +23,6 @@
     TEST_WITH_DEV_DBG_ASAN,
 )
 
-
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
@@ -75,16 +68,16 @@ def test_distributed_checkpoint(self, state_dict_type) -> None:
             path = paths[0]
             writer = FileSystemWriter(path)
             reader = FileSystemReader(path)
-            with FSDP.state_dict_type(
-                model, state_dict_type
-            ), FSDP.state_dict_type(new_model, state_dict_type):
+            with FSDP.state_dict_type(model, state_dict_type), FSDP.state_dict_type(
+                new_model, state_dict_type
+            ):
                 state_dict = model.state_dict()
 
             save_state_dict(state_dict, writer)
 
-            with FSDP.state_dict_type(
-                model, state_dict_type
-            ), FSDP.state_dict_type(new_model, state_dict_type):
+            with FSDP.state_dict_type(model, state_dict_type), FSDP.state_dict_type(
+                new_model, state_dict_type
+            ):
                 state_dict = new_model.state_dict()
                 load_state_dict(state_dict, reader)
                 new_model.load_state_dict(state_dict)
diff --git a/test/distributed/fsdp/test_flatten_params_wrapper.py b/test/distributed/fsdp/test_flatten_params_wrapper.py
deleted file mode 100644
index 56f7bc456485..000000000000
--- a/test/distributed/fsdp/test_flatten_params_wrapper.py
+++ /dev/null
@@ -1,315 +0,0 @@
-# Owner(s): ["oncall: distributed"]
-
-import sys
-import unittest
-
-import torch
-from torch import distributed as dist
-from torch.distributed.fsdp.flat_param import FlatParamShardMetadata
-from torch.distributed.fsdp.flatten_params_wrapper import FlattenParamsWrapper
-from torch.testing._internal.common_utils import TestCase, run_tests
-
-if not dist.is_available():
-    print("Distributed not available, skipping tests", file=sys.stderr)
-    sys.exit(0)
-
-
-class TestFlattenParams(TestCase):
-    """Base test class and used for CPU case."""
-
-    def _get_empty_module(self, seed=0):
-        torch.manual_seed(seed)  # keep everything deterministic
-
-        class Test(torch.nn.Module):
-            def forward(self, x):
-                return x + 1
-
-        module = Test()
-
-        def get_input(device, dtype):
-            torch.manual_seed(1)  # keep everything deterministic
-            return torch.rand(1).to(device=device, dtype=dtype)
-
-        module.get_input = get_input
-        return module
-
-    def _get_transformer(self, seed=0):
-        torch.manual_seed(seed)  # keep everything deterministic
-        module = torch.nn.Transformer(
-            d_model=32,
-            num_encoder_layers=2,
-            num_decoder_layers=2,
-            dim_feedforward=128,
-            dropout=0.1,
-        )
-        module.register_buffer("dummy_buffer", torch.tensor(1.0))
-
-        def get_input(device, dtype):
-            torch.manual_seed(1)  # keep everything deterministic
-            src = torch.rand(20, 8, 32).to(device=device, dtype=dtype)  # T x B x C
-            tgt = torch.rand(10, 8, 32).to(device=device, dtype=dtype)  # T x B x C
-            return (src, tgt)
-
-        module.get_input = get_input
-        return module
-
-    def _get_shared_params_transformer(self, seed=0):
-        module = self._get_transformer(seed=seed)
-        # share the FFNs
-        for enc_layer, dec_layer in zip(module.encoder.layers, module.decoder.layers):
-            dec_layer.linear1.weight = enc_layer.linear1.weight
-            dec_layer.linear2.weight = enc_layer.linear2.weight
-        return module
-
-    def _get_output(self, module):
-        device = next(module.parameters()).device
-        dtype = next(module.parameters()).dtype
-        input = module.get_input(device, dtype)
-        return module(*input)
-
-    def _get_pnorm_after_step(self, module):
-        optim = torch.optim.SGD(module.parameters(), lr=0.01)
-        loss = self._get_output(module).sum()
-        loss.backward()
-        optim.step()
-        return torch.norm(torch.stack([p.detach().norm() for p in module.parameters()]))
-
-    def _test_num_params(self, module):
-        ref_num_params = sum(p.numel() for p in module.parameters())
-
-        params_to_flatten = list(module.parameters())
-        flat_module = FlattenParamsWrapper(module, params_to_flatten)
-        flat_num_params = sum(p.numel() for p in flat_module.parameters())
-
-        self.assertEqual(ref_num_params, flat_num_params)
-        self.assertEqual(flat_num_params, flat_module.flat_param.numel())
-
-    def _test_output(self, module):
-        ref_output = self._get_output(module)
-
-        params_to_flatten = list(module.parameters())
-        flat_module = FlattenParamsWrapper(module, params_to_flatten)
-        flat_output = self._get_output(flat_module)
-        self.assertEqual(ref_output, flat_output)
-
-    def test_partial_flattening(self):
-        module = self._get_transformer()
-        num_params = sum(p.numel() for p in module.parameters())
-
-        params_to_flatten = list(module.encoder.layers[1].parameters()) + list(
-            module.decoder.layers[0].parameters()
-        )
-        num_params_to_flatten = sum(p.numel() for p in params_to_flatten)
-
-        module = FlattenParamsWrapper(module, params_to_flatten)
-        self.assertEqual(module.flat_param.numel(), num_params_to_flatten)
-        self.assertEqual(sum(p.numel() for p in module.parameters()), num_params)
-
-        # flattened parameters are removed
-        self.assertEqual(len(list(module.encoder.layers[1].parameters())), 0)
-        self.assertEqual(len(list(module.decoder.layers[0].parameters())), 0)
-
-        # non-flattened parameters remain
-        self.assertGreater(len(list(module.encoder.layers[0].parameters())), 0)
-        self.assertGreater(len(list(module.decoder.layers[1].parameters())), 0)
-
-        # test that changing the module dtype works properly
-        orig_dtype = params_to_flatten[0].dtype
-        new_dtype = torch.float32 if orig_dtype == torch.float16 else torch.float16
-        self.assertEqual(module.flat_param.dtype, orig_dtype)
-        self.assertTrue(
-            all(p.dtype == orig_dtype for p in module.encoder.layers[0].parameters())
-        )
-        module = module.to(dtype=new_dtype)
-        self.assertEqual(module.flat_param.dtype, new_dtype)
-        self.assertTrue(
-            all(p.dtype == new_dtype for p in module.encoder.layers[0].parameters())
-        )
-
-    def test_flatten_nothing(self):
-        module = self._get_transformer()
-        module = FlattenParamsWrapper(module, [])
-        self.assertIsNone(module.flat_param)
-
-    def test_empty_module(self):
-        module = self._get_empty_module()
-        in_data = torch.rand(1)
-        ref_out = module(in_data)
-        module = FlattenParamsWrapper(module, [])
-        self.assertEqual(len(list(module.parameters())), 0)
-        self.assertIsNone(module.flat_param)
-        fpw_out = module(in_data)
-        self.assertEqual(ref_out, fpw_out)
-
-    def test_num_params(self):
-        module = self._get_transformer()
-        self._test_num_params(module)
-
-    def test_shared_params_num_params(self):
-        module = self._get_shared_params_transformer()
-        self._test_num_params(module)
-
-    def test_output(self):
-        module = self._get_transformer()
-        self._test_output(module)
-
-    def test_shared_params_output(self):
-        module = self._get_shared_params_transformer()
-        self._test_output(module)
-
-    def test_shared_params_pnorm_after_step(self):
-        # incorrect parameter sharing is likely to cause problems after an
-        # optimization step
-        module = self._get_shared_params_transformer()
-        ref_pnorm_after_step = self._get_pnorm_after_step(module)
-
-        module = self._get_shared_params_transformer()  # recreate
-        params_to_flatten = list(module.parameters())
-        flat_module = FlattenParamsWrapper(module, params_to_flatten)
-        flat_pnorm_after_step = self._get_pnorm_after_step(flat_module)
-
-        self.assertEqual(ref_pnorm_after_step, flat_pnorm_after_step)
-
-    def test_sharded_flat_param(self):
-        module = torch.nn.Sequential(
-            torch.nn.Linear(10, 10, bias=False),
-            torch.nn.ReLU(),
-            torch.nn.Linear(10, 10, bias=False),
-            torch.nn.ReLU(),
-            torch.nn.Linear(10, 10, bias=False),
-            torch.nn.ReLU(),
-        )
-        params_to_flatten = list(module.parameters())
-        flat_module = FlattenParamsWrapper(module, params_to_flatten)
-        flat_param_handle = flat_module.handle
-
-        def _test(kwargs, expected):
-            """
-            Tests the subroutine ``_get_shard_metadata()`` that computes shard
-            metadata based on start and end indices in the unsharded flattened
-            parameter.
-
-            We manually set the relevant attributes on the flattened parameter
-            to be able to check the effect of ``_get_shard_metadata()`` via
-            ``shard_metadata()`` since normally the attributes are set in
-            ``init_shard_info()`` with the start and end indices fixed based on
-            rank and world size.
-            """
-            flat_param = flat_module.flat_param
-            flat_param._is_sharded = True
-            flat_param._shard_param_offsets, flat_param._shard_indices = \
-                flat_param_handle._get_shard_metadata(kwargs["start"], kwargs["end"])
-            self.assertEqual(
-                flat_param_handle.shard_metadata(),
-                expected,
-                msg=f"{flat_param_handle.shard_metadata()}, {expected}",
-            )
-
-        _test(
-            kwargs={"start": 0, "end": 0},
-            expected=FlatParamShardMetadata(
-                param_names=["0.weight"],
-                param_shapes=[(10, 10)],
-                param_numels=[100],
-                param_offsets=[(0, 0)],
-            ),
-        )
-        _test(
-            kwargs={"start": 0, "end": 50},
-            expected=FlatParamShardMetadata(
-                param_names=["0.weight"],
-                param_shapes=[(10, 10)],
-                param_numels=[100],
-                param_offsets=[(0, 50)],
-            ),
-        )
-        _test(
-            kwargs={"start": 0, "end": 99},
-            expected=FlatParamShardMetadata(
-                param_names=["0.weight"],
-                param_shapes=[(10, 10)],
-                param_numels=[100],
-                param_offsets=[(0, 99)],
-            ),
-        )
-        _test(
-            kwargs={"start": 50, "end": 149},
-            expected=FlatParamShardMetadata(
-                param_names=["0.weight", "2.weight"],
-                param_shapes=[(10, 10), (10, 10)],
-                param_numels=[100, 100],
-                param_offsets=[(50, 99), (0, 49)],
-            ),
-        )
-        _test(
-            kwargs={"start": 50, "end": 199},
-            expected=FlatParamShardMetadata(
-                param_names=["0.weight", "2.weight"],
-                param_shapes=[(10, 10), (10, 10)],
-                param_numels=[100, 100],
-                param_offsets=[(50, 99), (0, 99)],
-            ),
-        )
-        _test(
-            kwargs={"start": 99, "end": 199},
-            expected=FlatParamShardMetadata(
-                param_names=["0.weight", "2.weight"],
-                param_shapes=[(10, 10), (10, 10)],
-                param_numels=[100, 100],
-                param_offsets=[(99, 99), (0, 99)],
-            ),
-        )
-        _test(
-            kwargs={"start": 100, "end": 199},
-            expected=FlatParamShardMetadata(
-                param_names=["2.weight"],
-                param_shapes=[(10, 10)],
-                param_numels=[100],
-                param_offsets=[(0, 99)],
-            ),
-        )
-        _test(
-            kwargs={"start": 100, "end": 299},
-            expected=FlatParamShardMetadata(
-                param_names=["2.weight", "4.weight"],
-                param_shapes=[(10, 10), (10, 10)],
-                param_numels=[100, 100],
-                param_offsets=[(0, 99), (0, 99)],
-            ),
-        )
-        _test(
-            kwargs={"start": 100, "end": 1000},
-            expected=FlatParamShardMetadata(
-                param_names=["2.weight", "4.weight"],
-                param_shapes=[(10, 10), (10, 10)],
-                param_numels=[100, 100],
-                param_offsets=[(0, 99), (0, 99)],
-            ),
-        )
-        _test(
-            kwargs={"start": 299, "end": 299},
-            expected=FlatParamShardMetadata(
-                param_names=["4.weight"],
-                param_shapes=[(10, 10)],
-                param_numels=[100],
-                param_offsets=[(99, 99)],
-            ),
-        )
-
-
-@unittest.skipIf(not torch.cuda.is_available(), "test requires a GPU")
-class TestFlattenParamsCUDA(TestFlattenParams):
-    def _get_transformer(self, seed=0):
-        module = super()._get_transformer(seed=seed)
-        return module.cuda()
-
-
-@unittest.skipIf(not torch.cuda.is_available(), "test requires a GPU")
-class TestFlattenParamsCUDAHalf(TestFlattenParams):
-    def _get_transformer(self, seed=0):
-        module = super()._get_transformer(seed=seed)
-        return module.cuda().half()
-
-
-if __name__ == "__main__":
-    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_apply.py b/test/distributed/fsdp/test_fsdp_apply.py
index d72d57d133b0..d44239a32934 100644
--- a/test/distributed/fsdp/test_fsdp_apply.py
+++ b/test/distributed/fsdp/test_fsdp_apply.py
@@ -14,10 +14,7 @@
     NestedWrappedModule,
     TransformerWithSharedParams,
 )
-from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
-    run_tests,
-)
+from torch.testing._internal.common_utils import run_tests, TEST_WITH_DEV_DBG_ASAN
 
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
diff --git a/test/distributed/fsdp/test_fsdp_checkpoint.py b/test/distributed/fsdp/test_fsdp_checkpoint.py
index ea7ecc5b3089..f0e818864145 100644
--- a/test/distributed/fsdp/test_fsdp_checkpoint.py
+++ b/test/distributed/fsdp/test_fsdp_checkpoint.py
@@ -1,35 +1,50 @@
 # Owner(s): ["oncall: distributed"]
 
 import contextlib
+import sys
 from copy import deepcopy
 from functools import partial
 
 import torch
+import torch.distributed as dist
 import torch.nn as nn
-from torch.distributed.fsdp.fully_sharded_data_parallel import (
-    FullyShardedDataParallel as FSDP,
-    CPUOffload,
-)
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
     checkpoint_wrapper,
+    offload_wrapper,
 )
-from torch.testing._internal.common_distributed import (
-    skip_if_lt_x_gpu,
-)
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
-    _maybe_wrap_fsdp,
+from torch.distributed.fsdp.fully_sharded_data_parallel import (
+    CPUOffload,
+    FullyShardedDataParallel as FSDP,
 )
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import _maybe_wrap_fsdp, FSDPTest
 from torch.testing._internal.common_utils import (
-    run_tests,
-    parametrize,
     instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 from torch.utils.checkpoint import checkpoint
 
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
 _save_on_cpu_called = False
+
+
 def get_patched_save_on_cpu():
-    orig_save_on_cpu = torch.distributed.algorithms._checkpoint.checkpoint_wrapper.save_on_cpu
+    orig_save_on_cpu = (
+        torch.distributed.algorithms._checkpoint.checkpoint_wrapper.save_on_cpu
+    )
 
     def patched_save_on_cpu(*args, **kwargs):
         global _save_on_cpu_called
@@ -38,14 +53,22 @@ def patched_save_on_cpu(*args, **kwargs):
 
     return patched_save_on_cpu
 
+
 @contextlib.contextmanager
 def patch_save_on_cpu(new_save_on_cpu):
-    orig_save_on_cpu = torch.distributed.algorithms._checkpoint.checkpoint_wrapper.save_on_cpu
-    torch.distributed.algorithms._checkpoint.checkpoint_wrapper.save_on_cpu = new_save_on_cpu
+    orig_save_on_cpu = (
+        torch.distributed.algorithms._checkpoint.checkpoint_wrapper.save_on_cpu
+    )
+    torch.distributed.algorithms._checkpoint.checkpoint_wrapper.save_on_cpu = (
+        new_save_on_cpu
+    )
     try:
         yield
     finally:
-        torch.distributed.algorithms._checkpoint.checkpoint_wrapper.save_on_cpu = orig_save_on_cpu
+        torch.distributed.algorithms._checkpoint.checkpoint_wrapper.save_on_cpu = (
+            orig_save_on_cpu
+        )
+
 
 class TestFSDPCheckpoint(FSDPTest):
     class SequentialModule(nn.Module):
@@ -65,9 +88,10 @@ def __init__(
             l3 = nn.Linear(3, 3).cuda()
 
             if checkpoint_layer:
-                ckpt_wrapper = partial(
-                    checkpoint_wrapper, offload_to_cpu=offload_activations
-                )
+                if offload_activations:
+                    ckpt_wrapper = offload_wrapper
+                else:
+                    ckpt_wrapper = checkpoint_wrapper
 
                 l1 = ckpt_wrapper(l1)
                 l2 = ckpt_wrapper(l2)
@@ -108,24 +132,37 @@ def _verify_parity(self, losses, outputs, models):
         [CPUOffload(offload_params=True), CPUOffload(offload_params=False)],
     )
     @parametrize("offload_activations", [True, False])
-    def test_checkpoint_fsdp_wrapping(self, cpu_offload, offload_activations):
+    @parametrize("use_orig_params", [False, True])
+    def test_checkpoint_fsdp_wrapping(
+        self,
+        cpu_offload: CPUOffload,
+        offload_activations: bool,
+        use_orig_params: bool,
+    ):
         # Test checkpoint(FSDP(layer1), FSDP(layer2), ....)
-        ckpt_sequential_wrapped_fsdp = checkpoint_wrapper(
+        if offload_activations:
+            wrapper_to_use = offload_wrapper
+        else:
+            wrapper_to_use = checkpoint_wrapper
+
+        fsdp_kwargs = {"cpu_offload": cpu_offload, "use_orig_params": use_orig_params}
+        ckpt_sequential_wrapped_fsdp = wrapper_to_use(
             TestFSDPCheckpoint.SequentialModule(
-                wrap_fsdp=True, cpu_offload=cpu_offload
+                wrap_fsdp=True,
+                **fsdp_kwargs,
             ),
-            offload_to_cpu=offload_activations,
         )
         # Test FSDP(checkpoint(layer1)), FSDP(checkpoint(layer2)), ....
         inner_ckpt = TestFSDPCheckpoint.SequentialModule(
             checkpoint_layer=True,
             offload_activations=offload_activations,
             wrap_fsdp=True,
-            cpu_offload=cpu_offload,
+            **fsdp_kwargs,
         )
 
         baseline = TestFSDPCheckpoint.SequentialModule(
-            wrap_fsdp=True, cpu_offload=cpu_offload
+            wrap_fsdp=True,
+            **fsdp_kwargs,
         )
 
         # note that reentrant-based checkpointing requires inputs to have grad
@@ -153,34 +190,49 @@ def test_checkpoint_fsdp_wrapping(self, cpu_offload, offload_activations):
 
                 self._verify_parity(losses, outputs, models)
 
+        dist.barrier()
+
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "cpu_offload",
         [CPUOffload(offload_params=True), CPUOffload(offload_params=False)],
     )
     @parametrize("offload_activations", [True, False])
-    def test_basic_checkpoint_end_to_end(self, cpu_offload, offload_activations):
+    @parametrize("use_orig_params", [False, True])
+    def test_basic_checkpoint_end_to_end(
+        self,
+        cpu_offload: CPUOffload,
+        offload_activations: bool,
+        use_orig_params: bool,
+    ):
+        fsdp_kwargs = {"cpu_offload": cpu_offload, "use_orig_params": use_orig_params}
         global _save_on_cpu_called
         with patch_save_on_cpu(get_patched_save_on_cpu()):
             seq = TestFSDPCheckpoint.SequentialModule().to(torch.cuda.current_device())
             # Runs FSDP with no checkpointing
-            fsdp_only_seq = FSDP(deepcopy(seq), cpu_offload=cpu_offload)
+            fsdp_only_seq = FSDP(deepcopy(seq), **fsdp_kwargs)
             # Runs checkpoint-wrapped FSDP
-            checkpointed_fsdp = checkpoint_wrapper(
-                FSDP(deepcopy(seq), cpu_offload=cpu_offload),
-                offload_to_cpu=offload_activations,
+            if offload_activations:
+                wrapper_to_use = offload_wrapper
+            else:
+                wrapper_to_use = checkpoint_wrapper
+
+            checkpointed_fsdp = wrapper_to_use(
+                FSDP(deepcopy(seq), **fsdp_kwargs),
             )
             # Runs FSDP-wrapped checkpointed module
             fsdp_wrapped_checkpoint = FSDP(
-                checkpoint_wrapper(deepcopy(seq), offload_to_cpu=offload_activations),
-                cpu_offload=cpu_offload,
+                wrapper_to_use(deepcopy(seq)),
+                **fsdp_kwargs,
             )
             # Runs FSDP with manual calls to checkpoint.
-            fsdp_call_checkpoint = FSDP(deepcopy(seq), cpu_offload=cpu_offload)
+            fsdp_call_checkpoint = FSDP(deepcopy(seq), **fsdp_kwargs)
             # note that reentrant-based checkpointing requires inputs to have grad
             # flag set.
 
-            inp = torch.randn(10, 3, device=torch.cuda.current_device(), requires_grad=True)
+            inp = torch.randn(
+                10, 3, device=torch.cuda.current_device(), requires_grad=True
+            )
 
             models = [
                 fsdp_only_seq,
@@ -194,7 +246,9 @@ def test_basic_checkpoint_end_to_end(self, cpu_offload, offload_activations):
                 losses = []
                 outputs = []
                 for m in models:
-                    check_offload = m != fsdp_only_seq and i == 0 and offload_activations
+                    check_offload = (
+                        m != fsdp_only_seq and i == 0 and offload_activations
+                    )
                     if m == fsdp_call_checkpoint:
                         # _save_on_cpu should not be called yet
                         self.assertFalse(_save_on_cpu_called)
@@ -220,6 +274,9 @@ def test_basic_checkpoint_end_to_end(self, cpu_offload, offload_activations):
 
                 self._verify_parity(losses, outputs, models)
 
+        dist.barrier()
+
+
 instantiate_parametrized_tests(TestFSDPCheckpoint)
 
 if __name__ == "__main__":
diff --git a/test/distributed/fsdp/test_fsdp_clip_grad_norm.py b/test/distributed/fsdp/test_fsdp_clip_grad_norm.py
index 9e39254ec423..97b37ff2f185 100644
--- a/test/distributed/fsdp/test_fsdp_clip_grad_norm.py
+++ b/test/distributed/fsdp/test_fsdp_clip_grad_norm.py
@@ -1,31 +1,33 @@
 # Owner(s): ["oncall: distributed"]
 
+import itertools
 import sys
-from math import inf
+from typing import Union
 
 import torch
+import torch.nn as nn
 from torch import distributed as dist
+from torch.distributed.fsdp import ShardingStrategy
 from torch.distributed.fsdp.fully_sharded_data_parallel import (
-    FullyShardedDataParallel as FSDP,
     CPUOffload,
-    _calc_grad_norm,
+    FullyShardedDataParallel as FSDP,
 )
-from torch.nn import utils as nn_utils
+from torch.distributed.fsdp.wrap import ModuleWrapPolicy
+from torch.nn import TransformerDecoderLayer, TransformerEncoderLayer
+from torch.nn.parallel import DistributedDataParallel as DDP
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import (
-    DeterministicModel,
+    CUDAInitMode,
+    FSDPInitMode,
     FSDPTest,
-    _collect_total_grad_norm_fsdp,
-    _collect_total_grad_norm_local,
+    TransformerWithSharedParams,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
-    run_tests,
-    parametrize,
     instantiate_parametrized_tests,
+    run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
-
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
@@ -39,67 +41,205 @@
 
 
 class TestClipGradNorm(FSDPTest):
-    def _run_fsdp_one_iteration(self, norm_type, nested_fsdp, cpu_offload):
-        """Test FSDP with clip grad norm."""
-        fsdp_model = DeterministicModel(nested_fsdp, cpu_offload=cpu_offload)
-        local_model = DeterministicModel(False)
-        input = torch.rand(14, 2, device=self.rank)
-        fsdp_model = FSDP(fsdp_model, cpu_offload=cpu_offload)
-        self.assertTrue(len(input) >= self.world_size)
-        out = local_model(input[: self.world_size])
-        out.sum().backward()
-        in_data = torch.tensor(input[self.rank], device=self.rank)
-        out_fsdp = fsdp_model(in_data)
-        out_fsdp.sum().backward()
-        total_norms_fsdp = _collect_total_grad_norm_fsdp(
-            fsdp_model, norm_type, self.rank
+    """Tests :meth:`FullyShardedDataParallel.clip_grad_norm_`."""
+
+    @skip_if_lt_x_gpu(2)
+    def test_non_root(self):
+        """
+        Tests that calling ``clip_grad_norm_()`` on a non-root FSDP instance
+        raises an error.
+        """
+
+        class Model(nn.Module):
+            def __init__(self) -> None:
+                super().__init__()
+                self.lin1 = nn.Linear(5, 5)
+                self.lin2 = nn.Linear(5, 5)
+
+            def forward(self, x: torch.Tensor) -> torch.Tensor:
+                return self.lin2(self.lin1(x))
+
+        model = Model().cuda()
+        model.lin2 = FSDP(model.lin2)
+        fsdp_model = FSDP(model)
+        fsdp_model(torch.randn((2, 5), device=torch.device("cuda"))).sum().backward()
+        error_regex = "should only be called on the root FSDP instance"
+        with self.assertRaisesRegex(RuntimeError, error_regex):
+            fsdp_model.lin2.clip_grad_norm_(max_norm=2)
+
+    @skip_if_lt_x_gpu(2)
+    def test_ddp_parity(self):
+        """
+        Tests FSDP with ``FullyShardedDataParallel.clip_grad_norm_()` against
+        DDP with ``torch.nn.utils.clip_grad_norm_()`.
+        """
+        self.run_subtests(
+            {
+                "max_norm": [1, 2.5],
+                "norm_type": [1, 2, float("inf")],
+                "sharding_strategy": [
+                    ShardingStrategy.FULL_SHARD,
+                    ShardingStrategy.NO_SHARD,
+                    "mixed_strategy",
+                ],
+                "use_orig_params": [False, True],
+                "offload_params": [False, True],
+            },
+            self._test_ddp_parity,
         )
-        total_norms_local = _collect_total_grad_norm_local(local_model, norm_type)
-        total_norms_local /= self.world_size
-        norm_cap = total_norms_fsdp / 2.0
-        self.assertEqual(total_norms_local, total_norms_fsdp)
-        fsdp_model.clip_grad_norm_(norm_cap, norm_type=norm_type)
-        nn_utils.clip_grad_norm_(
-            local_model.parameters(), norm_cap, norm_type=norm_type
+
+    def _test_ddp_parity(
+        self,
+        max_norm: Union[float, int],
+        norm_type: Union[float, int],
+        sharding_strategy: Union[ShardingStrategy, str],
+        use_orig_params: bool,
+        offload_params: bool,
+    ):
+        local_model = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.NO_FSDP,
+            CUDAInitMode.CUDA_BEFORE,
+            deterministic=True,
         )
-        total_norms_after_clip_fsdp = _collect_total_grad_norm_fsdp(
-            fsdp_model, norm_type, self.rank
+        ddp_model = DDP(local_model, device_ids=[self.rank])
+        fsdp_kwargs = {
+            "cpu_offload": CPUOffload(offload_params=offload_params),
+            "use_orig_params": use_orig_params,
+        }
+        if sharding_strategy == "mixed_strategy":
+            fsdp_model = TransformerWithSharedParams.init(
+                self.process_group,
+                FSDPInitMode.NO_FSDP,
+                CUDAInitMode.CUDA_BEFORE,
+                deterministic=True,
+            )
+            # Apply `NO_SHARD` to the encoder
+            fsdp_model.transformer.encoder = FSDP(
+                fsdp_model.transformer.encoder,
+                sharding_strategy=ShardingStrategy.NO_SHARD,
+                **fsdp_kwargs,
+            )
+            # Apply `FULL_SHARD` to the decoder
+            fsdp_model.transformer.decoder = FSDP(
+                fsdp_model.transformer.decoder,
+                sharding_strategy=ShardingStrategy.FULL_SHARD,
+                **fsdp_kwargs,
+            )
+            # TODO: FSDP's `clip_grad_norm_()` is not a static method, so we
+            # must make the root module an FSDP instance
+            fsdp_model = FSDP(
+                fsdp_model, sharding_strategy=ShardingStrategy.FULL_SHARD, **fsdp_kwargs
+            )
+        else:
+            fsdp_kwargs.update(
+                {
+                    "sharding_strategy": sharding_strategy,
+                    "auto_wrap_policy": ModuleWrapPolicy(
+                        {
+                            TransformerEncoderLayer,
+                            TransformerDecoderLayer,
+                        }
+                    ),
+                }
+            )
+            fsdp_model = TransformerWithSharedParams.init(
+                self.process_group,
+                FSDPInitMode.RECURSIVE,
+                CUDAInitMode.CUDA_BEFORE,
+                deterministic=True,
+                fsdp_kwargs=fsdp_kwargs,
+            )
+        LR = 1e-2
+        ddp_optim = torch.optim.Adam(ddp_model.parameters(), lr=LR)
+        fsdp_optim = torch.optim.Adam(fsdp_model.parameters(), lr=LR)
+        device = torch.device("cuda")
+        LARGE_FACTOR = 100
+        inp = ddp_model.module.get_input(device)
+        for model in (ddp_model, fsdp_model):
+            out = model(*inp)
+            if isinstance(model, (DDP, FSDP)):
+                loss = model.module.get_loss(inp, out)
+            else:
+                loss = model.get_loss(inp, out)
+            loss.backward()
+
+        # Multiply gradients by a large factor to ensure that gradients will
+        # actually be clipped
+        for param in itertools.chain(ddp_model.parameters(), fsdp_model.parameters()):
+            if (
+                param.grad is not None
+            ):  # gradients may be `None` for `use_orig_params=True`
+                param.grad *= LARGE_FACTOR
+        orig_ddp_grads = [
+            param.grad.detach().clone() for param in ddp_model.parameters()
+        ]
+        orig_fsdp_grads = [
+            param.grad.detach().clone() if param.grad is not None else None
+            for param in fsdp_model.parameters()
+        ]
+
+        ddp_total_norm = torch.nn.utils.clip_grad_norm_(
+            ddp_model.parameters(),
+            max_norm=max_norm,
+            norm_type=norm_type,
         )
-        total_norms_after_clip_local = _collect_total_grad_norm_local(
-            local_model, norm_type
+        fsdp_total_norm = fsdp_model.clip_grad_norm_(
+            max_norm=max_norm, norm_type=norm_type
         )
-        self.assertTrue(total_norms_after_clip_fsdp <= norm_cap)
-        self.assertEqual(total_norms_after_clip_local, total_norms_after_clip_fsdp)
+        self.assertEqual(ddp_total_norm, fsdp_total_norm)
 
-    @skip_if_lt_x_gpu(2)
-    @parametrize("norm_type", [2.0, inf])
-    @parametrize("nested_fsdp", [True, False])
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)],
-    )
-    def test_fsdp_clip_grad_norm(self, norm_type, nested_fsdp, cpu_offload):
-        """Test FSDP with clip grad norm."""
-        self._run_fsdp_one_iteration(norm_type, nested_fsdp, cpu_offload)
+        # Check that the gradients were modified by `clip_grad_norm_()`
+        for param, orig_grad in zip(ddp_model.parameters(), orig_ddp_grads):
+            assert not torch.equal(param.grad, orig_grad)
+        for param, orig_grad in zip(fsdp_model.parameters(), orig_fsdp_grads):
+            if param.grad is None:
+                self.assertEqual(param.grad, orig_grad)  # `None`
+            else:
+                assert not torch.equal(param.grad, orig_grad)
 
+        # Run an optimizer step to ensure gradients matched after clipping
+        ddp_optim.step()
+        fsdp_optim.step()
+        with FSDP.summon_full_params(fsdp_model):
+            for (n1, p1), (n2, p2) in zip(
+                ddp_model.module.named_parameters(),
+                fsdp_model.named_parameters(),
+            ):
+                self.assertEqual(n1, n2)
+                self.assertEqual(p1, p2)
 
-class TestCalcuGradNorm(FSDPTest):
-    @skip_if_lt_x_gpu(2)
-    @parametrize("norm_type", [2.0, inf, 1.3, 2.5])
-    @parametrize("nested_fsdp", [True, False])
-    def test_fsdp_calc_grad_norm(self, norm_type, nested_fsdp):
-        """Test grad norm cal API."""
-        model = FSDP(DeterministicModel(nested_fsdp))
-        input = torch.rand(15, 2, device=self.rank)
-        out = model(input)
-        out.sum().backward()
-        total_norm = _calc_grad_norm(model.params_with_grad, norm_type)
-        total_norm_expected = _collect_total_grad_norm_local(model, norm_type)
-        self.assertEqual(total_norm, total_norm_expected)
+        if offload_params:
+            # TODO: Gradient computation on CPU and GPU differ slightly causing
+            # drift unrelated to `clip_grad_norm_()`.
+            # https://github.com/pytorch/pytorch/issues/89133
+            return
+
+        # Run a few more iterations
+        # TODO: We cannot run too many iterations, or else there is drift:
+        # https://github.com/pytorch/pytorch/issues/89136
+        for i in range(3):
+            set_to_none = i % 2 == 0  # exercise both
+            ddp_optim.zero_grad(set_to_none=set_to_none)
+            fsdp_optim.zero_grad(set_to_none=set_to_none)
+            inp = ddp_model.module.get_input(device)
+            for model in (ddp_model, fsdp_model):
+                out = model(*inp)
+                out.sum().backward()
+            ddp_total_norm = torch.nn.utils.clip_grad_norm_(
+                ddp_model.parameters(),
+                max_norm=max_norm,
+                norm_type=norm_type,
+            )
+            fsdp_total_norm = fsdp_model.clip_grad_norm_(
+                max_norm=max_norm, norm_type=norm_type
+            )
+            self.assertEqual(ddp_total_norm, fsdp_total_norm)
+            ddp_optim.step()
+            fsdp_optim.step()
 
 
 instantiate_parametrized_tests(TestClipGradNorm)
-instantiate_parametrized_tests(TestCalcuGradNorm)
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_comm.py b/test/distributed/fsdp/test_fsdp_comm.py
index 432e56ac0359..117e756da252 100644
--- a/test/distributed/fsdp/test_fsdp_comm.py
+++ b/test/distributed/fsdp/test_fsdp_comm.py
@@ -2,7 +2,7 @@
 
 import sys
 from contextlib import suppress
-from enum import Enum, auto
+from enum import auto, Enum
 from typing import Optional
 from unittest.mock import patch
 
@@ -19,10 +19,10 @@
     TransformerWithSharedParams,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -45,6 +45,7 @@ class PassType(Enum):
 
 class TestCommunication(FSDPTest):
     """Tests ``FullyShardedDataParallel``'s collective communication usage."""
+
     def _init_model(
         self,
         nested_model: bool,
@@ -106,7 +107,8 @@ def _get_ref_num_all_gathers(
                 pass_type,
                 is_first_iter,
                 is_last_iter_no_sync,
-            ) for pass_type in PassType
+            )
+            for pass_type in PassType
         )
 
     def _get_ref_num_all_gathers_in_pass(
@@ -121,9 +123,11 @@ def _get_ref_num_all_gathers_in_pass(
         if sharding_strategy is None:
             sharding_strategy = ShardingStrategy.FULL_SHARD  # default
         # Forward pass:
-        if pass_type == PassType.FWD and \
-            sharding_strategy == ShardingStrategy.SHARD_GRAD_OP and \
-                is_last_iter_no_sync:
+        if (
+            pass_type == PassType.FWD
+            and sharding_strategy == ShardingStrategy.SHARD_GRAD_OP
+            and is_last_iter_no_sync
+        ):
             # Modules do not free the full parameters in the last
             # iteration's backward pass if it was in `no_sync()`
             num_all_gathers = 0
@@ -132,26 +136,32 @@ def _get_ref_num_all_gathers_in_pass(
             # forward pass
             num_all_gathers = num_fsdp
         # Backward pass:
-        elif pass_type == PassType.BWD and \
-                sharding_strategy == ShardingStrategy.FULL_SHARD:
+        elif (
+            pass_type == PassType.BWD
+            and sharding_strategy == ShardingStrategy.FULL_SHARD
+        ):
             # Root does not free the full parameters at the end of the
             # forward pass
             num_all_gathers = num_fsdp - 1
-        elif pass_type == PassType.BWD and \
-                sharding_strategy == ShardingStrategy.SHARD_GRAD_OP:
+        elif (
+            pass_type == PassType.BWD
+            and sharding_strategy == ShardingStrategy.SHARD_GRAD_OP
+        ):
             # Modules do not free the full parameters at the end of the
             # forward pass
             num_all_gathers = 0
         else:
-            assert 0, f"Unsupported: add a branch for pass_type={pass_type} " \
-                f"is_first_iter={is_first_iter} " \
-                f"is_last_iter_no_sync={is_last_iter_no_sync} " \
+            assert 0, (
+                f"Unsupported: add a branch for pass_type={pass_type} "
+                f"is_first_iter={is_first_iter} "
+                f"is_last_iter_no_sync={is_last_iter_no_sync} "
                 f"sharding_strategy={sharding_strategy}"
+            )
         if is_first_iter and pass_type == PassType.FWD:
             # With execution order validation, on the first iteration, we have
-            # an additional all-gather before every actual all-gather in the
-            # forward pass
-            num_all_gathers *= 2
+            # an additional two all-gathers before every actual all-gather in
+            # the forward pass
+            num_all_gathers *= 3
         return num_all_gathers
 
     def _print_ref_num_all_gathers_in_pass(
@@ -167,7 +177,10 @@ def _print_ref_num_all_gathers_in_pass(
         if self.rank != 0:
             return  # only print on one rank
         num_all_gathers = self._get_ref_num_all_gathers_in_pass(
-            num_fsdp, sharding_strategy, pass_type, is_first_iter,
+            num_fsdp,
+            sharding_strategy,
+            pass_type,
+            is_first_iter,
             is_last_iter_no_sync,
         )
         print(
@@ -211,8 +224,7 @@ def test_communication(
         # Count the number of FSDP instances that manage parameters since the
         # number of collectives are a function of this number
         num_fsdp = sum(
-            (isinstance(m, FSDP) and len(m.params) > 0)
-            for m in fsdp_model.modules()
+            (isinstance(m, FSDP) and len(m.params) > 0) for m in fsdp_model.modules()
         )
 
         # If `use_no_sync=True`, we run `num_iters` iterations inside
@@ -220,11 +232,16 @@ def test_communication(
         # and if `use_no_sync=False`, we only run `num_iters` iterations
         # outside `no_sync()`
         num_iters = 3
-        with patch("torch.distributed._all_gather_base") as mock_all_gather, \
-                patch("torch.distributed._reduce_scatter_base") as mock_reduce_scatter:
+        with patch(
+            "torch.distributed.all_gather_into_tensor"
+        ) as mock_all_gather, patch(
+            "torch.distributed.reduce_scatter_tensor"
+        ) as mock_reduce_scatter:
+
             def reset_mocks():
                 mock_all_gather.reset_mock()
                 mock_reduce_scatter.reset_mock()
+
             # Check the communication cost when using `no_sync()`
             if use_no_sync:
                 for i in range(num_iters):
@@ -233,11 +250,14 @@ def reset_mocks():
                     num_all_gathers = mock_all_gather.call_count
                     num_reduce_scatters = mock_reduce_scatter.call_count
                     ref_num_all_gathers = self._get_ref_num_all_gathers(
-                        num_fsdp, sharding_strategy, is_first_iter=i == 0,
+                        num_fsdp,
+                        sharding_strategy,
+                        is_first_iter=i == 0,
                         is_last_iter_no_sync=i > 0,
                     )
                     ref_num_reduce_scatters = self._get_ref_num_reduce_scatters(
-                        num_fsdp, in_no_sync=True,
+                        num_fsdp,
+                        in_no_sync=True,
                     )
                     self.assertEqual(num_all_gathers, ref_num_all_gathers)
                     self.assertEqual(num_reduce_scatters, ref_num_reduce_scatters)
@@ -248,12 +268,14 @@ def reset_mocks():
                 num_all_gathers = mock_all_gather.call_count
                 num_reduce_scatters = mock_reduce_scatter.call_count
                 ref_num_all_gathers = self._get_ref_num_all_gathers(
-                    num_fsdp, sharding_strategy,
+                    num_fsdp,
+                    sharding_strategy,
                     is_first_iter=not use_no_sync and i == 0,
                     is_last_iter_no_sync=use_no_sync and i == 0,
                 )
                 ref_num_reduce_scatters = self._get_ref_num_reduce_scatters(
-                    num_fsdp, in_no_sync=False,
+                    num_fsdp,
+                    in_no_sync=False,
                 )
                 self.assertEqual(num_all_gathers, ref_num_all_gathers)
                 self.assertEqual(num_reduce_scatters, ref_num_reduce_scatters)
diff --git a/test/distributed/fsdp/test_fsdp_comm_hooks.py b/test/distributed/fsdp/test_fsdp_comm_hooks.py
index a16258855b2a..125606fbff5c 100644
--- a/test/distributed/fsdp/test_fsdp_comm_hooks.py
+++ b/test/distributed/fsdp/test_fsdp_comm_hooks.py
@@ -8,8 +8,8 @@
 import torch.nn.functional as F
 from torch import distributed as dist
 from torch.distributed.algorithms._comm_hooks import default_hooks
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp import MixedPrecision
+from torch.distributed.distributed_c10d import _get_default_group
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, MixedPrecision
 from torch.distributed.fsdp.fully_sharded_data_parallel import ShardingStrategy
 from torch.testing._internal.common_distributed import (
     requires_nccl,
@@ -33,10 +33,11 @@
 BFLOAT16_AVAILABLE = (
     torch.cuda.is_available()
     and torch.version.cuda is not None
-    and int(torch.version.cuda.split('.')[0]) >= 11)
+    and int(torch.version.cuda.split(".")[0]) >= 11
+)
 
-class Net(nn.Module):
 
+class Net(nn.Module):
     def __init__(self, has_wrapping, sharding_strategy, mixed_precision=None):
         # to ensure determinism
         torch.manual_seed(0)
@@ -44,75 +45,116 @@ def __init__(self, has_wrapping, sharding_strategy, mixed_precision=None):
         super().__init__()
 
         if has_wrapping:
-            self.net = FSDP(nn.Sequential(
-                nn.Linear(8, 16),
-                nn.ReLU(),
-                FSDP(
-                    nn.Linear(16, 8),
-                    device_id=torch.cuda.current_device(),
-                    sharding_strategy=sharding_strategy,
-                    mixed_precision=mixed_precision,
-                )
-            ),
+            self.net = FSDP(
+                nn.Sequential(
+                    nn.Linear(8, 16),
+                    nn.ReLU(),
+                    FSDP(
+                        nn.Linear(16, 8),
+                        device_id=torch.cuda.current_device(),
+                        sharding_strategy=sharding_strategy,
+                        mixed_precision=mixed_precision,
+                    ),
+                ),
                 device_id=torch.cuda.current_device(),
                 sharding_strategy=sharding_strategy,
                 mixed_precision=mixed_precision,
             )
         else:
-            self.net = nn.Sequential(
-                nn.Linear(8, 16),
-                nn.ReLU(),
-                nn.Linear(16, 8)
-            )
+            self.net = nn.Sequential(nn.Linear(8, 16), nn.ReLU(), nn.Linear(16, 8))
 
         self.out = nn.Linear(8, 4)
 
     def forward(self, x):
         return self.out(F.relu(self.net(x)))
 
+
 class DummyState(object):
 
-    __slots__ = [
-        "process_group"
-    ]
+    __slots__ = ["process_group", "noise"]
 
-    def __init__(self, process_group):
+    def __init__(self, process_group: dist.ProcessGroup, noise: int):
         self.process_group = process_group
+        self.noise = noise
+
 
 class DummyHook(object):
+    def dummy_hook_for_no_shard_fsdp(self, state: DummyState, grad: torch.Tensor):
+        """
+        This communication hook is for illustration and testing purpose only.
+        This communication hook is used during FSDP ``NO_SHARD`` training. It adds some noise to
+        the provided ``grad`` parameter and uses ``all_reduce`` to communicate full, flattened,
+        unsharded gradient.
+        """
+        grad.add_(state.noise)
+        dist.all_reduce(grad, group=state.process_group)
 
-    def dummy_hook(self, state: DummyState, grad: torch.Tensor):
+    def custom_reduce_scatter(self, output, input, group=None):
+        """
+        This function is for illustrative purpose only.
+        It is meant to implement a custom reduce-scatter
+        of a flattened tensor to all processes in a group.
+        Currently a no-op.
+        """
         pass
 
-class TestCommunicationHooks(FSDPTest):
+    def dummy_hook_for_sharded_fsdp(
+        self, state: DummyState, grad: torch.Tensor, output: torch.Tensor
+    ):
+        """
+        This communication hook is for illustration and testing purposes only.
+        This communication hook is used during FSDP ``FULL_SHARD`` or ``SHARD_GRAD_OP`` training.
+        It adds some noise to the provided ``grad`` parameter, uses
+        ``reduce_scatter`` for gradient communication and stores a sharded gradient in ``output``.
+        """
+        grad.add_(state.noise)
+        self.custom_reduce_scatter(output, grad, group=state.process_group)
 
+
+class TestCommunicationHooks(FSDPTest):
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
-        ])
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP,
+        ],
+    )
     def test_default_communication_hook_behavior(
-        self,
-        sharding_strategy: Optional[ShardingStrategy]
+        self, sharding_strategy: Optional[ShardingStrategy]
     ):
         """
         Tests FSDP's default communication hook's behavior and correctness.
+        This test creates a simple linear net with weight shape  ``1 X N``,
+        where ``N`` is the number of workers.
+        For sharded cases, each worker gets 1 element of the weight parameter. This test
+        checks that after backward, each worker has a proper value in its chunk of
+        the gradient, or the whole gradient on every worker is equal to an expected value.
+
         Arguments:
             sharding_strategy (Optional[ShardingStrategy]): Configures the FSDP algorithm.
         """
-        m = torch.nn.Linear(1, 5, bias=False)
+        out_dim = self.world_size
+        net = torch.nn.Linear(1, out_dim, bias=False)
         inpt = torch.tensor([self.rank]).float().cuda(self.rank)
 
         net_default_hook = FSDP(
-            m,
+            net,
             device_id=torch.cuda.current_device(),
-            sharding_strategy=sharding_strategy
+            sharding_strategy=sharding_strategy,
         ).to(self.rank)
 
-        # Check that default hook is set to `all_reduce`
+        # Check that default hook is set to `all_reduce` for `NO_SHARD`
+        # or `reduce_scatter` for sharded cases
+        default_hook = (
+            default_hooks.reduce_scatter_hook
+            if sharding_strategy != ShardingStrategy.NO_SHARD
+            else default_hooks.allreduce_hook
+        )
+
         for entry in FSDP.fsdp_modules(net_default_hook):
-            self.assertEqual(entry.communication_hook, default_hooks.allreduce_hook)
+            self.assertEqual(entry._communication_hook, default_hook)
 
         for _ in range(4):
 
@@ -130,11 +172,13 @@ def test_default_communication_hook_behavior(
             self.assertEqual(
                 grad[0].item(),
                 expected_grad,
-                msg=f"Expected hook grad of {expected_grad} but got {grad[0].item()}")
+                msg=f"Expected hook grad of {expected_grad} but got {grad[0].item()}",
+            )
 
     def _get_submodules(self, fsdp_net):
         return [
-            submodule for submodule in FSDP.fsdp_modules(fsdp_net)
+            submodule
+            for submodule in FSDP.fsdp_modules(fsdp_net)
             if not submodule.check_is_root()
         ]
 
@@ -155,15 +199,15 @@ def _init_model(self, core, sharding_strategy, mixed_precision=None):
         [
             ShardingStrategy.NO_SHARD,
             ShardingStrategy.FULL_SHARD,
-            ShardingStrategy.SHARD_GRAD_OP
-        ])
+            ShardingStrategy.SHARD_GRAD_OP,
+        ],
+    )
     def test_default_communication_hook_initialization(
-        self,
-        has_wrapping: bool,
-        sharding_strategy: Optional[ShardingStrategy]
+        self, has_wrapping: bool, sharding_strategy: Optional[ShardingStrategy]
     ):
         """
         Tests FSDP's communication hook interface behavior.
+
         Arguments:
             has_wrapping (bool): Configures wrapping of a module.
             sharding_strategy (Optional[ShardingStrategy]): Configures the FSDP algorithm.
@@ -172,151 +216,176 @@ def test_default_communication_hook_initialization(
         # Initialize a model
         fsdp_model_with_hook = self._init_model(
             Net(has_wrapping=has_wrapping, sharding_strategy=sharding_strategy),
-            sharding_strategy=sharding_strategy
+            sharding_strategy=sharding_strategy,
         )
-        dummy_state = DummyState(process_group=None)
-
-        # FSDP currently supports communication hooks for a NO_SHARD strategy
-        # Check that a `NotImplementedError` is raised for other strategies
-        if sharding_strategy != ShardingStrategy.NO_SHARD:
-            # Check that default hook is set to None
-            for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
-                self.assertIsNone(entry.communication_hook)
-                self.assertIsNone(entry.communication_hook_state)
-
-            with self.assertRaisesRegex(
-                NotImplementedError,
-                '^Communication hooks are currently only available for a NO_SHARD strategy.$'
-            ):
-                fsdp_model_with_hook.register_comm_hook(dummy_state, DummyHook.dummy_hook)
 
-        else:
+        # Check that default hook is set to `all_reduce` for `NO_SHARD`
+        # or `reduce_scatter` for sharded cases
+        default_hook = (
+            default_hooks.reduce_scatter_hook
+            if sharding_strategy != ShardingStrategy.NO_SHARD
+            else default_hooks.allreduce_hook
+        )
 
-            # Check that default hook is set to `all_reduce`
-            for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
-                self.assertEqual(entry.communication_hook, default_hooks.allreduce_hook)
+        for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
+            self.assertEqual(entry._communication_hook, default_hook)
 
-            dummy_state = DummyState(process_group=None)
+        dummy_state = DummyState(process_group=None, noise=1234)
+        dummy_hook = (
+            DummyHook.dummy_hook_for_no_shard_fsdp
+            if sharding_strategy != ShardingStrategy.NO_SHARD
+            else DummyHook.dummy_hook_for_sharded_fsdp
+        )
 
-            fsdp_model_with_hook.register_comm_hook(
-                dummy_state,
-                DummyHook.dummy_hook
-            )
+        fsdp_model_with_hook.register_comm_hook(dummy_state, dummy_hook)
 
-            # Check that we can't register comm hook twice
-            with self.assertRaisesRegex(AssertionError, '^communication hook can be only registered once$'):
-                fsdp_model_with_hook.register_comm_hook(
-                    dummy_state,
-                    DummyHook.dummy_hook
-                )
-
-            # Check dummy hook was registered for the root and all submodules if any
-            for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
-                self.assertEqual(
-                    entry.communication_hook,
-                    DummyHook.dummy_hook
-                )
-                self.assertEqual(
-                    entry.communication_hook_state,
-                    dummy_state
-                )
+        # Check that we can't register comm hook twice
+        with self.assertRaisesRegex(
+            AssertionError, "^communication hook can be only registered once$"
+        ):
+            fsdp_model_with_hook.register_comm_hook(dummy_state, dummy_hook)
+
+        # Check dummy hook was registered for the root and all submodules if any
+        for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
+            self.assertEqual(entry._communication_hook, dummy_hook)
+            self.assertEqual(entry._communication_hook_state, dummy_state)
+
+        for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
+            entry._communication_hook = None
+
+        in_data = torch.rand(16, 8).cuda()
+        loss = fsdp_model_with_hook(in_data).sum()
+        # This Error is raised during backward pass and is checked with `p_assert`,
+        # i.e. it prints error string but AssertionError raises nothing
+        with self.assertRaises(AssertionError):
+            loss.backward()
+
+        for entry in FSDP.fsdp_modules(fsdp_model_with_hook):
+            entry._communication_hook = dummy_hook
+            entry._communication_hook_state = None
+        # Same as above
+        loss = fsdp_model_with_hook(in_data).sum()
+        with self.assertRaises(AssertionError):
+            loss.backward()
 
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
-        ])
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP,
+        ],
+    )
     def test_registering_hook_non_root(
-        self,
-        sharding_strategy: Optional[ShardingStrategy]
+        self, sharding_strategy: Optional[ShardingStrategy]
     ):
         """
         Tests FSDP's communication hook registering for submodules.
         Make sure it can't be registered for non-root submodules.
         Currently tests only ``NO_SHARD`` strategy.
+
         Arguments:
             sharding_strategy (Optional[ShardingStrategy]): Configures the FSDP algorithm.
-
         """
 
         fsdp_model_with_hook = self._init_model(
             Net(has_wrapping=True, sharding_strategy=sharding_strategy),
-            sharding_strategy=sharding_strategy
+            sharding_strategy=sharding_strategy,
+        )
+        dummy_state = DummyState(process_group=None, noise=1234)
+        dummy_hook = (
+            DummyHook.dummy_hook_for_no_shard_fsdp
+            if sharding_strategy != ShardingStrategy.NO_SHARD
+            else DummyHook.dummy_hook_for_sharded_fsdp
         )
-        dummy_state = DummyState(process_group=None)
         # Creating a list of non-root submodules to test
         submodules = self._get_submodules(fsdp_model_with_hook)
         # Check that assertion is raised for registering a comm hook on a non-root
-        with self.assertRaisesRegex(AssertionError, '^register_comm_hook can only be called on a root instance.$'):
-            submodules[1].register_comm_hook(dummy_state, DummyHook.dummy_hook)
+        with self.assertRaisesRegex(
+            AssertionError,
+            "^register_comm_hook can only be called on a root instance.$",
+        ):
+            submodules[1].register_comm_hook(dummy_state, dummy_hook)
 
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
-        ])
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP,
+        ],
+    )
     def test_registering_hook_submodules(
-        self,
-        sharding_strategy: Optional[ShardingStrategy]
+        self, sharding_strategy: Optional[ShardingStrategy]
     ):
         """
         Tests FSDP's communication hook registering for submodules.
         Checks behavior if a hook was registered for a non-root submodule
         Currently tests only ``NO_SHARD`` strategy.
+
         Arguments:
             sharding_strategy (Optional[ShardingStrategy]): Configures the FSDP algorithm.
-
         """
 
         fsdp_model_with_hook = self._init_model(
             Net(has_wrapping=True, sharding_strategy=sharding_strategy),
-            sharding_strategy=sharding_strategy
+            sharding_strategy=sharding_strategy,
+        )
+        dummy_state = DummyState(process_group=None, noise=1234)
+        dummy_hook = (
+            DummyHook.dummy_hook_for_no_shard_fsdp
+            if sharding_strategy != ShardingStrategy.NO_SHARD
+            else DummyHook.dummy_hook_for_sharded_fsdp
         )
-        dummy_state = DummyState(process_group=None)
         submodules = self._get_submodules(fsdp_model_with_hook)
 
         # Simulate a registration of a hook on a submodule
         submodules[1]._hook_registered = True
         # Check that an error is raised when some of submodules have a non-default hook assigned
-        with self.assertRaisesRegex(AssertionError, '^communication hook can be only registered once$'):
-            fsdp_model_with_hook.register_comm_hook(dummy_state, DummyHook.dummy_hook)
+        with self.assertRaisesRegex(
+            AssertionError, "^communication hook can be only registered once$"
+        ):
+            fsdp_model_with_hook.register_comm_hook(dummy_state, dummy_hook)
 
         # Reinitialize the model
         fsdp_model_with_hook = self._init_model(
             Net(has_wrapping=True, sharding_strategy=sharding_strategy),
-            sharding_strategy=sharding_strategy
+            sharding_strategy=sharding_strategy,
         )
         submodules = self._get_submodules(fsdp_model_with_hook)
-        submodules[1].communication_hook = DummyHook.dummy_hook
+        submodules[1]._communication_hook = dummy_hook
 
         # Check that an error is raised when some of submodules have a non-default hook assigned
         with self.assertRaisesRegex(
             AssertionError,
-            f'^communication hook should be default, but it is {submodules[1].communication_hook.__name__} instead$'
+            f"^communication hook should be default, but it is {submodules[1]._communication_hook.__name__} instead$",
         ):
-            fsdp_model_with_hook.register_comm_hook(
-                dummy_state,
-                DummyHook.dummy_hook
-            )
+            fsdp_model_with_hook.register_comm_hook(dummy_state, dummy_hook)
 
-    def _check_low_precision_hook(self, state, hook, sharding_strategy, dtype, has_wrapping):
+    def _check_low_precision_hook(
+        self, state, hook, sharding_strategy, dtype, has_wrapping
+    ):
         # keep everything deterministic for input data
         torch.manual_seed(0)
         torch.cuda.manual_seed(0)
 
         fsdp_with_hook = self._init_model(
             Net(has_wrapping=has_wrapping, sharding_strategy=sharding_strategy),
-            sharding_strategy=sharding_strategy
+            sharding_strategy=sharding_strategy,
         )
         fsdp_with_hook.register_comm_hook(state, hook)
 
         mp_only_grad = MixedPrecision(reduce_dtype=dtype)
         fsdp_with_mp = self._init_model(
-            Net(has_wrapping=has_wrapping, sharding_strategy=sharding_strategy, mixed_precision=mp_only_grad),
+            Net(
+                has_wrapping=has_wrapping,
+                sharding_strategy=sharding_strategy,
+                mixed_precision=mp_only_grad,
+            ),
             sharding_strategy=sharding_strategy,
-            mixed_precision=mp_only_grad
+            mixed_precision=mp_only_grad,
         )
 
         optim_hook = torch.optim.SGD(fsdp_with_hook.parameters(), lr=0.1)
@@ -336,7 +405,9 @@ def _check_low_precision_hook(self, state, hook, sharding_strategy, dtype, has_w
 
         dist.barrier()
 
-        for hook_param, mp_param in zip(fsdp_with_hook.parameters(), fsdp_with_mp.parameters()):
+        for hook_param, mp_param in zip(
+            fsdp_with_hook.parameters(), fsdp_with_mp.parameters()
+        ):
             self.assertEqual(hook_param.grad, mp_param.grad)
 
     @requires_nccl()
@@ -345,18 +416,21 @@ def _check_low_precision_hook(self, state, hook, sharding_strategy, dtype, has_w
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
-        ])
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP,
+        ],
+    )
     def test_fp16_hook(
-        self,
-        has_wrapping: bool,
-        sharding_strategy: Optional[ShardingStrategy]
+        self, has_wrapping: bool, sharding_strategy: Optional[ShardingStrategy]
     ):
 
-        state = default_hooks.LowPrecisionState(process_group=None)
+        state = default_hooks.LowPrecisionState(process_group=_get_default_group())
         hook = default_hooks.fp16_compress_hook
 
-        self._check_low_precision_hook(state, hook, sharding_strategy, torch.float16, has_wrapping)
+        self._check_low_precision_hook(
+            state, hook, sharding_strategy, torch.float16, has_wrapping
+        )
 
     @requires_nccl()
     @requires_nccl_version((2, 10), "Need NCCL 2.10+ for BF16_COMPRESS")
@@ -370,18 +444,21 @@ def test_fp16_hook(
     @parametrize(
         "sharding_strategy",
         [
-            ShardingStrategy.NO_SHARD
-        ])
+            ShardingStrategy.NO_SHARD,
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP,
+        ],
+    )
     def test_bf16_hook(
-        self,
-        has_wrapping: bool,
-        sharding_strategy: Optional[ShardingStrategy]
+        self, has_wrapping: bool, sharding_strategy: Optional[ShardingStrategy]
     ):
 
-        state = default_hooks.LowPrecisionState(process_group=None)
+        state = default_hooks.LowPrecisionState(process_group=_get_default_group())
         hook = default_hooks.bf16_compress_hook
 
-        self._check_low_precision_hook(state, hook, sharding_strategy, torch.bfloat16, has_wrapping)
+        self._check_low_precision_hook(
+            state, hook, sharding_strategy, torch.bfloat16, has_wrapping
+        )
 
 
 instantiate_parametrized_tests(TestCommunicationHooks)
diff --git a/test/distributed/fsdp/test_fsdp_core.py b/test/distributed/fsdp/test_fsdp_core.py
index 36dc19eeda80..988731206f1b 100644
--- a/test/distributed/fsdp/test_fsdp_core.py
+++ b/test/distributed/fsdp/test_fsdp_core.py
@@ -24,14 +24,14 @@
     MixtureOfExperts,
     NestedWrappedModule,
     NestedWrappedModuleWithDelay,
-    TransformerWithSharedParams,
     subtest_name,
+    TransformerWithSharedParams,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -47,7 +47,11 @@
 
 params = "cpu_offload,sharding_strategy"
 cpu_offload_config = [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
-sharding_strategy_config = [None, ShardingStrategy.SHARD_GRAD_OP, ShardingStrategy.NO_SHARD]
+sharding_strategy_config = [
+    None,
+    ShardingStrategy.SHARD_GRAD_OP,
+    ShardingStrategy.NO_SHARD,
+]
 configs = list(itertools.product(cpu_offload_config, sharding_strategy_config))
 test_name_mapping = {
     str(CPUOffload(offload_params=True)): "offload_true",
@@ -68,7 +72,7 @@ class TestParityWithDDP(FSDPTest):
     def _get_cuda_init_modes(self, cpu_offload: CPUOffload) -> List[CUDAInitMode]:
         modes = [
             CUDAInitMode.CUDA_AFTER,
-            CUDAInitMode.CUDA_BEFORE
+            CUDAInitMode.CUDA_BEFORE,
         ]
         # Note that CUDAInitMode.CUDA_NEVER works currently only with CPU
         # offload as we explicitly bring the param back to CUDA device. In
@@ -84,12 +88,13 @@ def _get_subtest_config(self, cpu_offload: CPUOffload) -> Dict[str, List[Any]]:
         modes and prefetching settings together."""
         return {
             "cuda_init_mode": self._get_cuda_init_modes(cpu_offload),
-            "forward_prefetch": [False, True],
             "backward_prefetch": [
                 None,
                 BackwardPrefetch.BACKWARD_PRE,
                 BackwardPrefetch.BACKWARD_POST,
-            ]
+            ],
+            "forward_prefetch": [False, True],
+            "use_orig_params": [False, True],
         }
 
     @skip_if_lt_x_gpu(2)
@@ -258,7 +263,7 @@ def test_mixture_of_experts_with_delay_before_free(
             ref_init_fn=self._dummy_ddp_fn,
             cpu_offload=cpu_offload,
             sharding_strategy=sharding_strategy,
-            init_kwargs={"delay_before_free_ms": 250}
+            init_kwargs={"delay_before_free_ms": 250},
         )
 
 
@@ -360,13 +365,30 @@ def test_register_functions_called(self, cuda_first: bool, mixed_precision: bool
             fsdp_kwargs,
         )
         input = fsdp_model.module.get_input(torch.device("cuda"))
-        fsdp_model._register_pre_backward_hooks = mock.MagicMock(return_value=None)
-        fsdp_model._register_post_backward_hooks = mock.MagicMock(return_value=None)
-        self.assertFalse(fsdp_model._register_post_backward_hooks.called)
-        self.assertFalse(fsdp_model._register_pre_backward_hooks.called)
-        fsdp_model(*input)
-        self.assertTrue(fsdp_model._register_post_backward_hooks.called)
-        self.assertTrue(fsdp_model._register_pre_backward_hooks.called)
+
+        # Since `_register_pre_backward_hooks()` modifies the forward output,
+        # we cannot directly mock it. We implement our own counter instead.
+        orig_register_pre_backward_hooks = (
+            torch.distributed.fsdp._runtime_utils._register_pre_backward_hooks
+        )
+        register_pre_backward_hooks_call_count = 0
+
+        def _register_pre_backward_hooks_with_count(*args, **kwargs):
+            nonlocal register_pre_backward_hooks_call_count
+            register_pre_backward_hooks_call_count += 1
+            return orig_register_pre_backward_hooks(*args, **kwargs)
+
+        with mock.patch(
+            "torch.distributed.fsdp._runtime_utils._register_pre_backward_hooks",
+            _register_pre_backward_hooks_with_count,
+        ), mock.patch(
+            "torch.distributed.fsdp._runtime_utils._register_post_backward_hooks"
+        ) as register_post_bwd_mock:
+            self.assertEqual(register_pre_backward_hooks_call_count, 0)
+            self.assertFalse(register_post_bwd_mock.called)
+            fsdp_model(*input)
+            self.assertTrue(register_pre_backward_hooks_call_count > 0)
+            self.assertTrue(register_post_bwd_mock.called)
 
 
 class TestNoGrad(FSDPTest):
@@ -396,7 +418,7 @@ def test_transformer_no_grad(self, mixed_precision):
             fsdp_model,
             num_steps=1,
             autocast=False,
-            mixed_precision=fsdp_kwargs["mixed_precision"]
+            mixed_precision=fsdp_kwargs["mixed_precision"],
         )
         input = fsdp_model.module.get_input(torch.device("cuda"))
         # Run a forward in eval mode
diff --git a/test/distributed/fsdp/test_fsdp_exec_order.py b/test/distributed/fsdp/test_fsdp_exec_order.py
index 14a704b53f78..6cd00e530218 100644
--- a/test/distributed/fsdp/test_fsdp_exec_order.py
+++ b/test/distributed/fsdp/test_fsdp_exec_order.py
@@ -11,10 +11,10 @@
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import FSDPTest
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -36,6 +36,7 @@ class Model(torch.nn.Module):
     when flattened, which means that their corresponding all-gathers and
     reduce-scatters may be silently matched if we do not perform any checks.
     """
+
     def __init__(self) -> None:
         super().__init__()
         self.layer0 = torch.nn.Linear(5, 6)
@@ -54,8 +55,11 @@ def forward(self, x):
         # `layer0` -> `layer1` (normal)
         # `layer0` -> `layer2` (alternate)
         z = self.relu(self.layer0(x))
-        z = self.relu(self.layer2(z)) if self.use_alt_path \
+        z = (
+            self.relu(self.layer2(z))
+            if self.use_alt_path
             else self.relu(self.layer1(z))
+        )
         return z
 
     def get_input(self, device: torch.device):
@@ -68,10 +72,12 @@ def run_backward(self, loss):
         loss.backward()
 
     def flip_path(self):
-        params_to_freeze = self.layer2.parameters() if self.use_alt_path \
-            else self.layer1.parameters()
-        params_to_unfreeze = self.layer1.parameters() if self.use_alt_path \
-            else self.layer2.parameters()
+        params_to_freeze = (
+            self.layer2.parameters() if self.use_alt_path else self.layer1.parameters()
+        )
+        params_to_unfreeze = (
+            self.layer1.parameters() if self.use_alt_path else self.layer2.parameters()
+        )
         for param in params_to_freeze:
             param.requires_grad = False
         for param in params_to_unfreeze:
@@ -88,6 +94,9 @@ def wrap(sharding_strategy: ShardingStrategy, device: torch.device):
 
 
 class TestFSDPExecOrder(FSDPTest):
+    def setUp(self):
+        super().setUp()
+
     @property
     def device(self):
         return torch.device("cuda")
@@ -103,8 +112,10 @@ def test_invalid_first_iter_order(
     ):
         """Tests that FSDP errors if the all-gather order differs across ranks
         in the first iteration."""
+        dist.set_debug_level(dist.DebugLevel.INFO)
         # Rank 0 runs the forward pass in one order and all other ranks run in
         # different order
+        dist.set_debug_level(dist.DebugLevel.INFO)
         fsdp_model = Model.wrap(sharding_strategy, self.device)
         if self.rank != 0:
             fsdp_model.flip_path()
@@ -127,6 +138,7 @@ def test_invalid_later_iter_order(
     ):
         """Tests that FSDP warns the user if the all-gather order changes after
         the first iteration."""
+        dist.set_debug_level(dist.DebugLevel.INFO)
         # On the first iteration, all ranks run the same order, and on the next
         # iteration, all but rank 0 run in a different order
         fsdp_model = Model.wrap(sharding_strategy, self.device)
@@ -136,12 +148,19 @@ def test_invalid_later_iter_order(
             loss = fsdp_model.module.get_loss(inp, output).to(self.device)
             fsdp_model.module.run_backward(loss)
         # Match the warning message with the following prefix
-        regex = "^(Forward order differs from that of the first iteration " \
-            f"on rank {self.rank} -- collectives are unchecked and may give " \
+        regex = (
+            "^(Forward order differs from that of the first iteration "
+            f"on rank {self.rank}. Collectives are unchecked and may give "
             "incorrect results or hang)"
-        context = self.assertWarnsRegex(
-            expected_warning=UserWarning, expected_regex=regex,
-        ) if self.rank != 0 else suppress()
+        )
+        context = (
+            self.assertWarnsRegex(
+                expected_warning=UserWarning,
+                expected_regex=regex,
+            )
+            if self.rank != 0
+            else suppress()
+        )
         if self.rank != 0:
             fsdp_model.flip_path()
         inp = fsdp_model.module.get_input(self.device)
@@ -162,6 +181,7 @@ def test_invalid_later_iter_order(
         [ShardingStrategy.FULL_SHARD, ShardingStrategy.SHARD_GRAD_OP],
     )
     def test_train_eval(self, sharding_strategy: ShardingStrategy):
+        dist.set_debug_level(dist.DebugLevel.INFO)
         fsdp_model = Model.wrap(sharding_strategy, self.device)
         NUM_ITERS = 3
         NUM_EPOCHS = 2
@@ -183,7 +203,9 @@ def test_train_eval(self, sharding_strategy: ShardingStrategy):
         warning_prefix = "Forward order differs"
         for warning in w:
             if str(warning.message).startswith(warning_prefix):
-                raise AssertionError(f"Warning was incorrectly issued: {warning.message}")
+                raise AssertionError(
+                    f"Warning was incorrectly issued: {warning.message}"
+                )
         # If we still validate the forward execution order in eval mode, then
         # an `AssertionError` will be raised above for both sharding strategies
 
diff --git a/test/distributed/fsdp/test_fsdp_flatten_params.py b/test/distributed/fsdp/test_fsdp_flatten_params.py
new file mode 100644
index 000000000000..4f7178df4a10
--- /dev/null
+++ b/test/distributed/fsdp/test_fsdp_flatten_params.py
@@ -0,0 +1,445 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+
+import torch
+import torch.nn as nn
+from torch import distributed as dist
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.flat_param import (
+    FlatParamHandle,
+    FlatParamShardMetadata,
+    HandleConfig,
+    HandleShardingStrategy,
+)
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import FSDPTest
+from torch.testing._internal.common_utils import run_tests, TEST_WITH_DEV_DBG_ASAN
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+class TestFlattenParams(FSDPTest):
+    """Tests parameter flattening and shard metadata logic."""
+
+    @property
+    def world_size(self) -> int:
+        # Clamp the world size to 1 since these unit tests either exercise only
+        # the flattening logic or check sharding subroutines directly without
+        # requiring multiple ranks
+        return 1
+
+    def _get_default_config(self):
+        return HandleConfig(HandleShardingStrategy.FULL_SHARD, False, None, None)
+
+    def _get_transformer(self, seed=0):
+        torch.manual_seed(seed)  # keep everything deterministic
+        module = torch.nn.Transformer(
+            d_model=32,
+            num_encoder_layers=2,
+            num_decoder_layers=2,
+            dim_feedforward=128,
+            dropout=0.1,
+        )
+        module.register_buffer("dummy_buffer", torch.tensor(1.0))
+
+        def get_input(device, dtype):
+            torch.manual_seed(1)  # keep everything deterministic
+            src = torch.rand(20, 8, 32).to(device=device, dtype=dtype)  # T x B x C
+            tgt = torch.rand(10, 8, 32).to(device=device, dtype=dtype)  # T x B x C
+            return (src, tgt)
+
+        module.get_input = get_input
+        return module
+
+    def _get_shared_params_transformer(self, seed=0):
+        module = self._get_transformer(seed=seed)
+        # share the FFNs
+        for enc_layer, dec_layer in zip(module.encoder.layers, module.decoder.layers):
+            dec_layer.linear1.weight = enc_layer.linear1.weight
+            dec_layer.linear2.weight = enc_layer.linear2.weight
+        return module
+
+    @skip_if_lt_x_gpu(1)
+    def test_partial_flattening(self):
+        """Tests flattening some submodules but not others."""
+        self.run_subtests(
+            {"half": [False, True]},
+            self._test_partial_flattening,
+        )
+
+    def _test_partial_flattening(self, half: bool):
+        module = self._get_transformer()
+        if half:
+            module = module.half()
+        numel = sum(p.numel() for p in module.parameters())
+
+        encoder_1_params = list(module.encoder.layers[1].parameters())
+        decoder_0_params = list(module.decoder.layers[0].parameters())
+        params_to_flatten = encoder_1_params + decoder_0_params
+        num_params = [len(encoder_1_params), len(decoder_0_params)]
+        numel_to_flatten = sum(p.numel() for p in params_to_flatten)
+        module.encoder.layers[1] = FSDP(module.encoder.layers[1])
+        module.decoder.layers[0] = FSDP(module.decoder.layers[0])
+        flat_params = [
+            module.encoder.layers[1]._flat_param,
+            module.decoder.layers[0]._flat_param,
+        ]
+
+        self.assertEqual(sum(fp.numel() for fp in flat_params), numel_to_flatten)
+        self.assertEqual(sum(p.numel() for p in module.parameters()), numel)
+
+        # Check that flattened parameters have been replaced with a single
+        # `FlatParameter`
+        self.assertEqual(len(list(module.encoder.layers[1].parameters())), 1)
+        self.assertEqual(len(list(module.decoder.layers[0].parameters())), 1)
+
+        # Check that non-flattened parameters remain
+        self.assertEqual(
+            len(list(module.encoder.layers[0].parameters())), num_params[0]
+        )
+        self.assertEqual(
+            len(list(module.decoder.layers[1].parameters())), num_params[1]
+        )
+
+        # Check that calling `module.to()` affects the `FlatParameter`s
+        orig_dtype = params_to_flatten[0].dtype
+        new_dtype = torch.float32 if orig_dtype == torch.float16 else torch.float16
+        for flat_param in flat_params:
+            self.assertEqual(flat_param.dtype, orig_dtype)
+        self.assertTrue(
+            all(p.dtype == orig_dtype for p in module.encoder.layers[0].parameters())
+        )
+        module = module.to(dtype=new_dtype)
+        for flat_param in flat_params:
+            self.assertEqual(flat_param.dtype, new_dtype)
+        self.assertTrue(
+            all(p.dtype == new_dtype for p in module.encoder.layers[0].parameters())
+        )
+
+    def test_flatten_nothing(self):
+        """
+        Tests that constructing a ``FlatParamHandle`` with no parameters
+        raises an error.
+        """
+        self.run_subtests(
+            {"half": [False, True]},
+            self._test_flatten_nothing,
+        )
+
+    def _test_flatten_nothing(self, half: bool):
+        module = self._get_transformer()
+        if half:
+            module = module.half()
+        with self.assertRaisesRegex(
+            ValueError,
+            "Cannot initialize a `FlatParameter` from an empty parameter list",
+        ):
+            FlatParamHandle(
+                [],
+                module,
+                torch.device("cuda"),
+                self._get_default_config(),
+                self.process_group,
+                False,
+            )
+
+    @skip_if_lt_x_gpu(1)
+    def test_empty_module(self):
+        """
+        Tests flattening an empty module (i.e. one without any parameters).
+        """
+        module = self._get_empty_module()
+        in_data = torch.rand(1)
+        ref_out = module(in_data)
+        fsdp_module = FSDP(module)
+        self.assertEqual(len(list(fsdp_module.parameters())), 0)
+        self.assertIsNone(fsdp_module._flat_param)
+        fsdp_out = fsdp_module(in_data)
+        self.assertEqual(ref_out, fsdp_out)
+
+    def _get_empty_module(self):
+        """Returns a module with no parameters."""
+        torch.manual_seed(0)  # keep everything deterministic
+
+        class EmptyModule(torch.nn.Module):
+            def forward(self, x):
+                return x + 1
+
+            def get_input(self, device, dtype):
+                torch.manual_seed(1)  # keep everything deterministic
+                return torch.rand(1).to(device=device, dtype=dtype)
+
+        return EmptyModule()
+
+    def test_numel_without_shared_params(self):
+        """
+        Tests that numel is preserved after flattening when there are no shared
+        parameters in the module.
+        """
+        self.run_subtests(
+            {"half": [False, True]},
+            self._test_numel_without_shared_params,
+        )
+
+    def _test_numel_without_shared_params(self, half: bool):
+        module = self._get_transformer()
+        if half:
+            module = module.half()
+        self._test_numel(module)
+
+    def test_numel_with_shared_params(self):
+        """
+        Tests that numel is preserved after flattening when there are shared
+        parameters in the module.
+        """
+        self.run_subtests(
+            {"half": [False, True]},
+            self._test_numel_with_shared_params,
+        )
+
+    def _test_numel_with_shared_params(self, half: bool):
+        module = self._get_shared_params_transformer()
+        if half:
+            module = module.half()
+        self._test_numel(module)
+
+    def _test_numel(self, module):
+        ref_numel = sum(p.numel() for p in module.parameters())
+        params_to_flatten = list(module.parameters())
+        flat_param_handle = FlatParamHandle(
+            params_to_flatten,
+            module,
+            torch.device("cuda"),
+            self._get_default_config(),
+            self.process_group,
+            False,
+        )
+        self.assertEqual(ref_numel, flat_param_handle.flat_param.numel())
+
+    @skip_if_lt_x_gpu(1)
+    def test_output_without_shared_params(self):
+        """
+        Tests a forward pass after flattening when there are no shared
+        parameters in the module.
+        """
+        self.run_subtests(
+            {"half": [False, True]},
+            self._test_output_without_shared_params,
+        )
+
+    def _test_output_without_shared_params(self, half: bool):
+        module = self._get_transformer()
+        if half:
+            module = module.half()
+        self._test_output(module)
+
+    @skip_if_lt_x_gpu(1)
+    def test_output_with_shared_params(self):
+        """
+        Tests a forward pass after flattening when there are shared parameters
+        in the module.
+        """
+        self.run_subtests(
+            {"half": [False, True]},
+            self._test_output_with_shared_params,
+        )
+
+    def _test_output_with_shared_params(self, half: bool):
+        module = self._get_shared_params_transformer()
+        if half:
+            module = module.half()
+        self._test_output(module)
+
+    def _test_output(self, module: nn.Module):
+        module = module.to(self.rank)
+        ref_output = self._get_output(module)
+        fsdp_module = FSDP(module)
+        fsdp_output = self._get_output(fsdp_module)
+        self.assertEqual(ref_output, fsdp_output)
+
+    def _get_output(self, module):
+        device = next(module.parameters()).device
+        dtype = next(module.parameters()).dtype
+        input = module.get_input(device, dtype)
+        return module(*input)
+
+    @skip_if_lt_x_gpu(1)
+    def test_pnorm_after_step_with_shared_params(self):
+        """
+        Tests for parameter Frobenius norm parity after an optimizer step when
+        there are shared parameters in the module. If the parameter sharing is
+        handled incorrectly, then an optimizer step should reveal that.
+        """
+        self.run_subtests(
+            {"half": [False, True]},
+            self._test_pnorm_after_step_with_shared_params,
+        )
+
+    def _test_pnorm_after_step_with_shared_params(self, half: bool):
+        module = self._get_shared_params_transformer().to(self.rank)
+        if half:
+            module = module.half()
+        ref_pnorm_after_step = self._get_pnorm_after_step(module)
+        module = self._get_shared_params_transformer().to(self.rank)  # recreate
+        if half:
+            module = module.half()
+        fsdp_module = FSDP(module)
+        fsdp_pnorm_after_step = self._get_pnorm_after_step(fsdp_module)
+        self.assertEqual(ref_pnorm_after_step, fsdp_pnorm_after_step)
+
+    def _get_pnorm_after_step(self, module):
+        optim = torch.optim.SGD(module.parameters(), lr=0.01)
+        loss = self._get_output(module).sum()
+        loss.backward()
+        optim.step()
+        return torch.norm(torch.stack([p.detach().norm() for p in module.parameters()]))
+
+    def test_flat_param_shard_metadata(self):
+        """
+        Tests that ``FlatParameter`` shard metadata are computed as expected.
+        """
+        module = torch.nn.Sequential(
+            torch.nn.Linear(10, 10, bias=False),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10, bias=False),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10, bias=False),
+            torch.nn.ReLU(),
+        )
+        params_to_flatten = list(module.parameters())
+        flat_param_handle = FlatParamHandle(
+            params_to_flatten,
+            module,
+            torch.device("cuda"),
+            self._get_default_config(),
+            self.process_group,
+            False,
+        )
+
+        def _test(kwargs, expected):
+            """
+            Tests the subroutine ``_get_shard_metadata()`` that computes shard
+            metadata based on start and end indices in the unsharded flattened
+            parameter.
+
+            We manually set the relevant attributes on the flattened parameter
+            to be able to check the effect of ``_get_shard_metadata()`` via
+            ``shard_metadata()`` since normally the attributes are set in
+            ``init_shard_info()`` with the start and end indices fixed based on
+            rank and world size.
+            """
+            flat_param = flat_param_handle.flat_param
+            (
+                flat_param._shard_param_offsets,
+                flat_param._shard_indices,
+            ) = flat_param_handle._get_shard_metadata(kwargs["start"], kwargs["end"])
+            self.assertEqual(
+                flat_param_handle.shard_metadata(),
+                expected,
+                msg=f"{flat_param_handle.shard_metadata()}, {expected}",
+            )
+
+        _test(
+            kwargs={"start": 0, "end": 0},
+            expected=FlatParamShardMetadata(
+                param_names=["0.weight"],
+                param_shapes=[(10, 10)],
+                param_numels=[100],
+                param_offsets=[(0, 0)],
+            ),
+        )
+        _test(
+            kwargs={"start": 0, "end": 50},
+            expected=FlatParamShardMetadata(
+                param_names=["0.weight"],
+                param_shapes=[(10, 10)],
+                param_numels=[100],
+                param_offsets=[(0, 50)],
+            ),
+        )
+        _test(
+            kwargs={"start": 0, "end": 99},
+            expected=FlatParamShardMetadata(
+                param_names=["0.weight"],
+                param_shapes=[(10, 10)],
+                param_numels=[100],
+                param_offsets=[(0, 99)],
+            ),
+        )
+        _test(
+            kwargs={"start": 50, "end": 149},
+            expected=FlatParamShardMetadata(
+                param_names=["0.weight", "2.weight"],
+                param_shapes=[(10, 10), (10, 10)],
+                param_numels=[100, 100],
+                param_offsets=[(50, 99), (0, 49)],
+            ),
+        )
+        _test(
+            kwargs={"start": 50, "end": 199},
+            expected=FlatParamShardMetadata(
+                param_names=["0.weight", "2.weight"],
+                param_shapes=[(10, 10), (10, 10)],
+                param_numels=[100, 100],
+                param_offsets=[(50, 99), (0, 99)],
+            ),
+        )
+        _test(
+            kwargs={"start": 99, "end": 199},
+            expected=FlatParamShardMetadata(
+                param_names=["0.weight", "2.weight"],
+                param_shapes=[(10, 10), (10, 10)],
+                param_numels=[100, 100],
+                param_offsets=[(99, 99), (0, 99)],
+            ),
+        )
+        _test(
+            kwargs={"start": 100, "end": 199},
+            expected=FlatParamShardMetadata(
+                param_names=["2.weight"],
+                param_shapes=[(10, 10)],
+                param_numels=[100],
+                param_offsets=[(0, 99)],
+            ),
+        )
+        _test(
+            kwargs={"start": 100, "end": 299},
+            expected=FlatParamShardMetadata(
+                param_names=["2.weight", "4.weight"],
+                param_shapes=[(10, 10), (10, 10)],
+                param_numels=[100, 100],
+                param_offsets=[(0, 99), (0, 99)],
+            ),
+        )
+        _test(
+            kwargs={"start": 100, "end": 1000},
+            expected=FlatParamShardMetadata(
+                param_names=["2.weight", "4.weight"],
+                param_shapes=[(10, 10), (10, 10)],
+                param_numels=[100, 100],
+                param_offsets=[(0, 99), (0, 99)],
+            ),
+        )
+        _test(
+            kwargs={"start": 299, "end": 299},
+            expected=FlatParamShardMetadata(
+                param_names=["4.weight"],
+                param_shapes=[(10, 10)],
+                param_numels=[100],
+                param_offsets=[(99, 99)],
+            ),
+        )
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_freezing_weights.py b/test/distributed/fsdp/test_fsdp_freezing_weights.py
index 23836130818c..430e47adf71e 100644
--- a/test/distributed/fsdp/test_fsdp_freezing_weights.py
+++ b/test/distributed/fsdp/test_fsdp_freezing_weights.py
@@ -10,18 +10,14 @@
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.nn.parallel import DistributedDataParallel
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
-    get_full_params,
-)
+from torch.testing._internal.common_fsdp import FSDPTest, get_full_params
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
-
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
diff --git a/test/distributed/fsdp/test_fsdp_grad_acc.py b/test/distributed/fsdp/test_fsdp_grad_acc.py
index ae01b22ca66c..ef20d2a2db76 100644
--- a/test/distributed/fsdp/test_fsdp_grad_acc.py
+++ b/test/distributed/fsdp/test_fsdp_grad_acc.py
@@ -4,13 +4,15 @@
 import itertools
 import sys
 from dataclasses import dataclass
-from typing import List, Optional, Tuple
+from typing import Any, Dict, List, Optional, Tuple
 
 import torch
 from torch import distributed as dist
-from torch.distributed.fsdp import CPUOffload
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp.fully_sharded_data_parallel import BackwardPrefetch
+from torch.distributed.fsdp import CPUOffload, FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.fully_sharded_data_parallel import (
+    BackwardPrefetch,
+    ShardingStrategy,
+)
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import (
     CUDAInitMode,
@@ -19,10 +21,10 @@
     TransformerWithSharedParams,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -50,16 +52,14 @@ class _GradAccConfig:
             manager as the way to accumulate gradients.
         num_iters (int): Number of iterations to accumulate gradients.
     """
+
     use_no_sync: bool
     num_iters: int
 
     def __repr__(self) -> str:
         # Override to remove any spaces in the string to appease the internal
         # build's test name parser
-        return (
-            f"(use_no_sync={self.use_no_sync},"
-            f"num_iters={self.num_iters})"
-        )
+        return f"(use_no_sync={self.use_no_sync}," f"num_iters={self.num_iters})"
 
 
 @dataclass
@@ -68,14 +68,13 @@ class _GradAccConfigs:
     This wraps a :class:`list` of :class:`_GradAccConfig` instances with the
     sole purpose of overriding :meth:`__repr__` to remove spaces.
     """
+
     configs: List[_GradAccConfig]
 
     def __repr__(self) -> str:
         # Override to remove any spaces in the string to appease the internal
         # build's test name parser
-        return (
-            "[" + ",".join(config.__repr__() for config in self.configs) + "]"
-        )
+        return "[" + ",".join(config.__repr__() for config in self.configs) + "]"
 
 
 class TestGradAcc(FSDPTest):
@@ -88,6 +87,7 @@ def _test_grad_acc(
         configs: List[_GradAccConfig],
         cpu_offload: CPUOffload,
         backward_prefetch: Optional[BackwardPrefetch],
+        sharding_strategy: ShardingStrategy,
     ):
         """
         Tests gradient accumulation by comparing a run that trains sequentially
@@ -114,8 +114,9 @@ def _test_grad_acc(
         """
         # Gradient accumulation outside `no_sync()` is not currently compatible
         # with CPU offloading
-        if cpu_offload.offload_params and \
-                any(not config.use_no_sync for config in configs):
+        if cpu_offload.offload_params and any(
+            not config.use_no_sync for config in configs
+        ):
             return
         old_allow_tf32 = torch.backends.cuda.matmul.allow_tf32
         try:
@@ -126,6 +127,7 @@ def _test_grad_acc(
             fsdp_kwargs = {
                 "cpu_offload": cpu_offload,
                 "backward_prefetch": backward_prefetch,
+                "sharding_strategy": sharding_strategy,
             }
             fsdp_model: FSDP = TransformerWithSharedParams.init(
                 self.process_group,
@@ -137,7 +139,9 @@ def _test_grad_acc(
             )
             device = torch.device("cuda")
             optim = torch.optim.SGD(
-                fsdp_model.parameters(), lr=0.01, momentum=0.9,
+                fsdp_model.parameters(),
+                lr=0.01,
+                momentum=0.9,
             )
 
             # Generate the sequence of batches, each containing the same data
@@ -145,16 +149,16 @@ def _test_grad_acc(
             def permute_tensor(x: torch.Tensor):
                 return x.view(-1)[torch.randperm(x.numel())].view_as(x)
 
-            batch: Tuple[torch.Tensor, ...] = \
-                fsdp_model.module.get_input(device)
+            batch: Tuple[torch.Tensor, ...] = fsdp_model.module.get_input(device)
             batches: List[Tuple[torch.Tensor, ...]] = [batch]
             num_iters_to_acc = sum(config.num_iters for config in configs)
             for _ in range(num_iters_to_acc - 1):
                 batches.append(tuple(permute_tensor(t) for t in batch))
             for (batch1, batch2) in itertools.combinations(batches, r=2):
                 for t1, t2 in zip(batch1, batch2):
-                    assert not torch.all(t1 == t2), \
-                        "Check the test to make sure that batches are distinct"
+                    assert not torch.all(
+                        t1 == t2
+                    ), "Check the test to make sure that batches are distinct"
 
             # Concatenate the batches along the given batch dimension
             concat_batch: Tuple[torch.Tensor, ...] = tuple(
@@ -166,17 +170,18 @@ def permute_tensor(x: torch.Tensor):
             output = fsdp_model(*concat_batch)
             ref_loss = fsdp_model.module.get_loss(concat_batch, output)
             ref_loss.backward()
-            ref_grads = [
-                p.grad.detach().clone() for p in fsdp_model.parameters()
-            ]
+            ref_grads = [p.grad.detach().clone() for p in fsdp_model.parameters()]
 
             # Compute and accumulate the gradients
             fsdp_model.zero_grad()
             losses = []
             batch_idx = 0
             for config in configs:
-                sync_context = fsdp_model.no_sync() if config.use_no_sync \
+                sync_context = (
+                    fsdp_model.no_sync()
+                    if config.use_no_sync
                     else contextlib.suppress()
+                )
                 with sync_context:
                     for _ in range(config.num_iters):
                         if batch_idx == num_iters_to_acc - 1:
@@ -192,9 +197,7 @@ def permute_tensor(x: torch.Tensor):
             loss.backward()
             losses.append(loss)
             acc_loss = sum(losses)
-            acc_grads = [
-                p.grad.detach().clone() for p in fsdp_model.parameters()
-            ]
+            acc_grads = [p.grad.detach().clone() for p in fsdp_model.parameters()]
 
             # Compare the losses and gradients
             torch.testing.assert_close(ref_loss, acc_loss)
@@ -210,35 +213,53 @@ def permute_tensor(x: torch.Tensor):
         finally:
             torch.backends.cuda.matmul.allow_tf32 = old_allow_tf32
 
+    def _get_subtest_config(self) -> Dict[str, List[Any]]:
+        """Returns a subtest configuration that subtests prefetching."""
+        return {
+            "backward_prefetch": [
+                None,
+                BackwardPrefetch.BACKWARD_PRE,
+                BackwardPrefetch.BACKWARD_POST,
+            ]
+        }
+
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "configs",
         [
-            _GradAccConfigs([
-                _GradAccConfig(use_no_sync=True, num_iters=3),
-                _GradAccConfig(use_no_sync=False, num_iters=3),
-                _GradAccConfig(use_no_sync=True, num_iters=3),
-            ]),
-            _GradAccConfigs([
-                _GradAccConfig(use_no_sync=False, num_iters=3),
-                _GradAccConfig(use_no_sync=True, num_iters=3),
-                _GradAccConfig(use_no_sync=False, num_iters=3),
-            ]),
-        ]
+            _GradAccConfigs(
+                [
+                    _GradAccConfig(use_no_sync=True, num_iters=3),
+                    _GradAccConfig(use_no_sync=False, num_iters=3),
+                    _GradAccConfig(use_no_sync=True, num_iters=3),
+                ]
+            ),
+            _GradAccConfigs(
+                [
+                    _GradAccConfig(use_no_sync=False, num_iters=3),
+                    _GradAccConfig(use_no_sync=True, num_iters=3),
+                    _GradAccConfig(use_no_sync=False, num_iters=3),
+                ]
+            ),
+        ],
     )
     @parametrize(
         "cpu_offload",
         [CPUOffload(offload_params=False), CPUOffload(offload_params=True)],
     )
     @parametrize(
-        "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_PRE, BackwardPrefetch.BACKWARD_POST, None],
+        "sharding_strategy",
+        [
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP,
+            ShardingStrategy.NO_SHARD,
+        ],
     )
     def test_grad_acc(
         self,
         configs: _GradAccConfigs,
         cpu_offload: CPUOffload,
-        backward_prefetch: Optional[BackwardPrefetch],
+        sharding_strategy: ShardingStrategy,
     ):
         """
         Tests gradient accumulation.
@@ -255,11 +276,13 @@ def test_grad_acc(
         manager is not currently compatible with CPU offloading, so those tests
         are vacuous.
         """
-        self._test_grad_acc(
+        self.run_subtests(
+            self._get_subtest_config(),
+            self._test_grad_acc,
             batch_dim=1,
             configs=configs.configs,
             cpu_offload=cpu_offload,
-            backward_prefetch=backward_prefetch,
+            sharding_strategy=sharding_strategy,
         )
 
 
diff --git a/test/distributed/fsdp/test_fsdp_ignored_modules.py b/test/distributed/fsdp/test_fsdp_ignored_modules.py
index 826710d1979c..83babee7d482 100644
--- a/test/distributed/fsdp/test_fsdp_ignored_modules.py
+++ b/test/distributed/fsdp/test_fsdp_ignored_modules.py
@@ -14,10 +14,10 @@
     TransformerWithSharedParams,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -74,12 +74,15 @@ def forward(self, x):
 
 class ModelWithIgnoredModules(Model):
     """Adds a variable number of :class:`IgnoredModule` to ``self.layer1``."""
+
     def __init__(self, num_ignored: int) -> None:
         assert num_ignored >= 0
         super().__init__()
-        layer1_modules = [torch.nn.Linear(5, 4), torch.nn.Linear(4, 4)] + \
-            [IgnoredModule(4, 4) for _ in range(num_ignored)] + \
-            [torch.nn.Linear(4, 4)]
+        layer1_modules = (
+            [torch.nn.Linear(5, 4), torch.nn.Linear(4, 4)]
+            + [IgnoredModule(4, 4) for _ in range(num_ignored)]
+            + [torch.nn.Linear(4, 4)]
+        )
         self.layer1 = torch.nn.Sequential(*layer1_modules)
 
 
@@ -143,9 +146,7 @@ def test_ignored_modules_nested(self):
         # the ignored nested sequential's parameters
         nonwrapped_model = Model()
         total_numel = sum(p.numel() for p in nonwrapped_model.parameters())
-        ignored_numel = sum(
-            p.numel() for p in nonwrapped_model.layer1.parameters()
-        )
+        ignored_numel = sum(p.numel() for p in nonwrapped_model.layer1.parameters())
         nonignored_numel = total_numel - ignored_numel
         with FSDP.summon_full_params(wrapped_model):
             flat_param_numel = wrapped_model.params[0].numel()
@@ -170,13 +171,15 @@ def test_ignored_modules_invalid(self):
             expected_warning=UserWarning,
             expected_regex="Trying to ignore the top-level module passed into "
             "the FSDP constructor itself will result in all parameters being "
-            "ignored and is not supported",
+            "ignored",
         ):
             FSDP(model, ignored_modules=[model])
 
     @skip_if_lt_x_gpu(2)
     @parametrize("pass_ignored_modules_to_root", [False, True])
-    def test_diff_ignored_modules_across_ranks(self, pass_ignored_modules_to_root: bool):
+    def test_diff_ignored_modules_across_ranks(
+        self, pass_ignored_modules_to_root: bool
+    ):
         """
         Tests ignoring different modules across ranks.
 
@@ -196,9 +199,11 @@ def test_diff_ignored_modules_across_ranks(self, pass_ignored_modules_to_root: b
         ]
         model.layer1 = FSDP(model.layer1, ignored_modules=layer1_ignored_modules)
         model.layer3 = FSDP(model.layer3)
-        model_ignored_modules = [
-            m for m in model.modules() if isinstance(m, IgnoredModule)
-        ] if pass_ignored_modules_to_root else []
+        model_ignored_modules = (
+            [m for m in model.modules() if isinstance(m, IgnoredModule)]
+            if pass_ignored_modules_to_root
+            else []
+        )
         wrapped_model = FSDP(model, ignored_modules=model_ignored_modules)
         optim = torch.optim.Adam(wrapped_model.parameters(), lr=1e-3)
         self._train_model(wrapped_model, optim, 3)
diff --git a/test/distributed/fsdp/test_fsdp_input.py b/test/distributed/fsdp/test_fsdp_input.py
index 136b65c3b28e..06a516faaa97 100644
--- a/test/distributed/fsdp/test_fsdp_input.py
+++ b/test/distributed/fsdp/test_fsdp_input.py
@@ -8,18 +8,15 @@
 from torch.nn import Linear, Module
 from torch.optim import SGD
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
-)
+from torch.testing._internal.common_fsdp import FSDPTest
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
     subtest,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
-
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
diff --git a/test/distributed/fsdp/test_fsdp_memory.py b/test/distributed/fsdp/test_fsdp_memory.py
index b4632160ddca..fe2ad8879ad1 100644
--- a/test/distributed/fsdp/test_fsdp_memory.py
+++ b/test/distributed/fsdp/test_fsdp_memory.py
@@ -8,18 +8,15 @@
 from torch import distributed as dist
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
-)
+from torch.testing._internal.common_fsdp import FSDPTest
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 from torch.utils.checkpoint import checkpoint
 
-
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
@@ -34,6 +31,7 @@
 
 def get_cur_mem(rank, result, prefix):
     """Collect memory allocated values in a result dict in MB"""
+    torch._C._cuda_clearCublasWorkspaces()
     result[prefix] = round(torch.cuda.memory_allocated() / 1024 / 1024)
 
 
diff --git a/test/distributed/fsdp/test_fsdp_meta.py b/test/distributed/fsdp/test_fsdp_meta.py
index 1aa426800db6..09e5c7ae8329 100644
--- a/test/distributed/fsdp/test_fsdp_meta.py
+++ b/test/distributed/fsdp/test_fsdp_meta.py
@@ -6,20 +6,19 @@
 import torch.distributed as dist
 import torch.nn as nn
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp.wrap import always_wrap_policy as always_wrap
-from torch.distributed.fsdp.wrap import wrap, enable_wrap
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
+from torch.distributed.fsdp.wrap import (
+    always_wrap_policy as always_wrap,
+    enable_wrap,
+    wrap,
 )
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import FSDPTest
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
-    run_tests,
-    parametrize,
     instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
     sandcastle_skip_if,
-)
-from torch.testing._internal.common_distributed import (
-    skip_if_lt_x_gpu,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 _TORCHDISTX_AVAIL = True
@@ -47,10 +46,12 @@ def _reset_params_if_meta(is_meta, model):
     if is_meta:
         model.reset_parameters()
 
+
 class MyLinear(nn.Linear):
     """
     Linear layer with deterministic reset_parameters for testing.
     """
+
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
 
@@ -58,6 +59,7 @@ def reset_parameters(self, *args, **kwargs):
         with torch.no_grad():
             self.weight.fill_(1)
 
+
 class MyModel(nn.Module):
     def __init__(self, device):
         super().__init__()
@@ -90,6 +92,7 @@ def reset_parameters(self):
             if not isinstance(m, FSDP):
                 m.reset_parameters()
 
+
 def _init_with_reset_params(module):
     """
     to_empty + reset_parameters() init function example for modules
@@ -101,6 +104,7 @@ def _init_with_reset_params(module):
     with torch.no_grad():
         module.reset_parameters()
 
+
 def _init_with_torchdistX(module):
     """
     torchdistX-based deferred module initialization function example
@@ -113,6 +117,7 @@ def check_fn(k):
 
     deferred_init.materialize_module(module, check_fn=check_fn)
 
+
 class TestFSDPWithMetaDevice(FSDPTest):
     @property
     def world_size(self):
@@ -148,7 +153,7 @@ def _test_simple_model_with_meta_device(self, meta_module_fn, init_fn=None):
         regular_opt = torch.optim.SGD(fsdp_regular.parameters(), lr=1e-3)
 
         self._compare_fsdp(fsdp_meta, fsdp_regular)
-        inp = torch.randn(10, 2, device='cuda')
+        inp = torch.randn(10, 2, device="cuda")
         fsdp_meta(inp).sum().backward()
         fsdp_regular(inp).sum().backward()
         meta_opt.step()
@@ -176,6 +181,7 @@ def _test_simple_model_with_meta_device(self, meta_module_fn, init_fn=None):
     def test_simple_model_with_meta_device_reset_params(self):
         def meta_module_fn():
             return MyModel(device="meta")
+
         self._test_simple_model_with_meta_device(
             meta_module_fn, _init_with_reset_params
         )
@@ -184,11 +190,13 @@ def meta_module_fn():
     def test_simple_model_with_meta_device_default_init(self):
         def meta_module_fn():
             return MyModel(device="meta")
+
         self._test_simple_model_with_meta_device(meta_module_fn)
 
     @skip_if_lt_x_gpu(2)
     @sandcastle_skip_if(
-        not _TORCHDISTX_AVAIL, "Test requires torchdistX: https://github.com/pytorch/torchdistX"
+        not _TORCHDISTX_AVAIL,
+        "Test requires torchdistX: https://github.com/pytorch/torchdistX",
     )
     def test_simple_model_with_torchdistX_default_init(self):
         def meta_module_fn():
@@ -198,15 +206,20 @@ def meta_module_fn():
 
     @skip_if_lt_x_gpu(2)
     @sandcastle_skip_if(
-        not _TORCHDISTX_AVAIL, "Test requires torchdistX: https://github.com/pytorch/torchdistX"
+        not _TORCHDISTX_AVAIL,
+        "Test requires torchdistX: https://github.com/pytorch/torchdistX",
     )
     def test_simple_model_with_torchdistX_init_fn(self):
         def meta_module_fn():
             return deferred_init.deferred_init(MyModel, device="cuda")
 
-        self._test_simple_model_with_meta_device(meta_module_fn, init_fn=_init_with_torchdistX)
+        self._test_simple_model_with_meta_device(
+            meta_module_fn, init_fn=_init_with_torchdistX
+        )
 
-    def _test_nested_model_with_meta_device(self, auto_wrap, meta_module_fn, init_fn=None):
+    def _test_nested_model_with_meta_device(
+        self, auto_wrap, meta_module_fn, init_fn=None
+    ):
         if auto_wrap:
             module = meta_module_fn()
             is_meta = next(module.parameters()).is_meta
@@ -225,7 +238,8 @@ def _test_nested_model_with_meta_device(self, auto_wrap, meta_module_fn, init_fn
             regular_opt = torch.optim.SGD(fsdp_regular.parameters(), lr=1e-3)
         else:
             with enable_wrap(
-                wrapper_cls=FSDP, param_init_fn=init_fn,
+                wrapper_cls=FSDP,
+                param_init_fn=init_fn,
             ):
                 module = meta_module_fn()
                 is_meta = next(module.parameters()).is_meta
@@ -246,7 +260,7 @@ def _test_nested_model_with_meta_device(self, auto_wrap, meta_module_fn, init_fn
 
         # Compare it before training
         self._compare_fsdp(fsdp_meta, fsdp_regular)
-        inp = torch.randn(10, 2, device='cuda')
+        inp = torch.randn(10, 2, device="cuda")
         fsdp_meta(inp).sum().backward()
         fsdp_regular(inp).sum().backward()
         meta_opt.step()
@@ -260,7 +274,9 @@ def meta_module_fn():
             return NestedModel(device="meta")
 
         self._test_nested_model_with_meta_device(
-            auto_wrap=auto_wrap, meta_module_fn=meta_module_fn, init_fn=_init_with_reset_params
+            auto_wrap=auto_wrap,
+            meta_module_fn=meta_module_fn,
+            init_fn=_init_with_reset_params,
         )
 
     @skip_if_lt_x_gpu(2)
@@ -270,12 +286,14 @@ def meta_module_fn():
             return NestedModel(device="meta")
 
         self._test_nested_model_with_meta_device(
-            auto_wrap=auto_wrap, meta_module_fn=meta_module_fn,
+            auto_wrap=auto_wrap,
+            meta_module_fn=meta_module_fn,
         )
 
     @skip_if_lt_x_gpu(2)
     @sandcastle_skip_if(
-        not _TORCHDISTX_AVAIL, "Test requires torchdistX: https://github.com/pytorch/torchdistX"
+        not _TORCHDISTX_AVAIL,
+        "Test requires torchdistX: https://github.com/pytorch/torchdistX",
     )
     @parametrize("auto_wrap", [True, False])
     def test_nested_model_with_torchdistX_default_init(self, auto_wrap):
@@ -288,7 +306,8 @@ def meta_module_fn():
 
     @skip_if_lt_x_gpu(2)
     @sandcastle_skip_if(
-        not _TORCHDISTX_AVAIL, "Test requires torchdistX: https://github.com/pytorch/torchdistX"
+        not _TORCHDISTX_AVAIL,
+        "Test requires torchdistX: https://github.com/pytorch/torchdistX",
     )
     @parametrize("auto_wrap", [True, False])
     def test_nested_model_with_torchdistX_init_fn(self, auto_wrap):
@@ -296,7 +315,9 @@ def meta_module_fn():
             return deferred_init.deferred_init(NestedModel, device="cuda")
 
         self._test_nested_model_with_meta_device(
-            auto_wrap=auto_wrap, meta_module_fn=meta_module_fn, init_fn=_init_with_torchdistX,
+            auto_wrap=auto_wrap,
+            meta_module_fn=meta_module_fn,
+            init_fn=_init_with_torchdistX,
         )
 
     def _test_bad_arg(self, meta_module_fn):
@@ -306,7 +327,8 @@ def _test_bad_arg(self, meta_module_fn):
 
     @skip_if_lt_x_gpu(2)
     @sandcastle_skip_if(
-        not _TORCHDISTX_AVAIL, "Test requires torchdistX: https://github.com/pytorch/torchdistX"
+        not _TORCHDISTX_AVAIL,
+        "Test requires torchdistX: https://github.com/pytorch/torchdistX",
     )
     def test_bad_arg_torchdistx(self):
         def meta_module_fn():
diff --git a/test/distributed/fsdp/test_fsdp_misc.py b/test/distributed/fsdp/test_fsdp_misc.py
index 4c023f1b8004..8c972f851563 100644
--- a/test/distributed/fsdp/test_fsdp_misc.py
+++ b/test/distributed/fsdp/test_fsdp_misc.py
@@ -4,32 +4,37 @@
 import sys
 from collections import namedtuple
 from contextlib import suppress
+from copy import deepcopy
 
 import torch
 import torch.distributed as dist
 import torch.nn as nn
-from torch.distributed.fsdp import FlatParameter
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp import ShardingStrategy, CPUOffload
+from torch.distributed.fsdp import (
+    CPUOffload,
+    FlatParameter,
+    FullyShardedDataParallel as FSDP,
+    ShardingStrategy,
+)
 from torch.distributed.fsdp.wrap import (
     always_wrap_policy,
+    ModuleWrapPolicy,
     transformer_auto_wrap_policy,
 )
 from torch.nn import TransformerDecoderLayer, TransformerEncoderLayer
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import (
+    _assert_module_states,
     CUDAInitMode,
     FSDPInitMode,
     FSDPTest,
     NestedWrappedModule,
     TransformerWithSharedParams,
-    _assert_module_states,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -70,9 +75,7 @@ def forward(self, x):
         t = torch.ones(1, device="cuda", requires_grad=True)
 
         MyOutputType = namedtuple(
-            "MyOutputType",
-            ["a", "b", "c", "d"],
-            defaults=(t, t, t, t)
+            "MyOutputType", ["a", "b", "c", "d"], defaults=(t, t, t, t)
         )
 
         inp = MyOutputType()
@@ -86,6 +89,86 @@ def forward(self, x):
         # https://github.com/pytorch/pytorch/issues/83107 and
         # https://github.com/pytorch/pytorch/issues/83129
 
+    @skip_if_lt_x_gpu(2)
+    def test_fsdp_not_all_outputs_used_in_loss(self):
+        class MyModule(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lin1 = nn.Linear(4, 4)
+                self.lin2 = nn.Linear(4, 4)
+
+            def forward(self, x):
+                a = self.lin1(x)
+                b = self.lin2(x)
+                return (a, b)
+
+        def _check_resharded(fsdp_module):
+            for handle in fsdp_module._handles:
+                param = handle.flat_param
+                if handle.uses_sharded_strategy:
+                    full_param = param._full_param_padded
+                    self.assertEqual(full_param.storage().size(), 0)
+
+                self.assertEqual(param.data_ptr(), param._local_shard.data_ptr())
+
+        def _check_equal(local, fsdp):
+            with FSDP.summon_full_params(fsdp):
+                for p1, p2 in zip(fsdp.parameters(), local.parameters()):
+                    torch.testing.assert_close(p1, p2)
+
+        for sharding_strategy in [
+            ShardingStrategy.FULL_SHARD,
+            ShardingStrategy.SHARD_GRAD_OP,
+            ShardingStrategy.NO_SHARD,
+        ]:
+            with self.subTest(sharding_strategy=sharding_strategy):
+                fsdp_ctor = functools.partial(FSDP, sharding_strategy=sharding_strategy)
+                m = MyModule().cuda()
+                m_local = deepcopy(m)
+                local_m = m_local
+                prev_params = [p.clone() for p in m_local.parameters()]
+
+                m.lin1 = fsdp_ctor(m.lin1)
+                m = fsdp_ctor(m)
+                _check_equal(m_local, m)
+
+                opt = torch.optim.SGD(m.parameters(), lr=1e-3)
+                opt_local = torch.optim.SGD(local_m.parameters(), lr=1e-3)
+
+                for i in range(6):
+                    t = torch.ones(4, device="cuda")
+                    a, b = m(t)
+                    local_a, local_b = local_m(t)
+                    if i < 2:
+                        # use both params in loss computation. Later,
+                        # b will go unused and we check grads are the
+                        # same as local training.
+                        loss = (a @ b).sum()
+                        loss_local = (local_a @ local_b).sum()
+                    else:
+                        loss = a.sum()
+                        loss_local = local_a.sum()
+
+                    loss.backward()
+                    loss_local.backward()
+                    _check_resharded(m)
+                    opt.step()
+                    opt_local.step()
+                    _check_equal(m_local, m)
+                    # Ensure at least some change from previous params, otherwise
+                    # above check would be vacuously true.
+                    self.assertTrue(
+                        any(
+                            not torch.equal(p1, p2)
+                            for p1, p2 in zip(prev_params, m_local.parameters())
+                        )
+                    )
+                    prev_params = [p.clone() for p in local_m.parameters()]
+                    opt.zero_grad()
+                    opt_local.zero_grad()
+
+                dist.barrier()
+
     @skip_if_lt_x_gpu(2)
     @parametrize("use_second_layer", [True, False])
     @parametrize("sharding_strategy", [ShardingStrategy.NO_SHARD, None])
@@ -110,10 +193,10 @@ def forward(self, x, y):
         fsdp = FSDP(
             MyModel().cuda(),
             sharding_strategy=sharding_strategy,
-            auto_wrap_policy=always_wrap_policy
+            auto_wrap_policy=always_wrap_policy,
         )
-        x = torch.randn(10, 10, device='cuda')
-        y = torch.randn(10, 10, device='cuda')
+        x = torch.randn(10, 10, device="cuda")
+        y = torch.randn(10, 10, device="cuda")
         for i in range(4):
             if use_second_layer:
                 a, b = fsdp(x, y)
@@ -123,8 +206,8 @@ def forward(self, x, y):
             loss.backward()
 
             # self.a receives grad, self.b does not
-            a_grad = fsdp.module.a._fsdp_wrapped_module.flat_param.grad
-            b_grad = fsdp.module.b._fsdp_wrapped_module.flat_param.grad
+            a_grad = fsdp.module.a._handles[0].flat_param.grad
+            b_grad = fsdp.module.b._handles[0].flat_param.grad
             self.assertIsNotNone(a_grad)
             self.assertIsNone(b_grad)
 
@@ -132,10 +215,20 @@ def forward(self, x, y):
     def test_device_id_auto_wrap(self):
         """Tests that ``auto_wrap_policy`` propagates ``device_id`` to all
         nested FSDP instances."""
-        auto_wrap_policy = functools.partial(
-            transformer_auto_wrap_policy,
-            transformer_layer_cls={TransformerEncoderLayer, TransformerDecoderLayer},
+        self.run_subtests(
+            {"use_callable": [False, True]},
+            self._test_device_id_auto_wrap,
         )
+
+    def _test_device_id_auto_wrap(self, use_callable: bool):
+        module_classes = {TransformerEncoderLayer, TransformerDecoderLayer}
+        if use_callable:
+            auto_wrap_policy = functools.partial(
+                transformer_auto_wrap_policy,
+                transformer_layer_cls=module_classes,
+            )
+        else:
+            auto_wrap_policy = ModuleWrapPolicy(module_classes)
         fsdp_kwargs = {
             "auto_wrap_policy": auto_wrap_policy,
             "device_id": torch.cuda.current_device(),
@@ -148,7 +241,7 @@ def test_device_id_auto_wrap(self):
         )
         for fsdp_module in FSDP.fsdp_modules(fsdp_model):
             self.assertEqual(
-                fsdp_module.device_id,
+                fsdp_module.compute_device,
                 torch.device("cuda", torch.cuda.current_device()),
             )
 
@@ -158,6 +251,7 @@ def test_fsdp_device_id_cpu_offload(self):
         Ensures that even if device_id is specified but we have
         CPU offload, module is on CPU after init.
         """
+
         class MyModel(nn.Module):
             def __init__(self):
                 super().__init__()
@@ -173,7 +267,7 @@ def forward(self, x):
             model,
             auto_wrap_policy=always_wrap_policy,
             cpu_offload=CPUOffload(offload_params=True),
-            device_id=torch.cuda.current_device()
+            device_id=torch.cuda.current_device(),
         )
 
         cpu_device = torch.device("cpu")
@@ -198,7 +292,8 @@ def test_fsdp_device_id(self, use_index):
           without specifying a device ID (i.e. ``torch.device("cuda")``) warns
         """
         dev_id = (
-            torch.cuda.current_device() if use_index
+            torch.cuda.current_device()
+            if use_index
             else torch.device("cuda", torch.cuda.current_device())
         )
 
@@ -206,8 +301,7 @@ def _check_device_matches(module, device_id):
             """Checks that the ``FlatParameter``s in ``module`` have device
             matching ``device_id``."""
             devices = {
-                p.device for p in module.parameters()
-                if isinstance(p, FlatParameter)
+                p.device for p in module.parameters() if isinstance(p, FlatParameter)
             }
             assert len(devices) > 0
             self.assertEqual(1, len(devices))
@@ -236,7 +330,7 @@ def _check_device_matches(module, device_id):
         )
         _check_device_matches(nested_wrapped_module, dev_id)
         # Check that passing in `torch.device("cuda")` for a GPU module warns
-        regex = "does not have explicit index"
+        regex = "does not have an explicit index"
         context = self.assertWarnsRegex(
             expected_warning=UserWarning, expected_regex=regex
         )
@@ -245,11 +339,10 @@ def _check_device_matches(module, device_id):
                 self.process_group,
                 FSDPInitMode.RECURSIVE,
                 CUDAInitMode.CUDA_BEFORE,
-                fsdp_kwargs={"device_id": torch.device("cuda")}
+                fsdp_kwargs={"device_id": torch.device("cuda")},
             )
         _check_device_matches(
-            nested_wrapped_module,
-            torch.device("cuda", torch.cuda.current_device())
+            nested_wrapped_module, torch.device("cuda", torch.cuda.current_device())
         )
 
     @skip_if_lt_x_gpu(2)
@@ -257,10 +350,9 @@ def test_module_device_mismatches_device_id(self):
         """Tests that specifying a ``device_id`` argument to FSDP for a GPU
         module that does not match the GPU device ID raises an error."""
         context = (
-            self.assertRaisesRegex(
-                RuntimeError,
-                f"on rank {self.rank}.*cuda:0, but is on cuda:{self.rank}"
-            ) if self.rank != 0 else suppress()
+            self.assertRaisesRegex(ValueError, f"cuda:{self.rank} vs cuda:0")
+            if self.rank != 0
+            else suppress()
         )
         with context:
             NestedWrappedModule.init(
@@ -277,6 +369,7 @@ def test_module_device_mismatches_device_id(self):
     def test_multi_device_not_supported(self):
         """Tests that wrapping a multi-device module (i.e. with submodules on
         both GPU and CPU) with FSDP raises an error."""
+
         class MultiDeviceModule(nn.Module):
             def __init__(self):
                 super().__init__()
@@ -309,11 +402,14 @@ def test_no_params(self):
         # is computed as torch.cuda.current_device when there are no params.
         no_params = nn.ReLU().cuda()
         context = (
-            self.assertRaisesRegex(
-                AssertionError,
-                f"Inconsistent.*cuda:{self.rank} vs cuda:0"
+            (
+                self.assertRaisesRegex(
+                    ValueError, f"Inconsistent.*cuda:{self.rank} vs cuda:0"
+                )
             )
-        ) if self.rank != 0 else suppress()
+            if self.rank != 0
+            else suppress()
+        )
         with context:
             module = FSDP(no_params, device_id=0)
 
@@ -323,7 +419,7 @@ def test_fsdp_cpu_init_stays_on_cpu(self):
         module is on CPU after FSDP initialization, albeit after loging a
         warning, and that FSDP moves CPU input to GPU before the forward."""
         torch.cuda.set_device(self.rank)
-        regex = "Module is put on CPU"
+        regex = "passed-in `module` is on CPU"
         context = self.assertWarnsRegex(
             expected_warning=UserWarning, expected_regex=regex
         )
@@ -355,8 +451,7 @@ def test_cpu_init_with_sync_module_states(self):
             CUDAInitMode.CUDA_NEVER,
         )
         with self.assertRaisesRegex(
-            ValueError,
-            "Module has CPU parameters, but sync_module_states=True is specified."
+            ValueError, "The module has CPU parameters when `sync_module_states=True`"
         ):
             FSDP(nested_wrapped_module, self.process_group, sync_module_states=True)
 
@@ -374,6 +469,7 @@ def test_fsdp_same_model_across_ranks(self):
         FSDP broadcasts model from rank 0 to ensure it starts off with the same
         values.
         """
+
         class MyModel(nn.Module):
             def __init__(self, rank):
                 super().__init__()
@@ -384,19 +480,27 @@ def __init__(self, rank):
                 self.register_buffer("buffer", torch.ones(1) * rank)
 
         m = MyModel(self.rank).cuda()
-        _assert_module_states(m, process_group=self.process_group, assert_fn=self.assertNotEqual)
+        _assert_module_states(
+            m, process_group=self.process_group, assert_fn=self.assertNotEqual
+        )
         # Passing sync_module_states into FSDP makes model the same during init.
         fsdp = FSDP(m, sync_module_states=True)
         with fsdp.summon_full_params(fsdp):
-            _assert_module_states(fsdp, process_group=self.process_group, assert_fn=self.assertEqual)
+            _assert_module_states(
+                fsdp, process_group=self.process_group, assert_fn=self.assertEqual
+            )
 
         # sync_module_states also works with CPU module with device_id passed in
         m = MyModel(self.rank)
-        _assert_module_states(m, process_group=self.process_group, assert_fn=self.assertNotEqual)
+        _assert_module_states(
+            m, process_group=self.process_group, assert_fn=self.assertNotEqual
+        )
         # Passing sync_module_states into FSDP makes model the same during init.
         fsdp = FSDP(m, device_id=torch.cuda.current_device(), sync_module_states=True)
         with fsdp.summon_full_params(fsdp):
-            _assert_module_states(fsdp, process_group=self.process_group, assert_fn=self.assertEqual)
+            _assert_module_states(
+                fsdp, process_group=self.process_group, assert_fn=self.assertEqual
+            )
 
 
 instantiate_parametrized_tests(TestFSDPMisc)
diff --git a/test/distributed/fsdp/test_fsdp_mixed_precision.py b/test/distributed/fsdp/test_fsdp_mixed_precision.py
index 238a72e334c5..d03ed1179e0f 100644
--- a/test/distributed/fsdp/test_fsdp_mixed_precision.py
+++ b/test/distributed/fsdp/test_fsdp_mixed_precision.py
@@ -11,9 +11,13 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from torch import distributed as dist
-from torch.distributed.fsdp import BackwardPrefetch, CPUOffload
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp import MixedPrecision, ShardingStrategy
+from torch.distributed.fsdp import (
+    BackwardPrefetch,
+    CPUOffload,
+    FullyShardedDataParallel as FSDP,
+    MixedPrecision,
+    ShardingStrategy,
+)
 from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler
 from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
 from torch.nn.modules.batchnorm import _BatchNorm
@@ -23,19 +27,20 @@
     CUDAInitMode,
     FSDPInitMode,
     FSDPTest,
-    TransformerWithSharedParams,
     subtest_name,
+    TransformerWithSharedParams,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
     sandcastle_skip_if,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 try:
     import torchvision
+
     HAS_TORCHVISION = True
 except ImportError:
     HAS_TORCHVISION = False
@@ -66,7 +71,9 @@
 mp_only_reduce = MixedPrecision(reduce_dtype=torch.float16)
 
 # Only parameters are cast (thus comm should happen in the param_dtype precision)
-mp_only_param_and_buf = MixedPrecision(param_dtype=torch.float16, buffer_dtype=torch.float16)
+mp_only_param_and_buf = MixedPrecision(
+    param_dtype=torch.float16, buffer_dtype=torch.float16
+)
 
 # Nothing is cast (thus param, comm, grad, and buffer should be in the full precision)
 mp_no_mixed_precision = MixedPrecision()
@@ -80,7 +87,7 @@
     mp_diff_buffer_and_reduce = MixedPrecision(
         param_dtype=torch.float16,
         buffer_dtype=torch.bfloat16,
-        reduce_dtype=torch.float32
+        reduce_dtype=torch.float32,
     )
     mp_configs.extend([mp_diff_buffer_and_reduce])
 
@@ -88,18 +95,18 @@
 _BUFFER_ORIG_DTYPE = torch.float64
 
 params = "mp_config,cpu_offload,full_precision_param_dtype,enable_sharded_grad_scaler"
-cpu_offload_config = [
-    CPUOffload(offload_params=True), CPUOffload(offload_params=False)
-]
+cpu_offload_config = [CPUOffload(offload_params=True), CPUOffload(offload_params=False)]
 full_precision_param_dtype_config = [torch.float32, torch.float64]
 enable_sharded_grad_scaler = ["enable_sharded_grad_scaler", None]
 
-configs = list(product(
-    mp_configs,
-    cpu_offload_config,
-    full_precision_param_dtype_config,
-    enable_sharded_grad_scaler,
-))
+configs = list(
+    product(
+        mp_configs,
+        cpu_offload_config,
+        full_precision_param_dtype_config,
+        enable_sharded_grad_scaler,
+    )
+)
 
 test_name_mapping = {
     str(CPUOffload(offload_params=True)): "offload_true",
@@ -110,42 +117,50 @@
     str(mp_no_mixed_precision): "mp_no_mp",
     str(torch.float32): "fp32",
     str(torch.float64): "fp64",
-    "enable_sharded_grad_scaler": "enable_sharded_grad_scaler"
+    "enable_sharded_grad_scaler": "enable_sharded_grad_scaler",
 }
 
 if nccl_supports_bf16:
-    test_name_mapping.update({
-        str(mp_diff_buffer_and_reduce): "mp_diff_buffer_reduce",
-    })
+    test_name_mapping.update(
+        {
+            str(mp_diff_buffer_and_reduce): "mp_diff_buffer_reduce",
+        }
+    )
 
 subtest_name = partial(subtest_name, test_name_mapping)
 
 _CURRENT_FULL_PRECISION_PARAM_DTYPE = None
 
+
 @contextlib.contextmanager
 def patch_reduce_scatter(new_reduce_scatter, full_precision_param_dtype):
     """
-    Patches dist._reduce_scatter_base with a new reduce_scatter_base and
-    restores upon exiting. Used for validation of mixed precision
+    Patches ``dist.reduce_scatter_tensor`` with ``new_reduce_scatter`` and
+    restores upon exiting. Used for validation of mixed precision.
     """
-    orig_reduce_scatter = dist._reduce_scatter_base
-    dist._reduce_scatter_base = new_reduce_scatter
+    orig_reduce_scatter = dist.reduce_scatter_tensor
+    dist.reduce_scatter_tensor = new_reduce_scatter
     global _CURRENT_FULL_PRECISION_PARAM_DTYPE
     _CURRENT_FULL_PRECISION_PARAM_DTYPE = full_precision_param_dtype
     try:
         yield
     finally:
-        dist._reduce_scatter_base = orig_reduce_scatter
+        dist.reduce_scatter_tensor = orig_reduce_scatter
         _CURRENT_FULL_PRECISION_PARAM_DTYPE = None
 
+
 class LinearMixedPrecision(nn.Module):
     """
     A linear module with extra checks for mixed precision training.
     """
-    def __init__(self, param_dtype):
+
+    def __init__(self, param_dtype, buffer_name="buffer"):
         super().__init__()
         self.lin = nn.Linear(10, 10, bias=False).to(param_dtype)
-        self.register_buffer('buffer', torch.randn((1, 2), dtype=_BUFFER_ORIG_DTYPE))
+        # Use a configurable buffer name to avoid all submodules sharing the
+        # same buffer name, which may hide prefixed vs. unprefixed name bugs
+        self.buffer_name = buffer_name
+        self.register_buffer(buffer_name, torch.randn((1, 2), dtype=_BUFFER_ORIG_DTYPE))
         self._orig_param_type = param_dtype
         self._orig_buffer_dtype = _BUFFER_ORIG_DTYPE
 
@@ -153,16 +168,18 @@ def forward(self, tup):
         # Param and input should be the mixed precision type
         inp, cls, fsdp, mp_config, full_precision_param_dtype = tup
         expected_param_type = (
-            mp_config.param_dtype if mp_config.param_dtype is not None
+            mp_config.param_dtype
+            if mp_config.param_dtype is not None
             else self._orig_param_type
         )
         expected_buffer_type = (
-            mp_config.buffer_dtype if mp_config.buffer_dtype is not None
+            mp_config.buffer_dtype
+            if mp_config.buffer_dtype is not None
             else self._orig_buffer_dtype
         )
         cls.assertEqual(inp.dtype, expected_param_type)
         # Buffer should be in specified precision as well.
-        cls.assertEqual(self.buffer.dtype, expected_buffer_type)
+        cls.assertEqual(getattr(self, self.buffer_name).dtype, expected_buffer_type)
 
         # In FSDP, self.params should point to the right type.
         num_active_fsdp = 0
@@ -175,9 +192,13 @@ def forward(self, tup):
                 # local shard. This supports both FULL_SHARD and SHARD_GRAD_OP
                 # cases. In FULL_SHARD, we have the additional property that
                 # param._full_param_padded has not been freed.
+                param_is_sharded = (
+                    fsdp_module.sharding_strategy != ShardingStrategy.NO_SHARD
+                    and fsdp_module.world_size > 1
+                )
                 is_fsdp_unit_active = (
-                    param._is_sharded and
-                    (param.data.data_ptr() != param._local_shard.data_ptr())
+                    param_is_sharded
+                    and param.data.data_ptr() != param._local_shard.data_ptr()
                 )
                 if is_fsdp_unit_active:
                     num_active_fsdp += 1
@@ -189,8 +210,8 @@ def forward(self, tup):
                     if mp_config.param_dtype is not None:
                         cls.assertEqual(0, param._mp_shard.storage().size())
                     else:
-                        cls.assertFalse(hasattr(param, '_mp_shard'))
-                elif param._is_sharded:
+                        cls.assertFalse(hasattr(param, "_mp_shard"))
+                elif param_is_sharded:
                     # This FSDP unit is not active as full param has been
                     # freed or not yet allocated. Ensure param points to full
                     # precision param.
@@ -215,8 +236,12 @@ def world_size(self):
     def _get_simple_nested_model(self, param_dtype, *fsdp_args, **fsdp_kwargs):
         model = FSDP(
             nn.Sequential(
-                FSDP(LinearMixedPrecision(param_dtype).cuda(), *fsdp_args, **fsdp_kwargs),
-                LinearMixedPrecision(param_dtype).cuda(),
+                FSDP(
+                    LinearMixedPrecision(param_dtype, buffer_name="buffer0").cuda(),
+                    *fsdp_args,
+                    **fsdp_kwargs,
+                ),
+                LinearMixedPrecision(param_dtype, buffer_name="buffer1").cuda(),
             ),
             *fsdp_args,
             **fsdp_kwargs,
@@ -224,7 +249,9 @@ def _get_simple_nested_model(self, param_dtype, *fsdp_args, **fsdp_kwargs):
         return model
 
     def _get_simple_model(self, param_dtype, *fsdp_args, **fsdp_kwargs):
-        model = FSDP(LinearMixedPrecision(param_dtype).cuda(), *fsdp_args, **fsdp_kwargs)
+        model = FSDP(
+            LinearMixedPrecision(param_dtype).cuda(), *fsdp_args, **fsdp_kwargs
+        )
         return model
 
     def _validate_no_mp_shard(self, fsdp_model):
@@ -235,7 +262,7 @@ def _validate_no_mp_shard(self, fsdp_model):
         fsdp_units = FSDP.fsdp_modules(fsdp_model)
         for fsdp in fsdp_units:
             for param in fsdp.params:
-                self.assertFalse(hasattr(param, '_mp_shard'))
+                self.assertFalse(hasattr(param, "_mp_shard"))
 
     def _validate_mp_shard_freed(self, fsdp_model):
         """
@@ -246,17 +273,13 @@ def _validate_mp_shard_freed(self, fsdp_model):
             for param in fsdp.params:
                 self.assertEqual(0, param._mp_shard.storage().size())
 
-    def _reduce_scatter_base_validate_mp(
-        self,
-        orig_reduce_scatter,
-        mp_config,
-        *args,
-        **kwargs
+    def _reduce_scatter_validate_mp(
+        self, orig_reduce_scatter, mp_config, *args, **kwargs
     ):
         """
-        Performs dist._reduce_scatter_base but verifies mixed precision settings
-        before. This is to test mixed precision is working as expected during
-        backward pass. In particular it ensures that the gradients were cast to the right type
+        Runs reduce-scatter but verifies mixed precision settings before. This
+        is to test mixed precision is working as expected during backward pass.
+        In particular it ensures that the gradients were cast to the right type
         and comm. is going to happen in the right type.
         """
         tensors = []
@@ -274,9 +297,11 @@ def _reduce_scatter_base_validate_mp(
         # If reduce_dtype is not specified (is None) we comm. in the param_dtype
         # if that is specified, otherwise full precision dtype.
         expected_dtype = (
-            mp_config.reduce_dtype if mp_config.reduce_dtype is not None
+            mp_config.reduce_dtype
+            if mp_config.reduce_dtype is not None
             else (
-                mp_config.param_dtype if mp_config.param_dtype is not None
+                mp_config.param_dtype
+                if mp_config.param_dtype is not None
                 else _CURRENT_FULL_PRECISION_PARAM_DTYPE
             )
         )
@@ -286,6 +311,37 @@ def _reduce_scatter_base_validate_mp(
 
         return orig_reduce_scatter(*args, **kwargs)
 
+    def _test_grads_reduced_precision(self, offload_params: bool):
+        class MyModel(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lin1 = nn.Linear(10, 10)
+                self.lin2 = nn.Linear(10, 10)
+
+            def forward(self, x):
+                return self.lin2(self.lin1(x))
+
+        m = MyModel().cuda()
+        mp = MixedPrecision(
+            param_dtype=torch.float16,
+            reduce_dtype=torch.float16,
+            buffer_dtype=torch.float16,
+            keep_low_precision_grads=True,
+        )
+        fsdp_kwargs = {
+            "mixed_precision": mp,
+            "cpu_offload": CPUOffload(offload_params=offload_params),
+        }
+        m.lin1 = FSDP(m.lin1, **fsdp_kwargs)
+        m = FSDP(m, **fsdp_kwargs)
+        for _ in range(6):
+            inp = torch.ones(1, 10)
+            m(inp).sum().backward()
+            for param in m.parameters():
+                self.assertEqual(torch.float16, param.grad.dtype)
+
+        dist.barrier()
+
     def _run_test_mixed_precision_e2e(
         self,
         mp_config,
@@ -304,7 +360,7 @@ def _run_test_mixed_precision_e2e(
                 cpu_offload=cpu_offload,
                 mixed_precision=mp_config,
                 backward_prefetch=backward_prefetch,
-                forward_prefetch=forward_prefetch
+                forward_prefetch=forward_prefetch,
             ),
             self._get_simple_nested_model(
                 param_dtype=full_precision_param_dtype,
@@ -312,7 +368,7 @@ def _run_test_mixed_precision_e2e(
                 cpu_offload=cpu_offload,
                 mixed_precision=mp_config,
                 backward_prefetch=backward_prefetch,
-                forward_prefetch=forward_prefetch
+                forward_prefetch=forward_prefetch,
             ),
         ]
         for model in fsdp_models:
@@ -320,16 +376,20 @@ def _run_test_mixed_precision_e2e(
                 model.cuda()
 
             # Patch reduce_scatter to add validation for mixed precision types.
-            orig_reduce_scatter = dist._reduce_scatter_base
+            orig_reduce_scatter = dist.reduce_scatter_tensor
             test_reduce_scatter = partial(
-                self._reduce_scatter_base_validate_mp, orig_reduce_scatter, mp_config,
+                self._reduce_scatter_validate_mp,
+                orig_reduce_scatter,
+                mp_config,
             )
             with patch_reduce_scatter(test_reduce_scatter, full_precision_param_dtype):
                 scaler = ShardedGradScaler(enabled=enable_sharded_grad_scaler)
                 optim = torch.optim.Adam(model.parameters())
 
                 for _ in range(3):
-                    inp = torch.randn(3, 10, device='cuda', dtype=full_precision_param_dtype)
+                    inp = torch.randn(
+                        3, 10, device="cuda", dtype=full_precision_param_dtype
+                    )
                     # Forward pass of LinearMixedPrecision check casting of
                     # inputs, params, buffers.
                     act, *_ = model(
@@ -342,15 +402,11 @@ def _run_test_mixed_precision_e2e(
                         else:
                             self.assertEqual(buf.dtype, _BUFFER_ORIG_DTYPE)
                     # p._mp_shard should be freed.
-                    if model.params[0]._is_sharded:  # i.e. world_size > 1
-                        # TODO: free the mixed precision shard after forward
-                        # when world_size == 1 as well, currently when
-                        # world_size == 1 it is only freed after backward.
-                        if mp_config.param_dtype is not None:
-                            self._validate_mp_shard_freed(model)
-                        else:
-                            # We never should have allocated an _mp_shard.
-                            self._validate_no_mp_shard(model)
+                    if mp_config.param_dtype is not None:
+                        self._validate_mp_shard_freed(model)
+                    else:
+                        # We never should have allocated an _mp_shard.
+                        self._validate_no_mp_shard(model)
 
                     loss = act.sum()
                     loss = scaler.scale(loss)
@@ -378,7 +434,9 @@ def _run_test_mixed_precision_e2e(
                     for param in model.parameters():
                         self.assertEqual(param.dtype, full_precision_param_dtype)
                         if param.grad is not None:
-                            self.assertEqual(param.grad.dtype, full_precision_param_dtype)
+                            self.assertEqual(
+                                param.grad.dtype, full_precision_param_dtype
+                            )
 
                     # Unscale the gradients and step
                     scaler.step(optim)
@@ -417,8 +475,9 @@ def _run_test_mixed_precision_e2e(
                             self.assertEqual(tensor.dtype, _BUFFER_ORIG_DTYPE)
                         else:
                             self.assertEqual(
-                                tensor.dtype, full_precision_param_dtype,
-                                f"{name}: {tensor.dtype} vs {full_precision_param_dtype}"
+                                tensor.dtype,
+                                full_precision_param_dtype,
+                                f"{name}: {tensor.dtype} vs {full_precision_param_dtype}",
                             )
 
                     # After state_dict, buffer's dtype should have been restored
@@ -444,7 +503,7 @@ def _get_subtest_config(self) -> Dict[str, List[Any]]:
                 None,
                 BackwardPrefetch.BACKWARD_PRE,
                 BackwardPrefetch.BACKWARD_POST,
-            ]
+            ],
         }
 
     @skip_if_lt_x_gpu(2)
@@ -485,9 +544,11 @@ def _test_mixed_precision_embedding_table(self, mp_config):
         # Basic test to ensure int inputs are not casted which would break
         # modules such as embedding tables.
         param_dtype = mp_config.param_dtype or torch.float32
-        orig_reduce_scatter = dist._reduce_scatter_base
+        orig_reduce_scatter = dist.reduce_scatter_tensor
         test_reduce_scatter = partial(
-            self._reduce_scatter_base_validate_mp, orig_reduce_scatter, mp_config,
+            self._reduce_scatter_validate_mp,
+            orig_reduce_scatter,
+            mp_config,
         )
         with patch_reduce_scatter(test_reduce_scatter, param_dtype):
             # TODO: `test_mp_embedding_reduce()` fails if we do not wrap the
@@ -539,9 +600,11 @@ def test_mp_embedding_params_and_reduce_diff(self):
         params_and_reduce_different = MixedPrecision(
             param_dtype=torch.float16,
             reduce_dtype=torch.float32,
-            buffer_dtype=torch.float16
+            buffer_dtype=torch.float16,
+        )
+        self._test_mixed_precision_embedding_table(
+            mp_config=params_and_reduce_different
         )
-        self._test_mixed_precision_embedding_table(mp_config=params_and_reduce_different)
 
     @skip_if_lt_x_gpu(2)
     @skipIfNoTorchVision
@@ -552,11 +615,12 @@ def test_mixed_precision_resnet(self):
         """
         resnet_model = torchvision.models.resnet50().cuda()
         resnet_model = nn.SyncBatchNorm.convert_sync_batchnorm(
-            resnet_model,
-            process_group=dist.distributed_c10d._get_default_group()
+            resnet_model, process_group=dist.distributed_c10d._get_default_group()
+        )
+        n_bn = sum(
+            1 if isinstance(x, _BatchNorm) else 0 for x in resnet_model.modules()
         )
-        n_bn = sum(1 if isinstance(x, _BatchNorm) else 0 for x in resnet_model.modules())
-        inp = torch.ones(1, 3, 1000, 1000, device='cuda')
+        inp = torch.ones(1, 3, 1000, 1000, device="cuda")
         mp_config = MixedPrecision(
             param_dtype=torch.float16,
             reduce_dtype=torch.float16,
@@ -565,7 +629,7 @@ def test_mixed_precision_resnet(self):
         fsdp = FSDP(
             resnet_model,
             auto_wrap_policy=size_based_auto_wrap_policy,
-            mixed_precision=mp_config
+            mixed_precision=mp_config,
         )
         # Batchnorm units should be wrapped individually. Validate this by
         # ensuring there are equal no. of FSDP units that are BN as BN units
@@ -581,6 +645,13 @@ def test_mixed_precision_resnet(self):
         loss = fsdp(inp).sum()
         loss.backward()
 
+    @skip_if_lt_x_gpu(2)
+    def test_grads_reduced_precision(self):
+        self.run_subtests(
+            {"offload_params": [False, True]},
+            self._test_grads_reduced_precision,
+        )
+
     @skip_if_lt_x_gpu(2)
     @parametrize("convert_sync_bn", [True, False])
     def test_mp_batchnorm(self, convert_sync_bn):
@@ -614,7 +685,7 @@ def never_wrap_policy(*args, **kwargs):
         )
         with self.assertWarnsRegex(
             expected_warning=UserWarning,
-            expected_regex="BatchNorm units will be wrapped as a separate"
+            expected_regex="batch norm submodules will be wrapped as separate",
         ):
             model = FSDP(
                 net,
@@ -627,10 +698,11 @@ def never_wrap_policy(*args, **kwargs):
         # policy should not have wrapped any other submodules
         self.assertFalse(isinstance(model.fc1, FSDP))
         self.assertFalse(isinstance(model.fc2, FSDP))
-        self.assertEqual(None, bn.mixed_precision)
-        self.assertNotEqual(None, model.mixed_precision)
+        no_mixed_precision = MixedPrecision()
+        self.assertEqual(no_mixed_precision, bn.mixed_precision)
+        self.assertNotEqual(no_mixed_precision, model.mixed_precision)
 
-        inp = torch.randn((1, 2), device='cuda')
+        inp = torch.randn((1, 2), device="cuda")
         # Without FSDP BN mixed precision fix, this would result in
         # RuntimeError: Expected counts to have type Half but got Float
         # for syncBN
@@ -641,10 +713,18 @@ class TestFSDPMixedPrecisionUnsharded(TestFSDPMixedPrecision):
     """
     Smaller test suite for unshared param (i.e. world_size == 1) case.
     """
+
     @property
     def world_size(self):
         return 1
 
+    @skip_if_lt_x_gpu(1)
+    def test_grads_reduced_precision(self):
+        self.run_subtests(
+            {"offload_params": [False, True]},
+            self._test_grads_reduced_precision,
+        )
+
     @skip_if_lt_x_gpu(1)
     def test_mixed_precision_no_reshard_after_forward(self):
         # Note that we don't exercise all possible different configs so as to
@@ -673,6 +753,7 @@ def test_mixed_precision_e2e_full_shard(self):
             enable_sharded_grad_scaler=False,
         )
 
+
 instantiate_parametrized_tests(TestFSDPMixedPrecisionSharded)
 
 if __name__ == "__main__":
diff --git a/test/distributed/fsdp/test_fsdp_multiple_forward.py b/test/distributed/fsdp/test_fsdp_multiple_forward.py
index c9afbd465f28..7823f9349a00 100644
--- a/test/distributed/fsdp/test_fsdp_multiple_forward.py
+++ b/test/distributed/fsdp/test_fsdp_multiple_forward.py
@@ -9,12 +9,8 @@
 from torch.nn.parallel import DistributedDataParallel
 from torch.optim import SGD
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
-    get_full_params,
-)
-from torch.testing._internal.common_utils import TEST_WITH_DEV_DBG_ASAN, run_tests
-
+from torch.testing._internal.common_fsdp import FSDPTest, get_full_params
+from torch.testing._internal.common_utils import run_tests, TEST_WITH_DEV_DBG_ASAN
 
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
diff --git a/test/distributed/fsdp/test_fsdp_multiple_wrapping.py b/test/distributed/fsdp/test_fsdp_multiple_wrapping.py
index 0a3b9e2e2e06..58298fcce26f 100644
--- a/test/distributed/fsdp/test_fsdp_multiple_wrapping.py
+++ b/test/distributed/fsdp/test_fsdp_multiple_wrapping.py
@@ -9,8 +9,7 @@
 from torch.optim import SGD
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import FSDPTest
-from torch.testing._internal.common_utils import TEST_WITH_DEV_DBG_ASAN, run_tests
-
+from torch.testing._internal.common_utils import run_tests, TEST_WITH_DEV_DBG_ASAN
 
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
diff --git a/test/distributed/fsdp/test_fsdp_optim_state.py b/test/distributed/fsdp/test_fsdp_optim_state.py
index 1f2d5ad8ea8d..5b714fe65c26 100644
--- a/test/distributed/fsdp/test_fsdp_optim_state.py
+++ b/test/distributed/fsdp/test_fsdp_optim_state.py
@@ -2,15 +2,18 @@
 
 import bisect
 import sys
-from enum import Enum, auto
-from typing import Any, Dict, List, Tuple, Type
+from enum import auto, Enum
+from typing import Any, Callable, Dict, List, Optional, Tuple, Type
 
 import torch
+import torch.nn as nn
 from torch import distributed as dist
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
-    _CHECKPOINT_PREFIX, apply_activation_checkpointing_wrapper
+    _CHECKPOINT_WRAPPED_MODULE,
+    apply_activation_checkpointing,
 )
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp._shard_utils import _gather_state_dict
 from torch.distributed.fsdp.fully_sharded_data_parallel import (
     OptimStateKeyType,
     StateDictType,
@@ -23,12 +26,14 @@
     TransformerWithSharedParams,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
+STATE_DICT_TYPES = [StateDictType.FULL_STATE_DICT, StateDictType.SHARDED_STATE_DICT]
+
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
@@ -43,12 +48,22 @@
 
 class _OSDCommMethod(Enum):
     """Method for communicating the optimizer state dict for internal tests."""
+
     BROADCAST_OBJECT_LIST = auto()
     SCATTER_FULL_OSD = auto()
+    FLATTEN_SHARDED_OSD = auto()
+
+
+class _ModelClass(Enum):
+    """Different model type to test."""
+
+    NESTED = auto()
+    TRANSFORMER = auto()
 
 
 class Bias(torch.nn.Module):
     """This module applies a 1D additive bias with dimension ``dim``."""
+
     def __init__(self, dim: int) -> None:
         super().__init__()
         assert dim > 0
@@ -69,6 +84,7 @@ class BlockA(torch.nn.Module):
         Bias1
             bias
     """
+
     def __init__(self, in_dim: int, out_dim: int) -> None:
         super().__init__()
         assert all(v > 0 for v in (in_dim, out_dim))
@@ -85,6 +101,7 @@ def forward(self, x):
         x = self.bias_module1(x)
         return x
 
+
 class BlockB(torch.nn.Module):
     """
     Used to define interesting nested structure for FSDP wrapping.
@@ -95,6 +112,7 @@ class BlockB(torch.nn.Module):
         Bias
             bias
     """
+
     def __init__(self, in_dim: int, out_dim: int) -> None:
         super().__init__()
         assert all(v > 0 for v in (in_dim, out_dim))
@@ -143,33 +161,57 @@ def run_backward(self, loss):
         loss.backward()
 
     @staticmethod
-    def wrap(model, group=None, ignore_modules: bool = False) -> torch.nn.Module:
+    def wrap(
+        model: torch.nn.Module,
+        group: Optional[dist.ProcessGroup] = None,
+        ignore_modules: bool = False,
+        fsdp_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> torch.nn.Module:
+        if fsdp_kwargs is None:
+            fsdp_kwargs = {}
         # Flatten Bias0; then flatten weight and Bias1 together into `block1`
         model.block1.bias_module0 = FSDP(
-            model.block1.bias_module0, process_group=group,
+            model.block1.bias_module0,
+            process_group=group,
+            **fsdp_kwargs,
         )
-        model.block1 = FSDP(model.block1, process_group=group)
+        model.block1 = FSDP(model.block1, process_group=group, **fsdp_kwargs)
         # Flatten Bias0; flatten Bias1; then flatten weight into `block2[1]`
         model.block2[1].bias_module0 = FSDP(
-            model.block2[1].bias_module0, process_group=group,
+            model.block2[1].bias_module0,
+            process_group=group,
+            **fsdp_kwargs,
         )
         model.block2[1].bias_module1 = FSDP(
-            model.block2[1].bias_module1, process_group=group,
+            model.block2[1].bias_module1,
+            process_group=group,
+            **fsdp_kwargs,
         )
-        model.block2[1] = FSDP(model.block2[1], process_group=group)
+        model.block2[1] = FSDP(model.block2[1], process_group=group, **fsdp_kwargs)
         # Flatten weight, Bias, bias into `block2[2]`
         ignored_modules = [model.block2[2].bias_module0] if ignore_modules else None
         model.block2[2] = FSDP(
-            model.block2[2], process_group=group, ignored_modules=ignored_modules,
+            model.block2[2],
+            process_group=group,
+            ignored_modules=ignored_modules,
+            **fsdp_kwargs,
         )
         return model
 
     @staticmethod
-    def wrap_alt(model, group=None) -> torch.nn.Module:
+    def wrap_alt(
+        model: torch.nn.Module,
+        group: Optional[dist.ProcessGroup] = None,
+        fsdp_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> torch.nn.Module:
+        if fsdp_kwargs is None:
+            fsdp_kwargs = {}
         model.block0.bias_module0 = FSDP(
-            model.block0.bias_module0, process_group=group,
+            model.block0.bias_module0,
+            process_group=group,
+            **fsdp_kwargs,
         )
-        model.block0 = FSDP(model.block0, process_group=group)
+        model.block0 = FSDP(model.block0, process_group=group, **fsdp_kwargs)
         return model
 
     @staticmethod
@@ -185,7 +227,8 @@ def wrap_with_unmanaged_params(
         # (`model.block2[2]`) or a module not to be wrapped with FSDP (`model`)
         register_module = model.block2[2] if add_to_fsdp_module else model
         register_module.register_parameter(
-            "unmanaged_param", unmanaged_param,
+            "unmanaged_param",
+            unmanaged_param,
         )
         # For simplicity, we only add a single unmanaged parameter, but should
         # be easy to generalize if needed
@@ -230,11 +273,17 @@ def param_group0(self) -> List[torch.nn.Parameter]:
     def param_group1(self) -> List[torch.nn.Parameter]:
         # Deviate from the `model.parameters()` order further by rearranging
         # `block2`'s parameters to be before `block0`'s parameters
-        return list(self.block2.parameters()) + \
-            list(self.block0.parameters())
+        return list(self.block2.parameters()) + list(self.block0.parameters())
 
 
 class TestFSDPOptimState(FSDPTest):
+    def __init__(self, *args, **kwargs):
+        super(TestFSDPOptimState, self).__init__(*args, **kwargs)
+        self._model_class = {
+            _ModelClass.NESTED: self._init_nested_model,
+            _ModelClass.TRANSFORMER: self._init_transformer_model,
+        }
+
     def _init_nested_model(
         self,
         wrap: bool,
@@ -244,17 +293,21 @@ def _init_nested_model(
         optim_class: Type[torch.optim.Optimizer] = torch.optim.Adam,
         use_multiple_param_groups: bool = False,
         use_diff_optim_inputs: bool = False,
+        fsdp_kwargs: Optional[Dict[str, Any]] = None,
     ):
         model = NestedModel().to(device)
         if wrap:
-            model = NestedModel.wrap_alt(model, group) if wrap_alt \
-                else NestedModel.wrap(model, group)
+            model = (
+                NestedModel.wrap_alt(model, group, fsdp_kwargs)
+                if wrap_alt
+                else NestedModel.wrap(model, group, fsdp_kwargs=fsdp_kwargs)
+            )
         if not use_multiple_param_groups:
             optim_input = list(model.parameters())
         else:
             optim_input = [
                 {"params": model.param_group0()},
-                {"params": model.param_group1(), "weight_decay": 0.9}
+                {"params": model.param_group1(), "weight_decay": 0.9},
             ]
         # Use a reversed parameter order for the optimizer input on odd ranks
         if use_diff_optim_inputs and self.rank % 2 == 1:
@@ -319,7 +372,9 @@ def _broadcast_full_osd(self, full_osd: Dict[str, Any], group=None):
         ``torch.save()`` and ``torch.load()`` so that all ranks can have it."""
         obj_list = [full_osd]
         dist.broadcast_object_list(
-            obj_list, src=0, group=group,
+            obj_list,
+            src=0,
+            group=group,
         )
         full_osd = obj_list[0]
         return full_osd
@@ -341,8 +396,9 @@ def _are_equal_states(
                 # Check the values on CPU to be device-agnostic
                 value1 = value1.cpu()
                 value2 = value2.cpu()
-                if value1.shape != value2.shape or \
-                        not torch.all(torch.isclose(value1, value2)):
+                if value1.shape != value2.shape or not torch.all(
+                    torch.isclose(value1, value2)
+                ):
                     return False
             else:  # non-tensor state
                 if value1 != value2:
@@ -351,7 +407,7 @@ def _are_equal_states(
 
     def _check_same_state(
         self,
-        full_osd,
+        fsdp_osd,
         ref_osd,
         check_same_param_keys: bool,
     ):
@@ -360,35 +416,40 @@ def _check_same_state(
         match (e.g. when both should be parameter names), and does not check
         the parameter keys otherwise."""
         assert "state" in ref_osd
-        self.assertTrue("state" in full_osd)
+        self.assertTrue("state" in fsdp_osd)
         ref_osd_state = ref_osd["state"]
-        full_osd_state = full_osd["state"]
+        fsdp_osd_state = {
+            k: _gather_state_dict(v) for k, v in fsdp_osd["state"].items()
+        }
+
         if check_same_param_keys:
             # Check parameter keys are the same first for earlier erroring
             ref_osd_param_ids = set(ref_osd_state.keys())
-            full_osd_param_ids = set(full_osd_state.keys())
-            self.assertEqual(ref_osd_param_ids, full_osd_param_ids)
+            fsdp_osd_param_ids = set(fsdp_osd_state.keys())
+            self.assertTrue(ref_osd_param_ids == fsdp_osd_param_ids)
             # Check state values are the same
-            for param_id, param_state in full_osd_state.items():
+            for param_id, param_state in fsdp_osd_state.items():
                 for state_name, value in param_state.items():
                     ref_value = ref_osd_state[param_id][state_name]
                     self.assertEqual(value, ref_value)
             return
         # Otherwise, only require the parameter keys to be isomorphic (e.g.
         # between IDs and names)
-        ref_osd_states = list(ref_osd["state"].values())
-        full_osd_states = list(full_osd["state"].values())
-        self.assertEqual(len(ref_osd_states), len(full_osd_states))
+        ref_osd_states = list(ref_osd_state.values())
+        fsdp_osd_states = list(fsdp_osd_state.values())
+        self.assertEqual(len(ref_osd_states), len(fsdp_osd_states))
         # Use brute-force quadratic-time comparison since it is hard to
         # hash a tensor by value instead of by object
-        for full_osd_state in full_osd_states:
+        for fsdp_osd_state in fsdp_osd_states:
             # Check for at least one match (may be > 1 in toy edge cases, e.g.
             # multiple biases); nonetheless, each having >= 1 match and the two
             # lists having equal length imply that the list contents are equal
-            self.assertTrue(any(
-                self._are_equal_states(full_osd_state, ref_osd_state)
-                for ref_osd_state in ref_osd_states
-            ))
+            self.assertTrue(
+                any(
+                    self._are_equal_states(fsdp_osd_state, ref_osd_state)
+                    for ref_osd_state in ref_osd_states
+                )
+            )
 
     def _check_same_param_groups(
         self,
@@ -406,10 +467,12 @@ def _check_same_param_groups(
         full_osd_param_groups = full_osd["param_groups"]
         self.assertTrue(len(full_osd_param_groups), len(ref_osd_param_groups))
         for full_osd_pg, ref_osd_pg in zip(
-            full_osd_param_groups, ref_osd_param_groups,
+            full_osd_param_groups,
+            ref_osd_param_groups,
         ):
             self.assertEqual(
-                set(full_osd_pg.keys()), set(ref_osd_pg.keys()),
+                set(full_osd_pg.keys()),
+                set(ref_osd_pg.keys()),
             )
             for name, full_osd_value in full_osd_pg.items():
                 if name == "params" and not check_same_param_keys:
@@ -428,18 +491,21 @@ def _check_state_device(self, osd: Dict[str, Any], on_gpu: bool):
                         self.assertFalse(value.is_cuda)
 
     @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", STATE_DICT_TYPES)
     @parametrize("use_multiple_param_groups", [False, True])
     @parametrize("rank0_only", [False, True])
     @parametrize("use_diff_optim_inputs", [False, True])
-    def test_full_optim_state_dict_nested(
+    def test_optim_state_dict_nested(
         self,
+        state_dict_type: StateDictType,
         use_multiple_param_groups: bool,
         rank0_only: bool,
         use_diff_optim_inputs: bool,
     ) -> None:
         """
-        Tests :meth:`full_optim_state_dict` by comparing the returned dict for
-        an FSDP-wrapped model with that of an equivalent non-wrapped model.
+        Tests :meth:`full_optim_state_dict` and meth:`sharded_optim_state_dict`
+        by comparing the returned dict for an FSDP-wrapped model with that of
+        an equivalent non-wrapped model.
 
         The test checks the equivalence excluding the parameter keys since the
         FSDP and normal optimizer state dicts key by names and IDs,
@@ -447,21 +513,58 @@ def test_full_optim_state_dict_nested(
         are incorrectly mapped to values. Their correct mapping is tested in
         other tests that exercise the save/load workflow.
         """
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_optim_state_dict_nested,
+            state_dict_type=state_dict_type,
+            use_multiple_param_groups=use_multiple_param_groups,
+            rank0_only=rank0_only,
+            use_diff_optim_inputs=use_diff_optim_inputs,
+        )
+
+    def _test_optim_state_dict_nested(
+        self,
+        state_dict_type: StateDictType,
+        use_multiple_param_groups: bool,
+        rank0_only: bool,
+        use_diff_optim_inputs: bool,
+        use_optim_input: bool,
+    ) -> None:
+        if rank0_only and state_dict_type == StateDictType.SHARDED_STATE_DICT:
+            return  # not supported
         NUM_ITERS = 3
         model1, optim1, optim_input = self._init_nested_model(
-            wrap=True, use_multiple_param_groups=use_multiple_param_groups,
+            wrap=True,
+            use_multiple_param_groups=use_multiple_param_groups,
             use_diff_optim_inputs=use_diff_optim_inputs,
         )
         losses1 = self._step_model(model1, optim1, num_iters=NUM_ITERS)
-        full_osd = FSDP.full_optim_state_dict(
-            model1, optim1, optim_input, rank0_only=rank0_only,
-        )
+        if state_dict_type == StateDictType.FULL_STATE_DICT:
+            if use_optim_input:
+                fsdp_osd = FSDP.full_optim_state_dict(
+                    model1,
+                    optim1,
+                    optim_input,
+                    rank0_only=rank0_only,
+                )
+            else:
+                fsdp_osd = FSDP.full_optim_state_dict(
+                    model1,
+                    optim1,
+                    rank0_only=rank0_only,
+                )
+        else:
+            if use_optim_input:
+                fsdp_osd = FSDP.sharded_optim_state_dict(model1, optim1, optim_input)
+            else:
+                fsdp_osd = FSDP.sharded_optim_state_dict(model1, optim1)
         # Non-target ranks get an empty state dict
         if rank0_only and self.rank != 0:
-            self.assertEqual(len(full_osd), 0)
+            self.assertEqual(len(fsdp_osd), 0)
             return
         model2, optim2, _ = self._init_nested_model(
-            wrap=False, use_multiple_param_groups=use_multiple_param_groups,
+            wrap=False,
+            use_multiple_param_groups=use_multiple_param_groups,
             use_diff_optim_inputs=use_diff_optim_inputs,
         )
         losses2 = self._step_model(model2, optim2, num_iters=NUM_ITERS)
@@ -469,15 +572,19 @@ def test_full_optim_state_dict_nested(
         # Check the losses to eliminate model drift as a source of error
         for i, (l1, l2) in enumerate(zip(losses1, losses2)):
             assert l1 == l2, f"Losses differ on iter {i}: {l1:.5f} {l2:.5f}"
-        # Do not check the parameter keys since the full optimizer state dict
-        # uses parameter names, while the non-wrapped equivalent uses parameter
-        # IDs
+        # Do not check the parameter keys since the full/sharded optimizer state
+        # dict uses parameter names, while the non-wrapped equivalent uses
+        # parameter IDs
         check_same_param_keys = False
         self._check_same_param_groups(
-            full_osd, ref_osd, check_same_param_keys=check_same_param_keys,
+            fsdp_osd,
+            ref_osd,
+            check_same_param_keys=check_same_param_keys,
         )
         self._check_same_state(
-            full_osd, ref_osd, check_same_param_keys=check_same_param_keys,
+            fsdp_osd,
+            ref_osd,
+            check_same_param_keys=check_same_param_keys,
         )
 
     @skip_if_lt_x_gpu(2)
@@ -491,19 +598,20 @@ def test_full_optim_state_dict_keys(self):
         wrapped_model = NestedModel.wrap(model, ignore_modules=True)
         # Add checkpointing to ensure optim_state_dict and state_dict strip out
         # checkpointing prefixes.
-        apply_activation_checkpointing_wrapper(
-            model,
-            check_fn=lambda module: isinstance(module, torch.nn.Sequential)
+        apply_activation_checkpointing(
+            model, check_fn=lambda module: isinstance(module, torch.nn.Sequential)
         )
         optim = torch.optim.Adam(wrapped_model.parameters(), lr=1e-3)
         self._step_model(model, optim, device)
-        optim_state_dict = FSDP.full_optim_state_dict(wrapped_model, optim, rank0_only=False)
+        optim_state_dict = FSDP.full_optim_state_dict(
+            wrapped_model, optim, rank0_only=False
+        )
         with FSDP.state_dict_type(wrapped_model, StateDictType.FULL_STATE_DICT):
             state_dict = wrapped_model.state_dict()
         self.assertEqual(optim_state_dict["state"].keys(), state_dict.keys())
         # Check that checkpointing prefix was indeed stripped.
         for key in optim_state_dict["state"]:
-            self.assertNotIn(_CHECKPOINT_PREFIX, key)
+            self.assertNotIn(_CHECKPOINT_WRAPPED_MODULE, key)
 
     @skip_if_lt_x_gpu(2)
     def test_full_optim_state_dict_nested_invalid(self):
@@ -524,9 +632,7 @@ def test_full_optim_state_dict_nested_invalid(self):
             "are missing some of those states"
         )
         with self.assertRaisesRegex(RuntimeError, error_regex):
-            FSDP.full_optim_state_dict(
-                model, optim, optim_input,
-            )
+            FSDP.full_optim_state_dict(model, optim)
 
     @skip_if_lt_x_gpu(2)
     @parametrize("use_multiple_param_groups", [False, True])
@@ -540,8 +646,10 @@ def test_shard_full_optim_state_dict_nested(
     ):
         """Tests :meth:`shard_full_optim_state_dict` for a non-FSDP-root model
         with nested FSDP instances."""
-        self._test_shard_full_optim_state(
-            model_class="nested",
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_load_optim_state,
+            model_class=_ModelClass.NESTED,
             use_multiple_param_groups=use_multiple_param_groups,
             halve_world_size=False,
             osd_comm_method=_OSDCommMethod.BROADCAST_OBJECT_LIST,
@@ -558,8 +666,10 @@ def test_shard_full_optim_state_dict_nested_halve_world_size(self):
         use_multiple_param_groups = True
         use_diff_optim_inputs = True
         wrap_alt = True
-        self._test_shard_full_optim_state(
-            model_class="nested",
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_load_optim_state,
+            model_class=_ModelClass.NESTED,
             use_multiple_param_groups=use_multiple_param_groups,
             halve_world_size=True,
             osd_comm_method=_OSDCommMethod.BROADCAST_OBJECT_LIST,
@@ -571,8 +681,11 @@ def test_shard_full_optim_state_dict_nested_halve_world_size(self):
     def test_shard_full_optim_state_dict_transformer(self) -> None:
         """Tests :meth:`shard_full_optim_state_dict` for an FSDP-root
         transformer model with shared parameters."""
-        self._test_shard_full_optim_state(
-            model_class="transformer", use_multiple_param_groups=False,
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_load_optim_state,
+            model_class=_ModelClass.TRANSFORMER,
+            use_multiple_param_groups=False,
             halve_world_size=True,
             osd_comm_method=_OSDCommMethod.BROADCAST_OBJECT_LIST,
             use_diff_optim_inputs=False,
@@ -590,8 +703,10 @@ def test_scatter_full_optim_state_dict_nested(
     ):
         """Tests :meth:`scatter_full_optim_state_dict` for a non-FSDP-root
         model with nested FSDP instances."""
-        self._test_shard_full_optim_state(
-            model_class="nested",
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_load_optim_state,
+            model_class=_ModelClass.NESTED,
             use_multiple_param_groups=use_multiple_param_groups,
             halve_world_size=False,
             osd_comm_method=_OSDCommMethod.SCATTER_FULL_OSD,
@@ -608,8 +723,10 @@ def test_scatter_full_optim_state_dict_nested_halve_world_size(self):
         use_multiple_param_groups = True
         use_diff_optim_inputs = True
         wrap_alt = True
-        self._test_shard_full_optim_state(
-            model_class="nested",
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_load_optim_state,
+            model_class=_ModelClass.NESTED,
             use_multiple_param_groups=use_multiple_param_groups,
             halve_world_size=True,
             osd_comm_method=_OSDCommMethod.SCATTER_FULL_OSD,
@@ -621,28 +738,61 @@ def test_scatter_full_optim_state_dict_nested_halve_world_size(self):
     def test_scatter_full_optim_state_dict_transformer(self) -> None:
         """Tests :meth:`scatter_full_optim_state_dict` for an FSDP-root
         transformer model with shared parameters."""
-        self._test_shard_full_optim_state(
-            model_class="transformer", use_multiple_param_groups=False,
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_load_optim_state,
+            model_class=_ModelClass.TRANSFORMER,
+            use_multiple_param_groups=False,
             halve_world_size=True,
             osd_comm_method=_OSDCommMethod.SCATTER_FULL_OSD,
             use_diff_optim_inputs=False,
         )
 
-    def _test_shard_full_optim_state(
+    @skip_if_lt_x_gpu(2)
+    def test_flatten_sharded_optim_state_dict_nested(self):
+        """Tests :meth:`flatten_sharded_optim_state_dict` for an FSDP-root
+        nested model."""
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_load_optim_state,
+            _ModelClass.NESTED,
+            use_multiple_param_groups=False,
+            halve_world_size=False,
+            osd_comm_method=_OSDCommMethod.FLATTEN_SHARDED_OSD,
+            use_diff_optim_inputs=False,
+            wrap_alt=True,
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_flatten_sharded_optim_state_dict_transformer(self) -> None:
+        """Tests :meth:`flatten_sharded_optim_state_dict` for an FSDP-root
+        transformer model."""
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_load_optim_state,
+            _ModelClass.TRANSFORMER,
+            use_multiple_param_groups=False,
+            halve_world_size=False,
+            osd_comm_method=_OSDCommMethod.FLATTEN_SHARDED_OSD,
+            use_diff_optim_inputs=False,
+        )
+
+    def _test_load_optim_state(
         self,
-        model_class: str,
+        model_class: _ModelClass,
         use_multiple_param_groups: bool,
         halve_world_size: bool,
         osd_comm_method: _OSDCommMethod,
         use_diff_optim_inputs: bool,
+        use_optim_input: bool,
         **new_model_kwargs,
     ):
         """
         (1) Runs a model with full world size for K iterations to generate a
-        full optimizer state dict;
+        full/sharded optimizer state dict;
         (2) initializes a model with halved world size and possibly different
         FSDP wrapping scheme (based on ``new_model_kwargs``);
-        (3) shards the full optimizer state dict from (1) according to the
+        (3) loads the full/sharded optimizer state dict from (1) according to the
         halved-world-size model;
         (4) runs the halved-world-size model for K iterations; and
         (5) checks that the sharded optimizer state dict from (3) matches the
@@ -650,16 +800,24 @@ def _test_shard_full_optim_state(
         former could have equivalently been loaded into the local optimizer.
         """
         NUM_ITERS = 3
-        initializer = self._init_nested_model if model_class == "nested" \
-            else self._init_transformer_model if model_class == "transformer" \
-            else None
-        assert initializer is not None, f"Unsupported model: {model_class}"
+        initializer = self._model_class[model_class]
+        osd_method = (
+            FSDP.sharded_optim_state_dict
+            if osd_comm_method == _OSDCommMethod.FLATTEN_SHARDED_OSD
+            else FSDP.full_optim_state_dict
+        )
+
         # First, run a wrapped model with full world size for a few iterations
         model1, optim1, optim_input1 = initializer(
-            wrap=True, use_multiple_param_groups=use_multiple_param_groups,
+            wrap=True,
+            use_multiple_param_groups=use_multiple_param_groups,
         )
         self._step_model(model1, optim1, num_iters=NUM_ITERS)
-        full_osd1 = FSDP.full_optim_state_dict(model1, optim1, optim_input1)
+        fsdp_osd1 = (
+            osd_method(model1, optim1, optim_input1)
+            if use_optim_input
+            else osd_method(model1, optim1)
+        )
         if halve_world_size:
             # Create a new process group with halved world size
             new_group_ranks = [r for r in range(self.world_size) if r % 2 == 0]
@@ -672,68 +830,138 @@ def _test_shard_full_optim_state(
         # Second, run a wrapped model with (possibly) halved world size and
         # (possibly) differing `optim_input` across ranks
         model2, optim2, optim_input2 = initializer(
-            wrap=True, group=new_group,
+            wrap=True,
+            group=new_group,
             use_multiple_param_groups=use_multiple_param_groups,
             use_diff_optim_inputs=use_diff_optim_inputs,
             **new_model_kwargs,  # specify `wrap_alt` to change wrapping
         )
         self._step_model(model2, optim2, num_iters=NUM_ITERS)
-        full_osd2 = FSDP.full_optim_state_dict(model2, optim2, optim_input2, group=new_group)
+        fsdp_osd2 = (
+            osd_method(model2, optim2, optim_input2, group=new_group)
+            if use_optim_input
+            else osd_method(model2, optim2, group=new_group)
+        )
         # Compute two sharded optim state dicts: (1) for the first model
         # according to the second model and (2) for the second model according
         # to the second model
         if osd_comm_method == _OSDCommMethod.BROADCAST_OBJECT_LIST:
-            full_osd1 = self._broadcast_full_osd(full_osd1, group=new_group)
-            sharded_osd1 = FSDP.shard_full_optim_state_dict(
-                full_osd1, model2, optim_input2,
+            fsdp_osd1 = self._broadcast_full_osd(fsdp_osd1, group=new_group)
+            sharded_osd1 = (
+                FSDP.shard_full_optim_state_dict(
+                    fsdp_osd1, model2, optim_input=optim_input2
+                )
+                if use_optim_input
+                else FSDP.shard_full_optim_state_dict(fsdp_osd1, model2, optim=optim2)
             )
-            full_osd2 = self._broadcast_full_osd(full_osd2, group=new_group)
-            sharded_osd2 = FSDP.shard_full_optim_state_dict(
-                full_osd2, model2, optim_input2,
+            fsdp_osd2 = self._broadcast_full_osd(fsdp_osd2, group=new_group)
+            sharded_osd2 = (
+                FSDP.shard_full_optim_state_dict(
+                    fsdp_osd2, model2, optim_input=optim_input2
+                )
+                if use_optim_input
+                else FSDP.shard_full_optim_state_dict(fsdp_osd2, model2, optim=optim2)
             )
         elif osd_comm_method == _OSDCommMethod.SCATTER_FULL_OSD:
-            sharded_osd1 = FSDP.scatter_full_optim_state_dict(
-                full_osd1 if self.rank == 0 else None, model2, optim_input2,
-                group=new_group,
+            sharded_osd1 = (
+                FSDP.scatter_full_optim_state_dict(
+                    fsdp_osd1 if self.rank == 0 else None,
+                    model2,
+                    optim_input=optim_input2,
+                    group=new_group,
+                )
+                if use_optim_input
+                else FSDP.scatter_full_optim_state_dict(
+                    fsdp_osd1 if self.rank == 0 else None,
+                    model2,
+                    optim=optim2,
+                    group=new_group,
+                )
             )
-            sharded_osd2 = FSDP.scatter_full_optim_state_dict(
-                full_osd2 if self.rank == 0 else None, model2, optim_input2,
-                group=new_group,
+            sharded_osd2 = (
+                FSDP.scatter_full_optim_state_dict(
+                    fsdp_osd2 if self.rank == 0 else None,
+                    model2,
+                    optim_input=optim_input2,
+                    group=new_group,
+                )
+                if use_optim_input
+                else FSDP.scatter_full_optim_state_dict(
+                    fsdp_osd2 if self.rank == 0 else None,
+                    model2,
+                    optim=optim2,
+                    group=new_group,
+                )
             )
             self._check_state_device(sharded_osd1, on_gpu=True)
             self._check_state_device(sharded_osd2, on_gpu=True)
-        # As a sanity check, check that sharding the second model's full
+        elif osd_comm_method == _OSDCommMethod.FLATTEN_SHARDED_OSD:
+            sharded_osd1 = (
+                FSDP.flatten_sharded_optim_state_dict(
+                    fsdp_osd1,
+                    model2,
+                    optim_input=optim_input2,
+                )
+                if use_optim_input
+                else FSDP.flatten_sharded_optim_state_dict(
+                    fsdp_osd1,
+                    model2,
+                    optim=optim2,
+                )
+            )
+            sharded_osd2 = (
+                FSDP.flatten_sharded_optim_state_dict(
+                    fsdp_osd2,
+                    model2,
+                    optim_input=optim_input2,
+                )
+                if use_optim_input
+                else FSDP.flatten_sharded_optim_state_dict(
+                    fsdp_osd2,
+                    model2,
+                    optim=optim2,
+                )
+            )
+
+        # As a sanity check, check that sharding the second model's full/sharded
         # optimizer state dict according to itself is equivalent to its local
         # optimizer's state dict
         local_osd2 = optim2.state_dict()
         check_same_param_keys = True  # should all have matching parameter IDs
         self._check_same_param_groups(
-            sharded_osd2, local_osd2,
+            sharded_osd2,
+            local_osd2,
             check_same_param_keys=check_same_param_keys,
         )
         self._check_same_state(
-            sharded_osd2, local_osd2,
+            sharded_osd2,
+            local_osd2,
             check_same_param_keys=check_same_param_keys,
         )
-        # Check that sharding the first model's full optimizer state dict
+        # Check that sharding the first model's full/sharded optimizer state dict
         # according to the second model is equivalent to the second model's
         # local optimizer state dict
         self._check_same_param_groups(
-            sharded_osd1, local_osd2,
+            sharded_osd1,
+            local_osd2,
             check_same_param_keys=check_same_param_keys,
         )
         self._check_same_state(
-            sharded_osd1, local_osd2,
+            sharded_osd1,
+            local_osd2,
             check_same_param_keys=check_same_param_keys,
         )
         # As a sanity check, check that we can load and run a few iterations
-        optim2.load_state_dict(sharded_osd1)
-        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+        if osd_comm_method != _OSDCommMethod.FLATTEN_SHARDED_OSD:
+            optim2.load_state_dict(sharded_osd1)
+            self._step_model(model2, optim2, num_iters=NUM_ITERS)
 
     @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", STATE_DICT_TYPES)
     @parametrize("add_to_fsdp_module", [False, True])
     def test_shard_full_optim_state_dict_unmanaged_params(
         self,
+        state_dict_type: StateDictType,
         add_to_fsdp_module: bool,
     ):
         """
@@ -749,145 +977,467 @@ def test_shard_full_optim_state_dict_unmanaged_params(
           should be no error (emulating model parallel use cases where some
           parameters may be managed externally to FSDP).
         We do not separately test unmanaged parameters for
-        :meth:`scatter_full_optim_state_dict` to save CI cost since it calls
-        into the same subroutine :meth:`_flatten_full_optim_state_dict`.
+        :meth:`scatter_full_optim_state_dict` and `flatten_sharded_optim_state_dict`
+        to save CI cost since it call into the same subroutine
+        :meth:`_flatten_optim_state_dict`.
         """
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_shard_full_optim_state_dict_unmanaged_params,
+            state_dict_type=state_dict_type,
+            add_to_fsdp_module=add_to_fsdp_module,
+        )
+
+    def _test_shard_full_optim_state_dict_unmanaged_params(
+        self,
+        state_dict_type: StateDictType,
+        add_to_fsdp_module: bool,
+        use_optim_input: bool,
+    ):
         NUM_ITERS = 1
         # Create a normal wrapped model
         model, optim, optim_input = self._init_nested_model(wrap=True)
         self._step_model(model, optim, num_iters=NUM_ITERS)
-        full_osd = FSDP.full_optim_state_dict(
-            model, optim, optim_input, rank0_only=False,
-        )  # save on all ranks to avoid having to broadcast from rank 0
+
+        if state_dict_type == StateDictType.FULL_STATE_DICT:
+            fsdp_osd = (
+                FSDP.full_optim_state_dict(model, optim, optim_input, rank0_only=False)
+                if use_optim_input
+                else FSDP.full_optim_state_dict(model, optim, rank0_only=False)
+            )  # save on all ranks to avoid having to broadcast from rank 0
+        else:
+            fsdp_osd = (
+                FSDP.sharded_optim_state_dict(model, optim, optim_input)
+                if use_optim_input
+                else FSDP.sharded_optim_state_dict(model, optim)
+            )
         # Create a new model with the same structure but additional unmanaged
         # parameters, representing the model for which we want to load
         device = torch.device("cuda")
         model = NestedModel().to(device)
         model, unmanaged_params = NestedModel.wrap_with_unmanaged_params(
-            model, add_to_fsdp_module,
+            model,
+            add_to_fsdp_module,
         )
         optim_input = list(model.parameters())
+        optim = torch.optim.Adam(optim_input, lr=1e-3)
         if add_to_fsdp_module:
             # If we add the unmanaged parameters to a module wrapped with FSDP,
             # then the flattened parameter will be comprised of some
             # unflattened parameters with zero-dimensional tensor state (i.e.
             # Adam "step") and others without (i.e. the unmanaged parameters),
             # which triggers an error that we have to ensure correctness
-            error_prefix = "^(All unflattened parameters comprising a " \
-                "single flattened parameter must have scalar state with the " \
+            error_prefix = (
+                "^(All unflattened parameters comprising a "
+                "single flattened parameter must have scalar state with the "
                 "same value and dtype)"
+            )
             with self.assertRaisesRegex(ValueError, error_prefix):
-                FSDP.shard_full_optim_state_dict(
-                    full_osd, model, optim_input,
-                )
+                if state_dict_type == StateDictType.FULL_STATE_DICT:
+                    (
+                        FSDP.shard_full_optim_state_dict(
+                            fsdp_osd, model, optim_input=optim_input
+                        )
+                        if use_optim_input
+                        else FSDP.shard_full_optim_state_dict(
+                            fsdp_osd, model, optim=optim
+                        )
+                    )
+                else:
+                    (
+                        FSDP.flatten_sharded_optim_state_dict(
+                            fsdp_osd, model, optim_input=optim_input
+                        )
+                        if use_optim_input
+                        else FSDP.flatten_sharded_optim_state_dict(
+                            fsdp_osd, model, optim=optim
+                        )
+                    )
         else:
             # If we add the unmanaged parameters to a module not wrapped with
             # FSDP, then we simply ignore them without erroring to enable
             # model parallelism use cases, where some parameters are managed
             # externally to FSDP
-            sharded_osd = FSDP.shard_full_optim_state_dict(
-                full_osd, model, optim_input,
-            )
+            if state_dict_type == StateDictType.FULL_STATE_DICT:
+                flattened_osd = (
+                    FSDP.shard_full_optim_state_dict(
+                        fsdp_osd, model, optim_input=optim_input
+                    )
+                    if use_optim_input
+                    else FSDP.shard_full_optim_state_dict(fsdp_osd, model, optim=optim)
+                )
+            else:
+                flattened_osd = (
+                    FSDP.flatten_sharded_optim_state_dict(
+                        fsdp_osd, model, optim_input=optim_input
+                    )
+                    if use_optim_input
+                    else FSDP.flatten_sharded_optim_state_dict(
+                        fsdp_osd, model, optim=optim
+                    )
+                )
             # Add entries for the unmanaged parameters to be able to load
             for unmanaged_param in unmanaged_params:
                 NestedModel.add_unmanaged_param_entry(
-                    sharded_osd, unmanaged_param, NUM_ITERS,
+                    flattened_osd,
+                    unmanaged_param,
+                    NUM_ITERS,
                 )
             # Check that we can load the optimizer state dict
-            optim = torch.optim.Adam(optim_input, lr=1e-3)
-            optim.load_state_dict(sharded_osd)
+            optim.load_state_dict(flattened_osd)
 
     @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", STATE_DICT_TYPES)
     @parametrize("use_multiple_param_groups", [False, True])
     def test_rekey_optim_state_dict_to_ids(
         self,
+        state_dict_type: StateDictType,
         use_multiple_param_groups: bool,
     ):
         """Tests :meth:`rekey_optim_state_dict` with the new keys being
         parameter IDs by checking that a wrapped model (i.e. with FSDP modules)
         can rekey its optimizer state dict to match that of an equivalent
         non-wrapped model (i.e. without FSDP modules)."""
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_rekey_optim_state_dict_to_ids,
+            state_dict_type=state_dict_type,
+            use_multiple_param_groups=use_multiple_param_groups,
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def _test_rekey_optim_state_dict_to_ids(
+        self,
+        state_dict_type: StateDictType,
+        use_multiple_param_groups: bool,
+        use_optim_input: bool,
+    ):
         NUM_ITERS = 3
         # Run a wrapped model for a few iterations
         model1, optim1, optim_input1 = self._init_nested_model(
-            wrap=True, use_multiple_param_groups=use_multiple_param_groups,
+            wrap=True,
+            use_multiple_param_groups=use_multiple_param_groups,
         )
         self._step_model(model1, optim1, num_iters=NUM_ITERS)
-        full_osd = FSDP.full_optim_state_dict(model1, optim1, optim_input1)
-        # Broadcast instead of `torch.save()`/`torch.load()` so that all ranks
-        # have the full state dict
-        full_osd = self._broadcast_full_osd(full_osd)
+        if state_dict_type == StateDictType.FULL_STATE_DICT:
+            fsdp_osd = (
+                FSDP.full_optim_state_dict(model1, optim1, optim_input1)
+                if use_optim_input
+                else FSDP.full_optim_state_dict(model1, optim1)
+            )
+            # Broadcast instead of `torch.save()`/`torch.load()` so that all ranks
+            # have the full state dict
+            fsdp_osd = self._broadcast_full_osd(fsdp_osd)
+        else:
+            fsdp_osd = (
+                FSDP.sharded_optim_state_dict(model1, optim1, optim_input1)
+                if use_optim_input
+                else FSDP.sharded_optim_state_dict(model1, optim1)
+            )
         # Run a non-wrapped model for a few iterations
         model2, optim2, optim_input2 = self._init_nested_model(
-            wrap=False, use_multiple_param_groups=use_multiple_param_groups,
+            wrap=False,
+            use_multiple_param_groups=use_multiple_param_groups,
         )
         self._step_model(model2, optim2, num_iters=NUM_ITERS)
         # Re-key the wrapped model's optimizer state dict using parameter IDs
         # according to the non-wrapped model
-        rekeyed_osd = FSDP.rekey_optim_state_dict(
-            full_osd, OptimStateKeyType.PARAM_ID, model2, optim_input2,
+        rekeyed_osd = (
+            FSDP.rekey_optim_state_dict(
+                fsdp_osd,
+                OptimStateKeyType.PARAM_ID,
+                model2,
+                optim_input=optim_input2,
+            )
+            if use_optim_input
+            else FSDP.rekey_optim_state_dict(
+                fsdp_osd,
+                OptimStateKeyType.PARAM_ID,
+                model2,
+                optim=optim2,
+            )
         )
         # Check that the re-keyed dict and actual dict are the same
         osd = optim2.state_dict()
         check_same_param_keys = True
         self._check_same_param_groups(
-            rekeyed_osd, osd, check_same_param_keys=check_same_param_keys,
+            rekeyed_osd,
+            osd,
+            check_same_param_keys=check_same_param_keys,
         )
         self._check_same_state(
-            rekeyed_osd, osd, check_same_param_keys=check_same_param_keys,
+            rekeyed_osd,
+            osd,
+            check_same_param_keys=check_same_param_keys,
         )
         # As a sanity check, check that we can load and run a few iterations
-        optim2.load_state_dict(rekeyed_osd)
-        self._step_model(model2, optim2, num_iters=NUM_ITERS)
+        if state_dict_type != StateDictType.SHARDED_STATE_DICT:
+            optim2.load_state_dict(rekeyed_osd)
+            self._step_model(model2, optim2, num_iters=NUM_ITERS)
 
     @skip_if_lt_x_gpu(2)
-    @parametrize("use_multiple_param_groups", [False])
-    def test_rekey_optim_state_dict_to_names(
-        self,
-        use_multiple_param_groups: bool,
-    ):
+    def test_rekey_optim_state_dict_to_names(self):
         """Tests :meth:`rekey_optim_state_dict` with the new keys being
         parameter names by checking that a non-wrapped model (i.e. without FSDP
         modules) can rekey its optimizer state dict to match the expected
         output of :meth:`full_optim_state_dict`, hence be sharded using
         :meth:`shard_full_optim_state_dict`, and finally match the per-rank
         optimizer state dict of a wrapped model (i.e. with FSDP modules)."""
+        self.run_subtests(
+            {"use_optim_input": [False, True]},
+            self._test_rekey_optim_state_dict_to_names,
+            use_multiple_param_groups=False,
+        )
+
+    def _test_rekey_optim_state_dict_to_names(
+        self,
+        use_multiple_param_groups: bool,
+        use_optim_input: bool,
+    ):
+
         NUM_ITERS = 3
         # Run a wrapped model for a few iterations
         model1, optim1, optim_input1 = self._init_nested_model(
-            wrap=True, use_multiple_param_groups=use_multiple_param_groups,
+            wrap=True,
+            use_multiple_param_groups=use_multiple_param_groups,
         )
         self._step_model(model1, optim1, num_iters=NUM_ITERS)
         # Run a non-wrapped model for a few iterations
         model2, optim2, optim_input2 = self._init_nested_model(
-            wrap=False, use_multiple_param_groups=use_multiple_param_groups,
+            wrap=False,
+            use_multiple_param_groups=use_multiple_param_groups,
         )
         self._step_model(model2, optim2, num_iters=NUM_ITERS)
         # Re-key the non-wrapped model's optimizer state dict using parameter
         # names (still according to itself)
         osd2 = optim2.state_dict()
-        rekeyed_osd = FSDP.rekey_optim_state_dict(
-            osd2, OptimStateKeyType.PARAM_NAME, model2, optim_input2,
+        rekeyed_osd = (
+            FSDP.rekey_optim_state_dict(
+                osd2,
+                OptimStateKeyType.PARAM_NAME,
+                model2,
+                optim_input=optim_input2,
+            )
+            if use_optim_input
+            else FSDP.rekey_optim_state_dict(
+                osd2,
+                OptimStateKeyType.PARAM_NAME,
+                model2,
+                optim=optim2,
+            )
         )
         # Shard the non-wrapped model's re-keyed optimizer state dict, which
         # maps back to (flattened) parameter IDs
-        sharded_osd = FSDP.shard_full_optim_state_dict(
-            rekeyed_osd, model1, optim_input1,
+        sharded_osd = (
+            FSDP.shard_full_optim_state_dict(
+                rekeyed_osd,
+                model1,
+                optim_input=optim_input1,
+            )
+            if use_optim_input
+            else FSDP.shard_full_optim_state_dict(
+                rekeyed_osd,
+                model1,
+                optim=optim1,
+            )
         )
         # Check that this sharded optimizer state dict matches the wrapped
         # model's per-rank optimizer state dict
         osd1 = optim1.state_dict()
         check_same_param_keys = True
         self._check_same_param_groups(
-            sharded_osd, osd1, check_same_param_keys=check_same_param_keys,
+            sharded_osd,
+            osd1,
+            check_same_param_keys=check_same_param_keys,
         )
         self._check_same_state(
-            sharded_osd, osd1, check_same_param_keys=check_same_param_keys,
+            sharded_osd,
+            osd1,
+            check_same_param_keys=check_same_param_keys,
         )
         # As a sanity check, check that we can load and run a few iterations
         optim1.load_state_dict(sharded_osd)
         self._step_model(model1, optim1, num_iters=NUM_ITERS)
 
+    @skip_if_lt_x_gpu(2)
+    def test_optim_input_warning(self):
+        """Tests that passing the ``optim_input`` argument into optimizer state
+        checkpointing APIs issues a warning."""
+
+        def should_check_method(method_name: str):
+            # Check every method since they all accept `optim_input`
+            return True
+
+        def get_warning_context():
+            warning_regex = "`optim_input` argument is deprecated"
+            return self.assertWarnsRegex(
+                expected_warning=UserWarning, expected_regex=warning_regex
+            )
+
+        self._run_on_all_optim_state_apis(
+            should_check_method, get_warning_context, fsdp_kwargs=None
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_use_orig_params_error(self):
+        """Tests that the optimizer state checkpointing APIs raise an error
+        when ``use_orig_params=True``."""
+
+        def should_check_method(method_name: str):
+            # Skip `rekey_optim_state_dict` since that does not depend on
+            # `use_orig_params=True`
+            return method_name != "rekey_optim_state_dict"
+
+        def get_error_context():
+            error_regex = "Optimizer state checkpointing is not supported yet for `use_orig_params=True`"
+            return self.assertRaisesRegex(
+                expected_exception=NotImplementedError, expected_regex=error_regex
+            )
+
+        fsdp_kwargs = {"use_orig_params": True}
+        self._run_on_all_optim_state_apis(
+            should_check_method, get_error_context, fsdp_kwargs
+        )
+
+    def _run_on_all_optim_state_apis(
+        self,
+        should_check_method_fn: Callable[[str], bool],
+        context_fn: Callable,
+        fsdp_kwargs: Optional[Dict[str, Any]],
+    ):
+        """
+        Runs through all optimizer state checkpointing APIs with a context
+        manager instantiated by ``context_fn``. Certain APIs can be skipped
+        via ``should_check_method_fn``, which gets passed the string name of
+        the method.
+        """
+        wrapped_model, wrapped_optim, wrapped_optim_input = self._init_nested_model(
+            wrap=True,
+            use_multiple_param_groups=False,
+            fsdp_kwargs=fsdp_kwargs,
+        )
+        self._step_model(wrapped_model, wrapped_optim, num_iters=2)
+
+        # Sharded optim state dict
+        if should_check_method_fn("sharded_optim_state_dict"):
+            with context_fn():
+                fsdp_osd = FSDP.sharded_optim_state_dict(
+                    wrapped_model,
+                    wrapped_optim,
+                    optim_input=wrapped_optim_input,
+                )
+        if "fsdp_osd" not in locals():
+            fsdp_osd = {}  # may not be defined due to previous method erroring
+        if should_check_method_fn("flatten_sharded_optim_state_dict"):
+            with context_fn():
+                FSDP.flatten_sharded_optim_state_dict(
+                    fsdp_osd,
+                    wrapped_model,
+                    optim_input=wrapped_optim_input,
+                )
+        # Full optim state dict
+        if should_check_method_fn("full_optim_state_dict"):
+            with context_fn():
+                fsdp_osd = FSDP.full_optim_state_dict(
+                    wrapped_model,
+                    wrapped_optim,
+                    optim_input=wrapped_optim_input,
+                    rank0_only=False,
+                )
+        if should_check_method_fn("shard_full_optim_state_dict"):
+            with context_fn():
+                FSDP.shard_full_optim_state_dict(
+                    fsdp_osd,
+                    wrapped_model,
+                    optim_input=wrapped_optim_input,
+                )
+        if should_check_method_fn("scatter_full_optim_state_dict"):
+            with context_fn():
+                FSDP.scatter_full_optim_state_dict(
+                    fsdp_osd,
+                    wrapped_model,
+                    optim_input=wrapped_optim_input,
+                )
+        # Rekey optim state dict
+        (
+            nonwrapped_model,
+            nonwrapped_optim,
+            nonwrapped_optim_input,
+        ) = self._init_nested_model(wrap=False, use_multiple_param_groups=False)
+        if should_check_method_fn("rekey_optim_state_dict"):
+            with context_fn():
+                rekeyed_osd = FSDP.rekey_optim_state_dict(
+                    fsdp_osd,  # from `full_optim_state_dict()`
+                    OptimStateKeyType.PARAM_ID,
+                    nonwrapped_model,
+                    optim_input=nonwrapped_optim_input,
+                )
+        self._step_model(nonwrapped_model, nonwrapped_optim, num_iters=2)
+        osd = nonwrapped_optim.state_dict()
+        if should_check_method_fn("rekey_optim_state_dict"):
+            with context_fn():
+                FSDP.rekey_optim_state_dict(
+                    osd,
+                    OptimStateKeyType.PARAM_NAME,
+                    nonwrapped_model,
+                    optim_input=nonwrapped_optim_input,
+                )
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", STATE_DICT_TYPES)
+    def test_save_load_without_0th_param_state(self, state_dict_type: StateDictType):
+        """
+        Tests saving and loading an optim state dict for Adam optimizer (i.e.
+        any optimizer with a "step" key in its state) when the first parameter
+        does not have optimizer state (e.g. unused or frozen).
+        """
+
+        class Model(nn.Module):
+            def __init__(self) -> None:
+                super().__init__()
+                self.lin1 = nn.Linear(5, 5)
+                self.lin2 = nn.Linear(5, 5)
+                self.relu = nn.ReLU()
+
+            def forward(self, x: torch.Tensor) -> torch.Tensor:
+                # Do not use `lin1`, which is the parameter passed to the
+                # optimizer and the one checked for "step" state to see if it
+                # is tensor or float
+                return self.relu(self.lin2(x))
+
+        model = Model().cuda()
+        model.lin1 = FSDP(model.lin1)
+        model.lin2 = FSDP(model.lin2)
+        fsdp_model = FSDP(model)
+        optim = torch.optim.Adam(
+            fsdp_model.parameters(), lr=1e-2
+        )  # or any optimizer with "step"
+
+        # Run an iteration to construct optimizer state
+        device = torch.device("cuda")
+        inp = torch.randn((2, 5), device=device)
+        loss = fsdp_model(inp).sum()
+        loss.backward()
+        optim.step()
+
+        # Check that save and load does not error
+        if state_dict_type == StateDictType.FULL_STATE_DICT:
+            fsdp_osd = FSDP.full_optim_state_dict(fsdp_model, optim, rank0_only=False)
+            flattened_osd = FSDP.shard_full_optim_state_dict(fsdp_osd, fsdp_model)
+        elif state_dict_type == StateDictType.SHARDED_STATE_DICT:
+            fsdp_osd = FSDP.sharded_optim_state_dict(fsdp_model, optim)
+            flattened_osd = FSDP.flatten_sharded_optim_state_dict(fsdp_osd, fsdp_model)
+        optim.load_state_dict(flattened_osd)
+        # `__setstate__()` will check the 0th parameter to see if "step" is
+        # represented as a tensor or float, so it is imperative that its state
+        # is non-empty.
+
+        # Run an iteration as a sanity check
+        inp = torch.randn((2, 5), device=device)
+        loss = fsdp_model(inp).sum()
+        loss.backward()
+        optim.step()
+
 
 instantiate_parametrized_tests(TestFSDPOptimState)
 
diff --git a/test/distributed/fsdp/test_fsdp_overlap.py b/test/distributed/fsdp/test_fsdp_overlap.py
index 36f9ed9454e1..8bd5354b2b70 100644
--- a/test/distributed/fsdp/test_fsdp_overlap.py
+++ b/test/distributed/fsdp/test_fsdp_overlap.py
@@ -11,16 +11,13 @@
 from torch.cuda import Event
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
-)
+from torch.testing._internal.common_fsdp import FSDPTest
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     get_cycles_per_ms,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
-
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
     sys.exit(0)
@@ -109,6 +106,12 @@ def run(compute_cycles, all_gather_cycles):
             batch = torch.rand(1).cuda()
             batch.requires_grad = True
 
+            # Run one dummy iteration to trigger the execution order validation
+            # all-gathers
+            out = model(batch)
+            out.backward()
+            model.zero_grad(set_to_none=True)
+
             # We run 20 iterations but only collect timing data from the minimal 10
             # data points because nondeterministic system events can disturb the timing.
             cpu_iter = Min10()
diff --git a/test/distributed/fsdp/test_fsdp_param_exec_order_wrap.py b/test/distributed/fsdp/test_fsdp_param_exec_order_wrap.py
deleted file mode 100644
index a1c73d1cafb5..000000000000
--- a/test/distributed/fsdp/test_fsdp_param_exec_order_wrap.py
+++ /dev/null
@@ -1,134 +0,0 @@
-# Owner(s): ["oncall: distributed"]
-
-from typing import Any, Callable
-
-import torch
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp._symbolic_trace import TracingConfig
-from torch.distributed.fsdp.fully_sharded_data_parallel import ShardingStrategy
-from torch.distributed.fsdp.wrap import always_wrap_policy, ParamExecOrderWrapPolicy
-from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import FSDPTest
-from torch.testing._internal.common_utils import (
-    instantiate_parametrized_tests,
-    parametrize,
-    run_tests,
-)
-
-
-class Model(torch.nn.Module):
-    def __init__(self) -> None:
-        super().__init__()
-        self.layer0 = torch.nn.Linear(6, 6)
-        self.layer1 = torch.nn.Linear(6, 6, bias=False)
-        self.layer2 = torch.nn.Sequential(
-            torch.nn.Linear(6, 3, bias=False),
-            torch.nn.ReLU(),
-            torch.nn.Linear(3, 6, bias=False),
-        )
-        self.relu = torch.nn.ReLU()
-
-    def forward(self, x: Any, use_all_params: bool = True):
-        # `layer0` -> `layer2` -> `layer1`
-        # the forward execution order is NOT consistent with the model definition order.
-        z = self.relu(self.layer0(x))
-        z = self.relu(self.layer2(z))
-        if use_all_params:
-            z = self.relu(self.layer1(z))
-        return z
-
-    def get_input(self, device: torch.device):
-        return (torch.randn((8, 6)).to(device),)
-
-    def get_loss(self, input, output):
-        return (output - input[0]).sum()
-
-    @staticmethod
-    def wrap(
-        sharding_strategy: ShardingStrategy,
-        device: torch.device,
-        wrap_policy: Callable,
-    ) -> torch.nn.Module:
-        model = Model()
-        fsdp_model = FSDP(
-            model, auto_wrap_policy=wrap_policy, sharding_strategy=sharding_strategy
-        )
-        return fsdp_model.to(device)
-
-
-class TestFSDPExecOrder(FSDPTest):
-    @property
-    def device(self):
-        return torch.device("cuda")
-
-    @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "sharding_strategy",
-        [ShardingStrategy.FULL_SHARD, ShardingStrategy.SHARD_GRAD_OP],
-    )
-    def test_fsdp_flatten_params_exec_order(
-        self,
-        sharding_strategy: ShardingStrategy,
-    ):
-        """
-        Test ``_fsdp_params_exec_order`` with ``ParamExecOrderWrapPolicy``,
-        after running one iteration of forward and backward pass.
-        Here ``torch.fx`` is not enabled inside ``ParamExecOrderWrapPolicy``.
-        """
-        wrap_policy = ParamExecOrderWrapPolicy(init_policy=always_wrap_policy)
-        fsdp_model = Model.wrap(sharding_strategy, self.device, wrap_policy=wrap_policy)
-        self.assertTrue(fsdp_model._is_param_exec_order_prep_stage())
-        # run one iteration to record the execution ordering
-        input = fsdp_model.module.get_input(self.device)
-        output = fsdp_model(*input)
-        loss = fsdp_model.module.get_loss(input, output).to(self.device)
-        loss.backward()
-        params_list = list(fsdp_model.parameters())
-        # Since the forward execution order is NOT consistent with
-        # the model definition order, the ordering in flatten_named_params_exec_order
-        # should be different from named_parameters.
-        self.assertEqual(
-            fsdp_model._fsdp_params_exec_order,
-            [params_list[0], params_list[2], params_list[3], params_list[1]],
-        )
-        self.assertTrue(fsdp_model._use_param_exec_order_policy())
-        self.assertTrue(not fsdp_model._is_param_exec_order_prep_stage())
-
-    @skip_if_lt_x_gpu(2)
-    @parametrize(
-        "sharding_strategy",
-        [ShardingStrategy.FULL_SHARD, ShardingStrategy.SHARD_GRAD_OP],
-    )
-    def test_fsdp_flatten_params_exec_order_symbolic_trace(
-        self,
-        sharding_strategy: ShardingStrategy,
-    ):
-        """
-        Tests ``ParamExecOrderWrapPolicy`` with symbolic tracing.
-        With symbolic tracing enabled, ``_is_param_exec_order_prep_stage``
-        should always set as False.
-        """
-        wrap_policy = ParamExecOrderWrapPolicy(
-            init_policy=always_wrap_policy,
-            tracing_config=TracingConfig(concrete_args={"use_all_params": False}),
-        )
-        fsdp_model = Model.wrap(
-            sharding_strategy,
-            self.device,
-            wrap_policy=wrap_policy,
-        )
-        params_list = list(fsdp_model.parameters())
-        # Since the forward execution order is NOT consistent with the model definition order,
-        # the ordering in flatten_named_params_exec_order should be different from named_parameters
-        self.assertEqual(
-            fsdp_model._fsdp_params_exec_order,
-            [params_list[0], params_list[2], params_list[3]],
-        )
-        self.assertTrue(fsdp_model._use_param_exec_order_policy())
-        self.assertTrue(not fsdp_model._is_param_exec_order_prep_stage())
-
-
-instantiate_parametrized_tests(TestFSDPExecOrder)
-
-if __name__ == "__main__":
-    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_pure_fp16.py b/test/distributed/fsdp/test_fsdp_pure_fp16.py
index eea03bea3d8a..e0033ef3d4b7 100644
--- a/test/distributed/fsdp/test_fsdp_pure_fp16.py
+++ b/test/distributed/fsdp/test_fsdp_pure_fp16.py
@@ -12,10 +12,10 @@
     NestedWrappedModule,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -31,6 +31,10 @@
 
 
 class TestPureFP16(FSDPTest):
+    @property
+    def world_size(self):
+        # Test fails due to inaccuracies when using more than 4 GPUs
+        return min(4, super().world_size)
 
     @skip_if_lt_x_gpu(2)
     @parametrize(
diff --git a/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py b/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py
index 1c230cb7400c..2124e6b0450f 100644
--- a/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py
+++ b/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py
@@ -22,11 +22,11 @@
     subtest_name,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
-    TestCase,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
+    TestCase,
 )
 
 if not dist.is_available():
@@ -47,21 +47,23 @@
 sharding_strategy_config = [ShardingStrategy.SHARD_GRAD_OP, None]
 mixed_precision = ["enable_mixed_precision", None]
 
-configs = list(itertools.product(cpu_offload_config,
-                                 sharding_strategy_config,
-                                 mixed_precision))
+configs = list(
+    itertools.product(cpu_offload_config, sharding_strategy_config, mixed_precision)
+)
 test_name_mapping = {
     str(CPUOffload(offload_params=True)): "offload_true",
     str(CPUOffload(offload_params=False)): "offload_false",
     str(ShardingStrategy.SHARD_GRAD_OP): "shard_grad_op",
-    "enable_mixed_precision": "mixed_precision"
+    "enable_mixed_precision": "mixed_precision",
 }
 
 subtest_name = functools.partial(subtest_name, test_name_mapping)
 
 
 class TestShardGradScaler(TestCase):
-    @unittest.skipIf(amp_definitely_not_available(), "no supported device (cuda, xla) found")
+    @unittest.skipIf(
+        amp_definitely_not_available(), "no supported device (cuda, xla) found"
+    )
     def test_grad_scaling(self):
         pg = DummyProcessGroup(0, 1)
         scaler = ShardedGradScaler(init_scale=2.0, process_group=pg, enabled=True)
@@ -69,21 +71,26 @@ def test_grad_scaling(self):
         t1 = torch.full((1,), 8.0, dtype=torch.float32, device="cpu")
         outputs = [t1.clone(), (t0.clone(), t1.clone()), [t0.clone(), t1.clone()]]
         outputs = scaler.scale(outputs)
-        self.assertTrue(outputs[0] == 16.0 and outputs[1][0] == 8.0 and outputs[1][1] == 16.0)
+        self.assertTrue(
+            outputs[0] == 16.0 and outputs[1][0] == 8.0 and outputs[1][1] == 16.0
+        )
         self.assertTrue(outputs[2][0] == 8.0 and outputs[2][1] == 16.0)
         self.assertTrue(scaler._scale.device == t1.device)
 
-    @unittest.skipIf(amp_definitely_not_available(), "no supported device (cuda, xla) found")
+    @unittest.skipIf(
+        amp_definitely_not_available(), "no supported device (cuda, xla) found"
+    )
     def test_scaling_unscaling_sparse(self):
         pg = DummyProcessGroup(0, 1)
         scaler = ShardedGradScaler(init_scale=2.0, process_group=pg, enabled=True)
         inv_scale = torch.full((1,), 0.5, dtype=torch.float, device="cpu")
         found_inf = torch.full((1,), 0, dtype=torch.float, device="cpu")
 
-        i = torch.tensor([[0, 1, 1],
-                          [2, 0, 2]], device="cpu", dtype=torch.int64)
+        i = torch.tensor([[0, 1, 1], [2, 0, 2]], device="cpu", dtype=torch.int64)
         v = torch.tensor([16.0, 32.0, 64.0], dtype=torch.float, device="cpu")
-        s = torch.sparse_coo_tensor(i, v, torch.Size([2, 3]), device="cpu", dtype=torch.float)
+        s = torch.sparse_coo_tensor(
+            i, v, torch.Size([2, 3]), device="cpu", dtype=torch.float
+        )
 
         # unscale sparse tensors
         s1 = s.clone()
@@ -95,29 +102,34 @@ def test_scaling_unscaling_sparse(self):
         self.assertEqual(s1.grad.to_dense(), (s / 2).to_dense())
 
         # unscale sparse tensor: inf
-        v = torch.tensor([16.0, 32.0, float('inf')], dtype=torch.float, device="cpu")
-        s1.grad = torch.sparse_coo_tensor(i, v, torch.Size([2, 3]), device="cpu", dtype=torch.float)
+        v = torch.tensor([16.0, 32.0, float("inf")], dtype=torch.float, device="cpu")
+        s1.grad = torch.sparse_coo_tensor(
+            i, v, torch.Size([2, 3]), device="cpu", dtype=torch.float
+        )
         found_inf.zero_()
         found_inf = scaler._unscale_grads_(opt, inv_scale, found_inf)[s1.device]
         self.assertEqual(found_inf, 1.0)
 
         # unscale sparse tensor: overflow (marked as inf)
-        i = torch.tensor([[1, 1, 1],
-                          [0, 0, 2]], device="cpu", dtype=torch.int64)
+        i = torch.tensor([[1, 1, 1], [0, 0, 2]], device="cpu", dtype=torch.int64)
         # coalescing sparse tensor here will cause the value to be Inf
         v = torch.tensor([2**15, 2**15, 1.0], dtype=torch.float16, device="cpu")
-        s1 = torch.sparse_coo_tensor(i, v, torch.Size([2, 3]), device="cpu", dtype=torch.float16)
+        s1 = torch.sparse_coo_tensor(
+            i, v, torch.Size([2, 3]), device="cpu", dtype=torch.float16
+        )
         s1.grad = s1.clone()
         found_inf.zero_()
         found_inf = scaler._unscale_grads_(opt, inv_scale, found_inf)[s1.device]
         self.assertEqual(found_inf, 1.0)
 
-    @unittest.skipIf(amp_definitely_not_available(), "no supported device (cuda, xla) found")
+    @unittest.skipIf(
+        amp_definitely_not_available(), "no supported device (cuda, xla) found"
+    )
     def test_inf_gradients_skip_optim_step(self):
         pg = DummyProcessGroup(0, 1)
         scaler = ShardedGradScaler(init_scale=2.0, process_group=pg, enabled=True)
         loss = torch.full((1,), 4.0, dtype=torch.float32, device="cpu")
-        t0 = torch.tensor([float('inf')], dtype=torch.float32, device="cpu")
+        t0 = torch.tensor([float("inf")], dtype=torch.float32, device="cpu")
         t0.grad = t0.clone()
         opt = torch.optim.SGD([t0], lr=1.0)
         scaler.scale(loss)
@@ -127,10 +139,7 @@ def test_inf_gradients_skip_optim_step(self):
 
 class TestShardedGradScalerParityWithDDP(FSDPTest):
     def _get_init_modes_for_test(self, cpu_offload):
-        modes = [
-            CUDAInitMode.CUDA_AFTER,
-            CUDAInitMode.CUDA_BEFORE
-        ]
+        modes = [CUDAInitMode.CUDA_AFTER, CUDAInitMode.CUDA_BEFORE]
         # Note that CUDAInitMode.CUDA_NEVER works currently only with CPU
         # offload as we explicitly bring the param back to CUDA device. In
         # general, it will not work since we try to all_gather p.data which is
@@ -149,11 +158,15 @@ def test_fsdp_ddp_parity_with_grad_scaler(
         mixed_precision: Optional[str],
     ):
         init_modes = self._get_init_modes_for_test(cpu_offload)
-        mp = MixedPrecision(
-            param_dtype=torch.float16,
-            reduce_dtype=torch.float16,
-            buffer_dtype=torch.float16,
-        ) if mixed_precision is not None else None
+        mp = (
+            MixedPrecision(
+                param_dtype=torch.float16,
+                reduce_dtype=torch.float16,
+                buffer_dtype=torch.float16,
+            )
+            if mixed_precision is not None
+            else None
+        )
         for cuda_init_mode in init_modes:
             self._test_fsdp_parity(
                 NestedWrappedModule,
diff --git a/test/distributed/fsdp/test_fsdp_state_dict.py b/test/distributed/fsdp/test_fsdp_state_dict.py
index 97c82fbe8ffb..6fafc8e8fdf4 100644
--- a/test/distributed/fsdp/test_fsdp_state_dict.py
+++ b/test/distributed/fsdp/test_fsdp_state_dict.py
@@ -11,49 +11,42 @@
 import torch.nn as nn
 from torch import distributed as dist
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+    apply_activation_checkpointing,
     checkpoint_wrapper,
+    CheckpointImpl,
 )
-from torch.distributed.fsdp import CPUOffload, FullStateDictConfig
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.distributed.fsdp import (
+    CPUOffload,
+    FullStateDictConfig,
+    FullyShardedDataParallel as FSDP,
     LocalStateDictConfig,
     MixedPrecision,
+    ShardedStateDictConfig,
     StateDictType,
 )
-from torch.distributed.fsdp.fully_sharded_data_parallel import (
-    FullyShardedDataParallel,
-)
-from torch.distributed.fsdp.shard_utils import _gather_state_dict
-from torch.distributed.fsdp.wrap import (
-    enable_wrap,
-    transformer_auto_wrap_policy,
-    wrap,
-)
-from torch.nn import (
-    Linear,
-    Module,
-    TransformerDecoderLayer,
-    TransformerEncoderLayer,
-)
+from torch.distributed.fsdp._shard_utils import _gather_state_dict
+from torch.distributed.fsdp._unshard_param_utils import FLAT_PARAM
+from torch.distributed.fsdp.wrap import enable_wrap, ModuleWrapPolicy, wrap
+from torch.nn import Linear, Module, TransformerDecoderLayer, TransformerEncoderLayer
 from torch.nn.parallel import DistributedDataParallel
 from torch.optim import SGD
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import (
+    _assert_module_states,
+    _get_state_dict,
+    _zero_model,
     CUDAInitMode,
     FSDPInitMode,
     FSDPTest,
+    get_full_params,
     SkipModel,
     TransformerWithSharedParams,
-    _assert_module_states,
-    _get_state_dict,
-    _zero_model,
-    get_full_params,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -71,7 +64,7 @@
 OUTER_SHAPE = [4, 5]
 BUFFER_SHAPE = [5, 5]
 
-NON_ROOT_FSDP_PREFIX = 'non_fsdp_lin'
+NON_ROOT_FSDP_PREFIX = "non_fsdp_lin"
 
 _UNFLATTENED_STATE_DICT_IMPLS = ["state_dict", "sharded_state_dict"]
 _FLATTENED_STATE_DICT_IMPLS = ["local_state_dict"]
@@ -92,11 +85,19 @@ def __init__(self, wrap_fsdp, register_buffers=False, ignore_inner=False):
         self.inner = Linear(*INNER_SHAPE)
         if register_buffers:
             self.inner.register_buffer("buffer", torch.randn(BUFFER_SHAPE))
+            self.inner.register_buffer(
+                "non_persistent_buffer", torch.randn(BUFFER_SHAPE), persistent=False
+            )
         if wrap_fsdp:
-            self.inner = FSDP(self.inner, ignored_modules=([self.inner] if ignore_inner else []))
+            self.inner = FSDP(
+                self.inner, ignored_modules=([self.inner] if ignore_inner else [])
+            )
         self.outer = Linear(*OUTER_SHAPE)
         if register_buffers:
             self.outer.register_buffer("buffer", torch.randn(BUFFER_SHAPE))
+            self.outer.register_buffer(
+                "non_persistent_buffer", torch.randn(BUFFER_SHAPE), persistent=False
+            )
 
     def forward(self, x):
         # Forward twice.
@@ -110,22 +111,40 @@ class TestFSDPStateDict(FSDPTest):
     def world_size(self):
         return 2
 
-    def _broadcast_state_dict(self, state_dict):
+    def _broadcast_state_dict(self, model, state_dict):
+        if not isinstance(model, FSDP):
+            # For non-FSDP root, some parts of the model state on rank 0 may
+            # not be on CPU, so we move everything to CPU to avoid issues like:
+            # https://github.com/pytorch/pytorch/issues/77113.
+            for param_name, param in state_dict.items():
+                if param.device != torch.device("cpu"):
+                    state_dict[param_name] = param.cpu()
+
         olist = [state_dict if self.rank == 0 else None]
         dist.broadcast_object_list(olist)
-        return olist[0]
+        state_dict = olist[0]
+        # Ensure that the state is on CUDA
+        for param_name in state_dict.keys():
+            state_dict[param_name] = state_dict[param_name].cuda()
+        return state_dict
 
     def _compare_models(self, model, model_new, assert_fn, check_fp16=False):
-        with FullyShardedDataParallel.summon_full_params(model):
-            with FullyShardedDataParallel.summon_full_params(model_new):
+        assert assert_fn in (self.assertEqual, self.assertNotEqual)
+        with FSDP.summon_full_params(model):
+            with FSDP.summon_full_params(model_new):
                 params = list(model.parameters())
                 params_new = list(model_new.parameters())
+                # Regardless of `assert_fn`, the number of parameters should be
+                # the same
+                self.assertEqual(len(params), len(params_new))
                 assert_fn(params, params_new)
                 if check_fp16:
                     for tensor in model_new.parameters():
                         self.assertEqual(tensor.dtype, torch.float16)
 
-    def _get_simple_nested_model(self, *fsdp_args, wrap=True, checkpoint_wrap=False, **fsdp_kwargs):
+    def _get_simple_nested_model(
+        self, *fsdp_args, wrap=True, checkpoint_wrap=False, **fsdp_kwargs
+    ):
         if wrap:
             lin1 = nn.Linear(10, 10, bias=False).cuda()
             lin2 = nn.Linear(10, 10, bias=False).cuda()
@@ -138,7 +157,8 @@ def _get_simple_nested_model(self, *fsdp_args, wrap=True, checkpoint_wrap=False,
             model = FSDP(seq, *fsdp_args, **fsdp_kwargs)
         else:
             model = nn.Sequential(
-                nn.Linear(10, 10, bias=False).cuda(), nn.Linear(10, 10, bias=False).cuda()
+                nn.Linear(10, 10, bias=False).cuda(),
+                nn.Linear(10, 10, bias=False).cuda(),
             )
         return model
 
@@ -180,8 +200,16 @@ def _get_state_dict_mgr(
                 rank0_only=state_dict_rank0_and_offload,
                 offload_to_cpu=state_dict_rank0_and_offload,
             )
+        elif state_dict_type == "local_state_dict":
+            config = LocalStateDictConfig(
+                offload_to_cpu=state_dict_rank0_and_offload,
+            )
+        elif state_dict_type == "sharded_state_dict":
+            config = ShardedStateDictConfig(
+                offload_to_cpu=state_dict_rank0_and_offload,
+            )
         else:
-            config = None
+            raise ValueError("Unsupported state_dict_type")
         return FSDP.state_dict_type(model, _state_dict_type, config)
 
     def _validate_state_dict_contents(
@@ -205,36 +233,156 @@ def _validate_state_dict_contents(
                     self.assertEqual(fsdp_state_dict, {})
 
     @skip_if_lt_x_gpu(2)
-    @parametrize("checkpoint_wrap", ["first", "second", "both"])
-    def test_fsdp_state_dict_with_activation_checkpoint(self, checkpoint_wrap):
+    @parametrize("state_dict_type", _UNFLATTENED_STATE_DICT_IMPLS)
+    @parametrize(
+        "checkpoint_wrap",
+        ["source", "dest", "both", "source_after_wrap", "both_after_wrap"],
+    )
+    @parametrize("rank0_only_and_offload", [False, True])
+    def test_fsdp_state_dict_with_activation_checkpoint(
+        self, state_dict_type, checkpoint_wrap, rank0_only_and_offload
+    ):
         """Tests saving the state dict, zeroing a target model's parameters, and
         loading the state dict, where the source and target models may have a
         checkpoint wrapper."""
+
+        def apply_ac_to_linears(model) -> None:
+            non_reentrant_wrapper = partial(
+                checkpoint_wrapper,
+                offload_to_cpu=False,
+                checkpoint_impl=CheckpointImpl.NO_REENTRANT,
+            )
+            apply_activation_checkpointing(
+                model,
+                checkpoint_wrapper_fn=non_reentrant_wrapper,
+                check_fn=lambda submodule: isinstance(submodule, nn.Linear),
+            )
+
         for model_call in [
             partial(self._get_simple_model),
-            partial(self._get_simple_nested_model)
+            partial(self._get_simple_nested_model),
         ]:
-            model = model_call(checkpoint_wrap=(checkpoint_wrap in ["first", "both"]))
-            state_dict = _get_state_dict(model, False, False)
-            # Possibly wrap new model in activation checkpoint wrapper to test save/
-            # load with this wrapper
-            model_new = model_call(checkpoint_wrap=(checkpoint_wrap in ["second", "both"]))
-            _zero_model(model_new)
-            self._compare_models(model, model_new, self.assertNotEqual)
-            # Would fail if checkpoint_wrapper did not correctly implement state_dict pre/post hooks
-            model_new.load_state_dict(state_dict)
-            self._compare_models(model, model_new, self.assertEqual)
+            model = model_call(checkpoint_wrap=(checkpoint_wrap in ("source", "both")))
+            if checkpoint_wrap in ("source_after_wrap", "both_after_wrap"):
+                apply_ac_to_linears(model)
+            with self._get_state_dict_mgr(
+                model, state_dict_type, rank0_only_and_offload
+            ):
+                state_dict = _gather_state_dict(_get_state_dict(model, False, False))
+                # Possibly wrap new model in activation checkpoint wrapper to test save/
+                # load with this wrapper
+                model_new = model_call(
+                    checkpoint_wrap=(checkpoint_wrap in ("dest", "both"))
+                )
+                if checkpoint_wrap == "both_after_wrap":
+                    apply_ac_to_linears(model_new)
+                _zero_model(model_new)
+                self._compare_models(model, model_new, self.assertNotEqual)
+                if rank0_only_and_offload:
+                    state_dict = self._broadcast_state_dict(model, state_dict)
+                # Would fail if checkpoint_wrapper did not correctly implement state_dict pre/post hooks
+                model_new.load_state_dict(state_dict, strict=True)
+                self._compare_models(model, model_new, self.assertEqual)
 
     @skip_if_lt_x_gpu(2)
-    def test_state_dict_rank0_offload_save_load_flow(self):
+    @parametrize("state_dict_type", _UNFLATTENED_STATE_DICT_IMPLS)
+    @parametrize("rank0_only_and_offload", [False, True])
+    def test_state_dict_with_manual_ac_wrapper(
+        self,
+        state_dict_type: str,
+        rank0_only_and_offload: bool,
+    ):
+        """
+        Tests saving and loading a state dict for a model manually wrapped with
+        ``FSDP(CheckpointWrapper(module))``, where the ``CheckpointWrapper`` is
+        wrapped before FSDP.
+
+        TODO: Investigate why the test above does not cover everything in this
+        test and de-duplicate afterwards.
+        """
+        if state_dict_type == "sharded_state_dict" and rank0_only_and_offload:
+            return  # not supported
+        model_ac = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.NO_FSDP,
+            CUDAInitMode.CUDA_BEFORE,
+        )
+        # Manually wrap FSDP without AC
+        model_no_ac = deepcopy(model_ac)
+        for i, layer in enumerate(model_no_ac.transformer.encoder.layers):
+            model_no_ac.transformer.encoder.layers[i] = FSDP(layer)
+        for i, layer in enumerate(model_no_ac.transformer.decoder.layers):
+            model_no_ac.transformer.decoder.layers[i] = FSDP(layer)
+        model_no_ac.transformer = FSDP(model_no_ac.transformer)
+
+        # Manually wrap FSDP with AC as `FSDP(CheckpointWrapper(module))`
+        for i, layer in enumerate(model_ac.transformer.encoder.layers):
+            layer = checkpoint_wrapper(layer)
+            model_ac.transformer.encoder.layers[i] = FSDP(layer)
+        for i, layer in enumerate(model_ac.transformer.decoder.layers):
+            layer = checkpoint_wrapper(layer)
+            model_ac.transformer.decoder.layers[i] = FSDP(layer)
+        model_ac.transformer = FSDP(model_ac.transformer)
+
+        # Save, load, and compare the two models
+        with self._get_state_dict_mgr(
+            model_no_ac, state_dict_type, rank0_only_and_offload
+        ):
+            state_dict_no_ac = model_no_ac.state_dict()
+        with self._get_state_dict_mgr(
+            model_ac, state_dict_type, rank0_only_and_offload
+        ):
+            state_dict_ac = model_ac.state_dict()
+        self.assertEqual(state_dict_ac.keys(), state_dict_no_ac.keys())
+        if rank0_only_and_offload:
+            state_dict_no_ac = self._broadcast_state_dict(model_no_ac, state_dict_no_ac)
+            state_dict_ac = self._broadcast_state_dict(model_ac, state_dict_ac)
+        with self._get_state_dict_mgr(
+            model_no_ac, state_dict_type, rank0_only_and_offload
+        ):
+            model_no_ac.load_state_dict(state_dict_no_ac)
+        with self._get_state_dict_mgr(
+            model_ac, state_dict_type, rank0_only_and_offload
+        ):
+            model_ac.load_state_dict(state_dict_ac)
+        self._compare_models(model_ac, model_no_ac, self.assertEqual)
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", _SUPPORTED_STATE_DICT_IMPLS)
+    def test_state_dict_with_shared_parameters(self, state_dict_type):
+        auto_wrap_policy = ModuleWrapPolicy(
+            {TransformerEncoderLayer, TransformerDecoderLayer}
+        )
+        model_creator = partial(
+            TransformerWithSharedParams.init,
+            self.process_group,
+            FSDPInitMode.RECURSIVE,
+            CUDAInitMode.CUDA_BEFORE,
+            {"auto_wrap_policy": auto_wrap_policy},
+        )
+
+        fsdp_model = model_creator()
+        with self._get_state_dict_mgr(fsdp_model, state_dict_type, False):
+            state_dict = fsdp_model.state_dict()
+
+        new_model = model_creator()
+        _zero_model(new_model, zero_buffers=True)
+        with self._get_state_dict_mgr(new_model, state_dict_type, False):
+            new_model.load_state_dict(state_dict)
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_orig_params", [False, True])
+    def test_state_dict_rank0_offload_save_load_flow(self, use_orig_params: bool):
         """Tests saving a model checkpoint only on rank 0 and loading it only
         on rank 0 with ``sync_module_states=True`` to emulate the workflow to
         avoid redundant CPU memory usage."""
-        auto_wrap_policy = partial(
-            transformer_auto_wrap_policy,
-            transformer_layer_cls={TransformerEncoderLayer, TransformerDecoderLayer},
+        auto_wrap_policy = ModuleWrapPolicy(
+            {TransformerEncoderLayer, TransformerDecoderLayer}
         )
-        fsdp_kwargs = {"auto_wrap_policy": auto_wrap_policy}
+        fsdp_kwargs = {
+            "auto_wrap_policy": auto_wrap_policy,
+            "use_orig_params": use_orig_params,
+        }
         fsdp_model = TransformerWithSharedParams.init(
             self.process_group,
             FSDPInitMode.RECURSIVE,
@@ -243,10 +391,14 @@ def test_state_dict_rank0_offload_save_load_flow(self):
         )
         # Force model parameters and buffers to be nonzero
         with FSDP.summon_full_params(fsdp_model):
-            for tensor in itertools.chain(fsdp_model.parameters(), fsdp_model.buffers()):
+            for tensor in itertools.chain(
+                fsdp_model.parameters(), fsdp_model.buffers()
+            ):
                 if torch.count_nonzero(tensor) == 0:
                     with torch.no_grad():
-                        tensor.add_(torch.tensor(1, dtype=tensor.dtype, device=tensor.device))
+                        tensor.add_(
+                            torch.tensor(1, dtype=tensor.dtype, device=tensor.device)
+                        )
         with self._get_state_dict_mgr(fsdp_model, "state_dict", True):
             state_dict = deepcopy(_get_state_dict(fsdp_model))
         # Initialize a non-wrapped model on all ranks
@@ -258,7 +410,7 @@ def test_state_dict_rank0_offload_save_load_flow(self):
         _zero_model(new_model, zero_buffers=True)
         # Only load the checkpoint on rank 0
         if self.rank == 0:
-            new_model.load_state_dict(state_dict)
+            new_model.load_state_dict(state_dict, strict=True)
         _assert_module_states(
             new_model,
             process_group=self.process_group,
@@ -279,8 +431,8 @@ def test_state_dict_rank0_offload_save_load_flow(self):
                 assert_fn=self.assertEqual,
             )
         # Check FSDP models correctly loaded the checkpoint
-        with FullyShardedDataParallel.summon_full_params(fsdp_model):
-            with FullyShardedDataParallel.summon_full_params(new_fsdp_model):
+        with FSDP.summon_full_params(fsdp_model):
+            with FSDP.summon_full_params(new_fsdp_model):
                 params = list(fsdp_model.parameters())
                 params_new = list(new_fsdp_model.parameters())
                 self.assertEqual(params, params_new)
@@ -293,20 +445,40 @@ def test_state_dict_rank0_offload_save_load_flow(self):
     )
     @parametrize("fp16", [True, False])
     @parametrize("state_dict_rank0_and_offload", [True, False])
+    @parametrize("use_orig_params", [True, False])
     def test_basic_save_and_load_state_dict(
-        self, state_dict_type, cpu_offload, fp16, state_dict_rank0_and_offload
+        self,
+        state_dict_type: StateDictType,
+        cpu_offload: bool,
+        fp16: bool,
+        state_dict_rank0_and_offload: bool,
+        use_orig_params: bool,
     ):
         """
         Tests that we can save a state_dict and load it into a blank model
         with various configs such as fp16 and cpu offload and parameters
         match as expected.
         """
-        if state_dict_rank0_and_offload and state_dict_type != "state_dict":
-            return
+        if (state_dict_rank0_and_offload and state_dict_type != "state_dict") or (
+            use_orig_params and state_dict_type not in _UNFLATTENED_STATE_DICT_IMPLS
+        ):
+            return  # not supported
         for model_call in [
-            partial(self._get_non_fsdp_root_module, cpu_offload=cpu_offload),
-            partial(self._get_simple_nested_model, cpu_offload=cpu_offload),
-            partial(self._get_simple_model, cpu_offload=cpu_offload),
+            partial(
+                self._get_non_fsdp_root_module,
+                cpu_offload=cpu_offload,
+                use_orig_params=use_orig_params,
+            ),
+            partial(
+                self._get_simple_nested_model,
+                cpu_offload=cpu_offload,
+                use_orig_params=use_orig_params,
+            ),
+            partial(
+                self._get_simple_model,
+                cpu_offload=cpu_offload,
+                use_orig_params=use_orig_params,
+            ),
         ]:
             model = model_call()
 
@@ -318,10 +490,15 @@ def test_basic_save_and_load_state_dict(
                     model, cpu_offload.offload_params, fp16
                 )
 
-            ignore_keys = [k for k in fsdp_state_dict.keys() if NON_ROOT_FSDP_PREFIX in k]
+            ignore_keys = [
+                k for k in fsdp_state_dict.keys() if NON_ROOT_FSDP_PREFIX in k
+            ]
 
             self._validate_state_dict_contents(
-                model, fsdp_state_dict, state_dict_rank0_and_offload, ignore_keys=ignore_keys,
+                model,
+                fsdp_state_dict,
+                state_dict_rank0_and_offload,
+                ignore_keys=ignore_keys,
             )
             if fp16:
                 # Verify fp16 is the type
@@ -340,19 +517,9 @@ def test_basic_save_and_load_state_dict(
 
             # Verify parameters are the same in the new model.
             if state_dict_rank0_and_offload:
-                # Broadcast the state dict and move it back to GPU in
-                # preparation for loading.
-                if not isinstance(model, FSDP):
-                    # Move everything to CPU to avoid running into
-                    # https://github.com/pytorch/pytorch/issues/77113, some params
-                    # will still be on GPU for non FSDP root modules.
-                    for k in fsdp_state_dict.keys():
-                        fsdp_state_dict[k] = fsdp_state_dict[k].cpu()
-                fsdp_state_dict = self._broadcast_state_dict(fsdp_state_dict)
-                for key in fsdp_state_dict.keys():
-                    fsdp_state_dict[key] = fsdp_state_dict[key].cuda()
+                fsdp_state_dict = self._broadcast_state_dict(model, fsdp_state_dict)
             with FSDP.state_dict_type(model_new, STATE_DICT_MAPPING[state_dict_type]):
-                model_new.load_state_dict(fsdp_state_dict)
+                model_new.load_state_dict(fsdp_state_dict, strict=True)
 
             self._compare_models(model, model_new, self.assertEqual, check_fp16=fp16)
 
@@ -406,7 +573,9 @@ def test_save_and_load_after_forward_state_dict(
                 for sharded_tensor in state_dict.values():
                     shard = sharded_tensor._local_shards[0]
                     shard.tensor = shard.tensor.clone().detach_()
-        self._validate_state_dict_contents(model, state_dict, state_dict_rank0_and_offload)
+        self._validate_state_dict_contents(
+            model, state_dict, state_dict_rank0_and_offload
+        )
         _zero_model(model)
 
         # Ensure checkpointed params have the full param dtype
@@ -415,14 +584,10 @@ def test_save_and_load_after_forward_state_dict(
 
         # Load state_dict into zeroed model
         if state_dict_rank0_and_offload:
-            # Broadcast the state dict and move it back to GPU in
-            # preparation for loading.
-            state_dict = self._broadcast_state_dict(state_dict)
-            for key in state_dict.keys():
-                state_dict[key] = state_dict[key].cuda()
+            state_dict = self._broadcast_state_dict(model, state_dict)
 
         with FSDP.state_dict_type(model, STATE_DICT_MAPPING[state_dict_type]):
-            model.load_state_dict(state_dict)
+            model.load_state_dict(state_dict, strict=True)
         loaded_params = get_full_params(model)
         self.assertEqual(loaded_params, trained_params)
 
@@ -462,9 +627,11 @@ def _load_state_dict(
             raise ValueError(f"No state_dict for {state_dict_type}")
 
         with FSDP.state_dict_type(model, enum_val):
-            return model.load_state_dict(state_dict)
+            return model.load_state_dict(state_dict, strict=True)
 
-    def _dist_train(self, wrap_fsdp: bool, state_dict_type: str = ""):
+    def _dist_train(
+        self, wrap_fsdp: bool, state_dict_type: str = "", move_to_cpu: bool = False
+    ):
         # TODO: Move this test to common_fsdp.
         model = self._initialize_model(wrap_fsdp)
         optim = SGD(model.parameters(), lr=0.1)
@@ -480,6 +647,16 @@ def _dist_train(self, wrap_fsdp: bool, state_dict_type: str = ""):
             blank_model = FSDP(Model(True).cuda())
             _zero_model(blank_model)
             state_dict = self._state_dict(model, state_dict_type)
+            if move_to_cpu:
+                for key in list(state_dict.keys()):
+                    tensor = state_dict[key]
+                    if isinstance(tensor, torch.Tensor):
+                        state_dict[key] = tensor.cpu()
+                    else:
+                        shards = tensor.local_shards()
+                        if shards:
+                            shards[0].tensor = shards[0].tensor.cpu()
+
             self._load_state_dict(blank_model, state_dict_type, state_dict)
             return get_full_params(blank_model)
         else:
@@ -488,16 +665,24 @@ def _dist_train(self, wrap_fsdp: bool, state_dict_type: str = ""):
     @skip_if_lt_x_gpu(2)
     @parametrize("state_dict_type", _SUPPORTED_STATE_DICT_IMPLS)
     def test_state_dict_save_load_flow(self, state_dict_type):
-        fsdp_params = self._dist_train(wrap_fsdp=True, state_dict_type=state_dict_type)
-        ddp_params = self._dist_train(wrap_fsdp=False)
-        self.assertEqual(ddp_params, fsdp_params)
+        for move_to_cpu in [True, False]:
+            with self.subTest(move_to_cpu=move_to_cpu):
+                fsdp_params = self._dist_train(
+                    wrap_fsdp=True,
+                    state_dict_type=state_dict_type,
+                    move_to_cpu=move_to_cpu,
+                )
+                ddp_params = self._dist_train(wrap_fsdp=False)
+                self.assertEqual(ddp_params, fsdp_params)
 
     @skip_if_lt_x_gpu(2)
     @parametrize("state_dict_type", _SUPPORTED_STATE_DICT_IMPLS)
     def test_fsdp_state_dict_keys(self, state_dict_type):
         state_dict = self._state_dict(self._initialize_model(True), state_dict_type)
         if state_dict_type == "local_state_dict":
-            self.assertEqual(set(["flat_param", "inner.flat_param"]), state_dict.keys())
+            self.assertEqual(
+                set([FLAT_PARAM, f"inner.{FLAT_PARAM}"]), state_dict.keys()
+            )
         elif state_dict_type in ("state_dict", "sharded_state_dict"):
             # Keys should match local model.
             local_model = self._initialize_model(wrap_fsdp=False, wrap_ddp=False)
@@ -511,7 +696,10 @@ def test_fsdp_state_dict_keys(self, state_dict_type):
     @parametrize("state_dict_rank0_and_offload", [True, False])
     @parametrize("fsdp_root", [True, False])
     def test_state_dict_load_into_local_module(
-        self, state_dict_type, state_dict_rank0_and_offload, fsdp_root,
+        self,
+        state_dict_type,
+        state_dict_rank0_and_offload,
+        fsdp_root,
     ):
         """
         Tests that FSDP's state_dict can be loaded into a local model.
@@ -524,7 +712,9 @@ def test_state_dict_load_into_local_module(
             model = self._initialize_model(wrap_fsdp=True, register_buffers=True)
         optim = SGD(model.parameters(), lr=0.1)
         if not fsdp_root:
-            in_data = torch.randn(1, 10, requires_grad=True, device=torch.device("cuda"))
+            in_data = torch.randn(
+                1, 10, requires_grad=True, device=torch.device("cuda")
+            )
         else:
             in_data = torch.rand(64, 4, requires_grad=True, device=torch.device("cuda"))
         for _ in range(3):
@@ -533,7 +723,7 @@ def test_state_dict_load_into_local_module(
             optim.step()
             optim.zero_grad()
 
-        with FullyShardedDataParallel.summon_full_params(model):
+        with FSDP.summon_full_params(model):
             fsdp_params = deepcopy(list(model.parameters()))
 
         # get FSDP state_dict. Note that by default we return full_state_dict.
@@ -545,7 +735,10 @@ def test_state_dict_load_into_local_module(
 
         ignore_keys = [k for k in fsdp_state_dict.keys() if NON_ROOT_FSDP_PREFIX in k]
         self._validate_state_dict_contents(
-            model, fsdp_state_dict, state_dict_rank0_and_offload, ignore_keys=ignore_keys,
+            model,
+            fsdp_state_dict,
+            state_dict_rank0_and_offload,
+            ignore_keys=ignore_keys,
         )
         # Create zeroed local model
         if not fsdp_root:
@@ -568,20 +761,10 @@ def test_state_dict_load_into_local_module(
         # Load fsdp's full state dict into the local and verify params are as
         # expected.
         if state_dict_rank0_and_offload:
-            # Broadcast + CUDA state_dict
-            if not isinstance(model, FSDP):
-                # Some portions of the model on rank 0 might not be on CPU,
-                # move everything to CPU to avoid running into
-                # https://github.com/pytorch/pytorch/issues/77113.
-                for k, t in fsdp_state_dict.items():
-                    if t.device != torch.device("cpu"):
-                        fsdp_state_dict[k] = t.cpu()
-            fsdp_state_dict = self._broadcast_state_dict(fsdp_state_dict)
-            for key in fsdp_state_dict.keys():
-                fsdp_state_dict[key] = fsdp_state_dict[key].cuda()
+            fsdp_state_dict = self._broadcast_state_dict(model, fsdp_state_dict)
 
         # if self.rank == 0:
-        blank_local_model.load_state_dict(fsdp_state_dict)
+        blank_local_model.load_state_dict(fsdp_state_dict, strict=True)
         local_params = list(blank_local_model.parameters())
         for fsdp_param, local_param in zip(fsdp_params, local_params):
             self.assertEqual(fsdp_param, local_param)
@@ -638,7 +821,7 @@ def _create_module(wrap_fsdp=True):
             if state_dict_type != "local_state_dict":
                 # FlatParameter has not supported deepcopy yet.
                 state_dict = deepcopy(state_dict)
-            new_fsdp.load_state_dict(state_dict)
+            new_fsdp.load_state_dict(state_dict, strict=True)
         for (p1, p2) in zip(fsdp.parameters(), new_fsdp.parameters()):
             self.assertEqual(p1, p2)
 
@@ -657,7 +840,7 @@ def _create_module(wrap_fsdp=True):
         state_dict = _gather_state_dict(state_dict)
         with fsdp.summon_full_params(fsdp):
             if self.rank == 0:
-                local.load_state_dict(state_dict)
+                local.load_state_dict(state_dict, strict=True)
                 for (p1, p2) in zip(fsdp.parameters(), local.parameters()):
                     self.assertEqual(p1, p2)
 
@@ -671,12 +854,17 @@ def test_wrong_state_dict_config(self):
                 pass
 
     @skip_if_lt_x_gpu(2)
+    @parametrize("state_dict_type", _UNFLATTENED_STATE_DICT_IMPLS)
     @parametrize("prefix", [True, False])
     @parametrize("ignore_inner", [True, False])
-    def test_state_dict_with_ignored_modules(self, prefix, ignore_inner):
+    def test_state_dict_with_ignored_modules(
+        self, state_dict_type, prefix, ignore_inner
+    ):
         # Initialize an FSDP-wrapped model with an ignored module that includes
         # both parameters and a buffer
-        model = Model(wrap_fsdp=True, register_buffers=True, ignore_inner=ignore_inner).cuda()
+        model = Model(
+            wrap_fsdp=True, register_buffers=True, ignore_inner=ignore_inner
+        ).cuda()
         ignored_modules = [model.outer]
         ignored_tensor_to_tensor_name = {
             model.outer.bias: "outer.bias",
@@ -691,12 +879,13 @@ def test_state_dict_with_ignored_modules(self, prefix, ignore_inner):
         # Note that when model.inner is not ignored this test also ensures
         # non-ignored buffers are not cloned.
         buffer_to_buffer_name = {
-            model.inner.buffer: "inner.buffer", model.outer.buffer: "outer.buffer",
+            model.inner.buffer: "inner.buffer",
+            model.outer.buffer: "outer.buffer",
         }
         fsdp_model = FSDP(model, ignored_modules=ignored_modules)
         prefix_str = "foo." if prefix else ""
-        with FSDP.state_dict_type(fsdp_model, StateDictType.FULL_STATE_DICT):
-            sd1 = fsdp_model.state_dict(prefix=prefix_str)
+        with FSDP.state_dict_type(fsdp_model, STATE_DICT_MAPPING[state_dict_type]):
+            sd1 = _gather_state_dict(fsdp_model.state_dict(prefix=prefix_str))
         with FSDP.summon_full_params(fsdp_model):
             fsdp_params = deepcopy(list(fsdp_model.parameters()))
         # Check that the ignored parameters and all buffers are not cloned
@@ -706,7 +895,11 @@ def test_state_dict_with_ignored_modules(self, prefix, ignore_inner):
         }.items():
             prefixed_tensor_name = f"{prefix_str}{tensor_name}"
             self.assertTrue(prefixed_tensor_name in sd1)
-            self.assertEqual(tensor.data_ptr(), sd1[prefixed_tensor_name].data_ptr(), f"{prefixed_tensor_name}")
+            self.assertEqual(
+                tensor.data_ptr(),
+                sd1[prefixed_tensor_name].data_ptr(),
+                f"{prefixed_tensor_name}",
+            )
         # Check that the state dict can be loaded into a non-wrapped version of
         # the model
         nonwrapped_model = Model(wrap_fsdp=False, register_buffers=True).cuda()
@@ -714,14 +907,14 @@ def test_state_dict_with_ignored_modules(self, prefix, ignore_inner):
             with torch.no_grad():
                 param.zero_()
 
-        to_load = {k[len(prefix_str):] : v for k, v in sd1.items()}
-        nonwrapped_model.load_state_dict(to_load)
+        to_load = {k[len(prefix_str) :]: v for k, v in sd1.items()}
+        nonwrapped_model.load_state_dict(to_load, strict=True)
         local_params = list(nonwrapped_model.parameters())
         for fsdp_param, local_param in zip(fsdp_params, local_params):
             self.assertEqual(fsdp_param, local_param)
         # Check that if we save a state dict again, the ignored parameters and
         # buffer still have the same data pointer
-        with FSDP.state_dict_type(fsdp_model, StateDictType.FULL_STATE_DICT):
+        with FSDP.state_dict_type(fsdp_model, STATE_DICT_MAPPING[state_dict_type]):
             sd2 = fsdp_model.state_dict(prefix=prefix_str)
         for tensor, tensor_name in {
             **ignored_tensor_to_tensor_name,
@@ -730,7 +923,10 @@ def test_state_dict_with_ignored_modules(self, prefix, ignore_inner):
             prefixed_tensor_name = f"{prefix_str}{tensor_name}"
             self.assertTrue(prefixed_tensor_name in sd2)
             self.assertEqual(tensor.data_ptr(), sd2[prefixed_tensor_name].data_ptr())
-            self.assertEqual(sd1[prefixed_tensor_name].data_ptr(), sd2[prefixed_tensor_name].data_ptr())
+            self.assertEqual(
+                sd1[prefixed_tensor_name].data_ptr(),
+                sd2[prefixed_tensor_name].data_ptr(),
+            )
 
     @skip_if_lt_x_gpu(2)
     def test_state_dict_type(self):
diff --git a/test/distributed/fsdp/test_fsdp_summon_full_params.py b/test/distributed/fsdp/test_fsdp_summon_full_params.py
index d6fe3013bff4..18055dbebffb 100644
--- a/test/distributed/fsdp/test_fsdp_summon_full_params.py
+++ b/test/distributed/fsdp/test_fsdp_summon_full_params.py
@@ -1,17 +1,23 @@
 # Owner(s): ["oncall: distributed"]
+import contextlib
 import itertools
 import math
 import sys
 from copy import deepcopy
+from typing import List, Optional
 
 import torch
 import torch.nn as nn
 from torch import distributed as dist
-from torch.distributed.fsdp import CPUOffload
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.distributed.fsdp import MixedPrecision
+from torch.distributed.fsdp import (
+    CPUOffload,
+    FullyShardedDataParallel as FSDP,
+    MixedPrecision,
+    ShardingStrategy,
+)
 from torch.distributed.fsdp.flat_param import FlatParamHandle
 from torch.distributed.fsdp.wrap import enable_wrap, wrap
+from torch.nn.parallel.distributed import DistributedDataParallel as DDP
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import (
     CUDAInitMode,
@@ -19,12 +25,13 @@
     FSDPInitMode,
     FSDPTest,
     NestedWrappedModule,
+    TransformerWithSharedParams,
 )
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
 )
 
 if not dist.is_available():
@@ -48,10 +55,8 @@ def _run_test_summon_full_param_writeback(
         model = wrap(nn.Sequential(lin1, lin2))
 
     # set the value
-    outer_param = model.get_parameter("_fsdp_wrapped_module.flat_param")
-    inner_param = model.get_parameter(
-        "_fsdp_wrapped_module._fpw_module.0._fsdp_wrapped_module.flat_param"
-    )
+    outer_param = model._handles[0].flat_param
+    inner_param = model.module[0]._handles[0].flat_param
     p = outer_param if modify_outer else inner_param
 
     with torch.no_grad():
@@ -76,22 +81,21 @@ def world_size(self):
         return 1  # does not shard
 
     @skip_if_lt_x_gpu(2)
-    @parametrize("writeback", [True, False])
-    @parametrize("modify_outer", [True, False])
-    @parametrize("mixed_precision", [True, False])
     # TODO: CPUOffload summon + writeback does not
     # work when param is not sharded
     # (currently when world_size == 1)
-    def test_summon_full_param_writeback(
-        self, writeback, modify_outer, mixed_precision
-    ):
-        mixed_precision = MixedPrecision() if mixed_precision else None
-        return _run_test_summon_full_param_writeback(
-            self,
-            writeback,
-            modify_outer=modify_outer,
+    def test_summon_full_param_writeback(self):
+        subtest_config = {
+            "writeback": [True, False],
+            "modify_outer": [True, False],
+            "mixed_precision": [MixedPrecision(param_dtype=torch.float16), None],
+            "use_orig_params": [True, False],
+        }
+        self.run_subtests(
+            subtest_config,
+            _run_test_summon_full_param_writeback,
+            cls=self,
             cpu_offload=CPUOffload(offload_params=False),
-            mixed_precision=mixed_precision,
         )
 
 
@@ -108,29 +112,29 @@ def get_expected_sharded_size(self, global_size):
         return int(math.ceil(global_size / self.world_size))
 
     @skip_if_lt_x_gpu(2)
-    @parametrize("writeback", [True, False])
-    @parametrize(
-        "cpu_offload",
-        [CPUOffload(offload_params=True), CPUOffload(offload_params=False)],
-    )
-    @parametrize("mixed_precision", [True, False])
-    @parametrize("modify_outer", [True, False])
-    def test_summon_full_param_writeback(
-        self, writeback, cpu_offload, mixed_precision, modify_outer
-    ):
-        mixed_precision = MixedPrecision() if mixed_precision else None
-        return _run_test_summon_full_param_writeback(
-            self,
-            writeback,
-            modify_outer,
-            cpu_offload=cpu_offload,
-            mixed_precision=mixed_precision,
+    def test_summon_full_param_writeback(self):
+        subtest_config = {
+            "writeback": [True, False],
+            "modify_outer": [True, False],
+            "mixed_precision": [MixedPrecision(param_dtype=torch.float16), None],
+            "cpu_offload": [
+                CPUOffload(offload_params=False),
+                CPUOffload(offload_params=True),
+            ],
+            "use_orig_params": [True, False],
+        }
+        self.run_subtests(
+            subtest_config,
+            _run_test_summon_full_param_writeback,
+            cls=self,
         )
 
     @skip_if_lt_x_gpu(2)
     @parametrize("mixed_precision", [True, False])
     def test_summon_full_param_shard_value(self, mixed_precision):
-        mixed_precision = MixedPrecision() if mixed_precision else None
+        mixed_precision = (
+            MixedPrecision(param_dtype=torch.float16) if mixed_precision else None
+        )
         raw_model = nn.Linear(10, 11)
         raw_model_size = self.get_model_param_count(raw_model)
         expected_shard_size = self.get_expected_sharded_size(raw_model_size)
@@ -160,7 +164,9 @@ def test_summon_full_param_shard_value(self, mixed_precision):
     @parametrize("summon_outer", [True, False])
     @parametrize("mixed_precision", [True, False])
     def test_summon_full_param_recursive(self, recurse, summon_outer, mixed_precision):
-        mixed_precision = MixedPrecision() if mixed_precision else None
+        mixed_precision = (
+            MixedPrecision(param_dtype=torch.float16) if mixed_precision else None
+        )
         model = FSDP(
             nn.Sequential(
                 FSDP(nn.Linear(5, 5, bias=False), mixed_precision=mixed_precision),
@@ -175,10 +181,8 @@ def test_summon_full_param_recursive(self, recurse, summon_outer, mixed_precisio
         shard_inner_numel = int(math.ceil(global_inner_numel / self.world_size))
         shard_outer_numel = int(math.ceil(global_outer_numel / self.world_size))
 
-        outer_param = model.get_parameter("_fsdp_wrapped_module.flat_param")
-        inner_param = model.get_parameter(
-            "_fsdp_wrapped_module._fpw_module.0._fsdp_wrapped_module.flat_param"
-        )
+        outer_param = model._handles[0].flat_param
+        inner_param = model.module[0]._handles[0].flat_param
         self.assertEqual(shard_outer_numel, outer_param.numel())
         self.assertEqual(shard_inner_numel, inner_param.numel())
 
@@ -208,7 +212,7 @@ def forward(self, fsdp_module):
 
         model = FSDP(MyModule()).cuda(self.rank)
         with self.assertRaisesRegex(
-            ValueError, "current state is TrainingState_.FORWARD"
+            ValueError, "Current handle state is HandleTrainingState.FORWARD"
         ):
             model(model)
 
@@ -227,26 +231,37 @@ def bad_backwards_hook(tensor):
         output.register_hook(bad_backwards_hook)
 
         with self.assertRaisesRegex(
-            ValueError, "current state is TrainingState_.BACKWARD_PRE"
+            ValueError, "Current handle state is HandleTrainingState.BACKWARD_PRE"
         ):
             output.backward()
 
     @skip_if_lt_x_gpu(2)
-    @parametrize("mixed_precision", [True, False])
-    def test_summon_full_params_respects_reshard_after_forward(self, mixed_precision):
-        mixed_precision = MixedPrecision() if mixed_precision else None
+    def test_summon_full_params_respects_reshard_after_forward(self):
+        self.run_subtests(
+            {
+                "mixed_precision": [MixedPrecision(param_dtype=torch.float16), None],
+                "use_orig_params": [False, True],
+            },
+            self._test_summon_full_params_respects_reshard_after_forward,
+        )
+
+    def _test_summon_full_params_respects_reshard_after_forward(
+        self, mixed_precision: Optional[MixedPrecision], use_orig_params: bool
+    ):
+        fsdp_kwargs = {
+            "mixed_precision": mixed_precision,
+            "use_orig_params": use_orig_params,
+        }
         model = FSDP(
             nn.Sequential(
-                FSDP(nn.Linear(5, 5, bias=False), mixed_precision=mixed_precision),
+                FSDP(nn.Linear(5, 5, bias=False), **fsdp_kwargs),
                 nn.Linear(5, 3, bias=False),
             ),
-            mixed_precision=mixed_precision,
+            **fsdp_kwargs,
         ).cuda(self.rank)
 
-        outer_param = model.get_parameter("_fsdp_wrapped_module.flat_param")
-        inner_param = model.get_parameter(
-            "_fsdp_wrapped_module._fpw_module.0._fsdp_wrapped_module.flat_param"
-        )
+        outer_param = model._handles[0].flat_param
+        inner_param = model.module[0]._handles[0].flat_param
         outer_full_param_size = outer_param.numel() * self.world_size
 
         # trigger lazy init
@@ -269,7 +284,7 @@ def test_summon_full_params_respects_reshard_after_forward(self, mixed_precision
     def test_summon_single_param(self):
         model = FSDP(nn.Linear(1, 1, bias=False)).cuda(self.rank)
 
-        p = model.get_parameter("_fsdp_wrapped_module.flat_param")
+        p = model._handles[0].flat_param
         self.assertEqual(1, p.numel())
 
         with torch.no_grad():
@@ -363,7 +378,9 @@ def __init__(self, fsdp_1, fsdp_2, fsdp_3):
     def test_reshard_outside_forward_backward_iteration(
         self, rank0_only, offload_to_cpu, mixed_precision
     ):
-        mixed_precision = MixedPrecision() if mixed_precision else None
+        mixed_precision = (
+            MixedPrecision(param_dtype=torch.float16) if mixed_precision else None
+        )
         model = FSDP(
             nn.Sequential(
                 FSDP(nn.Linear(5, 5, bias=False), mixed_precision=mixed_precision),
@@ -372,10 +389,8 @@ def test_reshard_outside_forward_backward_iteration(
             mixed_precision=mixed_precision,
         ).cuda(self.rank)
 
-        outer_param = model.get_parameter("_fsdp_wrapped_module.flat_param")
-        inner_param = model.get_parameter(
-            "_fsdp_wrapped_module._fpw_module.0._fsdp_wrapped_module.flat_param"
-        )
+        outer_param = model._handles[0].flat_param
+        inner_param = model.module[0]._handles[0].flat_param
         outer_full_param_size = outer_param.numel() * self.world_size
 
         # First lets validate our assumption about resharding
@@ -429,13 +444,15 @@ def test_reshard_outside_forward_backward_iteration(
     def test_params_are_unflattenned(self, rank0_only, offload_to_cpu, mixed_precision):
         layer_shape = (10, 12)
         model = nn.Linear(*layer_shape, bias=False).cuda(self.rank)
-        mixed_precision = MixedPrecision() if mixed_precision else None
+        mixed_precision = (
+            MixedPrecision(param_dtype=torch.float16) if mixed_precision else None
+        )
         fsdp_model = FSDP(deepcopy(model), mixed_precision=mixed_precision).cuda(
             self.rank
         )
 
         def _get_flat_param():
-            return fsdp_model.get_parameter("_fsdp_wrapped_module.flat_param")
+            return fsdp_model._handles[0].flat_param
 
         flattened_param = _get_flat_param()
         self.assertEqual(layer_shape[0] * layer_shape[1] / 2, flattened_param.numel())
@@ -478,7 +495,9 @@ def test_params_count_and_value(
         offload_to_cpu: bool,
         mixed_precision: bool,
     ):
-        mixed_precision = MixedPrecision() if mixed_precision else None
+        mixed_precision = (
+            MixedPrecision(param_dtype=torch.float16) if mixed_precision else None
+        )
         model = NestedWrappedModule.init(
             self.process_group,
             FSDPInitMode.NO_FSDP,
@@ -565,6 +584,165 @@ def test_named_parameters_buffers(self, prefix: str, recurse: bool):
                     self.assertEqual(n1, n2)
                     self.assertEqual(p1, p2)
 
+    @skip_if_lt_x_gpu(2)
+    def test_with_grads_core(self):
+        """Tests the core usage of ``summon_full_params(with_grads=True)``."""
+        self.run_subtests(
+            {
+                "writeback": [False, True],
+                "offload_to_cpu": [False, True],
+                "sharding_strategy": [
+                    ShardingStrategy.FULL_SHARD,
+                    ShardingStrategy.SHARD_GRAD_OP,
+                    ShardingStrategy.NO_SHARD,
+                ],
+                "use_orig_params": [True],
+            },
+            self._test_with_grads_core,
+        )
+
+    def _test_with_grads_core(
+        self,
+        writeback: bool,
+        offload_to_cpu: bool,
+        sharding_strategy: ShardingStrategy,
+        use_orig_params: bool,
+    ):
+        def _check_grads(
+            ddp_model: DDP,
+            fsdp_model: FSDP,
+            old_fsdp_grads: Optional[List[torch.Tensor]],
+        ):
+            WRITEBACK_FACTOR = 2
+            with FSDP.summon_full_params(
+                fsdp_model,
+                writeback=writeback,
+                offload_to_cpu=offload_to_cpu,
+                with_grads=True,
+            ):
+                for (n1, p1), (n2, p2) in zip(
+                    ddp_model.module.named_parameters(),
+                    fsdp_model.named_parameters(),
+                ):
+                    # Parameter names are only expected to match because
+                    # `fsdp_model` has top-level FSDP, so its
+                    # `named_parameters()` cleans *all* of the names
+                    self.assertEqual(n1, n2)
+                    assert p1.grad is not None
+                    torch.testing.assert_close(p1.grad, p2.grad)
+                    # Ensure that the tensor is not all zeros, which would
+                    # mean that the multiplication is vacuous
+                    assert torch.count_nonzero(p2.grad) > 0
+                    p2.grad *= WRITEBACK_FACTOR
+            new_fsdp_grads = [
+                param.grad
+                for param in fsdp_model.parameters()
+                if param.grad is not None
+            ]
+            writeback_persists = (
+                writeback or sharding_strategy == ShardingStrategy.NO_SHARD
+            )
+            for old_grad, new_grad in zip(old_fsdp_grads, new_fsdp_grads):
+                if writeback_persists:
+                    torch.testing.assert_close(old_grad * WRITEBACK_FACTOR, new_grad)
+                else:
+                    torch.testing.assert_close(old_grad, new_grad)
+            if writeback_persists:
+                # Modify the DDP gradients for parity
+                for param in ddp_model.parameters():
+                    param.grad *= WRITEBACK_FACTOR
+
+        def _get_error_context(is_supported: bool):
+            return (
+                contextlib.suppress()
+                if is_supported
+                else self.assertRaises(NotImplementedError)
+            )  # some configs not implemented yet
+
+        def _get_fsdp_grads(fsdp_model: FSDP, is_supported: bool):
+            if is_supported:
+                return [
+                    param.grad.clone()
+                    for param in fsdp_model.parameters()
+                    if param.grad is not None
+                ]
+            return None  # unused
+
+        is_supported = use_orig_params and not offload_to_cpu
+        model = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.NO_FSDP,
+            CUDAInitMode.CUDA_BEFORE,
+            deterministic=True,
+        )
+        ddp_model = DDP(model, device_ids=[self.rank])
+        fsdp_model = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.RECURSIVE,
+            CUDAInitMode.CUDA_BEFORE,
+            deterministic=True,
+            fsdp_kwargs={
+                "use_orig_params": use_orig_params,
+                "sharding_strategy": sharding_strategy,
+            },
+        )
+        with FSDP.summon_full_params(fsdp_model):
+            for p1, p2 in zip(ddp_model.module.parameters(), fsdp_model.parameters()):
+                assert torch.all(torch.isclose(p1, p2))
+
+        # Check `summon_full_params()` after backward
+        inp = fsdp_model.get_input(torch.device("cuda"))
+        ddp_out = ddp_model(*inp)
+        fsdp_out = fsdp_model(*inp)
+        ddp_out.sum().backward()
+        fsdp_out.sum().backward()
+        old_fsdp_grads = _get_fsdp_grads(fsdp_model, is_supported)
+        with _get_error_context(is_supported):
+            _check_grads(ddp_model, fsdp_model, old_fsdp_grads)
+
+        # Check `summon_full_params()` between forward and backward
+        inp = fsdp_model.get_input(torch.device("cuda"))
+        ddp_out = ddp_model(*inp)
+        fsdp_out = fsdp_model(*inp)
+        old_fsdp_grads = _get_fsdp_grads(fsdp_model, is_supported)
+        with _get_error_context(is_supported):
+            _check_grads(ddp_model, fsdp_model, old_fsdp_grads)
+
+    @skip_if_lt_x_gpu(2)
+    def test_with_grads_none_grads(self):
+        """
+        Tests that if all ranks' ``FlatParameter`` has ``None`` gradient, then
+        each original parameter sees ``None`` gradient as well.
+        """
+        self.run_subtests(
+            {
+                "sharding_strategy": [
+                    ShardingStrategy.FULL_SHARD,
+                    ShardingStrategy.SHARD_GRAD_OP,
+                    ShardingStrategy.NO_SHARD,
+                ]
+            },
+            self._test_with_grads_none_grads,
+        )
+
+    def _test_with_grads_none_grads(self, sharding_strategy: ShardingStrategy):
+        fsdp_model = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.RECURSIVE,
+            CUDAInitMode.CUDA_BEFORE,
+            deterministic=True,
+            fsdp_kwargs={
+                "use_orig_params": True,
+                "sharding_strategy": sharding_strategy,
+            },
+        )
+        for fsdp_module in FSDP.fsdp_modules(fsdp_model):
+            for handle in fsdp_module._handles:
+                assert handle.flat_param.grad is None
+        with FSDP.summon_full_params(fsdp_model, with_grads=True):
+            for param in fsdp_model.parameters():
+                self.assertTrue(param.grad is None)
+
 
 instantiate_parametrized_tests(TestSummonFullParams)
 instantiate_parametrized_tests(TestSummonFullParamsNoShard)
diff --git a/test/distributed/fsdp/test_fsdp_tp_integration.py b/test/distributed/fsdp/test_fsdp_tp_integration.py
new file mode 100644
index 000000000000..9b3ba3d5add8
--- /dev/null
+++ b/test/distributed/fsdp/test_fsdp_tp_integration.py
@@ -0,0 +1,486 @@
+# Owner(s): ["oncall: distributed"]
+import copy
+import sys
+from collections import OrderedDict
+from typing import Any, Dict, List, NamedTuple, Optional, Tuple
+
+import torch
+import torch.distributed._shard.sharding_spec as shard_spec
+from torch import distributed as dist
+from torch.distributed._shard import shard_module
+from torch.distributed._shard.sharded_tensor.api import Shard, ShardedTensor
+from torch.distributed._shard.sharding_plan import ShardingPlan
+from torch.distributed._shard.sharding_spec import ChunkShardingSpec
+from torch.distributed.fsdp._common_utils import _set_fsdp_flattened
+from torch.distributed.fsdp._fsdp_extensions import _set_fsdp_extensions, FSDPExtensions
+from torch.distributed.fsdp._shard_utils import _create_chunk_sharded_tensor
+from torch.distributed.fsdp.fully_sharded_data_parallel import (
+    CPUOffload,
+    FullyShardedDataParallel as FSDP,
+    StateDictType,
+)
+from torch.distributed.remote_device import _remote_device
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import FSDPTest
+from torch.testing._internal.common_utils import (
+    instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
+)
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+class STShardingInfo(NamedTuple):
+    """:class:`ShardedTensor` sharding information."""
+
+    sharding_spec: shard_spec.ShardingSpec
+    global_size: torch.Size
+    process_group: dist.ProcessGroup
+
+
+class ShardedTensorExtensions(FSDPExtensions):
+    def pre_flatten_transform(
+        self,
+        tensor: torch.Tensor,
+    ) -> Tuple[torch.Tensor, Optional[Any]]:
+        if type(tensor) is ShardedTensor:
+            param_name_to_sharding_info = STShardingInfo(
+                tensor.sharding_spec(), tensor.size(), tensor._process_group
+            )
+            local_tensor = tensor.local_tensor()
+            return local_tensor, param_name_to_sharding_info
+        return tensor, None
+
+    def post_unflatten_transform(
+        self, tensor: torch.Tensor, param_name_to_sharding_info: STShardingInfo
+    ) -> torch.Tensor:
+        sharded_tensor = ShardedTensor._init_from_local_tensor(
+            tensor,
+            _rewrite_spec_if_needed(
+                param_name_to_sharding_info.sharding_spec,
+                tensor,
+                dist.get_rank(param_name_to_sharding_info.process_group),
+            ),
+            param_name_to_sharding_info.global_size,
+            process_group=param_name_to_sharding_info.process_group,
+        )
+        _set_fsdp_flattened(sharded_tensor)
+        return sharded_tensor
+
+    def chunk_tensor(
+        self,
+        tensor: torch.Tensor,
+        rank: int,
+        world_size: int,
+        num_devices_per_node: int,
+        pg: dist.ProcessGroup,
+    ) -> torch.Tensor:
+        if type(tensor) is ShardedTensor:
+            assert len(tensor.local_shards()) == 1
+
+            inner_param = tensor.local_tensor()
+            inner_st = _create_chunk_sharded_tensor(
+                inner_param,
+                rank,
+                world_size,
+                num_devices_per_node,
+                pg,
+            )
+
+            outer_local_shard = tensor.local_shards()[0]
+            shards: List[Shard] = [
+                Shard(inner_st, copy.deepcopy(outer_local_shard.metadata))
+            ]
+            st_meta = copy.deepcopy(tensor.metadata())
+            st_meta.tensor_properties.requires_grad = False
+
+            st_outer = ShardedTensor._init_from_local_shards_and_global_metadata(
+                shards,
+                sharded_tensor_metadata=st_meta,
+                process_group=tensor._process_group,
+                init_rrefs=False,
+            )
+            return st_outer
+        else:
+            return _create_chunk_sharded_tensor(
+                tensor, rank, world_size, num_devices_per_node, pg
+            )
+
+    def pre_load_state_dict_transform(
+        self,
+        tensor: torch.Tensor,
+    ) -> Tuple[torch.Tensor, List[Shard]]:
+        shards = tensor.local_shards()
+        if len(shards) == 1 and type(shards[0].tensor) is ShardedTensor:
+            tensor = shards[0].tensor
+            shards = tensor.local_shards()
+        return (tensor, shards if len(shards) > 0 else [])
+
+
+_set_fsdp_extensions(ShardedTensorExtensions())
+
+
+def _rewrite_spec_if_needed(
+    spec: shard_spec.ShardingSpec, tensor: torch.Tensor, rank: int
+):
+    """
+    Rewrites ``spec`` to match the device of ``tensor``.
+
+    Curerntly, ``FSDP.sharded_optim_state_dict` moves optimizer state to CPU
+    (without choice), so if the original ``ShardingSpec`` produces non-CPU
+    metadta, then the ST construction errors.
+    """
+    if not isinstance(spec, ChunkShardingSpec):
+        return spec
+    # Determine if we need to rewrite the spec
+    rewrite = False
+    for p in spec.placements:
+        if p.rank() == rank and p.device() != tensor.device:
+            rewrite = True
+            break
+    if rewrite:
+        spec = copy.deepcopy(spec)
+        for i, placement in enumerate(spec.placements):
+            if placement.rank() == rank and placement.device() != tensor.device:
+                spec.placements[i] = _remote_device(f"rank:{rank}/{tensor.device}")
+    return spec
+
+
+def _is_nested_tensor(val: Any) -> bool:
+    if type(val) is ShardedTensor:
+        if len(val.local_shards()) == 0:
+            return False
+        if type(val.local_shards()[0].tensor) is ShardedTensor:
+            return True
+    return False
+
+
+class SimpleModel(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.net1 = torch.nn.Linear(5, 8)
+        self.relu = torch.nn.ReLU()
+        self.net2 = torch.nn.Linear(8, 4)
+        self.net3 = torch.nn.Linear(4, 12)
+
+    def forward(self, x):
+        return self.net3(self.net2(self.relu(self.net1(x))))
+
+    @staticmethod
+    def module_sharding_plan(specs):
+        colwise_spec, rowwise_spec = specs[0], specs[1]
+        return ShardingPlan(
+            plan={
+                "net1.weight": colwise_spec,
+                "net2.weight": rowwise_spec,
+            },
+            output_plan={
+                "net2": colwise_spec,
+            },
+            return_local_tensor=["net2"],
+        )
+
+    @staticmethod
+    def get_sharded_param_names() -> List[str]:
+        return ["net1.weight", "net2.weight"]
+
+    @staticmethod
+    def get_non_sharded_param_names() -> List[str]:
+        return ["net3.weight", "net3.bias"]
+
+
+class TestTPFSDPIntegration(FSDPTest):
+    def _get_params_and_sharding_info(
+        self,
+        model: SimpleModel,
+        sharded_param_names: List[str],
+        tensor_parallel_size: int,
+    ) -> Tuple[Dict[str, int], Dict[str, Tuple[torch.Size, int]]]:
+        """ """
+        assert (
+            type(model) is SimpleModel
+        ), "Expects a `SimpleModel` since the sharding cases on the model definition"
+        param_name_to_numel = OrderedDict()
+        param_name_to_sharding_info = OrderedDict()
+        for param_name, param in model.named_parameters():
+            if param_name not in sharded_param_names:
+                param_name_to_numel[param_name] = param.numel()
+            else:
+                param_name_to_numel[param_name] = param.numel() // tensor_parallel_size
+                param_name_to_sharding_info[param_name] = (
+                    param.size(),
+                    0 if "net1" in param_name else 1,
+                )
+        return param_name_to_numel, param_name_to_sharding_info
+
+    def _get_sub_pgs(self, tensor_parallel_size: int):
+        """
+        Generates TP and FSDP subprocess groups. ``tensor_parallel_size`` gives
+        the TP process group size.
+
+        For example, if the global world size is 8 and the tensor parallel size
+        is 2, then this creates:
+        - 4 TP subprocess groups: [0, 1], [2, 3], [4, 5], [6, 7]
+        - 2 FSDP subprocess groups: [0, 2, 4, 6], [1, 3, 5, 7]
+        """
+        tp_ranks: List[List[int]] = []
+        fsdp_ranks: List[List[int]] = []
+        for rank in range(self.world_size):
+            tp_idx = rank // tensor_parallel_size
+            if len(tp_ranks) <= tp_idx:
+                tp_ranks.append([])
+            tp_ranks[tp_idx].append(rank)
+            fsdp_idx = rank % tensor_parallel_size
+            if len(fsdp_ranks) <= fsdp_idx:
+                fsdp_ranks.append([])
+            fsdp_ranks[fsdp_idx].append(rank)
+        tp_pgs = [dist.new_group(ranks) for ranks in tp_ranks]
+        fsdp_pgs = [dist.new_group(ranks) for ranks in fsdp_ranks]
+        tp_pg = tp_pgs[self.rank // tensor_parallel_size]
+        fsdp_pg = fsdp_pgs[self.rank % tensor_parallel_size]
+        return tp_pg, fsdp_pg
+
+    def _get_chunk_sharding_spec(self, tp_world_size: int, tp_pg: dist.ProcessGroup):
+        placements = [
+            f"rank:{idx}/cuda:{dist.distributed_c10d.get_global_rank(tp_pg, idx) % torch.cuda.device_count()}"
+            for idx in range(tp_world_size)
+        ]
+        # Rowwise and colwise sharding are specified with respect to the
+        # transposed linear weight
+        colwise_spec = ChunkShardingSpec(dim=0, placements=placements)
+        rowwise_spec = ChunkShardingSpec(dim=1, placements=placements)
+        return colwise_spec, rowwise_spec
+
+    def _sync_tp_grads(
+        self,
+        tp_fsdp_model: FSDP,
+        tp_pg: dist.ProcessGroup,
+        param_name_to_numel: Dict[str, int],
+        non_sharded_param_names: List[str],
+    ) -> None:
+        """
+        Syncs the tensor parallel parameters' gradients following the data
+        parallel paradigm where gradients are averaged over ranks (in this
+        case, the ones in the tensor parallel process group).
+        """
+        tp_world_size = tp_pg.size()
+        fsdp_world_size = self.world_size // tp_world_size
+        assert (
+            type(tp_fsdp_model) is FSDP and len(list(tp_fsdp_model.parameters())) == 1
+        ), (
+            "The following logic assumes a single top-level-only FSDP wrapping "
+            "the model with TP already applied"
+        )
+        flat_param = tp_fsdp_model.params[0]
+        splits = tuple(param_name_to_numel.values())
+        # Create a mask over the gradient elements to manually reduce
+        unsharded_size = torch.Size([flat_param.numel() * fsdp_world_size])
+        unsharded_zeros = torch.zeros(unsharded_size, device=flat_param.device)
+        per_param_masks = unsharded_zeros.split(splits)
+        for param_idx, param_name in enumerate(
+            param_name_to_numel.keys()
+        ):  # assumes fixed order
+            if param_name not in non_sharded_param_names:
+                per_param_masks[param_idx][:] = 1
+        unsharded_mask = torch.cat(per_param_masks).contiguous().type(torch.BoolTensor)
+        sharded_mask = unsharded_mask.chunk(fsdp_world_size)[self.rank // tp_world_size]
+        grad_device = flat_param.grad.device
+        grad = flat_param.grad.detach().clone().cuda(self.rank)
+        dist.all_reduce(grad, op=dist.ReduceOp.SUM, group=tp_pg)
+        grad = grad.to(grad_device)
+        flat_param.grad[~sharded_mask] = grad[~sharded_mask]
+        # Average *all* gradient elements to match the FSDP only semantics
+        flat_param.grad /= tp_world_size
+
+    def _get_grads_as_flattened(
+        self,
+        model: FSDP,
+        uses_tp: bool,
+        param_name_to_numel: Dict[str, int],
+        param_name_to_sharding_info: Dict[str, Tuple[torch.Size, int]],
+        tp_pg: Optional[dist.ProcessGroup],
+        fsdp_pg: Optional[dist.ProcessGroup],
+        sharded_param_names: Optional[List[str]],
+    ) -> torch.Tensor:
+        """
+        Returns all unsharded gradients as a single flattened tensor. This
+        returns the same value on all ranks.
+        """
+        local_grads_as_flattened = (
+            torch.cat([torch.flatten(param.grad) for param in model.parameters()])
+            .contiguous()
+            .cuda(self.rank)
+        )
+        all_grads_as_flattened = torch.cat(
+            [torch.empty_like(local_grads_as_flattened) for _ in range(fsdp_pg.size())]
+        ).contiguous()
+        dist._all_gather_base(
+            all_grads_as_flattened, local_grads_as_flattened, group=fsdp_pg
+        )
+        if not uses_tp:
+            return all_grads_as_flattened
+        splits = tuple(param_name_to_numel.values())
+        all_grads_per_param = list(all_grads_as_flattened.split(splits))
+        for param_idx, param_name in enumerate(
+            param_name_to_numel.keys()
+        ):  # assumes fixed order
+            if param_name in sharded_param_names:
+                local_tensor_size = list(param_name_to_sharding_info[param_name][0])
+                sharding_dim = param_name_to_sharding_info[param_name][1]
+                local_tensor_size[sharding_dim] //= tp_pg.size()
+                local_tensor = all_grads_per_param[param_idx].view(*local_tensor_size)
+                local_tensors = [
+                    torch.empty_like(local_tensor) for _ in range(tp_pg.size())
+                ]
+                dist.all_gather(local_tensors, local_tensor, group=tp_pg)
+                all_grads_per_param[param_idx] = torch.cat(
+                    local_tensors, dim=sharding_dim
+                ).reshape(-1)
+        return torch.cat(all_grads_per_param).contiguous()
+
+    @skip_if_lt_x_gpu(4)
+    @parametrize("tensor_parallel_size", [2, 4])
+    @parametrize(
+        "cpu_offload",
+        [CPUOffload(offload_params=False), CPUOffload(offload_params=True)],
+    )
+    def test_fsdp_tp_integration(self, tensor_parallel_size, cpu_offload):
+        """
+        Tests training for TP + FSDP integration by comparing an FSDP-only
+        model with a TP + FSDP model.
+        """
+        LR = 3e-5
+        torch.manual_seed(0)
+        model = SimpleModel().cuda(self.rank)
+        tp_fsdp_model = copy.deepcopy(model)
+        sharded_param_names = SimpleModel.get_sharded_param_names()
+        non_sharded_param_names = SimpleModel.get_non_sharded_param_names()
+        (
+            param_name_to_numel,
+            param_name_to_sharding_info,
+        ) = self._get_params_and_sharding_info(
+            model,
+            sharded_param_names,
+            tensor_parallel_size,
+        )
+
+        input_seed = self.rank
+        torch.manual_seed(input_seed + 1)
+        inp_size = [2, 3, 5]
+        inp = torch.rand(*inp_size).cuda(self.rank)
+        self.assertEqual(model(inp), tp_fsdp_model(inp))  # sanity check
+
+        tp_pg, fsdp_pg = self._get_sub_pgs(tensor_parallel_size)
+        fsdp_model = FSDP(
+            model, process_group=self.process_group, cpu_offload=cpu_offload
+        )
+        # Shard with TP and then wrap with FSDP
+        sharding_specs = self._get_chunk_sharding_spec(tp_pg.size(), tp_pg)
+        sharding_plan = SimpleModel.module_sharding_plan(sharding_specs)
+        shard_module(tp_fsdp_model, sharding_plan, process_group=tp_pg)
+        tp_fsdp_model = FSDP(
+            tp_fsdp_model, process_group=fsdp_pg, cpu_offload=cpu_offload
+        )
+
+        # Check the forward by checking output equality
+        fsdp_out = fsdp_model(inp)
+        tp_fsdp_out = tp_fsdp_model(inp)
+        self.assertEqual(fsdp_out, tp_fsdp_out)
+
+        # Check the backward by checking gradient equality
+        fsdp_out.sum().backward()
+        tp_fsdp_out.sum().backward()
+        self._sync_tp_grads(
+            tp_fsdp_model,
+            tp_pg,
+            param_name_to_numel,
+            non_sharded_param_names,
+        )
+        model_grads = self._get_grads_as_flattened(
+            fsdp_model,
+            False,
+            param_name_to_numel,
+            param_name_to_sharding_info,
+            None,
+            self.process_group,
+            None,
+        )
+        model_tp_grads = self._get_grads_as_flattened(
+            tp_fsdp_model,
+            True,
+            param_name_to_numel,
+            param_name_to_sharding_info,
+            tp_pg,
+            fsdp_pg,
+            sharded_param_names,
+        )
+        self.assertEqual(model_grads, model_tp_grads)
+
+        # Check the optimizer step by performing a second forward pass
+        fsdp_optim = torch.optim.SGD(fsdp_model.parameters(), lr=LR)
+        tp_fsdp_optim = torch.optim.SGD(tp_fsdp_model.parameters(), lr=LR)
+        fsdp_optim.step()
+        tp_fsdp_optim.step()
+        torch.manual_seed(input_seed + 16)
+        inp = torch.rand(*inp_size).cuda(self.rank)
+        fsdp_out = fsdp_model(inp)
+        tp_fsdp_out = tp_fsdp_model(inp)
+        self.assertEqual(fsdp_out, tp_fsdp_out)
+
+    @skip_if_lt_x_gpu(4)
+    def test_fsdp_tp_checkpoint_integration(self):
+        """Tests checkpointing for TP + FSDP integration."""
+        tensor_parallel_size = 2
+        torch.manual_seed(0)
+        model = SimpleModel().cuda(self.rank)
+        tp_pg, fsdp_pg = self._get_sub_pgs(tensor_parallel_size)
+        # Shard with TP and then wrap with FSDP
+        sharding_specs = self._get_chunk_sharding_spec(tp_pg.size(), tp_pg)
+        sharding_plan = SimpleModel.module_sharding_plan(sharding_specs)
+        shard_module(model, sharding_plan, process_group=tp_pg)
+        tp_fsdp_model = FSDP(model, process_group=fsdp_pg)
+
+        # Check that we produce a nested ST from model state dict
+        with FSDP.state_dict_type(tp_fsdp_model, StateDictType.SHARDED_STATE_DICT):
+            state_dict = tp_fsdp_model.state_dict()
+            # TODO once 2D is out, validate the nesting
+            self.assertTrue(_is_nested_tensor(state_dict["net1.weight"]))
+            self.assertFalse(_is_nested_tensor(state_dict["net1.bias"]))
+            tp_fsdp_model.load_state_dict(state_dict)
+
+        tp_fsdp_optim = torch.optim.Adam(tp_fsdp_model.parameters(), lr=0.0001)
+
+        input_seed = self.rank
+        torch.manual_seed(input_seed + 1)
+        inp_size = [2, 3, 5]
+        inp = torch.rand(*inp_size).cuda(self.rank)
+
+        tp_fsdp_model(inp).sum().backward()
+        tp_fsdp_optim.step()
+
+        # Check that we produce a nested ST from optim state dict
+        optim_state = FSDP.sharded_optim_state_dict(tp_fsdp_model, tp_fsdp_optim)
+        # TODO once 2D is out, validate the nesting
+        self.assertTrue(
+            _is_nested_tensor(optim_state["state"]["net1.weight"]["exp_avg"])
+        )
+        self.assertFalse(
+            _is_nested_tensor(optim_state["state"]["net1.bias"]["exp_avg"])
+        )
+
+
+instantiate_parametrized_tests(TestTPFSDPIntegration)
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_fsdp_traversal.py b/test/distributed/fsdp/test_fsdp_traversal.py
index e1b0a77cfe79..b9c7a0aeac9b 100644
--- a/test/distributed/fsdp/test_fsdp_traversal.py
+++ b/test/distributed/fsdp/test_fsdp_traversal.py
@@ -11,10 +11,7 @@
     FSDPTest,
     NestedWrappedModule,
 )
-from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
-    run_tests,
-)
+from torch.testing._internal.common_utils import run_tests, TEST_WITH_DEV_DBG_ASAN
 
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
@@ -42,18 +39,20 @@ def test_fsdp_modules(self):
         )
         modules = FSDP.fsdp_modules(nested_wrapped_module)
         self.assertEquals(
-            modules, [
+            modules,
+            [
                 nested_wrapped_module.module.get_submodule("1"),
                 nested_wrapped_module.module.get_submodule("1").get_submodule("0"),
                 nested_wrapped_module.module.get_submodule("2"),
-            ]
+            ],
         )
         modules = FSDP.fsdp_modules(nested_wrapped_module, root_only=True)
         self.assertEqual(
-            modules, [
+            modules,
+            [
                 nested_wrapped_module.module.get_submodule("1"),
                 nested_wrapped_module.module.get_submodule("2"),
-            ]
+            ],
         )
 
 
diff --git a/test/distributed/fsdp/test_fsdp_uneven.py b/test/distributed/fsdp/test_fsdp_uneven.py
index 295afbce508b..6ffeb279b617 100644
--- a/test/distributed/fsdp/test_fsdp_uneven.py
+++ b/test/distributed/fsdp/test_fsdp_uneven.py
@@ -8,11 +8,8 @@
 from torch.nn import Linear
 from torch.optim import SGD
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
-from torch.testing._internal.common_fsdp import (
-    FSDPTest,
-)
-from torch.testing._internal.common_utils import TEST_WITH_DEV_DBG_ASAN, run_tests
-
+from torch.testing._internal.common_fsdp import FSDPTest
+from torch.testing._internal.common_utils import run_tests, TEST_WITH_DEV_DBG_ASAN
 
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
diff --git a/test/distributed/fsdp/test_fsdp_use_orig_params.py b/test/distributed/fsdp/test_fsdp_use_orig_params.py
new file mode 100644
index 000000000000..e61f2e4d96de
--- /dev/null
+++ b/test/distributed/fsdp/test_fsdp_use_orig_params.py
@@ -0,0 +1,1057 @@
+# Owner(s): ["oncall: distributed"]
+
+import functools
+import itertools
+import sys
+from typing import Any, Callable, Dict, List, Optional, Tuple, Type
+
+import torch
+import torch.nn as nn
+from torch import distributed as dist
+from torch.distributed.fsdp import (
+    BackwardPrefetch,
+    CPUOffload,
+    FullyShardedDataParallel as FSDP,
+    ShardingStrategy,
+)
+from torch.distributed.fsdp._common_utils import clean_tensor_name
+from torch.distributed.fsdp.wrap import always_wrap_policy, ModuleWrapPolicy
+from torch.nn import TransformerDecoderLayer, TransformerEncoderLayer
+from torch.nn.parallel.distributed import DistributedDataParallel as DDP
+from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
+from torch.testing._internal.common_fsdp import (
+    CUDAInitMode,
+    FSDPInitMode,
+    FSDPTest,
+    TransformerWithSharedParams,
+)
+from torch.testing._internal.common_utils import (
+    instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
+    TEST_WITH_DEV_DBG_ASAN,
+)
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+
+class TestFSDPUseOrigParamsMultipleParamGroups(FSDPTest):
+    """Tests multiple parameter groups."""
+
+    @property
+    def world_size(self) -> int:
+        return 2
+
+    def _get_param_groups(self, model: nn.Module) -> List[Dict[str, Any]]:
+        """
+        Constructs separate parameter groups for weights, biases, and other
+        parameters.
+        """
+        param_groups = [
+            {"params": [], "weight_decay": 0.1, "lr": 1e-2},
+            {"params": [], "weight_decay": 0.01, "lr": 1e-3},
+            {"params": []},
+        ]
+        for param_name, param in model.named_parameters():
+            if "weight" in param_name:
+                param_groups[0]["params"].append(param)
+            elif "bias" in param_name:
+                param_groups[1]["params"].append(param)
+            else:
+                param_groups[2]["params"].append(param)
+        return param_groups
+
+    def _get_optim(
+        self,
+        model: nn.Module,
+        optim_class: Type[torch.optim.Optimizer],
+        multi_tensor: bool,
+    ) -> torch.optim.Optimizer:
+        """
+        Constructs an Adam optimizer with three parameter groups, one for
+        weights, one for biases, and one for everything else, each with
+        different weight decay and learning rates.
+        """
+        param_groups = self._get_param_groups(model)
+        return optim_class(param_groups, lr=5e-3, foreach=multi_tensor)
+
+    def _get_ddp_transformer(self, find_unused_params: bool) -> DDP:
+        """Returns a transformer with shared parameters wrapped with DDP."""
+        model = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.NO_FSDP,
+            CUDAInitMode.CUDA_BEFORE,
+            deterministic=True,
+        )
+        ddp_model = DDP(
+            model,
+            device_ids=[self.rank],
+            find_unused_parameters=find_unused_params,
+        )
+        return ddp_model
+
+    def _get_fsdp_transformer_and_optim(
+        self,
+        cuda_init_mode: CUDAInitMode,
+        init_optim_before_wrap: bool,
+        optim_class: Type[torch.optim.Optimizer],
+        multi_tensor: bool,
+        sharding_strategy: ShardingStrategy,
+        backward_prefetch: Optional[BackwardPrefetch],
+        cpu_offload: CPUOffload,
+    ) -> Tuple[FSDP, torch.optim.Optimizer]:
+        """
+        Returns a transformer with shared parameters wrapped with FSDP and a
+        corresponding optimizer.
+        """
+        # Each transformer layer has multiple linear layers, so this policy, in
+        # combination with the parameter group construction, ensures different
+        # hyperparameter settings within one `FlatParameter`
+        fsdp_kwargs = {
+            "auto_wrap_policy": ModuleWrapPolicy(
+                {
+                    TransformerEncoderLayer,
+                    TransformerDecoderLayer,
+                }
+            ),
+            "use_orig_params": True,
+            "sharding_strategy": sharding_strategy,
+            "backward_prefetch": backward_prefetch,
+            "cpu_offload": cpu_offload,
+        }
+        model = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.NO_FSDP,
+            cuda_init_mode,
+            deterministic=True,
+        )
+        if init_optim_before_wrap:
+            fsdp_optim = self._get_optim(model, optim_class, multi_tensor)
+            fsdp_model = FSDP(model, self.process_group, **fsdp_kwargs)
+        else:
+            fsdp_model = FSDP(model, self.process_group, **fsdp_kwargs)
+            fsdp_optim = self._get_optim(fsdp_model, optim_class, multi_tensor)
+        if (
+            cuda_init_mode == CUDAInitMode.CUDA_AFTER
+            and not fsdp_model.cpu_offload.offload_params
+        ):
+            fsdp_model = fsdp_model.cuda()
+        return fsdp_model, fsdp_optim
+
+    def _check_train_parity(
+        self,
+        ddp_model: DDP,
+        ddp_optim: torch.optim.Optimizer,
+        fsdp_model: FSDP,
+        fsdp_optim: torch.optim.Optimizer,
+        set_to_none: bool,
+        num_iters: int = 10,
+    ):
+        """Checks training parity between DDP and FSDP."""
+        device = torch.device("cuda")
+        for i in range(num_iters):
+            iter_losses = []
+            for model, optim in ((ddp_model, ddp_optim), (fsdp_model, fsdp_optim)):
+                module = model.module
+                # Test two different `zero_grad()` timings
+                if i % 2 == 0:
+                    optim.zero_grad(set_to_none=set_to_none)  # pre-forward
+                inp = module.get_input(device)
+                output = model(*inp)
+                loss = module.get_loss(inp, output).to(device)
+                iter_losses.append(loss)
+                if i % 2 == 1:
+                    optim.zero_grad(set_to_none=set_to_none)  # pre-backward
+                module.run_backward(loss)
+                # Perform the DDP optimizer step on CPU to match FSDP if needed
+                if model is ddp_model and fsdp_model.cpu_offload.offload_params:
+                    model.to(torch.device("cpu"))
+                optim.step()
+                if model is ddp_model and fsdp_model.cpu_offload.offload_params:
+                    model.to(device)
+            torch.testing.assert_close(iter_losses[0], iter_losses[1])
+            iter_losses.clear()
+        self._check_ddp_fsdp_param_parity(ddp_model, fsdp_model)
+
+    def _check_ddp_fsdp_param_parity(self, ddp_model: DDP, fsdp_model: FSDP):
+        with FSDP.summon_full_params(fsdp_model):
+            for (n1, p1), (n2, p2) in zip(
+                ddp_model.module.named_parameters(), fsdp_model.named_parameters()
+            ):
+                self.assertEqual(n1, n2)
+                torch.testing.assert_close(p1, p2)
+
+    def _get_sharding_strategy_from_str(
+        self, sharding_strategy_str: str
+    ) -> ShardingStrategy:
+        if sharding_strategy_str == "no_shard":
+            sharding_strategy = ShardingStrategy.NO_SHARD
+        elif sharding_strategy_str == "shard_grad_op":
+            sharding_strategy = ShardingStrategy.SHARD_GRAD_OP
+        elif sharding_strategy_str == "full_shard":
+            sharding_strategy = ShardingStrategy.FULL_SHARD
+        else:
+            raise ValueError(f"Invalid string: {sharding_strategy_str}")
+        return sharding_strategy
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize(
+        "sharding_strategy_str",
+        ["no_shard", "shard_grad_op", "full_shard"],
+    )
+    def test_diff_hyperparams(self, sharding_strategy_str: str):
+        """
+        Tests FSDP parity with DDP when using multiple parameter groups with
+        different hyperparameter settings.
+        """
+        sharding_strategy = self._get_sharding_strategy_from_str(sharding_strategy_str)
+        self.run_subtests(
+            {
+                "cuda_init_mode": [
+                    CUDAInitMode.CUDA_BEFORE,
+                    CUDAInitMode.CUDA_AFTER,
+                ],
+                "init_optim_before_wrap": [False, True],
+                "optim_class": [torch.optim.AdamW],
+                "multi_tensor": [False, True],
+                "set_to_none": [False, True],
+                "backward_prefetch": [
+                    None,
+                    BackwardPrefetch.BACKWARD_PRE,
+                    BackwardPrefetch.BACKWARD_POST,
+                ],
+            },
+            self._test_diff_hyperparams,
+            cpu_offload=CPUOffload(offload_params=False),
+            sharding_strategy=sharding_strategy,
+        )
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize(
+        "sharding_strategy_str",
+        ["no_shard", "shard_grad_op", "full_shard"],
+    )
+    def test_diff_hyperparams_cpu_offload(self, sharding_strategy_str: str):
+        """
+        Tests FSDP parity with DDP when using multiple parameter groups with
+        different hyperparameter settings with CPU offloading enabled. This is
+        separate from :meth:`test_diff_hyperparams` because CPU offloading has
+        some issues with subtesting for some specific subtesting configs (e.g.,
+        with ``offload_params=False`` followed by ``True`` but not vice versa).
+        """
+        sharding_strategy = self._get_sharding_strategy_from_str(sharding_strategy_str)
+        self._test_diff_hyperparams(
+            cuda_init_mode=CUDAInitMode.CUDA_BEFORE,
+            init_optim_before_wrap=False,
+            optim_class=torch.optim.Adam,
+            multi_tensor=False,
+            set_to_none=False,
+            backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
+            cpu_offload=CPUOffload(offload_params=True),
+            sharding_strategy=sharding_strategy,
+        )
+
+    def _test_diff_hyperparams(
+        self,
+        cuda_init_mode: CUDAInitMode,
+        init_optim_before_wrap: bool,
+        optim_class: Type[torch.optim.Optimizer],
+        multi_tensor: bool,
+        set_to_none: bool,
+        backward_prefetch: Optional[BackwardPrefetch],
+        cpu_offload: CPUOffload,
+        sharding_strategy: ShardingStrategy,
+    ):
+        """
+        Args:
+            init_optim_before_wrap (bool): If ``True``, initializes the
+                FSDP optimizer before wrapping the model with FSDP; otherwise,
+                initializes the FSDP optimizer after wrapping the model with
+                FSDP. We permit both forms of initialization to give users
+                flexibility.
+        """
+        if cuda_init_mode == CUDAInitMode.CUDA_AFTER and cpu_offload.offload_params:
+            return  # not supported
+        ddp_model = self._get_ddp_transformer(find_unused_params=False)
+        ddp_optim = self._get_optim(ddp_model, optim_class, multi_tensor)
+        fsdp_model, fsdp_optim = self._get_fsdp_transformer_and_optim(
+            cuda_init_mode=cuda_init_mode,
+            init_optim_before_wrap=init_optim_before_wrap,
+            optim_class=optim_class,
+            multi_tensor=multi_tensor,
+            sharding_strategy=sharding_strategy,
+            backward_prefetch=backward_prefetch,
+            cpu_offload=cpu_offload,
+        )
+        self._check_train_parity(
+            ddp_model, ddp_optim, fsdp_model, fsdp_optim, set_to_none
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_diff_trainability(self):
+        """
+        Tests FSDP parity with DDP when using multiple parameter groups and
+        freezing the parameters in one parameter group.
+        """
+        self.run_subtests(
+            {
+                "multi_tensor": [False, True],
+                "sharding_strategy": [
+                    ShardingStrategy.FULL_SHARD,
+                    ShardingStrategy.SHARD_GRAD_OP,
+                    ShardingStrategy.NO_SHARD,
+                ],
+            },
+            self._test_diff_trainability,
+        )
+
+    def _test_diff_trainability(
+        self,
+        multi_tensor: bool,
+        sharding_strategy: ShardingStrategy,
+    ):
+        optim_class = torch.optim.Adam
+        ddp_model = self._get_ddp_transformer(find_unused_params=True)
+        ddp_optim = self._get_optim(ddp_model, optim_class, multi_tensor)
+        fsdp_model, fsdp_optim = self._get_fsdp_transformer_and_optim(
+            cuda_init_mode=CUDAInitMode.CUDA_BEFORE,
+            init_optim_before_wrap=False,
+            optim_class=optim_class,
+            multi_tensor=multi_tensor,
+            sharding_strategy=sharding_strategy,
+            backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
+            cpu_offload=None,
+        )
+        # Freeze all biases (which happen to be in the same parameter group)
+        for param_name, param in ddp_model.named_parameters():
+            if "bias" in param_name:
+                param.requires_grad_(False)
+        for param_name, param in fsdp_model.named_parameters():
+            if "bias" in param_name:
+                param.requires_grad_(False)
+        self._check_train_parity(ddp_model, ddp_optim, fsdp_model, fsdp_optim, False)
+
+    @skip_if_lt_x_gpu(2)
+    def test_multiple_optimizers(self):
+        """
+        Tests using two optimizers where only one sets gradients to ``None``.
+        """
+        self.run_subtests(
+            {
+                "sharding_strategy": [
+                    ShardingStrategy.FULL_SHARD,
+                    # ShardingStrategy.SHARD_GRAD_OP,
+                ]
+            },
+            self._test_multiple_optimizers,
+        )
+
+    def _test_multiple_optimizers(self, sharding_strategy: ShardingStrategy):
+        ddp_model = self._get_ddp_transformer(find_unused_params=True)
+        ddp_param_groups = self._get_param_groups(ddp_model)
+        assert len(ddp_param_groups) == 3, f"{len(ddp_param_groups)}"
+        (
+            fsdp_model,
+            _,
+        ) = self._get_fsdp_transformer_and_optim(  # ignore returned optimizer
+            cuda_init_mode=CUDAInitMode.CUDA_BEFORE,
+            init_optim_before_wrap=False,
+            optim_class=torch.optim.Adam,  # ignored
+            multi_tensor=False,  # ignored
+            sharding_strategy=sharding_strategy,
+            backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
+            cpu_offload=None,
+        )
+        fsdp_param_groups = self._get_param_groups(fsdp_model)
+        assert len(fsdp_param_groups) == 3, f"{len(fsdp_param_groups)}"
+        ddp_optims = []
+        fsdp_optims = []
+        # For the transformer model, every parameter is either a weight or a
+        # bias, so we only use the first two parameter groups. Moreover, we use
+        # Adam and AdamW in particular since they both use bias correction
+        # dependent on the step, which is incremented even if a parameter has a
+        # zero gradient but not if the gradient is `None`. This is to test that
+        # we are differentiating between a zero and `None` gradient correctly.
+        optim_ctors = [
+            functools.partial(torch.optim.Adam, lr=5e-3),
+            functools.partial(torch.optim.AdamW, lr=1e-2),
+        ]
+
+        for optim_ctor, ddp_param_group, fsdp_param_group in zip(
+            optim_ctors,
+            ddp_param_groups[:2],
+            fsdp_param_groups[:2],
+        ):
+            ddp_optims.append(optim_ctor(ddp_param_group["params"]))
+            fsdp_optims.append(optim_ctor(fsdp_param_group["params"]))
+        device = torch.device("cuda")
+
+        # Check that there exists a `FlatParameter` that has both a weight and
+        # a bias in this rank's shard
+        has_both = False
+        for fsdp_module in FSDP.fsdp_modules(fsdp_model):
+            for handle in fsdp_module._handles:
+                flat_param = handle.flat_param
+                assert flat_param._params is not None
+                has_weight = False
+                has_bias = False
+                for param, fqn in zip(flat_param._params, flat_param._fqns):
+                    if "weight" in fqn and param.numel() > 0:
+                        has_weight = True
+                    elif "bias" in fqn and param.numel() > 0:
+                        has_bias = True
+                has_both |= has_weight and has_bias
+        assert has_both, (
+            f"Rank {self.rank} does not have a `FlatParameter` with both a "
+            "weight and a bias in its shard, meaning that this test is vacuous"
+        )
+
+        # Run one iteration to generate gradients
+        def run_iter():
+            iter_losses = []
+            for model, optims in ((ddp_model, ddp_optims), (fsdp_model, fsdp_optims)):
+                module = model.module
+                inp = module.get_input(device)
+                output = model(*inp)
+                loss = module.get_loss(inp, output).to(device)
+                iter_losses.append(loss)
+                module.run_backward(loss)
+                for optim in optims:
+                    optim.step()
+            torch.testing.assert_close(iter_losses[0], iter_losses[1])
+            iter_losses.clear()
+            self._check_ddp_fsdp_param_parity(ddp_model, fsdp_model)
+
+        run_iter()
+
+        # Only set the weights' gradients to None
+        ddp_optims[0].zero_grad(set_to_none=True)
+        fsdp_optims[0].zero_grad(set_to_none=True)
+        inp = ddp_model.module.get_input(device)
+        ddp_output = ddp_model(*inp)
+        fsdp_output = fsdp_model(*inp)
+
+        # Check that FSDP correctly exposes gradients even after forward
+        # (namely, `None` for weights and non-`None` for biases)
+        for (ddp_n, ddp_p), (fsdp_n, fsdp_p) in zip(
+            ddp_model.module.named_parameters(),
+            fsdp_model.named_parameters(),
+        ):
+            self.assertEqual(ddp_n, fsdp_n)
+            if fsdp_p.numel() == 0:
+                # Not in this rank's shard
+                self.assertTrue(fsdp_p.grad is None)
+                continue
+            if ddp_p.grad is None:
+                self.assertTrue(fsdp_p.grad is None)
+            else:
+                self.assertEqual(ddp_p.flatten(), fsdp_p.flatten())
+                self.assertEqual(ddp_p.grad.flatten(), fsdp_p.grad.flatten())
+        self._check_ddp_fsdp_param_parity(ddp_model, fsdp_model)
+
+        # Finish the iteration (backward pass and optimizer step)
+        ddp_loss = ddp_model.module.get_loss(inp, ddp_output).to(device)
+        fsdp_loss = fsdp_model.module.get_loss(inp, fsdp_output).to(device)
+        ddp_model.module.run_backward(ddp_loss)
+        fsdp_model.module.run_backward(fsdp_loss)
+        for optim in itertools.chain(ddp_optims, fsdp_optims):
+            optim.step()
+        self._check_ddp_fsdp_param_parity(ddp_model, fsdp_model)
+
+        # Run one more iteration to confirm bias corrections are correct
+        run_iter()
+        self._check_ddp_fsdp_param_parity(ddp_model, fsdp_model)
+
+
+class TestFSDPUseOrigParamsUnshardReshard(FSDPTest):
+    """Tests the unshard/reshard flow."""
+
+    @property
+    def world_size(self) -> int:
+        return 2
+
+    def _get_fsdp_models_and_optims(
+        self,
+        sharding_strategy: ShardingStrategy,
+        cpu_offload: CPUOffload,
+    ) -> Tuple[FSDP, torch.optim.Optimizer, FSDP, torch.optim.Optimizer]:
+        """
+        Returns a pair of (FSDP model, optimizer) for ``use_orig_params=False``
+        and ``True``, respectively.
+        """
+        LR = 1e-2
+        fsdp_kwargs = {
+            "sharding_strategy": sharding_strategy,
+            "cpu_offload": cpu_offload,
+            "use_orig_params": False,
+        }
+        fsdp_model = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.RECURSIVE,
+            CUDAInitMode.CUDA_BEFORE,
+            fsdp_kwargs=fsdp_kwargs,
+            deterministic=True,
+        )
+        optim = torch.optim.Adam(fsdp_model.parameters(), lr=LR)
+        fsdp_kwargs["use_orig_params"] = True
+        fsdp_model_orig_params = TransformerWithSharedParams.init(
+            self.process_group,
+            FSDPInitMode.RECURSIVE,
+            CUDAInitMode.CUDA_BEFORE,
+            fsdp_kwargs=fsdp_kwargs,
+            deterministic=True,
+        )
+        optim_orig_params = torch.optim.Adam(fsdp_model_orig_params.parameters(), lr=LR)
+        return fsdp_model, optim, fsdp_model_orig_params, optim_orig_params
+
+    def _check_fsdp_parameter_parity(self, fsdp1: FSDP, fsdp2: FSDP) -> None:
+        """Checks that two FSDP instances have the same model parameters."""
+        with FSDP.summon_full_params(fsdp1), FSDP.summon_full_params(fsdp2):
+            for (n1, p1), (n2, p2) in zip(
+                fsdp1.named_parameters(),
+                fsdp2.named_parameters(),
+            ):
+                self.assertEqual(n1, n2)
+                torch.testing.assert_close(p1, p2)
+
+    def _get_fsdp_parity_subtest_config(self):
+        return {
+            "sharding_strategy": [
+                ShardingStrategy.NO_SHARD,
+                ShardingStrategy.SHARD_GRAD_OP,
+                ShardingStrategy.FULL_SHARD,
+            ],
+        }
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("offload_params", [False, True])
+    def test_multiple_forward(self, offload_params: bool):
+        """
+        Tests that ``use_orig_params=True`` has parity with ``False`` when
+        running multiple forward passes before a backward pass.
+        """
+        cpu_offload = CPUOffload(offload_params=offload_params)
+        self.run_subtests(
+            self._get_fsdp_parity_subtest_config(),
+            self._test_multiple_forward,
+            cpu_offload=cpu_offload,
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def _test_multiple_forward(
+        self,
+        sharding_strategy: ShardingStrategy,
+        cpu_offload: CPUOffload,
+    ):
+        (
+            fsdp_model,
+            optim,
+            fsdp_model_orig_params,
+            optim_orig_params,
+        ) = self._get_fsdp_models_and_optims(sharding_strategy, cpu_offload)
+        device = torch.device("cuda")
+        for _ in range(3):
+            inp1 = fsdp_model.get_input(device)
+            _inp2 = fsdp_model.get_input(device)
+            inp2 = tuple(
+                t + torch.ones_like(t) for t in _inp2
+            )  # make different from `inp1`
+            # For these loss lists: elem 0 is baseline; elem 1 is test
+            losses1 = []
+            losses2 = []
+            losses = []
+            for _model, _optim in (fsdp_model, optim), (
+                fsdp_model_orig_params,
+                optim_orig_params,
+            ):
+                _optim.zero_grad()
+                loss1 = _model(*inp1)
+                losses1.append(loss1)
+                loss2 = _model(*inp2)
+                losses2.append(loss2)
+                loss = (loss1 + loss2).sum()
+                losses.append(loss)
+                _model.run_backward(loss)
+                _optim.step()
+            self.assertEqual(losses1[0], losses1[1])
+            self.assertEqual(losses2[0], losses2[1])
+            self.assertEqual(losses[0], losses[1])
+        self._check_fsdp_parameter_parity(fsdp_model, fsdp_model_orig_params)
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("offload_params", [False, True])
+    def test_summon_between_two_forwards(self, offload_params: bool):
+        """
+        Tests that ``use_orig_params=True`` has parity with ``False`` when
+        running a forward pass, :meth:`summon_full_params()`, and another
+        forward pass before a backward pass.
+        """
+        cpu_offload = CPUOffload(offload_params=offload_params)
+        self.run_subtests(
+            self._get_fsdp_parity_subtest_config(),
+            self._test_summon_between_two_forwards,
+            cpu_offload=cpu_offload,
+        )
+
+    def _test_summon_between_two_forwards(
+        self,
+        sharding_strategy: ShardingStrategy,
+        cpu_offload: CPUOffload,
+    ):
+        (
+            fsdp_model,
+            optim,
+            fsdp_model_orig_params,
+            optim_orig_params,
+        ) = self._get_fsdp_models_and_optims(sharding_strategy, cpu_offload)
+        device = torch.device("cuda")
+        for _ in range(3):
+            optim.zero_grad()
+            optim_orig_params.zero_grad()
+
+            inp1 = fsdp_model.get_input(device)
+            loss1 = fsdp_model(*inp1)
+            loss_orig_params1 = fsdp_model_orig_params(*inp1)
+            self.assertEqual(loss1, loss_orig_params1)
+
+            # Calls into `summon_full_params()`
+            self._check_fsdp_parameter_parity(fsdp_model, fsdp_model_orig_params)
+
+            inp2 = fsdp_model.get_input(device)
+            loss2 = fsdp_model(*inp2)
+            loss_orig_params2 = fsdp_model_orig_params(*inp2)
+            self.assertEqual(loss2, loss_orig_params2)
+
+            loss = (loss1 + loss2).sum()
+            loss_orig_params = (loss_orig_params1 + loss_orig_params2).sum()
+            fsdp_model.run_backward(loss)
+            fsdp_model_orig_params.run_backward(loss_orig_params)
+            optim.step()
+            optim_orig_params.step()
+        self._check_fsdp_parameter_parity(fsdp_model, fsdp_model_orig_params)
+
+
+class TestFSDPUseOrigParamsParamAccess(FSDPTest):
+    """Tests original parameter access."""
+
+    @property
+    def world_size(self):
+        # Force a world size of 2 since the tests hard code to the FSDP
+        # sharding strategy to check sharded parameter parity
+        return 2
+
+    @skip_if_lt_x_gpu(2)
+    def test_access_params_after_forward(self):
+        """
+        Tests that accessing the original parameters after the forward but
+        before the backward. Notably, this is not supported when
+        ``use_orig_params=False``. However, for ``True``, FSDP exposes the
+        (flattened) sharded original parameters, making it possible.
+        """
+        self.run_subtests(
+            {
+                "sharding_strategy": [
+                    ShardingStrategy.NO_SHARD,
+                    ShardingStrategy.FULL_SHARD,
+                    ShardingStrategy.SHARD_GRAD_OP,
+                ],
+            },
+            self._test_access_params_after_forward,
+        )
+
+    def _test_access_params_after_forward(
+        self,
+        sharding_strategy: ShardingStrategy,
+    ):
+        # NOTE: This test needs to be changed if the FSDP sharding algorithm
+        # changes. It is still valuable until such a change to sanity check the
+        # `use_orig_params=True` implementation.
+        class Model(nn.Module):
+            def __init__(self):
+                super().__init__()
+                torch.manual_seed(42)
+                # 5 * 5 = 25 numel -> pad to 26 -> 13 on each rank
+                self.lin1 = nn.Linear(5, 5, bias=False)
+                # 5 * 7 + 7 = 42 numel -> no pad -> 21 on each rank
+                # 21 of weight on rank 0; 14 of weight and 7 of bias on rank 1
+                self.lin2 = nn.Linear(5, 7)
+
+            def forward(self, x: torch.Tensor) -> torch.Tensor:
+                z = self.lin1(x)
+                z = nn.functional.relu(z)
+                z = self.lin2(z)
+                return z
+
+            def get_input(self, device: torch.device) -> Tuple[torch.Tensor, ...]:
+                return (torch.randn((2, 5)).to(device),)
+
+            def get_loss(self, inp, out):
+                return out.sum()
+
+        def check_parameter_parity(ddp_model, fsdp_model):
+            assert self.rank in (
+                0,
+                1,
+            ), f"Expects world size of 2 but got {self.world_size}"
+            for (n1, p1), (n2, p2) in zip(
+                ddp_model.module.named_parameters(),
+                fsdp_model.named_parameters(),
+            ):
+                self.assertEqual(n1, clean_tensor_name(n2))
+                if sharding_strategy == ShardingStrategy.NO_SHARD:
+                    # For `NO_SHARD`, do nothing since the original parameters
+                    # are unflattened
+                    pass
+                # Otherwise, case on the parameter (see the model definition)
+                elif n1 == "lin1.weight":
+                    if self.rank == 0:
+                        p1 = p1.flatten()[:13]
+                    elif self.rank == 1:
+                        p1 = p1.flatten()[13:]
+                elif n1 == "lin2.weight":
+                    if self.rank == 0:
+                        p1 = p1.flatten()[:21]
+                    elif self.rank == 1:
+                        p1 = p1.flatten()[21:]
+                elif n1 == "lin2.bias":
+                    if self.rank == 0:
+                        p1 = torch.empty(0, device=p1.device)
+                    elif self.rank == 1:
+                        p1 = p1.flatten()
+                torch.testing.assert_close(p1, p2)
+
+        ddp_model = DDP(Model().cuda(), device_ids=[self.rank])
+        fsdp_model = FSDP(
+            Model().cuda(),
+            sharding_strategy=sharding_strategy,
+            auto_wrap_policy=always_wrap_policy,
+            use_orig_params=True,
+        )
+        LR = 1e-2
+        ddp_optim = torch.optim.Adam(ddp_model.parameters(), lr=LR)
+        fsdp_optim = torch.optim.Adam(fsdp_model.parameters(), lr=LR)
+        device = torch.device("cuda")
+
+        inp = fsdp_model.get_input(device)
+        ddp_out = ddp_model(*inp)
+        fsdp_out = fsdp_model(*inp)
+        check_parameter_parity(ddp_model, fsdp_model)
+
+        ddp_loss = ddp_model.module.get_loss(inp, ddp_out)
+        fsdp_loss = fsdp_model.get_loss(inp, fsdp_out)
+        ddp_loss.backward()
+        fsdp_loss.backward()
+        ddp_optim.step()
+        fsdp_optim.step()
+        check_parameter_parity(ddp_model, fsdp_model)
+
+        inp = fsdp_model.get_input(device)
+        ddp_out = ddp_model(*inp)
+        fsdp_out = fsdp_model(*inp)
+        check_parameter_parity(ddp_model, fsdp_model)
+
+
+class TestFSDPUseOrigParamsWriteback(FSDPTest):
+    """Tests parameter and gradient writeback."""
+
+    class Model(nn.Module):
+        def __init__(self):
+            super().__init__()
+            torch.manual_seed(42)
+            self.lin1 = nn.Linear(5, 5, bias=True)
+            self.lin2 = nn.Linear(5, 7, bias=True)
+
+        def forward(self, x: torch.Tensor) -> torch.Tensor:
+            z = self.lin1(x)
+            z = nn.functional.relu(z)
+            z = self.lin2(z)
+            return z
+
+        def get_input(self, device: torch.device) -> Tuple[torch.Tensor, ...]:
+            return (torch.randn((2, 5)).to(device),)
+
+        def get_loss(self, inp, out):
+            return out.sum()
+
+    @property
+    def world_size(self):
+        # Force a world size of 2 since the tests hard code to the FSDP
+        # sharding strategy
+        return 2
+
+    def _check_param_parity(self, ddp_model: DDP, fsdp_model: FSDP):
+        with FSDP.summon_full_params(fsdp_model):
+            for (n1, p1), (n2, p2) in zip(
+                ddp_model.module.named_parameters(),
+                fsdp_model.named_parameters(),
+            ):
+                self.assertEqual(n1, n2)
+                torch.testing.assert_close(p1, p2)
+
+    @skip_if_lt_x_gpu(2)
+    def test_param_writeback(self):
+        """Tests that changes to the original parameters are written back."""
+        self.run_subtests(
+            {
+                "change_first_weight": [True, False],  # first vs. second `weight`
+                "change_data": [True, False],  # change `.data` vs. variable itself
+            },
+            self._test_param_writeback,
+        )
+
+    def _test_param_writeback(self, change_first_weight: bool, change_data: bool):
+        def transform_param(param: nn.Parameter) -> nn.Parameter:
+            return nn.Parameter(torch.ones_like(param) * 2)
+
+        # Check that the writeback propagates
+        ddp_model = DDP(
+            TestFSDPUseOrigParamsWriteback.Model().cuda(), device_ids=[self.rank]
+        )
+        fsdp_model = FSDP(
+            TestFSDPUseOrigParamsWriteback.Model().cuda(), use_orig_params=True
+        )
+        ddp = ddp_model.module  # for brevity
+        fsdp = fsdp_model.module
+        if change_first_weight:
+            if change_data:
+                ddp.lin1.weight.data = transform_param(ddp.lin1.weight)
+                fsdp.lin1.weight.data = transform_param(fsdp.lin1.weight)
+            else:
+                ddp.lin1.weight = transform_param(ddp.lin1.weight)
+                fsdp.lin1.weight = transform_param(fsdp.lin1.weight)
+        else:
+            if change_data:
+                ddp.lin2.weight.data = transform_param(ddp.lin2.weight)
+                fsdp.lin2.weight.data = transform_param(fsdp.lin2.weight)
+            else:
+                ddp.lin2.weight = transform_param(ddp.lin2.weight)
+                fsdp.lin2.weight = transform_param(fsdp.lin2.weight)
+        self._check_param_parity(ddp_model, fsdp_model)  # triggers a writeback
+
+    @skip_if_lt_x_gpu(2)
+    def test_grad_writeback(self):
+        """
+        Tests that changes to the original parameters' gradients are written
+        back.
+        """
+        self.run_subtests(
+            {
+                "change_first_weight_grad": [False, True],
+                "change_data": [False, True],  # change `.data` vs. variable itself
+                "set_to_none": [False, True],
+            },
+            self._test_grad_writeback,
+        )
+
+    def _test_grad_writeback(
+        self,
+        change_first_weight_grad: bool,
+        change_data: bool,
+        set_to_none: bool,
+    ):
+        if change_data and set_to_none:
+            return  # not well-defined
+
+        def transform_grad(param: nn.Parameter) -> nn.Parameter:
+            return None if set_to_none else torch.ones_like(param) * 2
+
+        ddp_model = DDP(
+            TestFSDPUseOrigParamsWriteback.Model().cuda(), device_ids=[self.rank]
+        )
+        fsdp_model = FSDP(
+            TestFSDPUseOrigParamsWriteback.Model().cuda(), use_orig_params=True
+        )
+        LR = 1e-2
+        # TODO: If we add `summon_full_params(with_grads=True)`, then replace
+        # the following. For now, we use the optimizer step as a surrogate for
+        # checking that gradients were written back.
+        ddp_optim = torch.optim.Adam(ddp_model.parameters(), lr=LR)
+        fsdp_optim = torch.optim.Adam(fsdp_model.parameters(), lr=LR)
+
+        # Generate an initial gradient
+        inp = fsdp_model.get_input(torch.device("cuda"))
+        ddp_out = ddp_model(*inp)
+        fsdp_out = fsdp_model(*inp)
+        ddp_out.sum().backward()
+        fsdp_out.sum().backward()
+
+        # Change the gradient through the original parameters
+        ddp = ddp_model.module  # for brevity
+        fsdp = fsdp_model.module
+        if change_first_weight_grad:
+            if change_data:
+                ddp.lin1.weight.grad.data = transform_grad(ddp.lin1.weight)
+                if fsdp.lin1.weight.grad is not None:
+                    fsdp.lin1.weight.grad.data = transform_grad(fsdp.lin1.weight)
+            else:
+                ddp.lin1.weight.grad = transform_grad(ddp.lin1.weight)
+                fsdp.lin1.weight.grad = transform_grad(fsdp.lin1.weight)
+        else:
+            if change_data:
+                ddp.lin2.weight.grad.data = transform_grad(ddp.lin2.weight)
+                if fsdp.lin2.weight.grad is not None:
+                    fsdp.lin2.weight.grad.data = transform_grad(fsdp.lin2.weight)
+            else:
+                ddp.lin2.weight.grad = transform_grad(ddp.lin2.weight)
+                fsdp.lin2.weight.grad = transform_grad(fsdp.lin2.weight)
+        ddp_optim.step()
+        fsdp_optim.step()
+        self._check_param_parity(ddp_model, fsdp_model)  # triggers a writeback
+
+        # Intentionally do not zero the gradient to check writeback
+        inp = fsdp_model.get_input(torch.device("cuda"))
+        ddp_out = ddp_model(*inp)
+        fsdp_out = fsdp_model(*inp)
+        ddp_out.sum().backward()
+        fsdp_out.sum().backward()
+        ddp_optim.step()
+        fsdp_optim.step()
+        self._check_param_parity(ddp_model, fsdp_model)  # triggers a writeback
+
+    @skip_if_lt_x_gpu(2)
+    def test_writeback_shape_mismatch(self):
+        fsdp_model = FSDP(
+            TestFSDPUseOrigParamsWriteback.Model().cuda(), use_orig_params=True
+        )
+        # Check that writing back with mismatched shape errors
+        fsdp = fsdp_model.module  # for brevity
+        assert self.rank in (0, 1), f"Expects world size of 2 but got {self.world_size}"
+        with self.assertRaisesRegex(RuntimeError, "Cannot writeback"):
+            # Change the gradient to a new one with 1 added to each dimension
+            # to force a shape mismatch when writing back
+            if self.rank == 0:
+                # Change `lin1.weight.grad` since it exists on rank 0
+                lin1_weight_shape = list(fsdp.lin1.weight.shape)
+                for dim_index in range(len(lin1_weight_shape)):
+                    lin1_weight_shape[dim_index] += 1
+                fsdp.lin1.weight = nn.Parameter(
+                    torch.randn(
+                        torch.Size(lin1_weight_shape), device=fsdp.lin1.weight.device
+                    )
+                )
+                fsdp.lin1.weight.grad = torch.randn(
+                    torch.Size(lin1_weight_shape), device=fsdp.lin1.weight.device
+                )
+            elif self.rank == 1:
+                # Change `lin2.weight.grad` since it exists (partially) on rank 1
+                lin2_weight_shape = list(fsdp.lin2.weight.shape)
+                for dim_index in range(len(lin2_weight_shape)):
+                    lin2_weight_shape[dim_index] += 1
+                fsdp.lin2.weight = nn.Parameter(
+                    torch.randn(
+                        torch.Size(lin2_weight_shape), device=fsdp.lin2.weight.device
+                    )
+                )
+                fsdp.lin2.weight.grad = torch.randn(
+                    torch.Size(lin2_weight_shape), device=fsdp.lin2.weight.device
+                )
+            with FSDP.summon_full_params(fsdp_model):  # triggers a writeback
+                ...
+
+
+class TestFSDPUseOrigParamsFQNs(FSDPTest):
+    @skip_if_lt_x_gpu(2)
+    def test_param_and_buffer_names(self):
+        """
+        Tests that, for ``use_orig_params=True``, the parameter and buffer
+        names match those of a local model even when sharded, meaning that they
+        do not include FSDP-specific prefixes.
+        """
+        self.run_subtests(
+            {"auto_wrap_policy": [None, always_wrap_policy]},
+            self._test_param_and_buffer_names,
+        )
+
+    def _test_param_and_buffer_names(self, auto_wrap_policy: Optional[Callable]):
+        class Container(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.param = nn.Parameter(torch.randn((5, 5)))
+                self.register_buffer("buf", torch.randn((5, 5)))
+
+            def forward(self, x):
+                return x @ self.param + self.buf
+
+        class Model(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.param = nn.Parameter(torch.randn((5, 5)))
+                self.lin = nn.Linear(5, 5)
+                self.container = Container()
+                self.register_buffer("buf", torch.randn((5, 5)))
+
+            def forward(self, x):
+                z = self.container(x)
+                z = z @ self.param + self.buf
+                z = self.lin(z)
+                return z
+
+        model = Model()
+        fsdp_model = FSDP(
+            Model(), auto_wrap_policy=auto_wrap_policy, use_orig_params=True
+        )
+        param_names = [n for n, _ in model.named_parameters()]
+        fsdp_param_names = [n for n, _ in fsdp_model.named_parameters()]
+        self.assertEqual(param_names, fsdp_param_names)
+        buffer_names = [n for n, _ in model.named_buffers()]
+        fsdp_buffer_names = [n for n, _ in fsdp_model.named_buffers()]
+        self.assertEqual(buffer_names, fsdp_buffer_names)
+
+    @skip_if_lt_x_gpu(2)
+    def test_named_parameters_in_forward(self):
+        """
+        Tests that calling ``named_parameters()`` during forward returns FQNs
+        and ``Tensor`` s corresponding to the original parameters.
+        """
+        param_shapes = [None, None]
+        assert_equal_fn = self.assertEqual
+
+        class Model(nn.Module):
+            def __init__(self) -> None:
+                super().__init__()
+                self.lin = nn.Linear(5, 5)
+
+            def forward(self, x: torch.Tensor) -> torch.Tensor:
+                nonlocal param_shapes
+                param_names = [tup[0] for tup in self.named_parameters()]
+                params = [tup[1] for tup in self.named_parameters()]
+                assert (
+                    param_shapes[0] is not None and param_shapes[1] is not None
+                ), "`param_sizes` should be set"
+                assert_equal_fn(
+                    param_names,
+                    [
+                        "lin.weight",
+                        "lin.bias",
+                    ],
+                )
+                assert_equal_fn(params[0].shape, param_shapes[0])
+                assert_equal_fn(params[1].shape, param_shapes[1])
+                return self.lin(x)
+
+        model = Model().cuda()
+        # Save the *unsharded* original parameter shapes and check the shapes
+        # match in the forward pass
+        param_shapes[0] = model.lin.weight.shape
+        param_shapes[1] = model.lin.bias.shape
+        fsdp_model = FSDP(model, use_orig_params=True)
+        inp = torch.randn((2, 5), device=torch.device("cuda"))
+        fsdp_model(inp)
+
+
+instantiate_parametrized_tests(TestFSDPUseOrigParamsMultipleParamGroups)
+instantiate_parametrized_tests(TestFSDPUseOrigParamsUnshardReshard)
+instantiate_parametrized_tests(TestFSDPUseOrigParamsParamAccess)
+instantiate_parametrized_tests(TestFSDPUseOrigParamsFQNs)
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/fsdp/test_shard_utils.py b/test/distributed/fsdp/test_shard_utils.py
index 1d24b2e3c681..b132629ac721 100644
--- a/test/distributed/fsdp/test_shard_utils.py
+++ b/test/distributed/fsdp/test_shard_utils.py
@@ -11,7 +11,8 @@
     EnumerableShardingSpec,
 )
 from torch.distributed.distributed_c10d import _get_default_group
-from torch.distributed.fsdp.shard_utils import (
+from torch.distributed.fsdp._shard_utils import (
+    _create_chunk_sharded_tensor,
     _offsets_to_split_sizes,
     _reshard_flatten_tensor,
 )
@@ -138,10 +139,10 @@ def _create_enumerate_spec(self, tensor):
     def _create_chunk_spec(self):
         return ChunkShardingSpec(dim=0, placements=["rank0/cuda:0"])
 
-    def _create_tensor(self):
+    def _create_tensor(self, *size):
         # Keep everything deterministic.
         torch.manual_seed(0)
-        return torch.rand(1001).cuda()
+        return torch.rand(*size).cuda()
 
     @skip_if_lt_x_gpu(2)
     def test_reshard_flatten_tensor(self):
@@ -151,7 +152,7 @@ def get_offsets(tensor, shard):
             else:
                 return [tensor.shape[0] - shard.shape[0]]
 
-        tensor = self._create_tensor()
+        tensor = self._create_tensor(1001)
 
         shard = _reshard_flatten_tensor(
             self._create_local_chunk(tensor),
@@ -185,3 +186,20 @@ def get_offsets(tensor, shard):
         uneven_sharded_tensor.gather(0, output)
         if self.rank == 0:
             self.assertEqual(tensor, output)
+
+    @skip_if_lt_x_gpu(2)
+    def test_create_chunk_sharded_tensor(self):
+        for size in ((1,), (1, 6), (12,), (12, 6), (25,), (25, 6)):
+            tensor = self._create_tensor(*size)
+
+            sharded_tensor = _create_chunk_sharded_tensor(
+                tensor,
+                self.rank,
+                self.world_size,
+                torch.cuda.device_count(),
+                _get_default_group(),
+            )
+            output = torch.empty(*size).cuda() if self.rank == 0 else None
+            sharded_tensor.gather(0, output)
+            if self.rank == 0:
+                self.assertEqual(tensor, output)
diff --git a/test/distributed/fsdp/test_utils.py b/test/distributed/fsdp/test_utils.py
index 2aa7fa0b6d97..37c52547e847 100644
--- a/test/distributed/fsdp/test_utils.py
+++ b/test/distributed/fsdp/test_utils.py
@@ -2,24 +2,27 @@
 
 import random
 import sys
-from typing import List
 import unittest
 from collections import OrderedDict
+from dataclasses import dataclass
+from enum import auto, Enum
+from typing import List
 
 import torch
 import torch.nn as nn
 from torch import distributed as dist
 from torch.distributed.fsdp._utils import _apply_to_tensors
+from torch.distributed.fsdp._wrap_utils import _get_submodule_to_states
+from torch.distributed.fsdp.wrap import ModuleWrapPolicy
 from torch.distributed.utils import _replace_by_prefix
 from torch.testing._internal.common_utils import (
-    TEST_WITH_DEV_DBG_ASAN,
-    TestCase,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
     subtest,
+    TEST_WITH_DEV_DBG_ASAN,
+    TestCase,
 )
-from dataclasses import dataclass
 
 if not dist.is_available():
     print("Distributed not available, skipping tests", file=sys.stderr)
@@ -60,8 +63,6 @@ class SomeDataClass:
             some_float: float
             some_tensor: List[torch.Tensor]
 
-
-
         # create a mixed bag of data.
         data = [1, "str"]
         data.append({"key1": get_a_tensor(), "key2": {1: get_a_tensor()}, "key3": 3})
@@ -100,7 +101,6 @@ def test_replace_by_prefix(self):
         _replace_by_prefix(state_dict, "module.layer.", "layer.")
         assert state_dict == original_state_dict
 
-
     def test_packed_sequence(self):
         """Test to ensure RNN packed sequences are modified correctly."""
         rnn = nn.RNN(5, 5)
@@ -118,7 +118,110 @@ def fill_fn(x):
         self.assertEqual(torch.sum(x), 0)
 
 
+class TestGetSubmoduleToStates(TestCase):
+    """Tests the function ``_get_submodule_to_states()``."""
+
+    class SharedParameterMode(Enum):
+        """
+        - ``PARENT_CHILD``: A parent submodule shares a parameter with a child
+        submodule.
+        - ``SIBLING``: Two sibling submodules share a parameter.
+        """
+
+        PARENT_CHILD = auto()
+        SIBLING = auto()  # TODO: not yet supported
+
+    class Model(nn.Module):
+        """Nested model with buffers and a shared parameter."""
+
+        def __init__(self, shared_parameter_mode) -> None:
+            super().__init__()
+            self.seq1 = nn.Sequential(
+                nn.Linear(5, 5, bias=False),
+                nn.Linear(5, 5, bias=False),
+            )
+            self.seq1.register_buffer("seq1_buffer", torch.randn((5,)))
+            self.lin = nn.Linear(5, 5, bias=False)
+            self.seq2 = nn.Sequential(
+                nn.Sequential(nn.Linear(5, 5, bias=False)), nn.Linear(5, 5, bias=False)
+            )
+            if (
+                shared_parameter_mode
+                == TestGetSubmoduleToStates.SharedParameterMode.PARENT_CHILD
+            ):
+                self.seq2[0][0].weight = self.lin.weight
+            elif (
+                shared_parameter_mode
+                == TestGetSubmoduleToStates.SharedParameterMode.SIBLING
+            ):
+                self.seq2[0][0].weight = self.seq1[0].weight
+            self.seq2[1].register_buffer("seq2_1_buffer", torch.randn((5,)))
+
+        def forward(self, x: torch.Tensor) -> torch.Tensor:
+            return self.seq2(self.lin(self.seq1(x)))  # equivalent to one matmul
+
+    def test_module_wrap_policy(self):
+        """
+        Tests the module wrap policy on a nested model with buffers and a
+        shared parameter.
+
+        NOTE: This test is hard coded against ``Model``.
+        """
+        model = self.Model(TestGetSubmoduleToStates.SharedParameterMode.PARENT_CHILD)
+
+        # Compute the mapping from submodule to states according to a logical
+        # module wrap policy
+        module_classes = (nn.Sequential,)
+        auto_wrap_policy = ModuleWrapPolicy(set(module_classes))
+        submodule_to_states = _get_submodule_to_states(
+            model, auto_wrap_policy, set(), set()
+        )
+        # Check the number of submodules with states in the mapping
+        num_submodules_with_states = sum(
+            isinstance(submodule, module_classes) for submodule in model.modules()
+        )  # explicitly show how to compute the expected number
+        if not isinstance(model, module_classes):
+            num_submodules_with_states += 1  # always include the root
+        assert num_submodules_with_states == 4, f"{num_submodules_with_states}"
+        self.assertEqual(len(submodule_to_states), num_submodules_with_states)
+
+        # Check the mapping, i.e. that the dict order follows a post-order
+        # traversal and that the contents are expected
+        submodules = list(submodule_to_states.keys())
+        # - Root module `model`
+        self.assertEqual(submodules[0], model)
+        root_states = submodule_to_states[submodules[0]]
+        self.assertEqual(root_states.params, [model.lin.weight])
+        self.assertEqual(root_states.param_names, ["lin.weight"])
+        self.assertEqual(root_states.buffers, [])
+        self.assertEqual(root_states.buffer_names, [])
+        # # - `seq2`
+        self.assertEqual(submodules[1], model.seq2)
+        seq2_states = submodule_to_states[submodules[1]]
+        self.assertEqual(seq2_states.params, [model.seq2[1].weight])
+        self.assertEqual(seq2_states.param_names, ["1.weight"])
+        self.assertEqual(seq2_states.buffers, [model.seq2[1].seq2_1_buffer])
+        self.assertEqual(seq2_states.buffer_names, ["1.seq2_1_buffer"])
+        # - `seq2[0]`
+        self.assertEqual(submodules[2], model.seq2[0])
+        seq2_0_states = submodule_to_states[submodules[2]]
+        self.assertEqual(seq2_0_states.params, [])  # shared parameter
+        self.assertEqual(seq2_0_states.param_names, [])
+        self.assertEqual(seq2_0_states.buffers, [])
+        self.assertEqual(seq2_0_states.buffer_names, [])
+        # - `seq1`
+        self.assertEqual(submodules[3], model.seq1)
+        seq1_states = submodule_to_states[submodules[3]]
+        self.assertEqual(
+            seq1_states.params, [model.seq1[0].weight, model.seq1[1].weight]
+        )
+        self.assertEqual(seq1_states.param_names, ["0.weight", "1.weight"])
+        self.assertEqual(seq1_states.buffers, [model.seq1.seq1_buffer])
+        self.assertEqual(seq1_states.buffer_names, ["seq1_buffer"])
+
+
 instantiate_parametrized_tests(TestUtils)
+instantiate_parametrized_tests(TestGetSubmoduleToStates)
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/distributed/fsdp/test_wrap.py b/test/distributed/fsdp/test_wrap.py
index 0008f8d23a94..e157f041ae1b 100644
--- a/test/distributed/fsdp/test_wrap.py
+++ b/test/distributed/fsdp/test_wrap.py
@@ -4,7 +4,8 @@
 import os
 import tempfile
 import unittest
-from enum import Enum, auto
+from enum import auto, Enum
+from typing import Callable, Union
 
 import torch
 import torch.nn as nn
@@ -12,15 +13,15 @@
 from torch.distributed.fsdp.fully_sharded_data_parallel import (
     BackwardPrefetch,
     CPUOffload,
-)
-from torch.distributed.fsdp.fully_sharded_data_parallel import (
     FullyShardedDataParallel as FSDP,
 )
 from torch.distributed.fsdp.wrap import (
+    _FSDPPolicy,
     _or_policy,
     _wrap_batchnorm_individually,
     always_wrap_policy,
     enable_wrap,
+    ModuleWrapPolicy,
     size_based_auto_wrap_policy,
     transformer_auto_wrap_policy,
     wrap,
@@ -28,20 +29,20 @@
 from torch.nn import TransformerDecoderLayer, TransformerEncoderLayer
 from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
 from torch.testing._internal.common_fsdp import (
+    _maybe_cuda,
     CUDAInitMode,
     DummyProcessGroup,
     FSDPInitMode,
     FSDPTest,
     TransformerWithSharedParams,
-    _maybe_cuda,
 )
 from torch.testing._internal.common_utils import (
     FILE_SCHEMA,
-    TestCase,
     find_free_port,
     instantiate_parametrized_tests,
     parametrize,
     run_tests,
+    TestCase,
 )
 
 
@@ -54,6 +55,7 @@ def __init__(self):
         self.bn3 = nn.BatchNorm3d(10)
         self.sync_bn = nn.SyncBatchNorm(10)
 
+
 class WrapMethod(Enum):
     FSDP_CTOR = auto()
     # FSDP_CTOR is the supported way forward, but keep WRAP_API in case we miss
@@ -61,8 +63,6 @@ class WrapMethod(Enum):
     WRAP_API = auto()
 
 
-
-
 class TestFSDPWrap(FSDPTest):
     """
     Tests main API for wrapping FSDP, which is to pass auto_wrap_policy into
@@ -144,7 +144,9 @@ def test_error_already_wrapped(self, nested, cuda_init_mode):
         Test that an error is raised if we attempt to wrap when submodules are
         already FSDP.
         """
-        wrapped_fsdp = self._get_already_wrapped_fsdp(nested=nested, cuda_init_mode=cuda_init_mode)
+        wrapped_fsdp = self._get_already_wrapped_fsdp(
+            nested=nested, cuda_init_mode=cuda_init_mode
+        )
         if cuda_init_mode == CUDAInitMode.CUDA_AFTER:
             wrapped_fsdp = wrapped_fsdp.cuda()
 
@@ -159,9 +161,10 @@ def never_wrap_policy(*args, **kwargs):
 
         policy = (
             functools.partial(
-                _or_policy,
-                policies=[never_wrap_policy, _wrap_batchnorm_individually]
-            ) if use_or_policy else _wrap_batchnorm_individually
+                _or_policy, policies=[never_wrap_policy, _wrap_batchnorm_individually]
+            )
+            if use_or_policy
+            else _wrap_batchnorm_individually
         )
         model = BatchNormNet()
         fsdp = FSDP(model, auto_wrap_policy=policy)
@@ -178,6 +181,7 @@ def test_bn_always_wrapped_individually(self):
         if the other policy results in a module containing a BN unit being
         wrapped, the contained BN unit will still be individually wrapped.
         """
+
         class MyModule(nn.Module):
             def __init__(self):
                 super().__init__()
@@ -189,8 +193,7 @@ def wrap_bn_container(module, recurse, *args, **kwargs):
             return isinstance(module, BatchNormNet)
 
         my_policy = functools.partial(
-            _or_policy,
-            policies=[wrap_bn_container, _wrap_batchnorm_individually]
+            _or_policy, policies=[wrap_bn_container, _wrap_batchnorm_individually]
         )
         mod = MyModule()
         fsdp = FSDP(mod, auto_wrap_policy=my_policy)
@@ -203,7 +206,7 @@ def wrap_bn_container(module, recurse, *args, **kwargs):
             fsdp.bn_container.bn1,
             fsdp.bn_container.bn2,
             fsdp.bn_container.bn3,
-            fsdp.bn_container.sync_bn
+            fsdp.bn_container.sync_bn,
         ]:
             self.assertTrue(isinstance(bn, FSDP))
 
@@ -216,26 +219,28 @@ def wrap_bn_container(module, recurse, *args, **kwargs):
             fsdp.bn_container.bn1,
             fsdp.bn_container.bn2,
             fsdp.bn_container.bn3,
-            fsdp.bn_container.sync_bn
+            fsdp.bn_container.sync_bn,
         ]:
             self.assertFalse(isinstance(bn, FSDP))
 
     @skip_if_lt_x_gpu(2)
     @parametrize(
         "cpu_offload",
-        [CPUOffload(offload_params=False), CPUOffload(offload_params=True)]
+        [CPUOffload(offload_params=False), CPUOffload(offload_params=True)],
     )
     @parametrize(
         "backward_prefetch",
-        [BackwardPrefetch.BACKWARD_POST, BackwardPrefetch.BACKWARD_PRE]
+        [BackwardPrefetch.BACKWARD_POST, BackwardPrefetch.BACKWARD_PRE],
     )
-    @parametrize("forward_prefetch", [True, False])
-    @parametrize(
-        "cuda_init_mode",
-        [CUDAInitMode.CUDA_AFTER, CUDAInitMode.CUDA_BEFORE]
-    )
-    def test_main_wrap_api(self, cpu_offload, backward_prefetch, forward_prefetch, cuda_init_mode):
-
+    @parametrize("forward_prefetch", [False, True])
+    @parametrize("cuda_init_mode", [CUDAInitMode.CUDA_AFTER, CUDAInitMode.CUDA_BEFORE])
+    def test_main_wrap_api(
+        self,
+        cpu_offload: CPUOffload,
+        backward_prefetch: BackwardPrefetch,
+        forward_prefetch: bool,
+        cuda_init_mode: CUDAInitMode,
+    ):
         if cuda_init_mode == CUDAInitMode.CUDA_AFTER and cpu_offload.offload_params:
             # they don't work together, expected
             return
@@ -281,7 +286,7 @@ def forward(self, input):
             wrapped_model.module.lin3,
             wrapped_model.module.lin4.module.nested_lin,
             wrapped_model.module.lin4,
-            wrapped_model
+            wrapped_model,
         ]
 
         for module in modules_in_fsdp_graph_order:
@@ -299,14 +304,6 @@ def forward(self, input):
             loss.backward()
             optim.step()
 
-        # Since we ran with backward prefetch, verify backward prefetch related
-        # data.
-        for i, module in enumerate(modules_in_fsdp_graph_order):
-            self.assertEqual(i, module._my_fsdp_idx_in_graph)
-            self.assertTrue(
-                module._fsdp_graph_order == modules_in_fsdp_graph_order
-            )
-
 
 class TestAutoWrap(TestCase):
     def setUp(self) -> None:
@@ -325,7 +322,9 @@ def test_wrap(self, wrap_method):
             layer = FSDP(
                 nn.Linear(5, 5),
                 process_group=self.process_group,
-                auto_wrap_policy=functools.partial(size_based_auto_wrap_policy, min_num_params=1)
+                auto_wrap_policy=functools.partial(
+                    size_based_auto_wrap_policy, min_num_params=1
+                ),
             )
         self.assertTrue(isinstance(layer, FSDP))
         self.assertEqual(layer.rank, self.process_group.rank())
@@ -365,7 +364,9 @@ def test_always_wrap(self):
         passed into FSDP, all submodules are wrapped.
         """
         seq = TestFSDPWrap.NestedSequentialModel.get_model(cuda=True)
-        model = FSDP(seq, process_group=self.process_group, auto_wrap_policy=always_wrap_policy)
+        model = FSDP(
+            seq, process_group=self.process_group, auto_wrap_policy=always_wrap_policy
+        )
         TestFSDPWrap.NestedSequentialModel.verify_model_all_wrapped(self, model)
 
     @unittest.skipIf(torch.cuda.device_count() < 2, "Requires at least 2 GPUs")
@@ -375,6 +376,19 @@ def test_transformer_auto_wrap_policy(self):
             transformer_auto_wrap_policy,
             transformer_layer_cls={TransformerEncoderLayer, TransformerDecoderLayer},
         )
+        self._test_transformer_wrapping(auto_wrap_policy)
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "Requires at least 2 GPUs")
+    def test_module_wrap_policy(self):
+        """Tests the ``ModuleWrapPolicy``."""
+        auto_wrap_policy = ModuleWrapPolicy(
+            {TransformerEncoderLayer, TransformerDecoderLayer}
+        )
+        self._test_transformer_wrapping(auto_wrap_policy)
+
+    def _test_transformer_wrapping(
+        self, auto_wrap_policy: Union[Callable, _FSDPPolicy]
+    ):
         fsdp_kwargs = {"auto_wrap_policy": auto_wrap_policy}
         fsdp_model = TransformerWithSharedParams.init(
             self.process_group,
@@ -386,7 +400,11 @@ def test_transformer_auto_wrap_policy(self):
         encoder_layers = set(fsdp_model.module.transformer.encoder.layers)
         decoder_layers = set(fsdp_model.module.transformer.decoder.layers)
         for module in modules:
-            if module is fsdp_model or module in encoder_layers or module in decoder_layers:
+            if (
+                module is fsdp_model
+                or module in encoder_layers
+                or module in decoder_layers
+            ):
                 self.assertTrue(isinstance(module, FSDP))
             else:
                 self.assertFalse(isinstance(module, FSDP))
@@ -404,7 +422,7 @@ def test_auto_wrap_api(self):
         model = FSDP(
             sequential,
             process_group=self.process_group,
-            auto_wrap_policy=my_auto_wrap_policy
+            auto_wrap_policy=my_auto_wrap_policy,
         )
 
         TestFSDPWrap.NestedSequentialModel.verify_model(self, model)
@@ -423,7 +441,7 @@ def test_auto_wrap_preset_exclude_wrap(self):
         model = FSDP(
             sequential,
             process_group=self.process_group,
-            auto_wrap_policy=my_auto_wrap_policy
+            auto_wrap_policy=my_auto_wrap_policy,
         )
 
         self.assertTrue(isinstance(model, FSDP))
@@ -440,7 +458,11 @@ def test_auto_wrap_preset_exclude_wrap_include_children(self):
         my_auto_wrap_policy = functools.partial(
             size_based_auto_wrap_policy, min_num_params=40
         )
-        model = FSDP(sequential, process_group=self.process_group, auto_wrap_policy=my_auto_wrap_policy)
+        model = FSDP(
+            sequential,
+            process_group=self.process_group,
+            auto_wrap_policy=my_auto_wrap_policy,
+        )
 
         self.assertTrue(isinstance(model, FSDP))
         self.assertTrue(isinstance(model[0], FSDP))
@@ -455,7 +477,11 @@ def test_auto_wrap_preset_force_leaf(self):
         my_auto_wrap_policy = functools.partial(
             size_based_auto_wrap_policy, min_num_params=40
         )
-        model = FSDP(sequential, process_group=self.process_group, auto_wrap_policy=my_auto_wrap_policy)
+        model = FSDP(
+            sequential,
+            process_group=self.process_group,
+            auto_wrap_policy=my_auto_wrap_policy,
+        )
         self.assertTrue(isinstance(model.module[0], FSDP))
         # Assert children of multihead attention are not wrapped
         self.assertTrue(isinstance(model.module[1], nn.MultiheadAttention))
@@ -476,7 +502,11 @@ def test_auto_wrap_preset_force_leaf_custom(self):
         sequential = nn.Sequential(
             nn.Linear(10, 10), nn.ModuleList([nn.Linear(10, 10)])
         )
-        model = FSDP(sequential, process_group=self.process_group, auto_wrap_policy=my_auto_wrap_policy)
+        model = FSDP(
+            sequential,
+            process_group=self.process_group,
+            auto_wrap_policy=my_auto_wrap_policy,
+        )
         # Model was wrapped in FSDP as no inner modules were wrapped.
         self.assertTrue(isinstance(model, FSDP))
         self.assertTrue(isinstance(model.module[0], nn.Linear))
@@ -486,14 +516,12 @@ def test_auto_wrap_preset_force_leaf_custom(self):
     @parametrize("cuda_init_mode", [CUDAInitMode.CUDA_BEFORE, CUDAInitMode.CUDA_AFTER])
     @parametrize(
         "cpu_offload",
-        [CPUOffload(offload_params=False), CPUOffload(offload_params=True)]
+        [CPUOffload(offload_params=False), CPUOffload(offload_params=True)],
     )
     @parametrize("use_device_id", [True, False])
     def test_auto_wrap_smoke_test(self, cuda_init_mode, cpu_offload, use_device_id):
         # CPU offload and CUDA after don't work together as expected.
-        if (
-            cpu_offload.offload_params and cuda_init_mode == CUDAInitMode.CUDA_AFTER
-        ):
+        if cpu_offload.offload_params and cuda_init_mode == CUDAInitMode.CUDA_AFTER:
             return
 
         device = torch.device("cuda")
@@ -518,12 +546,17 @@ def test_auto_wrap_smoke_test(self, cuda_init_mode, cpu_offload, use_device_id):
         # cases where full model cannot be loaded onto GPU, but their shards can.
         cuda_after_init = cuda_init_mode == CUDAInitMode.CUDA_AFTER
         try:
-            sequential = TestFSDPWrap.NestedSequentialModel.get_model(cuda=(not cuda_after_init))
+            sequential = TestFSDPWrap.NestedSequentialModel.get_model(
+                cuda=(not cuda_after_init)
+            )
             my_auto_wrap_policy = functools.partial(
                 size_based_auto_wrap_policy, min_num_params=40
             )
             model = FSDP(
-                sequential, cpu_offload=cpu_offload, auto_wrap_policy=my_auto_wrap_policy, device_id=device_id
+                sequential,
+                cpu_offload=cpu_offload,
+                auto_wrap_policy=my_auto_wrap_policy,
+                device_id=device_id,
             )
             TestFSDPWrap.NestedSequentialModel.verify_model(self, model)
             if cuda_after_init:
@@ -571,7 +604,8 @@ def test_auto_wrap_with_ignored_modules(self, wrap_method: WrapMethod):
         sequential = TestFSDPWrap.NestedSequentialModel.get_model(cuda=False)
         ignored_modules = [sequential[1], sequential[2][0]]
         my_auto_wrap_policy = functools.partial(
-            size_based_auto_wrap_policy, min_num_params=40,
+            size_based_auto_wrap_policy,
+            min_num_params=40,
         )
         fsdp_kwargs = {
             "process_group": self.process_group,
diff --git a/test/distributed/optim/test_apply_optimizer_in_backward.py b/test/distributed/optim/test_apply_optimizer_in_backward.py
new file mode 100644
index 000000000000..344d8c81a18c
--- /dev/null
+++ b/test/distributed/optim/test_apply_optimizer_in_backward.py
@@ -0,0 +1,113 @@
+# Owner(s): ["oncall: distributed"]
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+
+import unittest
+from copy import deepcopy
+
+import torch
+import torch.nn as nn
+
+from torch.distributed.optim import _apply_optimizer_in_backward
+
+# TODO (rohan-varma): Add FSDP & DDP tests once supported
+
+def _validate_params(params_list, fn):
+    ref_params = params_list[0]
+    for param_list in params_list[1:]:
+        for p1, p2 in zip(ref_params, param_list):
+            fn(p1, p2)
+
+
+class ApplyOverlappedOptimizerTest(unittest.TestCase):
+
+    def _run_training_loop_and_validate(self, inp, models, optimizers):
+        for i in range(6):
+            for model in models:
+                model(inp).sum().backward()
+            for opt in optimizers:
+                opt.step()
+
+            with self.subTest(i):
+                _validate_params(
+                    [model.parameters() for model in models],
+                    torch.testing.assert_allclose,
+                )
+
+            for opt in optimizers:
+                opt.zero_grad(set_to_none=True)
+
+    def _test_apply_optimizer_in_backward(self, share_params) -> None:
+        weight_optimizer_kwargs = {"lr": 1.0}
+        bias_optimizer_kwargs = {"lr": 0.5}
+        model = nn.Sequential(nn.Linear(10, 10), nn.Linear(10, 10))
+        if share_params:
+            model[0].weight = model[1].weight
+
+        # Use different optimizers for weights & biases.
+        weights = [m.weight for m in model]
+        biases = [m.bias for m in model]
+        optim_weight = torch.optim.SGD(weights, **weight_optimizer_kwargs)
+        optim_bias = torch.optim.SGD(biases, **bias_optimizer_kwargs)
+        model_with_opt_in_bwd = deepcopy(model)
+
+        # Apply different optimizer in backwards for weights and biases.
+        _apply_optimizer_in_backward(
+            torch.optim.SGD,
+            [m.weight for m in model_with_opt_in_bwd],
+            optimizer_kwargs=weight_optimizer_kwargs
+        )
+
+        _apply_optimizer_in_backward(
+            torch.optim.SGD,
+            [m.bias for m in model_with_opt_in_bwd],
+            optimizer_kwargs=bias_optimizer_kwargs
+        )
+
+        _validate_params(
+            [
+                model.parameters(),
+                model_with_opt_in_bwd.parameters(),
+            ],
+            torch.testing.assert_allclose,
+        )
+
+        self._run_training_loop_and_validate(
+            torch.randn(4, 10),
+            [model, model_with_opt_in_bwd],
+            [optim_weight, optim_bias],
+        )
+
+    def test_apply_optimizer_in_backward(self) -> None:
+        self._test_apply_optimizer_in_backward(share_params=False)
+
+    def test_apply_optimizer_in_backward_shared_params(self) -> None:
+        self._test_apply_optimizer_in_backward(share_params=True)
+
+    def test_multiple_optim_for_params(self) -> None:
+        model = nn.Sequential(nn.Linear(10, 10), nn.Linear(10, 10))
+        opt_0_kwargs = {"lr": 0.03}
+        opt_1_kwargs = {"lr": 0.01}
+        opt_0 = torch.optim.SGD(model.parameters(), **opt_0_kwargs)
+        opt_1 = torch.optim.SGD(model.parameters(), **opt_1_kwargs)
+        model_with_opt_in_bwd = deepcopy(model)
+        _apply_optimizer_in_backward(
+            torch.optim.SGD,
+            model_with_opt_in_bwd.parameters(),
+            optimizer_kwargs=opt_0_kwargs,
+        )
+        _apply_optimizer_in_backward(
+            torch.optim.SGD,
+            model_with_opt_in_bwd.parameters(),
+            optimizer_kwargs=opt_1_kwargs,
+        )
+        self._run_training_loop_and_validate(
+            torch.randn(4, 10),
+            [model, model_with_opt_in_bwd],
+            [opt_0, opt_1],
+        )
diff --git a/test/distributed/pipeline/sync/defs.bzl b/test/distributed/pipeline/sync/defs.bzl
deleted file mode 100644
index 0de277bddaef..000000000000
--- a/test/distributed/pipeline/sync/defs.bzl
+++ /dev/null
@@ -1,22 +0,0 @@
-load("@bazel_skylib//lib:paths.bzl", "paths")
-load(
-    "//caffe2/test:defs.bzl",
-    "define_tests",
-)
-
-def define_pipeline_tests():
-    test_files = native.glob(["**/test_*.py"])
-
-    TESTS = {}
-
-    for test_file in test_files:
-        test_file_name = paths.basename(test_file)
-        test_name = test_file_name.replace("test_", "").replace(".py", "")
-        TESTS[test_name] = [test_file]
-
-    define_tests(
-        pytest = True,
-        tests = TESTS,
-        external_deps = [("pytest", None)],
-        resources = ["conftest.py"],
-    )
diff --git a/test/distributed/test_c10d_common.py b/test/distributed/test_c10d_common.py
index 5c29f1fd448d..c03a68228990 100644
--- a/test/distributed/test_c10d_common.py
+++ b/test/distributed/test_c10d_common.py
@@ -2,14 +2,16 @@
 
 import copy
 import os
+import pickle
 import sys
 import tempfile
 import threading
 import time
+from contextlib import suppress
 from datetime import timedelta
 from itertools import product
 from sys import platform
-from contextlib import suppress
+from typing import Callable
 
 import torch
 import torch.distributed as dist
@@ -19,17 +21,17 @@
     sys.exit(0)
 
 import torch.distributed.distributed_c10d as c10d
-from torch.utils.checkpoint import checkpoint
 import torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook as powerSGD
 import torch.nn.functional as F
 import torch.testing._internal.common_utils as common
 from torch import nn
+from torch.distributed._spmd.comm_tensor import _wait_comm, CommTensor
+from torch.fx.experimental.proxy_tensor import make_fx
 from torch.nn.parallel import DistributedDataParallel
 from torch.testing._internal.common_distributed import (
     MultiProcessTestCase,
     skip_if_lt_x_gpu,
 )
-
 from torch.testing._internal.common_utils import (
     TestCase,
     load_tests,
@@ -38,6 +40,8 @@
     instantiate_parametrized_tests,
     parametrize
 )
+from torch.utils.checkpoint import checkpoint
+
 
 if TEST_WITH_DEV_DBG_ASAN:
     print("Multiprocessing spawn is not compatible with dev/dbg asan", file=sys.stderr)
@@ -969,6 +973,10 @@ def op_timeout_sec(self):
     def world_size(self):
         return 2
 
+    @property
+    def device(self):
+        self.fail("test subclass didn't override device")
+
     def _verify_sequence_number_across_pg(self, pg, verify_pg):
 
         seq_num = pg._get_sequence_number_for_group()
@@ -1116,6 +1124,106 @@ def _test_warn_not_in_group(self, backend):
             dist.barrier(group=group)
             dist.broadcast(x, src=0, group=group)
 
+    def _test_rank_membership(self, backend):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            backend,
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+        self.assertTrue(self.world_size > 1)
+
+        group = dist.new_group(ranks=[1])
+        self.assertEqual(dist.get_group_rank(group, 1), 0)
+        with self.assertRaisesRegex(RuntimeError, "not part of group"):
+            dist.get_group_rank(group, 0)
+        with self.assertRaisesRegex(RuntimeError, "not registered"):
+            dist.get_group_rank(DummyProcessGroup(self.rank, self.world_size), 0)
+
+        self.assertEqual(dist.get_global_rank(group, 0), 1)
+        with self.assertRaisesRegex(RuntimeError, "not part of group"):
+            dist.get_global_rank(group, 1)
+        with self.assertRaisesRegex(RuntimeError, "not registered"):
+            dist.get_global_rank(DummyProcessGroup(self.rank, self.world_size), 0)
+
+        self.assertEqual(dist.get_process_group_ranks(group), [1])
+
+    def _test_tensor_dtype_mismatch(self, backend):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            backend,
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+
+        tensor = torch.ones(2, 2, device=self.device) * 7
+        tensor_h = tensor.half()
+        tensor_list = [torch.zeros(2, 2, device=self.device) for _ in range(self.world_size)]
+        tensor_list_h = list(tensor_list)
+        tensor_list_h[1] = tensor_list_h[1].half()
+
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.all_gather(tensor_list_h, tensor)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.all_gather(tensor_list, tensor_h)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.all_gather_coalesced([tensor_list_h], tensor_list)
+            dist.all_gather_coalesced([tensor_list], tensor_list_h)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.all_reduce_coalesced(tensor_list_h)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.reduce_scatter(tensor, tensor_list_h)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.reduce_scatter(tensor_h, tensor_list)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.all_to_all_single(tensor_h, tensor)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.all_to_all(tensor_list_h, tensor_list)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.all_to_all(tensor_list, tensor_list_h)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.scatter(tensor, tensor_list_h)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.gather(tensor_h, tensor_list)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.gather(tensor, tensor_list_h)
+
+        with self.assertRaisesRegex(RuntimeError, "tensors with different dtypes"):
+            dist.scatter(tensor_h, tensor_list)
+
+    def _test_tensor_dtype_complex(self, backend):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            backend,
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+
+        tensor = torch.rand(2, device=self.device)
+        tensor_c = torch.view_as_complex(tensor)
+        tensor_list = [torch.rand(2, device=self.device) for _ in range(self.world_size)]
+        tensor_list_c = list(tensor_list)
+        tensor_list_c[1] = torch.view_as_complex(tensor_list_c[1])
+
+        dist.all_gather(tensor_list, tensor)
+        dist.all_gather(tensor_list, tensor_c)
+        dist.all_gather(tensor_list_c, tensor)
+        dist.all_gather(tensor_list_c, tensor_c)
 
 class CommTest(AbstractCommTest, MultiProcessTestCase):
     def setUp(self):
@@ -1259,13 +1367,20 @@ def test_backend_class_attr(self):
         )
         self.assertEqual(dist.Backend.DUMMY, "DUMMY")
         self.assertEqual(
-            dist.Backend._plugins["DUMMY"],
+            dist.Backend._plugins["DUMMY"].creator_fn,
             PythonProcessGroupExtensionTest.create_dummy
         )
 
+    class Options:
+        def __init__(self):
+            pass
+
+        def create(self):
+            pass
+
     @staticmethod
-    def create_dummy(store, rank, size, timeout):
-        return DummyProcessGroup(rank, size)
+    def create_dummy(store, group_rank, group_size, timeout):
+        return DummyProcessGroup(group_rank, group_size)
 
     def test_collectives(self):
         dist.Backend.register_backend("dummy", PythonProcessGroupExtensionTest.create_dummy)
@@ -1313,6 +1428,9 @@ def test_send_recv(self):
         dist.send(input_tensor, (self.rank + 1) % self.world_size)
         self.assertEqual(input_tensor, torch.zeros(2, 2) + 1)
 
+        with self.assertRaises(ValueError):
+            dist.send(input_tensor, dist.get_rank())
+
         # test recv
         input_tensor = torch.zeros(2, 2)
         dist.recv(input_tensor, (self.rank + 1) % self.world_size)
@@ -1325,6 +1443,259 @@ def test_send_recv(self):
 
 instantiate_parametrized_tests(CommonDistributedDataParallelTest)
 
+class ProcessGroupWithDispatchedCollectivesTests(MultiProcessTestCase):
+    @property
+    def world_size(self):
+        return 1
+
+    def setUp(self):
+        super(ProcessGroupWithDispatchedCollectivesTests, self).setUp()
+        self._spawn_processes()
+
+    def tearDown(self):
+        super(ProcessGroupWithDispatchedCollectivesTests, self).tearDown()
+        try:
+            os.remove(self.file_name)
+        except OSError:
+            pass
+
+    def _call_collective_with_varying_tensors(self, backend, collective, *args):
+        # call collective with varying tensors to ensure that the tensors are
+        # correctly dispatched
+
+        # TODO: this will be updated in the future to not be backend specific
+        device = "cuda" if backend == "nccl" else "cpu"
+        # ensure supported devices (cpu, cuda) succeeds during dispatch call
+        tensor = torch.zeros(2, 2, device=torch.device(device))
+        # multi tensor collectives
+        if collective == dist.barrier:
+            collective()
+        elif collective in (dist.all_gather, dist.gather):
+            collective([tensor], tensor, *args)
+        elif collective == dist.scatter:
+            collective(tensor, [tensor], *args)
+        elif collective in (dist.reduce_scatter, dist.all_to_all):
+            # gloo does not support reduce_scatter or all_to_all
+            if backend != "gloo":
+                if collective == dist.reduce_scatter:
+                    collective(tensor, [tensor], *args)
+                else:
+                    collective([tensor], [tensor], *args)
+        else:
+            collective(tensor, *args)
+
+    # TODO: backend will be replaced with a non specified backend
+    def _test_collectives(self, backend):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            backend,
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+        collectives_and_args = [
+            (dist.reduce, self.rank),
+            (dist.broadcast, self.rank),
+            (dist.all_reduce,),
+            (dist.all_gather,),
+            (dist.reduce_scatter,),
+            (dist.barrier,),
+            (dist.all_to_all,),
+            (dist.scatter,),
+        ]
+        for collective, *args in collectives_and_args:
+            with self.subTest(collective=collective, args=args):
+                self._call_collective_with_varying_tensors(backend, collective, *args)
+
+    def _test_allreduce_coalesced(self, backend):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            backend,
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+        # TODO: this will be updated in the future to not be backend specific
+        device = "cuda" if backend == "nccl" else "cpu"
+        tensors = [torch.ones(10, 10, device=torch.device(device))]
+        dist.all_reduce_coalesced(tensors, dist.ReduceOp.SUM)
+        for tensor in tensors:
+            self.assertEqual(tensor, torch.ones(10, 10) * self.world_size)
+
+class CompilerTest(MultiProcessTestCase):
+    def setUp(self):
+        super(CompilerTest, self).setUp()
+        self._spawn_processes()
+
+    def tearDown(self):
+        super(CompilerTest, self).tearDown()
+        try:
+            os.remove(self.file_name)
+        except OSError:
+            pass
+
+    def _get_process_group(self):
+        raise NotImplementedError("To be implemented by subclass")
+
+    def _test_work_wait(self, x: torch.Tensor, comm_fn: Callable):
+        pg = self._get_default_group()
+
+        def fn(x: torch.Tensor) -> torch.Tensor:
+            # N.B.: explicitly wrapping with CommTensor instead of updating
+            # all_reduce Python implementation, as the later will need more
+            # discussion.
+            y = CommTensor(x + x)
+            work, z = comm_fn(y, group=pg)
+            # this wait() will be ignored in tracing mode as
+            # ProxyTorchDispatchMode only supports torch.Tensor, _ProxyTensor,
+            # and torch.nn.Parameter objects
+            work.wait()
+            return z * 2
+
+        xx = x.clone()
+
+        # trace fn into a GraphModule
+        traced_fn = make_fx(fn)(xx)
+        traced_fn.graph.lint()
+        traced_fn.graph.eliminate_dead_code()
+
+        # make sure the mul op indeed waits for comm
+        for node in traced_fn.graph.nodes:
+            if node.op == "call_function" and "mul.Tensor" in node.target.__name__:
+                prev = node.args[0]
+                curr = None
+                waited = False
+                commed = False
+                while prev is not None and not commed:
+                    curr = prev
+                    waited |= all([
+                        curr.op == "call_function",
+                        curr.target == _wait_comm,
+                    ])
+                    commed |= all([
+                        curr.op == "call_function",
+                        CommTensor._is_supported(curr.target.__name__),
+                    ])
+
+                    prev = curr.args[0]
+
+                self.assertTrue(waited)
+                self.assertTrue(commed)
+
+        # Update input to make sure we are not recording it as constant during
+        # tracing.
+        x += 1
+        xx += 1
+
+        y = fn(x)
+        yy = traced_fn(xx)
+
+        # check correctness
+        self.assertEqual(y, yy)
+
+        xx += 1
+        yy = traced_fn(xx)
+        self.assertFalse(y.allclose(yy))
+
+    def _test_allreduce_work_wait(self, tensor):
+        def comm_fn(tensor, group=None):
+            work = dist.all_reduce(tensor, group=group, async_op=True)
+            return work, tensor
+
+        self._test_work_wait(tensor, comm_fn=comm_fn)
+
+    def _test_allgather_work_wait(self, tensor):
+        def comm_fn(tensor, group=None):
+            out_tensors = [torch.zeros_like(tensor) for _ in range(group.size())]
+            work = dist.all_gather(out_tensors, tensor, group=group, async_op=True)
+            work.wait()
+
+            return work, sum(out_tensors)
+
+        self._test_work_wait(tensor, comm_fn=comm_fn)
+
+    def _test_reduce_scatter_work_wait(self, tensor):
+        def comm_fn(tensor, group=None):
+            in_tensors = [tensor.clone() + i for i in range(group.size())]
+            out_tensor = torch.zeros_like(tensor)
+            work = dist.reduce_scatter(out_tensor, in_tensors, group=group, async_op=True)
+            return work, out_tensor
+
+        self._test_work_wait(tensor, comm_fn=comm_fn)
+
+    def _test_broadcast_work_wait(self, tensor):
+        def comm_fn(tensor, group=None):
+            work = dist.broadcast(tensor, src=0, group=group, async_op=True)
+            return work, tensor
+
+        self._test_work_wait(tensor, comm_fn=comm_fn)
+
+    def _test_scatter_work_wait(self, tensor):
+        def comm_fn(tensor, group=None):
+            in_tensors = [tensor + i for i in range(group.size())] if self.rank == 0 else None
+            out_tensor = torch.zeros_like(tensor)
+            work = dist.scatter(out_tensor, in_tensors, src=0, group=group, async_op=True)
+            return work, out_tensor
+
+        self._test_work_wait(tensor, comm_fn=comm_fn)
+
+    def _test_nested_comm_tensor_wrapping(self, tensor):
+        def comm_fn(tensor, group=None):
+            work = dist.all_reduce(CommTensor(tensor), group=group, async_op=True)
+            return work, tensor
+
+        self._test_work_wait(tensor, comm_fn=comm_fn)
+
+    def _test_consecutive_comm_work_wait(self, tensor):
+        def comm_fn(tensor, group=None):
+            work1 = dist.all_reduce(tensor, group=group, async_op=True)
+            work1.wait()
+            work2 = dist.all_reduce(tensor, group=group, async_op=True)
+            return work2, tensor
+
+        self._test_work_wait(tensor, comm_fn=comm_fn)
+
+
+class ReduceOpTest(TestCase):
+
+    # Ref: https://github.com/pytorch/pytorch/issues/87191
+    def test_op_isinstance_of_reduceop(self):
+        for reduce_op in (
+            c10d.ReduceOp.SUM, c10d.ReduceOp.AVG, c10d.ReduceOp.PRODUCT, c10d.ReduceOp.MIN, c10d.ReduceOp.MAX,
+            c10d.ReduceOp.BAND, c10d.ReduceOp.BOR, c10d.ReduceOp.BXOR,
+        ):
+            self.assertTrue(isinstance(reduce_op, c10d.ReduceOp))
+        for scale in (torch.tensor(1.0), 2.0):
+            self.assertTrue(isinstance(dist._make_nccl_premul_sum(scale), c10d.ReduceOp))
+
+    # Ref: https://github.com/pytorch/pytorch/pull/87303#discussion_r1002879700
+    def test_reduceop_copyable(self):
+        for reduce_op in (
+            c10d.ReduceOp.SUM, c10d.ReduceOp.AVG, c10d.ReduceOp.PRODUCT, c10d.ReduceOp.MIN, c10d.ReduceOp.MAX,
+            c10d.ReduceOp.BAND, c10d.ReduceOp.BOR, c10d.ReduceOp.BXOR,
+        ):
+            self.assertEqual(copy.copy(reduce_op), reduce_op)
+            self.assertEqual(copy.deepcopy(reduce_op), reduce_op)
+            self.assertEqual(copy.copy(c10d.ReduceOp(reduce_op)), reduce_op)
+            self.assertEqual(copy.deepcopy(c10d.ReduceOp(reduce_op)), reduce_op)
+
+        for scale in (torch.tensor(1.0), 2.0):
+            reduce_op = dist._make_nccl_premul_sum(scale)
+            self.assertEqual(copy.copy(reduce_op), reduce_op)
+            self.assertEqual(copy.deepcopy(reduce_op), reduce_op)
+
+    def test_reduceop_pickle(self):
+        for reduce_op in (
+            c10d.ReduceOp.SUM, c10d.ReduceOp.AVG, c10d.ReduceOp.PRODUCT, c10d.ReduceOp.MIN, c10d.ReduceOp.MAX,
+            c10d.ReduceOp.BAND, c10d.ReduceOp.BOR, c10d.ReduceOp.BXOR,
+        ):
+            pickle.loads(pickle.dumps(reduce_op))
+            orig = c10d.ReduceOp(reduce_op)
+            self.assertEqual(pickle.loads(pickle.dumps(orig)), orig)
+        for scale in (torch.tensor(1.0), 2.0):
+            reduce_op = dist._make_nccl_premul_sum(scale)
+            self.assertEqual(pickle.loads(pickle.dumps(reduce_op)), reduce_op)
+
 
 if __name__ == "__main__":
     assert (
diff --git a/test/distributed/test_c10d_error_logger.py b/test/distributed/test_c10d_error_logger.py
new file mode 100644
index 000000000000..7c8a6241b76b
--- /dev/null
+++ b/test/distributed/test_c10d_error_logger.py
@@ -0,0 +1,142 @@
+# Owner(s): ["oncall: distributed"]
+
+import json
+import logging
+import os
+import re
+import sys
+from functools import partial, wraps
+
+import torch
+import torch.distributed as dist
+
+from torch.distributed.c10d_error_logger import _get_or_create_logger
+from torch.distributed.distributed_c10d import exception_handler
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+from torch.testing._internal.common_distributed import MultiProcessTestCase, TEST_SKIPS
+from torch.testing._internal.common_utils import run_tests, TEST_WITH_DEV_DBG_ASAN
+
+if TEST_WITH_DEV_DBG_ASAN:
+    print(
+        "Skip dev-asan as torch + multiprocessing spawn have known issues",
+        file=sys.stderr,
+    )
+    sys.exit(0)
+
+BACKEND = dist.Backend.NCCL
+WORLD_SIZE = min(4, max(2, torch.cuda.device_count()))
+
+
+def with_comms(func=None):
+    if func is None:
+        return partial(
+            with_comms,
+        )
+
+    @wraps(func)
+    def wrapper(self, *args, **kwargs):
+        if BACKEND == dist.Backend.NCCL and torch.cuda.device_count() < self.world_size:
+            sys.exit(TEST_SKIPS[f"multi-gpu-{self.world_size}"].exit_code)
+        self.dist_init()
+        func(self)
+        self.destroy_comms()
+
+    return wrapper
+
+
+class C10dErrorLoggerTest(MultiProcessTestCase):
+    def setUp(self):
+        super(C10dErrorLoggerTest, self).setUp()
+        os.environ["WORLD_SIZE"] = str(self.world_size)
+        os.environ["BACKEND"] = BACKEND
+        self._spawn_processes()
+
+    @property
+    def device(self):
+        return (
+            torch.device(self.rank)
+            if BACKEND == dist.Backend.NCCL
+            else torch.device("cpu")
+        )
+
+    @property
+    def world_size(self):
+        return WORLD_SIZE
+
+    @property
+    def process_group(self):
+        return dist.group.WORLD
+
+    def destroy_comms(self):
+        # Wait for all ranks to reach here before starting shutdown.
+        dist.barrier()
+        dist.destroy_process_group()
+
+    def dist_init(self):
+        dist.init_process_group(
+            backend=BACKEND,
+            world_size=self.world_size,
+            rank=self.rank,
+            init_method=f"file://{self.file_name}",
+        )
+
+        # set device for nccl pg for collectives
+        if BACKEND == "nccl":
+            torch.cuda.set_device(self.rank)
+
+    def test_get_or_create_logger(self):
+        logger = _get_or_create_logger()
+        self.assertIsNotNone(logger)
+        self.assertEqual(1, len(logger.handlers))
+        self.assertIsInstance(logger.handlers[0], logging.NullHandler)
+
+    @exception_handler
+    def failed_broadcast_raise_exception(self):
+        tensor = torch.arange(2, dtype=torch.int64)
+        dist.broadcast(tensor, self.world_size + 1)
+
+    @exception_handler
+    def failed_broadcast_not_raise_exception(self):
+        try:
+            tensor = torch.arange(2, dtype=torch.int64)
+            dist.broadcast(tensor, self.world_size + 1)
+        except Exception as exception:
+            pass
+
+    @with_comms
+    def test_exception_handler_with_dist(self) -> None:
+        with self.assertRaises(Exception) as exception:
+            self.failed_broadcast_raise_exception()
+
+        with self.assertLogs(dist._c10d_error_logger, level="DEBUG") as captured:
+            self.failed_broadcast_not_raise_exception()
+            error_msg_dict = json.loads(
+                re.search("({.+})", captured.output[0]).group(0).replace("'", '"')
+            )
+            self.assertEqual(len(error_msg_dict), 7)
+
+            self.assertIn("func_name", error_msg_dict.keys())
+            self.assertEqual("broadcast", error_msg_dict["func_name"])
+
+            self.assertIn("args", error_msg_dict.keys())
+
+            self.assertIn("backend", error_msg_dict.keys())
+            self.assertEqual("nccl", error_msg_dict["backend"])
+
+            self.assertIn("world_size", error_msg_dict.keys())
+            self.assertEqual(str(self.world_size), error_msg_dict["world_size"])
+
+            self.assertIn("global_rank", error_msg_dict.keys())
+            self.assertIn(str(dist.get_rank()), error_msg_dict["global_rank"])
+
+            # In this test case, local_rank = global_rank, since we don't have multiple processes on one node.
+            self.assertIn("local_rank", error_msg_dict.keys())
+            self.assertIn(str(dist.get_rank()), error_msg_dict["local_rank"])
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/test_c10d_gloo.py b/test/distributed/test_c10d_gloo.py
index e49d65ea33d2..bee76e788d19 100644
--- a/test/distributed/test_c10d_gloo.py
+++ b/test/distributed/test_c10d_gloo.py
@@ -23,28 +23,35 @@
 import torch.nn.functional as F
 import torch.testing._internal.common_utils as common
 from test_c10d_common import (
-    LOOPBACK,
     gpus_for_rank,
-    Task,
+    LOOPBACK,
     ModuleForDdpCommHook,
     SparseGradientModule,
+    Task,
 )
 from torch import nn
+from torch.distributed._shard.sharded_tensor import (
+    init_from_local_shards,
+    Shard,
+    ShardedTensor,
+    ShardMetadata,
+)
 from torch.nn.parallel import DistributedDataParallel
+from torch.nn.parallel._replicated_tensor_ddp_utils import _ddp_replicated_tensor
 from torch.testing._internal.common_distributed import (
+    create_device,
     MultiProcessTestCase,
     requires_gloo,
-    skip_if_lt_x_gpu,
     simple_sparse_reduce_tests,
+    skip_if_lt_x_gpu,
     skip_if_win32,
-    create_device,
     verify_ddp_error_logged,
 )
 from torch.testing._internal.common_utils import (
-    TestCase,
-    run_tests,
     retry_on_connect_failures,
+    run_tests,
     sandcastle_skip,
+    TestCase,
 )
 
 
@@ -215,7 +222,7 @@ def setUp(self):
 
     def opts(self, threads=2):
         opts = c10d.ProcessGroupGloo._Options()
-        opts._timeout = 5.0
+        opts._timeout = 50.0
         opts._devices = [create_device(interface=LOOPBACK)]
         opts._threads = threads
         return opts
@@ -1754,6 +1761,49 @@ def forward(self, x):
             loss = criterion(output, target)
             loss.backward()
 
+    @requires_gloo()
+    @skip_if_lt_x_gpu(2)
+    def test_ignored_sharded_tensor(self):
+        class MyModule(nn.Module):
+            def __init__(self, shard_tensor: ShardedTensor) -> None:
+                super().__init__()
+                self.fc1 = nn.Linear(2, 10, bias=False)
+                self.st = nn.Parameter(shard_tensor)
+                self.relu = nn.ReLU()
+
+            def forward(self, x):
+                x = self.relu(self.fc1(x))
+                return F.softmax(x, dim=1)
+        pg = dist.init_process_group(
+            "gloo",
+            init_method=f"file://{self.file_name}",
+            world_size=self.world_size,
+            rank=self.rank,
+        )
+        device = torch.device(f"cuda:{self.rank}")
+        local_shard_metadata = ShardMetadata(
+            shard_offsets=[(self.rank % 2) * 5, 0],
+            shard_sizes=[5, 10],
+            placement=f"rank:{self.rank}/cuda:{self.rank}"
+        )
+        local_shards = [Shard(torch.randn(5, 10, device=device), local_shard_metadata)]
+        st = init_from_local_shards(local_shards, [10, 10])
+        m = MyModule(st)
+        with _ddp_replicated_tensor(False):
+            DistributedDataParallel._set_params_and_buffers_to_ignore_for_model(
+                module=m,
+                params_and_buffers_to_ignore={'st'}
+            )
+            # test to make DDP constructor will not fail when module includes a ShardedTensor when ignored
+            DistributedDataParallel(
+                m,
+                device_ids=[device] if device.type == "gpu" else None,
+                process_group=pg,
+                gradient_as_bucket_view=True,
+                broadcast_buffers=False,
+                static_graph=True,
+            )
+
     def _run_and_verify_sparse_gradients(self, vanilla_model, ddp_model):
         mult = 2
         batch_size = mult * self.world_size
@@ -2232,6 +2282,11 @@ def test_forward_backward_optimizer(self):
 
 
 class CommTest(test_c10d_common.AbstractCommTest, MultiProcessTestCase):
+    @property
+    def device(self):
+        return "cpu"
+
+
     def setUp(self):
         super(CommTest, self).setUp()
         self._spawn_processes()
@@ -2338,6 +2393,118 @@ def test_gloo_barrier_device_ids(self):
     def test_gloo_warn_not_in_group(self):
         self._test_warn_not_in_group(backend="gloo")
 
+    @skip_if_lt_x_gpu(2)
+    @requires_gloo()
+    def test_gloo_rank_membership(self):
+        self._test_rank_membership(backend="gloo")
+
+    @skip_if_lt_x_gpu(2)
+    @requires_gloo()
+    def test_tensor_dtype_mismatch(self):
+        self._test_tensor_dtype_mismatch(backend="gloo")
+
+    @skip_if_lt_x_gpu(2)
+    @requires_gloo()
+    def test_tensor_dtype_complex(self):
+        self._test_tensor_dtype_complex(backend="gloo")
+
+class GlooProcessGroupWithDispatchedCollectivesTests(test_c10d_common.ProcessGroupWithDispatchedCollectivesTests):
+    @requires_gloo()
+    def test_collectives(self):
+        self._test_collectives(backend="gloo")
+
+    @requires_gloo()
+    def test_allreduce_coalesced(self):
+        self._test_allreduce_coalesced(backend="gloo")
+
+    @requires_gloo()
+    def test_allgather_coalesced(self):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            "gloo",
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+        input_tensor = torch.ones(10, 10, dtype=torch.float32)
+        output_tensor_list = [torch.zeros_like(input_tensor)]
+        dist.all_gather_coalesced([output_tensor_list], [input_tensor])
+        self.assertEqual(output_tensor_list, [input_tensor])
+
+    @requires_gloo()
+    def test_monitored_barrier(self):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            "gloo",
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+        dist.monitored_barrier()
+
+class CompilerTest(test_c10d_common.CompilerTest):
+
+    @property
+    def world_size(self):
+        return 2
+
+    def _get_default_group(self):
+        store = c10d.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            backend="gloo",
+            rank=self.rank,
+            world_size=self.world_size,
+            store=store,
+        )
+        return dist.distributed_c10d._get_default_group()
+
+    def test_allreduce_work_wait_cpu(self):
+        self._test_allreduce_work_wait(torch.ones(2, 2) * self.rank)
+
+    @skip_if_lt_x_gpu(2)
+    def test_allreduce_work_wait_gpu(self):
+        self._test_allreduce_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    def test_allgather_work_wait_cpu(self):
+        self._test_allgather_work_wait(torch.ones(2, 2) * self.rank)
+
+    @skip_if_lt_x_gpu(2)
+    def test_allgather_work_wait_gpu(self):
+        self._test_allgather_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    def test_broadcast_work_wait_cpu(self):
+        self._test_broadcast_work_wait(torch.ones(2, 2) * self.rank)
+
+    @skip_if_lt_x_gpu(2)
+    def test_broadcast_work_wait_gpu(self):
+        self._test_broadcast_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    def test_scatter_work_wait_cpu(self):
+        self._test_scatter_work_wait(torch.ones(2, 2) * self.rank)
+
+    @skip_if_lt_x_gpu(2)
+    def test_scatter_work_wait_gpu(self):
+        self._test_scatter_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    def test_nested_comm_tensor_wrapping(self):
+        self._test_nested_comm_tensor_wrapping(torch.ones(2, 2) * self.rank)
+
+    def test_consecutive_comm_work_wait_cpu(self):
+        self._test_consecutive_comm_work_wait(torch.ones(2, 2) * self.rank)
+
+    @skip_if_lt_x_gpu(2)
+    def test_consecutive_comm_work_wait_gpu(self):
+        self._test_consecutive_comm_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
 
 if __name__ == "__main__":
     assert (
diff --git a/test/distributed/test_c10d_nccl.py b/test/distributed/test_c10d_nccl.py
index f8584245b4d6..85ebb6b75bc5 100644
--- a/test/distributed/test_c10d_nccl.py
+++ b/test/distributed/test_c10d_nccl.py
@@ -338,13 +338,27 @@ def allreduce(tensors, op):
             tensors = [torch.tensor([self.rank + 1.]).cuda(local_device_id)]
 
             allreduce(tensors, c10d.ReduceOp.AVG)
-
             # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
+            ndev = float(self.world_size)
             self.assertEqualIgnoreType(
                 torch.tensor([ndev * (ndev + 1.) / (2. * ndev)]),
                 tensors[0],
             )
 
+        # Premul Sum
+        if torch.cuda.nccl.version() >= (2, 11, 1):
+            for dtype in torch.half, torch.float, torch.double:
+                for factor in (3.0, torch.tensor([5.0], device=local_device_id, dtype=dtype)):
+                    tensors = [torch.tensor([self.rank + 1]).cuda(local_device_id).to(dtype=dtype)]
+
+                    allreduce(tensors, c10d._make_nccl_premul_sum(factor))
+
+                    # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
+                    self.assertEqualIgnoreType(
+                        factor * torch.tensor([float(self.world_size * (self.world_size + 1) / 2)], device=local_device_id),
+                        tensors[0],
+                    )
+
         # Product
         tensors = [torch.tensor([self.rank + 1]).cuda(local_device_id)]
 
@@ -367,9 +381,10 @@ def allreduce(tensors, op):
         allreduce(tensors, c10d.ReduceOp.MAX)
         self.assertEqual(torch.tensor([self.world_size]), tensors[0])
 
-        for op in (c10d.ReduceOp.BAND, c10d.ReduceOp.BOR, c10d.ReduceOp.BXOR):
+        for op, err in zip((c10d.ReduceOp.BAND, c10d.ReduceOp.BOR, c10d.ReduceOp.BXOR),
+                           ("ReduceOp.BAND", "ReduceOp.BOR", "ReduceOp.BXOR")):
             with self.assertRaisesRegex(
-                RuntimeError, "Cannot use " + str(op) + " with NCCL"
+                    RuntimeError, "Cannot use " + err + " with NCCL"
             ):
                 allreduce(tensors, op)
 
@@ -407,13 +422,36 @@ def reduce(xs, rootRank, rootTensor, op=None):
                     tensors[0],
                 )
 
-
-            for op in (c10d.ReduceOp.BAND, c10d.ReduceOp.BOR, c10d.ReduceOp.BXOR):
+            for op, err in zip(
+                (c10d.ReduceOp.BAND, c10d.ReduceOp.BOR, c10d.ReduceOp.BXOR),
+                ("ReduceOp.BAND", "ReduceOp.BOR", "ReduceOp.BXOR"),
+            ):
                 with self.assertRaisesRegex(
-                    RuntimeError, "Cannot use " + str(op) + " with NCCL"
+                        RuntimeError, "Cannot use " + err + " with NCCL"
                 ):
                     reduce(tensors, self.rank, rt, op)
 
+            # Premul sum
+            if torch.cuda.nccl.version() >= (2, 11, 1):
+                for factor in (3.0, torch.tensor([5.0], device=local_device_id)):
+                    if isinstance(factor, torch.Tensor):
+                        factor_ref = factor.cpu().item()
+                    else:
+                        factor_ref = factor
+                    float_tensors = [
+                        torch.tensor(
+                            [self.rank + 1.0], device=f"cuda:{local_device_id}")
+                    ]
+                    float_tensors_ref = [
+                        torch.tensor(
+                            [(self.rank + 1.0) * factor_ref], device=f"cuda:{local_device_id}")
+                    ]
+
+                    reduce(float_tensors_ref, rt, 0)
+                    reduce(float_tensors, rt, 0, c10d._make_nccl_premul_sum(factor))
+                    if self.rank == rt:
+                        self.assertEqual(float_tensors_ref[0], float_tensors[0])
+
     @requires_nccl()
     @sandcastle_skip_if(torch.cuda.device_count() < 2, "NCCL test requires 2+ GPUs")
     def test_allgather_ops(self):
@@ -775,7 +813,7 @@ def reduce_scatter_base(output_t, input_t):
 
         # anticpate an error
         with self.assertRaisesRegex(
-            RuntimeError, "input tensor must be the same type as the outut tensor."
+            RuntimeError, "input tensor must be the same type as the output tensor."
         ):
             tensor = torch.tensor([self.rank], dtype=torch.float).cuda(local_device_id)
             output_t = torch.empty((self.world_size + 1), dtype=torch.long).cuda(
@@ -892,6 +930,20 @@ def perm(n, k):
         expected = torch.tensor(prod_val)
         self.assertEqualIgnoreType(expected, output_tensor)
 
+        if torch.cuda.nccl.version() >= (2, 11, 1):
+            for factor in (3.0, torch.tensor([5.0], device=self.rank)):
+                if isinstance(factor, torch.Tensor):
+                    factor_ref = factor.cpu().item()
+                else:
+                    factor_ref = factor
+                output = [t.float() for t in output]
+                tensor_lists = [[t.float() for t in tl] for tl in tensor_lists]
+                output_ref = [t.float() for t in output]
+                tensor_lists_ref = [[t.float() * factor_ref for t in tl] for tl in tensor_lists]
+                reduce_scatter(output, tensor_lists, c10d._make_nccl_premul_sum(factor))
+                reduce_scatter(output_ref, tensor_lists_ref, c10d.ReduceOp.SUM)
+                self.assertEqual(output_ref, output)
+
     @requires_nccl()
     @sandcastle_skip_if(torch.cuda.device_count() < 2, "NCCL test requires 2+ GPUs")
     def test_reduce_scatter_base_ops(self):
@@ -971,6 +1023,18 @@ def test_send_recv(self):
             with self.assertRaisesRegex(RuntimeError, 'Tensors must be contiguous'):
                 dist.send(send_tensor_view, 1)
 
+    @requires_nccl()
+    @sandcastle_skip_if(torch.cuda.device_count() < 1, "NCCL test requires 1 GPU")
+    @skip_if_lt_x_gpu(1)
+    def test_nccl_dist_backend_error(self):
+        store = c10d.FileStore(self.file_name, self.world_size)
+        self._create_process_group_nccl(store, self.opts())
+
+        # Both rank 0 and 1 will use the same CUDA device resulting in ncclInvalidUsage
+        with self.assertRaises(dist.DistBackendError) as cm:
+            dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0)
+
+        self.assertIsInstance(cm.exception, RuntimeError)
 
 class DistributedDataParallelTest(
     test_c10d_common.CommonDistributedDataParallelTest, MultiProcessTestCase
@@ -2244,6 +2308,63 @@ def test_ddp_weight_sharing(self):
                         ),
                     )
 
+    @requires_nccl()
+    @skip_if_lt_x_gpu(2)
+    def test_ddp_packed_sequence(self):
+        """
+        Tests that DDP with ``device_ids`` specified can run a forward and
+        backward pass with ``PackedSequence`` s with parity compared to a local
+        version of the model.
+        """
+        store = c10d.FileStore(self.file_name, self.world_size)
+        process_group = dist.init_process_group(
+            "nccl",
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+        seqs = ["sequence_sequence", "seq", "sequence"]
+        vocab = ["<pad>"] + sorted(set([ch for seq in seqs for ch in seq]))
+        vectorized_seqs = [[vocab.index(tok) for tok in seq] for seq in seqs]
+        # Set the seed to make the embedding and LSTM deterministic (even
+        # across ranks since DDP broadcasts parameters from rank 0)
+        torch.manual_seed(0)
+        embed = nn.Embedding(len(vocab), 4)  # keep on CPU
+        lstm = nn.LSTM(input_size=4, hidden_size=2, batch_first=True).to(self.rank)
+        lstm_ddp = DistributedDataParallel(
+            copy.deepcopy(lstm),
+            device_ids=[self.rank],
+            process_group=process_group,
+        )
+        for p1, p2 in zip(lstm.parameters(), lstm_ddp.module.parameters()):
+            self.assertEqual(p1, p2)
+        seq_lengths = torch.LongTensor(list(map(len, vectorized_seqs)))
+        seq_tensor = torch.Tensor(
+            torch.zeros((len(vectorized_seqs), seq_lengths.max()))
+        ).long()
+        for i, (seq, seq_len) in enumerate(zip(vectorized_seqs, seq_lengths)):
+            seq_tensor[i, :seq_len] = torch.LongTensor(seq)
+        seq_lengths, permutation_idx = seq_lengths.sort(0, descending=True)
+        seq_tensor = seq_tensor[permutation_idx]
+        embedded_seq_tensor = embed(seq_tensor)
+        packed_input = torch.nn.utils.rnn.pack_padded_sequence(
+            embedded_seq_tensor, seq_lengths, batch_first=True,
+        )
+        packed_input_ddp = torch.nn.utils.rnn.pack_padded_sequence(
+            embedded_seq_tensor.detach().clone(), seq_lengths, batch_first=True,
+        )
+        # Move the input to GPU explicitly for the local model
+        packed_output, (ht, ct) = lstm(packed_input.to(self.rank))
+        # Let DDP move the input to GPU internally
+        packed_output_ddp, (ht_ddp, ct_ddp) = lstm_ddp(packed_input_ddp)
+        self.assertEqual(packed_output.data, packed_output_ddp.data)
+        self.assertEqual(ht, ht_ddp)
+        self.assertEqual(ct, ct_ddp)
+        packed_output.data.sum().backward()
+        packed_output_ddp.data.sum().backward()
+        for p1, p2 in zip(lstm.parameters(), lstm_ddp.parameters()):
+            self.assertEqual(p1.grad, p2.grad)
+
     @requires_nccl()
     @skip_if_lt_x_gpu(2)
     def test_channels_last_contig(self):
@@ -2494,6 +2615,11 @@ def test_nccl_timeout(self):
 
 
 class CommTest(test_c10d_common.AbstractCommTest, MultiProcessTestCase):
+    @property
+    def device(self):
+        return f"cuda:{self.rank}"
+
+
     def setUp(self):
         super(CommTest, self).setUp()
         # NCCL_BLOCKING_WAIT overrides NCCL_ASYNC_ERROR_HANDLING hence tests
@@ -2747,6 +2873,123 @@ def test_nccl_warn_not_in_group_debug_info(self):
     def test_nccl_warn_not_in_group_debug_off(self):
         self._test_warn_not_in_group(backend="nccl")
 
+    @requires_nccl()
+    @skip_if_lt_x_gpu(2)
+    def test_nncl_rank_membership(self):
+        self._test_rank_membership(backend="nccl")
+
+    @requires_nccl()
+    @skip_if_lt_x_gpu(2)
+    def test_tensor_dtype_mismatch(self):
+        self._test_tensor_dtype_mismatch(backend="nccl")
+
+    @requires_nccl()
+    @skip_if_lt_x_gpu(2)
+    def test_tensor_dtype_complex(self):
+        self._test_tensor_dtype_complex(backend="nccl")
+
+
+class CompilerTest(test_c10d_common.CompilerTest):
+
+    @property
+    def world_size(self):
+        return 2
+
+    def _get_default_group(self):
+        store = c10d.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            backend="nccl",
+            rank=self.rank,
+            world_size=self.world_size,
+            store=store,
+        )
+        return dist.distributed_c10d._get_default_group()
+
+    @skip_if_lt_x_gpu(2)
+    def test_allreduce_work_wait_gpu(self):
+        self._test_allgather_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank,
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_allgather_work_wait_gpu(self):
+        self._test_allgather_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_reduce_scatter_work_wait_gpu(self):
+        self._test_reduce_scatter_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_broadcast_work_wait_gpu(self):
+        self._test_broadcast_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_scatter_work_wait_gpu(self):
+        self._test_scatter_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_nested_comm_tensor_wrapping(self):
+        self._test_nested_comm_tensor_wrapping(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+    @skip_if_lt_x_gpu(2)
+    def test_consecutive_comm_work_wait_gpu(self):
+        self._test_consecutive_comm_work_wait(
+            torch.ones(2, 2, device=self.rank) * self.rank
+        )
+
+class NcclProcessGroupWithDispatchedCollectivesTests(test_c10d_common.ProcessGroupWithDispatchedCollectivesTests):
+    @requires_nccl()
+    @skip_if_lt_x_gpu(1)
+    def test_collectives(self):
+        self._test_collectives(backend="nccl")
+
+    @requires_nccl()
+    @skip_if_lt_x_gpu(1)
+    def test_allreduce_coalesced(self):
+        self._test_allreduce_coalesced(backend="nccl")
+
+    @requires_nccl()
+    @skip_if_lt_x_gpu(1)
+    def test_allgather_base(self):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            "nccl",
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+        device = "cuda"
+        tensor = torch.ones(10, 10, device=torch.device(device))
+        output_tensor = torch.zeros(10, 10, device=torch.device(device))
+        dist.all_gather_into_tensor(output_tensor, tensor)
+        self.assertEqual(output_tensor, tensor)
+
+    @requires_nccl()
+    @skip_if_lt_x_gpu(1)
+    def test_reduce_scatter_base(self):
+        store = dist.FileStore(self.file_name, self.world_size)
+        dist.init_process_group(
+            "nccl",
+            world_size=self.world_size,
+            rank=self.rank,
+            store=store,
+        )
+        device = "cuda"
+        tensor = torch.ones(10, 10, device=torch.device(device))
+        output_tensor = torch.zeros(10, 10, device=torch.device(device))
+        dist.reduce_scatter_tensor(output_tensor, tensor)
+        self.assertEqual(output_tensor, tensor)
+
 if __name__ == "__main__":
     assert (
         not torch.cuda._initialized
diff --git a/test/distributed/test_c10d_spawn_ucc.py b/test/distributed/test_c10d_spawn_ucc.py
new file mode 100644
index 000000000000..eabd7e1cf45b
--- /dev/null
+++ b/test/distributed/test_c10d_spawn_ucc.py
@@ -0,0 +1,110 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+import test_c10d_spawn
+import torch
+import torch.distributed as c10d
+from test_c10d_spawn import _torch_dist_nn_available, TestDistributedNNFunctions
+from torch.testing._internal.common_cuda import TEST_MULTIGPU
+from torch.testing._internal.common_distributed import (
+    requires_ucc,
+    skip_if_lt_x_gpu,
+)
+from torch.testing._internal.common_utils import (
+    TestCase,
+    run_tests,
+    sandcastle_skip,
+    sandcastle_skip_if,
+    TEST_WITH_DEV_DBG_ASAN,
+)
+
+NO_UCC = not hasattr(c10d, "ProcessGroupUCC")
+
+# Fails on Python-3.9, see https://github.com/pytorch/pytorch/issues/51619
+if sys.version_info < (3, 9):
+
+    class ProcessGroupShareTensorTest(
+        test_c10d_spawn.AbstractProcessGroupShareTensorTest, TestCase
+    ):
+        @classmethod
+        def _init_pg_ucc(cls, rank, filename, world_size):
+            store = c10d.FileStore(filename, world_size)
+            return c10d.ProcessGroupUCC(store, rank, world_size)
+
+        @sandcastle_skip_if(not TEST_MULTIGPU, "At least 2 CUDA GPUS needed")
+        @sandcastle_skip_if(NO_UCC, "UCC needed")
+        def test_shared_broadcast_ucc(self):
+            self._test_multiprocess(
+                ProcessGroupShareTensorTest._test_broadcast_process,
+                [torch.ones(2, 2).to(i) * i for i in range(self.world_size)],
+                ProcessGroupShareTensorTest._init_pg_ucc,
+                1,
+            )
+
+        @sandcastle_skip_if(not TEST_MULTIGPU, "At least 2 CUDA GPUS needed")
+        @sandcastle_skip_if(NO_UCC, "UCC needed")
+        def test_shared_allreduce_ucc(self):
+            self._test_multiprocess(
+                ProcessGroupShareTensorTest._test_allreduce_process,
+                [torch.ones(2, 2).to(i) for i in range(self.world_size)],
+                ProcessGroupShareTensorTest._init_pg_ucc,
+                1,
+            )
+
+        @sandcastle_skip_if(not TEST_MULTIGPU, "At least 2 CUDA GPUS needed")
+        @sandcastle_skip_if(NO_UCC, "UCC needed")
+        def test_shared_allgather_ucc(self):
+            self._test_multiprocess(
+                ProcessGroupShareTensorTest._test_allgather_process,
+                [torch.ones(2, 2).to(i) * i for i in range(self.world_size)],
+                ProcessGroupShareTensorTest._init_pg_ucc,
+                self.world_size,
+            )
+
+
+# Skip dev-asan as torch + multiprocessing spawn have known issues
+if not TEST_WITH_DEV_DBG_ASAN:
+
+    class TestDistributedNNFunctionsUcc(TestDistributedNNFunctions):
+        # Test Common Ops First.
+        @requires_ucc()
+        @skip_if_lt_x_gpu(2)
+        @sandcastle_skip_if(
+            not _torch_dist_nn_available, "torch.distributed.nn is not available"
+        )
+        def test_broadcast(self):
+            self._test_broadcast("ucc")
+
+        @requires_ucc()
+        @skip_if_lt_x_gpu(2)
+        @sandcastle_skip_if(not _torch_dist_nn_available, "torch.distributed.nn is not available")
+        def test_reduce(self):
+            self._test_reduce("ucc")
+
+        @requires_ucc()
+        @skip_if_lt_x_gpu(2)
+        @sandcastle_skip_if(not _torch_dist_nn_available, "torch.distributed.nn is not available")
+        def test_allreduce(self):
+            self._test_allreduce("ucc")
+
+        @requires_ucc()
+        @skip_if_lt_x_gpu(2)
+        @sandcastle_skip_if(not _torch_dist_nn_available, "torch.distributed.nn is not available")
+        @sandcastle_skip("runs into illegal memory access on first assertEqual check when run locally")
+        def test_all_gather(self):
+            self._test_all_gather("ucc")
+
+        @requires_ucc()
+        @skip_if_lt_x_gpu(2)
+        @sandcastle_skip_if(not _torch_dist_nn_available, "torch.distributed.nn is not available")
+        def test_all_to_all(self):
+            self._test_all_to_all("ucc")
+
+        @requires_ucc()
+        @skip_if_lt_x_gpu(2)
+        @sandcastle_skip_if(not _torch_dist_nn_available, "torch.distributed.nn is not available")
+        def test_all_to_all_single(self):
+            self._test_all_to_all_single("ucc")
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/test_distributed_spawn.py b/test/distributed/test_distributed_spawn.py
index 262737f9dd75..b2a23ff22a9b 100644
--- a/test/distributed/test_distributed_spawn.py
+++ b/test/distributed/test_distributed_spawn.py
@@ -27,7 +27,7 @@
 
 BACKEND = os.environ["BACKEND"]
 
-if BACKEND == "gloo" or BACKEND == "nccl":
+if BACKEND == "gloo" or BACKEND == "nccl" or BACKEND == "ucc":
     class TestDistBackendWithSpawn(TestDistBackend, DistributedTest._DistTestBase):
 
         def setUp(self):
diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py
new file mode 100644
index 000000000000..8d4365acfc9a
--- /dev/null
+++ b/test/distributed/test_dynamo_distributed.py
@@ -0,0 +1,567 @@
+# Owner(s): ["module: dynamo"]
+import copy
+import functools
+import logging
+import os
+import random
+import unittest
+from unittest.mock import patch
+import numpy as np
+import torch
+import torch._dynamo
+from torch._dynamo.optimizations.distributed import DDPOptimizer
+import torch._dynamo.test_case
+import torch.distributed as dist
+from contextlib import contextmanager
+from torch import nn
+from torch._dynamo import config
+from torch._dynamo.utils import same
+from torch._dynamo.testing import collect_results
+from torch._inductor.utils import has_triton
+from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.testing._internal.common_utils import TestCase
+from torch.testing._internal.common_distributed import (
+    MultiProcessTestCase,
+    import_transformers_or_skip,
+    skip_if_lt_x_gpu,
+    requires_nccl
+)
+import torch._dynamo.logging
+
+
+def reset_rng_state():
+    torch.manual_seed(1337)
+    random.seed(1337)
+    np.random.seed(1337)
+
+def init_weights(m):
+    if isinstance(m, nn.Linear):
+        nn.init.xavier_uniform_(m.weight)
+        m.bias.data.fill_(0.01)
+
+class ToyModel(nn.Module):
+    def __init__(self, in_feat=10, hidden_feat=5000, out_feat=5):
+        super().__init__()
+        self.net = nn.Sequential(
+            *[nn.Linear(in_feat, hidden_feat), nn.ReLU()]
+            + [nn.Linear(hidden_feat, hidden_feat), nn.ReLU()]
+            + [nn.Linear(hidden_feat, hidden_feat), nn.ReLU()]
+            + [nn.Linear(hidden_feat, out_feat), nn.ReLU()]
+        )
+
+    def forward(self, inputs):
+        return self.net(inputs)
+
+def get_model(device, bsz=20, in_feat=10, hidden_feat=5000, out_feat=5):
+    m = ToyModel(in_feat=in_feat, hidden_feat=hidden_feat, out_feat=out_feat).to(device)
+    m.apply(init_weights)
+    inputs = torch.rand(bsz, in_feat).to(device)
+    outputs = m(inputs)
+    return m, inputs, outputs
+
+def get_custom_model(device):
+    class MyCustomLinear(torch.nn.Module):
+        def __init__(self):
+            super(MyCustomLinear, self).__init__()
+            self.weight = nn.Parameter(torch.randn(512, 512))
+
+        def forward(self, x):
+            return torch.mm(x, self.weight.t())
+
+    class MyLinear(torch.nn.Module):
+        def __init__(self):
+            super(MyLinear, self).__init__()
+            self.linear = torch.nn.Linear(512, 512)
+
+        def forward(self, x):
+            return self.linear(x)
+
+    class MyModule(torch.nn.Module):
+        def __init__(self):
+            super(MyModule, self).__init__()
+            mods = [
+                (MyLinear(), torch.nn.ReLU()),
+                # sandwitch the custom in the middle so it comes before and after
+                (MyCustomLinear(), torch.nn.ReLU()),
+                (MyLinear(), torch.nn.ReLU()),
+            ]
+            self.seq = torch.nn.Sequential(*[x for items in mods for x in items])
+
+        def forward(self, x):
+            return self.seq(x)
+
+    m = MyModule().to(device)
+    m.apply(init_weights)
+    inputs = torch.rand((512, 512)).to(device)
+    correct_outputs = m(inputs)
+    return m, inputs, correct_outputs
+
+def get_hf_bert(rank):
+    # Note: use @import_transformers_or_skip on your test case if you use this
+    # in a multiprocessing test
+    try:
+        from transformers import BertConfig, AutoModelForMaskedLM
+    except ImportError:
+        raise unittest.SkipTest("Unable to import transformers")
+
+    batch_size, max_length, config, device = 4, 512, BertConfig(), f"cuda:{rank}"
+    model = AutoModelForMaskedLM.from_config(config).to(device)
+    input_ids = torch.randint(0, config.vocab_size, (batch_size, max_length)).to(device)
+    decoder_ids = torch.randint(0, config.vocab_size, (batch_size, max_length)).to(device)
+    inputs = {'input_ids': input_ids, 'labels': decoder_ids}
+    model.train()
+    return model, inputs
+
+class CheckSplitsCompiler:
+    def __init__(self):
+        self.compiler_called = 0
+
+    def compile_fn(self, gm, example_inputs):
+        self.compiler_called += 1
+        return gm
+
+@contextmanager
+def _per_rank_init(rank, world_size):
+    # To avoid multiple inheritance from _dynamo.test_case.TestCase and MultiProcessTestCase,
+    # Just manually implement the most important part of the dynamo behavior to reset/clear.
+    torch.cuda.set_device(rank)
+    os.environ['MASTER_ADDR'] = 'localhost'
+    os.environ['MASTER_PORT'] = '6789'
+    dist.init_process_group("nccl", rank=rank, world_size=world_size)
+    torch._dynamo.reset()
+    torch._dynamo.utils.counters.clear()
+    yield
+    torch._dynamo.reset()
+    torch._dynamo.utils.counters.clear()
+    dist.destroy_process_group()
+
+
+# This simulates DDP, but it doesn't actually do any process communication;
+# it just has enough properties so that the dynamo distributed optimization is
+# able to optimize.  Feel free to simulate more properties as necessary.  The
+# other important thing is patching _active_ddp_module, which is what actually
+# triggers DDP optimization
+class FakeDDP(nn.Module):
+    def __init__(self, module):
+        super().__init__()
+        self.module = module
+        bucket_cap_mb = 25
+        self.bucket_bytes_cap = int(bucket_cap_mb * 1024 * 1024)
+
+    @contextmanager
+    def _inside_ddp_forward(self):
+        DDP._active_ddp_module = self
+        try:
+            yield
+        except Exception:
+            raise
+        finally:
+            DDP._active_ddp_module = None
+
+    def forward(self, *inputs, **kwargs):
+        with self._inside_ddp_forward():
+            return self.module.forward(*inputs, **kwargs)
+
+def run_hf_bert_ddp(self, model, inputs, backend):
+    reset_rng_state()
+    correct_outputs = model(**inputs)
+    correct_loss = correct_outputs.loss
+    correct_loss.backward()
+
+    reset_rng_state()
+    opt_model = torch._dynamo.optimize(backend)(model)
+    opt_outputs = opt_model(**inputs)
+    opt_loss = opt_outputs.loss
+    opt_loss.backward()
+
+    inputs_flat = [inputs[k] for k in inputs]
+    correct_results = collect_results(model, correct_outputs.logits, correct_loss, inputs_flat)
+    opt_results = collect_results(opt_model, opt_outputs.logits, opt_loss, inputs_flat)
+    self.assertTrue(same(correct_results, opt_results))
+
+class TestFakeDistributedSingleProc(TestCase):
+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    @patch.object(config, "optimize_ddp", True)
+    @patch.object(torch._inductor.config, "fallback_random", True)
+    def test_hf_bert_ddp_inductor(self):
+        model, inputs = get_hf_bert(0)
+        model = FakeDDP(model)
+        run_hf_bert_ddp(self, model, inputs, "inductor")
+
+    @patch.object(config, "optimize_ddp", True)
+    def test_hf_bert_ddp_aot_eager(self):
+        model, inputs = get_hf_bert(0)
+        model = FakeDDP(model)
+        run_hf_bert_ddp(self, model, inputs, "aot_eager")
+
+
+# Are these tests failing?  Check and see if TestFakeDistributedSingleProc has a
+# single process version; if it's just a problem in the Dynamo distributed
+# optimizer, you should be able to repro it single process!
+@requires_nccl()
+class TestDistributedMultiProc(MultiProcessTestCase):
+    def setUp(self):
+        super(TestDistributedMultiProc, self).setUp()
+        self._spawn_processes()
+
+    def tearDown(self):
+        super(TestDistributedMultiProc, self).tearDown()
+        try:
+            os.remove(self.file_name)
+        except OSError:
+            pass
+
+    @property
+    def world_size(self) -> int:
+        return torch.cuda.device_count()
+
+    @classmethod
+    def _run(cls, rank: int, test_name: str, file_name: str, parent_pipe) -> None:
+        # Don't enable DDP + ReplicatedTensor, as that breaks Dynamo+DDP
+        # TODO(whc) why is ReplicatedTensor defaulted=True in MultiProcessTestCase, and should we support it?
+        # from torch.nn.parallel._replicated_tensor_ddp_utils import _set_ddp_with_replicated_tensor
+        # _set_ddp_with_replicated_tensor(True)
+
+        # The rest is copypasta from MultiProcessTestCase._run
+        self = cls(test_name)
+        self.rank = rank
+        self.file_name = file_name
+        self.run_test(test_name, parent_pipe)
+
+    @skip_if_lt_x_gpu(2)
+    @patch.object(config, "optimize_ddp", False)
+    def test_ddp_baseline_aot_eager_multiprocess(self):
+        with _per_rank_init(self.rank, self.world_size):
+            self.assertFalse(config.optimize_ddp)
+            m, inputs, correct_outputs = get_model(f"cuda:{self.rank}")
+            m = DDP(m, device_ids=[self.rank])
+            m = torch._dynamo.optimize("aot_eager")(m)
+            outputs = m(inputs)
+            self.assertTrue(same(correct_outputs, outputs))
+
+    @skip_if_lt_x_gpu(2)
+    @import_transformers_or_skip()
+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    @patch.object(config, "optimize_ddp", True)
+    @patch.object(torch._inductor.config, "fallback_random", True)
+    def test_hf_bert_ddp_inductor(self):
+
+        with _per_rank_init(self.rank, self.world_size):
+            model, inputs = get_hf_bert(self.rank)
+            model = DDP(model)
+            run_hf_bert_ddp(self, model, inputs, "inductor")
+
+    @skip_if_lt_x_gpu(2)
+    @import_transformers_or_skip()
+    @patch.object(config, "optimize_ddp", True)
+    def test_hf_bert_ddp_aot_eager(self):
+        with _per_rank_init(self.rank, self.world_size):
+            model, inputs = get_hf_bert(self.rank)
+            model = DDP(model)
+            run_hf_bert_ddp(self, model, inputs, "aot_eager")
+
+    @skip_if_lt_x_gpu(1)
+    def test_fsdp_aot_eager(self):
+        with _per_rank_init(self.rank, self.world_size):
+            # Test with basic FSDP wrapping (outer wrap around whole model)
+            m, inputs, correct_outputs = get_model(f"cuda:{self.rank}")
+            fsdp_m = FSDP(m, use_orig_params=True)
+            fsdp_m = torch._dynamo.optimize("aot_eager")(fsdp_m)
+            outputs = fsdp_m(inputs)
+            self.assertTrue(same(correct_outputs, outputs))
+
+            # Test with recursive wrapping, nested FSDP around each Linear
+            m, inputs, correct_outputs = get_model(f"cuda:{self.rank}")
+            fsdp_m = FSDP(
+                m,
+                auto_wrap_policy=functools.partial(
+                    transformer_auto_wrap_policy, transformer_layer_cls=(nn.Linear, )
+                ),
+                use_orig_params=True
+            )
+            fsdp_m = torch._dynamo.optimize("aot_eager")(fsdp_m)
+            outputs = fsdp_m(inputs)
+            self.assertTrue(same(correct_outputs, outputs))
+
+    @skip_if_lt_x_gpu(1)
+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    def test_fsdp_inductor(self):
+        with _per_rank_init(self.rank, self.world_size):
+            # Test with basic FSDP wrapping (outer wrap around whole model)
+            m, inputs, correct_outputs = get_model(f"cuda:{self.rank}")
+            fsdp_m = FSDP(m, use_orig_params=True)
+            fsdp_m = torch._dynamo.optimize("inductor")(fsdp_m)
+            outputs = fsdp_m(inputs)
+            self.assertTrue(same(correct_outputs, outputs))
+
+            # Test with recursive wrapping, nested FSDP around each Linear
+            m, inputs, correct_outputs = get_model(f"cuda:{self.rank}")
+            fsdp_m = FSDP(
+                m,
+                auto_wrap_policy=functools.partial(
+                    transformer_auto_wrap_policy, transformer_layer_cls=(nn.Linear, )
+                ),
+                use_orig_params=True
+            )
+            fsdp_m = torch._dynamo.optimize("inductor")(fsdp_m)
+            outputs = fsdp_m(inputs)
+            self.assertTrue(same(correct_outputs, outputs))
+
+    @import_transformers_or_skip()
+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    # TODO(whc) Investigate why cudagraphs breaks inductor+fsdp for hf_bert
+    @patch.object(torch._inductor.config.triton, "cudagraphs", False)
+    @patch.object(torch._inductor.config, "fallback_random", True)
+    def test_hf_bert_fsdp(self):
+        from transformers.models.bert.modeling_bert import BertLayer
+
+        def apply_fsdp(model, wrap_policy):
+            model = FSDP(
+                copy.deepcopy(model),
+                auto_wrap_policy=wrap_policy,
+                use_orig_params=True
+            )
+            return model
+
+        with _per_rank_init(self.rank, self.world_size):
+            for (wrap_policy, test_instance) in (
+                (
+                    None,
+                    "FSDP without recursive wrapping"
+                ),
+                (
+                    functools.partial(
+                        transformer_auto_wrap_policy, transformer_layer_cls=(BertLayer, )
+                    ),
+                    "FSDP with recursive wrapping BertLayer instances"
+                )
+            ):
+                print(f"Running hf_bert test for {test_instance}")
+                model, inputs = get_hf_bert(self.rank)
+                reset_rng_state()
+                eager_model = apply_fsdp(model, wrap_policy)
+                correct_outputs = eager_model(**inputs)
+                correct_loss = correct_outputs.loss
+                correct_loss.backward()
+
+                reset_rng_state()
+                opt_model = apply_fsdp(model, wrap_policy)
+
+                opt_model = torch._dynamo.optimize("inductor")(opt_model)
+                opt_outputs = opt_model(**inputs)
+                opt_loss = opt_outputs.loss
+                opt_loss.backward()
+
+                inputs_flat = [inputs[k] for k in inputs]
+                correct_results = collect_results(eager_model, correct_outputs.logits, correct_loss, inputs_flat)
+                opt_results = collect_results(opt_model, opt_outputs.logits, opt_loss, inputs_flat)
+                self.assertTrue(same(correct_results, opt_results))
+
+
+@requires_nccl()
+class TestDistributed(torch._dynamo.test_case.TestCase):
+    """
+    Test harness initializes dist process group
+    """
+
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        # _exit_stack is set up in TestCase
+        cls._exit_stack.enter_context(
+            patch.dict(
+                os.environ,
+                {
+                    "MASTER_ADDR": "localhost",
+                    "MASTER_PORT": "12355",
+                },
+            )
+        )
+        cls._exit_stack.enter_context(patch.object(config, "log_level", logging.DEBUG))
+        cls.rank = 0
+        cls.device = f"cuda:{cls.rank}"
+        cls.device_ids = None if "cuda" in cls.device else [cls.rank]
+        dist.init_process_group("nccl", rank=cls.rank, world_size=1)
+
+    @classmethod
+    def tearDownClass(cls):
+        dist.destroy_process_group()
+        super().tearDownClass()
+
+    def get_model(self, bsz=20, in_feat=10, hidden_feat=5000, out_feat=5):
+        m = ToyModel(in_feat=in_feat, hidden_feat=hidden_feat, out_feat=out_feat).to(self.device)
+        m.apply(init_weights)
+        inputs = torch.rand(bsz, in_feat).to(self.device)
+        outputs = m(inputs)
+        return m, inputs, outputs
+
+    @patch.object(config, "optimize_ddp", False)
+    def test_ddp_baseline_aot_eager(self):
+        from torch.nn.parallel import DistributedDataParallel as DDP
+
+        m, inputs, correct_outputs = self.get_model()
+        ddp_m = DDP(m, device_ids=self.device_ids)
+        ddp_m = torch._dynamo.optimize("aot_eager")(ddp_m)
+        outputs = ddp_m(inputs)
+        self.assertTrue(same(correct_outputs, outputs))
+
+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    @patch.object(config, "optimize_ddp", False)
+    def test_ddp_baseline_inductor(self):
+        from torch.nn.parallel import DistributedDataParallel as DDP
+
+        m, inputs, correct_outputs = self.get_model()
+        ddp_m = DDP(m, device_ids=self.device_ids)
+        ddp_m = torch._dynamo.optimize("inductor")(ddp_m)
+        outputs = ddp_m(inputs)
+        self.assertTrue(same(correct_outputs, outputs))
+
+    @patch.object(config, "optimize_ddp", True)
+    def test_graph_split(self):
+        """
+        Just ensures that the appropriate number of splits happen (based on
+        bucket size and model parameters) - verifies the number of times
+        the user-provided compiler is called by the DDPOptimizer which is
+        doing the graph splitting
+        """
+
+        m, inputs, correct_outputs = self.get_model()
+        ddp_m = DDP(m, device_ids=self.device_ids, bucket_cap_mb=25)
+
+        check_splits_compiler = CheckSplitsCompiler()
+
+        @torch._dynamo.optimize(check_splits_compiler.compile_fn)
+        def opt_fn(inputs):
+            return ddp_m(inputs)
+
+        opt_outputs = opt_fn(inputs)
+        self.assertTrue(same(correct_outputs, opt_outputs))
+        self.assertEqual(check_splits_compiler.compiler_called, 3)
+
+    @patch.object(config, "optimize_ddp", True)
+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    def test_graph_split_inductor(self):
+        """
+        Same as above, but using inductor backend.
+        We observed issues with inductor/fx interface in the past.
+        """
+        m, inputs, correct_outputs = self.get_model()
+        ddp_m = DDP(m, device_ids=self.device_ids, bucket_cap_mb=25)
+
+        @torch._dynamo.optimize("inductor")
+        def opt_fn(inputs):
+            return ddp_m(inputs)
+
+        opt_outputs = opt_fn(inputs)
+        self.assertTrue(same(correct_outputs, opt_outputs))
+
+    @patch.object(config, "optimize_ddp", True)
+    def test_no_split(self):
+        """
+        Ensures the DDPOptimizer returns a correct, compiled module without
+        introducing graph splits. (Based on model parmeters fitting in the bucket)
+        """
+        # DDP will always do a 'first bucket' with a really small size;  so only a tiny model will escape this
+        m, inputs, correct_outputs = self.get_model(hidden_feat=5)
+        ddp_m = DDP(m, device_ids=self.device_ids, bucket_cap_mb=250)
+        check_splits_compiler = CheckSplitsCompiler()
+
+        @torch._dynamo.optimize(check_splits_compiler.compile_fn)
+        def opt_fn(inputs):
+            return ddp_m(inputs)
+
+        opt_outputs = opt_fn(inputs)
+        self.assertTrue(same(correct_outputs, opt_outputs))
+        self.assertEqual(check_splits_compiler.compiler_called, 1)
+
+    @patch.object(config, "optimize_ddp", True)
+    def test_aot_autograd(self):
+        """
+        Explicitly check AotAutograd family of compilers work,
+        since they require example inputs propagated between graph splits.
+        """
+        m, inputs, correct_outputs = self.get_model()
+        ddp_m = DDP(m, device_ids=self.device_ids, bucket_cap_mb=25)
+
+        @torch._dynamo.optimize("aot_eager")
+        def opt_fn(inputs):
+            return ddp_m(inputs)
+
+        opt_outputs = opt_fn(inputs)
+        opt_outputs.sum().backward()
+        self.assertTrue(same(correct_outputs, opt_outputs))
+
+    @patch.object(config, "optimize_ddp", True)
+    def test_custom_layer(self):
+        """
+        Just ensures that the appropriate number of splits happen (based on
+        bucket size and model parameters) - verifies the number of times
+        the user-provided compiler is called by the DDPOptimizer which is
+        doing the graph splitting
+        """
+        m, inputs, correct_outputs = get_custom_model(self.device)
+        ddp_m = DDP(m, device_ids=self.device_ids, bucket_cap_mb=1)
+
+        check_splits_compiler = CheckSplitsCompiler()
+
+        @torch._dynamo.optimize(check_splits_compiler.compile_fn)
+        def opt_fn(inputs):
+            return ddp_m(inputs)
+
+        opt_outputs = opt_fn(inputs)
+        self.assertTrue(same(correct_outputs, opt_outputs))
+        self.assertEqual(check_splits_compiler.compiler_called, 3)
+
+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    def test_empty_graph_inductor(self):
+        def fn():
+            get_world_size = torch.distributed.distributed_c10d.get_world_size()
+            return (get_world_size,)
+
+        opt_fn = torch._dynamo.optimize("inductor")(fn)
+        res = None
+        try:
+            res = opt_fn()[0]
+        except Exception:
+            pass
+        self.assertEqual(res, 1)
+
+    @patch.object(config, "optimize_ddp", False)
+    def test_ignored_parameters(self):
+        """
+        Verifies ddp graph-split logic ignores parameters marked to ignore on DDP module.
+        Hooks up graph-split optimizer manually so it can peek at internal state.
+        """
+        m, inputs, correct_outputs = get_custom_model(self.device)
+        parameters_to_ignore = ["seq.2.weight", "seq.4.linear.bias"]
+        DDP._set_params_and_buffers_to_ignore_for_model(m, parameters_to_ignore)
+        ddp_m = DDP(m, device_ids=self.device_ids, bucket_cap_mb=25)
+        parameter_ids_to_ignore = [
+            id(ddp_m.module.get_parameter(p))
+            for p in ddp_m.parameters_to_ignore
+        ]
+
+        check_splits_compiler = CheckSplitsCompiler()
+        ddp_optimizer = DDPOptimizer(
+            bucket_bytes_cap=ddp_m.bucket_bytes_cap,
+            backend_compile_fn=check_splits_compiler.compile_fn
+        )
+
+        @torch._dynamo.optimize(ddp_optimizer.compile_fn)
+        def opt_fn(inputs):
+            return ddp_m(inputs)
+
+        opt_outputs = opt_fn(inputs)
+        self.assertTrue(same(correct_outputs, opt_outputs))
+        self.assertEqual(check_splits_compiler.compiler_called, 2)
+        for b in ddp_optimizer.buckets:
+            for p_id in b.param_ids:
+                self.assertFalse(p_id in parameter_ids_to_ignore)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/distributed/test_multi_threaded_pg.py b/test/distributed/test_multi_threaded_pg.py
new file mode 100644
index 000000000000..875e3f066384
--- /dev/null
+++ b/test/distributed/test_multi_threaded_pg.py
@@ -0,0 +1,87 @@
+# Owner(s): ["oncall: distributed"]
+
+import sys
+import torch
+import torch.distributed as dist
+from torch._C._distributed_c10d import ReduceOp
+
+if not dist.is_available():
+    print("Distributed not available, skipping tests", file=sys.stderr)
+    sys.exit(0)
+
+from torch.testing._internal.common_distributed import (
+    spawn_threads_and_init_comms,
+    MultiThreadedTestCase
+
+)
+from torch.testing._internal.common_utils import TestCase, run_tests
+
+DEFAULT_WORLD_SIZE = 4
+
+class TestCollectivesWithWrapper(TestCase):
+    @spawn_threads_and_init_comms(world_size=4)
+    def test_broadcast_object_list(self):
+        val = 99 if dist.get_rank() == 0 else None
+        object_list = [val] * dist.get_world_size()
+
+        dist.broadcast_object_list(object_list=object_list)
+        self.assertEqual(99, object_list[0])
+
+class TestCollectivesWithBaseClass(MultiThreadedTestCase):
+    @property
+    def world_size(self):
+        return 4
+
+    def test_allgather(self):
+        input_tensor = torch.ones(3, 3) * dist.get_rank()
+        output_tensors = [torch.empty_like(input_tensor) for _ in range(self.world_size)]
+        dist.all_gather(output_tensors, input_tensor)
+        for rank, out_tensor in enumerate(output_tensors):
+            self.assertEqual(out_tensor, torch.ones(3, 3) * rank)
+
+    def test_broadcast(self):
+        input_tensor = torch.ones(3, 3) * dist.get_rank()
+        for rank in range(self.world_size):
+            cloned_input = input_tensor.clone()
+            dist.broadcast(cloned_input, src=rank)
+            self.assertEqual(cloned_input, torch.ones(3, 3) * rank)
+
+    def test_scatter(self):
+        if dist.get_rank() == 0:
+            scatter_list = [torch.ones(3, 3) * rank for rank in range(self.world_size)]
+        else:
+            scatter_list = None
+        output_tensor = torch.empty(3, 3)
+
+        dist.scatter(output_tensor, scatter_list)
+        self.assertEqual(output_tensor, torch.ones(3, 3) * dist.get_rank())
+
+    def test_reduce_scatter(self):
+        to_reduce_scatter = [torch.ones(3, 3) * rank for rank in range(self.world_size)]
+        output_tensor = torch.empty(3, 3)
+
+        dist.reduce_scatter(output_tensor, to_reduce_scatter)
+        expected_tensor = torch.ones(3, 3) * dist.get_rank() * self.world_size
+        self.assertEqual(output_tensor, expected_tensor)
+
+    def test_broadcast_object_list(self):
+        val = 99 if dist.get_rank() == 0 else None
+        object_list = [val] * dist.get_world_size()
+        print(f"{dist.get_rank()} -> {dist.get_world_size()}")
+
+        dist.broadcast_object_list(object_list=object_list)
+        self.assertEqual(99, object_list[0])
+
+    def test_all_reduce(self):
+        output = torch.ones(3, 3) * dist.get_rank()
+        dist.all_reduce(output)
+        res_num = ((0 + self.world_size - 1) * self.world_size) / 2
+        self.assertEqual(output, torch.ones(3, 3) * res_num)
+
+        # Test unimplemented error
+        with self.assertRaisesRegex(NotImplementedError, "only supports SUM on threaded pg for now"):
+            dist.all_reduce(output, op=ReduceOp.MAX)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/distributed/test_store.py b/test/distributed/test_store.py
index 943d12aa85ae..f5475c3c1aa6 100644
--- a/test/distributed/test_store.py
+++ b/test/distributed/test_store.py
@@ -130,6 +130,35 @@ def _create_store(self):
         store.set_timeout(timedelta(seconds=300))
         return store
 
+    def test_init_pg_and_rpc_with_same_file(self):
+        file = tempfile.NamedTemporaryFile(delete=False)
+        # Init RPC using file
+        rpc_backend_options = rpc.TensorPipeRpcBackendOptions()
+        rpc_backend_options.init_method = f"file://{file.name}"
+        rpc.init_rpc("worker", rank=0, world_size=1, rpc_backend_options=rpc_backend_options)
+
+        # Init PG using file
+        dist.init_process_group("gloo", rank=0, world_size=1, init_method=f"file://{file.name}")
+        dist.destroy_process_group()
+        assert os.path.exists(file.name)
+
+        rpc.shutdown()
+        os.remove(file.name)
+
+    def test_refcount(self):
+        file = tempfile.NamedTemporaryFile(delete=False)
+        store = dist.FileStore(file.name, 1)
+        store2 = dist.FileStore(file.name, 1)
+
+        del store
+        assert os.path.exists(file.name)
+        del store2
+        assert not os.path.exists(file.name)
+
+    @property
+    def num_keys_total(self):
+        return 6
+
 
 @skip_if_win32()
 class HashStoreTest(TestCase, StoreTestBase):
@@ -141,6 +170,19 @@ def _create_store(self):
         store.set_timeout(timedelta(seconds=300))
         return store
 
+class PrefixStoreTest(TestCase):
+    def setUp(self):
+        # delete is false as FileStore will automatically clean up the file
+        self.file = tempfile.NamedTemporaryFile(delete=False)
+
+    def test_get_underlying_store(self):
+        tcp_store = dist.TCPStore(host_name=DEFAULT_HOSTNAME, port=0, world_size=1, is_master=True)
+        hash_store = dist.HashStore()
+        file_store = dist.FileStore(self.file.name, world_size=1)
+        for store in [tcp_store, hash_store, file_store]:
+            with self.subTest(f"Testing getting underlying_store for {type(store)}"):
+                prefix_store = dist.PrefixStore("prefix", store)
+                self.assertEqual(prefix_store.underlying_store, store)
 
 class PrefixFileStoreTest(TestCase, StoreTestBase):
     def setUp(self):
@@ -153,6 +195,10 @@ def setUp(self):
     def _create_store(self):
         return dist.PrefixStore(self.prefix, self.filestore)
 
+    @property
+    def num_keys_total(self):
+        return 6
+
 
 class TCPStoreTest(TestCase, StoreTestBase):
     def _create_store(self):
@@ -290,7 +336,7 @@ def num_keys_total(self):
 class MyPythonStore(dist.Store):
     def __init__(self):
         super(MyPythonStore, self).__init__()
-        self.store = dict()
+        self.store = {}
 
     def set(self, key, value):
         if not isinstance(key, string_classes):
diff --git a/test/distributions/test_distributions.py b/test/distributions/test_distributions.py
index 4db246ab93b3..127018516e12 100644
--- a/test/distributions/test_distributions.py
+++ b/test/distributions/test_distributions.py
@@ -888,15 +888,13 @@ def _check_sampler_discrete(self, torch_dist, ref_dist, message,
     def _check_enumerate_support(self, dist, examples):
         for params, expected in examples:
             params = {k: torch.tensor(v) for k, v in params.items()}
-            expected = torch.tensor(expected)
             d = dist(**params)
             actual = d.enumerate_support(expand=False)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(actual, expected)
+            expected = torch.tensor(expected, dtype=actual.dtype)
+            self.assertEqual(actual, expected)
             actual = d.enumerate_support(expand=True)
             expected_with_expand = expected.expand((-1,) + d.batch_shape + d.event_shape)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(actual, expected_with_expand)
+            self.assertEqual(actual, expected_with_expand)
 
     def test_repr(self):
         for Dist, params in EXAMPLES:
@@ -1203,10 +1201,9 @@ def test_binomial_extreme_vals(self):
 
     def test_binomial_vectorized_count(self):
         set_rng_seed(1)  # see Note [Randomized statistical tests]
-        total_count = torch.tensor([[4, 7], [3, 8]])
+        total_count = torch.tensor([[4, 7], [3, 8]], dtype=torch.float64)
         bin0 = Binomial(total_count, torch.tensor(1.))
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(bin0.sample(), total_count)
+        self.assertEqual(bin0.sample(), total_count)
         bin1 = Binomial(total_count, torch.tensor(0.5))
         samples = bin1.sample(torch.Size((100000,)))
         self.assertTrue((samples <= total_count.type_as(samples)).all())
@@ -1304,9 +1301,8 @@ def test_multinomial_2d(self):
         self._gradcheck_log_prob(lambda p: Multinomial(total_count, None, p.log()), [p])
 
         # sample check for extreme value of probs
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(Multinomial(total_count, s).sample(),
-                                   torch.tensor([[total_count, 0], [0, total_count]]))
+        self.assertEqual(Multinomial(total_count, s).sample(),
+                         torch.tensor([[total_count, 0], [0, total_count]], dtype=torch.float64))
 
     def test_categorical_1d(self):
         p = torch.tensor([0.1, 0.2, 0.3], requires_grad=True)
@@ -1424,11 +1420,8 @@ def ref_log_prob(ref_rate, idx, x, log_prob):
         # approximation enters the forbidden parameter space. We instead compare with the
         # theoretical results.
         dist = Poisson(rate_zero)
-        s = dist.sample()
-        dist.log_prob(s).backward()
-        torch.testing.assert_allclose(rate_zero.grad, -1.0)
         dist.log_prob(torch.ones_like(rate_zero)).backward()
-        torch.testing.assert_allclose(rate_zero.grad, torch.inf)
+        self.assertEqual(rate_zero.grad, torch.inf)
 
     @unittest.skipIf(not TEST_NUMPY, "Numpy not found")
     def test_poisson_sample(self):
diff --git a/torch/ao/sparsity/_experimental/activation_sparsifier/__init__.py b/test/dynamo/__init__.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/activation_sparsifier/__init__.py
rename to test/dynamo/__init__.py
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/lightning/__init__.py b/test/dynamo/mock_modules/__init__.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/lightning/__init__.py
rename to test/dynamo/mock_modules/__init__.py
diff --git a/test/dynamo/mock_modules/mock_module1.py b/test/dynamo/mock_modules/mock_module1.py
new file mode 100644
index 000000000000..c4bd2bf4f9de
--- /dev/null
+++ b/test/dynamo/mock_modules/mock_module1.py
@@ -0,0 +1,2 @@
+def method1(a, b):
+    return a + b
diff --git a/test/dynamo/mock_modules/mock_module2.py b/test/dynamo/mock_modules/mock_module2.py
new file mode 100644
index 000000000000..7fe8979709c3
--- /dev/null
+++ b/test/dynamo/mock_modules/mock_module2.py
@@ -0,0 +1,19 @@
+# from . import mock_module3
+import torch
+
+from . import mock_module3
+
+
+class Class1:
+    def __init__(self, x, y):
+        self.x = x
+        self.y = y
+
+    def method2(self, x):
+        return mock_module3.method1([], x)
+
+
+def method1(x, y):
+    torch.ones(1, 1)
+    x.append(y)
+    return x
diff --git a/test/dynamo/mock_modules/mock_module3.py b/test/dynamo/mock_modules/mock_module3.py
new file mode 100644
index 000000000000..8af77a237a89
--- /dev/null
+++ b/test/dynamo/mock_modules/mock_module3.py
@@ -0,0 +1,7 @@
+import torch
+
+
+def method1(x, y):
+    torch.ones(1, 1)
+    x.append(y)
+    return x
diff --git a/test/dynamo/test_aot_autograd.py b/test/dynamo/test_aot_autograd.py
new file mode 100644
index 000000000000..fe81a23cc339
--- /dev/null
+++ b/test/dynamo/test_aot_autograd.py
@@ -0,0 +1,288 @@
+# Owner(s): ["module: dynamo"]
+import functools
+
+import torch
+
+import torch._dynamo
+import torch._dynamo.test_case
+from torch._dynamo.optimizations.training import is_aot_autograd_safe_to_run
+from torch._dynamo.testing import CompileCounter, rand_strided
+
+
+def compiler_safe_fn(gm, example_inputs, is_safe):
+    is_safe[0] = is_aot_autograd_safe_to_run(gm, example_inputs)
+    return gm.forward
+
+
+class AotAutogradFallbackTests(torch._dynamo.test_case.TestCase):
+    def test_LSTM(self):
+        # https://github.com/pytorch/torchdynamo/issues/1147
+        class Repro(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.self_mod_model_lstm_lstm = torch.nn.LSTM(
+                    64, 64, num_layers=2, bidirectional=True
+                )
+
+            def forward(self, permute: torch.Tensor):
+                self_mod_model_lstm_lstm = self.self_mod_model_lstm_lstm(permute)
+                return (self_mod_model_lstm_lstm,)
+
+        is_safe = [True]
+        mod = Repro()
+        compiler_fn = functools.partial(compiler_safe_fn, is_safe=is_safe)
+        aot_mod = torch._dynamo.optimize(compiler_fn)(mod)
+
+        args = [((92, 4, 64), (1, 5888, 92), torch.float32, "cpu", False)]
+        args = [
+            rand_strided(sh, st, dt, dev).requires_grad_(rg)
+            for (sh, st, dt, dev, rg) in args
+        ]
+
+        aot_mod(*args)
+        self.assertTrue(not is_safe[0])
+
+    def test_mutation(self):
+        # https://github.com/pytorch/torchdynamo/issues/1301
+        def fn(param, y):
+            prev_grad = torch.is_grad_enabled()
+            try:
+                torch.set_grad_enabled(False)
+                param.add_(y)
+            finally:
+                torch.set_grad_enabled(prev_grad)
+            return y
+
+        y = torch.randn(4)
+        x = torch.nn.Parameter(torch.randn(4))
+        is_safe = [True]
+        compiler_fn = functools.partial(compiler_safe_fn, is_safe=is_safe)
+        aot_fn = torch._dynamo.optimize(compiler_fn)(fn)
+        aot_fn(x, y)
+        self.assertTrue(is_safe[0])
+
+    def test_mutation1(self):
+        def fn(_stack0: torch.Tensor, diagonal_chunked_attention_scores: torch.Tensor):
+            getitem = diagonal_chunked_attention_scores[
+                (
+                    slice(None, None, None),
+                    slice(None, None, None),
+                    slice(None, 256, None),
+                    slice(None, 257, None),
+                )
+            ]
+            _stack0[
+                (
+                    slice(None, None, None),
+                    slice(None, -1, None),
+                    slice(None, None, None),
+                    slice(256, None, None),
+                )
+            ] = getitem
+            view = _stack0.view(1, 12, 1024, 513)
+            return (view,)
+
+        x = torch.randn(torch.Size([12, 4, 256, 513]))
+        y = torch.randn(torch.Size([12, 3, 512, 513]))
+        is_safe = [True]
+        compiler_fn = functools.partial(compiler_safe_fn, is_safe=is_safe)
+        aot_fn = torch._dynamo.optimize(compiler_fn)(fn)
+        aot_fn(x, y)
+        self.assertTrue(is_safe[0])
+
+    def test_negative_testing_mutation(self):
+        def fn(_stack0: torch.Tensor, diagonal_chunked_attention_scores: torch.Tensor):
+            getitem = diagonal_chunked_attention_scores[
+                (
+                    slice(None, None, None),
+                    slice(None, None, None),
+                    slice(None, 256, None),
+                    slice(None, 257, None),
+                )
+            ]
+            _stack0 = torch.sin(_stack0)
+            _stack0[
+                (
+                    slice(None, None, None),
+                    slice(None, -1, None),
+                    slice(None, None, None),
+                    slice(256, None, None),
+                )
+            ] = getitem
+            view = _stack0.view(1, 12, 1024, 513)
+            return (view,)
+
+        x = torch.randn(torch.Size([12, 4, 256, 513]))
+        y = torch.randn(torch.Size([12, 3, 512, 513]))
+        is_safe = [True]
+        compiler_fn = functools.partial(compiler_safe_fn, is_safe=is_safe)
+        aot_fn = torch._dynamo.optimize(compiler_fn)(fn)
+        aot_fn(x, y)
+        self.assertTrue(is_safe[0])
+
+    def test_negative_testing(self):
+        def fn(x, y):
+            return torch.sin(x).add_(y)
+
+        y = torch.randn(4)
+        x = torch.randn(4)
+        is_safe = [True]
+        compiler_fn = functools.partial(compiler_safe_fn, is_safe=is_safe)
+        aot_fn = torch._dynamo.optimize(compiler_fn)(fn)
+        aot_fn(x, y)
+        self.assertTrue(is_safe[0])
+
+    def test_call_fn_with_non_const_inputs_aot_safe(self):
+        class ModuleSpecialFwd(torch.nn.Module):
+            def __init__(self):
+                super(ModuleSpecialFwd, self).__init__()
+                self.conv = torch.nn.Conv2d(
+                    in_channels=3, out_channels=20, kernel_size=(5, 5)
+                )
+
+            def _conv_forward(self, x):
+                return self.conv._conv_forward(x, self.conv.weight, self.conv.bias)
+
+            def forward(self, x):
+                return self._conv_forward(x)
+
+        # Init mod
+        mod = ModuleSpecialFwd()
+        rx = torch.randn([3, 10, 10])
+
+        # Run it for real
+        real = mod(rx)
+
+        # Run it in export
+        graph, _ = torch._dynamo.export(mod, rx)
+
+        # Run exported graph with AOT
+        is_safe = [True]
+        self.assertTrue(torch._dynamo.testing.same(real, graph(rx)))
+
+        compiler_fn = functools.partial(compiler_safe_fn, is_safe=is_safe)
+        aot_fn = torch._dynamo.optimize(compiler_fn)(graph)
+        aot_fn(rx)
+        self.assertTrue(is_safe[0])
+
+    def test_call_fn_with_non_const_inputs_aot_unsafe(self):
+        class ModuleSpecialFwd(torch.nn.Module):
+            def __init__(self):
+                super(ModuleSpecialFwd, self).__init__()
+
+            def _some_bad_fwd(self, param, y):
+                prev_grad = torch.is_grad_enabled()
+                try:
+                    torch.set_grad_enabled(False)
+                    param.add_(y)
+                finally:
+                    torch.set_grad_enabled(prev_grad)
+                return y
+
+            def forward(self, x, y):
+                return self._some_bad_fwd(x, y)
+
+        # Init mod
+        mod = ModuleSpecialFwd()
+        x = torch.nn.Parameter(torch.randn(4))
+        y = torch.randn([4])
+
+        # Run it for real
+        real = mod(x, y)
+
+        # Run it in export
+        graph, _ = torch._dynamo.export(mod, x, y)
+
+        # Assert equal
+        self.assertTrue(torch._dynamo.testing.same(real, graph(x, y)))
+
+        # Run exported graph with AOT
+        is_safe = [True]
+
+        compiler_fn = functools.partial(compiler_safe_fn, is_safe=is_safe)
+        aot_fn = torch._dynamo.optimize(compiler_fn)(graph)
+        aot_fn(x, y)
+        self.assertTrue(is_safe[0])
+
+    def test_call_fn_with_non_const_inputs_aot_unsafe_control_flow(self):
+        class ModuleSpecialFwd(torch.nn.Module):
+            def __init__(self):
+                super(ModuleSpecialFwd, self).__init__()
+
+            def _some_bad_fwd(self, param, y):
+                if y[0][0] < 3:
+                    return y + param
+                return param * y
+
+            def forward(self, x, y):
+                a = x * y
+                a = self._some_bad_fwd(a, a)
+                b = x + y
+                return a * b
+
+        # Init mod
+        mod = ModuleSpecialFwd()
+        x = torch.nn.Parameter(torch.randn([2, 2]))
+        y = torch.randn([2, 2])
+
+        # Run it for real
+        real = mod(x, y)
+
+        # Run it through optimize, with our capturing fn
+
+        gms = []
+        counter = CompileCounter()
+
+        def capturing_fn(gm, inputs):
+            nonlocal gms
+            gms.append(gm)
+            return counter(gm, inputs)
+
+        optimized_mod = torch._dynamo.optimize(capturing_fn)(mod)
+
+        # Assert equal
+        self.assertTrue(torch._dynamo.testing.same(real, optimized_mod(x, y)))
+
+        # Uncomment to reproduce commented out graphs below.
+        # for gm in gms:
+        #     print("GM CODE", gm.code)
+
+        self.assertEqual(counter.frame_count, 4)
+        self.assertEqual(counter.op_count, 7)
+        # Graph 1
+        # def forward(self, x : torch.nn.parameter.Parameter, y : torch.Tensor):
+        #     mul = x * y;  x = y = None
+        #     return (mul,)
+        # BREAK
+        # Graph 2
+        # def forward(self, y : torch.Tensor):
+        #     getitem = y[0];  y = None
+        #     getitem_1 = getitem[0];  getitem = None
+        #     lt = getitem_1 < 3;  getitem_1 = None
+        #     return (lt,)
+        # BREAK
+        # Graph 3
+        # def forward(self, param : torch.Tensor, y : torch.Tensor):
+        #     add = y + param;  y = param = None
+        #     return (add,)
+        # BREAK
+        # Graph 4
+        # def forward(self, _stack0 : torch.Tensor, x : torch.nn.parameter.Parameter, y : torch.Tensor):
+        #     add = x + y;  x = y = None
+        #     mul = _stack0 * add;  _stack0 = add = None
+        #     return (mul,)
+
+        # Run fn with AOT
+        torch._dynamo.reset()
+        is_safe = [True]
+
+        compiler_fn = functools.partial(compiler_safe_fn, is_safe=is_safe)
+        aot_fn = torch._dynamo.optimize(compiler_fn)(optimized_mod)
+        aot_fn(x, y)
+        self.assertTrue(is_safe[0])
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_aot_cudagraphs.py b/test/dynamo/test_aot_cudagraphs.py
new file mode 100644
index 000000000000..5b2e6eb2f9ea
--- /dev/null
+++ b/test/dynamo/test_aot_cudagraphs.py
@@ -0,0 +1,208 @@
+# Owner(s): ["module: cuda graphs"]
+
+import functools
+import unittest
+from unittest.mock import patch
+
+import torch
+
+import torch._dynamo
+import torch._dynamo.test_case
+import torch._dynamo.testing
+from torch._dynamo.testing import same
+from torch.testing._internal.common_utils import TEST_WITH_ROCM
+
+
+def composed(*decs):
+    def deco(f):
+        for dec in reversed(decs):
+            f = dec(f)
+        return f
+
+    return deco
+
+
+def assert_aot_autograd_counter(ok=True):
+    def deco(f):
+        @functools.wraps(f)
+        def wrap(self, *args, **kwargs):
+            torch._dynamo.utils.counters.clear()
+            r = f(self, *args, **kwargs)
+            c_ok = torch._dynamo.utils.counters["aot_autograd"]["ok"]
+            c_not_ok = torch._dynamo.utils.counters["aot_autograd"]["not_ok"]
+            if ok:
+                self.assertGreater(c_ok, 0)
+                self.assertEqual(c_not_ok, 0)
+            else:
+                self.assertEqual(c_ok, 0)
+                self.assertGreater(c_not_ok, 0)
+            return r
+
+        return wrap
+
+    return deco
+
+
+def patch_all(ok=True):
+    return composed(
+        unittest.skipIf(TEST_WITH_ROCM, "ROCm not supported"),
+        patch("torch._dynamo.config.verify_correctness", True),
+        assert_aot_autograd_counter(ok),
+    )
+
+
+N_ITERS = 5
+
+
+@unittest.skipIf(not torch.cuda.is_available(), "these tests require cuda")
+class TestAotCudagraphs(torch._dynamo.test_case.TestCase):
+    @patch_all()
+    def test_basic(self):
+        def model(x, y):
+            return (x + y) * y
+
+        @torch._dynamo.optimize("aot_cudagraphs")
+        def fn(x, y):
+            for i in range(N_ITERS):
+                loss = model(x, y).sum()
+                loss.backward()
+
+        x = torch.randn(3, device="cuda", requires_grad=True)
+        y = torch.randn(3, device="cuda")
+        fn(x, y)
+
+    @patch_all()
+    def test_dtoh(self):
+        def model(x, y):
+            a = x + y
+            b = a.cpu() * 3
+            return b
+
+        @torch._dynamo.optimize("aot_cudagraphs")
+        def fn(x, y):
+            for i in range(N_ITERS):
+                loss = model(x, y).sum()
+                loss.backward()
+
+        x = torch.randn(3, device="cuda", requires_grad=True)
+        y = torch.randn(3, device="cuda")
+        fn(x, y)
+
+    @patch_all()
+    def test_htod(self):
+        def model(x, y):
+            a = x + y
+            return a * 3
+
+        @torch._dynamo.optimize("aot_cudagraphs")
+        def fn(x, y):
+            for i in range(N_ITERS):
+                loss = model(x, y).sum()
+                loss.backward()
+
+        x = torch.randn(3, device="cuda", requires_grad=True)
+        y = torch.randn((), device="cpu")
+        fn(x, y)
+
+    @patch("functorch._src.config.use_functionalize", True)
+    def test_mutate_input(self):
+        def model(x, y):
+            y.add_(3)
+            return x * y
+
+        @torch._dynamo.optimize("aot_cudagraphs")
+        def fn(x, y):
+            for i in range(N_ITERS):
+                with self.subTest(i):
+                    y_orig = y.clone()
+                    loss = model(x, y).sum()
+                    self.assertTrue(same(y, y_orig + 3))
+                    loss.backward()
+
+        x = torch.randn(3, device="cuda", requires_grad=True)
+        y = torch.randn(3, device="cuda")
+        fn(x, y)
+
+    @patch_all()
+    def test_mutate_constant(self):
+        def model(x, y):
+            c = torch.tensor(1)
+            c.add_(2)
+            return x * y * 0 + c
+
+        @torch._dynamo.optimize("aot_cudagraphs")
+        def fn(x, y):
+            for i in range(N_ITERS):
+                with self.subTest(i):
+                    loss = model(x, y).sum()
+                    self.assertTrue(same(loss, torch.tensor(3.0, device="cuda")))
+                    loss.backward()
+
+        x = torch.randn(1, device="cuda", requires_grad=True)
+        y = torch.randn(1, device="cuda")
+        fn(x, y)
+
+    @patch_all()
+    def test_factory(self):
+        def model(y):
+            x = torch.zeros(3, device="cuda:0")
+            x.add_(3)
+            return x * y
+
+        @torch._dynamo.optimize("aot_cudagraphs")
+        def fn(y):
+            for i in range(N_ITERS):
+                with self.subTest(i):
+                    loss = model(y).sum()
+                    loss.backward()
+
+        y = torch.randn(3, device="cuda:0", requires_grad=True)
+        fn(y)
+
+    @patch("functorch._src.config.use_functionalize", True)
+    @patch_all()
+    def test_mutated_metadata(self):
+        # more tortured example at
+        # https://github.com/pytorch/pytorch/issues/81385
+        def model(x):
+            x = x.clone()
+            x.resize_(20)
+            x.fill_(2)
+            return x
+
+        @torch._dynamo.optimize("aot_cudagraphs")
+        def fn(x):
+            for i in range(N_ITERS):
+                with self.subTest(i):
+                    rx = model(x)
+                    self.assertTrue(same(rx, torch.full((20,), 2.0, device="cuda:0")))
+
+        x = torch.empty(0, device="cuda:0")
+        fn(x)
+
+    @patch("functorch._src.config.use_functionalize", True)
+    @patch_all()
+    def test_dead_fill(self):
+        def model(x):
+            x = x.clone()
+            y = x[0:0]
+            x.fill_(2)
+            y.fill_(3)
+            return x, y
+
+        @torch._dynamo.optimize("aot_cudagraphs")
+        def fn(x):
+            for i in range(N_ITERS):
+                with self.subTest(i):
+                    rx, ry = model(x)
+                    self.assertTrue(same(rx, torch.full((20,), 2.0, device="cuda:0")))
+                    self.assertTrue(same(ry, torch.empty(0, device="cuda:0")))
+
+        x = torch.empty(20, device="cuda:0")
+        fn(x)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_dynamic_shapes.py b/test/dynamo/test_dynamic_shapes.py
new file mode 100644
index 000000000000..2eb16784514d
--- /dev/null
+++ b/test/dynamo/test_dynamic_shapes.py
@@ -0,0 +1,112 @@
+# Owner(s): ["module: dynamo"]
+
+from torch._dynamo.testing import make_test_cls_with_patches
+
+try:
+    from . import (
+        test_export,
+        test_functions,
+        test_misc,
+        test_modules,
+        test_repros,
+        test_subgraphs,
+        test_unspec,
+    )
+except ImportError:
+    import test_export
+    import test_functions
+    import test_misc
+    import test_modules
+    import test_repros
+    import test_subgraphs
+    import test_unspec
+
+import unittest
+
+
+def make_dynamic_cls(cls):
+    return make_test_cls_with_patches(
+        cls, "DynamicShapes", "_dynamic_shapes", ("dynamic_shapes", True)
+    )
+
+
+DynamicShapesFunctionTests = make_dynamic_cls(test_functions.FunctionTests)
+DynamicShapesMiscTests = make_dynamic_cls(test_misc.MiscTests)
+DynamicShapesReproTests = make_dynamic_cls(test_repros.ReproTests)
+DynamicShapesNNModuleTests = make_dynamic_cls(test_modules.NNModuleTests)
+DynamicShapesUnspecTests = make_dynamic_cls(test_unspec.UnspecTests)
+DynamicShapesExportTests = make_dynamic_cls(test_export.ExportTests)
+DynamicShapesSubGraphTests = make_dynamic_cls(test_subgraphs.SubGraphTests)
+
+
+# DynamicShapesFunctionTests
+unittest.expectedFailure(
+    DynamicShapesFunctionTests.test_len_tensor_dynamic_shapes
+    # TypeError: 'torch._C.SymIntNode' object cannot be interpreted as an integer
+)
+
+unittest.expectedFailure(
+    DynamicShapesFunctionTests.test_tensor_len_dynamic_shapes
+    # TypeError: 'torch._C.SymIntNode' object cannot be interpreted as an integer
+)
+
+
+unittest.expectedFailure(
+    DynamicShapesReproTests.test_do_paste_mask_dynamic_shapes
+    # aten.min.dim - couldn't find symbolic meta function/decomposition
+)
+
+unittest.expectedFailure(
+    DynamicShapesReproTests.test_convert_boxes_to_pooler_format_dynamic_shapes
+    # Could not infer dtype of torch._C.SymIntNode
+)
+
+unittest.expectedFailure(
+    DynamicShapesReproTests.test_hf_t5_forward_dynamic_shapes
+    # Cannot call sizes() on tensor with symbolic sizes/strides
+)
+
+# DynamicShapesExportTests
+unittest.expectedFailure(
+    DynamicShapesExportTests.test_export_with_constant_list_nonzero_dynamic_shapes
+)
+unittest.expectedFailure(
+    DynamicShapesExportTests.test_export_with_constant_list_nonzero_free_function_dynamic_shapes
+)
+unittest.expectedFailure(
+    DynamicShapesExportTests.test_export_with_constant_tuple_nonzero_dynamic_shapes
+)
+unittest.expectedFailure(
+    DynamicShapesExportTests.test_export_with_constant_tuple_nonzero_dynamic_shapes
+)
+
+
+# DynamicShapesSubGraphTests
+unittest.expectedFailure(
+    DynamicShapesSubGraphTests.test_enumerate_not_break_graph_dynamic_shapes
+)
+unittest.expectedFailure(DynamicShapesSubGraphTests.test_restore_state_dynamic_shapes)
+
+# DynamicShapesUnspecTests
+# Missing decomp
+# RuntimeError: Failed running call_function <function batch_norm at 0x7f7d1ce38310>
+# (*(FakeTensor(FakeTensor(..., device='meta', size=(5, 1, 28, 28)), cpu),
+# FakeTensor(FakeTensor(..., device='meta', size=(1,)), cpu),
+#  FakeTensor(FakeTensor(..., device='meta', size=(1,)), cpu),
+#  FakeTensor(Parameter(FakeTensor(..., device='meta', size=(1,),
+#  requires_grad=True)), cpu),
+#  FakeTensor(Parameter(FakeTensor(..., device='meta', size=(1,),
+#  requires_grad=True)), cpu), False, 0.1,
+# FakeTensor(FakeTensor(..., device='meta', size=()), cpu)), **{}):
+# aten._local_scalar_dense.default
+unittest.expectedFailure(test_unspec.UnspecReproTests.test_batch_norm_act_unspec)
+
+# SymIntArrayRef expected to contain only concrete integers
+unittest.expectedFailure(
+    DynamicShapesUnspecTests.test_unspec_float_precision_dynamic_shapes
+)
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_export.py b/test/dynamo/test_export.py
new file mode 100644
index 000000000000..fb630f06d29f
--- /dev/null
+++ b/test/dynamo/test_export.py
@@ -0,0 +1,1493 @@
+# Owner(s): ["module: dynamo"]
+from unittest.mock import patch
+
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+import torch.utils._pytree as pytree
+from torch.fx.experimental.proxy_tensor import make_fx
+
+
+class ExportTests(torch._dynamo.test_case.TestCase):
+    # TODO(voz): Refactor to a shared test function.
+    # The tests in this file are a little redundant,
+    # They all take a func, run it with eager, then export it, then compare
+    def test_export(self):
+        def pre_attention_state_ops(input, mems, state):
+            lc_key = state[0]
+            lc_val = state[1]
+            bar = []
+            for i in range(0, 4):
+                bar2 = []
+                for j in range(0, 3):
+                    bar2.append(
+                        lc_key + lc_val + torch.tensor([0.1, 0.25, 0.4, 0.5, 0.1])
+                    )
+                bar.append(bar2)
+
+            return bar
+
+        def func():
+            mems = torch.tensor([[[1.8364, 0.2724, -1.4917, -0.4367, 0.8640]]])
+            state = [
+                torch.tensor([[[1.0517, 0.3848, -0.6472, 0.0823, 0.9116]]]),
+                torch.tensor([[[1.0517, 0.3848, -0.6472, 0.0823, 0.9116]]]),
+            ]
+            i = torch.tensor(
+                [
+                    [0.0313, -0.1487, -0.3846, -0.5321],
+                    [-1.7073, 1.3331, -0.0890, -1.4935],
+                    [-0.8314, -0.1862, -0.5935, 1.5232],
+                ]
+            )
+            return pre_attention_state_ops(i, mems, state)
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func()
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func)
+        out_graph = exported[0]
+
+        dynamo_result = out_graph()
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_mismatched_out(self):
+        def func(x):
+            y = x + 1
+            return ([x, x], (y, y))
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(torch.tensor([[[1.3737, 0.1]]]))
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, torch.tensor([[[1.3737, 0.1]]]))
+        out_graph = exported[0]
+
+        dynamo_result = out_graph(torch.tensor([[[1.3737, 0.1]]]))
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    @patch.object(torch._dynamo.config, "dynamic_shapes", True)
+    def test_export_shape_control_flow_1(self):
+        def func(x):
+            if x.shape[0] > 10:
+                return x.cos()
+            return x.sin()
+
+        opt_func = torch._dynamo.optimize("eager")(func)
+        real_result = opt_func(torch.ones(6, 4))
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, torch.ones(6, 4))
+        out_graph, out_guards = exported
+
+        dynamo_result = out_graph(torch.ones(6, 4))
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+        hit = False
+        for guard in out_guards:
+            if guard.name == "symbolic_shape_expression":
+                hit = True
+                self.assertTrue("x.size()[0] <= 10" in guard.code_list)
+
+        self.assertTrue(hit)
+
+    def test_export_graph_bypass(self):
+        inp = [
+            torch.tensor([0.1, 0.1]),
+            torch.tensor([0.2, 0.2]),
+            torch.tensor([0.3, 0.3]),
+        ]
+
+        def func(x):
+            first = x[2]
+            second = x[2]
+            return first * second
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_list_unpack(self):
+        inp = [
+            torch.tensor([0.1, 0.1]),
+            torch.tensor([0.2, 0.2]),
+            torch.tensor([0.3, 0.3]),
+        ]
+
+        def func(x):
+            first = x[2]
+            second = x[2]
+            return x[0], first * second, x[1], x[2]
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_mismatched_out_2(self):
+        def func(x):
+            y = x + 1
+            return ([x, x], (y, y))
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(torch.tensor([[[1.3737, 0.1]]]))
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, torch.tensor([[[1.3737, 0.1]]]))
+        out_graph = exported[0]
+
+        dynamo_result = out_graph(torch.tensor([[[1.3737, 0.1]]]))
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_graph_with_list(self):
+        inp = [
+            torch.tensor([0.1, 0.1]),
+            torch.tensor([0.2, 0.2]),
+            torch.tensor([0.3, 0.3]),
+            torch.tensor([0.4, 0.4]),
+        ]
+
+        def func(x):
+            first = x[2]
+            second = x[2]
+            return first * second, x
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_graph_with_complex_reorder(self):
+        inp = [
+            torch.tensor([0.1, 0.1]),
+            torch.tensor([0.2, 0.2]),
+            torch.tensor([0.3, 0.3]),
+            torch.tensor([0.4, 0.4]),
+        ]
+
+        def func(x):
+            first = x[0]
+            second = x[1]
+            third = x[2]
+            return third, first, second, first * second, first * third
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes(self):
+        inp = torch.tensor([0.1, 0.1])
+
+        def func(x):
+            y = x + 1
+            return y, y
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_2(self):
+        inp = torch.tensor([0.1, 0.1])
+
+        def func(x):
+            y = x + 1
+            return y, y
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_and_bypass(self):
+        inp = torch.tensor([0.1, 0.1])
+        inp2 = torch.tensor([0.4, 0.4])
+        inps = [inp, inp2]
+
+        def func(x, z):
+            y = x + 1
+            return y, y, z
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_and_bypass_with_non_tensor_arg(self):
+        inp = torch.tensor([0.1, 0.1])
+        inp2 = torch.tensor([0.1, 0.1])
+        inp3 = 4
+        inps = [inp, inp2, inp3]
+
+        def func(x, z, k):
+            y = x + k
+            return y, y, z
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_and_bypass_reorder_with_non_tensor_arg(self):
+        inp = torch.tensor([0.1, 0.1])
+        inp2 = torch.tensor([0.1, 0.1])
+        inp3 = 4
+        inps = [inp, inp2, inp3]
+
+        def func(x, z, k):
+            y = x + k
+            return z, y, y
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_dupes_and_bypass_with_non_tensor_output(self):
+        inp = torch.tensor([0.1, 0.1])
+        inp2 = torch.tensor([0.1, 0.1])
+        inp3 = 4
+        inps = [inp, inp2, inp3]
+
+        def func(x, z, k):
+            y = x + k
+            return y[0].item(), y, z
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_zeroes_in_and_out_different_shape_on_test(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            return [[a], [b, c], [a + b], [[c + c]]]
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_zeroes_in_new_shape_scalar_out(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            return a[0].item() + b[0].item() + c[0].item()
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_zeroes_in_new_shape_scalar_out_permute(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            return b[0].item() + c[0].item() + a[0].item() + a[0].item()
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_zeroes_in_new_shape_scalar_out_permute_dupe_and_bypass(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            return a, b[0].item() + c[0].item() + a[0].item() + a[0].item(), a
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_func_return(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            x = a + b + c
+
+            def func2(y):
+                return x * y
+
+            return func2(x)
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dict_return(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            x = a + b + c
+            return {"a": x}
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_with_aten_graph(self):
+        def pre_attention_state_ops(input, mems, state):
+            lc_key = state[0]
+            lc_val = state[1]
+            bar = []
+            for i in range(0, 4):
+                bar2 = []
+                for j in range(0, 3):
+                    bar2.append(
+                        lc_key + lc_val + torch.tensor([0.1, 0.25, 0.4, 0.5, 0.1])
+                    )
+                bar.append(bar2)
+
+            return bar
+
+        def func():
+            mems = torch.tensor([[[1.8364, 0.2724, -1.4917, -0.4367, 0.8640]]])
+            state = [
+                torch.tensor([[[1.0517, 0.3848, -0.6472, 0.0823, 0.9116]]]),
+                torch.tensor([[[1.0517, 0.3848, -0.6472, 0.0823, 0.9116]]]),
+            ]
+            i = torch.tensor(
+                [
+                    [0.0313, -0.1487, -0.3846, -0.5321],
+                    [-1.7073, 1.3331, -0.0890, -1.4935],
+                    [-0.8314, -0.1862, -0.5935, 1.5232],
+                ]
+            )
+            return pre_attention_state_ops(i, mems, state)
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func()
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, aten_graph=True)
+        out_graph = exported[0]
+
+        dynamo_result = out_graph()
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_mismatched_out_with_aten_graph(self):
+        def func(x):
+            y = x + 1
+            return ([x, x], (y, y))
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(torch.tensor([[[1.3737, 0.1]]]))
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(
+            func, torch.tensor([[[1.3737, 0.1]]]), aten_graph=True
+        )
+        out_graph = exported[0]
+
+        dynamo_result = out_graph(torch.tensor([[[1.3737, 0.1]]]))
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_graph_bypass_with_aten_graph(self):
+        inp = [
+            torch.tensor([0.1, 0.1]),
+            torch.tensor([0.2, 0.2]),
+            torch.tensor([0.3, 0.3]),
+        ]
+
+        def func(x):
+            first = x[2]
+            second = x[2]
+            return first * second
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_list_unpack_with_aten_graph(self):
+        inp = [
+            torch.tensor([0.1, 0.1]),
+            torch.tensor([0.2, 0.2]),
+            torch.tensor([0.3, 0.3]),
+        ]
+
+        def func(x):
+            first = x[2]
+            second = x[2]
+            return x[0], first * second, x[1], x[2]
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_mismatched_out_2_with_aten_graph(self):
+        def func(x):
+            y = x + 1
+            return ([x, x], (y, y))
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(torch.tensor([[[1.3737, 0.1]]]))
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(
+            func, torch.tensor([[[1.3737, 0.1]]]), aten_graph=True
+        )
+        out_graph = exported[0]
+
+        dynamo_result = out_graph(torch.tensor([[[1.3737, 0.1]]]))
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_graph_with_list_with_aten_graph(self):
+        inp = [
+            torch.tensor([0.1, 0.1]),
+            torch.tensor([0.2, 0.2]),
+            torch.tensor([0.3, 0.3]),
+            torch.tensor([0.4, 0.4]),
+        ]
+
+        def func(x):
+            first = x[2]
+            second = x[2]
+            return first * second, x
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_graph_with_complex_reorder_with_aten_graph(self):
+        inp = [
+            torch.tensor([0.1, 0.1]),
+            torch.tensor([0.2, 0.2]),
+            torch.tensor([0.3, 0.3]),
+            torch.tensor([0.4, 0.4]),
+        ]
+
+        def func(x):
+            first = x[0]
+            second = x[1]
+            third = x[2]
+            return third, first, second, first * second, first * third
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_with_aten_graph(self):
+        inp = torch.tensor([0.1, 0.1])
+
+        def func(x):
+            y = x + 1
+            return y, y
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_2_with_aten_graph(self):
+        inp = torch.tensor([0.1, 0.1])
+
+        def func(x):
+            y = x + 1
+            return y, y
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(inp)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inp)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_and_bypass_with_aten_graph(self):
+        inp = torch.tensor([0.1, 0.1])
+        inp2 = torch.tensor([0.4, 0.4])
+        inps = [inp, inp2]
+
+        def func(x, z):
+            y = x + 1
+            return y, y, z
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_and_bypass_with_non_tensor_arg_with_aten_graph(self):
+        inp = torch.tensor([0.1, 0.1])
+        inp2 = torch.tensor([0.1, 0.1])
+        inp3 = 4
+        inps = [inp, inp2, inp3]
+
+        def func(x, z, k):
+            y = x + k
+            return y, y, z
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dupes_and_bypass_reorder_with_non_tensor_arg_with_aten_graph(self):
+        inp = torch.tensor([0.1, 0.1])
+        inp2 = torch.tensor([0.1, 0.1])
+        inp3 = 4
+        inps = [inp, inp2, inp3]
+
+        def func(x, z, k):
+            y = x + k
+            return z, y, y
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_dupes_and_bypass_with_non_tensor_output_with_aten_graph(self):
+        inp = torch.tensor([0.1, 0.1])
+        inp2 = torch.tensor([0.1, 0.1])
+        inp3 = 4
+        inps = [inp, inp2, inp3]
+
+        def func(x, z, k):
+            y = x + k
+            return y[0].item(), y, z
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_zeroes_in_and_out_different_shape_on_test_with_aten_graph(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            return [[a], [b, c], [a + b], [[c + c]]]
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_func_return_with_aten_graph(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            x = a + b + c
+
+            def func2(y):
+                return x * y
+
+            return func2(x)
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_dict_return_with_aten_graph(self):
+        inp = torch.zeros(10)
+        inp2 = torch.zeros(10)
+        inp3 = torch.zeros(10)
+        inps = [inp, inp2, inp3]
+
+        inps_rand = [torch.randn(10), torch.randn(10), torch.randn(10)]
+
+        def func(a, b, c):
+            x = a + b + c
+            return {"a": x}
+
+        opt_func = torch._dynamo.optimize("eager", nopython=True)(func)
+        real_result = opt_func(*inps_rand)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, *inps, aten_graph=True)
+        out_graph = exported[0]
+        flat_input, _ = pytree.tree_flatten(inps_rand)
+
+        dynamo_result = out_graph(*flat_input)
+
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+    def test_export_with_stack_trace(self):
+        inp = torch.tensor([0.1, 0.1])
+        linear = torch.nn.Linear(2, 2)
+
+        def func(x):
+            x = x + 1
+            y = x.t()
+            y = y.relu()
+            y = linear(y)
+            return y
+
+        exported = torch._dynamo.export(func, inp, aten_graph=False)
+        out_graph = exported[0]
+
+        for node in out_graph.graph.nodes:
+            if node.op not in {"placeholder", "output"}:
+                self.assertTrue(node.stack_trace is not None)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(func, inp, aten_graph=True)
+        out_graph = exported[0]
+        for node in out_graph.graph.nodes:
+            if node.op == "call_function":
+                self.assertTrue(node.stack_trace is not None)
+
+    def test_export_compare_optimize_with_make_fx(self):
+        inp = torch.tensor([0.1, 0.1])
+        linear = torch.nn.Linear(2, 2)
+
+        def func(x):
+            x = x + 1
+            y = x.t()
+            y = y.relu()
+            y = linear(y)
+            return y
+
+        exported = torch._dynamo.export(func, inp, aten_graph=True)
+        out_graph = exported[0]
+        export_result = out_graph(inp)
+
+        torch._dynamo.reset()
+
+        def compiler(gm, sample_inputs):
+            aten_gm = make_fx(gm)(*sample_inputs)
+
+            self.assertEqual(len(aten_gm.graph.nodes), len(out_graph.graph.nodes))
+            for node1, node2 in zip(aten_gm.graph.nodes, out_graph.graph.nodes):
+                self.assertEqual(node1.op, node2.op)
+                if node1.op == "call_function":
+                    self.assertEqual(node1.target, node2.target)
+                    self.assertEqual(len(node1.args), len(node2.args))
+                    for arg1, arg2 in zip(node1.args, node2.args):
+                        self.assertEqual(type(arg1), type(arg2))
+
+            return aten_gm.forward
+
+        opt_func = torch._dynamo.optimize(compiler, nopython=True)(func)
+        make_fx_result = opt_func(inp)
+
+        self.assertTrue(torch._dynamo.utils.same(make_fx_result, export_result))
+
+    def test_export_with_constant_method_on_module(self):
+        class MyModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.param = torch.nn.Parameter(torch.rand(4, 2))
+                self.linear = torch.nn.Linear(2, 2)
+
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                return torch.nonzero(x)
+
+            def forward(self, x):
+                y = torch.sin(x)
+                x = self.linear(x)
+                y = self.helper_fn(x)
+                return y
+
+        module = MyModule()
+        real_result = module(torch.tensor([[1.0, 0], [0, 0]]))
+        module = MyModule()
+        graph, _ = torch._dynamo.export(module, torch.tensor([[0.0, 0], [0, 0]]))
+        result = graph(torch.tensor([[1.0, 0.0], [0, 0]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+        result = graph(torch.tensor([[1, 0], [0.25, 0.25]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_method_on_module_invoke_twice(self):
+        class MyModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.param = torch.nn.Parameter(torch.rand(4, 2))
+                self.linear = torch.nn.Linear(2, 2)
+
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                return torch.nonzero(x)
+
+            def forward(self, x):
+                y = torch.sin(x)
+                x = self.linear(x)
+                y = self.helper_fn(x) + self.helper_fn(x)
+                return y
+
+        module = MyModule()
+        real_result = module(torch.tensor([[1.0, 0], [0, 0]]))
+        module = MyModule()
+        graph, _ = torch._dynamo.export(module, torch.tensor([[0.0, 0], [0, 0]]))
+        result = graph(torch.tensor([[1.0, 0.0], [0, 0]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+        result = graph(torch.tensor([[1, 0], [0.25, 0.25]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_free_function(self):
+        @torch._dynamo.assume_constant_result
+        def helper_fn(x):
+            return torch.nonzero(x)
+
+        class MyModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.param = torch.nn.Parameter(torch.rand(4, 2))
+                self.linear = torch.nn.Linear(2, 2)
+
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                return torch.nonzero(x)
+
+            def forward(self, x):
+                y = torch.sin(x)
+                x = self.linear(x)
+                y = helper_fn(x) + self.helper_fn(x)
+                return y
+
+        module = MyModule()
+        real_result = module(torch.tensor([[1.0, 0], [0, 0]]))
+        module = MyModule()
+        graph, _ = torch._dynamo.export(module, torch.tensor([[0.0, 0], [0, 0]]))
+        result = graph(torch.tensor([[1.0, 0.0], [0, 0]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+        result = graph(torch.tensor([[1, 0], [0.25, 0.25]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_free_function_and_class_method(self):
+        @torch._dynamo.assume_constant_result
+        def helper_fn(x):
+            return torch.nonzero(x)
+
+        class MyModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.param = torch.nn.Parameter(torch.rand(4, 2))
+                self.linear = torch.nn.Linear(2, 2)
+
+            def forward(self, x):
+                y = torch.sin(x)
+                x = self.linear(x)
+                y = helper_fn(x)
+                return y
+
+        module = MyModule()
+        real_result = module(torch.tensor([[1.0, 0], [0, 0]]))
+        module = MyModule()
+        graph, _ = torch._dynamo.export(module, torch.tensor([[0.0, 0], [0, 0]]))
+        result = graph(torch.tensor([[1.0, 0.0], [0, 0]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+        result = graph(torch.tensor([[1, 0], [0.25, 0.25]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_free_function_and_class_method_multiarg(self):
+        @torch._dynamo.assume_constant_result
+        def helper_fn(x):
+            return torch.nonzero(x)
+
+        class MyModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.param = torch.nn.Parameter(torch.rand(4, 2))
+                self.linear = torch.nn.Linear(2, 2)
+
+            def forward(self, x, z):
+                y = torch.sin(x)
+                x = self.linear(x)
+                y = helper_fn(x) + helper_fn(z)
+                return y
+
+        module = MyModule()
+        real_result = module(
+            torch.tensor([[1.0, 0], [0, 0]]), torch.tensor([[1.0, 0], [0, 0]])
+        )
+        module = MyModule()
+        graph, _ = torch._dynamo.export(
+            module, torch.tensor([[0.0, 0], [0, 0]]), torch.tensor([[1.0, 0], [0, 0]])
+        )
+        result = graph(
+            torch.tensor([[1.0, 0.0], [0, 0]]), torch.tensor([[1.0, 0.0], [0, 0]])
+        )
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+        result = graph(
+            torch.tensor([[1, 0], [0.25, 0.25]]), torch.tensor([[1, 0], [0.25, 0.25]])
+        )
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_free_function_and_class_method_multiarg_diff(self):
+        @torch._dynamo.assume_constant_result
+        def helper_fn(x):
+            return torch.nonzero(x)
+
+        class MyModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x, z):
+                y = helper_fn(x) + helper_fn(z)
+                return y
+
+        module = MyModule()
+        real_result = module(
+            torch.tensor([[1.0, 0], [0, 0]]), torch.tensor([[1.0, 0], [0, 0]])
+        )
+        module = MyModule()
+        graph, _ = torch._dynamo.export(
+            module, torch.tensor([[0.0, 0], [0, 0]]), torch.tensor([[0.0, 0], [0.5, 0]])
+        )
+        result = graph(
+            torch.tensor([[1.0, 0.0], [0, 0]]), torch.tensor([[0.0, 1.0], [0, 0]])
+        )
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+        result = graph(
+            torch.tensor([[1, 0], [0.25, 0.25]]),
+            torch.tensor([[0.33, 0.33], [0.25, 0.25]]),
+        )
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_tuple_nonzero(self):
+        class MyModule(torch.nn.Module):
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                return (torch.nonzero(x), torch.nonzero(x))
+
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                elements = self.helper_fn(x)
+                all_y = []
+                for element in elements:
+                    for item in element:
+                        all_y.append(y * item)
+                return all_y
+
+        module = MyModule()
+        real_result = module(torch.tensor([1.0, 1.0]))
+        graph, guards = torch._dynamo.export(module, torch.tensor([1.0, 1.0]))
+
+        # Tensor input can be almost anything here, and the result will capture what we
+        # made constant at compile time.
+        result = graph(torch.tensor([[[1.0, 0], [0, 0]], [[1.0, 0], [0, 0]]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_list_nonzero(self):
+        class MyModule(torch.nn.Module):
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                return [torch.nonzero(x), torch.nonzero(x)]
+
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                elements = self.helper_fn(x)
+                all_y = []
+                for element in elements:
+                    for item in element:
+                        all_y.append(y * item)
+                return all_y
+
+        module = MyModule()
+        real_result = module(torch.tensor([1.0, 1.0]))
+        graph, guards = torch._dynamo.export(module, torch.tensor([1.0, 1.0]))
+
+        # Tensor input can be almost anything here, and the result will capture what we
+        # made constant at compile time.
+        result = graph(torch.tensor([[[1.0, 0], [0, 0]], [[1.0, 0], [0, 0]]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_list_nonzero_free_function(self):
+        @torch._dynamo.assume_constant_result
+        def helper_fn(x):
+            return [torch.nonzero(x), torch.nonzero(x)]
+
+        class MyModule(torch.nn.Module):
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                elements = helper_fn(x)
+                all_y = []
+                for element in elements:
+                    for item in element:
+                        all_y.append(y * item)
+                return all_y
+
+        module = MyModule()
+        real_result = module(torch.tensor([1.0, 1.0]))
+        graph, guards = torch._dynamo.export(module, torch.tensor([1.0, 1.0]))
+
+        # Tensor input can be almost anything here, and the result will capture what we
+        # made constant at compile time.
+        result = graph(torch.tensor([[[1.0, 0], [0, 0]], [[1.0, 0], [0, 0]]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_dict_values(self):
+        class MyModule(torch.nn.Module):
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                return {"x": x, "x^2": x * x}
+
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                elements = self.helper_fn(x)
+                y = y * elements["x"]
+                y = y * elements["x^2"]
+                return y
+
+        module = MyModule()
+        real_result = module(torch.tensor([2.0, 2.0]))
+        graph, guards = torch._dynamo.export(module, torch.tensor([2.0, 2.0]))
+
+        # Tensor input can be almost anything here, and the result will capture what we
+        # made constant at compile time.
+        result = graph(torch.tensor([[[1.0, 0], [0, 0]], [[1.0, 0], [0, 0]]]))
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_none_control_flow(self):
+        class MyModule(torch.nn.Module):
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                if x.item() < 0:
+                    return None
+                else:
+                    return x
+
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                x = self.helper_fn(x)
+                if x is None:
+                    return y
+                return y * x
+
+        module = MyModule()
+        real_result = module(torch.tensor([-1]))
+
+        # X is negative, so .item() < 0, which means we return y
+        self.assertEqual(real_result, torch.tensor([0.5]))
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([-1]))
+        result = graph(torch.tensor([2]))
+        # X is positive, but we compiled helper_fn to return None, so it will still return y
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_not_none_control_flow(self):
+        class MyModule(torch.nn.Module):
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                if x.item() < 0:
+                    return None
+                else:
+                    return x
+
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                x = self.helper_fn(x)
+                if x is None:
+                    return y
+                return y * x
+
+        module = MyModule()
+        real_result = module(torch.tensor([2]))
+
+        # X is positive, so .item() > 0, which means we return y * x
+        self.assertEqual(real_result, torch.tensor([1.0]))
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([2]))
+        result = graph(torch.tensor([-0.5]))
+        # X is negative, but we compiled helper_fn to return x, so it will still return y * x
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_none_control_flow_free_func(self):
+        @torch._dynamo.assume_constant_result
+        def helper_fn(x):
+            if x.item() < 0:
+                return None
+            else:
+                return x
+
+        class MyModule(torch.nn.Module):
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                x = helper_fn(x)
+                if x is None:
+                    return y
+                return y * x
+
+        module = MyModule()
+        real_result = module(torch.tensor([-1]))
+
+        # X is negative, so .item() < 0, which means we return y
+        self.assertEqual(real_result, torch.tensor([0.5]))
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([-1]))
+        result = graph(torch.tensor([2]))
+        # X is positive, but we compiled helper_fn to return None, so it will still return y
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_not_none_control_flow_pos(self):
+        class MyModule(torch.nn.Module):
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                if x.item() < 0:
+                    return None
+                else:
+                    return x
+
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                x = self.helper_fn(x)
+                if x is None:
+                    return y
+                return y * x
+
+        module = MyModule()
+        real_result = module(torch.tensor([2]))
+
+        # X is positive, so .item() > 0, which means we return y * x
+        self.assertEqual(real_result, torch.tensor([1.0]))
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([2]))
+        result = graph(torch.tensor([-0.5]))
+        # X is negative, but we compiled helper_fn to return x, so it will still return y * x
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_not_none_control_flow_free_func(self):
+        @torch._dynamo.assume_constant_result
+        def helper_fn(x):
+            if x.item() < 0:
+                return None
+            else:
+                return x
+
+        class MyModule(torch.nn.Module):
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                x = helper_fn(x)
+                if x is None:
+                    return y
+                return y * x
+
+        module = MyModule()
+        real_result = module(torch.tensor([2]))
+
+        # X is positive, so .item() > 0, which means we return y * x
+        self.assertEqual(real_result, torch.tensor([1.0]))
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([2]))
+        result = graph(torch.tensor([-0.5]))
+        # X is negative, but we compiled helper_fn to return x, so it will still return y * x
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_export_with_constant_not_return_const(self):
+        class MyModule(torch.nn.Module):
+            @torch._dynamo.assume_constant_result
+            def helper_fn(self, x):
+                return self.val
+
+            def forward(self, x):
+                y = torch.tensor([0.5])
+                x = self.helper_fn(x)
+                if x == "A":
+                    return y
+                return -1
+
+        module = MyModule()
+        module.val = "A"
+        resA = module(torch.tensor([2]))
+        graph, guards = torch._dynamo.export(module, torch.tensor([2]))
+        module.val = "B"
+        resB = graph(torch.tensor([2]))
+        self.assertTrue(torch._dynamo.utils.same(resA, resB))
+
+    def test_export_decomp(self):
+        def f(x):
+            return x.t() + x.t()
+
+        def nop(x):
+            return x.cos()
+
+        graph, _ = torch._dynamo.export(
+            f,
+            (torch.randn(5)),
+            aten_graph=True,
+            decomposition_table={torch.ops.aten.t.default: nop},
+        )
+        self.assertEqual(
+            len([n for n in graph.graph.nodes if n.target == torch.ops.aten.t.default]),
+            0,
+        )
+
+        graph, _ = torch._dynamo.export(
+            f, (torch.randn(5)), aten_graph=True, decomposition_table=None
+        )
+        self.assertEqual(
+            len([n for n in graph.graph.nodes if n.target == torch.ops.aten.t.default]),
+            2,
+        )
+
+    def test_export_decomp_asserts_bad_args(self):
+        def f(x):
+            return x.t() + x.t()
+
+        def nop(x):
+            return x.cos()
+
+        with self.assertRaises(AssertionError):
+            graph, _ = torch._dynamo.export(
+                f,
+                (torch.randn(5)),
+                aten_graph=False,
+                decomposition_table={torch.ops.aten.t.default: nop},
+            )
+
+    def test_export_decomp_asserts_bad_args_mode(self):
+        def f(x):
+            return x.t() + x.t()
+
+        def nop(x):
+            return x.cos()
+
+        with self.assertRaises(AssertionError):
+            graph, _ = torch._dynamo.export(
+                f, (torch.randn(5)), aten_graph=False, tracing_mode="symbolic"
+            )
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_export_with_module_layer(self):
+        from functorch.experimental.cond import cond
+
+        def true_fn(layer, val):
+            return layer(val) * torch.tensor(2)
+
+        def false_fn(layer, val):
+            return layer(val) * torch.tensor(-1)
+
+        class Module(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(3, 3)
+
+            def forward(self, pred, x):
+                return cond(pred, true_fn, false_fn, [self.linear, x])
+
+        mod = Module()
+        x = torch.randn([3, 3])
+        pred = torch.tensor(x[0][0].item() < 0)
+        real_result = mod.forward(pred, x)
+
+        torch._dynamo.reset()
+
+        exported = torch._dynamo.export(mod.forward, pred, x)
+        out_graph = exported[0]
+
+        dynamo_result = out_graph(pred, x)
+        self.assertTrue(torch._dynamo.utils.same(real_result, dynamo_result))
+
+        # New X, just to show we did not specialize
+        x = x * -1
+        pred = torch.tensor(x[0][0].item() < 0)
+        real_result_2 = mod.forward(pred, x)
+        dynamo_result_2 = out_graph(pred, x)
+        self.assertTrue(torch._dynamo.utils.same(real_result_2, dynamo_result_2))
+
+    def test_export_meta_val(self):
+        def f(x, y, z):
+            return x * y + z
+
+        gm, _ = torch._dynamo.export(
+            f,
+            torch.ones(3, 2),
+            torch.zeros(3, 2),
+            torch.ones(3, 2),
+            aten_graph=True,
+            tracing_mode="symbolic",
+        )
+        for node in gm.graph.nodes:
+            if node.op == "placeholder":
+                self.assertIn("val", node.meta)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_export_mutations.py b/test/dynamo/test_export_mutations.py
new file mode 100644
index 000000000000..218935d3f8cb
--- /dev/null
+++ b/test/dynamo/test_export_mutations.py
@@ -0,0 +1,134 @@
+# Owner(s): ["module: dynamo"]
+import torch
+import torch._dynamo.test_case
+import torch._dynamo.testing
+
+
+class MutationExportTests(torch._dynamo.test_case.TestCase):
+    def check_failure_on_export(self, mod, *args):
+        with self.assertRaises(AssertionError):
+            torch._dynamo.export(mod, *args)
+
+    def check_same_with_export(self, mod, arg):
+        real_result = mod(arg)
+        graph, _ = torch._dynamo.export(mod, arg)
+        result = graph(arg)
+        self.assertTrue(torch._dynamo.utils.same(result, real_result))
+
+    def test_module_attribute_mutation_violation_positive_1(self):
+        # Mutating attribute with a Tensor type
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.a = torch.Tensor(3, 2)
+
+            def forward(self, x):
+                self.a = self.a.to(torch.float64)
+                return x.sum() + self.a.sum()
+
+        self.check_failure_on_export(Foo(), torch.Tensor(3, 2))
+
+    def test_module_attribute_mutation_violation_positive_2(self):
+        # Mutating attribute with a scalar type
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.a = 2
+
+            def forward(self, x):
+                self.a = self.a * 3
+                return x.sum() + self.a
+
+        self.check_failure_on_export(Foo(), torch.Tensor(3, 2))
+
+    def test_module_attribute_mutation_violation_positive_3(self):
+        # Setting a new attribute inside forward()
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.a = torch.Tensor(3, 2)
+
+            def forward(self, x):
+                self.b = 2
+                return x.sum() + self.a.sum() + self.b
+
+        self.check_failure_on_export(Foo(), torch.Tensor(3, 2))
+
+    def test_module_attribute_mutation_violation_positive_4(self):
+        # Mutating attribute with an inline function
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def add(self, a, b):
+                return a + b
+
+            def forward(self, x):
+                self.a = self.add(1, 2) * self.add(3, 4)
+                return x.sum() + self.a
+
+        self.check_failure_on_export(Foo(), torch.Tensor(3, 2))
+
+    def test_module_attribute_mutation_violation_negative_1(self):
+        # Mutating attribute with a Tensor type inside __init__ but
+        # not in forward()
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.a = torch.Tensor(3, 2)
+
+            def forward(self, x):
+                return x.sum() + self.a.to(torch.float64).sum()
+
+        self.check_same_with_export(Foo(), torch.Tensor(3, 2))
+
+    def test_module_attribute_mutation_violation_negative_2(self):
+        # Mutating attribute with a Tensor type inside __init__ twice
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.a = torch.Tensor(3, 2)
+                self.a = self.a.to(torch.float64)
+
+            def forward(self, x):
+                return x.sum() + self.a.sum()
+
+        self.check_same_with_export(Foo(), torch.Tensor(3, 2))
+
+    def test_module_attribute_mutation_violation_negative_3(self):
+        # Mutating local variable inside forward()
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.a = torch.Tensor(3, 2)
+
+            def forward(self, x):
+                b = 1
+                b = b * 5
+                return x.sum() + self.a.sum() + b
+
+        self.check_same_with_export(Foo(), torch.Tensor(3, 2))
+
+    def test_module_attribute_mutation_violation_negative_4(self):
+        # Mutating attribute with a Tensor type
+        # But not exporting but using eager mode as well as dynamo optimize mode
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.a = torch.Tensor(3, 2)
+
+            def forward(self, x):
+                self.a = self.a.to(torch.float64)
+                return x.sum() + self.a.sum()
+
+        mod = Foo()
+        arg = torch.Tensor(3, 2)
+        real_result = mod(arg)
+        opt_mod = torch._dynamo.optimize("eager", nopython=True)(mod)
+        self.assertTrue(torch._dynamo.utils.same(opt_mod(arg), real_result))
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_functions.py b/test/dynamo/test_functions.py
new file mode 100644
index 000000000000..327fa64f1209
--- /dev/null
+++ b/test/dynamo/test_functions.py
@@ -0,0 +1,697 @@
+# Owner(s): ["module: dynamo"]
+# flake8: noqa
+import collections
+import functools
+import inspect
+import itertools
+import operator
+from typing import Any
+from unittest.mock import patch
+
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+from torch import sub
+from torch._dynamo.testing import requires_static_shapes
+from torch.nn import functional as F
+
+tensor_for_import_testing = torch.ones(10, 10)
+d = torch.ones(10, 10)
+e = torch.nn.Linear(10, 10)
+flag = True
+
+
+def constant3(a, b):
+    return a - b + (1.0 + 2)
+
+
+def func_with_default(a, b, some_default_arg=True):
+    if some_default_arg:
+        return a - b
+
+
+def make_test(fn):
+    nargs = len(inspect.signature(fn).parameters)
+
+    def test_fn(self):
+        return torch._dynamo.testing.standard_test(self, fn=fn, nargs=nargs)
+
+    return test_fn
+
+
+@torch.jit.script_if_tracing
+def inline_script_if_tracing(x):
+    return x + 1.2
+
+
+@torch.jit.ignore
+def inline_ignore(x):
+    return x + 3.4
+
+
+@torch.jit.unused
+def inline_unused(x):
+    return x + 5.6
+
+
+class FunctionTests(torch._dynamo.test_case.TestCase):
+    @make_test
+    def test_inline_jit_annotations(x):
+        x = inline_script_if_tracing(x)
+        x = inline_ignore(x)
+        x = inline_unused(x)
+        return
+
+    @make_test
+    def test_add(a, b):
+        return a + b
+
+    @make_test
+    def test_is_not_null(a, b):
+        if a is not None and b is not None:
+            return a + b
+
+    @make_test
+    def test_constant1(a, b, c):
+        return a - b * c + 1.0
+
+    @make_test
+    def test_constant2(a, b, c):
+        return a - b * c + 1
+
+    @make_test
+    def test_constant3(a):
+        b = 1
+        c = 2
+        d = 3
+        return b + c - d + a
+
+    @make_test
+    def test_constant4(a, b):
+        c = 2
+        d = 3
+        if c > d:
+            return a - b
+        return b - a
+
+    @make_test
+    def test_finfo(a, b):
+        if torch.iinfo(torch.int32).bits == 32:
+            return torch.finfo(a.dtype).min * b
+
+    @make_test
+    def test_globalfn(a, b):
+        return sub(a, b)
+
+    @make_test
+    def test_viatorch(a, b):
+        return torch.sub(a, b)
+
+    @make_test
+    def test_viamethod(a, b):
+        return a.sub(b)
+
+    @make_test
+    def test_indirect1(a, b):
+        t = a.sub
+        return t(b)
+
+    @make_test
+    def test_indirect2(a, b):
+        t = a.sub
+        args = (b,)
+        return t(*args)
+
+    @make_test
+    def test_indirect3(a, b):
+        t = a.sub
+        args = (b,)
+        kwargs = {}
+        return t(*args, **kwargs)
+
+    @make_test
+    def test_methodcall1(a, b, c):
+        return constant3(a, b) * c
+
+    @make_test
+    def test_methodcall2(a, b):
+        return constant3(a=b, b=a) + 1
+
+    @make_test
+    def test_methodcall3(a, b):
+        return constant3(a, b=1.0) + b
+
+    @make_test
+    def test_device_constant(a):
+        return a + torch.ones(1, device=torch.device("cpu"))
+
+    @make_test
+    def test_tuple1(a, b):
+        args = (a, b)
+        return sub(*args)
+
+    @make_test
+    def test_tuple2(a, b):
+        args = [a, b]
+        return sub(*args)
+
+    @make_test
+    def test_is_in_onnx_export(x, y):
+        if torch.onnx.is_in_onnx_export():
+            return x - 1
+        else:
+            return y + 1
+
+    @make_test
+    def test_is_fx_tracing(x, y):
+        if torch.fx._symbolic_trace.is_fx_tracing():
+            return x - 1
+        else:
+            return y + 1
+
+    @make_test
+    def test_listarg1(a, b):
+        return torch.cat([a, b])
+
+    @make_test
+    def test_listarg2(a, b):
+        return torch.cat((a, b), dim=0)
+
+    @make_test
+    def test_listarg3(a, b):
+        kwargs = {"tensors": (a, b), "dim": 0}
+        return torch.cat(**kwargs)
+
+    @make_test
+    def test_listarg4(a, b):
+        return torch.cat(tensors=[a, b], dim=0)
+
+    @make_test
+    def test_listarg5(a, b):
+        args = [(a, b)]
+        kwargs = {"dim": 0}
+        return torch.cat(*args, **kwargs)
+
+    @make_test
+    def test_slice1(a):
+        return a[5]
+
+    @make_test
+    def test_slice2(a):
+        return a[:5]
+
+    @make_test
+    def test_slice3(a):
+        return a[5:]
+
+    @make_test
+    def test_slice4(a):
+        return a[2:5]
+
+    @make_test
+    def test_slice5(a):
+        return a[::2]
+
+    @make_test
+    def test_slice6(a):
+        return torch.unsqueeze(a, 0)[:, 2:]
+
+    @make_test
+    def test_range1(a):
+        return torch.tensor(range(a.size(0)))
+
+    @make_test
+    def test_range2(x, y):
+        r = x + y
+        for i in range(x.size(0) + 2):
+            r = r / y
+        return r
+
+    @make_test
+    def test_unpack1(a):
+        a, b = a[:5], a[5:]
+        return a - b
+
+    @make_test
+    def test_unpack2(a):
+        packed = [a[:5], a[5:]]
+        a, b = packed
+        return a - b
+
+    @make_test
+    def test_unpack3(a):
+        packed = (a[:5], a[5:])
+        a, b = packed
+        return a - b
+
+    @make_test
+    def test_fn_with_self_set(a, b):
+        # avg_pool2d is an odd one with __self__ set
+        return F.avg_pool2d(
+            torch.unsqueeze(a, 0) * torch.unsqueeze(b, 1), kernel_size=2, padding=1
+        )
+
+    @make_test
+    def test_return_tuple1(a, b):
+        return (a - b, b - a, a, b)
+
+    @make_test
+    def test_globalvar(a, b):
+        return a - b + d
+
+    @make_test
+    def test_globalmodule(x):
+        return e(x)
+
+    @make_test
+    def test_inline_with_default(a, b, c):
+        return func_with_default(a, b) * c
+
+    @make_test
+    def test_inner_function(x):
+        def fn(x):
+            return torch.add(x, x)
+
+        return fn(x)
+
+    @make_test
+    def test_transpose_for_scores(x):
+        new_x_shape = x.size()[:-1] + (2, 5)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1)
+
+    @make_test
+    def test_return_tuple2(x):
+        return (torch.add(x, x), x)
+
+    @make_test
+    def test_load_global_bool(x):
+        if flag:
+            return torch.add(x, x)
+        else:
+            return x
+
+    @make_test
+    def test_len_tensor(x):
+        z = len(x)
+        return torch.add(x, z)
+
+    @make_test
+    def test_len_constant_list(x):
+        z = len([1, 2, 3])
+        return torch.add(x, z)
+
+    @make_test
+    def test_len_constant_dict(x):
+        z = len({"foo": "bar"})
+        return torch.add(x, z)
+
+    @make_test
+    def test_dict_copy(x):
+        z = dict({"foo": x + 1})
+        return z
+
+    @make_test
+    def test_len_constant_misc_iterables(x):
+        a = len((1, 2, 3))
+        b = len("test str")
+        c = a + b
+        return torch.add(x, c)
+
+    @make_test
+    def test_float(x):
+        y = float(1.2)
+        y += float("1.2")
+        return torch.add(x, y)
+
+    @make_test
+    def test_dtype(x):
+        if x.dtype == torch.float32:
+            return x + 1
+
+    @make_test
+    def test_device(x):
+        if not x.is_cuda:
+            return x + 1
+
+    @make_test
+    def test_tensor_type(a, b):
+        m = a.to(torch.float16)
+        return b.type(m.type())
+
+    @make_test
+    def test_ndim(x):
+        if x.ndim == 2 and x.ndimension() == 2 and x.dim() == 2:
+            return x + 1
+
+    @make_test
+    def test_T(x):
+        return torch.ones_like(x.T)
+
+    @make_test
+    def test_is_sparse(x):
+        if not x.is_sparse:
+            return x + 1
+
+    @requires_static_shapes
+    @make_test
+    def test_shape1(x):
+        if x.shape[0] == 10:
+            return x + 1
+
+    @requires_static_shapes
+    @make_test
+    def test_shape2(x):
+        if x.size(1) == 10:
+            return x + 1
+
+    @make_test
+    def test_del(a, b):
+        c = a + 1
+        d = c + 2
+        del c, a
+        return b + d
+
+    @requires_static_shapes
+    @make_test
+    def test_chunks1(x):
+        chunk_size = 5
+        assert x.shape[0] % chunk_size == 0
+        assert x.shape[0] // chunk_size == 2
+        return x[:chunk_size] - x[chunk_size:]
+
+    @make_test
+    def test_import1(x, y):
+        import torch
+        from torch import sub
+
+        return sub(torch.add(x, y), y)
+
+    @make_test
+    def test_return_dict(x, y):
+        z = [x + y, y, False]
+        return {"x": x, "z": z, "a": x, "b": z, "c": x}
+
+    @make_test
+    def test_return_dict2(x, y):
+        tmp = {"x": x}
+        tmp["z"] = [x + y, y]
+        tmp["y"] = y
+        tmp["z"].append(False)
+        return tmp
+
+    @make_test
+    def test_funcdef_closure(x, y):
+        x = x + y + 1.0
+
+        def inner(z):
+            nonlocal x, y
+            y = x + z + 20.0
+            x = y + z + 10.0
+
+        inner(2.0)
+        inner(3.0)
+
+        return x, y
+
+    @make_test
+    def test_module_constant(x, y):
+        r = x + y
+        for i in range(torch._dynamo.testing.three):
+            r = r / y
+        return r
+
+    @make_test
+    def test_inline_softmax(x, y):
+        # This is common in sme huggingface models
+        return torch.nn.Softmax(dim=-1)(x + y * 2)
+
+    @make_test
+    def test_dtype_compare(a, b):
+        if a.dtype == torch.float16:
+            return a + 10
+        if a.dtype == torch.float32:
+            return a - b * 32
+
+    @make_test
+    def test_build_list_unpack(a, b):
+        it1 = (x + 1 for x in (a, b))
+        it2 = (x - 1 for x in (a, b))
+        return torch.cat([*it1, *it2], dim=-1)
+
+    @make_test
+    def test_tensor_len(a, b):
+        return a + b + len(a) + b.__len__()
+
+    @make_test
+    def test_pop(a, b):
+        ll = [a, b]
+        ll.append(a + 1)
+        ll.extend(
+            [
+                b + 2,
+                a + b,
+            ]
+        )
+        ll.pop(-1)
+        ll.pop(0)
+        ll.pop()
+        v1, v2 = ll
+        return v1 - v2
+
+    @make_test
+    def test_list_convert(a, b):
+        ll = [a + 2, b]
+        ll = tuple(ll)
+        tmp = b + 3
+        ll = list(ll)
+        v1, v2 = ll
+        return v1 - v2 + tmp
+
+    @make_test
+    def test_list_add(a, b):
+        l1 = (a, b)
+        l2 = ()  # being a LOAD_CONST in the bytecode
+        l3 = l1 + l2
+        return l3[0] + l3[1]
+
+    @make_test
+    def test_startswith(a, b):
+        x = a + b
+        if "foobar".startswith("foo") and "test" in constant3.__module__:
+            x = x + 1
+        return x
+
+    @make_test
+    def test_dict_ops(a, b):
+        tmp = {"a": a + 1, "b": b + 2}
+        v = tmp.pop("b") + tmp.get("a") + tmp.get("missing", 3) + tmp.pop("missing", 4)
+        tmp.update({"d": 3})
+        tmp["c"] = v + tmp["d"]
+        if "c" in tmp and "missing" not in tmp:
+            return tmp["c"] - tmp["a"] + len(tmp)
+
+    def test_dict_param_keys(self):
+        a_param = torch.nn.Parameter(torch.ones([4, 4]))
+
+        def fn(a):
+            tmp = {"a": a, a_param: 3}
+            return tmp["a"] + tmp[a_param]
+
+        test = make_test(fn)
+        test(self)
+
+    def test_default_dict(self):
+        dd = collections.defaultdict(dict)
+        param = torch.nn.Parameter(torch.ones([2, 2]))
+
+        def fn(x):
+            dd["a"] = x + 1
+            dd[param] = 123
+            dd["c"] = x * 2
+            return dd["b"], dd
+
+        test = make_test(fn)
+        test(self)
+
+    @make_test
+    def test_min_max(a, b):
+        c = a + b
+        a = a.sum()
+        b = b.sum()
+        a = min(max(a, 0), 1)
+        b = max(0, min(1, b))
+        return max(a, b) - min(a, b) + c
+
+    @make_test
+    def test_map_sum(a, b, c, d):
+        return sum(map(lambda x: x + 1, [a, b, c, d]))
+
+    @make_test
+    def test_reduce(a, b, c, d):
+        return functools.reduce(operator.add, [a, b, c, d])
+
+    @make_test
+    def test_tuple_contains(a, b):
+        v1 = "a"
+        v2 = "b"
+        v3 = "c"
+        vals1 = (v1, v2, v3)
+        vals2 = ("d", "e", "f")
+        if "a" in vals1 and "b" not in vals2:
+            return a + b
+        return a - b
+
+    @make_test
+    def test_tuple_iadd(a, b):
+        output = (a, b)
+        output += (a + b, a - b)
+        return output
+
+    @make_test
+    def test_unpack_ex1(x):
+        output = (x, x + 1, x + 2, x + 3)
+        a, b, *cd = output
+        return a - b / cd[0]
+
+    @make_test
+    def test_unpack_ex2(x):
+        output = (x, x + 1, x + 2, x + 3)
+        *ab, c, d = output
+        return c - d / ab[0]
+
+    @make_test
+    def test_unpack_ex3(x):
+        output = (x, x + 1, x + 2, x + 3)
+        a, *bc, d = output
+        return a - d / bc[0]
+
+    @make_test
+    def test_const_tuple_add1(x):
+        output = (x, x + 1, x + 2, x + 3)
+        output = () + output + ()
+        return output[2] + output[3]
+
+    @make_test
+    def test_const_tuple_add2(x):
+        output = (x, x + 1, x + 2, x + 3)
+        output = (None,) + output + (None,)
+        return output[2] + output[3]
+
+    @make_test
+    def test_list_truth(a, b):
+        tmp = [1, 2, 3]
+        if tmp:
+            return a + b
+        else:
+            return a - b
+
+    @make_test
+    def test_list_reversed(a, b):
+        tmp = [a + 1, a + 2, a + 3]
+        return a + b + next(iter(reversed(tmp)))
+
+    @make_test
+    def test_list_clear(a, b):
+        tmp = [a + 1, a + 2]
+        tmp.clear()
+        tmp.append(a + b)
+        return tmp
+
+    @make_test
+    def test_islice_chain(a, b):
+        tmp1 = [a + 1, a + 2]
+        tmp2 = [a + 3, a + 4]
+        a, b = list(itertools.islice(itertools.chain(tmp1, tmp2), 1, 3))
+        c = next(itertools.islice(tmp1, 1, None))
+        return a - b / c
+
+    @make_test
+    def test_is_quantized(a, b):
+        if not a.is_quantized:
+            return a + b
+
+    @make_test
+    def test_fstrings1(a, b):
+        x = 1.229
+        tmp = f"{x:.2f} bar"
+        if tmp.startswith("1.23"):
+            return a + b
+
+    @requires_static_shapes
+    @make_test
+    def test_fstrings2(x):
+        tmp = f"{x.shape[0]} bar"
+        if tmp.startswith("10"):
+            return x + 1
+
+    @make_test
+    def test_fstrings3(x):
+        tmp = f"{x.__class__.__name__} foo"
+        if tmp.startswith("Tensor"):
+            return x + 1
+
+    @requires_static_shapes
+    @make_test
+    def test_tensor_new_with_size(x):
+        y = torch.rand(5, 8)
+        z = x.new(y.size())
+        assert z.size() == y.size()
+
+    @requires_static_shapes
+    @make_test
+    def test_tensor_new_with_shape(x):
+        y = torch.rand(5, 8)
+        z = x.new(y.shape)
+        assert z.size() == y.size()
+
+    @make_test
+    def test_jit_annotate(x):
+        y = torch.jit.annotate(Any, x + 1)
+        return y + 2
+
+    @requires_static_shapes
+    @make_test
+    def test_is_contiguous_memory_format(tensor):
+        if torch.jit.is_scripting():
+            return None
+        elif tensor.is_contiguous(memory_format=torch.contiguous_format):
+            return tensor + 1
+
+    @make_test
+    def test_list_slice_assignment(x):
+        m = [1, 2, 3, 4]
+        m[1:] = [6] * (len(m) - 1)
+        return x + 1
+
+    # # This is to test the new syntax for pattern matching
+    # # ("match ... case ...") added on python 3.10.
+    # # Uncomment these test cases if you run on 3.10+
+    # @make_test
+    # def test_match_sequence(a):
+    #     point = (5, 8)
+    #     match point:
+    #         case (0, 0):
+    #             return a
+    #         case (0, y):
+    #             return a - y
+    #         case (x, 0):
+    #             return a + x
+    #         case (x, y):
+    #             return a + x - y
+
+    # @make_test
+    # def test_match_mapping_and_match_keys(x):
+    #     param = {"a": 0.5}
+    #     match param:
+    #         case {"a": param}:
+    #             return x * param
+    #         case {"b": param}:
+    #             return x / param
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_global.py b/test/dynamo/test_global.py
new file mode 100644
index 000000000000..445a6cf103d4
--- /dev/null
+++ b/test/dynamo/test_global.py
@@ -0,0 +1,233 @@
+# Owner(s): ["module: dynamo"]
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+from torch._dynamo.testing import same
+
+try:
+    from . import test_global_declaration
+except ImportError:
+    import test_global_declaration
+
+
+class Pair(object):  # noqa: B903
+    def __init__(self, x, y):
+        self.x = x
+        self.y = y
+
+
+def Foo():
+    return Pair(1, 1)
+
+
+g_counter = 1
+g_list = [0, 1, 2]
+g_dict = {"a": 0, "b": 1}
+g_object = Foo()
+g_tensor = torch.zeros(10)
+
+
+_name: int = 0
+
+
+def fresh_name() -> str:
+    """create a new unique name for a variable: v0, v1, v2"""
+    global _name
+    r = f"v{_name}"
+    _name += 1
+    return r
+
+
+def reset_name():
+    global _name
+    _name = 0
+
+
+class TestGlobals(torch._dynamo.test_case.TestCase):
+    def test_store_global_1(self):
+        def fn(x):
+            global g_counter
+            val = x + g_counter
+            g_counter += 1
+            return val
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        res2 = fn(x)
+        self.assertTrue(same(res2 - res1, torch.ones(10)))
+
+    def test_store_global_2(self):
+        def fn(x):
+            global g_counter
+            val = x + g_counter
+            g_counter += 1
+            g_counter += 1
+            return val
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        """Wrap the second call with torch._dynamo as well"""
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res2 = opt_fn(x)
+        self.assertTrue(same(res2 - res1, 2 * torch.ones(10)))
+
+    def test_store_global_new(self):
+        def fn(x):
+            # Test create a new global
+            global g_counter_new
+            g_counter_new = x + 1
+            return x + g_counter_new
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        self.assertTrue(same(res1, x + x + 1))
+
+    def test_store_global_list(self):
+        def fn(x):
+            global g_list
+            val = x + g_list[1]
+            """
+            Strictly speaking, we are not testing STORE_GLOBAL
+            here, since STORE_SUBSCR is actually used to store.
+            """
+            g_list[1] += 1
+            return val
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        res2 = fn(x)
+        self.assertTrue(same(res2 - res1, torch.ones(10)))
+
+    def test_store_global_list_2(self):
+        def fn(x):
+            global g_list
+            val = x + g_list[1]
+            g_list = [x + 1 for x in g_list]
+            return val
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        res2 = fn(x)
+        self.assertTrue(same(res2 - res1, torch.ones(10)))
+
+    def test_store_global_dict(self):
+        def fn(x):
+            global g_dict
+            val = x + g_dict["b"]
+            """
+            Strictly speaking, we are not testing STORE_GLOBAL
+            here, since STORE_SUBSCR is actually used to store.
+            """
+            g_dict["b"] += 1
+            return val
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        res2 = fn(x)
+        self.assertTrue(same(res2 - res1, torch.ones(10)))
+
+    def test_store_global_dict_2(self):
+        def fn(x):
+            global g_dict
+            g_dict = {key: value + 1 for key, value in g_dict.items()}
+            val = x + g_dict["b"]
+            return val
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        res2 = fn(x)
+        self.assertTrue(same(res2 - res1, torch.ones(10)))
+
+    def test_store_global_object(self):
+        def fn(x):
+            global g_object
+            val = x + g_object.y
+            g_object.y += 1
+            return val
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        res2 = fn(x)
+        self.assertTrue(same(res2 - res1, torch.ones(10)))
+
+    def test_store_global_cross_file(self):
+        def fn(x):
+            val = x + test_global_declaration.g_tensor_export
+            test_global_declaration.g_tensor_export = (
+                test_global_declaration.g_tensor_export + 1
+            )
+            return val
+
+        x = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res1 = opt_fn(x)
+        res2 = fn(x)
+        self.assertTrue(same(res2 - res1, torch.ones(10)))
+
+    def test_store_global_inline_1(self):
+        # Borrowed from test_python_autograd.py
+        class Variable:
+            def __init__(self, value: torch.Tensor, name: str = None):
+                self.value = value
+                self.name = name or fresh_name()
+
+        def fn(a, b):
+            a = Variable(a)
+            b = Variable(b)
+            return a.value + b.value, a.name + b.name
+
+        a = torch.randn(10)
+        b = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        v0, s0 = opt_fn(a, b)
+        self.assertEqual(s0, "v0v1")
+        reset_name()
+
+    def test_store_global_inline_2(self):
+        # Borrowed from test_python_autograd.py
+        class Variable:
+            def __init__(self, value: torch.Tensor, name: str = None):
+                self.value = value
+                self.name = name or fresh_name()
+
+            @staticmethod
+            def constant(value: torch.Tensor, name: str = None):
+                return Variable(value, name)
+
+        def fn(a, b):
+            a = Variable.constant(a)
+            b = Variable.constant(b)
+            return a.value + b.value, a.name + b.name
+
+        a = torch.randn(10)
+        b = torch.randn(10)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        v0, s0 = opt_fn(a, b)
+        self.assertEqual(s0, "v0v1")
+        reset_name()
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_global_declaration.py b/test/dynamo/test_global_declaration.py
new file mode 100644
index 000000000000..95995ca80a22
--- /dev/null
+++ b/test/dynamo/test_global_declaration.py
@@ -0,0 +1,4 @@
+# Owner(s): ["module: dynamo"]
+import torch
+
+g_tensor_export = torch.ones(10)
diff --git a/test/dynamo/test_minifier.py b/test/dynamo/test_minifier.py
new file mode 100644
index 000000000000..c1a56f070be5
--- /dev/null
+++ b/test/dynamo/test_minifier.py
@@ -0,0 +1,318 @@
+# Owner(s): ["module: dynamo"]
+import functools
+import re
+import textwrap
+import unittest
+
+import torch
+import torch._dynamo
+from torch._dynamo.test_minifier_common import MinifierTestBase
+
+requires_cuda = functools.partial(
+    unittest.skipIf, not torch.cuda.is_available(), "requires cuda"
+)
+
+RELU_COMPILE_ERROR_BACKEND = """\
+from torch._dynamo.optimizations.backends import register_backend
+
+class DynamoCompileError(Exception):
+    pass
+
+@register_backend
+def test_relu_compile_error(gm: torch.fx.GraphModule, example_inputs):
+    for node in gm.graph.nodes:
+        if node.target == torch.relu:
+            raise DynamoCompileError("relu found")
+    return gm
+"""
+
+RELU_RUNTIME_ERROR_BACKEND = """\
+import copy
+from torch._dynamo.optimizations.backends import register_backend
+
+@register_backend
+def test_relu_runtime_error(gm: torch.fx.GraphModule, example_inputs):
+    gm = copy.deepcopy(gm)
+    for node in gm.graph.nodes:
+        if node.target == torch.relu:
+            node.target = torch._assert
+            node.args = (False, "DynamoRuntimeError")
+    gm.recompile()
+    return gm
+"""
+
+RELU_ACCURACY_ERROR_BACKEND = """\
+import copy
+from torch._dynamo.optimizations.backends import register_backend
+
+@register_backend
+def test_relu_accuracy_error(gm: torch.fx.GraphModule, example_inputs):
+    gm = copy.deepcopy(gm)
+    for node in gm.graph.nodes:
+        if node.target == torch.relu:
+            node.target = torch.add
+            node.args = (node.args[0], 1)
+    gm.recompile()
+
+    return gm
+"""
+
+RELU_CUSTOM_ERROR_BACKEND = """\
+class CustomError(Exception):
+    pass
+
+def test_relu_custom_error(gm: torch.fx.GraphModule, example_inputs):
+    for node in gm.graph.nodes:
+        if node.target == torch.relu:
+            raise CustomError("relu found")
+    return gm
+"""
+
+
+class MinifierTests(MinifierTestBase):
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+
+    @classmethod
+    def tearDownClass(cls):
+        super().tearDownClass()
+
+    # Test that compile, runtime, and accuracy errors after dynamo can be repro'd (both CPU and CUDA)
+    def _test_after_dynamo(self, device, repro_level, backend_code, error_name):
+        run_code = textwrap.dedent(
+            f"""\
+            @torch._dynamo.optimize("{self._get_fn_name(backend_code)}")
+            def inner(x):
+                for _ in range(10):
+                    x = torch.sin(x)
+                x = torch.relu(x)
+                for _ in range(10):
+                    x = torch.cos(x)
+                return x
+
+            inner(torch.randn(20, 20).to("{device}"))
+        """
+        )
+
+        (test_proc, _, repro_proc), _ = self._run_full_test(
+            run_code, "dynamo", repro_level, backend_code
+        )
+
+        self.assertIn(error_name, test_proc.stderr.decode("utf-8"))
+        self.assertIn(error_name, repro_proc.stderr.decode("utf-8"))
+
+    def test_after_dynamo_cpu_compile_error(self):
+        self._test_after_dynamo(
+            "cpu", 2, RELU_COMPILE_ERROR_BACKEND, "DynamoCompileError"
+        )
+
+    def test_after_dynamo_cpu_runtime_error(self):
+        self._test_after_dynamo(
+            "cpu", 2, RELU_RUNTIME_ERROR_BACKEND, "DynamoRuntimeError"
+        )
+
+    def test_after_dynamo_cpu_accuracy_error(self):
+        self._test_after_dynamo("cpu", 4, RELU_ACCURACY_ERROR_BACKEND, "AccuracyError")
+
+    @requires_cuda()
+    def test_after_dynamo_cuda_compile_error(self):
+        self._test_after_dynamo(
+            "cuda", 2, RELU_COMPILE_ERROR_BACKEND, "DynamoCompileError"
+        )
+
+    @requires_cuda()
+    def test_after_dynamo_cuda_runtime_error(self):
+        self._test_after_dynamo(
+            "cuda", 2, RELU_RUNTIME_ERROR_BACKEND, "DynamoRuntimeError"
+        )
+
+    @requires_cuda()
+    def test_after_dynamo_cuda_accuracy_error(self):
+        self._test_after_dynamo("cuda", 4, RELU_ACCURACY_ERROR_BACKEND, "AccuracyError")
+
+    # Ensure that the testing backends pass when relu is not present.
+    def _test_after_dynamo_backend_passes(self, device, repro_level, backend_code):
+        run_code = textwrap.dedent(
+            f"""\
+            @torch._dynamo.optimize("{self._get_fn_name(backend_code)}")
+            def inner(x):
+                for _ in range(10):
+                    x = torch.sin(x)
+                for _ in range(10):
+                    x = torch.cos(x)
+                return x
+
+            inner(torch.randn(20, 20).to("{device}"))
+        """
+        )
+
+        test_code = self._gen_test_code(run_code, "dynamo", repro_level, backend_code)
+        proc, repro_dir = self._run_test_code(test_code)
+        self.assertEqual(proc.returncode, 0)
+        self.assertIsNone(repro_dir)
+
+    def test_after_dynamo_cpu_compile_backend_passes(self):
+        self._test_after_dynamo_backend_passes("cpu", 2, RELU_COMPILE_ERROR_BACKEND)
+
+    def test_after_dynamo_cpu_runtime_backend_passes(self):
+        self._test_after_dynamo_backend_passes("cpu", 2, RELU_RUNTIME_ERROR_BACKEND)
+
+    def test_after_dynamo_cpu_accuracy_backend_passes(self):
+        self._test_after_dynamo_backend_passes("cpu", 4, RELU_ACCURACY_ERROR_BACKEND)
+
+    @requires_cuda()
+    def test_after_dynamo_cuda_compile_backend_passes(self):
+        self._test_after_dynamo_backend_passes("cuda", 2, RELU_COMPILE_ERROR_BACKEND)
+
+    @requires_cuda()
+    def test_after_dynamo_cuda_runtime_backend_passes(self):
+        self._test_after_dynamo_backend_passes("cuda", 2, RELU_RUNTIME_ERROR_BACKEND)
+
+    @requires_cuda()
+    def test_after_dynamo_cuda_accuracy_backend_passes(self):
+        self._test_after_dynamo_backend_passes("cuda", 4, RELU_ACCURACY_ERROR_BACKEND)
+
+    # Ensure that generated code with a custom backends generates a runnable minifier
+    # launcher script that results in a RuntimeError
+    def test_after_dynamo_custom_backend(self):
+        run_code = textwrap.dedent(
+            f"""\
+            @torch._dynamo.optimize({self._get_fn_name(RELU_CUSTOM_ERROR_BACKEND)})
+            def inner(x):
+                for _ in range(10):
+                    x = torch.sin(x)
+                x = torch.relu(x)
+                for _ in range(10):
+                    x = torch.cos(x)
+                return x
+
+            inner(torch.randn(20, 20))
+        """
+        )
+
+        test_code = self._gen_test_code(
+            run_code, "dynamo", 2, RELU_CUSTOM_ERROR_BACKEND
+        )
+        _, repro_dir = self._run_test_code(test_code)
+        launch_proc, _ = self._run_minifier_launcher("", repro_dir)
+        self.assertIn("RuntimeError", launch_proc.stderr.decode("utf-8"))
+
+    # Test that a module with mixed cpu/cuda parts with an error after dynamo can be repro'd
+    @requires_cuda()
+    def test_cpu_cuda_module_after_dynamo(self):
+        backend_name = self._get_fn_name(RELU_COMPILE_ERROR_BACKEND)
+
+        run_code = textwrap.dedent(
+            f"""\
+            class CpuCudaModule(torch.nn.Module):
+                def __init__(self):
+                    super().__init__()
+                    self.m_x = torch.nn.Linear(20, 20).cuda()
+                    self.m_y = torch.nn.Linear(20, 20)
+                    self.p_x = torch.nn.Parameter(torch.randn(20, 20).cuda())
+                    self.p_y = torch.nn.Parameter(torch.randn(20, 20))
+                    self.register_buffer("b_x", torch.ones(20, 20).cuda())
+                    self.register_buffer("b_y", torch.ones(20, 20))
+
+                def forward(self, x, y):
+                    return self.m_x(x) + self.p_x + self.b_x, self.m_y(y) + self.p_y + self.b_y
+
+            mod = CpuCudaModule()
+
+            @torch._dynamo.optimize("{backend_name}")
+            def inner(x1, y1):
+                x2 = torch.randn(20, 20).cuda()
+                y2 = torch.randn(20, 20)
+                x3, y3 = mod(x1 + x2, y1 + y2)
+                return torch.relu(x3.cpu() + y3)
+
+            inner(torch.randn(20, 20).cuda(), torch.randn(20, 20))
+        """
+        )
+
+        (test_proc, _, repro_proc), (launch_code, _) = self._run_full_test(
+            run_code, "dynamo", 2, RELU_COMPILE_ERROR_BACKEND
+        )
+
+        tb1 = test_proc.stderr.decode("utf-8")
+        tb2 = repro_proc.stderr.decode("utf-8")
+
+        # Check if generated minifier code covers all cpu/cuda cases
+        self.assertIsNotNone(re.search(r"args.*cuda", launch_code))
+        self.assertIsNotNone(re.search(r"args.*cpu", launch_code))
+        # search for Linear(...).cuda()
+        self.assertIsNotNone(re.search(r"Linear.*cuda", launch_code))
+        # search for Linear(...)
+        self.assertIsNotNone(
+            re.search(r"Linear(?!.*cuda.*$)", launch_code, re.MULTILINE)
+        )
+        self.assertIsNotNone(re.search(r"register_buffer.*cuda", launch_code))
+        self.assertIsNotNone(
+            re.search(r"register_buffer(?!.*cuda.*$)", launch_code, re.MULTILINE)
+        )
+        self.assertIsNotNone(re.search(r"Parameter.*cuda", launch_code))
+        self.assertIsNotNone(
+            re.search(r"Parameter(?!.*cuda.*$)", launch_code, re.MULTILINE)
+        )
+        # search for
+        # <name> = torch.randn(...)
+        # ... = <name>.cuda()
+        self.assertIsNotNone(
+            re.search(r"(\w+) = torch.randn.*\1\.cuda", launch_code, re.DOTALL)
+        )
+        # search for
+        # <name> = torch.randn(...)
+        # no followup call to <name>.cuda()
+        self.assertIsNotNone(
+            re.search(
+                r"(\w+) = torch.randn(?!.*\1\.cuda\(\).*$)", launch_code, re.DOTALL
+            )
+        )
+
+        self.assertIn(backend_name, tb1)
+        self.assertIn(backend_name, tb2)
+
+    # Test if we can actually get a minified graph
+    def test_if_graph_minified(self):
+        backend_name = self._get_fn_name(RELU_COMPILE_ERROR_BACKEND)
+
+        run_code = textwrap.dedent(
+            f"""\
+            @torch._dynamo.optimize("{backend_name}")
+            def inner(x):
+                for _ in range(20):
+                    x = torch.sin(x)
+                x = torch.relu(x)
+                for _ in range(20):
+                    x = torch.cos(x)
+                return x
+
+            inner(torch.randn(20, 20))
+        """
+        )
+
+        (test_proc, _, repro_proc), (launch_code, repro_code) = self._run_full_test(
+            run_code, "dynamo", 2, RELU_COMPILE_ERROR_BACKEND
+        )
+
+        tb1 = test_proc.stderr.decode("utf-8")
+        tb2 = repro_proc.stderr.decode("utf-8")
+
+        self.assertIn(backend_name, tb1)
+        self.assertIn(backend_name, tb2)
+
+        # compare the length of the forward functions
+        match = re.search(r"def forward.*return", launch_code, re.DOTALL)
+        self.assertIsNotNone(match)
+        self.assertGreater(match.group(0).count("\n"), 40)
+
+        match = re.search(r"def forward.*return", repro_code, re.DOTALL)
+        self.assertIsNotNone(match)
+        self.assertLess(match.group(0).count("\n"), 5)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_misc.py b/test/dynamo/test_misc.py
new file mode 100644
index 000000000000..bd551fb36a51
--- /dev/null
+++ b/test/dynamo/test_misc.py
@@ -0,0 +1,3062 @@
+# Owner(s): ["module: dynamo"]
+import abc
+import collections
+import copy
+import dataclasses
+import dis
+import enum
+import logging
+import math
+import os
+import sys
+import typing
+import unittest
+import weakref
+from unittest.mock import patch
+
+import numpy as np
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+import torch.onnx.operators
+from torch._dynamo import bytecode_transformation, graph_break
+from torch._dynamo.testing import (
+    CompileCounter,
+    requires_static_shapes,
+    same,
+    unsupported,
+)
+from torch.nn import functional as F
+from torch.testing._internal.common_utils import freeze_rng_state
+from torch.testing._internal.jit_utils import JitTestCase
+
+mytuple = collections.namedtuple("mytuple", ["a", "b", "ab"])
+
+
+def my_custom_function(x):
+    return x + 1
+
+
+class MiscTests(torch._dynamo.test_case.TestCase):
+    def test_boolarg(self):
+        def boolarg(aa, bb, flag):
+            if flag:
+                return aa - bb
+            else:
+                return bb - aa
+
+        a = torch.randn(10, 10)
+        b = torch.randn(10, 10)
+        correct1 = boolarg(a, b, True)
+        correct2 = boolarg(a, b, False)
+        correct3 = boolarg(a, b, None)
+        counter = CompileCounter()
+        opt_boolarg = torch._dynamo.optimize_assert(counter)(boolarg)
+        val1 = opt_boolarg(a, b, True)
+        val2 = opt_boolarg(a, b, False)
+        val3 = opt_boolarg(a, b, None)
+        val4 = opt_boolarg(a, b, True)
+        self.assertTrue(same(val1, correct1))
+        self.assertTrue(same(val2, correct2))
+        self.assertTrue(same(val3, correct3))
+        self.assertTrue(same(val4, correct1))
+        self.assertEqual(counter.frame_count, 3)
+
+    def test_callpacked(self):
+        def call_packed(args):
+            a, b, c = args
+            return a - b * c
+
+        counter = CompileCounter()
+        a = torch.randn(10, 10)
+        b = torch.randn(10, 10)
+        c = torch.randn(10, 10)
+        correct = call_packed([a, b, c])
+        opt_call_packed = torch._dynamo.optimize_assert(counter)(call_packed)
+        val1 = opt_call_packed([a, b, c])
+        val2 = opt_call_packed((a, b, c))
+        val3 = opt_call_packed([a, b, c])
+        val4 = opt_call_packed((a, b, c))
+        self.assertTrue(same(val1, correct))
+        self.assertTrue(same(val2, correct))
+        self.assertTrue(same(val3, correct))
+        self.assertTrue(same(val4, correct))
+        self.assertEqual(counter.frame_count, 2)
+
+    def test_raises(self):
+        def fn(a, b, c, cls):
+            x = a + b - c * 10
+            raise cls(str(x))
+
+        counter = CompileCounter()
+        a = torch.randn(10, 10)
+        b = torch.randn(10, 10)
+        c = torch.randn(10, 10)
+        opt_fn = torch._dynamo.optimize(counter)(fn)
+        self.assertRaises(AssertionError, lambda: opt_fn(a, b, c, AssertionError))
+        self.assertEqual(counter.frame_count, 1)
+        self.assertEqual(counter.op_count, 3)
+
+    def test_inplace(self):
+        def inplace1(a, b):
+            o = torch.empty((10, 10))
+            o.copy_(a)
+            o -= b
+            return o
+
+        torch._dynamo.testing.standard_test(self, inplace1, 2, expected_ops=3)
+
+    def test_unpack4(self):
+        def unpack4(a, b):
+            a = a[:5, :]
+            b = b[:5, :]
+            x, y = a.size()
+            o = torch.empty((x, y))
+            o.copy_(a / b)
+            return o
+
+        torch._dynamo.testing.standard_test(
+            self, unpack4, 2, expected_ops=5, expected_ops_dynamic=8
+        )
+
+    def test_unpack5(self):
+        def unpack5(a, b):
+            a = a[:5, :]
+            b = b[:5, :]
+            x, y = a.shape
+            o = torch.empty((x, y))
+            o.copy_(a / b)
+            return o
+
+        torch._dynamo.testing.standard_test(
+            self, unpack5, 2, expected_ops=5, expected_ops_dynamic=8
+        )
+
+    def test_matmul1(self):
+        def matmul_op1(a, b):
+            return a @ b
+
+        # TODO(jansel): FX doesn't support this, should add upstream support
+        torch._dynamo.testing.standard_test(self, matmul_op1, 2, expected_ops=1)
+
+    def test_builtin_isinstance(self):
+        def fn(x):
+            t = torch.arange(1, 3)
+            a = isinstance(x, torch.Tensor)
+            b = isinstance(t, torch.Tensor)
+            c = isinstance(x, int)
+            d = isinstance(3, int)
+            e = isinstance([1, 2, 3], list)
+            f = isinstance({"foo": 1, "bar": 2}, dict)
+            res = [a, b, c, d, e, f]
+            # Can't run yet due to other unimplemented instructions
+            # res += [isinstance(torch.nn.LazyLinear(2, 3), torch.nn.Linear)]
+            return res
+
+        torch._dynamo.testing.standard_test(self, fn, 1, expected_ops=1)
+
+    def test_fold(self):
+        def fn(a):
+            return a + math.sqrt(63)
+
+        torch._dynamo.testing.standard_test(self, fn, 1, expected_ops=1)
+
+    def test_shape_unpack(self):
+        def fn(x):
+            a, b = x.size()
+            return x * b
+
+        i = torch.randn(5, 10)
+        r1 = fn(i)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        r2 = opt_fn(i)
+        self.assertTrue(same(r1, r2))
+
+    def test_empty_list(self):
+        def fn(x, ll):
+            if len(ll) == 0 and not ll and ll is not None:
+                return x + 1
+
+        i = torch.randn(5, 10)
+        r1 = fn(i, [])
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        r2 = opt_fn(i, [])
+        r3 = opt_fn(i, tuple())
+        self.assertTrue(same(r1, r2))
+        self.assertTrue(same(r1, r3))
+
+    def test_config_obj(self):
+        class Cfg:
+            def __init__(self):
+                self.val = 0.5
+                self.count = 3
+
+        def fn(x, cfg):
+            for i in range(cfg.count):
+                x = x + cfg.val
+            return x
+
+        cfg1 = Cfg()
+        cfg1.val = 1.0
+        cfg2 = Cfg()
+        v = torch.zeros(1)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        v = opt_fn(v, cfg1)  # 3
+        v = opt_fn(v, cfg2)  # 4.5
+        cfg2.count = 1
+        v = opt_fn(v, cfg2)  # 5
+        cfg2.val = 2.0
+        v = opt_fn(v, cfg2)  # 7
+        self.assertEqual(v[0], 7)
+        self.assertEqual(cnts.op_count, 8)
+
+    def test_config_getattr_default(self):
+        class Cfg:
+            def __init__(self):
+                self.val = 0.5
+                self.count = 10
+
+        def fn(x, cfg):
+            if getattr(cfg, "just_add_7", False):
+                return x + 7
+            for i in range(cfg.count):
+                x = x + cfg.val
+            return x
+
+        cfg1 = Cfg()
+        v = torch.zeros(1)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertEqual(opt_fn(v, cfg1)[0], 5)
+        self.assertEqual(opt_fn(v, cfg1)[0], 5)
+        cfg1.just_add_7 = True
+        self.assertEqual(opt_fn(v, cfg1)[0], 7)
+        self.assertEqual(opt_fn(v, cfg1)[0], 7)
+        cfg1.just_add_7 = False
+        self.assertEqual(opt_fn(v, cfg1)[0], 5)
+        self.assertEqual(opt_fn(v, cfg1)[0], 5)
+        self.assertEqual(cnts.frame_count, 3)
+
+    def test_size_input(self):
+        def fn(x, s):
+            a, b = s
+            return x + (a - b)
+
+        v = torch.zeros(10, 20)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertEqual(opt_fn(v, v.size())[0, 0], -10)
+        self.assertEqual(opt_fn(v, (10, 20))[0, 0], -10)
+        self.assertEqual(opt_fn(v, [10, 20])[0, 0], -10)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_cell_output1(self):
+        out = None
+
+        def fn(a, b):
+            nonlocal out
+            out = a + b * 10
+
+        v = torch.Tensor([100])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertIsNone(opt_fn(v, v))
+        self.assertEqual(out[0], 1100)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_cell_output2(self):
+        out = None
+
+        def fn(a, b):
+            nonlocal out
+            c = unsupported(a, b)
+            out = a + b * 10 + c
+
+        v = torch.Tensor([100])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertIsNone(opt_fn(v, v))
+        self.assertEqual(out[0], 1200)
+        self.assertEqual(cnts.op_count, 3)
+
+    def test_return_nested_function(self):
+        out = None
+
+        def fn(a, b):
+            nonlocal out
+            c = a + b
+            d = a + 1.0
+
+            def fn2(f: int = 7, g: float = 9.0):
+                nonlocal out
+                out = a + b * 10
+                return c * f - d * g
+
+            return fn2
+
+        v1 = torch.Tensor([100])
+        v2 = torch.Tensor([200])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        opt_fn_ret = torch._dynamo.optimize(cnts)(opt_fn(v1, v2))
+        self.assertEqual(opt_fn_ret(1.5)[0], -459)
+        self.assertEqual(out[0], 2100)
+        self.assertEqual(cnts.frame_count, 2)
+        self.assertEqual(cnts.op_count, 7)
+
+    def test_tensor_dict1(self):
+        def fn(inputs):
+            return inputs["a"] - inputs["b"] * 1.5
+
+        v1 = torch.Tensor([100])
+        v2 = torch.Tensor([200])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertEqual(opt_fn({"a": v1, "b": v2})[0], -200)
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_tensor_dict2(self):
+        def fn1(inputs):
+            total = torch.zeros(1)
+            for k, v in inputs.items():
+                total += v
+            return total
+
+        def fn2(inputs):
+            total = torch.zeros(1)
+            for v in inputs.values():
+                total += v
+            return total
+
+        def fn3(inputs):
+            total = torch.zeros(1)
+            for k in inputs.keys():
+                total += inputs[k]
+            return total
+
+        v1 = torch.Tensor([100])
+        v2 = torch.Tensor([200])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn1 = torch._dynamo.optimize(cnts)(fn1)
+        opt_fn2 = torch._dynamo.optimize(cnts)(fn2)
+        opt_fn3 = torch._dynamo.optimize(cnts)(fn3)
+        self.assertEqual(opt_fn1({"a": v1, "b": v2})[0], 300)
+        self.assertEqual(opt_fn2({"a": v1, "b": v2})[0], 300)
+        self.assertEqual(opt_fn3({"a": v1, "b": v2})[0], 300)
+        self.assertEqual(cnts.frame_count, 3)
+        self.assertEqual(cnts.op_count, 9)
+
+    def test_dictcomp(self):
+        def fn1(inputs):
+            return {k: v + 1 for k, v in inputs.items()}
+
+        v1 = torch.Tensor([100])
+        v2 = torch.Tensor([200])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn1 = torch._dynamo.optimize(cnts)(fn1)
+        self.assertEqual(opt_fn1({"a": v1, "b": v2})["a"], 101)
+        self.assertEqual(opt_fn1({"a": v1, "b": v2})["b"], 201)
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_listcomp(self):
+        def fn2(inputs):
+            return torch.sum(torch.cat([v + 1 for k, v in inputs.items()], 0))
+
+        v1 = torch.Tensor([100])
+        v2 = torch.Tensor([200])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn2 = torch._dynamo.optimize(cnts)(fn2)
+        self.assertEqual(opt_fn2({"a": v1, "b": v2}), 302)
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 4)
+
+    def test_is_floating_point(self):
+        def fn(a, b):
+            x = a + 1.0
+            if torch.is_floating_point(b):
+                x = x + b
+            return x + 2.0
+
+        return torch._dynamo.testing.standard_test(self, fn=fn, nargs=2, expected_ops=3)
+
+    def test_is_floating_point2(self):
+        def fn(a, b):
+            x = a + 1.0
+            if b.is_floating_point():
+                x = x + b
+            return x + 2.0
+
+        return torch._dynamo.testing.standard_test(self, fn=fn, nargs=2, expected_ops=3)
+
+    def test_is_tensor(self):
+        def fn(a, b):
+            x = a + 1.0
+            if torch.is_tensor(b):
+                x = x + b
+            return x + 2.0
+
+        return torch._dynamo.testing.standard_test(self, fn=fn, nargs=2, expected_ops=3)
+
+    def test_is_tensor2(self):
+        def fn(x):
+            if torch.is_tensor(x):
+                return x + 1
+            else:
+                return torch.ones([2, 3])
+
+        x1 = {"input": torch.rand(2, 3)}
+        x2 = torch.rand(2, 3)
+        ref1 = fn(x1)
+        ref2 = fn(x2)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        res1 = opt_fn(x1)
+        res2 = opt_fn(x2)
+        self.assertEqual(ref1, res1)
+        self.assertEqual(ref2, res2)
+
+    def test_numel(self):
+        def fn(a):
+            return (a + a.numel() + torch.numel(a), a + a.nelement())
+
+        return torch._dynamo.testing.standard_test(
+            self, fn=fn, nargs=1, expected_ops=3, expected_ops_dynamic=6
+        )
+
+    def test_pair(self):
+        def fn(a):
+            return (
+                torch.zeros(torch.nn.modules.utils._pair(a.size()))
+                + a
+                + torch.ones(torch.nn.modules.utils._ntuple(3)(3)).sum()
+            )
+
+        return torch._dynamo.testing.standard_test(
+            self, fn=fn, nargs=1, expected_ops=5, expected_ops_dynamic=8
+        )
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_tensor_item_capture(self):
+        def fn(a, b):
+            return (a + b).sum().item()
+
+        v1 = torch.randn((10, 10))
+        v2 = torch.randn((10, 10))
+        correct = fn(v1, v2)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize((cnts))(fn)
+        self.assertEqual(opt_fn(v1, v2), correct)
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 3)
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", False)
+    def test_tensor_item_no_capture(self):
+        def fn(a, b):
+            return (a + b).sum().item()
+
+        v1 = torch.randn((10, 10))
+        v2 = torch.randn((10, 10))
+        correct = fn(v1, v2)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize((cnts))(fn)
+        self.assertEqual(opt_fn(v1, v2), correct)
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_namedtuple1(self):
+        def fn(a, b):
+            tmp = mytuple(a, b, a + b)
+            return mytuple(tmp.a, tmp[1], tmp.ab + b)
+
+        v1 = torch.Tensor([10])
+        v2 = torch.Tensor([20])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertEqual(opt_fn(v1, v2).ab, 50)
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_namedtuple2(self):
+        def fn(packed):
+            a, b, c = packed
+            if hasattr(packed, "b"):
+                b = packed.b + 1
+            c = packed[2]
+            return a + b + c
+
+        v1 = torch.Tensor([1])
+        v2 = torch.Tensor([2])
+        v3 = torch.Tensor([3])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertEqual(opt_fn(mytuple(v1, v2, v3))[0], 7)
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 3)
+
+    def test_namedtuple3(self):
+        def fn(x, packed):
+            if isinstance(packed, mytuple):
+                return x + 1
+            else:
+                return x - 1
+
+        x = torch.rand([2, 3])
+        packed = mytuple(1, 2, 3)
+        ref = fn(x, packed)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        res = opt_fn(x, packed)
+        self.assertTrue(same(ref, res))
+
+    def test_range_input(self):
+        def fn(a, rng):
+            x = a
+            for i in rng:
+                x = x + i
+            return x
+
+        def fn1(a):
+            return fn(a, rng=range(3))
+
+        return torch._dynamo.testing.standard_test(
+            self, fn=fn1, nargs=1, expected_ops=3
+        )
+
+    def test_no_grad(self):
+        def fn1(a, b):
+            x = a + 1
+            # redundant no_grad should get ignored
+            with torch.no_grad():
+                x = x + b
+            x = x + 2
+            return x
+
+        def fn2(a, b):
+            x = a + 1
+            with torch.set_grad_enabled(False):
+                x = x + b
+            x = x + 2
+            return x
+
+        def fn3(a, b):
+            x = a + 1
+            with torch.enable_grad():
+                x = x + b
+            x = x + 2
+            return x
+
+        def fn4(a, b):
+            x = a + 1
+            with torch.set_grad_enabled(True):
+                if torch.is_grad_enabled():
+                    x = x + b
+            x = x + 2
+            return x
+
+        with torch.no_grad():
+            torch._dynamo.testing.standard_test(self, fn=fn1, nargs=2, expected_ops=5)
+            torch._dynamo.testing.standard_test(self, fn=fn2, nargs=2, expected_ops=5)
+            torch._dynamo.testing.standard_test(self, fn=fn3, nargs=2, expected_ops=5)
+            torch._dynamo.testing.standard_test(self, fn=fn4, nargs=2, expected_ops=5)
+        with torch.enable_grad():
+            torch._dynamo.testing.standard_test(self, fn=fn1, nargs=2, expected_ops=5)
+            torch._dynamo.testing.standard_test(self, fn=fn2, nargs=2, expected_ops=5)
+            torch._dynamo.testing.standard_test(self, fn=fn3, nargs=2, expected_ops=5)
+            torch._dynamo.testing.standard_test(self, fn=fn4, nargs=2, expected_ops=5)
+
+    def test_grad_mode_guard(self):
+        def fn(a, b):
+            prev_grad = torch.is_grad_enabled()
+            torch.set_grad_enabled(False)
+            a = a + 1
+            a.tolist()  # graph break
+            ret = a + b
+            torch.set_grad_enabled(prev_grad)
+            return ret
+
+        a = torch.randn([3, 4])
+        b = torch.randn([3, 4])
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        for _ in range(10):
+            opt_fn(a, b)
+        self.assertEqual(cnts.frame_count, 2)
+
+    def test_build_tuple_unpack(self):
+        def fn1(a, b, c):
+            return a - b / c
+
+        def fn2(a, b, c):
+            tmp1 = (a,)
+            tmp2 = (b, c)
+            args = (*tmp1, *tmp2)
+            return fn1(*args)
+
+        def fn3(a, *args):
+            return fn1(a, *args)
+
+        torch._dynamo.testing.standard_test(self, fn=fn2, nargs=3, expected_ops=2)
+        torch._dynamo.testing.standard_test(self, fn=fn3, nargs=3, expected_ops=2)
+
+    def test_list_mul(self):
+        def fn(count):
+            head_mask = count * [None] * count
+            return head_mask
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertEqual(opt_fn(2), [None] * 4)
+        self.assertEqual(cnts.frame_count, 0)
+        self.assertEqual(cnts.op_count, 0)
+
+    def test_user_getattr1(self):
+        class MyConfig(dict):
+            def __getattr__(self, name):
+                return self[name]
+
+        def fn(cfg, x, y):
+            return x + y + cfg.offset
+
+        x = torch.randn(10)
+        cfg = MyConfig(offset=5)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(cfg, x, x), 2 * x + 5))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_user_getattr2(self):
+        class MyConfig:
+            defined_on_class = 1
+
+            def __init__(self):
+                self.defined_on_object = 2
+
+            def __getattr__(self, name):
+                return 3
+
+        def fn(cfg, x):
+            return x + cfg.defined_on_class - cfg.defined_on_object + cfg.not_defined
+
+        x = torch.randn(10)
+        cfg = MyConfig()
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(cfg, x), x + 1 - 2 + 3))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 3)
+
+    def test_user_property(self):
+        class MyConfig:
+            @property
+            def prop5(self):
+                return 5
+
+        def fn(cfg, x, y):
+            return x + y + cfg.prop5
+
+        x = torch.randn(10)
+        cfg = MyConfig()
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(cfg, x, x), 2 * x + 5))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_dataclass_fields(self):
+        @dataclasses.dataclass
+        class MyDataClass:
+            a: torch.Tensor
+            b: torch.Tensor = None
+            c: torch.Tensor = None
+            d: torch.Tensor = None
+            e: torch.Tensor = None
+
+        def fn(obj):
+            class_fields = dataclasses.fields(obj)
+            assert len(class_fields)
+            assert all(field.default is None for field in class_fields[1:])
+            other_fields_are_none = all(
+                getattr(obj, field.name) is None for field in class_fields[1:]
+            )
+            assert not other_fields_are_none
+
+            total = getattr(obj, class_fields[0].name)
+            for field in class_fields[1:]:
+                v = getattr(obj, field.name)
+                if v is not None:
+                    total += v
+
+            return total
+
+        obj1 = MyDataClass(torch.randn(10), torch.randn(10), torch.randn(10))
+        obj2 = MyDataClass(torch.randn(10), e=torch.randn(10))
+        correct1 = fn(obj1)
+        correct2 = fn(obj2)
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(obj1), correct1))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+        torch._dynamo.reset()
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(obj2), correct2))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 1)
+
+    @requires_static_shapes
+    def test_tensor_build_list_unpack(self):
+        def fn(x):
+            # seen in fastNLP_Bert
+            return torch.cat([*x], dim=-1)
+
+        val = torch.randn([1, 1, 473, 768])
+        correct = fn(val)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(val), correct))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_numpy_int_constant(self):
+        def fn(x, a, b):
+            return x + (a % b)
+
+        args = [torch.randn(10), 4096, np.int64(8)]
+        correct = fn(*args)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(*args), correct))
+        self.assertTrue(same(opt_fn(*args), correct))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_dict_mutation_side_effect(self):
+        def fn(d):
+            d["c"] = d["a"] + d.pop("b")
+            return d
+
+        args1 = {"a": torch.randn(10), "b": torch.randn(10)}
+        args2 = dict(args1)
+        assert fn(args1) is args1
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertIs(opt_fn(args2), args2)
+        self.assertTrue(same(args1, args2))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 1)
+
+    def test_module_deepcopy(self):
+        m1 = torch.nn.Sequential(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+        )
+        m2 = torch.nn.Sequential(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+        )
+
+        def fn(m, x):
+            m_copy = copy.deepcopy(m)
+            return m_copy(x)
+
+        v = torch.randn(10)
+        correct1 = fn(m1, v)
+        correct2 = fn(m2, v)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        for _ in range(10):
+            self.assertTrue(same(opt_fn(m1, v), correct1))
+        for _ in range(10):
+            self.assertTrue(same(opt_fn(m2, v), correct2))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 4)
+
+    def test_type_copy(self):
+        def fn(seq):
+            a, b = seq
+            return type(seq)([a + 1, b + 2, a + b])
+
+        args1 = [torch.randn(10), torch.randn(10)]
+        args2 = (torch.randn(10), torch.randn(10))
+        correct1 = fn(args1)
+        correct2 = fn(args2)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(args1), correct1))
+        self.assertTrue(same(opt_fn(args2), correct2))
+        self.assertIsInstance(opt_fn(args1), list)
+        self.assertIsInstance(opt_fn(args2), tuple)
+        self.assertEqual(cnts.frame_count, 2)
+        self.assertEqual(cnts.op_count, 6)
+
+    def test_setattr_mutation1(self):
+        class MyObj:  # noqa: B903
+            def __init__(self, a, b):
+                self.a = a
+                self.b = b
+
+        def fn(obj):
+            obj.c = obj.a * obj.b + 1
+            obj.b = obj.a * obj.c + 2
+            obj.a = obj.b * obj.c + 3
+            obj.c = obj.a * obj.b + 4
+            obj.b = obj.a * obj.c + 5
+            obj.a = obj.b * obj.c + 6
+            return obj
+
+        x1 = torch.randn(10)
+        x2 = torch.randn(10)
+        obj1 = MyObj(x1, x2)
+        obj2 = MyObj(x1, x2)
+        fn(obj2)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertIs(opt_fn(obj1), obj1)
+        self.assertTrue(same(obj1.a, obj2.a))
+        self.assertTrue(same(obj1.b, obj2.b))
+        self.assertTrue(same(obj1.c, obj2.c))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 12)
+
+    def test_setattr_mutation2(self):
+        class MyObj:
+            def __init__(self, x):
+                self.a = x + 1
+                self.b = x + 2
+
+        def fn(x):
+            x = x / 3.0
+            obj = MyObj(x)
+            obj.c = obj.a * obj.b + 1
+            obj.b = obj.a * obj.c + 2
+            obj.a = obj.b * obj.c + 3
+            return obj
+
+        x1 = torch.randn(10)
+        obj2 = fn(x1)
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        obj1 = opt_fn(x1)
+        self.assertTrue(same(obj1.a, obj2.a))
+        self.assertTrue(same(obj1.b, obj2.b))
+        self.assertTrue(same(obj1.c, obj2.c))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 9)
+
+    def test_setattr_mutation3(self):
+        # TODO(jansel): dead code eliminate the object creation
+        class MyObj:
+            def __init__(self, x):
+                super().__init__()
+                self.a = x + 1
+                self.b = x + 2
+
+        def fn(x):
+            x = x / 3.0
+            obj = MyObj(x)
+            obj.c = obj.a * obj.b + 1
+            obj.b = obj.a * obj.c + 2
+            obj.a = obj.b * obj.c + 3
+            return obj.a, obj.b, obj.c
+
+        x1 = torch.randn(10)
+        obj2 = fn(x1)
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        obj1 = opt_fn(x1)
+        self.assertTrue(same(obj1, obj2))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 9)
+
+    def test_user_defined_class_name(self):
+        class MyClassFoo:
+            pass
+
+        def fn1(a, b, c):
+            tmp = MyClassFoo()
+            if tmp.__class__.__name__ == "MyClassFoo":
+                return a - b / c
+
+        torch._dynamo.testing.standard_test(self, fn=fn1, nargs=3)
+
+    def test_manual_seed(self):
+        def fn(a, b):
+            x = a + b
+            torch.manual_seed(9000)
+            return x + 1
+
+        torch._dynamo.testing.standard_test(self, fn=fn, nargs=2, expected_ops=3)
+
+    def test_usr_cls_staticmethod(self):
+        class Foo:
+            @staticmethod
+            def bar(a, b):
+                return a + b
+
+        def fn(a, b):
+            return Foo.bar(a, b) - 1
+
+        torch._dynamo.testing.standard_test(self, fn=fn, nargs=2)
+
+    def test_usr_cls_classmethod(self):
+        class Foo:
+            @classmethod
+            def bar(cls, a, b):
+                return a + b
+
+        def fn(a, b):
+            return Foo.bar(a, b) - 1
+
+        torch._dynamo.testing.standard_test(self, fn=fn, nargs=2)
+
+    def test_dunder_methods(self):
+        class Foo:
+            def __init__(self, val):
+                super().__init__()
+                self.val = val
+
+            def __add__(self, other):
+                return Foo(self.val + other.val)
+
+            def __mul__(self, other):
+                return Foo(self.val * other.val)
+
+            def __truediv__(self, other):
+                return Foo(self.val / other.val)
+
+            def __sub__(self, other):
+                return Foo(self.val - other.val)
+
+        def fn(a, b, c):
+            return Foo(a) + Foo(b) * Foo(c) / Foo(a) - Foo(b)
+
+        torch._dynamo.testing.standard_test(self, fn=fn, nargs=3, expected_ops=4)
+
+    def test_function_annotation(self):
+        class Variable:
+            pass
+
+        def fn(x):
+            x = x / 3.0
+
+            def inner(y: typing.List[Variable]):
+                return x + 1
+
+            return inner
+
+        x1 = torch.randn(10)
+        obj2 = fn(x1)([])
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnts)(fn)
+        opt_fn_inner = torch._dynamo.optimize_assert(cnts)(opt_fn(x1))
+        obj1 = opt_fn_inner([])
+        self.assertTrue(same(obj1, obj2))
+        self.assertEqual(cnts.frame_count, 2)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_nested_closure(self):
+        v0 = torch.randn(10)
+
+        def fn1():
+            v1 = torch.randn(10)
+
+            def fn2(*args, **kwargs):
+                assert len(args) == 1
+                assert len(kwargs) == 1
+                v2 = torch.randn(10) + args[0] + kwargs["b"]
+
+                def fn3(v3=torch.randn(10)):
+                    def fn4():
+                        return v0 + v1 + v2 + v3 + 1
+
+                    return fn4
+
+                return fn3
+
+            return fn2(1, b=2)()
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn1 = torch._dynamo.optimize_assert(cnts)(fn1)
+        tmp1 = torch._dynamo.optimize_assert(cnts)(opt_fn1())
+        tmp2 = torch._dynamo.optimize_assert(cnts)(opt_fn1())
+        self.assertTrue(tmp1().shape, (10,))
+        self.assertTrue(same(tmp1(), tmp1()))
+        self.assertFalse(same(tmp1(), tmp2()))
+        self.assertEqual(cnts.frame_count, 2)
+        self.assertEqual(cnts.op_count, 9)
+
+    def test_nested_closure_mutation(self):
+        def fn1():
+            v1 = torch.randn(10)
+
+            def fn2():
+                v2 = torch.randn(10)
+
+                def fn3():
+                    nonlocal v1, v2
+                    v1 += 1
+                    v2 += 2
+                    return v1 + v2
+
+                return fn3
+
+            rv = fn2()
+            rv()
+            rv()
+            return rv
+
+        torch.manual_seed(9000)
+        counter1 = fn1()
+        result1 = [counter1(), counter1(), counter1()]
+
+        torch.manual_seed(9000)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn1 = torch._dynamo.optimize_assert(cnts)(fn1)
+        counter2 = torch._dynamo.optimize_assert(cnts)(opt_fn1())
+        result2 = [counter2(), counter2(), counter2()]
+        result1.append(counter1())
+        result2.append(counter2())
+
+        self.assertTrue(same(result1, result2))
+        self.assertEqual(cnts.frame_count, 2)
+        self.assertEqual(cnts.op_count, 11)
+
+    def test_write_to_closures_in_inlining(self):
+        out = []
+        for use_dynamo in [False, True]:
+
+            def make_counter():
+                x = torch.randn(10)
+
+                def counter():
+                    nonlocal x
+                    x = x + 1
+                    return x
+
+                return counter
+
+            torch.manual_seed(0)
+            counter = make_counter()
+            if not use_dynamo:
+                out.append(counter() + counter())
+            else:
+                cnts = torch._dynamo.testing.CompileCounter()
+
+                @torch._dynamo.optimize(cnts, nopython=True)
+                def fn(counter):
+                    return counter() + counter()
+
+                out.append(fn(counter))
+                self.assertEqual(cnts.frame_count, 1)
+                self.assertEqual(cnts.op_count, 3)
+                self.assertFalse(same(counter() + counter(), out[-1]))
+
+        self.assertTrue(same(out[0], out[1]))
+
+    def test_top_package_import(self):
+        def fn(x):
+            import torch.fx
+
+            assert not isinstance(x, torch.fx.Proxy)
+            return torch.sin(x)
+
+        x = torch.randn(4, 5)
+        ref = fn(x)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnts)(fn)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+    def test_optimize_on_module(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.relu = torch.nn.ReLU()
+
+            def custom_member(self):
+                # Just for checking that Dynamo returned mod object can redirect
+                # to this method
+                pass
+
+            def forward(self, x):
+                return self.relu(x)
+
+        cnts1 = torch._dynamo.testing.CompileCounter()
+        mod = MockModule()
+        optimized_mod = torch._dynamo.optimize(cnts1, nopython=True)(mod)
+
+        a = torch.randn(10)
+        ref = mod(a)
+        res = optimized_mod(a)
+
+        optimized_mod.custom_member()
+
+        self.assertTrue(same(ref, res))
+
+    def test_nested_optimize_decorator(self):
+        cnts2 = torch._dynamo.testing.CompileCounter()
+        cnts3 = torch._dynamo.testing.CompileCounter()
+
+        @torch._dynamo.run()
+        def fn1(x):
+            return torch.sin(x) * 10
+
+        @torch._dynamo.optimize(cnts2, nopython=True)
+        def fn2(x):
+            return fn1(x) + 1
+
+        @torch._dynamo.optimize(cnts3, nopython=True)
+        def fn3(x):
+            return torch.relu(fn2(x))
+
+        fn3(torch.randn(4, 5))
+        self.assertEqual(cnts2.frame_count, 0)
+        self.assertEqual(cnts3.frame_count, 1)
+        self.assertEqual(cnts3.op_count, 4)
+
+    def test_nested_optimize_run(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        @torch._dynamo.optimize(cnts, nopython=True)
+        def fn(x):
+            return torch.relu(torch.cos(x) + torch.sin(x))
+
+        fn(torch.randn(4))
+        self.assertEqual(cnts.frame_count, 1)
+
+        fn(torch.randn(4, 4))
+        self.assertEqual(cnts.frame_count, 2)
+
+        # Test that run works on a decorated fn
+        fn = torch._dynamo.run(fn)
+        fn(torch.randn(4, 4, 4))
+        self.assertEqual(cnts.frame_count, 2)
+
+    def test_nested_optimize(self):
+        cnts1 = torch._dynamo.testing.CompileCounter()
+        cnts2 = torch._dynamo.testing.CompileCounter()
+
+        def fn(x):
+            return torch.relu(torch.cos(x) + torch.sin(x))
+
+        fn1 = torch._dynamo.optimize(cnts1, nopython=True)(fn)
+        fn2 = torch._dynamo.optimize(cnts2, nopython=True)(fn1)
+
+        # The first optimize in the nesting should be ignored
+        fn2(torch.randn(4))
+        self.assertEqual(cnts2.frame_count, 1)
+        self.assertEqual(cnts1.frame_count, 0)
+
+        # Since the fn code object is already compiled, calling fn1 should
+        # directly call the compiled_fn callable.
+        torch._dynamo.run()(fn1)(torch.randn(4))
+        self.assertEqual(cnts1.frame_count, 0)
+
+        # Test same behavior by reversing the calls
+        torch._dynamo.reset()
+        cnts1 = torch._dynamo.testing.CompileCounter()
+        cnts2 = torch._dynamo.testing.CompileCounter()
+        fn1 = torch._dynamo.optimize(cnts1, nopython=True)(fn)
+        fn2 = torch._dynamo.optimize(cnts2, nopython=True)(fn1)
+        fn1(torch.randn(4))
+        self.assertEqual(cnts1.frame_count, 1)
+        torch._dynamo.run()(fn2)(torch.randn(4))
+        self.assertEqual(cnts2.frame_count, 0)
+
+    @patch.object(torch._dynamo.config, "suppress_errors", True)
+    def test_nested_disable_decorator(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        @torch._dynamo.disable()
+        def fn1(x):
+            return torch.sin(x) * 10
+
+        @torch._dynamo.optimize(cnts)
+        def fn2(x):
+            x = x + 1
+            x = x + 1
+            x = fn1(x)  # graph break
+            x = x + 1
+            x = x + 1
+            return x
+
+        @torch._dynamo.optimize(cnts, nopython=True)
+        def fn3(x):
+            return fn2(x)
+
+        fn2(torch.randn(4, 5))
+        self.assertEqual(cnts.frame_count, 2)
+        self.assertEqual(cnts.op_count, 4)
+
+        try:
+            fn3(torch.randn(4, 5))
+            self.assertFalse(True)
+        except torch._dynamo.exc.Unsupported as e:
+            self.assertIn("call torch._dynamo.disable() wrapped function", str(e))
+
+    def test_graph_break(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        @torch._dynamo.optimize(cnts)
+        def fn(x):
+            x = torch.cos(x)
+            x = torch.cos(x)
+            torch._dynamo.graph_break()
+            x = torch.cos(x)
+            x = torch.cos(x)
+            graph_break()
+            x = torch.cos(x)
+            x = torch.cos(x)
+            return x
+
+        fn(torch.randn(4, 5))
+        self.assertEqual(cnts.frame_count, 3)
+        self.assertEqual(cnts.op_count, 6)
+
+    def test_torch_size(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        def fn(x):
+            output_size = torch.Size([10, 10])
+            x = x.view(*output_size)
+            return (x,)
+
+        x = torch.randn(100, requires_grad=True)
+        x_clone = x.clone()
+        ref = fn(x)
+
+        opt_fn = torch._dynamo.optimize(cnts, nopython=True)(fn)
+        res = opt_fn(x_clone)
+
+        self.assertTrue(same(ref, res))
+
+    def test_torch_seed(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        def fn(x):
+            attention_seed = int(torch.seed() % sys.maxsize)
+            torch.manual_seed(attention_seed)
+            return (x,)
+
+        x = torch.randn(100, requires_grad=True)
+        ref = fn(x)
+
+        opt_fn = torch._dynamo.optimize(cnts, nopython=True)(fn)
+        res = opt_fn(x)
+
+        self.assertTrue(same(ref, res))
+
+    def test_is_tensor_like(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        def f(x):
+            if torch.overrides.is_tensor_like(x):
+                return (x * 2,)
+            return (torch.ones(10) + x,)
+
+        x = torch.randn(10)
+        ref0 = f(x)
+        ref1 = f(4)
+        opt_f = torch._dynamo.optimize(cnts, nopython=True)(f)
+        res0 = opt_f(x)
+        res1 = opt_f(4)
+        self.assertTrue(same(ref0, res0))
+        self.assertTrue(same(ref1, res1))
+
+    def test_is_tensor_like2(self):
+        class MyTensor(object):
+            @classmethod
+            def __torch_function__(cls, func, types, args=(), kwargs=None):
+                if kwargs is None:
+                    kwargs = {}
+
+                if func is torch.max:
+                    return torch.tensor(123)
+                return func(*args, **kwargs)
+
+        def fn(x):
+            if torch.overrides.is_tensor_like(x):
+                return torch.max(x)
+            else:
+                return torch.zeros(1)
+
+        x = MyTensor()
+        ref0 = fn(x)
+        ref1 = fn(4)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        res0 = opt_fn(x)
+        res1 = opt_fn(4)
+        self.assertTrue(same(ref0, res0))
+        self.assertTrue(same(ref1, res1))
+
+    def test_tensor_data(self):
+        def fn(x, y):
+            return x[y.data]
+
+        x = torch.rand(8)
+        y = torch.ones(8).to(torch.int)
+        ref = fn(x, y)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x, y)
+        self.assertTrue(same(ref, res))
+
+    def test_tensor_layout(self):
+        def fn(x):
+            return torch.zeros(
+                [x.size()[0], x.size()[1]],
+                dtype=x.dtype,
+                layout=x.layout,
+                device=x.device,
+            )
+
+        x = torch.rand(2, 3)
+        ref = fn(x)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+    def test_version_ci(self):
+        # temporary test to check that the ci torch version is set correctly
+        self.assertTrue(hasattr(torch, "_subclasses"))
+
+    @unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+    def test_rand(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+        device = "cuda"
+
+        def fn():
+            return torch.randn(10, device=device)
+
+        torch.manual_seed(10)
+        ref_run1 = fn()
+
+        torch.manual_seed(10)
+        ref_run2 = fn()
+        self.assertTrue(same(ref_run1, ref_run2))
+
+        torch.manual_seed(10)
+        opt_fn = torch._dynamo.optimize(cnts, nopython=True)(fn)
+        res = opt_fn()
+
+        self.assertTrue(same(res, ref_run1))
+
+    def test_slice_input(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        def getitem(a, idx):
+            if isinstance(idx, slice):
+                return (
+                    torch.zeros(1),
+                    a[idx]
+                    + [
+                        100,
+                    ],
+                )
+            else:
+                return (torch.zeros(1), a[idx])
+
+        layers = list(range(10))
+        ref0 = getitem(layers, slice(0, 2, 1))
+        ref1 = getitem(layers, 2)
+        ref2 = getitem(layers, slice(3, 8, 2))
+        opt_getitem = torch._dynamo.optimize(cnts, nopython=True)(getitem)
+        res0 = opt_getitem(layers, slice(0, 2, 1))
+        res1 = opt_getitem(layers, 2)
+        res2 = opt_getitem(layers, slice(3, 8, 2))
+
+        self.assertTrue(ref0 == res0)
+        self.assertTrue(ref1 == res1)
+        self.assertTrue(ref2 == res2)
+
+    def test_grad(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        def fn(a, b):
+            out = a * b
+            out.sum().backward()
+            real_out = torch.sigmoid(a.grad + b)
+            return real_out
+
+        inps = [torch.randn(4, requires_grad=True) for _ in range(2)]
+        for inp in inps:
+            inp.grad = None
+        ref = fn(*inps)
+
+        for inp in inps:
+            inp.grad = None
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res = opt_fn(*inps)
+
+        self.assertTrue(same(ref, res))
+
+    @unittest.skipIf(sys.version_info < (3, 10), "use linetable when python >= 3.10")
+    def test_linetable_writer(self):
+        def fn():
+            a = 10
+            b = 20
+            c = a + b
+            f = "linetable_writer"
+            return f"Test if {f} generates correct co_linetable: {c}"
+
+        inst = dis.get_instructions(fn)
+        result = bytecode_transformation.assemble(inst, fn.__code__.co_firstlineno)
+        self.assertTrue(result[1] == fn.__code__.co_linetable)
+
+    @unittest.skipIf(sys.version_info >= (3, 10), "use lnotab when python < 3.10")
+    def test_lnotab_writer(self):
+        def fn():
+            a = 10
+            b = 20
+            c = a + b
+            f = "lnotab_writer"
+            return f"Test if {f} generates correct co_lnotab: {c}"
+
+        inst = dis.get_instructions(fn)
+        result = bytecode_transformation.assemble(inst, fn.__code__.co_firstlineno)
+        self.assertTrue(result[1] == fn.__code__.co_lnotab)
+
+    def test_torch_profiler(self):
+        # wrap torch.profiler.* as NullContextVariable and do nothing
+        def fn(x):
+            y = x**2
+            with torch.profiler.profile():
+                y = y + 2
+                with torch.profiler.record_function("my_function"):
+                    z = y**3
+                    z.tolist()  # graph break
+                    z = z + 1
+            return z
+
+        x = torch.randn((2, 2), requires_grad=True)
+        ref = fn(x)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+        self.assertEqual(cnts.frame_count, 2)
+
+    def test_autograd_profiler(self):
+        # wrap torch.autograd.profiler.* as NullContextVariable and do nothing
+        def fn(x):
+            y = x**2
+            with torch.autograd.profiler.profile():
+                y = y + 2
+                with torch.autograd.profiler.record_function("my_function"):
+                    z = y**3
+                    z.tolist()  # graph break
+                    z = z + 1
+            return z
+
+        x = torch.randn((2, 2), requires_grad=True)
+        ref = fn(x)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+        self.assertEqual(cnts.frame_count, 2)
+
+    def test_autograd_profiler_enabled(self):
+        def fn(x):
+            if torch.autograd._profiler_enabled():
+                return x + 1
+            else:
+                return x - 1
+
+        x = torch.randn((2, 2), requires_grad=True)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+
+        if torch.autograd._profiler_enabled():
+            torch.autograd._disable_profiler()
+        assert not torch.autograd._profiler_enabled()
+        ref = fn(x)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+        with torch.autograd.profiler.profile():
+            assert torch.autograd._profiler_enabled()
+            ref = fn(x)
+            res = opt_fn(x)
+            self.assertTrue(same(ref, res))
+
+    def test_tensor_is_contiguous(self):
+        def fn(x):
+            input = torch.randn((1, 16, 1, 1))
+            weight = torch.randn((8, 16, 3, 3))
+            weight = weight.to(memory_format=x)
+            output = torch.conv2d(input, weight, None, (2, 1), (1, 1), (1, 1), 1)
+            return output.is_contiguous(memory_format=x)
+
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        for x in [torch.contiguous_format, torch.channels_last]:
+            self.assertEqual(fn(x), opt_fn(x))
+
+    def test_python_slice(self):
+        def f1(input):
+            y = 0
+            for i, x in enumerate(input[2:], 1):
+                y = y + x
+            return y
+
+        def f2(input):
+            y = 0
+            for i, x in enumerate(input.shape[2:], 1):
+                y = y + x
+            return y
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_f1 = torch._dynamo.optimize(cnts)(f1)
+        opt_f2 = torch._dynamo.optimize(cnts)(f2)
+        res1 = opt_f1([1, 2, 3, 5])
+        res2 = opt_f2(torch.rand([2, 3, 4, 5]))
+
+        self.assertEqual(res1, 8)
+        self.assertEqual(res2, 9)
+
+    def test_const_dict_variable_python_type(self):
+        from torch._dynamo.variables import ConstDictVariable
+
+        d1 = {"a": 10, "b": 20}
+        d2 = collections.OrderedDict([("x", 12), ("y", 22)])
+        self.assertEqual(ConstDictVariable(d1, dict).python_type(), dict)
+        self.assertEqual(
+            ConstDictVariable(d2, collections.OrderedDict).python_type(),
+            collections.OrderedDict,
+        )
+
+    def test_builtin_subclasses_as_method_on_class_type(self):
+        class Foo:
+            def __init__(self, name):
+                self.ame_ = name
+
+            def get_name(self):
+                return "Foo " + self.name_
+
+        class Bar(Foo):
+            def __init__(self, name):
+                self.name_ = name
+
+            def get_name(self):
+                return "Bar " + self.name_
+
+        class Baz(Foo):
+            def __init__(self, name):  # noqa: B903
+                self.name_ = name
+
+            def get_name(self):
+                return "Baz " + self.name_
+
+        subs_of_foo_reg = Foo.__subclasses__()
+
+        counter = CompileCounter()
+
+        @torch._dynamo.optimize_assert(counter)
+        def fn():
+            return Foo.__subclasses__()
+
+        subs_of_foo_optim = fn()
+
+        self.assertEqual(len(subs_of_foo_reg), 2)
+        self.assertEqual(subs_of_foo_reg, subs_of_foo_optim)
+
+    def test_builtin_subclasses_as_method_on_var(self):
+        class Foo:
+            def __init__(self, name):
+                self.name_ = name
+
+            def get_name(self):
+                return "Foo " + self.name_
+
+        class Bar(Foo):
+            def __init__(self, name):
+                self.name_ = name
+
+            def get_name(self):
+                return "Bar " + self.name_
+
+        class Baz(Bar):
+            def __init__(self, name):
+                self.name_ = name
+
+            def get_name(self):
+                return "Baz " + self.name_
+
+        subs_of_foo_reg = Foo.__subclasses__()
+        sub_of_foo_subclass_var_reg = subs_of_foo_reg[0].__subclasses__()
+
+        sub_of_foo_subclass_var_optim = list()
+        counter = CompileCounter()
+
+        @torch._dynamo.optimize_assert(counter)
+        def fn():
+            return Foo.__subclasses__()
+
+        @torch._dynamo.optimize_assert(counter)
+        def fn_single(subs_of_foo_optim):
+            return subs_of_foo_optim[0].__subclasses__()
+
+        subs_of_foo_optim = fn()
+        sub_of_foo_subclass_var_optim = fn_single(subs_of_foo_optim)
+
+        self.assertEqual(len(sub_of_foo_subclass_var_optim), 1)
+        self.assertEqual(sub_of_foo_subclass_var_optim, sub_of_foo_subclass_var_reg)
+
+    def test_enum_no_graphbreaks(self):
+        class Foo(enum.Enum):
+            FOO = 0
+            BAR = 1
+
+        def fn(x, foo):
+            if foo is Foo.FOO:
+                x = torch.add(x, 1.0)
+            x = torch.mul(x, 1.0)
+            return x
+
+        x = torch.randn(1)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts, nopython=True)(fn)
+        opt_fn(x, Foo.FOO)
+        self.assertEqual(cnts.op_count, 2)
+
+        torch._dynamo.reset()
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts, nopython=True)(fn)
+        opt_fn(x, Foo.BAR)
+        self.assertEqual(cnts.op_count, 1)
+
+    def test_id_of_nn_module(self):
+        class M(torch.nn.Module):
+            def forward(self, x, ref_id):
+                self_id = id(self)
+                if self_id == ref_id:
+                    x = torch.mul(x, 1.0)
+                x = torch.add(x, 1.0)
+                return x
+
+        m = M().eval()
+        data = torch.randn(1)
+        cnts = torch._dynamo.testing.CompileCounter()
+        correct_ref_id = id(m)
+        opt_m = torch._dynamo.optimize(cnts, nopython=True)(m)
+        opt_m(data, correct_ref_id)
+        self.assertEqual(cnts.op_count, 2)
+
+        torch._dynamo.reset()
+        cnts = torch._dynamo.testing.CompileCounter()
+        incorrect_ref_id = id(m) + 1
+        opt_m = torch._dynamo.optimize(cnts, nopython=True)(m)
+        opt_m(data, incorrect_ref_id)
+        self.assertEqual(cnts.op_count, 1)
+
+    def test_inline_func_jump_on_tensor_condition(self):
+        def f1(input):
+            if input == 0:
+                return input + 1
+            else:
+                return input + 2
+
+        def f2(input):
+            return f1(input)
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_f2 = torch._dynamo.optimize(cnts)(f2)
+        res1 = opt_f2(torch.tensor([1.0]))
+        res2 = opt_f2(torch.tensor([0.0]))
+
+        self.assertEqual(res1, 3)
+        self.assertEqual(res2, 1)
+
+    def test_frozenset_torch_func_contains(self):
+        funcs = frozenset([torch.add])
+
+        def fn(x, func):
+            if func in funcs:
+                x = torch.add(x, 1.0)
+            x = torch.mul(x, 1.0)
+            return x
+
+        x = torch.randn(1)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts, nopython=True)(fn)
+        opt_fn(x, torch.add)
+        self.assertEqual(cnts.op_count, 2)
+
+        torch._dynamo.reset()
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts, nopython=True)(fn)
+        opt_fn(x, torch.mul)
+        self.assertEqual(cnts.op_count, 1)
+
+    def test_inline_list_mutation(self):
+        def f1(x):
+            x.append(torch.ones(8))
+            return x
+
+        def f2():
+            x = [torch.ones(6)]
+            f1(x)
+            return x
+
+        res1 = f2()
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_f2 = torch._dynamo.optimize(cnts)(f2)
+        res2 = opt_f2()
+        self.assertTrue(same(res1, res2))
+
+    def test_inline_dict_mutation(self):
+        def f1(d):
+            d["c"] = d["a"] + d.pop("b")
+            return d
+
+        def f2():
+            d = {"a": torch.ones(5), "b": torch.ones(5)}
+            f1(d)
+            return d
+
+        res1 = f2()
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_f2 = torch._dynamo.optimize(cnts)(f2)
+        res2 = opt_f2()
+        self.assertTrue(same(res1, res2))
+
+    def test_recursive_inline_list_mutation(self):
+        def f1(x, y):
+            x.append(torch.tensor([1.1]))
+            y.append(torch.tensor([1.2]))
+            return x, y
+
+        def f2(x, y):
+            x.append(torch.tensor([2.1]))
+            y.append(torch.tensor([2.2]))
+            f1(x, y)
+            return x, y
+
+        def f3(x):
+            x.append(torch.tensor([3.1]))
+            y = [torch.tensor([3.2])]
+            f2(x, y)
+            return x, y
+
+        def f4():
+            x = [torch.tensor([4.1])]
+            return f3(x)
+
+        res1 = f4()
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_f4 = torch._dynamo.optimize(cnts)(f4)
+        res2 = opt_f4()
+        self.assertTrue(same(res1, res2))
+
+    def test_disallow_in_graph(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        @torch._dynamo.optimize(cnts)
+        def fn(a):
+            x = torch.add(a, 1)
+            x = torch.add(x, 1)
+            x = torch.sub(x, 1)
+            x = torch.add(x, 1)
+            x = torch.add(x, 1)
+            return x
+
+        torch._dynamo.disallow_in_graph(torch.sub)
+        fn(torch.randn(10))
+        torch._dynamo.allow_in_graph(torch.sub)
+
+        # check for graph break on sub
+        self.assertEqual(cnts.frame_count, 2)
+        self.assertEqual(cnts.op_count, 4)
+
+    def test_allow_in_graph(self):
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        @torch._dynamo.optimize(cnts)
+        def fn(a):
+            x = torch.add(a, 1)
+            x = torch.add(x, 1)
+            x = my_custom_function(x)
+            x = torch.add(x, 1)
+            x = torch.add(x, 1)
+            return x
+
+        torch._dynamo.allow_in_graph(my_custom_function)
+        fn(torch.randn(10))
+        torch._dynamo.disallow_in_graph(my_custom_function)
+
+        # check for no graph break
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 5)
+
+    def test_sample_input(self):
+        from torch.testing._internal.common_methods_invocations import SampleInput
+
+        def fn(sample):
+            if isinstance(sample.input, torch.Tensor):
+                return sample.input * 2
+            return torch.zeros(())
+
+        sample = SampleInput(torch.ones(2))
+        ref = fn(sample)
+
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        res = opt_fn(sample)
+
+        self.assertTrue(same(ref, res))
+
+    def test_release_input_memory(self):
+        x = torch.rand([4])
+        x_ref = weakref.ref(x)
+
+        cnts = torch._dynamo.testing.CompileCounter()
+
+        @torch._dynamo.optimize(cnts)
+        def foo(x):
+            return x + x
+
+        out = foo(x)
+        self.assertTrue(same(out, x + x))
+        del x
+        self.assertIs(x_ref(), None)
+
+    def test_release_module_memory(self):
+
+        mod = torch.nn.Linear(10, 10)
+        x = torch.rand([10, 10])
+        mod_weight_ref = weakref.ref(mod.weight)
+        mod_ref = weakref.ref(mod)
+
+        # Modules that are passed into torch._dynamo optimized functions
+        # will normally be held onto through the generated GraphModule,
+        # which contains the modules. remove the reference in this backend
+        # and test that no additional references are being held.
+        class NoLeakBackend:
+            def __call__(self, gm: torch.fx.GraphModule, example_inputs):
+                gm.mod = None
+
+                def foo(*args, **kwargs):
+                    return (1,)
+
+                return foo
+
+        no_leak_backend = NoLeakBackend()
+
+        @torch._dynamo.optimize(no_leak_backend)
+        def foo(mod, x):
+            return mod(x)
+
+        foo(mod, x)
+        del mod
+        del x
+        self.assertIsNone(mod_ref(), None)
+        self.assertIsNone(mod_weight_ref(), None)
+
+    def test_update_locals_and_stack_uses_shared_cache(self):
+        def fn(x):
+            perm = [0, 3, 5]
+            perm = list(range(min(perm))) + perm
+            perm.extend(i for i in range(x.dim()) if i not in perm)
+            return perm
+
+        x = torch.rand([2, 2, 2, 2, 2, 2])
+        res1 = fn(x)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res2 = opt_fn(x)
+        self.assertTrue(same(res1, res2))
+
+    def test_dict_reconstruct_keeps_original_order(self):
+        def fn():
+            modules = collections.OrderedDict([("act", torch.nn.ReLU())])
+            module_dict = torch.nn.ModuleDict(modules)
+
+            next_modules = {"fc4": torch.nn.Linear(5, 6), "act3": torch.nn.Sigmoid()}
+            modules.update(next_modules.items())
+            module_dict.update(next_modules)
+            return modules, module_dict
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        modules, module_dict = opt_fn()
+
+        self.assertEqual(len(module_dict), len(modules))
+        for k1, m2 in zip(modules, module_dict.children()):
+            self.assertTrue(modules[k1] is m2)
+
+    def test_side_effects_codegen_update_mutated(self):
+        # codegen to update mutated variables with side effect
+        # should after stack value's codegen
+        def f1(x):
+            alist = [x]
+            alist.append(x + 1)
+            alist[0].sum().item()  # graph break
+            res = alist.pop()
+            res.sum().item()  # graph break
+            return res
+
+        def f2(a, b):
+            d = {"a": a + 1, "b": b + 2}
+            x = d.pop("b")
+            x.sum().item()  # graph break
+            y = d["a"] + x
+            y.sum().item()  # graph break
+            d["c"] = y
+            return d
+
+        x = torch.rand([2, 3])
+        a = torch.rand([5, 6])
+        b = torch.rand([5, 6])
+        res11 = f1(x)
+        res21 = f2(a, b)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_f1 = torch._dynamo.optimize(cnts)(f1)
+        opt_f2 = torch._dynamo.optimize(cnts)(f2)
+        res12 = opt_f1(x)
+        res22 = opt_f2(a, b)
+        self.assertTrue(same(res11, res12))
+        self.assertTrue(same(res21, res22))
+
+    def test_list_append_return_none(self):
+        def fn(x):
+            alist = []
+            blist = alist.append(x + 1)
+            return alist, blist
+
+        x = torch.tensor([2.3])
+        res = fn(x)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res2 = opt_fn(x)
+        self.assertEqual(res, res2)
+
+    def test_tensor_types(self):
+        def fn(dtype, tensor_type):
+            x = torch.empty(4, dtype=dtype)
+            assert isinstance(x, tensor_type)
+
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        opt_fn(torch.float32, torch.FloatTensor)
+        opt_fn(torch.float64, torch.DoubleTensor)
+        opt_fn(torch.float16, torch.HalfTensor)
+        opt_fn(torch.bfloat16, torch.BFloat16Tensor)
+        opt_fn(torch.uint8, torch.ByteTensor)
+        opt_fn(torch.int8, torch.CharTensor)
+        opt_fn(torch.int64, torch.LongTensor)
+        opt_fn(torch.int, torch.IntTensor)
+        opt_fn(torch.int16, torch.ShortTensor)
+        opt_fn(torch.bool, torch.BoolTensor)
+
+    def test_nan(self):
+        def f(x, n):
+            return x * 2 + n
+
+        x = torch.randn(4)
+        n = float("nan")
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_f = torch._dynamo.optimize(cnts)(f)
+        opt_f(x, n)
+        opt_f(x, n)
+        self.assertEqual(cnts.frame_count, 1)
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_item(self):
+        class MyMod(torch.nn.Module):
+            def forward(self, x):
+                z = torch.max(x)
+                return z.int().item()
+
+        x = torch.tensor([[10.6763, 11.7445, -2.2369]])
+        model = MyMod()
+        y = torch._dynamo.optimize("eager", nopython=True)(model)(x)
+
+        self.assertEqual(y, 11)
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_item_changes(self):
+        class MyMod(torch.nn.Module):
+            def forward(self, x):
+                z = torch.max(x)
+                return z.int().item()
+
+        x = torch.tensor([[10.6763, 11.7445, -2.2369]])
+        model = MyMod()
+        opt_model = torch._dynamo.optimize("eager", nopython=True)(model)
+        y = opt_model(x)
+        z = opt_model(torch.tensor([[y - 5, y + 10, y + 50]]))
+
+        self.assertEqual(y, 11)
+        self.assertEqual(z, 61)
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_item_changes_new_shape(self):
+        class MyMod(torch.nn.Module):
+            def forward(self, x):
+                z = torch.max(x)
+                return z.int().item()
+
+        x = torch.tensor([[10.6763, 11.7445, -2.2369]])
+        model = MyMod()
+        opt_model = torch._dynamo.optimize("eager", nopython=True)(model)
+        y = opt_model(x)
+        z = opt_model(torch.tensor([[y - 5, y + 50], [y + 5, y - 50]]))
+
+        self.assertEqual(y, 11)
+        self.assertEqual(z, 61)
+
+    def test_cross_entropy_loss_fancy_ctor(self):
+        output = None
+        rand_5 = torch.randn(5)
+        rand_3_5 = torch.randn(3, 5)
+        target = torch.empty(3, dtype=torch.long).random_(5)
+
+        loss = torch.nn.CrossEntropyLoss(
+            weight=rand_5, reduce=False, label_smoothing=0.5
+        )
+        opt_loss = torch._dynamo.optimize("eager", nopython=True)(loss)
+        input = rand_3_5
+        dynamo_output = opt_loss(input, target)
+
+        loss = torch.nn.CrossEntropyLoss(
+            weight=rand_5, reduce=False, label_smoothing=0.5
+        )
+        input = rand_3_5
+        output = loss(input, target)
+
+        self.assertTrue(torch.allclose(dynamo_output, output))
+
+    def test_cross_entropy_loss_simple_ctor(self):
+        output = None
+        rand_3_5 = torch.randn(3, 5)
+        target = torch.empty(3, dtype=torch.long).random_(5)
+
+        loss = torch.nn.CrossEntropyLoss()
+        opt_loss = torch._dynamo.optimize("eager", nopython=True)(loss)
+        input = rand_3_5
+        dynamo_output = opt_loss(input, target)
+
+        loss = torch.nn.CrossEntropyLoss()
+        input = rand_3_5
+        output = loss(input, target)
+
+        self.assertTrue(torch.allclose(dynamo_output, output))
+
+    def test_nn_functional_reduction(self):
+        def fn(loss, reduction):
+            reduction_enum = F._Reduction.get_enum(reduction)
+            if reduction_enum == 0:
+                return loss
+            elif reduction_enum == 1:
+                return loss.mean()
+            elif reduction_enum == 2:
+                return loss.sum()
+
+        x = torch.rand([3, 5])
+        y = "mean"
+        ref = fn(x, y)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x, y)
+        self.assertTrue(torch.allclose(ref, res))
+
+    def test_large_reduction_list(self):
+        dtype = torch.float32
+        device = "cpu"
+
+        def check_sum_all(tensor: torch.Tensor) -> None:
+            pylist = tensor.reshape(-1).tolist()
+            self.assertTrue(same(tensor.sum(), torch.tensor(sum(pylist))))
+
+        check_sum_all(torch.randn(200000, dtype=dtype, device=device))
+
+    def test_raise_on_backend_error(self):
+        def my_compiler(gm, _):
+            raise RuntimeError("duck!")
+
+        @torch._dynamo.optimize(my_compiler)
+        def fn(a, b):
+            return a + b / (a - b)
+
+        self.assertRaises(
+            torch._dynamo.exc.BackendCompilerFailed,
+            lambda: fn(torch.randn(10), torch.randn(10)),
+        )
+
+    def test_named_parameters(self):
+        n_embd = 768
+        block_size = 128
+        vocab_size = 65
+        embd_pdrop = 0.1
+
+        class MyModel2(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.tok_emb = torch.nn.Embedding(vocab_size, n_embd)
+                self.pos_emb = torch.nn.Parameter(torch.zeros(1, block_size, n_embd))
+                self.drop = torch.nn.Dropout(embd_pdrop)
+
+            def forward(self, x):
+                return x
+
+        class MyModel(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.tok_emb = torch.nn.Embedding(vocab_size, n_embd)
+                self.pos_emb = torch.nn.Parameter(torch.zeros(1, block_size, n_embd))
+                self.drop = torch.nn.Dropout(embd_pdrop)
+                self.submod2 = MyModel2()
+
+            def forward(self, x):
+                return x
+
+        # Regular
+        params = []
+        mod = MyModel()
+        actual_params = list(mod.named_parameters())
+
+        @torch._dynamo.optimize("eager", nopython=True)
+        def fn():
+            return list(mod.named_parameters())
+
+        params = fn()
+
+        self.assertEqual(len(actual_params), len(params))
+        for idx in range(len(params)):
+            k_a, v_a = actual_params[idx]
+            k, v = params[idx]
+            self.assertEqual(k_a, k)
+            self.assertTrue(torch.allclose(v_a, v))
+
+        # Prefix
+        params = []
+        mod = MyModel()
+        actual_params = list(mod.named_parameters(prefix="foo"))
+
+        @torch._dynamo.optimize("eager", nopython=True)
+        def fn1():
+            return list(mod.named_parameters(prefix="foo"))
+
+        params = fn1()
+
+        self.assertEqual(len(actual_params), len(params))
+        for idx in range(len(params)):
+            k_a, v_a = actual_params[idx]
+            k, v = params[idx]
+            self.assertEqual(k_a, k)
+            self.assertTrue(torch.allclose(v_a, v))
+
+    def test_module_complex_iter(self):
+        n_embd = 768
+        block_size = 128
+        vocab_size = 65
+        embd_pdrop = 0.1
+
+        class FakeGPT(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.tok_emb = torch.nn.Embedding(vocab_size, n_embd)
+                self.pos_emb = torch.nn.Parameter(torch.zeros(1, block_size, n_embd))
+                self.drop = torch.nn.Dropout(embd_pdrop)
+                self.ln_f = torch.nn.LayerNorm(n_embd)
+                self.head = torch.nn.Linear(n_embd, vocab_size, bias=False)
+
+                self.block_size = block_size
+                self.names = []
+
+            def forward(self, idx, targets=None):
+
+                b, t = idx.size()
+                assert (
+                    t <= self.block_size
+                ), "Cannot forward, model block size is exhausted."
+
+                # forward the GPT model
+                token_embeddings = self.tok_emb(
+                    idx
+                )  # each index maps to a (learnable) vector
+                position_embeddings = self.pos_emb[
+                    :, :t, :
+                ]  # each position maps to a (learnable) vector
+                x = self.drop(token_embeddings + position_embeddings)
+                x = self.blocks(x)
+                x = self.ln_f(x)
+                logits = self.head(x)
+
+                # if we are given some desired targets also calculate the loss
+                loss = None
+                if targets is not None:
+                    loss = F.cross_entropy(
+                        logits.view(-1, logits.size(-1)), targets.view(-1)
+                    )
+
+                return logits, loss
+
+            def foo(self, memo=None, prefix="", remove_duplicate=False):
+                for mn, m in self.named_modules(
+                    memo=memo, prefix=prefix, remove_duplicate=remove_duplicate
+                ):
+                    for pn, p in self.named_parameters():
+                        fpn = "%s.%s" % (mn, pn) if mn else pn
+                        self.names.append(fpn)
+
+        # Test plain recurse
+        model_a = FakeGPT()
+        model_a.foo()
+        a_names = model_a.names
+
+        model_b = FakeGPT()
+        opt_model_b = torch._dynamo.optimize("eager", nopython=True)(model_b)
+        opt_model_b.foo()
+
+        self.assertEqual(a_names, model_b.names)
+
+        # Test with prefix
+        model_a = FakeGPT()
+        model_a.foo(prefix="abc")
+        a_names = model_a.names
+
+        model_b = FakeGPT()
+        opt_model_b = torch._dynamo.optimize("eager", nopython=True)(model_b)
+        opt_model_b.foo(prefix="abc")
+
+        self.assertEqual(a_names, model_b.names)
+
+    def test_numpy_variable_isinstance(self):
+        def fn(x, m):
+            if isinstance(m, np.ndarray):
+                return x + 1
+            else:
+                return x - 1
+
+        x = torch.tensor([2.3])
+        m = np.array([1, 2, 3])
+        ref = fn(x, m)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res = opt_fn(x, m)
+        self.assertEqual(ref, res)
+
+    def test_tensor_dot_grad_no_graph_break(self):
+        def fn(a, b):
+            y = 3 * a**3 - b**2
+            y.backward(gradient=torch.tensor([1.0, 1.0]))
+            b.grad.zero_()
+            return a.grad, b.grad
+
+        a = torch.tensor([2.0, 3.0], requires_grad=True)
+        b = torch.tensor([6.0, 4.0], requires_grad=True)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        _, b_grad = opt_fn(a, b)
+        self.assertTrue(same(b_grad, torch.tensor([0.0, 0.0])))
+        self.assertEqual(cnts.frame_count, 2)
+
+    def test_torch_nn_parameter_isinstance(self):
+        def fn(x):
+            a = torch.nn.Parameter(torch.rand(2, 3))
+            if isinstance(a, torch.Tensor):
+                return x + 1
+            else:
+                return x - 1
+
+        x = torch.tensor([2.5])
+        ref = fn(x)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        res = opt_fn(x)
+        self.assertEqual(ref, res)
+
+    def test_change_backends(self):
+        @torch._dynamo.optimize("eager", nopython=True)
+        def fn1():
+            return x + 1
+
+        @torch._dynamo.optimize("ts")
+        def fn2():
+            return x + 2
+
+        @torch._dynamo.optimize("eager", nopython=False)
+        def fn3():
+            return x + 1
+
+        x = torch.tensor([3, 5])
+
+        fn1()
+        fn1()
+        fn3()
+        self.assertRaises(torch._dynamo.exc.ResetRequired, fn2)
+        fn1()
+        torch._dynamo.reset()
+        fn2()
+        fn2()
+        self.assertRaises(torch._dynamo.exc.ResetRequired, fn1)
+        self.assertRaises(torch._dynamo.exc.ResetRequired, fn3)
+        fn2()
+
+    def test_dynamo_min_operator_with_shape(self):
+        @torch._dynamo.optimize("eager", nopython=True)
+        def f(x, a):
+            return min(x.shape[0], a)
+
+        result = f(torch.ones(6), 3)
+        self.assertEqual(result, 3)
+
+    @patch.object(torch._dynamo.config, "dynamic_shapes", True)
+    def test_onnx_shape_as_tensor(self):
+        @torch._dynamo.optimize("eager", nopython=True)
+        def f(x):
+            return 1 + torch._shape_as_tensor(x)[0]
+
+        gm, _ = torch._dynamo.export(f, torch.ones(6))
+
+        input_one_dim = torch.ones(6)
+        input_two_dims = torch.ones(7, 4)
+        self.assertEqual(f(input_one_dim), 7)
+        self.assertEqual(f(input_two_dims), 8)
+        self.assertEqual(f(input_two_dims), 8)
+
+        @torch._dynamo.optimize("eager", nopython=True)
+        def f_onnx(x):
+            return 1 + torch.onnx.operators.shape_as_tensor(x)[0]
+
+        self.assertEqual(f_onnx(input_one_dim), 7)
+        self.assertEqual(f_onnx(input_two_dims), 8)
+        self.assertEqual(f_onnx(input_two_dims), 8)
+
+    def test_cond(self):
+        from functorch.experimental.cond import cond
+
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return x.cos()
+
+        def f(pred, x):
+            return cond(pred, true_fn, false_fn, [x])
+
+        opt_fn = torch._dynamo.optimize("eager")(f)
+        a = opt_fn(torch.tensor(False), torch.tensor([0.25, 0.25]))
+        self.assertTrue(same(torch.cos(torch.tensor([0.25, 0.25])), a))
+        b = opt_fn(torch.tensor(True), torch.tensor([0.25, 0.25]))
+        self.assertTrue(same(torch.sin(torch.tensor([0.25, 0.25])), b))
+
+    def test_cond_nested(self):
+        from functorch.experimental.cond import cond
+
+        def true_fn_nested(x):
+            return x * 10
+
+        def false_fn_nested(x):
+            return x * -1
+
+        def true_fn(pred2, x):
+            return x.sin()
+
+        def false_fn(pred2, x):
+            return x + cond(pred2, true_fn_nested, false_fn_nested, [x])
+
+        def f(pred, pred2, x):
+            return cond(pred, true_fn, false_fn, [pred2, x])
+
+        cc = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cc)(f)
+        true_true_sin = opt_fn(
+            torch.tensor(True), torch.tensor(True), torch.tensor([0.25, 0.25])
+        )
+        self.assertTrue(same(torch.sin(torch.tensor([0.25, 0.25])), true_true_sin))
+
+        true_false_sin = opt_fn(
+            torch.tensor(True), torch.tensor(False), torch.tensor([0.25, 0.25])
+        )
+        self.assertTrue(same(torch.sin(torch.tensor([0.25, 0.25])), true_false_sin))
+
+        false_true_sum_mult = opt_fn(
+            torch.tensor(False), torch.tensor(True), torch.tensor([0.25, 0.25])
+        )
+        self.assertTrue(
+            same(torch.tensor([2.75, 2.75]), false_true_sum_mult)
+        )  # * 10 then add x
+
+        false_false_sum_neg = opt_fn(
+            torch.tensor(False), torch.tensor(False), torch.tensor([0.25, 0.25])
+        )
+        self.assertTrue(
+            same(torch.tensor([0.0, 0.0]), false_false_sum_neg)
+        )  # * -1 then add x
+        self.assertTrue(cc.frame_count, 2)
+
+    def test_cond_export(self):
+        from functorch.experimental.cond import cond
+
+        def true_fn_nested(x):
+            return x * 10
+
+        def false_fn_nested(x):
+            return x * -1
+
+        def true_fn(pred2, x):
+            return x.sin()
+
+        def false_fn(pred2, x):
+            return x + cond(pred2, true_fn_nested, false_fn_nested, [x])
+
+        def f(pred, pred2, x):
+            return cond(pred, true_fn, false_fn, [pred2, x])
+
+        graph, guard = torch._dynamo.export(
+            f, torch.tensor(False), torch.tensor(True), torch.tensor([0.25, 0.25])
+        )
+        true_true_sin = graph(
+            torch.tensor(True), torch.tensor(True), torch.tensor([0.25, 0.25])
+        )
+        self.assertTrue(same(torch.sin(torch.tensor([0.25, 0.25])), true_true_sin))
+
+        true_false_sin = graph(
+            torch.tensor(True), torch.tensor(False), torch.tensor([0.25, 0.25])
+        )
+        self.assertTrue(same(torch.sin(torch.tensor([0.25, 0.25])), true_false_sin))
+
+        false_true_sum_mult = graph(
+            torch.tensor(False), torch.tensor(True), torch.tensor([0.25, 0.25])
+        )
+        self.assertTrue(
+            same(torch.tensor([2.75, 2.75]), false_true_sum_mult)
+        )  # * 10 then add x
+
+        false_false_sum_neg = graph(
+            torch.tensor(False), torch.tensor(False), torch.tensor([0.25, 0.25])
+        )
+        self.assertTrue(
+            same(torch.tensor([0.0, 0.0]), false_false_sum_neg)
+        )  # * -1 then add x
+
+    def test_cond_export_single_arg(self):
+        from functorch.experimental.cond import cond
+
+        def true_fn(x):
+            return x
+
+        def false_fn(x):
+            return x.sin()
+
+        def f(pred, x):
+            return cond(pred, true_fn, false_fn, [x])
+
+        graph, guard = torch._dynamo.export(
+            f, torch.tensor(False), torch.tensor([0.25, 0.25])
+        )
+        true_mirror = graph(torch.tensor(True), torch.tensor([0.25, 0.25]))
+        self.assertTrue(same(torch.tensor([0.25, 0.25]), true_mirror))
+        true_mirror_2 = graph(torch.tensor(True), torch.tensor([0.33, 0.33, 0.33]))
+        self.assertTrue(same(torch.tensor([0.33, 0.33, 0.33]), true_mirror_2))
+
+        false_sin = graph(torch.tensor(False), torch.tensor([0.5, 0.5]))
+        self.assertTrue(same(torch.sin(torch.tensor([0.5, 0.5])), false_sin))
+
+    def test_disable_optimize(self):
+        cnt = torch._dynamo.testing.CompileCounter()
+
+        @torch._dynamo.optimize(cnt, disable=True)
+        def f1(x):
+            return x + 1
+
+        f1(torch.ones(6))
+        self.assertEqual(cnt.frame_count, 0)
+
+        @torch._dynamo.optimize(cnt, disable=True)
+        def f2(x):
+            return x + 1
+
+        f2(torch.ones(6))
+        self.assertEqual(cnt.frame_count, 0)
+
+        with patch.dict(os.environ, {"TORCHDYNAMO_DISABLE": "1"}):
+
+            @torch._dynamo.optimize(cnt)
+            def f3(x):
+                return x + 1
+
+            f3(torch.ones(6))
+        self.assertEqual(cnt.frame_count, 0)
+
+    def test_config_log_level(self):
+        @torch._dynamo.optimize("eager")
+        def fn(a, b):
+            return a + b
+
+        with self.assertLogs(logger="torch._dynamo", level=logging.DEBUG) as log:
+            torch._dynamo.config.log_level = logging.DEBUG
+            fn(torch.randn(10), torch.randn(10))
+            cur_len = len(log)
+            self.assertGreater(cur_len, 0)
+
+            torch._dynamo.config.log_level = logging.WARNING
+            fn(torch.randn(10), torch.randn(10))
+            self.assertEqual(cur_len, len(log))
+
+    @patch.object(torch._dynamo.config, "print_graph_breaks", True)
+    def test_duplicate_graph_break_warning(self):
+        @torch._dynamo.optimize("eager")
+        def f1(a, b):
+            f2(a, b)
+
+        def f2(a, b):
+            c = a + b
+            print("break")
+            return a + b + c
+
+        @torch._dynamo.optimize("eager")
+        def g1(a, b):
+            g2(a, b)
+
+        def g2(a, b):
+            c = a + b
+            print("break")
+            return a + b + c
+
+        def count_graph_break_msgs(msgs):
+            return sum(msg.find("Graph break") != -1 for msg in msgs)
+
+        with self.assertLogs(logger="torch._dynamo", level=logging.WARNING) as log:
+            torch._dynamo.config.verbose = True
+            f1(torch.randn(10), torch.randn(10))
+            self.assertGreater(count_graph_break_msgs(log.output), 1)
+
+        with self.assertLogs(logger="torch._dynamo", level=logging.WARNING) as log:
+            torch._dynamo.config.verbose = False
+            g1(torch.randn(10), torch.randn(10))
+            self.assertEqual(count_graph_break_msgs(log.output), 1)
+
+    def test_inplace_param_update(self):
+        def fn(param, y):
+            prev_grad = torch.is_grad_enabled()
+            try:
+                torch.set_grad_enabled(False)
+                torch.set_grad_enabled(True)
+                torch.set_grad_enabled(False)
+                param.add_(y)
+            finally:
+                torch.set_grad_enabled(prev_grad)
+
+        y = torch.randn(4)
+        x = torch.nn.Parameter(torch.randn(4))
+        fn(x, y)
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts, nopython=True)(fn)
+        opt_fn(x, y)
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 5)
+
+    @unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+    def test_autocast(self):
+        if not torch.cuda.is_bf16_supported():
+            raise unittest.SkipTest("requires bf16")
+
+        class MyModule(torch.nn.Module):
+            def forward(self, x):
+                a_float32 = torch.rand((8, 8), device="cuda")
+                b_float32 = torch.rand((8, 8), device="cuda")
+                d_float32 = torch.rand((8, 8), device="cuda")
+
+                with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                    e_float16 = torch.mm(a_float32, b_float32)
+                    f_float16 = torch.mm(d_float32, e_float16)
+                return f_float16
+
+        module = MyModule()
+        real = module(torch.tensor([0.5]))
+        real_device = real.device
+        real_dtype = real.dtype
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([[0.0, 0], [0, 0]]))
+        exported = graph(torch.tensor([0.5]))
+        self.assertEqual(exported.device, real_device)
+        self.assertEqual(exported.dtype, real_dtype)
+
+        self.assertEqual(exported.device.type, "cuda")
+        self.assertEqual(exported.device.index, 0)
+        self.assertEqual(exported.dtype, torch.bfloat16)
+
+    def test_autocast_cpu(self):
+        class MyModule(torch.nn.Module):
+            def forward(self, x):
+                a_float32 = torch.rand((8, 8), device="cpu")
+                b_float32 = torch.rand((8, 8), device="cpu")
+                d_float32 = torch.rand((8, 8), device="cpu")
+
+                with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
+                    e_float16 = torch.mm(a_float32, b_float32)
+                    f_float16 = torch.mm(d_float32, e_float16)
+                return f_float16
+
+        module = MyModule()
+        real = module(torch.tensor([0.5]))
+        real_device = real.device
+        real_dtype = real.dtype
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([[0.0, 0], [0, 0]]))
+        exported = graph(torch.tensor([0.5]))
+        self.assertEqual(exported.device, real_device)
+        self.assertEqual(exported.dtype, real_dtype)
+
+        self.assertEqual(exported.device.type, "cpu")
+        self.assertEqual(exported.dtype, torch.bfloat16)
+
+    @unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+    def test_autocast_float64(self):
+        class MyModule(torch.nn.Module):
+            def forward(self, x):
+                a_float32 = torch.rand((8, 8), device="cuda")
+                b_float32 = torch.rand((8, 8), device="cuda")
+                d_float32 = torch.rand((8, 8), device="cuda")
+
+                with torch.autocast(device_type="cuda", dtype=torch.float64):
+                    e_float64 = torch.mm(a_float32, b_float32)
+                    f_float64 = torch.mm(d_float32, e_float64)
+                return f_float64
+
+        module = MyModule()
+        real = module(torch.tensor([0.5]))
+        real_device = real.device
+        real_dtype = real.dtype
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([[0.0, 0], [0, 0]]))
+        exported = graph(torch.tensor([0.5]))
+        self.assertEqual(exported.device, real_device)
+        self.assertEqual(exported.dtype, real_dtype)
+
+        self.assertEqual(exported.device.index, 0)
+        self.assertEqual(exported.dtype, torch.float64)
+
+    @unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+    def test_autocast_device(self):
+        class MyModule(torch.nn.Module):
+            def forward(self, x):
+                a_float32 = torch.rand((8, 8), device="cuda")
+                b_float32 = torch.rand((8, 8), device="cuda")
+                d_float32 = torch.rand((8, 8), device="cuda")
+
+                with torch.autocast(device_type="cuda"):
+                    e_float64 = torch.mm(a_float32, b_float32)
+                    f_float64 = torch.mm(d_float32, e_float64)
+                return f_float64
+
+        module = MyModule()
+        real = module(torch.tensor([0.5]))
+        real_device = real.device
+        real_dtype = real.dtype
+
+        graph, guards = torch._dynamo.export(module, torch.tensor([[0.0, 0], [0, 0]]))
+        exported = graph(torch.tensor([0.5]))
+        self.assertEqual(exported.device, real_device)
+        self.assertEqual(exported.dtype, real_dtype)
+
+        self.assertEqual(exported.device.index, 0)
+        self.assertEqual(exported.dtype, torch.torch.float16)
+
+    def test_generate_tensor_from_list_of_numpy_primitive_type(self):
+        # Test sth like torch.LongTensor(list(np.int64, np.int64, ...))
+        def fn():
+            x = np.array([1, 2, 3, 4, 5, 6], dtype=np.int64)
+            y = [x[0], x[2], x[4]]
+            z = torch.LongTensor(y)
+            return z
+
+        ref = fn()
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        res = opt_fn()
+        self.assertTrue(same(ref, res))
+
+    def test_autograd_function_equivalence(self):
+        for i in range(1, 5):
+            model = globals()[f"Module{i}"]()
+            opt_model = torch._dynamo.optimize("eager", nopython=True)(model)
+            self.assertTrue(
+                torch.allclose(opt_model(torch.ones(2, 3)), torch.tensor([2.0]))
+            )
+
+    def test_object_classmethod(self):
+        class C:
+            @classmethod
+            def fn(cls, x):
+                return x + x
+
+        @torch._dynamo.optimize("eager", nopython=True)
+        def f():
+            return C().fn(torch.ones(2, 3))
+
+        self.assertTrue(torch.allclose(f(), torch.tensor([2.0])))
+
+    def test_object_staticmethod(self):
+        class C:
+            @staticmethod
+            def fn(x):
+                return x + x
+
+        @torch._dynamo.optimize("eager", nopython=True)
+        def f():
+            return C().fn(torch.ones(2, 3))
+
+        self.assertTrue(torch.allclose(f(), torch.tensor([2.0])))
+
+    def test_user_function_variable_supports_enum_argument(self):
+        class Foo(enum.Enum):
+            FOO = 0
+            BAR = 1
+
+        def gn(x, y=Foo.FOO):
+            if y is Foo.FOO:
+                return x
+            else:
+                return x + 1
+
+        def fn(x):
+            return gn(x)
+
+        x = torch.randn(2, 3)
+        ref = fn(x)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x)
+        self.assertTrue(torch.allclose(ref, res))
+
+    def test_user_function_variable_supports_type_abcmeta_argument(self):
+        class Foo(metaclass=abc.ABCMeta):
+            @abc.abstractclassmethod
+            def read(self):
+                pass
+
+        class Bar(Foo):
+            def read(self):
+                return "Hello World!"
+
+        class Baz:
+            pass
+
+        def gn(x, tys=(Bar, Baz)):
+            if Bar in tys:
+                return x - 1
+            else:
+                return x + 1
+
+        def fn(x):
+            return gn(x)
+
+        x = torch.randn(2, 3)
+        ref = fn(x)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x)
+        self.assertTrue(torch.allclose(ref, res))
+
+    def test_user_function_variable_supports_function_argument(self):
+        def add1(x):
+            return x + 1
+
+        def add2(x):
+            return x + 2
+
+        def gn(x, f=add1):
+            if f is add1:
+                return x + 1
+            else:
+                return x + 2
+
+        def fn(x, f):
+            return gn(x, f)
+
+        x = torch.randn(2, 3)
+        ref = fn(x, add2)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x, add2)
+        self.assertTrue(torch.allclose(ref, res))
+
+    def test_typing_variable_isinstance(self):
+        def fn(x, m):
+            if isinstance(m, typing.Mapping):
+                return x + 1
+            else:
+                return x - 1
+
+        x = torch.randn(2, 3)
+        m = {"x": torch.randn(3)}
+        ref = fn(x, m)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        res = opt_fn(x, m)
+        self.assertTrue(torch.allclose(ref, res))
+
+    def test_repro_graph_breaks_in__get_item_by_idx(self):
+        class Mod(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.mod = torch.nn.Sequential(
+                    torch.nn.Linear(3, 3), torch.nn.Linear(3, 3)
+                )
+
+            def forward(self, x):
+                return self.mod[0](x)
+
+        m = Mod()
+        graph, _ = torch._dynamo.export(m, torch.randn(3, 3))
+
+    def test_nn_sequential_invocation(self):
+        with freeze_rng_state():
+
+            class TestModel(torch.nn.Module):
+                def __init__(self) -> None:
+                    super().__init__()
+                    self.linears = torch.nn.Sequential(
+                        torch.nn.Linear(2, 2),
+                        torch.nn.Linear(2, 2),
+                        torch.nn.Linear(2, 2),
+                        torch.nn.Linear(2, 2),
+                    )
+
+                def forward(self, x):
+                    all_but_last = self.linears[:-1]
+                    return all_but_last(x)
+
+            m = TestModel()
+            x = torch.rand((2, 2))
+            real = m(x)
+            graph, _ = torch._dynamo.export(m, x)
+            dynamo_result = graph(x)
+            self.assertTrue(same(real, dynamo_result))
+
+    def test_nn_sequential_invocation_reposition_indices(self):
+        with freeze_rng_state():
+
+            class TestModel(torch.nn.Module):
+                def __init__(self) -> None:
+                    super().__init__()
+                    self.linears = torch.nn.Sequential(
+                        torch.nn.Linear(2, 2),
+                        torch.nn.Linear(2, 2),
+                        torch.nn.Linear(2, 2),
+                        torch.nn.Linear(2, 2),
+                    )
+
+                def forward(self, x):
+                    all_but_last = self.linears[1:3]
+                    return all_but_last(x)
+
+            m = TestModel()
+            x = torch.rand((2, 2))
+            real = m(x)
+            graph, _ = torch._dynamo.export(m, x)
+            dynamo_result = graph(x)
+            self.assertTrue(same(real, dynamo_result))
+
+    def test_error_on_nested_fx_trace(self):
+        input = torch.rand(2, 3)
+
+        def f(x):
+            x + x
+
+        real = f(input)
+
+        optimized = torch._dynamo.optimize("eager")(f)
+        self.assertTrue(same(optimized(input), real))
+
+        with self.assertRaisesRegex(RuntimeError, "Detected that you are using FX"):
+            gm = torch.fx.symbolic_trace(optimized)
+
+    @patch.object(torch._dynamo.config, "error_on_nested_fx_trace", False)
+    def test_no_error_on_nested_fx_trace(self):
+        input = torch.rand(2, 3)
+
+        def f(x):
+            x + x
+
+        real = f(input)
+
+        optimized = torch._dynamo.optimize("eager")(f)
+        self.assertTrue(same(optimized(input), real))
+
+        # should not error
+        gm = torch.fx.symbolic_trace(optimized)
+        self.assertTrue(same(gm(input), real))
+
+    def test_inference_mode(self):
+        @torch.inference_mode()
+        def func(x, y):
+            return x.add(1.0) + y
+
+        x = torch.ones(4, requires_grad=True)
+        y = torch.ones(4, requires_grad=True)
+        ref = func(x, y)
+        opt_func = torch._dynamo.optimize("eager")(func)
+
+        x1 = torch.ones(4, requires_grad=True)
+        res = opt_func(x1, y)
+        self.assertTrue(same(ref, res))
+        self.assertTrue(same(x, x1))
+
+    def test_if_cond_nn_mod(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self, output_relu=True):
+                super(MockModule, self).__init__()
+                self.relu = torch.nn.ReLU() if output_relu else None
+
+            def forward(self, x):
+                x = torch.sin(x)
+                if self.relu:
+                    x = self.relu(x)
+                return x
+
+        model = MockModule()
+        opt_model = torch._dynamo.optimize("eager", nopython=True)(model)
+
+        x = torch.rand(4)
+        ref = model(x)
+        res = opt_model(x)
+        self.assertTrue(same(ref, res))
+
+        model = MockModule(output_relu=False)
+        opt_model = torch._dynamo.optimize("eager", nopython=True)(model)
+
+        x = torch.rand(4)
+        ref = model(x)
+        res = opt_model(x)
+        self.assertTrue(same(ref, res))
+
+    def test_torch_cuda_is_available(self):
+        def fn(x):
+            if torch.cuda.is_available():
+                return x + 1
+            else:
+                return x - 1
+
+        x = torch.rand(4)
+        ref = fn(x)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+    @unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+    def test_get_device(self):
+        def fn(x, y):
+            x = x + 1
+            y = y + 1
+            return x.get_device(), y.get_device()
+
+        x = torch.rand(4, device="cuda")
+        y = torch.rand(4, device="cpu")
+        ref = fn(x, y)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x, y)
+        self.assertTrue(same(ref, res))
+
+    @unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+    def test_get_device_index(self):
+        def fn(x):
+            x = x + 1
+            a = torch._utils._get_device_index(x.device)
+            b = torch._utils._get_device_index(1)
+            return a, b
+
+        x = torch.rand(4, device="cuda")
+        ref = fn(x)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+
+class CustomFunc1(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, foo):
+        return foo + foo
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output
+
+
+class CustomFunc2(torch.autograd.Function):
+    # the forward function can be staticmethod or classmethod
+    @classmethod
+    def forward(cls, ctx, foo):
+        return foo + foo
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output
+
+
+class Module1(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, foo):
+        return CustomFunc1().apply(foo)
+
+
+class Module2(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.fn = CustomFunc1.apply
+
+    def forward(self, foo):
+        return self.fn(foo)
+
+
+class Module3(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, foo):
+        return CustomFunc2().apply(foo)
+
+
+class Module4(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.fn = CustomFunc2.apply
+
+    def forward(self, foo):
+        return self.fn(foo)
+
+
+class TestTracer(JitTestCase):
+    def test_jit_save(self):
+        def fn():
+            class Foo(torch.nn.Module):
+                def __init__(self):
+                    super(Foo, self).__init__()
+                    self.a = 3
+
+                @torch.jit.export
+                def __getstate__(self):
+                    return (3, self.training)
+
+                @torch.jit.export
+                def __setstate__(self, state):
+                    self.a = state[0]
+                    self.training = state[1]
+
+                def forward(self, x):
+                    return x + self.a
+
+            f = Foo()
+
+            return torch.jit.trace(f, (torch.rand(3, 4),))
+
+        fn()
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        opt_fn()
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_model_output.py b/test/dynamo/test_model_output.py
new file mode 100644
index 000000000000..7a35728b6116
--- /dev/null
+++ b/test/dynamo/test_model_output.py
@@ -0,0 +1,166 @@
+# Owner(s): ["module: dynamo"]
+import dataclasses
+import unittest.mock
+
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+from torch._dynamo.testing import same
+
+try:
+    from transformers import modeling_outputs
+    from transformers.configuration_utils import PretrainedConfig
+    from transformers.file_utils import ModelOutput
+    from transformers.modeling_outputs import BaseModelOutput
+except ImportError:
+    modeling_outputs = None
+
+
+def maybe_skip(fn):
+    if modeling_outputs is None:
+        return unittest.skip("requires HuggingFace")(fn)
+    return fn
+
+
+class TestHFPretrained(torch._dynamo.test_case.TestCase):
+    @maybe_skip
+    def test_pretrained(self):
+        def fn(a, tmp):
+            if tmp.return_dict:
+                return a + torch.ones(2) * tmp.max_length
+            return a
+
+        x = torch.randn(2)
+        tmp = PretrainedConfig(return_dict=True, max_length=20)
+        ref = fn(x, tmp)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(x, tmp)
+        self.assertTrue(same(ref, res))
+
+
+class TestModelOutput(torch._dynamo.test_case.TestCase):
+    @maybe_skip
+    def test_mo_create(self):
+        def fn(a, b):
+            tmp = BaseModelOutput(a + 1, attentions=b + 3)
+            return tmp
+
+        torch._dynamo.testing.standard_test(self, fn=fn, nargs=2, expected_ops=2)
+
+    @maybe_skip
+    def test_mo_assign(self):
+        def fn(a, b):
+            tmp = BaseModelOutput(last_hidden_state=b + 3)
+            tmp.hidden_states = a + 7
+            tmp["attentions"] = a + b + 6
+            return tmp
+
+        args = [torch.randn(10), torch.randn(10)]
+        obj1 = fn(*args)
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnts)(fn)
+        obj2 = opt_fn(*args)
+        self.assertTrue(same(obj1.last_hidden_state, obj2.last_hidden_state))
+        self.assertTrue(same(obj1.hidden_states, obj2.hidden_states))
+        self.assertTrue(same(obj1.attentions, obj2.attentions))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 4)
+
+    def _common(self, fn, op_count):
+        args = [
+            BaseModelOutput(
+                last_hidden_state=torch.randn(10), attentions=torch.randn(10)
+            )
+        ]
+        obj1 = fn(*args)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnts)(fn)
+        obj2 = opt_fn(*args)
+        self.assertTrue(same(obj1, obj2))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, op_count)
+
+    @maybe_skip
+    def test_mo_getattr(self):
+        def fn(obj: BaseModelOutput):
+            x = obj.last_hidden_state * 10
+            if obj.hidden_states is not None:
+                x += obj.hidden_states
+            if obj.attentions is not None:
+                x += obj.attentions
+            return x
+
+        self._common(fn, 2)
+
+    @maybe_skip
+    def test_mo_getitem(self):
+        def fn(obj: BaseModelOutput):
+            x = obj["last_hidden_state"] * 10
+            if "hidden_stats" in obj:
+                x += obj["hidden_states"]
+            if "attentions" in obj:
+                x += obj["attentions"]
+            return x
+
+        self._common(fn, 2)
+
+    @maybe_skip
+    def test_mo_tuple(self):
+        def fn(obj: BaseModelOutput):
+            a, b = obj.to_tuple()
+            return a + b * 10
+
+        self._common(fn, 2)
+
+    @maybe_skip
+    def test_mo_index(self):
+        def fn(obj: BaseModelOutput):
+            return obj[0] * 10 + obj[1]
+
+        self._common(fn, 2)
+
+    @maybe_skip
+    def test_mo_init(self):
+        @dataclasses.dataclass
+        class MyDataClass(ModelOutput):
+            a: torch.Tensor
+            b: torch.Tensor = None
+            c: torch.Tensor = None
+            d: torch.Tensor = None
+            e: torch.Tensor = None
+
+        def fn(obj):
+            class_fields = dataclasses.fields(obj)
+            assert len(class_fields)
+            assert all(field.default is None for field in class_fields[1:])
+            other_fields_are_none = all(
+                getattr(obj, field.name) is None for field in class_fields[1:]
+            )
+            assert not other_fields_are_none
+
+            total = getattr(obj, class_fields[0].name)
+            for field in class_fields[1:]:
+                v = getattr(obj, field.name)
+                if v is not None:
+                    total += v
+
+            return total
+
+        tensors = [torch.randn(10), torch.randn(10), torch.randn(10)]
+        obj1 = MyDataClass(*tensors)
+        correct1 = fn(obj1)
+
+        obj2 = MyDataClass(*tensors)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        self.assertTrue(same(opt_fn(obj2), correct1))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_modules.py b/test/dynamo/test_modules.py
new file mode 100644
index 000000000000..ed3b715f72f9
--- /dev/null
+++ b/test/dynamo/test_modules.py
@@ -0,0 +1,1045 @@
+# Owner(s): ["module: dynamo"]
+
+from copy import deepcopy
+from unittest.mock import patch
+
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+from torch._dynamo.eval_frame import unsupported
+from torch._dynamo.mutation_guard import GenerationTracker
+from torch._dynamo.testing import same
+from torch.nn import functional as F
+from torch.nn.modules.lazy import LazyModuleMixin
+from torch.nn.parameter import Parameter, UninitializedParameter
+
+try:
+    from . import test_functions
+except ImportError:
+    import test_functions
+
+
+class BasicModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+        self.scale = torch.randn(1, 10)
+
+    def forward(self, x):
+        return F.relu(self.linear1(x)) * self.scale
+
+
+class FnMember(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+        self.activation = F.relu
+
+    def forward(self, x):
+        x = self.linear1(x)
+        if self.activation:
+            x = self.activation(x)
+        return x
+
+
+class FnMemberCmp(torch.nn.Module):
+    def __init__(self, activation):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+        self.activation = activation
+
+    def forward(self, x):
+        x = self.linear1(x)
+        if self.activation is not None:
+            x = self.activation(x)
+        if self.activation is None:
+            x = torch.sigmoid(x)
+        return x
+
+
+class SubmoduleExample(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = BasicModule()
+        self.layer2 = BasicModule()
+        self.scale = torch.randn(1, 10)
+
+    def forward(self, x):
+        x = self.layer1(x)
+        x = self.layer2(x)
+        return x * self.scale
+
+
+class IsTrainingCheck(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+        self.linear2 = torch.nn.Linear(10, 10)
+        self.train(True)
+
+    def forward(self, x):
+        if self.training:
+            mod = self.linear1
+        else:
+            mod = self.linear2
+        return F.relu(mod(x))
+
+
+class IsEvalCheck(IsTrainingCheck):
+    def __init__(self):
+        super().__init__()
+        self.train(False)
+
+
+class ModuleMethodCall(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = BasicModule()
+        self.layer2 = BasicModule()
+        self.scale = torch.randn(1, 10)
+
+    def call_and_scale(self, mod, x):
+        x = mod(x)
+        return x * self.scale
+
+    def forward(self, x):
+        x1 = self.call_and_scale(self.layer1, x)
+        x2 = self.call_and_scale(self.layer2, x)
+        return x1 + x2
+
+
+class UnsupportedMethodCall(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = BasicModule()
+        self.scale = torch.randn(1, 10)
+
+    def call_and_scale(self, mod, x):
+        x = mod(x)
+        x = x * self.scale
+        return unsupported(x, x)
+
+    def forward(self, x):
+        x1 = self.call_and_scale(self.layer1, x)
+        return x + x1
+
+
+class UnsupportedModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = BasicModule()
+        self.scale = torch.randn(1, 10)
+
+    def forward(self, x):
+        x = self.layer1(x) * self.scale
+        return unsupported(x, x)
+
+
+class UnsupportedModuleCall(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.mod = UnsupportedModule()
+
+    def forward(self, x):
+        return 1 + self.mod(x * 1.5)
+
+
+class ModuleStaticMethodCall(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = BasicModule()
+        self.layer2 = BasicModule()
+        self.scale = torch.randn(1, 10)
+
+    @staticmethod
+    def call_and_scale(scale, mod, x):
+        x = mod(x)
+        return x * scale
+
+    def forward(self, x):
+        x1 = self.call_and_scale(self.scale, self.layer1, x)
+        x2 = self.call_and_scale(self.scale, self.layer2, x)
+        return x1 + x2
+
+
+class ModuleClassMethodCall(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = BasicModule()
+        self.layer2 = BasicModule()
+        self.scale = torch.randn(1, 10)
+
+    @classmethod
+    def call_and_scale(cls, scale, mod, x):
+        x = mod(x)
+        return x * scale
+
+    def forward(self, x):
+        x1 = self.call_and_scale(self.scale, self.layer1, x)
+        x2 = self.call_and_scale(self.scale, self.layer2, x)
+        return x1 + x2
+
+
+class ModuleProperty(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.scale = torch.randn(1, 10)
+
+    @property
+    def scale_alias(self):
+        return self.scale
+
+    def forward(self, x):
+        return x * self.scale_alias
+
+
+class ConstLoop(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+        self.count = 3
+
+    def forward(self, x):
+        for i in range(self.count):
+            x = torch.sigmoid(self.linear1(x))
+        return x
+
+
+class ViaModuleCall(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+
+    def forward(self, x):
+        return test_functions.constant3(torch.sigmoid(self.linear1(x)), x)
+
+
+class IsNoneLayer(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = torch.nn.Linear(10, 10)
+        self.layer2 = None
+        self.train(True)
+
+    def forward(self, x):
+        if self.layer1 is not None:
+            x = self.layer1(x)
+        if self.layer2 is not None:
+            x = self.layer2(x)
+        return x
+
+
+class LayerList(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = [
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10),
+        ]
+
+    def forward(self, x):
+        for layer in self.layers:
+            x = layer(x)
+        return x
+
+
+class ModuleList(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = torch.nn.ModuleList(
+            [
+                torch.nn.Linear(10, 10),
+                torch.nn.ReLU(),
+                torch.nn.Linear(10, 10),
+                torch.nn.ReLU(),
+            ]
+        )
+
+    def forward(self, x):
+        for i in range(len(self.layers)):
+            x = self.layers[i](x)
+
+        for layer in self.layers:
+            x = layer(x)
+
+        for layer, val in zip(self.layers, (x, x, x, x)):
+            x = layer(x) + val
+
+        for layer, val in zip(self.layers, (1, 2, 3, 4)):
+            x = layer(x) + val
+
+        for idx, layer in enumerate(self.layers):
+            x = layer(x) * idx
+
+        for idx, layer in enumerate(self.layers[::-1]):
+            x = layer(x) * idx
+
+        return x
+
+
+class ModuleDict(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = torch.nn.ModuleDict(
+            {
+                "0": torch.nn.Linear(10, 10),
+            }
+        )
+
+    def forward(self, x):
+        # TODO(future PR): handle more logic
+        x = self.layers["0"](x)
+        return x
+
+
+class TensorList(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = (
+            torch.randn((1, 10)),
+            torch.randn((10, 1)),
+            torch.randn((1, 10)),
+            torch.randn((10, 1)),
+        )
+
+    def forward(self, x):
+        for layer in self.layers:
+            x = x * layer
+        return x
+
+
+class Children(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.l1 = torch.nn.Linear(10, 10)
+        self.l2 = torch.nn.ReLU()
+        self.l3 = torch.nn.Linear(10, 10)
+        self.l4 = torch.nn.ReLU()
+
+    def forward(self, x):
+        for block in self.children():
+            x = block(x)
+        return x
+
+
+class IntArg(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = torch.nn.Linear(10, 10)
+
+    def forward(self, x, offset=1):
+        x = F.relu(self.layer1(x)) + offset
+        return x
+
+
+class Seq(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = torch.nn.Sequential(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+        )
+
+    def forward(self, x):
+        return self.layers(x)
+
+
+class Cfg:
+    def __init__(self):
+        self.val = 0.5
+        self.count = 3
+
+
+class CfgModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.cfg = Cfg()
+        self.layer = torch.nn.Linear(10, 10)
+
+    def forward(self, x):
+        for i in range(self.cfg.count):
+            x = self.layer(x + self.cfg.val)
+        return x
+
+
+class StringMember(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+        self.mode = "some_string"
+
+    def forward(self, x):
+        if self.mode == "some_string":
+            return F.relu(self.linear1(x))
+
+
+class _Block(torch.nn.Module):
+    def forward(self, x):
+        return 1.5 * torch.cat(x, 1)
+
+
+class _DenseBlock(torch.nn.ModuleDict):
+    _version = 2
+
+    def __init__(
+        self,
+        num_layers: int = 3,
+    ) -> None:
+        super().__init__()
+        for i in range(num_layers):
+            self.add_module("denselayer%d" % (i + 1), _Block())
+
+    def forward(self, init_features):
+        features = [init_features]
+        for name, layer in self.items():
+            new_features = layer(features)
+            features.append(new_features)
+        return torch.cat(features, 1)
+
+
+class DenseNetBlocks(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = _DenseBlock()
+
+    def forward(self, x):
+        return self.layers(x)
+
+
+class MaterializedModule(torch.nn.Module):
+    """Once the below lazy module is initialized with its first input,
+    it is transformed into this module."""
+
+    param: Parameter
+
+    def __init__(self):
+        super().__init__()
+        self.register_parameter("param", None)
+
+    def forward(self, x):
+        return x
+
+
+class LazyModule(LazyModuleMixin, MaterializedModule):
+    param: UninitializedParameter
+    cls_to_become = MaterializedModule
+
+    def __init__(self):
+        super().__init__()
+        self.param = UninitializedParameter()
+
+    def initialize_parameters(self, x):
+        self.param.materialize(x.shape)
+
+
+def requires_grad1(module: torch.nn.Module, recurse: bool = False) -> bool:
+    requires_grad = any([p.requires_grad for p in module.parameters(recurse)])
+    return requires_grad
+
+
+def requires_grad2(module: torch.nn.Module, recurse: bool = False) -> bool:
+    requires_grad = any(p.requires_grad for p in module.parameters(recurse))
+    return requires_grad
+
+
+class ParametersModule1(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+        self.scale = torch.nn.Parameter(torch.randn(1, 10))
+
+    def forward(self, x):
+        if not requires_grad1(self):
+            return F.relu(self.linear1(x)) * self.scale
+        else:
+            return x + 1
+
+
+class ParametersModule2(ParametersModule1):
+    def forward(self, x):
+        if not requires_grad2(self):
+            return F.relu(self.linear1(x)) * self.scale
+        else:
+            return x + 1
+
+
+class ParametersModule3(ParametersModule1):
+    def forward(self, x):
+        ones = torch.ones(10, dtype=next(self.parameters()).dtype)
+        return F.relu(self.linear1(x)) * self.scale + ones
+
+
+class SuperModule(BasicModule):
+    def forward(self, x):
+        x = super().forward(x)
+        return x + 10.0
+
+
+class ComplicatedSuperParent(torch.nn.Module):
+    @classmethod
+    def custom_add(cls, x):
+        x = x + x
+        return x
+
+
+class SuperChildCallsClassMethod(ComplicatedSuperParent):
+    @classmethod
+    def child_func(cls, x):
+        x = super().custom_add(x)
+        return x
+
+    def forward(self, x):
+        x = self.child_func(x)
+        return x
+
+
+class HasAttrModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.scale = torch.nn.Parameter(torch.randn(1, 10))
+
+    def forward(self, x):
+        x = F.relu(x)
+        if hasattr(self, "scale"):
+            x *= self.scale
+        if hasattr(self, "scale2"):
+            x *= self.scale2
+        return x
+
+
+class EnumValues(torch.nn.ModuleDict):
+    def __init__(
+        self,
+        num_layers: int = 3,
+    ) -> None:
+        super().__init__()
+        for i in range(num_layers):
+            self.add_module("denselayer%d" % (i + 1), _Block())
+
+    def forward(self, init_features):
+        features = [init_features]
+        for idx, layer in enumerate(self.values()):
+            new_features = layer(features)
+            features.append(new_features)
+        return torch.cat(features, 1)
+
+
+class CallForwardDirectly(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = BasicModule()
+        self.layer2 = torch.nn.Linear(10, 10)
+
+    def forward(self, x):
+        x = self.layer1.forward(x)
+        x = self.layer2.forward(x)
+        return x
+
+
+class ModuleNameString(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(10, 10)
+
+    def forward(self, x):
+        if self.__class__.__name__ == "ABC":
+            return 10
+        if self.linear1.__class__.__name__ == "Linear":
+            return F.relu(self.linear1(x) + 10)
+        return 11
+
+
+class SelfMutatingModule(torch.nn.Module):
+    def __init__(self, layer):
+        super().__init__()
+        self.layer = layer
+        self.counter = 0
+
+    def forward(self, x):
+        result = self.layer(x) + self.counter
+        self.counter += 1
+        return F.relu(result)
+
+
+class ModuleAttributePrecedenceBase(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def linear(self, x):
+        return x * 2.0
+
+
+class ModuleAttributePrecedence(ModuleAttributePrecedenceBase):
+    def __init__(self):
+        super().__init__()
+        self.activation = torch.nn.ReLU()
+        self.linear = torch.nn.Linear(10, 10)
+        self.initializer = torch.ones([10, 10])
+        self.scale = 0.5
+
+    def activation(self, x):
+        return x * 1.2
+
+    def initializer(self):
+        return torch.zeros([10, 10])
+
+    def scale(self):
+        return 2.0
+
+    def forward(self, x):
+        # object attribute takes precedence unless it's a nn.Module
+        return self.activation(self.linear(self.initializer + x)) * self.scale
+
+
+def make_test(fn, expected_ops=None):
+    def test_fn(self):
+        return torch._dynamo.testing.standard_test(
+            self, fn=fn, nargs=1, expected_ops=expected_ops
+        )
+
+    fn.eval()
+    return test_fn
+
+
+class NNModuleTests(torch._dynamo.test_case.TestCase):
+    test_seq = make_test(Seq())
+    test_basicmodule1 = make_test(BasicModule())
+    test_basicmodule2 = make_test(BasicModule())
+    test_submodules1 = make_test(SubmoduleExample())
+    test_submodules2 = make_test(SubmoduleExample())
+    test_modulemethod1 = make_test(ModuleMethodCall())
+    test_modulemethod2 = make_test(ModuleMethodCall())
+    test_module_static_method = make_test(ModuleStaticMethodCall())
+    test_fnmember = make_test(FnMember())
+    test_fnmembercmp1 = make_test(FnMemberCmp(F.relu))
+    test_fnmembercmp2 = make_test(FnMemberCmp(None))
+    test_constloop = make_test(ConstLoop())
+    test_istraining1 = make_test(IsTrainingCheck())
+    test_istraining2 = make_test(IsTrainingCheck())
+    test_iseval1 = make_test(IsEvalCheck())
+    test_iseval2 = make_test(IsEvalCheck())
+    test_viamodulecall = make_test(ViaModuleCall())
+    test_isnonelayer = make_test(IsNoneLayer())
+    test_layerlist = make_test(LayerList())
+    test_tensorlist = make_test(TensorList())
+    test_intarg = make_test(IntArg())
+    test_cfgmod = make_test(CfgModule())
+    test_stringmember = make_test(StringMember())
+    test_modulelist = make_test(ModuleList())
+    test_moduledict = make_test(ModuleDict())
+    test_super1 = make_test(SuperModule())
+    test_super_class_method = make_test(SuperChildCallsClassMethod())
+    test_children = make_test(Children())
+    test_densenet = make_test(DenseNetBlocks())
+    test_parameters1 = make_test(ParametersModule1())
+    test_parameters2 = make_test(ParametersModule2())
+    test_parameters3 = make_test(ParametersModule3(), expected_ops=5)
+    test_hasattr = make_test(HasAttrModule())
+    test_enumvalues = make_test(EnumValues())
+    test_module_class_method = make_test(ModuleClassMethodCall())
+    test_module_property = make_test(ModuleProperty())
+    test_forward_directly = make_test(CallForwardDirectly())
+    test_module_name_string = make_test(ModuleNameString())
+    test_module_attribute_precedence = make_test(ModuleAttributePrecedence())
+
+    def test_unsupportedmethod(self):
+        m = UnsupportedMethodCall()
+        i = torch.randn(10)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_m = torch._dynamo.optimize(cnt)(m)
+        r = opt_m(i)
+        self.assertTrue(torch._dynamo.testing.same(r, m(i)))
+        self.assertEqual(cnt.op_count, 5)
+
+    def test_unsupportedmodule(self):
+        m = UnsupportedModuleCall()
+        i = torch.randn(10)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_m = torch._dynamo.optimize(cnt)(m)
+        r = opt_m(i)
+        self.assertTrue(torch._dynamo.testing.same(r, m(i)))
+        self.assertEqual(cnt.op_count, 6)
+
+    def test_self_mutating1(self):
+        m1 = torch.nn.Linear(10, 10)
+        m2 = SelfMutatingModule(m1)
+        m3 = SelfMutatingModule(m1)
+        m4 = SelfMutatingModule(m1)
+        i = torch.randn(10)
+        out2 = [m2(i), m2(i), m2(i)]
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_m3 = torch._dynamo.optimize_assert(cnt)(m3)
+        opt_m4 = torch._dynamo.optimize_assert(cnt)(m4)
+        out3 = [opt_m3(i), opt_m3(i), opt_m3(i)]
+        out4 = [opt_m4(i), opt_m4(i), opt_m4(i)]
+        self.assertTrue(torch._dynamo.testing.same(out2, out3))
+        self.assertTrue(torch._dynamo.testing.same(out2, out4))
+        self.assertEqual(cnt.frame_count, 3)
+
+    @patch.object(torch._dynamo.config, "raise_on_ctx_manager_usage", False)
+    def test_generation_tag(self):
+        cnt = torch._dynamo.testing.CompileCounter()
+
+        # guarantee that we have installed
+        # the generation tagging function
+        with torch._dynamo.optimize_assert(cnt):
+            pass
+
+        m1 = torch.nn.Linear(10, 10)
+        prev_generation = GenerationTracker.get_generation_value(m1)
+        cur_generation = prev_generation + 1
+
+        with torch._dynamo.optimize_assert(cnt):
+            m2 = torch.nn.Linear(10, 10)
+
+        self.assertEqual(GenerationTracker.get_generation_value(m1), prev_generation)
+        self.assertEqual(GenerationTracker.get_generation_value(m2), cur_generation)
+        # check that newly constructed instances
+        # also have the same generation (even if copied from an old instance)
+        m3 = deepcopy(m1)
+        self.assertEqual(GenerationTracker.get_generation_value(m3), cur_generation)
+
+    def test_simple_torch_function(self):
+        def foo(x):
+            # function call, twice to test wrapping
+            x = F.sigmoid(x)
+            x = F.sigmoid(x)
+            # method call, twice to test wrapping
+            x = x.sigmoid()
+            x = x.sigmoid()
+            return x
+
+        class TensorProxy(torch.Tensor):
+            @classmethod
+            def __torch_function__(cls, func, types, args=(), kwargs=None):
+                return super().__torch_function__(func, types, args, kwargs)
+
+        torch._dynamo.config.traceable_tensor_subclasses.add(TensorProxy)
+
+        x = torch.randn(1).as_subclass(TensorProxy)
+        cnt = torch._dynamo.testing.CompileCounter()
+        out1 = foo(x)
+        opt_foo = torch._dynamo.optimize(cnt, nopython=True)(foo)
+        out2 = opt_foo(x)
+
+        self.assertEqual(cnt.op_count, 4)
+        self.assertTrue(torch._dynamo.testing.same(out1, out2))
+
+        torch._dynamo.config.traceable_tensor_subclasses.remove(TensorProxy)
+
+    def test_torch_function_with_closure(self):
+        def run():
+
+            counter = 0
+
+            def foo(x):
+                # function call, twice to test wrapping
+                x = F.sigmoid(x)
+                x = F.sigmoid(x)
+                # method call, twice to test wrapping
+                x = x.sigmoid()
+                x = x.sigmoid()
+                return x
+
+            class TensorProxy(torch.Tensor):
+                @classmethod
+                def __torch_function__(cls, func, types, args=(), kwargs=None):
+                    nonlocal counter
+                    # for now, only support reads from closure cells
+                    # TODO(future PR): support writes as well
+                    counter + 1
+                    return super().__torch_function__(func, types, args, kwargs)
+
+            torch._dynamo.config.traceable_tensor_subclasses.add(TensorProxy)
+
+            x = torch.randn(1).as_subclass(TensorProxy)
+            x = torch.randn(1)
+            cnt = torch._dynamo.testing.CompileCounter()
+            out1 = foo(x)
+            opt_foo = torch._dynamo.optimize(cnt, nopython=True)(foo)
+            out2 = opt_foo(x)
+
+            self.assertEqual(cnt.op_count, 4)
+            self.assertTrue(torch._dynamo.testing.same(out1, out2))
+
+            torch._dynamo.config.traceable_tensor_subclasses.remove(TensorProxy)
+
+        run()
+
+    @patch.object(torch._dynamo.config, "raise_on_ctx_manager_usage", False)
+    def test_nn_moduledict_contains(self):
+        class M(torch.nn.Module):
+            def __init__(self, module_dict):
+                super().__init__()
+                self.module_dict = module_dict
+
+            def forward(self, x):
+                if "foo" in self.module_dict:
+                    x = torch.mul(x, 1.0)
+                x = torch.add(x, 1.0)
+                return x
+
+        module_dict = torch.nn.ModuleDict({"foo": torch.nn.Conv2d(1, 1, 1)})
+        m = M(module_dict)
+        data = torch.randn(1)
+        out1 = m(data)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_m = torch._dynamo.optimize(cnt, nopython=True)(m)
+        out2 = opt_m(data)
+        self.assertEqual(cnt.op_count, 2)
+        self.assertTrue(torch._dynamo.testing.same(out1, out2))
+
+        module_dict = torch.nn.ModuleDict({"bar": torch.nn.Conv2d(1, 1, 1)})
+        m = M(module_dict)
+        data = torch.randn(1)
+        out1 = m(data)
+        cnt = torch._dynamo.testing.CompileCounter()
+        torch._dynamo.reset()
+        opt_m = torch._dynamo.optimize(cnt, nopython=True)(m)
+        out2 = opt_m(data)
+
+        self.assertEqual(cnt.op_count, 1)
+        self.assertTrue(torch._dynamo.testing.same(out1, out2))
+
+        module_dict = torch.nn.ModuleDict({"cat": torch.nn.Conv2d(1, 1, 1)})
+        pre = m(data)
+        cnt.clear()
+
+        with torch._dynamo.optimize(cnt, nopython=False):
+            opt_pre = m(data)
+            m = M(module_dict)
+            data = torch.randn(1)
+            out1 = m(data)
+
+        out_post = m(data)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 1)
+        self.assertTrue(torch._dynamo.testing.same(pre, opt_pre))
+        self.assertTrue(torch._dynamo.testing.same(out1, out_post))
+
+    def test_lazy_module(self):
+        input_shape = (16, 3, 6, 7, 8)
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        module = LazyModule()
+
+        def test_static_module():
+            input = torch.ones(*input_shape)
+            module(input)
+
+        opt_test_static_module = torch._dynamo.optimize(cnt)(test_static_module)
+        opt_test_static_module()
+
+        self.assertTrue(
+            isinstance(module, MaterializedModule),
+            "Module should be transformed to an instance of MaterializedModule.",
+        )
+        self.assertEqual(module.param.shape, input_shape)
+
+        # test when mapped to UnspecializedNNModule
+        module = LazyModule()
+
+        def test_unspecialized():
+            nonlocal module
+            module = LazyModule()
+            input = torch.ones(*input_shape)
+            module(input)
+
+        opt_test_unspecialized = torch._dynamo.optimize(cnt)(test_unspecialized)
+        opt_test_unspecialized()
+
+        self.assertTrue(
+            isinstance(module, MaterializedModule),
+            "Module should be transformed to an instance of MaterializedModule.",
+        )
+        self.assertEqual(module.param.shape, input_shape)
+
+        # test with a static module in torch.*
+        module = torch.nn.modules.LazyBatchNorm3d(
+            affine=False, track_running_stats=False
+        )
+
+        cnt = torch._dynamo.testing.CompileCounter()
+
+        torch._dynamo.reset()
+
+        def test_torch_static():
+            input = torch.ones(*input_shape)
+            return module(input)  # fully materialized
+
+        opt_test_torch_static = torch._dynamo.optimize(cnt)(test_torch_static)
+        opt_test_torch_static()
+        out = opt_test_torch_static()
+
+        self.assertTrue(same(out, module(torch.ones(*input_shape))))
+
+        self.assertTrue(
+            isinstance(module, torch.nn.modules.batchnorm.BatchNorm3d),
+            "Module should be transformed to an instance of BatchNorm3d.",
+        )
+        self.assertEqual(cnt.frame_count, 1, "No guards should have triggered.")
+
+    def test_call_fn_with_non_const_inputs_safe(self):
+        class ModuleSpecialFwd(torch.nn.Module):
+            def __init__(self):
+                super(ModuleSpecialFwd, self).__init__()
+                self.conv = torch.nn.Conv2d(
+                    in_channels=3, out_channels=20, kernel_size=(5, 5)
+                )
+
+            def _conv_forward(self, x):
+                return self.conv._conv_forward(x, self.conv.weight, self.conv.bias)
+
+            def forward(self, x):
+                return self._conv_forward(x)
+
+        mod = ModuleSpecialFwd()
+        rx = torch.randn([3, 10, 10])
+        real = mod(rx)
+        graph, _ = torch._dynamo.export(mod, rx)
+        self.assertTrue(torch._dynamo.testing.same(real, graph(rx)))
+
+
+class MockModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.relu = torch.nn.ReLU()
+        self.linear = torch.nn.Linear(10, 10)
+        self.register_buffer("buf0", torch.randn(10, 10))
+
+    def forward(self, x):
+        return self.relu(self.linear(x) + self.buf0)
+
+
+class OptimizedModuleTest(torch._dynamo.test_case.TestCase):
+    def test_nn_module(self):
+        mod = MockModule()
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_mod = torch._dynamo.optimize(cnt)(mod)
+        self.assertIsInstance(opt_mod, torch._dynamo.OptimizedModule)
+
+        x = torch.randn(10, 10)
+        self.assertTrue(torch._dynamo.testing.same(mod(x), opt_mod(x)))
+        self.assertEqual(cnt.frame_count, 1)
+
+    def test_to(self):
+        mod = MockModule()
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_mod = torch._dynamo.optimize(cnt)(mod)
+        x = torch.randn(10, 10)
+        self.assertTrue(torch._dynamo.testing.same(mod(x), opt_mod(x)))
+        self.assertEqual(cnt.frame_count, 1)
+
+        # Ensure that there is no recompilation
+        opt_mod(x)
+        self.assertEqual(cnt.frame_count, 1)
+
+        opt_mod = opt_mod.to(device="cpu").to(dtype=torch.float64)
+        self.assertIsInstance(opt_mod, torch._dynamo.OptimizedModule)
+        x = torch.randn(10, 10).to(dtype=torch.float64)
+        opt_mod(x)
+        # Ensure that there is a recompilation
+        self.assertEqual(cnt.frame_count, 2)
+
+        # Ensure that there is no recompilation
+        opt_mod(x)
+        self.assertEqual(cnt.frame_count, 2)
+
+        torch._dynamo.reset()
+        opt_mod(x)
+        self.assertEqual(cnt.frame_count, 3)
+
+    def test_attr(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(10, 10)
+                self.register_buffer("buf0", torch.randn(10, 10))
+
+            def forward(self, x):
+                return self.r(torch.sin(x)) + self.buf0
+
+        mod = MockModule()
+        opt_mod = torch._dynamo.optimize("eager")(mod)
+
+        # Check parameteres and buffers
+        for (p1, p2) in zip(mod.parameters(), opt_mod.parameters()):
+            self.assertTrue(id(p1) == id(p2))
+
+    def test_recursion(self):
+        mod = MockModule()
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_mod = torch._dynamo.optimize(cnt)(mod)
+
+        for _ in range(5):
+            opt_mod = torch._dynamo.optimize(cnt)(opt_mod)
+        opt_mod(torch.randn(10, 10))
+        self.assertEqual(cnt.frame_count, 1)
+
+    def test_composition(self):
+        class InnerModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                return self.relu(torch.sin(x))
+
+        opt_inner_mod = InnerModule()
+
+        class OuterModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.mod = opt_inner_mod
+
+            def forward(self, x):
+                return self.mod(torch.cos(x))
+
+        outer_mod = OuterModule()
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_outer_mod = torch._dynamo.optimize(cnt)(outer_mod)
+
+        x = torch.randn(4)
+        self.assertIsInstance(opt_outer_mod, torch._dynamo.OptimizedModule)
+        self.assertTrue(torch._dynamo.testing.same(outer_mod(x), opt_outer_mod(x)))
+        self.assertEqual(cnt.frame_count, 1)
+
+    def test_composition_with_opt_mod(self):
+        class InnerModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                return self.relu(torch.sin(x))
+
+        inner_mod = InnerModule()
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_inner_mod = torch._dynamo.optimize(cnt)(inner_mod)
+
+        class OuterModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.mod = opt_inner_mod
+
+            def forward(self, x):
+                return self.mod(torch.cos(x))
+
+        outer_mod = OuterModule()
+        opt_outer_mod = torch._dynamo.optimize(cnt)(outer_mod)
+
+        x = torch.randn(4)
+        self.assertIsInstance(opt_outer_mod, torch._dynamo.OptimizedModule)
+        self.assertTrue(torch._dynamo.testing.same(outer_mod(x), opt_outer_mod(x)))
+        # There will be a graph break for the inner mod being OptimizedModule
+        self.assertEqual(cnt.frame_count, 2)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_nops.py b/test/dynamo/test_nops.py
new file mode 100644
index 000000000000..44e102699d09
--- /dev/null
+++ b/test/dynamo/test_nops.py
@@ -0,0 +1,72 @@
+# Owner(s): ["module: dynamo"]
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+from torch._dynamo import eval_frame
+
+c = 10
+
+
+def fn1(a, b):
+    return a + b - c
+
+
+def fn2(a, b):
+    x = 0
+    y = 1
+
+    def modify():
+        nonlocal x
+        x += a + b + c
+
+    for _ in range(2):
+        modify()
+
+    return x + y
+
+
+def fn3():
+    yield 1
+    yield 2
+
+
+with_debug_nops = eval_frame._optimize_catch_errors(
+    torch._dynamo.testing.debug_insert_nops
+)
+
+
+class NopTests(torch._dynamo.test_case.TestCase):
+    @with_debug_nops
+    def test1(self):
+        self.assertEqual(fn1(1, 2), -7)
+        self.assertEqual(fn1(1, 2), -7)
+
+    @with_debug_nops
+    def test2(self):
+        self.assertEqual(fn2(1, 2), 27)
+        self.assertEqual(fn2(1, 2), 27)
+
+    @with_debug_nops
+    def test3(self):
+        t = fn3()
+        self.assertEqual(next(t), 1)
+        self.assertEqual(next(t), 2)
+        self.assertRaises(StopIteration, lambda: next(t))
+
+    def test_extended_args(self):
+        too_many_adds = "+".join(["a", "b"] * 256)
+        source = (
+            f"lambda a, b: ({too_many_adds}+a if a.sum() > 0 else {too_many_adds} - b)"
+        )
+        fn = eval(source)
+        a = torch.ones(1)
+        b = torch.ones(1)
+        fn = with_debug_nops(fn)
+        self.assertEqual(fn(a, b).sum(), 513)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_optimizations.py b/test/dynamo/test_optimizations.py
new file mode 100644
index 000000000000..1f69a8fd7906
--- /dev/null
+++ b/test/dynamo/test_optimizations.py
@@ -0,0 +1,206 @@
+# Owner(s): ["module: dynamo"]
+import importlib
+import json
+import os
+import unittest
+
+import torch
+
+import torch._dynamo
+import torch._dynamo.test_case
+from torch._dynamo.optimizations import backends
+from torch._dynamo.optimizations.analysis import has_mutation
+from torch._dynamo.optimizations.log_args import conv_args_analysis
+from torch._dynamo.optimizations.normalize import Inplacifier, normalize
+from torch._dynamo.testing import same
+
+
+def has_onnxruntime():
+    try:
+        importlib.import_module("onnxruntime")
+        return True
+    except ImportError:
+        return False
+
+
+def has_ipex():
+    try:
+        importlib.import_module("intel_extension_for_pytorch")
+        return True
+    except ImportError:
+        return False
+
+
+def has_functorch():
+    try:
+        importlib.import_module("functorch")
+        return True
+    except ImportError:
+        return False
+
+
+class Seq(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = torch.nn.Sequential(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10),
+            torch.nn.Sigmoid(),
+        )
+
+    def forward(self, x):
+        return self.layers(x)
+
+
+class Conv_Bn_Relu(torch.nn.Module):
+    def __init__(self, in_channels, out_channels, **kwargs):
+        super(Conv_Bn_Relu, self).__init__()
+        self.conv = torch.nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
+        self.bn = torch.nn.BatchNorm2d(out_channels, eps=0.001)
+        self.relu = torch.nn.ReLU()
+
+    def forward(self, x):
+        return self.relu(self.bn(self.conv(x)))
+
+
+class TestOptimizations(torch._dynamo.test_case.TestCase):
+    def test_inplacifier(self):
+        gm = torch.fx.symbolic_trace(Seq())
+        normalize(gm)
+        Inplacifier(gm).inplacify()
+        gm.recompile()
+        code = gm.code.replace(" ", "")
+        self.assertIn("inplace=True", code)
+        self.assertIn("out=linear_1", code)
+
+    def test_has_mutation(self):
+        gm = torch.fx.symbolic_trace(Seq())
+        self.assertFalse(has_mutation(gm, torch.rand([10, 10])))
+
+        class Mutating(torch.nn.Module):
+            def __init__(self):
+                super(Mutating, self).__init__()
+
+            def forward(self, arg):
+                return arg.add_(1)
+
+        gm = torch.fx.symbolic_trace(Mutating())
+        self.assertTrue(has_mutation(gm, torch.rand([10, 1, 1, 1])))
+
+    def test_has_mutation_factory(self):
+        def fn():
+            x = torch.empty(2)
+            x.fill_(2)
+            return x
+
+        def compiler_fn(graph, example_inputs):
+            self.assertTrue(has_mutation(graph, example_inputs))
+            return graph
+
+        opt_fn = torch._dynamo.optimize(compiler_fn)(fn)
+        opt_fn()
+
+    def test_example_inputs(self):
+        def fn(a, bc, d):
+            b, c = bc
+            return a / d - b / c
+
+        def compiler_fn(graph, example_inputs):
+            nonlocal r1
+            r1 = graph(*example_inputs)[0]
+            return graph.forward
+
+        a = torch.empty(2).fill_(1)
+        b = torch.empty(2).fill_(2)
+        c = torch.empty(2).fill_(3)
+        d = 4
+        r1 = None
+        r2 = fn(a, (b, c), d)
+        opt_fn = torch._dynamo.optimize_assert(compiler_fn)(fn)
+        r3 = opt_fn(a, (b, c), d)
+
+        self.assertIsNotNone(r1)
+        self.assertTrue(same(r1, r2))
+        self.assertTrue(same(r1, r3))
+
+    @unittest.skipIf(not has_functorch(), "requires functorch")
+    def test_log_conv_args(self):
+        model = Conv_Bn_Relu(3, 32, kernel_size=3, stride=1)
+        model = model.to(memory_format=torch.channels_last)
+        model = model.eval()
+        input = torch.randn(8, 3, 64, 64).contiguous(memory_format=torch.channels_last)
+        r1 = model(input)
+        # check tmp/conv_args.json exists and has keys as arg names
+        filename = "tmp/conv_args.json"
+        if os.path.exists(filename):
+            os.remove(filename)
+        opt_model = torch._dynamo.optimize(conv_args_analysis)(model)
+        with torch.no_grad():
+            r2 = opt_model(input)
+        self.assertTrue(same(r1, r2.float(), tol=0.1))
+        self.assertTrue(os.path.exists(filename))
+        with open(filename) as f:
+            args_dict = json.load(f)
+            self.assertIn("convolution", args_dict.keys())
+            conv_args_dict = args_dict["convolution"]
+            self.assertIn("input", conv_args_dict.keys())
+            self.assertIn("weight", conv_args_dict.keys())
+            self.assertIn("bias", conv_args_dict.keys())
+            self.assertIn("stride", conv_args_dict.keys())
+            self.assertIn("padding", conv_args_dict.keys())
+            self.assertIn("dilation", conv_args_dict.keys())
+            self.assertIn("transposed", conv_args_dict.keys())
+            self.assertIn("output_padding", conv_args_dict.keys())
+            self.assertIn("groups", conv_args_dict.keys())
+        os.remove(filename)
+
+    @unittest.skipIf(not has_ipex(), "requires ipex")
+    def test_ipex_fp32(self):
+        model = Conv_Bn_Relu(3, 32, kernel_size=3, stride=1)
+        model = model.to(memory_format=torch.channels_last)
+        model = model.eval()
+        input = torch.randn(8, 3, 64, 64).contiguous(memory_format=torch.channels_last)
+        r1 = model(input)
+        opt_model = torch._dynamo.optimize(backends.ipex_fp32)(model)
+        with torch.no_grad():
+            r2 = opt_model(input)
+        self.assertTrue(same(r1, r2))
+        self.assertEqual(r2.dtype, torch.float32)
+
+    @unittest.skipIf(not has_ipex(), "requires ipex")
+    def test_ipex_bf16(self):
+        model = Conv_Bn_Relu(3, 32, kernel_size=3, stride=1)
+        model = model.to(memory_format=torch.channels_last)
+        model = model.eval()
+        input = torch.randn(8, 3, 64, 64).contiguous(memory_format=torch.channels_last)
+        r1 = model(input)
+        opt_model = torch._dynamo.optimize(backends.ipex_bf16)(model)
+        with torch.no_grad(), torch.cpu.amp.autocast():
+            r2 = opt_model(input)
+        self.assertTrue(same(r1, r2.float(), tol=0.1))
+        self.assertEqual(r2.dtype, torch.bfloat16)
+
+
+class NormalizeIRTests(torch._dynamo.test_case.TestCase):
+    @unittest.skipIf(not has_functorch(), "requires functorch")
+    def test_inplace_normalize(self):
+        def fn(a, b):
+            x = torch.cos(a)
+            x += b
+            return torch.sin(x)
+
+        a = torch.randn(10)
+        b = torch.randn(10).to(torch.float64)
+
+        ref = fn(a, b)
+
+        optimized_fn = torch._dynamo.optimize("aot_eager")(fn)
+        res = optimized_fn(a, b)
+        self.assertTrue(same(ref, res))
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_optimizers.py b/test/dynamo/test_optimizers.py
new file mode 100644
index 000000000000..a4607a8d3db7
--- /dev/null
+++ b/test/dynamo/test_optimizers.py
@@ -0,0 +1,167 @@
+# Owner(s): ["module: dynamo"]
+
+import contextlib
+import inspect
+import unittest
+
+import torch
+
+import torch._dynamo
+import torch._dynamo.test_case
+import torch._dynamo.testing
+
+
+input = torch.ones([10, 10])
+model = torch.nn.Sequential(*[torch.nn.Linear(10, 10) for _ in range(2)])
+model(input).sum().backward()
+
+
+# Include optimizer code for tracing
+optim_filenames = set(
+    [
+        inspect.getfile(obj)
+        for obj in torch.optim.__dict__.values()
+        if inspect.isclass(obj)
+    ]
+)
+
+
+optim_filenames |= {torch.optim._functional.__file__}
+
+
+def make_test(optim_cls, exp_frame_cnt=1, closure=None, **kwargs):
+    opt = optim_cls(model.parameters(), **kwargs)
+
+    def test_fn(self):
+        nonlocal opt
+
+        counter = torch._dynamo.testing.CompileCounter()
+
+        if closure is not None:
+
+            def fn():
+                opt.step(closure)
+
+        else:
+            fn = opt.step
+
+        opt_fn = torch._dynamo.optimize(counter)(fn)
+        opt_fn()
+
+        self.assertEqual(counter.frame_count, exp_frame_cnt)
+
+    return test_fn
+
+
+@contextlib.contextmanager
+def enable_optimizer_tracing():
+    try:
+        old = set(torch._dynamo.skipfiles.FILENAME_ALLOWLIST)
+
+        torch._dynamo.skipfiles.FILENAME_ALLOWLIST.update(optim_filenames)
+        yield
+    finally:
+        torch._dynamo.skipfiles.FILENAME_ALLOWLIST.clear()
+        torch._dynamo.skipfiles.FILENAME_ALLOWLIST.update(old)
+
+
+class OptimizerTests(torch._dynamo.test_case.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+
+        # needed until pytorch assertion is changed to enable Adam
+        # to be called with capturable=True
+        cls._exit_stack.enter_context(
+            unittest.mock.patch.object(
+                torch._dynamo.config, "capture_scalar_outputs", True
+            )
+        )
+        cls._exit_stack.enter_context(enable_optimizer_tracing())
+
+    test_sgd = make_test(torch.optim.SGD, lr=0.01)
+    # lgbfs has data-dependent control and internally iterates
+    # calling the closure
+    # TODO mlazos: re-enable once we have latest pytorch with FakeTensor fix #497
+    # test_lbfgs = make_test(
+    #    torch.optim.LBFGS, exp_frame_cnt=3, closure=lambda: model(input).sum()
+    # )
+
+    # RAdam and Adagrad have data-dependent control which breaks the graph;
+    # furthermore, the break is inside a for loop, so we bail on the frame
+    # entirely.  This is basically an xfail; if the frame count goes up
+    # you done good
+    test_radam = make_test(torch.optim.RAdam, exp_frame_cnt=0)
+    test_adagrad = make_test(torch.optim.Adagrad, exp_frame_cnt=0)
+
+    # ASGD has a small optimization that avoids averaging
+    # This will fully capture the graph once that optimization is removed
+    # NB: in python versions < 3.8, we don't capture graphs when breaks
+    # occur in a loop
+
+    # Fails without fake tensor:
+    # TypeError: clamp() received an invalid combination of arguments - got (float, min=int)
+    # test_asgd = make_test(
+    #     torch.optim.ASGD, exp_frame_cnt=(0 if sys.version_info < (3, 8) else 6)
+    # )
+
+
+# exclude SparseAdam because other areas of the stack don't support it yet
+# the others are handled specially above
+exclude = set(["SGD", "Optimizer", "SparseAdam", "LBFGS", "RAdam", "Adagrad", "ASGD"])
+optimizers = [
+    opt
+    for opt in torch.optim.__dict__.values()
+    if inspect.isclass(opt)
+    and issubclass(opt, torch.optim.Optimizer)
+    and opt.__name__ not in exclude
+]
+
+
+for opt in optimizers:
+    setattr(OptimizerTests, "test_" + opt.__name__.lower(), make_test(opt))
+
+
+class End2EndTests(torch._dynamo.test_case.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls._exit_stack.enter_context(enable_optimizer_tracing())
+
+    # https://github.com/pytorch/torchdynamo/issues/1604
+    def test_optimizing_over_tensor_with_requires_grad(self):
+        class Net(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x, y):
+                z = torch.bmm(x, y)
+                z = torch.flatten(z, 1)
+                return z
+
+        def training_iter_fn(batch, model, optimizer):
+            optimizer.zero_grad()
+            out = model(**batch)
+            target = torch.tensor([0, 7])
+            loss = torch.nn.CrossEntropyLoss()(out, target)
+            loss.backward()
+            optimizer.step()
+            return loss
+
+        net = Net()
+        input1 = torch.randn(2, 1, 4)
+        input2 = torch.randn(2, 4, 8, requires_grad=True)
+        optimizer = torch.optim.Adam([input2], lr=0.1)
+
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_training_iter_fn = torch._dynamo.optimize(cnts)(training_iter_fn)
+        batch = {"x": input1, "y": input2}
+        for _ in range(2):
+            opt_training_iter_fn(batch, net, optimizer)
+        self.assertEqual(cnts.frame_count, 1)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_python_autograd.py b/test/dynamo/test_python_autograd.py
new file mode 100644
index 000000000000..73a6f8f6d330
--- /dev/null
+++ b/test/dynamo/test_python_autograd.py
@@ -0,0 +1,287 @@
+# Owner(s): ["module: dynamo"]
+from typing import Callable, Dict, List, NamedTuple, Optional
+
+import torch
+
+import torch._dynamo
+from torch._dynamo.test_case import run_tests, TestCase
+from torch._dynamo.testing import CompileCounter, same
+
+"""
+This is an example of a pure-python version of autograd implemented by
+@zdevito.  It represents a rather challenging test case for TorchDynamo
+to push the limits of what it can do.
+"""
+
+
+_name: int = 0
+
+
+def fresh_name() -> str:
+    """create a new unique name for a variable: v0, v1, v2"""
+    global _name
+    r = f"v{_name}"
+    _name += 1
+    return r
+
+
+class Variable:
+    def __init__(self, value: torch.Tensor, name: str = None):
+        self.value = value
+        self.name = name or fresh_name()
+
+    # We need to start with some tensors whose values were not computed
+    # inside the autograd. This function constructs leaf nodes.
+    @staticmethod
+    def constant(value: torch.Tensor, name: str = None):
+        return Variable(value, name)
+
+    def __repr__(self):
+        return repr(self.value)
+
+    # This performs a pointwise multiplication of a Variable, tracking gradients
+    def __mul__(self, rhs: "Variable") -> "Variable":
+        # defined later in the notebook
+        return operator_mul(self, rhs)
+
+    def __add__(self, rhs: "Variable") -> "Variable":
+        return operator_add(self, rhs)
+
+    def sum(self, name: Optional[str] = None) -> "Variable":
+        return operator_sum(self, name)
+
+    def expand(self, sizes: List[int]) -> "Variable":
+        return operator_expand(self, sizes)
+
+
+class TapeEntry(NamedTuple):
+    # names of the inputs to the original computation
+    inputs: List[str]
+    # names of the outputs of the original computation
+    outputs: List[str]
+    # apply chain rule
+    propagate: "Callable[List[Variable], List[Variable]]"
+
+
+gradient_tape: List[TapeEntry] = []
+
+
+def reset_tape():
+    gradient_tape.clear()
+    global _name
+    _name = 0
+
+
+def grad(L, desired_results: List[Variable]) -> List[Variable]:
+    # this map holds dL/dX for all values X
+    dL_d: Dict[str, Variable] = {}
+    # It starts by initializing the 'seed' dL/dL, which is 1
+    dL_d[L.name] = Variable(torch.ones(()))
+    # print(f'd{L.name} ------------------------')
+
+    # look up dL_dentries. If a variable is never used to compute the loss,
+    # we consider its gradient None, see the note below about zeros for more information.
+    def gather_grad(entries: List[str]):
+        return [dL_d[entry] if entry in dL_d else None for entry in entries]
+
+    # propagate the gradient information backward
+    for entry in reversed(gradient_tape):
+        dL_doutputs = gather_grad(entry.outputs)
+        if all(dL_doutput is None for dL_doutput in dL_doutputs):
+            # optimize for the case where some gradient pathways are zero. See
+            # The note below for more details.
+            continue
+
+        # perform chain rule propagation specific to each compute
+        dL_dinputs = entry.propagate(dL_doutputs)
+
+        # Accululate the gradient produced for each input.
+        # Each use of a variable produces some gradient dL_dinput for that
+        # use. The multivariate chain rule tells us it is safe to sum
+        # all the contributions together.
+        for input, dL_dinput in zip(entry.inputs, dL_dinputs):
+            if input not in dL_d:
+                dL_d[input] = dL_dinput
+            else:
+                dL_d[input].value += dL_dinput.value
+
+    # print some information to understand the values of each intermediate
+    # for name, value in dL_d.items():
+    #    print(f'd{L.name}_d{name} = {value.name}')
+    # print(f'------------------------')
+
+    return gather_grad(desired.name for desired in desired_results)
+
+
+def operator_mul(self: Variable, rhs: Variable) -> Variable:
+    if isinstance(rhs, float) and rhs == 1.0:
+        # peephole optimization
+        return self
+
+    # define forward
+    r = Variable(self.value * rhs.value)
+    # print(f'{r.name} = {self.name} * {rhs.name}')
+
+    # record what the inputs and outputs of the op were
+    inputs = [self.name, rhs.name]
+    outputs = [r.name]
+
+    # define backprop
+    def propagate(dL_doutputs: List[Variable]):
+        (dL_dr,) = dL_doutputs
+
+        dr_dself = rhs  # partial derivative of r = self*rhs
+        dr_drhs = self  # partial derivative of r = self*rhs
+
+        # chain rule propagation from outputs to inputs of multiply
+        dL_dself = dL_dr * dr_dself
+        dL_drhs = dL_dr * dr_drhs
+        dL_dinputs = [dL_dself, dL_drhs]
+        return dL_dinputs
+
+    # finally, we record the compute we did on the tape
+    gradient_tape.append(TapeEntry(inputs=inputs, outputs=outputs, propagate=propagate))
+    return r
+
+
+def operator_add(self: Variable, rhs: Variable) -> Variable:
+    # Add follows a similar pattern to Mul, but it doesn't end up
+    # capturing any variables.
+    r = Variable(self.value + rhs.value)
+    # print(f'{r.name} = {self.name} + {rhs.name}')
+
+    def propagate(dL_doutputs: List[Variable]):
+        (dL_dr,) = dL_doutputs
+        dr_dself = 1.0
+        dr_drhs = 1.0
+        dL_dself = dL_dr * dr_dself
+        dL_drhs = dL_dr * dr_drhs
+        return [dL_dself, dL_drhs]
+
+    gradient_tape.append(
+        TapeEntry(inputs=[self.name, rhs.name], outputs=[r.name], propagate=propagate)
+    )
+    return r
+
+
+def operator_sum(self: Variable, name: Optional[str]) -> "Variable":
+    r = Variable(torch.sum(self.value), name=name)
+    # print(f'{r.name} = {self.name}.sum()')
+
+    def propagate(dL_doutputs: List[Variable]):
+        (dL_dr,) = dL_doutputs
+        size = self.value.size()
+        return [dL_dr.expand(*size)]
+
+    gradient_tape.append(
+        TapeEntry(inputs=[self.name], outputs=[r.name], propagate=propagate)
+    )
+    return r
+
+
+def operator_expand(self: Variable, sizes: List[int]) -> "Variable":
+    assert self.value.dim() == 0  # only works for scalars
+    r = Variable(self.value.expand(sizes))
+    # print(f'{r.name} = {self.name}.expand({sizes})')
+
+    def propagate(dL_doutputs: List[Variable]):
+        (dL_dr,) = dL_doutputs
+        return [dL_dr.sum()]
+
+    gradient_tape.append(
+        TapeEntry(inputs=[self.name], outputs=[r.name], propagate=propagate)
+    )
+    return r
+
+
+def simple(a, b):
+    t = a + b
+    return t * b
+
+
+class TestPythonAutograd(TestCase):
+    def _common(self, fn, expected_ops):
+        args1 = [torch.randn(10), torch.randn(10)]
+        args2 = [torch.randn(10), torch.randn(10)]
+        cnt = CompileCounter()
+        fn_dynamo = torch._dynamo.optimize_assert(cnt)(fn)
+        reset_tape()
+        res1 = fn_dynamo(*args1)
+        reset_tape()
+        res2 = fn_dynamo(*args2)
+        reset_tape()
+        self.assertTrue(same(res1, fn(*args1)))
+        reset_tape()
+        self.assertTrue(same(res2, fn(*args2)))
+        reset_tape()
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, expected_ops)
+
+    def test_forwards1(self):
+        def fn(a, b):
+            a = Variable.constant(a, name="a")
+            b = Variable.constant(b, name="b")
+            loss = simple(a, b).sum()
+            return loss
+
+        self._common(fn, 3)
+
+    def test_forwards2(self):
+        def fn(a, b):
+            reset_tape()
+            a = Variable.constant(a, name="a")
+            b = Variable.constant(b, name="b")
+            loss = simple(a, b).sum()
+            reset_tape()
+            return loss
+
+        self._common(fn, 3)
+
+    def test_backwards1(self):
+        def fn(a, b):
+            a = Variable.constant(a, name="a")
+            b = Variable.constant(b, name="b")
+            loss = simple(a, b).sum()
+            return grad(loss, [a, b])
+
+        self._common(fn, 8)
+
+    def test_backwards2(self):
+        def fn(a, b):
+            reset_tape()
+            a = Variable.constant(a, name="a")
+            b = Variable.constant(b, name="b")
+            loss = simple(a, b).sum()
+            res = grad(loss, [a, b])
+            reset_tape()
+            return res
+
+        self._common(fn, 8)
+
+    def test_split(self):
+        v1 = Variable.constant(torch.randn(10), name="a")
+        v2 = Variable.constant(torch.randn(10), name="b")
+        cnt = CompileCounter()
+
+        def forward(a, b):
+            return simple(a, b).sum()
+
+        reset_tape()
+        loss1 = forward(v1, v2)
+        grad1 = grad(loss1, [v1, v2])
+
+        reset_tape()
+        opt_forward = torch._dynamo.optimize_assert(cnt)(forward)
+        opt_grad = torch._dynamo.optimize_assert(cnt)(grad)
+        loss2 = opt_forward(v1, v2)
+        # force two frames
+        grad2 = opt_grad(loss2, [v1, v2])
+
+        self.assertTrue(same(loss1, loss2))
+        self.assertTrue(same(grad1, grad2))
+        self.assertEqual(cnt.frame_count, 2)
+        self.assertEqual(cnt.op_count, 8)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/dynamo/test_recompile_ux.py b/test/dynamo/test_recompile_ux.py
new file mode 100644
index 000000000000..b39bea3ce932
--- /dev/null
+++ b/test/dynamo/test_recompile_ux.py
@@ -0,0 +1,205 @@
+# Owner(s): ["module: dynamo"]
+import unittest
+import weakref
+
+import torch
+
+import torch._dynamo
+import torch._dynamo.config
+import torch._dynamo.test_case
+import torch._dynamo.testing
+
+
+class RecompileUxTests(torch._dynamo.test_case.TestCase):
+    # TODO(whc) dynamo actualy recompiles one more time than the cache limit
+    cache_limit = 1
+
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls._exit_stack.enter_context(
+            unittest.mock.patch.object(
+                torch._dynamo.config, "cache_size_limit", cls.cache_limit
+            )
+        )
+
+    def test_drop_cache_on_skip(self):
+        def model(x, i):
+            return x + i
+
+        attached = False
+        triggered = False
+
+        def trigger():
+            nonlocal triggered
+            triggered = True
+
+        def compiler(gm, input):
+            nonlocal attached
+            f = gm.forward
+            assert not attached
+            # NB: making this a weakref.ref causes the cycle to no
+            # longer be promptly GC'ed
+            weakref.finalize(f, trigger)
+            attached = True
+            return f
+
+        x = torch.randn(2)
+        for i in range(2):
+            opt_model = torch._dynamo.optimize(compiler)(model)
+            opt_model(x, i)
+
+        self.assertTrue(triggered)
+
+    def test_loop_torture(self):
+        def loop_torture(input, iters):
+            out = input
+            # randint itself causes one graph break
+            for _ in range(iters):
+                out += input
+            return out
+
+        compile_counter = torch._dynamo.testing.CompileCounter()
+        for _ in range(10):
+            x = torch.randn(3)
+            iters = torch.randint(low=0, high=1000, size=())
+            opt_loop_torture = torch._dynamo.optimize(compile_counter)(loop_torture)
+            opt_loop_torture(x, iters)
+
+        # Currently, we recompile each time,
+        # We'd probably like to bail out quickly and warn
+        # TODO(whc) these checks fail on py37.  Why?
+        # self.assertEqual(counters["frames"]["total"], 2 + self.cache_limit)
+        # self.assertEqual(counters["frames"]["ok"], 1 + self.cache_limit)
+
+        # compile_counter only sees frames that were fed to the backend compiler,
+        # which is a subset of counters["frames"]["ok"] -- probably becuase
+        # counters["frames"]["ok"] includes frames not containing torch ops?
+        self.assertEqual(compile_counter.frame_count, self.cache_limit)
+
+    def test_dynamic_input(self):
+        def model(input):
+            return input + input
+
+        expected_recompiles = 2
+        compile_counter = torch._dynamo.testing.CompileCounter()
+        with unittest.mock.patch.object(
+            torch._dynamo.config, "cache_size_limit", expected_recompiles
+        ):
+            with self.assertLogs(logger="torch._dynamo", level="WARNING") as logs:
+                for _ in range(10):
+                    bsz = torch.randint(low=0, high=1000, size=())
+                    x = torch.randn((bsz, 3, 4))
+                    opt_model = torch._dynamo.optimize(compile_counter)(model)
+                    opt_model(x)
+
+        self.assertEqual(compile_counter.frame_count, expected_recompiles)
+        self.assertEqual(len(logs.records), 1)
+        print(logs.records[0])
+        self.assertTrue(
+            logs.records[0]
+            .getMessage()
+            .startswith("torch._dynamo hit config.cache_size_limit")
+        )
+
+    @unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+    def test_nvfuser_guards(self):
+        # we may want to model dynamo's guards sufficiently after nvfuser's ProfilingExecutor guards
+        # such that we ensure dynamo is in charge of all the recompilations at the top level,
+        # and we could thus simplfy the underlying torchscript executor
+        def func(a, b, c):
+            return a + b * c
+
+        a = torch.rand(3, 4, 5, device="cuda")
+        b = torch.rand(3, 4, 5, device="cuda")
+        b_v = torch.rand(3, 5, 4, device="cuda").view(3, 4, 5)
+        b_p = torch.rand(3, 5, 4, device="cuda").permute(0, 2, 1)
+        c = torch.rand(3, 4, 5, device="cuda")
+        compile_counter = torch._dynamo.testing.CompileCounter()
+
+        with unittest.mock.patch.object(torch._dynamo.config, "cache_size_limit", 2):
+            opt_func = torch._dynamo.optimize(compile_counter)(func)
+            opt_func(a, b, c)  # warmup
+            self.assertEqual(compile_counter.frame_count, 1)
+
+            opt_func(a, b, c)  # no guard fail or recompile
+            self.assertEqual(compile_counter.frame_count, 1)
+
+            opt_func(a, b_v, c)  # a view should not cause nvfuser recompile
+            self.assertEqual(compile_counter.frame_count, 1)
+
+            opt_func(a, b_p, c)  # a permutation should cause recompile
+            self.assertEqual(compile_counter.frame_count, 2)
+
+    def assert_single_log_contains(self, logs, contains_str):
+        self.assertEqual(len(logs.records), 1)
+        self.assertTrue(
+            logs.records[0].getMessage().find(contains_str) > 0,
+            msg=f'Expected to find "{contains_str}" in log "{logs.records[0].getMessage()}"',
+        )
+
+    def test_verbose_tensor_check(self):
+        def func(a):
+            # Warning: choose a function here whose meta implementation lives
+            # entirely in C++.  If you do a Python one, Dynamo will dive into
+            # torch._refs which is OK but it will muddy up the warnings
+            return torch.add(a, 4)
+
+        def cache_fail_test(cached_input, missed_input, expected_failure):
+            # TODO(whc) maybe its hacky to have a 'test within a test' but this seemed convenient
+            torch._dynamo.reset()
+            torch._dynamo.utils.counters.clear()
+            opt_func = torch._dynamo.optimize("eager")(func)
+            # warmup
+            opt_func(cached_input)
+
+            with self.assertLogs(logger="torch._dynamo", level="WARNING") as logs:
+                opt_func = torch._dynamo.optimize("eager")(func)
+                opt_func(missed_input)
+            self.assert_single_log_contains(logs, expected_failure)
+
+        a = torch.rand(3, 4, 5)
+        cache_fail_test(
+            a, a[0:2, :, :], "tensor 'a' size mismatch at index 0. expected 3, actual 2"
+        )
+        cache_fail_test(
+            a,
+            a.clone().as_strided((3, 4, 5), stride=(1, 3, 12)),
+            "tensor 'a' strides mismatch at index 0. expected 20, actual 1",
+        )
+        cache_fail_test(a, a[0, :, :], "tensor 'a' rank mismatch. expected 3, actual 2")
+        cache_fail_test(a, a.to("meta"), "tensor 'a' dispatch key set mismatch.")
+        cache_fail_test(
+            a,
+            a.to(torch.float16),
+            "tensor 'a' dtype mismatch. expected Float, actual Half",
+        )
+        a_grad = a.clone()
+        a_grad.requires_grad = True
+        cache_fail_test(
+            a, a_grad, "tensor 'a' requires_grad mismatch. expected requires_grad=0"
+        )
+
+    def test_mismatched_type(self):
+        a = torch.rand(3, 4, 5)
+        b = torch.rand(3, 4, 5)
+
+        def func(a, b):
+            return a + b
+
+        opt_func = torch._dynamo.optimize("eager")(func)
+        # warmup
+        opt_func(a, b)
+
+        with self.assertLogs(logger="torch._dynamo", level="WARNING") as logs:
+            opt_func = torch._dynamo.optimize("eager")(func)
+            opt_func(a, 1)
+        self.assert_single_log_contains(
+            logs, "expected type of 'b' to be a tensor type, ' but found <class 'int'>"
+        )
+
+
+# TODO(jansel): these pass with pytest, but not with pytorch CI
+# if __name__ == "__main__":
+#     from torch._dynamo.testing import run_tests
+#     run_tests()
diff --git a/test/dynamo/test_replay_record.py b/test/dynamo/test_replay_record.py
new file mode 100644
index 000000000000..5235e355e0d1
--- /dev/null
+++ b/test/dynamo/test_replay_record.py
@@ -0,0 +1,194 @@
+# Owner(s): ["module: dynamo"]
+import logging
+import re
+import shutil
+import unittest
+
+import torch
+import torch._dynamo.test_case
+import torch._dynamo.testing
+
+try:
+    import dill
+except ImportError:
+    dill = None
+
+requires_dill = unittest.skipIf(dill is None, "requires dill")
+
+
+class ReplayRecordTests(torch._dynamo.test_case.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls._exit_stack.enter_context(
+            unittest.mock.patch.object(
+                torch._dynamo.config, "replay_record_enabled", True
+            )
+        )
+        cls._exit_stack.enter_context(
+            unittest.mock.patch.object(torch._dynamo.config, "print_graph_breaks", True)
+        )
+        # Most of the tests are checking to see if errors got logged, so we
+        # ask for errors to be suppressed
+        cls._exit_stack.enter_context(
+            unittest.mock.patch.object(torch._dynamo.config, "suppress_errors", True)
+        )
+        cls._exit_stack.enter_context(
+            unittest.mock.patch.object(
+                torch._dynamo.config,
+                "debug_dir_root",
+                "/tmp/_torchdynamo_debug_/",
+            )
+        )
+
+    @classmethod
+    def tearDownClass(cls):
+        shutil.rmtree(torch._dynamo.config.debug_dir_root, ignore_errors=True)
+        cls._exit_stack.close()
+
+    def check_replay(self, fn, *args, exp_exc_name=None):
+        fn_opt = torch._dynamo.optimize("eager")(fn)
+        with self.assertLogs(logger="torch._dynamo", level=logging.ERROR) as log_orig:
+            try:
+                fn_opt(*args)
+            except Exception:
+                pass  # we'll check the logs for the raised exception
+
+        with self.assertLogs(
+            logger="torch._dynamo", level=logging.ERROR
+        ) as log_replayed:
+            file_name_match = re.search(
+                r"torch._dynamo\.replay\('(.*)'\)", log_orig.output[-1]
+            )
+            self.assertTrue(
+                file_name_match is not None,
+                "No record file name found in generated logs.",
+            )
+
+            torch._dynamo.replay(file_name_match.groups()[0])
+
+        def get_error_name(log):
+            error_name = re.search(r"\w+Error", log.output[-1])
+            self.assertIsNotNone(error_name, "No error name found in logs.")
+            return error_name[0]
+
+        orig_error = get_error_name(log_orig)
+        replayed_error = get_error_name(log_replayed)
+        if exp_exc_name is not None:
+            self.assertEqual(orig_error, exp_exc_name)
+
+        self.assertEqual(
+            orig_error,
+            replayed_error,
+            "Error logs for recorded execution and replayed execution should match.",
+        )
+
+    @requires_dill
+    def test_unsuccessful_inline(self):
+        def level2():
+            z = torch.ones(2, 2)
+            a = {z: 10}  # Error here, tensor as key to dict
+            return a[z] * torch.ones(1)
+
+        def level1():
+            y = torch.ones(1, 1)
+            return level2() + y
+
+        def level0():
+            x = torch.ones(1, 1)
+            return level1() + x
+
+        self.check_replay(level0, exp_exc_name="AssertionError")
+
+    @requires_dill
+    def test_successful_inline(self):
+        def test_fn():
+            x = torch.ones(2, 2)
+
+            def level1(a):
+                return a + torch.ones(2, 2)
+
+            y = level1(x)
+
+            return y + torch.ones(3, 3)  # dimension mismatch
+
+        self.check_replay(test_fn, exp_exc_name="RuntimeError")
+
+    @requires_dill
+    def test_nonlocal_fn_call(self):
+        def nonlocal_fn(x):
+            return x + torch.ones(2, 2)
+
+        def test_fn():
+            z = torch.ones(2, 2)
+            x = nonlocal_fn(z)
+            return x + torch.ones(3, 3)
+
+        self.check_replay(test_fn, exp_exc_name="RuntimeError")
+
+    @requires_dill
+    def test_nonlocal_module_fn_call(self):
+        # replay when we use a module
+        # not defined in the replay env
+        try:
+            from . import mock_modules
+        except ImportError:
+            import mock_modules
+
+        def test_fn():
+            z = mock_modules.mock_module2.method1([], 2)
+            z = torch.ones(2, 2) + z[0]
+            return z + torch.zeros(3, 3)
+
+        self.check_replay(test_fn, exp_exc_name="RuntimeError")
+
+    @requires_dill
+    def test_nonlocal_module_class(self):
+        try:
+            from .mock_modules import mock_module2
+        except ImportError:
+            from mock_modules import mock_module2
+
+        def test_fn():
+            z = mock_module2.Class1(1, 2)
+            y = z.method2(torch.ones(3, 3))
+            return y + torch.zeros(3, 5)
+
+        self.check_replay(test_fn, exp_exc_name="TypeError")
+
+    @requires_dill
+    def test_local_module(self):
+        try:
+            from .mock_modules import mock_module3 as _  # noqa: F401
+
+            def test_fn(x):
+                from .mock_modules import mock_module3
+
+                z = mock_module3.method1([], torch.ones(5, 1))
+                return torch.ones(2, 2) + x + z[0]
+
+        except ImportError:
+
+            def test_fn(x):
+                from mock_modules import mock_module3
+
+                z = mock_module3.method1([], torch.ones(5, 1))
+                return torch.ones(2, 2) + x + z[0]
+
+        self.check_replay(test_fn, torch.ones(1, 1), exp_exc_name="RuntimeError")
+
+    # Verfiy that we replay when we have tensor arguments to the frame being replayed
+    @requires_dill
+    def test_fn_call_args(self):
+        def test_fn(x, y):
+            return x + y + torch.zeros(2, 2)
+
+        self.check_replay(
+            test_fn, torch.ones(3, 3), torch.ones(2, 2), exp_exc_name="RuntimeError"
+        )
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_repros.py b/test/dynamo/test_repros.py
new file mode 100644
index 000000000000..7bd258cbb3c8
--- /dev/null
+++ b/test/dynamo/test_repros.py
@@ -0,0 +1,2066 @@
+# Owner(s): ["module: dynamo"]
+import collections
+import copy
+import inspect
+import itertools
+import random
+import unittest
+from abc import ABC
+from collections import namedtuple
+from copy import deepcopy
+from typing import List
+from unittest.mock import patch
+
+import functorch._src.config
+
+import numpy as np
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+import torch._dynamo.utils
+
+try:
+    from test_minifier import requires_cuda
+except ImportError:
+    from .test_minifier import requires_cuda
+
+from torch import nn
+from torch._dynamo.debug_utils import same_two_models
+from torch._dynamo.testing import rand_strided, requires_static_shapes, same
+from torch.nn import functional as F
+
+try:
+    import torch._refs
+
+    HAS_REFS = True
+except ImportError:
+    HAS_REFS = False
+
+
+_orig_module_call = torch.nn.Module.__call__
+
+
+def is_fx_tracing_test() -> bool:
+    """
+    Copied from the hpc trainer codebase
+    """
+    return torch.nn.Module.__call__ is not _orig_module_call
+
+
+def ifdyn(count1, count2):
+    if torch._dynamo.config.dynamic_shapes:
+        return count1
+    else:
+        return count2
+
+
+def has_detectron2():
+    try:
+        from detectron2.layers.mask_ops import _paste_masks_tensor_shape
+
+        return _paste_masks_tensor_shape is not None
+    except ImportError:
+        return False
+
+
+def _do_paste_mask(masks, boxes, img_h: int, img_w: int, skip_empty: bool = True):
+    # from detectron2 mask_ops.py
+
+    device = masks.device
+
+    if skip_empty and not torch.jit.is_scripting():
+        x0_int, y0_int = torch.clamp(boxes.min(dim=0).values.floor()[:2] - 1, min=0).to(
+            dtype=torch.int32
+        )
+        x1_int = torch.clamp(boxes[:, 2].max().ceil() + 1, max=img_w).to(
+            dtype=torch.int32
+        )
+        y1_int = torch.clamp(boxes[:, 3].max().ceil() + 1, max=img_h).to(
+            dtype=torch.int32
+        )
+    else:
+        x0_int, y0_int = 0, 0
+        x1_int, y1_int = img_w, img_h
+    x0, y0, x1, y1 = torch.split(boxes, 1, dim=1)  # each is Nx1
+
+    N = masks.shape[0]
+
+    img_y = torch.arange(y0_int, y1_int, device=device, dtype=torch.float32) + 0.5
+    img_x = torch.arange(x0_int, x1_int, device=device, dtype=torch.float32) + 0.5
+    img_y = (img_y - y0) / (y1 - y0) * 2 - 1
+    img_x = (img_x - x0) / (x1 - x0) * 2 - 1
+    # img_x, img_y have shapes (N, w), (N, h)
+
+    gx = img_x[:, None, :].expand(N, img_y.size(1), img_x.size(1))
+    gy = img_y[:, :, None].expand(N, img_y.size(1), img_x.size(1))
+    grid = torch.stack([gx, gy], dim=3)
+
+    if not torch.jit.is_scripting():
+        if not masks.dtype.is_floating_point:
+            masks = masks.float()
+    img_masks = F.grid_sample(masks, grid.to(masks.dtype), align_corners=False)
+
+    if skip_empty and not torch.jit.is_scripting():
+        return img_masks[:, 0], (slice(y0_int, y1_int), slice(x0_int, x1_int))
+    else:
+        return img_masks[:, 0], ()
+
+
+def cat(tensors, dim=0):
+    # from detectron2 wrappers.py
+    assert isinstance(tensors, (list, tuple))
+    if len(tensors) == 1:
+        return tensors[0]
+    return torch.cat(tensors, dim)
+
+
+def shapes_to_tensor(x, device=None):
+    # from detectron2 wrappers.py
+    if torch.jit.is_scripting():
+        return torch.as_tensor(x, device=device)
+    if torch.jit.is_tracing():
+        assert all(
+            [isinstance(t, torch.Tensor) for t in x]
+        ), "Shape should be tensor during tracing!"
+        # as_tensor should not be used in tracing because it records a constant
+        ret = torch.stack(x)
+        if ret.device != device:  # avoid recording a hard-coded device if not necessary
+            ret = ret.to(device=device)
+        return ret
+    return torch.as_tensor(x, device=device)
+
+
+class Boxes:
+    # from detectron2 poolers.py
+    def __init__(self, tensor: torch.Tensor):
+        """
+        Args:
+            tensor (Tensor[float]): a Nx4 matrix.  Each row is (x1, y1, x2, y2).
+        """
+        device = (
+            tensor.device if isinstance(tensor, torch.Tensor) else torch.device("cpu")
+        )
+        tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device)
+        if tensor.numel() == 0:
+            # Use reshape, so we don't end up creating a new tensor that does not depend on
+            # the inputs (and consequently confuses jit)
+            tensor = tensor.reshape((-1, 4)).to(dtype=torch.float32, device=device)
+        assert tensor.dim() == 2 and tensor.size(-1) == 4, tensor.size()
+        self.tensor = tensor
+
+    def __len__(self) -> int:
+        return self.tensor.shape[0]
+
+    @property
+    def device(self):
+        return self.tensor.device
+
+
+def convert_boxes_to_pooler_format(box_lists):
+    # from detectron2 structures.py
+    boxes = torch.cat([x.tensor for x in box_lists], dim=0)
+    # __len__ returns Tensor in tracing.
+    sizes = shapes_to_tensor([x.__len__() for x in box_lists], device=boxes.device)
+    indices = torch.repeat_interleave(
+        torch.arange(len(box_lists), dtype=boxes.dtype, device=boxes.device), sizes
+    )
+    return cat([indices[:, None], boxes], dim=1)
+
+
+ReformerBackwardOutput = namedtuple(
+    "ReformerBackwardOutput",
+    ["attn_output", "hidden_states", "grad_attn_output", "grad_hidden_states"],
+)
+ReformerEncoderOutput = namedtuple(
+    "ReformerEncoderOutput",
+    ["hidden_states", "all_hidden_states", "all_attentions", "past_buckets_states"],
+)
+
+
+class _ReversibleFunction(torch.autograd.Function):
+    # taken from modeling_reformer.py in huggingface
+    @staticmethod
+    def forward(
+        ctx,
+        hidden_states,
+        layers,
+        attention_mask,
+        head_mask,
+        num_hashes,
+        all_hidden_states,
+        all_attentions,
+        past_buckets_states,
+        use_cache,
+        orig_sequence_length,
+        output_hidden_states,
+        output_attentions,
+    ):
+        all_buckets = ()
+
+        # split duplicated tensor
+        hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1)
+
+        for layer_id, (layer, layer_head_mask) in enumerate(zip(layers, head_mask)):
+            if output_hidden_states is True:
+                all_hidden_states.append(hidden_states)
+
+            attn_output = layer(attn_output)
+
+        # Add last layer
+        if output_hidden_states is True:
+            all_hidden_states.append(hidden_states)
+
+        # attach params to ctx for backward
+        ctx.save_for_backward(attn_output.detach(), hidden_states.detach())
+        ctx.layers = layers
+        ctx.all_buckets = all_buckets
+        ctx.head_mask = head_mask
+        ctx.attention_mask = attention_mask
+
+        # Concatenate 2 RevNet outputs
+        return torch.cat([attn_output, hidden_states], dim=-1)
+
+    @staticmethod
+    def backward(ctx, grad_hidden_states):
+        grad_attn_output, grad_hidden_states = torch.chunk(
+            grad_hidden_states, 2, dim=-1
+        )
+
+        # retrieve params from ctx for backward
+        attn_output, hidden_states = ctx.saved_tensors
+
+        # create tuple
+        output = ReformerBackwardOutput(
+            attn_output=attn_output,
+            hidden_states=hidden_states,
+            grad_attn_output=grad_attn_output,
+            grad_hidden_states=grad_hidden_states,
+        )
+
+        # free memory
+        del grad_attn_output, grad_hidden_states, attn_output, hidden_states
+
+        layers = ctx.layers
+        all_buckets = ctx.all_buckets
+        head_mask = ctx.head_mask
+        attention_mask = ctx.attention_mask
+
+        for idx, layer in enumerate(layers[::-1]):
+            # pop last buckets from stack
+            buckets = all_buckets[-1]
+            all_buckets = all_buckets[:-1]
+
+            # backprop
+            output = layer.backward_pass(
+                next_attn_output=output.attn_output,
+                hidden_states=output.hidden_states,
+                grad_attn_output=output.grad_attn_output,
+                grad_hidden_states=output.grad_hidden_states,
+                head_mask=head_mask[len(layers) - idx - 1],
+                attention_mask=attention_mask,
+                buckets=buckets,
+            )
+
+        assert all_buckets == (), "buckets have to be empty after backpropagation"
+        grad_hidden_states = torch.cat(
+            [output.grad_attn_output, output.grad_hidden_states], dim=-1
+        )
+
+        # num of return vars has to match num of forward() args
+        # return gradient for hidden_states arg and None for other args
+        return (
+            grad_hidden_states,
+            None,
+            None,
+            None,
+            None,
+            None,
+            None,
+            None,
+            None,
+            None,
+            None,
+            None,
+        )
+
+
+class ReformerEncoder(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.dropout = 0.5
+        self.layer_norm = torch.nn.LayerNorm(512, eps=1.0e-12)
+        self.layers = [torch.nn.Linear(256, 256)]
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=[None] * 6,
+        num_hashes=None,
+        use_cache=False,
+        orig_sequence_length=64,
+        output_hidden_states=False,
+        output_attentions=False,
+    ):
+        # hidden_states and attention lists to be filled if wished
+        all_hidden_states = []
+        all_attentions = []
+        past_buckets_states = [((None), (None)) for i in range(len(self.layers))]
+
+        # concat same tensor for reversible ResNet
+        hidden_states = torch.cat([hidden_states, hidden_states], dim=-1)
+        hidden_states = _ReversibleFunction.apply(
+            hidden_states,
+            self.layers,
+            attention_mask,
+            head_mask,
+            num_hashes,
+            all_hidden_states,
+            all_attentions,
+            past_buckets_states,
+            use_cache,
+            orig_sequence_length,
+            output_hidden_states,
+            output_attentions,
+        )
+
+        # Apply layer norm to concatenated hidden states
+        hidden_states = self.layer_norm(hidden_states)
+
+        # Apply dropout
+        hidden_states = torch.nn.functional.dropout(
+            hidden_states, p=self.dropout, training=self.training
+        )
+
+        return ReformerEncoderOutput(
+            hidden_states=hidden_states,
+            all_hidden_states=all_hidden_states,
+            all_attentions=all_attentions,
+            past_buckets_states=past_buckets_states,
+        )
+
+
+def longformer_chunk(hidden_states, window_overlap=256):
+    """convert into overlapping chunks. Chunk size = 2w, overlap size = w"""
+
+    # non-overlapping chunks of size = 2w
+    hidden_states = hidden_states.view(
+        hidden_states.size(0),
+        hidden_states.size(1) // (window_overlap * 2),
+        window_overlap * 2,
+        hidden_states.size(2),
+    )
+
+    # use `as_strided` to make the chunks overlap with an overlap size = window_overlap
+    chunk_size = list(hidden_states.size())
+    chunk_size[1] = chunk_size[1] * 2 - 1
+
+    chunk_stride = list(hidden_states.stride())
+    chunk_stride[1] = chunk_stride[1] // 2
+    return hidden_states.as_strided(size=chunk_size, stride=chunk_stride)
+
+
+class PartialT5(torch.nn.Module):
+    # Highly simplified T5Attention prefix
+    def __init__(self):
+        super(PartialT5, self).__init__()
+        self.q = torch.nn.Linear(512, 512)
+        self.k = torch.nn.Linear(512, 512)
+        self.v = torch.nn.Linear(512, 512)
+
+    def forward(
+        self,
+        hidden_states,
+        key_value_states=None,
+        past_key_value=None,
+        query_length=None,
+    ):
+        batch_size, seq_length = hidden_states.shape[:2]
+
+        real_seq_length = seq_length
+
+        if past_key_value is not None:
+            assert (
+                len(past_key_value) == 2
+            ), f"past_key_value should have 2 past states: keys and values. Got { len(past_key_value)} past states"
+            real_seq_length += (
+                past_key_value[0].shape[2] if query_length is None else query_length
+            )
+
+        def shape(states):
+            """projection"""
+            return states.view(batch_size, -1, 8, 64).transpose(1, 2)
+
+        def project(hidden_states, proj_layer, key_value_states, past_key_value):
+            """projects hidden states correctly to key/query states"""
+            if key_value_states is None:
+                # self-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(hidden_states))
+            elif past_key_value is None:
+                # cross-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(key_value_states))
+
+            if past_key_value is not None:
+                if key_value_states is None:
+                    # self-attn
+                    # (batch_size, n_heads, key_length, dim_per_head)
+                    hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
+                else:
+                    # cross-attn
+                    hidden_states = past_key_value
+            return hidden_states
+
+        # get query states
+        query_states = shape(
+            self.q(hidden_states)
+        )  # (batch_size, n_heads, seq_length, dim_per_head)
+
+        # get key/value states
+        key_states = project(
+            hidden_states,
+            self.k,
+            key_value_states,
+            past_key_value[0] if past_key_value is not None else None,
+        )
+        value_states = project(
+            hidden_states,
+            self.v,
+            key_value_states,
+            past_key_value[1] if past_key_value is not None else None,
+        )
+
+        # compute scores
+        scores = torch.matmul(query_states, key_states.transpose(3, 2))
+
+        # (truncated here )
+        return scores, value_states
+
+
+class ChunkReformerFeedForward(torch.nn.Module):
+    # simplified from HF modeling_reformer.py
+    def __init__(self):
+        super().__init__()
+        self.layer_norm = torch.nn.LayerNorm(256, eps=1e-12)
+        self.dense = torch.nn.Linear(256, 256)
+        self.output = torch.nn.Linear(256, 256)
+
+    def forward(self, attention_output):
+        return apply_chunking_to_forward(
+            self.forward_chunk,
+            attention_output + 1,
+        )
+
+    def forward_chunk(self, hidden_states):
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.dense(hidden_states)
+        return self.output(hidden_states)
+
+
+def apply_chunking_to_forward(forward_fn, *input_tensors):
+    # simplified from HF model_utils.py
+    assert len(input_tensors) > 0
+    tensor_shape = input_tensors[0].shape[1]
+    assert all(input_tensor.shape[1] == tensor_shape for input_tensor in input_tensors)
+    num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)
+    if num_args_in_forward_chunk_fn != len(input_tensors):
+        raise ValueError()
+
+    return forward_fn(*input_tensors)
+
+
+class FakeMamlInner(torch.nn.Module):
+    def __init__(self):
+        super(FakeMamlInner, self).__init__()
+        self.linear = torch.nn.Linear(784, 5)
+
+    def forward(self, x, ignored=None, bn_training=False):
+        return self.linear(x.view(x.shape[0], -1))
+
+
+class PartialMaml(torch.nn.Module):
+    # Highly simplified version of maml.meta.Meta.finetuning
+    def __init__(self):
+        super(PartialMaml, self).__init__()
+        self.net = FakeMamlInner()
+        self.update_step_test = 10
+        self.update_lr = 0.4
+
+    def forward(self, x_spt, y_spt, x_qry, y_qry):
+        querysz = x_qry.size(0)
+
+        corrects = [0 for _ in range(self.update_step_test + 1)]
+
+        # in order to not ruin the state of running_mean/variance and bn_weight/bias
+        # we finetunning on the copied model instead of self.net
+        net = deepcopy(self.net)
+
+        # 1. run the i-th task and compute loss for k=0
+        logits = net(x_spt)
+        loss = F.cross_entropy(logits, y_spt)
+        grad = torch.autograd.grad(loss, net.parameters())
+        fast_weights = list(
+            map(lambda p: p[1] - self.update_lr * p[0], zip(grad, net.parameters()))
+        )
+
+        # this is the loss and accuracy before first update
+        with torch.no_grad():
+            # [setsz, nway]
+            logits_q = net(x_qry, net.parameters(), bn_training=True)
+            # [setsz]
+            pred_q = F.softmax(logits_q, dim=1).argmax(dim=1)
+            # scalar
+            correct = torch.eq(pred_q, y_qry).sum().item()
+            corrects[0] = corrects[0] + correct
+
+        # this is the loss and accuracy after the first update
+        with torch.no_grad():
+            # [setsz, nway]
+            logits_q = net(x_qry, fast_weights, bn_training=True)
+            # [setsz]
+            pred_q = F.softmax(logits_q, dim=1).argmax(dim=1)
+            # scalar
+            correct = torch.eq(pred_q, y_qry).sum().item()
+            corrects[1] = corrects[1] + correct
+
+        del net
+
+        accs = torch.tensor(corrects) / querysz
+
+        return accs
+
+
+class ModelOutput(collections.OrderedDict):
+    """based on file_utils.py in HuggingFace"""
+
+    def __getitem__(self, k):
+        if isinstance(k, str):
+            inner_dict = {k: v for (k, v) in self.items()}
+            return inner_dict[k]
+        else:
+            return self.to_tuple()[k]
+
+    def __setattr__(self, name, value):
+        if name in self.keys() and value is not None:
+            # Don't call self.__setitem__ to avoid recursion errors
+            super().__setitem__(name, value)
+        super().__setattr__(name, value)
+
+    def __setitem__(self, key, value):
+        # Will raise a KeyException if needed
+        super().__setitem__(key, value)
+        # Don't call self.__setattr__ to avoid recursion errors
+        super().__setattr__(key, value)
+
+    def to_tuple(self):
+        return tuple(self[k] for k in self.keys())
+
+
+def create_rand_mask_from_inputs(
+    from_blocked_mask,
+    to_blocked_mask,
+    rand_attn,
+    num_attention_heads,
+    num_rand_blocks,
+    batch_size,
+    from_seq_length,
+    from_block_size,
+):
+    """taken from HF modeling_big_bird.py"""
+    num_windows = from_seq_length // from_block_size - 2
+    rand_mask = torch.stack(
+        [p1[i1.flatten()] for p1, i1 in zip(to_blocked_mask, rand_attn)]
+    )
+    rand_mask = rand_mask.view(
+        batch_size, num_attention_heads, num_windows, num_rand_blocks * from_block_size
+    )
+    rand_mask = torch.einsum("blq,bhlk->bhlqk", from_blocked_mask[:, 1:-1], rand_mask)
+    return rand_mask
+
+
+class SequentialAppendList(torch.nn.Sequential):
+    """from timm/models/vovnet.py"""
+
+    def __init__(self, *args):
+        super(SequentialAppendList, self).__init__(*args)
+
+    def forward(self, x: torch.Tensor, concat_list: List[torch.Tensor]) -> torch.Tensor:
+        for i, module in enumerate(self):
+            if i == 0:
+                concat_list.append(module(x))
+            else:
+                concat_list.append(module(concat_list[-1]))
+        x = torch.cat(concat_list, dim=1)
+        return x, concat_list
+
+
+class BatchNormAct2d(torch.nn.BatchNorm2d):
+    """Taken from timm"""
+
+    def __init__(
+        self,
+        num_features,
+        eps=1e-5,
+        momentum=0.1,
+        affine=True,
+        track_running_stats=True,
+        act_layer=torch.nn.ReLU,
+        inplace=True,
+    ):
+        super(BatchNormAct2d, self).__init__(
+            num_features,
+            eps=eps,
+            momentum=momentum,
+            affine=affine,
+            track_running_stats=track_running_stats,
+        )
+        self.act = act_layer(inplace=inplace)
+
+    @torch.jit.ignore
+    def _forward_python(self, x):
+        return super().forward(x)
+
+    def forward(self, x):
+        if torch.jit.is_scripting():
+            x = self._forward_jit(x)
+        else:
+            x = self._forward_python(x)
+        x = self.act(x)
+        return x
+
+
+def get_parameter_dtype(parameter):
+    """from huggingface model_utils.py"""
+    try:
+        return next(parameter.parameters()).dtype
+    except StopIteration:
+        # For nn.DataParallel compatibility in PyTorch 1.5
+
+        def find_tensor_attributes(module):
+            tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]
+            return tuples
+
+        gen = parameter._named_members(get_members_fn=find_tensor_attributes)
+        first_tuple = next(gen)
+        return first_tuple[1].dtype
+
+
+class DummyConfig:
+    attn_layers = ["local", "lsh", "local", "lsh", "local", "lsh"]
+    lsh_attn_chunk_length = 64
+    local_attn_chunk_length = 64
+
+
+def _get_min_chunk_len(config):
+    """from hf_Reformer"""
+    attn_types = config.attn_layers
+    attn_types_set = set(attn_types)
+    if len(attn_types_set) == 1 and attn_types[0] == "lsh":
+        return config.lsh_attn_chunk_length
+    elif len(attn_types_set) == 1 and attn_types[0] == "local":
+        return config.local_attn_chunk_length
+    elif len(attn_types_set) == 2 and attn_types_set == set(["lsh", "local"]):
+        return min(config.lsh_attn_chunk_length, config.local_attn_chunk_length)
+    else:
+        raise NotImplementedError(
+            f"Only attn layer types 'lsh' and 'local' exist, but `config.attn_layers`: {config.attn_layers}. Select "
+            "attn layer types from ['lsh', 'local'] only."
+        )
+
+
+def _stable_argsort(vector, dim):
+    """from hf_Reformer"""
+    # this function scales the vector so that torch.argsort is stable.
+    # torch.argsort is not stable on its own
+    scale_offset = torch.arange(vector.shape[dim], device=vector.device).view(1, 1, -1)
+    scale_offset = scale_offset.expand(vector.shape)
+    scaled_vector = vector.shape[dim] * vector + (scale_offset % vector.shape[dim])
+    return torch.argsort(scaled_vector, dim=dim)
+
+
+def _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(buckets):
+    """from hf_Reformer"""
+    # no gradients are needed
+    with torch.no_grad():
+        # hash-based sort
+        sorted_bucket_idx = _stable_argsort(buckets, dim=-1)
+
+        # create simple indices to scatter to, to have undo sort
+        indices = (
+            torch.arange(sorted_bucket_idx.shape[-1], device=buckets.device)
+            .view(1, 1, -1)
+            .expand(sorted_bucket_idx.shape)
+        )
+
+        # get undo sort
+        undo_sorted_bucket_idx = sorted_bucket_idx.new(*sorted_bucket_idx.size())
+        undo_sorted_bucket_idx.scatter_(-1, sorted_bucket_idx, indices)
+
+    return sorted_bucket_idx, undo_sorted_bucket_idx
+
+
+class FeedForwardLayer(nn.Module):
+    def __init__(self, d_model, dim_feedforward, activation, dropout) -> None:
+        super(FeedForwardLayer, self).__init__()
+        self.linear1 = nn.Linear(d_model, dim_feedforward)
+        self.activation = activation
+        self.dropout1 = nn.Dropout(dropout)
+        self.linear2 = nn.Linear(dim_feedforward, d_model)
+        self.dropout2 = nn.Dropout(dropout)
+
+    def forward(self, x):
+        return self.dropout2(
+            self.linear2(self.dropout1(self.activation(self.linear1(x))))
+        )
+
+
+class TransformerEncoderLayer(nn.Module):
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward=2048,
+        dropout=0.1,
+        activation=nn.ReLU(),
+        layer_norm_eps=1e-5,
+    ):
+        super(TransformerEncoderLayer, self).__init__()
+        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
+        self.norm1 = nn.LayerNorm(d_model, eps=layer_norm_eps)
+        self.norm2 = nn.LayerNorm(d_model, eps=layer_norm_eps)
+        self.dropout = nn.Dropout(dropout)
+        self.ff_block = FeedForwardLayer(d_model, dim_feedforward, activation, dropout)
+
+    def forward(self, src, src_mask=None, src_key_padding_mask=None):
+        x = src
+        x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask))
+        x = self.norm2(x + self._ff_block(x))
+        return x
+
+    # self-attention block
+    def _sa_block(self, x, attn_mask, key_padding_mask):
+        x = self.self_attn(
+            x,
+            x,
+            x,
+            attn_mask=attn_mask,
+            key_padding_mask=key_padding_mask,
+            need_weights=False,
+        )[0]
+        return self.dropout(x)
+
+    # feed forward block
+    def _ff_block(self, x):
+        return self.ff_block(x)
+
+
+class TestModule(torch.nn.Module):
+    def inner_fn(self, left, right):
+        return tuple(left) == tuple(right)
+
+    def fn(self, tensor):
+        if type(tensor) is int:
+            return False
+
+        torch.add(tensor, tensor)
+        return self.inner_fn(tensor.shape, (1, 2, 3))
+
+
+class ReproTests(torch._dynamo.test_case.TestCase):
+    def test_do_paste_mask(self):
+        torch._dynamo.utils.counters.clear()
+        opt__do_paste_mask = torch._dynamo.optimize(
+            torch._dynamo.testing.CompileCounter()
+        )(_do_paste_mask)
+        opt__do_paste_mask(
+            torch.randn(1, 1, 28, 28),
+            torch.tensor([[0.0, 1, 2, 4]]) * 1,
+            427,
+            640,
+            True,
+        )
+        opt__do_paste_mask(
+            torch.randn(1, 1, 28, 28),
+            torch.tensor([[0.0, 1, 2, 4]]) * 2,
+            427,
+            640,
+            True,
+        )
+        opt__do_paste_mask(
+            torch.randn(1, 1, 28, 28),
+            torch.tensor([[0.0, 1, 2, 4]]) * 3,
+            612,
+            612,
+            True,
+        )
+        opt__do_paste_mask(
+            torch.randn(1, 1, 28, 28),
+            torch.tensor([[0.0, 1, 2, 4]]) * 4,
+            612,
+            612,
+            True,
+        )
+        opt__do_paste_mask(
+            torch.randn(1, 1, 28, 28),
+            torch.tensor([[0.0, 1, 2, 4]]) * 2,
+            427,
+            640,
+            False,
+        )
+
+        self.assertGreaterEqual(torch._dynamo.utils.counters["frames"]["ok"], 3)
+        self.assertEqual(
+            torch._dynamo.utils.counters["frames"]["total"],
+            torch._dynamo.utils.counters["frames"]["ok"] + 1,
+        )
+
+    def test_convert_boxes_to_pooler_format(self):
+        boxes1 = [
+            Boxes(torch.arange(0, 8).reshape((2, 4))),
+            Boxes(torch.arange(8, 16).reshape((2, 4))),
+        ]
+        boxes2 = [
+            Boxes(torch.arange(16, 20).reshape((1, 4))),
+            Boxes(torch.arange(20, 24).reshape((1, 4))),
+        ]
+        correct1 = convert_boxes_to_pooler_format(boxes1)
+        correct2 = convert_boxes_to_pooler_format(boxes2)
+        fn = convert_boxes_to_pooler_format
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        self.assertTrue(same(opt_fn(boxes1), correct1))
+        self.assertTrue(same(opt_fn(boxes2), correct2))
+
+        # repeat_interleave is a dynamic shape operator we do not execute/
+        # In the future, we could reduce the frame_count down to 1
+        # by guarding on the exact values of `Tensor repeats` arg
+        self.assertEqual(cnt.frame_count, ifdyn(2, 4))
+        self.assertEqual(cnt.op_count, ifdyn(9, 10))
+
+    def test_boxes_len(self):
+        def fn(boxes):
+            return len(boxes) + boxes.__len__() + boxes.tensor
+
+        boxes1 = Boxes(torch.arange(0, 8).reshape((2, 4)))
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnt)(fn)
+        self.assertTrue(same(opt_fn(boxes1), boxes1.tensor + 4.0))
+
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, ifdyn(6, 1))
+
+    def _reformer(self, nopython):
+        input = torch.randn([1, 64, 256])
+        model = ReformerEncoder()
+        torch.manual_seed(1337)
+        correct = copy.deepcopy(model)(input)
+        cnt = torch._dynamo.testing.CompileCounter()
+        torch.manual_seed(1337)
+        opt_model = torch._dynamo.optimize(cnt, nopython=nopython)(model)
+        self.assertTrue(same(opt_model(input), correct))
+        return cnt
+
+    def test_reformer_eval(self):
+        with torch.no_grad():
+            cnt = self._reformer(nopython=True)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 10)
+
+    def test_reformer_train(self):
+        with torch.enable_grad():
+            cnt = self._reformer(nopython=False)
+        # cant inline torch.autograd.Function means graph break
+        self.assertEqual(cnt.frame_count, 4)
+        self.assertEqual(cnt.op_count, 10)
+
+    def test_longformer_chunk(self):
+        input1 = torch.randn([1, 4096, 1])
+        input2 = torch.randn([12, 4096, 64])
+        correct1 = longformer_chunk(input1)
+        correct2 = longformer_chunk(input2)
+        fn = longformer_chunk
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnt)(fn)
+        self.assertTrue(same(opt_fn(input1), correct1))
+        self.assertTrue(same(opt_fn(input2), correct2))
+        self.assertTrue(same(opt_fn(input1), correct1))
+        self.assertTrue(same(opt_fn(input2), correct2))
+
+        # Dyn recompiles are due to changes in hidden_state (Should we be guarding on this?)
+        self.assertEqual(cnt.frame_count, ifdyn(4, 2))
+        self.assertEqual(cnt.op_count, ifdyn(76, 4))
+
+    def test_hf_t5_forward(self):
+        input = torch.randn([1, 2048, 512])
+        model = PartialT5()
+        correct = model(input)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_model = torch._dynamo.optimize_assert(cnt)(model)
+        self.assertTrue(same(opt_model(input), correct))
+
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, ifdyn(13, 11))
+
+    def test_slicing_dynamic_shape(self):
+        def fn(y):
+            x = torch.ones(8)
+            idx = y[0]
+            out = x[idx:]
+            return (out + 3) * 5
+
+        counter = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(counter)(fn)
+        out = opt_fn(torch.ones(10, dtype=torch.long))
+        # idx should be 1 -> slicing off [1:] of 8 elem tensor
+        self.assertEqual(list(out.shape), [7])
+
+        expected_ops = ifdyn(5, 4)
+        expected_frame = ifdyn(1, 2)
+
+        self.assertEqual(expected_ops, expected_ops)
+        self.assertEqual(expected_frame, expected_frame)
+
+        self.assertEqual(list(opt_fn(torch.tensor([4])).shape), [4])
+
+    def test_slicing_dynamic_shape_setitem(self):
+        def fn(input_lengths: torch.Tensor, new_ones_1):
+            getitem_13 = input_lengths[3]
+            new_ones_1[(3, slice(getitem_13, None, None))] = 0
+            setitem_13 = new_ones_1
+            return (setitem_13,)
+
+        x = torch.randn(10).to(dtype=torch.int64)
+        y = torch.randn(10, 204)
+        ref = fn(x, y)
+        opt_fn = torch._dynamo.optimize("aot_eager")(fn)
+        res = opt_fn(x, y)
+        self.assertTrue(same(ref, res))
+
+    @requires_static_shapes
+    def test_chunk_reformer_ff(self):
+        input = torch.randn([1, 4096, 256])
+        model = ChunkReformerFeedForward()
+        correct = model(input)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_model = torch._dynamo.optimize_assert(cnt)(model)
+        self.assertTrue(same(opt_model(input), correct))
+
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 4)
+
+    # see: https://github.com/pytorch/pytorch/issues/80067
+    # NB: When you remove the expectedFailure, don't forget to
+    # uncomment/adjust the assertEqual below
+    @unittest.expectedFailure
+    @patch.object(torch._dynamo.config, "fake_tensor_propagation", True)
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_maml_item_capture(self):
+        a = torch.randn(5, 1, 28, 28)
+        b = torch.zeros(5, dtype=torch.int64)
+        c = torch.randn(75, 1, 28, 28)
+        d = torch.zeros(75, dtype=torch.int64)
+        model = PartialMaml()
+        correct = model(a, b, c, d)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_model = torch._dynamo.optimize(cnt)(model)
+        for _ in range(10):
+            self.assertTrue(same(opt_model(a, b, c, d), correct))
+
+        # self.assertEqual(cnt.frame_count, ifdyn(3, 2))
+        # TODO(jansel): figure out why op count depends on imports
+        self.assertIn(cnt.op_count, (36, 35, 34, 29, 28, 27))
+
+    # see: https://github.com/pytorch/pytorch/issues/80067
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", False)
+    def test_maml_no_item_capture(self):
+        a = torch.randn(5, 1, 28, 28)
+        b = torch.zeros(5, dtype=torch.int64)
+        c = torch.randn(75, 1, 28, 28)
+        d = torch.zeros(75, dtype=torch.int64)
+        model = PartialMaml()
+        correct = model(a, b, c, d)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_model = torch._dynamo.optimize(cnt)(model)
+        for _ in range(10):
+            self.assertTrue(same(opt_model(a, b, c, d), correct))
+
+        self.assertEqual(cnt.frame_count, ifdyn(5, 4))
+        # TODO(jansel): figure out why op count depends on imports
+        self.assertIn(cnt.op_count, (31, 36, 35, 34, 29, 28))
+
+    def test_hf_model_output(self):
+        ex = ModelOutput(a=torch.randn(10), b=torch.randn(10), c=torch.randn(10))
+
+        def fn1(x):
+            return x["a"] + 1
+
+        def fn2(x):
+            return x.a + 1
+
+        def fn3(x):
+            return x.to_tuple()[0] + 1
+
+        def fn4(x):
+            return x[0] + 1
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        for fn in (fn1, fn2, fn3, fn4):
+            cnt.clear()
+            opt_fn = torch._dynamo.optimize_assert(cnt)(fn)
+            self.assertTrue(same(opt_fn(ex), ex.a + 1))
+            self.assertEqual(cnt.frame_count, 1)
+            self.assertEqual(cnt.op_count, 1)
+
+    @requires_static_shapes
+    def test_create_rand_mask_from_inputs(self):
+        args = [
+            torch.randn([1, 64, 64]),
+            torch.randn([1, 64, 64]),
+            torch.zeros([1, 12, 62, 3], dtype=torch.int64),
+            12,
+            3,
+            1,
+            4096,
+            64,
+        ]
+        correct = create_rand_mask_from_inputs(*args)
+        fn = create_rand_mask_from_inputs
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnt)(fn)
+        self.assertTrue(same(opt_fn(*args), correct))
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 8)
+
+    def test_rng_state(self):
+        def fn():
+            state = torch.get_rng_state()
+            before = torch.rand(1000)
+            torch.set_rng_state(state)
+            after = torch.rand(1000)
+            return before, after
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+
+        before, after = opt_fn()
+        self.assertTrue(same(before, after))
+        self.assertEqual(cnt.frame_count, 2)
+        self.assertEqual(cnt.op_count, 3)  # rand, rand
+        try:
+            graph, _ = torch._dynamo.export(fn)
+            # See https://github.com/pytorch/pytorch/pull/87490
+            self.fail("unexpected export success")
+        except torch._dynamo.exc.Unsupported:
+            pass
+
+    def test_seq_append_list(self):
+        x = torch.randn(4, 10)
+        model = SequentialAppendList(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+        )
+        # this one is tricky because it mutates the list provided as an input
+        l1 = [x]
+        l2 = [x]
+        correct, _ = model(x, l1)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_model = torch._dynamo.optimize_assert(cnt)(model)
+        result, l3 = opt_model(x, l2)
+        self.assertTrue(same(result, correct))
+        self.assertTrue(same(l1, l2))
+        self.assertIs(l2, l3)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 5)
+
+    def test_batch_norm_act(self):
+        a = torch.randn(5, 1, 28, 28)
+        model = BatchNormAct2d(1).eval()
+        correct = model(a)
+        cnt = torch._dynamo.testing.CompileCounter()
+        if not torch._dynamo.config.specialize_int_float:
+            # _local_scalar_dense causes graph break w 0-dim tensor
+            opt_model = torch._dynamo.optimize(cnt)(model)
+            self.assertTrue(same(opt_model(a), correct))
+            return
+
+        opt_model = torch._dynamo.optimize_assert(cnt)(model)
+        self.assertTrue(same(opt_model(a), correct))
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 2)
+
+    def test_get_parameter_dtype(self):
+        model = SequentialAppendList(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+        )
+
+        def fn(model, x):
+            return x + torch.randn(10, dtype=get_parameter_dtype(model))
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnt)(fn)
+        self.assertEqual(opt_fn(model, torch.randn(10)).dtype, torch.float32)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 2)
+
+    def test_nn_parameter(self):
+        def test_fn():
+            a = torch.nn.Parameter(torch.randn(5, 5))
+            # Checks that TensorVariable stores the type information correctly
+            self.assertTrue(isinstance(a, torch.nn.Parameter))
+            return a
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_test_fn = torch._dynamo.optimize(cnt)(test_fn)
+        out = opt_test_fn()
+        self.assertTrue(isinstance(out, torch.nn.Parameter))
+
+    def test_Size(self):
+        def test_fn():
+            a = torch.randn(4)
+            x = torch.Size([1, 2, 3])
+            # Checks that SizeVariable return torch.Size object
+            assert isinstance(x, torch.Size)
+            # Causes graph breaks and checks reconstruction of SizeVariable
+            # object
+            self.assertIsInstance(x, torch.Size)
+            return a
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_test_fn = torch._dynamo.optimize(cnt)(test_fn)
+        opt_test_fn()
+
+    def test_indexing_with_list(self):
+        def test_fn():
+            def run_test(tensor, *idx):
+                npt = tensor.numpy()
+                assert npt[idx].shape == tensor[idx].shape
+
+            x = torch.arange(0, 10)
+            cases = [
+                [None, None],
+                [1, None],
+            ]
+
+            for case in cases:
+                run_test(x, *case)
+
+            return torch.randn(4)
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_test_fn = torch._dynamo.optimize(cnt)(test_fn)
+        opt_test_fn()
+
+    def test_reformer_min_chunk_len(self):
+        def fn(cfg):
+            t = torch.empty(10)
+            t.fill_(_get_min_chunk_len(cfg))
+            return t[0]
+
+        cfg = DummyConfig()
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnt)(fn)
+        self.assertEqual(opt_fn(cfg), 64)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 3)
+
+    def test_reformer_sorting(self):
+        x = torch.zeros([1, 12, 4096], dtype=torch.int64)
+        correct = _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(x)
+        fn = _get_sorted_bucket_idx_and_undo_sorted_bucket_idx
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(cnt)(fn)
+        self.assertTrue(same(opt_fn(x), correct))
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, ifdyn(28, 14))
+
+    def test_recursive_map(self):
+        # https://github.com/pytorch/torchdynamo/issues/132
+        def _recursive_map(struct, batch_dim=0):
+            for k, v in struct.items():
+                if v is not None:
+                    if isinstance(v, dict):
+                        _recursive_map(v)
+                    else:
+                        struct[k] = v
+
+        def toy_example(a, b, v):
+            x = a / (torch.abs(a) + 1)
+            if v is not None:
+                _recursive_map(v)
+            return x * b
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_toy_example = torch._dynamo.optimize(cnt)(toy_example)
+        opt_toy_example(
+            torch.randn(10),
+            torch.randn(10),
+            {"layer0": {"memory_keys": torch.randn(10)}},
+        )
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 4)
+
+    def test_issue175(self):
+        n_heads = 2
+        d_model = 64
+        model = TransformerEncoderLayer(d_model, n_heads)
+        inp = torch.randn(1, d_model)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_model = torch._dynamo.optimize(cnt, nopython=True)(model)
+        opt_model(inp)
+        opt_model(inp)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 12)
+
+    def test_exec_import(self):
+        def fn1():
+            exec("import math")
+
+        def fn2():
+            try:
+                math.sqrt(4)
+                return False
+            except NameError:
+                return True
+
+        def fn3():
+            fn1()
+            return fn2()
+
+        self.assertTrue(fn3())
+        opt_fn3 = torch._dynamo.optimize("eager")(fn3)
+        self.assertTrue(opt_fn3())
+
+    def test_exec_wildcard_import(self):
+        # Test that globals are not carried over from frame to frame
+        def fn1():
+            exec("from torch import *")
+
+        def fn2():
+            x = torch.zeros(4)
+            for i in range(5):
+                x = x + i
+            return x
+
+        def fn3():
+            fn1()
+            return fn2()
+
+        ref = fn3()
+        opt_fn3 = torch._dynamo.optimize("eager")(fn3)
+        res = opt_fn3()
+        self.assertTrue(same(ref, res))
+
+    def test_with_on_graph_break_inst(self):
+        def reversible(x):
+            print("Hello world")  # Cause graph break so inline fails
+            return torch.sin(torch.cos(x))
+
+        def fn(x):
+            with torch.enable_grad():
+                a = torch.sin(x)
+                b = reversible(a)
+                c = torch.sigmoid(b)
+                c.sum().backward()
+                return x.grad
+
+        x = torch.randn(3, requires_grad=True)
+        x.grad = None
+        with torch.no_grad():
+            ref = fn(x)
+
+        x.grad = None
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        with torch.no_grad():
+            res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+    # https://github.com/pytorch/torchdynamo/issues/1446
+    def test_grad_mode_carrying_correct_state_after_graph_break(self):
+        def fn(x):
+            with torch.no_grad():
+                y = x * 3
+                print("Break")
+                z = x + 2
+            return y, z
+
+        x = torch.randn(3, requires_grad=True)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        y, z = opt_fn(x)
+        self.assertFalse(y.requires_grad)
+        self.assertFalse(z.requires_grad)
+
+    def test_abc_setattr(self):
+        # tests that we correctly bail out of __setattr__ calls
+
+        # TODO: does not ensure ABC classes are correctly inferred as ClassVariables
+        # (doesn't test the fix for 'super()')
+
+        class BaseModule(torch.nn.Module, ABC):
+            def blah(self, x):
+                return x + 1
+
+        class Derived(BaseModule):
+            def __setattr__(self, name, value) -> None:
+                super().__setattr__(name, value)
+
+            def forward(self, x):
+                # expect a graph break on __setattr__
+                self.foo = 0
+                return self.blah(x)
+
+            def blah(self, x):
+                return super().blah(x)
+
+        x = torch.randn(3, requires_grad=True)
+        mod = Derived()
+        opt_mod = torch._dynamo.optimize("eager")(mod)
+        opt_mod(x)
+
+        self.assertGreaterEqual(torch._dynamo.utils.counters["frames"]["ok"], 3)
+        self.assertGreaterEqual(torch._dynamo.utils.counters["frames"]["total"], 3)
+
+    @patch.object(torch._dynamo.config, "suppress_errors", True)
+    def test_guard_fail_tensor_bool(self):
+        @torch._dynamo.skip
+        def fn():
+            condition_shape = (5, 5)
+            dtypes = (torch.bool,)
+            shapes = (
+                (),
+                (5,),
+                (1, 5),
+            )
+
+            tensors = list(
+                [
+                    torch.empty(shape, dtype=dtype).fill_(17)
+                    for shape, dtype in itertools.product(shapes, dtypes)
+                ]
+            )
+
+            x_vals = (5.0, *tensors)
+            y_vals = (6.0, *tensors)
+
+            @torch._dynamo.disable
+            def get_expected(condition, x, y):
+                x_np = x.cpu().numpy() if isinstance(x, torch.Tensor) else x
+                y_np = y.cpu().numpy() if isinstance(y, torch.Tensor) else y
+                return torch.from_numpy(
+                    np.where(condition.cpu().numpy(), x_np, y_np)
+                ).to(common_dtype)
+
+            for x, y in zip(x_vals, y_vals):
+                condition = torch.empty(*condition_shape, dtype=torch.bool).bernoulli_()
+                common_dtype = torch.result_type(x, y)
+
+                def check_equal(condition, x, y):
+                    # NumPy aggressively promotes to double, hence cast to output to correct dtype
+                    expected = get_expected(condition, x, y)
+                    result = torch.where(condition, x, y)
+                    assert torch.allclose(expected, result)
+
+                check_equal(condition, x, y)
+                check_equal(condition, y, x)
+
+        fn()
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        opt_fn()
+
+    def test_guard_fail_nested_tuple(self):
+        def fn(args):
+            return torch.ones(()), args[0] * 2
+
+        # This adds a tensor check on args[1][0] and args[1][1]
+        args1 = (torch.ones(1), (torch.ones(1), torch.ones(1)))
+        args2 = (torch.ones(1), torch.ones(1))
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        ref = opt_fn(args1)
+        res = opt_fn(args2)
+
+        self.assertTrue(same(ref, res))
+
+    # AssertionError: ABCMeta
+    @unittest.expectedFailure
+    def test_numpy_list(self):
+        @torch._dynamo.disable
+        def rand_gen():
+            return list(np.array([random.randint(5, 10) for _ in range(10)]))
+
+        def fn(x):
+            random_list = rand_gen()
+            z = torch.LongTensor(random_list)
+            return x * z
+
+        x = torch.ones(10) * 2
+
+        random.seed(0)
+        ref0 = fn(x)
+        ref1 = fn(x)
+
+        random.seed(0)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        res0 = opt_fn(x)
+        res1 = opt_fn(x)
+
+        self.assertTrue(same(ref0, res0))
+        self.assertTrue(same(ref1, res1))
+
+    @unittest.skipIf(not HAS_REFS, "requires recent PT version")
+    def test_primtorch(self):
+        @torch._dynamo.optimize("eager")
+        def fn(x):
+            torch._refs.abs(x)
+
+        fn(torch.randn(3))
+
+    @unittest.skipIf(not HAS_REFS, "requires recent PT version")
+    @unittest.expectedFailure
+    # inline_call [('inline in skipfiles: bind ...python3.10/inspect.py', 1)]
+    def test_primtorch_no_graph_break(self):
+        @torch._dynamo.optimize("eager", nopython=True)
+        def fn(x):
+            torch._refs.abs(x)
+
+        fn(torch.randn(3))
+
+    @unittest.skipIf(
+        not isinstance(torch.ops.aten.abs, torch._ops.OpOverloadPacket),
+        "old pt doesn't work",
+    )
+    def test_torch_ops_aten(self):
+        # Picked an op that doesn't show up in the default list
+        @torch._dynamo.optimize("eager", nopython=True)
+        def fn(x):
+            return torch.ops.aten.absolute(x)
+
+        fn(torch.randn(3))
+
+    def test_guard_ordering_shape_fail(self):
+        # If a function which takes a tensor has an inner function which
+        # is compiled and generates a guard on its shape,
+        # they are evaluated in the wrong order. So if on a subsequent call
+        # an int is passed instead of a tensor, guard evaluation will crash
+        # with a "no attribute: shape" error
+        m = TestModule()
+        opt_m = torch._dynamo.optimize("eager")(m)
+        opt_m.fn(torch.ones((5, 5)))
+        opt_m.fn(-3)
+
+    def test_tensor_isinstance_tuple(self):
+        @torch._dynamo.optimize("eager")
+        def fn():
+            t = torch.ones(5, 5)
+            if not isinstance(t, (int, torch.Tensor)):
+                msg = str.format(
+                    "{0} is not an instance of {1}",
+                    type(t),
+                    (int, torch.Tensor),
+                )
+                raise ValueError(msg)
+            return True
+
+        fn()
+
+    def test_isinstance_dtype(self):
+        @torch._dynamo.optimize("eager", nopython=True)
+        def fn(x):
+            isinstance(torch.bfloat16, torch.dtype)
+            return x
+
+        fn(torch.randn(3))
+
+    # Bug with storage meta - torch.BoolStorage is becoming torch.storage._LegacyStorageMeta
+    @unittest.expectedFailure
+    def test_isinstance_storage(self):
+        @torch._dynamo.optimize("eager")
+        def fn(x):
+            f = bytearray([0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x10, 0x40])
+            bools = torch.BoolStorage.from_buffer(f, "big")
+            assert isinstance(bools, torch.BoolStorage)
+            return x
+
+        fn(torch.randn(3))
+
+    def test_dict_list_values(self):
+        def inner_fn(args):
+            return [x[1].shape for x in args]
+
+        @torch._dynamo.optimize("eager")
+        def fn(tensors):
+            return inner_fn(zip(itertools.count(), tensors["args"]))
+
+        fn({"args": [torch.ones(5, 5), torch.ones(5, 6), torch.ones(5, 7)]})
+        fn({"args": [torch.ones(5, 5)]})
+
+    def test_dict_iter(self):
+        class MyMod(torch.nn.Module):
+            def forward(self, x):
+                z = {"my": 1, "const": 2, "dict": 3, "variable": 4}
+                tot = 0
+                for key in z:
+                    tot += z[key]
+
+                return tot
+
+        x = torch.tensor([0])
+        model = MyMod()
+        opt_model = torch._dynamo.optimize("eager", nopython=True)(model)
+        y = opt_model(x)
+
+        self.assertEqual(y, 10)
+
+    def test_sort_out(self):
+
+        dtype = torch.float32
+        device = "cpu"
+
+        def fn():
+            tensor = torch.randn((3, 5), dtype=dtype, device=device)[:, 0]
+            values1 = torch.tensor(0, dtype=dtype, device=device)
+            indices1 = torch.tensor(0, dtype=torch.long, device=device)
+            torch.sort(tensor, out=(values1, indices1))
+            self.assertEqual(values1.stride(), (1,))
+            self.assertEqual(indices1.stride(), (1,))
+
+        fn()
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        opt_fn()
+
+    def test_sigmoid_out(self):
+
+        dtype = torch.float32
+        device = "cpu"
+
+        def fn():
+            inp = torch.randn((3, 5), dtype=dtype, device=device)
+            out1 = torch.tensor(0, dtype=dtype, device=device)
+            torch.sigmoid(inp, out=out1)
+            self.assertEqual(out1.numel(), 15)
+
+        fn()
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        opt_fn()
+
+    def test_slice_into_list_mutable(self):
+        class Mod(torch.nn.Module):
+            def forward(self, listy):
+                x = listy[3:5]
+                for i in range(10):
+                    z = torch.abs(torch.randn(10)) + 1
+                    x[0] = z
+                return x
+
+        m = Mod()
+        listy = [torch.randn(10)] * 10
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_m = torch._dynamo.optimize(cnt, nopython=True)(m)
+        opt_m.forward(listy)
+
+        self.assertEqual(cnt.frame_count, 1)
+
+    def test_vdd_duplicate_error(self):
+        def fn(a, dt):
+            keys = list(dt._jt_dict.keys())
+            p = torch.cos(dt._jt_dict[keys[0]]._value)
+            q = torch.sin(a)
+            r = torch.sigmoid(dt._jt_dict[keys[0]]._value)
+            return p + q + r
+
+        class Value:
+            def __init__(self):
+                self._value = torch.randn(4)
+
+        class Sample:
+            def __init__(self):
+                self._jt_dict = {}
+                self._jt_dict["POSITION_ID"] = Value()
+
+        a = torch.randn(4)
+        sample = Sample()
+
+        ref = fn(a, sample)
+
+        optimized_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = optimized_fn(a, sample)
+
+        self.assertTrue(same(ref, res))
+
+    def test_specialized_stride(self):
+        def f():
+            e = torch.empty(4)
+            x = e[::2]
+            return x.stride()
+
+        self.assertEqual(f(), torch._dynamo.optimize("eager")(f)())
+
+    @unittest.skipIf(not has_detectron2(), "requires detectron2")
+    def test_multi_import(self):
+        @torch._dynamo.optimize("eager", nopython=True)
+        def to_bitmasks(boxes):
+            from detectron2.layers.mask_ops import (
+                _paste_masks_tensor_shape,
+                paste_masks_in_image,
+            )
+
+            if (
+                paste_masks_in_image is not None
+                and _paste_masks_tensor_shape is not None
+            ):
+                return boxes + 1
+
+        self.assertTrue((to_bitmasks(torch.zeros(10)) == torch.ones(10)).all())
+
+    def test_multi_dot_import(self):
+        def fn1(x):
+            return torch.sin(x)
+
+        def fn(x):
+            import torch.fx
+
+            _ = torch.fx.symbolic_trace(fn1)
+            return x * 2
+
+        x = torch.randn(10)
+        fn(x)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        opt_fn(x)
+        self.assertEqual(cnt.frame_count, 1)
+
+    def test_relative_import(self):
+        try:
+            from . import test_functions as _  # noqa: F401
+
+            def fn(x):
+                from .test_functions import tensor_for_import_testing
+
+                return x * 2 * tensor_for_import_testing
+
+        except ImportError:
+
+            def fn(x):
+                from test_functions import tensor_for_import_testing
+
+                return x * 2 * tensor_for_import_testing
+
+        x = torch.randn(10)
+        fn(x)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt, nopython=True)(fn)
+        opt_fn(x)
+        self.assertEqual(cnt.frame_count, 1)
+
+    def test_relative_import_no_modulename(self):
+        try:
+            from . import test_functions as _  # noqa: F401
+
+            def fn(x):
+                from . import test_functions
+
+                return x * 2 * test_functions.tensor_for_import_testing
+
+        except ImportError:
+
+            def fn(x):
+                import test_functions
+
+                return x * 2 * test_functions.tensor_for_import_testing
+
+        x = torch.randn(10)
+        fn(x)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt, nopython=True)(fn)
+        opt_fn(x)
+        self.assertEqual(cnt.frame_count, 1)
+
+    @patch.object(functorch._src.config, "use_dynamic_shapes", True)
+    def test_bigbird_unsqueeze_inplace(self):
+        def fn(reshape_2):
+            view_2 = reshape_2.clone()
+            view_2.unsqueeze_(2)
+            cat_11 = torch.cat([view_2], dim=2)
+            view_13 = cat_11.view((2, 12, 64, -1))
+            return (view_13,)
+
+        x = torch.randn(2, 12, 64, 64, requires_grad=True)
+        ref = fn(x)
+        opt_fn = torch._dynamo.optimize("aot_eager")(fn)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+    def test_issue1466_size_aot_autograd(self):
+        def fn(x):
+            # do a tensor op and a size compute
+            y = x * 2
+            x_size = x.size()
+            # trigger a graph break
+            print("arf")
+            # use the tensor op and size compute
+            z = y.view(x_size) + 1
+            return z
+
+        x = torch.randn(2, 3, requires_grad=True)
+        ref = fn(x)
+        opt_fn = torch._dynamo.optimize("aot_eager")(fn)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+    def test_ellipsis(self):
+        class Repro(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lnorm = torch.nn.LayerNorm(
+                    (256,), eps=1e-06, elementwise_affine=True
+                )
+                self.linear = torch.nn.Linear(
+                    in_features=256, out_features=256, bias=True
+                )
+
+            def forward(self, cat_10):
+                lnorm = self.lnorm(cat_10)
+                getitem_64 = lnorm[
+                    (slice(None, None, None), slice(0, 1, None), Ellipsis)
+                ]
+                linear = self.linear(getitem_64)
+                return (linear,)
+
+        args = [torch.randn(2, 197, 256)]
+
+        mod = Repro()
+        opt_mod = torch._dynamo.optimize("eager", nopython=True)(mod)
+
+        self.assertTrue(same(mod(*args), opt_mod(*args)))
+
+    def test_reinplacing(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.self_layoutlm_embeddings_x_position_embeddings = (
+                    torch.nn.Embedding(1024, 768)
+                )
+                self.self_layoutlm_embeddings_y_position_embeddings = (
+                    torch.nn.Embedding(1024, 768)
+                )
+
+            def forward(self, getitem_1, getitem_2, add):
+                self_layoutlm_embeddings_x_position_embeddings = (
+                    self.self_layoutlm_embeddings_x_position_embeddings(getitem_1)
+                )
+                self_layoutlm_embeddings_y_position_embeddings = (
+                    self.self_layoutlm_embeddings_y_position_embeddings(getitem_2)
+                )
+                add_1 = add + self_layoutlm_embeddings_x_position_embeddings
+                add_2 = add_1 + self_layoutlm_embeddings_y_position_embeddings
+                return (add_2,)
+
+        mod = MockModule()
+        opt_mod = torch._dynamo.optimize("aot_inductor_debug")(mod)
+
+        args = [
+            ((2, 512), (2048, 4), torch.int64, "cpu", False),
+            ((2, 512), (2048, 4), torch.int64, "cpu", False),
+            ((2, 512, 768), (393216, 768, 1), torch.float32, "cpu", True),
+        ]
+        args = [
+            rand_strided(sh, st, dt, dev).requires_grad_(rg)
+            for (sh, st, dt, dev, rg) in args
+        ]
+        self.assertTrue(same_two_models(mod, opt_mod, args))
+
+    def test_optimized_deepcopy(self):
+        # See https://github.com/pytorch/pytorch/pull/88629
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.fc = torch.nn.Linear(in_features=2, out_features=3, bias=True)
+
+            def forward(self, x):
+                return self.fc(x)
+
+        mod = Foo()
+        opt_mod = torch._dynamo.optimize("eager")(mod)
+        args = [torch.randn(1, 2)]
+        self.assertTrue(same_two_models(mod, opt_mod, args))
+
+    def test_class_member(self):
+        class Foo(torch.nn.Module):
+            a = 4
+            b = torch.ones(3, 4)
+
+            def __init__(self):
+                super().__init__()
+                self.c = 4
+
+            def forward(self, x):
+                return x.cos() + self.a + self.b + self.c
+
+        mod = Foo()
+        opt_mod = torch._dynamo.optimize("eager", nopython=True)(mod)
+        args = (torch.randn(3, 4),)
+        self.assertTrue(same(mod(*args), opt_mod(*args)))
+
+    def test_named_buffers(self):
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.register_buffer("x", torch.ones(3))
+                self.register_buffer("y", torch.ones(3))
+
+            def forward(self, inp):
+                res = 0
+                for name, buffer in self.named_buffers():
+                    res += buffer.sum()
+
+                return inp.cos() + res
+
+        mod = Foo()
+        opt_mod = torch._dynamo.optimize("eager", nopython=True)(mod)
+        args = (torch.randn(3, 4),)
+        self.assertTrue(same(mod(*args), opt_mod(*args)))
+
+    def test_is_symbolic_tracing(self):
+        # Ensure no graph break here
+        def fn(x):
+            if is_fx_tracing_test():
+                return x * 2
+            return x * 4
+
+        a = torch.randn(4)
+        ref = fn(a)
+        opt_fn = torch._dynamo.optimize("eager", nopython=True)(fn)
+        res = opt_fn(a)
+        self.assertTrue(same(ref, res))
+
+    def test_tokenization(self):
+        from collections import UserDict
+
+        class BatchEncoding(UserDict):
+            """
+            Copied from tokenization
+            """
+
+            def __init__(
+                self,
+                data,
+            ):
+                super().__init__(data)
+
+            def __getattr__(self, item: str):
+                try:
+                    return self.data[item]
+                except KeyError:
+                    raise AttributeError
+
+        def tokenization(x):
+            encoding = BatchEncoding({"key": x})
+            return encoding["key"]
+
+        opt_fn = torch._dynamo.optimize("eager")(tokenization)
+        x = torch.rand((1, 4))
+        ref = tokenization(x)
+        res = opt_fn(x)
+        self.assertTrue(same(ref, res))
+
+    def test_modules(self):
+        class Foo(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.fc = torch.nn.Linear(4, 3)
+
+            def forward(self, inp):
+                res = torch.zeros(3, 3)
+                for mod in self.modules():
+                    res += self.fc(inp)
+                return res
+
+        mod = Foo()
+        args = (torch.ones(3, 4),)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_mod = torch._dynamo.optimize(cnt, nopython=True)(mod)
+        self.assertTrue(same(mod(*args), opt_mod(*args)))
+        self.assertEqual(cnt.op_count, 5)
+        self.assertEqual(cnt.frame_count, 1)
+
+    @requires_cuda()
+    def test_norm_dtype(self):
+        def foo(_stack0):
+            getitem = _stack0[(slice(None, None, None), -1)]
+            _stack0 = None
+            normalize = torch.nn.functional.normalize(getitem, p=2, dim=1)
+            getitem = None
+            return (normalize,)
+
+        args = [((2, 50, 256), (1, 256, 1), torch.float16, "cuda", False)]
+        args = [
+            rand_strided(sh, st, dt, dev).requires_grad_(rg)
+            for (sh, st, dt, dev, rg) in args
+        ]
+
+        opt_foo = torch._dynamo.optimize("aot_inductor_debug")(foo)
+        with torch.cuda.amp.autocast(enabled=True):
+            ref = foo(*args)[0]
+            res = foo(*args)[0]
+            self.assertEqual(ref.dtype, res.dtype)
+
+            self.assertTrue(same(res, ref))
+
+    def test_for_loop_graph_break(self):
+        def inner(x):
+            return torch.sin(x)
+
+        def fn(x):
+            for _ in range(100):
+                inner(x)
+                torch._dynamo.graph_break()
+            return x
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        x = torch.randn(4)
+        opt_fn(x)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 1)
+
+    def test_for_loop_graph_break_before(self):
+        # Checks that the backedge is calculated correctly
+        def inner(x):
+            return torch.sin(x)
+
+        def fn(x):
+            torch._dynamo.graph_break()
+            for _ in range(100):
+                inner(x)
+            return x
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        x = torch.randn(4)
+        opt_fn(x)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 100)
+
+    def test_while_loop_graph_break(self):
+        # Repro of tacotron2 cache_size_recompilation
+        def inner(x):
+            return torch.sin(x)
+
+        def fn(x):
+            i = 20
+            while i > 10:
+                x = inner(x)
+                i -= 1
+                torch._dynamo.graph_break()
+            return x
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        x = torch.randn(4)
+        opt_fn(x)
+        self.assertEqual(cnt.frame_count, 1)
+        self.assertEqual(cnt.op_count, 1)
+
+    @patch.object(torch._dynamo.config, "rewrite_assert_with_torch_assert", True)
+    def test_rewrite_assert_with_msg(self):
+        def f(x):
+            b = x.sin()
+            assert x[0] == 3, "First dim need to be 3"
+            return x.cos() + b
+
+        args = (torch.Tensor([3, 4, 5]),)
+        cnt = torch._dynamo.testing.CompileCounter()
+
+        opt_f = torch._dynamo.optimize(cnt, nopython=True)(f)
+        self.assertTrue(same(f(*args), opt_f(*args)))
+        self.assertEqual(cnt.op_count, 6)
+        self.assertEqual(cnt.frame_count, 1)
+
+        exported, _ = torch._dynamo.export(f, torch.Tensor([3, 4, 5]))
+        self.assertTrue(same(exported(*args), f(*args)))
+
+        with self.assertRaisesRegex(AssertionError, ""):
+            exported, _ = torch._dynamo.export(f, torch.Tensor([4, 4, 5]))
+
+    @patch.object(torch._dynamo.config, "rewrite_assert_with_torch_assert", True)
+    def test_not_rewrite_assert_for_other_errors(self):
+        def f(x):
+            b = x.sin()
+            if not x.sum() <= 3:
+                raise ValueError("input sum needs to be 3")
+            return x.cos() + b
+
+        args = (torch.Tensor([3, 4, 5]),)
+        opt_fn = torch._dynamo.optimize("eager")(f)
+        with self.assertRaisesRegex(ValueError, "input sum needs to be 3"):
+            opt_fn(*args)
+
+    # TODO (tmanlaibaatar) handle data-dependent fstring in assert statement.
+    @patch.object(torch._dynamo.config, "rewrite_assert_with_torch_assert", True)
+    def test_rewrite_assert_with_fstring_msg(self):
+        def f(x):
+            b = x.sin()
+            assert x[0] == 3, f"First dim need to be {x[0]}"
+            return x.cos() + b
+
+        args = (torch.Tensor([3, 4, 5]),)
+        with self.assertRaisesRegex(torch._dynamo.exc.Unsupported, "generic_jump"):
+            exported, _ = torch._dynamo.export(f, torch.Tensor([3, 4, 5]))
+
+    @patch.object(torch._dynamo.config, "rewrite_assert_with_torch_assert", True)
+    def test_rewrite_assert_without_msg(self):
+        def f(x):
+            b = x.sin()
+            assert x[0] == 3
+            return x.cos() + b
+
+        args = (torch.Tensor([3, 4, 5]),)
+        exported, _ = torch._dynamo.export(f, torch.Tensor([3, 4, 5]))
+        self.assertTrue(same(exported(*args), f(*args)))
+
+        with self.assertRaisesRegex(AssertionError, ""):
+            exported, _ = torch._dynamo.export(f, torch.Tensor([4, 4, 5]))
+
+    @patch.object(torch._dynamo.config, "rewrite_assert_with_torch_assert", True)
+    def test_rewrite_assert_noop(self):
+        def f(x):
+            b = x.sin()
+            assert True
+            assert x.dtype == torch.float32
+            return x.cos() + b
+
+        args = (torch.Tensor([3, 4, 5]),)
+        exported, _ = torch._dynamo.export(f, torch.Tensor([3, 4, 5]))
+        self.assertTrue(same(exported(*args), f(*args)))
+
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_f = torch._dynamo.optimize(cnt, nopython=True)(f)
+        self.assertTrue(same(f(*args), opt_f(*args)))
+        # torch._assert shouldn't be in the graph
+        self.assertEqual(cnt.op_count, 3)
+        self.assertEqual(cnt.frame_count, 1)
+
+        exported, _ = torch._dynamo.export(f, torch.Tensor([4, 4, 5]))
+        self.assertTrue(same(exported(*args), f(*args)))
+
+    @patch.object(torch._dynamo.config, "rewrite_assert_with_torch_assert", False)
+    def test_not_rewrite_assert(self):
+        def f(x):
+            b = x.sin()
+            assert x[0] == 3
+            return x.cos() + b
+
+        with self.assertRaisesRegex(torch._dynamo.exc.Unsupported, "generic_jump"):
+            torch._dynamo.export(f, torch.Tensor([3, 4, 5]))
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_skip_non_tensor.py b/test/dynamo/test_skip_non_tensor.py
new file mode 100644
index 000000000000..1d19762e73f8
--- /dev/null
+++ b/test/dynamo/test_skip_non_tensor.py
@@ -0,0 +1,113 @@
+# Owner(s): ["module: dynamo"]
+from unittest.mock import patch
+
+import torch
+
+import torch._dynamo
+import torch._dynamo.test_case
+from torch._dynamo.testing import CompileCounter
+
+
+class SkipNonTensorTests(torch._dynamo.test_case.TestCase):
+    def test_add_tensor1(self):
+        def fn(a, b):
+            return a + b
+
+        counter = CompileCounter()
+        x = torch.randn(4)
+        y = 5
+        opt_fn = torch._dynamo.optimize_assert(counter)(fn)
+        opt_fn(x, y)
+
+        assert counter.op_count == 1
+
+    def test_add_tensor2(self):
+        def fn(a, b):
+            return torch.add(a, b)
+
+        counter = CompileCounter()
+
+        x = torch.randn(4)
+        y = 5
+        opt_fn = torch._dynamo.optimize_assert(counter)(fn)
+        opt_fn(x, y)
+
+        assert counter.op_count == 1
+
+    def test_add_tensor_list(self):
+        def fn(lst):
+            return lst[0] + lst[1]
+
+        counter = CompileCounter()
+        x = torch.randn(4)
+        y = 5
+        opt_fn = torch._dynamo.optimize_assert(counter)(fn)
+        opt_fn([x, y])
+
+        assert counter.op_count == 1
+
+    def test_add_tensor_dict(self):
+        def fn(dt):
+            return dt["a"] + dt["b"]
+
+        counter = CompileCounter()
+        x = torch.randn(4)
+        y = 5
+        opt_fn = torch._dynamo.optimize_assert(counter)(fn)
+        opt_fn({"a": x, "b": y})
+
+        assert counter.op_count == 1
+
+    def test_add_skip(self):
+        def fn(a, b):
+            return a + b
+
+        counter = CompileCounter()
+        opt_fn = torch._dynamo.optimize_assert(counter)(fn)
+        x = 4
+        y = 5
+        opt_fn(x, y)
+
+        assert counter.op_count == 0
+
+    @patch.object(torch._dynamo.config, "raise_on_ctx_manager_usage", False)
+    def test_recursive_list(self):
+        def fn(x):
+            return x
+
+        counter = CompileCounter()
+
+        x = []
+        x.append(x)
+        with torch._dynamo.optimize_assert(counter):
+            fn(x)
+
+        assert counter.op_count == 0
+
+    @patch.object(torch._dynamo.config, "raise_on_ctx_manager_usage", False)
+    def test_custom_list(self):
+        def fn(x):
+            return x[0] + x[1]
+
+        counter = CompileCounter()
+
+        class Foo(list):
+            def __iter__(self):
+                raise Exception()
+
+            def __len__(self):
+                raise Exception()
+
+        x = Foo()
+        x.append(torch.randn(4))
+        x.append(torch.randn(4))
+        with torch._dynamo.optimize_assert(counter):
+            fn(x)
+
+        assert counter.op_count == 0
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_subgraphs.py b/test/dynamo/test_subgraphs.py
new file mode 100644
index 000000000000..27f73026435c
--- /dev/null
+++ b/test/dynamo/test_subgraphs.py
@@ -0,0 +1,546 @@
+# Owner(s): ["module: dynamo"]
+import unittest
+from unittest.mock import patch
+
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+from torch._dynamo import config
+from torch._dynamo.testing import unsupported
+from torch._dynamo.utils import disable_cache_limit
+
+globalmod = torch.nn.ReLU()
+
+
+def indirectly_unsupported(a, b):
+    c = a + b
+    return unsupported(a, c)
+
+
+class SubGraphTests(torch._dynamo.test_case.TestCase):
+    def _common(self, fn, frame_count, op_count):
+        torch._dynamo.reset()
+        v1 = torch.ones(10)
+        v2 = torch.ones(10) * -2.0
+        correct1 = fn(v1, v2)
+        correct2 = fn(v2, v1)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        r1 = opt_fn(v1, v2)
+        r2 = opt_fn(v2, v1)
+        self.assertTrue(torch._dynamo.testing.same(r1, correct1))
+        self.assertTrue(torch._dynamo.testing.same(r2, correct2))
+        self.assertEqual(cnt.frame_count, frame_count)
+        self.assertEqual(cnt.op_count, op_count)
+
+    def test_control_flow1(self):
+        def fn(a, b):
+            c1 = a - b
+            c2 = b - a
+            if c1.sum() > c2.sum():
+                return c1
+            else:
+                return c2
+
+        self._common(fn, 1, 5)
+
+    def test_control_flow2(self):
+        def fn(a, b):
+            if a.sum() > b.sum():
+                return 1
+            else:
+                return 2
+
+        self._common(fn, 1, 3)
+
+    def test_control_flow3(self):
+        def fn(a, b):
+            c1 = a - b
+            c2 = b - a
+            m = globalmod
+            if c1.sum() > c2.sum():
+                return m(c1)
+            else:
+                return m(c2)
+
+        self._common(fn, 3, 7)
+
+    def test_control_flow4(self):
+        def fn(a, b):
+            tmp1 = a.sum() > b.sum() and a.sum() > 0
+            if tmp1:
+                return 1
+            else:
+                return 2
+
+        self._common(fn, 3, 5)
+
+    def test_control_flow5(self):
+        def fn(a, b):
+            tmp1 = a.sum() > b.sum() and a.sum() > 0
+            tmp2 = a.sum() < b.sum() or b.sum() > 0
+            if tmp1 and tmp2:
+                return 1, tmp1, tmp2
+            else:
+                return 2, tmp1, tmp2
+
+        self._common(fn, 6, 13)
+
+    def test_capi_call1(self):
+        def fn(a, b):
+            c1 = a - b
+            c2 = b - a
+            return unsupported(c1, c2)
+
+        self._common(fn, 1, 2)
+
+    def test_capi_call2(self):
+        def fn(a, b):
+            c1 = a - b
+            c2 = b - a
+            return a - (b - unsupported(c1, c2))
+
+        self._common(fn, 2, 4)
+
+    def test_capi_call3(self):
+        def fn(a, b):
+            c1 = a - b
+            c2 = b - a
+            return torch._dynamo.testing.unsupported(c1, c2)
+
+        self._common(fn, 1, 2)
+
+    def test_indirect_unsupported1(self):
+        def fn(a, b):
+            c1 = a - b
+            c2 = b - a
+            return indirectly_unsupported(c1, c2)
+
+        self._common(fn, 2, 3)
+
+    def test_indirect_unsupported2(self):
+        def fn(a, b):
+            local_const1 = 7
+            local_const2 = 22
+            c1 = a - b
+            c2 = b - a
+            return local_const1 / (local_const2 - indirectly_unsupported(c1, c2))
+
+        self._common(fn, 3, 5)
+
+    def test_indirect_unsupported3(self):
+        def fn(a, b):
+            args = [a - b, b - a]
+            return indirectly_unsupported(*args)
+
+        self._common(fn, 2, 3)
+
+    def test_stack_state1(self):
+        def fn(a, b):
+            t1 = 1.23 * a
+            t2 = 4.56 * a
+            c1 = a - b
+            c2 = b - a
+            return t1 / (t2 - unsupported(c1, c2))
+
+        self._common(fn, 2, 6)
+
+    def test_stack_state2(self):
+        def fn(a, b):
+            t1 = 1.23 * a
+            t2 = 4.56 * a
+            c1 = a - b
+            c2 = b - a
+            return t1 / (t2 - indirectly_unsupported(c1, c2))
+
+        self._common(fn, 3, 7)
+
+    def test_multigraph(self):
+        def fn(a, b):
+            x = a + b
+            x = x / 2.0
+            if x.sum() < 0:
+                return x * -1.0
+            return x
+
+        self._common(fn, 2, 5)
+
+    def test_extended_args(self):
+        too_many_adds = "+".join(["a", "b"] * 256)
+        source = (
+            f"lambda a, b: ({too_many_adds}+a if a.sum() > 0 else {too_many_adds} - b)"
+        )
+        self._common(eval(source), 3, 1026)
+
+    def test_resume1(self):
+        def fn(a, b):
+            x = a + b
+            x = x / 2.0
+            x = x + 2.0
+            x = unsupported(x, a)
+            x = x + 2.0
+            x = x + 2.0
+            x = x + 2.0
+            return x
+
+        self._common(fn, 2, 6)
+
+    def test_resume2(self):
+        def fn(a, b):
+            x = a + b
+            x = x / 2.0
+            x = x + 2.0
+            x = indirectly_unsupported(x, a)
+            x = x + 2.0
+            x = x + 2.0
+            x = x + 2.0
+            return x
+
+        self._common(fn, 3, 7)
+
+    def test_resume3(self):
+        def fn(a, b):
+            x = a + b
+            x = x / 2.0
+            x = x + 2.0
+            x = indirectly_unsupported(x, b=a)
+            x = x + 2.0
+            x = x + 2.0
+            x = x + 2.0
+            return x
+
+        self._common(fn, 3, 7)
+
+    def test_resume4(self):
+        def fn(a, b):
+            x = a + b
+            x = x / 2.0
+            x = x + 2.0
+            x = indirectly_unsupported(a=x, b=a)
+            x = x + 2.0
+            x = x + 2.0
+            x = x + 2.0
+            return x
+
+        self._common(fn, 3, 7)
+
+    def test_resume5(self):
+        def fn(a, b):
+            x = a + b
+            x = x / 2.0
+            x = x + 2.0
+            print(x)
+            x = x + 2.0
+            x = x + 2.0
+            x = x + 2.0
+            return x
+
+        self._common(fn, 2, 6)
+
+    def test_start1(self):
+        def fn(a, b):
+            print(a)
+            x = a + b
+            x = x + 2.0
+            x = x + 2.0
+            return x
+
+        self._common(fn, 1, 3)
+
+    def test_start2(self):
+        def fn(a, b):
+            x = indirectly_unsupported(a, b)
+            x = x + 2.0
+            x = x + 2.0
+            x = x + 2.0
+            return x
+
+        self._common(fn, 2, 4)
+
+    def test_start3(self):
+        def fn(a, b):
+            x = unsupported(a, b)
+            x = x + 2.0
+            x = x + 2.0
+            x = x + 2.0
+            return x
+
+        self._common(fn, 1, 3)
+
+    def test_start4(self):
+        def fn(a, b, check):
+            if check:
+                return a + b + 10
+            else:
+                return a + b - 10
+
+        v1 = torch.randn(10)
+        v2 = torch.randn(10)
+        f = torch.zeros(1, dtype=torch.int32)
+        t = torch.ones(1, dtype=torch.int32)
+        correct1 = fn(v1, v2, t)
+        correct2 = fn(v1, v2, f)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        r1 = opt_fn(v1, v2, t)
+        r2 = opt_fn(v1, v2, f)
+        self.assertTrue(torch._dynamo.testing.same(r1, correct1))
+        self.assertTrue(torch._dynamo.testing.same(r2, correct2))
+        self.assertEqual(cnt.frame_count, 3)
+        self.assertEqual(cnt.op_count, 4)
+
+    def test_resume_freevars(self):
+        c1 = torch.randn(10)
+        c2 = torch.randn(10)
+
+        def fn(a, b):
+            x = a + b + (c1 - c2)
+            x = unsupported(x, x)
+            return x + (c1 - c2)
+
+        self._common(fn, 2, 5)
+
+    def test_restore_state(self):
+        def fn(a, b):
+            len_ = len
+            x = a + b
+            x = torch.add(unsupported(x, x), 1)
+            return a * x + len_(b)
+
+        if config.dynamic_shapes:
+            self._common(fn, 2, 5)
+        else:
+            self._common(fn, 2, 4)
+
+    def test_restore_range(self):
+        def fn(a, b):
+            x = a + b
+            rng = range(3, 8, 2)
+            x = unsupported(x, x)
+            for i in rng:
+                x = x + i
+            return x
+
+        self._common(fn, 2, 4)
+
+    def test_restore_range_iter(self):
+        def fn(a, b):
+            x = a + b
+            rng = iter(range(3, 8, 2))
+            x = unsupported(x, x)
+            x += next(rng)
+            return x, list(rng)
+
+        self._common(fn, 2, 2)
+
+    def test_pop_after_resume(self):
+        def fn(a, b):
+            tmp = [a + 1, b + 2, a + b]
+            x = a
+            x = unsupported(x, x)
+            for i in range(3):
+                x += tmp.pop(-1)
+            return x
+
+        self._common(fn, 2, 6)
+
+    @disable_cache_limit()
+    def test_dynamic_shapes(self):
+        def fn(a, b):
+            return a - b * 10
+
+        torch._dynamo.reset()
+        cnt_static = torch._dynamo.testing.CompileCounter()
+        with patch("torch._dynamo.config.dynamic_shapes", False):
+            opt_fn = torch._dynamo.optimize(cnt_static)(fn)
+            for i in range(10):
+                opt_fn(torch.randn(i), torch.randn(i))
+        self.assertEqual(cnt_static.frame_count, 10)
+
+        torch._dynamo.reset()
+        cnt_dynamic = torch._dynamo.testing.CompileCounter()
+        with patch("torch._dynamo.config.dynamic_shapes", True):
+            opt_fn = torch._dynamo.optimize(cnt_dynamic)(fn)
+            for i in range(10):
+                opt_fn(torch.randn(i), torch.randn(i))
+        # just one graph now rather than 10
+        self.assertEqual(cnt_dynamic.frame_count, 1)
+
+    def test_dynamic_kwarg(self):
+        def fn(a, b):
+            return a - b * 10
+
+        torch._dynamo.reset()
+        cnt_dynamic = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt_dynamic, dynamic=True)(fn)
+        for i in range(10):
+            opt_fn(torch.randn(i), torch.randn(i))
+        # just one graph
+        self.assertEqual(cnt_dynamic.frame_count, 1)
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", True)
+    def test_no_graph_break_on_item(self):
+        def fn(a, b):
+            x = a + b - 1.5
+            x = x.sum()
+            x.item()
+            x = x / (a + b)
+            return x
+
+        self._common(fn, 1, 6)
+
+    @patch.object(torch._dynamo.config, "capture_scalar_outputs", False)
+    def test_graph_break_on_item(self):
+        def fn(a, b):
+            x = a + b - 1.5
+            x = x.sum()
+            x.item()
+            x = x / (a + b)
+            return x
+
+        self._common(fn, 2, 5)
+
+    def test_resume_paths_join(self):
+        def fn(x, c1, c2, c3):
+            x = x + 1
+            if c1:
+                x = x + 2
+            x = x + 3
+            if c2:
+                x = x + 4
+            x = x + 5
+            if c3:
+                x = x + 6
+            return x + 7
+
+        v1 = torch.randn(10)
+        t = torch.Tensor([True])
+        f = torch.Tensor([False])
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        for a in (t, f):
+            for b in (t, f):
+                for c in (t, f):
+                    opt_fn(v1, a, b, c)
+
+        # checking here we don't create 2^n graphs
+        self.assertEqual(cnt.frame_count, 7)
+        self.assertEqual(cnt.op_count, 10)
+
+    def test_resume_with_no_grad1(self):
+        def fn(a, b):
+            x = a + b
+            with torch.no_grad():
+                x = x + 1
+                x.sum().tolist()  # graph break
+                x = x + 2
+            x = x + 3
+            return x
+
+        self._common(fn, 2, 9)
+        torch._dynamo.reset()
+        with torch.no_grad():
+            self._common(fn, 2, 9)
+
+    def test_resume_with_no_grad2(self):
+        def fn(a, b):
+            x = a + b
+            with torch.no_grad():
+                x = x + 1
+                x.sum().tolist()  # graph break
+                x = x + 2
+                x.sum().tolist()  # graph break
+                x = x + 3
+            x = x + 4
+            return x
+
+        self._common(fn, 3, 13)
+
+    def test_resume_with_no_grad3(self):
+        def fn(a, b):
+            x = a + b
+            with torch.no_grad():
+                with torch.no_grad():
+                    x = x + 1
+                    with torch.enable_grad():
+                        x.sum().tolist()  # graph break
+                        x = x[0] + 2
+                    x = x + 3
+            x = x + 4
+            return x
+
+        self._common(fn, 2, 19)
+
+    def test_resume_tuple_iterator(self):
+        def fn(a, b):
+            x = a + b
+            it = iter(tuple(range(10)))
+            x = x + next(it)
+            x = x + next(it)
+            x = x + next(it)
+            x = unsupported(x, x)
+            x = x + next(it)
+            x = x + next(it)
+            x = x + next(it)
+            x = x + next(it)
+            return x
+
+        self._common(fn, 2, 8)
+
+    def test_tuple_iterator_return(self):
+        def fn(x):
+            it = iter(tuple(range(10)))
+            x = x + next(it)
+            x = x + next(it)
+            x = unsupported(x, x)
+            x = x + next(it)
+            x = x + next(it)
+            x = unsupported(x, x)
+            x = x + next(it)
+            x = x + next(it)
+            return x, it
+
+        v1 = torch.randn(10)
+        v2, it2 = fn(v1)
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        v3, it3 = opt_fn(v1)
+        v4, it4 = opt_fn(v1)
+        self.assertEqual(v2.tolist(), v3.tolist())
+        self.assertEqual(v2.tolist(), v4.tolist())
+        self.assertEqual(list(it2), list(it3))
+        self.assertEqual(cnt.frame_count, 3)
+        self.assertEqual(cnt.op_count, 6)
+
+    @unittest.skip("not working yet")
+    def test_tuple_iterator_mutate(self):
+        def fn(x, it):
+            x = x + next(it)
+            x = x + next(it)
+            x = x + next(it)
+            x = x + next(it)
+            return x
+
+        v1 = torch.randn(10)
+        it1 = iter(tuple(range(10)))
+        cnt = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnt)(fn)
+        self.assertEqual(opt_fn(v1, it1).tolist(), (v1 + 1 + 2 + 3).tolist())
+        self.assertEqual(list(it1), [4, 5, 6, 7, 8, 9])
+
+    def test_enumerate_not_break_graph(self):
+        def fn(a, b):
+            for i, x in enumerate(a.shape):
+                b = b + x
+            for i, x in enumerate(b.shape, 8):
+                b = b + x * i
+            return b
+
+        self._common(fn, 1, 2)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_torchxla_integration.py b/test/dynamo/test_torchxla_integration.py
new file mode 100644
index 000000000000..70be4d8e87dc
--- /dev/null
+++ b/test/dynamo/test_torchxla_integration.py
@@ -0,0 +1,131 @@
+# Owner(s): ["module: dynamo"]
+import copy
+import unittest
+
+import torch
+
+try:
+    from .test_torchxla_util import maybe_skip_torchxla_test
+except ImportError:
+    from test_torchxla_util import maybe_skip_torchxla_test
+
+try:
+    import torch._dynamo.optimizations.torchxla_integration as integration
+    import torch_xla.core.xla_model as xm
+    import torch_xla.debug.metrics as metrics
+except ImportError:
+    # tests using torch_xla will be skipped. It's fine to ignore the
+    # importing error here.
+    pass
+
+from torch import fx, nn
+
+
+class BasicModule(nn.Module):
+    def __init__(self):
+        super(BasicModule, self).__init__()
+
+    def forward(self, x, y):
+        return x + y
+
+    def get_random_inputs(self):
+        return (torch.randn(10), torch.randn(10))
+
+
+class MatmulModule(nn.Module):
+    def __init__(self):
+        super(MatmulModule, self).__init__()
+
+    def forward(self, x, y):
+        return x @ y
+
+    def get_random_inputs(self):
+        return (torch.randn(5, 100), torch.randn(100, 5))
+
+
+class LinearModule(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear = nn.Linear(10, 5)
+
+    def forward(self, x):
+        return self.linear(x)
+
+    def get_random_inputs(self):
+        return (torch.randn(10),)
+
+
+class ModuleInplaceUpdate(nn.Module):
+    def __init__(self):
+        super(ModuleInplaceUpdate, self).__init__()
+
+    def forward(self, a, b):
+        a.sub_(b)
+        return b - 1, b + 1
+
+    def get_random_inputs(self):
+        return (torch.randn(10), torch.randn(10))
+
+
+def allclose(expected, actual):
+    def unwrap(cont):
+        if isinstance(cont, (list, tuple)) and len(cont) == 1:
+            return cont[0]
+        return cont
+
+    expected = unwrap(expected)
+    actual = unwrap(actual)
+
+    if isinstance(expected, torch.Tensor) and isinstance(actual, torch.Tensor):
+        return torch.allclose(expected, actual)
+    elif isinstance(expected, (tuple, list)) and isinstance(actual, (tuple, list)):
+        return len(expected) == len(actual) and all(
+            torch.allclose(a, b) for a, b in zip(expected, actual)
+        )
+    else:
+        raise RuntimeError("Unexpected types")
+
+
+def make_reuse_graph_test(module_class, niter=100):
+    @maybe_skip_torchxla_test
+    def test_wrapper(self):
+        xla_dev = xm.xla_device()
+        xla_module = module_class().to(device=xla_dev)
+        inputs = tuple(x.to(device=xla_dev) for x in xla_module.get_random_inputs())
+        metrics.clear_counters()
+        optimized_mod = integration.extract_compiled_graph(
+            fx.symbolic_trace(xla_module), inputs
+        )
+
+        for i in range(niter):
+            xla_inputs = tuple(
+                inp.to(device=xla_dev) for inp in xla_module.get_random_inputs()
+            )
+            xla_inputs_copy = copy.deepcopy(xla_inputs)
+
+            expected = xla_module(*xla_inputs)
+
+            actual = optimized_mod(*xla_inputs_copy)
+
+            if not allclose(expected, actual):
+                print(
+                    f"Incorrect results at iter {i}. expected\n{expected}, actual\n{actual}"
+                )
+                self.assertTrue(False)
+
+            # make sure arguments match after calling the model forward method
+            # to handle inplace updates.
+            if not allclose(xla_inputs, xla_inputs_copy):
+                print(
+                    f"Incorrect updated arguments at iter {i}. expected\n{rand_args}, actual\n{rand_args_copy}"
+                )
+                self.assertTrue(False)
+
+    return test_wrapper
+
+
+class TorchXLAReuseGraphTest(unittest.TestCase):
+    test_basic = make_reuse_graph_test(BasicModule)
+    test_matmul = make_reuse_graph_test(MatmulModule)
+    test_linear = make_reuse_graph_test(LinearModule)
+    test_inplace_update = make_reuse_graph_test(ModuleInplaceUpdate)
diff --git a/test/dynamo/test_torchxla_num_output.py b/test/dynamo/test_torchxla_num_output.py
new file mode 100644
index 000000000000..0e91a358d469
--- /dev/null
+++ b/test/dynamo/test_torchxla_num_output.py
@@ -0,0 +1,120 @@
+# Owner(s): ["module: dynamo"]
+import unittest
+
+import torch
+from torch import nn
+from torch._dynamo.optimizations.torchxla_integration import GraphInputMatcher
+from torch.utils._pytree import tree_map_only
+
+try:
+    from .test_torchxla_util import maybe_skip_torchxla_test
+except ImportError:
+    from test_torchxla_util import maybe_skip_torchxla_test
+
+try:
+    import torch_xla
+    import torch_xla.core.xla_model as xm
+except ImportError:
+    # tests using torch_xla will be skipped. It's fine to ignore the
+    # importing error here.
+    pass
+
+
+class DirectReturnModule(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, a, b, c):
+        """
+        The XLA graph will only return the first 2 items
+        """
+        return a + b, a + c, b
+
+    def get_example_inputs(self):
+        return (torch.rand(2), torch.rand(2), torch.rand(2))
+
+
+class DirectReturnWithInplaceUpdateModule(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, a, b, c):
+        """
+        Inplace update on b cause it to be returned in XLA graph
+        """
+        b.zero_()
+        return a + b, a + c, b
+
+    def get_example_inputs(self):
+        return (torch.rand(2), torch.rand(2), torch.rand(2))
+
+
+class DirectReturnWithDuplicatedInplaceUpdateModule(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, a, b, c):
+        """
+        Even if we return b twice, the XLA graph only return b once.
+        """
+        b.zero_()
+        return a + b, a + c, b, b
+
+    def get_example_inputs(self):
+        return (torch.rand(2), torch.rand(2), torch.rand(2))
+
+
+class TestNumOutput(unittest.TestCase):
+    def do_test(self, model_class, expected_num_output):
+        xla_dev = xm.xla_device()
+        model = model_class().to(device=xla_dev)
+        inputs = tree_map_only(
+            torch.Tensor, lambda x: x.to(device=xla_dev), model.get_example_inputs()
+        )
+
+        xm.mark_step()
+        args_tensor_ids = [
+            torch_xla._XLAC._xla_get_tensor_id(xla_arg) for xla_arg in inputs
+        ]
+        tensor_id_to_arg_idx = {
+            tensor_id: i for i, tensor_id in enumerate(args_tensor_ids)
+        }
+        outputs = model(*inputs)
+        xla_graph_hash = torch_xla._XLAC._get_graph_hash(outputs)
+
+        (
+            graph_input_tensor_ids,
+            graph_input_xla_values,
+        ) = torch_xla._XLAC._get_tensors_xla_device_data_node(outputs)
+
+        graph_input_matcher = GraphInputMatcher(
+            tensor_id_to_arg_idx, graph_input_tensor_ids, graph_input_xla_values
+        )
+        torch_xla._XLAC._xla_sync_multi(outputs, [])
+
+        def run_cached_graph(*inputs):
+            torch_xla._XLAC._xla_sync_multi(inputs, [])
+            xla_graph_inputs = graph_input_matcher(inputs)
+            xla_graph_outputs = torch_xla._XLAC._run_cached_graph(
+                xla_graph_hash, xla_graph_inputs
+            )
+            return xla_graph_outputs
+
+        test_inputs = tree_map_only(
+            torch.Tensor, lambda x: x.to(device=xla_dev), model.get_example_inputs()
+        )
+        self.assertEqual(expected_num_output, len(run_cached_graph(*test_inputs)))
+
+    @maybe_skip_torchxla_test
+    def test_direct_return(self):
+        self.do_test(DirectReturnModule, expected_num_output=2)
+
+    @maybe_skip_torchxla_test
+    def test_direct_return_with_inplace_update(self):
+        self.do_test(DirectReturnWithInplaceUpdateModule, expected_num_output=3)
+
+    @maybe_skip_torchxla_test
+    def test_direct_return_with_duplicated_inplace_update(self):
+        self.do_test(
+            DirectReturnWithDuplicatedInplaceUpdateModule, expected_num_output=3
+        )
diff --git a/test/dynamo/test_torchxla_util.py b/test/dynamo/test_torchxla_util.py
new file mode 100644
index 000000000000..5c54af34678a
--- /dev/null
+++ b/test/dynamo/test_torchxla_util.py
@@ -0,0 +1,26 @@
+# Owner(s): ["module: dynamo"]
+import functools
+import os
+import unittest
+
+
+@functools.lru_cache(None)
+def should_run_torchxla_tests():
+    """
+    Run the tests if torch_xla is available and number of gpu devices is specified.
+    """
+    has_torch_xla = True
+    try:
+        import torch_xla  # noqa: F401
+    except ImportError:
+        has_torch_xla = False
+
+    gpu_device_specified = int(os.environ.get("GPU_NUM_DEVICES", "0")) > 0
+    return has_torch_xla and gpu_device_specified
+
+
+def maybe_skip_torchxla_test(test_case):
+    return unittest.skipIf(
+        not should_run_torchxla_tests(),
+        "Skip the tests since torch_xla is not available or XLA devices are not specified",
+    )(test_case)
diff --git a/test/dynamo/test_unspec.py b/test/dynamo/test_unspec.py
new file mode 100644
index 000000000000..7ffed902fd9d
--- /dev/null
+++ b/test/dynamo/test_unspec.py
@@ -0,0 +1,229 @@
+# Owner(s): ["module: dynamo"]
+import functools
+import random
+import unittest
+from unittest.mock import patch
+
+import numpy as np
+import torch
+
+import torch._dynamo.test_case
+import torch._dynamo.testing
+from torch._dynamo.testing import same
+
+try:
+    from . import test_modules, test_repros
+except ImportError:
+    import test_modules
+    import test_repros
+
+
+def make_unspec_fn(fn):
+    @functools.wraps(fn)
+    def _fn(*args, **kwargs):
+        with patch.object(torch._dynamo.config, "specialize_int_float", False):
+            return fn(*args, **kwargs)
+
+    return _fn
+
+
+def make_unspec_cls(cls):
+    class UnspecTest(cls):
+        pass
+
+    UnspecTest.__name__ = f"Unspec{cls.__name__}"
+
+    for name in dir(cls):
+        if name.startswith("test_"):
+            fn = getattr(cls, name)
+            if not callable(fn):
+                continue
+            new_name = f"{name}_unspec"
+            fn = make_unspec_fn(fn)
+            fn.__name__ = new_name
+            setattr(UnspecTest, name, None)
+            setattr(UnspecTest, new_name, fn)
+
+    return UnspecTest
+
+
+UnspecReproTests = make_unspec_cls(test_repros.ReproTests)
+UnspecNNModuleTests = make_unspec_cls(test_modules.NNModuleTests)
+
+
+@patch.object(torch._dynamo.config, "specialize_int_float", False)
+class UnspecTests(torch._dynamo.test_case.TestCase):
+    def test_numpy_correctness(self):
+        def fn(x, y, z):
+            xy = [x + y, y, False]
+            np_x = x.numpy()
+            np_y = y.numpy()
+            return {
+                "x": x,
+                "z": z,
+                "a": np_y.sum(),
+                "b": xy,
+                "c": np_y[0][0] / 68,
+                "d": np_x.sum(),
+            }, x + np_y.sum() + z
+
+        x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=torch.float64)
+        y = torch.ones([2, 2], dtype=torch.int64)
+        z = np.int64(12)
+        res1 = fn(x, y, z)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res2 = opt_fn(x, y, z)
+        self.assertTrue(same(res1, res2))
+
+    def test_no_recompilations(self):
+        # no recompilations if passing on different numpy int values
+        def fn(x, y):
+            return {"a": x + 1, "b": y / 2}
+
+        x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=torch.float64)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        for i in range(10):
+            opt_fn(x, np.int64(i))
+        self.assertEqual(cnts.frame_count, 1)
+        self.assertEqual(cnts.op_count, 2)
+
+    def test_builtin_max_min(self):
+        # test unspecialized primitive max/min
+        def fn(x, y, z):
+            return z + 1, max(x, y), min(x - 4, y)
+
+        x = np.int64(12)
+        y = 10
+        z = torch.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=torch.float64)
+        res1 = fn(x, y, z)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res2 = opt_fn(x, y, z)
+        self.assertTrue(same(res1, res2))
+
+    def test_feed_random_values_into_graph_only(self):
+        def fn(shape):
+            torch.manual_seed(123)
+            x = torch.randn(shape, device="cpu") * random.randint(30, 100)
+            return x
+
+        shape = [2, 3]
+        random.seed(1)
+        res1 = fn(shape)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        random.seed(1)
+        res2 = opt_fn(shape)
+
+        self.assertTrue(same(res1, res2))
+
+    def test_random_values_with_graph_break(self):
+        def fn(x):
+            r1 = random.random()
+            y = x + random.uniform(10, 20)
+            y.sum().item()
+            r2 = random.randint(2, 18)  # no graph output in this frame
+            y.sum().item()
+            return y + r1, r2
+
+        x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
+        random.seed(1)
+        res1 = fn(x)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        random.seed(1)
+        res2 = opt_fn(x)
+        self.assertTrue(same(res1, res2))
+
+    @patch.object(torch._dynamo.config, "dynamic_shapes", True)
+    def test_multiple_consecutive_random_calls_before_graph(self):
+        def fn(x):
+            dim1 = random.randrange(start=0, stop=5)
+            dim2 = random.randrange(start=0, stop=5)
+            dim3 = random.randrange(start=0, stop=5)
+            y = torch.rand(dim1, dim2, dim3)
+            return x + 2, y
+
+        x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
+        random.seed(1)
+        res1 = fn(x)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        random.seed(1)
+        res2 = opt_fn(x)
+        self.assertTrue(same(res1, res2))
+
+    def test_random_call_with_while_loop(self):
+        def fn(x):
+            dim1 = random.randrange(start=0, stop=3)
+            dim2 = dim1
+            while dim1 == dim2:
+                dim2 = random.randrange(start=0, stop=3)
+            return x * 2
+
+        x = torch.randn(4)
+        random.seed(1)
+        res1 = fn(x)
+        opt_fn = torch._dynamo.optimize("eager")(fn)
+        random.seed(1)
+        res2 = opt_fn(x)
+        self.assertTrue(same(res1, res2))
+
+    # TypeError: zeros(): argument 'size' (position 1) must be tuple of SymInts, not FakeTensor
+    @unittest.expectedFailure
+    def test_builtin_getitem(self):
+        # builtin getitem args[0] is python list and args[1] is unspec
+        def fn(x, idx):
+            return (torch.zeros(idx), x[idx], x[idx:])
+
+        x = list(range(50))
+        ref = fn(x, 48)  # 48 is unspecialized
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res = opt_fn(x, 48)
+        self.assertTrue(same(ref, res))
+
+    @unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+    def test_builtin_functions_on_cuda(self):
+        def fn(x, scaler):
+            m = torch.nn.ReLU()
+            y = m(x) * scaler
+            return y
+
+        x = torch.randn([3, 6], device="cuda")
+        scaler = 0.23  # 0.23 is unspecialized
+        ref = fn(x, scaler)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res = opt_fn(x, scaler)
+        self.assertTrue(same(ref, res))
+        self.assertEqual(ref.device, res.device)
+
+    def test_unspec_float_precision(self):
+        def fn(image, scale_factor):
+            image = torch.nn.functional.interpolate(
+                image[None],
+                size=None,
+                scale_factor=scale_factor,
+                mode="bilinear",
+                recompute_scale_factor=True,
+                align_corners=False,
+            )[0]
+
+            return image.shape
+
+        x = torch.rand([3, 427, 640])
+        scale_factor = 1.873536229133606
+        ref = fn(x, scale_factor)
+        cnts = torch._dynamo.testing.CompileCounter()
+        opt_fn = torch._dynamo.optimize(cnts)(fn)
+        res = opt_fn(x, scale_factor)
+        self.assertTrue(same(ref, res))
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/dynamo/test_verify_correctness.py b/test/dynamo/test_verify_correctness.py
new file mode 100644
index 000000000000..8e3624bfd9e7
--- /dev/null
+++ b/test/dynamo/test_verify_correctness.py
@@ -0,0 +1,175 @@
+# Owner(s): ["module: dynamo"]
+import importlib
+import operator
+import unittest
+from unittest.mock import patch
+
+import torch
+
+import torch._dynamo
+import torch._dynamo.config as config
+import torch._dynamo.test_case
+from torch._dynamo.optimizations import backends
+from torch._dynamo.testing import same
+
+
+def has_onnxruntime():
+    try:
+        importlib.import_module("onnxruntime")
+        return True
+    except ImportError:
+        return False
+
+
+def has_ipex():
+    try:
+        importlib.import_module("intel_extension_for_pytorch")
+        return True
+    except ImportError:
+        return False
+
+
+class Seq(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = torch.nn.Sequential(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 10),
+            torch.nn.Sigmoid(),
+        )
+
+    def forward(self, x):
+        return self.layers(x)
+
+
+class Conv_Bn_Relu(torch.nn.Module):
+    def __init__(self, in_channels, out_channels, **kwargs):
+        super(Conv_Bn_Relu, self).__init__()
+        self.conv = torch.nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
+        self.bn = torch.nn.BatchNorm2d(out_channels, eps=0.001)
+        self.relu = torch.nn.ReLU()
+
+    def forward(self, x):
+        return self.relu(self.bn(self.conv(x)))
+
+
+def toy_example(a, b):
+    x = a / (torch.abs(a) + 1)
+    if b.sum() < 0:
+        b = b * -1
+    return x * b
+
+
+def transform(gm: torch.fx.GraphModule) -> torch.fx.GraphModule:
+    for node in gm.graph.nodes:
+        # Checks if we're calling a function (i.e:
+        # operator.add)
+        if node.op == "call_function":
+            # The target attribute is the function
+            # that call_function calls.
+            if node.target == operator.mul:
+                node.target = operator.add
+
+    gm.graph.lint()  # Does some checks to make sure the
+    # Graph is well-formed.
+
+    gm.recompile()
+    return gm
+
+
+class TestVerifyCorrectness(torch._dynamo.test_case.TestCase):
+    @patch.object(config, "verify_correctness", True)
+    def test_example_inputs(self):
+        def fn(a, bc, d):
+            b, c = bc
+            return a / d - b / c
+
+        def compiler_fn(graph, example_inputs):
+            nonlocal r1
+            r1 = graph(*example_inputs)[0]
+            return graph.forward
+
+        a = torch.empty(2).fill_(1)
+        b = torch.empty(2).fill_(2)
+        c = torch.empty(2).fill_(3)
+        d = 4
+        r1 = None
+        r2 = fn(a, (b, c), d)
+        opt_fn = torch._dynamo.optimize_assert(compiler_fn)(fn)
+        r3 = opt_fn(a, (b, c), d)
+
+        self.assertIsNotNone(r1)
+        self.assertTrue(same(r1, r2))
+        self.assertTrue(same(r1, r3))
+
+    @patch.object(config, "verify_correctness", True)
+    def test_nnc(self):
+        s = Seq()
+        i = torch.randn(10)
+        r1 = s(i)
+        opt_s = torch._dynamo.optimize("nnc")(s)
+        r2 = opt_s(i)
+        self.assertTrue(same(r1, r2))
+
+    @patch.object(config, "verify_correctness", True)
+    def test_incorrect_verify_true(self):
+        """
+        If a bad optimization return a graph that
+        is not functionally equal to the original graph;
+        When config.verify_correctness=True, it will
+        check the correctness of outputs and raise an error
+        """
+        i1 = torch.randn(10)
+        i2 = torch.randn(10)
+
+        def incorrect_compile_fn(gm, example_inputs):
+            return transform(gm).forward
+
+        toy_example(i1, i2)
+        try:
+            opt_toy_example = torch._dynamo.optimize(incorrect_compile_fn)(toy_example)
+            opt_toy_example(i1, i2)
+        except RuntimeError:
+            pass
+        else:
+            self.fail("expected failure")
+
+    @patch.object(config, "verify_correctness", False)
+    def test_incorrect_verify_false(self):
+        """
+        The bad optimization return a graph that
+        is not functionally equal to the original graph;
+        When config.verify_correctness=False, wrong outputs
+        will return
+        """
+        i1 = torch.randn(10)
+        i2 = torch.randn(10)
+
+        def incorrect_compile_fn(gm, example_inputs):
+            return transform(gm).forward
+
+        r1 = toy_example(i1, i2)
+        opt_toy_example = torch._dynamo.optimize(incorrect_compile_fn)(toy_example)
+        r2 = opt_toy_example(i1, i2)
+        self.assertTrue(not same(r1, r2))
+
+    @unittest.skipIf(not has_ipex(), "requires ipex")
+    @patch.object(config, "verify_correctness", True)
+    def test_ipex_fp32(self):
+        model = Conv_Bn_Relu(3, 32, kernel_size=3, stride=1)
+        model = model.to(memory_format=torch.channels_last)
+        model = model.eval()
+        input = torch.randn(8, 3, 64, 64).contiguous(memory_format=torch.channels_last)
+        r1 = model(input)
+        opt_model = torch._dynamo.optimize(backends.ipex_fp32)(model)
+        with torch.no_grad():
+            r2 = opt_model(input)
+        self.assertTrue(same(r1, r2))
+        self.assertEqual(r2.dtype, torch.float32)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    run_tests()
diff --git a/test/expect/TestFXAPIBackwardCompatibility.test_class_member_back_compat-fx_backcompat_class_members.expect b/test/expect/TestFXAPIBackwardCompatibility.test_class_member_back_compat-fx_backcompat_class_members.expect
index eb6bd8aca3dc..357cfe6658a2 100644
--- a/test/expect/TestFXAPIBackwardCompatibility.test_class_member_back_compat-fx_backcompat_class_members.expect
+++ b/test/expect/TestFXAPIBackwardCompatibility.test_class_member_back_compat-fx_backcompat_class_members.expect
@@ -1,8 +1,8 @@
 torch.fx._symbolic_trace.ProxyableClassMeta []
-torch.fx._symbolic_trace.Tracer ['call_module', 'create_arg', 'create_args_for_root', 'is_leaf_module', 'path_of_module', 'trace']
+torch.fx._symbolic_trace.Tracer ['call_module', 'create_arg', 'create_args_for_root', 'getattr', 'is_leaf_module', 'path_of_module', 'trace']
 torch.fx.graph.Graph ['call_function', 'call_method', 'call_module', 'create_node', 'eliminate_dead_code', 'erase_node', 'get_attr', 'graph_copy', 'inserting_after', 'inserting_before', 'lint', 'node_copy', 'nodes', 'on_generate_code', 'output', 'owning_module', 'placeholder', 'print_tabular', 'process_inputs', 'process_outputs', 'python_code', 'set_codegen']
 torch.fx.graph.PythonCode []
-torch.fx.graph_module.GraphModule ['add_submodule', 'code', 'delete_all_unused_submodules', 'delete_submodule', 'graph', 'nested_str', 'recompile', 'to_folder']
+torch.fx.graph_module.GraphModule ['add_submodule', 'code', 'delete_all_unused_submodules', 'delete_submodule', 'graph', 'print_readable', 'recompile', 'to_folder']
 torch.fx.immutable_collections.immutable_dict ['clear', 'pop', 'popitem', 'update']
 torch.fx.immutable_collections.immutable_list ['append', 'clear', 'extend', 'insert', 'pop', 'remove']
 torch.fx.interpreter.Interpreter ['call_function', 'call_method', 'call_module', 'fetch_args_kwargs_from_env', 'fetch_attr', 'get_attr', 'map_nodes_to_values', 'output', 'placeholder', 'run', 'run_node']
diff --git a/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect b/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
index a50db90a5056..7bdd777ad451 100644
--- a/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
+++ b/test/expect/TestFXAPIBackwardCompatibility.test_function_back_compat-fx_backcompat_function_signatures.expect
@@ -22,7 +22,7 @@ torch.fx.graph.Graph.node_copy(self, node: torch.fx.node.Node, arg_transform: Ca
 torch.fx.graph.Graph.output(self, result: 'Argument', type_expr: Optional[Any] = None)
 torch.fx.graph.Graph.placeholder(self, name: str, type_expr: Optional[Any] = None, default_value: Any) -> torch.fx.node.Node
 torch.fx.graph.Graph.print_tabular(self)
-torch.fx.graph.Graph.python_code(self, root_module: str) -> torch.fx.graph.PythonCode
+torch.fx.graph.Graph.python_code(self, root_module: str, verbose: bool = False) -> torch.fx.graph.PythonCode
 torch.fx.graph_module.GraphModule.__init__(self, root: Union[torch.nn.modules.module.Module, Dict[str, Any]], graph: torch.fx.graph.Graph, class_name: str = 'GraphModule')
 torch.fx.graph_module.GraphModule.add_submodule(self, target: str, m: torch.nn.modules.module.Module) -> bool
 torch.fx.graph_module.GraphModule.delete_all_unused_submodules(self) -> None
@@ -53,14 +53,14 @@ torch.fx.node.Node.__init__(self, graph: 'Graph', name: str, op: str, target: 'T
 torch.fx.node.Node.append(self, x: 'Node') -> None
 torch.fx.node.Node.format_node(self, placeholder_names: Optional[List[str]] = None, maybe_return_typename: Optional[List[str]] = None) -> Optional[str]
 torch.fx.node.Node.prepend(self, x: 'Node') -> None
-torch.fx.node.Node.replace_all_uses_with(self, replace_with: 'Node', delete_user_cb: Callable[[Node], bool] = <function <lambda>>) -> List[Node]
+torch.fx.node.Node.replace_all_uses_with(self, replace_with: 'Node', delete_user_cb: Callable[[Node], bool] = <function <lambda>>, propagate_meta = False) -> List[Node]
 torch.fx.node.Node.replace_input_with(self, old_input: 'Node', new_input: 'Node')
 torch.fx.node.Node.update_arg(self, idx: int, arg: torch.fx.node.Argument) -> None
 torch.fx.node.Node.update_kwarg(self, key: str, arg: torch.fx.node.Argument) -> None
 torch.fx.node.map_aggregate(a: torch.fx.node.Argument, fn: Callable[[torch.fx.node.Argument], torch.fx.node.Argument]) -> torch.fx.node.Argument
 torch.fx.node.map_arg(a: torch.fx.node.Argument, fn: Callable[[torch.fx.node.Node], torch.fx.node.Argument]) -> torch.fx.node.Argument
 torch.fx.passes.reinplace.reinplace(gm, *sample_args)
-torch.fx.passes.split_module.split_module(m: torch.fx.graph_module.GraphModule, root_m: torch.nn.modules.module.Module, split_callback: Callable[[torch.fx.node.Node], int], qualname_map: Optional[Dict[str, str]] = None)
+torch.fx.passes.split_module.split_module(m: torch.fx.graph_module.GraphModule, root_m: torch.nn.modules.module.Module, split_callback: Callable[[torch.fx.node.Node], int], qualname_map: Optional[Dict[str, str]] = None, keep_original_order: Optional[bool] = False)
 torch.fx.proxy.Attribute.__init__(self, root: torch.fx.proxy.Proxy, attr: str)
 torch.fx.proxy.Proxy.__init__(self, node: torch.fx.node.Node, tracer: 'Optional[TracerBase]' = None)
 torch.fx.proxy.Proxy.keys(self)
@@ -71,4 +71,4 @@ torch.fx.proxy.TracerBase.iter(self, obj: 'Proxy') -> Iterator
 torch.fx.proxy.TracerBase.keys(self, obj: 'Proxy') -> Any
 torch.fx.proxy.TracerBase.proxy(self, node: torch.fx.node.Node) -> 'Proxy'
 torch.fx.proxy.TracerBase.to_bool(self, obj: 'Proxy') -> bool
-torch.fx.subgraph_rewriter.replace_pattern(gm: torch.fx.graph_module.GraphModule, pattern: Callable, replacement: Callable) -> List[torch.fx.subgraph_rewriter.Match]
\ No newline at end of file
+torch.fx.subgraph_rewriter.replace_pattern(gm: torch.fx.graph_module.GraphModule, pattern: Union[Callable, torch.fx.graph_module.GraphModule], replacement: Union[Callable, torch.fx.graph_module.GraphModule]) -> List[torch.fx.subgraph_rewriter.Match]
diff --git a/test/forward_backward_compatibility/check_forward_backward_compatibility.py b/test/forward_backward_compatibility/check_forward_backward_compatibility.py
index df5d542dcd1a..853f5206969b 100644
--- a/test/forward_backward_compatibility/check_forward_backward_compatibility.py
+++ b/test/forward_backward_compatibility/check_forward_backward_compatibility.py
@@ -58,11 +58,17 @@
     ("aten::_syevd_helper", datetime.date(9999, 1, 1)),
     ("aten::_linalg_solve_out_helper_", datetime.date(9999, 1, 1)),
     ("aten::select_backward", datetime.date(9999, 1, 1)),
+    ("aten::lstsq", datetime.date(9999, 1, 1)),
+    ("aten::lstsq.X", datetime.date(9999, 1, 1)),
     ("aten::slice_backward", datetime.date(9999, 1, 1)),
     ("aten::diagonal_backward", datetime.date(9999, 1, 1)),
     ("aten::rowwise_prune", datetime.date(9999, 1, 1)),
+    ("aten::eig", datetime.date(9999, 1, 1)),
+    ("aten::eig.e", datetime.date(9999, 1, 1)),
     ("aten::adaptive_avg_pool3d_backward", datetime.date(9999, 1, 1)),
     ("aten::_embedding_bag_dense_backward", datetime.date(9999, 1, 1)),
+    ("aten::matrix_rank", datetime.date(9999, 1, 1)),
+    ("aten::matrix_rank.tol", datetime.date(9999, 1, 1)),
     ("aten::randperm", datetime.date(9999, 1, 1)),
     ("aten::linalg_solve", datetime.date(2022, 8, 31)),
     ("aten::linalg_solve.out", datetime.date(2022, 8, 31)),
@@ -79,6 +85,17 @@
     ("aten::linalg_slogdet.out", datetime.date(2022, 8, 1)),
     ("aten::_linalg_solve", datetime.date(2022, 10, 1)),
     ("aten::_linalg_solve.solution", datetime.date(2022, 10, 1)),
+    ("aten::linalg_inv_ex", datetime.date(2022, 10, 1)),
+    ("aten::linalg_inv_ex.inverse", datetime.date(2022, 10, 1)),
+    ("aten::linalg_inv", datetime.date(2022, 10, 1)),
+    ("aten::linalg_inv.out", datetime.date(2022, 10, 1)),
+    ("aten::_linalg_inv_out_helper.functional", datetime.date(2022, 10, 1)),
+    ("aten::_linalg_inv_out_helper.out", datetime.date(2022, 10, 1)),
+    ("aten::_linalg_inv_out_helper_", datetime.date(2022, 10, 1)),
+    ("aten::_linalg_inv_out_helper", datetime.date(2022, 10, 1)),
+    ("aten::col2im_backward", datetime.date(2022, 12, 1)),
+    ("aten::im2col_backward", datetime.date(2022, 12, 1)),
+    ("aten::diag_backward", datetime.date(2022, 12, 1)),
     ("aten::solve", datetime.date(9999, 1, 1)),
     ("aten::solve.solution", datetime.date(9999, 1, 1)),
     ("aten::_solve_helper", datetime.date(9999, 1, 1)),
@@ -112,7 +129,6 @@
     ("aten::segment_reduce", datetime.date(2022, 6, 30)),
     ("aten::_segment_reduce_backward", datetime.date(2022, 6, 30)),
     ("aten::empty.SymInt", datetime.date(9999, 1, 1)),
-    ("c10d::broadcast", datetime.date(2022, 6, 25)),
     ("aten::.*functional", datetime.date(2022, 8, 1)),
     ("aten::_foreach.*", datetime.date(2022, 8, 1)),
     ("aten::unflatten", datetime.date(2022, 8, 10)),
@@ -120,8 +136,191 @@
     ("aten::nanmean.out", datetime.date(2022, 8, 30)),
     ("aten::nansum", datetime.date(2022, 8, 30)),
     ("aten::nansum.out", datetime.date(2022, 8, 30)),
+    # nested tensor temporary auxiliary ops
+    ("aten::_reshape_nested", datetime.date(9999, 1, 1)),
+    ("aten::_reshape_nested_backward", datetime.date(9999, 1, 1)),
+    ("aten::sum.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::mps_linear", datetime.date(9999, 1, 1)),
+    ("aten::_mps_linear", datetime.date(9999, 1, 1)),
+    ("aten::view_copy.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::view_copy.SymInt_out", datetime.date(2022, 11, 30)),
+    ("aten::expand_copy.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::expand_copy.SymInt_out", datetime.date(2022, 11, 30)),
+    ("aten::expand.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::narrow_copy.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::narrow_copy.SymInt_out", datetime.date(2022, 11, 30)),
+    ("aten::view.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::new_empty.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::new_empty.SymInt_out", datetime.date(2022, 11, 30)),
+    ("aten::zeros.SymInt", datetime.date(2022, 11, 30)),
+    ("aten::zeros.SymInt_out", datetime.date(2022, 11, 30)),
     # TODO: FIXME: prims shouldn't be checked
     ("prims::.*", datetime.date(9999, 1, 1)),
+    ("aten::_amp_foreach_non_finite_check_and_unscale.out", datetime.date(2022, 9, 1)),
+    ("aten::_amp_foreach_non_finite_check_and_unscale_", datetime.date(2022, 9, 1)),
+    ("aten::_cudnn_rnn_backward.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_abs.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_abs_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_acos.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_acos_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_add_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcdiv.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcdiv.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcdiv_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcdiv_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcmul.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcmul.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcmul_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_addcmul_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_asin.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_asin_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_atan.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_atan_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_ceil.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_ceil_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_cos.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_cos_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_cosh.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_cosh_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_div_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_erf.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_erf_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_erfc.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_erfc_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_exp.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_exp_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_expm1.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_expm1_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_floor.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_floor_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_frac.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_frac_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_lgamma.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_lgamma_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log10.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log10_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log1p.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log1p_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log2.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log2_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_log_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_maximum.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_maximum_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_minimum.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_minimum_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_mul_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_neg.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_neg_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_norm.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_reciprocal.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_reciprocal_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_round.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_round_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sigmoid.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sigmoid_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sin.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sin_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sinh.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sinh_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sqrt.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sqrt_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub.List_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub.ScalarList_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub.Scalar_out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub_.List", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub_.Scalar", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_sub_.ScalarList", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_tan.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_tan_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_tanh.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_tanh_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_trunc.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_trunc_", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_zero.out", datetime.date(2022, 9, 1)),
+    ("aten::_foreach_zero_", datetime.date(2022, 9, 1)),
+    ("aten::_histogramdd_bin_edges.out", datetime.date(2022, 9, 1)),
+    ("aten::chunk", datetime.date(2022, 9, 1)),
+    ("aten::dequantize.tensors_out", datetime.date(2022, 9, 1)),
+    ("aten::dsplit.array", datetime.date(2022, 9, 1)),
+    ("aten::dsplit.int", datetime.date(2022, 9, 1)),
+    ("aten::hsplit.array", datetime.date(2022, 9, 1)),
+    ("aten::hsplit.int", datetime.date(2022, 9, 1)),
+    ("aten::lstm_mps_backward.out", datetime.date(2022, 9, 1)),
+    ("aten::miopen_rnn_backward.out", datetime.date(2022, 9, 1)),
+    ("aten::quantize_per_tensor.tensors_out", datetime.date(2022, 9, 1)),
+    ("aten::split", datetime.date(2022, 9, 1)),
+    ("aten::split.Tensor", datetime.date(2022, 9, 1)),
+    ("aten::split.sizes", datetime.date(2022, 9, 1)),
+    ("aten::split_copy.Tensor_out", datetime.date(2022, 9, 1)),
+    ("aten::split_with_sizes", datetime.date(2022, 9, 1)),
+    ("aten::split_with_sizes_copy.out", datetime.date(2022, 9, 1)),
+    ("aten::tensor_split.indices", datetime.date(2022, 9, 1)),
+    ("aten::tensor_split.sections", datetime.date(2022, 9, 1)),
+    ("aten::tensor_split.tensor_indices_or_sections", datetime.date(2022, 9, 1)),
+    ("aten::unbind.Dimname", datetime.date(2022, 9, 1)),
+    ("aten::unbind.int", datetime.date(2022, 9, 1)),
+    ("aten::unbind_copy.int_out", datetime.date(2022, 9, 1)),
+    ("aten::unsafe_split.Tensor_out", datetime.date(2022, 9, 1)),
+    ("aten::unsafe_split_with_sizes.out", datetime.date(2022, 9, 1)),
+    ("aten::vsplit.array", datetime.date(2022, 9, 1)),
+    ("aten::vsplit.int", datetime.date(2022, 9, 1)),
+    ("c10d::allreduce_", datetime.date(2022, 10, 1)),
+    ("aten::sym_numel", datetime.date(2022, 10, 1)),
+    ("aten::_flash_scaled_dot_product_attention", datetime.date(2022, 11, 1)),
+    ("aten::_scaled_dot_product_attention", datetime.date(2022, 11, 1)),
+    # Distributed c10d ops are all going to be updated
+    ("c10d::.*", datetime.date(2022, 10, 31)),
+    ("c10d::allgather_", datetime.date(2022, 10, 1)),
+    ("aten::to_padded_tensor", datetime.date(2022, 10, 1)),
+    ("aten::nested_to_padded_tensor", datetime.date(2022, 10, 1)),
+    ("aten::nested_tensor", datetime.date(2022, 10, 15)),
+    ("aten::_nested_tensor_layer_norm", datetime.date(2022, 10, 15)),
+    ("aten::_torch_cuda_cu_linker_symbol_op", datetime.date(2022, 11, 1)),
+
+    ("aten::upsample_linear1d_backward", datetime.date(2022, 12, 15)),
+    ("aten::upsample_bicubic2d_backward", datetime.date(2022, 12, 15)),
+    ("aten::upsample_trilinear3d", datetime.date(2022, 12, 15)),
+    ("aten::upsample_bilinear2d", datetime.date(2022, 12, 15)),
+    ("aten::upsample_nearest3d", datetime.date(2022, 12, 15)),
+    ("aten::upsample_nearest2d_backward", datetime.date(2022, 12, 15)),
+    ("aten::upsample_bilinear2d_backward", datetime.date(2022, 12, 15)),
+    ("aten::upsample_trilinear3d_backward", datetime.date(2022, 12, 15)),
+    ("aten::upsample_nearest2d", datetime.date(2022, 12, 15)),
+    ("aten::upsample_bicubic2d", datetime.date(2022, 12, 15)),
+    ("aten::upsample_nearest1d_backward", datetime.date(2022, 12, 15)),
+    ("aten::upsample_nearest3d_backward", datetime.date(2022, 12, 15)),
+    ("aten::upsample_linear1d", datetime.date(2022, 12, 15)),
+    ("aten::upsample_nearest1d", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_nearest_exact3d", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_nearest_exact3d_backward", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_bilinear2d_aa", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_bilinear2d_aa_backward", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_bicubic2d_aa", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_bicubic2d_aa_backward", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_nearest_exact1d", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_nearest_exact1d_backward", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_nearest_exact2d", datetime.date(2022, 12, 15)),
+    ("aten::_upsample_nearest_exact2d_backward", datetime.date(2022, 12, 15)),
+    ("aten::_flash_scaled_dot_product_attention", datetime.date(2022, 12, 15)),
+    ("aten::_scaled_dot_product_attention_forward", datetime.date(2022, 12, 15)),
+    ("aten::_efficient_attention_backward", datetime.date(2022, 12, 15)),
+    ("mkldnn::_convolution_pointwise.binary", datetime.date(2022, 12, 15)),
 ]
 
 ALLOW_LIST_COMPILED = [
@@ -307,7 +506,7 @@ def check_fc(existing_schemas):
         default="schemas.txt",
     )
     args = parser.parse_args()
-    existing_schema_dict = dict()
+    existing_schema_dict = {}
     slist = []
     with open(args.existing_schemas, "r") as f:
         while True:
diff --git a/functorch/test/attn_ft.py b/test/functorch/attn_ft.py
similarity index 100%
rename from functorch/test/attn_ft.py
rename to test/functorch/attn_ft.py
diff --git a/functorch/test/attn_positional.py b/test/functorch/attn_positional.py
similarity index 100%
rename from functorch/test/attn_positional.py
rename to test/functorch/attn_positional.py
diff --git a/functorch/test/common_utils.py b/test/functorch/common_utils.py
similarity index 62%
rename from functorch/test/common_utils.py
rename to test/functorch/common_utils.py
index 16faa90b7af7..c082340d7882 100644
--- a/functorch/test/common_utils.py
+++ b/test/functorch/common_utils.py
@@ -6,15 +6,15 @@
 
 import itertools
 import torch
-import functorch
 from functorch import vmap
 import torch.utils._pytree as pytree
-from functorch_lagging_op_db import functorch_lagging_op_db
 from functorch_additional_op_db import additional_op_db
 from torch.testing._internal.common_methods_invocations import DecorateInfo
+from torch.testing._internal.common_methods_invocations import op_db
 import os
 import unittest
 from torch.testing._internal.common_device_type import toleranceOverride
+from collections import namedtuple
 
 IS_FBCODE = os.getenv('FUNCTORCH_TEST_FBCODE') == '1'
 
@@ -38,6 +38,44 @@ def loop(op, in_dims, out_dim, batch_size, *batched_args, **kwarg_values):
     return loop_out
 
 
+# Like loop helper function but for 2 levels of vmap. If we need more levels than this, probably possible
+# to generalize the loops function but it seemed too complicated for this
+def loop2(op, in_dims1, in_dims2, out_dim1, out_dim2, batch_size1, batch_size2, *batched_args, **kwarg_values):
+    outs = []
+    flat_args, args_spec = pytree.tree_flatten(batched_args)
+    flat_dims1, dims_spec1 = pytree.tree_flatten(in_dims1)
+    flat_dims2, dims_spec2 = pytree.tree_flatten(in_dims2)
+    assert(args_spec == dims_spec1)
+    assert(args_spec == dims_spec2)
+    assert(len(flat_dims1) == len(flat_dims2))
+    for idx1 in range(batch_size1):
+        out_split = []
+        arg_split = [a.select(in_dim1, idx1) if in_dim1 is not None else a for a, in_dim1 in zip(flat_args, flat_dims1)]
+        for idx2 in range(batch_size2):
+            new_args = [a.select(in_dim, idx2) if in_dim is not None else a for a, in_dim in zip(arg_split, flat_dims2)]
+            out = op(*pytree.tree_unflatten(new_args, args_spec), **kwarg_values)
+            out_split.append(out)
+        outs.append(out_split)
+
+    loop_out = []
+    for out_split in outs:
+        if isinstance(out_split[0], torch.Tensor):
+            loop_out.append(torch.stack(out_split, out_dim1))
+        else:
+            new_out = []
+            for idx in range(len(out_split[0])):
+                new_out.append(torch.stack([i[idx] for i in out_split], out_dim1))
+            loop_out.append(new_out)
+
+    new_out = []
+    if isinstance(loop_out, torch.Tensor):
+        new_out = torch.stack(loop_out, out_dim2)
+    else:
+        for idx in range(len(loop_out[0])):
+            new_out.append(torch.stack([i[idx] for i in loop_out], out_dim2))
+    return new_out
+
+
 def is_valid_inplace_sample_input(sample_input, op, inplace_variant):
     if inplace_variant is None:
         return False
@@ -135,27 +173,6 @@ def construct_in_dims(bdim_choice_for_tensors, is_tensors):
         result.append(next(bdim))
     return tuple(result)
 
-def get_exhaustive_batched_inputs(arg_values, kwarg_values, batch_size=2):
-    flat_args, arg_spec = pytree.tree_flatten(tuple(arg_values))
-    is_tensors = [isinstance(a, torch.Tensor) for a in flat_args]
-    bdim_choices = get_bdim_choices(sum(is_tensors))
-
-    @memoize
-    def get_batched_arg(arg, bdim):
-        assert isinstance(arg, torch.Tensor)
-        assert bdim is not None
-        result, _ = add_batch_dim(arg, bdim, batch_size)
-        return result
-
-    for bdim_choice in bdim_choices:
-        flat_in_dims = construct_in_dims(bdim_choice, is_tensors)
-
-        flat_batched_args = tuple(arg if in_dim is None else get_batched_arg(arg, in_dim)
-                                  for arg, in_dim in zip(flat_args, flat_in_dims))
-        batched_args = pytree.tree_unflatten(flat_batched_args, arg_spec)
-        in_dims = pytree.tree_unflatten(flat_in_dims, arg_spec)
-        yield batched_args, in_dims, kwarg_values
-
 
 def is_batch_norm_training(op_name, kwarg_values):
     batch_norm_fns = ("nn.functional.batch_norm", "nn.functional.instance_norm")  # instance norm calls batch norm
@@ -172,13 +189,16 @@ def is_batch_norm_training(op_name, kwarg_values):
         return is_training[0]
 
 
-def get_exhaustive_batched_inputs_batch_norm_is_training(arg_values, kwarg_values, batch_size=2):
+def generate_vmap_inputs(arg_values, kwarg_values, is_batch_norm_and_training=False, batch_size=2):
     flat_args, arg_spec = pytree.tree_flatten(tuple(arg_values))
     is_tensors = [isinstance(a, torch.Tensor) for a in flat_args]
     num_tensors = sum(is_tensors)
-    if num_tensors == 1:  # if there's only an input, can't batch it since running_mean/var will be seen as unbatched tensors
+    # For Batch Norm, if there's only an input, we can't
+    # batch it since running_mean/var will be seen as unbatched tensors
+    if num_tensors == 1 and is_batch_norm_and_training:
         return
-    bdim_choices = get_bdim_choices_batch_norm(num_tensors, *arg_values)
+    bdim_choices = get_bdim_choices_batch_norm(
+        num_tensors, *arg_values) if is_batch_norm_and_training else get_bdim_choices(num_tensors)
 
     @memoize
     def get_batched_arg(arg, bdim):
@@ -197,22 +217,16 @@ def get_batched_arg(arg, bdim):
         yield batched_args, in_dims, kwarg_values
 
 
-def generate_vmap_inputs(args, kwargs, is_batch_norm_and_training=False, batch_size=2):
-    if is_batch_norm_and_training:
-        return get_exhaustive_batched_inputs_batch_norm_is_training(
-            args, kwargs, batch_size)
-    return get_exhaustive_batched_inputs(args, kwargs, batch_size)
-
-
 def clone_if_tensor(x):
     if isinstance(x, torch.Tensor):
         return x.clone()
     return x
 
-
-def compute_quantities_for_vmap_test(
+# Helper function to compare output of `vmap` against the
+# `for-loop` version.
+def _compute_quantities_for_vmap_test(
         op, orig_batched_args, orig_kwarg_values, in_dims,
-        out_dim=0, batch_size=2, compute_loop_out=True,
+        out_dim, batch_size, compute_loop_out=True,
         clone_inputs=False):
 
     def maybe_clone_inputs():
@@ -223,10 +237,12 @@ def maybe_clone_inputs():
         return orig_batched_args, orig_kwarg_values
 
     batched_args, kwarg_values = maybe_clone_inputs()
+
     if compute_loop_out:
         loop_out = loop(op, in_dims, out_dim, batch_size, *batched_args, **kwarg_values)
     else:
         loop_out = None
+
     # Used for debugging the resulting operations
     # from functorch import make_fx
     # def f(a):
@@ -235,7 +251,6 @@ def maybe_clone_inputs():
     # print(in_dims, [arg.shape for arg in batched_args], kwarg_values)
     batched_args, kwarg_values = maybe_clone_inputs()
     batched_out = vmap(op, in_dims=in_dims, out_dims=out_dim)(*batched_args, **kwarg_values)
-    yield (loop_out, batched_out)
 
     # Tests case where we dispatch to a batching rule with no bdims
     # This should be handled by autogenerated plumbing. For vmap support
@@ -249,60 +264,111 @@ def f(dummy, *args, **kwargs):
         return op(*args, **kwargs)
 
     dummy = torch.ones(batch_size, 1)
-    expected = pytree.tree_map(add_bdim_if_tensor, batched_out)
+    vmapvmap_expected = pytree.tree_map(add_bdim_if_tensor, batched_out)
 
     inner_in_dims = (0,) + pytree.tree_map(lambda x: None, in_dims)
     outer_in_dims = (0,) + in_dims
     batched_args, kwarg_values = maybe_clone_inputs()
-    output = vmap(vmap(f, inner_in_dims), outer_in_dims)(dummy, *batched_args, **kwarg_values)
-    yield (expected, output)
+    vmapvmap_output = vmap(vmap(f, inner_in_dims), outer_in_dims)(dummy, *batched_args, **kwarg_values)
+
+    yield (batched_out, loop_out, vmapvmap_output, vmapvmap_expected)
+
+
+# Function with more friendly return types
+# compared to `_compute_quantities_for_vmap_test`
+def compute_quantities_for_vmap_test(
+        op, orig_batched_args, orig_kwarg_values, in_dims,
+        out_dim=0, batch_size=2, compute_loop_out=True,
+        clone_inputs=False):
+    for quantities in _compute_quantities_for_vmap_test(op, orig_batched_args, orig_kwarg_values, in_dims,
+                                                        out_dim, batch_size, compute_loop_out, clone_inputs):
+        yield (quantities[0], quantities[1])
+        yield (quantities[2], quantities[3])
 
 
 def get_fallback_and_vmap_exhaustive(op, arg_values, kwarg_values, is_batch_norm_and_training=False, compute_loop_out=True):
     out_dim = 0
     batch_size = 2
 
+    def make_batched(t):
+        if isinstance(t, torch.Tensor):
+            shape = list(t.shape)
+            shape.insert(out_dim, batch_size)
+            return t.expand(*shape)
+        return t
+
+    # Inputs generated by `generate_vmap_inputs` just copy/expand the unbatched inputs
+    # over the batched dimension. Thus we can compute the expected value once and just
+    # expand it based on the `out_dim` and `batch_size`.
+    expected_unbatched = op(*arg_values, **kwarg_values)
+    expected_batched = pytree.tree_map(make_batched, expected_unbatched)
     generator = generate_vmap_inputs(arg_values, kwarg_values, is_batch_norm_and_training)
     for batched_args, in_dims, kwarg_values in generator:
-        for quantities in compute_quantities_for_vmap_test(
-                op, batched_args, kwarg_values, in_dims, out_dim, batch_size, compute_loop_out):
-            yield quantities
+        for quantities in _compute_quantities_for_vmap_test(
+                op, batched_args, kwarg_values, in_dims, out_dim, batch_size,
+                compute_loop_out=False):
+            assert quantities[1] is None
+            yield (quantities[0], expected_batched)
+            yield (quantities[2], quantities[3])
 
 
 def opinfo_in_dict(opinfo, d):
     return (opinfo.name in d) or (f'{opinfo.name}.{opinfo.variant_test_name}' in d)
 
 
-def xfail(op_name, variant_name='', *, device_type=None, dtypes=None):
-    return (op_name, variant_name, device_type, dtypes, True)
+DecorateMeta = namedtuple("DecorateMeta", [
+    "op_name",
+    "variant_name",
+    "decorator",
+    "device_type",
+    "dtypes",
+])
 
-# TODO: this doesn't work in python < 3.8
+
+def decorate(op_name, variant_name='', *, decorator=None, device_type=None, dtypes=None):
+    assert decorator is not None
+    return DecorateMeta(op_name=op_name,
+                        variant_name=variant_name,
+                        decorator=decorator,
+                        device_type=device_type,
+                        dtypes=dtypes)
+
+
+def xfail(op_name, variant_name='', *, device_type=None, dtypes=None):
+    return decorate(op_name=op_name,
+                    variant_name=variant_name,
+                    decorator=unittest.expectedFailure,
+                    device_type=device_type,
+                    dtypes=dtypes)
 
 
 def skip(op_name, variant_name='', *, device_type=None, dtypes=None):
-    return (op_name, variant_name, device_type, dtypes, False)
+    return decorate(op_name=op_name,
+                    variant_name=variant_name,
+                    decorator=unittest.skip("Skipped!"),
+                    device_type=device_type,
+                    dtypes=dtypes)
 
 
 def skipOps(test_case_name, base_test_name, to_skip):
-    all_opinfos = functorch_lagging_op_db + additional_op_db
-    for xfail in to_skip:
-        op_name, variant_name, device_type, dtypes, expected_failure = xfail
+    all_opinfos = op_db + additional_op_db
+    for decorate_meta in to_skip:
         matching_opinfos = [o for o in all_opinfos
-                            if o.name == op_name and o.variant_test_name == variant_name]
-        assert len(matching_opinfos) >= 1, f"Couldn't find OpInfo for {xfail}"
-        for opinfo in matching_opinfos:
-            decorators = list(opinfo.decorators)
-            if expected_failure:
-                decorator = DecorateInfo(unittest.expectedFailure,
-                                         test_case_name, base_test_name,
-                                         device_type=device_type, dtypes=dtypes)
-                decorators.append(decorator)
-            else:
-                decorator = DecorateInfo(unittest.skip("Skipped!"),
-                                         test_case_name, base_test_name,
-                                         device_type=device_type, dtypes=dtypes)
-                decorators.append(decorator)
-            opinfo.decorators = tuple(decorators)
+                            if o.name == decorate_meta.op_name and
+                            o.variant_test_name == decorate_meta.variant_name]
+        assert len(matching_opinfos) > 0, f"Couldn't find OpInfo for {decorate_meta}"
+        assert len(matching_opinfos) == 1, (
+            "OpInfos should be uniquely determined by their (name, variant_name). "
+            f"Got more than one result for ({decorate_meta.op_name}, {decorate_meta.variant_name})"
+        )
+        opinfo = matching_opinfos[0]
+        decorators = list(opinfo.decorators)
+        new_decorator = DecorateInfo(decorate_meta.decorator,
+                                     test_case_name, base_test_name,
+                                     device_type=decorate_meta.device_type,
+                                     dtypes=decorate_meta.dtypes)
+        decorators.append(new_decorator)
+        opinfo.decorators = tuple(decorators)
 
     # This decorator doesn't modify fn in any way
     def wrapped(fn):
@@ -310,6 +376,14 @@ def wrapped(fn):
     return wrapped
 
 
+def expectedFailureIf(condition):
+    def decorator(fn):
+        if condition:
+            return unittest.expectedFailure(fn)
+        return fn
+    return decorator
+
+
 def tol2(op_name, variant_name, override_dct, *, device_type=None):
     return (op_name, variant_name, override_dct, device_type)
 
@@ -319,7 +393,7 @@ def tol1(op_name, override_dct, *, device_type=None):
 
 
 def opsToleranceOverride(test_case_name, base_test_name, overrides):
-    all_opinfos = functorch_lagging_op_db + additional_op_db
+    all_opinfos = op_db + additional_op_db
     for override in overrides:
         op_name, variant_name, override, device_type = override
         matching_opinfos = [o for o in all_opinfos
@@ -340,11 +414,11 @@ def wrapped(fn):
 
 class DisableVmapFallback:
     def __enter__(self):
-        self.prev_state = functorch._C._is_vmap_fallback_enabled()
-        functorch._C._set_vmap_fallback_enabled(False)
+        self.prev_state = torch._C._functorch._is_vmap_fallback_enabled()
+        torch._C._functorch._set_vmap_fallback_enabled(False)
 
     def __exit__(self, *ignored):
-        functorch._C._set_vmap_fallback_enabled(self.prev_state)
+        torch._C._functorch._set_vmap_fallback_enabled(self.prev_state)
 
 
 def check_vmap_fallback(test_case, thunk, opinfo, dry_run=False):
diff --git a/functorch/test/discover_coverage.py b/test/functorch/discover_coverage.py
similarity index 99%
rename from functorch/test/discover_coverage.py
rename to test/functorch/discover_coverage.py
index d31a25a2ec49..e52f317087b4 100644
--- a/functorch/test/discover_coverage.py
+++ b/test/functorch/discover_coverage.py
@@ -7,7 +7,6 @@
 import pprint
 import unittest
 import enum
-from functorch_lagging_op_db import in_functorch_lagging_op_db
 from torch.testing._internal.common_device_type import toleranceOverride
 
 # Importing these files make modifications to the op_db that we need
@@ -650,9 +649,6 @@ def no_opinfos_skip_test(self, test_name):
         """Returns NO if any opinfos have a skip or xfail for the test"""
         if not self.has_opinfo():
             return Support.UNKNOWN
-        if not any([o in additional_op_db for o in self.opinfos]):
-            if not any([in_functorch_lagging_op_db(o) for o in self.opinfos]):
-                return Support.UNKNOWN
         for opinfo in self.opinfos:
             for decorator in opinfo.decorators:
                 if not hasattr(decorator, 'test_name'):
diff --git a/functorch/test/functorch_additional_op_db.py b/test/functorch/functorch_additional_op_db.py
similarity index 97%
rename from functorch/test/functorch_additional_op_db.py
rename to test/functorch/functorch_additional_op_db.py
index b090121d2180..9352924d5004 100644
--- a/functorch/test/functorch_additional_op_db.py
+++ b/test/functorch/functorch_additional_op_db.py
@@ -4,8 +4,7 @@
 
 import torch
 
-from torch.testing import \
-    (floating_types, floating_types_and, all_types_and_complex_and)
+from torch.testing._internal.common_dtype import floating_types, floating_types_and, all_types_and_complex_and
 from torch.testing._internal.common_utils import make_tensor
 from torch.testing._internal.common_methods_invocations import OpInfo, SampleInput, DecorateInfo
 
@@ -446,7 +445,8 @@ def sample_inputs_conversion(op_info, device, dtype, requires_grad, **kwargs):
            sample_inputs_func=sample_inputs_conversion,
            skips=(
                # autograd tests don't handle operators that change dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients'),
+               DecorateInfo(unittest.expectedFailure, 'TestBwdGradients'),
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                # RuntimeError: attribute lookup is not defined on builtin
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
@@ -512,7 +512,8 @@ def sample_inputs_conversion(op_info, device, dtype, requires_grad, **kwargs):
            sample_inputs_func=sample_inputs_conversion,
            skips=(
                # autograd tests don't handle operators that change dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients'),
+               DecorateInfo(unittest.expectedFailure, 'TestBwdGradients'),
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                # RuntimeError: attribute lookup is not defined on builtin
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
@@ -525,7 +526,8 @@ def sample_inputs_conversion(op_info, device, dtype, requires_grad, **kwargs):
            sample_inputs_func=sample_inputs_conversion,
            skips=(
                # autograd tests don't handle operators that change dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients'),
+               DecorateInfo(unittest.expectedFailure, 'TestBwdGradients'),
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                # RuntimeError: attribute lookup is not defined on builtin
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
diff --git a/test/functorch/test_aotdispatch.py b/test/functorch/test_aotdispatch.py
new file mode 100644
index 000000000000..aabd03050c4c
--- /dev/null
+++ b/test/functorch/test_aotdispatch.py
@@ -0,0 +1,1989 @@
+# Owner(s): ["module: functorch"]
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+from typing import Union, Callable, List, Any
+from unittest.mock import patch
+from torch.testing._internal.common_utils import TestCase, run_tests, IS_ARM64, IS_WINDOWS
+import torch
+import torch.nn as nn
+import torch.utils._pytree as pytree
+import unittest
+import warnings
+import itertools
+from functools import partial
+from torch.testing._internal.common_device_type import instantiate_device_type_tests
+from torch.testing._internal.common_methods_invocations import op_db, wrapper_set_seed
+from functorch import (
+    grad, vjp, vmap, jacrev,
+    make_fx
+)
+from functorch._src.aot_autograd import aot_module_simplified
+from functorch.compile import (
+    nnc_jit, compiled_function, compiled_module,
+    min_cut_rematerialization_partition, aot_function, aot_module,
+    nop, default_partition, default_decompositions,
+    memory_efficient_fusion, get_aot_compilation_context
+)
+from torch._decomp import decomposition_table
+
+from torch.testing._internal.common_device_type import ops
+from common_utils import (
+    decorate,
+    xfail,
+    skip,
+    skipOps,
+)
+from torch._subclasses.fake_tensor import DynamicOutputShapeException
+from torch.fx.experimental.proxy_tensor import is_sym_node
+
+USE_TORCHVISION = False
+try:
+    import torchvision
+    USE_TORCHVISION = True
+except ImportError:
+    warnings.warn("Couldn't import torchvision. Some of our tests use it, try "
+                  "to install it with commands from pytorch.org, post-fixed with "
+                  "`--no-deps` to avoid overwriting the pytorch installation",
+                  UserWarning)
+
+USE_NETWORKX = False
+try:
+    import networkx  # noqa: F401
+    USE_NETWORKX = True
+except ImportError:
+    warnings.warn("Some tests use networkx but it was not installed",
+                  UserWarning)
+
+try:
+    import sympy  # noqa: F401
+    # TODO(jansel): these tests fail on windows
+    HAS_SYMPY = not IS_WINDOWS
+except ImportError:
+    HAS_SYMPY = False
+skipIfNoSympy = unittest.skipIf(not HAS_SYMPY, "no sympy")
+
+# NB: numpy is a testing dependency!
+
+class AOTTestCase(TestCase):
+    def setUp(self):
+        super().setUp()
+
+class TestPythonKey(AOTTestCase):
+    def test_make_fx(self, device):
+        def f(x):
+            return torch.sin(x)
+        inp = torch.randn(3)
+        fx_f = make_fx(f)(inp)
+
+        new_inp = torch.randn(3)
+        self.assertEqual(fx_f(new_inp), f(new_inp))
+
+    def test_make_fx_grad(self, device):
+        def f(x):
+            return torch.sin(x).sum()
+        inp = torch.randn(3)
+        f = grad(f)
+        fx_f = make_fx(f)(inp)
+
+        new_inp = torch.randn(3)
+        self.assertEqual(fx_f(new_inp), f(new_inp))
+
+    def test_scalar_device(self, device):
+        def f(a, b):
+            return a + b
+        inps = [torch.randn(3, device=device), torch.tensor(5)]
+        fx_f = make_fx(f)(*inps)
+        self.assertEqual(fx_f(*inps), f(*inps))
+
+    def test_make_fx_vmap(self, device):
+        def f(x):
+            return torch.sin(x)
+        inp = torch.randn(5, 3)
+        f = vmap(f)
+        fx_f = make_fx(f)(inp)
+        new_inp = torch.randn(5, 3)
+        self.assertEqual(fx_f(new_inp), f(new_inp))
+
+    def test_make_fx_jacrev(self, device):
+        def f(x):
+            return x.sin().sum()
+        inp = torch.randn(3)
+        f = jacrev(jacrev(f))
+        fx_f = make_fx(f)(inp)
+        new_inp = torch.randn(3)
+        self.assertEqual(fx_f(new_inp), f(new_inp))
+
+    def test_make_fx_vjp(self, device):
+        def f(x):
+            return torch.sin(x).sum()
+
+        primals = torch.randn(3)
+        _, vjp_fn = vjp(f, primals)
+        cotangent = torch.randn(())
+        fx_f = make_fx(vjp_fn)(cotangent, True, True)
+        new_cotangent = torch.randn(())
+        self.assertEqual(fx_f(new_cotangent, True, True), vjp_fn(new_cotangent))
+
+    def test_make_fx_functionalize(self, device):
+        from functorch.experimental import functionalize
+
+        def fn(a):
+            a = a * 2
+            a.relu_()
+            return a
+
+        a = torch.randn(3, device=device)
+        symbolic_gm = torch.fx.symbolic_trace(fn)
+        includes_method_relu_ = any(
+            str(n.target) == "relu_" for n in symbolic_gm.graph.nodes
+        )
+        self.assertTrue(includes_method_relu_)
+        # Also verifies fix for https://github.com/pytorch/pytorch/issues/84570
+        gm = make_fx(functionalize(symbolic_gm))(a)
+        includes_aten_relu = any(
+            n.target == torch.ops.aten.relu.default for n in gm.graph.nodes
+        )
+        self.assertTrue(includes_aten_relu)
+
+    def test_make_fx_no_decompose(self, device):
+        # FIXME
+        return self.skipTest("error: maximum recursion reached")
+
+        def f(x):
+            return torch.tanh(x).sum()
+
+        fx_f = make_fx(grad(f))(torch.randn(5))
+        ops = set([i.target for i in fx_f.graph.nodes])
+
+        self.assertEqual(torch.ops.aten.tanh_backward in ops, True)
+
+        fx_f = make_fx(grad(f), decomposition_table)(torch.randn(5))
+        ops = set([i.target for i in fx_f.graph.nodes])
+        self.assertEqual(torch.ops.aten.tanh_backward in ops, False)
+
+    def test_nnc_jit(self, device):
+        def f(x):
+            return torch.sin(x)
+
+        jit_f = nnc_jit(f)
+
+        inp = torch.randn(3)
+        self.assertEqual(jit_f(inp), f(inp))
+
+    def test_nnc_scalar(self, device):
+        def f(x):
+            return torch.sin(x)
+
+        jit_f = nnc_jit(f)
+
+        inp = torch.randn(())
+        self.assertEqual(jit_f(inp), f(inp))
+
+    def test_nnc_pytrees(self, device):
+        def f(x):
+            return [torch.sin(x[0])]
+
+        jit_f = nnc_jit(f)
+
+        inp = [torch.randn(3)]
+        self.assertEqual(jit_f(inp), f(inp))
+
+    def test_external_calls(self, device):
+        def f(a, b):
+            return torch.mv(a, b)
+        jit_f = nnc_jit(f)
+        inp = [torch.randn(3, 3), torch.randn(3)]
+        self.assertEqual(jit_f(*inp), f(*inp))
+
+    def test_nnc_passthrough(self, device):
+        def f(x, y):
+            return x + y, y
+        inp = (torch.randn(3), torch.randn(3))
+        jit_f = nnc_jit(f)
+        self.assertEqual(jit_f(*inp), f(*inp))
+
+        def f(x):
+            x['a'] = x['a'] * 2
+            return x
+        inp = ({'a': torch.randn(3), 'b': torch.randn(3)},)
+        jit_f = nnc_jit(f)
+        self.assertEqual(jit_f(*inp), f(*inp))
+
+    @unittest.skipIf(not USE_TORCHVISION, "test requires torchvision")
+    def test_resnet18_backward_trace(self, device):
+        mod = torchvision.models.resnet18()
+
+        def f(x):
+            out = mod(x)
+            out.sum().backward()
+            return [a.grad for a in mod.parameters()]
+
+        inp = torch.randn(3, 3, 250, 250, requires_grad=True)
+        grads = f(inp)
+
+        mod.zero_grad()
+        mod(inp).sum().backward()
+        grads2 = [a.grad for a in mod.parameters()]
+        self.assertEqual(grads, grads2)
+
+
+def _outs_and_grads(fn, graph_inps, inps):
+    outs = fn(*graph_inps)
+    for out in pytree.tree_flatten(outs)[0]:
+        if isinstance(out, torch.Tensor) and out.requires_grad:
+            out.sum().backward(retain_graph=True)
+    grads = [inp.grad for inp in pytree.tree_flatten(inps)[0]]
+    for inp in pytree.tree_flatten(inps)[0]:
+        inp.grad = None
+    return outs, grads
+
+
+class TestAOTAutograd(AOTTestCase):
+    # test_mutation will:
+    # - Ensure that inputs are non-leaves, so our graphs can mutate them
+    # - try to mutate outputs of the graph (to ensure that autograd meta is set properly on outputs)
+    def verify_aot_autograd(
+        self,
+        f,
+        inp: Union[Callable, List[Any]],
+        *,
+        test_mutation: bool = False,
+        return_fw_graph: bool = False,
+    ):
+        # Some tests pass in a callable for inp, to generate the inputs
+        # (useful if we want to generate complicated aliasing inputs)
+        if isinstance(inp, Callable):
+            inp_callable = inp
+            # The callable should return a tuple of f_inputs, f_graph_inputs
+            # (The idea is that we might want to compile a function with the graph inputs,
+            # but test autograd backprop all the way through the actual inputs)
+            inp_copy, graph_inps_copy = inp_callable()
+            inp, graph_inps = inp_callable()
+        else:
+            inp_copy = []
+            # Our input clones need to mimic when inputs are duplicates of one another
+            dupes_map = {}
+            for i, x in enumerate(inp):
+                if x in dupes_map:
+                    x_dupe_idx = dupes_map[x]
+                    inp_copy.append(inp_copy[x_dupe_idx])
+                else:
+                    dupes_map[x] = i
+                    x_copy = x.clone().detach().requires_grad_(x.requires_grad)
+                    if x.requires_grad and not x.is_leaf:
+                        x_copy = x_copy.clone()
+                    inp_copy.append(x_copy)
+
+            if test_mutation:
+                # For graphs where we mutate inputs, need our test to make sure inputs aren't leaves
+                graph_inps = [x.add(1) for x in inp]
+                graph_inps_copy = [x.add(1) for x in inp_copy]
+            else:
+                graph_inps = inp
+                graph_inps_copy = inp_copy
+
+        # Create a copy of inputs, so we can test input mutation correctness.
+
+        fw_graph_cell = [None]
+        if isinstance(f, nn.Module):
+            compiled_f = aot_module(f, fw_compiler=partial(extract_graph, graph_cell=fw_graph_cell), bw_compiler=nop)
+        else:
+            compiled_f = aot_function(f, fw_compiler=partial(extract_graph, graph_cell=fw_graph_cell), bw_compiler=nop)
+        ref_out, ref_grad = _outs_and_grads(f, graph_inps, inp)
+        test_out, test_grad = _outs_and_grads(compiled_f, graph_inps_copy, inp_copy)
+        self.assertEqual(ref_grad, test_grad)
+
+        if isinstance(ref_out, torch.Tensor):
+            self.assertTrue(isinstance(test_out, torch.Tensor))
+            ref_out, test_out = [ref_out], [test_out]
+        for ref_o, test_o in zip(ref_out, test_out):
+            if isinstance(ref_o, torch.Tensor):
+                self.assertEqual(ref_o.requires_grad, test_o.requires_grad)
+                self.assertEqual(ref_o.is_leaf, test_o.is_leaf)
+                self.assertEqual(ref_o._is_view(), test_o._is_view())
+                self.assertEqual(ref_o, test_o)
+                if test_mutation:
+                    # This tests that autograd meta is set properly on the output we can
+                    # mutate it.
+                    ref_o.mul_(2)
+                    test_o.mul_(2)
+                    self.assertEqual(ref_o, test_o)
+        for ref_i, test_i in zip(inp, inp_copy):
+            if isinstance(ref_i, torch.Tensor):
+                self.assertEqual(ref_i.requires_grad, test_i.requires_grad)
+                self.assertEqual(ref_i, test_i)
+        if return_fw_graph:
+            return fw_graph_cell[0]
+
+    def test_single_output(self):
+        def f(a, b):
+            return a + b
+        inp = [torch.randn(3, 3, requires_grad=True), torch.randn(3, 3)]
+        self.verify_aot_autograd(f, inp)
+
+    def test_multi_output(self):
+        def f(a, b):
+            return a + b, a - b
+        inp = [torch.randn(3, 3, requires_grad=True), torch.randn(3, 3)]
+        self.verify_aot_autograd(f, inp)
+
+    def test_multi_output_list(self):
+        def f(a, b):
+            return [a + b, a - b]
+        inp = [torch.randn(3, 3, requires_grad=True), torch.randn(3, 3)]
+        self.verify_aot_autograd(f, inp)
+
+    def test_input_mutation_simple(self):
+        def f(a):
+            a.mul_(2)
+            return a * 3
+        inp = [torch.ones(3, 3, requires_grad=True)]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        # Things to note:
+        # - the extra clone is because we need to pass the pre-mutated input to grad(),
+        #   but autograd operates above functionalization so we need to manually clone.
+        #   Hopefully backends can optimize this easily.
+        # - The extra return arg is because the compiled forward returns (mutated inputs + outputs)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    mul = torch.ops.aten.mul.Tensor(clone, 2);  clone = None
+    mul_1 = torch.ops.aten.mul.Tensor(mul, 3)
+    return [mul, mul_1]""")
+
+    def test_input_mutation_is_output(self):
+        def f(a):
+            a.mul_(2)
+            return a
+        inp = [torch.ones(3, 3, requires_grad=True)]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    mul = torch.ops.aten.mul.Tensor(clone, 2);  clone = None
+    return [mul]""")
+
+    def test_input_mutation_multiple(self):
+        def f(a, b, c):
+            a.mul_(2)
+            c.mul_(2)
+            return a + b + c
+
+        inp = [
+            torch.ones(3, 3, requires_grad=True),
+            torch.ones(3, 3, requires_grad=True),
+            torch.ones(3, 3, requires_grad=True),
+        ]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2, primals_3):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    clone_1 = torch.ops.aten.clone.default(primals_3);  primals_3 = None
+    mul = torch.ops.aten.mul.Tensor(clone, 2);  clone = None
+    mul_1 = torch.ops.aten.mul.Tensor(clone_1, 2);  clone_1 = None
+    add = torch.ops.aten.add.Tensor(mul, primals_2);  primals_2 = None
+    add_1 = torch.ops.aten.add.Tensor(add, mul_1);  add = None
+    return [mul, mul_1, add_1]""")
+
+    def test_input_mutation_metadata(self):
+        def f(a, b):
+            a.transpose_(1, 0)
+            return a + b
+        inp = [
+            torch.ones(3, 3, requires_grad=True),
+            torch.ones(3, 3, requires_grad=True),
+        ]
+
+        self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+
+    def test_input_mutation_metadata2(self):
+        def f(a):
+            a.transpose_(1, 0)
+            a.mul_(2)
+            return a + 1
+        inp = [torch.ones(3, 3, requires_grad=True)]
+
+        self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+
+    def test_input_mutation_resize_smaller(self):
+        def f(a, b):
+            a.resize_(2, 2)
+            return a + b
+        # tenors that require gradients cannot be resized, so only test requires_grad=False case
+        inp = [
+            torch.ones(3, 3),
+            torch.ones(2, 2, requires_grad=True),
+        ]
+
+        self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+
+    def test_input_output_view_simple(self):
+        def f(a):
+            return a.view(-1)
+        inp = [
+            torch.ones(2, 2, requires_grad=True).add(1),
+        ]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        # Outputs that alias inputs are pulled out of the graph entirely, so we don't compile anything here
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    return [4, 1, 0]""")
+
+    def test_input_output_view_mutate_multiple(self):
+        def f(a, b, c):
+            a.mul_(2)
+            c.mul_(3)
+            return b.view(2, 2), c.view(2, 2)
+        inp = [
+            torch.ones(2, 2, requires_grad=True).add(1),
+            torch.ones(2, 2, requires_grad=True).add(1),
+            torch.ones(2, 2, requires_grad=True).add(1),
+        ]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        # The original function returned two outputs, both of which aliased inputs.
+        # We expect two outputs in the functional graph, a_updated and c_updated.
+        # The actual aliased outputs themselves aren't in the compiled forward graph;
+        # Instead, they're generated outside of  the graph.
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2, primals_3):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    clone_1 = torch.ops.aten.clone.default(primals_3);  primals_3 = None
+    mul = torch.ops.aten.mul.Tensor(clone, 2);  clone = None
+    mul_1 = torch.ops.aten.mul.Tensor(clone_1, 3);  clone_1 = None
+    return [mul, mul_1, 2, 2, 2, 1, 0, 2, 2, 2, 1, 0]""")
+
+    def test_input_output_view_metadata_mutate_multiple(self):
+        def f(a, b, c):
+            b.mul_(3)
+            c.t_()
+            return a.view(2, 2), b.view(2, 2), c.view(2, 2)
+        inp = [
+            torch.ones(2, 2, requires_grad=True).add(1),
+            torch.ones(2, 2, requires_grad=True).add(1),
+            torch.ones(2, 2, requires_grad=True).add(1),
+        ]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        # Important thing to check here: of the three inputs:
+        # Only the b.mul_(3) should show up in the graph (we functionalize it and return it).
+        # Everything else that does not show up in the graph includes:
+        # - The metadata mutation on c (we do it outside the graph)
+        # - All 3 original fw outputs, which are aliases of inputs (we regenerate them outside of the graph)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2, primals_3):
+    clone = torch.ops.aten.clone.default(primals_2);  primals_2 = None
+    mul = torch.ops.aten.mul.Tensor(clone, 3);  clone = None
+    return [mul, 2, 2, 1, 2, 0, 2, 2, 2, 1, 0, 2, 2, 2, 1, 0, 2, 2, 1, 2, 0]""")
+
+    def test_input_mutation_and_output_view(self):
+        def f(a):
+            a.add_(1)
+            return a.view(-1)
+        inp = [
+            torch.ones(2, 2, requires_grad=True).add(1),
+        ]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        # Here, total # of outputs is 1 because:
+        # - num_mutated_inps = 1 (a_updated)
+        # - num_fw_outputs = 0 (the output is an alias of the input, so we move it outside the compiled fw)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    add = torch.ops.aten.add.Tensor(clone, 1);  clone = None
+    return [add, 4, 1, 0]""")
+
+
+    def test_input_mutation_output_view_multiple(self):
+        def f(a, b, c, d):
+            b.transpose_(1, 0)
+            c.add_(1)
+            return d + 1, b.diagonal(), a + c
+        inp = [
+            torch.arange(4, requires_grad=True, dtype=torch.float32).view(2, 2).add(1),
+            torch.arange(4, requires_grad=True, dtype=torch.float32).view(2, 2).add(1),
+            torch.ones(2, 2, requires_grad=True).add(1),
+            torch.ones(2, 2, requires_grad=True).add(1),
+        ]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2, primals_3, primals_4):
+    clone = torch.ops.aten.clone.default(primals_3);  primals_3 = None
+    add = torch.ops.aten.add.Tensor(clone, 1);  clone = None
+    add_1 = torch.ops.aten.add.Tensor(primals_4, 1);  primals_4 = None
+    add_2 = torch.ops.aten.add.Tensor(primals_1, add);  primals_1 = None
+    return [add, add_1, add_2, 2, 2, 1, 2, 0, 2, 3, 0]""")
+
+    def test_input_data_and_metadata_mutation(self):
+        def f(a):
+            a.t_()
+            a[0].mul_(2)
+            return a.view(a.shape)
+        inp = [torch.ones(3, 3, requires_grad=True)]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    t = torch.ops.aten.t.default(clone)
+    select = torch.ops.aten.select.int(t, 0, 0);  t = None
+    mul = torch.ops.aten.mul.Tensor(select, 2);  select = None
+    t_1 = torch.ops.aten.t.default(clone);  clone = None
+    select_scatter = torch.ops.aten.select_scatter.default(t_1, mul, 0, 0);  t_1 = mul = None
+    t_2 = torch.ops.aten.t.default(select_scatter);  select_scatter = None
+    t_3 = torch.ops.aten.t.default(t_2);  t_2 = None
+    return [t_3, 3, 3, 1, 3, 0]""")
+
+    def test_view_and_inplace_view(self):
+        def f(a, b):
+            a.t_()
+            return b.view(b.shape), a.view(a.shape)
+        inp = [
+            torch.ones(3, 3, requires_grad=True),
+            torch.ones(3, 3, requires_grad=True)
+        ]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2):
+    return [3, 3, 1, 3, 0, 3, 3, 3, 1, 0, 3, 3, 1, 3, 0]""")
+
+    def test_view_detach(self):
+        def f(a):
+            tmp = a.detach()
+            a.mul_(2)
+            return a, tmp
+        inp = [
+            torch.ones(3, 3, requires_grad=True),
+        ]
+
+        self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+
+    def test_input_inplace_requires_grad_true(self):
+        def f(a, b):
+            a.requires_grad_(True)
+            return a.mul(3), b.mul(4)
+        inp = [
+            # First inp doesnt require grad, but we switch it on
+            torch.ones(3, 3, requires_grad=False),
+            torch.ones(3, 3, requires_grad=True),
+        ]
+
+        fw_graph = self.verify_aot_autograd(f, inp, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2):
+    mul = torch.ops.aten.mul.Tensor(primals_1, 3);  primals_1 = None
+    mul_1 = torch.ops.aten.mul.Tensor(primals_2, 4);  primals_2 = None
+    return [mul, mul_1]""")
+
+    def test_input_data_and_metadata_mutation_aliases_other_input(self):
+        # a and b are aliased
+        def f(a, b):
+            a.t_()
+            b.mul_(2)
+            return a.mul(3)
+
+        def inp_callable():
+            base = torch.ones(2, 2, requires_grad=True)
+            # Note: in our test, the add() is important because we need the graph inputs to be non-leaves so we can mutate them.
+            x = base.add(1)
+            inp1 = x.view(-1)
+            inp2 = x.view(-1)
+            return [base], [inp1, inp2]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        # Important parts of the graph:
+        # - the compiled graph takes in a base, and we generate a and b (the views) off of the base
+        # - clone() is still in the graph, because we need to call grad() on the original (non-mutated) inputs
+        # - We re-generate the views *after* the clone, to preserve view relationships.
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    as_strided_1 = torch.ops.aten.as_strided.default(clone, [4], [1], 0)
+    mul = torch.ops.aten.mul.Tensor(as_strided_1, 2);  as_strided_1 = None
+    as_strided_scatter = torch.ops.aten.as_strided_scatter.default(clone, mul, [4], [1], 0);  clone = None
+    as_strided_5 = torch.ops.aten.as_strided.default(as_strided_scatter, [4], [1], 0);  as_strided_scatter = None
+    t_1 = torch.ops.aten.t.default(as_strided_5);  as_strided_5 = None
+    mul_1 = torch.ops.aten.mul.Tensor(t_1, 3);  t_1 = None
+    return [mul, mul_1, 4, 1, 0]""")
+
+    def test_input_mutation_aliases_other_input(self):
+        def f(a, b):
+            a.add_(1)
+            return a + b
+
+        def inp_callable():
+            base = torch.ones(2, 2, requires_grad=True)
+            # Note: in our test, the add() is important because we need the graph inputs to be non-leaves so we can mutate them.
+            x = base.add(1)
+            inp1 = x[0]
+            inp2 = x[1]
+            return [base], [inp1, inp2]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        # Important parts of the graph:
+        # - the compiled graph takes in a base, and we generate a and b (the views) off of the base
+        # - clone() is still in the graph, because we need to call grad() on the original (non-mutated) inputs
+        # - We re-generate the views *after* the clone, to preserve view relationships.
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    as_strided = torch.ops.aten.as_strided.default(clone, [2], [1], 0)
+    add = torch.ops.aten.add.Tensor(as_strided, 1);  as_strided = None
+    as_strided_scatter = torch.ops.aten.as_strided_scatter.default(clone, add, [2], [1], 0);  clone = None
+    as_strided_4 = torch.ops.aten.as_strided.default(as_strided_scatter, [2], [1], 2);  as_strided_scatter = None
+    add_1 = torch.ops.aten.add.Tensor(add, as_strided_4);  as_strided_4 = None
+    return [add, add_1]""")
+
+    def test_input_mutation_aliases_other_input2(self):
+        def f(a, b):
+            a.add_(1)
+            return a + b
+
+        def inp_callable():
+            base = torch.ones(2, 2, requires_grad=True)
+            x = base.add(1)
+            inp1 = x[0]
+            # Here, one of the aliased inputs is the base itself
+            inp2 = x
+            return [base], [inp1, inp2]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    as_strided = torch.ops.aten.as_strided.default(clone, [2], [1], 0)
+    add = torch.ops.aten.add.Tensor(as_strided, 1);  as_strided = None
+    as_strided_scatter = torch.ops.aten.as_strided_scatter.default(clone, add, [2], [1], 0);  clone = None
+    as_strided_4 = torch.ops.aten.as_strided.default(as_strided_scatter, [2, 2], [2, 1], 0);  as_strided_scatter = None
+    add_1 = torch.ops.aten.add.Tensor(add, as_strided_4);  as_strided_4 = None
+    return [add, add_1]""")
+
+    def test_input_mutation_aliases_and_output_alias(self):
+        def f(a, b):
+            # Here, we need to take care:that because and b are aliased
+            # (1) since a and b are aliased, we generate a view off of "updated b"
+            # (2) We're returning a view, which doesn't show up in the graph
+            a.add_(1)
+            return b.view(b.shape)
+
+        def inp_callable():
+            base = torch.ones(2, 2, requires_grad=True)
+            x = base.add(1)
+            return [base], [x.view(-1), x.view(-1)]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    as_strided = torch.ops.aten.as_strided.default(clone, [4], [1], 0);  clone = None
+    add = torch.ops.aten.add.Tensor(as_strided, 1);  as_strided = None
+    return [add, 4, 1, 0]""")
+
+    def test_input_aliased_with_mutation_output_alias(self):
+        def f(a, b, c):
+            # a and c alias
+            c.mul_(2)
+            # The main thing we're testing here is that
+            # (1) We need to reconstruct c.view(-1) from the 3rd input to the forward
+            # (2) But we need to be careful to do this *before* converting aliased inputs into synthetic bases.
+            #     The original fw takes in 3 args, but the compiled fw takes in only 2 args.
+            return b.add(1), c.view(-1)
+
+        def inp_callable():
+            base1 = torch.ones(2, 2, requires_grad=True)
+            base2 = torch.ones(2, 2, requires_grad=True)
+            x = base1.add(1)
+            y = base2.add(1)
+            return [base1, base2], [x.view(-1), y, x.view(-1)]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    as_strided_1 = torch.ops.aten.as_strided.default(clone, [4], [1], 0);  clone = None
+    mul = torch.ops.aten.mul.Tensor(as_strided_1, 2);  as_strided_1 = None
+    add = torch.ops.aten.add.Tensor(primals_2, 1);  primals_2 = None
+    return [mul, add, 4, 1, 0]""")
+
+    def test_input_metadata_mutation_aliases(self):
+        def f(a, b):
+            # a and b alias, and we do a metadata mutation on a
+            # Since we're not mutating data, then b isn't affected at all.
+            # We expect aot autograd to not bother with constructing a synthetic base.
+            a.t_()
+            return a + b
+
+        def inp_callable():
+            base = torch.ones(2, 2, requires_grad=True)
+            x = base.add(1)
+            return [base], [x.view(-1), x.view(-1)]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        # Expectation: fwd() takes in 2 args, and we don't construct a synthetic base.
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2):
+    view = torch.ops.aten.view.default(primals_1, [4]);  primals_1 = None
+    t = torch.ops.aten.t.default(view);  view = None
+    add = torch.ops.aten.add.Tensor(t, primals_2);  t = primals_2 = None
+    return [add, 4, 1, 0]""")
+
+    def test_input_mutation_aliases_and_none_require_gradients(self):
+        def f(a, b, c):
+            # a and b alias, but neither require gradients (so they don't have a _base)
+            # aot autograd should construct the synthetic base from `torch.Tensor(a.storage())`
+            a.mul_(2)
+            return b + 1, c + 1
+
+        def inp_callable():
+            base = torch.ones(2, 2)
+            c_arg = torch.ones(2, 2, requires_grad=True)
+            x = base.add(1)
+            return [base, c_arg], [x.view(-1), x.view(-1), c_arg]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    as_strided = torch.ops.aten.as_strided.default(clone, [4], [1], 0)
+    mul = torch.ops.aten.mul.Tensor(as_strided, 2);  as_strided = None
+    as_strided_scatter = torch.ops.aten.as_strided_scatter.default(clone, mul, [4], [1], 0);  clone = None
+    as_strided_2 = torch.ops.aten.as_strided.default(as_strided_scatter, [4], [1], 0);  as_strided_scatter = None
+    add = torch.ops.aten.add.Tensor(as_strided_2, 1);  as_strided_2 = None
+    add_1 = torch.ops.aten.add.Tensor(primals_2, 1);  primals_2 = None
+    return [mul, add, add_1]""")
+
+    def test_input_mutation_aliases_bases_out_of_order(self):
+        # This tests our calling convention: if b and d are aliased, then the outer calling convention
+        # that we send to the compiled forward becomes:
+        # (b_d_base, a, c)
+        # Importantly, even though a and c alias in our test, neither inputs are mutated,
+        # So we don't need to do the base construction / deconstruction
+        def f(a, b, c, d):
+            b.add_(1)
+            d.t_()
+            return a + c + d, b.view(-1)
+
+        def inp_callable():
+            base1 = torch.ones(2, 2, requires_grad=True)
+            base2 = torch.ones(2, 2, requires_grad=True)
+            x1 = base1.add(1)
+            x2 = base2.add(1)
+            # a and c alias, b and d alias
+            return [base1, base2], [x1.view(-1), x2.view(-1), x1.view(-1), x2.view(-1)]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        # 3 graph inputs: (b_d_base, a, c)
+        # 2 returns: (b_updated, a+c+d)
+        # (there are 2 original fw outs, but one is a view of b so it's not part of the graph)
+        # (there are also 2 input mutations, but one is a metadata-only mutation so the compiled forward doesn't return it)
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2, primals_3):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    as_strided = torch.ops.aten.as_strided.default(clone, [4], [1], 0)
+    add = torch.ops.aten.add.Tensor(as_strided, 1);  as_strided = None
+    add_1 = torch.ops.aten.add.Tensor(primals_2, primals_3);  primals_2 = primals_3 = None
+    as_strided_scatter = torch.ops.aten.as_strided_scatter.default(clone, add, [4], [1], 0);  clone = None
+    as_strided_4 = torch.ops.aten.as_strided.default(as_strided_scatter, [4], [1], 0);  as_strided_scatter = None
+    t_1 = torch.ops.aten.t.default(as_strided_4);  as_strided_4 = None
+    add_2 = torch.ops.aten.add.Tensor(add_1, t_1);  add_1 = t_1 = None
+    return [add, add_2, 4, 1, 0, 4, 1, 0]""")
+
+    # Mondo test that tests a combination of:
+    # input is mutated, that aliases another input (so we make a synthetic base)
+    # an output is an alias of another output
+    # an output is an alias of an intermediate
+    def test_input_mutation_alias_everything(self):
+        # a and c are aliased
+        def f(a, b, c):
+            c.mul_(2)  # mutates c
+            b.t_()  # metadata mutate b
+            tmp = a + c
+            # TODO: this test doesn't test "alias of an intermediate" yet,
+            # delete this line later and get that to be tested
+            return tmp, b.t(), a
+            out1 = tmp.view(-1)
+            out2 = b.t()
+            out3 = out1.unsqueeze(0)
+            # out1 and out3 are aliases of an intermediate, and alias each other!
+            # out2 aliases an input, so we don't return it
+            return out1, out2, out3
+
+        def inp_callable():
+            base1 = torch.ones(2, 2, requires_grad=True)
+            base2 = torch.ones(2, 2, requires_grad=True)
+            # Note: in our test, the add() is important because we need the graph inputs to be non-leaves so we can mutate them.
+            base1_ = base1.add(1)
+            base2_ = base2.add(1)
+            a = base1_.view(-1)
+            b = base2_
+            c = base1_.view(-1)
+            return [base1, base2], [a, b, c]
+
+        fw_graph = self.verify_aot_autograd(f, inp_callable, test_mutation=True, return_fw_graph=True)
+        # Expected:
+        # - 2 inputs in the forward: synthetic_base_a_c, b
+        # - 1 output in the forward: "tmp"
+        #   out2 is an alias of an input, and will be generated off of b outside of the compiled fn
+        #   out1 and out3 are aliases of tmp, that we generate outside of the compiled function
+        self.assertExpectedInline(fw_graph.code.strip(), """\
+def forward(self, primals_1, primals_2):
+    clone = torch.ops.aten.clone.default(primals_1);  primals_1 = None
+    as_strided_1 = torch.ops.aten.as_strided.default(clone, [4], [1], 0)
+    mul = torch.ops.aten.mul.Tensor(as_strided_1, 2);  as_strided_1 = None
+    as_strided_scatter = torch.ops.aten.as_strided_scatter.default(clone, mul, [4], [1], 0);  clone = None
+    as_strided_4 = torch.ops.aten.as_strided.default(as_strided_scatter, [4], [1], 0);  as_strided_scatter = None
+    add = torch.ops.aten.add.Tensor(as_strided_4, mul);  as_strided_4 = None
+    return [mul, add, 2, 2, 1, 2, 0, 2, 2, 2, 1, 0]""")
+
+    def test_no_grad_input_output(self):
+        def f(a, b):
+            return a.cos(), b.cos(), a * b
+
+        inp_thunks = [lambda: torch.randn(5, requires_grad=True), lambda: torch.randn(5, requires_grad=False)]
+        for inps in itertools.product(inp_thunks, repeat=2):
+            inps = [i() for i in inps]
+            self.verify_aot_autograd(f, inps)
+
+    def test_some_output_requires_grad_input_doesnt(self):
+        def f(a, b):
+            a_view = a.view(-1)
+            a_view.requires_grad_(True)
+            return a_view
+        inp = [torch.randn(3, 3), torch.randn(3, 3, requires_grad=True)]
+        self.verify_aot_autograd(f, inp)
+
+    def test_some_outputs_dont_require_grad_view(self):
+        def f(a, b):
+            return a.detach(), b
+        inp = [torch.randn(3, 3, requires_grad=True), torch.randn(3, 3, requires_grad=True)]
+        self.verify_aot_autograd(f, inp)
+
+    def test_some_outputs_dont_require_grad_non_view(self):
+        def f(a, b):
+            return a.add(1).detach(), b
+        inp = [torch.randn(3, 3, requires_grad=True), torch.randn(3, 3, requires_grad=True)]
+        self.verify_aot_autograd(f, inp)
+
+    def test_inner_grad(self):
+        def foo(x):
+            y = torch.exp(x)
+            z = torch.autograd.grad(y, x)
+            return z
+        inps = [torch.randn((), requires_grad=True)]
+        self.verify_aot_autograd(foo, inps)
+
+    def test_grad_context(self):
+        def foo(x):
+            return x * 2
+        inps = [torch.randn((), requires_grad=True)]
+        graph_size = None
+
+        def get_graph_size(fx_g, _):
+            nonlocal graph_size
+            graph_size = len(fx_g.graph.nodes)
+            return fx_g
+
+        f = aot_function(foo, nop, get_graph_size)
+        with torch.set_grad_enabled(False):
+            f(*inps)
+        self.assertIsNone(graph_size)
+
+        f = aot_function(foo, nop, get_graph_size)
+        with torch.set_grad_enabled(True):
+            out = f(*inps)
+            self.assertIsNone(graph_size)
+            out.sum().backward()
+            self.assertTrue(graph_size > 2)
+
+    def test_output_dict(self):
+        def f(x):
+            return {'a': x, 'b': x}
+        inp = [torch.randn(3, 3, requires_grad=True)]
+        self.verify_aot_autograd(f, inp)
+
+        def f(x, y):
+            return {'a': x, 'b': y + x}
+        inp = [torch.randn(3, requires_grad=True), torch.randn(3)]
+        self.verify_aot_autograd(f, inp)
+
+        def f(x):
+            new_d = {}
+            for k in x:
+                new_d[k] = x[k] * 2
+            return new_d
+
+        def inp_callable():
+            inps = [{'a': torch.randn(3, requires_grad=True), 'b': torch.randn(3, requires_grad=True)}]
+            return inps, inps
+
+        self.verify_aot_autograd(f, inp_callable)
+
+    def test_module(self):
+        mod = nn.Sequential(nn.Linear(32, 32), nn.ReLU())
+        compiled_mod = compiled_module(mod, nop, nop)
+        inp = torch.randn(32, 32)
+        ref_out = mod(inp)
+        ref_out.sum().backward()
+        ref_grads = sorted([(name, p.grad) for name, p in mod.named_parameters()])
+        out = compiled_mod(inp)
+        out.sum().backward()
+        grads = sorted([(name, p.grad) for name, p in mod.named_parameters()])
+        self.assertEqual((out, grads), (ref_out, ref_grads))
+
+    def test_batchnorm(self):
+        mod = compiled_module(nn.BatchNorm2d(4), nop, nop)
+        x = torch.ones(1, 4, 2, 2)
+        mod(x).sum().backward()
+
+    def test_list_codegen(self):
+        def list_nop(f, _):
+            def g(inps):
+                return f(*inps)
+            g._boxed_call = True
+            return g
+
+        def f(a, b, c):
+            return a.sin() * b.cos() * c.sin()
+        f = aot_function(f, list_nop)
+        inp = [torch.randn(5, requires_grad=True) for _ in range(3)]
+        f(*inp).sum().backward()
+
+    @patch('functorch._src.aot_autograd.AOT_COUNTER', new_callable=itertools.count)
+    def test_compilation_context(self, counter):
+        def f(x):
+            return x.sin().sin()
+        count = []
+
+        def compiler(fx_g, _):
+            context = get_aot_compilation_context()
+            count.append((context[0], len(fx_g.graph.nodes)))
+            return fx_g
+
+        f = aot_function(f, compiler)
+        out = f(torch.randn(5, requires_grad=True))
+        f = aot_function(f, compiler)
+        f(torch.randn(5))
+        out.sum().backward()
+        self.assertEqual(count, [(['0_forward'], 4), (['1_inference'], 4), (['0_backward'], 8)])
+
+    def test_dupe_arg(self):
+        def f(x, y):
+            return x + y
+
+        x = torch.randn(3, 3, requires_grad=True)
+        self.verify_aot_autograd(f, [x, x])
+
+    @patch('functorch._src.aot_autograd.AOT_COUNTER', new_callable=itertools.count)
+    @patch("functorch._src.config.debug_assert", True)
+    def test_invalid_dupe(self, counter):
+        class F(torch.nn.Module):
+            def forward(self, x, y):
+                return (x + y,)
+
+        x = torch.randn(3, 3, requires_grad=True)
+        y = torch.randn(3, 3, requires_grad=True)
+
+        fxy = aot_module_simplified(F(), nop)
+        fxy(x, y)
+        fxy(x, x)  # is ok!
+
+        fxx = aot_module_simplified(F(), nop)
+        fxx(x, x)
+        self.assertExpectedRaisesInline(
+            AssertionError, lambda: fxx(x, y),
+            """At compilation time, graph 1 was compiled under the assumption that input 1 would be a duplicate of input 0, but at runtime this was not the case.  This indicates a guard bug in AOTAutograd or Dynamo, please file a bug to PyTorch."""  # noqa: B950
+        )
+
+    @patch('functorch._src.aot_autograd.AOT_COUNTER', new_callable=itertools.count)
+    @patch("functorch._src.config.debug_assert", True)
+    def test_invalid_requires_grad(self, counter):
+        class F(torch.nn.Module):
+            def forward(self, x, y):
+                return (x + y,)
+
+        x = torch.randn(3, 3, requires_grad=True)
+        y = torch.randn(3, 3, requires_grad=True)
+        z = torch.randn(3, 3, requires_grad=False)
+
+        # Non-mutating please!
+        def compare(m1, m2, inps):
+            r1, g1 = _outs_and_grads(m1, inps, inps)
+            r2, g2 = _outs_and_grads(m2, inps, inps)
+            self.assertEqual(r1, r2)
+            self.assertEqual(g1, g2)
+
+        fxy = aot_module_simplified(F(), nop)
+        compare(F(), fxy, (x, y))
+        compare(F(), fxy, (x, z))
+
+        fxz = aot_module_simplified(F(), nop)
+        compare(F(), fxz, (x, z))
+        self.assertExpectedRaisesInline(
+            AssertionError, lambda: fxz(x, y),
+            """At compilation time, graph 1 was compiled under the assumption that input 1 would not require grad, but at runtime this was not the case.  This indicates a guard bug in AOTAutograd or Dynamo, please file a bug to PyTorch."""  # noqa: B950
+        )
+
+    def test_resize_input(self):
+        def f(x, y):
+            y.resize_(4)
+            y.zero_()
+            self.assertEqual(x.shape, (4,))
+            return y
+
+        # NB: don't use verify_aot_autograd as the inputs get
+        # mutated and I don't trust verify to do it right
+
+        compiled_f = aot_function(f, nop)
+        ref_x = torch.randn(0)
+        ref_out = f(ref_x, ref_x)
+
+        test_x = torch.randn(0)
+        test_out = compiled_f(test_x, test_x)
+
+        self.assertEqual(ref_out, test_out)
+
+    def test_custom_autograd(self):
+        class CustomFn(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x):
+                return x.clone()
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                return grad_output + 1
+
+        def f(x):
+            return CustomFn.apply(x)
+
+        self.verify_aot_autograd(f, [torch.randn(3)])
+
+    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is unavailable")
+    def test_autocast_disable_guard(self):
+        guard = torch._C._DisableAutocast()
+        try:
+            x = torch.rand([4, 4]).cuda()
+            y = x @ x
+            self.assertEqual(y.dtype, torch.float32)
+        finally:
+            del guard
+
+    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is unavailable")
+    def test_nonidempotent_amp(self):
+        def f(self_s_emb, add_3):
+            einsum_2 = torch.functional.einsum('ah,th->t', self_s_emb, add_3)
+            log_softmax_2 = einsum_2.log_softmax(-1)
+            return (log_softmax_2,)
+
+        args = [torch.rand((1, 256), dtype=torch.float32, device='cuda'), torch.rand((30, 256), dtype=torch.float16, device='cuda')]
+        with torch.cuda.amp.autocast(enabled=True):
+            self.verify_aot_autograd(f, args)
+
+        args = [e.requires_grad_(True) for e in args]
+        with torch.cuda.amp.autocast(enabled=True):
+            self.verify_aot_autograd(f, args)
+
+    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is unavailable")
+    def test_batch_norm_amp(self):
+        device = "cuda"
+        input_dtype = torch.float16
+        param_dtype = torch.float32
+        weight, bias = [torch.ones(64, device=device, dtype=param_dtype, requires_grad=True) for _ in range(2)]
+        running_mean, running_var = [torch.ones(64, device=device, dtype=param_dtype) for _ in range(2)]
+
+        def bn(x):
+            return torch.ops.aten.cudnn_batch_norm(
+                x,
+                weight,
+                bias,
+                running_mean,
+                running_var,
+                False,
+                0.1,
+                1e-05,
+            )
+        inp = torch.ones(torch.Size([16, 64, 112, 112]), dtype=input_dtype, device=device)
+
+        ref = bn(inp)
+        cudnn_batch_norm_decomp = torch._decomp.get_decompositions({torch.ops.aten.cudnn_batch_norm})
+        aot_fn = make_fx(bn, decomposition_table=cudnn_batch_norm_decomp)(inp)
+        res = aot_fn(inp)
+        for a, b in zip(ref, res):
+            assert torch.allclose(a, b)
+
+    @patch("functorch.compile.config.use_dynamic_shapes", True)
+    @patch("functorch.compile.config.use_fake_tensor", True)
+    @skipIfNoSympy
+    def test_output_op_depending_on_symint(self):
+        """
+        It won't be obvious from reading this test what it's testing for.  We should probably make it into a more
+        focused unit test.
+
+        An issue with the following program was the expand op would end up depending on a symint whose proxy was
+        incorrectly associated with one of the grad tensors rather than input tensors.  It broke partitioner logic
+        and the net result was aot_function failed to produce a function and threw an exception instead.
+        """
+        inp = torch.randn(5, requires_grad=True)
+
+        def f(x):
+            return x.expand(x.shape)
+
+        # TODO(whc) make this work (test setup is wrong somehow)
+        # joint_forward_backward = create_joint_forward_backward(f)
+        # out = f(inp)
+        # joint_inputs =  ([inp], [out.detach().contiguous()])
+        # fx_g = make_fx(joint_forward_backward)(*joint_inputs)
+        # TODO: assert outputs of fwd graph trace to correct symint
+
+        # e2e test that fails without symint clone fix
+        af = aot_function(f, nop, partition_fn=partial(min_cut_rematerialization_partition, compiler="inductor"))
+        out = af(inp)
+        self.assertEqual(out, f(inp))
+
+
+def extract_graph(fx_g, _, graph_cell):
+    graph_cell[0] = fx_g
+    return fx_g
+
+
+def get_ins_outs(fx_g):
+    ins = []
+    outs = []
+    for n in fx_g.graph.nodes:
+        if n.op == 'placeholder':
+            ins.append(n)
+        elif n.op == 'output':
+            outs = tuple(n.args[0])
+    return ins, outs
+
+
+def get_num_ins_outs(fx_g):
+    return tuple(len(i) for i in get_ins_outs(fx_g))
+
+
+def get_fw_bw_graph(f, inps, partitioner=min_cut_rematerialization_partition):
+    fw_graph_cell = [None]
+    bw_graph_cell = [None]
+    aot_function(f,
+                 fw_compiler=partial(extract_graph, graph_cell=fw_graph_cell),
+                 bw_compiler=partial(extract_graph, graph_cell=bw_graph_cell),
+                 partition_fn=partitioner,
+                 decompositions=default_decompositions)(*inps).sum().backward()
+    return (fw_graph_cell[0], bw_graph_cell[0])
+
+
+class TestPartitioning(AOTTestCase):
+    @unittest.skipIf(not USE_NETWORKX, "networkx not available")
+    def test_recompute_partitioning(self):
+        def fn(a, b):
+            return torch.sin(torch.sin(a)) + b
+
+        # Reference calculation
+        ref_a = torch.rand(10, 10, requires_grad=True)
+        ref_b = torch.rand(10, 10, requires_grad=True)
+        ref = fn(ref_a, ref_b)
+        ref.sum().backward()
+
+        # Compiled function calculation
+        res_a = ref_a.clone().detach().requires_grad_(True)
+        res_b = ref_b.clone().detach().requires_grad_(True)
+
+        def compile_fn(x, _):
+            return x
+
+        compiled_fn = compiled_function(fn, compile_fn, compile_fn, min_cut_rematerialization_partition)
+        res = compiled_fn(res_a, res_b)
+        res.sum().backward()
+        assert torch.allclose(ref, res, atol=1e-3, rtol=1e-3)
+        assert torch.allclose(ref_a.grad, res_a.grad, atol=1e-3, rtol=1e-3)
+        assert torch.allclose(ref_b.grad, res_b.grad, atol=1e-3, rtol=1e-3)
+
+    def test_meta_tensor_inplace_op(self):
+        # Following module results in inplace ops while tracing. The test checks
+        # that the meta tensor information is stored for inplace ops.
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.weight = torch.nn.Parameter(torch.randn(3072, 768, requires_grad=True))
+                self.bias = torch.nn.Parameter(torch.randn(3072, requires_grad=True))
+
+            def forward(self, add_4):
+                linear_4 = torch.nn.functional.linear(add_4, self.weight, bias=self.bias)
+                gelu = torch.nn.functional.gelu(linear_4)
+                return gelu
+
+        def check_meta_tensor(fx_g, _):
+            for node in fx_g.graph.nodes:
+                if node.op != 'output':
+                    assert 'tensor_meta' in node.meta
+            return fx_g
+
+        inp0 = torch.randn(16, 128, 768, requires_grad=True)
+        inputs = [inp0, ]
+        mod = MockModule().to(device="cpu")
+        aot_mod = aot_module(mod, fw_compiler=check_meta_tensor)
+        aot_mod(*inputs)
+
+    def test_default_partitioner_getitem(self):
+        mod = nn.LayerNorm([10])
+
+        def f(x, mod_weight, mod_bias):
+            return torch.nn.functional.layer_norm(x, [10], mod_weight, mod_bias, eps=1e-6)
+
+        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(3, 10, requires_grad=True), mod.weight, mod.bias],
+                                             partitioner=default_partition)
+        self.assertEqual(get_num_ins_outs(fw_graph), (3, 6))
+        self.assertEqual(get_num_ins_outs(bw_graph), (6, 3))
+
+    @patch("functorch.compile.config.use_dynamic_shapes", True)
+    @patch("functorch.compile.config.use_fake_tensor", True)
+    @unittest.skipIf(not USE_NETWORKX, "networkx not available")
+    @skipIfNoSympy
+    def test_min_cut_partitioner_save_shape(self):
+
+        def f(x):
+            s = x.sum(dim=1)
+            return s
+
+        inp = [torch.ones([10, 10], requires_grad=True)]
+        fw_graph, bw_graph = get_fw_bw_graph(f, inp)
+        _, fw_output = get_ins_outs(fw_graph)
+        self.assertEqual(get_num_ins_outs(fw_graph), (1, 3))
+        self.assertEqual(get_num_ins_outs(bw_graph), (3, 1))
+        self.assertEqual(str(fw_output[0]), "sum_1")
+        # make sure we don't do the suboptimal thing of saving the bigger primals input to sum,
+        # rather than saving the sizes of the primals input for use in backward expand
+        self.assertEqual(str(fw_output[1]), "sym_size")
+        self.assertEqual(str(fw_output[2]), "sym_size_1")
+
+        inp = [
+            torch.randn(10, requires_grad=True),
+            torch.randn((3, 10), requires_grad=True),
+            torch.randn((2, 10), requires_grad=True),
+        ]
+
+        def f(a, b, c):
+            # tried to test what happens if we save a size tuple in the graph;
+            # turns out we never will due to how we trace, but this is probably
+            # still a good test case for various size manipulations
+            sb = torch.ops.aten.sym_size(b)
+            sc = c.size()
+            x = sb[0] + sc[0]
+            a_sz = (x, a.size(0))
+            return torch.cat([a.expand(a_sz), b, c])
+        fw_graph, bw_graph = get_fw_bw_graph(f, inp)
+        self.assertEqual(get_num_ins_outs(fw_graph), (3, 5))
+        self.assertEqual(get_num_ins_outs(bw_graph), (5, 3))
+        _, outs = get_ins_outs(fw_graph)
+        self.assertTrue(all([is_sym_node(n) for n in outs[1:]]))
+
+    @patch("functorch.compile.config.use_dynamic_shapes", True)
+    @patch("functorch.compile.config.use_fake_tensor", True)
+    @skipIfNoSympy
+    def test_default_partitioner_output_tensor_shape_tensor(self):
+
+        inp = [
+            torch.randn(10, requires_grad=True),
+            torch.randn((3, 10), requires_grad=True),
+            torch.randn((2, 10), requires_grad=True),
+            torch.randn((10, 1), requires_grad=True),
+        ]
+
+        def f(a, b, c, d):
+            # Try to force symints intermixed with outputs in the function's returns
+            sb = b.size()
+            sc = c.size()
+            x = sb[0] + sc[0]
+            a_sz = (x, a.size(0))
+            cat = torch.cat([a.expand(a_sz), b, c])
+            mm = torch.mm(cat, d)
+            mm2 = torch.mm(mm, a.view(mm.size(1), a.size(0)))  # this saves 4 new ints for backward. why?
+            # and what do i have to do to make it save a tensor for backward?
+            return cat, sb, c, mm2
+
+        fw_graph_cell = [None]
+        bw_graph_cell = [None]
+        compiled_outs = aot_function(
+            f,
+            fw_compiler=partial(extract_graph, graph_cell=fw_graph_cell),
+            bw_compiler=partial(extract_graph, graph_cell=bw_graph_cell),
+            partition_fn=default_partition,
+            decompositions=default_decompositions)(*inp)
+        fw_graph = fw_graph_cell[0]
+        (compiled_outs[0].sum() + compiled_outs[2].sum()).backward()
+        bw_graph = bw_graph_cell[0]
+
+        # 12 outs because:
+        # - 5 original outputs -> 4 graph outputs (the 3rd output is an input alias, gets moved outside)
+        # - 8 saved outputs for backward: 5 tensors, 3 symints
+        self.assertEqual(get_num_ins_outs(fw_graph), (4, 12))
+        self.assertEqual(get_num_ins_outs(bw_graph), (12, 4))
+        _, fw_graph_out_nodes = get_ins_outs(fw_graph)
+        self.assertEqual(
+            # fw outputs include b.size() which expands to 2 symints,
+            #
+            # TODO(whc)- are the saved-tensors/saved-symints correct here?
+            # i just made the test pass based on what default partition did
+            # Of the 5 original forward outputs, the 4th (c) is an input,
+            # which won't show up in the compiled forward graph
+            [False, True, True, False] + [False] * 4 + [True] * 4,
+            [is_sym_node(n) for n in fw_graph_out_nodes]
+        )
+
+        real_outs = f(*inp)
+        self.assertEqual(compiled_outs, real_outs)
+        self.assertTrue(isinstance(real_outs[1], torch.Size))
+
+        # TODO(whc) we should learn to return torch.Sizes
+        self.assertFalse(isinstance(compiled_outs[1], torch.Size))
+
+    @patch("functorch.compile.config.use_dynamic_shapes", True)
+    @patch("functorch.compile.config.use_fake_tensor", True)
+    @unittest.skipIf(not USE_NETWORKX, "networkx not available")
+    @skipIfNoSympy
+    def test_min_cut_partitioner_output_tensor_shape_tensor(self):
+
+        inp = [
+            torch.randn(10, requires_grad=True),
+            torch.randn((3, 10), requires_grad=True),
+            torch.randn((2, 10), requires_grad=True),
+            torch.randn((10, 1), requires_grad=True),
+        ]
+
+        def f(a, b, c, d):
+            # Try to force symints intermixed with outputs in the function's returns
+            sb = b.size()
+            sc = c.size()
+            x = sb[0] + sc[0]
+            a_sz = (x, a.size(0))
+            cat = torch.cat([a.expand(a_sz), b, c])
+            mm = torch.mm(cat, d)
+            mm2 = torch.mm(mm, a.view(mm.size(1), a.size(0)))  # this saves 4 new ints for backward. why?
+            # and what do i have to do to make it save a tensor for backward?
+            return cat, sb, c, mm2
+
+        fw_graph_cell = [None]
+        bw_graph_cell = [None]
+        compiled_outs = aot_function(
+            f,
+            fw_compiler=partial(extract_graph, graph_cell=fw_graph_cell),
+            bw_compiler=partial(extract_graph, graph_cell=bw_graph_cell),
+            partition_fn=min_cut_rematerialization_partition,
+            decompositions=default_decompositions)(*inp)
+        fw_graph = fw_graph_cell[0]
+        (compiled_outs[0].sum() + compiled_outs[2].sum()).backward()
+        bw_graph = bw_graph_cell[0]
+
+        self.assertEqual(get_num_ins_outs(fw_graph), (4, 12))
+        self.assertEqual(get_num_ins_outs(bw_graph), (12, 4))
+        _, fw_graph_out_nodes = get_ins_outs(fw_graph)
+        self.assertEqual(
+            # fw outputs include b.size() which expands to 2 symints,
+            # then 4 tensors (transposes of matricies used for mm) are saved
+            # finally 4 symints are saved
+            [False, True, True, False] + [False] * 4 + [True] * 4,
+            [is_sym_node(n) for n in fw_graph_out_nodes]
+        )
+
+        real_outs = f(*inp)
+        self.assertEqual(compiled_outs, real_outs)
+        self.assertTrue(isinstance(real_outs[1], torch.Size))
+
+        # TODO(whc) we should learn to return torch.Sizes
+        self.assertFalse(isinstance(compiled_outs[1], torch.Size))
+
+    @unittest.skipIf(not USE_NETWORKX, "networkx not available")
+    def test_min_cut_partitioner(self):
+        def f(x):
+            return x.cos().cos().cos()
+
+        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(3, requires_grad=True)])
+        self.assertEqual(get_num_ins_outs(fw_graph), (1, 2))
+        self.assertEqual(get_num_ins_outs(bw_graph), (2, 1))
+
+        def f(a, b, c, d):
+            x = a + b + c + d
+            return x.cos().cos()
+
+        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(3, requires_grad=True) for _ in range(4)])
+        self.assertEqual(get_num_ins_outs(fw_graph), (4, 2))
+        self.assertEqual(get_num_ins_outs(bw_graph), (2, 4))
+
+    @unittest.skipIf(not USE_NETWORKX, "networkx not available")
+    def test_min_cut_partitioner_recomputable_ops(self):
+        def f(x):
+            return x * x * x
+
+        recomputable_ops = []
+        partition_fn = partial(min_cut_rematerialization_partition, recomputable_ops=recomputable_ops)
+
+        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(3, requires_grad=True)], partition_fn)
+        # Expected forward graph:
+        # opcode         name       target           args                        kwargs
+        # -------------  ---------  ---------------  --------------------------  --------
+        # placeholder    primals_1  primals_1        ()                          {}
+        # call_function  mul        aten.mul.Tensor  (primals_1, primals_1)      {}
+        # call_function  mul_1      aten.mul.Tensor  (mul, primals_1)            {}
+        # output         output     output           ([mul_1, primals_1, mul],)  {}
+        self.assertEqual(get_num_ins_outs(fw_graph), (1, 3))
+        # Expected backward graph:
+        # opcode         name        target           args                     kwargs
+        # -------------  ----------  ---------------  -----------------------  --------
+        # placeholder    primals_1   primals_1        ()                       {}
+        # placeholder    mul         mul              ()                       {}
+        # placeholder    tangents_1  tangents_1       ()                       {}
+        # call_function  mul_2       aten.mul.Tensor  (tangents_1, mul)        {}
+        # call_function  mul_3       aten.mul.Tensor  (tangents_1, primals_1)  {}
+        # call_function  mul_4       aten.mul.Tensor  (mul_3, primals_1)       {}
+        # call_function  add         aten.add.Tensor  (mul_2, mul_4)           {}
+        # call_function  add_1       aten.add.Tensor  (add, mul_4)             {}
+        # output         output      output           ([add_1],)               {}
+        self.assertEqual(get_num_ins_outs(bw_graph), (3, 1))
+
+        recomputable_ops = [torch.ops.aten.mul]
+        partition_fn = partial(min_cut_rematerialization_partition, recomputable_ops=recomputable_ops)
+        fw_graph, bw_graph = get_fw_bw_graph(f, [torch.randn(3, requires_grad=True)], partition_fn)
+        # Expected forward graph:
+        # opcode         name       target           args                    kwargs
+        # -------------  ---------  ---------------  ----------------------  --------
+        # placeholder    primals_1  primals_1        ()                      {}
+        # call_function  mul        aten.mul.Tensor  (primals_1, primals_1)  {}
+        # call_function  mul_1      aten.mul.Tensor  (mul, primals_1)        {}
+        # output         output     output           ([mul_1, primals_1],)   {}
+        self.assertEqual(get_num_ins_outs(fw_graph), (1, 2))
+        # Expected backward graph:
+        # opcode         name        target           args                     kwargs
+        # -------------  ----------  ---------------  -----------------------  --------
+        # placeholder    primals_1   primals_1        ()                       {}
+        # placeholder    tangents_1  tangents_1       ()                       {}
+        # call_function  mul         aten.mul.Tensor  (primals_1, primals_1)   {} # RECOMPUTED
+        # call_function  mul_2       aten.mul.Tensor  (tangents_1, mul)        {}
+        # call_function  mul_3       aten.mul.Tensor  (tangents_1, primals_1)  {}
+        # call_function  mul_4       aten.mul.Tensor  (mul_3, primals_1)       {}
+        # call_function  add         aten.add.Tensor  (mul_2, mul_4)           {}
+        # call_function  add_1       aten.add.Tensor  (add, mul_4)             {}
+        # output         output      output           ([add_1],)               {}
+        self.assertEqual(get_num_ins_outs(bw_graph), (2, 1))
+
+    def test_contiguous(self):
+        # The test simulates the condition where transpose followed by view
+        # happens in the backward pass.
+        # https://discuss.pytorch.org/t/error-on-transpose-and-view/434
+        def f(x):
+            return x.view(2, 3).t()
+
+        inp = torch.randn(6, requires_grad=True)
+        out = aot_function(f, nop)(inp)
+        torch.autograd.grad(out, inp, torch.randn(3, 2))
+
+    def test_preserve_random(self):
+        def fn(x):
+            return torch.nn.functional.dropout(x, 0.5) + x
+
+        x = torch.randn(4)
+
+        torch.manual_seed(0)
+        ref = fn(x)
+
+        torch.manual_seed(0)
+        aot_fn = aot_function(fn, nop)
+        res = aot_fn(x)
+
+        assert torch.allclose(ref, res)
+
+    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is unavailable")
+    @unittest.skipIf(not USE_TORCHVISION, "test requires torchvision")
+    def test_autocast(self):
+        mod = torchvision.models.resnet18().cuda()
+        mod.train()
+
+        x = torch.randn(16, 3, 32, 32, device="cuda")
+        aot_mod = memory_efficient_fusion(mod)
+
+        # Ensure that AOT Autograd works with AMP
+        with torch.cuda.amp.autocast(True):
+            res = aot_mod(x)
+        res.sum().backward()
+
+class TestAOTModuleSimplified(AOTTestCase):
+    def test_aot_module_simplified(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(20, 30)
+
+            def forward(self, x, y):
+                return (self.linear(x) + y, )
+
+        mod = MockModule()
+        mod.zero_grad()
+
+        x = torch.randn(128, 20, requires_grad=True)
+        y = torch.randn(128, 30, requires_grad=True)
+        inputs = [x, y]
+        cloned_inputs = [x.detach().clone().requires_grad_(True) for x in inputs]
+
+        ref = mod(*inputs)
+        ref[0].sum().backward()
+
+        aot_mod = aot_module_simplified(mod, nop)
+        aot_mod.zero_grad()
+        res = aot_mod(*cloned_inputs)
+        res[0].sum().backward()
+
+        assert torch.allclose(ref[0], res[0])
+        assert torch.allclose(inputs[0].grad, cloned_inputs[0].grad)
+        assert torch.allclose(inputs[1].grad, cloned_inputs[1].grad)
+
+    def test_aot_module_simplified_preserves_stack_trace(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(20, 30)
+
+            def forward(self, x, y):
+                z = self.linear(x)
+                z = z + y
+                z = z.relu()
+                return (z, )
+
+        tracer = torch.fx.Tracer()
+        tracer.record_stack_traces = True
+        graph = tracer.trace(MockModule())
+        mod = torch.fx.GraphModule(tracer.root, graph)
+
+        for node in mod.graph.nodes:
+            if node.op == 'output':
+                continue
+            self.assertTrue(node.stack_trace is not None)
+            assert 'test_aotdispatch.py' in node.stack_trace
+
+        def assert_compiler(gm: torch.fx.GraphModule, _):
+            for node in gm.graph.nodes:
+                if node.op == 'output' or node.op == 'placeholder':
+                    continue
+                self.assertTrue(node.stack_trace is not None)
+                assert 'test_aotdispatch.py' in node.stack_trace
+            return gm.forward  # return a python callable
+
+        aot_mod = aot_module_simplified(mod, fw_compiler=assert_compiler, bw_compiler=assert_compiler)
+
+        x = torch.randn(128, 20, requires_grad=True)
+        y = torch.randn(128, 30, requires_grad=True)
+        inputs = [x, y]
+        res = aot_mod(*inputs)
+        res[0].sum().backward()
+
+    def test_aot_module_simplified_fake_tensor_gm_raises(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self, y):
+                super().__init__()
+                self.linear = torch.nn.Linear(4, 4)
+                self.y = y
+
+            def forward(self, x):
+                z = self.linear(x)
+                z = z + self.y
+                z = z.relu()
+                return (z, )
+
+
+        real_x = torch.randn(4)
+        real_y = torch.randn(4)
+        fake_mode = torch._subclasses.fake_tensor.FakeTensorMode()
+        fake_y = fake_mode.from_tensor(real_y)
+
+        tracer = torch.fx.Tracer()
+        tracer.record_stack_traces = True
+
+        # This test uses tracing to lift the fake_y into a constant buffer,
+        # so we have a contrived trace example.
+        # For a traceless example closer to how dynamo would call us, see
+        # test_aot_module_deepcopy_fake_tensor_gm_raises below.
+        graph = tracer.trace(MockModule(fake_y))
+        mod_fake = torch.fx.GraphModule(tracer.root, graph)
+
+        self.assertExpectedRaisesInline(
+            AssertionError, lambda: aot_module_simplified(mod_fake, nop),
+            """Unexpected fake buffer y"""
+        )
+        # Counterfactual to ensure that the raise is only due to real vs fake
+        # Run the same exact thing except with a real buffer.
+        graph = tracer.trace(MockModule(real_y))
+        mod_real = torch.fx.GraphModule(tracer.root, graph)
+        aot_module_simplified(MockModule(real_y), nop)
+
+    def test_aot_module_deepcopy_fake_tensor_gm_raises(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self, y):
+                super().__init__()
+                self.linear = torch.nn.Linear(4, 4)
+                self.linear.bias = torch.nn.Parameter(torch.ones(4))
+
+            def forward(self, x):
+                z = self.linear(x)
+                z = z.relu()
+                return (z, )
+
+
+        real_x = torch.randn(4)
+        real_y = torch.randn(4)
+
+        fake_mode = torch._subclasses.fake_tensor.FakeTensorMode()
+        mod_fake = torch._dynamo.utils.deepcopy_to_fake_tensor(MockModule(real_y), fake_mode)
+
+        self.assertExpectedRaisesInline(
+            AssertionError, lambda: aot_module_simplified(mod_fake, nop),
+            """Unexpected fake param linear.weight"""
+        )
+
+
+# entries in here don't work and need to be fixed.
+# Each one of these is a bug (or needs to be investigated)
+aot_autograd_failures = {
+    # data-dependent control flow
+    xfail('cov'),
+    xfail('istft'),
+    xfail('nn.functional.gaussian_nll_loss'),
+    xfail('tensor_split'),
+    xfail('corrcoef'),
+    xfail('quantile'),
+    xfail('nanquantile'),
+    xfail('narrow'),
+    xfail('index_reduce'),
+    xfail('istft'),
+    xfail('linalg.eig'),
+    xfail('scatter_reduce', 'prod'),
+
+    skip('as_strided_scatter'),
+
+    # Too annoying to generate random inputs
+    xfail('cholesky'),
+    xfail('linalg.cholesky'),
+
+    # Given input size: (s0xs1x2). Calculated output size: ...
+    skip('max_pool2d_with_indices_backward'),
+
+    # Misc
+    xfail('to_sparse'),
+    xfail('corrcoef'),
+    xfail('cov'),
+    xfail('chalf'),  # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
+    xfail('sparse.sampled_addmm'),
+    skip('nn.functional.binary_cross_entropy_with_logits'),  # seems to fail sometimes?
+    skip('nn.functional.margin_ranking_loss'),  # seems flaky
+    skip('linalg.lu_solve'),  # flaky
+    skip('linalg.householder_product'),  # flaky
+    decorate('matmul', decorator=unittest.skipIf(IS_ARM64, 'flaky')),
+    decorate('__rmatmul__', decorator=unittest.skipIf(IS_ARM64, 'flaky')),
+}
+
+symbolic_aot_autograd_failures = {
+    xfail('__rmatmul__', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('addmv', ''),  # aten.addmv.default - couldn't find symbolic meta function/decomposition
+    xfail('addr', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('amax', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('amin', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('baddbmm', ''),  # aten.baddbmm.default - couldn't find symbolic meta function/decomposition
+    xfail('block_diag', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('cartesian_prod', ''),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('cdist', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('cholesky_inverse', ''),  # could not find kernel
+    xfail('cholesky_solve', ''),  # could not find kernel
+    xfail('column_stack', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('combinations', ''),  # aten.masked_select.default
+    xfail('cross', ''),  # aten.linalg_cross.default - couldn't find symbolic meta function/decomposition
+    xfail('cummax', ''),  # aten.cummax.default - couldn't find symbolic meta function/decomposition
+    xfail('cummin', ''),  # aten.cummin.default - couldn't find symbolic meta function/decomposition
+    xfail('cumprod', ''),  # aten.cumprod.default - couldn't find symbolic meta function/decomposition
+    xfail('cumsum', ''),  # aten.cumsum.default - couldn't find symbolic meta function/decomposition
+    xfail('cumulative_trapezoid', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('diff', ''),  # aten.zeros_like.default - couldn't find symbolic meta function/decomposition
+    xfail('digamma', ''),  # aten.polygamma.default - couldn't find symbolic meta function/decomposition
+    xfail('dsplit', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.fft2', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.fft', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.fftn', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.fftshift', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.hfft2', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.hfft', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.hfftn', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.ifft2', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.ifft', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.ifftn', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.ifftshift', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.ihfft2', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.ihfft', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.ihfftn', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.irfft2', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.irfft', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.irfftn', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.rfft2', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.rfft', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('fft.rfftn', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('frexp', ''),  # aten.frexp.Tensor - couldn't find symbolic meta function/decomposition
+    xfail('gradient', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('hsplit', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('i0', ''),  # aten.i0.default - couldn't find symbolic meta function/decomposition
+    xfail('index_put', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('inner', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('kron', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('kthvalue', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('linalg.cholesky_ex', ''),  # could not find kernel for aten.linalg_solve_triangular.default
+    xfail('linalg.cond', ''),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('linalg.cross', ''),  # aten.linalg_cross.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.det', ''),  # aten._linalg_det.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.det', 'singular'),  # aten._linalg_det.default - couldn't find symbolic meta function/deco...
+    xfail('linalg.eigh', ''),  # aten._linalg_eigh.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.eigvals', ''),  # aten.linalg_eig.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.eigvalsh', ''),  # aten._linalg_eigh.default - couldn't find symbolic meta function/decompo...
+    xfail('linalg.householder_product', ''),  # aten.linalg_householder_product.default - couldn't find symbo...
+    xfail('linalg.inv', ''),  # aten.linalg_inv_ex.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.inv_ex', ''),  # aten.linalg_inv_ex.default - couldn't find symbolic meta function/decompos...
+    xfail('linalg.lstsq', ''),  # aten.linalg_lstsq.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.lstsq', 'grad_oriented'),  # aten.linalg_lstsq.default - couldn't find symbolic meta funct...
+    xfail('linalg.lu', ''),  # aten.linalg_lu.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.lu_factor', ''),  # aten.linalg_lu_factor_ex.default - couldn't find symbolic meta function...
+    xfail('linalg.lu_factor_ex', ''),  # aten.linalg_lu_factor_ex.default - couldn't find symbolic meta funct...
+    xfail('linalg.lu_solve', ''),  # aten.linalg_lu_solve.default - couldn't find symbolic meta function/deco...
+    xfail('linalg.matrix_norm', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('linalg.matrix_power', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('linalg.multi_dot', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('linalg.norm', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('linalg.norm', 'subgradients_at_zero'),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('linalg.pinv', ''),  # aten.linalg_pinv.atol_rtol_tensor - couldn't find symbolic meta function/dec...
+    xfail('linalg.pinv', 'hermitian'),  # aten.linalg_pinv.atol_rtol_tensor - couldn't find symbolic meta fu...
+    xfail('linalg.qr', ''),  # aten.linalg_qr.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.slogdet', ''),  # aten._linalg_slogdet.default - couldn't find symbolic meta function/decom...
+    xfail('linalg.solve', ''),  # aten._linalg_solve_ex.default - couldn't find symbolic meta function/decomp...
+    xfail('linalg.solve_ex', ''),  # aten._linalg_solve_ex.default - couldn't find symbolic meta function/dec...
+    xfail('linalg.solve_triangular', ''),  # aten.linalg_solve_triangular.default - couldn't find symbolic me...
+    xfail('linalg.svd', ''),  # aten._linalg_svd.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.svdvals', ''),  # aten._linalg_svd.default - couldn't find symbolic meta function/decomposi...
+    xfail('linalg.tensorinv', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('linalg.tensorsolve', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('linalg.vander', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('logaddexp2', ''),  # aten.logaddexp2.default - couldn't find symbolic meta function/decomposition
+    xfail('logaddexp', ''),  # aten.logaddexp.default - couldn't find symbolic meta function/decomposition
+    xfail('logcumsumexp', ''),  # aten.logcumsumexp.default - couldn't find symbolic meta function/decomposition
+    xfail('logdet', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('logsumexp', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('lu', ''),  # aten.linalg_lu_factor_ex.default - couldn't find symbolic meta function/decomposition
+    xfail('lu_solve', ''),  # aten.linalg_lu_solve.default - couldn't find symbolic meta function/decomposition
+    xfail('lu_unpack', ''),  # aten.lu_unpack.default - couldn't find symbolic meta function/decomposition
+    xfail('masked.amax', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('masked.amin', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('masked.cumprod', ''),  # aten.cumprod.default - couldn't find symbolic meta function/decomposition
+    xfail('masked.cumsum', ''),  # aten.cumsum.default - couldn't find symbolic meta function/decomposition
+    xfail('masked_fill', ''),  # could not find kernel
+    xfail('masked.logaddexp', ''),  # aten.logaddexp.default - couldn't find symbolic meta function/decomposi...
+    xfail('masked.logsumexp', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('masked.prod', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('masked_scatter', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('masked_select', ''),  # aten.masked_select.default - couldn't find symbolic meta function/decompos...
+    xfail('matmul', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('matrix_exp', ''),  # aten.linalg_matrix_exp.default - couldn't find symbolic meta function/decompo...
+    xfail('median', ''),  # could not find kernel
+    xfail('meshgrid', 'list_of_tensors'),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('meshgrid', 'variadic_tensors'),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('min', 'reduction_with_dim'),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('mode', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('mv', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional._scaled_dot_product_attention', ''),  # Cannot call sizes() on tensor with symbolic ...
+    xfail('nn.functional.adaptive_avg_pool3d', ''),  # aten._adaptive_avg_pool3d_backward.default - couldn't ...
+    xfail('nn.functional.adaptive_max_pool1d', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.adaptive_max_pool2d', ''),  # aten.adaptive_max_pool2d.default - couldn't find symbo...
+    xfail('nn.functional.adaptive_max_pool3d', ''),  # argument 'output_size' (position 2...
+    xfail('nn.functional.avg_pool3d', ''),  # aten.avg_pool3d.default - couldn't find symbolic meta function/...
+    skip('nn.functional.batch_norm', ''),  # '0 is not tracked with proxy for <torch.fx.experimental.proxy_te..
+    xfail('nn.functional.bilinear', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.binary_cross_entropy', ''),  # aten.fill_.Scalar - couldn't find symbolic meta funct...
+    xfail('nn.functional.cosine_similarity', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.cross_entropy', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.ctc_loss', ''),  # aten._ctc_loss.Tensor - couldn't find symbolic meta function/deco...
+    xfail('nn.functional.embedding_bag', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.fractional_max_pool2d', ''),  # rand() received an invalid combination of arguments - g...
+    xfail('nn.functional.fractional_max_pool3d', ''),  # rand() received an invalid combination of arguments - g...
+    xfail('nn.functional.grid_sample', ''),  # RuntimeError: aten.grid_sampler_3d.default - couldn't find sym ...
+    xfail('nn.functional.group_norm', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.interpolate', 'area'),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.interpolate', 'bicubic'),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.interpolate', 'linear'),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.interpolate', 'nearest'),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.interpolate', 'trilinear'),  # Cannot call sizes() on tensor with symbolic sizes/st...
+    xfail('nn.functional.max_pool1d', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.max_pool2d', ''),  # aten.max_pool2d_with_indices_backward.default - couldn't find s...
+    xfail('nn.functional.max_pool3d', ''),  # aten.max_pool3d_with_indices.default - couldn't find symbolic m...
+    xfail('nn.functional.max_unpool1d', ''),  # aten.max_unpool2d.default - couldn't find symbolic meta funct...
+    xfail('nn.functional.max_unpool1d', 'grad'),  # aten.max_unpool2d.default - couldn't find symbolic meta ...
+    xfail('nn.functional.max_unpool2d', ''),  # aten.max_unpool2d.default - couldn't find symbolic meta funct...
+    xfail('nn.functional.max_unpool2d', 'grad'),  # aten.max_unpool2d.default - couldn't find symbolic meta ...
+    xfail('nn.functional.max_unpool3d', ''),  # aten.max_unpool3d.default - couldn't find symbolic meta funct...
+    xfail('nn.functional.max_unpool3d', 'grad'),  # aten.max_unpool3d.default - couldn't find symbolic meta ...
+    xfail('nn.functional.multi_margin_loss', ''),  # could not find kernel
+    xfail('nn.functional.multilabel_margin_loss', ''),  # could not find kernel
+    xfail('nn.functional.nll_loss', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.pad', 'reflect'),  # aten.reflection_pad1d.default - couldn't find symbolic meta fu...
+    xfail('nn.functional.pad', 'replicate'),  # aten.replication_pad1d.default - couldn't find symbolic meta...
+    xfail('nn.functional.pdist', ''),  # could not find kernel
+    xfail('nn.functional.pixel_shuffle', ''),  # aten.pixel_shuffle.default - couldn't find symbolic meta fun...
+    xfail('nn.functional.pixel_unshuffle', ''),  # aten.pixel_unshuffle.default - couldn't find symbolic meta...
+    xfail('nn.functional.prelu', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.rrelu', ''),  # aten.rrelu_with_noise.default - couldn't find symbolic meta function...
+    xfail('nn.functional.smooth_l1_loss', ''),  # could not find kernel
+    xfail('nn.functional.unfold', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('nn.functional.upsample_nearest', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('norm', 'nuc'),  # aten._linalg_svd.default - couldn't find symbolic meta function/decomposition
+    xfail('normal', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('normal', 'number_mean'),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('ormqr', ''),  # aten.ormqr.default - couldn't find symbolic meta function/decomposition
+    xfail('outer', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('pca_lowrank', ''),  # could not find kernel
+    xfail('pinverse', ''),  # aten.linalg_pinv.atol_rtol_tensor - couldn't find symbolic meta function/decomp...
+    xfail('polar', ''),  # could not find kernel
+    xfail('polygamma', 'polygamma_n_0'),  # aten.polygamma.default - couldn't find symbolic meta function/de...
+    xfail('polygamma', 'polygamma_n_1'),  # aten.polygamma.default - couldn't find symbolic meta function/de...
+    xfail('polygamma', 'polygamma_n_2'),  # aten.polygamma.default - couldn't find symbolic meta function/de...
+    xfail('polygamma', 'polygamma_n_3'),  # aten.polygamma.default - couldn't find symbolic meta function/de...
+    xfail('polygamma', 'polygamma_n_4'),  # aten.polygamma.default - couldn't find symbolic meta function/de...
+    xfail('prod', ''),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('put', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('qr', ''),  # aten.linalg_qr.default - couldn't find symbolic meta function/decomposition
+    xfail('renorm', ''),  # aten.renorm.default - couldn't find symbolic meta function/decomposition
+    xfail('repeat_interleave', ''),  # aten.repeat_interleave.Te...
+    xfail('reshape_as', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('roll', ''),  # narrow() received an invalid combination of arguments - got (FakeTensor, int, torch._C...
+    xfail('segment_reduce', 'lengths'),  # aten.segment_reduce.default - couldn't find symbolic meta functio...
+    xfail('segment_reduce', 'offsets'),  # aten.segment_reduce.default - couldn't find symbolic meta functio...
+    xfail('sgn', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('special.i1', ''),  # aten.i0.default - couldn't find symbolic meta function/decomposition
+    xfail('special.polygamma', 'special_polygamma_n_0'),  # aten.polygamma.default - couldn't find symbolic ...
+    xfail('std', ''),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('std_mean', ''),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('stft', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('sum_to_size', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('svd', ''),  # aten._linalg_svd.default - couldn't find symbolic meta function/decomposition
+    xfail('svd_lowrank', ''),  # could not find kernel
+    xfail('symeig', ''),  # aten.symeig.default - couldn't find symbolic meta function/decomposition
+    xfail('take_along_dim', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('take', ''),  # aten.take.default - couldn't find symbolic meta function/decomposition
+    xfail('tensordot', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('trace', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('trapezoid', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('trapz', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('triangular_solve', ''),  # aten.triangular_solve.default - couldn't find symbolic meta function/de...
+    xfail('unflatten', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('var', ''),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('var_mean', ''),  # Cannot call numel() on tensor with symbolic sizes/strides
+    xfail('view_as', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+    xfail('vsplit', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
+}
+
+def _test_aot_autograd_helper(self, device, dtype, op):
+    if not op.supports_autograd:
+        self.skipTest("Op does not support autograd")
+
+    sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=True)
+    for sample_input in sample_inputs_itr:
+        t_args = [sample_input.input] + list(sample_input.args)
+        t_kwargs = sample_input.kwargs
+        flat_args, args_spec = pytree.tree_flatten((t_args, t_kwargs))
+        sentinel_val = -42
+        is_tensor_spec = [sentinel_val if isinstance(arg, torch.Tensor) else arg for arg in flat_args]
+        args = [arg for arg in flat_args if isinstance(arg, torch.Tensor)]
+
+        def f(args):
+            cur_flat_args = list(is_tensor_spec)
+            args = iter(args)
+            for idx, v in enumerate(cur_flat_args):
+                if v == sentinel_val:
+                    cur_flat_args[idx] = next(args)
+            c_args, c_kwargs = pytree.tree_unflatten(cur_flat_args, args_spec)
+            return op.op(*c_args, **c_kwargs)
+
+        def call_forwards_backwards(f):
+            out = wrapper_set_seed(f, args)
+            if isinstance(out, tuple):
+                sm = 0
+                for i in out:
+                    sm += i.sum()
+                sm.backward()
+            else:
+                out.sum().backward()
+
+        def reset_grads():
+            def f(x):
+                x.grad = None
+            pytree.tree_map(f, args)
+
+        def get_grads(args):
+            return pytree.tree_map(lambda x: x.grad, args)
+
+        compiled_f = compiled_function(f, nop, nop)
+
+        try:
+            reset_grads()
+            call_forwards_backwards(compiled_f)
+            compiled_grad = get_grads(args)
+
+            reset_grads()
+            call_forwards_backwards(f)
+            orig_grad = get_grads(args)
+            self.assertEqual(orig_grad, compiled_grad)
+
+            def create_new_arg(x):
+                if isinstance(x, torch.Tensor) and x.dtype == torch.float32:
+                    return x.detach().uniform_(0, 1).requires_grad_(x.requires_grad)
+                return x
+
+            args = pytree.tree_map(create_new_arg, args)
+
+            reset_grads()
+            call_forwards_backwards(compiled_f)
+            compiled_grad = get_grads(args)
+
+            reset_grads()
+            call_forwards_backwards(f)
+            orig_grad = get_grads(args)
+            self.assertEqual(orig_grad, compiled_grad)
+        except DynamicOutputShapeException:
+            self.skipTest("Dynamic output shape operation in trace")
+
+class TestEagerFusionOpInfo(AOTTestCase):
+    @ops(op_db, allowed_dtypes=(torch.float,))
+    @skipOps('TestEagerFusionOpInfo', 'test_aot_autograd_exhaustive', aot_autograd_failures)
+    def test_aot_autograd_exhaustive(self, device, dtype, op):
+        _test_aot_autograd_helper(self, device, dtype, op)
+
+    @ops(op_db, allowed_dtypes=(torch.float,))
+    @skipIfNoSympy
+    @patch("functorch.compile.config.use_dynamic_shapes", True)
+    @patch("functorch.compile.config.use_fake_tensor", True)
+    @patch("functorch.compile.config.use_functionalize", True)
+    @skipOps('TestEagerFusionOpInfo', 'test_aot_autograd_symbolic_exhaustive',
+             aot_autograd_failures | symbolic_aot_autograd_failures)
+    def test_aot_autograd_symbolic_exhaustive(self, device, dtype, op):
+        _test_aot_autograd_helper(self, device, dtype, op)
+
+only_for = ("cpu")
+instantiate_device_type_tests(
+    TestPythonKey,
+    globals(),
+    only_for=only_for,
+)
+instantiate_device_type_tests(TestEagerFusionOpInfo, globals(), only_for=only_for)
+
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/functorch/test_control_flow.py b/test/functorch/test_control_flow.py
new file mode 100644
index 000000000000..39e1967d1b27
--- /dev/null
+++ b/test/functorch/test_control_flow.py
@@ -0,0 +1,467 @@
+# Owner(s): ["module: functorch"]
+import torch
+from functorch.experimental.cond import cond
+from functorch.experimental import control_flow
+from torch.fx.experimental.proxy_tensor import make_fx
+
+from torch.testing._internal.common_utils import run_tests, TestCase
+
+class TestControlFlow(TestCase):
+    def test_cond_no_trace(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return x.cos()
+
+        x = torch.randn(4)
+        result = cond(False, true_fn, false_fn, [x])
+        self.assertEqual(result, torch.cos(x))
+
+
+class TestControlFlowTraced(TestCase):
+    def test_cond_traced_not_nested(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return x.cos()
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        graph = make_fx(f)(x, torch.tensor(False))
+        result_true = graph.forward(x, torch.tensor(True))
+        result_false = graph.forward(x, torch.tensor(False))
+        self.assertFalse(torch.allclose(result_true, result_false))
+        self.assertEqual(result_true, torch.sin(x))
+        self.assertEqual(result_false, torch.cos(x))
+
+    def test_cond_nested_traced(self):
+        def true_nested(y):
+            return y * y
+
+        def false_nested(y):
+            return y + y
+
+        def true_fn(x, pred2):
+            z = cond(pred2, true_nested, false_nested, [x])
+            return x + z
+
+        def false_fn(x, _):
+            return x.cos()
+
+        def f(x, pred, pred2):
+            return cond(pred, true_fn, false_fn, [x, pred2])
+
+        x = torch.randn(4)
+        graph = make_fx(f)(x, torch.tensor(False), torch.tensor(False))
+
+        result_true_true = graph.forward(x, torch.tensor(True), torch.tensor(True))  # True + True -> x * x
+        result_true_false = graph.forward(x, torch.tensor(True), torch.tensor(False))  # True + True -> x + x
+        result_false_true = graph.forward(x, torch.tensor(False), torch.tensor(True))  # False + either -> cos
+        result_false_false = graph.forward(x, torch.tensor(False), torch.tensor(False))  # False + either -> cos
+
+        self.assertNotEqual(result_true_true, result_true_false)
+        self.assertFalse(torch.allclose(result_false_true, result_true_true))
+
+        self.assertEqual(result_false_true, result_false_false)
+
+        self.assertEqual(result_true_true, (x * x) + x)
+        self.assertEqual(result_true_false, x + x + x)
+
+        self.assertEqual(result_false_true, torch.cos(x))
+
+    def test_cond_nested_traced_other_inputs(self):
+        def true_nested(y):
+            return y * y
+
+        def false_nested(y):
+            return y + y
+
+        def true_fn(k, pred2):
+            z = cond(pred2, true_nested, false_nested, [k])
+            return torch.add(torch.tensor([.25, .25]), z)
+
+        def false_fn(k, _):
+            return k.cos()
+
+        def f(k, pred, pred2):
+            return cond(pred, true_fn, false_fn, [k, pred2])
+
+        x = torch.tensor([0.5, 0.5])
+        graph = make_fx(f)(x, torch.tensor(False), torch.tensor(False))
+
+        a = torch.tensor([1.0, 1.0])
+        result_true_true = graph.forward(a, torch.tensor(True), torch.tensor(True))
+        self.assertEqual(result_true_true, (a * a) + torch.tensor([0.25, 0.25]))
+
+        b = torch.tensor([2.0, 2.0])
+        result_true_true = graph.forward(b, torch.tensor(True), torch.tensor(True))
+        self.assertEqual(result_true_true, (b * b) + torch.tensor([0.25, 0.25]))
+
+    def test_cond_nested_traced_multi(self):
+        def true_a(y):
+            return y * y
+
+        def false_a(y):
+            return y + y
+
+        def true_b(y, z):
+            return y + z
+
+        def false_b(y, z):
+            return y * z
+
+        def f(x, pred, pred2):
+            a_out = cond(pred, true_a, false_a, [x])
+            b_out = cond(pred2, true_b, false_b, [x, x])
+            return a_out + b_out
+
+        x = torch.randn(4)
+        graph = make_fx(f)(x, torch.tensor(False), torch.tensor(False))
+
+        # Brittle, yet, delicious
+        out = """
+        def forward(self, x_1, pred_1, pred2_1):
+            true_graph_0 = self.true_graph_0
+            false_graph_0 = self.false_graph_0
+            conditional = torch.ops.cond(pred_1, true_graph_0, false_graph_0, [[x_1]]);
+            pred_1 = true_graph_0 = false_graph_0 = None
+            true_graph_1 = self.true_graph_1
+            false_graph_1 = self.false_graph_1
+            conditional_1 = torch.ops.cond(pred2_1, true_graph_1, false_graph_1, [[x_1, x_1]]);
+            pred2_1 = true_graph_1 = false_graph_1 = x_1 = None
+            add = torch.ops.aten.add.Tensor(conditional, conditional_1);  conditional = conditional_1 = None
+            return add
+        """
+        code = graph.code
+        # Normalization hack, cause .code makes some weird whitespace
+        code = "".join(code.split())
+        out = "".join(out.split())
+        self.assertEqual(code, out)
+
+        code = graph.true_graph_0.code
+        out = """
+        def forward(self, flat_args):
+            flat_args_1, = fx_pytree.tree_flatten_spec([flat_args], self._in_spec)
+            mul = torch.ops.aten.mul.Tensor(flat_args_1, flat_args_1);  flat_args_1 = None
+            return pytree.tree_unflatten([mul], self._out_spec)
+        """
+        # Normalization hack, cause .code makes some weird whitespace
+        code = "".join(code.split())
+        out = "".join(out.split())
+        self.assertEqual(code, out)
+
+    def test_assert_on_mismatch_type_size(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return (x, x)
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        with self.assertRaises(AssertionError):
+            make_fx(f)(x, torch.tensor(False))
+
+
+    def test_assert_on_mismatch_tensor_size(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return torch.zeros([10, 10])
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        with self.assertRaises(AssertionError):
+            make_fx(f)(x, torch.tensor(False))
+
+    def test_cond_traced_not_nested_fake_tensor(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return x.cos()
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        graph = make_fx(f, tracing_mode="fake")(x, torch.tensor(False))
+        result_true = graph.forward(x, torch.tensor(True))
+        result_false = graph.forward(x, torch.tensor(False))
+        self.assertFalse(torch.allclose(result_true, result_false))
+        self.assertEqual(result_true, torch.sin(x))
+        self.assertEqual(result_false, torch.cos(x))
+
+    def test_cond_nested_traced_fake_tensor(self):
+        def true_nested(y):
+            return y * y
+
+        def false_nested(y):
+            return y + y
+
+        def true_fn(x, pred2):
+            z = cond(pred2, true_nested, false_nested, [x])
+            return x + z
+
+        def false_fn(x, _):
+            return x.cos()
+
+        def f(x, pred, pred2):
+            return cond(pred, true_fn, false_fn, [x, pred2])
+
+        x = torch.randn(4)
+        graph = make_fx(f, tracing_mode="fake")(x, torch.tensor(False), torch.tensor(False))
+
+        result_true_true = graph.forward(x, torch.tensor(True), torch.tensor(True))  # True + True -> x * x
+        result_true_false = graph.forward(x, torch.tensor(True), torch.tensor(False))  # True + True -> x + x
+        result_false_true = graph.forward(x, torch.tensor(False), torch.tensor(True))  # False + either -> cos
+        result_false_false = graph.forward(x, torch.tensor(False), torch.tensor(False))  # False + either -> cos
+
+        self.assertNotEqual(result_true_true, result_true_false)
+        self.assertFalse(torch.allclose(result_false_true, result_true_true))
+
+        self.assertEqual(result_false_true, result_false_false)
+
+        self.assertEqual(result_true_true, (x * x) + x)
+        self.assertEqual(result_true_false, x + x + x)
+
+        self.assertEqual(result_false_true, torch.cos(x))
+
+    def test_cond_nested_traced_other_inputs_fake_tensor(self):
+        def true_nested(y):
+            return y * y
+
+        def false_nested(y):
+            return y + y
+
+        def true_fn(k, pred2):
+            z = cond(pred2, true_nested, false_nested, [k])
+            return torch.add(torch.tensor([.25, .25]), z)
+
+        def false_fn(k, _):
+            return k.cos()
+
+        def f(k, pred, pred2):
+            return cond(pred, true_fn, false_fn, [k, pred2])
+
+        x = torch.tensor([0.5, 0.5])
+        graph = make_fx(f, tracing_mode="fake")(x, torch.tensor(False), torch.tensor(False))
+
+        a = torch.tensor([1.0, 1.0])
+        result_true_true = graph.forward(a, torch.tensor(True), torch.tensor(True))
+        self.assertEqual(result_true_true, (a * a) + torch.tensor([0.25, 0.25]))
+
+        b = torch.tensor([2.0, 2.0])
+        result_true_true = graph.forward(b, torch.tensor(True), torch.tensor(True))
+        self.assertEqual(result_true_true, (b * b) + torch.tensor([0.25, 0.25]))
+
+    def test_cond_nested_traced_multi_fake_tensor(self):
+        def true_a(y):
+            return y * y
+
+        def false_a(y):
+            return y + y
+
+        def true_b(y, z):
+            return y + z
+
+        def false_b(y, z):
+            return y * z
+
+        def f(x, pred, pred2):
+            a_out = cond(pred, true_a, false_a, [x])
+            b_out = cond(pred2, true_b, false_b, [x, x])
+            return a_out + b_out
+
+        x = torch.randn(4)
+        graph = make_fx(f, tracing_mode="fake")(x, torch.tensor(False), torch.tensor(False))
+
+        # Brittle, yet, delicious
+        out = """
+        def forward(self, x_1, pred_1, pred2_1):
+            true_graph_0 = self.true_graph_0
+            false_graph_0 = self.false_graph_0
+            conditional = torch.ops.cond(pred_1, true_graph_0, false_graph_0, [[x_1]]);
+            pred_1 = true_graph_0 = false_graph_0 = None
+            true_graph_1 = self.true_graph_1
+            false_graph_1 = self.false_graph_1
+            conditional_1 = torch.ops.cond(pred2_1, true_graph_1, false_graph_1, [[x_1, x_1]]);
+            pred2_1 = true_graph_1 = false_graph_1 = x_1 = None
+            add = torch.ops.aten.add.Tensor(conditional, conditional_1);  conditional = conditional_1 = None
+            return add
+        """
+        code = graph.code
+        # Normalization hack, cause .code makes some weird whitespace
+        code = "".join(code.split())
+        out = "".join(out.split())
+        self.assertEqual(code, out)
+
+        code = graph.true_graph_0.code
+        out = """
+        def forward(self, flat_args):
+            flat_args_1, = fx_pytree.tree_flatten_spec([flat_args], self._in_spec)
+            mul = torch.ops.aten.mul.Tensor(flat_args_1, flat_args_1);  flat_args_1 = None
+            return pytree.tree_unflatten([mul], self._out_spec)
+        """
+        # Normalization hack, cause .code makes some weird whitespace
+        code = "".join(code.split())
+        out = "".join(out.split())
+        self.assertEqual(code, out)
+
+    def test_assert_on_mismatch_type_size_fake_tensor(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return (x, x)
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        with self.assertRaises(AssertionError):
+            make_fx(f, tracing_mode="fake")(x, torch.tensor(False))
+
+
+    def test_assert_on_mismatch_tensor_size_fake_tensor(self):
+        def true_fn(x):
+            return x.sin()
+
+        def false_fn(x):
+            return torch.zeros([10, 10])
+
+        def f(x, y):
+            return cond(y, true_fn, false_fn, [x])
+
+        x = torch.randn(4)
+        with self.assertRaises(AssertionError):
+            make_fx(f, tracing_mode="fake")(x, torch.tensor(False))
+
+    def check_map_graph(self, gm, key):
+        i = 0
+        for node in gm.graph.nodes:
+            if node.op == "call_function" and node.target == torch.ops.map:
+                i += 1
+                self.assertEqual(
+                    node.meta[key].shape[0], node.args[1].meta[key].shape[0]
+                )
+        self.assertEqual(i, 1)
+
+    def test_map_real(self):
+        def f(x, y):
+            return x + y
+
+        def g(xs, y):
+            return control_flow.map(f, xs, y)
+
+        gm = make_fx(g, tracing_mode="real")(torch.ones(3, 2, 2), torch.ones(2))
+        x = torch.randn(3, 2, 2)
+        y = torch.randn(2)
+        res = gm(x, y)
+        self.assertEqual(res, g(x, y))
+        self.check_map_graph(gm, "tensor_meta")
+
+    def test_map_symbolic(self):
+        def f(x, y):
+            return x + y
+
+        def g(xs, y):
+            return control_flow.map(f, xs, y)
+
+        gm = make_fx(g, tracing_mode="symbolic")(torch.ones(3, 2, 4), torch.ones(4))
+        x = torch.randn(3, 2, 2)
+        y = torch.randn(2)
+        res = gm(x, y)
+        self.assertEqual(res, g(x, y))
+        self.check_map_graph(gm, "val")
+
+    def test_nested_map_cond_real(self):
+        def true_fn(x, y):
+            return x * y
+
+        def false_fn(x, y):
+            return x + y
+
+        def f(x, pred, y):
+            return cond(pred, true_fn, false_fn, [x, y])
+
+        def g(pred, xs, y):
+            return control_flow.map(f, xs, pred, y)
+
+        gm = make_fx(g, tracing_mode="real")(
+            torch.tensor(True), torch.ones(3, 2, 4), torch.ones(4)
+        )
+        pred = torch.tensor(False)
+        x = torch.randn(3, 2, 2)
+        y = torch.randn(2)
+        res = gm(pred, x, y)
+        self.assertEqual(res, g(pred, x, y))
+        self.check_map_graph(gm, "tensor_meta")
+
+    def test_nested_map_cond_symbolic(self):
+        def true_fn(x, y):
+            return x * y
+
+        def false_fn(x, y):
+            return x + y
+
+        def f(x, pred, y):
+            return cond(pred, true_fn, false_fn, [x, y])
+
+        def g(pred, xs, y):
+            return control_flow.map(f, xs, pred, y)
+
+        gm = make_fx(g, tracing_mode="symbolic")(
+            torch.tensor(True), torch.ones(3, 2, 4), torch.ones(4)
+        )
+        pred = torch.tensor(False)
+        x = torch.randn(3, 2, 2)
+        y = torch.randn(2)
+        res = gm(pred, x, y)
+        self.assertEqual(res, g(pred, x, y))
+        self.check_map_graph(gm, "val")
+
+    def test_nested_cond_map_cond_symbolic(self):
+
+        def true_fn(x, y):
+            return x * y
+
+        def false_fn(x, y):
+            return x + y
+
+        def f(x, pred, y):
+            return cond(pred, true_fn, false_fn, [x, y])
+
+        def g(pred, xs, y):
+            return control_flow.map(f, xs, pred, y)
+
+        def main_true_fn(pred, xs, y):
+            return g(pred, xs, y) * 2
+
+        def main_false_fn(pred, xs, y):
+            return g(pred, xs, y) + 1
+
+        def main(p, pred, xs, y):
+            return cond(p, main_true_fn, main_false_fn, [pred, xs, y])
+
+        gm = make_fx(main, tracing_mode="symbolic")(
+            torch.tensor(True), torch.tensor(True), torch.ones(3, 2, 4), torch.ones(4)
+        )
+        p = torch.tensor(False)
+        pred = torch.tensor(False)
+        xs = torch.randn(3, 2, 2)
+        y = torch.randn(2)
+        res = gm(p, pred, xs, y)
+        self.assertEqual(res, main(p, pred, xs, y))
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/functorch/test/test_dims.py b/test/functorch/test_dims.py
similarity index 92%
rename from functorch/test/test_dims.py
rename to test/functorch/test_dims.py
index d2046b4569ff..c609c375325c 100644
--- a/functorch/test/test_dims.py
+++ b/test/functorch/test_dims.py
@@ -1,3 +1,5 @@
+# Owner(s): ["module: functorch"]
+
 # Copyright (c) Facebook, Inc. and its affiliates.
 # All rights reserved.
 #
@@ -8,7 +10,7 @@
 from attn_ft import BertSelfAttention as BertSelfAttentionA, Linear
 from attn_positional import BertSelfAttention as BertSelfAttentionB
 
-from torch.testing._internal.common_utils import TestCase, run_tests, TEST_CUDA
+from torch.testing._internal.common_utils import TestCase, run_tests, TEST_CUDA, TEST_WITH_ASAN
 
 from unittest import skip, skipIf
 import torch
@@ -72,6 +74,7 @@ def gpu_time(lmb, name, r=100):
 class TestMin(TestCase):
 
     def setUp(self):
+        super().setUp()
         gc.disable()
         gc.collect()
         self.interesting = set()
@@ -102,6 +105,7 @@ def tearDown(self):
             f'extra torch.Tensor, Dim, or Tensor left allocated: {len(interesting)} objects of types:' \
             f' { [type(t) for t in interesting] }'
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_manual_stuff(self):
 
         A_ = torch.rand(3, 4)
@@ -174,10 +178,11 @@ def maybe_to(x):
             gpu_time(lambda: B(hidden_state), "positional", r=3)
             gpu_time(lambda: A(hidden_state), "first_class", r=3)
 
-
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_attn(self):
         self.attn()
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_inplace(self):
         # some embeddings table
         embeddings = torch.zeros(10, 3)
@@ -231,6 +236,7 @@ def test_with_dims_split(self):
         x = r.order(i, [j, k])
         self.assertTrue(torch.allclose(a, x))
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_hello(self):
         A = torch.rand(3, 4)
         B = torch.rand(4, 5)
@@ -326,6 +332,7 @@ def test_hello(self):
         r = [id(x) for x in torch.nn.functional.dropout(A[i, k]).dims]
         assert id(i) in r and id(k) in r
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_simple(self):
         i, j, k = dims()
         x = torch.rand(3, 4)
@@ -377,6 +384,7 @@ def test_time_mm_fuse(self):
 
         assert torch.allclose(r1.order(i, j), r0)
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_compare_dims(self):
         i, j = dims()
         i.size = 3
@@ -386,6 +394,7 @@ def test_compare_dims(self):
     def test_c(self):
         _test_c()
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_seg(self):
         A = torch.rand(3, 4)
         i, k = dims()
@@ -398,6 +407,8 @@ def test_expand(self):
         i = dims()
         assert list(A[i].expand(2, 4).order(i).size()) == [3, 2, 4]
 
+
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, maybe real issue")
     def test_parse(self):
         self.assertEqual(("x", None, None, None), _parse_test(1, 0, "x"))
         self.assertEqual(("x", None, "y", None), _parse_test(1, 0, "x", c="y"))
@@ -448,11 +459,13 @@ def test_dim_args(self):
         assert b[0].size == 4
         assert b[1].size == 5
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_diag(self):
         i = dims()
         A = torch.rand(4, 4)
         (A[i, i])
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_softmax_split(self):
         a = torch.rand(16)
         g, i = dims(sizes=[2, None])
@@ -504,6 +517,7 @@ def test_monkey(self):
         x.new(a)
         # self.assertEqual(x.new([z[2], z[0] + 3]).tolist(), [3, 4])
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_index_placement(self):
         A = torch.rand(1, 2, 3, 4)
 
@@ -519,11 +533,13 @@ def test_order(self):
         A = torch.rand(3, 4, 5)
         assert torch.allclose(A[i].order(1, i), A.permute(2, 0, 1))
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_mask(self):
         a = torch.rand(5)
         i, j = dims(sizes=[a.size(0), a.size(0)])
         ((i >= j) * a[i]).sum(j).order(i)
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_eq(self):
         i, j = dims(sizes=[3, 3])
         assert (i == j).sum((i, j)) == 3
@@ -540,6 +556,7 @@ class Foo:
         assert str(y.x) == "d1"
         assert str(q) == "d2"
 
+    @skipIf(TEST_WITH_ASAN, "tests gets asan error, put up issue since seems real")
     def test_dir(self):
         i, j = dims(sizes=[3, 3])
         dir(i <= j)
@@ -575,6 +592,17 @@ def test_functorch(self):
         BB = torch.mm(B[j], C)  # 3, 4, 2
         assert list(torch.mm(AA.T, BB).order(i, j).shape) == [3, 3, 2, 2]
 
+    def test_permute_orig(self):
+        d = dims(1)
+        t_fc = torch.rand(1, 2, 3, 4)[d]
+        assert t_fc.permute(dims=(1, 0, 2)).shape == t_fc.permute(1, 0, 2).shape
+
+    def test_order_keyword(self):
+        d = dims(1)
+        t = torch.rand(3)[d]
+        self.assertRaises(TypeError, lambda: t.order(wrong=3))
+
+
 
 skip_functorch_only = ['test_time_mm_fuse', 'test_attn_cuda']
 class TestMinFunctorchOnly(TestMin):
diff --git a/functorch/test/test_eager_transforms.py b/test/functorch/test_eager_transforms.py
similarity index 82%
rename from functorch/test/test_eager_transforms.py
rename to test/functorch/test_eager_transforms.py
index 388b086497f7..e123da0d9d3c 100644
--- a/functorch/test/test_eager_transforms.py
+++ b/test/functorch/test_eager_transforms.py
@@ -8,17 +8,21 @@
 
 import copy
 from torch.testing._internal.common_utils import (
-    TestCase, run_tests, parametrize, subtest, instantiate_parametrized_tests
+    TestCase, run_tests, parametrize, subtest, instantiate_parametrized_tests,
+    IS_FBCODE, freeze_rng_state,
 )
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+import os
+import subprocess
+import sys
 import unittest
 import warnings
 import math
 from torch.testing._internal.common_device_type import instantiate_device_type_tests, onlyCPU
 from torch.testing._internal.common_dtype import get_all_fp_dtypes
-from torch.testing._internal.common_utils import IS_WINDOWS
+from torch._subclasses.fake_tensor import FakeTensorMode
 from functools import partial
 from functorch.experimental import replace_all_batch_norm_modules_
 
@@ -31,11 +35,10 @@
 from functorch._src.make_functional import (
     functional_init, functional_init_with_buffers,
 )
-from functorch._src.eager_transforms import _argnums_partial, enable_fwd_grad
+from functorch._src.eager_transforms import enable_fwd_grad, _slice_argnums
 from functorch.experimental import functionalize
-
-if not IS_WINDOWS:
-    from functorch._src.custom_function import custom_vjp
+from torch._ops import PyOperator
+from torch._functorch.utils import enable_autograd_function
 
 # NB: numpy is a testing dependency!
 import numpy as np
@@ -50,145 +53,96 @@
                   "`--no-deps` to avoid overwriting the pytorch installation",
                   UserWarning)
 
-# TestCase for _argnums_partial, an important helper funciton
+# TestCase for _slice_argnums, an important helper funciton
 
 
-class TestArgnumsPartial(TestCase):
+class TestSliceArgnums(TestCase):
     def test_invalid_argnum_type(self):
         x = torch.randn(3)
         args = (x,)
         with self.assertRaisesRegex(RuntimeError, "int or Tuple"):
-            _argnums_partial(torch.sin, args, 0.0)
+            _slice_argnums(args, 0.0)
         with self.assertRaisesRegex(RuntimeError, "int or Tuple"):
-            _argnums_partial(torch.sin, args, [0])
+            _slice_argnums(args, [0])
         with self.assertRaisesRegex(RuntimeError, "must be int"):
-            _argnums_partial(torch.sin, args, (0.0,))
+            _slice_argnums(args, (0.0,))
 
         args = (0.1, 1.1, 2.1, 3.1, 4.1)
 
-        def f(a, b, c, d, e):
-            return a
         with self.assertRaisesRegex(RuntimeError, "must be int"):
-            _argnums_partial(torch.sin, args, ((0, 1), 2))
+            _slice_argnums(args, ((0, 1), 2))
 
     def test_out_of_bounds_argnum_values(self):
         x = torch.randn(3)
         args = (x,)
         with self.assertRaisesRegex(RuntimeError, "positional inputs"):
-            _argnums_partial(torch.sin, args, 1)
+            _slice_argnums(args, 1)
         with self.assertRaisesRegex(RuntimeError, "positional inputs"):
-            _argnums_partial(torch.sin, args, -2)
+            _slice_argnums(args, -2)
         with self.assertRaisesRegex(RuntimeError, "positional inputs"):
-            _argnums_partial(torch.sin, args, (-2,))
+            _slice_argnums(args, (-2,))
 
     def test_not_enough_argnums(self):
         x = torch.randn(3)
         args = (x,)
         with self.assertRaisesRegex(RuntimeError, "must be non-empty"):
-            _argnums_partial(torch.sin, args, ())
+            _slice_argnums(args, ())
 
     def test_duplicate_argnums(self):
         x = torch.randn(3)
         args = (x, x)
         with self.assertRaisesRegex(RuntimeError, "must be unique"):
-            _argnums_partial(torch.add, args, (0, 0))
+            _slice_argnums(args, (0, 0))
         with self.assertRaisesRegex(RuntimeError, "must be unique"):
-            _argnums_partial(torch.add, args, (0, -2))
+            _slice_argnums(args, (0, -2))
 
     def test_flat_args_with_positive_int_argnum(self):
         args = (0.1, 1.1, 2.1, 3.1, 4.1)
 
-        def f(a, b, c, d, e):
-            return a
-
-        f_new, res = _argnums_partial(f, args, 0)
+        res = _slice_argnums(args, 0)
         self.assertEqual(res, (0.1,))
-        self.assertEqual(f_new(*res), 0.1)
 
-        f_new, res = _argnums_partial(f, args, 4)
+        res = _slice_argnums(args, 4)
         self.assertEqual(res, (4.1,))
-        self.assertEqual(f_new(*res), 0.1)
 
     def test_flat_args_with_negative_int_argnum(self):
         args = (0.1, 1.1, 2.1, 3.1, 4.1)
 
-        def f(a, b, c, d, e):
-            return a
-
-        expected = f(*args)
-        f_new, res = _argnums_partial(f, args, -1)
+        res = _slice_argnums(args, -1)
         self.assertEqual(res, (4.1,))
-        self.assertEqual(f_new(*res), expected)
 
-        f_new, res = _argnums_partial(f, args, -5)
+        res = _slice_argnums(args, -5)
         self.assertEqual(res, (0.1,))
-        self.assertEqual(f_new(*res), expected)
 
     def test_flat_args_with_tuple_argnum(self):
         args = (0.1, 1.1, 2.1, 3.1, 4.1)
 
-        def f(a, b, c, d, e):
-            return a
-
-        f_new, res = _argnums_partial(f, args, (0, 1, 2, 3, 4))
-        self.assertEqual(f_new(*res), 0.1)
+        res = _slice_argnums(args, (0, 1, 2, 3, 4))
         self.assertEqual(res, args)
 
-        f_new, res = _argnums_partial(f, args, (0, -3))
-        self.assertEqual(f_new(*res), 0.1)
+        res = _slice_argnums(args, (0, -3))
         self.assertEqual(res, (0.1, 2.1))
 
     def test_pytree_args(self):
         args = ((0.1, 1.1), 2.0, [3.1])
 
-        def f(a, b, c):
-            return a[0] + a[1] + b + c[0]
-
-        expected = f(*args)
-
-        f_new, res = _argnums_partial(f, args, 0)
+        res = _slice_argnums(args, 0)
         self.assertEqual(res, args[0:1])
-        self.assertEqual(f_new(*res), expected)
 
-        f_new, res = _argnums_partial(f, args, (0,))
+        res = _slice_argnums(args, (0,))
         self.assertEqual(res, args[0:1])
-        self.assertEqual(f_new(*res), expected)
 
-        f_new, res = _argnums_partial(f, args, -1)
+        res = _slice_argnums(args, -1)
         self.assertEqual(res, args[-1:])
-        self.assertEqual(f_new(*res), expected)
 
-        f_new, res = _argnums_partial(f, args, (0, -2))
+        res = _slice_argnums(args, (0, -2))
         self.assertEqual(res, args[0:2])
-        self.assertEqual(f_new(*res), expected)
 
     def test_argnums_reorders(self):
         args = ((0.1, 1.1, 2.1), 3.1, 4.1)
 
-        def f(a, b, c):
-            return a[0] + a[1] + a[2] + b + c
-
-        expected = f(*args)
-        f_new, res = _argnums_partial(f, args, (1, 0))
+        res = _slice_argnums(args, (1, 0))
         self.assertEqual(res, (args[1], args[0]))
-        self.assertEqual(f_new(*res), expected)
-
-    def test_function_with_default_args(self):
-        args = ((0.1, 1.1, 2.1), 3.1)
-
-        def f(a, b, c=4.1):
-            return a[0] + a[1] + a[2] + b + c
-
-        expected = f(*args)
-        f_new, res = _argnums_partial(f, args, -2)
-        self.assertEqual(res, args[0:1])
-        self.assertEqual(f_new(*res), expected)
-
-        args = ((0.1, 1.1, 2.1), 3.1, 5.1)
-        expected = f(*args)
-        f_new, res = _argnums_partial(f, args, -1)
-        self.assertEqual(res, args[-1:])
-        self.assertEqual(f_new(*res), expected)
 
 
 class TestGradTransform(TestCase):
@@ -334,7 +288,7 @@ def foo(x):
             return y
 
         grad(foo)(x)
-        self.assertEqual(functorch._C.dlevel(escaped[0]), -1)
+        self.assertEqual(torch._C._functorch.dlevel(escaped[0]), -1)
 
     def test_escaped_wrappers_are_ignored(self, device):
         x = torch.randn([], device=device)
@@ -348,9 +302,22 @@ def foo(x):
         grad(foo)(x)
 
         something = escaped[0].sum()
-        self.assertEqual(functorch._C.dlevel(something), 0)
+        self.assertEqual(torch._C._functorch.dlevel(something), 0)
         self.assertEqual(something, x.sin().sum())
 
+    def test_manual_seed_inside_grad(self, device):
+        x = torch.randn([], device=device)
+
+        def f(x):
+            torch.manual_seed(0)
+            return x * torch.randn_like(x)
+
+        with freeze_rng_state():
+            result = grad(f)(x)
+            x.requires_grad_()
+            expected, = torch.autograd.grad(f(x), x)
+            self.assertEqual(result, expected)
+
     def test_vjp(self, device):
         x = torch.randn([], device=device)
         out, vjp_fn = vjp(torch.sin, x)
@@ -868,6 +835,18 @@ def foo(t):
                 expected = expected.replace("\n", "").replace("  ", "")
                 self.assertEqual(expected, buf)
 
+    def test_print_captured_tensor_inside_transform(self, device):
+        x = torch.tensor([1., 2., 3.], device=device)
+        out = None
+
+        def f(y):
+            nonlocal out
+            out = repr(x)
+            return y
+
+        vjp(f, torch.randn(4, device=device))
+        self.assertEqual(out, repr(x))
+
     def test_no_grad_outside(self, device):
         x = torch.randn([], device=device, requires_grad=True)
         with torch.no_grad():
@@ -1595,6 +1574,40 @@ def f(x, y):
         y = torch.randn(3)
         self._test_against_reference(f, (x, y), jacapi)
 
+    @jacrev_and_jacfwd
+    def test_against_reference_default_arg(self, device, jacapi):
+        def f(x, y, z=3.):
+            return x * y * z
+
+        x = torch.randn(3, device=device)
+        y = torch.randn(3, device=device)
+        self._test_against_reference(f, (x, y), jacapi)
+
+    @jacrev_and_jacfwd
+    def test_inplace(self, device, jacapi):
+        def f(x, y):
+            y.copy_(x)
+            return y
+
+        out = jacapi(f, argnums=0)  # x is differentiable
+        x, y = torch.randn(2, device=device), torch.randn(2, device=device)
+        self.assertEqual(out(x, y), torch.eye(y.shape[0]))
+
+        # testing tuple of argnums with the example that raised this issue originally
+        def g(x, y, z):
+            x[:2] = y
+            return torch.vstack([(x**2).sum(), (z**3).sum()])
+
+        out = jacapi(g, argnums=(1, 2))
+        x, y, z = torch.randn(3, device=device), torch.randn(2, device=device), torch.randn(2, device=device)
+
+        expected_out = (torch.zeros(2, 1, 2, device=device), torch.zeros(2, 1, 2, device=device))
+        expected_out[0][0][0] = 2 * y  # top left corner
+        expected_out[1][1][0] = 3 * (z ** 2)  # bottom right corner
+
+        out_val = out(x, y, z)
+        self.assertEqual(out_val, expected_out)
+
 
 class TestHessian(TestCase):
     def _test_against_reference(self, f, inputs):
@@ -1899,17 +1912,17 @@ def f(x):
 
     def test_fwd_grad_enabled(self, device):
         # Tests some private helper functions to enable/disable fwd grad mode
-        enabled = functorch._C.get_fwd_grad_enabled()
+        enabled = torch._C._functorch.get_fwd_grad_enabled()
         self.assertTrue(enabled)
 
         try:
-            functorch._C.set_fwd_grad_enabled(False)
-            enabled = functorch._C.get_fwd_grad_enabled()
+            torch._C._functorch.set_fwd_grad_enabled(False)
+            enabled = torch._C._functorch.get_fwd_grad_enabled()
             self.assertFalse(enabled)
         finally:
-            functorch._C.set_fwd_grad_enabled(True)
+            torch._C._functorch.set_fwd_grad_enabled(True)
 
-        enabled = functorch._C.get_fwd_grad_enabled()
+        enabled = torch._C._functorch.get_fwd_grad_enabled()
         self.assertTrue(enabled)
 
     def test_autograd_function_disables_fwd_grad(self, device):
@@ -1918,7 +1931,7 @@ def test_autograd_function_disables_fwd_grad(self, device):
         class MySquare(torch.autograd.Function):
             @staticmethod
             def forward(ctx, x):
-                enabled = functorch._C.get_fwd_grad_enabled()
+                enabled = torch._C._functorch.get_fwd_grad_enabled()
                 self.assertFalse(enabled)
                 return x * x
 
@@ -1932,18 +1945,18 @@ def backward(ctx, gx):
     def test_enable_fwd_grad(self, device):
         # Tests a private helper function
         try:
-            functorch._C.set_fwd_grad_enabled(False)
-            enabled = functorch._C.get_fwd_grad_enabled()
+            torch._C._functorch.set_fwd_grad_enabled(False)
+            enabled = torch._C._functorch.get_fwd_grad_enabled()
             self.assertFalse(enabled)
 
             with enable_fwd_grad():
-                enabled = functorch._C.get_fwd_grad_enabled()
+                enabled = torch._C._functorch.get_fwd_grad_enabled()
                 self.assertTrue(enabled)
 
-            enabled = functorch._C.get_fwd_grad_enabled()
+            enabled = torch._C._functorch.get_fwd_grad_enabled()
             self.assertFalse(enabled)
         finally:
-            functorch._C.set_fwd_grad_enabled(True)
+            torch._C._functorch.set_fwd_grad_enabled(True)
 
     def test_disable_fwd_grad_outside(self, device):
         x = torch.randn([], device=device)
@@ -2014,39 +2027,160 @@ def push_jvp(dummy, x):
         vmap(vmap(push_jvp, (0, None)))(dummy, x)
 
 
-class TestCustomFunction(TestCase):
-    @unittest.skipIf(IS_WINDOWS, "Prototype of custom_vjp doesn't link on windows")
-    @onlyCPU
-    def test_basic(self, device):
-        called_impl = False
-        called_vjp = False
+# The tests here follow the cases in [Forward Grad View/inplace]
+# https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/autograd_meta.cpp#L18-L43
+class TestVmapJvpInplaceView(TestCase):
+    # Case 1 in [Forward Grad View/inplace]
+    def test_all_dual_no_view(self, device):
+        B = 2
 
-        def my_sin_impl(args):
-            x, = args
-            nonlocal called_impl
-            called_impl = True
-            return x.sin(), x
+        def push_jvp(f):
+            def inner(x, xt, y, yt):
+                return jvp(f, (x, y), (xt, yt))
+            return inner
 
-        def my_sin_vjp(args):
-            grad_y, result, x = args
-            nonlocal called_vjp
-            called_vjp = True
-            return (grad_y * 3 * x.cos(),)
+        def f(x, y):
+            x.copy_(y)
+            return x
+        x = torch.randn(3, B, device=device)
+        xt = torch.randn(3, B, device=device)
+        y = torch.randn(3, B, device=device)
+        yt = torch.randn(3, B, device=device)
+        out, out_tangent = vmap(push_jvp(f), in_dims=1)(x, xt, y, yt)
+        self.assertEqual(out, x.movedim(1, 0))
+        self.assertEqual(out_tangent, yt.movedim(1, 0))
+
+        x = torch.randn(3, B, device=device)
+        xt = torch.randn(3, B, device=device)
+        y = torch.randn(3, 3, device=device)[:, 1]
+        yt = torch.randn(6, device=device)[::2]
+        out, out_tangent = vmap(push_jvp(f), in_dims=(1, 1, None, None))(x, xt, y, yt)
+        self.assertEqual(out, x.movedim(1, 0))
+        self.assertEqual(out_tangent, yt.expand(B, 3))
+
+    # Case 2 in [Forward Grad View/inplace]
+    def test_all_dual_base_view_inplace(self, device):
+        B = 2
+
+        def push_jvp(f):
+            def inner(x, xt, y, yt):
+                return jvp(f, (x, y), (xt, yt))
+            return inner
+
+        # with view, propagate from view to base
+        def f(x, y):
+            view = x[:, ::2]
+            view.copy_(y)
+            return view, x
+
+        orig_x = torch.randn(2, 6, B, device=device)
+        orig_xt = torch.randn(2, 6, B, device=device)
+        x = orig_x.clone()
+        xt = orig_xt.clone()
+        y = torch.randn(2, B, 3, device=device)
+        yt = torch.randn(2, B, 3, device=device)
+        out, out_tangent = vmap(push_jvp(f), in_dims=(2, 2, 1, 1))(x, xt, y, yt)
+
+        expected_out = vmap(f, in_dims=(2, 1))(orig_x.clone(), y)
+        self.assertEqual(out[0], expected_out[0])
+        self.assertEqual(out[1], expected_out[1])
+
+        self.assertEqual(out_tangent[0], yt.movedim(1, 0))
+
+        expected_x_tangent = orig_xt.movedim(-1, 0).clone()
+        expected_x_tangent[:, :, ::2].copy_(yt.movedim(1, 0))
+        self.assertEqual(out_tangent[1], expected_x_tangent)
+
+        expected = orig_x.movedim(2, 0).clone()
+        expected[:, :, ::2] = y.movedim(1, 0)
+        self.assertEqual(x.movedim(2, 0), expected)
+
+    # Case 3 in [Forward Grad View/inplace]
+    def test_all_dual_base_inplace(self, device):
+        B = 2
+
+        def push_jvp(f):
+            def inner(x, xt, y, yt):
+                return jvp(f, (x, y), (xt, yt))
+            return inner
+
+        # Case 3: with view, propagate from base to view
+        def f(x, y):
+            view = x[0, ::2]
+            x.copy_(y)
+            return x, view
+
+        x = torch.randn(2, B, 6, device=device)
+        xt = torch.randn(2, 6, B, device=device)
+        y = torch.randn(2, B, 6, device=device)
+        yt = torch.randn(2, B, 6, device=device)
+        out, out_tangent = vmap(push_jvp(f), in_dims=(1, 2, 1, 1))(x.clone(), xt, y, yt)
+
+        expected_out = vmap(f, in_dims=(1, 1))(x.clone(), y)
+        self.assertEqual(out[0], expected_out[0])
+        self.assertEqual(out[1], expected_out[1])
+
+        self.assertEqual(out_tangent[0], yt.movedim(1, 0))
+        self.assertEqual(out_tangent[1], yt.movedim(1, 0)[:, 0, ::2])
+
+    # Case 4 in [Forward Grad View/inplace]
+    def test_right_dual_view_prop(self, device):
+        B = 2
+
+        # Changes on the view must propagate to its base. Also:
+        # - x is a regular Tensor
+        # - y is a dual tensor
+        def f(x, y):
+            x = x.clone()
+            view = x[0]
+            view.copy_(y)
+            return view, x
 
-        def filter_fn(args):
-            return args[0]
+        def push_jvp(x, y, yt):
+            return jvp(partial(f, x), (y,), (yt,))
 
-        my_sin = custom_vjp('my_sin', filter_fn, my_sin_impl, my_sin_vjp)
+        x = torch.randn(2, B, 6, device=device)
+        y = torch.randn(6, B, device=device)
+        yt = torch.randn(6, B, device=device)
+        outs, tangents = vmap(push_jvp, in_dims=(1, 1, 1))(x, y, yt)
 
-        x = torch.tensor([1., 2.], requires_grad=True, device=device)
+        expected_out = vmap(f, in_dims=(1, 1))(x.clone(), y)
+        self.assertEqual(outs[0], expected_out[0])
+        self.assertEqual(outs[1], expected_out[1])
 
-        y = my_sin(x)
-        self.assertTrue(called_impl)
+        self.assertEqual(tangents[0], yt.movedim(1, 0))
 
-        y.sum().backward()
-        self.assertTrue(called_vjp)
+        expected_tangent_1 = torch.zeros_like(x).movedim(1, 0)
+        expected_tangent_1[:, 0].copy_(yt.movedim(1, 0))
+        self.assertEqual(tangents[1], expected_tangent_1)
 
-        assert torch.allclose(x.grad, 3 * x.cos())
+    # Case 5 in [Forward Grad View/inplace]
+    def test_right_dual_base_prop(self, device):
+        B = 2
+
+        # Changes on the base must propagate on all its views. Also:
+        # - x is a regular Tensor
+        # - y is a dual tensor
+        def f(x, y):
+            x = x.clone()
+            view = x[0]
+            x.copy_(y)
+            return view, x
+
+        def push_jvp(x, y, yt):
+            return jvp(partial(f, x), (y,), (yt,))
+
+        x = torch.randn(2, B, 6)
+        y = torch.randn(2, 6, B)
+        yt = torch.randn(2, 6, B)
+        outs, tangents = vmap(push_jvp, in_dims=(1, 2, 2))(x, y, yt)
+
+        expected_out = vmap(f, in_dims=(1, 2))(x, y)
+        self.assertEqual(outs[0], expected_out[0])
+        self.assertEqual(outs[1], expected_out[1])
+
+        self.assertEqual(tangents[0], yt.movedim(2, 0)[:, 0])
+        self.assertEqual(tangents[1], yt.movedim(2, 0))
 
 
 class TestComposability(TestCase):
@@ -2159,6 +2293,16 @@ def f(x):
         new_cotangent = torch.randn(())
         self.assertEqual(fx_f(new_cotangent, True, True), vjp_fn(new_cotangent))
 
+    @unittest.skipIf(IS_FBCODE, "can't subprocess in fbcode")
+    # it is redundant to run this test twice on a machine that has GPUs
+    @onlyCPU
+    def test_no_warning_on_import_functorch(self, device):
+        out = subprocess.check_output(
+            [sys.executable, "-W", "all", "-c", "import functorch"],
+            stderr=subprocess.STDOUT,
+            cwd=os.path.dirname(os.path.realpath(__file__)),).decode("utf-8")
+        self.assertEquals(out, "")
+
     def test_requires_grad_inside_transform(self, device):
         def f(x):
             x.requires_grad_()
@@ -2246,6 +2390,93 @@ def f(x):
         with self.assertRaises(RuntimeError):
             grad(f)(x)
 
+    def test_autograd_function_debug_switch(self, device):
+        class MySin(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x):
+                ctx.save_for_backward(x)
+                return x.sin()
+
+            @staticmethod
+            def backward(ctx, gy):
+                x, = ctx.saved_tensors
+                return gy * x.cos()
+
+        x = torch.randn([])
+
+        # by default, autograd.Function is disabled in a functorch transform
+        with self.assertRaisesRegex(RuntimeError, "autograd.Function"):
+            grad(MySin.apply)(x)
+
+        # we have a debug switch to allow it
+        self.assertFalse(torch._C._functorch.get_autograd_function_allowed())
+        try:
+            torch._C._functorch.set_autograd_function_allowed(True)
+            self.assertTrue(torch._C._functorch.get_autograd_function_allowed())
+            y = grad(MySin.apply)(x)
+        finally:
+            torch._C._functorch.set_autograd_function_allowed(False)
+        self.assertFalse(torch._C._functorch.get_autograd_function_allowed())
+        self.assertEqual(y, x.cos())
+
+    @parametrize('transform', [
+        'vmap', 'grad', 'jacrev', 'jacfwd', 'grad_and_value', 'hessian', 'functionalize'
+    ])
+    def test_transforms_dont_support_saved_tensor_hooks(self, device, transform):
+        def f(x):
+            return torch.sin(x).sum()
+
+        def g(x):
+            with torch.autograd.graph.save_on_cpu():
+                return f(x)
+
+        x = torch.randn(3, device=device)
+
+        if transform == 'functionalize':
+            transform = functorch.experimental.functionalize
+        else:
+            transform = getattr(functorch, transform)
+        with self.assertRaisesRegex(RuntimeError, "saved tensor hooks"):
+            with torch.autograd.graph.save_on_cpu():
+                transform(f)(x)
+
+        with self.assertRaisesRegex(RuntimeError, "saved tensor hooks"):
+            transform(g)(x)
+
+    def test_vjp_doesnt_support_saved_tensor_hooks(self, device):
+        def f(x):
+            return torch.sin(x).sum()
+
+        def g(x):
+            with torch.autograd.graph.save_on_cpu():
+                return f(x)
+
+        x = torch.randn(3, device=device)
+        with self.assertRaisesRegex(RuntimeError, "saved tensor hooks"):
+            with torch.autograd.graph.save_on_cpu():
+                vjp(f, x)
+
+        with self.assertRaisesRegex(RuntimeError, "saved tensor hooks"):
+            vjp(g, x)
+
+    def test_jvp_doesnt_support_saved_tensor_hooks(self, device):
+        def f(x):
+            return torch.sin(x).sum()
+
+        def g(x):
+            with torch.autograd.graph.save_on_cpu():
+                return f(x)
+
+        x = torch.randn(3, device=device)
+        t = torch.randn(3, device=device)
+
+        with self.assertRaisesRegex(RuntimeError, "saved tensor hooks"):
+            with torch.autograd.graph.save_on_cpu():
+                jvp(f, (x,), (t,))
+
+        with self.assertRaisesRegex(RuntimeError, "saved tensor hooks"):
+            jvp(g, (x,), (t,))
+
 
 class TestMakeFunctional(TestCase):
     @parametrize('disable_autograd_tracking', [True, False])
@@ -2469,6 +2700,37 @@ def test_combine_state_for_ensemble_smoke(self):
         models = [torch.nn.Linear(in_features, out_features) for i in range(num_models)]
         _ = combine_state_for_ensemble(models)
 
+    def test_state_correctly_returned_after_forward(self):
+        class Net(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = nn.Linear(3, 3)
+
+            def forward(self, x):
+                x = self.linear(x)
+                return x
+
+        mod = Net()
+        func, params = make_functional(mod)
+
+        # state in func.names_map
+        old_state_linear_weight = func.stateless_model.linear.weight
+        old_state_linear_bias = func.stateless_model.linear.bias
+
+        self.assertIsNotNone(old_state_linear_weight)
+        self.assertIsNotNone(old_state_linear_bias)
+
+        x = torch.randn(4, 3)
+        func(params, x)
+
+        new_state_linear_weight = func.stateless_model.linear.weight
+        new_state_linear_bias = func.stateless_model.linear.bias
+
+        self.assertIsNotNone(new_state_linear_weight)
+        self.assertIsNotNone(new_state_linear_bias)
+
+        self.assertEqual(old_state_linear_weight, new_state_linear_weight)
+        self.assertEqual(old_state_linear_bias, new_state_linear_bias)
 
 class TestExamplesCorrectness(TestCase):
     def test_maml_regression(self, device):
@@ -2813,7 +3075,11 @@ def slice_weights(batched_weights, index):
         self.assertEqual(result_loss, expected_loss)
         self.assertEqual(result_weights, expected_weights)
 
-    @parametrize("dropout_layer", [nn.Dropout, nn.AlphaDropout, nn.FeatureAlphaDropout])
+    @parametrize("dropout_layer", [
+        subtest(nn.Dropout, 'Dropout'),
+        subtest(nn.AlphaDropout, 'AlphaDropout'),
+        subtest(nn.FeatureAlphaDropout, 'FeatureAlphaDropout'),
+    ])
     def test_find_learning_rate_ensembling(self, device, dropout_layer):
         # This example mimics what a user might do when trying to find the optimal learning rate. They would
         # want to run a bunch of models with the same behavior (including the same dropout!) and have them
@@ -2889,8 +3155,10 @@ def test_resnet18_per_sample_grads(self, device):
         func_model, weights = make_functional(model)
 
         def compute_loss(weights, image, target):
-            output = func_model(weights, images)
-            loss = criterion(output, targets)
+            image = image.unsqueeze(0)
+            target = target.unsqueeze(0)
+            output = func_model(weights, image)
+            loss = criterion(output, target)
             return loss
 
         batch_size = 3
@@ -2900,7 +3168,7 @@ def compute_loss(weights, image, target):
         result_grads = vmap(grad(compute_loss), in_dims=(None, 0, 0))(weights, images, targets)
 
         expected_grads = [
-            torch.autograd.grad(compute_loss(weights, images[i].unsqueeze(0), targets[i].unsqueeze(0)), weights)
+            torch.autograd.grad(compute_loss(weights, images[i], targets[i]), weights)
             for i in range(batch_size)
         ]
         expected_grads = [torch.stack(shards) for shards in zip(*expected_grads)]
@@ -2924,13 +3192,16 @@ def normalize_devices(fx_g):
     return fx_g
 
 class TestFunctionalize(TestCase):
-    def _check_functionalize_correctness(self, f, inpt):
+    def _check_functionalize_correctness(self, f, inpt, *, skip_vmap=False):
         inpt1 = inpt.clone()
         inpt2 = inpt.clone()
         inpt3 = inpt.clone()
 
         expected_outputs = f(inpt1)
-        actual_outputs = vmap(functionalize(f))(inpt2.unsqueeze(0))[0].squeeze()
+        if skip_vmap:
+            actual_outputs = functionalize(f)(inpt2)
+        else:
+            actual_outputs = vmap(functionalize(f))(inpt2.unsqueeze(0))[0].squeeze()
         # Right now the flavor of functionalize that also removes view ops
         # isn't being used with vmap
         # That's because {view}_copy ops don't have batching rules yet
@@ -3000,7 +3271,8 @@ def f(x: torch.Tensor) -> torch.Tensor:
             z2, z3 = z1.split(2)
             z2.add_(tmp)
             return x
-        self._check_functionalize_correctness(f, torch.zeros(4, 2, device=device))
+        # See Note [Fix vmap slice_scatter]
+        self._check_functionalize_correctness(f, torch.zeros(4, 2, device=device), skip_vmap=True)
 
     # Ensure functionalize works with List[Optional[Tensor]] arguments.
     # See the fix / discussion at https://github.com/pytorch/pytorch/pull/76085
@@ -3020,8 +3292,8 @@ def f(x: torch.Tensor, indices: torch.Tensor) -> torch.Tensor:
 
 
 def forward(self, x_1, indices_1) -> torch.Tensor:
-    index_tensor = torch.ops.aten.index.Tensor(x_1, [indices_1]);  x_1 = indices_1 = None
-    return index_tensor
+    index = torch.ops.aten.index.Tensor(x_1, [indices_1]);  x_1 = indices_1 = None
+    return index
     """)
 
     # Ensure grad(functionalize(f)) works
@@ -3041,6 +3313,7 @@ def f(x: torch.Tensor) -> torch.Tensor:
         self.assertEqual(out1, out2)
         self.assertEqual(inpt1, inpt2)
 
+    @unittest.skipIf(IS_FBCODE, 'fails in fbcode')
     def test_vmap_functionalize_jvp(self, device):
 
         def f(x: torch.Tensor) -> torch.Tensor:
@@ -3059,6 +3332,19 @@ def jvp_wrapper(x, t):
         out2 = vmap(functionalize(jvp_wrapper))(x, t)
         self.assertEqual(out1, out2)
 
+    # TODO: move this test into test_fake_tensor.py
+    # once functionalize() can be used in core tests.
+    def test_functionalize_fake_tensors(self, device):
+
+        def f(x: torch.Tensor) -> torch.Tensor:
+            y = x.detach()
+            return y + y
+
+        with FakeTensorMode() as mode:
+            x = torch.ones(2, device=device, requires_grad=True)
+            out = functionalize(f)(x)
+        self.assertEqual(x.size(), (2,))
+
     def test_functionalize_fx_simple(self, device):
 
         def f(x: torch.Tensor) -> torch.Tensor:
@@ -3077,11 +3363,11 @@ def f(x: torch.Tensor) -> torch.Tensor:
 
 def forward(self, x_1) -> torch.Tensor:
     ones = torch.ops.aten.ones.default([2], device = 'cpu', pin_memory = False)
-    view_copy_default = torch.ops.aten.view_copy.default(x_1, [4, 2])
-    add_tensor = torch.ops.aten.add.Tensor(view_copy_default, ones);  view_copy_default = ones = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [4, 2]);  add_tensor = None
-    copy__default = torch.ops.aten.copy_.default(x_1, view_copy_default_1);  x_1 = None
-    return view_copy_default_1
+    view_copy = torch.ops.aten.view_copy.default(x_1, [4, 2])
+    add = torch.ops.aten.add.Tensor(view_copy, ones);  view_copy = ones = None
+    view_copy_1 = torch.ops.aten.view_copy.default(add, [4, 2]);  add = None
+    copy_ = torch.ops.aten.copy_.default(x_1, view_copy_1);  x_1 = None
+    return view_copy_1
     """)
 
     def test_functionalize_fx_transpose_simple(self, device):
@@ -3096,8 +3382,8 @@ def f(x: torch.Tensor) -> torch.Tensor:
 
 
 def forward(self, x_1) -> torch.Tensor:
-    transpose_copy_int = torch.ops.aten.transpose_copy.int(x_1, 1, 0);  x_1 = None
-    return transpose_copy_int
+    transpose_copy = torch.ops.aten.transpose_copy.int(x_1, 1, 0);  x_1 = None
+    return transpose_copy
     """)
 
     def test_functionalize_fx_out_op(self, device):
@@ -3118,12 +3404,12 @@ def f(inpt: torch.Tensor) -> torch.Tensor:
 
 def forward(self, inpt_1) -> torch.Tensor:
     empty = torch.ops.aten.empty.memory_format([], dtype = torch.float32, device = 'cpu', pin_memory = False)
-    add_tensor = torch.ops.aten.add.Tensor(inpt_1, inpt_1);  inpt_1 = None
-    view_copy_default = torch.ops.aten.view_copy.default(add_tensor, [4])
-    view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [4]);  add_tensor = None
-    add_tensor_1 = torch.ops.aten.add.Tensor(view_copy_default_1, 1);  view_copy_default_1 = None
-    view_copy_default_2 = torch.ops.aten.view_copy.default(add_tensor_1, [4]);  add_tensor_1 = None
-    return view_copy_default_2
+    add = torch.ops.aten.add.Tensor(inpt_1, inpt_1);  inpt_1 = None
+    view_copy = torch.ops.aten.view_copy.default(add, [4])
+    view_copy_1 = torch.ops.aten.view_copy.default(add, [4]);  add = None
+    add_1 = torch.ops.aten.add.Tensor(view_copy_1, 1);  view_copy_1 = None
+    view_copy_2 = torch.ops.aten.view_copy.default(add_1, [4]);  add_1 = None
+    return view_copy_2
     """)
 
     def test_functionalize_fx_multi_out_op(self, device):
@@ -3146,13 +3432,13 @@ def f(inpt: torch.Tensor) -> torch.Tensor:
 def forward(self, inpt_1) -> torch.Tensor:
     empty = torch.ops.aten.empty.memory_format([4], dtype = torch.float32, device = 'cpu', pin_memory = False)
     empty_1 = torch.ops.aten.empty.memory_format([2, 2], dtype = torch.float32, device = 'cpu', pin_memory = False)
-    view_copy_default = torch.ops.aten.view_copy.default(empty_1, [4]);  empty_1 = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(inpt_1, [2, 4]);  inpt_1 = None
-    aminmax_default = torch.ops.aten.aminmax.default(view_copy_default_1, dim = 0);  view_copy_default_1 = None
-    getitem = aminmax_default[0]
-    getitem_1 = aminmax_default[1];  aminmax_default = None
-    view_copy_default_2 = torch.ops.aten.view_copy.default(getitem_1, [2, 2]);  getitem_1 = None
-    return (view_copy_default_2, getitem)
+    view_copy = torch.ops.aten.view_copy.default(empty_1, [4]);  empty_1 = None
+    view_copy_1 = torch.ops.aten.view_copy.default(inpt_1, [2, 4]);  inpt_1 = None
+    aminmax = torch.ops.aten.aminmax.default(view_copy_1, dim = 0);  view_copy_1 = None
+    getitem = aminmax[0]
+    getitem_1 = aminmax[1];  aminmax = None
+    view_copy_2 = torch.ops.aten.view_copy.default(getitem_1, [2, 2]);  getitem_1 = None
+    return (view_copy_2, getitem)
     """)
 
     def test_functionalize_fx_reapply_views_simple(self, device):
@@ -3171,11 +3457,11 @@ def f(x: torch.Tensor) -> torch.Tensor:
 
 def forward(self, x_1) -> torch.Tensor:
     ones = torch.ops.aten.ones.default([2], device = 'cpu', pin_memory = False)
-    view_default = torch.ops.aten.view.default(x_1, [4, 2])
-    add_tensor = torch.ops.aten.add.Tensor(view_default, ones);  view_default = ones = None
-    view_default_1 = torch.ops.aten.view.default(add_tensor, [4, 2]);  add_tensor = None
-    copy__default = torch.ops.aten.copy_.default(x_1, view_default_1);  x_1 = None
-    return view_default_1
+    view = torch.ops.aten.view.default(x_1, [4, 2])
+    add = torch.ops.aten.add.Tensor(view, ones);  view = ones = None
+    view_1 = torch.ops.aten.view.default(add, [4, 2]);  add = None
+    copy_ = torch.ops.aten.copy_.default(x_1, view_1);  x_1 = None
+    return view_1
     """)
 
     def test_functionalize_nonfunctional_output(self, device):
@@ -3212,10 +3498,11 @@ def f(a, b) -> torch.Tensor:
 
 
 def forward(self, a_1, b_1) -> torch.Tensor:
-    index_tensor = torch.ops.aten.index.Tensor(a_1, [b_1]);  a_1 = b_1 = None
-    return index_tensor
+    index = torch.ops.aten.index.Tensor(a_1, [b_1]);  a_1 = b_1 = None
+    return index
     """)
 
+    @unittest.skipIf(IS_FBCODE, 'fails in fbcode')
     def test_functionalize_optional_tensorlist2(self, device):
 
         def f(a, b) -> torch.Tensor:
@@ -3230,14 +3517,151 @@ def f(a, b) -> torch.Tensor:
 
 
 def forward(self, a_1, b_1) -> torch.Tensor:
-    unbind_int = torch.ops.aten.unbind.int(b_1);  b_1 = None
-    getitem = unbind_int[0]
-    getitem_1 = unbind_int[1];  unbind_int = None
-    index_tensor = torch.ops.aten.index.Tensor(a_1, [getitem, getitem_1]);  a_1 = getitem = getitem_1 = None
-    return index_tensor
+    unbind = torch.ops.aten.unbind.int(b_1);  b_1 = None
+    getitem = unbind[0]
+    getitem_1 = unbind[1];  unbind = None
+    index = torch.ops.aten.index.Tensor(a_1, [getitem, getitem_1]);  a_1 = getitem = getitem_1 = None
+    return index
+    """)
+
+    def test_resize_program_inputs(self, device):
+        def f(x):
+            x.resize_(10)
+            x.fill_(2)
+
+        fn = make_fx(functionalize(f))
+        out = fn(torch.zeros(0, device=device))
+        out = normalize_devices(out)
+        self.assertExpectedInline((out.code), """\
+
+
+
+def forward(self, x_1):
+    resize = torch.ops.aten.resize.default(x_1, [10])
+    fill = torch.ops.aten.fill.Scalar(resize, 2);  resize = None
+    resize_ = torch.ops.aten.resize_.default(x_1, [10]);  x_1 = None
+    copy_ = torch.ops.aten.copy_.default(resize_, fill);  resize_ = fill = None
+    return None
     """)
 
 
+def construct_sum_pyop():
+    mysum = PyOperator("mysum")
+
+    @mysum.py_impl(torch._C._functorch.TransformType.Vmap)
+    def mysum_batch_rule(interpreter, x, dim):
+        if not torch._C._functorch.is_batchedtensor(x):
+            with interpreter.lower():
+                x = x.view_as(x)  # unnecessary, just here to test the dispatch
+                return mysum(x, dim)
+
+        bdim = torch._C._functorch.maybe_get_bdim(x)
+        value = torch._C._functorch.get_unwrapped(x)
+
+        with interpreter.lower():
+            value = value.movedim(bdim, 0)
+            result = mysum(value, dim + 1)
+
+        return torch._C._functorch._add_batch_dim(result, 0, interpreter.level())
+
+    @mysum.py_impl(torch._C._functorch.TransformType.Grad)
+    def mysum_grad_rule(interpreter, x, dim):
+        level = interpreter.level()
+
+        class MySum(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x, dim):
+                ctx.x_shape = x.shape
+                ctx.dim = dim
+                x = torch._C._functorch._unwrap_for_grad(x, level)
+                with torch.enable_grad(), interpreter.lower():
+                    x = x.view_as(x)  # unnecessary, just here to test the dispatch
+                    y = mysum(x, dim)
+
+                y = torch._C._functorch._wrap_for_grad(y, level)
+                return y
+
+            @staticmethod
+            def backward(ctx, gy):
+                return gy.unsqueeze(ctx.dim).expand(ctx.x_shape), None
+
+        with enable_autograd_function():
+            return MySum.apply(x, dim)
+
+    @mysum.py_impl(torch._C.DispatchKey.AutogradCPU)
+    def mysum_autograd_cpu(x, dim):
+        return torch.sum(x, dim)
+
+    @mysum.py_impl(torch._C.DispatchKey.AutogradCUDA)
+    def mysum_autograd_cuda(x, dim):
+        return torch.sum(x, dim)
+
+    return mysum
+
+sum_pyop = construct_sum_pyop()
+
+class TestPyOperatorInteraction(TestCase):
+
+    def test_basic_sum(self, device):
+        x = torch.randn(2, 3, 4, device=device)
+        result = sum_pyop(x, 1)
+        self.assertEqual(result, torch.sum(x, 1))
+
+    def test_vmap_sum(self, device):
+        x = torch.randn(2, 3, 4, device=device)
+        result = vmap(sum_pyop, (0, None))(x, 0)
+        self.assertEqual(result, torch.sum(x, 1))
+
+        result = vmap(vmap(sum_pyop, (0, None)), (0, None))(x, 0)
+        self.assertEqual(result, torch.sum(x, 2))
+
+    def test_grad_sum(self, device):
+        x = torch.randn(3, device=device)
+        gx = grad(sum_pyop)(x, 0)
+        self.assertEqual(gx, torch.ones_like(x))
+
+    def test_grad_grad_sum(self, device):
+        x = torch.randn(3, requires_grad=True, device=device)
+
+        def f(x):
+            # higher order grad. Requires a non-linearity
+            return sum_pyop(x.sin(), 0)
+
+        def grad_f_sum(x):
+            return grad(f)(x).sum()
+
+        ggx = grad(grad_f_sum)(x)
+        self.assertEqual(ggx, -x.sin())
+
+    def test_vmap_grad_sum(self, device):
+        x = torch.randn(2, 3, device=device)
+        gx = vmap(grad(sum_pyop), (0, None))(x, 0)
+        self.assertEqual(gx, torch.ones_like(x))
+
+    def test_no_grad_outside_grad(self, device):
+        x = torch.randn(3, device=device, requires_grad=True)
+        with torch.no_grad():
+            y = grad(sum_pyop)(x, 0)
+        self.assertEqual(y, torch.ones_like(x))
+        self.assertFalse(y.requires_grad)
+
+    def test_no_grad_inside_grad(self, device):
+        def f(x):
+            with torch.no_grad():
+                shift = sum_pyop(x ** 2, 0)
+            return sum_pyop(x ** 2, 0) - shift
+
+        x = torch.randn(3, device=device)
+        y = grad(f)(x)
+        self.assertEqual(y, 2 * x)
+        y = grad(lambda x: grad(f)(x).sum())(x)
+        self.assertEqual(y, torch.full_like(x, 2))
+
+        x = torch.randn(3, device=device, requires_grad=True)
+        y = grad(f)(x)
+        z, = torch.autograd.grad(y.sum(), x)
+        self.assertEqual(z, torch.full_like(x, 2))
+
 
 only_for = ("cpu", "cuda")
 instantiate_device_type_tests(
@@ -3260,6 +3684,11 @@ def forward(self, a_1, b_1) -> torch.Tensor:
     globals(),
     only_for=only_for,
 )
+instantiate_device_type_tests(
+    TestVmapJvpInplaceView,
+    globals(),
+    only_for=only_for,
+)
 instantiate_device_type_tests(
     TestHessian,
     globals(),
@@ -3276,7 +3705,7 @@ def forward(self, a_1, b_1) -> torch.Tensor:
     only_for=only_for,
 )
 instantiate_device_type_tests(
-    TestCustomFunction,
+    TestPyOperatorInteraction,
     globals(),
     only_for=only_for,
 )
diff --git a/functorch/test/test_functionalize.py b/test/functorch/test_functionalize.py
similarity index 65%
rename from functorch/test/test_functionalize.py
rename to test/functorch/test_functionalize.py
index 399273bfaf0d..a04bd4903305 100644
--- a/functorch/test/test_functionalize.py
+++ b/test/functorch/test_functionalize.py
@@ -4,8 +4,7 @@
 from unittest.mock import patch
 import functools
 from torch.testing._internal.common_utils import run_tests
-import test_compile_cache
-import test_pythonkey
+import test_aotdispatch
 
 
 def make_functionalize_fn(fn):
@@ -38,12 +37,8 @@ class FunctionalizeTest(cls):
     return FunctionalizeTest
 
 
-FunctionalizeTestCompileCache = make_functionalize_test(test_compile_cache.TestCompileCache)
-FunctionalizeTestCompileCacheStaticArgs = make_functionalize_test(test_compile_cache.TestCompileCacheStaticArgs)
-FunctionalizeTestPythonKeyAOT = make_functionalize_test(test_pythonkey.TestAOTAutograd)
-FunctionalizeTestPythonKeyContiguous = make_functionalize_test(test_pythonkey.TestContiguous)
-FunctionalizeTestPythonKeyRandom = make_functionalize_test(test_pythonkey.TestRandom)
-FunctionalizeTestPythonKeyPartitioning = make_functionalize_test(test_pythonkey.TestPartitioning)
+FunctionalizeTestPythonKeyAOT = make_functionalize_test(test_aotdispatch.TestAOTAutograd)
+FunctionalizeTestPythonKeyPartitioning = make_functionalize_test(test_aotdispatch.TestPartitioning)
 
 if __name__ == "__main__":
     run_tests()
diff --git a/functorch/test/test_memory_efficient_fusion.py b/test/functorch/test_memory_efficient_fusion.py
similarity index 100%
rename from functorch/test/test_memory_efficient_fusion.py
rename to test/functorch/test_memory_efficient_fusion.py
diff --git a/test/functorch/test_minifier.py b/test/functorch/test_minifier.py
new file mode 100644
index 000000000000..49af42795592
--- /dev/null
+++ b/test/functorch/test_minifier.py
@@ -0,0 +1,116 @@
+# Owner(s): ["module: functorch"]
+
+import torch
+from functorch.compile import minifier
+from functorch._src.compile_utils import get_placeholders, get_outputs
+from functorch import make_fx
+from torch.testing._internal.common_utils import TestCase, run_tests
+
+
+class TestMinifier(TestCase):
+    def test_has_mul_minifier(self):
+        def failing_f(x, y):
+            y = y / 3
+            x = x + 3
+            x = x * y
+            return x + y
+        inps = [torch.randn(3), torch.randn(3)]
+        failing_f = make_fx(failing_f)(*inps)
+
+        def has_mul(fx_g, inps):
+            return (torch.ops.aten.mul.Tensor in set([i.target for i in fx_g.graph.nodes]))
+
+        min_f, inps = minifier(failing_f, inps, has_mul)
+        self.assertEqual(len(min_f.graph.nodes), 4)
+        self.assertEqual(len(inps), 2)
+
+    def test_has_add_mul(self):
+        def failing_f(x):
+            x = x * 3
+            x = x + 5
+            x = x.cos()
+            zero = x - x
+            result = zero / zero
+            result = result + 3
+            return (result * 2,)
+
+        inps = [torch.randn(3)]
+        failing_f = make_fx(failing_f)(*inps)
+
+        def has_nans(fx_g, inps):
+            # Basically, make sure none of the nodes are computing nans
+            for i in inps:
+                if torch.isnan(i).any():
+                    return False
+            return torch.isnan(fx_g(*inps)[0]).any()
+
+        min_f, inps = minifier(failing_f, inps, has_nans)
+        self.assertEqual(len(min_f.graph.nodes), 3)
+        self.assertEqual(len(inps), 1)
+
+    def test_input_returned(self):
+        def f(a, b, c):
+            a = a.sin()
+            c = c.cos()
+            d = a * c
+            return (a, b, c, d)
+        inps = [torch.randn(3) for _ in range(3)]
+
+        def inputs_returned(fx_g, inps):
+            inps = set(get_placeholders(fx_g.graph))
+            outs = set(get_outputs(fx_g.graph))
+            return len(inps & outs) > 0
+
+        failing_f = make_fx(f)(*inps)
+        min_f, inps = minifier(failing_f, inps, inputs_returned)
+        self.assertEqual(len(min_f.graph.nodes), 2)
+        self.assertEqual(len(inps), 1)
+
+    def test_tup_use(self):
+        def f(a, b):
+            tup = torch.std_mean(a)
+            return (tup[0] + b * tup[1],)
+
+        inps = [torch.randn(3), torch.randn(3)]
+
+        def has_add(fx_g, inps):
+            return (torch.ops.aten.add.Tensor in set([i.target for i in fx_g.graph.nodes]))
+
+        failing_f = make_fx(f)(*inps)
+        min_f, inps = minifier(failing_f, inps, has_add)
+
+        self.assertEqual(len(min_f.graph.nodes), 4)
+        self.assertEqual(len(inps), 2)
+
+    def test_module(self):
+        class MockModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                y = self.relu(x)
+                zero = y - y
+                result = zero / zero
+                result = result + 3
+                return result
+
+        mod = MockModule()
+        failing_f = torch.fx.symbolic_trace(mod)
+
+        inps = [torch.randn(3)]
+
+        def pass_checker(fx_g, inps):
+            # Basically, make sure none of the inputs are nans
+            for i in inps:
+                if torch.isnan(i).any():
+                    return False
+            return torch.isnan(fx_g(*inps)[0]).any()
+
+        min_f, inps = minifier(failing_f, inps, pass_checker)
+        assert len(min_f.graph.nodes) == 3
+        assert len(inps) == 1
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/functorch/test/test_ops.py b/test/functorch/test_ops.py
similarity index 61%
rename from functorch/test/test_ops.py
rename to test/functorch/test_ops.py
index 07647e459853..c0ae683cdfbf 100644
--- a/functorch/test/test_ops.py
+++ b/test/functorch/test_ops.py
@@ -7,35 +7,42 @@
 # LICENSE file in the root directory of this source tree.
 
 import itertools
+import unittest
 
-from torch.testing._internal.common_utils import TestCase, run_tests, is_iterable_of_tensors
+from torch.testing._internal.common_utils import TestCase, run_tests, is_iterable_of_tensors, IS_MACOS, \
+    IS_ARM64, IS_X86, parametrize, TEST_WITH_ASAN, noncontiguous_like
 import torch
 from torch import Tensor
 import functools
+from torch.testing._internal.common_cuda import with_tf32_off
 from torch.testing._internal.common_device_type import instantiate_device_type_tests
 from torch.testing._internal.common_device_type import ops
 from torch.testing._internal.common_device_type import \
     toleranceOverride, tol
-from functorch_lagging_op_db import functorch_lagging_op_db
 from functorch_additional_op_db import additional_op_db
+from torch.testing._internal.common_methods_invocations import op_db
 from common_utils import (
     get_fallback_and_vmap_exhaustive,
-    get_exhaustive_batched_inputs,
-    get_exhaustive_batched_inputs_batch_norm_is_training,
+    generate_vmap_inputs,
+    decorate,
     xfail,
     skip,
     skipOps,
     tol1,
-    # tol2,
+    tol2,
     opsToleranceOverride,
     check_vmap_fallback,
     is_batch_norm_training,
     is_valid_inplace_sample_input,
+    loop2,
 )
+
+from torch.testing._internal.opinfo.core import SampleInput
 from torch.utils._pytree import tree_flatten, tree_unflatten, tree_map
 from functorch import grad, vjp, vmap, jacrev, jacfwd
 import torch.autograd.forward_ad as fwAD
 from functorch._src.eager_transforms import _as_tuple, jvp
+
 aten = torch.ops.aten
 
 
@@ -283,25 +290,66 @@ def is_inplace(op, variant):
 
 
 vjp_fail = {
-    xfail('tensor_split'),
-    xfail('to_sparse'),
-    xfail('nn.functional.ctc_loss'),
-    skip('pca_lowrank', ''),  # fails on cuda, runs okay on cpu
-    skip('svd_lowrank', ''),  # fails on cuda, runs okay on cpu
+    xfail('tensor_split'),  # data_ptr composite compliance
 }
 
+aliasing_ops = {
+    'T',
+    'broadcast_to',
+    'conj',
+    'contiguous',
+    'diagonal',  # linalg.diagonal is an alias
+    'expand',
+    'flatten',
+    'imag',
+    'mH',  # adjoint is an alias
+    'mT',
+    'movedim',  # moveaxis is an alias
+    'narrow',
+    'permute',
+    'positive',
+    # 'ravel', is composite implict autograd and may call clone
+    'real',
+    'reshape',
+    'resolve_conj',
+    'resolve_neg',
+    'select',
+    'squeeze',
+    'transpose',  # swapdims and swapaxes are aliases
+    'unflatten',
+    'unfold',
+    'unsqueeze',
+    'view',
+    'view_as',
+    'view_as_complex',
+    'view_as_real',
+}
+
+aliasing_ops_list_return = {
+    'chunks',
+    'dsplit',
+    'hsplit',
+    'split',
+    'unbind',
+    'vsplit',
+    # 'tensor_split' not composite compliant, see vjp_fail
+}
 
+
+@unittest.skipIf(TEST_WITH_ASAN, "tests time out with asan, are probably redundant")
 class TestOperators(TestCase):
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @with_tf32_off  # https://github.com/pytorch/pytorch/issues/86798
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_grad', vjp_fail.union({
         xfail('linalg.eig'),  # diagonal_scatter does not support complex
-        xfail('chalf', '', device_type='cpu'),
-        xfail('as_strided_scatter', ''),
-        xfail('sparse.sampled_addmm', ''),
+        xfail('chalf', '', device_type='cpu'),  # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
+        xfail('sparse.sampled_addmm', ''),  # RuntimeError: Sparse CSR tensors do not have strides
     }))
     @opsToleranceOverride('TestOperators', 'test_grad', (
         tol1('nn.functional.binary_cross_entropy_with_logits',
              {torch.float32: tol(atol=1e-04, rtol=1e-04)}),
+        tol1('masked.cumprod',
+             {torch.float32: tol(atol=1e-05, rtol=1e-05)}),
     ))
     def test_grad(self, device, dtype, op):
         if op.name in vjp_fail:
@@ -342,34 +390,46 @@ def wrapped_fn(*args, **kwargs):
 
             self.assertEqual(result, expected)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_jvp', set({
-        skip('nn.functional.max_pool1d'),  # fails on cpu, runs okay on cuda
-        skip('pca_lowrank', ''),  # fails on cuda, runs okay on cpu
-        skip('svd_lowrank', ''),  # fails on cuda, runs okay on cpu
-
-        # =============================================
-        # NB: The above failures also fail using PyTorch core's
-        #     forward-mode AD and vmap.
-        #     The failures below are functorch-specific issues
-        # =============================================
-
         # Composite ops that do bad things. Need to be fixed in PyTorch core.
         # RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
         xfail('tensor_split'),
 
-        # BUG: runs and produces numerical differences
+        # BUG: silent incorrectness: runs and produces numerical differences
         skip('nn.functional.max_unpool1d'),  # fails everywhere except on mac
         skip('nn.functional.max_unpool2d'),  # fails everywhere except on windows
         skip('nn.functional.max_unpool3d'),  # fails everywhere except on mac
+        xfail("native_batch_norm"),          # TODO: fails comparing None to tensor of 0s for saved_mean/var tangents
+        xfail("_native_batch_norm_legit"),    # TODO: fails comparing None to tensor of 0s for saved_mean/var tangents
+
+        xfail('nn.functional._scaled_dot_product_attention', device_type='cuda'),
 
-        xfail('nn.functional.rrelu')  # in-place test fails
+        xfail('nn.functional.rrelu'),  # in-place test errors out with no formula implemented
+
+        # --- Non-Contiguous Failures! ---
+        # This is expected to fail as the operator
+        # expects last dim to have stride=1
+        xfail('view_as_complex'),
+        # BUG
+        # AssertionError: Tensor-likes are not close!
+        xfail('as_strided'),
+        decorate('linalg.det', 'singular',
+                 decorator=unittest.skipIf(IS_MACOS and IS_X86, "Fails on x86 MacOS CI")),
     }))
     @opsToleranceOverride('TestOperators', 'test_jvp', (
         tol1('nn.functional.conv_transpose3d',
              {torch.float32: tol(atol=1e-04, rtol=1.3e-06)}, device_type='cuda'),
+        tol1('linalg.tensorsolve',
+             {torch.float32: tol(atol=1e-04, rtol=1.3e-05)}, device_type='cuda'),
         tol1('nn.functional.binary_cross_entropy_with_logits',
              {torch.float32: tol(atol=4e-04, rtol=4e-04)}),
+        tol1('nn.functional.batch_norm',
+             {torch.float32: tol(atol=4e-05, rtol=5e-05)}),
+        tol1('nn.functional.conv2d',
+             {torch.float32: tol(atol=4e-05, rtol=5e-05)}),
+        tol1('pca_lowrank',
+             {torch.float32: tol(atol=5e-05, rtol=5e-05)}),
     ))
     def test_jvp(self, device, dtype, op):
         # TODO: get rid of vjp_decomp when we add decomposition support to
@@ -393,28 +453,38 @@ def test_jvp(self, device, dtype, op):
         inplace_variant = op.inplace_variant if op.supports_inplace_autograd else None
 
         for sample in samples:
-            args = (sample.input,) + sample.args
-            kwargs = sample.kwargs
             if outplace_variant:
-                self.jvp_opinfo_test(outplace_variant, args, kwargs,
+                self.jvp_opinfo_test(outplace_variant, sample,
                                      sample.output_process_fn_grad,
                                      clone_inputs=False,
                                      fixme_ref_jvp_local=fixme_ref_jvp_local)
             if is_valid_inplace_sample_input(sample, op, inplace_variant):
-                self.jvp_opinfo_test(inplace_variant, args, kwargs,
+                self.jvp_opinfo_test(inplace_variant, sample,
                                      sample.output_process_fn_grad,
                                      clone_inputs=True,
                                      fixme_ref_jvp_local=fixme_ref_jvp_local)
 
-    def jvp_opinfo_test(self, fn, args, kwargs, output_process_fn,
+
+    def jvp_opinfo_test(self, fn, sample, output_process_fn,
                         clone_inputs, fixme_ref_jvp_local):
         # NB: we used requires_grad=True to determine where the primals are,
         # but don't need that information otherwise
-        fn, primals = normalize_op_input_output2(
+        args = (sample.input,) + sample.args
+        kwargs = sample.kwargs
+        contig_fn, primals = normalize_op_input_output2(
             fn, args, kwargs, output_process_fn, requires_grad=True)
         orig_primals = tree_map(lambda x: x.detach(), primals)
         orig_tangents = tree_map(lambda x: torch.randn_like(x), primals)
 
+        noncontig_sample = sample.noncontiguous()
+        noncontig_args = (noncontig_sample.input,) + noncontig_sample.args
+        noncontig_kwargs = sample.kwargs
+        noncontig_fn, primals = normalize_op_input_output2(
+            fn, noncontig_args, noncontig_kwargs,
+            output_process_fn, requires_grad=True)
+        noncontig_primals = tree_map(lambda x: x.detach(), primals)
+        noncontig_tangents = tree_map(lambda x: noncontiguous_like(x), orig_tangents)
+
         def maybe_clone_inputs():
             if clone_inputs:
                 primals = tree_map(torch.clone, orig_primals)
@@ -424,19 +494,23 @@ def maybe_clone_inputs():
 
         primals, tangents = maybe_clone_inputs()
         expected_primal_outs, expected_tangent_outs = \
-            fixme_ref_jvp_local(fn, primals, tangents)
+            fixme_ref_jvp_local(contig_fn, primals, tangents)
 
         primals, tangents = maybe_clone_inputs()
-        primal_outs, tangent_outs = jvp(fn, primals, tangents)
+        primal_outs, tangent_outs = jvp(contig_fn, primals, tangents)
+
+        noncontig_primal_outs, noncontig_tangent_outs = jvp(noncontig_fn,
+                                                            noncontig_primals,
+                                                            noncontig_tangents)
 
         self.assertEqual(primal_outs, expected_primal_outs)
         self.assertEqual(tangent_outs, expected_tangent_outs)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+        self.assertEqual(noncontig_primal_outs, expected_primal_outs)
+        self.assertEqual(noncontig_tangent_outs, expected_tangent_outs)
+
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_vjp', vjp_fail.union({
-        xfail('pca_lowrank', ''),
-        xfail('svd_lowrank', ''),
-        xfail('as_strided_scatter', ''),
         xfail('sparse.sampled_addmm', ''),
     }))
     @opsToleranceOverride('TestOperators', 'test_vjp', (
@@ -477,16 +551,34 @@ def f(inp, *args, **kwargs):
                 return op.inplace_variant(inp.clone(), *args, **kwargs)
             _test(f, inplace=True)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_vjpvjp', vjp_fail.union({
-        skip('nn.functional.max_unpool1d'),  # Flaky
-        skip('nn.functional.max_unpool2d'),  # Flaky
-        xfail('native_layer_norm', ''),
-        xfail('sparse.sampled_addmm', ''),
+        skip('nn.functional.max_unpool1d'),  # silent incorrectness; Flaky
+        skip('nn.functional.max_unpool2d'),  # silent incorrectness; Flaky
+        xfail('nn.functional.ctc_loss'),  # Not Implemented
+        xfail('native_layer_norm', ''),  # Expected a proper Tensor but got None for argument #1 'other'
+        xfail('sparse.sampled_addmm', ''),  # sparse tensors have no strides
+        skip('nn.functional._scaled_dot_product_attention', device_type='cuda'),
+        # AssertionError: Tensor-likes are not close!
+        # Mismatched elements: 1 / 15 (6.7%)
+        # Greatest absolute difference: 24.0 at index (2, 4) (up to 1e-05 allowed)
+        # Greatest relative difference: 1.7933241714393998e-06 at index (2, 4) (up to 1.3e-06 allowed)
+        # The failure occurred for item [0]
+        xfail('masked.prod')
     }))
     @opsToleranceOverride('TestOperators', 'test_vjpvjp', (
         tol1('nn.functional.conv_transpose3d',
              {torch.float32: tol(atol=5e-05, rtol=9e-05)}, device_type='cuda'),
+        tol1('prod',
+             {torch.float32: tol(atol=2e-05, rtol=1e-04)}),
+        tol1('masked.cumprod',
+             {torch.float32: tol(atol=5e-04, rtol=5e-04)}),
+        tol1('cumprod',
+             {torch.float32: tol(atol=5e-04, rtol=5e-04)}),
+        tol1('linalg.vander',
+             {torch.float32: tol(atol=5e-04, rtol=5e-04)}),
+        tol2('linalg.det', 'singular',
+             {torch.float32: tol(atol=2e-05, rtol=2e-05)}),
     ))
     def test_vjpvjp(self, device, dtype, op):
         if not op.supports_autograd:
@@ -524,17 +616,97 @@ def fn(inp, *args, **kwargs):
                 return op.inplace_variant(inp.clone(), *args, **kwargs)
             test(fn, inplace=True)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @with_tf32_off  # https://github.com/pytorch/pytorch/issues/86798
+    @skipOps('TestOperators', 'test_vmapvjpvjp', vjp_fail.union({
+        skip("atleast_1d"),  # Takes too long
+        skip("atleast_2d"),  # Takes too long
+        skip("atleast_3d"),  # Takes too long
+        skip("ormqr"),  # Takes too long
+        xfail("as_strided"),  # incorrect output
+        xfail("as_strided_scatter"),  # incorrect output
+        skip("bernoulli"),  # calls random op
+        xfail("bfloat16"),  # rank 4 tensor for channels_last
+        xfail("cdouble"),  # rank 4 tensor for channels_last
+        xfail("cfloat"),  # rank 4 tensor for channels_last
+        xfail("chalf"),  # rank 4 tensor for channels_last
+        xfail("double"),  # rank 4 tensor for channels_last
+        xfail("float"),  # rank 4 tensor for channels_last
+        xfail("half"),  # rank 4 tensor for channels_last
+        # It looks like you're either (1) calling .item() on a Tensor or
+        # (2) attempting to use a Tensor in some data-dependent control flow or
+        # (3) encountering this error in PyTorch internals.
+        xfail("index_reduce"),
+        xfail("linalg.eig"),  # vmap over torch.allclose
+        xfail("linalg.eigvals"),  # vmap over torch.allclose
+        xfail("linalg.householder_product"),  # vmap: inplace into a regular tensor
+        xfail("nanquantile", device_type='cpu'),  # vmap not implemented for at::equal.
+        xfail("native_layer_norm"),  # vmap: inplace into a regular tensor
+        # got a batched tensor as input while the running_mean or running_var,
+        # which will be updated in place, were not batched.
+        xfail("nn.functional.batch_norm"),
+        xfail("nn.functional.binary_cross_entropy"),  # vmap: inplace into a regular tensor
+        xfail("nn.functional.ctc_loss"),  # derivate not implemented for _ctc_loss_backward
+        skip("nn.functional.dropout"),  # calls random op
+        skip("nn.functional.dropout2d"),  # calls random op
+        skip("nn.functional.dropout3d"),  # calls random op
+        skip("nn.functional.alpha_dropout"),  # calls random op
+        skip("nn.functional.feature_alpha_dropout", "with_train"),  # calls random op
+        skip("nn.functional.fractional_max_pool2d"),  # calls random op
+        skip("nn.functional.fractional_max_pool3d"),  # calls random op
+        xfail('nn.functional._scaled_dot_product_attention'),  # randomness
+        # It looks like you're either (1) calling .item() on a Tensor or
+        # (2) attempting to use a Tensor in some data-dependent control flow or
+        # (3) encountering this error in PyTorch internals.
+        xfail("nn.functional.gaussian_nll_loss"),
+        # got a batched tensor as input while the running_mean or running_var,
+        # which will be updated in place, were not batched.
+        xfail("nn.functional.instance_norm"),
+        xfail("nn.functional.layer_norm"),  # vmap: inplace into a regular tensor
+        # RuntimeError: NYI: querying is_contiguous inside of vmap
+        # for memory_format other than torch.contiguous_formats
+        xfail("nn.functional.max_pool2d"),
+        # RuntimeError: NYI: Tensor.clone(memory_format) inside vmap is only
+        # supported with memory_format torch.preserve_format or
+        # torch.contiguous_format (got ChannelsLast)
+        xfail("nn.functional.max_unpool2d"),
+        # RuntimeError: NYI: Tensor.clone(memory_format) inside vmap is only
+        # supported with memory_format torch.preserve_format
+        # or torch.contiguous_format (got ChannelsLast)s
+        xfail("nn.functional.max_unpool2d", "grad"),
+        xfail("nn.functional.rrelu"),  # RuntimeError: vmap: we do not yet support aten::rrelu_with_noise.
+        xfail("normal"),  # calls random op
+        xfail("normal", "number_mean"),  # calls random op
+        xfail("pca_lowrank"),  # calls random op
+        xfail("put"),  # vmap: inplace into a regular tensor
+        xfail("quantile", device_type='cpu'),  # Batching rule not implemented for `at::equal`
+        xfail("scatter_reduce", "prod"),  # vmap (looks like you are calling item/data-dependent)
+        xfail("sparse.sampled_addmm"),  # RuntimeError: Sparse CSR tensors do not have strides
+        xfail("svd_lowrank"),  # calls random op
+        xfail("take"),  # vmap: inplace into a regular tensor
+        xfail("to"),  # rank 4 tensor for channels_last
+        xfail("view_as_complex"),  # RuntimeError: Tensor must have a last dimension with stride 1
+        # got a batched tensor as input while the running_mean or running_var,
+        # which will be updated in place, were not batched.
+        xfail("nn.functional.batch_norm", 'without_cudnn'),
+        # view doesn't work on sparse
+        xfail("to_sparse"),
+        xfail("native_batch_norm"),
+        xfail("_native_batch_norm_legit"),
+    }))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
+    @opsToleranceOverride('TestOperators', 'test_vmapvjpvjp', (
+        tol1('linalg.svd',
+             {torch.float32: tol(atol=1e-03, rtol=5e-04)}),
+        tol1('linalg.lu_factor',
+             {torch.float32: tol(atol=2e-03, rtol=2e-02)}),
+        tol1('svd',
+             {torch.float32: tol(atol=1e-03, rtol=5e-04)}),
+    ))
     def test_vmapvjpvjp(self, device, dtype, op):
-        self.skipTest("Skipped; these tests take too long")
-        op_skip = set({
-        })
-        op_skip = op_skip.union(vjp_fail)
-        if op.name in op_skip:
-            self.skipTest("Skipped; Expected failures")
-            return
-
+        # Since, we test `vjpvjp` independently,
+        # for this test, we just verify that vmap
+        # of `vjpvjp` is correct.
         if not op.supports_autograd:
             self.skipTest("Skipped! Autograd not supported.")
             return
@@ -574,6 +746,7 @@ def vjp_of_vjp(*args_and_cotangents):
                 self.assertEqual(loop_out, batched_out)
 
     vmapvjp_fail = vjp_fail.union({
+        # -------------------- ALLOWED FAILURES --------------------------------
         # The following are not bugs and are expected behavior
         xfail('masked_select'),  # Not possible due to dynamic shapes
         skip('bernoulli'),  # randomness
@@ -584,53 +757,67 @@ def vjp_of_vjp(*args_and_cotangents):
         skip('nn.functional.feature_alpha_dropout', 'without_train'),  # randomness
         skip('nn.functional.dropout'),  # randomness
         skip('nn.functional.dropout2d'),  # randomness
+        skip('nn.functional.dropout3d', ''),  # randomness
+        skip('nn.functional.alpha_dropout'),  # randomness
+        skip('nn.functional._scaled_dot_product_attention'),  # randomness
         xfail('as_strided'),  # as_strided is too wild for us to support, wontfix
         xfail('index_put', ''),  # not possible due to dynamic shapes; we support a subset
         xfail('masked_scatter'),  # dynamic
         xfail('nn.functional.fractional_max_pool2d'),  # random
         xfail('nn.functional.fractional_max_pool3d'),  # random
         xfail('take'),  # dynamic
+        xfail('pca_lowrank', ''),  # randomness
+        xfail('svd_lowrank', ''),  # randomness
+        xfail('to_sparse', ''),  # non-dense output
+        skip('to'),  # RuntimeError: required rank 4 tensor to use channels_last format
+        # ----------------------------------------------------------------------
 
+        # ---------------------------- BUGS ------------------------------------
         # All of the following are bugs and need to be fixed
         skip('linalg.svdvals'),  # # really annoying thing where it passes correctness check but not has_batch_rule
+        skip("native_batch_norm"),
+        skip("_native_batch_norm_legit"),
         xfail('__getitem__', ''),  # dynamic error
-        xfail('_masked.prod'),  # calls aten::item
-        xfail('eig'),  # calls aten::item
         xfail('linalg.eig'),  # Uses aten::allclose
-        xfail('linalg.householder_product'),  # needs select_scatter
         xfail('nanquantile', device_type='cpu'),  # checks q via a .item() call
         xfail('nn.functional.gaussian_nll_loss'),  # checks var for if any value < 0
-        xfail('prod'),  # calls nonzero
+        xfail('narrow'),  # .item() call
         xfail('quantile', device_type='cpu'),  # checks q via a .item() call
-        xfail('stft'),
-        xfail('view_as_complex'),
+        xfail('view_as_complex'),  # Tensor must have a last dimension with stride 1
 
         # required rank 4 tensor to use channels_last format
         xfail('bfloat16'),
         xfail('double'),
         xfail('float'),
         xfail('half'),
+        xfail('cdouble', ''),
+        xfail('cfloat', ''),
+        xfail('chalf', ''),
 
         xfail('scatter_reduce', 'prod'),  # item call
 
+        # Batching rule not implemented for aten::_use_cudnn_ctc_loss.Tensor
+        xfail('nn.functional.ctc_loss', device_type='cuda'),
         # NYI: querying is_contiguous inside of vmap for memory_format other than torch.contiguous_format
         xfail('nn.functional.max_unpool2d'),
         xfail('nn.functional.max_unpool2d', 'grad'),
 
-        xfail('chalf', ''),
         xfail('sparse.sampled_addmm', ''),
-        xfail('as_strided_scatter', ''),
-        xfail('index_reduce', ''),
-        xfail('nn.functional.dropout3d', ''),
+        xfail('as_strided_scatter', ''),  # calls as_strided
+        xfail('index_reduce', ''),  # .item() call
+        # ---------------------------------------------------------------------
     })
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @with_tf32_off  # https://github.com/pytorch/pytorch/issues/86798
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     @opsToleranceOverride('TestOperators', 'test_vmapvjp', (
         tol1('linalg.svd',
-             {torch.float32: tol(atol=1.5e-04, rtol=1e-04)}, device_type="cuda"),
+             {torch.float32: tol(atol=5e-04, rtol=1e-04)}, device_type="cuda"),
         tol1('svd',
-             {torch.float32: tol(atol=1.5e-04, rtol=1e-04)}, device_type="cuda"),
+             {torch.float32: tol(atol=5e-04, rtol=1e-04)}, device_type="cuda"),
+        tol1('linalg.householder_product',
+             {torch.float32: tol(atol=1e-04, rtol=1e-04)}),
     ))
     @skipOps('TestOperators', 'test_vmapvjp', vmapvjp_fail)
     def test_vmapvjp(self, device, dtype, op):
@@ -644,7 +831,6 @@ def test_vmapvjp(self, device, dtype, op):
         if is_inplace(op, op.get_op()):
             self.skipTest("Skipped! NYI: inplace-testing not supported.")
             return
-
         for sample in samples:
             cotangents = get_sample_cotangents(op, sample)
             fn, args = get_vjp_fn_and_args_with_cotangents(op, sample, cotangents)
@@ -655,62 +841,71 @@ def test_vmapvjp(self, device, dtype, op):
                 self.assertEqual(loop_out, batched_out)
 
     vmapjvpall_fail = {
+        # -------------------- ALLOWED FAILURES --------------------------------
         # The following are expected (not a bug)
         skip('bernoulli', ''),  # randomness
         skip('nn.functional.dropout'),  # randomness
         skip('nn.functional.rrelu'),  # randomness
         skip('nn.functional.dropout2d', ''),
         skip('nn.functional.dropout3d', ''),
+        skip('nn.functional._scaled_dot_product_attention'),  # randomness
+        skip('nn.functional.alpha_dropout'),  # randomness
         skip('nn.functional.feature_alpha_dropout', 'without_train'),
         skip('nn.functional.feature_alpha_dropout', 'with_train'),
         xfail('nn.functional.fractional_max_pool2d'),  # Cannot access data pointer of Tensor that doesn't have storage
         xfail('nn.functional.fractional_max_pool3d'),  # Cannot access data pointer of Tensor that doesn't have storage
-
-        # The following are bugs that we should fix
-        skip('nn.functional.max_pool1d'),  # fails on cpu, runs on cuda
-        xfail('_masked.mean'),
-        xfail('_masked.prod'),
-
         # Not actually a problem: embedding with max_norm mutates the weight
         # and causes different runs to produce different results.
         # skip because this is flaky depending on what the max_norm is!
         skip('nn.functional.embedding', ''),
-        xfail('nn.functional.soft_margin_loss', ''),
-        xfail('linalg.householder_product'),
-        xfail('tensor_split'),
-        xfail('quantile'),
-        xfail('as_strided'),
-        xfail('nn.functional.gaussian_nll_loss'),
-        xfail('scatter'),
-        xfail('nanquantile'),
-        xfail('view_as_complex'),
-        xfail('prod'),
+        skip('to'),  # RuntimeError: required rank 4 tensor to use channels_last format
+        # ----------------------------------------------------------------------
+
+        # ---------------------------- BUGS ------------------------------------
+        # The following are bugs that we should fix
+        decorate('nn.functional.conv2d', decorator=unittest.skipIf(IS_ARM64, "Fails on M1")),
+        decorate('linalg.det', 'singular', decorator=unittest.skipIf(IS_MACOS, "Fails on x86 MacOS CI")),
+        skip('nn.functional.max_pool1d'),  # fails on cpu, runs on cuda
+        xfail('masked.mean'),  # silent incorrectness (nan difference)
 
-        skip('pca_lowrank', ''),
-        skip('svd_lowrank', ''),
+        xfail('nn.functional.soft_margin_loss', ''),  # soft_margin_loss_backward does not support forward-ad
+        xfail('tensor_split'),  # data_ptr composite compliance
+        xfail('quantile'),  # at::equal batching rule (cpu), also, in-place vmap (cuda)
+        skip('as_strided'),  # Test runner cannot handle this
+        xfail('nn.functional.gaussian_nll_loss'),  # .item or data-dependent control flow
+        xfail('scatter'),  # forward-mode AD does not support at::scatter
+        xfail('nanquantile'),  # at::equal batching rule (cpu), also, in-place vmap (cuda)
+        xfail('view_as_complex'),  # Tensor must have a last dimension with stride 1
 
-        xfail('stft'),  # transpose_ fallback
+        skip('pca_lowrank', ''),  # randomness
+        skip('svd_lowrank', ''),  # randomness
 
         xfail('double'),  # required rank 4 tensor to use channels_last format
+        xfail('cdouble'),  # required rank 4 tensor to use channels_last format
 
+        # potential silent incorrectness
         skip('nn.functional.max_unpool1d'),  # Flaky, seems to sometimes his max_unpool2d
         skip('nn.functional.max_unpool2d'),  # fails everywhere except on mac
         skip('nn.functional.max_unpool3d'),  # fails everywhere except on mac
 
-        xfail('nn.functional.prelu'),  # Call Tensor.as_strided
-
         # erroring because running_mean and running_var aren't differentiable
         xfail('nn.functional.batch_norm'),
         xfail('nn.functional.batch_norm', 'without_cudnn'),
+        xfail("native_batch_norm"),
+        xfail("_native_batch_norm_legit"),
+        # ----------------------------------------------------------------------
     }
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @with_tf32_off  # https://github.com/pytorch/pytorch/issues/86798
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     @opsToleranceOverride('TestOperators', 'test_vmapjvpall', (
         tol1('nn.functional.conv_transpose3d',
              {torch.float32: tol(atol=2e-04, rtol=9e-3)}, device_type='cuda'),
+        tol1('linalg.householder_product',
+             {torch.float32: tol(atol=2e-04, rtol=9e-3)}),
     ))
     @skipOps('TestOperators', 'test_vmapjvpall', vmapjvpall_fail)
-    @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     # This is technically a superset of test_vmapjvp. We should either delete test_vmapjvp
     # or figure out if we can split vmapjvpall. It's useful to keep test_vmapjvp intact
     # because that coresponds to "batched forward-mode AD" testing in PyTorch core
@@ -737,34 +932,23 @@ def test_vmapjvpall(self, device, dtype, op):
             for loop_out, batched_out in generator:
                 self.assertEqual(loop_out, batched_out)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_vmapjvpall_has_batch_rule', vmapjvpall_fail.union({
+        skip('to'),  # RuntimeError: required rank 4 tensor to use channels_last format
+        xfail('cdouble'),  # RuntimeError: required rank 4 tensor to use channels_last format
         xfail('nn.functional.huber_loss'),
         xfail('lu'),
-        skip('linalg.det', 'singular'),  # https://github.com/pytorch/functorch/issues/961
         xfail('cumprod'),
-        xfail('lu_solve'),
-        xfail('linalg.det'),
-        xfail('linalg.lstsq', 'grad_oriented'),
-        xfail('cross'),
-        xfail('linalg.pinv'),
         xfail('masked_fill'),
         xfail('copysign'),
-        xfail('linalg.solve'),
-        xfail('linalg.eig'),
         xfail('complex'),
-        xfail('linalg.pinv', 'hermitian'),
-        xfail('pinverse'),
-        skip('_masked.mean'),  # ???
+        skip('masked.mean'),  # ???
         xfail('masked_scatter'),
         xfail('index_fill'),
         xfail('put'),
         xfail('take'),
-        xfail('linalg.eigvals'),
-        xfail('linalg.tensorsolve'),
         xfail('nn.functional.max_pool3d'),
         xfail('vdot'),
-        xfail('linalg.cross'),
         xfail('nanmean'),
         xfail('nansum'),
         xfail('nn.functional.feature_alpha_dropout', 'without_train'),
@@ -777,7 +961,6 @@ def test_vmapjvpall(self, device, dtype, op):
         xfail('special.log_ndtr', ''),
         xfail('fft.ihfft2'),  # conj_physical fallback
         xfail('fft.ihfftn'),  # conj_physical fallback
-        xfail('istft'),  # col2im fallback
         xfail('polar'),  # complex fallback
         xfail('nn.functional.max_unpool3d', 'grad'),
         xfail('nn.functional.smooth_l1_loss', ''),
@@ -785,17 +968,18 @@ def test_vmapjvpall(self, device, dtype, op):
         xfail('nn.functional.soft_margin_loss', ''),
         xfail('nn.functional.max_unpool1d', 'grad'),
         xfail('nn.functional.embedding', ''),
+        xfail('scatter_reduce', "sum"),   # aten::scatter_reduce.two hit the vmap fallback
+        xfail('scatter_reduce', "mean"),  # aten::scatter_reduce.two hit the vmap fallback
+        xfail('scatter_reduce', "amin"),  # aten::scatter_reduce.two hit the vmap fallback
+        xfail('scatter_reduce', "amax"),  # aten::scatter_reduce.two hit the vmap fallback
         xfail('lu_unpack'),
         xfail('nn.functional.glu'),
         xfail('nn.functional.bilinear'),  # trilinear doesn't have batching rule
-        xfail('logdet'),  # _linalg_slogdet doesn't have batching rule
-        xfail('linalg.slogdet'),  # _linalg_slogdet doesn't have batching rule
         xfail('linalg.lu', ''),
         xfail('linalg.lu_solve', ''),
-        xfail('linalg.solve_ex', ''),
         xfail('nn.functional.dropout3d', ''),
         xfail('as_strided_scatter', ''),
-        xfail('_masked.cumprod', ''),
+        xfail('masked.cumprod', ''),
         xfail('linalg.vecdot', ''),
     }))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
@@ -823,32 +1007,24 @@ def test():
                     pass
         check_vmap_fallback(self, test, op, dry_run=False)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     @skipOps('TestOperators', 'test_vmapvjp_has_batch_rule', vmapvjp_fail.union({
+        skip('to'),  # RuntimeError: required rank 4 tensor to use channels_last format
         xfail('view_as_complex'),
         xfail('complex'),
         xfail('copysign'),
         xfail('cummax'),
         xfail('cummin'),
         xfail('cumprod'),
-        xfail('eig'),
         xfail('nansum'),
         xfail('nanmean'),
+        xfail('narrow'),  # Batching rule not implemented for `narrow.Tensor` (and view op)
         xfail('special.log_ndtr'),
         xfail('index_copy'),
         xfail('index_fill'),
-        xfail('linalg.det'),
         xfail('linalg.eig'),
-        xfail('linalg.eigvals'),
         xfail('linalg.householder_product'),
-        xfail('linalg.lstsq', ''),
-        xfail('linalg.lstsq', 'grad_oriented'),
-        xfail('linalg.pinv'),
-        xfail('linalg.pinv', 'hermitian'),
-        xfail('linalg.slogdet'),
-        xfail('linalg.solve'),
-        xfail('logdet'),
         xfail('lu'),
         xfail('lu_solve'),
         xfail('lu_unpack'),
@@ -856,35 +1032,34 @@ def test():
         xfail('masked_scatter'),
         xfail('masked_select'),
         xfail('nanquantile'),
-        xfail('pinverse'),
-        xfail('prod'),
+        xfail('ormqr'),
         xfail('put'),
-        skip('linalg.det'),  # https://github.com/pytorch/functorch/issues/961
+        xfail('scatter_reduce', "sum"),   # aten::scatter_reduce.two hit the vmap fallback
+        xfail('scatter_reduce', "mean"),  # aten::scatter_reduce.two hit the vmap fallback
+        xfail('scatter_reduce', "amin"),  # aten::scatter_reduce.two hit the vmap fallback
+        xfail('scatter_reduce', "amax"),  # aten::scatter_reduce.two hit the vmap fallback
         xfail('quantile'),
         xfail('renorm'),
         xfail('take'),
         xfail('tensor_split'),
         xfail('to_sparse'),
         xfail('unfold'),
+        xfail('unfold_copy'),
         xfail('vdot'),
         xfail('nn.functional.dropout'),
-        xfail('_masked.prod'),
         xfail('fft.ihfft2'),
         xfail('fft.ihfftn'),
-        xfail('cross'),
-        xfail('linalg.cross'),
         xfail('nn.functional.gaussian_nll_loss'),
         xfail('nn.functional.huber_loss'),
         xfail('nn.functional.bilinear'),
         xfail('nn.functional.fractional_max_pool3d'),
+        xfail('nn.functional.ctc_loss'),
         xfail('as_strided'),
         xfail('stft'),
         xfail('nn.functional.rrelu'),
         xfail('nn.functional.embedding_bag'),
         xfail('nn.functional.max_pool3d'),
-        xfail('istft'),
         xfail('nn.functional.fractional_max_pool2d'),
-        xfail('linalg.tensorsolve'),
         xfail('linalg.lu_factor', ''),
         xfail('nn.functional.feature_alpha_dropout', 'with_train'),
         xfail('pca_lowrank', ''),
@@ -899,29 +1074,29 @@ def test():
         xfail('nn.functional.pdist', ''),
         xfail('nn.functional.smooth_l1_loss', ''),
         xfail('scatter_reduce', 'prod'),
-        xfail('scatter_reduce', 'amax'),
         xfail('nn.functional.max_unpool1d', ''),
         xfail('nn.functional.max_unpool3d', ''),
-        xfail('scatter_reduce', 'sum'),
-        xfail('scatter_reduce', 'mean'),
         xfail('nn.functional.max_unpool3d', 'grad'),
         xfail('nn.functional.soft_margin_loss', ''),
-        xfail('scatter_reduce', 'amin'),
         xfail('nn.functional.max_unpool1d', 'grad'),
         xfail('nn.functional.max_unpool2d', 'grad'),
         xfail('linalg.lu', ''),
         xfail('linalg.lu_solve', ''),
+        xfail('cdouble', ''),
+        xfail('cfloat', ''),
         xfail('chalf', ''),
         xfail('index_reduce', ''),
         xfail('linalg.vander', ''),
-        xfail('linalg.solve_ex', ''),
         xfail('nn.functional.dropout3d', ''),
         xfail('as_strided_scatter', ''),
         xfail('segment_reduce', 'offsets'),
-        xfail('_masked.cumprod', ''),
+        xfail('masked.cumprod', ''),
         xfail('linalg.vecdot', ''),
         xfail('segment_reduce', 'lengths'),
         xfail('sparse.sampled_addmm', ''),
+        xfail("native_batch_norm"),
+        xfail("_native_batch_norm_legit"),
+        xfail("native_dropout_backward"),
     }))
     def test_vmapvjp_has_batch_rule(self, device, dtype, op):
         if not op.supports_autograd:
@@ -951,7 +1126,7 @@ def test():
 
         check_vmap_fallback(self, test, op, dry_run=False)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_vjpvmap', vjp_fail.union({
         skip('bernoulli', ''),  # vjpvmap testing can't handle randomness
         skip('normal', ''),  # vjpvmap testing can't handle randomness
@@ -959,6 +1134,11 @@ def test():
         skip('nn.functional.rrelu'),  # randomness
         skip('nn.functional.feature_alpha_dropout', 'with_train'),  # randomness
         skip('nn.functional.feature_alpha_dropout', 'without_train'),  # randomness
+        skip('nn.functional._scaled_dot_product_attention', device_type='cuda'),
+        skip('nn.functional.alpha_dropout'),  # randomness
+        skip('to'),  # RuntimeError: required rank 4 tensor to use channels_last format
+        skip('to_sparse', ''),  # non-dense output
+        skip('ormqr', ''),  # takes too long
 
         # fallback path doesn't work
         # All of the following are bugs and need to be fixed
@@ -967,23 +1147,28 @@ def test():
         xfail('view_as_complex'),
         xfail('nn.functional.gaussian_nll_loss'),
         xfail('masked_select'),
+        xfail('narrow'),  # Batching rule not implemented for `narrow.Tensor` (and view op)
         skip('nn.functional.fractional_max_pool3d'),  # generator works on cpu, fails on cuda
         xfail('__rpow__'),  # https://github.com/pytorch/functorch/issues/617
-        xfail('as_strided'),
         skip('nn.functional.fractional_max_pool2d'),  # generator works on cpu, fails on cuda
         xfail('column_stack', ''),
         xfail('nn.functional.dropout2d', ''),
         xfail('svd_lowrank', ''),
         xfail('pca_lowrank', ''),
         xfail('clamp'),
+        xfail('cross'),  # The defaults of this op are *very* weird. No wonder it doesn't work
         # something weird happening with channels_last
         xfail('bfloat16'),
         xfail('double'),
         xfail('float'),
         xfail('half'),
+        xfail('cdouble'),
+        xfail('cfloat'),
         xfail('nn.functional.dropout3d', ''),
         xfail('as_strided_scatter', ''),
         xfail('sparse.sampled_addmm', ''),
+        xfail("native_batch_norm"),
+        xfail("_native_batch_norm_legit"),
     }))
     def test_vjpvmap(self, device, dtype, op):
         # NB: there is no vjpvmap_has_batch_rule test because that is almost
@@ -1010,10 +1195,10 @@ def test_vjpvmap(self, device, dtype, op):
         for sample in samples:
             args = [sample.input] + list(sample.args)
             kwargs = sample.kwargs
-            if is_batch_norm and is_batch_norm_training(op.name, kwargs):
-                generator = get_exhaustive_batched_inputs_batch_norm_is_training(args, kwargs)
-            else:
-                generator = get_exhaustive_batched_inputs(args, kwargs)
+
+            is_batch_norm_and_training = is_batch_norm and is_batch_norm_training(op.name, kwargs)
+            generator = generate_vmap_inputs(args, kwargs,
+                                             is_batch_norm_and_training=is_batch_norm_and_training)
 
             for batched_args, in_dims, kwargs in generator:
                 vmapped_op = vmap(op, in_dims)
@@ -1051,48 +1236,53 @@ def get_vjp(cotangents, *primals):
         else:
             self.assertEqual(jacobian_jvp, jacobian_vjp)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestOperators', 'test_jvpvjp', vjp_fail.union({
+        xfail('to_sparse', ''),  # NYI
         # RuntimeError: Trying to set a forward gradient that has a different size than that of the original Tensor,
         # this is not supported. Tensor is of size [5, 2, 3] while the given forward gradient is of size [1, 2, 3].
         xfail('normal', ''),
-        xfail('_masked.log_softmax', ''),
-        xfail('_masked.softmax', ''),
-        xfail('_masked.softmin', ''),
-        xfail('cdist', ''),
-        xfail('cholesky', ''),
-        xfail('eig', ''),
-        xfail('logcumsumexp', ''),
-        xfail('nn.functional.embedding_bag', ''),
-        xfail('nn.functional.grid_sample', ''),
-        xfail('nn.functional.hardsigmoid', ''),
-        xfail('nn.functional.huber_loss', ''),
-        xfail('nn.functional.instance_norm', ''),
-        xfail('nn.functional.logsigmoid', ''),
-        xfail('nn.functional.softmin', ''),
-        xfail('nn.functional.softmin', 'with_dtype'),
-        xfail('renorm', ''),
-        xfail('symeig', ''),
-        xfail('pca_lowrank', ''),
-        xfail('svd_lowrank', ''),
-        xfail('nn.functional.multilabel_margin_loss', ''),
-        xfail('nn.functional.multilabel_soft_margin_loss', ''),
-        xfail('scatter_reduce', 'amax'),
-        xfail('scatter_reduce', 'amin'),
-        xfail('nn.functional.soft_margin_loss', ''),
-        xfail('nn.functional.pdist', ''),
-        xfail('scatter_reduce', 'sum'),
-        xfail('nn.functional.multi_margin_loss', ''),
-        xfail('scatter_reduce', 'mean'),
-        xfail('scatter_reduce', 'prod'),
+        xfail('cdist', ''),  # NYI: forward-AD for _cdist_forward
+        xfail('cholesky', ''),  # NYI: forward-AD for cholesky
+        xfail('logcumsumexp', ''),  # NYI: forward-AD for logcumsumexp
+        xfail('nn.functional.embedding_bag', ''),  # NYI: forward-AD for _embedding_bag
+        xfail('nn.functional.grid_sample', ''),  # NYI: forward AD for grid_sampler_2d
+        xfail('grid_sampler_2d', ''),  # NYI: forward AD for grid_sampler_2d
+        xfail('nn.functional.hardsigmoid', ''),  # NYI: forward AD for hardsigmoid_backward
+        xfail('nn.functional.huber_loss', ''),  # NYI: forward AD for huber_loss_backward
+        xfail('nn.functional.logsigmoid', ''),  # not differentiable w.r.t. buffer
+        xfail('renorm', ''),  # NYI: forward AD for renorm
+        xfail('ormqr', ''),  # NYI: forward AD for ormqr
+        xfail('symeig', ''),  # NYI: forward AD for symeig
+        xfail('nn.functional.multilabel_margin_loss', ''),  # NYI: multilabel_margin_loss_forward
+        xfail('nn.functional.multilabel_soft_margin_loss', ''),  # NYI: log_sigmoid_backward
+        xfail('nn.functional.soft_margin_loss', ''),  # NYI: forward-AD for log_sigmoid_backward
+        xfail('nn.functional.ctc_loss', ''),  # NYI: forward-AD for _ctc_loss
+        xfail('nn.functional.pdist', ''),  # NYI: forward-AD with _pdist_forward
+        skip('nn.functional._scaled_dot_product_attention', device_type='cuda'),
+        xfail('nn.functional.multi_margin_loss', ''),  # NYI: forward AD with multi_margin_loss
         skip('linalg.householder_product', '', device_type='cuda'),  # flaky, I'm not sure why
-        xfail('native_layer_norm', ''),
-        xfail('sparse.sampled_addmm', ''),
-        xfail('as_strided_scatter', ''),
-        xfail('segment_reduce', 'offsets'),
-        xfail('index_reduce', ''),
-        xfail('segment_reduce', 'lengths'),
+        xfail('sparse.sampled_addmm', ''),  # Sparse tensors have no strides
+        xfail('segment_reduce', 'offsets'),  # NYI: forward-AD for segment_reduce
+        xfail('index_reduce', ''),  # NYI: forward-AD for index_reduce
+        xfail('segment_reduce', 'lengths'),  # NYI: forward-AD for segment_reduce
+        xfail('native_dropout_backward'),  # NYI
+
     }))
+    @opsToleranceOverride('TestOperators', 'test_jvpvjp', (
+        tol1('masked.prod',
+             {torch.float32: tol(atol=1e-04, rtol=1.3e-05)}),
+        tol1('masked.cumprod',
+             {torch.float32: tol(atol=1e-04, rtol=5e-04)}),
+        tol1('cumprod',
+             {torch.float32: tol(atol=1e-04, rtol=1.3e-05)}, device_type='cuda'),
+        tol1('linalg.vander',
+             {torch.float32: tol(atol=1e-04, rtol=1.3e-05)}, device_type='cuda'),
+        tol1('nn.functional.group_norm',
+             {torch.float32: tol(atol=1e-03, rtol=1e-03)}),
+        tol2('linalg.pinv', 'hermitian',
+             {torch.float32: tol(atol=5e-03, rtol=5e-03)}),
+    ))
     def test_jvpvjp(self, device, dtype, op):
         if not op.supports_autograd:
             self.skipTest("Skipped! Autograd not supported.")
@@ -1113,11 +1303,6 @@ def test_jvpvjp(self, device, dtype, op):
             primals_tangents = tree_map(lambda x: torch.randn_like(x), primals)
             cotangents_tangents = tree_map(lambda x: torch.randn_like(x), cotangents)
 
-            if isinstance(primals[0], torch.Tensor) and primals[0].numel() == 0:
-                # typically the first primal arg is the input. If the input has no elements, we will typically run
-                # into an issue of "Expected Tensor but got None"
-                continue
-
             def push_vjp(primals, cotangents):
                 _, vjp_fn = vjp(fn, *primals)
                 return vjp_fn(cotangents)
@@ -1147,37 +1332,160 @@ def reference(primals, cotangents, primals_tangents, cotangents_tangents):
                     expected = (tree_unflatten(primals_out, spec), tree_unflatten(tangents_out, spec))
                 return expected
 
-            # HACK: obviously pytorch should also have the same coverage
-            # For things that do have the same coverage, we test that jvp x vjp
-            # are the same between PyTorch and functorch. For things that don't,
-            # we check that jacfwd(vjp) and jacrev(vjp) are the same. This results
-            # in slower tests.
-            FUNCTORCH_HAS_FORMULA_BUT_NOT_PYTORCH = {
-                'nn.functional.nll_loss',
-                'softmax',
-                'log_softmax',
-                'nn.functional.cross_entropy',
-                'nn.functional.layer_norm',
-                'nn.functional.batch_norm',
-            }
-            if op.name in FUNCTORCH_HAS_FORMULA_BUT_NOT_PYTORCH:
-                self.assertFalse(op.supports_fwgrad_bwgrad,
-                                 f"{op.name} now supports forward over reverse without a decomposition. " +
-                                 "Please remove the decomposition version")
-
-                def is_differentiable(t):
-                    return isinstance(t, torch.Tensor) and t.dtype == torch.float32
-                args = (cotangents, *primals)
-                if op.name == 'nn.functional.binary_cross_entropy':
-                    argnums = (0, 1)  # targets is float32 but isn't differentiable
-                    atol_rtol = 1.5e-4, 1.3e-06
-                else:
-                    argnums = tuple(i for i in range(len(args)) if is_differentiable(args[i]))
-                    atol_rtol = None
-                self._compare_jacobians_of_vjp(fn, args, argnums, atol_rtol)
-            else:
-                expected = reference(primals, cotangents, primals_tangents, cotangents_tangents)
-                self.assertEqual(result, expected)
+            expected = reference(primals, cotangents, primals_tangents, cotangents_tangents)
+            self.assertEqual(result, expected)
+
+    @with_tf32_off  # https://github.com/pytorch/pytorch/issues/86798
+    @skipOps('TestOperators', 'test_vmapjvpvjp', vjp_fail.union({
+        # Following operatos take too long, hence skipped
+        skip('atleast_1d'),
+        skip('atleast_2d'),
+        skip('atleast_3d'),
+        skip('meshgrid', 'list_of_tensors'),
+        skip('meshgrid', 'variadic_tensors'),
+        skip('broadcast_tensors'),
+        skip('linalg.lstsq'),
+        skip('nn.functional.bilinear'),
+        skip('native_layer_norm'),
+        skip('ormqr'),
+
+        # Potential bugs/errors
+        xfail('as_strided'),  # AssertionError: Tensor-likes are not close!
+        xfail('as_strided_scatter'),  # AssertionError: Tensor-likes are not close!
+        xfail('bernoulli'),  # calls random op
+        xfail('bfloat16'),  # required rank 4 tensor to use channels_last format
+        xfail('cdist'),  # Forward AD not implemented and no decomposition
+        xfail('cdouble'),  # required rank 4 tensor to use channels_last format
+        xfail('cfloat'),  # required rank 4 tensor to use channels_last format
+        xfail('chalf'),  # required rank 4 tensor to use channels_last format
+        xfail('cholesky'),  # Forward AD not implemented and no decomposition
+        xfail('ormqr'),  # Forward AD not implemented and no decomposition
+        xfail('double'),  # required rank 4 tensor to use channels_last format
+        xfail('float'),  # required rank 4 tensor to use channels_last format
+        xfail('half'),  # required rank 4 tensor to use channels_last format
+        xfail('index_reduce'),  # Forward AD not implemented and no decomposition
+        xfail('linalg.eig'),  # vmap over torch.allclose isn't supported yet.
+        xfail('logcumsumexp'),  # Forward AD not implemented and no decomposition
+        xfail('mvlgamma', 'mvlgamma_p_1'),  # vmap: inplace into a regular tensor
+        xfail('mvlgamma', 'mvlgamma_p_3'),  # vmap: inplace into a regular tensor
+        xfail('mvlgamma', 'mvlgamma_p_5'),  # vmap: inplace into a regular tensor
+        xfail('nanquantile'),  # Batching rule not implemented for aten::equal
+        # RuntimeError: Batch norm got a batched tensor as input while the
+        # running_mean or running_var, which will be updated in place,
+        # were not batched.
+        xfail('nn.functional.batch_norm'),
+        xfail('nn.functional.batch_norm', 'without_cudnn'),
+        xfail('nn.functional.binary_cross_entropy'),  # vmap: inplace into a regular tensor
+        xfail("nn.functional.ctc_loss"),  # ForwardAD not implemented and no decomposition
+        xfail('nn.functional.dropout2d'),  # calls random op
+        xfail('nn.functional.dropout3d'),  # calls random op
+        xfail('nn.functional.dropout'),  # calls random op
+        xfail('nn.functional._scaled_dot_product_attention'),  # randomness
+        xfail('nn.functional.embedding_bag'),  # Forward AD not implemented and no decomposition
+        xfail('nn.functional.alpha_dropout'),  # calls randomn op
+        xfail('nn.functional.feature_alpha_dropout', 'with_train'),  # calls random op
+        xfail('nn.functional.fractional_max_pool2d'),  # calls random op
+        xfail('nn.functional.fractional_max_pool3d'),  # calls random op
+        xfail('nn.functional.gaussian_nll_loss'),  # data depenedant flow
+        xfail('nn.functional.grid_sample'),  # Forward AD not implemented and no decomposition
+        xfail('grid_sampler_2d'),  # Forward AD not implemented and no decomposition
+        xfail('nn.functional.hardsigmoid'),  # Forward AD not implemented and no decomposition
+        xfail('nn.functional.hinge_embedding_loss'),  # vmap: inplace into a regular tensor
+        xfail('nn.functional.huber_loss'),  # Forward AD not implemented and no decomposition
+        # RuntimeError: Batch norm got a batched tensor as input while the
+        # running_mean or running_var, which will be updated in place,
+        # were not batched.
+        xfail('nn.functional.instance_norm'),
+        xfail('nn.functional.logsigmoid'),  # Forward AD not implemented and no decomposition
+        # NYI: Tensor.clone(memory_format) inside vmap is only supported with
+        # memory_format torch.preserve_format or torch.contiguous_format (got ChannelsLast)
+        xfail('nn.functional.max_unpool2d'),
+        xfail('nn.functional.max_unpool2d', 'grad'),
+        xfail('nn.functional.multi_margin_loss'),  # Forward AD not implemented and no decomposition
+        xfail('nn.functional.multilabel_margin_loss'),  # Forward AD not implemented and no decomposition
+        xfail('nn.functional.multilabel_soft_margin_loss'),  # Forward AD not implemented and no decomposition
+        xfail('nn.functional.pdist'),  # Forward AD not implemented and no decomposition
+        xfail('nn.functional.rrelu'),  # vmap: we do not yet support aten::rrelu_with_noise.
+        xfail('nn.functional.soft_margin_loss'),  # Forward AD not implemented and no decomposition
+        xfail('normal'),  # calls random op
+        xfail('normal', 'number_mean'),  # calls random op
+        xfail('pca_lowrank'),  # calls random op
+        xfail('quantile'),  # Batching rule not implemented for aten::equal
+        xfail('renorm'),  # Forward AD not implemented and no decomposition
+        xfail('scatter_reduce', 'prod'),  # Forward AD not implemented and no decomposition
+        xfail('segment_reduce', 'lengths'),  # Forward AD not implemented and no decomposition
+        xfail('segment_reduce', 'offsets'),  # Forward AD not implemented and no decomposition
+        xfail('sparse.sampled_addmm'),  # RuntimeError: Sparse CSR tensors do not have strides
+        xfail('svd_lowrank'),  # calls random op
+        xfail('symeig'),  # Forward AD not implemented and no decomposition
+        xfail('take'),  # vmap: inplace into regular tensor
+        xfail('to'),  # RuntimeError: required rank 4 tensor to use channels_last format
+        xfail('to_sparse'),  # Forward AD not implemented and no decomposition
+        xfail('view_as_complex'),  # RuntimeError: Tensor must have a last dimension with stride 1
+        # RuntimeError: Batch norm got a batched tensor as
+        # input while the running_mean or running_var, which will be updated in
+        # place, were not batched.
+        xfail("native_batch_norm"),
+        xfail("_native_batch_norm_legit"),
+        xfail('native_dropout_backward',)
+    }))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
+    @opsToleranceOverride('TestOperators', 'test_vmapjvpvjp', (
+        tol1('linalg.svd',
+             {torch.float32: tol(atol=5e-04, rtol=5e-04)}),
+        tol1('linalg.householder_product',
+             {torch.float32: tol(atol=5e-03, rtol=5e-03)}),
+        tol1('linalg.multi_dot',
+             {torch.float32: tol(atol=5e-04, rtol=5e-04)}),
+        tol2('linalg.pinv', 'hermitian',
+             {torch.float32: tol(atol=5e-04, rtol=5e-04)}),
+        tol1('svd',
+             {torch.float32: tol(atol=5e-04, rtol=5e-04)}),
+    ))
+    def test_vmapjvpvjp(self, device, dtype, op):
+        # Since we test `jvpvjp` seperately,
+        # in this we just check that vmap of `jvpvjp`
+        # is correct.
+        if not op.supports_autograd:
+            self.skipTest("Skipped! Autograd not supported.")
+            return
+
+        samples = op.sample_inputs(device, dtype, requires_grad=True)
+
+        # TODO: test in-place
+        if is_inplace(op, op.get_op()):
+            self.skipTest("Skipped! NYI: inplace-testing not supported.")
+            return
+
+        for sample in samples:
+            fn, primals = normalize_op_input_output(op, sample)
+            result = fn(*primals)
+            cotangents = tree_map(lambda x: torch.randn_like(x), result)
+
+            primals_tangents = tree_map(lambda x: torch.randn_like(x), primals)
+            cotangents_tangents = tree_map(lambda x: torch.randn_like(x), cotangents)
+
+            def push_vjp(primals, cotangents):
+                _, vjp_fn = vjp(fn, *primals)
+                return vjp_fn(cotangents)
+
+            args, spec = tree_flatten(((primals, cotangents), (primals_tangents, cotangents_tangents)))
+
+            def jvp_of_vjp(*args):
+                (primals, tangents) = tree_unflatten(args, spec)
+                primals_out, tangents_out = jvp(push_vjp, primals, tangents)
+
+                flat_primals_out, _ = tree_flatten(primals_out)
+                flat_tangents_out, _ = tree_flatten(tangents_out)
+                return tuple(flat_primals_out + flat_tangents_out)
+
+            is_batch_norm_and_training = is_batch_norm_training(op, sample.kwargs)
+            generator = get_fallback_and_vmap_exhaustive(
+                jvp_of_vjp, args, {}, is_batch_norm_and_training=is_batch_norm_and_training)
+            for loop_out, batched_out in generator:
+                self.assertEqual(loop_out, batched_out)
+
 
     def _make_extremal_inputs(self, shape, device):
         if shape is None:
@@ -1335,11 +1643,9 @@ def fn(input, weight, bias):
                 cotangents = torch.randn_like(result, device=device)
                 self._compare_jacobians_of_vjp(fn, (cotangents, input, weight, bias))
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float32, torch.double))
+    @with_tf32_off  # https://github.com/pytorch/pytorch/issues/86798
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float32, torch.double))
     @skipOps('TestOperators', 'test_vmap_autograd_grad', {
-        # call inplace functions
-        xfail('linalg.householder_product'),  # inplace
-
         xfail('linalg.eig'),  # all close?
         # The size of tensor a (4) must match the size of tensor b (10) at non-singleton dimension 0
         xfail('masked_select'),
@@ -1361,8 +1667,17 @@ def fn(input, weight, bias):
         skip('linalg.multi_dot', '', device_type='cpu'),
         skip('sparse.sampled_addmm', ''),
         skip('native_layer_norm', '', device_type='cpu'),
-        xfail('as_strided_scatter', ''),
     })
+    @opsToleranceOverride('TestOperators', 'test_vmap_autograd_grad', (
+        tol1('linalg.householder_product',
+             {torch.float32: tol(atol=5e-04, rtol=9e-03)}, device_type='cuda'),
+        tol1('linalg.householder_product',
+             {torch.float32: tol(atol=1e-04, rtol=1e-04)}, device_type='cpu'),
+        tol1('linalg.multi_dot',
+             {torch.float32: tol(atol=2e-04, rtol=1e-04)}, device_type='cuda'),
+        tol2('linalg.pinv', 'hermitian',
+             {torch.float32: tol(atol=5e-06, rtol=5e-06)}),
+    ))
     def test_vmap_autograd_grad(self, device, dtype, op):
         def is_differentiable(inp):
             return isinstance(inp, Tensor) and (inp.grad_fn is not None or inp.requires_grad)
@@ -1409,6 +1724,85 @@ def compute_grad(cotangents):
             for loop_out, batched_out in generator:
                 self.assertEqual(loop_out, batched_out)
 
+    def test_vmapvmapjvp_linalg_solve(self):
+        ops = [op for op in op_db if op.name == "linalg.solve"]
+        assert len(ops) > 0
+
+        # this specializes a lot of code from the get_fallback_and_vmap_exhaustive test. If we need this more
+        # generally, this could go for a refactor
+
+        B0 = 2
+        B1 = 3
+
+        # we want to check the case where A will be seen as contiguous by jvp but during the vmap calls will become
+        # non-contiguous because vmap will expand. This will happen during both levels of vmap
+        A = torch.randn(4, 4)
+        k = torch.randn(4, 5, B1, B0)
+        fn, args = get_jvp_variant_primals_tangents(torch.linalg.solve, SampleInput(A, args=(k,)))
+
+        in_dims_all = (None, -1, None, -1)
+        batched_out = vmap(vmap(fn, in_dims=in_dims_all), in_dims=in_dims_all)(*args)
+        loop_out = loop2(fn, in_dims_all, in_dims_all, 0, 0, B0, B1, *args)
+        self.assertEqual(loop_out, batched_out)
+
+    @ops(filter(lambda op: op.name in aliasing_ops, op_db + additional_op_db), allowed_dtypes=(torch.float,))
+    @parametrize("grad_op", ["jvp", "vjp"])
+    def test_view_then_inplace(self, device, dtype, op, grad_op):
+        for sample_input in op.sample_inputs(device, dtype):
+            def f(x):
+                op(sample_input.input, *sample_input.args, **sample_input.kwargs).copy_(x)
+                return x
+
+            without_grad = op(sample_input.input, *sample_input.args, **sample_input.kwargs)
+            if grad_op == "jvp":
+                with self.assertRaisesRegex(RuntimeError, "During a grad .* attempted to call in-place operation"):
+                    jvp(f, (torch.randn_like(without_grad),), (torch.randn_like(without_grad),))
+            else:
+                assert grad_op == "vjp"
+                with self.assertRaisesRegex(RuntimeError, "During a grad .* attempted to call in-place operation"):
+                    vjp(f, torch.randn_like(without_grad))
+
+    @ops(filter(lambda op: op.name in aliasing_ops_list_return, op_db + additional_op_db), allowed_dtypes=(torch.float,))
+    @parametrize("grad_op", ["jvp", "vjp"])
+    def test_view_then_inplace_list_return(self, device, dtype, op, grad_op):
+        for sample_input in op.sample_inputs(device, dtype):
+            def f(x):
+                op(sample_input.input, *sample_input.args, **sample_input.kwargs)[0].copy_(x)
+                return x
+
+            without_grad = op(sample_input.input, *sample_input.args, **sample_input.kwargs)[0]
+            with self.assertRaisesRegex(RuntimeError, "During a grad .* attempted to call in-place operation"):
+                if grad_op == "jvp":
+                    jvp(f, (torch.randn_like(without_grad),), (torch.randn_like(without_grad),))
+                else:
+                    assert grad_op == "vjp"
+                    vjp(f, torch.randn_like(without_grad))
+
+    @parametrize("grad_op", ["jvp", "vjp"])
+    def test_view_then_inplace_special(self, grad_op):
+        # some things in __getitem__ use at::index, which doesn't alias, so this tests a subset of them that do alias
+        ops = [
+            lambda x: x[0],
+            lambda x: x[0, 0, 0],
+            lambda x: x[:1],
+            lambda x: x[:, :1],
+            lambda x: x[:, :1, :],
+        ]
+
+        for op in ops:
+            def f(x):
+                op(captured).copy_(x)
+                return x
+
+            captured = torch.randn(4, 3, 3)
+            without_grad = op(captured)
+            if grad_op == "jvp":
+                with self.assertRaisesRegex(RuntimeError, "During a grad .* attempted to call in-place operation"):
+                    jvp(f, (torch.randn_like(without_grad),), (torch.randn_like(without_grad),))
+            else:
+                assert grad_op == "vjp"
+                with self.assertRaisesRegex(RuntimeError, "During a grad .* attempted to call in-place operation"):
+                    vjp(f, torch.randn_like(without_grad))
 
 only_for = ("cpu", "cuda")
 instantiate_device_type_tests(TestOperators, globals(), only_for=only_for)
diff --git a/functorch/test/test_vmap.py b/test/functorch/test_vmap.py
similarity index 93%
rename from functorch/test/test_vmap.py
rename to test/functorch/test_vmap.py
index ca74b3205259..4b460560d8a9 100644
--- a/functorch/test/test_vmap.py
+++ b/test/functorch/test_vmap.py
@@ -16,6 +16,8 @@
 import itertools
 import warnings
 import unittest
+from torch.testing._internal.common_methods_invocations import op_db
+from torch.testing._internal.common_cuda import with_tf32_off
 from torch.testing._internal.common_device_type import instantiate_device_type_tests, \
     skipCUDAIfNoMagma
 from torch.testing._internal.common_device_type import ops
@@ -26,7 +28,6 @@
 )
 from torch.testing._internal.common_device_type import \
     toleranceOverride, tol
-from functorch_lagging_op_db import functorch_lagging_op_db
 from functorch_additional_op_db import additional_op_db
 from common_utils import (
     get_fallback_and_vmap_exhaustive,
@@ -45,9 +46,9 @@
 from collections import namedtuple
 
 import functorch
-from functorch import vmap, grad, grad_and_value, jvp, vjp
+from functorch import vmap, grad, grad_and_value, jvp, vjp, jacfwd
 from functorch.experimental import chunk_vmap
-from functorch._C import reshape_dim_into, reshape_dim_outof
+from torch._C._functorch import reshape_dim_into, reshape_dim_outof
 from functorch._src.make_functional import functional_init_with_buffers
 
 FALLBACK_REGEX = 'There is a performance drop'
@@ -1495,28 +1496,40 @@ def get_number(getter):
         # self._test_unary(lambda t: op(number, t), getter, device='cuda')
         # self._test_unary(lambda t: op(t, torch.tensor(number)), getter, device='cuda')
 
-    # TODO: as_strided BR
-    @unittest.expectedFailure
     def test_as_strided(self):
         def _test(sizes, strides, offset, tensor, lambd):
+            # bdim at dim 0 test
             result = vmap(lambda t: t.as_strided(sizes, strides, offset))(tensor)
             expected = vmap(lambd)(tensor)
             self.assertTrue(result._base is expected._base)
             self.assertEqual(result, expected)
 
+            # bdim at dim -1 test
+            tensor = tensor.movedim(0, -1)
+            result = vmap(lambda t: t.as_strided(sizes, strides, offset), -1)(tensor)
+            expected = vmap(lambd, -1)(tensor)
+            self.assertTrue(result._base is expected._base)
+            self.assertEqual(result, expected)
+
         # single vmap test
         B0 = 5
+        # Each Tensor has shape [B0, 2, 3]; the expressions below
+        # are just to get tensors of different strides that have shape [B0, 2, 3]
         tensors = [
             # contiguous
             torch.randn(B0, 2, 3),
             # non-contiguous
             torch.randn(B0, 3, 2).transpose(1, 2),
+            torch.randn(3, 2, B0).movedim(-1, 0).transpose(1, 2),
             # non-zero storage offset
             torch.randn(2, B0, 2, 3)[1],
+            torch.randn(2, 2, B0, 3)[1].movedim(1, 0),
             # non-contiguous strides, zero storage offset
             torch.randn(B0, 2, 4, 3, 7)[:, :, 0, :, 0],
+            torch.randn(2, 4, B0, 3, 7).movedim(2, 0)[:, :, 0, :, 0],
             # non-contiguous strides, non-zero storage offset
             torch.randn(B0, 2, 4, 3, 7)[:, :, 2, :, 1],
+            torch.randn(2, 4, 3, 7, B0).movedim(-1, 0)[:, :, 2, :, 1],
         ]
 
         for x in tensors:
@@ -1529,6 +1542,10 @@ def _test(sizes, strides, offset, tensor, lambd):
             _test([3, 2], [S1, S0], offset, x, lambda x: x.transpose(0, 1))
             # select
             _test([2], [S0], offset + S1, x, lambda x: x[:, 1])
+            # diagonal
+            _test([2], [S0 + S1], offset, x, lambda x: x.diagonal())
+            # strided slice
+            _test([2], [S1 * 2], offset, x, lambda x: x[0, ::2])
 
         # Nested vmap test
         B1 = 7
@@ -1544,23 +1561,13 @@ def _test(sizes, strides, offset, tensor, lambd):
             x = torch.randn(B0, 2, 3).transpose(0, 1)
             vmap(lambda x: x.as_strided([1, 1, 1], [1, 1]))(x)
 
-        # Sanity check #1: we require the batch dims to be at the front of the
-        # tensor (in memory layout).
-        msg = 'batch dims being vmapped over are at the front of the tensor'
-        with self.assertRaisesRegex(RuntimeError, msg):
-            x = torch.randn(2, B0, 3).transpose(0, 1)
-            vmap(lambda x: x.as_strided([2, 3], [B0 * 3, 1]))(x)
-        with self.assertRaisesRegex(RuntimeError, msg):
-            x = torch.randn(B0, 2, 3, B1).movedim(3, 1)
-            vmap(vmap(lambda x: x.as_strided([2, 3], [B1 * 3, B1])))(x)
-
-        # All the Sanity check #2{a,b,c} cases check that
+        # All the Sanity check #1{a,b,c} cases check that
         # xs[i].as_strided(sizes, strides, offset + xs[i].offset() - xs.offset())
         # doesn't index memory that is out of bounds of xs[i]. This condition
         # is important to the correctness of the as_strided batching rule
         # (see NOTE: [When will the as_strided_batching_rule fail?])
 
-        # Sanity check #2a: The maximum indexable location of
+        # Sanity check #1a: The maximum indexable location of
         # xs[i].as_strided(sizes, strides, offset + xs[i].offset() - xs.offset())
         # is less than or equal to the maximum indexable location of xs[i].
         msg = 'This is not supported inside of vmap'
@@ -1574,14 +1581,14 @@ def _test(sizes, strides, offset, tensor, lambd):
             x = torch.randn(B0, B1, 3, 5)
             vmap(vmap(lambda x: x.as_strided([4, 4], [4, 1], 0)))(x)
 
-        # Sanity check #2b: The min indexable location of
+        # Sanity check #1b: The min indexable location of
         # xs[i].as_strided(sizes, strides, offset + xs[i].offset() - xs.offset())
         # is greater than or equal to the min indexable location of xs[i].
         with self.assertRaisesRegex(RuntimeError, msg):
             x = torch.randn(2, B0, 3)[1]
             vmap(lambda x: x.as_strided([3], [1], B0 * 3 - 1))(x)
 
-        # Sanity check #2c:
+        # Sanity check #1c:
         # xs[i] is a zero-dim tensor, but
         # xs[i].as_strided(sizes, strides, offset + xs[i].offset() - xs.offset())
         # is not
@@ -1656,6 +1663,12 @@ def op(*tensors):
             return op
 
         test(get_op(0), (torch.rand(B0, 2), torch.rand(B0, 3)))
+        test(get_op(0), (torch.rand(B0, 0), torch.rand(B0, 0)))
+        test(get_op(0), (torch.rand(2), torch.rand(B0, 0)), in_dims=(None, 0))
+        test(get_op(1), (torch.rand(2, 5), torch.rand(B0, 0), torch.rand(2, 3)), in_dims=(None, 0, None))
+        test(get_op(1), (torch.rand(B0, 2, 3), torch.rand(B0, 0)))
+        test(get_op(1), (torch.rand(B0, 2, 3, 4), torch.rand(0)), in_dims=(0, None))
+        test(get_op(0), (torch.rand(0), torch.rand(B0, 2), torch.rand(B0, 0)), in_dims=(None, 0, 0))
         test(get_op(0), (torch.rand(2), torch.rand(B0, 3)), in_dims=(None, 0))
         test(get_op(0), (torch.rand(2, 17), torch.rand(3, 17, B0)), in_dims=(None, 2))
         test(get_op(-1), (torch.rand(17, 2), torch.rand(17, 3, B0)), in_dims=(None, 2))
@@ -3111,14 +3124,18 @@ def discover_variants(opinfo):
 
 class TestVmapOperatorsOpInfo(TestCase):
 
-    def vmap_outplace_test(self, func, args, kwargs, in_dims, check_shape_only=False):
+    def vmap_outplace_test(self, func, args, kwargs, in_dims, check_shape_only=False,
+                           postprocess_fn=None):
         for loop_out, vmap_out in compute_quantities_for_vmap_test(func, args, kwargs, in_dims):
+            if postprocess_fn is not None:
+                loop_out = postprocess_fn(loop_out)
+                vmap_out = postprocess_fn(vmap_out)
             if check_shape_only:
                 self.assertEqual(vmap_out.shape, loop_out.shape)
                 continue
             self.assertEqual(vmap_out, loop_out)
 
-    def vmap_inplace_test(self, func, args, kwargs, in_dims):
+    def vmap_inplace_test(self, func, args, kwargs, in_dims, postprocess_fn=None):
         # NB: This test assumes that the first argument is being modified.
         # This is OK because it's what every other OpInfo-based test assumes,
         # but it is going to need a more robust solution eventually.
@@ -3132,9 +3149,13 @@ def vmap_inplace_test(self, func, args, kwargs, in_dims):
             return
         for loop_out, vmap_out in compute_quantities_for_vmap_test(
                 func, args, kwargs, in_dims, clone_inputs=True):
+            if postprocess_fn is not None:
+                loop_out = postprocess_fn(loop_out)
+                vmap_out = postprocess_fn(vmap_out)
             self.assertEqual(vmap_out, loop_out)
 
-    def opinfo_vmap_test(self, device, dtype, op, check_has_batch_rule, skip_inplace=()):
+    def opinfo_vmap_test(self, device, dtype, op, check_has_batch_rule,
+                         skip_inplace=(), postprocess_fn=None):
         def test():
             # Error inputs check
             if op.error_inputs_func is not None:
@@ -3158,13 +3179,13 @@ def test():
                 for args, in_dims, _ in generate_vmap_inputs(
                         args, {}, is_batch_norm_and_training=is_batch_norm_and_training):
                     for func in aliases:
-                        self.vmap_outplace_test(func, args, kwargs, in_dims, check_shape_only)
+                        self.vmap_outplace_test(func, args, kwargs, in_dims, check_shape_only, postprocess_fn)
                     if op.name in skip_inplace:
                         continue
                     if not is_valid_inplace_sample_input(sample_input, op, op.inplace_variant):
                         continue
                     for func in inplace_aliases:
-                        self.vmap_inplace_test(func, args, kwargs, in_dims)
+                        self.vmap_inplace_test(func, args, kwargs, in_dims, postprocess_fn)
 
         if check_has_batch_rule:
             check_vmap_fallback(self, test, op)
@@ -3172,6 +3193,7 @@ def test():
             test()
 
     vmap_fail = {
+        # -------------------- ALLOWED FAILURES --------------------------------
         # These are things that we either cannot fix or are not actually problems
         xfail('resize_'),
         xfail('resize_as_'),
@@ -3179,9 +3201,13 @@ def test():
         xfail('__getitem__'),  # dynamic mask
         xfail('index_put'),  # dynamic mask
         xfail('nn.functional.dropout'),  # works, can't check against for loop because of randomness inconsistency
+        xfail('nn.functional._scaled_dot_product_attention'),  # randomness
         xfail('masked_select'),  # dynamic op
         xfail('nonzero'),  # dynamic op
+        xfail('unique', ''),  # dynamic op
+        xfail('unique_consecutive', ''),  # dynamic op
         xfail('allclose'),  # returns a boolean
+        xfail('uniform'),  # randomness is tested separately
         xfail('rand_like'),  # randomness is tested separately
         xfail('randint_like'),  # randomness is tested separately
         xfail('randn_like'),  # randomness is tested separately
@@ -3192,25 +3218,46 @@ def test():
         xfail('nn.functional.embedding', ''),  # we only support some cases
         xfail('nn.functional.rrelu'),  # randomness
         xfail('nn.functional.dropout2d', ''),  # randomness
+        xfail('nn.functional.dropout3d', ''),  # randomness
+        xfail('nn.functional.alpha_dropout', ''),  # randomness
         xfail('nn.functional.feature_alpha_dropout', 'with_train'),  # randomness
-        xfail('as_strided'),  # as_strided is too crazy
+        xfail('as_strided'),  # Our test runner can't handle this; manual test exists
+        skip('new_empty_strided'),  # empty tensor data is garbage so it's hard to make comparisons with it
         xfail('nn.functional.fractional_max_pool3d'),  # randomness
         xfail('nn.functional.fractional_max_pool2d'),  # randomness
-
+        xfail('pca_lowrank', ''),  # random operation
+        xfail('svd_lowrank', ''),  # random operation
+        xfail('linspace', ''),  # test runner can't handle factory functions
+        xfail('arange', ''),  # test runner can't handle factory functions
+        xfail('logspace', ''),  # test runner can't handle factory functions
+        xfail('scalar_tensor'),  # test runner can't handle factory functions
+        xfail('empty', ''),  # test runner can't handle factory functions
+        xfail('ones', ''),  # test runner can't handle factory functions
+        xfail('zeros', ''),  # test runner can't handle factory functions
+        xfail('full', ''),  # test runner can't handle factory functions
+        xfail('eye', ''),  # non-tensor input
+        xfail('broadcast_shapes', ''),  # test runner can't handle non-Tensor ops
+        xfail('sparse.sampled_addmm'),  # sparse
+        xfail('cross'),  # The default value of dim in op is *very* weird. No wonder it doesn't work
+        skip('_softmax_backward_data'),
+        skip('linalg.eigh', ''),  # not unique, see test_linalg_eigh for manual test
+        skip('to'),  # RuntimeError: required rank 4 tensor to use channels_last format
+        # ----------------------------------------------------------------------
+
+        # ---------------------------- BUGS ------------------------------------
         # entries in here don't work and need to be fixed.
         # Each one of these is a bug
-        xfail('view_as_complex'),
-        xfail('tensor_split'),
-        xfail('svd', device_type='cuda'),
-        xfail('linalg.svd', device_type='cuda'),
-        xfail('histogramdd'),
-        xfail('nn.functional.gaussian_nll_loss'),
-        xfail('nn.functional.embedding_bag'),
+        xfail('clamp_min', ''),  # Exception not raised on error input
+        xfail('clamp_max', ''),  # Exception not raised on error input
+
+        xfail('view_as_complex'),  # RuntimeError: Tensor must have a last dimension with stride 1
+        xfail('tensor_split'),  # data_ptr
+        xfail('histogramdd'),  # expected Tensor as element 0 in argument 0, but got tuple
+        xfail('nn.functional.gaussian_nll_loss'),  # data-dependent control flow error
+        xfail('nn.functional.embedding_bag'),  # embedding renorm vmap inplace incompatible
         xfail('__rpow__'),  # https://github.com/pytorch/functorch/issues/617
-        xfail('column_stack', ''),
-        xfail('pca_lowrank', ''),
-        xfail('svd_lowrank', ''),
-        skip('linalg.eigh', ''),  # Flaky but is likely a real problem
+        xfail('column_stack', ''),  # Batching rule not implemented for aten::column_stack
+        xfail('narrow'),  # Batching rule not implemented for aten::narrow.Tensor
 
         # required rank 4 tensor to use channels_last format
         xfail('bfloat16'),
@@ -3223,25 +3270,20 @@ def test():
         xfail('int'),
         xfail('long'),
         xfail('short'),
-
-        xfail('linspace', ''),
-        xfail('nn.functional.dropout3d', ''),
-        xfail('broadcast_shapes', ''),
-        xfail('clamp_min', ''),
-        xfail('sparse.sampled_addmm'),
-        xfail('jiterator_binary', device_type='cuda'),
-        xfail('arange', ''),
-        xfail('clamp_max', ''),
-        xfail('jiterator_binary_return_by_ref', device_type='cuda'),
-        xfail('jiterator_4inputs_with_extra_args', device_type='cuda'),
-        xfail('equal', ''),
-        xfail('jiterator_unary', device_type='cuda'),
-        xfail('logspace', ''),
-        xfail('jiterator_2inputs_2outputs', device_type='cuda'),
-        xfail('empty', ''),
+        xfail('cdouble'),
+        xfail('cfloat'),
+
+        xfail('jiterator_binary', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        xfail('jiterator_binary_return_by_ref', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        xfail('jiterator_4inputs_with_extra_args', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        xfail('equal', ''),  # TypeError: object of type 'bool' has no len(); likely testrunner problem
+        xfail('jiterator_unary', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        xfail('jiterator_2inputs_2outputs', device_type='cuda'),  # NYI: querying is_contiguous inside of vmap
+        # ---------------------------------------------------------------------
     }
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @with_tf32_off  # https://github.com/pytorch/pytorch/issues/86798
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @opsToleranceOverride('TestVmapOperatorsOpInfo', 'test_vmap_exhaustive', (
         tol1('linalg.det',
              {torch.float32: tol(atol=1e-04, rtol=1e-04)}, device_type='cuda'),
@@ -3251,7 +3293,16 @@ def test():
              {torch.float32: tol(atol=1e-04, rtol=1e-02)}, device_type='cuda'),
     ))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
-    @skipOps('TestVmapOperatorsOpInfo', 'test_vmap_exhaustive', vmap_fail)
+    @skipOps('TestVmapOperatorsOpInfo', 'test_vmap_exhaustive', vmap_fail.union({
+        # RuntimeError: Batch norm got a batched tensor as input while the running_mean or running_var,
+        # which will be updated in place, were not batched.
+        xfail('native_batch_norm'),
+        xfail('_native_batch_norm_legit'),
+        xfail('tril'),  # Exception not raised on error input
+        xfail('triu'),  # Exception not raised on error input
+        # The error inputs are vectors, that pass when batched as they are treated as a matrix
+        xfail('trace'),
+    }))
     def test_vmap_exhaustive(self, device, dtype, op):
         # needs to be fixed
         inplace_failure_list = (
@@ -3259,32 +3310,32 @@ def test_vmap_exhaustive(self, device, dtype, op):
         self.opinfo_vmap_test(device, dtype, op, check_has_batch_rule=False,
                               skip_inplace=inplace_failure_list)
 
-    @ops(functorch_lagging_op_db + additional_op_db, allowed_dtypes=(torch.float,))
+    @ops(op_db + additional_op_db, allowed_dtypes=(torch.float,))
     @opsToleranceOverride('TestVmapOperatorsOpInfo', 'test_op_has_batch_rule', (
         tol1('linalg.det',
              {torch.float32: tol(atol=1e-04, rtol=1e-04)}, device_type='cuda'),
     ))
     @toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1e-04)})
     @skipOps('TestVmapOperatorsOpInfo', 'test_op_has_batch_rule', vmap_fail.union({
+        skip('to'),  # RuntimeError: required rank 4 tensor to use channels_last format
         xfail('complex'),
         xfail('copysign'),
-        xfail('eig'),
+        # Batch norm got a batched tensor as input while the running_mean or running_var,
+        # which will be updated in place, were not batched.
+        xfail('native_batch_norm'),
+        xfail('_native_batch_norm_legit'),
         xfail('histogram'),
         xfail('index_fill'),
         xfail('nansum'),
         xfail('nanmean'),
+        xfail('scatter_reduce', 'sum'),
+        xfail('scatter_reduce', 'mean'),
+        xfail('scatter_reduce', 'amax'),
+        xfail('scatter_reduce', 'amin'),
         # `index_put` OpInfo in pytorch/pytorch has
         # masked index as input which is not supported
         xfail('index_put', ''),
         xfail('isin'),
-        xfail('linalg.lstsq'),
-        xfail('linalg.lstsq', 'grad_oriented'),
-        xfail('linalg.matrix_rank'),
-        xfail('linalg.matrix_rank', 'hermitian'),
-        xfail('linalg.pinv'),
-        xfail('linalg.pinv', 'hermitian'),
-        xfail('linalg.solve'),
-        xfail('lu_solve'),
         xfail('lu_unpack'),
         xfail('masked_fill'),
         xfail('masked_scatter'),
@@ -3299,21 +3350,24 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('tensor_split'),
         xfail('to_sparse'),
         xfail('vdot'),
+        xfail('tril'),  # Exception not raised on error input
+        xfail('triu'),  # Exception not raised on error input
         xfail('__getitem__', ''),
         xfail('all'),
         xfail('any'),
         xfail('count_nonzero'),
         xfail('nanmean'),
         xfail('nn.functional.dropout'),  # works, can't check against for loop because of randomness inconsistency
+        xfail('nn.functional._scaled_dot_product_attention'),  # randomness
         xfail('resize_'),
         xfail('view_as_complex'),
         xfail('matrix_exp'),
+        xfail('trace'),  # Does not support batched tensors
         xfail('bucketize'),
         xfail('fft.ihfft2'),
         xfail('fft.ihfftn'),
         xfail('allclose'),
         xfail('argwhere'),
-        xfail('linalg.cross'),
         xfail('unique_consecutive'),
         xfail('unique'),
         xfail('nn.functional.ctc_loss'),
@@ -3337,6 +3391,7 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('bernoulli', ''),
         xfail('linalg.lu_factor', ''),
         xfail('nn.functional.feature_alpha_dropout', 'with_train'),
+        xfail('native_dropout_backward'),
         xfail('nn.functional.kl_div', ''),
         xfail('multinomial', ''),
         xfail('column_stack', ''),
@@ -3347,6 +3402,7 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('svd_lowrank', ''),
         xfail('diagflat', ''),
         xfail('special.log_ndtr'),
+        xfail('narrow'),  # Batching rule not implemented for aten::narrow.Tensor
         xfail('nn.functional.triplet_margin_loss', ''),
         xfail('nn.functional.pdist', ''),
         xfail('scatter_reduce', 'sum'),
@@ -3381,7 +3437,6 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('special.chebyshev_polynomial_u'),
         xfail('special.modified_bessel_k1'),
         xfail('segment_reduce', 'offsets'),
-        xfail('linalg.solve_ex', ''),
         xfail('special.bessel_j1'),
         xfail('logspace', ''),
         xfail('empty', ''),
@@ -3394,7 +3449,7 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('jiterator_4inputs_with_extra_args', device_type='cuda'),
         xfail('linalg.vander', ''),
         xfail('segment_reduce', 'lengths'),
-        xfail('linalg.lu_solve', ''),
+        xfail('lu_solve', ''),
         xfail('special.bessel_y1'),
         xfail('special.hermite_polynomial_he'),
         xfail('special.scaled_modified_bessel_k0'),
@@ -3410,6 +3465,7 @@ def test_vmap_exhaustive(self, device, dtype, op):
         xfail('equal', ''),
         xfail('linalg.lu', ''),
         skip('linalg.ldl_solve', ''),
+        skip('_softmax_backward_data'),
     }))
     def test_op_has_batch_rule(self, device, dtype, op):
         # needs to be fixed
@@ -3462,6 +3518,76 @@ def test_op_has_batch_rule(self, device, dtype, op):
         self.opinfo_vmap_test(device, dtype, op, check_has_batch_rule=True,
                               skip_inplace=inplace_failures)
 
+    def test_linalg_svd(self, device):
+        # linalg_svd returns a tuple of three tensors, (U, S, Vh).
+        # Given the same input, it may return different tensors,
+        # because svd isn't unique. To test that the svd is correct, we multiply
+        # U @ diag(S) @ Vh and check that that the output from vmap matches the
+        # output from a for-loop.
+        def compute_A(out):
+            U, S, Vh = out
+            m = U.shape[-1]
+            n = Vh.shape[-2]
+            diag_S = S.new_zeros(*S.shape[:-1], m, n)
+            diag_S.diagonal(offset=0, dim1=-2, dim2=-1).copy_(S)
+            return U @ diag_S @ Vh
+
+        opinfos = [op for op in op_db if op.name == 'linalg.svd']
+        assert len(opinfos) > 0
+
+        for op in opinfos:
+            self.opinfo_vmap_test(device, torch.float, op, check_has_batch_rule=True,
+                                  postprocess_fn=compute_A)
+
+    def test_linalg_eigh(self, device):
+        # linalg_svd returns two tensors, (Q, L).
+        # Given the same input, it may return different tensors,
+        # because the eig decomposition isn't unique.
+        # To test that eigh is correct, we multiply
+        # Q @ diag(L) @ Qh and check that that the output from vmap matches the
+        # output from a for-loop.
+        def compute_A(out):
+            L, Q = out
+            n = Q.shape[-1]
+            diag_L = L.new_zeros(*L.shape[:-1], n, n)
+            diag_L.diagonal(offset=0, dim1=-2, dim2=-1).copy_(L)
+            Qh = Q.transpose(-2, -1).conj()
+            return Q @ diag_L @ Qh
+
+        opinfos = [op for op in op_db if op.name == 'linalg.eigh']
+        assert len(opinfos) > 0
+
+        for op in opinfos:
+            self.opinfo_vmap_test(device, torch.float, op, check_has_batch_rule=False,
+                                  postprocess_fn=compute_A)
+
+    def test_slogdet(self, device):
+        # There's no OpInfo for this
+        def test():
+            B = 2
+            x = torch.randn(2, 5, 5, device=device)
+            self.vmap_outplace_test(torch.slogdet, (x,), {}, (0,))
+
+        check_vmap_fallback(self, test, torch.slogdet)
+
+    def test_fill__Tensor(self, device):
+        # There's no OpInfo for fill_.Tensor, so here's an extra test for it.
+        def test():
+            B = 2
+            args = (torch.randn(B, 3, device=device), torch.randn(B))
+            self.vmap_inplace_test(Tensor.fill_, args, {}, (0, 0))
+
+            args = (torch.randn(3, B, device=device), torch.randn(B))
+            self.vmap_inplace_test(Tensor.fill_, args, {}, (-1, 0))
+
+            args = (torch.randn(3, device=device), torch.randn(B))
+            self.vmap_inplace_test(Tensor.fill_, args, {}, (None, 0))
+
+            args = (torch.randn(3, B, device=device), torch.randn([]))
+            self.vmap_inplace_test(Tensor.fill_, args, {}, (1, None))
+
+        check_vmap_fallback(self, test, Tensor.fill_)
+
     def test_conv_double_backward(self, device):
         images = torch.randn(2, 1, 5, 5, device=device)
         weight = torch.randn(2, 1, 2, 2, device=device)
@@ -3485,10 +3611,12 @@ def test_conv_double_backward(self, device):
         op = torch.ops.aten._convolution_double_backward
 
         generator = get_fallback_and_vmap_exhaustive(op, args, {})
+        is_cuda_sm86 = device.startswith("cuda") and torch.cuda.get_device_capability(0) == (8, 6)
+        atol, rtol = (1e-3, 1e-3) if is_cuda_sm86 else (1e-4, 1e-4)
 
         def test():
             for loop_out, batched_out in generator:
-                self.assertEqual(loop_out, batched_out, atol=1e-4, rtol=1e-4)
+                self.assertEqual(loop_out, batched_out, atol=atol, rtol=rtol)
 
         check_vmap_fallback(self, test, op)
 
@@ -3748,12 +3876,19 @@ def f(e_):
         b = with_vmap(_fake_vmap)
         self.assertEqual(a, b)
 
-    @ops(filter(lambda op: "linalg" in op.name, functorch_lagging_op_db + additional_op_db), allowed_dtypes=(torch.float,))
+    @ops(filter(lambda op: "linalg" in op.name, op_db + additional_op_db), allowed_dtypes=(torch.float,))
     @skipOps('TestVmapOperatorsOpInfo', 'test_vmap_linalg_failure_1D_input', {
         xfail('linalg.vector_norm'),  # can accept vector inputs
+        xfail('linalg.norm'),  # can accept vector inputs
+        xfail('linalg.norm', 'subgradients_at_zero'),  # can accept vector inputs
         skip('linalg.multi_dot'),  # accepts list of tensor inputs, has its own special test
         xfail('linalg.vander'),
         xfail('linalg.vecdot'),
+        # throws in vmap on CUDA
+        # IndexError: Dimension out of range (expected to be in range of [-1, 0], but got -2)
+        # https://github.com/pytorch/pytorch/runs/8110653462?check_suite_focus=true
+        # but it passes locally
+        skip('linalg.matrix_norm', ''),
         skip('linalg.ldl_solve', ''),
     })
     def test_vmap_linalg_failure_1D_input(self, device, dtype, op):
@@ -4352,7 +4487,7 @@ def op(input, _):
             orig_passed_size = passed.shape[:2] if batched_call else passed.shape[:1]
             passed = passed.flatten(0, 1) if batched_call else passed
             expected = op(passed, always_batched)
-            expected.reshape(*orig_passed_size, 10)
+            expected = expected.reshape(*orig_passed_size, 10)
             self._assert_all_slices_unique(vmap_result)
             self.assertEqual(vmap_result, expected)
         else:
@@ -4397,6 +4532,19 @@ def f(x):
             self._assert_all_slices_unique(output)
 
 
+    def test_jacfwd_with_random(self):
+        # checks on behavior are above, this just checks that jacfwd respects
+        # the randomness param
+
+        x = torch.rand(3, 4)
+        with self.assertRaisesRegex(RuntimeError, r"called random operation while in randomness error mode"):
+            jacfwd(torch.bernoulli)(x)
+
+        # x isn't batched so use bernoulli since it doesn't do inplace randomness
+        jacfwd(torch.bernoulli, randomness="same")(x)
+        jacfwd(torch.bernoulli, randomness="different")(x)
+
+
 class TestTransformFailure(TestCase):
     @parametrize('transform', ['vmap', 'grad', 'grad_and_value', 'vjp', 'jvp', 'jacrev', 'jacfwd'])
     def test_fails_with_autograd_function(self, device, transform):
@@ -4430,7 +4578,6 @@ def f(x):
         with self.assertRaisesRegex(RuntimeError, "autograd.Function"):
             transform(input)
 
-
 only_for = ("cpu", "cuda")
 instantiate_device_type_tests(TestVmapOperatorsOpInfo, globals(), only_for=only_for)
 
diff --git a/functorch/test/xfail_suggester.py b/test/functorch/xfail_suggester.py
similarity index 96%
rename from functorch/test/xfail_suggester.py
rename to test/functorch/xfail_suggester.py
index 6bc795eae8a8..cdf2cca13671 100644
--- a/functorch/test/xfail_suggester.py
+++ b/test/functorch/xfail_suggester.py
@@ -4,7 +4,7 @@
 """
 Instructions:
 
-1. pytest -n 8 test/test_vmap.py test/test_ops.py test/test_pythonkey.py > result.txt
+1. pytest -n 8 test/test_vmap.py test/test_ops.py test/test_aotdispatch.py > result.txt
 2. python test/xfail_suggester.py
 """
 
@@ -69,7 +69,7 @@ def parse_namespace(base):
         'linalg_': 'linalg',
         '_masked_': '_masked',
         'sparse_': 'sparse',
-        'speical_': 'special',
+        'special_': 'special',
     }
     for heading in mappings.keys():
         if base.startswith(heading):
diff --git a/test/fx/quantization.py b/test/fx/quantization.py
index ff6c98ac038b..e666e4c75e53 100644
--- a/test/fx/quantization.py
+++ b/test/fx/quantization.py
@@ -121,7 +121,7 @@ def quantize(self, quantizer, node, load_arg):
         weight_scale, weight_zp = _minmax_scale_zeropoint(min_val, max_val)
         qweight = torch.quantize_per_tensor(weight, weight_scale, weight_zp, torch.qint8)
 
-        ctor = torch.nn.intrinsic.quantized.ConvReLU2d if self.relu_node is not None else torch.nn.quantized.Conv2d
+        ctor = torch.ao.nn.intrinsic.quantized.ConvReLU2d if self.relu_node is not None else torch.ao.nn.quantized.Conv2d
 
         qconv = ctor(mod.in_channels, mod.out_channels, mod.kernel_size,
                      mod.stride, mod.padding, mod.dilation, mod.groups,
diff --git a/test/fx/test_common_passes.py b/test/fx/test_common_passes.py
index 9c59abce4da6..407e707db879 100644
--- a/test/fx/test_common_passes.py
+++ b/test/fx/test_common_passes.py
@@ -73,10 +73,15 @@ def MutationMetadata(x):
 if torch.cuda.is_available():
     Devices.append("cuda")
 
+
+def name_fn(common_pass, f, device):
+    """Names parameterized test cases."""
+    return f'{type(common_pass()).__name__}_{f.__name__}_{device}'
+
 @instantiate_parametrized_tests
 class TestCommonPass(TestCase):
 
-    @parametrize("common_pass,f,device", itertools.product(Passes, Test_Cases, Devices))
+    @parametrize("common_pass,f,device", itertools.product(Passes, Test_Cases, Devices), name_fn)
     def test_correctness(self, common_pass, f, device):
         inp = torch.randn(10, device=device)
 
@@ -94,7 +99,7 @@ def test_correctness(self, common_pass, f, device):
         self.assertEqual(result, expected)
 
 
-    @parametrize("common_pass,f,device", itertools.product(Passes, Factory_Test_Cases, Devices))
+    @parametrize("common_pass,f,device", itertools.product(Passes, Factory_Test_Cases, Devices), name_fn)
     def test_correctness_factory(self, common_pass, f, device):
         inp = torch.randn(10, device=device)
         traced_m = make_fx(f)(inp, device)
diff --git a/test/fx/test_fx_param_shape_control_flow.py b/test/fx/test_fx_param_shape_control_flow.py
index e9af35d60457..04db468a7e63 100644
--- a/test/fx/test_fx_param_shape_control_flow.py
+++ b/test/fx/test_fx_param_shape_control_flow.py
@@ -91,26 +91,26 @@ def verify_mm_relu_mods(self, mm_only_mod, relu_mod):
         performs both mm and relu ops in cascade
         """
         x = torch.randn(10, 5)
-        torch.testing.assert_allclose(mm_only_mod(x), torch.mm(x, mm_only_mod.get_mul_matrix()))
+        torch.testing.assert_close(mm_only_mod(x), torch.mm(x, mm_only_mod.get_mul_matrix()))
         tracer = torch.fx.Tracer(param_shapes_constant=True)
         traced_graph = tracer.trace(mm_only_mod)
 
         # verify the graph module calculates the same result
         graph_mod_mm = torch.fx.GraphModule(mm_only_mod, traced_graph)
-        torch.testing.assert_allclose(graph_mod_mm(x), torch.mm(x, mm_only_mod.get_mul_matrix()))
+        torch.testing.assert_close(graph_mod_mm(x), torch.mm(x, mm_only_mod.get_mul_matrix()))
 
 
         # Make a new module with different parameter shape to go down the different
         # code path
         x = torch.randn(10, 15)
-        torch.testing.assert_allclose(relu_mod(x), torch.relu(torch.mm(x, relu_mod.get_mul_matrix())))
+        torch.testing.assert_close(relu_mod(x), torch.relu(torch.mm(x, relu_mod.get_mul_matrix())))
 
         tracer2 = torch.fx.Tracer(param_shapes_constant=True)
         traced_graph2 = tracer2.trace(relu_mod)
 
         # verify the graph module calculates the same result
         graph_mod_relu = torch.fx.GraphModule(relu_mod, traced_graph2)
-        torch.testing.assert_allclose(graph_mod_relu(x), torch.relu(torch.mm(x, relu_mod.get_mul_matrix())))
+        torch.testing.assert_close(graph_mod_relu(x), torch.relu(torch.mm(x, relu_mod.get_mul_matrix())))
 
 
         graph1_node_targets = [n.target for n in traced_graph.nodes]
diff --git a/test/fx/test_gradual_type.py b/test/fx/test_gradual_type.py
index 0371dc20214b..131debf149fb 100644
--- a/test/fx/test_gradual_type.py
+++ b/test/fx/test_gradual_type.py
@@ -44,13 +44,16 @@ def test_annotations(self):
         where n is the corresoinding node in the resulting graph.
         """
         class M(torch.nn.Module):
-            def forward(self, x: TensorType((1, 2, 3, Dyn)), y: Dyn):
-                return torch.add(x, y)
+            def forward(self,
+                        x: TensorType((1, 2, 3, Dyn)),
+                        y: Dyn,
+                        z: TensorType[Dyn, 3, Dyn]):
+                return torch.add(x, y) + z
 
         module = M()
         symbolic_traced: torch.fx.GraphModule = symbolic_trace(module)
 
-        expected_ph_types = [TensorType((1, 2, 3, Dyn)), Dyn]
+        expected_ph_types = [TensorType((1, 2, 3, Dyn)), Dyn, TensorType((Dyn, 3, Dyn))]
         expected_iter = iter(expected_ph_types)
 
         for n in symbolic_traced.graph.nodes:
diff --git a/test/fx/test_pass_infra.py b/test/fx/test_pass_infra.py
index e087b1dc6c2f..7a7039979beb 100644
--- a/test/fx/test_pass_infra.py
+++ b/test/fx/test_pass_infra.py
@@ -156,3 +156,18 @@ def pass5(x):
             _topological_sort_passes(passes, constraints)
         expected_error_msg = f"Circular dependency detected within the following passes: {passes}"
         self.assertEqual(e.exception.args[0], expected_error_msg)
+
+    def test_pass_manager_error(self):
+        """
+        Tests error catching + debug
+        """
+        def pass_fail(graph_module):
+            raise RuntimeError("bad")
+
+        m = AddModule()
+        traced_m = torch.fx.symbolic_trace(m)
+        pm = PassManager(passes=[replace_add_with_mul_pass, replace_mul_with_div_pass, pass_fail])
+
+        # Comment out this line to see the actual error message
+        with self.assertRaisesRegex(Exception, "pass_fail"):
+            pm(traced_m)
diff --git a/test/fx/test_subgraph_rewriter.py b/test/fx/test_subgraph_rewriter.py
index afaf9e84f682..4568eaa33bd6 100644
--- a/test/fx/test_subgraph_rewriter.py
+++ b/test/fx/test_subgraph_rewriter.py
@@ -18,6 +18,18 @@
                        "\tpython test/test_fx.py TESTNAME\n\n"
                        "instead.")
 
+@torch.fx.wrap
+def wrapped_gemm_bias_mul(a, b, bias):
+    lin_res = torch.nn.functional.linear(a, b, bias=bias)
+    mul_res = lin_res * a
+    return lin_res, mul_res
+
+@torch.fx.wrap
+def wrapped_gemm_bias_mul_with_c(a, b, bias, c):
+    lin_res = torch.nn.functional.linear(a, b, bias=bias)
+    mul_res = lin_res * c
+    return lin_res, mul_res
+
 class TestSubgraphRewriter(JitTestCase):
 
     def test_subgraph_rewriter_preserves_logic(self):
@@ -461,7 +473,7 @@ def forward(self, x):
                 assert n.type == int
                 assert m.type == int
 
-    def test_subgraph_writer_replace_consecutive_submodules(self):
+    def test_subgraph_rewriter_replace_consecutive_submodules(self):
 
         def f(x):
             x = torch.sigmoid(x)
@@ -491,3 +503,361 @@ def comparison(x):
         ref_outs = comparison_fn(x)
         test_outs = traced.forward(x)
         self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_with_overlapping_matches(self):
+
+        def f(x):
+            x = torch.sigmoid(x)
+            x = torch.sigmoid(x)
+            x = torch.sigmoid(x)
+            return torch.sigmoid(x)
+
+        def pattern(x):
+            x = torch.sigmoid(x)
+            x = torch.sigmoid(x)
+            return x
+
+        def replacement(x):
+            return torch.neg(x)
+
+        def comparison(x):
+            x = torch.neg(x)
+            return torch.neg(x)
+
+        traced = symbolic_trace(f)
+        comparison_fn = symbolic_trace(comparison)
+
+        x = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+
+        ref_outs = comparison_fn(x)
+        test_outs = traced.forward(x)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_replace_with_multiple_outputs(self):
+
+        def f(x):
+            y = torch.sigmoid(x)
+            z = torch.relu(x)
+            return y + z
+
+        def pattern(a):
+            b = torch.sigmoid(a)
+            c = torch.relu(a)
+            return b, c
+
+        def replacement(x):
+            return torch.exp(x), torch.abs(x)
+
+        def comparison(x):
+            y = torch.exp(x)
+            z = torch.abs(x)
+            return y + z
+
+        traced = symbolic_trace(f)
+        comparison_fn = symbolic_trace(comparison)
+
+        x = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+
+        ref_outs = comparison_fn(x)
+        test_outs = traced.forward(x)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_replace_with_duplicated_outputs(self):
+
+        def f(x1, x2):
+            x = x1 - x2
+            y = torch.sigmoid(x)
+            z = torch.relu(x)
+            return y + z
+
+        def pattern(a1, a2):
+            a = a1 - a2
+            b = torch.sigmoid(a)
+            c = torch.relu(a)
+            return b, c, a
+
+        def replacement(x1, x2):
+            y1 = torch.exp(x1)
+            y2 = torch.abs(x2)
+            return y2, y2, y1
+
+        def comparison(x1, x2):
+            y2 = torch.abs(x2)
+            return y2 + y2
+
+        traced = symbolic_trace(f)
+        comparison_fn = symbolic_trace(comparison)
+
+        x1 = torch.randn(3, 4)
+        x2 = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+
+        ref_outs = comparison_fn(x1, x2)
+        test_outs = traced.forward(x1, x2)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_with_unused_args(self):
+        class M(torch.nn.Module):
+            def forward(self, x, y, z):
+                return x + y
+
+        def pattern(x, y):
+            return x + y
+
+        def replacement(x, y):
+            return x - y
+
+        def comparison(x1, x2, x3):
+            return x1 - x2
+
+        traced = symbolic_trace(M())
+        comparison_fn = symbolic_trace(comparison)
+
+        x1 = torch.randn(3, 4)
+        x2 = torch.randn(3, 4)
+        x3 = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+        placeholder_nodes = [n for n in traced.graph.nodes if n.op == "placeholder"]
+        assert len(placeholder_nodes) == 3
+
+        ref_outs = comparison_fn(x1, x2, x3)
+        test_outs = traced.forward(x1, x2, x3)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_call_method(self):
+
+        class M(torch.nn.Module):
+            def forward(self, x):
+                x = x.dequantize()
+                x = x.sigmoid()
+                x = x.to(torch.float16)
+                return x
+
+        def pattern(x):
+            x = x.dequantize()
+            x = x.sigmoid()
+            x = x.to(torch.float16)
+            return x
+
+        def replacement(x):
+            return x
+
+        traced = symbolic_trace(M())
+        comparison_fn = symbolic_trace(replacement)
+
+        x1 = torch.randn(3, 4)
+
+        subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        traced.graph.lint()
+
+        ref_outs = comparison_fn(x1)
+        test_outs = traced.forward(x1)
+        self.assertEqual(ref_outs, test_outs)
+
+    def test_subgraph_rewriter_nodes_with_kwargs(self):
+
+        class M(torch.nn.Module):
+            def __init__(self) -> None:
+                super().__init__()
+                self.w0 = torch.nn.Parameter(torch.empty([128, 128]))
+                self.b0 = torch.nn.Parameter(torch.empty([128]))
+
+            def forward(self, in0):
+                lin_res = torch.nn.functional.linear(in0, self.w0, bias=self.b0)
+                mul_res = in0 * lin_res
+                sum_res = mul_res + in0
+                return sum_res
+
+        def pattern(a, b, bias):
+            lin_res = torch.nn.functional.linear(a, b, bias=bias)
+            mul_res = a * lin_res
+            return lin_res, mul_res
+
+        def replacement(a, b, bias):
+            lin_res, mul_res = wrapped_gemm_bias_mul(a, b, bias)
+            return lin_res, mul_res
+
+        traced = symbolic_trace(M())
+        matches = subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        self.assertEqual(len(matches), 1)
+
+        found_repalcement_node = False
+        for node in traced.graph.nodes:
+            if node.target == wrapped_gemm_bias_mul:
+                found_repalcement_node = True
+                break
+
+        self.assertTrue(found_repalcement_node)
+
+    def test_subgraph_rewriter_local_revert(self):
+
+        # Following model will have 3 anchors as the matching candidate with the given pattern
+        # Anchor 1 and 3 is a real match, but anchor 2 is not.
+        # The subgraph rewriter should be able to revert the changes made while matching anchor 2.
+        # Final match with anchor 3 should be successful.
+
+        class M(torch.nn.Module):
+            def __init__(self) -> None:
+                super().__init__()
+                self.w0 = torch.nn.Parameter(torch.empty([128, 128]))
+                self.b0 = torch.nn.Parameter(torch.empty([128]))
+                self.w1 = torch.nn.Parameter(torch.empty([128, 128]))
+                self.b1 = torch.nn.Parameter(torch.empty([128]))
+                self.w2 = torch.nn.Parameter(torch.empty([128, 128]))
+                self.b2 = torch.nn.Parameter(torch.empty([128]))
+                self.w3 = torch.nn.Parameter(torch.empty([128, 128]))
+                self.b3 = torch.nn.Parameter(torch.empty([128]))
+                self.w4 = torch.nn.Parameter(torch.empty([128, 128]))
+                self.b4 = torch.nn.Parameter(torch.empty([128]))
+
+            def forward(self, in0, in1):
+                lin_res_1 = torch.nn.functional.linear(in1, self.w0, bias=self.b0)
+                lin_res_2 = torch.nn.functional.linear(lin_res_1, self.w1, bias=self.b1)
+                # potential match at anchor 1
+                mul_res_1 = in1 * lin_res_2
+                sum_res_1 = mul_res_1 + in1
+                lin_res_3 = torch.nn.functional.linear(
+                    sum_res_1, self.w2, bias=self.b2
+                )
+                sigmoid_res_1 = torch.sigmoid(lin_res_3)
+                # potential match at anchor 2
+                mul_res_2 = lin_res_3 * sigmoid_res_1
+                lin_res_4 = torch.nn.functional.linear(in0, self.w3, bias=self.b3)
+                lin_res_5 = torch.nn.functional.linear(lin_res_4, self.w4, bias=self.b4)
+                # potential match at anchor 3
+                mul_res_3 = in0 * lin_res_5
+                sum_res_2 = mul_res_3 + in0
+                cat_res = torch.cat(
+                    [mul_res_2, sum_res_2],
+                    dim=1,
+                )
+                return cat_res
+
+        def gemm_bias_mul_pattern_with_c(a, b, bias, c):
+            lin_res = torch.nn.functional.linear(a, b, bias=bias)
+            mul_res = c * lin_res
+            return lin_res, mul_res
+
+        def gemm_bias_mul_replacement_with_c(a, b, bias, c):
+            lin_res, mul_res = wrapped_gemm_bias_mul_with_c(a, b, bias, c)
+            return lin_res, mul_res
+
+        traced = symbolic_trace(M())
+        matches = subgraph_rewriter.replace_pattern(
+            traced,
+            gemm_bias_mul_pattern_with_c,
+            gemm_bias_mul_replacement_with_c)
+
+        self.assertEqual(len(matches), 2)
+
+        repalcement_node_found = 0
+        for node in traced.graph.nodes:
+            if node.target == wrapped_gemm_bias_mul_with_c:
+                repalcement_node_found += 1
+
+        self.assertEqual(repalcement_node_found, 2)
+
+    def test_replace_pattern_with_filters(self):
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x, scale, zero_point):
+                # Match, second input to add is a scalar
+                x = x.dequantize()
+                x = torch.add(x, 2)
+                x = x.relu()
+                x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
+
+                y = x + 1
+                # NOT a match, second input to add is NOT a scalar
+                x = x.dequantize()
+                x = torch.add(x, y)
+                x = x.relu()
+                x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
+
+                return x
+
+        def BinaryOpScalarReLUPattern(x, num, scale, zero_point):
+            x = x.dequantize()
+            x = torch.add(x, num)
+            x = x.relu()
+            x = torch.quantize_per_tensor(x, scale, zero_point, torch.quint8)
+            return x
+
+        def BinaryOpScalarReLUReplacement(x, num, scale, zero_point):
+            x = torch.mul(x, num)
+            return x
+
+        def second_input_is_scalar(match, original_graph, pattern_graph):
+            """ check the node that's matched to the second input of the pattern graph
+            is a scalar number
+            """
+            input_idx = 0
+            for node in pattern_graph.nodes:
+                if node.op == "placeholder":
+                    if input_idx == 1:
+                        num_node = node
+                    input_idx += 1
+            if not isinstance(match.nodes_map[num_node], (int, float)):
+                return False
+            return True
+
+        def num_repalcement_node_found(traced):
+            return sum(1 for node in traced.graph.nodes if node.target == torch.mul)
+
+        # match without filter, should find 2 match
+        traced = symbolic_trace(M())
+        matches = subgraph_rewriter.replace_pattern(
+            traced,
+            BinaryOpScalarReLUPattern,
+            BinaryOpScalarReLUReplacement)
+        self.assertEqual(len(matches), 2)
+        self.assertEqual(num_repalcement_node_found(traced), 2)
+
+        # match with filter, should find 1 match
+        traced = symbolic_trace(M())
+        matches = subgraph_rewriter.replace_pattern_with_filters(
+            traced,
+            BinaryOpScalarReLUPattern,
+            BinaryOpScalarReLUReplacement,
+            [second_input_is_scalar])
+        self.assertEqual(len(matches), 1)
+        self.assertEqual(num_repalcement_node_found(traced), 1)
+
+    def test_matching_pattern_with_list_type_arg(self):
+        class M(torch.nn.Module):
+            def forward(self, x):
+                return torch.ops.aten._reshape_alias_copy.default(x, [1, 2], [3, 4])
+
+        def pattern(x, arg0, arg1):
+            return torch.ops.aten._reshape_alias_copy.default(x, arg0, arg1)
+
+        def replacement(x, arg0, arg1):
+            return torch.ops.aten._reshape_alias_copy.default(x, arg1, arg0)
+
+        traced = symbolic_trace(M())
+        matches = subgraph_rewriter.replace_pattern(traced, pattern, replacement)
+
+        self.assertEqual(len(matches), 1)
+
+        self.assertExpectedInline(traced.code.strip(), """\
+def forward(self, x):
+    _reshape_alias_copy_default_1 = torch.ops.aten._reshape_alias_copy.default(x, [3, 4], [1, 2]);  x = None
+    return _reshape_alias_copy_default_1""")  # noqa: B950
diff --git a/test/fx/test_z3_gradual_types.py b/test/fx/test_z3_gradual_types.py
index d3e84b215fab..e8b239b81538 100644
--- a/test/fx/test_z3_gradual_types.py
+++ b/test/fx/test_z3_gradual_types.py
@@ -31,6 +31,23 @@
 
 class TorchDynamoUseCases(unittest.TestCase):
 
+    def test_dim(self):
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x: TensorType([1, 2])):
+                y = x.dim()
+                return y
+
+        symbolic_traced: torch.fx.GraphModule = symbolic_trace(BasicBlock())
+        transformed = transform_all_constraints(symbolic_traced, counter=0)
+        s = z3.Solver()
+        s.add(transformed)
+        self.assertEqual(s.check(), z3.sat)
+        y_res = z3.z3.Int(2)
+        self.assertEqual(s.model()[y_res], 2)
+
 
     def test_reshape(self):
         """
@@ -1014,6 +1031,31 @@ def forward(self, x: TensorType([2, 4]), y: Dyn):
         self.assertEquals(s.check(), z3.sat)
 
 
+    def test_conditional_wrong_assumption(self):
+        """
+        Test condition after making the wrong assumption about the input
+        """
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super(BasicBlock, self).__init__()
+
+            def forward(self, x: Dyn):
+                gt = x > 1
+                return gt
+
+        ast_rewriter = RewritingTracer()
+        graph = ast_rewriter.trace(BasicBlock())
+
+        # The node we are considering is the gt node
+        for n in graph.nodes:
+            if n.target == operator.gt:
+                node = n
+
+        positive, negative = evaluate_conditional_with_constraints(ast_rewriter.root, graph, node)
+
+        self.assertEqual(positive, z3.sat)
+        self.assertEqual(negative, z3.sat)
+
     def test_conditional(self):
         """
         This test case is for the HFmodels interface.
@@ -1421,6 +1463,47 @@ def forward(self, x: TensorType([20, 20])):
 
 
 class TestSingleOperation(unittest.TestCase):
+
+    def test_conv_wrong_example(self):
+        class BasicBlock(torch.nn.Module):
+            def __init__(self):
+                super(BasicBlock, self).__init__()
+                self.conv1 = torch.nn.Conv2d(in_channels=2, out_channels=2,
+                                             kernel_size=2, stride=2,
+                                             padding=2, groups=2, bias=False, dilation=2)
+
+                self.conv2 = torch.nn.Conv2d(in_channels=4, out_channels=2,
+                                             kernel_size=2, stride=2,
+                                             padding=2, groups=2, bias=False, dilation=2)
+
+                self.relu = torch.nn.ReLU(inplace=True)
+
+            def forward(self, x: Dyn):
+                y = self.relu(self.conv1(x))
+                z = self.relu(self.conv2(x))
+                return z
+
+        ast_rewriter = RewritingTracer()
+        graph = ast_rewriter.trace(BasicBlock())
+        traced = GraphModule(ast_rewriter.root, graph, "gm")
+
+        transformed = transform_all_constraints(traced)
+
+        solver3 = z3.Solver()
+        solver3.add(transformed)
+        print(solver3.check())
+        assert solver3.check() == z3.sat
+
+        s1, s2, s3, s4 = z3.Ints('s1 s2 s3 s4')
+        s11, s22, s33, s44 = z3.Ints('s11 s22 s33 s44')
+        d1, d2, d3, d4 = D(s11, s1), D(s22, s2), D(s33, s3), D(s44, s4),
+        x = z3.Const(1, tensor_type)
+        solver3.add(x == tensor_type.tensor4(d1, d2, d3, d4))
+        assert solver3.check() == z3.sat
+
+        solver3.add(s22 != 0)
+        assert solver3.check() == z3.unsat
+
     def test_conv_dyn(self):
 
         s1, s2, s3, s4 = z3.Ints('s1 s2 s3 s4')
@@ -1440,7 +1523,6 @@ def __init__(self, in_planes, out_planes, kernel_size, stride, padding, groups,
             def forward(self, x: Dyn):
                 return self.conv1(x)
 
-
         BasicBlock(2, 2, 2, 2, 2, 2, 2).forward(torch.rand(4, 2, 3, 4))
 
         ast_rewriter = RewritingTracer()
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/__init__.py b/test/inductor/__init__.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/__init__.py
rename to test/inductor/__init__.py
diff --git a/test/inductor/cpp/.gitignore b/test/inductor/cpp/.gitignore
new file mode 100644
index 000000000000..37b0b62a96b8
--- /dev/null
+++ b/test/inductor/cpp/.gitignore
@@ -0,0 +1,13 @@
+CMakeLists.txt.user
+CMakeCache.txt
+CMakeFiles
+CMakeScripts
+Testing
+Makefile
+cmake_install.cmake
+install_manifest.txt
+compile_commands.json
+CTestTestfile.cmake
+_deps
+lib
+bin
diff --git a/test/inductor/cpp/CMakeLists.txt b/test/inductor/cpp/CMakeLists.txt
new file mode 100644
index 000000000000..cc4954fc895a
--- /dev/null
+++ b/test/inductor/cpp/CMakeLists.txt
@@ -0,0 +1,47 @@
+project(my-project LANGUAGES C CXX)
+
+# Build output setup
+set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/test/lib)
+set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/test/lib)
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/test/bin)
+
+# TODO(voz): Fix hack below
+# Start hack
+list(APPEND policies_new  CMP0079)
+
+foreach(policy ${policies_new})
+  if(POLICY ${policy})
+    cmake_policy(SET ${policy} NEW)
+  endif()
+endforeach()
+# End hack
+
+################################
+# GTest
+################################
+project(googletest-git NONE)
+
+include(FetchContent)
+FetchContent_Declare(
+  googletest
+  GIT_REPOSITORY https://github.com/google/googletest.git
+  GIT_TAG        release-1.12.1
+)
+
+set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)
+set(BUILD_GMOCK OFF CACHE BOOL "" FORCE)
+set(BUILD_GTEST ON CACHE BOOL "" FORCE)
+
+FetchContent_MakeAvailable(googletest)
+
+
+
+################################
+# Tests
+################################
+
+# TODO(voz): This is a little assumptive of just this one test, rewrite with real dir includes
+include_directories(${ATEN_INCLUDE})
+add_executable(test_cpp_prefix test_cpp_prefix.cpp ../../torchinductor/codegen/cpp_prefix.h)
+target_link_libraries(test_cpp_prefix gtest gtest_main)
+add_test(NAME test_cpp_prefix COMMAND test_cpp_prefix)
diff --git a/test/inductor/cpp/test.sh b/test/inductor/cpp/test.sh
new file mode 100755
index 000000000000..055b740cc1e3
--- /dev/null
+++ b/test/inductor/cpp/test.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+set -euo pipefail
+IFS=$'\n\t'
+
+cmake . -DATEN_INCLUDE:PATH=$(python -c "import torch; from torch.utils import cpp_extension; print(cpp_extension.include_paths()[0])")
+make
+./test/bin/test_cpp_prefix
diff --git a/test/inductor/cpp/test_cpp_prefix.cpp b/test/inductor/cpp/test_cpp_prefix.cpp
new file mode 100644
index 000000000000..08d379fe3a05
--- /dev/null
+++ b/test/inductor/cpp/test_cpp_prefix.cpp
@@ -0,0 +1,21 @@
+#include "../../torchinductor/codegen/cpp_prefix.h"
+#include <gtest/gtest.h>
+
+TEST(testCppPrefix, testAtomicAddInt) {
+  int x = 0;
+  atomic_add(&x, 100);
+  EXPECT_EQ(x, 100);
+}
+
+TEST(testCppPrefix, testAtomicAddFloat) {
+  float x = 0.0f;
+  atomic_add(&x, 100.0f);
+  EXPECT_EQ(x, 100.0f);
+}
+
+TEST(testCppPrefix, testAtomicAddI64) {
+  int64_t x = 0.0;
+  int64_t y = 100.0;
+  atomic_add(&x, y);
+  EXPECT_EQ(x, 100);
+}
diff --git a/test/inductor/opinfo_harness.py b/test/inductor/opinfo_harness.py
new file mode 100644
index 000000000000..86077582134d
--- /dev/null
+++ b/test/inductor/opinfo_harness.py
@@ -0,0 +1,25 @@
+import os
+import subprocess
+
+from torch.testing._internal.common_methods_invocations import op_db
+
+if __name__ == "__main__":
+    i = 0
+    while i < len(op_db):
+        start = i
+        end = i + 20
+        os.environ["PYTORCH_TEST_RANGE_START"] = f"{start}"
+        os.environ["PYTORCH_TEST_RANGE_END"] = f"{end}"
+        popen = subprocess.Popen(
+            ["pytest", "test/inductor/test_torchinductor_opinfo.py"],
+            stdout=subprocess.PIPE,
+        )
+        for line in popen.stdout:
+            print(line.decode(), end="")
+        popen.stdout.close()
+        return_code = popen.wait()
+        if return_code:
+            raise subprocess.CalledProcessError(
+                return_code, ["pytest", "test/inductor/test_torchinductor_opinfo.py"]
+            )
+        i = end + 1
diff --git a/test/inductor/test_minifier.py b/test/inductor/test_minifier.py
new file mode 100644
index 000000000000..18c5e5f33cad
--- /dev/null
+++ b/test/inductor/test_minifier.py
@@ -0,0 +1,213 @@
+# Owner(s): ["module: inductor"]
+import functools
+import textwrap
+import unittest
+
+import torch
+import torch._dynamo
+import torch._inductor.utils
+from torch._dynamo.test_minifier_common import MinifierTestBase
+from torch.testing._internal.common_utils import IS_MACOS
+
+_HAS_TRITON = torch._inductor.utils.has_triton()
+requires_cuda = functools.partial(unittest.skipIf, not _HAS_TRITON, "requires cuda")
+
+CPP_COMPILE_ERROR = """\
+def cpp_compile_error(x):
+    return "compile error!"
+"""
+
+CPP_RUNTIME_ERROR = """\
+def cpp_runtime_error(x):
+    return f"{x}; throw 1"
+"""
+
+CPP_ACCURACY_ERROR = """\
+def cpp_accuracy_error(x):
+    return f"{x} + decltype({x})(1)"
+"""
+
+TRITON_COMPILE_ERROR = """\
+def triton_compile_error(x):
+    return "compile error!"
+"""
+
+# NOTE: there is currently not an easy way to cause a triton runtime error.
+TRITON_RUNTIME_ERROR = """\
+def triton_runtime_error(x):
+    return f"{x}; assert?"
+"""
+
+TRITON_ACCURACY_ERROR = """\
+def triton_accuracy_error(x):
+    return f"{x} + 1"
+"""
+
+
+class MinifierTests(MinifierTestBase):
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+
+    @classmethod
+    def tearDownClass(cls):
+        super().tearDownClass()
+
+    # Generates code that patches CppOverrides/TritonOverrides.
+    def _gen_codegen_fn_patch_code(self, old_fn_name, new_fn_code, device):
+        new_fn_name = self._get_fn_name(new_fn_code)
+        if new_fn_name is not None:
+            patch_code = f"""\
+import torch._inductor.codegen.{"cpp" if device == "cpu" else "triton"} as codegen
+overrides = codegen.{"CppOverrides" if device == "cpu" else "TritonOverrides"}
+vec_overrides = codegen.{"CppVecOverrides" if device == "cpu" else "TritonOverrides"}
+{new_fn_code}
+overrides.{old_fn_name} = staticmethod({new_fn_name})
+vec_overrides.{old_fn_name} = staticmethod({new_fn_name})
+"""
+        return f"""\
+{patch_code}
+isolate_fails_code_str = \"\"\"\\
+{patch_code}
+torch._dynamo.config.debug_dir_root = "{self.DEBUG_DIR}"
+\"\"\"
+"""
+
+    # Test that compile and accuracy errors after aot can be repro'd (both CPU and CUDA)
+    def _test_after_aot(self, device, backend_code, repro_level):
+        run_code = textwrap.dedent(
+            f"""\
+            @torch._dynamo.optimize("inductor")
+            def inner(x):
+                for _ in range(3):
+                    x = torch.sin(x)
+                x = torch.relu(x)
+                for _ in range(3):
+                    x = torch.cos(x)
+                return x
+
+            inner(torch.randn(20, 20).to("{device}"))
+        """
+        )
+        patch_code = self._gen_codegen_fn_patch_code("relu", backend_code, device)
+        self.assertIsNotNone(patch_code)
+        (test_proc, _, repro_proc), _ = self._run_full_test(
+            run_code, "aot", repro_level, patch_code
+        )
+        return (
+            (test_proc.stderr.decode("utf-8"), repro_proc.stderr.decode("utf-8")),
+            (test_proc.returncode, repro_proc.returncode),
+        )
+
+    def test_after_aot_cpu_compile_error(self):
+        (tb1, tb2), _ = self._test_after_aot("cpu", CPP_COMPILE_ERROR, 2)
+        self.assertIn("CppCompileError", tb1)
+        self.assertIn("CppCompileError", tb2)
+
+    def test_after_aot_cpu_accuracy_error(self):
+        (tb1, tb2), _ = self._test_after_aot("cpu", CPP_ACCURACY_ERROR, 4)
+        self.assertIn("AccuracyError", tb1)
+        self.assertIn("AccuracyError", tb2)
+
+    @requires_cuda()
+    def test_after_aot_cuda_compile_error(self):
+        (tb1, tb2), _ = self._test_after_aot("cuda", TRITON_COMPILE_ERROR, 2)
+        self.assertIn("SyntaxError", tb1)
+        self.assertIn("SyntaxError", tb2)
+
+    @requires_cuda()
+    def test_after_aot_cuda_accuracy_error(self):
+        (tb1, tb2), _ = self._test_after_aot("cuda", TRITON_ACCURACY_ERROR, 4)
+        self.assertIn("AccuracyError", tb1)
+        self.assertIn("AccuracyError", tb2)
+
+    # Test that runtime errors after aot can be repro'd (CPU only for now)
+    def _test_after_aot_runtime_error(self, device, backend_code):
+        run_code = textwrap.dedent(
+            f"""\
+            @torch._dynamo.optimize("inductor")
+            def inner(x):
+                for _ in range(3):
+                    x = torch.sin(x)
+                x = torch.relu(x)
+                for _ in range(3):
+                    x = torch.cos(x)
+                return x
+
+            inner(torch.randn(20, 20).to("{device}"))
+        """
+        )
+        patch_code = self._gen_codegen_fn_patch_code("relu", backend_code, device)
+        self.assertIsNotNone(patch_code)
+
+        (test_proc, _, repro_proc), _ = self._run_full_test(
+            run_code, "aot", 3, patch_code
+        )
+
+        self.assertNotIn("CompilerError", test_proc.stderr.decode("utf-8"))
+
+        self.assertEqual(test_proc.returncode, repro_proc.returncode)
+        self.assertNotEqual(test_proc.returncode, 0)
+
+    def test_after_aot_cpu_runtime_error(self):
+        self._test_after_aot_runtime_error("cpu", CPP_RUNTIME_ERROR)
+
+    # NOTE: there is currently not an easy way to cause a triton runtime error.
+    @unittest.skip
+    @requires_cuda()
+    def test_after_aot_cuda_runtime_error(self):
+        self._test_after_aot_runtime_error("cuda", TRITON_RUNTIME_ERROR)
+
+    # Ensure that inductor codegen patches pass when relu is not present.
+    def _test_after_aot_backend_passes(self, device, repro_level, backend_code):
+        run_code = textwrap.dedent(
+            f"""\
+            @torch._dynamo.optimize("inductor")
+            def inner(x):
+                for _ in range(3):
+                    x = torch.sin(x)
+                for _ in range(3):
+                    x = torch.cos(x)
+                return x
+
+            inner(torch.randn(20, 20).to("{device}"))
+        """
+        )
+        patch_code = self._gen_codegen_fn_patch_code("relu", backend_code, device)
+        self.assertIsNotNone(patch_code)
+
+        test_code = self._gen_test_code(run_code, "aot", repro_level, patch_code)
+        proc, repro_dir = self._run_test_code(test_code)
+        self.assertEqual(proc.returncode, 0)
+        self.assertIsNone(repro_dir)
+
+    def test_after_aot_cpu_compile_backend_passes(self):
+        self._test_after_aot_backend_passes("cpu", 2, CPP_COMPILE_ERROR)
+
+    def test_after_aot_cpu_runtime_backend_passes(self):
+        self._test_after_aot_backend_passes("cpu", 2, CPP_RUNTIME_ERROR)
+
+    def test_after_aot_cpu_accuracy_backend_passes(self):
+        self._test_after_aot_backend_passes("cpu", 4, CPP_ACCURACY_ERROR)
+
+    @requires_cuda()
+    def test_after_aot_cuda_compile_backend_passes(self):
+        self._test_after_aot_backend_passes("cuda", 2, TRITON_COMPILE_ERROR)
+
+    # NOTE: there is currently not an easy way to cause a triton runtime error.
+    @unittest.skip
+    @requires_cuda()
+    def test_after_aot_cuda_runtime_backend_passes(self):
+        self._test_after_aot_backend_passes("cuda", 2, TRITON_RUNTIME_ERROR)
+
+    @requires_cuda()
+    def test_after_aot_cuda_accuracy_backend_passes(self):
+        self._test_after_aot_backend_passes("cuda", 4, TRITON_ACCURACY_ERROR)
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    # skip CI tests on mac since CPU inductor does not seem to work due to C++ compile errors
+    if not IS_MACOS:
+        run_tests()
diff --git a/test/inductor/test_perf.py b/test/inductor/test_perf.py
new file mode 100644
index 000000000000..2b53c163421c
--- /dev/null
+++ b/test/inductor/test_perf.py
@@ -0,0 +1,502 @@
+# Owner(s): ["module: inductor"]
+import contextlib
+from unittest.mock import patch
+
+import functorch
+
+import torch._dynamo
+import torch._inductor.config as config
+from torch._dynamo.optimizations.backends import register_backend
+from torch._inductor import metrics
+from torch._inductor.compile_fx import compile_fx, count_bytes_inner
+from torch.testing._internal.common_utils import (
+    TEST_WITH_ROCM,
+    TestCase as TorchTestCase,
+)
+from torch.testing._internal.inductor_utils import HAS_CUDA
+
+aten = torch.ops.aten
+
+
+@register_backend
+def count_bytes_inductor(gm, example_inputs):
+    return compile_fx(gm, example_inputs, inner_compile=count_bytes_inner)
+
+
+@torch._dynamo.optimize("count_bytes_inductor")
+def f(x):
+    return torch.cat([x, x.cos()])
+
+
+def count_numel(f, *args):
+    """
+    Assumes all inputs are fp32
+    """
+    metrics.reset()
+    torch._dynamo.optimize("count_bytes_inductor")(f)(*args)
+    print(metrics.nodes_num_elem)
+    return str(metrics.num_bytes_accessed // 4)
+
+
+def count_numel_train(f, *args):
+    """
+    Assumes all inputs are fp32
+    """
+    metrics.reset()
+
+    f = torch._dynamo.optimize("count_bytes_inductor")(f)
+    out = f(*args)
+    res = 0
+    for o in out:
+        res += o.mean()
+    res.backward()
+    print(metrics.nodes_num_elem)
+    return str(metrics.num_bytes_accessed // 4)
+
+
+DEVICE = "cuda"
+
+
+def T(*size, dtype=torch.float32, device=DEVICE, grad=False):
+    return torch.randn(size, dtype=dtype, device=device, requires_grad=grad)
+
+
+def TI(*size, mx=10, dtype=torch.int32, device=DEVICE):
+    return torch.randint(0, mx, size, dtype=dtype, device=device)
+
+
+class TestCase(TorchTestCase):
+    device = DEVICE
+    pass
+
+
+class NumBytesMetricTests(TestCase):
+    """
+    Primarily used for sanity testing that the num_bytes_accessed metrics is correct.
+    """
+
+    def test_pointwise(self):
+        def f(x):
+            return x.cos()
+
+        inp = (T(10),)
+        self.assertExpectedInline(count_numel(f, *inp), """20""")
+
+        def f(x, y):
+            return x + y
+
+        inp = (T(10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """30""")
+
+        def f(x, y):
+            return x + y
+
+        inp = (T(10, 10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """210""")
+
+        def f(x):
+            return x + x
+
+        inp = (T(10),)
+        self.assertExpectedInline(count_numel(f, *inp), """20""")
+
+        def f(x):
+            return x + x.t()
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """200""")
+
+        def f(a, b, c):
+            return a.cos(), b.sin() + c.sin()
+
+        inp = (T(10), T(10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """50""")
+
+    def test_reduction(self):
+        def f(x):
+            return x.sum(dim=1)
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """110""")
+
+        def f(x):
+            return x.sum(dim=0)
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """110""")
+
+    def test_extern(self):
+        def f(x):
+            return torch.mm(x, x)
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """200""")
+
+        def f(a, b):
+            return torch.mm(a, b)
+
+        inp = (T(10, 10), T(10, 10))
+        self.assertExpectedInline(count_numel(f, *inp), """300""")
+
+        def f(x):
+            x = x.cos()
+            x = torch.mm(x, x)
+            x = x.cos()
+            return x
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """600""")
+
+        def f(x):
+            a = x.cos()
+            b = x.sin()
+            x = torch.mm(a, b)
+            return x
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """600""")
+
+    def test_cat(self):
+        def f(a, b):
+            return torch.cat([a.sin(), b.sin()])
+
+        inp = (T(10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """40""")
+
+        def f(a, b):
+            return torch.cat([a, b])
+
+        inp = (T(10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """40""")
+
+        def f(a, b):
+            return torch.cat([a.cos(), b])
+
+        inp = (T(10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """40""")
+
+        def f(a):
+            return torch.cat([a.cos(), a.sin()])
+
+        inp = (T(10),)
+        self.assertExpectedInline(count_numel(f, *inp), """30""")
+
+    def test_index(self):
+        def f(a, b):
+            return a[b]
+
+        inp = (T(10), TI(10, mx=10))
+        self.assertExpectedInline(count_numel(f, *inp), """30""")
+
+
+class FusionTests(TestCase):
+    """
+    Tests that things can be fused into a single kernel
+    """
+
+    def test_horizontal_reduction_pointwise(self):
+        def f(a):
+            b = a.sum(dim=1)
+            c = a.cos()
+            return b, c
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """210""")
+
+    def test_horizontal_reduction_reduction(self):
+        def f(a):
+            b = a.sum(dim=1)
+            c = a.amax(dim=1)
+            return b, c
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """120""")
+
+    def test_horizontal_reduction_pointwise2(self):
+        def f(a, b):
+            c = a.sum(dim=1)
+            b = b.cos()
+            return b + c
+
+        inp = (T(10, 10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """120""")
+
+    def test_horizontal_reduction_outer_pointwise(self):
+        def f(a, b):
+            c = a.sum(dim=0)
+            b = b.cos()
+            return b + c
+
+        inp = (T(10, 10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """120""")
+
+    def test_horizontal_sum_pw_broadcast(self):
+        def f(a, b):
+            a = a.sum(dim=1, keepdim=True)
+            b = b.cos()
+            return a * b
+
+        inp = (T(10, 10), T(10))
+        self.assertExpectedInline(count_numel(f, *inp), """210""")
+
+    def test_vertical_sum_pw(self):
+        def f(a):
+            a = a.cos()
+            a = a.sum(dim=1)
+            return a.cos()
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """110""")
+
+    def test_norm_chain(self):
+        def f(a):
+            b = a.sum(dim=1, keepdim=True)
+            a = a * b
+            b = a.sum(dim=1, keepdim=True)
+            a = a * b
+            b = a.sum(dim=1, keepdim=True)
+            a = a * b
+            return a
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """200""")
+
+    def test_softmax_inner(self):
+        def f(a):
+            return torch.softmax(a, dim=1)
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """200""")
+
+    def test_layer_norm(self):
+        # TODO: Suboptimal! We shouldn't need to save normalization stats.
+        mod = torch.nn.LayerNorm(10, device=self.device)
+
+        def f(x):
+            return mod(x)
+
+        inp = (T(10, 10),)
+        with torch.no_grad():
+            self.assertExpectedInline(count_numel(f, *inp), """220""")
+
+    def test_double_softmax(self):
+        def f(x):
+            x = torch.softmax(x, dim=1)
+            x = torch.softmax(x, dim=1)
+            return x
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """200""")
+
+    def test_softmax_backward(self):
+        def f(grad_out, out):
+            return aten._softmax_backward_data(grad_out, out, 1, torch.float32)
+
+        inp = (T(10, 10), T(10, 10))
+        self.assertExpectedInline(count_numel(f, *inp), """300""")
+
+    def test_neighbor(self):
+        def f(a, b):
+            return ((a - b) ** 2).sum(dim=-1).amax(dim=1)
+
+        inp = (T(10, 1, 4), T(1, 10, 4))
+        self.assertExpectedInline(count_numel(f, *inp), """90""")
+
+    def test_factory_reduction(self):
+        def f():
+            a = torch.ones(10, device=self.device)
+            b = torch.ones(10, 10, device=self.device)
+            return (a + b).sum(dim=-1)
+
+        inp = ()
+        self.assertExpectedInline(count_numel(f, *inp), """10""")
+
+    def test_index_pointwise(self):
+        def f(a, b):
+            return a[b].cos()
+
+        inp = (T(10, 10), TI(20, mx=10))
+        self.assertExpectedInline(count_numel(f, *inp), """320""")
+
+    def test_index_reduction(self):
+        def f(a, b):
+            return a[b].cos().sum(dim=1)
+
+        inp = (T(10, 10), TI(20, mx=10))
+        self.assertExpectedInline(count_numel(f, *inp), """140""")
+
+
+class SchedulerFusionTests(TestCase):
+    """
+    Testing the fusion group creation heuristic (i.e. cases where we can't fuse
+    everything into a single kernel)
+    Disables inductor rematerialization for easier reasoning of tests.
+    """
+
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls._stack = contextlib.ExitStack()
+        cls._stack.enter_context(patch.object(config, "realize_bytes_threshold", 0))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls._stack.close()
+        super().tearDownClass()
+
+    def test_fusion_choice1(self):
+        # Doesn't matter where we break fusion group here
+        def f(a):
+            c = a.cos()
+            d = torch.mm(c, c)
+            e = c.cos()
+            return d + e
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """700""")
+
+    def test_fusion_choice2(self):
+        # We should materialize e (it's smaller!)
+        # [c, e]: 210, [f]: 210, [d]: 200
+        def f(a):
+            c = a.cos()
+            d = torch.mm(c, c)
+            e = c.sum(dim=1)
+            f = d + e
+            return f
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """620""")
+
+    def test_fusion_choice3(self):
+        # We should materialize e.
+        # [c, e]: 300, [f]: 300, [d]: 200
+        def f(a):
+            c = a.cos()
+            d = torch.mm(c, c)
+            e = c + a
+            f = d + e
+            return f, e
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """800""")
+
+
+class TilingTests(TestCase):
+    def test_tiling_simple(self):
+        def f(a, b):
+            return a + b.t()
+
+        inp = (T(10, 10), T(10, 10))
+        self.assertExpectedInline(count_numel(f, *inp), """300""")
+
+        def f(a, b):
+            return a.t() + b
+
+        inp = (T(10, 10), T(10, 10))
+        self.assertExpectedInline(count_numel(f, *inp), """300""")
+
+    def test_tiling_three(self):
+        def f(a, b, c):
+            return a + b.permute(1, 2, 0) + c.permute(2, 0, 1)
+
+        inp = (T(10, 10, 10), T(10, 10, 10), T(10, 10, 10))
+        self.assertExpectedInline(count_numel(f, *inp), """4000""")
+
+
+class MinCutPartitioningTests(TestCase):
+    def test_partitioning_full_remat(self):
+        def f(x):
+            return x.cos().cos().cos()
+
+        inp = (T(10, grad=True),)
+        self.assertExpectedInline(count_numel_train(f, *inp), """50""")
+
+    def test_partitioning_partial_remat(self):
+        def f(a, b, c, d):
+            x = a + b + c + d
+            return x.cos().cos()
+
+        inp = (T(10, grad=True), T(10, grad=True), T(10, grad=True), T(10, grad=True))
+        self.assertExpectedInline(count_numel_train(f, *inp), """90""")
+
+    def test_partitioning_dtype(self):
+        def f(x):
+            return (x < 0) * x
+
+        inp = (T(100, grad=True),)
+        self.assertExpectedInline(count_numel_train(f, *inp), """450""")
+
+    @patch.object(functorch.compile.config, "max_dist_from_bw", 1000)
+    def test_partitioning_unremat_bw(self):
+        def f(x):
+            return torch.mm(x, x.new_ones(x.shape)).tanh().tanh()
+
+        inp = (T(10, 10, grad=True),)
+        self.assertExpectedInline(count_numel_train(f, *inp), """1300""")
+
+    def test_partitioning_unremat_bw2(self):
+        def f(a):
+            a = torch.mm(a, a)
+            a = a + 1
+            b = a + 2
+            c = torch.mm(a, b)
+            return c
+
+        inp = (T(10, 10, grad=True),)
+        self.assertExpectedInline(count_numel_train(f, *inp), """2600""")
+
+    def test_partitioning_keops(self):
+        def f(a, b):
+            return (a * b).cos().sum(dim=1)
+
+        inp = (T(20, 1, grad=True), T(1, 20, grad=True))
+        self.assertExpectedInline(count_numel_train(f, *inp), """220""")
+
+
+# Test cases where we don't do the right thing yet.
+class WouldBeNiceIfItWorked:
+    def test_horizontal(self):
+        def f(a):
+            b = a.sum(dim=0)
+            c = a.cos()
+            return b, c
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """210""")
+
+    # TODO: We aren't fusing outer dim softmaxes
+    def test_softmax_outer(self):
+        def f(a):
+            return torch.softmax(a, dim=0)
+
+        inp = (T(10, 10),)
+        self.assertExpectedInline(count_numel(f, *inp), """200""")
+
+    # TODO: The greedy fusion strategy results in suboptimal grouping
+    @patch.object(config, "realize_bytes_threshold", 0)
+    def test_fusion_choice4(self):
+        def f(a, b, b2):
+            c = a + b
+            d = torch.mm(c, c)
+            e = c + b + b2
+            f = d + e + b2
+            return f, e
+
+        inp = (T(10, 10), T(10, 10, dtype=torch.float16), T(10, 10))
+        self.assertExpectedInline(count_numel(f, *inp), """1000""")
+
+    # TODO: We materialize the intermediate if we don't unroll the reduction
+    def test_neighbor(self):
+        def f(a, b):
+            return ((a - b) ** 2).sum(dim=-1).amax(dim=1)
+
+        inp = (T(10, 1, 8), T(1, 10, 8))
+        self.assertExpectedInline(count_numel(f, *inp), """170""")
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    if HAS_CUDA and not TEST_WITH_ROCM:
+        run_tests(needs="filelock")
diff --git a/test/inductor/test_smoke.py b/test/inductor/test_smoke.py
new file mode 100644
index 000000000000..64afbcf0254e
--- /dev/null
+++ b/test/inductor/test_smoke.py
@@ -0,0 +1,30 @@
+# Owner(s): ["module: inductor"]
+import logging
+import unittest
+
+import torch
+import torch._dynamo as torchdynamo
+import torch._inductor.config as torchinductor_config
+
+torchdynamo.config.log_level = logging.INFO
+torchdynamo.config.verbose = True
+torchinductor_config.debug = True
+
+
+class MLP(torch.nn.Module):
+    def __init__(self):
+        super(MLP, self).__init__()
+        self.l1 = torch.nn.Linear(1, 6)
+        self.l2 = torch.nn.Linear(6, 1)
+
+    def forward(self, x=None):
+        x = torch.relu(self.l1(x))
+        x = torch.relu(self.l2(x))
+        return x
+
+
+class SmokeTest(unittest.TestCase):
+    def test_mlp(self):
+        mlp = torchdynamo.optimize("inductor")(MLP().cuda())
+        for _ in range(3):
+            mlp(torch.randn(1, device="cuda"))
diff --git a/test/inductor/test_torchinductor.py b/test/inductor/test_torchinductor.py
new file mode 100644
index 000000000000..ea112162c8a9
--- /dev/null
+++ b/test/inductor/test_torchinductor.py
@@ -0,0 +1,5566 @@
+# Owner(s): ["module: inductor"]
+import contextlib
+import dataclasses
+import functools
+import importlib
+import itertools
+import os
+import random
+import sys
+import typing
+import unittest
+import weakref
+from typing import Any, Callable
+from unittest.mock import patch
+
+import numpy as np
+
+import torch
+
+import torch._dynamo
+from torch._dynamo.debug_utils import same_two_models
+from torch._dynamo.testing import rand_strided, same
+from torch.fx.experimental.proxy_tensor import make_fx
+from torch.fx.passes.shape_prop import ShapeProp
+from torch.nn import functional as F
+from torch.testing import make_tensor
+from torch.testing._internal.common_utils import (
+    TEST_WITH_ASAN,
+    TEST_WITH_ROCM,
+    TestCase as TorchTestCase,
+)
+from torch.utils._python_dispatch import TorchDispatchMode
+from torch.utils._pytree import tree_flatten, tree_unflatten
+
+try:
+    import sympy
+
+    importlib.import_module("functorch")
+    importlib.import_module("filelock")
+
+    import torch._inductor.config
+    from functorch.compile import config as functorch_config
+    from torch._decomp import get_decompositions
+    from torch._inductor import codecache, config, metrics
+    from torch._inductor.compile_fx import compile_fx, complex_memory_overlap
+    from torch._inductor.ir import IndexingDiv, ModularIndexing
+    from torch._inductor.overrides import (
+        linear_permute_fusion,
+        linear_transpose,
+        permute_linear_fusion,
+        permute_matmul_fusion,
+        transpose_linear,
+        transpose_matmul,
+    )
+    from torch._inductor.sizevars import SizeVarAllocator
+    from torch._inductor.utils import has_torchvision_roi_align, timed
+
+    # This will only pass on pytorch builds newer than roughly 5/15/2022
+    assert get_decompositions([torch.ops.aten.trace])
+    # Requires functorch
+    from torch._inductor.compile_fx import compile_fx_inner
+except (ImportError, AssertionError) as e:
+    sys.stderr.write(f"{type(e)}: {e}\n")
+    if __name__ == "__main__":
+        sys.exit(0)
+    raise unittest.SkipTest("requires sympy/functorch/filelock")
+
+from torch.testing._internal.inductor_utils import HAS_CPU, HAS_CUDA
+
+aten = torch.ops.aten
+requires_cuda = functools.partial(unittest.skipIf, not HAS_CUDA, "requires cuda")
+
+torch._inductor.config.triton.autotune = False  # too slow
+
+
+# For OneDNN bf16 path, OneDNN requires the cpu has intel avx512 with avx512bw,
+# avx512vl, and avx512dq at least. So we will skip the test case if one processor
+# is not meet the requirement.
+@functools.lru_cache(maxsize=None)
+def has_bf16_support():
+    import sys
+
+    if sys.platform != "linux":
+        return False
+    with open("/proc/cpuinfo", encoding="ascii") as f:
+        lines = f.read()
+    return all(word in lines for word in ["avx512bw", "avx512vl", "avx512dq"])
+
+
+unary_list = [
+    torch.nn.ReLU(),
+    torch.nn.Sigmoid(),
+    torch.nn.Tanh(),
+    torch.nn.Hardswish(),
+    torch.nn.LeakyReLU(0.1, inplace=False),
+    torch.nn.Hardtanh(min_val=-0.5, max_val=4, inplace=False),
+    torch.nn.GELU(approximate="none"),
+    torch.nn.GELU(approximate="tanh"),
+    torch.nn.ReLU6(),
+    torch.nn.SiLU(),
+]
+
+
+binary_list = [
+    lambda x, y: torch.add(x, y),  # call_function
+    lambda x, y: torch.add(y, x),  # call_function
+    lambda x, y: x.add(y),  # call_method
+    lambda x, y: x.add_(y),  # call_method
+    lambda x, y: torch.sub(x, y),  # call_function
+    lambda x, y: x.sub(y),  # call_method
+    lambda x, y: x.sub_(y),  # call_method
+]
+
+
+def requires_decomp(fn):
+    """Decorator to disable test if a decomp is missing"""
+
+    def wrap_test(test):
+        @functools.wraps(test)
+        def maybe_test(*args, **kwargs):
+            if len(get_decompositions([fn])) == 0:
+                raise unittest.SkipTest(f"requires decomp for {fn.__name__}")
+            return test(*args, **kwargs)
+
+        return maybe_test
+
+    return wrap_test
+
+
+PassFunc = Callable[[torch.fx.GraphModule, Any], torch.fx.GraphModule]
+
+
+def chain_passes(*passes: PassFunc) -> PassFunc:
+    def parent_pass(module: torch.fx.GraphModule, input: Any) -> torch.fx.GraphModule:
+        for pass_ in passes:
+            if isinstance(module, torch.fx.GraphModule):
+                ShapeProp(module).propagate(*input)
+            module = pass_(module)
+        return module
+
+    return parent_pass
+
+
+def count_call_function(module: torch.fx.GraphModule, target_op: Any) -> int:
+    return sum(
+        [
+            1 if (n.op == "call_function" and n.target == target_op) else 0
+            for n in module.graph.nodes
+        ]
+    )
+
+
+class TestCase(TorchTestCase):
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls._stack = contextlib.ExitStack()
+        cls._stack.enter_context(patch.object(config, "debug", True))
+        cls._stack.enter_context(patch.object(config.cpp, "min_chunk_size", 1))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls._stack.close()
+        super().tearDownClass()
+
+
+class ToTuple(torch.nn.Module):
+    def forward(self, x):
+        return (x,)
+
+
+@dataclasses.dataclass
+class InputGen:
+    n: int
+    device: str
+
+    def dense(self):
+        return torch.randn((self.n, self.n), device=self.device)
+
+    def transposed(self):
+        return self.dense().transpose(0, 1)
+
+    def strided(self):
+        return torch.randn((self.n * 2, self.n * 3), device=self.device)[
+            self.n :, self.n :: 2
+        ]
+
+    def broadcast1(self):
+        return torch.randn((self.n,), device=self.device)
+
+    def broadcast2(self):
+        return torch.randn((1, self.n, 1), device=self.device)
+
+    def broadcast3(self):
+        return torch.randn((1,), device=self.device)
+
+    def double(self):
+        return torch.randn((self.n, self.n), device=self.device, dtype=torch.double)
+
+    def int(self):
+        return torch.arange(self.n, device=self.device, dtype=torch.int32)
+
+
+def compute_grads(args, kwrags, results, grads):
+    def gather_leaf_tensors(args, kwargs):
+        args, _ = tree_flatten(args)
+        kwargs, _ = tree_flatten(kwargs)
+        args = args + kwargs
+        leaf_tensors = [
+            arg for arg in args if isinstance(arg, torch.Tensor) and arg.requires_grad
+        ]
+        return leaf_tensors
+
+    flat_results, _ = tree_flatten(results)
+    flat_diff_results = [r for r in flat_results if r.requires_grad]
+    assert len(flat_diff_results) > 0
+
+    leaf_tensors = gather_leaf_tensors(args, kwrags)
+    assert len(leaf_tensors) > 0
+    return torch.autograd.grad(
+        flat_diff_results,
+        leaf_tensors,
+        grads,
+        allow_unused=True,
+        retain_graph=True,
+    )
+
+
+@patch.object(torch._inductor.config.triton, "cudagraphs", False)
+def check_model(
+    self: TestCase,
+    model,
+    example_inputs,
+    kwargs=None,
+    *,
+    atol=None,
+    rtol=None,
+    check_lowp=True,
+    exact_dtype=True,
+    nopython=True,
+    copy_to_cuda=True,
+    reference_in_float=True,
+    assert_equal=True,
+    check_gradient=False,
+):
+    kwargs = kwargs or {}
+    torch._dynamo.reset()
+
+    ref_inputs = example_inputs
+    ref_kwargs = kwargs
+    has_lowp_args = False
+    original_lowp_dtype = torch.half
+
+    if reference_in_float:
+        # check_lowp is ignored here, it's kept just to be able to call `common` with extra arg
+        def upcast_fn(x):
+            nonlocal has_lowp_args
+            if isinstance(x, torch.Tensor) and (
+                x.dtype == torch.float16 or x.dtype == torch.bfloat16
+            ):
+                has_lowp_args = True
+                return x.float()
+            else:
+                return x
+
+        def get_original_lowp_dtype(example_inputs):
+            dtypes = [x.dtype for x in example_inputs if isinstance(x, torch.Tensor)]
+            dtype_set = set(dtypes)
+            return dtype_set.pop() if len(dtype_set) == 1 else torch.half
+
+        ref_inputs = list(map(upcast_fn, example_inputs))
+        ref_kwargs = {k: upcast_fn(v) for k, v in kwargs.items()}
+        if has_lowp_args:
+            original_lowp_dtype = get_original_lowp_dtype(example_inputs)
+            if hasattr(model, "to"):
+                model = model.to(torch.float)
+
+    torch.manual_seed(0)
+
+    correct = model(*ref_inputs, **ref_kwargs)
+    # downcast the model back if needed
+    if reference_in_float and has_lowp_args:
+        if hasattr(model, "to"):
+            model = model.to(original_lowp_dtype)
+
+    torch._inductor.metrics.reset()
+
+    called = False
+
+    def compile_fx_wrapper(model_, example_inputs_):
+        nonlocal called
+        called = True
+        return compile_fx(model_, example_inputs_)
+
+    def run(*ex, **kwargs):
+        return model(*ex, **kwargs)
+
+    run = torch._dynamo.optimize(compile_fx_wrapper, nopython=nopython)(run)
+
+    torch.manual_seed(0)
+    actual = run(*example_inputs, **kwargs)
+    # if not called:
+    #     exp = torch._dynamo.explain(run, *example_inputs)
+    #     print("Explain:", exp[0])
+    #     for graph in exp[2]:
+    #         print("Graph", graph)
+    assert called, "Ran graph without calling compile_fx"
+    assert type(actual) == type(correct)
+
+    correct_flat, correct_spec = tree_flatten(correct)
+    actual_flat, _ = tree_flatten(actual)
+    if reference_in_float:
+        correct_flat = tuple(
+            y.to(x.dtype)
+            if isinstance(y, torch.Tensor) and y.dtype.is_floating_point
+            else y
+            for x, y in zip(actual_flat, correct_flat)
+        )
+        correct = tree_unflatten(correct_flat, correct_spec)
+
+    if assert_equal:
+        self.assertEqual(
+            actual,
+            correct,
+            atol=atol,
+            rtol=rtol,
+            equal_nan=True,
+            exact_dtype=exact_dtype,
+        )
+    else:
+        for correct_val, actual_val in zip(correct_flat, actual_flat):
+            if isinstance(correct_val, torch.Tensor):
+                assert correct_val.device == actual_val.device
+                assert correct_val.size() == actual_val.size()
+                assert correct_val.stride() == actual_val.stride()
+                assert correct_val.layout == actual_val.layout
+                if exact_dtype:
+                    assert correct_val.dtype == actual_val.dtype
+
+    if check_gradient:
+
+        # generate random unit norm gradients
+        grads = [
+            torch.rand(r.shape, device=r.device, dtype=r.dtype)
+            for r in correct_flat
+            if r.requires_grad
+        ]
+        for g in grads:
+            g /= g.norm()
+
+        correct_grad = compute_grads(ref_inputs, ref_kwargs, correct, grads)
+        actual_grad = compute_grads(example_inputs, kwargs, actual, grads)
+
+        self.assertEqual(
+            actual_grad,
+            correct_grad,
+            atol=atol,
+            rtol=rtol,
+            equal_nan=True,
+            exact_dtype=exact_dtype,
+        )
+
+    torch._dynamo.reset()
+
+
+@patch.object(torch._inductor.config.triton, "cudagraphs", False)
+def check_model_cuda(
+    self: TestCase,
+    model,
+    example_inputs,
+    kwargs=None,
+    *,
+    atol=None,
+    rtol=None,
+    check_lowp=True,
+    exact_dtype=True,
+    nopython=True,
+    copy_to_cuda=True,
+    reference_in_float=True,
+    assert_equal=True,
+    check_gradient=False,
+):
+    kwargs = kwargs or {}
+    if hasattr(model, "to"):
+        model = model.to("cuda")
+
+    def copy_fn(x):
+        # preserve strides of the input on the device
+        if not isinstance(x, torch.Tensor):
+            return x
+        return torch.empty_strided(
+            x.size(), x.stride(), device="cuda", dtype=x.dtype
+        ).copy_(x)
+
+    if copy_to_cuda:
+        example_inputs = tuple(copy_fn(x) for x in example_inputs)
+
+    check_model(
+        self,
+        model,
+        example_inputs,
+        kwargs,
+        atol=atol,
+        rtol=rtol,
+        exact_dtype=exact_dtype,
+        nopython=nopython,
+        reference_in_float=reference_in_float,
+        assert_equal=assert_equal,
+        check_gradient=check_gradient,
+    )
+
+    if check_lowp:
+
+        def downcast_fn(x):
+            if not isinstance(x, torch.Tensor) or not x.dtype == torch.float:
+                return x
+            return torch.empty_strided(
+                x.size(), x.stride(), device="cuda", dtype=torch.half
+            ).copy_(x)
+
+        example_inputs = list(map(downcast_fn, example_inputs))
+        if hasattr(model, "to"):
+            model = model.to(torch.half)
+        check_model(
+            self,
+            model,
+            example_inputs,
+            kwargs,
+            atol=atol,
+            rtol=rtol,
+            exact_dtype=exact_dtype,
+            nopython=nopython,
+            reference_in_float=reference_in_float,
+            assert_equal=assert_equal,
+            check_gradient=check_gradient,
+        )
+
+
+class SweepInputs2:
+    input_gen_types1 = [
+        "dense",
+        "transposed",
+        "strided",
+        "broadcast1",
+        "broadcast2",
+        "broadcast3",
+        "double",
+        "int",
+    ]
+    input_gen_types2 = input_gen_types1
+    gen = None
+
+    @staticmethod
+    def kernel(a, b):
+        return (a + b,)
+
+    @classmethod
+    def gen_template(cls, name1, name2):
+        def test(self):
+            check_model(
+                self,
+                cls.kernel,
+                (
+                    getattr(cls.gen, name1)(),
+                    getattr(cls.gen, name2)(),
+                ),
+            )
+
+        test.__name__ = f"test_{cls.gen.device}_{name1}_{name2}"
+        setattr(cls, test.__name__, test)
+
+    @classmethod
+    def populate(cls):
+        for name1 in cls.input_gen_types1:
+            for name2 in cls.input_gen_types2:
+                cls.gen_template(name1, name2)
+
+
+class TestIndexingSimplification(TorchTestCase):
+    def test_indexing_simplification(self):
+        sizevars = SizeVarAllocator()
+        i0 = sympy.Symbol("i0")
+        i1 = sympy.Symbol("i1")
+        i2 = sympy.Symbol("i2")
+        r3 = sympy.Symbol("r3")
+
+        var_ranges = {i0: 3136, i1: 64, i2: 32, r3: 3}
+        expr = (
+            128 * i2
+            + ModularIndexing(i1, 1, 64)
+            + 64 * ModularIndexing(i1 + 64 * r3, 64, 2)
+        )
+        # check that `i1//64` is removed when i1 is always less than 64,
+        # and the next simplificaton doesn't happen
+        self.assertEqual(
+            sizevars.simplify_with_ranges(expr, var_ranges),
+            i1 + 128 * i2 + 64 * ModularIndexing(r3, 1, 2),
+        )
+        # all the modular indexing should be removed when the body cant be larger than the modulus
+        var_ranges[r3] = 2
+        self.assertEqual(
+            sizevars.simplify_with_ranges(expr, var_ranges), i1 + 128 * i2 + 64 * r3
+        )
+
+        # small terms should be kept if the rest is not guaranteed to be divisible
+        self.assertEqual(
+            sizevars.simplify_with_ranges(IndexingDiv(r3 + i2 + i1, 32), var_ranges),
+            IndexingDiv(r3 + i2 + i1, 32),
+        )
+
+        expr = ModularIndexing(2 * i2 + r3, 1, 64)
+        # modular indexing is removed if base is smaller than modulo
+        self.assertEqual(sizevars.simplify_with_ranges(expr, var_ranges), 2 * i2 + r3)
+
+        # check the same thing but with symbolic divisor
+        self.assertEqual(IndexingDiv(r3 * i0, r3), i0)
+        self.assertEqual(ModularIndexing(r3 * i0, r3, 10), ModularIndexing(i0, 1, 10))
+
+        # (10*i) % 10 is always zero and should get optimized away
+        self.assertEqual(
+            ModularIndexing(i0 + i1 * 10, 1, 10), ModularIndexing(i0, 1, 10)
+        )
+
+        # ((20*i)//2) % 10 is always zero and should get optimized away
+        self.assertEqual(
+            ModularIndexing(i0 + i1 * 20, 2, 10), ModularIndexing(i0, 2, 10)
+        )
+
+        # the same things happens with symbolic divisor
+        self.assertEqual(
+            ModularIndexing(i0 + i1 * i2 * r3, i2, r3), ModularIndexing(i0, i2, r3)
+        )
+
+        # Constant fold from divisor into base
+        self.assertEqual(ModularIndexing(i0 * 4, 2, 10), ModularIndexing(i0 * 2, 1, 10))
+        self.assertEqual(IndexingDiv(i0 * 4, 2), i0 * 2)
+
+        # Nested modular indexing is correctly simplified
+        var_ranges = {"i1": 13, "i2": 121}
+        expr = ModularIndexing(ModularIndexing(121 * i1 + i2, 1, 784), 1, 28)
+        self.assertEqual(sizevars.simplify_with_ranges(expr, var_ranges), expr)
+        expr = ModularIndexing(ModularIndexing(121 * i1 + i2, 1, 784) + 1, 1, 28)
+        self.assertEqual(sizevars.simplify_with_ranges(expr, var_ranges), expr)
+        var_ranges = {"i2": 784}
+        expr = ModularIndexing(ModularIndexing(i2, 1, 28), 7, 4)
+        expected = IndexingDiv(ModularIndexing(i2, 1, 28), 7)
+        self.assertEqual(sizevars.simplify_with_ranges(expr, var_ranges), expected)
+        expr = ModularIndexing(ModularIndexing(i2, 1, 28) + 1, 7, 4)
+        self.assertEqual(sizevars.simplify_with_ranges(expr, var_ranges), expr)
+
+    def test_indexing_join(self):
+        sizevars = SizeVarAllocator()
+        i0 = sympy.Symbol("i0")
+        i1 = sympy.Symbol("i1")
+        i2 = sympy.Symbol("i2")
+
+        # join two ModularIndexing calls into one larger one when possible
+        expr1 = ModularIndexing(i0, 1, 32) + 32 * ModularIndexing(i0, 32, 4)
+        self.assertEqual(
+            sizevars.simplify_with_ranges(expr1, {}), ModularIndexing(i0, 1, 128)
+        )
+
+        # it should also work with a scale
+        self.assertEqual(
+            sizevars.simplify_with_ranges(2 * expr1, {}),
+            2 * ModularIndexing(i0, 1, 128),
+        )
+
+        # it should work when divisor is not 1
+        expr2 = ModularIndexing(i0, 3, 32) + 32 * ModularIndexing(i0, 32 * 3, 4)
+        simplified = sizevars.simplify_with_ranges(expr2, {})
+        self.assertEqual(simplified, ModularIndexing(i0, 3, 128))
+        self.assertEqual(expr2.subs({i0: 39485}), simplified.subs({i0: 39485}))
+
+        # it should not happen in this case as the modulus is wrong
+        expr3 = ModularIndexing(i0, 1, 30) + 32 * ModularIndexing(i0, 32, 4)
+        self.assertEqual(sizevars.simplify_with_ranges(expr3, {}), expr3)
+
+        # check that it also works with a modulus>1
+        expr4 = ModularIndexing(i0, 10, i1) + i1 * ModularIndexing(i0, i1 * 10, i2)
+        res0 = expr4.subs({i0: 24056, i1: 13, i2: 19})
+        simplified = sizevars.simplify_with_ranges(expr4, {})
+        res1 = simplified.subs({i0: 24056, i1: 13, i2: 19})
+        self.assertEqual(res0, res1)
+        self.assertEqual(simplified, ModularIndexing(i0, 10, i1 * i2))
+
+        # and also works with an offset
+        self.assertEqual(
+            sizevars.simplify_with_ranges(expr4 + 10, {}),
+            ModularIndexing(i0, 10, i1 * i2) + 10,
+        )
+
+        # works for ModularIndexing + IndexingDiv
+        expr5 = 197 * IndexingDiv(i0, 197) + ModularIndexing(i0, 1, 197)
+        simplified = sizevars.simplify_with_ranges(expr5, {})
+        self.assertEqual(simplified, i0)
+        self.assertEqual(expr5.subs({i0: 39485}), simplified.subs({i0: 39485}))
+
+        # works with a scale
+        self.assertEqual(
+            sizevars.simplify_with_ranges(2 * expr5, {}),
+            2 * i0,
+        )
+
+        # divisor != 1
+        expr6 = 197 * IndexingDiv(i0, 197 * 3) + ModularIndexing(i0, 3, 197)
+        simplified = sizevars.simplify_with_ranges(expr6, {})
+        self.assertEqual(simplified, IndexingDiv(i0, 3))
+        self.assertEqual(expr6.subs({i0: 39485}), simplified.subs({i0: 39485}))
+
+
+class CommonTemplate:
+    @classmethod
+    def install(my_cls, other_cls, suffix):  # noqa: B902
+        for name, value in my_cls.__dict__.items():
+            if name.startswith("test_"):
+                setattr(other_cls, f"{name}_{suffix}", value)
+
+    def test_bool(self):
+        def fn(a, b):
+            return (
+                a + b,
+                a * b,
+                a & b,
+                a | b,
+                a ^ b,
+                torch.logical_and(a, b),
+                torch.logical_or(a, b),
+                torch.logical_not(a),
+                torch.sign(b),
+            )
+
+        self.common(
+            fn,
+            (
+                torch.tensor([True, False, True, False]),
+                torch.tensor([False, False, True, True]),
+            ),
+        )
+
+    def test_add_const_int(self):
+        def fn(a):
+            return (a + 1,)
+
+        self.common(fn, (torch.randn(32),))
+
+    def test_add_const_float(self):
+        def fn(a):
+            return (a + 1.5,)
+
+        self.common(fn, (torch.randn(32),))
+
+    def test_add_inplace_permuted(self):
+        def fn(x, y):
+            return x.add_(y)
+
+        x = torch.ones([2, 12, 13, 17]).transpose(1, 2)
+        y = torch.randn([2, 13, 1, 17])
+
+        self.common(fn, (x, y))
+
+    def test_abs(self):
+        def fn(a):
+            return (a / (torch.abs(a) + 1),)
+
+        self.common(fn, (torch.randn(17),))
+
+    def test_sgn(self):
+        def fn(a):
+            return torch.sgn(a), torch.sgn(a + 1) - 1
+
+        self.common(fn, [torch.linspace(-10, 10, 41)])
+
+    def test_sgn_extremal(self):
+        def fn(a):
+            return (torch.sgn(a),)
+
+        self.common(fn, [torch.tensor([np.nan, np.inf, -np.inf, 0])])
+
+    def test_max_min(self):
+        def fn(a, b):
+            return (torch.maximum(a, b), torch.minimum(a, b))
+
+        self.common(fn, (torch.randn(8), torch.randn(8)))
+        t1 = torch.randn(8)
+        t1[0] = float("nan")
+        t2 = torch.randn(8)
+        t2[1] = float("nan")
+        self.common(fn, (t1, t2))
+
+    def test_horizonal_fusion1(self):
+        def fn(a, b, c):
+            return (a + b, a - c, b * c)
+
+        self.common(
+            fn, (torch.randn(8, 16, 16), torch.randn(8, 16, 16), torch.randn(1, 16, 1))
+        )
+
+    def test_horizonal_fusion2(self):
+        def fn(a, b, c):
+            return a + 1, b + 2, c + 3
+
+        self.common(fn, (torch.randn(8, 16, 8), torch.randn(8, 16), torch.randn(16, 8)))
+
+    def test_vertical_fusion1(self):
+        def fn(sa, ct, p):
+            # From torchbench.pyhpc_equation_of_state
+            v17 = -3.087032500374211e-7
+            v18 = -1.988366587925593e-8
+            v19 = -1.061519070296458e-11
+            v20 = 1.550932729220080e-10
+            t15 = v19 * ct
+            t19 = v17 + ct * (v18 + t15) + v20 * sa
+            t20 = 1.0 / t19
+            t128 = t19 * p
+            return t20 + t128
+
+        self.common(
+            fn,
+            (
+                torch.randn(204, 204, 26),
+                torch.randn(204, 204, 26),
+                torch.randn(26),
+            ),
+        )
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 1)
+
+    def test_sum1(self):
+        def fn(a, b):
+            return ((a + b).sum(-1),)
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_sum2(self):
+        def fn(a, b):
+            return ((a + b).sum([1, 2]), (a + b).sum(-1))
+
+        self.common(fn, (torch.randn(8, 9, 3, 21), torch.randn(8, 9, 3, 21)))
+
+    def test_sum3(self):
+        def fn(a, b):
+            r1 = a + b
+            r2 = r1.sum(-1)
+            r3 = torch.squeeze(b) + 10
+            return (r1, r2, r3)
+
+        # Mismatched elements: 2 / 10 (20.0%)
+        # Greatest absolute difference: 0.0029296875 at index (8,) (up to 1e-05 allowed)
+        # Greatest relative difference: 0.0017482517482517483 at index (6,) (up to 0.001 allowed)
+        self.common(fn, (torch.randn(10, 10), torch.randn(1, 10)), atol=1e-5, rtol=2e-3)
+
+    def test_sum4(self):
+        def fn(a):
+            b = a + 1
+            c = b.sum(-1)
+            d = c + 3
+            e = d.sum(-1)
+            f = e + 5
+            return (f, e, d, c, b)
+
+        self.common(fn, (torch.randn(1, 16, 8, 8),))
+
+    def test_sum5(self):
+        def fn(a):
+            b = a + 1
+            c = b.sum(-1)
+            d = c + 3
+            e = d.sum(-1)
+            f = e + 5
+            return (f,)
+
+        self.common(fn, (torch.randn(1, 17, 8, 9),))
+
+    def test_reduction1(self):
+        def fn(a):
+            return (a.sum(), a.max(), a.min(), a.argmax(), a.argmin())
+
+        self.common(fn, (torch.tensor([float("-inf"), 0.0, float("inf")]),))
+
+    def test_reduction2(self):
+        def fn(a):
+            # FIXME: a.argmax
+            return (a.sum(), a.max(), a.min(), a.argmin())
+
+        self.common(fn, (torch.full((4,), float("inf")),))
+
+    def test_reduction3(self):
+        def fn(a):
+            # FIXME: a.argmin
+            return (a.sum(), a.max(), a.min(), a.argmax())
+
+        self.common(fn, (torch.full((4,), float("-inf")),))
+
+    def test_reduction4(self):
+        if self.device == "cpu":
+            raise unittest.SkipTest("Non-deterministic CPU results")
+
+        def fn(a):
+            return (a.argmax(-1), a.argmin(-1))
+
+        inputs = (torch.ones(128), torch.ones(4, 4, 1))
+        for i in inputs:
+            self.common(fn, (i,))
+
+    @patch.object(config, "dynamic_shapes", False)
+    def test_unroll_small_reduction(self):
+        def fn(x):
+            val1, index1 = x.min(-1)
+            val2, index2 = x.max(-1)
+            return (
+                val1,
+                index1,
+                val2,
+                index2,
+                x.sum(-1),
+                (x > 1).any(-1),
+                (x > 0).all(-1),
+                x.argmin(-1),
+                x.argmax(-1),
+                x.amin(-1),
+                x.amax(-1),
+            )
+
+        with patch.object(config, "unroll_reductions_threshold", 8):
+            # small sized reductions will get unrolled
+            self.common(fn, (torch.randn(8, 3),))
+        torch._dynamo.reset()
+        with patch.object(config, "unroll_reductions_threshold", 1):
+            # make sure things also work if they aren't unrolled
+            self.common(fn, (torch.randn(8, 3),))
+
+    def test_multilayer_low_prec(self):
+        # fp16 nyi for cpu
+        if self.device == "cpu":
+            raise unittest.SkipTest("requires CUDA")
+
+        def fn(a):
+            return torch.mean(a)
+
+        self.common(fn, ((torch.rand((10, 3, 352, 352), dtype=torch.float16),)))
+
+    def test_expanded_reduction(self):
+        if self.device == "cpu":
+            raise unittest.SkipTest(
+                "https://github.com/pytorch/torchdynamo/issues/1697"
+            )
+
+        def fn(x, y):
+            z = x * y
+            return z.sum((0, 1))
+
+        self.common(fn, (torch.randn(2, 197, 256), torch.randn(2, 1, 256)))
+
+    def test_min_max_reduction(self):
+        def fn(a, b):
+            return ((a + b).max(), (a + b).min(), torch.amax(a + 1, keepdim=True))
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_sum_int(self):
+        def fn(x):
+            return 2 * x.sum(-1) + x.sum()
+
+        dtypes = torch.bool, torch.uint8, torch.int
+        inps = [torch.randint(2, (64,), dtype=dtype) for dtype in dtypes]
+        for i in inps:
+            self.common(fn, (i,), check_lowp=False)
+
+    def test_sum_dtype(self):
+        def fn(x):
+            return x * x.sum(-1, dtype=torch.double) + x.sum(dtype=torch.double)
+
+        self.common(fn, (torch.ones(32, 32) * 70,))
+
+    def test_clamp(self):
+        def fn(a, b):
+            return (a.clamp(-0.1, 0.1), b.clamp(0), torch.clamp(a + b, max=0))
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_arange1(self):
+        def fn(x):
+            rng1 = torch.arange(8 * 8, dtype=torch.float32, device=x.device).view(8, 8)
+            rng2 = torch.arange(10, 18, device=x.device)
+            tmp = x * rng1
+            return tmp, tmp + rng2
+
+        self.common(fn, (torch.randn(8, 8),))
+
+    def test_arange2(self):
+        def fn(x):
+            rng1 = torch.arange(8, device=x.device)
+            return (x + rng1,)
+
+        self.common(fn, (torch.randint(4, (8, 8)),), check_lowp=False)
+
+    def test_arange3(self):
+        def fn(x):
+            return x + torch.ops.aten.arange.start_step(
+                0, 53, 4, dtype=torch.int64, device=x.device
+            )
+
+        self.common(fn, (torch.randn(14),))
+
+    def test_arange4(self):
+        def fn(x):
+            return x - torch.arange(512, -512, -1.0, device=x.device)
+
+        self.common(fn, (torch.randn(1024),))
+
+    def test_linspace(self):
+        def fn(x):
+            return torch.linspace(0.125, 0.875, 7, device=x.device) + x
+
+        self.common(fn, (torch.randn(1, 7),))
+
+    def test_tensor1(self):
+        def fn(x):
+            return torch.tensor([1], device=x.device) + x, torch.tensor(
+                5, device=x.device
+            )
+
+        self.common(fn, (torch.randn(10),))
+
+    def test_tensor2(self):
+        def fn(x):
+            return torch.tensor(list(range(2, 40, 2)), device=x.device) + x
+
+        self.common(fn, (torch.randn(1),))
+
+    def test_tensor3(self):
+        def fn(x):
+            return (
+                torch.tensor([], device=x.device),
+                torch.tensor([1, 2], device=x.device) + 1,
+                torch.tensor([1, 2, 3], device=x.device) + 2,
+                torch.tensor([1, 2, 3, 4], device=x.device) + x,
+            )
+
+        self.common(fn, [torch.randn(4)])
+
+    def test_views1(self):
+        def fn1(x, y):
+            return (x.view(size2) + y,)
+
+        def fn2(x, y):
+            return ((x + 1).view(size2) + y,)
+
+        views = [
+            ([5 * 7], [5, 7]),
+            ([2 * 3 * 4 * 5 * 6 * 7], [2, 3, 4, 5, 6, 7]),
+            ([2 * 3, 4, 5, 6 * 7], [2, 3, 4, 5, 6, 7]),
+            ([10 * 5, 20], [10, 5, 20]),
+            ([1, 10, 1], [10]),
+            ([10, 1, 10, 1, 10], [10, 100]),
+            ([2, 2, 2, 2], [4, 4]),
+        ]
+        for size1, size2 in views:
+            self.common(fn1, (torch.randn(size1), torch.randn(size2)))
+            self.common(fn2, (torch.randn(size1), torch.randn(size2)))
+
+        for size2, size1 in views:
+            self.common(fn1, (torch.randn(size1), torch.randn(size2)))
+            self.common(fn2, (torch.randn(size1), torch.randn(size2)))
+
+    def test_views2(self):
+        def fn1(x):
+            return (x.view(size2) + 1,)
+
+        def fn2(x):
+            return ((x * 2).view(size2) + 1,)
+
+        for size1, size2 in [
+            ([2, 2, 2, 2], [4, -1]),
+            ([10, 1, 10, 1, 10], [-1, 100]),
+            ([10 * 5, 20], [10, -1, 20]),
+        ]:
+            self.common(fn1, (torch.randn(size1),))
+            self.common(fn2, (torch.randn(size1),))
+
+    def test_views3(self):
+        # example taken from hf_BigBird
+        def forward(arg1, arg2):
+            index = torch.ops.aten.index(arg1, [arg2])
+            view_1 = torch.ops.aten.view(index, [1, 2232, 64])
+            view_2 = torch.ops.aten.view(view_1, [1, 12, 62, 192])
+            return view_2
+
+        self.common(
+            forward,
+            (
+                rand_strided((64, 64), (64, 1), torch.float32),
+                rand_strided((2232,), (1,), torch.int64),
+            ),
+        )
+
+    def test_relu(self):
+        def fn(a, b):
+            return (torch.relu(a), torch.relu(a + b) / 10)
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_exp(self):
+        def fn(a, b):
+            return (torch.exp(a), torch.exp(a + b))
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_sigmoid(self):
+        def fn(a, b):
+            return (torch.sigmoid(a), torch.sigmoid(a + b))
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_round(self):
+        def fn(a, b):
+            return torch.round(a), torch.round(b + 1), torch.round(a, decimals=2)
+
+        # without manual_seed, there is some chance this test fails due to:
+        # https://github.com/openai/triton/issues/530
+        torch.manual_seed(0)
+
+        # with *100 we are always getting a number exactly at .5 which we don't do right in half
+        self.common(fn, (torch.randn(8, 8) * 100, torch.randn(8, 8) * 10))
+
+    def test_round_correctness(self):
+        if self.device == "cuda":
+            raise unittest.SkipTest("need to debug tl.libdevice on A100/V100")
+
+        def fn(a):
+            return torch.round(a)
+
+        self.common(
+            fn,
+            [torch.arange(-10, 10, 0.1, dtype=torch.float64)],
+            check_lowp=False,
+        )
+
+    def test_silu(self):
+        def fn(a):
+            return (torch.nn.functional.silu(a),)
+
+        self.common(fn, (torch.randn(8, 8),))
+
+    # TODO(voz): Re-enable this test ASAP https://github.com/pytorch/pytorch/issues/82763
+    @unittest.skip("Skipping due to op bugs")
+    def test_nan_to_num(self):
+        def fn(a):
+            return (
+                torch.nan_to_num(a),
+                torch.nan_to_num(a, nan=3.0),
+                torch.nan_to_num(a, nan=None),
+                torch.nan_to_num(a, posinf=4.0),
+                torch.nan_to_num(a, neginf=5.0),
+                torch.nan_to_num(a, nan=3.0, posinf=4.0, neginf=5.0),
+            )
+
+        self.common(
+            fn,
+            (torch.tensor((float("nan"), float("inf"), float("-inf"), 1.0)),),
+            check_lowp=False,  # a much more elaborate test is required to match finfo max's for float and half
+        )
+
+    def test_div1(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        self.common(fn, (torch.randn(8, 8) * 100, torch.randn(8, 8) * 100))
+
+    def test_div2(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        self.common(fn, (torch.randint(-100, 100, [8, 8]), 100 * torch.randn(8, 8)))
+
+    def test_div3(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        a = torch.randint(1, 100, [8, 8])
+        self.common(fn, (a * 2, a))
+
+    def test_div4(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        self.common(
+            fn,
+            (torch.randint(-100, 0, [8, 8]), torch.randint(1, 10, [8, 8])),
+        )
+
+    def test_div5(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        # divide a scalar
+        self.common(fn, (torch.randint(-100, 0, [8, 8]), 16))
+
+    def test_div6(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        # treat boolean as integer
+        self.common(
+            fn,
+            (torch.ones([8, 8], dtype=torch.bool), torch.randint(-100, -1, [8, 8])),
+        )
+
+    def test_div7(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randint(2**32, 2**40, [100, 100]),
+                torch.randint(-10, -1, [100, 100]),
+            ),
+        )
+
+    def test_div8(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        self.common(fn, (1024, 100))
+
+    def test_div_zero_dim(self):
+        def fn(a, b):
+            return (
+                aten.div(a, b, rounding_mode=None),
+                aten.div(a, b, rounding_mode="floor"),
+                aten.div(a, b, rounding_mode="trunc"),
+                a / b,
+                a // b,
+            )
+
+        for dtype in (torch.float32, torch.int64):
+            self.common(
+                fn,
+                (
+                    make_tensor(10, device="cpu", dtype=dtype),
+                    make_tensor((), device="cpu", dtype=dtype, exclude_zero=True),
+                ),
+            )
+            self.common(
+                fn,
+                (
+                    make_tensor((), device="cpu", dtype=dtype),
+                    make_tensor(10, device="cpu", dtype=dtype, exclude_zero=True),
+                ),
+            )
+
+    def test_div_prim(self):
+        def fn(a, b):
+            return (torch.ops.prims.div(a, b),)
+
+        for dtype in (torch.float32, torch.int64):
+            self.common(
+                fn,
+                (
+                    make_tensor(100, device="cpu", dtype=dtype),
+                    make_tensor(100, device="cpu", dtype=dtype, exclude_zero=True),
+                ),
+            )
+
+    def test_both_scalars(self):
+        def fn(a, b):
+            return (
+                aten.add(a, b),
+                aten.add(b, a),
+                aten.sub(a, b),
+                aten.sub(b, a),
+                aten.mul(a, b),
+                aten.mul(b, a),
+            )
+
+        self.common(fn, (4, 3.3), reference_in_float=False)
+
+    def test_sum_keepdims(self):
+        def fn(a, b):
+            return (torch.sum(a + b, -1, keepdim=True),)
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_softmax(self):
+        def fn(a, b):
+            return (torch.softmax(a + b, -1), torch.softmax(a, 0), torch.softmax(b, 1))
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_log_softmax(self):
+        def fn(a, b):
+            return (F.log_softmax(a + b, -1), F.log_softmax(a, 0), F.log_softmax(b, 1))
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_transpose(self):
+        def fn(a, b):
+            return (
+                torch.t(a) + b,
+                torch.transpose(b * 2, 0, 1) + 10,
+            )
+
+        self.common(fn, (torch.randn(8, 8), torch.randn(8, 8)))
+
+    def test_permute(self):
+        def fn(a):
+            return (
+                torch.permute(a + 1, [2, 1, 4, 0, 3]) + 2,
+                torch.permute(a, [2, 1, 4, 0, 3]) + 2,
+            )
+
+        self.common(fn, (torch.randn(2, 2, 2, 2, 2),))
+
+    def test_expand(self):
+        def fn(a):
+            return (
+                (a + 1).expand(3, 4, 2, 3, 2) + 2,
+                a.expand(2, 1, 2, 3, 2) + 2,
+            ), a.expand(2, -1, 5, -1)
+
+        self.common(fn, (torch.randn(2, 1, 2),))
+
+    def test_squeeze1(self):
+        def fn(a):
+            return ((a + 1).squeeze() + 2, a.squeeze() + 2)
+
+        self.common(fn, (torch.randn(1, 2, 1, 2, 2, 1, 1),))
+
+    def test_squeeze2(self):
+        def fn(a):
+            return ((a + 1).squeeze(-1).squeeze(2) + 2, a.squeeze(0) + 2)
+
+        self.common(fn, (torch.randn(1, 2, 1, 2, 2, 2, 1),))
+
+    def test_simplify_loops(self):
+        def fn(a, b):
+            return a + b
+
+        self.common(
+            fn,
+            (
+                torch.randn(2, 3, 4, 5, 6),
+                torch.randn(4, 2, 3, 5, 6).permute(1, 2, 0, 3, 4),
+            ),
+        )
+
+    def test_unsqueeze(self):
+        def fn(a):
+            return (
+                torch.unsqueeze(a + 1, -1) + 2,
+                torch.unsqueeze(a, 2) + 2,
+                torch.unsqueeze(a + 1, 0) + 2,
+                torch.unsqueeze(a, -2) + 2,
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randn(
+                    2,
+                    2,
+                    2,
+                    2,
+                ),
+            ),
+        )
+
+    def test_unsqueeze_inplace(self):
+        def fn(a):
+            tmp1 = a + 1
+            aten.unsqueeze_(tmp1, 2)
+            tmp2 = aten.unsqueeze_(a + 1, 0) + 2
+            return (tmp1, tmp2)
+
+        self.common(
+            fn,
+            (
+                torch.randn(
+                    2,
+                    2,
+                    2,
+                    2,
+                ),
+            ),
+        )
+
+    def test_addmm(self):
+        def fn(a, b, c):
+            return (torch.addmm(a + 1, b + 2, c + 3) + 4,)
+
+        self.common(
+            fn,
+            (
+                torch.randn(8, 8),
+                torch.randn(8, 8),
+                torch.randn(8, 8),
+            ),
+        )
+
+    def test_linear1(self):
+        mod = torch.nn.Sequential(
+            torch.nn.Linear(8, 16),
+            torch.nn.Sigmoid(),
+            ToTuple(),
+        )
+        self.common(mod, (torch.randn(2, 8),))
+
+    def test_linear2(self):
+        mod = torch.nn.Sequential(
+            torch.nn.Linear(8, 8),
+            torch.nn.ReLU(),
+            torch.nn.Linear(8, 8),
+            torch.nn.ReLU(),
+            torch.nn.Linear(8, 8),
+            torch.nn.ReLU(),
+            torch.nn.Linear(8, 8),
+            torch.nn.ReLU(),
+        )
+        self.common(mod, (torch.randn(2, 8),))
+
+    def test_bmm1(self):
+        def fn(a, b):
+            return (
+                torch.bmm(a, b),
+                torch.bmm(a + 1, b + 2) + 3,
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randn(2, 8, 8),
+                torch.randn(2, 8, 8),
+            ),
+            check_lowp=False,
+        )
+        self.common(
+            fn,
+            (
+                torch.randn(1, 16, 8),
+                torch.randn(1, 8, 10),
+            ),
+            check_lowp=False,
+        )
+
+    def test_bmm2(self):
+        def fn(a, b):
+            return torch.bmm(a.permute(0, 2, 1), b)
+
+        self.common(
+            fn,
+            (
+                torch.randn(1, 8, 8),
+                torch.randn(1, 8, 8),
+            ),
+            check_lowp=False,
+        )
+
+    # For gpu path, there has a accurcy issue,
+    @unittest.skipIf(HAS_CUDA, "only support cpu conv  bn test")
+    def test_conv_bn_fuse(self):
+        input_shapes = {1: (112,), 2: (112, 112), 3: (55, 55, 55)}
+        conv_modules = {1: torch.nn.Conv1d, 2: torch.nn.Conv2d, 3: torch.nn.Conv3d}
+        bn_modules = {
+            1: torch.nn.BatchNorm1d,
+            2: torch.nn.BatchNorm2d,
+            3: torch.nn.BatchNorm3d,
+        }
+        options = itertools.product(
+            [1, 2, 3],
+            [True, False],
+            [1, 3],
+            [1, 2],
+            [1, 4],
+        )
+
+        for (
+            dim,
+            bias,
+            kernel_size,
+            dilation,
+            groups,
+        ) in options:
+            oC = 32 * groups
+            iC = 3 * groups
+            x_shape = (1, iC) + input_shapes[dim]
+            mod = torch.nn.Sequential(
+                conv_modules[dim](
+                    iC,
+                    oC,
+                    kernel_size=kernel_size,
+                    dilation=dilation,
+                    groups=groups,
+                    bias=bias,
+                ),
+                bn_modules[dim](oC),
+            ).eval()
+            test_memory_format = [torch.contiguous_format]
+            # TODO: GPU path doesn't support channels_last now.
+            if not HAS_CUDA and dim > 1:
+                channels_last = (
+                    torch.channels_last if dim == 2 else torch.channels_last_3d
+                )
+                test_memory_format.append(channels_last)
+            for memory_format in test_memory_format:
+                v = torch.randn(x_shape, dtype=torch.float32).to(
+                    memory_format=memory_format
+                )
+                with torch.no_grad():
+                    self.common(
+                        mod,
+                        (v,),
+                    )
+
+    @unittest.skipIf(HAS_CUDA, "only support cpu conv2d unary test")
+    def test_conv2d_packed(self):
+        x_shape = (1, 3, 56, 56)
+        mod = torch.nn.Sequential(torch.nn.Conv2d(3, 64, 3, 3)).eval()
+        v = torch.randn(x_shape, dtype=torch.float32)
+        with torch.no_grad():
+            self.common(
+                mod,
+                (v,),
+            )
+
+    # For gpu path, there has a accurcy issue,
+    # see https://github.com/pytorch/pytorch/issues/87745.
+    @unittest.skipIf(HAS_CUDA, "only support cpu conv2d unary test")
+    def test_conv2d_unary(self):
+        test_memory_format = [torch.contiguous_format, torch.channels_last]
+        options = itertools.product(
+            unary_list,
+            [True, False],
+            [1, 3],
+            [1, 2],
+            [1, 4],
+            ["same", 0],
+            test_memory_format,
+        )
+
+        for (
+            unary_fn,
+            bias,
+            kernel_size,
+            dilation,
+            groups,
+            padding,
+            memory_format,
+        ) in options:
+            oC = 32 * groups
+            iC = 3 * groups
+            x_shape = (1, iC, 112, 112)
+            mod = torch.nn.Sequential(
+                torch.nn.Conv2d(
+                    iC,
+                    oC,
+                    kernel_size=kernel_size,
+                    padding=padding,
+                    dilation=dilation,
+                    groups=groups,
+                    bias=bias,
+                ),
+                unary_fn,
+            ).eval()
+
+            # TODO: add bf16 test for cpu path?
+            # TODO: this test fails when requires_grad=False
+            v = (
+                torch.randn(x_shape, dtype=torch.float32, requires_grad=True)
+                .add(1)
+                .to(memory_format=memory_format)
+            )
+            with torch.no_grad():
+                self.common(
+                    mod,
+                    (v,),
+                )
+
+    # For gpu path, there has a accurcy issue,
+    # see https://github.com/pytorch/pytorch/issues/87745.
+    @unittest.skipIf(HAS_CUDA, "only support cpu conv2d binary test")
+    def test_conv2d_binary(self):
+        class M(torch.nn.Module):
+            def __init__(
+                self,
+                binary_fn,
+                in_channels,
+                out_channels,
+                dilation,
+                groups,
+                padding,
+                bias,
+                has_relu,
+                **kwargs,
+            ):
+                super(M, self).__init__()
+                self.conv1 = torch.nn.Conv2d(
+                    in_channels,
+                    out_channels,
+                    dilation=dilation,
+                    groups=groups,
+                    padding=padding,
+                    bias=bias,
+                    **kwargs,
+                )
+                self.conv2 = torch.nn.Sequential(
+                    torch.nn.Conv2d(
+                        in_channels,
+                        out_channels,
+                        dilation=dilation,
+                        groups=groups,
+                        padding=padding,
+                        bias=bias,
+                        **kwargs,
+                    )
+                )
+                self.binary_fn = binary_fn
+                self.relu = torch.nn.ReLU() if has_relu else torch.nn.Identity()
+
+            def forward(self, x):
+                x1 = self.conv1(x)
+                x2 = self.conv2(x)
+                return self.relu(self.binary_fn(x1, x2))
+
+        test_memory_format = [torch.contiguous_format, torch.channels_last]
+        options = itertools.product(
+            binary_list,
+            [True, False],
+            [True, False],
+            [1, 3],
+            [1, 2],
+            [1, 4],
+            ["same", 0],
+            test_memory_format,
+        )
+
+        for (
+            binary_fn,
+            has_relu,
+            bias,
+            kernel_size,
+            dilation,
+            groups,
+            padding,
+            memory_format,
+        ) in options:
+            oC = 32 * groups
+            iC = 3 * groups
+            x_shape = (1, iC, 112, 112)
+            mod = M(
+                binary_fn,
+                iC,
+                oC,
+                dilation,
+                groups,
+                padding,
+                bias,
+                has_relu,
+                kernel_size=kernel_size,
+            ).eval()
+            mod = mod.to(memory_format=memory_format)
+            # TODO: add bf16 test
+            v = torch.randn(x_shape, dtype=torch.float32).to(
+                memory_format=memory_format
+            )
+            with torch.no_grad():
+                self.common(
+                    mod,
+                    (v,),
+                )
+
+    def test_linear_packed(self):
+        options = itertools.product([[2, 3, 10], [2, 10]], [True, False])
+        for input_shape, bias in options:
+            mod = torch.nn.Sequential(
+                torch.nn.Linear(input_shape[-1], 30, bias=bias)
+            ).eval()
+
+            v = torch.randn(input_shape)
+            with torch.no_grad():
+                self.common(
+                    mod,
+                    (v,),
+                )
+
+    def test_linear_unary(self):
+        options = itertools.product(unary_list, [[2, 3, 10], [2, 10]], [True, False])
+        dtype = torch.bfloat16
+        if has_bf16_support():
+            for eltwise_fn, input_shape, bias in options:
+                mod = torch.nn.Sequential(
+                    torch.nn.Linear(input_shape[-1], 30, bias=bias), eltwise_fn
+                ).eval()
+
+                # only fuse for linear when the dtype is bf16
+                mod = mod.to(dtype)
+                v = torch.randn(input_shape).to(dtype)
+                with torch.no_grad():
+                    self.common(
+                        mod,
+                        (v,),
+                    )
+
+    def test_linear_binary(self):
+        class M(torch.nn.Module):
+            def __init__(self, eltwise_fn, in_channels, out_channels, bias, **kwargs):
+                super(M, self).__init__()
+                self.linear = torch.nn.Linear(
+                    in_channels, out_channels, bias=bias, **kwargs
+                )
+                self.eltwise = eltwise_fn
+
+            def forward(self, x, y):
+                x = self.linear(x)
+                x = self.eltwise(x, y)
+                return x
+
+        options = itertools.product(binary_list, [[2, 3, 10], [2, 10]], [True, False])
+        dtype = torch.bfloat16
+        out_feature = 30
+        if has_bf16_support():
+            for binary_ops, input_shape, bias in options:
+                mod = M(binary_ops, input_shape[-1], out_feature, bias).eval()
+
+                # only fuse for linear when the dtype is bf16
+                mod = mod.to(dtype)
+                v = torch.randn(input_shape).to(dtype)
+                other = torch.randn(input_shape[:-1] + [out_feature]).to(dtype)
+                with torch.no_grad():
+                    self.common(mod, (v, other), atol=2e-3, rtol=0.016)
+
+    def test_gather1(self):
+        def fn(a, b):
+            return (
+                torch.gather(a.expand([4, 5, 10, 6]), 3, b + 1),
+                torch.gather(a.expand([4, 5, 10, 6]), -1, b + 1),
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randn([1, 1, 10, 6]),
+                torch.randint(5, [4, 5, 10, 1], dtype=torch.int64),
+            ),
+        )
+
+    def test_gather2(self):
+        # 0d tensor
+        def fn(a, b):
+            return torch.gather(a, 0, b) + torch.gather(a, -1, b)
+
+        x = torch.tensor(123)
+        y = torch.tensor(0)
+        self.assertEqual(fn(x, y), x + x)
+
+    def test_slice1(self):
+        def fn(a):
+            return (
+                a[:, :10, 0] + a[:, 10:, 0],
+                (a + 1)[:, :10, 0] + (a + 1)[:, 10:, 0],
+            )
+
+        self.common(
+            fn,
+            (torch.randn([2, 20, 2]),),
+        )
+
+    def test_slice2(self):
+        def fn(a):
+            return (
+                a[:-1, ::2, -1] + a[-1:, 1::2, -2],
+                (a + 1)[:-1, ::2, -1] + (a + 2)[-1:, 1::2, -2],
+            )
+
+        self.common(
+            fn,
+            (torch.randn([2, 20, 2]),),
+        )
+
+    def test_split_with_sizes(self):
+        def fn(a, sizes):
+            return [t + 1.0 for t in torch.split(a * 2.0, sizes, -1)]
+
+        self.common(fn, (torch.randn(2, 2, 10), [3, 3, 4]))
+        self.common(fn, (torch.randn(2, 2, 10), [4, 3, 3]))
+        self.common(fn, (torch.randn(2, 2, 10), [1, 2, 3, 4]))
+
+    def test_split(self):
+        def fn(a):
+            t = torch.split(a, 3, -1)
+            return (t[0], t[1], t[2], t[3])
+
+        def fn2(a):
+            return fn(a + 1)
+
+        self.common(
+            fn,
+            (torch.randn([2, 2, 10]),),
+        )
+
+        self.common(
+            fn2,
+            (torch.randn([2, 2, 10]),),
+        )
+
+    def test_to_dtype(self):
+        def fn(a, b):
+            return (
+                aten._to_copy(a, dtype=6),
+                aten._to_copy(b + 1, dtype=6),
+                aten.to(b, torch.float64),
+                aten.to(b, torch.bool),
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randn([2, 2, 10]),
+                torch.randn([2, 2, 10], dtype=torch.float64),
+            ),
+        )
+
+    @requires_cuda()
+    def test_to_device(self):
+        def fn(a):
+            if a.device.type == "cpu":
+                return aten._to_copy(a, device=torch.device("cuda"), dtype=6, layout=0)
+            else:
+                return aten._to_copy(a, device=torch.device("cpu"), dtype=6, layout=0)
+
+        self.common(
+            fn,
+            (torch.randn([2, 2, 10]),),
+        )
+
+    @requires_cuda()
+    def test_to_device_constant(self):
+        def fn(a):
+            d1 = a.device.type
+            if d1 == "cpu":
+                d2 = "cuda"
+            else:
+                d2 = "cpu"
+
+            const1 = torch.as_tensor(list(range(64)), device=d2)
+            return (
+                torch.arange(10, device=d2).to(d1) + a,
+                const1.to(d1),
+                (const1 + 1).to(d1),
+            )
+
+        self.common(
+            fn,
+            (torch.randn([10]),),
+        )
+
+    @requires_cuda()
+    def test_multi_device(self):
+        def fn(x):
+            x = x + 1
+            x = x + 2
+            x = x.cuda()
+            x = x + 3
+            x = x + 4
+            x = x.cpu()
+            x = x + 5
+            x = x + 6
+            x = x.cuda()
+            x = x + 7
+            x = x + 8
+            x = x.cpu()
+            x = x + 9
+            x = x + 10
+            return x
+
+        self.common(
+            fn,
+            (torch.randn([2, 2, 10]),),
+            check_lowp=False,  # cpu doesn't understand fp16, and there are explicit .cpu() calls
+        )
+
+    def test_unbind(self):
+        def fn(a):
+            return torch.unbind(a), torch.unbind(a, -1)
+
+        self.common(
+            fn,
+            (torch.randn([4, 4, 4]),),
+        )
+
+    def test_convolution1(self):
+        m = torch.nn.Sequential(
+            torch.nn.Conv2d(5, 6, [3, 3]),
+            torch.nn.ReLU(),
+            ToTuple(),
+        )
+
+        self.common(
+            m,
+            (torch.randn([2, 5, 16, 16]),),
+            # Mismatched elements: 10 / 2352 (0.4%)
+            # Greatest absolute difference: 5.7220458984375e-05 at index (0, 3, 12, 12) (up to 1e-05 allowed)
+            # Greatest relative difference: 0.06512477175897748 at index (0, 4, 11, 9) (up to 0.001 allowed)
+            atol=6e-5,
+            rtol=0.001,
+        )
+
+    def test_convolution2(self):
+        def fn(x, w, b):
+            # transposed conv
+            return (aten.convolution(x, w, b, [4], [0], [1], True, [0], 1),)
+
+        self.common(
+            fn,
+            (
+                torch.randn([2, 32, 90]),
+                torch.randn([32, 16, 8]),
+                torch.randn([16]),
+            ),
+            check_lowp=False,
+        )
+
+    @unittest.skipIf(HAS_CUDA, "only support cpu channels_last")
+    def test_conv2d_channels_last(self):
+        m = torch.nn.Sequential(
+            torch.nn.Conv2d(3, 3, 1, 1),
+            ToTuple(),
+        )
+        # only weight is channels_last
+        self.common(
+            m.to(memory_format=torch.channels_last),
+            (torch.randn([2, 3, 16, 16]),),
+            check_lowp=False,
+        )
+        # only activation is channels_last
+        self.common(
+            m,
+            (torch.randn([2, 3, 16, 16]).to(memory_format=torch.channels_last),),
+            check_lowp=False,
+        )
+        # activation and weight are all channels_last
+        self.common(
+            m.to(memory_format=torch.channels_last),
+            (torch.randn([2, 3, 16, 16]).to(memory_format=torch.channels_last),),
+            check_lowp=False,
+        )
+
+    def test_conv2d_backward_channels_last(self):
+        def fn(grad_output, inp, weight):
+            convolution_backward_8 = torch.ops.aten.convolution_backward.default(
+                grad_output,
+                inp,
+                weight,
+                [320],
+                [1, 1],
+                [0, 0],
+                [1, 1],
+                False,
+                [0, 0],
+                1,
+                [True, True, True],
+            )
+            return convolution_backward_8
+
+        # only weight is channels_last
+        self.common(
+            fn,
+            (
+                torch.randn([2, 320, 8, 8]),
+                torch.randn([2, 2048, 8, 8]),
+                torch.randn([320, 2048, 1, 1]).to(memory_format=torch.channels_last),
+            ),
+            check_lowp=False,
+        )
+
+    @unittest.skipIf(HAS_CUDA, "only support cpu channels_last")
+    def test_conv3d_channels_last(self):
+        m = torch.nn.Sequential(
+            torch.nn.Conv3d(3, 3, 1, 1),
+            ToTuple(),
+        )
+        # only weight is channels_last
+        self.common(
+            m.to(memory_format=torch.channels_last_3d),
+            (torch.randn([2, 3, 16, 16, 16]),),
+        )
+        # only activation is channels_last
+        self.common(
+            m,
+            (torch.randn([2, 3, 16, 16, 16]).to(memory_format=torch.channels_last_3d),),
+        )
+        # activation and weight are all channels_last
+        self.common(
+            m.to(memory_format=torch.channels_last_3d),
+            (torch.randn([2, 3, 16, 16, 16]).to(memory_format=torch.channels_last_3d),),
+        )
+
+    def test_adaptive_avg_pool2d1(self):
+        def fn(x):
+            return aten._adaptive_avg_pool2d(x, (6, 6)), aten._adaptive_avg_pool2d(
+                x + 1, (2, 5)
+            )
+
+        self.common(
+            fn,
+            (torch.randn(2, 4, 16, 16),),
+            check_lowp=False,
+        )
+
+        # lowering to avg_pool2d case
+        self.common(
+            fn,
+            (torch.randn(2, 4, 3, 3),),
+        )
+
+        # no-op case
+        self.common(
+            fn,
+            (torch.randn(2, 4, 6, 6),),
+        )
+
+    def test_adaptive_avg_pool2d2(self):
+        # Big kernel size, use fallback
+        def fn(x):
+            return aten._adaptive_avg_pool2d(x, (4, 4))
+
+        torch._inductor.metrics.generated_kernel_count = 0
+        self.common(
+            fn,
+            (torch.randn(2, 4, 21, 21),),
+            check_lowp=False,
+        )
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 0)
+
+    def test_max_pool2d1(self):
+        def fn(x):
+            return aten.max_pool2d_with_indices(x, [3, 3], [2, 2])
+
+        self.common(
+            fn,
+            (torch.randn(2, 4, 16, 16),),
+        )
+
+    def test_max_pool2d2(self):
+        def fn(x):
+            return aten.max_pool2d_with_indices(x, [3, 3], [2, 2])
+
+        self.common(
+            fn,
+            (torch.randn([16, 64, 55, 55]),),
+        )
+
+    def test_max_pool2d3(self):
+        def fn(x):
+            # with padding
+            return aten.max_pool2d_with_indices(x, [3, 3], [2, 2], [1, 1])
+
+        self.common(
+            fn,
+            (-torch.arange(1 * 8 * 8, dtype=torch.float32).view(1, 1, 8, 8),),
+        )
+
+    def test_max_pool2d4(self):
+        def fn(x):
+            # with padding
+            return aten.max_pool2d_with_indices(x, [3, 3], [2, 2], [0, 0], [1, 1], True)
+
+        self.common(
+            fn,
+            (torch.randn([2, 8, 111, 111]),),
+        )
+
+    def test_max_pool2d5(self):
+        def fn(x):
+            return aten.max_pool2d_with_indices(x, [3, 3], [])
+
+        self.common(
+            fn,
+            (torch.randn([16, 64, 55, 55]),),
+        )
+
+    def test_max_pool2d6(self):
+        # Too big kernel size, use fallback
+        def fn(x):
+            return aten.max_pool2d_with_indices(x, [13, 13], [])
+
+        torch._inductor.metrics.generated_kernel_count = 0
+        self.common(
+            fn,
+            (torch.randn([16, 64, 55, 55]),),
+        )
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 0)
+
+    def test_avg_pool2d1(self):
+        def fn(x):
+            return aten.avg_pool2d(x, [3, 3], [2, 2])
+
+        self.common(
+            fn,
+            (torch.randn(2, 4, 16, 16),),
+        )
+
+    def test_avg_pool2d2(self):
+        def fn(x):
+            return aten.avg_pool2d(x, [3, 3], [2, 2])
+
+        self.common(
+            fn,
+            (torch.randn([16, 64, 55, 55]),),
+        )
+
+    def test_avg_pool2d3(self):
+        def fn(x):
+            return aten.avg_pool2d(x, [3, 3], [2, 2], [1, 1])
+
+        self.common(
+            fn,
+            (-torch.arange(1 * 8 * 8, dtype=torch.float32).view(1, 1, 8, 8),),
+        )
+
+    def test_avg_pool2d4(self):
+        def fn(x):
+            return aten.avg_pool2d(x, [3, 3], [2, 2], [0, 0], True)
+
+        self.common(
+            fn,
+            (torch.randn([2, 8, 111, 111]),),
+        )
+
+    def test_avg_pool2d5(self):
+        def fn(x):
+            return aten.avg_pool2d(x, [3, 3], [2, 2], [1, 1], count_include_pad=False)
+
+        self.common(
+            fn,
+            (-torch.arange(1 * 8 * 8, dtype=torch.float32).view(1, 1, 8, 8),),
+        )
+
+    def test_avg_pool2d6(self):
+        def fn(x):
+            return aten.avg_pool2d(x, [3, 3], [2, 2], [1, 1], divisor_override=3)
+
+        self.common(
+            fn,
+            (-torch.arange(1 * 8 * 8, dtype=torch.float32).view(1, 1, 8, 8),),
+        )
+
+    def test_avg_pool2d7(self):
+        # Large kernel size, use fallback
+        def fn(x):
+            return aten.avg_pool2d(x, [13, 13], [1, 1], [0, 0])
+
+        torch._inductor.metrics.generated_kernel_count = 0
+        self.common(
+            fn,
+            (-torch.arange(1 * 24 * 24, dtype=torch.float32).view(1, 1, 24, 24),),
+        )
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 0)
+
+    def test_alexnet_prefix(self):
+        def forward(arg6, arg7, arg16):
+            convolution = torch.ops.aten.convolution(
+                arg16, arg7, arg6, [4, 4], [2, 2], [1, 1], False, [0, 0], 1
+            )
+            relu = torch.ops.aten.relu(convolution)
+            max_pool2d_with_indices = torch.ops.aten.max_pool2d_with_indices(
+                relu, [3, 3], [2, 2]
+            )
+            getitem = max_pool2d_with_indices[0]
+            return (getitem,)
+
+        self.common(
+            forward,
+            (
+                rand_strided((64,), (1,), torch.float32, "cpu"),
+                rand_strided((64, 3, 11, 11), (363, 121, 11, 1), torch.float32, "cpu"),
+                rand_strided(
+                    (16, 3, 224, 224), (150528, 50176, 224, 1), torch.float32, "cpu"
+                ),
+            ),
+            # Mismatched elements: 127 / 746496 (0.0%)
+            # Greatest absolute difference: 0.0009765625 at index (1, 62, 7, 16) (up to 1e-05 allowed)
+            # Greatest relative difference: 0.05187467899332306 at index (14, 18, 11, 0) (up to 0.001 allowed)
+            atol=1e-3,
+            rtol=0.001,
+        )
+
+    def test_elu(self):
+        def fn(x):
+            return aten.elu(x, 1.6732632423543772, 1.0507009873554805) + 2, aten.elu(
+                x + 1, 2, 3, 4
+            )
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_tanh(self):
+        def fn(x):
+            return aten.tanh(x) + 2, aten.tanh(x + 1)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_lgamma(self):
+        def fn(x):
+            return aten.lgamma(x) + 2, aten.cos(x + 1)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_cos(self):
+        def fn(x):
+            return aten.cos(x) + 2, aten.cos(x + 1)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_sin(self):
+        def fn(x):
+            return aten.sin(x) + 2, aten.sin(x + 1)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_repeat(self):
+        def fn(x):
+            return (
+                x.repeat(2, 2, 3, 1),
+                x.repeat(8, 1, 1, 1),
+                x.repeat(2, 1, 1, 1, 1, 1),
+            )
+
+        self.common(
+            fn,
+            (torch.randn([1, 2, 4, 8]),),
+        )
+
+    def test_embedding(self):
+        m = torch.nn.Sequential(
+            torch.nn.Embedding(10, 4, padding_idx=0),
+            torch.nn.ReLU(),
+            ToTuple(),
+        )
+
+        self.common(
+            m,
+            (torch.randint(10, [2, 8]),),
+        )
+
+    def test_mean(self):
+        def fn(x):
+            return (
+                x.mean(),
+                x.mean(-1),
+                torch.mean(x, -2, keepdim=True),
+                x.mean([0, 1]),
+            )
+
+        self.common(
+            fn,
+            (torch.randn([1, 2, 4, 8]),),
+        )
+
+    def test_var_mean(self):
+        def fn(x):
+            return (
+                *torch.var_mean(x, -1),
+                *torch.var_mean(x, [1, 3]),
+            )
+
+        self.common(
+            fn,
+            (torch.randn([1, 2, 4, 8]),),
+        )
+
+    @patch.object(config, "pick_loop_orders", True)
+    def test_transposed_propagates(self):
+        @torch._dynamo.optimize("inductor", nopython=True)
+        def fn(x, y):
+            return x + y
+
+        a = torch.randn(1, 4, 4, 4, device=self.device).permute(0, 2, 3, 1)
+        b = torch.randn(4, 4, 4, device=self.device).permute(1, 2, 0)
+        c = fn(a, b)
+        self.assertEqual(a.stride(), c.stride())
+        self.assertEqual(c.stride()[2], 1)
+
+    @requires_cuda()
+    @patch.object(config.triton, "convolution", "triton")
+    @patch.object(config.triton, "dense_indexing", "True")
+    def test_triton_conv(self):
+        @torch._dynamo.optimize("inductor", nopython=True)
+        def triton_conv(
+            x,
+            w,
+            bias,
+            stride,
+            padding,
+            dilation,
+            groups,
+        ):
+            y = torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+            return y
+
+        stride, padding, dilation, groups = (1, 1), (0, 0), (1, 1), 1
+        dtype = torch.float32
+        x = torch.randn((32, 128, 32, 32), dtype=dtype, device=self.device)
+        w = torch.randn((32, 128, 1, 1), dtype=dtype, device=self.device)
+        bias = torch.randn((32), dtype=dtype, device=self.device)
+
+        y = triton_conv(x, w, bias, stride, padding, dilation, groups)
+        y_correct = torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+        self.assertTrue(same(y, y_correct, cos_similarity=True, tol=0.1))
+
+    @requires_cuda()
+    @patch.object(config.triton, "convolution", "autotune")
+    @patch.object(config.triton, "dense_indexing", "True")
+    def test_conv_autotune(self):
+        @torch._dynamo.optimize("inductor", nopython=True)
+        def triton_conv(
+            x,
+            w,
+            bias,
+            stride,
+            padding,
+            dilation,
+            groups,
+        ):
+            y = torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+            return y
+
+        stride, padding, dilation, groups = (1, 1), (0, 0), (1, 1), 1
+        dtype = torch.float32
+        x = torch.randn((32, 128, 32, 32), dtype=dtype, device=self.device)
+        w = torch.randn((32, 128, 1, 1), dtype=dtype, device=self.device)
+        bias = torch.randn((32), dtype=dtype, device=self.device)
+
+        y = triton_conv(x, w, bias, stride, padding, dilation, groups)
+        y_correct = torch.conv2d(x, w, bias, stride, padding, dilation, groups)
+        self.assertTrue(same(y, y_correct, cos_similarity=True, tol=0.1))
+
+    @patch.object(config.triton, "mm", "triton")
+    def test_triton_mm2(self):
+        @torch._dynamo.optimize("inductor", nopython=True)
+        def fn(x, y):
+            return torch.relu(torch.mm(x, y))
+
+        N = 1024
+        a = torch.randn([N, N], device=self.device, dtype=torch.float32)
+        b = torch.randn([N, N], device=self.device, dtype=torch.float32)
+        c1 = torch.relu(torch.mm(a, b))
+        torch._inductor.metrics.reset()
+        c = fn(a, b)
+        assert torch.allclose(c1, c, atol=1e-3, rtol=1e-3)
+        if self.device == "cuda":
+            assert torch._inductor.metrics.generated_kernel_count == 1
+
+    def test_std(self):
+        def fn(x):
+            return (
+                torch.var(x, True),
+                torch.var(x, False),
+                torch.var(x, -1, True),
+                torch.var(x, -1, False),
+                torch.std(x, False),
+                torch.std(x, [0, 1], True),
+                torch.std(x, [0, 1], False),
+                torch.std(x, -2, True, keepdim=True),
+            )
+
+        self.common(
+            fn,
+            (torch.randn([2, 4, 4, 8]),),
+        )
+
+    def test_embedding_bag(self):
+        def fn(w, i, o):
+            return aten._embedding_bag(w, i, o, False, 0, False, None)
+
+        self.common(
+            fn,
+            (torch.randn([10, 4]), torch.randint(10, [8]), torch.tensor([0, 2, 6])),
+        )
+
+    def test_batch_norm_2d(self):
+        m = torch.nn.Sequential(
+            torch.nn.BatchNorm2d(10),
+            torch.nn.ReLU(),
+        )
+        m.eval()
+        self.common(m, (torch.randn([2, 10, 8, 8]),), check_lowp=False)
+        self.common(
+            m,
+            (torch.randn([3, 10, 16, 16]),),
+            check_lowp=False,  # too painful to match types of bn model
+        )
+
+    def test_layer_norm(self):
+        m = torch.nn.Sequential(
+            torch.nn.LayerNorm(32),
+            torch.nn.ReLU(),
+        )
+        m.eval()
+        self.common(m, (torch.randn([16, 32]),), check_lowp=False)
+        if self.device != "cpu":
+            self.assertEqual(torch._inductor.metrics.generated_kernel_count, 1)
+
+    def test_transpose_add(self):
+        def fn(a, b):
+            return a.t() + b
+
+        self.common(
+            fn, (torch.randn([16, 32]), torch.randn([32, 16])), check_lowp=False
+        )
+        if self.device != "cpu":
+            self.assertEqual(torch._inductor.metrics.generated_kernel_count, 1)
+
+    def test_softmax_one_kernel(self):
+        def fn(x):
+            dim = 1
+            x_max = torch.amax(x, dim, keepdim=True)
+            unnormalized = torch.exp(x * x_max)
+            result = unnormalized / torch.sum(unnormalized, dim, keepdim=True)
+            return result
+
+        self.common(fn, (torch.randn([16, 32]),), check_lowp=False)
+        if self.device != "cpu":
+            self.assertEqual(torch._inductor.metrics.generated_kernel_count, 1)
+
+    def test_cauchy(self):
+        def fn(x, y):
+            return torch.sum(1 / (torch.unsqueeze(x, -1) - y))
+
+        self.common(
+            fn,
+            (
+                torch.randn(32),
+                torch.randn(32),
+            ),
+            # Absolute difference: 0.0003662109375 (up to 0.0001 allowed)
+            # Relative difference: 1.8804297408767818e-05 (up to 1e-05 allowed)
+            atol=5 * 1e-4,
+            rtol=5 * 1e-5,
+            check_lowp=False,
+        )
+        if self.device != "cpu":
+            self.assertEqual(torch._inductor.metrics.generated_kernel_count, 1)
+
+    def test_gather_scatter(self):
+        def fn(node_feat, edge_index):
+            src_node_feat = node_feat[edge_index[0]]
+            dst_node_feat = node_feat[edge_index[1]]
+            edge_feat = src_node_feat - dst_node_feat + 1
+            new_node_feat = torch.zeros_like(node_feat)
+            new_node_feat.scatter_add_(
+                0, edge_index[1].unsqueeze(-1).expand_as(edge_feat), edge_feat
+            )
+            return new_node_feat
+
+        num_nodes = 16
+        num_features = 32
+        node_feat = torch.randn(num_nodes, num_features)
+        edge_index = torch.randint(0, num_nodes, size=(2, num_nodes * 5))
+        self.common(
+            fn,
+            (
+                node_feat,
+                edge_index,
+            ),
+            check_lowp=False,
+        )
+        if self.device != "cpu":
+            self.assertEqual(torch._inductor.metrics.generated_kernel_count, 2)
+
+    @patch.object(torch._inductor.config, "max_fusion_size", 1)
+    def test_no_mega_fusion_during_lowering(self):
+        n = 50
+
+        def fn(*args):
+            x = args[0]
+            for i in range(n):
+                x = torch.add(x, args[i])
+            return x
+
+        self.common(
+            fn,
+            [torch.randn(64) for _ in range(n)],
+            check_lowp=False,
+        )
+        print("-->", torch._inductor.metrics.generated_kernel_count)
+        if self.device != "cpu":
+            self.assertTrue(torch._inductor.metrics.generated_kernel_count > 1)
+
+    def test_move_arange(self):
+        def fn(x):
+            return torch.arange(len(x), device="cpu").to(x.device) + x
+
+        self.common(fn, (torch.randn([32]),), check_lowp=False)
+        # if we have a copy there will be more than 1 kernel
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 1)
+
+    def test_leaky_relu(self):
+        def fn(x):
+            return aten.leaky_relu(x, 0.2) + 2, aten.leaky_relu(x + 1)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_gelu(self):
+        def fn(x):
+            return aten.gelu(x) + 2, aten.gelu(x + 1)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_clone(self):
+        def fn(x):
+            return aten.clone(x) + 2, aten.clone(x + 1)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_masked_fill(self):
+        def fn(mask, value):
+            return aten.masked_fill(value, mask, -10000.0) + 2, aten.masked_fill(
+                value / 2.0, torch.logical_not(mask), 667
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randint(0, 1, [1, 16], dtype=torch.bool),
+                torch.randn([16, 16]),
+            ),
+        )
+
+    def test_masked_fill_promotion(self):
+        def fn(mask, value):
+            return aten.masked_fill(value, mask, torch.tensor(3.5))
+
+        opt_fn = torch._dynamo.optimize("inductor")(fn)
+        for inp in (
+            torch.randn(
+                [16, 16],
+                dtype=torch.float16 if self.device == "cuda" else torch.float32,
+                device=self.device,
+            ),
+            torch.randint(16, (16, 16), device=self.device),
+        ):
+
+            inputs = (
+                torch.randint(0, 1, [1, 16], dtype=torch.bool, device=self.device),
+                inp,
+            )
+            self.assertEqual(fn(*inputs), opt_fn(*inputs))
+
+    def test_fill1(self):
+        def fn(x):
+            tmp = torch.ones_like(x)
+            return tmp, aten.fill.Scalar(tmp, 2)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_fill2(self):
+        def fn(x):
+            tmp = torch.ones_like(x)
+            return tmp, aten.fill.Tensor(tmp, torch.tensor(3.0))
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_pow1(self):
+        def fn(x):
+            return [aten.pow(x, e) for e in range(-8, 9)]
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+        )
+
+    def test_pow2(self):
+        def fn(x):
+            return aten.pow(1000, x), aten.pow(x, 1000)
+
+        self.common(
+            fn,
+            (torch.randn([16, 16]),),
+            # Mismatched elements: 9 / 256 (3.5%)
+            # Greatest absolute difference: 2.491354329061828e+28 at index (6, 6) (up to 1e-05 allowed)
+            # Greatest relative difference: 2.9793410720160818e-05 at index (4, 5) (up to 1.3e-06 allowed)
+            atol=1e-5,
+            rtol=3e-05,
+        )
+
+    def test_pow3(self):
+        # power of 0.5 is special-cased, arbitrary power would still produce triton codegen error
+        def fn(x):
+            z = torch.tensor(0.123, device=self.device)
+            w = z + x
+            return torch.pow(w, 0.5)
+
+        opt = torch._dynamo.optimize("inductor")(fn)
+        input = torch.rand(())
+        self.assertTrue(same(opt(input), fn(input)))
+
+    def test_glu(self):
+        def fn(x):
+            return aten.glu(x, -1), aten.glu(x, 1), aten.glu(x, 2)
+
+        self.common(
+            fn,
+            (torch.randn([8, 16, 8, 8]),),
+        )
+
+    def test_cat(self):
+        def fn(a):
+            tmp = a * 2
+            return (
+                torch.cat((a, a[:, :4] + 1, a + 2), -1),
+                torch.cat((tmp, tmp), 0),
+                torch.cat((tmp, tmp.double()), 0),
+            )
+
+        self.common(
+            fn,
+            (torch.randn([8, 16]),),
+        )
+
+    def test_cat_upcasting(self):
+        def fn(arg4_1, slice_7):
+            cat_1 = aten.cat.default([arg4_1, slice_7], 1)
+            return (cat_1,)
+
+        self.common(
+            fn,
+            (
+                torch.randn([8, 16], dtype=torch.float32),
+                torch.randn([8, 20], dtype=torch.float16),
+            ),
+        )
+
+    def test_cat_extern_kernel(self):
+        def fn(x1, x2, x3, x4):
+            x = torch.mm(x2, x3)
+            s = torch.narrow(x, 1, 0, 100)
+            x = torch.mm(s, x4)
+            c = torch.cat((x, x1), 1)
+            return (c,)
+
+        self.common(
+            fn,
+            (
+                torch.randn(256, 256),
+                torch.randn(256, 1024),
+                torch.randn(1024, 1600),
+                torch.randn(100, 256),
+            ),
+            check_lowp=False,  # accuracy issues with relatively large matmuls
+        )
+
+    def test_stack(self):
+        def fn(a, b):
+            return torch.stack(
+                [
+                    a.expand(12, 16),
+                    b.expand(12, 16),
+                ],
+                2,
+            )
+
+        self.common(fn, (torch.randn([1, 16]), torch.randn([12, 1])))
+
+    def test_hardtanh(self):
+        def fn(x):
+            return F.hardtanh(x), F.hardtanh(x + 1), F.hardtanh(x - 1)
+
+        self.common(
+            fn,
+            (torch.randn([64]),),
+        )
+
+    def test_hardsigmoid(self):
+        def fn(x):
+            return F.hardsigmoid(x), F.hardsigmoid(x + 3), F.hardsigmoid(x - 3)
+
+        self.common(
+            fn,
+            (torch.randn([64]),),
+        )
+
+    def test_hardswish(self):
+        def fn(x):
+            return F.hardswish(x), F.hardswish(x + 3), F.hardswish(x - 3)
+
+        self.common(
+            fn,
+            (torch.randn([64]),),
+        )
+
+    def test_rsqrt(self):
+        def fn(x):
+            return torch.rsqrt(x), torch.rsqrt(x + 1) - 2
+
+        self.common(
+            fn,
+            (torch.randn([64]),),
+        )
+
+    def test_flip(self):
+        def fn(x):
+            return torch.flip(x, (-1,)), torch.flip(x, (0, 2)) - 2
+
+        self.common(
+            fn,
+            (torch.randn([1, 2, 6, 6]),),
+        )
+
+    def test_signbit(self):
+        def fn(x):
+            return torch.signbit(x), ~torch.signbit(-x) & 1
+
+        self.common(
+            fn,
+            (torch.randn([1, 2, 6, 6]),),
+        )
+
+    def test_fmod(self):
+        def fn(a, b):
+            return torch.fmod(a, b), torch.fmod(3.0 * a, b) - 2.0
+
+        shape = [1, 2, 6, 6]
+        self.common(fn, (torch.randn(shape), torch.randn(shape)))
+
+    def test_fmod_zero_dim(self):
+        def fn(a, b):
+            return (torch.fmod(a, b),)
+
+        self.common(
+            fn,
+            (
+                make_tensor(10, device="cpu", dtype=torch.float32),
+                make_tensor((), device="cpu", dtype=torch.float32),
+            ),
+        )
+        self.common(
+            fn,
+            (
+                make_tensor((), device="cpu", dtype=torch.float32),
+                make_tensor(10, device="cpu", dtype=torch.float32),
+            ),
+        )
+
+    def test_log2(self):
+        def fn(x):
+            return torch.log2(x), torch.log2(x + 1) - 2
+
+        self.common(
+            fn,
+            (torch.randn([64]) + 10,),
+        )
+
+    def test_logsumexp(self):
+        def fn(x):
+            return torch.logsumexp(x, -1), torch.logsumexp(x, 0) - 2
+
+        self.common(
+            fn,
+            (torch.randn([8, 8]) + 10,),
+        )
+
+    def test_log_fp64(self):
+        def fn(x):
+            return torch.log(x), torch.log2(x)
+
+        self.common(
+            fn,
+            (torch.randn([1024], dtype=torch.float64) + 10,),
+        )
+
+    def test_bitwise(self):
+        def fn(x, y):
+            return (
+                torch.bitwise_not(x),
+                torch.bitwise_or(x, y),
+                torch.bitwise_xor(x, y),
+                torch.bitwise_and(x, y),
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randint(0, 2**30, [64], dtype=torch.int32),
+                torch.randint(0, 2**30, [64], dtype=torch.int32),
+            ),
+        )
+
+    def test_bitwise2(self):
+        # again with bool types
+        def fn(x, y):
+            return (
+                torch.bitwise_not(x),
+                torch.bitwise_or(x, y),
+                torch.bitwise_xor(x, y),
+                torch.bitwise_and(x, y),
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randint(0, 2, (2, 20), dtype=torch.bool),
+                torch.randint(0, 2, (2, 20), dtype=torch.bool),
+            ),
+        )
+
+    def test_inf(self):
+        def fn(a):
+            return a + float("inf"), a + float("-inf"), a * -float("inf")
+
+        self.common(fn, (torch.randn(8),))
+
+    def test_remainder(self):
+        def fn(a, b):
+            return (
+                torch.remainder(a, b),
+                torch.remainder(a + 1, b - 1),
+                torch.remainder(a - 1, b + 1),
+            )
+
+        self.common(fn, (torch.randn(64), torch.randn(64)))
+
+    def test_zeros(self):
+        def fn(a):
+            return (
+                a + 1,
+                torch.zeros(
+                    (1, 8, 64, 64),
+                    dtype=torch.float32,
+                    device=a.device,
+                ),
+                torch.zeros(
+                    1,
+                    8,
+                    64,
+                    64,
+                    dtype=torch.float32,
+                    device=a.device,
+                ),
+                torch.zeros(2, 3, names=None),
+                a + torch.ones(8, device=a.device),
+                torch.full((2, 3), 3.1416, device=a.device),
+            )
+
+        self.common(fn, (torch.randn(8),))
+
+    def test_new_ones(self):
+        def fn(a):
+            return (
+                aten.new_ones(
+                    a, [], device=a.device, dtype=6, layout=0, pin_memory=False
+                ),
+                aten.new_zeros(
+                    a, [], device=a.device, dtype=6, layout=0, pin_memory=False
+                ),
+            )
+
+        self.common(fn, (torch.randn(8),))
+
+    def test_full_like(self):
+        def fn(a):
+            return torch.full_like(a, 7.777) - 1
+
+        self.common(fn, (torch.randn(8),))
+
+    def test_index1(self):
+        def fn(a, b, c):
+            return aten.index(a, [b, c])
+
+        self.common(
+            fn,
+            (
+                torch.randn(8, 8, 12),
+                torch.tensor([0, 0, 2, 2], dtype=torch.int64),
+                torch.tensor([3, 4, 4, 3], dtype=torch.int64),
+            ),
+        )
+        self.common(
+            fn,
+            (
+                torch.randn(8, 8, 12),
+                torch.tensor([[0, 0, 2, 2]], dtype=torch.int64),
+                torch.tensor([[3], [4], [4], [3]], dtype=torch.int64),
+            ),
+        )
+
+    def test_index2(self):
+        def fn(a, b):
+            return (
+                aten.index(a, [b]),
+                aten.index(a, [None, b]),
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randn(8, 8, 8),
+                torch.tensor([[0, 0, 2, 2]], dtype=torch.int64),
+            ),
+        )
+
+    def test_index_select(self):
+        def fn(a, b):
+            return (
+                torch.index_select(a, 0, b),
+                torch.index_select(a, 1, b),
+                torch.index_select(torch.index_select(a, 2, b), 1, b),
+            )
+
+        for ind_dtype in (torch.int32, torch.int64):
+            self.common(
+                fn,
+                (
+                    torch.randn(8, 8, 8),
+                    torch.tensor([0, 0, 2, 1], dtype=ind_dtype),
+                ),
+            )
+
+    def test_cudnn_rnn(self):
+        if self.device == "cpu":
+            raise unittest.SkipTest("requires CUDA")
+
+        def fn(
+            a0,
+            b0,
+            b1,
+            b2,
+            b3,
+            b4,
+            b5,
+            b6,
+            b7,
+            b8,
+            b9,
+            b10,
+            b11,
+            b12,
+            b13,
+            b14,
+            b15,
+            a3,
+            a4,
+            a5,
+        ):
+            a1 = [
+                b0,
+                b1,
+                b2,
+                b3,
+                b4,
+                b5,
+                b6,
+                b7,
+                b8,
+                b9,
+                b10,
+                b11,
+                b12,
+                b13,
+                b14,
+                b15,
+            ]
+            return aten._cudnn_rnn(
+                a0,
+                a1,
+                4,
+                a3,
+                a4,
+                a5,
+                2,
+                2048,
+                0,
+                2,
+                False,
+                0.0,
+                False,
+                True,
+                [],
+                None,
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randn([92, 8, 2048]),
+                torch.randn([8192, 2048]),
+                torch.randn([8192, 2048]),
+                torch.randn([8192]),
+                torch.randn([8192]),
+                torch.randn([8192, 2048]),
+                torch.randn([8192, 2048]),
+                torch.randn([8192]),
+                torch.randn([8192]),
+                torch.randn([8192, 4096]),
+                torch.randn([8192, 2048]),
+                torch.randn([8192]),
+                torch.randn([8192]),
+                torch.randn([8192, 4096]),
+                torch.randn([8192, 2048]),
+                torch.randn([8192]),
+                torch.randn([8192]),
+                torch.randn([167837696]),
+                torch.randn([4, 8, 2048]),
+                torch.randn([4, 8, 2048]),
+            ),
+            check_lowp=False,  # difference in rnn is too large between half and float inputs
+        )
+
+    def test_upsample_nearest1d(self):
+        def fn(a):
+            return (
+                aten.upsample_nearest1d(a, [74], None),
+                aten.upsample_nearest1d(a, [70], None),
+                aten.upsample_nearest1d(a, [45], None),
+                aten.upsample_nearest1d(a, [36], None),
+                aten.upsample_nearest1d(a, None, [2.0]),
+            )
+
+        self.common(fn, (torch.randn([2, 4, 37]),))
+
+    def test_upsample_nearest2d(self):
+        def fn(a):
+            return (
+                aten.upsample_nearest2d(a, [74, 76]),
+                aten.upsample_nearest2d(a, [70, 75]),
+                aten.upsample_nearest2d(a, [45, 74]),
+                aten.upsample_nearest2d(a, [36, 39]),
+                aten.upsample_nearest2d(a, None, [2.0, 2.0]),
+            )
+
+        self.common(fn, (torch.randn([2, 4, 37, 38]),))
+
+    def test_upsample_nearest3d(self):
+        def fn(a):
+            return (
+                aten.upsample_nearest3d(a, [74, 76, 78], None),
+                aten.upsample_nearest3d(a, [70, 75, 80], None),
+                aten.upsample_nearest3d(a, [45, 74, 103], None),
+                aten.upsample_nearest3d(a, [36, 39, 40], None),
+                aten.upsample_nearest3d(a, None, [2.0, 2.0, 2.0]),
+            )
+
+        self.common(fn, (torch.randn([2, 4, 37, 38, 39]),))
+
+    def test_upsample_nearest2d_backward(self):
+        func = torch.ops.aten.upsample_nearest2d_backward
+
+        def fn(a):
+            return (
+                func(a, output_size=[6, 12], input_size=[3, 3, 3, 6]),
+                func(a, output_size=[6, 12], input_size=[3, 3, 4, 5]),
+                func(a, output_size=[6, 12], input_size=[3, 3, 2, 8]),
+                func(a, output_size=[6, 12], input_size=[3, 3, 2, 8]),
+                func(a, output_size=[6, 12], input_size=[3, 3, 4, 7]),
+            )
+
+        self.common(fn, (torch.randn([3, 3, 6, 12]),))
+
+    def test_upsample_bilinear2d_a(self):
+        def fn(a):
+            return (
+                aten.upsample_bilinear2d(a, [45, 45], False, None),
+                aten.upsample_bilinear2d(a, None, True, [2.0, 2.0]),
+            )
+
+        self.common(fn, (torch.randn([2, 4, 37, 38]),))
+
+    def test_upsample_bilinear2d_b(self):
+        def fn(a):
+            return aten.upsample_bilinear2d(a, None, True, [2.0, 2.0])
+
+        self.common(
+            fn,
+            [
+                torch.randn([1, 2, 40, 59]),
+            ],
+        )
+
+    def test_reflection_pad2d(self):
+        def fn(a):
+            return (
+                aten.reflection_pad2d(a, [1, 1, 1, 1]),
+                aten.reflection_pad2d(a, [1, 2, 3, 4]),
+            )
+
+        self.common(
+            fn, (torch.randint(0, 999, size=[1, 1, 8, 8], dtype=torch.float32),)
+        )
+
+    def test_reflection_pad2d_backward(self):
+        def template(size, padding):
+            def fn(grad_output, x):
+                return aten.reflection_pad2d_backward(grad_output, x, padding)
+
+            x = torch.randint(0, 999, size=size, dtype=torch.float32)
+            result = aten.reflection_pad2d(x, padding)
+            grad_output = torch.randn_like(result)
+
+            self.common(fn, (grad_output, x))
+
+        template([1, 1, 8, 8], [0, 0, 0, 0])
+        template([1, 1, 8, 8], [1, 1, 1, 1])
+        template([1, 1, 8, 8], [1, 2, 3, 4])
+
+    def test_grid_sampler_2d(self):
+        def fn(a, b):
+            return (
+                aten.grid_sampler_2d(a, b, 0, 0, True),
+                aten.grid_sampler_2d(a, b, 0, 1, False),
+            )
+
+        self.common(
+            fn,
+            (
+                torch.randn([4, 3, 352, 352], dtype=torch.float32),
+                torch.rand([4, 352, 352, 2], dtype=torch.float32) * 2 - 1,
+            ),
+            check_lowp=False,
+            # Mismatched elements: 154697 / 1486848 (10.4%)
+            # Greatest absolute difference: 0.0001976490020751953 at index (0, 0, 101, 243) (up to 1e-05 allowed)
+            # Greatest relative difference: 7.332530120481928 at index (1, 1, 258, 301) (up to 1.3e-06 allowed)
+            atol=0.0002,
+            rtol=1.3e-06,
+        )
+
+    def test_upsample_bicubic2d(self):
+        def fn(a):
+            return (
+                aten.upsample_bicubic2d(a, (128, 128), True),
+                aten.upsample_bicubic2d(a, (128, 256), False),
+            )
+
+        # Mismatched elements: 10 / 196608 (0.0%)
+        # Greatest absolute difference: 1.3869255781173706e-05 at index (2, 1, 88, 65) (up to 1e-05 allowed)
+        # Greatest relative difference: 0.0033082996811011046 at index (3, 1, 88, 91) (up to 1.3e-06 allowed)
+        self.common(
+            fn,
+            (torch.randn([4, 3, 64, 32], dtype=torch.float32),),
+            atol=2e-5,
+            rtol=1e-3,
+        )
+
+    def test_sort(self):
+        def fn(a):
+            return torch.sort(a)
+
+        self.common(
+            fn, (torch.randint(0, 999, size=[1, 1, 8, 8], dtype=torch.float32),)
+        )
+
+    def test_topk(self):
+        def fn(a):
+            return torch.topk(a, 2, -1)
+
+        self.common(
+            fn, (torch.randint(0, 999, size=[1, 1, 8, 8], dtype=torch.float32),)
+        )
+
+    def test_long_tensor(self):
+        def fn(a):
+            return (
+                torch.LongTensor([294]).to(a.device) - a,
+                torch.as_tensor([295]).to(a.device) + a,
+            )
+
+        self.common(fn, (torch.randint(0, 999, size=[8, 8]),))
+
+    def test_constant_pad_1d(self):
+        def fn(a):
+            return (
+                aten.constant_pad_nd(a, [0, 1], 6.0),
+                aten.constant_pad_nd(a, [2, 3], 99.0),
+            )
+
+        self.common(fn, (torch.randint(0, 999, size=[2, 16, 31], dtype=torch.float32),))
+
+    def test_constant_pad_2d(self):
+        def fn(a):
+            return (
+                aten.constant_pad_nd(a, [1, 1, 1, 1], 6.0),
+                aten.constant_pad_nd(a, [1, 2, 3, 4], 99.0),
+            )
+
+        self.common(
+            fn, (torch.randint(0, 999, size=[1, 1, 8, 8], dtype=torch.float32),)
+        )
+
+    def test_constant_pad_3d(self):
+        def fn(a):
+            return (
+                aten.constant_pad_nd(a, [1, 2, 3, 4, 5, 6], 6.0),
+                aten.constant_pad_nd(a, [0, 0, 3, 4, 0, 0], 6.0),
+            )
+
+        self.common(
+            fn, (torch.randint(0, 999, size=[2, 4, 4, 4], dtype=torch.float32),)
+        )
+
+    def test_l1_loss(self):
+        def fn(a, b):
+            return torch.nn.functional.l1_loss(a, b), torch.nn.functional.mse_loss(a, b)
+
+        self.common(
+            fn,
+            (
+                torch.randn([2, 3, 16, 16]),
+                torch.randn([2, 3, 16, 16]),
+            ),
+            check_lowp=False,
+        )
+
+    def test_triu(self):
+        def fn(a):
+            return aten.triu(a, 1), aten.triu(a, 0), aten.triu(a, 2)
+
+        self.common(fn, (torch.randn([2, 10, 10]),))
+
+    def test_no_op_reduction(self):
+        def fn(a):
+            return a.sum(-1), torch.amax(a + 1, 1, keepdim=True)
+
+        self.common(fn, (torch.randn([8, 1, 1]),))
+
+    def test_inplace_add(self):
+        @torch._dynamo.optimize("inductor")
+        def fn(x, y):
+            return x.add_(y)
+
+        inputs = (
+            rand_strided((4, 4), (4, 1), device=self.device),
+            rand_strided((4, 4), (4, 1), device=self.device),
+        )
+        inp_clone = inputs[0].clone()
+        out = fn(*inputs)
+        self.assertTrue(same(out, inp_clone + inputs[1]))
+        self.assertTrue(out is inputs[0])
+
+    def test_inplace_mixed_dtype_ops(self):
+        @torch._dynamo.optimize("inductor")
+        def fn(x, y):
+            z = x + y.float()
+            w = z.add_(y)
+            return w.mul_(y)
+
+        inputs = (
+            rand_strided((4, 4), (4, 1), device=self.device, dtype=torch.float),
+            rand_strided((4, 4), (4, 1), device=self.device, dtype=torch.double),
+        )
+        out = fn(*inputs)
+        out_eager = (inputs[0] + inputs[1].float()).add_(inputs[1]).mul_(inputs[1])
+        self.assertTrue(same(out, out_eager))
+
+    @patch.object(config.triton, "ordered_kernel_names", True)
+    @patch.object(config.triton, "descriptive_kernel_names", False)
+    def test_kernel_names(self):
+        @torch._dynamo.optimize("inductor")
+        def fn(x):
+            return 2 * x
+
+        inputs = (rand_strided((8,), (1,), device=self.device),)
+        self.assertTrue(same(fn(*inputs), 2 * inputs[0]))
+
+    @patch.object(config.triton, "cudagraphs", True)
+    def test_strided_inputs(self):
+        @torch._dynamo.optimize("inductor")
+        def fn(x, y):
+            return x + y
+
+        inputs = (
+            rand_strided((8, 16), (32, 2), device=self.device),
+            rand_strided((8, 16), (16, 1), device=self.device),
+        )
+        self.assertTrue(same(fn(*inputs), inputs[0] + inputs[1]))
+
+    @patch.object(config.triton, "cudagraphs", True)
+    @patch.object(functorch_config, "use_fake_tensor", True)
+    def test_input_mutation1(self):
+        def fn(a):
+            b = a + 1
+            a.copy_(b)
+            c = a + 2
+            return a * b / c
+
+        arg1 = torch.randn(64, device=self.device)
+        arg2 = arg1.clone()
+        arg3 = torch.randn(64, device=self.device)
+        arg4 = arg3.clone()
+        correct1 = fn(arg1)
+        correct2 = fn(arg3)
+        opt_fn = torch._dynamo.optimize_assert(compile_fx)(fn)
+        actual1 = opt_fn(arg2)
+        actual2 = opt_fn(arg4)
+
+        self.assertTrue(same(actual1, correct1))
+        self.assertTrue(same(actual2, correct2))
+        self.assertTrue(same(arg1, arg2))
+        self.assertTrue(same(arg3, arg4))
+
+    @patch.object(functorch_config, "use_fake_tensor", True)
+    def test_input_mutation2(self):
+        def fn(a):
+            b = a + 1
+            a.view(64).copy_(torch.tensor([66.0], device=a.device))
+            c = a + 2
+            return b, c
+
+        # NOTE: this test fails when none of the inputs require grad.
+        # That seems like an inductor bug.
+        arg1 = torch.randn([1, 64], device=self.device).requires_grad_(True).add(1)
+        arg2 = arg1.clone()
+        correct1 = fn(arg1)
+        opt_fn = torch._dynamo.optimize_assert(compile_fx)(fn)
+        actual1 = opt_fn(arg2)
+
+        self.assertTrue(same(actual1, correct1))
+        self.assertTrue(same(arg1, arg2))
+
+    @patch.object(functorch_config, "use_fake_tensor", True)
+    def test_input_mutation3(self):
+        def fn(a):
+            a += 1
+            a *= 2
+            aten.sigmoid_(a)
+            a = a.view(64)
+            a += 3
+            a *= 4
+            aten.relu_(a)
+            return a
+
+        arg1 = torch.randn([1, 64], device=self.device)
+        arg2 = arg1.clone()
+        correct1 = fn(arg1)
+        opt_fn = torch._dynamo.optimize_assert(compile_fx)(fn)
+        actual1 = opt_fn(arg2)
+
+        self.assertTrue(same(actual1, correct1))
+        self.assertTrue(same(arg1, arg2))
+
+    def test_input_mutation4(self):
+        def fn(a):
+            torch.relu_(a)
+            return a
+
+        arg1 = torch.randn([1, 64], device=self.device)
+        arg2 = arg1.clone()
+        correct1 = fn(arg1)
+        opt_fn = torch._dynamo.optimize_assert(compile_fx)(fn)
+        actual1 = opt_fn(arg2)
+
+        self.assertTrue(same(actual1, correct1))
+        self.assertTrue(same(arg1, arg2))
+
+    @patch.object(functorch_config, "use_fake_tensor", True)
+    def test_slice_mutation1(self):
+        def fn(a):
+            x = torch.zeros_like(a)
+            b = x + 1
+            x[:, 3] = 3.0
+            c = torch.clone(x)
+            x[4, :] = 4.0
+            d = x + 1
+            return x, b, c, d
+
+        self.common(fn, (torch.randn([8, 8]),))
+
+    @patch.object(functorch_config, "use_fake_tensor", True)
+    def test_slice_mutation2(self):
+        def fn(a):
+            a[:, 20:40] = a[:, 20:40] + 1
+            a[:, 2:11] = a[:, 1:10] + 2
+
+        arg1 = torch.randn([1, 64], device=self.device)
+        arg2 = arg1.clone()
+        fn(arg1)
+        opt_fn = torch._dynamo.optimize_assert(compile_fx)(fn)
+        opt_fn(arg2)
+
+        self.assertTrue(same(arg1, arg2))
+
+    def test_indirect_load_broadcast(self):
+        def fn(in_ptr0, in_ptr1, in_ptr2):
+            return torch.gather(in_ptr1, 0, in_ptr2) + in_ptr0
+
+        arg190 = rand_strided((32, 21), (1, 32), device=self.device, dtype=torch.int64)
+        arg190.fill_(0)
+        arg111 = rand_strided(
+            (9521, 512), (512, 1), device=self.device, dtype=torch.float32
+        )
+        self.common(
+            fn,
+            (
+                torch.randn(32, 1),
+                arg111,
+                arg190,
+            ),
+        )
+
+    @unittest.skipIf(not has_torchvision_roi_align(), "requires torchvision")
+    def test_roi_align(self):
+        def fn(a, b):
+            return torch.ops.torchvision.roi_align(a, b, 0.25, 7, 7, 2, False)
+
+        self.common(fn, (torch.zeros([4, 256, 296, 304]), torch.zeros([2292, 5])))
+
+    @requires_decomp(aten.nll_loss_forward)
+    def test_nll_loss_forward(self):
+        def fn(a, b):
+            return aten.nll_loss_forward(a, b, None, 1, -100)
+
+        self.common(
+            fn,
+            (
+                torch.randn([5, 5]),
+                torch.zeros([5], dtype=torch.int64),
+            ),
+        )
+
+    def test_isinf(self):
+        def fn(x):
+            return x.isinf(), x.isnan()
+
+        self.common(
+            fn, [torch.tensor([1, float("inf"), 2, float("-inf"), float("nan")])]
+        )
+        self.common(
+            fn,
+            [
+                torch.tensor(
+                    [1, float("inf"), 2, float("-inf"), float("nan")],
+                    dtype=torch.float64,
+                )
+            ],
+        )
+
+    def test_isinf2(self):
+        def fn(x):
+            y = torch.tensor(
+                [1, float("inf"), 2, float("-inf"), float("nan")], device=self.device
+            )
+            return x == y
+
+        self.common(
+            fn, (torch.tensor([1, float("inf"), 2, float("-inf"), float("nan")]),)
+        )
+
+    def test_any(self):
+        def fn(x):
+            return (
+                x.any(-1),
+                x.isinf().any(),
+                torch.all(x.isinf(), dim=0),
+                torch.all(torch.logical_not(x.isinf())),
+            )
+
+        self.common(fn, [-torch.rand(64)])
+        tmp = torch.randn(16, 8)
+        tmp[1, 1] = float("inf")
+        self.common(fn, [tmp])
+
+    def test_inplace_activations(self):
+        def fn(x):
+            a = aten.hardswish_(x + 1)
+            b = aten.hardtanh_(x + 1)
+            c = aten.leaky_relu_(x + 1)
+            d = aten.silu_(x + 1)
+            e = aten.log1p(x + 1)
+            f = aten.masked_fill_(x + 1, torch.zeros_like(x, dtype=torch.bool), 99.0)
+            h = aten.masked_fill_(x + 1, torch.ones_like(x, dtype=torch.bool), 99.0)
+            return (a, b, c, d, e, f, h)
+
+        self.common(fn, [torch.randn(64) * 10])
+
+    def test_baddbmm(self):
+        def fn(a, b, c):
+            return aten.baddbmm(a, b, c)
+
+        self.common(
+            fn,
+            [
+                torch.randn(6, 1, 100),
+                torch.randn(6, 128, 64),
+                torch.randn(6, 64, 100),
+            ],
+            # Mismatched elements: 1212 / 76800 (1.6%)
+            # Greatest absolute difference: 0.001953125 at index (0, 0, 93) (up to 1e-05 allowed)
+            # Greatest relative difference: 1.0 at index (3, 19, 4) (up to 0.001 allowed)
+            atol=0.002,
+            rtol=0.001,
+        )
+
+    @patch.object(config.triton, "max_tiles", 2)
+    def test_fuse_tiled(self):
+        def fn(a, b, c):
+            return a + b, c + 1
+
+        self.common(
+            fn, [torch.randn(128, 1), torch.randn(1, 128), torch.randn(128, 128)]
+        )
+
+    def test_expand_as(self):
+        def fn(a, b):
+            return aten.expand_as(a, b), aten.expand_as(a + 1, b + 1) + 1
+
+        self.common(
+            fn,
+            [
+                torch.randn(6, 1, 100),
+                torch.randn(6, 128, 100),
+            ],
+        )
+
+    def test_index_put1(self):
+        def fn(a, b, c):
+            return (
+                torch.index_put(a, [b], c),
+                torch.index_put_(a + 1, [b + 1], c + 1) + 1,
+            )
+
+        self.common(
+            fn,
+            [
+                torch.randn([800, 256, 7, 7]),
+                torch.randperm(601),
+                torch.randn([601, 256, 7, 7]),
+            ],
+        )
+        self.common(
+            fn, [torch.randn(1024, 4, 2), torch.arange(4), torch.randn(4, 1, 1)]
+        )
+
+    def test_index_put2(self):
+        def fn(a, b, c):
+            return torch.index_put(a, [b], c, True)
+
+        self.common(
+            fn,
+            [
+                torch.randn([100, 256, 7, 7]),
+                torch.randint(0, 100, size=[600], dtype=torch.int64),
+                torch.randn([600, 256, 7, 7]),
+            ],
+            # workaround for https://github.com/openai/triton/issues/558
+            check_lowp=False,
+        )
+
+    def test_index_put3(self):
+        def fn(a, b, c):
+            torch.ops.aten.index_put_(a, (None, b, None), c)
+            a1 = a + 1
+            torch.ops.aten.index_put_(a1, (None, b + 1, None), c + 1)
+            return (a, a1)
+
+        self.common(
+            fn,
+            [
+                torch.randn([1024, 4, 2]),
+                torch.arange(3),
+                torch.randn([1024, 1, 2]),
+            ],
+        )
+
+    def test_index_put_as_masked_fill(self):
+        def fn(a, b, c, d):
+            a = a.clone()
+            torch.ops.aten.index_put_(a, [b], c, d)
+            return a
+
+        self.common(
+            fn,
+            (
+                torch.randn([1024, 4, 2]),
+                torch.randn([1024, 4, 2]) > 0,
+                torch.randn([]),
+                False,
+            ),
+        )
+
+        self.common(
+            fn,
+            (
+                torch.randn([1024, 4, 2]),
+                torch.randn([1024, 4, 2]) > 0,
+                torch.randn([]),
+                True,
+            ),
+        )
+
+    def test_index_put_fallback1(self):
+        def fn(a, b, c, d):
+            a = a.clone()
+            torch.ops.aten.index_put_(a, [b], c, d)
+            return a
+
+        self.common(
+            fn,
+            (
+                torch.randn([3]),
+                torch.as_tensor([True, True, False]),
+                torch.randn([2]),
+                False,
+            ),
+        )
+
+        self.common(
+            fn,
+            (
+                torch.randn([3]),
+                torch.as_tensor([True, True, False]),
+                torch.randn([2]),
+                True,
+            ),
+        )
+
+    def test_index_put_fallback2(self):
+        def fn(a, b, c, d, e):
+            a = a.clone()
+            torch.ops.aten.index_put_(a, [None, b, c], d, e)
+            return a
+
+        self.common(
+            fn,
+            (
+                torch.randn([1, 2, 3]),
+                torch.as_tensor([0, 1]),
+                torch.as_tensor([True, True, False]),
+                torch.randn([]),
+                False,
+            ),
+        )
+        self.common(
+            fn,
+            (
+                torch.randn([1, 2, 3]),
+                torch.as_tensor([0, 1]),
+                torch.as_tensor([True, True, False]),
+                torch.randn([]),
+                True,
+            ),
+        )
+
+    @patch.object(config, "fallback_random", True)
+    def test_bernoulli1(self):
+        def fn(a):
+            b = torch.empty_like(a)
+            return aten.bernoulli_(b), b
+
+        self.common(
+            fn,
+            [
+                torch.randn([100]),
+            ],
+        )
+
+    def test_bernoulli2(self):
+        def fn(a):
+            return aten.bernoulli(a)
+
+        self.common(
+            fn,
+            [torch.tensor([1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0])],
+        )
+
+    def test_narrow(self):
+        def fn(x):
+            return aten.narrow(x, 1, 10, 16), aten.narrow(x + 2, 0, 10, 16) + 1
+
+        self.common(fn, [torch.randn(64, 64)])
+
+    def test_as_strided(self):
+        def fn(x):
+            return (
+                aten.as_strided(x, (8, 8, 64), (8 * 64, 64, 1), 0),
+                aten.as_strided(x + 1, (8, 8, 64), (8 * 64, 64, 1), 0) + 2,
+            )
+
+        self.common(fn, [torch.randn(64, 64)])
+
+    def test_as_strided_scatter(self):
+        def fn(a, b):
+            return aten.as_strided_scatter(
+                a * 8 + 10,
+                b * 2 - 4,
+                size=(a.shape[0], a.shape[1] // 2),
+                stride=(a.shape[1], 2),
+                storage_offset=0,
+            )
+
+        self.common(fn, [torch.randn(10, 1024), torch.randn(10, 512)])
+
+    def test_select_scatter(self):
+        def fn(x, a, b):
+            return (
+                aten.select_scatter(x, a, 1, 0),
+                aten.select_scatter(x, b, 0, 1),
+            )
+
+        self.common(
+            fn,
+            [
+                torch.randn(8, 197, 38),
+                torch.randn(8, 38),
+                torch.randn(197, 38),
+            ],
+        )
+
+    def test_slice_scatter(self):
+        def fn(x, a):
+            return (
+                aten.slice_scatter(x, a, 2, 10, -10),
+                aten.slice_scatter(x, a[:, :, :40], 2, 10, -10, 2),
+            )
+
+        self.common(
+            fn,
+            [
+                torch.randn(4, 8, 100),
+                torch.randn(4, 8, 80),
+            ],
+        )
+
+    def test_slice_scatter2(self):
+        def fn(a, b):
+            return aten.slice_scatter(a, b, 0, 0, 9223372036854775807)
+
+        self.common(
+            fn,
+            [
+                torch.randn([8, 197, 384]),
+                torch.randn([8, 197, 384]),
+            ],
+        )
+
+    def test_scatter1(self):
+        def fn(a, dim, index, b):
+            return aten.scatter(a, dim, index, b)
+
+        self.common(
+            fn,
+            [
+                torch.zeros(2, 3),
+                -1,
+                torch.tensor([[0]]),
+                torch.ones(2, 3),
+            ],
+        )
+
+    def test_scatter2(self):
+        if self.device == "cuda":
+            raise unittest.SkipTest("unstable on sm86")
+
+        def fn(a, dim, index, b):
+            return aten.scatter.reduce(a, dim, index, b, reduce="add")
+
+        self.common(
+            fn,
+            [
+                torch.zeros(64, 512),
+                0,
+                torch.zeros((64, 512), dtype=torch.int64),
+                torch.ones(64, 512),
+            ],
+        )
+
+    def test_scatter3(self):
+        def fn(a, dim, index, b):
+            return aten.scatter(a, dim, index, b, reduce="add")
+
+        self.common(
+            fn,
+            [
+                torch.randn(5, 29, 13),
+                2,
+                torch.tensor([[[3, 5, 7, 9]]]),
+                0.8,  # src can be a scalar
+            ],
+            # Mismatched elements: 1 / 1885 (0.1%)
+            # Greatest absolute difference: 0.00018310546875 at index (0, 0, 3) (up to 1e-05 allowed)
+            # Greatest relative difference: 0.0022371364653243847 at index (0, 0, 3) (up to 0.001 allowed)
+            atol=2e-4,
+            rtol=1e-3,
+        )
+
+    def test_scatter4(self):
+        def fn(x, ind, src):
+            return torch.scatter(x, 0, ind, src)
+
+        self.common(
+            fn,
+            (torch.randn(196, 992), torch.randint(196, (1, 992)), torch.randn(1, 992)),
+        )
+
+    @unittest.skip("Flaky test, needs debugging")
+    def test_scatter_add1(self):
+        def fn(a, dim, index, b):
+            return aten.scatter_add(a, dim, index, b)
+
+        self.common(
+            fn,
+            [
+                torch.randn(2, 3),
+                0,
+                torch.tensor([[0]]),
+                torch.randn(2, 3),
+            ],
+        )
+
+    def test_scatter_add2(self):
+        def fn(a, dim, index, b):
+            return aten.scatter_add(a, dim, index, b)
+
+        self.common(
+            fn,
+            [
+                torch.randn(2, 3),
+                0,
+                torch.tensor([[0, 0, 0], [1, 1, 1]]),
+                torch.randn(2, 3),
+            ],
+        )
+
+    def test_scatter_add3(self):
+        def fn(a, dim, index, b):
+            return aten.scatter_add(a, dim, index, b)
+
+        self.common(
+            fn,
+            [
+                torch.randn(5, 29, 13),
+                2,
+                torch.tensor([[[3, 5, 7, 9]]]),
+                torch.randn(1, 1, 10),
+            ],
+        )
+
+    def test_scatter_reduce1(self):
+        def fn(a, dim, index, b):
+            return aten.scatter_reduce(a, dim, index, b, "sum")
+
+        self.common(
+            fn,
+            [
+                torch.randn(5, 29, 13),
+                2,
+                torch.tensor([[[3, 5, 7, 9]]]),
+                torch.randn(1, 1, 10),
+            ],
+        )
+
+    def test_scatter_reduce2(self):
+        def fn(a, dim, index, b):
+            return aten.scatter_reduce(a, dim, index, b, "sum", include_self=False)
+
+        self.common(
+            fn,
+            [
+                torch.randn(2, 3),
+                0,
+                torch.zeros((2, 3), dtype=torch.int64),
+                torch.randn(2, 3),
+            ],
+        )
+
+    # issue #1150
+    def test_dense_mask_index(self):
+        if self.device == "cpu":
+            raise unittest.SkipTest(
+                "https://github.com/pytorch/torchdynamo/issues/1697"
+            )
+
+        def fn(x, y):
+            y = torch.ops.aten.select.int(y, 0, 2)
+            z = x * y
+            return z.sum()
+
+        self.common(fn, [torch.randn(102400), torch.randn(3)])
+
+    def test_new_empty_strided(self):
+        def fn(a):
+            return aten.new_empty_strided(a, [1, 128, 128], [16384, 128, 1]).fill_(123)
+
+        self.common(fn, [torch.randn(55)])
+
+    @patch.object(torch._inductor.config.triton, "cudagraphs", True)
+    def test_dropout(self):
+        random.seed(1234)
+        torch.manual_seed(1234)
+
+        @torch._dynamo.optimize("inductor")
+        def fn1(a):
+            return torch.nn.functional.dropout(a)
+
+        x = torch.ones(1000, device=self.device, dtype=torch.float32)
+        result1 = fn1(x)
+        self.assertTrue(400 < result1.nonzero().shape[0] < 600)
+        self.assertTrue(0.9 < result1.mean().item() < 1.1)
+
+        random.seed(1234)
+        torch.manual_seed(1234)
+
+        @torch._dynamo.optimize("inductor")
+        def fn2(a):
+            return torch.nn.functional.dropout(a, 0.5, True)
+
+        result2 = fn2(x)
+        self.assertTrue(400 < result2.nonzero().shape[0] < 600)
+        self.assertTrue(0.9 < result2.mean().item() < 1.1)
+
+    def test_dropout_deterministic(self):
+        @torch._dynamo.optimize("inductor")
+        def fn(a):
+            return torch.nn.functional.dropout(a, 0.55, True)
+
+        for cg in (False, True):
+            with patch.object(torch._inductor.config.triton, "cudagraphs", cg):
+                torch._dynamo.reset()
+
+                x = torch.ones(1024, device=self.device, dtype=torch.float32)
+
+                torch.manual_seed(1234)
+                a0 = fn(x).clone()
+                a1 = fn(x).clone()
+                a2 = fn(x).clone()
+
+                torch.manual_seed(1234)
+                b0 = fn(x).clone()
+                b1 = fn(x).clone()
+                b2 = fn(x).clone()
+
+                # same seed, same values
+                self.assertTrue(torch.allclose(a0, b0))
+                self.assertTrue(torch.allclose(a1, b1))
+                self.assertTrue(torch.allclose(a2, b2))
+
+                # different calls, different values
+                self.assertFalse(torch.allclose(a0, a1))
+                self.assertFalse(torch.allclose(a1, a2))
+
+    def test_rand_like_deterministic(self):
+        @torch._dynamo.optimize("inductor")
+        def fn(a):
+            return torch.rand_like(a), torch.rand_like(a)
+
+        x = torch.ones(1024, device=self.device, dtype=torch.float32)
+
+        torch.manual_seed(1234)
+        a0 = fn(x)[0].clone()
+        a1 = fn(x)[0].clone()
+        a2 = fn(x)[0].clone()
+
+        torch.manual_seed(1234)
+        b0 = fn(x)[0].clone()
+        b1 = fn(x)[0].clone()
+        b2 = fn(x)[0].clone()
+
+        # same seed, same values
+        self.assertTrue(torch.allclose(a0, b0))
+        self.assertTrue(torch.allclose(a1, b1))
+        self.assertTrue(torch.allclose(a2, b2))
+
+        # different calls, different values
+        self.assertFalse(torch.allclose(a0, a1))
+        self.assertFalse(torch.allclose(a1, a2))
+
+        c, d = fn(x)
+        self.assertFalse(torch.allclose(c, d))
+        self.assertTrue((c >= 0).all())
+        self.assertTrue((c < 1).all())
+        self.assertTrue((d >= 0).all())
+        self.assertTrue((d < 1).all())
+
+    def test_max_pool2d_with_indices_backward(self):
+        def fn(a, b, c):
+            return aten.max_pool2d_with_indices_backward(
+                a, b, [2, 2], [2, 2], [0, 0], [1, 1], False, c
+            )
+
+        x = torch.randn([2, 4, 18, 14])
+        result, indices = aten.max_pool2d_with_indices(
+            x,
+            [2, 2],
+            [2, 2],
+            [0, 0],
+            [1, 1],
+            False,
+        )
+
+        self.common(
+            fn,
+            [
+                torch.randn_like(result),
+                x,
+                indices,
+            ],
+        )
+
+    def test_max_pool2d_with_indices_backward2(self):
+        def fn(a, b, c):
+            return aten.max_pool2d_with_indices_backward(
+                a, b, [3, 3], [2, 2], [1, 1], [1, 1], True, c
+            )
+
+        x = torch.randn([2, 4, 40, 56])
+        result, indices = aten.max_pool2d_with_indices(
+            x,
+            [3, 3],
+            [2, 2],
+            [1, 1],
+            [1, 1],
+            True,
+        )
+
+        self.common(
+            fn,
+            [
+                torch.randn_like(result),
+                x,
+                indices,
+            ],
+        )
+
+    # From https://github.com/pytorch/torchdynamo/issues/1200
+    def test_max_pool2d_with_indices_backward3(self):
+        def fn(a, b, c):
+            return aten.max_pool2d_with_indices_backward(
+                a, b, [1, 1], [2, 2], [0, 0], [1, 1], False, c
+            )
+
+        x = torch.randn([32, 256, 37, 38])
+        result, indices = aten.max_pool2d_with_indices(
+            x,
+            [1, 1],
+            [2, 2],
+            0,
+            1,
+            False,
+        )
+        self.common(
+            fn,
+            [
+                torch.randn_like(result),
+                x,
+                indices,
+            ],
+        )
+
+    # From https://github.com/pytorch/torchdynamo/issues/1352
+    def test_max_pool2d_with_indices_backward4(self):
+        def fn(a, b, c):
+            return aten.max_pool2d_with_indices_backward(
+                a, b, [5, 5], [1, 1], [2, 2], [1, 1], False, c
+            )
+
+        torch._inductor.metrics.generated_kernel_count = 0
+        x = torch.randn([2, 64, 3, 4])
+        result, indices = aten.max_pool2d_with_indices(
+            x,
+            [5, 5],
+            [1, 1],
+            2,
+            1,
+            False,
+        )
+        self.common(
+            fn,
+            [
+                torch.randn_like(result),
+                x,
+                indices,
+            ],
+        )
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 1)
+
+    def test_max_pool2d_with_indices_backward5(self):
+        # Window size is too big. Should fallback
+        def fn(a, b, c):
+            return aten.max_pool2d_with_indices_backward(
+                a, b, [13, 13], [1, 1], [2, 2], [1, 1], False, c
+            )
+
+        torch._inductor.metrics.generated_kernel_count = 0
+        x = torch.randn([2, 64, 20, 20])
+        result, indices = aten.max_pool2d_with_indices(
+            x,
+            [13, 13],
+            [1, 1],
+            2,
+            1,
+            False,
+        )
+        self.common(
+            fn,
+            [
+                torch.randn_like(result),
+                x,
+                indices,
+            ],
+        )
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 0)
+
+    def test_avg_pool2d_backward(self):
+        def fn(a, b):
+            return aten.avg_pool2d_backward(
+                a,
+                b,
+                [2, 2],
+                [2, 2],
+                [0, 0],
+                True,
+                False,
+                None,
+            )
+
+        self.common(
+            fn,
+            [
+                torch.randn([2, 4, 7, 7]),
+                torch.randn([2, 4, 14, 14]),
+            ],
+        )
+
+    def test_avg_pool2d_backward2(self):
+        def fn(a, b):
+            return aten.avg_pool2d_backward(
+                a,
+                b,
+                [3, 3],
+                [1, 1],
+                [1, 1],
+                True,
+                False,
+                None,
+            )
+
+        self.common(
+            fn,
+            [
+                torch.randn([1, 1, 20, 15]),
+                torch.randn([1, 1, 20, 15]),
+            ],
+        )
+
+    def test_avg_pool2d_backward3(self):
+        def fn(a, b):
+            return aten.avg_pool2d_backward(
+                a,
+                b,
+                [1, 1],
+                [2, 2],
+                [0, 0],
+                False,
+                False,
+                None,
+            )
+
+        torch._inductor.metrics.generated_kernel_count = 0
+        self.common(
+            fn,
+            [
+                torch.randn([1, 2016, 11, 11]),
+                torch.randn([1, 2016, 21, 21]),
+            ],
+        )
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 1)
+
+    def test_avg_pool2d_backward4(self):
+        def fn(a, b):
+            return aten.avg_pool2d_backward(
+                a,
+                b,
+                [13, 13],
+                [1, 1],
+                [0, 0],
+                True,
+                False,
+                None,
+            )
+
+        torch._inductor.metrics.generated_kernel_count = 0
+        self.common(
+            fn,
+            [
+                torch.randn([1, 16, 12, 12]),
+                torch.randn([1, 16, 24, 24]),
+            ],
+            check_lowp=False,
+        )
+        self.assertEqual(torch._inductor.metrics.generated_kernel_count, 0)
+
+    def test_mm_views(self):
+        def fn(a, b):
+            return torch.mm(a.view(32, 32), b.view(32, 32))
+
+        self.common(
+            fn,
+            (
+                torch.randn([32, 32]).transpose(0, 1),
+                torch.randn([1, 32, 32]).transpose(0, 1),
+            ),
+            check_lowp=False,
+        )
+        expected_kernel = 0
+        # codegen mm kernel from template
+        if config.triton.mm != "aten" and self.device == "cuda":
+            expected_kernel = 1
+        if config.triton.mm == "autotune":
+            self.assertLessEqual(
+                torch._inductor.metrics.generated_kernel_count, expected_kernel
+            )
+        self.assertEqual(
+            torch._inductor.metrics.generated_kernel_count, expected_kernel
+        )
+
+    @patch.object(config.triton, "cudagraphs", False)
+    def test_lowmem_dropout1(self):
+        n = 100000
+        weight = torch.ones(
+            n, device=self.device, dtype=torch.float32, requires_grad=True
+        )
+        ones = torch.ones(n, device=self.device, dtype=torch.float32)
+
+        @torch._dynamo.optimize_assert("inductor")
+        def run(x, train=True):
+            return F.dropout(x * weight, 0.33, train)
+
+        def check(r, g):
+            rmean = r.mean().item()
+            gmean = g.mean().item()
+            rcount = len(r.nonzero())
+            gcount = len(g.nonzero())
+
+            # dropped elements should match
+            self.assertTrue(same(r.nonzero(), g.nonzero()))
+            self.assertEqual(rcount, gcount)
+
+            # dropped should be close to 0.33
+            self.assertGreater(rcount, 0.64 * n)
+            self.assertGreater(0.68 * n, rcount)
+
+            self.assertAlmostEqual(rmean, gmean)
+            self.assertAlmostEqual(rmean, 1.0, places=2)
+
+        r1 = run(ones, train=False)
+        r1.sum().backward()
+        g1 = weight.grad.clone()
+        # eval mode should be all ones
+        self.assertTrue(same(r1, torch.ones_like(r1)))
+        self.assertTrue(same(g1, torch.ones_like(g1)))
+
+        torch.manual_seed(1234)
+        weight.grad.zero_()
+        r2 = run(ones)
+        r2.sum().backward()
+        g2 = weight.grad.clone()
+        check(r2, g2)
+
+        torch.manual_seed(1234)
+        weight.grad.zero_()
+        r3 = run(ones)
+        r3.sum().backward()
+        g3 = weight.grad.clone()
+        check(r3, g3)
+
+        # second run is same result as first
+        self.assertTrue(same(r2, r3))
+        self.assertTrue(same(g2, g3))
+
+    def test_lowmem_dropout2(self):
+        m = torch.nn.Sequential(
+            torch.nn.Linear(32, 32, bias=False),
+            torch.nn.Dropout(),
+            torch.nn.Linear(32, 32, bias=False),
+            torch.nn.Dropout(),
+        ).to(self.device)
+
+        @torch._dynamo.optimize_assert("inductor")
+        def run(x):
+            return m(x)
+
+        torch._inductor.metrics.generated_kernel_count = 0
+        result = run(torch.randn([8, 32], device=self.device))
+        result.sum().backward()
+
+        expected_kernel = 4
+        if config.triton.mm != "aten" and self.device == "cuda":
+            # fwd: 2 * (mm+dropout) kernels = 2 kernels
+            # bwd: dropout + (mm) + 2 * (mm+dropout) kernels = 4 kernels
+            # expect 2 + 4 = 6 kernels
+            expected_kernel = 6
+        if config.triton.mm == "autotune":
+            self.assertLessEqual(
+                torch._inductor.metrics.generated_kernel_count, expected_kernel
+            )
+        self.assertEqual(
+            torch._inductor.metrics.generated_kernel_count, expected_kernel
+        )
+
+    def test_roll(self):
+        def fn(a):
+            return (
+                aten.roll(a, [-3, 10], [1, 2]),
+                aten.roll(a, [5]),
+            )
+
+        self.common(
+            fn,
+            [
+                torch.randn([2, 56, 56, 16]),
+            ],
+        )
+
+    def test_argmax_argmin1(self):
+        def fn(x):
+            return (aten.argmax(x), aten.argmin(x))
+
+        self.common(
+            fn,
+            [
+                torch.randn([8, 256, 256]),
+            ],
+        )
+
+    def test_argmax_argmin2(self):
+        def fn(x):
+            return (
+                aten.argmax(x, 0),
+                aten.argmin(x, 0),
+                aten.argmax(x, 1),
+                aten.argmin(x, 1),
+            )
+
+        self.common(
+            fn,
+            [
+                torch.randn([144, 144]),
+            ],
+            # Mismatched elements: 1 / 144 (0.7%)
+            # Greatest absolute difference: 26 at index (71,)
+            # Greatest relative difference: 0.4126984179019928 at index (71,)
+            atol=1e-5,
+            rtol=0.5,
+        )
+
+    def test_conv_backward(self):
+        def fn(rank4_inps, rank3_inps, rank5_inps):
+
+            out1 = aten.convolution_backward(
+                *rank4_inps,
+                [C],
+                [1, 1],
+                [0, 0],
+                [1, 1],
+                False,
+                [0, 0],
+                1,
+                [True, True, True],
+            )
+            out2 = aten.convolution_backward(
+                *rank4_inps,
+                [C],
+                [1, 1],
+                [0, 0],
+                [1, 1],
+                False,
+                [0, 0],
+                1,
+                [True, False, False],
+            )
+            out3 = aten.convolution_backward(
+                *rank3_inps,
+                [C],
+                [1],
+                [0],
+                [1],
+                False,
+                [0],
+                1,
+                [True, True, True],
+            )
+            out4 = aten.convolution_backward(
+                *rank5_inps,
+                [C],
+                [1, 1, 1],
+                [0, 0, 0],
+                [1, 1, 1],
+                False,
+                [0, 0, 0],
+                1,
+                [True, True, True],
+            )
+            return (out1, out2, out3, out4)
+
+        B = 3
+        C = 4
+        H = 5
+        grad_out = torch.randn(B, C, H - 2, H - 2, H - 2)
+        inp = torch.randn(B, C, H, H, H)
+        weight = torch.randn(C, C, 3, 3, 3)
+
+        def shrink_rank(x, rank):
+            res = x
+            while res.dim() > rank:
+                res = torch.select(res, -1, 0)
+            return res.contiguous()
+
+        rank4_inps = [shrink_rank(x, 4) for x in [grad_out, inp, weight]]
+        rank3_inps = [shrink_rank(x, 4) for x in [grad_out, inp, weight]]
+        rank5_inps = [shrink_rank(x, 5) for x in [grad_out, inp, weight]]
+
+        with torch.backends.cudnn.flags(allow_tf32=False):
+            self.common(
+                fn,
+                [rank4_inps, rank3_inps, rank5_inps],
+            )
+
+    @unittest.skip(
+        """
+        FIXME: In the case of having equally max/min elements, our implementation returns
+        the last index instead of the first one
+        """
+    )
+    def test_argmax_argmin3(self):
+        def fn(x):
+            return (
+                aten.argmax(x, 0),
+                aten.argmin(x, 0),
+                aten.argmax(x, -1),
+                aten.argmin(x, -1),
+            )
+
+        self.common(
+            fn,
+            [torch.randint(0, 5, [10, 10])],
+        )
+
+    def test_vdd_clamp(self):
+        def fn(x):
+            return torch.clamp_min(x, 3)
+
+        self.common(
+            fn,
+            [
+                torch.randn([16], requires_grad=True) * 10,
+            ],
+        )
+
+    def test_tmp_not_defined_issue1(self):
+        def forward(
+            primals_3,
+            primals_4,
+            add_tensor,
+            convert_element_type_default,
+            div_default,
+            reciprocal_default,
+        ):
+            var_default = torch.ops.prims.var.default(
+                convert_element_type_default, [2], correction=0
+            )
+            sub_tensor = torch.ops.aten.sub.Tensor(add_tensor, div_default)
+            mul_tensor_1 = torch.ops.aten.mul.Tensor(sub_tensor, reciprocal_default)
+            mul_tensor_2 = torch.ops.aten.mul.Tensor(mul_tensor_1, primals_3)
+            add_tensor_2 = torch.ops.aten.add.Tensor(mul_tensor_2, primals_4)
+            convert_element_type_default_1 = (
+                torch.ops.prims.convert_element_type.default(
+                    add_tensor_2, torch.float32
+                )
+            )
+            convert_element_type_default_2 = (
+                torch.ops.prims.convert_element_type.default(
+                    convert_element_type_default_1, torch.float32
+                )
+            )
+            var_default_1 = torch.ops.prims.var.default(
+                convert_element_type_default_2, [2], correction=0
+            )
+            broadcast_in_dim_default_2 = torch.ops.prims.broadcast_in_dim.default(
+                var_default_1, [1, 512, 1], [0, 1]
+            )
+            sum_default_1 = torch.ops.prims.sum.default(
+                convert_element_type_default_2, [2]
+            )
+            add_tensor_3 = torch.ops.aten.add.Tensor(broadcast_in_dim_default_2, 1e-05)
+            return (var_default, sum_default_1, add_tensor_3)
+
+        inps = [
+            (torch.Size([1024]), torch.float32),
+            (torch.Size([1024]), torch.float32),
+            (torch.Size([1, 512, 1024]), torch.float32),
+            (torch.Size([1, 512, 1024]), torch.float32),
+            (torch.Size([1, 512, 1]), torch.float32),
+            (torch.Size([1, 512, 1]), torch.float32),
+        ]
+        inps = [torch.randn(shape, dtype=dtype) for (shape, dtype) in inps]
+        self.common(forward, inps, atol=1e-05, rtol=2e-05)
+
+    @unittest.skipIf(
+        TEST_WITH_ASAN
+        or os.environ.get("BUILD_ENVIRONMENT", "").startswith("parallelnative"),
+        "TODO: debug this with asan",
+    )
+    def test_tmp_not_defined_issue2(self):
+        def forward(arg38_1, arg81_1, getitem_17, new_zeros_default_4):
+            div_tensor_7 = torch.ops.aten.div.Tensor(getitem_17, arg81_1)
+            mul_tensor_24 = torch.ops.aten.mul.Tensor(div_tensor_7, arg38_1)
+            sum_default_7 = torch.ops.aten.sum.default(mul_tensor_24)
+            return (new_zeros_default_4, sum_default_7)
+
+        args = [
+            ((1, 88, 40, 40), (140800, 1600, 40, 1), torch.float32),
+            ((), (), torch.float32),
+            ((1, 88, 40, 40), (140800, 1600, 40, 1), torch.float32),
+            ((3,), (1,), torch.float32),
+        ]
+        args = [
+            rand_strided(shape, stride, dtype).requires_grad_(True).add(1)
+            for shape, stride, dtype in args
+        ]
+        self.common(forward, args)
+
+    def test_misaligned_address_issue1(self):
+        def forward(sub_tensor_1, unsqueeze_default):
+            gather_default = torch.ops.aten.gather.default(
+                sub_tensor_1, 1, unsqueeze_default
+            )
+            return gather_default
+
+        args = [
+            ((1, 1000), (1000, 1), torch.float32),
+            ((1, 1), (1, 1), torch.int64),
+        ]
+        args = [rand_strided(shape, stride, dtype) for shape, stride, dtype in args]
+        self.common(forward, args)
+
+    def test_invalid_operand_issue1(self):
+        def forward(arg0_1, arg1_1, arg3_1, squeeze, view_1, slice_1):
+            slice_scatter = torch.ops.aten.slice_scatter.default(
+                slice_1, arg3_1, 1, 1, 9223372036854775807
+            )
+            slice_scatter_1 = torch.ops.aten.slice_scatter.default(
+                arg1_1, slice_scatter, 0, 0, 9223372036854775807
+            )
+            slice_2 = torch.ops.aten.slice.Tensor(
+                slice_scatter_1, 0, 0, 9223372036854775807
+            )
+            select_scatter = torch.ops.aten.select_scatter.default(
+                slice_2, squeeze, 1, 0
+            )
+            slice_scatter_2 = torch.ops.aten.slice_scatter.default(
+                slice_scatter_1, select_scatter, 0, 0, 9223372036854775807
+            )
+            view = torch.ops.aten.view.default(slice_scatter_2, [-1, 128])
+            embedding = torch.ops.aten.embedding.default(arg0_1, view, 1)
+            return [embedding, view_1]
+
+        args = [
+            ((50005, 768), (768, 1), torch.float32),
+            ((8, 128), (128, 1), torch.int64),
+            ((8, 127), (127, 1), torch.int64),
+            ((8,), (1,), torch.int64),
+            ((1024,), (1,), torch.int64),
+            ((8, 128), (128, 1), torch.int64),
+        ]
+        args = [rand_strided(shape, stride, dtype) for shape, stride, dtype in args]
+        self.common(forward, args)
+
+    def test_sizehint_issue1(self):
+        def forward(x):
+            return torch.nn.functional.unfold(
+                x, kernel_size=[4, 4], dilation=1, padding=0, stride=[4, 4]
+            )
+
+        args = [((2, 24, 56, 56), (75264, 3136, 56, 1), torch.float32, False)]
+        args = [
+            rand_strided(sh, st, dt).requires_grad_(rg) for (sh, st, dt, rg) in args
+        ]
+        self.common(forward, args)
+
+    def test_zero_dim_reductions(self):
+        for kd in [True, False]:
+            inps0 = (torch.zeros(2, 0, device=self.device, dtype=torch.float16), 1, kd)
+            failed_ops = [aten.argmin, aten.argmax, aten.max, aten.min]
+            for fo in failed_ops:
+                with self.assertRaisesRegex(
+                    IndexError, "Expected reduction dim 1 to have non-zero size"
+                ):
+                    mod = make_fx(fo)(*inps0)
+                    _ = compile_fx_inner(mod, inps0)
+
+            pass_ops = [
+                lambda *x: fn(*x) for fn in [aten.sum, aten.prod, aten.any, aten.all]
+            ]
+            for po in pass_ops:
+                compiled = torch._dynamo.optimize("inductor")(po)
+                expected = po(*inps0)
+                actual = compiled(*inps0)
+
+            self.assertTrue(torch.allclose(actual, expected, atol=1e-3, rtol=1e-3))
+
+    @requires_cuda()
+    def test_unspec_inputs(self):
+        def fn(x, y):
+            return x + y, x * y, x / y
+
+        opt = torch._dynamo.optimize("inductor")(fn)
+        dtypes = [
+            torch.float16,
+            torch.bfloat16,
+            torch.float32,
+            torch.float64,
+            torch.int32,
+            torch.int64,
+        ]
+
+        for d in dtypes:
+            inputs = (
+                rand_strided((2, 3), (3, 1), dtype=torch.float32, device="cuda"),
+                rand_strided((), (), dtype=d, device="cpu"),
+            )
+            self.assertTrue(same(opt(*inputs), fn(*inputs)))
+            inputs = (inputs[1], inputs[0])
+            self.assertTrue(same(opt(*inputs), fn(*inputs)))
+
+    @patch.object(config.triton, "mm", "aten")
+    def test_list_clearing(self):
+
+        if self.device == "cpu":
+            contexts = [contextlib.nullcontext]
+        else:
+            contexts = [
+                contextlib.nullcontext,
+                lambda: patch.object(config.triton, "cudagraphs", True),
+            ]
+
+        for context in contexts:
+            with context():
+                inps = [
+                    torch.rand([5, 5]).to(self.device),
+                    torch.rand([5, 5]).to(self.device),
+                ]
+                inp_refs = [weakref.ref(inp) for inp in inps]
+
+                def fn(x, y):
+                    a = x + y
+                    return (a @ a,)
+
+                fn_fx = make_fx(fn)(inps[0], inps[1])
+                fn_compiled = compile_fx_inner(fn_fx, inps)
+
+                test_self = self
+                matmul_seen = False
+
+                class TestRefMode(TorchDispatchMode):
+                    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                        kwargs = kwargs if kwargs else {}
+
+                        nonlocal inps
+                        nonlocal inp_refs
+                        nonlocal test_self
+                        nonlocal matmul_seen
+
+                        # by matmul, inputs should be deallocated
+                        if func is aten.mm.out:
+                            matmul_seen = True
+                            test_self.assertEqual(len(inps), 0)
+                            test_self.assertIsNone(inp_refs[0]())
+                            test_self.assertIsNone(inp_refs[1]())
+
+                        return func(*args, **kwargs)
+
+                with TestRefMode():
+                    fn_compiled(inps)
+
+                # for some reason, TorchDispatch doesnt capture the
+                # cuda mm call (even without cudagraphs)
+                if self.device == "cpu":
+                    self.assertTrue(matmul_seen)
+                else:
+                    self.assertEqual(len(inps), 0)
+
+    def test_dtype_mismatch_issue(self):
+        def fn(x):
+            attn = torch.nn.functional.pad(x, [0, 1])
+            return attn.softmax(dim=-1)
+
+        x = torch.rand(128, 32, 63)
+        res_ref = fn(x)
+        res = torch._dynamo.optimize("inductor")(fn)(x)
+        self.assertEqual(res, res_ref)
+
+    @unittest.skipIf(HAS_CUDA, "histogramdd only supports cpu")
+    def test_kwargs(self):
+        def fn(x, y):
+            return torch.histogramdd(
+                x,
+                bins=[3, 3],
+                weight=y,
+            )
+
+        self.common(
+            fn,
+            [torch.randn((4, 2)), torch.randn((4))],
+        )
+
+
+if HAS_CPU:
+
+    class SweepInputsCpuTest(SweepInputs2, TestCase):
+        gen = InputGen(10, "cpu")
+
+    SweepInputsCpuTest.populate()
+
+    class CpuTests(TestCase):
+        common = check_model
+        device = "cpu"
+
+    CommonTemplate.install(CpuTests, "cpu")
+
+    class CPUReproTests(TestCase):
+        def test_conv_stride_constraints(self):
+            for fmt in [torch.channels_last, torch.contiguous_format]:
+                # TorchDispatch doesn't work in our cuda invocation for some reason
+                m = torch.nn.Conv2d(5, 6, [3, 3])
+
+                def fn(inp, weight):
+                    return (
+                        F.conv2d(
+                            inp, weight, None, m.stride, m.padding, m.dilation, m.groups
+                        ),
+                    )
+
+                inp = torch.randn([2, 5, 16, 16])
+                inps = [inp, m.weight.to(memory_format=fmt)]
+                fn_fx = make_fx(fn)(*inps)
+                fn_compiled = compile_fx_inner(fn_fx, inps)
+                test_self = self
+                conv_seen = False
+
+                class RecordFunctions(TorchDispatchMode):
+                    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                        kwargs = kwargs if kwargs else {}
+                        if func == torch.ops.aten.convolution.default:
+                            test_self.assertTrue(
+                                args[0].is_contiguous(memory_format=fmt)
+                            )
+                            test_self.assertTrue(
+                                args[1].is_contiguous(memory_format=fmt)
+                            )
+                            nonlocal conv_seen
+                            conv_seen = True
+
+                        return func(*args, **kwargs)
+
+                with RecordFunctions():
+                    out = fn_compiled(inps)
+
+                self.assertTrue(conv_seen)
+
+        def test_inplace_squeeze_needed(self):
+            mod = torch.nn.Sequential(
+                torch.nn.Linear(10, 10),
+                torch.nn.LayerNorm(10),
+                torch.nn.ReLU(),
+            ).eval()
+
+            @torch._dynamo.optimize("inductor")
+            def fn(x):
+                return mod(x)
+
+            v = torch.randn(10)
+            result = fn(v)
+            # TODO: OMP parallel reduction order is not deterministic.
+            # Hence, the accurarcy might vary up and down. For short term,
+            # we increase the tolerance and will fix it later by using
+            # aten parallel.
+            assert same(result, mod(v), tol=5e-1)
+
+        def test_inplace_add_alpha(self):
+            def fn(x, y):
+                aten.add_.Tensor(x, y, alpha=0.55)
+                return (x,)
+
+            x1 = torch.zeros(10)
+            x2 = torch.zeros(10)
+            x3 = torch.zeros(10)
+            y = torch.randn(10)
+            fn_fx = make_fx(fn)(x1, y)
+            fn_compiled = compile_fx_inner(fn_fx, [x1, y])
+            fn(x2, y)
+            fn_compiled([x3, y])
+            assert same(x2, x3)
+
+        def test_no_op_squeeze(self):
+            @torch._dynamo.optimize("inductor")
+            def forward(arg0_1):
+                return torch.ops.aten.squeeze.dim(arg0_1, 1)
+
+            x = torch.randn((10, 20))
+            assert same(x, forward(x))
+
+        def test_parallel_num_threads(self):
+            @torch._dynamo.optimize("inductor")
+            def fn(x1, x2):
+                return x1 + x2
+
+            @contextlib.contextmanager
+            def set_num_threads(num_threads):
+                orig_num_threads = torch.get_num_threads()
+                torch.set_num_threads(num_threads)
+                yield
+                torch.set_num_threads(orig_num_threads)
+
+            x1 = torch.randn((10, 20))
+            x2 = torch.randn((10, 20))
+            with set_num_threads(1):
+                assert same(x1 + x2, fn(x1, x2))
+            with set_num_threads(4):
+                assert same(x1 + x2, fn(x1, x2))
+
+        @patch("torch.cuda.is_available", lambda: False)
+        def test_timed_cpu_only(self):
+            timed(lambda: torch.randn(10), ())
+
+        def test_complex_memory_overlap(self):
+            dense = torch.zeros(64, 32)
+            self.assertFalse(complex_memory_overlap(dense))
+            self.assertFalse(complex_memory_overlap(dense.t()))
+
+            strided = dense.split(4, dim=1)
+            self.assertFalse(complex_memory_overlap(strided[0]))
+            self.assertFalse(complex_memory_overlap(strided[0].t()))
+
+            unsqueezed = dense.unsqueeze(1)
+            self.assertFalse(complex_memory_overlap(unsqueezed))
+            self.assertFalse(complex_memory_overlap(unsqueezed.permute(1, 2, 0)))
+
+            expanded = unsqueezed.expand(-1, 2, -1)
+            self.assertTrue(complex_memory_overlap(expanded))
+            self.assertTrue(complex_memory_overlap(expanded.permute(1, 2, 0)))
+
+            gathered = dense.index_select(0, torch.IntTensor([1, 0, 1]))
+            self.assertFalse(complex_memory_overlap(gathered))
+            self.assertFalse(complex_memory_overlap(gathered.t()))
+
+        @unittest.skipIf(
+            not codecache.valid_vec_isa_list(), "Does not support vectorization"
+        )
+        @patch.object(config, "dynamic_shapes", True)
+        @patch.object(torch._dynamo.config, "dynamic_shapes", True)
+        @patch.object(functorch_config, "use_dynamic_shapes", True)
+        def test_vec_dynamic_shapes(self):
+            def fn(x):
+                return torch.softmax(x, -1)
+
+            value = torch.randn((2, 10))
+            with patch.object(config.cpp, "simdlen", None):
+                torch._dynamo.reset()
+                metrics.reset()
+                opt_fn = torch._dynamo.optimize("inductor")(fn)
+                opt_fn(value)
+
+                real_out = fn(value)
+                compiled_out = opt_fn(value)
+                assert same(real_out, compiled_out, equal_nan=True)
+                assert metrics.generated_cpp_vec_kernel_count < 1
+
+        @unittest.skipIf(
+            not codecache.valid_vec_isa_list(), "Does not support vectorization"
+        )
+        @patch("torch.cuda.is_available", lambda: False)
+        def test_auto_simd(self):
+            vec_avx512 = codecache.supported_vec_isa_list[0]
+            vec_avx2 = codecache.supported_vec_isa_list[1]
+            self.assertTrue(vec_avx512.bit_width() == 512)
+            self.assertTrue(vec_avx2.bit_width() == 256)
+            self.assertTrue(vec_avx512.nelements() == 16)
+            self.assertTrue(vec_avx2.nelements() == 8)
+            self.assertTrue(vec_avx512.nelements(torch.bfloat16) == 32)
+            self.assertTrue(vec_avx2.nelements(torch.bfloat16) == 16)
+
+            with patch.object(config.cpp, "simdlen", None):
+                isa = codecache.pick_vec_isa()
+                if vec_avx512 in codecache.valid_vec_isa_list():
+                    self.assertTrue(isa == vec_avx512)
+                else:
+                    self.assertTrue(isa == vec_avx2)
+
+            with patch.object(config.cpp, "simdlen", 0):
+                isa = codecache.pick_vec_isa()
+                self.assertFalse(isa)
+
+            with patch.object(config.cpp, "simdlen", 1):
+                isa = codecache.pick_vec_isa()
+                self.assertFalse(isa)
+
+            with patch.object(config.cpp, "simdlen", 257):
+                isa = codecache.pick_vec_isa()
+                self.assertFalse(isa)
+
+            with patch.object(config.cpp, "simdlen", 513):
+                isa_list = codecache.valid_vec_isa_list()
+                if vec_avx512 in isa_list:
+                    self.assertFalse(isa)
+
+            with patch.object(config.cpp, "simdlen", 512):
+                isa_list = codecache.valid_vec_isa_list()
+                if vec_avx512 in isa_list:
+                    isa = codecache.pick_vec_isa()
+                    self.assertTrue(isa == vec_avx512)
+
+            with patch.object(config.cpp, "simdlen", 256):
+                isa_list = codecache.valid_vec_isa_list()
+                if vec_avx2 in isa_list:
+                    isa = codecache.pick_vec_isa()
+                    self.assertTrue(isa == vec_avx2)
+
+        @unittest.skipIf(
+            not codecache.valid_vec_isa_list(), "Does not support vectorization"
+        )
+        @patch("torch.cuda.is_available", lambda: False)
+        def test_masked_fill_softmax(self):
+            def fn(value, mask):
+                mask = mask.to(torch.bool)
+                x = torch.masked_fill(value, mask, -33.0)
+                return torch.softmax(x, -1)
+
+            value = torch.randn((2, 17))
+            mask = torch.randint(0, 1, size=(2, 17), dtype=torch.uint8)
+            with patch.object(config.cpp, "simdlen", None):
+                torch._dynamo.reset()
+                metrics.reset()
+                opt_fn = torch._dynamo.optimize("inductor")(fn)
+                opt_fn(value, mask)
+
+                real_out = fn(value, mask)
+                compiled_out = opt_fn(value, mask)
+                assert same(real_out, compiled_out, equal_nan=True)
+                assert metrics.generated_cpp_vec_kernel_count >= 1
+
+        @unittest.skipIf(
+            not codecache.valid_vec_isa_list(), "Does not support vectorization"
+        )
+        @patch("torch.cuda.is_available", lambda: False)
+        def test_sign_cpu_only(self):
+            def fn(x):
+                return (torch.sign(x),)
+
+            x = torch.randn((2, 9))
+            x[0, 0] = torch.nan
+            x[1, -1] = torch.nan
+
+            with patch.object(config.cpp, "simdlen", None):
+                torch._dynamo.reset()
+                metrics.reset()
+                traced = make_fx(fn)(x)
+                compiled = compile_fx_inner(traced, [x])
+                assert same(fn(x)[0], compiled([x])[0], equal_nan=True)
+                assert metrics.generated_cpp_vec_kernel_count == 1
+
+        # Currently, we enabled AVX2 and AVX512 for vectorization. If the platform is not
+        # supported, the vectorization will not work and skip this test case. For ARM or
+        # other platforms support, we just need to add the ISA info to the supported_vector_isa
+        # and include proper aten vectorization head file.
+        @unittest.skipIf(
+            not codecache.valid_vec_isa_list(), "Does not support vectorization"
+        )
+        @patch("torch.cuda.is_available", lambda: False)
+        def test_vec_kernel_cpu_only(self):
+            def fn(x1, x2):
+                # Current, there are some limitations as follows.
+                #   rsqrt:
+                #     assert [both a fallback and a decomp for same kernel: aten.rsqrt.default]
+                #   round:
+                #     couldn't find symbolic meta function/decomposition
+                #   fmod/logical_and/logic_or:
+                #     vec kernel has not support to_type
+                x = torch.abs(x1)
+                x = torch.sin(x)
+                x = torch.neg(x)
+                x = torch.square(x)
+                x = torch.sigmoid(x)
+                x = torch.relu(x)
+                x = torch.cos(x)
+                x = torch.exp(x)
+                x = torch.sqrt(x)
+                x = torch.add(x, x1)
+                x = torch.sub(x, x2)
+                x = torch.mul(x, x1)
+                x = torch.div(x, x1)
+                x = torch.pow(x, 10)
+                x = torch.log(x)
+                x = torch.floor(x)
+                x = torch.ceil(x)
+                x = torch.trunc(x)
+                x = torch.lgamma(x)
+                x = torch.fmod(x, x2)
+                x = torch.sign(x)
+                res = x + x2
+                return (res,)
+
+            x1 = torch.randn((10, 20))
+            x2 = torch.randn((10, 20))
+
+            with patch.object(config.cpp, "simdlen", 1):
+                torch._dynamo.reset()
+                metrics.reset()
+                traced = make_fx(fn)(x1, x2)
+                compiled = compile_fx_inner(traced, [x1, x2])
+                assert same(fn(x1, x2)[0], compiled([x1, x2])[0], equal_nan=True)
+                assert metrics.generated_cpp_vec_kernel_count == 0
+
+            with patch.object(config.cpp, "simdlen", None):
+                torch._dynamo.reset()
+                metrics.reset()
+                traced = make_fx(fn)(x1, x2)
+                compiled = compile_fx_inner(traced, [x1, x2])
+                assert same(fn(x1, x2)[0], compiled([x1, x2])[0], equal_nan=True)
+                assert metrics.generated_cpp_vec_kernel_count == 1
+
+                torch._dynamo.reset()
+                metrics.reset()
+                x1 = x1.permute(1, 0)
+                x2 = torch.randn((20, 10))
+                traced = make_fx(fn)(x1, x2)
+                compiled = compile_fx_inner(traced, [x1, x2])
+                assert same(fn(x1, x2)[0], compiled([x1, x2])[0], equal_nan=True)
+                assert metrics.generated_cpp_vec_kernel_count == 1
+
+                torch._dynamo.reset()
+                metrics.reset()
+                x1 = torch.randn((10, 7))
+                x2 = torch.randn((10, 7))
+                traced = make_fx(fn)(x1, x2)
+                compiled = compile_fx_inner(traced, ([x1, x2]))
+                assert same(fn(x1, x2)[0], compiled([x1, x2])[0], equal_nan=True)
+                assert metrics.generated_cpp_vec_kernel_count == 1
+
+        @unittest.skipIf(
+            sys.platform != "linux", "cpp kernel profile only support linux now"
+        )
+        @patch("torch.cuda.is_available", lambda: False)
+        @patch.object(config.cpp, "enable_kernel_profile", True)
+        def test_cpp_kernel_profile(self):
+            from torch.profiler import profile
+
+            @torch._dynamo.optimize("inductor", nopython=True)
+            def fn(a, b):
+                return a + b
+
+            a = torch.rand((100,))
+            b = torch.rand((100,))
+            with profile() as prof:
+                fn(a, b)
+            assert "kernel_cpp_0" in (e.name for e in prof.profiler.function_events)
+
+
+if HAS_CUDA:
+    import triton
+    import triton.language as tl
+
+    class SweepInputsCudaTest(SweepInputs2, TestCase):
+        gen = InputGen(10, "cuda")
+
+    SweepInputsCudaTest.populate()
+
+    class CudaTests(TestCase):
+        common = check_model_cuda
+        device = "cuda"
+
+        def test_simplify_dims(self):
+            def fn(a):
+                return (a + 1,)
+
+            self.common(
+                fn, (torch.randn(2, 3, 10, 5, 6, device="cuda")[:, :, 2::2, :, :],)
+            )
+
+        def test_linear_permute_fusion(self):
+            class TestModule(torch.nn.Module):
+                def __init__(self, k: int, n: int):
+                    super().__init__()
+                    self.weight = torch.nn.Parameter(torch.randn(n, k))
+                    self.bias = torch.nn.Parameter(torch.randn(n))
+
+                def forward(self, input: torch.Tensor):
+                    a0 = torch.nn.functional.linear(input, self.weight, self.bias)
+                    b0 = a0.permute(0, 2, 1)
+                    return b0
+
+            m, k, n = 16, 8, 4
+            trace_func = chain_passes(torch.fx.symbolic_trace, linear_permute_fusion)
+            module = TestModule(k, n).eval()
+            input = torch.randn(6, m, k)
+            traced = trace_func(module, [input])
+            num_linear = count_call_function(traced, torch.nn.functional.linear)
+            num_linear_transpose = count_call_function(traced, linear_transpose)
+            self.assertEqual(num_linear, 0)
+            self.assertEqual(num_linear_transpose, 1)
+
+            self.assertTrue(torch.allclose(module(input), traced(input)))
+
+        def test_permute_linear_fusion(self):
+            class TestModule(torch.nn.Module):
+                def __init__(self, k: int, n: int):
+                    super().__init__()
+                    self.weight = torch.nn.Parameter(torch.randn(n, k))
+                    self.bias = torch.nn.Parameter(torch.randn(n))
+
+                def forward(self, input: torch.Tensor):
+                    input1 = input.permute(0, 2, 1)
+                    output = torch.nn.functional.linear(input1, self.weight, self.bias)
+                    return output
+
+            m, k, n = 16, 8, 4
+
+            trace_func = chain_passes(torch.fx.symbolic_trace, permute_linear_fusion)
+            module = TestModule(k, n).eval()
+            input = torch.randn(6, k, m)
+            traced = trace_func(module, [input])
+            num_linear = count_call_function(traced, torch.nn.functional.linear)
+            num_transpose_linear = count_call_function(traced, transpose_linear)
+            self.assertEqual(num_linear, 0)
+            self.assertEqual(num_transpose_linear, 1)
+
+            self.assertTrue(torch.allclose(module(input), traced(input)))
+
+        def test_permute_bmm_fusion(self):
+            class TestModule(torch.nn.Module):
+                def __init__(self, batch: int, k: int, n: int):
+                    super().__init__()
+                    self.other = torch.randn(batch, k, n)
+
+                def forward(self, input: torch.Tensor):
+                    input1 = input.permute(0, 2, 1)
+                    output = torch.bmm(input1, self.other)
+                    return output
+
+            batch, m, k, n = 6, 16, 8, 4
+
+            trace_func = chain_passes(torch.fx.symbolic_trace, permute_matmul_fusion)
+            module = TestModule(batch, k, n).eval()
+            input = torch.randn(batch, k, m)
+            traced = trace_func(module, [input])
+            num_bmm = count_call_function(traced, torch.bmm)
+            num_transpose_matmul = count_call_function(traced, transpose_matmul)
+            self.assertEqual(num_bmm, 0)
+            self.assertEqual(num_transpose_matmul, 1)
+
+            self.assertTrue(torch.allclose(module(input), traced(input)))
+
+    CommonTemplate.install(CudaTests, "cuda")
+
+    class CudaReproTests(TestCase):
+        common = check_model_cuda
+
+        def test_index_put_issue(self):
+            def forward(
+                self,
+                arg76_1,
+                expand_default,
+                full_like_default,
+                _to_copy_default_67,
+                zeros,
+            ):
+                sum_sym_int_19 = torch.ops.aten.sum(_to_copy_default_67, [0], True)
+                view_default_57 = torch.ops.aten.view.default(
+                    sum_sym_int_19, [512, 768]
+                )
+                where_self = torch.ops.aten.where.self(
+                    expand_default, view_default_57, full_like_default
+                )
+                clone_default_12 = torch.ops.aten.clone.default(zeros)
+                index_put__default = torch.ops.aten.index_put_.default(
+                    clone_default_12, [arg76_1], where_self, True
+                )
+                return (index_put__default,)
+
+            inps = [
+                (torch.Size([512]), torch.int64),
+                (torch.Size([512, 768]), torch.bool),
+                (torch.Size([512, 768]), torch.float16),
+                (torch.Size([4, 512, 768]), torch.float16),
+                (torch.Size([512, 768]), torch.float16),
+            ]
+            inps = [torch.zeros(())] + [
+                torch.ones(shape, dtype=dtype, device="cuda") for (shape, dtype) in inps
+            ]
+            mod = make_fx(forward)(*inps)
+            compiled = compile_fx_inner(mod, inps)
+            compiled(inps)
+
+        @requires_cuda()
+        def test_input_channels_last(self):
+            m = torch.nn.Sequential(
+                torch.nn.Conv2d(3, 3, 1, 1),
+                ToTuple(),
+            ).cuda()
+            inp = (
+                torch.randn([2, 3, 16, 16]).to(memory_format=torch.channels_last).cuda()
+            )
+
+            self.common(
+                m,
+                (inp,),
+                check_lowp=False,
+            )
+
+            @torch._dynamo.optimize()
+            def foo(m, inp):
+                return m(inp)
+
+            self.assertTrue(
+                foo(m, inp)[0].is_contiguous(memory_format=torch.channels_last)
+            )
+
+        # https://github.com/pytorch/torchdynamo/issues/1681#issuecomment-1283433527
+        @requires_cuda()
+        def test_unspec_inputs_interop(self):
+            class Repro(torch.nn.Module):
+                def __init__(self):
+                    super().__init__()
+
+                def forward(self, x, y):
+                    unsqueeze = torch.ops.aten.unsqueeze.default(x, 4)
+                    permute = torch.ops.aten.permute.default(unsqueeze, [0, 1, 2, 4, 3])
+                    add = torch.ops.aten.add.Tensor(y, 1)
+                    return [permute, add]
+
+            inps = [
+                rand_strided(
+                    (12, 3, 512, 64), (64, 196608, 768, 1), torch.float32, "cuda"
+                ),
+                rand_strided((), (), torch.int64, "cpu"),
+            ]
+            mod = make_fx(Repro().to(device="cuda"))(*inps)
+            compiled = compile_fx_inner(mod, inps)
+            compiled(inps)
+
+        @patch.object(config, "fallback_random", True)
+        def test_dtype_factory_issue(self):
+            def forward():
+                randn = torch.ops.aten.randn.default(
+                    [12, 64, 1, 64],
+                    dtype=torch.float32,
+                    device=torch.device(type="cuda", index=0),
+                    pin_memory=False,
+                )
+                unsqueeze_default_2 = torch.ops.aten.unsqueeze.default(randn, -1)
+                return (unsqueeze_default_2,)
+
+            mod = make_fx(forward)()
+            compiled = compile_fx_inner(mod, ())
+            assert compiled([])[0].device.type == "cuda"
+
+        @patch.object(config.triton, "cudagraphs", True)
+        def test_expanded_inputs_cudagraphs(self):
+            @torch._dynamo.optimize("inductor")
+            def fn(x, y):
+                return x + y
+
+            inputs = (
+                rand_strided((5, 5, 5, 5), (0, 5, 0, 1), device="cuda"),
+                rand_strided((5, 5, 5, 5), (0, 5, 0, 1), device="cuda"),
+            )
+            self.assertTrue(same(fn(*inputs), inputs[0] + inputs[1]))
+
+        # TODO: Abstract this out, test more extensively
+        @patch.object(config, "dynamic_shapes", True)
+        @patch.object(torch._dynamo.config, "dynamic_shapes", True)
+        @patch.object(functorch_config, "use_dynamic_shapes", True)
+        def test_dynamic_shapes(self):
+            torch._dynamo.reset()  # Needed since everywhere else uses "inductor"
+
+            def f(x):
+                return x.cos().view(x.shape).sin()
+
+            cnts = torch._dynamo.testing.CompileCounterWithBackend("inductor")
+
+            f2 = torch._dynamo.optimize(cnts)(f)
+
+            f2(torch.randn(32))
+
+            inp = torch.randn(16)
+            real_out = f(inp)
+            compiled_out = f2(inp)
+
+            self.assertEqual(cnts.frame_count, 1)
+            self.assertEqual(real_out, compiled_out)
+            torch._dynamo.reset()
+
+        @patch.object(config, "size_asserts", False)
+        @patch.object(config.triton, "cudagraphs", True)
+        def test_expanded_inputs_cudagraphs_no_size_asserts(self):
+            @torch._dynamo.optimize("inductor")
+            def fn(x, y):
+                return x + y
+
+            inputs = (
+                rand_strided((5, 5, 5, 5), (0, 5, 0, 1), device="cuda"),
+                rand_strided((5, 5, 5, 5), (0, 5, 0, 1), device="cuda"),
+            )
+            self.assertTrue(same(fn(*inputs), inputs[0] + inputs[1]))
+
+        @patch.object(config.triton, "cudagraphs", True)
+        def test_inplace_updates_cudagraphs(self):
+            class Repro(torch.nn.Module):
+                def __init__(self):
+                    super(Repro, self).__init__()
+                    self.weight1 = torch.nn.Parameter(
+                        torch.randn(10, 20, requires_grad=True)
+                    )
+
+                def forward(self, x):
+                    x = torch.matmul(x, self.weight1)
+                    return x
+
+            from copy import deepcopy
+
+            model = Repro().cuda()
+            model_ref = deepcopy(model)
+            model_opt = torch._dynamo.optimize("inductor")(model)
+
+            input = torch.randn(10, 10, device="cuda", requires_grad=True)
+
+            for i in range(2):
+                output_ref = model_ref(input)
+                output_res = model_opt(input)
+                output_ref.sum().backward()
+                output_res.sum().backward()
+                for (p_ref, p_res) in zip(
+                    model_ref.parameters(), model_opt.parameters()
+                ):
+                    self.assertEqual(p_ref.grad, p_res.grad)
+                with torch.no_grad():
+                    for param in model_ref.parameters():
+                        param.add_(1.0)
+                    for param in model_opt.parameters():
+                        param.add_(1.0)
+
+        # https://github.com/pytorch/torchdynamo/issues/1850
+        def test_inductor_output_aliases_intermediate(self):
+            def foo(x):
+                out = x + x
+                return out.t()
+
+            foo_opt = torch._dynamo.optimize("inductor")(foo)
+
+            inpt = torch.randn(10, 10, device="cuda", requires_grad=True)
+            # TODO: this is broken, fix later
+            # out = foo_opt(inpt)
+            # out.add_(2)
+
+            out_ref = foo(inpt)
+            out_ref.add_(2)
+            # self.assertEqual(out_ref, out)
+
+        def test_accuracy_issue1(self):
+            class Repro(torch.nn.Module):
+                def __init__(self):
+                    super().__init__()
+                    self.linear = torch.nn.Linear(
+                        in_features=768, out_features=2, bias=True
+                    )
+
+                def forward(self, start_positions: torch.Tensor, x: torch.Tensor):
+                    linear = self.linear(x)
+                    split = linear.split(1, dim=-1)
+                    getitem = split[0]
+                    squeeze = getitem.squeeze(-1)
+                    clamp = start_positions.clamp(0, 128)
+                    cross_entropy = torch.nn.functional.cross_entropy(
+                        squeeze, clamp, None, None, 128, None, "mean", 0.0
+                    )
+                    return cross_entropy
+
+            mod = Repro().cuda()
+            opt_mod = torch._dynamo.optimize("inductor")(mod)
+            mod.eval()
+            opt_mod.eval()
+
+            args = [
+                ((1,), (1,), torch.int64, "cuda", False),
+                ((1, 128, 768), (98304, 768, 1), torch.float32, "cuda", True),
+            ]
+            args = [
+                rand_strided(sh, st, dt, dev).requires_grad_(rg)
+                for (sh, st, dt, dev, rg) in args
+            ]
+            with torch.cuda.amp.autocast(enabled=False):
+                assert same_two_models(mod, opt_mod, args), "Dynamo failed"
+
+        def test_autotune_inplace_kernel(self):
+            """
+            This UT tests autotune on an inplace kernel. The autotune should not contaminate
+            the input buffers when tuning with multiple configs. For more details, refer to
+            https://github.com/openai/triton/issues/781
+            https://github.com/pytorch/torchdynamo/issues/1670
+            """
+            from torch._C import _cuda_getCurrentRawStream as get_cuda_stream
+            from torch._inductor.triton_ops.autotune import CachingAutotuner, grid
+            from torch._inductor.utils import instance_descriptor
+
+            def autotune(configs, meta):
+                def decorator(fn):
+                    return CachingAutotuner(
+                        # force autotune by setting save_cache_hook to False
+                        fn,
+                        meta=meta,
+                        configs=configs,
+                        save_cache_hook=False,
+                    )
+
+                return decorator
+
+            @autotune(
+                configs=[
+                    triton.Config({"XBLOCK": 1}),
+                    triton.Config({"XBLOCK": 2}),
+                ],
+                meta={
+                    "signature": {0: "*fp32", 1: "*fp32", 2: "i32"},
+                    "device": 0,
+                    "configs": [
+                        instance_descriptor(divisible_by_16=(0, 1), equal_to_1=())
+                    ],
+                    "constants": {},
+                },
+            )
+            @triton.jit
+            def kernel(in_out_ptr0, in_ptr0, xnumel, XBLOCK: tl.constexpr):
+                pid = tl.program_id(0)
+                block_start = pid * XBLOCK
+                offsets = block_start + tl.arange(0, XBLOCK)
+                mask = offsets < xnumel
+                x = tl.load(in_out_ptr0 + offsets, mask=mask)
+                y = tl.load(in_ptr0 + offsets, mask=mask)
+                output = x + y
+                tl.store(in_out_ptr0 + offsets, output, mask=mask)
+
+            xnumel = 384
+            in0 = rand_strided((xnumel,), (1,), device="cuda", dtype=torch.float32)
+            inout1 = rand_strided((xnumel,), (1,), device="cuda", dtype=torch.float32)
+            inout2 = inout1.clone()
+
+            stream0 = get_cuda_stream(0)
+            kernel.run(inout1, in0, xnumel, grid=grid(xnumel), stream=stream0)
+            kernel.run(inout2, in0, xnumel, grid=grid(xnumel), stream=stream0)
+
+            assert same(
+                inout1, inout2, tol=0.001, equal_nan=True
+            ), "failed autotune with inplace kernel"
+
+        @requires_cuda()
+        def test_sort_stride_issue(self):
+            # This minified testcase comes from detectron2_maskrcnn_r_50_fpn
+            # There was a false error from our size_assert code
+            @torch._dynamo.optimize(nopython=True)
+            def forward(pred_objectness_logits_3_: torch.Tensor):
+                sort_3 = pred_objectness_logits_3_.sort(descending=True, dim=1)
+                getitem_12 = sort_3[0]
+                return getitem_12
+
+            args = [((1, 100), (0, 1), torch.float16, "cuda", False)]
+            args = [
+                rand_strided(sh, st, dt, dev).requires_grad_(rg)
+                for (sh, st, dt, dev, rg) in args
+            ]
+            result = forward(*args)
+            assert same(result, torch.sort(args[0], descending=True, dim=1)[0])
+
+        @requires_cuda()
+        def test_scalar_triton_index(self):
+            # The indirect indexing via a scalar like below used to lead to
+            # bad triton code that made triton segfault when compiling.
+            # See https://github.com/pytorch/torchdynamo/issues/1515
+            def fn(a):
+                zero = torch.zeros((16,), device=a.device, dtype=torch.int64)
+                return (a[zero],)
+
+            a = torch.randn((8,), dtype=torch.float32, device="cuda")
+
+            fn_optimized = torch._dynamo.optimize("inductor")(fn)
+            assert same(fn(a), fn_optimized(a))
+
+        @requires_cuda()
+        def test_indirect_indexing_dense_mask(self):
+            def fn(x, y):
+                ne = torch.ops.aten.ne.Scalar(x, 1)
+                sum_1 = torch.ops.aten.sum.dim_IntList(ne, [1])
+                sub = torch.ops.aten.sub.Tensor(sum_1, 1)
+                unsqueeze = torch.ops.aten.unsqueeze.default(sub, -1)
+                gather = torch.ops.aten.gather.default(x, 1, unsqueeze)
+                squeeze = torch.ops.aten.squeeze.default(gather)
+                out = torch.ops.aten.multiply(y, squeeze)
+                return (out,)
+
+            a = torch.zeros((1, 128), dtype=torch.int64, device="cuda")
+            b = torch.zeros((1, 128), dtype=torch.int64, device="cuda")
+
+            fn_optimized = torch._dynamo.optimize("inductor")(fn)
+            assert same(fn(a, b), fn_optimized(a, b))
+
+    class TritonCodeGenTests(TestCase):
+        from torch._inductor.triton_ops.autotune import CachingAutotuner
+
+        class NoOpCompilerBackend:
+            def __init__(self):
+                self.example_args = None
+                self.model = None
+
+            def noop_backend(
+                self,
+                model_: torch.fx.GraphModule,
+                example_inputs_: typing.List[torch.Tensor],
+            ):
+                """
+                The Noop backend does not compile the fx graph it is given.
+                Instead, it transforms the fx graph so that its functions are
+                aten operations. It then saves this graph.
+                """
+                from functorch._src.aot_autograd import Interpreter
+                from torch._inductor.decomposition import select_decomp_table
+                from torch._subclasses import FakeTensorMode
+
+                fake_mode = FakeTensorMode()
+
+                def interpret(*args, **kwargs):
+                    return Interpreter(model_).run(*args[0:], **kwargs)
+
+                fake_flat_tensor_args = [
+                    fake_mode.from_tensor(x) for x in example_inputs_
+                ]
+                fw_module = make_fx(interpret, select_decomp_table())(
+                    *fake_flat_tensor_args
+                )
+                self.model = fw_module
+                self.example_args = fake_flat_tensor_args
+                return lambda x: example_inputs_
+
+        def get_kernels(self, fn, args) -> typing.List[CachingAutotuner]:
+            from torch._inductor.debug import DebugContext
+            from torch._inductor.graph import GraphLowering
+            from torch._inductor.virtualized import V
+
+            cxt = TritonCodeGenTests.NoOpCompilerBackend()
+            torch._dynamo.optimize(backend=cxt.noop_backend)(fn)(*args)
+            graph = GraphLowering(cxt.model)
+            graph.num_static_inputs = 0
+            kernels = []
+            with V.set_graph_handler(graph), V.set_debug_handler(DebugContext()):
+                graph.run(*(cxt.example_args))
+                mod = graph.compile_to_module()
+
+                for val in mod.__dict__.values():
+                    if isinstance(
+                        val, torch._inductor.triton_ops.autotune.CachingAutotuner
+                    ):
+                        kernels.append(val)
+
+            return kernels
+
+        def test_divisibile_by_16_covers_numel_args(self):
+            torch._dynamo.reset()
+
+            def fn(a: torch.Tensor) -> torch.Tensor:
+                return torch.sum(a)
+
+            kernels = self.get_kernels(fn, [torch.randn([256, 256], device="cuda")])
+            self.assertTrue(len(kernels) == 2, "SUM should result in two kernels")
+
+            # kernel0 reduces from 256 to (xnumel=8, rnumel=8192), which means it reduces 256 by 256 into an array of
+            # size 8 by accumulating 8192 elements at once note that rnumel is equal to 512 * 16, so rnumel which is
+            # at slot 3 should be in the divisible by 16 descriptor
+            arguments_that_are_divisible_by_16_in_kernel0 = (
+                kernels[0].meta["configs"][0].divisible_by_16
+            )
+            self.assertEqual(arguments_that_are_divisible_by_16_in_kernel0, (0, 1, 3))
+
+            # kernel1 reduces from 8 elements to a single scalar.
+            arguments_that_are_divisible_by_16_in_kernel1 = (
+                kernels[1].meta["configs"][0].divisible_by_16
+            )
+            self.assertEqual(arguments_that_are_divisible_by_16_in_kernel1, (0, 1))
+            torch._dynamo.reset()
+
+
+if __name__ == "__main__":
+    from torch._dynamo.test_case import run_tests
+
+    if (HAS_CPU or HAS_CUDA) and not TEST_WITH_ROCM:
+        run_tests(needs="filelock")
diff --git a/test/inductor/test_torchinductor_opinfo.py b/test/inductor/test_torchinductor_opinfo.py
new file mode 100644
index 000000000000..c9a9147830e6
--- /dev/null
+++ b/test/inductor/test_torchinductor_opinfo.py
@@ -0,0 +1,591 @@
+# Owner(s): ["module: inductor"]
+import atexit
+import os
+import sys
+import unittest
+from collections import defaultdict
+from enum import Enum
+from functools import partial
+from unittest.mock import patch
+
+import torch
+
+import torch._dynamo
+from torch.testing._internal.common_device_type import (
+    instantiate_device_type_tests,
+    onlyNativeDeviceTypes,
+    OpDTypes,
+    ops,
+    skipCPUIf,
+    skipCUDAIf,
+)
+from torch.testing._internal.common_methods_invocations import op_db
+from torch.testing._internal.common_utils import (
+    dtype_abbrs,
+    run_tests,
+    skipCUDAMemoryLeakCheckIf,
+    skipIfCrossRef,
+    skipIfTorchDynamo,
+    suppress_warnings,
+    TestCase,
+)
+from torch.testing._internal.inductor_utils import HAS_CPU, HAS_CUDA
+
+try:
+    try:
+        from .test_torchinductor import check_model, check_model_cuda
+    except ImportError:
+        from test_torchinductor import check_model, check_model_cuda
+except (unittest.SkipTest, ImportError) as e:
+    sys.stderr.write(f"{type(e)}: {e}\n")
+    if __name__ == "__main__":
+        sys.exit(0)
+    raise
+
+bf16 = torch.bfloat16  # not tested
+f64 = torch.float64
+f32 = torch.float32
+f16 = torch.float16
+i8 = torch.int8  # not tested
+i16 = torch.int16  # not tested
+i32 = torch.int32
+i64 = torch.int64
+b8 = torch.bool
+u8 = torch.uint8  # not tested
+
+_ops = partial(
+    ops, dtypes=OpDTypes.supported, allowed_dtypes=[f16, f32, f64, i32, i64, b8]
+)
+
+# Success forces pass; failure forces fail; skip unconditionally skips testing
+TestExpect = Enum("TestExpect", ("SUCCESS", "XFAILURE", "SKIP"))
+
+COLLECT_EXPECT = os.getenv("PYTORCH_COLLECT_EXPECT", "0") == "1"
+FAIL_ON_SUCCESS = os.getenv("PYTORCH_FAIL_ON_SUCCESS", "1") == "1"
+ALL_SAMPLES = os.getenv("PYTORCH_ALL_SAMPLES", "0") == "1"
+START = os.getenv("PYTORCH_TEST_RANGE_START", None)
+END = os.getenv("PYTORCH_TEST_RANGE_END", None)
+
+if START is not None or END is not None:
+    assert END is not None
+    assert START is not None
+    START = int(START)
+    END = int(END)
+    assert START < END
+else:
+    START = 0
+    END = len(op_db)
+
+seen_succeeded = defaultdict(dict)
+seen_failed = defaultdict(dict)
+failed_reasons = defaultdict(set)
+
+
+def print_seen():
+    expected_failures = defaultdict(list)
+
+    def fmt_dtypes(dtypes):
+        r = ", ".join(sorted(dtype_abbrs[d] for d in dtypes))
+        return "{" + r + "}"
+
+    def process(device_type):
+        for op, failed_dtypes in seen_failed[device_type].items():
+            succeeded_dtypes = seen_succeeded.get(op, set())
+            expected_failures_dtypes = failed_dtypes - succeeded_dtypes
+
+            reasons = ""
+            if failed_reasons[op]:
+                reasons = "  # " + ", ".join(sorted(failed_reasons[op]))
+            if expected_failures_dtypes:
+                expected_failures[device_type].append(
+                    f'   "{op}": {fmt_dtypes(expected_failures_dtypes)},{reasons}'
+                )
+
+        expected_failures[device_type].sort()
+        nl = "\n"
+        print(
+            f"""
+inductor_expected_failures_single_sample[\"{device_type}\"] = {{
+{nl.join(expected_failures[device_type])}
+}}
+"""
+        )
+
+    process("cpu")
+    process("cuda")
+
+
+if COLLECT_EXPECT:
+    atexit.register(print_seen)
+
+inductor_skips = defaultdict(dict)
+
+inductor_skips["cpu"] = {
+    "linalg.ldl_solve": {b8, f16, f32, f64, i32, i64},  # segfault
+    "linalg.ldl_factor": {f32, f64},  # flaky
+    "__rdiv__": {b8, f16, f32, f64, i32, i64},  # flaky
+}
+
+inductor_skips["cuda"] = {
+    # Jiterator kernel is not expected to work with inductor
+    "jiterator_2inputs_2outputs": {b8, f16, f32, f64, i32, i64},
+    "jiterator_4inputs_with_extra_args": {b8, f16, f32, f64, i32, i64},
+    "jiterator_binary": {b8, f16, f32, f64, i32, i64},
+    "jiterator_binary_return_by_ref": {b8, f16, f32, f64, i32, i64},
+    "jiterator_unary": {b8, f16, f32, f64, i32, i64},
+    # flaky
+    "native_batch_norm": {f16, f32, f64},
+    "_native_batch_norm_legit": {f16, f32, f64},
+}
+
+inductor_expected_failures_single_sample = defaultdict(dict)
+
+inductor_expected_failures_single_sample["cpu"] = {
+    "T": {b8, f16, f32, f64, i32, i64},
+    "H": {b8, f16, f32, f64, i32, i64},
+    "mH": {b8, f16, f32, f64, i32, i64},
+    "mT": {b8, f16, f32, f64, i32, i64},
+    "__getitem__": {b8, f16, f32, f64, i32, i64},
+    "addr": {f16},
+    "allclose": {f16, f32, f64},
+    "amax": {f16},
+    "amin": {f16},
+    "angle": {f16, f32, f64},
+    "argwhere": {b8, f16, f32, f64, i32, i64},
+    "bernoulli": {f32, f64},
+    "bincount": {i32, i64},
+    "bucketize": {b8, f16, f32, f64, i32, i64},
+    "cdouble": {b8, f16, f32, f64, i32, i64},
+    "cfloat": {b8, f16, f32, f64, i32, i64},
+    "chalf": {b8, f16, f32, f64, i32, i64},
+    "cholesky": {f32, f64},
+    "combinations": {b8, f16, f32, f64, i32, i64},
+    "complex": {f16, f32, f64},
+    "constant_pad_nd": {f16, f32, f64},
+    "copysign": {f16},
+    "corrcoef": {f32, f64, i32, i64},
+    "cov": {f32, f64, i32, i64},
+    "equal": {b8, f16, f32, f64, i32, i64},
+    "fft.fft": {f32, f64},
+    "fft.fft2": {b8, f32, f64, i32, i64},
+    "fft.fftn": {b8, f32, f64, i32, i64},
+    "fft.hfft": {b8, f32, f64, i32, i64},
+    "fft.hfft2": {b8, f32, f64, i32, i64},
+    "fft.hfftn": {b8, f32, f64, i32, i64},
+    "fft.ifft": {f16, f32, f64},
+    "fft.ifft2": {b8, f32, f64, i32, i64},
+    "fft.ifftn": {b8, f32, f64, i32, i64},
+    "fft.ihfft": {f16, f32, f64},
+    "fft.ihfft2": {f32, f64},
+    "fft.ihfftn": {f32, f64},
+    "fft.irfft": {b8, f32, f64, i32, i64},
+    "fft.irfft2": {b8, f32, f64, i32, i64},
+    "fft.irfftn": {b8, f32, f64, i32, i64},
+    "fft.rfft": {f32, f64},
+    "fft.rfft2": {f32, f64},
+    "fft.rfftn": {f32, f64},
+    "index_add": {f16},
+    "index_reduce": {f16, f32, f64},
+    "istft": {f32, f64},
+    "linalg.eig": {f32, f64},
+    "linalg.eigh": {f32, f64},
+    "linalg.eigvals": {f32, f64},
+    "linalg.eigvalsh": {f32, f64},
+    "linalg.lstsq": {f32, f64},
+    "linalg.lstsq.grad_oriented": {f32, f64},
+    "linalg.matrix_rank": {f32, f64},
+    "linalg.matrix_rank.hermitian": {f32, f64},
+    "linalg.pinv.singular": {f32, f64},
+    "masked.norm": {f16},
+    "masked.normalize": {f16},
+    "masked_fill": {f16},
+    "masked_scatter": {f16, f32, f64},
+    "masked_select": {b8, f16, f32, f64, i32, i64},
+    "max.reduction_no_dim": {f16},
+    "max.reduction_with_dim": {b8},
+    "min.reduction_no_dim": {f16},
+    "min.reduction_with_dim": {b8},
+    "multinomial": {f32, f64},
+    "nan_to_num": {f16},
+    "nanquantile": {f32, f64},
+    "nn.functional.avg_pool1d": {i64},
+    "nn.functional.avg_pool2d": {i64, f64},
+    "nn.functional.adaptive_avg_pool2d": {f16, f64},
+    "nn.functional.ctc_loss": {f32, f64},
+    "nn.functional.gaussian_nll_loss": {f32, f64},
+    "nn.functional.local_response_norm": {i64},
+    "nn.functional.one_hot": {i64},
+    "nn.functional.pairwise_distance": {f16},
+    "nn.functional.rrelu": {f32, f64},
+    "nn.functional.triplet_margin_with_distance_loss": {f32, f64, i32, i64},
+    "nonzero": {b8, f16, f32, f64, i32, i64},
+    "normal": {f16, f32, f64},
+    "normal.number_mean": {f16, f32, f64},
+    "pca_lowrank": {f32, f64},
+    "polar": {f32, f64},
+    "quantile": {f32, f64},
+    "rand_like": {f16, f32, f64},
+    "randint_like": {f16, f32, f64, i32, i64},
+    "randint": {f16, f32, f64, i32, i64},
+    "randn_like": {f16, f32, f64},
+    "repeat_interleave": {b8, f16, f32, f64, i32, i64},
+    "scatter_add": {f16},
+    "scatter_reduce.sum": {f16},
+    "scatter_reduce.prod": {f16, f32, f64},
+    "segment_reduce.lengths": {f16, f32, f64},
+    "sgn": {f16, f32, f64},
+    "sparse.sampled_addmm": {f32, f64},
+    "stft": {f32, f64},
+    "svd_lowrank": {f32, f64},
+    "tensor_split": {b8, f16, f32, f64, i32, i64},
+    "to": {b8, f16, f32, f64, i32, i64},
+    "to_sparse": {f32, f64},
+    "tril": {f16},
+    "triu": {f16},
+    "uniform": {f16, f32, f64},
+    "unique": {b8, f32, f64, i32, i64},
+    "unique_consecutive": {b8, f32, f64, i32, i64},
+    "var": {f16},
+    "var_mean": {f16},
+    "view_as_complex": {f16},
+}
+
+
+inductor_expected_failures_single_sample["cuda"] = {
+    "T": {b8, f16, f32, f64, i32, i64},
+    "H": {b8, f16, f32, f64, i32, i64},
+    "mH": {b8, f16, f32, f64, i32, i64},
+    "mT": {b8, f16, f32, f64, i32, i64},
+    "__getitem__": {b8, f16, f32, f64, i32, i64},
+    "__rdiv__": {b8, f16, f32, f64, i32, i64},
+    "allclose": {f16, f32, f64},
+    "angle": {f32, f64},
+    "argwhere": {b8, f16, f32, f64, i32, i64},
+    "baddbmm": {f16},
+    "bernoulli": {f16, f32, f64},
+    "bincount": {i32, i64},
+    "bucketize": {b8, f16, f32, f64, i32, i64},
+    "cdouble": {b8, f16, f32, f64, i32, i64},
+    "cfloat": {b8, f16, f32, f64, i32, i64},
+    "chalf": {b8, f16, f32, f64, i32, i64},
+    "cholesky": {f32, f64},
+    "combinations": {b8, f16, f32, f64, i32, i64},
+    "complex": {f16, f32, f64},
+    "corrcoef": {f16, f32, f64, i32, i64},
+    "cov": {f16, f32, f64, i32, i64},
+    "equal": {b8, f16, f32, f64, i32, i64},
+    "fft.fft": {f16, f32, f64},
+    "fft.fft2": {b8, f16, f32, f64, i32, i64},
+    "fft.fftn": {b8, f16, f32, f64, i32, i64},
+    "fft.hfft": {b8, f16, f32, f64, i32, i64},
+    "fft.hfft2": {b8, f16, f32, f64, i32, i64},
+    "fft.hfftn": {b8, f16, f32, f64, i32, i64},
+    "fft.ifft": {f16, f32, f64},
+    "fft.ifft2": {b8, f16, f32, f64, i32, i64},
+    "fft.ifftn": {b8, f16, f32, f64, i32, i64},
+    "fft.ihfft": {f16, f32, f64},
+    "fft.ihfft2": {f16, f32, f64},
+    "fft.ihfftn": {f16, f32, f64},
+    "fft.irfft": {b8, f16, f32, f64, i32, i64},
+    "fft.irfft2": {b8, f16, f32, f64, i32, i64},
+    "fft.irfftn": {b8, f16, f32, f64, i32, i64},
+    "fft.rfft": {f16, f32, f64},
+    "fft.rfft2": {f16, f32, f64},
+    "fft.rfftn": {f16, f32, f64},
+    "index_reduce": {f16, f32, f64},
+    "istft": {f32, f64},
+    "linalg.eig": {f32, f64},
+    "linalg.eigh": {f32, f64},
+    "linalg.eigvals": {f32, f64},
+    "linalg.eigvalsh": {f32, f64},
+    "linalg.lstsq": {f32, f64},
+    "linalg.lstsq.grad_oriented": {f32, f64},
+    "linalg.matrix_rank": {f32, f64},
+    "linalg.matrix_rank.hermitian": {f32, f64},
+    "linalg.pinv.singular": {f32, f64},
+    "masked.argmax": {f16, f32, f64, i32},
+    "masked.argmin": {f16, f32, f64, i32},
+    "masked_scatter": {f16, f32, f64},
+    "masked_select": {b8, f16, f32, f64, i32, i64},
+    "max.reduction_with_dim": {b8},
+    "min.reduction_with_dim": {b8},
+    "multinomial": {f16, f32, f64},
+    "nn.functional.adaptive_avg_pool2d": {f16},
+    "nn.functional.ctc_loss": {f32, f64},
+    "nn.functional.grid_sample": {f16},
+    "grid_sampler_2d": {f16},
+    "nn.functional.gaussian_nll_loss": {f16, f32, f64},
+    "nn.functional.one_hot": {i64},
+    "nn.functional.rrelu": {f16, f32, f64},
+    "nn.functional.triplet_margin_with_distance_loss": {f16, f32, f64, i32, i64},
+    "nonzero": {b8, f16, f32, f64, i32, i64},
+    "normal": {f16, f32, f64},
+    "normal.number_mean": {f16, f32, f64},
+    "pca_lowrank": {f32, f64},
+    "polar": {f32, f64},
+    "pow": {i32, i64},
+    "rand_like": {f16, f32, f64},
+    "randint_like": {f16, f32, f64, i32, i64},
+    "randint": {f16, f32, f64, i32, i64},
+    "randn_like": {f16, f32, f64},
+    "repeat_interleave": {b8, f16, f32, f64, i32, i64},
+    "round.decimals_3": {f16},
+    "scatter_reduce.prod": {f16, f32, f64},
+    "segment_reduce.lengths": {f16, f32, f64},
+    "sgn": {f16, f32, f64},
+    "sparse.sampled_addmm": {f32, f64},
+    "stft": {f32, f64},
+    "svd_lowrank": {f32, f64},
+    "tensor_split": {b8, f16, f32, f64, i32, i64},
+    "to": {b8, f16, f32, f64, i32, i64},
+    "to_sparse": {f16, f32, f64},
+    "uniform": {f16, f32, f64},
+    "unique": {b8, f16, f32, f64, i32, i64},
+    "unique_consecutive": {b8, f16, f32, f64, i32, i64},
+    # AssertionError: Tensor-likes are not close!
+    "nn.functional.triplet_margin_loss": {f16},
+}
+
+inductor_gradient_expected_failures_single_sample = defaultdict(dict)
+
+inductor_gradient_expected_failures_single_sample["cuda"] = {
+    "asin": {f16},
+    "cumprod": {f16},
+    "linalg.vector_norm": {f64, f64},
+    "kron": {f16},
+    "nanquantile": {f32, f64},
+    "nn.functional._scaled_dot_product_attention": {f16},
+    "nn.functional.avg_pool2d": {f16, f32, f64},
+    "nn.functional.batch_norm.without_cudnn": {f16},
+    "nn.functional.batch_norm": {f16},
+    "nn.functional.cosine_similarity": {f16},
+    "nn.functional.instance_norm": {f16},
+    "nn.functional.normalize": {f16},
+    "nn.functional.softsign": {f16},
+    "nn.functional.local_response_norm": {f16},
+    "outer": {f16},
+    "quantile": {f32, f64},
+    "scatter_reduce.amax": {f16, f32, f64},
+    "scatter_reduce.amin": {f16, f32, f64},
+    "tanh": {f16},
+}
+
+inductor_should_fail_with_exception = defaultdict(dict)
+
+inductor_should_fail_with_exception["cpu"] = {}
+
+
+inductor_should_fail_with_exception["cuda"] = {
+    "__rpow__": {
+        i32: "Pow input must be floating point.",
+        i64: "Pow input must be floating point.",
+    }
+}
+
+
+def wrapper_set_seed(op, *args, **kwargs):
+    """Wrapper to set seed manually for some functions like dropout
+    See: https://github.com/pytorch/pytorch/pull/62315#issuecomment-896143189 for more details.
+    """
+    torch.manual_seed(42)
+    return op(*args, **kwargs)
+
+
+torch.testing._internal.common_methods_invocations.wrapper_set_seed = wrapper_set_seed
+
+# This file does a global patch to `disable_global_flags()` - which we should not invoke in non testing cases.
+torch._dynamo.variables.torch.tensor_dunder_fns.append(
+    torch.testing._internal.common_utils.disable_functorch
+)
+
+# key can be either op_name, or (op_name, deivce_type), or (op_name, device_type, dtype)
+inductor_override_kwargs = {
+    # the return value of empty is undefined
+    "empty": {"assert_equal": False},
+    "empty_like": {"assert_equal": False},
+    "new_empty": {"assert_equal": False},
+    "new_empty_strided": {"assert_equal": False},
+    "randn": {"assert_equal": False},
+    ("nn.functional.tanhshrink", "cuda", f16): {"atol": 3e-4, "rtol": 0.001},
+    ("cummax", "cuda", f16): {"atol": 5e-4, "rtol": 0.002},
+    ("_softmax_backward_data", "cuda", f16): {"atol": 0.008, "rtol": 0.002},
+    "gradient": {"check_gradient": False},  # segfault on check_gradient
+    # Following tests failed, and causing subsequent tests failing with unrecoverable CUDA error
+    "linalg.solve_triangular": {"check_gradient": False},
+    "linalg.lu_factor": {"check_gradient": False},
+    "linalg.lu_factor_ex": {"check_gradient": False},
+}
+
+# Always test with all sample for following ops
+inductor_all_samples = {
+    "softmax.with_dtype",
+    "index_add",
+    "index_copy",
+    "scatter_reduce.sum",
+    "select_scatter",
+    "squeeze",
+    "unsqueeze",
+    "sum",
+    "amax",
+    "amin",
+    "all",
+}
+
+
+class TestInductorOpInfo(TestCase):
+    check_model = check_model
+    check_model_cuda = check_model_cuda
+
+    @onlyNativeDeviceTypes
+    @suppress_warnings
+    @skipCUDAMemoryLeakCheckIf(
+        True
+    )  # inductor kernels failing this test intermittently
+    @skipCUDAIf(not HAS_CUDA, "Skipped! Triton not found")
+    @skipCPUIf(not HAS_CPU, "Skipped! Supported CPU compiler not found")
+    @skipIfTorchDynamo("Test uses dynamo already")
+    @skipIfCrossRef
+    @_ops(op_db[START:END])
+    @patch("torch._dynamo.config.raise_on_unsafe_aot_autograd", True)
+    def test_comprehensive(self, device, dtype, op):
+        torch._dynamo.reset()
+        with torch.no_grad():
+            torch.cuda.empty_cache()
+        op_name = op.name
+        if op.variant_test_name:
+            op_name += f".{op.variant_test_name}"
+
+        device_type = torch.device(device).type
+
+        assert device_type in ("cuda", "cpu")
+
+        # with open("test_output.txt", "a") as f:
+        #     print(f"CONSIDERING OP {op_name} on {device_type} with {dtype} |
+        # {inductor_skips[device_type].get(op_name, set())}", flush=True, file=f)
+        #     print(f"CONSIDERING OP {op_name} on {device_type} with {dtype} |
+        # {inductor_skips[device_type].get(op_name, set())}", flush=True)
+        if dtype in inductor_skips[device_type].get(op_name, set()):
+            test_expect = TestExpect.SKIP
+            # with open("test_output.txt", "a") as f:
+            #     print(f"SKIPPING OP {op_name} on {device_type}", flush=True, file=f)
+            #     print(f"SKIPPING OP {op_name} on {device_type}", flush=True)
+            self.skipTest(f"{op_name} in {dtype} not supported")
+        elif dtype in inductor_expected_failures_single_sample[device_type].get(
+            op_name, set()
+        ) or dtype in inductor_gradient_expected_failures_single_sample[
+            device_type
+        ].get(
+            op_name, set()
+        ):
+            test_expect = TestExpect.XFAILURE
+        else:
+            test_expect = TestExpect.SUCCESS
+
+        overridden_kwargs = {}
+        if op_name in inductor_override_kwargs:
+            overridden_kwargs = inductor_override_kwargs[op_name]
+        elif (op_name, device_type) in inductor_override_kwargs:
+            overridden_kwargs = inductor_override_kwargs[(op_name, device_type)]
+        elif (op_name, device_type, dtype) in inductor_override_kwargs:
+            overridden_kwargs = inductor_override_kwargs[(op_name, device_type, dtype)]
+
+        func = op.get_op()
+
+        def fn(*args, **kwargs):
+            return func(*args, **kwargs)
+
+        requires_grad = (
+            op.supports_autograd
+            and dtype in op.supported_backward_dtypes(device_type)
+            # TODO: OpInfo really ought to error out for this case, but it's
+            # not exercised in test_ops_gradients atm.  The problem is not
+            # complex32 per-se (which is supported by data movement only ops)
+            # but that when we do backwards we expect other ops like add to work
+            and not dtype == torch.complex32
+        )
+        samples = op.sample_inputs(device, dtype, requires_grad=requires_grad)
+
+        if op_name not in inductor_all_samples and not ALL_SAMPLES:
+            if isinstance(samples, (list, tuple)):
+                samples = [samples[0]]
+            else:
+                samples = [next(samples)]
+
+        try:
+            for sample_input in samples:
+                args = [sample_input.input] + list(sample_input.args)
+                kwargs = sample_input.kwargs
+                # UNCOMMENT TO DEBUG SEGFAULTS
+                # with open("test_output.txt", "a") as f:
+                #     print(f"RUNNING OP {op_name} on {device_type} with {dtype}", flush=True, file=f)
+                #     print(f"RUNNING OP {op_name} on {device_type} with {dtype}", flush=True)
+                if device_type == "cuda":
+                    # opinfo test case have already place the input on the correct device
+                    # so we don't need do additional copy by setting copy_to_cuda=False
+                    adjusted_kwargs = {
+                        "check_lowp": False,
+                        "nopython": True,
+                        "copy_to_cuda": False,
+                        "reference_in_float": False,
+                        "check_gradient": requires_grad,
+                    }
+                    adjusted_kwargs.update(overridden_kwargs)
+                    self.check_model_cuda(
+                        fn,
+                        args,
+                        kwargs,
+                        **adjusted_kwargs,
+                    )
+                elif device_type == "cpu":
+                    adjusted_kwargs = {
+                        "check_lowp": False,
+                        "nopython": True,
+                        # skip checking gradient on CPU for now
+                        "check_gradient": False,
+                    }
+                    adjusted_kwargs.update(overridden_kwargs)
+
+                    self.check_model(
+                        fn,
+                        args,
+                        kwargs,
+                        **adjusted_kwargs,
+                    )
+
+        except Exception as e:
+
+            if test_expect is TestExpect.XFAILURE:
+                return
+
+            seen_failed[device_type].setdefault(op_name, set()).add(dtype)
+
+            if COLLECT_EXPECT:
+                return
+
+            known_failure = False
+            if dtype in inductor_should_fail_with_exception[device_type].get(
+                op_name, set()
+            ):
+                failure = inductor_should_fail_with_exception[device_type][op_name][
+                    dtype
+                ]
+                if failure in str(e):
+                    known_failure = True
+            if not known_failure:
+                raise e
+
+        # with open("test_output.txt", "a") as f:
+        #     print(f"SUCCEEDED OP {op_name} on {device_type} with {dtype}", flush=True, file=f)
+        seen_succeeded[device_type].setdefault(op_name, set()).add(dtype)
+
+        if test_expect is TestExpect.XFAILURE and not COLLECT_EXPECT:
+            if FAIL_ON_SUCCESS:
+                raise RuntimeError(
+                    f"unexpected success {op_name}, {dtype}, {device_type}"
+                )
+
+
+instantiate_device_type_tests(TestInductorOpInfo, globals())
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/jit/test_async.py b/test/jit/test_async.py
index d3769cd452d6..f8a1baea6713 100644
--- a/test/jit/test_async.py
+++ b/test/jit/test_async.py
@@ -1,6 +1,5 @@
 # Owner(s): ["oncall: jit"]
 
-import io
 import os
 import sys
 
@@ -420,20 +419,6 @@ def fn(x):
         self.assertGraphContainsExactly(traced.graph, kind='aten::wait', num_kind_nodes=0)
         self.assertGraphContainsExactly(traced.graph, kind='aten::add', num_kind_nodes=2)
 
-    def test_trace_fork_wait_inline_onnx(self):
-        def fork_body(x):
-            return torch.neg(x), torch.neg(x)
-
-        class MyMod(torch.nn.Module):
-            def forward(self, x):
-                fut = torch.jit._fork(fork_body, x)
-                val = torch.jit._wait(fut)
-                return val[1]
-
-        # smoke test for ONNX export
-        f = io.BytesIO()
-        torch.onnx.export(MyMod(), (torch.rand(3, 4),), f)
-
     def test_trace_fork_wait_list_modulecalls(self):
         def add_one(input):
             return input + torch.ones(input.size())
diff --git a/test/jit/test_backends.py b/test/jit/test_backends.py
index 0ed7d0c19b2d..1a34fca32155 100644
--- a/test/jit/test_backends.py
+++ b/test/jit/test_backends.py
@@ -656,7 +656,7 @@ def test_execution(self):
         self.check_function("forward", (input1, input2, input2))
 
 # This is needed for IS_WINDOWS or IS_MACOS to skip the tests.
-@unittest.skipIf(TEST_WITH_ROCM or IS_SANDCASTLE or IS_WINDOWS or IS_MACOS or IS_FBCODE,
+@unittest.skipIf(IS_SANDCASTLE or IS_WINDOWS or IS_MACOS or IS_FBCODE,
                  "Non-portable load_library call used in test")
 class TestBackendsWithCompiler(JitTestCase):
     """
@@ -672,17 +672,14 @@ def __init__(self, name):
 
     def setUp(self):
         super().setUp()
-        if not TEST_WITH_ROCM:
-            self.basic_module_compiler_test.setUp()
-            self.error_module_compiler_test.setUp()
-            self.comp_module_compiler_test.setUp()
+        self.basic_module_compiler_test.setUp()
+        self.error_module_compiler_test.setUp()
+        self.comp_module_compiler_test.setUp()
 
-    @skipIfRocm
     def test_execution(self):
         self.basic_module_compiler_test.test_execution()
         self.comp_module_compiler_test.test_execution()
 
-    @skipIfRocm
     def test_errors(self):
         self.error_module_compiler_test.test_errors()
 
diff --git a/test/jit/test_freezing.py b/test/jit/test_freezing.py
index 85e16e4e577c..4a94da43880f 100644
--- a/test/jit/test_freezing.py
+++ b/test/jit/test_freezing.py
@@ -12,7 +12,7 @@
 from torch.testing import FileCheck
 from torch.testing._internal.common_quantization import skipIfNoFBGEMM
 from torch.testing._internal.common_quantized import override_quantized_engine
-from torch.testing._internal.common_utils import set_default_dtype, skipCUDAMemoryLeakCheckIf
+from torch.testing._internal.common_utils import set_default_dtype, skipCUDAMemoryLeakCheckIf, TEST_WITH_ROCM
 from torch.testing._internal.jit_utils import JitTestCase
 from torch.utils import mkldnn as mkldnn_utils
 
@@ -649,7 +649,7 @@ def _forward(self, x):
         self.assertFalse(mf.hasattr('sub'))
         self.assertFalse(mf.hasattr('a'))
         self.assertTrue(mf.hasattr('b'))
-        with self.assertRaisesRegex(AttributeError, "TestModule \(.*\) does not have a field with name '_forward'"):  # noqa: W605
+        with self.assertRaisesRegex(AttributeError, "TestModule (.*) does not have a field with name '_forward'"):
             mf._forward(x)
 
     def test_freeze_module_with_inplace_mutable(self):
@@ -1077,6 +1077,25 @@ def foo(self, x):
         self.assertEqual(m_s.foo(input), m_f.foo(input))
 
 
+    def test_freeze_no_forward(self):
+
+        class FreezeMe(nn.Module):
+            def __init__(self):
+                super(FreezeMe, self).__init__()
+                self.lin = nn.Linear(10, 1)
+
+            @torch.jit.export
+            def foo(self, x):
+                return self.lin(x)
+
+        m = FreezeMe()
+        m_s = torch.jit.script(m)
+        m_s.eval()
+        m_f = torch.jit.freeze(m_s, preserved_attrs=['foo'])
+        input = torch.ones(10)
+        self.assertEqual(m_s.foo(input), m_f.foo(input))
+
+
     def test_freeze_module_in_training_mode(self):
         class Net(nn.Module):
             def __init__(self):
@@ -1509,6 +1528,391 @@ def forward(self, x: torch.Tensor, idx: int) -> Any:
         with self.assertRaisesRegex(RuntimeError, "Freezing modules containing prim::ModuleContainerIndex is not supported"):
             mf = torch._C._freeze_module(m._c)
 
+    def test_freeze_with_interface_mutable(self):
+        @torch.jit.interface
+        class ModuleInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        class ImplementsInterface(torch.nn.Module):
+            def __init__(self):
+                super(ImplementsInterface, self).__init__()
+                self.sum = torch.zeros((2, 2))
+
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                self.sum += inp.relu()
+                return self.sum
+
+        class WrapperModule(torch.nn.Module):
+            impl: ModuleInterface
+
+            def __init__(self):
+                super().__init__()
+                self.impl = ImplementsInterface()
+
+            def forward(self, x: torch.Tensor) -> torch.Tensor:
+                return self.impl.forward(x)
+
+        m = torch.jit.script(WrapperModule())
+        m.eval()
+        m_frozen = torch.jit.freeze(m)
+
+        x = torch.rand((2, 2))
+
+        m_frozen(x)
+        self.assertEqual(m_frozen.impl.sum, x.relu())
+
+    def test_freeze_with_swapping_interfaces(self):
+        @torch.jit.interface
+        class ModuleInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        class Implementation1(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                return inp.relu()
+
+        class Implementation2(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                return inp.sin()
+
+        class WrapperModule(torch.nn.Module):
+            impl: ModuleInterface
+
+            def __init__(self):
+                super().__init__()
+                self.option1 = Implementation1()
+                self.option2 = Implementation2()
+                self.impl = self.option1
+                self.idx = 0
+
+            def forward(self, x: torch.Tensor) -> torch.Tensor:
+                self.idx += 1
+                if self.idx % 2 == 1:
+                    self.impl = self.option1
+                else:
+                    self.impl = self.option2
+                return self.impl(x)
+
+        m = torch.jit.script(WrapperModule())
+        m.eval()
+        with self.assertRaisesRegex(RuntimeError, "Freezing does not support SetAttr on an interface type"):
+            m_frozen = torch.jit.freeze(m)
+
+    def test_freeze_recursive_interfaces(self):
+        @torch.jit.interface
+        class InnerInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        @torch.jit.interface
+        class OuterInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        class InnerImpl(torch.nn.Module):
+            def __init__(self):
+                super(InnerImpl, self).__init__()
+                self.x = torch.ones((2, 2))
+
+            def forward(self, inp):
+                return inp.cos() * self.x
+
+        class OuterImpl(torch.nn.Module):
+            inner_impl: InnerInterface
+
+            def __init__(self):
+                super(OuterImpl, self).__init__()
+                self.inner_impl = InnerImpl()
+
+            def forward(self, inp):
+                return inp.relu() + self.inner_impl(inp.sin())
+
+        class WrapperModule(torch.nn.Module):
+            outer_impl: OuterInterface
+
+            def __init__(self):
+                super(WrapperModule, self).__init__()
+                self.outer_impl = OuterImpl()
+
+            def forward(self, inp):
+                return self.outer_impl(inp) + inp
+
+        m = WrapperModule()
+        x = torch.rand((2, 2))
+        expected = m(x)
+
+        m_s = torch.jit.script(m)
+        m_s.eval()
+        m_s = torch.jit.freeze(m_s)
+        actual = m_s(x)
+
+        self.assertEqual(expected, actual)
+
+    def test_freeze_recursive_interfaces_with_reassignment(self):
+        @torch.jit.interface
+        class InnerInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        @torch.jit.interface
+        class OuterInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        class InnerImpl1(torch.nn.Module):
+            def __init__(self):
+                super(InnerImpl1, self).__init__()
+                self.x = torch.ones((2, 2))
+
+            def forward(self, inp):
+                return inp.cos() * self.x
+
+
+        class InnerImpl2(torch.nn.Module):
+            def __init__(self):
+                super(InnerImpl2, self).__init__()
+                self.x = torch.ones((2, 2)) * 2
+
+            def forward(self, inp):
+                return inp.sin() / self.x
+
+        class OuterImpl(torch.nn.Module):
+            inner_impl: InnerInterface
+
+            def __init__(self):
+                super(OuterImpl, self).__init__()
+                self.inner_impl = InnerImpl1()
+                self.impl1 = InnerImpl1()
+                self.impl2 = InnerImpl1()
+                self.idx = 0
+
+            def forward(self, inp):
+                self.idx += 1
+                if self.idx % 2 == 0:
+                    self.inner_impl = self.impl1
+                else:
+                    self.inner_impl = self.impl2
+                return inp.relu() + self.inner_impl(inp.sin())
+
+        class WrapperModule(torch.nn.Module):
+            outer_impl: OuterInterface
+
+            def __init__(self):
+                super(WrapperModule, self).__init__()
+                self.outer_impl = OuterImpl()
+
+            def forward(self, inp):
+                return self.outer_impl(inp) + inp
+
+        m = WrapperModule()
+
+        m_s = torch.jit.script(m)
+        m_s.eval()
+        with self.assertRaisesRegex(RuntimeError, "Freezing does not support SetAttr on an interface type"):
+            m_s = torch.jit.freeze(m_s)
+
+    def test_freeze_interface_swapping_two_methods(self):
+        @torch.jit.interface
+        class MyInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        class Impl1(torch.nn.Module):
+            def forward(self, inp):
+                return inp.cos()
+
+        class Impl2(torch.nn.Module):
+            def forward(self, inp):
+                return inp.sin()
+
+        class WrapperModule1(torch.nn.Module):
+            interface_impl: MyInterface
+
+            def __init__(self):
+                super(WrapperModule1, self).__init__()
+                self.interface_impl = Impl1()
+                self.impl1 = Impl1()
+                self.impl2 = Impl2()
+                self.idx = 0
+
+            def forward(self, x):
+                return self.interface_impl(x)
+
+            @torch.jit.export
+            def other_method(self, x):
+                self.idx += 1
+                if self.idx % 2 == 0:
+                    self.interface_impl = self.impl1
+                else:
+                    self.interface_impl = self.impl2
+                return self.interface_impl(x)
+
+        class WrapperModule2(torch.nn.Module):
+            interface_impl: MyInterface
+
+            def __init__(self):
+                super(WrapperModule2, self).__init__()
+                self.interface_impl = Impl1()
+                self.impl1 = Impl1()
+                self.impl2 = Impl2()
+                self.idx = 0
+
+            def forward(self, x):
+                self.idx += 1
+                if self.idx % 2 == 0:
+                    self.interface_impl = self.impl1
+                else:
+                    self.interface_impl = self.impl2
+                return self.interface_impl(x)
+
+            @torch.jit.export
+            def other_method(self, x):
+                return self.interface_impl(x)
+
+        m1 = torch.jit.script(WrapperModule1())
+        m2 = torch.jit.script(WrapperModule2())
+
+        m1.eval()
+        m2.eval()
+
+        with self.assertRaisesRegex(RuntimeError, "Freezing does not support SetAttr on an interface type"):
+            torch.jit.freeze(m1, preserved_attrs=["other_method"])
+
+        with self.assertRaisesRegex(RuntimeError, "Freezing does not support SetAttr on an interface type"):
+            torch.jit.freeze(m2, preserved_attrs=["other_method"])
+
+    def test_freeze_recursive_interfaces_same_name(self):
+        @torch.jit.interface
+        class InnerInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        @torch.jit.interface
+        class OuterInterface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        class InnerImpl(torch.nn.Module):
+            def __init__(self):
+                super(InnerImpl, self).__init__()
+                self.x = torch.ones((2, 2))
+
+            def forward(self, inp):
+                return inp.cos() * self.x
+
+        class OuterImpl(torch.nn.Module):
+            impl: InnerInterface
+
+            def __init__(self):
+                super(OuterImpl, self).__init__()
+                self.impl = InnerImpl()
+                self.x = torch.ones((2, 2)) * 5
+
+            def forward(self, inp):
+                return self.other_method(inp)
+
+            def other_method(self, inp):
+                return inp.relu() + self.impl(inp.sin()) + self.x
+
+        class WrapperModule(torch.nn.Module):
+            impl: OuterInterface
+
+            def __init__(self):
+                super(WrapperModule, self).__init__()
+                self.impl = OuterImpl()
+
+            def forward(self, inp):
+                return self.impl(inp) + inp
+
+        m = WrapperModule()
+        x = torch.rand((2, 2))
+        expected = m(x)
+
+        m_s = torch.jit.script(m)
+        m_s.eval()
+        m_s = torch.jit.freeze(m_s)
+        actual = m_s(x)
+
+        self.assertEqual(expected, actual)
+
+    def test_freeze_non_interface_module_swap(self):
+        class InnerModule(torch.nn.Module):
+            def __init__(self, x):
+                super(InnerModule, self).__init__()
+                self.x = x
+
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                return inp.relu() + self.x
+
+        class WrapperModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.option1 = InnerModule(torch.rand((2, 2)))
+                self.option2 = InnerModule(torch.rand((2, 2)))
+                self.impl = self.option1
+                self.idx = 0
+
+            def forward(self, x: torch.Tensor) -> torch.Tensor:
+                self.idx += 1
+                if self.idx % 2 == 1:
+                    self.impl = self.option1
+                else:
+                    self.impl = self.option2
+                return self.impl(x)
+
+        unfrozen = WrapperModule()
+        m = torch.jit.script(unfrozen)
+        m.eval()
+        m_frozen = torch.jit.freeze(m)
+
+        x = torch.rand((2, 2))
+        expected = unfrozen(x)
+        actual = m_frozen(x)
+        self.assertEqual(expected, actual)
+
+    @unittest.expectedFailure
+    def test_freeze_interface_within_object(self):
+        # I don't think there's any way to create a plain python object that
+        # contains a torch.nn.Module inside it, but just in case... I'm not
+        # sure freezing would handle this case correctly, so marking as xfail
+        # so that if this ever _does_ start working someone will need to
+        # investigate to make sure this is handled correctly.
+        class MyIface(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                pass
+
+        class MyImpl(torch.nn.Module):
+            def forward(self, inp: torch.Tensor) -> torch.Tensor:
+                return inp.sin()
+
+        class MyObject:
+            impl: MyIface
+
+            def run(self, x):
+                return self.impl(x)
+
+        class WrapperModule(torch.nn.Module):
+            impl: MyObject
+
+            def __init__(self):
+                super().__init__()
+                self.impl = MyObject()
+                self.impl.impl = MyImpl()
+
+            def forward(self, x: torch.Tensor) -> torch.Tensor:
+                return self.impl(x)
+
+        unfrozen = WrapperModule()
+        m = torch.jit.script(unfrozen)
+        m.eval()
+        m_frozen = torch.jit.freeze(m)
+
+        x = torch.rand((2, 2))
+        expected = unfrozen(x)
+        actual = m_frozen(x)
+        self.expectEqual(expected, actual)
+
     def test_freeze_non_module_class_getattr(self):
         class BoxCoder(object):
             def __init__(self, bbox_xform_clip):
@@ -2189,78 +2593,117 @@ def test_conv_to_mkldnn_no_mkldnn(self):
             inp = torch.rand([4, 3, 4, 4])
             self.assertEqual(frozen(inp), mod(inp))
 
-    @unittest.skipIf(not TEST_CUDNN, "requires CUDNN")
+    @unittest.skipIf(not (TEST_CUDNN or TEST_WITH_ROCM), "requires CUDNN")
     def test_freeze_conv_relu_fusion(self):
-        conv_bias = [True, False]
-        conv_ops = [nn.Conv2d, nn.Conv3d]
-        add_z = [True, False]
-        use_tracing = [True, False]
-        for use_bias, conv, add_z, tracing in product(conv_bias, conv_ops, add_z, use_tracing):
+        with set_default_dtype(torch.float):
+            conv_bias = [True, False]
+            conv_ops = [nn.Conv2d, nn.Conv3d]
+            add_z = [True, False]
+            use_tracing = [True, False]
+            for use_bias, conv, add_z, tracing in product(conv_bias, conv_ops, add_z, use_tracing):
+                class Net(nn.Module):
+                    def __init__(self, in_channels, out_channels, **kwargs):
+                        super(Net, self).__init__()
+                        self.conv = conv(in_channels, out_channels, bias=use_bias, **kwargs)
+                        self.relu = nn.ReLU(inplace=True)
+                        self.add_z = add_z
+
+                    def forward(self, x):
+                        z = self.conv(x)
+                        out = self.conv(x)
+                        if self.add_z:
+                            out += z
+                        out = self.relu(out)
+                        return out
+
+                mod_eager = Net(3, 6, kernel_size=3, stride=2).eval().cuda()
+
+                inps = [5, 3, 4, 4]
+                if conv == nn.Conv3d:
+                    inps.append(inps[-1])
+                inp = torch.rand(inps).cuda()
+
+                if tracing:
+                    scripted_mod = torch.jit.trace(mod_eager, (inp))
+                else:
+                    scripted_mod = torch.jit.script(mod_eager)
+
+                frozen_mod = torch.jit.optimize_for_inference(scripted_mod)
+                if TEST_WITH_ROCM:
+                    if add_z:
+                        FileCheck().check("aten::miopen_convolution_add_relu").run(frozen_mod.graph)
+                    else:
+                        FileCheck().check("aten::miopen_convolution_relu").run(frozen_mod.graph)
+                else:
+                    if add_z:
+                        FileCheck().check("aten::cudnn_convolution_add_relu").run(frozen_mod.graph)
+                    else:
+                        FileCheck().check("aten::cudnn_convolution_relu").run(frozen_mod.graph)
+
+                self.assertEqual(mod_eager(inp), frozen_mod(inp))
+
+    @unittest.skipIf(not (TEST_CUDNN or TEST_WITH_ROCM), "requires CUDNN")
+    def test_freeze_conv_relu_fusion_not_forward(self):
+        with set_default_dtype(torch.float):
             class Net(nn.Module):
                 def __init__(self, in_channels, out_channels, **kwargs):
                     super(Net, self).__init__()
-                    self.conv = conv(in_channels, out_channels, bias=use_bias, **kwargs)
+                    self.conv = nn.Conv2d(in_channels, out_channels, bias=None, **kwargs)
                     self.relu = nn.ReLU(inplace=True)
-                    self.add_z = add_z
 
                 def forward(self, x):
                     z = self.conv(x)
                     out = self.conv(x)
-                    if self.add_z:
-                        out += z
                     out = self.relu(out)
                     return out
 
+                @torch.jit.export
+                def make_prediction(self, x):
+                    return self.forward(x)
+
             mod_eager = Net(3, 6, kernel_size=3, stride=2).eval().cuda()
 
             inps = [5, 3, 4, 4]
-            if conv == nn.Conv3d:
-                inps.append(inps[-1])
             inp = torch.rand(inps).cuda()
 
-            if tracing:
-                scripted_mod = torch.jit.trace(mod_eager, (inp))
-            else:
-                scripted_mod = torch.jit.script(mod_eager)
+            scripted_mod = torch.jit.script(mod_eager)
 
-            frozen_mod = torch.jit.optimize_for_inference(scripted_mod)
-            if add_z:
-                FileCheck().check("aten::cudnn_convolution_add_relu").run(frozen_mod.graph)
+            frozen_mod = torch.jit.freeze(scripted_mod, preserved_attrs=['make_prediction'])
+            optimized_mod = torch.jit.optimize_for_inference(frozen_mod, other_methods=['make_prediction'])
+            if TEST_WITH_ROCM:
+                FileCheck().check("aten::miopen_convolution_relu").run(optimized_mod.make_prediction.graph)
             else:
-                FileCheck().check("aten::cudnn_convolution_relu").run(frozen_mod.graph)
-
-            self.assertEqual(mod_eager(inp), frozen_mod(inp))
+                FileCheck().check("aten::cudnn_convolution_relu").run(optimized_mod.make_prediction.graph)
 
-    @unittest.skipIf(not TEST_CUDNN, "requires CUDNN")
-    def test_freeze_conv_relu_fusion_not_forward(self):
-        class Net(nn.Module):
-            def __init__(self, in_channels, out_channels, **kwargs):
-                super(Net, self).__init__()
-                self.conv = nn.Conv2d(in_channels, out_channels, bias=None, **kwargs)
-                self.relu = nn.ReLU(inplace=True)
+            self.assertEqual(mod_eager.make_prediction(inp), optimized_mod.make_prediction(inp))
 
-            def forward(self, x):
-                z = self.conv(x)
-                out = self.conv(x)
-                out = self.relu(out)
-                return out
+    @unittest.skipIf(not torch._C.has_mkldnn, "MKL-DNN build is disabled")
+    def test_numel_less_than_size_with_padding(self):
 
-            @torch.jit.export
-            def make_prediction(self, x):
-                return self.forward(x)
+        with set_default_dtype(torch.float):
+            class MyModule(nn.Module):
+                def __init__(self):
+                    super().__init__()
+                    self.conv1 = nn.Conv2d(
+                        1, 2, kernel_size=(2, 4), stride=2, padding=2, dilation=(2, 1),
+                    )
 
-        mod_eager = Net(3, 6, kernel_size=3, stride=2).eval().cuda()
+                def forward(self, i0):
+                    x = self.conv1(i0)
+                    o0 = torch.max(x, i0)
+                    o1 = torch.clip(x, -1.5, 1.5)
+                    return o0, o1
 
-        inps = [5, 3, 4, 4]
-        inp = torch.rand(inps).cuda()
 
-        scripted_mod = torch.jit.script(mod_eager)
+            i0 = torch.zeros((1, 1, 1, 2), dtype=torch.float32)
+            mod = MyModule()
+            out = mod(i0)
 
-        frozen_mod = torch.jit.freeze(scripted_mod, preserved_attrs=['make_prediction'])
-        optimized_mod = torch.jit.optimize_for_inference(frozen_mod, other_methods=['make_prediction'])
-        FileCheck().check("aten::cudnn_convolution_relu").run(optimized_mod.make_prediction.graph)
+            exported = torch.jit.trace(mod, [i0])
+            exported = torch.jit.optimize_for_inference(exported)
 
-        self.assertEqual(mod_eager.make_prediction(inp), optimized_mod.make_prediction(inp))
+            eout = exported(i0)
+            self.assertTrue(all(torch.allclose(x, y) for x, y in zip(out, eout)))
 
     @unittest.skipIf(not torch._C.has_mkldnn, "MKL-DNN build is disabled")
     def test_incompatible_perf_formats(self):
diff --git a/test/jit/test_hooks.py b/test/jit/test_hooks.py
index 109a5e3f1b71..2963837a638a 100644
--- a/test/jit/test_hooks.py
+++ b/test/jit/test_hooks.py
@@ -229,7 +229,7 @@ def pre_hook(self, input: Tuple[str]) -> Tuple[str]:
 
         with self.assertRaisesRegex(
             RuntimeError,
-            "This error occured while scripting the forward pre-hook 'pre_hook'",
+            "This error occurred while scripting the forward pre-hook 'pre_hook'",
         ):
             torch.jit.script(m)
 
diff --git a/test/jit/test_misc.py b/test/jit/test_misc.py
index db37af81993f..8a5d4ea5f4a7 100644
--- a/test/jit/test_misc.py
+++ b/test/jit/test_misc.py
@@ -361,3 +361,22 @@ def test_parse_ir_single_element_tensor_negative(self):
         ret = func()
         self.assertTrue(ret.numel() == 1)
         self.assertTrue(len(ret.size()) == 1)
+
+
+    def test_script_many_decorators(self):
+        def no_op_decorator(f):
+            return f
+
+        @no_op_decorator
+        @no_op_decorator
+        @no_op_decorator
+        @no_op_decorator
+        @no_op_decorator
+        def foo(x, dim: int):
+            return x.unsqueeze(dim)
+
+        x = torch.randn(1,)
+        expected = foo(x, 0)
+        scripted = torch.jit.script(foo)
+        actual = scripted(x, 0)
+        torch.testing.assert_close(expected, actual)
diff --git a/test/jit/test_module_interface.py b/test/jit/test_module_interface.py
index f1fd56e780c8..194e2abbbc2d 100644
--- a/test/jit/test_module_interface.py
+++ b/test/jit/test_module_interface.py
@@ -514,8 +514,7 @@ def forward(self, x):
         m = torch.jit.script(TestModule())
         m.proxy_mod = m.sub
         m.eval()
-        with self.assertRaisesRegex(RuntimeError, "failed to freeze interface attribute 'proxy_mod'"):
-            mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
+        mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
 
     def test_freeze_module_with_inplace_mutation_in_interface(self):
         class SubModule(torch.nn.Module):
@@ -561,8 +560,7 @@ def forward(self, x):
         m.proxy_mod = m.sub
         m.sub.b = m.proxy_mod.b
         m.eval()
-        with self.assertRaisesRegex(RuntimeError, "failed to freeze interface attribute 'proxy_mod'"):
-            mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
+        mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
 
     def test_freeze_module_with_mutated_interface(self):
         class SubModule(torch.nn.Module):
@@ -606,7 +604,7 @@ def forward(self, x):
 
         m = torch.jit.script(TestModule())
         m.eval()
-        with self.assertRaisesRegex(RuntimeError, "failed to freeze interface attribute 'proxy_mod'"):
+        with self.assertRaisesRegex(RuntimeError, "Freezing does not support SetAttr on an interface type."):
             mf = torch._C._freeze_module(m._c, freezeInterfaces=True)
 
     def test_freeze_module_with_interface_and_fork(self):
diff --git a/test/jit/test_python_bindings.py b/test/jit/test_python_bindings.py
index 37c2ef7f85af..51c5e0383b2c 100644
--- a/test/jit/test_python_bindings.py
+++ b/test/jit/test_python_bindings.py
@@ -84,6 +84,11 @@ def test_graph_create(self):
         with self.assertRaises(ValueError):
             gr.create("prim::Constant", [None])
 
+    def test_add_input(self):
+        gr = torch._C.Graph()
+        foo_value = gr.addInput("foo")
+        assert foo_value in gr.inputs()
+
     def test_canonicalize(self):
         ir = """
 graph(%p207 : Tensor,
diff --git a/test/jit/test_symbolic_shape_analysis.py b/test/jit/test_symbolic_shape_analysis.py
index 1c4e359662bd..3e3cb3ffed73 100644
--- a/test/jit/test_symbolic_shape_analysis.py
+++ b/test/jit/test_symbolic_shape_analysis.py
@@ -319,7 +319,12 @@ def forward(self, x, y):
             mod = torch.jit.script(CatMod(**inp.kwargs).eval())
 
             args = inp.input
-            self.assertTrue(len(args) == 2)
+
+            # This test is hard-coded only to work with two sample inputs
+            # but the OpInfo may have more/less
+            if len(args) != 2:
+                continue
+
             out_size = mod(*args).size()
             inps = list(mod.graph.inputs())
             inps[1].setType(inps[1].type().with_sizes(args[0].size()))
diff --git a/test/jit/test_tensor_creation_ops.py b/test/jit/test_tensor_creation_ops.py
index 5b82aa8cab66..b3bab0eb20d3 100644
--- a/test/jit/test_tensor_creation_ops.py
+++ b/test/jit/test_tensor_creation_ops.py
@@ -49,10 +49,10 @@ def triu_indices(rows: int, cols: int):
 
     def test_triu_indices_specified_dtype(self):
         def triu_indices(rows: int, cols: int):
-            indices = torch.triu_indices(rows, cols, dtype=torch.float)
+            indices = torch.triu_indices(rows, cols, dtype=torch.int32)
             # Have to perform assertion here because TorchScript returns dtypes
             # as integers, which are not comparable against eager torch.dtype.
-            assert indices.dtype == torch.float
+            assert indices.dtype == torch.int32
 
         self.checkScript(triu_indices, (3, 3))
 
@@ -67,9 +67,9 @@ def tril_indices(rows: int, cols: int):
 
     def test_tril_indices_specified_dtype(self):
         def tril_indices(rows: int, cols: int):
-            indices = torch.tril_indices(rows, cols, dtype=torch.float)
+            indices = torch.tril_indices(rows, cols, dtype=torch.int32)
             # Have to perform assertion here because TorchScript returns dtypes
             # as integers, which are not comparable against eager torch.dtype.
-            assert indices.dtype == torch.float
+            assert indices.dtype == torch.int32
 
         self.checkScript(tril_indices, (3, 3))
diff --git a/test/jit/test_tracer.py b/test/jit/test_tracer.py
index 50fdec94b9fc..b36003a2b920 100644
--- a/test/jit/test_tracer.py
+++ b/test/jit/test_tracer.py
@@ -1124,14 +1124,6 @@ def foo(x, w):
         # With `check_trace=True` it will run with `@torch.no_grad()` and break assert.
         torch.jit.trace(foo, (x, w), check_trace=False)
 
-    def test_trace_detach_onnx_erase(self):
-        class Mod(torch.nn.Module):
-            def forward(self, x, w):
-                return torch.matmul(x, w).detach()
-
-        torch.onnx.export_to_pretty_string(
-            Mod(), (torch.rand(3, 4), torch.rand(4, 5)))
-
     def test_trace_slice_full_dim(self):
         def foo(x):
             return x[0:5, 0] + 1.0
diff --git a/test/jit/test_with.py b/test/jit/test_with.py
index bd09a36c6860..ddbd90a025da 100644
--- a/test/jit/test_with.py
+++ b/test/jit/test_with.py
@@ -6,6 +6,7 @@
 from typing import Any, List
 
 import torch
+from torch.testing._internal.common_utils import skipIfTorchDynamo
 from torch.testing._internal.jit_utils import JitTestCase, make_global
 
 
@@ -599,6 +600,7 @@ def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
 
         self.assertFalse(w.requires_grad)
 
+    @skipIfTorchDynamo("Torchdynamo cannot correctly handle profiler.profile calls")
     def test_with_record_function(self):
         """
         Check that torch.autograd.profiler.record_function context manager is
diff --git a/test/jit/xnnpack/test_xnnpack_delegate.py b/test/jit/xnnpack/test_xnnpack_delegate.py
new file mode 100644
index 000000000000..c54d9ba1b088
--- /dev/null
+++ b/test/jit/xnnpack/test_xnnpack_delegate.py
@@ -0,0 +1,192 @@
+# Owner(s): ["oncall: jit"]
+
+import unittest
+
+import torch
+import torch._C
+
+torch.ops.load_library("//caffe2:xnnpack_backend")
+
+class TestXNNPackBackend(unittest.TestCase):
+    def test_xnnpack_constant_data(self):
+        class Module(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self._constant = torch.ones(4, 4, 4)
+
+            def forward(self, x):
+                return x + self._constant
+
+        scripted_module = torch.jit.script(Module())
+
+        lowered_module = torch._C._jit_to_backend(
+            "xnnpack",
+            scripted_module,
+            {
+                "forward": {
+                    "inputs" : [torch.randn(4, 4, 4)],
+                    "outputs": [torch.randn(4, 4, 4)]
+                }
+            }
+        )
+
+        for i in range(0, 20):
+            sample_input = torch.randn(4, 4, 4)
+            actual_output = scripted_module(sample_input)
+            expected_output = lowered_module(sample_input)
+            self.assertTrue(torch.allclose(actual_output, expected_output, atol=1e-03, rtol=1e-03))
+
+    def test_xnnpack_lowering(self):
+        class Module(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x):
+                return x + x
+
+        scripted_module = torch.jit.script(Module())
+
+        faulty_compile_spec = {
+            "backward": {
+                "inputs" : [torch.zeros(1)],
+                "outputs": [torch.zeros(1)],
+            }
+        }
+        error_msg = (
+            "method_compile_spec does not contain the \"forward\" key."
+        )
+
+        with self.assertRaisesRegex(
+            RuntimeError,
+            error_msg,
+        ):
+            _ = torch._C._jit_to_backend(
+                "xnnpack",
+                scripted_module,
+                faulty_compile_spec,
+            )
+
+        mismatch_compile_spec = {
+            "forward" : {
+                "inputs" : [torch.zeros(1), torch.zeros(1)],
+                "outputs" : [torch.zeros(1)]
+            }
+        }
+        error_msg = ("method_compile_spec inputs do not match expected number of forward inputs")
+
+        with self.assertRaisesRegex(
+            RuntimeError,
+            error_msg,
+        ):
+            _ = torch._C._jit_to_backend(
+                "xnnpack",
+                scripted_module,
+                mismatch_compile_spec
+            )
+
+        lowered = torch._C._jit_to_backend(
+            "xnnpack",
+            scripted_module,
+            {
+                "forward": {
+                    "inputs" : [torch.zeros(1)],
+                    "outputs": [torch.zeros(1)],
+                }
+            }
+        )
+        lowered(torch.zeros(1))
+
+    def test_xnnpack_backend_add(self):
+        class AddModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x, y):
+                z = x + y
+                z = z + x
+                z = z + x
+                return z
+
+        add_module = AddModule()
+        sample_inputs = (torch.rand(1, 512, 512, 3), torch.rand(1, 512, 512, 3))
+        sample_output = torch.zeros(1, 512, 512, 3)
+
+        add_module = torch.jit.script(add_module)
+        expected_output = add_module(sample_inputs[0], sample_inputs[1])
+
+        lowered_add_module = torch._C._jit_to_backend(
+            "xnnpack",
+            add_module,
+            {
+                "forward": {
+                    "inputs" : [sample_inputs[0].clone(), sample_inputs[1].clone()],
+                    "outputs": [sample_output]
+                }
+            }
+        )
+
+        actual_output = lowered_add_module.forward(sample_inputs[0], sample_inputs[1])
+        self.assertTrue(torch.allclose(actual_output, expected_output, atol=1e-03, rtol=1e-03))
+
+    def test_xnnpack_broadcasting(self):
+        class AddModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x, y):
+                return x + y
+
+        add_module = AddModule()
+        sample_inputs = (torch.rand(5, 1, 4, 1), torch.rand(3, 1, 1))
+        sample_output = torch.zeros(5, 3, 4, 1)
+
+        add_module = torch.jit.script(add_module)
+        expected_output = add_module(sample_inputs[0], sample_inputs[1])
+
+        lowered_add_module = torch._C._jit_to_backend(
+            "xnnpack",
+            add_module,
+            {
+                "forward": {
+                    "inputs" : [sample_inputs[0], sample_inputs[1]],
+                    "outputs": [sample_output]
+                }
+            }
+        )
+
+        actual_output = lowered_add_module.forward(sample_inputs[0], sample_inputs[1])
+        self.assertTrue(torch.allclose(actual_output, expected_output, atol=1e-03, rtol=1e-03))
+
+    def test_xnnpack_unsupported(self):
+        class AddSpliceModule(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x, y):
+                z = x + y[:, :, 1, :]
+                return z
+
+        sample_inputs = (torch.rand(1, 512, 512, 3), torch.rand(1, 512, 512, 3))
+        sample_output = torch.zeros(1, 512, 512, 3)
+
+        error_msg = (
+            "the module contains the following unsupported ops:\n"
+            "aten::select\n"
+            "aten::slice\n"
+        )
+
+        add_module = torch.jit.script(AddSpliceModule())
+        with self.assertRaisesRegex(
+            RuntimeError,
+            error_msg,
+        ):
+            _ = torch._C._jit_to_backend(
+                "xnnpack",
+                add_module,
+                {
+                    "forward": {
+                        "inputs" : [sample_inputs[0], sample_inputs[1]],
+                        "outputs": [sample_output]
+                    }
+                }
+            )
diff --git a/test/lazy/test_debug_util.py b/test/lazy/test_debug_util.py
new file mode 100644
index 000000000000..df201d54737f
--- /dev/null
+++ b/test/lazy/test_debug_util.py
@@ -0,0 +1,44 @@
+# Owner(s): ["oncall: jit"]
+
+import os
+import re
+import tempfile
+import torch.nn as nn
+import unittest
+
+import torch._lazy
+import torch._lazy.ts_backend
+from torch.testing._internal.common_utils import IS_WINDOWS, run_tests, TestCase
+
+torch._lazy.ts_backend.init()
+
+
+@unittest.skipIf(IS_WINDOWS, "To be fixed")
+class DebugUtilTest(TestCase):
+    def _run_linear(self):
+        device = "lazy"
+        model = nn.Linear(5, 5).to(device)
+        output = model(torch.randn(1, 5).to(device))
+        torch._lazy.mark_step()
+
+
+    def test_get_python_frames(self):
+        # We only care about the first "Python Stacktrace" part of the saved
+        # graph. However, we cannot save the whole stack for comparison given
+        # it depends on a lot of things.
+        partial_graph = (r"Python Stacktrace:.*"
+                         r"mark_step \(.*/_lazy/__init__.py:[0-9]+\).*"
+                         r"_run_linear \(.*lazy/test_debug_util.py:[0-9]+\).*"
+                         r"test_get_python_frames \(.*lazy/test_debug_util.py:[0-9]+\)")
+
+        with tempfile.NamedTemporaryFile(mode="r+", encoding="utf-8") as graph_file:
+            os.environ["LTC_SAVE_TENSORS_FILE"] = graph_file.name
+            self._run_linear()
+            file = graph_file.read()
+            if re.search(partial_graph, file, re.DOTALL) is None:
+                print(file)
+                self.assertTrue(False)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/lazy/test_extract_compiled_graph.py b/test/lazy/test_extract_compiled_graph.py
index f4152d0af68b..b27a11bf49b6 100644
--- a/test/lazy/test_extract_compiled_graph.py
+++ b/test/lazy/test_extract_compiled_graph.py
@@ -141,7 +141,7 @@ def verify_reusing_compiled_graph(mod, exception_msg_pattern, ncase=10):
             raise e  # reraise the exception
         exception_message = str(e)
         if not re.search(exception_msg_pattern, exception_message):
-            raise RuntimeError(f"Expection message does not match the required pattern: {exception_message}")
+            raise RuntimeError(f"Exception message does not match the required pattern: {exception_message}")
         else:
             # We are done for the test case that expects an exception
             return
diff --git a/test/lazy/test_meta_kernel.py b/test/lazy/test_meta_kernel.py
new file mode 100644
index 000000000000..06fd8b8ecef4
--- /dev/null
+++ b/test/lazy/test_meta_kernel.py
@@ -0,0 +1,34 @@
+# Owner(s): ["oncall: jit"]
+
+import torch
+
+from torch.testing._internal.common_utils import TestCase
+import torch._lazy
+import torch._lazy.ts_backend
+
+torch._lazy.ts_backend.init()
+
+class TestMetaKernel(TestCase):
+
+    def test_addmm_invalid_dtype(self):
+        """Tests that the addmm meta kernel returns the correct output type"""
+        input = torch.ones(2, 2, dtype=torch.float16).to("lazy")
+        self.assertTrue(input.dtype == torch.float16)
+
+        fc_nobias = torch.nn.Linear(2, 2, bias=False, dtype=float32).to("lazy")
+
+        with self.assertRaises(Exception):
+            out_nobias = fc_nobias(input)
+
+    def test_addmm(self):
+        """Tests that the addmm meta kernel returns the correct output type"""
+        input = torch.ones(2, 2, dtype=torch.float16).to("lazy")
+        self.assertEqual(input.dtype, torch.float16)
+
+        fc_nobias = torch.nn.Linear(2, 2, bias=False, dtype=float16).to("lazy")
+        out_nobias = fc_nobias(input)
+        self.assertEqual(out_nobias.dtype, torch.float16)
+
+        fc_bias = torch.nn.Linear(2, 2, bias=True, dtype=float16).to("lazy")
+        out_bias = fc_bias(input)
+        self.assertEqual(out_bias.dtype, torch.float16)
diff --git a/test/lazy/test_reuse_ir.py b/test/lazy/test_reuse_ir.py
index 2d19fe1a5b53..f7024e9519cc 100644
--- a/test/lazy/test_reuse_ir.py
+++ b/test/lazy/test_reuse_ir.py
@@ -111,6 +111,7 @@ def testBatchNorm(self):
             # BatchNorm2d does extra checks on dimensions which SymInts don't support yet
             # so we call `torch.ops.aten.native_batch_norm` to bypass the checks.
             z, _, _ = torch.ops.aten.native_batch_norm(x, weight, bias, None, None, True, 0.1, 1e-5)
+            z_legit, _, _ = torch.ops.aten._native_batch_norm_legit(x, weight, bias, True, 0.1, 1e-5)
 
         device = "lazy"
         x_lazy = x.detach().clone().to(device=device)
@@ -118,12 +119,15 @@ def testBatchNorm(self):
         bias_lazy = bias.detach().clone().to(device=device)
         for i in range(10):
             z_lazy, _, _ = torch.ops.aten.native_batch_norm(x_lazy, weight_lazy, bias_lazy, None, None, True, 0.1, 1e-5)
+            z_legit_lazy, _, _ = torch.ops.aten._native_batch_norm_legit(x_lazy, weight_lazy, bias_lazy, True, 0.1, 1e-5)
             torch._lazy.mark_step()
 
         torch.testing.assert_close(z.cpu(), z_lazy.cpu())
+        torch.testing.assert_close(z_legit.cpu(), z_legit_lazy.cpu())
         assert metrics.counter_value("IrNodeReused_torch::lazy::NativeBatchNorm") >= 7
         metrics.reset()
         torch._lazy.ir_cache.reset()
 
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/lazy/test_step_closures.py b/test/lazy/test_step_closures.py
new file mode 100644
index 000000000000..c84479097211
--- /dev/null
+++ b/test/lazy/test_step_closures.py
@@ -0,0 +1,91 @@
+# Owner(s): ["oncall: jit"]
+
+from threading import Event
+from time import sleep
+
+import torch._lazy
+import torch._lazy.ts_backend
+from torch.testing._internal.common_utils import run_tests, TestCase
+
+torch._lazy.ts_backend.init()
+
+
+class ClosuresTest(TestCase):
+    def test_synchronous(self):
+        flag = Event()
+        assert not flag.is_set()
+
+        def closure():
+            sleep(1)
+            assert not flag.is_set()
+            flag.set()
+
+        torch._lazy.add_step_closure(closure)
+        torch._lazy.mark_step()
+
+        # should not get to this part before closure is finished running
+        assert flag.is_set()
+
+    def test_asynchronous(self):
+        flag = Event()
+        assert not flag.is_set()
+
+        def closure():
+            sleep(1)
+            assert flag.is_set()
+
+        torch._lazy.add_step_closure(closure, run_async=True)
+        torch._lazy.mark_step()
+
+        # should get to this part and complete before closure is finished running
+        assert not flag.is_set()
+        flag.set()
+
+    def test_synchronous_exception(self):
+        flag = Event()
+        assert not flag.is_set()
+
+        try:
+
+            def closure():
+                flag.set()
+                raise RuntimeError("Simulating exception in closure")
+
+            torch._lazy.add_step_closure(closure)
+            torch._lazy.mark_step()
+
+            raise AssertionError()  # Should not reach here
+        except RuntimeError as e:
+            assert flag.is_set(), "Should have caught exception from closure"
+
+    def test_asynchronous_exception(self):
+        flag = Event()
+        assert not flag.is_set()
+
+        def closure1():
+            flag.set()
+            raise RuntimeError("Simulating exception in closure1")
+
+        torch._lazy.add_step_closure(closure1, run_async=True)
+        torch._lazy.mark_step()
+
+        flag.wait(timeout=5)
+
+        try:
+
+            def closure2():  # Should never execute
+                flag.clear()
+
+            torch._lazy.add_step_closure(closure2, run_async=True)
+            torch._lazy.mark_step()
+
+            raise AssertionError()  # Should not reach here
+        except RuntimeError as e:
+            # Should have caught exception from closure1
+            pass
+
+        assert flag.is_set()
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/lazy/test_ts_opinfo.py b/test/lazy/test_ts_opinfo.py
index a60183ba50ec..092ba3d0388d 100644
--- a/test/lazy/test_ts_opinfo.py
+++ b/test/lazy/test_ts_opinfo.py
@@ -33,6 +33,7 @@ def init_lists():
     with open(TS_NATIVE_FUNCTIONS_PATH) as f:
         yaml_ts = yaml.load(f, yaml.Loader)
     LAZY_OPS_LIST = set(remove_suffixes(itertools.chain(yaml_ts["full_codegen"], yaml_ts["supported"], yaml_ts["autograd"])))
+    HAS_SYMINT_SUFFIX = yaml_ts["symint"]
     FALLBACK_LIST = set(["clamp"])
     SKIP_RUNTIME_ERROR_LIST = set([
         'index_select',  # Empty output_sizes is not supported
@@ -58,6 +59,7 @@ def init_lists():
     # but run functionalized versions of the composite kernels in core.
     # This means that we don't expect the ops to show directly in the LTC metrics.
     FUNCTIONAL_DECOMPOSE_LIST = set([
+        'diag_embed',
         'block_diag',
         'new_empty_strided',
         'narrow_copy',
@@ -69,10 +71,28 @@ def init_lists():
         'linalg_pinv.atol_rtol_tensor',
         'logsumexp',
     ])
+    # For some ops, we don't support all variants. Here we use formatted_name
+    # to uniquely identify the variant.
+    SKIP_VARIANT_LIST = set([
+        'norm_nuc',
+        'min_reduction_with_dim'
+    ])
 
-    return (LAZY_OPS_LIST, FALLBACK_LIST, SKIP_RUNTIME_ERROR_LIST, SKIP_INCORRECT_RESULTS_LIST, FUNCTIONAL_DECOMPOSE_LIST)
-
-(LAZY_OPS_LIST, FALLBACK_LIST, SKIP_RUNTIME_ERROR_LIST, SKIP_INCORRECT_RESULTS_LIST, FUNCTIONAL_DECOMPOSE_LIST) = init_lists()
+    return (LAZY_OPS_LIST,
+            FALLBACK_LIST,
+            SKIP_RUNTIME_ERROR_LIST,
+            SKIP_INCORRECT_RESULTS_LIST,
+            FUNCTIONAL_DECOMPOSE_LIST,
+            HAS_SYMINT_SUFFIX,
+            SKIP_VARIANT_LIST)
+
+(LAZY_OPS_LIST,
+ FALLBACK_LIST,
+ SKIP_RUNTIME_ERROR_LIST,
+ SKIP_INCORRECT_RESULTS_LIST,
+ FUNCTIONAL_DECOMPOSE_LIST,
+ HAS_SYMINT_SUFFIX,
+ SKIP_VARIANT_LIST) = init_lists()
 
 torch.manual_seed(42)
 
@@ -154,6 +174,7 @@ class TestLazyOpInfo(TestCase):
           if op.name in LAZY_OPS_LIST
           and op.name not in SKIP_RUNTIME_ERROR_LIST
           and op.name not in FUNCTIONAL_DECOMPOSE_LIST
+          and op.formatted_name not in SKIP_VARIANT_LIST
           ], allowed_dtypes=(torch.float,))
     def test_dispatched_to_lazy(self, device, dtype, op):
         def get_name(op):
@@ -162,7 +183,7 @@ def get_name(op):
                 l.append(op.variant_test_name)
             return '.'.join(l)
 
-        global FALLBACK_LIST
+        global HAS_SYMINT_SUFFIX, FALLBACK_LIST
         samples = op.sample_inputs("lazy", dtype, requires_grad=False)
         sample = list(samples)[0]
         args = [sample.input] + list(sample.args)
@@ -175,11 +196,12 @@ def get_name(op):
         torch._lazy.mark_step()
         torch._lazy.wait_device_ops()
         prefix = "aten" if op.name in FALLBACK_LIST else "lazy"
-        found = f"{prefix}::{op.name}" in remove_suffixes(torch._lazy.metrics.counter_names())
+        symint_suffix = "_symint" if op.name in HAS_SYMINT_SUFFIX else ""
+        found = f"{prefix}::{op.name}{symint_suffix}" in remove_suffixes(torch._lazy.metrics.counter_names())
         # check aliases
         if not found:
             for alias in op.aliases:
-                alias_found = f"{prefix}::{alias.name}" in remove_suffixes(torch._lazy.metrics.counter_names())
+                alias_found = f"{prefix}::{alias.name}{symint_suffix}" in remove_suffixes(torch._lazy.metrics.counter_names())
                 found = found or alias_found
                 if found:
                     break
diff --git a/test/mobile/model_test/README.md b/test/mobile/model_test/README.md
index 49b21051c655..7e99e6763fee 100644
--- a/test/mobile/model_test/README.md
+++ b/test/mobile/model_test/README.md
@@ -55,7 +55,7 @@ NOTE: currently Android simulator test does not generate on-the-fly models. Only
 ## Diagnose failed test
 If the simulator test is falling, that means the current change will potentially break a production model. So be careful. The detailed error message can be found in test log. If the change has to be made, make sure it doesn't break existing production models, and update the failed test model as appropriate (see the next section).
 
-You can also run these tests locally, please see the insturction in android and ios folder. Remember to generate on-the-fly test models if you want to test it locally (but don't commit these models with _temp suffix).
+You can also run these tests locally, please see the instruction in android and ios folder. Remember to generate on-the-fly test models if you want to test it locally (but don't commit these models with _temp suffix).
 ```
 python test/mobile/model_test/gen_test_model.py ios-test
 ```
diff --git a/test/mobile/test_lite_script_module.py b/test/mobile/test_lite_script_module.py
index 638ac37eb88b..9089977b77f1 100644
--- a/test/mobile/test_lite_script_module.py
+++ b/test/mobile/test_lite_script_module.py
@@ -241,7 +241,7 @@ def forward(self):
 
         script_module = torch.jit.script(MyTestModuleForListWithModuleClass())
         with self.assertRaisesRegex(RuntimeError,
-                                    r"^Returining a list or dictionary with pytorch class type "
+                                    r"^Returning a list or dictionary with pytorch class type "
                                     r"is not supported in mobile module "
                                     r"\(List\[Foo\] or Dict\[int\, Foo\] for class Foo\(torch\.nn\.Module\)\)\. "
                                     r"Workaround\: instead of using pytorch class as their element type\, "
@@ -264,7 +264,7 @@ def forward(self):
 
         script_module = torch.jit.script(MyTestModuleForDictWithModuleClass())
         with self.assertRaisesRegex(RuntimeError,
-                                    r"^Returining a list or dictionary with pytorch class type "
+                                    r"^Returning a list or dictionary with pytorch class type "
                                     r"is not supported in mobile module "
                                     r"\(List\[Foo\] or Dict\[int\, Foo\] for class Foo\(torch\.nn\.Module\)\)\. "
                                     r"Workaround\: instead of using pytorch class as their element type\, "
diff --git a/test/mobile/test_lite_script_type.py b/test/mobile/test_lite_script_type.py
index 67d6ac859683..44eb6d4778e8 100644
--- a/test/mobile/test_lite_script_type.py
+++ b/test/mobile/test_lite_script_type.py
@@ -4,6 +4,7 @@
 import torch.utils.bundled_inputs
 import io
 from typing import Dict, List, NamedTuple
+import unittest
 
 from torch.jit.mobile import _load_for_lite_interpreter
 from torch.testing._internal.common_utils import TestCase, run_tests
@@ -28,12 +29,13 @@ def forward(self, a: torch.Tensor):
         buffer.seek(0)
         mobile_module = _load_for_lite_interpreter(buffer)  # Error here
         mobile_module_result = mobile_module(sample_input).a
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(
             script_module_result,
             mobile_module_result
         )
 
 
+    @unittest.skip("T137512434")
     def test_typing_dict_with_namedtuple(self):
         class Foo(NamedTuple):
             id: torch.Tensor
@@ -45,7 +47,7 @@ def __init__(self):
 
             def forward(self, a: torch.Tensor):
                 self.foo = Foo(a)
-                re: Dict[str, Foo] = dict()
+                re: Dict[str, Foo] = {}
                 re["test"] = Foo(a)
                 return self.foo, re["test"]
 
@@ -91,7 +93,7 @@ def forward(self, a: torch.Tensor):
         buffer_mobile.seek(0)
         mobile_module = _load_for_lite_interpreter(buffer_mobile)
         mobile_module_result = mobile_module(sample_input)
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(
             script_module_result,
             mobile_module_result
         )
@@ -117,7 +119,7 @@ def forward(self, a: torch.Tensor):
         buffer_mobile.seek(0)
         mobile_module = _load_for_lite_interpreter(buffer_mobile)
         mobile_module_result = mobile_module(sample_input)
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(
             script_module_result,
             mobile_module_result
         )
@@ -136,7 +138,7 @@ def forward(self, a: torch.Tensor):
         buffer_mobile.seek(0)
         mobile_module = _load_for_lite_interpreter(buffer_mobile)
         mobile_module_result = mobile_module(sample_input)
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(
             script_module_result,
             mobile_module_result
         )
@@ -166,7 +168,7 @@ def forward(self, a: torch.Tensor):
         buffer_mobile.seek(0)
         mobile_module = _load_for_lite_interpreter(buffer_mobile)
         mobile_module_result = mobile_module(sample_input)
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(
             script_module_result.baz.di,
             mobile_module_result.baz.di
         )
diff --git a/test/mobile/test_quantize_fx_lite_script_module.py b/test/mobile/test_quantize_fx_lite_script_module.py
index 3253e7b2dd22..ebc96d17697b 100644
--- a/test/mobile/test_quantize_fx_lite_script_module.py
+++ b/test/mobile/test_quantize_fx_lite_script_module.py
@@ -2,7 +2,7 @@
 
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import torch.utils.bundled_inputs
 from torch.ao.quantization import (
     default_qconfig,
@@ -47,7 +47,11 @@ def forward(self, indices):
 
         for qconfig, node in configs:
             qconfig_dict = {"": qconfig}
-            m = prepare_fx(model, qconfig_dict)
+            m = prepare_fx(
+                model,
+                qconfig_dict,
+                example_inputs=torch.randint(low=0, high=10, size=(20,)),
+            )
             m = convert_fx(m)
             self._compare_script_and_mobile(m, input=indices)
 
@@ -65,7 +69,7 @@ def forward(self, x):
 
         m = M().eval()
         qconfig_dict = {"": default_qconfig, "module_name": [("conv1", None)]}
-        m = prepare_fx(m, qconfig_dict)
+        m = prepare_fx(m, qconfig_dict, example_inputs=torch.randn(1, 1, 1, 1))
         data = torch.randn(1, 1, 1, 1)
         m = convert_fx(m)
         # first conv is quantized, second conv is not quantized
@@ -84,7 +88,11 @@ def test_submodule(self):
                 "": torch.ao.quantization.get_default_qconfig("qnnpack"),
                 **config,
             }
-            model = prepare_fx(model, qconfig_dict)
+            model = prepare_fx(
+                model,
+                qconfig_dict,
+                example_inputs=torch.randn(5, 5),
+            )
             quant = convert_fx(model)
 
             x = torch.randn(5, 5)
diff --git a/test/nn/test_convolution.py b/test/nn/test_convolution.py
new file mode 100644
index 000000000000..a30a27643975
--- /dev/null
+++ b/test/nn/test_convolution.py
@@ -0,0 +1,2480 @@
+# Owner(s): ["module: nn"]
+import math
+import unittest
+import itertools
+import warnings
+from itertools import product
+
+import torch
+
+import torch.autograd.forward_ad as fwAD
+import torch.backends.cudnn as cudnn
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.testing._internal.common_dtype import floating_types_and, floating_and_complex_types_and
+from torch.testing._internal.common_utils import run_tests, \
+    skipIfRocmVersionLessThan, skipIfNotMiopenSuggestNHWC, TEST_SCIPY, TEST_WITH_ROCM, \
+    download_file, parametrize as parametrize_test, subtest, \
+    instantiate_parametrized_tests, set_default_dtype
+from torch.testing._internal.common_cuda import TEST_CUDA, TEST_CUDNN
+from torch.testing._internal.common_nn import NNTestCase, _test_module_empty_input
+from torch.testing._internal.common_device_type import instantiate_device_type_tests, dtypes, \
+    dtypesIfCUDA, precisionOverride, skipCUDAIfNoCudnn, skipCUDAIfCudnnVersionLessThan, onlyCUDA, onlyCPU, \
+    skipCUDAIfRocm, skipCUDAIfRocmVersionLessThan, skipCUDAIfNotMiopenSuggestNHWC, \
+    onlyNativeDeviceTypes, largeTensorTest, skipMeta, \
+    disableMkldnn, skipCPUIfNoMkldnn, disablecuDNN, skipCUDAIfMiopen, skipCUDAIfNoMiopen
+
+from torch.testing import make_tensor
+from torch.testing._internal.common_utils import gradcheck, gradgradcheck, \
+    GRADCHECK_NONDET_TOL
+from torch.testing._internal.common_utils import dtype2prec_DONTUSE
+from torch.testing._internal.common_cuda import tf32_on_and_off, tf32_is_not_fp32
+
+AMPERE_OR_ROCM = TEST_WITH_ROCM or tf32_is_not_fp32()
+
+
+if TEST_SCIPY:
+    import scipy.signal
+    import scipy.ndimage
+
+class TestConvolutionNN(NNTestCase):
+    _do_cuda_memory_leak_check = True
+    _do_cuda_non_default_stream = True
+
+    def test_conv_backcompat(self):
+        from torch.serialization import SourceChangeWarning
+
+        # This file was generated by running on PyTorch 1.0.1 on Python 2:
+        #
+        #     import torch
+        #     from torch import nn
+        #     m = nn.Conv2d(1, 1, 1)
+        #     torch.save(m, 'legacy_conv2d.pt')
+        #
+        # NB: This Pickle also contains some Unicode data!
+        path = download_file('https://download.pytorch.org/test_data/legacy_conv2d.pt')
+        with warnings.catch_warnings():
+            warnings.simplefilter('ignore', SourceChangeWarning)
+            m = torch.load(path, encoding='utf-8')
+        input = torch.randn((1, 1, 1, 1), dtype=torch.float)
+        self.assertEqual(m(input).size(), (1, 1, 1, 1))
+
+    def test_invalid_conv1d(self):
+        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
+            module = nn.Conv1d(in_channels=3, out_channels=33, kernel_size=10, stride=1, bias=True).to(dtype)
+            input = torch.randn(1, 3, 4).to(dtype)
+            with self.assertRaisesRegex(RuntimeError,
+                                        r'Calculated padded input size per channel: \(4\). ' +
+                                        r'Kernel size: \(10\). Kernel size can\'t be greater than actual input size'):
+                module(input)
+
+            # Negative stride check
+            module = nn.Conv1d(in_channels=3, out_channels=6, kernel_size=3, stride=-1, bias=True).to(dtype)
+            input = torch.randn(1, 3, 4).to(dtype)
+            with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
+                module(input)
+
+    def test_mismatch_shape_conv2d(self):
+        for dtype in (torch.float, torch.cfloat):
+            x = torch.randn(1, 10, 1, 28, 28, dtype=dtype)
+            w = torch.randn(6, 1, 5, 5, dtype=dtype)
+
+            with self.assertRaisesRegex(RuntimeError,
+                                        r'Expected 3D \(unbatched\) or 4D \(batched\) input to conv2d, but got ' +
+                                        r'input of size: \[1, 10, 1, 28, 28\]'):
+
+                F.conv2d(x, w)
+
+    def test_conv2d_discontiguous_weight(self):
+        for dtype in (torch.float, torch.cfloat):
+            # Test for https://github.com/pytorch/pytorch/issues/55781
+            x = torch.ones(64, 16, 16, 16, dtype=dtype)
+            weight = torch.arange(0, 1.0, 1 / 2.0 ** 10).reshape(32, 16, 1, 2).to(dtype)[:, :, :, ::2]
+            self.assertFalse(weight.is_contiguous())
+            y = torch.nn.functional.conv2d(x, weight, None)
+            if torch.backends.mkldnn.is_available():
+                # Disable MKLDNN explicitly, so that either NNPACK or THCNN will be used
+                with torch.backends.mkldnn.flags(enabled=False):
+                    y_ = torch.nn.functional.conv2d(x, weight, None)
+                    self.assertEqual(y, y_)
+            self.assertEqual(y.sum(), 4186112.)
+
+    def test_invalid_conv2d(self):
+        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
+            module = torch.nn.Conv2d(1, 1, kernel_size=3, dilation=2, stride=2).to(dtype)
+            input = torch.empty(1, 1, 4, 4).to(dtype)
+            self.assertRaises(RuntimeError, lambda: module(input))
+
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, stride=1, bias=True)
+            input = torch.randn(1, 3, 1, 1)
+            with self.assertRaisesRegex(RuntimeError,
+                                        r'Calculated padded input size per channel: \(1 x 1\). ' +
+                                        r'Kernel size: \(10 x 10\). Kernel size can\'t be greater than actual input size'):
+                module(input)
+
+            # Negative stride check
+            module = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=4, stride=-1, bias=True).to(dtype)
+            input = torch.randn(1, 3, 4, 4).to(dtype)
+            with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
+                module(input)
+
+            # Zero stride check
+            module = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=4, stride=0, bias=True).to(dtype)
+            input = torch.randn(1, 3, 4, 4).to(dtype)
+            with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
+                module(input)
+
+    def test_invalid_conv3d(self):
+        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
+            module = torch.nn.Conv3d(1, 1, kernel_size=3, dilation=2, stride=2).to(dtype)
+            input = torch.empty(1, 1, 4, 4, 4).to(dtype)
+            self.assertRaises(RuntimeError, lambda: module(input))
+
+            # Negative stride check
+            module = torch.nn.Conv3d(1, 1, kernel_size=3, stride=-2)
+            input = torch.empty(1, 1, 4, 4, 4)
+            with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
+                module(input)
+
+    def test_conv_invalid_groups(self):
+        with self.assertRaisesRegex(ValueError, 'groups must be a positive integer'):
+            torch.nn.Conv1d(1, 1, kernel_size=3, dilation=2, stride=2, groups=0)
+        with self.assertRaisesRegex(ValueError, 'groups must be a positive integer'):
+            torch.nn.Conv2d(1, 1, kernel_size=3, dilation=2, stride=2, groups=-1)
+        with self.assertRaisesRegex(ValueError, 'groups must be a positive integer'):
+            torch.nn.Conv3d(1, 1, kernel_size=3, dilation=2, stride=2, groups=-2)
+
+    def test_Conv1d_module_same_padding(self):
+        # Compare module against functional: without strides/dilation, asymmetric padding
+        x = torch.rand(1, 1, 20)
+        module = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=10,
+                           padding='same')
+        expect = F.conv1d(x, module.weight, module.bias, padding='same')
+        self.assertEqual(expect, module(x))
+
+        # Test dilation, symmetric padding
+        module = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=10,
+                           padding='same', dilation=2)
+        expect = F.conv1d(x, module.weight, module.bias, padding='same', dilation=2)
+        self.assertEqual(expect, module(x))
+
+        # Test non-zero padding_mode, requiring explicit padding
+        module = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=10,
+                           padding='same', padding_mode='replicate')
+        x_padded = F.pad(x, [4, 5], mode='replicate')
+        expect = F.conv1d(x_padded, module.weight, module.bias, padding='valid')
+        self.assertEqual(expect, module(x))
+        self.assertEqual(x.size(), expect.size())
+
+        # Test connstruction with invalid padding string raises
+        with self.assertRaisesRegex(ValueError, 'Invalid padding string'):
+            module = nn.Conv1d(in_channels=3, out_channels=33, kernel_size=10, padding='foo')
+
+        # Test connstruction with same padding and strides raises
+        with self.assertRaisesRegex(ValueError, "padding='same'"):
+            module = nn.Conv1d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=2)
+
+    def test_Conv2d_module_same_padding(self):
+        # Compare module against functional:
+        # without strides/dilation, both symmetric and asymmetric padding
+        x = torch.rand(1, 1, 9, 20)
+        module = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(5, 10),
+                           padding='same')
+        expect = F.conv2d(x, module.weight, module.bias, padding='same')
+        self.assertEqual(expect, module(x))
+
+        # with dilation, symmetric padding
+        module = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(3, 4),
+                           padding='same', dilation=(1, 2))
+        expect = F.conv2d(x, module.weight, module.bias, padding='same', dilation=(1, 2))
+        self.assertEqual(expect, module(x))
+
+        # Test non-zero padding_mode, requiring explicit padding
+        module = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(3, 4),
+                           padding='same', padding_mode='reflect')
+        x_padded = F.pad(x, [1, 2, 1, 1], mode='reflect')
+        expect = F.conv2d(x_padded, module.weight, module.bias, padding='valid')
+        self.assertEqual(expect, module(x))
+        self.assertEqual(x.size(), expect.size())
+
+        # Test connstruction with invalid padding string raises
+        with self.assertRaisesRegex(ValueError, 'Invalid padding string'):
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='foo')
+
+        # Test connstruction with same padding and strides raises
+        with self.assertRaisesRegex(ValueError, "padding='same'"):
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=2)
+        with self.assertRaisesRegex(ValueError, "padding='same'"):
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(1, 3))
+        with self.assertRaisesRegex(ValueError, "padding='same'"):
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(4, 1))
+
+    def test_Conv3d_module_same_padding(self):
+        # Compare module against functional:
+        x = torch.rand(1, 1, 4, 4, 4)
+        # without dilation, both symmetric and asymmetric padding
+        module = nn.Conv3d(in_channels=1, out_channels=1, kernel_size=(2, 3, 4),
+                           padding='same')
+        expect = F.conv3d(x, module.weight, module.bias, padding='same')
+        self.assertEqual(expect, module(x))
+
+        # with dilation, both symmetric and asymmetric padding
+        module = nn.Conv3d(in_channels=1, out_channels=1, kernel_size=(2, 3, 4),
+                           padding='same', dilation=(3, 2, 1))
+        expect = F.conv3d(x, module.weight, module.bias, padding='same', dilation=(3, 2, 1))
+        self.assertEqual(expect, module(x))
+
+        # Test non-zero padding_mode, requiring explicit padding
+        module = nn.Conv3d(in_channels=1, out_channels=1, kernel_size=(2, 3, 4),
+                           padding='same', padding_mode='circular')
+        x_padded = F.pad(x, [1, 2, 1, 1, 0, 1], mode='circular')
+        expect = F.conv3d(x_padded, module.weight, module.bias, padding='valid')
+        self.assertEqual(expect, module(x))
+        self.assertEqual(x.size(), expect.size())
+
+        # Test connstruction with invalid padding string raises
+        with self.assertRaisesRegex(ValueError, 'Invalid padding string'):
+            module = nn.Conv3d(in_channels=3, out_channels=33, kernel_size=10, padding='foo')
+
+        # Test connstruction with same padding and strides raises
+        with self.assertRaisesRegex(ValueError, "padding='same'"):
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=2)
+        with self.assertRaisesRegex(ValueError, "padding='same'"):
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(1, 1, 3))
+        with self.assertRaisesRegex(ValueError, "padding='same'"):
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(1, 4, 1))
+        with self.assertRaisesRegex(ValueError, "padding='same'"):
+            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(5, 1, 1))
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_thnn_conv_strided_padded_dilated(self):
+        for convfn, dims, transposed in (
+                (torch.nn.functional.conv2d, 2, False),
+                (torch.nn.functional.conv_transpose2d, 2, True),
+                (torch.nn.functional.conv3d, 3, False),
+                (torch.nn.functional.conv_transpose3d, 3, True)):
+            for stride, padding, dilation in (
+                    (2, 0, 1), (1, 1, 1), (2, 1, 1), (1, 0, 2)):
+                kwargs = {"stride": stride, "padding": padding, "dilation": dilation}
+                inp_shape = (1, 2) + dims * (4,)
+                weight_shape = (2, 2) + dims * (1,)
+                inputs = torch.randn(inp_shape, dtype=torch.double, device="cuda", requires_grad=True)
+                weight = torch.randn(weight_shape, dtype=torch.double, device="cuda", requires_grad=True)
+                bias = torch.randn(2, dtype=torch.double, device="cuda", requires_grad=True)
+                with torch.backends.cudnn.flags(enabled=False):
+                    res = convfn(inputs, weight, bias, **kwargs)
+                res_cpu = convfn(inputs.cpu(), weight.cpu(), bias.cpu(), **kwargs)
+                self.assertEqual(res, res_cpu)
+                with torch.backends.cudnn.flags(enabled=False):
+                    torch.autograd.gradcheck(
+                        lambda x, w, b: convfn(x, w, b, **kwargs),
+                        (inputs, weight, bias)
+                    )
+                    torch.autograd.gradcheck(
+                        lambda x, w, b: convfn(x, w, b, **kwargs),
+                        (inputs.cpu(), weight.cpu(), bias.cpu())
+                    )
+
+    def test_Conv2d_inconsistent_types(self):
+        inputs = torch.randn(4, 1, 7, 7, dtype=torch.float)
+        weights = torch.randn(1, 1, 3, 3, dtype=torch.double)
+        # inconsistent types should raise an exception
+        self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights))
+        # but it should work with the same type
+        nn.functional.conv2d(inputs.float(), weights.float())
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_Conv2d_inconsistent_types_on_GPU_without_cudnn(self):
+        inputs = torch.randn(4, 1, 7, 7, dtype=torch.float, device="cuda")
+        weights = torch.randn(1, 1, 3, 3, dtype=torch.double, device="cuda")
+        bias = torch.randn(1, dtype=torch.double, device="cuda")
+
+        with torch.backends.cudnn.flags(enabled=False):
+            # inconsistent types should raise an exception
+            self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights))
+            self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights.float(), bias))
+
+            # but it should work with the same type
+            nn.functional.conv2d(inputs.float(), weights.float(), bias.float())
+
+    def test_Conv2d_1x1(self):
+        in_channels = 2
+        out_channels = 2
+        mod = torch.nn.Conv2d(2, 2, 1, bias=False).to(dtype=torch.double)
+        input = torch.randn(1, in_channels, 5, 5, requires_grad=True, dtype=torch.double)
+        for enabled in (False, True):
+            with torch.backends.mkldnn.flags(enabled=enabled):
+                gradcheck(F.conv2d, (input, mod.weight))
+
+    def test_Conv2d_OneDNN(self):
+        def run_once(group_val=24, dilation=1):
+            ifm = torch.ones([1, group_val, 6, 6], dtype=torch.float32)
+            weights = torch.ones([group_val, 1, 3, 3], dtype=torch.float32)
+            op = torch.nn.Conv2d(
+                in_channels=group_val,
+                out_channels=group_val,
+                kernel_size=[3, 3],
+                stride=[2, 2],
+                padding=[1, 1],
+                dilation=[dilation, dilation],
+                groups=group_val,
+                bias=False,
+                padding_mode='zeros'
+            )
+
+            op.weight.data = weights
+            res = op(ifm)
+            grad_in = torch.ones(res.shape, dtype=torch.float32)
+            res.backward(grad_in)
+            return op.weight.grad
+
+        for gorup_val in (24, 48, 23, 25):
+            for dilation in (1, 2):
+                with torch.backends.mkldnn.flags(enabled=False):
+                    without_onednn = run_once(gorup_val, dilation)
+
+                with torch.backends.mkldnn.flags(enabled=True):
+                    with_onednn = run_once(gorup_val, dilation)
+
+                self.assertEqual(without_onednn, with_onednn)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    @unittest.skipIf(not TEST_CUDNN, 'CUDNN not available')
+    def test_cudnn_non_contiguous(self):
+        x = torch.randn(192, 16, 50).cuda()
+        x = x.permute(0, 2, 1).contiguous().permute(0, 2, 1)
+        m = torch.nn.Conv1d(
+            in_channels=16,
+            out_channels=32,
+            kernel_size=2,
+            bias=True).cuda()
+        result = m(x)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    @unittest.skipIf(not TEST_CUDNN, 'CUDNN not available')
+    def test_Conv2d_inconsistent_types_on_GPU_with_cudnn(self):
+        inputs = torch.randn(4, 1, 7, 7, dtype=torch.float, device="cuda")
+        weights = torch.randn(1, 1, 3, 3, dtype=torch.double, device="cuda")
+        bias = torch.randn(1, dtype=torch.double, device="cuda")
+
+        with torch.backends.cudnn.flags(enabled=True):
+            # inconsistent types should raise an exception
+            self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights))
+            self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights.float(), bias))
+
+            # but it should work with the same type
+            nn.functional.conv2d(inputs.float(), weights.float(), bias.float())
+
+    def test_Conv2d_missing_argument(self):
+        c = nn.Conv2d(3, 3, 3)
+        self.assertRaises(TypeError, lambda: c(None))
+
+    def test_Conv2d_backward_twice(self):
+        input = torch.randn(2, 3, 5, 5)
+        c = nn.Conv2d(3, 3, 3)
+        o1 = c(input)
+        o1.sum().backward()
+        self.assertRaisesRegex(RuntimeError, 'Specify retain_graph=True',
+                               lambda: o1.sum().backward())
+
+
+    def test_conv_modules_raise_error_on_incorrect_input_size(self):
+        for dtype in [torch.bfloat16, torch.double, torch.float]:
+            modules = [nn.Conv1d(3, 8, 3).to(dtype), nn.ConvTranspose1d(3, 8, 3).to(dtype),
+                       nn.Conv2d(3, 8, 3).to(dtype), nn.ConvTranspose2d(3, 8, 3).to(dtype),
+                       nn.Conv3d(3, 8, 3).to(dtype), nn.ConvTranspose3d(3, 8, 3).to(dtype)]
+
+            invalid_input_dims = [(1, 4), (1, 4),
+                                  (2, 5), (2, 5),
+                                  (3, 6), (3, 6)]
+
+            for invalid_dims, module in zip(invalid_input_dims, modules):
+                for dims in invalid_dims:
+                    input = torch.empty(torch.Size((3, ) * dims))
+                    self.assertRaises(RuntimeError, lambda: module(input))
+
+    def test_conv_shapecheck(self):
+        def test(should_raise, module, input_size, dtype):
+            input = torch.empty(3, *input_size).to(dtype)
+            if should_raise:
+                self.assertRaises(RuntimeError, lambda: module(input))
+            else:
+                # just run it to ensure no exception raised.
+                module(input)
+
+        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
+            # Conv1d
+            test(True, nn.Conv1d(1, 1, 3).to(dtype), (1, 2), dtype)
+            test(True, nn.Conv1d(1, 1, 3, stride=2).to(dtype), (1, 2), dtype)
+            test(False, nn.Conv1d(1, 1, 2).to(dtype), (1, 2), dtype)
+            test(False, nn.Conv1d(1, 1, 2, stride=2).to(dtype), (1, 2), dtype)
+            test(False, nn.Conv1d(1, 1, 3, stride=2, padding=1).to(dtype), (1, 2), dtype)
+
+            # Conv2d
+            test(True, nn.Conv2d(1, 1, (3, 3)).to(dtype), (1, 2, 2), dtype)
+            test(False, nn.Conv2d(1, 1, (3, 3)).to(dtype), (1, 3, 3), dtype)
+            test(False, nn.Conv2d(1, 1, (3, 3), padding=1).to(dtype), (1, 2, 2), dtype)
+
+            # Conv3D
+            test(True, nn.Conv3d(1, 1, (3, 3, 3)).to(dtype), (1, 2, 2, 2), dtype)
+            test(False, nn.Conv3d(1, 1, (3, 3, 3)).to(dtype), (1, 3, 3, 3), dtype)
+            test(False, nn.Conv3d(1, 1, (3, 3, 3), padding=1).to(dtype), (1, 2, 2, 2), dtype)
+
+    def test_ConvTranspose2d_output_size(self):
+        m = nn.ConvTranspose2d(3, 4, 3, 3, 0, 2)
+        i = torch.randn(2, 3, 6, 6)
+        for h in range(15, 22):
+            for w in range(15, 22):
+                if 18 <= h <= 20 and 18 <= w <= 20:
+                    output = m(i, output_size=(h, w))
+                    self.assertEqual(output.size()[2:], (h, w))
+                else:
+                    self.assertRaises(ValueError, lambda: m(i, (h, w)))
+
+    def test_ConvTranspose2d_output_size_downsample_upsample(self):
+        b, c, hid_c = 2, 3, 2
+        for h in range(13, 24):
+            for w in range(13, 17):
+                for k in range(2, 5):
+                    for d in range(1, 5):
+                        for s in range(1, 4):
+                            for p in range(3):
+                                conv = nn.Conv2d(
+                                    in_channels=c,
+                                    out_channels=hid_c,
+                                    kernel_size=k,
+                                    stride=s,
+                                    padding=p,
+                                    dilation=d,
+                                )
+
+                                t_conv = nn.ConvTranspose2d(
+                                    in_channels=hid_c,
+                                    out_channels=c,
+                                    kernel_size=k,
+                                    stride=s,
+                                    padding=p,
+                                    dilation=d,
+                                )
+
+                                i = torch.randn(b, c, h, w)
+
+                                out = t_conv(conv(i), output_size=i.shape)
+
+                                self.assertEqual(out.size()[2:], i.size()[2:])
+
+    def test_ConvTranspose3d_correct_output_size(self):
+        # Check that ConvTranspose3d can take a 5d output_size.
+        m = nn.ConvTranspose3d(2, 2, 2)
+        i = torch.rand(1, 2, 1, 1, 1)
+        out = m(i, output_size=(1, 2, 2, 2, 2))
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_ConvTranspose2d_half_cublas_gemm(self):
+        with torch.backends.cudnn.flags(enabled=False):
+            inputs = torch.randn(1, 1, 16, 16, device='cuda', dtype=torch.half)
+            deconv = nn.ConvTranspose2d(
+                1, 1, 3, stride=2, padding=1, output_padding=1).cuda().half()
+            output = deconv(inputs)
+            output.mean().backward()
+
+    # For https://github.com/pytorch/pytorch/pull/1273
+    # Almost identical to the above `test_Conv2d_naive_groups`
+    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
+    def test_Conv2d_groups_nobias(self):
+        dev_dtypes = [("cpu", torch.float)]
+        if TEST_CUDA:
+            dev_dtypes += [("cuda", torch.float), ("cuda", torch.half)]
+        if AMPERE_OR_ROCM:
+            dev_dtypes += [("cuda", torch.bfloat16)]
+        for device, dtype in dev_dtypes:
+            m = nn.Conv2d(4, 4, kernel_size=3, groups=2, bias=False).to(device, dtype)
+            i = torch.randn(2, 4, 6, 6, device=device, dtype=dtype, requires_grad=True)
+            output = m(i)
+            grad_output = torch.randn(2, 4, 4, 4, device=device, dtype=dtype)
+            output.backward(grad_output)
+
+            m1 = nn.Conv2d(2, 2, kernel_size=3, bias=False).to(device, dtype)
+            m1.weight.data.copy_(m.weight.data[:2])
+            i1 = i.data[:, :2].contiguous().requires_grad_(True)
+            output1 = m1(i1)
+            output1.backward(grad_output[:, :2].contiguous())
+
+            m2 = nn.Conv2d(2, 2, kernel_size=3, bias=False).to(device, dtype)
+            m2.weight.data.copy_(m.weight.data[2:])
+            i2 = i.data[:, 2:].contiguous().requires_grad_(True)
+            output2 = m2(i2)
+            output2.backward(grad_output[:, 2:].contiguous())
+
+            self.assertEqual(output, torch.cat([output1, output2], 1))
+            self.assertEqual(i.grad.data,
+                             torch.cat([i1.grad.data, i2.grad.data], 1),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+            self.assertEqual(m.weight.grad.data,
+                             torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
+                             atol=1e-1 if dtype == torch.half else dtype2prec_DONTUSE[dtype], rtol=0)
+
+    # Almost identical to the above `test_Conv2d_naive_groups`
+    # Covering special case when group > 1, input-channel / group < 16 and output-channel is multiple of 16
+    # See also https://github.com/pytorch/pytorch/pull/18463#issuecomment-476563686
+    # and https://github.com/pytorch/pytorch/pull/18463#issuecomment-477001024
+    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
+    def test_Conv2d_groups_nobias_v2(self):
+        torch.manual_seed(123)
+        dev_dtypes = [("cpu", torch.float)]
+        if TEST_CUDA:
+            dev_dtypes += [("cuda", torch.float), ("cuda", torch.half)]
+        if AMPERE_OR_ROCM:
+            dev_dtypes += [("cuda", torch.bfloat16)]
+        for device, dtype in dev_dtypes:
+            m = nn.Conv2d(4, 16, kernel_size=3, groups=2, bias=False).to(device, dtype)
+            i = torch.randn(2, 4, 6, 6, device=device, dtype=dtype, requires_grad=True)
+            output = m(i)
+            grad_output = torch.randn(2, 16, 4, 4, device=device, dtype=dtype)
+            output.backward(grad_output)
+
+            m1 = nn.Conv2d(2, 8, kernel_size=3, bias=False).to(device, dtype)
+            m1.weight.data.copy_(m.weight.data[:8])
+            i1 = i.data[:, :2].contiguous().requires_grad_(True)
+            output1 = m1(i1)
+            output1.backward(grad_output[:, :8].contiguous())
+
+            m2 = nn.Conv2d(2, 8, kernel_size=3, bias=False).to(device, dtype)
+            m2.weight.data.copy_(m.weight.data[8:])
+            i2 = i.data[:, 2:].contiguous().requires_grad_(True)
+            output2 = m2(i2)
+            output2.backward(grad_output[:, 8:].contiguous())
+
+            self.assertEqual(output, torch.cat([output1, output2], 1))
+            self.assertEqual(i.grad.data,
+                             torch.cat([i1.grad.data, i2.grad.data], 1),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+            self.assertEqual(m.weight.grad.data,
+                             torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
+                             atol=1e-1 if dtype == torch.half else dtype2prec_DONTUSE[dtype], rtol=0)
+
+    # CPU-only test for group conv3d fast implementation using bmm
+    # See: https://github.com/pytorch/pytorch/pull/36355
+    def test_Conv3d_groups_nobias(self):
+        torch.manual_seed(123)
+        m = nn.Conv3d(4, 16, kernel_size=3, groups=2, bias=False).to("cpu", torch.float)
+        i = torch.randn(2, 4, 6, 6, 6, device="cpu", dtype=torch.float, requires_grad=True)
+        output = m(i)
+        grad_output = torch.randn(2, 16, 4, 4, 4, device="cpu", dtype=torch.float)
+        output.backward(grad_output)
+
+        m1 = nn.Conv3d(2, 8, kernel_size=3, bias=False).to("cpu", torch.float)
+        m1.weight.data.copy_(m.weight.data[:8])
+        i1 = i.data[:, :2].contiguous().requires_grad_(True)
+        output1 = m1(i1)
+        output1.backward(grad_output[:, :8].contiguous())
+
+        m2 = nn.Conv3d(2, 8, kernel_size=3, bias=False).to("cpu", torch.float)
+        m2.weight.data.copy_(m.weight.data[8:])
+        i2 = i.data[:, 2:].contiguous().requires_grad_(True)
+        output2 = m2(i2)
+        output2.backward(grad_output[:, 8:].contiguous())
+
+        self.assertEqual(output, torch.cat([output1, output2], 1))
+        self.assertEqual(i.grad.data,
+                         torch.cat([i1.grad.data, i2.grad.data], 1),
+                         atol=dtype2prec_DONTUSE[torch.float], rtol=0)
+        self.assertEqual(m.weight.grad.data,
+                         torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
+                         atol=dtype2prec_DONTUSE[torch.float], rtol=dtype2prec_DONTUSE[torch.float])
+
+    def test_Conv3d_groups_wbias(self):
+        torch.manual_seed(123)
+        m = nn.Conv3d(4, 16, kernel_size=3, groups=2, bias=True).to("cpu", torch.float)
+        i = torch.randn(2, 4, 6, 6, 6, device="cpu", dtype=torch.float, requires_grad=True)
+        output = m(i)
+        grad_output = torch.randn(2, 16, 4, 4, 4, device="cpu", dtype=torch.float)
+        output.backward(grad_output)
+
+        m1 = nn.Conv3d(2, 8, kernel_size=3, bias=True).to("cpu", torch.float)
+        m1.weight.data.copy_(m.weight.data[:8])
+        m1.bias.data.copy_(m.bias.data[:8])
+        i1 = i.data[:, :2].contiguous().requires_grad_(True)
+        output1 = m1(i1)
+        output1.backward(grad_output[:, :8].contiguous())
+
+        m2 = nn.Conv3d(2, 8, kernel_size=3, bias=True).to("cpu", torch.float)
+        m2.weight.data.copy_(m.weight.data[8:])
+        m2.bias.data.copy_(m.bias.data[8:])
+        i2 = i.data[:, 2:].contiguous().requires_grad_(True)
+        output2 = m2(i2)
+        output2.backward(grad_output[:, 8:].contiguous())
+
+        self.assertEqual(output, torch.cat([output1, output2], 1))
+        self.assertEqual(i.grad.data,
+                         torch.cat([i1.grad.data, i2.grad.data], 1),
+                         atol=dtype2prec_DONTUSE[torch.float],
+                         rtol=dtype2prec_DONTUSE[torch.float])
+        self.assertEqual(m.weight.grad.data,
+                         torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
+                         atol=dtype2prec_DONTUSE[torch.float],
+                         rtol=dtype2prec_DONTUSE[torch.float])
+        self.assertEqual(m.bias.grad.data,
+                         torch.cat([m1.bias.grad.data, m2.bias.grad.data], 0),
+                         atol=dtype2prec_DONTUSE[torch.float], rtol=dtype2prec_DONTUSE[torch.float])
+
+    def test_conv_tbc(self):
+        with set_default_dtype(torch.double):
+            inp = torch.randn(9, 4, 5, requires_grad=True)
+            weight = torch.randn(3, 5, 6, requires_grad=True)
+            bias = torch.randn(6, requires_grad=True)
+
+            gradcheck(lambda i, w, b, pad: F.conv_tbc(i, w, b, pad), (inp, weight, bias, 3))
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    @unittest.skipIf(not TEST_CUDNN, "needs cudnn")
+    @skipIfRocmVersionLessThan((4, 3))
+    @skipIfNotMiopenSuggestNHWC
+    def test_grouped_conv_cudnn_nhwc_support(self):
+        # in order to catch the hols in grouped convolution in nhwc support for earlier cudnn version
+        input = torch.randn((16, 16, 8, 8), dtype=torch.float16, device="cuda").to(memory_format=torch.channels_last)
+        weight = torch.randn((8, 4, 3, 3), dtype=torch.float16, device="cuda").to(memory_format=torch.channels_last)
+        out = torch.convolution(input, weight, None, (1, 1), (1, 1), (1, 1), False, (0, 0), 4)
+        input = torch.randn((16, 8, 8, 8), dtype=torch.float16, device="cuda").to(memory_format=torch.channels_last)
+        out_transpose = torch.convolution(input, weight, None, (1, 1), (1, 1), (1, 1), True, (0, 0), 4)
+
+    @unittest.expectedFailure
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    @unittest.skipIf(not TEST_CUDNN, "needs cudnn")
+    def test_conv_cudnn_memory_layout_dominance(self):
+        # desired behavior here is to have the memory_layout of conv.weight to
+        # dominante the layout of output.
+        # which is not the same as current behavior, we'll fix this in
+        # following up PRs and remove the `expectedFailure` tag
+        input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, device="cuda", requires_grad=True)
+        conv = nn.Conv2d(8, 4, 3).cuda().float()
+
+        out = conv(input)
+        self.assertTrue(out.is_contiguous())
+
+        input = input.contiguous(memory_format=torch.channels_last)
+        out = conv(input)
+        self.assertTrue(out.is_contiguous())
+
+        conv.weight.data = conv.weight.contiguous(memory_format=torch.channels_last)
+        out = conv(input)
+        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+
+        input = input.contiguous()
+        out = conv(input)
+        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_cudnn_noncontiguous_weight(self):
+        # Noncontiguous weights must be contiguous() before being
+        # passed to cuDNN
+        input = torch.tensor([1, 1, 1], dtype=torch.double, device="cuda").view(1, 1, 3)
+        weights1 = torch.tensor([1], dtype=torch.double, device="cuda").expand(1, 1, 2)
+        weights2 = torch.tensor([1], dtype=torch.double, device="cuda").expand(1, 1, 2).contiguous()
+        self.assertEqual(F.conv1d(input, weights1, bias=None, stride=2, dilation=2),
+                         F.conv1d(input, weights2, bias=None, stride=2, dilation=2))
+
+
+    def run_grad_conv_test(self, func_forward, func_backward, dim=1, gradient='input'):
+        for kern, inp_size in [(3, 6), (3, 7), (4, 9)]:
+            for batch, stride, padding, chan_in, chan_out, dilation in \
+                    product([1, 2], [1, 2], [0, 1, 2], [2], [3], [1]):
+
+                for has_bias in [True, False]:
+                    input_shape = [batch, chan_in]
+                    weight_shape = [chan_out, chan_in]
+                    for _ in range(dim):
+                        input_shape.append(inp_size)
+                        weight_shape.append(kern)
+
+                    input = torch.randn(input_shape, requires_grad=True)
+                    weight = torch.randn(weight_shape, requires_grad=True)
+                    if has_bias:
+                        bias = torch.randn([chan_out], requires_grad=True)
+                    output = func_forward(input, weight, stride=stride, padding=padding, dilation=dilation, bias=bias)
+
+                    gradient_o = torch.randn(output.shape)
+                    gradient_w = torch.autograd.grad(output, input if (gradient == 'input') else weight, gradient_o)
+
+                    self.assertEqual(gradient_w[0],
+                                     func_backward(
+                                     input_shape if (gradient == 'input') else input,
+                                     weight_shape if (gradient == 'weight') else weight,
+                                     gradient_o,
+                                     stride=stride,
+                                     padding=padding,
+                                     dilation=dilation))
+
+    def test_grad_conv1d_input(self):
+        self.run_grad_conv_test(F.conv1d, F.grad.conv1d_input, 1, 'input')
+
+    def test_grad_conv1d_weight(self):
+        self.run_grad_conv_test(F.conv1d, F.grad.conv1d_weight, 1, 'weight')
+
+    def test_grad_conv2d_input(self):
+        self.run_grad_conv_test(F.conv2d, F.grad.conv2d_input, 2, 'input')
+
+    def test_grad_conv2d_weight(self):
+        self.run_grad_conv_test(F.conv2d, F.grad.conv2d_weight, 2, 'weight')
+
+    def test_grad_conv3d_input(self):
+        self.run_grad_conv_test(F.conv3d, F.grad.conv3d_input, 3, 'input')
+
+    def test_grad_conv3d_weight(self):
+        self.run_grad_conv_test(F.conv3d, F.grad.conv3d_weight, 3, 'weight')
+
+    @unittest.skipIf(not torch._nnpack_available(), "NNPACK unavailable")
+    def test_nnpack_conv(self):
+        for kern, inp_size in [(3, 6), (3, 7), (4, 9)]:
+            for batch, stride, padding, chan_in, chan_out in \
+                    product([1, 2, 3, 4], [1, 2], [0, 1, 2], [2], [3]):
+
+                for has_bias in [True, False]:
+                    input_shape = [batch, chan_in]
+                    weight_shape = [chan_out, chan_in]
+                    for _ in range(2):
+                        input_shape.append(inp_size)
+                        weight_shape.append(kern)
+
+                    input = torch.randn(input_shape, requires_grad=True, dtype=torch.float)
+                    weight = torch.randn(weight_shape, requires_grad=True, dtype=torch.float)
+                    if has_bias:
+                        bias = torch.randn([chan_out], requires_grad=True, dtype=torch.float)
+                    output = torch._nnpack_spatial_convolution(input, weight, stride=stride, padding=padding, bias=bias)
+                    output_expected = torch.nn.functional.conv2d(input, weight, stride=stride, padding=padding, bias=bias)
+                    self.assertEqual(output, output_expected, atol=3e-4, rtol=0)
+
+                    gradient_o = torch.randn(output.shape, dtype=torch.float)
+
+                    grads = torch.autograd.grad(output, [input, weight], gradient_o)
+                    grads_expected = torch.autograd.grad(output_expected, [input, weight], gradient_o)
+                    for gr, gr_expected in zip(grads, grads_expected):
+                        self.assertEqual(gr, gr_expected, atol=3e-4, rtol=0)
+
+    def test_conv_padding_mode(self):
+        with self.assertRaisesRegex(ValueError, "padding_mode must be one of"):
+            nn.Conv2d(3, 3, 3, padding_mode="xyz")
+
+        with self.assertRaisesRegex(ValueError, "padding_mode must be one of"):
+            nn.Conv2d(3, 3, 3, padding_mode=3)
+
+        with self.assertRaisesRegex(ValueError, "Only \"zeros\" "):
+            nn.ConvTranspose2d(3, 3, 3, padding_mode="reflect")
+
+
+    def test_functional_grad_conv(self):
+        # Conv 1D
+        input = torch.randn(1, 1, 5, requires_grad=True)
+        weight = torch.randn(1, 1, 3, requires_grad=True)
+        output = F.conv1d(input, weight, dilation=2)
+        grad_output = torch.randn(output.shape)
+
+        grad_input_autograd, grad_weight_autograd = torch.autograd.grad(output, (input, weight), grad_output)
+
+        grad_input_functional = torch.nn.grad.conv1d_input(input.shape, weight, grad_output, dilation=2)
+        self.assertEqual(grad_input_functional, grad_input_autograd)
+
+        grad_weight_functional = torch.nn.grad.conv1d_weight(input, weight.shape, grad_output, dilation=2)
+        self.assertEqual(grad_weight_functional, grad_weight_autograd)
+
+        # Conv 2D
+        input = torch.randn(1, 1, 5, 5, requires_grad=True)
+        weight = torch.randn(1, 1, 3, 3, requires_grad=True)
+        output = F.conv2d(input, weight, dilation=2)
+        grad_output = torch.randn(output.shape)
+
+        (grad_input_autograd, grad_weight_autograd) = torch.autograd.grad(output, (input, weight), grad_output)
+
+        grad_input_functional = torch.nn.grad.conv2d_input(input.shape, weight, grad_output, dilation=2)
+        self.assertEqual(grad_input_functional, grad_input_autograd)
+
+        grad_weight_functional = torch.nn.grad.conv2d_weight(input, weight.shape, grad_output, dilation=2)
+        self.assertEqual(grad_weight_functional, grad_weight_autograd)
+
+        # Conv 3D
+        input = torch.randn(1, 1, 5, 5, 5, requires_grad=True)
+        weight = torch.randn(1, 1, 3, 3, 3, requires_grad=True)
+        output = F.conv3d(input, weight, dilation=2)
+        grad_output = torch.randn(output.shape)
+
+        (grad_input_autograd, grad_weight_autograd) = torch.autograd.grad(output, (input, weight), grad_output)
+
+        grad_input_functional = torch.nn.grad.conv3d_input(input.shape, weight, grad_output, dilation=2)
+        self.assertEqual(grad_input_functional, grad_input_autograd)
+
+        grad_weight_functional = torch.nn.grad.conv3d_weight(input, weight.shape, grad_output, dilation=2)
+        self.assertEqual(grad_weight_functional, grad_weight_autograd)
+
+    def test_functional_grad_conv2d(self):
+        BATCH_SIZE = 4
+        IN_CH = 8
+        OUT_CH = 16
+        SPATIAL = 32
+
+        def _test_conv2d(stride, kernel_size, groups, dilation):
+            padding = kernel_size // 2
+
+            input = torch.empty(BATCH_SIZE, IN_CH, SPATIAL, SPATIAL).uniform_(-8.0, 8.0).requires_grad_(True)
+
+            weight = torch.empty(OUT_CH, IN_CH // groups, kernel_size, kernel_size).uniform_(-4.0, 4.0).requires_grad_(True)
+
+            output = F.conv2d(input, weight,
+                              stride=stride, padding=padding, dilation=dilation, groups=groups)
+
+            grad_output = torch.randn(output.shape)
+
+            (grad_input_autograd, grad_weight_autograd) = torch.autograd.grad(output, (input, weight), grad_output)
+
+            grad_input_functional = torch.nn.grad.conv2d_input(input.shape, weight, grad_output,
+                                                               stride=stride, padding=padding, dilation=dilation, groups=groups)
+            self.assertEqual(grad_input_functional, grad_input_autograd)
+
+            grad_weight_functional = torch.nn.grad.conv2d_weight(input, weight.shape, grad_output,
+                                                                 stride=stride, padding=padding, dilation=dilation, groups=groups)
+            self.assertEqual(grad_weight_functional, grad_weight_autograd)
+
+        strides = [1, 2]
+        kernel_sizes = [1, 3, 5]
+        groups = [1, 2, 4]
+        dilates = [1, 2]
+
+        for s, k, g, d in product(strides, kernel_sizes, groups, dilates):
+            _test_conv2d(s, k, g, d)
+
+
+class TestConvolutionNNDeviceType(NNTestCase):
+    def run_conv_double_back_test(self, kern, stride, padding, chan_in, chan_out, batch_size,
+                                  inp_size, dilation, no_weight, groups=1, use_cuda=False,
+                                  use_bias=True, dtype=torch.double):
+        if use_cuda:
+            device = torch.device("cuda")
+        else:
+            device = torch.device("cpu")
+
+        x = torch.randn(batch_size, chan_in, inp_size, inp_size, device=device,
+                        dtype=dtype, requires_grad=True)
+        weight = torch.randn(chan_out, chan_in // groups, kern, kern, device=device,
+                             dtype=dtype, requires_grad=not no_weight)
+        if use_bias:
+            bias = torch.randn(chan_out, device=device, dtype=dtype, requires_grad=True)
+        else:
+            bias = None
+
+        def func(*inputs):
+            if use_bias:
+                lx, lweight, lbias = inputs
+            else:
+                lx, lweight = inputs
+                lbias = None
+            # We disable cudnn during forward to avoid finite difference imprecision issues
+            with cudnn.flags(enabled=False):
+                out = F.conv2d(lx, lweight, lbias, stride, padding, dilation, groups)
+            return out
+
+        if use_bias:
+            inputs = x, weight, bias
+        else:
+            inputs = x, weight
+
+        dummy_out = func(*inputs)
+        grad_y = torch.randn_like(dummy_out, device=device, dtype=dtype, requires_grad=True)
+
+        # Issue #15353: test mkldnn double backward, don't run gradgradcheck due
+        # to imprecision issues
+        if dtype == torch.float:
+            g, = torch.autograd.grad(dummy_out.sum(), x, create_graph=True)
+            return g.requires_grad
+
+        return gradgradcheck(func, inputs, (grad_y,))
+
+    @onlyCUDA
+    @skipCUDAIfNoCudnn
+    @dtypes(*floating_and_complex_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
+    def test_Conv2d_deterministic_cudnn(self, device, dtype):
+        inputs = torch.randn(2, 3, 5, 5, device=device, dtype=dtype, requires_grad=True)
+        with cudnn.flags(enabled=True, benchmark=True, deterministic=True):
+            conv1 = torch.nn.Conv2d(3, 3, 3).to(device, dtype)
+            conv2 = torch.nn.Conv2d(3, 3, 3).to(device, dtype)
+            conv2.bias.data.copy_(conv1.bias.data)
+            conv2.weight.data.copy_(conv1.weight.data)
+            out1 = conv1(inputs)
+            out2 = conv2(inputs)
+            self.assertEqual(out1, out2, atol=0.0, rtol=0)
+            y = torch.randn(out1.size(), device=device, dtype=dtype)
+            out1.backward(y)
+            out2.backward(y)
+            self.assertEqual(conv1.bias.grad.data, conv2.bias.grad.data, atol=0.0, rtol=0)
+            self.assertEqual(conv1.weight.grad.data, conv2.weight.grad.data, atol=0.0, rtol=0)
+
+
+    @onlyCUDA
+    @dtypes(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
+    def test_Conv2d_large_workspace(self, device, dtype):
+        # These sizes require huge cuDNN workspaces. Make sure we choose a
+        # reasonable algorithm that does not run out of memory
+        sizes = [
+            (1, 256, 109, 175),
+            (1, 256, 80, 128),
+            (1, 256, 120, 192),
+        ]
+
+        def run_test(benchmark):
+            with torch.backends.cudnn.flags(benchmark=benchmark):
+                conv = torch.nn.Conv2d(256, 256, kernel_size=3, padding=1).to(device, dtype)
+                for size in sizes:
+                    x = torch.randn(size, device=device, dtype=dtype)
+                    out = conv(x.detach().clone().requires_grad_())
+                    out.backward(torch.ones_like(out))
+
+        run_test(benchmark=False)
+        run_test(benchmark=True)
+
+
+    @onlyCUDA
+    @dtypes(torch.half, torch.float)
+    def test_ConvTranspose2d_large_output_padding(self, device, dtype):
+        net1 = torch.nn.ConvTranspose2d(128, 64, kernel_size=3, stride=2, padding=1, output_padding=1)\
+            .to(device=device, dtype=dtype)
+        net2 = torch.nn.ConvTranspose2d(64, 32, kernel_size=3, stride=2, padding=1, output_padding=1)\
+            .to(device=device, dtype=dtype)
+        net3 = torch.nn.ConvTranspose2d(32, 3, kernel_size=3, stride=2, padding=1, output_padding=1)\
+            .to(device=device, dtype=dtype)
+        x = torch.rand(1, 128, 6, 6, device=device, dtype=dtype, requires_grad=True)
+        x = net1(x)
+        x = net2(x)
+        x = net3(x)
+        x.backward(torch.randn_like(x))
+        torch.cuda.synchronize()
+
+
+    @onlyCUDA
+    @tf32_on_and_off(0.01)
+    @dtypes(torch.float, torch.double, torch.half)
+    # Very similar to test_Conv2d_naive_groups but with special care to handle
+    # the number of groups == number of input channels
+    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
+    def test_Conv2d_depthwise_naive_groups(self, device, dtype):
+        for depth_multiplier in [1, 2]:
+            m = nn.Conv2d(2, 2 * depth_multiplier, kernel_size=3, groups=2).to(device, dtype)
+            i = torch.randn(2, 2, 6, 6, device="cuda", dtype=dtype).div_(2).requires_grad_()
+            output = m(i)
+            grad_output = torch.randn(2, 2 * depth_multiplier, 4, 4, device=device, dtype=dtype) / 2
+            output.backward(grad_output)
+
+            offset = 1 * depth_multiplier
+
+            m1 = nn.Conv2d(1, 1 * depth_multiplier, kernel_size=3).to(device, dtype)
+            m1.weight.data = m.weight.data[:offset].clone()
+            m1.bias.data = m.bias.data[:offset].clone()
+            i1 = i.detach()[:, :1].clone().requires_grad_()
+            output1 = m1(i1)
+            output1.backward(grad_output[:, :offset].contiguous())
+
+            m2 = nn.Conv2d(1, 1 * depth_multiplier, kernel_size=3).to(device, dtype)
+            m2.weight.data.copy_(m.weight.data[offset:])
+            m2.bias.data.copy_(m.bias.data[offset:])
+            i2 = i.detach()[:, 1:].clone().requires_grad_()
+            output2 = m2(i2)
+            output2.backward(grad_output[:, offset:].contiguous())
+
+            self.assertEqual(output, torch.cat([output1, output2], 1),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+            self.assertEqual(i.grad.data,
+                             torch.cat([i1.grad.data, i2.grad.data], 1),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+            self.assertEqual(m.bias.grad.data,
+                             torch.cat([m1.bias.grad.data,
+                                        m2.bias.grad.data], 0),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+            self.assertEqual(m.weight.grad.data,
+                             torch.cat([m1.weight.grad.data,
+                                        m2.weight.grad.data], 0),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+
+    @onlyCUDA
+    @dtypes(torch.float, torch.double, torch.half)
+    @tf32_on_and_off(0.005)
+    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
+    def test_Conv3d_depthwise_naive_groups(self, device, dtype):
+        for depth_multiplier in [1, 2]:
+            m = nn.Conv3d(2, 2 * depth_multiplier, kernel_size=3, groups=2).to(device, dtype)
+            i = torch.randn(2, 2, 6, 6, 6, device="cuda", dtype=dtype).div_(2).requires_grad_()
+            output = m(i)
+            grad_output = torch.randn(2, 2 * depth_multiplier, 4, 4, 4, device=device, dtype=dtype) / 2
+            output.backward(grad_output)
+
+            offset = 1 * depth_multiplier
+
+            m1 = nn.Conv3d(1, 1 * depth_multiplier, kernel_size=3).to(device, dtype)
+            m1.weight.data = m.weight.data[:offset].clone()
+            m1.bias.data = m.bias.data[:offset].clone()
+            i1 = i.detach()[:, :1].clone().requires_grad_()
+            output1 = m1(i1)
+            output1.backward(grad_output[:, :offset].contiguous())
+
+            m2 = nn.Conv3d(1, 1 * depth_multiplier, kernel_size=3).to(device, dtype)
+            m2.weight.data.copy_(m.weight.data[offset:])
+            m2.bias.data.copy_(m.bias.data[offset:])
+            i2 = i.detach()[:, 1:].clone().requires_grad_()
+            output2 = m2(i2)
+            output2.backward(grad_output[:, offset:].contiguous())
+            is_cuda_sm86 = device.startswith("cuda") and torch.cuda.get_device_capability(0) == (8, 6)
+            atol, rtol = (3e-4, 3e-2) if dtype == torch.float32 and is_cuda_sm86 else (dtype2prec_DONTUSE[dtype], 0)
+
+            self.assertEqual(output, torch.cat([output1, output2], 1),
+                             atol=atol, rtol=rtol)
+            self.assertEqual(i.grad.data,
+                             torch.cat([i1.grad.data, i2.grad.data], 1),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+            self.assertEqual(m.bias.grad.data,
+                             torch.cat([m1.bias.grad.data,
+                                        m2.bias.grad.data], 0),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+            self.assertEqual(m.weight.grad.data,
+                             torch.cat([m1.weight.grad.data,
+                                        m2.weight.grad.data], 0),
+                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
+
+
+    @onlyCUDA
+    @dtypes(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
+    def test_noncontig_conv_grad(self, device, dtype):
+        # FIXME: remove after adding non-contiguous grad tests for all modules
+        module = nn.Conv2d(3, 5, kernel_size=3, padding=1).to(device, dtype)
+        input = torch.randn(2, 3, 10, 10, dtype=dtype, device=device, requires_grad=True)
+        output = module(input)
+
+        grad = torch.randn(2, 2, 5, 10, 10, dtype=dtype, device=device)[:, 1]
+        assert not grad.is_contiguous()
+        output.backward(grad, retain_graph=True)
+        self.assertIsNotNone(input.grad)
+        result = input.grad.data.clone()
+        input.grad.data.zero_()
+
+        output.backward(grad.contiguous())
+        self.assertEqual(result, input.grad.data, atol=dtype2prec_DONTUSE[dtype], rtol=0)
+
+    @onlyCUDA
+    @dtypes(torch.double)
+    def test_conv_double_backward(self, device, dtype):
+        with torch.backends.cudnn.flags(deterministic=True):
+            # Double backward only runs with DoubleTensor due to precision reason
+            batch_size = 1
+            for kern, inp_size, dilations in [(3, 5, [1, 2]), (4, 9, [1])]:
+                for stride, padding, chan_in, chan_out, dilation in product([1], [2], [2], [3], dilations):
+                    no_weight = stride == 2
+                    result = self.run_conv_double_back_test(kern, stride,
+                                                            padding, chan_in, chan_out,
+                                                            batch_size, inp_size, dilation,
+                                                            no_weight, use_cuda=True, dtype=dtype)
+                    self.assertTrue(result,
+                                    "Conv double backward test failed with parameters:" +
+                                    "\nkern: " + str(kern) +
+                                    "\nstride: " + str(stride) +
+                                    "\npadding: " + str(padding) +
+                                    "\nchan_in: " + str(chan_in) +
+                                    "\nchan_out: " + str(chan_out) +
+                                    "\nbatch_size: " + str(batch_size) +
+                                    "\ninp_size: " + str(inp_size) +
+                                    "\ndilation: " + str(dilation))
+
+
+    def test_conv_double_backward_no_bias(self):
+        kern = 3
+        stride = 2
+        chan_in, chan_out = 2, 4
+        batch_size = 2
+        inp_size = 5
+        padding = 1
+        dilation = 1
+        no_weight = False
+        use_bias = True
+        result = self.run_conv_double_back_test(kern, stride,
+                                                padding, chan_in, chan_out,
+                                                batch_size, inp_size, dilation,
+                                                no_weight, use_bias=use_bias)
+        self.assertTrue(result,
+                        "Conv double backward test failed with parameters:" +
+                        "\nkern: " + str(kern) +
+                        "\nstride: " + str(stride) +
+                        "\npadding: " + str(padding) +
+                        "\nchan_in: " + str(chan_in) +
+                        "\nchan_out: " + str(chan_out) +
+                        "\nbatch_size: " + str(batch_size) +
+                        "\ninp_size: " + str(inp_size) +
+                        "\ndilation: " + str(dilation))
+
+
+    def test_conv_double_backward_groups(self):
+        kern = 3
+        stride = 1
+        padding = 2
+        chan_in, chan_out = 2, 4
+        batch_size = 2
+        inp_size = 6
+        dilation = 1
+        no_weight = False
+        groups = 2
+        result = self.run_conv_double_back_test(kern, stride,
+                                                padding, chan_in * groups, chan_out * groups,
+                                                batch_size, inp_size, dilation,
+                                                no_weight, groups=groups)
+        self.assertTrue(result,
+                        "Conv double backward test failed with parameters:" +
+                        "\nkern: " + str(kern) +
+                        "\nstride: " + str(stride) +
+                        "\npadding: " + str(padding) +
+                        "\nchan_in: " + str(chan_in) +
+                        "\nchan_out: " + str(chan_out) +
+                        "\nbatch_size: " + str(batch_size) +
+                        "\ninp_size: " + str(inp_size) +
+                        "\ndilation: " + str(dilation) +
+                        "\ngroups: " + str(groups))
+
+
+    def test_conv_double_backward_stride(self):
+        batch_size = 2
+
+        # Cannot provide ggW when stride is > 1
+        for kern, inp_size, dilations in [(3, 5, [1, 2]), (3, 7, [1])]:
+            for stride, padding, chan_in, chan_out, dilation in product([2], [0, 1], [1], [2], dilations):
+                no_weight = False
+                self.run_conv_double_back_test(kern, stride,
+                                               padding, chan_in, chan_out,
+                                               batch_size, inp_size, dilation,
+                                               no_weight)
+
+    @dtypes(torch.float, torch.cfloat)
+    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
+    def test_conv1d_same_padding(self, device, dtype):
+        # Test padding='same' outputs the correct shape
+        test_args = [
+            # in_size
+            range(50, 55),
+            # kernel_size
+            [1, 2, 3, 8],
+            # dilation
+            range(1, 4),
+            # stride
+            [1],
+        ]
+        for in_size, k_size, dilation, stride in itertools.product(*test_args):
+            x = torch.rand(1, 1, in_size, device=device, dtype=dtype)
+            y = torch.rand(1, 1, k_size, device=device, dtype=dtype)
+            z = F.conv1d(x, y, padding='same', dilation=dilation, stride=stride)
+            self.assertEqual(z.size(2), int(math.ceil(in_size / stride)))
+
+        # Compare F.conv1d padding='same' output against manual padding
+        # Without strides/dilation
+        x = torch.rand(1, 1, 12, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 3, device=device, dtype=dtype)
+        expect = F.conv1d(x, y, padding=1)
+        actual = F.conv1d(x, y, padding='same')
+        self.assertEqual(expect, actual)
+
+        # With dilation
+        x = torch.rand(1, 1, 12, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 4, device=device, dtype=dtype)
+        expect = F.conv1d(x, y, padding=3, dilation=2)
+        actual = F.conv1d(x, y, padding='same', dilation=2)
+        self.assertEqual(expect, actual)
+
+        # Dilation with asymmetric padding
+        expect = F.conv1d(x, y, padding=5, dilation=3)[..., 1:]
+        actual = F.conv1d(x, y, padding='same', dilation=3)
+        self.assertEqual(expect, actual)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv2d_same_padding(self, device, dtype):
+        if dtype is torch.cfloat:
+            rtol, atol = 2e-6, 2e-6
+        else:
+            rtol, atol = None, None
+        # Compare F.conv2d padding='same' output against manual padding
+        # Without strides/dilation
+        x = torch.rand(1, 1, 10, 11, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 4, 5, device=device, dtype=dtype)
+        expect = F.conv2d(x, y, padding=(2, 2))[..., 1:, :]
+        actual = F.conv2d(x, y, padding='same')
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
+
+        # With dilation
+        y = torch.rand(1, 1, 3, 4, device=device, dtype=dtype)
+        expect = F.conv2d(x, y, padding=(2, 3), dilation=2)
+        actual = F.conv2d(x, y, padding='same', dilation=2)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
+
+        # Dilation with asymmetric padding
+        y = torch.rand(1, 1, 4, 4, device=device, dtype=dtype)
+        expect = F.conv2d(x, y, padding=5, dilation=3)[..., 1:, 1:]
+        actual = F.conv2d(x, y, padding='same', dilation=3)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv3d_same_padding(self, device, dtype):
+        if dtype is torch.cfloat:
+            rtol, atol = 2e-6, 2e-6
+        else:
+            rtol, atol = None, None
+        # Compare F.conv3d padding='same' output against manual padding
+        # Without strides/dilation
+        x = torch.rand(1, 1, 10, 11, 12, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 1, 2, 5, device=device, dtype=dtype)
+        expect = F.conv3d(x, y, padding=(0, 1, 2))[..., :, 1:, :]
+        actual = F.conv3d(x, y, padding='same')
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
+
+        # With dilation
+        expect = F.conv3d(x, y, padding=(0, 1, 4), dilation=2)
+        actual = F.conv3d(x, y, padding='same', dilation=2)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
+
+        # Dilation with asymmetric padding
+        y = torch.rand(1, 1, 4, 4, 4, device=device, dtype=dtype)
+        expect = F.conv3d(x, y, padding=5, dilation=3)[..., 1:, 1:, 1:]
+        actual = F.conv3d(x, y, padding='same', dilation=3)
+        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv1d_valid_padding(self, device, dtype):
+        # Test F.conv1d padding='valid' is the same as no padding
+        x = torch.rand(1, 1, 10, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 4, device=device, dtype=dtype)
+        expect = F.conv1d(x, y)
+        actual = F.conv1d(x, y, padding='valid')
+        self.assertEqual(expect, actual)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv2d_valid_padding(self, device, dtype):
+        # Test F.conv2d padding='valid' is the same as no padding
+        x = torch.rand(1, 1, 1, 10, device=device, dtype=dtype)
+        y = torch.rand(1, 1, 1, 4, device=device, dtype=dtype)
+        expect = F.conv2d(x, y)
+        actual = F.conv2d(x, y, padding='valid')
+        self.assertEqual(expect, actual)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv3d_valid_padding(self, device, dtype):
+        # Test F.conv3d padding='valid' is the same as no padding
+        x = torch.rand(1, 1, 1, 1, 10, dtype=dtype, device=device)
+        y = torch.rand(1, 1, 1, 1, 4, dtype=dtype, device=device)
+        expect = F.conv3d(x, y)
+        actual = F.conv3d(x, y, padding='valid')
+        self.assertEqual(expect, actual)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv1d_same_padding_backward(self, device, dtype):
+        # Test F.conv1d gradients work with padding='same'
+        x = torch.rand(1, 1, 12, dtype=dtype, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 4, dtype=dtype, device=device, requires_grad=True)
+
+        # Symmetric padding
+        z = F.conv1d(x, y, padding=3, dilation=2)
+        z.sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        z = F.conv1d(x, y, padding='same', dilation=2)
+        z.sum().backward()
+        self.assertEqual(gx_expect, x.grad)
+        self.assertEqual(gy_expect, y.grad)
+        x.grad, y.grad = None, None
+
+        # Asymmetric padding
+        z = F.conv1d(x, y, padding=2)[..., 1:]
+        z.sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        z = F.conv1d(x, y, padding='same')
+        z.sum().backward()
+        self.assertEqual(gx_expect, x.grad)
+        self.assertEqual(gy_expect, y.grad)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv2d_same_padding_backward(self, device, dtype):
+        # Test F.conv2d gradients work with padding='same'
+        x = torch.rand(1, 1, 10, 11, device=device, dtype=dtype, requires_grad=True)
+        y = torch.rand(1, 1, 4, 5, device=device, dtype=dtype, requires_grad=True)
+
+        # Symmetric padding
+        z = F.conv2d(x, y, padding=(3, 4), dilation=2)
+        z.sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        z = F.conv2d(x, y, padding='same', dilation=2)
+        z.sum().backward()
+        self.assertEqual(gx_expect, x.grad)
+        self.assertEqual(gy_expect, y.grad)
+        x.grad, y.grad = None, None
+
+        # Asymmetric padding
+        y = torch.rand(1, 1, 4, 4, device=device, dtype=dtype, requires_grad=True)
+        z = F.conv2d(x, y, padding=2)[..., 1:, 1:]
+        z.sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        z = F.conv2d(x, y, padding='same')
+        z.sum().backward()
+        self.assertEqual(gx_expect, x.grad)
+        self.assertEqual(gy_expect, y.grad)
+
+    @dtypes(torch.double, torch.cdouble)
+    def test_conv3d_same_padding_backward(self, device, dtype):
+        check_forward_ad = torch.device(device).type != 'xla'
+
+        # Test F.conv3d gradients work with padding='same'
+        x = torch.rand(1, 1, 1, 11, 12, dtype=dtype, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 1, 2, 5, dtype=dtype, device=device, requires_grad=True)
+
+        # Symmetric padding
+        z = F.conv3d(x, y, padding=(0, 1, 4), dilation=2)
+        z.sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        z = F.conv3d(x, y, padding='same', dilation=2)
+        z.sum().backward()
+        self.assertEqual(gx_expect, x.grad)
+        self.assertEqual(gy_expect, y.grad)
+        x.grad, y.grad = None, None
+
+        gradcheck(lambda x, y: F.conv3d(x, y, padding='same', dilation=2), (x, y),
+                  check_forward_ad=check_forward_ad, nondet_tol=1e-5)
+        if torch.device(device).type != 'cuda':
+            # https://github.com/pytorch/pytorch/issues/70702
+            gradgradcheck(lambda x, y: F.conv3d(x, y, padding='same', dilation=2), (x, y),
+                          check_fwd_over_rev=True)
+
+        # Asymmetric padding
+        y = torch.rand(1, 1, 1, 4, 4, dtype=dtype, device=device, requires_grad=True)
+        z = F.conv3d(x, y, padding=2)[..., 1:, 1:]
+        z.sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        z = F.conv3d(x, y, padding='same')
+        z.sum().backward()
+        self.assertEqual(gx_expect, x.grad)
+        self.assertEqual(gy_expect, y.grad)
+
+        gradcheck(lambda x, y: F.conv3d(x, y, padding='same'), (x, y),
+                  check_forward_ad=check_forward_ad, nondet_tol=1e-5)
+        if torch.device(device).type != 'cuda':
+            # https://github.com/pytorch/pytorch/issues/70702
+            gradgradcheck(lambda x, y: F.conv3d(x, y, padding='same'), (x, y),
+                          check_fwd_over_rev=True)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv1d_valid_padding_backward(self, device, dtype):
+        # Test F.conv1d gradients work with padding='valid'
+        x = torch.rand(1, 1, 10, dtype=dtype, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 4, dtype=dtype, device=device, requires_grad=True)
+        F.conv1d(x, y, padding=0).sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        F.conv1d(x, y, padding='valid').sum().backward()
+        gx_actual, gy_actual = x.grad, y.grad
+        self.assertEqual(gx_expect, gx_actual)
+        self.assertEqual(gy_expect, gy_actual)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
+    @dtypes(torch.float, torch.cfloat)
+    @parametrize_test("mode", ('valid', 'same'))
+    def test_conv1d_vs_scipy(self, device, dtype, mode):
+        t = make_tensor((1, 10), device=device, dtype=dtype)
+        feat_dim = t.shape[1]
+        weight_even = make_tensor((1, 1, 4), device=device, dtype=dtype)
+        weight_odd = make_tensor((1, 1, 5), device=device, dtype=dtype)
+
+        def _test(t, weight, mode):
+            # SciPy expects two 1-D inputs.
+            t_a = t.view(-1).cpu().numpy()
+            w_a = weight.view(-1).cpu().numpy()
+            expected = scipy.signal.convolve(t_a, w_a, mode=mode)
+
+            kwargs = {'padding': mode}
+            if mode == 'same':
+                # `same` padding in PyTorch conv1d is different
+                # from SciPy
+                p = weight.shape[2] // 2
+                t = torch.nn.functional.pad(t, (p, p))
+                # We have already taken care of padding
+                kwargs.pop("padding")
+
+            # second input is flipped in SciPy's convolve
+            weight_flipped = torch.flip(weight, (2,))
+            actual = torch.nn.functional.conv1d(t, weight_flipped, **kwargs).squeeze(0)
+            if mode == 'same':
+                actual = actual[:feat_dim]
+
+            self.assertEqual(actual, expected, atol=2e-5, rtol=2e-5)
+
+        # Global dtype for this test suite is torch.double
+        # This leads to change in type-promotion
+        # and conv1d outputs `complex128` for `complex64` input.
+        with set_default_dtype(torch.float):
+            _test(t, weight_even, mode)
+            _test(t, weight_odd, mode)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
+    @dtypes(torch.float, torch.cfloat)
+    @parametrize_test("mode", ('valid', 'same'))
+    def test_conv2d_vs_scipy(self, device, dtype, mode):
+        t = make_tensor((1, 5, 10), device=device, dtype=dtype)
+        weight_even = make_tensor((1, 1, 2, 4), device=device, dtype=dtype)
+        weight_odd = make_tensor((1, 1, 3, 5), device=device, dtype=dtype)
+
+        def _test(t, weight, mode):
+            # SciPy expects two 2-D inputs.
+            t_a = t.squeeze(0).cpu().numpy()
+            w_a = weight.squeeze(0).squeeze(0).cpu().numpy()
+            expected = scipy.signal.convolve2d(t_a, w_a, mode=mode)
+
+            kwargs = {'padding': mode}
+            if mode == 'same':
+                # `same` padding in PyTorch conv2d is different
+                # from SciPy
+                left_right_pad = weight.shape[3] // 2
+                top_bottom_pad = weight.shape[2] // 2
+                p = (left_right_pad, left_right_pad, top_bottom_pad, top_bottom_pad)
+                t = torch.nn.functional.pad(t, p)
+                # We have already taken care of padding
+                kwargs.pop("padding")
+
+            # second input is flipped in SciPy's convolve2d
+            weight_flipped = torch.flip(weight, (2, 3))
+            actual = torch.nn.functional.conv2d(t, weight_flipped, **kwargs).squeeze(0)
+            if mode == 'same':
+                actual = actual[:5, :10]
+
+            self.assertEqual(actual, expected, rtol=2e-5, atol=5e-6)
+
+        # Global dtype for this test suite is torch.double
+        # This leads to change in type-promotion
+        # and conv1d outputs `complex128` for `complex64` input.
+        with set_default_dtype(torch.float):
+            _test(t, weight_even, mode)
+            _test(t, weight_odd, mode)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
+    @dtypes(torch.float, torch.cfloat)
+    @parametrize_test("mode", ('valid', 'same'))
+    def test_conv3d_vs_scipy(self, device, dtype, mode):
+        t = make_tensor((1, 5, 5, 10), device=device, dtype=dtype)
+        weight_even = make_tensor((1, 1, 2, 2, 4), device=device, dtype=dtype)
+        weight_odd = make_tensor((1, 1, 2, 3, 5), device=device, dtype=dtype)
+
+        def _test(t, weight, mode):
+            # SciPy expects two 3-D inputs.
+            t_a = t.squeeze(0).cpu().numpy()
+            w_a = weight.squeeze(0).squeeze(0).cpu().numpy()
+            expected = scipy.signal.convolve(t_a, w_a, mode=mode)
+
+            kwargs = {'padding': mode}
+            if mode == 'same':
+                # `same` padding in PyTorch conv3d is different
+                # from SciPy
+                left_right_pad = weight.shape[4] // 2
+                top_bottom_pad = weight.shape[3] // 2
+                front_back_pad = weight.shape[2] // 2
+                p = (left_right_pad, left_right_pad, top_bottom_pad, top_bottom_pad,
+                     front_back_pad, front_back_pad)
+                t = torch.nn.functional.pad(t, p)
+                # We have already taken care of padding
+                kwargs.pop("padding")
+
+            # second input is flipped in SciPy's convolve
+            weight_flipped = torch.flip(weight, (2, 3, 4))
+            actual = torch.nn.functional.conv3d(t, weight_flipped, **kwargs).squeeze(0)
+            if mode == 'same':
+                actual = actual[:5, :5, :10]
+
+            if tf32_is_not_fp32() and (dtype == torch.float or dtype == torch.complex64):
+                self.assertEqual(actual, expected, atol=0.05, rtol=0.05)
+            else:
+                self.assertEqual(actual, expected, rtol=2e-5, atol=5e-6)
+
+        # Global dtype for this test suite is torch.double
+        # This leads to change in type-promotion
+        # and conv1d outputs `complex128` for `complex64` input.
+        with set_default_dtype(torch.float):
+            _test(t, weight_even, mode)
+            _test(t, weight_odd, mode)
+
+    @dtypes(torch.float, torch.complex64)
+    def test_conv2d_valid_padding_backward(self, device, dtype):
+        # Test F.conv2d gradients work with padding='valid'
+        x = torch.rand(1, 1, 1, 10, device=device, dtype=dtype, requires_grad=True)
+        y = torch.rand(1, 1, 1, 4, device=device, dtype=dtype, requires_grad=True)
+        F.conv2d(x, y, padding=0).sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        F.conv2d(x, y, padding='valid').sum().backward()
+        gx_actual, gy_actual = x.grad, y.grad
+        self.assertEqual(gx_expect, gx_actual)
+        self.assertEqual(gy_expect, gy_actual)
+
+    @dtypes(torch.double, torch.cdouble)
+    def test_conv3d_valid_padding_backward(self, device, dtype):
+        check_forward_ad = torch.device(device).type != 'xla'
+
+        # Test F.conv3d gradients work with padding='valid'
+        x = torch.rand(1, 1, 1, 1, 10, dtype=dtype, device=device, requires_grad=True)
+        y = torch.rand(1, 1, 1, 1, 4, dtype=dtype, device=device, requires_grad=True)
+        F.conv3d(x, y, padding=0).sum().backward()
+        gx_expect, gy_expect = x.grad, y.grad
+        x.grad, y.grad = None, None
+
+        F.conv3d(x, y, padding='valid').sum().backward()
+        gx_actual, gy_actual = x.grad, y.grad
+        self.assertEqual(gx_expect, gx_actual)
+        self.assertEqual(gy_expect, gy_actual)
+
+        gradcheck(lambda x, y: F.conv3d(x, y, padding='valid'), (x, y), check_forward_ad=check_forward_ad)
+        gradgradcheck(lambda x, y: F.conv3d(x, y, padding='valid'), (x, y), check_fwd_over_rev=check_forward_ad)
+
+    @parametrize_test("N", range(2, 4), name_fn=lambda N: 'ConvTranspose{}d'.format(N))
+    def test_conv_transpose_with_output_size_and_no_batch_dim(self, device, N):
+        # For inputs with no batch dim, verify output is the correct shape when output_size is set.
+        # See https://github.com/pytorch/pytorch/issues/75889
+        inp = torch.randn((1, 15, 13) if N == 2 else (1, 15, 13, 13), device=device)
+        output_size = (1, 240, 200) if N == 2 else (1, 240, 200, 200)
+        ConvTransposeNd = getattr(nn, 'ConvTranspose{}d'.format(N))
+        m = ConvTransposeNd(1, 1, kernel_size=16, stride=16, padding=7, bias=False, device=device)
+        output = m(inp, output_size=output_size)
+        self.assertEqual(output.shape, output_size)
+
+    @skipMeta
+    @parametrize_test("input_shape,transposed,dilated,groups,layout,backend_expected", [
+        # === slow ===
+        subtest(((2, 6, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Slow2d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow1d'),
+        subtest(((2, 6, 7), True, False, 3, torch.strided, torch._C._ConvBackend.SlowTranspose2d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow1d_transposed'),
+        subtest(((2, 6, 7), False, True, 3, torch.strided, torch._C._ConvBackend.SlowDilated2d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow1d_dilated'),
+        subtest(((2, 6, 7), True, True, 3, torch.strided, torch._C._ConvBackend.SlowTranspose2d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow1d_dilated_transposed'),
+        subtest(((2, 6, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Slow2d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow2d'),
+        subtest(((2, 6, 7, 8), True, False, 3, torch.strided, torch._C._ConvBackend.SlowTranspose2d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow2d_transposed'),
+        subtest(((2, 6, 7, 8), False, True, 3, torch.strided, torch._C._ConvBackend.SlowDilated2d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow2d_dilated'),
+        subtest(((2, 6, 7, 8), True, True, 3, torch.strided, torch._C._ConvBackend.SlowTranspose2d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow2d_dilated_transposed'),
+        subtest(((2, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Slow3d),
+                decorators=[onlyCPU, disableMkldnn], name='slow3d_cpu'),
+        # CUDA doesn't have a slow 3D implementation, so it goes to the dilated 3D implementation instead
+        subtest(((2, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.SlowDilated3d),
+                decorators=[onlyCUDA, disablecuDNN], name='slow3d_cuda'),
+        # FIXME: RuntimeError: CUDA out of memory.
+        # subtest(((2, 6, 7, 8, 9), True, False, 3, torch.strided, torch._C._ConvBackend.SlowTranspose3d),
+        #         decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow3d_transposed'),
+        subtest(((2, 6, 7, 8, 9), False, True, 3, torch.strided, torch._C._ConvBackend.SlowDilated3d),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow3d_dilated'),
+        # FIXME: RuntimeError: CUDA out of memory.
+        # subtest(((2, 6, 7, 8, 9), True, True, 3, torch.strided, torch._C._ConvBackend.SlowTranspose3d),
+        #         decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow3d_dilated_transposed'),
+        subtest(((0, 6, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch1d'),
+        subtest(((2, 0, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_channel1d'),
+        subtest(((0, 0, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch_channel1d'),
+        subtest(((0, 6, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch2d'),
+        subtest(((2, 0, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_channel2d'),
+        subtest(((0, 0, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch_channel2d'),
+        subtest(((0, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch3d'),
+        subtest(((2, 0, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_channel3d'),
+        subtest(((0, 0, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
+                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch_channel3d'),
+        # === cuda ===
+        # Note that disablecuDNN disables miopen as well.
+        subtest(((2, 6, 7), False, False, 6, torch.strided, torch._C._ConvBackend.CudaDepthwise2d),
+                decorators=[onlyCUDA, disablecuDNN], name='cuda_depthwise1d'),
+        subtest(((2, 6, 7, 8), False, False, 6, torch.strided, torch._C._ConvBackend.CudaDepthwise2d),
+                decorators=[onlyCUDA, disablecuDNN], name='cuda_depthwise2d'),
+        subtest(((2, 6, 7, 8, 9), False, False, 6, torch.strided, torch._C._ConvBackend.CudaDepthwise3d),
+                decorators=[onlyCUDA, disablecuDNN], name='cuda_depthwise3d'),
+        # === cudnn ===
+        subtest(((2, 6, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Cudnn),
+                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn1d'),
+        subtest(((2, 6, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Cudnn),
+                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn2d'),
+        subtest(((2, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Cudnn),
+                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn3d'),
+        subtest(((2, 6, 7), True, False, 3, torch.strided, torch._C._ConvBackend.CudnnTranspose),
+                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn1d_transposed'),
+        subtest(((2, 6, 7, 8), True, False, 3, torch.strided, torch._C._ConvBackend.CudnnTranspose),
+                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn2d_transposed'),
+        # FIXME: RuntimeError: CUDA out of memory.
+        # subtest(((2, 6, 7, 8, 9), True, False, 3, torch.strided, torch._C._ConvBackend.CudnnTranspose),
+        #         decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn3d_transposed'),
+        # === miopen ===
+        subtest(((2, 6, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Miopen),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen1d'),
+        subtest(((2, 6, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Miopen),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen2d'),
+        subtest(((2, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Miopen),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen3d'),
+        subtest(((2, 6, 7), True, False, 3, torch.strided, torch._C._ConvBackend.MiopenTranspose),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen1d_transposed'),
+        subtest(((2, 6, 7, 8), True, False, 3, torch.strided, torch._C._ConvBackend.MiopenTranspose),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen2d_transposed'),
+        subtest(((2, 6, 7, 8, 9), True, False, 3, torch.strided, torch._C._ConvBackend.MiopenTranspose),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen3d_transposed'),
+        subtest(((2, 6, 7), False, False, 6, torch.strided, torch._C._ConvBackend.MiopenDepthwise),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen_depthwise1d'),
+        subtest(((2, 6, 7, 8), False, False, 6, torch.strided, torch._C._ConvBackend.MiopenDepthwise),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen_depthwise2d'),
+        subtest(((2, 6, 7, 8, 9), False, False, 6, torch.strided, torch._C._ConvBackend.MiopenDepthwise),
+                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen_depthwise3d'),
+        # === mkldnn ===
+        subtest(((2, 6, 7), False, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn1d'),
+        subtest(((2, 6, 7, 8), False, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn2d'),
+        subtest(((2, 6, 7, 8, 9), False, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn3d'),
+        # Transposed convolution is broken for mkldnn. See https://github.com/pytorch/pytorch/issues/68775.
+        subtest(((2, 6, 7), True, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn, unittest.expectedFailure], name='mkldnn1d_transposed'),
+        subtest(((2, 6, 7, 8), True, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn, unittest.expectedFailure], name='mkldnn2d_transposed'),
+        subtest(((2, 6, 7, 8, 9), True, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn, unittest.expectedFailure], name='mkldnn3d_transposed'),
+        subtest(((2, 6, 7), False, True, 3, torch.strided, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn1d_cpu_input'),
+        subtest(((2, 6, 7, 8), False, True, 3, torch.strided, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn2d_cpu_input'),
+        subtest(((2, 6, 7, 8, 9), False, True, 3, torch.strided, torch._C._ConvBackend.Mkldnn),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn3d_cpu_input'),
+        subtest(((0, 6, 7), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch1d'),
+        subtest(((2, 0, 7), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_channel1d'),
+        subtest(((0, 0, 7), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch_channel1d'),
+        subtest(((0, 6, 7, 8), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch2d'),
+        subtest(((2, 0, 7, 8), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_channel2d'),
+        subtest(((0, 0, 7, 8), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch_channel2d'),
+        subtest(((0, 6, 7, 8, 9), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch3d'),
+        subtest(((2, 0, 7, 8, 9), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_channel3d'),
+        subtest(((0, 0, 7, 8, 9), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
+                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch_channel3d'),
+        # Note: Tests for mobile backends are not currently supported. This comprises
+        # NnpackSpatial, Winograd3x3Depthwise, and Xnnpack2d backends. Testing these
+        # requires the ability to gate tests by whether PyTorch is built with USE_MOBILE=1.
+    ])
+    # Test with both bias and no bias.
+    @parametrize_test("has_bias", [False, True])
+    # Test with both stride=1 and stride>1 cases.
+    @parametrize_test("strided", [False, True])
+    # Test with both contiguous and non-contiguous inputs.
+    @parametrize_test("contiguous", [False, True])
+    def test_conv_backend(
+            self, device, input_shape, has_bias, strided, contiguous, transposed, dilated, groups,
+            layout, backend_expected):
+        # Build up inputs.
+        dtype = torch.float32
+        C_in, C_out, dim, kernel_size = input_shape[1], 12, len(input_shape) - 2, 3
+        x = torch.randn(*input_shape, device=device, dtype=dtype, requires_grad=True)
+        weight = torch.randn(C_in if transposed else C_out,
+                             C_out // groups if transposed else C_in // groups,
+                             *[kernel_size for _ in range(dim)],
+                             device=device, dtype=dtype, requires_grad=True)
+        bias = torch.randn(C_out, device=device, dtype=dtype, requires_grad=True) if has_bias else None
+
+        def _make_noncontiguous(inp):
+            if inp is None:
+                return None
+            old_requires_grad = inp.requires_grad
+            inp = torch.repeat_interleave(inp, 2, dim=-1)
+            inp = inp[..., ::2].detach().requires_grad_(old_requires_grad)
+            return inp
+
+        if not contiguous:
+            x = _make_noncontiguous(x)
+            weight = _make_noncontiguous(weight)
+            bias = _make_noncontiguous(bias)
+
+        if layout is torch._mkldnn:
+            x = x.to_mkldnn()
+            # Note that weight and bias are not supported as mkldnn tensors during training.
+
+        stride = (2,) * dim if strided else (1,) * dim
+        padding = (0,) * dim
+        dilation = (2,) * dim if dilated else (1,) * dim
+        output_padding = (0,) * dim
+        inputs = [x, weight, bias, stride, padding, dilation, transposed, output_padding, groups]
+
+        # Ensure correct backend is selected.
+        backend_actual = torch._C._select_conv_backend(*inputs)
+        self.assertEqual(backend_actual, backend_expected)
+
+        # Ensure backward call succeeds.
+        convolution = torch.ops.aten.convolution
+        output = convolution(*inputs)
+        grad_output = torch.randn(output.shape, device=device, dtype=dtype)
+        if not contiguous:
+            grad_output = _make_noncontiguous(grad_output)
+        if layout is torch._mkldnn:
+            grad_output = grad_output.to_mkldnn()
+        output.backward(grad_output)
+
+        # mkldnn doesn't support gradcheck :(
+        if layout is torch._mkldnn:
+            return
+
+        if backend_actual != torch._C._ConvBackend.Empty:  # FIXME: forward AD fails
+            # Forward AD and forward-over-reverse AD smoke test in float32
+            # TODO: remove this if we introduce per-op gradient tests for float32
+            with fwAD.dual_level():
+                dual_inputs = [(fwAD.make_dual(i, torch.rand_like(i)) if isinstance(i, torch.Tensor) else i) for i in inputs]
+                # Forward AD
+                output = convolution(*dual_inputs)
+                # Forward over reverse AD
+                grad_output_d = fwAD.make_dual(torch.rand_like(output), torch.rand_like(output))
+                if has_bias:
+                    torch.autograd.grad(output, [x, weight, bias], grad_output_d)
+                else:
+                    torch.autograd.grad(output, [x, weight], grad_output_d)
+
+        # Convert to float64 for gradcheck.
+        x = x.to(torch.float64).detach().requires_grad_(True)
+        weight = weight.to(torch.float64).detach().requires_grad_(True)
+        if bias is not None:
+            bias = bias.to(torch.float64).detach().requires_grad_(True)
+        inputs = [x, weight, bias, stride, padding, dilation, transposed, output_padding, groups]
+
+        # Set some backend-specific validation settings.
+        gradcheck_nondet_tol = 0.0
+        if torch.backends.cudnn.is_available():
+            # cuDNN introduces non-determinism
+            gradcheck_nondet_tol = GRADCHECK_NONDET_TOL
+
+        self.assertTrue(gradcheck(convolution, inputs, nondet_tol=gradcheck_nondet_tol))
+
+        # double backward doesn't support bias gradients
+        if bias is not None:
+            bias.requires_grad_(False)
+        self.assertTrue(gradgradcheck(convolution, inputs, nondet_tol=gradcheck_nondet_tol))
+
+
+    @onlyCPU
+    def test_conv_contiguous_for_oneDNN(self):
+        # See https://github.com/pytorch/pytorch/issues/80837.
+        for dtype in [torch.float, torch.bfloat16]:
+            conv = nn.Conv2d(
+                1,
+                128,
+                kernel_size=(5, 2),
+                stride=(2, 1),
+                padding=(0, 1),
+                dilation=(1, 1),
+                groups=1,
+                bias=True,
+                padding_mode='zeros').to(dtype=dtype)
+
+            x = torch.rand([1, 2, 321, 201, 1]).to(dtype=dtype)
+            x = torch.transpose(x, 1, 4)
+            x2 = x[..., 0]
+            inputs = [x2, conv.weight, conv.bias, (2, 1), (0, 1), (1, 1), False, (0, 1), 1]
+            if torch.backends.mkldnn.is_available():
+                y = conv(x2)
+                # Disable MKLDNN explicitly
+                with torch.backends.mkldnn.flags(enabled=False):
+                    y_ = conv(x2)
+                    self.assertEqual(y, y_)
+
+    @onlyCPU
+    def test_conv_ic1_channels_last_for_oneDNN(self):
+        # See https://github.com/pytorch/pytorch/issues/82060, N > 1 will call in OneDNN path.
+        for dtype in [torch.float, torch.bfloat16]:
+            conv = torch.nn.Conv2d(1, 64, kernel_size=(3, 3), padding=(1, 1), bias=False)
+            conv = conv.to(memory_format=torch.channels_last).to(dtype=dtype)
+            x = torch.rand(2, 1, 100, 100).to(dtype=dtype)
+            if torch.backends.mkldnn.is_available():
+                y = conv(x)
+                # Disable MKLDNN explicitly
+                with torch.backends.mkldnn.flags(enabled=False):
+                    y_ = conv(x)
+                    self.assertEqual(y, y_)
+
+    @dtypes(torch.float, torch.cfloat)
+    def test_conv_empty_channel(self, device, dtype):
+        in_channels = 0
+        mod = torch.nn.Conv1d(in_channels, 8, 2, stride=2, dtype=dtype).to(device)
+        inp = torch.randn(2, 0, 15, device=device, dtype=dtype)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
+            inp = torch.randn(2, 1, 0, device=device, dtype=dtype)
+            mod(inp)
+
+        mod = torch.nn.Conv2d(in_channels, 33, 3, stride=2, dtype=dtype).to(device)
+        inp = torch.randn(2, 0, 50, 100, device=device, dtype=dtype)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
+            inp = torch.randn(2, 1, 40, 0, device=device, dtype=dtype)
+            mod(inp)
+
+        mod = torch.nn.Conv3d(in_channels, 33, 3, stride=2, dtype=dtype).to(device)
+        inp = torch.randn(2, 0, 50, 20, 40, device=device, dtype=dtype)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
+            inp = torch.randn(2, 1, 50, 0, 40, device=device, dtype=dtype)
+            mod(inp)
+
+    def test_group_conv_empty(self, device):
+        mod = torch.nn.Conv2d(4, 4, stride=2, kernel_size=3, padding=1, groups=4).to(device)
+        inp = torch.randn(0, 4, 4, 4, device=device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+        if self.device_type == 'cuda' and self.has_cudnn():
+            with torch.backends.cudnn.flags(enabled=False):
+                _test_module_empty_input(self, mod, inp, check_size=False)
+
+    def test_group_convTranspose_empty(self, device):
+        mod = torch.nn.ConvTranspose2d(4, 4, stride=2, kernel_size=3, padding=1, groups=4).to(device)
+        inp = torch.randn(0, 4, 4, 4, device=device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+        if self.device_type == 'cuda' and self.has_cudnn():
+            with torch.backends.cudnn.flags(enabled=False):
+                _test_module_empty_input(self, mod, inp, check_size=False)
+
+    def test_convTranspose_empty(self, device):
+        mod = torch.nn.ConvTranspose2d(4, 4, stride=2, kernel_size=3, padding=1).to(device)
+        inp = torch.randn(0, 4, 4, 4, device=device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+        if self.device_type == 'cuda' and self.has_cudnn():
+            with torch.backends.cudnn.flags(enabled=False):
+                _test_module_empty_input(self, mod, inp, check_size=False)
+
+    @onlyCUDA
+    @largeTensorTest('12GB')
+    def test_conv_large_nosplit(self, device):
+        # Here we just test the convolution correctly route to the fallback implementation
+        # that is, it does not crash. The correctness of fallback implementation should be
+        # covered in other tests
+        dtype = torch.half if self.device_type == 'cuda' else torch.float
+        conv1 = nn.Conv2d(2, 2, 8, 8).to(device).to(dtype)
+        input_large = torch.randn(1, 2, 1024, 1024 * 1024, dtype=dtype, device=device)
+        conv1(input_large)
+        conv2 = torch.nn.Conv2d(1, 1024, 1, 1).to(device).to(dtype)
+        input_large = torch.randn(1, 1, 2048, 1024 , dtype=dtype, device=device)
+        conv2(input_large)
+
+    def test_conv_noncontig_weights(self, device):
+        for dim in (1, 2, 3):
+            for grouped in (False, True):
+                nc = 3
+                groups = 3 if grouped else 1
+                w = torch.randn([3] * dim, device=device)
+                w = w.expand([nc, int(nc / groups)] + list(w.shape))
+                w = w.detach().requires_grad_()
+                x = torch.randn([1, nc] + ([5] * dim), device=device, requires_grad=True)
+                y = getattr(F, 'conv{}d'.format(dim))(x, w, groups=groups)
+                y.sum().backward()
+                y = getattr(F, 'conv_transpose{}d'.format(dim))(x, w, groups=groups)
+                y.sum().backward()
+
+    def test_conv_noncontig_weights_and_bias(self, device):
+        # need floats to exercise https://github.com/pytorch/pytorch/issues/16018
+        for bias in [True, False]:
+            conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,
+                              bias=bias).to(device, torch.float)
+
+            input_nc = torch.randn((1, 3, 224, 224, 2), device=device, dtype=torch.float)[:, :, :, :, 1]
+            input_c = input_nc.contiguous()
+
+            weight_nc = torch.randn((64, 3, 7, 7, 2), device=device, dtype=torch.float)[:, :, :, :, 1]
+            conv1.weight = nn.Parameter(weight_nc)
+            weight_c = conv1.weight.contiguous()
+
+            if bias:
+                bias_nc = torch.randn((64, 2), device=device, dtype=torch.float)[:, 1]
+                conv1.bias = nn.Parameter(bias_nc)
+                bias_c = conv1.bias.contiguous()
+
+            out1 = conv1(input_nc)
+            conv1.weight = nn.Parameter(weight_c)
+            if bias:
+                conv1.bias = nn.Parameter(bias_c)
+            out2 = conv1(input_c)
+            self.assertEqual(out1, out2)
+
+    @onlyCUDA
+    @largeTensorTest('12GB')
+    def test_conv_transposed_large(self, device):
+        dtype = torch.half if self.device_type == 'cuda' else torch.float
+        conv = nn.ConvTranspose2d(1, 1, 1, 1, bias=False).to(device).to(dtype)
+        input_large = torch.randn(4096, 1, 512, 1024, dtype=dtype, device=device)
+        # forward
+        ret = conv(input_large)
+        maxdiff0 = (ret.narrow(0, 0, 1024) - conv(input_large.narrow(0, 0, 1024))).abs_().max().item()
+        maxdiff1 = (ret.narrow(0, 1024, 1024) - conv(input_large.narrow(0, 1024, 1024))).abs_().max().item()
+        maxdiff2 = (ret.narrow(0, 2048, 1024) - conv(input_large.narrow(0, 2048, 1024))).abs_().max().item()
+        maxdiff3 = (ret.narrow(0, 3072, 1024) - conv(input_large.narrow(0, 3072, 1024))).abs_().max().item()
+        if self.device_type == 'cuda':
+            # cuDNN may use algorithms such as FFT that don't guarantee a diff of 0
+            self.assertEqual(maxdiff0, 0, atol=2e-3, rtol=1e-5)
+            self.assertEqual(maxdiff1, 0, atol=2e-3, rtol=1e-5)
+            self.assertEqual(maxdiff2, 0, atol=2e-3, rtol=1e-5)
+            self.assertEqual(maxdiff3, 0, atol=2e-3, rtol=1e-5)
+        else:
+            self.assertEqual(maxdiff0, 0)
+            self.assertEqual(maxdiff1, 0)
+            self.assertEqual(maxdiff2, 0)
+            self.assertEqual(maxdiff3, 0)
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @largeTensorTest('12GB')
+    def test_conv_large(self, device):
+        dtype = torch.half if self.device_type == 'cuda' else torch.float
+        conv = nn.Conv2d(2, 2, 8, 8, bias=False).to(device).to(dtype)
+        input_large = torch.randn(4097, 2, 512, 512, dtype=dtype, device=device)
+        # forward
+        ret = conv(input_large)
+        self.assertEqual(ret[:2048], conv(input_large[:2048]))
+        self.assertEqual(ret[2048:4096], conv(input_large[2048:4096]))
+        self.assertEqual(ret[4096:], conv(input_large[4096:]))
+
+        # backward
+        conv.zero_grad()
+        # When computing the backward, we are using the `max(dim=1)`` to create
+        # some sparsity. Without this sparsity, the rounding error would be
+        # too large (as large as 1e-5) to satisfy the creterion (1e-6) of `assertEqual`
+        ret.view(4097, -1).max(dim=1).values.sum().backward()
+        del ret
+        grad1 = conv.weight.grad.detach().clone()
+        conv.zero_grad()
+        conv(input_large[:2048]).view(2048, -1).max(dim=1).values.sum().backward()
+        conv(input_large[2048:4096]).view(2048, -1).max(dim=1).values.sum().backward()
+        conv(input_large[4096:]).view(1, -1).max(dim=1).values.sum().backward()
+        grad2 = conv.weight.grad.detach().clone()
+        # gradients are at the order of hundreds, we need to scale it to
+        # the order of one so that we can compare
+        scale = 1 / grad2.abs().mean()
+        grad1 = grad1 * scale
+        grad2 = grad2 * scale
+        self.assertEqual(grad1, grad2, atol=5e-2, rtol=5e-3)
+
+    @onlyCUDA
+    @skipCUDAIfNoCudnn
+    def test_contig_wrong_stride_cudnn(self, device):
+        # x has to have batch_size 1 to test contiguous checks
+        x = torch.randn(1, 16, 5, 5, device=device)
+        stride = list(x.stride())
+        stride[0] = 20
+        # change the stride in dimension 0. the tensor is still contiguous because size[0] is 1
+        x.set_(x.storage(), 0, x.size(), stride)
+        self.assertTrue(x.is_contiguous())
+        F.conv_transpose2d(x, torch.randn(16, 1, 1, 1, device=device))
+        F.conv2d(x, torch.randn(1, 16, 1, 1, device=device))
+
+    @onlyCUDA
+    def test_Conv2d_size_1_kernel(self, device):
+        x_cpu = torch.randn(2, 3, 5, 5)
+        conv_cpu = torch.nn.Conv2d(3, 3, kernel_size=1)
+        y_cpu = conv_cpu(x_cpu)
+        y = torch.rand_like(y_cpu)
+        y_cpu.backward(y)
+
+        with cudnn.flags(enabled=False):
+            conv_cuda = torch.nn.Conv2d(3, 3, kernel_size=1).to(device)
+            conv_cuda.bias.data.copy_(conv_cpu.bias.data)
+            conv_cuda.weight.data.copy_(conv_cpu.weight.data)
+            y_cuda = conv_cuda(x_cpu.to(device))
+            y_cuda.backward(y.to(device))
+
+        self.assertEqual(y_cpu, y_cuda, atol=1e-5, rtol=0, exact_device=False)
+        self.assertEqual(conv_cpu.bias.grad.data, conv_cuda.bias.grad.data, atol=1e-5, rtol=0, exact_device=False)
+        self.assertEqual(conv_cpu.weight.grad.data, conv_cuda.weight.grad.data, atol=1e-5, rtol=0, exact_device=False)
+
+    @onlyCUDA
+    def test_ConvTranspose2d_size_1_kernel(self, device):
+        x_cpu = torch.randn(2, 3, 5, 5)
+        conv_cpu = torch.nn.ConvTranspose2d(3, 3, kernel_size=1)
+        y_cpu = conv_cpu(x_cpu)
+        y = torch.rand_like(y_cpu)
+        y_cpu.backward(y)
+
+        with cudnn.flags(enabled=False):
+            conv_cuda = torch.nn.ConvTranspose2d(3, 3, kernel_size=1).to(device)
+            conv_cuda.bias.data.copy_(conv_cpu.bias.data)
+            conv_cuda.weight.data.copy_(conv_cpu.weight.data)
+            y_cuda = conv_cuda(x_cpu.to(device))
+            y_cuda.backward(y.to(device))
+
+        self.assertEqual(y_cpu, y_cuda, atol=1e-5, rtol=0, exact_device=False)
+        self.assertEqual(conv_cpu.bias.grad.data, conv_cuda.bias.grad.data, atol=1e-5, rtol=0, exact_device=False)
+        self.assertEqual(conv_cpu.weight.grad.data, conv_cuda.weight.grad.data, atol=1e-5, rtol=0, exact_device=False)
+
+    @onlyCUDA
+    def test_ConvTranspose3d_size_1_kernel(self, device):
+        with set_default_dtype(torch.double):
+            x_cpu = torch.randn(2, 3, 3, 5, 5)
+            conv_cpu = torch.nn.ConvTranspose3d(3, 3, kernel_size=1)
+            y_cpu = conv_cpu(x_cpu)
+            y = torch.rand_like(y_cpu)
+            y_cpu.backward(y)
+
+            with cudnn.flags(enabled=False):
+                conv_cuda = torch.nn.ConvTranspose3d(3, 3, kernel_size=1).to(device)
+                conv_cuda.bias.data.copy_(conv_cpu.bias.data)
+                conv_cuda.weight.data.copy_(conv_cpu.weight.data)
+                y_cuda = conv_cuda(x_cpu.to(device))
+                y_cuda.backward(y.to(device))
+
+            self.assertEqual(y_cpu, y_cuda, atol=1e-5, rtol=0, exact_device=False)
+            self.assertEqual(conv_cpu.bias.grad.data, conv_cuda.bias.grad.data, atol=1e-5, rtol=0, exact_device=False)
+            self.assertEqual(conv_cpu.weight.grad.data, conv_cuda.weight.grad.data, atol=1e-5, rtol=0, exact_device=False)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
+    @dtypes(torch.float)
+    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
+    def test_Conv2d_naive_groups(self, device, dtype):
+        # Check that grouped convolutions matches two half convolutions
+        m = nn.Conv2d(4, 4, kernel_size=3, groups=2).to(device, dtype)
+        i = torch.randn(2, 4, 6, 6, device=device, dtype=dtype, requires_grad=True)
+        output = m(i)
+        grad_output = torch.randn(2, 4, 4, 4, device=device, dtype=dtype)
+        output.backward(grad_output)
+
+        m1 = nn.Conv2d(2, 2, kernel_size=3).to(device, dtype)
+        m1.weight.data.copy_(m.weight.data[:2])
+        m1.bias.data.copy_(m.bias.data[:2])
+        i1 = i.data[:, :2].contiguous().requires_grad_(True)
+        output1 = m1(i1)
+        output1.backward(grad_output[:, :2].contiguous())
+
+        m2 = nn.Conv2d(2, 2, kernel_size=3).to(device, dtype)
+        m2.weight.data.copy_(m.weight.data[2:])
+        m2.bias.data.copy_(m.bias.data[2:])
+        i2 = i.data[:, 2:].contiguous().requires_grad_(True)
+        output2 = m2(i2)
+        output2.backward(grad_output[:, 2:].contiguous())
+
+        self.assertEqual(output, torch.cat([output1, output2], 1))
+        self.assertEqual(i.grad.data,
+                         torch.cat([i1.grad.data, i2.grad.data], 1),
+                         atol=dtype2prec_DONTUSE[dtype], rtol=0)
+        self.assertEqual(m.bias.grad.data,
+                         torch.cat([m1.bias.grad.data, m2.bias.grad.data], 0),
+                         atol=dtype2prec_DONTUSE[dtype], rtol=0)
+        self.assertEqual(m.weight.grad.data,
+                         torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
+                         atol=dtype2prec_DONTUSE[dtype], rtol=0)
+
+    @dtypes(torch.double, torch.cdouble)
+    def test_Conv2d_backward_depthwise(self, device, dtype):
+        x = torch.randn(2, 2, 4, 20, device=device, dtype=dtype, requires_grad=True)
+        weight = torch.randn(2, 1, 3, 5, device=device, dtype=dtype, requires_grad=True)
+
+        def conv2d_depthwise(x, weight):
+            return torch.nn.functional.conv2d(
+                x, weight, bias=None, stride=(1, 10), groups=2)
+
+        for cudnn_enabled in [False, True]:
+            with torch.backends.cudnn.flags(enabled=cudnn_enabled):
+                torch.autograd.gradcheck(conv2d_depthwise, (x, weight))
+
+    @onlyCPU
+    @dtypes(torch.float, torch.double)
+    def test_conv_thnn_nhwc(self, device, dtype):
+        def helper(n, c, h, w, out_channels, kernel_size, dilation, groups, input_format, weight_format):
+            input = torch.randint(-3, 3, (n, c, h, w), dtype=dtype, device=device)\
+                .to(memory_format=input_format)
+            input.requires_grad_()
+            conv = nn.Conv2d(c, out_channels, kernel_size, dilation=dilation, groups=groups)\
+                .to(device='cpu', dtype=dtype, memory_format=weight_format)
+            for p in conv.parameters():
+                p.data = torch.randint_like(p, -3, 3)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_()
+            ref_conv = nn.Conv2d(c, out_channels, kernel_size, dilation=dilation, groups=groups)
+            # load_state_dict will restore the stride & memory_layout on ref_conv.weight.
+            ref_conv.load_state_dict(conv.state_dict())
+            ref_conv = ref_conv.to(device='cpu', dtype=dtype, memory_format=torch.contiguous_format)
+
+            out = conv(input)
+            ref_out = ref_conv(ref_input)
+
+            grad = torch.randint_like(out, -3, 3)
+            ref_grad = grad.detach().clone().contiguous()
+
+            out.backward(grad)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertEqual(out, ref_out, exact_dtype=False)
+            self.assertEqual(conv.weight.grad, ref_conv.weight.grad, exact_dtype=False)
+            self.assertEqual(conv.bias.grad, ref_conv.bias.grad, exact_dtype=False)
+            self.assertEqual(input.grad, ref_input.grad, exact_dtype=False)
+
+        with torch.backends.mkldnn.flags(enabled=False):
+            formats = [[torch.channels_last, torch.channels_last],
+                       [torch.channels_last, torch.contiguous_format],
+                       [torch.contiguous_format, torch.channels_last]]
+            for input_format, weight_format in formats:
+                # non-dilated conv: thnn_conv2d normal path (with im2col)
+                helper(2, 8, 4, 4, out_channels=4, kernel_size=3, dilation=1, groups=1,
+                       input_format=input_format, weight_format=weight_format)
+                helper(2, 8, 4, 4, out_channels=8, kernel_size=3, dilation=1, groups=8,
+                       input_format=input_format, weight_format=weight_format)
+                # test when input chanels is 1 and not converted to channels last
+                helper(2, 1, 10, 10, out_channels=8, kernel_size=3, dilation=1, groups=1,
+                       input_format=torch.contiguous_format, weight_format=torch.channels_last)
+                # non-dilated conv: thnn_conv2d fast path (skip im2col)
+                helper(1, 16, 56, 56, out_channels=16, kernel_size=1, dilation=1, groups=1,
+                       input_format=input_format, weight_format=weight_format)
+                # ic == oc == 1 here, so need to stick input to CL to activate channels last
+                helper(1, 16, 56, 56, out_channels=16, kernel_size=1, dilation=1, groups=16,
+                       input_format=torch.channels_last, weight_format=weight_format)
+                # dilated conv: slow_conv_dilated2d
+                helper(2, 8, 11, 13, out_channels=16, kernel_size=3, dilation=2, groups=1,
+                       input_format=input_format, weight_format=weight_format)
+                helper(2, 16, 11, 13, out_channels=32, kernel_size=3, dilation=2, groups=16,
+                       input_format=input_format, weight_format=weight_format)
+
+    @onlyCUDA
+    @skipCUDAIfRocmVersionLessThan((4, 3))
+    @skipCUDAIfNotMiopenSuggestNHWC
+    @skipCUDAIfCudnnVersionLessThan(7603)
+    @dtypes(torch.half, torch.float, torch.cfloat)
+    def test_conv_cudnn_nhwc(self, device, dtype):
+        def helper(n, c, h, w, out_channels, kernel_size, groups):
+            input = torch.randint(-3, 3, (n, c, h, w), dtype=dtype, device=device)\
+                .to(memory_format=torch.channels_last)
+            input.requires_grad_()
+            conv = nn.Conv2d(c, out_channels, kernel_size, groups=groups)\
+                .to(device='cuda', dtype=dtype, memory_format=torch.channels_last)
+            for p in conv.parameters():
+                p.data = torch.randint_like(p, -3, 3)
+
+            # use FP64 channels-first conv as reference
+            ref_input = input.detach().clone().contiguous().double().requires_grad_()
+            ref_conv = nn.Conv2d(c, out_channels, kernel_size, groups=groups)
+            # load_state_dict will restore the stride & memory_layout on ref_conv.weight.
+            ref_conv.load_state_dict(conv.state_dict())
+            ref_conv = ref_conv.to(device='cuda', dtype=torch.double, memory_format=torch.contiguous_format)
+
+            out = conv(input)
+            ref_out = ref_conv(ref_input)
+
+            grad = torch.randint_like(out, -3, 3)
+            ref_grad = grad.detach().clone().double().contiguous()
+
+            out.backward(grad)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(input.grad.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(conv.weight.grad.is_contiguous(memory_format=torch.channels_last))
+
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertTrue(ref_input.grad.is_contiguous())
+            self.assertTrue(ref_conv.weight.grad.is_contiguous())
+
+            self.assertEqual(out, ref_out, exact_dtype=False)
+            self.assertEqual(conv.weight.grad, ref_conv.weight.grad, exact_dtype=False)
+            self.assertEqual(conv.bias.grad, ref_conv.bias.grad, exact_dtype=False)
+            self.assertEqual(input.grad, ref_input.grad, exact_dtype=False)
+
+        helper(2, 8, 4, 4, out_channels=4, kernel_size=3, groups=1)
+        helper(2, 8, 4, 4, out_channels=8, kernel_size=3, groups=8)
+        helper(1, 16, 56, 56, out_channels=16, kernel_size=3, groups=1)
+        helper(1, 16, 56, 56, out_channels=16, kernel_size=3, groups=16)
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @skipCUDAIfCudnnVersionLessThan(8005)
+    @dtypes(torch.half, torch.float)
+    def test_conv_cudnn_ndhwc(self, device, dtype):
+        def helper(n, c, d, h, w, out_channels, kernel_size, groups):
+            input = torch.randint(-2, 2, (n, c, d, h, w), dtype=dtype, device=device)\
+                .to(memory_format=torch.channels_last_3d)
+            input.requires_grad_()
+            conv = nn.Conv3d(c, out_channels, kernel_size, groups=groups)\
+                .to(device='cuda', dtype=dtype, memory_format=torch.channels_last_3d)
+            for p in conv.parameters():
+                p.data = torch.randint_like(p, -2, 2)
+
+            # use FP64 channels-first conv as reference
+            ref_input = input.detach().clone().contiguous().double().requires_grad_()
+            ref_conv = nn.Conv3d(c, out_channels, kernel_size, groups=groups)
+            # load_state_dict will restore the stride & memory_layout on ref_conv.weight.
+            ref_conv.load_state_dict(conv.state_dict())
+            ref_conv = ref_conv.to(device='cuda', dtype=torch.double, memory_format=torch.contiguous_format)
+
+            out = conv(input)
+            ref_out = ref_conv(ref_input)
+
+            grad = torch.randint_like(out, -2, 2)
+            ref_grad = grad.detach().clone().double().contiguous()
+
+            out.backward(grad)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last_3d))
+            self.assertTrue(input.grad.is_contiguous(memory_format=torch.channels_last_3d))
+            self.assertTrue(conv.weight.grad.is_contiguous(memory_format=torch.channels_last_3d))
+
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertTrue(ref_input.grad.is_contiguous())
+            self.assertTrue(ref_conv.weight.grad.is_contiguous())
+
+            self.assertEqual(out, ref_out, exact_dtype=False)
+            self.assertEqual(conv.weight.grad, ref_conv.weight.grad, exact_dtype=False)
+            self.assertEqual(conv.bias.grad, ref_conv.bias.grad, exact_dtype=False)
+            self.assertEqual(input.grad, ref_input.grad, exact_dtype=False)
+
+        helper(2, 8, 4, 4, 4, out_channels=4, kernel_size=3, groups=1)
+        helper(2, 8, 4, 4, 4, out_channels=8, kernel_size=3, groups=8)
+        helper(1, 16, 18, 18, 18, out_channels=16, kernel_size=3, groups=1)
+        helper(1, 16, 18, 18, 18, out_channels=16, kernel_size=3, groups=16)
+
+    def _run_conv(self, layer, device, inp, grad, ref_conv, ref_input, ref_out,
+                  input_format, weight_format, grad_format, output_format):
+        conv = layer(inp.size(1), grad.size(1),
+                     ref_conv.weight.size(2)).float().to(device)
+        # load_state_dict will restore the stride & memory_layout on ref_conv.weight.
+        conv.load_state_dict(ref_conv.state_dict())
+        weight_data = conv.weight.detach().clone().contiguous(memory_format=weight_format)
+        conv.weight.data = weight_data.resize_(weight_data.size(), memory_format=weight_format)
+        input = inp.clone().contiguous(memory_format=input_format)
+        input.resize_(input.size(), memory_format=input_format)
+        input = input.requires_grad_()
+        grad = grad.contiguous(memory_format=grad_format)
+        grad.resize_(grad.size(), memory_format=grad_format)
+        out = conv(input)
+        out.backward(grad)
+        self.assertTrue(out.is_contiguous(memory_format=output_format))
+        self.assertEqual(out, ref_out)
+        self.assertEqual(conv.weight.grad, ref_conv.weight.grad)
+        self.assertEqual(conv.bias.grad, ref_conv.bias.grad)
+        self.assertEqual(input.grad, ref_input.grad)
+
+    def _test_conv_cudnn_nhwc_nchw(self, layer, n, c, h, w, k, filter_size, device):
+        data = torch.randint(1, 10, (n, c, h, w), dtype=torch.float32, device=device)
+        ref_input = data.clone().contiguous().requires_grad_(True)
+        ref_conv = layer(c, k, filter_size).float().to(device)
+        ref_out = ref_conv(ref_input)
+        grad = torch.randint(1, 10, ref_out.size(), dtype=torch.float32, device="cuda")
+        ref_out.backward(grad)
+
+        for w_f in [torch.contiguous_format, torch.channels_last]:
+            for g_f in [torch.contiguous_format, torch.channels_last]:
+                for input_format in [torch.contiguous_format, torch.channels_last]:
+                    output_format = torch.contiguous_format
+                    # Older versions of CudNN have Channels Last support disabled
+                    if torch.backends.cudnn.version() >= 7603:
+                        if input_format == torch.channels_last:
+                            output_format = torch.channels_last
+                        # This is because we have N111 weight that cannot handle
+                        # the ambiguous memory_format
+                        if w_f == torch.channels_last:
+                            if layer == nn.Conv2d and filter_size * c != 1:
+                                output_format = torch.channels_last
+                            if layer == nn.ConvTranspose2d and filter_size * k != 1:
+                                output_format = torch.channels_last
+                    self._run_conv(layer, device, data, grad, ref_conv, ref_input,
+                                   ref_out, input_format, w_f, g_f, output_format)
+
+    @onlyCUDA
+    @skipCUDAIfRocmVersionLessThan((4, 3))
+    @skipCUDAIfNotMiopenSuggestNHWC
+    @skipCUDAIfCudnnVersionLessThan(7603)
+    @tf32_on_and_off(0.05)
+    def test_conv_cudnn_mismatch_memory_format(self, device):
+        configs = [
+            [4, 2, 8, 8, 4, 2],
+            [4, 1, 8, 8, 4, 2],
+            [1, 1, 8, 8, 4, 2],
+            [4, 2, 2, 8, 4, 1],
+            [4, 2, 1, 8, 4, 1],
+            [4, 2, 8, 8, 4, 1],
+            [4, 1, 8, 8, 4, 1],
+        ]
+        for n, c, h, w, k, filter_size in configs:
+            self._test_conv_cudnn_nhwc_nchw(nn.Conv2d, n, c, h, w, k, filter_size, device)
+            self._test_conv_cudnn_nhwc_nchw(nn.ConvTranspose2d, n, c, h, w, k, filter_size, device)
+
+    # torch.half is erroring out on Windows with CUDA 10.1 + cuDNN 7.6.4
+    # returning CUDNN_STATUS_BAD_PARAM
+    # Disabling that specific test for now [see issue # 33918]
+    @onlyCUDA
+    @skipCUDAIfNoCudnn
+    @dtypes(torch.float, torch.double)
+    def test_conv_cudnn_nhwc_support(self, device, dtype):
+        input = torch.randn((1, 16, 1, 1), dtype=dtype, device="cuda", requires_grad=True)
+        weight = torch.randn((8, 16, 3, 3), dtype=dtype, device="cuda", requires_grad=True)
+        weight = weight.to(memory_format=torch.channels_last)
+        o = torch.conv2d(input, weight, None, (2, 1), (1, 1), (1, 1), 1)
+        self.assertTrue(o.is_contiguous(memory_format=torch.channels_last))
+        o.sum().backward()
+
+    # Test that faster algorithms used for inference produce the same results
+    # Validates depthwise3x3 bug reported in https://github.com/pytorch/pytorch/issues/60176
+    @onlyCPU
+    @dtypes(torch.float)
+    def test_conv2d_no_grad(self, device, dtype):
+        for batch in [1, 2, 3]:
+            for groups in [1, 2, 4]:
+                input = torch.rand(batch, groups, 8, 8, dtype=dtype, device=device)
+                m = nn.Conv2d(groups, 8, kernel_size=(3, 3), groups=groups, dtype=dtype, device=device)
+                with torch.no_grad():
+                    output_ng = m(input)
+                output = m(input)
+                self.assertEqual(output, output_ng, rtol=1e-2, atol=1e-5)
+
+    @onlyCUDA
+    @skipCUDAIfNoCudnn
+    @dtypes(torch.float, torch.float16)
+    @precisionOverride({torch.half: 0.002, torch.float: 1e-4})
+    def test_cudnn_convolution_relu(self, device, dtype):
+        for batch, groups, image_size, kernel_size, memory_format in \
+                product((1, 2, 3),
+                        (1, 2, 4),
+                        ((1, 1), (8, 8)),
+                        ((1, 1), (3, 3)),
+                        (torch.channels_last, torch.contiguous_format)):
+            if image_size[0] < kernel_size[0]:
+                continue
+            inp = torch.rand(batch, groups, *image_size, dtype=dtype, device=device)
+            w = torch.randn(8, groups, *kernel_size, dtype=dtype, device=device)
+            conv2d_out = torch.conv2d(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
+            inp = inp.to(memory_format=memory_format)
+            w = w.to(memory_format=memory_format)
+            if torch.version.hip:
+                cudnn_out = torch.miopen_convolution_relu(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
+            else:
+                cudnn_out = torch.cudnn_convolution_relu(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
+            self.assertTrue(cudnn_out.is_contiguous(memory_format=memory_format))
+            if tf32_is_not_fp32() and dtype == torch.float:
+                self.assertEqual(conv2d_out.relu(), cudnn_out, atol=2e-4, rtol=0.006)
+            else:
+                self.assertEqual(conv2d_out.relu(), cudnn_out)
+
+    @onlyCUDA
+    @skipCUDAIfNoCudnn
+    @dtypes(torch.float, torch.float16)
+    @precisionOverride({torch.half: 0.002, torch.float: 1e-4})
+    def test_cudnn_convolution_add_relu(self, device, dtype):
+        for batch, groups, image_size, kernel_size, memory_format in \
+            product((1, 2, 3),
+                    (1, 2, 4),
+                    ((1, 1), (8, 8)),
+                    ((1, 1), (3, 3)),
+                    (torch.channels_last, torch.contiguous_format)):
+            if image_size[0] < kernel_size[0]:
+                continue
+            inp = torch.rand(batch, groups, *image_size, dtype=dtype, device=device)
+            w = torch.randn(8, groups, *kernel_size, dtype=dtype, device=device)
+            conv2d_out = torch.conv2d(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
+            alpha = 2.0
+            z = torch.randn_like(conv2d_out)
+
+            inp = inp.to(memory_format=memory_format)
+            w = w.to(memory_format=memory_format)
+            z = z.to(memory_format=memory_format)
+            if torch.version.hip:
+                cudnn_out = torch.miopen_convolution_add_relu(inp, w, z, alpha, None, (1, 1), (0, 0), (1, 1), 1)
+            else:
+                cudnn_out = torch.cudnn_convolution_add_relu(inp, w, z, alpha, None, (1, 1), (0, 0), (1, 1), 1)
+
+            self.assertTrue(cudnn_out.is_contiguous(memory_format=memory_format))
+            if tf32_is_not_fp32() and dtype == torch.float:
+                self.assertEqual(F.relu(conv2d_out + alpha * z), cudnn_out, atol=3e-4, rtol=0.006)
+            else:
+                self.assertEqual(F.relu(conv2d_out + alpha * z), cudnn_out)
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @skipCUDAIfCudnnVersionLessThan(7603)
+    def test_convert_conv2d_weight_memory_format(self, device):
+        input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, device=device)
+        model = nn.Sequential(
+            nn.Conv2d(8, 4, 3),
+            nn.BatchNorm2d(4)).to(device).float()
+        for memory_format in [torch.channels_last, torch.contiguous_format]:
+            model = nn.utils.convert_conv2d_weight_memory_format(model, memory_format)
+            out = model(input)
+            self.assertTrue(out.is_contiguous(memory_format=memory_format))
+
+        model = nn.Sequential(
+            nn.ConvTranspose2d(8, 4, 3),
+            nn.BatchNorm2d(4)).to(device).float()
+        for memory_format in [torch.channels_last, torch.contiguous_format]:
+            model = nn.utils.convert_conv2d_weight_memory_format(model, memory_format)
+            out = model(input)
+            self.assertTrue(out.is_contiguous(memory_format=memory_format))
+
+    def test_conv_double_backward_strided_with_3D_input_and_weight(self, device):
+        # Test that _convolution_double_backward() outputs the correct grad shapes
+        # for 3D input / weight when stride > 1. This is an ad-hoc regression test for a
+        # specific case that was uncovered during the convolution consolidation effort.
+        # The test can be safely deleted if _convolution_double_backward() is removed.
+
+        input = torch.randn(2, 3, 6, device=device)
+        weight = torch.randn(3, 3, 3, device=device)
+        bias = torch.randn(3, device=device)
+        stride = (2,)
+        padding = (1,)
+        dilation = (1,)
+        transposed = False
+        output_padding = (0,)
+        groups = 1
+        output = torch.ops.aten.convolution(input, weight, bias, stride, padding, dilation, transposed,
+                                            output_padding, groups)
+
+        ggI = torch.randn(input.shape, device=device)
+        ggW = torch.randn(weight.shape, device=device)
+        ggB = torch.randn(bias.shape, device=device)
+        gO = torch.randn(output.shape, device=device)
+        output_mask = [True, True, True]
+        grad_grad_output, grad_input, grad_weight = torch.ops.aten._convolution_double_backward(
+            ggI, ggW, ggB, gO, weight, input, stride, padding, dilation, transposed,
+            output_padding, groups, output_mask)
+
+        # Make sure the correct shapes are computed.
+        self.assertEqual(grad_grad_output.shape, gO.shape)
+        self.assertEqual(grad_input.shape, input.shape)
+        self.assertEqual(grad_weight.shape, weight.shape)
+
+    @onlyCUDA
+    @largeTensorTest('40GB')
+    @largeTensorTest('24GB', 'cpu')
+    def test_conv3d_64bit_indexing(self, device):
+        x = torch.rand(1, 32, 512, 512, 256)
+        m = torch.nn.Conv3d(32, 1, kernel_size=1, padding=0, stride=1, bias=False)
+        yref = m(x)
+        y = m.to(device=device)(x.to(device=device))
+        self.assertEqual(yref, y)
+
+instantiate_device_type_tests(TestConvolutionNNDeviceType, globals())
+instantiate_parametrized_tests(TestConvolutionNN)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/nn/test_dropout.py b/test/nn/test_dropout.py
new file mode 100644
index 000000000000..150e5f57df7c
--- /dev/null
+++ b/test/nn/test_dropout.py
@@ -0,0 +1,283 @@
+# Owner(s): ["module: nn"]
+from itertools import product
+import unittest
+import random
+import itertools
+
+
+import torch
+from torch.testing._internal.common_utils import run_tests, set_default_dtype, \
+    instantiate_parametrized_tests
+from torch.testing._internal.common_cuda import TEST_CUDA
+from torch.testing._internal.common_nn import NNTestCase, freeze_rng_state
+from torch.testing._internal.common_device_type import instantiate_device_type_tests, expectedFailureXLA
+import torch.nn.functional as F
+import torch.nn as nn
+
+class TestDropoutNN(NNTestCase):
+    _do_cuda_memory_leak_check = True
+    _do_cuda_non_default_stream = True
+
+    def _test_alpha_dropout(self, cls, input):
+        mean = input.mean()
+        std = input.std()
+
+        for p in [0.2, 0.5, 0.8]:
+            module = cls(p)
+            input_var = input.detach().clone().requires_grad_()
+            output = module(input_var)
+            # output mean should be close to input mean
+            self.assertLess(abs(output.data.mean() - mean), 0.1)
+            # output std should be close to input std
+            self.assertLess(abs(output.data.std() - std), 0.1)
+            output.backward(input)
+
+    def test_AlphaDropout(self):
+        # generate random tensor with zero mean and unit std
+        input = torch.randn(5000)
+        self._test_alpha_dropout(nn.AlphaDropout, input)
+
+    def test_FeatureAlphaDropout(self):
+        b = random.randint(1, 5)
+        w = random.randint(1, 5)
+        h = random.randint(1, 5)
+        d = random.randint(1, 2)
+        num_features = 1000
+        input = torch.randn(num_features, b, d, w, h)
+        self._test_alpha_dropout(nn.FeatureAlphaDropout, input)
+
+        # no batch dims
+        input = torch.randn(50, 20, 64, 64)
+        self._test_alpha_dropout(nn.FeatureAlphaDropout, input)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_native_dropout_corner_case(self):
+        for train in [True, False]:
+            for p in [0.0, 1.0]:
+                for device in ["cuda", "cpu"]:
+                    x = torch.randn(5).to(device=device).requires_grad_()
+                    x_ref = x.detach().requires_grad_()
+                    o = torch.native_dropout(x, p, train)[0]
+                    o_ref = torch.dropout(x_ref, p, train)
+                    o.sum().backward()
+                    o_ref.sum().backward()
+                    assert(o.equal(o_ref))
+                    assert(x.grad.equal(x_ref.grad))
+
+    def test_invalid_dropout_p(self):
+        v = torch.ones(1)
+        self.assertRaises(ValueError, lambda: nn.Dropout(-0.1))
+        self.assertRaises(ValueError, lambda: nn.Dropout(1.1))
+        self.assertRaises(ValueError, lambda: nn.Dropout1d(-0.1))
+        self.assertRaises(ValueError, lambda: nn.Dropout1d(1.1))
+        self.assertRaises(ValueError, lambda: nn.Dropout2d(-0.1))
+        self.assertRaises(ValueError, lambda: nn.Dropout2d(1.1))
+        self.assertRaises(ValueError, lambda: nn.Dropout3d(-0.1))
+        self.assertRaises(ValueError, lambda: nn.Dropout3d(1.1))
+        self.assertRaises(ValueError, lambda: F.dropout(v, -0.1))
+        self.assertRaises(ValueError, lambda: F.dropout(v, 1.1))
+
+class TestDropoutNNDeviceType(NNTestCase):
+    def _test_dropout(self, cls, device, input, memory_format=torch.contiguous_format):
+        p = 0.2
+        input = input.to(device).fill_(1 - p)
+
+        module = cls(p)
+        input_var = input.clone(memory_format=memory_format).requires_grad_()
+        output = module(input_var)
+        self.assertTrue(output.is_contiguous(memory_format=memory_format))
+        self.assertLess(abs(output.data.mean() - (1 - p)), 0.05)
+        output.backward(input)
+        self.assertTrue(input_var.grad.is_contiguous(memory_format=memory_format))
+        self.assertLess(abs(input_var.grad.data.mean() - (1 - p)), 0.05)
+
+        module = cls(p, True)
+        input_var = input.clone(memory_format=memory_format).requires_grad_()
+        output = module(input_var + 0)
+        self.assertTrue(output.is_contiguous(memory_format=memory_format))
+        self.assertLess(abs(output.data.mean() - (1 - p)), 0.05)
+        output.backward(input)
+        self.assertTrue(input_var.grad.is_contiguous(memory_format=memory_format))
+        self.assertLess(abs(input_var.grad.data.mean() - (1 - p)), 0.05)
+
+        # check eval mode doesn't change anything
+        for inplace in [True, False]:
+            module = cls(p, inplace).eval()
+            self.assertEqual(input, module(input))
+
+        # Check that these don't raise errors
+        module.__repr__()
+        str(module)
+
+    def _test_dropout_discontiguous(self, cls, device, memory_format=torch.contiguous_format):
+        # In this test, we verify that dropout preserves the layout and data for different memory formats.
+        # We check whether, we get same values for the output of dropout, when the probability
+        # of dropout is 0 or very close to 0.
+        # Reference: https://github.com/pytorch/pytorch/issues/47176
+        close_to_zero_p = 1e-10  # Should be almost zero but not zero, as for p=0 different path is taken
+        for p in [0, close_to_zero_p]:
+            inp = torch.ones(2, 3, 3, 3, device=device)
+            inp_discontiguous = torch.empty(2, 3, 3, 6, device=device, memory_format=memory_format)[..., ::2]
+            inp_discontiguous.copy_(inp)
+            mod = cls(p=p)
+            out = mod(inp_discontiguous)
+            if p != 0:  # Zero will keep strides as is based on input.
+                # When prob == 0, input stride (54, 18, 6, 2) -> output stride (54, 18, 6, 2)
+                # When prob != 0, input stride (54, 18, 6, 2) -> output stride (27, 9, 3, 1)
+                self.assertTrue(out.is_contiguous(memory_format=memory_format))
+            self.assertEqual(inp_discontiguous, out)
+
+    def _test_dropout_stride_mean_preserve(self, cls, device):
+        def invert_perm(p):
+            d = {x: i for i, x in enumerate(p)}
+            return (d[0], d[1], d[2], d[3])
+
+        inp = torch.ones(2, 3, 4, 5, device=device)
+        shifts = [(0, 0), (1, 0), (0, 1), (1, 1)]
+        for perm in itertools.permutations((0, 1, 2, 3), r=4):
+            for shift in shifts:
+                for p in [1e-10, 0.3, 0.5, 0.7]:
+                    mod = cls(p=p)
+                    permuted_inp = inp.permute(perm).contiguous().permute(invert_perm(perm))
+                    permuted_inp = permuted_inp[shift[0]:, shift[1]:, :, :]
+                    out = mod(permuted_inp)
+
+                    self.assertTrue(out.permute(perm).is_contiguous())
+                    self.assertEqual(inp.mean(), out.mean(), rtol=0.5, atol=0.5)
+                    if p == 1e-10:
+                        self.assertEqual(permuted_inp, out)
+                    else:
+                        self.assertNotEqual(permuted_inp, out)
+
+    def test_Dropout(self, device):
+        input = torch.empty(1000)
+        self._test_dropout(nn.Dropout, device, input)
+
+        self._test_dropout_discontiguous(nn.Dropout, device)
+        self._test_dropout_discontiguous(nn.Dropout, device, memory_format=torch.channels_last)
+
+        self._test_dropout_stride_mean_preserve(nn.Dropout, device)
+
+        if self.device_type == 'cuda' or self.device_type == 'cpu':
+            input = input.bfloat16()
+            self._test_dropout(nn.Dropout, device, input)
+
+    def _test_dropoutNd_no_batch(self, dropout, input):
+        input_clone = input.clone()
+        with freeze_rng_state():
+            res_no_batch = dropout(input)
+
+        with freeze_rng_state():
+            res_batched = dropout(input_clone.unsqueeze(0)).squeeze(0)
+
+        self.assertEqual(res_no_batch, res_batched)
+
+    def _test_dropoutNd_channel_zero(self, dropout, input):
+        # Verify the number of zeros in a channel is 0 or the number of elements in the channel
+        # for a fully positive input tensor
+        shape = input.shape
+        B = shape[0]
+        C = shape[1]
+        channel_numel = torch.tensor(shape[2:]).prod()
+        result = dropout(input)
+
+        for b, c in product(range(B), range(C)):
+            self.assertTrue(result[b, c].count_nonzero() in (0, channel_numel))
+
+    @expectedFailureXLA  # seems like freeze_rng_state is not honoured by XLA
+    def test_Dropout1d(self, device):
+        with set_default_dtype(torch.double):
+            N, C, L = random.randint(10, 15), random.randint(10, 15), random.randint(10, 15)
+            input = torch.empty(N, C, L)
+            self._test_dropout(nn.Dropout1d, device, input)
+
+            with self.assertRaisesRegex(RuntimeError, "Expected 2D or 3D input, but received a 4D input"):
+                nn.Dropout1d(p=0.5)(torch.rand(1, 2, 2, 2, device=device))
+
+            with self.assertRaisesRegex(RuntimeError, "Expected 2D or 3D input, but received a 1D input"):
+                nn.Dropout1d(p=0.5)(torch.rand(2, device=device))
+
+            # no batch dims
+            input = torch.rand(50, 2, device=device)
+            self._test_dropoutNd_no_batch(nn.Dropout1d(p=0.5), input)
+            self._test_dropoutNd_no_batch(nn.Dropout1d(p=0.5, inplace=True), input)
+
+            # check that complete channels are dropped
+            input = torch.ones(10, 4, 2, device=device)
+            self._test_dropoutNd_channel_zero(nn.Dropout1d(p=0.5), input)
+            self._test_dropoutNd_channel_zero(nn.Dropout1d(p=0.5, inplace=True), input)
+
+    @expectedFailureXLA  # seems like freeze_rng_state is not honoured by XLA
+    def test_Dropout2d(self, device):
+        b = random.randint(1, 5)
+        w = random.randint(1, 5)
+        h = random.randint(1, 5)
+        num_features = 1000
+        input = torch.empty(num_features, b, w, h)
+        self._test_dropout(nn.Dropout2d, device, input)
+        self._test_dropout(nn.Dropout2d, device, input, memory_format=torch.channels_last)
+
+        self._test_dropout_discontiguous(nn.Dropout2d, device)
+        self._test_dropout_discontiguous(nn.Dropout2d, device, memory_format=torch.channels_last)
+
+        with self.assertWarnsRegex(UserWarning, "Received a 5-D input to dropout2d"):
+            nn.Dropout2d(p=0.5)(torch.rand(1, 2, 2, 2, 2, device=device))
+
+        with self.assertWarnsRegex(UserWarning, "Received a 2-D input to dropout2d"):
+            nn.Dropout2d(p=0.5)(torch.rand(1, 2, device=device))
+
+        # TODO: Uncomment these lines once no-batch-dim inputs are supported.
+        # For now, the historical dropout1d behavior is performed for 3D inputs.
+        # See https://github.com/pytorch/pytorch/issues/77081
+
+        # input = torch.rand(50, 2, 2, device=device)
+        # self._test_dropoutNd_no_batch(nn.Dropout2d(p=0.5), input)
+        # self._test_dropoutNd_no_batch(nn.Dropout2d(p=0.5, inplace=True), input)
+
+        with self.assertWarnsRegex(UserWarning, "assuming that channel-wise 1D dropout behavior is desired"):
+            nn.Dropout2d(p=0.5)(torch.rand(1, 2, 2, device=device))
+
+        # check that complete channels are dropped
+        input = torch.ones(10, 4, 2, 2, device=device)
+        self._test_dropoutNd_channel_zero(nn.Dropout2d(p=0.5), input)
+        self._test_dropoutNd_channel_zero(nn.Dropout2d(p=0.5, inplace=True), input)
+
+    @expectedFailureXLA  # seems like freeze_rng_state is not honoured by XLA
+    def test_Dropout3d(self, device):
+        b = random.randint(1, 5)
+        w = random.randint(1, 5)
+        h = random.randint(1, 5)
+        d = random.randint(1, 2)
+        num_features = 1000
+        input = torch.empty(num_features, b, d, w, h)
+        self._test_dropout(nn.Dropout3d, device, input)
+
+        self._test_dropout_discontiguous(nn.Dropout3d, device)
+        self._test_dropout_discontiguous(nn.Dropout3d, device, memory_format=torch.channels_last)
+
+        with self.assertWarnsRegex(UserWarning, "Received a 6-D input to dropout3d"):
+            nn.Dropout3d(p=0.5)(torch.rand(1, 2, 2, 2, 2, 2, device=device))
+
+        with self.assertWarnsRegex(UserWarning, "Received a 3-D input to dropout3d"):
+            nn.Dropout3d(p=0.5)(torch.rand(1, 2, 2, device=device))
+
+        # no batch dims
+        input = torch.rand(50, 2, 2, 2, device=device)
+        self._test_dropoutNd_no_batch(nn.Dropout3d(p=0.5), input)
+        self._test_dropoutNd_no_batch(nn.Dropout3d(p=0.5, inplace=True), input)
+
+        # check that complete channels are dropped
+        input = torch.ones(10, 4, 2, 2, 2, device=device)
+        self._test_dropoutNd_channel_zero(nn.Dropout3d(p=0.5), input)
+        self._test_dropoutNd_channel_zero(nn.Dropout3d(p=0.5, inplace=True), input)
+
+    def test_empty_dropout(self, device):
+        x = torch.tensor([]).to(device)
+        out = torch.nn.functional.dropout(x)
+        self.assertEqual(out.size(), x.size())
+
+instantiate_device_type_tests(TestDropoutNNDeviceType, globals())
+instantiate_parametrized_tests(TestDropoutNN)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/nn/test_embedding.py b/test/nn/test_embedding.py
new file mode 100644
index 000000000000..228c79681df5
--- /dev/null
+++ b/test/nn/test_embedding.py
@@ -0,0 +1,1193 @@
+# Owner(s): ["module: nn"]
+import unittest
+import random
+import itertools
+from itertools import product
+
+import torch
+from torch.testing._internal.common_utils import run_tests, set_default_dtype, \
+    instantiate_parametrized_tests, parametrize as parametrize_test, _assertGradAndGradgradChecks
+from torch.testing._internal.common_cuda import TEST_CUDA
+from torch.testing._internal.common_nn import NNTestCase
+from torch.testing._internal.common_device_type import onlyNativeDeviceTypes, dtypes, \
+    instantiate_device_type_tests, dtypesIfCUDA, onlyCUDA, \
+    TEST_WITH_ROCM, skipCUDAIf, skipMeta
+import torch.nn.functional as F
+import torch.nn as nn
+from torch.testing._internal.common_utils import dtype2prec_DONTUSE
+
+class TestEmbeddingNN(NNTestCase):
+    _do_cuda_memory_leak_check = True
+    _do_cuda_non_default_stream = True
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_embedding_max_norm_unsorted_repeating_indices(self):
+        def create_embedding(device):
+            # Seed RNG so we get the same Embedding each time
+            torch.manual_seed(0)
+            return torch.nn.Embedding(
+                num_embeddings=20,
+                embedding_dim=64,
+                max_norm=1.0).to(device)
+
+        ix = torch.arange(2, device='cpu', dtype=torch.long).repeat(2000)
+        out_cpu = create_embedding('cpu')(ix)
+
+        ix = ix.to('cuda')
+        out = create_embedding('cuda')(ix)
+        self.assertEqual(out.cpu(), out_cpu)
+
+    def test_embedding_sparse_basic(self):
+        embedding = nn.Embedding(10, 20, sparse=True)
+        input = torch.tensor([[0, 2, 4, 5], [4, 3, 0, 9]], dtype=torch.long)
+        embedding(input).sum().backward()
+        self.assertTrue(embedding.weight.grad.is_sparse)
+        self.assertEqual(embedding.weight.grad.shape, embedding.weight.shape)
+
+    def test_embedding_sparse_empty_tensor(self):
+        embedding = nn.Embedding(0, 0, sparse=True)
+        input = torch.tensor([], dtype=torch.int64)
+        embedding(input).sum().backward()
+        self.assertTrue(embedding.weight.grad.is_sparse)
+        self.assertEqual(embedding.weight.grad.shape, embedding.weight.shape)
+
+        embedding = nn.Embedding(10, 0, sparse=True)
+        input = torch.LongTensor([[0, 2, 4, 5], [4, 3, 0, 9]])
+        embedding(input).sum().backward()
+        self.assertTrue(embedding.weight.grad.is_sparse)
+        self.assertEqual(embedding.weight.grad.shape, embedding.weight.shape)
+
+    def test_move_sparse_half_embedding(self):
+        embedding = nn.Embedding(10, 3, sparse=True)
+        self.assertEqual(embedding.weight.device.type, 'cpu')
+        self.assertEqual(embedding.weight.dtype, torch.get_default_dtype())
+        embedding.to(torch.float16)
+        self.assertEqual(embedding.weight.dtype, torch.float16)
+        self.assertEqual(embedding.embedding_dim, 3)
+        self.assertEqual(embedding.num_embeddings, 10)
+
+        if torch.cuda.is_available():
+            embedding.to('cuda')
+            self.assertEqual(embedding.weight.device.type, 'cuda')
+            embedding.to('cpu')
+            self.assertEqual(embedding.weight.device.type, 'cpu')
+
+    def test_embedding_max_norm(self):
+        embedding = nn.Embedding(22, 5, max_norm=1.0)
+        input = torch.tensor([2, 8, 8, 6], dtype=torch.long)
+        output = embedding(input)
+        self.assertEqual(output[1], output[2])
+        self.assertTrue(output.data.norm(p=2, dim=1).le(1).all())
+
+    @parametrize_test("dtype", (torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float, torch.double))
+    def test_embedding_from_pretrained(self, dtype):
+        a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], dtype=dtype)
+        embedding = nn.Embedding.from_pretrained(a)
+        self.assertEqual(a, embedding.weight.data)
+
+        input = torch.LongTensor([0, 1])
+        output = embedding(input)
+        self.assertEqual(a, output)
+
+    def test_embedding_bag_from_pretrained(self):
+        a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
+        embedding = nn.EmbeddingBag.from_pretrained(a)
+        self.assertEqual(a, embedding.weight)
+
+        input = torch.tensor([0, 1], dtype=torch.long)
+        output = embedding(input, torch.arange(input.size(0)))
+        self.assertEqual(a, output)
+
+    def test_embedding_from_pretrained_padding_idx(self):
+        padding_idx = 2
+        padding_vec = torch.ones(3) * 7
+        embeddings = torch.rand(4, 3, requires_grad=True)
+        with torch.no_grad():
+            embeddings[padding_idx] = padding_vec
+        embedding_nn = nn.Embedding.from_pretrained(embeddings, padding_idx=padding_idx)
+        self.assertEqual(embedding_nn.weight[padding_idx], padding_vec)
+
+    def test_embedding_bag_from_pretrained_padding_idx(self):
+        padding_idx = 2
+        embeddings = torch.rand(4, 3, requires_grad=True)
+        embedding_nn = nn.EmbeddingBag.from_pretrained(embeddings, padding_idx=padding_idx)
+        self.assertEqual(embedding_nn.weight, embeddings)
+
+    def test_embedding_from_pretrained_options(self):
+        with set_default_dtype(torch.double):
+            a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
+            opts = {
+                "max_norm": 2.,
+                "norm_type": .5,
+                "scale_grad_by_freq": False,
+                "sparse": True
+            }
+            embedding = nn.Embedding.from_pretrained(a, **opts)
+            input = torch.LongTensor([0, 1])
+            output = embedding(input)
+            # test output and that weight matrix was renormalized
+            self.assertEqual(a, output)
+            self.assertTrue(a.ne(torch.arange(1, 7, dtype=a.dtype).view(2, 3)).all())
+            self.assertTrue(output.data.norm(p=opts["norm_type"], dim=1).le(opts["max_norm"]).all())
+
+    def test_embedding_functional(self):
+        a = torch.tensor([
+            [1, 3, 2],
+            [0, 2, 1]
+        ], dtype=torch.long)
+        embeddings = torch.rand(4, 3, requires_grad=True)
+
+        embed_old = torch.nn.Embedding(4, 3)
+        embed_old.weight.data = embeddings.data
+        res_old = embed_old(a)
+
+        res_F = F.embedding(a, embeddings)
+        self.assertEqual(res_old, res_F)
+
+        embed_old = torch.nn.Embedding(4, 3)
+        embed_old = embed_old.from_pretrained(embeddings, padding_idx=2)
+        res_old = embed_old(a)
+        res_F = F.embedding(a, embeddings, padding_idx=2)
+
+        self.assertEqual(res_old, res_F)
+
+    def test_embedding_bag_functional(self):
+        a = torch.tensor([
+            [1, 3, 2],
+            [0, 2, 1]
+        ], dtype=torch.long)
+        embeddings = torch.rand(4, 3, requires_grad=True)
+
+        embed_old = torch.nn.EmbeddingBag(4, 3)
+        embed_old.weight = torch.nn.Parameter(embeddings)
+        res_old = embed_old(a)
+
+        res_F = F.embedding_bag(a, embeddings)
+        self.assertEqual(res_old, res_F)
+
+        embed_old = torch.nn.EmbeddingBag(4, 3)
+        embed_old = embed_old.from_pretrained(embeddings, padding_idx=2)
+        res_old = embed_old(a)
+        res_F = F.embedding_bag(a, embeddings, padding_idx=2)
+
+        self.assertEqual(res_old, res_F)
+
+    # Make sure that error is thrown if padding_idx is out of bounds
+    def test_embedding_bag_padding_idx_error(self):
+        a = torch.tensor([
+            [1, 3, 2],
+            [0, 2, 1]
+        ], dtype=torch.long)
+        num_embeddings = 4
+        num_features = 3
+        embeddings = torch.rand(num_embeddings, num_features, requires_grad=True)
+
+        functional_err_msg = r'padding_idx must be within the number of embeddings'
+        module_err_msg = r'padding_idx must be within num_embeddings'
+
+        for padding_idx in range(-(num_embeddings + 2), (num_embeddings + 2)):
+            if (padding_idx < -num_embeddings) or (padding_idx >= num_embeddings):
+                with self.assertRaisesRegex(RuntimeError, functional_err_msg):
+                    F.embedding_bag(a, embeddings, padding_idx=padding_idx)
+                with self.assertRaisesRegex(AssertionError, module_err_msg):
+                    torch.nn.EmbeddingBag(num_embeddings, num_features, padding_idx=padding_idx)
+            else:
+                F.embedding_bag(a, embeddings, padding_idx=padding_idx)
+                torch.nn.EmbeddingBag(num_embeddings, num_features, padding_idx=padding_idx)
+
+    def test_embeddingbag_from_pretrained(self):
+        a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
+        embeddingbag = nn.EmbeddingBag.from_pretrained(a)
+        self.assertEqual(a, embeddingbag.weight.data)
+
+        input = torch.LongTensor([[0, 1]])
+        output = embeddingbag(input)
+        self.assertEqual(a.mean(0, keepdim=True), output)
+
+    def test_embeddingbag_from_pretrained_options(self):
+        with set_default_dtype(torch.double):
+            a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
+            opts = {
+                "max_norm": 2.,
+                "norm_type": .5,
+                "scale_grad_by_freq": False,
+                "mode": "max",
+                "sparse": False
+            }
+            embeddingbag = nn.EmbeddingBag.from_pretrained(a, **opts)
+
+            input = torch.LongTensor([[0, 1]])
+            output = embeddingbag(input)
+            self.assertEqual(a.max(0, keepdim=True)[0], output)
+            self.assertTrue(a.ne(torch.arange(1, 7, dtype=a.dtype).view(2, 3)).all())
+            self.assertTrue(a.norm(p=opts["norm_type"], dim=1).le(opts["max_norm"]).all())
+
+class TestEmbeddingNNDeviceType(NNTestCase):
+    def test_embedding_dense_grad(self, device):
+        with set_default_dtype(torch.double):
+            embd = nn.Embedding(20, 20).to(device)
+            weight = embd.weight
+
+            def fn_wrapper(device):
+                def fn(weight):
+                    inp = torch.tensor([[0, 1, 1, 2], [3, 5, 7, 11]], dtype=torch.long).to(device)
+                    return torch.nn.functional.embedding(inp, weight)
+                return fn
+
+            fn = fn_wrapper(device)
+            _assertGradAndGradgradChecks(self, fn, (weight, ))
+
+    def test_embedding_scalar_weight_error(self, device):
+        indices = torch.rand(2, 2, device=device).long()
+        weights = [
+            torch.tensor(1.0, device=device),
+            torch.tensor(1.0, device=device).reshape(1, 1, 1),
+        ]
+
+        for weight in weights:
+            with self.assertRaisesRegex(RuntimeError, "'weight' must be 2-D"):
+                torch.nn.functional.embedding(indices, weight)
+
+    @dtypesIfCUDA(torch.float16, torch.float64)
+    @dtypes(torch.float64)
+    def test_embedding_backward(self, device, dtype):
+        embedding = nn.Embedding(10, 3, sparse=True)
+        tensor = torch.tensor([[7, 1, 3]])
+        ones = torch.tensor(1., dtype=dtype).expand(3, 3)
+        tensorTwice = tensor.repeat(1, 2)
+        onesTwice = torch.cat((ones, ones))
+
+        embedding = embedding.to(dtype=dtype).to(device)
+        tensor = tensor.to(device)
+        ones = ones.to(device)
+        tensorTwice = tensorTwice.to(device)
+        onesTwice = onesTwice.to(device)
+
+        embedding.zero_grad()
+        embedding(tensor[0]).sum().backward()
+        self.assertEqual(embedding.weight.grad._indices(), tensor)
+        self.assertEqual(embedding.weight.grad._values(), ones)
+
+        embedding.zero_grad()
+        embedding(tensor[0]).sum().backward()
+        embedding(tensor[0]).sum().backward()
+        self.assertEqual(embedding.weight.grad._indices(), tensorTwice)
+        self.assertEqual(embedding.weight.grad._values(), onesTwice)
+
+        embedding.zero_grad()
+        embedding(tensor[0]).sum().backward()
+        tensor[0, 0] = 8
+        embedding(tensor[0]).sum().backward()
+        tensorTwice[0, 3] = 8
+        self.assertEqual(embedding.weight.grad._indices(), tensorTwice)
+        self.assertEqual(embedding.weight.grad._values(), onesTwice)
+
+    @dtypesIfCUDA(*((torch.float, torch.double, torch.bfloat16, torch.half)
+                    if TEST_WITH_ROCM else (torch.float, torch.double, torch.half)))
+    @dtypes(torch.float32)
+    def test_embedding_max_norm_backward(self, device, dtype):
+        # can't use gradcheck since in place renorm makes analytical gradients different from produced ones
+        weight = torch.randn((4, 4), device=device, dtype=dtype) * 2
+        weight.requires_grad_()
+        inp_list = [0, 1, 2, 2]
+        inp = torch.tensor(inp_list, device=device)
+        out = nn.functional.embedding(inp, weight, max_norm=1.).sum()
+        out.backward()
+
+        expected_grad = torch.tensor([[1., 1., 2., 0.]], device=device, dtype=dtype).transpose(0, 1).expand(4, 4)
+        self.assertEqual(weight.grad, expected_grad)
+
+    @dtypesIfCUDA(*((torch.float, torch.double, torch.bfloat16, torch.half)
+                    if TEST_WITH_ROCM else (torch.float, torch.double, torch.half)))
+    @dtypes(torch.float32)
+    def test_embedding_max_norm_fwd_AD(self, device, dtype):
+        if torch.device(device).type == 'xla':
+            self.skipTest("forward AD doesn't work on xla")
+
+        # can't use gradcheck since in place renorm makes analytical gradients different from produced ones
+        weight = torch.randn((4, 4), device=device, dtype=dtype) * 2
+        tangent = torch.ones((4, 4), device=device, dtype=dtype)
+        inp = torch.tensor([[0, 1], [2, 2]], device=device)
+        with torch.autograd.forward_ad.dual_level():
+            dual_weight = torch.autograd.forward_ad.make_dual(weight, tangent)
+            out = nn.functional.embedding(inp, dual_weight, max_norm=1.)
+            jvp = torch.autograd.forward_ad.unpack_dual(out).tangent
+
+        expected_grad = torch.ones((2, 2, 4), device=device, dtype=dtype)
+        self.assertEqual(jvp, expected_grad)
+
+    @dtypesIfCUDA(*((torch.float, torch.double, torch.bfloat16, torch.half)
+                    if TEST_WITH_ROCM else (torch.float, torch.double, torch.half)))
+    @dtypes(torch.float32)
+    def test_embedding_padding_idx(self, device, dtype):
+        embedding = nn.Embedding(10, 20, padding_idx=0).to(device, dtype)
+        input = torch.tensor([[0, 2, 4, 5], [4, 3, 0, 9]], dtype=torch.long).to(device)
+        output = embedding(input)
+        self.assertEqual(output[0][0].sum(), 0)
+        self.assertEqual(output[1][2].sum(), 0)
+
+        embedding = nn.Embedding(10, 20, padding_idx=0, sparse=True).to(device, dtype)
+        input = torch.tensor([[0, 2, 4, 5], [4, 3, 0, 9]], dtype=torch.long).to(device)
+        output = embedding(input)
+        self.assertEqual(output[0][0].sum(), 0)
+        self.assertEqual(output[1][2].sum(), 0)
+
+        # negative indexing check for padding_idx
+        # padding_idx=-2, num_embeddings=10 ==> index 8 padded
+        embedding = nn.Embedding(10, 20, padding_idx=-2).to(device, dtype)
+        input = torch.tensor([[0, 2, 8, 5], [4, 8, 0, 9]], dtype=torch.long).to(device)
+        output = embedding(input)
+        self.assertEqual(output[0][2].sum(), 0)
+        self.assertEqual(output[1][1].sum(), 0)
+
+        embedding = nn.Embedding(10, 20, padding_idx=-2, sparse=True).to(device, dtype)
+        input = torch.tensor([[0, 2, 8, 5], [4, 8, 0, 9]], dtype=torch.long).to(device)
+        output = embedding(input)
+        self.assertEqual(output[0][2].sum(), 0)
+        self.assertEqual(output[1][1].sum(), 0)
+
+        # change padding vector
+        padding_vector = torch.ones(20, dtype=dtype, device=device)
+        embedding = nn.Embedding(10, 20, padding_idx=2, sparse=True).to(device, dtype)
+        with torch.no_grad():
+            embedding.weight[2] = padding_vector
+        input = torch.tensor([0, 2], dtype=torch.long).to(device)
+        output = embedding(input)
+        self.assertEqual(output[1], padding_vector)
+
+        # out of bounds check for padding_idx
+        self.assertRaises(AssertionError, nn.Embedding, num_embeddings=10, embedding_dim=20, padding_idx=25)
+        self.assertRaises(AssertionError, nn.Embedding, num_embeddings=10, embedding_dim=20, padding_idx=-25)
+
+        padding_idx = 0
+        embedding = nn.Embedding(5, 2, padding_idx=padding_idx).to(device, dtype)
+        for n in (1, 2, 1000):  # Need large N to trigger all the methods we have implemented
+            for other_indices in ([], [1, 3], [2]):
+                indices = torch.tensor(other_indices + [padding_idx] * n, dtype=torch.long).to(device)
+                pre = embedding.weight[padding_idx].clone()
+                embedding(indices).sum().backward()
+                after = (embedding.weight + embedding.weight.grad)[padding_idx]
+                embedding.zero_grad()
+                self.assertEqual(after, pre)
+
+                # test double backward
+                emb_sum = embedding(indices).sum()
+                emb_grad = torch.autograd.grad(outputs=emb_sum, inputs=list(embedding.parameters()), retain_graph=True)
+                scalar = emb_grad[0].sum() + emb_sum
+                scalar.backward()
+                after = (embedding.weight + embedding.weight.grad)[padding_idx]
+                embedding.zero_grad()
+                self.assertEqual(after, pre)
+
+    # Check correctness of torch.nn.functional.embedding_bag forward and
+    # backward functions with padding_idx, given a 1D input separated into bags
+    # with an offset array. Compare against an equivalent 2D input that uses
+    # padding indices to fill in the gaps indicated by the offset array
+
+    @onlyNativeDeviceTypes
+    @dtypes(torch.float32, torch.float64)
+    @dtypesIfCUDA(torch.half, torch.bfloat16)
+    def test_embedding_bag_1D_padding_idx(self, device, dtype):
+        num_features = 3
+        max_indices_per_bag = 10
+        num_bags = 10
+        num_words = 100
+
+        def gen_1D_indices_offsets(include_last_offset, allpad):
+            indices = []
+            offsets = []
+            cur_offset = 0
+
+            # Make one bag full and one bag empty, for extra coverage
+            empty_bag = random.randint(0, num_bags - 1)
+            full_bag = empty_bag
+            while full_bag == empty_bag:
+                full_bag = random.randint(0, num_bags - 1)
+
+            for bag in range(num_bags):
+                offsets.append(cur_offset)
+                if bag == full_bag:
+                    bag_size = max_indices_per_bag
+                elif bag == empty_bag:
+                    bag_size = 0
+                else:
+                    bag_size = random.randint(1, max_indices_per_bag - 1)
+                indices += [1 if allpad else random.randint(0, num_words - 1) for _ in range(bag_size)]
+                cur_offset += bag_size
+
+            # embedding_bag requires first entry of offsets to be 0
+            assert offsets[0] == 0
+
+            indices = torch.tensor(indices, device=device)
+
+            if include_last_offset:
+                offsets.append(indices.size(0))
+
+            offsets = torch.tensor(offsets, device=device)
+
+            return indices, offsets
+
+        # Convert a 1-D indices-offsets representation into 2-D. Fill any empty
+        # indices with padding_idx
+        def gen_2D_indices_from_1D(indices_1D, offsets, include_last_offset, padding_idx):
+            assert offsets[0] == 0
+            if include_last_offset:
+                offsets = offsets[:-1]
+            indices_2D = torch.empty(num_bags, max_indices_per_bag, device=device, dtype=torch.long)
+            for bag in range(num_bags):
+                # Determine the start and end position of the bag within indices_1D
+                start = offsets[bag]
+                end = len(indices_1D) if bag + 1 == num_bags else offsets[bag + 1]
+                end = min(len(indices_1D), end)
+
+                # Pull out the bag's indices from indices_1D, and fill any
+                # remaining space with padding indices
+                indices_in_bag = []
+                for item_pos in range(0, max_indices_per_bag):
+                    if (start + item_pos) < end:
+                        indices_in_bag.append(indices_1D[start + item_pos])
+                    else:
+                        indices_in_bag.append(padding_idx)
+                indices_2D[bag] = torch.tensor(indices_in_bag, device=device)
+
+            return indices_2D
+
+        test_cases = product(['max', 'mean', 'sum'], [False, True], [False, True], [False, True])
+
+        for mode, sparse, include_last_offset, allpad in test_cases:
+            # Max sparse and bfloat16 are not supported
+            if mode == 'max':
+                if sparse or (dtype == torch.bfloat16):
+                    continue
+            indices_1D, offsets = gen_1D_indices_offsets(include_last_offset, allpad)
+            for padding_idx_1D in list(set(indices_1D.tolist())) + [None]:
+                msg = (
+                    f"mode: '{mode}', sparse: {sparse}, include_last_offset: {include_last_offset}, "
+                    f"padding_idx_1D: {padding_idx_1D}")
+
+                # If 1D input does not use a padding index, we still need one for the 2D input,
+                # so we can add one dummy word to the weights to act as the padded word
+                padding_idx_2D = padding_idx_1D if padding_idx_1D is not None else num_words
+                num_words_with_padding = num_words if padding_idx_1D is not None else num_words + 1
+
+                indices_2D = gen_2D_indices_from_1D(
+                    indices_1D,
+                    offsets,
+                    include_last_offset,
+                    padding_idx_2D)
+
+                weights = torch.randn(
+                    num_words_with_padding,
+                    num_features,
+                    dtype=dtype,
+                    device=device,
+                    requires_grad=True)
+                weights_check = weights.clone().detach().requires_grad_(True)
+
+                bag = torch.nn.functional.embedding_bag(
+                    indices_1D,
+                    weights,
+                    offsets,
+                    padding_idx=padding_idx_1D,
+                    mode=mode,
+                    sparse=sparse,
+                    include_last_offset=include_last_offset)
+
+                bag_check = torch.nn.functional.embedding_bag(
+                    indices_2D,
+                    weights_check,
+                    padding_idx=padding_idx_2D,
+                    mode=mode,
+                    sparse=sparse)
+                self.assertEqual(bag, bag_check, msg=msg)
+
+                bag.sum().backward()
+                bag_check.sum().backward()
+
+                # Sometimes, half dtype gradients mismatch by a greater amount
+                # than other dtypes
+                if dtype in [torch.half, torch.bfloat16]:
+                    atol = 0.01
+                    rtol = 0.01
+                else:
+                    atol = None
+                    rtol = None
+                self.assertEqual(weights.grad, weights_check.grad, msg=msg, atol=atol, rtol=rtol)
+
+    # Check correctness of torch.nn.functional.embedding_bag forward and
+    # backward functions with padding_idx, given a 2D indices input. Compare
+    # against torch.nn.functional.embedding followed by a reduction.
+    @onlyNativeDeviceTypes
+    @dtypes(torch.float32, torch.float64)
+    @dtypesIfCUDA(torch.half, torch.bfloat16)
+    def test_embedding_bag_2D_padding_idx(self, device, dtype):
+        # Use a Python implementation of embedding_bag with padding_idx support
+        # to check torch.nn.functional.embedding_bag correctness
+        def embedding_bag_check(indices, weights, mode, sparse, padding_idx):
+            assert padding_idx is not None
+            embedding = torch.nn.functional.embedding(
+                indices,
+                weights,
+                padding_idx=padding_idx,
+                sparse=sparse)
+
+            reduction_dim = indices.dim() - 1
+
+            if mode == 'sum' or mode == 'mean':
+                # We must avoid including elements at padding_idx in the
+                # sum/mean, so multiply those elements by 0, and multiply
+                # all other elements by 1
+                per_sample_weights = indices.ne(padding_idx).to(dtype).unsqueeze(-1)
+                res = embedding.mul(per_sample_weights).sum(dim=reduction_dim)
+
+                if mode == 'mean':
+                    weights_sum = per_sample_weights.sum(dim=reduction_dim)
+                    res = res.div(weights_sum)
+
+            elif mode == 'max':
+                # We must avoid allowing elements at padding_idx to be chosen
+                # as the max, so set those elements to negative infinity
+                res = embedding.masked_fill(
+                    indices.unsqueeze(-1) == padding_idx, -float('inf')
+                ).amax(dim=reduction_dim)
+
+            else:
+                raise RuntimeError(f"mode '{mode}' is not available")
+
+            # If a row is all padding, set its corresponding result row to 0.
+            # This is needed because the above mean and max mode
+            # implementations set these elements to nan and -inf, respectively
+            if mode in ['mean', 'max']:
+                res = res.masked_fill(
+                    indices.eq(padding_idx).all(dim=-1).unsqueeze(-1),
+                    0)
+
+            return res
+
+        num_features = 3
+        num_words = 10
+        indices_dim1 = 10
+
+        for mode, sparse, allpad, indices_dim0 in product(['max', 'mean', 'sum'], [False, True], [False, True], [1, 10]):
+            # Max sparse and bfloat16 are not supported
+            if mode == 'max':
+                if sparse or (dtype == torch.bfloat16):
+                    continue
+
+            if allpad:
+                indices = torch.empty(indices_dim0, indices_dim1, dtype=torch.long, device=device).fill_(1)
+            else:
+                indices = torch.randint(0, num_words, (indices_dim0, indices_dim1), device=device)
+
+                if indices_dim0 > 1:
+                    # Fill one row with duplicate index so we can test with a fully
+                    # padded row
+                    duplicate_row = random.randint(0, indices_dim0 - 1)
+                    indices[duplicate_row] = indices[duplicate_row][0]
+
+            for padding_idx in list(set(indices.flatten(0, -1).tolist())):
+                weights = torch.randn(num_words, num_features, dtype=dtype, device=device, requires_grad=True)
+                weights_check = weights.clone().detach().requires_grad_(True)
+
+                msg = (
+                    f"mode: '{mode}', sparse: {sparse}, padding_idx: {padding_idx}, "
+                    f"allpad: {allpad}, indices.size(): {indices.size()}")
+
+                # Check forward with a Python implementation of padding_idx embedding_bag
+                bag_check = embedding_bag_check(
+                    indices,
+                    weights_check,
+                    mode,
+                    sparse,
+                    padding_idx)
+                bag = torch.nn.functional.embedding_bag(
+                    indices,
+                    weights,
+                    padding_idx=padding_idx,
+                    mode=mode,
+                    sparse=sparse)
+
+                self.assertEqual(bag, bag_check, msg=msg)
+
+                bag_check.sum().backward()
+                grad_check = weights_check.grad
+
+                bag.sum().backward()
+                grad = weights.grad
+
+                # Sometimes, half dtype gradients mismatch by a greater amount
+                # than other dtypes
+                if dtype in [torch.half, torch.bfloat16]:
+                    atol = 0.01
+                    rtol = 0.01
+                else:
+                    atol = None
+                    rtol = None
+                self.assertEqual(grad, grad_check, msg=msg, atol=atol, rtol=rtol)
+
+    @onlyCUDA
+    @dtypes(*((torch.float, torch.double, torch.bfloat16, torch.half)
+              if TEST_WITH_ROCM else (torch.float, torch.double, torch.half)))
+    def test_embedding_max_norm_device(self, device, dtype):
+        embedding = nn.Embedding(22, 5, max_norm=1.0).to(device, dtype=dtype)
+        # nn.Embedding only takes LongTensor as input
+        input = torch.tensor([2, 8, 8, 6], device=device, dtype=torch.long)
+        output = embedding(input)
+        self.assertEqual(output[1], output[2])
+        self.assertTrue(output.data.norm(p=2, dim=1).le(1).all())
+
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
+    def test_embedding_bag_empty_input(self, device, dtypes):
+        m = 4
+        n = 3
+        x = torch.tensor([], device=device, dtype=dtypes[0])
+        for sparse in [True, False]:
+            Embed = torch.nn.EmbeddingBag(m, n, sparse=sparse)
+            Embed.to(device)
+
+            output = Embed(input=x, offsets=torch.tensor([0], device=device, dtype=dtypes[1]))
+            self.assertEqual(output, torch.zeros_like(output))
+
+            output = Embed(input=x, offsets=torch.tensor([0, 0], device=device, dtype=dtypes[1]))
+            self.assertEqual(output, torch.zeros_like(output))
+
+    @skipCUDAIf(True, "no out-of-bounds check on CUDA for perf.")
+    @dtypes(*itertools.product((torch.float, torch.double), (torch.int, torch.long)))
+    @parametrize_test("padding_idx", [None, 0])
+    @parametrize_test("mode", ["sum", "mean", "max"])
+    def test_embedding_bag_out_of_bounds_idx(self, device, dtypes, padding_idx, mode):
+        padding_idx = 0
+        w_dtype, idx_dtype = dtypes
+        # negative out-of-bound
+        idx1 = torch.tensor([[-1, 1]], device=device, dtype=idx_dtype)
+        # positive out-of-bound
+        idx2 = torch.tensor([[11, 8]], device=device, dtype=idx_dtype)
+        weight = torch.randn(10, 2, device=device, dtype=w_dtype)
+        if mode == 'sum':
+            # Only `sum` supports per_sample_weight
+            per_sample_weights = (None, torch.randn_like(idx1, device=device, dtype=w_dtype))
+        else:
+            per_sample_weights = (None,)
+
+        for p_s_weights, idx in itertools.product(per_sample_weights, (idx1, idx2)):
+            msg = "Expected idx >= 0 && idx < num_embeddings"
+            with self.assertRaisesRegex(RuntimeError, msg):
+                torch.nn.functional.embedding_bag(idx, weight,
+                                                  per_sample_weights=p_s_weights, padding_idx=padding_idx,
+                                                  mode=mode)
+
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
+    def test_EmbeddingBag_per_sample_weights_failures(self, device, dtypes):
+        # Failure 1: mismatched embeddings / per_sample_weights dtype
+        es = nn.EmbeddingBag(5, 2, mode='sum').to(dtype=torch.float, device=device)
+        input = torch.tensor([3, 1, 1, 1, 4, 0], dtype=dtypes[0], device=device)
+        offsets = torch.tensor([0, 0, 3, 3, 6], dtype=dtypes[1], device=device)
+        per_sample_weights = torch.randn_like(input, dtype=torch.double, device=device)
+        if device == 'cpu':
+            with self.assertRaisesRegex(RuntimeError, 'have the same type as'):
+                es(input, offsets, per_sample_weights)
+        else:
+            with self.assertRaisesRegex(RuntimeError, 'expected scalar type'):
+                es(input, offsets, per_sample_weights)
+
+        # Failure 2.1: input/per_sample_weights have different sizes (1d input)
+        input = torch.tensor([3, 1, 1, 1, 4, 0], dtype=dtypes[0], device=device)
+        offsets = torch.tensor([0, 0, 3, 3, 6], dtype=dtypes[1], device=device)
+        per_sample_weights = torch.randn(5, dtype=torch.float, device=device)
+        with self.assertRaisesRegex(ValueError, 'same shape as the input'):
+            es(input, offsets, per_sample_weights)
+
+        # Failure 2.2: input/per_sample_weights have different sizes (2d input)
+        input = torch.randint(5, (7, 3), dtype=dtypes[0], device=device)
+        offsets = None
+        per_sample_weights = torch.randn(7 * 3, dtype=torch.float, device=device)
+        with self.assertRaisesRegex(ValueError, 'same shape as the input'):
+            es(input, offsets, per_sample_weights)
+
+        # Failure 3: Unsupported per_sample_weights and mode=('max', 'mean')
+        for unsupported_mode in ('max', 'mean'):
+            es = nn.EmbeddingBag(5, 2, mode=unsupported_mode).to(
+                dtype=torch.float, device=device)
+            input = torch.randint(5, (7, 3), dtype=dtypes[0], device=device)
+            offsets = None
+            per_sample_weights = torch.randn(7, 3, dtype=torch.float, device=device)
+            with self.assertRaisesRegex(NotImplementedError,
+                                        "only supported for mode='sum'"):
+                es(input, offsets, per_sample_weights)
+
+    def _embedding_bag_reference_impl(self, input, weight, offsets=None, mode='sum',
+                                      per_sample_weights=None, include_last_offset=False):
+        assert mode == 'sum' or per_sample_weights is None
+        assert offsets is not None
+        if per_sample_weights is None:
+            per_sample_weights = torch.ones(input.size()).to(
+                dtype=weight.dtype, device=weight.device
+            )
+        assert input.numel() == per_sample_weights.numel()
+
+        bags = []
+        long_input = input.to(torch.long)
+        embeddings = weight.index_select(0, long_input) * per_sample_weights.unsqueeze(1)
+        if include_last_offset:
+            for index in range(len(offsets) - 1):
+                offset = offsets[index]
+                next_offset = offsets[index + 1]
+                length = next_offset - offset
+                if length == 0:
+                    bags.append(
+                        torch.tensor([0] * weight.size(1)).to(
+                            dtype=embeddings.dtype, device=embeddings.device
+                        )
+                    )
+                else:
+                    if mode == 'sum':
+                        bags.append(embeddings.narrow(0, offset, length).sum(0))
+                    elif mode == 'mean':
+                        bags.append(embeddings.narrow(0, offset, length).sum(0).div(length))
+                    else:
+                        assert mode == 'max'
+                        bags.append(embeddings.narrow(0, offset, length).max(0)[0])
+        else:
+            for index, offset in enumerate(offsets):
+                if index + 1 < len(offsets):
+                    next_offset = offsets[index + 1]
+                else:
+                    next_offset = len(long_input)
+                length = next_offset - offset
+                if length == 0:
+                    bags.append(
+                        torch.tensor([0] * weight.size(1)).to(
+                            dtype=embeddings.dtype, device=embeddings.device
+                        )
+                    )
+                else:
+                    if mode == 'sum':
+                        bags.append(embeddings.narrow(0, offset, length).sum(0))
+                    elif mode == 'mean':
+                        bags.append(embeddings.narrow(0, offset, length).sum(0).div(length))
+                    else:
+                        assert mode == 'max'
+                        bags.append(embeddings.narrow(0, offset, length).max(0)[0])
+        return torch.stack(bags)
+
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.half, torch.float, torch.double)))
+    def test_EmbeddingBag_empty_per_sample_weights_and_offsets(self, device, dtypes):
+        # Test empty input and per sample weight, and backward pass. There was a CUDA
+        # invalid configuration bug (more context in #46572)
+        def test_per_sample_weights(mode, trainable_scale):
+            es = nn.EmbeddingBag(5, 2, mode=mode).to(dtype=dtypes[2], device=device)
+            es.weight.data.copy_(
+                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
+            input = torch.tensor([], device=device, dtype=dtypes[0])
+            offsets = torch.tensor([0, 0, 0, 0, 0], device=device, dtype=dtypes[1])
+            per_sample_weights = torch.randn_like(input, dtype=dtypes[2]) \
+                                      .requires_grad_(trainable_scale)
+            ref_per_sample_weights = \
+                per_sample_weights.detach().requires_grad_(trainable_scale)
+            reference_weights = es.weight.detach().requires_grad_()
+
+            expected = self._embedding_bag_reference_impl(
+                input, reference_weights, offsets, mode, ref_per_sample_weights)
+            result = es(input, offsets, per_sample_weights)
+            self.assertEqual(result, expected, atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+
+            grad = torch.randn_like(expected)
+            result.backward(grad)
+            # the reference impl doesn't have grad fn for empty input; but the grad should
+            # simply be a zero tensor
+            ref_weights_grad = torch.zeros_like(es.weight)
+            self.assertEqual(es.weight.grad, ref_weights_grad,
+                             atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+            if trainable_scale:
+                ref_per_sample_weights_grad = torch.empty_like(per_sample_weights)
+                self.assertEqual(per_sample_weights.grad, ref_per_sample_weights_grad,
+                                 atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+
+        modes = ('sum',)
+        trainable_scale = (True, False)
+        for mode, trainable in itertools.product(modes, trainable_scale):
+            test_per_sample_weights(mode, trainable)
+
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
+    def test_EmbeddingBag_per_sample_weights_and_offsets(self, device, dtypes):
+        def test_per_sample_weights(mode, trainable_scale):
+            es = nn.EmbeddingBag(5, 2, mode=mode).to(dtype=dtypes[2], device=device)
+            es.weight.data.copy_(
+                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
+            input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtypes[0])
+            offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=dtypes[1])
+            per_sample_weights = torch.randn_like(input, dtype=dtypes[2]) \
+                                      .requires_grad_(trainable_scale)
+            ref_per_sample_weights = \
+                per_sample_weights.detach().requires_grad_(trainable_scale)
+            reference_weights = es.weight.detach().requires_grad_()
+
+            expected = self._embedding_bag_reference_impl(
+                input, reference_weights, offsets, mode, ref_per_sample_weights)
+            result = es(input, offsets, per_sample_weights)
+            self.assertEqual(result, expected, atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+
+            grad = torch.randn_like(expected).to(dtype=dtypes[2], device=device)
+            result.backward(grad)
+            expected.backward(grad)
+            self.assertEqual(es.weight.grad, reference_weights.grad,
+                             atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+            if trainable_scale:
+                self.assertEqual(per_sample_weights.grad, ref_per_sample_weights.grad,
+                                 atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+
+        modes = ('sum',)
+        trainable_scale = (True, False)
+        for mode, trainable in itertools.product(modes, trainable_scale):
+            test_per_sample_weights(mode, trainable)
+
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
+    def test_EmbeddingBag_per_sample_weights_and_new_offsets(self, device, dtypes):
+        def test_per_sample_weights_new_offsets(mode, trainable_scale, include_last_offset, has_weight=True):
+            es = nn.EmbeddingBag(5, 2, mode=mode, include_last_offset=include_last_offset).to(dtype=dtypes[2], device=device)
+            es.weight.data.copy_(
+                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
+            input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtypes[0])
+            offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=dtypes[1])
+
+            if include_last_offset:
+                offsets = torch.cat((offsets, torch.tensor([input.size(0)], device=device, dtype=dtypes[1])), 0)
+
+            if has_weight:
+                per_sample_weights = torch.randn_like(input, device=device, dtype=dtypes[2]) \
+                                          .requires_grad_(trainable_scale)
+                ref_per_sample_weights = \
+                    per_sample_weights.detach().requires_grad_(trainable_scale)
+            else:
+                per_sample_weights = None
+                ref_per_sample_weights = None
+
+            reference_weights = es.weight.detach().requires_grad_()
+
+            expected = self._embedding_bag_reference_impl(
+                input, reference_weights, offsets, mode, ref_per_sample_weights, include_last_offset)
+            result = es(input, offsets, per_sample_weights)
+            self.assertEqual(result, expected, atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+
+            grad = torch.randn_like(expected)
+            result.backward(grad)
+            expected.backward(grad)
+            self.assertEqual(es.weight.grad, reference_weights.grad,
+                             atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+            if has_weight and trainable_scale:
+                self.assertEqual(per_sample_weights.grad, ref_per_sample_weights.grad,
+                                 atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
+
+        trainable_scale = (True, False)
+        include_last_offset = (True, False)
+        modes = (('sum', False), ('sum', True), ('max', False), ('mean', False))
+        for (mode, has_weight), trainable, include_last_offset in itertools.product(
+            modes, trainable_scale, include_last_offset
+        ):
+            test_per_sample_weights_new_offsets(
+                mode, trainable, include_last_offset, has_weight
+            )
+
+    def _test_EmbeddingBag_vs_Embedding(self, N, D, B, L, max_norm=None,
+                                        mode='mean',
+                                        device='cpu',
+                                        wdtype=torch.float,
+                                        dtype=torch.long,
+                                        test_per_sample_weights=False,
+                                        trainable_per_sample_weights=False,
+                                        sparse=False,
+                                        test_backward=True,
+                                        backward_prec=None):
+        es = nn.EmbeddingBag(N, D, mode=mode, sparse=sparse, max_norm=max_norm).to(device, wdtype)
+        e = nn.Embedding(N, D, max_norm=max_norm).to(device, wdtype)
+        e.weight.data.copy_(es.weight)
+        input = torch.randint(N, (B, L), device=device, dtype=dtype)
+        offsets = torch.arange(0, B, device=device, dtype=dtype).mul_(L)
+        grad_output = torch.rand(B, D, device=device, dtype=wdtype)
+
+        if test_per_sample_weights:
+            # To prevent large gradients, weights should sum to 1 for each bag
+            per_sample_weights = \
+                torch.randn(B, L, device=device, dtype=wdtype).softmax(dim=-1)
+            per_sample_weights_reference = \
+                per_sample_weights.clone().requires_grad_(trainable_per_sample_weights)
+            per_sample_weights.requires_grad_(trainable_per_sample_weights)
+            output = es(input.view(-1), offsets, per_sample_weights.view(-1))
+        else:
+            output = es(input.view(-1), offsets)
+            per_sample_weights = None
+            per_sample_weights_reference = None
+
+        if mode == 'sum':
+            if test_per_sample_weights:
+                ref_output = (e(input) * per_sample_weights_reference.unsqueeze(-1)).sum(1)
+            else:
+                ref_output = e(input).sum(1)
+        elif mode == 'mean':
+            assert not test_per_sample_weights
+            ref_output = e(input).mean(1)
+        elif mode == 'max':
+            assert not test_per_sample_weights
+            ref_output = e(input).max(1)[0]
+
+        self.assertEqual(output, ref_output, atol=dtype2prec_DONTUSE[wdtype], rtol=0)
+
+        if not test_backward:
+            return
+
+        output.backward(grad_output)
+        ref_output.backward(grad_output)
+        es_weight_grad = es.weight.grad.data
+        if sparse:
+            es_weight_grad = es.weight.grad.data.to_dense()
+
+        # We have more floating point error here because we are dealing with larger numbers
+        if backward_prec is None:
+            needed_prec = dtype2prec_DONTUSE[wdtype] * 5
+        else:
+            needed_prec = backward_prec
+
+        self.assertEqual(es_weight_grad, e.weight.grad, atol=needed_prec, rtol=0)
+
+        if test_per_sample_weights and trainable_per_sample_weights:
+            self.assertEqual(per_sample_weights.grad, per_sample_weights_reference.grad,
+                             atol=dtype2prec_DONTUSE[wdtype], rtol=0)
+
+    @skipCUDAIf(True, "Temporarily disabled. See t54369166")
+    @dtypesIfCUDA(*itertools.product((torch.int, torch.long), (torch.half, torch.float, torch.double)))
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.float, torch.double)))
+    def test_EmbeddingBag_per_sample_weights_and_no_offsets(self, device, dtypes):
+        def run_tests(mode, sparse, trainable_per_sample_weights):
+            kwargs = dict(test_per_sample_weights=True, device=device,
+                          mode=mode, wdtype=dtypes[1], dtype=dtypes[0], sparse=sparse,
+                          trainable_per_sample_weights=trainable_per_sample_weights)
+
+            # Simple case
+            self._test_EmbeddingBag_vs_Embedding(2, 3, 5, 7, **kwargs)
+
+            # B * L > 1000
+            self._test_EmbeddingBag_vs_Embedding(2, 5, 53, 23, **kwargs)
+
+            # Large num_embedding
+            self._test_EmbeddingBag_vs_Embedding(101, 5, 3, 7, **kwargs)
+
+            # Large embedding_dim
+            self._test_EmbeddingBag_vs_Embedding(2, 101, 3, 7, **kwargs)
+
+        modes = ('sum',)
+        sparsity = (True, False)
+        trainable_scale = (True, False)
+        for mode, sparse, trainable_per_sample_weights in \
+                itertools.product(modes, sparsity, trainable_scale):
+            run_tests(mode, sparse, trainable_per_sample_weights)
+
+        # Test CUDA Dense on half precision
+        if device == 'cuda':
+            modes = ('sum',)
+            sparsity = (False,)
+            trainable_scale = (True, False)
+            for mode, sparse, trainable_per_sample_weights in \
+                    itertools.product(modes, sparsity, trainable_scale):
+                run_tests(mode, sparse, trainable_per_sample_weights)
+
+    def _test_EmbeddingBag(
+        self,
+        device,
+        mode,
+        sparse,
+        wdtype=torch.double,
+        dtype=torch.long,
+        odtype=torch.long,
+        test_backward=True,
+    ):
+        # check a known test example
+        es = nn.EmbeddingBag(5, 2, mode=mode, sparse=sparse).to(device, wdtype)
+        es.weight.data.copy_(torch.arange(1, 11, device=device).view_as(es.weight).to(wdtype))
+        input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtype)
+        offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=odtype)
+
+        grad_output = torch.tensor(
+            [1, 2,
+             3, 4], device=device, dtype=wdtype).view(2, 2)
+        grad_output_with_empty = torch.tensor(
+            [99, 99,
+             1, 2,
+             99, 99,
+             3, 4,
+             99, 99], device=device, dtype=wdtype).view(5, 2)
+
+        if mode == "sum" or mode == "mean":
+            denominator = 1 if mode == "sum" else 3
+            expected_output = torch.tensor(
+                [[13, 16],
+                 [13, 16]], device=device, dtype=wdtype) / denominator
+
+            expected_output_with_empty = torch.tensor(
+                [[0, 0],
+                 [13, 16],
+                 [0, 0],
+                 [13, 16],
+                 [0, 0]], device=device, dtype=wdtype) / denominator
+
+            expected_grad_weight = torch.tensor(
+                [[3, 4],
+                 [5, 8],
+                 [0, 0],
+                 [1, 2],
+                 [3, 4]], device=device, dtype=wdtype) / denominator
+        elif mode == "max":
+            expected_output = torch.tensor(
+                [[7, 8],
+                 [9, 10]], device=device, dtype=wdtype)
+
+            expected_output_with_empty = torch.tensor(
+                [[0, 0],
+                 [7, 8],
+                 [0, 0],
+                 [9, 10],
+                 [0, 0]], device=device, dtype=wdtype)
+
+            expected_grad_weight = torch.tensor(
+                [[0, 0],
+                 [0, 0],
+                 [0, 0],
+                 [1, 2],
+                 [3, 4]], device=device, dtype=wdtype)
+        output = es(input, offsets)
+        output.backward(grad_output_with_empty)
+
+        es_weight_grad = es.weight.grad.data
+        if sparse:
+            es_weight_grad = es.weight.grad.to_dense()
+        self.assertEqual(output, expected_output_with_empty)
+        self.assertEqual(es_weight_grad, expected_grad_weight, atol=dtype2prec_DONTUSE[wdtype], rtol=0)
+
+        # check same example except as 2D (2 x 3)
+        input = input.view(2, -1)
+        es.zero_grad()
+        output = es(input)
+        output.backward(grad_output)
+
+        es_weight_grad = es.weight.grad
+        if sparse:
+            es_weight_grad = es.weight.grad.to_dense()
+        self.assertEqual(output, expected_output)
+        self.assertEqual(es_weight_grad, expected_grad_weight, atol=dtype2prec_DONTUSE[wdtype], rtol=0)
+
+        # test all empty bags
+        es.zero_grad()
+        inputs = torch.tensor([], dtype=dtype, device=device)
+        offsets = torch.tensor([0, 0, 0, 0], dtype=odtype, device=device)
+        es(inputs, offsets).sum().backward()
+        dense_grad = es.weight.grad
+        if dense_grad.is_sparse:
+            dense_grad = dense_grad.to_dense()
+        self.assertEqual(dense_grad, torch.zeros_like(es.weight))
+
+        # now compare EmbeddingBag vs Embedding + Sum/Mean, for constant bag length
+        N, D, B, L = random.randint(1, 100), random.randint(1, 100), random.randint(1, 50), random.randint(1, 50)
+        kwargs = dict(mode=mode, sparse=sparse, device=device, wdtype=wdtype, dtype=dtype, test_backward=test_backward)
+        self._test_EmbeddingBag_vs_Embedding(N, D, B, L, **kwargs)
+        for max_norm in (None, 3):
+            for p in itertools.product([1, 2], repeat=4):
+                self._test_EmbeddingBag_vs_Embedding(*p, max_norm=max_norm, **kwargs)
+
+        # check that giving illegal input combos raises error
+        es = nn.EmbeddingBag(10, 20, mode=mode, sparse=sparse)
+        input = torch.ones(3, 4, dtype=dtype)
+        offset = torch.arange(0, 3, dtype=odtype)
+        self.assertRaises(ValueError, lambda: es(input, offset))
+        self.assertRaises(ValueError, lambda: es(input.view(-1)))
+        offset[0] = 1
+        if self.device_type == "cpu":
+            self.assertRaises(RuntimeError, lambda: es(input.view(-1), offset))
+            offset[0] = 0
+            offset[-1] = 100
+            self.assertRaises(RuntimeError, lambda: es(input.view(-1), offset))
+
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
+    def test_embedding_bag_device(self, device, dtypes):
+        with set_default_dtype(torch.double):
+            self._test_EmbeddingBag(device, 'sum', False, wdtype=dtypes[2], dtype=dtypes[0], odtype=dtypes[1])
+            self._test_EmbeddingBag(device, 'mean', False, wdtype=dtypes[2], dtype=dtypes[0], odtype=dtypes[1])
+            self._test_EmbeddingBag(device, 'max', False, wdtype=dtypes[2], dtype=dtypes[0], odtype=dtypes[1])
+
+            test_backward = False
+            if self.device_type == 'cuda':
+                # see 'todo' in test_embedding_bag.
+                test_backward = dtypes[2] is not torch.float16
+            elif self.device_type == 'cpu':
+                # TODO: figure out why precision on sparse embeddings isn't the
+                # same as for dense.
+                test_backward = dtypes[2] is not torch.float and dtypes[2] is not torch.float16
+
+            self._test_EmbeddingBag(
+                device,
+                'sum',
+                True,
+                wdtype=dtypes[2],
+                dtype=dtypes[0],
+                odtype=dtypes[1],
+                test_backward=test_backward,
+            )
+            self._test_EmbeddingBag(
+                device,
+                'mean',
+                True,
+                wdtype=dtypes[2],
+                dtype=dtypes[0],
+                odtype=dtypes[1],
+                test_backward=test_backward,
+            )
+
+    @skipMeta
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
+    def test_embedding_bag_non_contiguous_weight(self, device, dtypes):
+        weight_tensor = torch.randn(3, 4, dtype=dtypes[2], device=device)
+
+        weight_tensor_non_contig = weight_tensor[:, :3]  # This is non-contiguous strided.
+        weight_tensor_contig = weight_tensor_non_contig.clone().contiguous()  # Contig-strided.
+
+        index = torch.tensor([0, 1, 2], dtype=dtypes[0], device=device)
+        offsets = torch.tensor([0, 2], dtype=dtypes[1], device=device)
+        for mode in ['sum', 'mean', 'max']:
+            output_non_contig = F.embedding_bag(
+                input=index,
+                weight=weight_tensor_non_contig,
+                offsets=offsets,
+                mode=mode,
+            )
+            output_contig = F.embedding_bag(
+                input=index,
+                weight=weight_tensor_contig,
+                offsets=offsets,
+                mode=mode,
+            )
+        self.assertEqual(output_non_contig, output_contig)
+
+    @onlyCUDA
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
+    def test_embedding_bag_bfloat16(self, device, dtypes):
+        with set_default_dtype(torch.double):
+            self._test_EmbeddingBag(device, 'sum', True,
+                                    wdtype=torch.bfloat16, dtype=dtypes[0],
+                                    odtype=dtypes[1], test_backward=True)
+            self._test_EmbeddingBag(device, 'mean', True,
+                                    wdtype=torch.bfloat16, dtype=dtypes[0],
+                                    odtype=dtypes[1], test_backward=True)
+
+    @onlyNativeDeviceTypes  # currently fails on XLA
+    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
+    def test_embedding_bag_half(self, device, dtypes):
+        self._test_EmbeddingBag(device, 'sum', True, wdtype=torch.float16, dtype=dtypes[0], odtype=dtypes[1], test_backward=True)
+
+
+instantiate_device_type_tests(TestEmbeddingNNDeviceType, globals())
+instantiate_parametrized_tests(TestEmbeddingNN)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/nn/test_init.py b/test/nn/test_init.py
new file mode 100644
index 000000000000..9e72c1040a55
--- /dev/null
+++ b/test/nn/test_init.py
@@ -0,0 +1,420 @@
+# Owner(s): ["module: nn"]
+import random
+import unittest
+import math
+import string
+from functools import reduce
+from operator import mul
+
+from torch.testing._internal.common_utils import TestCase, TEST_SCIPY, skipIfNoLapack
+import torch
+import torch.nn.init as init
+import torch.nn.functional as F
+
+if TEST_SCIPY:
+    from scipy import stats
+
+class TestNNInit(TestCase):
+    def setUp(self):
+        super(TestNNInit, self).setUp()
+        random.seed(123)
+
+    def _is_normal(self, tensor, mean, std):
+        samples = tensor.view(-1).tolist()
+        p_value = stats.kstest(samples, 'norm', args=(mean, std))[1]
+        return p_value > 0.0001
+
+    def _is_trunc_normal(self, tensor, mean, std, a, b):
+        # scipy's trunc norm is suited for data drawn from N(0, 1),
+        # so we need to transform our data to test it using scipy.
+        z_samples = (tensor.view(-1) - mean) / std
+        z_samples = z_samples.tolist()
+        a0 = (a - mean) / std
+        b0 = (b - mean) / std
+        p_value = stats.kstest(z_samples, 'truncnorm', args=(a0, b0))[1]
+        return p_value > 0.0001
+
+    def _is_uniform(self, tensor, a, b):
+        samples = tensor.view(-1).tolist()
+        p_value = stats.kstest(samples, 'uniform', args=(a, (b - a)))[1]
+        return p_value > 0.0001
+
+    def _create_random_nd_tensor(self, dims, size_min, size_max):
+        size = [random.randint(size_min, size_max) for _ in range(dims)]
+        tensor = torch.zeros(size)
+        return tensor
+
+    def _random_float(self, a, b):
+        return (b - a) * random.random() + a
+
+    def test_calculate_gain_linear(self):
+        for fn in ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose2d', 'conv_transpose2d', 'conv_transpose3d']:
+            gain = init.calculate_gain(fn)
+            self.assertEqual(gain, 1)
+
+    def test_calculate_gain_nonlinear(self):
+        for fn in ['sigmoid', 'tanh', 'relu', 'leaky_relu']:
+            gain = init.calculate_gain(fn)
+            if fn == 'sigmoid':
+                self.assertEqual(gain, 1)
+            elif fn == 'tanh':  # 5 / 3
+                self.assertEqual(gain, 1.6666666666666667)
+            elif fn == 'relu':  # sqrt(2)
+                self.assertEqual(gain, 1.4142135623730951)
+            elif fn == 'leaky_relu':  # sqrt(2 / 1 + slope^2))
+                self.assertEqual(gain, 1.4141428569978354)
+            elif fn == 'selu':
+                self.assertEqual(gain, 0.75)
+
+    def test_calculate_gain_leaky_relu(self):
+        for param in [None, 0, 0.01, 10]:
+            gain = init.calculate_gain('leaky_relu', param)
+            if param is None:  # Default slope is 0.01
+                self.assertEqual(gain, 1.4141428569978354)
+            elif param == 0:  # No slope = same gain as normal ReLU
+                self.assertEqual(gain, 1.4142135623730951)
+            elif param == 0.01:
+                self.assertEqual(gain, 1.4141428569978354)
+            elif param == 10:
+                self.assertEqual(gain, 0.14071950894605836)
+
+    def test_calculate_gain_leaky_relu_only_accepts_numbers(self):
+        for param in [True, [1], {'a': 'b'}]:
+            with self.assertRaises(ValueError):
+                init.calculate_gain('leaky_relu', param)
+
+    def test_calculate_gain_only_accepts_valid_nonlinearities(self):
+        for n in [2, 5, 25]:
+            # Generate random strings of lengths that definitely aren't supported
+            random_string = ''.join([random.choice(string.ascii_lowercase) for i in range(n)])
+            with self.assertRaises(ValueError):
+                init.calculate_gain(random_string)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
+    def test_uniform(self):
+        for dims in [1, 2, 4]:
+            input_tensor = self._create_random_nd_tensor(dims, size_min=30, size_max=50)
+            a = self._random_float(-3, 3)
+            b = a + self._random_float(1, 5)
+            init.uniform_(input_tensor, a=a, b=b)
+            assert self._is_uniform(input_tensor, a, b)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
+    def test_normal(self):
+        for dims in [1, 2, 4]:
+            input_tensor = self._create_random_nd_tensor(dims, size_min=30, size_max=50)
+            mean = self._random_float(-3, 3)
+            std = self._random_float(1, 5)
+            init.normal_(input_tensor, mean=mean, std=std)
+
+            assert self._is_normal(input_tensor, mean, std)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
+    def test_trunc_normal(self):
+        for dims in [1, 2, 4]:
+            input_tensor = self._create_random_nd_tensor(dims, size_min=30, size_max=50)
+            mean = self._random_float(-3, 3)
+            std = self._random_float(.01, 1)
+            a = self._random_float(mean - 2 * std, mean)
+            b = self._random_float(mean, mean + 2 * std)
+            init.trunc_normal_(input_tensor, mean=mean, std=std, a=a, b=b)
+
+            assert self._is_trunc_normal(input_tensor, mean, std, a, b)
+
+    def test_constant(self):
+        for dims in [1, 2, 4]:
+            input_tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=5)
+            val = self._random_float(1, 10)
+            init.constant_(input_tensor, val)
+
+            self.assertEqual(input_tensor, input_tensor.clone().fill_(val))
+
+    def test_ones_and_zeros(self):
+        for init_fn_, val in zip([init.ones_, init.zeros_], [1, 0]):
+            for dims in [1, 2, 4]:
+                input_tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=5)
+                init_fn_(input_tensor)
+
+                self.assertEqual(input_tensor, input_tensor.clone().fill_(val))
+
+    def test_eye(self):
+        input_tensor = self._create_random_nd_tensor(2, size_min=1, size_max=5)
+        init.eye_(input_tensor)
+
+        # Check every single element
+        for i in range(input_tensor.size(0)):
+            for j in range(input_tensor.size(1)):
+                if i == j:
+                    assert input_tensor[i][j] == 1
+                else:
+                    assert input_tensor[i][j] == 0
+
+    def test_eye_only_works_on_2d_inputs(self):
+        for dims in [1, 3]:
+            with self.assertRaises(ValueError):
+                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=3)
+                init.eye_(tensor)
+
+    def test_dirac_properties(self):
+        for dims in [3, 4, 5]:
+            for groups in [1, 2, 3]:
+                # prepare random tensor with random sizes, but fits groups
+                a, c, d, e = (random.randint(1, 5) for _ in range(4))
+                b = random.randint(1, 5 * groups)  # same range as a*groups but all range allowed
+                # make sure first dim divides by groups
+                input_tensor = torch.randn((a * groups, b, c, d, e)[:dims])
+
+                init.dirac_(input_tensor, groups)
+
+                c_out, c_in = input_tensor.size(0) // groups, input_tensor.size(1)
+                min_d = min(c_out, c_in)
+                # Check number of nonzeros is equivalent to smallest dim (for each group)
+                assert torch.nonzero(input_tensor).size(0) == min_d * groups
+                # Check sum of values (can have precision issues, hence assertEqual) is also equivalent
+                self.assertEqual(input_tensor.sum(), min_d * groups)
+
+
+    def test_dirac_identity(self):
+        for groups in [1, 3]:
+            batch, in_c, out_c, size, kernel_size = 8, 3, 9, 5, 3  # in_c, out_c must divide by groups
+            eff_out_c = out_c // groups
+
+            # Test 1D
+            input_var = torch.randn(batch, in_c, size)
+            filter_var = torch.zeros(eff_out_c, in_c, kernel_size)
+            filter_var = torch.cat([filter_var] * groups)
+            init.dirac_(filter_var, groups)
+            output_var = F.conv1d(input_var, filter_var)
+            input_tensor, output_tensor = input_var.data, output_var.data  # Variables do not support nonzero
+            for g in range(groups):
+                # Assert in_c outputs are preserved (per each group)
+                self.assertEqual(input_tensor[:, :, 1:-1],
+                                 output_tensor[:, eff_out_c * g:eff_out_c * g + in_c, :])
+                # Assert extra outputs are 0
+                assert torch.nonzero(output_tensor[:, eff_out_c * g + in_c:eff_out_c * (g + 1), :]).numel() == 0
+
+            # Test 2D
+            input_var = torch.randn(batch, in_c, size, size)
+            filter_var = torch.zeros(eff_out_c, in_c, kernel_size, kernel_size)
+            filter_var = torch.cat([filter_var] * groups)
+            init.dirac_(filter_var, groups)
+            output_var = F.conv2d(input_var, filter_var)
+            input_tensor, output_tensor = input_var.data, output_var.data  # Variables do not support nonzero
+            for g in range(groups):
+                # Assert in_c outputs are preserved (per each group)
+                self.assertEqual(input_tensor[:, :, 1:-1, 1:-1],
+                                 output_tensor[:, eff_out_c * g:eff_out_c * g + in_c, :, :])
+                # Assert extra outputs are 0
+                assert torch.nonzero(output_tensor[:, eff_out_c * g + in_c:eff_out_c * (g + 1), :, :]).numel() == 0
+
+            # Test 3D
+            input_var = torch.randn(batch, in_c, size, size, size)
+            filter_var = torch.zeros(eff_out_c, in_c, kernel_size, kernel_size, kernel_size)
+            filter_var = torch.cat([filter_var] * groups)
+            init.dirac_(filter_var, groups)
+            output_var = F.conv3d(input_var, filter_var)
+            input_tensor, output_tensor = input_var.data, output_var.data
+            for g in range(groups):
+                # Assert in_c outputs are preserved (per each group)
+                self.assertEqual(input_tensor[:, :, 1:-1, 1:-1, 1:-1],
+                                 output_tensor[:, eff_out_c * g:eff_out_c * g + in_c, :, :, :])
+                # Assert extra outputs are 0
+                assert torch.nonzero(output_tensor[:, eff_out_c * g + in_c:eff_out_c * (g + 1), :, :, :]).numel() == 0
+
+    def test_dirac_only_works_on_3_4_5d_inputs(self):
+        for dims in [1, 2, 6]:
+            with self.assertRaises(ValueError):
+                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=3)
+                init.dirac_(tensor)
+
+    def test_xavier_uniform_errors_on_inputs_smaller_than_2d(self):
+        for dims in [0, 1]:
+            tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=1)
+            with self.assertRaises(ValueError):
+                init.xavier_uniform_(tensor)
+
+    def test_xavier_normal_errors_on_inputs_smaller_than_2d(self):
+        for dims in [0, 1]:
+            tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=1)
+            with self.assertRaises(ValueError):
+                init.xavier_normal_(tensor)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
+    def test_xavier_uniform(self):
+        for use_gain in [True, False]:
+            for dims in [2, 4]:
+                input_tensor = self._create_random_nd_tensor(dims, size_min=20, size_max=25)
+                gain = 1
+
+                if use_gain:
+                    gain = self._random_float(0.1, 2)
+                    init.xavier_uniform_(input_tensor, gain=gain)
+                else:
+                    init.xavier_uniform_(input_tensor)
+
+                fan_in = input_tensor.size(1)
+                fan_out = input_tensor.size(0)
+                if input_tensor.dim() > 2:
+                    fan_in *= input_tensor[0, 0].numel()
+                    fan_out *= input_tensor[0, 0].numel()
+
+                expected_std = gain * math.sqrt(2.0 / (fan_in + fan_out))
+                bounds = expected_std * math.sqrt(3)
+                assert self._is_uniform(input_tensor, -bounds, bounds)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
+    def test_xavier_normal(self):
+        for use_gain in [True, False]:
+            for dims in [2, 4]:
+                input_tensor = self._create_random_nd_tensor(dims, size_min=20, size_max=25)
+                gain = 1
+
+                if use_gain:
+                    gain = self._random_float(0.1, 2)
+                    init.xavier_normal_(input_tensor, gain=gain)
+                else:
+                    init.xavier_normal_(input_tensor)
+
+                fan_in = input_tensor.size(1)
+                fan_out = input_tensor.size(0)
+                if input_tensor.dim() > 2:
+                    fan_in *= input_tensor[0, 0].numel()
+                    fan_out *= input_tensor[0, 0].numel()
+
+                expected_std = gain * math.sqrt(2.0 / (fan_in + fan_out))
+                assert self._is_normal(input_tensor, 0, expected_std)
+
+    def test_kaiming_uniform_errors_on_inputs_smaller_than_2d(self):
+        for dims in [0, 1]:
+            with self.assertRaises(ValueError):
+                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=1)
+                init.kaiming_uniform_(tensor)
+
+    def test_kaiming_normal_errors_on_inputs_smaller_than_2d(self):
+        for dims in [0, 1]:
+            with self.assertRaises(ValueError):
+                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=1)
+                init.kaiming_normal_(tensor)
+
+    def test_kaiming_uniform_warning_on_0element_tensor(self):
+        tensor = torch.empty(0, 1)
+        with self.assertWarnsRegex(UserWarning, "Initializing zero-element tensors is a no-op"):
+            _ = init.kaiming_uniform_(tensor)
+
+    def test_kaiming_normal_warning_on_0element_tensor(self):
+        tensor = torch.empty(0, 1)
+        with self.assertWarnsRegex(UserWarning, "Initializing zero-element tensors is a no-op"):
+            _ = init.kaiming_normal_(tensor)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
+    def test_kaiming_uniform(self):
+        for use_a in [True, False]:
+            for dims in [2, 4]:
+                for mode in ['fan_in', 'fan_out']:
+                    input_tensor = self._create_random_nd_tensor(dims, size_min=20, size_max=25)
+                    if use_a:
+                        a = self._random_float(0.1, 2)
+                        init.kaiming_uniform_(input_tensor, a=a, mode=mode)
+                    else:
+                        a = 0
+                        init.kaiming_uniform_(input_tensor, mode=mode)
+
+                    fan_in = input_tensor.size(1)
+                    fan_out = input_tensor.size(0)
+                    if input_tensor.dim() > 2:
+                        fan_in *= input_tensor[0, 0].numel()
+                        fan_out *= input_tensor[0, 0].numel()
+
+                    if mode == 'fan_in':
+                        n = fan_in
+                    else:
+                        n = fan_out
+
+                    expected_std = math.sqrt(2.0 / ((1 + a**2) * n))
+                    bounds = expected_std * math.sqrt(3.0)
+                    assert self._is_uniform(input_tensor, -bounds, bounds)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
+    def test_kaiming_normal(self):
+        for use_a in [True, False]:
+            for dims in [2, 4]:
+                for mode in ['fan_in', 'fan_out']:
+                    input_tensor = self._create_random_nd_tensor(dims, size_min=20, size_max=25)
+                    if use_a:
+                        a = self._random_float(0.1, 2)
+                        init.kaiming_normal_(input_tensor, a=a, mode=mode)
+                    else:
+                        a = 0
+                        init.kaiming_normal_(input_tensor, mode=mode)
+
+                    fan_in = input_tensor.size(1)
+                    fan_out = input_tensor.size(0)
+                    if input_tensor.dim() > 2:
+                        fan_in *= input_tensor[0, 0].numel()
+                        fan_out *= input_tensor[0, 0].numel()
+
+                    if mode == 'fan_in':
+                        n = fan_in
+                    else:
+                        n = fan_out
+
+                    expected_std = math.sqrt(2.0 / ((1 + a**2) * n))
+                    assert self._is_normal(input_tensor, 0, expected_std)
+
+    def test_sparse_only_works_on_2d_inputs(self):
+        for dims in [1, 3]:
+            with self.assertRaises(ValueError):
+                sparsity = self._random_float(0.1, 0.9)
+                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=3)
+                init.sparse_(tensor, sparsity)
+
+    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
+    def test_sparse_default_std(self):
+        for use_random_std in [True, False]:
+            input_tensor = self._create_random_nd_tensor(2, size_min=30, size_max=35)
+            rows, cols = input_tensor.size(0), input_tensor.size(1)
+            sparsity = self._random_float(0.1, 0.2)
+
+            std = 0.01  # default std
+            if use_random_std:
+                std = self._random_float(0.01, 0.2)
+                init.sparse_(input_tensor, sparsity=sparsity, std=std)
+            else:
+                init.sparse_(input_tensor, sparsity=sparsity)
+
+            for col_idx in range(input_tensor.size(1)):
+                column = input_tensor[:, col_idx]
+                assert column[column == 0].nelement() >= math.ceil(sparsity * rows)
+
+            assert self._is_normal(input_tensor[input_tensor != 0], 0, std)
+
+    @skipIfNoLapack
+    def test_orthogonal(self):
+        for use_gain in [True, False]:
+            for tensor_size in [[3, 4], [4, 3], [20, 2, 3, 4], [2, 3, 4, 5]]:
+                input_tensor = torch.zeros(tensor_size)
+                gain = 1.0
+
+                if use_gain:
+                    gain = self._random_float(0.1, 2)
+                    init.orthogonal_(input_tensor, gain=gain)
+                else:
+                    init.orthogonal_(input_tensor)
+
+                rows, cols = tensor_size[0], reduce(mul, tensor_size[1:])
+                flattened_tensor = input_tensor.view(rows, cols)
+                if rows > cols:
+                    self.assertEqual(torch.mm(flattened_tensor.t(), flattened_tensor),
+                                     torch.eye(cols) * gain ** 2, atol=1e-6, rtol=0)
+                else:
+                    self.assertEqual(torch.mm(flattened_tensor, flattened_tensor.t()),
+                                     torch.eye(rows) * gain ** 2, atol=1e-6, rtol=0)
+
+    def test_deprecation(self):
+        x = torch.randn(3, 3)
+
+        def fn():
+            init.normal(x)
+
+        with self.assertWarnsRegex(UserWarning, 'deprecated', msg='methods not suffixed with underscore should be deprecated'):
+            fn()
diff --git a/test/nn/test_lazy_modules.py b/test/nn/test_lazy_modules.py
new file mode 100644
index 000000000000..c3a9dff20022
--- /dev/null
+++ b/test/nn/test_lazy_modules.py
@@ -0,0 +1,626 @@
+# Owner(s): ["module: nn"]
+import unittest
+import pickle
+
+import torch
+import torch.nn as nn
+from torch.nn.parameter import UninitializedParameter, UninitializedBuffer
+from torch.nn import Parameter
+from torch.testing._internal.common_utils import TestCase, run_tests, suppress_warnings
+from torch.testing._internal.common_cuda import TEST_CUDA
+
+class LazyModule(torch.nn.modules.lazy.LazyModuleMixin, torch.nn.Module):
+    pass
+
+
+class TestLazyModules(TestCase):
+
+    @suppress_warnings
+    def test_lazy_module_parameter(self):
+        module = LazyModule()
+        module.register_parameter('test_param', UninitializedParameter())
+        self.assertTrue(module.has_uninitialized_params())
+        state_dict = module.state_dict()
+        self.assertIsInstance(state_dict['test_param'], UninitializedParameter)
+        new_module = LazyModule()
+        # An error is raised when there is an attempt to replace an existing parameter
+        # with an uninitialized one
+        new_module.register_parameter('test_param', nn.Parameter(torch.ones(5, 5)))
+        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
+            new_module.load_state_dict(state_dict)
+        # Uninitialized parameters are overriden when the state dict to be loaded contains a valid one
+        new_module = LazyModule()
+        new_module.register_parameter('test_param', nn.Parameter(torch.ones(5, 5)))
+        module.load_state_dict(new_module.state_dict())
+        self.assertEqual(module.test_param, torch.ones((5, 5)))
+
+        # Uninitialized parameters are left unchanged
+        module = LazyModule()
+        module.register_parameter('test_param', UninitializedParameter())
+        self.assertTrue(module.has_uninitialized_params())
+
+        new_module = LazyModule()
+        new_module.register_parameter('test_param', UninitializedParameter())
+        module.load_state_dict(new_module.state_dict())
+        self.assertTrue(module.has_uninitialized_params())
+
+    @suppress_warnings
+    def test_lazy_module_buffer(self):
+        module = LazyModule()
+        module.register_buffer('test_buffer', UninitializedBuffer())
+        self.assertTrue(module.has_uninitialized_params())
+        state_dict = module.state_dict()
+        self.assertIsInstance(state_dict['test_buffer'], UninitializedBuffer)
+        new_module = LazyModule()
+        # An error is raised when there is an attempt to replace an existing parameter
+        # with an uninitialized one
+        new_module.register_buffer('test_buffer', torch.ones(5, 5))
+        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
+            new_module.load_state_dict(state_dict)
+        # Uninitialized parameters are overriden when the state dict to be loaded contains a valid one
+        new_module = LazyModule()
+        new_module.register_buffer('test_buffer', torch.ones(5, 5))
+        module.load_state_dict(new_module.state_dict())
+        self.assertEqual(module.test_buffer, torch.ones((5, 5)))
+
+        # Uninitialized parameters are left unchanged
+        module = LazyModule()
+        module.register_buffer('test_buffer', UninitializedBuffer())
+        self.assertTrue(module.has_uninitialized_params())
+
+        new_module = LazyModule()
+        new_module.register_buffer('test_buffer', UninitializedBuffer())
+        module.load_state_dict(new_module.state_dict())
+        module.load_state_dict(new_module.state_dict())
+        self.assertTrue(module.has_uninitialized_params())
+
+    @suppress_warnings
+    def test_lazy_module_jit_param(self):
+        module = LazyModule()
+        module.register_parameter('test_param', UninitializedParameter())
+        self.assertTrue(module.has_uninitialized_params())
+        with self.assertRaisesRegex(RuntimeError, 'run a forward pass'):
+            torch.jit.script(module)
+
+    @suppress_warnings
+    def test_lazy_module_jit_buffer(self):
+        module = LazyModule()
+        module.register_buffer('test_buffer', UninitializedBuffer())
+        self.assertTrue(module.has_uninitialized_params())
+        with self.assertRaisesRegex(RuntimeError, 'run a forward pass'):
+            torch.jit.script(module)
+
+    @suppress_warnings
+    def test_lazy_share_memory_param(self):
+        module = LazyModule()
+        module.register_parameter('test_param', UninitializedParameter())
+        self.assertTrue(module.has_uninitialized_params())
+        with self.assertRaisesRegex(RuntimeError, 'share memory on an uninitialized'):
+            module.share_memory()
+
+    @suppress_warnings
+    def test_lazy_share_memory_buffer(self):
+        module = LazyModule()
+        module.register_buffer('test_buffer', UninitializedBuffer())
+        self.assertTrue(module.has_uninitialized_params())
+        with self.assertRaisesRegex(RuntimeError, 'share memory on an uninitialized'):
+            module.share_memory()
+
+    @suppress_warnings
+    def test_linear(self):
+        module = nn.LazyLinear(10)
+        self.assertIsInstance(module.weight, UninitializedParameter)
+        self.assertIsInstance(module.bias, UninitializedParameter)
+        input = torch.ones(5, 5)
+        module(input)
+        self.assertIsInstance(module, nn.Linear)
+        self.assertNotIsInstance(module, nn.LazyLinear)
+        self.assertTrue(module.weight.shape == (10, 5))
+        self.assertTrue(module.bias.shape == (10,))
+        y = module(input)
+        self.assertTrue(torch.equal(torch.nn.functional.linear(input, module.weight, module.bias), y))
+
+    @suppress_warnings
+    def test_lazy_linear_pickle(self):
+        module = nn.LazyLinear(10)
+        self.assertIsInstance(module.weight, UninitializedParameter)
+        self.assertIsInstance(module.bias, UninitializedParameter)
+        module = pickle.loads(pickle.dumps(module))
+        self.assertIsInstance(module, nn.LazyLinear)
+        self.assertIsInstance(module.weight, UninitializedParameter)
+        self.assertIsInstance(module.bias, UninitializedParameter)
+        input = torch.ones(5, 5)
+        module(input)  # fully materialized
+        new_module = pickle.loads(pickle.dumps(module))
+        self.assertIsInstance(new_module, nn.Linear)
+        self.assertNotIsInstance(new_module, nn.LazyLinear)
+        self.assertTrue(new_module.weight.shape == (10, 5))
+        self.assertNotIsInstance(new_module.weight, UninitializedParameter)
+        self.assertTrue(new_module.bias.shape == (10,))
+        self.assertNotIsInstance(new_module.bias, UninitializedParameter)
+
+    @suppress_warnings
+    def test_linear_state(self):
+        module = nn.Linear(5, 10)
+        lazy_module = nn.LazyLinear(10)
+        lazy_module.load_state_dict(module.state_dict())
+        # Parameters have been initialized but the module won't become a full
+        # Linear one until the first iteration. This is due to
+        # limitations on the state_dict loading logic
+        self.assertFalse(lazy_module.has_uninitialized_params())
+        self.assertTrue(lazy_module.weight.shape == (10, 5))
+        self.assertTrue(lazy_module.bias.shape == (10,))
+
+        module = nn.Linear(5, 10)
+        lazy_module = nn.LazyLinear(10)
+        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
+            module.load_state_dict(lazy_module.state_dict())
+
+    def _check_lazy_conv(self, cls, lazy_cls, func, init_args, input_shape,
+                         expected_weight_shape, expected_bias_shape):
+        module = lazy_cls(*init_args)
+        self.assertIsInstance(module.weight, UninitializedParameter)
+        if module.bias is not None:
+            self.assertIsInstance(module.bias, UninitializedParameter)
+        input = torch.ones(*input_shape)
+        module(input)
+        self.assertIsInstance(module, cls)
+        self.assertNotIsInstance(module, lazy_cls)
+        self.assertEqual(module.weight.shape, expected_weight_shape)
+        if module.bias is not None:
+            self.assertEqual(module.bias.shape, expected_bias_shape)
+        y = module(input)
+        self.assertTrue(torch.equal(func(input, module.weight, module.bias), y))
+
+    def _check_lazy_conv_pickle(self, cls, lazy_cls, init_args, input_shape,
+                                expected_weight_shape, expected_bias_shape):
+        module = lazy_cls(*init_args)
+        self.assertIsInstance(module.weight, UninitializedParameter)
+        if module.bias is not None:
+            self.assertIsInstance(module.bias, UninitializedParameter)
+        module = pickle.loads(pickle.dumps(module))
+        self.assertIsInstance(module, lazy_cls)
+        self.assertIsInstance(module.weight, UninitializedParameter)
+        if module.bias is not None:
+            self.assertIsInstance(module.bias, UninitializedParameter)
+        input = torch.ones(*input_shape)
+        module(input)  # fully materialized
+        new_module = pickle.loads(pickle.dumps(module))
+        self.assertIsInstance(new_module, cls)
+        self.assertNotIsInstance(new_module, lazy_cls)
+        self.assertEqual(new_module.weight.shape, expected_weight_shape)
+        self.assertNotIsInstance(new_module.weight, UninitializedParameter)
+        if new_module.bias is not None:
+            self.assertEqual(new_module.bias.shape, expected_bias_shape)
+            self.assertNotIsInstance(new_module.bias, UninitializedParameter)
+
+    def _check_lazy_conv_state(self, gen_module, gen_lazy_module,
+                               expected_weight_shape, expected_bias_shape):
+        module = gen_module()
+        lazy_module = gen_lazy_module()
+        lazy_module.load_state_dict(module.state_dict())
+        # Parameters have been initialized but the module won't become a full
+        # Conv one until the first iteration. This is due to
+        # limitations on the state_dict loading logic
+        self.assertFalse(lazy_module.has_uninitialized_params())
+        self.assertEqual(lazy_module.weight.shape, expected_weight_shape)
+        if lazy_module.bias is not None:
+            self.assertEqual(lazy_module.bias.shape, expected_bias_shape)
+
+        module = gen_module()
+        lazy_module = gen_lazy_module()
+        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
+            module.load_state_dict(lazy_module.state_dict())
+
+
+    def test_lazy_pre_forward_hook(self):
+        """
+        This test is to test whether lazymodule can register other pre-forward hook
+        functions successfully.
+        """
+        class TestModule(torch.nn.modules.lazy.LazyModuleMixin, torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def initialize_parameters(self, input):
+                return None
+
+            def forward(self, input):
+                return input
+
+        def hook_function(module, input):
+            return input[0] + 1
+
+        module = TestModule()
+        module.register_forward_pre_hook(hook_function)
+        output = module(torch.zeros(2, 2))
+        self.assertEqual(output, torch.ones(2, 2))
+
+    def test_lazy_forward_hook(self):
+        """
+        This test is to test whether lazymodule can register other forward hook
+        functions successfully.
+        """
+        class TestModule(torch.nn.modules.lazy.LazyModuleMixin, torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def initialize_parameters(self, input):
+                return None
+
+            def forward(self, input):
+                return input
+
+        def hook_function(module, input, output):
+            return input[0] + 1
+
+        module = TestModule()
+        module.register_forward_hook(hook_function)
+        output = module(torch.zeros(2, 2))
+        self.assertEqual(output, torch.ones(2, 2))
+
+    @suppress_warnings
+    def test_lazy_conv1d(self):
+        self._check_lazy_conv(nn.Conv1d, nn.LazyConv1d, torch.nn.functional.conv1d,
+                              (32, 2), (192, 16, 50), (32, 16, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv1d_pickle(self):
+        self._check_lazy_conv_pickle(nn.Conv1d, nn.LazyConv1d, (32, 2), (192, 16, 50),
+                                     (32, 16, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv1d_state(self):
+        self._check_lazy_conv_state(lambda: nn.Conv1d(16, 32, 2),
+                                    lambda: nn.LazyConv1d(32, 2),
+                                    (32, 16, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv2d(self):
+        self._check_lazy_conv(nn.Conv2d, nn.LazyConv2d, torch.nn.functional.conv2d,
+                              (32, 2), (192, 16, 8, 6), (32, 16, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv2d_pickle(self):
+        self._check_lazy_conv_pickle(nn.Conv2d, nn.LazyConv2d, (32, 2), (192, 16, 8, 6),
+                                     (32, 16, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv2d_state(self):
+        self._check_lazy_conv_state(lambda: nn.Conv2d(16, 32, 2),
+                                    lambda: nn.LazyConv2d(32, 2),
+                                    (32, 16, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv3d(self):
+        self._check_lazy_conv(nn.Conv3d, nn.LazyConv3d, torch.nn.functional.conv3d,
+                              (32, 2), (192, 16, 8, 7, 6), (32, 16, 2, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv3d_pickle(self):
+        self._check_lazy_conv_pickle(nn.Conv3d, nn.LazyConv3d, (32, 2), (192, 16, 8, 7, 6),
+                                     (32, 16, 2, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv3d_state(self):
+        self._check_lazy_conv_state(lambda: nn.Conv3d(16, 32, 2),
+                                    lambda: nn.LazyConv3d(32, 2),
+                                    (32, 16, 2, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transposed1d(self):
+        self._check_lazy_conv(nn.ConvTranspose1d, nn.LazyConvTranspose1d, torch.nn.functional.conv_transpose1d,
+                              (32, 2), (192, 16, 50), (16, 32, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transpose1d_pickle(self):
+        self._check_lazy_conv_pickle(nn.ConvTranspose1d, nn.LazyConvTranspose1d, (32, 2),
+                                     (192, 16, 50), (16, 32, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transpose1d_state(self):
+        self._check_lazy_conv_state(lambda: nn.ConvTranspose1d(16, 32, 2),
+                                    lambda: nn.LazyConvTranspose1d(32, 2),
+                                    (16, 32, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transpose2d(self):
+        self._check_lazy_conv(nn.ConvTranspose2d, nn.LazyConvTranspose2d, torch.nn.functional.conv_transpose2d,
+                              (32, 2), (192, 16, 8, 6), (16, 32, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transpose2d_pickle(self):
+        self._check_lazy_conv_pickle(nn.ConvTranspose2d, nn.LazyConvTranspose2d, (32, 2),
+                                     (192, 16, 8, 6), (16, 32, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transpose2d_state(self):
+        self._check_lazy_conv_state(lambda: nn.ConvTranspose2d(16, 32, 2),
+                                    lambda: nn.LazyConvTranspose2d(32, 2),
+                                    (16, 32, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transpose3d(self):
+        self._check_lazy_conv(nn.ConvTranspose3d, nn.LazyConvTranspose3d, torch.nn.functional.conv_transpose3d,
+                              (32, 2), (192, 16, 8, 7, 6), (16, 32, 2, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transpose3d_pickle(self):
+        self._check_lazy_conv_pickle(nn.ConvTranspose3d, nn.LazyConvTranspose3d, (32, 2),
+                                     (192, 16, 8, 7, 6), (16, 32, 2, 2, 2), (32,))
+
+    @suppress_warnings
+    def test_lazy_conv_transpose3d_state(self):
+        self._check_lazy_conv_state(lambda: nn.ConvTranspose3d(16, 32, 2),
+                                    lambda: nn.LazyConvTranspose3d(32, 2),
+                                    (16, 32, 2, 2, 2), (32,))
+
+    def _check_lazy_norm(self, cls, lazy_cls, input_shape):
+        for affine in [False, True]:
+            for track_running_stats in [False, True]:
+                lazy_module = lazy_cls(affine=affine, track_running_stats=track_running_stats)
+
+                if affine:
+                    self.assertIsInstance(lazy_module.weight, UninitializedParameter)
+                    self.assertIsInstance(lazy_module.bias, UninitializedParameter)
+                if track_running_stats:
+                    self.assertIsInstance(lazy_module.running_mean, UninitializedBuffer)
+                    self.assertIsInstance(lazy_module.running_var, UninitializedBuffer)
+
+                input = torch.ones(*input_shape)
+                lazy_output = lazy_module(input)
+                self.assertIsInstance(lazy_module, cls)
+                self.assertNotIsInstance(lazy_module, lazy_cls)
+
+                num_features = input_shape[1]
+                module = cls(num_features, affine=affine, track_running_stats=track_running_stats)
+                expected_output = module(input)
+
+                self.assertEqual(lazy_output, expected_output)
+                if module.weight is not None:
+                    self.assertEqual(lazy_module.weight.shape, module.weight.shape)
+                    self.assertEqual(lazy_module.weight, module.weight)
+                if module.bias is not None:
+                    self.assertEqual(lazy_module.bias.shape, module.bias.shape)
+                    self.assertEqual(lazy_module.bias, module.bias)
+                if module.running_mean is not None:
+                    self.assertEqual(lazy_module.running_mean.shape, module.running_mean.shape)
+                    self.assertEqual(lazy_module.running_mean, module.running_mean)
+                if module.running_var is not None:
+                    self.assertEqual(lazy_module.running_var.shape, module.running_var.shape)
+                    self.assertEqual(lazy_module.running_var, module.running_var)
+                if module.num_batches_tracked is not None:
+                    self.assertEqual(lazy_module.num_batches_tracked.shape, module.num_batches_tracked.shape)
+                    self.assertEqual(lazy_module.num_batches_tracked, module.num_batches_tracked)
+
+    def _check_lazy_norm_pickle(self, cls, lazy_cls, input_shape):
+        for affine in [False, True]:
+            for track_running_stats in [False, True]:
+                module = lazy_cls(affine=affine, track_running_stats=track_running_stats)
+                module = pickle.loads(pickle.dumps(module))
+
+                self.assertIsInstance(module, lazy_cls)
+                if affine:
+                    self.assertIsInstance(module.weight, UninitializedParameter)
+                    self.assertIsInstance(module.bias, UninitializedParameter)
+                if track_running_stats:
+                    self.assertIsInstance(module.running_mean, UninitializedBuffer)
+                    self.assertIsInstance(module.running_var, UninitializedBuffer)
+
+                input = torch.ones(*input_shape)
+                module(input)  # fully materialized
+                module = pickle.loads(pickle.dumps(module))
+
+                self.assertNotIsInstance(module, lazy_cls)
+                self.assertIsInstance(module, cls)
+                if affine:
+                    self.assertNotIsInstance(module.weight, UninitializedParameter)
+                    self.assertNotIsInstance(module.bias, UninitializedParameter)
+                if track_running_stats:
+                    self.assertNotIsInstance(module.running_mean, UninitializedBuffer)
+                    self.assertNotIsInstance(module.running_var, UninitializedBuffer)
+
+    def _check_lazy_batchnorm_state(self, cls, lazy_cls):
+        module = cls(10)
+        lazy_module = lazy_cls(affine=True, track_running_stats=True)
+        lazy_module.load_state_dict(module.state_dict())
+        # Parameters have been initialized but the module won't become a full
+        # Conv one until the first iteration. This is due to
+        # limitations on the state_dict loading logic
+        self.assertFalse(lazy_module.has_uninitialized_params())
+        self.assertEqual(lazy_module.weight.shape, (10,))
+        self.assertEqual(lazy_module.bias.shape, (10,))
+        self.assertEqual(lazy_module.running_mean.shape, (10,))
+        self.assertEqual(lazy_module.running_var.shape, (10,))
+
+        module = cls(10)
+        lazy_module = lazy_cls()
+        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
+            module.load_state_dict(lazy_module.state_dict())
+
+    def _check_lazy_instancenorm_state(self, cls, lazy_cls):
+        for affine in [False, True]:
+            for track_running_stats in [False, True]:
+                module = cls(10, affine=affine, track_running_stats=track_running_stats)
+                lazy_module = lazy_cls(affine=affine, track_running_stats=track_running_stats)
+                lazy_module.load_state_dict(module.state_dict())
+                # Parameters have been initialized but the module won't become a full
+                # InstanceNorm one until the first iteration. This is due to
+                # limitations on the state_dict loading logic
+                self.assertFalse(lazy_module.has_uninitialized_params())
+                if affine:
+                    self.assertEqual(lazy_module.weight.shape, (10,))
+                    self.assertEqual(lazy_module.bias.shape, (10,))
+                if track_running_stats:
+                    self.assertEqual(lazy_module.running_mean.shape, (10,))
+                    self.assertEqual(lazy_module.running_var.shape, (10,))
+
+        module = cls(10, affine=True, track_running_stats=True)
+        lazy_module = lazy_cls(affine=True, track_running_stats=True)
+        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
+            module.load_state_dict(lazy_module.state_dict())
+
+    def test_lazy_batchnorm1d(self):
+        self._check_lazy_norm(nn.BatchNorm1d, nn.LazyBatchNorm1d, (16, 3, 6))
+        self._check_lazy_norm(nn.BatchNorm1d, nn.LazyBatchNorm1d, (16, 6))
+
+    def test_lazy_batchnorm1d_pickle(self):
+        self._check_lazy_norm_pickle(nn.BatchNorm1d, nn.LazyBatchNorm1d, (16, 3, 6))
+        self._check_lazy_norm_pickle(nn.BatchNorm1d, nn.LazyBatchNorm1d, (16, 6))
+
+    def test_lazy_batchnorm1d_state(self):
+        self._check_lazy_batchnorm_state(nn.BatchNorm1d, nn.LazyBatchNorm1d)
+        self._check_lazy_batchnorm_state(nn.BatchNorm1d, nn.LazyBatchNorm1d)
+
+    def test_lazy_batchnorm2d(self):
+        self._check_lazy_norm(nn.BatchNorm2d, nn.LazyBatchNorm2d, (16, 3, 6, 7))
+
+    def test_lazy_batchnorm2d_pickle(self):
+        self._check_lazy_norm_pickle(nn.BatchNorm2d, nn.LazyBatchNorm2d, (16, 3, 6, 7))
+
+    def test_lazy_batchnorm2d_state(self):
+        self._check_lazy_batchnorm_state(nn.BatchNorm2d, nn.LazyBatchNorm2d)
+        self._check_lazy_batchnorm_state(nn.BatchNorm2d, nn.LazyBatchNorm2d)
+
+    def test_lazy_batchnorm3d(self):
+        self._check_lazy_norm(nn.BatchNorm3d, nn.LazyBatchNorm3d, (16, 3, 6, 7, 8))
+
+    def test_lazy_batchnorm3d_pickle(self):
+        self._check_lazy_norm_pickle(nn.BatchNorm3d, nn.LazyBatchNorm3d, (16, 3, 6, 7, 8))
+
+    def test_lazy_batchnorm3d_state(self):
+        self._check_lazy_batchnorm_state(nn.BatchNorm3d, nn.LazyBatchNorm3d)
+        self._check_lazy_batchnorm_state(nn.BatchNorm3d, nn.LazyBatchNorm3d)
+
+    def test_lazy_instancenorm1d(self):
+        self._check_lazy_norm(nn.InstanceNorm1d, nn.LazyInstanceNorm1d, (16, 3, 6))
+
+    def test_lazy_instancenorm1d_pickle(self):
+        self._check_lazy_norm_pickle(nn.InstanceNorm1d, nn.LazyInstanceNorm1d, (16, 3, 6))
+
+    def test_lazy_instancenorm1d_state(self):
+        self._check_lazy_instancenorm_state(nn.InstanceNorm1d, nn.LazyInstanceNorm1d)
+        self._check_lazy_instancenorm_state(nn.InstanceNorm1d, nn.LazyInstanceNorm1d)
+
+    def test_lazy_instancenorm2d(self):
+        self._check_lazy_norm(nn.InstanceNorm2d, nn.LazyInstanceNorm2d, (16, 3, 6, 7))
+
+    def test_lazy_instancenorm2d_pickle(self):
+        self._check_lazy_norm_pickle(nn.InstanceNorm2d, nn.LazyInstanceNorm2d, (16, 3, 6, 7))
+
+    def test_lazy_instancenorm2d_state(self):
+        self._check_lazy_instancenorm_state(nn.InstanceNorm2d, nn.LazyInstanceNorm2d)
+        self._check_lazy_instancenorm_state(nn.InstanceNorm2d, nn.LazyInstanceNorm2d)
+
+    def test_lazy_instancenorm3d(self):
+        self._check_lazy_norm(nn.InstanceNorm3d, nn.LazyInstanceNorm3d, (16, 3, 6, 7, 8))
+
+    def test_lazy_instancenorm3d_pickle(self):
+        self._check_lazy_norm_pickle(nn.InstanceNorm3d, nn.LazyInstanceNorm3d, (16, 3, 6, 7, 8))
+
+    def test_lazy_instancenorm3d_state(self):
+        self._check_lazy_instancenorm_state(nn.InstanceNorm3d, nn.LazyInstanceNorm3d)
+        self._check_lazy_instancenorm_state(nn.InstanceNorm3d, nn.LazyInstanceNorm3d)
+
+    @suppress_warnings
+    def test_materialize_dtype(self):
+        module = LazyModule()
+        module.register_parameter('test_param', UninitializedParameter())
+        module.test_param.materialize(10)
+        self.assertTrue(module.test_param.dtype == torch.get_default_dtype())
+        module = LazyModule()
+        module.register_parameter('test_param', UninitializedParameter())
+        module.half()
+        module.test_param.materialize(10)
+        self.assertTrue(module.test_param.dtype == torch.float16)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    @suppress_warnings
+    def test_materialize_device(self):
+        module = LazyModule()
+        module.register_parameter('test_param', UninitializedParameter())
+        module.test_param.materialize(10)
+        self.assertTrue(module.test_param.device.type == 'cpu')
+        module = LazyModule()
+        module.register_parameter('test_param', UninitializedParameter())
+        module.cuda()
+        module.test_param.materialize(10)
+        self.assertTrue(module.test_param.device.type == 'cuda')
+
+    @suppress_warnings
+    def test_chained_initialization(self):
+        class MyNetwork(torch.nn.Module):
+            def __init__(self):
+                super(MyNetwork, self).__init__()
+                self.linear_1 = torch.nn.LazyLinear(15)
+                self.linear_2 = torch.nn.LazyLinear(10)
+
+            def forward(self, x):
+                y = self.linear_1(x)
+                return self.linear_2(y)
+
+        net = MyNetwork()
+        net(torch.ones(5, 10))
+        self.assertTrue(net.linear_1.weight.shape == (15, 10))
+        self.assertTrue(net.linear_1.bias.shape == (15,))
+        self.assertTrue(net.linear_2.weight.shape == (10, 15))
+        self.assertTrue(net.linear_2.bias.shape == (10,))
+
+    @suppress_warnings
+    def test_optimizer_pass(self):
+        optimizers = [torch.optim.Adadelta, torch.optim.Adagrad, torch.optim.Adam,
+                      torch.optim.AdamW, torch.optim.Adamax,
+                      torch.optim.ASGD, torch.optim.SGD, torch.optim.Rprop,
+                      torch.optim.RMSprop, torch.optim.LBFGS]
+
+        def run_step(module, optim):
+            self.assertIsInstance(optim.param_groups[0]['params'][0], UninitializedParameter)
+            module.test_param.materialize(10)
+            self.assertIsInstance(optim.param_groups[0]['params'][0], Parameter)
+            self.assertNotIsInstance(optim.param_groups[0]['params'][0], UninitializedParameter)
+            for p in module.parameters():
+                p.grad = torch.rand_like(p)
+            if isinstance(optim, torch.optim.LBFGS):
+                optim.step(lambda: 1.0)
+            else:
+                optim.step()
+
+        for optim_cls in optimizers:
+            module = LazyModule()
+            module.register_parameter('test_param', UninitializedParameter())
+            if optim_cls is torch.optim.SGD:
+                optim = optim_cls(module.parameters(), lr=0.0)
+            elif optim_cls is torch.optim.Adagrad:
+                with self.assertRaisesRegex(ValueError, 'uninitialized parameter'):
+                    optim = optim_cls(module.parameters())
+                continue
+            else:
+                optim = optim_cls(module.parameters())
+            run_step(module, optim)
+
+    @suppress_warnings
+    def test_weight_norm(self):
+        m = nn.LazyLinear(7)
+        with self.assertRaisesRegex(ValueError, 'have uninitialized parameters.'):
+            m = torch.nn.utils.weight_norm(m)
+
+    @suppress_warnings
+    def test_spectral_norm(self):
+        m = nn.LazyLinear(7)
+        with self.assertRaisesRegex(ValueError, 'have uninitialized parameters.'):
+            m = torch.nn.utils.spectral_norm(m)
+
+    @suppress_warnings
+    def test_invalid_functions(self):
+        param = torch.nn.parameter.UninitializedParameter()
+        with self.assertRaisesRegex(ValueError, 'uninitialized parameter'):
+            torch.empty_like(param)
+
+        with self.assertRaisesRegex(ValueError, 'uninitialized parameter'):
+            torch.add(param, param)
+
+        with self.assertRaisesRegex(ValueError, 'uninitialized parameter'):
+            param + param
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/nn/test_module_hooks.py b/test/nn/test_module_hooks.py
new file mode 100644
index 000000000000..889966e006c1
--- /dev/null
+++ b/test/nn/test_module_hooks.py
@@ -0,0 +1,1334 @@
+# Owner(s): ["module: nn"]
+from torch.testing._internal.common_utils import (
+    TestCase,
+    run_tests,
+    skipIfTorchDynamo,
+    IS_WINDOWS
+)
+from torch.testing._internal.common_nn import NNTestCase, _create_basic_net
+
+import torch
+import torch.nn as nn
+
+from functools import partial
+from typing import Any, Dict, List, Tuple
+import gc
+import unittest
+from copy import deepcopy
+from tempfile import NamedTemporaryFile
+import weakref
+import pickle
+from collections import OrderedDict
+import math
+
+
+class Net(nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.seq1 = nn.Sequential(*[nn.Linear(10, 10) for _ in range(2)])
+        self.seq2 = nn.Sequential(*[nn.Linear(10, 10) for _ in range(2)])
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.seq2(self.seq1(x))
+
+
+class ToyModel(nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.net1 = Net()
+        self.net2 = Net()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.net2(self.net1(x))
+
+
+def forward_hook(
+    self: TestCase,
+    fired_hooks: List[int],
+    expected_module: nn.Module,
+    hook_id: int,
+    module: nn.Module,
+    inp: Tuple[torch.Tensor],
+    out: torch.Tensor,
+) -> None:
+    fired_hooks.append(hook_id)
+    self.assertEqual(id(module), id(expected_module))
+    self.assertEqual(len(inp), 1)
+
+
+def forward_pre_hook(
+    self: TestCase,
+    fired_hooks: List[int],
+    expected_module: nn.Module,
+    hook_id: int,
+    module: nn.Module,
+    inp: Tuple[torch.Tensor],
+) -> None:
+    fired_hooks.append(hook_id)
+    self.assertEqual(id(module), id(expected_module))
+    self.assertEqual(len(inp), 1)
+
+
+def full_backward_hook(
+    self: TestCase,
+    fired_hooks: List[int],
+    expected_module: nn.Module,
+    hook_id: int,
+    module: nn.Module,
+    grad_input: Tuple[torch.Tensor],
+    grad_output: Tuple[torch.Tensor],
+) -> None:
+    fired_hooks.append(hook_id)
+    self.assertEqual(id(module), id(expected_module))
+    self.assertEqual(len(grad_input), 1)
+    self.assertEqual(len(grad_output), 1)
+
+
+def full_backward_pre_hook(
+    self: TestCase,
+    fired_hooks: List[int],
+    expected_module: nn.Module,
+    hook_id: int,
+    module: nn.Module,
+    grad_input: Tuple[torch.Tensor],
+) -> None:
+    fired_hooks.append(hook_id)
+    self.assertEqual(id(module), id(expected_module))
+    self.assertEqual(len(grad_input), 1)
+
+
+class KwargModel(nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.net1 = Net()
+        self.net2 = Net()
+
+    def forward(
+        self, x: torch.Tensor, bias: torch.Tensor = None
+    ) -> torch.Tensor:
+        if bias is not None:
+            x = x + bias
+        return x
+
+    def internal_forward_hook(
+        self,
+        module: nn.Module,
+        args: Tuple[torch.Tensor],
+        kwargs: Dict[str, Any],
+        out: torch.Tensor,
+    ):
+        return out + kwargs["bias"]
+
+
+def kwarg_forward_pre_hook(
+    self: TestCase,
+    fired_hooks: List[int],
+    expected_module: nn.Module,
+    hook_id: int,
+    module: nn.Module,
+    args: Tuple[torch.Tensor],
+    kwargs: Dict[str, Any],
+) -> Tuple[Any, Any]:
+    fired_hooks.append(hook_id)
+    self.assertEqual(id(module), id(expected_module))
+    self.assertEqual(len(args), 1)
+    kwargs["bias"] = 2 * kwargs["bias"]
+    return args, kwargs
+
+
+def kwarg_forward_hook(
+    self: TestCase,
+    fired_hooks: List[int],
+    expected_module: nn.Module,
+    hook_id: int,
+    module: nn.Module,
+    args: Tuple[torch.Tensor],
+    kwargs: Dict[str, Any],
+    out: torch.Tensor,
+) -> Any:
+    fired_hooks.append(hook_id)
+    self.assertEqual(id(module), id(expected_module))
+    self.assertEqual(len(args), 1)
+
+    out = out + kwargs["bias"]
+    return out
+
+
+class TestModuleHooks(TestCase):
+    @skipIfTorchDynamo("Dynamo does not yet capture hooks")
+    def test_forward_hooks(self):
+        fired_hooks: List[int] = []
+        model = ToyModel()
+        x = torch.randn(10, 10)
+        hook = partial(forward_hook, self, fired_hooks, model.net1.seq2)
+        model.net1.seq2.register_forward_hook(partial(hook, 0))
+        model.net1.seq2.register_forward_hook(partial(hook, 1), prepend=True)
+        model.net1.seq2.register_forward_hook(partial(hook, 2))
+        model.net1.seq2.register_forward_hook(partial(hook, 3))
+        model.net1.seq2.register_forward_hook(partial(hook, 4), prepend=True)
+        expected = [4, 1, 0, 2, 3]
+
+        self.assertEqual(fired_hooks, [])
+        out = model(x)
+        self.assertEqual(fired_hooks, expected)
+        out.sum().backward()
+        self.assertEqual(fired_hooks, expected)
+        model(x).sum().backward()
+        self.assertEqual(fired_hooks, expected + expected)
+
+    @skipIfTorchDynamo("Dynamo does not yet capture hooks")
+    def test_forward_pre_hooks(self):
+        fired_hooks: List[int] = []
+        model = ToyModel()
+        x = torch.randn(10, 10)
+        hook = partial(forward_pre_hook, self, fired_hooks, model.net2.seq1)
+        model.net2.seq1.register_forward_pre_hook(
+            partial(hook, 0), prepend=True
+        )
+        model.net2.seq1.register_forward_pre_hook(partial(hook, 1))
+        model.net2.seq1.register_forward_pre_hook(partial(hook, 2))
+        model.net2.seq1.register_forward_pre_hook(partial(hook, 3))
+        model.net2.seq1.register_forward_pre_hook(
+            partial(hook, 4), prepend=True
+        )
+        expected = [4, 0, 1, 2, 3]
+
+        self.assertEqual(fired_hooks, [])
+        out = model(x)
+        self.assertEqual(fired_hooks, expected)
+        out.sum().backward()
+        self.assertEqual(fired_hooks, expected)
+        model(x).sum().backward()
+        self.assertEqual(fired_hooks, expected + expected)
+
+    @skipIfTorchDynamo("Dynamo does not yet capture hooks")
+    def test_full_backward_hooks(self):
+        fired_hooks: List[int] = []
+        model = ToyModel()
+        x = torch.randn(10, 10)
+        hook = partial(full_backward_hook, self, fired_hooks, model.net1)
+        model.net1.register_full_backward_hook(partial(hook, 0))
+        model.net1.register_full_backward_hook(partial(hook, 1))
+        model.net1.register_full_backward_hook(partial(hook, 2))
+        model.net1.register_full_backward_hook(partial(hook, 3), prepend=True)
+        model.net1.register_full_backward_hook(partial(hook, 4), prepend=True)
+        expected = [4, 3, 0, 1, 2]
+
+        self.assertEqual(fired_hooks, [])
+        out = model(x)
+        self.assertEqual(fired_hooks, [])
+        out.sum().backward()
+        self.assertEqual(fired_hooks, expected)
+        model(x).sum().backward()
+        self.assertEqual(fired_hooks, expected + expected)
+
+    @skipIfTorchDynamo("Dynamo does not yet capture hooks")
+    def test_full_backward_pre_hooks(self):
+        fired_hooks: List[int] = []
+        model = ToyModel()
+        x = torch.randn(10, 10)
+        hook = partial(full_backward_pre_hook, self, fired_hooks, model.net1)
+        model.net1.register_full_backward_pre_hook(
+            partial(hook, 0), prepend=True
+        )
+        model.net1.register_full_backward_pre_hook(
+            partial(hook, 1), prepend=True
+        )
+        model.net1.register_full_backward_pre_hook(partial(hook, 2))
+        model.net1.register_full_backward_pre_hook(partial(hook, 3))
+        model.net1.register_full_backward_pre_hook(partial(hook, 4))
+        expected = [1, 0, 2, 3, 4]
+
+        self.assertEqual(fired_hooks, [])
+        out = model(x)
+        self.assertEqual(fired_hooks, [])
+        out.sum().backward()
+        self.assertEqual(fired_hooks, expected)
+        model(x).sum().backward()
+        self.assertEqual(fired_hooks, expected + expected)
+
+    @skipIfTorchDynamo("Dynamo does not yet capture hooks")
+    def test_mixed_hooks(self):
+        fired_hooks: List[int] = []
+        model = ToyModel()
+        x = torch.randn(10, 10)
+        model.register_forward_pre_hook(
+            partial(forward_pre_hook, self, fired_hooks, model, 0)
+        )
+        model.register_forward_hook(
+            partial(forward_hook, self, fired_hooks, model, 1)
+        )
+        model.register_full_backward_pre_hook(
+            partial(full_backward_pre_hook, self, fired_hooks, model, 2)
+        )
+        model.register_full_backward_hook(
+            partial(full_backward_hook, self, fired_hooks, model, 3)
+        )
+
+        self.assertEqual(fired_hooks, [])
+        out = model(x)
+        self.assertEqual(fired_hooks, [0, 1])
+        out.sum().backward()
+        self.assertEqual(fired_hooks, [0, 1, 2, 3])
+        model(x).sum().backward()
+        self.assertEqual(fired_hooks, [0, 1, 2, 3, 0, 1, 2, 3])
+
+    @skipIfTorchDynamo("Dynamo does not yet capture hooks")
+    def test_kwarg_hooks(self):
+        # 1. test forward pre hook
+        fired_hooks: List[int] = []
+        x: torch.Tensor = torch.ones(10, 10)
+        bias: torch.Tensor = torch.ones(10, 10)
+        model = KwargModel()
+        model.register_forward_pre_hook(
+            partial(kwarg_forward_pre_hook, self, fired_hooks, model, 0),
+            with_kwargs=True,
+        )
+
+        # forward-pre: bias' = bias * 2
+        # So, out = x + bias * 2
+        self.assertEqual(fired_hooks, [])
+        out = model(x, bias=bias)
+        self.assertEqual(fired_hooks, [0])
+        self.assertEqual(out, x + 2 * bias, rtol=0, atol=1e-5)
+
+        # 2. test forward pre and forward hooks
+        fired_hooks: List[int] = []
+        x: torch.Tensor = torch.ones(10, 10)
+        bias: torch.Tensor = torch.ones(10, 10)
+        model = KwargModel()
+        model.register_forward_hook(
+            partial(kwarg_forward_hook, self, fired_hooks, model, 1),
+            with_kwargs=True,
+        )
+        model.register_forward_pre_hook(
+            partial(kwarg_forward_pre_hook, self, fired_hooks, model, 0),
+            with_kwargs=True,
+        )
+
+        # forward-pre: bias' = bias * 2
+        # forward: out = x + bias'
+        # forward-post: out = out + bias'
+        # So, out = x + bias * 4
+        self.assertEqual(fired_hooks, [])
+        out = model(x, bias=bias)
+        self.assertEqual(fired_hooks, [0, 1])
+        self.assertEqual(out, x + 4 * bias, rtol=0, atol=1e-5)
+
+        # 3. test nn.Module member method as forward-post hook
+        x: torch.Tensor = torch.ones(10, 10)
+        bias: torch.Tensor = torch.ones(10, 10)
+        model = KwargModel()
+        model.register_forward_hook(
+            model.internal_forward_hook, with_kwargs=True
+        )
+
+        # forward: out = x + bias
+        # forward-post: out = out + bias
+        # So, out = x + bias * 2
+        out = model(x, bias=bias)
+        self.assertEqual(out, x + 2 * bias, rtol=0, atol=1e-5)
+
+
+    @skipIfTorchDynamo("Dynamo does not yet capture hooks")
+    def test_remove_kwarg_hooks(self):
+        # test forward pre and forward hooks
+        fired_hooks: List[int] = []
+        x: torch.Tensor = torch.ones(10, 10)
+        bias: torch.Tensor = torch.ones(10, 10)
+        model = KwargModel()
+        forward_hook_handle = model.register_forward_hook(
+            partial(kwarg_forward_hook, self, fired_hooks, model, 1),
+            with_kwargs=True,
+        )
+        forward_pre_hook_handle = model.register_forward_pre_hook(
+            partial(kwarg_forward_pre_hook, self, fired_hooks, model, 0),
+            with_kwargs=True,
+        )
+
+        # forward-pre: bias' = bias * 2
+        # forward: out = x + bias'
+        # forward-post: out = out + bias'
+        # So, out = x + bias * 4
+        self.assertEqual(fired_hooks, [])
+        out = model(x, bias=bias)
+        self.assertEqual(fired_hooks, [0, 1])
+        self.assertEqual(out, x + 4 * bias, rtol=0, atol=1e-5)
+
+        # forward-pre: bias' = bias * 2
+        # forward: out = x + bias'
+        # So, out = x + bias * 2
+        forward_hook_handle.remove()
+        out = model(x, bias=bias)
+        self.assertEqual(fired_hooks, [0, 1, 0])
+        self.assertEqual(out, x + 2 * bias, rtol=0, atol=1e-5)
+        self.assertFalse(
+            forward_hook_handle.id in model._forward_hooks_with_kwargs
+        )
+
+        # forward: out = x + bias
+        # So, out = x + bias
+        forward_pre_hook_handle.remove()
+        out = model(x, bias=bias)
+        self.assertEqual(fired_hooks, [0, 1, 0])
+        self.assertEqual(out, x + bias, rtol=0, atol=1e-5)
+        self.assertFalse(
+            forward_pre_hook_handle.id in model._forward_pre_hooks_with_kwargs
+        )
+
+
+def _hook_to_pickle(*args, **kwargs):
+    pass
+
+class TestStateDictHooks(TestCase):
+
+    def test_load_state_dict_pre_hook(self):
+
+        m = nn.Linear(10, 10)
+        m_state_dict = m.state_dict()
+
+        m_load = nn.Linear(10, 10)
+
+        hook_called = 0
+
+        def hook_without_module(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
+            self.assertEqual(m_state_dict, state_dict)
+            nonlocal hook_called
+            hook_called += 1
+
+        def hook_with_module(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
+            self.assertEqual(m_state_dict, state_dict)
+            self.assertTrue(m_load is module)
+            nonlocal hook_called
+            hook_called += 1
+
+        hook_called = 0
+        m_load._register_load_state_dict_pre_hook(hook_without_module)
+        m_load.load_state_dict(m_state_dict)
+        self.assertEqual(1, hook_called)
+
+        hook_called = 0
+        m_load._register_load_state_dict_pre_hook(hook_with_module, True)
+        m_load.load_state_dict(m_state_dict)
+        self.assertEqual(2, hook_called)
+
+    def test_no_extra_ref_to_module(self):
+        try:
+            gc.disable()
+            m = nn.Linear(10, 10)
+
+            m._register_load_state_dict_pre_hook(_hook_to_pickle, True)
+            weak_m = weakref.ref(m)
+            del m
+
+            self.assertEqual(weak_m(), None)
+        finally:
+            gc.enable()
+
+    def test_pickled_hook(self):
+        m = nn.Linear(10, 10)
+        m._register_load_state_dict_pre_hook(_hook_to_pickle, True)
+        pickle.loads(pickle.dumps(m))
+
+    def test_load_state_dict_module_pre_hook(self):
+        hook_called = 0
+
+        # Test with module instance method as hook
+        class MyModule(nn.Module):
+            def __init__(self):
+                super(MyModule, self).__init__()
+                self.foo = torch.nn.Parameter(torch.rand(10))
+
+            def my_pre_load_hook(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
+                assert [] == error_msgs
+                assert [] == unexpected_keys
+                assert [] == missing_keys
+                assert strict
+                nonlocal hook_called
+                hook_called += 1
+
+            def my_pre_load_hook_with_module(
+                self,
+                module,
+                state_dict,
+                prefix,
+                local_metadata,
+                strict,
+                missing_keys,
+                unexpected_keys,
+                error_msgs,
+            ):
+                assert [] == error_msgs
+                assert [] == unexpected_keys
+                assert [] == missing_keys
+                assert strict
+                assert self is module
+                nonlocal hook_called
+                hook_called += 1
+
+        # Test that hooks registered on a submodule are also called
+        # appropriately, i.e. with the submodule as module argument in
+        # my_pre_load_hook_with_module.
+        class MyModuleContainer(nn.Module):
+            def __init__(self, mod):
+                super().__init__()
+                self.mod = mod
+
+        for ctor in [MyModuleContainer, lambda x: x]:
+            m = ctor(MyModule())
+            state_dict = m.state_dict()
+            if isinstance(m, MyModuleContainer):
+                mod = m.mod
+            else:
+                mod = m
+
+            hook_called = 0
+            mod._register_load_state_dict_pre_hook(
+                mod.my_pre_load_hook
+            )
+            m.load_state_dict(state_dict)
+            self.assertEqual(1, hook_called)
+
+            hook_called = 0
+            mod._register_load_state_dict_pre_hook(
+                mod.my_pre_load_hook_with_module, True
+            )
+            m.load_state_dict(state_dict)
+            self.assertEqual(2, hook_called)
+
+    def test_load_state_dict_post_hook(self):
+        hook_called = 0
+
+        class MyModule(nn.Module):
+            def __init__(self):
+                super(MyModule, self).__init__()
+                self.foo = torch.nn.Parameter(torch.rand(10))
+
+            def my_post_load_hook(self, module, incompatible_keys):
+                assert module is self
+                nonlocal hook_called
+                incompatible_keys.missing_keys.append("foo")
+                incompatible_keys.unexpected_keys.append("bar")
+                hook_called += 1
+
+        nested = MyModule()
+        wrapped = nn.ModuleList([nested])
+        handle = nested.register_load_state_dict_post_hook(
+            nested.my_post_load_hook,
+        )
+        # Hook must be called even if it is wrapped
+        ret = wrapped.load_state_dict(wrapped.state_dict(), strict=False)
+        self.assertEqual(hook_called, 1)
+        # Ensure that the hook modified missing_keys and unexpected_keys
+        missing = ret.missing_keys
+        unexpected = ret.unexpected_keys
+        self.assertEqual(missing, ["foo"])
+        self.assertEqual(unexpected, ["bar"])
+        # When called with strict=True, the error raised should mention the
+        # missing and unexpected keys the hook added.
+        with self.assertRaisesRegex(RuntimeError, "foo.*\n.*bar"):
+            wrapped.load_state_dict(wrapped.state_dict(), strict=True)
+        self.assertEqual(hook_called, 2)
+        # Removing the hook via handle.remove() should cause it not to
+        # fire anymore.
+        handle.remove()
+        # Hook did not run so it should not have added any keys
+        ret = wrapped.load_state_dict(wrapped.state_dict(), strict=False)
+        self.assertEqual(ret.missing_keys, [])
+        self.assertEqual(ret.unexpected_keys, [])
+        # hook_called should not have been incremented
+        self.assertEqual(hook_called, 2)
+
+        def load_hook_clear_incompatible(module, incompatible_keys):
+            incompatible_keys.missing_keys.clear()
+            incompatible_keys.unexpected_keys.clear()
+
+        nested.register_load_state_dict_post_hook(load_hook_clear_incompatible)
+        state_dict = wrapped.state_dict()
+        state_dict["extra"] = torch.ones(1)
+        # load state_dict with strict=True should not throw.
+        ret = wrapped.load_state_dict(state_dict, strict=True)
+        # explicitly ensure that the post hook clearned out incompatible_keys
+        self.assertEqual([], ret.missing_keys)
+        self.assertEqual([], ret.unexpected_keys)
+
+    @unittest.skipIf(IS_WINDOWS, "Tempfile permission issue on windows")
+    def test_load_state_dict_post_hook_backward_compatibility(self):
+        def my_post_load_hook(mod, _):
+            nonlocal called
+            called = True
+
+        for m in [nn.Softmin(10), nn.Softmax(10), nn.LogSoftmax(10)]:
+            called = False
+            sd = deepcopy(m.state_dict())
+            self.assertTrue(hasattr(m, '_load_state_dict_post_hooks'))
+            # Simulate an older model that did not have this attr
+            delattr(m, '_load_state_dict_post_hooks')
+            # Save and load, and ensure that load_state_dict works (without proper
+            # BC we would run into errors because this attribute would be expected).
+            # In particular, Softmax runs into the issue described here:
+            # https://github.com/pytorch/pytorch/issues/77280
+            with NamedTemporaryFile() as f:
+                # Note that torch.save / torch.load is not recommended to save/load
+                # modules.
+                torch.save(m, f.name)
+                m = torch.load(f.name)
+                m.load_state_dict(sd)
+                self.assertFalse(called)
+
+            # Ensure hooks can be registered and called.
+            m.register_load_state_dict_post_hook(my_post_load_hook)
+            m.load_state_dict(sd)
+            self.assertTrue(called)
+
+
+class TestModuleGlobalHooks(TestCase):
+
+    def tearDown(self):
+        nn.modules.module._global_backward_hooks = OrderedDict()
+        nn.modules.module._global_forward_hooks = OrderedDict()
+        nn.modules.module._global_forward_pre_hooks = OrderedDict()
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_module_global_hooks(self):
+        module = nn.Sigmoid
+
+        module_1 = module()
+        module_2 = module()
+        module_3 = module()
+
+        input = torch.ones(5, 5, requires_grad=True)
+
+        counter = {
+            'forwards': 0,
+            'backwards': 0
+        }
+
+        def fw_hook(inc, h_module, input, output):
+            self.assertIsInstance(input, tuple)
+            self.assertTrue(isinstance(output, torch.Tensor))
+            self.assertTrue(isinstance(h_module, module))
+            self.assertEqual(input[0], torch.ones(5, 5))
+            self.assertEqual(output, torch.empty(5, 5).fill_(1 / (1 + 1 / math.e)))
+            counter['forwards'] += inc
+
+        def bw_hook(inc, h_module, grad_input, grad_output):
+            self.assertIsInstance(grad_input, tuple)
+            self.assertIsInstance(grad_output, tuple)
+            self.assertTrue(isinstance(h_module, module))
+            self.assertEqual(grad_output[0], torch.ones(5, 5) * 2)
+            counter['backwards'] += inc
+
+        test_fwd = nn.modules.module.register_module_forward_hook(lambda *args: fw_hook(1, *args))
+
+        module_1(input)
+        module_2(input)
+        module_3(input)
+        self.assertEqual(counter['forwards'], 3)
+        self.assertEqual(counter['backwards'], 0)
+
+        test_bwd = nn.modules.module.register_module_backward_hook(
+            lambda *args: bw_hook(1, *args))
+
+        output_1 = module_1(input)
+        output_2 = module_2(input)
+        output_3 = module_3(input)
+        self.assertEqual(counter['forwards'], 6)
+        self.assertEqual(counter['backwards'], 0)
+
+        output_1.backward(torch.ones(5, 5) * 2, retain_graph=True)
+        output_2.backward(torch.ones(5, 5) * 2, retain_graph=False)
+        output_3.backward(torch.ones(5, 5) * 2, retain_graph=False)
+        self.assertEqual(counter['forwards'], 6)
+        self.assertEqual(counter['backwards'], 3)
+
+        output_1.backward(torch.ones(5, 5) * 2, retain_graph=True)
+        self.assertEqual(counter['forwards'], 6)
+        self.assertEqual(counter['backwards'], 4)
+
+        test2_fwd = nn.modules.module.register_module_forward_hook(lambda *args: fw_hook(2, *args))
+
+        output = module_1(input)
+        output = module_2(input)
+        output = module_3(input)
+        self.assertEqual(counter['forwards'], 15)
+        self.assertEqual(counter['backwards'], 4)
+
+        test2_bwd = nn.modules.module.register_module_backward_hook(lambda *args: bw_hook(2, *args))
+
+        module_1(input).backward(torch.ones(5, 5) * 2)
+        self.assertEqual(counter['forwards'], 18)
+        self.assertEqual(counter['backwards'], 7)
+
+        test2_bwd.remove()
+
+        module_2(input).backward(torch.ones(5, 5) * 2)
+        self.assertEqual(counter['forwards'], 21)
+        self.assertEqual(counter['backwards'], 8)
+
+        test2_fwd.remove()
+
+        module_3(input).backward(torch.ones(5, 5) * 2)
+        self.assertEqual(counter['forwards'], 22)
+        self.assertEqual(counter['backwards'], 9)
+
+        test_fwd.remove()
+        test_bwd.remove()
+
+    def test_module_global_hook_invalid_outputs(self):
+        module = nn.Sigmoid()
+        input = torch.randn(5, 5, requires_grad=True)
+
+        def bw_fail1(self, grad_input, grad_output):
+            return grad_input[:-1]
+
+        def bw_fail2(self, grad_input, grad_output):
+            return grad_input + (torch.randn(2, 2),)
+
+        with nn.modules.module.register_module_backward_hook(bw_fail1):
+            with self.assertRaisesRegex(RuntimeError, 'got 0, but expected 1'):
+                module(input).sum().backward()
+
+        with nn.modules.module.register_module_backward_hook(bw_fail2):
+            with self.assertRaisesRegex(RuntimeError, 'got 2, but expected 1'):
+                module(input).sum().backward()
+
+    @skipIfTorchDynamo("https://github.com/pytorch/torchdynamo/issues/847")
+    def test_module_backward_global_hook_writeable(self):
+        module = nn.Sigmoid()
+        input = torch.randn(5, 5, requires_grad=True)
+        sig_x = torch.sigmoid(input)
+
+        def bw_hook(module, grad_input, grad_output):
+            for grad in grad_input:
+                self.assertTrue(isinstance(grad, torch.Tensor))
+            for grad in grad_output:
+                self.assertTrue(isinstance(grad, torch.Tensor))
+            return tuple(gi * 2 for gi in grad_input)
+
+        nn.modules.module.register_module_backward_hook(bw_hook)
+        module(input).backward(torch.ones(5, 5))
+        expected_grad = sig_x * (1 - sig_x) * 2
+        self.assertEqual(input.grad, expected_grad)
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_module_global_forward_preforward_hook_writeable(self):
+        module = nn.Sigmoid()
+        input = torch.randn(5, 5, requires_grad=True)
+        sig_x = torch.sigmoid(input)
+
+        def forward_pre_hook(m, input):
+            return torch.nn.functional.relu(input[0])
+
+        def forward_hook(m, input, output):
+            return -output
+
+        nn.modules.module.register_module_forward_pre_hook(forward_pre_hook)
+        nn.modules.module.register_module_forward_hook(forward_hook)
+        output = module(input)
+        expected_res = -torch.sigmoid(torch.nn.functional.relu(input))
+        self.assertEqual(output, expected_res)
+        output.backward(torch.ones(5, 5) * 2, retain_graph=True)
+        mask = (input > 0)
+        expected_grad = -sig_x * (1 - sig_x) * 2 * mask
+        self.assertEqual(input.grad, expected_grad)
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_module_forward_preforward_hook_removable(self):
+        """
+        This test is to test when multiple pre-forward hook functions can be
+        registered successfully and used correctly, if the handle can be removable
+        during the pre-forward hook function call.
+        """
+        module = nn.Sigmoid()
+
+        def removable_hook(m, input):
+            nonlocal handle
+            handle.remove()
+            return input
+
+        def removable_hook_2(m, input):
+            nonlocal handle_2
+            handle_2.remove()
+            return input
+
+        handle = module.register_forward_pre_hook(removable_hook)
+        handle_2 = module.register_forward_pre_hook(removable_hook_2)
+
+        # make sure hook register is successful
+        self.assertEqual(len(handle.hooks_dict_ref()), 2)
+        self.assertEqual(len(handle_2.hooks_dict_ref()), 2)
+
+        input = torch.randn(2, 2)
+        output = module(input)
+        self.assertEqual(torch.sigmoid(input), output)
+
+        # make sure hook removal is successful
+        self.assertFalse(handle.id in handle.hooks_dict_ref())
+        self.assertFalse(handle_2.id in handle.hooks_dict_ref())
+        self.assertEqual(len(handle.hooks_dict_ref()), 0)
+        self.assertEqual(len(handle_2.hooks_dict_ref()), 0)
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_module_forward_forward_hook_removable(self):
+        """
+        This test is to test when multiple forward hook functions can be registered
+        successfully and used correctly, if the handle can be removable during the
+        forward hook function call.
+        """
+        module = nn.Sigmoid()
+
+        def removable_hook(m, input, output):
+            nonlocal handle
+            handle.remove()
+            return output
+
+        def removable_hook_2(m, input, output):
+            nonlocal handle_2
+            handle_2.remove()
+            return output
+
+        handle = module.register_forward_hook(removable_hook)
+        handle_2 = module.register_forward_hook(removable_hook_2)
+
+        # make sure hook register is successful
+        self.assertEqual(len(handle.hooks_dict_ref()), 2)
+        self.assertEqual(len(handle_2.hooks_dict_ref()), 2)
+
+        input = torch.randn(2, 2)
+        output = module(input)
+        self.assertEqual(torch.sigmoid(input), output)
+
+        # make sure hook removal is successful
+        self.assertFalse(handle.id in handle.hooks_dict_ref())
+        self.assertFalse(handle_2.id in handle.hooks_dict_ref())
+        self.assertEqual(len(handle.hooks_dict_ref()), 0)
+        self.assertEqual(len(handle_2.hooks_dict_ref()), 0)
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_global_and_local_hooks_order(self):
+        module = nn.Sigmoid()
+
+        global_forward_pre_called = False
+        local_forward_pre_called = False
+        global_forward_called = False
+        local_forward_called = False
+        global_backward_called = False
+        local_backward_called = False
+
+        def global_forward_pre_hook(m, input):
+            nonlocal global_forward_pre_called
+            self.assertTrue(not local_forward_pre_called)
+            global_forward_pre_called = True
+            return input
+
+        def local_forward_pre_hook(m, input):
+            nonlocal local_forward_pre_called
+            self.assertTrue(global_forward_pre_called)
+            local_forward_pre_called = True
+            return input
+
+        def global_forward_hook(m, input, output):
+            nonlocal global_forward_called
+            self.assertTrue(not local_forward_called)
+            global_forward_called = True
+            return output
+
+        def local_forward_hook(m, input, output):
+            nonlocal local_forward_called
+            self.assertTrue(global_forward_called)
+            local_forward_called = True
+            return output
+
+        def global_backward_hook(m, input, output):
+            nonlocal global_backward_called
+            self.assertTrue(not local_backward_called)
+            global_backward_called = True
+            return input
+
+        def local_backward_hook(m, input, output):
+            nonlocal local_backward_called
+            self.assertTrue(global_backward_called)
+            local_backward_called = True
+            return input
+
+        input = torch.randn(5, 5, requires_grad=True)
+        nn.modules.module.register_module_forward_pre_hook(global_forward_pre_hook)
+        module.register_forward_pre_hook(local_forward_pre_hook)
+        nn.modules.module.register_module_forward_hook(global_forward_hook)
+        module.register_forward_hook(local_forward_hook)
+        nn.modules.module.register_module_backward_hook(global_backward_hook)
+        module.register_backward_hook(local_backward_hook)
+
+        output = module(input)
+        self.assertTrue(local_forward_called and local_forward_pre_called and global_forward_called and global_forward_pre_called)
+
+        output.backward(torch.ones(5, 5), retain_graph=True)
+        self.assertTrue(local_backward_called and global_backward_called)
+
+
+class TestModuleHookNN(NNTestCase):
+    _do_cuda_memory_leak_check = True
+    _do_cuda_non_default_stream = True
+
+    def _test_hooks(self, backward_register_fn):
+        module = nn.Sigmoid()
+        input = torch.ones(5, 5, requires_grad=True)
+
+        counter = {
+            'forwards': 0,
+            'backwards': 0
+        }
+
+        def fw_hook(inc, h_module, input, output):
+            self.assertIsInstance(input, tuple)
+            self.assertTrue(isinstance(output, torch.Tensor))
+            self.assertTrue(h_module is module)
+            self.assertEqual(input[0], torch.ones(5, 5))
+            self.assertEqual(output, torch.empty(5, 5).fill_(1 / (1 + 1 / math.e)))
+            counter['forwards'] += inc
+
+        def bw_hook(inc, h_module, grad_input, grad_output):
+            self.assertIsInstance(grad_input, tuple)
+            self.assertIsInstance(grad_output, tuple)
+            self.assertTrue(h_module is module)
+            self.assertEqual(grad_output[0], torch.ones(5, 5) * 2)
+            counter['backwards'] += inc
+
+        # backward_pre_hook expects callback with only `module` and `grad_output`
+        # as arguments.
+        def bw_pre_hook(inc, h_module, grad_output):
+            self.assertIsInstance(grad_output, tuple)
+            self.assertTrue(h_module is module)
+            self.assertEqual(grad_output[0], torch.ones(5, 5) * 2)
+            counter['backwards'] += inc
+
+        test_fwd = module.register_forward_hook(lambda *args: fw_hook(1, *args))
+
+        module(input)
+        module(input)
+        self.assertEqual(counter['forwards'], 2)
+        self.assertEqual(counter['backwards'], 0)
+
+        bw_hook_fn = bw_pre_hook if backward_register_fn == 'register_full_backward_pre_hook' else bw_hook
+        test_bwd = getattr(module, backward_register_fn)(
+            lambda *args: bw_hook_fn(1, *args))
+
+        output = module(input)
+        self.assertEqual(counter['forwards'], 3)
+        self.assertEqual(counter['backwards'], 0)
+
+        output.backward(torch.ones(5, 5) * 2, retain_graph=True)
+        self.assertEqual(counter['forwards'], 3)
+        self.assertEqual(counter['backwards'], 1)
+
+        output.backward(torch.ones(5, 5) * 2, retain_graph=True)
+        self.assertEqual(counter['forwards'], 3)
+        self.assertEqual(counter['backwards'], 2)
+
+        test2_fwd = module.register_forward_hook(lambda *args: fw_hook(2, *args))
+
+        output = module(input)
+        self.assertEqual(counter['forwards'], 6)
+        self.assertEqual(counter['backwards'], 2)
+
+        test2_bwd = getattr(module, backward_register_fn)(lambda *args: bw_hook_fn(2, *args))
+
+        module(input).backward(torch.ones(5, 5) * 2)
+        self.assertEqual(counter['forwards'], 9)
+        self.assertEqual(counter['backwards'], 5)
+
+        test2_bwd.remove()
+
+        module(input).backward(torch.ones(5, 5) * 2)
+        self.assertEqual(counter['forwards'], 12)
+        self.assertEqual(counter['backwards'], 6)
+
+        test2_fwd.remove()
+
+        module(input).backward(torch.ones(5, 5) * 2)
+        self.assertEqual(counter['forwards'], 13)
+        self.assertEqual(counter['backwards'], 7)
+
+        test_fwd.remove()
+        test_bwd.remove()
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_hooks(self):
+        self._test_hooks("register_backward_hook")
+        self._test_hooks("register_full_backward_hook")
+        self._test_hooks("register_full_backward_pre_hook")
+
+    def test_hook_cpp(self):
+        bn = nn.BatchNorm1d(5)
+
+        def hook(module, grad_inputs, grad_outputs):
+            self.assertEqual(len(grad_inputs), 1)
+            self.assertEqual(len(grad_outputs), 1)
+            self.assertEqual(module, bn)
+
+        bn.register_full_backward_hook(hook)
+        output = bn(torch.randn(5, 5, requires_grad=True))
+        output.sum().backward()
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_backward_hooks_interaction(self):
+        # Test to make sure that the grad_outputs
+        # updated by full_backward_pre_hook are received by
+        # the full_backward_hook
+        module = torch.nn.Sigmoid()
+
+        cnt = {'backward_cnt': 0}
+
+        def bw_pre_hook(m, grad_output):
+            cnt['backward_cnt'] += 1
+            return (grad_output[0] * 0.5, )
+
+        def bw_hook(m, grad_in, grad_output):
+            self.assertEqual(torch.full_like(grad_output[0], 0.5), grad_output[0])
+            cnt['backward_cnt'] += 1
+            return grad_output
+
+        module.register_full_backward_pre_hook(bw_pre_hook)
+        module.register_full_backward_hook(bw_hook)
+
+        t = torch.ones(1, 2, requires_grad=True)
+        module(t).sum().backward()
+        self.assertEqual(cnt['backward_cnt'], 2)
+
+    def test_hook_invalid_outputs(self):
+        module = nn.Sigmoid()
+        input = torch.randn(5, 5, requires_grad=True)
+
+        def bw_fail1(self, grad_input, grad_output):
+            return grad_input[:-1]
+
+        def bw_fail2(self, grad_input, grad_output):
+            return grad_input + (torch.randn(2, 2),)
+
+        with module.register_backward_hook(bw_fail1):
+            with self.assertRaisesRegex(RuntimeError, 'got 0, but expected 1'):
+                module(input).sum().backward()
+
+        with module.register_backward_hook(bw_fail2):
+            with self.assertRaisesRegex(RuntimeError, 'got 2, but expected 1'):
+                module(input).sum().backward()
+
+        def bw_pre_fail1(self, grad_output):
+            return ()
+
+        def bw_pre_fail2(self, grad_output):
+            return grad_output + (torch.randn(2, 2),)
+
+        with module.register_full_backward_pre_hook(bw_pre_fail1):
+            with self.assertRaisesRegex(RuntimeError, 'got 0, but expected 1'):
+                module(input).sum().backward()
+
+        with module.register_full_backward_pre_hook(bw_pre_fail2):
+            with self.assertRaisesRegex(RuntimeError, 'got 2, but expected 1'):
+                module(input).sum().backward()
+
+    def test_hook_requires_grad(self):
+        test_self = self
+
+        class MyModule(nn.Module):
+            def forward(self, arg1, arg2, arg3):
+                test_self.assertTrue(arg1.requires_grad)
+                test_self.assertFalse(arg2.requires_grad)
+                test_self.assertTrue(arg3.requires_grad)
+                return arg1.sum() + arg2.sum() + arg3.sum()
+
+        inp = torch.rand(2, requires_grad=True)
+        mod = MyModule()
+
+        mod(inp, inp.detach(), inp)
+        # Ensure that requires grad is properly propagated
+        mod.register_full_backward_hook(lambda mod, gI, gO: None)
+        mod(inp, inp.detach(), inp)
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_hook_no_requires_grad(self):
+        mod = nn.Linear(2, 3)
+
+        inp = torch.rand(1, 2)
+
+        return_val = "None"
+        hook_called = [0]
+
+        def hook(mod, grad_input, grad_output):
+            hook_called[0] += 1
+            for gI in grad_input:
+                self.assertIsNone(gI)
+            for gO in grad_output:
+                self.assertEqual(gO.size(), (1, 3))
+
+            if return_val == "grad_input":
+                return grad_input
+            elif return_val == "invalid":
+                # If the inputs were requiring gradients, this would be
+                # a valid return
+                return inp
+            elif return_val == "None":
+                return None
+            else:
+                raise RuntimeError("Invalid return_val string")
+
+        mod.register_full_backward_hook(hook)
+
+        # This should run and trigger the hook properly
+        mod(inp).sum().backward()
+        self.assertEqual(hook_called[0], 1)
+
+        return_val = "grad_input"
+
+        mod(inp).sum().backward()
+        self.assertEqual(hook_called[0], 2)
+
+        return_val = "invalid"
+        with self.assertRaisesRegex(RuntimeError, "where no input requires gradient"):
+            mod(inp).sum().backward()
+
+    def test_hook_last_arg_requires_grad(self):
+        mod = nn.L1Loss()
+        inp = torch.rand(1, requires_grad=True)
+        mod.register_full_backward_hook(lambda m, gI, gO: None)
+
+        try:
+            mod(inp.detach(), inp)
+        except Exception as ex:
+            self.fail("Unexpected exception: %s" % ex)
+
+    def test_hook_extra_input(self):
+        class MyModule(nn.Module):
+            def forward(self, non_tensor, tensor):
+                return tensor.clone(), non_tensor
+
+        inp = torch.rand(2, requires_grad=True)
+        mod = MyModule()
+
+        def hook(mod, grad_input, grad_output):
+            self.assertIsNone(grad_input[0])
+            self.assertIsInstance(grad_input[1], torch.Tensor)
+
+            self.assertIsInstance(grad_output[0], torch.Tensor)
+            self.assertIsNone(grad_output[1])
+
+        mod.register_full_backward_hook(hook)
+        out, _ = mod(True, inp)
+        out.sum().backward()
+
+    def test_hook_inplace(self):
+        class MyModule(nn.Module):
+            def forward(self, inp, do_inplace):
+                self.inp = inp
+                if do_inplace:
+                    inp += 1
+                return inp.clone()
+
+        hook_called = [0]
+
+        def hook(mod, grad_input, grad_output):
+            hook_called[0] += 1
+
+        def hook_pre(mod, grad_output):
+            hook_called[0] += 1
+
+        inp = torch.rand(10, requires_grad=True)
+        mod = MyModule()
+        for hook_fn, register_fn in [(hook, mod.register_full_backward_hook),
+                                     (hook_pre, mod.register_full_backward_pre_hook)]:
+            hook_called[0] = 0
+            with register_fn(hook_fn):
+                # No inplace should work
+                mod(inp, False).sum().backward()
+                self.assertEqual(hook_called[0], 1)
+
+                # Input inplace error should throw an error
+                with self.assertRaisesRegex(RuntimeError, "Output 0 of BackwardHookFunctionBackward is "
+                                            "a view and is being modified inplace."):
+                    mod(inp.clone(), True)
+
+                # Input inplace error should throw an error if we try to re-use the view after they have
+                # been modified
+                local_inp = inp.clone()
+                out = mod(local_inp, False)
+                local_inp[0] *= 1
+                with self.assertRaisesRegex(RuntimeError, "Output 0 of BackwardHookFunctionBackward is "
+                                            "a view and its base or another view"):
+                    # Any operation involving the view will fail here
+                    mod.inp + 2
+
+                # Output inplace error should throw an error
+                out = mod(inp, False)
+                with self.assertRaisesRegex(RuntimeError, "BackwardHookFunctionBackward is a view "
+                                            "and is being modified inplace."):
+                    out += 1
+
+    def test_hook_non_full_warning(self):
+        def noop(*args):
+            pass
+
+        a = torch.rand(2, requires_grad=True)
+        b = torch.rand(2, requires_grad=True)
+
+        # Check invalid input container
+        class MyModule(nn.Module):
+            def forward(self, l):
+                return l[0].clone(), l[1].clone()
+
+        m = MyModule()
+        m.register_backward_hook(noop)
+
+        with self.assertWarnsRegex(UserWarning, "does not take as input a single Tensor or a tuple of Tensors"):
+            m([a, b])
+
+        # Check invalid output container
+        class MyModule(nn.Module):
+            def forward(self, a, b):
+                return [a.clone(), b.clone()]
+
+        m = MyModule()
+        m.register_backward_hook(noop)
+
+        with self.assertWarnsRegex(UserWarning, "does not return a single Tensor or a tuple of Tensors"):
+            m(a, b)
+
+        # Check invalid output from different Nodes
+        class MyModule(nn.Module):
+            def forward(self, a, b):
+                return a.clone(), b.clone()
+
+        m = MyModule()
+        m.register_backward_hook(noop)
+
+        with self.assertWarnsRegex(UserWarning, "outputs are generated by different autograd Nodes"):
+            m(a, b)
+
+        # Check invalid forward with multiple Nodes
+        class MyModule(nn.Module):
+            def forward(self, a):
+                return a.clone().clone()
+
+        m = MyModule()
+        m.register_backward_hook(noop)
+
+        with self.assertWarnsRegex(UserWarning, "the forward contains multiple autograd Nodes"):
+            m(a)
+
+    def test_hook_backward_size(self):
+        # Make module with multiple operations in forward
+        # And different size for input and outputs
+        class MyModule(nn.Module):
+            def forward(self, arg1, arg2):
+                tmp = arg1.sum() * arg2
+                tmp = tmp + arg2.sum() * arg1.sum()
+                tmp = tmp.sum().view(1)
+                tmp = tmp.expand(8).contiguous()
+                return tmp
+
+        module = MyModule()
+        inp1 = torch.randn(5, 5, requires_grad=True)
+        inp2 = torch.randn(10, 10, requires_grad=True)
+
+        def bw_hook(module, grad_input, grad_output):
+            self.assertEqual(len(grad_input), 2)
+            self.assertEqual(grad_input[0].size(), torch.Size([5, 5]))
+            self.assertEqual(grad_input[1].size(), torch.Size([10, 10]))
+            self.assertEqual(len(grad_output), 1)
+            self.assertEqual(grad_output[0].size(), torch.Size([8]))
+
+        with module.register_full_backward_hook(bw_hook):
+            module(inp1, inp2).sum().backward()
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_hook_backward_writeable(self):
+        module = nn.Sigmoid()
+        input = torch.randn(5, 5, requires_grad=True)
+        sig_x = torch.nn.functional.sigmoid(input)
+
+        def bw_hook(module, grad_input, grad_output):
+            for grad in grad_input:
+                self.assertTrue(isinstance(grad, torch.Tensor))
+            for grad in grad_output:
+                self.assertTrue(isinstance(grad, torch.Tensor))
+            return tuple(gi * 2 for gi in grad_input)
+
+        module.register_backward_hook(bw_hook)
+        module(input).backward(torch.ones(5, 5))
+        expected_grad = sig_x * (1 - sig_x) * 2
+        self.assertEqual(input.grad, expected_grad)
+
+    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
+    def test_hook_forward_preforward_writable(self):
+        module = nn.Sigmoid()
+        input = torch.randn(5, 5, requires_grad=True)
+        sig_x = torch.nn.functional.sigmoid(input)
+
+        def forward_pre_hook(m, input):
+            return torch.nn.functional.relu(input[0])
+
+        def forward_hook(m, input, output):
+            return -output
+
+        module.register_forward_pre_hook(forward_pre_hook)
+        module.register_forward_hook(forward_hook)
+        output = module(input)
+        expected_res = -torch.nn.functional.sigmoid(torch.nn.functional.relu(input))
+        self.assertEqual(output, expected_res)
+        output.backward(torch.ones(5, 5) * 2, retain_graph=True)
+        mask = (input > 0)
+        expected_grad = -sig_x * (1 - sig_x) * 2 * mask
+        self.assertEqual(input.grad, expected_grad)
+
+    def test_hook_buffer_registration(self):
+        for return_buffer in (True, False):
+            def buffer_registration_hook(module, name, buffer):
+                buffer.registered = True
+                if return_buffer:
+                    return buffer
+            handle = torch.nn.modules.module.register_module_buffer_registration_hook(
+                buffer_registration_hook
+            )
+            try:
+                l, n, s = _create_basic_net()
+                for b in s.buffers():
+                    self.assertTrue(getattr(b, "registered", False))
+            finally:
+                handle.remove()
+
+    def test_hook_submodule_registration(self):
+        for return_submodule in (True, False):
+            def module_registration_hook(module, name, submodule):
+                module.registered = True
+                submodule.registered = True
+                if return_submodule:
+                    return submodule
+            handle = torch.nn.modules.module.register_module_module_registration_hook(
+                module_registration_hook
+            )
+            try:
+                l, n, s = _create_basic_net()
+                for m in s.modules():
+                    self.assertTrue(getattr(m, "registered", False))
+            finally:
+                handle.remove()
+
+    def test_hook_parameter_registration(self):
+        for return_parameter in (True, False):
+            def parameter_registration_hook(module, name, parameter):
+                parameter.registered = True
+                if return_parameter:
+                    return parameter
+            handle = torch.nn.modules.module.register_module_parameter_registration_hook(
+                parameter_registration_hook
+            )
+            try:
+                l, n, s = _create_basic_net()
+                for p in s.parameters():
+                    self.assertTrue(getattr(p, "registered", False))
+            finally:
+                handle.remove()
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/nn/test_packed_sequence.py b/test/nn/test_packed_sequence.py
new file mode 100644
index 000000000000..04856dc7096e
--- /dev/null
+++ b/test/nn/test_packed_sequence.py
@@ -0,0 +1,392 @@
+# Owner(s): ["module: nn"]
+
+import itertools
+import random
+
+import torch
+from torch.testing._internal.common_utils import TestCase, run_tests
+import torch.nn.utils.rnn as rnn_utils
+
+
+class PackedSequenceTest(TestCase):
+
+    _type_by_name = {
+        'torch.DoubleTensor': (torch.DoubleTensor, 'double'),
+        'torch.FloatTensor': (torch.FloatTensor, 'float'),
+        # We leave out `'torch.HalfTensor': (torch.HalfTensor, 'half'),`
+        # because of an error in `pad_packed_sequence`
+        # > AttributeError: 'torch.HalfTensor' object has no attribute 'fill_'
+        'torch.LongTensor': (torch.LongTensor, 'long'),
+        'torch.IntTensor': (torch.IntTensor, 'int'),
+        'torch.ShortTensor': (torch.ShortTensor, 'short'),
+        'torch.CharTensor': (torch.CharTensor, 'char'),
+        'torch.ByteTensor': (torch.ByteTensor, 'byte'),
+    }
+
+    def __init__(self, *args, **kwargs):
+        super(PackedSequenceTest, self).__init__(*args, **kwargs)
+        self.batch_size = 5
+        self.max_length = 6
+
+    def _ordered_sequence(self, tensor_type):
+        """Create ordered list of random sequences"""
+        seqs = [tensor_type(random.randint(1, self.max_length))
+                for _ in range(self.batch_size)]
+        if tensor_type == torch.ByteTensor:
+            seqs = [s.random_(0, 256) for s in seqs]
+        else:
+            seqs = [s.random_(-128, 128) for s in seqs]
+        ordered = sorted(seqs, key=len, reverse=True)
+        return ordered
+
+    def _padded_sequence(self, tensor_type):
+        """Create Tensor of random padded sequences"""
+        ordered = self._ordered_sequence(tensor_type)
+        lengths = [len(i) for i in ordered]
+        padded_tensor = rnn_utils.pad_sequence(ordered)
+        return padded_tensor, lengths
+
+    def test_type_casts(self):
+        """Test type casting of `PackedSequence` against type casting of tensor"""
+        for _, (input_type, _) in self._type_by_name.items():
+            for expected_type_str, (_, cast_str) in self._type_by_name.items():
+                for enforce_sorted in [True, False]:
+                    padded, lengths = self._padded_sequence(input_type)
+                    packed = rnn_utils.pack_padded_sequence(
+                        padded, lengths, enforce_sorted=enforce_sorted)
+                    # Apply cast to `PackedSequence` instance and unpack
+                    masked = getattr(packed, cast_str)()
+                    unpacked, lengths_out = rnn_utils.pad_packed_sequence(masked)
+                    self.assertEqual(unpacked.type(), expected_type_str)
+
+    def test_wrong_order(self):
+        a = torch.ones(25, 300)
+        b = torch.ones(22, 300)
+        b_a = rnn_utils.pad_sequence([b, a])
+        self.assertRaises(
+            RuntimeError,
+            lambda: rnn_utils.pack_padded_sequence(b_a, [22, 25], enforce_sorted=True))
+
+    def test_pad_sequence_with_tensor_sequences(self):
+        seq_tuple_input = torch.nn.utils.rnn.pad_sequence(
+            (torch.tensor([[7, 6]]), torch.tensor([[-7, -1]]))
+        )
+        seq_tensor_input = torch.nn.utils.rnn.pad_sequence(
+            torch.tensor([[[7, 6]], [[-7, -1]]])
+        )
+        self.assertEqual(seq_tuple_input, seq_tensor_input)
+        self.assertEqual(seq_tuple_input.shape, torch.Size([1, 2, 2]))
+
+    def test_pad_sequence_with_non_iterable_sequences(self):
+        msg = r"Expected iterable for input sequences, but got arg of type"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            torch.nn.utils.rnn.pad_sequence(5)
+
+    def test_total_length(self):
+        padded, lengths = self._padded_sequence(torch.FloatTensor)
+        max_length = max(lengths)
+        packed = rnn_utils.pack_padded_sequence(padded, lengths)
+        # test ValueError if total_length < max_length
+        for total_length in (-1, 0, max_length - 1):
+            for batch_first in (True, False):
+                def err_fn():
+                    rnn_utils.pad_packed_sequence(packed, batch_first=batch_first,
+                                                  total_length=total_length)
+            self.assertRaisesRegex(ValueError,
+                                   r'Expected total_length to be at least the '
+                                   r'length of the longest sequence in input',
+                                   err_fn)
+        # test that pad_packed_sequence returns results of correct length
+        for batch_first in (True, False):
+            no_extra_pad, _ = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first)
+            for total_length_delta in (0, 1, 8):
+                total_length = max_length + total_length_delta
+                unpacked, lengths_out = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first,
+                                                                      total_length=total_length)
+                self.assertEqual(lengths, lengths_out)
+                self.assertEqual(unpacked.size(1 if batch_first else 0), total_length)
+                if total_length_delta == 0:
+                    ref_output = no_extra_pad
+                elif batch_first:
+                    extra_pad = no_extra_pad.new_zeros(self.batch_size, total_length_delta)
+                    ref_output = torch.cat([no_extra_pad, extra_pad], 1)
+                else:
+                    extra_pad = no_extra_pad.new_zeros(total_length_delta, self.batch_size)
+                    ref_output = torch.cat([no_extra_pad, extra_pad], 0)
+                self.assertEqual(unpacked, ref_output)
+
+    def test_to(self):
+        for enforce_sorted in (True, False):
+            padded, lengths = self._padded_sequence(torch.IntTensor)
+            a = rnn_utils.pack_padded_sequence(
+                padded, lengths, enforce_sorted=enforce_sorted).cpu()
+
+            self.assertIs(a, a.to('cpu'))
+            self.assertIs(a, a.cpu())
+            self.assertIs(a, a.to('cpu', dtype=torch.int32))
+            self.assertEqual(a.long(), a.to(torch.int64))
+
+            if torch.cuda.is_available():
+                for cuda in ['cuda', 'cuda:0' if torch.cuda.device_count() == 1 else 'cuda:1']:
+                    b = a.cuda(device=cuda)
+                    self.assertIs(b, b.to(cuda))
+                    self.assertIs(b, b.cuda())
+                    self.assertEqual(a, b.to('cpu'))
+                    self.assertEqual(b, a.to(cuda))
+                    self.assertEqual(a, b.to('cpu', dtype=torch.int32))
+                    self.assertIs(b, b.to(dtype=torch.int32))
+                    self.assertEqual(b.long(), b.to(dtype=torch.int64))
+
+    def test_to_memory_format(self):
+        m = torch.nn.Conv2d(in_channels=16, out_channels=32, kernel_size=2, bias=True)
+        m = m.to(memory_format=torch.channels_last)
+        for param in m.parameters():
+            if param.dim() == 4:
+                self.assertTrue(param.is_contiguous(memory_format=torch.channels_last))
+
+    def test_pad_sequence(self):
+        def pad(tensor, length):
+            return torch.cat(
+                [tensor.data, tensor.data.new(
+                    length - tensor.size(0), *tensor.size()[1:]).zero_()])
+
+        # single dimensional
+        a = torch.tensor([1, 2, 3])
+        b = torch.tensor([4, 5])
+        c = torch.tensor([6])
+
+        # batch_first = true
+        expected = torch.tensor([[4, 5, 0], [1, 2, 3], [6, 0, 0]])
+        padded = rnn_utils.pad_sequence([b, a, c], True)
+        self.assertEqual(padded, expected)
+
+        # batch_first = false
+        padded = rnn_utils.pad_sequence([b, a, c])
+        self.assertEqual(padded, expected.transpose(0, 1))
+
+        # pad with non-zero value
+        expected = torch.tensor([[4, 5, 1], [1, 2, 3], [6, 1, 1]])
+        padded = rnn_utils.pad_sequence([b, a, c], True, 1)
+        self.assertEqual(padded, expected)
+
+        # Test pad sorted sequence
+        expected = torch.tensor([[1, 2, 3], [4, 5, 0], [6, 0, 0]])
+        padded = rnn_utils.pad_sequence([a, b, c], True)
+        self.assertEqual(padded, expected)
+
+        # more dimensions
+        maxlen = 9
+        for num_dim in (0, 1, 2, 3):
+            sequences = []
+            trailing_dims = [4] * num_dim
+            for i in range(1, maxlen + 1):
+                seq_len = i * i
+                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
+            random.shuffle(sequences)
+            expected = []
+            for seq in sequences:
+                expected.append(pad(seq, maxlen * maxlen))
+            # batch first = true
+            expected = torch.stack(expected)
+            padded = rnn_utils.pad_sequence(sequences, True)
+            self.assertEqual(padded, expected)
+
+            # batch first = false
+            padded = rnn_utils.pad_sequence(sequences)
+            self.assertEqual(padded, expected.transpose(0, 1))
+
+    def test_unpad_sequence(self):
+
+        # single dimensional
+        a = torch.tensor([1, 2, 3])
+        b = torch.tensor([4, 5])
+        c = torch.tensor([6])
+        sequences = [a, b, c]
+
+        lengths = torch.as_tensor([v.size(0) for v in sequences])
+        for batch_first in [True, False]:
+            padded_sequences = rnn_utils.pad_sequence(sequences, batch_first=batch_first)
+            unpadded_sequences = rnn_utils.unpad_sequence(padded_sequences, lengths, batch_first=batch_first)
+            self.assertEqual(sequences, unpadded_sequences)
+
+        # more dimensions
+        maxlen = 9
+        for num_dim in (0, 1, 2, 3):
+            sequences = []
+            trailing_dims = [4] * num_dim
+            for i in range(1, maxlen + 1):
+                seq_len = i * i
+                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
+            random.shuffle(sequences)
+
+            lengths = torch.as_tensor([v.size(0) for v in sequences])
+            padded_sequences = rnn_utils.pad_sequence(sequences, batch_first=batch_first)
+            unpadded_sequences = rnn_utils.unpad_sequence(padded_sequences, lengths, batch_first=batch_first)
+            self.assertEqual(sequences, unpadded_sequences)
+
+    def test_pack_sequence(self):
+        def _compatibility_test(sequences, lengths, batch_first, enforce_sorted=False):
+            padded = rnn_utils.pad_sequence(sequences, batch_first)
+            packed = rnn_utils.pack_sequence(sequences, enforce_sorted)
+            unpacked = rnn_utils.pad_packed_sequence(packed, batch_first)
+            self.assertEqual(padded, unpacked[0])
+            pack_padded = rnn_utils.pack_padded_sequence(
+                padded, lengths, batch_first, enforce_sorted)
+            self.assertEqual(packed, pack_padded)
+
+        # single dimensional
+        a = torch.tensor([1, 2, 3])
+        b = torch.tensor([4, 5])
+        c = torch.tensor([6])
+        packed = rnn_utils.pack_sequence([a, b, c], enforce_sorted=False)
+        expected = torch.tensor([1, 4, 6, 2, 5, 3])
+        self.assertEqual(packed.batch_sizes, [3, 2, 1])
+        self.assertEqual(packed.data.data, expected)
+        self.assertEqual(packed.sorted_indices, [0, 1, 2])
+        self.assertEqual(packed.unsorted_indices, [0, 1, 2])
+
+        packed_unsorted = rnn_utils.pack_sequence([b, c, a], enforce_sorted=False)
+        self.assertEqual(packed_unsorted.batch_sizes, [3, 2, 1])
+        self.assertEqual(packed_unsorted.data.data, expected)
+        self.assertEqual(packed_unsorted.sorted_indices, [2, 0, 1])
+        self.assertEqual(packed_unsorted.unsorted_indices, [1, 2, 0])
+
+        # single dimensional, enforce_sorted = True
+        packed_enforce_sorted = rnn_utils.pack_sequence([a, b, c], enforce_sorted=True)
+        self.assertEqual(packed_enforce_sorted.batch_sizes, [3, 2, 1])
+        self.assertEqual(packed_enforce_sorted.data.data, expected)
+        self.assertTrue(packed_enforce_sorted.sorted_indices is None)
+        self.assertTrue(packed_enforce_sorted.unsorted_indices is None)
+
+        with self.assertRaisesRegex(RuntimeError, 'must be sorted in decreasing order'):
+            rnn_utils.pack_sequence([b, c, a], enforce_sorted=True)
+
+        with self.assertRaisesRegex(RuntimeError, 'You can pass `enforce_sorted=False`'):
+            rnn_utils.pack_sequence([b, c, a], enforce_sorted=True)
+
+        # more dimensions
+        maxlen = 9
+        for num_dim in (0, 1, 2, 3):
+            sequences = []
+            lengths = []
+            trailing_dims = [4] * num_dim
+            for i in range(maxlen, 0, -1):
+                seq_len = i * i
+                lengths.append(seq_len)
+                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
+            unsorted_sequences = [s.clone() for s in sequences]
+            random.shuffle(unsorted_sequences)
+            unsorted_sequences_lengths = [t.size(0) for t in unsorted_sequences]
+
+            # compatibility with other utilities
+            for batch_first in (True, False):
+                for enforce_sorted in (True, False):
+                    _compatibility_test(sequences, lengths, batch_first, enforce_sorted)
+                _compatibility_test(unsorted_sequences, unsorted_sequences_lengths,
+                                    batch_first)
+
+    def test_unpack_sequence(self):
+
+        # single dimensional
+        a = torch.tensor([1, 2, 3])
+        b = torch.tensor([4, 5])
+        c = torch.tensor([6])
+        sequences = [a, b, c]
+
+        packed_sequences = rnn_utils.pack_sequence(sequences, enforce_sorted=False)
+        unpacked_sequences = rnn_utils.unpack_sequence(packed_sequences)
+        self.assertEqual(sequences, unpacked_sequences)
+
+        # more dimensions
+        maxlen = 9
+        for num_dim in (0, 1, 2, 3):
+            sequences = []
+            trailing_dims = [4] * num_dim
+            for i in range(1, maxlen + 1):
+                seq_len = i * i
+                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
+            random.shuffle(sequences)
+
+            packed_sequences = rnn_utils.pack_sequence(sequences, enforce_sorted=False)
+            unpacked_sequences = rnn_utils.unpack_sequence(packed_sequences)
+            self.assertEqual(sequences, unpacked_sequences)
+
+    def test_pack_padded_sequence(self):
+        def generate_test_case(sorted_lengths, should_shuffle):
+            def pad(tensor, length):
+                return torch.cat([tensor, tensor.new(length - tensor.size(0), *tensor.size()[1:]).zero_()])
+
+            max_length = sorted_lengths[0]
+            batch_sizes = [sum(map(bool, filter(lambda x: x >= i, sorted_lengths)))
+                           for i in range(1, max_length + 1)]
+            offset = 0
+            padded = torch.cat([pad(i * 100 + torch.arange(1., 5 * l + 1).view(l, 1, 5), max_length)
+                                for i, l in enumerate(sorted_lengths, 1)], 1)
+            expected_data = [[torch.arange(1., 6) + (i + 1) * 100 + 5 * n for i in range(batch_size)]
+                             for n, batch_size in enumerate(batch_sizes)]
+            expected_data = list(itertools.chain.from_iterable(expected_data))
+            expected_data = torch.stack(expected_data, dim=0)
+
+            if should_shuffle:
+                # Shuffle the padded sequence to create an unsorted sequence
+                permutation = list(range(len(sorted_lengths)))
+                random.shuffle(permutation)
+
+                unsorted_indices = torch.tensor(permutation)
+                padded = padded.index_select(1, unsorted_indices)
+                lengths = torch.tensor(sorted_lengths).index_select(0, unsorted_indices)
+            else:
+                unsorted_indices = None
+                lengths = sorted_lengths
+
+            return padded.requires_grad_(), lengths, expected_data, batch_sizes, unsorted_indices
+
+        test_cases = [
+            # sorted_lengths, should_shuffle
+            [[10, 8, 4, 2, 2, 2, 1], False],
+            [[11, 10, 8, 6, 4, 3, 1], False],
+            [[11, 10, 8, 6, 4, 3, 1], True],
+        ]
+
+        for test_case, batch_first in itertools.product(test_cases, (True, False)):
+            sorted_lengths, should_shuffle = test_case
+            padded, lengths, expected_data, batch_sizes, unsorted_indices = generate_test_case(
+                sorted_lengths, should_shuffle)
+
+            src = padded
+            if batch_first:
+                src = src.transpose(0, 1)
+
+            # check output
+            packed = rnn_utils.pack_padded_sequence(src, lengths, batch_first=batch_first,
+                                                    enforce_sorted=not should_shuffle)
+            self.assertEqual(packed.data.data, expected_data)
+            self.assertEqual(packed.batch_sizes, batch_sizes)
+            self.assertEqual(packed.unsorted_indices, unsorted_indices)
+
+            # test inverse
+            unpacked, unpacked_len = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first)
+            self.assertEqual(unpacked, src)
+            self.assertEqual(unpacked_len, lengths)
+
+            # check grad
+            if padded.grad is not None:
+                padded.grad.data.zero_()
+            grad_output = unpacked.data.clone().normal_()
+            unpacked.backward(grad_output)
+            if batch_first:
+                grad_output.transpose_(0, 1)
+            for i, l in enumerate(lengths):
+                self.assertEqual(padded.grad.data[:l, i], grad_output[:l, i])
+                if l < 10:
+                    self.assertEqual(padded.grad.data[l:, i].abs().sum(), 0)
+
+        # test error messages
+        with self.assertRaisesRegex(RuntimeError, 'You can pass `enforce_sorted=False`'):
+            packed = rnn_utils.pack_padded_sequence(torch.randn(3, 3), [1, 3, 2])
+        with self.assertRaisesRegex(RuntimeError, 'empty tensor'):
+            packed = rnn_utils.pack_padded_sequence(torch.randn(0, 0), [])
+
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/nn/test_parametrization.py b/test/nn/test_parametrization.py
new file mode 100644
index 000000000000..0ba361d310d3
--- /dev/null
+++ b/test/nn/test_parametrization.py
@@ -0,0 +1,1525 @@
+# Owner(s): ["module: nn"]
+from copy import deepcopy
+from itertools import product
+
+import torch
+
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.nn.init as init
+import torch.nn.utils.parametrize as parametrize
+from torch.nn import Parameter
+from torch.testing._internal.common_utils import run_tests, skipIfNoLapack, \
+    TemporaryFileName, instantiate_parametrized_tests, set_default_dtype
+from torch.testing._internal.common_cuda import TEST_MULTIGPU
+from torch.testing._internal.common_nn import NNTestCase
+from torch.testing._internal.common_utils import gradcheck
+
+
+class TestNNParametrization(NNTestCase):
+    _do_cuda_memory_leak_check = True
+    _do_cuda_non_default_stream = True
+
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    #        and remove the `@skipIfNoLapack` (see #70995)
+    # torch/nn/utils/parametrize
+    @skipIfNoLapack
+    def test_register_and_remove_parametrization(self):
+        r"""Test that it is possible to add a few parametrizations
+        on a parameter or a buffer and that removing them restores the initial state
+        It also tests that backpropagating through them works as expected
+        """
+        # Define a couple matrix parametrizations
+        class Skew(nn.Module):
+            def forward(self, X):
+                X = X.tril(-1)
+                return X - X.T
+
+        class Orthogonal(nn.Module):
+            def forward(self, X):
+                # Cayley map
+                # If X is skew-symmetric it returns an orthogonal matrix
+                Id = torch.eye(X.size(0), device=X.device)
+                # We call contiguous because solve returns a tensor with strides that are Fortran-contiguous
+                # and autograd raises a performance warning.
+                # This happens when we remove the parametrization with leave_parametrized=True,
+                # which does a set_ with a non-contiguous tensor while the gradient is contiguous
+                return torch.linalg.solve(Id + X, Id - X).contiguous()
+
+        class Resize(nn.Module):
+            def forward(self, X):
+                return X[[0]]
+
+        class NoResize(nn.Module):
+            def forward(self, X):
+                return X
+
+        # Define a couple vector parametrizations
+        class FirstZero(nn.Module):
+            def forward(self, x):
+                return torch.cat([x.new_zeros(1), x[1:]])
+
+        class LastZero(nn.Module):
+            def forward(self, x):
+                return torch.cat([x[:-1], x.new_zeros(1)])
+
+        model = nn.Linear(8, 8)
+        initial_weight_id = id(model.weight)
+        initial_bias_id = id(model.bias)
+        initial_model = deepcopy(model)
+
+        # Test unsafe flag
+        with self.assertRaisesRegex(ValueError, "Registering a parametrization may not change the shape of the tensor"):
+            parametrize.register_parametrization(model, "weight", Resize())  # default unsafe = False
+            model(torch.ones(8, 8))
+
+        # One parametrization with unsafe=True
+        parametrize.register_parametrization(model, "weight", Resize(), unsafe=True)
+        self.assertTrue(hasattr(model, "parametrizations"))
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "weight"))
+        self.assertFalse(parametrize.is_parametrized(model, "bias"))
+        self.assertNotIn("weight", model._parameters)
+        A = model.weight
+        self.assertTrue(A.shape[0] == 1)
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        self.assertFalse(hasattr(model, "parametrizations"))
+        self.assertEqual(model.weight, initial_model.weight)
+        self.assertEqual(id(model.weight), initial_weight_id)
+        self.assertEqual(model.__class__, nn.Linear)
+
+        # Two parametrizations with unsafe=True
+        parametrize.register_parametrization(model, "weight", Resize(), unsafe=True)
+        parametrize.register_parametrization(model, "weight", NoResize(), unsafe=False)
+        self.assertTrue(hasattr(model, "parametrizations"))
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "weight"))
+        self.assertFalse(parametrize.is_parametrized(model, "bias"))
+        self.assertNotIn("weight", model._parameters)
+        A = model.weight
+        self.assertTrue(A.shape[0] == 1)
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        self.assertFalse(hasattr(model, "parametrizations"))
+        self.assertEqual(model.weight, initial_model.weight)
+        self.assertEqual(id(model.weight), initial_weight_id)
+        self.assertEqual(model.__class__, nn.Linear)
+
+        # Test unsafe flag doesn't change expected behavior
+        parametrize.register_parametrization(model, "weight", Skew(), unsafe=True)
+        self.assertTrue(hasattr(model, "parametrizations"))
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "weight"))
+        self.assertFalse(parametrize.is_parametrized(model, "bias"))
+        self.assertNotIn("weight", model._parameters)
+        # Result should be skew-symmetric
+        A = model.weight
+        self.assertEqual(A, -A.T)
+        # Remove and check consistency
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        self.assertFalse(hasattr(model, "parametrizations"))
+        self.assertEqual(model.weight, initial_model.weight)
+        self.assertEqual(id(model.weight), initial_weight_id)
+        self.assertEqual(model.__class__, nn.Linear)
+
+        # Test one parametrization
+        parametrize.register_parametrization(model, "weight", Skew())
+        self.assertTrue(hasattr(model, "parametrizations"))
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "weight"))
+        self.assertFalse(parametrize.is_parametrized(model, "bias"))
+        self.assertNotIn("weight", model._parameters)
+        # Result should be skew-symmetric
+        A = model.weight
+        self.assertEqual(A, -A.T)
+        # Remove and check consistency
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        self.assertFalse(hasattr(model, "parametrizations"))
+        self.assertEqual(model.weight, initial_model.weight)
+        self.assertEqual(id(model.weight), initial_weight_id)
+        self.assertEqual(model.__class__, nn.Linear)
+
+        # Test two parametrizations at the same time and removing them
+        parametrize.register_parametrization(model, "weight", Skew())
+        parametrize.register_parametrization(model, "weight", Orthogonal())
+        # Result should be orthogonal
+        X = model.weight
+        Id = torch.eye(X.size(0), device=X.device)
+        self.assertEqual(X.T @ X, Id)
+        # Structure tests
+        self.assertTrue(hasattr(model, "parametrizations"))
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "weight"))
+        self.assertFalse(parametrize.is_parametrized(model, "bias"))
+        self.assertIn("weight", model.parametrizations)
+        self.assertNotIn("weight", model._parameters)
+        # Remove
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        self.assertEqual(model.weight, initial_model.weight)
+        self.assertEqual(id(model.weight), initial_weight_id)
+        self.assertFalse(hasattr(model, "parametrizations"))
+        self.assertEqual(model.__class__, nn.Linear)
+
+        # Add everything
+        parametrize.register_parametrization(model, "weight", Skew())
+        parametrize.register_parametrization(model, "weight", Orthogonal())
+        parametrize.register_parametrization(model, "bias", FirstZero())
+        parametrize.register_parametrization(model, "bias", LastZero())
+
+        # Basic tests
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "weight"))
+        self.assertTrue(parametrize.is_parametrized(model, "bias"))
+        self.assertEqual(model.bias[0].item(), 0.)
+        self.assertEqual(model.bias[-1].item(), 0.)
+        self.assertEqual(len(list(model.parameters())), 2)  # Nothing weird has happpened
+        # Should not throw
+
+        sgd = torch.optim.SGD(model.parameters(), lr=0.01)
+
+        weight_copy = model.weight.clone()
+        bias_copy = model.bias.clone()
+        sgd.zero_grad()
+        (model.weight.T @ model.bias).sum().backward()
+        sgd.step()
+        self.assertNotEqual(model.weight, weight_copy)
+        self.assertNotEqual(model.bias, bias_copy)
+
+        # Remove first parametrization.
+        # Check that the model is still parametrized and so is the second parameter
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        self.assertTrue(parametrize.is_parametrized(model))             # Still parametrized
+        self.assertFalse(parametrize.is_parametrized(model, "weight"))  # Parametrization removed
+        self.assertTrue(parametrize.is_parametrized(model, "bias"))     # Still parametrized
+        self.assertEqual(model.bias[0].item(), 0.)                      # Still parametrized
+        self.assertEqual(model.bias[-1].item(), 0.)                     # Still parametrized
+        self.assertNotEqual(model.weight, initial_model.weight)         # Has been updated
+        self.assertEqual(id(model.weight), initial_weight_id)           # Keeps the same id
+        self.assertEqual(len(list(model.parameters())), 2)              # Nothing weird has happened
+        # Should not throw
+        weight_copy = model.weight.clone()
+        bias_copy = model.bias.clone()
+        sgd.zero_grad()
+        (model.weight.T @ model.bias).sum().backward()
+        sgd.step()
+        self.assertNotEqual(model.weight, weight_copy)
+        self.assertNotEqual(model.bias, bias_copy)
+
+        # Remove the second parametrization.
+        # Check that the module is not parametrized
+        parametrize.remove_parametrizations(model, "bias", leave_parametrized=False)
+        self.assertFalse(parametrize.is_parametrized(model))  # Not parametrized
+        self.assertNotEqual(model.bias, initial_model.bias)   # Has been updated
+        self.assertNotEqual(model.bias[0].item(), 0.)         # Not parametrized
+        self.assertNotEqual(model.bias[-1].item(), 0.)        # Not parametrized
+        self.assertEqual(id(model.bias), initial_bias_id)     # Keeps the same id
+        self.assertFalse(hasattr(model, "parametrizations"))  # Not parametrized the module
+        self.assertEqual(model.__class__, nn.Linear)          # Resores the previous class
+        self.assertEqual(len(list(model.parameters())), 2)    # Nothing weird has happeed
+
+        # Should not throw things are updated
+        weight_copy = model.weight.clone()
+        bias_copy = model.bias.clone()
+        sgd.zero_grad()
+        (model.weight.T @ model.bias).sum().backward()
+        sgd.step()
+        self.assertNotEqual(model.weight, weight_copy)
+        self.assertNotEqual(model.bias, bias_copy)
+
+        # Test leave_parametrized=True
+        for _ in range(2):
+            parametrize.register_parametrization(model, "weight", Skew())
+            parametrize.register_parametrization(model, "weight", Orthogonal())
+            parametrize.remove_parametrizations(model, "weight", leave_parametrized=True)
+            # We didn't change the dtype nor had multiple inputs, so the id should be the same
+            self.assertEqual(id(model.weight), initial_weight_id)
+            self.assertEqual(id(model.bias), initial_bias_id)
+
+            # Should not throw. Things are updated
+            weight_copy = model.weight.clone()
+            bias_copy = model.bias.clone()
+            sgd.zero_grad()
+            (model.weight.T @ model.bias).sum().backward()
+            sgd.step()
+            self.assertNotEqual(model.weight, weight_copy)
+            self.assertNotEqual(model.bias, bias_copy)
+
+    def test_register_and_remove_nested_parametrization(self):
+        r"""Test that it is possible to nest the parametrizations
+        meaning that the original param is parametrized again
+        """
+        class Skew(nn.Module):
+            def forward(self, X):
+                X = X.tril(-1)
+                return X - X.T
+
+        model = nn.Linear(8, 8)
+        # Add top level parametrization
+        parametrize.register_parametrization(model, "weight", Skew())
+        self.assertTrue(hasattr(model, "parametrizations"))
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "weight"))
+        self.assertFalse(parametrize.is_parametrized(model, "bias"))
+        self.assertNotIn("weight", model._parameters)
+        # Result should be skew-symmetric
+        A = model.weight
+        self.assertEqual(A, -A.T)
+
+        # Add nested parametrization
+        param_mod = model.parametrizations.weight
+        self.assertFalse(hasattr(param_mod, "parametrizations"))
+        self.assertFalse(parametrize.is_parametrized(param_mod))
+        self.assertFalse(parametrize.is_parametrized(param_mod, "original"))
+
+        parametrize.register_parametrization(param_mod, "original", Skew())
+        self.assertTrue(hasattr(param_mod, "parametrizations"))
+        self.assertTrue(parametrize.is_parametrized(param_mod))
+        self.assertTrue(parametrize.is_parametrized(param_mod, "original"))
+        self.assertNotIn("original", param_mod._parameters)
+        # Result should be skew-symmetric
+        A = param_mod.original
+        self.assertEqual(A, -A.T)
+
+        # Remove nested param and check consistency
+        parametrize.remove_parametrizations(param_mod, "original", leave_parametrized=False)
+        self.assertFalse(hasattr(param_mod, "parametrizations"))
+        self.assertEqual(param_mod.__class__, parametrize.ParametrizationList)
+
+        # Remove top level and check consistency
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        self.assertFalse(hasattr(model, "parametrizations"))
+        self.assertEqual(model.__class__, nn.Linear)
+
+    def test_register_and_remove_buffer_parametrization(self):
+        r"""Test that it is possible to add and remove parametrizations on buffers"""
+        # Define a couple vector parametrizations
+        class FirstZero(nn.Module):
+            def forward(self, x):
+                return torch.cat([x.new_zeros(1), x[1:]])
+
+        class LastZero(nn.Module):
+            def forward(self, x):
+                return torch.cat([x[:-1], x.new_zeros(1)])
+
+        model = nn.Linear(8, 8)
+
+        # Instantiate parametrizations on buffers. It should work as expected
+        delattr(model, "bias")
+        model.register_buffer("bias", torch.ones(8))
+        parametrize.register_parametrization(model, "bias", FirstZero())
+        parametrize.register_parametrization(model, "bias", LastZero())
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "bias"))
+        self.assertEqual(model.bias[0].item(), 0.)
+        self.assertEqual(model.bias[-1].item(), 0.)
+        self.assertTrue((model.bias[1:-1] == torch.ones(6)).all())
+        self.assertEqual(len(list(model.parameters())), 1)
+
+        # Remove parametrizations on buffers. It should work as expected
+        parametrize.remove_parametrizations(model, "bias", leave_parametrized=True)
+        self.assertFalse(parametrize.is_parametrized(model))
+        self.assertFalse(parametrize.is_parametrized(model, "bias"))
+        self.assertEqual(model.bias[0].item(), 0.)
+        self.assertEqual(model.bias[-1].item(), 0.)
+        self.assertTrue((model.bias[1:-1] == torch.ones(6)).all())
+        self.assertEqual(len(list(model.parameters())), 1)
+
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    #        and remove the `@skipIfNoLapack` (see #70995)
+    @skipIfNoLapack
+    def test_serialization_parametrization(self):
+        r"""Test that it is possible to serialize a parametrized model via state_dict"""
+        # A stateful parametrization
+        class Orthogonal(nn.Module):
+            def __init__(self, n):
+                super().__init__()
+                self.register_buffer("id", torch.eye(n))
+                self.register_buffer("B", torch.empty(n, n))
+                init.orthogonal_(self.B)
+
+            def forward(self, X):
+                A = X.triu(1)
+                A = A - A.T
+                return self.B @ torch.linalg.solve(self.id + A, self.id - A)
+
+        def get_model():
+            model = torch.nn.Sequential(
+                torch.nn.Linear(5, 5),
+                torch.nn.ReLU(),
+                torch.nn.Linear(5, 1),
+            )
+
+            parametrize.register_parametrization(model[0], "weight", Orthogonal(5))
+            return model
+
+        model = get_model()
+
+        prev_weight = model[0].weight
+        prev_B = model[0].parametrizations.weight[0].B
+
+        new_model = get_model()
+        with TemporaryFileName() as fname:
+            torch.save(model.state_dict(), fname)
+            new_model.load_state_dict(torch.load(fname))
+
+        # Integrity tests
+        self.assertTrue(parametrize.is_parametrized(new_model[0], "weight"))
+        self.assertEqual(prev_weight, new_model[0].weight)
+        self.assertEqual(prev_B, new_model[0].parametrizations.weight[0].B)
+
+        # Trying to save the whole parametrized model raises
+        with self.assertRaisesRegex(RuntimeError, "state_dict"):
+            with TemporaryFileName() as fname:
+                torch.save(model, fname)
+
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    #        and remove the `@skipIfNoLapack` (see #70995)
+    @skipIfNoLapack
+    def test_initialization_parametrization(self):
+        r"""Test that it is possible to initialize a parametrization when it
+            implements a `right_inverse` method
+        """
+        class Skew(nn.Module):
+            def forward(self, X):
+                A = X.triu(1)
+                return A - A.T
+
+            def is_skew(self, A):
+                return torch.allclose(A, -A.T, atol=1e-6)
+
+            def right_inverse(self, X):
+                if not self.is_skew(X):
+                    raise ValueError("The matrix is not skew-symmetric.")
+                return X.triu(1)
+
+        # Implements a Cayley map where right_inverse is not quite the inverse of forward
+        class Orthogonal(nn.Module):
+            def __init__(self, n):
+                super().__init__()
+                self.register_buffer("B", torch.eye(n))
+
+            def forward(self, X):
+                Id = torch.eye(X.size(0))
+                return self.B @ torch.linalg.solve(Id + X, Id - X)
+
+            def is_orthogonal(self, X):
+                Id = torch.eye(X.size(0))
+                return torch.allclose(X.T @ X, Id, atol=1e-4)
+
+            def right_inverse(self, X):
+                if not self.is_orthogonal(X):
+                    raise ValueError("The input is not orthogonal.")
+                # cayley(0) == Id, so B @ cayley(0) == B
+                self.B = X
+                return torch.zeros_like(X)
+
+        N = 5
+        model = nn.Linear(N, N)
+        # Register the skew-symmetric constraint. The result is now skew-symmetric
+        skew = Skew()
+        # Make the weight skew-symmetric before registering the parametrization
+        with torch.no_grad():
+            model.weight.set_(skew(model.weight))
+        parametrize.register_parametrization(model, "weight", skew)
+        X = torch.rand(N, N)
+        # X is not skew-symmetric, so it throws an error
+        with self.assertRaises(ValueError):
+            model.weight = X
+        # Make X skew-symmetric
+        X = X - X.T
+        model.weight = X
+        self.assertEqual(model.parametrizations.weight.original, X.triu(1))
+        self.assertEqual(model.weight, X)
+
+        # Having several parametrizations registered should work in the same way
+        parametrize.register_parametrization(model, "weight", Orthogonal(N))
+        # Register now the Cayley map. The result is now orthogonal
+        X = torch.rand(N, N)
+        # X is not orthogonal, so it throws an error
+        with self.assertRaises(ValueError):
+            model.weight = X
+        init.orthogonal_(X)
+        model.weight = X
+        self.assertEqual(model.weight, X)
+        self.assertEqual(model.parametrizations.weight.original, torch.zeros_like(X))
+
+    def test_errors_unparametrized_tensor_parametrization(self):
+        # Test errors when registering a parametrization on an unparametrized tensor
+        module = nn.Linear(3, 4)
+        weight_init = module.weight.clone()
+
+        class Identity(nn.Module):
+            def forward(self, x):
+                return x
+
+        # Register a parametrization on a non-existing parameter throws
+        with self.assertRaisesRegex(ValueError, "does not have a parameter"):
+            parametrize.register_parametrization(module, "foo", Identity())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # Removing parametrizations from an unparametrized tensor throws
+        with self.assertRaisesRegex(ValueError, "does not have a parametrization"):
+            parametrize.remove_parametrizations(module, "bias")
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # A correct parametrization with several outputs
+        class Sum(nn.Module):
+            def forward(self, x, y):
+                return x + y
+
+            def right_inverse(self, z):
+                return z, torch.zeros_like(z)
+
+        parametrize.register_parametrization(module, "weight", Sum())
+        # Cannot remove a parametrization with several outputs with `leave_parametrized=False`
+        with self.assertRaisesRegex(ValueError, "leave_parametrized=False"):
+            parametrize.remove_parametrizations(module, "weight", leave_parametrized=False)
+        parametrize.remove_parametrizations(module, "weight", leave_parametrized=True)
+
+        # A parametrization with an incorrect number of outputs
+        class WrongNumberParams(nn.Module):
+            def forward(self, x, y, z):
+                return x + y + z
+
+            def right_inverse(self, w):
+                return w, torch.zeros_like(w)
+
+        # Makes param(*param.right_inverse(X)) fail
+        with self.assertRaisesRegex(TypeError, "positional argument"):
+            parametrize.register_parametrization(module, "weight", WrongNumberParams())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # A parametrization with a right_inverse that does not return a Tensor or Sequence[Tensor]
+        class WrongRightInverse(Identity):
+            def right_inverse(self, z):
+                return None
+
+        # right_inverse should return a Tensor or a Sequence[Tensor]
+        with self.assertRaisesRegex(ValueError, "Tensor or a Sequence of"):
+            parametrize.register_parametrization(module, "weight", WrongRightInverse())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # If it's a sequence, it must to be a sequence of tensors
+        class WrongRightInverseSequence(nn.Module):
+            def forward(self, x, y):
+                return x
+
+            def right_inverse(self, z):
+                return None, z
+
+        with self.assertRaisesRegex(ValueError, "of the sequence with type"):
+            parametrize.register_parametrization(module, "weight", WrongRightInverseSequence())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # A parametrization from one tensor to one tensor that changes the dtype
+        class ChangeDtypeInverse(nn.Module):
+            def forward(self, x):
+                return x.float()
+
+            def right_inverse(self, w):
+                return w.bool()
+
+        # For parametrizations that return one tensor, right_inverse may not change the dtype
+        with self.assertRaisesRegex(ValueError, "outputs one tensor, it may not change the dtype"):
+            parametrize.register_parametrization(module, "weight", ChangeDtypeInverse())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # Doesn't return a tensor
+        class NotTensor(nn.Module):
+            def forward(self, x):
+                return 2
+
+        # Forward must return a tensor
+        with self.assertRaisesRegex(ValueError, "must return a tensor"):
+            parametrize.register_parametrization(module, "weight", NotTensor())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # A parametrization from one tensor to one tensor that changes the dtype
+        class ChangeDtype(nn.Module):
+            def forward(self, x):
+                return x.bool()
+
+        # forward should not change the initial dtype
+        with self.assertRaisesRegex(ValueError, "may not change the dtype"):
+            parametrize.register_parametrization(module, "weight", ChangeDtype())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # Change shape
+        class ChangeShape(nn.Module):
+            def forward(self, x):
+                return x[:-1]
+
+        # forward should not change the original shape
+        with self.assertRaisesRegex(ValueError, "may not change the shape"):
+            parametrize.register_parametrization(module, "weight", ChangeShape())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # Many to one that changes dtype
+        class ChangeDtypeMulti(nn.Module):
+            def forward(self, x, y):
+                return (x + y).bool()
+
+            def right_inverse(self, w):
+                return w, w + 1
+
+        # forward should not change the original shape even for parametrizations with many inputs
+        with self.assertRaisesRegex(ValueError, "may not change the dtype"):
+            parametrize.register_parametrization(module, "weight", ChangeDtypeMulti())
+        self.assertFalse(parametrize.is_parametrized(module))
+
+        # Returning a sequence of size one, although weird, it's correct
+        class SequenceLen1(nn.Module):
+            def forward(self, x):
+                return x
+
+            def right_inverse(self, w):
+                return (w,)
+
+        parametrize.register_parametrization(module, "weight", SequenceLen1())
+        self.assertTrue(hasattr(module.parametrizations.weight, "original0"))
+        self.assertFalse(hasattr(module.parametrizations.weight, "original1"))
+        _ = module.weight   # Does not throw
+        self.assertTrue(parametrize.is_parametrized(module))
+        parametrize.remove_parametrizations(module, "weight", leave_parametrized=True)
+
+        # None of the operations above should have altered the weight
+        self.assertFalse(parametrize.is_parametrized(module))
+        self.assertEqual(module.weight, weight_init)
+
+    def test_errors_parametrized_tensor_parametrization(self):
+        # Test errors when registering a parametrization on a parametrized tensor
+
+        class Identity(nn.Module):
+            def forward(self, x):
+                return x
+
+        module = nn.Linear(3, 4)
+        parametrize.register_parametrization(module, "weight", Identity())
+
+        # Has to return a tensor
+        class WrongReturn(nn.Module):
+            def forward(self, x):
+                return x, x
+
+        with self.assertRaisesRegex(ValueError, "must return a tensor"):
+            parametrize.register_parametrization(module, "weight", WrongReturn())
+        self.assertTrue(parametrize.is_parametrized(module))
+        self.assertEqual(len(module.parametrizations.weight), 1)
+        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
+
+        # Cannot change dtype
+        class ChangeDtype(nn.Module):
+            def forward(self, x):
+                return x.bool()
+
+        with self.assertRaisesRegex(ValueError, "may not change the dtype"):
+            parametrize.register_parametrization(module, "weight", ChangeDtype())
+        self.assertTrue(parametrize.is_parametrized(module))
+        self.assertEqual(len(module.parametrizations.weight), 1)
+        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
+
+        # Cannot change shape
+        class ChangeShape(nn.Module):
+            def forward(self, x):
+                return x[:-1]
+
+        with self.assertRaisesRegex(ValueError, "may not change the shape"):
+            parametrize.register_parametrization(module, "weight", ChangeShape())
+        self.assertTrue(parametrize.is_parametrized(module))
+        self.assertEqual(len(module.parametrizations.weight), 1)
+        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
+
+        # The following checks are mostly due to bugs in the code of the parametrization
+
+        # right_inverse has to return a tensor
+        class WrongReturnInverse(Identity):
+            def right_inverse(self, x):
+                return x, x
+
+        with self.assertRaisesRegex(ValueError, "right_inverse must return a tensor"):
+            parametrize.register_parametrization(module, "weight", WrongReturnInverse())
+        self.assertTrue(parametrize.is_parametrized(module))
+        self.assertEqual(len(module.parametrizations.weight), 1)
+        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
+
+        # Cannot change dtype
+        class ChangeDtypeInverse(Identity):
+            def right_inverse(self, x):
+                return x.bool()
+
+        with self.assertRaisesRegex(ValueError, "must have the same dtype"):
+            parametrize.register_parametrization(module, "weight", ChangeDtypeInverse())
+        self.assertTrue(parametrize.is_parametrized(module))
+        self.assertEqual(len(module.parametrizations.weight), 1)
+        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
+
+        # Cannot change shape
+        class ChangeShapeInverse(Identity):
+            def right_inverse(self, x):
+                return x[:-1]
+
+        with self.assertRaisesRegex(ValueError, "must have the same shape"):
+            parametrize.register_parametrization(module, "weight", ChangeShapeInverse())
+        self.assertTrue(parametrize.is_parametrized(module))
+        self.assertEqual(len(module.parametrizations.weight), 1)
+        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
+
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    #        and remove the `@skipIfNoLapack` (see #70995)
+    @skipIfNoLapack
+    def test_multiple_inputs_parametrization(self):
+        # A parametrization with several outputs
+        class RankOne(nn.Module):
+            def forward(self, x, y):
+                # Form a rank-1 matrix from a pair of vectors
+                return x.unsqueeze(-1) @ y.unsqueeze(-2)
+
+            def right_inverse(self, Y):
+                # We project the given matrix onto the rank 1 matrices
+                U, S, Vh = torch.linalg.svd(Y, full_matrices=False)
+                # S is ordered in a decreasing way.
+                s0_sqrt = S[0].sqrt().unsqueeze(-1)
+                return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt
+
+        # Simple parametrisation
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+            def right_inverse(self, w):
+                return 0.5 * w
+
+        model = nn.Linear(3, 3)
+        # Test one parametrization
+        parametrize.register_parametrization(model, "weight", RankOne())
+        self.assertTrue(hasattr(model, "parametrizations"))
+        self.assertTrue(parametrize.is_parametrized(model))
+        self.assertTrue(parametrize.is_parametrized(model, "weight"))
+        self.assertTrue(hasattr(model.parametrizations.weight, "original0"))
+        self.assertIn("original0", model.parametrizations.weight._parameters)
+        self.assertTrue(hasattr(model.parametrizations.weight, "original1"))
+        self.assertIn("original1", model.parametrizations.weight._parameters)
+        self.assertFalse(parametrize.is_parametrized(model, "bias"))
+        self.assertNotIn("weight", model._parameters)
+        # Result should be rank 1
+        self.assertEqual(torch.linalg.matrix_rank(model.weight).item(), 1)
+
+        with self.assertRaisesRegex(ValueError, "leave_parametrized=False"):
+            # Cannot remove a parametrization with multiple inputs and not leave it parametrized
+            parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        # Remove parametrization and check consistency
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=True)
+        self.assertFalse(hasattr(model, "parametrizations"))
+        self.assertEqual(model.__class__, nn.Linear)
+        self.assertFalse(parametrize.is_parametrized(model))
+        self.assertEqual(torch.linalg.matrix_rank(model.weight).item(), 1)
+        self.assertIn("weight", model._parameters)
+
+        # Registering parametrizations with one input on top of one with multiple inputs should work
+        init_weight = model.weight.clone()
+        parametrize.register_parametrization(model, "weight", RankOne())
+        # Projecting a rank 1 matrix onto the matrices of rank one does not change the matrix
+        self.assertEqual(init_weight, model.weight)
+        parametrize.register_parametrization(model, "weight", Double())
+        # The matrix now is twice the initial matrix
+        self.assertEqual(2.0 * init_weight, model.weight)
+        # Multiplying by a scalar does not change the rank
+        self.assertEqual(torch.linalg.matrix_rank(model.weight).item(), 1)
+
+        # The model has now three parameters
+        self.assertEqual(len(list(model.parameters())), 3)
+
+        sgd = torch.optim.SGD(model.parameters(), lr=0.1)
+
+        # Test backward. Should not throw
+        for _ in range(2):
+            sgd.zero_grad()
+            loss = (model.weight.T @ model.bias).sum()
+            loss.backward()
+            sgd.step()
+
+        # Same drill as before, removing should work as expected
+        with self.assertRaisesRegex(ValueError, "leave_parametrized=False"):
+            # Cannot remove a parametrization with multiple inputs and not leave it parametrized
+            parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
+        # Remove parametrization and check consistency
+        parametrize.remove_parametrizations(model, "weight", leave_parametrized=True)
+        self.assertFalse(hasattr(model, "parametrizations"))
+        self.assertEqual(model.__class__, nn.Linear)
+        self.assertFalse(parametrize.is_parametrized(model))
+        self.assertEqual(torch.linalg.matrix_rank(model.weight).item(), 1)
+        self.assertIn("weight", model._parameters)
+
+        # The model has now two parameters
+        self.assertEqual(len(list(model.parameters())), 2)
+
+        # Test backward. Should not throw
+        sgd = torch.optim.SGD(model.parameters(), lr=0.1)
+        for _ in range(2):
+            sgd.zero_grad()
+            loss = (model.weight.T @ model.bias).sum()
+            loss.backward()
+            sgd.step()
+
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    #        and remove the `@skipIfNoLapack` (see #70995)
+    @skipIfNoLapack
+    def test_caching_parametrization(self):
+        r"""Test the caching system of a parametrization"""
+        # Define a couple matrix parametrizations
+        class Skew(nn.Module):
+            def forward(self, X):
+                X = X.tril(-1)
+                return X - X.T
+
+        class Orthogonal(nn.Module):
+            def forward(self, X):
+                Id = torch.eye(X.size(0), device=X.device)
+                return torch.linalg.solve(Id + X, Id - X)
+
+        model = nn.Linear(5, 5)
+        parametrize.register_parametrization(model, "weight", Skew())
+        parametrize.register_parametrization(model, "weight", Orthogonal())
+
+        # Test that the caching system works
+        with parametrize.cached():
+            X = model.weight
+            Y = model.weight
+            self.assertEqual(id(X), id(Y))
+
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    #        and remove the `@skipIfNoLapack` (see #70995)
+    @skipIfNoLapack
+    def test_caching_parametrization_with_transfer_parametrizations_and_params(self):
+        r"""Test that transferring parametrizations doesn't cause issues with caching"""
+        class Skew(nn.Module):
+            def forward(self, X):
+                X = X.tril(-1)
+                return X - X.T
+
+        class Orthogonal(nn.Module):
+            def forward(self, X):
+                Id = torch.eye(X.size(0), device=X.device)
+                return torch.linalg.solve(Id + X, Id - X)
+
+        model = nn.Linear(5, 5)
+        parametrize.register_parametrization(model, "weight", Skew())
+        parametrize.register_parametrization(model, "weight", Orthogonal())
+
+        to_model = nn.Linear(5, 5)
+        parametrize.transfer_parametrizations_and_params(model, to_model)
+
+        with parametrize.cached():
+            X = model.weight
+            Y = model.weight
+            self.assertEqual(id(X), id(Y))
+
+            A = to_model.weight
+            B = to_model.weight
+            self.assertEqual(id(A), id(B))
+
+            # test that the results are distinct objects for each module
+            self.assertNotEqual(id(A), id(X))
+
+    def test_parametrization_same_training_mode(self):
+        r"""Test training mode updated on parametrization registration"""
+        class Identity(nn.Module):
+            def forward(self, X):
+                return X
+
+        module = nn.Linear(4, 4)
+        module.eval()
+        parametrize.register_parametrization(module, "weight", Identity())
+        self.assertFalse(module.parametrizations.weight[0].training)
+        module.train()
+        parametrize.register_parametrization(module, "weight", Identity().eval())
+        self.assertTrue(module.parametrizations.weight[0].training)
+        self.assertTrue(module.parametrizations.weight[1].training)
+
+    def test_type_before_parametrizations(self):
+        r"""Test that type_before_parametrizations always retrieves original type"""
+
+        class Identity(nn.Module):
+            def forward(self, X):
+                return X
+
+        model = nn.Linear(5, 5)
+        original_type = type(model)
+        self.assertTrue(
+            parametrize.type_before_parametrizations(model) == original_type
+        )
+        parametrize.register_parametrization(model, "weight", Identity())
+        self.assertTrue(
+            parametrize.type_before_parametrizations(model) == original_type
+        )
+
+    def test_deepcopy_after_parametrization(self):
+        r"""Test that we are able to create a deepcopy of the module when it's parametrized."""
+
+        class AddOne(nn.Module):
+            def forward(self, x):
+                return x + 1.0
+
+        class ModelWithoutDeepcopy(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.weight = nn.Parameter(torch.tensor([1., 1., 1., 1.]), requires_grad=True)
+                self.bias = nn.Parameter(torch.tensor([0., 0., 0., 0.]), requires_grad=True)
+                self.attr = [1.0, 2.0, 3.0, 4.0]
+
+        class ActualModel(ModelWithoutDeepcopy):
+            # Emulate custom implementation of the deepcopying.
+            def __deepcopy__(self, memo):
+                result = self.__new__(self.__class__)
+                memo[id(self)] = result
+                result.__dict__ = deepcopy(self.__dict__, memo)
+                return result
+
+        def check_deepcopy(m1: nn.Module, m2: nn.Module):
+            w1 = m1.parametrizations.weight.original
+            w2 = m2.parametrizations.weight.original
+            b1 = m1.parametrizations.bias.original if parametrize.is_parametrized(m1, "bias") else m1.bias
+            b2 = m2.parametrizations.bias.original if parametrize.is_parametrized(m2, "bias") else m2.bias
+            # Weights, biases and attributes should be equal but they must be different objects.
+            self.assertEqual(m1.__dict__.keys(), m2.__dict__.keys())
+            self.assertIsNot(m1, m2)
+            self.assertEqual(w1, w2)
+            self.assertIsNot(w1, w2)
+            self.assertEqual(b1, b2)
+            self.assertIsNot(b1, b2)
+            self.assertEqual(m1.attr, m2.attr)
+            self.assertIsNot(m1.attr, m2.attr)
+
+        for model in (ModelWithoutDeepcopy(), ActualModel()):
+            # General check that we are able to create deepcopy.
+            parametrize.register_parametrization(model, "weight", AddOne())
+            check_deepcopy(model, deepcopy(model))
+            # Check that this works on models with several parametrized tensors.
+            parametrize.register_parametrization(model, "bias", AddOne())
+            check_deepcopy(model, deepcopy(model))
+            # Check that this works on models where tensors have more than one parametrization.
+            parametrize.register_parametrization(model, "weight", AddOne())
+            check_deepcopy(model, deepcopy(model))
+
+    def test_transfer_parametrizations_and_params(self):
+        r"""Test that all parametrizations and their associated parameters are transferred."""
+
+        class AddOne(nn.Module):
+            def forward(self, x):
+                return x + 1.0
+
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+            def right_inverse(self, x):
+                return 0.5 * x
+
+        class MinusOne(nn.Module):
+            def forward(self, x):
+                return x - 1.0
+
+        model = nn.Linear(5, 5)
+        parametrize.register_parametrization(model, "weight", AddOne())
+        parametrize.register_parametrization(model, "weight", Double())
+        parametrize.register_parametrization(model, "weight", MinusOne())
+        hold_weight = model.weight
+
+        to_model = torch.ao.nn.qat.Linear(
+            5, 5, qconfig=torch.ao.quantization.get_default_qconfig()
+        )
+        parametrize.transfer_parametrizations_and_params(model, to_model)
+
+        # checks that final and original value are correct and the to_model is parametrized
+        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
+        self.assertEqual(model.weight, to_model.weight)
+        self.assertEqual(
+            model.parametrizations.weight.original,
+            to_model.parametrizations.weight.original,
+        )
+
+        # check that the transfer didn't affect the original value
+        self.assertEqual(hold_weight, model.weight)
+
+        # testing that changes to one set of parametrizations do not affect the other
+        parametrize.remove_parametrizations(to_model, "weight")
+        self.assertFalse(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
+        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(model, "weight"))
+
+        # also test that parameters that don't exist in to_model get transferred
+        model.test_param = Parameter(torch.randn(5, 5))
+
+        self.assertTrue(not hasattr(to_model, "test_param"))
+        parametrize.register_parametrization(model, "test_param", Double())
+        hold_test_param = model.test_param
+        parametrize.transfer_parametrizations_and_params(model, to_model, "test_param")
+
+        # check that previously missing params got transferred correctly
+        self.assertEqual(model.test_param, to_model.test_param)
+        self.assertEqual(
+            model.parametrizations.test_param.original,
+            to_model.parametrizations.test_param.original,
+        )
+
+        # check that the new transfer didn't change the value for the from_module
+        self.assertEqual(hold_test_param, model.test_param)
+
+    def test_transfer_parametrizations_and_params_right_inverse(self):
+        r"""Test that all parametrizations and their associated parameters are transferred."""
+
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+            def right_inverse(self, x):
+                return 0.5 * x
+
+        model = nn.Linear(5, 5)
+        parametrize.register_parametrization(model, "weight", Double())
+        hold_weight = model.weight
+
+        to_model = torch.ao.nn.qat.Linear(
+            5, 5, qconfig=torch.ao.quantization.get_default_qconfig()
+        )
+        parametrize.transfer_parametrizations_and_params(model, to_model)
+
+        # check that transfer occurs successfully
+        self.assertEqual(model.weight, to_model.weight)
+        self.assertEqual(
+            model.parametrizations.weight.original,
+            to_model.parametrizations.weight.original,
+        )
+
+        # check that transfer doesn't affect the from_model weight
+        self.assertEqual(hold_weight, model.weight)
+
+    def test_transfer_parametrizations_and_params_single_param(self):
+        r"""Test that all parametrizations and their associated parameters are transferred."""
+
+        class AddOne(nn.Module):
+            def forward(self, x):
+                return x + 1.0
+
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+        class MinusOne(nn.Module):
+            def forward(self, x):
+                return x - 1.0
+
+        model = nn.Linear(5, 5, bias=True)
+        parametrize.register_parametrization(model, "weight", AddOne())
+        parametrize.register_parametrization(model, "weight", Double())
+        parametrize.register_parametrization(model, "weight", MinusOne())
+        parametrize.register_parametrization(model, "bias", AddOne())
+        parametrize.register_parametrization(model, "bias", Double())
+        parametrize.register_parametrization(model, "bias", MinusOne())
+
+        to_model = torch.ao.nn.qat.Linear(
+            5, 5, bias=True, qconfig=torch.ao.quantization.get_default_qconfig()
+        )
+        parametrize.transfer_parametrizations_and_params(model, to_model, "weight")
+
+        # check that weight and only weight was transferred
+        self.assertEqual(model.weight, to_model.weight)
+        self.assertEqual(
+            model.parametrizations.weight.original,
+            to_model.parametrizations.weight.original,
+        )
+        self.assertTrue("bias" not in to_model.parametrizations)
+
+    # FIXME: Rewrite this test using functions not depending on LAPACK
+    # and remove the `@skipIfNoLapack` (see #70995)
+    @skipIfNoLapack
+    def test_transfer_parametrizations_and_params_many_to_one(self):
+        # A parametrization with several outputs
+        class RankOne(nn.Module):
+            def forward(self, x, y):
+                # Form a rank-1 matrix from a pair of vectors
+                return x.unsqueeze(-1) @ y.unsqueeze(-2)
+
+            def right_inverse(self, Y):
+                # We project the given matrix onto the rank 1 matrices
+                U, S, Vh = torch.linalg.svd(Y, full_matrices=False)
+                # S is ordered in a decreasing way.
+                s0_sqrt = S[0].sqrt().unsqueeze(-1)
+                return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt
+
+        class Double(nn.Module):
+            def forward(self, x):
+                return 2.0 * x
+
+        model = nn.Linear(3, 3)
+        parametrize.register_parametrization(model, "weight", RankOne())
+        parametrize.register_parametrization(model, "weight", Double())
+        hold_weight = model.weight
+
+        to_model = torch.ao.nn.qat.Linear(
+            3, 3, qconfig=torch.ao.quantization.get_default_qconfig()
+        )
+
+        parametrize.transfer_parametrizations_and_params(model, to_model)
+
+        # checks that final and original value are correct and the to_model is parametrized
+        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
+        self.assertEqual(model.weight, to_model.weight)
+        self.assertEqual(
+            model.parametrizations.weight.original0,
+            to_model.parametrizations.weight.original0,
+        )
+        self.assertEqual(
+            model.parametrizations.weight.original1,
+            to_model.parametrizations.weight.original1,
+        )
+
+        # check that the transfer didn't affect the original value
+        self.assertEqual(hold_weight, model.weight)
+
+        # testing that changes to one set of parametrizations do not affect the other
+        model.test_param = Parameter(torch.randn(3, 3))
+
+        self.assertTrue(not hasattr(to_model, "test_param"))
+        parametrize.register_parametrization(model, "test_param", RankOne())
+        hold_test_param = model.test_param
+        parametrize.transfer_parametrizations_and_params(model, to_model, "test_param")
+
+        # also check that previously missing params got transferred correctly
+        self.assertEqual(model.test_param, to_model.test_param)
+        self.assertEqual(
+            model.parametrizations.test_param.original0,
+            to_model.parametrizations.test_param.original0,
+        )
+        self.assertEqual(
+            model.parametrizations.test_param.original1,
+            to_model.parametrizations.test_param.original1,
+        )
+
+        # check that the new transfer didn't change the value for the from_module
+        self.assertEqual(hold_test_param, model.test_param)
+
+    def test_new_spectral_norm(self):
+        with set_default_dtype(torch.double):
+            input = torch.randn(3, 5)
+            m = nn.Linear(5, 7)
+            m = torch.nn.utils.parametrizations.spectral_norm(m)
+            spectral_norm_m = m.parametrizations.weight[0]
+
+            self.assertEqual(spectral_norm_m._u.size(), torch.Size([m.weight.size(0)]))
+
+            # .parametrizations.weight.original should be trainable
+            self.assertTrue(hasattr(m.parametrizations.weight, 'original'))
+            self.assertTrue('original' in m.parametrizations.weight._parameters)
+
+            # u should be just a reused buffer
+            self.assertTrue(hasattr(spectral_norm_m, '_u'))
+            self.assertTrue('_u' in spectral_norm_m._buffers)
+            self.assertTrue('_v' in spectral_norm_m._buffers)
+
+            # weight should be a plain attribute, not counted as a buffer or a param
+            self.assertIsNotNone(m.weight)
+            self.assertFalse('weight' in m._buffers)
+            self.assertFalse('weight' in m._parameters)
+
+            # it should also be sharing storage as `weight_orig`
+            # self.assertEqual(m.parametrizations.weight.original.storage(), m.weight.storage())
+            self.assertEqual(m.parametrizations.weight.original.size(), m.weight.size())
+            self.assertEqual(m.parametrizations.weight.original.stride(), m.weight.stride())
+
+            m = torch.nn.utils.parametrize.remove_parametrizations(m, 'weight')
+
+            # spectral_norm is the only parametrization
+            self.assertFalse(hasattr(m, 'parametrizations'))
+            self.assertTrue('weight' in m._parameters)
+
+            # We can register spectral_norm multiple times on the same parameter
+            # and on multiple parameters in the same module
+            m = torch.nn.utils.parametrizations.spectral_norm(m, 'weight')
+            m = torch.nn.utils.parametrizations.spectral_norm(m, 'weight')
+            m = torch.nn.utils.parametrizations.spectral_norm(m, 'bias')
+
+            # If we remove the parametrization on bias, weight is still parametrized
+            # Removing a parametrization runs forward in eval mode if leave_parametrized=True
+            m = torch.nn.utils.parametrize.remove_parametrizations(m, 'bias')
+            self.assertTrue('bias' in m._parameters)
+            self.assertTrue(hasattr(m, 'parametrizations'))
+            self.assertFalse('weight' in m._parameters)
+
+            m = torch.nn.utils.parametrize.remove_parametrizations(m, 'weight')
+            # Neither weight and bias are parametrized
+            self.assertFalse(hasattr(m, 'parametrizations'))
+            self.assertTrue('weight' in m._parameters)
+            self.assertFalse(torch.nn.utils.parametrize.is_parametrized(m))
+
+            # test correctness in training/eval modes and cpu/multi-gpu settings
+            for apply_dp in (True, False):
+                if apply_dp:
+                    if not TEST_MULTIGPU:
+                        continue
+                    device = torch.device('cuda:0')
+
+                    def maybe_wrap(m):
+                        return torch.nn.DataParallel(m, [0, 1])
+                else:
+                    device = torch.device('cpu')
+
+                    def maybe_wrap(m):
+                        return m
+
+                for requires_grad in (True, False):
+                    def get_modules():
+                        m = nn.Linear(3, 4).to(device)
+                        m.weight.requires_grad_(requires_grad)
+                        m = torch.nn.utils.parametrizations.spectral_norm(m)
+                        wrapped_m = maybe_wrap(m)
+                        spectral_norm_m = m.parametrizations.weight[0]
+                        return m, wrapped_m, spectral_norm_m
+
+                    input = torch.randn(2, 3, device=device)
+
+                    m, wrapped_m, spectral_norm_m = get_modules()
+
+                    self.assertTrue(hasattr(spectral_norm_m, '_u'))
+                    u0 = spectral_norm_m._u.clone()
+                    v0 = spectral_norm_m._v.clone()
+
+                    # TEST TRAINING BEHAVIOR
+
+                    # We perform GD first to modify the initial matrix
+                    opt = torch.optim.SGD(wrapped_m.parameters(), lr=0.1)
+
+                    opt.zero_grad()
+                    wrapped_m(input).sum().backward()
+                    opt.step()
+
+                    out = wrapped_m(input)
+                    if requires_grad:
+                        # run forward again and assert that u and v are updated
+                        self.assertNotEqual(u0, spectral_norm_m._u)
+                        self.assertNotEqual(v0, spectral_norm_m._v)
+
+                    # assert that backprop reaches original weight
+                    # can't use gradcheck because the function changes as we
+                    # activate through it in training mode
+                    if requires_grad:
+                        torch.autograd.grad(out.sum(), m.parametrizations.weight.original)
+
+                    # test backward works with multiple forwards
+                    # it uses training mode so we need to reset `u` and `v` vectors
+                    # to same value at beginning for finite difference test to pass
+                    saved_u = spectral_norm_m._u.clone()
+                    saved_v = spectral_norm_m._v.clone()
+
+                    def fn(input):
+                        spectral_norm_m._u.data.copy_(saved_u)
+                        spectral_norm_m._v.data.copy_(saved_v)
+                        out0 = wrapped_m(input)
+                        out1 = wrapped_m(input)
+                        return out0 + out1
+
+                    # Make sure we can compute gradients wrt to all the parameters in the case
+                    # of double forward
+                    fn(input.clone().requires_grad_()).sum().backward()
+                    gradcheck(fn, (input.clone().requires_grad_(),), check_batched_grad=False)
+
+                    # test removing
+                    # spectral norm module needs to be in eval mode if we'd like to
+                    # avoid doing another power iteration
+                    m, wrapped_m, _ = get_modules()
+                    pre_remove_out = wrapped_m(input)
+                    m.eval()
+                    m = torch.nn.utils.parametrize.remove_parametrizations(m, 'weight')
+                    self.assertEqual(wrapped_m(input), pre_remove_out)
+
+                    torch.nn.utils.parametrizations.spectral_norm(m)
+                    for _ in range(3):
+                        pre_remove_out = wrapped_m(input)
+                    m.eval()
+                    m = torch.nn.utils.parametrize.remove_parametrizations(m, 'weight')
+                    self.assertEqual(wrapped_m(input), pre_remove_out)
+
+                    # TEST EVAL BEHAVIOR
+                    m, wrapped_m, spectral_norm_m = get_modules()
+                    wrapped_m(input)
+                    last_train_out = wrapped_m(input)
+                    last_train_u = spectral_norm_m._u.clone()
+                    last_train_v = spectral_norm_m._v.clone()
+                    wrapped_m.zero_grad()
+                    wrapped_m.eval()
+
+                    eval_out0 = wrapped_m(input)
+                    # assert eval gives same result as last training iteration
+                    self.assertEqual(eval_out0, last_train_out)
+                    # assert doing more iteartion in eval don't change things
+                    self.assertEqual(eval_out0, wrapped_m(input))
+                    self.assertEqual(last_train_u, spectral_norm_m._u)
+                    self.assertEqual(last_train_v, spectral_norm_m._v)
+
+                    # FIXME: the code below is flaky when executed with DataParallel
+                    # see https://github.com/pytorch/pytorch/issues/13818
+                    if apply_dp:
+                        continue
+
+                    # test backward works with multiple forwards in mixed training
+                    # and eval modes
+                    # it uses training mode so we need to reset `u` and `v` vectors
+                    # to same value at beginning for finite difference test to pass
+                    saved_u = spectral_norm_m._u.clone()
+                    saved_v = spectral_norm_m._v.clone()
+
+                    def fn(input):
+                        spectral_norm_m._u.data.copy_(saved_u)
+                        spectral_norm_m._v.data.copy_(saved_v)
+                        wrapped_m.train()
+                        out0 = wrapped_m(input)
+                        wrapped_m.eval()
+                        out1 = wrapped_m(input)
+                        wrapped_m.train()
+                        out2 = wrapped_m(input)
+                        wrapped_m.eval()
+                        out3 = wrapped_m(input)
+                        return out0 + out1 + out2 + out3
+
+                    gradcheck(fn, (input.clone().requires_grad_(),))
+
+                    # assert that backprop reaches weight_orig in eval
+                    if requires_grad:
+                        def fn(weight):
+                            return wrapped_m(input)
+
+                        gradcheck(fn, (m.parametrizations.weight.original,))
+
+    def test_new_spectral_norm_load_state_dict(self):
+        for activate_times in (0, 3):
+            inp = torch.randn(2, 3)
+            m = nn.Linear(3, 5)
+            snm = torch.nn.utils.parametrizations.spectral_norm(m)
+            snm.train()
+
+            for _ in range(activate_times):
+                snm(inp)
+
+            state_dict = deepcopy(snm.state_dict())
+            self.assertEqual({
+                'parametrizations.weight.original',
+                'bias',
+                'parametrizations.weight.0._v',
+                'parametrizations.weight.0._u'
+            }, set(state_dict.keys()))
+
+            # test that non-strict loading works
+            non_strict_state_dict = deepcopy(state_dict)
+            non_strict_state_dict['nonsense'] = 'nonsense'
+            with self.assertRaisesRegex(RuntimeError, r'Unexpected key\(s\) in state_dict: "nonsense"'):
+                snm.load_state_dict(non_strict_state_dict, strict=True)
+            snm.load_state_dict(non_strict_state_dict, strict=False)
+            del non_strict_state_dict['parametrizations.weight.original']
+            snm.load_state_dict(non_strict_state_dict, strict=False)
+            del non_strict_state_dict['parametrizations.weight.0._u']
+            snm.load_state_dict(non_strict_state_dict, strict=False)
+            del non_strict_state_dict['parametrizations.weight.0._v']
+            snm.load_state_dict(non_strict_state_dict, strict=False)
+            non_strict_state_dict['weight'] = snm.weight.detach().clone()     # set W as a buffer
+            snm.load_state_dict(non_strict_state_dict, strict=False)
+            del non_strict_state_dict._metadata['parametrizations.weight.0']  # remove metadata info
+            snm.load_state_dict(non_strict_state_dict, strict=False)
+            del non_strict_state_dict['weight']                               # remove W buffer
+            snm.load_state_dict(non_strict_state_dict, strict=False)
+            del non_strict_state_dict['bias']
+            snm.load_state_dict(non_strict_state_dict, strict=False)
+
+            # normal state_dict
+
+            # test that re-wrapping does not matter
+            m = torch.nn.utils.parametrize.remove_parametrizations(snm, 'weight')
+            snm = torch.nn.utils.parametrizations.spectral_norm(m)
+
+            snm.load_state_dict(state_dict)
+            with torch.no_grad():
+                snm.eval()
+                out0_eval = snm(inp)
+                snm.train()
+                out1_train = snm(inp)
+                out2_train = snm(inp)
+                snm.eval()
+                out3_eval = snm(inp)
+
+            # test that re-wrapping does not matter
+            m = torch.nn.utils.parametrize.remove_parametrizations(snm, 'weight')
+            snm = torch.nn.utils.parametrizations.spectral_norm(m)
+
+            # Test normal loading
+            snm.load_state_dict(state_dict)
+            with torch.no_grad():
+                snm.eval()
+                self.assertEqual(out0_eval, snm(inp))
+                snm.train()
+                self.assertEqual(out1_train, snm(inp))
+                self.assertEqual(out2_train, snm(inp))
+                snm.eval()
+                self.assertEqual(out3_eval, snm(inp))
+
+    def test_new_spectral_norm_dim(self):
+        inp = torch.randn(2, 3, 10, 12)
+        m = nn.ConvTranspose2d(3, 4, (5, 6))
+        m = torch.nn.utils.parametrizations.spectral_norm(m)
+        snm = m.parametrizations.weight[0]
+        # this should not run into incompatible shapes
+        x = m(inp)
+        # check that u refers to the same dimension
+        self.assertEqual(snm._u.shape, m.parametrizations.weight.original[0, :, 0, 0].shape)
+
+    def test_new_spectral_norm_forward(self):
+        input = torch.randn(3, 5)
+        m = nn.Linear(5, 7)
+        m = torch.nn.utils.parametrizations.spectral_norm(m)
+        snm = m.parametrizations.weight[0]
+        # naive forward
+        _weight = m.parametrizations.weight.original
+        _bias, _v = m.bias, snm._v
+        _weight_mat = _weight.view(_weight.size(0), -1)
+        _u = torch.mv(_weight_mat, _v)
+        _u = F.normalize(_u, dim=0, eps=1e-12)
+        _v = torch.mv(_weight_mat.t(), _u)
+        _v = F.normalize(_v, dim=0, eps=1e-12)
+        _weight.data /= torch.dot(_u, torch.matmul(_weight_mat, _v))
+        out_hat = torch.nn.functional.linear(input, _weight, _bias)
+        expect_out = m(input)
+        self.assertEqual(expect_out, out_hat)
+
+    @skipIfNoLapack
+    def test_orthogonal_parametrization(self):
+        # Orthogonal implements 6 algorithms (3x parametrizations times 2 options of use_trivialization)
+
+        def assert_is_orthogonal(X):
+            n, k = X.size(-2), X.size(-1)
+            if n < k:
+                X = X.mT
+                n, k = k, n
+            Id = torch.eye(k, dtype=X.dtype, device=X.device).expand(*(X.size()[:-2]), k, k)
+            eps = 10 * n * torch.finfo(X.dtype).eps
+            torch.testing.assert_close(X.mH @ X, Id, atol=eps, rtol=0.)
+
+        def assert_weight_allclose_Q(weight, W):
+            # Test that weight is equal to the Q part of the QR decomposition of W
+            # (or of its transpose if the matrix is wide)
+            wide_matrix = W.size(-2) < W.size(-1)
+            if wide_matrix:
+                W = W.mT
+            Q, R = torch.linalg.qr(W)
+            Q *= R.diagonal(dim1=-2, dim2=-1).sgn().unsqueeze(-2)
+            if wide_matrix:
+                Q = Q.mT
+            torch.testing.assert_close(Q, weight, atol=1e-5, rtol=0.)
+
+        for shape, dtype, use_linear in product(((4, 4), (5, 3), (3, 5)),  # square/ tall / wide
+                                                (torch.float32, torch.complex64),
+                                                (True, False)):
+            # Conv2d does not support complex yet
+            if not use_linear:
+                continue
+
+            if use_linear:
+                input = torch.randn(3, shape[0], dtype=dtype)
+            else:
+                input = torch.randn(2, 2, shape[0] + 2, shape[1] + 1, dtype=dtype)
+
+            for parametrization, use_trivialization in product(("matrix_exp", "cayley", "householder"),
+                                                               (False, True)):
+                # right_inverse for Cayley and matrix_exp not implemented for use_trivialization=False
+                # See Note [right_inverse expm cayley]
+                can_initialize = use_trivialization or parametrization == "householder"
+
+                # We generate them every time to always start with fresh weights
+                if use_linear:
+                    m = nn.Linear(*shape, dtype=dtype)
+                else:
+                    m = nn.Conv2d(2, 3, shape, dtype=dtype)
+
+                # We do not support householder for complex inputs
+                # See Note [Householder complex]
+                w_init = m.weight.clone()
+                if parametrization == "householder" and m.weight.is_complex():
+                    msg = "householder parametrization does not support complex tensors"
+                    with self.assertRaisesRegex(ValueError, msg):
+                        torch.nn.utils.parametrizations.orthogonal(m,
+                                                                   "weight",
+                                                                   parametrization,
+                                                                   use_trivialization=use_trivialization)
+                    continue
+
+                wide_matrix = w_init.size(-2) < w_init.size(-1)
+                torch.nn.utils.parametrizations.orthogonal(m,
+                                                           "weight",
+                                                           parametrization,
+                                                           use_trivialization=use_trivialization)
+                # Forwards works as expected
+                self.assertEqual(w_init.shape, m.weight.shape)
+                assert_is_orthogonal(m.weight)
+                if can_initialize:
+                    assert_weight_allclose_Q(m.weight, w_init)
+
+                # Intializing with a given orthogonal matrix works
+                X = torch.randn_like(m.weight)
+                if wide_matrix:
+                    X = X.mT
+                w_new = torch.linalg.qr(X).Q
+                if wide_matrix:
+                    w_new = w_new.mT
+                if can_initialize:
+                    m.weight = w_new
+                    torch.testing.assert_close(w_new, m.weight, atol=1e-5, rtol=0.)
+                else:
+                    msg = "assign to the matrix exponential or the Cayley parametrization"
+                    with self.assertRaisesRegex(NotImplementedError, msg):
+                        m.weight = w_new
+
+                # Intializing with a non-orthogonal matrix makes m.weight be the Q part of the given matrix
+                w_new = torch.randn_like(m.weight)
+                if can_initialize:
+                    m.weight = w_new
+                    assert_weight_allclose_Q(m.weight, w_new)
+                else:
+                    msg = "assign to the matrix exponential or the Cayley parametrization"
+                    with self.assertRaisesRegex(NotImplementedError, msg):
+                        m.weight = w_new
+
+                opt = torch.optim.SGD(m.parameters(), lr=0.1)
+                for _ in range(2):
+                    opt.zero_grad()
+                    m(input).norm().backward()
+                    grad = m.parametrizations.weight.original.grad
+                    self.assertIsNotNone(grad)
+                    # We do not update the upper triangular part of the matrix if tall tril if wide
+                    if grad.size(-2) >= grad.size(-1):
+                        zeros_grad = grad.triu(1)
+                    else:
+                        zeros_grad = grad.tril(-1)
+                    self.assertEqual(zeros_grad, torch.zeros_like(zeros_grad))
+                    # The gradient in the diagonal can only be imaginary because a skew-Hermitian
+                    # matrix has imaginary diagonal
+                    diag_grad = grad.diagonal(dim1=-2, dim2=-1)
+                    if grad.is_complex():
+                        diag_grad = diag_grad.real
+                    self.assertEqual(diag_grad, torch.zeros_like(diag_grad))
+                    opt.step()
+                    assert_is_orthogonal(m.weight)
+
+    @skipIfNoLapack
+    def test_orthogonal_errors(self):
+        m = nn.Linear(3, 4)
+        with self.assertRaisesRegex(ValueError, "has to be one of"):
+            torch.nn.utils.parametrizations.orthogonal(m, "weight", "foo")
+
+        with self.assertRaisesRegex(ValueError, "Expected a matrix"):
+            torch.nn.utils.parametrizations.orthogonal(m, "bias")
+
+        torch.nn.utils.parametrizations.orthogonal(m, "weight")
+        with self.assertRaisesRegex(ValueError, "matrices of shape"):
+            m.weight = torch.randn(5, 5)
+        torch.nn.utils.parametrize.remove_parametrizations(m, "weight")
+
+
+instantiate_parametrized_tests(TestNNParametrization)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/nn/test_pooling.py b/test/nn/test_pooling.py
new file mode 100644
index 000000000000..35579e643a46
--- /dev/null
+++ b/test/nn/test_pooling.py
@@ -0,0 +1,1450 @@
+# Owner(s): ["module: nn"]
+from functools import reduce
+from itertools import repeat
+import unittest
+import subprocess
+import sys
+import os
+import random
+import itertools
+import math
+
+from torch._six import inf, nan
+import torch
+from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_UBSAN, set_default_dtype, \
+    instantiate_parametrized_tests, slowTest, parametrize as parametrize_test, subtest, skipIfMps
+from torch.testing._internal.common_cuda import TEST_CUDA
+from torch.testing._internal.common_nn import NNTestCase, _test_bfloat16_ops, _test_module_empty_input
+from torch.testing._internal.common_device_type import largeTensorTest, onlyNativeDeviceTypes, dtypes, \
+    instantiate_device_type_tests, skipCUDAIfRocm, expectedFailureMeta, dtypesIfCUDA, onlyCPU, onlyCUDA, \
+    TEST_WITH_ROCM
+from torch.testing._internal.common_dtype import floating_types_and
+import torch.nn.functional as F
+import torch.nn as nn
+from torch.autograd import gradcheck, gradgradcheck
+
+
+class TestAvgPool(TestCase):
+    def _sum_pool2d(self, x, kernel_size):
+        windows = torch.nn.functional.unfold(x, kernel_size=kernel_size, stride=kernel_size)
+        return torch.sum(windows, dim=1)
+
+    def _sum_pool3d(self, x, kernel_size):
+        # Because unfold does not support 3D sliding window we will split tensor to multiple tensors and calculate sum
+        h = kernel_size[0]
+        splited_x = [t.sum(0) for t in x.split(h) if t.size(0) == h]
+        # sum_pool2d assumes tensor in (1, 1, n, m) view, so unsqueeze two times
+        splited_x = [self._sum_pool2d(t.unsqueeze(0).unsqueeze(0), kernel_size[1:]) for t in splited_x]
+        joined_x = torch.cat(splited_x)
+        return joined_x.view(1, joined_x.numel())
+
+    def _avg_pool2d(self, x, kernel_size):
+        size = reduce((lambda x, y: x * y), kernel_size)
+        return self._sum_pool2d(x, kernel_size) / size
+
+    def _avg_pool3d(self, x, kernel_size):
+        size = reduce((lambda x, y: x * y), kernel_size)
+        return self._sum_pool3d(x, kernel_size) / size
+
+    def test_doubletensor_avg_pool2d(self):
+        n, m = 5, 8
+        input = torch.rand(1, 1, n, m, dtype=torch.double)
+        for i in range(1, n + 1):
+            for j in range(1, m + 1):
+                actual = torch.nn.functional.avg_pool2d(input[0], (i, j))
+                actual = actual.view(1, actual.numel())
+                expected = self._avg_pool2d(input, (i, j))
+                self.assertEqual(actual, expected, rtol=0, atol=1e-5)
+
+    def test_doubletensor_avg_pool2d_with_divisor(self):
+        n, m = 3, 3
+        input = torch.rand(1, 1, n, m, dtype=torch.double)
+        for i in range(1, n + 1):
+            for j in range(1, m + 1):
+                for divisor in [1, 7, i * j]:
+                    actual = F.avg_pool2d(input[0], (i, j), divisor_override=divisor)
+                    actual = actual.view(1, actual.numel())
+                    expected = self._sum_pool2d(input, (i, j)) / divisor
+                    self.assertEqual(actual, expected, rtol=0, atol=1e-5)
+
+    def test_doubletensor_avg_pool3d(self):
+        h, w, d = 5, 6, 7
+        input = torch.rand(h, w, d, dtype=torch.double)
+        for i in range(1, h + 1):
+            for j in range(1, w + 1):
+                for k in range(1, d + 1):
+                    actual = torch.nn.functional.avg_pool3d(input.unsqueeze(0), (i, j, k))
+                    actual = actual.view(1, actual.numel())
+                    expected = self._avg_pool3d(input, (i, j, k))
+                    self.assertEqual(actual, expected, rtol=0, atol=1e-5)
+
+    def test_doubletensor_avg_pool3d_with_divisor(self):
+        h, w, d = 6, 5, 7
+        input = torch.rand(h, w, d, dtype=torch.double)
+        for i in range(1, h + 1):
+            for j in range(1, w + 1):
+                for k in range(1, d + 1):
+                    for divisor in [1, 7, i * j]:
+                        actual = torch.nn.functional.avg_pool3d(input.unsqueeze(0), (i, j, k), divisor_override=divisor)
+                        actual = actual.view(1, actual.numel())
+                        expected = self._sum_pool3d(input, (i, j, k)) / divisor
+                        self.assertEqual(actual, expected, rtol=0, atol=1e-5)
+
+    def test_avg_pool1d_ceil_mode(self):
+        # Regression test for gh-36977
+        x = 10 * torch.randn((1, 16, 4))
+        y = torch.nn.functional.avg_pool1d(
+            x, ceil_mode=True, count_include_pad=True, kernel_size=1, stride=2)
+        self.assertTrue(not torch.isnan(y).any())
+
+        if TEST_CUDA:
+            y = torch.nn.functional.avg_pool1d(
+                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=1, stride=2)
+            self.assertTrue(not torch.isnan(y).any())
+
+    def test_avg_pool2d_ceil_mode(self):
+        # Regression test for gh-36977
+        x = 10 * torch.randn((1, 16, 4, 4))
+        y = torch.nn.functional.avg_pool2d(
+            x, ceil_mode=True, count_include_pad=True, kernel_size=(1, 2),
+            padding=(0, 1), stride=2)
+        self.assertTrue(not torch.isnan(y).any())
+
+        if TEST_CUDA:
+            y = torch.nn.functional.avg_pool2d(
+                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=(1, 2),
+                padding=(0, 1), stride=2)
+            self.assertTrue(not torch.isnan(y).any())
+
+    def test_avg_pool3d_ceil_mode(self):
+        # Regression test for gh-36977
+        x = 10 * torch.randn((1, 16, 4, 4, 4))
+        y = torch.nn.functional.avg_pool3d(
+            x, ceil_mode=True, count_include_pad=True, kernel_size=(1, 2, 3), stride=2)
+        self.assertTrue(not torch.isnan(y).any())
+
+        if TEST_CUDA:
+            y = torch.nn.functional.avg_pool3d(
+                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=(1, 2, 3), stride=2)
+            self.assertTrue(not torch.isnan(y).any())
+
+
+class TestPoolingNN(NNTestCase):
+    _do_cuda_memory_leak_check = True
+    _do_cuda_non_default_stream = True
+
+    def test_adaptive_pooling_input_size(self):
+        for numel in (2, 3):
+            for pool_type in ('Max', 'Avg'):
+                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
+                module_cls = getattr(nn, cls_name)
+                output_size = (2,) * numel
+                module = module_cls(output_size)
+
+                input = torch.randn(output_size)
+                self.assertRaises(ValueError, lambda: module(input))
+
+    def test_adaptive_pooling_size_none(self):
+        for numel in (2, 3):
+            for pool_type in ('Max', 'Avg'):
+                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
+                module_cls = getattr(nn, cls_name)
+                output_size = (2,) * (numel - 1) + (None,)
+                module = module_cls(output_size)
+
+                input = torch.randn((4,) * (numel + 1))
+                output = module(input)
+                self.assertEqual(output.size(), (4,) + (2,) * (numel - 1) + (4,))
+
+    @unittest.skipIf(TEST_WITH_UBSAN, "signed integer overflow error with UBSAN")
+    def test_adaptive_pooling_size_overflow(self):
+        # 0x0x3fffffffffffffff * 2 * 2 = 0xfffffffffffffffc = -4 as int64_t
+        # Tensor::numel() return int64_t, so following check that negative allocs are correctly handled
+        self.assertRaises(
+            RuntimeError,
+            lambda: torch.nn.AdaptiveMaxPool1d(0x3fffffffffffffff)(torch.empty([2, 2, 2])))
+
+    def test_adaptive_pooling_avg_nhwc(self):
+        device_list = ['cpu']
+        if TEST_CUDA:
+            device_list.append('cuda')
+
+        for device in device_list:
+            input = torch.randint(1, 10, (4, 8, 8, 8), dtype=torch.float32).to(device)
+            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+            grad = torch.randint(1, 10, (4, 8, 7, 7), dtype=torch.float32).to(device)
+            pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
+
+            out = pool(input)
+            out.backward(grad)
+            ref_out = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(input.grad, ref_input.grad)
+
+    def test_adaptive_pooling_avg_nhwc_non_contiguous(self):
+        device_list = ['cpu']
+        if TEST_CUDA:
+            device_list.append('cuda')
+
+        for device in device_list:
+            input = torch.randint(1, 10, (4, 8, 8, 8), dtype=torch.float32).to(device)
+            input = input.contiguous(memory_format=torch.channels_last)
+            input = input[:, ::2, :, :].requires_grad_()
+            grad = torch.randint(1, 10, (4, 8, 7, 7), dtype=torch.float32).to(device)
+            grad = grad[:, ::2, :, :]
+            pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
+
+            out = pool(input)
+            out.backward(grad)
+            ref_out = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(input.grad, ref_input.grad)
+
+    def test_adaptive_pooling_bfloat16(self):
+        def _test_adaptive_pooling_bfloat16(self, device, mod, memory_format):
+            input = torch.randint(1, 10, (3, 19, 8, 8), dtype=torch.float32)
+            input = input.to(device).to(memory_format=memory_format).requires_grad_()
+            pool = mod((7, 7)).to(device)
+
+            input2 = input.detach().clone().bfloat16().requires_grad_(True)
+
+            out = pool(input)
+            out.sum().backward()
+            out2 = pool(input2)
+            out2.sum().backward()
+
+            self.assertTrue(out2.is_contiguous(memory_format=memory_format))
+            self.assertEqual(out2.dtype, torch.bfloat16)
+            self.assertEqual(input2.grad.dtype, torch.bfloat16)
+            self.assertEqual(out, out2.float(), atol=0.1, rtol=0)
+            self.assertEqual(input.grad, input2.grad.float(), atol=0.1, rtol=0)
+
+        device_list = ['cpu']
+        for device in device_list:
+            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveAvgPool2d, torch.contiguous_format)
+            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveAvgPool2d, torch.channels_last)
+            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveMaxPool2d, torch.contiguous_format)
+            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveMaxPool2d, torch.channels_last)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    @largeTensorTest('12GB', device='cuda')
+    def test_adaptive_pooling_avg_nhwc_launch_config_backward(self):
+        input = torch.randint(1, 10, (1, 32, 2 ** 17 + 1, 32), dtype=torch.float32, device="cuda")
+        input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+        grad = torch.randint(1, 10, (1, 32, 10, 32), dtype=torch.float32, device="cuda")
+
+        pool = torch.nn.AdaptiveAvgPool2d((10, 32)).cuda()
+
+        ref_input = input.detach().clone().contiguous().requires_grad_(True)
+        ref_grad = grad.detach().clone().contiguous()
+        ref_pool = torch.nn.AdaptiveAvgPool2d((10, 32)).cuda()
+
+        out = pool(input)
+        out.backward(grad)
+        ref_out = ref_pool(ref_input)
+        ref_out.backward(ref_grad)
+
+        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+        self.assertTrue(ref_out.is_contiguous())
+        self.assertEqual(out, ref_out)
+        self.assertEqual(input.grad, ref_input.grad)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    @largeTensorTest('12GB', device='cuda')
+    def test_adaptive_pooling_avg_nhwc_launch_config_forward(self):
+        input = torch.randint(1, 10, (1, 32, 16, 16), dtype=torch.float32, device="cuda")
+        input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+        pool = torch.nn.AdaptiveAvgPool2d((2 ** 17 + 1, 32)).cuda()
+
+        ref_input = input.detach().clone().contiguous().requires_grad_(True)
+        ref_pool = torch.nn.AdaptiveAvgPool2d((2 ** 17 + 1, 32)).cuda()
+
+        out = pool(input)
+        ref_out = ref_pool(ref_input)
+
+        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+        self.assertTrue(ref_out.is_contiguous())
+        self.assertEqual(out, ref_out)
+
+    def test_MaxUnpool2d_output_size(self):
+        m = nn.MaxPool2d(3, stride=2, return_indices=True)
+        mu = nn.MaxUnpool2d(3, stride=2)
+        big_t = torch.rand(1, 1, 6, 6)
+        big_t[0][0][4][4] = 100
+        output_big, indices_big = m(big_t)
+        self.assertRaises(RuntimeError, lambda: mu(output_big, indices_big))
+
+        small_t = torch.rand(1, 1, 5, 5)
+        for i in range(0, 4, 2):
+            for j in range(0, 4, 2):
+                small_t[:, :, i, j] = 100
+        output_small, indices_small = m(small_t)
+        for h in range(3, 10):
+            for w in range(3, 10):
+                if 4 <= h <= 6 and 4 <= w <= 6:
+                    size = (h, w)
+                    if h == 6:
+                        size = (1, 1) + size
+
+                    mu(output_small, indices_small, output_size=size)
+                else:
+                    self.assertRaises(ValueError, lambda: mu(output_small, indices_small, (h, w)))
+
+    def test_max_unpool2d_nhwc_cpu(self):
+        input = torch.randn(2, 10, 9, 9).float().cpu()
+        input = input.contiguous(memory_format=torch.channels_last)
+        ref_input = input.clone().contiguous()
+
+        pool = nn.MaxPool2d(3, stride=2, return_indices=True).cpu()
+        ref_pool = nn.MaxPool2d(3, stride=2, return_indices=True).cpu()
+
+        out, ind = pool(input)
+        ref_out, ref_ind = ref_pool(ref_input)
+        out.requires_grad_()
+        ref_out.requires_grad_()
+
+        unpool = nn.MaxUnpool2d(3, stride=2).cpu()
+        ref_unpool = nn.MaxUnpool2d(3, stride=2).cpu()
+
+        upout = unpool(out, ind)
+        ref_upout = ref_unpool(ref_out, ref_ind)
+
+        grad = torch.randn(upout.size()).float().cpu()
+        grad = grad.contiguous(memory_format=torch.channels_last)
+        ref_grad = grad.clone().contiguous()
+
+        upout.backward(grad)
+        ref_upout.backward(ref_grad)
+
+        self.assertTrue(upout.is_contiguous(memory_format=torch.channels_last))
+        self.assertTrue(ref_upout.is_contiguous())
+        self.assertTrue(torch.allclose(upout, ref_upout))
+        self.assertTrue(torch.allclose(out.grad, ref_out.grad))
+
+    def test_max_unpool(self):
+        with set_default_dtype(torch.double):
+            # Test 1D
+            output, indices = F.max_pool1d(torch.randn([1, 1, 4]), 2, stride=2, return_indices=True)
+            self.assertEqual(F.max_unpool1d(output, indices, 2), F.max_unpool1d(output, indices, 2, stride=2))
+
+            # Test list / tuple passed as argument to max_unpool1d
+            input = torch.randn([1, 1, 5], requires_grad=True)
+            output, indices = F.max_pool1d(input, 2, stride=2, return_indices=True)
+            self.assertEqual(F.max_unpool1d(output, indices, 2, stride=2, output_size=input.shape),
+                             F.max_unpool1d(output, indices, 2, stride=2, output_size=input.size()))
+            gradcheck(F.max_unpool1d, (output, indices, 2), check_forward_ad=True)
+
+            # Test 2D
+            output, indices = F.max_pool2d(torch.randn(
+                [1, 1, 4, 4], requires_grad=True), 2, stride=2, return_indices=True)
+            self.assertEqual(F.max_unpool2d(output, indices, 2), F.max_unpool2d(output, indices, 2, stride=2))
+            gradcheck(F.max_unpool2d, (output, indices, 2), check_forward_ad=True)
+
+            # Test 3D
+            output, indices = F.max_pool3d(torch.randn(
+                [4, 4, 4, 4, 4], requires_grad=True), 2, stride=2, return_indices=True)
+            self.assertEqual(F.max_unpool3d(output, indices, 2), F.max_unpool3d(output, indices, 2, stride=2))
+            gradcheck(F.max_unpool3d, (output, indices, 2), check_forward_ad=True)
+
+
+class TestPoolingNNDeviceType(NNTestCase):
+    @onlyNativeDeviceTypes
+    @dtypes(torch.float, torch.double)
+    def test_adaptive_pooling_zero_batch(self, dtype, device):
+        inp = torch.ones(0, 10, dtype=dtype, device=device)
+        mod = torch.nn.AdaptiveAvgPool1d(5).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        inp = torch.ones(0, 10, 10, dtype=dtype, device=device)
+        mod = torch.nn.AdaptiveAvgPool2d((5, 5)).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        inp = torch.ones(0, 10, 10, 10, dtype=dtype, device=device)
+        mod = torch.nn.AdaptiveAvgPool3d((5, 5, 5)).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+    @onlyNativeDeviceTypes
+    def test_FractionalMaxPool2d_zero_batch(self, device):
+        mod = nn.FractionalMaxPool2d(3, output_ratio=(0.5, 0.5))
+        inp = torch.ones(0, 16, 50, 32, device=device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected input"):
+            inp = torch.randn(1, 0, 50, 32, device=device)
+            mod(inp)
+
+    @onlyNativeDeviceTypes
+    def test_FractionalMaxPool3d_zero_batch(self, device):
+        mod = nn.FractionalMaxPool3d(3, output_ratio=(0.5, 0.5, 0.5)).to(device)
+        inp = torch.ones(0, 16, 50, 32, 32, device=device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected input"):
+            inp = torch.randn(1, 0, 50, 32, 32, device=device)
+            mod(inp)
+
+    @onlyNativeDeviceTypes
+    def test_FractionalMaxPool2d_zero_out_size(self, device):
+        mod = nn.FractionalMaxPool2d([2, 2], output_size=[0, 1])
+        inp = torch.rand([16, 50, 32, 32], device=device)
+        out = mod(inp)
+        self.assertEqual(out, torch.empty((16, 50, 0, 1), device=device))
+
+    @onlyNativeDeviceTypes
+    def test_FractionalMaxPool3d_zero_out_size(self, device):
+        mod = nn.FractionalMaxPool3d([3, 2, 2], output_size=[0, 1, 1])
+        inp = torch.rand([16, 50, 32, 32], device=device)
+        out = mod(inp)
+        self.assertEqual(out, torch.empty((16, 0, 1, 1), device=device))
+
+    @onlyNativeDeviceTypes
+    def test_MaxPool_zero_batch_dim(self, device):
+        inp = torch.randn(0, 16, 50, device=device)
+        mod = torch.nn.MaxPool1d(3, stride=2).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        # 1D is supposed to be okay with 0 numel() inputs so dont test
+        # error raising for that case.
+
+        inp = torch.randn(0, 16, 50, 32, device=device)
+        mod = torch.nn.MaxPool2d(3, stride=2).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.randn(1, 0, 50, 32, device=device)
+            mod(inp)
+
+        inp = torch.ones(0, 16, 50, 44, 31, device=device)
+        mod = torch.nn.MaxPool3d(3, stride=2).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.ones(1, 0, 50, 44, 31, device=device)
+            mod(inp)
+
+    @onlyNativeDeviceTypes
+    def test_MaxUnpool_zero_batch_dim(self, device):
+        pool = torch.nn.MaxPool1d(2, stride=2, return_indices=True).to(device)
+        unpool = torch.nn.MaxUnpool1d(2, stride=2).to(device)
+        inp = torch.randn(0, 10, 10, requires_grad=True, device=device)
+        output, indices = pool(inp)
+        output.requires_grad_(True)
+        unpool_out = unpool(output, indices)
+        unpool_out.sum().backward()
+
+        self.assertEqual(inp.grad, torch.zeros_like(inp))
+        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
+
+        pool = torch.nn.MaxPool2d(2, stride=2, return_indices=True).to(device)
+        unpool = torch.nn.MaxUnpool2d(2, stride=2).to(device)
+        inp = torch.randn(0, 10, 10, 10, requires_grad=True, device=device)
+        output, indices = pool(inp)
+        unpool_out = unpool(output, indices)
+        unpool_out.sum().backward()
+
+        self.assertEqual(inp.grad, torch.zeros_like(inp))
+        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
+
+        pool = torch.nn.MaxPool3d(2, stride=2, return_indices=True).to(device)
+        unpool = torch.nn.MaxUnpool3d(2, stride=2).to(device)
+        inp = torch.randn(0, 10, 10, 10, 10, requires_grad=True, device=device)
+        output, indices = pool(inp)
+        output.requires_grad_(True)
+        unpool_out = unpool(output, indices)
+        unpool_out.sum().backward()
+
+        self.assertEqual(inp.grad, torch.zeros_like(inp))
+        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
+
+    @slowTest
+    @onlyNativeDeviceTypes
+    @skipCUDAIfRocm
+    @parametrize_test("module_name,module_size,output_size,test_index,should_error", [
+        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), -1, True), name='case1'),
+        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), 2 * 2 * 4 * 5, True), name='case2'),
+        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), (2 * 2 * 4 * 5) - 1, False), name='case3'),
+        subtest(('MaxUnpool2d', (2, 3), (2, 1, 4, 2), 2 * 3 * 4 * 2, True), name='case4'),
+        subtest(('MaxUnpool2d', (2, 3), (2, 1, 4, 2), (2 * 3 * 4 * 2) - 1, False), name='case5'),
+
+        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), -1, True), name='case6'),
+        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), 2 * 2 * 2 * 3 * 4 * 5, True), name='case7'),
+        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), (2 * 2 * 2 * 3 * 4 * 5) - 1, False), name='case8'),
+        subtest(('MaxUnpool3d', (2, 2, 2), (2, 3, 4, 1), 2 * 2 * 2 * 3 * 4 * 1, True), name='case9'),
+        subtest(('MaxUnpool3d', (2, 2, 2), (2, 3, 4, 1), (2 * 2 * 2 * 3 * 4 * 1) - 1, False), name='case10'),
+    ])
+    def test_MaxUnpool_index_errors(self, device, module_name, module_size, output_size, test_index, should_error):
+        # NOTE: CUDA tests need to be run in a subprocess because they cause device asserts
+        if torch.device(device).type == 'cuda':
+            error_msgs = {
+                'MaxUnpool2d': r'Assertion `maxind >= 0 && maxind < outputImageSize` failed',
+                'MaxUnpool3d': r'Assertion `index >= 0 && index < outputImageSize` failed'}
+
+            script = f'''
+import torch
+unpool = torch.nn.{module_name}({module_size}).to('{device}')
+output = torch.rand({output_size}, dtype=torch.float32, device='{device}')
+indices = torch.zeros({output_size}, dtype=torch.int64, device='{device}')
+indices.flatten()[0] = {test_index}
+unpool(output, indices)
+torch.cuda.synchronize()
+'''
+            p = subprocess.run(
+                [sys.executable, '-c', script],
+                cwd=os.path.dirname(os.path.realpath(__file__)),
+                capture_output=True,
+                text=True,
+            )
+
+            output = p.stdout + '\n' + p.stderr
+
+            error_msg = error_msgs[module_name]
+
+            if should_error:
+                self.assertIn(
+                    error_msg,
+                    output,
+                    'The expected error was not found')
+            else:
+                self.assertNotIn(
+                    'Error',
+                    output,
+                    'Should not have produced an error')
+        else:
+            module_class = getattr(torch.nn, module_name)
+            unpool = module_class(module_size).to(device)
+            output = torch.rand(output_size, dtype=torch.float32, device=device)
+            indices = torch.zeros(output_size, dtype=torch.int64, device=device)
+            indices.flatten()[0] = test_index
+
+            if should_error:
+                with self.assertRaisesRegex(RuntimeError, r'Found an invalid max index:'):
+                    unpool(output, indices)
+            else:
+                unpool(output, indices)
+
+    @onlyNativeDeviceTypes
+    def test_AdaptiveMaxPool_zero_batch_dim(self, device):
+        inp = torch.randn(0, 16, 50, device=device)
+        mod = torch.nn.AdaptiveMaxPool1d(3).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.randn(1, 0, 50, device=device)
+            mod(inp)
+
+        inp = torch.randn(0, 16, 50, 32, device=device)
+        mod = torch.nn.AdaptiveMaxPool2d(3).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.randn(1, 0, 50, 32, device=device)
+            mod(inp)
+
+        inp = torch.ones(0, 16, 50, 44, 31, device=device)
+        mod = torch.nn.AdaptiveMaxPool3d(3).to(device)
+        _test_module_empty_input(self, mod, inp, check_size=False)
+
+        with self.assertRaisesRegex(RuntimeError, "Expected"):
+            inp = torch.ones(1, 0, 50, 44, 31, device=device)
+            mod(inp)
+
+    @onlyNativeDeviceTypes
+    def test_AvgPool2d_empty(self, device):
+        avgpool = torch.nn.AvgPool2d(3, stride=2).to(device)
+        inp = torch.randn(0, 16, 20, 32, device=device)
+        _test_module_empty_input(self, avgpool, inp, check_size=False)
+
+        clast_inp = torch.randn(0, 16, 20, 32, device=device).contiguous(memory_format=torch.channels_last)
+        _test_module_empty_input(self, avgpool, clast_inp, check_size=False)
+
+        # test with empty non-batch input
+        with self.assertRaisesRegex(RuntimeError, '3D or 4D'):
+            inp = torch.randn(16, 0, 20, 32, device=device)
+            avgpool(inp)
+
+    def test_pooling_shape(self, device):
+        ''' Test the output shape calculation for pooling functions '''
+
+        # Checks output shape against expected for 1D, 2D and 3D
+        def check(expected_out_shape, sizes, *args, **kwargs):
+            for kernel in ['max', 'avg']:
+                for i in [1, 2, 3]:
+                    if hasattr(torch.nn.functional, f'{kernel}_pool{i}d'):
+                        op = getattr(torch.nn.functional, f'{kernel}_pool{i}d')
+                        t = torch.randn(sizes[:i + 2], device=device)
+                        self.assertEqual(op(t, *args, **kwargs).shape, expected_out_shape[:i + 2])
+
+        check((1, 1, 3, 3, 4), (1, 1, 5, 6, 7), kernel_size=1, stride=2, padding=0, ceil_mode=True)
+        check((1, 1, 2, 3, 3), (1, 1, 3, 4, 5), kernel_size=2, stride=2, padding=1, ceil_mode=False)
+        check((1, 1, 2, 3, 3), (1, 1, 3, 4, 5), kernel_size=2, stride=2, padding=1, ceil_mode=True)
+
+        # Test case from issue https://github.com/pytorch/pytorch/issues/45357
+        x = torch.randn(1, 1, 6, 7, device=device)
+        y = torch.nn.functional.max_pool2d(x, 1, stride=(2, 2), padding=0, ceil_mode=True)
+        self.assertEqual(y.size(), (1, 1, 3, 4))
+
+    @onlyNativeDeviceTypes   # TODO: fix on XLA
+    def test_adaptive_avg_pool2d_output_size_one(self, device):
+        def helper(size, memory_format):
+            x = torch.randint(1, 10, size, dtype=torch.float, device=device, requires_grad=True)
+            if memory_format == 'non_contiguous':
+                x = x[::2, ::2, ::2, ::2]
+            else:
+                x = x.to(memory_format=memory_format)
+
+            net = torch.nn.AdaptiveAvgPool2d((1, 1))
+            out = net(x)
+            ref_out = x.contiguous().mean((-1, -2)).view((x.size(0), x.size(1), 1, 1))
+
+            out.sum().backward()    # make sure it doesn't crash
+
+            self.assertEqual(out, ref_out)
+            if memory_format == torch.channels_last:
+                self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+                c = out.size(1)
+                self.assertEqual(out.stride(), [c, 1, c, c])
+            else:
+                self.assertTrue(out.is_contiguous())
+                c = out.size(1)
+                self.assertEqual(out.stride(), [c, 1, 1, 1])
+
+        for mf in (torch.contiguous_format, torch.channels_last, 'non_contiguous'):
+            helper((2, 3, 6, 6), mf)
+
+    @onlyNativeDeviceTypes
+    def test_adaptive_avg_pool3d_output_size_one(self, device):
+        x = torch.randn((2, 3, 6, 6, 6), dtype=torch.float, device=device, requires_grad=True)
+
+        net = torch.nn.AdaptiveAvgPool3d(1)
+        out = net(x)
+        ref_out = x.contiguous().mean((-1, -2, -3)).view(out.shape)
+
+        out.sum().backward()    # make sure it doesn't crash
+
+        self.assertEqual(out, ref_out)
+        self.assertTrue(out.is_contiguous())
+        c = out.size(1)
+        self.assertEqual(out.stride(), [c, 1, 1, 1, 1])
+
+    @expectedFailureMeta  # Runtime Error not raised for meta
+    @onlyNativeDeviceTypes
+    @dtypes(torch.uint8, torch.int8, torch.short, torch.int, torch.long)
+    def test_adaptive_pooling_no_suppot_input(self, device, dtype):
+        for numel in (2, 3):
+            for pool_type in ('Max', 'Avg'):
+                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
+                module_cls = getattr(nn, cls_name)
+                output_size = (2,) * numel
+                module = module_cls(output_size)
+                input = torch.randn((4,) * (numel + 1), device=device).to(dtype)
+                with self.assertRaisesRegex(RuntimeError, "not implemented"):
+                    output = module(input)
+
+    @onlyNativeDeviceTypes
+    @dtypes(torch.float, torch.double)
+    @dtypesIfCUDA(torch.half, torch.float, torch.double)
+    def test_avg_pool2d_nhwc(self, device, dtype):
+        def helper(n, c, h, w, kernel_size, stride=None,
+                   count_include_pad=True, divisor_override=None, padding=0):
+            if stride is None:
+                stride = kernel_size
+            input = torch.randn(n, c, h, w, dtype=dtype, device=device)
+            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+            grad = torch.randn(n, c, (h - kernel_size) // stride + 1, (w - kernel_size) // stride + 1,
+                               dtype=dtype, device=device)
+            pool = torch.nn.AvgPool2d(kernel_size, stride=stride, count_include_pad=count_include_pad,
+                                      divisor_override=divisor_override).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.AvgPool2d(kernel_size, stride=stride, count_include_pad=count_include_pad,
+                                          divisor_override=divisor_override).to(device)
+
+            out = pool(input)
+            out.backward(grad)
+            ref_out = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(input.grad, ref_input.grad)
+
+        helper(4, 8, 8, 8, 3)
+        helper(4, 8, 8, 8, 3, count_include_pad=False, padding=1)
+        helper(4, 8, 8, 8, 3, count_include_pad=False, padding=2, stride=2)
+        helper(4, 8, 8, 8, 3, divisor_override=42)
+        helper(4, 8, 8, 8, 7)
+        # ROCm 16GB MI25 hits OOM error. Clear caching allocator prior to running large subtest.
+        if TEST_WITH_ROCM and 'cuda' in device:
+            torch.cuda.empty_cache()
+        helper(200, 512, 28, 28, 2)
+        helper(4, 8, 7, 7, 3, stride=1)
+        helper(4, 8, 7, 7, 3, padding=2, stride=1)
+        helper(10, 512, 31, 31, 3, stride=2)
+        helper(1, 129, 8, 8, 3, stride=2)
+
+    @onlyCPU
+    @dtypes(torch.float, torch.double)
+    def test_max_pool1d_corner_cases(self, device, dtype):
+        def check(x, args, expected):
+            model = torch.nn.MaxPool1d(*args)
+            if isinstance(x, list):
+                x = torch.tensor(x, device=device, dtype=dtype)
+                expected = torch.tensor(expected, device=device, dtype=dtype)
+            self.assertEqual(model(x), expected)
+
+        # Pooling args: (kernel_size, stride, padding, dilation, return_indices, ceil_mode)
+        check([[1]], (1, None, 0, 1, False, False), [[1]])
+        check([[1]], (2, None, 1, 2, False, False), [[float('-inf')]])
+        check([[1], [1]], (2, None, 1, 2, False, False), [[float('-inf')], [float('-inf')]])
+        check([[1, 2]], (2, 1, 1, 2, False, False), [[2, 1]])
+        check([[1, 2]], (2, 2, 1, 2, False, True), [[2, 2]])
+
+    @onlyCPU
+    @dtypes(torch.float, torch.double)
+    def test_max_pool1d(self, device, dtype):
+        # FIXME For now compare against max_pool1d with indices
+        def check(x, *args, **kwargs):
+            model = torch.nn.MaxPool1d(*args, **kwargs)
+            ref_model = torch.nn.MaxPool1d(*args, **kwargs, return_indices=True)
+            self.assertEqual(model(x), ref_model(x)[0])
+
+        sizes = [random.sample(range(8, 128), 3) for _ in range(3)]
+        kernel_sizes = random.sample(range(1, 5), 3)
+        strides = random.sample(range(1, 5), 3)
+        dilations = random.sample(range(1, 5), 3)
+        ceil_modes = [True, False]
+
+        for size, kernel_size, stride, dilation, ceil_mode in \
+                itertools.product(sizes, kernel_sizes, strides, dilations, ceil_modes):
+            padding = random.sample(range(0, math.floor(kernel_size / 2) + 1), 1)
+            check(torch.randn(size, device=device, dtype=dtype),
+                  kernel_size, stride, padding, dilation, ceil_mode=ceil_mode)
+
+        # Non-contiguous test
+        tensor = torch.randn(5, 151, 33, device=device, dtype=dtype)[::2, ::3, ::2]
+        check(tensor, 3, 2, 1, 2, ceil_mode=True)
+        check(tensor.transpose(1, 2), 3, 2, 1, 2, ceil_mode=True)
+
+    @onlyCUDA
+    def test_max_pool2d(self, device):
+        def helper(n, c, h, w, ks):
+            x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
+            ref_x = x.detach().clone().cpu().requires_grad_()
+
+            pool = torch.nn.MaxPool2d(kernel_size=ks)
+
+            y = pool(x)
+            ref_y = pool(ref_x)
+
+            y.sum().backward()
+            ref_y.sum().backward()
+
+            self.assertEqual(y, ref_y)
+            self.assertEqual(x.grad, ref_x.grad)
+
+        helper(2, 8, 4, 4, ks=2)
+        helper(1, 100000, 32, 32, ks=4)
+        helper(1, 100000, 1, 4, ks=(1, 4))  # test for max_pool1d
+
+    @onlyNativeDeviceTypes
+    @dtypes(torch.float, torch.double)
+    @dtypesIfCUDA(torch.half, torch.float, torch.double)
+    def test_max_pool2d_nhwc(self, device, dtype):
+        def helper(n, c, h, w, kernel_size, stride=None):
+            if stride is None:
+                stride = kernel_size
+            input = torch.randn(n, c, h, w, dtype=dtype, device=device)
+            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
+            grad = torch.randn(n, c, (h - kernel_size) // stride + 1, (w - kernel_size) // stride + 1,
+                               dtype=dtype, device=device)
+            pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
+
+            out, ind = pool(input)
+            out.backward(grad)
+            ref_out, ref_ind = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_ind.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(ind, ref_ind)
+            self.assertEqual(input.grad, ref_input.grad)
+
+        helper(4, 8, 8, 8, 7)
+        helper(200, 512, 28, 28, 2)
+        helper(4, 8, 7, 7, 3, stride=1)
+        helper(10, 512, 31, 31, 3, stride=2)
+        helper(1, 129, 8, 8, 3, stride=2)
+
+    @onlyNativeDeviceTypes
+    @dtypes(torch.half, torch.float, torch.double)
+    @onlyCUDA
+    def test_max_pool3d_ndhwc(self, device, dtype):
+        def helper(n, c, h, w, d, kernel_size, stride=None):
+            batch = n
+            if not batch:
+                batch = 1
+            input = torch.randn(batch, c, d, h, w, dtype=dtype, device=device)
+            input = input.contiguous(memory_format=torch.channels_last_3d).requires_grad_()
+            if not n:
+                input = input.squeeze(0).detach().clone().requires_grad_()
+            if isinstance(kernel_size, int):
+                kernel_size = [kernel_size] * 3
+            if stride is None:
+                stride = kernel_size
+            elif isinstance(stride, int):
+                stride = [stride] * 3
+            grad = torch.randn(batch, c,
+                               (d - kernel_size[0]) // stride[0] + 1,
+                               (h - kernel_size[1]) // stride[1] + 1,
+                               (w - kernel_size[2]) // stride[2] + 1,
+                               dtype=dtype, device=device)
+            grad = grad.contiguous(memory_format=torch.channels_last_3d)
+            if not n:
+                grad = grad.squeeze(0)
+            pool = torch.nn.MaxPool3d(kernel_size, stride, return_indices=True).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.MaxPool3d(kernel_size, stride, return_indices=True).to(device)
+            out, ind = pool(input)
+            out.backward(grad)
+            ref_out, ref_ind = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            if len(out.shape) == 4:
+                self.assertTrue(out.unsqueeze(0).is_contiguous(memory_format=torch.channels_last_3d))
+            else:
+                self.assertTrue(out.is_contiguous(memory_format=torch.channels_last_3d))
+            self.assertTrue(ref_out.is_contiguous())
+            if len(ind.shape) == 4:
+                self.assertTrue(ind.unsqueeze(0).is_contiguous(memory_format=torch.channels_last_3d))
+            else:
+                self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last_3d))
+            self.assertTrue(ref_ind.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(ind, ref_ind)
+            if dtype == torch.half:
+                self.assertEqual(input.grad, ref_input.grad, atol=0.05, rtol=0.01)
+            else:
+                self.assertEqual(input.grad, ref_input.grad)
+
+        helper(4, 8, 8, 8, 8, 7)
+        helper(4, 8, 8, 8, 8, (5, 6, 7))
+        helper(1, 8, 8, 8, 8, (5, 6, 7))
+        helper(0, 6, 12, 13, 14, (5, 6, 7))
+        helper(4, 8, 7, 7, 7, 3, stride=1)
+        helper(10, 128, 19, 19, 19, 3, stride=2)
+        helper(10, 128, 19, 19, 19, (1, 2, 3), stride=2)
+        helper(1, 128, 19, 19, 19, (1, 2, 3), stride=2)
+        helper(0, 128, 19, 19, 19, (1, 2, 3), stride=2)
+        helper(1, 79, 4, 4, 4, 3, stride=2)
+        helper(0, 79, 4, 4, 4, 3, stride=2)
+
+    @onlyCPU
+    def test_max_pool2d_bfloat16(self, device):
+        def helper(n, c, h, w, kernel_size, stride, memory_format):
+            input = torch.randn(n, c, h, w, dtype=torch.float32, device=device).bfloat16()
+            input = input.to(memory_format=memory_format).requires_grad_()
+            pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
+
+            input2 = input.detach().clone().float().requires_grad_(True)
+
+            out, ind = pool(input)
+            out.sum().backward()
+            out2, ind2 = pool(input2)
+            out2.sum().backward()
+
+            self.assertTrue(out.is_contiguous(memory_format=memory_format))
+            self.assertEqual(out.dtype, torch.bfloat16)
+            self.assertEqual(input.grad.dtype, torch.bfloat16)
+            self.assertEqual(out, out2.bfloat16())
+            self.assertEqual(ind, ind2)
+            self.assertEqual(input.grad, input2.grad.bfloat16())
+
+        helper(4, 30, 8, 8, 7, 1, torch.contiguous_format)
+        helper(4, 65, 8, 8, 7, 1, torch.channels_last)
+        helper(1, 19, 20, 10, 8, 2, torch.contiguous_format)
+        helper(1, 19, 20, 10, 8, 2, torch.channels_last)
+
+    @onlyCUDA
+    def test_max_pool2d_indices(self, device):
+        def helper(n, c, h, w, ks):
+            if n is None:
+                x = torch.randn(c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
+            else:
+                x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
+
+            ref_x = x.detach().clone().cpu().requires_grad_()
+
+            pool = torch.nn.MaxPool2d(kernel_size=ks, return_indices=True)
+
+            y, idx = pool(x)
+            ref_y, ref_idx = pool(ref_x)
+
+            y.sum().backward()
+            ref_y.sum().backward()
+
+            self.assertEqual(y, ref_y)
+            self.assertEqual(idx, ref_idx)  # assertEqual implicitly compares shape for tensors
+            self.assertEqual(x.grad, ref_x.grad)
+
+        helper(2, 8, 4, 4, ks=2)
+        helper(None, 3, 50, 50, ks=5)
+
+    @onlyCPU
+    def test_avg_pool2d_bfloat16(self, device):
+        def helper(n, c, h, w, kernel_size, stride, memory_format):
+            input = torch.randn(n, c, h, w, dtype=torch.float32, device=device).bfloat16()
+            input = input.to(memory_format=memory_format).requires_grad_()
+            pool = torch.nn.AvgPool2d(kernel_size, stride).to(device)
+
+            input2 = input.detach().clone().float().requires_grad_(True)
+
+            out = pool(input)
+            out.sum().backward()
+            out2 = pool(input2)
+            out2.sum().backward()
+
+            self.assertTrue(out.is_contiguous(memory_format=memory_format))
+            self.assertEqual(out.dtype, torch.bfloat16)
+            self.assertEqual(input.grad.dtype, torch.bfloat16)
+            self.assertEqual(out, out2.bfloat16())
+            self.assertEqual(input.grad, input2.grad.bfloat16())
+
+        helper(4, 30, 8, 8, 7, 1, torch.contiguous_format)
+        helper(4, 65, 8, 8, 7, 1, torch.channels_last)
+        helper(1, 19, 20, 10, 8, 2, torch.contiguous_format)
+        helper(1, 19, 20, 10, 8, 2, torch.channels_last)
+
+    @dtypes(torch.float, torch.double)
+    def test_adaptive_pooling_max_nhwc(self, device, dtype):
+        def helper(n, c, h, w, output_height, output_width, contig):
+            input = torch.randint(1, 10, (n, c, h, w), device=device, dtype=dtype)
+            input = input.contiguous(memory_format=torch.channels_last)
+            grad = torch.randint(1, 10, (4, 8, output_height, output_width), device=device, dtype=dtype)
+            grad = grad.contiguous(memory_format=torch.channels_last)
+            if not contig:
+                input = input[:, ::2, :, :]
+                grad = grad[:, ::2, :, :]
+            input.requires_grad_(True)
+            pool = torch.nn.AdaptiveMaxPool2d((output_height, output_width), return_indices=True).to(device)
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.AdaptiveMaxPool2d((output_height, output_width), return_indices=True).to(device)
+
+            out, ind = pool(input)
+            out.backward(grad)
+            ref_out, ref_ind = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_ind.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(ind, ref_ind)
+            self.assertEqual(input.grad, ref_input.grad)
+
+        for contig in [True, False]:
+            helper(4, 8, 10, 10, 7, 7, contig)
+            helper(4, 8, 9, 14, 5, 8, contig)
+            helper(4, 8, 11, 11, 1, 1, contig)
+
+    @dtypes(torch.float, torch.double)
+    def test_pooling_max_nhwc(self, device, dtype):
+        def helper(n, c, h, w, kernel_size, stride, padding, dilation, contig, device):
+            output_height = math.floor((h + 2 * padding[0] - dilation[0] * (kernel_size[0] - 1) - 1) / stride[0] + 1)
+            output_width = math.floor((w + 2 * padding[1] - dilation[1] * (kernel_size[1] - 1) - 1) / stride[1] + 1)
+
+            input = torch.randint(1, 10, (n, c, h, w), device=device, dtype=dtype)
+            input = input.contiguous(memory_format=torch.channels_last)
+            grad = torch.randint(1, 10, (n, c, output_height, output_width), device=device, dtype=dtype)
+            grad = grad.contiguous(memory_format=torch.channels_last)
+            if not contig:
+                input = input[:, ::2, :, :]
+                grad = grad[:, ::2, :, :]
+            input.requires_grad_(True)
+            pool = torch.nn.MaxPool2d(
+                kernel_size, stride, padding, dilation, return_indices=True, ceil_mode=False
+            )
+
+            ref_input = input.detach().clone().contiguous().requires_grad_(True)
+            ref_grad = grad.detach().clone().contiguous()
+            ref_pool = torch.nn.MaxPool2d(
+                kernel_size, stride, padding, dilation, return_indices=True, ceil_mode=False
+            ).to(device)
+
+            out, ind = pool(input)
+            out.backward(grad)
+            ref_out, ref_ind = ref_pool(ref_input)
+            ref_out.backward(ref_grad)
+
+            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_out.is_contiguous())
+            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
+            self.assertTrue(ref_ind.is_contiguous())
+            self.assertEqual(out, ref_out)
+            self.assertEqual(ind, ref_ind)
+            self.assertEqual(input.grad, ref_input.grad)
+
+        for contig in [True, False]:
+            helper(4, 8, 10, 10, (2, 2), (1, 1), (1, 1), (2, 2), contig, device)
+            helper(4, 8, 9, 14, (2, 2), (1, 1), (1, 1), (2, 2), contig, device)
+            helper(4, 8, 11, 11, (4, 4), (2, 2), (2, 2), (2, 2), contig, device)
+
+    @onlyCUDA
+    def test_pool3d_size_one_feature_dim(self, device):
+        # Tests crazy strides for feature dim of size 1
+        x = torch.randn(7, 1, 5, 3, 2, device=device)
+        strange_strides = [30, 1234, 6, 2, 1]
+        y = x.as_strided(x.size(), strange_strides)
+        x = x.cpu().as_strided(x.size(), strange_strides)
+
+        to_test = {
+            'max_pool3d': lambda t: F.max_pool3d(t, (5, 1, 1), stride=(5, 1, 1)),
+            'avg_pool3d': lambda t: F.avg_pool3d(t, (5, 1, 1), stride=(5, 1, 1)),
+        }
+
+        for test, fn in to_test.items():
+            # Should not crash
+            out_y = fn(y)
+            out_x = fn(x)
+            self.assertEqual(out_y, out_x.to(device), msg=test)
+
+    @onlyCUDA
+    @largeTensorTest('18GB')
+    @largeTensorTest('180GB', 'cpu')
+    def test_pool3d_large_size_int64(self, device):
+        # See https://github.com/pytorch/pytorch/issues/52822
+        x = torch.randn(70, 32, 100, 100, 100, dtype=torch.half, device=device, requires_grad=True)
+        y = torch.nn.functional.max_pool3d(x, 5)
+        g = torch.randn_like(y, dtype=torch.half)
+        torch.cuda.synchronize()
+        y.backward(g)
+        torch.cuda.synchronize()
+
+        ref_x = x.detach().cpu().float()  # max_pool3d_cpu is not implemented for half
+        ref_x.requires_grad = True
+        ref_g = g.cpu().float()
+        ref_y = torch.nn.functional.max_pool3d(ref_x, 5)
+        ref_y.backward(ref_g)
+
+        self.assertEqual(y, ref_y, exact_dtype=False)
+        self.assertEqual(x.grad, ref_x.grad, exact_dtype=False)
+
+    @onlyCUDA
+    def test_AvgPool3d_backward_after_cat_dim1_device(self, device):
+        # x has to have batch_size 1 to test contiguous checks
+        x = torch.randn(1, 3, 4, 4, 4, device=device, requires_grad=True)
+        y = F.avg_pool3d(x, kernel_size=3, padding=1, stride=2)
+
+        grad = torch.randn(y.size(), device=device)
+        # increase the stride in dimension 0. the tensor is still contiguous because size[0] is 1
+        stride = list(grad.stride())
+        stride[0] = stride[0] * 2
+        grad.set_(grad.storage(), 0, grad.size(), stride)
+        assert grad.is_contiguous()
+
+        y.backward(grad)
+
+    def test_pooling_size_empty(self, device):
+        t = torch.rand([1, 2, 3, 4], device=device)
+        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool1d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool2d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool3d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool1d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool2d(t, []))
+        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool3d(t, []))
+
+    def _test_maxpool_indices(self, num_dim, adaptive=False, device="cpu", dtype=torch.float):
+        def expected_indices(dim, dtype):
+            if dim == 1:
+                return torch.tensor([1, 3], dtype=dtype).repeat(2, 2, 1)
+            if dim == 2:
+                return torch.tensor([[5, 7], [13, 15]], dtype=dtype).repeat(2, 2, 1, 1)
+
+        def expected_grad(dim, dtype):
+            if dim == 1:
+                return torch.tensor([0, 1, 0, 1], dtype=dtype).repeat(2, 2, 1)
+            grad = expected_grad(dim - 1, dtype=dtype)
+            zero = torch.zeros(grad.size(), dtype=dtype)
+            return torch.stack((zero, grad, zero, grad), 2)
+
+        def expected_output(dim, dtype):
+            if dim == 1:
+                return torch.arange(2, 17, 2, dtype=dtype).view(2, 2, 2)
+            if dim == 2:
+                col = torch.arange(6, 63, 8, dtype=dtype)
+                return torch.stack([col, col + 2], 1).view(2, 2, 2, 2)
+
+        if adaptive:
+            cls_name = 'AdaptiveMaxPool{}d'.format(num_dim)
+        else:
+            cls_name = 'MaxPool{}d'.format(num_dim)
+        module_cls = getattr(nn, cls_name)
+        module = module_cls(2, return_indices=True).to(device, dtype=dtype)
+        numel = 4 ** (num_dim + 1)
+        input = torch.arange(1, numel + 1).view(2, 2, *repeat(4, num_dim)).to(device, dtype=dtype)
+        input_var = input.clone().detach().requires_grad_()
+
+        # Check forward
+        output, indices = module(input_var)
+        if num_dim != 3:
+            expected_indices = expected_indices(num_dim, dtype=indices.data.dtype)
+            expected_output = expected_output(num_dim, dtype=output.data.dtype)
+            self.assertEqual(indices.dim(), input.dim())
+            self.assertEqual(indices.data.squeeze(), expected_indices)
+            self.assertEqual(output.data.squeeze(), expected_output)
+        self.assertTrue(output.requires_grad)
+        self.assertFalse(indices.requires_grad)
+
+        # Make sure backward works
+        grad_output = torch.ones(output.size(), device=device, dtype=dtype)
+        output.backward(grad_output, retain_graph=True)
+        expected_grad = expected_grad(num_dim, dtype=input_var.grad.data.dtype)
+        self.assertEqual(input_var.grad.data, expected_grad.view_as(input))
+
+        # Make sure backward after changing indices will result in an error
+        indices.add_(1)
+        self.assertRaises(RuntimeError, lambda: output.backward(grad_output))
+
+        # Make sure -Infinity is handled correctly
+        t = torch.tensor([[[float("-inf")]]])
+        m = nn.MaxPool1d(kernel_size=1, return_indices=True)
+        output, indices = m(t)
+        self.assertEqual(output[0, 0, 0], float("-inf"))
+        self.assertEqual(indices[0, 0, 0], 0)
+
+        t = torch.tensor([[[float("-inf")]]])
+        m = nn.MaxPool2d(kernel_size=1, return_indices=True)
+        output, indices = m(t)
+        self.assertEqual(output[0, 0, 0], float("-inf"))
+        self.assertEqual(indices[0, 0, 0], 0)
+
+        t = torch.tensor([[[[float("-inf")]]]])
+        m = nn.MaxPool3d(kernel_size=1, return_indices=True)
+        output, indices = m(t)
+        self.assertEqual(output[0, 0, 0, 0], float("-inf"))
+        self.assertEqual(indices[0, 0, 0, 0], 0)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @dtypes(torch.float)
+    def test_MaxPool1d_indices(self, device, dtype):
+        self._test_maxpool_indices(1, device=device, dtype=dtype)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @dtypes(torch.float)
+    def test_MaxPool2d_indices(self, device, dtype):
+        self._test_maxpool_indices(2, device=device, dtype=dtype)
+
+    @skipIfMps
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @dtypes(torch.float)
+    def test_MaxPool3d_indices(self, device, dtype):
+        self._test_maxpool_indices(3, device=device, dtype=dtype)
+
+    @skipIfMps
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @dtypes(torch.float)
+    def test_AdaptiveMaxPool1d_indices(self, device, dtype):
+        self._test_maxpool_indices(1, adaptive=True, device=device, dtype=dtype)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_AdaptiveMaxPool2d_indices(self, device, dtype):
+        self._test_maxpool_indices(2, adaptive=True, device=device, dtype=dtype)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_AdaptiveMaxPool3d_indices(self, device, dtype):
+        self._test_maxpool_indices(3, adaptive=True, device=device, dtype=dtype)
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_maxpool_indices_no_batch_dim(self, device, dtype):
+        """Check that indices with no batch dim is consistent with a single batch."""
+        max_pool_cases = [
+            (nn.MaxPool1d(3, return_indices=True),
+             torch.randn(3, 5, device=device, dtype=dtype)),
+            (nn.MaxPool2d(3, return_indices=True),
+             torch.randn(3, 5, 6, device=device, dtype=dtype)),
+            (nn.MaxPool3d(3, return_indices=True),
+             torch.randn(3, 5, 6, 7, device=device, dtype=dtype)),
+            (nn.AdaptiveMaxPool1d(3, return_indices=True),
+             torch.randn(3, 5, device=device, dtype=dtype)),
+            (nn.AdaptiveMaxPool2d(3, return_indices=True),
+             torch.randn(3, 5, 6, device=device, dtype=dtype)),
+            (nn.AdaptiveMaxPool3d(3, return_indices=True),
+             torch.randn(3, 5, 6, 7, device=device, dtype=dtype))]
+
+        for module, input in max_pool_cases:
+            _, indices_no_batch = module(input)
+            _, indicies_single_batch = module(input.unsqueeze(0))
+            self.assertEqual(indices_no_batch, indicies_single_batch.squeeze(0))
+
+    @dtypesIfCUDA(torch.half, torch.float, torch.double)
+    @dtypes(torch.float)
+    @onlyNativeDeviceTypes  # TODO: Fails on XLA
+    def test_max_pool_nan_inf(self, device, dtype):
+        for adaptive in ['', 'adaptive_']:
+            for num_dim in [1, 2, 3]:
+                fn_name = '{}max_pool{}d'.format(adaptive, num_dim)
+                fn = getattr(F, fn_name)
+
+                x = torch.full([1, 1] + num_dim * [3], nan, device=device, dtype=dtype, requires_grad=True)
+                res = fn(x, 1 if adaptive else 3)
+                res.backward(torch.randn_like(res))
+                self.assertTrue(math.isnan(res.item()))
+                x.requires_grad_(False)
+                res = fn(x, 1 if adaptive else 3)
+                self.assertTrue(math.isnan(res.item()))
+
+                x2 = torch.full([1, 1] + num_dim * [3], -inf, device=device, dtype=dtype, requires_grad=True)
+                res2 = fn(x2, 1 if adaptive else 3)
+                res2.backward(torch.randn_like(res2))
+                self.assertTrue(math.isinf(res2.item()))
+                x2.requires_grad_(False)
+                res2 = fn(x2, 1 if adaptive else 3)
+                self.assertTrue(math.isinf(res2.item()))
+
+    @expectedFailureMeta  # RuntimeError: Unrecognized tensor type ID: Meta
+    @onlyNativeDeviceTypes
+    def test_fractional_max_pool2d(self, device):
+        with set_default_dtype(torch.double):
+            x = torch.randn(1, 2, 7, 7, requires_grad=True, device=device)
+            samples = x.new(1, 2, 2).uniform_()
+
+            def func(x):
+                return F.fractional_max_pool2d(
+                    x, (2, 2), output_size=(3, 3), _random_samples=samples)
+
+            self.assertEqual(func(x).shape, (1, 2, 3, 3))
+            gradcheck(func, [x])
+            gradgradcheck(func, [x])
+
+            x = torch.randn(2, 7, 7, requires_grad=True, device=device)
+            self.assertEqual(func(x).shape, (2, 3, 3))
+            if self.device_type != 'cuda':
+                # Reference: https://github.com/pytorch/pytorch/issues/52427
+                # Raises -> RuntimeError: TensorAccessor expected 4 dims but tensor has 3
+                # on CUDA in gradcheck
+                gradcheck(func, [x])
+                gradgradcheck(func, [x])
+
+            for kernel_size in [(), (1,)]:
+                with self.assertRaisesRegex(RuntimeError, "kernel_size must either"):
+                    # Incorrect kernel_size
+                    F.fractional_max_pool2d(x, kernel_size=kernel_size, output_size=(3, 3), _random_samples=samples)
+
+            err_large_msg = "too large relative to input "
+            err_out_size_msg = "output_size must either"
+            for output_size, msg in [((9, 3), err_large_msg + "height"),
+                                     ((3, 9), err_large_msg + "width"),
+                                     ((3,), err_out_size_msg),
+                                     ((), err_out_size_msg)]:
+                with self.assertRaisesRegex(RuntimeError, msg):
+                    # Incorrect output_size
+                    F.fractional_max_pool2d(x, (2, 2), output_size=output_size, _random_samples=samples)
+
+    @expectedFailureMeta  # RuntimeError: Unrecognized tensor type ID: Meta
+    @onlyNativeDeviceTypes
+    def test_fractional_max_pool3d(self, device):
+        with set_default_dtype(torch.double):
+            x = torch.randn(1, 2, 7, 7, 7, requires_grad=True, device=device)
+            samples = x.new(1, 2, 3).uniform_()
+
+            def func(x):
+                return F.fractional_max_pool3d(
+                    x, (2, 2, 2), output_size=(3, 3, 3), _random_samples=samples)
+
+            self.assertEqual(func(x).shape, (1, 2, 3, 3, 3))
+            gradcheck(func, [x])
+            gradgradcheck(func, [x])
+
+            x = torch.randn(2, 7, 7, 7, requires_grad=True, device=device)
+            self.assertEqual(func(x).shape, (2, 3, 3, 3))
+            gradcheck(func, [x])
+            gradgradcheck(func, [x])
+
+            for kernel_size in [(), (1,), (1, 1)]:
+                with self.assertRaisesRegex(RuntimeError, "kernel_size must either"):
+                    # Incorrect kernel_size
+                    F.fractional_max_pool3d(x, kernel_size=kernel_size, output_size=(3, 3, 3), _random_samples=samples)
+
+            err_large_msg = "too large relative to input "
+            err_out_size_msg = "output_size must either"
+            for output_size, msg in [((9, 3, 3), err_large_msg + "time"),
+                                     ((3, 9, 3), err_large_msg + "height"),
+                                     ((3, 3, 9), err_large_msg + "width"),
+                                     ((3, 3), err_out_size_msg),
+                                     ((3,), err_out_size_msg),
+                                     ((), err_out_size_msg)]:
+                with self.assertRaisesRegex(RuntimeError, msg):
+                    # Incorrect output_size
+                    F.fractional_max_pool3d(x, (2, 2, 2), output_size=output_size, _random_samples=samples)
+
+    @dtypesIfCUDA(torch.half, torch.float, torch.double)
+    @dtypes(torch.float)
+    @onlyNativeDeviceTypes  # TODO: Fails on XLA
+    def test_fractional_max_pool_nan_inf(self, device, dtype):
+        for num_dim in [2, 3]:
+            fn_name = 'FractionalMaxPool{}d'.format(num_dim)
+            fn = getattr(nn, fn_name)(kernel_size=2, output_size=1)
+            x = torch.full([1, 1] + num_dim * [3], nan, device=device, dtype=dtype, requires_grad=True)
+            res = fn(x)
+            res.backward(torch.randn_like(res))
+            self.assertTrue(math.isnan(res.item()))
+
+            x2 = torch.full([1, 1] + num_dim * [3], -inf, device=device, dtype=dtype, requires_grad=True)
+            res2 = fn(x2)
+            res2.backward(torch.randn_like(res2))
+            self.assertTrue(math.isinf(res2.item()))
+
+    @onlyNativeDeviceTypes  # TODO: RuntimeError message different on XLA
+    def test_pooling_zero_stride(self, device):
+        for op in ('max', 'avg'):
+            for num_dim in [1, 2, 3]:
+                fn_name = '{}_pool{}d'.format(op, num_dim)
+                fn = getattr(F, fn_name)
+                x = torch.ones([1, 2] + num_dim * [4], device=device, dtype=torch.float)
+                self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
+                                       lambda: fn(x, kernel_size=2, stride=0))
+
+                fn_module_name = '{}Pool{}d'.format(op.title(), num_dim)
+                fn_module = getattr(nn, fn_module_name)(kernel_size=2, stride=0)
+                self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
+                                       lambda: fn_module(x))
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_pool_large_size(self, device, dtype):
+        for op in ('max', 'avg'):
+            for num_dim in [1, 2, 3]:
+                fn_name = '{}_pool{}d'.format(op, num_dim)
+                fn = getattr(F, fn_name)
+                # 16777217 is the smallest integer not expressible in float32
+                x = torch.ones([1, 1, 16777217] + (num_dim - 1) * [1],
+                               device=device, dtype=dtype)
+                res = fn(x, 1, stride=1, padding=0)
+                # check if the output shape was still computed correctly
+                self.assertEqual(x.shape[2], res.shape[2])
+
+    @onlyCUDA
+    @largeTensorTest('6GB')
+    def test_pooling_large(self, device):
+        def helper(pool):
+            inp = torch.randn(2**7 + 10, 2**8, 2**8, 2**8, dtype=torch.half, device="cuda")
+            self.assertTrue(inp.numel() > 2**31 - 1)
+            out = pool(inp)
+            torch.cuda.synchronize()    # asserts test finishes normally without raising errors
+
+        helper(nn.MaxPool2d(4, 4))
+        helper(nn.AvgPool2d(4, 4))
+        helper(nn.FractionalMaxPool2d(4, 4))
+        helper(nn.AdaptiveMaxPool2d((2**6, 2**6)))
+        helper(nn.AdaptiveAvgPool2d((2**6, 2**6)))
+
+    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
+    @skipIfMps
+    @dtypes(torch.float)
+    def test_pool_invalid_size(self, device, dtype):
+        for op in ('max', 'avg'):
+            for num_dim in [1, 2, 3]:
+                fn_name = '{}_pool{}d'.format(op, num_dim)
+                if op == 'max':
+                    # New implementation without indices supports empty tensors
+                    # TODO(Heitor) change once with_indices code is updated
+                    fn_name += '_with_indices'
+                fn = getattr(F, fn_name)
+                # use a configuration that gives zero outputs only
+                # when doing a correct floor division by the stride
+                x = torch.ones([1, 1] + num_dim * [4],
+                               device=device, dtype=dtype)
+                with self.assertRaisesRegex(RuntimeError, r"too small|smaller than"):
+                    try:
+                        res = fn(x, 3, stride=2, padding=0, dilation=2)
+                    except TypeError:
+                        # some implementations do not support dilation
+                        res = fn(x, 6, stride=2, padding=0)
+
+    @onlyCUDA
+    def test_pooling_bfloat16(self, device):
+        _test_bfloat16_ops(self, torch.nn.AvgPool1d(3, stride=2), device, inp_dims=(8, 4, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AvgPool2d(3, stride=2), device, inp_dims=(8, 4, 16, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AvgPool3d(3, stride=2), device, inp_dims=(8, 4, 16, 16, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AdaptiveAvgPool1d(3), device, inp_dims=(8, 4, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AdaptiveAvgPool2d((3, 5)), device, inp_dims=(8, 4, 16, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.AdaptiveAvgPool3d((3, 5, 7)), device, inp_dims=(8, 4, 16, 16, 16), prec=0.05)
+
+    def test_maxpool3d_non_square_backward(self, device):
+        # previous CUDA routine of this backward calculates kernel launch grid size
+        # with last two dimensions interchanged, so the tailing along the longer dim
+        # get ignored. Here we test whether every position gets gradient.
+        for dim in (2, 3, 4):
+            shape = tuple(32 if i != dim else 256 for i in range(4))
+            x = torch.randn(shape, device=device, requires_grad=True)
+            F.max_pool3d(x, kernel_size=(1, 1, 1)).sum().backward()
+            self.assertEqual(x.grad, torch.ones_like(x.grad))
+
+    def test_adaptive_pool_invalid(self, device):
+        inp_1d = (torch.randn(1, 1, 1, device=device), (-1,))
+        inp_2d = (torch.randn(1, 1, 1, 1, device=device), (-1, 0))
+        inp_3d = (torch.randn(1, 1, 1, 1, 1, device=device), (-1, 0, 2))
+        module_input_dict = {torch.nn.AdaptiveAvgPool1d: inp_1d,
+                             torch.nn.AdaptiveAvgPool2d: inp_2d,
+                             torch.nn.AdaptiveAvgPool3d: inp_3d}
+
+        for m, inp in module_input_dict.items():
+            with self.assertRaisesRegex(RuntimeError,
+                                        r"elements of output_size must be greater than or equal to 0"):
+                t, output_size = inp
+                m(output_size)(t)
+
+    @slowTest
+    def test_adaptive_pool_odd_size(self, device):
+        # See https://github.com/pytorch/pytorch/issues/81409
+        Ih, Iw, Oh, Ow = 5873, 3693, 3527, 2219
+        imgs = torch.randint(low=0, high=256, size=(11, Ih, Iw), dtype=torch.float)
+        imgs_ = F.adaptive_avg_pool2d(imgs, (Oh, Ow))
+        imgs_ = F.adaptive_max_pool2d(imgs, (Oh, Ow))
+
+        Id, Ih, Iw, Od, Oh, Ow = 3, 5873, 3693, 3, 3527, 2219
+        imgs = torch.randint(low=0, high=256, size=(3, Id, Ih, Iw), dtype=torch.float)
+        imgs_ = F.adaptive_avg_pool3d(imgs, (Od, Oh, Ow))
+        imgs_ = F.adaptive_max_pool3d(imgs, (Od, Oh, Ow))
+
+instantiate_device_type_tests(TestPoolingNNDeviceType, globals())
+instantiate_parametrized_tests(TestPoolingNN)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/nn/test_pruning.py b/test/nn/test_pruning.py
new file mode 100644
index 000000000000..bd2db02d056f
--- /dev/null
+++ b/test/nn/test_pruning.py
@@ -0,0 +1,939 @@
+# Owner(s): ["module: nn"]
+import unittest
+import unittest.mock as mock
+import pickle
+
+import torch
+
+import torch.nn as nn
+import torch.nn.utils.prune as prune
+from torch.testing._internal.common_utils import TEST_NUMPY, TemporaryFileName, \
+    instantiate_parametrized_tests, run_tests
+from torch.testing._internal.common_nn import NNTestCase
+
+class TestPruningNN(NNTestCase):
+    _do_cuda_memory_leak_check = True
+    _do_cuda_non_default_stream = True
+
+    # torch/nn/utils/prune.py
+    @unittest.skipIf(not TEST_NUMPY, "numpy not found")
+    def test_validate_pruning_amount_init(self):
+        r"""Test the first util function that validates the pruning
+        amount requested by the user the moment the pruning method
+        is initialized. This test checks that the expected errors are
+        raised whenever the amount is invalid.
+        The original function runs basic type checking + value range checks.
+        It doesn't check the validity of the pruning amount with
+        respect to the size of the tensor to prune. That's left to
+        `_validate_pruning_amount`, tested below.
+        """
+        # neither float not int should raise TypeError
+        with self.assertRaises(TypeError):
+            prune._validate_pruning_amount_init(amount="I'm a string")
+
+        # float not in [0, 1] should raise ValueError
+        with self.assertRaises(ValueError):
+            prune._validate_pruning_amount_init(amount=1.1)
+        with self.assertRaises(ValueError):
+            prune._validate_pruning_amount_init(amount=20.)
+
+        # negative int should raise ValueError
+        with self.assertRaises(ValueError):
+            prune._validate_pruning_amount_init(amount=-10)
+
+        # all these should pass without errors because they're valid amounts
+        prune._validate_pruning_amount_init(amount=0.34)
+        prune._validate_pruning_amount_init(amount=1500)
+        prune._validate_pruning_amount_init(amount=0)
+        prune._validate_pruning_amount_init(amount=0.)
+        prune._validate_pruning_amount_init(amount=1)
+        prune._validate_pruning_amount_init(amount=1.)
+        self.assertTrue(True)
+
+    @unittest.skipIf(not TEST_NUMPY, "numpy not found")
+    def test_validate_pruning_amount(self):
+        r"""Tests the second util function that validates the pruning
+        amount requested by the user, this time with respect to the size
+        of the tensor to prune. The rationale is that if the pruning amount,
+        converted to absolute value of units to prune, is larger than
+        the number of units in the tensor, then we expect the util function
+        to raise a value error.
+        """
+        # if amount is int and amount > tensor_size, raise ValueError
+        with self.assertRaises(ValueError):
+            prune._validate_pruning_amount(amount=20, tensor_size=19)
+
+        # amount is a float so this should not raise an error
+        prune._validate_pruning_amount(amount=0.3, tensor_size=0)
+
+        # this is okay
+        prune._validate_pruning_amount(amount=19, tensor_size=20)
+        prune._validate_pruning_amount(amount=0, tensor_size=0)
+        prune._validate_pruning_amount(amount=1, tensor_size=1)
+        self.assertTrue(True)
+
+    @unittest.skipIf(not TEST_NUMPY, "numpy not found")
+    def test_compute_nparams_to_prune(self):
+        r"""Test that requested pruning `amount` gets translated into the
+        correct absolute number of units to prune.
+        """
+        self.assertEqual(
+            prune._compute_nparams_toprune(amount=0, tensor_size=15),
+            0
+        )
+        self.assertEqual(
+            prune._compute_nparams_toprune(amount=10, tensor_size=15),
+            10
+        )
+        # if 1 is int, means 1 unit
+        self.assertEqual(
+            prune._compute_nparams_toprune(amount=1, tensor_size=15),
+            1
+        )
+        # if 1. is float, means 100% of units
+        self.assertEqual(
+            prune._compute_nparams_toprune(amount=1., tensor_size=15),
+            15
+        )
+        self.assertEqual(
+            prune._compute_nparams_toprune(amount=0.4, tensor_size=17),
+            7
+        )
+
+    def test_random_pruning_sizes(self):
+        r"""Test that the new parameters and buffers created by the pruning
+        method have the same size as the input tensor to prune. These, in
+        fact, correspond to the pruned version of the tensor itself, its
+        mask, and its original copy, so the size must match.
+        """
+        # fixturize test
+        # TODO: add other modules
+        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
+        names = ['weight', 'bias']
+
+        for m in modules:
+            for name in names:
+                with self.subTest(m=m, name=name):
+                    original_tensor = getattr(m, name)
+
+                    prune.random_unstructured(m, name=name, amount=0.1)
+                    # mask has the same size as tensor being pruned
+                    self.assertEqual(
+                        original_tensor.size(),
+                        getattr(m, name + '_mask').size()
+                    )
+                    # 'orig' tensor has the same size as the original tensor
+                    self.assertEqual(
+                        original_tensor.size(),
+                        getattr(m, name + '_orig').size()
+                    )
+                    # new tensor has the same size as the original tensor
+                    self.assertEqual(
+                        original_tensor.size(),
+                        getattr(m, name).size()
+                    )
+
+    def test_random_pruning_orig(self):
+        r"""Test that original tensor is correctly stored in 'orig'
+        after pruning is applied. Important to make sure we don't
+        lose info about the original unpruned parameter.
+        """
+        # fixturize test
+        # TODO: add other modules
+        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
+        names = ['weight', 'bias']
+
+        for m in modules:
+            for name in names:
+                with self.subTest(m=m, name=name):
+
+                    # tensor prior to pruning
+                    original_tensor = getattr(m, name)
+                    prune.random_unstructured(m, name=name, amount=0.1)
+                    self.assertEqual(
+                        original_tensor,
+                        getattr(m, name + '_orig')
+                    )
+
+    def test_random_pruning_new_weight(self):
+        r"""Test that module.name now contains a pruned version of
+        the original tensor obtained from multiplying it by the mask.
+        """
+        # fixturize test
+        # TODO: add other modules
+        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
+        names = ['weight', 'bias']
+
+        for m in modules:
+            for name in names:
+                with self.subTest(m=m, name=name):
+                    # tensor prior to pruning
+                    original_tensor = getattr(m, name)
+                    prune.random_unstructured(m, name=name, amount=0.1)
+                    # weight = weight_orig * weight_mask
+                    self.assertEqual(
+                        getattr(m, name),
+                        getattr(m, name + '_orig')
+                        * getattr(m, name + '_mask').to(
+                            dtype=original_tensor.dtype
+                        ),
+                    )
+
+    def test_identity_pruning(self):
+        r"""Test that a mask of 1s does not change forward or backward.
+        """
+        input_ = torch.ones(1, 5)
+        m = nn.Linear(5, 2)
+        y_prepruning = m(input_)  # output prior to pruning
+
+        # compute grad pre-pruning and check it's equal to all ones
+        y_prepruning.sum().backward()
+        old_grad_weight = m.weight.grad.clone()  # don't grab pointer!
+        self.assertEqual(old_grad_weight, torch.ones_like(m.weight))
+        old_grad_bias = m.bias.grad.clone()
+        self.assertEqual(old_grad_bias, torch.ones_like(m.bias))
+
+        # remove grads
+        m.zero_grad()
+
+        # force the mask to be made of all 1s
+        prune.identity(m, name="weight")
+
+        # with mask of 1s, output should be identical to no mask
+        y_postpruning = m(input_)
+        self.assertEqual(y_prepruning, y_postpruning)
+
+        # with mask of 1s, grad should be identical to no mask
+        y_postpruning.sum().backward()
+        self.assertEqual(old_grad_weight, m.weight_orig.grad)
+        self.assertEqual(old_grad_bias, m.bias.grad)
+
+        # calling forward twice in a row shouldn't change output
+        y1 = m(input_)
+        y2 = m(input_)
+        self.assertEqual(y1, y2)
+
+    def test_random_pruning_0perc(self):
+        r"""Test that a mask of 1s does not change forward or backward.
+        """
+        input_ = torch.ones(1, 5)
+        m = nn.Linear(5, 2)
+        y_prepruning = m(input_)  # output prior to pruning
+
+        # compute grad pre-pruning and check it's equal to all ones
+        y_prepruning.sum().backward()
+        old_grad_weight = m.weight.grad.clone()  # don't grab pointer!
+        self.assertEqual(old_grad_weight, torch.ones_like(m.weight))
+        old_grad_bias = m.bias.grad.clone()
+        self.assertEqual(old_grad_bias, torch.ones_like(m.bias))
+
+        # remove grads
+        m.zero_grad()
+
+        # force the mask to be made of all 1s
+        with mock.patch(
+            "torch.nn.utils.prune.RandomUnstructured.compute_mask"
+        ) as compute_mask:
+            compute_mask.return_value = torch.ones_like(m.weight)
+            prune.random_unstructured(m, name='weight', amount=0.9)  # amount won't count
+
+        # with mask of 1s, output should be identical to no mask
+        y_postpruning = m(input_)
+        self.assertEqual(y_prepruning, y_postpruning)
+
+        # with mask of 1s, grad should be identical to no mask
+        y_postpruning.sum().backward()
+        self.assertEqual(old_grad_weight, m.weight_orig.grad)
+        self.assertEqual(old_grad_bias, m.bias.grad)
+
+        # calling forward twice in a row shouldn't change output
+        y1 = m(input_)
+        y2 = m(input_)
+        self.assertEqual(y1, y2)
+
+    def test_random_pruning(self):
+        input_ = torch.ones(1, 5)
+        m = nn.Linear(5, 2)
+
+        # define custom mask to assign with mock
+        mask = torch.ones_like(m.weight)
+        mask[1, 0] = 0
+        mask[0, 3] = 0
+
+        # check grad is zero for masked weights
+        with mock.patch(
+            "torch.nn.utils.prune.RandomUnstructured.compute_mask"
+        ) as compute_mask:
+            compute_mask.return_value = mask
+            prune.random_unstructured(m, name='weight', amount=0.9)
+
+        y_postpruning = m(input_)
+        y_postpruning.sum().backward()
+        # weight_orig is the parameter, so it's the tensor that will accumulate the grad
+        self.assertEqual(m.weight_orig.grad, mask)  # all 1s, except for masked units
+        self.assertEqual(m.bias.grad, torch.ones_like(m.bias))
+
+        # make sure that weight_orig update doesn't modify [1, 0] and [0, 3]
+        old_weight_orig = m.weight_orig.clone()
+        # update weights
+        learning_rate = 1.
+        for p in m.parameters():
+            p.data.sub_(p.grad.data * learning_rate)
+        # since these are pruned, they should not be updated
+        self.assertEqual(old_weight_orig[1, 0], m.weight_orig[1, 0])
+        self.assertEqual(old_weight_orig[0, 3], m.weight_orig[0, 3])
+
+    def test_random_pruning_forward(self):
+        r"""check forward with mask (by hand).
+        """
+        input_ = torch.ones(1, 5)
+        m = nn.Linear(5, 2)
+
+        # define custom mask to assign with mock
+        mask = torch.zeros_like(m.weight)
+        mask[1, 0] = 1
+        mask[0, 3] = 1
+
+        with mock.patch(
+            "torch.nn.utils.prune.RandomUnstructured.compute_mask"
+        ) as compute_mask:
+            compute_mask.return_value = mask
+            prune.random_unstructured(m, name='weight', amount=0.9)
+
+        yhat = m(input_)
+        self.assertEqual(yhat[0, 0], m.weight_orig[0, 3] + m.bias[0])
+        self.assertEqual(yhat[0, 1], m.weight_orig[1, 0] + m.bias[1])
+
+    def test_remove_pruning_forward(self):
+        r"""Remove pruning and check forward is unchanged from previous
+        pruned state.
+        """
+        input_ = torch.ones(1, 5)
+        m = nn.Linear(5, 2)
+
+        # define custom mask to assign with mock
+        mask = torch.ones_like(m.weight)
+        mask[1, 0] = 0
+        mask[0, 3] = 0
+
+        # check grad is zero for masked weights
+        with mock.patch(
+            "torch.nn.utils.prune.RandomUnstructured.compute_mask"
+        ) as compute_mask:
+            compute_mask.return_value = mask
+            prune.random_unstructured(m, name='weight', amount=0.9)
+
+        y_postpruning = m(input_)
+
+        prune.remove(m, 'weight')
+
+        y_postremoval = m(input_)
+        self.assertEqual(y_postpruning, y_postremoval)
+
+    def test_pruning_id_consistency(self):
+        r"""Test that pruning doesn't change the id of the parameters, which
+        would otherwise introduce issues with pre-existing optimizers that
+        point to old parameters.
+        """
+        m = nn.Linear(5, 2, bias=False)
+
+        tensor_id = id(list(m.parameters())[0])
+
+        prune.random_unstructured(m, name="weight", amount=0.9)
+        self.assertEqual(tensor_id, id(list(m.parameters())[0]))
+
+        prune.remove(m, "weight")
+        self.assertEqual(tensor_id, id(list(m.parameters())[0]))
+
+    def test_random_pruning_pickle(self):
+        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
+        names = ['weight', 'bias']
+
+        for m in modules:
+            for name in names:
+                with self.subTest(m=m, name=name):
+                    prune.random_unstructured(m, name=name, amount=0.1)
+                    m_new = pickle.loads(pickle.dumps(m))
+                    self.assertIsInstance(m_new, type(m))
+
+    def test_multiple_pruning_calls(self):
+        # if you call pruning twice, the hook becomes a PruningContainer
+        m = nn.Conv3d(2, 2, 2)
+        prune.l1_unstructured(m, name='weight', amount=0.1)
+        weight_mask0 = m.weight_mask  # save it for later sanity check
+
+        # prune again
+        prune.ln_structured(m, name='weight', amount=0.3, n=2, dim=0)
+        hook = next(iter(m._forward_pre_hooks.values()))
+        self.assertIsInstance(
+            hook,
+            torch.nn.utils.prune.PruningContainer
+        )
+        # check that container._tensor_name is correctly set no matter how
+        # many pruning methods are in the container
+        self.assertEqual(hook._tensor_name, 'weight')
+
+        # check that the pruning container has the right length
+        # equal to the number of pruning iters
+        self.assertEqual(len(hook), 2)  # m.weight has been pruned twice
+
+        # check that the entries of the pruning container are of the expected
+        # type and in the expected order
+        self.assertIsInstance(hook[0], torch.nn.utils.prune.L1Unstructured)
+        self.assertIsInstance(hook[1], torch.nn.utils.prune.LnStructured)
+
+        # check that all entries that are 0 in the 1st mask are 0 in the
+        # 2nd mask too
+        self.assertTrue(torch.all(m.weight_mask[weight_mask0 == 0] == 0))
+
+        # prune again
+        prune.ln_structured(m, name='weight', amount=0.1, n=float('inf'), dim=1)
+        # check that container._tensor_name is correctly set no matter how
+        # many pruning methods are in the container
+        hook = next(iter(m._forward_pre_hooks.values()))
+        self.assertEqual(hook._tensor_name, 'weight')
+
+    def test_pruning_container(self):
+        # create an empty container
+        container = prune.PruningContainer()
+        container._tensor_name = 'test'
+        self.assertEqual(len(container), 0)
+
+        p = prune.L1Unstructured(amount=2)
+        p._tensor_name = 'test'
+
+        # test adding a pruning method to a container
+        container.add_pruning_method(p)
+
+        # test error raised if tensor name is different
+        q = prune.L1Unstructured(amount=2)
+        q._tensor_name = 'another_test'
+        with self.assertRaises(ValueError):
+            container.add_pruning_method(q)
+
+        # test that adding a non-pruning method object to a pruning container
+        # raises a TypeError
+        with self.assertRaises(TypeError):
+            container.add_pruning_method(10)
+        with self.assertRaises(TypeError):
+            container.add_pruning_method('ugh')
+
+    def test_pruning_container_compute_mask(self):
+        r"""Test `compute_mask` of pruning container with a known `t` and
+        `default_mask`. Indirectly checks that Ln structured pruning is
+        acting on the right axis.
+        """
+        # create an empty container
+        container = prune.PruningContainer()
+        container._tensor_name = 'test'
+
+        # 1) test unstructured pruning
+        # create a new pruning method
+        p = prune.L1Unstructured(amount=2)
+        p._tensor_name = 'test'
+        # add the pruning method to the container
+        container.add_pruning_method(p)
+
+        # create tensor to be pruned
+        t = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]).to(dtype=torch.float32)
+        # create prior mask by hand
+        default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
+        # since we are pruning the two lowest magnitude units, the outcome of
+        # the calculation should be this:
+        expected_mask = torch.tensor([[0, 0, 1, 0], [1, 1, 0, 1]], dtype=torch.float32)
+        computed_mask = container.compute_mask(t, default_mask)
+        self.assertEqual(expected_mask, computed_mask)
+
+        # 2) test structured pruning
+        q = prune.LnStructured(amount=1, n=2, dim=0)
+        q._tensor_name = 'test'
+        container.add_pruning_method(q)
+        # since we are pruning the lowest magnitude one of the two rows, the
+        # outcome of the calculation should be this:
+        expected_mask = torch.tensor([[0, 0, 0, 0], [1, 1, 0, 1]], dtype=torch.float32)
+        computed_mask = container.compute_mask(t, default_mask)
+        self.assertEqual(expected_mask, computed_mask)
+
+        # 2) test structured pruning, along another axis
+        r = prune.LnStructured(amount=1, n=2, dim=1)
+        r._tensor_name = 'test'
+        container.add_pruning_method(r)
+        # since we are pruning the lowest magnitude of the four columns, the
+        # outcome of the calculation should be this:
+        expected_mask = torch.tensor([[0, 1, 1, 0], [0, 1, 0, 1]], dtype=torch.float32)
+        computed_mask = container.compute_mask(t, default_mask)
+        self.assertEqual(expected_mask, computed_mask)
+
+    def test_l1_unstructured_pruning(self):
+        r"""Test that l1 unstructured pruning actually removes the lowest
+        entries by l1 norm (by hand). It also checks that applying l1
+        unstructured pruning more than once respects the previous mask.
+        """
+        m = nn.Linear(4, 2)
+        # modify its weight matrix by hand
+        m.weight = torch.nn.Parameter(
+            torch.tensor(
+                [[1, 2, 3, 4], [-4, -3, -2, -1]], dtype=torch.float32
+            )
+        )
+
+        prune.l1_unstructured(m, 'weight', amount=2)
+        expected_weight = torch.tensor([[0, 2, 3, 4], [-4, -3, -2, 0]],
+                                       dtype=m.weight.dtype)
+        self.assertEqual(expected_weight, m.weight)
+
+        # check that pruning again removes the next two smallest entries
+        prune.l1_unstructured(m, 'weight', amount=2)
+        expected_weight = torch.tensor([[0, 0, 3, 4], [-4, -3, 0, 0]],
+                                       dtype=m.weight.dtype)
+        self.assertEqual(expected_weight, m.weight)
+
+    def test_l1_unstructured_pruning_with_importance_scores(self):
+        r"""Test that l1 unstructured pruning actually removes the lowest
+        entries of importance scores and not the parameter by l1 norm (by hand).
+        It also checks that applying l1 unstructured pruning more than once
+        respects the previous mask.
+        """
+        m = nn.Linear(4, 2)
+        # modify its weight matrix by hand
+        m.weight = torch.nn.Parameter(
+            torch.tensor(
+                [[1, 2, 3, 4], [-4, -3, -2, -1]], dtype=torch.float32
+            )
+        )
+        importance_scores = torch.tensor(
+            [[4, 2, 1, 3], [-3, -1, -2, -4]], dtype=torch.float32
+        )
+
+        prune.l1_unstructured(m, 'weight', amount=2, importance_scores=importance_scores)
+        expected_weight = torch.tensor([[1, 2, 0, 4], [-4, 0, -2, -1]],
+                                       dtype=m.weight.dtype)
+        self.assertEqual(expected_weight, m.weight)
+
+        # check that pruning again removes two entries of m.weight that are colocated with
+        # the next two smallest absolute values of importance scores.
+        prune.l1_unstructured(m, 'weight', amount=2, importance_scores=importance_scores)
+        expected_weight = torch.tensor([[1, 0, 0, 4], [-4, 0, 0, -1]],
+                                       dtype=m.weight.dtype)
+        self.assertEqual(expected_weight, m.weight)
+
+    def test_unstructured_pruning_same_magnitude(self):
+        r"""Since it may happen that the tensor to prune has entries with the
+        same exact magnitude, it is important to check that pruning happens
+        consistenly based on the bottom % of weights, and not by threshold,
+        which would instead kill off *all* units with magnitude = threshold.
+        """
+        AMOUNT = 0.2
+        p = prune.L1Unstructured(amount=AMOUNT)
+        # create a random tensors with entries in {-2, 0, 2}
+        t = 2 * torch.randint(low=-1, high=2, size=(10, 7))
+        nparams_toprune = prune._compute_nparams_toprune(AMOUNT, t.nelement())
+
+        computed_mask = p.compute_mask(t, default_mask=torch.ones_like(t))
+        nparams_pruned = torch.sum(computed_mask == 0)
+        self.assertEqual(nparams_toprune, nparams_pruned)
+
+    def test_random_structured_pruning_amount(self):
+        AMOUNT = 0.6
+        AXIS = 2
+        p = prune.RandomStructured(amount=AMOUNT, dim=AXIS)
+        t = 2 * torch.randint(low=-1, high=2, size=(5, 4, 2)).to(
+            dtype=torch.float32
+        )
+        nparams_toprune = prune._compute_nparams_toprune(AMOUNT, t.shape[AXIS])
+
+        computed_mask = p.compute_mask(t, default_mask=torch.ones_like(t))
+        # check that 1 column is fully prune, the others are left untouched
+        remaining_axes = [_ for _ in range(len(t.shape)) if _ != AXIS]
+        per_column_sums = sorted(
+            torch.sum(computed_mask == 0, axis=remaining_axes)
+        )
+        assert per_column_sums == [0, 20]
+
+    def test_ln_structured_pruning(self):
+        r"""Check Ln structured pruning by hand.
+        """
+        m = nn.Conv2d(3, 1, 2)
+        m.weight.data = torch.tensor(
+            [[[[1., 2.], [1., 2.5]],
+             [[0.5, 1.], [0.1, 0.1]],
+             [[-3., -5.], [0.1, -1.]]]]
+        )
+        # expected effect of pruning 1 of the 3 channels by L2-norm
+        expected_mask_axis1 = torch.ones_like(m.weight)
+        expected_mask_axis1[:, 1] = 0.
+
+        prune.ln_structured(m, 'weight', amount=1, n=2, dim=1)
+        self.assertEqual(expected_mask_axis1, m.weight_mask)
+
+        # expected effect of pruning 1 of the 2 columns along axis -1 by L1-norm
+        expected_mask_axis3 = expected_mask_axis1
+        expected_mask_axis3[:, :, :, 0] = 0.
+
+        prune.ln_structured(m, 'weight', amount=1, n=1, dim=-1)
+        self.assertEqual(expected_mask_axis3, m.weight_mask)
+
+    def test_ln_structured_pruning_importance_scores(self):
+        r"""Check Ln structured pruning by hand.
+        """
+        m = nn.Conv2d(3, 1, 2)
+        m.weight.data = torch.tensor(
+            [[[[1., 2.], [1., 2.5]],
+             [[0.5, 1.], [0.1, 0.1]],
+             [[-3., -5.], [0.1, -1.]]]]
+        )
+        importance_scores = torch.tensor(
+            [[[[10., 1.], [10., 1.]],
+             [[30., 3.], [30., 3.]],
+             [[-20., -2.], [-20., -2.]]]]
+        )
+        # expected effect of pruning 1 of the 3 channels by L2-norm
+        expected_mask_axis1 = torch.ones_like(m.weight)
+        expected_mask_axis1[:, 0] = 0.
+
+        prune.ln_structured(m, 'weight', amount=1, n=2, dim=1, importance_scores=importance_scores)
+        self.assertEqual(expected_mask_axis1, m.weight_mask)
+
+        # expected effect of pruning 1 of the 2 columns along axis -1 by L1-norm
+        expected_mask_axis3 = expected_mask_axis1
+        expected_mask_axis3[:, :, :, 1] = 0.
+
+        prune.ln_structured(m, 'weight', amount=1, n=1, dim=-1, importance_scores=importance_scores)
+        self.assertEqual(expected_mask_axis3, m.weight_mask)
+
+    def test_remove_pruning(self):
+        r"""`prune.remove` removes the hook and the reparametrization
+        and makes the pruning final in the original parameter.
+        """
+        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
+        names = ['weight', 'bias']
+
+        for m in modules:
+            for name in names:
+                with self.subTest(m=m, name=name):
+                    # first prune
+                    prune.random_unstructured(m, name, amount=0.5)
+                    self.assertIn(name + "_orig", dict(m.named_parameters()))
+                    self.assertIn(name + "_mask", dict(m.named_buffers()))
+                    self.assertNotIn(name, dict(m.named_parameters()))
+                    self.assertTrue(hasattr(m, name))
+                    pruned_t = getattr(m, name)
+
+                    # then remove pruning
+                    prune.remove(m, name)
+                    self.assertIn(name, dict(m.named_parameters()))
+                    self.assertNotIn(name + "_orig", dict(m.named_parameters()))
+                    self.assertNotIn(name + "_mask", dict(m.named_buffers()))
+                    final_t = getattr(m, name)
+
+                    self.assertEqual(pruned_t, final_t)
+
+    def test_remove_pruning_exception(self):
+        r"""Removing from an unpruned tensor throws an assertion error
+        """
+        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
+        names = ['weight', 'bias']
+
+        for m in modules:
+            for name in names:
+                with self.subTest(m=m, name=name):
+                    # check that the module isn't pruned
+                    self.assertFalse(prune.is_pruned(m))
+                    # since it isn't pruned, pruning can't be removed from it
+                    with self.assertRaises(ValueError):
+                        prune.remove(m, name)
+
+
+    def test_global_pruning(self):
+        r"""Test that global l1 unstructured pruning over 2 parameters removes
+        the `amount=4` smallest global weights across the 2 parameters.
+        """
+        m = nn.Linear(4, 2)
+        n = nn.Linear(3, 1)
+        # modify the weight matrices by hand
+        m.weight = torch.nn.Parameter(
+            torch.tensor([[1, 2, 3, 4], [-4, -3, -2, -1]]).to(
+                dtype=torch.float32)
+        )
+        n.weight = torch.nn.Parameter(
+            torch.tensor([[0, 0.1, -2]]).to(
+                dtype=torch.float32)
+        )
+
+        params_to_prune = (
+            (m, 'weight'),
+            (n, 'weight'),
+        )
+
+        # prune the 4 smallest weights globally by L1 magnitude
+        prune.global_unstructured(
+            params_to_prune,
+            pruning_method=prune.L1Unstructured,
+            amount=4
+        )
+
+        expected_mweight = torch.tensor([[0, 2, 3, 4], [-4, -3, -2, 0]],
+                                        dtype=m.weight.dtype)
+        self.assertEqual(expected_mweight, m.weight)
+
+        expected_nweight = torch.tensor([[0, 0, -2]]).to(dtype=n.weight.dtype)
+        self.assertEqual(expected_nweight, n.weight)
+
+    def test_global_pruning_importance_scores(self):
+        r"""Test that global l1 unstructured pruning over 2 parameters removes
+        the `amount=4` smallest global weights across the 2 parameters.
+        """
+        m = nn.Linear(4, 2)
+        n = nn.Linear(3, 1)
+        # modify the weight matrices by hand
+        m.weight = torch.nn.Parameter(
+            torch.tensor([[1, 2, 3, 4], [-4, -3, -2, -1]]).to(
+                dtype=torch.float32)
+        )
+        m_importance_scores = torch.tensor(
+            [[4, 2, 1, 3], [-3, -1, -2, -4]], dtype=torch.float32
+        )
+        n.weight = torch.nn.Parameter(
+            torch.tensor([[0, 0.1, -2]]).to(
+                dtype=torch.float32)
+        )
+        n_importance_scores = torch.tensor([[0, 10., -0.2]]).to(dtype=torch.float32)
+
+        params_to_prune = (
+            (m, 'weight'),
+            (n, 'weight'),
+        )
+        importance_scores = {
+            (m, 'weight'): m_importance_scores,
+            (n, 'weight'): n_importance_scores,
+        }
+
+        # prune the 4 smallest weights globally by L1 magnitude
+        prune.global_unstructured(
+            params_to_prune,
+            pruning_method=prune.L1Unstructured,
+            amount=4,
+            importance_scores=importance_scores,
+        )
+
+        expected_m_weight = torch.tensor([[1, 2, 0, 4], [-4, 0, -2, -1]],
+                                         dtype=m.weight.dtype)
+        self.assertEqual(expected_m_weight, m.weight)
+
+        expected_n_weight = torch.tensor([[0, 0.1, 0]]).to(dtype=n.weight.dtype)
+        self.assertEqual(expected_n_weight, n.weight)
+
+    def test_custom_from_mask_pruning(self):
+        r"""Test that the CustomFromMask is capable of receiving
+        as input at instantiation time a custom mask, and combining it with
+        the previous default mask to generate the correct final mask.
+        """
+        # new mask
+        mask = torch.tensor([[0, 1, 1, 0], [0, 0, 1, 1]])
+        # old mask
+        default_mask = torch.tensor([[0, 0, 0, 0], [1, 1, 1, 1]])
+
+        # some tensor (not actually used)
+        t = torch.rand_like(mask.to(dtype=torch.float32))
+
+        p = prune.CustomFromMask(mask=mask)
+
+        computed_mask = p.compute_mask(t, default_mask)
+        expected_mask = torch.tensor([[0, 0, 0, 0], [0, 0, 1, 1]], dtype=computed_mask.dtype)
+
+        self.assertEqual(computed_mask, expected_mask)
+
+    def test_pruning_rollback(self):
+        r"""Test that if something fails when the we try to compute the mask,
+        then the model isn't left in some intermediate half-pruned state.
+        The try/except statement in `apply` should handle rolling back
+        to the previous state before pruning began.
+        """
+        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
+        names = ['weight', 'bias']
+
+        for m in modules:
+            for name in names:
+                with self.subTest(m=m, name=name):
+
+                    with mock.patch(
+                        "torch.nn.utils.prune.L1Unstructured.compute_mask"
+                    ) as compute_mask:
+                        compute_mask.side_effect = Exception('HA!')
+                        with self.assertRaises(Exception):
+                            prune.l1_unstructured(m, name=name, amount=0.9)
+
+                        self.assertTrue(
+                            name in dict(m.named_parameters())
+                        )
+                        self.assertFalse(
+                            name + '_mask' in dict(m.named_buffers())
+                        )
+                        self.assertFalse(
+                            name + '_orig' in dict(m.named_parameters())
+                        )
+
+    def test_pruning_serialization_model(self):
+        # create a model
+        model = torch.nn.Sequential(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 1),
+        )
+        # check that everything looks normal before pruning
+        self.assertNotIn('0.weight_orig', model.state_dict())
+        self.assertNotIn('0.weight_mask', model.state_dict())
+        self.assertIn('0.weight', model.state_dict())
+
+        # prune one of its parameters
+        prune.l1_unstructured(module=model[0], name='weight', amount=0.9)
+
+        # check that the original weight and the new mask are present
+        self.assertIn('0.weight_orig', model.state_dict())
+        self.assertIn('0.weight_mask', model.state_dict())
+        self.assertNotIn('0.weight', model.state_dict())
+        self.assertTrue(hasattr(model[0], 'weight'))
+
+        pruned_weight = model[0].weight
+
+        with TemporaryFileName() as fname:
+            torch.save(model, fname)
+            new_model = torch.load(fname)
+
+        # check that the original weight and the new mask are present
+        self.assertIn('0.weight_orig', new_model.state_dict())
+        self.assertIn('0.weight_mask', new_model.state_dict())
+        self.assertNotIn('0.weight', new_model.state_dict())
+        self.assertTrue(hasattr(new_model[0], 'weight'))
+
+        self.assertEqual(pruned_weight, new_model[0].weight)
+
+    def test_pruning_serialization_state_dict(self):
+        # create a model
+        model = torch.nn.Sequential(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 1),
+        )
+        # check that everything looks normal before pruning
+        self.assertNotIn('0.weight_orig', model.state_dict())
+        self.assertNotIn('0.weight_mask', model.state_dict())
+        self.assertIn('0.weight', model.state_dict())
+
+        # prune one of its parameters
+        prune.l1_unstructured(module=model[0], name='weight', amount=0.9)
+
+        # check that the original weight and the new mask are present
+        self.assertIn('0.weight_orig', model.state_dict())
+        self.assertIn('0.weight_mask', model.state_dict())
+        self.assertNotIn('0.weight', model.state_dict())
+        self.assertTrue(hasattr(model[0], 'weight'))
+
+        pruned_weight = model[0].weight
+
+        # make pruning permanent and restore parameter names as in base
+        # architecture
+        prune.remove(module=model[0], name='weight')
+
+        # check that the original weight and the new mask are no longer present
+        self.assertNotIn('0.weight_orig', model.state_dict())
+        self.assertNotIn('0.weight_mask', model.state_dict())
+        self.assertIn('0.weight', model.state_dict())
+
+        # save the state dict of model and reload it into new_model
+        new_model = torch.nn.Sequential(
+            torch.nn.Linear(10, 10),
+            torch.nn.ReLU(),
+            torch.nn.Linear(10, 1),
+        )
+        with TemporaryFileName() as fname:
+            torch.save(model.state_dict(), fname)
+            new_model.load_state_dict(torch.load(fname))
+
+        # check that the original weight and the new mask are not present in
+        # new_model either.
+        self.assertNotIn('0.weight_orig', new_model.state_dict())
+        self.assertNotIn('0.weight_mask', new_model.state_dict())
+        self.assertIn('0.weight', new_model.state_dict())
+
+        self.assertEqual(pruned_weight, new_model[0].weight)
+
+    def test_prune(self):
+        # create a new pruning method
+        p = prune.L1Unstructured(amount=2)
+        # create tensor to be pruned
+        t = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]).to(dtype=torch.float32)
+        # create prior mask by hand
+        default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
+        # since we are pruning the two lowest magnitude units, the outcome of
+        # the calculation should be this:
+        expected_mask = torch.tensor([[0, 0, 1, 0], [1, 1, 0, 1]])
+        pruned_tensor = p.prune(t, default_mask)
+        self.assertEqual(t * expected_mask, pruned_tensor)
+
+    def test_prune_importance_scores(self):
+        # create a new pruning method
+        p = prune.L1Unstructured(amount=2)
+        # create tensor to be pruned
+        t = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]).to(dtype=torch.float32)
+        importance_scores = torch.tensor(
+            [[1, 2, 3, 4], [1.5, 1.6, 1.7, 1.8]]
+        ).to(dtype=torch.float32)
+        # create prior mask by hand
+        default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
+        # since we are pruning the two lowest magnitude units, the outcome of
+        # the calculation should be this:
+        expected_mask = torch.tensor([[0, 1, 1, 0], [0, 1, 0, 1]])
+        pruned_tensor = p.prune(t, default_mask, importance_scores=importance_scores)
+        self.assertEqual(t * expected_mask, pruned_tensor)
+
+    def test_prune_importance_scores_mimic_default(self):
+        # create a new pruning method
+        p = prune.L1Unstructured(amount=2)
+        # create tensor to be pruned
+        t = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]).to(dtype=torch.float32)
+        # create prior mask by hand
+        default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
+        # since we are pruning the two lowest magnitude units, the outcome of
+        # the calculation should be this:
+        expected_mask = torch.tensor([[0, 0, 1, 0], [1, 1, 0, 1]])
+        pruned_tensor_without_importance_scores = p.prune(t, default_mask)
+        pruned_tensor_with_importance_scores = p.prune(t, default_mask, importance_scores=t)
+        self.assertEqual(pruned_tensor_without_importance_scores, pruned_tensor_with_importance_scores)
+        self.assertEqual(t * expected_mask, pruned_tensor_without_importance_scores)
+
+    def test_rnn_pruning(self):
+        l = torch.nn.LSTM(32, 32)
+        # This Module has 4 parameters called:
+        # 'weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0'
+
+        # Pruning one of them causes one of the weights to become a tensor
+        prune.l1_unstructured(l, 'weight_ih_l0', 0.5)
+        assert (
+            sum([isinstance(p, torch.nn.Parameter) for p in l._flat_weights])
+            == 3
+        )
+
+        # Removing the pruning reparametrization restores the Parameter
+        prune.remove(l, 'weight_ih_l0')
+        assert (
+            sum([isinstance(p, torch.nn.Parameter) for p in l._flat_weights])
+            == 4
+        )
+
+        # Make sure that, upon removal of the reparametrization, the
+        # `._parameters` and `.named_parameters` contain the right params.
+        # Specifically, the original weight ('weight_ih_l0') should be placed
+        # back in the parameters, while the reparametrization component
+        # ('weight_ih_l0_orig') should be removed.
+        assert 'weight_ih_l0' in l._parameters
+        assert l._parameters['weight_ih_l0'] is not None
+        assert 'weight_ih_l0_orig' not in l._parameters
+        assert 'weight_ih_l0' in dict(l.named_parameters())
+        assert dict(l.named_parameters())['weight_ih_l0'] is not None
+        assert 'weight_ih_l0_orig' not in dict(l.named_parameters())
+
+instantiate_parametrized_tests(TestPruningNN)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/onnx/expect/TestOperators.test_acos.expect b/test/onnx/expect/TestOperators.test_acos.expect
index 40fc61e29b7f..f3864a75c0fb 100644
--- a/test/onnx/expect/TestOperators.test_acos.expect
+++ b/test/onnx/expect/TestOperators.test_acos.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_broadcast.expect b/test/onnx/expect/TestOperators.test_add_broadcast.expect
index 569b2400df88..921e17f4c2e2 100644
--- a/test/onnx/expect/TestOperators.test_add_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_broadcast.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_left_broadcast.expect b/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
index ffa632ca475b..b4c9843262aa 100644
--- a/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_left_broadcast.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
index 9917880a8a22..1195e978f13c 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_broadcast.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
index 569b2400df88..921e17f4c2e2 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_right_broadcast.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect b/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
index 96d2dca59325..0062cc4d1c5b 100644
--- a/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
+++ b/test/onnx/expect/TestOperators.test_add_size1_singleton_broadcast.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_addconstant.expect b/test/onnx/expect/TestOperators.test_addconstant.expect
index 0e1570eb62da..a4d80aaa8dad 100644
--- a/test/onnx/expect/TestOperators.test_addconstant.expect
+++ b/test/onnx/expect/TestOperators.test_addconstant.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_addmm.expect b/test/onnx/expect/TestOperators.test_addmm.expect
index 1ef0a81e2a90..be308c504756 100644
--- a/test/onnx/expect/TestOperators.test_addmm.expect
+++ b/test/onnx/expect/TestOperators.test_addmm.expect
@@ -102,5 +102,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_argmax.expect b/test/onnx/expect/TestOperators.test_argmax.expect
index 38add716ff36..5c53523afc53 100644
--- a/test/onnx/expect/TestOperators.test_argmax.expect
+++ b/test/onnx/expect/TestOperators.test_argmax.expect
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_asin.expect b/test/onnx/expect/TestOperators.test_asin.expect
index f5a44b850eb1..cd68845b8b9b 100644
--- a/test/onnx/expect/TestOperators.test_asin.expect
+++ b/test/onnx/expect/TestOperators.test_asin.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_at_op.expect b/test/onnx/expect/TestOperators.test_at_op.expect
index 8890f6535756..e8557f61baf2 100644
--- a/test/onnx/expect/TestOperators.test_at_op.expect
+++ b/test/onnx/expect/TestOperators.test_at_op.expect
@@ -3,8 +3,8 @@ producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    input: "x"
-    input: "x"
+    input: "x.1"
+    input: "x.1"
     output: "1"
     name: "ATen_0"
     op_type: "ATen"
@@ -22,7 +22,7 @@ graph {
   }
   name: "torch_jit"
   input {
-    name: "x"
+    name: "x.1"
     type {
       tensor_type {
         elem_type: 1
@@ -55,7 +55,7 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
 opset_import {
   domain: "org.pytorch.aten"
diff --git a/test/onnx/expect/TestOperators.test_atan.expect b/test/onnx/expect/TestOperators.test_atan.expect
index c8d189e1415e..bbb52c8f1c9d 100644
--- a/test/onnx/expect/TestOperators.test_atan.expect
+++ b/test/onnx/expect/TestOperators.test_atan.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_avg_pool2d.expect b/test/onnx/expect/TestOperators.test_avg_pool2d.expect
index 344022ec2688..4839fb5a35a7 100644
--- a/test/onnx/expect/TestOperators.test_avg_pool2d.expect
+++ b/test/onnx/expect/TestOperators.test_avg_pool2d.expect
@@ -1,6 +1,6 @@
 ir_version: 7
 producer_name: "pytorch"
-producer_version: "CURRENT_VERSION"
+producer_version: "1.14.0"
 graph {
   node {
     output: "onnx::Pad_1"
@@ -33,11 +33,6 @@ graph {
     output: "3"
     name: "AveragePool_2"
     op_type: "AveragePool"
-    attribute {
-      name: "ceil_mode"
-      i: 0
-      type: INT
-    }
     attribute {
       name: "kernel_shape"
       ints: 3
@@ -106,5 +101,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_baddbmm.expect b/test/onnx/expect/TestOperators.test_baddbmm.expect
index 058770e80326..cee7cc9185e7 100644
--- a/test/onnx/expect/TestOperators.test_baddbmm.expect
+++ b/test/onnx/expect/TestOperators.test_baddbmm.expect
@@ -5,12 +5,12 @@ graph {
   node {
     input: "onnx::MatMul_1"
     input: "onnx::MatMul_2"
-    output: "onnx::Mul_5"
+    output: "onnx::Mul_4"
     name: "MatMul_0"
     op_type: "MatMul"
   }
   node {
-    output: "onnx::Mul_11"
+    output: "onnx::Mul_10"
     name: "Constant_1"
     op_type: "Constant"
     attribute {
@@ -23,14 +23,14 @@ graph {
     }
   }
   node {
-    input: "onnx::Mul_5"
-    input: "onnx::Mul_11"
-    output: "onnx::Add_7"
+    input: "onnx::Mul_4"
+    input: "onnx::Mul_10"
+    output: "onnx::Add_6"
     name: "Mul_2"
     op_type: "Mul"
   }
   node {
-    output: "onnx::Mul_12"
+    output: "onnx::Mul_11"
     name: "Constant_3"
     op_type: "Constant"
     attribute {
@@ -44,15 +44,15 @@ graph {
   }
   node {
     input: "onnx::Mul_0"
-    input: "onnx::Mul_12"
-    output: "onnx::Add_9"
+    input: "onnx::Mul_11"
+    output: "onnx::Add_8"
     name: "Mul_4"
     op_type: "Mul"
   }
   node {
-    input: "onnx::Add_7"
-    input: "onnx::Add_9"
-    output: "10"
+    input: "onnx::Add_6"
+    input: "onnx::Add_8"
+    output: "9"
     name: "Add_5"
     op_type: "Add"
   }
@@ -115,7 +115,7 @@ graph {
     }
   }
   output {
-    name: "10"
+    name: "9"
     type {
       tensor_type {
         elem_type: 1
@@ -135,5 +135,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_basic.expect b/test/onnx/expect/TestOperators.test_basic.expect
index 3d151aefabdb..933664a75b67 100644
--- a/test/onnx/expect/TestOperators.test_basic.expect
+++ b/test/onnx/expect/TestOperators.test_basic.expect
@@ -76,5 +76,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm.expect b/test/onnx/expect/TestOperators.test_batchnorm.expect
index d9c9ec338c8c..289092005855 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm.expect
@@ -21,6 +21,11 @@ graph {
       f: 0.9
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -145,5 +150,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_1d.expect b/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
index a4d2e1f10249..b6ec8fb2f9d7 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_1d.expect
@@ -21,6 +21,11 @@ graph {
       f: 0.9
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -133,5 +138,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect b/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
index a421443cdcda..5dcfdc3f69cf 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_noaffine.expect
@@ -49,6 +49,11 @@ graph {
       f: 0.7
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -135,5 +140,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
index a556e38c7198..96d9f901242e 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_onnx_irv4.expect
@@ -21,6 +21,11 @@ graph {
       f: 0.9
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -93,5 +98,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_batchnorm_training.expect b/test/onnx/expect/TestOperators.test_batchnorm_training.expect
index 5e8f2049e143..9099fac2dcc5 100644
--- a/test/onnx/expect/TestOperators.test_batchnorm_training.expect
+++ b/test/onnx/expect/TestOperators.test_batchnorm_training.expect
@@ -11,8 +11,6 @@ graph {
     output: "6"
     output: "7"
     output: "8"
-    output: "batch_norm_dead_output-13"
-    output: "batch_norm_dead_output-14"
     name: "BatchNormalization_0"
     op_type: "BatchNormalization"
     attribute {
@@ -25,6 +23,11 @@ graph {
       f: 0.9
       type: FLOAT
     }
+    attribute {
+      name: "training_mode"
+      i: 1
+      type: INT
+    }
   }
   name: "torch_jit"
   initializer {
@@ -149,5 +152,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_chunk.expect b/test/onnx/expect/TestOperators.test_chunk.expect
index 575245c807eb..708b3e10a8cb 100644
--- a/test/onnx/expect/TestOperators.test_chunk.expect
+++ b/test/onnx/expect/TestOperators.test_chunk.expect
@@ -192,5 +192,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_clip.expect b/test/onnx/expect/TestOperators.test_clip.expect
index 67dad133acec..0700d68e6985 100644
--- a/test/onnx/expect/TestOperators.test_clip.expect
+++ b/test/onnx/expect/TestOperators.test_clip.expect
@@ -71,5 +71,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_clip_max.expect b/test/onnx/expect/TestOperators.test_clip_max.expect
index 23b001cd4e7e..278c81c53a8a 100644
--- a/test/onnx/expect/TestOperators.test_clip_max.expect
+++ b/test/onnx/expect/TestOperators.test_clip_max.expect
@@ -70,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_clip_min.expect b/test/onnx/expect/TestOperators.test_clip_min.expect
index 3bd4c47ef858..3f96a07dd135 100644
--- a/test/onnx/expect/TestOperators.test_clip_min.expect
+++ b/test/onnx/expect/TestOperators.test_clip_min.expect
@@ -70,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_concat2.expect b/test/onnx/expect/TestOperators.test_concat2.expect
index f5b6aec0c229..80fb9d9c1a53 100644
--- a/test/onnx/expect/TestOperators.test_concat2.expect
+++ b/test/onnx/expect/TestOperators.test_concat2.expect
@@ -65,5 +65,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_conv.expect b/test/onnx/expect/TestOperators.test_conv.expect
index f1078cef39c1..c736f2883646 100644
--- a/test/onnx/expect/TestOperators.test_conv.expect
+++ b/test/onnx/expect/TestOperators.test_conv.expect
@@ -118,5 +118,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
index 18e3c683e9bc..fc7f328d65fe 100644
--- a/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_conv_onnx_irv4.expect
@@ -96,5 +96,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_convtranspose.expect b/test/onnx/expect/TestOperators.test_convtranspose.expect
index 0beedca2f292..91746bdcdcb4 100644
--- a/test/onnx/expect/TestOperators.test_convtranspose.expect
+++ b/test/onnx/expect/TestOperators.test_convtranspose.expect
@@ -124,5 +124,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_cos.expect b/test/onnx/expect/TestOperators.test_cos.expect
index 1185bca62c59..939bdb2c03e6 100644
--- a/test/onnx/expect/TestOperators.test_cos.expect
+++ b/test/onnx/expect/TestOperators.test_cos.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dict.expect b/test/onnx/expect/TestOperators.test_dict.expect
index e041d535d768..aa9bba9dfece 100644
--- a/test/onnx/expect/TestOperators.test_dict.expect
+++ b/test/onnx/expect/TestOperators.test_dict.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dict_str.expect b/test/onnx/expect/TestOperators.test_dict_str.expect
index eaab2752fb7d..e3ad5f7f027c 100644
--- a/test/onnx/expect/TestOperators.test_dict_str.expect
+++ b/test/onnx/expect/TestOperators.test_dict_str.expect
@@ -63,5 +63,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dim.expect b/test/onnx/expect/TestOperators.test_dim.expect
index 59e910a646ca..4143454562cc 100644
--- a/test/onnx/expect/TestOperators.test_dim.expect
+++ b/test/onnx/expect/TestOperators.test_dim.expect
@@ -28,5 +28,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout.expect b/test/onnx/expect/TestOperators.test_dropout.expect
index 27aab5c71821..bfb06790923b 100644
--- a/test/onnx/expect/TestOperators.test_dropout.expect
+++ b/test/onnx/expect/TestOperators.test_dropout.expect
@@ -42,5 +42,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout_default.expect b/test/onnx/expect/TestOperators.test_dropout_default.expect
index 89c0e988aacb..e3ac188c945e 100644
--- a/test/onnx/expect/TestOperators.test_dropout_default.expect
+++ b/test/onnx/expect/TestOperators.test_dropout_default.expect
@@ -77,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_dropout_training.expect b/test/onnx/expect/TestOperators.test_dropout_training.expect
index 89c0e988aacb..e3ac188c945e 100644
--- a/test/onnx/expect/TestOperators.test_dropout_training.expect
+++ b/test/onnx/expect/TestOperators.test_dropout_training.expect
@@ -77,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_elu.expect b/test/onnx/expect/TestOperators.test_elu.expect
index 9fc2d5aab1fe..043ce6374ce7 100644
--- a/test/onnx/expect/TestOperators.test_elu.expect
+++ b/test/onnx/expect/TestOperators.test_elu.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_embedding_bags.expect b/test/onnx/expect/TestOperators.test_embedding_bags.expect
index b2b11d5bd4f0..02996eea910e 100644
--- a/test/onnx/expect/TestOperators.test_embedding_bags.expect
+++ b/test/onnx/expect/TestOperators.test_embedding_bags.expect
@@ -429,5 +429,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_empty_like.expect b/test/onnx/expect/TestOperators.test_empty_like.expect
index e4f6c6ede2ca..0cb3b2598c8e 100644
--- a/test/onnx/expect/TestOperators.test_empty_like.expect
+++ b/test/onnx/expect/TestOperators.test_empty_like.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_equal.expect b/test/onnx/expect/TestOperators.test_equal.expect
index 5a9877d484f8..7801bfe777ae 100644
--- a/test/onnx/expect/TestOperators.test_equal.expect
+++ b/test/onnx/expect/TestOperators.test_equal.expect
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_erf.expect b/test/onnx/expect/TestOperators.test_erf.expect
index f8f70c37598d..ddf5efcc2bb7 100644
--- a/test/onnx/expect/TestOperators.test_erf.expect
+++ b/test/onnx/expect/TestOperators.test_erf.expect
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_exp.expect b/test/onnx/expect/TestOperators.test_exp.expect
index 49d9f74cb20d..3abafc3000bb 100644
--- a/test/onnx/expect/TestOperators.test_exp.expect
+++ b/test/onnx/expect/TestOperators.test_exp.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_expand.expect b/test/onnx/expect/TestOperators.test_expand.expect
index 36804d0062e5..b0dc5d03c024 100644
--- a/test/onnx/expect/TestOperators.test_expand.expect
+++ b/test/onnx/expect/TestOperators.test_expand.expect
@@ -139,5 +139,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_flatten.expect b/test/onnx/expect/TestOperators.test_flatten.expect
index 12160e8b9e66..b303aa16ded4 100644
--- a/test/onnx/expect/TestOperators.test_flatten.expect
+++ b/test/onnx/expect/TestOperators.test_flatten.expect
@@ -91,6 +91,11 @@ graph {
     output: "8"
     name: "Reshape_7"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -130,5 +135,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_flatten2D.expect b/test/onnx/expect/TestOperators.test_flatten2D.expect
index f60b1ba7066f..edefc6cfb56b 100644
--- a/test/onnx/expect/TestOperators.test_flatten2D.expect
+++ b/test/onnx/expect/TestOperators.test_flatten2D.expect
@@ -54,5 +54,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_frobenius_norm.expect b/test/onnx/expect/TestOperators.test_frobenius_norm.expect
index fba4585b18b8..ba186b1ac842 100644
--- a/test/onnx/expect/TestOperators.test_frobenius_norm.expect
+++ b/test/onnx/expect/TestOperators.test_frobenius_norm.expect
@@ -87,5 +87,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_full.expect b/test/onnx/expect/TestOperators.test_full.expect
index fc8acf5ee80d..90d5075872d2 100644
--- a/test/onnx/expect/TestOperators.test_full.expect
+++ b/test/onnx/expect/TestOperators.test_full.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_full_like.expect b/test/onnx/expect/TestOperators.test_full_like.expect
index fc8acf5ee80d..90d5075872d2 100644
--- a/test/onnx/expect/TestOperators.test_full_like.expect
+++ b/test/onnx/expect/TestOperators.test_full_like.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_gather.expect b/test/onnx/expect/TestOperators.test_gather.expect
index 609f89853ac6..cfa780f17da2 100644
--- a/test/onnx/expect/TestOperators.test_gather.expect
+++ b/test/onnx/expect/TestOperators.test_gather.expect
@@ -74,5 +74,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_ge.expect b/test/onnx/expect/TestOperators.test_ge.expect
index 8d578a4d25bd..2d69f792ecb8 100644
--- a/test/onnx/expect/TestOperators.test_ge.expect
+++ b/test/onnx/expect/TestOperators.test_ge.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_gelu.expect b/test/onnx/expect/TestOperators.test_gelu.expect
index dfc7d1d88468..4aebfac7f406 100644
--- a/test/onnx/expect/TestOperators.test_gelu.expect
+++ b/test/onnx/expect/TestOperators.test_gelu.expect
@@ -122,5 +122,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_gt.expect b/test/onnx/expect/TestOperators.test_gt.expect
index 5aab77798bf6..d121ee5d1258 100644
--- a/test/onnx/expect/TestOperators.test_gt.expect
+++ b/test/onnx/expect/TestOperators.test_gt.expect
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_hardtanh.expect b/test/onnx/expect/TestOperators.test_hardtanh.expect
index 1268a4c14cfd..6a5b73e3fa78 100644
--- a/test/onnx/expect/TestOperators.test_hardtanh.expect
+++ b/test/onnx/expect/TestOperators.test_hardtanh.expect
@@ -71,5 +71,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_implicit_expand.expect b/test/onnx/expect/TestOperators.test_implicit_expand.expect
index 3c94edc85b4b..df955f77ed14 100644
--- a/test/onnx/expect/TestOperators.test_implicit_expand.expect
+++ b/test/onnx/expect/TestOperators.test_implicit_expand.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_index.expect b/test/onnx/expect/TestOperators.test_index.expect
index 330d2de0d7fc..f5327f35947a 100644
--- a/test/onnx/expect/TestOperators.test_index.expect
+++ b/test/onnx/expect/TestOperators.test_index.expect
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_isnan.expect b/test/onnx/expect/TestOperators.test_isnan.expect
index 198d3bdb2387..8dfb3a0add6f 100644
--- a/test/onnx/expect/TestOperators.test_isnan.expect
+++ b/test/onnx/expect/TestOperators.test_isnan.expect
@@ -37,5 +37,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_layer_norm_aten.expect b/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
index 071437686117..74e21b77bb26 100644
--- a/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
+++ b/test/onnx/expect/TestOperators.test_layer_norm_aten.expect
@@ -193,5 +193,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_le.expect b/test/onnx/expect/TestOperators.test_le.expect
index 374a0d0e0d52..3e645a67678f 100644
--- a/test/onnx/expect/TestOperators.test_le.expect
+++ b/test/onnx/expect/TestOperators.test_le.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_linear.expect b/test/onnx/expect/TestOperators.test_linear.expect
index 71c64dfe5a50..5f6294b681cb 100644
--- a/test/onnx/expect/TestOperators.test_linear.expect
+++ b/test/onnx/expect/TestOperators.test_linear.expect
@@ -102,5 +102,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_log_sigmoid.expect b/test/onnx/expect/TestOperators.test_log_sigmoid.expect
index 2681f1193102..2e7e267408d2 100644
--- a/test/onnx/expect/TestOperators.test_log_sigmoid.expect
+++ b/test/onnx/expect/TestOperators.test_log_sigmoid.expect
@@ -61,5 +61,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_logsoftmax.expect b/test/onnx/expect/TestOperators.test_logsoftmax.expect
index 1c4de89b6402..9a9ca557c56f 100644
--- a/test/onnx/expect/TestOperators.test_logsoftmax.expect
+++ b/test/onnx/expect/TestOperators.test_logsoftmax.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_lt.expect b/test/onnx/expect/TestOperators.test_lt.expect
index 2dbcc07cd9e1..a7f3e533eb19 100644
--- a/test/onnx/expect/TestOperators.test_lt.expect
+++ b/test/onnx/expect/TestOperators.test_lt.expect
@@ -72,5 +72,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_max.expect b/test/onnx/expect/TestOperators.test_max.expect
index d9fcc0fb5f7a..cb740174abbf 100644
--- a/test/onnx/expect/TestOperators.test_max.expect
+++ b/test/onnx/expect/TestOperators.test_max.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_maxpool.expect b/test/onnx/expect/TestOperators.test_maxpool.expect
index f43712bbfc58..6bed98ea739e 100644
--- a/test/onnx/expect/TestOperators.test_maxpool.expect
+++ b/test/onnx/expect/TestOperators.test_maxpool.expect
@@ -70,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_maxpool_indices.expect b/test/onnx/expect/TestOperators.test_maxpool_indices.expect
index 46c23e3a4cae..167dfa05c100 100644
--- a/test/onnx/expect/TestOperators.test_maxpool_indices.expect
+++ b/test/onnx/expect/TestOperators.test_maxpool_indices.expect
@@ -165,5 +165,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mean.expect b/test/onnx/expect/TestOperators.test_mean.expect
index b53b8c2f1248..62a0a1b2ba75 100644
--- a/test/onnx/expect/TestOperators.test_mean.expect
+++ b/test/onnx/expect/TestOperators.test_mean.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mean_dtype.expect b/test/onnx/expect/TestOperators.test_mean_dtype.expect
index 92ce0ae3aa99..60e5cb5c1abb 100644
--- a/test/onnx/expect/TestOperators.test_mean_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_mean_dtype.expect
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_meshgrid.expect b/test/onnx/expect/TestOperators.test_meshgrid.expect
index 05b9de875d94..32190c410f23 100644
--- a/test/onnx/expect/TestOperators.test_meshgrid.expect
+++ b/test/onnx/expect/TestOperators.test_meshgrid.expect
@@ -22,6 +22,11 @@ graph {
     output: "onnx::Shape_4"
     name: "Reshape_1"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     output: "onnx::Reshape_5"
@@ -43,6 +48,11 @@ graph {
     output: "onnx::Shape_6"
     name: "Reshape_3"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     output: "onnx::Reshape_7"
@@ -64,6 +74,11 @@ graph {
     output: "onnx::Shape_8"
     name: "Reshape_5"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Shape_4"
@@ -129,6 +144,11 @@ graph {
     output: "onnx::Expand_15"
     name: "Reshape_12"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Expand_15"
@@ -170,6 +190,11 @@ graph {
     output: "onnx::Expand_19"
     name: "Reshape_16"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Expand_19"
@@ -211,6 +236,11 @@ graph {
     output: "onnx::Expand_23"
     name: "Reshape_20"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Expand_23"
@@ -318,5 +348,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_min.expect b/test/onnx/expect/TestOperators.test_min.expect
index 28ca14779f71..999d6009a360 100644
--- a/test/onnx/expect/TestOperators.test_min.expect
+++ b/test/onnx/expect/TestOperators.test_min.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mm.expect b/test/onnx/expect/TestOperators.test_mm.expect
index 9492d651fd9e..4ea74c154586 100644
--- a/test/onnx/expect/TestOperators.test_mm.expect
+++ b/test/onnx/expect/TestOperators.test_mm.expect
@@ -70,5 +70,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mul_bool.expect b/test/onnx/expect/TestOperators.test_mul_bool.expect
index 455967e543cb..cb6ce8180443 100644
--- a/test/onnx/expect/TestOperators.test_mul_bool.expect
+++ b/test/onnx/expect/TestOperators.test_mul_bool.expect
@@ -51,5 +51,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_mul_fp_bool.expect b/test/onnx/expect/TestOperators.test_mul_fp_bool.expect
index dee222fbb1fa..834be416415a 100644
--- a/test/onnx/expect/TestOperators.test_mul_fp_bool.expect
+++ b/test/onnx/expect/TestOperators.test_mul_fp_bool.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_narrow.expect b/test/onnx/expect/TestOperators.test_narrow.expect
index b25611d9374a..f71fbab5a457 100644
--- a/test/onnx/expect/TestOperators.test_narrow.expect
+++ b/test/onnx/expect/TestOperators.test_narrow.expect
@@ -3,7 +3,7 @@ producer_name: "pytorch"
 producer_version: "CURRENT_VERSION"
 graph {
   node {
-    output: "onnx::Slice_14"
+    output: "onnx::Slice_13"
     name: "Constant_0"
     op_type: "Constant"
     attribute {
@@ -17,7 +17,7 @@ graph {
     }
   }
   node {
-    output: "onnx::Slice_15"
+    output: "onnx::Slice_14"
     name: "Constant_1"
     op_type: "Constant"
     attribute {
@@ -31,7 +31,7 @@ graph {
     }
   }
   node {
-    output: "onnx::Slice_16"
+    output: "onnx::Slice_15"
     name: "Constant_2"
     op_type: "Constant"
     attribute {
@@ -46,10 +46,10 @@ graph {
   }
   node {
     input: "onnx::Slice_0"
+    input: "onnx::Slice_13"
     input: "onnx::Slice_14"
     input: "onnx::Slice_15"
-    input: "onnx::Slice_16"
-    output: "12"
+    output: "11"
     name: "Slice_3"
     op_type: "Slice"
   }
@@ -71,7 +71,7 @@ graph {
     }
   }
   output {
-    name: "12"
+    name: "11"
     type {
       tensor_type {
         elem_type: 1
@@ -88,5 +88,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_ne.expect b/test/onnx/expect/TestOperators.test_ne.expect
index ab053fbcf67e..2d7a67e1fd61 100644
--- a/test/onnx/expect/TestOperators.test_ne.expect
+++ b/test/onnx/expect/TestOperators.test_ne.expect
@@ -78,5 +78,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_nonzero.expect b/test/onnx/expect/TestOperators.test_nonzero.expect
index cfcb1f505f87..f25f517cf12b 100644
--- a/test/onnx/expect/TestOperators.test_nonzero.expect
+++ b/test/onnx/expect/TestOperators.test_nonzero.expect
@@ -58,5 +58,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_norm_p1.expect b/test/onnx/expect/TestOperators.test_norm_p1.expect
index ec5e12b90a16..407424b9b642 100644
--- a/test/onnx/expect/TestOperators.test_norm_p1.expect
+++ b/test/onnx/expect/TestOperators.test_norm_p1.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_norm_p2.expect b/test/onnx/expect/TestOperators.test_norm_p2.expect
index 0388ec620821..618471b3ddef 100644
--- a/test/onnx/expect/TestOperators.test_norm_p2.expect
+++ b/test/onnx/expect/TestOperators.test_norm_p2.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_ones_like.expect b/test/onnx/expect/TestOperators.test_ones_like.expect
index fafec789b174..e279666c050f 100644
--- a/test/onnx/expect/TestOperators.test_ones_like.expect
+++ b/test/onnx/expect/TestOperators.test_ones_like.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_pad.expect b/test/onnx/expect/TestOperators.test_pad.expect
index e4554ae3181a..61d8e524b239 100644
--- a/test/onnx/expect/TestOperators.test_pad.expect
+++ b/test/onnx/expect/TestOperators.test_pad.expect
@@ -77,6 +77,11 @@ graph {
     output: "onnx::Slice_13"
     name: "Reshape_5"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     output: "onnx::Slice_14"
@@ -176,6 +181,11 @@ graph {
     output: "onnx::Cast_21"
     name: "Reshape_13"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Cast_21"
@@ -247,5 +257,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_params.expect b/test/onnx/expect/TestOperators.test_params.expect
index 67064d8087ae..ce2186457eea 100644
--- a/test/onnx/expect/TestOperators.test_params.expect
+++ b/test/onnx/expect/TestOperators.test_params.expect
@@ -92,5 +92,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect b/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
index 8dbc34a20640..3021b07c6c38 100644
--- a/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
+++ b/test/onnx/expect/TestOperators.test_params_onnx_irv4.expect
@@ -76,5 +76,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_permute2.expect b/test/onnx/expect/TestOperators.test_permute2.expect
index 7f7b6afd9d2d..02754b14d947 100644
--- a/test/onnx/expect/TestOperators.test_permute2.expect
+++ b/test/onnx/expect/TestOperators.test_permute2.expect
@@ -77,5 +77,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_pow.expect b/test/onnx/expect/TestOperators.test_pow.expect
index f20fd9555090..614661bcf303 100644
--- a/test/onnx/expect/TestOperators.test_pow.expect
+++ b/test/onnx/expect/TestOperators.test_pow.expect
@@ -78,5 +78,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_prelu.expect b/test/onnx/expect/TestOperators.test_prelu.expect
index f2bcb50ef777..68458919a478 100644
--- a/test/onnx/expect/TestOperators.test_prelu.expect
+++ b/test/onnx/expect/TestOperators.test_prelu.expect
@@ -83,5 +83,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_prod.expect b/test/onnx/expect/TestOperators.test_prod.expect
index 0cfeafa4da32..f93519e36e15 100644
--- a/test/onnx/expect/TestOperators.test_prod.expect
+++ b/test/onnx/expect/TestOperators.test_prod.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_prod_dtype.expect b/test/onnx/expect/TestOperators.test_prod_dtype.expect
index 26a63ac840ad..c1a22018ec93 100644
--- a/test/onnx/expect/TestOperators.test_prod_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_prod_dtype.expect
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_rand.expect b/test/onnx/expect/TestOperators.test_rand.expect
index 25235bcc49d2..9ffbaecd0b83 100644
--- a/test/onnx/expect/TestOperators.test_rand.expect
+++ b/test/onnx/expect/TestOperators.test_rand.expect
@@ -74,5 +74,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_randn.expect b/test/onnx/expect/TestOperators.test_randn.expect
index 14a10912aa43..56756cbf05ff 100644
--- a/test/onnx/expect/TestOperators.test_randn.expect
+++ b/test/onnx/expect/TestOperators.test_randn.expect
@@ -74,5 +74,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect b/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
index 7e5fefad2eb7..4ae8d99c7363 100644
--- a/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
+++ b/test/onnx/expect/TestOperators.test_reduce_sum_negative_indices.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean.expect b/test/onnx/expect/TestOperators.test_reduced_mean.expect
index ce69ab65a6a6..e9174fd04955 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
index 71d9d296aecd..a441689414cf 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean_dtype.expect
@@ -73,5 +73,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
index 98bb26aaea36..7da69995cc79 100644
--- a/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_mean_keepdim.expect
@@ -66,5 +66,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod.expect b/test/onnx/expect/TestOperators.test_reduced_prod.expect
index cdfbc0f5fbb6..9424de9dd8f1 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod.expect
@@ -62,5 +62,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
index 641d21cb9c79..eb57b5930613 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod_dtype.expect
@@ -73,5 +73,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
index 62befc2cf1cf..5ac7c0c05b98 100644
--- a/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_prod_keepdim.expect
@@ -65,5 +65,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum.expect b/test/onnx/expect/TestOperators.test_reduced_sum.expect
index e03a204a3f99..912cb4a004f4 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum.expect
@@ -69,5 +69,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect b/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
index e8ffa49295a5..08e549518102 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum_dtype.expect
@@ -83,5 +83,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect b/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
index 7d05fdc26041..e50fe185dcad 100644
--- a/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
+++ b/test/onnx/expect/TestOperators.test_reduced_sum_keepdim.expect
@@ -75,5 +75,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reducemax.expect b/test/onnx/expect/TestOperators.test_reducemax.expect
index bbd770761f3a..a73b00966ce6 100644
--- a/test/onnx/expect/TestOperators.test_reducemax.expect
+++ b/test/onnx/expect/TestOperators.test_reducemax.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_reducemin.expect b/test/onnx/expect/TestOperators.test_reducemin.expect
index a555fac90f0a..c0a85099f95c 100644
--- a/test/onnx/expect/TestOperators.test_reducemin.expect
+++ b/test/onnx/expect/TestOperators.test_reducemin.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_remainder.expect b/test/onnx/expect/TestOperators.test_remainder.expect
index ecf44141260e..9c432409faf7 100644
--- a/test/onnx/expect/TestOperators.test_remainder.expect
+++ b/test/onnx/expect/TestOperators.test_remainder.expect
@@ -89,5 +89,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_repeat.expect b/test/onnx/expect/TestOperators.test_repeat.expect
index 76203f189e3e..5f49c14a8fb5 100644
--- a/test/onnx/expect/TestOperators.test_repeat.expect
+++ b/test/onnx/expect/TestOperators.test_repeat.expect
@@ -106,5 +106,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect b/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
index cdbadc5f43eb..c99f9d2a9522 100644
--- a/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
+++ b/test/onnx/expect/TestOperators.test_repeat_dim_overflow.expect
@@ -100,5 +100,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_rrelu.expect b/test/onnx/expect/TestOperators.test_rrelu.expect
index 646b18924d51..4814245be94e 100644
--- a/test/onnx/expect/TestOperators.test_rrelu.expect
+++ b/test/onnx/expect/TestOperators.test_rrelu.expect
@@ -60,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_rsqrt.expect b/test/onnx/expect/TestOperators.test_rsqrt.expect
index 32e4df543ae9..8d4814532544 100644
--- a/test/onnx/expect/TestOperators.test_rsqrt.expect
+++ b/test/onnx/expect/TestOperators.test_rsqrt.expect
@@ -63,5 +63,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_rsub.expect b/test/onnx/expect/TestOperators.test_rsub.expect
index 75344bfc68de..4b300f4f3809 100644
--- a/test/onnx/expect/TestOperators.test_rsub.expect
+++ b/test/onnx/expect/TestOperators.test_rsub.expect
@@ -57,5 +57,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_scatter_add.expect b/test/onnx/expect/TestOperators.test_scatter_add.expect
index fd7514e30630..08aaa382622e 100644
--- a/test/onnx/expect/TestOperators.test_scatter_add.expect
+++ b/test/onnx/expect/TestOperators.test_scatter_add.expect
@@ -104,5 +104,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_selu.expect b/test/onnx/expect/TestOperators.test_selu.expect
index 7cdc4dc8bac4..911361cb7e1f 100644
--- a/test/onnx/expect/TestOperators.test_selu.expect
+++ b/test/onnx/expect/TestOperators.test_selu.expect
@@ -55,5 +55,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_shape_value_map.expect b/test/onnx/expect/TestOperators.test_shape_value_map.expect
index 92e5be56549d..2cee43fc20e1 100644
--- a/test/onnx/expect/TestOperators.test_shape_value_map.expect
+++ b/test/onnx/expect/TestOperators.test_shape_value_map.expect
@@ -55,7 +55,7 @@ graph {
     op_type: "Unsqueeze"
   }
   node {
-    output: "onnx::Concat_26"
+    output: "onnx::Concat_25"
     name: "Constant_5"
     op_type: "Constant"
     attribute {
@@ -69,7 +69,7 @@ graph {
     }
   }
   node {
-    output: "onnx::Concat_27"
+    output: "onnx::Concat_26"
     name: "Constant_6"
     op_type: "Constant"
     attribute {
@@ -83,7 +83,7 @@ graph {
     }
   }
   node {
-    output: "onnx::Concat_28"
+    output: "onnx::Concat_27"
     name: "Constant_7"
     op_type: "Constant"
     attribute {
@@ -98,9 +98,9 @@ graph {
   }
   node {
     input: "onnx::Concat_8"
+    input: "onnx::Concat_25"
     input: "onnx::Concat_26"
     input: "onnx::Concat_27"
-    input: "onnx::Concat_28"
     output: "onnx::Reshape_15"
     name: "Concat_8"
     op_type: "Concat"
@@ -116,6 +116,11 @@ graph {
     output: "onnx::Transpose_16"
     name: "Reshape_9"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   node {
     input: "onnx::Transpose_16"
@@ -143,7 +148,7 @@ graph {
     }
   }
   node {
-    output: "onnx::Unsqueeze_20"
+    output: "onnx::Unsqueeze_19"
     name: "Constant_12"
     op_type: "Constant"
     attribute {
@@ -158,13 +163,13 @@ graph {
   }
   node {
     input: "onnx::Unsqueeze_3"
-    input: "onnx::Unsqueeze_20"
-    output: "onnx::Concat_21"
+    input: "onnx::Unsqueeze_19"
+    output: "onnx::Concat_20"
     name: "Unsqueeze_13"
     op_type: "Unsqueeze"
   }
   node {
-    output: "onnx::Concat_29"
+    output: "onnx::Concat_28"
     name: "Constant_14"
     op_type: "Constant"
     attribute {
@@ -178,9 +183,9 @@ graph {
     }
   }
   node {
-    input: "onnx::Concat_21"
-    input: "onnx::Concat_29"
-    output: "onnx::Reshape_24"
+    input: "onnx::Concat_20"
+    input: "onnx::Concat_28"
+    output: "onnx::Reshape_23"
     name: "Concat_15"
     op_type: "Concat"
     attribute {
@@ -191,10 +196,15 @@ graph {
   }
   node {
     input: "onnx::Reshape_18"
-    input: "onnx::Reshape_24"
-    output: "25"
+    input: "onnx::Reshape_23"
+    output: "24"
     name: "Reshape_16"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -220,7 +230,7 @@ graph {
     }
   }
   output {
-    name: "25"
+    name: "24"
     type {
       tensor_type {
         elem_type: 1
@@ -237,5 +247,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sign.expect b/test/onnx/expect/TestOperators.test_sign.expect
index 6cb9200dc073..4d76e6bdb57d 100644
--- a/test/onnx/expect/TestOperators.test_sign.expect
+++ b/test/onnx/expect/TestOperators.test_sign.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sin.expect b/test/onnx/expect/TestOperators.test_sin.expect
index 4ca6284c48d9..ac1b111f6b80 100644
--- a/test/onnx/expect/TestOperators.test_sin.expect
+++ b/test/onnx/expect/TestOperators.test_sin.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_slice.expect b/test/onnx/expect/TestOperators.test_slice.expect
index 15aa37bc2f7e..41ef73929a26 100644
--- a/test/onnx/expect/TestOperators.test_slice.expect
+++ b/test/onnx/expect/TestOperators.test_slice.expect
@@ -103,5 +103,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_split.expect b/test/onnx/expect/TestOperators.test_split.expect
index e1616e4a52cd..da2dfac1322f 100644
--- a/test/onnx/expect/TestOperators.test_split.expect
+++ b/test/onnx/expect/TestOperators.test_split.expect
@@ -97,5 +97,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_split_with_sizes.expect b/test/onnx/expect/TestOperators.test_split_with_sizes.expect
index 964ba363a56e..ee358bf707d0 100644
--- a/test/onnx/expect/TestOperators.test_split_with_sizes.expect
+++ b/test/onnx/expect/TestOperators.test_split_with_sizes.expect
@@ -97,5 +97,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sqrt.expect b/test/onnx/expect/TestOperators.test_sqrt.expect
index 91fc7bac0b77..4b9448a38565 100644
--- a/test/onnx/expect/TestOperators.test_sqrt.expect
+++ b/test/onnx/expect/TestOperators.test_sqrt.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_std.expect b/test/onnx/expect/TestOperators.test_std.expect
index 69df37b90452..232bf74c057a 100644
--- a/test/onnx/expect/TestOperators.test_std.expect
+++ b/test/onnx/expect/TestOperators.test_std.expect
@@ -185,5 +185,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sum.expect b/test/onnx/expect/TestOperators.test_sum.expect
index 6722064ace20..0f5b926b29a4 100644
--- a/test/onnx/expect/TestOperators.test_sum.expect
+++ b/test/onnx/expect/TestOperators.test_sum.expect
@@ -48,5 +48,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_sum_dtype.expect b/test/onnx/expect/TestOperators.test_sum_dtype.expect
index 2b5f417b0eee..2f9177eace80 100644
--- a/test/onnx/expect/TestOperators.test_sum_dtype.expect
+++ b/test/onnx/expect/TestOperators.test_sum_dtype.expect
@@ -59,5 +59,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_tan.expect b/test/onnx/expect/TestOperators.test_tan.expect
index 84bc3e9420df..be836365e7b9 100644
--- a/test/onnx/expect/TestOperators.test_tan.expect
+++ b/test/onnx/expect/TestOperators.test_tan.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_transpose.expect b/test/onnx/expect/TestOperators.test_transpose.expect
index f1350a1b2623..7a48933ef347 100644
--- a/test/onnx/expect/TestOperators.test_transpose.expect
+++ b/test/onnx/expect/TestOperators.test_transpose.expect
@@ -43,5 +43,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_type_as.expect b/test/onnx/expect/TestOperators.test_type_as.expect
index 31803483edbd..8dc64bf6199f 100644
--- a/test/onnx/expect/TestOperators.test_type_as.expect
+++ b/test/onnx/expect/TestOperators.test_type_as.expect
@@ -37,5 +37,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_unfold.expect b/test/onnx/expect/TestOperators.test_unfold.expect
index 9b5e20281d20..707b1177013f 100644
--- a/test/onnx/expect/TestOperators.test_unfold.expect
+++ b/test/onnx/expect/TestOperators.test_unfold.expect
@@ -202,5 +202,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_unsqueeze.expect b/test/onnx/expect/TestOperators.test_unsqueeze.expect
index 49a61c2b8451..21697b87deff 100644
--- a/test/onnx/expect/TestOperators.test_unsqueeze.expect
+++ b/test/onnx/expect/TestOperators.test_unsqueeze.expect
@@ -61,5 +61,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
index 89bda18c735c..75faf21aa1cd 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_scale.expect
@@ -91,5 +91,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
index 89bda18c735c..75faf21aa1cd 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_scale_default_scale_factor.expect
@@ -91,5 +91,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect b/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
index 53219a404508..f02de359c092 100644
--- a/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
+++ b/test/onnx/expect/TestOperators.test_upsample_nearest_size.expect
@@ -161,5 +161,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_view.expect b/test/onnx/expect/TestOperators.test_view.expect
index 097625822969..e9633df31103 100644
--- a/test/onnx/expect/TestOperators.test_view.expect
+++ b/test/onnx/expect/TestOperators.test_view.expect
@@ -22,6 +22,11 @@ graph {
     output: "2"
     name: "Reshape_1"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -55,5 +60,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_view_flatten.expect b/test/onnx/expect/TestOperators.test_view_flatten.expect
index 444906b80e47..2372a76e52f7 100644
--- a/test/onnx/expect/TestOperators.test_view_flatten.expect
+++ b/test/onnx/expect/TestOperators.test_view_flatten.expect
@@ -22,6 +22,11 @@ graph {
     output: "8"
     name: "Reshape_1"
     op_type: "Reshape"
+    attribute {
+      name: "allowzero"
+      i: 0
+      type: INT
+    }
   }
   name: "torch_jit"
   input {
@@ -64,5 +69,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/expect/TestOperators.test_zeros_like.expect b/test/onnx/expect/TestOperators.test_zeros_like.expect
index e4f6c6ede2ca..0cb3b2598c8e 100644
--- a/test/onnx/expect/TestOperators.test_zeros_like.expect
+++ b/test/onnx/expect/TestOperators.test_zeros_like.expect
@@ -36,5 +36,5 @@ graph {
   }
 }
 opset_import {
-  version: 13
+  version: 14
 }
diff --git a/test/onnx/internal/test_beartype.py b/test/onnx/internal/test_beartype.py
new file mode 100644
index 000000000000..8d2ed557d40c
--- /dev/null
+++ b/test/onnx/internal/test_beartype.py
@@ -0,0 +1,86 @@
+# Owner(s): ["module: onnx"]
+"""Unit tests for the internal beartype wrapper module."""
+
+import unittest
+
+import torch
+from torch.onnx import _exporter_states
+from torch.onnx._internal import _beartype
+from torch.testing._internal import common_utils
+
+
+def beartype_installed():
+    try:
+        import beartype  # noqa: F401
+    except ImportError:
+        return False
+    return True
+
+
+def skip_if_beartype_not_installed(test_case):
+    return unittest.skipIf(not beartype_installed(), "beartype is not installed")(
+        test_case
+    )
+
+
+def func_with_type_hint(x: int) -> int:
+    return x
+
+
+def func_with_incorrect_type_hint(x: int) -> str:
+    return x  # type: ignore[return-value]
+
+
+@common_utils.instantiate_parametrized_tests
+class TestBeartype(common_utils.TestCase):
+    def test_create_beartype_decorator_returns_no_op_decorator_when_disabled(self):
+        decorator = _beartype._create_beartype_decorator(
+            _exporter_states.RuntimeTypeCheckState.DISABLED,
+        )
+        decorated = decorator(func_with_incorrect_type_hint)
+        decorated("string_input")  # type: ignore[arg-type]
+
+    @skip_if_beartype_not_installed
+    def test_create_beartype_decorator_warns_when_warnings(self):
+        decorator = _beartype._create_beartype_decorator(
+            _exporter_states.RuntimeTypeCheckState.WARNINGS,
+        )
+        decorated = decorator(func_with_incorrect_type_hint)
+        with self.assertWarns(torch.onnx.errors.CallHintViolationWarning):
+            decorated("string_input")  # type: ignore[arg-type]
+
+    @common_utils.parametrize("arg", [1, "string_input"])
+    @skip_if_beartype_not_installed
+    def test_create_beartype_decorator_errors_when_errors(self, arg):
+        import beartype
+
+        decorator = _beartype._create_beartype_decorator(
+            _exporter_states.RuntimeTypeCheckState.ERRORS,
+        )
+        decorated = decorator(func_with_incorrect_type_hint)
+        with self.assertRaises(beartype.roar.BeartypeCallHintViolation):
+            decorated(arg)
+
+    @skip_if_beartype_not_installed
+    def test_create_beartype_decorator_warning_calls_function_once(self):
+        call_count = 0
+
+        def func_with_incorrect_type_hint_and_side_effect(x: int) -> str:
+            nonlocal call_count
+            call_count += 1
+            return x  # type: ignore[return-value]
+
+        decorator = _beartype._create_beartype_decorator(
+            _exporter_states.RuntimeTypeCheckState.WARNINGS,
+        )
+        decorated = decorator(func_with_incorrect_type_hint_and_side_effect)
+        decorated("string_input")  # type: ignore[arg-type]
+        self.assertEqual(call_count, 1)
+        decorated(1)
+        # The return value violates the type hint, but the function is called
+        # only once.
+        self.assertEqual(call_count, 2)
+
+
+if __name__ == "__main__":
+    common_utils.run_tests()
diff --git a/test/onnx/internal/test_diagnostics.py b/test/onnx/internal/test_diagnostics.py
new file mode 100644
index 000000000000..49402204e9d2
--- /dev/null
+++ b/test/onnx/internal/test_diagnostics.py
@@ -0,0 +1,304 @@
+# Owner(s): ["module: onnx"]
+
+import contextlib
+import dataclasses
+import io
+import typing
+import unittest
+from typing import AbstractSet, Tuple
+
+import torch
+from torch.onnx import errors
+from torch.onnx._internal import diagnostics
+from torch.onnx._internal.diagnostics import infra
+from torch.testing._internal import common_utils
+
+
+def _assert_has_diagnostics(
+    engine: infra.DiagnosticEngine,
+    rule_level_pairs: AbstractSet[Tuple[infra.Rule, infra.Level]],
+):
+    sarif_log = engine.sarif_log()
+    unseen_pairs = {(rule.id, level.name.lower()) for rule, level in rule_level_pairs}
+    actual_results = []
+    for run in sarif_log.runs:
+        if run.results is None:
+            continue
+        for result in run.results:
+            id_level_pair = (result.rule_id, result.level)
+            unseen_pairs.discard(id_level_pair)
+            actual_results.append(id_level_pair)
+
+    if unseen_pairs:
+        raise AssertionError(
+            f"Expected diagnostic results of rule id and level pair {unseen_pairs} not found. "
+            f"Actual diagnostic results: {actual_results}"
+        )
+
+
+@contextlib.contextmanager
+def assert_all_diagnostics(
+    test_suite: unittest.TestCase,
+    engine: infra.DiagnosticEngine,
+    rule_level_pairs: AbstractSet[Tuple[infra.Rule, infra.Level]],
+):
+    """Context manager to assert that all diagnostics are emitted.
+
+    Usage:
+        with assert_all_diagnostics(
+            self,
+            diagnostics.engine,
+            {(rule, infra.Level.Error)},
+        ):
+            torch.onnx.export(...)
+
+    Args:
+        test_suite: The test suite instance.
+        engine: The diagnostic engine.
+        rule_level_pairs: A set of rule and level pairs to assert.
+
+    Returns:
+        A context manager.
+
+    Raises:
+        AssertionError: If not all diagnostics are emitted.
+    """
+
+    try:
+        yield
+    except errors.OnnxExporterError:
+        test_suite.assertIn(infra.Level.ERROR, {level for _, level in rule_level_pairs})
+    finally:
+        _assert_has_diagnostics(engine, rule_level_pairs)
+
+
+def assert_diagnostic(
+    test_suite: unittest.TestCase,
+    engine: infra.DiagnosticEngine,
+    rule: infra.Rule,
+    level: infra.Level,
+):
+    """Context manager to assert that a diagnostic is emitted.
+
+    Usage:
+        with assert_diagnostic(
+            self,
+            diagnostics.engine,
+            rule,
+            infra.Level.Error,
+        ):
+            torch.onnx.export(...)
+
+    Args:
+        test_suite: The test suite instance.
+        engine: The diagnostic engine.
+        rule: The rule to assert.
+        level: The level to assert.
+
+    Returns:
+        A context manager.
+
+    Raises:
+        AssertionError: If the diagnostic is not emitted.
+    """
+
+    return assert_all_diagnostics(test_suite, engine, {(rule, level)})
+
+
+class TestOnnxDiagnostics(common_utils.TestCase):
+    """Test cases for diagnostics emitted by the ONNX export code."""
+
+    def setUp(self):
+        engine = diagnostics.engine
+        engine.clear()
+        self._sample_rule = diagnostics.rules.missing_custom_symbolic_function
+        super().setUp()
+
+    def _trigger_node_missing_onnx_shape_inference_warning_diagnostic_from_cpp(
+        self,
+    ) -> diagnostics.ExportDiagnostic:
+        class CustomAdd(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x, y):
+                return x + y
+
+            @staticmethod
+            def symbolic(g, x, y):
+                return g.op("custom::CustomAdd", x, y)
+
+        class M(torch.nn.Module):
+            def forward(self, x):
+                return CustomAdd.apply(x, x)
+
+        # trigger warning for missing shape inference.
+        rule = diagnostics.rules.node_missing_onnx_shape_inference
+        torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO())
+
+        context = diagnostics.engine.contexts[-1]
+        for diagnostic in context.diagnostics:
+            if (
+                diagnostic.rule == rule
+                and diagnostic.level == diagnostics.levels.WARNING
+            ):
+                return typing.cast(diagnostics.ExportDiagnostic, diagnostic)
+        raise AssertionError("No diagnostic found.")
+
+    def test_assert_diagnostic_raises_when_diagnostic_not_found(self):
+        with self.assertRaises(AssertionError):
+            with assert_diagnostic(
+                self,
+                diagnostics.engine,
+                diagnostics.rules.node_missing_onnx_shape_inference,
+                diagnostics.levels.WARNING,
+            ):
+                pass
+
+    def test_cpp_diagnose_emits_warning(self):
+        with assert_diagnostic(
+            self,
+            diagnostics.engine,
+            diagnostics.rules.node_missing_onnx_shape_inference,
+            diagnostics.levels.WARNING,
+        ):
+            # trigger warning for missing shape inference.
+            self._trigger_node_missing_onnx_shape_inference_warning_diagnostic_from_cpp()
+
+    def test_py_diagnose_emits_error(self):
+        class M(torch.nn.Module):
+            def forward(self, x):
+                return torch.diagonal(x)
+
+        with assert_diagnostic(
+            self,
+            diagnostics.engine,
+            diagnostics.rules.operator_supported_in_newer_opset_version,
+            diagnostics.levels.ERROR,
+        ):
+            # trigger error for operator unsupported until newer opset version.
+            torch.onnx.export(
+                M(),
+                torch.randn(3, 4),
+                io.BytesIO(),
+                opset_version=9,
+            )
+
+    def test_diagnostics_engine_records_diagnosis_reported_outside_of_export(
+        self,
+    ):
+        sample_level = diagnostics.levels.ERROR
+        with assert_diagnostic(
+            self,
+            diagnostics.engine,
+            self._sample_rule,
+            sample_level,
+        ):
+            diagnostics.context.diagnose(self._sample_rule, sample_level)
+
+    def test_diagnostics_records_python_call_stack(self):
+        diagnostic = diagnostics.ExportDiagnostic(
+            self._sample_rule, diagnostics.levels.NOTE
+        )
+        stack = diagnostic.python_call_stack
+        assert stack is not None  # for mypy
+        self.assertGreater(len(stack.frames), 0)
+        frame = stack.frames[0]
+        assert frame.location.snippet is not None  # for mypy
+        self.assertIn("self._sample_rule", frame.location.snippet)
+        assert frame.location.uri is not None  # for mypy
+        self.assertIn("test_diagnostics.py", frame.location.uri)
+
+    def test_diagnostics_records_cpp_call_stack(self):
+        diagnostic = (
+            self._trigger_node_missing_onnx_shape_inference_warning_diagnostic_from_cpp()
+        )
+        stack = diagnostic.cpp_call_stack
+        assert stack is not None  # for mypy
+        self.assertGreater(len(stack.frames), 0)
+        frame_messages = [frame.location.message for frame in stack.frames]
+        # node missing onnx shape inference warning only comes from ToONNX (_jit_pass_onnx)
+        # after node-level shape type inference and processed symbolic_fn output type
+        self.assertTrue(
+            any(
+                isinstance(message, str) and "torch::jit::NodeToONNX" in message
+                for message in frame_messages
+            )
+        )
+
+
+@dataclasses.dataclass
+class _RuleCollectionForTest(infra.RuleCollection):
+    rule_without_message_args: infra.Rule = dataclasses.field(
+        default=infra.Rule(
+            "1",
+            "rule-without-message-args",
+            message_default_template="rule message",
+        )
+    )
+
+
+class TestDiagnosticsInfra(common_utils.TestCase):
+    """Test cases for diagnostics infra."""
+
+    def setUp(self):
+        self.engine = infra.DiagnosticEngine()
+        self.rules = _RuleCollectionForTest()
+        with contextlib.ExitStack() as stack:
+            self.context = stack.enter_context(
+                self.engine.create_diagnostic_context("test", "1.0.0")
+            )
+            self.addCleanup(stack.pop_all().close)
+        return super().setUp()
+
+    def test_diagnostics_engine_records_diagnosis_reported_in_nested_contexts(
+        self,
+    ):
+        with self.engine.create_diagnostic_context("inner_test", "1.0.1") as context:
+            context.diagnose(self.rules.rule_without_message_args, infra.Level.WARNING)
+            sarif_log = self.engine.sarif_log()
+            self.assertEqual(len(sarif_log.runs), 2)
+            self.assertEqual(len(sarif_log.runs[0].results), 0)
+            self.assertEqual(len(sarif_log.runs[1].results), 1)
+        self.context.diagnose(self.rules.rule_without_message_args, infra.Level.ERROR)
+        sarif_log = self.engine.sarif_log()
+        self.assertEqual(len(sarif_log.runs), 2)
+        self.assertEqual(len(sarif_log.runs[0].results), 1)
+        self.assertEqual(len(sarif_log.runs[1].results), 1)
+
+    def test_diagnostics_engine_records_diagnosis_with_custom_rules(self):
+        custom_rules = infra.RuleCollection.custom_collection_from_list(
+            "CustomRuleCollection",
+            [
+                infra.Rule(
+                    "1",
+                    "custom-rule",
+                    message_default_template="custom rule message",
+                ),
+                infra.Rule(
+                    "2",
+                    "custom-rule-2",
+                    message_default_template="custom rule message 2",
+                ),
+            ],
+        )
+
+        with self.engine.create_diagnostic_context(
+            "custom_rules", "1.0"
+        ) as diagnostic_context:
+            with assert_all_diagnostics(
+                self,
+                self.engine,
+                {
+                    (custom_rules.custom_rule, infra.Level.WARNING),  # type: ignore[attr-defined]
+                    (custom_rules.custom_rule_2, infra.Level.ERROR),  # type: ignore[attr-defined]
+                },
+            ):
+                diagnostic_context.diagnose(
+                    custom_rules.custom_rule, infra.Level.WARNING  # type: ignore[attr-defined]
+                )
+                diagnostic_context.diagnose(
+                    custom_rules.custom_rule_2, infra.Level.ERROR  # type: ignore[attr-defined]
+                )
+
+
+if __name__ == "__main__":
+    common_utils.run_tests()
diff --git a/test/onnx/internal/test_registraion.py b/test/onnx/internal/test_registraion.py
new file mode 100644
index 000000000000..13c0e0f396d4
--- /dev/null
+++ b/test/onnx/internal/test_registraion.py
@@ -0,0 +1,254 @@
+# Owner(s): ["module: onnx"]
+"""Unit tests for the internal registration wrapper module."""
+
+from typing import Sequence
+
+from torch.onnx import errors
+from torch.onnx._internal import registration
+from torch.testing._internal import common_utils
+
+
+@common_utils.instantiate_parametrized_tests
+class TestGlobalHelpers(common_utils.TestCase):
+    @common_utils.parametrize(
+        "available_opsets, target, expected",
+        [
+            ((7, 8, 9, 10, 11), 16, 11),
+            ((7, 8, 9, 10, 11), 11, 11),
+            ((7, 8, 9, 10, 11), 10, 10),
+            ((7, 8, 9, 10, 11), 9, 9),
+            ((7, 8, 9, 10, 11), 8, 8),
+            ((7, 8, 9, 10, 11), 7, 7),
+            ((9, 10, 16), 16, 16),
+            ((9, 10, 16), 15, 10),
+            ((9, 10, 16), 10, 10),
+            ((9, 10, 16), 9, 9),
+            ((9, 10, 16), 8, 9),
+            ((9, 10, 16), 7, 9),
+            ((7, 9, 10, 16), 16, 16),
+            ((7, 9, 10, 16), 10, 10),
+            ((7, 9, 10, 16), 9, 9),
+            ((7, 9, 10, 16), 8, 9),
+            ((7, 9, 10, 16), 7, 7),
+            ([17], 16, None),  # New op added in 17
+            ([9], 9, 9),
+            ([9], 8, 9),
+            ([], 16, None),
+            ([], 9, None),
+            ([], 8, None),
+            # Ops registered at opset 1 found as a fallback when target >= 9
+            ([1], 16, 1),
+        ],
+    )
+    def test_dispatch_opset_version_returns_correct_version(
+        self, available_opsets: Sequence[int], target: int, expected: int
+    ):
+        actual = registration._dispatch_opset_version(target, available_opsets)
+        self.assertEqual(actual, expected)
+
+
+class TestOverrideDict(common_utils.TestCase):
+    def setUp(self):
+        self.override_dict: registration.OverrideDict[
+            str, int
+        ] = registration.OverrideDict()
+
+    def test_get_item_returns_base_value_when_no_override(self):
+        self.override_dict.set_base("a", 42)
+        self.override_dict.set_base("b", 0)
+
+        self.assertEqual(self.override_dict["a"], 42)
+        self.assertEqual(self.override_dict["b"], 0)
+        self.assertEqual(len(self.override_dict), 2)
+
+    def test_get_item_returns_overridden_value_when_override(self):
+        self.override_dict.set_base("a", 42)
+        self.override_dict.set_base("b", 0)
+        self.override_dict.override("a", 100)
+        self.override_dict.override("c", 1)
+
+        self.assertEqual(self.override_dict["a"], 100)
+        self.assertEqual(self.override_dict["b"], 0)
+        self.assertEqual(self.override_dict["c"], 1)
+        self.assertEqual(len(self.override_dict), 3)
+
+    def test_get_item_raises_key_error_when_not_found(self):
+        self.override_dict.set_base("a", 42)
+
+        with self.assertRaises(KeyError):
+            self.override_dict["nonexistent_key"]
+
+    def test_get_returns_overridden_value_when_override(self):
+        self.override_dict.set_base("a", 42)
+        self.override_dict.set_base("b", 0)
+        self.override_dict.override("a", 100)
+        self.override_dict.override("c", 1)
+
+        self.assertEqual(self.override_dict.get("a"), 100)
+        self.assertEqual(self.override_dict.get("b"), 0)
+        self.assertEqual(self.override_dict.get("c"), 1)
+        self.assertEqual(len(self.override_dict), 3)
+
+    def test_get_returns_none_when_not_found(self):
+        self.override_dict.set_base("a", 42)
+
+        self.assertEqual(self.override_dict.get("nonexistent_key"), None)
+
+    def test_in_base_returns_true_for_base_value(self):
+        self.override_dict.set_base("a", 42)
+        self.override_dict.set_base("b", 0)
+        self.override_dict.override("a", 100)
+        self.override_dict.override("c", 1)
+
+        self.assertIn("a", self.override_dict)
+        self.assertIn("b", self.override_dict)
+        self.assertIn("c", self.override_dict)
+
+        self.assertTrue(self.override_dict.in_base("a"))
+        self.assertTrue(self.override_dict.in_base("b"))
+        self.assertFalse(self.override_dict.in_base("c"))
+        self.assertFalse(self.override_dict.in_base("nonexistent_key"))
+
+    def test_overridden_returns_true_for_overridden_value(self):
+        self.override_dict.set_base("a", 42)
+        self.override_dict.set_base("b", 0)
+        self.override_dict.override("a", 100)
+        self.override_dict.override("c", 1)
+
+        self.assertTrue(self.override_dict.overridden("a"))
+        self.assertFalse(self.override_dict.overridden("b"))
+        self.assertTrue(self.override_dict.overridden("c"))
+        self.assertFalse(self.override_dict.overridden("nonexistent_key"))
+
+    def test_remove_override_removes_overridden_value(self):
+        self.override_dict.set_base("a", 42)
+        self.override_dict.set_base("b", 0)
+        self.override_dict.override("a", 100)
+        self.override_dict.override("c", 1)
+
+        self.assertEqual(self.override_dict["a"], 100)
+        self.assertEqual(self.override_dict["c"], 1)
+
+        self.override_dict.remove_override("a")
+        self.override_dict.remove_override("c")
+        self.assertEqual(self.override_dict["a"], 42)
+        self.assertEqual(self.override_dict.get("c"), None)
+        self.assertFalse(self.override_dict.overridden("a"))
+        self.assertFalse(self.override_dict.overridden("c"))
+
+    def test_remove_override_removes_overridden_key(self):
+        self.override_dict.override("a", 100)
+        self.assertEqual(self.override_dict["a"], 100)
+        self.assertEqual(len(self.override_dict), 1)
+        self.override_dict.remove_override("a")
+        self.assertEqual(len(self.override_dict), 0)
+        self.assertNotIn("a", self.override_dict)
+
+    def test_overriden_key_precededs_base_key_regardless_of_insert_order(self):
+        self.override_dict.set_base("a", 42)
+        self.override_dict.override("a", 100)
+        self.override_dict.set_base("a", 0)
+
+        self.assertEqual(self.override_dict["a"], 100)
+        self.assertEqual(len(self.override_dict), 1)
+
+    def test_bool_is_true_when_not_empty(self):
+        if self.override_dict:
+            self.fail("OverrideDict should be false when empty")
+        self.override_dict.override("a", 1)
+        if not self.override_dict:
+            self.fail("OverrideDict should be true when not empty")
+        self.override_dict.set_base("a", 42)
+        if not self.override_dict:
+            self.fail("OverrideDict should be true when not empty")
+        self.override_dict.remove_override("a")
+        if not self.override_dict:
+            self.fail("OverrideDict should be true when not empty")
+
+
+class TestRegistrationDecorators(common_utils.TestCase):
+    def tearDown(self) -> None:
+        registration.registry._registry.pop("test::test_op", None)
+
+    def test_onnx_symbolic_registers_function(self):
+        self.assertFalse(registration.registry.is_registered_op("test::test_op", 9))
+
+        @registration.onnx_symbolic("test::test_op", opset=9)
+        def test(g, x):
+            return g.op("test", x)
+
+        self.assertTrue(registration.registry.is_registered_op("test::test_op", 9))
+        function_group = registration.registry.get_function_group("test::test_op")
+        assert function_group is not None
+        self.assertEqual(function_group.get(9), test)
+
+    def test_onnx_symbolic_registers_function_applied_decorator_when_provided(self):
+        wrapper_called = False
+
+        def decorator(func):
+            def wrapper(*args, **kwargs):
+                nonlocal wrapper_called
+                wrapper_called = True
+                return func(*args, **kwargs)
+
+            return wrapper
+
+        @registration.onnx_symbolic("test::test_op", opset=9, decorate=[decorator])
+        def test():
+            return
+
+        function_group = registration.registry.get_function_group("test::test_op")
+        assert function_group is not None
+        registered_function = function_group[9]
+        self.assertFalse(wrapper_called)
+        registered_function()
+        self.assertTrue(wrapper_called)
+
+    def test_onnx_symbolic_raises_warning_when_overriding_function(self):
+        self.assertFalse(registration.registry.is_registered_op("test::test_op", 9))
+
+        @registration.onnx_symbolic("test::test_op", opset=9)
+        def test1():
+            return
+
+        with self.assertWarnsRegex(
+            errors.OnnxExporterWarning,
+            "Symbolic function 'test::test_op' already registered",
+        ):
+
+            @registration.onnx_symbolic("test::test_op", opset=9)
+            def test2():
+                return
+
+    def test_custom_onnx_symbolic_registers_custom_function(self):
+        self.assertFalse(registration.registry.is_registered_op("test::test_op", 9))
+
+        @registration.custom_onnx_symbolic("test::test_op", opset=9)
+        def test(g, x):
+            return g.op("test", x)
+
+        self.assertTrue(registration.registry.is_registered_op("test::test_op", 9))
+        function_group = registration.registry.get_function_group("test::test_op")
+        assert function_group is not None
+        self.assertEqual(function_group.get(9), test)
+
+    def test_custom_onnx_symbolic_overrides_existing_function(self):
+        self.assertFalse(registration.registry.is_registered_op("test::test_op", 9))
+
+        @registration.onnx_symbolic("test::test_op", opset=9)
+        def test_original():
+            return "original"
+
+        self.assertTrue(registration.registry.is_registered_op("test::test_op", 9))
+
+        @registration.custom_onnx_symbolic("test::test_op", opset=9)
+        def test_custom():
+            return "custom"
+
+        function_group = registration.registry.get_function_group("test::test_op")
+        assert function_group is not None
+        self.assertEqual(function_group.get(9), test_custom)
+
+
+if __name__ == "__main__":
+    common_utils.run_tests()
diff --git a/test/onnx/onnx_test_common.py b/test/onnx/onnx_test_common.py
index fe0614fd90d7..6963d16284ce 100644
--- a/test/onnx/onnx_test_common.py
+++ b/test/onnx/onnx_test_common.py
@@ -3,15 +3,13 @@
 from __future__ import annotations
 
 import os
-import random
 from typing import Any, Mapping, Type
 
-import numpy as np
 import onnxruntime
+import pytorch_test_common
 
 import torch
 from torch.onnx import _constants, verification
-from torch.testing._internal import common_utils
 
 onnx_model_dir = os.path.join(
     os.path.dirname(os.path.realpath(__file__)),
@@ -54,21 +52,15 @@ def parameterize_class_name(cls: Type, idx: int, input_dicts: Mapping[Any, Any])
     return f"{cls.__name__}_{suffix}"
 
 
-def set_rng_seed(seed):
-    torch.manual_seed(seed)
-    random.seed(seed)
-    np.random.seed(seed)
-
-
-class _TestONNXRuntime(common_utils.TestCase):
-    opset_version = _constants.onnx_default_opset
+class _TestONNXRuntime(pytorch_test_common.ExportTestCase):
+    opset_version = _constants.ONNX_DEFAULT_OPSET
     keep_initializers_as_inputs = True  # For IR version 3 type export.
     is_script = False
     check_shape = True
     check_dtype = True
 
     def setUp(self):
-        set_rng_seed(0)
+        super().setUp()
         onnxruntime.set_seed(0)
         if torch.cuda.is_available():
             torch.cuda.manual_seed_all(0)
@@ -96,7 +88,7 @@ def run_test(
         remained_onnx_input_idx=None,
         verbose=False,
     ):
-        def _run_test(m, remained_onnx_input_idx, flatten=True):
+        def _run_test(m, remained_onnx_input_idx, flatten=True, ignore_none=True):
             return run_model_test(
                 self,
                 m,
@@ -113,6 +105,7 @@ def _run_test(m, remained_onnx_input_idx, flatten=True):
                 training=training,
                 remained_onnx_input_idx=remained_onnx_input_idx,
                 flatten=flatten,
+                ignore_none=ignore_none,
                 verbose=verbose,
             )
 
@@ -129,7 +122,11 @@ def _run_test(m, remained_onnx_input_idx, flatten=True):
 
         if self.is_script_test_enabled and self.is_script:
             script_model = model if is_model_script else torch.jit.script(model)
-            _run_test(script_model, scripting_remained_onnx_input_idx, flatten=False)
-
+            _run_test(
+                script_model,
+                scripting_remained_onnx_input_idx,
+                flatten=False,
+                ignore_none=False,
+            )
         if not is_model_script and not self.is_script:
             _run_test(model, tracing_remained_onnx_input_idx)
diff --git a/test/onnx/pytorch_test_common.py b/test/onnx/pytorch_test_common.py
index 67b39058f2ab..4e443c333f35 100644
--- a/test/onnx/pytorch_test_common.py
+++ b/test/onnx/pytorch_test_common.py
@@ -2,11 +2,17 @@
 
 import functools
 import os
+import random
 import sys
 import unittest
+from typing import Optional
+
+import numpy as np
 
 import torch
 from torch.autograd import function
+from torch.onnx._internal import diagnostics
+from torch.testing._internal import common_utils
 
 pytorch_test_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
 sys.path.insert(-1, pytorch_test_dir)
@@ -93,13 +99,27 @@ def wrapper(self, *args, **kwargs):
     return skip_dec
 
 
-def skipTraceTest(min_opset_version=float("inf")):
+def skipTraceTest(skip_before_opset_version: Optional[int] = None, reason: str = ""):
+    """Skip tracing test for opset version less than skip_before_opset_version.
+
+    Args:
+        skip_before_opset_version: The opset version before which to skip tracing test.
+            If None, tracing test is always skipped.
+        reason: The reason for skipping tracing test.
+
+    Returns:
+        A decorator for skipping tracing test.
+    """
+
     def skip_dec(func):
         @functools.wraps(func)
         def wrapper(self, *args, **kwargs):
-            self.is_trace_test_enabled = self.opset_version >= min_opset_version
-            if not self.is_trace_test_enabled and not self.is_script:
-                raise unittest.SkipTest("Skip verify test for torch trace")
+            if skip_before_opset_version is not None:
+                self.skip_this_opset = self.opset_version < skip_before_opset_version
+            else:
+                self.skip_this_opset = True
+            if self.skip_this_opset and not self.is_script:
+                raise unittest.SkipTest(f"Skip verify test for torch trace. {reason}")
             return func(self, *args, **kwargs)
 
         return wrapper
@@ -107,13 +127,27 @@ def wrapper(self, *args, **kwargs):
     return skip_dec
 
 
-def skipScriptTest(min_opset_version=float("inf")):
+def skipScriptTest(skip_before_opset_version: Optional[int] = None, reason: str = ""):
+    """Skip scripting test for opset version less than skip_before_opset_version.
+
+    Args:
+        skip_before_opset_version: The opset version before which to skip scripting test.
+            If None, scripting test is always skipped.
+        reason: The reason for skipping scripting test.
+
+    Returns:
+        A decorator for skipping scripting test.
+    """
+
     def skip_dec(func):
         @functools.wraps(func)
         def wrapper(self, *args, **kwargs):
-            self.is_script_test_enabled = self.opset_version >= min_opset_version
-            if not self.is_script_test_enabled and self.is_script:
-                raise unittest.SkipTest("Skip verify test for TorchScript")
+            if skip_before_opset_version is not None:
+                self.skip_this_opset = self.opset_version < skip_before_opset_version
+            else:
+                self.skip_this_opset = True
+            if self.skip_this_opset and self.is_script:
+                raise unittest.SkipTest(f"Skip verify test for TorchScript. {reason}")
             return func(self, *args, **kwargs)
 
         return wrapper
@@ -159,3 +193,24 @@ def wrapper(self, *args, **kwargs):
 
 def flatten(x):
     return tuple(function._iter_filter(lambda o: isinstance(o, torch.Tensor))(x))
+
+
+def set_rng_seed(seed):
+    torch.manual_seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+class ExportTestCase(common_utils.TestCase):
+    """Test case for ONNX export.
+
+    Any test case that tests functionalities under torch.onnx should inherit from this class.
+    """
+
+    def setUp(self):
+        super().setUp()
+        # TODO(#88264): Flaky test failures after changing seed.
+        set_rng_seed(0)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed_all(0)
+        diagnostics.engine.clear()
diff --git a/test/onnx/symbolic_opsets/test_symbolic_opset9.py b/test/onnx/symbolic_opsets/test_symbolic_opset9.py
deleted file mode 100644
index a8334838115c..000000000000
--- a/test/onnx/symbolic_opsets/test_symbolic_opset9.py
+++ /dev/null
@@ -1,32 +0,0 @@
-# Owner(s): ["module: onnx"]
-
-"""Tests for `torch.onnx.symbolic_opset9`."""
-import torch
-from torch import _C
-from torch.onnx import symbolic_opset9 as opset9
-from torch.testing._internal import common_utils
-
-
-class TestPrim(common_utils.TestCase):
-    def setUp(self):
-        super().setUp()
-        self.graph = _C.Graph()
-
-    def test_list_unpack_returns_all_list_elements_when_previous_node_is_list_construct(
-        self,
-    ):
-        # Build the graph
-        input_1 = self.graph.addInput()
-        input_1.setType(input_1.type().with_dtype(torch.float).with_sizes([2, 42]))
-        input_2 = self.graph.addInput()
-        input_2.setType(input_2.type().with_dtype(torch.float).with_sizes([3, 42]))
-        constructed_list = self.graph.op("prim::ListConstruct", input_1, input_2)
-        # Test the op
-        outputs = opset9.Prim.ListUnpack(self.graph, constructed_list)
-        self.assertNotEqual(outputs, None)
-        self.assertEqual(outputs[0], input_1)
-        self.assertEqual(outputs[1], input_2)
-
-
-if __name__ == "__main__":
-    common_utils.run_tests()
diff --git a/test/onnx/test_autograd_funs.py b/test/onnx/test_autograd_funs.py
index a0980d277d56..a5498f39d2da 100644
--- a/test/onnx/test_autograd_funs.py
+++ b/test/onnx/test_autograd_funs.py
@@ -1,16 +1,16 @@
 # Owner(s): ["module: onnx"]
 
-import unittest
+import pytorch_test_common
 
 import torch
-
 from onnx_test_common import run_model_test
 from torch.onnx import OperatorExportTypes
 from torch.onnx._globals import GLOBALS
 from torch.onnx.utils import _model_to_graph
+from torch.testing._internal import common_utils
 
 
-class TestAutogradFuns(unittest.TestCase):
+class TestAutogradFuns(pytorch_test_common.ExportTestCase):
     opset_version = GLOBALS.export_onnx_opset_version
     keep_initializers_as_inputs = False
     onnx_shape_inference = True
@@ -148,7 +148,7 @@ def forward(self, input):
             operator_export_type=OperatorExportTypes.ONNX_ATEN_FALLBACK,
         )
         iter = graph.nodes()
-        self.assertEqual(next(iter).kind(), "prim::PythonOp")
+        self.assertEqual(next(iter).kind(), "aten::ATen")
 
     def test_inline_and_symbolic(self):
         class Exp(torch.autograd.Function):
@@ -209,4 +209,4 @@ def forward(self, input):
 
 
 if __name__ == "__main__":
-    unittest.main()
+    common_utils.run_tests()
diff --git a/test/onnx/test_custom_ops.py b/test/onnx/test_custom_ops.py
index db5ddfd00114..5609b497535e 100644
--- a/test/onnx/test_custom_ops.py
+++ b/test/onnx/test_custom_ops.py
@@ -4,6 +4,7 @@
 import numpy as np
 import onnx
 import onnx_test_common
+import pytorch_test_common
 import torch
 import torch.utils.cpp_extension
 from test_pytorch_onnx_caffe2 import do_export
@@ -11,7 +12,7 @@
 from torch.testing._internal import common_utils
 
 
-class TestCustomOps(common_utils.TestCase):
+class TestCustomOps(pytorch_test_common.ExportTestCase):
     def test_custom_add(self):
         op_source = """
         #include <torch/script.h>
@@ -38,9 +39,7 @@ def forward(self, a, b):
         def symbolic_custom_add(g, self, other):
             return g.op("Add", self, other)
 
-        from torch.onnx import register_custom_op_symbolic
-
-        register_custom_op_symbolic(
+        torch.onnx.register_custom_op_symbolic(
             "custom_namespace::custom_add", symbolic_custom_add, 9
         )
 
@@ -48,6 +47,9 @@ def symbolic_custom_add(g, self, other):
         y = torch.randn(2, 3, 4, requires_grad=False)
 
         model = CustomAddModel()
+        # before fixing #51833 this used to give a PyBind error
+        # with PyTorch 1.10dev ("Unable to cast from non-held to held
+        # instance (T& to Holder<T>)")
         onnxir, _ = do_export(model, (x, y), opset_version=11)
         onnx_model = onnx.ModelProto.FromString(onnxir)
         prepared = c2.prepare(onnx_model)
@@ -55,7 +57,7 @@ def symbolic_custom_add(g, self, other):
         np.testing.assert_array_equal(caffe2_out[0], model(x, y).cpu().numpy())
 
 
-class TestCustomAutogradFunction(common_utils.TestCase):
+class TestCustomAutogradFunction(pytorch_test_common.ExportTestCase):
     opset_version = 9
     keep_initializers_as_inputs = False
     onnx_shape_inference = True
@@ -129,7 +131,7 @@ def symbolic_pythonop(ctx: torch.onnx.SymbolicContext, g, *args, **kwargs):
         onnx_test_common.run_model_test(self, model, input_args=(x,))
 
 
-class TestExportAsContribOps(common_utils.TestCase):
+class TestExportAsContribOps(pytorch_test_common.ExportTestCase):
     opset_version = 14
     keep_initializers_as_inputs = False
     onnx_shape_inference = True
diff --git a/test/jit/test_export_modes.py b/test/onnx/test_export_modes.py
similarity index 64%
rename from test/jit/test_export_modes.py
rename to test/onnx/test_export_modes.py
index dbf10cddc059..502f31b38b10 100644
--- a/test/jit/test_export_modes.py
+++ b/test/onnx/test_export_modes.py
@@ -1,29 +1,27 @@
-# Owner(s): ["oncall: jit"]
+# Owner(s): ["module: onnx"]
 
 import io
 import os
 import shutil
 import sys
 import tempfile
+import unittest
 
 import torch
 import torch.nn as nn
-from torch.onnx import OperatorExportTypes
 from torch.autograd import Variable
+from torch.onnx import OperatorExportTypes
 
 # Make the helper files in test/ importable
 pytorch_test_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
 sys.path.append(pytorch_test_dir)
-from torch.testing._internal.jit_utils import JitTestCase
-from torch.testing._internal.common_utils import skipIfNoLapack, skipIfCaffe2, skipIfNoCaffe2
+import pytorch_test_common
+
+from torch.testing._internal import common_utils
 
-if __name__ == '__main__':
-    raise RuntimeError("This test file is not meant to be run directly, use:\n\n"
-                       "\tpython test/test_jit.py TESTNAME\n\n"
-                       "instead.")
 
 # Smoke tests for export methods
-class TestExportModes(JitTestCase):
+class TestExportModes(pytorch_test_common.ExportTestCase):
     class MyModel(nn.Module):
         def __init__(self):
             super(TestExportModes.MyModel, self).__init__()
@@ -35,41 +33,66 @@ def test_protobuf(self):
         torch_model = TestExportModes.MyModel()
         fake_input = Variable(torch.randn(1, 1, 224, 224), requires_grad=True)
         f = io.BytesIO()
-        torch.onnx._export(torch_model, (fake_input), f, verbose=False,
-                           export_type=torch.onnx.ExportTypes.PROTOBUF_FILE)
+        torch.onnx._export(
+            torch_model,
+            (fake_input),
+            f,
+            verbose=False,
+            export_type=torch.onnx.ExportTypes.PROTOBUF_FILE,
+        )
 
     def test_zipfile(self):
         torch_model = TestExportModes.MyModel()
         fake_input = Variable(torch.randn(1, 1, 224, 224), requires_grad=True)
         f = io.BytesIO()
-        torch.onnx._export(torch_model, (fake_input), f, verbose=False,
-                           export_type=torch.onnx.ExportTypes.ZIP_ARCHIVE)
+        torch.onnx._export(
+            torch_model,
+            (fake_input),
+            f,
+            verbose=False,
+            export_type=torch.onnx.ExportTypes.ZIP_ARCHIVE,
+        )
 
     def test_compressed_zipfile(self):
         torch_model = TestExportModes.MyModel()
         fake_input = Variable(torch.randn(1, 1, 224, 224), requires_grad=True)
         f = io.BytesIO()
-        torch.onnx._export(torch_model, (fake_input), f, verbose=False,
-                           export_type=torch.onnx.ExportTypes.COMPRESSED_ZIP_ARCHIVE)
+        torch.onnx._export(
+            torch_model,
+            (fake_input),
+            f,
+            verbose=False,
+            export_type=torch.onnx.ExportTypes.COMPRESSED_ZIP_ARCHIVE,
+        )
 
     def test_directory(self):
         torch_model = TestExportModes.MyModel()
         fake_input = Variable(torch.randn(1, 1, 224, 224), requires_grad=True)
         d = tempfile.mkdtemp()
-        torch.onnx._export(torch_model, (fake_input), d, verbose=False,
-                           export_type=torch.onnx.ExportTypes.DIRECTORY)
+        torch.onnx._export(
+            torch_model,
+            (fake_input),
+            d,
+            verbose=False,
+            export_type=torch.onnx.ExportTypes.DIRECTORY,
+        )
         shutil.rmtree(d)
 
     def test_onnx_multiple_return(self):
         @torch.jit.script
         def foo(a):
             return (a, a)
+
         f = io.BytesIO()
         x = torch.ones(3)
-        torch.onnx._export(foo, (x,), f)
-
-    @skipIfNoCaffe2
-    @skipIfNoLapack
+        torch.onnx.export(foo, (x,), f)
+
+    # TODO(87318): Can't pass even with Caffe2
+    @unittest.skip(
+        "RuntimeError: ScalarType UNKNOWN_SCALAR is an unexpected tensor scalar type"
+    )
+    @common_utils.skipIfNoCaffe2
+    @common_utils.skipIfNoLapack
     def test_caffe2_aten_fallback(self):
         class ModelWithAtenNotONNXOp(nn.Module):
             def forward(self, x, y):
@@ -80,13 +103,15 @@ def forward(self, x, y):
         x = torch.rand(3, 4)
         y = torch.rand(3, 4)
         torch.onnx.export_to_pretty_string(
-            ModelWithAtenNotONNXOp(), (x, y),
+            ModelWithAtenNotONNXOp(),
+            (x, y),
             add_node_names=False,
             do_constant_folding=False,
-            operator_export_type=OperatorExportTypes.ONNX_ATEN_FALLBACK)
+            operator_export_type=OperatorExportTypes.ONNX_ATEN_FALLBACK,
+        )
 
-    @skipIfCaffe2
-    @skipIfNoLapack
+    @common_utils.skipIfCaffe2
+    @common_utils.skipIfNoLapack
     def test_aten_fallback(self):
         class ModelWithAtenNotONNXOp(nn.Module):
             def forward(self, x, y):
@@ -97,12 +122,14 @@ def forward(self, x, y):
         x = torch.rand(3, 4)
         y = torch.rand(3, 4)
         torch.onnx.export_to_pretty_string(
-            ModelWithAtenNotONNXOp(), (x, y),
+            ModelWithAtenNotONNXOp(),
+            (x, y),
             add_node_names=False,
             do_constant_folding=False,
             operator_export_type=OperatorExportTypes.ONNX_ATEN_FALLBACK,
             # support for linalg.qr was added in later op set versions.
-            opset_version=9)
+            opset_version=9,
+        )
 
     # torch.fmod is using to test ONNX_ATEN.
     # If you plan to remove fmod from aten, or found this test failed.
@@ -115,7 +142,13 @@ def forward(self, x, y):
         x = torch.randn(3, 4, dtype=torch.float32)
         y = torch.randn(3, 4, dtype=torch.float32)
         torch.onnx.export_to_pretty_string(
-            ModelWithAtenFmod(), (x, y),
+            ModelWithAtenFmod(),
+            (x, y),
             add_node_names=False,
             do_constant_folding=False,
-            operator_export_type=OperatorExportTypes.ONNX_ATEN)
+            operator_export_type=OperatorExportTypes.ONNX_ATEN,
+        )
+
+
+if __name__ == "__main__":
+    common_utils.run_tests()
diff --git a/test/onnx/test_models.py b/test/onnx/test_models.py
index e1445a78a544..15904839957e 100644
--- a/test/onnx/test_models.py
+++ b/test/onnx/test_models.py
@@ -2,8 +2,9 @@
 
 import unittest
 
-import torch
+import pytorch_test_common
 
+import torch
 from model_defs.dcgan import _netD, _netG, bsz, imgsz, nz, weights_init
 from model_defs.emb_seq import EmbeddingNetwork1, EmbeddingNetwork2
 from model_defs.mnist import MNIST
@@ -44,7 +45,7 @@ def toC(x):
 BATCH_SIZE = 2
 
 
-class TestModels(common_utils.TestCase):
+class TestModels(pytorch_test_common.ExportTestCase):
     opset_version = 9  # Caffe2 doesn't support the default.
     keep_initializers_as_inputs = False
 
@@ -143,7 +144,6 @@ def test_resnet(self):
         x = Variable(torch.randn(BATCH_SIZE, 3, 224, 224).fill_(1.0))
         self.exportTest(toC(resnet50()), toC(x), atol=1e-6)
 
-    @skipScriptTest(min_opset_version=15)  # None type in outputs
     # This test is numerically unstable. Sporadic single element mismatch occurs occasionally.
     def test_inception(self):
         x = Variable(torch.randn(BATCH_SIZE, 3, 299, 299))
@@ -231,7 +231,7 @@ def test_qat_resnet_per_channel(self):
 
         self.exportTest(toC(qat_resnet50), toC(x))
 
-    @skipScriptTest(min_opset_version=15)  # None type in outputs
+    @skipScriptTest(skip_before_opset_version=15, reason="None type in outputs")
     def test_googlenet(self):
         x = Variable(torch.randn(BATCH_SIZE, 3, 224, 224).fill_(1.0))
         self.exportTest(toC(googlenet()), toC(x), rtol=1e-3, atol=1e-5)
diff --git a/test/onnx/test_models_onnxruntime.py b/test/onnx/test_models_onnxruntime.py
index c84640e535e1..de1003ce449e 100644
--- a/test/onnx/test_models_onnxruntime.py
+++ b/test/onnx/test_models_onnxruntime.py
@@ -8,6 +8,7 @@
 import onnx_test_common
 import parameterized
 import PIL
+import pytorch_test_common
 import test_models
 
 import torch
@@ -64,7 +65,7 @@ def exportTest(
 
 TestModels = type(
     "TestModels",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(
         test_models.TestModels.__dict__,
         is_script_test_enabled=False,
@@ -77,7 +78,7 @@ def exportTest(
 # model tests for scripting with new JIT APIs and shape inference
 TestModels_new_jit_API = type(
     "TestModels_new_jit_API",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(
         TestModels.__dict__,
         exportTest=exportTest,
diff --git a/test/onnx/test_onnx_opset.py b/test/onnx/test_onnx_opset.py
index 6bce330e2355..ef79e82ee266 100644
--- a/test/onnx/test_onnx_opset.py
+++ b/test/onnx/test_onnx_opset.py
@@ -4,6 +4,7 @@
 import itertools
 
 import onnx
+import pytorch_test_common
 
 import torch
 import torch.onnx
@@ -70,7 +71,7 @@ def check_onnx_opsets_operator(
         check_onnx_opset_operator(model, ops[opset_version], opset_version)
 
 
-class TestONNXOpset(common_utils.TestCase):
+class TestONNXOpset(pytorch_test_common.ExportTestCase):
     def test_opset_fallback(self):
         class MyModule(Module):
             def forward(self, x):
@@ -118,7 +119,7 @@ def forward(self, input, k):
         x = torch.arange(1.0, 6.0, requires_grad=True)
         k = torch.tensor(3)
         module = MyModuleDynamic()
-        check_onnx_opsets_operator(module, [x, k], ops, opset_versions=[10])
+        check_onnx_opsets_operator(module, (x, k), ops, opset_versions=[10])
 
     def test_maxpool(self):
         module = torch.nn.MaxPool1d(2, stride=1)
diff --git a/test/onnx/test_onnxscript_no_runtime.py b/test/onnx/test_onnxscript_no_runtime.py
new file mode 100644
index 000000000000..125e899af944
--- /dev/null
+++ b/test/onnx/test_onnxscript_no_runtime.py
@@ -0,0 +1,164 @@
+# Owner(s): ["module: onnx"]
+
+"""Test the support on onnxscript in PyTorch-ONNX converter."""
+import io
+from typing import List
+
+import onnx
+import onnxscript
+import torch
+from onnxscript.onnx_types import FLOAT
+from torch.onnx._internal import jit_utils
+from torch.testing._internal import common_utils
+
+
+class TestONNXScriptExport(common_utils.TestCase):
+
+    # opset version is
+    # 1. local function is supported after opset 15
+    # 2. onnx-script requires users to determine opset in local function
+    opset_version = 15
+
+    def test_onnxscript_registration_with_multiple_models(self):
+
+        from onnxscript.onnx_opset import opset15 as op
+
+        # 1. Register Selu onnxscript function as custom Op
+        custom_opset = onnxscript.values.Opset(domain="onnx-script", version=1)
+
+        @onnxscript.script(custom_opset)
+        def Selu(X):
+            # TODO: onnx/ort doesn't support default values for now
+            # move this when they do
+            alpha = 1.67326  # auto wrapped as Constants
+            gamma = 1.0507
+            alphaX = op.CastLike(alpha, X)
+            gammaX = op.CastLike(gamma, X)
+            neg = gammaX * (alphaX * op.Exp(X) - alphaX)
+            pos = gammaX * X
+            zero = op.CastLike(0, X)
+            return op.Where(X <= zero, neg, pos)
+
+        def custom_selu(g: jit_utils.GraphContext, X):
+            return g.onnxscript_op(Selu, X).setType(X.type())
+
+        torch.onnx.register_custom_op_symbolic(
+            symbolic_name="aten::selu",
+            symbolic_fn=custom_selu,
+            opset_version=self.opset_version,
+        )
+
+        # 2. Register layer_norm onnxscript function as custom Op
+        @onnxscript.script(custom_opset)
+        def layer_norm(
+            X, axes: List[int], weight: FLOAT[...], bias: FLOAT[...], eps: float
+        ):
+            mean = op.ReduceMean(X, axes=axes)
+            D = X - mean  # op.Sub(X, mean)
+            DD = D * D  # op.Mul(D, D)
+            var = op.ReduceMean(DD, axes=axes)
+            vareps = var + eps  # op.Add(var, eps)
+            stddev = op.Sqrt(vareps)
+            invstddev = op.Reciprocal(stddev)
+            normalized = D * invstddev  # op.Mul(D, invstddev)
+            normalizedw = op.CastLike(
+                normalized, weight
+            )  # Type issue if missing this Op
+            normalizedscaled = normalizedw * weight  # op.Mul(normalized, weight)
+            return normalizedscaled + bias
+
+        @torch.onnx.symbolic_helper.parse_args("v", "is", "v", "v", "f", "none")
+        def custom_layer_norm(
+            g, input, normalized_shape, weight, bias, eps, cudnn_enable
+        ):
+            # TODO: move the comprehension into local function once
+            # it's supported by onnxscript
+            axes = [-i for i in range(len(normalized_shape), 0, -1)]
+            return g.onnxscript_op(
+                layer_norm, input, weight, bias, axes_i=axes, eps_f=eps
+            ).setType(input.type())
+
+        torch.onnx.register_custom_op_symbolic(
+            symbolic_name="aten::layer_norm",
+            symbolic_fn=custom_layer_norm,
+            opset_version=self.opset_version,
+        )
+
+        # 3. export two models
+        x = torch.randn(1, 2, 3, 4, requires_grad=True)
+        model_selu = torch.nn.SELU()
+        selu_onnx = io.BytesIO()
+        torch.onnx.export(model_selu, x, selu_onnx, opset_version=self.opset_version)
+
+        N, C = 3, 4
+        y = torch.randn(N, C)
+        model_layer_norm = torch.nn.LayerNorm(C)
+        layer_norm_onnx = io.BytesIO()
+        torch.onnx.export(
+            model_layer_norm, y, layer_norm_onnx, opset_version=self.opset_version
+        )
+
+        # 4. test on models
+        selu_proto = onnx.load(io.BytesIO(selu_onnx.getvalue()))
+        layer_norm_proto = onnx.load(io.BytesIO(layer_norm_onnx.getvalue()))
+
+        self.assertEqual(len(selu_proto.functions), 1)
+        self.assertEqual(len(layer_norm_proto.functions), 1)
+        self.assertEqual(selu_proto.functions[0].name, "Selu")
+        self.assertEqual(layer_norm_proto.functions[0].name, "layer_norm")
+
+    def test_loop_registration(self):
+        # Control flow is tested for _find_onnxscript_op function in torch/onnx/utils.py,
+        # which has recursive logic to go through every nodes with subgraph in model proto
+        class NestedLoopsModel(torch.jit.ScriptModule):
+            def __init__(self):
+                super().__init__()
+                self.selu = torch.nn.SELU()
+
+            @torch.jit.script_method
+            def forward(self, x):
+                y = x
+                for i in range(x.size(3)):
+                    if i == 0:
+                        y = self.selu(x)
+                    else:
+                        y += i
+                return y
+
+        model = NestedLoopsModel()
+        inputs = torch.zeros(1, 2, 3, 4)
+
+        from onnxscript.onnx_opset import opset15 as op
+
+        custom_opset = onnxscript.values.Opset(domain="onnx-script", version=2)
+
+        @onnxscript.script(custom_opset)
+        def Selu(X):
+            alpha = 1.6732632423543772848170429916717
+            gamma = 1.0507009873554804934193349852946
+            alphaX = op.CastLike(alpha, X)
+            gammaX = op.CastLike(gamma, X)
+            neg = gammaX * (alphaX * op.Exp(X) - alphaX)
+            pos = gammaX * X
+            zero = op.CastLike(0, X)
+            return op.Where(X <= zero, neg, pos)
+
+        def custom_selu(g, X):
+            # domain of the Op should be aligned with onnx-script
+            # setType API is required for custom Op to support
+            # torchscript shape type inference
+            print("custom_selu is used!")
+            return g.onnxscript_op(Selu, X).setType(X.type())
+
+        torch.onnx.register_custom_op_symbolic(
+            symbolic_name="aten::selu",
+            symbolic_fn=custom_selu,
+            opset_version=15,
+        )
+
+        saved_model = io.BytesIO()
+        torch.onnx.export(
+            torch.jit.script(model), inputs, f=saved_model, opset_version=15
+        )
+        loop_selu_proto = onnx.load(io.BytesIO(saved_model.getvalue()))
+        self.assertEqual(len(loop_selu_proto.functions), 1)
diff --git a/test/onnx/test_onnxscript_runtime.py b/test/onnx/test_onnxscript_runtime.py
new file mode 100644
index 000000000000..e22e76c8315e
--- /dev/null
+++ b/test/onnx/test_onnxscript_runtime.py
@@ -0,0 +1,130 @@
+# Owner(s): ["module: onnx"]
+
+"""Test the support on onnxscript in PyTorch-ONNX converter with onnxruntime."""
+from typing import List
+
+import onnx_test_common
+import onnxscript
+import torch
+from onnxscript.onnx_types import FLOAT
+from torch.onnx._internal import jit_utils
+from torch.testing._internal import common_utils
+
+
+class TestONNXScriptRuntime(onnx_test_common._TestONNXRuntime):
+
+    # opset version is
+    # 1. local function is supported after opset 15
+    # 2. onnx-script requires users to determine opset in local function
+    opset_version = 15
+
+    def test_selu_from_onnxscript_example(self):
+
+        x = torch.randn(1, 2, 3, 4, requires_grad=True)
+        model = torch.nn.SELU()
+
+        from onnxscript.onnx_opset import opset15 as op
+
+        # TODO(titaiwang): make an official domain for onnxscript usage
+        custom_opset = onnxscript.values.Opset(domain="onnx-script", version=1)
+
+        @onnxscript.script(custom_opset)
+        def Selu(X):
+            # TODO: onnx/ort doesn't support default values for now
+            # move this when they do
+            alpha = 1.67326  # auto wrapped as Constants
+            gamma = 1.0507
+            alphaX = op.CastLike(alpha, X)
+            gammaX = op.CastLike(gamma, X)
+            neg = gammaX * (alphaX * op.Exp(X) - alphaX)
+            pos = gammaX * X
+            zero = op.CastLike(0, X)
+            return op.Where(X <= zero, neg, pos)
+
+        def custom_selu(g: jit_utils.GraphContext, X):
+            return g.onnxscript_op(Selu, X).setType(X.type())
+
+        torch.onnx.register_custom_op_symbolic(
+            symbolic_name="aten::selu",
+            symbolic_fn=custom_selu,
+            opset_version=self.opset_version,
+        )
+        self.run_test(model, x)
+
+    def test_layer_norm(self):
+
+        x = torch.randn(2, 3)
+        y = torch.randn(2, 3)
+        z = torch.randn(2, 3)
+
+        class N(torch.nn.Module):
+            def __init__(self, prob):
+                super().__init__()
+                self.dropout = torch.nn.Dropout(prob)
+
+            def forward(self, x):
+                return self.dropout(x)
+
+        class M(torch.nn.Module):
+            def __init__(self, num_layers):
+                super().__init__()
+                self.num_layers = num_layers
+                self.lns = torch.nn.ModuleList(
+                    [torch.nn.LayerNorm(3, eps=i) for i in range(num_layers)]
+                )
+                self.celu1 = torch.nn.CELU(1.0)
+                self.celu2 = torch.nn.CELU(2.0)
+                self.dropout = N(0.5)
+
+            def forward(self, x, y, z):
+                res1 = self.celu1(x)
+                res2 = self.celu2(y)
+                for ln in self.lns:
+                    z = ln(z)
+                return res1 + res2, self.dropout(z)
+
+        model = M(3)
+
+        from onnxscript.onnx_opset import opset15 as op
+
+        custom_opset = onnxscript.values.Opset(domain="onnxscript", version=1)
+
+        @onnxscript.script(custom_opset)
+        def layer_norm(
+            X, axes: List[int], weight: FLOAT[...], bias: FLOAT[...], eps: float
+        ):
+            mean = op.ReduceMean(X, axes=axes)
+            D = X - mean  # op.Sub(X, mean)
+            DD = D * D  # op.Mul(D, D)
+            var = op.ReduceMean(DD, axes=axes)
+            vareps = var + eps  # op.Add(var, eps)
+            stddev = op.Sqrt(vareps)
+            invstddev = op.Reciprocal(stddev)
+            normalized = D * invstddev  # op.Mul(D, invstddev)
+            normalizedw = op.CastLike(
+                normalized, weight
+            )  # Type issue if missing this Op
+            normalizedscaled = normalizedw * weight  # op.Mul(normalized, weight)
+            return normalizedscaled + bias
+
+        @torch.onnx.symbolic_helper.parse_args("v", "is", "v", "v", "f", "none")
+        def custom_layer_norm(
+            g, input, normalized_shape, weight, bias, eps, cudnn_enable
+        ):
+            # TODO: move the comprehension into local function once it's supported by onnxscript
+            axes = [-i for i in range(len(normalized_shape), 0, -1)]
+            return g.onnxscript_op(
+                layer_norm, input, weight, bias, axes_i=axes, eps_f=eps
+            ).setType(input.type())
+
+        torch.onnx.register_custom_op_symbolic(
+            symbolic_name="aten::layer_norm",
+            symbolic_fn=custom_layer_norm,
+            opset_version=self.opset_version,
+        )
+
+        self.run_test(model, (x, y, z))
+
+
+if __name__ == "__main__":
+    common_utils.run_tests()
diff --git a/test/onnx/test_operators.py b/test/onnx/test_operators.py
index bd66c38ff5ec..7375cf3fe4d7 100644
--- a/test/onnx/test_operators.py
+++ b/test/onnx/test_operators.py
@@ -1,5 +1,11 @@
 # Owner(s): ["module: onnx"]
 
+"""
+Usage: python test/onnx/test_operators.py [--no-onnx] [--produce-onnx-test-data]
+          --no-onnx: no onnx python dependency
+          --produce-onnx-test-data: generate onnx test data
+          --accept: accept onnx updates and overwrite models
+"""
 import glob
 import inspect
 import io
@@ -8,6 +14,9 @@
 import shutil
 import tempfile
 
+# Full diff for expect files
+import unittest
+
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
@@ -22,6 +31,7 @@
 )
 from torch.autograd import Function, Variable
 from torch.nn import functional, Module
+from torch.onnx._internal import diagnostics
 from torch.onnx.symbolic_helper import (
     _get_tensor_dim_size,
     _get_tensor_sizes,
@@ -30,15 +40,6 @@
 from torch.testing._internal import common_utils
 from torch.testing._internal.common_utils import skipIfCaffe2, skipIfNoLapack
 
-"""Usage: python test/onnx/test_operators.py [--no-onnx] [--produce-onnx-test-data]
-          --no-onnx: no onnx python dependence
-          --produce-onnx-test-data: generate onnx test data
-          --accept: accept onnx updates and overwrite models
-"""
-
-# Full diff for expect files
-import unittest
-
 unittest.TestCase.maxDiff = None
 
 _onnx_test = False  # flag to produce onnx test cases.
@@ -71,6 +72,10 @@ def forward(self, *args):
 
 
 class TestOperators(common_utils.TestCase):
+    def setUp(self):
+        super().setUp()
+        diagnostics.engine.clear()
+
     def assertONNX(self, f, args, params=None, **kwargs):
         if params is None:
             params = ()
@@ -649,10 +654,14 @@ def test_repeat_dim_overflow(self):
         x = torch.randn(1, 2, requires_grad=True)
         self.assertONNX(lambda x: x.repeat(1, 2, 3, 4), x)
 
+    @unittest.skip("It started failing after #81761")
+    # TODO(#83661): Fix and enable the test
     def test_norm_p1(self):
         x = torch.randn(1, 2, 3, 4, requires_grad=True)
         self.assertONNX(lambda x: x.norm(p=1, dim=2), (x))
 
+    @unittest.skip("It started failing after #81761")
+    # TODO(#83661): Fix and enable the test
     def test_norm_p2(self):
         x = torch.randn(1, 2, 3, 4, requires_grad=True)
         self.assertONNX(lambda x: x.norm(p=2, dim=2), (x))
@@ -952,6 +961,8 @@ def test_pixel_shuffle(self):
             lambda x: torch.pixel_shuffle(x, upscale_factor=2), x, opset_version=11
         )
 
+    @unittest.skip("It started failing after #81761")
+    # TODO(#83661): Fix and enable the test
     def test_frobenius_norm(self):
         x = torch.randn(2, 3, 4).float()
         self.assertONNX(lambda x: torch.norm(x, p="fro", dim=(0, 1), keepdim=True), x)
diff --git a/test/onnx/test_pytorch_helper.py b/test/onnx/test_pytorch_helper.py
index 362841d8bf90..7d7f3ade7f58 100644
--- a/test/onnx/test_pytorch_helper.py
+++ b/test/onnx/test_pytorch_helper.py
@@ -4,6 +4,7 @@
 import unittest
 
 import numpy as np
+import pytorch_test_common
 
 import torch.nn.init as init
 import torch.onnx
@@ -15,7 +16,7 @@
 from torch.testing._internal.common_utils import skipIfNoLapack
 
 
-class TestCaffe2Backend(common_utils.TestCase):
+class TestCaffe2Backend(pytorch_test_common.ExportTestCase):
     @skipIfNoLapack
     @unittest.skip("test broken because Lapack was always missing.")
     def test_helper(self):
diff --git a/test/onnx/test_pytorch_jit_onnx.py b/test/onnx/test_pytorch_jit_onnx.py
index 9576762fb190..784bd0954b0a 100644
--- a/test/onnx/test_pytorch_jit_onnx.py
+++ b/test/onnx/test_pytorch_jit_onnx.py
@@ -1,5 +1,6 @@
 # Owner(s): ["module: onnx"]
 import onnxruntime
+import pytorch_test_common
 
 import torch
 from pytorch_test_common import skipIfNoCuda
@@ -49,6 +50,7 @@ class _TestJITIRToONNX:
     ort_providers = ["CPUExecutionProvider"]
     check_shape = True
     check_dtype = True
+    ignore_none = True  # True for tracing, and Flase for scripting
 
     def run_test(self, graph_ir, example_inputs):
         graph = torch._C.parse_ir(graph_ir)
@@ -69,6 +71,7 @@ def run_test(self, graph_ir, example_inputs):
             atol=1e-7,
             check_shape=self.check_shape,
             check_dtype=self.check_dtype,
+            ignore_none=self.ignore_none,
             acceptable_error_percentage=None,
         )
 
@@ -169,7 +172,7 @@ def MakeTestCase(opset_version: int) -> type:
     name = f"TestJITIRToONNX_opset{opset_version}"
     return type(
         str(name),
-        (common_utils.TestCase,),
+        (pytorch_test_common.ExportTestCase,),
         dict(_TestJITIRToONNX.__dict__, opset_version=opset_version),
     )
 
diff --git a/test/onnx/test_pytorch_onnx_caffe2.py b/test/onnx/test_pytorch_onnx_caffe2.py
index 141d3683171f..78440ac6ecb5 100644
--- a/test/onnx/test_pytorch_onnx_caffe2.py
+++ b/test/onnx/test_pytorch_onnx_caffe2.py
@@ -12,6 +12,7 @@
 import model_defs.word_language_model as word_language_model
 import numpy as np
 import onnx
+import pytorch_test_common
 import torch.onnx
 import torch.onnx.operators
 import torch.utils.model_zoo as model_zoo
@@ -129,18 +130,10 @@ def do_export(model, inputs, *args, **kwargs):
 }
 
 
-class TestCaffe2Backend_opset9(common_utils.TestCase):
+class TestCaffe2Backend_opset9(pytorch_test_common.ExportTestCase):
     opset_version = 9
     embed_params = False
 
-    def setUp(self):
-        # the following should ideally be super().setUp(), https://github.com/pytorch/pytorch/issues/79630
-        common_utils.TestCase.setUp(self)
-        torch.manual_seed(0)
-        if torch.cuda.is_available():
-            torch.cuda.manual_seed_all(0)
-        np.random.seed(seed=0)
-
     def convert_cuda(self, model, input):
         cuda_model = model.cuda()
         # input might be nested - we want to move everything to GPU
@@ -3198,44 +3191,44 @@ def setup_rnn_tests():
 # to embed_params=True
 TestCaffe2BackendEmbed_opset9 = type(
     "TestCaffe2BackendEmbed_opset9",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(TestCaffe2Backend_opset9.__dict__, embed_params=True),
 )
 
 # opset 7 tests
 TestCaffe2Backend_opset7 = type(
     "TestCaffe2Backend_opset7",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(TestCaffe2Backend_opset9.__dict__, opset_version=7),
 )
 TestCaffe2BackendEmbed_opset7 = type(
     "TestCaffe2BackendEmbed_opset7",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(TestCaffe2Backend_opset9.__dict__, embed_params=True, opset_version=7),
 )
 
 # opset 8 tests
 TestCaffe2Backend_opset8 = type(
     "TestCaffe2Backend_opset8",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(TestCaffe2Backend_opset9.__dict__, opset_version=8),
 )
 TestCaffe2BackendEmbed_opset8 = type(
     "TestCaffe2BackendEmbed_opset8",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(TestCaffe2Backend_opset9.__dict__, embed_params=True, opset_version=8),
 )
 
 # opset 10 tests
 TestCaffe2Backend_opset10 = type(
     "TestCaffe2Backend_opset10",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(TestCaffe2Backend_opset9.__dict__, opset_version=10),
 )
 
 TestCaffe2BackendEmbed_opset10 = type(
     "TestCaffe2BackendEmbed_opset10",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(TestCaffe2Backend_opset9.__dict__, embed_params=True, opset_version=10),
 )
 
@@ -3243,7 +3236,7 @@ def setup_rnn_tests():
 # to embed_params=True
 TestCaffe2BackendEmbed_opset9_new_jit_API = type(
     "TestCaffe2BackendEmbed_opset9_new_jit_API",
-    (common_utils.TestCase,),
+    (pytorch_test_common.ExportTestCase,),
     dict(TestCaffe2Backend_opset9.__dict__, embed_params=True),
 )
 
diff --git a/test/onnx/test_pytorch_onnx_caffe2_quantized.py b/test/onnx/test_pytorch_onnx_caffe2_quantized.py
index 6b6e81364008..92079ebbe6d9 100644
--- a/test/onnx/test_pytorch_onnx_caffe2_quantized.py
+++ b/test/onnx/test_pytorch_onnx_caffe2_quantized.py
@@ -6,13 +6,14 @@
 
 import numpy as np
 import onnx
+import pytorch_test_common
+import torch.ao.nn.quantized as nnq
 import torch.nn as nn
-import torch.nn.quantized as nnq
 import torch.onnx
 from torch.testing._internal import common_utils
 
 
-class TestQuantizedOps(common_utils.TestCase):
+class TestQuantizedOps(pytorch_test_common.ExportTestCase):
     def generic_test(
         self, model, sample_inputs, input_names=None, decimal=3, relaxed_check=False
     ):
@@ -201,7 +202,7 @@ def __init__(self):
                 self.dequant = torch.ao.quantization.DeQuantStub()
 
             def forward(self, x):
-                res = torch.nn.quantized.functional.interpolate(
+                res = torch.ao.nn.quantized.functional.interpolate(
                     self.quant1(x), size=[6, 8], mode="nearest"
                 )
                 return self.dequant(res)
diff --git a/test/onnx/test_pytorch_onnx_no_runtime.py b/test/onnx/test_pytorch_onnx_no_runtime.py
index 97cfa4afadf8..89526c71ca38 100644
--- a/test/onnx/test_pytorch_onnx_no_runtime.py
+++ b/test/onnx/test_pytorch_onnx_no_runtime.py
@@ -6,22 +6,27 @@
 import io
 import itertools
 import unittest
-from typing import Callable, Dict, Iterable, List, Optional, Tuple, Type, Union
+import unittest.mock
+import warnings
+from typing import Callable, Dict, Iterable, List, Optional, Tuple, Union
 
+import numpy as np
 import onnx
 import onnx.numpy_helper
+import pytorch_test_common
 
 import torch
 import torch.nn.functional as F
 from torch import Tensor
-from torch.onnx import symbolic_helper, symbolic_registry, utils
+from torch.onnx import OperatorExportTypes, symbolic_helper, utils
 from torch.onnx._globals import GLOBALS
-from torch.testing._internal import common_utils
+from torch.onnx._internal import registration
+from torch.testing._internal import common_quantization, common_utils, jit_utils
 
 
 def export_to_onnx(
     model: Union[torch.nn.Module, torch.jit.ScriptFunction],
-    input: Tuple[torch.Tensor],
+    input: Union[torch.Tensor, Tuple[torch.Tensor]],
     custom_ops: Optional[
         Iterable[
             Union[contextlib.AbstractContextManager, contextlib.ContextDecorator],
@@ -30,6 +35,7 @@ def export_to_onnx(
     mocks: Optional[Iterable] = None,
     operator_export_type: torch.onnx.OperatorExportTypes = torch.onnx.OperatorExportTypes.ONNX,
     opset_version: int = GLOBALS.export_onnx_opset_version,
+    **torch_onnx_export_kwargs,
 ) -> onnx.ModelProto:
     """Exports `model(input)` to ONNX and returns it.
 
@@ -42,6 +48,7 @@ def export_to_onnx(
         mocks: list of mocks to use during export
         operator_export_type: export type as described by `torch.onnx.export(...operator_export_type,...)`
         opset_version: ONNX opset version as described by `torch.onnx.export(...opset_version,...)`
+        torch_onnx_export_kwargs: extra torch.onnx.export kwargs arguments
     Returns:
         A valid ONNX model (`onnx.ModelProto`)
     """
@@ -58,6 +65,7 @@ def export_to_onnx(
             f,
             operator_export_type=operator_export_type,
             opset_version=opset_version,
+            **torch_onnx_export_kwargs,
         )
 
     # Validate ONNX graph before returning it
@@ -66,76 +74,7 @@ def export_to_onnx(
     return onnx_model
 
 
-@common_utils.instantiate_parametrized_tests
-class TestOptionalOutput(common_utils.TestCase):
-    # TODO: Move these tests to test_pytorch_onnx_onnxruntime once
-    # ONNX Runtime 1.11 is released and supports opset 16.
-
-    class IfNoneInput(torch.nn.Module):
-        def forward(self, x) -> Optional[Tensor]:
-            y: Optional[Tensor] = None
-            if x.size(0) > 1:
-                y = x
-            return y
-
-    class IfNoneOutput(torch.nn.Module):
-        def forward(self, x) -> Optional[Tensor]:
-            y: Optional[Tensor] = x
-            if x.size(0) > 1:
-                y = None
-            return y
-
-    class LoopNoneInput(torch.nn.Module):
-        def forward(self, x) -> Optional[Tensor]:
-            y: Optional[Tensor] = None
-            for _ in range(x.size(0)):
-                y = x
-            return y
-
-    class LoopNoneOutput(torch.nn.Module):
-        def forward(self, x) -> Optional[Tensor]:
-            y: Optional[Tensor] = x
-            for _ in range(x.size(0)):
-                y = None
-            return y
-
-    @common_utils.parametrize(
-        "module_class",
-        (IfNoneInput, IfNoneOutput, LoopNoneInput, LoopNoneOutput),
-        name_fn=lambda module_class: module_class.__name__,
-    )
-    @common_utils.parametrize("x_size", (0, 1), name_fn=lambda x_size: str(x_size))
-    def test_optional_output(self, module_class: Type[torch.nn.Module], x_size: int):
-        # Need scripting to preserve control flow for this test to be
-        # meaningful.
-        model = torch.jit.script(module_class())
-        f = io.BytesIO()
-        x = torch.ones(x_size)
-        dynamic_axis_name = "condition"
-        torch.onnx.export(
-            model,
-            (x,),
-            f,
-            opset_version=15,
-            # Ensure condition is not constant
-            dynamic_axes={"x": {0: dynamic_axis_name}},
-            input_names=["x"],
-        )
-        exported = onnx.load_from_string(f.getvalue())
-        expected_elem_type = torch.onnx.JitScalarType.from_dtype(x.dtype).onnx_type()
-        expected_output_type = onnx.helper.make_optional_type_proto(
-            onnx.helper.make_tensor_type_proto(expected_elem_type, (dynamic_axis_name,))
-        )
-        self.assertEqual(expected_output_type, exported.graph.output[0].type)
-        for node in exported.graph.node:
-            # Both branches output types should match.
-            if node.op_type == "If":
-                for attr in node.attribute:
-                    if attr.name in ("then_branch", "else_branch"):
-                        self.assertEqual(expected_output_type, attr.g.output[0].type)
-
-
-class TestONNXExport(common_utils.TestCase):
+class TestONNXExport(pytorch_test_common.ExportTestCase):
     def test_fuse_addmm(self):
         class AddmmModel(torch.nn.Module):
             def forward(self, x):
@@ -143,7 +82,7 @@ def forward(self, x):
 
         x = torch.ones(3, 3)
         f = io.BytesIO()
-        torch.onnx._export(AddmmModel(), x, f, verbose=False)
+        torch.onnx.export(AddmmModel(), x, f, verbose=False)
 
     def test_onnx_transpose_incomplete_tensor_type(self):
         # Smoke test to get us into the state where we are attempting to export
@@ -230,7 +169,7 @@ def forward(self, x):
         mte = ModuleToExport()
         f = io.BytesIO()
         with self.assertRaisesRegex(RuntimeError, "Couldn't export Python"):
-            torch.onnx._export(mte, (torch.zeros(1, 2, 3),), f, verbose=False)
+            torch.onnx.export(mte, (torch.zeros(1, 2, 3),), f, verbose=False)
 
     def test_onnx_export_script_inline_trace(self):
         class ModuleToInline(torch.nn.Module):
@@ -494,7 +433,11 @@ def forward(self, x):
         onnx_model = export_to_onnx(
             MyClip(),
             torch.randn(3, 4, requires_grad=True),
-            custom_ops=[common_utils.custom_op("aten::clamp", bad_clamp, 9)],
+            custom_ops=[
+                common_utils.custom_op(
+                    "aten::clamp", bad_clamp, GLOBALS.export_onnx_opset_version
+                )
+            ],
             operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
         )
         self.assertAtenOp(onnx_model, "clamp", "Tensor")
@@ -505,14 +448,11 @@ class MyClip(torch.nn.Module):
             def forward(self, x):
                 return torch.clamp(x, min=-0.5, max=0.5)
 
-        def break_is_registered_op_api(opname, domain, version):
-            fake_missing_symbolics = ("clamp",)
-            if opname in fake_missing_symbolics:
-                return False
-            return (
-                (domain, version) in symbolic_registry._registry
-                and opname in symbolic_registry._registry[(domain, version)]
-            )
+        def break_is_registered_op_api(name):
+            fake_missing_symbolics = {"aten::clamp"}
+            if name in fake_missing_symbolics:
+                return None
+            return registration.registry.get_function_group(name)
 
         # Force missing symbolic for well-known op using a mock
         onnx_model = export_to_onnx(
@@ -520,7 +460,7 @@ def break_is_registered_op_api(opname, domain, version):
             torch.randn(3, 4, requires_grad=True),
             mocks=[
                 unittest.mock.patch(
-                    "torch.onnx.symbolic_registry.is_registered_op",
+                    "torch.onnx._internal.registration.registry.get_function_group",
                     side_effect=break_is_registered_op_api,
                 )
             ],
@@ -611,7 +551,7 @@ def forward(self, x):
 
         x = torch.randn(32, 3)
         f = io.BytesIO()
-        torch.onnx._export(test_model, (x,), f, do_constant_folding=False)
+        torch.onnx.export(test_model, (x,), f, do_constant_folding=False)
         loaded_model = onnx.load_from_string(f.getvalue())
 
         actual_list = [p.name for p in loaded_model.graph.initializer]
@@ -695,7 +635,9 @@ def forward(self, x, y):
         def symbolic_custom_invalid_add(g, input, other, alpha=None):
             return g.op("Add", input, other, invalid_attr_i=1)
 
-        torch.onnx.register_custom_op_symbolic("::add", symbolic_custom_invalid_add, 1)
+        torch.onnx.register_custom_op_symbolic(
+            "::add", symbolic_custom_invalid_add, opset_version=9
+        )
 
         x = torch.randn(2, 3, 4)
         y = torch.randn(2, 3, 4)
@@ -705,9 +647,9 @@ def symbolic_custom_invalid_add(g, input, other, alpha=None):
 
         try:
             with self.assertRaises(torch.onnx.errors.CheckerError):
-                torch.onnx.export(test_model, (x, y), f)
+                torch.onnx.export(test_model, (x, y), f, opset_version=9)
         finally:
-            torch.onnx.unregister_custom_op_symbolic("::add", 1)
+            torch.onnx.unregister_custom_op_symbolic("::add", 9)
 
         self.assertTrue(f.getvalue(), "ONNX graph was not exported.")
         loaded_model = onnx.load_from_string(f.getvalue())
@@ -845,6 +787,349 @@ def forward(self, x):
             model, inputs, f, dynamic_axes={"x": [0, 1]}, input_names=["x"]
         )
 
+    def test_dropout_script(self):
+
+        eg = torch.zeros(1, 2, 3, requires_grad=True)
+
+        @jit_utils._trace(eg)
+        def foo(x):
+            x = torch.neg(x)
+            return F.dropout(x)
+
+        class MyDrop(torch.nn.Module):
+            def forward(self, x):
+                return foo(x)
+
+        f = io.BytesIO()
+        with warnings.catch_warnings(record=True):
+            torch.onnx.export(MyDrop(), (eg,), f, verbose=False)
+
+    def test_pack_padded_pad_packed_trace(self):
+        from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
+
+        T, B, C = 3, 5, 7
+
+        class PadPackedWrapper(torch.nn.Module):
+            def __init__(self):
+                super(PadPackedWrapper, self).__init__()
+
+            def forward(self, x, seq_lens):
+                x = pack_padded_sequence(x, seq_lens)
+                x, _ = pad_packed_sequence(x)
+                return x
+
+        x = np.ones((T, B, C))
+        seq_lens = np.array([3, 3, 2, 2, 1], dtype=np.int32)
+        # set padding value so we can test equivalence
+        for b in range(B):
+            if seq_lens[b] < T:
+                x[seq_lens[b] :, b, :] = 0
+        seq_lens = torch.from_numpy(seq_lens)
+        x = torch.autograd.Variable(torch.from_numpy(x), requires_grad=True)
+
+        m = PadPackedWrapper()
+        m_traced = torch.jit.trace(
+            m,
+            (
+                x,
+                seq_lens,
+            ),
+        )
+
+        y = m(x, seq_lens)
+        loss = torch.sum(y)
+        loss.backward()
+        grad = x.grad.clone()
+        x.grad.zero_()
+
+        y_traced = m_traced(x, seq_lens)
+        loss_traced = torch.sum(y_traced)
+        loss_traced.backward()
+        grad_traced = x.grad.clone()
+
+        self.assertEqual(y_traced, x)
+        self.assertEqual(y_traced, y)
+        self.assertEqual(grad, grad_traced)
+
+        f = io.BytesIO()
+        torch.onnx.export(m, (x, seq_lens), f, verbose=False)
+
+    # Suppression: ONNX warns when exporting RNNs because of potential batch size mismatch.
+    @common_utils.suppress_warnings
+    def test_rnn_trace_override(self):
+        from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
+
+        num_layers = 3
+        T, B, C = 11, 5, 7
+
+        class RNNTraceWrapper(torch.nn.Module):
+            def __init__(self, cell_type):
+                super(RNNTraceWrapper, self).__init__()
+                if cell_type == "RNN":
+                    self.rnn = torch.nn.RNN(
+                        input_size=C, hidden_size=C, num_layers=num_layers
+                    )
+                elif cell_type == "LSTM":
+                    self.rnn = torch.nn.LSTM(
+                        input_size=C, hidden_size=C, num_layers=num_layers
+                    )
+                elif cell_type == "GRU":
+                    self.rnn = torch.nn.GRU(
+                        input_size=C, hidden_size=C, num_layers=num_layers
+                    )
+
+            def forward(self, x, seq_lens):
+                x = pack_padded_sequence(x, seq_lens)
+                x, _ = self.rnn(x)
+                x, _ = pad_packed_sequence(x)
+                return x
+
+        for cell_type in ["RNN", "LSTM", "GRU"]:
+            x = torch.ones(T, B, C, requires_grad=True)
+            seq_lens = torch.from_numpy(np.array([11, 3, 2, 2, 1], dtype=np.int32))
+
+            m = RNNTraceWrapper(cell_type)
+            m_traced = torch.jit.trace(
+                m,
+                (
+                    x,
+                    seq_lens,
+                ),
+            )
+
+            y = m(x, seq_lens)
+            loss = torch.sum(y)
+            loss.backward()
+            grad = x.grad.clone()
+            x.grad.zero_()
+
+            y_traced = m_traced(x, seq_lens)
+            loss_traced = torch.sum(y_traced)
+            loss_traced.backward()
+            grad_traced = x.grad.clone()
+
+            self.assertEqual(y_traced, y)
+            self.assertEqual(grad, grad_traced)
+
+            f = io.BytesIO()
+            torch.onnx.export(m, (x, seq_lens), f, verbose=False)
+
+    def test_trace_fork_wait_inline_onnx(self):
+        def fork_body(x):
+            return torch.neg(x), torch.neg(x)
+
+        class MyMod(torch.nn.Module):
+            def forward(self, x):
+                fut = torch.jit._fork(fork_body, x)
+                val = torch.jit._wait(fut)
+                return val[1]
+
+        # smoke test for ONNX export
+        f = io.BytesIO()
+        torch.onnx.export(MyMod(), (torch.rand(3, 4),), f)
+
+    def test_trace_detach_onnx_erase(self):
+        class Mod(torch.nn.Module):
+            def forward(self, x, w):
+                return torch.matmul(x, w).detach()
+
+        torch.onnx.export_to_pretty_string(Mod(), (torch.rand(3, 4), torch.rand(4, 5)))
+
+    @common_utils.skipIfNoCaffe2
+    def test_caffe2_aten_fallback_must_fallback(self):
+        class ModelWithAtenNotONNXOp(torch.nn.Module):
+            def forward(self, x, y):
+                abcd = x + y
+                defg = torch.linalg.qr(abcd)
+                return defg
+
+        # TODO: Refactor common_utils._decide_skip_caffe2 to support parametrize
+        for operator_export_type in (
+            OperatorExportTypes.ONNX_ATEN,
+            OperatorExportTypes.ONNX_ATEN_FALLBACK,
+        ):
+            x = torch.rand(3, 4)
+            y = torch.rand(3, 4)
+            f = io.BytesIO()
+            torch.onnx.export(
+                ModelWithAtenNotONNXOp(),
+                (x, y),
+                f,
+                do_constant_folding=False,
+                operator_export_type=operator_export_type,
+                # support for linalg.qr was added in later op set versions.
+                opset_version=9,
+            )
+            onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+            self.assertAtenOp(onnx_model, "linalg_qr")
+
+    @common_utils.skipIfNoCaffe2
+    def test_caffe2_onnx_aten_must_not_fallback(self):
+        class ModelWithAtenFmod(torch.nn.Module):
+            def forward(self, x, y):
+                return torch.fmod(x, y)
+
+        # TODO: Refactor common_utils._decide_skip_caffe2 to support parametrize
+        for operator_export_type in (
+            OperatorExportTypes.ONNX_ATEN_FALLBACK,
+            OperatorExportTypes.ONNX_ATEN,
+        ):
+            x = torch.randn(3, 4, dtype=torch.float32)
+            y = torch.randn(3, 4, dtype=torch.float32)
+            f = io.BytesIO()
+            torch.onnx.export(
+                ModelWithAtenFmod(),
+                (x, y),
+                f,
+                do_constant_folding=False,
+                operator_export_type=operator_export_type,
+                opset_version=10,  # or higher
+            )
+            onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+            assert onnx_model.graph.node[0].op_type == "Mod"
+
+    @common_utils.skipIfCaffe2
+    def test_aten_fallback_must_fallback(self):
+        class ModelWithAtenNotONNXOp(torch.nn.Module):
+            def forward(self, x, y):
+                abcd = x + y
+                defg = torch.linalg.qr(abcd)
+                return defg
+
+        x = torch.rand(3, 4)
+        y = torch.rand(3, 4)
+        f = io.BytesIO()
+        torch.onnx.export(
+            ModelWithAtenNotONNXOp(),
+            (x, y),
+            f,
+            do_constant_folding=False,
+            operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
+            # support for linalg.qr was added in later op set versions.
+            opset_version=9,
+        )
+        onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+        self.assertAtenOp(onnx_model, "linalg_qr")
+
+    @common_utils.skipIfCaffe2
+    def test_onnx_aten(self):
+        class ModelWithAtenFmod(torch.nn.Module):
+            def forward(self, x, y):
+                return torch.fmod(x, y)
+
+        x = torch.randn(3, 4, dtype=torch.float32)
+        y = torch.randn(3, 4, dtype=torch.float32)
+        f = io.BytesIO()
+        torch.onnx.export(
+            ModelWithAtenFmod(),
+            (x, y),
+            f,
+            do_constant_folding=False,
+            operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN,
+        )
+        onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+        self.assertAtenOp(onnx_model, "fmod", "Tensor")
+
+    @common_utils.skipIfCaffe2
+    def test_onnx_aten_fallback_must_not_fallback(self):
+        # For BUILD_CAFFE2=0, aten fallback only when not exportable
+        class ONNXExportable(torch.nn.Module):
+            def __init__(self):
+                super(ONNXExportable, self).__init__()
+                self.quant = torch.quantization.QuantStub()
+                self.fc1 = torch.nn.Linear(12, 8)
+                self.fc2 = torch.nn.Linear(8, 4)
+                self.fc3 = torch.nn.Linear(4, 6)
+                self.dequant = torch.quantization.DeQuantStub()
+
+            def forward(self, x):
+                x = self.quant(x)
+                x = x.view((-1, 12))
+                h = F.relu(self.fc1(x))
+                h = F.relu(self.fc2(h))
+                h = F.relu(self.fc3(h))
+                h = self.dequant(h)
+                return h
+
+        dummy_input = torch.randn(12)
+        f = io.BytesIO()
+        torch.onnx.export(
+            ONNXExportable(),
+            (dummy_input,),
+            f,
+            do_constant_folding=False,
+            operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
+        )
+        onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+        all_aten_nodes = [
+            p
+            for p in onnx_model.graph.node
+            if p.op_type == "ATen" and p.domain == "org.pytorch.aten"
+        ]
+        self.assertEqual(len(all_aten_nodes), 0)
+
+
+class TestQuantizeEagerONNXExport(common_utils.TestCase):
+    def _test_lower_graph_impl(self, model, data):
+        model.qconfig = torch.ao.quantization.default_qconfig
+        model = torch.ao.quantization.prepare(model)
+        model = torch.ao.quantization.convert(model)
+
+        _ = model(data)
+        input_names = ["x"]
+
+        def _export_to_onnx(model, input, input_names):
+            traced = torch.jit.trace(model, input)
+            buf = io.BytesIO()
+            torch.jit.save(traced, buf)
+            buf.seek(0)
+
+            model = torch.jit.load(buf)
+            f = io.BytesIO()
+            torch.onnx.export(
+                model,
+                input,
+                f,
+                input_names=input_names,
+                operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
+                opset_version=9,
+            )
+
+        _export_to_onnx(model, data, input_names)
+
+    @common_quantization.skipIfNoFBGEMM
+    @common_utils.skipIfNoCaffe2
+    def test_lower_graph_linear(self):
+        model = torch.ao.quantization.QuantWrapper(
+            torch.nn.Linear(5, 10, bias=True)
+        ).to(dtype=torch.float)
+        data_numpy = np.random.rand(1, 2, 5).astype(np.float32)
+        data = torch.from_numpy(data_numpy).to(dtype=torch.float)
+        self._test_lower_graph_impl(model, data)
+
+    @common_quantization.skipIfNoFBGEMM
+    @common_utils.skipIfNoCaffe2
+    def test_lower_graph_conv2d(self):
+        model = torch.ao.quantization.QuantWrapper(
+            torch.nn.Conv2d(3, 5, 2, bias=True)
+        ).to(dtype=torch.float)
+        data_numpy = np.random.rand(1, 3, 6, 6).astype(np.float32)
+        data = torch.from_numpy(data_numpy).to(dtype=torch.float)
+        self._test_lower_graph_impl(model, data)
+
+    @common_quantization.skipIfNoFBGEMM
+    @unittest.skip(
+        "onnx opset9 does not support quantize_per_tensor and caffe2 \
+    does not support conv3d"
+    )
+    def test_lower_graph_conv3d(self):
+        model = torch.ao.quantization.QuantWrapper(
+            torch.nn.Conv3d(3, 5, 2, bias=True)
+        ).to(dtype=torch.float)
+        data_numpy = np.random.rand(1, 3, 6, 6, 6).astype(np.float32)
+        data = torch.from_numpy(data_numpy).to(dtype=torch.float)
+        self._test_lower_graph_impl(model, data)
+
 
 if __name__ == "__main__":
     common_utils.run_tests()
diff --git a/test/onnx/test_pytorch_onnx_onnxruntime.py b/test/onnx/test_pytorch_onnx_onnxruntime.py
index b601b196830e..184cc5f4ae67 100644
--- a/test/onnx/test_pytorch_onnx_onnxruntime.py
+++ b/test/onnx/test_pytorch_onnx_onnxruntime.py
@@ -7,9 +7,10 @@
 import os
 import unittest
 from collections import OrderedDict
-from typing import Dict, List, Optional, Tuple, Union
+from typing import Dict, List, Optional, Tuple, Type, Union
 
 import numpy as np
+import onnx
 import onnx_test_common
 import parameterized
 
@@ -36,10 +37,15 @@
 )
 from torch import Tensor
 from torch.nn.utils import rnn as rnn_utils
-from torch.onnx import verification
+from torch.onnx import _constants, verification
 from torch.testing._internal import common_utils
 from torch.testing._internal.common_utils import skipIfNoLapack
 
+# The min onnx opset version to test for
+MIN_ONNX_OPSET_VERSION = 9
+# The max onnx opset version to test for
+MAX_ONNX_OPSET_VERSION = _constants.ONNX_MAX_OPSET
+
 
 def _init_test_generalized_rcnn_transform():
     min_size = 100
@@ -112,13 +118,23 @@ def _construct_tensor_for_quantization_test(
     return tensor
 
 
-def _parameterized_class_attrs_and_values():
+def _parameterized_class_attrs_and_values(
+    min_opset_version: int, max_opset_version: int
+):
     attrs = ("opset_version", "is_script", "keep_initializers_as_inputs")
     input_values = []
     input_values.extend(itertools.product((7, 8), (True, False), (True,)))
     # Valid opset versions are defined in torch/onnx/_constants.py.
-    # Versions are intentionally set statically, to not be affected by elsewhere changes.
-    input_values.extend(itertools.product(range(9, 17), (True, False), (True, False)))
+    # Versions are intentionally set statically, to not be affected by changes elsewhere.
+    if min_opset_version < 9:
+        raise ValueError("min_opset_version must be >= 9")
+    input_values.extend(
+        itertools.product(
+            range(min_opset_version, max_opset_version + 1),
+            (True, False),
+            (True, False),
+        )
+    )
     return {"attrs": attrs, "input_values": input_values}
 
 
@@ -143,7 +159,9 @@ def _parametrize_rnn_args(arg_name):
 
 
 @parameterized.parameterized_class(
-    **_parameterized_class_attrs_and_values(),
+    **_parameterized_class_attrs_and_values(
+        MIN_ONNX_OPSET_VERSION, MAX_ONNX_OPSET_VERSION
+    ),
     class_name_func=onnx_test_common.parameterize_class_name,
 )
 @common_utils.instantiate_parametrized_tests
@@ -1394,8 +1412,26 @@ def test_avgpool_1d_ceil(self):
         x = torch.randn(1, 1, 7)
         self.run_test(model, x)
 
-    def test_avgpool_2d_ceil(self):
-        model = torch.nn.AvgPool2d(3, 2, ceil_mode=True)
+    @common_utils.parametrize(
+        "padding",
+        (0, 1),
+    )
+    @common_utils.parametrize(
+        "ceil_mode",
+        (True, False),
+    )
+    @common_utils.parametrize(
+        "count_include_pad",
+        (True, False),
+    )
+    def test_avgpool_2d(self, padding, ceil_mode, count_include_pad):
+        model = torch.nn.AvgPool2d(
+            3,
+            3,
+            padding=padding,
+            ceil_mode=ceil_mode,
+            count_include_pad=count_include_pad,
+        )
         x = torch.randn(20, 16, 50, 32)
         self.run_test(model, x)
 
@@ -1581,11 +1617,10 @@ def forward(self, x: int, y: int):
         y = 2
         self.run_test(ArithmeticModule(), (x, y))
 
-    # In tracing, None outputs are removed. In scripting they're kept but
-    # we don't know Optional.elem_type, so we can't construct a valid Optional.
-    # Tests for Optional outputs (control flow with None in one branch,
-    # not-None in another) are in test_pytorch_onnx_no_runtime.py.
-    @skipScriptTest()
+    # Outputs that are always None are removed.
+    # Issue 84130: ONNX ignores mustNone() node, while pytorch
+    # doesn't, and that makes Optional comparison difficult to achieve.
+    @skipScriptTest()  # TODO Use onnx::Optional to replace erase None in shape_type_inference.cpp
     def test_tuple_with_none_outputs(self):
         class TupleModel(torch.nn.Module):
             def forward(self, x):
@@ -2608,6 +2643,23 @@ def forward(self, x):
         x = torch.empty(2, 3, 3, dtype=torch.double).uniform_(0, 1)
         self.run_test(Bernoulli(), x)
 
+    def test_bernoulli_p(self):
+        class Bernoulli_float(torch.nn.Module):
+            def forward(self, x):
+                return torch.mul(x, torch.bernoulli(x, 0.2).size(0))
+
+        class Bernoulli_tensor(torch.nn.Module):
+            def forward(self, x):
+                return torch.mul(x, torch.rand_like(x).bernoulli_(x).size(0))
+
+        x = torch.rand(3, 3)
+        self.run_test(Bernoulli_float(), x)
+        self.run_test(Bernoulli_tensor(), x)
+
+        x = torch.rand(2, 3, 3, dtype=torch.double)
+        self.run_test(Bernoulli_float(), x)
+        self.run_test(Bernoulli_tensor(), x)
+
     @unittest.skip("Bug in ORT, skip test until rel-1.11.")
     @skipIfUnsupportedMinOpsetVersion(14)
     def test_reshape_allowzero(self):
@@ -3566,7 +3618,7 @@ def forward(self, x, k):
 
         x = torch.arange(1.0, 6.0, requires_grad=True)
         k = torch.tensor(3)
-        self.run_test(MyModuleDynamic(), [x, k])
+        self.run_test(MyModuleDynamic(), (x, k))
 
     @skipScriptTest()  # Python builtin apply of FunctionMeta object is currently not supported in Torchscript.
     @skipIfUnsupportedMinOpsetVersion(11)  # Clip op min is an input since opset 11.
@@ -3657,8 +3709,11 @@ def forward(self, x):
         self.run_test(Model(), x)
 
     def test_layer_norm(self):
-        model = torch.nn.LayerNorm([10, 10])
-        x = torch.randn(20, 5, 10, 10)
+        # As layer_norm works on the last D dimension, please keep
+        # this test case at least three dimension to prevent the
+        # situation of axis=2 mapping to the same axis as axis=-2
+        model = torch.nn.LayerNorm([10, 10, 10])
+        x = torch.randn(20, 5, 10, 10, 10)
         self.run_test(model, x)
 
     def test_batchnorm1d(self):
@@ -6646,6 +6701,8 @@ def forward(self, x, y):
         y = torch.tensor(2)
         self.run_test(FullLikeModel(), (x, y))
 
+    @unittest.skip("It started failing after #81761")
+    # TODO(#83661): Fix and enable the test
     def test_l1_norm(self):
         class NormModel(torch.nn.Module):
             def forward(self, x):
@@ -6654,6 +6711,8 @@ def forward(self, x):
         x = torch.randn(4, 2, 3, requires_grad=True)
         self.run_test(NormModel(), x)
 
+    @unittest.skip("It started failing after #81761")
+    # TODO(#83661): Fix and enable the test
     def test_l2_norm(self):
         class NormModel(torch.nn.Module):
             def forward(self, x):
@@ -6662,6 +6721,8 @@ def forward(self, x):
         x = torch.randn(4, 2, 3, requires_grad=True)
         self.run_test(NormModel(), x)
 
+    @unittest.skip("It started failing after #81761")
+    # TODO(#83661): Fix and enable the test
     def test_frobenius_norm(self):
         class NormModel(torch.nn.Module):
             def forward(self, x):
@@ -6670,6 +6731,8 @@ def forward(self, x):
         x = torch.randn(4, 2, 3, requires_grad=True)
         self.run_test(NormModel(), x)
 
+    @unittest.skip("It started failing after #81761")
+    # TODO(#83661): Fix and enable the test
     def test_frobenius_norm_keepdim(self):
         class NormModel(torch.nn.Module):
             def forward(self, x):
@@ -6835,6 +6898,13 @@ def forward(self, x):
         x = torch.randn(2, 3, 4)
         self.run_test(trilModelwithNegDiagonal(), (x))
 
+        class trilModelWithDiagonalInput(torch.nn.Module):
+            def forward(self, x, diagnonal: int):
+                return torch.tril(x, diagonal=diagnonal)
+
+        x = torch.randn(2, 3, 4)
+        self.run_test(trilModelWithDiagonalInput(), (x, 5))
+
     @skipIfUnsupportedMinOpsetVersion(14)
     def test_triu(self):
         class triuModel(torch.nn.Module):
@@ -6851,12 +6921,19 @@ def forward(self, x):
         x = torch.randn(2, 3, 4)
         self.run_test(triuModelwithDiagonal(), (x))
 
-        class trilModelwithNegDiagonal(torch.nn.Module):
+        class triuModelwithNegDiagonal(torch.nn.Module):
             def forward(self, x):
-                return torch.tril(x, diagonal=-1)
+                return torch.triu(x, diagonal=-1)
 
         x = torch.randn(2, 3, 4)
-        self.run_test(trilModelwithNegDiagonal(), (x))
+        self.run_test(triuModelwithNegDiagonal(), (x))
+
+        class triuModelWithDiagonalInput(torch.nn.Module):
+            def forward(self, x, diagnonal: int):
+                return torch.triu(x, diagonal=diagnonal)
+
+        x = torch.randn(2, 3, 4)
+        self.run_test(triuModelWithDiagonalInput(), (x, 5))
 
     def test_mish(self):
         class MishModel(torch.nn.Module):
@@ -7380,23 +7457,49 @@ def test_constant_pad(self):
         x = torch.randn(2, 2, 4, 4)
         self.run_test(model, x)
 
-    # Dynamic padding is added in opset 11
-    @skipIfUnsupportedMinOpsetVersion(11)
-    def test_pad_types(self):
+    @common_utils.parametrize(
+        "pad",
+        [
+            common_utils.subtest([2, 4], name="scalar_list"),
+            common_utils.subtest(
+                [
+                    torch.tensor(2, dtype=torch.int64),
+                    torch.tensor(4, dtype=torch.int64),
+                ],
+                name="scalar_tensor_list",
+            ),
+        ],
+    )
+    @skipIfUnsupportedMinOpsetVersion(11)  # Dynamic padding is added in opset 11
+    def test_pad_types(self, pad):
         # Test for different pad integer types
         class Pad(torch.nn.Module):
             def forward(self, x, pad: List[int]):
                 return torch.nn.functional.pad(x, pad)
 
         x = torch.randn(2, 2, 4, 4)
-        y = pad = [2, 4]
-        self.run_test(Pad(), (x, y))
+        self.run_test(Pad(), (x, pad))
 
-        y = pad = [
-            torch.tensor(2, dtype=torch.int64),
-            torch.tensor(4, dtype=torch.int64),
-        ]
-        self.run_test(Pad(), (x, y))
+    @skipIfUnsupportedMinOpsetVersion(11)
+    def test_pad_circular(self):
+        class PadModel(torch.nn.Module):
+            def forward(self, x):
+                out = torch.nn.functional.pad(x, (1, 2, 1, 2), mode="circular")
+                return out
+
+        x = torch.randn(2, 3, 3, 4)
+        self.run_test(PadModel(), (x))
+
+    @skipIfUnsupportedMinOpsetVersion(11)
+    def test_pad_circular_negative(self):
+        # Test for different pad integer types
+        class PadModel(torch.nn.Module):
+            def forward(self, x):
+                out = torch.nn.functional.pad(x, (-1, -2), mode="circular")
+                return out
+
+        x = torch.randn(2, 3, 6)
+        self.run_test(PadModel(), (x))
 
     @skipIfUnsupportedMaxOpsetVersion(10)
     @skipScriptTest()  # TODO: the logic in symbolic_opset9 doesn't handle script
@@ -8650,6 +8753,28 @@ def forward(self, x, y):
         y = torch.full_like(x, True)
         self.run_test(MinimumModel(), (x, y))
 
+    @skipIfUnsupportedMinOpsetVersion(12)
+    def test_maximum_dtypes(self):
+        class MaximumModel(torch.nn.Module):
+            def forward(self, x, y):
+                return torch.maximum(x, y)
+
+        x = torch.randn((5, 5), dtype=torch.float16)
+        y = torch.randn((5, 5), dtype=torch.float)
+        self.run_test(MaximumModel(), (x, y))
+
+        x = torch.randn((5, 5), dtype=torch.float16)
+        y = torch.randint(10, (5, 5), dtype=torch.int16)
+        self.run_test(MaximumModel(), (x, y))
+
+        x = torch.randint(10, (5, 5), dtype=torch.int16)
+        y = torch.randint(10, (5, 5), dtype=torch.int32)
+        self.run_test(MaximumModel(), (x, y))
+
+        x = torch.randint(10, (5, 5), dtype=torch.int)
+        y = torch.full_like(x, True)
+        self.run_test(MaximumModel(), (x, y))
+
     @skipIfUnsupportedMinOpsetVersion(9)
     def test_any(self):
         class M(torch.nn.Module):
@@ -9050,7 +9175,9 @@ def forward(self, x, y, cond):
             dynamic_axes={"output_1": [1]},
         )
 
-    @skipScriptTest(min_opset_version=11)  # dynamic split support addded in 11
+    @skipScriptTest(
+        skip_before_opset_version=11, reason="dynamic split support added in 11"
+    )
     def test_split_tensor_scalar(self):
         class SplitModel(torch.nn.Module):
             def forward(self, x):
@@ -9949,11 +10076,7 @@ def forward(self, boxes, scores):
 
         self.run_test(Module(), (boxes, scores))
 
-    @unittest.skip(
-        "Broken in recent TorchVision, see https://github.com/pytorch/pytorch/issues/81121"
-    )
     @skipIfUnsupportedMinOpsetVersion(11)
-    # TODO: Fails with vision 0.13. See #77671
     def test_batched_nms(self):
         num_boxes = 100
         boxes = torch.rand(num_boxes, 4)
@@ -9989,10 +10112,9 @@ def forward(self, boxes, size):
             additional_test_inputs=[(boxes, size), (boxes, size_2)],
         )
 
-    @unittest.skip(
-        "Broken in recent TorchVision, see https://github.com/pytorch/pytorch/issues/81121"
+    @skipScriptTest(
+        reason="Conditioning on input type via prim::isinstance unsupported in ONNX"
     )
-    @skipIfUnsupportedMaxOpsetVersion(15)  # TODO: Opset 16 RoiAlign result mismatch
     @skipIfUnsupportedMinOpsetVersion(11)
     def test_roi_align(self):
         x = torch.rand(1, 1, 10, 10, dtype=torch.float32)
@@ -10000,11 +10122,10 @@ def test_roi_align(self):
         model = torchvision.ops.RoIAlign((5, 5), 1.0, 2)
         self.run_test(model, (x, single_roi))
 
-    @unittest.skip(
-        "Broken in recent TorchVision, see https://github.com/pytorch/pytorch/issues/81121"
+    @skipScriptTest(
+        reason="Conditioning on input type via prim::isinstance unsupported in ONNX"
     )
-    @skipIfUnsupportedMaxOpsetVersion(15)  # TODO: Opset 16 RoiAlign result mismatch
-    @skipIfUnsupportedMinOpsetVersion(11)
+    @skipIfUnsupportedMinOpsetVersion(16)
     def test_roi_align_aligned(self):
         x = torch.rand(1, 1, 10, 10, dtype=torch.float32)
         single_roi = torch.tensor([[0, 1.5, 1.5, 3, 3]], dtype=torch.float32)
@@ -10026,8 +10147,8 @@ def test_roi_align_aligned(self):
         model4 = torchvision.ops.RoIAlign((2, 2), 2.5, 0, aligned=True)
         self.run_test(model4, (x, single_roi))
 
-    @unittest.skip(
-        "Broken in recent TorchVision, see https://github.com/pytorch/pytorch/issues/81121"
+    @skipScriptTest(
+        reason="Conditioning on input type via prim::isinstance unsupported in ONNX"
     )
     @skipIfUnsupportedMinOpsetVersion(11)
     def test_roi_pool(self):
@@ -11358,7 +11479,6 @@ def forward(self, x, y):
         self.run_test(M_ToDeviceDtype(), (x, y))
 
     @skipIfUnsupportedMinOpsetVersion(9)
-    @skipScriptTest()
     def test_fill(self):
         class FillModule(torch.nn.Module):
             def forward(self, x, filled_value: int):
@@ -11368,6 +11488,14 @@ def forward(self, x, filled_value: int):
         filled_value = 7
         self.run_test(FillModule(), (x, filled_value))
 
+        class FillFloatModule(torch.nn.Module):
+            def forward(self, x, filled_value: float):
+                return x.fill_(filled_value)
+
+        x = torch.randn((4, 5, 6))
+        filled_value = 7.5
+        self.run_test(FillFloatModule(), (x, filled_value))
+
         class FillScalarModule(torch.nn.Module):
             def forward(self, x):
                 res = x + 2
@@ -11612,6 +11740,37 @@ def forward(self, x, y):
             abs(abs(actual_std) - expected_std) <= expected_std * 0.1
         ), "the gap of variance between ort outputs and expected one is unacceptable."
 
+    @skipScriptTest()
+    @skipIfUnsupportedMinOpsetVersion(11)
+    def test_nn_init_normal_correctness(self):
+        expected_mean = 5.0
+        expected_std = 10.0
+
+        class M(torch.nn.Module):
+            def forward(self):
+                x = torch.ones([]).new_empty(1, 400, 50)
+                torch.nn.init.normal_(x, expected_mean, expected_std)
+                return x
+
+        model_export = M()
+        model_onnx = io.BytesIO()
+        test_inputs = tuple()
+        torch.onnx.export(
+            model_export, test_inputs, model_onnx, opset_version=self.opset_version
+        )
+        ort_sess = verification._ort_session(model_onnx)
+        ort_out = verification._run_ort(ort_sess, inputs=test_inputs)
+
+        actual_std = np.std(ort_out)
+        actual_mean = np.mean(ort_out)
+
+        assert (
+            abs(abs(actual_mean) - expected_mean) <= expected_mean * 0.1
+        ), "the gap of mean between ort outputs and expected one is unacceptable."
+        assert (
+            abs(abs(actual_std) - expected_std) <= expected_std * 0.1
+        ), "the gap of variance between ort outputs and expected one is unacceptable."
+
     @skipScriptTest()
     @skipIfUnsupportedMinOpsetVersion(11)
     def test_dist_uniform(self):
@@ -11712,7 +11871,7 @@ def forward(self, t: Tensor) -> Tuple[Tensor, Tensor]:
     #       Otherwise test results could be inaccurate.
     @skipIfUnsupportedMinOpsetVersion(10)
     def test_quantized_linear(self):
-        model = torch.nn.quantized.Linear(4, 8)
+        model = torch.ao.nn.quantized.Linear(4, 8)
         # Set fixed weight to avoid flaky test.
         weight = torch.quantize_per_tensor(
             torch.arange(32, dtype=torch.float).view(8, 4), 0.5, 0, torch.qint8
@@ -11728,7 +11887,7 @@ def test_quantized_linear(self):
 
     @skipIfUnsupportedMinOpsetVersion(10)
     def test_quantized_conv2d(self):
-        model = torch.nn.quantized.Conv2d(16, 33, 3, stride=2)
+        model = torch.ao.nn.quantized.Conv2d(16, 33, 3, stride=2)
         # Manually initialize model weight and bias to random numbers.
         # By default all zeros.
         q_weight = torch.quantize_per_tensor(
@@ -11762,25 +11921,126 @@ def test_quantized_conv2d_relu(self):
         self.run_test(model, q_input)
 
     @skipIfUnsupportedMinOpsetVersion(10)
-    def test_quantized_hardswish(self):
-        model = torch.nn.quantized.Hardswish(1.0, 0)
-        input = torch.randn(2, 6)
-        q_input = torch.quantize_per_tensor(input, 0.26, 128, torch.quint8)
+    def test_quantized_conv1d_relu(self):
+        model = torch.nn.intrinsic.quantized.ConvReLU1d(16, 33, 3, stride=2)
+        # Manually initialize model weight and bias to random numbers.
+        # By default all zeros.
+        q_weight = torch.quantize_per_tensor(
+            torch.randn(33, 16, 3), 0.5, 0, torch.qint8
+        )
+        bias = torch.arange(33).to(torch.float) - 16
+        model.set_weight_bias(q_weight, bias)
+        input = torch.randn(3, 16, 32)
+        q_input = torch.quantize_per_tensor(input, 0.5, 128, torch.quint8)
         self.run_test(model, q_input)
 
+    @common_utils.parametrize(
+        "function_or_module",
+        [
+            common_utils.subtest(
+                torch.nn.ReLU(),
+                name="relu",
+            ),
+            common_utils.subtest(
+                torch.nn.LeakyReLU(),
+                name="leaky_relu",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.LeakyReLU(2.0, 1),
+                name="quantized_leaky_relu",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.Hardswish(2.0, 1),
+                name="quantized_hardswish",
+            ),
+            common_utils.subtest(
+                torch.nn.Sigmoid(),
+                name="sigmoid",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.Sigmoid(2.0, 1),
+                name="quantized_sigmoid",
+            ),
+            common_utils.subtest(
+                torch.nn.Hardsigmoid(),
+                name="hardsigmoid",
+            ),
+            common_utils.subtest(
+                torch.nn.Tanh(),
+                name="tanh",
+            ),
+            common_utils.subtest(
+                torch.nn.Hardtanh(),
+                name="hardtanh",
+            ),
+            common_utils.subtest(
+                lambda x: torch.transpose(x, 0, 1),
+                name="transpose",
+            ),
+            common_utils.subtest(
+                lambda x: x.expand(2, 4, 2, 3),
+                name="expand",
+            ),
+            common_utils.subtest(
+                lambda x: x.view(1, 4, 6),
+                name="view",
+            ),
+            common_utils.subtest(
+                lambda x: x.select(1, 1),
+                name="select",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.LayerNorm(
+                    [4, 2, 3],
+                    torch.nn.Parameter(torch.ones([4, 2, 3])),
+                    torch.nn.Parameter(torch.zeros([4, 2, 3])),
+                    2.0,
+                    1,
+                ),
+                name="layer_norm",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.InstanceNorm1d(
+                    2,
+                    torch.nn.Parameter(torch.ones(4)),
+                    torch.nn.Parameter(torch.zeros(4)),
+                    2.0,
+                    1,
+                ),
+                name="instance_norm",
+            ),
+            common_utils.subtest(
+                torch.nn.quantized.GroupNorm(
+                    2,
+                    4,
+                    torch.nn.Parameter(torch.zeros(4)),
+                    torch.nn.Parameter(torch.zeros(4)),
+                    2.0,
+                    1,
+                ),
+                name="group_norm",
+            ),
+            common_utils.subtest(
+                lambda x: torch.as_strided(x, (2, 2), (1, 2)),
+                name="as_strided",
+            ),
+        ],
+    )
+    @skipScriptTest()
     @skipIfUnsupportedMinOpsetVersion(10)
-    def test_quantized_hardsigmoid(self):
-        model = torch.nn.Hardsigmoid()
-        input = torch.randn(2, 6)
+    def test_quantized_unary_ops(self, function_or_module):
+        input = torch.randn(1, 4, 2, 3)
         q_input = torch.quantize_per_tensor(input, 0.26, 128, torch.quint8)
-        self.run_test(model, q_input)
 
-    @skipIfUnsupportedMinOpsetVersion(10)
-    def test_quantized_sigmoid(self):
-        model = torch.nn.Sigmoid()
-        input = torch.randn(2, 6)
-        q_input = torch.quantize_per_tensor(input, 0.26, 128, torch.quint8)
-        self.run_test(model, q_input)
+        class Model(torch.nn.Module):
+            def __init__(self, function_or_module):
+                super().__init__()
+                self.function_or_module = function_or_module
+
+            def forward(self, x):
+                return self.function_or_module(x)
+
+        self.run_test(Model(function_or_module), q_input)
 
     @skipIfUnsupportedMinOpsetVersion(10)
     def test_quantized_flatten(self):
@@ -11869,8 +12129,8 @@ def test_quantized_arithmetic_qfunctional(self):
 
         class ArithmeticModel(torch.nn.Module):
             def forward(self, x, y):
-                o = torch.nn.quantized.QFunctional().add(x, y)
-                o = torch.nn.quantized.QFunctional().mul(o, x)
+                o = torch.ao.nn.quantized.QFunctional().add(x, y)
+                o = torch.ao.nn.quantized.QFunctional().mul(o, x)
                 return o
 
         self.run_test(ArithmeticModel(), (x, y))
@@ -12108,6 +12368,17 @@ def test_qat_upsample_nearest2d(self):
         input = _construct_tensor_for_quantization_test((4, 3, 2, 2))
         self.run_test(model, input)
 
+    def test_0d_tensor_broadcast(self):
+        class fn(torch.nn.Module):
+            def forward(self, x, y):
+                a = torch.add(x, y)
+                b = torch.mul(y, y)
+                return a + b
+
+        x = torch.ones(0)
+        y = torch.ones(1)
+        self.run_test(fn(), (x, y), input_names=["x", "y"], output_names=["output"])
+
     @skipIfUnsupportedMinOpsetVersion(9)
     def test_convolution_allow_tf32(self):
         class Module(torch.nn.Module):
@@ -12200,6 +12471,79 @@ def forward(self, input, grid):
             **atol_rtol,
         )
 
+    class IfNoneInput(torch.nn.Module):
+        def forward(self, x) -> Optional[Tensor]:
+            y: Optional[Tensor] = None
+            if x.size(0) > 1:
+                y = x
+            return y
+
+    class IfNoneOutput(torch.nn.Module):
+        def forward(self, x) -> Optional[Tensor]:
+            y: Optional[Tensor] = x
+            if x.size(0) > 1:
+                y = None
+            return y
+
+    class LoopNoneInput(torch.nn.Module):
+        def forward(self, x) -> Optional[Tensor]:
+            y: Optional[Tensor] = None
+            for _ in range(x.size(0)):
+                y = x
+            return y
+
+    class LoopNoneOutput(torch.nn.Module):
+        def forward(self, x) -> Optional[Tensor]:
+            y: Optional[Tensor] = x
+            for _ in range(x.size(0)):
+                y = None
+            return y
+
+    @common_utils.parametrize(
+        "module_class",
+        (IfNoneOutput, IfNoneInput, LoopNoneOutput, LoopNoneInput),
+        name_fn=lambda module_class: module_class.__name__,
+    )
+    @common_utils.parametrize("x_size", (0, 1), name_fn=lambda x_size: str(x_size))
+    @skipTraceTest()
+    @skipIfUnsupportedMinOpsetVersion(16)
+    def test_optional_output(self, module_class: Type[torch.nn.Module], x_size: int):
+        # Need scripting to preserve control flow for this test to be
+        # meaningful.
+        model = torch.jit.script(module_class())
+        f = io.BytesIO()
+        x = torch.ones(x_size)
+        dynamic_axis_name = "condition"
+        torch.onnx.export(
+            model,
+            x,
+            f,
+            opset_version=self.opset_version,
+            # Ensure condition is not constant
+            dynamic_axes={"x": {0: dynamic_axis_name}},
+            input_names=["x"],
+        )
+        exported = onnx.load_from_string(f.getvalue())
+        expected_elem_type = torch.onnx.JitScalarType.from_value(x).onnx_type()
+        expected_output_type = onnx.helper.make_optional_type_proto(
+            onnx.helper.make_tensor_type_proto(expected_elem_type, (dynamic_axis_name,))
+        )
+        self.assertEqual(expected_output_type, exported.graph.output[0].type)
+        for node in exported.graph.node:
+            # Both branches output types should match.
+            if node.op_type == "If":
+                for attr in node.attribute:
+                    if attr.name in ("then_branch", "else_branch"):
+                        self.assertEqual(expected_output_type, attr.g.output[0].type)
+
+        self.run_test(
+            module_class(),
+            x,
+            # Ensure condition is not constant
+            dynamic_axes={"x": {0: dynamic_axis_name}},
+            input_names=["x"],
+        )
+
     @skipTraceTest()
     @skipIfUnsupportedMinOpsetVersion(16)
     def test_uninitialized_optional(self):
@@ -12262,6 +12606,86 @@ def forward(self, x):
 
         self.run_test(LerpModel(), torch.rand(5, 4, 3))
 
+    @common_utils.parametrize("input_dtype", [torch.cfloat, torch.float])
+    @skipIfUnsupportedMinOpsetVersion(9)
+    def test_print_tensor_within_torch_nn_module(self, input_dtype: torch.dtype):
+        class PrintTensorOnMyModel(torch.nn.Module):
+            def forward(self, x):
+                # 'print' has side effect calling 'resolve_conj' and 'resolve_neg'.
+                x_firsts = x[:, 0]
+                print(f"x_firsts: {x_firsts}")
+                # 'tolist' has side effect calling 'resolve_conj' and 'resolve_neg'.
+                # Annotation added to pass torch script.
+                _: List[float] = x.tolist()
+                return x_firsts
+
+        m = PrintTensorOnMyModel()
+        x = torch.randn(10, 5, dtype=input_dtype)
+        if input_dtype == torch.cfloat:
+            with self.assertRaises(RuntimeError):
+                self.run_test(
+                    m,
+                    x,
+                )
+        else:
+            self.run_test(
+                m,
+                x,
+            )
+
+    @skipScriptTest()
+    @skipIfUnsupportedMinOpsetVersion(16)
+    @unittest.skipIf(
+        not torch.hub._check_module_exists("torch_geometric"),
+        "torch_geometric not installed.",
+    )
+    def test_sage_conv(self):
+        from torch_geometric import nn as torch_geometric_nn
+
+        # Input
+        coords0 = torch.randn(1, 6)
+        coords1 = torch.randn(1, 6)
+        coords = torch.transpose(torch.cat((coords0, coords1), dim=0), 0, 1)
+        adj = torch_geometric_nn.knn_graph(coords, k=2, batch=None, loop=True)
+        edge_from = adj[0:1, :]
+        edge_to = adj[1:, :]
+        inputs = (coords0, coords1, edge_from, edge_to)
+
+        class MySAGEConv(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.SAGEConvBlock1 = torch_geometric_nn.SAGEConv(
+                    2, 512, normalize=True
+                )
+                self.bano1 = torch_geometric_nn.BatchNorm(512)
+                self.relu = torch.nn.ReLU()
+                self.dense1 = torch.nn.Seq(Lin(512, 1))
+                self.sigmoid = torch.nn.Sigmoid()
+
+            def forward(self, coords0, coords1, edge_from, edge_to):
+                adj = torch.cat((edge_from, edge_to), dim=0)
+                gra = torch.transpose(torch.cat((coords0, coords1), dim=0), 0, 1)
+                x1 = self.SAGEConvBlock1(gra, edge_index=adj)
+                x = torch.unsqueeze(torch.sum(x1), dim=0)
+                return x
+
+        input_names = ["coords0", "coords1", "edge_from", "edge_to"]
+        output_names = ["outputs"]
+        dynamic_axes = {
+            "coords0": {0: "batch_size", 1: "features"},
+            "coords1": {0: "batch_size", 1: "features"},
+            "edge_from": {0: "batch_size", 1: "features"},
+            "edge_to": {0: "batch_size", 1: "features"},
+            "outputs": {0: "batch_size"},
+        }
+        self.run_test(
+            MySAGEConv(),
+            inputs,
+            input_names=input_names,
+            output_names=output_names,
+            dynamic_axes=dynamic_axes,
+        )
+
     # Cannot export with older opsets because of "ConstantFill" op
     # ConstantFill was a temp op removed at opset 8. This is no longer supported by onnxruntime
     # There are still some issues prevent us from enabling script test for these scenarios:
diff --git a/test/onnx/test_pytorch_onnx_onnxruntime_cuda.py b/test/onnx/test_pytorch_onnx_onnxruntime_cuda.py
index 3832a110fc91..193b87af3d28 100644
--- a/test/onnx/test_pytorch_onnx_onnxruntime_cuda.py
+++ b/test/onnx/test_pytorch_onnx_onnxruntime_cuda.py
@@ -2,7 +2,10 @@
 
 import unittest
 
+import onnx_test_common
+
 import onnxruntime  # noqa: F401
+import parameterized
 
 import torch
 from pytorch_test_common import (
@@ -11,18 +14,22 @@
     skipIfUnsupportedMinOpsetVersion,
     skipScriptTest,
 )
-from test_pytorch_onnx_onnxruntime import TestONNXRuntime
+from test_pytorch_onnx_onnxruntime import (
+    _parameterized_class_attrs_and_values,
+    MAX_ONNX_OPSET_VERSION,
+    MIN_ONNX_OPSET_VERSION,
+)
 from torch.cuda.amp import autocast
-from torch.onnx._globals import GLOBALS
 from torch.testing._internal import common_utils
 
 
-class TestONNXRuntime_cuda(common_utils.TestCase):
-
-    opset_version = GLOBALS.export_onnx_opset_version
-    keep_initializers_as_inputs = True
-    onnx_shape_inference = True
-
+@parameterized.parameterized_class(
+    **_parameterized_class_attrs_and_values(
+        MIN_ONNX_OPSET_VERSION, MAX_ONNX_OPSET_VERSION
+    ),
+    class_name_func=onnx_test_common.parameterize_class_name,
+)
+class TestONNXRuntime_cuda(onnx_test_common._TestONNXRuntime):
     @skipIfUnsupportedMinOpsetVersion(9)
     @skipIfNoCuda
     def test_gelu_fp16(self):
@@ -145,8 +152,5 @@ def forward(self, x, y):
         self.run_test(Model(), (x, y))
 
 
-TestONNXRuntime_cuda.setUp = TestONNXRuntime.setUp
-TestONNXRuntime_cuda.run_test = TestONNXRuntime.run_test
-
 if __name__ == "__main__":
     common_utils.run_tests()
diff --git a/test/onnx/test_pytorch_onnx_shape_inference.py b/test/onnx/test_pytorch_onnx_shape_inference.py
index 516ac2cb6cd7..915677279d01 100644
--- a/test/onnx/test_pytorch_onnx_shape_inference.py
+++ b/test/onnx/test_pytorch_onnx_shape_inference.py
@@ -1,7 +1,10 @@
 # Owner(s): ["module: onnx"]
 
-import numpy as np
+import io
 
+import numpy as np
+import onnx
+import pytorch_test_common
 import torch
 from pytorch_test_common import skipIfUnsupportedMinOpsetVersion
 from torch.onnx import _constants, symbolic_helper
@@ -19,9 +22,9 @@ def verify(actual_type):
     return verify
 
 
-class TestONNXShapeInference(common_utils.TestCase):
+class TestONNXShapeInference(pytorch_test_common.ExportTestCase):
     def setUp(self):
-        self.opset_version = _constants.onnx_main_opset
+        self.opset_version = _constants.ONNX_MAX_OPSET
         symbolic_helper._set_onnx_shape_inference(True)
         symbolic_helper._set_opset_version(self.opset_version)
 
@@ -268,6 +271,187 @@ def test_resize_after_concat(self):
         )
         self.run_test(g, resize.node(), expect_tensor("Float", shape=(4, 32, 128, 128)))
 
+    def test_reduce_prod_with_axes(self):
+        g = self.create_empty_graph()
+        input = g.addInput()
+        input.setType(input.type().with_dtype(torch.long).with_sizes([2]))
+        reduce_prod = g.op("ReduceProd", input, axes_i=[0])
+        self.run_test(g, reduce_prod.node(), expect_tensor("Long", shape=(1,)))
+
+    def test_reduce_prod_without_axes(self):
+        g = self.create_empty_graph()
+        input = g.addInput()
+        input.setType(input.type().with_dtype(torch.long).with_sizes([2]))
+        reduce_prod = g.op("ReduceProd", input)
+        self.run_test(g, reduce_prod.node(), expect_tensor("Long", shape=(1,)))
+
+
+class TestONNXCustomOpShapeInference(pytorch_test_common.ExportTestCase):
+    def setUp(self):
+        super().setUp()
+        self.opset_version = _constants.ONNX_MAX_OPSET
+
+    def test_setType_maintains_output_shape_for_single_custom_op(self):
+
+        self.addCleanup(torch.onnx.unregister_custom_op_symbolic, "::linalg_inv", 9)
+
+        class CustomInverse(torch.nn.Module):
+            def forward(self, x):
+                return torch.inverse(x) + x
+
+        def linalg_inv_settype(g, self):
+            return g.op("com.microsoft::Inverse", self).setType(self.type())
+
+        torch.onnx.register_custom_op_symbolic("::linalg_inv", linalg_inv_settype, 9)
+        model = CustomInverse()
+        x = torch.randn(2, 3, 3)
+        f = io.BytesIO()
+        torch.onnx.export(
+            model,
+            (x,),
+            f,
+            opset_version=self.opset_version,
+            custom_opsets={"com.microsoft": 1},
+        )
+
+        model_proto = onnx.load(io.BytesIO(f.getvalue()))
+        model_value_info = model_proto.graph.value_info
+        self.assertIsNotNone(model_value_info)
+        assert model_value_info
+        dims = model_value_info[0].type.tensor_type.shape.dim
+        for i in range(len(dims)):
+            # If node output has shape info, it should have dim_value
+            # Otherwise, it has dim_params with dynamic shape
+            self.assertTrue(dims[i].HasField("dim_value"))
+        for dim, rank in zip(dims, x.size()):
+            self.assertEqual(dim.dim_value, rank)
+
+    def test_no_setType_for_single_custom_op(self):
+
+        self.addCleanup(torch.onnx.unregister_custom_op_symbolic, "::linalg_inv", 9)
+
+        class CustomInverse(torch.nn.Module):
+            def forward(self, x):
+                return torch.inverse(x) + x
+
+        def linalg_inv_no_settype(g, self):
+            return g.op("com.microsoft::Inverse", self)
+
+        torch.onnx.register_custom_op_symbolic("::linalg_inv", linalg_inv_no_settype, 9)
+        model = CustomInverse()
+        x = torch.randn(2, 3, 3)
+        f = io.BytesIO()
+        torch.onnx.export(
+            model,
+            (x,),
+            f,
+            opset_version=self.opset_version,
+            custom_opsets={"com.microsoft": 1},
+        )
+
+        model_proto = onnx.load(io.BytesIO(f.getvalue()))
+        model_value_info = model_proto.graph.value_info
+        self.assertIsNotNone(model_value_info)
+        assert model_value_info
+        dims = model_value_info[0].type.tensor_type.shape.dim
+        for i in range(len(dims)):
+            # If node output has shape info, it should have dim_value
+            # Otherwise, it has dim_params with dynamic shape
+            self.assertTrue(dims[i].HasField("dim_param"))
+
+    def test_setType_maintains_output_shape_for_single_custom_op_with_dynamic_axes(
+        self,
+    ):
+
+        self.addCleanup(torch.onnx.unregister_custom_op_symbolic, "::linalg_inv", 9)
+
+        class CustomInverse(torch.nn.Module):
+            def forward(self, x):
+                return torch.inverse(x) + x
+
+        def linalg_inv_settype(g, self):
+            return g.op("com.microsoft::Inverse", self).setType(
+                self.type().with_dtype(torch.float).with_sizes([None, 3, 3])
+            )
+
+        torch.onnx.register_custom_op_symbolic("::linalg_inv", linalg_inv_settype, 9)
+        model = CustomInverse()
+        x = torch.randn(2, 3, 3)
+        f = io.BytesIO()
+        torch.onnx.export(
+            model,
+            (x,),
+            f,
+            opset_version=self.opset_version,
+            custom_opsets={"com.microsoft": 1},
+            input_names=["x"],
+            dynamic_axes={"x": {0: "batch"}},
+        )
+
+        model_proto = onnx.load(io.BytesIO(f.getvalue()))
+        model_value_info = model_proto.graph.value_info
+        self.assertIsNotNone(model_value_info)
+        assert model_value_info
+        dims = model_value_info[0].type.tensor_type.shape.dim
+        # The first axe should be dynamic as we defined when exporting
+        self.assertTrue(dims[0].HasField("dim_param"))
+        for i in range(1, len(dims)):
+            # If node output has shape info, it should have dim_value
+            # Otherwise, it has dim_params with dynamic shape
+            self.assertTrue(dims[i].HasField("dim_value"))
+            self.assertEqual(dims[i].dim_value, x.size()[i])
+
+    def test_setType_maintains_output_shape_for_single_custom_op_with_onnx_ops(self):
+
+        self.addCleanup(torch.onnx.unregister_custom_op_symbolic, "::linalg_inv", 9)
+
+        class CustomInverse(torch.nn.Module):
+            def forward(self, x, y, z):
+                x = torch.inverse(x)
+                return x + y + z
+
+        def linalg_inv_settype(g, self):
+            return g.op("com.microsoft::Inverse", self).setType(
+                self.type().with_dtype(torch.float).with_sizes([2, 3, 10, 10])
+            )
+
+        torch.onnx.register_custom_op_symbolic("::linalg_inv", linalg_inv_settype, 9)
+        model = CustomInverse()
+        x = torch.randn(2, 3, 10, 10)
+        y = torch.randn(2, 3, 10, 10)
+        z = torch.randn(2, 3, 10, 10)
+        f = io.BytesIO()
+        torch.onnx.export(
+            model,
+            (x, y, z),
+            f,
+            opset_version=self.opset_version,
+            custom_opsets={"com.microsoft": 1},
+        )
+
+        model_proto = onnx.load(io.BytesIO(f.getvalue()))
+        # To validate the shape of inverse Op, we need to find inverse output name,
+        # and then use it to identify its value_info for the shape.
+        output_name = ""
+        for node in model_proto.graph.node:
+            if node.op_type == "Inverse":
+                output_name = node.output[0]
+                break
+        assert output_name
+        model_value_info = model_proto.graph.value_info
+        self.assertIsNotNone(model_value_info)
+        assert model_value_info
+        for value_info in model_value_info:
+            assert value_info.name
+            if value_info.name == output_name:
+                dims = value_info.type.tensor_type.shape.dim
+                for i in range(len(dims)):
+                    # If node output has shape info, it should have dim_value
+                    # Otherwise, it has dim_params with dynamic shape
+                    self.assertTrue(dims[i].HasField("dim_value"))
+                for dim, rank in zip(dims, x.size()):
+                    self.assertEqual(dim.dim_value, rank)
+
 
 if __name__ == "__main__":
     common_utils.run_tests()
diff --git a/test/onnx/test_utility_funs.py b/test/onnx/test_utility_funs.py
index f15b08580bdf..7e23b06e5541 100644
--- a/test/onnx/test_utility_funs.py
+++ b/test/onnx/test_utility_funs.py
@@ -1,9 +1,14 @@
 # Owner(s): ["module: onnx"]
 
 import copy
+import functools
 import io
+import warnings
+from typing import Callable
 
 import onnx
+import parameterized
+import pytorch_test_common
 
 import torch
 import torch.onnx
@@ -15,31 +20,15 @@
     skipIfUnsupportedMaxOpsetVersion,
     skipIfUnsupportedMinOpsetVersion,
 )
-from torch.onnx import (
-    OperatorExportTypes,
-    register_custom_op_symbolic,
-    TrainingMode,
-    unregister_custom_op_symbolic,
-    utils,
-)
-from torch.onnx.symbolic_helper import (
-    _set_operator_export_type,
-    _set_opset_version,
-    _unpack_list,
-    parse_args,
-)
+from torch.onnx import _constants, OperatorExportTypes, TrainingMode, utils
+from torch.onnx._globals import GLOBALS
+from torch.onnx.symbolic_helper import _unpack_list, parse_args
 from torch.testing._internal import common_utils
 from torch.testing._internal.common_utils import skipIfNoCaffe2, skipIfNoLapack
 from verify import verify
 
 
-class _BaseTestCase(common_utils.TestCase):
-    def setUp(self):
-        super().setUp()
-        torch.manual_seed(0)
-        if torch.cuda.is_available():
-            torch.cuda.manual_seed_all(0)
-
+class _BaseTestCase(pytorch_test_common.ExportTestCase):
     def _model_to_graph(
         self,
         model,
@@ -50,6 +39,7 @@ def _model_to_graph(
         input_names=None,
         dynamic_axes=None,
     ):
+        torch.onnx.utils._setup_trace_module_map(model, False)
         if training == torch.onnx.TrainingMode.TRAINING:
             model.train()
         elif training == torch.onnx.TrainingMode.EVAL:
@@ -68,26 +58,94 @@ def _model_to_graph(
         return graph, params_dict, torch_out
 
 
-class TestUtilityFuns_opset_independent(_BaseTestCase):
-    def test_unconvertible_ops(self):
-        class MyModule(torch.nn.Module):
+@common_utils.instantiate_parametrized_tests
+class TestUnconvertibleOps(pytorch_test_common.ExportTestCase):
+    """Unit tests for the `unconvertible_ops` function."""
+
+    def setUp(self):
+        class EinsumModule(torch.nn.Module):
             def forward(self, x):
-                return torch.cumsum(x, dim=0)
+                return torch.einsum("ii", x)
 
-        model = MyModule()
-        x = torch.randn(2, 3, 4)
+        self.einsum_module = EinsumModule()
 
-        graph, unconvertible_ops = utils.unconvertible_ops(model, (x,), opset_version=9)
-        iter = graph.nodes()
-        self.assertEqual(next(iter).kind(), "onnx::Constant")
-        self.assertEqual(next(iter).kind(), "prim::Constant")
-        self.assertEqual(next(iter).kind(), "aten::cumsum")
-        self.assertEqual(len(unconvertible_ops), 1)
-        self.assertEqual(unconvertible_ops, ["aten::cumsum"])
+    def test_it_returns_graph_and_unconvertible_ops_at_lower_opset_version(self):
+        x = torch.randn(4, 4)
 
+        # Einsum is supported since opset 12. It should be unconvertible at opset 9.
+        graph, unconvertible_ops = utils.unconvertible_ops(
+            self.einsum_module, (x,), opset_version=9
+        )
+        nodes = graph.nodes()
+        self.assertEqual(next(nodes).kind(), "prim::Constant")
+        self.assertEqual(next(nodes).kind(), "prim::ListConstruct")
+        self.assertEqual(next(nodes).kind(), "prim::Constant")
+        self.assertEqual(next(nodes).kind(), "aten::einsum")
+        self.assertEqual(unconvertible_ops, ["aten::einsum"])
+
+    @common_utils.parametrize(
+        "jit_function",
+        [
+            common_utils.subtest(
+                functools.partial(torch.jit.trace, example_inputs=torch.randn(4, 4)),
+                name="traced",
+            ),
+            common_utils.subtest(torch.jit.script, name="scripted"),
+        ],
+    )
+    def test_it_returns_unconvertible_ops_at_lower_opset_version_for_jit_module(
+        self, jit_function: Callable
+    ):
+        module = jit_function(self.einsum_module)
+        x = torch.randn(4, 4)
+
+        # Einsum is supported since opset 12. It should be unconvertible at opset 9.
+        _, unconvertible_ops = utils.unconvertible_ops(module, (x,), opset_version=9)
+        self.assertEqual(unconvertible_ops, ["aten::einsum"])
+
+    @common_utils.parametrize(
+        "jit_function",
+        [
+            common_utils.subtest(lambda x: x, name="nn_module"),
+            common_utils.subtest(
+                functools.partial(torch.jit.trace, example_inputs=torch.randn(4, 4)),
+                name="traced",
+            ),
+            common_utils.subtest(torch.jit.script, name="scripted"),
+        ],
+    )
+    def test_it_returns_empty_list_when_all_ops_convertible(
+        self, jit_function: Callable
+    ):
+        module = jit_function(self.einsum_module)
+        x = torch.randn(4, 4)
 
-class TestUtilityFuns_opset9(_BaseTestCase):
-    opset_version = 9
+        # Einsum is supported since opset 12
+        _, unconvertible_ops = utils.unconvertible_ops(module, (x,), opset_version=12)
+        self.assertEqual(unconvertible_ops, [])
+
+    def test_it_returns_empty_list_when_model_contains_supported_inplace_ops(self):
+        class SkipConnectionModule(torch.nn.Module):
+            def forward(self, x):
+                out = x
+                out += x
+                out = torch.nn.functional.relu(out, inplace=True)
+
+        module = SkipConnectionModule()
+        x = torch.randn(4, 4)
+        _, unconvertible_ops = utils.unconvertible_ops(module, (x,), opset_version=13)
+        self.assertEqual(unconvertible_ops, [])
+
+
+@parameterized.parameterized_class(
+    [
+        {"opset_version": opset}
+        for opset in range(_constants.ONNX_BASE_OPSET, _constants.ONNX_MAX_OPSET + 1)
+    ],
+    class_name_func=lambda cls, num, params_dict: f"{cls.__name__}_opset_{params_dict['opset_version']}",
+)
+class TestUtilityFuns(_BaseTestCase):
+    opset_version = None
 
     def test_is_in_onnx_export(self):
         test_self = self
@@ -106,8 +164,6 @@ def forward(self, x):
             self.assertFalse(torch.onnx.is_in_onnx_export())
 
     def test_validate_dynamic_axes_invalid_input_output_name(self):
-        import warnings
-
         with warnings.catch_warnings(record=True) as w:
             warnings.simplefilter("always")
             utils._validate_dynamic_axes(
@@ -135,8 +191,8 @@ def forward(self, x, y, t):
                 out, out2 = torch.split(t, splits, dim=1)
                 return out, out2
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.randn(2, 3)
         y = torch.randn(2, 4)
         t = torch.randn(2, 7)
@@ -156,8 +212,8 @@ def forward(self, x):
                 b = torch.transpose(a, 1, 0)
                 return b + x
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(3, 2)
         graph, _, __ = self._model_to_graph(
             TransposeModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
@@ -175,8 +231,8 @@ def forward(self, x):
                 b = torch.norm(a, p=2, dim=-2, keepdim=False)
                 return b + x
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(2, 3)
         graph, _, __ = self._model_to_graph(
             ReduceModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
@@ -184,7 +240,6 @@ def forward(self, x):
 
         for node in graph.nodes():
             self.assertNotEqual(node.kind(), "onnx::ReduceL2")
-        self.assertEqual(len(list(graph.nodes())), 2)
 
     def test_constant_fold_reduceL1(self):
         class NormModule(torch.nn.Module):
@@ -193,8 +248,8 @@ def forward(self, x):
                 b = torch.norm(a, p=1, dim=-2)
                 return b + x
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(2, 3)
         graph, _, __ = self._model_to_graph(
             NormModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
@@ -202,7 +257,6 @@ def forward(self, x):
 
         for node in graph.nodes():
             self.assertNotEqual(node.kind(), "onnx::ReduceL1")
-        self.assertEqual(len(list(graph.nodes())), 2)
 
     def test_constant_fold_slice(self):
         class NarrowModule(torch.nn.Module):
@@ -211,8 +265,8 @@ def forward(self, x):
                 b = torch.narrow(a, 0, 0, 1)
                 return b + x
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(1, 3)
         graph, _, __ = self._model_to_graph(
             NarrowModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
@@ -230,8 +284,8 @@ def forward(self, x):
                 b = a[1:10]  # index exceeds dimension
                 return b + x
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(1, 3)
         graph, _, __ = self._model_to_graph(
             SliceIndexExceedsDimModule(),
@@ -254,8 +308,8 @@ def forward(self, x):
                 d = torch.select(a, dim=1, index=0)
                 return b + x, c + d
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(1, 3)
         graph, _, __ = self._model_to_graph(
             SliceNegativeIndexModule(),
@@ -276,8 +330,8 @@ def forward(self, x):
                 c = torch.index_select(a, dim=-2, index=torch.tensor([0, 1]))
                 return b + 1, c + x
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(1, 3)
         model = GatherModule()
         model(x)
@@ -295,8 +349,8 @@ def forward(self, x):
                 b = torch.unsqueeze(a, -2)
                 return b + x
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(1, 2, 3)
         graph, _, __ = self._model_to_graph(
             UnsqueezeModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1, 2]}
@@ -317,8 +371,8 @@ def forward(self, x):
                 a = torch.randn(2, 3, 4, 5, 8, 7)
                 return self.prelu(x) + a
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.randn(2, 3, 4, 5, 8, 7)
         graph, _, __ = self._model_to_graph(
             PReluModel(), x, input_names=["x"], dynamic_axes={"x": [0, 1, 2, 3, 4, 5]}
@@ -335,8 +389,8 @@ def forward(self, x):
                 a = torch.tensor([[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]])
                 return torch.squeeze(a) + x + torch.squeeze(a)
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(2, 3)
         graph, _, __ = self._model_to_graph(
             SqueezeModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
@@ -352,8 +406,8 @@ def forward(self, x):
                 a = torch.tensor([[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]])
                 return torch.squeeze(a, dim=-3) + x
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(2, 3)
         graph, _, __ = self._model_to_graph(
             SqueezeAxesModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
@@ -388,8 +442,8 @@ def forward(self, x):
                 d = b + c
                 return x + d
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.ones(2, 3)
         graph, _, __ = self._model_to_graph(
             ConcatModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
@@ -409,8 +463,8 @@ def __init__(self):
             def forward(self, input, initial_state):
                 return self.mygru(input, initial_state)
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         input = torch.randn(5, 3, 7)
         h0 = torch.randn(1, 3, 3)
         graph, _, __ = self._model_to_graph(
@@ -440,8 +494,8 @@ def __init__(self):
             def forward(self, A):
                 return torch.matmul(A, torch.transpose(self.B, -1, -2))
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         A = torch.randn(2, 3)
         graph, _, __ = self._model_to_graph(
             MatMulNet(), (A,), input_names=["A"], dynamic_axes={"A": [0, 1]}
@@ -463,8 +517,8 @@ def forward(self, x):
                 b = self.weight.reshape(1, -1, 1, 1)
                 return x * b
 
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         x = torch.randn(4, 5)
         graph, _, __ = self._model_to_graph(
             ReshapeModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
@@ -487,8 +541,8 @@ def forward(self, x):
                 return div * x
 
         x = torch.randn(2, 5)
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         graph, _, __ = self._model_to_graph(
             Module(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
         )
@@ -510,8 +564,8 @@ def forward(self, x):
                 return mul / x
 
         x = torch.randn(2, 5)
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         graph, _, __ = self._model_to_graph(
             Module(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
         )
@@ -533,8 +587,8 @@ def forward(self, x):
                 return add - x
 
         x = torch.randn(2, 5)
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         graph, params_dict, __ = self._model_to_graph(
             Module(),
             (x,),
@@ -565,8 +619,8 @@ def forward(self, x):
                 return sub + x
 
         x = torch.randn(2, 5)
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         graph, params_dict, __ = self._model_to_graph(
             Module(),
             (x,),
@@ -597,8 +651,8 @@ def forward(self, x):
                 return sqrt / x
 
         x = torch.randn(2, 5)
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         graph, _, __ = self._model_to_graph(
             Module(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
         )
@@ -617,8 +671,8 @@ def forward(self, x):
                 return x + shape
 
         x = torch.randn(2, 5)
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         graph, _, __ = self._model_to_graph(
             ShapeModule(), (x,), input_names=["x"], dynamic_axes={"x": [0, 1]}
         )
@@ -750,6 +804,31 @@ def forward(self, x):
         # verify that the model state is preserved
         self.assertEqual(model.training, old_state)
 
+    def test_export_does_not_fail_on_frozen_scripted_module(self):
+        class Inner(torch.nn.Module):
+            def forward(self, x):
+                if x > 0:
+                    return x
+                else:
+                    return x * x
+
+        class Outer(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.inner = torch.jit.script(Inner())
+
+            def forward(self, x):
+                return self.inner(x)
+
+        x = torch.zeros(1)
+        # Freezing is only implemented in eval mode. So we need to call eval()
+        outer_module = Outer().eval()
+        module = torch.jit.trace_module(outer_module, {"forward": (x)})
+        # jit.freeze removes the training attribute in the module
+        module = torch.jit.freeze(module)
+
+        torch.onnx.export(module, (x,), io.BytesIO(), opset_version=self.opset_version)
+
     @skipIfUnsupportedMinOpsetVersion(15)
     def test_local_function(self):
         class N(torch.nn.Module):
@@ -984,6 +1063,181 @@ def forward(self, x):
             self.assertIn(ln_node.attribute[0], expected_ln_attrs)
             self.assertIn(ln_node.attribute[1], expected_ln_attrs)
 
+    def test_node_scope(self):
+        class N(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.relu = torch.nn.ReLU()
+
+            def forward(self, x):
+                return self.relu(x)
+
+        class M(torch.nn.Module):
+            def __init__(self, num_layers):
+                super().__init__()
+                self.num_layers = num_layers
+                self.lns = torch.nn.ModuleList(
+                    [torch.nn.LayerNorm(3, eps=float(i)) for i in range(num_layers)]
+                )
+                self.gelu1 = torch.nn.GELU()
+                self.gelu2 = torch.nn.GELU()
+                self.relu = N()
+
+            def forward(self, x, y, z):
+                res1 = self.gelu1(x)
+                res2 = self.gelu2(y)
+                for ln in self.lns:
+                    z = ln(z)
+                return res1 + res2, self.relu(z)
+
+        x = torch.randn(2, 3)
+        y = torch.randn(2, 3)
+        z = torch.randn(2, 3)
+
+        model = M(3)
+        expected_scope_names = {
+            "test_utility_funs.TestUtilityFuns.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.activation.GELU::gelu1",
+            "test_utility_funs.TestUtilityFuns.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.activation.GELU::gelu2",
+            "test_utility_funs.TestUtilityFuns.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.normalization.LayerNorm::lns.0",
+            "test_utility_funs.TestUtilityFuns.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.normalization.LayerNorm::lns.1",
+            "test_utility_funs.TestUtilityFuns.test_node_scope.<locals>.M::/"
+            "torch.nn.modules.normalization.LayerNorm::lns.2",
+            "test_utility_funs.TestUtilityFuns.test_node_scope.<locals>.M::/"
+            "test_utility_funs.TestUtilityFuns.test_node_scope.<locals>.N::relu/"
+            "torch.nn.modules.activation.ReLU::relu",
+            "test_utility_funs.TestUtilityFuns.test_node_scope.<locals>.M::",
+        }
+
+        graph, _, _ = self._model_to_graph(
+            model, (x, y, z), input_names=[], dynamic_axes={}
+        )
+        for node in graph.nodes():
+            self.assertIn(node.scopeName(), expected_scope_names)
+
+        expected_torch_script_scope_names = {
+            "test_utility_funs.M::/torch.nn.modules.activation.GELU::gelu1",
+            "test_utility_funs.M::/torch.nn.modules.activation.GELU::gelu2",
+            "test_utility_funs.M::/torch.nn.modules.normalization.LayerNorm::lns.0",
+            "test_utility_funs.M::/torch.nn.modules.normalization.LayerNorm::lns.1",
+            "test_utility_funs.M::/torch.nn.modules.normalization.LayerNorm::lns.2",
+            "test_utility_funs.M::/test_utility_funs.N::relu/torch.nn.modules.activation.ReLU::relu",
+            "test_utility_funs.M::",
+        }
+
+        graph, _, _ = self._model_to_graph(
+            torch.jit.script(model), (x, y, z), input_names=[], dynamic_axes={}
+        )
+        for node in graph.nodes():
+            self.assertIn(node.scopeName(), expected_torch_script_scope_names)
+
+    def test_scope_of_constants_when_combined_by_cse_pass(self):
+        layer_num = 3
+
+        class M(torch.nn.Module):
+            def __init__(self, constant):
+                super().__init__()
+                self.constant = constant
+
+            def forward(self, x):
+                # 'self.constant' is designed to be the same for all layers,
+                # hence it is common sub expression.
+                return x + self.constant
+
+        class N(torch.nn.Module):
+            def __init__(self, layers: int = layer_num):
+                super().__init__()
+                self.layers = torch.nn.ModuleList(
+                    [M(constant=torch.tensor(1.0)) for i in range(layers)]
+                )
+
+            def forward(self, x):
+                for layer in self.layers:
+                    x = layer(x)
+                return x
+
+        graph, _, _ = self._model_to_graph(
+            N(), (torch.randn(2, 3)), input_names=[], dynamic_axes={}
+        )
+
+        # NOTE: Duplicated constants are populated due to implicit casting in scalar_type_analysis,
+        #       so we expect 3 constants with different scopes. The 3 constants are for the 3 layers.
+        #       If CSE in exporter is improved later, this test needs to be updated.
+        #       It should expect 1 constant, with same scope as root.
+        scope_prefix = "test_utility_funs.TestUtilityFuns.test_scope_of_constants_when_combined_by_cse_pass.<locals>"
+        expected_root_scope_name = f"{scope_prefix}.N::"
+        expected_layer_scope_name = f"{scope_prefix}.M::layers"
+        expected_constant_scope_name = [
+            f"{expected_root_scope_name}/{expected_layer_scope_name}.{i}"
+            for i in range(layer_num)
+        ]
+
+        constant_scope_names = []
+        for node in graph.nodes():
+            if node.kind() == "onnx::Constant":
+                constant_scope_names.append(node.scopeName())
+        self.assertEqual(constant_scope_names, expected_constant_scope_name)
+
+    def test_scope_of_nodes_when_combined_by_cse_pass(self):
+        layer_num = 3
+
+        class M(torch.nn.Module):
+            def __init__(self, constant, bias):
+                super().__init__()
+                self.constant = constant
+                self.bias = bias
+
+            def forward(self, x):
+                # 'constant' and 'x' is designed to be the same for all layers,
+                # hence `x + self.constant` is common sub expression.
+                # 'bias' is designed to be different for all layers,
+                # hence `* self.bias` is not common sub expression.
+                return (x + self.constant) * self.bias
+
+        class N(torch.nn.Module):
+            def __init__(self, layers: int = layer_num):
+                super().__init__()
+
+                self.layers = torch.nn.ModuleList(
+                    [
+                        M(constant=torch.tensor([1.0]), bias=torch.randn(1))
+                        for i in range(layers)
+                    ]
+                )
+
+            def forward(self, x):
+                y = []
+                for layer in self.layers:
+                    y.append(layer(x))
+                return y[0], y[1], y[2]
+
+        graph, _, _ = self._model_to_graph(
+            N(), (torch.randn(2, 3)), input_names=[], dynamic_axes={}
+        )
+        scope_prefix = "test_utility_funs.TestUtilityFuns.test_scope_of_nodes_when_combined_by_cse_pass.<locals>"
+        expected_root_scope_name = f"{scope_prefix}.N::"
+        expected_layer_scope_name = f"{scope_prefix}.M::layers"
+        expected_add_scope_names = [
+            f"{expected_root_scope_name}/{expected_layer_scope_name}.0"
+        ]
+        expected_mul_scope_names = [
+            f"{expected_root_scope_name}/{expected_layer_scope_name}.{i}"
+            for i in range(layer_num)
+        ]
+
+        add_scope_names = []
+        mul_scope_names = []
+        for node in graph.nodes():
+            if node.kind() == "onnx::Add":
+                add_scope_names.append(node.scopeName())
+            elif node.kind() == "onnx::Mul":
+                mul_scope_names.append(node.scopeName())
+        self.assertEqual(add_scope_names, expected_add_scope_names)
+        self.assertEqual(mul_scope_names, expected_mul_scope_names)
+
     def test_aten_fallthrough(self):
         # Test aten export of op with no symbolic
         class Module(torch.nn.Module):
@@ -991,7 +1245,7 @@ def forward(self, x):
                 return torch.erfc(x)
 
         x = torch.randn(2, 3, 4)
-        _set_opset_version(self.opset_version)
+        GLOBALS.export_onnx_opset_version = self.opset_version
         graph, _, __ = self._model_to_graph(
             Module(),
             (x,),
@@ -1041,12 +1295,12 @@ def forward(self, input, other):
         self.assertEqual(next(iter).kind(), "custom_namespace::custom_op")
 
     def test_custom_opsets_gelu(self):
-        self.addCleanup(unregister_custom_op_symbolic, "::gelu", 1)
+        self.addCleanup(torch.onnx.unregister_custom_op_symbolic, "::gelu", 9)
 
         def gelu(g, self, approximate):
             return g.op("com.microsoft::Gelu", self).setType(self.type())
 
-        register_custom_op_symbolic("::gelu", gelu, 1)
+        torch.onnx.register_custom_op_symbolic("::gelu", gelu, 9)
         model = torch.nn.GELU(approximate="none")
         x = torch.randn(3, 3)
         f = io.BytesIO()
@@ -1065,12 +1319,12 @@ def gelu(g, self, approximate):
         self.assertEqual(graph.opset_import[1].version, 1)
 
     def test_register_aten_custom_op_symbolic(self):
-        self.addCleanup(unregister_custom_op_symbolic, "aten::gelu", 1)
+        self.addCleanup(torch.onnx.unregister_custom_op_symbolic, "aten::gelu", 9)
 
         def gelu(g, self, approximate):
             return g.op("com.microsoft::Gelu", self).setType(self.type())
 
-        register_custom_op_symbolic("aten::gelu", gelu, 1)
+        torch.onnx.register_custom_op_symbolic("aten::gelu", gelu, 9)
         model = torch.nn.GELU(approximate="none")
         x = torch.randn(3, 3)
         f = io.BytesIO()
@@ -1082,14 +1336,16 @@ def gelu(g, self, approximate):
 
     @skipIfNoLapack
     def test_custom_opsets_inverse(self):
+        self.addCleanup(torch.onnx.unregister_custom_op_symbolic, "::linalg_inv", 9)
+
         class CustomInverse(torch.nn.Module):
             def forward(self, x):
                 return torch.inverse(x) + x
 
-        def inverse(g, self):
+        def linalg_inv(g, self):
             return g.op("com.microsoft::Inverse", self).setType(self.type())
 
-        register_custom_op_symbolic("::inverse", inverse, 1)
+        torch.onnx.register_custom_op_symbolic("::linalg_inv", linalg_inv, 9)
         model = CustomInverse()
         x = torch.randn(2, 3, 3)
         f = io.BytesIO()
@@ -1253,8 +1509,8 @@ def forward(self, x):
                 return x
 
         x = torch.randn(20, 16, 50, 100)
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         _, params_dict, __ = self._model_to_graph(
             Model(),
             (x,),
@@ -1282,8 +1538,8 @@ def forward(self, x):
 
         model = torch.jit.script(MyModule())
         x = torch.randn(10, 3, 128, 128)
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         graph, _, __ = self._model_to_graph(
             model,
             (x,),
@@ -1377,8 +1633,8 @@ def forward(self, x, y):
 
         input_1 = torch.tensor([11])
         input_2 = torch.tensor([12])
-        _set_opset_version(self.opset_version)
-        _set_operator_export_type(OperatorExportTypes.ONNX)
+        GLOBALS.export_onnx_opset_version = self.opset_version
+        GLOBALS.operator_export_type = OperatorExportTypes.ONNX
         graph, _, __ = self._model_to_graph(
             MyModule(),
             (input_1, input_2),
@@ -1429,8 +1685,8 @@ def forward(self, x):
         self.assertEqual(graph.graph.input[1].name, "in_weight")
         self.assertEqual(graph.graph.input[2].name, "in_bias")
 
-    def test_onnx_intermediate_renaming(self):
-        class RenamedIntermediateModule(torch.nn.Module):
+    def test_onnx_node_naming(self):
+        class MainModule(torch.nn.Module):
             def __init__(self):
                 super().__init__()
                 self._module_1 = torch.nn.Linear(10, 10)
@@ -1445,15 +1701,28 @@ def forward(self, x):
                 z = self._module_4(y * z)
                 return z
 
-        module = RenamedIntermediateModule()
+        module = MainModule()
+        ref_node_names = [
+            "/_module_1/Gemm",
+            "/_module_2/Gemm",
+            "/_module_3/Gemm",
+            "/_module_4/Gemm",
+            "/Mul",
+            "/Mul_1",
+        ]
+        f = io.BytesIO()
 
-        g, p, o = utils._model_to_graph(module, torch.ones(1, 10), output_names=["y"])
-        renamed_intermediate = 0
-        for n in g.nodes():
-            for v in n.inputs():
-                if v.debugName().startswith("onnx::Mul_"):
-                    renamed_intermediate += 1
-        self.assertEqual(renamed_intermediate, 2)
+        torch.onnx.export(module, torch.ones(1, 10), f, output_names=["y"])
+        onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+        for n in onnx_model.graph.node:
+            self.assertIn(n.name, ref_node_names)
+
+        torch.onnx.export(
+            torch.jit.script(module), torch.ones(1, 10), f, output_names=["y"]
+        )
+        onnx_model = onnx.load(io.BytesIO(f.getvalue()))
+        for n in onnx_model.graph.node:
+            self.assertIn(n.name, ref_node_names)
 
     def _test_deduplicate_initializers(self, torchscript=False):
         class MyModule(torch.nn.Module):
@@ -1628,7 +1897,7 @@ def cat(g, tensor_list, dim):
             tensors = _unpack_list(tensor_list)
             return g.op("Concat", *tensors, axis_i=dim)
 
-        register_custom_op_symbolic("::cat", cat, _onnx_opset_version)
+        torch.onnx.register_custom_op_symbolic("::cat", cat, _onnx_opset_version)
 
         class CatModel(torch.nn.Module):
             def forward(self, x):
@@ -1649,31 +1918,7 @@ def forward(self, x):
                 "report this bug."
             ),
         )
-        unregister_custom_op_symbolic("::cat", _onnx_opset_version)
-
-
-class TestUtilityFuns_opset10(TestUtilityFuns_opset9):
-    opset_version = 10
-
-
-class TestUtilityFuns_opset11(TestUtilityFuns_opset9):
-    opset_version = 11
-
-
-class TestUtilityFuns_opset12(TestUtilityFuns_opset9):
-    opset_version = 12
-
-
-class TestUtilityFuns_opset13(TestUtilityFuns_opset9):
-    opset_version = 13
-
-
-class TestUtilityFuns_opset14(TestUtilityFuns_opset9):
-    opset_version = 14
-
-
-class TestUtilityFuns_opset15(TestUtilityFuns_opset9):
-    opset_version = 15
+        torch.onnx.unregister_custom_op_symbolic("::cat", _onnx_opset_version)
 
 
 if __name__ == "__main__":
diff --git a/test/onnx/test_verification.py b/test/onnx/test_verification.py
index ceb0c18d7bdf..9036e5f91b1d 100644
--- a/test/onnx/test_verification.py
+++ b/test/onnx/test_verification.py
@@ -93,6 +93,7 @@ def test_compare_ort_pytorch_outputs_no_raise_with_acceptable_error_percentage(
             atol=1e-6,
             check_shape=True,
             check_dtype=False,
+            ignore_none=True,
             acceptable_error_percentage=0.3,
         )
 
@@ -109,5 +110,6 @@ def test_compare_ort_pytorch_outputs_raise_without_acceptable_error_percentage(
                 atol=1e-6,
                 check_shape=True,
                 check_dtype=False,
+                ignore_none=True,
                 acceptable_error_percentage=None,
             )
diff --git a/test/onnx/verify.py b/test/onnx/verify.py
index 665491e06c18..ac6f3749fdc6 100644
--- a/test/onnx/verify.py
+++ b/test/onnx/verify.py
@@ -513,7 +513,7 @@ def run_helper(torch_out, args, remained_onnx_input_idx):
             backend_out = prepared.run(onnx_input)
             if isinstance(torch_out, torch.Tensor):
                 torch_out = (torch_out,)
-            torch_out, _ = torch._C._jit_flatten(torch_out)
+            torch_out, _ = torch.jit._flatten(torch_out)
             # NB: onnx backend NEVER returns bare numpy array
             msg = "ONNX backend returned different results from PyTorch"
             result_hint = (
diff --git a/test/profiler/profiler_utils_mock_events.json b/test/profiler/profiler_utils_mock_events.json
new file mode 100644
index 000000000000..80d40a67bc01
--- /dev/null
+++ b/test/profiler/profiler_utils_mock_events.json
@@ -0,0 +1 @@
+[[{"_name": "aten::matmul", "_start_us": 1656454173440014, "_duration_us": 2254, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173440019, "_duration_us": 2246, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173442289, "_duration_us": 33, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173442291, "_duration_us": 30, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173442325, "_duration_us": 32, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173442326, "_duration_us": 30, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173442360, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173442361, "_duration_us": 19, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173442384, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173442385, "_duration_us": 20, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173444252, "_duration_us": 38, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173444282, "_duration_us": 4, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173444291, "_duration_us": 9, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173444296, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173444305, "_duration_us": 45427, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489760, "_duration_us": 5, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489764, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489766, "_duration_us": 3, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489767, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173489771, "_duration_us": 35, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489811, "_duration_us": 2, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489812, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489814, "_duration_us": 2, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489815, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173489817, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489842, "_duration_us": 3, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489844, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489846, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489847, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173489848, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489873, "_duration_us": 2, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489874, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489875, "_duration_us": 2, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489876, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173489878, "_duration_us": 20, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173489912, "_duration_us": 104, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173489916, "_duration_us": 99, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173490026, "_duration_us": 25, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173490027, "_duration_us": 23, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173490054, "_duration_us": 34, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173490055, "_duration_us": 32, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173490091, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173490092, "_duration_us": 20, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173490115, "_duration_us": 22, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173490116, "_duration_us": 20, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173441289, "_duration_us": 2, "_linked_correlation_id": 3074, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173443225, "_duration_us": 9296, "_linked_correlation_id": 3074, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173441296, "_duration_us": 963, "_linked_correlation_id": 3074, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173442309, "_duration_us": 1, "_linked_correlation_id": 3076, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173452523, "_duration_us": 9296, "_linked_correlation_id": 3076, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173442312, "_duration_us": 7, "_linked_correlation_id": 3076, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173442346, "_duration_us": 0, "_linked_correlation_id": 3078, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173461821, "_duration_us": 9293, "_linked_correlation_id": 3078, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173442348, "_duration_us": 6, "_linked_correlation_id": 3078, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173442371, "_duration_us": 0, "_linked_correlation_id": 3080, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173471117, "_duration_us": 9295, "_linked_correlation_id": 3080, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173442373, "_duration_us": 5, "_linked_correlation_id": 3080, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173442395, "_duration_us": 0, "_linked_correlation_id": 3082, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173480414, "_duration_us": 9297, "_linked_correlation_id": 3082, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173442397, "_duration_us": 6, "_linked_correlation_id": 3082, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489715, "_duration_us": 2, "_linked_correlation_id": 3087, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173444325, "_duration_us": 24, "_linked_correlation_id": 3087, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173444350, "_duration_us": 45377, "_linked_correlation_id": 3087, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489796, "_duration_us": 2, "_linked_correlation_id": 3092, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173489777, "_duration_us": 14, "_linked_correlation_id": 3092, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173489791, "_duration_us": 13, "_linked_correlation_id": 3092, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489828, "_duration_us": 2, "_linked_correlation_id": 3097, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173489820, "_duration_us": 3, "_linked_correlation_id": 3097, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173489824, "_duration_us": 13, "_linked_correlation_id": 3097, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489859, "_duration_us": 2, "_linked_correlation_id": 3102, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173489851, "_duration_us": 3, "_linked_correlation_id": 3102, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173489854, "_duration_us": 13, "_linked_correlation_id": 3102, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489889, "_duration_us": 2, "_linked_correlation_id": 3107, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173489880, "_duration_us": 3, "_linked_correlation_id": 3107, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173489884, "_duration_us": 12, "_linked_correlation_id": 3107, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173489972, "_duration_us": 3, "_linked_correlation_id": 3109, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173490013, "_duration_us": 9302, "_linked_correlation_id": 3109, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173489980, "_duration_us": 32, "_linked_correlation_id": 3109, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173490040, "_duration_us": 0, "_linked_correlation_id": 3111, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173499317, "_duration_us": 9306, "_linked_correlation_id": 3111, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173490042, "_duration_us": 7, "_linked_correlation_id": 3111, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173490076, "_duration_us": 0, "_linked_correlation_id": 3113, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173508625, "_duration_us": 9299, "_linked_correlation_id": 3113, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173490078, "_duration_us": 7, "_linked_correlation_id": 3113, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173490102, "_duration_us": 0, "_linked_correlation_id": 3115, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173517925, "_duration_us": 9300, "_linked_correlation_id": 3115, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173490104, "_duration_us": 5, "_linked_correlation_id": 3115, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173490126, "_duration_us": 0, "_linked_correlation_id": 3117, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173527228, "_duration_us": 9301, "_linked_correlation_id": 3117, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173490128, "_duration_us": 6, "_linked_correlation_id": 3117, "_device_type": 0}, {"_name": "cudaDeviceSynchronize", "_start_us": 1656454173490313, "_duration_us": 46225, "_linked_correlation_id": 0, "_device_type": 0}], [{"_name": "profiler/test_profiler.py(1435): <module>", "id": 94242239505696, "start_time_ns": 1656454173436288169, "duration_time_ns": 7566917863418487639, "correlation_id": 0, "children": [94242238082288], "parent": null}, {"_name": "torch/testing/_internal/common_utils.py(697): run_tests", "id": 94242238082288, "start_time_ns": 1656454173438182431, "duration_time_ns": 7566917863416593377, "correlation_id": 0, "children": [94242238082800], "parent": 94242239505696}, {"_name": "unittest/main.py(101): __init__", "id": 94242238082800, "start_time_ns": 1656454173438184159, "duration_time_ns": 7566917863416591649, "correlation_id": 0, "children": [94242238083184], "parent": 94242238082288}, {"_name": "unittest/main.py(271): runTests", "id": 94242238083184, "start_time_ns": 1656454173438186629, "duration_time_ns": 7566917863416589179, "correlation_id": 0, "children": [94242238083568], "parent": 94242238082800}, {"_name": "unittest/runner.py(184): run", "id": 94242238083568, "start_time_ns": 1656454173438187601, "duration_time_ns": 7566917863416588207, "correlation_id": 0, "children": [94242238084128], "parent": 94242238083184}, {"_name": "unittest/suite.py(84): __call__", "id": 94242238084128, "start_time_ns": 1656454173438189531, "duration_time_ns": 7566917863416586277, "correlation_id": 0, "children": [94242238084544], "parent": 94242238083568}, {"_name": "unittest/suite.py(122): run", "id": 94242238084544, "start_time_ns": 1656454173438190205, "duration_time_ns": 7566917863416585603, "correlation_id": 0, "children": [94242238084960], "parent": 94242238084128}, {"_name": "unittest/suite.py(84): __call__", "id": 94242238084960, "start_time_ns": 1656454173438191228, "duration_time_ns": 7566917863416584580, "correlation_id": 0, "children": [94242238085376], "parent": 94242238084544}, {"_name": "unittest/suite.py(122): run", "id": 94242238085376, "start_time_ns": 1656454173438191346, "duration_time_ns": 7566917863416584462, "correlation_id": 0, "children": [94242238085792], "parent": 94242238084960}, {"_name": "unittest/case.py(651): __call__", "id": 94242238085792, "start_time_ns": 1656454173438191484, "duration_time_ns": 7566917863416584324, "correlation_id": 0, "children": [94242239133216], "parent": 94242238085376}, {"_name": "torch/testing/_internal/common_utils.py(1886): run", "id": 94242239133216, "start_time_ns": 1656454173438195759, "duration_time_ns": 7566917863416580049, "correlation_id": 0, "children": [94242239133632], "parent": 94242238085792}, {"_name": "torch/testing/_internal/common_utils.py(1829): _run_with_retry", "id": 94242239133632, "start_time_ns": 1656454173438197353, "duration_time_ns": 7566917863416578455, "correlation_id": 0, "children": [94242239134048], "parent": 94242239133216}, {"_name": "unittest/case.py(592): run", "id": 94242239134048, "start_time_ns": 1656454173438198172, "duration_time_ns": 7566917863416577636, "correlation_id": 0, "children": [94242239134464], "parent": 94242239133632}, {"_name": "unittest/case.py(550): _callTestMethod", "id": 94242239134464, "start_time_ns": 1656454173438211703, "duration_time_ns": 7566917863416564105, "correlation_id": 0, "children": [94242239134880], "parent": 94242239134048}, {"_name": "profiler/test_profiler.py(1420): test_utils_get_optimizable_events", "id": 94242239134880, "start_time_ns": 1656454173438759703, "duration_time_ns": 7566917863416016105, "correlation_id": 0, "children": [94242239135296], "parent": 94242239134464}, {"_name": "profiler/test_profiler.py(1251): load_mock_profile", "id": 94242239135296, "start_time_ns": 1656454173438760534, "duration_time_ns": 7566917863416015274, "correlation_id": 0, "children": [94242239135712, 94240979270032, 94240979295904, 94240979389888, 94240979327296, 94242239499936, 94242239139040, 94242239299696, 94242239301040, 94242239302384, 94242239303728, 94242239305072, 94242239139456], "parent": 94242239134880}, {"_name": "torch/profiler/profiler.py(475): __exit__", "id": 94242239139456, "start_time_ns": 1656454173490143177, "duration_time_ns": 7566917863364632631, "correlation_id": 0, "children": [94242239139872], "parent": 94242239135296}, {"_name": "torch/profiler/profiler.py(484): stop", "id": 94242239139872, "start_time_ns": 1656454173490151443, "duration_time_ns": 7566917863364624365, "correlation_id": 0, "children": [94242239140288], "parent": 94242239139456}, {"_name": "torch/profiler/profiler.py(511): _transit_action", "id": 94242239140288, "start_time_ns": 1656454173490160200, "duration_time_ns": 7566917863364615608, "correlation_id": 0, "children": [94242238898288, 94242238886608], "parent": 94242239139872}, {"_name": "torch/profiler/profiler.py(117): stop_trace", "id": 94242238886608, "start_time_ns": 1656454173490212930, "duration_time_ns": 7566917863364562878, "correlation_id": 0, "children": [94242238887024], "parent": 94242239140288}, {"_name": "torch/autograd/profiler.py(207): __exit__", "id": 94242238887024, "start_time_ns": 1656454173490216323, "duration_time_ns": 7566917863364559485, "correlation_id": 0, "children": [94242238887440], "parent": 94242238886608}, {"_name": "torch/cuda/__init__.py(486): synchronize", "id": 94242238887440, "start_time_ns": 1656454173490222710, "duration_time_ns": 7566917863364553098, "correlation_id": 0, "children": [94242238887856, 94242238888688, 94242238894096, 94242239121280, 94242238897008], "parent": 94242238887024}, {"_name": "torch/cuda/__init__.py(281): __exit__", "id": 94242238897008, "start_time_ns": 1656454173536540711, "duration_time_ns": 7566917863318235097, "correlation_id": 0, "children": [94242239121696], "parent": 94242238887440}, {"_name": "<built-in method _disable_profiler of PyCapsule object at 0x7fa02685d210>", "id": 94242239121696, "start_time_ns": 1656454173536553153, "duration_time_ns": 7566917863318222655, "correlation_id": 0, "children": [], "parent": 94242238897008}, {"_name": "<built-in function _cuda_synchronize>", "id": 94242239121280, "start_time_ns": 1656454173490312079, "duration_time_ns": 46227101, "correlation_id": 0, "children": [], "parent": 94242238887440}, {"_name": "torch/cuda/__init__.py(272): __enter__", "id": 94242238894096, "start_time_ns": 1656454173490303577, "duration_time_ns": 5394, "correlation_id": 0, "children": [94242238894512, 94242238895760, 94242238896176], "parent": 94242238887440}, {"_name": "torch/cuda/__init__.py(191): _lazy_init", "id": 94242238896176, "start_time_ns": 1656454173490308123, "duration_time_ns": 792, "correlation_id": 0, "children": [94242238896592], "parent": 94242238894096}, {"_name": "torch/cuda/__init__.py(149): is_initialized", "id": 94242238896592, "start_time_ns": 1656454173490308633, "duration_time_ns": 242, "correlation_id": 0, "children": [94242239120864], "parent": 94242238896176}, {"_name": "<built-in function _cuda_isInBadFork>", "id": 94242239120864, "start_time_ns": 1656454173490308734, "duration_time_ns": 121, "correlation_id": 0, "children": [], "parent": 94242238896592}, {"_name": "torch/_jit_internal.py(982): is_scripting", "id": 94242238895760, "start_time_ns": 1656454173490307337, "duration_time_ns": 660, "correlation_id": 0, "children": [], "parent": 94242238894096}, {"_name": "torch/cuda/__init__.py(480): current_device", "id": 94242238894512, "start_time_ns": 1656454173490304532, "duration_time_ns": 2250, "correlation_id": 0, "children": [94242238894928, 94242239120448], "parent": 94242238894096}, {"_name": "<built-in function _cuda_getDevice>", "id": 94242239120448, "start_time_ns": 1656454173490305817, "duration_time_ns": 934, "correlation_id": 0, "children": [], "parent": 94242238894512}, {"_name": "torch/cuda/__init__.py(191): _lazy_init", "id": 94242238894928, "start_time_ns": 1656454173490305205, "duration_time_ns": 400, "correlation_id": 0, "children": [94242238895344], "parent": 94242238894512}, {"_name": "torch/cuda/__init__.py(149): is_initialized", "id": 94242238895344, "start_time_ns": 1656454173490305315, "duration_time_ns": 249, "correlation_id": 0, "children": [94242239120032], "parent": 94242238894928}, {"_name": "<built-in function _cuda_isInBadFork>", "id": 94242239120032, "start_time_ns": 1656454173490305469, "duration_time_ns": 64, "correlation_id": 0, "children": [], "parent": 94242238895344}, {"_name": "torch/cuda/__init__.py(268): __init__", "id": 94242238888688, "start_time_ns": 1656454173490238187, "duration_time_ns": 63856, "correlation_id": 0, "children": [94242238889104], "parent": 94242238887440}, {"_name": "torch/cuda/_utils.py(7): _get_device_index", "id": 94242238889104, "start_time_ns": 1656454173490241229, "duration_time_ns": 59393, "correlation_id": 0, "children": [94242239113392, 94242239113808, 94242238889520, 94242239114224, 94242238889936], "parent": 94242238888688}, {"_name": "torch/_utils.py(521): _get_device_index", "id": 94242238889936, "start_time_ns": 1656454173490254695, "duration_time_ns": 45728, "correlation_id": 0, "children": [94242239114640, 94242239115056, 94242239117536, 94242238890352, 94242238890768], "parent": 94242238889104}, {"_name": "torch/_utils.py(497): _get_current_device_index", "id": 94242238890768, "start_time_ns": 1656454173490269804, "duration_time_ns": 30489, "correlation_id": 0, "children": [94242238891184], "parent": 94242238889936}, {"_name": "torch/_utils.py(487): _get_device_attr", "id": 94242238891184, "start_time_ns": 1656454173490277921, "duration_time_ns": 22112, "correlation_id": 0, "children": [94242238891600, 94242239118784, 94242238892432], "parent": 94242238890768}, {"_name": "torch/_utils.py(499): <lambda>", "id": 94242238892432, "start_time_ns": 1656454173490290622, "duration_time_ns": 9269, "correlation_id": 0, "children": [94242238892848], "parent": 94242238891184}, {"_name": "torch/cuda/__init__.py(480): current_device", "id": 94242238892848, "start_time_ns": 1656454173490292572, "duration_time_ns": 7253, "correlation_id": 0, "children": [94242238893264, 94242239119616], "parent": 94242238892432}, {"_name": "<built-in function _cuda_getDevice>", "id": 94242239119616, "start_time_ns": 1656454173490296196, "duration_time_ns": 3565, "correlation_id": 0, "children": [], "parent": 94242238892848}, {"_name": "torch/cuda/__init__.py(191): _lazy_init", "id": 94242238893264, "start_time_ns": 1656454173490293743, "duration_time_ns": 1072, "correlation_id": 0, "children": [94242238893680], "parent": 94242238892848}, {"_name": "torch/cuda/__init__.py(149): is_initialized", "id": 94242238893680, "start_time_ns": 1656454173490294339, "duration_time_ns": 402, "correlation_id": 0, "children": [94242239119200], "parent": 94242238893264}, {"_name": "<built-in function _cuda_isInBadFork>", "id": 94242239119200, "start_time_ns": 1656454173490294551, "duration_time_ns": 124, "correlation_id": 0, "children": [], "parent": 94242238893680}, {"_name": "<built-in method get of dict object at 0x7fa01c47d700>", "id": 94242239118784, "start_time_ns": 1656454173490289374, "duration_time_ns": 241, "correlation_id": 0, "children": [], "parent": 94242238891184}, {"_name": "torch/_utils.py(478): _get_available_device_type", "id": 94242238891600, "start_time_ns": 1656454173490280148, "duration_time_ns": 8003, "correlation_id": 0, "children": [94242238892016], "parent": 94242238891184}, {"_name": "torch/cuda/__init__.py(77): is_available", "id": 94242238892016, "start_time_ns": 1656454173490282141, "duration_time_ns": 5804, "correlation_id": 0, "children": [94242239117952, 94242239118368], "parent": 94242238891600}, {"_name": "<built-in function _cuda_getDeviceCount>", "id": 94242239118368, "start_time_ns": 1656454173490286599, "duration_time_ns": 1061, "correlation_id": 0, "children": [], "parent": 94242238892016}, {"_name": "<built-in function hasattr>", "id": 94242239117952, "start_time_ns": 1656454173490284307, "duration_time_ns": 988, "correlation_id": 0, "children": [], "parent": 94242238892016}, {"_name": "torch/_jit_internal.py(982): is_scripting", "id": 94242238890352, "start_time_ns": 1656454173490268636, "duration_time_ns": 383, "correlation_id": 0, "children": [], "parent": 94242238889936}, {"_name": "<built-in function isinstance>", "id": 94242239117536, "start_time_ns": 1656454173490268135, "duration_time_ns": 45, "correlation_id": 0, "children": [], "parent": 94242238889936}, {"_name": "<built-in function isinstance>", "id": 94242239115056, "start_time_ns": 1656454173490266016, "duration_time_ns": 43, "correlation_id": 0, "children": [], "parent": 94242238889936}, {"_name": "<built-in function isinstance>", "id": 94242239114640, "start_time_ns": 1656454173490264843, "duration_time_ns": 71, "correlation_id": 0, "children": [], "parent": 94242238889936}, {"_name": "<built-in function isinstance>", "id": 94242239114224, "start_time_ns": 1656454173490253455, "duration_time_ns": 56, "correlation_id": 0, "children": [], "parent": 94242238889104}, {"_name": "torch/_jit_internal.py(982): is_scripting", "id": 94242238889520, "start_time_ns": 1656454173490250344, "duration_time_ns": 2192, "correlation_id": 0, "children": [], "parent": 94242238889104}, {"_name": "<built-in function isinstance>", "id": 94242239113808, "start_time_ns": 1656454173490247257, "duration_time_ns": 104, "correlation_id": 0, "children": [], "parent": 94242238889104}, {"_name": "<built-in function isinstance>", "id": 94242239113392, "start_time_ns": 1656454173490245162, "duration_time_ns": 807, "correlation_id": 0, "children": [], "parent": 94242238889104}, {"_name": "torch/cuda/__init__.py(191): _lazy_init", "id": 94242238887856, "start_time_ns": 1656454173490224967, "duration_time_ns": 10586, "correlation_id": 0, "children": [94242238888272], "parent": 94242238887440}, {"_name": "torch/cuda/__init__.py(149): is_initialized", "id": 94242238888272, "start_time_ns": 1656454173490227128, "duration_time_ns": 8241, "correlation_id": 0, "children": [94242239113008], "parent": 94242238887856}, {"_name": "<built-in function _cuda_isInBadFork>", "id": 94242239113008, "start_time_ns": 1656454173490234177, "duration_time_ns": 892, "correlation_id": 0, "children": [], "parent": 94242238888272}, {"_name": "<built-in method get of dict object at 0x7fa01c47d700>", "id": 94242238898288, "start_time_ns": 1656454173490187641, "duration_time_ns": 9517, "correlation_id": 0, "children": [94242239140704], "parent": 94242239140288}, {"_name": "enum.py(774): __hash__", "id": 94242239140704, "start_time_ns": 1656454173490190439, "duration_time_ns": 5319, "correlation_id": 0, "children": [94242239112592], "parent": 94242238898288}, {"_name": "<built-in function hash>", "id": 94242239112592, "start_time_ns": 1656454173490194870, "duration_time_ns": 721, "correlation_id": 0, "children": [], "parent": 94242239140704}, {"_name": "aten::matmul", "id": 94242239305072, "start_time_ns": 1656454173490115971, "duration_time_ns": 21513, "correlation_id": 3116, "children": [94242239305744], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239305744, "start_time_ns": 1656454173490116650, "duration_time_ns": 20114, "correlation_id": 3117, "children": [], "parent": 94242239305072}, {"_name": "aten::matmul", "id": 94242239303728, "start_time_ns": 1656454173490091388, "duration_time_ns": 21342, "correlation_id": 3114, "children": [94242239304400], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239304400, "start_time_ns": 1656454173490092214, "duration_time_ns": 19792, "correlation_id": 3115, "children": [], "parent": 94242239303728}, {"_name": "aten::matmul", "id": 94242239302384, "start_time_ns": 1656454173490054842, "duration_time_ns": 33225, "correlation_id": 3112, "children": [94242239303056], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239303056, "start_time_ns": 1656454173490055485, "duration_time_ns": 31894, "correlation_id": 3113, "children": [], "parent": 94242239302384}, {"_name": "aten::matmul", "id": 94242239301040, "start_time_ns": 1656454173490026585, "duration_time_ns": 24997, "correlation_id": 3110, "children": [94242239301712], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239301712, "start_time_ns": 1656454173490027380, "duration_time_ns": 23566, "correlation_id": 3111, "children": [], "parent": 94242239301040}, {"_name": "aten::matmul", "id": 94242239299696, "start_time_ns": 1656454173489912600, "duration_time_ns": 104156, "correlation_id": 3108, "children": [94242239300368], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239300368, "start_time_ns": 1656454173489916106, "duration_time_ns": 99633, "correlation_id": 3109, "children": [], "parent": 94242239299696}, {"_name": "profiler/test_profiler.py(1245): garbage_code", "id": 94242239139040, "start_time_ns": 1656454173442410244, "duration_time_ns": 47490938, "correlation_id": 0, "children": [94242239501088, 94242239482560, 94242239483728, 94242239484368, 94242239485536, 94242239486912, 94242239487552, 94242237992320, 94242237993696, 94242237994336, 94242237995712, 94242237997120, 94242237997856, 94242237999328, 94242238000800], "parent": 94242239135296}, {"_name": "aten::copy_", "id": 94242238000800, "start_time_ns": 1656454173489878232, "duration_time_ns": 20288, "correlation_id": 3107, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242237999328, "start_time_ns": 1656454173489875969, "duration_time_ns": 1490, "correlation_id": 3105, "children": [94242238000240], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242238000240, "start_time_ns": 1656454173489876749, "duration_time_ns": 269, "correlation_id": 3106, "children": [], "parent": 94242237999328}, {"_name": "aten::select", "id": 94242237997856, "start_time_ns": 1656454173489873022, "duration_time_ns": 2173, "correlation_id": 3103, "children": [94242237998768], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237998768, "start_time_ns": 1656454173489874129, "duration_time_ns": 436, "correlation_id": 3104, "children": [], "parent": 94242237997856}, {"_name": "aten::copy_", "id": 94242237997120, "start_time_ns": 1656454173489848771, "duration_time_ns": 20290, "correlation_id": 3102, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242237995712, "start_time_ns": 1656454173489846145, "duration_time_ns": 1571, "correlation_id": 3100, "children": [94242237996560], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237996560, "start_time_ns": 1656454173489847021, "duration_time_ns": 204, "correlation_id": 3101, "children": [], "parent": 94242237995712}, {"_name": "aten::select", "id": 94242237994336, "start_time_ns": 1656454173489842325, "duration_time_ns": 3114, "correlation_id": 3098, "children": [94242237995184], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237995184, "start_time_ns": 1656454173489844409, "duration_time_ns": 440, "correlation_id": 3099, "children": [], "parent": 94242237994336}, {"_name": "aten::copy_", "id": 94242237993696, "start_time_ns": 1656454173489817557, "duration_time_ns": 20628, "correlation_id": 3097, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242237992320, "start_time_ns": 1656454173489814695, "duration_time_ns": 1630, "correlation_id": 3095, "children": [94242237993168], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237993168, "start_time_ns": 1656454173489815568, "duration_time_ns": 267, "correlation_id": 3096, "children": [], "parent": 94242237992320}, {"_name": "aten::select", "id": 94242239487552, "start_time_ns": 1656454173489811667, "duration_time_ns": 2305, "correlation_id": 3093, "children": [94242237991792], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237991792, "start_time_ns": 1656454173489812906, "duration_time_ns": 491, "correlation_id": 3094, "children": [], "parent": 94242239487552}, {"_name": "aten::copy_", "id": 94242239486912, "start_time_ns": 1656454173489771721, "duration_time_ns": 34924, "correlation_id": 3092, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242239485536, "start_time_ns": 1656454173489766717, "duration_time_ns": 2462, "correlation_id": 3090, "children": [94242239486384], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242239486384, "start_time_ns": 1656454173489767943, "duration_time_ns": 366, "correlation_id": 3091, "children": [], "parent": 94242239485536}, {"_name": "aten::select", "id": 94242239484368, "start_time_ns": 1656454173489760388, "duration_time_ns": 5433, "correlation_id": 3088, "children": [94242239485008], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242239485008, "start_time_ns": 1656454173489764139, "duration_time_ns": 858, "correlation_id": 3089, "children": [], "parent": 94242239484368}, {"_name": "aten::copy_", "id": 94242239483728, "start_time_ns": 1656454173444305057, "duration_time_ns": 45427507, "correlation_id": 3087, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242239482560, "start_time_ns": 1656454173444291864, "duration_time_ns": 8740, "correlation_id": 3085, "children": [94242239483200], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242239483200, "start_time_ns": 1656454173444296798, "duration_time_ns": 531, "correlation_id": 3086, "children": [], "parent": 94242239482560}, {"_name": "aten::select", "id": 94242239501088, "start_time_ns": 1656454173444252555, "duration_time_ns": 38328, "correlation_id": 3083, "children": [94242239501584], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242239501584, "start_time_ns": 1656454173444282394, "duration_time_ns": 3993, "correlation_id": 3084, "children": [], "parent": 94242239501088}, {"_name": "aten::matmul", "id": 94242239499936, "start_time_ns": 1656454173442384887, "duration_time_ns": 20958, "correlation_id": 3081, "children": [94242239500512], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239500512, "start_time_ns": 1656454173442385493, "duration_time_ns": 19655, "correlation_id": 3082, "children": [], "parent": 94242239499936}, {"_name": "aten::matmul", "id": 94240979327296, "start_time_ns": 1656454173442360631, "duration_time_ns": 21026, "correlation_id": 3079, "children": [94242238916288], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242238916288, "start_time_ns": 1656454173442361296, "duration_time_ns": 19633, "correlation_id": 3080, "children": [], "parent": 94240979327296}, {"_name": "aten::matmul", "id": 94240979389888, "start_time_ns": 1656454173442325764, "duration_time_ns": 31593, "correlation_id": 3077, "children": [94240979374096], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94240979374096, "start_time_ns": 1656454173442326275, "duration_time_ns": 30364, "correlation_id": 3078, "children": [], "parent": 94240979389888}, {"_name": "aten::matmul", "id": 94240979295904, "start_time_ns": 1656454173442289759, "duration_time_ns": 32569, "correlation_id": 3075, "children": [94240169025248], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94240169025248, "start_time_ns": 1656454173442291934, "duration_time_ns": 29693, "correlation_id": 3076, "children": [], "parent": 94240979295904}, {"_name": "aten::matmul", "id": 94240979270032, "start_time_ns": 1656454173440014537, "duration_time_ns": 2254371, "correlation_id": 3073, "children": [94240979296288], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94240979296288, "start_time_ns": 1656454173440019291, "duration_time_ns": 2245915, "correlation_id": 3074, "children": [], "parent": 94240979270032}, {"_name": "torch/profiler/profiler.py(472): __enter__", "id": 94242239135712, "start_time_ns": 1656454173438761183, "duration_time_ns": 1208076, "correlation_id": 0, "children": [94242239136128], "parent": 94242239135296}, {"_name": "torch/profiler/profiler.py(479): start", "id": 94242239136128, "start_time_ns": 1656454173438762066, "duration_time_ns": 1206947, "correlation_id": 0, "children": [94242239136544], "parent": 94242239135712}, {"_name": "torch/profiler/profiler.py(515): _transit_action", "id": 94242239136544, "start_time_ns": 1656454173438764183, "duration_time_ns": 1203897, "correlation_id": 0, "children": [94242239136960], "parent": 94242239136128}, {"_name": "torch/profiler/profiler.py(110): start_trace", "id": 94242239136960, "start_time_ns": 1656454173438766170, "duration_time_ns": 1200818, "correlation_id": 0, "children": [94242239137376, 94242238897424, 94242239137792], "parent": 94242239136544}, {"_name": "torch/profiler/profiler.py(189): _get_distributed_info", "id": 94242239137792, "start_time_ns": 1656454173439946391, "duration_time_ns": 20326, "correlation_id": 0, "children": [94242239138208, 94242239138624], "parent": 94242239136960}, {"_name": "torch/distributed/distributed_c10d.py(415): is_initialized", "id": 94242239138624, "start_time_ns": 1656454173439964257, "duration_time_ns": 2376, "correlation_id": 0, "children": [], "parent": 94242239137792}, {"_name": "torch/distributed/__init__.py(8): is_available", "id": 94242239138208, "start_time_ns": 1656454173439956583, "duration_time_ns": 5736, "correlation_id": 0, "children": [94242238897872], "parent": 94242239137792}, {"_name": "<built-in function hasattr>", "id": 94242238897872, "start_time_ns": 1656454173439960911, "duration_time_ns": 1344, "correlation_id": 0, "children": [], "parent": 94242239138208}, {"_name": "<built-in method kineto_available of PyCapsule object at 0x7fa02685d2d0>", "id": 94242238897424, "start_time_ns": 1656454173439940525, "duration_time_ns": 1813, "correlation_id": 0, "children": [], "parent": 94242239136960}, {"_name": "torch/autograd/profiler.py(205): _start_trace", "id": 94242239137376, "start_time_ns": 1656454173438766630, "duration_time_ns": 63314, "correlation_id": 0, "children": [], "parent": 94242239136960}]]
diff --git a/test/profiler/test_memory_profiler.py b/test/profiler/test_memory_profiler.py
new file mode 100644
index 000000000000..01f2263807d3
--- /dev/null
+++ b/test/profiler/test_memory_profiler.py
@@ -0,0 +1,1418 @@
+# Owner(s): ["oncall: profiler"]
+import functools
+import gc
+import itertools as it
+import textwrap
+from typing import Callable, Dict, Iterator, List, Optional, Tuple
+
+import torch
+from torch._C._profiler import _EventType, _TensorMetadata
+from torch.profiler import _memory_profiler, _utils
+from torch.testing._internal.common_utils import run_tests, skipIfTorchDynamo, TestCase
+from torch.utils._pytree import tree_flatten
+
+
+profile = functools.partial(
+    torch.profiler.profile, record_shapes=True, profile_memory=True, with_stack=True
+)
+
+
+@skipIfTorchDynamo("TorchDynamo removes profiler altogether.")
+class TestMemoryProfiler(TestCase):
+    def test_config_check(self) -> None:
+        with torch.profiler.profile() as prof:
+            pass
+
+        pattern = r"record_shapes=True, profile_memory=True, with_stack=True"
+        with self.assertRaisesRegex(ValueError, pattern):
+            prof._memory_profile()
+
+        with torch.profiler.profile(record_shapes=True, with_stack=True) as prof:
+            pass
+
+        pattern = r"^profile_memory=True required for memory profiling\.$"
+        with self.assertRaisesRegex(ValueError, pattern):
+            prof._memory_profile()
+
+        with profile() as prof:
+            pass
+
+        self.assertIsInstance(prof._memory_profile(), _memory_profiler.MemoryProfile)
+
+
+class ScaleLayer(torch.nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.scale = torch.nn.Parameter(torch.rand(()), requires_grad=True)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return x * self.scale
+
+
+class LazyLinear(torch.nn.Module):
+    def __init__(self, in_features: int, out_features: int):
+        super().__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+
+    def forward(self, x) -> torch.Tensor:
+        if getattr(self, "weight", None) is None:
+            self.weight = torch.nn.Parameter(
+                torch.empty((self.out_features, self.in_features))
+            )
+            self.bias = torch.nn.Parameter(torch.empty(self.out_features))
+
+        return torch.nn.functional.linear(x, self.weight, self.bias)
+
+
+class RecordInputOutputDispatchMode(torch.utils._python_dispatch.TorchDispatchMode):
+    def __init__(self):
+        self.results = []
+
+    def mark_region(self, name: str):
+        self.results.append((name, (), ()))
+
+    @staticmethod
+    def flat_ids(args):
+        flat_args = tree_flatten(args)[0]
+        return tuple(
+            (t._cdata, t.storage().data_ptr())
+            for t in flat_args
+            if isinstance(t, torch.Tensor) and t.storage()
+        )
+
+    def __torch_dispatch__(self, func, types, args=..., kwargs=None):
+        args = args or []
+        kwargs = kwargs or {}
+        flat_inputs = self.flat_ids(args) + self.flat_ids(kwargs)
+        out = func(*args, **kwargs)
+        flat_outputs = self.flat_ids(out)
+        if (flat_inputs or flat_outputs) and "_record_function_enter" not in func.name():
+            self.results.append((func.name(), flat_inputs, flat_outputs))
+        return out
+
+
+@skipIfTorchDynamo("TorchDynamo changes Python calls that memory profiling relies on.")
+class TestIdentifyGradients(TestCase):
+    def gradient_detected(
+        self,
+        prof: torch.profiler.profile,
+        ctx: _EventType,
+        grad_tensor: torch.Tensor,
+        parameter: Optional[torch.Tensor] = None,
+    ) -> None:
+
+        # This is not an exhaustive check, but for the purpose of unit testing
+        # it is sufficient.
+        def key_matches_tensor(key, tensor) -> bool:
+            # Vacuous case.
+            if tensor is None:
+                return True
+
+            if key is None:
+                return False
+
+            return tensor.storage().data_ptr() == key.storage.ptr
+
+        tree = prof.profiler.kineto_results.experimental_event_tree()
+        for node in _utils.traverse_dfs(tree):
+            for p_key, p_grad_key in _memory_profiler.extract_gradients(node):
+                if node.tag == ctx and key_matches_tensor(p_grad_key, grad_tensor):
+                    if parameter is None:
+                        return True  # Don't need to check parameter; we're done.
+
+                    elif p_key is not None:
+                        # For a complex workflow a gradient could correspond to
+                        # different parameters at different points in a trace.
+                        # However this will not happen in the relatively simple
+                        # cases tested here, so if `extract_gradients` identifies
+                        # the parameter corresponding to a particular gradient it
+                        # must be the one we expect.
+                        self.assertTrue(key_matches_tensor(p_key, parameter))
+                        return True
+
+        return False
+
+    def assertGradientDetected(self, name: str, *args, **kwargs) -> None:
+        self.assertTrue(
+            self.gradient_detected(*args, **kwargs),
+            f"Failed to identify gradient `{name}` from profile.",
+        )
+
+    def assertOnlyGradients(
+        self, prof: torch.profiler.profile, tensors: Iterator[torch.Tensor]
+    ) -> None:
+        allowed_set = {t.storage().data_ptr() for t in tensors}
+
+        tree = prof.profiler.kineto_results.experimental_event_tree()
+        for node in _utils.traverse_dfs(tree):
+            for _, p_grad_key in _memory_profiler.extract_gradients(node):
+                self.assertTrue(
+                    p_grad_key.storage.ptr in allowed_set,
+                    f"Tensor wrongly marked as gradient: {node.name}: {p_grad_key}",
+                )
+
+    def test_extract_gradients_low_level(self) -> None:
+        x = torch.ones((1,))
+        w0 = torch.ones((1,), requires_grad=True)
+        w1 = torch.ones((1,), requires_grad=True)
+
+        def check(cold_start: bool):
+            self.assertEqual(w0.grad is None, cold_start)
+            self.assertEqual(w1.grad is None, cold_start)
+            with profile() as prof:
+                z = x.expand(4) * w0
+                (z * w1).sum().backward()
+
+            # Gradient detection through op inspection does not provide a
+            # reference to the parameter corresponding to the gradient.
+            self.assertGradientDetected("w0", prof, _EventType.TorchOp, w0.grad)
+            self.assertGradientDetected("w1", prof, _EventType.TorchOp, w1.grad)
+            self.assertOnlyGradients(prof, (w0.grad, w1.grad))
+
+        check(cold_start=True)
+        check(cold_start=False)
+
+    def test_extract_gradients_from_module(self) -> None:
+        model = torch.nn.Sequential(torch.nn.Linear(2, 1), ScaleLayer())
+        named_parameters = {name: p for name, p in model.named_parameters()}
+        self.assertEqual(len(named_parameters), 3)
+
+        def assert_only_gradients(prof: torch.profiler.profile):
+            gradients = tuple(i.grad for i in named_parameters.values())
+            self.assertFalse(any(i is None for i in gradients))
+            self.assertOnlyGradients(prof, gradients)
+
+        def check(cold_start: bool):
+            x = torch.ones((2, 2))
+            with profile() as prof:
+                model(x).sum().backward()
+
+            for name, p in named_parameters.items():
+                # The first time we run a module none of the `.grad` fields
+                # have been initialized. This is fine; in that case we can
+                # detect everything we need in the profiled section.
+                self.assertNotEqual(
+                    self.gradient_detected(prof, _EventType.PyCall, p.grad, p),
+                    cold_start,
+                    name,
+                )
+
+                # Op based detection should still identify the gradients.
+                self.assertGradientDetected(name, prof, _EventType.TorchOp, p.grad)
+            assert_only_gradients(prof)
+
+            # We can detect gradients even when `.backward()` is not called.
+            with profile() as prof:
+                model(torch.ones((2, 2)))
+
+            for name, p in named_parameters.items():
+                self.assertGradientDetected(name, prof, _EventType.PyCall, p.grad, p)
+                self.assertFalse(
+                    self.gradient_detected(prof, _EventType.TorchOp, p.grad), name
+                )
+            assert_only_gradients(prof)
+
+        check(cold_start=True)
+        check(cold_start=False)
+
+    def _test_extract_gradients_from_optimizer(self, set_to_none: bool) -> None:
+
+        x = torch.ones((1,))
+        w0 = torch.ones((1,), requires_grad=True)
+        w1 = torch.ones((1,), requires_grad=True)
+        optimizer = torch.optim.SGD((w0, w1), lr=0.1, momentum=0.9)
+
+        def check(cold_start: bool):
+            self.assertEqual(w0.grad is None, cold_start)
+            self.assertEqual(w1.grad is None, cold_start)
+            with profile() as prof:
+                optimizer.zero_grad(set_to_none=set_to_none)
+                z = x.expand(4) * w0
+                (z * w1).sum().backward()
+                optimizer.step()
+
+            # Optimizer instrumentation runs late in the step, so we can detect
+            # gradients for both cold and warm start.
+            self.assertGradientDetected("w0", prof, _EventType.PyCall, w0.grad, w0)
+            self.assertGradientDetected("w1", prof, _EventType.PyCall, w1.grad, w1)
+
+            self.assertGradientDetected("w0", prof, _EventType.TorchOp, w0.grad)
+            self.assertGradientDetected("w1", prof, _EventType.TorchOp, w1.grad)
+            self.assertOnlyGradients(prof, (w0.grad, w1.grad))
+
+            with profile() as prof:
+                for _ in range(2):
+                    optimizer.zero_grad(set_to_none=set_to_none)
+                    z = x.expand(4) * w0
+                    (z * w1).sum().backward()
+                    optimizer.step()
+
+            # Inspected state is cached, so if we replace gradients (as is the
+            # case for `set_to_none=True`) our python instrumentation will not
+            # see them.
+            # TODO(robieta): Should `.step()` be excluded from caching?
+            self.assertNotEqual(
+                self.gradient_detected(prof, _EventType.PyCall, w0.grad, w0),
+                set_to_none,
+            )
+
+            self.assertNotEqual(
+                self.gradient_detected(prof, _EventType.PyCall, w1.grad, w1),
+                set_to_none,
+            )
+
+            if set_to_none:
+                with self.assertRaisesRegex(AssertionError, "Tensor wrongly marked"):
+                    self.assertOnlyGradients(prof, (w0.grad, w1.grad))
+
+        check(cold_start=True)
+        check(cold_start=False)
+
+    def test_extract_gradients_from_optimizer(self) -> None:
+        self._test_extract_gradients_from_optimizer(set_to_none=False)
+
+    def test_extract_gradients_from_optimizer_set_to_none(self) -> None:
+        self._test_extract_gradients_from_optimizer(set_to_none=True)
+
+    def test_extract_gradients_from_module_and_optimizer(self) -> None:
+        # Module and optimizer are thoroughly tested individually and should be
+        # additive. Thus we can manage with a lightweight check that they don't
+        # interact adversely.
+        model = torch.nn.Sequential(torch.nn.Linear(2, 1), ScaleLayer())
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
+        with profile() as prof:
+            model(torch.ones((2, 2))).sum().backward()
+            optimizer.step()
+
+        self.assertGradientDetected(
+            "weight", prof, _EventType.PyCall, model[0].weight.grad, model[0].weight
+        )
+
+
+@skipIfTorchDynamo("TorchDynamo removes profiler altogether.")
+class TestDataFlow(TestCase):
+    def setUp(self) -> None:
+        super().setUp()
+        self.maxDiff = None
+
+    @staticmethod
+    def formatSchemas(
+        prof: torch.profiler.profile, indent: int = 12
+    ) -> Tuple[Tuple[str, Tuple[bool, ...]], ...]:
+        tree = prof.profiler.kineto_results.experimental_event_tree()
+        out: List[Tuple[str, Tuple[bool, ...]]] = []
+        for node in _utils.traverse_dfs(tree):
+            if node.tag == _EventType.TorchOp:
+                e = node.extra_fields
+                schemas = _memory_profiler.SchemaMatcher.match_schemas(e)
+                name = node.name
+                if len(schemas) == 1:
+                    name = f"{name}.{schemas[0].overload_name}"
+                elif len(schemas) > 1:
+                    name = f"{name}.{{{', '.join(s.overload_name for s in schemas)}}}"
+
+                out.append((name, _memory_profiler.SchemaMatcher.inputs_are_mutable(e)))
+        return tuple(out)
+
+    @staticmethod
+    def _run_and_format_data_flow(
+        inputs: Dict[str, torch.Tensor],
+        f: Callable[..., Optional[Dict[str, torch.Tensor]]],
+        indent: int = 12,
+    ) -> str:
+        with profile() as prof:
+            outputs = f(**inputs) or {}
+            gc.collect()
+
+        memory_profile = prof._memory_profile()
+        graph = memory_profile._data_flow_graph
+        storage_to_id = {key.storage.ptr: key.id for key in graph._active_version}
+
+        lines: List[str] = []
+        for name, t in it.chain(inputs.items(), outputs.items()):
+            lines.append(f"{name + ':':<8} T{storage_to_id[t.storage().data_ptr()]}")
+            if t.grad is not None:
+                grad_id = storage_to_id[t.grad.storage().data_ptr()]
+                lines.append(f"{name + '.grad:':<9} T{grad_id}")
+
+        if lines:
+            lines.append("")
+
+        for node in graph.flow_nodes:
+            destroyed = {k for k, v in node._edges.items() if v.is_deletion}
+
+            inputs: List[str] = []
+            for key, (_, v) in node.inputs.items():
+                inputs.append(f"T{key.id}(v{v}{'*' if key in destroyed else ''})")
+
+            outputs = [f"T{key.id}(v{v})" for key, v in node.outputs.items()]
+            if inputs or outputs:
+                event_name = node._event.name.replace("torch::autograd::", "")
+                lines.append(
+                    f"{event_name:<25} {', '.join(inputs):<15}  ->  {', '.join(outputs)}"
+                )
+
+        return textwrap.indent("\n".join([l.rstrip() for l in lines]), " " * indent)
+
+    def test_match_schemas(self) -> None:
+        with profile() as prof:
+            x = torch.ones((1,)).mul(2).add_(2)
+            _ = torch.sin(x, out=torch.empty_like(x))
+
+        self.assertEqual(
+            self.formatSchemas(prof),
+            (
+                ("aten::ones.", (False,) * 5),
+                ("aten::empty.memory_format", (False,) * 6),
+                #
+                # fill_.Scalar(Tensor(a!) self, Scalar value) -> Tensor(a!)
+                ("aten::fill_.Scalar", (True, False)),
+                ("aten::mul.Tensor", (False, False)),
+                ("aten::to.dtype", (False,) * 5),
+                ("aten::_to_copy.", (False,) * 7),
+                ("aten::empty_strided.", (False,) * 6),
+                #
+                # copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)
+                ("aten::copy_.", (True, False, False)),
+                #
+                # add_.Tensor(Tensor(a!) self, Tensor other, *, Scalar alpha=1) -> Tensor(a!)
+                ("aten::add_.Tensor", (True, False, False)),
+                ("aten::to.dtype", (False,) * 5),
+                ("aten::_to_copy.", (False,) * 7),
+                ("aten::empty_strided.", (False,) * 6),
+                #
+                # copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)
+                ("aten::copy_.", (True, False, False)),
+                ("aten::empty_like.", (False,) * 6),
+                ("aten::empty_strided.", (False,) * 6),
+                #
+                # sin.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
+                ("aten::sin.out", (False, True)),
+            ),
+        )
+
+    def test_match_schemas_backward(self) -> None:
+        x = torch.ones((1,))
+        w = torch.ones((1,), requires_grad=True)
+        with profile() as prof:
+            torch.mul(x, w).backward()
+
+        self.assertEqual(
+            self.formatSchemas(prof),
+            (
+                ("aten::mul.Tensor", (False, False)),
+                ("aten::ones_like.", (False,) * 6),
+                ("aten::empty_like.", (False,) * 6),
+                ("aten::empty_strided.", (False,) * 6),
+                #
+                # fill_.Scalar(Tensor(a!) self, Scalar value) -> Tensor(a!)
+                ("aten::fill_.Scalar", (True, False)),
+                ("autograd::engine::evaluate_function: MulBackward0", ()),
+                ("MulBackward0", (None,)),
+                ("aten::mul.Tensor", (False, False)),
+                (
+                    "autograd::engine::evaluate_function: torch::autograd::AccumulateGrad",
+                    (),
+                ),
+                ("torch::autograd::AccumulateGrad", (None,)),
+                ("aten::detach.", (False,)),
+                ("detach", (None,)),
+            ),
+        )
+
+    def test_match_schemas_tensorlist(self) -> None:
+        x = torch.ones((1,))
+        y = torch.ones((1,))
+        with profile() as prof:
+            torch.cat([x, y], axis=0)
+
+        self.assertEqual(
+            self.formatSchemas(prof),
+            (("aten::cat.", (False, False)),),
+        )
+
+    def test_data_flow_graph_with_annotations(self) -> None:
+        def f(x, y):
+            # torch._C._jit_get_schemas_for_operator will reject any name that
+            # is missing a namespace. (denoted by the presence of "::") We want
+            # to check that we skip both annotations which have no schema
+            # (return empty tuple from SchemaMatcher.lookup_schemas) and
+            # annotations which cannot have schema (return None from
+            # SchemaMatcher.lookup_schemas).
+            with torch.profiler.record_function("Namespaced::Annotation"):
+                with torch.profiler.record_function("My Annotation"):
+                    x.zero_()
+                    y.zero_()
+                    return {"x0": torch.ones_like(x), "y0": torch.zeros_like(y)}
+
+        inputs = {"x": torch.ones((1,)), "y": torch.ones((1,))}
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(inputs, f),
+            """\
+            x:       T0
+            y:       T1
+            x0:      T2
+            y0:      T3
+
+            aten::zero_               T0(v0)           ->  T0(v1)
+            aten::zero_               T1(v0)           ->  T1(v1)
+            aten::ones_like           T0(v1)           ->  T2(v0)
+            aten::zeros_like          T1(v1)           ->  T3(v0)""",
+        )
+
+    def test_data_flow_graph_non_op_allocations(self) -> None:
+        def f(x):
+            x.mul(2)
+
+        # The python arg parser will convert the python scalar `2` to a Tensor
+        # to pass to `aten::mul`. As a result there is no op that "owns" the
+        # allocation. The Tensor deletions also do not happen in an op; they
+        # are collected as a result of the Python objects going out of scope.
+        self.assertExpectedInline(
+            self._run_and_format_data_flow({"x": torch.ones((1,))}, f),
+            """\
+            x:       T1
+
+            [memory]                                   ->  T0(v0)
+            aten::mul                 T0(v0), T1(v0)   ->
+            [memory]                  T0(v0*)          ->""",
+        )
+
+    def test_data_flow_graph_simple(self) -> None:
+        inputs = {"x": torch.ones((25,)), "y": torch.ones((25,), requires_grad=True)}
+
+        def f0(x, y):
+            z = x.mul(y)
+            return {"z": z.view_as(z)}
+
+        def f1(x, y):
+            with torch.no_grad():
+                return f0(x, y)
+
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(inputs, f0),
+            """\
+            x:       T0
+            y:       T1
+            z:       T2
+
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::view_as             T2(v0)           ->""",
+        )
+
+        # Out of place is identical regardless of Autograd.
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(inputs, f0),
+            """\
+            x:       T0
+            y:       T1
+            z:       T2
+
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::view_as             T2(v0)           ->""",
+        )
+
+    def test_data_flow_graph_simple_inplace(self) -> None:
+        inputs = {"x": torch.ones((25,)), "y": torch.ones((25,), requires_grad=True)}
+
+        def f0(x, y):
+            x.mul_(y)
+
+        def f1(x, y):
+            with torch.no_grad():
+                return f0(x, y)
+
+        # When Autograd is enabled a second Tensor `T2` is created to store
+        # the values of T0(v0) which are needed for backwards.
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(inputs, f0),
+            """\
+            x:       T0
+            y:       T1
+
+            aten::mul_                T0(v0), T1(v0)   ->  T0(v1), T2(v0)""",
+        )
+
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(inputs, f1),
+            """\
+            x:       T0
+            y:       T1
+
+            aten::mul_                T0(v0), T1(v0)   ->  T0(v1)""",
+        )
+
+    def test_data_flow_graph_simple_backward(self) -> None:
+        inputs = {
+            "x": torch.ones((1,)),
+            "w": torch.ones((1,), requires_grad=True),
+        }
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(
+                inputs, lambda x, w: (x * w).sin().backward()
+            ),
+            """\
+            x:       T0
+            w:       T1
+            w.grad:   T7
+
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::sin                 T2(v0)           ->  T3(v0)
+            aten::ones_like           T3(v0)           ->  T4(v0)
+            SinBackward0              T2(v0), T4(v0)   ->  T6(v0)
+            [memory]                  T2(v0*)          ->
+            MulBackward0              T0(v0), T6(v0)   ->  T7(v0)
+            [memory]                  T6(v0*)          ->
+            AccumulateGrad            T7(v0)           ->
+            [memory]                  T4(v0*)          ->
+            [memory]                  T3(v0*)          ->""",
+        )
+
+    def test_data_flow_graph_complicated(self) -> None:
+        def f():
+            x = torch.ones((25,))
+            y = x.mul(2).add_(2)
+            z = torch.sin(y, out=torch.empty_like(y))
+            return {"x": x, "y": y, "z": z}
+
+        # T1 is the `2` in `.mul(2)`. The Python arg parser automatically
+        # converts Scalar arguments to Tensors. The same is true for `T4`
+        # and `.add_(2)`.
+        self.assertExpectedInline(
+            self._run_and_format_data_flow({}, f),
+            """\
+            x:       T0
+            y:       T3
+            z:       T6
+
+            aten::ones                                 ->  T0(v0)
+            [memory]                                   ->  T1(v0)
+            aten::mul                 T0(v0), T1(v0)   ->  T3(v0)
+            [memory]                  T1(v0*)          ->
+            [memory]                                   ->  T4(v0)
+            aten::add_                T3(v0), T4(v0)   ->  T3(v1)
+            [memory]                  T4(v0*)          ->
+            aten::empty_like          T3(v1)           ->  T6(v0)
+            aten::sin                 T3(v1), T6(v0)   ->  T6(v1)""",
+        )
+
+        with profile() as prof:
+            f()
+
+        # `aten::mul` creates a temporary Tensor (T2), which is why the output
+        # is has ID three rather than two.
+        mul_node = prof._memory_profile()._data_flow_graph.flow_nodes[2]
+        self.assertEqual(mul_node._event.name, "aten::mul")
+        self.assertEqual(len(mul_node.intermediates), 1)
+        self.assertEqual(mul_node.intermediates[0].id, 2)
+
+    def test_data_flow_graph_stacked(self) -> None:
+        inputs = {
+            "x": torch.ones((25,)),
+            "w0": torch.ones((1,), requires_grad=True),
+            "w1": torch.ones((1,), requires_grad=True),
+        }
+
+        def f(x, w0, w1):
+            return x.mul(w0).relu().mul(w1).relu().sum()
+
+        def f_fwd(**kwargs):
+            with torch.no_grad():
+                return {"loss": f(**kwargs)}
+
+        def f_fwd_bwd(**kwargs):
+            loss = f(**kwargs)
+            loss.backward()
+            return {"loss": loss}
+
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(inputs, f_fwd),
+            """\
+            x:       T0
+            w0:      T1
+            w1:      T4
+            loss:    T7
+
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::relu                T2(v0)           ->  T3(v0)
+            [memory]                  T2(v0*)          ->
+            aten::mul                 T3(v0), T4(v0)   ->  T5(v0)
+            [memory]                  T3(v0*)          ->
+            aten::relu                T5(v0)           ->  T6(v0)
+            [memory]                  T5(v0*)          ->
+            aten::sum                 T6(v0)           ->  T7(v0)
+            [memory]                  T6(v0*)          ->""",
+        )
+
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(inputs, f_fwd_bwd),
+            """\
+            x:       T0
+            w0:      T1
+            w0.grad:  T15
+            w1:      T4
+            w1.grad:  T12
+            loss:    T7
+
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::relu                T2(v0)           ->  T3(v0)
+            [memory]                  T2(v0*)          ->
+            aten::mul                 T3(v0), T4(v0)   ->  T5(v0)
+            aten::relu                T5(v0)           ->  T6(v0)
+            [memory]                  T5(v0*)          ->
+            aten::sum                 T6(v0)           ->  T7(v0)
+            aten::ones_like           T7(v0)           ->  T8(v0)
+            SumBackward0              T8(v0)           ->
+            ReluBackward0             T6(v0), T8(v0)   ->  T9(v0)
+            [memory]                  T6(v0*)          ->
+            MulBackward0              T3(v0), T4(v0), T9(v0)  ->  T10(v0), T11(v0)
+            aten::sum                 T10(v0)          ->  T12(v0)
+            [memory]                  T10(v0*)         ->
+            [memory]                  T9(v0*)          ->
+            AccumulateGrad            T12(v0)          ->
+            ReluBackward0             T3(v0), T11(v0)  ->  T13(v0)
+            [memory]                  T11(v0*)         ->
+            [memory]                  T3(v0*)          ->
+            MulBackward0              T0(v0), T13(v0)  ->  T14(v0)
+            aten::sum                 T14(v0)          ->  T15(v0)
+            [memory]                  T14(v0*)         ->
+            [memory]                  T13(v0*)         ->
+            AccumulateGrad            T15(v0)          ->
+            [memory]                  T8(v0*)          ->""",
+        )
+
+        # Second time grads are already initialized.
+        self.assertExpectedInline(
+            self._run_and_format_data_flow(inputs, f_fwd_bwd),
+            """\
+            x:       T0
+            w0:      T1
+            w0.grad:  T17
+            w1:      T4
+            w1.grad:  T13
+            loss:    T7
+
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::relu                T2(v0)           ->  T3(v0)
+            [memory]                  T2(v0*)          ->
+            aten::mul                 T3(v0), T4(v0)   ->  T5(v0)
+            aten::relu                T5(v0)           ->  T6(v0)
+            [memory]                  T5(v0*)          ->
+            aten::sum                 T6(v0)           ->  T7(v0)
+            aten::ones_like           T7(v0)           ->  T8(v0)
+            SumBackward0              T8(v0)           ->
+            ReluBackward0             T6(v0), T8(v0)   ->  T9(v0)
+            [memory]                  T6(v0*)          ->
+            MulBackward0              T3(v0), T4(v0), T9(v0)  ->  T10(v0), T11(v0)
+            aten::sum                 T10(v0)          ->  T12(v0)
+            [memory]                  T10(v0*)         ->
+            [memory]                  T9(v0*)          ->
+            AccumulateGrad            T12(v0*), T13(v0)  ->  T13(v1)
+            ReluBackward0             T3(v0), T11(v0)  ->  T14(v0)
+            [memory]                  T11(v0*)         ->
+            [memory]                  T3(v0*)          ->
+            MulBackward0              T0(v0), T14(v0)  ->  T15(v0)
+            aten::sum                 T15(v0)          ->  T16(v0)
+            [memory]                  T15(v0*)         ->
+            [memory]                  T14(v0*)         ->
+            AccumulateGrad            T16(v0*), T17(v0)  ->  T17(v1)
+            [memory]                  T8(v0*)          ->""",
+        )
+
+        return
+
+        x = torch.ones((25,))
+        w0 = torch.ones((1,), requires_grad=True)
+        w1 = torch.ones((1,), requires_grad=True)
+
+        with profile() as prof_no_grad:
+            with torch.no_grad():
+                x.mul(w0).relu().mul(w1).relu().sum()
+
+        # TODO: one with `.logsumexp(dim=0)`
+
+        self.assertExpectedInline(
+            self._format_graph(prof_no_grad),
+            """\
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::relu                T2(v0)           ->  T3(v0)
+            [memory]                  T2(v0*)          ->
+            aten::mul                 T3(v0), T4(v0)   ->  T5(v0)
+            [memory]                  T3(v0*)          ->
+            aten::relu                T5(v0)           ->  T6(v0)
+            [memory]                  T5(v0*)          ->
+            aten::sum                 T6(v0)           ->  T7(v0)
+            [memory]                  T6(v0*)          ->
+            [memory]                  T7(v0*)          ->""",
+        )
+
+        with profile() as prof_grad:
+            loss = x.mul(w0).relu().mul(w1).relu().sum()
+            loss.backward()
+
+        self.assertExpectedInline(
+            self._format_graph(prof_grad),
+            """\
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::relu                T2(v0)           ->  T3(v0)
+            [memory]                  T2(v0*)          ->
+            aten::mul                 T3(v0), T4(v0)   ->  T5(v0)
+            aten::relu                T5(v0)           ->  T6(v0)
+            [memory]                  T5(v0*)          ->
+            aten::sum                 T6(v0)           ->  T7(v0)
+            aten::ones_like           T7(v0)           ->  T8(v0)
+            SumBackward0              T8(v0)           ->  T8(v1)
+            ReluBackward0             T6(v0), T8(v1)   ->  T8(v2), T9(v0)
+            [memory]                  T6(v0*)          ->
+            MulBackward0              T3(v0), T4(v0), T9(v0)  ->  T9(v1), T10(v0), T11(v0)
+            aten::sum                 T10(v0)          ->  T12(v0)
+            [memory]                  T10(v0*)         ->
+            [memory]                  T9(v1*)          ->
+            AccumulateGrad            T12(v0)          ->  T12(v1)
+            ReluBackward0             T3(v0), T11(v0)  ->  T11(v1), T13(v0)
+            [memory]                  T11(v1*)         ->
+            [memory]                  T3(v0*)          ->
+            MulBackward0              T0(v0), T13(v0)  ->  T13(v1), T14(v0)
+            aten::sum                 T14(v0)          ->  T15(v0)
+            [memory]                  T14(v0*)         ->
+            [memory]                  T13(v1*)         ->
+            AccumulateGrad            T15(v0)          ->  T15(v1)
+            [memory]                  T8(v2*)          ->""",
+        )
+
+        # Second time grads are already initialized.
+        with profile() as prof_grad:
+            loss = x.mul(w0).relu().mul(w1).relu().sum()
+            loss.backward()
+
+        self.assertExpectedInline(
+            self._format_graph(prof_grad),
+            """\
+            aten::mul                 T0(v0), T1(v0)   ->  T2(v0)
+            aten::relu                T2(v0)           ->  T3(v0)
+            [memory]                  T2(v0*)          ->
+            aten::mul                 T3(v0), T4(v0)   ->  T5(v0)
+            aten::relu                T5(v0)           ->  T6(v0)
+            [memory]                  T5(v0*)          ->
+            aten::sum                 T6(v0)           ->  T7(v0)
+            aten::ones_like           T7(v0)           ->  T8(v0)
+            SumBackward0              T8(v0)           ->  T8(v1)
+            ReluBackward0             T6(v0), T8(v1)   ->  T8(v2), T9(v0)
+            [memory]                  T6(v0*)          ->
+            MulBackward0              T3(v0), T4(v0), T9(v0)  ->  T9(v1), T10(v0), T11(v0)
+            aten::sum                 T10(v0)          ->  T12(v0)
+            [memory]                  T10(v0*)         ->
+            [memory]                  T9(v1*)          ->
+            AccumulateGrad            T12(v0*), T13(v0)  ->  T13(v1)
+            ReluBackward0             T3(v0), T11(v0)  ->  T11(v1), T14(v0)
+            [memory]                  T11(v1*)         ->
+            [memory]                  T3(v0*)          ->
+            MulBackward0              T0(v0), T14(v0)  ->  T14(v1), T15(v0)
+            aten::sum                 T15(v0)          ->  T16(v0)
+            [memory]                  T15(v0*)         ->
+            [memory]                  T14(v1*)         ->
+            AccumulateGrad            T16(v0*), T17(v0)  ->  T17(v1)
+            [memory]                  T8(v2*)          ->""",
+        )
+
+
+@skipIfTorchDynamo("TorchDynamo changes Python calls that memory profiling relies on.")
+class TestMemoryProfilerE2E(TestCase):
+    @staticmethod
+    def _lookup_tensor_categories(
+        t: torch.Tensor, memory_profile: _memory_profiler.MemoryProfile
+    ) -> Dict[_memory_profiler.TensorAndID, Optional[_memory_profiler.Category]]:
+        storage = t.storage()
+        if storage is None:
+            raise ValueError("Cannot look up uninitialized Tensor.")
+
+        snapshot = memory_profile._category_snapshot()
+        ids = {
+            key.storage.allocation_id
+            for key, _ in snapshot
+            if key.storage.ptr == storage.data_ptr() and key.device == storage.device
+        }
+
+        return {
+            (key, version): category
+            for (key, version), category in memory_profile._category_snapshot().items()
+            #
+            # If a Tensor is live we want the most recent ID
+            if key.storage.allocation_id == max(ids | {-1})
+        }
+
+    def _run_and_check_parameters_and_gradients(self, inner_fn, model):
+
+        with profile() as prof:
+            inner_fn()
+
+        memory_profile = prof._memory_profile()
+
+        def assert_category(t: torch.Tensor, category: _memory_profiler.Category):
+            self.assertIsNotNone(t)
+            categories = self._lookup_tensor_categories(t, memory_profile)
+            self.assertGreater(len(categories), 0)
+            self.assertTrue(all(c == category for c in categories.values()), categories)
+
+        for p in model.parameters():
+            assert_category(p, _memory_profiler.Category.PARAMETER)
+            assert_category(p.grad, _memory_profiler.Category.GRADIENT)
+
+    def _run_and_format_categories(self, fn, indent=12):
+        """Generate summary of assigned categories for expecttest."""
+
+        # Use `__torch_dispatch__` to collect ground truth.
+        with RecordInputOutputDispatchMode() as record_ops, profile() as prof:
+            fn(lambda name: record_ops.mark_region(f"-- {name} ".ljust(105, "-")))
+
+        memory_profile = prof._memory_profile()
+        ptr_pair_to_key: Dict[Tuple[int, int], _memory_profiler.TensorKey] = {}
+        snapshot = memory_profile._category_snapshot()
+
+        # Build map from observed live Tensors to the memory profiler's
+        # TensorKey representation.
+        for op in memory_profile._op_tree.dfs():
+            if op.typed[0] == _EventType.TorchOp:
+                inputs = tree_flatten(op.typed[1].inputs)[0]
+                for t in (i for i in inputs if isinstance(i, _TensorMetadata)):
+                    key = _memory_profiler.TensorKey.from_tensor(t)
+                    if key:
+                        ptr_pair_to_key[(t.impl_ptr, t.storage_data_ptr)] = key
+
+        def format_categories(ptr_pair: int):
+            target_key = ptr_pair_to_key.get(ptr_pair, None)
+            if target_key is None:
+                return "???"
+
+            matches = tuple(
+                (version, category.name if category else "???")
+                for (key, version), category in snapshot.items()
+                if key == target_key
+            )
+            assert matches, "Failed to lookup Tensor"
+
+            # Deduplicate version bumps which don't change the category.
+            categories = [matches[0][1]]
+            for _, category in matches:
+                if category != categories[-1]:
+                    categories.append(category)
+
+            return f"{target_key.storage.allocation_id} ({','.join(categories)})"
+
+        out: List[str] = []
+        for name, inputs, outputs in record_ops.results:
+            if inputs or outputs:
+                # PyTorch ops
+                inputs_str = ", ".join(format_categories(i) for i in inputs)
+                outputs_str = ", ".join(format_categories(i) for i in outputs)
+                out.append(f"{name:<40} {inputs_str:<45} -> {outputs_str}")
+
+            else:
+                # Marked regions.
+                out.append(f"\n{name}")
+
+        return textwrap.indent("\n".join(out), " " * indent)
+
+    def test_parameters_and_gradients(self):
+        model = torch.nn.Sequential(
+            torch.nn.Linear(2, 2), ScaleLayer(), torch.nn.Linear(2, 1), ScaleLayer()
+        )
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
+
+        def fwd_only():
+            _ = model(torch.ones((2, 2)))
+
+        def fwd_bwd_step():
+            y = model(torch.ones((2, 2)))
+            torch.nn.functional.mse_loss(y, torch.rand((2, 1))).backward()
+            optimizer.step()
+            optimizer.zero_grad()
+
+        # If we profile the first step then gradients will not have been
+        # created when we call `model.forward`, so if we don't call `.backward`
+        # then gradients are never created.
+        with self.assertRaises(AssertionError):
+            self._run_and_check_parameters_and_gradients(inner_fn=fwd_only, model=model)
+
+        # On the first step we must rely on `AccumulateGrad`, since gradients
+        # did not exist when `model.forward` was called.
+        self.assertTrue(all(p.grad is None for p in model.parameters()))
+        self._run_and_check_parameters_and_gradients(inner_fn=fwd_bwd_step, model=model)
+
+        # After one step the python tracer will also flag gradients.
+        self.assertTrue(not any(p.grad is None for p in model.parameters()))
+        self._run_and_check_parameters_and_gradients(inner_fn=fwd_bwd_step, model=model)
+
+        # The parameter gradients are not used but we still detect them with
+        # the python tracer.
+        self._run_and_check_parameters_and_gradients(inner_fn=fwd_only, model=model)
+
+    def test_parameters_and_gradients_set_to_none(self):
+        model = torch.nn.Sequential(torch.nn.Linear(2, 2), torch.nn.Linear(2, 1))
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
+
+        def fwd_bwd_step():
+            for _ in range(3):
+                # zero grads at the start so gradients are still live to be
+                # checked.
+                optimizer.zero_grad(set_to_none=True)
+
+                y = model(torch.ones((2, 2)))
+                torch.nn.functional.mse_loss(y, torch.rand((2, 1))).backward()
+                optimizer.step()
+
+        fwd_bwd_step()
+        self.assertTrue(not any(p.grad is None for p in model.parameters()))
+        self._run_and_check_parameters_and_gradients(inner_fn=fwd_bwd_step, model=model)
+
+        optimizer.zero_grad(set_to_none=True)
+        self.assertTrue(all(p.grad is None for p in model.parameters()))
+        self._run_and_check_parameters_and_gradients(inner_fn=fwd_bwd_step, model=model)
+
+    def test_inputs_fwd(self):
+        model = torch.nn.Sequential(torch.nn.Linear(2, 2), torch.nn.Linear(2, 1))
+        inputs = [torch.ones((2, 2)) for _ in range(2)]
+
+        with profile() as prof:
+            # Inputs which were allocated before profiling began
+            for x in inputs:
+                _ = model(x)
+
+            # Inputs which were allocated after profiling began
+            for _ in range(2):
+                x = torch.ones((2, 2))
+                inputs.append(x)
+                _ = model(x)
+
+        memory_profile = prof._memory_profile()
+        for x in inputs:
+            categories = self._lookup_tensor_categories(x, memory_profile)
+            self.assertGreater(len(categories), 0)
+            self.assertTrue(
+                all(i == _memory_profiler.Category.INPUT for i in categories.values()),
+                categories,
+            )
+
+        snapshot = memory_profile._category_snapshot()
+        self.assertTrue(_memory_profiler.Category.INPUT in snapshot.values())
+
+    def test_inputs_fwd_lazy(self):
+        model = torch.nn.Sequential(LazyLinear(2, 2), LazyLinear(2, 1))
+        inputs = [torch.ones((2, 2)) for _ in range(2)]
+
+        with profile() as prof:
+            # Inputs which were allocated before profiling began
+            for x in inputs:
+                _ = model(x)
+
+            # Inputs which were allocated after profiling began
+            for _ in range(2):
+                x = torch.ones((2, 2))
+                inputs.append(x)
+                _ = model(x)
+
+        # For now we can't make any meaningful statements without a backward
+        # pass. Here we simply ensure that passes don't generate false positive
+        # category classifications.
+        memory_profile = prof._memory_profile()
+        for x in inputs:
+            categories = self._lookup_tensor_categories(x, memory_profile)
+            self.assertGreater(len(categories), 0)
+            self.assertTrue(all(i is None for i in categories.values()), categories)
+
+        snapshot = memory_profile._category_snapshot()
+        self.assertFalse(_memory_profiler.Category.INPUT in snapshot.values())
+
+    def test_inputs_fwd_bwd(self):
+        model = torch.nn.Sequential(torch.nn.Linear(2, 2), torch.nn.Linear(2, 1))
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
+        inputs_targets = [(torch.ones((2, 2)), torch.rand((2, 1))) for _ in range(2)]
+
+        def fwd_bwd_step(x, targets):
+            y = model(x)
+            torch.nn.functional.mse_loss(y, targets).backward()
+            optimizer.step()
+            optimizer.zero_grad()
+
+        with profile() as prof:
+            # Inputs which were allocated before profiling began
+            for x, targets in inputs_targets:
+                fwd_bwd_step(x, targets)
+
+            # Inputs which were allocated after profiling began
+            for _ in range(2):
+                x = torch.ones((2, 2))
+                targets = torch.rand((2, 1))
+                inputs_targets.append((x, targets))
+                fwd_bwd_step(x, targets)
+
+        memory_profile = prof._memory_profile()
+
+        def check(t):
+            categories = self._lookup_tensor_categories(t, memory_profile)
+            self.assertGreater(len(categories), 0)
+            self.assertTrue(
+                all(i == _memory_profiler.Category.INPUT for i in categories.values())
+            )
+
+        for x, targets in inputs_targets:
+            check(x)
+            check(targets)
+
+    def test_lazily_initialized(self) -> None:
+        model = torch.nn.Sequential(
+            torch.nn.Linear(2, 2),
+            torch.nn.ReLU(),
+            LazyLinear(2, 2),
+            torch.nn.ReLU(),
+            torch.nn.Linear(2, 1),
+        )
+
+        self.assertEqual(len(list(model.parameters())), 4)
+
+        def inner_fn():
+            y = model(torch.ones((2, 2)))
+            torch.nn.functional.mse_loss(y, torch.rand((2, 1))).backward()
+            optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
+            optimizer.step()
+            optimizer.zero_grad()
+
+        self._run_and_check_parameters_and_gradients(inner_fn=inner_fn, model=model)
+        self.assertEqual(len(list(model.parameters())), 6)
+
+    def test_manual_optimizer_step(self) -> None:
+        model = torch.nn.Sequential(torch.nn.Linear(2, 2), torch.nn.Linear(2, 1))
+
+        def inner_fn():
+            y = model(torch.ones((2, 2)))
+            torch.nn.functional.mse_loss(y, torch.rand((2, 1))).backward()
+
+            with torch.no_grad():
+                for p in model.parameters():
+                    grad = p.grad
+                    self.assertIsNotNone(grad)
+                    p.add_(grad, alpha=-0.1)
+
+        self._run_and_check_parameters_and_gradients(inner_fn=inner_fn, model=model)
+
+    def test_categories_e2e_simple_fwd(self) -> None:
+        w0 = torch.ones((1,), requires_grad=True)
+        w1 = torch.ones((1,), requires_grad=True)
+
+        def step_fn(_):
+            x = torch.ones((2, 2))
+            y = torch.cat([x * w0, x * w1], dim=1)
+
+        # NOTE: We expect that all unknown categories. This is simply a sanity
+        #       check to ensure that we do not over-label.
+        self.assertExpectedInline(
+            self._run_and_format_categories(step_fn),
+            """\
+            aten::ones                                                                             -> 1 (???)
+            aten::mul.Tensor                         1 (???), 2 (???)                              -> 3 (???)
+            aten::mul.Tensor                         1 (???), 4 (???)                              -> 5 (???)
+            aten::cat                                3 (???), 5 (???)                              -> ???""",
+        )
+
+    def test_categories_e2e_simple_fwd_bwd(self) -> None:
+        w0 = torch.ones((1,), requires_grad=True)
+        w1 = torch.ones((1,), requires_grad=True)
+
+        def step_fn(mark_region):
+            x = torch.ones((2, 2))
+            targets = torch.ones((2, 4))
+
+            mark_region("Forward & loss")
+            y = torch.cat([x * w0, x * w1], dim=1)
+            loss = torch.nn.functional.binary_cross_entropy_with_logits(y, targets)
+
+            mark_region("Backward")
+            loss.backward()
+
+        self.assertExpectedInline(
+            self._run_and_format_categories(step_fn),
+            """\
+            aten::ones                                                                             -> 1 (INPUT)
+            aten::ones                                                                             -> 2 (INPUT)
+
+            -- Forward & loss ---------------------------------------------------------------------------------------
+            aten::mul.Tensor                         1 (INPUT), 3 (INPUT)                          -> 4 (INPUT)
+            aten::mul.Tensor                         1 (INPUT), 5 (INPUT)                          -> 6 (INPUT)
+            aten::cat                                4 (INPUT), 6 (INPUT)                          -> 7 (INPUT)
+            aten::binary_cross_entropy_with_logits   7 (INPUT), 2 (INPUT)                          -> 13 (INPUT)
+
+            -- Backward ---------------------------------------------------------------------------------------------
+            aten::ones_like                          13 (INPUT)                                    -> 16 (INPUT)
+            aten::sigmoid                            7 (INPUT)                                     -> 17 (TEMPORARY)
+            aten::sub.Tensor                         17 (TEMPORARY), 2 (INPUT)                     -> 18 (TEMPORARY)
+            aten::mul.Tensor                         18 (TEMPORARY), 16 (INPUT)                    -> 19 (AUTOGRAD_DETAIL)
+            aten::div_.Scalar                        19 (AUTOGRAD_DETAIL)                          -> 19 (AUTOGRAD_DETAIL)
+            aten::slice.Tensor                       19 (AUTOGRAD_DETAIL)                          -> 19 (AUTOGRAD_DETAIL)
+            aten::slice.Tensor                       19 (AUTOGRAD_DETAIL)                          -> 19 (AUTOGRAD_DETAIL)
+            aten::mul.Tensor                         19 (AUTOGRAD_DETAIL), 1 (INPUT)               -> 22 (AUTOGRAD_DETAIL)
+            aten::sum.dim_IntList                    22 (AUTOGRAD_DETAIL)                          -> 23 (GRADIENT)
+            aten::view                               23 (GRADIENT)                                 -> 23 (GRADIENT)
+            aten::detach                             23 (GRADIENT)                                 -> 23 (GRADIENT)
+            aten::detach                             23 (GRADIENT)                                 -> ???
+            aten::mul.Tensor                         19 (AUTOGRAD_DETAIL), 1 (INPUT)               -> 24 (AUTOGRAD_DETAIL)
+            aten::sum.dim_IntList                    24 (AUTOGRAD_DETAIL)                          -> 25 (GRADIENT)
+            aten::view                               25 (GRADIENT)                                 -> 25 (GRADIENT)
+            aten::detach                             25 (GRADIENT)                                 -> 25 (GRADIENT)
+            aten::detach                             25 (GRADIENT)                                 -> ???""",
+        )
+
+    def test_categories_e2e_simple_fwd_bwd_step(self) -> None:
+        w0 = torch.ones((1,), requires_grad=True)
+        w1 = torch.ones((1,), requires_grad=True)
+        optimizer = torch.optim.SGD([w0, w1], lr=0.1)
+
+        def step_fn(mark_region):
+            x = torch.ones((2, 2))
+            targets = torch.ones((2, 4))
+
+            mark_region("Forward & loss")
+            y = torch.cat([x * w0, x * w1], dim=1)
+            loss = torch.nn.functional.binary_cross_entropy_with_logits(y, targets)
+
+            mark_region("Backward")
+            loss.backward()
+
+            mark_region("Optimizer")
+            optimizer.step()
+            optimizer.zero_grad()
+
+        self.assertExpectedInline(
+            self._run_and_format_categories(step_fn),
+            """\
+            aten::ones                                                                             -> 1 (INPUT)
+            aten::ones                                                                             -> 2 (INPUT)
+
+            -- Forward & loss ---------------------------------------------------------------------------------------
+            aten::mul.Tensor                         1 (INPUT), 3 (PARAMETER)                      -> 4 (ACTIVATION)
+            aten::mul.Tensor                         1 (INPUT), 5 (PARAMETER)                      -> 6 (ACTIVATION)
+            aten::cat                                4 (ACTIVATION), 6 (ACTIVATION)                -> 7 (ACTIVATION)
+            aten::binary_cross_entropy_with_logits   7 (ACTIVATION), 2 (INPUT)                     -> 13 (ACTIVATION)
+
+            -- Backward ---------------------------------------------------------------------------------------------
+            aten::ones_like                          13 (ACTIVATION)                               -> 16 (ACTIVATION)
+            aten::sigmoid                            7 (ACTIVATION)                                -> 17 (TEMPORARY)
+            aten::sub.Tensor                         17 (TEMPORARY), 2 (INPUT)                     -> 18 (TEMPORARY)
+            aten::mul.Tensor                         18 (TEMPORARY), 16 (ACTIVATION)               -> 19 (AUTOGRAD_DETAIL)
+            aten::div_.Scalar                        19 (AUTOGRAD_DETAIL)                          -> 19 (AUTOGRAD_DETAIL)
+            aten::slice.Tensor                       19 (AUTOGRAD_DETAIL)                          -> 19 (AUTOGRAD_DETAIL)
+            aten::slice.Tensor                       19 (AUTOGRAD_DETAIL)                          -> 19 (AUTOGRAD_DETAIL)
+            aten::mul.Tensor                         19 (AUTOGRAD_DETAIL), 1 (INPUT)               -> 22 (AUTOGRAD_DETAIL)
+            aten::sum.dim_IntList                    22 (AUTOGRAD_DETAIL)                          -> 23 (GRADIENT)
+            aten::view                               23 (GRADIENT)                                 -> 23 (GRADIENT)
+            aten::detach                             23 (GRADIENT)                                 -> 23 (GRADIENT)
+            aten::detach                             23 (GRADIENT)                                 -> 23 (GRADIENT)
+            aten::mul.Tensor                         19 (AUTOGRAD_DETAIL), 1 (INPUT)               -> 24 (AUTOGRAD_DETAIL)
+            aten::sum.dim_IntList                    24 (AUTOGRAD_DETAIL)                          -> 25 (GRADIENT)
+            aten::view                               25 (GRADIENT)                                 -> 25 (GRADIENT)
+            aten::detach                             25 (GRADIENT)                                 -> 25 (GRADIENT)
+            aten::detach                             25 (GRADIENT)                                 -> 25 (GRADIENT)
+
+            -- Optimizer --------------------------------------------------------------------------------------------
+            aten::add_.Tensor                        3 (PARAMETER), 25 (GRADIENT)                  -> 3 (PARAMETER)
+            aten::add_.Tensor                        5 (PARAMETER), 23 (GRADIENT)                  -> 5 (PARAMETER)
+            aten::zero_                              25 (GRADIENT)                                 -> 25 (GRADIENT)
+            aten::zero_                              23 (GRADIENT)                                 -> 23 (GRADIENT)""",
+        )
+
+    def test_categories_e2e_simple_module_fwd(self) -> None:
+        model = torch.nn.Linear(2, 4, bias=True)
+        self.assertExpectedInline(
+            self._run_and_format_categories(lambda _: model(torch.ones((2, 2)))),
+            """\
+            aten::ones                                                                             -> 1 (INPUT)
+            aten::t                                  2 (PARAMETER)                                 -> 2 (PARAMETER)
+            aten::addmm                              3 (PARAMETER), 1 (INPUT), 2 (PARAMETER)       -> 4 (ACTIVATION)""",
+        )
+
+    def test_categories_e2e_simple_module_fwd_bwd(self) -> None:
+        model = torch.nn.Linear(2, 1, bias=True)
+
+        def step_fn(mark_region):
+            mark_region("Forward & loss")
+            loss = model(torch.ones((2, 2))).sum()
+
+            mark_region("Backward")
+            loss.backward()
+
+        self.assertExpectedInline(
+            self._run_and_format_categories(step_fn),
+            """\
+
+            -- Forward & loss ---------------------------------------------------------------------------------------
+            aten::ones                                                                             -> 1 (INPUT)
+            aten::t                                  2 (PARAMETER)                                 -> 2 (PARAMETER)
+            aten::addmm                              3 (PARAMETER), 1 (INPUT), 2 (PARAMETER)       -> 4 (ACTIVATION)
+            aten::sum                                4 (ACTIVATION)                                -> 5 (ACTIVATION)
+
+            -- Backward ---------------------------------------------------------------------------------------------
+            aten::ones_like                          5 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::expand                             6 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::t                                  6 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::mm                                 6 (ACTIVATION), 1 (INPUT)                     -> 7 (GRADIENT)
+            aten::t                                  7 (GRADIENT)                                  -> 7 (GRADIENT)
+            aten::sum.dim_IntList                    6 (ACTIVATION)                                -> 9 (GRADIENT)
+            aten::view                               9 (GRADIENT)                                  -> 9 (GRADIENT)
+            aten::detach                             9 (GRADIENT)                                  -> 9 (GRADIENT)
+            aten::detach                             9 (GRADIENT)                                  -> ???
+            aten::t                                  7 (GRADIENT)                                  -> 7 (GRADIENT)
+            aten::detach                             7 (GRADIENT)                                  -> 7 (GRADIENT)
+            aten::detach                             7 (GRADIENT)                                  -> ???""",
+        )
+
+    def test_categories_e2e_simple_module_fwd_bwd_step(self) -> None:
+        model = torch.nn.Linear(2, 1, bias=True)
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
+
+        def step_fn(mark_region):
+            mark_region("Forward & loss")
+            loss = model(torch.ones((2, 2))).sum()
+
+            mark_region("Backward")
+            loss.backward()
+
+            mark_region("Optimizer")
+            optimizer.step()
+            optimizer.zero_grad()
+
+        self.assertExpectedInline(
+            self._run_and_format_categories(step_fn),
+            """\
+
+            -- Forward & loss ---------------------------------------------------------------------------------------
+            aten::ones                                                                             -> 1 (INPUT)
+            aten::t                                  2 (PARAMETER)                                 -> 2 (PARAMETER)
+            aten::addmm                              3 (PARAMETER), 1 (INPUT), 2 (PARAMETER)       -> 4 (ACTIVATION)
+            aten::sum                                4 (ACTIVATION)                                -> 5 (ACTIVATION)
+
+            -- Backward ---------------------------------------------------------------------------------------------
+            aten::ones_like                          5 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::expand                             6 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::t                                  6 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::mm                                 6 (ACTIVATION), 1 (INPUT)                     -> 7 (GRADIENT)
+            aten::t                                  7 (GRADIENT)                                  -> 7 (GRADIENT)
+            aten::sum.dim_IntList                    6 (ACTIVATION)                                -> 9 (GRADIENT)
+            aten::view                               9 (GRADIENT)                                  -> 9 (GRADIENT)
+            aten::detach                             9 (GRADIENT)                                  -> 9 (GRADIENT)
+            aten::detach                             9 (GRADIENT)                                  -> 9 (GRADIENT)
+            aten::t                                  7 (GRADIENT)                                  -> 7 (GRADIENT)
+            aten::detach                             7 (GRADIENT)                                  -> 7 (GRADIENT)
+            aten::detach                             7 (GRADIENT)                                  -> 7 (GRADIENT)
+
+            -- Optimizer --------------------------------------------------------------------------------------------
+            aten::clone                              7 (GRADIENT)                                  -> 10 (OPTIMIZER_STATE)
+            aten::detach                             10 (OPTIMIZER_STATE)                          -> 10 (OPTIMIZER_STATE)
+            aten::detach                             10 (OPTIMIZER_STATE)                          -> 10 (OPTIMIZER_STATE)
+            aten::add_.Tensor                        2 (PARAMETER), 10 (OPTIMIZER_STATE)           -> 2 (PARAMETER)
+            aten::clone                              9 (GRADIENT)                                  -> 11 (OPTIMIZER_STATE)
+            aten::detach                             11 (OPTIMIZER_STATE)                          -> 11 (OPTIMIZER_STATE)
+            aten::detach                             11 (OPTIMIZER_STATE)                          -> 11 (OPTIMIZER_STATE)
+            aten::add_.Tensor                        3 (PARAMETER), 11 (OPTIMIZER_STATE)           -> 3 (PARAMETER)
+            aten::zero_                              7 (GRADIENT)                                  -> 7 (GRADIENT)
+            aten::zero_                              9 (GRADIENT)                                  -> 9 (GRADIENT)""",
+        )
+
+    def test_categories_e2e_sequential_fwd(self) -> None:
+        model = torch.nn.Sequential(
+            torch.nn.Linear(2, 4, bias=True),
+            torch.nn.ReLU(),
+            torch.nn.Linear(4, 4, bias=False),
+            torch.nn.Softmax(dim=1),
+        )
+        self.assertExpectedInline(
+            self._run_and_format_categories(lambda _: model(torch.ones((2, 2)))),
+            """\
+            aten::ones                                                                             -> 1 (INPUT)
+            aten::t                                  2 (PARAMETER)                                 -> 2 (PARAMETER)
+            aten::addmm                              3 (PARAMETER), 1 (INPUT), 2 (PARAMETER)       -> 4 (ACTIVATION)
+            aten::relu                               4 (ACTIVATION)                                -> 5 (ACTIVATION)
+            aten::detach                             5 (ACTIVATION)                                -> ???
+            aten::t                                  6 (PARAMETER)                                 -> 6 (PARAMETER)
+            aten::mm                                 5 (ACTIVATION), 6 (PARAMETER)                 -> 7 (ACTIVATION)
+            aten::_softmax                           7 (ACTIVATION)                                -> 8 (ACTIVATION)
+            aten::detach                             8 (ACTIVATION)                                -> ???""",
+        )
+
+    def test_categories_e2e_sequential_fwd_bwd(self) -> None:
+        model = torch.nn.Sequential(
+            torch.nn.Linear(2, 4, bias=True),
+            torch.nn.ReLU(),
+            torch.nn.Linear(4, 4, bias=False),
+            torch.nn.Softmax(dim=1),
+        )
+
+        def step_fn(mark_region):
+            x = torch.ones((2, 2))
+            targets = torch.ones((2, 4))
+
+            mark_region("Forward")
+            y = model(x)
+
+            mark_region("Loss")
+            loss = torch.sum((y - targets) ** 2).mean()
+
+            mark_region("Backward")
+            loss.backward()
+
+        self.assertExpectedInline(
+            self._run_and_format_categories(step_fn),
+            """\
+            aten::ones                                                                             -> 1 (INPUT)
+            aten::ones                                                                             -> 2 (INPUT)
+
+            -- Forward ----------------------------------------------------------------------------------------------
+            aten::t                                  3 (PARAMETER)                                 -> 3 (PARAMETER)
+            aten::addmm                              4 (PARAMETER), 1 (INPUT), 3 (PARAMETER)       -> 5 (ACTIVATION)
+            aten::relu                               5 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::detach                             6 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::t                                  7 (PARAMETER)                                 -> 7 (PARAMETER)
+            aten::mm                                 6 (ACTIVATION), 7 (PARAMETER)                 -> 8 (ACTIVATION)
+            aten::_softmax                           8 (ACTIVATION)                                -> 9 (ACTIVATION)
+            aten::detach                             9 (ACTIVATION)                                -> 9 (ACTIVATION)
+
+            -- Loss -------------------------------------------------------------------------------------------------
+            aten::sub.Tensor                         9 (ACTIVATION), 2 (INPUT)                     -> 10 (ACTIVATION)
+            aten::pow.Tensor_Scalar                  10 (ACTIVATION)                               -> 11 (ACTIVATION)
+            aten::sum                                11 (ACTIVATION)                               -> 12 (ACTIVATION)
+            aten::mean                               12 (ACTIVATION)                               -> 13 (ACTIVATION)
+
+            -- Backward ---------------------------------------------------------------------------------------------
+            aten::ones_like                          13 (ACTIVATION)                               -> 16 (ACTIVATION)
+            aten::expand                             16 (ACTIVATION)                               -> 16 (ACTIVATION)
+            aten::div.Scalar                         16 (ACTIVATION)                               -> 19 (AUTOGRAD_DETAIL)
+            aten::expand                             19 (AUTOGRAD_DETAIL)                          -> 19 (AUTOGRAD_DETAIL)
+            aten::pow.Tensor_Scalar                  10 (ACTIVATION)                               -> 20 (TEMPORARY)
+            aten::mul.Scalar                         20 (TEMPORARY)                                -> 23 (TEMPORARY)
+            aten::mul.Tensor                         19 (AUTOGRAD_DETAIL), 23 (TEMPORARY)          -> 24 (AUTOGRAD_DETAIL)
+            aten::detach                             9 (ACTIVATION)                                -> 9 (ACTIVATION)
+            aten::_softmax_backward_data             24 (AUTOGRAD_DETAIL), 9 (ACTIVATION)          -> 25 (AUTOGRAD_DETAIL)
+            aten::t                                  25 (AUTOGRAD_DETAIL)                          -> 25 (AUTOGRAD_DETAIL)
+            aten::mm                                 25 (AUTOGRAD_DETAIL), 6 (ACTIVATION)          -> 26 (GRADIENT)
+            aten::t                                  26 (GRADIENT)                                 -> 26 (GRADIENT)
+            aten::t                                  7 (PARAMETER)                                 -> 7 (PARAMETER)
+            aten::mm                                 25 (AUTOGRAD_DETAIL), 7 (PARAMETER)           -> 27 (AUTOGRAD_DETAIL)
+            aten::t                                  26 (GRADIENT)                                 -> 26 (GRADIENT)
+            aten::detach                             26 (GRADIENT)                                 -> 26 (GRADIENT)
+            aten::detach                             26 (GRADIENT)                                 -> ???
+            aten::detach                             6 (ACTIVATION)                                -> 6 (ACTIVATION)
+            aten::threshold_backward                 27 (AUTOGRAD_DETAIL), 6 (ACTIVATION)          -> 28 (AUTOGRAD_DETAIL)
+            aten::t                                  28 (AUTOGRAD_DETAIL)                          -> 28 (AUTOGRAD_DETAIL)
+            aten::mm                                 28 (AUTOGRAD_DETAIL), 1 (INPUT)               -> 29 (GRADIENT)
+            aten::t                                  29 (GRADIENT)                                 -> 29 (GRADIENT)
+            aten::sum.dim_IntList                    28 (AUTOGRAD_DETAIL)                          -> 30 (GRADIENT)
+            aten::view                               30 (GRADIENT)                                 -> 30 (GRADIENT)
+            aten::detach                             30 (GRADIENT)                                 -> 30 (GRADIENT)
+            aten::detach                             30 (GRADIENT)                                 -> ???
+            aten::t                                  29 (GRADIENT)                                 -> 29 (GRADIENT)
+            aten::detach                             29 (GRADIENT)                                 -> 29 (GRADIENT)
+            aten::detach                             29 (GRADIENT)                                 -> ???""",
+        )
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/test_profiler.py b/test/profiler/test_profiler.py
similarity index 67%
rename from test/test_profiler.py
rename to test/profiler/test_profiler.py
index 3ef93334020d..acaa1f966757 100644
--- a/test/test_profiler.py
+++ b/test/profiler/test_profiler.py
@@ -1,52 +1,76 @@
 # Owner(s): ["oncall: profiler"]
 import collections
-import expecttest
 import gc
 import io
 import json
 import os
 import re
 import tempfile
-from typing import List, Optional
+import textwrap
 import unittest
 from dataclasses import dataclass, field
+from typing import List, Optional
 
+import expecttest
 import torch
 import torch.nn as nn
 import torch.optim
 import torch.utils.data
 import torch.utils.data.datapipes as dp
-from torch.testing._internal.common_cuda import TEST_MULTIGPU
-from torch.testing._internal.common_utils import (
-    TestCase, run_tests, TEST_WITH_ASAN, TEST_WITH_ROCM, IS_WINDOWS,
-    TEST_WITH_CROSSREF, TemporaryFileName, TemporaryDirectoryName)
-from torch.autograd import (_record_function_with_args_enter, _record_function_with_args_exit)
+from torch.autograd import (
+    _record_function_with_args_enter,
+    _record_function_with_args_exit,
+)
 from torch.autograd.profiler import profile as _profile
 from torch.autograd.profiler_legacy import profile as _profile_legacy
 from torch.profiler import (
-    kineto_available, profile, record_function, supported_activities,
-    DeviceType, ProfilerAction, ProfilerActivity, ExecutionGraphObserver,
-    _utils
+    _utils,
+    DeviceType,
+    ExecutionGraphObserver,
+    kineto_available,
+    profile,
+    ProfilerAction,
+    ProfilerActivity,
+    record_function,
+    supported_activities,
+)
+from torch._C._profiler import _TensorMetadata
+from torch.profiler._pattern_matcher import (
+    Conv2dBiasFollowedByBatchNorm2dPattern,
+    ExtraCUDACopyPattern,
+    ForLoopIndexingPattern,
+    FP32MatMulPattern,
+    GradNotSetToNonePattern,
+    MatMulDimInFP16Pattern,
+    NamePattern,
+    OptimizerSingleTensorPattern,
+    Pattern,
+    report_all_anti_patterns,
+    SynchronizedDataLoaderPattern,
 )
-from torch.profiler._pattern_matcher import (Pattern, NamePattern,
-                                             ExtraCUDACopyPattern,
-                                             ForLoopIndexingPattern,
-                                             FP32MatMulPattern,
-                                             OptimizerSingleTensorPattern,
-                                             SynchronizedDataLoaderPattern,
-                                             GradNotSetToNonePattern,
-                                             Conv2dBiasFollowedByBatchNorm2dPattern,
-                                             MatMulDimInFP16Pattern,
-                                             report_all_anti_patterns)
+from torch.testing._internal.common_cuda import TEST_MULTIGPU
 from torch.testing._internal.common_device_type import skipCUDAVersionIn
+from torch.testing._internal.common_utils import (
+    IS_WINDOWS,
+    run_tests,
+    TemporaryDirectoryName,
+    TemporaryFileName,
+    TEST_WITH_ASAN,
+    TEST_WITH_CROSSREF,
+    TEST_WITH_ROCM,
+    TestCase,
+)
 
 try:
     import psutil
+
     HAS_PSUTIL = True
 except ImportError:
     HAS_PSUTIL = False
 import pickle
 
+from torch._C._profiler import _ExperimentalConfig, _ExtraFields_PyCall
+
 
 @unittest.skipIf(not HAS_PSUTIL, "Requires psutil to run")
 @unittest.skipIf(TEST_WITH_ASAN, "Cannot test with ASAN")
@@ -105,6 +129,34 @@ def custom_layer(input_ten):
             q = s.sum()
             q.backward()
 
+@unittest.skipIf(not torch.profiler.itt.is_available(), "ITT is required")
+class TestProfilerITT(TestCase):
+
+    def test_custom_module_input_op_ids(self):
+        class MyFunc(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x):
+                ctx.save_for_backward(x)
+                return x
+
+            @staticmethod
+            def backward(ctx, gO):
+                x, = ctx.saved_tensors
+                return x
+
+        def custom_layer(input_ten):
+            return MyFunc.apply(input_ten)
+
+        # Only testing that emit_itt runs when
+        # record_shapes option is enabled.
+        with torch.autograd.profiler.emit_itt(record_shapes=True) as prof:
+            x = torch.randn(10, 10, requires_grad=True)
+            y = torch.randn(10, 10, requires_grad=True)
+            z = x + y
+            s = custom_layer(z)
+            q = s.sum()
+            q.backward()
+
 class TestRecordFunction(TestCase):
     def _record_function_with_param(self):
         u = torch.randn(3, 4, 5, requires_grad=True)
@@ -453,7 +505,7 @@ def forward(self, x):
         def call_module(x):
             return mod(x)
 
-        with _profile(with_stack=True, use_kineto=kineto_available()) as p:
+        with _profile(with_stack=True, use_kineto=kineto_available(), experimental_config=_ExperimentalConfig(verbose=True)) as p:
             x = torch.randn(10, 10, requires_grad=True)
             y = torch.randn(10, 10, requires_grad=True)
             z = x + y
@@ -994,7 +1046,7 @@ def trace_handler(p):
             self.assertEqual(test_schedule(step), test_schedule_expected_outputs[step])
 
     def test_export_stacks(self):
-        with _profile(with_stack=True, use_kineto=kineto_available()) as p:
+        with _profile(with_stack=True, use_kineto=kineto_available(), experimental_config=_ExperimentalConfig(verbose=True)) as p:
             x = torch.randn(10, 10)
             y = torch.randn(10, 10)
             z = torch.mm(x, y)
@@ -1177,7 +1229,7 @@ def test_profiler_fwd_bwd_link(self):
 
     def test_profiler_type(self):
         profiler_type = torch._C._autograd._profiler_type
-        ActiveProfilerType = torch._C._autograd.ActiveProfilerType
+        ActiveProfilerType = torch._C._profiler.ActiveProfilerType
         self.assertEqual(profiler_type(), ActiveProfilerType.NONE)
 
         # Autograd profiler
@@ -1216,7 +1268,7 @@ def test_nested_tensor_with_shapes(self):
         a = torch.randn(4, 4)
         b = torch.randn(4, 4)
         c = torch.randn(4, 4)
-        inp = torch.nested_tensor([a, b])
+        inp = torch.nested.nested_tensor([a, b])
         with torch.profiler.profile(record_shapes=True) as prof:
             torch.nn.functional.linear(inp, c, None)
         for e in prof.events():
@@ -1227,16 +1279,503 @@ def test_nested_tensor_with_shapes(self):
                 self.assertTrue(len(e.input_shapes[0]) > 0)
 
 
-
 def find_node_with_name(nodes, name):
-    for node in nodes:
-        if node.name() == name:
+    for node in _utils.traverse_dfs(nodes):
+        if node.name == name:
             return node
-        result = find_node_with_name(node.children, name)
-        if result is not None:
-            return result
+
+def find_node_with_regex(nodes, pattern):
+    for node in _utils.traverse_dfs(nodes):
+        if re.search(pattern, node.name):
+            return node
+
+
+class SimpleNet(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.fc1 = nn.Linear(10, 5)
+        self.fc2 = nn.Linear(5, 2)
+
+    def forward(self, x):
+        return self.fc2(self.fc1(x))
+
 
 class TestTorchTidyProfiler(TestCase):
+
+    def _get_tensor_fields(self, node, index):
+        self.assertIsNotNone(node)
+        self.assertIsInstance(
+            node.extra_fields,
+            torch._C._profiler._ExtraFields_TorchOp)
+        tensor_info = node.extra_fields.inputs[index]
+        self.assertIsInstance(tensor_info, _TensorMetadata)
+        self.assertIsNotNone(tensor_info.impl_ptr)
+        self.assertIsNotNone(tensor_info.storage_data_ptr)
+        self.assertIsNotNone(tensor_info.id)
+        return tensor_info.impl_ptr, tensor_info.storage_data_ptr, tensor_info.id
+
+    def test_pointers_and_ids(self):
+        a = torch.randn(4, 3)
+        a_initial_storage_data = a.storage().data_ptr()
+
+        # Views of tensors can share the same storage, but have different TensorImpls
+        b = a.view((1, 12))
+        c = torch.randn(4, 1)
+        c_initial_storage_data = c.storage().data_ptr()
+        d = torch.randn(4, 3)
+
+        with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
+            _ = a + c
+            _ = b * c
+
+            # Resize should create a new data_ptr but keep the TensorImpl the same.
+            f = a.resize_(128, 129)
+            _ = torch.relu(f)
+
+            # `.set_` points a Tensor at an existing storage.
+            _ = d.sin()
+            c.set_(d.storage())
+            _ = c.cos()
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+
+        def get_fields(op_name, index):
+            return self._get_tensor_fields(
+                find_node_with_name(nodes, op_name),
+                index)
+
+        a_impl, a_storage_data, a_id = get_fields("aten::add", 0)
+        b_impl, b_storage_data, b_id = get_fields("aten::mul", 0)
+
+        # Profiler matches ground truth from Python API.
+        self.assertEqual(a_storage_data, a_initial_storage_data)
+
+        # Views are handled correctly.
+        self.assertEqual(a_storage_data, b_storage_data)
+        self.assertNotEqual(a_impl, b_impl)
+
+        # The same Tensor used in multiple calls gives identical results.
+        c_impl, c_storage_data, c_id = get_fields("aten::add", 1)
+        self.assertEqual((c_impl, c_storage_data, c_id), get_fields("aten::mul", 1))
+        self.assertEqual(c_storage_data, c_initial_storage_data)
+
+        # Mutations to the underlying storage are reflected. (But ID is shared.)
+        f_impl, f_storage_data, f_id = get_fields("aten::relu", 0)
+        self.assertEqual(a_impl, f_impl)
+        self.assertNotEqual(a_storage_data, f_storage_data)
+        self.assertEqual(a_id, f_id)
+
+        # Calling `set_` with an existing Tensor makes them share an ID.
+        d_impl, d_storage_data, d_id = get_fields("aten::sin", 0)
+        c_impl_new, c_storage_data_new, c_id_new = get_fields("aten::cos", 0)
+        self.assertNotEqual(d_impl, c_impl_new)
+        self.assertEqual(d_storage_data, c_storage_data_new)
+        self.assertEqual(c_id, c_id_new)
+        self.assertEqual(d_id, c_id_new)
+
+    @staticmethod
+    def _format_allocations(profiled_code):
+        gc.collect()
+        with profile(profile_memory=True, record_shapes=True) as prof:
+            profiled_code()
+            gc.collect()
+
+        root_events = prof.profiler.kineto_results.experimental_event_tree()
+        events = sorted(_utils.traverse_dfs(root_events), key=lambda x: x.start_time_ns)
+        allocations = tuple(
+            event.extra_fields
+            for event in events
+            if isinstance(event.extra_fields, torch._C._profiler._ExtraFields_Allocation)
+        )
+
+        return textwrap.indent("\n".join(
+            f"{repr(i.id):>5}{' ' * 6}"
+            f"{repr(i.allocation_id):>5}{' ' * 6}"
+            f"{'Allocation' if i.alloc_size > 0 else 'Free'}"
+            for i in allocations
+        ), " " * 12)
+
+    def test_tensorimpl_invalidation_set(self) -> None:
+        def profiled_code(add_empty_set: bool):
+            x = torch.ones((1,))
+
+            # Determines if new storage is created before or after the old one
+            # is destroyed.
+            if add_empty_set:
+                x.set_()
+
+            x.set_(torch.ones((1,)).storage())
+            x.view_as(x)
+
+        self.assertExpectedInline(
+            self._format_allocations(lambda: profiled_code(add_empty_set=False)),
+            """\
+                0          1      Allocation
+                0          2      Allocation
+                0          1      Free
+                0          2      Free"""
+        )
+
+        self.assertExpectedInline(
+            self._format_allocations(lambda: profiled_code(add_empty_set=True)),
+            """\
+                0          1      Allocation
+                0          1      Free
+                0          2      Allocation
+                0          2      Free"""
+        )
+
+    def test_tensorimpl_invalidation_keep_alive(self) -> None:
+        def profiled_code(add_empty_set: bool):
+            x = torch.ones((1,))
+            x_storages = [x.storage()]
+            for _ in range(3):
+                x.set_()
+                x.set_(torch.ones((1,)).storage())
+
+                # This keeps the StorageImpls alive and preserves the chain.
+                # (Despite the `set_()` call.)
+                x_storages.append(x.storage())
+            x.view_as(x)
+
+            # Free storage in a deterministic fashion.
+            while x_storages:
+                x_storages.pop()
+                gc.collect()
+
+            # Determines if new storage is created before or after the old one
+            # is destroyed.
+            if add_empty_set:
+                x.set_()
+
+            for _ in range(3):
+                x.set_(torch.ones((1,)).storage())
+            x.view_as(x)
+
+            del x
+            gc.collect()
+
+        self.assertExpectedInline(
+            self._format_allocations(lambda: profiled_code(add_empty_set=False)),
+            """\
+                0          1      Allocation
+                0          2      Allocation
+                0          4      Allocation
+                0          5      Allocation
+                0          4      Free
+                0          2      Free
+                0          1      Free
+                0          6      Allocation
+                0          5      Free
+                0          7      Allocation
+                0          6      Free
+                0          8      Allocation
+                0          7      Free
+                0          8      Free"""
+        )
+
+        self.assertExpectedInline(
+            self._format_allocations(lambda: profiled_code(add_empty_set=True)),
+            """\
+                0          1      Allocation
+                0          2      Allocation
+                0          4      Allocation
+                0          5      Allocation
+                0          4      Free
+                0          2      Free
+                0          1      Free
+                0          5      Free
+                0          6      Allocation
+                0          7      Allocation
+                0          6      Free
+                0          8      Allocation
+                0          7      Free
+                0          8      Free"""
+        )
+
+    def test_tensorimpl_invalidation_full(self) -> None:
+        def profiled_code():
+            x = torch.ones((1,))
+            x_storages = [x.storage()]
+            for _ in range(3):
+                x.set_()
+                x.set_(torch.ones((1,)).storage())
+                x_storages.append(x.storage())
+            x.view_as(x)
+
+            # Free storage in a deterministic fashion.
+            while x_storages:
+                x_storages.pop()
+                gc.collect()
+
+            for _ in range(3):
+                x.set_(torch.ones((1,)).storage())
+
+            for _ in range(3):
+                x.set_()
+                x.set_(torch.ones((1,)).storage())
+
+            for i in range(4):
+                x.resize_((1 + i,))
+            x.view_as(x)
+
+        self.assertExpectedInline(
+            self._format_allocations(profiled_code),
+            """\
+                0          1      Allocation
+                0          2      Allocation
+                0          4      Allocation
+                0          5      Allocation
+                0          4      Free
+                0          2      Free
+                0          1      Free
+                0          6      Allocation
+                0          5      Free
+                0          7      Allocation
+                0          6      Free
+                0          8      Allocation
+                0          7      Free
+                0          8      Free
+                0          9      Allocation
+                0          9      Free
+                0         10      Allocation
+                0         10      Free
+                0         11      Allocation
+                0         12      Allocation
+                0         11      Free
+                0         13      Allocation
+                0         12      Free
+                0         14      Allocation
+                0         13      Free
+                0         14      Free"""
+        )
+
+    def test_tensorimpl_invalidation_scalar_args(self) -> None:
+        def profiled_code():
+            with torch.no_grad():
+                x = torch.ones((1,))
+                for _ in range(10):
+                    x.add_(2)
+
+        self.assertExpectedInline(
+            self._format_allocations(profiled_code),
+            """\
+                0          1      Allocation
+                1          2      Allocation
+                2          3      Allocation
+                2          3      Free
+                1          2      Free
+                3          4      Allocation
+                4          5      Allocation
+                4          5      Free
+                3          4      Free
+                5          6      Allocation
+                6          7      Allocation
+                6          7      Free
+                5          6      Free
+                7          8      Allocation
+                8          9      Allocation
+                8          9      Free
+                7          8      Free
+                9         10      Allocation
+               10         11      Allocation
+               10         11      Free
+                9         10      Free
+               11         12      Allocation
+               12         13      Allocation
+               12         13      Free
+               11         12      Free
+               13         14      Allocation
+               14         15      Allocation
+               14         15      Free
+               13         14      Free
+               15         16      Allocation
+               16         17      Allocation
+               16         17      Free
+               15         16      Free
+               17         18      Allocation
+               18         19      Allocation
+               18         19      Free
+               17         18      Free
+               19         20      Allocation
+               20         21      Allocation
+               20         21      Free
+               19         20      Free
+                0          1      Free""")
+
+
+    def test_module_and_optimizer_ids(self) -> None:
+        model = torch.nn.Linear(2, 1, bias=True)
+        optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
+
+        def check(cold_start: bool) -> None:
+            with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
+                x = torch.ones((1, 2))
+                _ = x.sin()  # Mark `x`
+                model(x).backward()
+                optimizer.step()
+                _ = optimizer.state[model.weight]["momentum_buffer"].cos()  # Mark weight momentum
+                _ = model.weight.grad.tan()  # Mark weight gradient
+
+            nodes = p.profiler.kineto_results.experimental_event_tree()
+
+            def get_fields(op_name, index):
+                return self._get_tensor_fields(
+                    find_node_with_name(nodes, op_name),
+                    index)
+
+            # Marked Tensors act as ground truth for python tracer IDs.
+            _, _, x_id = get_fields("aten::sin", 0)
+            _, _, weight_momenumtum_id = get_fields("aten::cos", 0)
+            _, _, weight_grad_id = get_fields("aten::tan", 0)
+            self.assertNotEqual(x_id, weight_momenumtum_id)
+            self.assertNotEqual(x_id, weight_grad_id)
+            self.assertNotEqual(weight_momenumtum_id, weight_grad_id)
+
+            # Use linear op to identify weight ground truth.
+            linear_op_node = find_node_with_name(nodes, "aten::linear")
+            self.assertIsNotNone(linear_op_node)
+            x_metadata, weight_metadata, _ = linear_op_node.extra_fields.inputs
+            self.assertEqual(x_id, x_metadata.id)
+
+            # Module
+            linear_module_node = find_node_with_name(nodes, "nn.Module: Linear_0")
+            self.assertIsNotNone(linear_module_node)
+            self.assertIsNotNone(linear_module_node.extra_fields.module)
+            self.assertIsNone(linear_module_node.extra_fields.optimizer)
+
+            linear_parameters = linear_module_node.extra_fields.module.parameters
+            name, weight, weight_grad = linear_parameters[0]
+            self.assertEqual(name, "weight")
+            self.assertEqual(weight.id, weight_metadata.id)
+
+            self.assertEqual(weight_grad is None, cold_start)
+            if not cold_start:
+                self.assertEqual(weight_grad.id, weight_grad_id)
+
+            # Optimizer
+            step_node = find_node_with_regex(nodes, "_optimizer_step_code")
+            self.assertIsNotNone(step_node)
+            self.assertIsNone(step_node.extra_fields.module)
+            self.assertIsNotNone(step_node.extra_fields.optimizer)
+            optimizer_parameters = step_node.extra_fields.optimizer.parameters
+            self.assertEqual(len(optimizer_parameters), 2)  # Weight and bias
+            weight, weight_grad, state = optimizer_parameters[0]
+            self.assertEqual(weight.id, weight_metadata.id)
+            self.assertEqual(weight_grad.id, weight_grad_id)
+            self.assertEqual(len(state), 1)
+            self.assertEqual(state[0][0], "momentum_buffer")
+            self.assertEqual(state[0][1].id, weight_momenumtum_id)
+
+        # Check that we handle first step (lazy initalization) and steady state.
+        check(cold_start=True)
+        check(cold_start=False)
+
+    def _test_allocation_ids(self, before_fn, after_fn) -> None:
+        with profile(profile_memory=True, record_shapes=True) as p:
+            # Introduce other operations and allocations to check robustness
+            _ = before_fn()
+
+            x = torch.rand(4, 3)
+            x.resize_(4, 4)
+
+            # We need to use `x` post resize for profiler to determine its ID.
+            x.sin()
+
+            # Introduce other operations and allocations to check robustness
+            _ = after_fn()
+
+            # Ensure `x` is the last variable collected to make it easier to
+            # find the deallocation event.
+            gc.collect()
+            del x
+            gc.collect()
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+
+        def find_chain(names: List[str]):
+            out = []
+            for name in names:
+                root = [out[-1]] if out else nodes
+                out.append(find_node_with_name(root, name))
+                self.assertIsNotNone(out[-1], name)
+            return out
+
+        allocation = find_chain(["aten::rand", "aten::empty", "[memory]"])[-1].extra_fields
+        _, uniform_node = find_chain(["aten::rand", "aten::uniform_"])
+        x_impl, x_storage_data, x_id = self._get_tensor_fields(uniform_node, 0)
+
+        # Make sure IDs are consistent between allocations and op inputs
+        self.assertEqual(allocation.ptr, x_storage_data)
+        self.assertEqual(allocation.id, x_id)
+
+        resize_node = find_node_with_name(nodes, "aten::resize_")
+        self.assertIsNotNone(resize_node)
+        self.assertEqual(len(resize_node.children), 2)
+        allocate_new = resize_node.children[0].extra_fields
+        free_old = resize_node.children[1].extra_fields
+
+        # Destruction of the old storage for x.
+        self.assertEqual(free_old.id, allocation.id)
+        self.assertEqual(free_old.ptr, allocation.ptr)
+
+        # Make sure ID is retained through change in storage.
+        self.assertEqual(allocate_new.id, allocation.id)
+        self.assertNotEqual(allocate_new.ptr, allocation.ptr)
+
+        # Deletion when `x` goes out of scope.
+        free_new = [i for i in nodes if i.tag == torch._C._profiler._EventType.Allocation][-1].extra_fields
+        self.assertIsInstance(free_new, torch._C._profiler._ExtraFields_Allocation)
+        self.assertEqual(free_new.id, allocate_new.id)
+        self.assertEqual(free_new.ptr, allocate_new.ptr)
+
+    def test_allocation_ids(self) -> None:
+        self._test_allocation_ids(lambda: None, lambda: None)
+
+    def test_allocation_ids_with_other_ops(self) -> None:
+        x = torch.ones((1,))
+        self._test_allocation_ids(
+            lambda: (x + 1).relu_(),
+            lambda: torch.zeros((1,)).cos()
+        )
+
+    def test_impl_reuse(self) -> None:
+        repeats = 1_000
+        with profile(profile_memory=True, record_shapes=True) as p:
+            for _ in range(repeats):
+                torch.ones((1,))
+            gc.collect()
+
+        roots = p.profiler.kineto_results.experimental_event_tree()
+        tensor_impls = tuple(
+            e.extra_fields.inputs[0].impl_ptr
+            for e in _utils.traverse_dfs(roots)
+            if e.name == "aten::fill_"
+        )
+
+        self.assertEqual(len(tensor_impls), repeats)
+        self.assertEqual(len(set(tensor_impls)), repeats)
+
+    def test_allocation_id_uniqueness(self) -> None:
+        repeats = 1_000
+        with profile(profile_memory=True, record_shapes=True) as p:
+            for _ in range(repeats):
+                torch.ones((1,))
+            gc.collect()
+
+        roots = p.profiler.kineto_results.experimental_event_tree()
+        id_set = set()
+        for e in _utils.traverse_dfs(roots):
+            fields = e.extra_fields
+            if isinstance(fields, torch._C._profiler._ExtraFields_TorchOp):
+                id_set |= {t.allocation_id for t in fields.inputs if isinstance(t, _TensorMetadata)}
+
+            elif isinstance(fields, torch._C._profiler._ExtraFields_Allocation):
+                id_set.add(fields.allocation_id)
+
+        id_set.difference_update([None])
+        self.assertEqual(repeats, len(id_set))
+
     def test_extra_fields(self):
         with profile(with_stack=True, profile_memory=True) as p:
             _ = torch.ones((1,))
@@ -1247,24 +1786,57 @@ def test_extra_fields(self):
 
         self.assertIsInstance(
             node.extra_fields,
-            torch._C._autograd._ExtraFields_TorchOp)
+            torch._C._profiler._ExtraFields_TorchOp)
 
         self.assertIsInstance(
             node.parent.extra_fields,
-            torch._C._autograd._ExtraFields_PyCCall)
+            torch._C._profiler._ExtraFields_PyCCall)
 
-        self.assertEqual(node.children[0].name(), "aten::empty")
-        self.assertEqual(node.children[0].children[0].name(), "[memory]")
+        self.assertEqual(node.children[0].name, "aten::empty")
+        self.assertEqual(node.children[0].children[0].name, "[memory]")
         self.assertIsInstance(
             node.children[0].children[0].extra_fields,
-            torch._C._autograd._ExtraFields_Allocation)
+            torch._C._profiler._ExtraFields_Allocation)
 
     def test_tensor_properties(self):
         x = torch.ones(10, 10).as_strided([4, 4], [12, 3])
-        y = torch.ones(4, 1)
+        y = torch.ones(4, 1, requires_grad=True)
 
         with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
             _ = x + y
+            _ = x * y
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "aten::add")
+        self.assertIsNotNone(node)
+
+        self.assertIsInstance(
+            node.extra_fields,
+            torch._C._profiler._ExtraFields_TorchOp)
+
+        def getattr_inputs(name, default):
+            return [getattr(i, name, default) for i in node.extra_fields.inputs]
+
+        self.assertEqual(getattr_inputs("sizes", []), [[4, 4], [4, 1], []])
+        self.assertEqual(getattr_inputs("strides", []), [[12, 3], [1, 1], []])
+        self.assertEqual(getattr_inputs("layout", None), [torch.strided, torch.strided, None])
+        self.assertEqual(getattr_inputs("device", None), [torch.device("cpu"), torch.device("cpu"), None])
+        self.assertEqual(getattr_inputs("dtype", None), [torch.float32, torch.float32, None])
+        self.assertEqual(node.extra_fields.scope, torch.profiler.RecordScope.FUNCTION)
+
+        mul_node = find_node_with_name(nodes, "aten::mul")
+        self.assertIsNotNone(mul_node)
+        self.assertEqual(
+            node.extra_fields.sequence_number + 1,
+            mul_node.extra_fields.sequence_number)
+
+    def test_sparse_tensors(self):
+        i = [[0, 1, 1], [2, 0, 2]]
+        v = [3, 4, 5]
+        s = torch.sparse_coo_tensor(i, v, (2, 3))
+
+        with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
+            _ = s + s
 
         nodes = p.profiler.kineto_results.experimental_event_tree()
         node = find_node_with_name(nodes, "aten::add")
@@ -1272,17 +1844,38 @@ def test_tensor_properties(self):
 
         self.assertIsInstance(
             node.extra_fields,
-            torch._C._autograd._ExtraFields_TorchOp)
+            torch._C._profiler._ExtraFields_TorchOp)
 
-        self.assertEqual(node.extra_fields.inputs.shapes, [[4, 4], [4, 1], []])
+        def getattr_inputs(name, default):
+            return [getattr(i, name, default) for i in node.extra_fields.inputs]
 
-        input_info = node.extra_fields.inputs
-        self.assertEqual(input_info.dtypes, ['float', 'float', 'Scalar'])
+        self.assertEqual(getattr_inputs("sizes", []), [[2, 3], [2, 3], []])
+        self.assertEqual(getattr_inputs("strides", []), [[], [], []])
+        self.assertEqual(getattr_inputs("layout", None), [torch.sparse_coo, torch.sparse_coo, None])
+        self.assertEqual(getattr_inputs("device", None), [torch.device("cpu"), torch.device("cpu"), None])
 
-        layout_info = [x.layout if x else None for x in input_info.tensor_metadata]
-        self.assertEqual(layout_info, [torch.strided, torch.strided, None])
-        device_info = [x.device if x else None for x in input_info.tensor_metadata]
-        self.assertEqual(device_info, [torch.device("cpu"), torch.device("cpu"), None])
+    @unittest.skipIf(not torch._C.has_mkldnn, "MKL-DNN build is disabled")
+    def test_mkldnn_tensors(self):
+        x = torch.ones(4, 3).to_mkldnn()
+
+        with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
+            _ = x + x
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "aten::add")
+        self.assertIsNotNone(node)
+
+        self.assertIsInstance(
+            node.extra_fields,
+            torch._C._profiler._ExtraFields_TorchOp)
+
+        def getattr_inputs(name, default):
+            return [getattr(i, name, default) for i in node.extra_fields.inputs]
+
+        self.assertEqual(getattr_inputs("sizes", []), [[4, 3], [4, 3], []])
+        self.assertEqual(getattr_inputs("strides", []), [[], [], []])
+        self.assertEqual(getattr_inputs("layout", None), [torch._mkldnn, torch._mkldnn, None])
+        self.assertEqual(getattr_inputs("device", None), [torch.device("cpu"), torch.device("cpu"), None])
 
     def test_scalar_ins(self):
         x = torch.ones(5, 5)
@@ -1295,11 +1888,157 @@ def test_scalar_ins(self):
         node = find_node_with_name(nodes, "aten::add")
         self.assertIsNotNone(node)
 
+        def getattr_inputs(name, default):
+            return [getattr(i, name, default) for i in node.extra_fields.inputs]
+
         # The second argument to the add gets promotoed to a zerodim Tensor
-        input_info = node.extra_fields.inputs
-        self.assertEqual(input_info.dtypes, ['float', 'double', 'Scalar'])
-        self.assertEqual(input_info.shapes, [[5, 5], [], []])
-        self.assertEqual(input_info.ivalues, [None, None, alpha])
+        self.assertEqual(getattr_inputs("dtype", None), [torch.float32, torch.float64, None])
+        self.assertEqual(getattr_inputs("sizes", []), [[5, 5], [], []])
+        self.assertEqual(node.extra_fields.inputs[2], alpha)
+
+    def test_tensor_lists(self):
+        x = torch.ones((1,))
+        y = torch.ones((1,))
+        with profile(with_stack=True, profile_memory=True, record_shapes=True) as p:
+            _ = torch.stack((x, y))
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "aten::stack")
+        inputs = node.extra_fields.inputs
+        self.assertEqual(len(inputs), 2)
+        self.assertIsInstance(inputs[0], list)
+        self.assertEqual(len(inputs[0]), 2)
+        self.assertEqual(x.storage().data_ptr(), inputs[0][0].storage_data_ptr)
+        self.assertEqual(y.storage().data_ptr(), inputs[0][1].storage_data_ptr)
+
+
+    def test_nnmodule_params(self):
+
+        def flat_out_extrafields(nodes, out=None):
+            if out is None:
+                out = []
+            for node in nodes:
+                if isinstance(node.extra_fields, _ExtraFields_PyCall) and node.extra_fields.module:
+                    if node.extra_fields.module.parameters:
+                        out.append(node.extra_fields.module)
+                flat_out_extrafields(node.children, out)
+            return out
+
+        inputs = torch.rand(10)
+        net = SimpleNet()
+        out = net(inputs)
+        torch.nn.functional.cross_entropy(out, torch.rand(2)).backward()
+        with torch.profiler.profile(with_stack=True, profile_memory=True) as p:
+            _ = net(inputs)
+
+        modules = flat_out_extrafields(p.profiler.kineto_results.experimental_event_tree())
+        self.assertEqual(len(modules), 2, f"Expected two parameter list, but got {len(modules)}")
+
+        params = [(n, p.storage_data_ptr, g.storage_data_ptr) for module in modules for (n, p, g) in module.parameters]
+        expected = [(name, val.storage().data_ptr(), val.grad.storage().data_ptr()) for name, val in net.fc1._parameters.items()]
+        expected += [(name, val.storage().data_ptr(), val.grad.storage().data_ptr()) for name, val in net.fc2._parameters.items()]
+        self.assertEqual(expected, params, f"{expected} vs. {params}")
+
+    def _flat_out_extrafields(self, nodes, out=None):
+        if out is None:
+            out = []
+        for node in nodes:
+            if (isinstance(node.extra_fields, _ExtraFields_PyCall) and
+                    node.extra_fields.optimizer and node.extra_fields.optimizer.parameters):
+                # avoiding OptInfo duplicates from iterations
+                addr = node.extra_fields.optimizer.parameters[0][0].storage_data_ptr
+                if not [o for o in out if addr == o.parameters[0][0].storage_data_ptr]:
+                    out.append(node.extra_fields.optimizer)
+            self._flat_out_extrafields(node.children, out)
+        return out
+
+    def _check_results(self, opt, opts, check_items=False):
+        self.assertEqual(len(opts), 1, f"Expected 1 optimizer: len(opts): {len(opts)}")
+        self.assertEqual(id(opt), opts[0].self_ptr, f"Optimizer addr ({id(opt)}) vs. profiled addr ({opts[0].self_ptr})")
+        if check_items:
+            self.assertEqual(len(opt.param_groups), len(opts))
+            for group, opt_ in zip(opt.param_groups, opts):
+                self.assertEqual(
+                    [(v.storage().data_ptr()) for v in group.get("params", [])],
+                    [(o.storage_data_ptr) for (o, _, _) in opt_.parameters]
+                )
+            for opt_ in opts:
+                observed_state = {
+                    p.storage_data_ptr: {name: s.storage_data_ptr for name, s in state}
+                    for (p, _, state) in opt_.parameters
+                }
+
+                # Make sure the profiler collected all optimizer state and check
+                # that the address recorded by the profiler is correct.
+                for parameter, parameter_state in opt.state.items():
+                    self.assertEqual(
+                        {name: value.storage().data_ptr() for name, value in parameter_state.items()},
+                        observed_state.get(parameter.storage().data_ptr(), [])
+                    )
+
+    def test_optimizer(self):
+        inputs = torch.rand(10)
+        with torch.profiler.profile(with_stack=True, profile_memory=True) as p:
+            net = SimpleNet()
+            opt = torch.optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
+
+            opt.zero_grad()
+            out = net(inputs)
+            loss = torch.nn.functional.cross_entropy(out, torch.rand(2))
+            loss.backward()
+            opt.step()
+        self._check_results(opt, self._flat_out_extrafields(p.profiler.kineto_results.experimental_event_tree()), False)
+
+    def _test_optimizer_parameters(self, optimizer_factory):
+        inputs = torch.rand(10)
+        with torch.profiler.profile(with_stack=True, profile_memory=True) as p:
+            net = SimpleNet()
+            opt = optimizer_factory(net.parameters())
+            for _ in range(2):
+                opt.zero_grad()
+                out = net(inputs)
+                loss = torch.nn.functional.cross_entropy(out, torch.rand(2))
+                loss.backward()
+                opt.step()
+        self._check_results(opt, self._flat_out_extrafields(p.profiler.kineto_results.experimental_event_tree()), True)
+
+    def test_optimizer_parameters_sgd(self):
+        self._test_optimizer_parameters(lambda params: torch.optim.SGD(params, lr=0.01, momentum=0.9))
+
+    def test_optimizer_parameters_adam(self):
+        self._test_optimizer_parameters(lambda params: torch.optim.Adam(params, foreach=True))
+
+    def test_allocations(self):
+        gc.collect()
+        with profile(profile_memory=True) as p:
+            x = torch.empty((3, 4))
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "[memory]")
+        self.assertIsNotNone(node)
+
+        alloc_size = 3 * 4 * 4  # fp32 -> 4 bytes
+        ptr = node.extra_fields.ptr
+        self.assertGreater(ptr, 0)
+        self.assertEqual(node.extra_fields.alloc_size, alloc_size)
+        self.assertEqual(node.extra_fields.device, torch.device("cpu"))
+        total_allocated = node.extra_fields.total_allocated
+
+        # total_reserved is only for CUDACachingAllocator
+        self.assertEqual(node.extra_fields.total_reserved, 0)
+
+        with profile(profile_memory=True) as p:
+            del x
+            gc.collect()
+
+        nodes = p.profiler.kineto_results.experimental_event_tree()
+        node = find_node_with_name(nodes, "[memory]")
+        self.assertIsNotNone(node)
+
+        self.assertEqual(node.extra_fields.ptr, ptr)
+        self.assertEqual(node.extra_fields.alloc_size, -alloc_size)
+        self.assertEqual(node.extra_fields.device, torch.device("cpu"))
+        self.assertEqual(node.extra_fields.total_allocated, total_allocated - alloc_size)
 
 
 @dataclass(frozen=True)
@@ -1310,6 +2049,7 @@ class MockKinetoEvent():
     _linked_correlation_id: int
     _device_type: int
 
+    @property
     def name(self) -> str:
         return self._name
 
@@ -1340,6 +2080,7 @@ class MockProfilerEvent():
     def end_time_ns(self):
         return self.start_time_ns + self.duration_time_ns
 
+    @property
     def name(self) -> str:
         return self._name
 
@@ -1347,9 +2088,47 @@ def __post__init__(self, parent, children):
         object.__setattr__(self, "parent", parent)
         object.__setattr__(self, "children", children)
 
+class MockNode:
+    def __init__(self, name, children) -> None:
+        self.name = name
+        self.children = [MockNode(name, i) for name, i in children.items()]
+
 
 class TestExperimentalUtils(TestCase):
 
+    def make_tree(self) -> List[MockNode]:
+        tree = {
+            "root_0": {
+                "1": {
+                    "2": {}
+                },
+                "3": {
+                    "4": {},
+                    "5": {},
+                },
+            },
+            "root_1": {
+                "6": {},
+                "7": {},
+                "8": {
+                    "9": {
+                        "10": {}
+                    },
+                },
+            },
+        }
+        return [MockNode(name, i) for name, i in tree.items()]
+
+    def test_dfs(self) -> None:
+        self.assertEqual(
+            " ".join(i.name for i in _utils.traverse_dfs(self.make_tree())),
+            "root_0 1 2 3 4 5 root_1 6 7 8 9 10")
+
+    def test_bfs(self) -> None:
+        self.assertEqual(
+            " ".join(i.name for i in _utils.traverse_bfs(self.make_tree())),
+            "root_0 root_1 1 3 6 7 8 2 4 5 9 10")
+
     @staticmethod
     def generate_mock_profile():
         cuda_events = [
@@ -1422,7 +2201,7 @@ def garbage_code(x):
 
             kineto_events = [{
                 '_name':
-                e.name(),
+                e.name,
                 '_start_us':
                 e.start_us(),
                 '_duration_us':
@@ -1443,7 +2222,7 @@ def EventTreeDFS(event_tree):
                         stack.append(child_event)
 
             profiler_events = [{
-                '_name': e.name(),
+                '_name': e.name,
                 'id': e.id,
                 'start_time_ns': e.start_time_ns,
                 'duration_time_ns': e.duration_time_ns,
@@ -1519,7 +2298,7 @@ def test_utils_compute_queue_depth(self):
         def format_queue_depth(queue_depth_list, events):
             res = ""
             for data, event in zip(queue_depth_list, events):
-                res += f"{data.queue_depth} [{event.name()}]\n"
+                res += f"{data.queue_depth} [{event.name}]\n"
             return res
 
         # We have to use Mock because time series data is too flaky to test
@@ -1573,7 +2352,7 @@ def test_utils_compute_idle_time(self):
         profiler = self.generate_mock_profile()
         basic_evaluation = _utils.BasicEvaluation(profiler)
         expected_output = "\n".join([
-            f"{basic_evaluation.metrics[event_key].idle_time_ns} [{event_key.event.name()}]"
+            f"{basic_evaluation.metrics[event_key].idle_time_ns} [{event_key.event.name}]"
             for event_key in basic_evaluation.event_keys
         ])
         self.assertExpectedInline(
@@ -1596,7 +2375,7 @@ def test_utils_get_optimizable_events(self):
         optimizable_events = basic_evaluation.get_optimizable_events(
             2, print_enable=False)
         expected_output = "\n".join(
-            [f"{event_key.event.name()}" for event_key in optimizable_events])
+            [f"{event_key.event.name}" for event_key in optimizable_events])
         self.assertExpectedInline(
             expected_output, """\
 <built-in function _cuda_synchronize>
@@ -1609,7 +2388,7 @@ def test_profiler_name_pattern(self):
                 x = x @ x
                 x = x + x
         matched_events = NamePattern(prof, "aten::mm").matched_events()
-        output = "\n".join([f"{event.name()}" for event in matched_events])
+        output = "\n".join([f"{event.name}" for event in matched_events])
         self.assertExpectedInline(output, """\
 aten::mm
 aten::mm
@@ -1835,10 +2614,15 @@ def test_profiler_pattern_matcher_json_report(self):
         try:
             with open("./torchtidy_report.json") as f:
                 report = json.load(f)
-            self.assertTrue("test_profiler.py" in report)
-            self.assertTrue(len(report["test_profiler.py"]) > 0)
+
+            # It is platform dependent whether the path will include "profiler/"
+            keys = [k for k in report.keys() if k.endswith("test_profiler.py")]
+            self.assertEqual(len(keys), 1, f"{keys}")
+            entry = report[keys[0]]
+
+            self.assertTrue(len(entry) > 0)
             expected_fields = sorted(["line_number", "name", "url", "message"])
-            for event in report["test_profiler.py"]:
+            for event in entry:
                 actual_fields = sorted(event.keys())
                 self.assertEqual(expected_fields, actual_fields)
         finally:
diff --git a/test/test_profiler_tree.py b/test/profiler/test_profiler_tree.py
similarity index 86%
rename from test/test_profiler_tree.py
rename to test/profiler/test_profiler_tree.py
index 850ec8c3c36a..d4a31c645613 100644
--- a/test/test_profiler_tree.py
+++ b/test/profiler/test_profiler_tree.py
@@ -10,17 +10,25 @@
 import expecttest
 
 import torch
-from torch._C._autograd import _ExtraFields_PyCall, _ExtraFields_PyCCall
+from torch._C._profiler import _ExtraFields_PyCall, _ExtraFields_PyCCall
 from torch.testing._internal.common_utils import (
     TestCase, run_tests, IS_WINDOWS, TEST_WITH_CROSSREF, IS_ARM64)
+from torch.utils._pytree import tree_map
 
 # These functions can vary from based on platform and build (e.g. with CUDA)
 # and generally distract from rather than adding to the test.
+PRUNE_ALL = 1
+KEEP_ELLIPSES = 2
+KEEP_NAME_AND_ELLIPSES = 3
+
 PRUNE_FUNCTIONS = {
-    "torch/profiler/profiler.py(...): start": True,
-    "torch/profiler/profiler.py(...): stop_trace": True,
-    "cudaStreamIsCapturing": False,
-    "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags": False,
+    "torch/utils/_pytree.py(...): tree_map": KEEP_NAME_AND_ELLIPSES,
+    "torch/profiler/profiler.py(...): start": KEEP_ELLIPSES,
+    "torch/profiler/profiler.py(...): stop_trace": KEEP_ELLIPSES,
+    "torch/profiler/profiler.py(...): _transit_action": KEEP_ELLIPSES,
+    "<built-in method __exit__ of torch._C.DisableTorchFunction object at 0xXXXXXXXXXXXX>": PRUNE_ALL,
+    "cudaStreamIsCapturing": PRUNE_ALL,
+    "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags": PRUNE_ALL,
 }
 
 # ROCTracer is currently not producing events that profiler can extract. We
@@ -34,6 +42,38 @@
 ALLOW_CUDA_FAILURE = (torch.version.hip is not None) or IS_WINDOWS
 
 
+class TorchFunctionTensor(torch.Tensor):
+
+    @classmethod
+    def __torch_function__(cls, func, types, args=(), kwargs=None):
+        return super().__torch_function__(func, types, args, kwargs)
+
+
+class TorchDispatchTensor(torch.Tensor):
+
+    @staticmethod
+    def __new__(cls, elem):
+        t = torch.Tensor._make_subclass(cls, elem, elem.requires_grad)
+        t.elem = elem
+        return t
+
+    __torch_function__ = torch._C._disabled_torch_function_impl
+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+
+        def unwrap(x):
+            return x.elem if isinstance(x, TorchDispatchTensor) else x
+
+        def wrap(x):
+            return TorchDispatchTensor(x) if isinstance(x, torch.Tensor) else x
+
+        args = tree_map(unwrap, args)
+        kwargs = tree_map(unwrap, kwargs or {})
+
+        return tree_map(wrap, func(*args, **kwargs))
+
+
 class ProfilerTree:
 
     @staticmethod
@@ -69,13 +109,19 @@ def flatten(nodes, depth=0, out=None):
 
             for node in nodes:
                 cls.validate_node(node)
-                name = cls.fmt_name(node.name())
-                add_ellipses = PRUNE_FUNCTIONS.get(name.strip(), None)
-                if add_ellipses is None:
+                name = cls.fmt_name(node.name)
+                prune_level = PRUNE_FUNCTIONS.get(name.strip(), None)
+                if prune_level is None:
                     out.append((depth, name))
                     flatten(node.children, depth + 1, out)
-                elif add_ellipses:
+                elif prune_level == KEEP_NAME_AND_ELLIPSES:
+                    out.append((depth, name))
+                    if node.children:
+                        out.append((depth + 1, "..."))
+                elif prune_level == KEEP_ELLIPSES:
                     out.append((depth, "..."))
+                else:
+                    assert prune_level == PRUNE_ALL
 
             return out
 
@@ -92,19 +138,15 @@ def flatten(nodes, depth=0, out=None):
 
     @staticmethod
     def fmt_name(name: str) -> str:
-        # torch::autograd::Node relies on c10::demangle to generate names, and
-        # Windows demangles to include `struct` in the name.
-        if IS_WINDOWS:
-            name = name.replace('struct torch::autograd::AccumulateGrad', 'torch::autograd::AccumulateGrad')
-
-        match = re.match(r"(.*)\.py\(([0-9]+)\): (.*)$", name)
+        match = re.match(r"^(.*)\.py\(([0-9]+)\): (.*)$", name)
         if match:
             filename, _, fn = match.groups()
 
-            # This test can appear as `test/test_profiler_tree.py` depending on
-            # where it is run from.
-            if filename.endswith(os.path.splitext(__file__)[0]):
-                filename = os.path.split(os.path.splitext(__file__)[0])[1]
+            # This test can appear as `test/profiler/test_profiler_tree.py`
+            # depending on where it is run from.
+            test_file = os.path.splitext(os.path.split(__file__)[1])[0]
+            if filename.endswith(test_file):
+                filename = test_file
 
             # We test against a string literal, so all paths have to look like POSIX paths.
             filename = filename.replace(os.sep, "/")
@@ -270,38 +312,18 @@ def test_profiler_experimental_tree_with_record_function(self):
         self.assertTreesMatch(
             ProfilerTree.format(p.profiler, 12),
             """\
-            aten::zeros
-              aten::zeros
-                aten::empty
-                aten::zero_
             Top level Annotation
-              aten::empty
-              aten::zeros
-                aten::zeros
-                  aten::empty
-                  aten::zero_
               First Annotation
-                aten::empty
                 aten::ones
                   aten::empty
                   aten::fill_
-              aten::zeros
-                aten::zeros
-                  aten::empty
-                  aten::zero_
               Second Annotation
-                aten::empty
                 aten::add
                   aten::to
                     aten::_to_copy
                       aten::empty_strided
                       aten::copy_
-                aten::zeros
-                  aten::zeros
-                    aten::empty
-                    aten::zero_
                 Third Annotation
-                  aten::empty
                   aten::ones_like
                     aten::empty_like
                       aten::empty_strided
@@ -421,6 +443,7 @@ def test_profiler_experimental_tree_with_memory_and_stack(self):
               torch/_tensor.py(...): backward
                 <built-in function _has_torch_function_unary>
                 torch/autograd/__init__.py(...): backward
+                  <built-in method _are_functorch_transforms_active of PyCapsule object at 0xXXXXXXXXXXXX>
                   <built-in function isinstance>
                   <built-in function isinstance>
                   <built-in function len>
@@ -479,11 +502,7 @@ def test_profiler_experimental_tree_with_memory_and_stack(self):
                 [memory]
               torch/profiler/profiler.py(...): __exit__
                 torch/profiler/profiler.py(...): stop
-                  torch/profiler/profiler.py(...): _transit_action
-                    <built-in method get of dict object at 0xXXXXXXXXXXXX>
-                      enum.py(...): __hash__
-                        <built-in function hash>
-                    ..."""
+                  ..."""
         )
 
     @unittest.skipIf(TEST_WITH_CROSSREF, "crossref intercepts calls and changes the callsite.")
@@ -602,13 +621,79 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
                             aten::clamp_min
               torch/profiler/profiler.py(...): __exit__
                 torch/profiler/profiler.py(...): stop
-                  torch/profiler/profiler.py(...): _transit_action
-                    <built-in method get of dict object at 0xXXXXXXXXXXXX>
-                      enum.py(...): __hash__
-                        <built-in function hash>
-                    ..."""
+                  ..."""
+        )
+
+    @unittest.skipIf(TEST_WITH_CROSSREF, "crossref intercepts calls and changes the callsite.")
+    @ProfilerTree.test
+    def test_profiler_experimental_tree_with_stack_and_torch_function(self):
+        x = TorchFunctionTensor(torch.ones((1,)))
+        y = torch.ones((1,))
+
+        # There's some lazy initialization in __torch_function__. If we don't
+        # run this the first run won't match the replicates.
+        torch.add(x, y)
+
+        with torch.profiler.profile(with_stack=True) as p:
+            torch.add(x, y)
+
+        self.assertTreesMatch(
+            ProfilerTree.format(p.profiler, 12),
+            """\
+            test_profiler_tree.py(...): test_profiler_experimental_tree_with_stack_and_torch_function
+              torch/profiler/profiler.py(...): __enter__
+                ...
+              <built-in method add of type object at 0xXXXXXXXXXXXX>
+                test_profiler_tree.py(...): __torch_function__
+                  torch/_tensor.py(...): __torch_function__
+                    <built-in function all>
+                      torch/_tensor.py(...): <genexpr>
+                        <built-in function issubclass>
+                      torch/_tensor.py(...): <genexpr>
+                    <built-in method add of type object at 0xXXXXXXXXXXXX>
+                      aten::add
+                    torch/_tensor.py(...): _convert
+                      <built-in function isinstance>
+                      <built-in function isinstance>
+                      <built-in method as_subclass of Tensor object at 0xXXXXXXXXXXXX>
+                        aten::alias
+                      <built-in function isinstance>
+              torch/profiler/profiler.py(...): __exit__
+                torch/profiler/profiler.py(...): stop
+                  ..."""
         )
 
+    @unittest.skipIf(TEST_WITH_CROSSREF, "crossref intercepts calls and changes the callsite.")
+    @ProfilerTree.test
+    def test_profiler_experimental_tree_with_stack_and_torch_dispatch(self):
+        x = TorchDispatchTensor(torch.ones((1,)))
+        y = torch.ones((1,))
+
+        with torch.profiler.profile(with_stack=True) as p:
+            x + y
+
+        self.assertTreesMatch(
+            ProfilerTree.format(p.profiler, 12),
+            """\
+            test_profiler_tree.py(...): test_profiler_experimental_tree_with_stack_and_torch_dispatch
+              torch/profiler/profiler.py(...): __enter__
+                ...
+              aten::add
+                test_profiler_tree.py(...): __torch_dispatch__
+                  torch/utils/_pytree.py(...): tree_map
+                    ...
+                  torch/utils/_pytree.py(...): tree_map
+                    ...
+                  torch/_ops.py(...): __call__
+                    <built-in method  of PyCapsule object at 0xXXXXXXXXXXXX>
+                      aten::add
+                  torch/utils/_pytree.py(...): tree_map
+                    ...
+              torch/profiler/profiler.py(...): __exit__
+                torch/profiler/profiler.py(...): stop
+                  ...""")
+
+    @unittest.skip("https://github.com/pytorch/pytorch/issues/83606")
     @unittest.skipIf(not torch.cuda.is_available(), "CUDA is required")
     @ProfilerTree.test
     def test_profiler_experimental_tree_cuda(self):
@@ -706,6 +791,7 @@ def test_profiler_experimental_tree_cuda(self):
             allow_failure=ALLOW_CUDA_FAILURE,
         )
 
+    @unittest.skip("https://github.com/pytorch/pytorch/issues/83606")
     @unittest.skipIf(not torch.cuda.is_available(), "CUDA is required")
     @ProfilerTree.test
     def test_profiler_experimental_tree_cuda_with_stream(self):
@@ -762,6 +848,7 @@ def test_profiler_experimental_tree_cuda_with_stream(self):
             allow_failure=ALLOW_CUDA_FAILURE,
         )
 
+    @unittest.skip("https://github.com/pytorch/pytorch/issues/83606")
     @unittest.skipIf(TEST_WITH_CROSSREF, "crossref intercepts calls and changes the callsite.")
     @unittest.skipIf(not torch.cuda.is_available(), "CUDA is required")
     @ProfilerTree.test
diff --git a/test/profiler_utils_mock_events.json b/test/profiler_utils_mock_events.json
deleted file mode 100644
index 00fcfccdfe30..000000000000
--- a/test/profiler_utils_mock_events.json
+++ /dev/null
@@ -1 +0,0 @@
-[[{"_name": "aten::matmul", "_start_us": 1656454173440014, "_duration_us": 2254, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173440019, "_duration_us": 2246, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173442289, "_duration_us": 33, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173442291, "_duration_us": 30, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173442325, "_duration_us": 32, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173442326, "_duration_us": 30, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173442360, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173442361, "_duration_us": 19, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173442384, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173442385, "_duration_us": 20, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173444252, "_duration_us": 38, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173444282, "_duration_us": 4, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173444291, "_duration_us": 9, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173444296, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173444305, "_duration_us": 45427, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489760, "_duration_us": 5, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489764, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489766, "_duration_us": 3, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489767, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173489771, "_duration_us": 35, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489811, "_duration_us": 2, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489812, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489814, "_duration_us": 2, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489815, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173489817, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489842, "_duration_us": 3, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489844, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489846, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489847, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173489848, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489873, "_duration_us": 2, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489874, "_duration_us": 0, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::select", "_start_us": 1656454173489875, "_duration_us": 2, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::as_strided", "_start_us": 1656454173489876, "_duration_us": 1, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::copy_", "_start_us": 1656454173489878, "_duration_us": 20, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173489912, "_duration_us": 104, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173489916, "_duration_us": 99, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173490026, "_duration_us": 25, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173490027, "_duration_us": 23, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173490054, "_duration_us": 34, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173490055, "_duration_us": 32, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173490091, "_duration_us": 21, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173490092, "_duration_us": 20, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::matmul", "_start_us": 1656454173490115, "_duration_us": 22, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "aten::mm", "_start_us": 1656454173490116, "_duration_us": 20, "_linked_correlation_id": 0, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173441289, "_duration_us": 2, "_linked_correlation_id": 3074, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173443225, "_duration_us": 9296, "_linked_correlation_id": 3074, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173441296, "_duration_us": 963, "_linked_correlation_id": 3074, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173442309, "_duration_us": 1, "_linked_correlation_id": 3076, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173452523, "_duration_us": 9296, "_linked_correlation_id": 3076, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173442312, "_duration_us": 7, "_linked_correlation_id": 3076, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173442346, "_duration_us": 0, "_linked_correlation_id": 3078, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173461821, "_duration_us": 9293, "_linked_correlation_id": 3078, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173442348, "_duration_us": 6, "_linked_correlation_id": 3078, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173442371, "_duration_us": 0, "_linked_correlation_id": 3080, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173471117, "_duration_us": 9295, "_linked_correlation_id": 3080, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173442373, "_duration_us": 5, "_linked_correlation_id": 3080, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173442395, "_duration_us": 0, "_linked_correlation_id": 3082, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173480414, "_duration_us": 9297, "_linked_correlation_id": 3082, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173442397, "_duration_us": 6, "_linked_correlation_id": 3082, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489715, "_duration_us": 2, "_linked_correlation_id": 3087, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173444325, "_duration_us": 24, "_linked_correlation_id": 3087, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173444350, "_duration_us": 45377, "_linked_correlation_id": 3087, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489796, "_duration_us": 2, "_linked_correlation_id": 3092, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173489777, "_duration_us": 14, "_linked_correlation_id": 3092, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173489791, "_duration_us": 13, "_linked_correlation_id": 3092, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489828, "_duration_us": 2, "_linked_correlation_id": 3097, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173489820, "_duration_us": 3, "_linked_correlation_id": 3097, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173489824, "_duration_us": 13, "_linked_correlation_id": 3097, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489859, "_duration_us": 2, "_linked_correlation_id": 3102, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173489851, "_duration_us": 3, "_linked_correlation_id": 3102, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173489854, "_duration_us": 13, "_linked_correlation_id": 3102, "_device_type": 0}, {"_name": "Memcpy HtoD (Pageable -> Device)", "_start_us": 1656454173489889, "_duration_us": 2, "_linked_correlation_id": 3107, "_device_type": 1}, {"_name": "cudaMemcpyAsync", "_start_us": 1656454173489880, "_duration_us": 3, "_linked_correlation_id": 3107, "_device_type": 0}, {"_name": "cudaStreamSynchronize", "_start_us": 1656454173489884, "_duration_us": 12, "_linked_correlation_id": 3107, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173489972, "_duration_us": 3, "_linked_correlation_id": 3109, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173490013, "_duration_us": 9302, "_linked_correlation_id": 3109, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173489980, "_duration_us": 32, "_linked_correlation_id": 3109, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173490040, "_duration_us": 0, "_linked_correlation_id": 3111, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173499317, "_duration_us": 9306, "_linked_correlation_id": 3111, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173490042, "_duration_us": 7, "_linked_correlation_id": 3111, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173490076, "_duration_us": 0, "_linked_correlation_id": 3113, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173508625, "_duration_us": 9299, "_linked_correlation_id": 3113, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173490078, "_duration_us": 7, "_linked_correlation_id": 3113, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173490102, "_duration_us": 0, "_linked_correlation_id": 3115, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173517925, "_duration_us": 9300, "_linked_correlation_id": 3115, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173490104, "_duration_us": 5, "_linked_correlation_id": 3115, "_device_type": 0}, {"_name": "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags", "_start_us": 1656454173490126, "_duration_us": 0, "_linked_correlation_id": 3117, "_device_type": 0}, {"_name": "ampere_sgemm_128x64_nn", "_start_us": 1656454173527228, "_duration_us": 9301, "_linked_correlation_id": 3117, "_device_type": 1}, {"_name": "cudaLaunchKernel", "_start_us": 1656454173490128, "_duration_us": 6, "_linked_correlation_id": 3117, "_device_type": 0}, {"_name": "cudaDeviceSynchronize", "_start_us": 1656454173490313, "_duration_us": 46225, "_linked_correlation_id": 0, "_device_type": 0}], [{"_name": "test_profiler.py(1435): <module>", "id": 94242239505696, "start_time_ns": 1656454173436288169, "duration_time_ns": 7566917863418487639, "correlation_id": 0, "children": [94242238082288], "parent": null}, {"_name": "torch/testing/_internal/common_utils.py(697): run_tests", "id": 94242238082288, "start_time_ns": 1656454173438182431, "duration_time_ns": 7566917863416593377, "correlation_id": 0, "children": [94242238082800], "parent": 94242239505696}, {"_name": "unittest/main.py(101): __init__", "id": 94242238082800, "start_time_ns": 1656454173438184159, "duration_time_ns": 7566917863416591649, "correlation_id": 0, "children": [94242238083184], "parent": 94242238082288}, {"_name": "unittest/main.py(271): runTests", "id": 94242238083184, "start_time_ns": 1656454173438186629, "duration_time_ns": 7566917863416589179, "correlation_id": 0, "children": [94242238083568], "parent": 94242238082800}, {"_name": "unittest/runner.py(184): run", "id": 94242238083568, "start_time_ns": 1656454173438187601, "duration_time_ns": 7566917863416588207, "correlation_id": 0, "children": [94242238084128], "parent": 94242238083184}, {"_name": "unittest/suite.py(84): __call__", "id": 94242238084128, "start_time_ns": 1656454173438189531, "duration_time_ns": 7566917863416586277, "correlation_id": 0, "children": [94242238084544], "parent": 94242238083568}, {"_name": "unittest/suite.py(122): run", "id": 94242238084544, "start_time_ns": 1656454173438190205, "duration_time_ns": 7566917863416585603, "correlation_id": 0, "children": [94242238084960], "parent": 94242238084128}, {"_name": "unittest/suite.py(84): __call__", "id": 94242238084960, "start_time_ns": 1656454173438191228, "duration_time_ns": 7566917863416584580, "correlation_id": 0, "children": [94242238085376], "parent": 94242238084544}, {"_name": "unittest/suite.py(122): run", "id": 94242238085376, "start_time_ns": 1656454173438191346, "duration_time_ns": 7566917863416584462, "correlation_id": 0, "children": [94242238085792], "parent": 94242238084960}, {"_name": "unittest/case.py(651): __call__", "id": 94242238085792, "start_time_ns": 1656454173438191484, "duration_time_ns": 7566917863416584324, "correlation_id": 0, "children": [94242239133216], "parent": 94242238085376}, {"_name": "torch/testing/_internal/common_utils.py(1886): run", "id": 94242239133216, "start_time_ns": 1656454173438195759, "duration_time_ns": 7566917863416580049, "correlation_id": 0, "children": [94242239133632], "parent": 94242238085792}, {"_name": "torch/testing/_internal/common_utils.py(1829): _run_with_retry", "id": 94242239133632, "start_time_ns": 1656454173438197353, "duration_time_ns": 7566917863416578455, "correlation_id": 0, "children": [94242239134048], "parent": 94242239133216}, {"_name": "unittest/case.py(592): run", "id": 94242239134048, "start_time_ns": 1656454173438198172, "duration_time_ns": 7566917863416577636, "correlation_id": 0, "children": [94242239134464], "parent": 94242239133632}, {"_name": "unittest/case.py(550): _callTestMethod", "id": 94242239134464, "start_time_ns": 1656454173438211703, "duration_time_ns": 7566917863416564105, "correlation_id": 0, "children": [94242239134880], "parent": 94242239134048}, {"_name": "test_profiler.py(1420): test_utils_get_optimizable_events", "id": 94242239134880, "start_time_ns": 1656454173438759703, "duration_time_ns": 7566917863416016105, "correlation_id": 0, "children": [94242239135296], "parent": 94242239134464}, {"_name": "test_profiler.py(1251): load_mock_profile", "id": 94242239135296, "start_time_ns": 1656454173438760534, "duration_time_ns": 7566917863416015274, "correlation_id": 0, "children": [94242239135712, 94240979270032, 94240979295904, 94240979389888, 94240979327296, 94242239499936, 94242239139040, 94242239299696, 94242239301040, 94242239302384, 94242239303728, 94242239305072, 94242239139456], "parent": 94242239134880}, {"_name": "torch/profiler/profiler.py(475): __exit__", "id": 94242239139456, "start_time_ns": 1656454173490143177, "duration_time_ns": 7566917863364632631, "correlation_id": 0, "children": [94242239139872], "parent": 94242239135296}, {"_name": "torch/profiler/profiler.py(484): stop", "id": 94242239139872, "start_time_ns": 1656454173490151443, "duration_time_ns": 7566917863364624365, "correlation_id": 0, "children": [94242239140288], "parent": 94242239139456}, {"_name": "torch/profiler/profiler.py(511): _transit_action", "id": 94242239140288, "start_time_ns": 1656454173490160200, "duration_time_ns": 7566917863364615608, "correlation_id": 0, "children": [94242238898288, 94242238886608], "parent": 94242239139872}, {"_name": "torch/profiler/profiler.py(117): stop_trace", "id": 94242238886608, "start_time_ns": 1656454173490212930, "duration_time_ns": 7566917863364562878, "correlation_id": 0, "children": [94242238887024], "parent": 94242239140288}, {"_name": "torch/autograd/profiler.py(207): __exit__", "id": 94242238887024, "start_time_ns": 1656454173490216323, "duration_time_ns": 7566917863364559485, "correlation_id": 0, "children": [94242238887440], "parent": 94242238886608}, {"_name": "torch/cuda/__init__.py(486): synchronize", "id": 94242238887440, "start_time_ns": 1656454173490222710, "duration_time_ns": 7566917863364553098, "correlation_id": 0, "children": [94242238887856, 94242238888688, 94242238894096, 94242239121280, 94242238897008], "parent": 94242238887024}, {"_name": "torch/cuda/__init__.py(281): __exit__", "id": 94242238897008, "start_time_ns": 1656454173536540711, "duration_time_ns": 7566917863318235097, "correlation_id": 0, "children": [94242239121696], "parent": 94242238887440}, {"_name": "<built-in method _disable_profiler of PyCapsule object at 0x7fa02685d210>", "id": 94242239121696, "start_time_ns": 1656454173536553153, "duration_time_ns": 7566917863318222655, "correlation_id": 0, "children": [], "parent": 94242238897008}, {"_name": "<built-in function _cuda_synchronize>", "id": 94242239121280, "start_time_ns": 1656454173490312079, "duration_time_ns": 46227101, "correlation_id": 0, "children": [], "parent": 94242238887440}, {"_name": "torch/cuda/__init__.py(272): __enter__", "id": 94242238894096, "start_time_ns": 1656454173490303577, "duration_time_ns": 5394, "correlation_id": 0, "children": [94242238894512, 94242238895760, 94242238896176], "parent": 94242238887440}, {"_name": "torch/cuda/__init__.py(191): _lazy_init", "id": 94242238896176, "start_time_ns": 1656454173490308123, "duration_time_ns": 792, "correlation_id": 0, "children": [94242238896592], "parent": 94242238894096}, {"_name": "torch/cuda/__init__.py(149): is_initialized", "id": 94242238896592, "start_time_ns": 1656454173490308633, "duration_time_ns": 242, "correlation_id": 0, "children": [94242239120864], "parent": 94242238896176}, {"_name": "<built-in function _cuda_isInBadFork>", "id": 94242239120864, "start_time_ns": 1656454173490308734, "duration_time_ns": 121, "correlation_id": 0, "children": [], "parent": 94242238896592}, {"_name": "torch/_jit_internal.py(982): is_scripting", "id": 94242238895760, "start_time_ns": 1656454173490307337, "duration_time_ns": 660, "correlation_id": 0, "children": [], "parent": 94242238894096}, {"_name": "torch/cuda/__init__.py(480): current_device", "id": 94242238894512, "start_time_ns": 1656454173490304532, "duration_time_ns": 2250, "correlation_id": 0, "children": [94242238894928, 94242239120448], "parent": 94242238894096}, {"_name": "<built-in function _cuda_getDevice>", "id": 94242239120448, "start_time_ns": 1656454173490305817, "duration_time_ns": 934, "correlation_id": 0, "children": [], "parent": 94242238894512}, {"_name": "torch/cuda/__init__.py(191): _lazy_init", "id": 94242238894928, "start_time_ns": 1656454173490305205, "duration_time_ns": 400, "correlation_id": 0, "children": [94242238895344], "parent": 94242238894512}, {"_name": "torch/cuda/__init__.py(149): is_initialized", "id": 94242238895344, "start_time_ns": 1656454173490305315, "duration_time_ns": 249, "correlation_id": 0, "children": [94242239120032], "parent": 94242238894928}, {"_name": "<built-in function _cuda_isInBadFork>", "id": 94242239120032, "start_time_ns": 1656454173490305469, "duration_time_ns": 64, "correlation_id": 0, "children": [], "parent": 94242238895344}, {"_name": "torch/cuda/__init__.py(268): __init__", "id": 94242238888688, "start_time_ns": 1656454173490238187, "duration_time_ns": 63856, "correlation_id": 0, "children": [94242238889104], "parent": 94242238887440}, {"_name": "torch/cuda/_utils.py(7): _get_device_index", "id": 94242238889104, "start_time_ns": 1656454173490241229, "duration_time_ns": 59393, "correlation_id": 0, "children": [94242239113392, 94242239113808, 94242238889520, 94242239114224, 94242238889936], "parent": 94242238888688}, {"_name": "torch/_utils.py(521): _get_device_index", "id": 94242238889936, "start_time_ns": 1656454173490254695, "duration_time_ns": 45728, "correlation_id": 0, "children": [94242239114640, 94242239115056, 94242239117536, 94242238890352, 94242238890768], "parent": 94242238889104}, {"_name": "torch/_utils.py(497): _get_current_device_index", "id": 94242238890768, "start_time_ns": 1656454173490269804, "duration_time_ns": 30489, "correlation_id": 0, "children": [94242238891184], "parent": 94242238889936}, {"_name": "torch/_utils.py(487): _get_device_attr", "id": 94242238891184, "start_time_ns": 1656454173490277921, "duration_time_ns": 22112, "correlation_id": 0, "children": [94242238891600, 94242239118784, 94242238892432], "parent": 94242238890768}, {"_name": "torch/_utils.py(499): <lambda>", "id": 94242238892432, "start_time_ns": 1656454173490290622, "duration_time_ns": 9269, "correlation_id": 0, "children": [94242238892848], "parent": 94242238891184}, {"_name": "torch/cuda/__init__.py(480): current_device", "id": 94242238892848, "start_time_ns": 1656454173490292572, "duration_time_ns": 7253, "correlation_id": 0, "children": [94242238893264, 94242239119616], "parent": 94242238892432}, {"_name": "<built-in function _cuda_getDevice>", "id": 94242239119616, "start_time_ns": 1656454173490296196, "duration_time_ns": 3565, "correlation_id": 0, "children": [], "parent": 94242238892848}, {"_name": "torch/cuda/__init__.py(191): _lazy_init", "id": 94242238893264, "start_time_ns": 1656454173490293743, "duration_time_ns": 1072, "correlation_id": 0, "children": [94242238893680], "parent": 94242238892848}, {"_name": "torch/cuda/__init__.py(149): is_initialized", "id": 94242238893680, "start_time_ns": 1656454173490294339, "duration_time_ns": 402, "correlation_id": 0, "children": [94242239119200], "parent": 94242238893264}, {"_name": "<built-in function _cuda_isInBadFork>", "id": 94242239119200, "start_time_ns": 1656454173490294551, "duration_time_ns": 124, "correlation_id": 0, "children": [], "parent": 94242238893680}, {"_name": "<built-in method get of dict object at 0x7fa01c47d700>", "id": 94242239118784, "start_time_ns": 1656454173490289374, "duration_time_ns": 241, "correlation_id": 0, "children": [], "parent": 94242238891184}, {"_name": "torch/_utils.py(478): _get_available_device_type", "id": 94242238891600, "start_time_ns": 1656454173490280148, "duration_time_ns": 8003, "correlation_id": 0, "children": [94242238892016], "parent": 94242238891184}, {"_name": "torch/cuda/__init__.py(77): is_available", "id": 94242238892016, "start_time_ns": 1656454173490282141, "duration_time_ns": 5804, "correlation_id": 0, "children": [94242239117952, 94242239118368], "parent": 94242238891600}, {"_name": "<built-in function _cuda_getDeviceCount>", "id": 94242239118368, "start_time_ns": 1656454173490286599, "duration_time_ns": 1061, "correlation_id": 0, "children": [], "parent": 94242238892016}, {"_name": "<built-in function hasattr>", "id": 94242239117952, "start_time_ns": 1656454173490284307, "duration_time_ns": 988, "correlation_id": 0, "children": [], "parent": 94242238892016}, {"_name": "torch/_jit_internal.py(982): is_scripting", "id": 94242238890352, "start_time_ns": 1656454173490268636, "duration_time_ns": 383, "correlation_id": 0, "children": [], "parent": 94242238889936}, {"_name": "<built-in function isinstance>", "id": 94242239117536, "start_time_ns": 1656454173490268135, "duration_time_ns": 45, "correlation_id": 0, "children": [], "parent": 94242238889936}, {"_name": "<built-in function isinstance>", "id": 94242239115056, "start_time_ns": 1656454173490266016, "duration_time_ns": 43, "correlation_id": 0, "children": [], "parent": 94242238889936}, {"_name": "<built-in function isinstance>", "id": 94242239114640, "start_time_ns": 1656454173490264843, "duration_time_ns": 71, "correlation_id": 0, "children": [], "parent": 94242238889936}, {"_name": "<built-in function isinstance>", "id": 94242239114224, "start_time_ns": 1656454173490253455, "duration_time_ns": 56, "correlation_id": 0, "children": [], "parent": 94242238889104}, {"_name": "torch/_jit_internal.py(982): is_scripting", "id": 94242238889520, "start_time_ns": 1656454173490250344, "duration_time_ns": 2192, "correlation_id": 0, "children": [], "parent": 94242238889104}, {"_name": "<built-in function isinstance>", "id": 94242239113808, "start_time_ns": 1656454173490247257, "duration_time_ns": 104, "correlation_id": 0, "children": [], "parent": 94242238889104}, {"_name": "<built-in function isinstance>", "id": 94242239113392, "start_time_ns": 1656454173490245162, "duration_time_ns": 807, "correlation_id": 0, "children": [], "parent": 94242238889104}, {"_name": "torch/cuda/__init__.py(191): _lazy_init", "id": 94242238887856, "start_time_ns": 1656454173490224967, "duration_time_ns": 10586, "correlation_id": 0, "children": [94242238888272], "parent": 94242238887440}, {"_name": "torch/cuda/__init__.py(149): is_initialized", "id": 94242238888272, "start_time_ns": 1656454173490227128, "duration_time_ns": 8241, "correlation_id": 0, "children": [94242239113008], "parent": 94242238887856}, {"_name": "<built-in function _cuda_isInBadFork>", "id": 94242239113008, "start_time_ns": 1656454173490234177, "duration_time_ns": 892, "correlation_id": 0, "children": [], "parent": 94242238888272}, {"_name": "<built-in method get of dict object at 0x7fa01c47d700>", "id": 94242238898288, "start_time_ns": 1656454173490187641, "duration_time_ns": 9517, "correlation_id": 0, "children": [94242239140704], "parent": 94242239140288}, {"_name": "enum.py(774): __hash__", "id": 94242239140704, "start_time_ns": 1656454173490190439, "duration_time_ns": 5319, "correlation_id": 0, "children": [94242239112592], "parent": 94242238898288}, {"_name": "<built-in function hash>", "id": 94242239112592, "start_time_ns": 1656454173490194870, "duration_time_ns": 721, "correlation_id": 0, "children": [], "parent": 94242239140704}, {"_name": "aten::matmul", "id": 94242239305072, "start_time_ns": 1656454173490115971, "duration_time_ns": 21513, "correlation_id": 3116, "children": [94242239305744], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239305744, "start_time_ns": 1656454173490116650, "duration_time_ns": 20114, "correlation_id": 3117, "children": [], "parent": 94242239305072}, {"_name": "aten::matmul", "id": 94242239303728, "start_time_ns": 1656454173490091388, "duration_time_ns": 21342, "correlation_id": 3114, "children": [94242239304400], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239304400, "start_time_ns": 1656454173490092214, "duration_time_ns": 19792, "correlation_id": 3115, "children": [], "parent": 94242239303728}, {"_name": "aten::matmul", "id": 94242239302384, "start_time_ns": 1656454173490054842, "duration_time_ns": 33225, "correlation_id": 3112, "children": [94242239303056], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239303056, "start_time_ns": 1656454173490055485, "duration_time_ns": 31894, "correlation_id": 3113, "children": [], "parent": 94242239302384}, {"_name": "aten::matmul", "id": 94242239301040, "start_time_ns": 1656454173490026585, "duration_time_ns": 24997, "correlation_id": 3110, "children": [94242239301712], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239301712, "start_time_ns": 1656454173490027380, "duration_time_ns": 23566, "correlation_id": 3111, "children": [], "parent": 94242239301040}, {"_name": "aten::matmul", "id": 94242239299696, "start_time_ns": 1656454173489912600, "duration_time_ns": 104156, "correlation_id": 3108, "children": [94242239300368], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239300368, "start_time_ns": 1656454173489916106, "duration_time_ns": 99633, "correlation_id": 3109, "children": [], "parent": 94242239299696}, {"_name": "test_profiler.py(1245): garbage_code", "id": 94242239139040, "start_time_ns": 1656454173442410244, "duration_time_ns": 47490938, "correlation_id": 0, "children": [94242239501088, 94242239482560, 94242239483728, 94242239484368, 94242239485536, 94242239486912, 94242239487552, 94242237992320, 94242237993696, 94242237994336, 94242237995712, 94242237997120, 94242237997856, 94242237999328, 94242238000800], "parent": 94242239135296}, {"_name": "aten::copy_", "id": 94242238000800, "start_time_ns": 1656454173489878232, "duration_time_ns": 20288, "correlation_id": 3107, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242237999328, "start_time_ns": 1656454173489875969, "duration_time_ns": 1490, "correlation_id": 3105, "children": [94242238000240], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242238000240, "start_time_ns": 1656454173489876749, "duration_time_ns": 269, "correlation_id": 3106, "children": [], "parent": 94242237999328}, {"_name": "aten::select", "id": 94242237997856, "start_time_ns": 1656454173489873022, "duration_time_ns": 2173, "correlation_id": 3103, "children": [94242237998768], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237998768, "start_time_ns": 1656454173489874129, "duration_time_ns": 436, "correlation_id": 3104, "children": [], "parent": 94242237997856}, {"_name": "aten::copy_", "id": 94242237997120, "start_time_ns": 1656454173489848771, "duration_time_ns": 20290, "correlation_id": 3102, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242237995712, "start_time_ns": 1656454173489846145, "duration_time_ns": 1571, "correlation_id": 3100, "children": [94242237996560], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237996560, "start_time_ns": 1656454173489847021, "duration_time_ns": 204, "correlation_id": 3101, "children": [], "parent": 94242237995712}, {"_name": "aten::select", "id": 94242237994336, "start_time_ns": 1656454173489842325, "duration_time_ns": 3114, "correlation_id": 3098, "children": [94242237995184], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237995184, "start_time_ns": 1656454173489844409, "duration_time_ns": 440, "correlation_id": 3099, "children": [], "parent": 94242237994336}, {"_name": "aten::copy_", "id": 94242237993696, "start_time_ns": 1656454173489817557, "duration_time_ns": 20628, "correlation_id": 3097, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242237992320, "start_time_ns": 1656454173489814695, "duration_time_ns": 1630, "correlation_id": 3095, "children": [94242237993168], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237993168, "start_time_ns": 1656454173489815568, "duration_time_ns": 267, "correlation_id": 3096, "children": [], "parent": 94242237992320}, {"_name": "aten::select", "id": 94242239487552, "start_time_ns": 1656454173489811667, "duration_time_ns": 2305, "correlation_id": 3093, "children": [94242237991792], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242237991792, "start_time_ns": 1656454173489812906, "duration_time_ns": 491, "correlation_id": 3094, "children": [], "parent": 94242239487552}, {"_name": "aten::copy_", "id": 94242239486912, "start_time_ns": 1656454173489771721, "duration_time_ns": 34924, "correlation_id": 3092, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242239485536, "start_time_ns": 1656454173489766717, "duration_time_ns": 2462, "correlation_id": 3090, "children": [94242239486384], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242239486384, "start_time_ns": 1656454173489767943, "duration_time_ns": 366, "correlation_id": 3091, "children": [], "parent": 94242239485536}, {"_name": "aten::select", "id": 94242239484368, "start_time_ns": 1656454173489760388, "duration_time_ns": 5433, "correlation_id": 3088, "children": [94242239485008], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242239485008, "start_time_ns": 1656454173489764139, "duration_time_ns": 858, "correlation_id": 3089, "children": [], "parent": 94242239484368}, {"_name": "aten::copy_", "id": 94242239483728, "start_time_ns": 1656454173444305057, "duration_time_ns": 45427507, "correlation_id": 3087, "children": [], "parent": 94242239139040}, {"_name": "aten::select", "id": 94242239482560, "start_time_ns": 1656454173444291864, "duration_time_ns": 8740, "correlation_id": 3085, "children": [94242239483200], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242239483200, "start_time_ns": 1656454173444296798, "duration_time_ns": 531, "correlation_id": 3086, "children": [], "parent": 94242239482560}, {"_name": "aten::select", "id": 94242239501088, "start_time_ns": 1656454173444252555, "duration_time_ns": 38328, "correlation_id": 3083, "children": [94242239501584], "parent": 94242239139040}, {"_name": "aten::as_strided", "id": 94242239501584, "start_time_ns": 1656454173444282394, "duration_time_ns": 3993, "correlation_id": 3084, "children": [], "parent": 94242239501088}, {"_name": "aten::matmul", "id": 94242239499936, "start_time_ns": 1656454173442384887, "duration_time_ns": 20958, "correlation_id": 3081, "children": [94242239500512], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242239500512, "start_time_ns": 1656454173442385493, "duration_time_ns": 19655, "correlation_id": 3082, "children": [], "parent": 94242239499936}, {"_name": "aten::matmul", "id": 94240979327296, "start_time_ns": 1656454173442360631, "duration_time_ns": 21026, "correlation_id": 3079, "children": [94242238916288], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94242238916288, "start_time_ns": 1656454173442361296, "duration_time_ns": 19633, "correlation_id": 3080, "children": [], "parent": 94240979327296}, {"_name": "aten::matmul", "id": 94240979389888, "start_time_ns": 1656454173442325764, "duration_time_ns": 31593, "correlation_id": 3077, "children": [94240979374096], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94240979374096, "start_time_ns": 1656454173442326275, "duration_time_ns": 30364, "correlation_id": 3078, "children": [], "parent": 94240979389888}, {"_name": "aten::matmul", "id": 94240979295904, "start_time_ns": 1656454173442289759, "duration_time_ns": 32569, "correlation_id": 3075, "children": [94240169025248], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94240169025248, "start_time_ns": 1656454173442291934, "duration_time_ns": 29693, "correlation_id": 3076, "children": [], "parent": 94240979295904}, {"_name": "aten::matmul", "id": 94240979270032, "start_time_ns": 1656454173440014537, "duration_time_ns": 2254371, "correlation_id": 3073, "children": [94240979296288], "parent": 94242239135296}, {"_name": "aten::mm", "id": 94240979296288, "start_time_ns": 1656454173440019291, "duration_time_ns": 2245915, "correlation_id": 3074, "children": [], "parent": 94240979270032}, {"_name": "torch/profiler/profiler.py(472): __enter__", "id": 94242239135712, "start_time_ns": 1656454173438761183, "duration_time_ns": 1208076, "correlation_id": 0, "children": [94242239136128], "parent": 94242239135296}, {"_name": "torch/profiler/profiler.py(479): start", "id": 94242239136128, "start_time_ns": 1656454173438762066, "duration_time_ns": 1206947, "correlation_id": 0, "children": [94242239136544], "parent": 94242239135712}, {"_name": "torch/profiler/profiler.py(515): _transit_action", "id": 94242239136544, "start_time_ns": 1656454173438764183, "duration_time_ns": 1203897, "correlation_id": 0, "children": [94242239136960], "parent": 94242239136128}, {"_name": "torch/profiler/profiler.py(110): start_trace", "id": 94242239136960, "start_time_ns": 1656454173438766170, "duration_time_ns": 1200818, "correlation_id": 0, "children": [94242239137376, 94242238897424, 94242239137792], "parent": 94242239136544}, {"_name": "torch/profiler/profiler.py(189): _get_distributed_info", "id": 94242239137792, "start_time_ns": 1656454173439946391, "duration_time_ns": 20326, "correlation_id": 0, "children": [94242239138208, 94242239138624], "parent": 94242239136960}, {"_name": "torch/distributed/distributed_c10d.py(415): is_initialized", "id": 94242239138624, "start_time_ns": 1656454173439964257, "duration_time_ns": 2376, "correlation_id": 0, "children": [], "parent": 94242239137792}, {"_name": "torch/distributed/__init__.py(8): is_available", "id": 94242239138208, "start_time_ns": 1656454173439956583, "duration_time_ns": 5736, "correlation_id": 0, "children": [94242238897872], "parent": 94242239137792}, {"_name": "<built-in function hasattr>", "id": 94242238897872, "start_time_ns": 1656454173439960911, "duration_time_ns": 1344, "correlation_id": 0, "children": [], "parent": 94242239138208}, {"_name": "<built-in method kineto_available of PyCapsule object at 0x7fa02685d2d0>", "id": 94242238897424, "start_time_ns": 1656454173439940525, "duration_time_ns": 1813, "correlation_id": 0, "children": [], "parent": 94242239136960}, {"_name": "torch/autograd/profiler.py(205): _start_trace", "id": 94242239137376, "start_time_ns": 1656454173438766630, "duration_time_ns": 63314, "correlation_id": 0, "children": [], "parent": 94242239136960}]]
diff --git a/test/quantization/ao_migration/common.py b/test/quantization/ao_migration/common.py
index 85682d1b9202..bade3b7ff4d2 100644
--- a/test/quantization/ao_migration/common.py
+++ b/test/quantization/ao_migration/common.py
@@ -4,11 +4,19 @@
 from typing import List, Optional
 
 class AOMigrationTestCase(TestCase):
-    def _test_package_import(self, package_name: str, base: Optional[str] = None):
+    def _test_package_import(self, package_name: str,
+                             base: Optional[str] = None,
+                             skip: List[str] = None):
         r"""Tests the module import by making sure that all the internals match
-        (except the dunder methods)."""
-        if base is None:
-            base = 'quantization'
+        (except the dunder methods).
+
+        Args:
+            package_name: The name of the package to be tested
+            base: The base namespace where the `package_name` resides
+            skip: The list of the subpackages/modules/functions to skip
+        """
+        skip = skip or []
+        base = base or 'quantization'
         old_base = 'torch.' + base
         new_base = 'torch.ao.' + base
         old_module = importlib.import_module(f'{old_base}.{package_name}')
@@ -17,7 +25,11 @@ def _test_package_import(self, package_name: str, base: Optional[str] = None):
         new_module_dir = set(dir(new_module))
         # Remove magic modules from checking in subsets
         for el in list(old_module_dir):
-            if el[:2] == '__' and el[-2:] == '__':
+            if el.startswith('__') and el.endswith('__'):
+                # Remove dunder
+                old_module_dir.remove(el)
+            if el in skip:
+                # Remove skips
                 old_module_dir.remove(el)
         assert (old_module_dir <= new_module_dir), \
             f"Importing {old_module} vs. {new_module} does not match: " \
diff --git a/test/quantization/ao_migration/test_ao_migration.py b/test/quantization/ao_migration/test_ao_migration.py
index 4b30bb872dce..accb13da0dcb 100644
--- a/test/quantization/ao_migration/test_ao_migration.py
+++ b/test/quantization/ao_migration/test_ao_migration.py
@@ -92,3 +92,428 @@ def test_function_import_fake_quantize(self):
             'enable_observer',
         ]
         self._test_function_import('fake_quantize', function_list)
+
+
+class TestAOMigrationNNQuantized(AOMigrationTestCase):
+    def test_package_import_nn_quantized_modules(self):
+        r"""Tests the migration of the torch.nn.quantized.modules"""
+        self._test_package_import('modules', base='nn.quantized')
+        self._test_package_import('modules.activation', base='nn.quantized')
+        self._test_package_import('modules.batchnorm', base='nn.quantized')
+        self._test_package_import('modules.conv', base='nn.quantized')
+        self._test_package_import('modules.dropout', base='nn.quantized')
+        self._test_package_import('modules.embedding_ops', base='nn.quantized')
+        self._test_package_import('modules.functional_modules', base='nn.quantized')
+        self._test_package_import('modules.linear', base='nn.quantized')
+        self._test_package_import('modules.normalization', base='nn.quantized')
+        self._test_package_import('modules.utils', base='nn.quantized')
+
+    def test_package_import_nn_quantized(self):
+        skip = [
+            # These are added in the `torch.nn.quantized` to allow
+            # for the legacy import, s.a. `import torch.nn.quantized.conv`, etc.
+            'activation',
+            'batchnorm',
+            'conv',
+            'dropout',
+            'embedding_ops',
+            'functional_modules',
+            'linear',
+            'normalization',
+            '_reference',
+        ]
+        self._test_package_import('quantized', base='nn', skip=skip)
+
+    def test_functional_import(self):
+        r"""Tests the migration of the torch.nn.quantized.functional"""
+        function_list = [
+            'avg_pool2d',
+            'avg_pool3d',
+            'adaptive_avg_pool2d',
+            'adaptive_avg_pool3d',
+            'conv1d',
+            'conv2d',
+            'conv3d',
+            'interpolate',
+            'linear',
+            'max_pool1d',
+            'max_pool2d',
+            'celu',
+            'leaky_relu',
+            'hardtanh',
+            'hardswish',
+            'threshold',
+            'elu',
+            'hardsigmoid',
+            'clamp',
+            'upsample',
+            'upsample_bilinear',
+            'upsample_nearest',
+        ]
+        self._test_function_import('functional', function_list, base='nn.quantized')
+
+    def test_modules_import(self):
+        module_list = [
+            # Modules
+            'BatchNorm2d',
+            'BatchNorm3d',
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+            'ConvTranspose1d',
+            'ConvTranspose2d',
+            'ConvTranspose3d',
+            'DeQuantize',
+            'ELU',
+            'Embedding',
+            'EmbeddingBag',
+            'GroupNorm',
+            'Hardswish',
+            'InstanceNorm1d',
+            'InstanceNorm2d',
+            'InstanceNorm3d',
+            'LayerNorm',
+            'LeakyReLU',
+            'Linear',
+            'MaxPool2d',
+            'Quantize',
+            'ReLU6',
+            'Sigmoid',
+            'Softmax',
+            'Dropout',
+            # Wrapper modules
+            'FloatFunctional',
+            'FXFloatFunctional',
+            'QFunctional',
+        ]
+        self._test_function_import('modules', module_list, base='nn.quantized')
+
+    def test_modules_activation(self):
+        function_list = [
+            'ReLU6',
+            'Hardswish',
+            'ELU',
+            'LeakyReLU',
+            'Sigmoid',
+            'Softmax',
+        ]
+        self._test_function_import('activation', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_batchnorm(self):
+        function_list = [
+            'BatchNorm2d',
+            'BatchNorm3d',
+        ]
+        self._test_function_import('batchnorm', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_conv(self):
+        function_list = [
+            '_reverse_repeat_padding',
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+            'ConvTranspose1d',
+            'ConvTranspose2d',
+            'ConvTranspose3d',
+        ]
+
+        self._test_function_import('conv', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_dropout(self):
+        function_list = [
+            'Dropout',
+        ]
+        self._test_function_import('dropout', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_embedding_ops(self):
+        function_list = [
+            'EmbeddingPackedParams',
+            'Embedding',
+            'EmbeddingBag',
+        ]
+        self._test_function_import('embedding_ops', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_functional_modules(self):
+        function_list = [
+            'FloatFunctional',
+            'FXFloatFunctional',
+            'QFunctional',
+        ]
+        self._test_function_import('functional_modules', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_linear(self):
+        function_list = [
+            'Linear',
+            'LinearPackedParams',
+        ]
+        self._test_function_import('linear', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_normalization(self):
+        function_list = [
+            'LayerNorm',
+            'GroupNorm',
+            'InstanceNorm1d',
+            'InstanceNorm2d',
+            'InstanceNorm3d',
+        ]
+        self._test_function_import('normalization', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_modules_utils(self):
+        function_list = [
+            '_ntuple_from_first',
+            '_pair_from_first',
+            '_quantize_weight',
+            'hide_packed_params_repr',
+            'WeightedQuantizedModule',
+        ]
+        self._test_function_import('utils', function_list,
+                                   base='nn.quantized.modules')
+
+    def test_package_import_nn_quantized_dynamic(self):
+        self._test_package_import('dynamic', base='nn.quantized')
+
+    def test_package_import_nn_quantized_dynamic_modules(self):
+        r"""Tests the migration of the torch.nn.quantized.modules"""
+        self._test_package_import('modules', base='nn.quantized.dynamic')
+        self._test_package_import('modules.conv', base='nn.quantized.dynamic')
+        self._test_package_import('modules.linear', base='nn.quantized.dynamic')
+        self._test_package_import('modules.rnn', base='nn.quantized.dynamic')
+
+    def test_import_nn_quantized_dynamic_import(self):
+        module_list = [
+            # Modules
+            'Linear',
+            'LSTM',
+            'GRU',
+            'LSTMCell',
+            'RNNCell',
+            'GRUCell',
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+            'ConvTranspose1d',
+            'ConvTranspose2d',
+            'ConvTranspose3d',
+        ]
+        self._test_function_import('dynamic', module_list, base='nn.quantized')
+
+    def test_package_import_nn_quantizable(self):
+        self._test_package_import('quantizable', base='nn')
+
+    def test_package_import_nn_quantizable_modules(self):
+        r"""Tests the migration of the torch.nn.quantizable.modules"""
+        self._test_package_import('modules', base='nn.quantizable')
+        self._test_package_import('modules.activation', base='nn.quantizable')
+        self._test_package_import('modules.rnn', base='nn.quantizable')
+
+    def test_import_nn_quantizable_activation(self):
+        module_list = [
+            # Modules
+            'MultiheadAttention',
+        ]
+        self._test_function_import('activation', module_list, base='nn.quantizable.modules')
+
+    def test_import_nn_quantizable_rnn(self):
+        module_list = [
+            # Modules
+            'LSTM',
+            'LSTMCell',
+        ]
+        self._test_function_import('rnn', module_list, base='nn.quantizable.modules')
+
+    # torch.nn.qat and torch.nn.qat.dynamic
+    def test_package_import_nn_qat(self):
+        self._test_package_import('qat', base='nn')
+
+    def test_package_import_nn_qat_modules(self):
+        r"""Tests the migration of the torch.nn.qat.modules"""
+        self._test_package_import('modules', base='nn.qat')
+        self._test_package_import('modules.conv', base='nn.qat')
+        self._test_package_import('modules.embedding_ops', base='nn.qat')
+        self._test_package_import('modules.linear', base='nn.qat')
+
+    def test_package_import_nn_qat_dynamic(self):
+        r"""Tests the migration of the torch.nn.qat.modules"""
+        self._test_package_import('dynamic', base='nn.qat')
+        self._test_package_import('dynamic.modules', base='nn.qat')
+        self._test_package_import('dynamic.modules.linear', base='nn.qat')
+
+    def test_import_nn_qat_conv(self):
+        module_list = [
+            'Conv1d',
+            'Conv2d',
+            'Conv3d',
+        ]
+        self._test_function_import('conv', module_list, base='nn.qat.modules')
+
+    def test_import_nn_qat_embedding_ops(self):
+        module_list = [
+            'Embedding',
+            'EmbeddingBag',
+        ]
+        self._test_function_import('embedding_ops', module_list, base='nn.qat.modules')
+
+    def test_import_nn_qat_linear(self):
+        module_list = [
+            'Linear',
+        ]
+        self._test_function_import('linear', module_list, base='nn.qat.modules')
+
+    def test_import_nn_qat_dynamic_linear(self):
+        module_list = [
+            'Linear',
+        ]
+        self._test_function_import('linear', module_list, base='nn.qat.dynamic.modules')
+
+
+class TestAOMigrationNNIntrinsic(AOMigrationTestCase):
+    def test_package_import_nn_intrinsic_modules(self):
+        r"""Tests the migration of the torch.nn.intrinsic.modules"""
+        self._test_package_import('modules', base='nn.intrinsic')
+        self._test_package_import('modules.fused', base='nn.intrinsic')
+
+    def test_package_import_nn_intrinsic(self):
+        skip = []
+        self._test_package_import('intrinsic', base='nn', skip=skip)
+
+    def test_modules_import_nn_intrinsic(self):
+        module_list = [
+            # Modules
+            '_FusedModule',
+            'ConvBn1d',
+            'ConvBn2d',
+            'ConvBn3d',
+            'ConvBnReLU1d',
+            'ConvBnReLU2d',
+            'ConvBnReLU3d',
+            'ConvReLU1d',
+            'ConvReLU2d',
+            'ConvReLU3d',
+            'LinearReLU',
+            'BNReLU2d',
+            'BNReLU3d',
+            'LinearBn1d',
+        ]
+        self._test_function_import('intrinsic', module_list, base='nn')
+
+    def test_modules_nn_intrinsic_fused(self):
+        function_list = [
+            '_FusedModule',
+            'ConvBn1d',
+            'ConvBn2d',
+            'ConvBn3d',
+            'ConvBnReLU1d',
+            'ConvBnReLU2d',
+            'ConvBnReLU3d',
+            'ConvReLU1d',
+            'ConvReLU2d',
+            'ConvReLU3d',
+            'LinearReLU',
+            'BNReLU2d',
+            'BNReLU3d',
+            'LinearBn1d',
+        ]
+        self._test_function_import('fused', function_list,
+                                   base='nn.intrinsic.modules')
+
+    def test_package_import_nn_intrinsic_qat(self):
+        r"""Tests the migration of the torch.nn.intrinsic.modules"""
+        self._test_package_import('qat', base='nn.intrinsic')
+        self._test_package_import('qat.modules', base='nn.intrinsic')
+
+    def test_modules_import_nn_intrinsic_qat(self):
+        module_list = [
+            "LinearReLU",
+            "LinearBn1d",
+            "ConvReLU1d",
+            "ConvReLU2d",
+            "ConvReLU3d",
+            "ConvBn1d",
+            "ConvBn2d",
+            "ConvBn3d",
+            "ConvBnReLU1d",
+            "ConvBnReLU2d",
+            "ConvBnReLU3d",
+            "update_bn_stats",
+            "freeze_bn_stats",
+        ]
+        self._test_function_import('qat', module_list, base='nn.intrinsic')
+
+    def test_modules_intrinsic_qat_conv_fused(self):
+        function_list = [
+            'ConvBn1d',
+            'ConvBnReLU1d',
+            'ConvReLU1d',
+            'ConvBn2d',
+            'ConvBnReLU2d',
+            'ConvReLU2d',
+            'ConvBn3d',
+            'ConvBnReLU3d',
+            'ConvReLU3d',
+            'update_bn_stats',
+            'freeze_bn_stats'
+        ]
+        self._test_function_import('conv_fused', function_list,
+                                   base='nn.intrinsic.qat.modules')
+
+    def test_modules_intrinsic_qat_linear_fused(self):
+        function_list = [
+            'LinearBn1d',
+        ]
+        self._test_function_import('linear_fused', function_list,
+                                   base='nn.intrinsic.qat.modules')
+
+    def test_modules_intrinsic_qat_linear_relu(self):
+        function_list = [
+            'LinearReLU',
+        ]
+        self._test_function_import('linear_relu', function_list,
+                                   base='nn.intrinsic.qat.modules')
+
+    def test_package_import_nn_intrinsic_quantized(self):
+        r"""Tests the migration of the torch.nn.intrinsic.quantized"""
+        self._test_package_import('quantized', base='nn.intrinsic')
+        self._test_package_import('quantized.modules', base='nn.intrinsic')
+
+    def test_modules_import_nn_intrinsic_quantized(self):
+        module_list = [
+            'BNReLU2d',
+            'BNReLU3d',
+            'ConvReLU1d',
+            'ConvReLU2d',
+            'ConvReLU3d',
+            'LinearReLU',
+        ]
+        self._test_function_import('quantized', module_list, base='nn.intrinsic')
+
+    def test_modules_intrinsic_quantized_bn_relu(self):
+        function_list = [
+            'BNReLU2d',
+            'BNReLU3d',
+        ]
+        self._test_function_import('bn_relu', function_list,
+                                   base='nn.intrinsic.quantized.modules')
+
+    def test_modules_intrinsic_quantized_conv_relu(self):
+        function_list = [
+            'ConvReLU1d',
+            'ConvReLU2d',
+            'ConvReLU3d',
+        ]
+        self._test_function_import('conv_relu', function_list,
+                                   base='nn.intrinsic.quantized.modules')
+
+    def test_modules_intrinsic_quantized_linear_relu(self):
+        function_list = [
+            'LinearReLU',
+        ]
+        self._test_function_import('linear_relu', function_list,
+                                   base='nn.intrinsic.quantized.modules')
diff --git a/test/quantization/ao_migration/test_quantization.py b/test/quantization/ao_migration/test_quantization.py
index 89b69d1ef182..9c246e1b7cd8 100644
--- a/test/quantization/ao_migration/test_quantization.py
+++ b/test/quantization/ao_migration/test_quantization.py
@@ -118,7 +118,7 @@ def test_package_import_quant_type(self):
     def test_function_import_quant_type(self):
         function_list = [
             'QuantType',
-            'quant_type_to_str',
+            '_get_quant_type_to_str',
         ]
         self._test_function_import('quant_type', function_list)
 
@@ -177,9 +177,9 @@ def test_function_import_qconfig(self):
             "default_qat_qconfig_v2",
             "get_default_qconfig",
             "get_default_qat_qconfig",
-            "assert_valid_qconfig",
+            "_assert_valid_qconfig",
             "QConfigAny",
-            "add_module_to_qconfig_obs_ctr",
+            "_add_module_to_qconfig_obs_ctr",
             "qconfig_equals"
         ]
         self._test_function_import('qconfig', function_list)
@@ -225,7 +225,7 @@ def test_function_import_fuser_method_mappings(self):
             "get_fuser_method",
         ]
         dict_list = [
-            "DEFAULT_OP_LIST_TO_FUSER_METHOD"
+            "_DEFAULT_OP_LIST_TO_FUSER_METHOD"
         ]
         self._test_function_import('fuser_method_mappings', function_list)
         self._test_dict_import('fuser_method_mappings', dict_list)
diff --git a/test/quantization/ao_migration/test_quantization_fx.py b/test/quantization/ao_migration/test_quantization_fx.py
index afee0a6db530..03d1da6f2cfb 100644
--- a/test/quantization/ao_migration/test_quantization_fx.py
+++ b/test/quantization/ao_migration/test_quantization_fx.py
@@ -169,7 +169,7 @@ def test_function_import_fx_fusion_patterns(self):
 
     # we removed matching test for torch.quantization.fx.quantization_types
     # old: torch.quantization.fx.quantization_types
-    # new: torch.ao.quantization.quantization_types
+    # new: torch.ao.quantization.utils
     # both are valid, but we'll deprecate the old path in the future
 
     def test_package_import_fx_utils(self):
diff --git a/test/quantization/bc/test_backward_compatibility.py b/test/quantization/bc/test_backward_compatibility.py
index ef8f49b3b2e2..83f2c790a6eb 100644
--- a/test/quantization/bc/test_backward_compatibility.py
+++ b/test/quantization/bc/test_backward_compatibility.py
@@ -9,8 +9,8 @@
 # torch
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 import torch.nn.intrinsic.quantized as nniq
 from torch.fx import GraphModule
 
diff --git a/test/quantization/core/test_backend_config.py b/test/quantization/core/test_backend_config.py
index 731d0f5afe6b..e641e58bb2aa 100644
--- a/test/quantization/core/test_backend_config.py
+++ b/test/quantization/core/test_backend_config.py
@@ -10,12 +10,11 @@
     BackendConfig,
     BackendPatternConfig,
     DTypeConfig,
+    DTypeWithConstraints,
     ObservationType,
 )
-from torch.ao.quantization.fake_quantize import FixedQParamsFakeQuantize
-from torch.ao.quantization.fuser_method_mappings import reverse_sequential_wrapper2
+from torch.ao.quantization.fuser_method_mappings import _reverse_sequential_wrapper2
 from torch.ao.quantization.fx.quantization_patterns import _default_root_node_getter
-from torch.ao.quantization.observer import default_fixed_qparams_range_0to1_observer
 
 
 class TestBackendConfig(QuantizationTestCase):
@@ -37,32 +36,75 @@ class TestBackendConfig(QuantizationTestCase):
         is_dynamic=True
     )
 
-    dtype_config_dict1 = {
+    activation_dtype_with_constraints = DTypeWithConstraints(
+        dtype=torch.quint8,
+        quant_min_lower_bound=0,
+        quant_max_upper_bound=127,
+        scale_min_lower_bound=2 ** -12,
+    )
+
+    weight_dtype_with_constraints = DTypeWithConstraints(
+        dtype=torch.qint8,
+        quant_min_lower_bound=-128,
+        quant_max_upper_bound=127,
+        scale_min_lower_bound=2 ** -12,
+    )
+
+    dtype_config3 = DTypeConfig(
+        input_dtype=activation_dtype_with_constraints,
+        output_dtype=activation_dtype_with_constraints,
+        weight_dtype=weight_dtype_with_constraints,
+    )
+
+    dtype_config_dict1_legacy = {
         "input_dtype": torch.quint8,
         "output_dtype": torch.quint8,
         "weight_dtype": torch.qint8,
         "bias_dtype": torch.float,
     }
 
-    dtype_config_dict2 = {
+    dtype_config_dict2_legacy = {
         "input_dtype": torch.float16,
         "output_dtype": torch.float,
         "is_dynamic": True,
     }
 
+    dtype_config_dict1 = {
+        "input_dtype": DTypeWithConstraints(dtype=torch.quint8),
+        "output_dtype": DTypeWithConstraints(torch.quint8),
+        "weight_dtype": DTypeWithConstraints(torch.qint8),
+        "bias_dtype": torch.float,
+    }
+
+    dtype_config_dict2 = {
+        "input_dtype": DTypeWithConstraints(dtype=torch.float16),
+        "output_dtype": DTypeWithConstraints(dtype=torch.float),
+        "is_dynamic": True,
+    }
+
+    dtype_config_dict3 = {
+        "input_dtype": activation_dtype_with_constraints,
+        "output_dtype": activation_dtype_with_constraints,
+        "weight_dtype": weight_dtype_with_constraints,
+    }
+
     def test_dtype_config_from_dict(self):
+        self.assertEqual(DTypeConfig.from_dict(self.dtype_config_dict1_legacy), self.dtype_config1)
+        self.assertEqual(DTypeConfig.from_dict(self.dtype_config_dict2_legacy), self.dtype_config2)
         self.assertEqual(DTypeConfig.from_dict(self.dtype_config_dict1), self.dtype_config1)
         self.assertEqual(DTypeConfig.from_dict(self.dtype_config_dict2), self.dtype_config2)
+        self.assertEqual(DTypeConfig.from_dict(self.dtype_config_dict3), self.dtype_config3)
 
     def test_dtype_config_to_dict(self):
         self.assertEqual(self.dtype_config1.to_dict(), self.dtype_config_dict1)
         self.assertEqual(self.dtype_config2.to_dict(), self.dtype_config_dict2)
+        self.assertEqual(self.dtype_config3.to_dict(), self.dtype_config_dict3)
 
     # ======================
     #  BackendPatternConfig
     # ======================
 
-    _fuser_method = reverse_sequential_wrapper2(nni.LinearReLU)
+    _fuser_method = _reverse_sequential_wrapper2(nni.LinearReLU)
 
     _num_tensor_args_to_observation_type = {
         0: ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
@@ -74,7 +116,6 @@ def test_dtype_config_to_dict(self):
         "input": 1,
         "weight": 2,
     }
-    _fake_quantize = FixedQParamsFakeQuantize.with_args(observer=default_fixed_qparams_range_0to1_observer)
 
     def _extra_inputs_getter(self, p):
         return (torch.rand(3, 3),)
@@ -97,9 +138,7 @@ def _get_backend_op_config2(self):
             ._set_extra_inputs_getter(self._extra_inputs_getter) \
             ._set_num_tensor_args_to_observation_type(self._num_tensor_args_to_observation_type) \
             ._set_input_type_to_index(self._input_type_to_index) \
-            ._set_input_output_observed(False) \
-            ._set_overwrite_output_fake_quantize(self._fake_quantize) \
-            ._set_overwrite_output_observer(default_fixed_qparams_range_0to1_observer)
+            ._set_input_output_observed(False)
 
     def _get_backend_pattern_config_dict1(self):
         return {
@@ -123,8 +162,6 @@ def _get_backend_pattern_config_dict2(self):
             "num_tensor_args_to_observation_type": self._num_tensor_args_to_observation_type,
             "input_type_to_index": self._input_type_to_index,
             "input_output_observed": False,
-            "overwrite_output_fake_quantize": self._fake_quantize,
-            "overwrite_output_observer": default_fixed_qparams_range_0to1_observer
         }
 
     def test_backend_op_config_set_observation_type(self):
@@ -202,18 +239,6 @@ def test_backend_op_config_set_input_output_observed(self):
         conf._set_input_output_observed(False)
         self.assertEqual(conf._input_output_observed, False)
 
-    def test_backend_op_config_set_overwrite_output_fake_quantize(self):
-        conf = BackendPatternConfig(torch.sigmoid)
-        self.assertTrue(conf._overwrite_output_fake_quantize is None)
-        conf._set_overwrite_output_fake_quantize(self._fake_quantize)
-        self.assertEqual(conf._overwrite_output_fake_quantize, self._fake_quantize)
-
-    def test_backend_op_config_set_overwrite_output_observer(self):
-        conf = BackendPatternConfig(torch.sigmoid)
-        self.assertTrue(conf._overwrite_output_observer is None)
-        conf._set_overwrite_output_observer(default_fixed_qparams_range_0to1_observer)
-        self.assertEqual(conf._overwrite_output_observer, default_fixed_qparams_range_0to1_observer)
-
     def test_backend_op_config_from_dict(self):
         conf_dict1 = self._get_backend_pattern_config_dict1()
         conf1 = BackendPatternConfig.from_dict(conf_dict1)
@@ -229,8 +254,6 @@ def test_backend_op_config_from_dict(self):
         self.assertEqual(len(conf1._num_tensor_args_to_observation_type), 0)
         self.assertEqual(len(conf1._input_type_to_index), 0)
         self.assertTrue(conf1._input_output_observed is None)
-        self.assertTrue(conf1._overwrite_output_fake_quantize is None)
-        self.assertTrue(conf1._overwrite_output_observer is None)
         # Test temporary/internal keys
         conf_dict2 = self._get_backend_pattern_config_dict2()
         conf2 = BackendPatternConfig.from_dict(conf_dict2)
@@ -246,8 +269,6 @@ def test_backend_op_config_from_dict(self):
         self.assertEqual(conf2._num_tensor_args_to_observation_type, self._num_tensor_args_to_observation_type)
         self.assertEqual(conf2._input_type_to_index, self._input_type_to_index)
         self.assertEqual(conf2._input_output_observed, False)
-        self.assertEqual(conf2._overwrite_output_fake_quantize, self._fake_quantize)
-        self.assertEqual(conf2._overwrite_output_observer, default_fixed_qparams_range_0to1_observer)
 
     def test_backend_op_config_to_dict(self):
         conf1 = self._get_backend_op_config1()
diff --git a/test/quantization/core/test_docs.py b/test/quantization/core/test_docs.py
index 7ca188955d09..27842b46ce7e 100644
--- a/test/quantization/core/test_docs.py
+++ b/test/quantization/core/test_docs.py
@@ -41,28 +41,17 @@ def _get_code(
         def get_correct_path(path_from_pytorch):
             r"""
             Current working directory when CI is running test seems to vary, this function
-            looks for docs and if it finds it looks for the path to the
-            file and if the file exists returns that path, otherwise keeps looking. Will
-            only work if cwd contains pytorch or docs or a parent contains docs.
+            looks for docs relative to this test file.
             """
-            # get cwd
-            cur_dir_path = Path(".").resolve()
-
-            # check if cwd contains pytorch, use that if it does
-            if (cur_dir_path / "pytorch").is_dir():
-                cur_dir_path = (cur_dir_path / "pytorch").resolve()
-
-            # need to find the file, so we check current directory
-            # and all parent directories to see if the path leads to it
-            check_dir = cur_dir_path
-            while not check_dir == check_dir.parent:
-                file_path = (check_dir / path_from_pytorch).resolve()
-                if file_path.is_file():
-                    return file_path
-                check_dir = check_dir.parent.resolve()
-
-            # no longer passing when file not found
-            raise FileNotFoundError("could not find {}".format(path_from_pytorch))
+            core_dir = Path(__file__).parent
+            assert core_dir.match("test/quantization/core/"), (
+                "test_docs.py is in an unexpected location. If you've been "
+                "moving files around, ensure that the test and build files have "
+                "been updated to have the correct relative path between "
+                "test_docs.py and the docs."
+            )
+            pytorch_root = core_dir.parent.parent.parent
+            return pytorch_root / path_from_pytorch
 
         path_to_file = get_correct_path(path_from_pytorch)
         if path_to_file:
diff --git a/test/quantization/core/test_quantized_functional.py b/test/quantization/core/test_quantized_functional.py
index 57ef41c781a3..f9585e73c846 100644
--- a/test/quantization/core/test_quantized_functional.py
+++ b/test/quantization/core/test_quantized_functional.py
@@ -2,8 +2,8 @@
 
 # Torch
 import torch
+import torch.ao.nn.quantized.functional as qF
 import torch.nn.functional as F
-import torch.nn.quantized.functional as qF
 
 # Standard library
 import numpy as np
diff --git a/test/quantization/core/test_quantized_module.py b/test/quantization/core/test_quantized_module.py
index 067b7b481426..780f1ebb6cd5 100644
--- a/test/quantization/core/test_quantized_module.py
+++ b/test/quantization/core/test_quantized_module.py
@@ -4,10 +4,10 @@
 import torch.nn as nn
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.quantized._reference as nnqr
+import torch.ao.nn.quantized.reference as nnqr
 import torch.ao.quantization
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 
 from torch.ao.quantization import (
     get_default_static_quant_module_mappings,
@@ -624,7 +624,7 @@ def test_pool_api(self):
                                        dtype=torch.quint8)
         qX_expect = torch.nn.functional.max_pool2d(qX, **kwargs)
 
-        pool_under_test = torch.nn.quantized.MaxPool2d(**kwargs)
+        pool_under_test = torch.ao.nn.quantized.MaxPool2d(**kwargs)
         qX_hat = pool_under_test(qX)
         self.assertEqual(qX_expect, qX_hat)
 
@@ -1036,6 +1036,33 @@ def test_prelu(self):
         self.assertEqual(qy_ref, qy,
                          msg="PReLU module API failed")
 
+    def test_channel_shuffle(self):
+        """Tests the correctness of the ChannelShuffle module.
+        """
+        x_scale = 10.0 / 256
+        x_zero_point = 1
+        y_scale = x_scale
+        y_zero_point = x_zero_point
+
+        dims = (1, 4, 4, 8)
+        groups = 2
+
+        X = (torch.randn(dims, dtype=torch.float) - 0.5) * 10
+        qX = torch.quantize_per_tensor(X, x_scale, x_zero_point, dtype=torch.quint8)
+        dqX = qX.dequantize()
+
+        float_mod = torch.nn.ChannelShuffle(groups).float()
+        dqY_ref = float_mod(dqX)
+        qY_ref = torch.quantize_per_tensor(
+            dqY_ref, y_scale, y_zero_point, dtype=torch.quint8)
+
+        quant_mod = torch.nn.ChannelShuffle(groups)
+        qY = quant_mod(qX)
+
+        self.assertEqual(qY_ref.int_repr().numpy(), qY.int_repr().numpy(),
+                         msg="ChannelShuffle module API failed, qY_ref\n{} vs qY\n{}"
+                         .format(qY_ref, qY))
+
 class TestDynamicQuantizedModule(QuantizationTestCase):
     def _test_qconv_impl(self, q_mod, dq_mod, dim, dtype, bias):
         in_channels = 3
@@ -1154,8 +1181,8 @@ def _test_qconv_impl(self, q_mod, dq_mod, dim, dtype, bias):
 
     @override_qengines
     def test_dynamic_conv1d(self):
-        q_mod = torch.nn.quantized.Conv1d
-        dq_mod = torch.nn.quantized.dynamic.Conv1d
+        q_mod = torch.ao.nn.quantized.Conv1d
+        dq_mod = torch.ao.nn.quantized.dynamic.Conv1d
         dim = 3
         dtype = torch.quint8
 
@@ -1164,8 +1191,8 @@ def test_dynamic_conv1d(self):
 
     @override_qengines
     def test_dynamic_conv2d(self):
-        q_mod = torch.nn.quantized.Conv2d
-        dq_mod = torch.nn.quantized.dynamic.Conv2d
+        q_mod = torch.ao.nn.quantized.Conv2d
+        dq_mod = torch.ao.nn.quantized.dynamic.Conv2d
         dim = 4
         dtype = torch.quint8
 
@@ -1174,8 +1201,8 @@ def test_dynamic_conv2d(self):
 
     @override_qengines
     def test_dynamic_conv3d(self):
-        q_mod = torch.nn.quantized.Conv3d
-        dq_mod = torch.nn.quantized.dynamic.Conv3d
+        q_mod = torch.ao.nn.quantized.Conv3d
+        dq_mod = torch.ao.nn.quantized.dynamic.Conv3d
         dim = 5
         dtype = torch.quint8
 
@@ -1186,8 +1213,8 @@ def test_dynamic_conv3d(self):
 
     @override_qengines
     def test_dynamic_convtranspose1d(self):
-        q_mod = torch.nn.quantized.ConvTranspose1d
-        dq_mod = torch.nn.quantized.dynamic.ConvTranspose1d
+        q_mod = torch.ao.nn.quantized.ConvTranspose1d
+        dq_mod = torch.ao.nn.quantized.dynamic.ConvTranspose1d
         dim = 3
         dtype = torch.quint8
 
@@ -1196,8 +1223,8 @@ def test_dynamic_convtranspose1d(self):
 
     @override_qengines
     def test_dynamic_convtranspose2d(self):
-        q_mod = torch.nn.quantized.ConvTranspose2d
-        dq_mod = torch.nn.quantized.dynamic.ConvTranspose2d
+        q_mod = torch.ao.nn.quantized.ConvTranspose2d
+        dq_mod = torch.ao.nn.quantized.dynamic.ConvTranspose2d
         dim = 4
         dtype = torch.quint8
 
@@ -1206,8 +1233,8 @@ def test_dynamic_convtranspose2d(self):
 
     @override_qengines
     def test_dynamic_convtranspose3d(self):
-        q_mod = torch.nn.quantized.ConvTranspose3d
-        dq_mod = torch.nn.quantized.dynamic.ConvTranspose3d
+        q_mod = torch.ao.nn.quantized.ConvTranspose3d
+        dq_mod = torch.ao.nn.quantized.dynamic.ConvTranspose3d
         dim = 5
         dtype = torch.quint8
 
@@ -1344,22 +1371,22 @@ def test_lstm_api(self, dtype, bidirectional):
             x = torch.randn(seq_len, batch, input_size)
             h = torch.randn(num_layers * (bidirectional + 1), batch, hidden_size)
             c = torch.randn(num_layers * (bidirectional + 1), batch, hidden_size)
-            cell_dq = torch.nn.quantized.dynamic.LSTM(input_size=input_size,
-                                                      hidden_size=hidden_size,
-                                                      num_layers=num_layers,
-                                                      bias=bias,
-                                                      batch_first=False,
-                                                      dropout=0.0,
-                                                      bidirectional=bidirectional,
-                                                      dtype=dtype)
-            ref_dq = torch.nn.quantized.dynamic.LSTM(input_size=input_size,
-                                                     hidden_size=hidden_size,
-                                                     num_layers=num_layers,
-                                                     bias=bias,
-                                                     batch_first=False,
-                                                     dropout=0.0,
-                                                     bidirectional=bidirectional,
-                                                     dtype=dtype)
+            cell_dq = torch.ao.nn.quantized.dynamic.LSTM(input_size=input_size,
+                                                         hidden_size=hidden_size,
+                                                         num_layers=num_layers,
+                                                         bias=bias,
+                                                         batch_first=False,
+                                                         dropout=0.0,
+                                                         bidirectional=bidirectional,
+                                                         dtype=dtype)
+            ref_dq = torch.ao.nn.quantized.dynamic.LSTM(input_size=input_size,
+                                                        hidden_size=hidden_size,
+                                                        num_layers=num_layers,
+                                                        bias=bias,
+                                                        batch_first=False,
+                                                        dropout=0.0,
+                                                        bidirectional=bidirectional,
+                                                        dtype=dtype)
 
             _all_params = ([m.param for m in cell_dq._all_weight_values])
             result = torch.quantized_lstm(x, (h, c),
@@ -1406,14 +1433,14 @@ def test_gru_api(self):
             h = torch.rand(num_layers * (bidirectional + 1), batch, hidden_size)
 
 
-            cell_dq = torch.nn.quantized.dynamic.GRU(input_size=input_size,
-                                                     hidden_size=hidden_size,
-                                                     num_layers=num_layers,
-                                                     bias=bias,
-                                                     batch_first=False,
-                                                     dropout=0.0,
-                                                     bidirectional=bidirectional,
-                                                     dtype=dtype)
+            cell_dq = torch.ao.nn.quantized.dynamic.GRU(input_size=input_size,
+                                                        hidden_size=hidden_size,
+                                                        num_layers=num_layers,
+                                                        bias=bias,
+                                                        batch_first=False,
+                                                        dropout=0.0,
+                                                        bidirectional=bidirectional,
+                                                        dtype=dtype)
 
             _all_params = ([m.param for m in cell_dq._all_weight_values])
             result = torch.quantized_gru(x,
@@ -1447,10 +1474,10 @@ def test_cell_api(self, dtype):
 
         x = torch.rand(batch, input_size)
         h = torch.rand(batch, hidden_size)
-        cell_dict = {'LSTMCell': torch.nn.quantized.dynamic.LSTMCell,
-                     'GRUCell': torch.nn.quantized.dynamic.GRUCell,
-                     'RNNTanh': torch.nn.quantized.dynamic.RNNCell,
-                     'RNNReLU': torch.nn.quantized.dynamic.RNNCell
+        cell_dict = {'LSTMCell': torch.ao.nn.quantized.dynamic.LSTMCell,
+                     'GRUCell': torch.ao.nn.quantized.dynamic.GRUCell,
+                     'RNNTanh': torch.ao.nn.quantized.dynamic.RNNCell,
+                     'RNNReLU': torch.ao.nn.quantized.dynamic.RNNCell
                      }
         state = {'LSTMCell': (h, h),
                  'GRUCell': h,
diff --git a/test/quantization/core/test_quantized_op.py b/test/quantization/core/test_quantized_op.py
index 663ba0c48761..a63acc99383b 100644
--- a/test/quantization/core/test_quantized_op.py
+++ b/test/quantization/core/test_quantized_op.py
@@ -21,7 +21,7 @@
 import torch.testing._internal.hypothesis_utils as hu
 hu.assert_deadline_disabled()
 
-from torch.testing._internal.common_utils import TestCase, skipIfSlowGradcheckEnv
+from torch.testing._internal.common_utils import TestCase
 from torch.testing._internal.common_utils import IS_PPC, TEST_WITH_UBSAN, IS_MACOS, BUILD_WITH_CAFFE2
 from torch.testing._internal.common_quantization import skipIfNoFBGEMM, skipIfNoQNNPACK
 from torch.testing._internal.common_quantized import _quantize, _dequantize, _calculate_dynamic_qparams, \
@@ -130,7 +130,6 @@ def _get_random_tensor_and_q_params(shapes, rand_scale, torch_type):
         X_scale = 1e-10
     return X, X_scale, X_zero_point
 
-@skipIfSlowGradcheckEnv
 class TestQuantizedOps(TestCase):
 
     """Helper function to test quantized activation functions."""
@@ -168,6 +167,8 @@ def _test_activation_function(self, X, fn_name, test_configs):
         X, (scale, zero_point, torch_type) = X
         if not isinstance(X, torch.Tensor):
             X = torch.from_numpy(X)
+        if (X.device.type == 'cuda') and (torch.backends.quantized.engine == 'qnnpack'):
+            return
         # Quantizes the reference to account for max error.
         # q_min and q_max only depend on the initial torch_type.
         q_min, q_max = torch.iinfo(torch_type).min, torch.iinfo(torch_type).max
@@ -185,7 +186,7 @@ def _test_activation_function(self, X, fn_name, test_configs):
                     # some functions require inplace=True to test in-place.
                     # copy.copy is needed because these are modified in place
                     extra_kwargs = \
-                        copy.copy(op_group.get('extra_kwargs', dict()))
+                        copy.copy(op_group.get('extra_kwargs', {}))
                     output_is_observed = \
                         copy.copy(op_group.get('output_is_observed', False))
 
@@ -229,9 +230,7 @@ def _test_activation_function(self, X, fn_name, test_configs):
 
     """Tests the correctness of the quantized::relu op."""
     @override_qengines
-    @given(X=hu.tensor(shapes=hu.array_shapes(1, 5, 1, 5),
-                       qparams=hu.qparams()))
-    def test_qrelu(self, X):
+    def test_qrelu(self):
         relu_test_configs = [
             {
                 'quantized_fn': [
@@ -253,7 +252,17 @@ def test_qrelu(self, X):
                 }
             }
         ]
-        self._test_activation_function(X, 'relu', relu_test_configs)
+        devices = ["cpu", "cuda"] if TEST_CUDA else ["cpu"]
+        for device in devices:
+            shapes = ((4,), (4, 4), (4, 4, 4), (4, 4, 4, 4))
+            dtypes = (torch.quint8, torch.qint8)
+            scales = (0.05, 0.1)
+            zero_points = (0, 5)
+            test_cases = itertools.product(shapes, dtypes, scales, zero_points)
+            for shape, dtype, scale, zero_point in test_cases:
+                X = torch.randn(*shape, device=device)
+                X = (X, (scale, zero_point, dtype))
+                self._test_activation_function(X, 'relu', relu_test_configs)
 
     """Tests the correctness of the quantized::relu6 op."""
     def test_qrelu6(self):
@@ -261,8 +270,8 @@ def test_qrelu6(self):
             {
                 'quantized_fn': [
                     torch.ops.quantized.relu6,
-                    torch.nn.quantized.ReLU6(inplace=False),
-                    torch.nn.quantized.ReLU6(inplace=True)
+                    torch.ao.nn.quantized.ReLU6(inplace=False),
+                    torch.ao.nn.quantized.ReLU6(inplace=True)
                 ],
                 'reference_fn': torch.nn.functional.relu6
             }
@@ -320,7 +329,8 @@ def test_qhardsigmoid(self):
         hardsigmoid_test_configs = [
             {
                 'quantized_fn': [
-                    torch.nn.quantized.functional.hardsigmoid
+                    torch.ao.nn.quantized.functional.hardsigmoid,
+                    torch.nn.quantized.functional.hardsigmoid,
                 ],
                 'reference_fn': torch.nn.functional.hardsigmoid,
                 'output_range': (0.0, 1.0),
@@ -328,7 +338,8 @@ def test_qhardsigmoid(self):
             },
             {
                 'quantized_fn': [
-                    torch.nn.quantized.functional.hardsigmoid
+                    torch.ao.nn.quantized.functional.hardsigmoid,
+                    torch.nn.quantized.functional.hardsigmoid,
                 ],
                 'reference_fn': torch.nn.functional.hardsigmoid,
                 'output_range': (0.0, 1.0),
@@ -411,7 +422,7 @@ def test_qelu(self, X, alpha):
         qY_hat = torch.quantize_per_tensor(dqY_hat, scale=output_scale, zero_point=output_zero_point,
                                            dtype=torch_type)
 
-        qY = torch.nn.quantized.functional.elu(qX, output_scale, output_zero_point, alpha=alpha)
+        qY = torch.ao.nn.quantized.functional.elu(qX, output_scale, output_zero_point, alpha=alpha)
         self.assertEqual(qY, qY_hat,
                          msg="F.elu failed ({} vs {})".format(qY, qY_hat))
 
@@ -650,7 +661,8 @@ def test_qthreshold(self, X, threshold, value):
         ops_under_test = {
             'native': torch.threshold,
             'nn.functional': torch.nn.functional.threshold,
-            'nn.quantized.functional': torch.nn.quantized.functional.threshold
+            'nn.quantized.functional': torch.nn.quantized.functional.threshold,
+            'ao.nn.quantized.functional': torch.ao.nn.quantized.functional.threshold,
         }
 
         for name, op in ops_under_test.items():
@@ -723,6 +735,8 @@ def test_hardtanh(self, X, min_val, max_val):
             ops_under_test = {
                 'nn.quantized.functional.hardtanh':
                     torch.nn.quantized.functional.hardtanh,
+                'ao.nn.quantized.functional.hardtanh':
+                    torch.ao.nn.quantized.functional.hardtanh,
             }
 
             for name, op in ops_under_test.items():
@@ -732,6 +746,8 @@ def test_hardtanh(self, X, min_val, max_val):
             ops_under_test_inplace = {
                 'inplace nn.quantized.functional.hardtanh':
                     torch.nn.quantized.functional.hardtanh,
+                'inplace ao.nn.quantized.functional.hardtanh':
+                    torch.ao.nn.quantized.functional.hardtanh,
             }
 
             for name, op_ in ops_under_test_inplace.items():
@@ -770,7 +786,7 @@ def test_hardswish(self):
                                                    zero_point=Y_zero_point,
                                                    dtype=torch_type)
 
-                qY = torch.nn.quantized.functional.hardswish(
+                qY = torch.ao.nn.quantized.functional.hardswish(
                     qX, scale=Y_scale, zero_point=Y_zero_point)
                 self.assertEqual(
                     qY, qY_hat,
@@ -1005,7 +1021,7 @@ def test_qmul_relu_same_qparams(self):
 
             A = torch.arange(-100, 100, dtype=torch.float)
             B = torch.arange(-100, 100, dtype=torch.float)
-            scale = 2.0
+            scale = 2
             zero_point = 127
             qA = torch.quantize_per_tensor(A, scale=scale, zero_point=zero_point,
                                            dtype=dtype)
@@ -1330,7 +1346,8 @@ def test_max_pool1d(self, X, kernel, stride, dilation, padding, ceil_mode):
         ops_under_test = {
             "torch": torch.max_pool1d,
             "nn.functional": torch.nn.functional.max_pool1d,
-            "nn.quantized.functional": torch.nn.quantized.functional.max_pool1d
+            "nn.quantized.functional": torch.nn.quantized.functional.max_pool1d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.max_pool1d,
         }
 
         for name, op in ops_under_test.items():
@@ -1426,7 +1443,8 @@ def test_max_pool2d(self, X, kernel, stride, dilation, padding, ceil_mode):
         ops_under_test = {
             "torch": torch.max_pool2d,
             "nn.functional": torch.nn.functional.max_pool2d,
-            "nn.quantized.functional": torch.nn.quantized.functional.max_pool2d
+            "nn.quantized.functional": torch.nn.quantized.functional.max_pool2d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.max_pool2d,
         }
 
         for name, op in ops_under_test.items():
@@ -1484,7 +1502,8 @@ def test_max_pool2d_nhwc(self, X, kernel, stride, dilation, padding, ceil_mode):
         ops_under_test = {
             "torch": torch.max_pool2d,
             "nn.functional": torch.nn.functional.max_pool2d,
-            "nn.quantized.functional": torch.nn.quantized.functional.max_pool2d
+            "nn.quantized.functional": torch.nn.quantized.functional.max_pool2d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.max_pool2d,
         }
 
         for name, op in ops_under_test.items():
@@ -1533,7 +1552,8 @@ def test_avg_pool2d(self, X, kernel, stride, padding, ceil_mode, count_include_p
             ceil_mode=ceil_mode, count_include_pad=count_include_pad, divisor_override=divisor_override)
         ops_under_test = {
             "nn.functional": torch.nn.functional.avg_pool2d,
-            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool2d
+            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool2d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.avg_pool2d,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -1594,7 +1614,8 @@ def test_avg_pool2d_nhwc(self, X, kernel, stride, padding, ceil_mode, count_incl
         self.assertTrue(qX.stride() != sorted(qX.stride()))
         ops_under_test = {
             "nn.functional": torch.nn.functional.avg_pool2d,
-            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool2d
+            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool2d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.avg_pool2d,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -1648,7 +1669,8 @@ def test_avg_pool3d(self, X, kernel, stride, padding, ceil_mode, count_include_p
 
         ops_under_test = {
             "nn.functional": torch.nn.functional.avg_pool3d,
-            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool3d
+            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool3d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.avg_pool3d,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -1710,7 +1732,8 @@ def test_avg_pool3d_nhwc(self, X, kernel, stride, padding, ceil_mode, count_incl
         self.assertTrue(qX.stride() != sorted(qX.stride()))
         ops_under_test = {
             "nn.functional": torch.nn.functional.avg_pool3d,
-            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool3d
+            "nn.quantized.functional": torch.nn.quantized.functional.avg_pool3d,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.avg_pool3d,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -1779,7 +1802,9 @@ def test_adaptive_avg_pool2d_nhwc(self):
             ops_under_test = {
                 "nn.functional": torch.nn.functional.adaptive_avg_pool2d,
                 "nn.quantized.functional":
-                    torch.nn.quantized.functional.adaptive_avg_pool2d
+                    torch.nn.quantized.functional.adaptive_avg_pool2d,
+                "ao.nn.quantized.functional":
+                    torch.ao.nn.quantized.functional.adaptive_avg_pool2d,
             }
             error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
             for name, op in ops_under_test.items():
@@ -1848,7 +1873,9 @@ def test_adaptive_avg_pool(self):
                     "nn.functional":
                         getattr(torch.nn.functional, 'adaptive_avg_pool{}d'.format(dim)),
                     "nn.quantized.functional":
-                        getattr(torch.nn.quantized.functional, 'adaptive_avg_pool{}d'.format(dim))
+                        getattr(torch.nn.quantized.functional, 'adaptive_avg_pool{}d'.format(dim)),
+                    "ao.nn.quantized.functional":
+                        getattr(torch.ao.nn.quantized.functional, 'adaptive_avg_pool{}d'.format(dim))
                 }
 
                 error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
@@ -1926,7 +1953,9 @@ def test_adaptive_avg_pool3d_ndhwc(self):
             ops_under_test = {
                 "nn.functional": torch.nn.functional.adaptive_avg_pool3d,
                 "nn.quantized.functional":
-                    torch.nn.quantized.functional.adaptive_avg_pool3d
+                    torch.nn.quantized.functional.adaptive_avg_pool3d,
+                "ao.nn.quantized.functional":
+                    torch.ao.nn.quantized.functional.adaptive_avg_pool3d,
             }
             error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
             for name, op in ops_under_test.items():
@@ -2072,7 +2101,8 @@ def test_interpolate(self, X, size, mode, scale_factor, align_corners, nhwc_layo
 
         ops_under_test = {
             "nn.functional": torch.nn.functional.interpolate,
-            "nn.quantized.functional": torch.nn.quantized.functional.interpolate
+            "nn.quantized.functional": torch.nn.quantized.functional.interpolate,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.interpolate,
         }
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
         for name, op in ops_under_test.items():
@@ -2125,7 +2155,8 @@ def test_interpolate3d(self, X, size, mode, scale_factor, align_corners, nhwc_la
 
         ops_under_test = {
             "nn.functional": torch.nn.functional.interpolate,
-            "nn.quantized.functional": torch.nn.quantized.functional.interpolate
+            "nn.quantized.functional": torch.nn.quantized.functional.interpolate,
+            "ao.nn.quantized.functional": torch.ao.nn.quantized.functional.interpolate,
         }
 
         error_message = r"Results are off for {}:\n\tExpected:\n{}\n\tGot:\n{}"
@@ -2623,7 +2654,7 @@ def test_empty_batch(self):
                                 "Quantized sigmoid with batch size 0 failed.")
 
         # interpolate
-        op = torch.nn.quantized.functional.interpolate
+        op = torch.ao.nn.quantized.functional.interpolate
         for mode in ["nearest", "bilinear", "nearest-exact"]:
             qY = op(qX, scale_factor=2, mode=mode)
             np.testing.assert_equal(qY.size(), (0, 2, 8, 8),
@@ -2633,13 +2664,13 @@ def test_empty_batch(self):
         kernel = (2, 2)
         stride = (1, 1)
         padding = (0, 0)
-        op = torch.nn.quantized.functional.avg_pool2d
+        op = torch.ao.nn.quantized.functional.avg_pool2d
         qY = op(qX, kernel, stride, padding)
         np.testing.assert_equal(qY.size(), (0, 2, 3, 3),
                                 "Quantized avg_pool2d with batch size 0 failed.")
 
         # adaptive_avg_pool
-        op = torch.nn.quantized.functional.adaptive_avg_pool2d
+        op = torch.ao.nn.quantized.functional.adaptive_avg_pool2d
         qY = op(qX, (3, 3))
         np.testing.assert_equal(qY.size(), (0, 2, 3, 3),
                                 "Quantized adaptive_avg_pool2d with batch size 0 failed.")
@@ -2653,7 +2684,7 @@ def test_empty_batch(self):
                                 "Quantized maxpool2d with batch size 0 failed.")
 
         # hardtanh
-        qY = torch.nn.quantized.functional.hardtanh(qX, -1, 6)
+        qY = torch.ao.nn.quantized.functional.hardtanh(qX, -1, 6)
         np.testing.assert_equal(qY.size(), qX.size(),
                                 "Quantized hardtanh with batch size 0 failed.")
 
@@ -2878,7 +2909,7 @@ def forward(
             ]
 
             q_data = []
-            reduce_range = (qengine in ('fbgemm', 'onednn'))
+            reduce_range = (qengine in ('x86', 'fbgemm', 'onednn'))
             for idx, x in enumerate(fp_data):
                 scale, zero_point = _calculate_dynamic_qparams(
                     x, dtype=dtype, reduce_range=reduce_range)
@@ -2996,7 +3027,7 @@ def test_qlinear(self, batch_size, input_channels, output_channels,
             (b_value_max - b_value_min) + b_value_min
         ).astype(np.int32) if use_bias else None
 
-        if torch.backends.quantized.engine in ('fbgemm', 'onednn'):
+        if torch.backends.quantized.engine in ('x86', 'fbgemm', 'onednn'):
             avoid_vpmaddubsw_overflow_linear(
                 batch_size,
                 input_channels,
@@ -3441,7 +3472,7 @@ def _test_qconv_op_impl(self, q_mod, dq_op, dim, dtype):
 
     @override_qengines
     def test_dynamic_conv1d(self):
-        q_mod = torch.nn.quantized.Conv1d
+        q_mod = torch.ao.nn.quantized.Conv1d
         dq_op = torch.ops.quantized.conv1d_dynamic
         dim = 3
         dtype = torch.quint8
@@ -3450,7 +3481,7 @@ def test_dynamic_conv1d(self):
 
     @override_qengines
     def test_dynamic_conv2d(self):
-        q_mod = torch.nn.quantized.Conv2d
+        q_mod = torch.ao.nn.quantized.Conv2d
         dq_op = torch.ops.quantized.conv2d_dynamic
         dim = 4
         dtype = torch.quint8
@@ -3459,7 +3490,7 @@ def test_dynamic_conv2d(self):
 
     @override_qengines
     def test_dynamic_conv3d(self):
-        q_mod = torch.nn.quantized.Conv3d
+        q_mod = torch.ao.nn.quantized.Conv3d
         dq_op = torch.ops.quantized.conv3d_dynamic
         dim = 5
         dtype = torch.quint8
@@ -3468,7 +3499,7 @@ def test_dynamic_conv3d(self):
 
     @override_qengines
     def test_dynamic_convtranspose1d(self):
-        q_mod = torch.nn.quantized.ConvTranspose1d
+        q_mod = torch.ao.nn.quantized.ConvTranspose1d
         dq_op = torch.ops.quantized.conv_transpose1d_dynamic
         dim = 3
         dtype = torch.quint8
@@ -3477,7 +3508,7 @@ def test_dynamic_convtranspose1d(self):
 
     @override_qengines
     def test_dynamic_convtranspose2d(self):
-        q_mod = torch.nn.quantized.ConvTranspose2d
+        q_mod = torch.ao.nn.quantized.ConvTranspose2d
         dq_op = torch.ops.quantized.conv_transpose2d_dynamic
         dim = 4
         dtype = torch.quint8
@@ -3486,7 +3517,7 @@ def test_dynamic_convtranspose2d(self):
 
     @override_qengines
     def test_dynamic_convtranspose3d(self):
-        q_mod = torch.nn.quantized.ConvTranspose3d
+        q_mod = torch.ao.nn.quantized.ConvTranspose3d
         dq_op = torch.ops.quantized.conv_transpose3d_dynamic
         dim = 5
         dtype = torch.quint8
@@ -3568,7 +3599,7 @@ def test_qlinear(self, batch_size, input_channels, output_channels, use_bias,
                 np.random.rand(output_channels) *
                 (b_value_max - b_value_min) + b_value_min
             ).astype(np.int32) if use_bias else None
-            if torch.backends.quantized.engine in ('fbgemm', 'onednn'):
+            if torch.backends.quantized.engine in ('x86', 'fbgemm', 'onednn'):
                 avoid_vpmaddubsw_overflow_linear(
                     batch_size,
                     input_channels,
@@ -4457,7 +4488,7 @@ def _test_qconv_impl(
            height=st.integers(10, 16),
            width=st.integers(7, 14),
            output_channels_per_group=st.sampled_from([2, 4, 5, 8, 16, 32]),
-           groups=st.integers(1, 3),
+           groups=st.integers(1, 300),
            kernel_h=st.integers(1, 7),
            kernel_w=st.integers(1, 7),
            stride_h=st.integers(1, 2),
@@ -4709,95 +4740,47 @@ def trace_handler(p):
         print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=10))
 
     """Tests the correctness of quantized convolution op."""
-    @given(batch_size=st.integers(1, 3),
-           input_channels_per_group=st.sampled_from([2, 4, 5, 8, 16, 32]),
-           width=st.integers(7, 14),
-           output_channels_per_group=st.sampled_from([2, 4, 5, 8, 16, 32]),
-           groups=st.integers(1, 3),
-           kernel=st.integers(1, 7),
-           stride=st.integers(1, 2),
-           pad=st.integers(0, 2),
-           o_pad=st.integers(0, 2),
-           dilation=st.integers(1, 2),
-           X_scale=st.floats(1.2, 1.6),
-           X_zero_point=st.integers(0, 4),
-           W_scale=st.lists(st.floats(0.2, 1.6), min_size=1, max_size=2),
-           W_zero_point=st.lists(st.integers(-5, 5), min_size=1, max_size=2),
-           Y_scale=st.floats(4.2, 5.6),
-           Y_zero_point=st.integers(0, 4),
-           use_bias=st.booleans())
     @override_qengines
-    def test_qconv_transpose1d(
-            self,
-            batch_size,
-            input_channels_per_group,
-            width,
-            output_channels_per_group,
-            groups,
-            kernel,
-            stride,
-            pad,
-            o_pad,
-            dilation,
-            X_scale,
-            X_zero_point,
-            W_scale,
-            W_zero_point,
-            Y_scale,
-            Y_zero_point,
-            use_bias):
+    def test_qconv_transpose1d(self):
         if not qengine_is_qnnpack():
             return  # Currently only the QNNPACK is supported
         if qengine_is_qnnpack() and (IS_PPC or TEST_WITH_UBSAN):
             return  # QNNPACK doesn't support these
-        assume(o_pad < stride and o_pad < dilation)
-
-        input_channels = input_channels_per_group * groups
-        output_channels = output_channels_per_group * groups
-        kernels = (kernel,)
-        strides = (stride,)
-        pads = (pad,)
-        o_pads = (o_pad,)
-        dilations = (dilation,)
-
-        qconv = torch.ops.quantized.conv_transpose1d
-        qconv_prepack = torch.ops.quantized.conv_transpose1d_prepack
-        conv_op = torch.nn.ConvTranspose1d(
-            in_channels=input_channels,
-            out_channels=output_channels,
-            kernel_size=kernels,
-            stride=strides,
-            padding=pads,
-            output_padding=o_pads,
-            groups=groups,
-            dilation=dilations,
-            bias=use_bias
-        )
-
-        act_qdtypes = [torch.quint8]
-        # Only qnnpack qengine supportes qint8
-        if qengine_is_qnnpack() and torch.backends.xnnpack.enabled:
-            act_qdtypes.append(torch.qint8)
-
-        for X_qdtype in act_qdtypes:
-            if X_qdtype == torch.qint8:
-                W_zero_point = [0 for i in range(len(W_zero_point))]
-
-            X_q, W_q, bias_float = self._test_qconv_impl(
-                qconv, qconv_prepack, conv_op, batch_size,
-                input_channels_per_group, (width, ),
-                output_channels_per_group, groups, kernels, strides, pads, o_pads,
-                dilations, X_scale, X_zero_point, W_scale, W_zero_point,
-                Y_scale, Y_zero_point, use_bias, use_relu=False,
-                use_channelwise=False, use_transpose=True, input_dtype=X_qdtype, output_dtype=X_qdtype)
-
-            # check that this doesn't error
-            test_conv = torch.nn.quantized.ConvTranspose1d(input_channels, output_channels, 1)
-            test_conv.scale = Y_scale
-            test_conv(X_q)
-
-            # Test the module implementation
-            qconv_op = torch.nn.quantized.ConvTranspose1d(
+        batch_size = 2
+        input_channels_per_group_list = [2, 32]
+        width = 14
+        output_channels_per_group_list = [2, 8]
+        groups_list = [1, 3]
+        kernel_list = [1, 7]
+        stride_list = [1, 2]
+        pad = 2
+        o_pad = 0
+        dilation = 1
+        X_scale = 1.2
+        X_zero_point = 1
+        W_scale = [1.2]
+        W_zero_point = [1]
+        Y_scale = 4.2
+        Y_zero_point = 2
+        use_bias_list = [True, False]
+
+        test_cases = itertools.product(
+            input_channels_per_group_list, output_channels_per_group_list,
+            groups_list, kernel_list, stride_list, use_bias_list)
+        for input_channels_per_group, output_channels_per_group, \
+                groups, kernel, stride, use_bias in test_cases:
+
+            input_channels = input_channels_per_group * groups
+            output_channels = output_channels_per_group * groups
+            kernels = (kernel,)
+            strides = (stride,)
+            pads = (pad,)
+            o_pads = (o_pad,)
+            dilations = (dilation,)
+
+            qconv = torch.ops.quantized.conv_transpose1d
+            qconv_prepack = torch.ops.quantized.conv_transpose1d_prepack
+            conv_op = torch.nn.ConvTranspose1d(
                 in_channels=input_channels,
                 out_channels=output_channels,
                 kernel_size=kernels,
@@ -4808,16 +4791,51 @@ def test_qconv_transpose1d(
                 dilation=dilations,
                 bias=use_bias
             )
-            qconv_op.scale = Y_scale
-            qconv_op.zero_point = Y_zero_point
-            qconv_op.set_weight_bias(W_q, bias_float)
 
-            Y_dq_ref = conv_op(X_q.dequantize())
-            Y_q_ref = torch.quantize_per_tensor(Y_dq_ref, scale=Y_scale,
-                                                zero_point=Y_zero_point,
-                                                dtype=X_qdtype)
-            Y_q = qconv_op(X_q)
-            self.assertEqual(Y_q_ref, Y_q)
+            act_qdtypes = [torch.quint8]
+            # Only qnnpack qengine supportes qint8
+            if qengine_is_qnnpack() and torch.backends.xnnpack.enabled:
+                act_qdtypes.append(torch.qint8)
+
+            for X_qdtype in act_qdtypes:
+                if X_qdtype == torch.qint8:
+                    W_zero_point = [0 for i in range(len(W_zero_point))]
+
+                X_q, W_q, bias_float = self._test_qconv_impl(
+                    qconv, qconv_prepack, conv_op, batch_size,
+                    input_channels_per_group, (width, ),
+                    output_channels_per_group, groups, kernels, strides, pads, o_pads,
+                    dilations, X_scale, X_zero_point, W_scale, W_zero_point,
+                    Y_scale, Y_zero_point, use_bias, use_relu=False,
+                    use_channelwise=False, use_transpose=True, input_dtype=X_qdtype, output_dtype=X_qdtype)
+
+                # check that this doesn't error
+                test_conv = torch.ao.nn.quantized.ConvTranspose1d(input_channels, output_channels, 1)
+                test_conv.scale = Y_scale
+                test_conv(X_q)
+
+                # Test the module implementation
+                qconv_op = torch.ao.nn.quantized.ConvTranspose1d(
+                    in_channels=input_channels,
+                    out_channels=output_channels,
+                    kernel_size=kernels,
+                    stride=strides,
+                    padding=pads,
+                    output_padding=o_pads,
+                    groups=groups,
+                    dilation=dilations,
+                    bias=use_bias
+                )
+                qconv_op.scale = Y_scale
+                qconv_op.zero_point = Y_zero_point
+                qconv_op.set_weight_bias(W_q, bias_float)
+
+                Y_dq_ref = conv_op(X_q.dequantize())
+                Y_q_ref = torch.quantize_per_tensor(Y_dq_ref, scale=Y_scale,
+                                                    zero_point=Y_zero_point,
+                                                    dtype=X_qdtype)
+                Y_q = qconv_op(X_q)
+                self.assertEqual(Y_q_ref, Y_q)
 
 
     """Tests the correctness of quantized convolution op."""
@@ -4826,7 +4844,7 @@ def test_qconv_transpose1d(
            height=st.integers(10, 16),
            width=st.integers(7, 14),
            output_channels_per_group=st.sampled_from([2, 4, 5, 8, 16, 32]),
-           groups=st.integers(1, 3),
+           groups=st.integers(1, 300),
            kernel_h=st.integers(1, 7),
            kernel_w=st.integers(1, 7),
            stride_h=st.integers(1, 2),
@@ -4918,12 +4936,12 @@ def test_qconv_transpose2d(
                 use_channelwise=False, use_transpose=True, input_dtype=X_qdtype, output_dtype=X_qdtype)
 
             # check that this doesn't error
-            test_conv = torch.nn.quantized.ConvTranspose2d(input_channels, output_channels, 1)
+            test_conv = torch.ao.nn.quantized.ConvTranspose2d(input_channels, output_channels, 1)
             test_conv.scale = Y_scale
             test_conv(X_q)
 
             # Test the module implementation
-            qconv_op = torch.nn.quantized.ConvTranspose2d(
+            qconv_op = torch.ao.nn.quantized.ConvTranspose2d(
                 in_channels=input_channels,
                 out_channels=output_channels,
                 kernel_size=kernels,
@@ -4952,7 +4970,7 @@ def test_qconv_transpose2d(
            height=st.integers(10, 16),
            width=st.integers(7, 14),
            output_channels_per_group=st.sampled_from([2, 4, 5, 8, 16, 32]),
-           groups=st.integers(1, 3),
+           groups=st.integers(1, 300),
            kernel_t=st.integers(1, 7),
            kernel_h=st.integers(1, 7),
            kernel_w=st.integers(1, 7),
@@ -5045,12 +5063,12 @@ def test_qconv_transpose3d(
             use_channelwise=False, use_transpose=True)
 
         # check that this doesn't error
-        test_conv = torch.nn.quantized.ConvTranspose3d(input_channels, output_channels, 1)
+        test_conv = torch.ao.nn.quantized.ConvTranspose3d(input_channels, output_channels, 1)
         test_conv.scale = Y_scale
         test_conv(X_q)
 
         # Test the module implementation
-        qconv_op = torch.nn.quantized.ConvTranspose3d(
+        qconv_op = torch.ao.nn.quantized.ConvTranspose3d(
             in_channels=input_channels,
             out_channels=output_channels,
             kernel_size=kernels,
@@ -5697,6 +5715,66 @@ def test_qnnpack_add(self, A, zero_point, scale_A, scale_B, scale_C):
                 np.testing.assert_equal(qCrelu.int_repr().numpy(), qCrelu_hat.int_repr(),
                                         "Quantized addition with ReLU failed.")
 
+        """Tests the correctness of the quantized::add (qnnpack) mul."""
+    @settings(suppress_health_check=(HealthCheck.filter_too_much,))
+    @given(A=hu.tensor(shapes=hu.array_shapes(1, 5, 1, 5),
+                       qparams=hu.qparams(dtypes=[torch.quint8, torch.qint8])),
+           zero_point=st.sampled_from([0, 2, 5, 15, 127]),
+           scale_A=st.sampled_from([0.3, 0.57, 0.889]),
+           scale_B=st.sampled_from([0.8, 0.821, 0.67]),
+           scale_C=st.sampled_from([0.3, 0.7821, 0.457]),)
+    def test_qnnpack_mul(self, A, zero_point, scale_A, scale_B, scale_C):
+        with override_quantized_engine('qnnpack'):
+            A_temp = A
+            for channels_last in [True, False]:
+                if channels_last and len(A_temp[0].shape) != 4:
+                    continue
+                A, (scale_a, zero_point_A, torch_type) = A_temp
+                B, (scale_b, zero_point_B, torch_type) = A_temp
+                A = torch.from_numpy(A)
+                B = torch.from_numpy(B)
+
+                if torch_type == torch.qint8 and not torch.backends.xnnpack.enabled:
+                    continue
+
+                if channels_last:
+                    A = A.to(memory_format=torch.channels_last)
+                    B = B.to(memory_format=torch.channels_last)
+                assume(scale_A // scale_C >= 2**-14)
+                assume(scale_A // scale_C < 2**8)
+                assume(scale_B // scale_C >= 2**-14)
+                assume(scale_B // scale_C < 2**8)
+
+                zero_point_C = 127
+                np_dtype = np.uint8
+
+                if torch_type == torch.qint8:
+                    zero_point_C = 0
+                    np_dtype = np.int8
+
+                qA = torch.quantize_per_tensor(A, scale=scale_A, zero_point=zero_point,
+                                               dtype=torch_type)
+                qB = torch.quantize_per_tensor(B, scale=scale_B, zero_point=zero_point,
+                                               dtype=torch_type)
+
+                # Add ground truth
+                C = (qA.dequantize() * qB.dequantize()).numpy()
+
+                qC = _quantize(C, scale_C, zero_point_C, dtype=np_dtype)
+                qC_qnnp = torch.ops.quantized.mul(qA, qB, scale_C, zero_point_C)
+
+                np.testing.assert_equal(qC, qC_qnnp.int_repr(),
+                                        "Quantized addition failed.")
+
+                Crelu = C.copy()
+                Crelu[C < 0] = 0
+                qCrelu = torch.quantize_per_tensor(torch.from_numpy(Crelu), scale_C,
+                                                   zero_point_C, dtype=torch_type)
+                qCrelu_hat = torch.ops.quantized.mul_relu(qA, qB, scale=scale_C, zero_point=zero_point_C)
+                np.testing.assert_equal(qCrelu.int_repr().numpy(), qCrelu_hat.int_repr(),
+                                        "Quantized addition with ReLU failed.")
+
+
     """Tests that quantized add works with broadcasting """
     def test_qnnpack_add_broadcast(self):
         def _run_test(A, B):
@@ -5830,7 +5908,7 @@ def test_avg_pool2d(
             s = (stride, stride)
             p = (padding, padding)
 
-            q_avg_pool = torch.nn.quantized.functional.avg_pool2d
+            q_avg_pool = torch.ao.nn.quantized.functional.avg_pool2d
 
             x_q = torch.quantize_per_tensor(X, scale=scale, zero_point=zero_point,
                                             dtype=torch.quint8)
@@ -5878,7 +5956,7 @@ def test_adaptive_avg_pool2d(
 
             iH, iW = X.shape[-2:]
 
-            q_avg_pool = torch.nn.quantized.functional.adaptive_avg_pool2d
+            q_avg_pool = torch.ao.nn.quantized.functional.adaptive_avg_pool2d
 
             x_q = torch.quantize_per_tensor(X, scale=scale, zero_point=zero_point,
                                             dtype=torch.quint8)
@@ -5934,7 +6012,7 @@ def test_hardtanh(self):
                 qX = torch.quantize_per_tensor(X, scale=scale, zero_point=zero_point,
                                                dtype=torch_type)
 
-                qY_hat = torch.nn.quantized.functional.hardtanh(qX, min_val, max_val)
+                qY_hat = torch.ao.nn.quantized.functional.hardtanh(qX, min_val, max_val)
                 self.assertEqual(
                     qY, qY_hat,
                     msg="hardtanh failed:\nactual {}\nexpected {}\nmemory_format {}".format(qY_hat, qY, memory_format))
diff --git a/test/quantization/core/test_quantized_tensor.py b/test/quantization/core/test_quantized_tensor.py
index 2b5f13d1af1a..4e7ba6409b5b 100644
--- a/test/quantization/core/test_quantized_tensor.py
+++ b/test/quantization/core/test_quantized_tensor.py
@@ -2,6 +2,7 @@
 
 import numpy as np
 import math
+import random
 import torch
 import io
 import unittest
@@ -10,8 +11,9 @@
 from hypothesis import strategies as st
 from torch.testing._internal.common_utils import TemporaryFileName
 from torch.testing._internal.common_cuda import TEST_CUDA
-from torch.testing._internal.common_utils import TestCase, TEST_WITH_ROCM
+from torch.testing._internal.common_utils import TestCase, DeterministicGuard
 import torch.testing._internal.hypothesis_utils as hu
+from torch.testing._internal.common_quantization import get_supported_device_types
 
 hu.assert_deadline_disabled()
 
@@ -67,9 +69,6 @@ def _calculate_dynamic_qparams(X, dtype, reduce_range=False):
 
     return [scale.astype(np.float32), int(nudged_zero_point)]
 
-def get_supported_device_types():
-    return ['cpu', 'cuda'] if torch.cuda.is_available() and not TEST_WITH_ROCM else ['cpu']
-
 # Note we explicitly cast variables to np.float32 in a couple of places to avoid
 # the default casting in Python often resuling in double precision and to make
 # sure we're doing the same numerics as C++ code.
@@ -1010,19 +1009,24 @@ def test_qtensor_fill_per_channel(self):
             self.assertEqual(q_filled.q_per_channel_scales(), scales)
             self.assertEqual(q_filled.q_per_channel_zero_points(), zero_points)
 
+    def test_qtensor_masked_fill_cpu(self):
+        self._test_qtensor_masked_fill('cpu')
+
+    @unittest.skipIf(not TEST_CUDA, "No gpu is available.")
+    def test_qtensor_masked_fill_cuda(self):
+        self._test_qtensor_masked_fill('cuda')
+
     # adapted from test_qtensor_fill_per_tensor
-    def test_qtensor_masked_fill(self):
+    def _test_qtensor_masked_fill(self, device):
         numel = 10
         scale = 0.5
         zero_point = 10
 
-        ones = torch.ones(numel).to(torch.float)
+        ones = torch.ones(numel, dtype=torch.float, device=device)
 
         types = [torch.qint8, torch.quint8, torch.qint32]
         fills = [-1, 1, 2**32]  # positive, negative, overflow
 
-        device = 'cpu'
-        ones = ones.to(device)
         for qtype, fill_with in itertools.product(types, fills):
             q_filled = torch._empty_affine_quantized(
                 [numel], scale=scale, zero_point=zero_point, device=device,
@@ -1032,8 +1036,7 @@ def test_qtensor_masked_fill(self):
                 [numel], scale=scale, zero_point=zero_point, device=device,
                 dtype=qtype)
             # mask fill the whole tensor, equivalent to calling plain vanilla fill
-            mask = torch.tensor(True)
-            torch.tensor(fill_with)
+            mask = torch.tensor(True, device=device)
             q_masked_fill.masked_fill_(mask, fill_with)
             int_repr = torch.quantize_per_tensor(ones * fill_with, scale,
                                                  zero_point, qtype)
@@ -1049,34 +1052,43 @@ def test_qtensor_masked_fill(self):
 
         # the above loop does the same test as test_qtensor_fill
         # now we will check masked_fill for subset of indices
-        mask = torch.randint(0, 2, (numel, ))
+        mask = torch.randint(0, 2, (numel, ), device=device)
         mask = mask.bool()
-        x = torch.rand(numel)
+        x = torch.rand(numel, device=device)
         qx = torch.quantize_per_tensor(x, scale=scale, zero_point=zero_point, dtype=qtype)
         for qtype, fill_with in itertools.product(types, fills):
             q_masked_fill = qx.clone()
             q_masked_fill.masked_fill_(mask, fill_with)
             ref = qx.clone()
+
             for i in range(numel):
                 if mask[i]:
                     # this assignment doesn't end up calling masked_fill, allowing us to compare the different implementations
-                    ref[i] = fill_with
+                    ref[i] = torch.tensor([fill_with], device=device, dtype=torch.float)
 
             self.assertEqual(q_masked_fill, ref)
             self.assertEqual(q_masked_fill.int_repr(), ref.int_repr())
             self.assertEqual(q_masked_fill.dequantize(), ref.dequantize())
 
-    def test_qtensor_index_put(self):
+    def test_qtensor_index_put_cpu(self):
+        self._test_qtensor_index_put('cpu')
+        self._test_qtensor_index_put_non_accumulate_deterministic('cpu')
+
+    @unittest.skipIf(not TEST_CUDA, "No gpu is available.")
+    def test_qtensor_index_put_cuda(self):
+        self._test_qtensor_index_put('cuda')
+        self._test_qtensor_index_put_non_accumulate_deterministic('cuda')
+
+    def _test_qtensor_index_put(self, device):
         n = 10
         m = 10
-        x_orig = torch.rand(n, m)
-        indices = tuple(torch.tensor([[0, 0], [1, 1], [5, 5], [7, 3], [0, 5], [6, 9], [-1, -1]]).t())
+        x_orig = torch.rand(n, m, device=device)
+        indices = tuple(torch.tensor([[0, 0], [1, 1], [5, 5], [7, 3], [0, 5], [6, 9], [-1, -1]], device=device).t())
         # for the scalar tensor case, index_put routes to masked_fill
-        values_list = [torch.tensor(2.5), torch.rand(len(indices[0])) * 1000]
+        values_list = [torch.tensor(2.5, device=device), torch.rand(len(indices[0]), device=device) * 1000]
         scale = 0.5
         zero_point = 10
         types = [torch.qint8, torch.quint8, torch.qint32]
-        fills = [-1, 1, 2**32]  # positive, negative, overflow
         for qtype, values in itertools.product(types, values_list):
             x_ref = x_orig.clone()
             x_ref[indices] = values.to(dtype=x_ref.dtype)
@@ -1088,6 +1100,30 @@ def test_qtensor_index_put(self):
 
             self.assertEqual(qx_ref, qx)
 
+    def _test_qtensor_index_put_non_accumulate_deterministic(self, device):
+        with DeterministicGuard(True):
+            scale = 0.5
+            zero_point = 10
+            types = [torch.qint8, torch.quint8, torch.qint32]
+            for qtype in types:
+                for i in range(3):
+                    m = random.randint(10, 20)
+                    elems = random.randint(20000, 30000)
+                    values = torch.rand(elems, device=device)
+                    indices = torch.randint(m, (elems,), device=device)
+                    x_orig = torch.rand(m, device=device)
+
+                    x = x_orig.clone()
+                    qx = torch.quantize_per_tensor(x, scale=scale, zero_point=zero_point, dtype=qtype)
+                    output = qx.index_put((indices,), values, accumulate=False)
+
+
+                    x_ref = x_orig.clone()
+                    output_ref = x_ref.index_put((indices,), values, accumulate=False)
+                    qx_ref = torch.quantize_per_tensor(output_ref, scale=scale, zero_point=zero_point, dtype=qtype)
+
+                    self.assertEqual(output, qx_ref)
+
     # adapted from test_qtensor_fill_per_channel and test_qtensor_fill_per_tensor_nhwc
     def test_qtensor_fill_per_channel_nhwc(self):
         dims = torch.randint(low=1, high=10, size=(4, )).tolist()
@@ -1425,7 +1461,107 @@ def test_bfp16_quantize(self):
         X = torch.randn(5 , 10)
         quantized_X = X.to(torch.bfloat16)
         dedequantized_X = quantized_X.to(torch.float32)
-        torch.testing.assert_allclose(X, dedequantized_X, rtol=1e-4, atol=5e-3)
+        torch.testing.assert_close(X, dedequantized_X, rtol=1e-4, atol=5e-3)
+
+    def test_decomposed_quantize_per_tensor(self):
+        # register the ops
+        import torch.ao.quantization.fx._decomposed
+        X = torch.randn(5, 10)
+        qdtype = torch.quint8
+        dtype = torch.uint8
+        scale, zero_point = _calculate_dynamic_qparams(X, qdtype)
+        quant_min, quant_max = 0, 255
+
+        quantized_X = torch.quantize_per_tensor(X, scale, zero_point, qdtype)
+        quantized_decomposed_X = \
+            torch.ops.quantized_decomposed.quantize_per_tensor(
+                X, scale, zero_point, quant_min, quant_max, dtype)
+        self.assertEqual(quantized_decomposed_X.dtype, dtype)
+        self.assertEqual(quantized_X.int_repr(), quantized_decomposed_X)
+
+    def test_decomposed_dequantize_per_tensor(self):
+        import torch.ao.quantization.fx._decomposed
+        X = torch.randn(5, 10)
+        dtype = torch.uint8
+        qdtype = torch.quint8
+        scale, zero_point = _calculate_dynamic_qparams(X, qdtype)
+        quant_min, quant_max = 0, 255
+
+        quantized_X = torch.quantize_per_tensor(X, scale, zero_point, qdtype)
+        dequantized_X = torch.dequantize(quantized_X)
+
+        quantized_decomposed_X = torch.ops.quantized_decomposed.quantize_per_tensor(
+            X, scale, zero_point, quant_min, quant_max, dtype)
+        dequantized_decomposed_X = torch.ops.quantized_decomposed.dequantize_per_tensor(
+            quantized_decomposed_X, scale, zero_point, quant_min, quant_max, dtype
+        )
+        self.assertEqual(quantized_X.int_repr(), quantized_decomposed_X)
+        self.assertEqual(dequantized_X, dequantized_decomposed_X)
+
+    def test_decomposed_dynamic_quant_pattern(self):
+        import torch.ao.quantization.fx._decomposed
+        X = torch.randn(5, 10)
+        dtype = torch.uint8
+        qdtype = torch.quint8
+        scale, zero_point = torch._choose_qparams_per_tensor(X, False)
+        quant_min, quant_max = 0, 255
+
+        quantized_X = torch.quantize_per_tensor(X, scale, zero_point, qdtype)
+        dequantized_X = torch.dequantize(quantized_X)
+
+        # Now try decomposed pattern
+        (scale_decomposed, zero_point_decomposed) = torch.ops.quantized_decomposed.choose_qparams.tensor(
+            X, quant_min, quant_max, dtype)
+        quantized_decomposed_X = torch.ops.quantized_decomposed.quantize_per_tensor.tensor(
+            X, scale_decomposed, zero_point_decomposed, quant_min, quant_max, dtype)
+
+        dequantized_decomposed_X = torch.ops.quantized_decomposed.dequantize_per_tensor.tensor(
+            quantized_decomposed_X, scale_decomposed, zero_point_decomposed, quant_min, quant_max, dtype
+        )
+        self.assertEqual(quantized_X.int_repr(), quantized_decomposed_X)
+        self.assertEqual(dequantized_X, dequantized_decomposed_X)
+
+    def test_decomposed_quantize_per_channel(self):
+        # register the ops
+        import torch.ao.quantization.fx._decomposed
+        X = torch.randn(5, 10)
+        qdtype = torch.quint8
+        dtype = torch.uint8
+        scales = torch.randn(5,)
+        zero_points = torch.randint(0, 100, (5,))
+        quant_min, quant_max = 0, 255
+        axis = 0
+
+        quantized_X = torch.quantize_per_channel(X, scales, zero_points, axis, qdtype)
+        quantized_decomposed_X = \
+            torch.ops.quantized_decomposed.quantize_per_channel(
+                X, scales, zero_points, axis, quant_min, quant_max, dtype)
+        self.assertEqual(quantized_decomposed_X.dtype, dtype)
+        self.assertEqual(quantized_X.int_repr(), quantized_decomposed_X)
+
+    def test_decomposed_dequantize_per_channel(self):
+        # register the ops
+        import torch.ao.quantization.fx._decomposed
+        X = torch.randn(5, 10)
+        qdtype = torch.quint8
+        dtype = torch.uint8
+        scales = torch.randn(5,)
+        zero_points = torch.randint(0, 100, (5,))
+        quant_min, quant_max = 0, 255
+        axis = 0
+
+        quantized_X = torch.quantize_per_channel(X, scales, zero_points, axis, qdtype)
+        dequantized_X = torch.dequantize(quantized_X)
+
+        quantized_decomposed_X = \
+            torch.ops.quantized_decomposed.quantize_per_channel(
+                X, scales, zero_points, axis, quant_min, quant_max, dtype)
+        dequantized_decomposed_X = \
+            torch.ops.quantized_decomposed.dequantize_per_channel(
+                quantized_decomposed_X, scales, zero_points, axis, quant_min, quant_max, dtype)
+
+        self.assertEqual(quantized_X.int_repr(), quantized_decomposed_X)
+        self.assertEqual(dequantized_X, dequantized_decomposed_X)
 
 if __name__ == '__main__':
     raise RuntimeError("This test file is not meant to be run directly, use:\n\n"
diff --git a/test/quantization/core/test_top_level_apis.py b/test/quantization/core/test_top_level_apis.py
new file mode 100644
index 000000000000..f76db1cd4139
--- /dev/null
+++ b/test/quantization/core/test_top_level_apis.py
@@ -0,0 +1,93 @@
+# Owner(s): ["oncall: quantization"]
+
+import torch
+import torch.ao.quantization
+from torch.testing._internal.common_utils import TestCase
+
+
+class TestDefaultObservers(TestCase):
+    observers = [
+        "default_affine_fixed_qparams_observer",
+        "default_debug_observer",
+        "default_dynamic_quant_observer",
+        "default_placeholder_observer",
+        "default_fixed_qparams_range_0to1_observer",
+        "default_fixed_qparams_range_neg1to1_observer",
+        "default_float_qparams_observer",
+        "default_float_qparams_observer_4bit",
+        "default_histogram_observer",
+        "default_observer",
+        "default_per_channel_weight_observer",
+        "default_reuse_input_observer",
+        "default_symmetric_fixed_qparams_observer",
+        "default_weight_observer",
+        "per_channel_weight_observer_range_neg_127_to_127",
+        "weight_observer_range_neg_127_to_127",
+    ]
+
+    fake_quants = [
+        "default_affine_fixed_qparams_fake_quant",
+        "default_dynamic_fake_quant",
+        "default_embedding_fake_quant",
+        "default_embedding_fake_quant_4bit",
+        "default_fake_quant",
+        "default_fixed_qparams_range_0to1_fake_quant",
+        "default_fixed_qparams_range_neg1to1_fake_quant",
+        "default_fused_act_fake_quant",
+        "default_fused_per_channel_wt_fake_quant",
+        "default_fused_wt_fake_quant",
+        "default_histogram_fake_quant",
+        "default_per_channel_weight_fake_quant",
+        "default_symmetric_fixed_qparams_fake_quant",
+        "default_weight_fake_quant",
+        "fused_per_channel_wt_fake_quant_range_neg_127_to_127",
+        "fused_wt_fake_quant_range_neg_127_to_127",
+    ]
+
+    def _get_observer_ins(self, observer):
+        obs_func = getattr(torch.ao.quantization, observer)
+        return obs_func()
+
+    def test_observers(self) -> None:
+        t = torch.rand(1, 2, 3, 4)
+        for observer in self.observers:
+            obs = self._get_observer_ins(observer)
+            obs.forward(t)
+
+    def test_fake_quants(self) -> None:
+        t = torch.rand(1, 2, 3, 4)
+        for observer in self.fake_quants:
+            obs = self._get_observer_ins(observer)
+            obs.forward(t)
+
+
+class TestQConfig(TestCase):
+
+    REDUCE_RANGE_DICT = {
+        'fbgemm': (True, False),
+        'qnnpack': (False, False),
+        'onednn': (False, False),
+        'x86': (True, False),
+    }
+
+    def test_reduce_range_qat(self) -> None:
+        for backend, reduce_ranges in self.REDUCE_RANGE_DICT.items():
+            for version in range(2):
+                qconfig = torch.ao.quantization.get_default_qat_qconfig(backend, version)
+
+                fake_quantize_activ = qconfig.activation()
+                self.assertEqual(fake_quantize_activ.activation_post_process.reduce_range, reduce_ranges[0])
+
+                fake_quantize_weight = qconfig.weight()
+                self.assertEqual(fake_quantize_weight.activation_post_process.reduce_range, reduce_ranges[1])
+
+    def test_reduce_range(self) -> None:
+        for backend, reduce_ranges in self.REDUCE_RANGE_DICT.items():
+            for version in range(1):
+                qconfig = torch.ao.quantization.get_default_qconfig(backend, version)
+
+                fake_quantize_activ = qconfig.activation()
+                self.assertEqual(fake_quantize_activ.reduce_range, reduce_ranges[0])
+
+                fake_quantize_weight = qconfig.weight()
+                self.assertEqual(fake_quantize_weight.reduce_range, reduce_ranges[1])
diff --git a/test/quantization/core/test_utils.py b/test/quantization/core/test_utils.py
index dda2b3856a8a..55d889f88eb3 100644
--- a/test/quantization/core/test_utils.py
+++ b/test/quantization/core/test_utils.py
@@ -3,6 +3,8 @@
 import torch
 from torch.testing._internal.common_utils import TestCase
 from torch.ao.quantization.utils import get_fqn_to_example_inputs
+from torch.nn.quantized.modules.utils import _quantize_weight
+from torch.ao.quantization import MovingAverageMinMaxObserver, MovingAveragePerChannelMinMaxObserver
 
 
 class TestUtils(TestCase):
@@ -126,3 +128,66 @@ def forward(self, x):
         assert isinstance(fqn_to_example_inputs["sub"][1], list)
         assert isinstance(fqn_to_example_inputs["sub"][2], dict) and \
             "3" in fqn_to_example_inputs["sub"][2]
+
+    def test_quantize_weight_clamping_per_tensor(self):
+        """ Test quant_{min, max} from per tensor observer is honored by `_quantize_weight` method
+        """
+        fp_min, fp_max = -1000.0, 1000.0
+        q8_min, q8_max = -10, 10
+
+        float_tensor = torch.tensor([fp_min, fp_max])
+
+        observer = MovingAverageMinMaxObserver(
+            averaging_constant=1.0,
+            dtype=torch.qint8,
+            quant_min=q8_min,
+            quant_max=q8_max,
+            qscheme=torch.per_tensor_symmetric,
+        )
+
+        observer(float_tensor)
+        assert observer.min_val == fp_min
+        assert observer.max_val == fp_max
+
+        quantized_tensor = _quantize_weight(float_tensor, observer)
+        assert quantized_tensor.int_repr().max().item() == q8_max
+        assert quantized_tensor.int_repr().min().item() == q8_min
+
+        # Actual weight values can be outside than observer [min_val, max_val] for the moving average observer
+        float_tensor *= 1.2
+
+        quantized_tensor = _quantize_weight(float_tensor, observer)
+        assert quantized_tensor.int_repr().max().item() == q8_max
+        assert quantized_tensor.int_repr().min().item() == q8_min
+
+    def test_quantize_weight_clamping_per_channel(self):
+        """ Test quant_{min, max} from per channel observer is honored by `_quantize_weight` method
+        """
+        fp_min, fp_max = -1000.0, 1000.0
+        q8_min, q8_max = -10, 10
+
+        float_tensor = torch.tensor([[fp_min, fp_max]])
+
+        observer = MovingAveragePerChannelMinMaxObserver(
+            averaging_constant=1.0,
+            dtype=torch.qint8,
+            quant_min=q8_min,
+            quant_max=q8_max,
+            qscheme=torch.per_channel_symmetric,
+            ch_axis=0,
+        )
+
+        observer(float_tensor)
+        assert observer.min_val == fp_min
+        assert observer.max_val == fp_max
+
+        quantized_tensor = _quantize_weight(float_tensor, observer)
+        assert quantized_tensor.int_repr().max().item() == q8_max
+        assert quantized_tensor.int_repr().min().item() == q8_min
+
+        # Actual weight values can be outside than observer [min_val, max_val] for the moving average observer
+        float_tensor *= 1.2
+
+        quantized_tensor = _quantize_weight(float_tensor, observer)
+        assert quantized_tensor.int_repr().max().item() == q8_max
+        assert quantized_tensor.int_repr().min().item() == q8_min
diff --git a/test/quantization/core/test_workflow_module.py b/test/quantization/core/test_workflow_module.py
index f299026b3192..6ac8bed90ca3 100644
--- a/test/quantization/core/test_workflow_module.py
+++ b/test/quantization/core/test_workflow_module.py
@@ -712,6 +712,20 @@ def test_histogram_observer_against_reference(self, N, bins, dtype, qscheme, red
 
         self.assertEqual(ref_qparams, my_qparams)
 
+    def test_histogram_observer_extreme_inputs(self):
+        """
+        Ensures that the HistogramObserver is able to work correctly in
+        a rare case: extreme samll max values
+        """
+        obs = HistogramObserver()
+        test_input = torch.tensor(
+            [0.0, 0.0, 4.58e-41, 4.58e-41]
+        )
+        # Make sure it runs, two passes are required based on the behavior of forward func
+        # The first pass initializes min_val&max_val, and second pass calls _adjust_min_max
+        obs(test_input)
+        obs(test_input)
+
 
 class TestFakeQuantize(TestCase):
     @given(device=st.sampled_from(['cpu', 'cuda'] if torch.cuda.is_available() else ['cpu']),
@@ -864,7 +878,7 @@ def test_qat_data_parallel(self):
                 if epoch >= 1:
                     model.apply(torch.ao.quantization.disable_observer)
                 if epoch >= 2:
-                    model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
+                    model.apply(torch.ao.nn.intrinsic.qat.freeze_bn_stats)
                 quant_model = copy.deepcopy(model.module)
                 quant_model = torch.ao.quantization.convert(quant_model.eval().cpu(), inplace=False)
                 with torch.no_grad():
@@ -997,11 +1011,11 @@ def test_fused_obs_fq_module(self, device):
         )
 
         # Compare params with reference
-        torch.testing.assert_allclose(out, out_ref)
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(out, out_ref)
+        torch.testing.assert_close(
             running_min_op, mod.activation_post_process.min_val
         )
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(
             running_max_op, mod.activation_post_process.max_val
         )
 
@@ -1052,11 +1066,11 @@ def test_fused_obs_fq_moving_avg_module(self, device):
             )
 
             # Compare params with reference
-            torch.testing.assert_allclose(out, out_ref)
-            torch.testing.assert_allclose(
+            torch.testing.assert_close(out, out_ref)
+            torch.testing.assert_close(
                 running_min_op, mod.activation_post_process.min_val
             )
-            torch.testing.assert_allclose(
+            torch.testing.assert_close(
                 running_max_op, mod.activation_post_process.max_val
             )
 
@@ -1081,12 +1095,12 @@ def test_compare_fused_obs_fq_oss_module(self, device):
             x = torch.randn(5, 5, device=device)
             out = mod(x)
             out_ref = mod_ref(x)
-            torch.testing.assert_allclose(out, out_ref)
-            torch.testing.assert_allclose(
+            torch.testing.assert_close(out, out_ref)
+            torch.testing.assert_close(
                 mod_ref.activation_post_process.min_val,
                 mod.activation_post_process.min_val,
             )
-            torch.testing.assert_allclose(
+            torch.testing.assert_close(
                 mod_ref.activation_post_process.max_val,
                 mod.activation_post_process.max_val,
             )
@@ -1137,20 +1151,20 @@ def test_fused_mod_per_channel(self):
                     False,
                 )
                 # Compare params with reference
-                torch.testing.assert_allclose(out, out_ref)
+                torch.testing.assert_close(out, out_ref)
                 if mod.observer_enabled[0]:
-                    torch.testing.assert_allclose(
+                    torch.testing.assert_close(
                         running_min_op, mod.activation_post_process.min_val
                     )
-                    torch.testing.assert_allclose(
+                    torch.testing.assert_close(
                         running_max_op, mod.activation_post_process.max_val
                     )
                 if mod.fake_quant_enabled:
-                    torch.testing.assert_allclose(scale, mod.scale)
-                    torch.testing.assert_allclose(zero_point, mod.zero_point)
+                    torch.testing.assert_close(scale, mod.scale)
+                    torch.testing.assert_close(zero_point, mod.zero_point)
 
-            torch.testing.assert_allclose(mod.state_dict()['activation_post_process.min_val'], running_min_op)
-            torch.testing.assert_allclose(mod.state_dict()['activation_post_process.max_val'], running_max_op)
+            torch.testing.assert_close(mod.state_dict()['activation_post_process.min_val'], running_min_op)
+            torch.testing.assert_close(mod.state_dict()['activation_post_process.max_val'], running_max_op)
 
     def test_fused_mod_reduce_range(self):
         obs = FusedMovingAvgObsFakeQuantize(quant_min=0, quant_max=255, dtype=torch.quint8, reduce_range=True)
@@ -1198,8 +1212,8 @@ def forward(self, indices):
                                    mapping=get_embedding_static_quant_module_mappings())
 
             # Ensure that EmbeddingBags are now quantized with the appropriate bitwidth.
-            self.assertEqual(type(inference_gm.emb1), torch.nn.quantized.EmbeddingBag)
-            self.assertEqual(type(inference_gm.emb2), torch.nn.quantized.EmbeddingBag)
+            self.assertEqual(type(inference_gm.emb1), torch.ao.nn.quantized.EmbeddingBag)
+            self.assertEqual(type(inference_gm.emb2), torch.ao.nn.quantized.EmbeddingBag)
             self.assertEqual(inference_gm.emb1.dtype, qconfig.weight().dtype)
             self.assertEqual(inference_gm.emb2.dtype, qconfig.weight().dtype)
 
@@ -1233,9 +1247,9 @@ def test_embedding_qat_config(self):
                 inference_gm = convert(quant_model,
                                        mapping=get_embedding_static_quant_module_mappings())
                 # Ensure that Embedding is now quantized
-                self.assertEqual(type(inference_gm.emb), torch.nn.quantized.Embedding)
+                self.assertEqual(type(inference_gm.emb), torch.ao.nn.quantized.Embedding)
                 # Ensure that Linear is now quantized
-                self.assertEqual(type(inference_gm.linear), torch.nn.quantized.Linear)
+                self.assertEqual(type(inference_gm.linear), torch.ao.nn.quantized.Linear)
 
     def test_default_fused_qat_config(self):
         class Model(nn.Module):
diff --git a/test/quantization/core/test_workflow_ops.py b/test/quantization/core/test_workflow_ops.py
index b459b5865bfa..a0687d88fa57 100644
--- a/test/quantization/core/test_workflow_ops.py
+++ b/test/quantization/core/test_workflow_ops.py
@@ -1083,7 +1083,7 @@ def test_fused_obs_fake_quant_moving_avg(self, device, symmetric_quant) -> None:
 
             self.assertEqual(in_running_min_ref, in_running_min_op)
             self.assertEqual(in_running_max_ref, in_running_max_op)
-            torch.testing.assert_allclose(out, x_in)
+            torch.testing.assert_close(out, x_in)
 
         # Test empty input works
         x = torch.empty(0, 5, device=device)
@@ -1176,7 +1176,7 @@ def test_fused_obs_fake_quant_moving_avg_per_channel(self, device, symmetric_qua
                     x_in = x
                 self.assertEqual(in_running_min_ref, in_running_min_op)
                 self.assertEqual(in_running_max_ref, in_running_max_op)
-                torch.testing.assert_allclose(out, x_in)
+                torch.testing.assert_close(out, x_in)
 
     @given(device=st.sampled_from(['cpu', 'cuda'] if torch.cuda.is_available() else ['cpu']),)
     @settings(deadline=None)
@@ -1218,7 +1218,7 @@ def test_fused_obs_fake_quant_backward_op(self, device) -> None:
             False,
         )
         # verify the output matches
-        torch.testing.assert_allclose(out, x_fake_quant)
+        torch.testing.assert_close(out, x_fake_quant)
 
         # verify the gradient matches expectation of fake_quant op
         dout = torch.rand_like(x, dtype=torch.float).to(device)
@@ -1264,7 +1264,7 @@ def test_fused_backward_op_fake_quant_off(self, device) -> None:
             False,
         )
         # verify the output matches
-        torch.testing.assert_allclose(out, x)
+        torch.testing.assert_close(out, x)
 
         # verify the gradient matches expectation of fake_quant op
         dout = torch.rand_like(x, dtype=torch.float).to(device)
diff --git a/test/quantization/dbr/test_quantize_dbr.py b/test/quantization/dbr/test_quantize_dbr.py
deleted file mode 100644
index 4d7a2c5aabaa..000000000000
--- a/test/quantization/dbr/test_quantize_dbr.py
+++ /dev/null
@@ -1,1619 +0,0 @@
-# Owner(s): ["oncall: quantization"]
-
-import collections
-import copy
-import math
-import tempfile
-import unittest
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.nn.intrinsic as nni
-import torch.nn.quantized as nnq
-toq = torch.ops.quantized
-from torch.testing._internal.common_quantization import (
-    skipIfNoFBGEMM,
-    skip_if_no_torchvision,
-    QuantizationTestCase,
-    NodeSpec,
-)
-from torch.testing import FileCheck
-from torch.quantization import (
-    ObserverBase,
-    FakeQuantizeBase,
-    QConfig,
-    MinMaxObserver,
-)
-from torch.quantization.quantize_fx import (
-    prepare_fx,
-    convert_fx,
-)
-from torch.ao.quantization._dbr.quantization_state import AutoQuantizationState
-
-import torch.ao.quantization._quantize_dbr as _quantize_dbr
-import torch.ao.ns._numeric_suite_dbr as ns
-# TODO(future PR): move these utils out of the FX folder
-import torch.ao.ns._numeric_suite_fx as ns_fx
-from torch.ao.quantization._dbr.torchscript_utils import (
-    remove_redundant_aliases,
-)
-
-def _allclose(a, b):
-    if isinstance(a, tuple):
-        assert isinstance(b, tuple)
-        result = True
-        for a_inner, b_inner in zip(a, b):
-            result = result and torch.allclose(a_inner, b_inner)
-        return result
-    elif isinstance(a, torch.Tensor):
-        assert isinstance(b, torch.Tensor)
-        return torch.allclose(a, b)
-    raise AssertionError('unhandled type')
-
-
-class QuantizeDBRTestCase(QuantizationTestCase):
-    def _test_auto_tracing(
-        self,
-        m,
-        qconfig,
-        example_args,
-        fuse_modules=True,
-        do_fx_comparison=True,
-        do_torchscript_checks=True,
-        # there are some keys in DBR prepare_custom_config_dict which
-        # are not supported in FX, this argument is for DBR only
-        dbr_prepare_custom_config_dict=None,
-    ):
-        m_copy = copy.deepcopy(m)
-
-        qconfig_dict = {'': qconfig}
-
-        mp = _quantize_dbr.prepare(
-            m, qconfig_dict, example_args, fuse_modules=fuse_modules,
-            prepare_custom_config_dict=dbr_prepare_custom_config_dict)
-        out_p = mp(*example_args)
-        # print(mp)
-        mq = _quantize_dbr.convert(mp)
-        # print(mq)
-        # verify it runs
-        out_q = mq(*example_args)
-        # print(out_q)
-
-        # compare it against FX
-        if do_fx_comparison:
-            m_copy_p = prepare_fx(m_copy, {'': qconfig}, example_inputs=example_args)
-            out_m_copy_p = m_copy_p(*example_args)
-            # print(m_copy_p)
-            m_copy_q = convert_fx(m_copy_p)
-            # print(m_copy_q)
-            # print(m_copy_q.graph)
-            out_q_fx = m_copy_q(*example_args)
-            # print(out_q)
-            # print(out_q_fx)
-            self.assertTrue(_allclose(out_p, out_m_copy_p))
-            # print(out_q)
-            # print(out_q_fx)
-            self.assertTrue(_allclose(out_q, out_q_fx))
-
-        if do_torchscript_checks:
-            # verify torch.jit.trace works
-            mq_jit_traced = torch.jit.trace(
-                mq, example_args, check_trace=False)
-            # print(mq_jit_traced.graph)
-            traced_out = mq_jit_traced(*example_args)
-            self.assertTrue(_allclose(traced_out, out_q))
-
-            # verify torch.jit.script works
-            rewritten = mq.rewrite_for_scripting()
-            rewritten_out = rewritten(*example_args)
-            # print(rewritten)
-            self.assertTrue(_allclose(rewritten_out, out_q))
-
-            scripted_rewritten = torch.jit.script(rewritten)
-            # print(scripted_rewritten.graph)
-            scripted_rewritten_out = scripted_rewritten(*example_args)
-            # print('scripted_rewritten_out', scripted_rewritten_out)
-            self.assertTrue(_allclose(scripted_rewritten_out, out_q))
-
-            traced_rewritten = torch.jit.trace(
-                rewritten, example_args, check_trace=False)
-            traced_rewritten_out = traced_rewritten(*example_args)
-            self.assertTrue(_allclose(traced_rewritten_out, out_q))
-
-@skipIfNoFBGEMM
-class TestQuantizeDBRIndividualOps(QuantizeDBRTestCase):
-    """
-    Tests that DBR quantization covers individual ops
-    """
-    def test_conv(self):
-        convs = {1: nn.Conv1d, 2: nn.Conv2d, 3: nn.Conv3d}
-
-        class M(torch.nn.Module):
-            def __init__(self, dim):
-                super().__init__()
-                self.conv = convs[dim](3, 3, 3)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                return x1
-
-        data = {
-            1: torch.randn(1, 3, 10),
-            2: torch.randn(1, 3, 10, 10),
-            3: torch.randn(1, 3, 5, 5, 5)
-        }
-        for dim in range(1, 4):
-            m = M(dim).eval()
-            qconfig = torch.quantization.default_qconfig
-            self._test_auto_tracing(m, qconfig, (data[dim],))
-
-    def test_conv_functional(self):
-        convs = {1: F.conv1d, 2: F.conv2d, 3: F.conv3d}
-
-        class M(torch.nn.Module):
-            def __init__(self, dim, weight, bias):
-                super().__init__()
-                self.conv_func = convs[dim]
-                self.weight = torch.nn.Parameter(weight)
-                self.bias = torch.nn.Parameter(bias)
-                self.stride = (1,) * dim
-                self.padding = (0,) * dim
-                self.dilation = (1,) * dim
-                self.groups = 1
-
-            def forward(self, x):
-                x = self.conv_func(
-                    x, self.weight, self.bias, self.stride, self.padding,
-                    self.dilation, self.groups)
-                return x
-
-        data = {
-            1: torch.randn(1, 3, 10),
-            2: torch.randn(1, 3, 10, 10),
-            3: torch.randn(1, 3, 5, 5, 5)
-        }
-        bias = torch.randn(1)
-        for dim in range(1, 4):
-            model_fp32 = M(dim, data[dim], bias).eval()
-            qconfig = torch.quantization.default_qconfig
-            self._test_auto_tracing(model_fp32, qconfig, (data[dim],))
-
-    def test_linear_dynamic(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x1 = self.linear(x)
-                return x1
-
-        m = M().eval()
-        qconfig = torch.quantization.default_dynamic_qconfig
-        # qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_linear_functional(self):
-        class LinearFunctional(nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.w1 = nn.Parameter(torch.empty(4, 4))
-                self.b1 = nn.Parameter(torch.ones(4))
-                torch.nn.init.kaiming_uniform_(self.w1, a=math.sqrt(5))
-
-            def forward(self, x):
-                x = F.linear(x, self.w1, self.b1)
-                return x
-
-        model_fp32 = LinearFunctional().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 4, 4),))
-
-    def test_linear_functional_nobias(self):
-        class LinearFunctional(nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.w1 = nn.Parameter(torch.empty(4, 4))
-                torch.nn.init.kaiming_uniform_(self.w1, a=math.sqrt(5))
-
-            def forward(self, x):
-                x = F.linear(x, self.w1)
-                return x
-
-        model_fp32 = LinearFunctional().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 4, 4),))
-
-    # TODO(future PR): implement observer sharing to match FX
-    def test_cat_fp32(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = torch.cat([x, x], dim=1)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = torch.cat((x, x), dim=1)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_cat_int(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = torch.cat([x, x], dim=1)
-                return x
-
-        qconfig = torch.quantization.default_qconfig
-        for dtype in (torch.int32, torch.int64):
-            m = M().eval()
-            self._test_auto_tracing(
-                m, qconfig, (torch.zeros(1, 1, 1, 1, dtype=dtype),),
-                # FX graph mode quant does not support this yet
-                do_fx_comparison=False)
-
-    def test_add(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                x = x + 1.0
-                x = 1.0 + x
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_add_int32(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.ones(1, 1, 2, 2, dtype=torch.int32),),
-            # FX graph mode quantization does not automatically detect
-            # tensor inputs in non-float dtypes.
-            do_fx_comparison=False)
-
-    def test_sub(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x - x
-                x = x - 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_mul(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x * x
-                x = x * 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_mul_int(self):
-        # TODO: make all the math functions work correctly for integer types
-        # TODO: make the same improvement in FX graph mode quant, if possible
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x * x
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        for dtype in (torch.int32, torch.int64):
-            self._test_auto_tracing(
-                copy.deepcopy(model_fp32), qconfig,
-                (torch.ones(1, 1, 2, 2, dtype=dtype),),
-                # FX graph mode quant does not support this yet
-                do_fx_comparison=False)
-
-    def test_div(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x / x
-                x = x / 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_embedding(self):
-        # Note: this test is just testing that models with embeddings
-        # do not crash with a global qconfig defined. Embedding quantization
-        # is not actually happening in this prototype yet.
-        # TODO(future PR): fix this and update this code.
-
-        # test subclass
-        class EmbeddingSubclass(nn.Embedding):
-            pass
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.embedding = EmbeddingSubclass(1, 1)
-
-            def forward(self, x):
-                x = self.embedding(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_dynamic_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.LongTensor([[0]]),),
-            fuse_modules=False)
-
-        # test regular embedding
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.embedding = nn.Embedding(1, 1)
-
-            def forward(self, x):
-                x = self.embedding(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_dynamic_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.LongTensor([[0]]),),
-            fuse_modules=False)
-
-
-@skipIfNoFBGEMM
-class TestQuantizeDBR(QuantizeDBRTestCase):
-    def test_fusion(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.relu = torch.nn.ReLU()
-                self.child = nn.Sequential(
-                    nn.Conv2d(1, 1, 1),
-                    nn.ReLU(),
-                )
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = self.relu(x)
-                x = self.child(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (torch.randn(1, 1, 1, 1),))
-        self.assertTrue(isinstance(mp.conv, nni.ConvReLU2d))
-        self.assertTrue(isinstance(mp.child[0], nni.ConvReLU2d))
-
-    def test_fusion2(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.bn = torch.nn.BatchNorm2d(1)
-                # self.conv2 = torch.nn.Conv2d(1, 1, 1)
-                self.relu = torch.nn.LeakyReLU()
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = self.bn(x)
-                x = self.relu(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_fusion_called_multiple_times(self):
-        """
-        Tests that fusion works if the modules to fuse get called multiple
-        times in the same forward.
-
-        Currently, observers are not shared between successive calls of
-        the same module.
-        TODO(future PR): make them shared (this is easy to detect)
-        """
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.relu = torch.nn.ReLU()
-
-            def forward(self, x):
-                for _ in range(2):
-                    x = self.conv(x)
-                    x = self.relu(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        # fx graph mode quant doesn't support using a single module multiple times
-        # right now, so this would crash, we can handle this case later
-        # if it is needed
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),), do_fx_comparison=False)
-
-    def test_fusion_functions(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                x = torch.relu(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (torch.randn(1, 1, 1, 1),))
-
-        self.assertTrue(
-            mp._auto_quant_state.idx_to_seen_q_op_infos[0].fusion_info is not None)
-        self.assertTrue(
-            mp._auto_quant_state.idx_to_seen_q_op_infos[1].fusion_info is not None)
-
-        # verify that the add relu is not observed
-        self.assertTrue(
-            '1' not in mp._auto_quant_state.tensor_id_to_observer)
-        # verify that the relu is observed
-        self.assertTrue(
-            '2' in mp._auto_quant_state.tensor_id_to_observer)
-
-        mp(torch.randn(1, 1, 1, 1))
-        mq = _quantize_dbr.convert(mp)
-
-        # verify that the add-relu got fused
-        mqt = torch.jit.trace(mq, (torch.randn(1, 1, 1, 1),))
-        FileCheck().check_count("quantized::add_relu", 1, exactly=True).run(
-            mqt.graph)
-
-        # TODO(future PR): use information about non-quantizeable ops during
-        #   matching fusion patterns
-
-    def test_observers_not_touched_by_tracing(self):
-        """
-        Verifies that running dynamic tracing does not change any data
-        stored in observers and fake quants.
-        """
-        m = nn.Sequential(nn.Conv2d(1, 1, 1)).eval()
-        qconfig = torch.quantization.default_qconfig
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (torch.randn(1, 1, 1, 1),))
-        for _, mod in mp.named_modules():
-            if isinstance(mod, (ObserverBase, FakeQuantizeBase)):
-                scale, zp = mod.calculate_qparams()
-                # Assume that if scale is 1.0 and zp is 0, no calibration
-                # has happened.
-                self.assertTrue(torch.allclose(scale, torch.ones(1)))
-                self.assertTrue(torch.equal(zp, torch.zeros(1, dtype=torch.long)))
-
-    def test_multiple_modules(self):
-        m = nn.Sequential(
-            nn.Sequential(nn.Conv2d(1, 1, 1)),
-            nn.Sequential(nn.Conv2d(1, 1, 1)),
-        ).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_child_modules(self):
-        m = nn.Sequential(nn.Sequential(nn.Conv2d(1, 1, 1))).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_mod_qat(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                return x1
-
-        m = M().eval()
-        qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
-        self._test_auto_tracing(
-            copy.deepcopy(m), qconfig, (torch.randn(1, 1, 2, 2),))
-
-        # test backprop does not crash
-        inputs = torch.randn(1, 1, 1, 1)
-        inputs.requires_grad = True
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (inputs,))
-        output = mp(inputs)
-        labels = torch.randn(1, 1, 1, 1)
-        loss = (output - labels).sum()
-        loss.backward()
-        optim = torch.optim.SGD(mp.parameters(), lr=0.01)
-        optim.step()
-
-    def test_conv_functional_qat(self):
-
-        class M(torch.nn.Module):
-            def __init__(self, weight2d, bias2d):
-                super().__init__()
-                self.weight2d = torch.nn.Parameter(weight2d)
-                self.bias2d = torch.nn.Parameter(bias2d)
-                self.stride2d = (1, 1)
-                self.padding2d = (0, 0)
-                self.dilation2d = (1, 1)
-                self.groups = 1
-
-            def forward(self, x):
-                x = F.conv2d(
-                    x, self.weight2d, self.bias2d, self.stride2d, self.padding2d,
-                    self.dilation2d, self.groups)
-                return x
-
-        m = M(torch.randn(1, 1, 1, 1), torch.randn(1)).eval()
-        qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-        # test backprop does not crash
-        inputs = torch.randn(1, 1, 1, 1)
-        inputs.requires_grad = True
-        m = M(torch.randn(1, 1, 1, 1), torch.randn(1)).eval()
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, (inputs,))
-        output = mp(inputs)
-        labels = torch.randn(1, 1, 1, 1)
-        loss = (output - labels).sum()
-        loss.backward()
-        optim = torch.optim.SGD(mp.parameters(), lr=0.01)
-        optim.step()
-
-    @unittest.skip('FX graph mode is using fake_quantize with PTQ, TODO verify')
-    def test_conv_unsupported_inplace_conv(self):
-        """
-        Verifies that having an quantizeable op which is inplace
-        is handled well
-        """
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.conv2 = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = F.hardsigmoid(x, inplace=True)
-                x = self.conv2(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_functional_dynamic_weights(self):
-        class M(torch.nn.Module):
-            def __init__(self, weight2d, bias2d):
-                super().__init__()
-                self.weight2d = torch.nn.Parameter(weight2d)
-                self.bias2d = torch.nn.Parameter(bias2d)
-                self.stride2d = (1, 1)
-                self.padding2d = (0, 0)
-                self.dilation2d = (1, 1)
-                self.groups = 1
-
-            def forward(self, x):
-                updated_weight = self.weight2d * x
-                x = F.conv2d(
-                    x, updated_weight, self.bias2d, self.stride2d, self.padding2d,
-                    self.dilation2d, self.groups)
-                return x
-
-        model_fp32 = M(torch.randn(1, 1, 1, 1), torch.randn(1)).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 2, 2),),
-            # FX implements this functionality instead of skipping it
-            do_fx_comparison=False,
-            # TODO enable scripting support for this
-            do_torchscript_checks=False)
-
-    def test_method(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                x = torch.relu(x)
-                # x = x.relu()
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_module_created_during_forward(self):
-        """Some BERT models have this pattern"""
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = nn.Softmax(dim=-1)(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 1, 1),),
-            # This syntax is not supported by FX or TorchScript
-            do_fx_comparison=False, do_torchscript_checks=False)
-
-    def test_module_returns_namedtuple(self):
-        NamedTuple = collections.namedtuple("NamedTuple", ["x0", "x1"])
-
-        """Some hf models have this pattern"""
-        class M1(torch.nn.Module):
-            def forward(self, x):
-                return NamedTuple(x, x)
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-
-            def forward(self, x):
-                m1 = self.m1(x)
-                return (m1.x0, m1.x1)
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 1, 1),),
-            # TODO(future PR): add FX rewrite support
-            do_fx_comparison=False, do_torchscript_checks=False)
-
-    def test_child_module_does_not_return_tensor(self):
-        class M1(torch.nn.Module):
-            def forward(self, x):
-                pass
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-
-            def forward(self, x):
-                self.m1(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 1, 1),),
-            # TODO(future PR): add FX rewrite support
-            do_fx_comparison=False, do_torchscript_checks=False)
-
-    def _get_non_traceable_module_class_test_model(self):
-        class M1(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = x + x
-                return x
-
-        class M2(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.m1(x)
-                x = self.conv(x)
-                x = x + x
-                return x
-
-        class M3(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m2 = M2()
-
-            def forward(self, x):
-                x = self.m2(x)
-                return x
-
-        return M3().eval(), M1, M2, M3
-
-    def test_prepare_custom_config_dict_non_traceable_module_class_child_leaf(self):
-
-        # if M1 is set as leaf, M2 and M3 should have auto_quant_state
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        m, M1, M2, M3 = self._get_non_traceable_module_class_test_model()
-        prepare_custom_config_dict = {
-            'non_traceable_module_class': [M1],
-        }
-        mp = _quantize_dbr.prepare(
-            m, qconfig_dict, (torch.randn(1, 1, 1, 1),),
-            prepare_custom_config_dict=prepare_custom_config_dict)
-        self.assertTrue(not hasattr(mp.m2.m1, '_auto_quant_state'))
-        self.assertTrue(hasattr(mp.m2, '_auto_quant_state'))
-        self.assertTrue(hasattr(mp, '_auto_quant_state'))
-        mq = _quantize_dbr.convert(mp)
-        self.assertTrue(isinstance(mq.m2.m1.conv, nn.Conv2d))
-        self.assertTrue(isinstance(mq.m2.conv, nnq.Conv2d))
-        mqt = torch.jit.trace(mq, (torch.randn(1, 1, 1, 1),))
-
-        # mqt.m2.m1 should not have quantized ops
-        FileCheck().check_count("aten::add", 1, exactly=True).run(mqt.m2.m1.graph)
-        FileCheck().check_count("quantized::add", 0, exactly=True).run(mqt.m2.m1.graph)
-        # mqt.m2.m1 should have quantized ops
-        FileCheck().check_count("aten::add", 0, exactly=True).run(mqt.m2.graph)
-        FileCheck().check_count("quantized::add", 1, exactly=True).run(mqt.m2.graph)
-
-        # TODO(future PR): ensure modules in leaves do not get quantized
-
-    def test_prepare_custom_config_dict_non_traceable_module_class_mid_leaf(self):
-        # if M2 is set as leaf, only M1 should have auto_quant_state
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        m, M1, M2, M3 = self._get_non_traceable_module_class_test_model()
-        prepare_custom_config_dict = {
-            'non_traceable_module_class': [M2],
-        }
-        mp = _quantize_dbr.prepare(
-            m, qconfig_dict, (torch.randn(1, 1, 1, 1),),
-            prepare_custom_config_dict=prepare_custom_config_dict)
-        self.assertTrue(not hasattr(mp.m2.m1, '_auto_quant_state'))
-        self.assertTrue(not hasattr(mp.m2, '_auto_quant_state'))
-        self.assertTrue(hasattr(mp, '_auto_quant_state'))
-        mq = _quantize_dbr.convert(mp)
-        self.assertTrue(isinstance(mq.m2.m1.conv, nn.Conv2d))
-        self.assertTrue(isinstance(mq.m2.conv, nn.Conv2d))
-        mqt = torch.jit.trace(mq, (torch.randn(1, 1, 1, 1),))
-
-        # mqt.m2 and all children should not have quantized ops
-        FileCheck().check_count("aten::add", 1, exactly=True).run(mqt.m2.m1.graph)
-        FileCheck().check_count("quantized::add", 0, exactly=True).run(mqt.m2.m1.graph)
-        FileCheck().check_count("aten::add", 1, exactly=True).run(mqt.m2.graph)
-        FileCheck().check_count("quantized::add", 0, exactly=True).run(mqt.m2.graph)
-
-    def test_module_list(self):
-        class Child(torch.nn.Module):
-            def forward(self, x):
-                return x + x
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.module_list = torch.nn.ModuleList([
-                    Child(),
-                ])
-
-            def forward(self, x):
-                for module in self.module_list:
-                    # TODO(future PR): we should see if there is a better
-                    # solution other than asking users to do this
-                    if not isinstance(module, AutoQuantizationState):
-                        x = module(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            m, qconfig, (torch.randn(8, 1, 1, 1),),
-            # TODO(future PR): enable scripting for ModuleList + DBR
-            do_fx_comparison=True, do_torchscript_checks=False)
-
-    @unittest.skip('TODO build this')
-    def test_module_input_types(self):
-        class M(torch.nn.Module):
-            def forward(self, x=None, y=None):
-                print('x', x)
-                print('y', y)
-                assert x is not None and y is not None
-                return (x, y)
-
-        model_fp32 = M().eval()
-        example_inputs = {'y': torch.randn(1), 'x': torch.randn(1)}
-        ExampleInputsTupleCtr = collections.namedtuple('ExampleInputs', example_inputs)
-        example_inputs_tuple = ExampleInputsTupleCtr(**example_inputs)
-        ms = torch.jit.trace(model_fp32, example_inputs_tuple)
-
-        return
-        qconfig = torch.quantization.default_qconfig
-
-        # dict
-        kwargs = {'x': torch.randn(1, 1, 2, 2)}
-        self._test_auto_tracing(model_fp32, qconfig, (), kwargs)
-
-    def test_module_return_types(self):
-        class M1(torch.nn.Module):
-            def forward(self, x):
-                return x, x
-
-        class M2(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-
-            def forward(self, x):
-                x1, x2 = self.m1(x)
-                return x1
-
-        model_fp32 = M2().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_inplace_unquantizeable_op(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv1 = nn.Conv2d(1, 1, 1)
-                self.silu = nn.SiLU(inplace=True)
-                # self.silu = nn.SiLU()
-                self.conv2 = nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv1(x)
-                x = self.silu(x)
-                x = self.conv2(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_vovnet_sequential(self):
-        # We cannot quantize SequentialAppendList directly because
-        # AutoQuantizationStateModuleDict would appear in self.items.
-        # However, we can wrap it and quantize the wrapper.
-        class SequentialAppendList(nn.Sequential):
-            def __init__(self, *args):
-                super(SequentialAppendList, self).__init__(*args)
-
-            def forward(self, x: torch.Tensor) -> torch.Tensor:
-                concat_list = []
-                for i, module in enumerate(self):
-                    if i == 0:
-                        concat_list.append(module(x))
-                    else:
-                        concat_list.append(module(concat_list[-1]))
-                x = torch.cat(concat_list, dim=1)
-                return x
-
-        class Wrapper(nn.Module):
-            def __init__(self, *args):
-                super().__init__()
-                self.append_list = SequentialAppendList(*args)
-
-            def forward(self, x):
-                x = self.append_list(x)
-                return x
-
-        m = Wrapper(torch.nn.Conv2d(1, 1, 1)).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_unsupported_ops(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = F.tanhshrink(x)
-                x = x + x
-                x = F.tanhshrink(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_unsupported_ops_recorded(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv2d = nn.Conv2d(1, 1, 1)
-                self.softshrink = nn.Softshrink()
-
-            def forward(self, x):
-                # supported
-                x = self.conv2d(x)
-                x = x + x
-                # not supported
-                x = self.softshrink(x)
-                x = F.tanhshrink(x)
-                return x
-
-        m = M().eval()
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        mp = _quantize_dbr.prepare(m, qconfig_dict, (torch.randn(1, 1, 1, 1),))
-        self.assertTrue(len(mp._auto_quant_state.seen_nonq_op_infos) == 2)
-        self.assertTrue(mp._auto_quant_state.seen_nonq_op_infos[0].type == nn.Softshrink)
-        self.assertTrue(mp._auto_quant_state.seen_nonq_op_infos[1].type == F.tanhshrink)
-
-    def test_unknown_op_after_quantized(self):
-        class M(torch.nn.Module):
-            def forward(self, x):
-                x = x + x
-                std = x.std()
-                return std
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 2, 2),),
-            fuse_modules=False)
-
-    def test_module_calls_items(self):
-        # We cannot quantize M1 directly because
-        # AutoQuantizationStateModuleDict would appear in self.items.
-        # However, we can wrap it and quantize the wrapper.
-        class M1(torch.nn.ModuleDict):
-            def __init__(self):
-                super().__init__()
-                for i in range(2):
-                    layer = nn.ReLU()
-                    self.add_module("layer_" + str(i), layer)
-
-            def forward(self, x):
-                layers = [x]
-                for name, layer in self.items():
-                    layers.append(layer(x))
-                return torch.cat(layers, dim=1)
-
-        class M2(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.m1 = M1()
-
-            def forward(self, x):
-                x = self.m1(x)
-                return x
-
-        model_fp32 = M2().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.randn(1, 1, 2, 2),),
-            # TODO(future PR): implement observer sharing for torch.cat
-            # in DBR quant, to ensure that numerical behavior matches
-            do_fx_comparison=False)
-
-    def test_subclass_of_quantizeable_module(self):
-        """
-        If a user creates a subclass of nn.BatchNorm2d, that subclass
-        should not be quantized unless the user defines a custom module.
-        """
-        class BN2d(torch.nn.BatchNorm2d):
-            pass
-
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.bn = BN2d(1)
-                self.conv2 = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = self.bn(x)
-                x = self.conv2(x)
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            m, qconfig, (torch.randn(1, 1, 2, 2),),
-            # the module is not symbolically traceable
-            do_fx_comparison=False)
-
-    # TODO(future PR): move into a separate test file
-    def test_numeric_suite(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = nn.Conv2d(1, 1, 1)
-                self.conv2 = nn.Sequential(nn.Conv2d(1, 1, 1))
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = self.conv2(x)
-                x = x + x
-                return x
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        example_args = (torch.randn(1, 1, 2, 2),)
-        mp = _quantize_dbr.prepare(m, {'': qconfig}, example_args)
-        out_p = mp(*example_args)
-        mq = _quantize_dbr.convert(copy.deepcopy(mp))
-        out_q = mq(*example_args)
-
-        mp, mq = ns.add_loggers('mp', mp, 'mq', mq)
-
-        mp(*example_args)
-        mq(*example_args)
-
-        act_comparison = ns.extract_logger_info(mp, mq, 'mq')
-        ns_fx.extend_logger_results_with_comparison(
-            act_comparison, 'mp', 'mq', torch.ao.ns.fx.utils.compute_sqnr,
-            'sqnr')
-
-        # TODO(future PR): enforce validity of the result above, using
-        # NS for FX utils. Will need some refactoring.
-
-        # TODO(future PR): consider adding a util for below
-        to_print = []
-        for idx, (layer_name, v) in enumerate(act_comparison.items()):
-            to_print.append([
-                layer_name,
-                v['node_output']['mq'][0]['fqn'],
-                v['node_output']['mq'][0]['ref_node_target_type'],
-                v['node_output']['mq'][0]['sqnr']])
-
-    def test_qconfig_dict_global(self):
-        """
-        Verifies that the '' option of qconfig_dict works
-        """
-
-        # regular case
-        m = nn.Sequential(nn.Conv2d(1, 1, 1))
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        self.assertTrue(isinstance(mq[0], nnq.Conv2d))
-
-        # quantization turned off
-        m = nn.Sequential(nn.Conv2d(1, 1, 1))
-        qconfig_dict = {'': None}
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        self.assertTrue(isinstance(mq[0], nn.Conv2d))
-
-    def test_qconfig_dict_object_type_module(self):
-        """
-        Verifies that the 'object_type' option of qconfig_dict works
-        on module types.
-        """
-        m = nn.Sequential(
-            nn.Conv2d(1, 1, 1),
-            nn.Hardswish(),
-            nn.Conv2d(1, 1, 1),
-        )
-        qconfig_dict = {
-            '': torch.quantization.default_qconfig,
-            'object_type': [
-                (nn.Conv2d, torch.quantization.default_qconfig),
-                (nn.Hardswish, None),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        self.assertTrue(isinstance(mq[0], nnq.Conv2d))
-        self.assertTrue(isinstance(mq[1], nn.Hardswish))
-        self.assertTrue(isinstance(mq[2], nnq.Conv2d))
-
-    def test_qconfig_dict_object_type_function(self):
-        """
-        Verifies that the 'object_type' option of qconfig_dict works
-        on function types.
-        """
-        class M(nn.Module):
-            def forward(self, x):
-                x = x + x
-                x = x * x
-                return x
-
-        m = M()
-        qconfig_dict = {
-            '': torch.quantization.default_qconfig,
-            'object_type': [
-                (torch.add, None),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        rewritten = mq.rewrite_for_scripting()
-        expected_occurrence = {
-            NodeSpec.call_function(torch.add): 1,
-            NodeSpec.call_function(toq.add): 0,
-            NodeSpec.call_function(toq.mul): 1,
-        }
-        self.checkGraphModuleNodes(
-            rewritten, expected_node_occurrence=expected_occurrence)
-
-    def test_qconfig_dict_object_type_method(self):
-        """
-        Verifies that the 'object_type' option of qconfig_dict works
-        on method types.
-        """
-        class M(nn.Module):
-            def forward(self, x):
-                x = x.add(x)
-                x = x.mul(x)
-                return x
-
-        m = M()
-        qconfig_dict = {
-            '': torch.quantization.default_qconfig,
-            'object_type': [
-                (torch.Tensor.add, None),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        rewritten = mq.rewrite_for_scripting()
-        expected_occurrence = {
-            NodeSpec.call_function(torch.add): 1,
-            NodeSpec.call_function(toq.add): 0,
-            NodeSpec.call_function(toq.mul): 1,
-        }
-        self.checkGraphModuleNodes(
-            rewritten, expected_node_occurrence=expected_occurrence)
-
-    def test_qconfig_dict_object_type_function_global_none(self):
-        """
-        Verifies that the 'object_type' option of qconfig_dict works
-        on function types when global qconfig is None.
-        """
-        class M(nn.Module):
-            def forward(self, x):
-                x = x + x
-                return x
-
-        m = M()
-        qconfig_dict = {
-            '': None,
-            'object_type': [
-                (torch.add, torch.quantization.default_qconfig),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        rewritten = mq.rewrite_for_scripting()
-        expected_occurrence = {
-            NodeSpec.call_function(torch.add): 0,
-            NodeSpec.call_function(toq.add): 1,
-        }
-        self.checkGraphModuleNodes(
-            rewritten, expected_node_occurrence=expected_occurrence)
-
-    def test_qconfig_dict_module_name(self):
-        """
-        Verifies that the 'module_name' option of qconfig_dict works
-        on module types.
-        """
-        m = nn.Sequential(
-            nn.Sequential(
-                nn.Conv2d(1, 1, 1),
-            ),
-            nn.Conv2d(1, 1, 1),
-            nn.Sequential(
-                nn.Conv2d(1, 1, 1),
-                nn.Conv2d(1, 1, 1),
-            ),
-        )
-        qconfig_dict = {
-            '': torch.quantization.default_qconfig,
-            'module_name': [
-                ('0', torch.quantization.default_qconfig),
-                ('1', None),
-                ('2.0', None),
-            ],
-        }
-        example_args = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_args)
-        mp(*example_args)
-        mq = _quantize_dbr.convert(mp)
-        mq(*example_args)
-        self.assertTrue(isinstance(mq[0][0], nnq.Conv2d))
-        self.assertTrue(isinstance(mq[1], nn.Conv2d))
-        self.assertTrue(isinstance(mq[2][0], nn.Conv2d))
-        self.assertTrue(isinstance(mq[2][1], nnq.Conv2d))
-
-    def test_qconfig_dict_unsupported_does_not_crash_when_empty(self):
-        """
-        Verifies that the yet unimplemented keys of qconfig_dict only
-        crash when they have non-zero values
-        """
-        m = nn.Sequential(nn.Conv2d(1, 1, 1)).eval()
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        example_inputs = (torch.randn(1, 1, 1, 1),)
-        # this modifies qconfig_dict inplace to include more keys
-        mp = prepare_fx(m, qconfig_dict, example_inputs=example_inputs)
-        # need this line to not crash
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_inputs)
-
-    def _test_serialization(self, model, input_shape):
-        example_inputs = (torch.randn(*input_shape),)
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        m = model().eval()
-        m = _quantize_dbr.prepare(m, qconfig_dict, example_inputs)
-        # calibrate, to populate statistics
-        m(example_inputs[0])
-        m = _quantize_dbr.convert(m)
-
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        m2 = model().eval()
-        m2 = _quantize_dbr.prepare(m2, qconfig_dict, example_inputs)
-        # do not calibrate, to ensure important statistics are populated without calibration and
-        # the results are different at every node, including the quantize_per_tensor node
-        m2 = _quantize_dbr.convert(m2)
-
-        # Results should be different without loading from serialized state_dict
-        expected = m(example_inputs[0])
-        actual = m2(example_inputs[0])
-        self.assertFalse(_allclose(expected, actual))
-
-        # Results should be the same after loading from serialized state_dict
-        with tempfile.NamedTemporaryFile(delete=False) as f:
-            torch.save(m.state_dict(), f)
-            with open(f.name, 'rb') as f2:
-                loaded_state_dict = torch.load(f2)
-                m2.load_state_dict(loaded_state_dict)
-        expected = m(example_inputs[0])
-        actual = m2(example_inputs[0])
-        self.assertTrue(_allclose(expected, actual))
-
-    def test_serialization(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                x2 = self.linear(x1)
-                return x2
-
-        input_shape = (1, 1, 1, 1)
-        self._test_serialization(M, input_shape)
-
-    def test_serialization_functional(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                # conv
-                self.weight2d = torch.nn.Parameter(torch.randn(1, 1, 1, 1))
-                self.bias2d = torch.nn.Parameter(torch.randn(1))
-                self.stride2d = (1, 1)
-                self.padding2d = (0, 0)
-                self.dilation2d = (1, 1)
-                self.groups = 1
-                # linear
-                self.w1 = nn.Parameter(torch.empty(1, 1))
-                self.b1 = nn.Parameter(torch.ones(1))
-                torch.nn.init.kaiming_uniform_(self.w1, a=math.sqrt(5))
-
-            def forward(self, x):
-                updated_weight = self.weight2d * x
-                x = F.conv2d(
-                    x, updated_weight, self.bias2d, self.stride2d, self.padding2d,
-                    self.dilation2d, self.groups)
-                # TODO: Investigate why serialization does not work with functional linear
-                # x = F.linear(x, self.w1, self.b1)
-                return x
-
-        input_shape = (1, 1, 1, 1)
-        self._test_serialization(M, input_shape)
-
-    def test_jit_tracing_removes_aliases(self):
-        m = nn.Sequential(
-            nn.Conv2d(1, 1, 1),
-            nn.Sequential(
-                nn.Conv2d(1, 1, 1),
-            ),
-        )
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        example_inputs = (torch.randn(1, 1, 1, 1),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_inputs)
-        mq = _quantize_dbr.convert(mp)
-        mqs = torch.jit.trace(mq, example_inputs)
-        FileCheck().check_count("aten::alias", 5, exactly=True).run(
-            mqs.inlined_graph)
-        res1 = mqs(*example_inputs)
-        mqs = remove_redundant_aliases(mqs)
-        res2 = mqs(*example_inputs)
-        self.assertTrue(torch.allclose(res1, res2))
-        # TODO(future PR): figure out why aliasing still appears in the inlined
-        # graph, and if that is fixed then just check the inlined graph.
-        for graph in (
-            mqs.graph,
-            getattr(mqs, '1').graph,
-            getattr(getattr(mqs, '1'), '0').graph,
-        ):
-            FileCheck().check_count("aten::alias", 0, exactly=True).run(graph)
-
-    def test_conv_int32_reference_model(self):
-        m = nn.Sequential(nn.Conv2d(1, 1, 1)).eval()
-        int32_obs_ctr = MinMaxObserver.with_args(dtype=torch.qint32)
-        int32_qconfig = QConfig(weight=int32_obs_ctr, activation=int32_obs_ctr)
-        qconfig_dict = {'': int32_qconfig}
-        mp = _quantize_dbr.prepare(m, qconfig_dict, (torch.randn(1, 1, 1, 1),))
-        mp(torch.randn(1, 1, 1, 1))
-        mq = _quantize_dbr.convert(mp)
-        res = mq(torch.randn(1, 1, 1, 1))
-        mqt = torch.jit.trace(mq, (torch.randn(1, 1, 1, 1),))
-        # verify the right ops are present:
-        # x0 -> quant -> (dequant -> conv_ref -> quant) -> dequant -> x1
-        FileCheck()\
-            .check_count("aten::quantize_per_tensor", 2, exactly=True)\
-            .run(mqt.graph)
-        FileCheck()\
-            .check_count("aten::dequantize", 2, exactly=True)\
-            .run(mqt.graph)
-
-@skipIfNoFBGEMM
-class TestQuantizeDBRMultipleOps(QuantizeDBRTestCase):
-    """
-    Tests that DBR quantization covers interactions between multiple ops
-
-    Most of these tests were added when the code was an early prototype
-    and were one-off test cases for patterns which happened to break
-    the code on various models at the time of writing. A lot of these
-    can probably be removed in the future as they are replaced by more
-    systematic individual and fusion tests.
-    """
-    def test_dropout_conv(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.dropout = nn.Dropout()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                # this can be sometimes inplace
-                x1 = self.dropout(x)
-                x1 = self.conv(x)
-                return x1
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_flatten_linear(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                # TODO(future PR): unbreak this
-                # x1 = torch.nn.functional.adaptive_avg_pool2d(x, (1, 1))
-                x1 = torch.nn.functional.adaptive_avg_pool2d(x1, (1, 1))
-                x2 = torch.flatten(x1, 1)
-                x3 = self.linear(x2)
-                return x3
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_conv_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                x2 = x1 + x
-                return x2
-
-        m = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(m, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_scalar_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-
-            def forward(self, x):
-                x = self.conv(x)
-                x = x + 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_conv_relu_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.conv = torch.nn.Conv2d(1, 1, 1)
-                self.relu = torch.nn.ReLU()
-
-            def forward(self, x):
-                x1 = self.conv(x)
-                x2 = self.relu(x1)
-                x3 = x1 + x
-                return x3
-
-        model_fp32 = M().eval()
-
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 2, 2),))
-
-    def test_linear_torch_relu(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.u1 = nn.Linear(1, 1)
-                self.v1 = nn.Linear(1, 1)
-                self.u2 = nn.Linear(1, 1)
-                self.v2 = nn.Linear(1, 1)
-                self.w = nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = self.w(x)
-                x = x + torch.relu(self.v1(torch.relu(self.u1(x))))
-                return x + torch.relu(self.v2(torch.relu(self.u2(x))))
-
-        model_fp32 = M().eval()
-
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_gelu_linear(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.gelu = torch.nn.GELU()
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = self.linear(x)
-                x = self.gelu(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_dropout(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.dropout = nn.Dropout()
-                self.linear = torch.nn.Linear(1, 1)
-                self.linear2 = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = self.linear(x)
-                x = self.dropout(x)
-                x = self.linear2(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_module_then_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.linear = torch.nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = self.linear(x)
-                x = x + 1.0
-                x = x + 1.0
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_add_linear(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.linear = nn.Linear(1, 1)
-
-            def forward(self, x):
-                x = x + x
-                x = self.linear(x)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(model_fp32, qconfig, (torch.randn(1, 1, 1, 1),))
-
-    def test_inplace_add(self):
-        class M(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.embedding1 = nn.Embedding(1, 1)
-                self.embedding2 = nn.Embedding(1, 1)
-                self.layernorm = nn.LayerNorm(1)
-
-            def forward(self, x):
-                x1 = self.embedding1(x)
-                x1 += self.embedding2(x)
-                x2 = self.layernorm(x1)
-                return x
-
-        model_fp32 = M().eval()
-        qconfig = torch.quantization.default_qconfig
-        prepare_custom_config_dict = {
-            'output_dtypes': (torch.int64,),
-        }
-        self._test_auto_tracing(
-            model_fp32, qconfig, (torch.LongTensor([[0]]),),
-            fuse_modules=False,
-            dbr_prepare_custom_config_dict=prepare_custom_config_dict)
-
-    def test_lstm(self):
-        # building block of torchbenchmark/tts_angular
-        class LSTMWithProjection(nn.Module):
-            def __init__(self, input_size, hidden_size, proj_size):
-                super().__init__()
-                self.input_size = input_size
-                self.hidden_size = hidden_size
-                self.proj_size = proj_size
-                self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
-                self.linear = nn.Linear(hidden_size, proj_size, bias=False)
-
-            def forward(self, x):
-                self.lstm.flatten_parameters()
-                o, (_, _) = self.lstm(x)
-                return self.linear(o)
-
-        m = LSTMWithProjection(1, 1, 1).eval()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            m, qconfig, (torch.randn(1, 1, 1),),
-            # the module is not symbolically traceable
-            do_fx_comparison=False)
-
-
-@skipIfNoFBGEMM
-class TestQuantizeDBRModels(QuantizeDBRTestCase):
-    @skip_if_no_torchvision
-    def test_mobilenet_v2(self):
-        import torchvision
-        m = torchvision.models.__dict__['mobilenet_v2'](pretrained=False).eval().float()
-        qconfig = torch.quantization.default_qconfig
-        self._test_auto_tracing(
-            m, qconfig, (torch.randn(1, 3, 224, 224),),
-            # TODO fix this (reason TBD)
-            do_torchscript_checks=False)
-
-    @skip_if_no_torchvision
-    def test_mobilenet_v2_removes_aliases(self):
-        import torchvision
-        m = torchvision.models.__dict__['mobilenet_v2'](pretrained=False)\
-            .eval().float()
-        qconfig_dict = {'': torch.quantization.default_qconfig}
-        example_inputs = (torch.randn(1, 3, 224, 224),)
-        mp = _quantize_dbr.prepare(m, qconfig_dict, example_inputs)
-        mq = _quantize_dbr.convert(mp)
-        mqs = torch.jit.trace(mq, example_inputs)
-        res1 = mqs(*example_inputs)
-        mqs = remove_redundant_aliases(mqs)
-        res2 = mqs(*example_inputs)
-        self.assertTrue(torch.allclose(res1, res2))
diff --git a/test/quantization/eager/test_fuse_eager.py b/test/quantization/eager/test_fuse_eager.py
index 9cf09dedba67..1ebc4bfd094e 100644
--- a/test/quantization/eager/test_fuse_eager.py
+++ b/test/quantization/eager/test_fuse_eager.py
@@ -4,10 +4,10 @@
 
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
-import torch.nn.intrinsic.qat as nniqat
+import torch.ao.nn.intrinsic.qat as nniqat
 from torch.ao.quantization import (
     quantize,
     prepare,
@@ -28,6 +28,7 @@
     ModelForLinearBNFusion,
     ModelForFusionWithBias,
     ModelForConvTransposeBNFusion,
+    SingleLayerLinearModel,
     test_only_eval_fn,
     test_only_train_fn,
     skipIfNoFBGEMM,
@@ -363,6 +364,17 @@ def test_fusion_convtranspose_bn_eval(self):
 
         self.assertEqual(golden, model(inp2))
 
+    def test_fuse_function_customization(self):
+        dummy_model = SingleLayerLinearModel().train()
+        dummy_model.eval()
+
+        # A custom fuse funct
+        def custom_fuse_func(module, is_qat, add_fuser_mapping):
+            return [torch.nn.Identity()]
+
+        dummy_model = fuse_modules(dummy_model, [["fc1"]], fuser_func=custom_fuse_func)
+        self.assertEqual(type(dummy_model.fc1), nn.Identity)
+
     def test_forward_hooks_preserved(self):
         r"""Test case that checks whether forward pre hooks of the first module and
         post forward hooks of the last module in modules list passed to fusion function preserved.
diff --git a/test/quantization/eager/test_model_numerics.py b/test/quantization/eager/test_model_numerics.py
index b259e102b37c..bcefb78bd752 100644
--- a/test/quantization/eager/test_model_numerics.py
+++ b/test/quantization/eager/test_model_numerics.py
@@ -73,7 +73,7 @@ def test_fake_quant_true_quant_compare(self):
                 torch.ao.quantization.prepare_qat(fq_model)
                 fq_model.eval()
                 fq_model.apply(torch.ao.quantization.disable_fake_quant)
-                fq_model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
+                fq_model.apply(torch.ao.nn.intrinsic.qat.freeze_bn_stats)
                 fq_model(calib_data)
                 fq_model.apply(torch.ao.quantization.enable_fake_quant)
                 fq_model.apply(torch.ao.quantization.disable_observer)
@@ -109,7 +109,7 @@ def test_weight_only_activation_only_fakequant(self):
                     torch.ao.quantization.prepare_qat(fq_model)
                     fq_model.eval()
                     fq_model.apply(torch.ao.quantization.disable_fake_quant)
-                    fq_model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
+                    fq_model.apply(torch.ao.nn.intrinsic.qat.freeze_bn_stats)
                     fq_model(calib_data)
                     fq_model.apply(torch.ao.quantization.enable_fake_quant)
                     fq_model.apply(torch.ao.quantization.disable_observer)
diff --git a/test/quantization/eager/test_numeric_suite_eager.py b/test/quantization/eager/test_numeric_suite_eager.py
index 9d8f7aa373de..c8cf9c3dddf8 100644
--- a/test/quantization/eager/test_numeric_suite_eager.py
+++ b/test/quantization/eager/test_numeric_suite_eager.py
@@ -3,7 +3,7 @@
 import unittest
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from torch.ao.quantization import (
     DeQuantStub,
     QuantStub,
diff --git a/test/quantization/eager/test_quantize_eager_ptq.py b/test/quantization/eager/test_quantize_eager_ptq.py
index d06575c51bf2..e0ad793df68a 100644
--- a/test/quantization/eager/test_quantize_eager_ptq.py
+++ b/test/quantization/eager/test_quantize_eager_ptq.py
@@ -2,7 +2,7 @@
 
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from torch.nn.utils.rnn import PackedSequence
 from torch.ao.quantization import (
     quantize,
@@ -19,8 +19,10 @@
     float16_dynamic_qconfig,
     float_qparams_weight_only_qconfig,
     float_qparams_weight_only_qconfig_4bit,
+    FixedQParamsObserver,
     PerChannelMinMaxObserver,
     default_dynamic_quant_observer,
+    default_weight_observer,
     QConfig,
 )
 
@@ -61,8 +63,6 @@
     supported_qengines,
     override_qengines,
 )
-from torch.testing._internal.jit_utils import JitTestCase
-from torch.testing._internal.common_utils import skipIfNoCaffe2
 
 from hypothesis import given
 from hypothesis import strategies as st
@@ -71,8 +71,6 @@
 
 # Standard library
 from typing import Tuple
-import io
-import unittest
 import numpy as np
 
 class TestQuantizeEagerOps(QuantizationTestCase):
@@ -326,9 +324,9 @@ def test_functional_module(self, train_mode):
 
         def checkQuantized(model):
             self.checkNoPrepModules(model)
-            self.assertEqual(type(model.myadd), torch.nn.quantized.QFunctional)
-            self.assertEqual(type(model.mycat), torch.nn.quantized.QFunctional)
-            self.assertEqual(type(model.myadd_relu), torch.nn.quantized.QFunctional)
+            self.assertEqual(type(model.myadd), torch.ao.nn.quantized.QFunctional)
+            self.assertEqual(type(model.mycat), torch.ao.nn.quantized.QFunctional)
+            self.assertEqual(type(model.myadd_relu), torch.ao.nn.quantized.QFunctional)
             self.checkNoQconfig(model)
 
         checkQuantized(model)
@@ -776,7 +774,7 @@ def test_quantized_embedding(self):
             prepare(model, inplace=True)
             convert(model, inplace=True)
             self.assertTrue('QuantizedEmbedding' in str(model))
-            self.assertEqual(type(model.emb), torch.nn.quantized.Embedding)
+            self.assertEqual(type(model.emb), torch.ao.nn.quantized.Embedding)
             self.checkScriptable(model, [[indices]], check_save_load=True)
 
             idx = torch.LongTensor([1, 2, 4, 5, 4, 3, 2, 9])
@@ -836,7 +834,7 @@ def test_quantized_embedding_bag(self):
 
             # Test to make sure module is quantized correctly.
             self.assertTrue('QuantizedEmbeddingBag' in str(quantized_model))
-            self.checkDynamicQuantizedModule(quantized_model.emb, torch.nn.quantized.EmbeddingBag, torch.quint8)
+            self.checkDynamicQuantizedModule(quantized_model.emb, torch.ao.nn.quantized.EmbeddingBag, torch.quint8)
             self.checkScriptable(quantized_model, [[indices, offsets, per_sample_weights]], check_save_load=True)
 
             class EmbeddingBagWithLinear(torch.nn.Module):
@@ -857,7 +855,7 @@ def forward(self, indices, offsets, per_sample_weights, linear_in):
 
             self.assertTrue('QuantizedEmbeddingBag' in str(quantized_model))
             self.checkLinear(model2.fc)
-            self.checkDynamicQuantizedModule(quantized_model.emb, torch.nn.quantized.EmbeddingBag, torch.quint8)
+            self.checkDynamicQuantizedModule(quantized_model.emb, torch.ao.nn.quantized.EmbeddingBag, torch.quint8)
 
     @skipIfNoFBGEMM
     def test_custom_module_class(self):
@@ -1026,6 +1024,48 @@ def test_quantwrapper_attaches_qconfig_to_dequant(self):
         mq = torch.ao.quantization.convert(mp)
         self.assertTrue(isinstance(mq[0].dequant, nnq.DeQuantize))
 
+    def test_activations_in_non_leaf_module_list(self):
+        """
+        Ensure activations like `nn.Sigmoid` and `nn.Tanh` are properly handled in
+        `non_leaf_module_list`.
+        """
+        class MyModel(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.quant = QuantStub()
+                self.sigmoid = torch.nn.Sigmoid()
+                self.hardsigmoid = torch.nn.Hardsigmoid()
+                self.softmax = torch.nn.Softmax()
+                self.tanh = torch.nn.Tanh()
+                self.dequant = DeQuantStub()
+
+            def forward(self, x):
+                x = self.quant(x)
+                x = self.sigmoid(x)
+                x = self.hardsigmoid(x)
+                x = self.softmax(x)
+                x = self.tanh(x)
+                x = self.dequant(x)
+                return x
+
+        qconfig = QConfig(
+            activation=FixedQParamsObserver.with_args(scale=123.0, zero_point=0),
+            weight=default_weight_observer
+        )
+        m = MyModel()
+        m.qconfig = qconfig
+        m = prepare(m, observer_non_leaf_module_list=[
+            torch.nn.Sigmoid,
+            torch.nn.Hardsigmoid,
+            torch.nn.Softmax,
+            torch.nn.Tanh,
+        ])
+
+        # Should use the observer specified in the QConfig instead of the default (FixedQParamsFakeQuantize)
+        self.assertTrue(isinstance(m.sigmoid.activation_post_process, FixedQParamsObserver))
+        self.assertTrue(isinstance(m.hardsigmoid.activation_post_process, FixedQParamsObserver))
+        self.assertTrue(isinstance(m.softmax.activation_post_process, FixedQParamsObserver))
+        self.assertTrue(isinstance(m.tanh.activation_post_process, FixedQParamsObserver))
 
 @skipIfNoFBGEMM
 class TestQuantizeEagerPTQDynamic(QuantizationTestCase):
@@ -1287,8 +1327,8 @@ def test_quantized_rnn(self, qconfig, dtype):
         }
 
         def checkQuantized(model, module_type):
-            mod_type_map = {'LSTM': torch.nn.quantized.dynamic.LSTM,
-                            'GRU': torch.nn.quantized.dynamic.GRU}
+            mod_type_map = {'LSTM': torch.ao.nn.quantized.dynamic.LSTM,
+                            'GRU': torch.ao.nn.quantized.dynamic.GRU}
             mod_repr_map = {'LSTM': 'DynamicQuantizedLSTM',
                             'GRU': 'DynamicQuantizedGRU'}
             self.assertTrue(mod_repr_map[module_type] in str(model_quantized))
@@ -1359,10 +1399,10 @@ def test_quantized_rnn_cell(self, qconfig, dtype):
                 model_quantized = quantize_dynamic(model=model, qconfig_spec=qconfig_dict, dtype=dtype)
 
             def checkQuantized(model, module_type):
-                mod_type_map = {'LSTMCell': torch.nn.quantized.dynamic.LSTMCell,
-                                'GRUCell': torch.nn.quantized.dynamic.GRUCell,
-                                'RNNTanh': torch.nn.quantized.dynamic.RNNCell,
-                                'RNNReLU': torch.nn.quantized.dynamic.RNNCell}
+                mod_type_map = {'LSTMCell': torch.ao.nn.quantized.dynamic.LSTMCell,
+                                'GRUCell': torch.ao.nn.quantized.dynamic.GRUCell,
+                                'RNNTanh': torch.ao.nn.quantized.dynamic.RNNCell,
+                                'RNNReLU': torch.ao.nn.quantized.dynamic.RNNCell}
 
                 mod_repr_map = {'LSTMCell': 'DynamicQuantizedLSTMCell',
                                 'GRUCell': 'DynamicQuantizedGRUCell',
@@ -1443,53 +1483,6 @@ def forward(self, indices, offsets, linear_in):
         self.assertTrue('QuantizedEmbedding' in str(q_model))
         self.assertTrue('DynamicQuantizedLinear' in str(q_model))
 
-class TestQuantizeEagerONNXExport(JitTestCase):
-    def _test_lower_graph_impl(self, model, data):
-        model.qconfig = torch.ao.quantization.default_qconfig
-        model = torch.ao.quantization.prepare(model)
-        model = torch.ao.quantization.convert(model)
-
-        outputs = model(data)
-        input_names = ["x"]
-
-        def export_to_onnx(model, input, input_names):
-            traced = torch.jit.trace(model, input)
-            buf = io.BytesIO()
-            torch.jit.save(traced, buf)
-            buf.seek(0)
-
-            model = torch.jit.load(buf)
-            f = io.BytesIO()
-            torch.onnx.export(model, input, f, input_names=input_names,
-                              operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
-                              opset_version=9)
-        onnx_model = export_to_onnx(model, data, input_names)
-
-    @skipIfNoFBGEMM
-    @skipIfNoCaffe2
-    def test_lower_graph_linear(self):
-        model = torch.ao.quantization.QuantWrapper(torch.nn.Linear(5, 10, bias=True)).to(dtype=torch.float)
-        data_numpy = np.random.rand(1, 2, 5).astype(np.float32)
-        data = torch.from_numpy(data_numpy).to(dtype=torch.float)
-        self._test_lower_graph_impl(model, data)
-
-    @skipIfNoFBGEMM
-    @skipIfNoCaffe2
-    def test_lower_graph_conv2d(self):
-        model = torch.ao.quantization.QuantWrapper(torch.nn.Conv2d(3, 5, 2, bias=True)).to(dtype=torch.float)
-        data_numpy = np.random.rand(1, 3, 6, 6).astype(np.float32)
-        data = torch.from_numpy(data_numpy).to(dtype=torch.float)
-        self._test_lower_graph_impl(model, data)
-
-    @skipIfNoFBGEMM
-    @unittest.skip("onnx opset9 does not support quantize_per_tensor and caffe2 \
-    does not support conv3d")
-    def test_lower_graph_conv3d(self):
-        model = torch.ao.quantization.QuantWrapper(torch.nn.Conv3d(3, 5, 2, bias=True)).to(dtype=torch.float)
-        data_numpy = np.random.rand(1, 3, 6, 6, 6).astype(np.float32)
-        data = torch.from_numpy(data_numpy).to(dtype=torch.float)
-        self._test_lower_graph_impl(model, data)
-
 
 if __name__ == '__main__':
     raise RuntimeError("This test file is not meant to be run directly, use:\n\n"
diff --git a/test/quantization/eager/test_quantize_eager_qat.py b/test/quantization/eager/test_quantize_eager_qat.py
index 984e87dacbbc..44911b6d9e11 100644
--- a/test/quantization/eager/test_quantize_eager_qat.py
+++ b/test/quantization/eager/test_quantize_eager_qat.py
@@ -6,13 +6,13 @@
 import torch.nn as nn
 import torch.backends.mkldnn
 from torch.nn import Conv2d, BatchNorm2d, ReLU, init
-from torch.nn.intrinsic.qat import ConvBn2d, ConvBnReLU2d
+from torch.ao.nn.intrinsic.qat import ConvBn2d, ConvBnReLU2d
 from torch.nn.modules.utils import _pair
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.qat as nnqat
-import torch.nn.intrinsic.qat as nniqat
-import torch.nn.qat.dynamic as nnqatd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.qat as nnqat
+import torch.ao.nn.intrinsic.qat as nniqat
+import torch.ao.nn.qat.dynamic as nnqatd
 from torch.ao.quantization import (
     prepare,
     convert,
@@ -158,9 +158,9 @@ def _forward(self, input):
         scale_factor = self.gamma / running_std
         scaled_weight = self.weight * scale_factor.reshape([-1, 1, 1, 1])
         if self.bias is not None:
-            zero_bias = torch.zeros_like(self.bias)
+            zero_bias = torch.zeros_like(self.bias, dtype=input.dtype)
         else:
-            zero_bias = torch.zeros(self.out_channels, device=scaled_weight.device)
+            zero_bias = torch.zeros(self.out_channels, device=scaled_weight.device, dtype=input.dtype)
         conv = self._conv_forward(input, self.weight_fake_quant(scaled_weight), zero_bias)
 
         if self.training and not self.freeze_bn:
@@ -557,7 +557,7 @@ class M(torch.nn.Module):
             def __init__(self):
                 super().__init__()
                 self.quant = torch.ao.quantization.QuantStub()
-                self.ff = torch.nn.quantized.FloatFunctional()
+                self.ff = torch.ao.nn.quantized.FloatFunctional()
 
             def forward(self, x):
                 x = self.quant(x)
@@ -578,7 +578,7 @@ class M(torch.nn.Module):
             def __init__(self):
                 super().__init__()
                 self.quant = torch.ao.quantization.QuantStub()
-                self.ff = torch.nn.quantized.FloatFunctional()
+                self.ff = torch.ao.nn.quantized.FloatFunctional()
 
             def forward(self, x):
                 x = self.quant(x)
@@ -594,6 +594,7 @@ def forward(self, x):
         eps = 1e-5
         self.assertTrue(torch.abs(mq.quant.scale * 2 - res.q_scale()) < eps)
 
+    @override_qengines
     def test_qat_embedding_bag_errors(self):
         default_qat_qconfig = get_default_qat_qconfig(torch.backends.quantized.engine)
 
@@ -749,7 +750,10 @@ def forward(self, x):
            use_relu=st.booleans(),
            eps=st.sampled_from([1e-5, 1e-4, 1e-3]),
            momentum=st.sampled_from([0.1, 0.2, 0.3]),
-           freeze_bn=st.booleans())
+           freeze_bn=st.booleans(),
+           zero_gamma=st.booleans(),
+           has_bias=st.booleans(),
+           use_slow_fusion=st.booleans())
     def test_conv_bn_relu(
             self,
             batch_size,
@@ -769,7 +773,10 @@ def test_conv_bn_relu(
             use_relu,
             eps,
             momentum,
-            freeze_bn
+            freeze_bn,
+            zero_gamma,
+            has_bias,
+            use_slow_fusion,
     ):
         input_channels = input_channels_per_group * groups
         output_channels = output_channels_per_group * groups
@@ -783,7 +790,7 @@ def test_conv_bn_relu(
             (pad_h, pad_w),
             (dilation_h, dilation_w),
             groups,
-            False,  # No bias
+            has_bias,
             padding_mode
         ).to(dtype=torch.double)
         bn_op = BatchNorm2d(output_channels, eps, momentum).to(dtype=torch.double)
@@ -798,22 +805,30 @@ def test_conv_bn_relu(
             (pad_h, pad_w),
             (dilation_h, dilation_w),
             groups,
-            None,  # bias
+            has_bias,
             padding_mode,
             eps,
             momentum,
             freeze_bn=True,
             qconfig=default_qat_qconfig
         ).to(dtype=torch.double)
+        qat_op._enable_slow_path_for_better_numerical_stability = use_slow_fusion
+
+        # the approximate fusion will not work if bn.weight has 0
+        if zero_gamma and use_slow_fusion:
+            torch.nn.init.zeros_(qat_op.bn.weight)
+
         qat_op.apply(torch.ao.quantization.disable_fake_quant)
         if freeze_bn:
-            qat_op.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
+            qat_op.apply(torch.ao.nn.intrinsic.qat.freeze_bn_stats)
         else:
-            qat_op.apply(torch.nn.intrinsic.qat.update_bn_stats)
+            qat_op.apply(torch.ao.nn.intrinsic.qat.update_bn_stats)
 
         # align inputs and internal parameters
         input = torch.randn(batch_size, input_channels, height, width, dtype=torch.double, requires_grad=True)
         conv_op.weight = torch.nn.Parameter(qat_op.weight.detach())
+        if has_bias:
+            conv_op.bias = torch.nn.Parameter(qat_op.bias.detach())
         bn_op.running_mean = qat_op.bn.running_mean.clone()
         bn_op.running_var = qat_op.bn.running_var.clone()
         bn_op.weight = torch.nn.Parameter(qat_op.bn.weight.detach())
@@ -978,7 +993,7 @@ def test_conv_bn_folded_vs_unfolded(
             input_clone = input.clone().detach().requires_grad_()
 
             if i > 2:
-                qat_op.apply(torch.nn.intrinsic.qat.freeze_bn_stats)
+                qat_op.apply(torch.ao.nn.intrinsic.qat.freeze_bn_stats)
                 qat_ref_op.freeze_bn_stats()
 
             if i > 3:
diff --git a/test/quantization/fx/test_equalize_fx.py b/test/quantization/fx/test_equalize_fx.py
index 1a297b9ecf43..e3560fd29149 100644
--- a/test/quantization/fx/test_equalize_fx.py
+++ b/test/quantization/fx/test_equalize_fx.py
@@ -4,7 +4,7 @@
 import torch.nn as nn
 import torch.nn.functional as F
 import torch.nn.intrinsic.quantized as nniq
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from torch.ao.quantization import default_qconfig
 from torch.ao.quantization.observer import MinMaxObserver, PerChannelMinMaxObserver
 from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
diff --git a/test/quantization/fx/test_model_report_fx.py b/test/quantization/fx/test_model_report_fx.py
index 94099275bdd6..c688946eaf8b 100644
--- a/test/quantization/fx/test_model_report_fx.py
+++ b/test/quantization/fx/test_model_report_fx.py
@@ -15,7 +15,11 @@
 from torch.ao.quantization.fx._model_report.model_report_observer import ModelReportObserver
 from torch.ao.quantization.fx._model_report.model_report_visualizer import ModelReportVisualizer
 from torch.ao.quantization.fx._model_report.model_report import ModelReport
-from torch.ao.quantization.observer import HistogramObserver, default_per_channel_weight_observer
+from torch.ao.quantization.observer import (
+    HistogramObserver,
+    default_per_channel_weight_observer,
+    default_observer
+)
 from torch.nn.intrinsic.modules.fused import ConvReLU2d, LinearReLU
 from torch.testing._internal.common_quantization import (
     ConvModel,
@@ -1147,7 +1151,57 @@ def test_qconfig_mapping_generation(self):
         - Tests that qconfigmapping is generated
         - Tests that mappings include information for for relavent modules
         """
-        pass
+        with override_quantized_engine('fbgemm'):
+            # set the backend for this test
+            torch.backends.quantized.engine = "fbgemm"
+            # test with multiple detectors
+            detector_set = set()
+            detector_set.add(PerChannelDetector())
+            detector_set.add(DynamicStaticDetector())
+
+            model = TwoThreeOps()
+
+            # get tst model and callibrate
+            prepared_for_callibrate_model, mod_report = _get_prepped_for_calibration_model_helper(
+                model, detector_set, model.get_example_inputs()[0]
+            )
+
+            # now we actually callibrate the models
+            example_input = model.get_example_inputs()[0]
+            example_input = example_input.to(torch.float)
+
+            prepared_for_callibrate_model(example_input)
+
+
+            # get the mapping without error
+            qconfig_mapping = mod_report.generate_qconfig_mapping()
+
+            # now get the report by running it through ModelReport instance
+            generated_report = mod_report.generate_model_report(remove_inserted_observers=False)
+
+            # get the visualizer so we can get access to reformatted reports by module fqn
+            mod_reports_by_fqn = mod_report.generate_visualizer().generated_reports
+
+            # compare the entries of the mapping to those of the report
+            # we should have the same number of entries
+            self.assertEqual(len(qconfig_mapping.module_name_qconfigs), len(mod_reports_by_fqn))
+
+            # for the non_empty one, we should have 2 because we have only applicable linears
+            # so should have suggestions for each module named
+            self.assertEqual(len(qconfig_mapping.module_name_qconfigs), 2)
+
+            # only two linears, make sure per channel min max for weight since fbgemm
+            # also static distribution since a simple single callibration
+            for key in qconfig_mapping.module_name_qconfigs:
+                config = qconfig_mapping.module_name_qconfigs[key]
+                self.assertEqual(config.weight, default_per_channel_weight_observer)
+                self.assertEqual(config.activation, default_observer)
+
+            # make sure these can actually be used to prepare the model
+            prepared = quantize_fx.prepare_fx(TwoThreeOps(), qconfig_mapping, example_input)
+
+            # now convert the model to ensure no errors in conversion
+            converted = quantize_fx.convert_fx(prepared)
 
     @skipIfNoFBGEMM
     def test_equalization_mapping_generation(self):
@@ -1156,7 +1210,47 @@ def test_equalization_mapping_generation(self):
         - Tests that equalization config generated when input-weight equalization detector used
         - Tests that mappings include information for for relavent modules
         """
-        pass
+        with override_quantized_engine('fbgemm'):
+            # set the backend for this test
+            torch.backends.quantized.engine = "fbgemm"
+            # test with multiple detectors
+            detector_set = set()
+            detector_set.add(InputWeightEqualizationDetector(0.6))
+
+            model = TwoThreeOps()
+
+            # get tst model and callibrate
+            prepared_for_callibrate_model, mod_report = _get_prepped_for_calibration_model_helper(
+                model, detector_set, model.get_example_inputs()[0]
+            )
+
+            # now we actually callibrate the models
+            example_input = model.get_example_inputs()[0]
+            example_input = example_input.to(torch.float)
+
+            prepared_for_callibrate_model(example_input)
+
+
+            # get the mapping without error
+            qconfig_mapping = mod_report.generate_qconfig_mapping()
+            equalization_mapping = mod_report.generate_equalization_mapping()
+
+            # tests a lot more simple for the equalization mapping
+
+            # shouldn't have any equalization suggestions for this case
+            self.assertEqual(len(qconfig_mapping.module_name_qconfigs), 2)
+
+
+            # make sure these can actually be used to prepare the model
+            prepared = quantize_fx.prepare_fx(
+                TwoThreeOps(),
+                qconfig_mapping,
+                example_input,
+                _equalization_config=equalization_mapping
+            )
+
+            # now convert the model to ensure no errors in conversion
+            converted = quantize_fx.convert_fx(prepared)
 
 class TestFxDetectInputWeightEqualization(QuantizationTestCase):
 
diff --git a/test/quantization/fx/test_numeric_suite_fx.py b/test/quantization/fx/test_numeric_suite_fx.py
index 78d8091e0c6d..7f46cf0a442b 100644
--- a/test/quantization/fx/test_numeric_suite_fx.py
+++ b/test/quantization/fx/test_numeric_suite_fx.py
@@ -8,7 +8,10 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from torch.ao.quantization import default_dynamic_qconfig
+from torch.ao.quantization import (
+    default_dynamic_qconfig,
+    QConfigMapping,
+)
 import torch.nn.quantized as nnq
 toq = torch.ops.quantized
 from torch.ao.quantization.quantize_fx import (
@@ -28,6 +31,7 @@
     LSTMwithHiddenDynamicModel,
     SparseNNModel,
     skip_if_no_torchvision,
+    TwoLayerLinearModel
 )
 from torch.ao.quantization.quantization_mappings import (
     get_default_static_quant_module_mappings,
@@ -71,7 +75,15 @@
     extract_logger_info,
     extract_shadow_logger_info,
     extend_logger_results_with_comparison,
+    prepare_n_shadows_model,
+    convert_n_shadows_model,
+    extract_results_n_shadows_model,
+    OutputComparisonLogger,
+    print_comparisons_n_shadows_model,
+    loggers_set_enabled,
+    loggers_set_save_activations,
 )
+from torch.ao.ns.fx.qconfig_multi_mapping import QConfigMultiMapping
 from torch.ao.quantization.backend_config import get_native_backend_config
 from torch.ao.quantization.fx.backend_config_utils import get_pattern_to_quantize_handlers
 
@@ -2060,6 +2072,398 @@ def forward(self, x):
             mt_shadows_mt_copy, OutputLogger, 'b')
         self.assertTrue(len(act_compare_dict) == 1)
 
+@skipIfNoFBGEMM
+class TestFXNumericSuiteNShadows(FXNumericSuiteQuantizationTestCase):
+    """
+    Tests the "n shadows" workflow.
+    """
+
+    def _test_impl(self, m, example_input, qconfig_mappings):
+        backend_config = get_native_backend_config()
+
+        # test that input is valid
+        _ = m(*example_input)
+
+        msp = prepare_n_shadows_model(
+            m, example_input, qconfig_mappings, backend_config)
+        # print('msp', msp)
+
+        for _ in range(2):
+            msp(*example_input)
+
+        msq = convert_n_shadows_model(msp)
+
+        loggers_set_enabled(msq, True)
+        msq(*example_input)
+
+        results = extract_results_n_shadows_model(msq)
+        print_comparisons_n_shadows_model(results)
+        return msq
+
+    def test_linear_mod(self):
+        class M(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.fc1 = nn.Linear(2, 2)
+
+            def forward(self, x):
+                x = self.fc1(x)
+                return x
+
+        m = M().eval()
+        example_input = (torch.randn(2, 2),)
+
+        qconfig_mappings = \
+            QConfigMultiMapping().set_global([torch.quantization.default_qconfig])
+        self._test_impl(m, example_input, qconfig_mappings)
+
+    def test_linear_relu_mod(self):
+        class M(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.fc1 = nn.Linear(2, 2)
+                self.fc2 = nn.Linear(2, 2)
+                self.relu = nn.ReLU()
+
+            def forward(self, x):
+                x = self.fc1(x)
+                x = self.fc2(x)
+                x = self.relu(x)
+                return x
+
+        m = M().eval()
+        example_input = (torch.randn(2, 2),)
+
+        qconfig_mappings = (
+            QConfigMultiMapping().set_global([
+                torch.quantization.default_qconfig,
+                torch.quantization.default_dynamic_qconfig
+            ])
+        )
+        self._test_impl(m, example_input, qconfig_mappings)
+
+    def test_conv_bn_relu_mod(self):
+        class M(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.conv = nn.Conv2d(1, 1, 1)
+                self.bn = nn.BatchNorm2d(1)
+                self.relu = nn.ReLU()
+
+            def forward(self, x):
+                x = self.conv(x)
+                x = self.bn(x)
+                x = self.relu(x)
+                return x
+
+        m = M().eval()
+        example_input = (torch.randn(32, 1, 16, 16),)
+
+        qconfig_mappings = QConfigMultiMapping() \
+            .set_global([
+                torch.quantization.default_qconfig,
+                torch.quantization.default_per_channel_qconfig
+            ])
+        self._test_impl(m, example_input, qconfig_mappings)
+
+    def test_functions(self):
+        class M(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.w1 = nn.Parameter(torch.randn(2, 2))
+                self.b1 = nn.Parameter(torch.zeros(2))
+                torch.nn.init.kaiming_uniform_(self.w1, a=math.sqrt(5))
+
+            def forward(self, x):
+                x = F.sigmoid(x)
+                x = F.linear(x, self.w1, self.b1)
+                x = F.linear(x, self.w1[:], self.b1)
+                x = F.relu(x)
+                x = x + x
+                x = torch.cat([x])
+                x = torch.cat((x,))
+                x = torch.cat(tensors=[x])
+                # TODO(future PR): enable layernorm
+                # blocked on FX graph mode quant not inserting observer for
+                # second arg, if the second arg is a module input
+                # x = F.layer_norm(x, x.shape)
+                # x = F.layer_norm(x, x.shape[1:])
+                # x = x.reshape(1, -1) * 2
+                # x = F.layer_norm(x.reshape(1, -1), x.shape[1:])
+                x = torch.matmul(x, x.reshape(2, 2))
+                x = torch.matmul(x.reshape(2, 2), x.reshape(2, 2))
+                # TODO(future PR): enable below after FX graph mode quantization handles
+                # it, currently this is not supported
+                # x = F.linear(input=x, weight=self.w1, bias=self.b1)
+                return x
+
+        m = M().eval()
+        example_input = (torch.randn(2, 2),)
+
+        qconfig_mappings = QConfigMultiMapping() \
+            .set_global([torch.quantization.default_qconfig])
+        self._test_impl(m, example_input, qconfig_mappings)
+
+    def test_partial_qconfig_mapping(self):
+        class M(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.fc = nn.Linear(2, 2)
+                self.w1 = nn.Parameter(torch.randn(2, 2))
+                self.b1 = nn.Parameter(torch.randn(2))
+                torch.nn.init.kaiming_uniform_(self.w1, a=math.sqrt(5))
+
+            def forward(self, x):
+                x = self.fc(x)
+                x = F.linear(x, self.w1, self.b1)
+                x = F.relu(x)
+                x = x + x
+                return x
+
+        m = M().eval()
+        example_input = (torch.randn(2, 2),)
+        qconfig = torch.ao.quantization.default_qconfig
+
+        qconfig_mappings = QConfigMultiMapping() \
+            .set_object_type(F.linear, [qconfig]) \
+            .set_object_type(F.relu, [qconfig])
+        self._test_impl(m, example_input, qconfig_mappings)
+
+    def test_logger_enabled_and_save_activations_flags(self):
+        m = nn.Sequential(nn.Linear(1, 1)).eval()
+        example_input = (torch.randn(1, 1),)
+
+        qconfig_mappings = QConfigMultiMapping() \
+            .set_global([torch.quantization.default_qconfig])
+        backend_config = get_native_backend_config()
+
+        msp = prepare_n_shadows_model(
+            m, example_input, qconfig_mappings, backend_config)
+
+        for _ in range(2):
+            msp(*example_input)
+
+        def _check_logger_count(model, exp_count_stats, exp_count_comparisons):
+            for name, mod in model.named_modules():
+                if isinstance(mod, OutputLogger):
+                    self.assertTrue(
+                        len(mod.stats) == exp_count_stats,
+                        f'stats: expected {len(mod.stats)} to equal {exp_count_stats}')
+                    if isinstance(mod, OutputComparisonLogger):
+                        self.assertTrue(
+                            len(mod.comparisons) == exp_count_comparisons,
+                            f'comparisons: expected {len(mod.comparisons)} to equal {exp_count_comparisons}')
+
+        # check behavior with save_activations enabled
+        msq = convert_n_shadows_model(copy.deepcopy(msp))
+        loggers_set_enabled(msq, True)
+        loggers_set_save_activations(msq, True)
+        # after prepare calibration but before convert calibration, loggers
+        # should not have anything saved
+        _check_logger_count(msq, 0, 0)
+        msq(*example_input)
+        # loggers should save each item after calibration
+        _check_logger_count(msq, 1, 1)
+
+        # check behavior with save_activations disabled
+        msq = convert_n_shadows_model(copy.deepcopy(msp))
+        loggers_set_enabled(msq, True)
+        loggers_set_save_activations(msq, False)
+        # after prepare calibration but before convert calibration, loggers
+        # should not have anything saved
+        _check_logger_count(msq, 0, 0)
+        msq(*example_input)
+        # stats should be empty, but comparisons should be there
+        _check_logger_count(msq, 0, 1)
+
+    @skip_if_no_torchvision
+    def test_mobilenet_v2(self):
+        import torchvision
+        m = torchvision.models.quantization.mobilenet_v2(
+            pretrained=False, quantize=False).eval()
+        example_input = (torch.randn(1, 3, 224, 224),)
+
+        qconfig_mappings = QConfigMultiMapping() \
+            .set_global([torch.quantization.default_qconfig, torch.quantization.default_dynamic_qconfig])
+
+        self._test_impl(m, example_input, qconfig_mappings)
+
+    def test_qconfig_multi_mapping_deduplication(self):
+        # check that insertion deduplicates qconfigs
+        qconfig_multi_mapping = QConfigMultiMapping().set_global(
+            [torch.quantization.default_qconfig, torch.quantization.default_qconfig]
+        )
+        self.assertEqual(len(qconfig_multi_mapping.qconfig_mappings_list), 1)
+
+    def test_qconfig_multi_mapping_insert_padding(self):
+        # test that inserting a higher priority qconfig style with fewer elements than a lower priority qconfig will
+        # result in adding None to the extra QConfigMappings at that same style+key
+        qconfig_multi_mapping = (
+            QConfigMultiMapping()
+            .set_global(
+                [
+                    torch.quantization.default_qconfig,
+                    torch.quantization.default_dynamic_qconfig,
+                ]
+            )
+            .set_object_type(torch.nn.Linear, [torch.quantization.default_qconfig])
+            .set_module_name_regex("fc", [torch.quantization.default_qconfig])
+            .set_module_name("fc2", [torch.quantization.default_qconfig])
+            .set_module_name_object_type_order(
+                "", nn.Linear, 0, [torch.quantization.default_qconfig]
+            )
+        )
+
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[1].object_type_qconfigs[
+                torch.nn.Linear
+            ],
+            None,
+        )
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[1].module_name_regex_qconfigs[
+                "fc"
+            ],
+            None,
+        )
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[1].module_name_qconfigs["fc2"],
+            None,
+        )
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[
+                1
+            ].module_name_object_type_order_qconfigs[("", nn.Linear, 0)],
+            None,
+        )
+
+    def test_qconfig_multi_mapping_retroactive_padding(self):
+        # test that inserting a lower priority qconfig style with more elements thhan lower priority qconfig styles
+        # will result in the new QConfigMapping having None at all previously existing styles+keys
+        qconfig_multi_mapping = (
+            QConfigMultiMapping()
+            .set_object_type(torch.nn.Linear, [torch.quantization.default_qconfig])
+            .set_module_name_regex("fc", [torch.quantization.default_qconfig])
+            .set_module_name("fc2", [torch.quantization.default_qconfig])
+            .set_module_name_object_type_order(
+                "", nn.Linear, 0, [torch.quantization.default_qconfig]
+            )
+            .set_global(
+                [
+                    torch.quantization.default_qconfig,
+                    torch.quantization.default_dynamic_qconfig,
+                ]
+            )
+        )
+
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[1].object_type_qconfigs[
+                torch.nn.Linear
+            ],
+            None,
+        )
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[1].module_name_regex_qconfigs[
+                "fc"
+            ],
+            None,
+        )
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[1].module_name_qconfigs["fc2"],
+            None,
+        )
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[
+                1
+            ].module_name_object_type_order_qconfigs[("", nn.Linear, 0)],
+            None,
+        )
+
+    def test_qconfig_multi_mapping_end_to_end(self):
+        # test that the prepare/convert_n_shadows_model works as expected
+        # with qconfig_multi_mapping and avoids unwanted matches
+
+        m = TwoLayerLinearModel().eval()
+        example_input = m.get_example_inputs()
+
+        qconfig_multi_mapping = (
+            QConfigMultiMapping()
+            .set_global(
+                [
+                    torch.quantization.default_qconfig,
+                    torch.quantization.default_dynamic_qconfig,
+                ]
+            )
+            .set_module_name("fc2", [None, torch.quantization.default_qconfig])
+        )
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[1].module_name_qconfigs["fc2"],
+            None,
+        )
+        msq = self._test_impl(m, example_input, qconfig_multi_mapping)
+
+        self.checkQuantizedLinear(msq.shadow_wrapper_0_1.mod_0)
+        self.checkDynamicQuantizedLinear(msq.shadow_wrapper_0_2.mod_0, torch.qint8)
+        self.checkQuantizedLinear(msq.shadow_wrapper_1_1.mod_0)
+        self.assertRaisesRegex(AttributeError, ".*", lambda: msq.shadow_wrapper_1_2)
+
+    def test_qconfig_multi_mapping_from_list(self):
+        # test QConfigMultiMapping.from_list_qconfig_mapping works as expected
+
+        m = TwoLayerLinearModel().eval()
+        example_input = m.get_example_inputs()
+
+        qconfig_mappings_list = [
+            QConfigMapping().set_global(torch.quantization.default_qconfig),
+            QConfigMapping()
+            .set_global(torch.quantization.default_dynamic_qconfig)
+            .set_module_name("fc2", torch.quantization.default_qconfig),
+        ]
+
+        qconfig_multi_mapping = QConfigMultiMapping().from_list_qconfig_mapping(
+            qconfig_mappings_list
+        )
+        self.assertEqual(
+            qconfig_multi_mapping.qconfig_mappings_list[1].module_name_qconfigs["fc2"],
+            None,
+        )
+
+        msq = self._test_impl(m, example_input, qconfig_multi_mapping)
+
+        self.checkQuantizedLinear(msq.shadow_wrapper_0_1.mod_0)
+        self.checkDynamicQuantizedLinear(msq.shadow_wrapper_0_2.mod_0, torch.qint8)
+        self.checkQuantizedLinear(msq.shadow_wrapper_1_1.mod_0)
+        self.assertRaisesRegex(AttributeError, ".*", lambda: msq.shadow_wrapper_1_2)
+
+    def test_qconfig_multi_mapping_ordering(self):
+        # test that the module ordering ignores None
+
+        m = TwoLayerLinearModel().eval()
+        example_input = m.get_example_inputs()
+        qconfig_multi_mapping = (
+            QConfigMultiMapping()
+            .set_global(
+                [
+                    torch.ao.quantization.default_qconfig,
+                    torch.ao.quantization.default_dynamic_qconfig,
+                ]
+            )
+            .set_module_name(
+                "fc2",
+                [
+                    None,
+                    torch.ao.quantization.default_dynamic_qconfig,
+                    torch.ao.quantization.default_qat_qconfig_v2,
+                ],
+            )
+        )
+        self.assertEqual(len(qconfig_multi_mapping.qconfig_mappings_list), 2)
+        msq = self._test_impl(m, example_input, qconfig_multi_mapping)
+
+        self.checkQuantizedLinear(msq.shadow_wrapper_0_1.mod_0)
+        self.checkDynamicQuantizedLinear(msq.shadow_wrapper_0_2.mod_0, torch.qint8)
+        self.checkDynamicQuantizedLinear(msq.shadow_wrapper_1_1.mod_0, torch.qint8)
+        self.checkQuantizedLinear(msq.shadow_wrapper_1_2.mod_0)
 
 class TestFXNumericSuiteCoreAPIsModels(FXNumericSuiteQuantizationTestCase):
     """
diff --git a/test/quantization/fx/test_quantize_fx.py b/test/quantization/fx/test_quantize_fx.py
index 1ef127b4a194..107e6eb589f2 100644
--- a/test/quantization/fx/test_quantize_fx.py
+++ b/test/quantization/fx/test_quantize_fx.py
@@ -1,31 +1,30 @@
 # Owner(s): ["oncall: quantization"]
 
 from collections import OrderedDict
-import os
 import contextlib
 import torch
 import torch.nn.functional as F
 import torch.nn as nn
-import torch.nn.quantized as nnq
-import torch.nn.quantized._reference as nnqr
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.reference as nnqr
+import torch.ao.nn.quantized.dynamic as nnqd
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
 import torch.nn.intrinsic.quantized.dynamic as nniqd
 import torch.multiprocessing as mp
-from torch.ao.quantization import is_activation_post_process
 
 # graph mode quantization based on fx
 from torch.ao.quantization.quantize_fx import (
     prepare_fx,
     convert_fx,
     convert_to_reference_fx,
+    _convert_to_reference_decomposed_fx,
     prepare_qat_fx,
     fuse_fx,
 )
 
+
 from torch.ao.quantization.fx.quantization_patterns import DefaultNodeQuantizeHandler
-from torch.ao.quantization.fx.common_quantization_patterns import CommonQuantizeHandler
 
 from torch.ao.quantization.fx.match_utils import (
     is_match,
@@ -34,8 +33,8 @@
 
 from torch.ao.quantization import (
     QuantType,
-    quant_type_to_str,
 )
+from torch.ao.quantization.quant_type import _get_quant_type_to_str
 
 from torch.ao.quantization import (
     QuantStub,
@@ -43,8 +42,11 @@
     QuantWrapper,
     default_qconfig,
     default_dynamic_qconfig,
+    default_per_channel_qconfig,
     default_qat_qconfig,
     default_reuse_input_qconfig,
+    default_symmetric_qnnpack_qconfig,
+    default_symmetric_qnnpack_qat_qconfig,
     per_channel_dynamic_qconfig,
     float16_dynamic_qconfig,
     float16_static_qconfig,
@@ -54,6 +56,7 @@
     get_default_qat_qconfig,
     get_default_qconfig_mapping,
     get_default_qat_qconfig_mapping,
+    is_activation_post_process,
     fuse_modules,
     fuse_modules_qat,
     prepare,
@@ -69,31 +72,37 @@
     FakeQuantize,
     MovingAverageMinMaxObserver,
     HistogramObserver,
+    ReuseInputObserver,
     QConfig,
     default_embedding_qat_qconfig,
 )
 
 from torch.ao.quantization.backend_config import (
+    get_qnnpack_backend_config,
     BackendConfig,
     BackendPatternConfig,
+    DTypeConfig,
+    DTypeWithConstraints,
+    ObservationType
 )
 from torch.ao.quantization.backend_config.native import (
     get_test_only_legacy_native_backend_config,
 )
 
 from torch.ao.quantization.qconfig_mapping import (
-    GLOBAL_DICT_KEY,
-    MODULE_NAME_DICT_KEY,
-    MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY,
-    MODULE_NAME_REGEX_DICT_KEY,
-    OBJECT_TYPE_DICT_KEY,
+    _get_symmetric_qnnpack_qconfig_mapping,
+    _GLOBAL_DICT_KEY,
+    _MODULE_NAME_DICT_KEY,
+    _MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY,
+    _MODULE_NAME_REGEX_DICT_KEY,
+    _OBJECT_TYPE_DICT_KEY,
     QConfigMapping,
 )
 
 from torch.ao.quantization.qconfig_mapping_utils import (
-    get_object_type_qconfig,
-    get_module_name_qconfig,
-    get_module_name_regex_qconfig,
+    _get_object_type_qconfig,
+    _get_module_name_qconfig,
+    _get_module_name_regex_qconfig,
 )
 
 from torch.ao.quantization.fx.pattern_utils import (
@@ -122,11 +131,14 @@
     StandaloneModuleConfigEntry,
 )
 
-from torch.ao.quantization.fx.qconfig_utils import (
+from torch.ao.quantization.fx.qconfig_mapping_utils import (
     maybe_adjust_qconfig_for_module_name_object_type_order,
 )
 
-from torch.ao.quantization.fx.utils import NodeInfo
+from torch.ao.quantization.fx.utils import (
+    _reroute_tuple_getitem_pattern,
+    NodeInfo,
+)
 
 from torch.ao.quantization.fake_quantize import (
     default_fixed_qparams_range_0to1_fake_quant,
@@ -136,6 +148,7 @@
 from torch.ao.quantization.observer import (
     default_fixed_qparams_range_0to1_observer,
     default_fixed_qparams_range_neg1to1_observer,
+    MinMaxObserver,
 )
 
 # test utils
@@ -147,12 +160,14 @@
     LinearReluModel,
     QuantizationTestCase,
     skipIfNoFBGEMM,
+    skipIfNoQNNPACK,
     skip_if_no_torchvision,
     train_one_epoch,
     run_ddp,
     test_only_eval_fn,
     test_only_train_fn,
     ModelForConvTransposeBNFusion,
+    get_supported_device_types,
 )
 
 from torch.testing._internal.common_quantization import (
@@ -179,14 +194,7 @@
 import operator
 import unittest
 import io
-from typing import Callable, Optional, List
-
-
-
-TEST_WITH_ROCM = os.getenv('PYTORCH_TEST_WITH_ROCM', '0') == '1'
-
-def get_supported_device_types():
-    return ['cpu', 'cuda'] if torch.cuda.is_available() and not TEST_WITH_ROCM else ['cpu']
+from typing import Callable, Optional, List, Tuple
 
 class BinaryOp(torch.nn.Module):
     def __init__(self, binary_op, ibinary_op, is_inplace, is_scalar):
@@ -485,7 +493,7 @@ def forward(self, x):
 
         self.checkGraphModuleNodes(m, expected_node=ns.call_module(torch.nn.intrinsic.modules.fused.LinearReLU))
 
-    @unittest.skip("Temprorarily skipping the test case, will enable after the simple"
+    @unittest.skip("Temporarily skipping the test case, will enable after the simple"
                    "pattern format is supported")
     def test_fuse_addtional_fuser_method(self):
         class MyConvReLU(torch.nn.Module):
@@ -743,7 +751,7 @@ def forward(self, x):
             ],
         }
         prepared = prepare_qat_fx(model, qconfig_dict, example_inputs=(torch.randn(1, 5),))
-        self.assertTrue(isinstance(getattr(prepared.mods1, "0").tmp, torch.nn.intrinsic.qat.LinearReLU))
+        self.assertTrue(isinstance(getattr(prepared.mods1, "0").tmp, torch.ao.nn.intrinsic.qat.LinearReLU))
 
     def _get_conv_linear_test_cases(self, is_reference):
         """ Returns a list of test cases, with format:
@@ -946,7 +954,7 @@ def test_conv_linear_not_reference(self):
         for (is_dynamic, ModuleClass, module_constructor_inputs,
              inputs, quantized_node, weight_prepack_node) in tests:
             quant_type = QuantType.DYNAMIC if is_dynamic else QuantType.STATIC
-            node_occurrence = dict()
+            node_occurrence = {}
             if weight_prepack_node:
                 node_occurrence[weight_prepack_node] = 0
             self.checkGraphModeFxOp(
@@ -971,7 +979,7 @@ def _get_keys(prefix, is_dynamic):
         for (is_dynamic, ModuleClass, module_constructor_inputs,
              inputs, quantized_node, weight_prepack_node) in tests:
             quant_type = QuantType.DYNAMIC if is_dynamic else QuantType.STATIC
-            node_occurrence = dict()
+            node_occurrence = {}
             if weight_prepack_node:
                 node_occurrence[weight_prepack_node] = 0
             result_dict = self.checkGraphModeFxOp(
@@ -1206,7 +1214,7 @@ def forward(self, x):
             for (ModuleClass, module_constructor_inputs,
                  inputs, quantized_node, weight_prepack_node) in tests:
                 for is_reference in [True, False]:
-                    node_occurrence = dict()
+                    node_occurrence = {}
                     if weight_prepack_node:
                         node_occurrence[weight_prepack_node] = 0
                     m = ModuleClass(*module_constructor_inputs).eval()
@@ -1870,9 +1878,9 @@ def test_qconfig_mapping_set_object_type(self):
         qconfig_mapping.set_object_type(torch.nn.Linear, qconfig3)
         self.assertEqual(qconfig_mapping.object_type_qconfigs[torch.nn.Linear], qconfig3)
         self.assertEqual(qconfig_mapping.object_type_qconfigs[torch.nn.ReLU], qconfig2)
-        self.assertEqual(get_object_type_qconfig(qconfig_mapping, torch.nn.Linear, None), qconfig3)
-        self.assertEqual(get_object_type_qconfig(qconfig_mapping, torch.nn.ReLU, None), qconfig2)
-        self.assertEqual(get_object_type_qconfig(qconfig_mapping, "nomatch", None), None)
+        self.assertEqual(_get_object_type_qconfig(qconfig_mapping, torch.nn.Linear, None), qconfig3)
+        self.assertEqual(_get_object_type_qconfig(qconfig_mapping, torch.nn.ReLU, None), qconfig2)
+        self.assertEqual(_get_object_type_qconfig(qconfig_mapping, "nomatch", None), None)
 
     def test_qconfig_mapping_set_module_name_regex(self):
         qconfig1 = get_default_qconfig()
@@ -1892,11 +1900,11 @@ def test_qconfig_mapping_set_module_name_regex(self):
         qconfig_mapping.set_module_name_regex("foo.*bar", qconfig3)
         self.assertEqual(qconfig_mapping.module_name_regex_qconfigs["foo.*bar"], qconfig3)
         self.assertEqual(qconfig_mapping.module_name_regex_qconfigs["foo.*"], qconfig2)
-        self.assertEqual(get_module_name_regex_qconfig(qconfig_mapping, "foo123bar", None), qconfig3)
-        self.assertEqual(get_module_name_regex_qconfig(qconfig_mapping, "foobar", None), qconfig3)
-        self.assertEqual(get_module_name_regex_qconfig(qconfig_mapping, "foobaz", None), qconfig2)
-        self.assertEqual(get_module_name_regex_qconfig(qconfig_mapping, "foo", None), qconfig2)
-        self.assertEqual(get_module_name_regex_qconfig(qconfig_mapping, "nomatch", None), None)
+        self.assertEqual(_get_module_name_regex_qconfig(qconfig_mapping, "foo123bar", None), qconfig3)
+        self.assertEqual(_get_module_name_regex_qconfig(qconfig_mapping, "foobar", None), qconfig3)
+        self.assertEqual(_get_module_name_regex_qconfig(qconfig_mapping, "foobaz", None), qconfig2)
+        self.assertEqual(_get_module_name_regex_qconfig(qconfig_mapping, "foo", None), qconfig2)
+        self.assertEqual(_get_module_name_regex_qconfig(qconfig_mapping, "nomatch", None), None)
 
     def test_qconfig_mapping_set_module_name(self):
         qconfig1 = get_default_qconfig()
@@ -1916,9 +1924,9 @@ def test_qconfig_mapping_set_module_name(self):
         qconfig_mapping.set_module_name("mod1", qconfig3)
         self.assertEqual(qconfig_mapping.module_name_qconfigs["mod1"], qconfig3)
         self.assertEqual(qconfig_mapping.module_name_qconfigs["mod2"], qconfig2)
-        self.assertEqual(get_module_name_qconfig(qconfig_mapping, "mod1", None), qconfig3)
-        self.assertEqual(get_module_name_qconfig(qconfig_mapping, "mod2", None), qconfig2)
-        self.assertEqual(get_module_name_qconfig(qconfig_mapping, "nomatch", None), None)
+        self.assertEqual(_get_module_name_qconfig(qconfig_mapping, "mod1", None), qconfig3)
+        self.assertEqual(_get_module_name_qconfig(qconfig_mapping, "mod2", None), qconfig2)
+        self.assertEqual(_get_module_name_qconfig(qconfig_mapping, "nomatch", None), None)
 
     def test_qconfig_mapping_set_module_name_object_type_order(self):
         qconfig1 = get_default_qconfig()
@@ -1966,20 +1974,20 @@ def _get_qconfig_dict_for_qconfig_mapping_test(self, global_qconfig, qconfig1, q
         Return a dummy qconfig_dict to test QConfigMapping's to_dict and from_dict methods.
         """
         return {
-            GLOBAL_DICT_KEY: global_qconfig,
-            OBJECT_TYPE_DICT_KEY: [
+            _GLOBAL_DICT_KEY: global_qconfig,
+            _OBJECT_TYPE_DICT_KEY: [
                 (torch.nn.Linear, qconfig1),
                 (torch.nn.ReLU, qconfig2),
             ],
-            MODULE_NAME_REGEX_DICT_KEY: [
+            _MODULE_NAME_REGEX_DICT_KEY: [
                 ("foo.*bar", qconfig1),
                 ("foo.*", qconfig2),
             ],
-            MODULE_NAME_DICT_KEY: [
+            _MODULE_NAME_DICT_KEY: [
                 ("bazbaz", qconfig1),
                 ("borbor", qconfig2),
             ],
-            MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY: [
+            _MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY: [
                 ("bazbaz", torch.nn.Linear, 0, qconfig1),
                 ("foofoo", torch.nn.ReLU, 1, qconfig2),
             ],
@@ -2585,6 +2593,7 @@ def forward(self, x):
             @classmethod
             def from_observed(cls, observed_module):
                 assert hasattr(observed_module, 'qconfig')
+                observed_module.linear.qconfig = observed_module.qconfig
                 quantized = cls(nnqd.Linear.from_float(observed_module.linear))
                 return quantized
 
@@ -2618,13 +2627,18 @@ def forward(self, x):
         original_ref_m.linear2.weight = torch.nn.Parameter(original_m.custom.linear.weight.detach())
         original_ref_m.linear2.bias = torch.nn.Parameter(original_m.custom.linear.bias.detach())
 
+        a16_qconfig = QConfig(
+            activation=MinMaxObserver.with_args(dtype=torch.qint32, quant_min=0, quant_max=65536),
+            weight=default_weight_observer,
+        )
         test_configs = {
             "static": (default_qconfig, StaticQuantCustomModule, 3),
+            "static_a16": (a16_qconfig, StaticQuantCustomModule, 3),
             "dynamic": (default_dynamic_qconfig, DynamicQuantCustomModule, 0)
         }
 
         for quant_type in [QuantType.STATIC, QuantType.DYNAMIC]:
-            key = quant_type_to_str(quant_type)
+            key = _get_quant_type_to_str(quant_type)
             qconfig, quantized_module_class, num_observers = test_configs[key]
             qconfig_dict = {"": qconfig}
             if key == "static":
@@ -2659,7 +2673,7 @@ def forward(self, x):
             example_inputs = (torch.randn(3, 3),)
             # check prepared model
             m = prepare_fx(
-                original_m,
+                copy.deepcopy(original_m),
                 qconfig_dict,
                 example_inputs=example_inputs,
                 prepare_custom_config=prepare_custom_config_dict)
@@ -2686,7 +2700,8 @@ def forward(self, x):
             res = m(*example_inputs)
 
             # quantize the reference model
-            ref_m = prepare_fx(original_ref_m, qconfig_dict, example_inputs=example_inputs)
+            ref_m = prepare_fx(
+                copy.deepcopy(original_ref_m), qconfig_dict, example_inputs=example_inputs)
             ref_m(*example_inputs)
             ref_m = convert_fx(ref_m)
             ref_res = ref_m(*example_inputs)
@@ -4083,6 +4098,266 @@ def forward(self, inputs: torch.Tensor, state: List[torch.Tensor]):
             if n.target == "lstm":
                 self.assertEqual(type(n.args[1]), tuple)
 
+    def _test_static_lstm_helper(self, model, prepare_node_occurrence, convert_node_occurrence):
+        """
+        Helper method to validate the graph of a model with static LSTM.
+        """
+        qconfig_mapping = get_default_qconfig_mapping()
+        prepare_custom_config = PrepareCustomConfig() \
+            .set_float_to_observed_mapping(torch.nn.LSTM, torch.ao.nn.quantizable.LSTM)
+        convert_custom_config = ConvertCustomConfig() \
+            .set_observed_to_quantized_mapping(torch.ao.nn.quantizable.LSTM, torch.ao.nn.quantized.LSTM)
+        example_inputs = (torch.rand(5, 3, 50), torch.rand(1, 3, 50), torch.randn(1, 3, 50))
+
+        model = prepare_fx(model, qconfig_mapping, example_inputs, prepare_custom_config=prepare_custom_config)
+        self.checkGraphModuleNodes(model, expected_node_occurrence=prepare_node_occurrence)
+        model(*example_inputs)
+
+        model = convert_fx(model, convert_custom_config=convert_custom_config)
+        self.checkGraphModuleNodes(model, expected_node_occurrence=convert_node_occurrence)
+        model(*example_inputs)
+
+    def test_static_lstm(self):
+        """
+        Test statically quantized custom module LSTM followed by ops that consume individual
+        tensors of the output tuple.
+        """
+        class MyModel(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lstm = nn.LSTM(50, 50, 1)
+                self.linear1 = nn.Linear(50, 10)
+                self.linear2 = nn.Linear(50, 10)
+                self.linear3 = nn.Linear(50, 10)
+
+            def forward(self, inputs: torch.Tensor, h0: torch.Tensor, c0: torch.Tensor):
+                (out, (h0_out, c0_out)) = self.lstm(inputs, (h0, c0))
+                out = self.linear1(out)
+                h0_out = self.linear2(h0_out)
+                c0_out = self.linear3(c0_out)
+                return (out, (h0_out, c0_out))
+
+        m = MyModel()
+        prepare_node_occurrence = {
+            ns.call_module(torch.ao.nn.quantizable.LSTM): 1,
+        }
+        convert_node_occurrence = {
+            ns.call_module(torch.ao.nn.quantized.LSTM): 1,
+            ns.call_function(torch.quantize_per_tensor): 3,
+            # lstm[0].dequantize()
+            # lstm[1][0].dequantize()
+            # lstm[1][1].dequantize()
+            ns.call_method("dequantize"): 3,
+            # lstm[0], lstm[1], lstm[1][0], lstm[1][1]
+            ns.call_function(operator.getitem): 4,
+            # No tuples are consumed
+            ns.call_function(tuple): 0,
+        }
+        self._test_static_lstm_helper(m, prepare_node_occurrence, convert_node_occurrence)
+
+    def test_static_lstm_consume_tuple(self):
+        """
+        Test statically quantized custom module LSTM followed by a module that consumes the
+        output tuple, either as a whole or part of it.
+        """
+        class ModuleAfterLSTM(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.identity = torch.nn.Identity()
+
+            def forward(self, x):
+                return self.identity(x)
+
+        class ConsumeWholeTuple(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.lstm = nn.LSTM(50, 50, 1)
+                self.module_after_lstm = ModuleAfterLSTM()
+
+            def forward(self, inputs: torch.Tensor, h0: torch.Tensor, c0: torch.Tensor):
+                x = self.lstm(inputs, (h0, c0))
+                x = self.module_after_lstm(x)  # consume tuple (output, (hidden0, hidden1))
+                return x
+
+        class ConsumeHiddenTuple(ConsumeWholeTuple):
+            def forward(self, inputs: torch.Tensor, h0: torch.Tensor, c0: torch.Tensor):
+                x = self.lstm(inputs, (h0, c0))
+                x = self.module_after_lstm(x[1])  # consume tuple (hidden0, hidden1)
+                return x
+
+        # Test consuming the whole tuple (output, (hidden0, hidden1))
+        m1 = ConsumeWholeTuple()
+        prepare_node_occurrence = {
+            ns.call_module(torch.ao.nn.quantizable.LSTM): 1,
+        }
+        convert_node_occurrence1 = {
+            ns.call_module(torch.ao.nn.quantized.LSTM): 1,
+            ns.call_function(torch.quantize_per_tensor): 3,
+            # lstm[0].dequantize()
+            # lstm[1][0].dequantize()
+            # lstm[1][1].dequantize()
+            ns.call_method("dequantize"): 3,
+            # lstm[0], lstm[1], lstm[1][0], lstm[1][1]
+            ns.call_function(operator.getitem): 4,
+            # tuple(output_dq, tuple(hidden0_dq, hidden1_dq))
+            ns.call_function(tuple): 2,
+        }
+        self._test_static_lstm_helper(m1, prepare_node_occurrence, convert_node_occurrence1)
+
+        # Test consuming just the hidden tuple (hidden0, hidden1)
+        m2 = ConsumeHiddenTuple()
+        convert_node_occurrence2 = {
+            ns.call_module(torch.ao.nn.quantized.LSTM): 1,
+            ns.call_function(torch.quantize_per_tensor): 3,
+            # lstm[1][0].dequantize()
+            # lstm[1][1].dequantize()
+            ns.call_method("dequantize"): 2,
+            # lstm[1], lstm[1][0], lstm[1][1]
+            ns.call_function(operator.getitem): 3,
+            # tuple(hidden0_dq, hidden1_dq)
+            ns.call_function(tuple): 1,
+        }
+        self._test_static_lstm_helper(m2, prepare_node_occurrence, convert_node_occurrence2)
+
+    def test_static_lstm_with_custom_fixed_qparams(self):
+        """
+        Test statically quantized LSTM with custom fixed qparams assigned to each of the
+        inner submodules. This flow requires users to extend `torch.ao.nn.quantizable.LSTM`
+        and use the child class in the custom module mapping.
+        """
+        class MyModel(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.my_lstm = torch.nn.LSTM(50, 50, 1)
+
+            def forward(self, inputs: torch.Tensor, h0: torch.Tensor, c0: torch.Tensor):
+                x = self.my_lstm(inputs, (h0, c0))
+                return x
+
+        class UserLSTM(torch.ao.nn.quantizable.LSTM):
+            """
+            Example of user provided LSTM implementation that has fixed qparams assigned
+            to the inner submodules.
+            """
+            @classmethod
+            def from_float(cls, other):
+                assert isinstance(other, cls._FLOAT_MODULE)
+                # uint16, [-16, 16)
+                linear_output_obs_ctr = FixedQParamsObserver.with_args(scale=2 ** -11, zero_point=2 ** 15, dtype=torch.qint32)
+                # uint16, [0, 1)
+                sigmoid_obs_ctr = FixedQParamsObserver.with_args(scale=2 ** -16, zero_point=0, dtype=torch.qint32)
+                # uint16, [-1, 1)
+                tanh_obs_ctr = FixedQParamsObserver.with_args(scale=2 ** -15, zero_point=2 ** 15, dtype=torch.qint32)
+                # int16, [-16, 16)
+                cell_state_obs_ctr = FixedQParamsObserver.with_args(scale=2 ** -11, zero_point=0, dtype=torch.qint32)
+                # uint8, [-1, 1)
+                hidden_state_obs_ctr = FixedQParamsObserver.with_args(scale=2 ** -7, zero_point=2 ** 7, dtype=torch.quint8)
+                return torch.ao.quantization.utils._get_lstm_with_individually_observed_parts(
+                    float_lstm=other,
+                    linear_output_obs_ctr=linear_output_obs_ctr,
+                    sigmoid_obs_ctr=sigmoid_obs_ctr,
+                    tanh_obs_ctr=tanh_obs_ctr,
+                    cell_state_obs_ctr=cell_state_obs_ctr,
+                    hidden_state_obs_ctr=hidden_state_obs_ctr,
+                )
+
+        # Prepare model
+        qconfig_mapping = get_default_qconfig_mapping()
+        example_inputs = (torch.rand(5, 3, 50), torch.rand(1, 3, 50), torch.randn(1, 3, 50))
+        prepare_custom_config = PrepareCustomConfig() \
+            .set_float_to_observed_mapping(torch.nn.LSTM, UserLSTM)
+        convert_custom_config = ConvertCustomConfig() \
+            .set_observed_to_quantized_mapping(UserLSTM, torch.ao.nn.quantized.LSTM)
+        model = MyModel()
+        model = prepare_fx(model, qconfig_mapping, example_inputs, prepare_custom_config=prepare_custom_config)
+
+        # Validate that the observers inserted to each inner module has the expected qparams
+        def validate_qparams(inner_module: torch.nn.Module, scale: float, zero_point: int, dtype: torch.dtype):
+            self.assertTrue(hasattr(inner_module, "activation_post_process"))
+            obs = inner_module.activation_post_process
+            self.assertTrue(isinstance(obs, FixedQParamsObserver))
+            self.assertEqual(obs.scale, scale)
+            self.assertEqual(obs.zero_point, zero_point)
+            self.assertEqual(obs.dtype, dtype)
+        cell = model.my_lstm.layers[0].layer_fw.cell
+        validate_qparams(cell.igates, 2 ** -11, 2 ** 15, torch.qint32)
+        validate_qparams(cell.hgates, 2 ** -11, 2 ** 15, torch.qint32)
+        validate_qparams(cell.input_gate, 2 ** -16, 0, torch.qint32)
+        validate_qparams(cell.forget_gate, 2 ** -16, 0, torch.qint32)
+        validate_qparams(cell.cell_gate, 2 ** -15, 2 ** 15, torch.qint32)
+        validate_qparams(cell.output_gate, 2 ** -16, 0, torch.qint32)
+        validate_qparams(cell.fgate_cx_igate_cgate, 2 ** -11, 0, torch.qint32)
+        validate_qparams(cell.ogate_cy, 2 ** -7, 2 ** 7, torch.quint8)
+
+        # Make sure the rest of the flow runs
+        model(*example_inputs)
+        model = convert_fx(model, convert_custom_config=convert_custom_config, _remove_qconfig=False)
+        model(*example_inputs)
+
+    def test_reroute_tuple_getitem_patterns(self):
+        """
+        The following graph should redirect the output to `b`. After the transformation,
+        all other nodes, including the inputs `a` and `c`, are no longer needed.
+
+             a   b     c
+             |   \\   /
+             \\   tuple
+              \\   /
+               tuple
+               /  \\
+              /    \\
+             |      \\
+             |       \\
+             |        \\
+        getitem0    getitem1
+             |      /     \\
+             | getitem0  getitem1
+             |     \\     /
+             \\      tuple
+              \\      /
+               \\    /
+                tuple
+                  |
+               getitem1
+                  |
+               getitem0
+                  |
+                output
+        """
+        # Construct graph manually because symbolic_trace does not insert tuple and getitem nodes
+        graph = torch.fx.Graph()
+        a = graph.create_node("placeholder", "a")
+        b = graph.create_node("placeholder", "b")
+        c = graph.create_node("placeholder", "c")
+        bc = graph.call_function(tuple, args=([b, c],))
+        abc = graph.call_function(tuple, args=([a, bc],))
+
+        # Break down tuple and reconstruct it again
+        a2 = graph.call_function(operator.getitem, args=(abc, 0))
+        bc2 = graph.call_function(operator.getitem, args=(abc, 1))
+        b2 = graph.call_function(operator.getitem, args=(bc2, 0))
+        c2 = graph.call_function(operator.getitem, args=(bc2, 1))
+        bc3 = graph.call_function(tuple, args=([b2, c2],))
+        abc2 = graph.call_function(tuple, args=([a2, bc3],))
+
+        # Output tuple[1][0]
+        bc4 = graph.call_function(operator.getitem, args=(abc2, 1))
+        b3 = graph.call_function(operator.getitem, args=(bc4, 0))
+        output = graph.output(b3)
+
+        # Do reroute
+        _reroute_tuple_getitem_pattern(graph)
+
+        # Assert that output reroutes to `b` directly, and all other nodes can be removed
+        output_ancestors = []
+        def gather_ancestors(current_node):  # noqa: E306
+            for arg in current_node.args:
+                output_ancestors.append(arg)
+                gather_ancestors(arg)
+        gather_ancestors(output)
+        self.assertEqual(output_ancestors, [b])
+        self.assertEqual(output.args[0], b)
+
     def test_relu_lowering(self):
         class M(torch.nn.Module):
             def forward(self, x):
@@ -4693,12 +4968,12 @@ def forward(self, x):
         )
         self.assertTrue(
             type(mod_prep.untraceable_module_class.linear)
-            is not torch.nn.qat.modules.linear.Linear,
+            is not torch.ao.nn.qat.modules.linear.Linear,
             "prepare_qat_fx shold not convert anything inside untraced module classes",
         )
         self.assertTrue(
             type(mod_prep.untraceable_module_name.linear)
-            is not torch.nn.qat.modules.linear.Linear,
+            is not torch.ao.nn.qat.modules.linear.Linear,
             "prepare_qat_fx shold not convert anything inside modules named in untraced_module_names",
         )
 
@@ -4772,6 +5047,438 @@ def _test(prepare_fn, qconfig_dict):
         _test(prepare_fx, get_default_qconfig_mapping())
         _test(prepare_qat_fx, get_default_qat_qconfig_mapping())
 
+    def _validate_qconfig_against_backend_config_constraints(
+            self,
+            model: torch.nn.Module,
+            qconfig: QConfig,
+            backend_config: BackendConfig,
+            satisfies_constraints: bool,
+            qconfig_name: Optional[str] = None):
+        """
+        Helper method to validate whether `qconfig` satisfies the constraints specified in `backend_config`.
+        """
+        qconfig_mapping = QConfigMapping().set_object_type(torch.nn.Linear, qconfig)
+        example_inputs = (torch.rand((1, 30), dtype=torch.float),)
+        model = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
+        model(*example_inputs)
+        model = convert_fx(model, backend_config=backend_config)
+        if satisfies_constraints:
+            expected_node_occurrence = {
+                ns.call_module(torch.ao.nn.quantized.Linear) : 1,
+                ns.call_module(torch.nn.Linear) : 0,
+            }
+        else:
+            expected_node_occurrence = {
+                ns.call_module(torch.ao.nn.quantized.Linear) : 0,
+                ns.call_module(torch.nn.Linear) : 1,
+            }
+        try:
+            self.checkGraphModuleNodes(model, expected_node_occurrence=expected_node_occurrence)
+        except AssertionError as e:
+            if qconfig_name is not None:
+                print("ERROR: Validation for QConfig '%s' failed" % qconfig_name)
+            raise e
+
+    def test_backend_config_quantization_range(self):
+        """
+        Check that quantization ranges specified through the BackendConfig are reflected in
+        the observers inserted into the model.
+        """
+        class MyModel(torch.nn.Module):
+            def __init__(self):
+                super(MyModel, self).__init__()
+                self.linear = torch.nn.Linear(30, 4).float()
+
+            def forward(self, x):
+                return self.linear(x)
+
+        dtype_config = DTypeConfig(
+            input_dtype=DTypeWithConstraints(
+                dtype=torch.quint8,
+                quant_min_lower_bound=0,
+                quant_max_upper_bound=31,
+            ),
+            output_dtype=DTypeWithConstraints(
+                dtype=torch.quint8,
+                quant_min_lower_bound=0,
+                quant_max_upper_bound=31,
+            ),
+            weight_dtype=DTypeWithConstraints(
+                dtype=torch.qint8,
+                quant_min_lower_bound=-64,
+                quant_max_upper_bound=63,
+            ),
+            bias_dtype=torch.float,
+        )
+        backend_config = BackendConfig() \
+            .set_backend_pattern_config(BackendPatternConfig(torch.nn.Linear)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E128
+                .add_dtype_config(dtype_config)
+                .set_root_module(torch.nn.Linear)
+                .set_reference_quantized_module(nnqr.Linear))
+
+        def validate_qconfig(qconfig: QConfig, satisfies_constraints: bool):
+            self._validate_qconfig_against_backend_config_constraints(
+                MyModel(), qconfig, backend_config, satisfies_constraints)
+
+        # Case 1: QConfig ranges fit within backend ranges, OK
+        qconfig1 = QConfig(
+            activation=MinMaxObserver.with_args(quant_min=0, quant_max=15, dtype=torch.quint8),
+            weight=MinMaxObserver.with_args(quant_min=-32, quant_max=31, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
+        validate_qconfig(qconfig1, satisfies_constraints=True)
+
+        # Case 2: QConfig activation range falls outside backend range, should fail
+        qconfig2 = QConfig(
+            activation=MinMaxObserver.with_args(quant_min=0, quant_max=63, dtype=torch.quint8),
+            weight=MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
+        validate_qconfig(qconfig2, satisfies_constraints=False)
+
+        # Case 3: QConfig weight range falls outside backend range, should fail
+        qconfig3 = QConfig(
+            activation=MinMaxObserver.with_args(dtype=torch.quint8),
+            weight=MinMaxObserver.with_args(quant_min=-128, quant_max=127, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
+        validate_qconfig(qconfig3, satisfies_constraints=False)
+
+        # Case 4: QConfig doesn't specify range, should fail
+        qconfig4 = QConfig(activation=ReuseInputObserver, weight=ReuseInputObserver)
+        validate_qconfig(qconfig4, satisfies_constraints=False)
+
+    def test_backend_config_scale_min(self):
+        """
+        Test QConfig eps validation against the BackendConfig's min scale value.
+        """
+        class MyModel(torch.nn.Module):
+            def __init__(self):
+                super(MyModel, self).__init__()
+                self.linear = torch.nn.Linear(30, 4).float()
+
+            def forward(self, x):
+                return self.linear(x)
+
+        dtype_config = DTypeConfig(
+            input_dtype=DTypeWithConstraints(dtype=torch.quint8, scale_min_lower_bound=2 ** -12),
+            output_dtype=DTypeWithConstraints(dtype=torch.quint8, scale_min_lower_bound=2 ** -12),
+            weight_dtype=DTypeWithConstraints(dtype=torch.qint8, scale_min_lower_bound=2 ** -12),
+            bias_dtype=torch.float,
+        )
+
+        backend_config = BackendConfig() \
+            .set_backend_pattern_config(BackendPatternConfig(torch.nn.Linear)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E128
+                .add_dtype_config(dtype_config)
+                .set_root_module(torch.nn.Linear)
+                .set_reference_quantized_module(nnqr.Linear))
+
+        def validate_qconfig(qconfig: QConfig, satisfies_constraints: bool):
+            self._validate_qconfig_against_backend_config_constraints(
+                MyModel(), qconfig, backend_config, satisfies_constraints)
+
+        # Case 1: QConfig min scale value == backend min scale value, OK
+        qconfig1 = QConfig(
+            activation=MinMaxObserver.with_args(dtype=torch.quint8, eps=2 ** -12),
+            weight=MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, eps=2 ** -12))
+        validate_qconfig(qconfig1, satisfies_constraints=True)
+
+        # Case 2: QConfig min scale value > backend min scale value, OK
+        qconfig2 = QConfig(
+            activation=MinMaxObserver.with_args(dtype=torch.quint8, eps=2 ** -10),
+            weight=MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, eps=2 ** -10))
+        validate_qconfig(qconfig2, satisfies_constraints=True)
+
+        # Case 3: QConfig activation min scale value < backend min scale value, should fail
+        qconfig3 = QConfig(
+            activation=MinMaxObserver.with_args(dtype=torch.quint8, eps=2 ** -14),
+            weight=MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
+        validate_qconfig(qconfig3, satisfies_constraints=False)
+
+        # Case 3: QConfig weight min scale value < backend min scale value, should fail
+        qconfig4 = QConfig(
+            activation=MinMaxObserver.with_args(dtype=torch.quint8),
+            weight=MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, eps=2 ** -14))
+        validate_qconfig(qconfig4, satisfies_constraints=False)
+
+        # Case 5: QConfig doesn't specify eps, should fail
+        qconfig5 = QConfig(
+            activation=FixedQParamsObserver.with_args(scale=1.0, zero_point=0),
+            weight=FixedQParamsObserver.with_args(scale=1.0, zero_point=0))
+        validate_qconfig(qconfig5, satisfies_constraints=False)
+
+    def test_qnnpack_backend_config(self):
+        """
+        Test whether default QNNPACK QConfigs are compatible with the QNNPACK BackendConfig.
+        """
+        class MyModel(torch.nn.Module):
+            def __init__(self):
+                super(MyModel, self).__init__()
+                self.linear = torch.nn.Linear(30, 4).float()
+
+            def forward(self, x):
+                return self.linear(x)
+
+        all_qconfigs: List[Tuple[QConfig, str]] = [
+            (get_default_qconfig("qnnpack", version=0), "default_qnnpack_qconfig_v0"),
+            (get_default_qat_qconfig("qnnpack", version=0), "default_qat_qnnpack_qconfig_v0"),
+            (get_default_qat_qconfig("qnnpack", version=1), "default_qat_qnnpack_qconfig_v1"),
+            (default_symmetric_qnnpack_qconfig, "default_symmetric_qnnpack_qconfig"),
+            (default_symmetric_qnnpack_qat_qconfig, "default_symmetric_qnnpack_qat_qconfig"),
+            # TODO: Test these QConfigs once they are fixed, see https://github.com/pytorch/pytorch/issues/85862
+            # (default_per_channel_symmetric_qnnpack_qconfig, "default_per_channel_symmetric_qnnpack_qconfig"),
+            # (default_per_channel_symmetric_qnnpack_qat_qconfig, "default_per_channel_symmetric_qnnpack_qat_qconfig"),
+        ]
+        backend_config = get_qnnpack_backend_config()
+        for qconfig, qconfig_name in all_qconfigs:
+            self._validate_qconfig_against_backend_config_constraints(
+                MyModel(), qconfig, backend_config, satisfies_constraints=True, qconfig_name=qconfig_name)
+
+    def test_symmetric_qnnpack_qconfig_mapping(self):
+        """
+        Test whether `torch.ao.quantization.qconfig_mapping._get_symmetric_qnnpack_qconfig_mapping`
+        works with the QNNPACK BackendConfig.
+        """
+        if "qnnpack" not in supported_qengines:
+            return
+
+        class MyModel(torch.nn.Module):
+            def __init__(self):
+                super(MyModel, self).__init__()
+                self.linear = torch.nn.Linear(30, 4).float()
+
+            def forward(self, x):
+                return self.linear(x)
+
+        with override_quantized_engine("qnnpack"):
+            qconfig_mapping = _get_symmetric_qnnpack_qconfig_mapping()
+            example_inputs = (torch.rand((1, 30), dtype=torch.float),)
+            backend_config = get_qnnpack_backend_config()
+            model = MyModel()
+            model = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
+            model(*example_inputs)
+            model = convert_fx(model, backend_config=backend_config)
+            expected_node_occurrence = {
+                ns.call_module(torch.ao.nn.quantized.Linear) : 1,
+                ns.call_module(torch.nn.Linear) : 0,
+            }
+            self.checkGraphModuleNodes(model, expected_node_occurrence=expected_node_occurrence)
+            model(*example_inputs)
+
+    def test_get_executorch_backend_config(self):
+        from torch.ao.quantization.backend_config import get_executorch_backend_config
+        # make sure this runs
+        executorch_backend_config = get_executorch_backend_config()
+
+    def test_backend_config_check_for_weight_and_bias(self):
+        """ Test to make sure the backend_config check for weight and bias
+        runs when the qconfig is None for the ops with weight and bias
+        previously the error was not hit because we first check input, and
+        the check for weight and bias are skipped.
+        """
+
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.weight = torch.tensor((5, 5))
+                self.bias = torch.tensor((5,))
+
+            def forward(self, x):
+                return torch.addmm(self.bias, x, self.weight)
+
+        m = M().eval()
+        qconfig_mapping = QConfigMapping()
+        observation_type = ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT
+        weighted_op_quint8_dtype_config = DTypeConfig(
+            input_dtype=torch.quint8,
+            output_dtype=torch.quint8,
+            weight_dtype=torch.qint8,
+            bias_dtype=torch.float,
+        )
+        dtype_configs = [weighted_op_quint8_dtype_config]
+        backend_pattern_config = BackendPatternConfig(torch.addmm) \
+            .set_observation_type(observation_type) \
+            .set_dtype_configs(dtype_configs) \
+            ._set_input_type_to_index({"weight": 2, "bias": 0})
+        backend_config = BackendConfig() \
+            .set_backend_pattern_config(backend_pattern_config)
+        example_inputs = (torch.rand(1, 5),)
+        # make sure this runs
+        m = prepare_fx(m, qconfig_mapping, example_inputs, backend_config=backend_config)
+
+    def test_get_default_qconfig_valid_backend(self):
+        """ Checks that AssertionError is raised when non expected backend input is specified
+        """
+        invalid_backends = ["imaginary_backend", 3]
+        for invalid_backend in invalid_backends:
+            with self.assertRaisesRegex(AssertionError, "not supported"):
+                qconfig = get_default_qconfig(invalid_backend)
+            with self.assertRaisesRegex(AssertionError, "not supported"):
+                qconfig = get_default_qat_qconfig(invalid_backend)
+            with self.assertRaisesRegex(AssertionError, "not supported"):
+                qconfig_mapping = get_default_qconfig_mapping(invalid_backend)
+            with self.assertRaisesRegex(AssertionError, "not supported"):
+                qconfig_mapping = get_default_qat_qconfig_mapping(invalid_backend)
+
+    def test__convert_to_reference_decomposed_fx(self):
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 10)
+
+            def forward(self, x):
+                return self.linear(x)
+
+        m = M().eval()
+        qconfig_mapping = get_default_qconfig_mapping("fbgemm")
+        example_inputs = (torch.randn(1, 5),)
+        m = prepare_fx(m, qconfig_mapping, example_inputs)
+        m_ref = copy.deepcopy(m)
+        m_ref = convert_to_reference_fx(m_ref)
+        m = _convert_to_reference_decomposed_fx(m)
+        expected_occurrence = {
+            ns.call_function(torch.ops.quantized_decomposed.quantize_per_tensor): 2,
+            ns.call_function(torch.ops.quantized_decomposed.dequantize_per_tensor): 2,
+        }
+        self.checkGraphModuleNodes(
+            m,
+            expected_node_occurrence=expected_occurrence)
+        # make sure it runs
+        res_ref = m_ref(*example_inputs)
+        res = m(*example_inputs)
+        self.assertEqual(res, res_ref)
+
+    @skipIfNoQNNPACK
+    def test__convert_to_reference_decomposed_fx_dynamic_quant(self):
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.linear = torch.nn.Linear(5, 10)
+
+            def forward(self, x):
+                return self.linear(x)
+
+        # to avoid reduce_range
+        with override_quantized_engine("qnnpack"):
+            m = M().eval()
+            qconfig_mapping = get_default_qconfig_mapping("fbgemm") \
+                .set_object_type(torch.nn.Linear, default_dynamic_qconfig)
+            example_inputs = (torch.randn(1, 5),)
+            m = prepare_fx(m, qconfig_mapping, example_inputs)
+            m(*example_inputs)
+            m_ref = copy.deepcopy(m)
+            m_ref = convert_to_reference_fx(m_ref)
+            m = _convert_to_reference_decomposed_fx(m)
+            expected_occurrence = {
+                ns.call_function(torch.ops.quantized_decomposed.choose_qparams.tensor): 1,
+                ns.call_function(torch.ops.quantized_decomposed.quantize_per_tensor.tensor): 1,
+                ns.call_function(torch.ops.quantized_decomposed.dequantize_per_tensor.tensor): 1,
+            }
+            self.checkGraphModuleNodes(
+                m,
+                expected_node_occurrence=expected_occurrence)
+            # make sure it runs
+            res_ref = m_ref(*example_inputs)
+            res = m(*example_inputs)
+            self.assertEqual(res, res_ref)
+
+    def test__convert_to_reference_decomposed_fx_per_channel_quant(self):
+        class M(torch.nn.Module):
+            def forward(self, x, weight, bias):
+                return F.linear(x, weight, bias)
+
+        m = M().eval()
+        qconfig_mapping = get_default_qconfig_mapping("fbgemm") \
+            .set_object_type(F.linear, default_per_channel_qconfig)
+        example_inputs = (torch.randn(1, 5), torch.randn(10, 5), torch.randn(10,))
+        m = prepare_fx(m, qconfig_mapping, example_inputs)
+        m(*example_inputs)
+        m_ref = copy.deepcopy(m)
+        m_ref = convert_to_reference_fx(m_ref)
+        m = _convert_to_reference_decomposed_fx(m)
+        expected_occurrence = {
+            # for input and output activations
+            ns.call_function(torch.ops.quantized_decomposed.quantize_per_tensor): 2,
+            ns.call_function(torch.ops.quantized_decomposed.dequantize_per_tensor): 2,
+            # for weight
+            ns.call_function(torch.ops.quantized_decomposed.quantize_per_channel): 1,
+            ns.call_function(torch.ops.quantized_decomposed.dequantize_per_channel): 1,
+        }
+        self.checkGraphModuleNodes(
+            m,
+            expected_node_occurrence=expected_occurrence)
+        # make sure it runs
+        res_ref = m_ref(*example_inputs)
+        res = m(*example_inputs)
+        self.assertEqual(res, res_ref)
+
+    def test_change_backend_config_for_fixed_qparam_ops(self):
+        """ Making sure we can skip validation of qconfigs for fixedqparam ops based
+        on BackendConfig
+        """
+        class M(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.tanh = torch.nn.Tanh()
+
+            def forward(self, x: torch.Tensor):
+                x = self.tanh(x)
+                return x
+
+        model = M().eval()
+        # we set a global default_qconfig, which will be ignored since the backend
+        # we defined doesn't support anything
+        # this is to make sure we don't validate the qconfig when BackendConfig does not
+        # have fixed qparam op related configurations
+        qconfig_mapping = QConfigMapping().set_global(default_qconfig)
+        backend_config = BackendConfig()
+        # make sure this runs
+        model = prepare_fx(
+            model,
+            qconfig_mapping=qconfig_mapping,
+            example_inputs=(torch.randn(1, 2, 3, 4),),
+            backend_config=backend_config
+        )
+
+    def test_channel_shuffle_lowering(self):
+        # Three versions of channel shuffle
+        class M1(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.op = torch.nn.ChannelShuffle(2)
+
+            def forward(self, x):
+                return self.op(x + x) + x
+
+        class M2(torch.nn.Module):
+            def forward(self, x):
+                return torch.channel_shuffle(x + x, 2) + x
+
+        class M3(torch.nn.Module):
+            def forward(self, x):
+                return torch.nn.functional.channel_shuffle(x + x, 2) + x
+
+        x = torch.randn(4, 4, 4, 4)
+        # torch.channel_shuffle is equivalent to torch.nn.functional.channel_shuffle
+        model_node_pairs = [
+            (M1().eval(), ns.call_module(torch.nn.ChannelShuffle)),
+            (M2().eval(), ns.call_function(torch.channel_shuffle)),
+            (M3().eval(), ns.call_function(torch.channel_shuffle))
+        ]
+        for m, node in model_node_pairs:
+            m = prepare_fx(m, {"": default_qconfig}, example_inputs=(x,))
+            m_copy = copy.deepcopy(m)
+            m = convert_fx(m)
+            m_ref = convert_to_reference_fx(m_copy)
+            node_occurrence = {
+                node: 1,
+                ns.call_function(torch.quantize_per_tensor): 1,
+                ns.call_method("dequantize"): 1
+            }
+            node_occurrence_ref = {
+                node: 1,
+                ns.call_function(torch.quantize_per_tensor): 4,
+                ns.call_method("dequantize"): 4
+            }
+            self.checkGraphModuleNodes(m, expected_node_occurrence=node_occurrence)
+            self.checkGraphModuleNodes(m_ref, expected_node_occurrence=node_occurrence_ref)
+
 @skipIfNoFBGEMM
 class TestQuantizeFxOps(QuantizationTestCase):
     def setUp(self):
@@ -4783,29 +5490,29 @@ def setUp(self):
             weight=torch.ao.quantization.default_per_channel_weight_observer
         )
         self.common_quant_patterns = {
-            torch.nn.ConvTranspose1d: CommonQuantizeHandler,
-            torch.nn.ConvTranspose2d: CommonQuantizeHandler,
-            torch.nn.ELU: CommonQuantizeHandler,
-            torch.nn.LeakyReLU: CommonQuantizeHandler,
-            torch.nn.Hardswish: CommonQuantizeHandler,
-            torch.nn.InstanceNorm1d: CommonQuantizeHandler,
-            torch.nn.InstanceNorm2d: CommonQuantizeHandler,
-            torch.nn.InstanceNorm3d: CommonQuantizeHandler,
-            torch.nn.LayerNorm: CommonQuantizeHandler,
-            torch.nn.SiLU: CommonQuantizeHandler,
-            torch.nn.Mish: CommonQuantizeHandler,
-            torch.nn.GELU: CommonQuantizeHandler,
-            torch.nn.Softmax: CommonQuantizeHandler,
-            torch.nn.functional.elu: CommonQuantizeHandler,
-            torch.nn.functional.hardswish: CommonQuantizeHandler,
-            torch.nn.functional.instance_norm: CommonQuantizeHandler,
-            torch.nn.functional.layer_norm: CommonQuantizeHandler,
-            torch.nn.functional.leaky_relu: CommonQuantizeHandler,
-            torch.nn.functional.silu: CommonQuantizeHandler,
-            torch.nn.functional.mish: CommonQuantizeHandler,
-            torch.nn.functional.gelu: CommonQuantizeHandler,
-            torch.nn.functional.softmax: CommonQuantizeHandler,
-            torch.sum: CommonQuantizeHandler
+            torch.nn.ConvTranspose1d: DefaultNodeQuantizeHandler,
+            torch.nn.ConvTranspose2d: DefaultNodeQuantizeHandler,
+            torch.nn.ELU: DefaultNodeQuantizeHandler,
+            torch.nn.LeakyReLU: DefaultNodeQuantizeHandler,
+            torch.nn.Hardswish: DefaultNodeQuantizeHandler,
+            torch.nn.InstanceNorm1d: DefaultNodeQuantizeHandler,
+            torch.nn.InstanceNorm2d: DefaultNodeQuantizeHandler,
+            torch.nn.InstanceNorm3d: DefaultNodeQuantizeHandler,
+            torch.nn.LayerNorm: DefaultNodeQuantizeHandler,
+            torch.nn.SiLU: DefaultNodeQuantizeHandler,
+            torch.nn.Mish: DefaultNodeQuantizeHandler,
+            torch.nn.GELU: DefaultNodeQuantizeHandler,
+            torch.nn.Softmax: DefaultNodeQuantizeHandler,
+            torch.nn.functional.elu: DefaultNodeQuantizeHandler,
+            torch.nn.functional.hardswish: DefaultNodeQuantizeHandler,
+            torch.nn.functional.instance_norm: DefaultNodeQuantizeHandler,
+            torch.nn.functional.layer_norm: DefaultNodeQuantizeHandler,
+            torch.nn.functional.leaky_relu: DefaultNodeQuantizeHandler,
+            torch.nn.functional.silu: DefaultNodeQuantizeHandler,
+            torch.nn.functional.mish: DefaultNodeQuantizeHandler,
+            torch.nn.functional.gelu: DefaultNodeQuantizeHandler,
+            torch.nn.functional.softmax: DefaultNodeQuantizeHandler,
+            torch.sum: DefaultNodeQuantizeHandler
         }
 
     """Unit tests for individual ops
@@ -5558,7 +6265,7 @@ class M(nn.Module):
             def __init__(self, scalar):
                 super().__init__()
                 self.scalar = scalar
-                self.add_func = torch.nn.quantized.FloatFunctional()
+                self.add_func = torch.ao.nn.quantized.FloatFunctional()
 
             def forward(self, x):
                 return self.add_func.add_scalar(x, self.scalar)
@@ -5925,7 +6632,8 @@ def forward(self, x):
             model,
             (torch.rand(5, 5),),
             QuantType.STATIC,
-            expected_node_occurrence=expected_occurrence
+            expected_node_occurrence=expected_occurrence,
+            custom_qconfig_dict=get_default_qconfig_mapping().to_dict()
         )
 
     def _test_default_node_quant_handler_ops(
@@ -5981,7 +6689,7 @@ def test_softmax_normal(self):
         qconfig = torch.ao.quantization.get_default_qconfig("fbgemm")
         is_reference = False
         node_list = [
-            ns.call_module(torch.nn.quantized.Softmax),
+            ns.call_module(torch.ao.nn.quantized.Softmax),
             ns.call_function(functional),
         ]
         self._test_default_node_quant_handler_ops(
@@ -6225,6 +6933,37 @@ def forward(self, x):
             M(), data, quant_type, custom_qconfig_dict=qconfig_mapping,
             expected_node_occurrence=node_occurrence, is_reference=True)
 
+    def test_fixed_qparams_ops_wrong_qconfig(self):
+        """ Test that wrong qconfigs for fixed qparams ops results in the ops not being quantized.
+        """
+        class M(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.sigmoid = torch.nn.Sigmoid()
+                self.tanh = torch.nn.Tanh()
+
+            def forward(self, x):
+                x = self.sigmoid(x)
+                x = torch.sigmoid(x)
+                x = x.sigmoid()
+                x = self.tanh(x)
+                x = torch.tanh(x)
+                x = x.tanh()
+                return x
+
+        data = (torch.randn((2, 2, 2, 2), dtype=torch.float),)
+        qconfig_mapping = QConfigMapping().set_global(default_qconfig)
+        m = M().eval()
+        node_occurrence = {
+            ns.call_function(torch.quantize_per_tensor): 0,
+            ns.call_method("dequantize"): 0,
+        }
+        self.checkGraphModeFxOp(
+            m, data, QuantType.STATIC, custom_qconfig_dict=qconfig_mapping,
+            expected_node_occurrence=node_occurrence, is_reference=True)
+        self.assertTrue(isinstance(m.sigmoid, torch.nn.Sigmoid))
+        self.assertTrue(isinstance(m.tanh, torch.nn.Tanh))
+
     @skipIfNoFBGEMM
     def test_general_shape_ops(self):
         """ A test that checks dequantize will be swapped for
@@ -6703,7 +7442,7 @@ def forward(self, x):
             m = prepare_fx_function(m, qconfig_dict, example_inputs=example_inputs)
             node_occurrence = {
                 ns.call_module(expected_act_post_process): 7,
-                ns.call_module(torch.nn.quantized.FloatFunctional): 0
+                ns.call_module(torch.ao.nn.quantized.FloatFunctional): 0
             }
             self.checkGraphModuleNodes(m, expected_node_occurrence=node_occurrence)
             m(*example_inputs)
@@ -7059,7 +7798,6 @@ def forward(self, x):
             m,
             expected_node_occurrence=expected_occurrence)
 
-    @unittest.skip("This is no longer needed right now, can enable later with new api")
     def test_qmatmul(self):
         class M(torch.nn.Module):
             def forward(self, x, y):
@@ -7068,7 +7806,7 @@ def forward(self, x, y):
 
         m = M().eval()
         example_inputs = (torch.randn(2, 2), torch.randn(2, 2))
-        qconfig_dict = {"": torch.ao.quantization.default_qconfig}
+        qconfig_dict = get_default_qconfig_mapping("fbgemm")
         mp = prepare_fx(m, qconfig_dict, example_inputs=example_inputs)
         mp(*example_inputs)
         mq = convert_fx(mp)
@@ -7609,7 +8347,7 @@ def forward(self, x):
             inp = torch.randn(5, 5, device=device, requires_grad=True)
             out_ref = prepared_ref(inp)
             out = prepared(inp)
-            torch.testing.assert_allclose(out, out_ref)
+            torch.testing.assert_close(out, out_ref)
 
             # try backward pass
             labels = torch.randn(5, 5, device=device)
@@ -7617,7 +8355,7 @@ def forward(self, x):
             grad = torch.autograd.grad(loss, [inp])
             loss_ref = (out_ref - labels).sum()
             grad_ref = torch.autograd.grad(loss_ref, [inp])
-            torch.testing.assert_allclose(grad[0], grad_ref[0])
+            torch.testing.assert_close(grad[0], grad_ref[0])
 
         if 'fbgemm' in torch.backends.quantized.supported_engines:
             # During the lowering step in convert, fold_weight calls quantized::linear_prepack
@@ -7630,7 +8368,7 @@ def forward(self, x):
             out = converted(inp)
             out_ref = converted_ref(inp)
 
-            torch.testing.assert_allclose(out, out_ref)
+            torch.testing.assert_close(out, out_ref)
 if __name__ == '__main__':
     raise RuntimeError("This test file is not meant to be run directly, use:\n\n"
                        "\tpython test/test_quantization.py TESTNAME\n\n"
diff --git a/test/quantization/jit/test_ondevice_quantization.py b/test/quantization/jit/test_ondevice_quantization.py
new file mode 100644
index 000000000000..fa3cfaab24b0
--- /dev/null
+++ b/test/quantization/jit/test_ondevice_quantization.py
@@ -0,0 +1,529 @@
+# -*- coding: utf-8 -*-
+# Owner(s): ["oncall: quantization"]
+
+import torch
+import torch._C_flatbuffer
+
+from torch.ao.quantization import (
+    default_dynamic_qconfig,
+    per_channel_dynamic_qconfig,
+)
+
+from torch.ao.quantization.quantize_jit import (
+    prepare_dynamic_jit,
+    convert_dynamic_jit,
+    _prepare_ondevice_dynamic_jit,
+    _quantize_ondevice_dynamic_jit,
+)
+
+from torch.testing._internal.common_utils import TestCase
+
+from torch.testing._internal.common_quantization import (
+    get_script_module,
+    LinearAddModel,
+)
+
+from torch.jit.mobile import _load_for_lite_interpreter, LiteScriptModule
+
+from torch.testing import FileCheck
+from torch.utils import bundled_inputs as bundled_inputs
+
+import io
+from typing import Dict
+
+class myMod(torch.nn.Module):
+    def __init__(self, weight):
+        super(myMod, self).__init__()
+        self.fc1 = torch.nn.Linear(5, 5).float()
+        self.fc1.weight = weight
+        self.fc2 = torch.nn.Linear(5, 5).float()
+
+    def forward(self, x):
+        return self.fc2(self.fc1(x))
+
+
+class MyConvLinearModule(torch.nn.Module):
+    def __init__(self):
+        super(MyConvLinearModule, self).__init__()
+        self.conv = torch.nn.Conv2d(3, 5, 3)
+        weight = torch.nn.Parameter(torch.ones(5, 5))
+        self.weight1 = torch.nn.Parameter(torch.ones(5, 5))
+        self.mymod = myMod(weight)
+
+    def forward(self, x):
+        conv_output = self.conv(x)
+        y = self.mymod(conv_output)
+        z = torch.nn.functional.linear(y, self.weight1)
+        return z
+
+    def get_example_inputs(self):
+        return (torch.rand(1, 3, 12, 7),)
+
+
+class OnDevicePTQUtils(object):
+    observer_module_name = ['MinMaxObserver', 'PerChannelMinMaxObserver']
+
+    @staticmethod
+    def insert_observers(model, qconfig_dict):
+        inputs = model.get_example_inputs()
+        scripted_model = get_script_module(model, False, inputs)
+        scripted_model = _prepare_ondevice_dynamic_jit(scripted_model, qconfig_dict)
+        return scripted_model
+
+    @staticmethod
+    def ptq_dynamic_quantize(model, qconfig_dict):
+        inputs = model.get_example_inputs()
+        m = get_script_module(model, False, inputs)
+        m = _quantize_ondevice_dynamic_jit(m, qconfig_dict, 'forward', True)
+        return m
+
+    @staticmethod
+    def find_observer_modules(m):
+        observer_modules = []
+        for child_module in m.children():
+            if child_module.original_name in OnDevicePTQUtils.observer_module_name:
+                observer_modules.append(child_module)
+        return observer_modules
+
+    @staticmethod
+    def is_value_type_observer(value):
+        type_name = value.type()
+        for observer_type in OnDevicePTQUtils.observer_module_name:
+            if observer_type in type_name.str():
+                return True
+        return False
+
+    @staticmethod
+    def is_calculate_qparam(node):
+        if node.kind() == "prim::CallMethod":
+            if node.s('name') == "calculate_qparams":
+                return True
+        return False
+
+    @staticmethod
+    def get_linear_packed_param_fp_weight(node):
+        weight = node.inputsAt(0).node()
+        if weight.kind() != "aten::quantize_per_tensor" and weight.kind() != "aten::quantize_per_channel":
+            raise ValueError("Quantized weight must be produced.")
+        fp_weight = weight.inputsAt(0).node()
+        assert fp_weight.kind() == "prim::GetAttr", "Weight must be an attribute of the module."
+        fp_weight_name = fp_weight.s('name')
+        return fp_weight_name
+
+    @staticmethod
+    def is_per_channel_quantized_packed_param(node):
+        assert node.kind() == 'quantized::linear_prepack', "Node must corresponds to linear_prepack."
+        weight = node.inputsAt(0).node()
+        assert weight.kind() != "aten::quantize_per_tensor" or weight.kind() != "aten::quantize_per_channel"
+        return weight.kind() != "aten::quantize_per_tensor"
+
+
+class TestOnDeviceDynamicPTQInsertObservers(TestCase):
+    def _check_num_and_type_of_observers(self, model, num_observers):
+        qconfig_dict = {"": default_dynamic_qconfig}
+        scripted_model = OnDevicePTQUtils.insert_observers(model, qconfig_dict)
+        observer_modules = OnDevicePTQUtils.find_observer_modules(scripted_model)
+        self.assertTrue(len(observer_modules) == num_observers)
+        for observer in observer_modules:
+            self.assertTrue(observer.original_name == 'MinMaxObserver')
+
+        qconfig_dict = {"": per_channel_dynamic_qconfig}
+        scripted_model = OnDevicePTQUtils.insert_observers(model, qconfig_dict)
+        observer_modules = OnDevicePTQUtils.find_observer_modules(scripted_model)
+        self.assertTrue(len(observer_modules) == num_observers)
+        for observer in observer_modules:
+            self.assertTrue(observer.original_name == 'PerChannelMinMaxObserver')
+
+    def _check_observer_method(self, model, num_observers):
+        qconfig_dict = {"": default_dynamic_qconfig}
+        inputs = model.get_example_inputs()
+        orig_scripted_model = get_script_module(model, False, inputs)
+        torch._C._jit_pass_inline(orig_scripted_model.graph)
+        orig_forward_graph = orig_scripted_model.graph.str()
+        scripted_model = OnDevicePTQUtils.insert_observers(model, qconfig_dict)
+        quant_forward_graph = scripted_model.graph.str()
+        # exact graph matching is difficult so just resorting to # of lines
+        # instead of implementing graph matching
+        self.assertEqual(len(orig_forward_graph.splitlines()), len(quant_forward_graph.splitlines()))
+        observe_method = scripted_model.observe_forward.graph
+        FileCheck().check_count("prim::CallMethod[name=\"forward\"](%_observer",
+                                num_observers, exactly=True).run(observe_method)
+        reset_observers_method = scripted_model.reset_observers_forward.graph
+        FileCheck().check_count(
+            "prim::CallMethod[name=\"reset_min_max_vals\"](%_observer", num_observers, exactly=True).run(reset_observers_method)
+
+    def _observer_is_weight_only(self, node):
+        if (node.kind() == "prim::CallMethod") and node.s("name") == "forward":
+            if (OnDevicePTQUtils.is_value_type_observer(node.inputsAt(0))):
+                return (node.inputsAt(1).node().kind() == "prim::GetAttr")
+        return False
+
+    def test_num_observers(self):
+        model = LinearAddModel()
+        self._check_num_and_type_of_observers(model, 2)
+        model = MyConvLinearModule()
+        self._check_num_and_type_of_observers(model, 3)
+
+    def test_observe_method(self):
+        model = MyConvLinearModule()
+        self._check_observer_method(model, 3)
+
+    def test_weight_only_observers(self):
+        model = MyConvLinearModule()
+        qconfig_dict = {"": default_dynamic_qconfig}
+        inputs = model.get_example_inputs()
+        scripted_model = OnDevicePTQUtils.insert_observers(model, qconfig_dict)
+        observe_forward_graph = scripted_model.observe_forward.graph
+        num_weight_only_observers = 0
+        for node in observe_forward_graph.nodes():
+            if (self._observer_is_weight_only(node)):
+                num_weight_only_observers += 1
+        self.assertEqual(num_weight_only_observers, 3)
+
+
+class TestOnDeviceDynamicPTQInsertQuantDequant(TestCase):
+    def _validate_quant_dequant_nodes(self, model, num_nodes, per_channel=0):
+        quantize_forward_graph = model.quantize_forward.graph
+        quantize_per_tensor = quantize_per_channel = 0
+        for n in quantize_forward_graph.nodes():
+            if "aten::quantize_per_tensor" in n.kind():
+                quantize_per_tensor += 1
+            if "aten::quantize_per_channel" in n.kind():
+                quantize_per_channel += 1
+        self.assertEqual(quantize_per_tensor + quantize_per_channel, num_nodes)
+
+    def _validate_calculate_qparams(self, model, num_nodes):
+        quantize_forward_graph = model.quantize_forward.graph
+        num_calculate_qparams = 0
+        for n in quantize_forward_graph.nodes():
+            if OnDevicePTQUtils.is_calculate_qparam(n):
+                num_calculate_qparams += 1
+        self.assertEqual(num_calculate_qparams, num_nodes)
+
+    def _validate_no_observer_forward(self, model):
+        quantize_forward_graph = model.quantize_forward.graph
+        for n in quantize_forward_graph.nodes():
+            if (n.kind() == "prim::CallMethod") and n.s("name") == "forward":
+                if (OnDevicePTQUtils.is_value_type_observer(n.inputsAt(0))):
+                    return False
+        return True
+
+    def _check_quant_dequant_and_calc_qparams(self, model, num_nodes):
+        qconfig_dict = {"" : default_dynamic_qconfig}
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        self._validate_quant_dequant_nodes(m, num_nodes)
+        self._validate_calculate_qparams(m, num_nodes)
+        self._validate_no_observer_forward(m)
+
+        qconfig_dict = {"" : per_channel_dynamic_qconfig}
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        self._validate_quant_dequant_nodes(m, num_nodes, num_nodes)
+        self._validate_calculate_qparams(m, num_nodes)
+        self._validate_no_observer_forward(m)
+
+    def _check_quantize_forward_runs(self, model):
+        inputs = model.get_example_inputs()
+        qconfig_dict = {"" : default_dynamic_qconfig}
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        m.observe_forward(*inputs)
+        m.quantize_forward(*inputs)
+
+        qconfig_dict = {"" : per_channel_dynamic_qconfig}
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        # First must run observe forward to record the stats to produce
+        # correct scales and zero points
+        m.observe_forward(*inputs)
+        m.quantize_forward(*inputs)
+
+    def test_num_quant_dequant_nodes(self):
+        model = LinearAddModel()
+        self._check_quant_dequant_and_calc_qparams(model, 2)
+        model = MyConvLinearModule()
+        self._check_quant_dequant_and_calc_qparams(model, 3)
+
+    def test_quantize_forward_runs(self):
+        model = LinearAddModel()
+        self._check_quantize_forward_runs(model)
+        model = MyConvLinearModule()
+        self._check_quantize_forward_runs(model)
+
+
+class TestOnDeviceDynamicPTQFinalize(TestCase):
+    def _validate_packed_params(self, model, num_nodes, per_channel=0):
+        quantize_forward_graph = model.quantize_forward.graph
+        quantize_per_tensor = quantize_per_channel = 0
+        linear_prepack = 0
+        linear_prepack_uses = 0
+        for n in quantize_forward_graph.nodes():
+            if n.kind() == 'prim::SetAttr':
+                maybe_packed_param_value = n.inputsAt(1)
+                maybe_packed_param = maybe_packed_param_value.node()
+                if maybe_packed_param.kind() == 'quantized::linear_prepack':
+                    linear_prepack += 1
+                    linear_prepack_uses += len(maybe_packed_param_value.uses())
+                    if OnDevicePTQUtils.is_per_channel_quantized_packed_param(maybe_packed_param):
+                        quantize_per_channel += 1
+                    else:
+                        quantize_per_tensor += 1
+        self.assertEqual(quantize_per_tensor + quantize_per_channel, num_nodes)
+        self.assertEqual(quantize_per_channel, per_channel)
+        self.assertEqual(linear_prepack, num_nodes)
+        self.assertEqual(linear_prepack_uses, num_nodes)
+
+
+    def _validate_no_linear_unpack(self, model):
+        quantize_forward_graph = model.quantize_forward.graph
+        for n in quantize_forward_graph.nodes():
+            if n.kind() == 'quantized::linear_unpack':
+                return False
+        return True
+
+
+    def _validate_setattr_fp_weights(self, model, num_nodes):
+        quantize_forward_graph = model.quantize_forward.graph
+        fp_weights_setattr = 0
+        fp_weight_names = []
+        for n in quantize_forward_graph.nodes():
+            if n.kind() == 'prim::SetAttr':
+                maybe_packed_param = n.inputsAt(1).node()
+                if maybe_packed_param.kind() == 'quantized::linear_prepack':
+                    weight_name = OnDevicePTQUtils.get_linear_packed_param_fp_weight(maybe_packed_param)
+                    fp_weight_names.append(weight_name)
+
+        for n in quantize_forward_graph.nodes():
+            # This is basically detecting
+            # %x = prim::Constant
+            # = prim::SetAttr(<weight_name>)(module_value, x)
+            # Thus making sure that the original fp weights are
+            # reset
+            if n.kind() == 'prim::SetAttr':
+                weight_name = n.s('name')
+                if weight_name in fp_weight_names:
+                    maybe_constant = n.inputsAt(1).node()
+                    if maybe_constant.kind() == 'prim::Constant':
+                        fp_weights_setattr += 1
+        self.assertEqual(fp_weights_setattr, num_nodes)
+
+
+    def _validate_quantized_forward(self, model, num_nodes):
+        quantized_forward_graph = model.quantized_forward.graph
+        quantize_per_tensor = quantize_per_channel = 0
+        quantized_linear_dynamic = 0
+        linear_packed_params = 0
+        num_setattr = 0
+        for n in quantized_forward_graph.nodes():
+            if "aten::quantize_per_tensor" in n.kind():
+                quantize_per_tensor += 1
+            if "aten::quantize_per_channel" in n.kind():
+                quantize_per_channel += 1
+            if "quantized::linear_dynamic" in n.kind():
+                quantized_linear_dynamic += 1
+            if n.kind() == 'prim::GetAttr':
+                output = n.outputsAt(0)
+                output_type = output.type()
+                if "LinearPackedParamsBase" in output_type.str():
+                    linear_packed_params += 1
+            if n.kind() == 'prim::SetAttr':
+                num_setattr += 1
+        self.assertEqual(quantize_per_tensor, 0)
+        self.assertEqual(quantize_per_channel, 0)
+        self.assertEqual(quantized_linear_dynamic, num_nodes)
+        self.assertEqual(linear_packed_params, num_nodes)
+        # self.assertEqual(num_setattr, 0)
+
+
+    def _check_quantize_forward(self, model, num_nodes):
+        qconfig_dict = {"" : default_dynamic_qconfig}
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        self._validate_packed_params(m, num_nodes)
+        self._validate_no_linear_unpack(m)
+        self._validate_setattr_fp_weights(m, num_nodes)
+
+        qconfig_dict = {"" : per_channel_dynamic_qconfig}
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        self._validate_packed_params(m, num_nodes, num_nodes)
+        self._validate_no_linear_unpack(m)
+        self._validate_setattr_fp_weights(m, num_nodes)
+
+
+    def _check_quantized_forward(self, model, num_nodes):
+        qconfig_dict = {"" : default_dynamic_qconfig}
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        self._validate_quantized_forward(m, num_nodes)
+
+        qconfig_dict = {"" : per_channel_dynamic_qconfig}
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        self._validate_quantized_forward(m, num_nodes)
+
+
+    def _check_against_ref_dynamic_ptq(self, model):
+        model.eval()
+        inputs = model.get_example_inputs()
+        ref_m = torch.jit.script(model)
+        torch._C._jit_pass_inline(ref_m.graph)
+        qconfig_dict = {"" : default_dynamic_qconfig}
+        ref_m = prepare_dynamic_jit(ref_m, qconfig_dict)
+        ref_m = convert_dynamic_jit(ref_m)
+        ref_output = ref_m(*inputs)
+
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        m.observe_forward(*inputs)
+        m.quantize_forward(*inputs)
+        output = m.quantized_forward(*inputs)
+        self.assertTrue(torch.allclose(ref_output, output))
+        thrown = False
+        try:
+            m(*inputs)
+        except Exception as e:
+            thrown = True
+        self.assertTrue(thrown)
+
+        # test with per channel quant
+        ref_m = torch.jit.script(model)
+        torch._C._jit_pass_inline(ref_m.graph)
+        qconfig_dict = {"" : per_channel_dynamic_qconfig}
+        ref_m = prepare_dynamic_jit(ref_m, qconfig_dict)
+        ref_m = convert_dynamic_jit(ref_m)
+        ref_output = ref_m(*inputs)
+
+        m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+        m.observe_forward(*inputs)
+        m.quantize_forward(*inputs)
+        output = m.quantized_forward(*inputs)
+        self.assertTrue(torch.allclose(ref_output, output))
+        thrown = False
+        try:
+            m(*inputs)
+        except Exception as e:
+            thrown = True
+        self.assertTrue(thrown)
+
+
+    def _check_serdes_and_device_side_api_helper(self, model, check_device_side_api=False):
+        model.eval()
+        inputs = model.get_example_inputs()
+        ref_m = torch.jit.script(model)
+        torch._C._jit_pass_inline(ref_m.graph)
+        qconfig_dict = {"" : default_dynamic_qconfig}
+        ref_m = prepare_dynamic_jit(ref_m, qconfig_dict)
+        ref_m = convert_dynamic_jit(ref_m)
+        buffer = io.BytesIO()
+        torch.jit.save(ref_m, buffer)
+        buffer.seek(0)
+        ref_m = torch.jit.load(buffer)
+        ref_output = ref_m(*inputs)
+
+        if not check_device_side_api:
+            m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+            buffer = io.BytesIO()
+            torch.jit.save(m, buffer)
+            buffer.seek(0)
+            m = torch.jit.load(buffer)
+            m.reset_observers_forward()
+            m.observe_forward(*inputs)
+            m.quantize_forward(*inputs)
+            output = m.quantized_forward(*inputs)
+            self.assertTrue(torch.allclose(ref_output, output))
+        else:
+            # check for lite interpreter
+            m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+            first_input, = inputs
+            rand_input = bundled_inputs.bundle_randn(first_input.size(), dtype=first_input.dtype)
+            m = bundled_inputs.bundle_inputs(m, inputs=[(rand_input, )])
+            buffer = io.BytesIO(m._save_to_buffer_for_lite_interpreter())
+            buffer.seek(0)
+            m = _load_for_lite_interpreter(buffer)  # Error here
+            torch._C._quantize_ondevice_ptq_dynamic(m._c, "forward")
+            self.assertFalse(m.find_method("quantized_forward"))
+            self.assertFalse(m.find_method("quantize_forward"))
+            self.assertFalse(m.find_method("observe_forward"))
+            self.assertFalse(m.find_method("reset_observers_forward"))
+            output = m(*inputs)
+            self.assertTrue(torch.allclose(ref_output, output))
+
+            # Now serialize to flabuffer and load from fb and check
+            dict: Dict[str, str] = {}
+            bytes = torch._C_flatbuffer._save_mobile_module_to_bytes(m._c, dict)
+            m = LiteScriptModule(torch._C_flatbuffer._load_mobile_module_from_bytes(bytes))
+            fb_output = m(*inputs)
+            self.assertTrue(torch.allclose(ref_output, fb_output))
+
+        model.eval()
+        inputs = model.get_example_inputs()
+        ref_m = torch.jit.script(model)
+        torch._C._jit_pass_inline(ref_m.graph)
+        qconfig_dict = {"" : per_channel_dynamic_qconfig}
+        ref_m = prepare_dynamic_jit(ref_m, qconfig_dict)
+        ref_m = convert_dynamic_jit(ref_m)
+        buffer = io.BytesIO()
+        torch.jit.save(ref_m, buffer)
+        buffer.seek(0)
+        ref_m = torch.jit.load(buffer)
+        ref_output = ref_m(*inputs)
+
+        if not check_device_side_api:
+            m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+            buffer = io.BytesIO()
+            torch.jit.save(m, buffer)
+            buffer.seek(0)
+            m = torch.jit.load(buffer)
+            m.reset_observers_forward()
+            m.observe_forward(*inputs)
+            m.quantize_forward(*inputs)
+            output = m.quantized_forward(*inputs)
+            self.assertTrue(torch.allclose(ref_output, output))
+        else:
+            # check for lite interpreter
+            m = OnDevicePTQUtils.ptq_dynamic_quantize(model, qconfig_dict)
+            first_input, = inputs
+            rand_input = bundled_inputs.bundle_randn(first_input.size(), dtype=first_input.dtype)
+            m = bundled_inputs.bundle_inputs(m, inputs=[(rand_input, )])
+            buffer = io.BytesIO(m._save_to_buffer_for_lite_interpreter())
+            buffer.seek(0)
+            m = _load_for_lite_interpreter(buffer)  # Error here
+            torch._C._quantize_ondevice_ptq_dynamic(m._c, "forward")
+            self.assertFalse(m.find_method("quantized_forward"))
+            self.assertFalse(m.find_method("quantize_forward"))
+            self.assertFalse(m.find_method("observe_forward"))
+            self.assertFalse(m.find_method("reset_observers_forward"))
+            output = m(*inputs)
+            self.assertTrue(torch.allclose(ref_output, output))
+
+
+    def _check_serialization_deserialization(self, model):
+        self._check_serdes_and_device_side_api_helper(model, False)
+
+
+    def _check_device_side_api(self, model):
+        self._check_serdes_and_device_side_api_helper(model, True)
+
+
+    def test_quantize_forward(self):
+        model = LinearAddModel()
+        self._check_quantize_forward(model, 2)
+        model = MyConvLinearModule()
+        self._check_quantize_forward(model, 3)
+
+
+    def test_quantized_forward(self):
+        model = LinearAddModel()
+        self._check_quantized_forward(model, 2)
+        model = MyConvLinearModule()
+        self._check_quantized_forward(model, 3)
+
+
+    def test_against_offdevice_dynamic_ptq(self):
+        model = LinearAddModel()
+        self._check_against_ref_dynamic_ptq(model)
+        model = MyConvLinearModule()
+        self._check_against_ref_dynamic_ptq(model)
+
+
+    def test_serialization_deserialization(self):
+        model = MyConvLinearModule()
+        self._check_serialization_deserialization(model)
+
+
+    def test_device_side_api(self):
+        model = MyConvLinearModule()
+        self._check_device_side_api(model)
diff --git a/test/quantization/jit/test_quantize_jit.py b/test/quantization/jit/test_quantize_jit.py
index 84ab3a723b70..7726dc04c711 100644
--- a/test/quantization/jit/test_quantize_jit.py
+++ b/test/quantization/jit/test_quantize_jit.py
@@ -73,7 +73,6 @@
 from torch.testing._internal.jit_utils import attrs_with_prefix
 from torch.testing._internal.jit_utils import get_forward
 from torch.testing._internal.jit_utils import get_forward_graph
-from torch.testing._internal.common_utils import skipIfSlowGradcheckEnv
 
 from torch.jit._recursive import wrap_cpp_module
 
@@ -1626,7 +1625,6 @@ def forward(self, x):
         torch.jit.save(model, b)
 
 
-@skipIfSlowGradcheckEnv
 class TestQuantizeJitOps(QuantizationTestCase):
     """Test graph mode post training static quantization works
     for individual ops end to end.
@@ -2674,6 +2672,7 @@ def forward(self, x):
                     m.graph
                 )
 
+    @override_qengines
     def test_hardswish(self):
         class FunctionalHardswish(torch.nn.Module):
             def __init__(self, inplace):
@@ -2698,6 +2697,7 @@ def forward(self, input):
                 m.graph
             )
 
+    @override_qengines
     def test_elu(self):
         class FunctionalELU(torch.nn.Module):
             def __init__(self, inplace=False):
@@ -2714,6 +2714,7 @@ def forward(self, input):
             m = self.checkGraphModeOp(m, self.img_data_2d, "quantized::elu", tracing)
             FileCheck().check_not("aten::elu").check_not("aten::elu_").run(m.graph)
 
+    @override_qengines
     def test_layer_norm(self):
         data = [[torch.rand((1, 2, 5, 5), dtype=torch.float)] for _ in range(2)]
         layer_norm = torch.nn.LayerNorm([2, 5, 5])
@@ -2723,6 +2724,7 @@ def test_layer_norm(self):
             )
             FileCheck().check_not("aten::layer_norm").run(m.graph)
 
+    @override_qengines
     def test_group_norm(self):
         data = [[torch.rand((1, 4, 5, 5), dtype=torch.float)] for _ in range(2)]
         group_norm = torch.nn.GroupNorm(2, 4)
@@ -2732,6 +2734,7 @@ def test_group_norm(self):
             )
             FileCheck().check_not("aten::group_norm").run(m.graph)
 
+    @override_qengines
     def test_instance_norm(self):
         data_1d = [[torch.rand((1, 4, 5), dtype=torch.float)] for _ in range(2)]
         data_2d = [[torch.rand((1, 4, 5, 1), dtype=torch.float)] for _ in range(2)]
diff --git a/test/run_test.py b/test/run_test.py
old mode 100644
new mode 100755
index 6cd6ebde31f6..62ce99ae7937
--- a/test/run_test.py
+++ b/test/run_test.py
@@ -14,6 +14,7 @@
 import sys
 import tempfile
 import json
+import glob
 from typing import Dict, Optional, List, cast, Any
 
 import torch
@@ -25,8 +26,10 @@
     shell,
     set_cwd,
     parser as common_parser,
+    is_slow_gradcheck_env,
 )
 import torch.distributed as dist
+from torch.multiprocessing import get_context
 
 REPO_ROOT = pathlib.Path(__file__).resolve().parent.parent
 
@@ -38,6 +41,7 @@
         get_reordered_tests,
         get_test_case_configs,
         calculate_shards,
+        NUM_PROCS
     )
     HAVE_TEST_SELECTION_TOOLS = True
 except ImportError:
@@ -63,7 +67,8 @@ def skip_test_p(name: str) -> bool:
             rc |= name in blocklisted_tests
         return rc
     cwd = pathlib.Path(__file__).resolve().parent if base_dir is None else base_dir
-    all_py_files = list(cwd.glob('**/test_*.py'))
+    # This supports symlinks, so we can link domain library tests to PyTorch test directory
+    all_py_files = [pathlib.Path(p) for p in glob.glob(f"{cwd}/**/test_*.py", recursive=True)]
     rc = [str(fname.relative_to(cwd))[:-3] for fname in all_py_files]
     # Invert slashes on Windows
     if sys.platform == "win32":
@@ -96,7 +101,6 @@ def skip_test_p(name: str) -> bool:
         'test_jit_simple',
         'test_jit_string',
         'test_kernel_launch_checks',
-        'test_metal',
         'test_nnapi',
         'test_segment_reductions',
         'test_static_runtime',
@@ -109,6 +113,7 @@ def skip_test_p(name: str) -> bool:
         "distributed/launcher/bin/test_script_is_torchelastic_launched",
         "distributed/launcher/bin/test_script_local_rank",
         "distributed/test_c10d_spawn",
+        "distributed/_tensor/test_dtensor_ops",
         'distributions/test_transforms',
         'distributions/test_utils',
     ],
@@ -124,10 +129,13 @@ def skip_test_p(name: str) -> bool:
         "distributed/elastic/utils/util_test",
         "distributed/elastic/utils/distributed_test",
         "distributed/elastic/multiprocessing/api_test",
-        "test_deploy",
     ]
 )
 
+# The doctests are a special case that don't correspond to a file that discover
+# tests can enable.
+TESTS = TESTS + ['doctests']
+
 FSDP_TEST = [test for test in TESTS if test.startswith("distributed/fsdp")]
 
 # Tests need to be run with pytest.
@@ -161,6 +169,7 @@ def skip_test_p(name: str) -> bool:
     "distributed/elastic/events/lib_test",
     "distributed/elastic/agent/server/test/api_test",
     "test_deploy",
+    "distributed/test_c10d_error_logger.py"
 ]
 
 WINDOWS_BLOCKLIST = [
@@ -242,7 +251,7 @@ def skip_test_p(name: str) -> bool:
     "distributed/_shard/test_replicated_tensor",
     "test_determination",
     "test_jit_legacy",
-    "test_openmp",
+    "test_cuda_nvml_based_avail",
 ]
 
 RUN_PARALLEL_BLOCKLIST = [
@@ -258,8 +267,33 @@ def skip_test_p(name: str) -> bool:
     "test_tensorexpr",
     "test_cuda_primary_ctx",
     "test_cuda_trace",
+    "test_cuda_nvml_based_avail",
 ] + FSDP_TEST
 
+CI_SERIAL_LIST = [
+    'test_nn',
+    'test_fake_tensor',
+    'test_cpp_api_parity',
+    'test_reductions',
+    'test_cuda',
+    'test_jit_cuda_fuser',  # OOM on test_issue_1785, also profiling?
+    'test_indexing',
+    'test_fx_backends',
+    'test_linalg',
+    'test_cpp_extensions_jit',
+    'test_torch',
+    'test_tensor_creation_ops',
+    'test_sparse_csr',
+    'test_dispatch',
+    'nn/test_pooling',
+    'distributions/test_distributions',
+    'test_autograd',  # slow gradcheck runs a test that checks the cuda memory allocator
+    'test_prims',  # slow gradcheck runs a test that checks the cuda memory allocator
+    'test_modules',  # failed test due to mismatched elements
+    'functorch/test_vmap',  # OOM
+    'test_fx',  # gets SIGKILL
+]
+
 # A subset of our TEST list that validates PyTorch's ops, modules, and autograd function as expected
 CORE_TEST_LIST = [
     "test_autograd",
@@ -267,10 +301,25 @@ def skip_test_p(name: str) -> bool:
     "test_nn",
     "test_ops",
     "test_ops_gradients",
+    "test_ops_fwd_gradients",
     "test_ops_jit",
     "test_torch"
 ]
 
+# A list of distributed tests that run on multiple backends, i.e. gloo, nccl. These backends are spread out
+# among all available shards to speed up the test. The list of backends are copied from the tests themselves
+DISTRIBUTED_TESTS_WITH_MULTIPLE_BACKENDS = {
+    "distributed/test_distributed_spawn": [
+        "gloo",
+        "nccl",
+        "ucc",
+    ],
+    "distributed/algorithms/quantization/test_quantization": [
+        "gloo",
+        "nccl",
+    ],
+}
+
 # if a test file takes longer than 5 min, we add it to TARGET_DET_LIST
 SLOW_TEST_THRESHOLD = 300
 
@@ -294,6 +343,14 @@ def skip_test_p(name: str) -> bool:
             "WORLD_SIZE": "2" if torch.cuda.device_count() == 2 else "3",
             "TEST_REPORT_SOURCE_OVERRIDE": "dist-gloo",
         }
+    if dist.is_ucc_available():
+        DISTRIBUTED_TESTS_CONFIG["ucc"] = {
+            "WORLD_SIZE": "2" if torch.cuda.device_count() == 2 else "3",
+            "TEST_REPORT_SOURCE_OVERRIDE": "dist-ucc",
+            "UCX_TLS": "tcp",
+            "UCC_TLS": "nccl,ucp",
+            "UCC_TL_UCP_TUNE": "cuda:0",  # don't use UCP TL on CUDA as it is not well supported
+        }
 
 # https://stackoverflow.com/questions/2549939/get-signal-names-from-numbers-in-python
 SIGNALS_TO_NAMES_DICT = {
@@ -316,25 +373,30 @@ def skip_test_p(name: str) -> bool:
 ]
 
 DISTRIBUTED_TESTS = [test for test in TESTS if test.startswith("distributed")]
-
-
-def discover_functorch_tests():
-    pytorch_root = pathlib.Path(__file__).resolve().parent.parent
-    functorch_test_dir = os.path.join(pytorch_root, 'functorch', 'test')
-    result = discover_tests(pathlib.Path(functorch_test_dir))
-    result = [os.path.join(functorch_test_dir, r) for r in result]
-
-    # Sanity check
-    assert len(result) >= 8
-    return result
-
-FUNCTORCH_TESTS = discover_functorch_tests()
+FUNCTORCH_TESTS = [test for test in TESTS if test.startswith("functorch")]
 
 TESTS_REQUIRING_LAPACK = [
     "distributions/test_constraints",
     "distributions/test_distributions",
 ]
 
+# These are just the slowest ones, this isn't an exhaustive list.
+TESTS_NOT_USING_GRADCHECK = [
+    # Note that you should use skipIfSlowGradcheckEnv if you do not wish to
+    # skip all the tests in that file, e.g. test_mps
+    "doctests",
+    "test_meta",
+    "test_hub",
+    "test_fx",
+    "test_decomp",
+    "test_cpp_extensions_jit",
+    "test_jit",
+    "test_ops",
+    "test_ops_jit",
+    "dynamo/test_recompile_ux",
+    "inductor/test_smoke",
+    "test_quantization",
+]
 
 def print_to_stderr(message):
     print(message, file=sys.stderr)
@@ -348,20 +410,6 @@ def get_executable_command(options, allow_pytest, disable_coverage=False):
     if options.pytest:
         if allow_pytest:
             executable += ["-m", "pytest"]
-            # Enable xdoctest
-            # TODO: enable xdoctest later
-            # Many doctests assume the existence of these variables
-            # xdoctest_global_exec_lines = r'\n'.join([
-            #     'from torch import nn',
-            #     'import torch.nn.functional as F',
-            #     'import torch',
-            # ])
-            # executable += [
-            #     "--xdoctest",
-            #     "--xdoctest-style=google",
-            #     f"--xdoctest-global-exec='{xdoctest_global_exec_lines}'",
-            #     "--xdoctest-options=+IGNORE_WHITESPACE"
-            # ]
         else:
             print_to_stderr(
                 "Pytest cannot be used for this test. Falling back to unittest."
@@ -370,8 +418,13 @@ def get_executable_command(options, allow_pytest, disable_coverage=False):
 
 
 def run_test(
-    test_module, test_directory, options, launcher_cmd=None, extra_unittest_args=None
-):
+    test_module,
+    test_directory,
+    options,
+    launcher_cmd=None,
+    extra_unittest_args=None,
+    env=None,
+) -> int:
     unittest_args = options.additional_unittest_args.copy()
     if options.verbose:
         unittest_args.append(f'-{"v"*options.verbose}')  # in case of pytest
@@ -387,8 +440,11 @@ def run_test(
     if options.pytest:
         unittest_args = [arg if arg != "-f" else "-x" for arg in unittest_args]
     elif IS_CI:
+        ci_args = ["--import-slow-tests", "--import-disabled-tests"]
+        if os.getenv("PYTORCH_TEST_RERUN_DISABLED_TESTS", "0") == "1":
+            ci_args.append("--rerun-disabled-tests")
         # use the downloaded test cases configuration, not supported in pytest
-        unittest_args.extend(["--import-slow-tests", "--import-disabled-tests"])
+        unittest_args.extend(ci_args)
 
     # Extra arguments are not supported with pytest
     executable = get_executable_command(
@@ -399,9 +455,17 @@ def run_test(
     # in `if __name__ == '__main__': `. So call `python test_*.py` instead.
     argv = [test_module + ".py"] + unittest_args
 
+    os.makedirs(REPO_ROOT / "test" / "test-reports", exist_ok=True)
+    log_fd, log_path = tempfile.mkstemp(dir=REPO_ROOT / "test" / "test-reports",
+                                        prefix="{}_".format(test_module.replace("\\", "-").replace("/", "-")))
+    os.close(log_fd)
     command = (launcher_cmd or []) + executable + argv
     print_to_stderr("Executing {} ... [{}]".format(command, datetime.now()))
-    return shell(command, test_directory)
+    with open(log_path, "w") as f:
+        ret_code = shell(command, test_directory, stdout=f, stderr=f, env=env)
+    print_log_file(test_module, log_path, failed=(ret_code != 0))
+    os.remove(log_path)
+    return ret_code
 
 
 def test_cuda_primary_ctx(test_module, test_directory, options):
@@ -491,12 +555,26 @@ def test_distributed(test_module, test_directory, options):
     ) == 0 and sys.version_info < (3, 9)
     if options.verbose and not mpi_available:
         print_to_stderr("MPI not available -- MPI backend tests will be skipped")
+
+    if options.shard:
+        which_shard, num_shards = options.shard
+    else:
+        which_shard = num_shards = 1
+    # Round-robin all backends to different shards
+    backend_to_shard = {backend: i % num_shards + 1
+                        for i, backend in enumerate(DISTRIBUTED_TESTS_WITH_MULTIPLE_BACKENDS[test_module])}
+    print_to_stderr(f"Map different backends to different shards for {test_module}: {backend_to_shard}")
+
     config = DISTRIBUTED_TESTS_CONFIG
     for backend, env_vars in config.items():
         if sys.platform == "win32" and backend != "gloo":
             continue
         if backend == "mpi" and not mpi_available:
             continue
+        # Default to the first shard if seeing an unrecognized backend
+        if which_shard != backend_to_shard.get(backend, 1):
+            print_to_stderr(f"Shard {which_shard}: {backend} should be run in {backend_to_shard.get(backend, 1)}")
+            continue
         for with_init_file in {True, False}:
             if sys.platform == "win32" and not with_init_file:
                 continue
@@ -505,8 +583,8 @@ def test_distributed(test_module, test_directory, options):
                 init_str = "with {} init_method"
                 with_init = init_str.format("file" if with_init_file else "env")
                 print_to_stderr(
-                    "Running distributed tests for the {} backend {}".format(
-                        backend, with_init
+                    "Running distributed tests for the {} backend {} in shard {} of {}".format(
+                        backend, with_init, which_shard, num_shards
                     )
                 )
             old_environ = dict(os.environ)
@@ -565,8 +643,178 @@ def test_distributed(test_module, test_directory, options):
     return 0
 
 
+def run_doctests(test_module, test_directory, options):
+    """
+    Assumes the incoming test module is called doctest, and simply executes the
+    xdoctest runner on the torch library itself.
+    """
+    import xdoctest
+    import pathlib
+    pkgpath = pathlib.Path(torch.__file__).parent
+
+    #
+    enabled = {
+        # TODO: expose these options to the user
+        # Temporary disable all feature-conditional tests
+        # 'lapack': 'auto',
+        # 'cuda': 'auto',
+        # 'cuda1': 'auto',
+        # 'qengine': 'auto',
+        'lapack': 0,
+        'cuda': 0,
+        'cuda1': 0,
+        'qengine': 0,
+    }
+
+    # Resolve "auto" based on a test to determine if the feature is available.
+    if enabled['cuda'] == 'auto' and torch.cuda.is_available():
+        enabled['cuda'] = True
+
+    if enabled['cuda1'] == 'auto' and torch.cuda.is_available() and torch.cuda.device_count() > 1:
+        enabled['cuda1'] = True
+
+    if enabled['lapack'] == 'auto' and torch._C.has_lapack:
+        enabled['lapack'] = True
+
+    if enabled['qengine'] == 'auto':
+        try:
+            # Is there a better check if quantization is enabled?
+            import torch.nn.quantized as nnq  # NOQA
+            torch.backends.quantized.engine = 'qnnpack'
+            torch.backends.quantized.engine = 'fbgemm'
+        except (ImportError, RuntimeError):
+            ...
+        else:
+            enabled['qengine'] = True
+
+    # Set doctest environment variables
+    if enabled['cuda']:
+        os.environ['TORCH_DOCTEST_CUDA'] = '1'
+
+    if enabled['cuda1']:
+        os.environ['TORCH_DOCTEST_CUDA1'] = '1'
+
+    if enabled['lapack']:
+        os.environ['TORCH_DOCTEST_LAPACK'] = '1'
+
+    if enabled['qengine']:
+        os.environ['TORCH_DOCTEST_QENGINE'] = '1'
+
+    pkgpath = os.path.dirname(torch.__file__)
+    xdoctest_config = {
+        'global_exec': r'\n'.join([
+            'from torch import nn',
+            'import torch.nn.functional as F',
+            'import torch',
+        ]),
+        'style': 'google',
+        'options': '+IGNORE_WHITESPACE',
+    }
+    xdoctest_verbose = max(1, options.verbose)
+    run_summary = xdoctest.runner.doctest_module(
+        os.fspath(pkgpath), config=xdoctest_config, verbose=xdoctest_verbose,
+        command=options.xdoctest_command, argv=[])
+    result = 1 if run_summary.get('n_failed', 0) else 0
+    return result
+
+
+def print_log_file(test: str, file_path: str, failed: bool) -> None:
+    num_lines = sum(1 for _ in open(file_path, 'rb'))
+    n = 100
+    with open(file_path, "r") as f:
+        print_to_stderr("")
+        if failed:
+            if n < num_lines:
+                print_to_stderr(f"Expand the folded group to see the beginning of the log file of {test}")
+                print_to_stderr(f"##[group]PRINTING BEGINNING OF LOG FILE of {test} ({file_path})")
+                for _ in range(num_lines - n):
+                    print_to_stderr(next(f).rstrip())
+                print_to_stderr("##[endgroup]")
+            for _ in range(min(n, num_lines)):
+                print_to_stderr(next(f).rstrip())
+            print_to_stderr(f"FINISHED PRINTING LOG FILE of {test} ({file_path})")
+        else:
+            print_to_stderr(f"Expand the folded group to see the log file of {test}")
+            print_to_stderr(f"##[group]PRINTING LOG FILE of {test} ({file_path})")
+            print_to_stderr(f.read())
+            print_to_stderr("##[endgroup]")
+            print_to_stderr(f"FINISHED PRINTING LOG FILE of {test} ({file_path})")
+        print_to_stderr("")
+
+
+def run_test_ops(test_module, test_directory, options):
+    if os.getenv("PYTORCH_TEST_RERUN_DISABLED_TESTS", "0") == "1":
+        # When under rerun-disabled-tests mode, run the same tests multiple times to determine their
+        # flakiness status. Default to 50 re-runs
+        rerun_options = ["--flake-finder", "--flake-runs=50"]
+    else:
+        # When under the normal mode, retry a failed test 2 more times. -x means stop at the first
+        # failure
+        rerun_options = ["-x", "--reruns=2"]
+
+    default_unittest_args = [
+        "--use-pytest",
+        "-vv",
+        "-rfEX"
+    ]
+    default_unittest_args.extend(rerun_options)
+
+    if 'slow-gradcheck' in os.getenv("BUILD_ENVIRONMENT", ""):
+        extra_unittest_args = default_unittest_args.copy()
+        # there are a lot of tests that take up a lot of space in slowgrad check, so don't bother parallelizing
+        # it's also on periodic so we don't care about TTS as much
+        return run_test(
+            test_module,
+            test_directory,
+            copy.deepcopy(options),
+            extra_unittest_args=extra_unittest_args,
+        )
+
+    return_codes = []
+    os.environ["NUM_PARALLEL_PROCS"] = str(NUM_PROCS)
+    pool = get_context("spawn").Pool(NUM_PROCS)
+    for i in range(NUM_PROCS):
+        extra_unittest_args = default_unittest_args.copy()
+        extra_unittest_args.extend([
+            f"--shard-id={i}",
+            f"--num-shards={NUM_PROCS}",
+            "-k=not _linalg_cholesky_",
+        ])
+
+        return_code = pool.apply_async(
+            run_test,
+            args=(test_module, test_directory, copy.deepcopy(options)),
+            kwds={
+                "extra_unittest_args": extra_unittest_args,
+            },
+        )
+        return_codes.append(return_code)
+
+    pool.close()
+    pool.join()
+    del os.environ["NUM_PARALLEL_PROCS"]
+
+    for return_code in return_codes:
+        if return_code.get() != 0:
+            return return_code.get()
+
+    extra_unittest_args = default_unittest_args.copy()
+    extra_unittest_args.extend([
+        "-k=_linalg_cholesky_",
+    ])
+
+    return_code = run_test(
+        test_module,
+        test_directory,
+        copy.deepcopy(options),
+        extra_unittest_args=extra_unittest_args,
+    )
+    return return_code
+
+
 CUSTOM_HANDLERS = {
     "test_cuda_primary_ctx": test_cuda_primary_ctx,
+    "test_cuda_nvml_based_avail": get_run_test_with_subprocess_fn(),
     "test_cuda_trace": get_run_test_with_subprocess_fn(),
     "test_cpp_extensions_aot_no_ninja": test_cpp_extensions_aot_no_ninja,
     "test_cpp_extensions_aot_ninja": test_cpp_extensions_aot_ninja,
@@ -577,12 +825,20 @@ def test_distributed(test_module, test_directory, options):
     "distributed/test_c10d_common": get_run_test_with_subprocess_fn(),
     "distributed/test_c10d_spawn_gloo": get_run_test_with_subprocess_fn(),
     "distributed/test_c10d_spawn_nccl": get_run_test_with_subprocess_fn(),
+    "distributed/test_c10d_spawn_ucc": get_run_test_with_subprocess_fn(),
     "distributed/test_store": get_run_test_with_subprocess_fn(),
     "distributed/test_pg_wrapper": get_run_test_with_subprocess_fn(),
     "distributed/rpc/test_faulty_agent": get_run_test_with_subprocess_fn(),
     "distributed/rpc/test_tensorpipe_agent": get_run_test_with_subprocess_fn(),
     "distributed/rpc/test_share_memory": get_run_test_with_subprocess_fn(),
     "distributed/rpc/cuda/test_tensorpipe_agent": get_run_test_with_subprocess_fn(),
+    "doctests": run_doctests,
+    "inductor/test_torchinductor_opinfo": run_test_ops,
+    "test_ops": run_test_ops,
+    "test_ops_gradients": run_test_ops,
+    "test_ops_fwd_gradients": run_test_ops,
+    "test_ops_jit": run_test_ops,
+    "functorch/test_ops": run_test_ops,
 }
 
 
@@ -629,6 +885,14 @@ def parse_args():
             "This requires functorch to already be installed."
         )
     )
+    parser.add_argument(
+        "--mps",
+        "--mps",
+        action="store_true",
+        help=(
+            "If this flag is present, we will only run test_mps and test_metal"
+        )
+    )
     parser.add_argument(
         "-core",
         "--core",
@@ -739,6 +1003,15 @@ def parse_args():
         action="store_true",
         help="Only list the test that will run.",
     )
+    parser.add_argument(
+        "--xdoctest-command",
+        default='list',
+        help=(
+            "Control the specific doctest action. "
+            "Use 'list' to simply parse doctests and check syntax. "
+            "Use 'all' to execute all doctests or specify a specific "
+            "doctest to run")
+    )
     return parser.parse_args()
 
 
@@ -779,17 +1052,28 @@ def find_test_index(test, selected_tests, find_last_index=False):
     return found_idx
 
 
-def exclude_tests(exclude_list, selected_tests, exclude_message=None):
+def exclude_tests(exclude_list, selected_tests, exclude_message=None, exact_match=False):
     for exclude_test in exclude_list:
         tests_copy = selected_tests[:]
         for test in tests_copy:
-            if test.startswith(exclude_test):
+            if (not exact_match and test.startswith(exclude_test)) or test == exclude_test:
                 if exclude_message is not None:
                     print_to_stderr("Excluding {} {}".format(test, exclude_message))
                 selected_tests.remove(test)
     return selected_tests
 
 
+def must_serial(file: str) -> bool:
+    return (
+        "distributed" in os.getenv("TEST_CONFIG", "") or
+        "dynamo" in os.getenv("TEST_CONFIG", "") or
+        "distributed" in file or
+        file in CUSTOM_HANDLERS or
+        file in RUN_PARALLEL_BLOCKLIST or
+        file in CI_SERIAL_LIST
+    )
+
+
 def get_selected_tests(options):
     selected_tests = options.include
 
@@ -801,7 +1085,9 @@ def get_selected_tests(options):
 
     if options.distributed_tests:
         selected_tests = list(
-            filter(lambda test_name: test_name in DISTRIBUTED_TESTS, selected_tests)
+            filter(lambda test_name: (test_name in DISTRIBUTED_TESTS and
+                                      test_name not in DISTRIBUTED_TESTS_WITH_MULTIPLE_BACKENDS),
+                   selected_tests)
         )
 
     # Filter to only run core tests when --core option is specified
@@ -811,7 +1097,16 @@ def get_selected_tests(options):
         )
 
     if options.functorch:
-        selected_tests = FUNCTORCH_TESTS
+        selected_tests = [tname for tname in selected_tests if tname in FUNCTORCH_TESTS]
+    else:
+        # Exclude all functorch tests otherwise
+        options.exclude.extend(FUNCTORCH_TESTS)
+
+    if options.mps:
+        selected_tests = ['test_mps', 'test_metal']
+    else:
+        # Exclude all mps tests otherwise
+        options.exclude.extend(['test_mps', 'test_metal'])
 
     # process reordering
     if options.bring_to_front:
@@ -852,7 +1147,11 @@ def get_selected_tests(options):
 
         # This is exception that's caused by this issue https://github.com/pytorch/pytorch/issues/69460
         # This below code should be removed once this issue is solved
-        if torch.version.cuda is not None and LooseVersion(torch.version.cuda) >= "11.5":
+        if (
+            torch.version.cuda is not None and
+            LooseVersion(torch.version.cuda) >= "11.5" and
+            LooseVersion(torch.version.cuda) <= "11.6"
+        ):
             WINDOWS_BLOCKLIST.append("test_cpp_extensions_aot")
             WINDOWS_BLOCKLIST.append("test_cpp_extensions_aot_ninja")
             WINDOWS_BLOCKLIST.append("test_cpp_extensions_aot_no_ninja")
@@ -893,7 +1192,13 @@ def get_selected_tests(options):
         else:
             print("Found test time stats from artifacts")
             test_file_times_config = test_file_times[test_config]
-            shards = calculate_shards(num_shards, selected_tests, test_file_times_config)
+            if is_slow_gradcheck_env():
+                # HACK: hardcode approx test times, so these two don't get put in the same shard
+                #       we can remove this when their actual runtimes are recorded
+                test_file_times_config["test_ops_fwd_gradients"] = 3600 * 2 + 600  # 2:10
+                test_file_times_config["test_ops_gradients"] = 3600 * 2 + 600  # 2:10
+            shards = calculate_shards(num_shards, selected_tests, test_file_times_config,
+                                      must_serial=must_serial)
             _, tests_from_shard = shards[which_shard - 1]
             selected_tests = tests_from_shard
 
@@ -907,6 +1212,16 @@ def get_selected_tests(options):
         selected_tests = exclude_tests(TESTS_REQUIRING_LAPACK, selected_tests,
                                        "PyTorch is built without LAPACK support.")
 
+    if is_slow_gradcheck_env():
+        selected_tests = exclude_tests(TESTS_NOT_USING_GRADCHECK, selected_tests,
+                                       "Running in slow gradcheck mode, skipping tests "
+                                       "that don't use gradcheck.", exact_match=True)
+
+    if options.distributed_tests:
+        # Run distributed tests with multiple backends across all shards, one per backend
+        selected_tests.extend(DISTRIBUTED_TESTS_WITH_MULTIPLE_BACKENDS.keys())
+        selected_tests.reverse()
+
     return selected_tests
 
 
@@ -919,7 +1234,7 @@ def run_test_module(test: str, test_directory: str, options) -> Optional[str]:
     return_code = handler(test_module, test_directory, options)
     assert isinstance(return_code, int) and not isinstance(
         return_code, bool
-    ), "Return code should be an integer"
+    ), f"While running {test} got non integer return code {return_code}"
     if return_code == 0:
         return None
 
@@ -952,22 +1267,56 @@ def main():
         # downloading test cases configuration to local environment
         get_test_case_configs(dirpath=test_directory)
 
-    has_failed = False
     failure_messages = []
+
+    # parallel = in parallel with other files
+    # serial = this file on it's own.  The file might still be run in parallel with itself (ex test_ops)
+    selected_tests_parallel = [x for x in selected_tests if not must_serial(x)]
+    selected_tests_serial = [x for x in selected_tests if x not in selected_tests_parallel]
+    print_to_stderr("parallel (file granularity) tests:\n {}".format("\n ".join(selected_tests_parallel)))
+    print_to_stderr("serial (file granularity) tests:\n {}".format("\n ".join(selected_tests_serial)))
+
+    pool = get_context("spawn").Pool(NUM_PROCS, maxtasksperchild=1)
+    os.makedirs(REPO_ROOT / "test" / "test-reports", exist_ok=True)
+
+    def success_callback(err_message):
+        if err_message is None:
+            return True
+        failure_messages.append(err_message)
+        print_to_stderr(err_message)
+        if not options.continue_through_error:
+            pool.terminate()
+        return False
+
     try:
-        for test in selected_tests:
+        os.environ['PARALLEL_TESTING'] = '1'
+        for test in selected_tests_parallel:
+            options_clone = copy.deepcopy(options)
+            if test in USE_PYTEST_LIST:
+                options_clone.pytest = True
+            pool.apply_async(run_test_module, args=(test, test_directory, options_clone), callback=success_callback)
+        pool.close()
+        pool.join()
+        del os.environ['PARALLEL_TESTING']
+
+        if not options.continue_through_error and len(failure_messages) != 0:
+            raise RuntimeError("\n".join(failure_messages))
+
+        for test in selected_tests_serial:
             options_clone = copy.deepcopy(options)
             if test in USE_PYTEST_LIST:
                 options_clone.pytest = True
             err_message = run_test_module(test, test_directory, options_clone)
             if err_message is None:
                 continue
-            has_failed = True
             failure_messages.append(err_message)
             if not options_clone.continue_through_error:
                 raise RuntimeError(err_message)
             print_to_stderr(err_message)
     finally:
+        pool.terminate()
+        pool.join()
+
         if options.coverage:
             from coverage import Coverage
 
@@ -980,7 +1329,7 @@ def main():
                 if not PYTORCH_COLLECT_COVERAGE:
                     cov.html_report()
 
-    if options.continue_through_error and has_failed:
+    if len(failure_messages) != 0:
         for err in failure_messages:
             print_to_stderr(err)
         sys.exit(1)
diff --git a/test/scripts/run_cuda_memcheck.py b/test/scripts/run_cuda_memcheck.py
index 10202e416d00..7d882b8c1fff 100755
--- a/test/scripts/run_cuda_memcheck.py
+++ b/test/scripts/run_cuda_memcheck.py
@@ -119,7 +119,7 @@ async def run1(coroutine_id):
         gpuid = coroutine_id % GPUS
     else:
         gpu_assignments = args.gpus.split(':')
-        assert args.nproc == len(gpu_assignments), 'Please specify GPU assignmnent for each process, separated by :'
+        assert args.nproc == len(gpu_assignments), 'Please specify GPU assignment for each process, separated by :'
         gpuid = gpu_assignments[coroutine_id]
 
     while progress < len(ALL_TESTS):
diff --git a/test/test_ao_sparsity.py b/test/test_ao_sparsity.py
index 9832a8ad8245..ebe89689d686 100644
--- a/test/test_ao_sparsity.py
+++ b/test/test_ao_sparsity.py
@@ -20,6 +20,7 @@
 
 # Scheduler
 from ao.sparsity.test_scheduler import TestScheduler  # noqa: F401
+from ao.sparsity.test_scheduler import TestCubicScheduler  # noqa: F401
 
 # Composability
 if not IS_ARM64:
diff --git a/test/test_autocast.py b/test/test_autocast.py
index bfbe46d08b89..1a8263a79f93 100644
--- a/test/test_autocast.py
+++ b/test/test_autocast.py
@@ -1,9 +1,12 @@
 # Owner(s): ["module: unknown"]
 
 import collections
+import unittest
+
 import torch
 from torch.testing._internal.common_utils import TestCase, run_tests
 from torch.testing._internal.autocast_test_lists import AutocastCPUTestLists
+from torch.utils._python_dispatch import TorchDispatchMode
 
 class TestAutocastCPU(TestCase):
     def setUp(self):
@@ -122,6 +125,64 @@ def test_autocast_torch_need_autocast_promote(self):
         for op, args in self.autocast_lists.torch_need_autocast_promote:
             self._run_autocast_outofplace(op, args, torch.float32)
 
+@unittest.skipIf(not torch.cuda.is_available(), "requires cuda")
+class TestAutocastGPU(TestCase):
+    def test_cast_cache_is_global(self):
+        """
+        Verifies that the autocast cache is global. This is done by
+        mocking out cache clearing at the end of the forward pass,
+        running forward+backward with an explicit call to autocast in the
+        backward, and verifying that the weight only get cast to float16 once.
+        """
+
+        class CustomLinear(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x, w_t):
+                ctx.save_for_backward(x, w_t)
+                return torch.nn.functional.linear(x, w_t)
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                x, w_t = ctx.saved_tensors
+                with torch.autocast(device_type='cuda'):
+                    dL_dX = torch.matmul(grad_output, w_t)
+                    dL_dW = torch.matmul(x.transpose(0, 1), grad_output).transpose(0, 1)
+                return dL_dX, dL_dW
+
+        data = torch.randn(2, 3).cuda()
+        weight = torch.nn.Parameter(torch.randn(4, 3).cuda())
+        weight_dtype_cast_counter = 0
+
+        class WeightDTypeCastCounterMode(TorchDispatchMode):
+
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                if (
+                    func is torch.ops.aten._to_copy.default and
+                    args[0] is weight and
+                    kwargs['dtype'] is torch.float16
+                ):
+                    nonlocal weight_dtype_cast_counter
+                    weight_dtype_cast_counter += 1
+                return func(*args, **kwargs)
+
+            def __enter__(self):
+                self.old_clear_cache = torch.clear_autocast_cache
+                torch.clear_autocast_cache = lambda: None
+                return super().__enter__()
+
+            def __exit__(self, exc_type, exc_val, exc_tb):
+                torch.clear_autocast_cache = self.old_clear_cache
+                return super().__exit__(exc_type, exc_val, exc_tb)
+
+        with WeightDTypeCastCounterMode():
+            with torch.autocast(device_type='cuda'):
+                output = CustomLinear.apply(data, weight)
+                s = output.sum()
+            s.backward()
+
+        self.assertEqual(weight_dtype_cast_counter, 1)
+
+
 class TestTorchAutocast(TestCase):
     def test_autocast_fast_dtype(self):
         gpu_fast_dtype = torch.get_autocast_gpu_dtype()
diff --git a/test/test_autograd.py b/test/test_autograd.py
index 72d4f8363189..4b1e97cb3b2b 100644
--- a/test/test_autograd.py
+++ b/test/test_autograd.py
@@ -25,7 +25,7 @@
 from torch import nn
 from torch._six import inf, nan
 from torch.autograd.function import once_differentiable
-from torch.autograd.profiler import (profile, record_function, emit_nvtx)
+from torch.autograd.profiler import (profile, record_function, emit_nvtx, emit_itt)
 from torch.autograd.profiler_util import (_format_time, EventList, FunctionEvent, FunctionEventAvg)
 from torch.utils.checkpoint import checkpoint
 from torch.testing import make_tensor
@@ -63,6 +63,30 @@ def graph_desc(fn):
 
 class TestAutograd(TestCase):
 
+    def test_grad_mode_class_decoration(self):
+        # Decorating class is deprecated and should not be used
+        with self.assertWarnsRegex(UserWarning, "Decorating classes is deprecated"):
+            @torch.no_grad()
+            class Foo():
+                pass
+
+        # Decorating functions or methods is fine though
+        with warnings.catch_warnings(record=True) as w:
+            @torch.no_grad()
+            def foo():
+                pass
+
+            class Foo2():
+                @torch.no_grad()
+                def __init__(self):
+                    pass
+
+                @torch.no_grad()
+                def foo(self):
+                    pass
+
+        self.assertEqual(len(w), 0)
+
     def test_tensor_grad_warnings(self):
         dummy = torch.empty(1)
 
@@ -394,6 +418,79 @@ def test_not_implemented_fwad(self):
                 # if forward AD ends up being implemented for torch.igamma, choose a different op
                 torch.igamma(dual_x, dual_x)
 
+    def test_will_engine_execute_node(self):
+        counter = [0]
+
+        class MyFunction(Function):
+            @staticmethod
+            def forward(ctx, x):
+                return x * 2
+
+            @staticmethod
+            def backward(ctx, gO):
+                return gO * 2
+
+        def get_grad_fn(t):
+            if t.requires_grad and t.grad_fn is None:
+                return t.clone().grad_fn.next_functions[0][0]
+            else:
+                return t.grad_fn
+
+        a = torch.randn(2, 3, 4, requires_grad=True)
+        a2 = torch.randn(2, 3, 4, requires_grad=True)
+        b = a * a2
+        b2 = b.cos()
+        c = MyFunction.apply(b)
+
+        should_execute = list(map(get_grad_fn, (a, b, c)))
+        should_not_execute = list(map(get_grad_fn, (a2, b2)))
+
+        def fn(x):
+            counter[0] += 1
+
+            for g in should_execute:
+                self.assertTrue(torch._C._will_engine_execute_node(g))
+
+            for g in should_not_execute:
+                self.assertFalse(torch._C._will_engine_execute_node(g))
+
+        b.register_hook(fn)
+        c.register_hook(fn)
+
+        # .backward(inputs=) is OK
+        out = c.sum()
+        torch.autograd.backward(out, inputs=(a, b), retain_graph=True)
+        self.assertEqual(counter[0], 2)
+
+        # .backward() is OK
+        should_execute = list(map(get_grad_fn, (a, a2, b, c)))
+        should_not_execute = list(map(get_grad_fn, (b2,)))
+        torch.autograd.backward(out, retain_graph=True)
+
+        # .grad is NOT OK when leaf is passed (this is the current state, subject to change)
+        with self.assertRaisesRegex(RuntimeError, "are currently running autograd.grad()"):
+            torch.autograd.grad(out, (a,))
+
+        # .grad is OK when non-leaf is passed
+        a = torch.randn(1, 2, 3, requires_grad=True) * 2
+        b = a * 2
+
+        def fn(x):
+            # Check a non-leaf
+            counter[0] += 1
+            self.assertTrue(torch._C._will_engine_execute_node(b.grad_fn))
+        b.register_hook(fn)
+        counter[0] = 0
+        torch.autograd.grad(b.sum(), (a,))
+        self.assertEqual(counter[0], 1)
+
+        # Verify other errors are raised
+        with self.assertRaisesRegex(RuntimeError, "during the backward pass"):
+            torch._C._will_engine_execute_node(out.grad_fn)
+
+        with self.assertRaisesRegex(RuntimeError, "expects an grad_fn"):
+            torch._C._will_engine_execute_node(out)
+
     def test_accumulate_grad(self):
         grad_output = torch.ones(5, 5)
 
@@ -1069,58 +1166,41 @@ def test_backward(self):
 
     def test_sparse_mm_backward(self):
         size = (3, 3)
-        sparse = torch.sparse_coo_tensor(size, requires_grad=True)
-        dense = torch.randn(size, requires_grad=True)
 
-        with self.assertRaisesRegex(
-                RuntimeError,
-                "The backward pass for this operation requires the 'mat1' tensor to be strided,"):
-            z = dense.addmm(sparse, dense)
-
-        mm_test_cases = [
-            # a requires grad, a is sparse, b requires grad, b is sparse, error message
-            (False, True, True, False, None),
-            (False, False, True, True, "The backward pass for this operation requires the 'mat2'"),
-            (False, True, True, True, "The backward pass for this operation requires the 'mat2'"),
-            (True, False, True, True, "The backward pass for this operation requires the 'mat2'"),
-            (True, True, False, False, "The backward pass for this operation requires the 'self'"),
-            (True, True, True, False, "The backward pass for this operation requires the 'self'"),
-            (True, True, True, True, "The backward pass for this operation requires the 'mat2'"),
-        ]
-        for a_req_grad, a_is_sparse, b_req_grad, b_is_sparse, err_msg in mm_test_cases:
+        import itertools
+        mm_test_cases = itertools.product(*(([False, True],) * 4))
+
+        for a_req_grad, a_is_sparse, b_req_grad, b_is_sparse in mm_test_cases:
             # We should only be testing cases with sparse inputs, and at least one
             # input needs to require grad so we can call a backward pass
-            assert a_is_sparse or b_is_sparse
-            assert a_req_grad or b_req_grad
-
-            a = torch.randn(size, requires_grad=a_req_grad)
+            if not ((a_is_sparse or b_is_sparse) and (a_req_grad or b_req_grad)):
+                continue
+            a = torch.randn(size)
             if a_is_sparse:
-                a = a.to_sparse()
-            b = torch.randn(size, requires_grad=b_req_grad)
+                # detaching as `a` needs to be a leaf
+                a = a.to_sparse().detach()
+            b = torch.randn(size)
             if b_is_sparse:
-                b = b.to_sparse()
-
-            # If no error expected, check that sparse and dense cases match
-            if err_msg is None:
-                r = a.mm(b)
-                r.sum().backward()
-                a_grad = None if a.grad is None else a.grad.clone().detach()
-                b_grad = None if b.grad is None else b.grad.clone().detach()
-
-                # Redo with only dense tensors
-                a = (a.to_dense() if a.is_sparse else a).clone().detach()
-                a.requires_grad = a_req_grad
-                b = (b.to_dense() if b.is_sparse else b).clone().detach()
-                b.requires_grad = b_req_grad
-                r = a.mm(b)
-                r.sum().backward()
-
-                self.assertEqual(a_grad, a.grad)
-                self.assertEqual(b_grad, b.grad)
+                # detaching as `b` needs to be a leaf
+                b = b.to_sparse().detach()
 
-            else:
-                with self.assertRaisesRegex(RuntimeError, err_msg):
-                    a.mm(b)
+            a = a.requires_grad_(a_req_grad)
+            b = b.requires_grad_(b_req_grad)
+
+            r = a.mm(b)
+            s = r.sum().backward()
+            a_grad = None if a.grad is None else a.grad.clone().detach()
+            b_grad = None if b.grad is None else b.grad.clone().detach()
+
+            # Redo with only dense tensors
+            a = (a.to_dense() if a.is_sparse else a).clone().detach().requires_grad_(a_req_grad)
+            b = (b.to_dense() if b.is_sparse else b).clone().detach().requires_grad_(b_req_grad)
+
+            r = a.mm(b)
+            r.sum().backward()
+
+            self.assertEqual(a_grad, a.grad)
+            self.assertEqual(b_grad, b.grad)
 
     def test_multi_backward(self):
         x = torch.randn(5, 5, requires_grad=True)
@@ -1260,6 +1340,13 @@ def backward(ctx, grad_b):
 
         TestFn.apply(b).sum().backward()
 
+    def test_first_grad_fn_access_in_no_grad_mode(self):
+        a = torch.tensor([1 + 1j], requires_grad=True).clone()
+        v = a.real
+        a.add_(1)
+        with torch.autograd.grad_mode.no_grad():
+            v.grad_fn
+
     def test_free_deep_graph(self):
         def scope():
             depth = 150000
@@ -2329,7 +2416,7 @@ def backward(ctx, grad_x):
         # which means that a lot of accessors on them may segfault. Test that we
         # properly error in this case.
         t = torch.ones(1, requires_grad=True)
-        t._backward_hooks = dict()
+        t._backward_hooks = {}
         with self.assertRaisesRegex(RuntimeError, "Attribute '_register_hook_dict' is invalid"):
             f._register_hook_dict(t)
         with self.assertRaisesRegex(RuntimeError, "Attribute 'register_hook' is invalid"):
@@ -2510,14 +2597,6 @@ def test_detach_then_inplace_raises_in_autograd(self):
         with self.assertRaisesRegex(RuntimeError, "has been modified by an inplace"):
             y.backward()
 
-    def test_detach_disallows_metadata_change(self):
-        x = torch.randn([], requires_grad=True)
-        detached = x.detach()
-
-        with self.assertRaisesRegex(
-                RuntimeError, "not allowed on a Tensor created from .data or .detach()"):
-            detached.resize_(3, 3)
-
     def _test_type_conversion_backward(self, t, ):
         fvar = Variable(t(torch.randn(5, 5).float()), requires_grad=True)
         fvar.double().sum().backward()
@@ -3000,6 +3079,146 @@ def run_test(input_size, exponent):
         run_test((10, 10), torch.zeros(10, 10))
         run_test((10,), 0)
 
+    def test_current_graph_task_id(self):
+        id = [-1]
+
+        def hook(_):
+            id[0] = (torch._C._current_graph_task_id())
+
+        t = torch.tensor(1., requires_grad=True).clone()
+        t.register_hook(hook)
+
+        t.backward(retain_graph=True)
+        base = id[0]
+        t.backward(retain_graph=True)
+        self.assertEqual(id[0] - base, 1)
+        t.backward(retain_graph=True)
+        self.assertEqual(id[0] - base, 2)
+
+        self.assertEqual(torch._C._current_graph_task_id(), -1)
+
+    def test_current_graph_task_execution_order(self):
+        predicted = [None]
+
+        def hook(_):
+            predicted[0] = torch._C._current_graph_task_execution_order()
+
+        def names(nodes):
+            return ", ".join([node.name().split(' ')[-1] for node in nodes]) + '\n'
+
+        def grad_fns(*tensors):
+            # or grad accumulator
+            out = []
+            for t in tensors:
+                if t.requires_grad and t.grad_fn is None:
+                    out.append(t.clone().grad_fn.next_functions[0][0])
+                else:
+                    out.append(t.grad_fn)
+            return out
+
+        actual = []
+
+        def register_logging_hooks(*tensors):
+            # register hooks that log the order in which they are called
+            def get_hook(i):
+                def hook(t_):
+                    actual.append(tensors[i])
+                return hook
+
+            for i, t in enumerate(tensors):
+                t.register_hook(get_hook(i))
+
+        # Basic example: single path
+        t = torch.tensor(1., requires_grad=True).clone().sin().exp()
+        t.register_hook(hook)
+        with torch.autograd.set_multithreading_enabled(False):
+            t.backward()
+        self.assertExpectedInline(names(predicted[0]), """\
+ExpBackward0, SinBackward0, CloneBackward0, torch::autograd::AccumulateGrad
+""")
+
+        # We don't exactly follow sequence_nr order
+        a = torch.tensor(1., requires_grad=True)
+        b = torch.tensor(2., requires_grad=True)
+        c = b.sin()
+        d = a.cos()
+        out = c * d
+        register_logging_hooks(a, b, c, d, out)
+        out.register_hook(hook)
+        with torch.autograd.set_multithreading_enabled(False):
+            out.backward()
+        self.assertEqual(predicted[0], grad_fns(*actual))
+        actual = []
+
+        # Multiple roots are also OK
+        a = torch.tensor(1., requires_grad=True)
+        b = a * 2
+        out = b.sin()
+        out2 = b.cos()
+        out3 = b.cos()
+        register_logging_hooks(a, b, out, out2, out3)
+        out3.register_hook(hook)
+        with torch.autograd.set_multithreading_enabled(False):
+            torch.autograd.grad((out, out3, out2), inputs=(a,))
+        self.assertExpectedInline(names(predicted[0]), """\
+CosBackward0, CosBackward0, SinBackward0, MulBackward0, torch::autograd::AccumulateGrad
+""")
+        # TODO: Uncomment after update to hooks behavior
+        # self.assertEqual(predicted[0], grad_fns(*actual))
+        actual = []
+
+        # Case where next node is nullptr
+        a = torch.tensor(1., requires_grad=True)
+        b = a * 2
+        out = b.sin()
+        register_logging_hooks(a, b, out)
+        out.register_hook(hook)
+        with torch.autograd.set_multithreading_enabled(False):
+            out.backward()
+        self.assertEqual(predicted[0], grad_fns(*actual))
+        actual = []
+
+        # Case where two `inputs` on the same path
+        a = torch.tensor(1., requires_grad=True)
+        b = a * 2
+        out = b.sin()
+        register_logging_hooks(a, b, out)
+        out.register_hook(hook)
+        with torch.autograd.set_multithreading_enabled(False):
+            torch.autograd.grad((out,), inputs=(a, b,))
+        self.assertEqual(names(predicted[0]), """\
+SinBackward0, MulBackward0, torch::autograd::AccumulateGrad
+""")
+        # TODO: Uncomment after update to hooks behavior
+        # self.assertEqual(predicted[0], grad_fns(*actual))
+        actual = []
+
+        # Case where `inputs` specifies a subgraph
+        a = torch.tensor(1., requires_grad=True)
+        b = torch.tensor(1., requires_grad=True)
+        c = a * b
+        out = c.sin()
+        register_logging_hooks(a, b, c, out)
+        out.register_hook(hook)
+        with torch.autograd.set_multithreading_enabled(False):
+            torch.autograd.grad((out,), inputs=(a,))
+        self.assertEqual(names(predicted[0]), """\
+SinBackward0, MulBackward0, torch::autograd::AccumulateGrad
+""")
+        # TODO: Uncomment after update to hooks behavior
+        # self.assertEqual(predicted[0], grad_fns(*actual))
+        actual = []
+
+        # Errors when not called in a backward
+        with self.assertRaisesRegex(RuntimeError, "should only be called during the backward pass"):
+            torch._C._current_graph_task_execution_order()
+
+        # Errors when context manager not enabled
+        t = torch.tensor(1., requires_grad=True).clone().sin().exp()
+        t.register_hook(hook)
+        with self.assertRaisesRegex(RuntimeError, "expects the current backward to be executed with multithreading disabled"):
+            t.backward()
+
     def test_profiler(self):
         x = torch.randn(10, 10)
 
@@ -3119,16 +3338,16 @@ def test_record_function_callbacks(self):
         foo_event = [event for event in function_events if "foo" in event.name][0]
         self.assertEqual(foo_event.count, 1)
 
-    def test_record_function_new_signatures(self):
+    def test_record_function_legacy(self):
         # Test the new _record_function ops work
         # Note: Remove once record_function uses these directly
         x = torch.randn(10, 10)
         with profile(use_kineto=kineto_available()) as p:
-            record = torch.ops.profiler._record_function_enter_new("bar", None)
+            handle = torch.ops.profiler._record_function_enter("bar", None)
             try:
                 y = x * 2 + 4
             finally:
-                torch.ops.profiler._record_function_exit(record)
+                torch.ops.profiler._record_function_exit(handle)
 
         function_events = p.function_events
         foo_event = [event for event in function_events if "bar" in event.name][0]
@@ -3437,19 +3656,6 @@ def test_out_variant_raises_when_inputs_require_grad(self):
         # we should throw an exception if the output requires grad
         self.assertRaisesRegex(RuntimeError, 'out=', lambda: torch.mul(a, b, out=x))
 
-    # TODO: see if this test can be OpInfo'd or moved to diagonal's test suite
-    def test_diagonal_derivative_requires_grad(self):
-        # test that the backward requires grad
-        # we do this is because diagonal_backward uses inplace
-        # operations and gradgradcheck does not catch whether
-        # they works as expected (it will succeed even if
-        # the gradient has requires_grad == False
-        a = torch.randn(5, 6, requires_grad=True)
-        b = torch.diagonal(a)**2
-        c = b.sum()
-        d, = torch.autograd.grad(c, a, retain_graph=True, create_graph=True)
-        self.assertTrue(d.requires_grad)
-
     def test_anomaly_detect_nan(self):
         size = 10
 
@@ -3497,12 +3703,12 @@ def test_calculate_shape_util(self):
         assert out_shape == torch.Size([10, 5])
         assert grad_shape == torch.Size([5, 10])
 
-        out = torch.nested_tensor([
+        out = torch.nested.as_nested_tensor([
             torch.randn(10, 5, requires_grad=True),
             torch.randn(10, 5, requires_grad=True),
             torch.randn(10, 5, requires_grad=True)]
         )
-        grad = torch.nested_tensor([torch.randn(5, 10, requires_grad=True), torch.randn(5, 10, requires_grad=True)])
+        grad = torch.nested.as_nested_tensor([torch.randn(5, 10, requires_grad=True), torch.randn(5, 10, requires_grad=True)])
         out_shape, grad_shape = _calculate_shape(out, grad, False)
 
         assert torch.equal(out_shape, torch.tensor([[10, 5], [10, 5], [10, 5]]))
@@ -3718,21 +3924,33 @@ class Foo(object):
         del t
         self.assertIsNone(ref())
 
-    # TODO: update these tests to use the linalg module and move to test_linalg.py
-    @skipIfNoLapack
-    def test_eig_no_eigenvectors(self):
-        A = torch.tensor([[1., 2.], [2., 4.]], dtype=torch.float32, requires_grad=True)
-        w, v = torch.eig(A, eigenvectors=False)
-        with self.assertRaisesRegex(RuntimeError, 'is not differentiable'):
-            torch.autograd.backward([w, v], [torch.ones_like(w), torch.ones_like(v)])
+    def test_anomaly_mode_no_check_nan(self):
+        class MyFunc(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, inp):
+                return inp.clone()
 
-    @skipIfNoLapack
-    def test_eig_complex_eigenvalues(self):
-        A = torch.tensor([[0., -1.], [1., 0.]], dtype=torch.float32, requires_grad=True)
-        w, v = torch.eig(A, eigenvectors=True)
-        with self.assertRaisesRegex(RuntimeError, 'does not support complex eigenvalues'):
-            torch.autograd.backward([w, v], [torch.ones_like(w), torch.ones_like(v)])
+            @staticmethod
+            def backward(ctx, gO):
+                return torch.tensor(float("nan")).expand(10, 10)
 
+        def run_fn(a):
+            out = MyFunc.apply(a)
+            return out.sum()
+
+        with warnings.catch_warnings(record=True) as w:
+            with torch.autograd.detect_anomaly(check_nan=False):
+                inp = torch.rand(10, 10, requires_grad=True)
+                out = run_fn(inp)
+                out.backward(retain_graph=True)
+
+                with torch.autograd.detect_anomaly(check_nan=True):
+                    with self.assertRaisesRegex(RuntimeError, "Function 'MyFuncBackward' returned nan values in its 0th output."):
+                        out.backward(retain_graph=True)
+
+                out.backward()
+
+    # TODO: update these tests to use the linalg module and move to test_linalg.py
     @skipIfNoLapack
     def test_symeig_no_eigenvectors(self):
         A = torch.tensor([[1., 2.], [2., 4.]], dtype=torch.float32, requires_grad=True)
@@ -4533,8 +4751,12 @@ def forward(ctx, x):
     def backward(ctx, grad):
         return grad
 
+# Run on cuda if it is available to ensure that the worker thread
+# is properly initialized by the time we exit.
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
 for shape in [(1,), ()]:
-    v = torch.ones(shape, requires_grad=True)
+    v = torch.ones(shape, requires_grad=True, device=device)
     MyFunction.apply(v).backward()
 """
         s = TestCase.runWithPytorchAPIUsageStderr(code)
@@ -5172,22 +5394,29 @@ def test_grad_fn_attr_bindings(self):
         # Please help update this test if you update the names of any the fields we check!
         #
         a = torch.ones(1, requires_grad=True)
-        b = torch.ones(1, requires_grad=True)
-        out = torch.stack([a, b], dim=0)
-        self.assertEqual(out.grad_fn._saved_tensors, (a, b))              # TensorList -> Tuple[Tensor]
-        self.assertIsInstance(out.grad_fn._saved_tensors[0], torch.Tensor)
-        self.assertIsInstance(out.grad_fn._raw_saved_tensors[0], torch._C._autograd.SavedTensor)
-        self.assertEqual(out.grad_fn._saved_dim, 0)                       # int64_t -> int
-        self.assertIsInstance(out.grad_fn._saved_dim, int)
-
-        out.grad_fn._raw_saved_tensors[0].register_hooks(lambda x: x, lambda x: x)
-
-        out.sum().backward()
-        with self.assertRaisesRegex(RuntimeError, "after they have already been freed"):
-            out.grad_fn._saved_tensors
+        b = torch.zeros(1, requires_grad=True)
+        out1 = torch.stack([a, b], dim=0)
+        out2 = (a * 2) * b
+        # TODO: I don't think we have a backward saving a list of tensors
+        #       at the moment. It used to be stack, but for no reason...
+        #       see discussion in #84993
+        # self.assertEqual(out.grad_fn._saved_tensors, (a, b))              # TewnsorList -> Tuple[Tensor]
+        self.assertEqual(out2.grad_fn._saved_self, a * 2)
+        self.assertIsInstance(out2.grad_fn._saved_self, torch.Tensor)
+        self.assertIsInstance(out2.grad_fn._raw_saved_self, torch._C._autograd.SavedTensor)
+        self.assertEqual(out1.grad_fn._saved_dim, 0)                       # int64_t -> int
+        self.assertIsInstance(out1.grad_fn._saved_dim, int)
+
+        out2.grad_fn._raw_saved_self.register_hooks(lambda x: x, lambda x: x)
+
+        out2.sum().backward()
         with self.assertRaisesRegex(RuntimeError, "after they have already been freed"):
-            out.grad_fn._raw_saved_tensors
-        self.assertEqual(out.grad_fn._saved_dim, 0)
+            out2.grad_fn._saved_self
+        # TODO: interestingly, this only happens if indexing into a list grad_fn._raw_saved_tensors[0],
+        #       not when using a saved tensor, see discussion in #84993
+        # with self.assertRaisesRegex(RuntimeError, "after they have already been freed"):
+        #     out2.grad_fn._raw_saved_self
+        self.assertEqual(out1.grad_fn._saved_dim, 0)
 
         a = torch.ones(2, 2, requires_grad=True)
         indices = torch.tensor([0, 1])
@@ -5195,13 +5424,16 @@ def test_grad_fn_attr_bindings(self):
         self.assertEqual(out.grad_fn._saved_indices, (None, indices))     # c10::List<c10::optional<Tensor>> -> Tuple[Tensor?]
         self.assertIsInstance(out.grad_fn._saved_indices[1], torch.Tensor)
         self.assertIsInstance(out.grad_fn._raw_saved_indices[1], torch._C._autograd.SavedTensor)
-        self.assertEqual(out.grad_fn._saved_self_sizes, a.shape)          # IntArrayRef -> Tuple[int]
-        self.assertIsInstance(out.grad_fn._saved_self_sizes[0], int)
+        self.assertEqual(out.grad_fn._saved_self_sym_sizes, a.shape)          # SymIntArrayRef -> Tuple[SymInt]
+        self.assertIsInstance(out.grad_fn._saved_self_sym_sizes[0], int)
 
         out.grad_fn._raw_saved_indices[1].register_hooks(lambda x: x, lambda x: x)
         with self.assertRaisesRegex(RuntimeError, "None is forbidden"):
             out.grad_fn._raw_saved_indices[0].register_hooks(lambda x: x, lambda x: x)
 
+        out = a.mean()
+        self.assertEqual(out.grad_fn._saved_self_sym_sizes, a.shape)          # IntArrayRef -> Tuple[int]
+
         a = torch.ones(2, 2, requires_grad=True)
         out = a * a
         out.grad_fn._raw_saved_self.register_hooks(lambda x: x, lambda x: x)
@@ -5220,10 +5452,26 @@ def test_grad_fn_attr_bindings(self):
         else:
             self.assertIsNone(out.grad_fn._saved_scales)                  # c10::optional<ArrayRef<double>> -> float[]?
 
+        a = torch.ones(1, 1, 3, 3, requires_grad=True)
+        out = nn.Conv2d(1, 1, 3)(a)
+        self.assertEqual(out.grad_fn._saved_bias_sym_sizes_opt, (1,))     # c10::optional<SymIntArrayRef> -> SymInt[]?
+        out = nn.Conv2d(1, 1, 3, bias=False)(a)
+        # TODO: This is BAD! we converted a c10::nullopt into a (0,)
+        self.assertEqual(out.grad_fn._saved_bias_sym_sizes_opt, (0,))
+
+        a = torch.ones(1, 3, 3, requires_grad=True)
+        out = torch.addbmm(a.squeeze(0), a, a)
+        self.assertEqual(out.grad_fn._saved_batch1_sym_argsize_0, 1)      # int64_t
+        self.assertEqual(out.grad_fn._saved_batch1_sym_argsize_1, 3)      # int64_t
+
+        a = torch.ones(1, 1, 3, 3, requires_grad=True)
+        out = torch.nn.functional.unfold(a, 3)
+        self.assertEqual(out.grad_fn._saved_self_sym_argsize_minus_2, 3)  # SymInt
+        self.assertEqual(out.grad_fn._saved_self_sym_argsize_minus_1, 3)  # SymInt
+
+        a = torch.ones(1, 1, 2, requires_grad=True)
         out = torch.nn.functional.interpolate(a, scale_factor=0.5, mode="linear")
-        self.assertIsNone(out.grad_fn._saved_output_size)
-        self.assertEqual(out.grad_fn._saved_scale_factors, (0.5,))
-        self.assertIsInstance(out.grad_fn._saved_scale_factors[0], float)
+        self.assertEqual(out.grad_fn._saved_scales, 0.5)
 
         a = torch.ones(2, 2, requires_grad=True)
         out = torch.pdist(a, p=1)
@@ -5655,6 +5903,23 @@ def run_tests(fn):
         run_tests(lambda v: v.swapdims_(0, 0))
         run_tests(lambda v: v.swapaxes_(0, 0))
 
+    def test_autograd_inplace_view_of_view(self):
+        x = torch.zeros(2)
+        with torch.no_grad():
+            y = x.view(2)
+        y.requires_grad_(True)
+        z = y.view(2)
+        with self.assertRaisesRegex(RuntimeError, "a view of a view .* is being .* inside the no_grad block"):
+            z /= 2
+
+        x = torch.zeros(2)
+        with torch.inference_mode():
+            y = x.view(2)
+        y.requires_grad_(True)
+        z = y.view(2)
+        with self.assertRaisesRegex(RuntimeError, "a view of a view .* is being .* inside the inference_mode"):
+            z /= 2
+
     # TODO This is not the correct behavior -
     # See https://github.com/pytorch/pytorch/issues/49825#issuecomment-794466627
     def test_autograd_inplace_views_cross_dtype(self):
@@ -6397,6 +6662,25 @@ def forward(self, x):
                 gc.collect()
                 self.assertIsNone(ref_())
 
+    def test_full_backward_hook_double_backward(self):
+        x = torch.rand(1, requires_grad=True)
+        y = torch.rand_like(x)
+
+        func = torch.nn.MSELoss()
+        counter = [0]
+
+        def hook(module, grad_input, grad_output):
+            counter[0] += 1
+
+        func.register_full_backward_hook(hook)
+
+        f = func(x, y)
+
+        (gradx_f,) = torch.autograd.grad(f, x, create_graph=True)
+        self.assertEqual(counter[0], 1)
+        _ = torch.autograd.grad(gradx_f, x)
+        # We should not error, and counter should not be incremented
+        self.assertEqual(counter[0], 1)
 
     def test_input_buffer_accum(self):
         leaf = torch.rand(2, 2, requires_grad=True)
@@ -6535,6 +6819,20 @@ def inplace_double(x):
         # not leaf, not output
         test(lambda: (1 + torch.randn(5, requires_grad=True)), False)
 
+    def test_saved_variable_saved_original_inplace_detach(self):
+        # Detaching a tensor that is saved input raises
+        a = torch.tensor(1., requires_grad=True).clone()
+        b = a.sin()
+        a.detach_()
+        with self.assertRaisesRegex(RuntimeError, "Trying to use a saved tensor that has been detached"):
+            b.backward()
+
+        # Detaching a tensor that is saved as output is OK
+        a = torch.tensor(1., requires_grad=True).clone()
+        b = a.exp()
+        a.detach_()
+        b.backward()
+
     def test_saved_variable_packing_unpacking_did_not_save_original_with_hooks(self):
         # Tests that packing/unpacking a SavedVariable works correctly with user-defined hooks
         # The saved_original / did_not_save_original distinction corresponds to the `save_original`
@@ -6562,8 +6860,8 @@ def pack(x):
         with torch.autograd.graph.saved_tensors_hooks(pack, lambda x: x):
             a = torch.ones(5, requires_grad=True)
 
-            warnings.simplefilter('always')
             with warnings.catch_warnings(record=True) as w:
+                warnings.simplefilter('always')
                 y = a * a
                 # should raise two warnings from a being saved twice
                 self.assertEqual(len(w), 2)
@@ -6627,6 +6925,32 @@ def test_setting_default_saved_variable_hooks_twice_should_use_inner(self):
         self.assertEqual(2 * 5 * 5 * a, a.grad)
         self.assertEqual(2 * 3 * 3 * b, b.grad)
 
+    def test_disabling_saved_tensor_hooks(self):
+        with torch.autograd.graph.disable_saved_tensors_hooks("error message"):
+            with self.assertRaisesRegex(RuntimeError, "error message"):
+                with torch.autograd.graph.saved_tensors_hooks(lambda x: x, lambda x: x):
+                    pass
+
+        self.assertTrue(torch._C._autograd._saved_tensors_hooks_is_enabled())
+
+        with torch.autograd.graph.saved_tensors_hooks(lambda x: x, lambda x: x):
+            with self.assertRaisesRegex(RuntimeError, "error message"):
+                with torch.autograd.graph.disable_saved_tensors_hooks("error message"):
+                    pass
+
+        self.assertTrue(torch._C._autograd._saved_tensors_hooks_is_enabled())
+
+    def test_disabling_saved_tensor_hooks_nested(self):
+        with torch.autograd.graph.disable_saved_tensors_hooks("outer"):
+            with torch.autograd.graph.disable_saved_tensors_hooks("inner"):
+                with self.assertRaisesRegex(RuntimeError, "inner"):
+                    with torch.autograd.graph.saved_tensors_hooks(lambda x: x, lambda x: x):
+                        pass
+
+            self.assertFalse(torch._C._autograd._saved_tensors_hooks_is_enabled())
+
+        self.assertTrue(torch._C._autograd._saved_tensors_hooks_is_enabled())
+
     def test_save_on_cpu_and_checkpoint(self):
         a = torch.randn(2, 2, requires_grad=True)
 
@@ -6729,6 +7053,17 @@ def test_default_saved_variable_hooks_double_backward(self):
             # note that in that sense, a is saved twice
             self.assertEqual(6 * 8 * a, a.grad)
 
+    def test_wrapped_number_saved_variable_hooks(self):
+        def err_hook(x):
+            raise RuntimeError("this hook should not be called")
+
+        with torch.autograd.graph.saved_tensors_hooks(err_hook, err_hook):
+            a = torch.randn(5, requires_grad=True)
+            out = (a * 3).sum()
+            # 3 is saved as a saved tensor because it is a wrapped number, but
+            # wrapped numbers should be special cased to not trigger saved variable hooks
+            torch.autograd.grad(out, (a,))
+
     def test_graph_save_on_cpu(self):
         def test(get_input, cuda, pin_memory):
             with torch.autograd.graph.save_on_cpu(pin_memory):
@@ -6794,6 +7129,53 @@ def f(x):
             memory_with_hooks = torch.cuda.memory_allocated()
             self.assertEqual(memory_with_hooks, memory_without_grad)
 
+    def test_multi_grad_hooks(self):
+        t1 = torch.rand(2, requires_grad=True)
+        t2 = torch.rand(2, requires_grad=True)
+        t3 = torch.rand(2, requires_grad=True)
+        t4 = torch.rand(2, requires_grad=True)
+
+        res = [None] * 4
+        count = [0]
+
+        def hook(grads):
+            nonlocal res
+            count[0] += 1
+            res = [g is not None for g in grads]
+
+        handle = torch.autograd.graph.register_multi_grad_hook((t1, t2, t3, t4), hook)
+
+        out = t2 * t3
+
+        out.sum().backward(inputs=(t2, t3), retain_graph=True)
+        self.assertEqual(count[0], 1)
+        self.assertEqual(res, [False, True, True, False])
+
+        out.sum().backward(inputs=(t1, t4), retain_graph=True)
+        self.assertEqual(count[0], 1)
+
+        out.sum().backward(inputs=(t1, t3), retain_graph=True)
+        self.assertEqual(count[0], 2)
+        self.assertEqual(res, [False, False, True, False])
+
+        class Func(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x):
+                return x
+
+            @staticmethod
+            def backward(ctx, gO):
+                raise RuntimeError("error message")
+
+        out = Func.apply(t2) * t3
+        with self.assertRaisesRegex(RuntimeError, "error message"):
+            out.sum().backward(inputs=(t2, t3), retain_graph=True)
+        self.assertEqual(count[0], 2)
+
+        handle.remove()
+        out.sum().backward(inputs=(t1, t3), retain_graph=True)
+        self.assertEqual(count[0], 2)
+
     def test_pynode_destruction_deadlock(self):
         script = """
 import torch
@@ -6854,6 +7236,28 @@ def get_out():
             err_msg = "RuntimeError: one of the variables needed for gradient computation"
             self.assertTrue(err_msg in e.output.decode("utf-8"))
 
+    def test_view_func_replay(self):
+        def _assert_match_metadata(a, b):
+            self.assertEqual(a.size(), b.size())
+            self.assertEqual(a.stride(), b.stride())
+            self.assertEqual(a.storage_offset(), b.storage_offset())
+
+        def _test_op(fn, inp, args):
+            out = fn(inp, *args)
+            self.assertTrue(out._is_view)
+            self.assertTrue(out._base is inp)
+
+            new_inp = inp.clone()
+            _assert_match_metadata(new_inp, inp)
+            new_out = out._view_func(new_inp)
+            _assert_match_metadata(new_out, out)
+
+        _test_op(torch.select, torch.rand(2, 2), (0, 0))
+        _test_op(torch.as_strided, torch.rand(2, 2), ((4,), (1,)))
+        _test_op(torch.view_as_complex, torch.rand(2, 2), ())
+        _test_op(torch.view_as_real, torch.rand(2, 2, dtype=torch.cfloat), ())
+
+
 def index_perm_variable(shape, max_indices):
     if not isinstance(shape, tuple):
         shape = (shape,)
@@ -7970,6 +8374,16 @@ def test_rnn_backward_to_input_but_not_parameters(self, device):
         out.sum().backward()
         self.assertFalse(s.grad is None or s.grad.abs().sum().item() == 0)
 
+    @unittest.skipIf(not torch.profiler.itt.is_available(), "ITT is required")
+    def test_profiler_emit_itt(self, device):
+        # This test is not intended to ensure correctness of itt ranges.
+        # That would require something a great deal more complex (you'd have to create a
+        # profile in a subprocess, open it, and parse the sql somehow).
+        # This test is merely intended to catch if emit_itt breaks on construction.
+        a = torch.tensor([1, 2, 3], dtype=torch.float32, device=device)
+        with emit_itt():
+            a.add(1.0)
+
     @skipIfMps  # the test doesn't work as randn is not supported with type long
     @deviceCountAtLeast(1)
     def test_grad_assignment(self, devices):
@@ -8443,6 +8857,184 @@ def test_warning_in_backward(self, device):
         with self.assertWarnsRegex(UserWarning, "Warn from backward"):
             b.backward()
 
+class TestAllowMutationOnSaved(TestCase):
+    def assertClonedLenEqual(self, ctx, n):
+        self.assertEqual(len(list(ctx.cloned.items())), n)
+
+    def assertTIDMapLenEqual(self, ctx, n):
+        self.assertEqual(len(list(ctx.tid_to_weakhandle.items())), n)
+
+    def test_basic(self):
+        a = torch.rand(2, 3, requires_grad=True)
+
+        def fn(a):
+            b = a.clone()
+            out = (b**2).sum()
+            b.sin_()
+            out.sum().backward()
+            return a.grad
+        msg = "variables needed for gradient computation has been modified by an inplace"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            fn(a)
+
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            da = fn(a)
+
+        self.assertTrue(torch.allclose(a * 2, da))
+        self.assertClonedLenEqual(ctx, 0)
+
+    def test_views(self):
+        a = torch.rand(2, 3, requires_grad=True)
+
+        def fn(a):
+            b = a.clone()
+            c = b.view_as(b)
+            out = (b**2).sum()  # How does this work?
+            c.sin_()
+            out.sum().backward()
+            return a.grad
+
+        msg = "variables needed for gradient computation has been modified by an inplace"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            fn(a)
+
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            da = fn(a)
+
+        self.assertClonedLenEqual(ctx, 0)
+        self.assertTrue(torch.allclose(a * 2, da))
+
+    def test_save_base_and_modify_view(self):
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.rand(2, 3, requires_grad=True)
+            b = a.clone()
+            c = b[:1]
+            out = b**2
+            # modify the view
+            c *= 10
+            # self.assertClonedLenEqual(ctx, 1)
+            out.sum().backward()
+            self.assertClonedLenEqual(ctx, 0)
+
+        self.assertClonedLenEqual(ctx, 0)
+        self.assertTrue(torch.allclose(a * 2, a.grad))
+
+    def test_save_view_modify_base(self):
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.rand(2, 3, requires_grad=True)
+            b = a.clone()
+            c = b[:]
+            out = (c**2).sum()
+            b *= 2
+            out.backward()
+            self.assertTrue(torch.allclose(a * 2, a.grad))
+
+    def test_double_backward(self):
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.rand(2, 3, requires_grad=True)
+            b = a.clone()
+            out = (b**2).sum()
+            b.sin_()
+            torch.autograd.grad(out, a, create_graph=True)
+            da, = torch.autograd.grad(out, a, create_graph=True)
+            d2a, = torch.autograd.grad(da.sum(), a)
+
+        self.assertTrue(torch.allclose(torch.ones_like(a) * 2, d2a))
+        self.assertClonedLenEqual(ctx, 0)
+
+    def test_saved_but_not_anymore(self):
+        # Make sure we don't clone if the tensor was once saved, but
+        # by the time we do in-place, it is no longer saved
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.randn(2, 3, requires_grad=True).clone()
+            out = (a**2).sum()
+            self.assertTIDMapLenEqual(ctx, 1)
+            self.assertClonedLenEqual(ctx, 0)
+            out.backward()
+            a.sin_()
+            self.assertClonedLenEqual(ctx, 0)
+            out = (a**2).sum()
+            a.sin_()
+            self.assertClonedLenEqual(ctx, 1)
+            del out
+            self.assertClonedLenEqual(ctx, 0)
+
+    def test_saved_same_tensor_many_times(self):
+        # We should only clone once
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.randn(2, 3, requires_grad=True).clone()
+            b = a**2
+            c = a**2
+            a.sin_()
+            self.assertClonedLenEqual(ctx, 1)
+            del b, c
+            self.assertClonedLenEqual(ctx, 0)
+
+    def test_saved_same_tensor_different_versions(self):
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.randn(2, 3, requires_grad=True).clone()
+            b = a**2
+            a.sin_()
+            c = a**2
+            a.sin_()
+            self.assertClonedLenEqual(ctx, 2)
+            del b
+            self.assertClonedLenEqual(ctx, 1)
+            del c
+            self.assertClonedLenEqual(ctx, 0)
+
+    def test_with_math_views(self):
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.tensor([1 + 1j], requires_grad=True).clone()
+            b = a.conj()
+            out = (b**2).sum()
+            a.sin_()
+            out.backward()
+
+            a = torch.tensor([1 + 1j], requires_grad=True).clone()
+            b = a.conj()
+            out = (b**2).sum()
+            # in this case, it is no longer a view it seems
+            b.sin_()
+            out.backward()
+
+    def test_with_out_variant(self):
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.tensor([1.], requires_grad=True)
+            b = torch.tensor([1.])
+            c = torch.tensor([2.])
+            out = a * b
+            self.assertTIDMapLenEqual(ctx, 1)
+            torch.sin(c, out=b)
+            self.assertClonedLenEqual(ctx, 1)
+            out.backward()
+            self.assertClonedLenEqual(ctx, 0)
+
+    def test_backward_out_of_context(self):
+        # Out of context
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.rand(2, 3, requires_grad=True)
+            out = (a**2).sum()
+
+        msg = "Trying to backward outside of the 'allow_mutation_on_saved_tensors' context"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            out.backward()
+
+        # Different context
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            a = torch.rand(2, 3, requires_grad=True)
+            out = (a**2).sum()
+
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            with self.assertRaisesRegex(RuntimeError, msg):
+                out.backward()
+
+    def test_disallow_nesting(self):
+        with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+            msg = "allow_mutation_on_saved_tensors contexts cannot be nested"
+            with self.assertRaisesRegex(RuntimeError, msg):
+                with torch.autograd.graph.allow_mutation_on_saved_tensors() as ctx:
+                    pass
 
 class TestAutogradInferenceMode(TestCase):
     def _is_inference_tensor(self, tensor):
@@ -8721,10 +9313,12 @@ def test_mix_inference_and_normal_tensor_functional_op(self):
                 with self.assertRaisesRegex(RuntimeError, err_msg):
                     c * s
 
-                # inference tensor in TensorList input
-                inputs = [s, c]
-                with self.assertRaisesRegex(RuntimeError, err_msg):
-                    torch.stack(inputs)
+                # TODO: Test this with an autograd.Function when it works
+                #       stack stopped capturing a TensorList input
+                # # inference tensor in TensorList input
+                # inputs = [s, c]
+                # with self.assertRaisesRegex(RuntimeError, err_msg):
+                #     torch.stack(inputs)
 
 
     def test_mix_inference_and_normal_tensor_inplace_op(self):
@@ -8813,7 +9407,7 @@ def run_test(fn):
 
 
 class TestMultithreadAutograd(TestCase):
-    def _run_py_multithread_fn(self, fn, args=(), num_threads=10, kwargs=None):
+    def _run_py_multithread_fn(self, fn, args=(), num_threads=10, kwargs=None, pass_idx=False):
 
         class PropagatingThread(threading.Thread):
             '''Helper class to propagate exception from child
@@ -8836,8 +9430,8 @@ def join(self, timeout=None):
                 return self.ret
 
         threads = []
-        for _ in range(num_threads):
-            p = PropagatingThread(target=fn, args=args)
+        for idx in range(num_threads):
+            p = PropagatingThread(target=fn, args=((idx, *args) if pass_idx else args))
             p.start()
             threads.append(p)
 
@@ -8890,19 +9484,72 @@ def train_fn_grad(x):
         # be accumulate to the same place and should be the same
         self._run_py_multithread_fn(train_fn_grad, (x,))
 
-    def test_multithread_saved_tensors_hooks(self):
-        def pack(x):
-            warnings.warn("pack")
-            return x
+    def test_multi_grad_hooks(self):
+        # Multihooks should behave independently per execution of backward
+        # Test that the hook fired the number of times we ran backward
+        # even if those executions occur concurrently on different threads
+        t1 = torch.rand(2, requires_grad=True)
+        t2 = torch.rand(2, requires_grad=True)
+        t3 = torch.rand(2, requires_grad=True)
+        t4 = torch.rand(2, requires_grad=True)
+
+        res = None
+        count = [0]
+
+        def hook(grads):
+            nonlocal res
+            count[0] += 1
+            grad_is_none = [g is not None for g in grads]
+            if res is None:
+                res = grad_is_none
+            else:
+                self.assertEqual(res, grad_is_none)
+
+        torch.autograd.graph.register_multi_grad_hook((t1, t2, t3, t4), hook)
+
+        out = (t2 * t3).sum()
+
+        def backward_retain_graph(out, t2, t3):
+            out.backward(inputs=(t2, t3), retain_graph=True)
+
+        self._run_py_multithread_fn(backward_retain_graph, (out, t2, t3), num_threads=5)
+
+        self.assertEqual(count[0], 5)
+        self.assertEqual(res, [False, True, True, False])
+
+        # Leave one hook partially applied
+        res = None
+        count = [0]
+        err_count = [0]
+        bw_count = [0]
+
+        class Func(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, x):
+                return x
+
+            @staticmethod
+            def backward(ctx, gO):
+                bw_count[0] += 1
+                if bw_count[0] == 1:
+                    raise RuntimeError("error message")
+                else:
+                    return gO
+
+        out = (Func.apply(t2) * t3).sum()
+
+        def backward_retain_graph(out, t2, t3):
+            try:
+                out.backward(inputs=(t2, t3), retain_graph=True)
+            except RuntimeError:
+                err_count[0] += 1
+
+        self._run_py_multithread_fn(backward_retain_graph, (out, t2, t3), num_threads=5)
+
+        self.assertEqual(count[0], 4)
+        self.assertEqual(err_count[0], 1)
+        self.assertEqual(res, [False, True, True, False])
 
-        def registers_hooks_for_each_thread():
-            with torch.autograd.graph.saved_tensors_hooks(pack, lambda x: x):
-                x = torch.ones(5, 5, requires_grad=True)
-                with warnings.catch_warnings(record=True) as w:
-                    y = x * x
-                    # should raise two warnings from x being saved twice
-                    self.assertEqual(len(w), 2)
-            y.sum().backward()
 
     def test_dataparallel_saved_tensors_hooks(self):
         def pack(x):
@@ -9039,16 +9686,23 @@ def backward(ctx, *grad):
 
     # TODO(@anjali411): add an OpInfo based test for torch.cat
     # Issue: https://github.com/pytorch/pytorch/issues/51627
-    def test_cat_r_to_c(self):
+    #        https://github.com/pytorch/pytorch/issues/75852
+    def test_cat_stack_r_to_c(self):
         inp_c = torch.rand(3, 2, dtype=torch.cdouble, requires_grad=True)
         inp_r = torch.randn(3, 2, dtype=torch.double, requires_grad=True)
 
         def fn(x1, x2):
             return torch.cat((x1, x2), dim=-1)
 
+        def fn2(x1, x2):
+            return torch.stack((x1, x2), dim=-1)
+
         torch.autograd.gradcheck(fn, [inp_r, inp_c], check_forward_ad=True)
         torch.autograd.gradcheck(fn, [inp_c, inp_r], check_forward_ad=True)
 
+        torch.autograd.gradcheck(fn2, [inp_r, inp_c], check_forward_ad=True)
+        torch.autograd.gradcheck(fn2, [inp_c, inp_r], check_forward_ad=True)
+
 class TestAutogradMultipleDispatch(TestCase):
     def test_autograd_multiple_dispatch_registrations(self, device):
         t = torch.randn(3, 3, device=device, requires_grad=True)
@@ -9068,12 +9722,12 @@ def test_autograd_multiple_dispatch_registrations(self, device):
         # test registered AutogradNestedTensor formula
         a = torch.arange(6, dtype=torch.float, device=device).reshape(2, 3).requires_grad_(True)
         b = torch.arange(8, dtype=torch.float, device=device).reshape(2, 4).requires_grad_(True)
-        nt = torch.nested_tensor([a, b], dtype=torch.float, device=device)
+        nt = torch.nested.as_nested_tensor([a, b], dtype=torch.float, device=device)
 
         nt_out = torch._test_autograd_multiple_dispatch(nt)
         c = torch.randn(2, 3, device=device)
         d = torch.randn(2, 4, device=device)
-        nt_grad = torch.nested_tensor([c, d], dtype=torch.float, device=device)
+        nt_grad = torch.nested.nested_tensor([c, d], dtype=torch.float, device=device)
         nt_out.backward(nt_grad)
 
         # bogus gradient for AutogradNestedTensor is grad * grad
@@ -9094,12 +9748,12 @@ def test_autograd_composite_implicit_and_dispatch_registration(self, device):
         # test registered AutogradNestedTensor formula
         a = torch.arange(6, dtype=torch.float, device=device).reshape(2, 3).requires_grad_(True)
         b = torch.arange(8, dtype=torch.float, device=device).reshape(2, 4).requires_grad_(True)
-        nt = torch.nested_tensor([a, b], dtype=torch.float, device=device)
+        nt = torch.nested.as_nested_tensor([a, b], dtype=torch.float, device=device)
 
         nt_out = torch._test_autograd_multiple_dispatch(nt, True)
         c = torch.randn(2, 3, device=device)
         d = torch.randn(2, 4, device=device)
-        nt_grad = torch.nested_tensor([c, d], dtype=torch.float, device=device)
+        nt_grad = torch.nested.nested_tensor([c, d], dtype=torch.float, device=device)
         nt_out.backward(nt_grad)
 
         # bogus gradient for AutogradNestedTensor is grad * grad + grad
@@ -9149,6 +9803,52 @@ def test_view_copy(self, device):
             # Default gradient registered is grad.reshape_as(self)
             self.assertEqual(t.grad, grad.reshape_as(t))
 
+    @onlyCPU
+    def test_per_dispatch_key_input_saving(self, device):
+        # Tests that sum.dim_IntList's input is not saved for regular tensors but is saved for nested tensors
+        def foo(x):
+            # Don't modify the input inplace
+            x = x.clone()
+            res = x.sum(-1, keepdim=True)
+            x.add_(x)
+            return res
+
+        inp = torch.rand(2, device=device, requires_grad=True)
+        # sum's input is not saved for regular Tensors
+        foo(inp).backward()
+
+        # sum's input is saved for Nested Tensors
+        nt = torch.nested.nested_tensor([torch.rand(2), torch.rand(2)], device=device, requires_grad=True)
+        with self.assertRaisesRegex(RuntimeError, "modified by an inplace operation"):
+            foo(nt).backward(torch.nested.nested_tensor([torch.rand(1), torch.rand(1)], device=device))
+
+    @onlyCUDA
+    def test_backward_single_threaded(self):
+
+        threads_eq = None
+
+        class TestFn(Function):
+            @staticmethod
+            def forward(ctx, x, self):
+                ctx.self = self
+                ctx.tid = threading.get_ident()
+                return x.clone()
+
+            @staticmethod
+            def backward(ctx, gO):
+                nonlocal threads_eq
+                threads_eq = ctx.tid == threading.get_ident()
+                return gO, None
+
+        inp = torch.rand(10, device="cuda", requires_grad=True)
+
+        with torch.autograd.set_multithreading_enabled(False):
+            TestFn.apply(inp, None).sum().backward()
+        self.assertTrue(threads_eq)
+
+        TestFn.apply(inp, None).sum().backward()
+        self.assertFalse(threads_eq)
+
 # Import test cases from below autograd/ here. These are found
 # implicitly by the loader, so Flake8 thinks they are unused, hence
 # the suppressions.
diff --git a/test/test_binary_ufuncs.py b/test/test_binary_ufuncs.py
index cd39f35c952b..8ffab2daa6e2 100644
--- a/test/test_binary_ufuncs.py
+++ b/test/test_binary_ufuncs.py
@@ -491,9 +491,6 @@ def test_type_promotion(self, device, op):
             make_tensor, (5,), device=device, **op.rhs_make_tensor_kwargs
         )
 
-        make_lhs_scalar_tensor = partial(
-            make_tensor, (), device='cpu', **op.lhs_make_tensor_kwargs
-        )
         make_rhs_scalar_tensor = partial(
             make_tensor, (), device='cpu', **op.rhs_make_tensor_kwargs
         )
@@ -782,17 +779,14 @@ def _supported(dtypes):
             )
             self.assertEqual(result.dtype, expected_dtype)
 
-        # scalar int x scalar float
+        # scalar  x scalar
         # Note: result dtype is default float type
-        # TODO: FIXME: re-enable this, scalar x scalar type promotion is currently broken
-        # https://github.com/pytorch/pytorch/issues/76801
-        # if op.supports_two_python_scalars and _supported((torch.long, torch.float32)):
-        #     lhs_i_scalar = 1
-        #     rhs_f_scalar = 2.
-
-        #     result = op(lhs_i_scalar, rhs_f_scalar)
-        #     expected_dtype = torch.get_default_dtype() if not op.always_returns_bool else torch.bool
-        #     self.assertEqual(result.dtype, expected_dtype)
+        if op.supports_two_python_scalars and _supported((torch.long, torch.float32)):
+            rhs_f_scalar = 2.
+            for lhs in (1, 1.):
+                result = op(lhs, rhs_f_scalar)
+                expected_dtype = torch.get_default_dtype() if not op.always_returns_bool else torch.bool
+                self.assertEqual(result.dtype, expected_dtype)
 
     # TODO: move to error input test
     @ops(binary_ufuncs, allowed_dtypes=(torch.float32,))
@@ -1128,6 +1122,16 @@ def test_out_resize_warning(self, device):
                 # Cases that throw warnings
                 op(*inputs, out=torch.empty(2, device=device))
                 self.assertEqual(len(w), 1)
+        # test that multi-d out doesn't trigger segfault
+        arg1 = (torch.ones(2, 1, device=device), torch.ones(1, device=device))
+        arg2 = (torch.ones(2, device=device), torch.ones(1, 1, device=device))
+        outs = (torch.ones(2, 1, 1, 1, device=device), torch.ones(2, 2, 2, 2, device=device))
+
+        for a1, a2, o in zip(arg1, arg2, outs):
+            with warnings.catch_warnings(record=True) as w:
+                warnings.simplefilter("always")
+                torch.mul(a1, a2, out=o)
+                self.assertEqual(len(w), 1)
 
     # Verifies that the inplace dunders (like idiv) actually are in place
     @expectedFailureMeta  # UserWarning not triggered
@@ -3362,6 +3366,23 @@ def test_lerp_lowp(self, device, dtype):
             expected = torch.lerp(xref, yref, wref).to(dtype)
             self.assertEqual(actual, expected, atol=0.0, rtol=0.0)
 
+    @onlyCPU
+    @dtypes(torch.bfloat16)
+    def test_lerp_lowp_cpu(self, device, dtype):
+        xvals = (0.0, -30000.0)
+        yvals = (0.1, -20000.0)
+        for shape in [(4,), (20,), (3, 10, 10)]:
+            xs = [torch.full(shape, xval, device=device, dtype=dtype) for xval in xvals]
+            ys = [torch.full(shape, yval, device=device, dtype=dtype) for yval in yvals]
+            weights = [70000, torch.full(shape, 8, device=device, dtype=dtype)]
+            for x, y, w in zip(xs, ys, weights):
+                xref = x.float()
+                yref = y.float()
+                wref = w.float() if isinstance(w, torch.Tensor) else w
+                actual = torch.lerp(x, y, w)
+                expected = torch.lerp(xref, yref, wref).to(dtype)
+                self.assertEqual(actual, expected, atol=0.0, rtol=0.0)
+
     def _test_logaddexp(self, device, dtype, base2):
         if base2:
             ref_func = np.logaddexp2
@@ -4129,6 +4150,22 @@ def test_zeros_special_helper(torch_fn, reference_fn, scalar=False):
         test_zeros_special_helper(*xlogy_fns, scalar=True)
         test_zeros_special_helper(*xlog1py_fns, scalar=True)
 
+    @dtypes(torch.float64)
+    def test_xlogy_xlog1py_gradients(self, device, dtype):
+        make_arg = partial(torch.tensor, dtype=dtype, device=device, requires_grad=True)
+
+        zeros = torch.zeros((2,), dtype=dtype, device=device)
+
+        x = make_arg([0.0, 0.0])
+        y = make_arg([-1.5, 0.0])
+        torch.special.xlogy(x, y).sum().backward()
+        self.assertEqual(x.grad, zeros)
+
+        x = make_arg([0.0, 0.0])
+        y = make_arg([-2.5, -1.0])
+        torch.special.xlog1py(x, y).sum().backward()
+        self.assertEqual(x.grad, zeros)
+
     def test_xlogy_xlog1py_scalar_type_promotion(self, device):
         # Test that python numbers don't participate in type promotion at the same
         # priority level as 0-dim tensors
diff --git a/test/test_comparison_utils.py b/test/test_comparison_utils.py
new file mode 100644
index 000000000000..fccc217bb7b2
--- /dev/null
+++ b/test/test_comparison_utils.py
@@ -0,0 +1,32 @@
+#!/usr/bin/env python3
+# Owner(s): ["module: internals"]
+
+import torch
+from torch.testing._internal.common_utils import TestCase
+
+class TestComparisonUtils(TestCase):
+    def test_all_equal_no_assert(self):
+        t = torch.tensor([0.5])
+        torch._assert_tensor_metadata(t, [1], [1], torch.float)
+
+    def test_all_equal_no_assert_nones(self):
+        t = torch.tensor([0.5])
+        torch._assert_tensor_metadata(t, None, None, None)
+
+    def test_assert_dtype(self):
+        t = torch.tensor([0.5])
+
+        with self.assertRaises(RuntimeError):
+            torch._assert_tensor_metadata(t, None, None, torch.int32)
+
+    def test_assert_strides(self):
+        t = torch.tensor([0.5])
+
+        with self.assertRaises(RuntimeError):
+            torch._assert_tensor_metadata(t, None, [3], torch.float)
+
+    def test_assert_sizes(self):
+        t = torch.tensor([0.5])
+
+        with self.assertRaises(RuntimeError):
+            torch._assert_tensor_metadata(t, [3], [1], torch.float)
diff --git a/test/test_cpp_extensions_jit.py b/test/test_cpp_extensions_jit.py
index 124cb3e8f1c2..2ead8d32ca17 100644
--- a/test/test_cpp_extensions_jit.py
+++ b/test/test_cpp_extensions_jit.py
@@ -15,7 +15,7 @@
 import torch.backends.cudnn
 import torch.utils.cpp_extension
 from torch.utils.cpp_extension import CUDA_HOME, ROCM_HOME
-from torch.testing._internal.common_utils import gradcheck, skipIfSlowGradcheckEnv, IS_ARM64
+from torch.testing._internal.common_utils import gradcheck
 
 
 TEST_CUDA = torch.cuda.is_available() and CUDA_HOME is not None
@@ -38,8 +38,6 @@ def remove_build_path():
         shutil.rmtree(default_build_root)
 
 # There's only one test that runs gracheck, run slow mode manually
-@skipIfSlowGradcheckEnv
-@unittest.skipIf(IS_ARM64, "Does not work on arm")
 class TestCppExtensionJIT(common.TestCase):
     """Tests just-in-time cpp extensions.
     Don't confuse this with the PyTorch JIT (aka TorchScript).
diff --git a/test/test_cuda.py b/test/test_cuda.py
index c435d25a0c83..59f379487c43 100644
--- a/test/test_cuda.py
+++ b/test/test_cuda.py
@@ -4,6 +4,7 @@
 from typing import NamedTuple
 import collections
 import contextlib
+from copy import deepcopy
 import ctypes
 import gc
 import io
@@ -22,12 +23,10 @@
 from torch.nn.parallel import scatter_gather
 from torch.utils.checkpoint import checkpoint_sequential
 from torch._six import inf, nan
-from torch.testing._internal.common_methods_invocations import tri_tests_args, tri_large_tests_args, \
-    _compare_trilu_indices, _compare_large_trilu_indices
 from torch.testing._internal.common_utils import TestCase, freeze_rng_state, run_tests, \
     NO_MULTIPROCESSING_SPAWN, skipIfRocm, load_tests, IS_REMOTE_GPU, IS_SANDCASTLE, IS_WINDOWS, \
     slowTest, skipCUDANonDefaultStreamIf, skipCUDAMemoryLeakCheckIf, TEST_WITH_ROCM, TEST_NUMPY, \
-    get_cycles_per_ms, parametrize, instantiate_parametrized_tests
+    get_cycles_per_ms, parametrize, instantiate_parametrized_tests, subtest
 from torch.testing._internal.autocast_test_lists import AutocastTestLists
 
 # load_tests from common_utils is used to automatically filter tests for
@@ -45,6 +44,7 @@
     print('CUDA not available, skipping tests', file=sys.stderr)
     TestCase = object  # noqa: F811
 
+TEST_CUDAMALLOCASYNC = TEST_CUDA and (torch.cuda.get_allocator_backend() == "cudaMallocAsync")
 TEST_LARGE_TENSOR = TEST_CUDA
 TEST_MEDIUM_TENSOR = TEST_CUDA
 TEST_CUDNN = TEST_CUDA
@@ -272,6 +272,7 @@ def test_cudart_register(self):
         self.assertEqual(r, 0)
         self.assertFalse(t.is_pinned())
 
+    @unittest.skipIf(TEST_CUDAMALLOCASYNC, "temporarily disabled")
     def test_memory_stats(self):
         gc.collect()
         torch.cuda.empty_cache()
@@ -323,6 +324,7 @@ def test_cuda_get_device_capability(self):
         device_capability_no_argument = torch.cuda.get_device_capability()
         self.assertEqual(current_device_capability, device_capability_no_argument)
 
+    @unittest.skipIf(TEST_CUDAMALLOCASYNC, "temporarily disabled")
     @unittest.skipIf(not TEST_MULTIGPU, "only one GPU detected")
     def test_memory_stats_multigpu(self):
         # advance a generator with a end flag
@@ -363,7 +365,9 @@ def advance(gen, end):
     def test_out_of_memory(self):
         tensor = torch.zeros(1024, device='cuda')
 
-        with self.assertRaisesRegex(RuntimeError, "Tried to allocate 800000000.00 GiB"):
+        oom_regex = "would exceed allowed memory" if TEST_CUDAMALLOCASYNC else \
+                    "Tried to allocate 800000000.00 GiB"
+        with self.assertRaisesRegex(RuntimeError, oom_regex):
             torch.empty(1024 * 1024 * 1024 * 800000000, dtype=torch.int8, device='cuda')
 
         with self.assertRaisesRegex(RuntimeError, "Tried to allocate more than 1EB memory"):
@@ -373,6 +377,24 @@ def test_out_of_memory(self):
         tensor.fill_(1)
         self.assertTrue((tensor == 1).all())
 
+
+    @unittest.skipIf(TEST_CUDAMALLOCASYNC, "Segmentation fault (core dumped)")
+    def test_out_of_memory_retry(self):
+        torch.cuda.empty_cache()
+        total_memory = torch.cuda.get_device_properties(0).total_memory
+        oom_regex = "would exceed allowed memory" if TEST_CUDAMALLOCASYNC else \
+                    "Tried to allocate"
+        size = int(total_memory * 0.5)
+        a = torch.empty(size , dtype=torch.int8, device='cuda')
+        with self.assertRaisesRegex(RuntimeError, oom_regex):
+            b = torch.empty(size, dtype=torch.int8, device='cuda')
+        del a
+        b = torch.empty(size, dtype=torch.int8, device='cuda')
+        del b
+        # We used a lot of memory here, clean up so we don't affect other tests too much
+        torch.cuda.empty_cache()
+        torch.cuda.reset_peak_memory_stats()
+
     def test_set_per_process_memory_fraction(self):
         # test invalid fraction value.
         with self.assertRaisesRegex(TypeError, "Invalid type"):
@@ -395,7 +417,9 @@ def test_set_per_process_memory_fraction(self):
 
         application = int(total_memory * 0.5)
         # it will get OOM when try to allocate more than half memory.
-        with self.assertRaisesRegex(RuntimeError, "out of memory"):
+        oom_regex = "would exceed allowed memory" if TEST_CUDAMALLOCASYNC else \
+                    "out of memory"
+        with self.assertRaisesRegex(RuntimeError, oom_regex):
             torch.empty(application, dtype=torch.int8, device='cuda')
 
         # ensure out of memory error doesn't disturb subsequent kernel
@@ -571,7 +595,7 @@ def test_serialization_array_with_storage(self):
         self.assertTrue(isinstance(q_copy[1], torch.cuda.IntTensor))
         self.assertTrue(isinstance(q_copy[2], torch.cuda.FloatTensor))
         self.assertTrue(isinstance(q_copy[3], torch.storage.TypedStorage))
-        self.assertTrue(isinstance(q_copy[3]._storage, torch.UntypedStorage))
+        self.assertTrue(isinstance(q_copy[3]._untyped_storage, torch.UntypedStorage))
         q_copy[1].fill_(10)
         self.assertEqual(q_copy[3], torch.cuda.IntStorage(10).fill_(10))
 
@@ -1298,11 +1322,13 @@ def perform_copy():
 
         self.assertEqual(result.tolist(), [1, 2, 3, 4])
 
-        # Check that the block will be re-used after the main stream finishes
-        torch.cuda.current_stream().synchronize()
-        with torch.cuda.stream(stream):
-            tmp3 = torch.cuda.FloatTensor(t.size())
-            self.assertEqual(tmp3.data_ptr(), ptr[0], msg='allocation not re-used')
+        if not TEST_CUDAMALLOCASYNC:
+            # In the native allocator, we expect "tmp"'s side-stream-tagged block will be reused
+            # in that side stream after result.copy_(tmp) in the main stream finishes.
+            torch.cuda.current_stream().synchronize()
+            with torch.cuda.stream(stream):
+                tmp3 = torch.cuda.FloatTensor(t.size())
+                self.assertEqual(tmp3.data_ptr(), ptr[0], msg='allocation not re-used')
 
     def test_record_stream_on_shifted_view(self):
         # See issue #27366
@@ -1550,6 +1576,38 @@ def test_multinomial_invalid_probs_cuda(self):
         self._spawn_test_multinomial_invalid_probs_cuda([1., -inf, 1.])
         self._spawn_test_multinomial_invalid_probs_cuda([1., 1., nan])
 
+    @staticmethod
+    def _mute_init():
+        os.dup2(os.open(os.devnull, os.O_WRONLY), sys.stderr.fileno())
+
+    def _spawn_method(self, method, arg):
+        ctx = torch.multiprocessing.get_context("spawn")
+        with ctx.Pool(1, initializer=self._mute_init) as pool:
+            errors = pool.map(method, [arg])
+            for e in errors:
+                if 'device-side assert triggered' not in str(e):
+                    self.fail(e)
+
+    @staticmethod
+    def _test_index_bounds_cuda(idx):
+        x = torch.arange(10, device="cuda")
+        try:
+            y = x[torch.tensor([idx])]
+            return f"x[torch.tensor([{idx})]={y}"
+        except RuntimeError as err:
+            return err
+
+    @slowTest
+    @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
+                     don't support multiprocessing with spawn start method")
+    @skipIfRocm
+    def test_index_out_of_bounds_exception_cuda(self):
+        test_method = TestCuda._test_index_bounds_cuda
+        # Test in-bound access works fine
+        self.assertEqual(test_method(1), "x[torch.tensor([1)]=tensor([1], device='cuda:0')")
+        # Test that indexing out of bounds causes assert
+        self._spawn_method(test_method, 11)
+
     @slowTest
     @unittest.skipIf(not TEST_LARGE_TENSOR, "not enough memory")
     def test_huge_index(self):
@@ -1644,8 +1702,6 @@ def _test(idx):
             _test(1)
 
     # Test that wrap_with_cuda_memory_check successfully detects leak
-    # skip for ROCM. Look into #62533.
-    @skipIfRocm
     def test_cuda_memory_leak_detection(self):
         l = []
 
@@ -1679,24 +1735,6 @@ def test_cuda_memory_leak_detection_propagates_errors(self):
                 y = torch.randn(2, 1, device='cuda')
                 z = x + y
 
-    def test_trilu_indices(self):
-        for test_args in tri_tests_args:
-            _compare_trilu_indices(self, *test_args, device='cuda')
-
-        # test default options
-        x = torch.ones(
-            3, 3, dtype=torch.long, device='cuda', layout=torch.strided)
-        self.assertEqual(
-            x.tril(0).nonzero().transpose(0, 1),
-            torch.tril_indices(3, 3, device='cuda'))
-        self.assertEqual(
-            x.triu(0).nonzero().transpose(0, 1),
-            torch.triu_indices(3, 3, device='cuda'))
-
-    def test_large_trilu_indices(self):
-        for test_args in tri_large_tests_args:
-            _compare_large_trilu_indices(self, *test_args, device='cuda')
-
     @unittest.skipIf(not TEST_MEDIUM_TENSOR, "not enough memory")
     def test_cuda_kernel_loop_overflow(self):
         # Issue #24309: In extreme cases, the loop variable could overflow and continue
@@ -1979,7 +2017,7 @@ def worker(rank):
 t2.start()
 """])
 
-    @unittest.skipIf(TEST_WITH_ROCM, "ROCm doesn't support device side asserts")
+    @unittest.skipIf(TEST_WITH_ROCM, "In ROCm, kernel asserts are disabled due to performance overhead")
     def test_fixed_cuda_assert_async(self):
         with self.assertRaisesRegex(RuntimeError, "Boolean value of Tensor with no values is ambiguous"):
             torch._assert_async(torch.tensor([], device="cuda"))
@@ -2245,20 +2283,24 @@ def test_grad_scaling_state_dict(self):
             self.assertEqual(s1.get_growth_interval(), 2)
             self.assertEqual(s1._init_growth_tracker, 0)
 
-    def _create_scaling_models_optimizers(self, device="cuda"):
+    def _create_scaling_models_optimizers(self, device="cuda", optimizer_ctor=torch.optim.SGD, optimizer_kwargs=None):
         # Create a module+optimizer that will use scaling, and a control module+optimizer
         # that will not use scaling, against which the scaling-enabled module+optimizer can be compared.
         mod_control = torch.nn.Sequential(torch.nn.Linear(8, 8), torch.nn.Linear(8, 8)).to(device=device)
         mod_scaling = torch.nn.Sequential(torch.nn.Linear(8, 8), torch.nn.Linear(8, 8)).to(device=device)
-        for c, s in zip(mod_control.parameters(), mod_scaling.parameters()):
-            s.data.copy_(c.data)
+        with torch.no_grad():
+            for c, s in zip(mod_control.parameters(), mod_scaling.parameters()):
+                s.copy_(c)
 
-        opt_control = torch.optim.SGD(mod_control.parameters(), lr=1.0)
-        opt_scaling = torch.optim.SGD(mod_scaling.parameters(), lr=1.0)
+        kwargs = {"lr": 1.0}
+        if optimizer_kwargs is not None:
+            kwargs.update(optimizer_kwargs)
+        opt_control = optimizer_ctor(mod_control.parameters(), **kwargs)
+        opt_scaling = optimizer_ctor(mod_scaling.parameters(), **kwargs)
 
         return mod_control, mod_scaling, opt_control, opt_scaling
 
-    def _create_scaling_case(self, device="cuda", dtype=torch.float):
+    def _create_scaling_case(self, device="cuda", dtype=torch.float, optimizer_ctor=torch.optim.SGD, optimizer_kwargs=None):
         data = [(torch.randn((8, 8), dtype=dtype, device=device), torch.randn((8, 8), dtype=dtype, device=device)),
                 (torch.randn((8, 8), dtype=dtype, device=device), torch.randn((8, 8), dtype=dtype, device=device)),
                 (torch.randn((8, 8), dtype=dtype, device=device), torch.randn((8, 8), dtype=dtype, device=device)),
@@ -2268,13 +2310,17 @@ def _create_scaling_case(self, device="cuda", dtype=torch.float):
 
         skip_iter = 2
 
-        return self._create_scaling_models_optimizers(device=device) + (data, loss_fn, skip_iter)
+        return self._create_scaling_models_optimizers(
+            device=device, optimizer_ctor=optimizer_ctor, optimizer_kwargs=optimizer_kwargs,
+        ) + (data, loss_fn, skip_iter)
 
     # _run_scaling_case generalizes some single-optimizer test logic to avoid too much copy-pasting below.
-    def _run_scaling_case(self, run, unskipped, skipped, atol=1e-7):
+    def _run_scaling_case(self, run, unskipped, skipped, atol=1e-7, optimizer_ctor=torch.optim.SGD, optimizer_kwargs=None):
         # Ensure scaling can be disabled without changing user control flow.
         for enabled in True, False:
-            mod_control, mod_scaling, opt_control, opt_scaling, data, loss_fn, skip_iter = self._create_scaling_case()
+            (
+                mod_control, mod_scaling, opt_control, opt_scaling, data, loss_fn, skip_iter,
+            ) = self._create_scaling_case(optimizer_ctor=optimizer_ctor, optimizer_kwargs=optimizer_kwargs)
 
             # For functionality, test with a modest initial scale, and an unrealistically-large growth factor
             # so any potential errors with the growth factor handling will be magnified.
@@ -2296,10 +2342,16 @@ def _run_scaling_case(self, run, unskipped, skipped, atol=1e-7):
                 self.assertTrue(scaler.get_scale() == 1.0)
 
             for c, s in zip(mod_control.parameters(), mod_scaling.parameters()):
+                self.assertEqual(c.grad, s.grad, atol=atol, rtol=1e-05)
+
+                c_state, s_state = opt_control.state[c], opt_scaling.state[s]
+                for k in c_state:
+                    self.assertEqual(c_state[k], s_state[k], atol=atol, rtol=1e-05, msg=k)
+
                 self.assertEqual(c, s, atol=atol, rtol=1e-05)
 
     # Compares no scaling + no autocasting against scaling + autocasting.
-    def test_grad_scaling_autocast(self):
+    def _grad_scaling_autocast_test(self, *, atol=1e-3, optimizer_ctor=torch.optim.SGD, optimizer_kwargs=None):
         try_pickle = False
 
         def run(data, model, optimizer, scaler, loss_fn, skip_iter, try_scaling_api):
@@ -2311,7 +2363,8 @@ def run(data, model, optimizer, scaler, loss_fn, skip_iter, try_scaling_api):
                 if try_scaling_api:
                     scaler.scale(loss).backward()
                     if i == skip_iter and scaler.is_enabled():
-                        model[1].weight.grad.data.fill_(float('inf'))
+                        with torch.no_grad():
+                            model[1].weight.grad.fill_(float('inf'))
                     scaler.step(optimizer)
                     scaler.update()
                     if try_pickle:
@@ -2322,11 +2375,81 @@ def run(data, model, optimizer, scaler, loss_fn, skip_iter, try_scaling_api):
                         optimizer.step()
             return scaler
 
-        # sets atol=1e-3 because we're comparing pure fp32 arithmetic vs a mixture of fp16 and fp32
-        self._run_scaling_case(run, unskipped=3, skipped=1, atol=1e-3)
-        # this will be picked up by try_pickle within run():
-        try_pickle = True
-        self._run_scaling_case(run, unskipped=3, skipped=1, atol=1e-3)
+        # NOTE(mkozuki): With current way of testing, `torch.optim.Adam` is failing in spite of `foreach` and `fused`.
+        #   Giving some flexibility to this test might help.
+        context = contextlib.nullcontext
+        if optimizer_ctor in (torch.optim.Adam,):
+            from functools import partial
+            context = partial(self.assertRaises, AssertionError)
+        with context():
+            # sets atol=1e-3 because we're comparing pure fp32 arithmetic vs a mixture of fp16 and fp32
+            self._run_scaling_case(
+                run, unskipped=3, skipped=1, atol=atol, optimizer_ctor=optimizer_ctor, optimizer_kwargs=optimizer_kwargs,
+            )
+            # this will be picked up by try_pickle within run():
+            try_pickle = True
+            self._run_scaling_case(
+                run, unskipped=3, skipped=1, atol=atol, optimizer_ctor=optimizer_ctor, optimizer_kwargs=optimizer_kwargs,
+            )
+
+    def test_grad_scaling_autocast(self):
+        for optimizer_ctor in (torch.optim.SGD, torch.optim.Adam):
+            self._grad_scaling_autocast_test(optimizer_ctor=optimizer_ctor)
+
+    def test_grad_scaling_autocast_foreach(self):
+        for optimizer_ctor in (torch.optim.SGD, torch.optim.Adam):
+            self._grad_scaling_autocast_test(optimizer_ctor=optimizer_ctor, optimizer_kwargs={"foreach": True})
+
+    def test_grad_scaling_autocast_fused(self):
+        self._grad_scaling_autocast_test(optimizer_ctor=torch.optim.Adam, optimizer_kwargs={"fused": True})
+
+    def test_grad_scaling_autocast_fused_optimizers(self):
+        for optimizer_ctor, optimizer_kwargs in (
+            (torch.optim.Adam, {"fused": True, "amsgrad": False}),
+            (torch.optim.Adam, {"fused": True, "amsgrad": True}),
+        ):
+            self._grad_scaling_autocast_fused_optimizers(
+                optimizer_ctor=optimizer_ctor, optimizer_kwargs=optimizer_kwargs)
+
+    def _grad_scaling_autocast_fused_optimizers(self, optimizer_ctor, optimizer_kwargs):
+        (
+            mod_control, mod_scaling, opt_control, opt_scaling, data, loss_fn, _,
+        ) = self._create_scaling_case(optimizer_ctor=optimizer_ctor, optimizer_kwargs=optimizer_kwargs)
+        kwargs = deepcopy(optimizer_kwargs)
+        kwargs["fused"] = False
+        opt_control = optimizer_ctor(mod_control.parameters(), lr=1.0, **kwargs)
+
+        scaler = torch.cuda.amp.GradScaler(init_scale=128.0)
+
+        for input, target in data:
+            opt_control.zero_grad()
+            with torch.autocast('cuda'):
+                output_control = mod_control(input)
+                loss_control = loss_fn(output_control, target)
+            scaler.scale(loss_control).backward()
+            scaler.step(opt_control)
+            scaler.update()
+
+            opt_scaling.zero_grad()
+            with torch.autocast('cuda'):
+                output_scaling = mod_scaling(input)
+                loss_scaling = loss_fn(output_scaling, target)
+            scaler.scale(loss_scaling).backward()
+            scaler.step(opt_scaling)
+            scaler.update()
+
+            self.assertEqual(loss_control, loss_scaling)
+            for param_control, param_scaling in zip(mod_control.parameters(), mod_scaling.parameters()):
+                self.assertEqual(param_control.grad, param_scaling.grad)
+                self.assertEqual(param_control, param_scaling)
+
+                state_control, state_scaling = opt_control.state[param_control], opt_scaling.state[param_scaling]
+
+                for k in state_control:
+                    actual = state_scaling[k]
+                    if k == "step":
+                        actual = actual.squeeze()
+                    self.assertEqual(state_control[k], actual, msg=k)
 
     def test_grad_scaling_clipping(self):
         def run(data, model, optimizer, scaler, loss_fn, skip_iter, try_scaling_api):
@@ -2549,6 +2672,7 @@ def run(model0, model1, optimizer0, optimizer1, try_scaling_api):
                             chain(mod_scaling0.parameters(), mod_scaling1.parameters())):
                 self.assertEqual(c, s, rtol=1e-5, atol=1e-7)
 
+    @unittest.skipIf(TEST_CUDAMALLOCASYNC, "FAIL")
     def test_cublas_multiple_threads_same_device(self):
         # Note, these parameters should be very carefully tuned
         # Too small number makes it hard for the racing condition
@@ -2938,18 +3062,22 @@ def forward(ctx, a, b):
             def backward(ctx, grad):
                 self.assertTrue(torch.is_autocast_enabled())
                 a, b = ctx.saved_tensors
-                return grad.mm(b.t()), a.t().mm(grad)
+                a_grad, b_grad = grad.mm(b.t()), a.t().mm(grad)
+                self.assertTrue(a_grad.dtype is dtype and b_grad.dtype is dtype)
+                return a_grad, b_grad
 
         mymm = MyMM.apply
 
         x = torch.randn((8, 8), device="cuda", dtype=torch.float32, requires_grad=True)
         y = torch.randn((8, 8), device="cuda", dtype=torch.float32, requires_grad=True)
 
-        with torch.cuda.amp.autocast():
-            output = mymm(x, y)
-            self.assertTrue(output.dtype is torch.float16)
-            loss = output.sum()
-        loss.backward()
+        dtypes = (torch.float16, torch.bfloat16) if TEST_BF16 else (torch.float16,)
+        for dtype in dtypes:
+            with torch.cuda.amp.autocast(dtype=dtype):
+                output = mymm(x, y)
+                self.assertTrue(output.dtype is dtype)
+                loss = output.sum()
+            loss.backward()
 
     def test_autocast_custom_cast_inputs(self):
         class MyMM(torch.autograd.Function):
@@ -3167,10 +3295,37 @@ def test_graph_capture_simple(self):
                      TEST_WITH_ROCM or
                      int(torch.version.cuda.split(".")[0]) < 11, "CUDA >= 11.0 required for graphs")
     def test_graph_capture_oom(self):
-        with self.assertRaisesRegex(RuntimeError, "out of memory"):
+        oom_regex = "would exceed allowed memory" if TEST_CUDAMALLOCASYNC else \
+                    "out of memory"
+        with self.assertRaisesRegex(RuntimeError, oom_regex):
             with torch.cuda.graph(torch.cuda.CUDAGraph()):
                 torch.zeros(2 ** 40, device="cuda")
 
+    @unittest.skipIf((not TEST_CUDA) or
+                     TEST_WITH_ROCM or
+                     int(torch.version.cuda.split(".")[0]) < 11, "CUDA >= 11.0 required for graphs")
+    def test_repeat_graph_capture_cublas_workspace_memory(self):
+        (x, y, z) = 1024, 512, 64
+        a = torch.rand((x, y), device='cuda')
+        b = torch.rand((y, z), device='cuda')
+
+        # warmup
+        torch.mm(a, b)
+
+        free_bytes_before, total_bytes = torch.cuda.mem_get_info()
+        used_gb_before = (total_bytes - free_bytes_before) / 1e9
+
+        for i in range(100):
+            torch_graph = torch.cuda.CUDAGraph()
+            with torch.cuda.graph(torch_graph):
+                torch.mm(a, b)
+            torch_graph.replay()
+
+        free_bytes_after, _ = torch.cuda.mem_get_info()
+        used_gb_after = (total_bytes - free_bytes_after) / 1e9
+
+        self.assertFalse(used_gb_before + 0.1 < used_gb_after)
+
     @unittest.skipIf((not TEST_CUDA) or
                      TEST_WITH_ROCM or
                      int(torch.version.cuda.split(".")[0]) < 11, "CUDA >= 11.0 required for graphs")
@@ -3220,6 +3375,36 @@ def run(op, kwargs):
             except Exception as e:
                 raise RuntimeError("Failed on ", op) from e
 
+            # Do the same operations varying seeds
+            seeds = [6, 128, 9999]
+
+            for seed in seeds:
+                torch.cuda.manual_seed(seed)
+                graph_in.copy_(a)
+                for _ in range(3):
+                    g.replay()
+
+                # If the random seed was not updated then the graph would
+                # generate the same output as in previous check.
+                try:
+                    self.assertNotEqual(eager_out, graph_out)
+                except Exception as e:
+                    raise RuntimeError("Failed on ", op) from e
+
+                # Now repeat the same operations in non-graphed mode.
+                torch.cuda.manual_seed(seed)
+                for _ in range(3):
+                    eager_out.copy_(a)
+                    eager_out = op(eager_out, **kwargs)
+                    eager_out = op(eager_out, **kwargs)
+
+                # In the end, graph_out and eager_out must be equal
+                # as they went under the same set of operations.
+                try:
+                    self.assertEqual(eager_out, graph_out)
+                except Exception as e:
+                    raise RuntimeError("Failed on ", op) from e
+
             # We hold references to all tensors used across streams up til this sync,
             # so no need to call record_stream on those tensors.
             torch.cuda.synchronize()
@@ -3297,28 +3482,53 @@ def run(module, op, args, kwargs):
                     g.capture_end()
             torch.cuda.current_stream().wait_stream(stream)
 
-            try:
-                self.assertNotEqual(control1, t1)
-                self.assertNotEqual(control2, t2)
-            except Exception as e:
-                raise RuntimeError("Failed on " + module + "." + op) from e
+            if not TEST_CUDAMALLOCASYNC:
+                # Makes sure values haven't been populated yet
+                # (in other words, makes sure capture didn't actually run ops).
+                # We can only try this with the native allocator, for which captured
+                # addresses are already backed by cudaMalloced memory.
+                # If we try it with cudaMallocAsync, CUDA won't event consider
+                # the captured addresses allocated until replay(), and if we
+                # access them before replay() we get IMAs.
+                try:
+                    self.assertNotEqual(control1, t1)
+                    self.assertNotEqual(control2, t2)
+                except Exception as e:
+                    raise RuntimeError("Failed on " + module + "." + op) from e
+
+            # Set a new seed to check if graph would use it
+            for seed in [6, 314, 271]:
+                torch.cuda.manual_seed(seed)
+                # Runs a dummy op prelude, as for controls, to make sure replay()
+                # picks up the dummy op's state increment.
+                if (module == "torch"):
+                    dummy = getattr(torch, op)(*args, **kwargs)
+                    control1 = getattr(torch, op)(*args, **kwargs)
+                    control2 = getattr(torch, op)(*args, **kwargs)
+                else:
+                    getattr(dummy, op)(*args)
+                    getattr(control1, op)(*args)
+                    getattr(control2, op)(*args)
 
-            # Runs a dummy op prelude, as for controls, to make sure replay()
-            # picks up the dummy op's state increment.
-            if module == "torch":
-                dummy = getattr(torch, op)(*args, **kwargs)
-            else:
-                dummy = alloc.clone()
-                getattr(dummy, op)(*args)
+                torch.cuda.manual_seed(seed)
+                if (module == "torch"):
+                    dummy = getattr(torch, op)(*args, **kwargs)
+                else:
+                    getattr(dummy, op)(*args)
 
-            # Runs RNG ops that fill t1 and t2.
-            g.replay()
+                # see above comment on TEST_CUDAMALLOCASYNC
+                if not TEST_CUDAMALLOCASYNC:
+                    t1.copy_(alloc)
+                    t2.copy_(alloc)
 
-            try:
-                self.assertEqual(control1, t1)
-                self.assertEqual(control2, t2)
-            except Exception as e:
-                raise RuntimeError("Failed on " + module + "." + op) from e
+                # Runs RNG ops that fill t1 and t2.
+                g.replay()
+
+                try:
+                    self.assertEqual(control1, t1)
+                    self.assertEqual(control2, t2)
+                except Exception as e:
+                    raise RuntimeError("Failed on " + module + "." + op) from e
 
             # We hold references to all tensors used across streams up til this sync,
             # so no need to call record_stream on those tensors.
@@ -3383,22 +3593,27 @@ def func_with_temps(t, val):
             self.assertEqual(b.sum().item(), size * 3070)
             self.assertEqual(c.sum().item(), size * 442)
 
-            if share_mem != "Don't share":
-                self.assertEqual(reserved_no_sharing - torch.cuda.memory_stats()["reserved_bytes.all.current"],
-                                 kSmallBuffer)
-            else:
-                reserved_no_sharing = torch.cuda.memory_stats()["reserved_bytes.all.current"]
+            if not TEST_CUDAMALLOCASYNC:
+                # These stat checks are specific to the native allocator.
+                if share_mem != "Don't share":
+                    self.assertEqual(reserved_no_sharing - torch.cuda.memory_stats()["reserved_bytes.all.current"],
+                                     kSmallBuffer)
+                else:
+                    reserved_no_sharing = torch.cuda.memory_stats()["reserved_bytes.all.current"]
 
             del a, b, c, g0, g1
             # Tensors used across streams (a and b) were held until just now, so no need to call record_stream on them.
             torch.cuda.synchronize()
             torch.cuda.empty_cache()
 
-    @unittest.skip("Temporarily disabled due to a graphs bug in libcuda.so, " +
-                   "see https://github.com/pytorch/pytorch/pull/57556")
     @unittest.skipIf((not TEST_CUDA) or
                      TEST_WITH_ROCM or
-                     int(torch.version.cuda.split(".")[0]) < 11, "CUDA >= 11.0 required for graphs")
+                     IS_WINDOWS or  # appears to still be broken on Windows as of 11.4+
+                     int(torch.version.cuda.split(".")[0]) < 11 or
+                     (int(torch.version.cuda.split(".")[0]) == 11 and
+                      int(torch.version.cuda.split(".")[1]) < 4),
+                     "Graph bindings disallow concurrent replay for CUDA < 11.4, see " +
+                     "https://github.com/pytorch/pytorch/pull/57556")
     def test_graph_concurrent_replay(self):
         torch.cuda.empty_cache()
 
@@ -3449,12 +3664,16 @@ def func_with_temps(t, val):
             torch.cuda.current_stream().wait_stream(s0)
             torch.cuda.current_stream().wait_stream(s1)
 
-            if share_mem != "Don't share":
-                # Confirms concurrent replays using the same mempool corrupted each other.
+            if (not TEST_CUDAMALLOCASYNC) and (share_mem != "Don't share"):
+                # If we used the native allocator and shared mempools,
+                # we expect the concurrent replays corrupted each other.
                 self.assertNotEqual(b.sum().item(), size * 94)
                 self.assertNotEqual(c.sum().item(), size * 156)
             else:
-                # Confirms concurrent replays using different mempools did not corrupt each other.
+                # If we EITHER
+                #   - used the native allocator without sharing mempools, OR
+                #   - used cudaMallocAsync, which ignores graph pool-sharing hints and should always be safe
+                # we don't expect memory corruption.
                 self.assertEqual(b.sum().item(), size * 94)
                 self.assertEqual(c.sum().item(), size * 156)
 
@@ -3514,9 +3733,10 @@ def test_graph_three_successive(self):
             g2.replay()
             g1.replay()
 
-            # If share_mem is True, g2's capture should have reused c's memory for f. We replayed g2 then g1,
-            # so we expect g1's captured "e = c + 3" mistakenly filled e with "f's vals + 3".
-            self.assertEqual(e.sum().item(), size * (7 + 3) if share_mem != "Don't share" else size * 5)
+            expect_corruption = (not TEST_CUDAMALLOCASYNC) and (share_mem != "Don't share")
+            # If we used the native allocator and shared mempools, g2's capture should have reused c's memory for f.
+            # We replayed g2 then g1, so we expect g1's captured "e = c + 3" mistakenly filled e with "f's vals + 3".
+            self.assertEqual(e.sum().item(), size * (7 + 3) if expect_corruption else size * 5)
             self.assertEqual(f.sum().item(), size * 7)
 
             del a, b, d, e, f, g0, g1, g2
@@ -3526,6 +3746,7 @@ def test_graph_three_successive(self):
 
     @unittest.skipIf((not TEST_CUDA) or
                      TEST_WITH_ROCM or
+                     TEST_CUDAMALLOCASYNC or
                      int(torch.version.cuda.split(".")[0]) < 11, "CUDA >= 11.0 required for graphs")
     def test_graph_memory_stats_and_use_result_after_destroy_graph(self):
         kSmallSize = 1048576
@@ -3562,8 +3783,8 @@ def test_graph_memory_stats_and_use_result_after_destroy_graph(self):
              delta_cudaMalloc_bytes_post_del_g,
              pool_string) in cases:
             if pool_string == "small_pool":
-                delta_active_blocks = 2  # one from "b" plus a sneaky one from CUDAGraph's one-element rng offset holder
-                delta_active_bytes = numel * elem + 512  # + 512 for CUDAGraph's rng offset holder
+                delta_active_blocks = 3  # one from "b" plus a sneaky two from CUDAGraph's one-element rng seed and offset holders
+                delta_active_bytes = numel * elem + 1024  # + 1024 for CUDAGraph's rng seed and offset holders each
             else:
                 delta_active_blocks = 1  # We only check the large pool, which isn't affected by rng offset holder
                 delta_active_bytes = numel * elem
@@ -3698,6 +3919,8 @@ def test_graph_cudnn_dropout(self):
             g.capture_end()
         torch.cuda.current_stream().wait_stream(s)
 
+        g.replay()
+
         y = model(x)
 
     @unittest.skipIf((not TEST_CUDA) or
@@ -3752,7 +3975,8 @@ def test_graph_grad_scaling(self):
     @unittest.skipIf((not TEST_CUDA) or
                      TEST_WITH_ROCM or
                      int(torch.version.cuda.split(".")[0]) < 11, "CUDA >= 11.0 required for graphs")
-    @parametrize('with_amp,cache_enabled', [(False, False), (True, False), (True, True)],
+    @parametrize('with_amp,cache_enabled', [(False, False), (True, False), subtest((True, True),
+                 decorators=[unittest.expectedFailure])],
                  name_fn=lambda x, y: '{}{}'.format({True: "with_amp", False: "without_amp"}[x],
                                                     {True: "_cache_enabled", False: "_cache_disabled"}[y] if x else ''))
     def test_graph_make_graphed_callables(self, with_amp, cache_enabled):
@@ -3819,15 +4043,79 @@ def test_graph_make_graphed_callables(self, with_amp, cache_enabled):
         model_control.eval()
         self.assertEqual(model_graphed(real_inputs[0]), model_control(real_inputs[0]))
 
+    def _test_graphed_optimizer(self, steps_warmup, steps_train, optimizer_ctor, kwargs):
+        for actually_do_graphs in (True, False):
+            params = [torch.randn((i + 5, i + 5), device="cuda") for i in range(2)]
+            params_control = [p.clone().requires_grad_() for p in params]
+            params_graphed = [p.clone().requires_grad_() for p in params]
+
+            grads = [[torch.randn_like(p) for p in params] for _ in range(steps_warmup + steps_train)]
+
+            # Control (capturable=False)
+
+            opt = optimizer_ctor(params_control, capturable=False, **kwargs)
+
+            for i in range(steps_warmup + steps_train):
+                for j, p in enumerate(params_control):
+                    p.grad = grads[i][j]
+                opt.step()
+
+            # capturable=True
+
+            opt = optimizer_ctor(params_graphed, capturable=True, **kwargs)
+
+            for i in range(steps_warmup):
+                for j, p in enumerate(params_graphed):
+                    p.grad = grads[i][j]
+                opt.step()
+
+            if actually_do_graphs:
+                g = torch.cuda.CUDAGraph()
+                with torch.cuda.graph(g):
+                    opt.step()
+
+            for i in range(steps_train):
+                if actually_do_graphs:
+                    for j, p in enumerate(params_graphed):
+                        p.grad.copy_(grads[i + steps_warmup][j])
+                    g.replay()
+                else:
+                    # Passing capturable=True to the constructor and running without graphs should still be
+                    # numerically correct, even if it's not ideal for performance.
+                    for j, p in enumerate(params_graphed):
+                        p.grad = grads[i + steps_warmup][j]
+                    opt.step()
+
+            for p_control, p_graphed in zip(params_control, params_graphed):
+                self.assertEqual(p_control, p_graphed)
+
     @unittest.skipIf((not TEST_CUDA) or
                      TEST_WITH_ROCM or
                      int(torch.version.cuda.split(".")[0]) < 11, "CUDA >= 11.0 required for graphs")
     def test_graph_adam_adamw(self):
-        OptClasses = (torch.optim.Adam, torch.optim.AdamW)
-        cases = []
         # Needs generalization if we want to extend this test to non-Adam-like optimizers.
-        for Class, foreach, amsgrad in product(OptClasses, (False, True), (False, True)):
-            cases.append((Class, {"lr": 0.1, "betas": (0.8, 0.7), "foreach": foreach, "amsgrad": amsgrad}))
+        cases = [
+            (optimizer_ctor, {"lr": 0.1, "betas": (0.8, 0.7), "foreach": foreach, "amsgrad": amsgrad})
+            for optimizer_ctor, foreach, amsgrad in product(
+                (torch.optim.Adam, torch.optim.AdamW), (False, True), (False, True),)
+        ] + [
+            (torch.optim.Adam, {"lr": 0.1, "betas": (0.8, 0.7), "fused": True, "amsgrad": amsgrad})
+            for amsgrad in (False, True)
+        ]
+
+        for optimizer_ctor, kwargs in cases:
+            with self.subTest(optimizer_ctor=optimizer_ctor, kwargs=kwargs):
+                self._test_graphed_optimizer(3, 2, optimizer_ctor, kwargs)
+
+    @unittest.skipIf(
+        (not TEST_CUDA) or TEST_WITH_ROCM or int(torch.version.cuda.split(".")[0]) < 11,
+        "CUDA >= 11.0 required for graphs",
+    )
+    def test_graph_scaling_fusedadam(self):
+        cases = [
+            (torch.optim.Adam, {"lr": 0.1, "betas": (0.8, 0.7), "fused": True, "amsgrad": amsgrad})
+            for amsgrad in (False, True)
+        ]
 
         steps_warmup = 3
         steps_train = 2
@@ -3838,7 +4126,21 @@ def test_graph_adam_adamw(self):
                 params_control = [p.clone().requires_grad_() for p in params]
                 params_graphed = [p.clone().requires_grad_() for p in params]
 
+                # `GradScaler` in-place updates gradients thus it's necessary to duplicate gradients.
                 grads = [[torch.randn_like(p) for p in params] for _ in range(steps_warmup + steps_train)]
+                with torch.no_grad():
+                    grads_control = [[g.clone() for g in gs] for gs in grads]
+                    grads_graphed = [[g.clone() for g in gs] for gs in grads]
+
+                # Gradient Scaler
+                scaler_for_control = torch.cuda.amp.GradScaler(init_scale=128.0)
+                with torch.no_grad():
+                    scaler_for_control._lazy_init_scale_growth_tracker(torch.device("cuda"))
+
+                scaler_for_graphed = torch.cuda.amp.GradScaler()
+                scaler_for_graphed.load_state_dict(scaler_for_control.state_dict())
+                with torch.no_grad():
+                    scaler_for_graphed._lazy_init_scale_growth_tracker(torch.device("cuda"))
 
                 # Control (capturable=False)
 
@@ -3846,8 +4148,9 @@ def test_graph_adam_adamw(self):
 
                 for i in range(steps_warmup + steps_train):
                     for j, p in enumerate(params_control):
-                        p.grad = grads[i][j]
-                    opt.step()
+                        p.grad = grads_control[i][j]
+                    scaler_for_control.step(opt)
+                    scaler_for_control.update()
 
                 # capturable=True
 
@@ -3855,25 +4158,28 @@ def test_graph_adam_adamw(self):
 
                 for i in range(steps_warmup):
                     for j, p in enumerate(params_graphed):
-                        p.grad = grads[i][j]
-                    opt.step()
+                        p.grad = grads_graphed[i][j]
+                    scaler_for_graphed.step(opt)
+                    scaler_for_graphed.update()
 
                 if actually_do_graphs:
                     g = torch.cuda.CUDAGraph()
                     with torch.cuda.graph(g):
-                        opt.step()
+                        scaler_for_graphed.step(opt)
+                        scaler_for_graphed.update()
 
                 for i in range(steps_train):
                     if actually_do_graphs:
                         for j, p in enumerate(params_graphed):
-                            p.grad.copy_(grads[i + steps_warmup][j])
+                            p.grad.copy_(grads_graphed[i + steps_warmup][j])
                         g.replay()
                     else:
                         # Passing capturable=True to the constructor and running without graphs should still be
                         # numerically correct, even if it's not ideal for performance.
                         for j, p in enumerate(params_graphed):
-                            p.grad = grads[i + steps_warmup][j]
-                        opt.step()
+                            p.grad = grads_graphed[i + steps_warmup][j]
+                        scaler_for_graphed.step(opt)
+                        scaler_for_graphed.update()
 
                 for p_control, p_graphed in zip(params_control, params_graphed):
                     self.assertEqual(p_control, p_graphed)
@@ -3959,12 +4265,12 @@ def forward(self, x):
         loss.backward()
         optimizer.step()
 
-    @unittest.skipIf(TEST_WITH_ROCM, "ROCm doesn't support CUDA_VISIBLE_DEVICES")
     @unittest.skipIf(TEST_MULTIGPU, "Testing on one GPU is sufficient")
     def test_lazy_init(self):
         """ Validate that no CUDA calls are made during `import torch` call"""
         from subprocess import check_output
-        test_script = "import os; import torch;os.environ['CUDA_VISIBLE_DEVICES']='32';print(torch.cuda.device_count())"
+        VISIBLE_DEVICES = "HIP_VISIBLE_DEVICES" if TEST_WITH_ROCM else "CUDA_VISIBLE_DEVICES"
+        test_script = f"import os; import torch;os.environ['{VISIBLE_DEVICES}']='32';print(torch.cuda.device_count())"
         rc = check_output([sys.executable, '-c', test_script]).decode("ascii").strip()
         self.assertEqual(rc, "0")
 
@@ -4408,6 +4714,7 @@ class TestNamedTupleInput_1(NamedTuple):
             cat = torch.cat((outputs[0][i].to('cpu'), outputs[1][i].to('cpu')))
             self.assertTrue(torch.equal(x, cat))
 
+    @unittest.skipIf(TEST_CUDAMALLOCASYNC, "setContextRecorder not supported by CUDAMallocAsync")
     def test_memory_snapshot(self):
         try:
             torch.cuda.memory.empty_cache()
@@ -4428,7 +4735,7 @@ def test_memory_snapshot(self):
 
             ss = torch.cuda.memory._snapshot()
             found_it = False
-            for seg in ss:
+            for seg in ss['segments']:
                 for b in seg['blocks']:
                     if 'history' in b:
                         for h in b['history']:
@@ -4436,19 +4743,172 @@ def test_memory_snapshot(self):
                                 self.assertTrue('test_cuda' in h['frames'][0]['filename'])
                                 found_it = True
             self.assertTrue(found_it)
+
             if not IS_WINDOWS:
                 with tempfile.NamedTemporaryFile() as f:
                     torch.cuda.memory._save_segment_usage(f.name)
                     with open(f.name, 'r') as f2:
                         self.assertTrue('test_cuda.py' in f2.read())
 
+            del x
+            torch.cuda.empty_cache()
+            ss = torch.cuda.memory._snapshot()
+            self.assertTrue(ss['device_traces'][0][-1]['action'] == 'segment_free')
+
         finally:
             torch.cuda.memory._record_memory_history(False)
 
+    @unittest.skipIf(TEST_CUDAMALLOCASYNC, "setContextRecorder not supported by CUDAMallocAsync")
+    def test_memory_snapshot_with_cpp(self):
+        try:
+            torch.cuda.memory.empty_cache()
+            torch.cuda.memory._record_memory_history(True, _enable_expensive_cpp=True)
+            x = torch.rand(311, 411, device='cuda')
+
+            ss = torch.cuda.memory._snapshot()['segments']
+            found_it = False
+            for seg in ss:
+                for b in seg['blocks']:
+                    if 'history' in b:
+                        for h in b['history']:
+                            if h['real_size'] == 311 * 411 * 4:
+                                self.assertNotEqual(len(h['cpp_frames']), 0)
+                                found_it = True
+            self.assertTrue(found_it)
+
+        finally:
+            torch.cuda.memory._record_memory_history(False)
+
+
+    def test_allocator_settings(self):
+        def power2_div(size, div_factor):
+            pow2 = 1
+            while pow2 < size:
+                pow2 = pow2 * 2
+            if pow2 == size:
+                return pow2
+            step = pow2 / 2 / div_factor
+            ret = pow2 / 2
+            while ret < size:
+                ret = ret + step
+            return ret
+
+        torch.cuda.memory.empty_cache()
+        key = 'active_bytes.all.allocated' if not TEST_CUDAMALLOCASYNC else 'allocated_bytes.all.current'
+
+        nelems = 21 * 1024 * 1024
+        nbytes = 4 * nelems  # floats are 4 bytes
+
+        nelems_big = 100 * 1024 * 1024
+        nbytes_big = 4 * nelems_big  # floats are 4 bytes
+
+        start_mem = torch.cuda.memory_stats()[key]
+        torch.cuda.memory._set_allocator_settings("")
+        x = torch.rand(nelems, device='cuda')
+
+        # test roundup_power2_divisions single value syntax
+        reg_mem = torch.cuda.memory_stats()[key]
+        torch.cuda.memory._set_allocator_settings("roundup_power2_divisions:4")
+        y = torch.rand(nelems, device='cuda')
+
+        pow2_div4_mem = torch.cuda.memory_stats()[key]
+
+        self.assertTrue(reg_mem - start_mem == nbytes)
+        if not TEST_CUDAMALLOCASYNC:
+            # not supported with the cudaMallocAsync backend
+            self.assertTrue(pow2_div4_mem - reg_mem == power2_div(nbytes, 4))
+
+        torch.cuda.memory._set_allocator_settings("garbage_collection_threshold:0.5")
+        torch.cuda.memory._set_allocator_settings("garbage_collection_threshold:0.5,max_split_size_mb:40")
+
+        # should have reset the power2 divisions now
+        torch.cuda.memory.empty_cache()
+        start_mem = torch.cuda.memory_stats()[key]
+        z = torch.rand(nelems, device='cuda')
+        reg_mem = torch.cuda.memory_stats()[key]
+        self.assertTrue(reg_mem - start_mem == nbytes)
+
+        # roundup_power2_divisions knob array syntax
+        torch.cuda.memory.empty_cache()
+        torch.cuda.memory._set_allocator_settings(
+            "garbage_collection_threshold:0.5,roundup_power2_divisions:[64:8,128:2,256:2,512:2,1024:1,>:1]")
+        start_mem = torch.cuda.memory_stats()[key]
+        w = torch.rand(nelems, device='cuda')
+
+        pow2_div8_mem = torch.cuda.memory_stats()[key]
+        if not TEST_CUDAMALLOCASYNC:
+            # not supported with the cudaMallocAsync backend
+            self.assertTrue(pow2_div8_mem - start_mem == power2_div(nbytes, 8))
+
+        torch.cuda.memory.empty_cache()
+        start_mem = torch.cuda.memory_stats()[key]
+        v = torch.rand(nelems_big, device='cuda')
+
+        pow2_div2_mem = torch.cuda.memory_stats()[key]
+        if not TEST_CUDAMALLOCASYNC:
+            # not supported with the cudaMallocAsync backend
+            self.assertTrue(pow2_div2_mem - start_mem == power2_div(nbytes_big, 2))
+
+        with self.assertRaises(RuntimeError):
+            torch.cuda.memory._set_allocator_settings("foo:1,bar:2")
+
+        with self.assertRaises(RuntimeError):
+            torch.cuda.memory._set_allocator_settings("garbage_collection_threshold:1.2")
+
+        with self.assertRaises(RuntimeError):
+            torch.cuda.memory._set_allocator_settings("max_split_size_mb:2")
+
+
     def test_raises_oom(self):
         with self.assertRaises(torch.cuda.OutOfMemoryError):
             torch.empty(1024 * 1024 * 1024 * 1024, device='cuda')
 
+    @unittest.skipIf(IS_WINDOWS, 'Windows CI does not like the load_inline')
+    @unittest.skipIf(TEST_CUDAMALLOCASYNC, "setContextRecorder not supported by CUDAMallocAsync")
+    def test_cpp_memory_snapshot_pickle(self):
+        from torch.utils.cpp_extension import load_inline
+        source = """
+        #include <torch/csrc/cuda/memory_snapshot.h>
+        py::object do_snapshot() {
+            std::string data = torch::cuda::_memory_snapshot_pickled();
+            return py::bytes(data);
+        }
+        void record(bool e) {
+            torch::cuda::_record_memory_history(e);
+        }
+        """
+        m = load_inline(name='snapshot', cpp_sources=[source], functions=['do_snapshot', 'record'])
+        try:
+            m.record(True)
+            t = torch.rand(311, 411, device='cuda')
+            mem = pickle.loads(m.do_snapshot())
+            found = False
+            for s in mem['segments']:
+                for b in s['blocks']:
+                    if b['state'] == 'active_allocated' and 'history' in b:
+                        history = b['history']
+                        if history and history[0]['real_size'] == 311 * 411 * 4:
+                            found = True
+            last_action = mem['device_traces'][0][-1]
+            self.assertTrue(last_action['action'] == 'alloc')
+            self.assertTrue(last_action['size'] == 311 * 411 * 4)
+            self.assertTrue(found)
+        finally:
+            m.record(False)
+
+    @unittest.skipIf(TEST_CUDAMALLOCASYNC, "temporarily disabled")
+    def test_notifies_oom(self):
+        x = False
+
+        def cb(device, alloc, device_alloc, device_free):
+            nonlocal x
+            x = True
+        torch._C._cuda_attach_out_of_memory_observer(cb)
+        with self.assertRaises(torch.cuda.OutOfMemoryError):
+            torch.empty(1024 * 1024 * 1024 * 1024, device='cuda')
+        self.assertTrue(x)
+
+
 instantiate_parametrized_tests(TestCuda)
 
 if __name__ == '__main__':
diff --git a/test/test_cuda_nvml_based_avail.py b/test/test_cuda_nvml_based_avail.py
new file mode 100644
index 000000000000..26a72c361dfd
--- /dev/null
+++ b/test/test_cuda_nvml_based_avail.py
@@ -0,0 +1,69 @@
+# Owner(s): ["module: cuda"]
+
+import os
+import sys
+import multiprocessing
+import torch
+import unittest
+from unittest.mock import patch
+
+# NOTE: Each of the tests in this module need to be run in a brand new process to ensure CUDA is uninitialized
+# prior to test initiation.
+with patch.dict(os.environ, {"PYTORCH_NVML_BASED_CUDA_CHECK": "1"}):
+    # Before executing the desired tests, we need to disable CUDA initialization and fork_handler additions that would
+    # otherwise be triggered by the `torch.testing._internal.common_utils` module import
+    from torch.testing._internal.common_utils import (parametrize, instantiate_parametrized_tests, run_tests, TestCase,
+                                                      IS_WINDOWS)
+    # NOTE: Because `remove_device_and_dtype_suffixes` initializes CUDA context (triggered via the import of
+    # `torch.testing._internal.common_device_type` which imports `torch.testing._internal.common_cuda`) we need
+    # to bypass that method here which should be irrelevant to the parameterized tests in this module.
+    torch.testing._internal.common_utils.remove_device_and_dtype_suffixes = lambda x: x
+
+    TEST_CUDA = torch.cuda.is_available()
+    if not TEST_CUDA:
+        print('CUDA not available, skipping tests', file=sys.stderr)
+        TestCase = object  # type: ignore[misc, assignment] # noqa: F811
+
+
+class TestExtendedCUDAIsAvail(TestCase):
+    SUBPROCESS_REMINDER_MSG = (
+        "\n REMINDER: Tests defined in test_cuda_nvml_based_avail.py must be run in a process "
+        "where there CUDA Driver API has not been initialized. Before further debugging, ensure you are either using "
+        "run_test.py or have added --subprocess to run each test in a different subprocess.")
+
+    def setUp(self):
+        super().setUp()
+        torch.cuda.device_count.cache_clear()  # clear the lru_cache on this method before our test
+
+    @staticmethod
+    def in_bad_fork_test() -> bool:
+        _ = torch.cuda.is_available()
+        return torch.cuda._is_in_bad_fork()
+
+    # These tests validate the behavior and activation of the weaker, NVML-based, user-requested
+    # `torch.cuda.is_available()` assessment. The NVML-based assessment should be attempted when
+    # `PYTORCH_NVML_BASED_CUDA_CHECK` is set to 1, reverting to the default CUDA Runtime API check otherwise.
+    # If the NVML-based assessment is attempted but fails, the CUDA Runtime API check should be executed
+    @unittest.skipIf(IS_WINDOWS, "Needs fork")
+    @parametrize("nvml_avail", [True, False])
+    @parametrize("avoid_init", ['1', '0', None])
+    def test_cuda_is_available(self, avoid_init, nvml_avail):
+        patch_env = {"PYTORCH_NVML_BASED_CUDA_CHECK": avoid_init} if avoid_init else {}
+        with patch.dict(os.environ, **patch_env):
+            if nvml_avail:
+                _ = torch.cuda.is_available()
+            else:
+                with patch.object(torch.cuda, '_device_count_nvml', return_value=-1):
+                    _ = torch.cuda.is_available()
+            with multiprocessing.get_context("fork").Pool(1) as pool:
+                in_bad_fork = pool.apply(TestExtendedCUDAIsAvail.in_bad_fork_test)
+            if os.getenv('PYTORCH_NVML_BASED_CUDA_CHECK') == '1' and nvml_avail:
+                self.assertFalse(in_bad_fork, TestExtendedCUDAIsAvail.SUBPROCESS_REMINDER_MSG)
+            else:
+                assert in_bad_fork
+
+
+instantiate_parametrized_tests(TestExtendedCUDAIsAvail)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_cuda_sanitizer.py b/test/test_cuda_sanitizer.py
new file mode 100644
index 000000000000..f8733ba43a42
--- /dev/null
+++ b/test/test_cuda_sanitizer.py
@@ -0,0 +1,505 @@
+# Owner(s): ["module: cuda"]
+
+import sys
+import textwrap
+import traceback
+from typing import List
+
+import torch
+import torch.cuda._sanitizer as csan
+from torch.cuda._sanitizer import StreamId, DataPtr, EventId
+from torch.testing._internal.common_utils import TestCase, run_tests
+
+
+# We cannot import TEST_CUDA from torch.testing._internal.common_cuda here,
+# because if we do that, the TEST_CUDNN line from torch.testing._internal.common_cuda will be executed
+# multiple times as well during the execution of this test suite, and it will
+# cause CUDA OOM error on Windows.
+TEST_CUDA = torch.cuda.is_available()
+
+if not TEST_CUDA:
+    print("CUDA not available, skipping tests", file=sys.stderr)
+    TestCase = object  # noqa: F811
+
+
+class TestArgumentHandler(TestCase):
+    def test_add(self):
+        add_func = torch.ops.aten.add.Tensor
+        a = torch.ones(5, 3, device="cuda")
+        b = torch.randn(5, 3, device="cuda")
+
+        argument_handler = csan.ArgumentHandler()
+        argument_handler.parse_inputs(add_func._schema, (a, b), {})
+        c = torch.add(a, b)
+        argument_handler.parse_outputs(c)
+
+        self.assertEqual({a.data_ptr(), b.data_ptr()}, argument_handler.dataptrs_read)
+        self.assertEqual({c.data_ptr()}, argument_handler.dataptrs_written)
+
+    def test_cat(self):
+        cat_func = torch.ops.aten.cat.default
+        a = torch.ones(2, 4, 5, device="cuda")
+        b = torch.zeros(2, 1, 5, device="cuda")
+        c = torch.rand(2, 7, 5, device="cuda")
+
+        argument_handler = csan.ArgumentHandler()
+        argument_handler.parse_inputs(cat_func._schema, ([a, b, c], 1), {})
+        d = torch.cat((a, b, c), dim=1)
+        argument_handler.parse_outputs(d)
+
+        self.assertEqual(
+            {a.data_ptr(), b.data_ptr(), c.data_ptr()}, argument_handler.dataptrs_read
+        )
+        self.assertEqual({d.data_ptr()}, argument_handler.dataptrs_written)
+
+    def test_split(self):
+        split_func = torch.ops.aten.split.Tensor
+        a = torch.arange(10, device="cuda").reshape(5, 2)
+
+        argument_handler = csan.ArgumentHandler()
+        argument_handler.parse_inputs(split_func._schema, (a, 2), {})
+        out = torch.split(a, 2)
+        argument_handler.parse_outputs(out)
+
+        outputs = {out[0].data_ptr(), out[1].data_ptr(), out[2].data_ptr()}
+        self.assertEqual({a.data_ptr()}, argument_handler.dataptrs_read)
+        self.assertEqual(
+            outputs,
+            argument_handler.dataptrs_written,
+        )
+
+    def test_inplace(self):
+        add_inplace_func = torch.ops.aten.add_.Tensor
+        a = torch.rand(4, 2, device="cuda")
+
+        argument_handler = csan.ArgumentHandler()
+        argument_handler.parse_inputs(add_inplace_func._schema, (a, 5), {})
+        a.add_(5)
+        argument_handler.parse_outputs(a)
+
+        self.assertEqual(set(), argument_handler.dataptrs_read)
+        self.assertEqual({a.data_ptr()}, argument_handler.dataptrs_written)
+
+    def test_out(self):
+        mul_out_func = torch.ops.aten.mul.out
+        a = torch.arange(8, device="cuda")
+        b = torch.empty(8, device="cuda")
+
+        argument_handler = csan.ArgumentHandler()
+        argument_handler.parse_inputs(mul_out_func._schema, (a, 3), {"out": b})
+        torch.mul(a, 3, out=b)
+        argument_handler.parse_outputs(b)
+
+        self.assertEqual({a.data_ptr()}, argument_handler.dataptrs_read)
+        self.assertEqual({b.data_ptr()}, argument_handler.dataptrs_written)
+
+    def test_nonzero(self):
+        nonzero_func = torch.ops.aten.nonzero.default
+        a = torch.ones(5, 3, 2, device="cuda")
+
+        argument_handler = csan.ArgumentHandler()
+        argument_handler.parse_inputs(nonzero_func._schema, (a,), {"as_tuple": True})
+        out = torch.nonzero(a, as_tuple=True)
+        argument_handler.parse_outputs(out)
+
+        outputs = {out[0].data_ptr(), out[1].data_ptr(), out[2].data_ptr()}
+        self.assertEqual({a.data_ptr()}, argument_handler.dataptrs_read)
+        self.assertEqual(outputs, argument_handler.dataptrs_written)
+
+    def test_tensor_names(self):
+        addr_func = torch.ops.aten.addr.default
+        vec = torch.arange(1, 4, device="cuda")
+        M = torch.zeros(3, 3, device="cuda")
+
+        argument_handler = csan.ArgumentHandler()
+        argument_handler.parse_inputs(addr_func._schema, (M, vec, vec), {})
+        out = torch.addr(M, vec, vec)
+        argument_handler.parse_outputs(out)
+
+        self.assertEqual(
+            argument_handler.tensor_aliases,
+            {
+                M.data_ptr(): ["self"],
+                vec.data_ptr(): ["vec1", "vec2"],
+                out.data_ptr(): [],
+            },
+        )
+        self.assertEqual({out.data_ptr()}, argument_handler.outputs)
+
+
+def tensor_id(i: int) -> DataPtr:
+    return i
+
+
+def stream_id(i: int) -> StreamId:
+    return 1000 + i
+
+
+def event_id(i: int) -> EventId:
+    return 2000 + i
+
+
+class TestEventHandler(TestCase):
+    def setUp(self):
+        self.handler = csan.EventHandler()
+
+    def kernel_launch(
+        self,
+        stream: StreamId,
+        read_only: List[DataPtr] = None,
+        read_write: List[DataPtr] = None,
+    ) -> List[csan.SynchronizationError]:
+        if read_only is None:
+            read_only = []
+        if read_write is None:
+            read_write = []
+        return self.handler._handle_kernel_launch(
+            stream,
+            read_only,
+            read_write,
+            {},
+            "",
+            {k: [""] for k in read_only + read_write},
+        )
+
+    def assert_good_kernel_launch(
+        self,
+        stream: StreamId,
+        read_only: List[DataPtr] = None,
+        read_write: List[DataPtr] = None,
+    ) -> None:
+        self.assertEqual(self.kernel_launch(stream, read_only, read_write), [])
+
+    def assert_bad_kernel_launch(
+        self,
+        number_of_errors: int,
+        stream: StreamId,
+        read_only: List[DataPtr] = None,
+        read_write: List[DataPtr] = None,
+    ) -> None:
+        errors = self.kernel_launch(stream, read_only, read_write)
+        self.assertEqual(len(errors), number_of_errors)
+
+    def test_empty_kernel_launch(self):
+        self.assert_good_kernel_launch(stream_id(0))
+
+    def test_simple_passing(self):
+        self.assert_good_kernel_launch(stream_id(1), read_only=[tensor_id(1)])
+        self.assert_good_kernel_launch(stream_id(2), read_only=[tensor_id(1)])
+
+    def test_simple_error(self):
+        self.assert_good_kernel_launch(stream_id(1), read_only=[tensor_id(1)])
+        self.assert_bad_kernel_launch(1, stream_id(2), read_write=[tensor_id(1)])
+
+    def test_simple_sync(self):
+        self.assert_good_kernel_launch(stream_id(1), read_only=[tensor_id(1)])
+        self.handler._handle_event_record(event_id(0), stream_id(1))
+        self.handler._handle_event_wait(event_id(0), stream_id(2))
+        self.assert_good_kernel_launch(stream_id(2), read_write=[tensor_id(1)])
+
+    def test_reads_check_last_write(self):
+        # Tests that not only the first read operation checks if it is in conflict
+        # with the last write operation, but all read operations do.
+
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+        self.handler._handle_event_record(event_id(0), stream_id(1))
+        self.handler._handle_event_wait(event_id(0), stream_id(2))
+        self.assert_good_kernel_launch(stream_id(2), read_only=[tensor_id(1)])
+
+        self.assert_bad_kernel_launch(1, stream_id(3), read_only=[tensor_id(1)])
+
+    def test_branch_sync(self):
+        # Tests that two streams can read after both waiting for a third, but they
+        # cannot write without further synchronization.
+
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+        self.handler._handle_event_record(event_id(0), stream_id(1))
+        self.handler._handle_event_wait(event_id(0), stream_id(2))
+        self.handler._handle_event_wait(event_id(0), stream_id(3))
+        self.assert_good_kernel_launch(stream_id(2), read_only=[tensor_id(1)])
+        self.assert_good_kernel_launch(stream_id(3), read_only=[tensor_id(1)])
+
+        self.assert_bad_kernel_launch(1, stream_id(2), read_write=[tensor_id(1)])
+
+    def test_chain_sync(self):
+        iterations = 10
+
+        self.assert_good_kernel_launch(stream_id(0), read_only=[tensor_id(1)])
+        for i in range(iterations):
+            self.handler._handle_event_record(event_id(i), stream_id(i))
+            self.handler._handle_event_wait(event_id(i), stream_id(i + 1))
+        self.assert_good_kernel_launch(stream_id(iterations), read_write=[tensor_id(1)])
+
+    def test_expired_record(self):
+        self.assert_good_kernel_launch(stream_id(1), read_only=[tensor_id(1)])
+        self.handler._handle_event_record(event_id(0), stream_id(1))
+        self.assert_good_kernel_launch(stream_id(1), read_only=[tensor_id(1)])
+        self.handler._handle_event_wait(event_id(0), stream_id(2))
+
+        self.assert_bad_kernel_launch(1, stream_id(2), read_write=[tensor_id(1)])
+
+    def test_deleted_record(self):
+        for should_delete, should_create in [
+            (True, True),
+            (True, False),
+            (False, True),
+        ]:
+            self.setUp()
+            with self.subTest(should_delete=should_delete, should_create=should_create):
+                self.assert_good_kernel_launch(stream_id(1), read_only=[tensor_id(1)])
+                self.handler._handle_event_record(event_id(0), stream_id(1))
+
+                if should_delete:
+                    self.handler._handle_event_deletion(event_id(0))
+                if should_create:
+                    self.handler._handle_event_creation(event_id(0))
+
+                self.handler._handle_event_wait(event_id(0), stream_id(2))
+                self.assert_bad_kernel_launch(
+                    1, stream_id(2), read_write=[tensor_id(1)]
+                )
+
+    def test_all_reads_checked_failing(self):
+        iterations = 10
+        for i in range(1, iterations):
+            self.assert_good_kernel_launch(stream_id(i), read_only=[tensor_id(1)])
+            self.handler._handle_event_record(event_id(i), stream_id(i))
+
+        for i in range(1, iterations):
+            self.handler._handle_event_wait(event_id(i), stream_id(0))
+
+        self.assert_good_kernel_launch(stream_id(iterations), read_only=[tensor_id(1)])
+        self.handler._handle_event_record(event_id(iterations), stream_id(i))
+
+        # Does not synchronize with the last read.
+        self.assert_bad_kernel_launch(1, stream_id(0), read_write=[tensor_id(1)])
+
+    def test_all_reads_checked_passing(self):
+        iterations = 10
+        for i in range(1, iterations):
+            self.assert_good_kernel_launch(stream_id(i), read_only=[tensor_id(1)])
+            self.handler._handle_event_record(event_id(i), stream_id(i))
+
+        for i in range(1, iterations):
+            self.handler._handle_event_wait(event_id(i), stream_id(0))
+
+        self.assert_good_kernel_launch(stream_id(0), read_write=[tensor_id(1)])
+
+    def test_multiple_errors(self):
+        iterations = 10
+        self.assert_good_kernel_launch(
+            stream_id(0), read_write=[tensor_id(i) for i in range(iterations)]
+        )
+        self.assert_bad_kernel_launch(
+            iterations,
+            stream_id(1),
+            read_write=[tensor_id(i) for i in range(iterations)],
+        )
+
+    def test_correct_state_merging(self):
+        # Tests that after waiting for an event, a stream's state is indeed set
+        # to the pointwise maximum of its old state and the recorded state.
+
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+        self.assert_good_kernel_launch(stream_id(2), read_write=[tensor_id(2)])
+        self.handler._handle_event_record(event_id(1), stream_id(1))
+        self.handler._handle_event_record(event_id(2), stream_id(2))
+
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+        self.assert_good_kernel_launch(stream_id(2), read_write=[tensor_id(2)])
+        self.handler._handle_event_wait(event_id(1), stream_id(2))
+        self.handler._handle_event_wait(event_id(2), stream_id(1))
+
+        self.handler._handle_event_record(event_id(3), stream_id(2))
+        self.handler._handle_event_wait(event_id(3), stream_id(1))
+        self.assert_good_kernel_launch(
+            stream_id(1), read_write=[tensor_id(1), tensor_id(2)]
+        )
+
+    def test_record_override(self):
+        self.assert_good_kernel_launch(stream_id(1), read_only=[tensor_id(1)])
+        self.assert_good_kernel_launch(stream_id(2), read_only=[tensor_id(2)])
+        self.handler._handle_event_record(event_id(1), stream_id(1))
+        self.handler._handle_event_record(event_id(1), stream_id(2))
+
+        self.handler._handle_event_wait(event_id(1), stream_id(3))
+        self.assert_bad_kernel_launch(1, stream_id(3), read_write=[tensor_id(1)])
+
+    def test_multiple_wait(self):
+        # Tests that a wait operation can be performed multiple times on the same event
+        # by different streams.
+
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+        self.handler._handle_event_record(event_id(1), stream_id(1))
+        self.handler._handle_event_wait(event_id(1), stream_id(2))
+        self.handler._handle_event_wait(event_id(1), stream_id(3))
+
+        self.assert_good_kernel_launch(stream_id(2), read_only=[tensor_id(1)])
+        self.assert_good_kernel_launch(stream_id(3), read_only=[tensor_id(1)])
+
+    def test_device_synchronize(self):
+        # Tests that a device synchronization does correctly cause all streams
+        # to synchronize with each other.
+
+        iterations = 10
+        for i in range(1, iterations):
+            self.assert_good_kernel_launch(stream_id(i), read_write=[tensor_id(i)])
+
+        self.handler._handle_device_synchronization()
+        self.assert_good_kernel_launch(
+            stream_id(0), read_write=[tensor_id(i) for i in range(1, iterations)]
+        )
+
+    def test_device_synchronization_expired(self):
+        # Tests that a device synchronization is a one-time synchronization.
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+        self.handler._handle_device_synchronization()
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+
+        self.assert_bad_kernel_launch(1, stream_id(2), read_write=[tensor_id(1)])
+
+    def test_new_stream_is_synchronized(self):
+        # Tests that after synchronizing operations with the host, any newly created
+        # stream is guaranteed to be synchronized with them as well.
+
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+        self.handler._handle_device_synchronization()
+        self.handler._handle_stream_creation(stream_id(2))
+        self.assert_good_kernel_launch(stream_id(2), read_write=[tensor_id(1)])
+
+    def test_stream_synchronize(self):
+        # Tests that a stream synchronization does correctly cause all streams to wait
+        # for one specific stream, but does not synchronize all streams with each other.
+
+        self.assert_good_kernel_launch(stream_id(0), read_write=[tensor_id(1)])
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(2)])
+        self.handler._handle_stream_synchronization(stream_id(0))
+
+        self.assert_good_kernel_launch(stream_id(2), read_only=[tensor_id(1)])
+        self.assert_good_kernel_launch(stream_id(3), read_only=[tensor_id(1)])
+        self.assert_bad_kernel_launch(1, stream_id(4), read_only=[tensor_id(2)])
+
+    def test_event_synchronize(self):
+        # Tests that an event synchronization does correctly cause all streams to wait
+        # for a recorded event, but does not guarantee synchronization with the current
+        # state of the stream that recorded the event.
+
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(1)])
+        self.handler._handle_event_record(event_id(1), stream_id(1))
+        self.assert_good_kernel_launch(stream_id(1), read_write=[tensor_id(2)])
+
+        self.handler._handle_event_synchronization(event_id(1))
+        self.assert_good_kernel_launch(stream_id(2), read_write=[tensor_id(1)])
+        self.assert_bad_kernel_launch(1, stream_id(2), read_write=[tensor_id(2)])
+
+
+class TestMessages(TestCase):
+    def setUp(self):
+        self.handler = csan.EventHandler()
+
+    def test_ensure_exists(self):
+        ARG = 0
+        for func, out in [
+            (
+                self.handler._handle_event_deletion,
+                f"Found Event with id: {ARG}, but no matching event "
+                "creation in the trace. Backfilling the trace now. "
+                "Perhaps the sanitizer was enabled after some torch operations?",
+            ),
+            (
+                self.handler._handle_memory_deallocation,
+                f"Found tensor with pointer: {ARG}, but no matching tensor "
+                "allocation in the trace. Backfilling the trace now. "
+                "Perhaps the sanitizer was enabled after some torch operations?",
+            ),
+        ]:
+            with self.subTest(func=func, out=out):
+                with self.assertLogs() as captured:
+                    func(ARG)
+                self.assertEqual(captured.records[0].getMessage(), out)
+
+    def test_ensure_does_not_exist(self):
+        ARG = 0
+        self.handler._handle_event_creation(ARG)
+        self.handler._handle_stream_creation(ARG)
+        for func, out in [
+            (
+                self.handler._handle_event_creation,
+                "Found duplicate event creation in the trace for event with "
+                f"id: {ARG}. Assuming the trace for event deletion wasn't caught "
+                "and backfilling it now. "
+                "Perhaps the sanitizer was enabled after some torch operations?",
+            ),
+            (
+                self.handler._handle_stream_creation,
+                "Found duplicate Stream creation in the trace for Stream with "
+                f"id: {ARG}. PyTorch Streams are only created once, so this "
+                "trace entry is ignored.",
+            ),
+        ]:
+            with self.subTest(func=func, out=out):
+                with self.assertLogs() as captured:
+                    func(ARG)
+                self.assertEqual(captured.records[0].getMessage(), out)
+
+    def test_error_message(self):
+        current_access = csan.Access(
+            type=csan.AccessType.WRITE,
+            seq_num=1,
+            stream=stream_id(1),
+            operator="schema",
+            aliases=["b"],
+            is_output=True,
+            stack_trace=traceback.StackSummary.from_list(
+                [("file", 0, "name", "trace a")]
+            ),
+        )
+        previous_access = csan.Access(
+            type=csan.AccessType.READ,
+            seq_num=2,
+            stream=stream_id(0),
+            operator="schema",
+            aliases=["a"],
+            is_output=False,
+            stack_trace=traceback.StackSummary.from_list(
+                [("file", 0, "name", "trace b")]
+            ),
+        )
+        error = csan.UnsynchronizedAccessError(
+            data_ptr=tensor_id(1),
+            allocation_stack_trace=traceback.StackSummary.from_list(
+                [("file", 0, "name", "alloc")]
+            ),
+            current_access=current_access,
+            previous_access=previous_access,
+        )
+        self.assertEqual(
+            str(error),
+            textwrap.dedent(
+                """\
+                ============================
+                CSAN detected a possible data race on tensor with data pointer 1
+                Access by stream 1001 during kernel:
+                schema
+                writing to argument(s) b, and to the output
+                With stack trace:
+                  File "file", line 0, in name
+                    trace a
+
+                Previous access by stream 1000 during kernel:
+                schema
+                reading from argument(s) a
+                With stack trace:
+                  File "file", line 0, in name
+                    trace b
+
+                Tensor was allocated with stack trace:
+                  File "file", line 0, in name
+                    alloc
+                """
+            ),
+        )
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/test_cuda_trace.py b/test/test_cuda_trace.py
index 226e30acbfeb..07ba30d27f41 100644
--- a/test/test_cuda_trace.py
+++ b/test/test_cuda_trace.py
@@ -82,6 +82,34 @@ def test_stream_creation_callback(self):
         torch.cuda.Stream()
         self.mock.assert_called()
 
+    def test_device_synchronization_callback(self):
+        cuda_trace.register_callback_for_cuda_device_synchronization(self.mock)
+
+        torch.cuda.synchronize()
+        self.mock.assert_called()
+
+    def test_stream_synchronization_callback(self):
+        cuda_trace.register_callback_for_cuda_stream_synchronization(self.mock)
+
+        stream = torch.cuda.Stream()
+        stream.synchronize()
+        self.mock.assert_called_once_with(stream.cuda_stream)
+
+    def test_event_synchronization_callback(self):
+        cuda_trace.register_callback_for_cuda_event_synchronization(self.mock)
+
+        event = torch.cuda.Event()
+        event.record()
+        event.synchronize()
+        self.mock.assert_called_once_with(event._as_parameter_.value)
+
+    def test_memcpy_synchronization(self):
+        cuda_trace.register_callback_for_cuda_stream_synchronization(self.mock)
+
+        tensor = torch.rand(5, device="cuda")
+        tensor.nonzero()
+        self.mock.assert_called_once_with(torch.cuda.default_stream().cuda_stream)
+
     def test_all_trace_callbacks_called(self):
         other = unittest.mock.MagicMock()
         cuda_trace.register_callback_for_cuda_memory_allocation(self.mock)
diff --git a/test/test_dataloader.py b/test/test_dataloader.py
index c07b577fc454..347f9be73e8b 100644
--- a/test/test_dataloader.py
+++ b/test/test_dataloader.py
@@ -14,24 +14,22 @@
 import itertools
 import warnings
 import tempfile
+import torch.utils.data.datapipes as dp
 from torch import multiprocessing as mp
 from torch.utils.data import (
     ChainDataset,
     ConcatDataset,
     DataLoader,
-    DataLoader2,
     Dataset,
     IterableDataset,
     IterDataPipe,
     Subset,
     TensorDataset,
-    communication,
     _utils
 )
 from torch.utils.data._utils import MP_STATUS_CHECK_INTERVAL
 from torch.utils.data.dataset import random_split
 from torch.utils.data.datapipes.iter import IterableWrapper
-from torch.utils.data.datapipes.map import SequenceWrapper
 from torch._utils import ExceptionWrapper
 from torch.testing._internal.common_utils import (TestCase, run_tests, TEST_NUMPY, IS_WINDOWS,
                                                   IS_CI, NO_MULTIPROCESSING_SPAWN, skipIfRocm, slowTest,
@@ -2221,113 +2219,64 @@ def test_excessive_thread_creation_warning(self):
                 r"excessive worker creation might get DataLoader running slow or even freeze"):
             dataloader = DataLoader(self.dataset, batch_size=2, num_workers=1000)
 
-# Define a global function for testing purposes since local functions cannot be pickled
-def identity(x):
-    return x
 
-@unittest.skipIf(
-    TEST_WITH_TSAN,
-    "Fails with TSAN with the following error: starting new threads after multi-threaded "
-    "fork is not supported. Dying (set die_after_fork=0 to override)")
-class TestDataLoader2(TestCase):
-    @skipIfNoDill
-    def test_basics(self):
-        # TODO(VitalyFedyunin): This test will start breaking if we remove guaranteed order
-        # of traversing workers
-        dp = IterableWrapper(list(range(1000))).sharding_filter()
-        dl = DataLoader(dp, batch_size=3, collate_fn=identity, num_workers=2)
-        dl2 = DataLoader2(dp, batch_size=3, collate_fn=identity, num_workers=2)
-        dl2_threading = DataLoader2(dp, batch_size=3, collate_fn=identity, num_workers=2, parallelism_mode='thread')
-        self.assertEqual(list(dl), list(dl2))
-        self.assertEqual(list(dl), list(dl2_threading))
-
-    class Sorter(IterDataPipe):
-        def __init__(self, datapipe):
-            self.datapipe = datapipe
-
-        def __iter__(self):
-            return iter(sorted(self.datapipe))
+class IntegrationTestDataLoaderDataPipe(TestCase):
+    r"""
+    Verify the behavior of a certain ``DataPipes`` with ``DataLoader``
+    """
 
-    def test_shuffle(self):
-        items = list(range(1000))
-        dp = IterableWrapper(items).sharding_filter().shuffle()
+    def test_shuffler_iterdatapipe(self):
+        r"""
+        Verify ``IterDataPipe.shuffle`` is controlled by ``DataLoader``
+        to generate different seeds deterministically per epoch.
+        """
+        exp = list(range(100))
 
-        dl = DataLoader2(dp, batch_size=None, num_workers=2, shuffle=False)
-        self.assertEqual(items, list(dl))
+        def _create_dp(buffer_size):
+            input_ds = dp.iter.IterableWrapper(exp)
+            return input_ds.shuffle(buffer_size=buffer_size).sharding_filter()
 
-        dl = DataLoader2(dp, batch_size=None, num_workers=2, shuffle=True)
-        self.assertNotEqual(items, list(dl))
-        self.assertEqual(items, sorted(list(dl)))
+        for bs in (5, 20, 33):
+            # Test Deterministic
+            for num_workers, pw in itertools.product((0, 1, 2), (True, False)):
+                if num_workers == 0 and pw:
+                    continue
 
-        dl = DataLoader2(dp, batch_size=None, num_workers=2, shuffle=True)
-        self.assertNotEqual(items, list(dl))
-        self.assertEqual(items, sorted(list(dl)))
+                shuffle_dp = _create_dp(bs)
 
-        dl = DataLoader2(self.Sorter(dp), batch_size=None, num_workers=2, shuffle=True)
-        self.assertEqual(list(dl), items)
+                mp_ctx = "spawn" if num_workers > 0 else None
+                dl = DataLoader(
+                    shuffle_dp,
+                    num_workers=num_workers,
+                    shuffle=True,
+                    multiprocessing_context=mp_ctx,
+                    persistent_workers=pw
+                )
 
-        dl = DataLoader2(self.Sorter(dp), batch_size=None, num_workers=2, shuffle=True)
-        self.assertEqual(list(dl), items)
+                # No seed
+                dl_res_ns = list(dl)
+                self.assertEqual(sorted(dl_res_ns), exp)
 
+                # Same seeds
+                dl_res = []
+                for epoch in range(2):
+                    torch.manual_seed(123)
+                    dl_res.append(list(dl))
+                self.assertEqual(dl_res[0], dl_res[1])
+                self.assertEqual(sorted(dl_res[0]), exp)
 
-@unittest.skipIf(
-    TEST_WITH_TSAN,
-    "Fails with TSAN with the following error: starting new threads after multi-threaded "
-    "fork is not supported. Dying (set die_after_fork=0 to override)")
-class TestDataLoader2_EventLoop(TestCase):
-    @skipIfNoDill
-    def test_basic_threading(self):
-        def clean_me(process, req_queue, res_queue):
-            req_queue.put(communication.messages.TerminateRequest())
-            _ = res_queue.get()
-            process.join()
-
-        it = list(range(100))
-        numbers_dp = IterableWrapper(it)
-        (process, req_queue, res_queue, _thread_local_datapipe) = communication.eventloop.SpawnThreadForDataPipeline(numbers_dp)
-
-        process.start()
-        local_datapipe = communication.iter.QueueWrapper(
-            communication.protocol.IterDataPipeQueueProtocolClient(req_queue, res_queue))
-
-        actual = list(local_datapipe)
-        clean_me(process, req_queue, res_queue)
-
-        self.assertEqual(list(range(100)), actual)
-
-    @skipIfNoDill
-    def test_basic_mapdatapipe_threading(self):
-        def clean_me(process, req_queue, res_queue):
-            req_queue.put(communication.messages.TerminateRequest())
-            _ = res_queue.get()
-            process.join()
-
-        input_len = 100
-        it = list(range(input_len))
-        numbers_dp = SequenceWrapper(it)
-        (process, req_queue, res_queue, _thread_local_datapipe) = communication.eventloop.SpawnThreadForDataPipeline(
-            numbers_dp)
-
-        process.start()
-
-        # Functional Test: Ensure that you can retrieve every element from the Queue and DataPipe
-        local_datapipe = communication.map.QueueWrapperForMap(
-            communication.protocol.MapDataPipeQueueProtocolClient(req_queue, res_queue))
-        actual = list(local_datapipe)
-        self.assertEqual([(x, x) for x in range(100)], actual)
-
-        # Functional Test: raise Error when input
-        local_datapipe = communication.map.QueueWrapperForMap(
-            communication.protocol.MapDataPipeQueueProtocolClient(req_queue, res_queue))
-        with self.assertRaisesRegex(IndexError, "out of bound"):
-            local_datapipe[1000]
-
-        # __len__ Test: Ensure that the correct length is returned
-        local_datapipe = communication.map.QueueWrapperForMap(
-            communication.protocol.MapDataPipeQueueProtocolClient(req_queue, res_queue))
-        self.assertEqual(input_len, len(local_datapipe))
-
-        clean_me(process, req_queue, res_queue)
+                # Different seeds
+                torch.manual_seed(321)
+                dl_res.append(list(dl))
+
+                self.assertEqual(len(dl_res[0]), len(dl_res[2]))
+                self.assertNotEqual(dl_res[0], dl_res[2])
+                self.assertEqual(sorted(dl_res[0]), sorted(dl_res[2]))
+
+                if dl._iterator is not None:
+                    dl._iterator._shutdown_workers()
+                    dl._iterator = None
+                del dl
 
 
 class StringDataset(Dataset):
@@ -2767,6 +2716,9 @@ def __getitem__(self, index):
 
 
 @unittest.skipIf(IS_WINDOWS, "Needs fork")
+@unittest.skipIf(
+    TEST_WITH_ASAN,
+    "This test hangs when running with ASAN, see https://github.com/pytorch/pytorch/issues/75492")
 class TestConvAfterFork(TestCase):
     # Tests crash reported in https://github.com/pytorch/pytorch/issues/53565
     def test_conv_after_fork(self):
diff --git a/test/test_datapipe.py b/test/test_datapipe.py
index 66165b9c4ddb..b5de6a5f4006 100644
--- a/test/test_datapipe.py
+++ b/test/test_datapipe.py
@@ -33,7 +33,7 @@
 import torch.utils.data.datapipes as dp
 import torch.utils.data.graph
 import torch.utils.data.graph_settings
-from torch.testing._internal.common_utils import TestCase, run_tests, suppress_warnings
+from torch.testing._internal.common_utils import TestCase, run_tests, suppress_warnings, skipIfTorchDynamo
 from torch.utils.data import (
     DataLoader,
     DataChunk,
@@ -44,7 +44,7 @@
     runtime_validation,
     runtime_validation_disabled,
 )
-from torch.utils.data.graph import traverse
+from torch.utils.data.graph import traverse_dps
 from torch.utils.data.datapipes.utils.common import StreamWrapper
 from torch.utils.data.datapipes.utils.decoder import (
     basichandlers as decoder_basichandlers,
@@ -54,6 +54,7 @@
 )
 from torch.utils.data.datapipes.dataframe import CaptureDataFrame
 from torch.utils.data.datapipes.dataframe import dataframe_wrapper as df_wrapper
+from torch.utils.data.datapipes.iter.grouping import SHARDING_PRIORITIES
 
 try:
     import dill
@@ -219,6 +220,7 @@ def test_dir(self):
         for api in ['open', 'read', 'close']:
             self.assertTrue(api in s)
 
+    @skipIfTorchDynamo
     def test_api(self):
         fd = TestStreamWrapper._FakeFD("")
         wrap_fd = StreamWrapper(fd)
@@ -632,13 +634,6 @@ def _mod_3_test(x):
     return x % 3 == 1
 
 
-def _worker_init_fn(worker_id):
-    info = torch.utils.data.get_worker_info()
-    num_workers = info.num_workers
-    datapipe = info.dataset
-    torch.utils.data.graph_settings.apply_sharding(datapipe, num_workers, worker_id)
-
-
 lambda_fn1 = lambda x: x  # noqa: E731
 lambda_fn2 = lambda x: x % 2  # noqa: E731
 lambda_fn3 = lambda x: x >= 5  # noqa: E731
@@ -992,10 +987,10 @@ def test_fork_iterdatapipe(self):
 
         # Pickle Test:
         dp1, dp2, dp3 = input_dp.fork(num_instances=3)
-        traverse(dp1)  # This should not raise any error
+        traverse_dps(dp1)  # This should not raise any error
         for _ in zip(dp1, dp2, dp3):
             pass
-        traverse(dp2)  # This should not raise any error either
+        traverse_dps(dp2)  # This should not raise any error either
 
     def test_mux_iterdatapipe(self):
 
@@ -1162,10 +1157,10 @@ def test_demux_iterdatapipe(self):
 
         # Pickle Test:
         dp1, dp2 = input_dp.demux(num_instances=2, classifier_fn=odd_or_even)
-        traverse(dp1)  # This should not raise any error
+        traverse_dps(dp1)  # This should not raise any error
         for _ in zip(dp1, dp2):
             pass
-        traverse(dp2)  # This should not raise any error either
+        traverse_dps(dp2)  # This should not raise any error either
 
     def test_map_iterdatapipe(self):
         target_length = 10
@@ -1218,6 +1213,24 @@ def fn_n1(d0, d1):
         def fn_nn(d0, d1):
             return -d0, -d1, d0 + d1
 
+        def fn_n1_def(d0, d1=1):
+            return d0 + d1
+
+        def fn_n1_kwargs(d0, d1, **kwargs):
+            return d0 + d1
+
+        def fn_n1_pos(d0, d1, *args):
+            return d0 + d1
+
+        def fn_n1_sep_pos(d0, *args, d1):
+            return d0 + d1
+
+        def fn_cmplx(d0, d1=1, *args, d2, **kwargs):
+            return d0 + d1
+
+        p_fn_n1 = partial(fn_n1, d1=1)
+        p_fn_cmplx = partial(fn_cmplx, d2=2)
+
         def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
             for constr in (list, tuple):
                 datapipe = dp.iter.IterableWrapper([constr((0, 1, 2)), constr((3, 4, 5)), constr((6, 7, 8))])
@@ -1231,6 +1244,12 @@ def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
                     self.assertEqual(list(res_dp), list(ref_dp))
                     # Reset
                     self.assertEqual(list(res_dp), list(ref_dp))
+        _helper(lambda data: data, fn_n1_def, 0, 1)
+        _helper(lambda data: (data[0], data[1], data[0] + data[1]), fn_n1_def, [0, 1], 2)
+        _helper(lambda data: data, p_fn_n1, 0, 1)
+        _helper(lambda data: data, p_fn_cmplx, 0, 1)
+        _helper(lambda data: (data[0], data[1], data[0] + data[1]), p_fn_cmplx, [0, 1], 2)
+        _helper(lambda data: (data[0] + data[1], ), fn_n1_pos, [0, 1, 2])
 
         # Replacing with one input column and default output column
         _helper(lambda data: (data[0], -data[1], data[2]), fn_11, 1)
@@ -1238,7 +1257,20 @@ def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
         # The index of input column is out of range
         _helper(None, fn_1n, 3, error=IndexError)
         # Unmatched input columns with fn arguments
-        _helper(None, fn_n1, 1, error=TypeError)
+        _helper(None, fn_n1, 1, error=ValueError)
+        _helper(None, fn_n1, [0, 1, 2], error=ValueError)
+        _helper(None, lambda d0, d1: d0 + d1, 0, error=ValueError)
+        _helper(None, lambda d0, d1: d0 + d1, [0, 1, 2], error=ValueError)
+        _helper(None, fn_cmplx, 0, 1, ValueError)
+        _helper(None, fn_n1_pos, 1, error=ValueError)
+        _helper(None, fn_n1_def, [0, 1, 2], 1, error=ValueError)
+        _helper(None, p_fn_n1, [0, 1], error=ValueError)
+        _helper(None, fn_1n, [1, 2], error=ValueError)
+        # _helper(None, p_fn_cmplx, [0, 1, 2], error=ValueError)
+        _helper(None, fn_n1_sep_pos, [0, 1, 2], error=ValueError)
+        # Fn has keyword-only arguments
+        _helper(None, fn_n1_kwargs, 1, error=ValueError)
+        _helper(None, fn_cmplx, [0, 1], 2, ValueError)
 
         # Replacing with multiple input columns and default output column (the left-most input column)
         _helper(lambda data: (data[1], data[2] + data[0]), fn_n1, [2, 0])
@@ -1264,6 +1296,10 @@ def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
         _helper(lambda data: (*data, data[0] + data[2]), fn_n1, [0, 2], -1)
         _helper(lambda data: (*data, (-data[1], -data[2], data[1] + data[2])), fn_nn, [1, 2], -1)
 
+        # Handling built-in functions (e.g. `dict`, `iter`, `int`, `str`) whose signatures cannot be inspected
+        _helper(lambda data: (str(data[0]), data[1], data[2]), str, 0)
+        _helper(lambda data: (data[0], data[1], int(data[2])), int, 2)
+
     @suppress_warnings  # Suppress warning for lambda fn
     def test_map_dict_with_col_iterdatapipe(self):
         def fn_11(d):
@@ -1278,6 +1314,28 @@ def fn_n1(d0, d1):
         def fn_nn(d0, d1):
             return -d0, -d1, d0 + d1
 
+        def fn_n1_def(d0, d1=1):
+            return d0 + d1
+
+        p_fn_n1 = partial(fn_n1, d1=1)
+
+        def fn_n1_pos(d0, d1, *args):
+            return d0 + d1
+
+        def fn_n1_kwargs(d0, d1, **kwargs):
+            return d0 + d1
+
+        def fn_kwonly(*, d0, d1):
+            return d0 + d1
+
+        def fn_has_nondefault_kwonly(d0, *, d1):
+            return d0 + d1
+
+        def fn_cmplx(d0, d1=1, *args, d2, **kwargs):
+            return d0 + d1
+
+        p_fn_cmplx = partial(fn_cmplx, d2=2)
+
         # Prevent modification in-place to support resetting
         def _dict_update(data, newdata, remove_idx=None):
             _data = dict(data)
@@ -1304,13 +1362,33 @@ def _helper(ref_fn, fn, input_col=None, output_col=None, error=None):
                 # Reset
                 self.assertEqual(list(res_dp), list(ref_dp))
 
+        _helper(lambda data: data, fn_n1_def, 'x', 'y')
+        _helper(lambda data: data, p_fn_n1, 'x', 'y')
+        _helper(lambda data: data, p_fn_cmplx, 'x', 'y')
+        _helper(lambda data: _dict_update(data, {"z": data["x"] + data["y"]}),
+                p_fn_cmplx, ["x", "y", "z"], "z")
+
+        _helper(lambda data: _dict_update(data, {"z": data["x"] + data["y"]}), fn_n1_def, ['x', 'y'], 'z')
+
+        _helper(None, fn_n1_pos, 'x', error=ValueError)
+        _helper(None, fn_n1_kwargs, 'x', error=ValueError)
+        # non-default kw-only args
+        _helper(None, fn_kwonly, ['x', 'y'], error=ValueError)
+        _helper(None, fn_has_nondefault_kwonly, ['x', 'y'], error=ValueError)
+        _helper(None, fn_cmplx, ['x', 'y'], error=ValueError)
+
+
         # Replacing with one input column and default output column
         _helper(lambda data: _dict_update(data, {"y": -data["y"]}), fn_11, "y")
         _helper(lambda data: _dict_update(data, {"y": (-data["y"], data["y"])}), fn_1n, "y")
         # The key of input column is not in dict
         _helper(None, fn_1n, "a", error=KeyError)
         # Unmatched input columns with fn arguments
-        _helper(None, fn_n1, "y", error=TypeError)
+        _helper(None, fn_n1, "y", error=ValueError)
+        _helper(None, fn_1n, ["x", "y"], error=ValueError)
+        _helper(None, fn_n1_def, ["x", "y", "z"], error=ValueError)
+        _helper(None, p_fn_n1, ["x", "y"], error=ValueError)
+        _helper(None, fn_n1_kwargs, ["x", "y", "z"], error=ValueError)
         # Replacing with multiple input columns and default output column (the left-most input column)
         _helper(lambda data: _dict_update(data, {"z": data["x"] + data["z"]}, ["x"]), fn_n1, ["z", "x"])
         _helper(lambda data: _dict_update(
@@ -1508,6 +1586,32 @@ def _mul_filter_fn(a, b):
         input_col_2_dp = tuple_input_ds.filter(_mul_filter_fn, input_col=[0, 2])
         self.assertEqual(list(input_col_2_dp), [(d - 1, d, d + 1) for d in range(5)])
 
+        # invalid input col
+        with self.assertRaises(ValueError):
+            tuple_input_ds.filter(_mul_filter_fn, input_col=0)
+
+        p_mul_filter_fn = partial(_mul_filter_fn, b=1)
+        out = tuple_input_ds.filter(p_mul_filter_fn, input_col=0)
+        self.assertEqual(list(out), [(d - 1, d, d + 1) for d in range(10)])
+
+        def _mul_filter_fn_with_defaults(a, b=1):
+            return a + b < 10
+
+        out = tuple_input_ds.filter(_mul_filter_fn_with_defaults, input_col=0)
+        self.assertEqual(list(out), [(d - 1, d, d + 1) for d in range(10)])
+
+        def _mul_filter_fn_with_kw_only(*, a, b):
+            return a + b < 10
+
+        with self.assertRaises(ValueError):
+            tuple_input_ds.filter(_mul_filter_fn_with_kw_only, input_col=0)
+
+        def _mul_filter_fn_with_kw_only_1_default(*, a, b=1):
+            return a + b < 10
+
+        with self.assertRaises(ValueError):
+            tuple_input_ds.filter(_mul_filter_fn_with_kw_only_1_default, input_col=0)
+
         # __len__ Test: DataPipe has no valid len
         with self.assertRaisesRegex(TypeError, r"has no len"):
             len(filter_dp)
@@ -1554,68 +1658,65 @@ def test_stream_reader_iterdatapipe(self):
         with self.assertRaises(TypeError):
             len(dp1)
 
-    def test_shuffle_iterdatapipe(self):
-        exp = list(range(100))
-        input_ds = dp.iter.IterableWrapper(exp)
+    def test_shuffler_iterdatapipe(self):
+        input_dp = dp.iter.IterableWrapper(list(range(10)))
 
         with self.assertRaises(AssertionError):
-            shuffle_dp = input_ds.shuffle(buffer_size=0)
-
-        def _create_dp(buffer_size):
-            input_ds = dp.iter.IterableWrapper(list(range(100)))
-            return input_ds.shuffle(buffer_size=bs).sharding_filter()
-
-        for bs in (5, 20, 33):
-            shuffle_dp = _create_dp(bs)
-            self.assertEqual(len(shuffle_dp), len(exp))
-
-            torch.manual_seed(123)
-            res = list(shuffle_dp)
-            self.assertEqual(sorted(res), exp)
-
-            # Test Deterministic
-            for num_workers, pw in itertools.product((0, 1, 2), (True, False)):
-                if num_workers == 0 and pw:
-                    continue
-
-                mp_ctx = "spawn" if num_workers > 0 else None
-                dl = DataLoader(
-                    shuffle_dp,
-                    num_workers=num_workers,
-                    shuffle=True,
-                    multiprocessing_context=mp_ctx,
-                    worker_init_fn=_worker_init_fn,
-                    persistent_workers=pw
-                )
-
-                # No seed
-                dl_res_ns = list(dl)
-                self.assertEqual(len(dl_res_ns), len(exp))
-                self.assertEqual(sorted(dl_res_ns), sorted(exp))
-
-                # Same seeds
-                dl_res = []
-                for epoch in range(2):
-                    torch.manual_seed(123)
-                    dl_res.append(list(dl))
-                self.assertEqual(dl_res[0], dl_res[1])
-
-                # Different seeds
-                torch.manual_seed(321)
-                dl_res.append(list(dl))
-
-                self.assertEqual(len(dl_res[0]), len(dl_res[2]))
-                self.assertNotEqual(dl_res[0], dl_res[2])
-                self.assertEqual(sorted(dl_res[0]), sorted(dl_res[2]))
-
-
-        shuffle_dp_nl = IDP_NoLen(range(20)).shuffle(buffer_size=5)
-        with self.assertRaisesRegex(TypeError, r"instance doesn't have valid length$"):
-            len(shuffle_dp_nl)
+            shuffle_dp = input_dp.shuffle(buffer_size=0)
+
+        # Functional Test: No seed
+        shuffler_dp = input_dp.shuffle()
+        self.assertEqual(set(range(10)), set(shuffler_dp))
 
-        # Test: deactivate shuffling via set_shuffle
-        unshuffled_dp = input_ds.shuffle().set_shuffle(False)
-        self.assertEqual(list(unshuffled_dp), list(input_ds))
+        # Functional Test: With global seed
+        torch.manual_seed(123)
+        shuffler_dp = input_dp.shuffle()
+        res = list(shuffler_dp)
+        torch.manual_seed(123)
+        self.assertEqual(list(shuffler_dp), res)
+
+        # Functional Test: Set seed
+        shuffler_dp = input_dp.shuffle().set_seed(123)
+        res = list(shuffler_dp)
+        shuffler_dp.set_seed(123)
+        self.assertEqual(list(shuffler_dp), res)
+
+        # Functional Test: deactivate shuffling via set_shuffle
+        unshuffled_dp = input_dp.shuffle().set_shuffle(False)
+        self.assertEqual(list(unshuffled_dp), list(input_dp))
+
+        # Reset Test:
+        shuffler_dp = input_dp.shuffle()
+        n_elements_before_reset = 5
+        res_before_reset, res_after_reset = reset_after_n_next_calls(shuffler_dp, n_elements_before_reset)
+        self.assertEqual(5, len(res_before_reset))
+        for x in res_before_reset:
+            self.assertTrue(x in set(range(10)))
+        self.assertEqual(set(range(10)), set(res_after_reset))
+
+        # __len__ Test: returns the length of the input DataPipe
+        shuffler_dp = input_dp.shuffle()
+        self.assertEqual(10, len(shuffler_dp))
+        exp = list(range(100))
+
+        # Serialization Test
+        from torch.utils.data.datapipes._hook_iterator import _SnapshotState
+
+        def _serialization_helper(bs):
+            shuffler_dp = input_dp.shuffle(buffer_size=bs)
+            it = iter(shuffler_dp)
+            for _ in range(2):
+                next(it)
+            shuffler_dp_copy = pickle.loads(pickle.dumps(shuffler_dp))
+            _simple_graph_snapshot_restoration(shuffler_dp_copy.datapipe, shuffler_dp.datapipe._number_of_samples_yielded)
+
+            exp = list(it)
+            shuffler_dp_copy._snapshot_state = _SnapshotState.Restored
+            self.assertEqual(exp, list(shuffler_dp_copy))
+
+        buffer_sizes = [2, 5, 15]
+        for bs in buffer_sizes:
+            _serialization_helper(bs)
 
     def test_zip_iterdatapipe(self):
 
@@ -1670,7 +1771,7 @@ def _serialization_test_for_single_dp(self, dp, use_dill=False):
         _ = next(it)
         self._serialization_test_helper(dp, use_dill)
         # 3. Testing for serialization after DataPipe is fully read
-        _ = list(it)
+        _ = list(dp)
         self._serialization_test_helper(dp, use_dill)
 
     def test_serializable(self):
@@ -1793,15 +1894,15 @@ def test_zip_mapdatapipe(self):
         with self.assertRaisesRegex(IndexError, r"out of range"):
             input_dp1.zip(input_dp2, input_dp3)[5]
 
-        # Functional Test: Ensure `zip` can combine `Shuffler` with others
+        # Functional Test: Ensure `zip` can combine `Batcher` with others
         dp1 = dp.map.SequenceWrapper(range(10))
-        shuffle_dp1 = dp1.shuffle()
+        shuffle_dp1 = dp1.batch(2)
         dp2 = dp.map.SequenceWrapper(range(10))
-        shuffle_dp2 = dp2.shuffle()
-        zip_dp = shuffle_dp1.zip(shuffle_dp2)
-        self.assertEqual(10, len(list(zip_dp)))
+        shuffle_dp2 = dp2.batch(3)
+        zip_dp1 = shuffle_dp1.zip(shuffle_dp2)
+        self.assertEqual(4, len(list(zip_dp1)))
         zip_dp2 = shuffle_dp1.zip(dp2)
-        self.assertEqual(10, len(list(zip_dp)))
+        self.assertEqual(5, len(list(zip_dp2)))
 
         # __len__ Test: returns the length of the shortest DataPipe
         zip_dp = input_dp1.zip(input_dp2, input_dp3)
@@ -1816,10 +1917,27 @@ def test_shuffler_mapdatapipe(self):
         self.assertEqual(set(range(10)), set(shuffler_dp))
 
         # Functional Test: Custom indices are working
-        shuffler_dp = dp.map.Shuffler(input_dp2, indices=['a', 'b', 'c', 'd', 'e'])
+        shuffler_dp = input_dp2.shuffle(indices=['a', 'b', 'c', 'd', 'e'])
         self.assertEqual(set(range(1, 6)), set(shuffler_dp))
 
-        # # Reset Test:
+        # Functional Test: With global seed
+        torch.manual_seed(123)
+        shuffler_dp = input_dp1.shuffle()
+        res = list(shuffler_dp)
+        torch.manual_seed(123)
+        self.assertEqual(list(shuffler_dp), res)
+
+        # Functional Test: Set seed
+        shuffler_dp = input_dp1.shuffle().set_seed(123)
+        res = list(shuffler_dp)
+        shuffler_dp.set_seed(123)
+        self.assertEqual(list(shuffler_dp), res)
+
+        # Functional Test: deactivate shuffling via set_shuffle
+        unshuffled_dp = input_dp1.shuffle().set_shuffle(False)
+        self.assertEqual(list(unshuffled_dp), list(input_dp1))
+
+        # Reset Test:
         shuffler_dp = input_dp1.shuffle()
         n_elements_before_reset = 5
         res_before_reset, res_after_reset = reset_after_n_next_calls(shuffler_dp, n_elements_before_reset)
@@ -1832,6 +1950,19 @@ def test_shuffler_mapdatapipe(self):
         shuffler_dp = input_dp1.shuffle()
         self.assertEqual(10, len(shuffler_dp))
 
+        # Serialization Test
+        from torch.utils.data.datapipes._hook_iterator import _SnapshotState
+
+        shuffler_dp = input_dp1.shuffle()
+        it = iter(shuffler_dp)
+        for _ in range(2):
+            next(it)
+        shuffler_dp_copy = pickle.loads(pickle.dumps(shuffler_dp))
+
+        exp = list(it)
+        shuffler_dp_copy._snapshot_state = _SnapshotState.Restored
+        self.assertEqual(exp, list(shuffler_dp_copy))
+
     def test_map_mapdatapipe(self):
         arr = range(10)
         input_dp = dp.map.SequenceWrapper(arr)
@@ -2223,6 +2354,9 @@ def __iter__(self):
         for i in range(self.size):
             yield i
 
+    def __len__(self):
+        return self.size
+
 
 class TestGraph(TestCase):
     class CustomIterDataPipe(IterDataPipe):
@@ -2242,15 +2376,25 @@ def __hash__(self):
 
     def test_simple_traverse(self):
         numbers_dp = NumbersDataset(size=50)
-        mapped_dp = numbers_dp.map(lambda x: x * 10)
-        graph = torch.utils.data.graph.traverse(mapped_dp, only_datapipe=True)
-        expected: Dict[Any, Any] = {id(mapped_dp): (mapped_dp, {id(numbers_dp): (numbers_dp, {})})}
+        shuffled_dp = numbers_dp.shuffle()
+        sharded_dp = shuffled_dp.sharding_filter()
+        mapped_dp = sharded_dp.map(lambda x: x * 10)
+        graph = traverse_dps(mapped_dp)
+        expected: Dict[Any, Any] = {
+            id(mapped_dp): (mapped_dp, {
+                id(sharded_dp): (sharded_dp, {
+                    id(shuffled_dp): (shuffled_dp, {
+                        id(numbers_dp): (numbers_dp, {})
+                    })
+                })
+            })
+        }
         self.assertEqual(expected, graph)
 
         dps = torch.utils.data.graph_settings.get_all_graph_pipes(graph)
-        self.assertEqual(len(dps), 2)
-        self.assertTrue(numbers_dp in dps)
-        self.assertTrue(mapped_dp in dps)
+        self.assertEqual(len(dps), 4)
+        for datapipe in (numbers_dp, shuffled_dp, sharded_dp, mapped_dp):
+            self.assertTrue(datapipe in dps)
 
     def test_traverse_forked(self):
         numbers_dp = NumbersDataset(size=50)
@@ -2258,7 +2402,7 @@ def test_traverse_forked(self):
         dp0_upd = dp0.map(lambda x: x * 10)
         dp1_upd = dp1.filter(lambda x: x % 3 == 1)
         combined_dp = dp0_upd.mux(dp1_upd, dp2)
-        graph = torch.utils.data.graph.traverse(combined_dp, only_datapipe=True)
+        graph = traverse_dps(combined_dp)
         expected = {
             id(combined_dp): (combined_dp, {
                 id(dp0_upd): (dp0_upd, {
@@ -2292,21 +2436,21 @@ def test_traverse_forked(self):
     def test_traverse_mapdatapipe(self):
         source_dp = dp.map.SequenceWrapper(range(10))
         map_dp = source_dp.map(partial(_fake_add, 1))
-        graph = torch.utils.data.graph.traverse(map_dp)
+        graph = traverse_dps(map_dp)
         expected: Dict[Any, Any] = {id(map_dp): (map_dp, {id(source_dp): (source_dp, {})})}
         self.assertEqual(expected, graph)
 
     def test_traverse_mixdatapipe(self):
         source_map_dp = dp.map.SequenceWrapper(range(10))
         iter_dp = dp.iter.IterableWrapper(source_map_dp)
-        graph = torch.utils.data.graph.traverse(iter_dp)
+        graph = traverse_dps(iter_dp)
         expected: Dict[Any, Any] = {id(iter_dp): (iter_dp, {id(source_map_dp): (source_map_dp, {})})}
         self.assertEqual(expected, graph)
 
     def test_traverse_circular_datapipe(self):
         source_iter_dp = dp.iter.IterableWrapper(list(range(10)))
         circular_dp = TestGraph.CustomIterDataPipe(source_iter_dp)
-        graph = torch.utils.data.graph.traverse(circular_dp, only_datapipe=True)
+        graph = traverse_dps(circular_dp)
         # See issue: https://github.com/pytorch/data/issues/535
         expected: Dict[Any, Any] = {
             id(circular_dp): (circular_dp, {
@@ -2325,7 +2469,7 @@ def test_traverse_circular_datapipe(self):
     def test_traverse_unhashable_datapipe(self):
         source_iter_dp = dp.iter.IterableWrapper(list(range(10)))
         unhashable_dp = TestGraph.CustomIterDataPipe(source_iter_dp)
-        graph = torch.utils.data.graph.traverse(unhashable_dp, only_datapipe=True)
+        graph = traverse_dps(unhashable_dp)
         with self.assertRaises(NotImplementedError):
             hash(unhashable_dp)
         expected: Dict[Any, Any] = {
@@ -2353,11 +2497,11 @@ def test_spawn_lambdas_iter(self):
 
     @skipIfNoDill
     def test_spawn_lambdas_map(self):
-        mdp = dp.map.SequenceWrapper(range(6)).map(lambda x: x + 1).shuffle()
+        mdp = dp.map.SequenceWrapper(range(3)).map(lambda x: x + 1).shuffle()
         dl = DataLoader(mdp, num_workers=2, shuffle=True,
                         multiprocessing_context='spawn', collate_fn=unbatch, batch_size=1)
         result = list(dl)
-        self.assertEqual([1, 2, 3, 4, 5, 6], sorted(result))
+        self.assertEqual([1, 1, 2, 2, 3, 3], sorted(result))
 
 
 class TestCircularSerialization(TestCase):
@@ -2394,28 +2538,14 @@ def test_circular_serialization_with_pickle(self):
         m1_1 = m2_1.datapipe
         src_1 = m1_1.datapipe
 
-        res1 = traverse(dp1, only_datapipe=True)
-        res2 = traverse(dp1, only_datapipe=False)
-
+        res1 = traverse_dps(dp1)
         exp_res_1 = {id(dp1): (dp1, {
             id(src_1): (src_1, {}),
             id(child_1): (child_1, {id(dm_1): (dm_1, {
                 id(m2_1): (m2_1, {id(m1_1): (m1_1, {id(src_1): (src_1, {})})})
             })})
         })}
-        exp_res_2 = {id(dp1): (dp1, {
-            id(src_1): (src_1, {}),
-            id(child_1): (child_1, {id(dm_1): (dm_1, {
-                id(m2_1): (m2_1, {
-                    id(m1_1): (m1_1, {id(src_1): (src_1, {})}),
-                    id(src_1): (src_1, {})
-                })
-            })})
-        })}
-
         self.assertEqual(res1, exp_res_1)
-        self.assertEqual(res2, exp_res_2)
-
         dp2 = TestCircularSerialization.CustomIterDataPipe(fn=_fake_fn, source_dp=dp1)
         self.assertTrue(list(dp2) == list(pickle.loads(pickle.dumps(dp2))))
 
@@ -2424,9 +2554,8 @@ def test_circular_serialization_with_pickle(self):
         m2_2 = dm_2.main_datapipe
         m1_2 = m2_2.datapipe
 
-        res3 = traverse(dp2, only_datapipe=True)
-        res4 = traverse(dp2, only_datapipe=False)
-        exp_res_3 = {id(dp2): (dp2, {
+        res2 = traverse_dps(dp2)
+        exp_res_2 = {id(dp2): (dp2, {
             id(dp1): (dp1, {
                 id(src_1): (src_1, {}),
                 id(child_1): (child_1, {id(dm_1): (dm_1, {
@@ -2444,44 +2573,7 @@ def test_circular_serialization_with_pickle(self):
                 })})
             })})
         })}
-        exp_res_4 = {id(dp2): (dp2, {
-            id(dp1): (dp1, {
-                id(src_1): (src_1, {}),
-                id(child_1): (child_1, {id(dm_1): (dm_1, {
-                    id(m2_1): (m2_1, {
-                        id(m1_1): (m1_1, {id(src_1): (src_1, {})}),
-                        id(src_1): (src_1, {})
-                    })
-                })})
-            }),
-            id(child_2): (child_2, {id(dm_2): (dm_2, {
-                id(m2_2): (m2_2, {
-                    id(m1_2): (m1_2, {
-                        id(dp1): (dp1, {
-                            id(src_1): (src_1, {}),
-                            id(child_1): (child_1, {id(dm_1): (dm_1, {
-                                id(m2_1): (m2_1, {
-                                    id(m1_1): (m1_1, {id(src_1): (src_1, {})}),
-                                    id(src_1): (src_1, {})
-                                })
-                            })})
-                        })
-                    }),
-                    id(dp1): (dp1, {
-                        id(src_1): (src_1, {}),
-                        id(child_1): (child_1, {id(dm_1): (dm_1, {
-                            id(m2_1): (m2_1, {
-                                id(m1_1): (m1_1, {id(src_1): (src_1, {})}),
-                                id(src_1): (src_1, {})
-                            })
-                        })})
-                    })
-                })
-            })})
-        })}
-
-        self.assertEqual(res3, exp_res_3)
-        self.assertEqual(res4, exp_res_4)
+        self.assertEqual(res2, exp_res_2)
 
     class LambdaIterDataPipe(CustomIterDataPipe):
 
@@ -2504,8 +2596,7 @@ def test_circular_serialization_with_dill(self):
         m1_1 = m2_1.datapipe
         src_1 = m1_1.datapipe
 
-        res1 = traverse(dp1, only_datapipe=True)
-        res2 = traverse(dp1, only_datapipe=False)
+        res1 = traverse_dps(dp1)
 
         exp_res_1 = {id(dp1): (dp1, {
             id(src_1): (src_1, {}),
@@ -2513,18 +2604,8 @@ def test_circular_serialization_with_dill(self):
                 id(m2_1): (m2_1, {id(m1_1): (m1_1, {id(src_1): (src_1, {})})})
             })})
         })}
-        exp_res_2 = {id(dp1): (dp1, {
-            id(src_1): (src_1, {}),
-            id(child_1): (child_1, {id(dm_1): (dm_1, {
-                id(m2_1): (m2_1, {
-                    id(m1_1): (m1_1, {id(src_1): (src_1, {})}),
-                    id(src_1): (src_1, {})
-                })
-            })})
-        })}
 
         self.assertEqual(res1, exp_res_1)
-        self.assertEqual(res2, exp_res_2)
 
         dp2 = TestCircularSerialization.LambdaIterDataPipe(fn=_fake_fn, source_dp=dp1)
         self.assertTrue(list(dp2) == list(dill.loads(dill.dumps(dp2))))
@@ -2534,9 +2615,8 @@ def test_circular_serialization_with_dill(self):
         m2_2 = dm_2.main_datapipe
         m1_2 = m2_2.datapipe
 
-        res3 = traverse(dp2, only_datapipe=True)
-        res4 = traverse(dp2, only_datapipe=False)
-        exp_res_3 = {id(dp2): (dp2, {
+        res2 = traverse_dps(dp2)
+        exp_res_2 = {id(dp2): (dp2, {
             id(dp1): (dp1, {
                 id(src_1): (src_1, {}),
                 id(child_1): (child_1, {id(dm_1): (dm_1, {
@@ -2554,45 +2634,7 @@ def test_circular_serialization_with_dill(self):
                 })})
             })})
         })}
-        exp_res_4 = {id(dp2): (dp2, {
-            id(dp1): (dp1, {
-                id(src_1): (src_1, {}),
-                id(child_1): (child_1, {id(dm_1): (dm_1, {
-                    id(m2_1): (m2_1, {
-                        id(m1_1): (m1_1, {id(src_1): (src_1, {})}),
-                        id(src_1): (src_1, {})
-                    })
-                })})
-            }),
-            id(child_2): (child_2, {id(dm_2): (dm_2, {
-                id(m2_2): (m2_2, {
-                    id(m1_2): (m1_2, {
-                        id(dp1): (dp1, {
-                            id(src_1): (src_1, {}),
-                            id(child_1): (child_1, {id(dm_1): (dm_1, {
-                                id(m2_1): (m2_1, {
-                                    id(m1_1): (m1_1, {id(src_1): (src_1, {})}),
-                                    id(src_1): (src_1, {})
-                                })
-                            })})
-                        })
-                    }),
-                    id(dp1): (dp1, {
-                        id(src_1): (src_1, {}),
-                        id(child_1): (child_1, {id(dm_1): (dm_1, {
-                            id(m2_1): (m2_1, {
-                                id(m1_1): (m1_1, {id(src_1): (src_1, {})}),
-                                id(src_1): (src_1, {})
-                            })
-                        })})
-                    })
-                })
-            })})
-        })}
-
-        self.assertEqual(res3, exp_res_3)
-        self.assertEqual(res4, exp_res_4)
-
+        self.assertEqual(res2, exp_res_2)
 
 class TestSharding(TestCase):
 
@@ -2626,6 +2668,40 @@ def test_simple_sharding(self):
             items += list(sharded_dp)
         self.assertEqual(sorted(all_items), sorted(items))
 
+    def test_sharding_groups(self):
+        def construct_sharded_pipe():
+            sharding_pipes = []
+            dp = NumbersDataset(size=90)
+            dp = dp.sharding_filter(sharding_group_filter=SHARDING_PRIORITIES.DISTRIBUTED)
+            sharding_pipes.append(dp)
+            dp = dp.sharding_filter(sharding_group_filter=SHARDING_PRIORITIES.MULTIPROCESSING)
+            sharding_pipes.append(dp)
+            dp = dp.sharding_filter(sharding_group_filter=300)
+            sharding_pipes.append(dp)
+            return dp, sharding_pipes
+
+        dp, sharding_pipes = construct_sharded_pipe()
+
+        for pipe in sharding_pipes:
+            pipe.apply_sharding(2, 1, sharding_group=SHARDING_PRIORITIES.DISTRIBUTED)
+            pipe.apply_sharding(5, 3, sharding_group=SHARDING_PRIORITIES.MULTIPROCESSING)
+            pipe.apply_sharding(3, 1, sharding_group=300)
+
+        actual = list(dp)
+        expected = [17, 47, 77]
+        self.assertEquals(expected, actual)
+        self.assertEquals(3, len(dp))
+
+        dp, _ = construct_sharded_pipe()
+        dp.apply_sharding(2, 1, sharding_group=SHARDING_PRIORITIES.DEFAULT)
+        with self.assertRaises(Exception):
+            dp.apply_sharding(5, 3, sharding_group=SHARDING_PRIORITIES.MULTIPROCESSING)
+
+        dp, _ = construct_sharded_pipe()
+        dp.apply_sharding(5, 3, sharding_group=SHARDING_PRIORITIES.MULTIPROCESSING)
+        with self.assertRaises(Exception):
+            dp.apply_sharding(2, 1, sharding_group=SHARDING_PRIORITIES.DEFAULT)
+
     def test_sharding_length(self):
         numbers_dp = dp.iter.IterableWrapper(range(13))
         sharded_dp0 = numbers_dp.sharding_filter()
@@ -3037,7 +3113,7 @@ def _fast_forward_graph_test_helper(self, datapipe, fast_forward_fn, expected_re
         if rng is None:
             rng = torch.Generator()
         rng = rng.manual_seed(0)
-        torch.utils.data.graph_settings.apply_shuffle_seed(datapipe, rng)
+        torch.utils.data.graph_settings.apply_random_seed(datapipe, rng)
 
         # Test Case: fast forward works with list
         rng.manual_seed(0)
@@ -3070,7 +3146,7 @@ def test_simple_snapshot_graph(self):
         rng = torch.Generator()
         graph3 = graph2.shuffle()
         rng.manual_seed(0)
-        torch.utils.data.graph_settings.apply_shuffle_seed(graph3, rng)
+        torch.utils.data.graph_settings.apply_random_seed(graph3, rng)
         res3 = list(graph3)
         self._fast_forward_graph_test_helper(graph3, _simple_graph_snapshot_restoration,
                                              expected_res=res3)
@@ -3090,7 +3166,7 @@ def test_simple_snapshot_graph(self):
         cdp1, cdp2 = graph5.fork(2)
         graph6 = cdp1.zip(cdp2)
         rng = rng.manual_seed(100)
-        torch.utils.data.graph_settings.apply_shuffle_seed(graph6, rng)
+        torch.utils.data.graph_settings.apply_random_seed(graph6, rng)
         res6 = [(x, x) for x in res5]
         self._fast_forward_graph_test_helper(graph6, _simple_graph_snapshot_restoration,
                                              expected_res=res6)
@@ -3121,7 +3197,7 @@ def _snapshot_test_helper(self, datapipe, expected_res, n_iter=3, rng=None):
         if rng is None:
             rng = torch.Generator()
         rng.manual_seed(0)
-        torch.utils.data.graph_settings.apply_shuffle_seed(datapipe, rng)
+        torch.utils.data.graph_settings.apply_random_seed(datapipe, rng)
         it = iter(datapipe)
         for _ in range(n_iter):
             next(it)
@@ -3148,7 +3224,7 @@ def test_simple_snapshot_graph_with_serialization(self):
         rng = torch.Generator()
         graph3 = graph2.shuffle()
         rng.manual_seed(0)
-        torch.utils.data.graph_settings.apply_shuffle_seed(graph3, rng)
+        torch.utils.data.graph_settings.apply_random_seed(graph3, rng)
         res3 = list(graph3)
         self._snapshot_test_helper(graph3, expected_res=res3)
 
@@ -3178,13 +3254,13 @@ def test_simple_snapshot_graph_repeated(self):
 
         rng = torch.Generator()
         rng.manual_seed(0)
-        torch.utils.data.graph_settings.apply_shuffle_seed(graph, rng)
+        torch.utils.data.graph_settings.apply_random_seed(graph, rng)
 
         # Get expected result
         expected_res = list(graph)
 
         rng.manual_seed(0)
-        torch.utils.data.graph_settings.apply_shuffle_seed(graph, rng)
+        torch.utils.data.graph_settings.apply_random_seed(graph, rng)
         it = iter(graph)
         n_iter = 3
         for _ in range(n_iter):
diff --git a/test/test_decomp.py b/test/test_decomp.py
index df5391f8ecb6..73f8c7a126ea 100644
--- a/test/test_decomp.py
+++ b/test/test_decomp.py
@@ -1,10 +1,10 @@
-# Owner(s): ["module: primTorch"]
+# Owner(s): ["module: primTorch", "module: decompositions"]
 
 from collections import defaultdict
 from torch import Tensor
 import torch.autograd
-from torch.utils._python_dispatch import enable_torch_dispatch_mode
 from torch._decomp import decomposition_table
+from torch.utils._python_dispatch import TorchDispatchMode
 
 from torch.utils._pytree import tree_map, tree_flatten, tree_unflatten
 from torch.utils._mode_utils import no_dispatch
@@ -15,7 +15,7 @@
     suppress_warnings,
     TEST_WITH_ASAN,
     run_tests,
-    skipIfSlowGradcheckEnv,
+    skipIfTorchDynamo,
 )
 from torch.testing._internal.common_device_type import (
     onlyNativeDeviceTypes,
@@ -23,6 +23,7 @@
     instantiate_device_type_tests,
 )
 from torch.testing._internal.common_methods_invocations import op_db
+from torch._dispatch.python import enable_python_dispatcher
 
 import itertools
 import functools
@@ -154,10 +155,18 @@ def op_assert_ref(test_case, op, test_dtype, i, orig, decomp, ref, args, kwargs)
     tol_table = {
         (torch.bfloat16, torch.ops.aten.native_layer_norm.default): 1e-5,
         (torch.float16, torch.ops.aten.native_layer_norm.default): 1e-5,
+        (torch.float16, torch.ops.aten.native_layer_norm_backward.default): 1e-3,
+        (torch.bfloat16, torch.ops.aten.native_layer_norm_backward.default): 2e-2,
         (torch.bfloat16, torch.ops.aten.native_batch_norm.default): 1e-5,
         (torch.float16, torch.ops.aten.native_batch_norm.default): 1e-5,
-        (torch.bfloat16, torch.ops.aten.linalg_vector_norm.default): 1e-6,
-        (torch.float16, torch.ops.aten.linalg_vector_norm.default): 1e-6,
+        (torch.bfloat16, torch.ops.aten._native_batch_norm_legit.default): 1e-5,
+        (torch.bfloat16, torch.ops.aten._native_batch_norm_legit.no_stats): 1e-5,
+        (torch.float16, torch.ops.aten._native_batch_norm_legit.default): 1e-5,
+        (torch.float16, torch.ops.aten._native_batch_norm_legit.no_stats): 1e-5,
+        (torch.bfloat16, torch.ops.aten.linalg_vector_norm.default): 1e-5,
+        (torch.float16, torch.ops.aten.linalg_vector_norm.default): 1e-5,
+        (torch.float16, torch.ops.aten.nll_loss_forward.default): 1e-2,
+        (torch.bfloat16, torch.ops.aten.nll_loss_forward.default): 1e-1,
     }
     if ref.is_floating_point():
         orig_diff = (orig - ref).abs().max()
@@ -188,8 +197,25 @@ def op_assert_equal(test_case, op, test_dtype, orig, decomp, args, kwargs):
             1e-3,
             1e-3,
         ),
+        (torch.float64, torch.ops.aten.native_layer_norm.default): (1e-6, 1e-6),
+        # This exceeds default tolerances only on CPU, on CUDA it's fine
+        (torch.float32, torch.ops.aten.grid_sampler_2d.default) : (7e-6, 3e-5),
+        # Exceeds tolerances on CUDA, likely due to fma
+        (torch.float32, torch.ops.aten.mv.default) : (1e-5, 3e-5),
+        (torch.complex64, torch.ops.aten.mv.default): (5e-5, 5e-5),
+        (torch.float64, torch.ops.aten.upsample_bicubic2d.vec) : (1e-5, 5e-4),
+        (torch.float64, torch.ops.aten.upsample_bicubic2d.default) : (1e-5, 5e-4),
+        # The decomposition is TOO correct. It computes everything in int64, so sometimes
+        # there's an off-by-one error. See
+        # https://github.com/pytorch/pytorch/issues/81996
+        # https://github.com/pytorch/pytorch/issues/82230
+        (torch.int8, torch.ops.aten.linspace.default) : (0, 1),
+        (torch.uint8, torch.ops.aten.linspace.default) : (0, 1),
+        (torch.int16, torch.ops.aten.linspace.default) : (0, 1),
+        (torch.int32, torch.ops.aten.linspace.default) : (0, 1),
+        (torch.int64, torch.ops.aten.linspace.default) : (0, 1),
     }
-    if (test_dtype, op) in tol_table:
+    if (decomp.dtype, op) in tol_table:
         rtol, atol = tol_table[(decomp.dtype, op)]
     else:
         rtol, atol = _getDefaultRtolAndAtol(orig.dtype, decomp.dtype)
@@ -270,15 +296,28 @@ def normalize_op_input_output(f, sample, requires_grad=True):
     ("cuda", torch.bfloat16, "nn.functional.dropout"),
     ("cuda", torch.float64, "nn.functional.dropout"),
     ("cuda", torch.float32, "nn.functional.dropout"),
+    (None, None, "special.ndtr"),  # aten.special_ndtr was not decomposed
     (None, None, "new_empty"),
-    # decomp has problem even with opmath
-    # doesn't work
-    ("cuda", torch.bfloat16, "nn.functional.embedding"),
+    (None, None, "empty_like"),
+    (None, None, "empty"),
 
     # CompositeAutogradImplicit
     # See https://github.com/pytorch/pytorch/issues/81669
     (None, None, "nn.functional.relu6"),
     (None, None, "meshgrid"),
+    # diag was not decomposed (it just registers a decomp for diag_out, torch.diag is CompImplicit)
+    (None, None, "diag"),
+    # _softmax_backward_data's CPU kernel for bfloat16 always return the grad_input as float32
+    ("cpu", torch.bfloat16, "_softmax_backward_data"),
+    (None, None, "norm"),
+    # native_batch_norm is only implicit when python dispatcher is on (and noncomposite otherwise)
+    (None, None, "native_batch_norm"),
+}
+
+CROSS_REF_BACKWARD_EXCLUDE_SET = {
+    # Decomposed backward formula is not as precise
+    ("cpu", torch.bfloat16, "nn.functional.hardswish"),
+    ("cuda", torch.float16, "nn.functional.cross_entropy"),
 }
 
 all_decomposed = set()
@@ -330,7 +369,6 @@ def test_unsupported(t):
     return any(test_unsupported(x) for x in itertools.chain(flat_args, flat_kwargs))
 
 
-@skipIfSlowGradcheckEnv
 class TestDecomp(TestCase):
     longMessage = True
 
@@ -353,14 +391,17 @@ def test_quick(self, device, dtype, op):
     def test_comprehensive(self, device, dtype, op):
         self.do_cross_ref(device, dtype, op, run_all=True)
 
+    @skipIfTorchDynamo("Test does not work with TorchDynamo")
     def do_cross_ref(self, device, dtype, op, *, run_all):
-        if (torch.device(device).type, dtype, op.name) in CROSS_REF_EXCLUDE_SET or (
-            None,
-            dtype,
-            op.name,
-        ) in CROSS_REF_EXCLUDE_SET or (None, None, op.name) in CROSS_REF_EXCLUDE_SET:
+        test_keys = [
+            (torch.device(device).type, dtype, op.name),
+            (None, dtype, op.name),
+            (None, None, op.name),
+        ]
+        if any(key in CROSS_REF_EXCLUDE_SET for key in test_keys):
             self.skipTest(f"{op.name} in {dtype} not supported")
 
+        skip_decomp_vjp = any(key in CROSS_REF_BACKWARD_EXCLUDE_SET for key in test_keys)
         test_dtype = dtype
 
         # We check the correctness of each decomposition right after running it.
@@ -371,17 +412,16 @@ def do_cross_ref(self, device, dtype, op, *, run_all):
 
         saved_precision = self.precision
         saved_rel_tol = self.rel_tol
+        test_case = self
 
-        class DecompCrossRefMode(torch.Tensor):
-            @classmethod
-            def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+        class DecompCrossRefMode(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
                 with no_dispatch():
-                    return cls._torch_dispatch(func, types, args, kwargs)
+                    return self._torch_dispatch(func, types, args, kwargs)
 
-            @classmethod
-            def _torch_dispatch(cls, func, types, args=(), kwargs=None):
-                self.precision = saved_precision
-                self.rel_tol = saved_rel_tol
+            def _torch_dispatch(self, func, types, args=(), kwargs=None):
+                test_case.precision = saved_precision
+                test_case.rel_tol = saved_rel_tol
 
                 called.add(func)
                 all_called[func] += 1
@@ -391,8 +431,9 @@ def _torch_dispatch(cls, func, types, args=(), kwargs=None):
                 if func not in decomposition_table or func in [
                     torch.ops.aten.detach.default,
                     # non-deterministic ops
-                    torch.ops.aten.new_empty.default,
-                    torch.ops.aten.new_empty.SymInt
+                    torch.ops.aten.empty.memory_format,
+                    torch.ops.aten.empty_like.default,
+                    torch.ops.aten.new_empty.default
                 ] or any_unsupported(args, kwargs):
                     return func(*args, **kwargs)
 
@@ -427,14 +468,14 @@ def _torch_dispatch(cls, func, types, args=(), kwargs=None):
                             assert type(orig) == type(decomp)
                             assert orig == decomp
                             continue
-                        op_assert_ref(self, func, test_dtype, i, orig, decomp, ref, args, kwargs)
+                        op_assert_ref(test_case, func, test_dtype, i, orig, decomp, ref, args, kwargs)
                 else:
                     for orig, decomp in zip(real_out, decomp_out):
                         if not isinstance(orig, torch.Tensor):
                             assert type(orig) == type(decomp)
                             assert orig == decomp
                             continue
-                        op_assert_equal(self, func, test_dtype, orig, decomp, args, kwargs)
+                        op_assert_equal(test_case, func, test_dtype, orig, decomp, args, kwargs)
 
                 return real_out_unflat
 
@@ -463,9 +504,6 @@ def check_decomposed(aten_name):
         func = op.get_op()
         for sample_input in samples:
             if requires_grad:
-                if None in sample_input.args:
-                    continue
-
                 fn, primals = normalize_op_input_output(func, sample_input)
                 primals = tree_map(
                     lambda x: x if isinstance(x, torch.Tensor) else x, primals
@@ -476,16 +514,16 @@ def check_decomposed(aten_name):
                 # explicit clearing is necessary as I will create a fresh mode
                 # for each region
                 decomposed.clear()
-                with enable_torch_dispatch_mode(DecompCrossRefMode):
+                with DecompCrossRefMode(), enable_python_dispatcher():
                     decomp_out, decomp_vjp_fn = ref_vjp_no_create(fn, *primals)
                 if aten_name in decomposition_names:
                     check_decomposed(aten_name)
 
-                if op.aten_backward_name in decomposition_names or run_all:
+                if not skip_decomp_vjp and (op.aten_backward_name in decomposition_names or run_all):
                     cotangents = tree_map(lambda x: torch.randn_like(x), decomp_out)
 
                     decomposed.clear()
-                    with enable_torch_dispatch_mode(DecompCrossRefMode):
+                    with DecompCrossRefMode(), enable_python_dispatcher():
                         decomp_vjp_fn(cotangents)
                     if not run_all:
                         check_decomposed(op.aten_backward_name)
@@ -494,7 +532,7 @@ def check_decomposed(aten_name):
                 args = [sample_input.input] + list(sample_input.args)
                 kwargs = sample_input.kwargs
                 decomposed.clear()
-                with enable_torch_dispatch_mode(DecompCrossRefMode):
+                with DecompCrossRefMode(), enable_python_dispatcher():
                     func(*args, **kwargs)
                 if not run_all:
                     check_decomposed(aten_name)
@@ -504,8 +542,41 @@ def check_decomposed(aten_name):
                     "only backwards is decomposed, but dtype doesn't support AD"
                 )
 
-
 instantiate_device_type_tests(TestDecomp, globals())
 
+class DecompContiguousTests(TestCase):
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @onlyNativeDeviceTypes
+    @skipIfCrossRef
+    def test_contiguous_softmax(self, device):
+        size = (2, 4, 3, 3)
+        stride = (9, 18, 3, 1)
+        dtype = torch.float32
+
+        x = torch.randn(size, dtype=dtype, device=device)
+        x = torch.as_strided(x, size, stride)
+
+        ref = torch.ops.aten._softmax(x, -1, False)
+        res = torch._decomp.decompositions._softmax(x, -1, False)
+        self.assertEqual(ref.stride(), res.stride())
+
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @onlyNativeDeviceTypes
+    @skipIfCrossRef
+    def test_contiguous_log_softmax(self, device):
+        size = (2, 4, 3, 3)
+        stride = (9, 18, 3, 1)
+
+        dtype = torch.float32
+        x = torch.randn(size, dtype=dtype, device=device)
+        x = torch.as_strided(x, size, stride)
+
+        ref = torch.ops.aten._log_softmax(x, -1, False)
+        res = torch._decomp.decompositions._log_softmax(x, -1, False)
+        self.assertEqual(ref.stride(), res.stride())
+
+instantiate_device_type_tests(DecompContiguousTests, globals())
+
+
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_dispatch.py b/test/test_dispatch.py
index ac19faa6b8d3..1b4e04119434 100644
--- a/test/test_dispatch.py
+++ b/test/test_dispatch.py
@@ -759,6 +759,10 @@ def test_overwrite_math(self):
 '''
         )
 
+    # Definition: a dangling impl happens when someone does an impl() on a
+    # function but not a def() for it. This is usually a bug, e.g. someone
+    # misspelled an operator name, or someone registered an impl for an op that
+    # no longer exists
     def test_find_dangling_impls(self):
         dangling_impls = C._dispatch_find_dangling_impls()
         self.assertEqual(
diff --git a/test/test_dlpack.py b/test/test_dlpack.py
new file mode 100644
index 000000000000..8dbb1058abd3
--- /dev/null
+++ b/test/test_dlpack.py
@@ -0,0 +1,193 @@
+# -*- coding: utf-8 -*-
+# Owner(s): ["module: tests"]
+
+import torch
+from torch.testing import make_tensor
+from torch.testing._internal.common_utils import TestCase, run_tests
+from torch.testing._internal.common_device_type import (
+    instantiate_device_type_tests, onlyCUDA, dtypes, skipMeta,
+    onlyNativeDeviceTypes)
+from torch.testing._internal.common_dtype import all_types_and_complex_and
+from torch.utils.dlpack import from_dlpack, to_dlpack
+
+
+class TestTorchDlPack(TestCase):
+    exact_dtype = True
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_capsule_conversion(self, device, dtype):
+        # DLpack does not explicitly support bool (xref dmlc/dlpack#75)
+        x = make_tensor((5,), dtype=dtype, device=device)
+        z = from_dlpack(to_dlpack(x))
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_protocol_conversion(self, device, dtype):
+        x = make_tensor((5,), dtype=dtype, device=device)
+        z = from_dlpack(x)
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    def test_dlpack_shared_storage(self, device):
+        x = make_tensor((5,), dtype=torch.float64, device=device)
+        z = from_dlpack(to_dlpack(x))
+        z[0] = z[0] + 20.0
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyCUDA
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_conversion_with_streams(self, device, dtype):
+        # Create a stream where the tensor will reside
+        stream = torch.cuda.Stream()
+        with torch.cuda.stream(stream):
+            # Do an operation in the actual stream
+            x = make_tensor((5,), dtype=dtype, device=device) + 1
+        # DLPack protocol helps establish a correct stream order
+        # (hence data dependency) at the exchange boundary.
+        # DLPack manages this synchronization for us, so we don't need to
+        # explicitly wait until x is populated
+        stream = torch.cuda.Stream()
+        with torch.cuda.stream(stream):
+            z = from_dlpack(x)
+        stream.synchronize()
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_from_dlpack(self, device, dtype):
+        x = make_tensor((5,), dtype=dtype, device=device)
+        y = torch.from_dlpack(x)
+        self.assertEqual(x, y)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_from_dlpack_noncontinguous(self, device, dtype):
+        x = make_tensor((25,), dtype=dtype, device=device).reshape(5, 5)
+
+        y1 = x[0]
+        y1_dl = torch.from_dlpack(y1)
+        self.assertEqual(y1, y1_dl)
+
+        y2 = x[:, 0]
+        y2_dl = torch.from_dlpack(y2)
+        self.assertEqual(y2, y2_dl)
+
+        y3 = x[1, :]
+        y3_dl = torch.from_dlpack(y3)
+        self.assertEqual(y3, y3_dl)
+
+        y4 = x[1]
+        y4_dl = torch.from_dlpack(y4)
+        self.assertEqual(y4, y4_dl)
+
+        y5 = x.t()
+        y5_dl = torch.from_dlpack(y5)
+        self.assertEqual(y5, y5_dl)
+
+    @skipMeta
+    @onlyCUDA
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_conversion_with_diff_streams(self, device, dtype):
+        stream_a = torch.cuda.Stream()
+        stream_b = torch.cuda.Stream()
+        # DLPack protocol helps establish a correct stream order
+        # (hence data dependency) at the exchange boundary.
+        # the `tensor.__dlpack__` method will insert a synchronization event
+        # in the current stream to make sure that it was correctly populated.
+        with torch.cuda.stream(stream_a):
+            x = make_tensor((5,), dtype=dtype, device=device) + 1
+            z = torch.from_dlpack(x.__dlpack__(stream_b.cuda_stream))
+            stream_a.synchronize()
+        stream_b.synchronize()
+        self.assertEqual(z, x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_from_dlpack_dtype(self, device, dtype):
+        x = make_tensor((5,), dtype=dtype, device=device)
+        y = torch.from_dlpack(x)
+        assert x.dtype == y.dtype
+
+    @skipMeta
+    @onlyCUDA
+    def test_dlpack_default_stream(self, device):
+        class DLPackTensor:
+            def __init__(self, tensor):
+                self.tensor = tensor
+
+            def __dlpack_device__(self):
+                return self.tensor.__dlpack_device__()
+
+            def __dlpack__(self, stream=None):
+                if torch.version.hip is None:
+                    assert stream == 1
+                else:
+                    assert stream == 0
+                capsule = self.tensor.__dlpack__(stream)
+                return capsule
+
+        # CUDA-based tests runs on non-default streams
+        with torch.cuda.stream(torch.cuda.default_stream()):
+            x = DLPackTensor(make_tensor((5,), dtype=torch.float32, device=device))
+            from_dlpack(x)
+
+    @skipMeta
+    @onlyNativeDeviceTypes
+    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
+    def test_dlpack_tensor_invalid_stream(self, device, dtype):
+        with self.assertRaises(TypeError):
+            x = make_tensor((5,), dtype=dtype, device=device)
+            x.__dlpack__(stream=object())
+
+    @skipMeta
+    def test_dlpack_error_on_bool_tensor(self):
+        x = torch.tensor([True], dtype=torch.bool)
+        with self.assertRaises(RuntimeError):
+            to_dlpack(x)
+
+    # TODO: add interchange tests once NumPy 1.22 (dlpack support) is required
+    @skipMeta
+    def test_dlpack_export_requires_grad(self):
+        x = torch.zeros(10, dtype=torch.float32, requires_grad=True)
+        with self.assertRaisesRegex(RuntimeError, r"require gradient"):
+            x.__dlpack__()
+
+    @skipMeta
+    def test_dlpack_export_is_conj(self):
+        x = torch.tensor([-1 + 1j, -2 + 2j, 3 - 3j])
+        y = torch.conj(x)
+        with self.assertRaisesRegex(RuntimeError, r"conjugate bit"):
+            y.__dlpack__()
+
+    @skipMeta
+    def test_dlpack_export_non_strided(self):
+        x = torch.sparse_coo_tensor([[0]], [1], size=(1,))
+        y = torch.conj(x)
+        with self.assertRaisesRegex(RuntimeError, r"strided"):
+            y.__dlpack__()
+
+    @skipMeta
+    def test_dlpack_normalize_strides(self):
+        x = torch.rand(16)
+        y = x[::3][:1]
+        self.assertEqual(y.shape, (1,))
+        self.assertEqual(y.stride(), (3,))
+        z = from_dlpack(y)
+        self.assertEqual(z.shape, (1,))
+        # gh-83069, make sure __dlpack__ normalizes strides
+        self.assertEqual(z.stride(), (1,))
+
+
+instantiate_device_type_tests(TestTorchDlPack, globals())
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_dynamic_shapes.py b/test/test_dynamic_shapes.py
index 9b29d82992de..953b6d9a53f6 100644
--- a/test/test_dynamic_shapes.py
+++ b/test/test_dynamic_shapes.py
@@ -4,19 +4,32 @@
 from torch._C import _disabled_torch_function_impl
 import torch.fx
 import torch.nn.functional as F
-from torch.testing._internal.common_utils import run_tests, TestCase
+from torch.testing._internal.common_utils import run_tests, TestCase, skipIfTorchDynamo, \
+    IS_WINDOWS, parametrize, instantiate_parametrized_tests
 import unittest
 import torch
 import operator
 import itertools
+import random
+import contextlib
+import math
+import builtins
+import atexit
+import io
+import os
 from torch.utils._pytree import tree_map
-from torch.fx.experimental.symbolic_shapes import ShapeEnv, PySymInt
+from torch.fx.experimental import symbolic_shapes
+from torch.fx.experimental.proxy_tensor import make_fx
+from torch.fx.experimental.symbolic_shapes import ShapeEnv, sym_float, guard_int, SymNode, sym_sqrt, sym_int, to_node
+from torch.utils._python_dispatch import TorchDispatchMode
+from torch import SymInt
 
 aten = torch.ops.aten
 
 try:
     import sympy
-    HAS_SYMPY = True
+    # TODO(jansel): these tests fail on windows
+    HAS_SYMPY = not IS_WINDOWS
 except ImportError:
     HAS_SYMPY = False
 skipIfNoSympy = unittest.skipIf(not HAS_SYMPY, "no sympy")
@@ -54,7 +67,7 @@ def cat_meta(tensors, dim=0):
     return tensors[0].new_empty(new_shape)
 
 
-@register_meta([aten.narrow_copy.SymInt])
+@register_meta([aten.narrow_copy.default])
 def narrow_copy_symint_meta(a, dim, start, length, **kwargs):
     shape = []
     for i, x in enumerate(a.shape):
@@ -65,7 +78,7 @@ def narrow_copy_symint_meta(a, dim, start, length, **kwargs):
     return a.new_empty(tuple(shape))
 
 
-@register_meta([aten.expand.SymInt])
+@register_meta([aten.expand.default])
 def expand_symint_meta(a, size, implicit=False):
     return a.new_empty(size)
 
@@ -79,18 +92,15 @@ def create_contiguous(shape):
 
 class FakeSymbolicTensor(torch.Tensor):
     @staticmethod
-    def __new__(cls, sym_shape, sym_strides, dtype, layout, requires_grad, device):
-        # sym_strides doesn't work yet
+    def __new__(cls, sym_shape, sym_strides, dtype, layout, requires_grad, device, storage_offset=0):
         # TODO: this is wrong in general
-        offset = 0
+        sym_stride = create_contiguous(sym_shape)
         r = torch.Tensor._make_wrapper_subclass(
             cls, sym_shape,
-            create_contiguous(sym_shape), offset,
+            sym_stride, storage_offset,
             dtype=dtype, layout=layout, requires_grad=requires_grad,
             device=device,
         )
-
-        r.sym_shape = sym_shape
         return r
 
     __torch_function__ = _disabled_torch_function_impl
@@ -103,18 +113,6 @@ def __torch_dispatch__(cls, func_overload, types, args=(), kwargs=None):
         if func_overload in meta_funcs:
             return meta_funcs[func_overload](*args, **kwargs)
 
-        if func_overload == torch.ops.aten.sym_size.default:
-            self = args[0]
-            return self.sym_shape
-
-        # some calls can be redirected to `sym_size` rather than
-        # `sym_sizes`. `sym_size` uses `dim` to canonicalize an index
-        # so we need to implement both `sym_size` and `dim` for python
-        # tensors
-        if func_overload == torch.ops.aten.dim.default:
-            self = args[0]
-            return len(self.sym_shape)
-
         if func_overload == torch.ops.aten.new_empty.default:
             self = args[0]
             shape = args[1]
@@ -123,23 +121,22 @@ def __torch_dispatch__(cls, func_overload, types, args=(), kwargs=None):
         raise RuntimeError(f"operator {func_overload} not supported")
 
 
-def create_symbolic_tensor(name, arg, shape_env):
-    sym_shapes = tuple([shape_env.create_symint(f"{name}_{idx}", val) for idx, val in enumerate(arg.size())])
-    sym_strides = tuple([shape_env.create_symint(f"{name}_{idx}_stride", val) for idx, val in enumerate(arg.stride())])
-    return FakeSymbolicTensor(sym_shapes, sym_strides, arg.dtype, arg.layout, arg.requires_grad, arg.device)
-
-
-CPP_SYMINT_CLASS = type(torch._C.SymIntNode.new_symint(1))
+def create_symbolic_tensor(name, arg, shape_env, storage_offset=0):
+    sym_shapes, sym_strides = shape_env.create_symbolic_sizes_strides(arg)
+    return FakeSymbolicTensor(sym_shapes, sym_strides, arg.dtype, arg.layout, arg.requires_grad, arg.device, storage_offset)
 
+def create_symint(shape_env, i):
+    return shape_env.create_symintnode(shape_env.create_symbol(i))
 
+@skipIfTorchDynamo("Creating ShapeEnv fails for confusing reasons (also we never expect dynamo to see code like this)")
 class TestPySymInt(TestCase):
 
     @skipIfNoSympy
     def test_arith_ops(self):
         shape_env = ShapeEnv()
         symints = []
-        for i in range(5):
-            symints.append((i, shape_env.create_symint(f"s{i}", i)))
+        for i in range(2, 5):
+            symints.append((i, create_symint(shape_env, i)))
 
         ops = [operator.add, operator.sub, operator.floordiv, operator.mul, operator.mod]
 
@@ -153,10 +150,10 @@ def test_arith_ops(self):
     def test_reverse_arith_ops(self):
         shape_env = ShapeEnv()
 
-        a = shape_env.create_symint("s1", 2)
+        a = create_symint(shape_env, 2)
         self.assertTrue(5 // a == 5 // 2)
 
-        a = shape_env.create_symint("s1", 2)
+        a = create_symint(shape_env, 2)
         self.assertTrue(5 * a == 5 * 2)
 
 
@@ -164,8 +161,9 @@ def test_reverse_arith_ops(self):
     def test_roundtrip(self):
         shape_env = ShapeEnv()
         x = create_symbolic_tensor("x", torch.randn(5, 4, 3), shape_env)
-        self.assertTrue(not isinstance(x.shape[0], PySymInt))
-        self.assertTrue(isinstance(x.shape[0], CPP_SYMINT_CLASS))
+
+        self.assertTrue(not isinstance(x.shape[0], SymNode))
+        self.assertTrue(isinstance(x.shape[0], SymInt))
 
         self.assertTrue(x.shape[0] == 5)
         self.assertTrue(x.shape[1] == 4)
@@ -173,13 +171,23 @@ def test_roundtrip(self):
 
         self.assertTrue(x.size()[0], 5)
         self.assertTrue(x.size()[1], 4)
-        self.assertTrue(isinstance(x.size()[1], CPP_SYMINT_CLASS))
+        self.assertTrue(isinstance(x.size()[1], SymInt))
         self.assertTrue(x.size()[2] == 3)
 
         self.assertTrue(x.size(0) == 5)
         self.assertTrue(x.size(1) == 4)
         self.assertTrue(x.size(2) == 3)
-        self.assertTrue(isinstance(x.size(2), CPP_SYMINT_CLASS))
+        self.assertTrue(isinstance(x.size(2), SymInt))
+
+        offset = create_symint(shape_env, 2)
+        y = create_symbolic_tensor("x", torch.randn(5, 4, 3), shape_env, offset)
+        self.assertTrue(isinstance(y.storage_offset(), SymInt))
+        self.assertTrue(y.storage_offset() == 2)
+
+        offset = 2
+        z = create_symbolic_tensor("z", torch.randn(5, 4, 3), shape_env, offset)
+        self.assertTrue(isinstance(z.storage_offset(), int))
+        self.assertTrue(z.storage_offset() == 2)
 
     @skipIfNoSympy
     def test_binary(self):
@@ -206,7 +214,7 @@ def test_symint_args(self):
         y = create_symbolic_tensor("y", torch.randn(5, 4, 1), shape_env)
         LAST_DIM = 2
         z = x.narrow_copy(LAST_DIM, 0, y.shape[LAST_DIM])
-        self.assertTrue(z.shape[2] == int(y.shape[2]))
+        self.assertTrue(z.shape[2] == y.shape[2])
 
         # arithmetic expr with two symints
         z = x.narrow_copy(LAST_DIM, 0, x.shape[LAST_DIM] - y.shape[LAST_DIM])
@@ -261,6 +269,12 @@ def test_symint_vargs(self):
         z = y.expand((y.shape[1],))
         z = y.expand(y.shape[1])
 
+    @skipIfNoSympy
+    def test_stride(self):
+        shape_env = ShapeEnv()
+        x = create_symbolic_tensor("x", torch.randn(5, 5), shape_env)
+        self.assertIsInstance(x.stride()[0], SymInt)
+
     @skipIfNoSympy
     def test_size_expressions(self):
         shape_env = ShapeEnv()
@@ -277,16 +291,34 @@ def test_size_expressions(self):
         self.assertTrue(str(expand_x.shape[1]), str(x.shape[0]))
         self.assertTrue(str(expand_x.shape[1]), str(result.shape[0]))
 
+    @skipIfNoSympy
+    def test_numel(self):
+        shape_env = ShapeEnv()
+        x = create_symbolic_tensor("x", torch.randn(5), shape_env)
+        self.assertIsInstance(x.numel(), torch.SymInt)
+        self.assertIsInstance(torch.numel(x), torch.SymInt)
+
+        x = torch.rand(3, 3)
+        self.assertIsInstance(x.numel(), int)
+        self.assertIsInstance(torch.numel(x), int)
+
+    @skipIfNoSympy
+    def test_int_to_float(self):
+        shape_env = ShapeEnv()
+        x = create_symbolic_tensor("x", torch.randn(5), shape_env)
+        r = sym_float(x.shape[0])
+        self.assertIsInstance(r, torch.SymFloat, msg=type(r))
+
     @skipIfNoSympy
     def test_aten_ops(self):
 
         shape_env = ShapeEnv()
         x = create_symbolic_tensor("x", torch.randn(5), shape_env)
-        torch.ops.aten.narrow_copy.SymInt(x, 0, 0, x.shape[0])
+        torch.ops.aten.narrow_copy.default(x, 0, 0, x.shape[0])
 
         shape_env = ShapeEnv()
         x = create_symbolic_tensor("x", torch.randn(5, 4, 3), shape_env)
-        torch.ops.aten.expand.SymInt(x, [x.shape[0], x.shape[1], x.shape[2]])
+        torch.ops.aten.expand.default(x, [x.shape[0], x.shape[1], x.shape[2]])
 
     def test_fx_trace_intlist(self):
         class CustomModule(torch.nn.Module):
@@ -300,14 +332,272 @@ def forward(self, x):
         # tuple of ints, not tuple
         torch.fx.symbolic_trace(m)
 
-
     @skipIfNoSympy
     def test_meta_symint(self):
         shape_env = ShapeEnv()
-        a0 = shape_env.create_symint("a0", 2)
+        a0 = create_symint(shape_env, 2)
         r = torch.empty(a0, device='meta')
-        self.assertIsInstance(r.shape[0], CPP_SYMINT_CLASS)
+        self.assertIsInstance(r.shape[0], SymInt)
+
+    @skipIfNoSympy
+    def test_guard_int(self):
+        shape_env = ShapeEnv()
+        a0 = create_symint(shape_env, 2)
+        self.assertEqual(guard_int(a0), 2)
+        self.assertEqual(str(shape_env.guards[0][0]), "Eq(s0, 2)")
+
+    @skipIfNoSympy
+    def test_sym_int(self):
+        shape_env = ShapeEnv()
+        a0 = create_symint(shape_env, 5)
+        r = sym_int(a0)
+        self.assertEqual(r, 5)
+        self.assertIsInstance(r, torch.SymInt, msg=type(r))
+        self.assertEqual(str(shape_env.guards[0][0]), "Eq(s0, 5)")
+
+        a1 = create_symint(shape_env, 7)
+        r = sym_int(a1 / 2)
+        self.assertEqual(guard_int(r), 3)
+        self.assertIsInstance(r, torch.SymInt, msg=type(r))
+        self.assertEqual(str(shape_env.guards[1][0]), "Eq(floor(s1/2), 3)")
+
+        a2 = create_symint(shape_env, -3)
+        r = sym_int(a2 / 2)
+        self.assertEqual(guard_int(r), -1)
+        self.assertIsInstance(r, torch.SymInt, msg=type(r))
+        self.assertEqual(str(shape_env.guards[2][0]), "Eq(ceiling(-s2/2), -1)")
+
+    @skipIfNoSympy
+    def test_sym_sqrt(self):
+        shape_env = ShapeEnv()
+        a0 = create_symint(shape_env, 4)
+        r = sym_sqrt(a0)
+        self.assertEqual(r, 2)
+        self.assertIsInstance(r, torch.SymFloat, msg=type(r))
+        self.assertEqual(str(shape_env.guards[0][0]), "Eq(sqrt(s0), 2)")
+
+    @skipIfNoSympy
+    def test_sym_floor(self):
+        shape_env = ShapeEnv()
+        a0 = create_symint(shape_env, 5)
+        r = math.floor(a0 / 2)
+        self.assertEqual(r, 2)
+        self.assertIsInstance(r, torch.SymInt, msg=type(r))
+        self.assertEqual(str(shape_env.guards[0][0]), "Eq(floor(s0/2), 2)")
+
+    @skipIfNoSympy
+    def test_int_conversion(self):
+        shape_env = ShapeEnv()
+        a0 = create_symint(shape_env, 2)
+        self.assertRaisesRegex(RuntimeError, "Trying to extract", lambda: int(a0))
+
+    @skipIfNoSympy
+    def test_symint_as_scalar(self):
+        shape_env = ShapeEnv()
+        a0 = create_symint(shape_env, 2)
+
+        sym_int_encountered = False
+
+        class TestSymInt(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                assert func == torch.ops.aten.add.Tensor
+
+                nonlocal sym_int_encountered
+                # WARNING: do not do identity tests on the outer
+                # SymInt/SymFloat, they are NOT STABLE
+                sym_int_encountered = kwargs["alpha"].node is a0.node
+                kwargs["alpha"] = 0
+                return func(*args)
+
+        x = torch.rand([4, 4])
+        with TestSymInt():
+            y = torch.add(x, x, alpha=a0)
+
+        self.assertTrue(sym_int_encountered)
+
+    @skipIfNoSympy
+    @unittest.mock.patch('sys.stdout', new_callable=io.StringIO)
+    def test_print_readable_with_symints(self, mock_stdout):
+        def f(a, b):
+            dim0 = a.shape[0] + b.shape[0]
+            dim1 = a.shape[1] + b.shape[1]
+            d = a.new_empty(dim0, dim1)
+            d = torch.ops.aten.native_dropout(d, 0.5, train=True)
+            return d
+
+        fx_g = make_fx(f, tracing_mode="symbolic")(torch.randn(5, 3), torch.randn(4, 3))
+        fx_g.print_readable()
+
+        self.assertExpectedInline(mock_stdout.getvalue().strip(), """\
+class f(torch.nn.Module):
+    def forward(self, a_1: f32[s0, s1], b_1: f32[s2, s1]):
+        # No stacktrace found for following nodes
+        sym_size: Sym(s0) = torch.ops.aten.sym_size(a_1, 0)
+        sym_size_1: Sym(s2) = torch.ops.aten.sym_size(b_1, 0)
+        add: Sym(s0 + s2) = sym_size + sym_size_1;  sym_size = sym_size_1 = None
+        sym_size_2: Sym(s1) = torch.ops.aten.sym_size(a_1, 1)
+        sym_size_3: Sym(s1) = torch.ops.aten.sym_size(b_1, 1);  b_1 = None
+        add_1: Sym(2*s1) = sym_size_2 + sym_size_3;  sym_size_2 = sym_size_3 = None
+        new_empty: f32[s0 + s2, 2*s1] = torch.ops.aten.new_empty.default(a_1, [add, add_1], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False);  a_1 = add = add_1 = None
+        native_dropout = torch.ops.aten.native_dropout.default(new_empty, 0.5, True);  new_empty = None
+        getitem: f32[s0 + s2, 2*s1] = native_dropout[0]
+        getitem_1: b8[s0 + s2, 2*s1] = native_dropout[1];  native_dropout = None
+        return (getitem, getitem_1)""")  # noqa: B950
+
+# This environment variable controls whether or not we print expected failure
+# lists at the end of a test suite run.  The intended usage looks like this:
+#
+# 1. Run `PYTORCH_COLLECT_EXPECT=1 python test/test_dynamic_shapes.py -k TestSymNumberMagicMethods`.
+# 2. Given the printed xfail list, add them to the set expected_failure_sym_magic_methods.
+COLLECT_EXPECT = os.getenv('PYTORCH_COLLECT_EXPECT', '0') == '1'
+
+seen_failed = []
+def print_seen():
+    out = []
+    for key, reason in seen_failed:
+        # Make sure the generated line is lint clean
+        msg = f"    {key},  # {reason}"
+        eol = msg.find("\n")
+        if eol != -1:
+            msg = msg[:eol]
+        out.append(msg[:120])
+
+    print("expected_failure_sym_magic_methods = {")
+    print("\n".join(out))
+    print("}")
+
+if COLLECT_EXPECT:
+    atexit.register(print_seen)
+
+expected_failure_sym_magic_methods = {
+    ('floordiv', 'SymFloat', 'float'),  # Cannot convert complex to float
+    ('floordiv', 'float', 'SymFloat'),  # Cannot convert complex to float
+    ('floordiv', 'SymFloat', 'SymFloat'),  # Cannot convert complex to float
+    ('floordiv', 'SymFloat', 'int'),  # Scalars are not close!
+    ('floordiv', 'float', 'SymInt'),  # Scalars are not close!
+    ('floordiv', 'SymFloat', 'SymInt'),  # Scalars are not close!
+    ('floordiv', 'SymInt', 'float'),  # Cannot convert complex to float
+    ('floordiv', 'int', 'SymFloat'),  # Cannot convert complex to float
+    ('floordiv', 'SymInt', 'SymFloat'),  # Cannot convert complex to float
+}
+
+@skipIfTorchDynamo("Creating ShapeEnv fails for confusing reasons (also we never expect dynamo to see code like this)")
+class TestSymNumberMagicMethods(TestCase):
+    def _do_test(self, fn, inp1, inp2, shape_env, is_unary_fn):
+        # Helper function
+        seed_node = (create_symint(shape_env, 1) / 1.).get_pyobj()
+
+        def get_sym_inp(inp):
+            if isinstance(inp, int):
+                return torch.SymInt(to_node(seed_node, inp))
+            else:
+                return torch.SymFloat(to_node(seed_node, inp))
+
+        def maybe_xfail(inp1, inp2):
+            key = (fn, type(inp1).__name__, type(inp2).__name__)
+            if COLLECT_EXPECT:
+                @contextlib.contextmanager
+                def context():
+                    try:
+                        yield
+                    except (TypeError, AssertionError) as e:
+                        seen_failed.append((key, str(e)))
+                return context()
+
+            if key in expected_failure_sym_magic_methods:
+                return self.assertRaises((TypeError, AssertionError))
+            else:
+                return contextlib.nullcontext()
+
+        # These functions might return plain int/float
+        has_valid_downcast = fn in ["min", "max"]
+        if fn in symbolic_shapes.magic_methods_on_builtins:
+            lambda_apply = getattr(builtins, fn)
+        elif fn in symbolic_shapes.magic_methods_on_math:
+            lambda_apply = getattr(math, fn)
+        elif fn in symbolic_shapes.magic_methods_on_submodule:
+            lambda_apply = getattr(symbolic_shapes, fn)
+        else:
+            lambda_apply = getattr(operator, fn)
+
+        if fn in symbolic_shapes.always_float_magic_methods:
+            tp = "float"
+        elif fn in symbolic_shapes.always_int_magic_methods:
+            tp = "int"
+        elif is_unary_fn:
+            tp = "float" if isinstance(inp1, float) else "int"
+        else:
+            tp = "float" if any(isinstance(i, float) for i in [inp1, inp2]) else "int"
+
+        def guard_fn(v):
+            try:
+                if fn in symbolic_shapes.always_bool_magic_methods:
+                    return bool(v)
+                else:
+                    return getattr(v.node, f"guard_{tp}")("", 0)
+            except Exception as e:
+                if has_valid_downcast:
+                    return v
+                else:
+                    raise e
+
+        # Get reference result
+        with maybe_xfail(inp1, inp2):
+            if is_unary_fn:
+                ref_out = lambda_apply(inp1)
+            else:
+                ref_out = lambda_apply(inp1, inp2)
+
+        # Symified first arg
+        sym_inp1 = get_sym_inp(inp1)
+        with maybe_xfail(sym_inp1, inp2):
+            if is_unary_fn:
+                out = lambda_apply(sym_inp1)
+            else:
+                out = lambda_apply(sym_inp1, inp2)
+            self.assertEqual(guard_fn(out), ref_out)
+
+        if is_unary_fn:
+            return
+
+        # Symified second arg
+        sym_inp2 = get_sym_inp(inp2)
+        with maybe_xfail(inp1, sym_inp2):
+            out = lambda_apply(inp1, sym_inp2)
+            self.assertEqual(guard_fn(out), ref_out)
+
+        # Symified both args
+        with maybe_xfail(sym_inp1, sym_inp2):
+            out = lambda_apply(sym_inp1, sym_inp2)
+            self.assertEqual(guard_fn(out), ref_out)
+
+
+    @parametrize("fn", list(symbolic_shapes.magic_methods.keys()))
+    @parametrize("first_type", ["int", "float"])
+    @parametrize("second_type", ["int", "float"])
+    def test_method(self, fn, first_type, second_type):
+        if first_type == "float":
+            self.skipTest(f"{fn} is not a float magic method")
+
+        is_unary_fn = fn in symbolic_shapes.unary_magic_methods
+        # Second argument is ignored for unary function. So only run for one type
+        if is_unary_fn and second_type == "float":
+            self.skipTest(f"{fn} is unary and already tested")
+
+        # We could pass int/float directly for types but then the
+        # mangled test name is bad
+        inp1 = random.random() * 2.5
+        if first_type == "int":
+            inp1 = int(inp1)
+        inp2 = random.random() * 2.5
+        if second_type == "int":
+            inp2 = int(inp2)
+
+        shape_env = ShapeEnv()
+
+        self._do_test(fn, inp1, inp2, shape_env, is_unary_fn)
 
+instantiate_parametrized_tests(TestSymNumberMagicMethods)
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_expanded_weights.py b/test/test_expanded_weights.py
index 1b1f24d3eee6..a7f4709c27d2 100644
--- a/test/test_expanded_weights.py
+++ b/test/test_expanded_weights.py
@@ -178,9 +178,6 @@ def test_expanded_weight_per_sample_grad_sum(self, device, dtype, op):
             if op.name == "nn.functional.embedding":  # embedding flips its argument order for autograd tests
                 sample_input = SampleInput(sample_input.args[0], args=(sample_input.input,), kwargs=sample_input.kwargs)
 
-            def reduction(x):
-                return x.sum()
-
             self._compare_ew_and_for_loop_per_sample_grads(op, sample_input, torch.sum)
 
     @ops(filter(lambda op: op.supports_expanded_weight, op_db), dtypes=OpDTypes.supported, allowed_dtypes=(torch.double,))
@@ -251,13 +248,13 @@ def _test_embedding_model(self, model, num_embedding, device):
         input = torch.randint(0, num_embedding, (batch_size, 5, 5), device=device)
         return self._test_model(partial(model, num_embedding=num_embedding), batch_size, input, device)
 
-    def _test_conv_model(self, model, input_size, num_dim, device, loss_reduction="sum"):
+    def _test_conv_model(self, model, input_size, num_dim, device, loss_reduction="sum", atol=1e-4, rtol=5e-5):
         batch_size = 32
         input_ending = [input_size] * num_dim
         input = torch.randn([batch_size, 3] + input_ending, device=device)
-        return self._test_model(partial(model, num_dim=num_dim), batch_size, input, device, loss_reduction)
+        return self._test_model(partial(model, num_dim=num_dim), batch_size, input, device, loss_reduction, atol, rtol)
 
-    def _test_model(self, model, batch_size, input, device, loss_reduction="sum"):
+    def _test_model(self, model, batch_size, input, device, loss_reduction="sum", atol=1e-4, rtol=5e-5):
         model = model(10).to(device)
         targets = torch.randint(0, 10, (batch_size,), device=device)
         criterion = CrossEntropyLoss(reduction=loss_reduction)
@@ -276,8 +273,11 @@ def _test_model(self, model, batch_size, input, device, loss_reduction="sum"):
 
         expected = [torch.stack(grad) for grad in zip(*expected)]
         for (res, exp) in zip(result, expected):
-            self.assertEqual(res, exp, atol=1e-4, rtol=5e-5)
+            self.assertEqual(res, exp, atol=atol, rtol=rtol)
 
+    def _compute_tolerances(self, device):
+        is_cuda_sm86 = device.startswith("cuda") and torch.cuda.get_device_capability(0) == (8, 6)
+        return (9e-3, 5e-5) if is_cuda_sm86 else (1e-4, 5e-5)
 
     def test_cnn_model_sum(self, device):
         def convnet(num_classes, num_dim):
@@ -298,7 +298,8 @@ def convnet(num_classes, num_dim):
                 nn.Linear(128, num_classes, bias=True),
             )
 
-        return self._test_conv_model(convnet, 28, 2, device)
+        atol, rtol = self._compute_tolerances(device)
+        return self._test_conv_model(convnet, 28, 2, device, atol=atol, rtol=rtol)
 
     def test_cnn_model_mean(self, device):
         def convnet(num_classes, num_dim):
@@ -318,8 +319,8 @@ def convnet(num_classes, num_dim):
                 nn.Flatten(start_dim=1, end_dim=-1),
                 nn.Linear(128, num_classes, bias=True),
             )
-
-        return self._test_conv_model(convnet, 28, 2, device, loss_reduction="mean")
+        atol, rtol = self._compute_tolerances(device)
+        return self._test_conv_model(convnet, 28, 2, device, loss_reduction="mean", atol=atol, rtol=rtol)
 
     @parametrize('num_dim', [1, 2, 3])
     def test_instance_norm_model(self, num_dim, device):
@@ -332,7 +333,8 @@ def instance_norm_model(num_classes, num_dim):
                 nn.Flatten(start_dim=1, end_dim=-1),
                 nn.Linear(32 * (7 ** num_dim), num_classes, bias=True),
             )
-        return self._test_conv_model(instance_norm_model, 7, num_dim, device)
+        atol, rtol = self._compute_tolerances(device)
+        return self._test_conv_model(instance_norm_model, 7, num_dim, device, atol=atol, rtol=rtol)
 
     @parametrize('num_dim', [1, 2, 3])
     def test_group_norm_model(self, num_dim, device):
@@ -344,7 +346,8 @@ def group_norm_model(num_classes, num_dim):
                 nn.Flatten(start_dim=1, end_dim=-1),
                 nn.Linear(32 * (7 ** num_dim), num_classes, bias=True),
             )
-        return self._test_conv_model(group_norm_model, 7, num_dim, device)
+        atol, rtol = self._compute_tolerances(device)
+        return self._test_conv_model(group_norm_model, 7, num_dim, device, atol=atol, rtol=rtol)
 
     @parametrize('num_dim', [1, 2, 3])
     def test_layer_norm_model(self, num_dim, device):
@@ -357,7 +360,8 @@ def layer_norm_model(num_classes, num_dim):
                 nn.Flatten(start_dim=1, end_dim=-1),
                 nn.Linear(32 * (7 ** num_dim), num_classes, bias=True),
             )
-        return self._test_conv_model(layer_norm_model, 7, num_dim, device)
+        atol, rtol = self._compute_tolerances(device)
+        return self._test_conv_model(layer_norm_model, 7, num_dim, device, atol=atol, rtol=rtol)
 
     def test_embedding_model(self, device):
         def embedding_model(num_classes, num_embedding):
@@ -559,12 +563,9 @@ def filter_supported_tests(t):
     supported_modules = ['Linear', 'Conv1d', 'Conv2d', 'Conv3d', 'Embedding', 'LayerNorm', 'GroupNorm', 'InstanceNorm']
     if 'module_name' in t and t['module_name'] in supported_modules:
         return True
-    if 'fullname' in t and any([module + "_" in t['fullname'] for module in supported_modules]):
-        return not('Conv' in t['fullname'] and 'pad' in t['fullname'])
 
 # TODO: Once all of these use ModuleInfo, replace with ModuleInfo tests
 # These currently use the legacy nn tests
-supported_modules = ['Linear', 'Conv1d', 'Conv2d', 'Conv3d', 'Embedding', 'LayerNorm', 'GroupNorm', 'InstanceNorm']
 supported_tests = [t for t in module_tests + new_module_tests if filter_supported_tests(t)]
 for test_param in supported_tests:
     if 'constructor' not in test_param:
@@ -628,8 +629,7 @@ def filter_fn(input):
             is_supported_input = input.input.shape != normalized_shape  # would cause inter-batch operations
         elif op.name in convolutions:
             # currently can't deal with padding computation on Python level
-            is_supported_input = 'padding' not in input.kwargs or not isinstance(input.kwargs['padding'], str)
-            is_supported_input = is_supported_input and input.input.dim() == batched_input_size[op.name]
+            is_supported_input = input.input.dim() == batched_input_size[op.name]
         elif op.name == "nn.functional.embedding":
             idx = input.args[0]
             is_supported_input = len(idx.shape) > 1  # there's no batch size
diff --git a/test/test_fake_tensor.py b/test/test_fake_tensor.py
index dd80a14f8b48..1a213bb767b4 100644
--- a/test/test_fake_tensor.py
+++ b/test/test_fake_tensor.py
@@ -2,6 +2,7 @@
 
 from torch.testing._internal.common_utils import TestCase, run_tests, skipIfCrossRef, skipIfRocm
 import torch
+import torch._dynamo
 import itertools
 import numpy as np
 from torch.testing._internal.jit_utils import RUN_CUDA
@@ -11,12 +12,13 @@
     FakeTensorConverter,
     DynamicOutputShapeException,
 )
+from torch.fx.passes.fake_tensor_prop import FakeTensorProp
 from torch.testing import FileCheck
-from torch.utils._python_dispatch import enable_torch_dispatch_mode
 from torch import nn
 import unittest
 import torch._prims as prims
 import contextlib
+import weakref
 import copy
 
 class FakeTensorTest(TestCase):
@@ -26,10 +28,9 @@ def checkType(self, t, device_str, size):
         self.assertEqual(list(t.size()), size)
 
     def test_basic(self):
-        mode = FakeTensorMode(inner=None)
         x = torch.empty(2, 2, device="cpu")
         y = torch.empty(4, 2, 2, device="cpu")
-        with enable_torch_dispatch_mode(mode):
+        with FakeTensorMode() as mode:
             x = mode.from_tensor(x)
             y = mode.from_tensor(y)
             z = x + y
@@ -38,21 +39,27 @@ def test_basic(self):
             self.assertTrue(isinstance(z, FakeTensor))
 
     def test_parameter_instantiation(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.rand([4])
             y = torch.nn.parameter.Parameter(x)
             self.assertTrue(isinstance(y, torch.nn.Parameter))
 
+    def test_non_parameter_grad(self):
+        mode = FakeTensorMode()
+        t = torch.rand([4], requires_grad=True)
+        fake_t = mode.from_tensor(t)
+        self.assertEqual(fake_t.requires_grad, t.requires_grad)
+
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_index_cuda_with_cpu(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.rand([2048], device='cuda')
             out = x[torch.zeros([36], dtype=torch.int64)]
             self.checkType(out, "cuda", [36])
 
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_shape_take_not_device(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.empty(1, device="cpu")
             y = torch.empty(8, 8, device="cuda")
             out = x.resize_as_(y)
@@ -62,8 +69,7 @@ def test_shape_take_not_device(self):
 
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_zero_dim(self):
-        mode = FakeTensorMode(inner=None)
-        with enable_torch_dispatch_mode(mode):
+        with FakeTensorMode() as mode:
             x = torch.tensor(0.)
             y = torch.rand([4, 4], device="cuda")
             out = x + y
@@ -72,8 +78,7 @@ def test_zero_dim(self):
             self.assertTrue(isinstance(out, FakeTensor))
 
     def test_nan_to_num(self):
-        mode = FakeTensorMode(inner=None)
-        with enable_torch_dispatch_mode(mode):
+        with FakeTensorMode():
             for dtype in [torch.float16, torch.float32]:
                 x = torch.rand([4], dtype=dtype)
                 y = torch.nan_to_num(x, nan=None)
@@ -83,9 +88,8 @@ def test_nan_to_num(self):
 
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_throw(self):
-        mode = FakeTensorMode(inner=None)
         x = torch.tensor(0.)  # TODO: tensor() errors
-        with enable_torch_dispatch_mode(mode):
+        with FakeTensorMode() as mode:
             x_conv = mode.from_tensor(x)
             y = torch.rand([4, 4], device="cuda")
             z = torch.rand([4, 4], device="cpu")
@@ -93,7 +97,7 @@ def test_throw(self):
 
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_type_as(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.rand([16, 1], device="cpu")
             y = torch.rand([4, 4], device="cuda")
             out = x.type_as(y)
@@ -103,12 +107,12 @@ def test_type_as(self):
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_setitem(self):
         for device in ["cpu", "cuda"]:
-            with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+            with FakeTensorMode():
                 x = torch.rand([16, 1], device=device)
                 x[..., 0] = 0
 
     def test_fake_dispatch_keys(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.rand([4])
             f = FileCheck().check("CPU").check("ADInplaceOrView").check("AutogradCPU").check("AutocastCPU")
             f.run(torch._C._dispatch_key_set(x))
@@ -120,14 +124,14 @@ def test_fake_dispatch_keys(self):
                 FileCheck().check_not("ADInplaceOrView").check_not("Autograd").run(torch._C._dispatch_key_set(y))
 
     def test_constructor(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.rand([4, 4], device="cpu")
 
         self.assertTrue(isinstance(x, FakeTensor))
         self.assertTrue(x.device.type == "cpu")
 
     def test_mode(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             y = torch.rand([4], device="cpu")
             out = y + y
 
@@ -135,7 +139,7 @@ def test_mode(self):
 
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_non_kwarg_device(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.rand([16, 1], device="cpu")
             y = x.to(torch.device("cpu"))
             self.assertIs(x, y)
@@ -146,7 +150,7 @@ def test_fake_mode_error(self):
         x = torch.rand([4, 4])
 
         with self.assertRaisesRegex(Exception, "non-Fake Tensor inputs"):
-            with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+            with FakeTensorMode():
                 y = x[0]
 
     def test_fake_grad_copy(self):
@@ -161,7 +165,7 @@ def test_fake_grad_copy(self):
 
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_like_constructor(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.rand([4, 4])
             y = torch.ones_like(x)
             self.assertTrue(isinstance(y, FakeTensor))
@@ -171,7 +175,7 @@ def test_like_constructor(self):
             self.assertEqual(z.device.type, "cuda")
 
     def test_binary_op_type_promotion(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.empty([2, 2], dtype=torch.float)
             y = torch.empty([2, 2], dtype=torch.int64)
             out = x / y
@@ -179,37 +183,56 @@ def test_binary_op_type_promotion(self):
             self.assertEqual(out.device.type, "cpu")
 
     def test_from_numpy(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.tensor(np.zeros([4, 4]))
             self.checkType(x, "cpu", [4, 4])
 
     def test_randperm(self):
         x = torch.randperm(10)
         y = torch.randperm(5, device="cpu")
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x1 = torch.randperm(10)
             prims.utils.compare_tensor_meta(x, x1)
             y1 = torch.randperm(5, device="cpu")
             prims.utils.compare_tensor_meta(y, y1)
 
+    def test_print_in_fake_mode(self):
+        x = torch.zeros(2)
+        # does not fail
+        with FakeTensorMode():
+            out = str(x)
+        assert "FakeTensor" not in out
+
+    @unittest.skipIf(not RUN_CUDA, "requires cuda")
+    def test_upsample_bilinear_small_channels(self):
+        out = []
+        mode = FakeTensorMode()
+        for i, context in enumerate([contextlib.nullcontext, lambda: mode]):
+            with context():
+                arg0_1 = torch.empty_strided((3, 427, 640), (1, 1920, 3), dtype=torch.float32, device='cuda')
+                unsqueeze = torch.ops.aten.unsqueeze.default(arg0_1, 0)
+                out.append(torch.ops.aten.upsample_bilinear2d.default(unsqueeze, [800, 1199], False))
+
+        self.assertTrue(out[1].is_contiguous())
+        self.checkMetaProps(out[0], out[1])
 
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_cpu_fallback(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None, allow_fallback_kernels=False)):
+        with FakeTensorMode(allow_fallback_kernels=False):
             filters = torch.randn(8, 4, 3, 3).cuda()
             inputs = torch.randn(1, 4, 5, 5).cuda()
             out = torch.nn.functional.conv2d(inputs, filters, padding=1)
             self.assertEqual(out.device.type, "cuda")
             self.assertEqual(list(out.size()), [1, 8, 5, 5])
 
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None, allow_fallback_kernels=True)):
+        with FakeTensorMode(allow_fallback_kernels=True):
             # intentionally bad inputs
             filters = torch.randn(8, 20, 3, 3).cuda()
             inputs = torch.randn(1, 7, 10, 5).cuda()
             with self.assertRaises(RuntimeError):
                 torch.nn.functional.conv2d(inputs, filters, padding=1)
 
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None, allow_fallback_kernels=True)):
+        with FakeTensorMode(allow_fallback_kernels=True):
             filters = torch.randn(8, 4, 3, 3).cuda()
             inputs = torch.randn(1, 4, 5, 5).cuda()
 
@@ -225,6 +248,14 @@ def test_normalize_device(self):
             out = x + y
         self.checkType(out, "cuda", [1])
 
+    def test_recursive_invocation(self):
+        mode = FakeTensorMode()
+        with mode:
+            x = torch.tensor(2)
+            mode.in_kernel_invocation = True
+            y = x + x
+            self.assertTrue(mode.in_kernel_invocation)
+
     @skipIfRocm
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_cudnn_rnn(self):
@@ -287,8 +318,8 @@ def fn(
                 None,
             )
 
-        mode = FakeTensorMode(inner=None)
-        for i, context in enumerate([contextlib.nullcontext, lambda: enable_torch_dispatch_mode(mode)]):
+        mode = FakeTensorMode()
+        for i, context in enumerate([contextlib.nullcontext, lambda: mode]):
             with context():
                 inps = (
                     torch.randn([92, 8, 2048]).cuda(),
@@ -324,31 +355,30 @@ def fn(
     def test_fallback_memory_prop(self):
         m = nn.Conv2d(16, 33, 3, stride=2, device="cuda", dtype=torch.half)
         m = m.to(memory_format=torch.channels_last)
-        mode = FakeTensorMode(inner=None)
+        mode = FakeTensorMode()
         # TODO: module.to() doesn't work because it assigns .data, which is ignored
         with torch._subclasses.fake_tensor.FakeCopyMode(mode):
             mod_copied = copy.deepcopy(m)
 
-        with enable_torch_dispatch_mode(mode):
+        with mode:
             input = torch.rand(20, 16, 50, 100, dtype=torch.half, device="cuda").to(memory_format=torch.channels_last)
             out = mod_copied(input)
             self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
             self.checkType(out, "cuda", [20, 33, 24, 49])
 
     def test_data_dependent_operator(self):
-        with enable_torch_dispatch_mode(
-            FakeTensorMode(inner=None, allow_fallback_kernels=False)
-        ):
+        with FakeTensorMode(allow_fallback_kernels=False):
             x = torch.rand([10, 10])
 
             self.assertRaises(DynamicOutputShapeException, lambda: torch.nonzero(x))
 
     def checkMetaProps(self, t1, t2):
-        prims.utils.compare_tensor_meta(t1, t2)
+        prims.utils.compare_tensor_meta(t1, t2, check_strides=True)
 
     @skipIfCrossRef
     def test_deepcopy(self):
-        mode = FakeTensorMode(inner=None)
+        with FakeTensorMode() as mode:
+            pass
         mod = torch.nn.BatchNorm2d(10)
         with torch._subclasses.fake_tensor.FakeCopyMode(mode):
             mod_copied = copy.deepcopy(mod)
@@ -379,13 +409,106 @@ def __init__(self):
 
     @unittest.skipIf(not RUN_CUDA, "requires cuda")
     def test_new(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             a = torch.rand([16, 1])
             self.checkType(a.new(10, 10), "cpu", [10, 10])
             self.checkType(a.new([1, 2, 3, 4]), "cpu", [4])
             b = torch.rand([4, 4], device='cuda')
             self.checkType(b.new(device='cuda'), "cuda", [0])
+            self.checkType(a.new(torch.rand([1])), "cpu", [1])
+
+    def test_scalar_inputs(self):
+        with FakeTensorMode():
+            self.checkType(torch.div(3, 2), "cpu", [])
+            ten = torch.zeros(2, dtype=torch.int32) * 2.0
+            self.assertEqual(ten.dtype, torch.float)
+            self.checkType(ten, "cpu", [2])
+
+
+class FakeTensorConstHandling(TestCase):
+    def assertConst(self, *args):
+        for arg in args:
+            self.assertTrue(arg.constant is not None)
 
+    def assertNotConst(self, *args):
+        for arg in args:
+            self.assertTrue(arg.constant is None)
+
+    def test_simple(self):
+        with FakeTensorMode():
+            x = torch.tensor(4.)
+            self.assertEqual(x.item(), 4.)
+
+    def test_inplace_add(self):
+        with FakeTensorMode():
+            x = torch.tensor(4.)
+            y = x.add_(1)
+            self.assertEqual(x.item(), 5.)
+            self.assertEqual(y.item(), 5.)
+            self.assertConst(x, y)
+
+    def test_shared_storages(self):
+        with FakeTensorMode():
+            x = torch.tensor([4.])
+            y = x[:]
+
+            self.assertEqual(x.storage()._cdata, y.storage()._cdata)
+            self.assertEqual(x.constant.storage()._cdata, y.constant.storage()._cdata)
+
+    def test_constant_invalidation(self):
+        with FakeTensorMode():
+            x = torch.tensor([1.])
+            self.assertConst(x)
+            y = torch.rand([1])
+            x.add_(y)
+            self.assertNotConst(x)
+
+    def test_inplace_view_invalidation(self):
+        with FakeTensorMode():
+            x = torch.tensor([1])
+            self.assertConst(x)
+            x.resize_([2])
+            self.assertEqual(x.size(0), 2)
+            self.assertNotConst(x)
+
+    def test_fake_tensor_in_intlist_repro(self):
+
+        def fn(tensors):
+            max_size = torch.tensor([800, 1216], dtype=torch.int64)
+            batch_shape = [len(tensors)] + list(tensors[0].shape[:-2]) + list(max_size)
+            return tensors[0].new_full(batch_shape, 0.0)
+
+        with self.assertRaises(torch._subclasses.fake_tensor.DataDependentOutputException):
+            with torch._subclasses.fake_tensor.FakeTensorMode(throw_on_data_dependent_ops=True):
+                a = torch.randn(3, 800, 1199)
+                b = torch.randn(3, 800, 800)
+                inputs = [a, b]
+                ref = fn(inputs)
+
+    def test_fake_tensor_batch_norm_cpu(self):
+        with torch._subclasses.CrossRefFakeMode():
+            m = torch.nn.Sequential(
+                torch.nn.BatchNorm2d(10),
+                torch.nn.ReLU(),
+            )
+            m.eval()
+            out = m(torch.randn([2, 10, 8, 8]))
+
+    def test_shared_storage_invalidation(self):
+        with FakeTensorMode():
+            x = torch.tensor([1.])
+            y = x[:]
+            self.assertConst(x, y)
+            y.add_(torch.rand([1]))
+            self.assertNotConst(x, y)
+
+    def test_aliased_const_write(self):
+        with FakeTensorMode():
+            x = torch.tensor([1])
+            y = x.expand([4])
+            self.assertNotConst(y)
+            y[0] = 1
+            self.assertNotConst(x)
 
 def contains_type(type: torch._C.Type, maybe_contained_type: torch._C.Type):
     return maybe_contained_type.isSubtypeOf(type) or any(
@@ -396,19 +519,19 @@ def contains_type(type: torch._C.Type, maybe_contained_type: torch._C.Type):
 class FakeTensorConverterTest(TestCase):
     def test_memoized_conversion_to_meta(self):
         x = torch.rand(2, 2, 2)
-        mode = FakeTensorMode(inner=None)
+        mode = FakeTensorMode()
         self.assertTrue(mode.from_tensor(x) is mode.from_tensor(x))
 
     def test_memoized_conversion_from_meta(self):
         x = torch.rand(2, 2).to(device="meta")
-        mode = FakeTensorMode(inner=None)
+        mode = FakeTensorMode()
         converter = mode.fake_tensor_converter
-        self.assertTrue(converter(mode, x, "cpu") is converter(mode, x, "cpu"))
+        self.assertTrue(converter.from_meta_and_device(mode, x, "cpu") is converter.from_meta_and_device(mode, x, "cpu"))
 
     def test_separate_tensor_storages_view(self):
         x = torch.rand(2, 2, 2)
         y = x[0]
-        mode = FakeTensorMode(inner=None)
+        mode = FakeTensorMode()
         converter = mode.fake_tensor_converter
         x_conv = converter(mode, x)
         y_conv = converter(mode, y)
@@ -418,7 +541,7 @@ def test_separate_tensor_storages_non_view(self):
         x = torch.rand(2, 2, 2)
         y = torch.rand(4, 2)
         y.set_(x.storage())
-        mode = FakeTensorMode(inner=None)
+        mode = FakeTensorMode()
         converter = mode.fake_tensor_converter
         x_conv = converter(mode, x)
         y_conv = converter(mode, y)
@@ -437,7 +560,7 @@ def test_separate_tensor_storages_non_view(self):
     def test_dead_weak_ref(self):
         x = torch.rand(2, 2, 2)
         y = x[0]
-        mode = FakeTensorMode(inner=None)
+        mode = FakeTensorMode()
         converter = FakeTensorConverter()
         x_conv = converter(mode, x)
         x_conv_storage = torch._C._storage_id(x_conv)
@@ -448,18 +571,17 @@ def test_dead_weak_ref(self):
 
     def test_dead_key(self):
         x = torch.rand(2, 2, 2)
-        mode = FakeTensorMode(inner=None)
+        mode = FakeTensorMode()
         converter = FakeTensorConverter()
         x_conv = converter(mode, x)
         self.assertEqual(len(converter.tensor_memo), 1)
-        self.assertEqual(len(converter.meta_converter.tensor_memo), 1)
+        x_conv2 = converter(mode, x)
+        assert x_conv2 is x_conv
         del x
         self.assertEqual(len(converter.tensor_memo), 0)
-        self.assertEqual(len(converter.meta_converter.tensor_memo), 0)
 
     def test_no_active_mode(self):
-        mode = FakeTensorMode(inner=None)
-        with enable_torch_dispatch_mode(mode):
+        with FakeTensorMode() as mode:
             x = torch.empty(2, 2, device="cpu")
             y = torch.empty(2, 2, device="cpu")
 
@@ -469,22 +591,23 @@ def test_no_active_mode(self):
         self.assertEqual(out.device.type, "cpu")
 
     def test_separate_mode_error(self):
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             x = torch.empty(2, 2, device="cpu")
-        with enable_torch_dispatch_mode(FakeTensorMode(inner=None)):
+        with FakeTensorMode():
             y = torch.empty(2, 2, device="cpu")
         self.assertRaises(Exception, lambda: x, y)
 
     def test_no_ref_cycle(self):
         x = torch.rand([4])
-        mode = torch._prims.get_prim_fake_mode()
+        mode = FakeTensorMode()
         y = mode.from_tensor(x)
-        assert mode is torch._prims.get_prim_fake_mode()
         self.assertEqual(len(mode.fake_tensor_converter.tensor_memo), 1)
+        mode_weak = weakref.ref(mode)
+        y_weak = weakref.ref(mode)
         del mode
         del y
-        new_mode = torch._prims.get_prim_fake_mode()
-        self.assertEqual(len(new_mode.fake_tensor_converter.tensor_memo), 0)
+        assert mode_weak() is None
+        assert y_weak() is None
 
 
 class FakeTensorOperatorInvariants(TestCase):
@@ -538,12 +661,79 @@ def test_tensor_constructors_all_have_kwarg_device(self):
                 has_kwarg_device or op == torch.ops.aten._list_to_tensor.default
             )
 
+    @unittest.expectedFailure
+    def test_sparse_new(self):
+        with FakeTensorMode():
+            indices = torch.randn(1, 1, dtype=torch.int64)
+            values = torch.randn(1)
+            extra = (2,)
+            sparse = torch.randn(1).to_sparse()
+            # This used to segfault, now it does not, but it still raises an
+            # error
+            sparse2 = sparse.new(indices, values, extra)
+
     def test_like_ops(self):
         for schema in self.get_all_aten_schemas():
             if "_like" == schema.name[-5:]:
                 op = self.get_aten_op(schema)
-                self.assertTrue(op in torch._subclasses.fake_tensor._like_tensor_constructors)
+                self.assertIn(op, torch._subclasses.fake_tensor._like_tensor_constructors)
 
+class FakeTensorPropTest(TestCase):
+    def test_fake_tensor_prop_on_nn_module(self):
+        class ToyNnModuleWithParameters(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.layer1 = torch.nn.Linear(4, 3)
+                self.layer2 = torch.nn.Linear(3, 2)
+
+            def forward(self, value):
+                value = self.layer1(value)
+                value = torch.relu(value)
+                value = self.layer2(value)
+                return value
+
+        model = ToyNnModuleWithParameters()
+        value = torch.randn(5, 4)
+        # Convert nn.Module to GraphModule so that FakeTensorProp runs.
+        graph_model = torch.fx.symbolic_trace(model, (value,))
+        # The following block runs FakeTensorProp on graph_module w/to the same FakeTensorMode
+        #
+        # TODO(wschin): there should be an API to run FakeTensorProp for GraphModule
+        # with parameters and buffers.
+        with FakeTensorMode() as fake_tensor_mode:
+
+            def to_fake_tensor(x):
+                if isinstance(x, torch.Tensor) and not isinstance(x, FakeTensor):
+                    return fake_tensor_mode.from_tensor(x)
+                return x
+
+            fake_parameters_and_buffers = {
+                k: to_fake_tensor(v)
+                for k, v in itertools.chain(
+                    graph_model.named_parameters(), graph_model.named_buffers()
+                )
+            }
+            with torch.nn.utils.stateless._reparametrize_module(
+                graph_model, fake_parameters_and_buffers
+            ):
+                # This case uses the **same** fake tensor mode to
+                #  1. create fake parameters and fake buffers, and
+                #  2. run FakeTensorProp
+                # The result should be correct.
+                result = FakeTensorProp(graph_model, fake_tensor_mode).propagate(value)
+                self.assertTrue(isinstance(result, FakeTensor))
+                self.assertEqual(result.shape, (5, 2))
+                # This case uses the **different** fake tensor modes to
+                #  1. create fake parameters and fake buffers, and
+                #  2. run FakeTensorProp
+                # The following code should fail.
+                failed = False
+                try:
+                    FakeTensorProp(graph_model).propagate(value)
+                except AssertionError:
+                    # AssertionError: tensor's device must be `meta`, got cpu instead
+                    failed = True
+                self.assertTrue(failed)
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_foreach.py b/test/test_foreach.py
index 71f243c8de0a..13e0e6ebc9cf 100644
--- a/test/test_foreach.py
+++ b/test/test_foreach.py
@@ -9,7 +9,7 @@
 
 from torch.testing import make_tensor
 from torch.testing._comparison import default_tolerances
-from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_ROCM, TEST_WITH_SLOW
+from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_ROCM, TEST_WITH_SLOW, skipIfTorchDynamo
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, dtypes, onlyCUDA, skipMeta, ops)
 from torch.testing._internal.common_methods_invocations import (
@@ -31,6 +31,7 @@
     complex(1.0 - random.random(), 1.0 - random.random()),
 )
 
+
 def getScalarLists(N):
     return (
         ("int", [random.randint(0, 9) + 1 for _ in range(N)]),
@@ -41,8 +42,10 @@ def getScalarLists(N):
         ("mixed", [True, 1, 2.0, 3.0 + 4.5j] + [3.0 for _ in range(N - 4)]),
     )
 
+
 _BOOL_SUB_ERR_MSG = "Subtraction, the `-` operator"
 
+
 class RegularFuncWrapper:
 
     def __init__(self, func):
@@ -88,6 +91,7 @@ def __call__(self, inputs, is_cuda, is_fastpath, **kwargs):
         # note(mkozuki): inplace foreach functions are void functions.
         return inputs[0] if self._is_inplace else actual
 
+
 class TestForeach(TestCase):
 
     @property
@@ -159,7 +163,7 @@ def _test_binary_op_tensorlists(self, device, dtype, opinfo, N, is_fastpath, dis
         inputs = [
             opinfo.sample_inputs(device, dtype, N, noncontiguous=not is_fastpath),
             [
-                make_tensor((N - i , 1), device=device, dtype=dtype, noncontiguous=not is_fastpath) for i in range(N)
+                make_tensor((N - i, 1), device=device, dtype=dtype, noncontiguous=not is_fastpath) for i in range(N)
             ],
         ]
         self._binary_test(dtype, op, ref, inputs, is_fastpath and disable_fastpath, is_inplace=False)
@@ -248,7 +252,7 @@ def test_binary_op_scalarlist_slowpath(self, device, dtype, op):
             for _, scalarlist in getScalarLists(N):
                 self._test_binary_op_scalarlist(device, dtype, op, N, scalarlist, False, False)
 
-    def _pointwise_test(self, dtype, op, ref, inputs, is_fastpath, is_inplace, *, values=None):
+    def _pointwise_test(self, dtype, op, ref, inputs, is_fastpath, is_inplace, *, values=None, custom_values_err=None):
         ref_inputs = [[t.clone().detach() for t in inputs[0]], inputs[1], inputs[2]] if is_inplace else inputs
         try:
             actual = op(inputs, self.is_cuda, is_fastpath)
@@ -262,13 +266,18 @@ def _pointwise_test(self, dtype, op, ref, inputs, is_fastpath, is_inplace, *, va
             try:
                 actual = op(inputs + [values], self.is_cuda, is_fastpath)
             except RuntimeError as e:
-                with self.assertRaisesRegex(type(e), re.escape(str(e))):
-                    ref(ref_inputs, values=values)
+                # Match with error messages from regular non-foreach reference if no
+                # custom error message was provided.
+                if custom_values_err is None:
+                    with self.assertRaisesRegex(type(e), re.escape(str(e))):
+                        ref(ref_inputs, values=values)
+                else:
+                    self.assertEqual(re.escape(str(e)), re.escape(custom_values_err))
             else:
                 expected = ref(ref_inputs, values=values)
                 self.assertEqual(expected, actual)
 
-    def _test_pointwise_op(self, device, dtype, opinfo, N, is_fastpath, disable_fastpath, *, values=None):
+    def _test_pointwise_op(self, device, dtype, opinfo, N, is_fastpath, disable_fastpath, *, values=None, custom_values_err=None):
         n_expected_cudaLaunchKernels = N if disable_fastpath else 1
         op, ref, inplace_op, inplace_ref = self._get_funcs(opinfo, n_expected_cudaLaunchKernels)
         inputs = [
@@ -276,8 +285,10 @@ def _test_pointwise_op(self, device, dtype, opinfo, N, is_fastpath, disable_fast
             opinfo.sample_inputs(device, dtype, N, noncontiguous=not is_fastpath),
             opinfo.sample_inputs(device, dtype, N, noncontiguous=not is_fastpath),
         ]
-        self._pointwise_test(dtype, op, ref, inputs, is_fastpath, is_inplace=False, values=values)
-        self._pointwise_test(dtype, inplace_op, inplace_ref, inputs, is_fastpath, is_inplace=True, values=values)
+        self._pointwise_test(dtype, op, ref, inputs, is_fastpath, is_inplace=False,
+                             values=values, custom_values_err=custom_values_err)
+        self._pointwise_test(dtype, inplace_op, inplace_ref, inputs, is_fastpath,
+                             is_inplace=True, values=values, custom_values_err=custom_values_err)
 
         # Tests of implicit broadcasting
         inputs = [
@@ -289,9 +300,11 @@ def _test_pointwise_op(self, device, dtype, opinfo, N, is_fastpath, disable_fast
                 make_tensor((1, N - i), device=device, dtype=dtype, noncontiguous=not is_fastpath) for i in range(N)
             ],
         ]
-        self._pointwise_test(dtype, op, ref, inputs, is_fastpath and disable_fastpath, is_inplace=False, values=values)
+        self._pointwise_test(dtype, op, ref, inputs, is_fastpath and disable_fastpath,
+                             is_inplace=False, values=values, custom_values_err=custom_values_err)
         self._pointwise_test(
-            dtype, inplace_op, inplace_ref, inputs, is_fastpath and disable_fastpath, is_inplace=True, values=values)
+            dtype, inplace_op, inplace_ref, inputs, is_fastpath and disable_fastpath,
+            is_inplace=True, values=values, custom_values_err=custom_values_err)
 
     @skipMeta
     @ops(foreach_pointwise_op_db)
@@ -302,9 +315,24 @@ def test_pointwise_op_fastpath(self, device, dtype, op):
             self._test_pointwise_op(device, dtype, op, N, True, disable_fastpath)
             for scalar in Scalars:
                 self._test_pointwise_op(device, dtype, op, N, True, disable_fastpath, values=scalar)
-            for _, scalarlist in getScalarLists(N):
+            for case, scalarlist in getScalarLists(N):
                 self._test_pointwise_op(
                     device, dtype, op, N, True, disable_fastpath, values=scalarlist)
+                self._test_pointwise_op(
+                    device, dtype, op, N, True, disable_fastpath, values=torch.tensor(scalarlist))
+                self._test_pointwise_op(
+                    device, dtype, op, N, True, disable_fastpath, values=torch.tensor(scalarlist)[0],
+                    custom_values_err="Expected packed scalar Tensor to be of dimension 1. Got 0 instead.")
+                if device == "cuda":
+                    self._test_pointwise_op(
+                        device, dtype, op, N, True, disable_fastpath, values=torch.tensor(scalarlist, device="cuda"),
+                        custom_values_err="Expected scalars to be on CPU, got cuda:0 instead.")
+                self._test_pointwise_op(
+                    device, dtype, op, N, True, disable_fastpath, values=torch.tensor(scalarlist)[:2],
+                    custom_values_err=f"Expected length of scalars to match input of length {len(scalarlist)} but got 2 instead.")
+                self._test_pointwise_op(
+                    device, dtype, op, N, True, disable_fastpath, values=torch.tensor([[0, 1], [2, 3]])[:, 1],
+                    custom_values_err="Expected scalars to be contiguous.")
 
     @ops(foreach_pointwise_op_db)
     def test_pointwise_op_slowpath(self, device, dtype, op):
@@ -313,9 +341,11 @@ def test_pointwise_op_slowpath(self, device, dtype, op):
             self._test_pointwise_op(device, dtype, op, N, False, False)
             for scalar in Scalars:
                 self._test_pointwise_op(device, dtype, op, N, False, False, values=scalar)
-            for _, scalarlist in getScalarLists(N):
+            for case, scalarlist in getScalarLists(N):
                 self._test_pointwise_op(
                     device, dtype, op, N, False, False, values=scalarlist)
+                self._test_pointwise_op(
+                    device, dtype, op, N, False, False, values=torch.tensor(scalarlist))
 
     # note(mkozuki): fastpath test uses dtypes which fastpath implementation supports.
     # To confirm the dtypes of `OpInfo` cover the dtypes that the function support,
@@ -476,6 +506,7 @@ def test_binary_op_scalar_with_different_tensor_dtypes(self, device, dtype, op):
             runtime_error = e
         self.assertIsNone(runtime_error)
 
+    @skipIfTorchDynamo("Different error msgs, TODO")
     @ops(foreach_binary_op_db, dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.bool))
     def test_binary_op_list_error_cases(self, device, dtype, op):
         foreach_op, foreach_op_, ref, ref_ = op.method_variant, op.inplace_variant, op.ref, op.ref_inplace
diff --git a/test/test_function_schema.py b/test/test_function_schema.py
index 505d9182a33d..c5717a24da34 100644
--- a/test/test_function_schema.py
+++ b/test/test_function_schema.py
@@ -186,5 +186,25 @@ def test_schema_error(self):
         with self.assertRaisesRegex(RuntimeError, r"schemas with vararg \(...\) can't have default value args"):
             schema = parse_schema("any.foo(int arg1, int arg2=0, ...)")
 
+    def test_tensor_list_alias_annotation_properly_parsed(self):
+        schema_str = 'foo(Tensor self, *, Tensor(a!)[] out) -> ()'
+        schema = parse_schema(schema_str)
+        self.assertTrue(schema.arguments[-1].alias_info.is_write)
+        self.assertEqual(str(schema), schema_str)
+
+    def test_tensor_option_arguments_properly_parsed(self):
+        schema_str = '_to_copy(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, ' \
+                     'bool? pin_memory=None, bool non_blocking=False, MemoryFormat? memory_format=None) -> Tensor'
+        schema = parse_schema(schema_str)
+        # fake type of MemoryFormat? is int?
+        self.assertEqual(schema.arguments[-1].type.str(), "int?")
+        # fake type of Layout? is int?
+        self.assertEqual(schema.arguments[2].type.str(), "int?")
+        # fake type of Device? is Device?
+        self.assertEqual(schema.arguments[3].type.str(), "Device?")
+        # print real types in FunctionSchema
+        self.assertEqual(str(schema), schema_str)
+
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_functional_optim.py b/test/test_functional_optim.py
index 7dd0276fb6c9..24e5088be08d 100644
--- a/test/test_functional_optim.py
+++ b/test/test_functional_optim.py
@@ -1,14 +1,15 @@
 # Owner(s): ["oncall: distributed"]
 
 from typing import List, Optional, Tuple
+import unittest
 
 import torch
+import torch.distributed
 import torch.nn as nn
 import torch.nn.functional as F
 from torch import Tensor
 from torch.optim import SGD, Adam, AdamW
 from torch.testing._internal.common_utils import TestCase, run_tests
-from torch.distributed.optim.utils import functional_optim_map, register_functional_optim
 
 class MyModule(torch.nn.Module):
     def __init__(self):
@@ -64,6 +65,10 @@ def step(self, gradients: List[Optional[Tensor]]):
         with torch.no_grad():
             raise RuntimeError("MyDummyFnOptimizer does not support step() as of now")
 
+if torch.distributed.is_available():
+    from torch.distributed.optim.utils import functional_optim_map, register_functional_optim
+
+@unittest.skipIf(not torch.distributed.is_available(), "These are testing distributed functions")
 class TestFunctionalOptimParity(TestCase):
     def _validate_parameters(self, params_1, params_2):
         for p1, p2 in zip(params_1, params_2):
diff --git a/test/test_functionalization.py b/test/test_functionalization.py
index 45bf54a8c4e1..d699c03ed417 100644
--- a/test/test_functionalization.py
+++ b/test/test_functionalization.py
@@ -1,11 +1,16 @@
 # Owner(s): ["module: codegen"]
 
 import torch
-from torch.testing._internal.common_utils import TestCase, run_tests, skipIfTorchDynamo, TEST_WITH_TORCHDYNAMO
+from contextlib import nullcontext
+from torch.testing._internal.common_utils import (
+    TestCase, run_tests, skipIfTorchDynamo, TEST_WITH_TORCHDYNAMO, IS_WINDOWS,
+    xfail_inherited_tests
+)
 from torch.testing._internal.logging_tensor import LoggingTensor, capture_logs
 from torch.utils._pytree import tree_map
 from torch.fx.experimental.proxy_tensor import make_fx
 from torch.fx.passes.reinplace import reinplace
+from torch._dispatch.python import enable_crossref_functionalize, enable_python_dispatcher
 
 import unittest
 
@@ -21,33 +26,40 @@ def are_aliased(x, y):
 # We can unify testing and use functionalize() here instead
 # if/when functorch moves into core.
 # This is basically a crappy version of `functionalize()` for single-tensor-arg inputs.
-def _functionalize(f, *, reapply_views: bool):
+def _functionalize(f, *, reapply_views: bool, crossref: bool):
     def wrapped(a):
-        input_functional = torch._to_functional_tensor(a)
-        torch._enable_functionalization(reapply_views=reapply_views)
-        try:
-            out = f(input_functional)
-        finally:
-            torch._disable_functionalization()
-        torch._sync(input_functional)
-        inpt_new = torch._from_functional_tensor(input_functional)
-        if inpt_new is not a:
-            # Existing deficiency in functionalize():
-            # we don't correctly mutate input metadata (yet?)
-            if inpt_new.shape == a.shape:
-                a.copy_(inpt_new)
-        tree_map(torch._sync, out)
-        out_unwrapped = tree_map(torch._from_functional_tensor, out)
-        return out_unwrapped
+        ctx = nullcontext()
+        if crossref:
+            ctx = enable_crossref_functionalize()
+        with ctx:
+            input_functional = torch._to_functional_tensor(a)
+            input_functional.requires_grad = a.requires_grad
+            torch._enable_functionalization(reapply_views=reapply_views)
+            try:
+                out = f(input_functional)
+            finally:
+                torch._disable_functionalization()
+            torch._sync(input_functional)
+            inpt_new = torch._from_functional_tensor(input_functional)
+            if inpt_new is not a:
+                # Existing deficiency in functionalize():
+                # we don't correctly mutate input metadata (yet?)
+                if inpt_new.shape == a.shape:
+                    a.copy_(inpt_new)
+            tree_map(torch._sync, out)
+            out_unwrapped = tree_map(torch._from_functional_tensor, out)
+            return out_unwrapped
 
     return wrapped
 
 @unittest.skipIf(TEST_WITH_TORCHDYNAMO, "https://github.com/pytorch/pytorch/issues/81457")
 class TestFunctionalization(TestCase):
 
+    crossref = False
+
     def get_logs(self, func, inpt, *, reapply_views=False, run_reinplace=False):
         inpt_clone = inpt.clone()
-        traced_f = make_fx(_functionalize(func, reapply_views=reapply_views))(inpt)
+        traced_f = make_fx(_functionalize(func, reapply_views=reapply_views, crossref=self.crossref))(inpt)
         if run_reinplace:
             traced_f = reinplace(traced_f, inpt_clone)
         return traced_f.code
@@ -59,10 +71,15 @@ def assert_functionalization(self, func, inpt, *, reapply_views=False, mutated_i
 
         # Compare outputs (and mutated inputs), with and without functionalization.
         out_ref = func(inpt)
-        out_functional = _functionalize(func, reapply_views=reapply_views)(input_clone)
+        out_functional = _functionalize(func, reapply_views=reapply_views, crossref=self.crossref)(input_clone)
         # The reinplacing pass is only valid to run with reapply_views=True.
-        functional_func = make_fx(_functionalize(func, reapply_views=True))(input_clone2)
-        reinplace_func = reinplace(make_fx(_functionalize(func, reapply_views=True))(input_clone2), input_clone2)
+        functional_func = make_fx(_functionalize(func, reapply_views=True, crossref=self.crossref))(input_clone2)
+        reinplace_func = reinplace(
+            make_fx(
+                _functionalize(func, reapply_views=True, crossref=self.crossref)
+            )(input_clone2),
+            input_clone2
+        )
 
         # NOTE: for now, need to pass in fresh inputs here, because make_fx
         # will directly mutate the inputs that you trace with.
@@ -101,6 +118,76 @@ def f(x):
             return z2
         self.assert_functionalization(f, torch.ones(4))
 
+    def test_freeze(self):
+        def f(x):
+            y = x.clone()
+            z = y[0]
+            torch._freeze_functional_tensor(y)
+            x.add_(1)
+            self.assertRaises(RuntimeError, lambda: y.add_(1))
+            self.assertRaises(RuntimeError, lambda: z.add_(1))
+            return z
+
+        _functionalize(f, reapply_views=True, crossref=self.crossref)(torch.ones(3, 3))
+
+    def test_copy_stride_mismatch(self):
+        def f(x):
+            y = torch.empty_strided((2, 2), (5, 1))
+            y.copy_(x)
+            return y
+
+        r = _functionalize(f, reapply_views=True, crossref=self.crossref)(torch.ones(2, 2))
+        self.assertEqual(r.stride(), (5, 1))
+
+    def test_view_clone_view_inplace(self):
+        def f(input):
+            shape = [1, 1024, 128, 128]
+            input_reshaped = input.view(shape)
+            out = input_reshaped.clone()
+            r = out.view(input.shape)
+            r.relu_()
+            return r
+
+        def g(x):
+            loss = f(x).sum()
+            from functorch._src.aot_autograd import setup_stacktrace_preservation_hooks
+            import torch.fx.traceback as fx_traceback
+            setup_stacktrace_preservation_hooks([loss.grad_fn])
+            with fx_traceback.override_stack_trace():
+                loss.backward()
+            return x.grad
+
+        with torch.autograd.detect_anomaly(check_nan=False):
+            logs = self.get_logs(g, torch.ones(16, 64, 128, 128, requires_grad=True))
+        self.assertExpectedInline(logs, """\
+
+
+
+def forward(self, a_1):
+    view_copy = torch.ops.aten.view_copy.default(a_1, [1, 1024, 128, 128]);  a_1 = None
+    clone = torch.ops.aten.clone.default(view_copy);  view_copy = None
+    view_copy_1 = torch.ops.aten.view_copy.default(clone, [16, 64, 128, 128])
+    relu = torch.ops.aten.relu.default(view_copy_1);  view_copy_1 = None
+    view_copy_2 = torch.ops.aten.view_copy.default(clone, [16, 64, 128, 128]);  clone = None
+    sum_1 = torch.ops.aten.sum.default(relu)
+    ones_like = torch.ops.aten.ones_like.default(sum_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False, memory_format = torch.preserve_format);  sum_1 = None
+    expand_copy = torch.ops.aten.expand_copy.default(ones_like, [16, 64, 128, 128]);  ones_like = None
+    view_copy_3 = torch.ops.aten.view_copy.default(expand_copy, [1, 1024, 128, 128]);  expand_copy = None
+    new_empty_strided = torch.ops.aten.new_empty_strided.default(view_copy_3, [1, 1024, 128, 128], [16777216, 16384, 128, 1])
+    copy = torch.ops.aten.copy.default(new_empty_strided, view_copy_3);  new_empty_strided = view_copy_3 = None
+    view_copy_4 = torch.ops.aten.view_copy.default(copy, [16, 64, 128, 128])
+    view_copy_5 = torch.ops.aten.view_copy.default(copy, [16, 64, 128, 128])
+    clone_1 = torch.ops.aten.clone.default(view_copy_5, memory_format = torch.contiguous_format)
+    threshold_backward = torch.ops.aten.threshold_backward.default(clone_1, relu, 0);  clone_1 = relu = None
+    copy_1 = torch.ops.aten.copy.default(view_copy_5, threshold_backward);  view_copy_5 = threshold_backward = None
+    view_copy_6 = torch.ops.aten.view_copy.default(copy, [16, 64, 128, 128]);  copy = None
+    detach_copy = torch.ops.aten.detach_copy.default(view_copy_6);  view_copy_6 = None
+    view_copy_7 = torch.ops.aten.view_copy.default(copy_1, [1, 1024, 128, 128]);  copy_1 = None
+    view_copy_8 = torch.ops.aten.view_copy.default(view_copy_7, [16, 64, 128, 128]);  view_copy_7 = None
+    detach_copy_1 = torch.ops.aten.detach_copy.default(view_copy_8);  view_copy_8 = None
+    return detach_copy_1
+    """)  # noqa: B950
+
     def test_simple(self):
         def f(x):
             # simple test: 1 view op, 1 inplace op
@@ -117,12 +204,12 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
-    view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2])
-    add_tensor = torch.ops.aten.add.Tensor(view_copy_default, ones);  view_copy_default = ones = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [4, 2])
-    mul_tensor = torch.ops.aten.mul.Tensor(view_copy_default_1, view_copy_default_1)
-    copy__default = torch.ops.aten.copy_.default(a_1, view_copy_default_1);  a_1 = view_copy_default_1 = None
-    return add_tensor
+    view_copy = torch.ops.aten.view_copy.default(a_1, [4, 2])
+    add = torch.ops.aten.add.Tensor(view_copy, ones);  view_copy = ones = None
+    view_copy_1 = torch.ops.aten.view_copy.default(add, [4, 2])
+    mul = torch.ops.aten.mul.Tensor(view_copy_1, view_copy_1)
+    copy_ = torch.ops.aten.copy_.default(a_1, view_copy_1);  a_1 = view_copy_1 = None
+    return add
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(4, 2), reapply_views=True, run_reinplace=True)
@@ -132,12 +219,12 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
-    view_default = torch.ops.aten.view.default(a_1, [4, 2])
-    add_tensor = torch.ops.aten.add.Tensor(view_default, ones);  view_default = ones = None
-    view_default_1 = torch.ops.aten.view.default(add_tensor, [4, 2])
-    mul_tensor = torch.ops.aten.mul.Tensor(view_default_1, view_default_1)
-    copy__default = torch.ops.aten.copy_.default(a_1, view_default_1);  a_1 = view_default_1 = None
-    return add_tensor
+    view = torch.ops.aten.view.default(a_1, [4, 2])
+    add = torch.ops.aten.add.Tensor(view, ones);  view = ones = None
+    view_1 = torch.ops.aten.view.default(add, [4, 2])
+    mul = torch.ops.aten.mul.Tensor(view_1, view_1)
+    copy_ = torch.ops.aten.copy_.default(a_1, view_1);  a_1 = view_1 = None
+    return add
     """)
 
     def test_simple_out(self):
@@ -157,11 +244,11 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
-    view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2]);  a_1 = None
+    view_copy = torch.ops.aten.view_copy.default(a_1, [4, 2]);  a_1 = None
     empty = torch.ops.aten.empty.memory_format([], device = device(type='cpu'), pin_memory = False)
-    add_tensor = torch.ops.aten.add.Tensor(view_copy_default, ones);  view_copy_default = ones = None
-    mul_tensor = torch.ops.aten.mul.Tensor(add_tensor, add_tensor);  add_tensor = None
-    return mul_tensor
+    add = torch.ops.aten.add.Tensor(view_copy, ones);  view_copy = ones = None
+    mul = torch.ops.aten.mul.Tensor(add, add);  add = None
+    return mul
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(4, 2), reapply_views=True, run_reinplace=True)
@@ -171,11 +258,11 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
-    view_default = torch.ops.aten.view.default(a_1, [4, 2]);  a_1 = None
+    view = torch.ops.aten.view.default(a_1, [4, 2]);  a_1 = None
     empty = torch.ops.aten.empty.memory_format([], device = device(type='cpu'), pin_memory = False)
-    add_tensor = torch.ops.aten.add.Tensor(view_default, ones);  view_default = ones = None
-    mul_tensor = torch.ops.aten.mul.Tensor(add_tensor, add_tensor);  add_tensor = None
-    return mul_tensor
+    add = torch.ops.aten.add.Tensor(view, ones);  view = ones = None
+    mul = torch.ops.aten.mul.Tensor(add, add);  add = None
+    return mul
     """)
 
     def test_multi_out(self):
@@ -195,9 +282,9 @@ def f(x):
 def forward(self, a_1):
     empty = torch.ops.aten.empty.memory_format([4], device = device(type='cpu'), pin_memory = False)
     empty_1 = torch.ops.aten.empty.memory_format([4], device = device(type='cpu'), pin_memory = False)
-    aminmax_default = torch.ops.aten.aminmax.default(a_1, dim = 0);  a_1 = None
-    getitem = aminmax_default[0]
-    getitem_1 = aminmax_default[1];  aminmax_default = None
+    aminmax = torch.ops.aten.aminmax.default(a_1, dim = 0);  a_1 = None
+    getitem = aminmax[0]
+    getitem_1 = aminmax[1];  aminmax = None
     return getitem
     """)
 
@@ -209,9 +296,9 @@ def forward(self, a_1):
 def forward(self, a_1):
     empty = torch.ops.aten.empty.memory_format([4], device = device(type='cpu'), pin_memory = False)
     empty_1 = torch.ops.aten.empty.memory_format([4], device = device(type='cpu'), pin_memory = False)
-    aminmax_default = torch.ops.aten.aminmax.default(a_1, dim = 0);  a_1 = None
-    getitem = aminmax_default[0]
-    getitem_1 = aminmax_default[1];  aminmax_default = None
+    aminmax = torch.ops.aten.aminmax.default(a_1, dim = 0);  a_1 = None
+    getitem = aminmax[0]
+    getitem_1 = aminmax[1];  aminmax = None
     return getitem
     """)
 
@@ -232,11 +319,11 @@ def f(x):
 
 def forward(self, a_1):
     _tensor_constant0 = self._tensor_constant0
-    lift_fresh = torch.ops.aten.lift_fresh_copy.default(_tensor_constant0);  _tensor_constant0 = None
-    view_copy_default = torch.ops.aten.view_copy.default(lift_fresh, [-1]);  lift_fresh = None
-    add_tensor = torch.ops.aten.add.Tensor(view_copy_default, 1);  view_copy_default = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [3]);  add_tensor = None
-    return view_copy_default_1
+    lift_fresh_copy = torch.ops.aten.lift_fresh_copy.default(_tensor_constant0);  _tensor_constant0 = None
+    view_copy = torch.ops.aten.view_copy.default(lift_fresh_copy, [-1]);  lift_fresh_copy = None
+    add = torch.ops.aten.add.Tensor(view_copy, 1);  view_copy = None
+    view_copy_1 = torch.ops.aten.view_copy.default(add, [3]);  add = None
+    return view_copy_1
     """)
 
         reinplaced_logs = self.get_logs(f, inpt, reapply_views=True, run_reinplace=True)
@@ -246,11 +333,11 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     _tensor_constant0 = self._tensor_constant0
-    lift_fresh = torch.ops.aten.lift_fresh_copy.default(_tensor_constant0);  _tensor_constant0 = None
-    view_default = torch.ops.aten.view.default(lift_fresh, [-1]);  lift_fresh = None
-    add_tensor = torch.ops.aten.add_.Tensor(view_default, 1)
-    view_default_1 = torch.ops.aten.view.default(view_default, [3]);  view_default = None
-    return view_default_1
+    lift_fresh_copy = torch.ops.aten.lift_fresh_copy.default(_tensor_constant0);  _tensor_constant0 = None
+    view = torch.ops.aten.view.default(lift_fresh_copy, [-1]);  lift_fresh_copy = None
+    add = torch.ops.aten.add_.Tensor(view, 1)
+    view_1 = torch.ops.aten.view.default(view, [3]);  view = None
+    return view_1
     """)
 
 
@@ -263,7 +350,7 @@ def f(x):
             out = x[functional_tensor, nonfunctional_tensor]
             return out
         out = f(torch.ones(2, 2))
-        out_functional = _functionalize(f, reapply_views=True)(torch.ones(2, 2))
+        out_functional = _functionalize(f, reapply_views=True, crossref=self.crossref)(torch.ones(2, 2))
         self.assertEqual(out, out_functional)
 
     def test_inplace_on_non_view(self):
@@ -282,11 +369,11 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
-    view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2])
-    add_tensor = torch.ops.aten.add.Tensor(a_1, ones);  ones = None
-    copy__default = torch.ops.aten.copy_.default(a_1, add_tensor);  a_1 = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [4, 2]);  add_tensor = None
-    return view_copy_default_1
+    view_copy = torch.ops.aten.view_copy.default(a_1, [4, 2])
+    add = torch.ops.aten.add.Tensor(a_1, ones);  ones = None
+    copy_ = torch.ops.aten.copy_.default(a_1, add);  a_1 = None
+    view_copy_1 = torch.ops.aten.view_copy.default(add, [4, 2]);  add = None
+    return view_copy_1
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(4, 2), reapply_views=True, run_reinplace=True)
@@ -296,11 +383,11 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
-    view_default = torch.ops.aten.view.default(a_1, [4, 2])
-    add_tensor = torch.ops.aten.add.Tensor(a_1, ones);  ones = None
-    copy__default = torch.ops.aten.copy_.default(a_1, add_tensor);  a_1 = None
-    view_default_1 = torch.ops.aten.view.default(add_tensor, [4, 2]);  add_tensor = None
-    return view_default_1
+    view = torch.ops.aten.view.default(a_1, [4, 2])
+    add = torch.ops.aten.add.Tensor(a_1, ones);  ones = None
+    copy_ = torch.ops.aten.copy_.default(a_1, add);  a_1 = None
+    view_1 = torch.ops.aten.view.default(add, [4, 2]);  add = None
+    return view_1
     """)
 
     # Some ops that are mutable are neither inplace nor out= ops.
@@ -315,14 +402,14 @@ def f(x):
 
 
 def forward(self, a_1):
-    _fused_moving_avg_obs_fq_helper_functional_default = torch.ops.aten._fused_moving_avg_obs_fq_helper_functional.default(a_1, a_1, a_1, a_1, a_1, a_1, a_1, 1.0, 0, 1, 0)
-    getitem = _fused_moving_avg_obs_fq_helper_functional_default[0]
-    getitem_1 = _fused_moving_avg_obs_fq_helper_functional_default[1]
-    getitem_2 = _fused_moving_avg_obs_fq_helper_functional_default[2]
-    getitem_3 = _fused_moving_avg_obs_fq_helper_functional_default[3]
-    getitem_4 = _fused_moving_avg_obs_fq_helper_functional_default[4]
-    getitem_5 = _fused_moving_avg_obs_fq_helper_functional_default[5];  _fused_moving_avg_obs_fq_helper_functional_default = None
-    copy__default = torch.ops.aten.copy_.default(a_1, getitem_5);  a_1 = getitem_5 = None
+    _fused_moving_avg_obs_fq_helper_functional = torch.ops.aten._fused_moving_avg_obs_fq_helper_functional.default(a_1, a_1, a_1, a_1, a_1, a_1, a_1, 1.0, 0, 1, 0)
+    getitem = _fused_moving_avg_obs_fq_helper_functional[0]
+    getitem_1 = _fused_moving_avg_obs_fq_helper_functional[1]
+    getitem_2 = _fused_moving_avg_obs_fq_helper_functional[2]
+    getitem_3 = _fused_moving_avg_obs_fq_helper_functional[3]
+    getitem_4 = _fused_moving_avg_obs_fq_helper_functional[4]
+    getitem_5 = _fused_moving_avg_obs_fq_helper_functional[5];  _fused_moving_avg_obs_fq_helper_functional = None
+    copy_ = torch.ops.aten.copy_.default(a_1, getitem_5);  a_1 = getitem_5 = None
     return (getitem, getitem_1)
     """)  # noqa: B950
 
@@ -338,11 +425,11 @@ def f(x):
 
 
 def forward(self, a_1):
-    as_strided_copy_default = torch.ops.aten.as_strided_copy.default(a_1, [2], [2], 1)
-    add_tensor = torch.ops.aten.add.Tensor(as_strided_copy_default, 1);  as_strided_copy_default = None
-    as_strided_scatter_default = torch.ops.aten.as_strided_scatter.default(a_1, add_tensor, [2], [2], 1);  add_tensor = None
-    copy__default = torch.ops.aten.copy_.default(a_1, as_strided_scatter_default);  a_1 = None
-    return as_strided_scatter_default
+    as_strided_copy = torch.ops.aten.as_strided_copy.default(a_1, [2], [2], 1)
+    add = torch.ops.aten.add.Tensor(as_strided_copy, 1);  as_strided_copy = None
+    as_strided_scatter = torch.ops.aten.as_strided_scatter.default(a_1, add, [2], [2], 1);  add = None
+    copy_ = torch.ops.aten.copy_.default(a_1, as_strided_scatter);  a_1 = None
+    return as_strided_scatter
     """)
 
     def test_tensor_list_composite(self):
@@ -357,8 +444,8 @@ def f(x):
 
 
 def forward(self, a_1):
-    block_diag_default = torch.ops.aten.block_diag.default([a_1, a_1]);  a_1 = None
-    return block_diag_default
+    block_diag = torch.ops.aten.block_diag.default([a_1, a_1]);  a_1 = None
+    return block_diag
     """)
 
     def test_cat(self):
@@ -374,8 +461,8 @@ def f(x):
 
 def forward(self, a_1):
     empty = torch.ops.aten.empty.memory_format([0], device = device(type='cpu'), pin_memory = False)
-    cat_default = torch.ops.aten.cat.default([a_1]);  a_1 = None
-    return cat_default
+    cat = torch.ops.aten.cat.default([a_1]);  a_1 = None
+    return cat
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(2, 2), reapply_views=True, run_reinplace=True)
@@ -385,8 +472,8 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     empty = torch.ops.aten.empty.memory_format([0], device = device(type='cpu'), pin_memory = False)
-    cat_default = torch.ops.aten.cat.default([a_1]);  a_1 = None
-    return cat_default
+    cat = torch.ops.aten.cat.default([a_1]);  a_1 = None
+    return cat
     """)
 
 
@@ -406,11 +493,11 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([2], device = device(type='cpu'), pin_memory = False)
-    clone_default = torch.ops.aten.clone.default(a_1)
-    diagonal_copy_default = torch.ops.aten.diagonal_copy.default(clone_default);  clone_default = None
-    add_tensor = torch.ops.aten.add.Tensor(diagonal_copy_default, ones);  diagonal_copy_default = ones = None
-    mul_tensor = torch.ops.aten.mul.Tensor(a_1, a_1);  a_1 = None
-    return mul_tensor
+    clone = torch.ops.aten.clone.default(a_1)
+    diagonal_copy = torch.ops.aten.diagonal_copy.default(clone);  clone = None
+    add = torch.ops.aten.add.Tensor(diagonal_copy, ones);  diagonal_copy = ones = None
+    mul = torch.ops.aten.mul.Tensor(a_1, a_1);  a_1 = None
+    return mul
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(2, 2), reapply_views=True, run_reinplace=True)
@@ -420,11 +507,11 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([2], device = device(type='cpu'), pin_memory = False)
-    clone_default = torch.ops.aten.clone.default(a_1)
-    diagonal_default = torch.ops.aten.diagonal.default(clone_default);  clone_default = None
-    add_tensor = torch.ops.aten.add_.Tensor(diagonal_default, ones);  diagonal_default = ones = None
-    mul_tensor = torch.ops.aten.mul.Tensor(a_1, a_1);  a_1 = None
-    return mul_tensor
+    clone = torch.ops.aten.clone.default(a_1)
+    diagonal = torch.ops.aten.diagonal.default(clone);  clone = None
+    add = torch.ops.aten.add_.Tensor(diagonal, ones);  diagonal = ones = None
+    mul = torch.ops.aten.mul.Tensor(a_1, a_1);  a_1 = None
+    return mul
     """)
 
     def test_diagonal_mutated_input(self):
@@ -443,11 +530,11 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([2], device = device(type='cpu'), pin_memory = False)
-    diagonal_copy_default = torch.ops.aten.diagonal_copy.default(a_1)
-    add_tensor = torch.ops.aten.add.Tensor(diagonal_copy_default, ones);  diagonal_copy_default = ones = None
-    diagonal_scatter_default = torch.ops.aten.diagonal_scatter.default(a_1, add_tensor);  add_tensor = None
-    copy__default = torch.ops.aten.copy_.default(a_1, diagonal_scatter_default);  a_1 = None
-    return diagonal_scatter_default
+    diagonal_copy = torch.ops.aten.diagonal_copy.default(a_1)
+    add = torch.ops.aten.add.Tensor(diagonal_copy, ones);  diagonal_copy = ones = None
+    diagonal_scatter = torch.ops.aten.diagonal_scatter.default(a_1, add);  add = None
+    copy_ = torch.ops.aten.copy_.default(a_1, diagonal_scatter);  a_1 = None
+    return diagonal_scatter
     """)
 
     def test_split(self):
@@ -467,19 +554,19 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([2], device = device(type='cpu'), pin_memory = False)
-    split_copy_tensor = torch.ops.aten.split_copy.Tensor(a_1, 2)
-    getitem = split_copy_tensor[0]
-    getitem_1 = split_copy_tensor[1];  split_copy_tensor = None
-    diagonal_copy_default = torch.ops.aten.diagonal_copy.default(getitem_1);  getitem_1 = None
-    add_tensor = torch.ops.aten.add.Tensor(diagonal_copy_default, ones);  diagonal_copy_default = ones = None
-    split_copy_tensor_1 = torch.ops.aten.split_copy.Tensor(a_1, 2)
-    getitem_2 = split_copy_tensor_1[0]
-    getitem_3 = split_copy_tensor_1[1];  split_copy_tensor_1 = None
-    diagonal_scatter_default = torch.ops.aten.diagonal_scatter.default(getitem_3, add_tensor);  getitem_3 = None
-    slice_scatter_default = torch.ops.aten.slice_scatter.default(a_1, diagonal_scatter_default, 0, 2, 4);  diagonal_scatter_default = None
-    mul_tensor = torch.ops.aten.mul.Tensor(slice_scatter_default, slice_scatter_default)
-    copy__default = torch.ops.aten.copy_.default(a_1, slice_scatter_default);  a_1 = slice_scatter_default = None
-    return add_tensor
+    split_copy = torch.ops.aten.split_copy.Tensor(a_1, 2)
+    getitem = split_copy[0]
+    getitem_1 = split_copy[1];  split_copy = None
+    diagonal_copy = torch.ops.aten.diagonal_copy.default(getitem_1);  getitem_1 = None
+    add = torch.ops.aten.add.Tensor(diagonal_copy, ones);  diagonal_copy = ones = None
+    split_copy_1 = torch.ops.aten.split_copy.Tensor(a_1, 2)
+    getitem_2 = split_copy_1[0]
+    getitem_3 = split_copy_1[1];  split_copy_1 = None
+    diagonal_scatter = torch.ops.aten.diagonal_scatter.default(getitem_3, add);  getitem_3 = None
+    slice_scatter = torch.ops.aten.slice_scatter.default(a_1, diagonal_scatter, 0, 2, 4);  diagonal_scatter = None
+    mul = torch.ops.aten.mul.Tensor(slice_scatter, slice_scatter)
+    copy_ = torch.ops.aten.copy_.default(a_1, slice_scatter);  a_1 = slice_scatter = None
+    return add
     """)  # noqa: B950
 
     def test_view_inplace(self):
@@ -498,14 +585,14 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4], device = device(type='cpu'), pin_memory = False)
-    transpose_copy_int = torch.ops.aten.transpose_copy.int(a_1, 1, 0)
-    select_copy_int = torch.ops.aten.select_copy.int(transpose_copy_int, 0, 0);  transpose_copy_int = None
-    add_tensor = torch.ops.aten.add.Tensor(select_copy_int, ones);  select_copy_int = ones = None
-    transpose_copy_int_1 = torch.ops.aten.transpose_copy.int(a_1, 1, 0);  a_1 = None
-    select_scatter_default = torch.ops.aten.select_scatter.default(transpose_copy_int_1, add_tensor, 0, 0);  transpose_copy_int_1 = add_tensor = None
-    transpose_copy_int_2 = torch.ops.aten.transpose_copy.int(select_scatter_default, 1, 0);  select_scatter_default = None
-    transpose_copy_int_3 = torch.ops.aten.transpose_copy.int(transpose_copy_int_2, 1, 0);  transpose_copy_int_2 = None
-    return transpose_copy_int_3
+    transpose_copy = torch.ops.aten.transpose_copy.int(a_1, 1, 0)
+    select_copy = torch.ops.aten.select_copy.int(transpose_copy, 0, 0);  transpose_copy = None
+    add = torch.ops.aten.add.Tensor(select_copy, ones);  select_copy = ones = None
+    transpose_copy_1 = torch.ops.aten.transpose_copy.int(a_1, 1, 0);  a_1 = None
+    select_scatter = torch.ops.aten.select_scatter.default(transpose_copy_1, add, 0, 0);  transpose_copy_1 = add = None
+    transpose_copy_2 = torch.ops.aten.transpose_copy.int(select_scatter, 1, 0);  select_scatter = None
+    transpose_copy_3 = torch.ops.aten.transpose_copy.int(transpose_copy_2, 1, 0);  transpose_copy_2 = None
+    return transpose_copy_3
     """)  # noqa: B950
 
     def test_optional_tensor_list(self):
@@ -524,13 +611,13 @@ def f(x):
 
 
 def forward(self, a_1):
-    view_copy_default = torch.ops.aten.view_copy.default(a_1, [8])
+    view_copy = torch.ops.aten.view_copy.default(a_1, [8])
     arange = torch.ops.aten.arange.default(4, device = device(type='cpu'), pin_memory = False)
     arange_1 = torch.ops.aten.arange.default(4, dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
-    index_put_default = torch.ops.aten.index_put.default(view_copy_default, [arange], arange_1);  view_copy_default = arange = arange_1 = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(index_put_default, [4, 2])
-    copy__default = torch.ops.aten.copy_.default(a_1, view_copy_default_1);  a_1 = view_copy_default_1 = None
-    return index_put_default
+    index_put = torch.ops.aten.index_put.default(view_copy, [arange], arange_1);  view_copy = arange = arange_1 = None
+    view_copy_1 = torch.ops.aten.view_copy.default(index_put, [4, 2])
+    copy_ = torch.ops.aten.copy_.default(a_1, view_copy_1);  a_1 = view_copy_1 = None
+    return index_put
     """)  # noqa: B950
 
     def test_scalars(self):
@@ -550,13 +637,13 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
-    view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2])
-    add_tensor = torch.ops.aten.add.Tensor(view_copy_default, 1);  view_copy_default = None
-    mul_tensor = torch.ops.aten.mul.Tensor(add_tensor, 2)
-    div_tensor = torch.ops.aten.div.Tensor(mul_tensor, 1);  mul_tensor = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [4, 2]);  add_tensor = None
-    copy__default = torch.ops.aten.copy_.default(a_1, view_copy_default_1);  a_1 = view_copy_default_1 = None
-    return div_tensor
+    view_copy = torch.ops.aten.view_copy.default(a_1, [4, 2])
+    add = torch.ops.aten.add.Tensor(view_copy, 1);  view_copy = None
+    mul = torch.ops.aten.mul.Tensor(add, 2)
+    div = torch.ops.aten.div.Tensor(mul, 1);  mul = None
+    view_copy_1 = torch.ops.aten.view_copy.default(add, [4, 2]);  add = None
+    copy_ = torch.ops.aten.copy_.default(a_1, view_copy_1);  a_1 = view_copy_1 = None
+    return div
     """)
 
     @skipIfTorchDynamo("Test does not work with TorchDynamo")
@@ -574,10 +661,10 @@ def f(x):
 
 
 def forward(self, a_1):
-    clone_default = torch.ops.aten.clone.default(a_1);  a_1 = None
-    ge_scalar = torch.ops.aten.ge.Scalar(clone_default, 0);  clone_default = None
-    _to_copy_default = torch.ops.aten._to_copy.default(ge_scalar, dtype = torch.float32, layout = torch.strided);  ge_scalar = None
-    return _to_copy_default
+    clone = torch.ops.aten.clone.default(a_1);  a_1 = None
+    ge = torch.ops.aten.ge.Scalar(clone, 0);  clone = None
+    _to_copy = torch.ops.aten._to_copy.default(ge, dtype = torch.float32, layout = torch.strided);  ge = None
+    return _to_copy
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(2, 2), reapply_views=True, run_reinplace=True)
@@ -586,10 +673,10 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    clone_default = torch.ops.aten.clone.default(a_1);  a_1 = None
-    ge_scalar = torch.ops.aten.ge_.Scalar(clone_default, 0)
-    _to_copy_default = torch.ops.aten._to_copy.default(clone_default, dtype = torch.float32, layout = torch.strided);  clone_default = None
-    return _to_copy_default
+    clone = torch.ops.aten.clone.default(a_1);  a_1 = None
+    ge = torch.ops.aten.ge.Scalar(clone, 0);  clone = None
+    _to_copy = torch.ops.aten._to_copy.default(ge, dtype = torch.float32, layout = torch.strided);  ge = None
+    return _to_copy
     """)  # noqa: B950
 
     @skipIfTorchDynamo("Test does not work with TorchDynamo")
@@ -622,8 +709,8 @@ def f(x):
 
 
 def forward(self, a_1):
-    view_copy_default = torch.ops.aten.view_copy.default(a_1, [4, 2]);  a_1 = None
-    return view_copy_default
+    view_copy = torch.ops.aten.view_copy.default(a_1, [4, 2]);  a_1 = None
+    return view_copy
     """)
 
     def test_everything(self):
@@ -648,35 +735,43 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
-    view_copy_default = torch.ops.aten.view_copy.default(add_tensor, [8])
-    _reshape_alias_copy_default = torch.ops.aten._reshape_alias_copy.default(view_copy_default, [2, 4], [4, 1]);  view_copy_default = None
-    transpose_copy_int = torch.ops.aten.transpose_copy.int(_reshape_alias_copy_default, 1, 0)
-    unsqueeze_copy_default = torch.ops.aten.unsqueeze_copy.default(transpose_copy_int, 0);  transpose_copy_int = None
-    squeeze_copy_default = torch.ops.aten.squeeze_copy.default(unsqueeze_copy_default);  unsqueeze_copy_default = None
-    split_copy_tensor = torch.ops.aten.split_copy.Tensor(squeeze_copy_default, 2);  squeeze_copy_default = None
-    getitem = split_copy_tensor[0]
-    getitem_1 = split_copy_tensor[1];  split_copy_tensor = None
-    add_tensor_1 = torch.ops.aten.add.Tensor(getitem, ones);  getitem = ones = None
-    select_copy_int = torch.ops.aten.select_copy.int(_reshape_alias_copy_default, 0, 0);  _reshape_alias_copy_default = None
-    clone_default = torch.ops.aten.clone.default(add_tensor_1, memory_format = torch.contiguous_format)
-    _unsafe_view_default = torch.ops.aten._unsafe_view.default(clone_default, [4]);  clone_default = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(add_tensor, [8]);  add_tensor = None
-    _reshape_alias_copy_default_1 = torch.ops.aten._reshape_alias_copy.default(view_copy_default_1, [2, 4], [4, 1]);  view_copy_default_1 = None
-    transpose_copy_int_1 = torch.ops.aten.transpose_copy.int(_reshape_alias_copy_default_1, 1, 0);  _reshape_alias_copy_default_1 = None
-    unsqueeze_copy_default_1 = torch.ops.aten.unsqueeze_copy.default(transpose_copy_int_1, 0);  transpose_copy_int_1 = None
-    squeeze_copy_default_1 = torch.ops.aten.squeeze_copy.default(unsqueeze_copy_default_1);  unsqueeze_copy_default_1 = None
-    slice_scatter_default = torch.ops.aten.slice_scatter.default(squeeze_copy_default_1, add_tensor_1, 0, 0, 2);  squeeze_copy_default_1 = None
-    unsqueeze_copy_default_2 = torch.ops.aten.unsqueeze_copy.default(slice_scatter_default, 0);  slice_scatter_default = None
-    squeeze_copy_dim = torch.ops.aten.squeeze_copy.dim(unsqueeze_copy_default_2, 0);  unsqueeze_copy_default_2 = None
-    transpose_copy_int_2 = torch.ops.aten.transpose_copy.int(squeeze_copy_dim, 1, 0);  squeeze_copy_dim = None
-    _reshape_alias_copy_default_2 = torch.ops.aten._reshape_alias_copy.default(transpose_copy_int_2, [8], [1]);  transpose_copy_int_2 = None
-    view_copy_default_2 = torch.ops.aten.view_copy.default(_reshape_alias_copy_default_2, [4, 2]);  _reshape_alias_copy_default_2 = None
-    view_copy_default_3 = torch.ops.aten.view_copy.default(view_copy_default_2, [8]);  view_copy_default_2 = None
-    _reshape_alias_copy_default_3 = torch.ops.aten._reshape_alias_copy.default(view_copy_default_3, [2, 4], [4, 1]);  view_copy_default_3 = None
-    select_copy_int_1 = torch.ops.aten.select_copy.int(_reshape_alias_copy_default_3, 0, 0);  _reshape_alias_copy_default_3 = None
-    add_tensor_2 = torch.ops.aten.add.Tensor(select_copy_int_1, _unsafe_view_default);  select_copy_int_1 = _unsafe_view_default = None
-    return add_tensor_1
+    add = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
+    view_copy = torch.ops.aten.view_copy.default(add, [8])
+    view_copy_1 = torch.ops.aten.view_copy.default(view_copy, [2, 4]);  view_copy = None
+    transpose_copy = torch.ops.aten.transpose_copy.int(view_copy_1, 1, 0)
+    unsqueeze_copy = torch.ops.aten.unsqueeze_copy.default(transpose_copy, 0);  transpose_copy = None
+    squeeze_copy = torch.ops.aten.squeeze_copy.default(unsqueeze_copy);  unsqueeze_copy = None
+    split_copy = torch.ops.aten.split_copy.Tensor(squeeze_copy, 2);  squeeze_copy = None
+    getitem = split_copy[0]
+    getitem_1 = split_copy[1];  split_copy = None
+    add_1 = torch.ops.aten.add.Tensor(getitem, ones);  getitem = ones = None
+    select_copy = torch.ops.aten.select_copy.int(view_copy_1, 0, 0);  view_copy_1 = None
+    view_copy_2 = torch.ops.aten.view_copy.default(add_1, [4])
+    view_copy_3 = torch.ops.aten.view_copy.default(add, [8]);  add = None
+    view_copy_4 = torch.ops.aten.view_copy.default(view_copy_3, [2, 4]);  view_copy_3 = None
+    transpose_copy_1 = torch.ops.aten.transpose_copy.int(view_copy_4, 1, 0);  view_copy_4 = None
+    unsqueeze_copy_1 = torch.ops.aten.unsqueeze_copy.default(transpose_copy_1, 0);  transpose_copy_1 = None
+    squeeze_copy_1 = torch.ops.aten.squeeze_copy.default(unsqueeze_copy_1);  unsqueeze_copy_1 = None
+    slice_scatter = torch.ops.aten.slice_scatter.default(squeeze_copy_1, add_1, 0, 0, 2);  squeeze_copy_1 = None
+    unsqueeze_copy_2 = torch.ops.aten.unsqueeze_copy.default(slice_scatter, 0);  slice_scatter = None
+    squeeze_copy_2 = torch.ops.aten.squeeze_copy.dim(unsqueeze_copy_2, 0);  unsqueeze_copy_2 = None
+    transpose_copy_2 = torch.ops.aten.transpose_copy.int(squeeze_copy_2, 1, 0);  squeeze_copy_2 = None
+    view_copy_5 = torch.ops.aten.view_copy.default(transpose_copy_2, [8]);  transpose_copy_2 = None
+    view_copy_6 = torch.ops.aten.view_copy.default(view_copy_5, [4, 2]);  view_copy_5 = None
+    view_copy_7 = torch.ops.aten.view_copy.default(view_copy_6, [8])
+    view_copy_8 = torch.ops.aten.view_copy.default(view_copy_7, [2, 4]);  view_copy_7 = None
+    select_copy_1 = torch.ops.aten.select_copy.int(view_copy_8, 0, 0);  view_copy_8 = None
+    view_copy_9 = torch.ops.aten.view_copy.default(view_copy_6, [8]);  view_copy_6 = None
+    view_copy_10 = torch.ops.aten.view_copy.default(view_copy_9, [2, 4]);  view_copy_9 = None
+    transpose_copy_3 = torch.ops.aten.transpose_copy.int(view_copy_10, 1, 0);  view_copy_10 = None
+    unsqueeze_copy_3 = torch.ops.aten.unsqueeze_copy.default(transpose_copy_3, 0);  transpose_copy_3 = None
+    squeeze_copy_3 = torch.ops.aten.squeeze_copy.default(unsqueeze_copy_3);  unsqueeze_copy_3 = None
+    split_copy_1 = torch.ops.aten.split_copy.Tensor(squeeze_copy_3, 2);  squeeze_copy_3 = None
+    getitem_2 = split_copy_1[0]
+    getitem_3 = split_copy_1[1];  split_copy_1 = None
+    view_copy_11 = torch.ops.aten.view_copy.default(getitem_2, [4]);  getitem_2 = None
+    add_2 = torch.ops.aten.add.Tensor(select_copy_1, view_copy_11);  select_copy_1 = view_copy_11 = None
+    return add_1
     """)  # noqa: B950
 
         reinplaced_logs = self.get_logs(f, torch.ones(4, 2), reapply_views=True, run_reinplace=True)
@@ -686,33 +781,33 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
-    view_default = torch.ops.aten.view.default(add_tensor, [8])
-    _reshape_alias_default = torch.ops.aten._reshape_alias.default(view_default, [2, 4], [4, 1]);  view_default = None
-    transpose_int = torch.ops.aten.transpose.int(_reshape_alias_default, 1, 0)
-    unsqueeze_default = torch.ops.aten.unsqueeze.default(transpose_int, 0);  transpose_int = None
-    squeeze_default = torch.ops.aten.squeeze.default(unsqueeze_default);  unsqueeze_default = None
-    split_tensor = torch.ops.aten.split.Tensor(squeeze_default, 2);  squeeze_default = None
-    getitem = split_tensor[0]
-    getitem_1 = split_tensor[1];  split_tensor = None
-    add_tensor_1 = torch.ops.aten.add_.Tensor(getitem, ones);  ones = None
-    select_int = torch.ops.aten.select.int(_reshape_alias_default, 0, 0);  _reshape_alias_default = None
-    clone_default = torch.ops.aten.clone.default(getitem, memory_format = torch.contiguous_format)
-    _unsafe_view_default = torch.ops.aten._unsafe_view.default(clone_default, [4]);  clone_default = None
-    view_default_1 = torch.ops.aten.view.default(add_tensor, [8]);  add_tensor = None
-    _reshape_alias_default_1 = torch.ops.aten._reshape_alias.default(view_default_1, [2, 4], [4, 1]);  view_default_1 = None
-    transpose_int_1 = torch.ops.aten.transpose.int(_reshape_alias_default_1, 1, 0);  _reshape_alias_default_1 = None
-    unsqueeze_default_1 = torch.ops.aten.unsqueeze.default(transpose_int_1, 0);  transpose_int_1 = None
-    squeeze_default_1 = torch.ops.aten.squeeze.default(unsqueeze_default_1);  unsqueeze_default_1 = None
-    unsqueeze_default_2 = torch.ops.aten.unsqueeze.default(squeeze_default_1, 0);  squeeze_default_1 = None
-    squeeze_dim = torch.ops.aten.squeeze.dim(unsqueeze_default_2, 0);  unsqueeze_default_2 = None
-    transpose_int_2 = torch.ops.aten.transpose.int(squeeze_dim, 1, 0);  squeeze_dim = None
-    _reshape_alias_default_2 = torch.ops.aten._reshape_alias.default(transpose_int_2, [8], [1]);  transpose_int_2 = None
-    view_default_2 = torch.ops.aten.view.default(_reshape_alias_default_2, [4, 2]);  _reshape_alias_default_2 = None
-    view_default_3 = torch.ops.aten.view.default(view_default_2, [8]);  view_default_2 = None
-    _reshape_alias_default_3 = torch.ops.aten._reshape_alias.default(view_default_3, [2, 4], [4, 1]);  view_default_3 = None
-    select_int_1 = torch.ops.aten.select.int(_reshape_alias_default_3, 0, 0);  _reshape_alias_default_3 = None
-    add_tensor_2 = torch.ops.aten.add.Tensor(select_int_1, _unsafe_view_default);  select_int_1 = _unsafe_view_default = None
+    add = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
+    view = torch.ops.aten.view.default(add, [8])
+    view_1 = torch.ops.aten.view.default(view, [2, 4]);  view = None
+    transpose = torch.ops.aten.transpose.int(view_1, 1, 0)
+    unsqueeze = torch.ops.aten.unsqueeze.default(transpose, 0);  transpose = None
+    squeeze = torch.ops.aten.squeeze.default(unsqueeze);  unsqueeze = None
+    split = torch.ops.aten.split.Tensor(squeeze, 2);  squeeze = None
+    getitem = split[0]
+    getitem_1 = split[1];  split = None
+    add_1 = torch.ops.aten.add_.Tensor(getitem, ones);  ones = None
+    select = torch.ops.aten.select.int(view_1, 0, 0);  view_1 = None
+    clone = torch.ops.aten.clone.default(getitem, memory_format = torch.contiguous_format)
+    _unsafe_view = torch.ops.aten._unsafe_view.default(clone, [4]);  clone = None
+    view_2 = torch.ops.aten.view.default(add, [8]);  add = None
+    view_3 = torch.ops.aten.view.default(view_2, [2, 4]);  view_2 = None
+    transpose_1 = torch.ops.aten.transpose.int(view_3, 1, 0);  view_3 = None
+    unsqueeze_1 = torch.ops.aten.unsqueeze.default(transpose_1, 0);  transpose_1 = None
+    squeeze_1 = torch.ops.aten.squeeze.default(unsqueeze_1);  unsqueeze_1 = None
+    unsqueeze_2 = torch.ops.aten.unsqueeze.default(squeeze_1, 0);  squeeze_1 = None
+    squeeze_2 = torch.ops.aten.squeeze.dim(unsqueeze_2, 0);  unsqueeze_2 = None
+    transpose_2 = torch.ops.aten.transpose.int(squeeze_2, 1, 0);  squeeze_2 = None
+    view_4 = torch.ops.aten.view.default(transpose_2, [8]);  transpose_2 = None
+    view_5 = torch.ops.aten.view.default(view_4, [4, 2]);  view_4 = None
+    view_6 = torch.ops.aten.view.default(view_5, [8]);  view_5 = None
+    view_7 = torch.ops.aten.view.default(view_6, [2, 4]);  view_6 = None
+    select_1 = torch.ops.aten.select.int(view_7, 0, 0);  view_7 = None
+    add_2 = torch.ops.aten.add.Tensor(select_1, _unsafe_view);  select_1 = _unsafe_view = None
     return getitem
     """)
 
@@ -731,12 +826,12 @@ def f(x):
 
 def forward(self, a_1):
     ones = torch.ops.aten.ones.default([4, 2], device = device(type='cpu'), pin_memory = False)
-    view_default = torch.ops.aten.view.default(a_1, [4, 2])
-    add_tensor = torch.ops.aten.add.Tensor(view_default, ones);  view_default = ones = None
-    view_default_1 = torch.ops.aten.view.default(add_tensor, [4, 2])
-    mul_tensor = torch.ops.aten.mul.Tensor(view_default_1, view_default_1)
-    copy__default = torch.ops.aten.copy_.default(a_1, view_default_1);  a_1 = view_default_1 = None
-    return add_tensor
+    view = torch.ops.aten.view.default(a_1, [4, 2])
+    add = torch.ops.aten.add.Tensor(view, ones);  view = ones = None
+    view_1 = torch.ops.aten.view.default(add, [4, 2])
+    mul = torch.ops.aten.mul.Tensor(view_1, view_1)
+    copy_ = torch.ops.aten.copy_.default(a_1, view_1);  a_1 = view_1 = None
+    return add
     """)
 
     def test_aliases_maintained_after_pass_when_reapplying_views(self):
@@ -761,8 +856,8 @@ def f(x):
         _z = torch._from_functional_tensor(z)
         self.assertTrue(are_aliased(_y, _z))
 
-    # copy_() gets its own test, because it is special cased in functionalization.
-    # self.copy_(src) decomposes into src.to(self).expand_as(self).
+    # copy_() gets its own test, because it used to be special cased in functionalization.
+    # However, now it works pretty similar to other functional ops
     def test_copy_(self):
         def f(x):
             tmp = torch.zeros(2, 2)
@@ -781,9 +876,10 @@ def f(x):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    diagonal_copy_default = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
-    add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
-    return add_tensor
+    diagonal_copy = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
+    copy = torch.ops.aten.copy.default(diagonal_copy, a_1);  diagonal_copy = None
+    add = torch.ops.aten.add.Tensor(copy, a_1);  copy = a_1 = None
+    return add
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(2), reapply_views=True, run_reinplace=True)
@@ -793,9 +889,10 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    diagonal_default = torch.ops.aten.diagonal.default(zeros);  zeros = None
-    add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
-    return add_tensor
+    diagonal = torch.ops.aten.diagonal.default(zeros);  zeros = None
+    copy = torch.ops.aten.copy_.default(diagonal, a_1)
+    add = torch.ops.aten.add_.Tensor(diagonal, a_1);  a_1 = None
+    return diagonal
     """)
 
         # Test 2: copy_() with same dtype, different shape
@@ -807,10 +904,10 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    diagonal_copy_default = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
-    expand_copy_default = torch.ops.aten.expand_copy.default(a_1, [2])
-    add_tensor = torch.ops.aten.add.Tensor(expand_copy_default, a_1);  expand_copy_default = a_1 = None
-    return add_tensor
+    diagonal_copy = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
+    copy = torch.ops.aten.copy.default(diagonal_copy, a_1);  diagonal_copy = None
+    add = torch.ops.aten.add.Tensor(copy, a_1);  copy = a_1 = None
+    return add
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(1), reapply_views=True, run_reinplace=True)
@@ -820,10 +917,10 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    diagonal_default = torch.ops.aten.diagonal.default(zeros);  zeros = None
-    expand_copy_default = torch.ops.aten.expand_copy.default(a_1, [2])
-    add_tensor = torch.ops.aten.add_.Tensor(expand_copy_default, a_1);  a_1 = None
-    return expand_copy_default
+    diagonal = torch.ops.aten.diagonal.default(zeros);  zeros = None
+    copy = torch.ops.aten.copy_.default(diagonal, a_1)
+    add = torch.ops.aten.add_.Tensor(diagonal, a_1);  a_1 = None
+    return diagonal
     """)
 
         # Test 3: copy_() with different dtype, same shape
@@ -835,10 +932,10 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    diagonal_copy_default = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
-    _to_copy_default = torch.ops.aten._to_copy.default(a_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
-    add_tensor = torch.ops.aten.add.Tensor(_to_copy_default, a_1);  _to_copy_default = a_1 = None
-    return add_tensor
+    diagonal_copy = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
+    copy = torch.ops.aten.copy.default(diagonal_copy, a_1);  diagonal_copy = None
+    add = torch.ops.aten.add.Tensor(copy, a_1);  copy = a_1 = None
+    return add
     """)  # noqa: B950
 
         reinplaced_logs = self.get_logs(f, torch.ones(2, dtype=torch.long), reapply_views=True, run_reinplace=True)
@@ -848,10 +945,10 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    diagonal_default = torch.ops.aten.diagonal.default(zeros);  zeros = None
-    _to_copy_default = torch.ops.aten._to_copy.default(a_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
-    add_tensor = torch.ops.aten.add_.Tensor(_to_copy_default, a_1);  a_1 = None
-    return _to_copy_default
+    diagonal = torch.ops.aten.diagonal.default(zeros);  zeros = None
+    copy = torch.ops.aten.copy_.default(diagonal, a_1)
+    add = torch.ops.aten.add_.Tensor(diagonal, a_1);  a_1 = None
+    return diagonal
     """)  # noqa: B950
 
         # Test 4: copy_() with different dtype, different shape
@@ -863,11 +960,10 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    diagonal_copy_default = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
-    _to_copy_default = torch.ops.aten._to_copy.default(a_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
-    expand_copy_default = torch.ops.aten.expand_copy.default(_to_copy_default, [2]);  _to_copy_default = None
-    add_tensor = torch.ops.aten.add.Tensor(expand_copy_default, a_1);  expand_copy_default = a_1 = None
-    return add_tensor
+    diagonal_copy = torch.ops.aten.diagonal_copy.default(zeros);  zeros = None
+    copy = torch.ops.aten.copy.default(diagonal_copy, a_1);  diagonal_copy = None
+    add = torch.ops.aten.add.Tensor(copy, a_1);  copy = a_1 = None
+    return add
     """)  # noqa: B950
 
         reinplaced_logs = self.get_logs(f, torch.ones(1, dtype=torch.long), reapply_views=True, run_reinplace=True)
@@ -877,11 +973,10 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
-    diagonal_default = torch.ops.aten.diagonal.default(zeros);  zeros = None
-    _to_copy_default = torch.ops.aten._to_copy.default(a_1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
-    expand_copy_default = torch.ops.aten.expand_copy.default(_to_copy_default, [2]);  _to_copy_default = None
-    add_tensor = torch.ops.aten.add_.Tensor(expand_copy_default, a_1);  a_1 = None
-    return expand_copy_default
+    diagonal = torch.ops.aten.diagonal.default(zeros);  zeros = None
+    copy = torch.ops.aten.copy_.default(diagonal, a_1)
+    add = torch.ops.aten.add_.Tensor(diagonal, a_1);  a_1 = None
+    return diagonal
     """)  # noqa: B950
 
     def test_expand_symint(self):
@@ -897,8 +992,8 @@ def f(x):
 
 
 def forward(self, a_1):
-    expand_copy_default = torch.ops.aten.expand_copy.default(a_1, [2, 2]);  a_1 = None
-    return expand_copy_default
+    expand_copy = torch.ops.aten.expand_copy.default(a_1, [2, 2]);  a_1 = None
+    return expand_copy
     """)
 
     def test_fill_(self):
@@ -915,11 +1010,11 @@ def f(x):
 
 
 def forward(self, a_1):
-    add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
-    diagonal_copy_default = torch.ops.aten.diagonal_copy.default(add_tensor)
-    fill_scalar = torch.ops.aten.fill.Scalar(diagonal_copy_default, 0);  diagonal_copy_default = None
-    diagonal_scatter_default = torch.ops.aten.diagonal_scatter.default(add_tensor, fill_scalar);  add_tensor = fill_scalar = None
-    return diagonal_scatter_default
+    add = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
+    diagonal_copy = torch.ops.aten.diagonal_copy.default(add)
+    fill = torch.ops.aten.fill.Scalar(diagonal_copy, 0);  diagonal_copy = None
+    diagonal_scatter = torch.ops.aten.diagonal_scatter.default(add, fill);  add = fill = None
+    return diagonal_scatter
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(2, 2), reapply_views=True, run_reinplace=True)
@@ -928,10 +1023,10 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    add_tensor = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
-    diagonal_default = torch.ops.aten.diagonal.default(add_tensor)
-    fill_scalar = torch.ops.aten.fill_.Scalar(diagonal_default, 0);  diagonal_default = None
-    return add_tensor
+    add = torch.ops.aten.add.Tensor(a_1, a_1);  a_1 = None
+    diagonal = torch.ops.aten.diagonal.default(add)
+    fill = torch.ops.aten.fill_.Scalar(diagonal, 0);  diagonal = None
+    return add
     """)
 
     def test_resize_smaller(self):
@@ -952,21 +1047,21 @@ def f(w):
 
 
 def forward(self, a_1):
-    add_tensor = torch.ops.aten.add.Tensor(a_1, 1);  a_1 = None
-    view_copy_default = torch.ops.aten.view_copy.default(add_tensor, [4, 4])
-    resize_default = torch.ops.aten.resize.default(view_copy_default, [3, 3])
-    as_strided_copy_default = torch.ops.aten.as_strided_copy.default(view_copy_default, [3, 3], [3, 1]);  view_copy_default = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(as_strided_copy_default, [-1]);  as_strided_copy_default = None
-    add_tensor_1 = torch.ops.aten.add.Tensor(view_copy_default_1, 1);  view_copy_default_1 = None
-    view_copy_default_2 = torch.ops.aten.view_copy.default(add_tensor, [4, 4]);  add_tensor = None
-    as_strided_copy_default_1 = torch.ops.aten.as_strided_copy.default(view_copy_default_2, [3, 3], [3, 1])
-    view_copy_default_3 = torch.ops.aten.view_copy.default(add_tensor_1, [3, 3]);  add_tensor_1 = None
-    as_strided_scatter_default = torch.ops.aten.as_strided_scatter.default(view_copy_default_2, view_copy_default_3, [3, 3], [3, 1]);  view_copy_default_2 = view_copy_default_3 = None
-    view_copy_default_4 = torch.ops.aten.view_copy.default(as_strided_scatter_default, [8, 2]);  as_strided_scatter_default = None
-    view_copy_default_5 = torch.ops.aten.view_copy.default(view_copy_default_4, [4, 4]);  view_copy_default_4 = None
-    as_strided_copy_default_2 = torch.ops.aten.as_strided_copy.default(view_copy_default_5, [3, 3], [3, 1]);  view_copy_default_5 = None
-    add_tensor_2 = torch.ops.aten.add.Tensor(as_strided_copy_default_2, 1);  as_strided_copy_default_2 = None
-    return add_tensor_2
+    add = torch.ops.aten.add.Tensor(a_1, 1);  a_1 = None
+    view_copy = torch.ops.aten.view_copy.default(add, [4, 4])
+    resize = torch.ops.aten.resize.default(view_copy, [3, 3])
+    as_strided_copy = torch.ops.aten.as_strided_copy.default(view_copy, [3, 3], [3, 1]);  view_copy = None
+    view_copy_1 = torch.ops.aten.view_copy.default(as_strided_copy, [-1]);  as_strided_copy = None
+    add_1 = torch.ops.aten.add.Tensor(view_copy_1, 1);  view_copy_1 = None
+    view_copy_2 = torch.ops.aten.view_copy.default(add, [4, 4]);  add = None
+    as_strided_copy_1 = torch.ops.aten.as_strided_copy.default(view_copy_2, [3, 3], [3, 1])
+    view_copy_3 = torch.ops.aten.view_copy.default(add_1, [3, 3]);  add_1 = None
+    as_strided_scatter = torch.ops.aten.as_strided_scatter.default(view_copy_2, view_copy_3, [3, 3], [3, 1]);  view_copy_2 = view_copy_3 = None
+    view_copy_4 = torch.ops.aten.view_copy.default(as_strided_scatter, [8, 2]);  as_strided_scatter = None
+    view_copy_5 = torch.ops.aten.view_copy.default(view_copy_4, [4, 4]);  view_copy_4 = None
+    as_strided_copy_2 = torch.ops.aten.as_strided_copy.default(view_copy_5, [3, 3], [3, 1]);  view_copy_5 = None
+    add_2 = torch.ops.aten.add.Tensor(as_strided_copy_2, 1);  as_strided_copy_2 = None
+    return add_2
     """)  # noqa: B950
 
         reinplaced_logs = self.get_logs(f, torch.ones(8, 2), reapply_views=True, run_reinplace=True)
@@ -975,20 +1070,20 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    add_tensor = torch.ops.aten.add.Tensor(a_1, 1);  a_1 = None
-    view_default = torch.ops.aten.view.default(add_tensor, [4, 4])
-    resize_default = torch.ops.aten.resize.default(view_default, [3, 3])
-    as_strided_default = torch.ops.aten.as_strided.default(view_default, [3, 3], [3, 1]);  view_default = None
-    view_default_1 = torch.ops.aten.view.default(as_strided_default, [-1]);  as_strided_default = None
-    add_tensor_1 = torch.ops.aten.add_.Tensor(view_default_1, 1)
-    view_default_2 = torch.ops.aten.view.default(add_tensor, [4, 4]);  add_tensor = None
-    as_strided_default_1 = torch.ops.aten.as_strided.default(view_default_2, [3, 3], [3, 1])
-    view_default_3 = torch.ops.aten.view.default(view_default_1, [3, 3]);  view_default_1 = None
-    view_default_4 = torch.ops.aten.view.default(view_default_2, [8, 2]);  view_default_2 = None
-    view_default_5 = torch.ops.aten.view.default(view_default_4, [4, 4]);  view_default_4 = None
-    as_strided_default_2 = torch.ops.aten.as_strided.default(view_default_5, [3, 3], [3, 1]);  view_default_5 = None
-    add_tensor_2 = torch.ops.aten.add_.Tensor(as_strided_default_2, 1)
-    return as_strided_default_2
+    add = torch.ops.aten.add.Tensor(a_1, 1);  a_1 = None
+    view = torch.ops.aten.view.default(add, [4, 4])
+    resize = torch.ops.aten.resize.default(view, [3, 3])
+    as_strided = torch.ops.aten.as_strided.default(view, [3, 3], [3, 1]);  view = None
+    view_1 = torch.ops.aten.view.default(as_strided, [-1]);  as_strided = None
+    add_1 = torch.ops.aten.add_.Tensor(view_1, 1)
+    view_2 = torch.ops.aten.view.default(add, [4, 4]);  add = None
+    as_strided_1 = torch.ops.aten.as_strided.default(view_2, [3, 3], [3, 1])
+    view_3 = torch.ops.aten.view.default(view_1, [3, 3]);  view_1 = None
+    view_4 = torch.ops.aten.view.default(view_2, [8, 2]);  view_2 = None
+    view_5 = torch.ops.aten.view.default(view_4, [4, 4]);  view_4 = None
+    as_strided_2 = torch.ops.aten.as_strided.default(view_5, [3, 3], [3, 1]);  view_5 = None
+    add_2 = torch.ops.aten.add_.Tensor(as_strided_2, 1)
+    return as_strided_2
     """)
 
     def test_resize_larger_valid(self):
@@ -1015,13 +1110,13 @@ def f(x):
 
 
 def forward(self, a_1):
-    add_tensor = torch.ops.aten.add.Tensor(a_1, 1);  a_1 = None
-    resize_default = torch.ops.aten.resize.default(add_tensor, [5, 5]);  add_tensor = None
-    view_copy_default = torch.ops.aten.view_copy.default(resize_default, [25]);  resize_default = None
-    fill_scalar = torch.ops.aten.fill.Scalar(view_copy_default, 1);  view_copy_default = None
-    view_copy_default_1 = torch.ops.aten.view_copy.default(fill_scalar, [5, 5]);  fill_scalar = None
-    add_tensor_1 = torch.ops.aten.add.Tensor(view_copy_default_1, 1)
-    return (view_copy_default_1, add_tensor_1)
+    add = torch.ops.aten.add.Tensor(a_1, 1);  a_1 = None
+    resize = torch.ops.aten.resize.default(add, [5, 5]);  add = None
+    view_copy = torch.ops.aten.view_copy.default(resize, [25]);  resize = None
+    fill = torch.ops.aten.fill.Scalar(view_copy, 1);  view_copy = None
+    view_copy_1 = torch.ops.aten.view_copy.default(fill, [5, 5]);  fill = None
+    add_1 = torch.ops.aten.add.Tensor(view_copy_1, 1)
+    return (view_copy_1, add_1)
     """)
 
         reinplaced_logs = self.get_logs(f, torch.ones(8, 2), reapply_views=True, run_reinplace=True)
@@ -1030,13 +1125,13 @@ def forward(self, a_1):
 
 
 def forward(self, a_1):
-    add_tensor = torch.ops.aten.add.Tensor(a_1, 1);  a_1 = None
-    resize_default = torch.ops.aten.resize_.default(add_tensor, [5, 5])
-    view_default = torch.ops.aten.view.default(add_tensor, [25]);  add_tensor = None
-    fill_scalar = torch.ops.aten.fill_.Scalar(view_default, 1)
-    view_default_1 = torch.ops.aten.view.default(view_default, [5, 5]);  view_default = None
-    add_tensor_1 = torch.ops.aten.add.Tensor(view_default_1, 1)
-    return (view_default_1, add_tensor_1)
+    add = torch.ops.aten.add.Tensor(a_1, 1);  a_1 = None
+    resize = torch.ops.aten.resize_.default(add, [5, 5])
+    view = torch.ops.aten.view.default(add, [25]);  add = None
+    fill = torch.ops.aten.fill_.Scalar(view, 1)
+    view_1 = torch.ops.aten.view.default(view, [5, 5]);  view = None
+    add_1 = torch.ops.aten.add.Tensor(view_1, 1)
+    return (view_1, add_1)
     """)
 
     def test_resize_larger_invalid(self):
@@ -1115,10 +1210,10 @@ def f(x):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([10], device = device(type='cpu'), pin_memory = False)
-    select_copy_int = torch.ops.aten.select_copy.int(zeros, 0, 5)
-    fill_scalar = torch.ops.aten.fill.Scalar(select_copy_int, 1);  select_copy_int = None
-    select_scatter_default = torch.ops.aten.select_scatter.default(zeros, fill_scalar, 0, 5);  zeros = fill_scalar = None
-    return select_scatter_default
+    select_copy = torch.ops.aten.select_copy.int(zeros, 0, 5)
+    fill = torch.ops.aten.fill.Scalar(select_copy, 1);  select_copy = None
+    select_scatter = torch.ops.aten.select_scatter.default(zeros, fill, 0, 5);  zeros = fill = None
+    return select_scatter
     """)  # noqa: B950
 
         reinplaced_logs = self.get_logs(f, torch.ones(2), reapply_views=True, run_reinplace=True)
@@ -1128,10 +1223,223 @@ def forward(self, a_1):
 
 def forward(self, a_1):
     zeros = torch.ops.aten.zeros.default([10], device = device(type='cpu'), pin_memory = False)
-    select_int = torch.ops.aten.select.int(zeros, 0, 5)
-    fill_scalar = torch.ops.aten.fill_.Scalar(select_int, 1);  select_int = None
+    select = torch.ops.aten.select.int(zeros, 0, 5)
+    fill = torch.ops.aten.fill_.Scalar(select, 1);  select = None
     return zeros
     """)
 
+
+    def test_instance_norm(self):
+        def f(x):
+            with enable_python_dispatcher():
+                return torch.instance_norm(x, None, None, running_mean=torch.zeros(100), running_var=torch.ones(100),
+                                           use_input_stats=True, momentum=0.1, eps=1e-5, cudnn_enabled=False)
+        self.assert_functionalization(f, torch.randn(20, 100, 35, 45))
+        # On Windows, for instance_norm, the alias_copy's are reordered to come right before they need to be used
+        # whereas on other platforms, the alias_copy's are before the view_copy's.
+        # e.g., the alias_copy after the getitem_4 assignment would be moved to be right before the copy assignment.
+        if not IS_WINDOWS:
+            logs = self.get_logs(f, torch.randn(20, 100, 35, 45))
+            self.assertExpectedInline(logs, """\
+
+
+
+def forward(self, a_1):
+    zeros = torch.ops.aten.zeros.default([100], device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([100], device = device(type='cpu'), pin_memory = False)
+    repeat = torch.ops.aten.repeat.default(zeros, [20])
+    repeat_1 = torch.ops.aten.repeat.default(ones, [20])
+    view_copy = torch.ops.aten.view_copy.default(a_1, [1, 2000, 35, 45]);  a_1 = None
+    empty = torch.ops.aten.empty.memory_format([0], dtype = torch.uint8, layout = torch.strided, device = device(type='cpu'))
+    _native_batch_norm_legit_functional = torch.ops.aten._native_batch_norm_legit_functional.default(view_copy, None, None, repeat, repeat_1, True, 0.1, 1e-05);  view_copy = repeat = repeat_1 = None
+    getitem = _native_batch_norm_legit_functional[0]
+    getitem_1 = _native_batch_norm_legit_functional[1]
+    getitem_2 = _native_batch_norm_legit_functional[2]
+    getitem_3 = _native_batch_norm_legit_functional[3]
+    getitem_4 = _native_batch_norm_legit_functional[4];  _native_batch_norm_legit_functional = None
+    alias_copy = torch.ops.aten.alias_copy.default(zeros);  zeros = None
+    view_copy_1 = torch.ops.aten.view_copy.default(getitem_3, [20, 100])
+    view_copy_2 = torch.ops.aten.view_copy.default(getitem_3, [20, 100]);  getitem_3 = None
+    mean = torch.ops.aten.mean.dim(view_copy_2, [0]);  view_copy_2 = None
+    copy = torch.ops.aten.copy.default(alias_copy, mean);  alias_copy = mean = None
+    alias_copy_1 = torch.ops.aten.alias_copy.default(ones);  ones = None
+    view_copy_3 = torch.ops.aten.view_copy.default(getitem_4, [20, 100])
+    view_copy_4 = torch.ops.aten.view_copy.default(getitem_4, [20, 100]);  getitem_4 = None
+    mean_1 = torch.ops.aten.mean.dim(view_copy_4, [0]);  view_copy_4 = None
+    copy_1 = torch.ops.aten.copy.default(alias_copy_1, mean_1);  alias_copy_1 = mean_1 = None
+    view_copy_5 = torch.ops.aten.view_copy.default(getitem, [20, 100, 35, 45]);  getitem = None
+    return view_copy_5
+    """)  # noqa: B950
+
+            reinplaced_logs = self.get_logs(f, torch.randn(20, 100, 35, 45), reapply_views=True, run_reinplace=True)
+            self.assertExpectedInline(reinplaced_logs, """\
+
+
+
+def forward(self, a_1):
+    zeros = torch.ops.aten.zeros.default([100], device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([100], device = device(type='cpu'), pin_memory = False)
+    repeat = torch.ops.aten.repeat.default(zeros, [20])
+    repeat_1 = torch.ops.aten.repeat.default(ones, [20])
+    view = torch.ops.aten.view.default(a_1, [1, 2000, 35, 45]);  a_1 = None
+    empty = torch.ops.aten.empty.memory_format([0], dtype = torch.uint8, layout = torch.strided, device = device(type='cpu'))
+    _native_batch_norm_legit_functional = torch.ops.aten._native_batch_norm_legit_functional.default(view, None, None, repeat, repeat_1, True, 0.1, 1e-05);  view = repeat = repeat_1 = None
+    getitem = _native_batch_norm_legit_functional[0]
+    getitem_1 = _native_batch_norm_legit_functional[1]
+    getitem_2 = _native_batch_norm_legit_functional[2]
+    getitem_3 = _native_batch_norm_legit_functional[3]
+    getitem_4 = _native_batch_norm_legit_functional[4];  _native_batch_norm_legit_functional = None
+    alias = torch.ops.aten.alias.default(zeros);  zeros = None
+    view_1 = torch.ops.aten.view.default(getitem_3, [20, 100])
+    view_2 = torch.ops.aten.view.default(getitem_3, [20, 100]);  getitem_3 = None
+    mean = torch.ops.aten.mean.dim(view_2, [0]);  view_2 = None
+    copy = torch.ops.aten.copy_.default(alias, mean);  alias = mean = None
+    alias_1 = torch.ops.aten.alias.default(ones);  ones = None
+    view_3 = torch.ops.aten.view.default(getitem_4, [20, 100])
+    view_4 = torch.ops.aten.view.default(getitem_4, [20, 100]);  getitem_4 = None
+    mean_1 = torch.ops.aten.mean.dim(view_4, [0]);  view_4 = None
+    copy_1 = torch.ops.aten.copy_.default(alias_1, mean_1);  alias_1 = mean_1 = None
+    view_5 = torch.ops.aten.view.default(getitem, [20, 100, 35, 45]);  getitem = None
+    return view_5
+    """)  # noqa: B950
+
+
+    def test_instance_norm_running_mean_is_x(self):
+        def f(x):
+            with enable_python_dispatcher():
+                return torch.instance_norm(torch.randn(20, 100, 35, 45), None, None, running_mean=x, running_var=torch.ones(100),
+                                           use_input_stats=True, momentum=0.1, eps=1e-5, cudnn_enabled=False)
+        # TODO: uncomment following line after functionalization can handle input mutations
+        # self.assert_functionalization(f, torch.zeros(100))
+        logs = self.get_logs(f, torch.zeros(100))
+        # On Windows, for instance_norm, the alias_copy's are reordered to come right before they need to be used
+        # whereas on other platforms, the alias_copy's are before the view_copy's.
+        # e.g., the alias_copy after the getitem_4 assignment would be moved to be right before the copy assignment.
+        if not IS_WINDOWS:
+            self.assertExpectedInline(logs, """\
+
+
+
+def forward(self, a_1):
+    randn = torch.ops.aten.randn.default([20, 100, 35, 45], device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([100], device = device(type='cpu'), pin_memory = False)
+    repeat = torch.ops.aten.repeat.default(a_1, [20])
+    repeat_1 = torch.ops.aten.repeat.default(ones, [20])
+    view_copy = torch.ops.aten.view_copy.default(randn, [1, 2000, 35, 45]);  randn = None
+    empty = torch.ops.aten.empty.memory_format([0], dtype = torch.uint8, layout = torch.strided, device = device(type='cpu'))
+    _native_batch_norm_legit_functional = torch.ops.aten._native_batch_norm_legit_functional.default(view_copy, None, None, repeat, repeat_1, True, 0.1, 1e-05);  view_copy = repeat = repeat_1 = None
+    getitem = _native_batch_norm_legit_functional[0]
+    getitem_1 = _native_batch_norm_legit_functional[1]
+    getitem_2 = _native_batch_norm_legit_functional[2]
+    getitem_3 = _native_batch_norm_legit_functional[3]
+    getitem_4 = _native_batch_norm_legit_functional[4];  _native_batch_norm_legit_functional = None
+    alias_copy = torch.ops.aten.alias_copy.default(a_1)
+    view_copy_1 = torch.ops.aten.view_copy.default(getitem_3, [20, 100])
+    view_copy_2 = torch.ops.aten.view_copy.default(getitem_3, [20, 100]);  getitem_3 = None
+    mean = torch.ops.aten.mean.dim(view_copy_2, [0]);  view_copy_2 = None
+    copy = torch.ops.aten.copy.default(alias_copy, mean);  alias_copy = mean = None
+    alias_copy_1 = torch.ops.aten.alias_copy.default(ones);  ones = None
+    view_copy_3 = torch.ops.aten.view_copy.default(getitem_4, [20, 100])
+    view_copy_4 = torch.ops.aten.view_copy.default(getitem_4, [20, 100]);  getitem_4 = None
+    mean_1 = torch.ops.aten.mean.dim(view_copy_4, [0]);  view_copy_4 = None
+    copy_1 = torch.ops.aten.copy.default(alias_copy_1, mean_1);  alias_copy_1 = mean_1 = None
+    view_copy_5 = torch.ops.aten.view_copy.default(getitem, [20, 100, 35, 45]);  getitem = None
+    alias_copy_2 = torch.ops.aten.alias_copy.default(copy);  copy = None
+    copy_ = torch.ops.aten.copy_.default(a_1, alias_copy_2);  a_1 = alias_copy_2 = None
+    return view_copy_5
+    """)  # noqa: B950
+
+            reinplaced_logs = self.get_logs(f, torch.zeros(100), reapply_views=True, run_reinplace=True)
+            self.assertExpectedInline(reinplaced_logs, """\
+
+
+
+def forward(self, a_1):
+    randn = torch.ops.aten.randn.default([20, 100, 35, 45], device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([100], device = device(type='cpu'), pin_memory = False)
+    repeat = torch.ops.aten.repeat.default(a_1, [20])
+    repeat_1 = torch.ops.aten.repeat.default(ones, [20])
+    view = torch.ops.aten.view.default(randn, [1, 2000, 35, 45]);  randn = None
+    empty = torch.ops.aten.empty.memory_format([0], dtype = torch.uint8, layout = torch.strided, device = device(type='cpu'))
+    _native_batch_norm_legit_functional = torch.ops.aten._native_batch_norm_legit_functional.default(view, None, None, repeat, repeat_1, True, 0.1, 1e-05);  view = repeat = repeat_1 = None
+    getitem = _native_batch_norm_legit_functional[0]
+    getitem_1 = _native_batch_norm_legit_functional[1]
+    getitem_2 = _native_batch_norm_legit_functional[2]
+    getitem_3 = _native_batch_norm_legit_functional[3]
+    getitem_4 = _native_batch_norm_legit_functional[4];  _native_batch_norm_legit_functional = None
+    alias = torch.ops.aten.alias.default(a_1)
+    view_1 = torch.ops.aten.view.default(getitem_3, [20, 100])
+    view_2 = torch.ops.aten.view.default(getitem_3, [20, 100]);  getitem_3 = None
+    mean = torch.ops.aten.mean.dim(view_2, [0]);  view_2 = None
+    copy = torch.ops.aten.copy.default(alias, mean);  alias = mean = None
+    alias_1 = torch.ops.aten.alias.default(ones);  ones = None
+    view_3 = torch.ops.aten.view.default(getitem_4, [20, 100])
+    view_4 = torch.ops.aten.view.default(getitem_4, [20, 100]);  getitem_4 = None
+    mean_1 = torch.ops.aten.mean.dim(view_4, [0]);  view_4 = None
+    copy_1 = torch.ops.aten.copy_.default(alias_1, mean_1);  alias_1 = mean_1 = None
+    view_5 = torch.ops.aten.view.default(getitem, [20, 100, 35, 45]);  getitem = None
+    alias_2 = torch.ops.aten.alias.default(copy);  copy = None
+    copy_ = torch.ops.aten.copy_.default(a_1, alias_2);  a_1 = alias_2 = None
+    return view_5
+    """)  # noqa: B950
+
+
+    def test_batch_norm(self):
+        def f(x):
+            with enable_python_dispatcher():
+                return torch.batch_norm(x, None, None, torch.zeros(100), torch.ones(100), False, 0.1, 1e-5, False)
+
+        self.assert_functionalization(f, torch.randn(20, 100, 35, 45))
+        logs = self.get_logs(f, torch.randn(20, 100, 35, 45))
+        self.assertExpectedInline(logs, """\
+
+
+
+def forward(self, a_1):
+    zeros = torch.ops.aten.zeros.default([100], device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([100], device = device(type='cpu'), pin_memory = False)
+    empty = torch.ops.aten.empty.memory_format([0], dtype = torch.uint8, layout = torch.strided, device = device(type='cpu'))
+    _native_batch_norm_legit_functional = torch.ops.aten._native_batch_norm_legit_functional.default(a_1, None, None, zeros, ones, False, 0.1, 1e-05);  a_1 = zeros = ones = None
+    getitem = _native_batch_norm_legit_functional[0]
+    getitem_1 = _native_batch_norm_legit_functional[1]
+    getitem_2 = _native_batch_norm_legit_functional[2]
+    getitem_3 = _native_batch_norm_legit_functional[3]
+    getitem_4 = _native_batch_norm_legit_functional[4];  _native_batch_norm_legit_functional = None
+    return getitem
+    """)  # noqa: B950
+
+        reinplaced_logs = self.get_logs(f, torch.randn(20, 100, 35, 45), reapply_views=True, run_reinplace=True)
+        self.assertExpectedInline(reinplaced_logs, """\
+
+
+
+def forward(self, a_1):
+    zeros = torch.ops.aten.zeros.default([100], device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([100], device = device(type='cpu'), pin_memory = False)
+    empty = torch.ops.aten.empty.memory_format([0], dtype = torch.uint8, layout = torch.strided, device = device(type='cpu'))
+    _native_batch_norm_legit_functional = torch.ops.aten._native_batch_norm_legit_functional.default(a_1, None, None, zeros, ones, False, 0.1, 1e-05);  a_1 = zeros = ones = None
+    getitem = _native_batch_norm_legit_functional[0]
+    getitem_1 = _native_batch_norm_legit_functional[1]
+    getitem_2 = _native_batch_norm_legit_functional[2]
+    getitem_3 = _native_batch_norm_legit_functional[3]
+    getitem_4 = _native_batch_norm_legit_functional[4];  _native_batch_norm_legit_functional = None
+    return getitem
+    """)  # noqa: B950
+
+
+@xfail_inherited_tests([
+    "test_as_strided",
+    "test_copy_",
+    "test_diagonal",
+    "test_diagonal_mutated_input",
+    "test_everything",
+    "test_fill_",
+    "test_split",
+    "test_view_clone_view_inplace",
+    "test_view_inplace",
+])
+class TestCrossRefFunctionalization(TestFunctionalization):
+    crossref = True
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_futures.py b/test/test_futures.py
index 3537ea810d97..d054de7c9c79 100644
--- a/test/test_futures.py
+++ b/test/test_futures.py
@@ -327,5 +327,14 @@ def raise_in_fut(fut):
         with self.assertRaisesRegex(RuntimeError, "Expected error"):
             torch.futures.wait_all([fut3, fut2])
 
+    def test_wait_none(self):
+        fut1 = Future[int]()
+        with self.assertRaisesRegex(RuntimeError, "Future can't be None"):
+            torch.jit.wait(None)
+        with self.assertRaisesRegex(RuntimeError, "Future can't be None"):
+            torch.futures.wait_all((None,))  # type: ignore[arg-type]
+        with self.assertRaisesRegex(RuntimeError, "Future can't be None"):
+            torch.futures.collect_all((fut1, None,))  # type: ignore[arg-type]
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_fx.py b/test/test_fx.py
index aca7e72697f5..0aff631b8e81 100644
--- a/test/test_fx.py
+++ b/test/test_fx.py
@@ -57,7 +57,6 @@
     IS_WINDOWS,
     find_library_location,
     run_tests,
-    skipIfSlowGradcheckEnv,
 )
 from torch.testing._internal.jit_utils import JitTestCase
 
@@ -120,6 +119,16 @@ def wrapped_via_decorator(a):
 def wrapped_with_submodule(x: torch.Tensor, batchnorm1d: torch.nn.BatchNorm1d):
     return batchnorm1d(x)
 
+def my_decorator(f):
+    @functools.wraps(f)
+    def wrapper_inside_decorator(*args, **kwargs):
+        return f(*args, **kwargs)
+    return wrapper_inside_decorator
+
+@wrap
+@my_decorator
+def wrapped_decorated_fn(x):
+    return x
 
 real_wrapped_via_decorator = wrapped_via_decorator
 real_a_lifed_leaf = a_lifted_leaf
@@ -224,7 +233,7 @@ def forward(self, x):
         new_instance.__init__(gm3, gm3.graph)
 
         x = torch.randn(5, 3)
-        torch.testing.assert_allclose(new_instance(x), torch.relu(x))
+        torch.testing.assert_close(new_instance(x), torch.relu(x))
 
     def test_custom_import(self):
         graph = torch.fx.Graph()
@@ -448,6 +457,14 @@ def to_trace(y):
         self.assertIn('wrapped_via_decorator', retraced.code)
         self.assertEqual(retraced(0), 1)
 
+    def test_wrap_decorated_function(self):
+        def to_trace(y):
+            return wrapped_decorated_fn(y)
+
+        m = symbolic_trace(to_trace)
+        self.assertIn('wrapped_decorated_fn', m.code)
+        self.assertEqual(m(1), 1)
+
     def test_graph_edit_with_proxy(self):
         class M(torch.nn.Module):
             def forward(self, a, b):
@@ -791,7 +808,7 @@ def forward(self, x):
         traced = torch.fx.symbolic_trace(ec)
 
         x = torch.randn(bs, d_hid)
-        torch.testing.assert_allclose(ec(x), traced(x))
+        torch.testing.assert_close(ec(x), traced(x))
 
 
     def test_node_tagging(self):
@@ -1108,7 +1125,7 @@ def foo(x : Tuple):
 
         traced = torch.fx.symbolic_trace(foo)
         x = (torch.randn(5, 3),)
-        torch.testing.assert_allclose(traced(x), x[0])
+        torch.testing.assert_close(traced(x), x[0])
 
         bio = io.BytesIO()
 
@@ -1118,7 +1135,7 @@ def foo(x : Tuple):
 
         loaded = torch.load(bio)
 
-        torch.testing.assert_allclose(loaded(x), x[0])
+        torch.testing.assert_close(loaded(x), x[0])
 
     def test_torch_fx_len(self):
         class FXLenTest(torch.nn.Module):
@@ -1788,7 +1805,7 @@ def forward(self, x, y=3.14159):
         interp = Interpreter(gm)
         x = torch.randn(5, 3)
         out = interp.run(x)
-        torch.testing.assert_allclose(out, x + 3.14159)
+        torch.testing.assert_close(out, x + 3.14159)
 
     def test_interpreter_not_enough_args(self):
         class Model(torch.nn.Module):
@@ -2297,8 +2314,8 @@ def forward(self, x):
         traced1.recompile()
 
         x = torch.randn(15, 15)
-        torch.testing.assert_allclose(traced1(x), torch.relu(x))
-        torch.testing.assert_allclose(copied(x), torch.neg(x))
+        torch.testing.assert_close(traced1(x), torch.relu(x))
+        torch.testing.assert_close(copied(x), torch.neg(x))
 
     def test_direct_param_use(self):
         class TransposeTest(torch.nn.Module):
@@ -2681,7 +2698,7 @@ def forward(self, x):
         replica = gm._replicate_for_data_parallel()
         out_replica = replica(x)
 
-        torch.testing.assert_allclose(out_replica, out)
+        torch.testing.assert_close(out_replica, out)
 
     def test_ast_rewriter_rewrites_assert(self):
         class M(torch.nn.Module):
@@ -2773,7 +2790,7 @@ def to_trace(y):
 
     def test_profiler_ranges_side_effect(self):
         g = torch.fx.Graph()
-        handle = g.call_function(torch.ops.profiler._record_function_enter, ('test_range',))
+        handle = g.call_function(torch.ops.profiler._record_function_enter_new, ('test_range',))
         g.call_function(torch.ops.profiler._record_function_exit, (handle,))
         g.output(None)
 
@@ -2783,7 +2800,7 @@ def test_profiler_ranges_side_effect(self):
                 found_targets.setdefault(node.target)
         self.assertEqual(
             list(found_targets.keys()),
-            [torch.ops.profiler._record_function_enter, torch.ops.profiler._record_function_exit]
+            [torch.ops.profiler._record_function_enter_new, torch.ops.profiler._record_function_exit]
         )
 
         g.eliminate_dead_code()
@@ -2793,7 +2810,7 @@ def test_profiler_ranges_side_effect(self):
                 found_targets.setdefault(node.target)
         self.assertEqual(
             list(found_targets.keys()),
-            [torch.ops.profiler._record_function_enter, torch.ops.profiler._record_function_exit]
+            [torch.ops.profiler._record_function_enter_new, torch.ops.profiler._record_function_exit]
         )
 
     def test_ast_rewriter_wrapped_via_decorator(self):
@@ -3027,7 +3044,7 @@ def is_leaf_module(self, m: torch.nn.Module, module_qualified_name : str) -> boo
         traced_graph = MyCustomTracer().trace(model)
         gm2 = torch.fx.GraphModule(model, traced_graph)
         gm2.delete_all_unused_submodules()
-        torch.testing.assert_allclose(gm2(inputs), model(inputs))
+        torch.testing.assert_close(gm2(inputs), model(inputs))
 
     def test_fx_stateless(self):
         class MockModule(torch.nn.Module):
@@ -3754,7 +3771,7 @@ def test_function_back_compat(self):
         signature_strs.sort()
 
         try:
-            self.assertExpected('\n'.join(signature_strs), 'fx_backcompat_function_signatures')
+            self.assertExpected('\n'.join(signature_strs) + '\n', 'fx_backcompat_function_signatures')
         except AssertionError as e:
             msg = f"{e}\n****** ERROR ******\nAn FX function that has been marked " \
                   f"as backwards-compatible has experienced a signature change. See the " \
@@ -3908,11 +3925,12 @@ def tearDown(self):
         "max_pool2d": PROXY_ITERABLE,
         "max_pool3d": PROXY_ITERABLE,
 
-        "group_norm": PROXY_ITERATED,
         "lp_pool2d": PROXY_ITERATED,
         "max_unpool1d": PROXY_ITERATED,
         "max_unpool2d": PROXY_ITERATED,
         "max_unpool3d": PROXY_ITERATED,
+        "fold": PROXY_ITERATED,
+        "unfold": PROXY_ITERATED,
 
         "adaptive_max_pool1d_with_indices": ARG_TYPE_MISMATCH,
         "fractional_max_pool2d_with_indices": ARG_TYPE_MISMATCH,
@@ -3937,10 +3955,10 @@ def tearDown(self):
         "embedding": CONTROL_FLOW,
         "embedding_bag": CONTROL_FLOW,
         "feature_alpha_dropout": CONTROL_FLOW,
-        "fold": CONTROL_FLOW,
         "gaussian_nll_loss": CONTROL_FLOW,
         "glu": CONTROL_FLOW,
         "grid_sample": CONTROL_FLOW,
+        "group_norm": CONTROL_FLOW,
         "gumbel_softmax": CONTROL_FLOW,
         "hardsigmoid": CONTROL_FLOW,
         "hardswish": CONTROL_FLOW,
@@ -3974,7 +3992,6 @@ def tearDown(self):
         "threshold": CONTROL_FLOW,
         "triplet_margin_loss": CONTROL_FLOW,
         "triplet_margin_with_distance_loss": CONTROL_FLOW,
-        "unfold": CONTROL_FLOW,
         "upsample": CONTROL_FLOW,
 
         "upsample_bilinear": INTERPOLATE_ARGS_CONFLICT,
@@ -4012,7 +4029,7 @@ def tearDown(self):
         "max_pool2d": PROXY_ITERATED,
         "max_pool3d": PROXY_ITERATED,
 
-        "group_norm": LEN_ERROR
+        "group_norm": CONTROL_FLOW
     }
 
     @classmethod
@@ -4092,7 +4109,6 @@ def tearDownClass(cls):
 instantiate_device_type_tests(TestOperatorSignatures, globals())
 
 @skipIfNoTorchVision
-@skipIfSlowGradcheckEnv
 class TestVisionTracing(JitTestCase):
     def setUp(self):
         # Checking for mutable operations while tracing is feature flagged
@@ -4201,7 +4217,11 @@ def generate_detection_tests(cls):
     def generate_video_tests(cls):
         for k in torchvision_models.list_models(module=torchvision_models.video):
             test_name = 'test_torchvision_models_video_' + k
-            x = torch.rand(1, 3, 4, 112, 112) if k not in {'mvit_v1_b', 'mvit_v2_s'} else torch.rand(1, 3, 16, 224, 224)
+            x = (
+                torch.rand(1, 3, 4, 112, 112)
+                if k not in {"mvit_v1_b", "mvit_v2_s", "s3d"}
+                else torch.rand(1, 3, 16, 224, 224)
+            )
             kwargs = dict(num_classes=50)
             model_test = cls.generate_test_fn(k, x, kwargs)
             setattr(cls, test_name, model_test)
diff --git a/test/test_fx_backends.py b/test/test_fx_backends.py
deleted file mode 100644
index 645abdd9e801..000000000000
--- a/test/test_fx_backends.py
+++ /dev/null
@@ -1,258 +0,0 @@
-# Owner(s): ["module: fx"]
-
-import copy
-import sys
-import logging
-from typing import List, Tuple
-
-import torch
-from torch.fx._symbolic_trace import symbolic_trace
-from torch.fx.experimental.proxy_tensor import make_fx
-from torch.fx.passes.backends.nvfuser import NvFuserBackend
-
-from torch.testing._internal.common_utils import run_tests, TEST_CUDA, TestCase
-from torch.testing._internal.common_device_type import (
-    instantiate_device_type_tests,
-    skipCUDAIfRocm,
-    dtypes,
-)
-
-if not TEST_CUDA:
-    print('CUDA not available, skipping tests', file=sys.stderr)
-    TestCase = object  # noqa: F811
-
-logging.basicConfig(level=logging.DEBUG)
-logger = logging.getLogger(__name__)
-
-class HF_T5_Partial(torch.nn.Module):
-
-    def inputs_meta(self):
-        return [
-            (torch.Size([512, 512]), torch.float32),
-            (torch.Size([512, 512]), torch.float32),
-            (torch.Size([512, 512]), torch.float32),
-            (torch.Size([512, 512]), torch.float32),
-            (torch.Size([512]), torch.float32),
-            (torch.Size([2048, 512]), torch.float32),
-            (torch.Size([512, 2048]), torch.float32),
-            (torch.Size([512]), torch.float32),
-            (torch.Size([8, 1024, 512]), torch.float32),
-            (torch.Size([8, 8, 1024, 1024]), torch.float32),
-        ]
-
-    def forward(self, primals_1, primals_2, primals_3, primals_4, primals_5,
-                primals_6, primals_7, primals_8, primals_9, primals_10):
-        pow_1 = torch.ops.aten.pow(primals_9, 2)
-        mean = torch.ops.aten.mean(pow_1, [-1], True)
-        add = torch.ops.aten.add(mean, 1e-06)
-        rsqrt = torch.ops.aten.rsqrt(add)
-        mul = torch.ops.aten.mul(primals_9, rsqrt)
-        mul_1 = torch.ops.aten.mul(primals_5, mul)
-        t = torch.ops.aten.t(primals_3)
-        view = torch.ops.aten.view(mul_1, [8192, 512])
-        mm = torch.ops.aten.mm(view, t)
-        _unsafe_view = torch.ops.aten._unsafe_view(mm, [8, 1024, 512])
-        view_1 = torch.ops.aten.view(_unsafe_view, [8, -1, 8, 64])
-        transpose = torch.ops.aten.transpose(view_1, 1, 2)
-        t_1 = torch.ops.aten.t(primals_1)
-        view_2 = torch.ops.aten.view(mul_1, [8192, 512])
-        mm_1 = torch.ops.aten.mm(view_2, t_1)
-        _unsafe_view_1 = torch.ops.aten._unsafe_view(mm_1, [8, 1024, 512])
-        view_3 = torch.ops.aten.view(_unsafe_view_1, [8, -1, 8, 64])
-        transpose_1 = torch.ops.aten.transpose(view_3, 1, 2)
-        t_2 = torch.ops.aten.t(primals_4)
-        view_4 = torch.ops.aten.view(mul_1, [8192, 512])
-        mm_2 = torch.ops.aten.mm(view_4, t_2)
-        _unsafe_view_2 = torch.ops.aten._unsafe_view(mm_2, [8, 1024, 512])
-        view_5 = torch.ops.aten.view(_unsafe_view_2, [8, -1, 8, 64])
-        transpose_2 = torch.ops.aten.transpose(view_5, 1, 2)
-        transpose_3 = torch.ops.aten.transpose(transpose_1, 3, 2)
-        expand = torch.ops.aten.expand(transpose, [8, 8, 1024, 64])
-        clone = torch.ops.aten.clone(expand, memory_format=torch.contiguous_format)
-        _unsafe_view_3 = torch.ops.aten._unsafe_view(clone, [64, 1024, 64])
-        expand_1 = torch.ops.aten.expand(transpose_3, [8, 8, 64, 1024])
-        clone_1 = torch.ops.aten.clone(expand_1, memory_format=torch.contiguous_format)
-        _unsafe_view_4 = torch.ops.aten._unsafe_view(clone_1, [64, 64, 1024])
-        bmm = torch.ops.aten.bmm(_unsafe_view_3, _unsafe_view_4)
-        _unsafe_view_5 = torch.ops.aten._unsafe_view(bmm, [8, 8, 1024, 1024])
-        add_ = torch.ops.aten.add_(_unsafe_view_5, primals_10)
-        _softmax = torch.ops.aten._softmax(add_, -1, False)
-        expand_2 = torch.ops.aten.expand(_softmax, [8, 8, 1024, 1024])
-        view_6 = torch.ops.aten.view(expand_2, [64, 1024, 1024])
-        expand_3 = torch.ops.aten.expand(transpose_2, [8, 8, 1024, 64])
-        clone_2 = torch.ops.aten.clone(expand_3, memory_format=torch.contiguous_format)
-        _unsafe_view_6 = torch.ops.aten._unsafe_view(clone_2, [64, 1024, 64])
-        bmm_1 = torch.ops.aten.bmm(view_6, _unsafe_view_6)
-        _unsafe_view_7 = torch.ops.aten._unsafe_view(bmm_1, [8, 8, 1024, 64])
-        transpose_4 = torch.ops.aten.transpose(_unsafe_view_7, 1, 2)
-        clone_3 = torch.ops.aten.clone(transpose_4, memory_format=torch.contiguous_format)
-        view_7 = torch.ops.aten.view(clone_3, [8, -1, 512])
-        t_3 = torch.ops.aten.t(primals_2)
-        view_8 = torch.ops.aten.view(view_7, [8192, 512])
-        mm_3 = torch.ops.aten.mm(view_8, t_3)
-        _unsafe_view_8 = torch.ops.aten._unsafe_view(mm_3, [8, 1024, 512])
-        add_1 = torch.ops.aten.add(primals_9, _unsafe_view_8)
-        pow_2 = torch.ops.aten.pow(add_1, 2)
-        mean_1 = torch.ops.aten.mean(pow_2, [-1], True)
-        add_2 = torch.ops.aten.add(mean_1, 1e-06)
-        rsqrt_1 = torch.ops.aten.rsqrt(add_2)
-        mul_2 = torch.ops.aten.mul(add_1, rsqrt_1)
-        mul_3 = torch.ops.aten.mul(primals_8, mul_2)
-        t_4 = torch.ops.aten.t(primals_6)
-        view_9 = torch.ops.aten.view(mul_3, [8192, 512])
-        mm_4 = torch.ops.aten.mm(view_9, t_4)
-        _unsafe_view_9 = torch.ops.aten._unsafe_view(mm_4, [8, 1024, 2048])
-        relu = torch.ops.aten.relu(_unsafe_view_9)
-        t_5 = torch.ops.aten.t(primals_7)
-        view_10 = torch.ops.aten.view(relu, [8192, 2048])
-        mm_5 = torch.ops.aten.mm(view_10, t_5)
-        _unsafe_view_10 = torch.ops.aten._unsafe_view(mm_5, [8, 1024, 512])
-        add_3 = torch.ops.aten.add(add_1, _unsafe_view_10)
-        return [add_3, rsqrt, _unsafe_view_3, t_3, _softmax, view_6, mul_2, t, view_9, t_1, primals_5, add_1,
-                _unsafe_view_4, view_2, view_10, t_5, t_2, primals_8, view_4, view_8, rsqrt_1, primals_9, t_4,
-                mul, _unsafe_view_6, relu, view]
-
-
-class TestFxNvFuserBackend(TestCase):
-
-    def _generate_random_inputs(self, device, inputs_meta: List[Tuple[torch.Size, torch.dtype]]):
-        inputs = []
-        for meta in inputs_meta:
-            shape, dtype = meta
-
-            if dtype in {torch.int, torch.int32, torch.int64, torch.bool, torch.int, torch.uint8}:
-                input = torch.randint(0, 1, shape, dtype=dtype, device=device)
-            else:
-                input = torch.rand(shape, dtype=dtype, device=device)
-
-            inputs.append(input)
-
-        return inputs
-
-
-    @skipCUDAIfRocm
-    @dtypes(torch.float32)
-    def test_nvfuser_call_module_backend(self, device, dtype):
-
-        class Model(torch.nn.Module):
-
-            def __init__(self):
-                super(Model, self).__init__()
-                self.bn = torch.nn.BatchNorm2d(3)
-                self.relu = torch.nn.ReLU()
-
-            def forward(self, inp):
-                o = self.bn(inp)
-                o = self.relu(o)
-                return o
-
-        inp = torch.randn(2, 3, 4, 5).to(dtype=dtype, device=device)
-        m = Model().to(dtype=dtype, device=device)
-
-        # note that the traced module here contains only `call_module` node,
-        # which isn't fused by nvfuser backend. But `nvfuser.compile` should run without error
-        traced = symbolic_trace(m)
-
-        nvfuser = NvFuserBackend()
-        compiled_module = nvfuser.compile(traced)
-
-        eager_result = m(inp)
-        nvfuser_result = compiled_module(inp)
-
-        torch.testing.assert_close(eager_result, nvfuser_result, rtol=1e-5, atol=1e-5)
-
-
-    @skipCUDAIfRocm
-    @dtypes(torch.float32)
-    def test_nvfuser_backend(self, device, dtype):
-        m = HF_T5_Partial()
-        m.to(device)
-
-        traced = symbolic_trace(m)
-
-        nvfuser = NvFuserBackend()
-        compiled_module = nvfuser.compile(traced)
-
-        inputs = self._generate_random_inputs(device, m.inputs_meta())
-
-        eager_result = m(*inputs)
-        nvfuser_result = compiled_module(*inputs)
-
-        torch.testing.assert_close(eager_result, nvfuser_result, rtol=1e-5, atol=1e-5)
-
-    @skipCUDAIfRocm
-    @dtypes(torch.float32)
-    def test_aten_square(self, device, dtype):
-
-        def fn(x):
-            square = torch.square(x)
-            a = square + 1
-            b = a + 1
-            return b
-
-        inputs = torch.randn(4, device=device)
-        traced = make_fx(fn)(inputs)
-
-        nvfuser = NvFuserBackend()
-        compiled_module = nvfuser.compile(copy.deepcopy(traced))
-
-        for node in compiled_module.graph.nodes:
-            if node.op == "call_function":
-                assert "fused" in str(node.target), "the entire function should be fused into a single fusion group"
-
-        eager_result = traced(inputs)
-        nvfuser_result = compiled_module(inputs)
-        torch.testing.assert_close(eager_result, nvfuser_result, rtol=1e-5, atol=1e-5)
-
-    @skipCUDAIfRocm
-    @dtypes(torch.float32)
-    def test_aten_leakyrelu(self, device, dtype):
-
-        def fn(x):
-            square = torch.ops.aten.leaky_relu(x, 0.1)
-            a = square + 1
-            b = a + 1
-            return b
-
-        inputs = torch.randn(4, device=device)
-        traced = make_fx(fn)(inputs)
-
-        nvfuser = NvFuserBackend()
-        compiled_module = nvfuser.compile(copy.deepcopy(traced))
-
-        for node in compiled_module.graph.nodes:
-            if node.op == "call_function":
-                assert "fused" in str(node.target), "the entire function should be fused into a single fusion group"
-
-        eager_result = traced(inputs)
-        nvfuser_result = compiled_module(inputs)
-        torch.testing.assert_close(eager_result, nvfuser_result, rtol=1e-5, atol=1e-5)
-
-    @skipCUDAIfRocm
-    @dtypes(torch.float32)
-    def test_aten_where(self, device, dtype):
-
-        def fn(x):
-            where = torch.ops.aten.where(x < 0, -x, x)
-            a = where + 1
-            b = a + 1
-            return b
-
-        inputs = torch.randn(4, device=device)
-        traced = make_fx(fn)(inputs)
-
-        nvfuser = NvFuserBackend()
-        compiled_module = nvfuser.compile(copy.deepcopy(traced))
-
-        for node in compiled_module.graph.nodes:
-            if node.op == "call_function":
-                assert "fused" in str(node.target), "the entire function should be fused into a single fusion group"
-
-        eager_result = traced(inputs)
-        nvfuser_result = compiled_module(inputs)
-        torch.testing.assert_close(eager_result, nvfuser_result, rtol=1e-5, atol=1e-5)
-
-instantiate_device_type_tests(TestFxNvFuserBackend, globals(), only_for="cuda")
-
-if __name__ == "__main__":
-    run_tests()
diff --git a/test/test_fx_experimental.py b/test/test_fx_experimental.py
index 5ace6bb830d5..a8fc07770302 100644
--- a/test/test_fx_experimental.py
+++ b/test/test_fx_experimental.py
@@ -782,7 +782,7 @@ def split_callback(n):
 
         x = torch.randn(5, 3)
         foo = torch.randn(5, 3)
-        torch.testing.assert_allclose(split(x, foo=foo), traced(x, foo=foo))
+        torch.testing.assert_close(split(x, foo=foo), traced(x, foo=foo))
 
     @skipIfNoTorchVision
     def test_subgraph_trivial_resnet(self):
@@ -814,7 +814,7 @@ def forward(self, x, targets=None):
         split = split_module(traced, mtt, lambda node: 0)
 
         x = torch.randn(50, 512)
-        torch.testing.assert_allclose(split(x), traced(x))
+        torch.testing.assert_close(split(x), traced(x))
 
     def test_normalize_binary_operators(self):
         ops_to_test = {
@@ -1014,6 +1014,19 @@ def forward(self, a):
         else:
             self.fail("Didn't find call_function torch.add")
 
+    def test_normalize_args_perserve_type(self):
+        class MyModule(torch.nn.Module):
+            def forward(self, a: List[torch.Tensor]):
+                return torch.add(a[0], a[1])
+
+        m = MyModule()
+        traced = symbolic_trace(m)
+        traced = NormalizeArgs(traced).transform()
+
+        for node in traced.graph.nodes:
+            if node.op == "placeholder":
+                self.assertEqual(node.type, List[torch.Tensor])
+
     @skipIfNoTorchVision
     def test_annotate_returns_with_schema(self):
         m = resnet18()
@@ -1468,6 +1481,8 @@ def test_normalize_operator_exhaustive(self, device, dtype, op):
         # These ops currently don't trace in FX for various reasons (i.e. they take a list of tensors)
         fx_fail = {"cat", "stack", "hstack", "vstack", "dstack", "linalg.multi_dot"}
         sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=False)
+        if isinstance(op.op, torch._ops.OpOverload):
+            self.skipTest("normalize operator doesn't work on torch.ops")
         for sample_input in sample_inputs_itr:
             unsupported_arg_type = False
             arg_values = [sample_input.input] + list(sample_input.args)
diff --git a/test/test_fx_passes.py b/test/test_fx_passes.py
index 2ff1959fe21e..d9e5abc921df 100644
--- a/test/test_fx_passes.py
+++ b/test/test_fx_passes.py
@@ -160,36 +160,115 @@ def forward11(a, b, c):
         out = torch.stack([add_1, add_2, add_3])
         return out
 
+    @staticmethod
+    def forward12(a, b, c):
+        b0 = a + 1.0
+        c0 = a + 1.5
+        x0 = b0.relu()
+        x1 = c0.relu()
+        b1 = b0 + x1
+        c1 = c0 + 1.2
+        # c2 has dependency on x0 & b0, when we merge {c0, c1, c2}
+        # this dependency should be updated to the fusion group and reflected
+        # on the decision to not fuse b0 & b1, which forms a cyclic dependency in
+        # the new graph
+        c2 = x0 + c0
+        return b1, c2
+
+    @staticmethod
+    def forward13(a, b, c):
+        a0, a1, a2, a3 = a.split(1, 0)
+        b1 = a0 + b
+        c1 = a1 + c
+        return b1 + c1
+
+    @staticmethod
+    def forward14(a, b, c):
+        a0, a1 = torch.ops.aten.std_mean(a)
+        out = a0 + 1.0
+        return out
+
+    @staticmethod
+    def forward15(a, b, c):
+        a0 = torch.ops.aten.view(a, [2, 2])
+        a1 = torch.ops.aten.permute(a0, [1, 0])
+        a2 = a1 + 1.0
+        a3 = torch.ops.aten.permute(a2, [1, 0])
+        a4 = a3 + 1.0
+        a5 = torch.ops.aten.permute(a4, [1, 0])
+        return torch.ops.aten.permute(a5, [1, 0])
+
+    @staticmethod
+    def forward16(a, b, c):
+        a0 = a - 1.0
+        a1 = torch.ops.aten.view(a0, [2, 2])
+        a2 = torch.ops.aten.permute(a1, [1, 0])
+        a3 = a2 + 1.0
+        a4 = torch.ops.aten.permute(a3, [1, 0])
+        a5 = a4 + 1.0
+        a6 = torch.ops.aten.permute(a5, [1, 0])
+        a7 = torch.ops.aten.permute(a6, [1, 0])
+        return a7 - 1.0
+
 # A mock OperatorSupport class, where only operator.add is supported
 class MockOperatorSupport(OperatorSupport):
     def is_node_supported(self, submodules, node: torch.fx.Node) -> bool:
-        return node.op == "call_function" and node.target in {operator.add}
-
+        return (node.op == "call_function" and
+                node.target in {operator.add, operator.getitem,
+                                torch.ops.aten.view,
+                                torch.ops.aten.permute,
+                                torch.ops.aten.std_mean})
 
 @instantiate_parametrized_tests
 class TestFXGraphPasses(JitTestCase):
 
-    @parametrize("fn, expected_partition", [
-        (TestPartitionFunctions.forward1, [["add_7", "add_6"], ["add_5", "add_4", "add_3"], ["add_2", "add_1", "add"]]),
-        (TestPartitionFunctions.forward2, [["add_3", "add_2"], ["add_1", "add"]]),
+    @parametrize("fn, expected_partition, bookend_non_compute_pass", [
+        (TestPartitionFunctions.forward1, [["add_7", "add_6"], ["add_5", "add_4", "add_3"], ["add_2", "add_1", "add"]], False),
+        (TestPartitionFunctions.forward2, [["add_3", "add_2"], ["add_1", "add"]], False),
+
+        # 1 horizontal fusion with common producer
+        (TestPartitionFunctions.forward3, [["add_2", "add_1", "add"]], False),
+        (TestPartitionFunctions.forward4, [["add_2", "add_1", "add"]], False),
 
         # 2 branches cases
-        (TestPartitionFunctions.forward5, [["add_1", "add"]]),
-        (TestPartitionFunctions.forward6, [["add"]]),
-        (TestPartitionFunctions.forward7, [["add_3", "add_2", "add", "add_1"]]),
-        (TestPartitionFunctions.forward8, [["add_3", "add_2", "add", "add_1"]]),
+        (TestPartitionFunctions.forward5, [["add_1", "add"]], False),
+        (TestPartitionFunctions.forward6, [["add"]], False),
+        (TestPartitionFunctions.forward7, [["add_3", "add_2", "add", "add_1"]], False),
+        (TestPartitionFunctions.forward8, [["add_3", "add_2", "add", "add_1"]], False),
 
         # 3 branch cases
-        (TestPartitionFunctions.forward9, [['add_3', 'add_2', 'add_1', 'add']]),
-        (TestPartitionFunctions.forward10, [['add_3', 'add_2', 'add', 'add_1']]),
-        (TestPartitionFunctions.forward11, [['add_1'], ['add']]),
+        (TestPartitionFunctions.forward9, [['add_3', 'add_2', 'add_1', 'add']], False),
+        (TestPartitionFunctions.forward10, [['add_3', 'add_2', 'add', 'add_1']], False),
+        (TestPartitionFunctions.forward11, [['add_1'], ['add']], False),
+
+        # 4 not necessarily the only partition, just to verify that there's no cyclic dependency after partition
+        (TestPartitionFunctions.forward12, [["add_2"], ["add_3", "add_4", "add_1"], ["add"]], False),
+
+        # 5 getitem special case
+        (TestPartitionFunctions.forward13, [["add_2", "add_1", "add"]], False),
+        (TestPartitionFunctions.forward14, [["add", "std_mean", "getitem", "getitem_1"]], False),
+
+        # 6 bookend non_compute pass
+        (TestPartitionFunctions.forward15, [["permute_1", "add_1", "add"]], True),
+        (TestPartitionFunctions.forward15, [['add_1', 'add', 'permute_1', 'view', 'permute_2', 'permute_3', 'permute']], False),
+        (TestPartitionFunctions.forward16, [["permute_1", "add_1", "add"]], True),
+        (TestPartitionFunctions.forward16, [['add_1', 'add', 'permute_1', 'view', 'permute_2', 'permute_3', 'permute']], False),
     ])
-    def test_partitioner(self, fn, expected_partition):
+    def test_partitioner(self, fn, expected_partition, bookend_non_compute_pass):
         traced = symbolic_trace(fn)
 
+        non_compute_ops = []
+        if bookend_non_compute_pass:
+            non_compute_ops = ["torch.ops.aten.view", "torch.ops.aten.permute"]
+
         supported_ops = MockOperatorSupport()
-        partitioner = CapabilityBasedPartitioner(traced, supported_ops, allows_single_node_partition=True)
+        partitioner = CapabilityBasedPartitioner(traced,
+                                                 supported_ops,
+                                                 allows_single_node_partition=True,
+                                                 non_compute_ops=non_compute_ops)
         partitions = partitioner.propose_partitions()
+        if bookend_non_compute_pass:
+            partitioner.remove_bookend_non_compute_ops(partitions)
 
         partitions_name = [[node.name for node in partition.nodes] for partition in partitions]
         assert len(partitions_name) == len(expected_partition)
@@ -204,24 +283,6 @@ def test_partitioner(self, fn, expected_partition):
         result = fused_graph(a, b, c)
         torch.testing.assert_close(expected, result)
 
-
-    @parametrize("fn, expected_partition", [
-        # horizontal fusion without a common downstream node, not supported yet
-        (TestPartitionFunctions.forward3, [["add_2", "add_1", "add"]]),
-        # horizontal fusion with a common downstream node, not supported yet
-        (TestPartitionFunctions.forward4, [["add_2", "add_1", "add"]]),
-    ])
-    def test_partitioner_xfail(self, fn, expected_partition):
-        traced = symbolic_trace(fn)
-
-        supported_ops = MockOperatorSupport()
-        partitioner = CapabilityBasedPartitioner(traced, supported_ops, allows_single_node_partition=True)
-        partitions = partitioner.propose_partitions()
-
-        partitions_name = [[node.name for node in partition.nodes] for partition in partitions]
-        with self.assertRaises(Exception):
-            assert len(partitions_name) == len(expected_partition)
-
     @parametrize("partition", [
         [['add', 'add_1'], ['add_5', 'add_6']],
         [['add', 'add_1', 'add_2']],  # vertical fusion
@@ -572,6 +633,7 @@ def pattern(a):
         TestCase(False, True, 0),
     ]
 
+
 class MultipleOutputsHorizontalPattern:
     @staticmethod
     def forward(x):
@@ -598,59 +660,89 @@ def pattern(a):
         TestCase(True, True, 0)
     ]
 
-class PatternWithPseudoAny:
+class MultiOutputWithWithInvalidMatches:
     @staticmethod
     def forward(x):
-        x = x.relu()
-        x = x.sigmoid()
+        res0 = torch.nn.functional.linear(x, torch.rand(3, 3))
+        res1 = torch.sigmoid(res0)
+        res2 = res0 * res1
+        res3 = torch.sum(res2, dim=1)
+        return res3
 
-        y = x.relu()
-        y = y + 1
+    @staticmethod
+    def pattern(a, b, c):
+        lin_res = torch.nn.functional.linear(a, b)
+        mul_res = lin_res * c
+        return lin_res, mul_res
+
+    test_cases = [
+        # match_output, match_placeholder, num_matches
+        TestCase(False, False, 0),
+        TestCase(True, False, 0),
+        TestCase(False, True, 0),
+    ]
 
-        z = y.relu()
-        z = z.relu()
+class QuantizationFp8Pattern:
+    @classmethod
+    def setup(cls):
+        cls.quantization = torch.library.Library("fp8_quantization", "DEF")
+        cls.quantization.define("quantize_per_tensor_affine_fp8(Tensor self, int dtype, float scale) -> Tensor")
+        cls.quantization.define("dequantize_per_tensor_affine_fp8(Tensor self, int dtype, float scale) -> Tensor")
 
-        return z
+    @classmethod
+    def tearDown(cls):
+        del cls.quantization
 
     @staticmethod
-    def pattern(a):
-        y = a.relu()
-        z = torch.ops.pseudo.any(y)
-        return z
+    def forward(self, arg0_1, arg1_1):
+        qt = torch.ops.fp8_quantization
+        _scale_0 = self._scale_0
+        quantize_per_tensor_affine_fp8 = qt.quantize_per_tensor_affine_fp8(arg0_1, 0, _scale_0)
+        dequantize_per_tensor_affine_fp8 = qt.dequantize_per_tensor_affine_fp8(quantize_per_tensor_affine_fp8, 0, _scale_0)
+        _scale_1 = self._scale_0
+        quantize_per_tensor_affine_fp8_1 = qt.quantize_per_tensor_affine_fp8(arg1_1, 0, _scale_1)
+        dequantize_per_tensor_affine_fp8_1 = qt.dequantize_per_tensor_affine_fp8(quantize_per_tensor_affine_fp8_1, 0, _scale_1)
+        add = torch.ops.aten.add.Tensor(dequantize_per_tensor_affine_fp8, dequantize_per_tensor_affine_fp8_1)
+        _scale_2 = self._scale_0
+        quantize_per_tensor_affine_fp8_2 = qt.quantize_per_tensor_affine_fp8(add, 0, _scale_2)
+        dequantize_per_tensor_affine_fp8_2 = qt.dequantize_per_tensor_affine_fp8(quantize_per_tensor_affine_fp8_2, 0, _scale_2)
+        return dequantize_per_tensor_affine_fp8_2
+
+    @staticmethod
+    def pattern(a, a_dtype, a_scale, b, b_dtype, b_scale, out_scale):
+        qt = torch.ops.fp8_quantization
+        a = qt.dequantize_per_tensor_affine_fp8(a, a_dtype, a_scale)
+        b = qt.dequantize_per_tensor_affine_fp8(b, b_dtype, b_scale)
+        output = torch.ops.aten.add.Tensor(a, b)
+
+        qt.dequantize_per_tensor_affine_fp8
+
+        output = qt.quantize_per_tensor_affine_fp8(output, a_dtype, out_scale)
+        return output
 
     test_cases = [
         # match_output, match_placeholder, num_matches
-        TestCase(False, False, 3),
-        TestCase(True, False, 1),
-        TestCase(False, True, 1),
-        TestCase(True, True, 0)
+        TestCase(False, False, 1),
     ]
 
-class PatternWithPseudoOneof:
+class NoAnchorFound:
+    # This test case is for pattern where no matching anchor is found in the target graph
+    # `anchor` is the starting point of the pattern matching, it's usually the boundary returning nodes
     @staticmethod
     def forward(x):
-        x = x.relu()
-        x = torch.sigmoid(x)
-
-        z = x.relu()
-        z = torch.relu(z)
-
-        y = x.relu()
-        y = y + 1
-
-        return y
+        x = x + 1
+        return x
 
     @staticmethod
     def pattern(a):
-        y = a.relu()
-        z = torch.ops.pseudo.oneof(y, targets=["torch.sigmoid", "operator.add"])
-        return z
+        b1 = a.relu()
+        return b1
 
     test_cases = [
         # match_output, match_placeholder, num_matches
-        TestCase(False, False, 2),
-        TestCase(True, False, 1),
-        TestCase(False, True, 1),
+        TestCase(False, False, 0),
+        TestCase(True, False, 0),
+        TestCase(False, True, 0),
         TestCase(True, True, 0)
     ]
 
@@ -671,10 +763,16 @@ class TestFXMatcherUtils(JitTestCase):
         MultipleOutputsMultipleNonOverlappingMatches,
         MultipleOutputsIdenticalAnchor,
         MultipleOutputsHorizontalPattern,
-        PatternWithPseudoAny,
-        PatternWithPseudoOneof,
+        MultiOutputWithWithInvalidMatches,
+        QuantizationFp8Pattern,
+        NoAnchorFound,
     ])
     def test_subgraph_matcher(self, test_model):
+
+        setup = getattr(test_model, "setup", None)
+        if callable(setup):
+            setup()
+
         traced = symbolic_trace(test_model.forward)
         pattern_traced = symbolic_trace(test_model.pattern)
 
@@ -696,6 +794,10 @@ def test_subgraph_matcher(self, test_model):
                         continue
                     assert node in match.nodes_map
 
+        tearDown = getattr(test_model, "tearDown", None)
+        if callable(setup):
+            tearDown()
+
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_fx_reinplace_pass.py b/test/test_fx_reinplace_pass.py
index 037633670bf1..dc512cadea69 100644
--- a/test/test_fx_reinplace_pass.py
+++ b/test/test_fx_reinplace_pass.py
@@ -30,9 +30,9 @@ def f(x):
 
 
 def forward(self, x_1):
-    clone_default = torch.ops.aten.clone.default(x_1);  x_1 = None
-    add_tensor = torch.ops.aten.add_.Tensor(clone_default, 1)
-    return clone_default
+    clone = torch.ops.aten.clone.default(x_1);  x_1 = None
+    add = torch.ops.aten.add_.Tensor(clone, 1)
+    return clone
     """)
 
 
@@ -56,11 +56,59 @@ def f(x):
 
 
 def forward(self, x_1):
-    clone_default = torch.ops.aten.clone.default(x_1);  x_1 = None
-    view_default = torch.ops.aten.view.default(clone_default, [-1])
-    add_tensor = torch.ops.aten.add.Tensor(clone_default, 1);  clone_default = None
-    add_tensor_1 = torch.ops.aten.add_.Tensor(view_default, 1)
-    return view_default
+    clone = torch.ops.aten.clone.default(x_1);  x_1 = None
+    view = torch.ops.aten.view.default(clone, [-1])
+    add = torch.ops.aten.add.Tensor(clone, 1);  clone = None
+    add_1 = torch.ops.aten.add_.Tensor(view, 1)
+    return view
+    """)
+
+    def test_reinplace_different_metadata(self):
+        def f(a_):
+            a = a_.clone()
+            b = a + 1
+            # Naively, we shouldn't try to inplace the .ge() call,
+            # because that would require resizing "b" (from a float to a bool tensor).
+            c = torch.ge(b, a)
+            return c
+        inpt = torch.ones(4)
+        f2 = reinplace(make_fx(f)(inpt), inpt)
+        expected_out = f(inpt)
+        actual_out = f2(inpt)
+        self.assertEqual(actual_out, expected_out)
+        # The .ge() should not be reinplaced.
+        self.assertExpectedInline(f2.code, """\
+
+
+
+def forward(self, a__1):
+    clone = torch.ops.aten.clone.default(a__1);  a__1 = None
+    add = torch.ops.aten.add.Tensor(clone, 1)
+    ge = torch.ops.aten.ge.Tensor(add, clone);  add = clone = None
+    return ge
+    """)
+
+    def test_reinplace_overlapping_memory(self):
+        def f(a_):
+            a = a_.clone()
+            b = a.expand(4, 4)
+            # Can't reinplace because b has overlapping memory.
+            c = b.add(1)
+            return c
+        inpt = torch.ones(1)
+        f2 = reinplace(make_fx(f)(inpt), inpt)
+        expected_out = f(inpt)
+        actual_out = f2(inpt)
+        self.assertEqual(actual_out, expected_out)
+        self.assertExpectedInline(f2.code, """\
+
+
+
+def forward(self, a__1):
+    clone = torch.ops.aten.clone.default(a__1);  a__1 = None
+    expand = torch.ops.aten.expand.default(clone, [4, 4]);  clone = None
+    add = torch.ops.aten.add.Tensor(expand, 1);  expand = None
+    return add
     """)
 
     # This test won't actually run in CI, because it requires functionalize() from functorch.
@@ -95,19 +143,19 @@ def f(a_):
 
 
 def forward(self, a__1):
-    clone_default = torch.ops.aten.clone.default(a__1);  a__1 = None
-    view_default = torch.ops.aten.view.default(clone_default, [-1])
-    view_default_1 = torch.ops.aten.view.default(clone_default, [-1])
-    select_int = torch.ops.aten.select.int(view_default_1, 0, 0);  view_default_1 = None
-    view_default_2 = torch.ops.aten.view.default(select_int, [-1]);  select_int = None
-    add_tensor = torch.ops.aten.add_.Tensor(view_default_2, 1)
-    view_default_3 = torch.ops.aten.view.default(clone_default, [-1]);  clone_default = None
-    select_int_1 = torch.ops.aten.select.int(view_default_3, 0, 0)
-    view_default_4 = torch.ops.aten.view.default(view_default_2, []);  view_default_2 = None
-    view_default_5 = torch.ops.aten.view.default(view_default_3, [4]);  view_default_3 = None
-    view_default_6 = torch.ops.aten.view.default(view_default_5, [-1])
-    add_tensor_1 = torch.ops.aten.add_.Tensor(view_default_5, view_default_6);  view_default_6 = None
-    return view_default_5
+    clone = torch.ops.aten.clone.default(a__1);  a__1 = None
+    view = torch.ops.aten.view.default(clone, [-1])
+    view_1 = torch.ops.aten.view.default(clone, [-1])
+    select = torch.ops.aten.select.int(view_1, 0, 0);  view_1 = None
+    view_2 = torch.ops.aten.view.default(select, [-1]);  select = None
+    add = torch.ops.aten.add_.Tensor(view_2, 1)
+    view_3 = torch.ops.aten.view.default(clone, [-1]);  clone = None
+    select_1 = torch.ops.aten.select.int(view_3, 0, 0)
+    view_4 = torch.ops.aten.view.default(view_2, []);  view_2 = None
+    view_5 = torch.ops.aten.view.default(view_3, [4]);  view_3 = None
+    view_6 = torch.ops.aten.view.default(view_5, [-1])
+    add_1 = torch.ops.aten.add_.Tensor(view_5, view_6);  view_6 = None
+    return view_5
     """)
 
     def test_reinplace_scatter_twice(self):
@@ -132,14 +180,14 @@ def f(a_):
 
 
 def forward(self, a__1):
-    clone_default = torch.ops.aten.clone.default(a__1);  a__1 = None
-    slice_tensor = torch.ops.aten.slice.Tensor(clone_default, 0, 0, 9223372036854775807)
-    select_int = torch.ops.aten.select.int(slice_tensor, 1, 1);  slice_tensor = None
-    select_int_1 = torch.ops.aten.select.int(select_int, 0, 1);  select_int = None
-    add_tensor = torch.ops.aten.add_.Tensor(select_int_1, 1);  select_int_1 = None
-    slice_tensor_1 = torch.ops.aten.slice.Tensor(clone_default, 0, 0, 9223372036854775807)
-    select_int_2 = torch.ops.aten.select.int(slice_tensor_1, 1, 1);  slice_tensor_1 = None
-    return clone_default
+    clone = torch.ops.aten.clone.default(a__1);  a__1 = None
+    slice_1 = torch.ops.aten.slice.Tensor(clone, 0, 0, 9223372036854775807)
+    select = torch.ops.aten.select.int(slice_1, 1, 1);  slice_1 = None
+    select_1 = torch.ops.aten.select.int(select, 0, 1);  select = None
+    add = torch.ops.aten.add_.Tensor(select_1, 1);  select_1 = None
+    slice_2 = torch.ops.aten.slice.Tensor(clone, 0, 0, 9223372036854775807)
+    select_2 = torch.ops.aten.select.int(slice_2, 1, 1);  slice_2 = None
+    return clone
     """)
 
     def test_reinplace_scatter_twice_with_different_view_op_valid(self):
@@ -169,13 +217,13 @@ def f(a_):
 
 
 def forward(self, a__1):
-    clone_default = torch.ops.aten.clone.default(a__1);  a__1 = None
-    slice_tensor = torch.ops.aten.slice.Tensor(clone_default, 0, 0, 9223372036854775807)
-    select_int = torch.ops.aten.select.int(slice_tensor, 1, 1);  slice_tensor = None
-    select_int_1 = torch.ops.aten.select.int(select_int, 0, 1);  select_int = None
-    add_tensor = torch.ops.aten.add_.Tensor(select_int_1, 1);  select_int_1 = None
-    as_strided_default = torch.ops.aten.as_strided.default(clone_default, [4], [4], 1);  clone_default = None
-    return as_strided_default
+    clone = torch.ops.aten.clone.default(a__1);  a__1 = None
+    slice_1 = torch.ops.aten.slice.Tensor(clone, 0, 0, 9223372036854775807)
+    select = torch.ops.aten.select.int(slice_1, 1, 1);  slice_1 = None
+    select_1 = torch.ops.aten.select.int(select, 0, 1);  select = None
+    add = torch.ops.aten.add_.Tensor(select_1, 1);  select_1 = None
+    as_strided = torch.ops.aten.as_strided.default(clone, [4], [4], 1);  clone = None
+    return as_strided
     """)
 
     # Test example where we have a scatter op, where the base tensor
@@ -205,14 +253,15 @@ def f(a_):
 
 
 def forward(self, a__1):
-    clone_default = torch.ops.aten.clone.default(a__1);  a__1 = None
-    slice_tensor = torch.ops.aten.slice.Tensor(clone_default, 0, 0, 9223372036854775807)
-    select_int = torch.ops.aten.select.int(slice_tensor, 1, 1);  slice_tensor = None
-    select_int_1 = torch.ops.aten.select.int(select_int, 0, 1);  select_int = None
-    add_tensor = torch.ops.aten.add.Tensor(select_int_1, 1);  select_int_1 = None
-    as_strided_default = torch.ops.aten.as_strided.default(clone_default, [4], [4], 1);  clone_default = None
-    select_scatter_default = torch.ops.aten.select_scatter.default(as_strided_default, add_tensor, 0, 0);  as_strided_default = add_tensor = None
-    return select_scatter_default
+    clone = torch.ops.aten.clone.default(a__1);  a__1 = None
+    slice_1 = torch.ops.aten.slice.Tensor(clone, 0, 0, 9223372036854775807)
+    select = torch.ops.aten.select.int(slice_1, 1, 1);  slice_1 = None
+    select_1 = torch.ops.aten.select.int(select, 0, 1);  select = None
+    add = torch.ops.aten.add.Tensor(select_1, 1);  select_1 = None
+    as_strided = torch.ops.aten.as_strided.default(clone, [4], [4], 1);  clone = None
+    select_int = torch.ops.aten.select.int(as_strided, 0, 0)
+    copy__default = torch.ops.aten.copy_.default(select_int, add);  select_int = add = None
+    return as_strided
     """)  # noqa: B950
 
     def test_reinplace_scatter_twice_with_different_view_op_invalid2(self):
@@ -237,15 +286,69 @@ def f(a_):
 
 
 def forward(self, a__1):
-    clone_default = torch.ops.aten.clone.default(a__1);  a__1 = None
-    slice_tensor = torch.ops.aten.slice.Tensor(clone_default, 0, 0, 9223372036854775807)
-    select_int = torch.ops.aten.select.int(slice_tensor, 1, 1);  slice_tensor = None
-    select_int_1 = torch.ops.aten.select.int(select_int, 0, 1);  select_int = None
-    add_tensor = torch.ops.aten.add.Tensor(select_int_1, 1);  select_int_1 = None
-    as_strided_default = torch.ops.aten.as_strided.default(clone_default, [4], [4], 0);  clone_default = None
-    select_scatter_default = torch.ops.aten.select_scatter.default(as_strided_default, add_tensor, 0, 1);  as_strided_default = add_tensor = None
-    return select_scatter_default
+    clone = torch.ops.aten.clone.default(a__1);  a__1 = None
+    slice_1 = torch.ops.aten.slice.Tensor(clone, 0, 0, 9223372036854775807)
+    select = torch.ops.aten.select.int(slice_1, 1, 1);  slice_1 = None
+    select_1 = torch.ops.aten.select.int(select, 0, 1);  select = None
+    add = torch.ops.aten.add.Tensor(select_1, 1);  select_1 = None
+    as_strided = torch.ops.aten.as_strided.default(clone, [4], [4], 0);  clone = None
+    select_int = torch.ops.aten.select.int(as_strided, 0, 1)
+    copy__default = torch.ops.aten.copy_.default(select_int, add);  select_int = add = None
+    return as_strided
     """)  # noqa: B950
 
+
+    def test_out_node_updated(self):
+        def f():
+            x = torch.zeros(2, 2)
+            y = x.diagonal()
+            y_updated = y.add(1)
+            z = torch.diagonal_scatter(x, y_updated)
+            # reinplace needs to know to replace output [z] with [x]
+            return [z]
+
+        if not HAS_FUNCTIONALIZATION:
+            return
+        f2 = reinplace(make_fx(functionalize(f))())
+        expected_out = f()
+        actual_out = f2()
+        self.assertEqual(actual_out, expected_out)
+        self.assertExpectedInline(f2.code, """\
+
+
+
+def forward(self):
+    zeros = torch.ops.aten.zeros.default([2, 2], device = device(type='cpu'), pin_memory = False)
+    diagonal = torch.ops.aten.diagonal.default(zeros)
+    add = torch.ops.aten.add_.Tensor(diagonal, 1);  diagonal = None
+    return [zeros]
+    """)
+
+    def test_reinplace_index_mutation(self):
+        def f():
+            a = torch.zeros(4, 4, 4)
+            a[:, 2:] = torch.ones(4, 2, 4)
+            return a
+
+        if not HAS_FUNCTIONALIZATION:
+            return
+        f2 = reinplace(make_fx(functionalize(f))())
+        expected_out = f()
+        actual_out = f2()
+        self.assertEqual(actual_out, expected_out)
+        self.assertExpectedInline(f2.code, """\
+
+
+
+def forward(self):
+    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
+    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
+    slice_1 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
+    slice_2 = torch.ops.aten.slice.Tensor(slice_1, 1, 2, 9223372036854775807);  slice_1 = None
+    copy = torch.ops.aten.copy_.default(slice_2, ones);  slice_2 = ones = None
+    slice_3 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
+    return zeros
+    """)
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_indexing.py b/test/test_indexing.py
index 0d1022bc24eb..5dc23a3d5465 100644
--- a/test/test_indexing.py
+++ b/test/test_indexing.py
@@ -11,7 +11,8 @@
 import numpy as np
 
 from torch.testing import make_tensor
-from torch.testing._internal.common_utils import TestCase, run_tests
+from torch.testing._internal.common_utils import (
+    TestCase, run_tests, TEST_WITH_TORCHDYNAMO)
 from torch.testing._internal.common_device_type import (
     instantiate_device_type_tests, onlyCUDA, dtypes, dtypesIfCPU, dtypesIfCUDA,
     onlyNativeDeviceTypes)
@@ -737,6 +738,10 @@ def test_byte_mask_accumulate(self, device):
             self.assertEqual(y, torch.ones(size=(10, 10), device=device))
             self.assertEqual(len(w), 2)
 
+    @unittest.skipIf(
+        TEST_WITH_TORCHDYNAMO,
+        "This test causes SIGKILL when running with dynamo, https://github.com/pytorch/pytorch/issues/88472"
+    )
     def test_index_put_accumulate_large_tensor(self, device):
         # This test is for tensors with number of elements >= INT_MAX (2^31 - 1).
         N = (1 << 31) + 5
@@ -881,6 +886,31 @@ def test_index_put_accumulate_duplicate_indices(self, device):
 
             self.assertEqual(output, input_list)
 
+    @onlyNativeDeviceTypes
+    def test_index_ind_dtype(self, device):
+        x = torch.randn(4, 4, device=device)
+        ind_long = torch.randint(4, (4,), dtype=torch.long, device=device)
+        ind_int = ind_long.int()
+        src = torch.randn(4, device=device)
+        ref = x[ind_long, ind_long]
+        res = x[ind_int, ind_int]
+        self.assertEqual(ref, res)
+        ref = x[ind_long, :]
+        res = x[ind_int, :]
+        self.assertEqual(ref, res)
+        ref = x[:, ind_long]
+        res = x[:, ind_int]
+        self.assertEqual(ref, res)
+        # no repeating indices for index_put
+        ind_long = torch.arange(4, dtype=torch.long, device=device)
+        ind_int = ind_long.int()
+        for accum in (True, False):
+            inp_ref = x.clone()
+            inp_res = x.clone()
+            torch.index_put_(inp_ref, (ind_long, ind_long), src, accum)
+            torch.index_put_(inp_res, (ind_int, ind_int), src, accum)
+            self.assertEqual(inp_ref, inp_res)
+
     def test_multiple_byte_mask(self, device):
         v = torch.randn(5, 7, 3, device=device)
         # note: these broadcast together and are transposed to the first dim
diff --git a/test/test_itt.py b/test/test_itt.py
new file mode 100644
index 000000000000..b43df322a51a
--- /dev/null
+++ b/test/test_itt.py
@@ -0,0 +1,26 @@
+# Owner(s): ["module: intel"]
+
+import torch
+import unittest
+from torch.testing._internal.common_utils import TestCase, run_tests, load_tests
+
+# load_tests from common_utils is used to automatically filter tests for
+# sharding on sandcastle. This line silences flake warnings
+load_tests = load_tests
+
+@unittest.skipIf(not torch.profiler.itt.is_available(), "ITT is required")
+class TestItt(TestCase):
+    def setUp(self):
+        super(TestItt, self).setUp()
+
+    def tearDown(self):
+        super(TestItt, self).tearDown()
+
+    def test_itt(self):
+        # Just making sure we can see the symbols
+        torch.profiler.itt.range_push("foo")
+        torch.profiler.itt.mark("bar")
+        torch.profiler.itt.range_pop()
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_jit.py b/test/test_jit.py
index 5aeda8f30d5d..6cbc091d506b 100644
--- a/test/test_jit.py
+++ b/test/test_jit.py
@@ -20,7 +20,6 @@
 from jit.test_autodiff import TestAutodiffJit  # noqa: F401
 from jit.test_autodiff_subgraph_slicing import TestAutodiffSubgraphSlicing  # noqa: F401
 from jit.test_custom_operators import TestCustomOperators  # noqa: F401
-from jit.test_export_modes import TestExportModes  # noqa: F401
 from jit.test_graph_rewrite_passes import TestGraphRewritePasses  # noqa: F401
 from jit.test_class_type import TestClassType  # noqa: F401
 from jit.test_builtins import TestBuiltins, TestTensorBuiltins  # noqa: F401
@@ -97,7 +96,7 @@
 from torch.testing._internal.common_jit import check_against_reference
 from torch.testing._internal.common_utils import run_tests, IS_WINDOWS, TEST_WITH_UBSAN, \
     suppress_warnings, BUILD_WITH_CAFFE2, IS_SANDCASTLE, GRAPH_EXECUTOR, ProfilingMode, TestCase, \
-    freeze_rng_state, slowTest, TemporaryFileName, skipIfCompiledWithoutNumpy, \
+    freeze_rng_state, slowTest, TemporaryFileName, \
     enable_profiling_mode_for_profiling_tests, TEST_MKL, set_default_dtype, num_profiled_runs, \
     skipIfCrossRef, IS_MACOS, skipIfTorchDynamo
 from torch.testing._internal.jit_utils import JitTestCase, enable_cpu_fuser, disable_autodiff_subgraph_inlining, \
@@ -1858,6 +1857,7 @@ def profile(func, X):
                     self.assertEqual(training, 'aten::bernoulli_' in profile(scripted, X))
 
     @unittest.skipIf(GRAPH_EXECUTOR == ProfilingMode.SIMPLE, 'Testing differentiable graph')
+    @skipIfTorchDynamo("Torchdynamo cannot correctly handle profiler.profile calls")
     def test_dropout_func_requires_grad(self):
         def dropout_training(input):
             return F.dropout(input, 0.5, training=True)
@@ -3026,6 +3026,46 @@ def forward(self, x):
         checker.check("def forward")
         checker.run(str(cm.exception))
 
+    def test_dictionary_as_example_inputs_for_jit_trace(self):
+        class TestModule_v1(torch.nn.Module):
+            def __init__(self):
+                super(TestModule_v1, self).__init__()
+
+            def forward(self, key2=None, key3=None, key4=None, key5=None, key1=None, key6=None):
+                return key1 + key2 + key3
+
+        class TestModule_v2(torch.nn.Module):
+            def __init__(self):
+                super(TestModule_v2, self).__init__()
+
+            def forward(self, x, y):
+                return x + y
+
+        def test_func(x, y):
+            return x + y
+        model_1 = TestModule_v1()
+        model_2 = TestModule_v2()
+        value1 = torch.ones(1)
+        value2 = torch.ones(1)
+        value3 = torch.ones(1)
+        example_input_dict = {'key1': value1, 'key2': value2, 'key3': value3}
+        example_input_dict_func = {'x': value1, 'y': value2}
+        traced_model_1 = torch.jit.trace(model_1, example_kwarg_inputs=example_input_dict, strict=False)
+        traced_model_1_m = torch.jit.trace_module(
+            model_1, {'forward': example_input_dict}, example_inputs_is_kwarg=True, strict=False)
+        traced_model_2 = torch.jit.trace(model_2, example_kwarg_inputs={'x': torch.rand([2]), 'y': torch.rand([2])})
+        traced_func = torch.jit.trace(test_func, example_kwarg_inputs=example_input_dict_func, strict=False)
+        res_1 = traced_model_1(**example_input_dict)
+        res_1_m = traced_model_1_m(**example_input_dict)
+        self.assertEqual(res_1, 3 * torch.ones(1))
+        self.assertEqual(res_1_m, 3 * torch.ones(1))
+        res_func = traced_func(**example_input_dict_func)
+        self.assertEqual(res_func, 2 * torch.ones(1))
+        with self.assertRaisesRegex(RuntimeError, r"forward\(\) is missing value for argument 'x'."):
+            res_2 = traced_model_2(**{'z': torch.rand([2]), 'y': torch.rand([2])})
+        with self.assertRaisesRegex(RuntimeError, r"forward\(\) is missing value for argument 'y'."):
+            res_2 = traced_model_2(**{'x': torch.rand([2]), 'z': torch.rand([2])})
+
 
 class TestScript(JitTestCase):
 
@@ -3911,6 +3951,14 @@ def invalid4(a):
                 return a + 2
             torch.jit.script(invalid4)
 
+    def test_calls_in_type_annotations(self):
+        with self.assertRaisesRegex(RuntimeError, "Type annotation should not contain calls"):
+            def spooky(a):
+                # type: print("Hello") -> Tensor # noqa: F723
+                return a + 2
+            print(torch.__file__)
+            torch.jit.annotations.get_signature(spooky, None, 1, True)
+
     def test_is_optional(self):
         ann = Union[List[int], List[float]]
         torch._jit_internal.is_optional(ann)
@@ -5872,23 +5920,6 @@ def test_fuser_multiple_blocks(this, that, theother, meme):
 
         self.assertEqual(cu.test_fuser_multiple_blocks(*inputs), outputs)
 
-    def test_dropout_script(self):
-
-        eg = torch.zeros(1, 2, 3, requires_grad=True)
-
-        @_trace(eg)
-        def foo(x):
-            x = torch.neg(x)
-            return F.dropout(x)
-
-        class MyDrop(nn.Module):
-            def forward(self, x):
-                return foo(x)
-
-        f = io.BytesIO()
-        with warnings.catch_warnings(record=True):
-            torch.onnx.export(MyDrop(), (eg,), f, verbose=False)
-
     @unittest.skip("RuntimeError: VariableType::ID() not implemented")
     def test_cast(self):
         script = '''
@@ -9739,50 +9770,6 @@ def forward(self, rep):
             m = M2()
             m(torch.zeros(4, 3))
 
-    @skipIfCompiledWithoutNumpy
-    def test_pack_padded_pad_packed_trace(self):
-        from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
-        T, B, C = 3, 5, 7
-
-        class PadPackedWrapper(torch.nn.Module):
-            def __init__(self):
-                super(PadPackedWrapper, self).__init__()
-
-            def forward(self, x, seq_lens):
-                x = pack_padded_sequence(x, seq_lens)
-                x, _ = pad_packed_sequence(x)
-                return x
-
-        x = np.ones((T, B, C))
-        seq_lens = np.array([3, 3, 2, 2, 1], dtype=np.int32)
-        # set padding value so we can test equivalence
-        for b in range(B):
-            if seq_lens[b] < T:
-                x[seq_lens[b]:, b, :] = 0
-        seq_lens = torch.from_numpy(seq_lens)
-        x = torch.autograd.Variable(torch.from_numpy(x), requires_grad=True)
-
-        m = PadPackedWrapper()
-        m_traced = torch.jit.trace(m, (x, seq_lens,))
-
-        y = m(x, seq_lens)
-        loss = torch.sum(y)
-        loss.backward()
-        grad = x.grad.clone()
-        x.grad.zero_()
-
-        y_traced = m_traced(x, seq_lens)
-        loss_traced = torch.sum(y_traced)
-        loss_traced.backward()
-        grad_traced = x.grad.clone()
-
-        self.assertEqual(y_traced, x)
-        self.assertEqual(y_traced, y)
-        self.assertEqual(grad, grad_traced)
-
-        f = io.BytesIO()
-        torch.onnx._export(m, (x, seq_lens), f, verbose=False)
-
     def test_script_pack_padded_sequence(self):
         from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
 
@@ -9983,54 +9970,6 @@ def forward(self, input: torch.Tensor):
         m_scripted = torch.jit.script(m)
         self.assertEqual(m_scripted(torch.tensor(1)), torch.tensor(246))
 
-    # Suppression: ONNX warns when exporting RNNs because of potential batch size mismatch.
-    @suppress_warnings
-    @skipIfCompiledWithoutNumpy
-    def test_rnn_trace_override(self):
-        from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
-        num_layers = 3
-        T, B, C = 11, 5, 7
-
-        class RNNTraceWrapper(torch.nn.Module):
-            def __init__(self, cell_type):
-                super(RNNTraceWrapper, self).__init__()
-                if cell_type == 'RNN':
-                    self.rnn = torch.nn.RNN(input_size=C, hidden_size=C, num_layers=num_layers)
-                elif cell_type == 'LSTM':
-                    self.rnn = torch.nn.LSTM(input_size=C, hidden_size=C, num_layers=num_layers)
-                elif cell_type == 'GRU':
-                    self.rnn = torch.nn.GRU(input_size=C, hidden_size=C, num_layers=num_layers)
-
-            def forward(self, x, seq_lens):
-                x = pack_padded_sequence(x, seq_lens)
-                x, _ = self.rnn(x)
-                x, _ = pad_packed_sequence(x)
-                return x
-
-        for cell_type in ['RNN', 'LSTM', 'GRU']:
-            x = torch.ones(T, B, C, requires_grad=True)
-            seq_lens = torch.from_numpy(np.array([11, 3, 2, 2, 1], dtype=np.int32))
-
-            m = RNNTraceWrapper(cell_type)
-            m_traced = torch.jit.trace(m, (x, seq_lens,))
-
-            y = m(x, seq_lens)
-            loss = torch.sum(y)
-            loss.backward()
-            grad = x.grad.clone()
-            x.grad.zero_()
-
-            y_traced = m_traced(x, seq_lens)
-            loss_traced = torch.sum(y_traced)
-            loss_traced.backward()
-            grad_traced = x.grad.clone()
-
-            self.assertEqual(y_traced, y)
-            self.assertEqual(grad, grad_traced)
-
-            f = io.BytesIO()
-            torch.onnx._export(m, (x, seq_lens), f, verbose=False)
-
     def test_python_call_non_tensor(self):
         def foo(a, b, c):
             # type: (Tensor, int, Tuple[Tensor, int]) -> Tuple[int, Tensor]
diff --git a/test/test_jit_autocast.py b/test/test_jit_autocast.py
index 5d555e7cd9d8..d311eb687a76 100644
--- a/test/test_jit_autocast.py
+++ b/test/test_jit_autocast.py
@@ -797,7 +797,7 @@ def test_nchw_autocast_jit_trace_model(model, x):
                 y = traced_model(x.clone())
             with torch.cpu.amp.autocast(), torch.no_grad():
                 y2 = model(x.clone())
-            torch.testing.assert_allclose(y.double(), y2.double(), rtol=1e-03, atol=1e-03)
+            torch.testing.assert_close(y.double(), y2.double(), rtol=1e-03, atol=1e-03)
         for i in range(self.models.__len__()):
             test_nchw_autocast_jit_trace_model(self.models[i], self.inputs[i])
 
@@ -812,13 +812,39 @@ def test_nhwc_autocast_jit_trace_model(model, x):
                 y = traced_model(x.clone().to(memory_format=torch.channels_last))
             with torch.cpu.amp.autocast(), torch.no_grad():
                 y2 = model(x.clone().to(memory_format=torch.channels_last))
-            torch.testing.assert_allclose(y.double(), y2.double(), rtol=1e-03, atol=1e-03)
+            torch.testing.assert_close(y.double(), y2.double(), rtol=1e-03, atol=1e-03)
         for i in range(self.models.__len__()):
             if self.inputs[i].size().__len__() == 5:
                 # NHWC 3D case not support yet
                 continue
             test_nhwc_autocast_jit_trace_model(self.models[i], self.inputs[i])
 
+    def test_cat_promote(self):
+        class TestModel(torch.nn.Module):
+            def __init__(self):
+                super(TestModel, self).__init__()
+
+            def forward(self, a, b):
+                return torch.cat([a, b], 0)
+        with torch.jit.fuser("none"):
+            # In this testcase, we will check whether cat has done the promotion in AMP with mixed dtype inputs.
+            # To avoid the fusion group from TE, we will disable the fuser here.
+            for jit_freeze_or_not in [False, True]:
+                test_model = TestModel().eval()
+                with torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16), torch.no_grad():
+                    a = torch.rand(24, 128, 128)
+                    b = torch.rand(24, 128, 128, dtype=torch.bfloat16)
+                    c = test_model(a, b)
+                    traced = torch.jit.trace(test_model, (a, b))
+                if jit_freeze_or_not:
+                    traced = torch.jit.freeze(traced)
+                for _ in range(3):
+                    c2 = traced(a, b)
+                self.assertTrue(c.dtype, torch.float32)
+                self.assertTrue(c2.dtype, torch.float32)
+                traced_graph = traced.graph_for(a, b)
+                self.assertTrue(any(n.kind() == "aten::to" for n in traced_graph.nodes()))
+
     def test_script_autocast_cpu(self):
         def fn(x):
             if torch.is_autocast_cpu_enabled():
diff --git a/test/test_jit_cuda_fuser.py b/test/test_jit_cuda_fuser.py
index a4a2101de38c..0a13fdb20a82 100644
--- a/test/test_jit_cuda_fuser.py
+++ b/test/test_jit_cuda_fuser.py
@@ -20,7 +20,7 @@
 from torch.testing._internal.common_jit import JitCommonTestCase
 from torch.testing._internal.common_methods_invocations import op_db, SampleInput
 from torch.testing._internal.common_utils import run_tests, ProfilingMode, GRAPH_EXECUTOR, TEST_WITH_ROCM, slowTest, \
-    is_iterable_of_tensors, freeze_rng_state
+    is_iterable_of_tensors, freeze_rng_state, skipIfRocm
 from torch.testing._internal.jit_utils import clone_inputs, get_traced_sample_variant_pairs, JitTestCase, RUN_CUDA
 from torch.testing._internal.jit_metaprogramming_utils import create_traced_fn
 from torch.testing import FileCheck
@@ -35,14 +35,18 @@
 
 from typing import List
 
-RUN_NVFUSER = RUN_CUDA and not TEST_WITH_ROCM
+RUN_NVFUSER = RUN_CUDA
 CUDA_MAJOR, CUDA_MINOR = 0, 0
 
 if RUN_NVFUSER and torch.version.cuda is not None:
     CUDA_MAJOR, CUDA_MINOR = (int(x) for x in torch.version.cuda.split('.')[:2])
 
-os.environ['PYTORCH_NVFUSER_ENABLE'] = 'linear_decomposition,conv_decomposition'
-os.environ['PYTORCH_NVFUSER_DISABLE'] = 'fallback,fma,unroll_with_rng'
+if 'PYTORCH_NVFUSER_ENABLE' not in os.environ:
+    os.environ['PYTORCH_NVFUSER_ENABLE'] = ""
+os.environ['PYTORCH_NVFUSER_ENABLE'] = 'linear_decomposition,conv_decomposition,' + os.environ['PYTORCH_NVFUSER_ENABLE']
+if 'PYTORCH_NVFUSER_DISABLE' not in os.environ:
+    os.environ['PYTORCH_NVFUSER_DISABLE'] = ""
+os.environ['PYTORCH_NVFUSER_DISABLE'] = 'fallback,fma,' + os.environ['PYTORCH_NVFUSER_DISABLE']
 os.environ['PYTORCH_NVFUSER_JIT_OPT_LEVEL'] = '0'
 # TODO: enable complex when we fixes the extremal cases in OpInfo
 # see issue https://github.com/csarofeen/pytorch/issues/1730"
@@ -139,7 +143,7 @@ def setUp(self):
         disabled_ops = ("aten::batch_norm",
                         "aten::_batch_norm_impl_index",
                         "aten::_batch_norm_impl_index_backward",
-                        "aten::native_batch_norm_backward")
+                        "aten::native_batch_norm_backward",)
         for op in disabled_ops:
             disabled_flag = torch._C._jit_set_nvfuser_skip_node_kind(op, False)
             if disabled_flag:
@@ -379,6 +383,27 @@ def func(x: torch.Tensor):
                         self.assertTrue(self._compare("comparing output failed", o, jit_o, 1e-4))
                         self.assertGraphContains(t_jit.graph_for(x), FUSION_GUARD)
 
+    @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_variance_profiling(self):
+        with nvfuser_singleton_fusion(True):
+            for op in [torch.var, torch.std]:
+                for dtype in [torch.float16, torch.float32, torch.double]:
+                    for axis in [-2, -1, 2, 1]:
+                        for unbiased in [False, True]:
+                            for keepdim in [False, True]:
+                                def t(x: torch.Tensor, dim: List[int], unbiased: bool, keepdim: bool):
+                                    o = torch.mul(x, 2.0)
+                                    o = op(o, dim=dim, unbiased=unbiased, keepdim=keepdim)
+                                    return o
+
+                                x = torch.randn(8, 4, 16, dtype=dtype, device="cuda")
+                                t_jit = torch.jit.script(t)
+                                self._run_helper(t_jit, t, x, [axis], unbiased, keepdim, check_stride=False, check_runs=5)
+
+
     @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
@@ -1740,6 +1765,7 @@ def test_norm(self):
                         x[1] = C
                         self._norm_helper(x, torch.float32, "cuda", 1e-4, is_batch_norm_else_instance_norm)
 
+    @skipIfRocm
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
     @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
@@ -1765,7 +1791,8 @@ def test_norm_half(self):
         channel_sizes = [67, 457, 1024, 4096]
 
         with torch.backends.cudnn.flags(enabled=False):
-            for is_batch_norm_else_instance_norm in [False, True]:
+            # TODO instance norm on ROCm was giving ~50% incorrect results
+            for is_batch_norm_else_instance_norm in [True] if TEST_WITH_ROCM else [False, True]:
                 for dims in range(3, 6):
                     output_size = int(pow(output_elements, 1. / (dims - 1)))
                     for C in channel_sizes:
@@ -1783,7 +1810,8 @@ def test_norm_bfloat(self):
         channel_sizes = [67, 457, 1024, 4096]
 
         with torch.backends.cudnn.flags(enabled=False):
-            for is_batch_norm_else_instance_norm in [False, True]:
+            # TODO instance norm on ROCm was giving ~50% incorrect results
+            for is_batch_norm_else_instance_norm in [True] if TEST_WITH_ROCM else [False, True]:
                 for dims in range(3, 6):
                     output_size = int(pow(output_elements, 1. / (dims - 1)))
                     for C in channel_sizes:
@@ -2357,10 +2385,13 @@ def t(x: torch.Tensor):
         self.assertEqual(o, jit_o)
         self.assertGraphContains(t_jit.graph_for(x), FUSION_GUARD)
 
+    @unittest.skip("Skipped due to rand_like behavior change")
     @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
     def test_profiling_node(self):
+        # TODO: should we change this test to not use rand_like, or just
+        # remove this test?
         dtype = torch.float
         device = "cuda"
         x = torch.randn(4, 8, 8, 8, dtype=dtype, device=device)
@@ -2372,26 +2403,6 @@ def repro(x: torch.Tensor, alpha: float):
         repro_jit = torch.jit.script(repro)
         self._run_helper(repro_jit, repro, x, 0.6)
 
-    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
-    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
-                     "Requires fusion optimization pass to be effective")
-    def test_rand_like(self):
-        dtype = torch.float
-        device = "cuda"
-
-        def t(x: torch.Tensor, alpha: float):
-            o = torch.rand_like(x)
-            o = torch.add(o, alpha)
-            return o
-
-        # disabling cache so new inputs would generate new graph
-        t.__disable_jit_function_caching__ = True
-
-        for m_format in [torch.contiguous_format, torch.channels_last]:
-            x = torch.randn(4, 5, 6, 7, dtype=dtype, device=device).to(memory_format=m_format)
-            t_jit = torch.jit.script(t)
-            self._run_helper(t_jit, t, x, 0.6, check_stride=True)
-
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
     @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
@@ -3310,9 +3321,14 @@ def forward(self, x):
             .execution_plans.values())[0].graph
         self.assertGraphContainsExactly(bwd_graph, FUSION_GUARD, 1, consider_subgraphs=True)
 
-        e0 = 1e-5 if dtype is not torch.half else 1e-3
-        e1 = 1e-4 if dtype is not torch.half else 1e-3
-        e2 = 1e-3 if dtype is not torch.half else 1e-2
+        if TEST_WITH_ROCM:
+            e0 = 1e-3
+            e1 = 1e-2
+            e2 = 1e-2
+        else:
+            e0 = 1e-5 if dtype is not torch.half else 1e-3
+            e1 = 1e-4 if dtype is not torch.half else 1e-3
+            e2 = 1e-3 if dtype is not torch.half else 1e-2
 
         self.assertTrue(self._compare("comparing output failed", jit_o, o, e0))
         self.assertTrue(self._compare("comparing input grad failed", x.grad, ref_x.grad, e1))
@@ -3371,6 +3387,7 @@ def test_batch_norm_impl_index_inner_bcast(self):
             training, track_running_stats = training_and_track
             self._test_batch_norm_impl_index_helper(2, 1, 1, affine, track_running_stats, training)
 
+    @skipIfRocm
     @unittest.skipIf(is_pre_volta(), "reduction not supported in pre volta device")
     @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
@@ -4470,6 +4487,125 @@ def t(x, w):
             self.assertEqual(jit_o, o)
             self.assertGraphContainsExactly(t_jit.graph_for(x, w), FUSION_GUARD, 2, consider_subgraphs=True)
 
+    @skipIfRocm
+    # see issue here on why we disabled this test https://github.com/csarofeen/pytorch/issues/2127
+    @unittest.skipIf(is_pre_volta(), "permutation scheduling can be dangerous on pre-volta device")
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_view_before_permute(self):
+        view_examples = [[[1, 19, 1, 12, 7, 1, 99], [1, 19, 1, 3, 2772]],
+                         [[3, 17, 80, 1], [51, 1, 2, 4, 10]],
+                         [[3, 17, 80, 1, 9], [51, 1, 2, 4, 10, 9]],
+                         [[2, 3, 4, 5], [1, 6, 1, 2, 2, 5]],
+                         [[22, 22, 2], [22, 11, 1, 1, 4]],
+                         [[37, 9, 7, 6, 10], [333, 2, 2, 3, 35]],
+                         [[8, 1, 1, 8, 1, 8], [8, 2, 4, 1, 8]],
+                         [[1, 333, 1], [1, 37, 9]],
+                         [[1, 333], [1, 1, 1, 111, 1, 3]],
+                         [[1, 27454, 1, 2], [1, 7844, 1, 7]],
+                         [[1, 7844, 1, 7], [1, 27454, 2]]]
+
+        def _getTransposeAxes(sizes):
+            # broadcast do not change
+            # always move inner-most dim
+            # random permutation of other dims
+            result = []
+            valid_sizes = []
+            for idx, val in enumerate(sizes):
+                if val > 1 and idx < len(sizes) - 1:
+                    valid_sizes.append((idx, val))
+                result.append(idx)
+            idx, new_size = valid_sizes[random.randint(0, len(valid_sizes) - 1)]
+            result[idx] = len(sizes) - 1
+            result[len(sizes) - 1] = idx
+            return result
+
+        def _transposeSize(sizes, dims):
+            return [sizes[old_pos] for old_pos in dims]
+
+        for example in view_examples:
+            before_view_size, after_view_size = example
+            axes = _getTransposeAxes(after_view_size)
+            output_size = _transposeSize(after_view_size, axes)
+            self._view_before_permute_helper(before_view_size, after_view_size, output_size, axes)
+
+    def _view_before_permute_helper(self, input_shape, view_shape, output_shape, dims):
+        def t(x, y, view_shape : List[int], dims : List[int]):
+            x_v = x.view(view_shape)
+            x_t = torch.permute(x_v, dims)
+            o = torch.add(x_t, y)
+            o = torch.relu(o)
+            return o
+
+        x = torch.randn(*input_shape, device="cuda")
+        y = torch.randn(*output_shape, device="cuda")
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, y, view_shape, dims)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_permute(self):
+        max_dims = 4
+        for ndims in range(2, max_dims + 1):
+            shape = [idx + 2 for idx in range(ndims)]
+            for dims in itertools.permutations(range(ndims)):
+                self._permute_helper(shape, dims)
+
+    def _permute_helper(self, shape, dims):
+        def t(x, y, dims : List[int]):
+            x_t = torch.permute(x, dims)
+            y_t = torch.permute(y, dims)
+            o = torch.add(x_t, y_t)
+            o = torch.relu(o)
+            return o
+
+        x = torch.randn(*shape, device="cuda")
+        y = torch.randn(*shape, device="cuda")
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, y, dims)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_transpose(self):
+        max_dims = 4
+        for ndims in range(2, max_dims + 1):
+            shape = [idx + 2 for idx in range(ndims)]
+            for idx in range(1, ndims):
+                for jdx in range(idx):
+                    self._transpose_helper(shape, idx, jdx)
+
+    def _transpose_helper(self, shape, dim0, dim1):
+        def t(x, y, dim0 : int, dim1 : int):
+            x_t = torch.transpose(x, dim0, dim1)
+            y_t = torch.transpose(y, dim0, dim1)
+            o = torch.add(x_t, y_t)
+            o = torch.nn.functional.gelu(o)
+            return o
+
+        x = torch.randn(*shape, device="cuda")
+        y = torch.randn(*shape, device="cuda")
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, y, dim0, dim1)
+
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_transpose_default(self):
+        def t(x, y):
+            x_t = torch.t(x)
+            y_t = torch.t(y)
+            o = torch.add(x_t, y_t)
+            o = torch.nn.functional.gelu(o)
+            return o
+
+        x = torch.randn(3, 5, device="cuda")
+        y = torch.randn(3, 5, device="cuda")
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, y)
+
     @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
                      "Requires fusion optimization pass to be effective")
@@ -4864,19 +5000,13 @@ def clamp_min(x):
     def test_device_constant(self):
         x = torch.randn(4, 2, device="cuda")
 
-        def t(x):
-            return torch.rand_like(x, device=torch.device(type='cuda'))
-
         # cpu tensor shouldn't be fused
         def t_cpu(x):
             return torch.rand_like(x, device=torch.device(type='cpu'))
 
         with nvfuser_singleton_fusion(True):
-            t_jit = torch.jit.script(t)
-            self._run_helper(t_jit, t, x)
-
             t_cpu_jit = torch.jit.script(t_cpu)
-            for i in range(5):
+            for _ in range(5):
                 t_cpu_jit(x)
 
             self.assertGraphContainsExactly(t_cpu_jit.graph_for(x), FUSION_GUARD, 0)
@@ -4983,6 +5113,27 @@ def t(x, y, w0, w1):
         t_jit = torch.jit.script(t)
         self._run_helper(t_jit, t, x0, x1, w0, w1, check_stride=True)
 
+    @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+    @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
+                     "Requires fusion optimization pass to be effective")
+    def test_no_tensor_input(self):
+        device = "cuda"
+        x = torch.randn(512, device=device)
+
+        def t(x):
+            tensor0 = torch.tensor(3, dtype=torch.float32, device='cuda')
+            tensor1 = torch.tensor(3, dtype=torch.float32, device='cuda')
+            o = torch.div(x.numel(), tensor0)
+            o = torch.mul(o, tensor1)
+            return o
+
+        t_jit = torch.jit.script(t)
+        self._run_helper(t_jit, t, x, check_stride=True)
+
+        # Note that curently TS embeds constant tensor in the graph
+        # this triggers memory leak check in CI
+        torch.jit._state._python_cu.drop_all_functions()
+
 
 class TestEnableDisableCudaFuser(JitTestCase):
     def setUp(self):
@@ -5046,18 +5197,8 @@ def test_register_fuser_cpu(self):
             torch._C._jit_set_nvfuser_enabled(True)
             torch._C._jit_set_nvfuser_enabled(False)
 
-    @unittest.skipIf(not RUN_CUDA, "requires CUDA")
-    @unittest.skipIf(not TEST_WITH_ROCM, "ROCM test only")
-    def test_register_fuser_rocm(self):
-        with self.assertRaises(RuntimeError):
-            torch._C._jit_set_nvfuser_enabled(True)
-            torch._C._jit_set_nvfuser_enabled(False)
-
     def test_can_be_enabled_nvfuser(self):
-        if TEST_WITH_ROCM:
-            expected = False
-        else:
-            expected = RUN_CUDA
+        expected = RUN_CUDA
 
         self.assertEqual(expected, torch._C._jit_nvfuser_can_be_enabled())
 
@@ -5108,6 +5249,7 @@ def test_nvfuser_correctness(self, device, dtype, op):
         # if the CU is not cleared.
         torch.jit._state._python_cu.drop_all_functions()
 
+    @skipIfRocm
     @slowTest
     @unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
     @unittest.skipIf(GRAPH_EXECUTOR != ProfilingMode.PROFILING,
diff --git a/test/test_jit_fuser_te.py b/test/test_jit_fuser_te.py
index 09cc3b7e30f5..19585296870b 100644
--- a/test/test_jit_fuser_te.py
+++ b/test/test_jit_fuser_te.py
@@ -21,7 +21,7 @@
 torch._C._get_graph_executor_optimize(True)
 
 from torch.testing._internal.common_utils import run_tests, ProfilingMode, GRAPH_EXECUTOR, \
-    enable_profiling_mode_for_profiling_tests, slowTest
+    enable_profiling_mode_for_profiling_tests, slowTest, skipIfTorchDynamo
 from torch.testing._internal.jit_utils import JitTestCase, \
     RUN_CUDA, RUN_CUDA_HALF, RUN_CUDA_MULTI_GPU, warmup_backward, set_fusion_group_inlining, \
     clone_inputs, get_traced_sample_variant_pairs, TensorExprTestOptions, NoTracerWarnContextManager
@@ -1388,13 +1388,18 @@ def apply(fn):
                 F.hardswish,
                 F.softplus,
                 F.silu,
+                F.mish,
+                F.elu,
                 torch.sqrt,
                 torch.rsqrt,
                 torch.abs,
-                torch.ceil,
-                torch.floor,
-                torch.round,
-                torch.trunc,
+                # TODO broken on int8 since
+                # https://github.com/pytorch/pytorch/pull/85144
+                # RuntimeError: Invalid integral op_type: 23
+                # torch.ceil,
+                # torch.floor,
+                # torch.round,
+                # torch.trunc,
                 torch.frac,
                 # TODO: broken on ROCm?
                 # F.hardshrink,
@@ -2197,7 +2202,9 @@ def test_batch_norm(self):
         def test(fn, args):
             trace = torch.jit.trace(fn, args)
             self.assertAllFused(trace.graph_for(*args))
-            torch.testing.assert_allclose(fn(*args), trace(*args))
+            # TODO: Are `NaN`'s actually ok here or did this pass silently before, because `equal_nan=True` was the
+            #  default?
+            torch.testing.assert_close(fn(*args), trace(*args), equal_nan=True)
 
         def bn(i, x):
             return torch.batch_norm(i, x, x, x, x, False, 0.1, 1e-4, False).relu()
@@ -2343,6 +2350,32 @@ def f(x):
         scr(x)
         self.assertLastGraphAllFused()
 
+    @unittest.skipIf(not LLVM_ENABLED, "Compiles with TensorExprKernel")
+    def test_to_dtype(self):
+        def f(x):
+            y = torch.sigmoid(x)
+            z = y._autocast_to_reduced_precision(True, True, torch.half, torch.bfloat16)
+            h = z._autocast_to_full_precision(True, True)
+            i = h.to(dtype=torch.bfloat16)
+            j = i.to(dtype=torch.float32)
+            return j
+
+        x = torch.rand((2, 2), dtype=torch.float32)
+        scr = torch.jit.trace(f, x)
+        scr(x)
+        scr(x)
+        self.assertLastGraphAllFused()
+        self.assertEqual(f(x), scr(x), atol=4e-3, rtol=4e-3)
+
+        bf_x = torch.rand((2, 2), dtype=torch.bfloat16)
+        bf_scr = torch.jit.trace(f, bf_x)
+        bf_scr(bf_x)
+        bf_scr(bf_x)
+        graph = bf_scr.graph_for(bf_x)
+        fusion_groups = self.findFusionGroups(graph)
+        self.assertEqual(len(fusion_groups), 2)
+        self.assertEqual(f(bf_x), bf_scr(bf_x), atol=4e-3, rtol=4e-3)
+
     def test_with_strict_fusion(self):
 
         def success(x):
@@ -2363,7 +2396,7 @@ def foo(x):
             foo_s(torch.rand([4]))
             print(torch.jit.last_executed_optimized_graph())
         fc = FileCheck().check("Found unfused operators")
-        fc.check("aten::rand(int[] size")
+        fc.check("aten::rand(SymInt[] size")
         fc.check("torch.rand([4]").run(str(error_out.exception))
 
         with warnings.catch_warnings(record=True) as warns:
@@ -2626,6 +2659,7 @@ def f({', '.join(param_names)}):
             self.assertEqual(kernel.fallback(tuple(param_values)), correct_val)
 
     @onlyCPU
+    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
     @unittest.skipIf(not LLVM_ENABLED, "Compiles with TensorExprKernel")
     @ops([op for op in op_db if get_name(op) in works_list], allowed_dtypes=(torch.float,))
     def test_working(self, device, dtype, op):
@@ -2674,7 +2708,9 @@ def test_nnc_correctness(self, device, dtype, op):
                 trace(*clone_inputs((sample.input, *sample.args)), **sample.kwargs)
                 val = trace(*clone_inputs((sample.input, *sample.args)), **sample.kwargs)
 
-                self.assertEqual(ref, val)
+                atol = 2e-1 if dtype == torch.bfloat16 else 1e-5
+                rtol = 2e-1 if dtype == torch.bfloat16 else 1e-5
+                self.assertEqual(ref, val, atol=atol, rtol=rtol)
 
             # https://github.com/pytorch/pytorch/issues/35600
             # each torch.jit.trace adds state to the _python_cu compilation unit
diff --git a/test/test_jit_llga_fuser.py b/test/test_jit_llga_fuser.py
index 1e79b745d2c1..12bd955043b9 100644
--- a/test/test_jit_llga_fuser.py
+++ b/test/test_jit_llga_fuser.py
@@ -1,51 +1,114 @@
 # Owner(s): ["module: mkldnn"]
+import sys
 import torch
 import unittest
 import itertools
-
 import torch.nn as nn
+from functools import wraps
+from concurrent import futures
 import torch.nn.functional as F
+import torch.fx.experimental.optimization as optimization
 from torch.testing._internal.jit_utils import JitTestCase
 from torch.testing._internal.common_utils import run_tests, TEST_SCIPY, IS_WINDOWS, IS_MACOS
+from torch.testing._internal.common_device_type import (
+    instantiate_device_type_tests,
+    onlyCPU,
+    dtypes
+)
+
+# We use this wrapper to run UTs of TorchVision models because of a memory-leak
+# issue with JIT tracing that causes traced model objects to persist in the
+# memory. Ref: https://github.com/pytorch/pytorch/issues/35600
+# Memory requirement for running these UTs was thus increasing cumulatively, and
+# invoked the Linux kernel OOM killer on linux.2xlarge PyTorch CI runners, which
+# only have 16 GB RAM. Cumulatively, these UTs had been using more than 14 GB
+# memory (as per psutils). So now we run each TorchVision model UTs in separate processes.
+def separate_process(func):
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        with futures.ProcessPoolExecutor() as executor:
+            future = executor.submit(func, *args, **kwargs)
+            futures.wait([future])
+    return wrapper
+
+def is_avx512_supported():
+    if sys.platform != 'linux':
+        return False
+    with open("/proc/cpuinfo", encoding="ascii") as f:
+        lines = f.read()
+    return "avx512" in lines
+
+IS_AVX512_UNSUPPORTED = not is_avx512_supported()
 
 LLGA_FUSION_GROUP = 'prim::oneDNNFusionGroup'
 LLGA_NOT_ENABLED = not torch._C.has_mkldnn or IS_WINDOWS or IS_MACOS
 
-
-def warmup_forward(f, *args, profiling_count=2):
+def warmup_forward(f, *args, profiling_count=3):
     for i in range(profiling_count):
         results = f(*args)
 
     return results
 
-
 class JitLlgaTestCase(JitTestCase):
+
     def setUp(self):
+        # PyTorch has divergent op support for AMP in JIT & eager modes
+        # so we disable AMP for JIT & leverage eager-mode AMP.
+        # Ref: https://github.com/pytorch/pytorch/issues/75956
+        self.original_autocast_mode = torch._C._jit_set_autocast_mode(False)
         torch.jit.enable_onednn_fusion(True)
 
     def tearDown(self):
         torch.jit.enable_onednn_fusion(False)
+        torch._C._jit_set_autocast_mode(self.original_autocast_mode)
 
-    def checkTrace(self, m, x, *args, **kwargs):
+    def checkTrace(self, m, x, dtype=torch.float32, *args, **kwargs):
         if isinstance(m, torch.nn.Module):
             m.eval()
-        with torch.no_grad(), \
-                torch._jit_internal._disable_emit_hooks():
-            traced = torch.jit.trace(m, x)
-            if isinstance(m, torch.nn.Module):
-                traced = torch.jit.freeze(traced)
-            warmup_forward(traced, *x)
-            fwd_graph = traced.graph_for(*x)
-
-            ref_o = m(*x)
+        with torch.no_grad(), torch._jit_internal._disable_emit_hooks():
+            if dtype == torch.bfloat16:
+                # We rely upon eager-mode AMP support for BF16
+                with torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):
+                    traced = torch.jit.trace(m, x)
+                    if isinstance(m, torch.nn.Module):
+                        traced = torch.jit.freeze(traced)
+                    warmup_forward(traced, *x)
+                    ref_o = m(*x)
+                    fwd_graph = traced.graph_for(*x)
+            else:
+                traced = torch.jit.trace(m, x)
+                if isinstance(m, torch.nn.Module):
+                    traced = torch.jit.freeze(traced)
+                warmup_forward(traced, *x)
+                ref_o = m(*x)
+                fwd_graph = traced.graph_for(*x)
+
             jit_o = traced(*x)
             self.assertEqual(jit_o, ref_o)
-        return traced, fwd_graph
+            return traced, fwd_graph
+
 
     def assertFused(self, graph, fused_patterns):
         for pat in fused_patterns:
             self.assertGraphContainsExactly(graph, pat, 0)
 
+    def findFusionGroups(self, graph):
+        result = []
+        for n in graph.nodes():
+            if n.kind() == LLGA_FUSION_GROUP:
+                result.append(n.g('Subgraph'))
+                continue
+            for block in n.blocks():
+                result += self.findFusionGroups(block)
+        return result
+
+    def checkPatterns(self, graph, patterns):
+        fusion_groups = self.findFusionGroups(graph)
+        assert len(fusion_groups) == len(patterns), "length of subgraphs not equal to length of given patterns"
+
+        for i in range(len(fusion_groups)):
+            for pattern in patterns[i]:
+                self.assertGraphContains(fusion_groups[i], pattern)
 
 try:
     import torchvision
@@ -61,13 +124,18 @@ def get_eltwise_fn(name):
         return getattr(torch, name)
     elif hasattr(F, name):
         return getattr(F, name)
+    elif name == 'hardswish_':
+        return torch.nn.Hardswish(inplace=True)
     else:
         raise NameError('Eltwise function %s not found' % name)
 
 
+@unittest.skipIf(IS_AVX512_UNSUPPORTED, "This test fails for BF16 on machines without AVX512.")
 @unittest.skipIf(LLGA_NOT_ENABLED, "MKL-DNN build is disabled")
 class TestOp(JitLlgaTestCase):
-    def test_conv2d(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_conv2d(self, dtype):
         for [spatial, in_channels, out_channels, kernel, padding, stride, dilation, g, bias] in itertools.product(
                 [7, 8],
                 [8, 15],
@@ -89,17 +157,21 @@ def test_conv2d(self):
                           bias=bias)
 
             x = torch.rand(1, in_channels * g, spatial, spatial)
-            _, graph = self.checkTrace(m, [x])
+            _, graph = self.checkTrace(m, [x], dtype)
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_bn2d(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_bn2d(self, dtype):
         m = nn.BatchNorm2d(32).eval()
         x = torch.rand(1, 32, 28, 28)
-        _, graph = self.checkTrace(m, [x])
+        _, graph = self.checkTrace(m, [x], dtype)
         # single-op partition shouldn't be created for softmax
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 0)
 
-    def test_eltwise(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_eltwise(self, dtype):
         class M(nn.Module):
             def __init__(self, eltwise_fn):
                 super(M, self).__init__()
@@ -112,11 +184,13 @@ def forward(self, x):
             eltwise_fn = get_eltwise_fn(eltwise)
             m = M(eltwise_fn)
             x = torch.rand(1, 32, 28, 28)
-            _, graph = self.checkTrace(m, [x])
+            _, graph = self.checkTrace(m, [x], dtype)
             # single-op partition shouldn't be created.
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 0)
 
-    def test_max_pool2d(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_max_pool2d(self, dtype):
         for [spatial, kernel, padding, stride, dilation, ceil_mode] in itertools.product(
                 [15, 16, 17, 18, 19],
                 [4, 5],
@@ -132,10 +206,12 @@ def test_max_pool2d(self):
                              ceil_mode=ceil_mode)
 
             x = torch.rand(1, 4, spatial, spatial)
-            _, graph = self.checkTrace(m, [x])
+            _, graph = self.checkTrace(m, [x], dtype)
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_avg_pool2d(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_avg_pool2d(self, dtype):
         for [spatial, kernel, padding, stride, ceil_mode, count_include_pad] in itertools.product(
                 [15, 16, 17, 18, 19],
                 [4, 5],
@@ -151,10 +227,12 @@ def test_avg_pool2d(self):
                              count_include_pad=count_include_pad)
 
             x = torch.rand(1, 4, spatial, spatial)
-            _, graph = self.checkTrace(m, [x])
+            _, graph = self.checkTrace(m, [x], dtype)
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_variable_kernel_avg_pool2d(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_variable_kernel_avg_pool2d(self, dtype):
         class M(nn.Module):
             def __init__(self):
                 super(M, self).__init__()
@@ -165,27 +243,32 @@ def forward(self, x):
 
         x = torch.randn(1, 1000, 1, 1)
         m = M()
-        _, graph = self.checkTrace(m, [x])
+        _, graph = self.checkTrace(m, [x], dtype)
         # kernel_size is not Constant, shouldn't have any LLGA_FUSION_GROUP
         # TODO: with shape specialization, should have 1 LLGA_FUSION_GROUP
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 0)
 
-    def test_softmax(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_softmax(self, dtype):
         for dim in [-4, -3, -2, -1, 0, 1, 2, 3]:
             m = nn.Softmax(dim=dim)
             x = torch.rand(8, 12, 12, 12)
-            _, graph = self.checkTrace(m, [x])
+            _, graph = self.checkTrace(m, [x], dtype)
             # single-op partition shouldn't be created for softmax
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 0)
 
-    def test_linear(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_linear(self, dtype):
         for bias in [True, False]:
             x = torch.rand(32, 28)
             m = torch.nn.Linear(in_features=28, out_features=64, bias=bias)
-            _, graph = self.checkTrace(m, [x])
+            _, graph = self.checkTrace(m, [x], dtype)
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
             self.assertFused(graph, ['aten::linear'])
 
+
     def _gen_binary_inputs(self, gen_permute=True):
         for xshape, yshape in [
             [[1, 32, 28, 28], [1, 32, 28, 28]],
@@ -198,23 +281,32 @@ def _gen_binary_inputs(self, gen_permute=True):
             if gen_permute and xshape != yshape:
                 yield torch.rand(yshape), torch.rand(xshape)
 
-    def test_add(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_add(self, dtype):
         def forward_add(x, y):
             return torch.add(x, y, alpha=2)
 
         for x, y in self._gen_binary_inputs():
-            _, graph = self.checkTrace(forward_add, [x, y])
+            _, graph = self.checkTrace(forward_add, [x, y], dtype)
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_add_scalar(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_add_scalar(self, dtype):
         def add_scalar(x):
             return 42 + x + 3.14
 
         x = torch.rand(32, 32)
-        _, graph = self.checkTrace(add_scalar, [x])
+        _, graph = self.checkTrace(add_scalar, [x], dtype)
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_addmm(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_addmm(self, dtype):
+        # Just a sidenote - comparison of eager-mode & oneDNN Graph JIT outputs of
+        # addmm (which entails matmul-bias-add fusion) might require higher tolerance
+        # bounds for BF16. This is subject to change in the near future.
         def addmm(x, y, z):
             # alpha and beta are 1, by default
             return torch.addmm(z, x, y)
@@ -222,35 +314,43 @@ def addmm(x, y, z):
         x = torch.rand(64, 32)
         y = torch.rand(32, 32)
         z = torch.rand(64, 32)
-        _, graph = self.checkTrace(addmm, [x, y, z])
+        _, graph = self.checkTrace(addmm, [x, y, z], dtype)
         # single-op partition should be created for matmul with bias.
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_mul(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_mul(self, dtype):
         def forward_mul(x, y):
             return torch.mul(x, y) * 3
 
         for x, y in self._gen_binary_inputs():
-            _, graph = self.checkTrace(forward_mul, [x, y])
+            _, graph = self.checkTrace(forward_mul, [x, y], dtype)
             # single-op partitions shouldn't be created
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_identity_binary(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_identity_binary(self, dtype):
         def forward(x):
             return x * 1 + 0.0
 
         x = torch.rand(32)
-        _, graph = self.checkTrace(forward, [x])
+        _, graph = self.checkTrace(forward, [x], dtype)
         self.assertFused(graph, ['aten::add', 'aten::mul'])
 
-    def test_layer_norm(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_layer_norm(self, dtype):
         # TODO: support more normalized_shape
         m = torch.nn.LayerNorm(10)
         x = torch.randn(2, 5, 10, 10)
-        _, graph = self.checkTrace(m, [x])
+        _, graph = self.checkTrace(m, [x], dtype)
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_cat(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_cat(self, dtype):
         def cat_along_dim(d):
             def forward_cat(*inputs):
                 return torch.cat(inputs, d)
@@ -263,23 +363,28 @@ def forward_cat(*inputs):
         ]:
             for d in range(len(xshape)):
                 x = torch.rand(xshape)
-                _, graph = self.checkTrace(cat_along_dim(d), [x, x, x])
+                _, graph = self.checkTrace(cat_along_dim(d), [x, x, x], dtype)
                 self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
 
-    def test_typecheck(self):
-        x = torch.rand(32, 28)
-        m = torch.nn.Linear(in_features=28, out_features=64, bias=True)
-        traced, graph = self.checkTrace(m, [x])
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_typecheck(self, dtype):
+        x = torch.rand(32, 28, dtype=dtype)
+        m = torch.nn.Linear(in_features=28, out_features=64, bias=True, dtype=dtype)
+        traced, graph = self.checkTrace(m, [x], dtype)
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
         self.assertFused(graph, ['aten::linear'])
         # change the shape of the input, we should enter fallback graph
-        x = torch.rand(5, 28)
+        x = torch.rand(5, 28, dtype=dtype)
         self.assertEqual(m(x), traced(x))
 
 
+@unittest.skipIf(IS_AVX512_UNSUPPORTED, "This test fails for BF16 on machines without AVX512.")
 @unittest.skipIf(LLGA_NOT_ENABLED, "MKL-DNN build is disabled")
 class TestFusionPattern(JitLlgaTestCase):
-    def test_conv2d_eltwise(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_conv2d_eltwise(self, dtype):
         class M(nn.Module):
             def __init__(self, eltwise_fn):
                 super(M, self).__init__()
@@ -294,22 +399,128 @@ def forward(self, x):
                 x = self.eltwise(x)
                 return x
 
-        # for eltwise in ['relu', 'sigmoid', 'sqrt', 'abs', 'square', 'hardtanh']:
-        for eltwise in ['relu']:
+        for eltwise in ['relu', 'leaky_relu', 'sigmoid', 'square',
+                        'abs', 'exp', 'hardswish', 'tanh', 'hardtanh']:
             for inplace in [True, False]:
                 eltwise_fn_name = eltwise + '_' if inplace else eltwise
                 eltwise_fn = get_eltwise_fn(eltwise_fn_name)
 
                 m = M(eltwise_fn)
                 x = torch.rand(1, 32, 28, 28)
-                _, graph = self.checkTrace(m, [x])
+                _, graph = self.checkTrace(m, [x], dtype=dtype)
                 self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 2)
                 # test if relu_ is replace with relu by mutation removal pass
                 self.assertFused(graph, ['aten::' + eltwise_fn_name])
                 # test if relu is fused into the fusion group
                 self.assertFused(graph, ['aten::' + eltwise])
 
-    def test_conv2d_bn(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_conv2d_silu(self, dtype):
+        class M(nn.Module):
+            def __init__(self, inplace):
+                super(M, self).__init__()
+                self.conv1 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv2 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.eltwise = nn.SiLU(inplace=inplace)
+
+            def forward(self, x):
+                x = self.conv1(x)
+                x = self.eltwise(x)
+                x = self.conv2(x)
+                return x
+        for inplace in [False, True]:
+            for memory_format in [torch.contiguous_format, torch.channels_last]:
+                m = M(inplace)
+                x = torch.rand(1, 32, 28, 28).to(memory_format=memory_format)
+
+                _, graph = self.checkTrace(m, [x], dtype)
+                self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 2)
+                # oneDNN graph does not have silu OP. The bridge will convert silu to sigmoid - mul
+                # Inplace op will become outplace op on the JIT graph
+                patterns = [
+                    ["aten::_convolution", 'aten::sigmoid', 'aten::mul'],
+                    ["aten::_convolution"]
+                ]
+                silu_op = 'aten::silu_' if inplace else 'aten::silu'
+                self.assertFused(graph, ['aten::_convolution', silu_op])
+                self.checkPatterns(graph, patterns)
+
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_ensure_tensor_is_rewrapped(self, dtype):
+        class M(nn.Module):
+            def __init__(self, eltwise_fn):
+                super(M, self).__init__()
+                self.conv1 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv2 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv3 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv4 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.eltwise = eltwise_fn
+                self.adaptive_avg_pool_2d = nn.AdaptiveAvgPool2d((5, 7))
+
+            def forward(self, x, y):
+                x = self.conv1(x)
+                x = self.eltwise(x)
+                x = self.conv2(x)
+                x = self.eltwise(x)
+                y = self.conv3(y)
+                y = self.eltwise(y)
+                y = self.conv4(y)
+                y = self.eltwise(y)
+
+                x = torch.add(x, y)
+                x = self.adaptive_avg_pool_2d(x)
+                return x
+
+        eltwise_fn_name = 'relu'
+        eltwise_fn = get_eltwise_fn(eltwise_fn_name)
+        m = M(eltwise_fn)
+        m = m.to(memory_format=torch.channels_last)
+        x = torch.rand(1, 32, 28, 28).to(memory_format=torch.channels_last)
+        y = torch.rand(1, 32, 28, 28).to(memory_format=torch.channels_last)
+        # Simply test if the output is accurate
+        # The output of the second partition is input to adaptive_avg_pool2d, which is
+        # unsupported by LLGA. In resnext101 32x16d, we encountered an accuracy issue.
+        _, graph = self.checkTrace(m, [x, y], dtype)
+        self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 4)
+
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_conv2d_clamp(self, dtype):
+        class M(nn.Module):
+            def __init__(self):
+                super(M, self).__init__()
+                self.conv1 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv2 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv3 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv4 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv5 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+
+            def forward(self, x):
+                x = self.conv1(x)
+                x = torch.clamp(x, min=float('-inf'))
+                x = self.conv2(x)
+                x = torch.clamp(x, min=-5)
+                x = self.conv3(x)
+                x = torch.clamp(x, min=0, max=float('inf'))
+                x = self.conv4(x)
+                x = torch.clamp(x, min=1, max=5)
+                x = self.conv5(x)
+                x = torch.clamp(x, max=2)
+                return x
+
+        for inplace in [False, True]:
+            for memory_format in [torch.contiguous_format, torch.channels_last]:
+                x = torch.rand(1, 32, 28, 28).to(memory_format=memory_format)
+                m = M()
+                _, graph = self.checkTrace(m, [x], dtype)
+                self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 5)
+                self.assertFused(graph, ['aten::_convolution', "aten::clamp"])
+
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_conv2d_bn(self, dtype):
         class M(nn.Module):
             def __init__(self):
                 super(M, self).__init__()
@@ -322,13 +533,16 @@ def forward(self, x):
                 return x
 
         m = M().eval()
+        if dtype == torch.bfloat16:
+            m = optimization.fuse(m)
         x = torch.rand(1, 32, 28, 28)
-        _, graph = self.checkTrace(m, [x])
+        _, graph = self.checkTrace(m, [x], dtype)
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
         self.assertFused(graph, ['aten::_convolution', 'aten::batch_norm'])
 
-
-    def test_conv2d_bn_relu(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_conv2d_bn_relu(self, dtype):
         class M(nn.Module):
             def __init__(self):
                 super(M, self).__init__()
@@ -342,13 +556,17 @@ def forward(self, x):
                 return x
 
         m = M().eval()
+        if dtype == torch.bfloat16:
+            m = optimization.fuse(m)
         x = torch.rand(1, 32, 28, 28)
-        _, graph = self.checkTrace(m, [x])
+        _, graph = self.checkTrace(m, [x], dtype)
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
         self.assertFused(graph, ['aten::_convolution', 'aten::batch_norm',
                                  'aten::relu'])
 
-    def test_bn2d_eltwise(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_bn2d_eltwise(self, dtype):
         class M(nn.Module):
             def __init__(self, eltwise_fn):
                 super(M, self).__init__()
@@ -364,11 +582,13 @@ def forward(self, x):
             eltwise_fn = get_eltwise_fn(eltwise)
             m = M(eltwise_fn).eval()
             x = torch.rand(1, 32, 28, 28)
-            _, graph = self.checkTrace(m, [x])
+            _, graph = self.checkTrace(m, [x], dtype)
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
             self.assertFused(graph, ['aten::' + eltwise])
 
-    def test_linear_eltwise(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_linear_eltwise(self, dtype):
         class M(nn.Module):
             def __init__(self, eltwise_fn, bias):
                 super(M, self).__init__()
@@ -387,11 +607,13 @@ def forward(self, x):
             eltwise_fn = get_eltwise_fn(eltwise)
             m = M(eltwise_fn, has_bias)
             x = torch.rand(32, 28, requires_grad=False)
-            _, graph = self.checkTrace(m, [x])
+            _, graph = self.checkTrace(m, [x], dtype)
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
             self.assertFused(graph, ['aten::' + eltwise])
 
-    def test_conv2d_sum(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_conv2d_sum(self, dtype):
         class M(nn.Module):
             def __init__(self, bias=False):
                 super(M, self).__init__()
@@ -415,12 +637,16 @@ def forward(self, x, y):
 
         for bias in [True, False]:
             m = M(bias).eval()
+            if dtype == torch.bfloat16:
+                m = optimization.fuse(m)
             x = torch.rand(1, 32, 16, 16, requires_grad=False)
             y = torch.rand(1, 32, 16, 16, requires_grad=False)
-            _, graph = self.checkTrace(m, [x, y])
+            _, graph = self.checkTrace(m, [x, y], dtype)
             self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 3)
 
-    def test_wildcard(self):
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_wildcard(self, dtype):
         class M(nn.Module):
             def __init__(self):
                 super(M, self).__init__()
@@ -443,17 +669,43 @@ def forward(self, x):
         # Thus conv-eltwise cannot be selected into the same Partition.
         m = M()
         x = torch.rand(1, 32, 28, 28)
-        _, graph = self.checkTrace(m, [x])
+        _, graph = self.checkTrace(m, [x], dtype)
         # conv can exist in a single-op oneDNN Graph partition but not relu
         self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 1)
         self.assertFused(graph, ['aten::_convolution'])
 
-    def test_rewrap_tensor_input_to_pytorch(self):
+    @onlyCPU
+    @dtypes(torch.int32)
+    def test_wildcard_unsupported_dtype(self, dtype):
+        class M(nn.Module):
+            def __init__(self):
+                super(M, self).__init__()
+
+            def forward(self, x):
+                y = x // 2
+                return y
+
+        # In shufflenet_v2_x1_0, channels_per_groups is computed as:
+        # channels_per_group = num_channels // groups
+        # JIT IR converts groups to Long dtype, which is unsupported
+        # by oneDNN Graph, viz. Long(requires_grad=0, device=cpu) = prim::Constant[value={2}]()
+        # This test just ensures that the bridge code can handle
+        # unsupported dtypes for inputs to ops unsupported
+        # by oneDNN Graph. In this particular UT, aten::floor_divide
+        # would be added as a wildcard in graph-construction stage.
+        m = M()
+        x = torch.tensor([32], dtype=dtype)
+        _, graph = self.checkTrace(m, [x], dtype)
+        self.assertGraphContainsExactly(graph, LLGA_FUSION_GROUP, 0)
+
+    @onlyCPU
+    @dtypes(torch.float32, torch.bfloat16)
+    def test_rewrap_tensor_input_to_pytorch(self, dtype):
         class M(nn.Module):
-            def __init__(self, eltwise_fn, data_type):
+            def __init__(self, eltwise_fn):
                 super(M, self).__init__()
-                self.conv1 = nn.Conv2d(32, 32, 3, padding=1, bias=True, dtype=data_type)
-                self.conv2 = nn.Conv2d(32, 32, 3, padding=1, bias=True, dtype=data_type)
+                self.conv1 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
+                self.conv2 = nn.Conv2d(32, 32, 3, padding=1, bias=True)
                 self.eltwise = eltwise_fn
                 self.adaptive_avg_pool_2d = nn.AdaptiveAvgPool2d((5, 7))
 
@@ -468,38 +720,124 @@ def forward(self, x, y):
 
         eltwise_fn_name = 'relu'
         eltwise_fn = get_eltwise_fn(eltwise_fn_name)
-        # Add bfloat16 later
-        for data_type in [torch.float]:
-            m = M(eltwise_fn, data_type)
-            m = m.to(memory_format=torch.channels_last)
-            x = torch.rand(1, 32, 28, 28, dtype=data_type).to(memory_format=torch.channels_last)
-            y = torch.rand(1, 32, 28, 28, dtype=data_type).to(memory_format=torch.channels_last)
-            # Simply test if the output is accurate
-            # The output of the second partition is input to adaptive_avg_pool2d, which is
-            # unsupported by LLGA, so it must be handled by PyTorch, which should receive
-            # correct strides info of the channels-last tensor.
-            graph, _ = self.checkTrace(m, [x, y])
+        m = M(eltwise_fn)
+        m = m.to(memory_format=torch.channels_last)
+        x = torch.rand(1, 32, 28, 28).to(memory_format=torch.channels_last)
+        y = torch.rand(1, 32, 28, 28).to(memory_format=torch.channels_last)
+        # Simply test if the output is accurate
+        # The output of the second partition is input to adaptive_avg_pool2d, which is
+        # unsupported by LLGA, so it must be handled by PyTorch, which should receive
+        # correct strides info of the channels-last tensor.
+        graph, _ = self.checkTrace(m, [x, y], dtype)
+
+@unittest.skipIf(LLGA_NOT_ENABLED, "MKL-DNN build is disabled")
+class TestEnableDisableLlgaFuser(JitTestCase):
+    def setUp(self):
+        super().setUp()
+        self.is_enabled = torch._C._jit_set_llga_enabled(False)
+
+    def tearDown(self):
+        torch._C._jit_set_llga_enabled(self.is_enabled)
+        super().tearDown()
+
+    def test_context_manager(self):
+        x = torch.randn(4, 8)
+        y = torch.randn(4, 8)
+        with torch.jit.fuser('fuser3'):
+            with torch.jit.fuser('fuser3'):
+
+                def t1(x, y):
+                    o = x + y
+                    o = o + 2.0
+                    return o
+                t_jit = torch.jit.script(t1)
+                t_jit(x, y)
+                t_jit(x, y)
+                self.assertGraphContains(t_jit.graph_for(x, y), LLGA_FUSION_GROUP)
+
+            def t2(x, y):
+                o = x + y
+                o = o + 3.0
+                return o
+            t_jit_2 = torch.jit.script(t2)
+            t_jit_2(x, y)
+            t_jit_2(x, y)
+            self.assertGraphContains(t_jit_2.graph_for(x, y), LLGA_FUSION_GROUP)
+
+        def t3(x, y):
+            o = x + y
+            o = o + 4.0
+            return o
+        t_jit_3 = torch.jit.script(t3)
+        t_jit_3(x, y)
+        t_jit_3(x, y)
+        self.assertGraphContainsExactly(t_jit_3.graph_for(x, y), LLGA_FUSION_GROUP, 0)
 
 
+@unittest.skipIf(LLGA_NOT_ENABLED, "MKL-DNN build is disabled")
+@unittest.skip("Enable when integration with dynamo aot_autograd is more stable")
+class TestDynamoAOT(JitTestCase):
+    def test_dynamo_aot_ts_onednn(self):
+        class Seq(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.layers = nn.Sequential(
+                    nn.Linear(10, 10),
+                    nn.ReLU(),
+                    nn.Linear(10, 10),
+                    nn.ReLU(),
+                )
+
+            def forward(self, x):
+                return self.layers(x)
+
+        mod = Seq()
+
+        import torch._dynamo
+        aot_mod = torch._dynamo.optimize("aot_ts", nopython=True)(mod)
+
+        for _ in range(10):
+            with torch.jit.fuser("fuser3"):
+                loss = aot_mod(torch.rand([10, 10])).sum()
+                loss.backward()
+
+        torch._dynamo.reset()
+
+
+@unittest.skipIf(IS_AVX512_UNSUPPORTED, "This test fails for BF16 on machines without AVX512.")
 @unittest.skipIf(LLGA_NOT_ENABLED, "MKL-DNN build is disabled")
 class TestModel(JitLlgaTestCase):
     @skipIfNoTorchVision
-    def _test_vision(self, model_name):
+    def _test_vision(self, model_name, dtype):
         m = getattr(torchvision.models, model_name)().eval()
+        if dtype == torch.bfloat16:
+            m = optimization.fuse(m)
         x = torch.rand(1, 3, 224, 224) / 10
-        _, graph = self.checkTrace(m, [x])
+        _, graph = self.checkTrace(m, [x], dtype)
         self.assertFused(graph, ['aten::_convolution', 'aten::batch_norm',
                                  'aten::relu', 'aten::linear',
                                  'aten::avg_pool2d', 'aten::max_pool2d'])
 
-
 for model_name, enabled in [
     ['resnet50', True],
     ['resnext50_32x4d', True],
     ['resnext101_32x8d', True],
     ['densenet121', True],
+    ['densenet161', True],
+    ['densenet169', True],
+    ['densenet201', True],
+    ['efficientnet_b0', True],
+    ['efficientnet_b1', True],
+    ['efficientnet_b2', True],
+    ['efficientnet_b3', True],
+    ['efficientnet_b4', True],
+    ['efficientnet_b5', True],
+    ['efficientnet_b6', True],
+    ['efficientnet_b7', True],
+    ['regnet_y_400mf', True],
     ['googlenet', TEST_SCIPY],
     ['mobilenet_v2', True],
+    ['mobilenet_v3_large', True],
     ['mnasnet1_0', True],
     ['squeezenet1_0', True],
     ['vgg16', True],
@@ -507,13 +845,19 @@ def _test_vision(self, model_name):
     ['shufflenet_v2_x1_0', True],
     ['wide_resnet50_2', True],
 ]:
-    def wrapper(mname):
+    def _wrapper(mname, dtype):
         @unittest.skipIf(not enabled, 'Disabled')
-        def test(self):
-            return self._test_vision(mname)
+        @separate_process
+        def test(self, dtype=dtype):
+            return self._test_vision(mname, dtype)
         return test
 
-    setattr(TestModel, 'test_vision_%s' % model_name, wrapper(model_name))
+    for dtype in [torch.bfloat16, torch.float32]:
+        setattr(TestModel, 'test_vision_%s_%s' % (model_name, str(dtype).split("torch.")[1]), _wrapper(model_name, dtype))
+
+
+instantiate_device_type_tests(TestFusionPattern, globals())
+instantiate_device_type_tests(TestOp, globals())
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_jiterator.py b/test/test_jiterator.py
index 741551e9999b..f995e5408873 100644
--- a/test/test_jiterator.py
+++ b/test/test_jiterator.py
@@ -23,7 +23,6 @@ def ref_fn(x, y, alpha=1, beta=1):
     return alpha * x + beta * y
 
 class TestPythonJiterator(TestCase):
-    @skipCUDAIfRocm
     @parametrize("shape_strides", [
         (([3, 3], [3, 1]), ([3, 3], [3, 1])),  # contiguous
     ])
@@ -62,7 +61,6 @@ def test_all_dtype_noncontiguous(self, device, dtypes, shape_strides):
 
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     @dtypes(torch.float, torch.double, torch.float16, torch.bfloat16)
     @parametrize("alpha", [-1, 2.0, None])
     @parametrize("beta", [3, -4.2, None])
@@ -82,7 +80,6 @@ def test_extra_args(self, device, dtype, alpha, beta):
 
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     @parametrize("is_train", [True, False])
     def test_bool_extra_args(self, device, is_train):
         code_string = "template <typename T> T conditional(T x, T mask, bool is_train) { return is_train ? x * mask : x; }"
@@ -98,7 +95,6 @@ def ref_fn(x, mask, is_train):
         result = jitted_fn(a, b, is_train=is_train)
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     def test_multiple_functors(self, device):
         code_string = '''
         template <typename T> T fn(T x, T mask) { return x * mask; }
@@ -117,7 +113,6 @@ def ref_fn(x, mask, y):
         result = jitted_fn(a, b, c)
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     @parametrize("num_inputs", [1, 5, 8])
     def test_various_num_inputs(self, num_inputs):
         inputs = []
@@ -137,7 +132,6 @@ def ref_fn(*inputs):
 
         self.assertEqual(expected, result)
 
-    @skipCUDAIfRocm
     @parametrize("num_outputs", [1, 4, 8])
     def test_various_num_outputs(self, num_outputs):
         input = torch.rand(3, device='cuda')
@@ -146,7 +140,8 @@ def test_various_num_outputs(self, num_outputs):
         function_body = ""
         for i in range(num_outputs):
             function_body += f"out{i} = input + {i};\n"
-        code_string = f"template <typename T> T my_kernel(T input, {output_string}) {{ {function_body} }}"
+        # NB: return type must be void, otherwise ROCm silently fails
+        code_string = f"template <typename T> void my_kernel(T input, {output_string}) {{ {function_body} }}"
 
         jitted_fn = create_multi_output_jit_fn(code_string, num_outputs)
 
@@ -165,7 +160,6 @@ def ref_fn(input):
         for i in range(num_outputs):
             self.assertEqual(expected[i], result[i])
 
-    @skipCUDAIfRocm
     @parametrize("code_string", [
         "template <typename T> T my _kernel(T x) { return x; }",
         "template <typename T> Tmy_kernel(T x) { return x; }",
diff --git a/test/test_linalg.py b/test/test_linalg.py
index 8df810d7e37b..41c3e8a2d9ba 100644
--- a/test/test_linalg.py
+++ b/test/test_linalg.py
@@ -18,7 +18,7 @@
     (TestCase, run_tests, TEST_SCIPY, IS_MACOS, IS_WINDOWS, slowTest,
      TEST_WITH_ASAN, TEST_WITH_ROCM, IS_FBCODE, IS_REMOTE_GPU, iter_indices,
      make_fullrank_matrices_with_distinct_singular_values,
-     freeze_rng_state, IS_ARM64)
+     freeze_rng_state, IS_ARM64, IS_SANDCASTLE, TEST_OPT_EINSUM)
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, dtypes, has_cusolver,
      onlyCPU, skipCUDAIf, skipCUDAIfNoMagma, skipCPUIfNoLapack, precisionOverride,
@@ -32,6 +32,7 @@
 from torch.testing._internal.common_cuda import SM53OrLater, tf32_on_and_off, CUDA11OrLater, CUDA9, _get_magma_version, \
     _get_torch_cuda_version
 from torch.distributions.binomial import Binomial
+import torch.backends.opt_einsum as opt_einsum
 
 # Protects against includes accidentally setting the default dtype
 # NOTE: jit_metaprogramming_utils sets the default dtype to double!
@@ -140,6 +141,11 @@ def run_test_case(a, b):
         run_test_case(zero_strided, b)
         run_test_case(a, zero_strided)
 
+    def test_matrix_rank_removed_error(self, device):
+        a = make_tensor(5, 5, device=device, dtype=torch.float32)
+        with self.assertRaisesRegex(RuntimeError, "This function was deprecated since version 1.9 and is now removed"):
+            torch.matrix_rank(a)
+
     def test_solve_removed_error(self, device):
         a = make_tensor(5, 5, device=device, dtype=torch.float32)
         b = make_tensor(5, 1, device=device, dtype=torch.float32)
@@ -148,6 +154,20 @@ def test_solve_removed_error(self, device):
         with self.assertRaisesRegex(RuntimeError, "This function was deprecated since version 1.9 and is now removed"):
             b.solve(a)
 
+    def test_eig_removed_error(self, device):
+        a = make_tensor(5, 5, device=device, dtype=torch.float32)
+        with self.assertRaisesRegex(RuntimeError, "This function was deprecated since version 1.9 and is now removed"):
+            torch.eig(a)
+        with self.assertRaisesRegex(RuntimeError, "This function was deprecated since version 1.9 and is now removed"):
+            a.eig()
+
+    def test_lstsq_removed_error(self, device):
+        a = make_tensor(5, 5, device=device, dtype=torch.float32)
+        with self.assertRaisesRegex(RuntimeError, "This function was deprecated since version 1.9 and is now removed"):
+            torch.lstsq(a, a)
+        with self.assertRaisesRegex(RuntimeError, "This function was deprecated since version 1.9 and is now removed"):
+            a.lstsq(a)
+
     @skipCUDAIfNoMagma
     @skipCPUIfNoLapack
     @dtypes(torch.float, torch.double, torch.cfloat, torch.cdouble)
@@ -1337,17 +1357,16 @@ def run_test_case(input, ord, dim, keepdim):
     def test_norm_fused_type_promotion(self, device, dtype):
         x = torch.randn(10, device=device, dtype=dtype)
 
-        def profile_and_check(fn, x, kwargs, fn_name):
+        def profile_and_check(fn, x, kwargs):
             with torch.profiler.profile(activities=(torch.profiler.ProfilerActivity.CPU,)) as p:
                 fn(x, **kwargs, dtype=torch.float)
             # smoke check that profiler returned some events
-            self.assertTrue(fn_name in map(lambda e: e.name, p.events()))
+            self.assertTrue("aten::linalg_vector_norm" in (e.name for e in p.events()))
             # test that there was no explicit copy
-            self.assertFalse("aten::to" in map(lambda e: e.name, p.events()))
+            self.assertFalse("aten::to" in (e.name for e in p.events()))
 
-        for f, kwargs, fn_name in zip((torch.norm, torch.linalg.vector_norm), ({"p" : 2}, {}),
-                                      ("aten::norm", "aten::linalg_vector_norm")):
-            profile_and_check(f, x, kwargs, fn_name)
+        for f, kwargs, in zip((torch.linalg.vector_norm, torch.norm), ({}, {"p" : 2})):
+            profile_and_check(f, x, kwargs)
 
     @skipMeta  # https://github.com/pytorch/pytorch/issues/53739
     @skipCPUIfNoLapack
@@ -1520,7 +1539,7 @@ def run_error_test_case(input, ord, dim, keepdim, error_type, error_regex):
     @skipCUDAIfNoMagma
     @skipCPUIfNoLapack
     @dtypes(torch.cfloat, torch.cdouble)
-    @precisionOverride({torch.cfloat: 2e-4})
+    @precisionOverride({torch.cfloat: 5e-4})
     def test_norm_complex(self, device, dtype):
         def gen_error_message(input_size, ord, keepdim, dim=None):
             return "complex norm failed for input size %s, ord=%s, keepdim=%s, dim=%s" % (
@@ -1758,122 +1777,6 @@ def test_norm_fastpaths(self, device):
         expected = torch.pow(x.pow(3).abs().sum(1), 1.0 / 3.0)
         self.assertEqual(result, expected)
 
-    @skipCPUIfNoLapack
-    @skipCUDAIfNoMagma
-    @dtypes(*floating_and_complex_types())
-    def test_old_eig_basic(self, device, dtype):
-        a = torch.tensor([[1.96, 0.00, 0.00, 0.00, 0.00],
-                          [-6.49, 3.80, 0.00, 0.00, 0.00],
-                          [-0.47, -6.39, 4.17, 0.00, 0.00],
-                          [-7.20, 1.50, -1.51, 5.70, 0.00],
-                          [-0.65, -6.34, 2.67, 1.80, -7.10]],
-                         dtype=dtype, device=device).t()
-        e = torch.eig(a)[0]
-        ee, vv = torch.eig(a, True)
-        te = torch.tensor((), dtype=dtype, device=device)
-        tv = torch.tensor((), dtype=dtype, device=device)
-        eee, vvv = torch.eig(a, True, out=(te, tv))
-        self.assertEqual(e, ee, atol=1e-12, rtol=0)
-        self.assertEqual(ee, eee, atol=1e-12, rtol=0)
-        self.assertEqual(ee, te, atol=1e-12, rtol=0)
-        self.assertEqual(vv, vvv, atol=1e-12, rtol=0)
-        self.assertEqual(vv, tv, atol=1e-12, rtol=0)
-        #
-        # compare with numpy
-        np_e, np_v = np.linalg.eig(a.cpu().numpy())
-        if dtype.is_complex:
-            self.assertEqual(ee, np_e)
-        else:
-            # np_e.shape == (n, 2), where each column contain the real and
-            # imaginary parts of the result
-            self.assertEqual(ee[:, 0], np_e)  # real part
-            self.assertEqual(ee[:, 1], torch.zeros(ee.shape[0], dtype=dtype))  # imaginary part
-        self.assertEqual(vv, np_v)
-
-    @skipCPUIfNoLapack
-    @skipCUDAIfNoMagma
-    @dtypes(torch.double, torch.float)
-    def test_old_eig_reuse(self, device, dtype):
-        X = torch.randn(4, 4, dtype=dtype, device=device)
-        X = torch.mm(X.t(), X)
-        e = torch.zeros(4, 2, dtype=dtype, device=device)
-        v = torch.zeros(4, 4, dtype=dtype, device=device)
-        torch.eig(X, True, out=(e, v))
-        Xhat = np.matmul(np.matmul(v.cpu(), torch.diag(e.select(1, 0)).cpu()), v.t().cpu())
-        if dtype is torch.float:
-            atol = 1e-7
-            rtol = 1e-5
-        else:
-            atol = 1e-8
-            rtol = 0
-        self.assertEqual(X, Xhat, atol=atol, rtol=rtol, msg='VeV\' wrong')
-        self.assertTrue(v.is_contiguous(), 'V is not contiguous')
-
-        torch.eig(X, True, out=(e, v))
-        Xhat = np.matmul(v.cpu(), np.matmul(e.select(1, 0).diag().cpu(), v.t().cpu()))
-        self.assertEqual(X, Xhat, atol=atol, rtol=rtol, msg='VeV\' wrong')
-        self.assertTrue(v.is_contiguous(), 'V is not contiguous')
-
-    @skipCPUIfNoLapack
-    @skipCUDAIfNoMagma
-    @dtypes(torch.double, torch.float)
-    def test_old_eig_invalid_input(self, device, dtype):
-        # test invalid input
-        self.assertRaisesRegex(
-            RuntimeError,
-            'input should be 2 dimensional',
-            lambda: torch.eig(torch.ones((2))))
-        self.assertRaisesRegex(
-            RuntimeError,
-            'input should be square',
-            lambda: torch.eig(torch.ones((2, 3))))
-        self.assertRaisesRegex(
-            RuntimeError,
-            'input should not contain infs or NaNs',
-            lambda: torch.eig(np.inf * torch.ones((2, 2))))
-        self.assertRaisesRegex(
-            RuntimeError,
-            'input should not contain infs or NaNs',
-            lambda: torch.eig(np.nan * torch.ones((2, 2))))
-
-    @skipCUDAIfNoMagma
-    @skipCPUIfNoLapack
-    @dtypes(torch.double, torch.float)
-    def test_old_eig_out(self, device, dtype):
-        # the out version of torch.eig needs to be tested manually: we can't
-        # use the "test_out=True" parameter to tensor_op_tests because the
-        # signature is irregular (since we have *two* output vectors)
-        t = torch.randn(10, 10, dtype=dtype, device=device)
-        evals, evecs = torch.eig(t, eigenvectors=True)
-        #
-        # check that the out= version computes the same values as the normal one
-        out_evals = torch.empty_like(evals)
-        out_evecs = torch.empty_like(evecs)
-        evals2, evecs2 = torch.eig(t, eigenvectors=True, out=(out_evals, out_evecs))
-        # check that the out tensors were used in-place
-        self.assertEqual(evals2.data_ptr(), out_evals.data_ptr())
-        self.assertEqual(evecs2.data_ptr(), out_evecs.data_ptr())
-        # check that the result is the same as the non-out version
-        self.assertEqual(evals, out_evals)
-        self.assertEqual(evecs, out_evecs)
-        #
-        # check what happens in the eigenvectors=False case
-        out_evals = torch.empty_like(evals)
-        out_evecs = torch.tensor([1, 2, 3], dtype=dtype, device=device)
-        evals2, evecs2 = torch.eig(t, eigenvectors=False, out=(out_evals, out_evecs))
-        # check that the out_evals was used in-place
-        self.assertEqual(evals2.data_ptr(), out_evals.data_ptr())
-        self.assertEqual(evals, out_evals)
-        # check that out_evecs was NOT touched at all
-        assert out_evecs.tolist() == [1, 2, 3]
-        #
-        # check that we complain if we pass an out vector of the wrong dtype
-        wrong_out = torch.empty((0, 0), dtype=int)
-        with self.assertRaisesRegex(RuntimeError, r"Expected .* but got .*"):
-            torch.eig(t, eigenvectors=True, out=(wrong_out, out_evecs))
-        with self.assertRaisesRegex(RuntimeError, r"Expected .* but got .*"):
-            torch.eig(t, eigenvectors=True, out=(out_evals, wrong_out))
-
     @skipCPUIfNoLapack
     @skipCUDAIfNoMagma
     # NumPy computes only in float64 and complex128 precisions
@@ -2406,10 +2309,10 @@ def test_nuclear_norm_exceptions_old(self, device):
             x = torch.tensor(lst, dtype=torch.double, device=device)
             for axes in (), (0,):
                 self.assertRaises(RuntimeError, torch.norm, x, "nuc", axes)
-            self.assertRaises(IndexError, torch.norm, x, "nuc", (0, 1))
+            self.assertRaises(RuntimeError, torch.norm, x, "nuc", (0, 1))
 
         x = torch.tensor([[0, 1, 2], [3, 4, 5]], dtype=torch.double, device=device)
-        self.assertRaisesRegex(RuntimeError, "duplicate or invalid", torch.norm, x, "nuc", (0, 0))
+        self.assertRaisesRegex(RuntimeError, "must be different", torch.norm, x, "nuc", (0, 0))
         self.assertRaisesRegex(IndexError, "Dimension out of range", torch.norm, x, "nuc", (0, 2))
 
     @skipCUDAIfNoCusolver
@@ -2572,28 +2475,6 @@ def test_svd_memory_allocation(self, device, dtype):
         result = torch.linalg.svd(a, full_matrices=False)
         self.assertEqual(result.S, S)
 
-    # This test doesn't work with MAGMA backend https://github.com/pytorch/pytorch/issues/72106
-    @skipMeta
-    @skipCUDAIfRocm
-    @skipCUDAIfNoCusolver
-    @skipCPUIfNoLapack
-    @dtypes(*floating_and_complex_types())
-    def test_svd_nan_error(self, device, dtype):
-        for svd in [torch.svd, torch.linalg.svd]:
-            # if input contains NaN then an error is triggered for svd
-            # When cuda < 11.5, cusolver raises CUSOLVER_STATUS_EXECUTION_FAILED when input contains nan.
-            # When cuda >= 11.5, cusolver normally finishes execution and sets info array indicating convergence issue.
-            error_msg = r'(CUSOLVER_STATUS_EXECUTION_FAILED|The algorithm failed to converge)'
-            a = torch.full((3, 3), float('nan'), dtype=dtype, device=device)
-            a[0] = float('nan')
-            with self.assertRaisesRegex(torch.linalg.LinAlgError, error_msg):
-                svd(a)
-            error_msg = r'(CUSOLVER_STATUS_EXECUTION_FAILED|\(Batch element 1\): The algorithm failed to converge)'
-            a = torch.randn(3, 33, 33, dtype=dtype, device=device)
-            a[1, 0, 0] = float('nan')
-            with self.assertRaisesRegex(torch.linalg.LinAlgError, error_msg):
-                svd(a)
-
     def cholesky_solve_test_helper(self, A_dims, b_dims, upper, device, dtype):
         from torch.testing._internal.common_utils import random_hermitian_pd_matrix
 
@@ -2719,7 +2600,7 @@ def run_test(torch_inverse, matrix, batches, n):
 
             # Compare against NumPy output
             # NumPy uses 'gesv' LAPACK routine solving the equation A A_inv = I
-            # But in PyTorch 'gertf' + 'getri' is used causing element-wise differences
+            # But in PyTorch 'gertf' + 'getrs' is used. As such, there may be some element-wise differences
             expected = np.linalg.inv(matrix.cpu().numpy())
             self.assertEqual(matrix_inverse, expected, atol=self.precision, rtol=self.precision)
 
@@ -2854,6 +2735,7 @@ def run_test_singular_input(batch_dim, n):
         for params in [(1, 0), (2, 0), (2, 1), (4, 0), (4, 2), (10, 2)]:
             run_test_singular_input(*params)
 
+    @unittest.skipIf(IS_FBCODE or IS_SANDCASTLE, "Test fails for float64 on GPU (P100, V100) on Meta infra")
     @skipCUDAIfNoMagmaAndNoCusolver
     @skipCPUIfNoLapack
     @onlyNativeDeviceTypes   # TODO: XLA doesn't raise exception
@@ -3008,7 +2890,7 @@ def run_test_singular_input(batch_dim, n):
         # dtypes should match
         a = torch.eye(2, dtype=dtype, device=device)
         out = torch.empty(0, dtype=torch.int, device=device)
-        with self.assertRaisesRegex(RuntimeError, "got result with dtype Int"):
+        with self.assertRaisesRegex(RuntimeError, "but got int instead"):
             torch.linalg.inv(a, out=out)
 
         # device should match
@@ -3528,37 +3410,6 @@ def test_matrix_rank_basic(self, device, dtype):
         self.assertEqual(matrix_rank(a).item(), 9)
         self.assertEqual(matrix_rank(a, hermitian=True).item(), 9)
 
-    @skipCUDAIfNoMagma
-    @skipCPUIfNoLapack
-    @dtypes(*floating_and_complex_types())
-    def test_old_matrix_rank(self, device, dtype):
-        a = torch.eye(10, dtype=dtype, device=device)
-        self.assertEqual(torch.matrix_rank(a).item(), 10)
-        self.assertEqual(torch.matrix_rank(a, True).item(), 10)
-
-        a[5, 5] = 0
-        self.assertEqual(torch.matrix_rank(a).item(), 9)
-        self.assertEqual(torch.matrix_rank(a, True).item(), 9)
-
-        a = torch.randn(24, 42, dtype=dtype, device=device)
-        self.assertEqual(torch.matrix_rank(a), torch.matrix_rank(a.t()))
-        aaT = torch.mm(a, a.conj().t())
-        self.assertEqual(torch.matrix_rank(aaT), torch.matrix_rank(aaT, True))
-        aTa = torch.mm(a.conj().t(), a)
-        self.assertEqual(torch.matrix_rank(aTa), torch.matrix_rank(aTa, True))
-
-        a = torch.randn(35, 75, dtype=dtype, device=device)
-        self.assertEqual(torch.matrix_rank(a), np.linalg.matrix_rank(a.cpu().numpy()))
-        self.assertEqual(torch.matrix_rank(a, 0.01), np.linalg.matrix_rank(a.cpu().numpy(), 0.01))
-
-        aaT = torch.mm(a, a.conj().t())
-        self.assertEqual(torch.matrix_rank(aaT), np.linalg.matrix_rank(aaT.cpu().numpy()))
-        self.assertEqual(torch.matrix_rank(aaT, 0.01), np.linalg.matrix_rank(aaT.cpu().numpy(), 0.01))
-
-        if np.lib.NumpyVersion(np.__version__) >= '1.14.0':
-            self.assertEqual(torch.matrix_rank(aaT, True), np.linalg.matrix_rank(aaT.cpu().numpy(), True))
-            self.assertEqual(torch.matrix_rank(aaT, 0.01, True), np.linalg.matrix_rank(aaT.cpu().numpy(), 0.01, True))
-
     @onlyNativeDeviceTypes
     @dtypes(torch.double)
     # This tests only the cases where torch.chain_matmul differs from torch.linalg.multi_dot which this is an "alias" for.
@@ -3794,9 +3645,23 @@ def test_qr_error_cases(self, device, dtype):
     def _check_einsum(self, *args, np_args=None):
         if np_args is None:
             np_args = [arg.cpu().numpy() if isinstance(arg, torch.Tensor) else arg for arg in args]
-        res = torch.einsum(*args)
         ref = np.einsum(*np_args)
-        self.assertEqual(torch.from_numpy(np.array(ref)), res)
+        res = torch.einsum(*args)
+        self.assertEqual(ref, res)
+
+        # Check that the other variations for opt_einsum work too
+        if TEST_OPT_EINSUM:
+            with opt_einsum.flags(enabled=False):
+                res = torch.einsum(*args)
+                self.assertEqual(ref, res)
+
+            with opt_einsum.flags(enabled=True, strategy='greedy'):
+                res = torch.einsum(*args)
+                self.assertEqual(ref, res)
+
+            with opt_einsum.flags(enabled=True, strategy='optimal'):
+                res = torch.einsum(*args)
+                self.assertEqual(ref, res)
 
     @dtypes(torch.double, torch.cdouble)
     def test_einsum(self, device, dtype):
@@ -3861,6 +3726,9 @@ def test_einsum(self, device, dtype):
         # with strided tensors
         self._check_einsum("bn,Anm,bm->bA", l[:, ::2], w[:, ::2, ::2], r[:, ::2])
 
+        # test multiple inputs
+        self._check_einsum("...,be,b...,beg,gi,bc...->bi...", A, B, C, D, E, F)
+
     @dtypes(torch.double, torch.cdouble)
     def test_einsum_sublist_format(self, device, dtype):
         x = make_tensor((5,), dtype=dtype, device=device)
@@ -3902,9 +3770,9 @@ def convert_sublist(sublist):
 
         def test(n=10,                       # how many tests to generate
                  n_labels=5,                 # how many labels available
-                 min_ops=1, max_ops=3,       # min and max number of operands per test
+                 min_ops=1, max_ops=4,       # min and max number of operands per test
                  min_dims=1, max_dims=3,     # min and max number of dimensions per operand
-                 min_size=1, max_size=8,    # min and max size of each dimension
+                 min_size=1, max_size=8,     # min and max size of each dimension
                  max_out_dim=3,              # max number of dimensions for the output
                  enable_diagonals=True,      # controls if labels can be repeated for diagonals
                  ellipsis_prob=0.5,          # probability of including ellipsis in operand
@@ -3993,7 +3861,7 @@ def test(n=10,                       # how many tests to generate
                 args.append(out_sublist)
                 self._check_einsum(*args, np_args=(equation, *np_operands))
 
-        test(100)
+        test(500)
 
     def test_einsum_corner_cases(self, device):
         def check(equation, *operands, expected_output):
@@ -4061,8 +3929,10 @@ def check(*args, regex, exception=RuntimeError):
         check('a->aa', [x], regex=r'output subscript a appears more than once in the output')
         check('a->i', [x], regex=r'output subscript i does not appear in the equation for any input operand')
         check('aa', [y], regex=r'subscript a is repeated for operand 0 but the sizes don\'t match, 3 != 2')
-        check('a, ba', [x, y], regex=r'operands do not broadcast with remapped shapes \[original->remapped\]: '
-              r'\[2\]->\[1, 2\] \[2, 3\]->\[2, 3\]')
+        check('...,...', [x, y], regex=r'does not broadcast')
+        check('a,a', [x, make_tensor((3,), dtype=torch.float32, device=device)], regex=r'does not broadcast')
+        check('a, ba', [x, y], regex=r'subscript a has size 3 for operand 1 which does not broadcast with previously'
+              r' seen size 2')
 
         check(x, [-1], regex=r'not within the valid range \[0, 52\)', exception=ValueError)
         check(x, [52], regex=r'not within the valid range \[0, 52\)', exception=ValueError)
@@ -4149,6 +4019,7 @@ def test_linalg_solve_triangular(self, device, dtype):
             for A, B, left, upper, uni in gen_inputs((b, n, k), dtype, device):
                 self._test_linalg_solve_triangular(A, B, upper, left, uni)
 
+    @unittest.skipIf(IS_FBCODE or IS_SANDCASTLE, "Test fails for float64 on GPU (P100, V100) on Meta infra")
     @onlyCUDA
     @skipCUDAIfNoMagma  # Magma needed for the PLU decomposition
     @skipCUDAIfRocm  # There is a memory access bug in rocBLAS in the (non-batched) solve_triangular
@@ -4504,48 +4375,6 @@ def test_linalg_cross(self, device, dtype):
         torch.linalg.cross(x, y, dim=1, out=res2)
         self.assertEqual(res1, res2)
 
-        # non contiguous case 1
-        x = torch.rand((4, 4, 4, 3), dtype=dtype,
-                       device=device).contiguous(memory_format=torch.channels_last)  # non-contiguous
-        y = torch.rand((4, 4, 4, 3), dtype=dtype,
-                       device=device).contiguous(memory_format=torch.channels_last)  # non-contiguous
-        np_expected_ref = np.cross(x.cpu().numpy(), y.cpu().numpy(), axis=-1)
-        res = torch.linalg.cross(x, y, dim=-1)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
-        # non contiguous case 2
-        x = torch.rand(1, 3, 2, dtype=dtype, device=device)  # contiguous
-        y = torch.rand(1, 3, 4, dtype=dtype, device=device).permute(2, 1, 0)  # non-contiguous
-        np_expected_ref = np.cross(x.cpu().numpy(), y.cpu().numpy(), axis=1)
-        res = torch.linalg.cross(x, y, dim=1)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
-        # non contiguous case 3
-        x = torch.rand(2, 3, 1, dtype=dtype, device=device).permute(2, 1, 0)  # non-contiguous
-        y = torch.rand(1, 3, 4, dtype=dtype, device=device).permute(2, 1, 0)  # non-contiguous
-        np_expected_ref = np.cross(x.cpu().numpy(), y.cpu().numpy(), axis=1)
-        res = torch.linalg.cross(x, y, dim=1)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
-        # non contiguous case 4
-        x = torch.randn(12, 3, device=device, dtype=dtype)[::2, :]  # non-contiguous
-        y = torch.randn(18, 3, device=device, dtype=dtype)[::3, :]  # non-contiguous
-        np_expected_ref = np.cross(x.cpu().numpy(), y.cpu().numpy(), axis=1)
-        res = torch.linalg.cross(x, y, dim=1)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
-        # non contiguous case 5
-        x = torch.randn(1, device=device, dtype=dtype)  # contiguous
-        y = torch.randn(6, device=device, dtype=dtype)[::2]  # non-contiguous
-        np_expected_ref = np.cross(x.expand(3).cpu().numpy(), y.cpu().numpy())
-        res = torch.linalg.cross(x, y)
-        # numpy reference compared to torch result
-        self.assertEqual(res.cpu().numpy(), np_expected_ref)
-
     @dtypes(torch.float32, torch.complex64)
     def test_cross_with_and_without_dim(self, device, dtype):
         x = torch.rand(100, 3, dtype=dtype, device=device)
@@ -4566,46 +4395,6 @@ def test_linalg_cross_with_and_without_dim(self, device, dtype):
         self.assertEqual(res1, res2)
         self.assertEqual(res1, res3)
 
-    def test_cross_errors(self, device):
-        self.assertRaisesRegex(
-            RuntimeError, "must match the size of tensor",
-            lambda: torch.cross(torch.rand(100, 3, device=device), torch.rand(100, 3, 10, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "must match the size of tensor",
-            lambda: torch.cross(torch.rand(5, 3, device=device), torch.rand(3, 5, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "no dimension of size 3 in input",
-            lambda: torch.cross(torch.rand(5, 4, device=device), torch.rand(5, 4, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "dimension 0 does not have size 3",
-            lambda: torch.cross(torch.rand(5, 4, 3, device=device), torch.rand(5, 4, 3, device=device), dim=0))
-        self.assertRaisesRegex(
-            RuntimeError, "dimension -1 does not have size 3",
-            lambda: torch.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device), dim=-1))
-        self.assertRaisesRegex(
-            IndexError, "Dimension out of range",
-            lambda: torch.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device), dim=-5))
-
-    def test_linalg_cross_errors(self, device):
-        self.assertRaisesRegex(
-            RuntimeError, "dimension -1 does not have size 3",
-            lambda: torch.linalg.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "must match the size of tensor",
-            lambda: torch.linalg.cross(torch.rand(100, 3, device=device), torch.rand(100, 3, 10, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "must match the size of tensor",
-            lambda: torch.linalg.cross(torch.rand(5, 3, device=device), torch.rand(3, 5, device=device)))
-        self.assertRaisesRegex(
-            RuntimeError, "dimension 0 does not have size 3",
-            lambda: torch.linalg.cross(torch.rand(5, 4, 3, device=device), torch.rand(5, 4, 3, device=device), dim=0))
-        self.assertRaisesRegex(
-            RuntimeError, "dimension -1 does not have size 3",
-            lambda: torch.linalg.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device), dim=-1))
-        self.assertRaisesRegex(
-            IndexError, "Dimension out of range",
-            lambda: torch.linalg.cross(torch.rand(5, 3, 4, device=device), torch.rand(5, 3, 4, device=device), dim=-5))
-
     def test_renorm(self, device):
         m1 = torch.randn(20, 20, device=device)  # big enough to exercise vectorized path
         res1 = torch.tensor((), device=device)
@@ -5156,6 +4945,64 @@ def gen_matrices():
                         self.assertEqual(B_left, X @ A_adj)
 
 
+    @onlyCPU
+    @dtypes(*floating_and_complex_types())
+    def test_linalg_lu_cpu_errors(self, device, dtype):
+        # Square tests
+        sample = torch.randn(3, 2, 2, device=device, dtype=dtype)
+        B = torch.randn(3, 2, 2, device=device, dtype=dtype)
+        LU, pivots = torch.linalg.lu_factor(sample)
+
+        # This should run without issues
+        torch.linalg.lu_solve(LU, pivots, B, adjoint=True)
+        torch.lu_unpack(LU, pivots)
+
+        pivots[0] = 0
+        with self.assertRaisesRegex(RuntimeError, r"greater or equal to 1"):
+            torch.linalg.lu_solve(LU, pivots, B, adjoint=True)
+        with self.assertRaisesRegex(RuntimeError, r"between 1 and LU.size\(-2\)."):
+            torch.lu_unpack(LU, pivots)
+
+        pivots[0] = 3
+        with self.assertRaisesRegex(RuntimeError, r"smaller or equal to LU.size\(-2\)"):
+            torch.linalg.lu_solve(LU, pivots, B, adjoint=True)
+        with self.assertRaisesRegex(RuntimeError, r"between 1 and LU.size\(-2\)."):
+            torch.lu_unpack(LU, pivots)
+
+        # Rectangular tests
+        sample = torch.randn(3, 4, 2, device=device, dtype=dtype)
+        B = torch.randn(3, 4, 2, device=device, dtype=dtype)
+        LU, pivots = torch.linalg.lu_factor(sample)
+
+        # This should run without issues
+        torch.lu_unpack(LU, pivots)
+
+        pivots[0] = 0
+        with self.assertRaisesRegex(RuntimeError, r"between 1 and LU.size\(-2\)."):
+            torch.lu_unpack(LU, pivots)
+
+        pivots[0] = 5
+        with self.assertRaisesRegex(RuntimeError, r"between 1 and LU.size\(-2\)."):
+            torch.lu_unpack(LU, pivots)
+
+
+        # Rectangular tests
+        sample = torch.randn(2, 3, 5, device=device, dtype=dtype)
+        B = torch.randn(2, 3, 5, device=device, dtype=dtype)
+        LU, pivots = torch.linalg.lu_factor(sample)
+
+        # This should run without issues
+        torch.lu_unpack(LU, pivots)
+
+        pivots[0] = 0
+        with self.assertRaisesRegex(RuntimeError, r"between 1 and LU.size\(-2\)."):
+            torch.lu_unpack(LU, pivots)
+
+        pivots[0] = 4
+        with self.assertRaisesRegex(RuntimeError, r"between 1 and LU.size\(-2\)."):
+            torch.lu_unpack(LU, pivots)
+
+
     @skipCPUIfNoLapack
     @skipCUDAIfNoMagma
     @dtypes(torch.double)
@@ -7345,126 +7192,6 @@ def run_test(shape):
         for batch, (m, n) in product(batches, product(ns, ns)):
             run_test((*batch, m, n))
 
-    @skipCUDAIfNoMagma
-    @skipCPUIfNoLapack
-    @dtypes(torch.double)
-    def test_lstsq(self, device, dtype):
-        def _test_underdetermined(a, b, expectedNorm):
-            # underdetermined systems are only supported on CPU
-            if self.device_type != 'cpu':
-                return
-
-            m = a.size()[0]
-            n = a.size()[1]
-            assert(m <= n)
-
-            a_copy = a.clone()
-            b_copy = b.clone()
-            res1 = torch.lstsq(b, a)[0]
-            self.assertEqual(a, a_copy, atol=0, rtol=0)
-            self.assertEqual(b, b_copy, atol=0, rtol=0)
-            self.assertEqual((torch.mm(a, res1) - b).norm(), expectedNorm, atol=1e-8, rtol=0)
-
-            ta = torch.tensor((), dtype=dtype, device=device)
-            tb = torch.tensor((), dtype=dtype, device=device)
-            res2 = torch.lstsq(b, a, out=(tb, ta))[0]
-            self.assertEqual(a, a_copy, atol=0, rtol=0)
-            self.assertEqual(b, b_copy, atol=0, rtol=0)
-            self.assertEqual((torch.mm(a, res1) - b).norm(), expectedNorm, atol=1e-8, rtol=0)
-
-            res3 = torch.lstsq(b, a, out=(b, a))[0]
-            self.assertEqual((torch.mm(a_copy, b) - b_copy).norm(), expectedNorm, atol=1e-8, rtol=0)
-            self.assertEqual(res1, tb, atol=0, rtol=0)
-            self.assertEqual(res1, b, atol=0, rtol=0)
-            self.assertEqual(res1, res2, atol=0, rtol=0)
-            self.assertEqual(res1, res3, atol=0, rtol=0)
-
-        def _test_overdetermined(a, b, expectedNorm):
-            m = a.size()[0]
-            n = a.size()[1]
-            assert(m > n)
-
-            def check_norm(a, b, expected_norm, gels_result):
-                # Checks |ax - b| and the residual info from the result
-
-                # The first n rows is the least square solution.
-                # Rows n to m-1 contain residual information.
-                x = gels_result[:n]
-                resid_info = gels_result[n:]
-
-                resid_norm = (torch.mm(a, x) - b).norm()
-                self.assertEqual(resid_norm, expectedNorm, atol=1e-8, rtol=0)
-                self.assertEqual(resid_info.norm(), resid_norm, atol=1e-8, rtol=0)
-
-            a_copy = a.clone()
-            b_copy = b.clone()
-            res1 = torch.lstsq(b, a)[0]
-            self.assertEqual(a, a_copy, atol=0, rtol=0)
-            self.assertEqual(b, b_copy, atol=0, rtol=0)
-            check_norm(a, b, expectedNorm, res1)
-
-            ta = torch.tensor((), dtype=dtype, device=device)
-            tb = torch.tensor((), dtype=dtype, device=device)
-            res2 = torch.lstsq(b, a, out=(tb, ta))[0]
-            self.assertEqual(a, a_copy, atol=0, rtol=0)
-            self.assertEqual(b, b_copy, atol=0, rtol=0)
-            check_norm(a, b, expectedNorm, res2)
-
-            res3 = torch.lstsq(b, a, out=(b, a))[0]
-            check_norm(a_copy, b_copy, expectedNorm, res3)
-
-            self.assertEqual(res1, tb, atol=0, rtol=0)
-            self.assertEqual(res1, b, atol=0, rtol=0)
-            self.assertEqual(res1, res2, atol=0, rtol=0)
-            self.assertEqual(res1, res3, atol=0, rtol=0)
-
-        # basic test
-        expectedNorm = 0
-        a = torch.tensor(((1.44, -9.96, -7.55, 8.34),
-                          (-7.84, -0.28, 3.24, 8.09),
-                          (-4.39, -3.24, 6.27, 5.28),
-                          (4.53, 3.83, -6.64, 2.06)), dtype=dtype, device=device).t()
-        b = torch.tensor(((8.58, 8.26, 8.48, -5.28),
-                          (9.35, -4.43, -0.70, -0.26)), dtype=dtype, device=device).t()
-        _test_underdetermined(a, b, expectedNorm)
-
-        # test overdetermined
-        expectedNorm = 17.390200628863
-        a = torch.tensor(((1.44, -9.96, -7.55, 8.34, 7.08, -5.45),
-                          (-7.84, -0.28, 3.24, 8.09, 2.52, -5.70),
-                          (-4.39, -3.24, 6.27, 5.28, 0.74, -1.19),
-                          (4.53, 3.83, -6.64, 2.06, -2.47, 4.70)), dtype=dtype, device=device).t()
-        b = torch.tensor(((8.58, 8.26, 8.48, -5.28, 5.72, 8.93),
-                          (9.35, -4.43, -0.70, -0.26, -7.36, -2.52)), dtype=dtype, device=device).t()
-        _test_overdetermined(a, b, expectedNorm)
-
-        # test underdetermined
-        expectedNorm = 0
-        a = torch.tensor(((1.44, -9.96, -7.55),
-                          (-7.84, -0.28, 3.24),
-                          (-4.39, -3.24, 6.27),
-                          (4.53, 3.83, -6.64)), dtype=dtype, device=device).t()
-        b = torch.tensor(((8.58, 8.26, 8.48),
-                          (9.35, -4.43, -0.70)), dtype=dtype, device=device).t()
-        _test_underdetermined(a, b, expectedNorm)
-
-        # test reuse
-        expectedNorm = 0
-        a = torch.tensor(((1.44, -9.96, -7.55, 8.34),
-                          (-7.84, -0.28, 3.24, 8.09),
-                          (-4.39, -3.24, 6.27, 5.28),
-                          (4.53, 3.83, -6.64, 2.06)), dtype=dtype, device=device).t()
-        b = torch.tensor(((8.58, 8.26, 8.48, -5.28),
-                          (9.35, -4.43, -0.70, -0.26)), dtype=dtype, device=device).t()
-        ta = torch.tensor((), dtype=dtype, device=device)
-        tb = torch.tensor((), dtype=dtype, device=device)
-        torch.lstsq(b, a, out=(tb, ta))
-        self.assertEqual((torch.mm(a, tb) - b).norm(), expectedNorm, atol=1e-8, rtol=0)
-        torch.lstsq(b, a, out=(tb, ta))
-        self.assertEqual((torch.mm(a, tb) - b).norm(), expectedNorm, atol=1e-8, rtol=0)
-        torch.lstsq(b, a, out=(tb, ta))
-        self.assertEqual((torch.mm(a, tb) - b).norm(), expectedNorm, atol=1e-8, rtol=0)
-
     @skipCUDAIfNoMagma
     @skipCPUIfNoLapack
     def test_lapack_empty(self, device):
@@ -7489,16 +7216,6 @@ def fn(torchfn, *args):
         self.assertEqual((torch.tensor(1., device=device), torch.tensor(0., device=device)),
                          fn(torch.slogdet, (0, 0)))
 
-        # eig, symeig
-        evalues, evectors = fn(torch.eig, (0, 0), True)
-        self.assertEqual([(0, 2), (0, 0)], [evalues.shape, evectors.shape])
-        evalues, evectors = fn(torch.symeig, (0, 0), True)
-        self.assertEqual([(0,), (0, 0)], [evalues.shape, evectors.shape])
-
-        # lstsq
-        self.assertRaises(RuntimeError, lambda: torch.lstsq(torch.randn(0, 0), torch.randn(0, 0)))
-        self.assertRaises(RuntimeError, lambda: torch.lstsq(torch.randn(0,), torch.randn(0, 0)))
-
     @tf32_on_and_off(0.005)
     def test_tensordot(self, device):
         a = torch.arange(60., device=device).reshape(3, 4, 5)
@@ -7650,6 +7367,12 @@ def test_preferred_linalg_library(self):
         self.assertEqual(out_ref, out1.cpu())
         self.assertEqual(out1, out2)
 
+    def test_permute_matmul(self):
+        a = torch.ones([2, 5, 24, 24])
+        b = torch.ones([3, 2, 5, 24, 24])
+        c = a.permute(0, 1, 3, 2).matmul(b)
+        self.assertEqual([c.min(), c.max(), c.sum()], [24, 24, 414720])
+
 
 instantiate_device_type_tests(TestLinalg, globals())
 
diff --git a/test/test_masked.py b/test/test_masked.py
index 4b8fab87318f..25e629a3171f 100644
--- a/test/test_masked.py
+++ b/test/test_masked.py
@@ -96,7 +96,7 @@ def apply_masked_reduction_along_dim(op, input, *args, **kwargs):
         args0 = args
 
     # dimensions along which the reduction operation is applied:
-    dim_ = torch._masked._canonical_dim(dim, input.ndim)
+    dim_ = torch.masked._canonical_dim(dim, input.ndim)
     # slices in product(*ranges) define all elementary slices:
     ranges: List[Any] = []
     # shape of output for the keepdim=True case:
@@ -116,7 +116,7 @@ def apply_masked_reduction_along_dim(op, input, *args, **kwargs):
     if mask is None:
         inpmask = input.new_ones([], dtype=torch.bool).expand(input.shape)
     else:
-        inpmask = torch._masked._input_mask(input, mask=mask)
+        inpmask = torch.masked._input_mask(input, mask=mask)
     for s in itertools.product(*ranges):
         # data of an elementary slice is 1D sequence and has only
         # masked-in elements:
@@ -148,7 +148,7 @@ def apply_masked_normalization_along_dim(op, input, *args, **kwargs):
     if mask is None:
         inpmask = input.new_ones([], dtype=torch.bool).expand(input.shape)
     else:
-        inpmask = torch._masked._input_mask(input, mask=mask)
+        inpmask = torch.masked._input_mask(input, mask=mask)
     dim_ = dim % input.ndim
     left_ranges = tuple(map(range, input.shape[:dim_]))
     right_ranges = tuple(map(range, input.shape[dim_ + 1:]))
@@ -169,7 +169,7 @@ def apply_masked_normalization_along_dim(op, input, *args, **kwargs):
         torch.nn.functional.normalize, *args, **dict(kwargs, dim_position=1)),
 )
 
-masked_ops = [op for op in op_db if op.name.startswith('_masked.')]
+masked_ops = [op for op in op_db if op.name.startswith('masked.')]
 masked_ops_with_references = [op for op in masked_ops if op.name.rsplit('.', 1)[-1] in reference_functions]
 masked_ops_with_non_strided_support = [op for op in masked_ops if op.supports_sparse or op.supports_sparse_csr]
 
@@ -287,7 +287,7 @@ def test_reference_masked(self, device, dtype, op):
             if t_kwargs.get('mask') is None:
                 outmask = None
             else:
-                outmask = torch._masked._output_mask(op.op, t_inp, *t_args, **t_kwargs)
+                outmask = torch.masked._output_mask(op.op, t_inp, *t_args, **t_kwargs)
             self.assertEqualMasked(actual, expected, outmask)
 
     @mask_layouts()
@@ -309,7 +309,7 @@ def test_mask_layout(self, layout, device, dtype, op, sample_inputs):
             if r_kwargs.get('mask') is None:
                 outmask = None
             else:
-                outmask = torch._masked._output_mask(op.op, r_inp, *r_args, **r_kwargs)
+                outmask = torch.masked._output_mask(op.op, r_inp, *r_args, **r_kwargs)
             expected = op.op(r_inp, *r_args, **r_kwargs)
             self.assertEqualMasked(actual, expected, outmask)
 
@@ -393,8 +393,8 @@ def set_values(sparse, index, value):
         tmp = to_sparse(tmp)
 
 
-        sparse = torch._masked._where(mask, input,
-                                      torch.tensor(fill_value, dtype=input.dtype, device=input.device))
+        sparse = torch.masked._where(mask, input,
+                                     torch.tensor(fill_value, dtype=input.dtype, device=input.device))
 
         if tmp.layout == torch.sparse_coo:
             expected_sparse = torch.sparse_coo_tensor(
diff --git a/test/test_maskedtensor.py b/test/test_maskedtensor.py
new file mode 100644
index 000000000000..f4b9cd1e42aa
--- /dev/null
+++ b/test/test_maskedtensor.py
@@ -0,0 +1,912 @@
+# Owner(s): ["module: masked operators"]
+
+import torch
+from torch.testing._internal.common_utils import (
+    TestCase,
+    run_tests,
+    make_tensor,
+    parametrize,
+    instantiate_parametrized_tests,
+)
+from torch.testing._internal.common_device_type import (
+    instantiate_device_type_tests,
+    ops,
+)
+from torch.testing._internal.common_methods_invocations import (
+    SampleInput,
+    binary_ufuncs,
+    reduction_ops,
+    unary_ufuncs,
+)
+
+from torch.masked import as_masked_tensor, masked_tensor, _combine_input_and_mask
+from torch.masked.maskedtensor.core import _masks_match, _tensors_match
+from torch.masked.maskedtensor.unary import NATIVE_INPLACE_UNARY_FNS, NATIVE_UNARY_FNS, UNARY_NAMES
+from torch.masked.maskedtensor.binary import NATIVE_BINARY_FNS, NATIVE_INPLACE_BINARY_FNS, BINARY_NAMES
+from torch.masked.maskedtensor.reductions import REDUCE_NAMES
+
+
+def _compare_mt_t(mt_result, t_result, rtol=1e-05, atol=1e-05):
+    mask = mt_result.get_mask()
+    mt_result_data = mt_result.get_data()
+    if mask.layout in {torch.sparse_coo, torch.sparse_csr}:
+        mask = mask.to_dense()
+    if mt_result_data.layout in {torch.sparse_coo, torch.sparse_csr}:
+        mt_result_data = mt_result_data.to_dense()
+    a = mt_result_data.detach().masked_fill_(~mask, 0)
+    b = t_result.detach().masked_fill_(~mask, 0)
+    if not _tensors_match(a, b, exact=False, rtol=rtol, atol=atol):
+        raise ValueError("The data in MaskedTensor a and Tensor b do not match")
+
+def _compare_mts(mt1, mt2, rtol=1e-05, atol=1e-08):
+    mt_data1 = mt1.get_data()
+    mt_data2 = mt2.get_data()
+    if mt_data1.layout != mt_data2.layout:
+        raise ValueError("mt1's data and mt2's data do not have the same layout. "
+                         f"mt1.get_data().layout = {mt_data1.layout} while mt2.get_data().layout = {mt_data2.layout}")
+
+    mask = mt1.get_mask()
+    mask2 = mt2.get_mask()
+    if not _masks_match(mt1, mt2):
+        raise ValueError("mt1 and mt2 must have matching masks")
+    if mask.layout != mask2.layout:
+        raise ValueError("mt1's mask and mt2's mask do not have the same layout. "
+                         f"mt1.get_mask().layout = {mask.layout} while mt2.get_mask().layout = {mask2.layout}")
+    if mask.layout in {torch.sparse_coo, torch.sparse_csr}:
+        mask = mask.to_dense()
+
+    if mt_data1.layout in {torch.sparse_coo, torch.sparse_csr}:
+        mt_data1 = mt_data1.to_dense()
+        mt_data2 = mt_data2.to_dense()
+    a = mt_data1.detach().masked_fill_(~mask, 0)
+    b = mt_data2.detach().masked_fill_(~mask, 0)
+
+    if not _tensors_match(a, b, exact=False, rtol=rtol, atol=atol):
+        raise ValueError("The data in MaskedTensor mt1 and MaskedTensor mt2 do not match")
+
+def _make_tensor_mask(shape, device):
+    return make_tensor(
+        shape, device=device, dtype=torch.bool, low=0, high=1, requires_grad=False
+    )
+
+def _create_random_mask(shape, device):
+    return torch.randint(0, 2, shape, device=device).bool()
+
+def _generate_sample_data(
+    device="cpu", dtype=torch.float, requires_grad=True, layout=torch.strided
+):
+    assert layout in {
+        torch.strided,
+        torch.sparse_coo,
+        torch.sparse_csr,
+    }, "Layout must be strided/sparse_coo/sparse_csr"
+    shapes = [
+        [],
+        [2],
+        [3, 5],
+        [3, 2, 1, 2],
+    ]
+    inputs = []
+    for s in shapes:
+        data = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)  # type: ignore[arg-type]
+        mask = _make_tensor_mask(s, device)
+        if layout == torch.sparse_coo:
+            mask = mask.to_sparse_coo().coalesce()
+            data = data.sparse_mask(mask).requires_grad_(requires_grad)
+        elif layout == torch.sparse_csr:
+            if data.ndim != 2 and mask.ndim != 2:
+                continue
+            mask = mask.to_sparse_csr()
+            data = data.sparse_mask(mask)
+        inputs.append(SampleInput(data, kwargs={"mask": mask}))
+    return inputs
+
+def _fix_fn_name(fn_name):
+    if fn_name[-1] == "_":
+        fn_name = fn_name[:-1]
+    return fn_name
+
+
+class TestBasics(TestCase):
+    def test_invalid_tensor_inputs(self, device):
+        data = torch.randn((3, 4), device=device)
+        mask = _create_random_mask((3, 4), device=device)
+        mt = masked_tensor(data, mask)
+
+        with self.assertRaisesRegex(TypeError, "data must be a Tensor"):
+            masked_tensor(mt, mask)
+        with self.assertRaisesRegex(TypeError, "data must be a Tensor"):
+            masked_tensor(0, mask)
+        with self.assertRaisesRegex(TypeError, "mask must be a Tensor"):
+            masked_tensor(data, mt)
+        with self.assertRaisesRegex(TypeError, "mask must be a Tensor"):
+            masked_tensor(data, 0)
+
+    def test_diff_layouts(self, device):
+        data = torch.randn((3, 4), device=device).to_sparse_coo()
+        mask = _create_random_mask((3, 4), device=device)
+        with self.assertRaisesRegex(TypeError, "data and mask must have the same layout"):
+            masked_tensor(data, mask)
+
+    def test_diff_dim(self, device):
+        data = torch.randn((3, 4, 5), device=device)
+        mask = _create_random_mask((3, 4), device=device)
+        with self.assertRaisesRegex(ValueError, "data.dim\\(\\) must equal mask.dim\\(\\)"):
+            masked_tensor(data, mask)
+
+    def test_diff_sizes(self, device):
+        data = torch.randn((3, 4), device=device)
+        mask = _create_random_mask((3, 3), device=device)
+        with self.assertRaisesRegex(ValueError, "data.size\\(\\) must equal mask.size\\(\\)"):
+            masked_tensor(data, mask)
+
+    def test_grad_warning(self, device):
+        data = torch.randn((3, 4), device=device, requires_grad=True)
+        mask = _create_random_mask((3, 4), device=device)
+        msg = "It is not recommended to create a MaskedTensor with a tensor that requires_grad."
+        with self.assertWarnsRegex(UserWarning, msg):
+            mt = masked_tensor(data, mask)
+
+    def test_add(self, device):
+        data = torch.arange(5.0, device=device)
+        mask = torch.tensor([True, True, False, True, False], device=device)
+        m0 = masked_tensor(data, mask)
+        m1 = masked_tensor(data, ~mask)
+        with self.assertRaisesRegex(ValueError, "Input masks must match."):
+            m0 + m1
+        _compare_mts(m0 + m0, masked_tensor(torch.tensor([0., 2, 0, 6, 0], device=device), mask))
+
+    def test_softmax(self, device):
+        data = torch.randn((3, 4), device=device) * 0.1
+        mask = torch.tensor(
+            [
+                [True, True, True, False],
+                [False, True, False, True],
+                [True, True, False, False],
+            ],
+            device=device
+        )
+        mt = masked_tensor(data, mask, requires_grad=True)
+        masked_res = torch.softmax(mt, -1)
+        masked_res.sum().backward()
+        xinf = data.masked_fill(~mask, float("-inf")).detach().clone().requires_grad_()
+        tensor_res = torch.softmax(xinf, -1)
+        tensor_res.sum().backward()
+
+        _compare_mt_t(masked_res, tensor_res)
+        _compare_mt_t(mt.grad, xinf.grad, atol=1e-06)
+
+    def test_where(self, device):
+        data = torch.tensor([-10.0, -5, 0, 5, 10, 50, 60, 70, 80, 90, 100], device=device)
+        mask = data < 0
+
+        mx = masked_tensor(data, mask, requires_grad=True)
+        my = masked_tensor(torch.ones_like(data), ~mask, requires_grad=True)
+        masked_res = torch.where(mask, torch.exp(mx), my)
+        masked_res.sum().backward()
+
+        x = data.detach().clone().requires_grad_()
+        y = torch.ones_like(x, device=device, requires_grad=True)
+        tensor_res = torch.where(mask, torch.exp(x), y)
+        tensor_res.sum().backward()
+
+        _compare_mt_t(masked_res, tensor_res)
+        _compare_mt_t(mx.grad, x.grad)
+        _compare_mt_t(my.grad, y.grad)
+
+    def test_to_sparse(self, device):
+        for sample in _generate_sample_data(device=device):
+            data = sample.input
+            mask = sample.kwargs["mask"]
+            mt = masked_tensor(data.clone().detach(), mask, requires_grad=True)
+
+            sparse_mt = mt.to_sparse()
+            data.to_sparse().to_dense().sum().backward()
+            sparse_mt.to_dense().sum().backward()
+
+            _compare_mt_t(sparse_mt, data)
+            _compare_mt_t(mt.grad, data.grad)
+
+    def test_to_dense(self, device):
+        samples = _generate_sample_data(
+            device=device,
+            layout=torch.sparse_coo
+        ) + _generate_sample_data(device=device, layout=torch.sparse_csr)
+        for sample in samples:
+            data = sample.input
+            mask = sample.kwargs["mask"]
+            mt = masked_tensor(data, mask, requires_grad=True)
+
+            dense_data = data.to_dense().detach().clone().requires_grad_(True)
+            dense_mt = mt.to_dense()
+            dense_data.sum().backward()
+            dense_mt.sum().backward()
+
+            _compare_mt_t(dense_mt, dense_data)
+            _compare_mt_t(mt.grad.to_dense(), dense_data.grad)
+
+    def test_to_dense_and_sparse_coo(self, device):
+        for sample in _generate_sample_data(device=device, layout=torch.strided):
+            data = sample.input
+            mask = sample.kwargs["mask"]
+            ms = mask.to_sparse_coo().coalesce()
+
+            mt = masked_tensor(data, mask, requires_grad=True)
+            mts = masked_tensor(data.sparse_mask(ms), ms, requires_grad=True)
+
+            converted = mt.to_sparse().to_dense()
+            converted.sum().backward()
+
+            converted2 = mts.to_dense()
+            converted2.sum().backward()
+
+            _compare_mts(converted, converted2)
+            _compare_mts(mt.grad, mts.grad.to_dense())
+
+    def test_to_dense_and_sparse_csr(self, device):
+        for sample in _generate_sample_data(device=device, layout=torch.strided):
+            data = sample.input
+            mask = sample.kwargs["mask"]
+            if data.ndim != 2:
+                continue
+            ms = mask.to_sparse_csr()
+
+            mt = masked_tensor(data, mask, requires_grad=True)
+            mts = masked_tensor(data.sparse_mask(ms), ms, requires_grad=True)
+
+            converted = mt.to_sparse_csr().to_dense()
+            converted.sum().backward()
+
+            converted2 = mts.to_dense()
+            converted2.sum().backward()
+
+            _compare_mts(converted, converted2)
+            _compare_mts(mt.grad, mts.grad.to_dense())
+
+    def test_invalid_sparse_layout(self, device):
+        data = torch.randn((3, 4), device=device).to_sparse_csc()
+        mask = _create_random_mask((3, 4), device=device).to_sparse_csc()
+        with self.assertRaisesRegex(TypeError, "data layout of torch.sparse_csc is not supported"):
+            masked_tensor(data, mask)
+
+    def test_invalid_sparse_coo_values(self, device):
+        v = torch.tensor([3, 4, 5], dtype=torch.float32)
+        i1 = torch.tensor([[0, 1, 1], [2, 0, 2]])
+        i2 = torch.tensor([[0, 1, 1], [2, 1, 2]])
+
+        t = torch.sparse_coo_tensor(i1, v, (2, 4), device=device)
+        mask = torch.sparse_coo_tensor(i2, torch.tensor([True, True, True]), (2, 4), device=device)
+
+        msg = "data and mask are both sparse COO tensors but do not have the same indices."
+        with self.assertRaisesRegex(ValueError, msg):
+            masked_tensor(t, mask)
+
+    def test_invalid_sparse_csr_values(self, device):
+        crow_indices1 = [0, 2, 3]
+        crow_indices2 = [0, 1, 3]
+        col_indices1 = [0, 1, 2]
+        col_indices2 = [1, 2, 3]
+
+        values = [2, 3, 4]
+        mask_values = [True, True, True]
+
+        t1 = torch.sparse_csr_tensor(
+            torch.tensor(crow_indices1, dtype=torch.int64),
+            torch.tensor(col_indices1, dtype=torch.int64),
+            torch.tensor(values),
+            size=(2, 4)
+        )
+        mask1 = torch.sparse_csr_tensor(
+            torch.tensor(crow_indices2, dtype=torch.int64),
+            torch.tensor(col_indices1, dtype=torch.int64),
+            torch.tensor(mask_values),
+            dtype=torch.bool,
+            size=(2, 4),
+        )
+        t2 = torch.sparse_csr_tensor(
+            torch.tensor(crow_indices2, dtype=torch.int64),
+            torch.tensor(col_indices1, dtype=torch.int64),
+            torch.tensor(values),
+            size=(2, 4),
+        )
+        mask2 = torch.sparse_csr_tensor(
+            torch.tensor(crow_indices2, dtype=torch.int64),
+            torch.tensor(col_indices2, dtype=torch.int64),
+            torch.tensor(mask_values),
+            dtype=torch.bool,
+            size=(2, 4),
+        )
+
+        msg = "data and mask are both sparse CSR tensors but do not share either crow or col indices."
+        with self.assertRaisesRegex(ValueError, msg):
+            masked_tensor(t1, mask1)
+        with self.assertRaisesRegex(ValueError, msg):
+            masked_tensor(t2, mask2)
+
+    def test_contiguous(self, device):
+        data = torch.randn((3, 3), device=device)
+
+        contiguous_data = data.clone()
+        mask1 = (contiguous_data > 0).bool()
+        not_contiguous_data = torch.as_strided(data.clone(), (2, 2), (1, 2))
+        mask2 = (not_contiguous_data > 0).bool()
+
+        contiguous_mt = masked_tensor(contiguous_data, mask1)
+        not_contiguous_mt = masked_tensor(not_contiguous_data, mask2)
+
+        contiguous_mt_sparse = masked_tensor(
+            contiguous_data.to_sparse_coo(), mask1.to_sparse_coo()
+        )
+        not_contiguous_mt_sparse = masked_tensor(
+            not_contiguous_data.to_sparse_coo(), mask2.to_sparse_coo()
+        )
+
+        self.assertEqual(contiguous_data.is_contiguous(), True)
+        self.assertEqual(not_contiguous_data.is_contiguous(), False)
+
+        self.assertEqual(contiguous_mt.is_contiguous(), True)
+        self.assertEqual(not_contiguous_mt.is_contiguous(), False)
+
+        error_msg = "MaskedTensors with sparse data do not have is_contiguous"
+        for t in [contiguous_mt_sparse, not_contiguous_mt_sparse]:
+            with self.assertRaisesRegex(ValueError, error_msg):
+                t.is_contiguous()
+            with self.assertRaisesRegex(ValueError, error_msg):
+                t.contiguous()
+
+        now_contiguous_mt = not_contiguous_mt.contiguous()
+
+        _compare_mts(not_contiguous_mt, now_contiguous_mt)
+
+        self.assertEqual(now_contiguous_mt.is_contiguous(), True)
+        self.assertEqual(now_contiguous_mt.get_data().is_contiguous(), True)
+        self.assertEqual(now_contiguous_mt.is_contiguous(), True)
+
+class TestUnary(TestCase):
+    def _get_test_data(self, fn_name):
+        data = torch.randn(10, 10)
+        mask = torch.rand(10, 10) > 0.5
+        fn_name = _fix_fn_name(fn_name)
+        if fn_name in ["log", "log10", "log1p", "log2", "sqrt"]:
+            data = data.mul(0.5).abs()
+        if fn_name in ["rsqrt"]:
+            data = data.abs() + 1  # Void division by zero
+        if fn_name in ["acos", "arccos", "asin", "arcsin", "logit"]:
+            data = data.abs().mul(0.5).clamp(0, 1)
+        if fn_name in ["atanh", "arctanh", "erfinv"]:
+            data = data.mul(0.5).clamp(-1, 1)
+        if fn_name in ["acosh", "arccosh"]:
+            data = data.abs() + 1
+        if fn_name in ["bitwise_not"]:
+            data = data.mul(128).to(torch.int8)
+        return data, mask
+
+    def _get_sample_kwargs(self, fn_name):
+        fn_name = _fix_fn_name(fn_name)
+        kwargs = {}
+        if fn_name in ["clamp", "clip"]:
+            kwargs["min"] = -0.5
+            kwargs["max"] = 0.5
+        return kwargs
+
+    def _get_sample_args(self, fn_name, data, mask):
+        fn_name = _fix_fn_name(fn_name)
+        mt = masked_tensor(data, mask)
+        t_args = [data]
+        mt_args = [mt]
+        if fn_name in ["pow"]:
+            t_args += [2.0]
+            mt_args += [2.0]
+        return t_args, mt_args
+
+    @parametrize("fn", NATIVE_UNARY_FNS)
+    def test_unary(self, fn):
+        torch.random.manual_seed(0)
+        fn_name = fn.__name__
+        data, mask = self._get_test_data(fn_name)
+        kwargs = self._get_sample_kwargs(fn_name)
+
+        t_args, mt_args = self._get_sample_args(fn_name, data, mask)
+
+        mt_result = fn(*mt_args, **kwargs)
+        t_result = fn(*t_args, **kwargs)
+        _compare_mt_t(mt_result, t_result)
+
+    @parametrize("fn", NATIVE_INPLACE_UNARY_FNS)
+    def test_inplace_unary(self, fn):
+        torch.random.manual_seed(0)
+        fn_name = fn.__name__
+        data, mask = self._get_test_data(fn_name)
+        kwargs = self._get_sample_kwargs(fn_name)
+
+        t_args, mt_args = self._get_sample_args(fn_name, data, mask)
+
+        mt_result = fn(*mt_args, **kwargs)
+        t_result = fn(*t_args, **kwargs)
+        _compare_mt_t(mt_result, t_result)
+
+class TestBinary(TestCase):
+    def _get_test_data(self, fn_name):
+        fn_name = _fix_fn_name(fn_name)
+        data0 = torch.randn(10, 10)
+        data1 = torch.randn(10, 10)
+        mask = torch.rand(10, 10) > 0.5
+        if fn_name in ["bitwise_and", "bitwise_or", "bitwise_xor"]:
+            data0 = data0.mul(128).to(torch.int8)
+            data1 = data1.mul(128).to(torch.int8)
+        if fn_name in ["bitwise_left_shift", "bitwise_right_shift"]:
+            data0 = data0.abs().to(torch.int64)
+            data1 = data1.abs().to(torch.int64)
+        return data0, data1, mask
+
+    def _get_sample_kwargs(self, fn_name):
+        fn_name = _fix_fn_name(fn_name)
+        kwargs = {}
+        return kwargs
+
+    def _yield_sample_args(self, fn_name, data0, data1, mask):
+        """ Returns two sets of Tensor and MaskedTensor args for a binary function to compute.
+            Tensor args are all the same (just the two provided data tensors),
+            while the MaskedTensor args tests both (MaskedTensor, MaskedTensor) and (MaskedTensor, Tensor)
+        """
+        fn_name = _fix_fn_name(fn_name)
+        mt0 = masked_tensor(data0, mask)
+        mt1 = masked_tensor(data1, mask)
+
+        t_args = [data0, data1]
+        mt_args = [mt0, mt1]
+        yield t_args, mt_args
+
+        t_args = [data0, data1]
+        mt_args = [mt0, data1]
+        yield t_args, mt_args
+
+    @parametrize("fn", NATIVE_BINARY_FNS)
+    def test_binary(self, fn):
+        torch.random.manual_seed(0)
+        fn_name = fn.__name__
+        data0, data1, mask = self._get_test_data(fn_name)
+        kwargs = self._get_sample_kwargs(fn_name)
+
+        for (t_args, mt_args) in self._yield_sample_args(fn_name, data0, data1, mask):
+            mt_result = fn(*mt_args, **kwargs)
+            t_result = fn(*t_args, **kwargs)
+            _compare_mt_t(mt_result, t_result)
+
+    @parametrize("fn", NATIVE_INPLACE_BINARY_FNS)
+    def test_inplace_binary(self, fn):
+        torch.random.manual_seed(0)
+        fn_name = fn.__name__
+        data0, data1, mask = self._get_test_data(fn_name)
+        kwargs = self._get_sample_kwargs(fn_name)
+
+        for (t_args, mt_args) in self._yield_sample_args(fn_name, data0, data1, mask):
+            mt_result = fn(*mt_args, **kwargs)
+            t_result = fn(*t_args, **kwargs)
+            _compare_mt_t(mt_result, t_result)
+
+    @parametrize("fn_name", ["add", "add_"])
+    def test_masks_match(self, fn_name):
+        torch.random.manual_seed(0)
+        fn = getattr(torch.ops.aten, fn_name)
+        data0, data1, mask = self._get_test_data(fn_name)
+        mask0 = mask
+        mask1 = torch.rand(mask.size()) > 0.5
+        mt0 = masked_tensor(data0, mask0)
+        mt1 = masked_tensor(data1, mask1)
+        try:
+            fn(mt0, mt1)
+            raise AssertionError()
+        except ValueError as e:
+            assert (
+                "Input masks must match. If you need support for this, please open an issue on Github."
+                == str(e)
+            )
+
+class TestReductions(TestCase):
+    def test_max_not_implemented(self):
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = masked_tensor(d, m)
+        with self.assertRaisesRegex(TypeError, "no implementation found for 'torch._ops.aten.max.default'"):
+            mt.max()
+
+    def test_sum(self):
+        d = torch.tensor([[0, 1, 2, 6], [3, 4, 5.0, 7]])
+        m = torch.tensor([[True, False, False, True], [False, True, False, True]])
+        mt = masked_tensor(d, m)
+        _compare_mts(masked_tensor(torch.tensor(17.0), torch.tensor(True)), mt.sum())
+        _compare_mts(
+            masked_tensor(
+                torch.tensor([0.0, 4.0, 1.0, 13]),
+                torch.tensor([True, True, False, True]),
+            ),
+            mt.sum(dim=0),
+        )
+
+    def test_sum_grad(self):
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = masked_tensor(d, m, requires_grad=True)
+        mt.sum().backward()
+        _compare_mts(mt.grad, masked_tensor(torch.tensor(1.0).expand_as(m), m))
+
+    def test_mean(self):
+        d = torch.tensor([[0, 1, 3, 2], [3, 4, 1.0, 4]])
+        m = torch.tensor([[True, False, False, True], [False, True, False, True]])
+        mt = masked_tensor(d, m)
+        _compare_mts(masked_tensor(torch.tensor(2.5), torch.tensor(True)), mt.mean())
+        _compare_mts(
+            masked_tensor(
+                torch.tensor([0.0, 4.0, 1.0, 3]),
+                torch.tensor([True, True, False, True]),
+            ),
+            mt.mean(dim=0),
+        )
+
+    """
+        The following block of tests "test_mean_grad_case_1[a through e] are used to test the functionality of
+        the two different ways of constructing MaskedTensors:
+            masked_tensor(data, mask, requires_grad=True/False) -- NO differentiable constructor and always a leaf
+            as_masked_tensor(data, mask) -- differentiable constructor
+
+        Like torch.tensor(data), masked_tensor(data, mask) will provide a UserWarning if data.requires_grad=True
+        as_masked_tensor does not take in requires_grad -- it just takes on the requires_grad from data
+
+        Therefore, there are 6 cases to test and we use `mean` as a proxy to test the different combinations
+
+        Assuming mt.mean().backward() is run after each constructor:
+
+        Case 1a:
+            values.requires_grad = True
+            mt = masked_tensor(values, mask, requires_grad=True)
+        yields
+            - Provide a UserWarning because values.requires_grad=True
+            - values.grad = None
+            - mt.grad is a MaskedTensor with the correct gradient
+
+        Case 1b:
+            values.requires_grad = False
+            mt = masked_tensor(values, mask, requires_grad=True)
+        yields
+            - values.grad = None
+            - mt.grad is a MaskedTensor with the correct gradient
+
+        Case 2a/2b:
+            values.requires_grad = True/False
+            mt = masked_tensor(values, mask, requires_grad=False)
+
+            will both yield a RuntimeError of "element 0 of tensors does not require grad and does not have a grad_fn"
+            as expected. When values.requires_grad=True, we will also get a UserWarning
+
+        Case 3a:
+            values.requires_grad = True
+            mt = as_masked_tensor(values, mask)
+        yields
+            - values.grad is a MaskedTensor with the correct gradient
+            - mt.grad is None and gives a UserWarning that
+              "The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad"
+
+        Case 3b:
+            values.requires_grad = False
+            mt = as_masked_tensor(values, mask)
+
+            will yield a RuntimeError of "element 0 of tensors does not require grad and does not have a grad_fn"
+            as expected.
+    """
+    def test_mean_grad_case_1a(self):
+        """ values.requires_grad = True
+            mt = masked_tensor(values, mask, requires_grad=True)
+        """
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]], requires_grad=True)
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        with self.assertWarnsRegex(UserWarning, "It is not recommended to create a MaskedTensor"):
+            mt = masked_tensor(d, m, requires_grad=True)
+        mt.mean().backward()
+        self.assertIsNone(d.grad)
+        _compare_mts(mt.grad, masked_tensor(torch.tensor([[0.5, 0, 0], [0, 0.5, 0]]), m))
+
+    def test_mean_grad_case_1b(self):
+        """ values.requires_grad = False
+            mt = masked_tensor(values, mask, requires_grad=True)
+        """
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = masked_tensor(d, m, requires_grad=True)
+        mt.mean().backward()
+        self.assertIsNone(d.grad)
+        _compare_mts(mt.grad, masked_tensor(torch.tensor([[0.5, 0, 0], [0, 0.5, 0]]), m))
+
+    def test_mean_grad_case_1c(self):
+        """ values.requires_grad = True
+            mt = masked_tensor(values, mask, requires_grad=False)
+        """
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]], requires_grad=True)
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        with self.assertWarnsRegex(UserWarning, "It is not recommended to create a MaskedTensor"):
+            mt = masked_tensor(d, m, requires_grad=False)
+        result = mt.mean()
+        msg = "element 0 of tensors does not require grad and does not have a grad_fn"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            result.backward()
+
+
+    def test_mean_grad_case_1d(self):
+        """ values.requires_grad = False
+            mt = masked_tensor(values, mask, requires_grad=False)
+        """
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = masked_tensor(d, m, requires_grad=False)
+        result = mt.mean()
+        msg = "element 0 of tensors does not require grad and does not have a grad_fn"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            result.backward()
+
+    def test_mean_grad_case_1e(self):
+        """ values.requires_grad = True
+            mt = as_masked_tensor(values, mask)
+        """
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]], requires_grad=True)
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = as_masked_tensor(d, m)
+        mt.mean().backward()
+        _compare_mts(d.grad, masked_tensor(torch.tensor([[0.5, 0, 0], [0, 0.5, 0]]), m))
+        msg = "The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad"
+        with self.assertWarnsRegex(UserWarning, msg):
+            self.assertIsNone(mt.grad)
+
+    def test_mean_grad_case_1f(self):
+        """ values.requires_grad = False
+            mt = as_masked_tensor(values, mask)
+        """
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = as_masked_tensor(d, m)
+        result = mt.mean()
+        msg = "element 0 of tensors does not require grad and does not have a grad_fn"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            result.backward()
+
+    def test_mean_dim_grad(self):
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, True, False], [False, True, False]])
+        mt = masked_tensor(d, m, requires_grad=True)
+        mt.mean(1).sum().backward()
+        _compare_mts(mt.grad, masked_tensor(torch.tensor([[0.5, 0.5, 0], [0, 1, 0]]), m))
+
+    def test_amax(self):
+        d = torch.tensor([[0, 1, 3, -3], [3, -4, 1.0, 3]])
+        m = torch.tensor([[True, False, False, True], [False, True, False, True]])
+        mt = masked_tensor(d, m)
+        _compare_mts(masked_tensor(torch.tensor(3.0), torch.tensor(True)), mt.amax())
+        _compare_mts(
+            masked_tensor(
+                torch.tensor([0.0, -4.0, 1.0, 3]),
+                torch.tensor([True, True, False, True]),
+            ),
+            mt.amax(dim=0),
+        )
+
+    def test_amax_grad(self):
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = masked_tensor(d, m, requires_grad=True)
+        mt.amax().backward()
+        _compare_mts(mt.grad, masked_tensor(torch.tensor([[0.0, 0, 0], [0, 1, 0]]), m))
+
+    def test_amin(self):
+        d = torch.tensor([[0, 1, 3, -3], [3, -4, 1.0, 3]])
+        m = torch.tensor([[True, False, False, True], [False, True, False, True]])
+        mt = masked_tensor(d, m)
+        _compare_mts(masked_tensor(torch.tensor(-4.0), torch.tensor(True)), mt.amin())
+        _compare_mts(
+            masked_tensor(
+                torch.tensor([0.0, -4.0, 1.0, -3]),
+                torch.tensor([True, True, False, True]),
+            ),
+            mt.amin(dim=0),
+        )
+
+    def test_amin_grad(self):
+        d = torch.tensor([[0, 1, 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = masked_tensor(d, m, requires_grad=True)
+        mt.amin().backward()
+        _compare_mts(mt.grad, masked_tensor(torch.tensor([[1.0, 0, 0], [0, 0, 0]]), m))
+
+    def test_prod(self):
+        d = torch.tensor([[0, 1, 3, 0.0], [float("nan"), 4, 1.0, 5.0]])
+        m = torch.tensor([[True, False, False, True], [False, True, False, True]])
+        mt = masked_tensor(d, m)
+        _compare_mts(masked_tensor(torch.tensor(0.0), torch.tensor(True)), mt.prod())
+        _compare_mts(
+            masked_tensor(
+                torch.tensor([0.0, 4.0, 1.0, 0.0]),
+                torch.tensor([True, True, False, True]),
+            ),
+            mt.prod(dim=0),
+        )
+
+    def test_prod_grad(self):
+        d = torch.tensor([[2, float("nan"), 2], [3, 4, 5.0]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        mt = masked_tensor(d, m, requires_grad=True)
+        mt.prod().backward()
+        _compare_mts(mt.grad, masked_tensor(torch.tensor([[4.0, 0, 0], [0, 2, 0]]), m))
+
+    def test_all(self):
+        d = torch.tensor([[True, True, False, False], [False, True, True, True]])
+        m = torch.tensor([[True, False, False, True], [False, True, False, True]])
+        mt = masked_tensor(d, m)
+        _compare_mts(masked_tensor(torch.tensor(False), torch.tensor(True)), mt.all())
+        _compare_mts(
+            masked_tensor(
+                torch.tensor([True, True, True, False]),
+                torch.tensor([True, True, False, True]),
+            ),
+            mt.all(dim=0),
+        )
+
+        m = torch.tensor([[True, False, True, False], [False, True, False, False]])
+        mt = masked_tensor(d, m)
+        _compare_mts(
+            masked_tensor(
+                torch.tensor([True, True, False, True]),
+                torch.tensor([True, True, True, False]),
+            ),
+            mt.all(dim=0),
+        )
+
+    def test_grad_dtype(self):
+        d = torch.tensor([[True, True, False], [False, True, True]])
+        m = torch.tensor([[True, False, False], [False, True, False]])
+        msg = "Only Tensors of floating point and complex dtype can require gradients"
+        with self.assertRaisesRegex(RuntimeError, msg):
+            masked_tensor(d, m, requires_grad=True)
+
+
+def is_unary(op):
+    return op.name in UNARY_NAMES
+
+def is_binary(op):
+    return op.name in BINARY_NAMES
+
+def is_reduction(op):
+    return op.name in REDUCE_NAMES and op.name not in {"all", "mean", "std", "var"}
+
+mt_unary_ufuncs = [op for op in unary_ufuncs if is_unary(op)]
+mt_binary_ufuncs = [op for op in binary_ufuncs if is_binary(op)]
+mt_reduction_ufuncs = [op for op in reduction_ops if is_reduction(op)]
+
+MASKEDTENSOR_FLOAT_TYPES = {
+    torch.float16,
+    torch.float32,
+    torch.float64,
+}
+
+class TestOperators(TestCase):
+    def _convert_mt_args(self, args, mask, layout):
+        return [
+            masked_tensor(
+                arg.sparse_mask(mask) if layout != torch.strided else arg, mask
+            )
+            if torch.is_tensor(arg)
+            else arg
+            for arg in args
+        ]
+
+    def _test_unary_binary_equality(self, device, dtype, op, layout=torch.strided):
+        samples = op.sample_inputs(device, dtype, requires_grad=True)
+
+        for sample in samples:
+            input = sample.input
+            sample_args, sample_kwargs = sample.args, sample.kwargs
+            mask = (
+                _make_tensor_mask(input.shape, device)
+                if "mask" not in sample_kwargs
+                else sample_kwargs.pop("mask")
+            )
+
+            if layout == torch.sparse_coo:
+                mask = mask.to_sparse_coo().coalesce()
+                input = input.sparse_mask(mask)
+            elif layout == torch.sparse_csr:
+                if input.ndim != 2 or mask.ndim != 2:
+                    continue
+                mask = mask.to_sparse_csr()
+                input = input.sparse_mask(mask)
+
+            # Binary operations currently only support same size masks
+            if is_binary(op):
+                if input.shape != sample_args[0].shape:
+                    continue
+                # Binary operations also don't support kwargs right now
+                else:
+                    sample_kwargs = {}
+
+            mt = masked_tensor(input, mask)
+            mt_args = self._convert_mt_args(sample_args, mask, layout)
+
+            mt_result = op(mt, *mt_args, **sample_kwargs)
+            t_result = op(sample.input, *sample_args, **sample_kwargs)
+
+            _compare_mt_t(mt_result, t_result)
+
+            # If the operation is binary, check that lhs = masked, rhs = regular tensor also works
+            if is_binary(op) and layout == torch.strided:
+                mt_result2 = op(mt, *sample_args, **sample_kwargs)
+                _compare_mt_t(mt_result2, t_result)
+
+    def _test_reduction_equality(self, device, dtype, op, layout=torch.strided):
+        samples = op.sample_inputs(device, dtype, requires_grad=True)
+
+        for sample in samples:
+            input = sample.input
+            # Reduction operations don't support more advanced args/kwargs right now
+            sample_args, sample_kwargs = (), {}
+
+            if input.dim() == 0 or input.numel() == 0:
+                continue
+
+            mask = _make_tensor_mask(input.shape, device)
+
+            if torch.count_nonzero(mask) == 0:
+                continue
+
+            tensor_input = _combine_input_and_mask(op.op, input, mask)
+            if layout == torch.sparse_coo:
+                mask = mask.to_sparse_coo().coalesce()
+                input = input.sparse_mask(mask)
+            elif layout == torch.sparse_csr:
+                if input.ndim != 2 or mask.ndim != 2:
+                    continue
+                mask = mask.to_sparse_csr()
+                input = input.sparse_mask(mask)
+
+            mt = masked_tensor(input, mask)
+            mt_args = self._convert_mt_args(sample_args, mask, layout)
+
+            mt_result = op(mt, *mt_args, **sample_kwargs)
+            t_result = op(tensor_input, *sample_args, **sample_kwargs)
+
+            _compare_mt_t(mt_result, t_result)
+
+    @ops(mt_unary_ufuncs, allowed_dtypes=MASKEDTENSOR_FLOAT_TYPES)  # type: ignore[arg-type]
+    @parametrize("layout", [torch.strided, torch.sparse_coo, torch.sparse_csr])
+    def test_unary_core(self, device, dtype, op, layout):
+        # Skip tests that don't have len(kwargs) == 0
+        skip_variants = {
+            "decimals_0",
+            "decimals_3",
+            "decimals_neg_3",
+        }
+        if op.name == "round" and op.variant_test_name in skip_variants:
+            return
+        self._test_unary_binary_equality(device, dtype, op)
+
+    @ops(mt_binary_ufuncs, allowed_dtypes=MASKEDTENSOR_FLOAT_TYPES)  # type: ignore[arg-type]
+    @parametrize("layout", [torch.strided, torch.sparse_coo, torch.sparse_csr])
+    def test_binary_core(self, device, dtype, op, layout):
+        self._test_unary_binary_equality(device, dtype, op, layout)
+
+    @ops(mt_reduction_ufuncs, allowed_dtypes=MASKEDTENSOR_FLOAT_TYPES)  # type: ignore[arg-type]
+    @parametrize("layout", [torch.strided, torch.sparse_coo, torch.sparse_csr])
+    def test_reduction_all(self, device, dtype, op, layout):
+        # argmin and argmax are not currently supported for torch.sparse_csr
+        if op.name in {"argmin", "argmax"} and layout == torch.sparse_csr:
+            return
+
+        self._test_reduction_equality(device, dtype, op, layout)
+
+
+only_for = ("cpu", "cuda")
+instantiate_device_type_tests(TestOperators, globals(), only_for=only_for)
+
+instantiate_device_type_tests(TestBasics, globals(), only_for=only_for)
+instantiate_parametrized_tests(TestUnary)
+instantiate_parametrized_tests(TestBinary)
+instantiate_parametrized_tests(TestReductions)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_matmul_cuda.py b/test/test_matmul_cuda.py
new file mode 100644
index 000000000000..d47357da0c21
--- /dev/null
+++ b/test/test_matmul_cuda.py
@@ -0,0 +1,155 @@
+# -*- coding: utf-8 -*-
+# Owner(s): ["module: linear algebra"]
+
+import unittest
+from functools import partial
+
+import torch
+from torch.testing import make_tensor
+from torch.testing._internal.common_cuda import CUDA11OrLater, SM53OrLater
+from torch.testing._internal.common_device_type import (
+    dtypes,
+    instantiate_device_type_tests,
+    onlyCUDA,
+    tol as xtol,
+    toleranceOverride,
+)
+
+from torch.testing._internal.common_utils import (
+    IS_ARM64,
+    parametrize,
+    run_tests,
+    TEST_WITH_ROCM,
+    TestCase,
+)
+
+# Protects against includes accidentally setting the default dtype
+# NOTE: jit_metaprogramming_utils sets the default dtype to double!
+torch.set_default_dtype(torch.float32)
+assert torch.get_default_dtype() is torch.float32
+
+
+@unittest.skipIf(IS_ARM64, "Issue with numpy version on arm")
+class TestMatmulCuda(TestCase):
+    def setUp(self):
+        super(self.__class__, self).setUp()
+        torch.backends.cuda.matmul.allow_tf32 = False
+
+    def tearDown(self):
+        torch.backends.cuda.matmul.allow_tf32 = True
+        super(self.__class__, self).tearDown()
+
+    @onlyCUDA
+    @unittest.skipIf(not CUDA11OrLater, "Only CUDA 11+ is supported")
+    # imported 'tol' as 'xtol' to avoid aliasing in code above
+    @toleranceOverride({torch.float16: xtol(atol=1e-1, rtol=1e-1),
+                        torch.bfloat16: xtol(atol=1e-1, rtol=1e-1),
+                        torch.float32: xtol(atol=1e-1, rtol=1e-1)})
+    @dtypes(torch.float16, torch.bfloat16, torch.float32)
+    @parametrize("size", [100, 1000, 10000])
+    def test_cublas_addmm(self, size: int, dtype: torch.dtype):
+        #
+        # Check for catastrophic cuBLAS inaccuracy by measuring the deviation between
+        # results from the CUDA invocation of torch.addmm and the CPU invocation
+        # (which does not use CUDA backend).
+        #
+        # Get dims
+        n, m, p = (size + 1, size, size + 2)
+        # Make random tensors on CPU (seed set on common_utils.py import)
+        # (Not using numpy because it does not support bfloat16)
+        make_arg = partial(make_tensor, dtype=dtype, device="cpu")
+        m_beta = make_arg(1)
+        m_input = make_arg((n, p))
+        m_1 = make_arg((n, m))
+        m_2 = make_arg((m, p))
+        # *(B)FLOAT16 Special Handling*
+        # Backend does not tensorize float16 on CPU,
+        # and bloat16 may present accuracy issues,
+        # so convert to float32 for these cases
+        # (but keep same for other types, e.g. float32 and int*)
+        if dtype == torch.float16 or dtype == torch.bfloat16:
+            m_beta = m_beta.to(dtype=torch.float32)
+            m_input = m_input.to(dtype=torch.float32)
+            m_1 = m_1.to(dtype=torch.float32)
+            m_2 = m_2.to(dtype=torch.float32)
+        # Get CPU result
+        res_cpu = torch.addmm(m_input, m_1, m_2, beta=m_beta.item())
+        # *(B)FLOAT16 Special Handling*``
+        # Convert back to (b)float16
+        if dtype == torch.float16 or dtype == torch.bfloat16:
+            m_beta = m_beta.to(dtype=dtype)
+            m_input = m_input.to(dtype=dtype)
+            m_1 = m_1.to(dtype=dtype)
+            m_2 = m_2.to(dtype=dtype)
+            res_cpu = res_cpu.to(dtype=dtype)
+        # Move arg tensors to CUDA
+        m_beta = m_beta.to("cuda")
+        m_input = m_input.to("cuda")
+        m_1 = m_1.to("cuda")
+        m_2 = m_2.to("cuda")
+        # Get CUDA result
+        res_cuda = torch.addmm(m_input, m_1, m_2, beta=m_beta.item())
+        # Move to CPU for comparison
+        res_cuda = res_cuda.to("cpu")
+        # Compare
+        self.assertEqual(res_cpu, res_cuda)
+
+    @onlyCUDA
+    @unittest.skipIf(not CUDA11OrLater, "Only CUDA 11+ is supported")
+    @toleranceOverride({torch.float32: xtol(atol=1e-5, rtol=1e-5)})
+    @dtypes(*([torch.float32, torch.float16] +
+              [torch.bfloat16] if TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater) else []))
+    @parametrize(
+        "batch_size, N, M, P",
+        [(2, 100, 100, 100),
+         (2, 1000, 1000, 1000),
+         (1, 10000, 1000, 10000),
+         (1, 10000, 10000, 10000)],
+        name_fn=lambda batch_size, N, M, P: "{}_{}_{}_{}".format(batch_size, N, M, P),
+    )
+    def test_cublas_baddbmm_large_input(self, device, batch_size, N, M, P, dtype):
+        cpu_dtype = dtype
+        if dtype == torch.float16 or dtype == torch.bfloat16:
+            cpu_dtype = torch.float32
+
+        M1 = torch.rand((N, M), device=device, dtype=dtype)
+        M2 = torch.rand((M, P), device=device, dtype=dtype)
+        A = torch.rand((N, P), device=device, dtype=dtype)
+
+        def _convert_to_cpu(t):
+            return t.to(device='cpu', dtype=cpu_dtype)
+        M1_cpu, M2_cpu, A_cpu = map(_convert_to_cpu, [M1, M2, A])
+
+        # linear
+        out1_cpu = torch.nn.functional.linear(M1_cpu, M2_cpu.t(), A_cpu).to(dtype=dtype)
+        out1_gpu = torch.nn.functional.linear(M1, M2.t(), A).cpu()
+        self.assertEqual(out1_cpu, out1_gpu)
+        # test multiply the identity matrix
+        if N == M and M == P:
+            M2_eye = torch.eye(N, device=device, dtype=dtype)
+            out1_eye_gpu = torch.nn.functional.linear(M1, M2_eye.t(), torch.zeros_like(A))
+            self.assertEqual(M1_cpu.to(dtype=dtype), out1_eye_gpu.cpu())
+
+        # baddbmm
+        def _expand_to_batch(t: torch.Tensor):
+            return t.expand((batch_size, ) + t.size())
+        alpha, beta = 1.0, 1.0
+        M1, M2, A, M1_cpu, M2_cpu, A_cpu = map(_expand_to_batch, [M1, M2, A, M1_cpu, M2_cpu, A_cpu])
+
+        out2_cpu = torch.baddbmm(A_cpu, M1_cpu, M2_cpu, beta=beta, alpha=alpha).to(dtype=dtype)
+        out2_gpu = torch.baddbmm(A, M1, M2, beta=beta, alpha=alpha).cpu()
+        self.assertEqual(out2_cpu, out2_gpu)
+        # test multiply the identity matrix
+        if N == M and M == P:
+            M2_eye = torch.eye(N, device=device, dtype=dtype).expand(batch_size, N, N)
+            out2_eye_gpu = torch.baddbmm(torch.zeros_like(A), M1, M2_eye, beta=beta, alpha=alpha)
+            self.assertEqual(M1_cpu.to(dtype=dtype), out2_eye_gpu.cpu())
+
+        # cross comparison
+        self.assertEqual(out1_gpu, out2_gpu[0])
+
+
+instantiate_device_type_tests(TestMatmulCuda, globals(), except_for="cpu")
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_meta.py b/test/test_meta.py
index 14fab0c9b291..af81d14e37d5 100644
--- a/test/test_meta.py
+++ b/test/test_meta.py
@@ -1,24 +1,27 @@
 # Owner(s): ["module: primTorch"]
 
+import itertools
 import torch
 import os
 from enum import Enum
 from torch.overrides import resolve_name
-from torch.utils._pytree import tree_map, tree_flatten
-from torch._subclasses.meta_utils import MetaConverter
+from torch.utils._pytree import tree_map, tree_flatten, tree_unflatten
+from torch._subclasses.meta_utils import MetaConverter, assert_metadata_eq
 import torch.utils._python_dispatch
+from torch._dispatch.python import enable_python_dispatcher
 from torch.testing._internal.common_utils import (
     TestCase,
     skipIfCrossRef,
     suppress_warnings,
     TEST_WITH_ASAN,
     run_tests,
-    skipIfSlowGradcheckEnv,
+    dtype_abbrs
 )
 from torch.testing._internal.common_device_type import (
     ops,
     instantiate_device_type_tests,
     onlyCUDA,
+    OpDTypes,
 )
 from torch.testing._internal.common_methods_invocations import op_db
 from torchgen.utils import YamlLoader
@@ -32,7 +35,7 @@
 import unittest
 import warnings
 import weakref
-
+from functools import wraps
 
 bf16 = torch.bfloat16
 f64 = torch.float64
@@ -48,24 +51,7 @@
 b8 = torch.bool
 u8 = torch.uint8
 
-dtype_abbrs = {
-    torch.bfloat16: 'bf16',
-    torch.float64: 'f64',
-    torch.float32: 'f32',
-    torch.float16: 'f16',
-    torch.complex32: 'c32',
-    torch.complex64: 'c64',
-    torch.complex128: 'c128',
-    torch.int8: 'i8',
-    torch.int16: 'i16',
-    torch.int32: 'i32',
-    torch.int64: 'i64',
-    torch.bool: 'b8',
-    torch.uint8: 'u8',
-}
-
 
-@skipIfSlowGradcheckEnv
 class TestMetaConverter(TestCase):
     def assertSameVersionCounter(self, m1, m2):
         # Cannot easily test m1 and m2 have same storage due to
@@ -78,6 +64,9 @@ def assertSameVersionCounter(self, m1, m2):
         self.assertNotEqual(m1._version, vc)
         self.assertEqual(m2._version, m1._version)
 
+    def assertMetadataMatches(self, m1, m2):
+        assert_metadata_eq(self.assertEqual, m1, m2)
+
     def test_view_of_non_leaf(self):
         x = torch.randn(4, requires_grad=True)
         y = x.neg()
@@ -86,9 +75,14 @@ def test_view_of_non_leaf(self):
         to_meta = MetaConverter()
         m1 = to_meta(z1)
         m2 = to_meta(z2)
-        self.assertEqual(m1.shape, z1.shape)
+
+        # check the test is actually testing what it claims
         self.assertTrue(m1._is_view())
         self.assertFalse(m1._base.is_leaf)
+
+        self.assertIsNot(m1, m2)
+        self.assertMetadataMatches(m1, z1)
+        self.assertMetadataMatches(m2, z2)
         self.assertSameVersionCounter(m1, m2)
 
     def test_view_of_leaf(self):
@@ -98,35 +92,133 @@ def test_view_of_leaf(self):
         to_meta = MetaConverter()
         m1 = to_meta(z1)
         m2 = to_meta(z2)
-        self.assertEqual(m1.shape, z1.shape)
+
+        # check the test is actually testing what it claims
         self.assertTrue(m1._is_view())
         self.assertTrue(m1._base.is_leaf)
+
+        self.assertIsNot(m1, m2)
+        self.assertMetadataMatches(m1, z1)
+        self.assertMetadataMatches(m2, z2)
         self.assertSameVersionCounter(m1, m2)
 
+    def test_view_of_view_of_leaf(self):
+        x = torch.randn(8)
+        y = x.view(2, 4)
+        y.requires_grad = True
+        z = y.view(2, 2, 2)
+
+        to_meta = MetaConverter()
+        mx = to_meta(x)
+        mz = to_meta(z)
+
+        self.assertFalse(z.is_leaf)
+
+        self.assertMetadataMatches(mx, x)
+        self.assertMetadataMatches(mz, z)
+
     def test_leaf(self):
         x = torch.randn(4, requires_grad=True)
         to_meta = MetaConverter()
         m = to_meta(x)
-        self.assertEqual(m.shape, x.shape)
+
+        # check the test is actually testing what it claims
         self.assertTrue(m.is_leaf)
         self.assertTrue(m.requires_grad)
 
+        self.assertMetadataMatches(m, x)
+
     def test_non_leaf(self):
         x = torch.randn(4, requires_grad=True)
         y = x.neg()
         to_meta = MetaConverter()
         m = to_meta(y)
-        self.assertEqual(m.shape, y.shape)
+
+        # check the test is actually testing what it claims
         self.assertFalse(m.is_leaf)
         self.assertTrue(m.requires_grad)
 
+        self.assertMetadataMatches(m, y)
+
     def test_requires_grad_false(self):
         x = torch.randn(4, requires_grad=False)
         to_meta = MetaConverter()
         m = to_meta(x)
-        self.assertEqual(m.shape, x.shape)
+
+        # check the test is actually testing what it claims
         self.assertFalse(m.requires_grad)
 
+        self.assertMetadataMatches(m, x)
+
+    def test_channels_last(self):
+        x = torch.empty(2, 3, 4, 5, memory_format=torch.channels_last)
+        to_meta = MetaConverter()
+        m = to_meta(x)
+
+        # check the test is actually testing what it claims
+        self.assertTrue(m.is_leaf)
+
+        self.assertMetadataMatches(m, x)
+
+    def test_channels_last_leaf(self):
+        x = torch.empty(2, 3, 4, 5, memory_format=torch.channels_last, requires_grad=True)
+        to_meta = MetaConverter()
+        m = to_meta(x)
+
+        # check the test is actually testing what it claims
+        self.assertTrue(m.requires_grad)
+        self.assertTrue(m.is_leaf)
+
+        self.assertMetadataMatches(m, x)
+
+    def test_channels_last_non_leaf(self):
+        x = torch.empty(2, 3, 4, 5, memory_format=torch.channels_last, requires_grad=True)
+        y = x + 2
+
+        # sanity
+        self.assertEqual(x.stride(), y.stride())
+        self.assertFalse(y.is_leaf)
+
+        to_meta = MetaConverter()
+        m = to_meta(y)
+
+        # check the test is actually testing what it claims
+        self.assertTrue(m.requires_grad)
+        self.assertFalse(m.is_leaf)
+
+        self.assertMetadataMatches(m, y)
+
+        # Check that we can autograd with m as input without erroring;
+        # see https://github.com/pytorch/pytorch/issues/87956
+        loss = m.sum()
+        torch.autograd.grad(loss, m)
+
+    def test_empty_strided_non_dense_leaf(self):
+        x = torch.empty_strided((2, 2), (4, 2), requires_grad=True)
+
+        to_meta = MetaConverter()
+        m = to_meta(x)
+
+        # check the test is actually testing what it claims
+        self.assertTrue(m.requires_grad)
+        self.assertTrue(m.is_leaf)
+
+        self.assertMetadataMatches(m, x)
+
+    def test_non_leaf_torture(self):
+        x = torch.empty(20, requires_grad=True)
+        with torch.no_grad():
+            x.set_(x.storage(), 10, (2,), (2,))
+
+        to_meta = MetaConverter()
+        m = to_meta(x)
+
+        # check the test is actually testing what it claims
+        self.assertTrue(m.requires_grad)
+        self.assertTrue(m.is_leaf)
+
+        self.assertMetadataMatches(m, x)
+
     # NB: complex stuff is not actually exercised right now because
     # we have a blanket exclusion for complex conversion
 
@@ -134,41 +226,30 @@ def test_view_as_real(self):
         x = torch.randn(4, dtype=torch.complex64)
         y = torch.view_as_real(x)
         m = MetaConverter()(y)
-        self.assertEqual(m.shape, y.shape)
-        self.assertEqual(m.stride(), y.stride())
-        self.assertEqual(m.dtype, y.dtype)
+        self.assertMetadataMatches(m, y)
 
     def test_complex_noncontiguous_bug(self):
         x = torch.randn((2, 2, 4, 9), dtype=torch.complex32)[:, 0, :, :]
         m = MetaConverter()(x)
-        self.assertEqual(m.shape, x.shape)
-        self.assertEqual(m.stride(), x.stride())
-        self.assertEqual(m.dtype, x.dtype)
+        self.assertMetadataMatches(m, x)
 
     def test_view_as_complex(self):
         x = torch.randn((4, 2), dtype=torch.float32)
         y = torch.view_as_complex(x)
         m = MetaConverter()(y)
-        self.assertEqual(m.shape, y.shape)
-        self.assertEqual(m.stride(), y.stride())
-        self.assertEqual(m.dtype, y.dtype)
+        self.assertMetadataMatches(m, y)
 
     def test_view_dtype(self):
         x = torch.randn(4, dtype=torch.float32)
         y = x.view(dtype=torch.int32)
         m = MetaConverter()(y)
-        self.assertEqual(m.shape, y.shape)
-        self.assertEqual(m.stride(), y.stride())
-        self.assertEqual(m.dtype, y.dtype)
+        self.assertMetadataMatches(m, y)
 
     def test_imag(self):
         x = torch.randn(4, dtype=torch.complex64)
         y = x.imag
         m = MetaConverter()(y)
-        self.assertEqual(m.shape, y.shape)
-        self.assertEqual(m.dtype, y.dtype)
-        self.assertEqual(m.stride(), y.stride())
-        self.assertEqual(m.storage_offset(), y.storage_offset())
+        self.assertMetadataMatches(m, y)
 
     def test_weakref(self):
         x = torch.randn(4, 4, 4)
@@ -183,9 +264,10 @@ def test_weakref(self):
         m.check_for_expired_weak_storages()
         self.assertEqual(len(m.storage_memo), 0)
         li = []
+        r = []
         for i in range(4):
             li.append(torch.rand([i]))
-            m(li[-1])
+            r.append(m(li[-1]))
         self.assertEqual(len(m.tensor_memo), 4)
         del li
         self.assertEqual(len(m.tensor_memo), 0)
@@ -200,7 +282,61 @@ def test_tensor_outlives_converter(self):
         del m
         self.assertIs(ref(), None)
 
-def assert_ref_meta_equal(test_case, meta_rs, rs, msg_callable):
+aten = torch.ops.aten
+
+CHECK_STRIDES = {
+    torch.Tensor.__getitem__,
+}
+
+CHECK_STRIDES_SKIPS = {
+    aten._conj_physical.default,
+    aten._fft_c2c.default,
+    aten._fft_c2r.default,
+    aten._fft_r2c.default,
+    aten._linalg_svd.default,
+    aten.binary_cross_entropy.default,
+    aten.complex.default,
+    aten.copysign.Tensor,
+    aten.div.Tensor_mode,
+    aten.floor_divide.default,
+    aten.heaviside.default,
+    aten.lerp.Scalar,
+    aten.lerp.Tensor,
+    aten.logical_and.default,
+    aten.logical_or.default,
+    aten.logical_xor.default,
+    aten.pow.Scalar,
+    aten.prelu.default,
+    aten.special_xlog1py.default,
+    aten.xlogy.Tensor,
+
+    # channel_last and channel_last_3d related failures
+    aten.convolution.default,
+
+    # following ops fails if include_storage_offset = True, but these are a bit edge casey
+    # we should still fix them, leaving them here for tracking.
+    # aten._reshape_alias.default,  # repro with test_dispatch_symbolic_meta_outplace_all_strides_matmul_cuda_float32
+    # aten.view.default,  # repro with test_dispatch_symbolic_meta_outplace_all_strides_unflatten_cuda_float32
+}
+
+def should_check_strides(func):
+    if func in CHECK_STRIDES:
+        return True
+    if func in CHECK_STRIDES_SKIPS:
+        return False
+    if not isinstance(func, torch._ops.OpOverload):
+        return False
+    # Prims are expected to model strides correctly
+    if func.namespace == "prims":
+        return True
+    # Check if it's a view, by testing if any of the returns have
+    # a non-empty alias set
+    if any(r.alias_info.before_set for r in func._schema.returns if r.alias_info):
+        return True
+    # TODO: check for TensorIterator
+    return True
+
+def assert_ref_meta_equal(test_case, func, meta_rs, rs, msg_callable):
     flat_meta_rs, _ = tree_flatten(meta_rs)
     flat_rs, _ = tree_flatten(rs)
     test_case.assertEqual(len(flat_meta_rs), len(flat_rs))
@@ -213,10 +349,10 @@ def test_assert(cond, msg):
         test_assert(isinstance(meta_r, torch.Tensor), f"but real {i}th result is Tensor")
         test_assert(meta_r.dtype == r.dtype, f"but real dtype was {r.dtype}")
         test_assert(meta_r.shape == r.shape, f"but real shape was {r.shape}")
-        # NOTE: stride checking is currently disabled
         # See https://github.com/pytorch/pytorch/issues/78050
-        # same_strides, _ = prims.utils.check_significant_strides(meta_r, r)
-        # test_assert(same_strides, f"but real stride was {r.stride()}")
+        if should_check_strides(func):
+            same_strides, _ = torch._prims_common.check_significant_strides(meta_r, r)
+            test_assert(same_strides, f"but real stride was {r.stride()}")
         test_assert(
             meta_r.storage_offset() == r.storage_offset(),
             f"but real storage_offset was {r.storage_offset()}")
@@ -310,6 +446,7 @@ def run_meta_crossref(
     *,
     dtype,
     device_type,
+    run_symbolic_meta: bool
 ):
     to_meta = MetaConverter()
     do_meta = test_expect is not TestExpect.SKIP
@@ -323,7 +460,15 @@ def run_meta_crossref(
                 f"failed to convert args to meta; "
                 f"originally (*{args}, **{kwargs})") from e
 
-    rs = func(*args, **kwargs)
+    try:
+        rs = func(*args, **kwargs)
+    except Exception as e:
+        # A lot of OpInfo for inplace are actually broken because
+        # they're not tested outside of gradcheck which only checks
+        # torch.float64 and torch.complex128 (which this second one
+        # often skipped as well).
+        raise unittest.SkipTest("Original OpInfo is broken")
+
 
     # TODO: also handle cases where func raise an exception
 
@@ -335,6 +480,15 @@ def run_meta_crossref(
         if func is torch.tensor_split:
             # Use original indices_or_sections, this argument is data dependent
             meta_args = (meta_args[0], args[1]) + meta_args[2:]
+        elif func is torch.Tensor.__getitem__:
+            # Ensure boolean tensors use original
+            assert len(args) == 2
+            flat_args, _ = tree_flatten(args[1])
+            flat_meta_args, spec = tree_flatten(meta_args[1])
+            flat_new_args = []
+            for a, ma in zip(flat_args, flat_meta_args):
+                flat_new_args.append(a if isinstance(a, torch.Tensor) and a.dtype in [torch.int8, torch.bool] else ma)
+            meta_args = (meta_args[0], tree_unflatten(flat_new_args, spec))
         elif func is torch.ops.aten.repeat_interleave.Tensor:
             if kwargs.get("output_size", None) is None:
                 meta_args = args
@@ -359,7 +513,15 @@ def run_meta_crossref(
             # errors
             with warnings.catch_warnings():
                 warnings.simplefilter("ignore")
-                meta_rs = func(*meta_args, **meta_kwargs)
+                if run_symbolic_meta:
+                    # Run the decomps and meta kernels registered
+                    # to the python dispatcher instead of the regular dispatcher.
+                    # This should be the same set of kernels
+                    # that fake tensor runs in dynamic shapes mode.
+                    with enable_python_dispatcher():
+                        meta_rs = func(*meta_args, **meta_kwargs)
+                else:
+                    meta_rs = func(*meta_args, **meta_kwargs)
         except Exception as e:
             if test_expect is TestExpect.XFAILURE:
                 return rs
@@ -378,7 +540,7 @@ def run_meta_crossref(
         else:
             try:
                 delim = ',\n  '
-                assert_ref_meta_equal(test_case, meta_rs, rs, lambda msg: f"""\
+                assert_ref_meta_equal(test_case, func, meta_rs, rs, lambda msg: f"""\
 meta disagrees with real impl:
 {resolve_name(func)}(
   {delim.join(map(verbose_print, meta_args))},
@@ -420,14 +582,13 @@ def run_meta_crossref(
     torch.linalg.solve_triangular : {f64, c64, c128, f32},
     torch.masked_select : {f64, i32, c128, i64, i16, f16, u8, c64, bf16, b8, i8, f32},
     torch.matrix_exp : {f64, c128, c64, bf16, f32},
-    torch.nn.functional.unfold : {f64, f16, c128, c64, bf16, f32},
     torch.nonzero : {f64, i32, c128, i64, i16, c32, f16, u8, c64, bf16, b8, i8, f32},
+    torch.Tensor.nonzero : {f64, i32, c128, i64, i16, c32, f16, u8, c64, bf16, b8, i8, f32},
     torch.ormqr : {f64, c64, c128, f32},
     torch.repeat_interleave : {f64, i32, c128, i64, i16, c32, f16, u8, c64, bf16, b8, i8, f32},
     torch.take : {f64, i32, c128, i64, i16, f16, u8, c64, bf16, b8, i8, f32},
     torch.Tensor.item : {f64, i32, c128, i64, i16, f16, u8, c64, bf16, b8, i8, f32},
     torch.bincount : {i32, i64, u8, i16, i8},
-    torch.bucketize : {f64, i32, i64, f16, u8, i16, bf16, i8, f32},
     torch.frexp : {f64, f16, bf16, f32},
     torch.functional.unique : {f64, i32, i64, u8, i16, bf16, b8, i8, f32},
     torch.functional.unique_consecutive : {f64, i32, i64, u8, i16, bf16, b8, i8, f32},
@@ -439,10 +600,8 @@ def run_meta_crossref(
     torch.median : {f64, i32, i64, u8, i16, bf16, i8, f32},
     torch.mode : {f64, i32, i64, f16, u8, i16, bf16, b8, i8, f32},
     torch.multinomial : {f64, bf16, f32},
-    torch.mvlgamma : {f64, i32, i64, u8, i16, bf16, i8, f32},
     torch.nn.functional.ctc_loss : {f64, f32},
     torch.nn.functional.gaussian_nll_loss : {f64, bf16, f32},
-    torch.nn.functional.grid_sample : {f64, f32},
     torch.nn.functional.max_pool3d : {f64, f32},
     torch.nn.functional.max_pool3d_with_indices : {f64, f32},
     torch.nn.functional.max_unpool1d : {f64, f32},
@@ -452,7 +611,6 @@ def run_meta_crossref(
     torch.nn.functional.multilabel_margin_loss : {f64, f32},
     torch.nn.functional.one_hot : {i64},
     torch.nn.functional.pdist : {f64, f32},
-    torch.nn.functional.rrelu : {f64, bf16, f32},
     torch.polar : {f64, f32},
     torch.segment_reduce : {f64, f16, bf16, f32},
     torch.searchsorted : {f64, i32, i64, f16, u8, i16, bf16, i8, f32},
@@ -460,12 +618,15 @@ def run_meta_crossref(
     torch.cholesky : {f64, f32, c128, c64},
     torch.cholesky_inverse : {f64, f32, c128, c64},
     torch.cholesky_solve : {f64, f32, c128, c64},
-    torch.eig : {f64, f32, c128, c64},
     torch.linalg.eig : {f64, f32, c128, c64},
     torch.linalg.eigvals : {f64, f32, c128, c64},
     torch.linalg.lstsq : {f64, f32, c128, c64},
 }
 
+meta_function_expected_failures_only_outplace = {
+    torch.nn.functional.rrelu : {f64, bf16, f32},
+}
+
 """
 # This is some sample code for how we could dump these dicts into YAML
 # file for easier reading/writing
@@ -512,6 +673,7 @@ def run_meta_crossref(
     torch.linalg.svd : {c128, c64},
     torch.matmul : {bf16, c128, f64, f32, f16, c64},
     torch.nanquantile : {f64, f32},
+    torch.narrow : {bf16, i8, i64, u8, c128, b8, f64, i16, i32, f32, f16, c32, c64},
     torch.nn.functional.batch_norm : {f64, f32},
     torch.nn.functional.binary_cross_entropy : {bf16, f64, f32, f16},
     torch.nn.functional.dropout3d : {bf16, f64, f32, f16},
@@ -534,13 +696,22 @@ def run_meta_crossref(
     torch.linalg.vander: {c128, c64, f32, f64, i16, i32, i64, i8, u8},
     torch.linalg.vecdot : {bf16, f64, f32, f16},
     torch.empty : {bf16, i8, c32, i64, u8, c128, b8, f64, i16, i32, f32, f16, c64},
+    # This fails for arguments dispatched to grid_sampler_3d, but succeeds
+    # for grid_sampler_2d, so we can't just xfail it
+    torch.nn.functional.grid_sample : {f64, f32},
+    torch.bucketize : {f64, i32, i64, f16, u8, i16, bf16, i8, f32},
+    torch.Tensor.addbmm_: {bf16, c128, c64, f32, f64, i16, i32, i64, i8, u8},
 }
 
 
 meta_function_device_expected_failures = defaultdict(dict)
+meta_function_device_expected_failures_only_outplace = defaultdict(dict)
 meta_function_device_skips = defaultdict(dict)
 
 meta_function_device_expected_failures['cpu'] = {
+    torch.native_batch_norm: {bf16},
+    torch._native_batch_norm_legit: {bf16},
+    torch.native_layer_norm: {bf16},
 }
 
 meta_function_device_expected_failures['cuda'] = {
@@ -557,9 +728,7 @@ def run_meta_crossref(
     torch.matrix_exp: {f16},  # aten::linalg_matrix_exp
     torch.median: {f16},  # aten::median, aten::median.dim_values
     torch.multinomial: {f16},  # aten::multinomial, aten::multinomial.out
-    torch.mvlgamma: {f16},  # aten::_local_scalar_dense, aten::mvlgamma.out
     torch.nn.functional.gaussian_nll_loss: {f16},  # aten::_local_scalar_dense
-    torch.nn.functional.grid_sample: {f16},  # aten::grid_sampler_2d, aten::grid_sampler_3d
     torch.nn.functional.max_pool3d: {bf16, f16},  # aten::max_pool3d_with_indices
     torch.nn.functional.max_pool3d_with_indices: {bf16, f16},  # aten::max_pool3d_with_indices
     torch.nn.functional.max_unpool1d: {f16},  # aten::max_unpool2d
@@ -567,10 +736,18 @@ def run_meta_crossref(
     torch.nn.functional.max_unpool3d: {f16},  # aten::max_unpool3d
     torch.nn.functional.multi_margin_loss: {bf16, f16},  # aten::multi_margin_loss
     torch.nn.functional.multilabel_margin_loss: {bf16, f16},  # aten::multilabel_margin_loss_forward
-    torch.nn.functional.rrelu: {f16},  # aten::rrelu_with_noise
     torch.ormqr: {f32, f64},  # aten::ormqr, aten::ormqr.out
 }
 
+meta_function_device_expected_failures_only_outplace['cuda'] = {
+    torch.nn.functional.rrelu: {f16},  # aten::rrelu_with_noise
+}
+
+meta_function_device_skips['cpu'] = {
+    torch.native_batch_norm: {f32, f64},
+    torch._native_batch_norm_legit: {f32, f64},
+}
+
 meta_function_device_skips['cuda'] = {
     torch.cummax: {f16},
     torch.cummin: {f16},
@@ -583,6 +760,9 @@ def run_meta_crossref(
     torch.nn.functional.interpolate: {f16},
     torch.nn.functional.nll_loss: {f16},
     torch.svd: {f32, f64},
+    # This fails for arguments dispatched to grid_sampler_3d, but succeeds
+    # for grid_sampler_2d, so we can't just xfail it
+    torch.nn.functional.grid_sample : {f16},
 }
 
 # This is a __torch_function__ mode that, when enabled, interposes every
@@ -603,15 +783,21 @@ class MetaCrossRefFunctionMode(torch.overrides.TorchFunctionMode):
     device_type: str
     dtype: torch.dtype
 
-    def __init__(self, test_case, *, device, dtype):
+    def __init__(self, test_case, *, device, dtype, inplace):
         self.test_case = test_case
         self.device_type = torch.device(device).type
         self.dtype = dtype
+        self.inplace = inplace
 
     def __torch_function__(self, func, types, args=(), kwargs=None):
         kwargs = kwargs or {}
 
-        if torch.jit.is_tracing() or isinstance(func, torch.ScriptMethod):
+        if (
+            torch.jit.is_tracing() or isinstance(func, torch.ScriptMethod) or
+            # meta converter doesn't work correctly when no_dispatch() is on, so
+            # skip running the crossref test in this case
+            torch._C._dispatch_tls_local_exclude_set().has(torch._C.DispatchKey.Python)
+        ):
             return func(*args, **kwargs)
 
         if self.dtype in meta_function_skips.get(func, set()):
@@ -620,18 +806,21 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
             test_expect = TestExpect.SKIP
         elif self.dtype in meta_function_expected_failures.get(func, set()):
             test_expect = TestExpect.XFAILURE
+        elif not self.inplace and self.dtype in meta_function_expected_failures_only_outplace.get(func, set()):
+            test_expect = TestExpect.XFAILURE
         elif self.dtype in meta_function_device_expected_failures[self.device_type].get(func, set()):
             test_expect = TestExpect.XFAILURE
+        elif not self.inplace and \
+                self.dtype in meta_function_device_expected_failures_only_outplace[self.device_type].get(func, set()):
+            test_expect = TestExpect.XFAILURE
         else:
             test_expect = TestExpect.SUCCESS
 
         return run_meta_crossref(
             self.test_case, test_expect, func, args,
-            kwargs, dtype=self.dtype, device_type=self.device_type
+            kwargs, dtype=self.dtype, device_type=self.device_type, run_symbolic_meta=False
         )
 
-aten = torch.ops.aten
-
 # these always fail
 meta_dispatch_expected_failures = {
     aten.allclose.default: {f16, bf16, f32, f64, c64, c128},  # NotImplementedError: 'aten::_local_scalar_dense'
@@ -645,9 +834,7 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.cholesky_solve.out : {c64, c128, f64, f32},
     aten.count_nonzero.default : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
     aten.count_nonzero.dim_IntList : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
-    aten.eig.default : {c64, c128, f64, f32},
     aten.geqrf.default : {c64, c128, f64, f32},
-    aten.im2col.default : {c64, bf16, f32, f16, f64, c128},
     aten.linalg_eig.default : {c64, c128, f64, f32},
     aten.linalg_householder_product.default : {c64, c128, f64, f32},
     aten.linalg_householder_product.out : {c64, c128, f64, f32},
@@ -657,7 +844,6 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.linalg_solve_triangular.out : {c64, c128, f64, f32},
     aten.masked_select.default : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
     aten.masked_select.out : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
-    aten.native_group_norm.default : {bf16},
     aten.nonzero.default : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, c32, b8, i16, u8},
     aten.nonzero.out : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, c32, b8, i16, u8},
     aten.ormqr.default : {c64, c128, f64, f32},
@@ -669,27 +855,23 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.tensordot.out : {c64, i8, f64, c128, i64, bf16, f32, i32, i16, u8},
     aten.to_sparse.default : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
     aten.to_sparse.sparse_dim : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
-    aten._ctc_loss.default : {f32, f64},
+    aten._ctc_loss.default : {f32, f64},  # Shape of second output depends on data.
+    aten._ctc_loss.Tensor : {f32, f64},  # Shape of second output depends on data.
     aten._histogramdd_bin_edges.default : {f32, f64},
     aten._histogramdd_from_bin_cts.default : {f32, f64},
     aten._histogramdd_from_bin_tensors.default : {f32, f64},
-    aten._local_scalar_dense.default : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
+    aten._local_scalar_dense.default : {c32, c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
     aten._pdist_forward.default : {f32, f64},
     aten._unique2.default : {i8, f64, i64, bf16, f32, i32, b8, i16, u8},
     aten.bincount.default : {i64, i8, i32, i16, u8},
-    aten.bucketize.Tensor : {f16, i8, f64, i64, bf16, f32, i32, i16, u8},
-    aten.bucketize.Tensor_out : {f16, i8, f64, i64, bf16, f32, i32, i16, u8},
-    aten.col2im.default : {c64, f32, f64, c128},
     aten.equal.default : {c64, f16, i8, f64, c128, i64, bf16, f32, i32, b8, i16, u8},
     aten.frexp.Tensor : {bf16, f32, f16, f64},
-    aten.grid_sampler_2d.default : {f32, f64},
     aten.grid_sampler_3d.default : {f32, f64},
     aten.histc.default : {bf16, f32, f64},
     aten.histc.out : {bf16, f32, f64},
     aten.histogram.bin_ct : {f32, f64},
     aten.histogram.bins_tensor : {f32, f64},
     aten.kthvalue.default : {i8, f64, i64, bf16, f32, i32, i16, u8},
-    aten.log_sigmoid_forward.output : {bf16, f32, f64},
     aten.logcumsumexp.default : {bf16, f32, f64},
     aten.logcumsumexp.out : {bf16, f32, f64},
     aten.max_pool3d_with_indices.default : {f32, f64},
@@ -702,8 +884,6 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.multilabel_margin_loss_forward.default : {f32, f64},
     aten.multinomial.default : {bf16, f32, f64},
     aten.multinomial.out : {bf16, f32, f64},
-    aten.mvlgamma.default : {i8, f64, i64, bf16, f32, i32, i16, u8},
-    aten.mvlgamma.out : {i8, f64, i64, bf16, f32, i32, i16, u8},
     aten.nll_loss2d_forward.default : {bf16, f32, f64},
     aten.polar.default : {f32, f64},
     aten.rrelu_with_noise.default : {bf16, f32, f64},
@@ -727,18 +907,39 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.linalg_pinv.atol_rtol_tensor: {f32, f64},
     aten.linalg_pinv.atol_rtol_tensor_out: {f32, f64},
     aten.empty.memory_format: {b8, bf16, c128, c64, c32, f16, f32, f64, i16, i32, i64, i8, u8},
-    aten.empty.SymInt: {b8, bf16, c128, c64, c32, f16, f32, f64, i16, i32, i64, i8, u8},
+    aten.bucketize.Tensor : {f16, i8, f64, i64, bf16, f32, i32, i16, u8},
+    aten.bucketize.Tensor_out : {f16, i8, f64, i64, bf16, f32, i32, i16, u8},
+    aten.addbmm_.default: {bf16, c128, c64, f32, f64, i16, i32, i64, i8, u8},
 }
 
+# For CompositeImplicitAutograd functions that fail before hitting the Mode
+meta_dispatch_early_skips = set({
+    torch.Tensor.float_power_,
+    # Errors out in one of the tests, while ProxyTensor passes...
+    torch.Tensor.cumsum_,
+})
+
+meta_inplace_skips = set({
+    # Errors out in one of the tests, while ProxyTensor passes...
+    torch.Tensor.cumsum_,
+})
+
 meta_dispatch_device_expected_failures = defaultdict(dict)
 meta_dispatch_device_skips = defaultdict(dict)
 
+meta_dispatch_device_expected_failures['cpu'] = {
+    aten.native_batch_norm.default: {bf16},
+    aten._native_batch_norm_legit.default: {bf16},
+    aten._native_batch_norm_legit.no_stats: {bf16},
+    aten.native_layer_norm.default: {bf16},
+}
+
 meta_dispatch_device_expected_failures['cuda'] = {
     aten._unique2.default: {f16},  # aten::_unique2
     aten._use_cudnn_ctc_loss.default: {f32, f64},  # aten::_use_cudnn_ctc_loss
+    aten._use_cudnn_ctc_loss.Tensor: {f32, f64},  # aten::_use_cudnn_ctc_loss.Tensor
     aten.cudnn_grid_sampler.default: {f16, f32, f64},  # aten::cudnn_grid_sampler
     aten.geqrf.default: {f32, f64},  # aten::geqrf
-    aten.grid_sampler_2d.default: {f16},  # aten::grid_sampler_2d
     aten.grid_sampler_3d.default: {f16},  # aten::grid_sampler_3d
     aten.histc.default: {i16, i32, i64, i8},  # aten::histc
     aten.histc.out: {i16, i32, i64, i8},  # aten::histc.out
@@ -750,7 +951,7 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.linalg_solve_triangular.default: {f32, f64},  # aten::linalg_solve_triangular
     aten.linalg_solve_triangular.out: {f32, f64},  # aten::linalg_solve_triangular.out
     aten.log_sigmoid_forward.default: {bf16, f16, f64, f32},
-    aten.log_sigmoid_forward.output: {f16},  # aten::log_sigmoid_forward.output
+    aten.log_sigmoid_forward.output : {bf16, f16, f64, f32},  # aten::log_sigmoid_forward.output
     aten.logcumsumexp.default: {bf16, f16},  # aten::_logcumsumexp
     aten.logcumsumexp.out: {bf16, f16},  # aten::_logcumsumexp.out
     aten.max_pool3d_with_indices.default: {bf16, f16},  # aten::max_pool3d_with_indices
@@ -762,9 +963,6 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.multilabel_margin_loss_forward.default: {bf16, f16},  # aten::multilabel_margin_loss_forward
     aten.multinomial.default: {f16},  # aten::multinomial
     aten.multinomial.out: {f16},  # aten::multinomial.out
-    aten.mvlgamma.default: {f16},  # aten::_local_scalar_dense
-    aten.mvlgamma.out: {f16},  # aten::mvlgamma.out
-    aten.native_group_norm.default: {bf16, f16},
     aten.nll_loss2d_forward.default: {f16},  # aten::nll_loss2d_forward
     aten.ormqr.default: {f32, f64},  # aten::ormqr
     aten.ormqr.out: {f32, f64},  # aten::ormqr.out
@@ -775,6 +973,13 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.upsample_nearest3d.vec: {f16},  # aten::upsample_nearest3d.vec
 }
 
+meta_dispatch_device_skips['cpu'] = {
+    aten._embedding_bag_forward_only.default: {f16, f32, f64},
+    aten.native_batch_norm.default: {f32, f64},
+    aten._native_batch_norm_legit.default: {f32, f64},
+    aten._native_batch_norm_legit.no_stats: {f32, f64},
+}
+
 meta_dispatch_device_skips['cuda'] = {
     aten._conj.default: {c32, f16},  # file issue
     aten._linalg_svd.default: {c64, c128},  # aten::linalg_eigvalsh.out
@@ -790,18 +995,68 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
     aten.miopen_batch_norm.default: {f32},
 }
 
+def get_strided_args(args):
+
+    def get_strided_variants(t, include_storage_offset=False):
+        variants = []
+
+        # contiguous
+        variants.append(t)
+
+        # transposed
+        if t.ndim > 1:
+            perm = list(reversed(range(t.ndim)))
+            transposed = torch.empty(
+                t.shape[::-1], device=t.device, dtype=t.dtype, requires_grad=t.requires_grad
+            ).permute(perm).copy_(t)
+            variants.append(transposed)
+
+        # nondense
+        if t.ndim > 0:
+            nondense = torch.repeat_interleave(t, 2, dim=-1)[..., ::2]
+            variants.append(nondense)
+
+        # channel_last
+        if t.ndim == 4:
+            variants.append(t.contiguous(memory_format=torch.channels_last))
+
+        # channel_last_3d
+        if t.ndim == 5:
+            variants.append(t.contiguous(memory_format=torch.channels_last_3d))
+
+        # storage_offset
+        if include_storage_offset:
+            buffer = torch.empty(t.numel() + 1, device=t.device, dtype=t.dtype, requires_grad=t.requires_grad)
+            buffer = buffer.as_strided(t.shape, t.stride(), storage_offset=1)
+            buffer.copy_(t)
+            variants.append(buffer)
+
+        return variants
+
+    strided_args = []
+    for arg in args:
+        if isinstance(arg, torch.Tensor) and not arg.is_sparse_csr and arg.is_contiguous():
+            strided_arg_variants = get_strided_variants(arg)
+        else:
+            strided_arg_variants = [arg]
+        strided_args.append(strided_arg_variants)
+
+    for result in itertools.product(*strided_args):
+        yield result
+
 class MetaCrossRefDispatchMode(torch.utils._python_dispatch.TorchDispatchMode):
     test_case: TestCase
     device: torch.device
     dtype: torch.dtype
 
-    def __init__(self, test_case, *, device, dtype):
+    def __init__(self, test_case, *, device, dtype, symbolic_meta: bool):
         self.test_case = test_case
         # save TLS
         self.precision = test_case.precision
         self.rel_tol = test_case.rel_tol
         self.device_type = torch.device(device).type
         self.dtype = dtype
+        self.symbolic_meta = symbolic_meta
 
     def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         kwargs = kwargs or {}
@@ -828,20 +1083,28 @@ def __torch_dispatch__(self, func, types, args=(), kwargs=None):
             kwargs,
             dtype=self.dtype,
             device_type=self.device_type,
+            run_symbolic_meta=self.symbolic_meta,
         )
 
 # NB: we're running these tests only on CUDA because there are some
 # inconsistencies between CUDA and CPU, and running on CUDA makes it easier
 # to ignore the CPU case when inconsistencies arise.  Ideally we deal
 # with the inconsistencies but this takes time.
-@skipIfSlowGradcheckEnv
 class TestMeta(TestCase):
+    # Copies inputs to inplace operations to avoid inplace modifications
+    #   to leaves requiring gradient
+    def _get_safe_inplace(self, inplace_variant):
+        @wraps(inplace_variant)
+        def _fn(t, *args, **kwargs):
+            return inplace_variant(t.clone(), *args, **kwargs)
+
+        return _fn
+
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
-    @onlyCUDA
     @skipIfCrossRef
     @suppress_warnings
     @ops(op_db)
-    def test_meta(self, device, dtype, op):
+    def test_meta_outplace(self, device, dtype, op):
         # run the OpInfo sample inputs, cross-referencing them with the
         # meta implementation and check the results are the same.  All
         # the heavy lifting happens in MetaCrossRefFunctionMode
@@ -850,32 +1113,202 @@ def test_meta(self, device, dtype, op):
         for sample_input in samples:
             args = [sample_input.input] + list(sample_input.args)
             kwargs = sample_input.kwargs
-            with MetaCrossRefFunctionMode(self, dtype=dtype, device=device):
+            with MetaCrossRefFunctionMode(self, dtype=dtype, device=device, inplace=False):
                 expected = func(*args, **kwargs)
                 if isinstance(expected, torch.Tensor) and op.supports_out:
                     func(*args, **kwargs, out=expected)
 
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
-    @onlyCUDA
     @skipIfCrossRef
     @suppress_warnings
     @ops(op_db)
-    def test_dispatch_meta(self, device, dtype, op):
-        func = op.get_op()
+    def test_meta_inplace(self, device, dtype, op):
+        func = op.get_inplace()
+        if not func:
+            self.skipTest("No inplace variable for this op")
+        if func in meta_inplace_skips:
+            self.skipTest("Skipped")
+        func = self._get_safe_inplace(func)
         samples = op.sample_inputs(device, dtype, requires_grad=False)
         for sample_input in samples:
+            if sample_input.broadcasts_input:
+                continue
             args = [sample_input.input] + list(sample_input.args)
             kwargs = sample_input.kwargs
-
-            with MetaCrossRefDispatchMode.push(self, dtype=dtype, device=device):
+            with MetaCrossRefFunctionMode(self, dtype=dtype, device=device, inplace=True):
                 expected = func(*args, **kwargs)
-                if isinstance(expected, torch.Tensor) and op.supports_out:
-                    func(*args, **kwargs, out=expected)
+
+    def _run_dispatch_meta_test(self, device, dtype, op, symbolic_meta, inplace, all_stride_variants=False):
+        if inplace:
+            func = op.get_inplace()
+            if not func:
+                self.skipTest("No inplace variable for this op")
+        else:
+            func = op.get_op()
+
+        if func in meta_dispatch_early_skips:
+            self.skipTest("Function is in dispatch early skips")
+
+        if inplace:
+            func = self._get_safe_inplace(func)
+
+        samples = op.sample_inputs(device, dtype, requires_grad=False)
+        for sample_input in samples:
+            if inplace and sample_input.broadcasts_input:
+                continue
+
+            sample_args = [sample_input.input] + list(sample_input.args)
+            kwargs = sample_input.kwargs
+
+            if all_stride_variants and sum(isinstance(arg, torch.Tensor) for arg in sample_args) <= 5:
+                # test inputs <= 5 tensors to avoid combinatorial explosion
+                strided_args = get_strided_args(sample_args)
+            else:
+                strided_args = [sample_args]
+
+            for args in strided_args:
+                with MetaCrossRefDispatchMode.push(self, dtype=dtype, device=device, symbolic_meta=symbolic_meta):
+                    expected = func(*args, **kwargs)
+
+                    if not inplace and isinstance(expected, torch.Tensor) and op.supports_out:
+                        func(*args, **kwargs, out=expected)
+
+
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @skipIfCrossRef
+    @suppress_warnings
+    @ops(op_db)
+    def test_dispatch_meta_outplace(self, device, dtype, op):
+        self._run_dispatch_meta_test(device, dtype, op, symbolic_meta=False, inplace=False)
+
+
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @skipIfCrossRef
+    @suppress_warnings
+    @ops(op_db)
+    def test_dispatch_meta_inplace(self, device, dtype, op):
+        self._run_dispatch_meta_test(device, dtype, op, symbolic_meta=False, inplace=True)
+
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @skipIfCrossRef
+    @suppress_warnings
+    @ops(op_db)
+    def test_dispatch_symbolic_meta_outplace(self, device, dtype, op):
+        self._run_dispatch_meta_test(device, dtype, op, symbolic_meta=True, inplace=False)
+
+
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @skipIfCrossRef
+    @suppress_warnings
+    @ops(op_db)
+    def test_dispatch_symbolic_meta_inplace(self, device, dtype, op):
+        self._run_dispatch_meta_test(device, dtype, op, symbolic_meta=True, inplace=True)
+
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @skipIfCrossRef
+    @suppress_warnings
+    # only test one dtype, as output stride behavior is the same for all dtypes
+    @ops(op_db, dtypes=OpDTypes.any_common_cpu_cuda_one)
+    # Only test on CUDA, as CUDA kernel's stride is the reference
+    @onlyCUDA
+    def test_dispatch_symbolic_meta_outplace_all_strides(self, device, dtype, op):
+        self._run_dispatch_meta_test(device, dtype, op, symbolic_meta=True, inplace=False, all_stride_variants=True)
+
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @skipIfCrossRef
+    @suppress_warnings
+    # only test one dtype, as output stride behavior is the same for all dtypes
+    @ops(op_db, dtypes=OpDTypes.any_common_cpu_cuda_one)
+    # Only test on CUDA, as CUDA kernel's stride is the reference
+    @onlyCUDA
+    def test_dispatch_symbolic_meta_inplace_all_strides(self, device, dtype, op):
+        self._run_dispatch_meta_test(device, dtype, op, symbolic_meta=True, inplace=True, all_stride_variants=True)
+
 
     def test_empty_quantized(self):
         r = torch.empty(2 ** 52, device='meta', dtype=torch.qint8)
         self.assertEqual(r.device.type, 'meta')
 
+    def test_huber_loss_backward(self):
+        inps = [torch.rand(2**52, device='meta') for _ in range(3)]
+        r = torch.ops.aten.huber_loss_backward(*inps, 0, 1.0)
+        self.assertEqual(r.device.type, 'meta')
+        self.assertEqual(r.shape, inps[0].shape)
+
+    def test_fill__alias_relationship(self):
+        inps = torch.rand(2**52, device='meta')
+        r = torch.ops.aten.fill_(inps, 1.0)
+        # aten.fill_ returns an aliase
+        self.assertEqual(id(inps), id(r))
+
+        # aten.fill returns a new tensor
+        r2 = torch.ops.aten.fill(inps, 1.0)
+        self.assertNotEqual(id(inps), id(r2))
+
+    def test_meta__fused_moving_avg_obs_fq_helper(self, device):
+        from torch.ao.quantization import FusedMovingAvgObsFakeQuantize
+        to_meta = MetaConverter()
+
+        x = torch.randn(5, 5, device=device)
+        running_min_op = torch.tensor(float("inf"), device=device)
+        running_max_op = torch.tensor(float("-inf"), device=device)
+        avg_const = 0.01
+        scale = torch.tensor([1.0], device=device)
+        zero_point = torch.tensor([0], dtype=torch.int, device=device)
+
+        mod = FusedMovingAvgObsFakeQuantize()
+        torch.ao.quantization.enable_fake_quant(mod)
+        torch.ao.quantization.enable_observer(mod)
+        mod.to(device)
+
+        meta_x = to_meta(x)
+
+        args = [
+            x,
+            mod.observer_enabled,
+            mod.fake_quant_enabled,
+            running_min_op,
+            running_max_op,
+            scale,
+            zero_point,
+            avg_const,
+            0,
+            255,
+            0,
+        ]
+
+        meta_args = args.copy()
+        meta_args[0] = meta_x
+
+        kwargss = [
+            {},
+            {"per_row_fake_quant": False, "symmetric_quant": False},
+            {"per_row_fake_quant": False, "symmetric_quant": True},
+        ]
+
+        for kwargs in kwargss:
+            ref_out = aten._fused_moving_avg_obs_fq_helper.default(*args, **kwargs)
+            meta_out = aten._fused_moving_avg_obs_fq_helper.default(*meta_args, **kwargs)
+
+            self.assertEqual(ref_out[0].size(), meta_out[0].size())
+            self.assertEqual(ref_out[0].stride(), meta_out[0].stride())
+            self.assertEqual(ref_out[1].size(), meta_out[1].size())
+            self.assertEqual(ref_out[1].stride(), meta_out[1].stride())
+
+    # opinfo test is using aten.fill_, it's not testing aten.fill
+    @onlyCUDA
+    def test_fill_stride(self):
+        to_meta = MetaConverter()
+        sample_args = [torch.rand(2, 2, 2, 2), 1.0]
+
+        for args in get_strided_args(sample_args):
+            meta_args = to_meta(args)
+            ref_out = torch.ops.aten.fill(*args)
+            meta_out = torch.ops.aten.fill(*meta_args)
+            self.assertEqual(ref_out.size(), meta_out.size())
+            self.assertEqual(ref_out.stride(), meta_out.stride())
+
+
     def test_map_location_deserialize(self):
         import io
 
diff --git a/test/test_mkldnn.py b/test/test_mkldnn.py
index 4f9518d8fd03..f4f427ba659c 100644
--- a/test/test_mkldnn.py
+++ b/test/test_mkldnn.py
@@ -299,7 +299,7 @@ def test_conv2d_bf16(self):
     def test_conv3d_bf16(self):
         self._test_conv_bf16_base(dim=3)
 
-    def _test_conv2d_nhwc_base(self, dtype):
+    def _test_conv2d_nhwc_base(self, weight_memory_format, dtype):
         conv_module = torch.nn.Conv2d
         input_shapes = (224, 224)
         options = itertools.product([True, False], [True, False], [1, 2], [1, 4])
@@ -319,7 +319,7 @@ def _test_conv2d_nhwc_base(self, dtype):
                                 dilation=dilation,
                                 bias=bias,
                                 groups=groups).to(dtype=dtype)
-            conv2 = copy.deepcopy(conv1).to(memory_format=torch.channels_last)
+            conv2 = copy.deepcopy(conv1).to(memory_format=weight_memory_format)
             x1 = x.clone()
             x2 = x.clone().to(memory_format=torch.channels_last)
             if train:
@@ -341,13 +341,15 @@ def _test_conv2d_nhwc_base(self, dtype):
                 self.assertEqual(x1.grad, x2.grad)
 
     def test_conv2d_nhwc(self):
-        self._test_conv2d_nhwc_base(dtype=torch.float32)
+        self._test_conv2d_nhwc_base(torch.contiguous_format, dtype=torch.float32)
+        self._test_conv2d_nhwc_base(torch.channels_last, dtype=torch.float32)
 
     @unittest.skipIf(IS_WINDOWS, "Limit support for bf16 path")
     def test_conv2d_nhwc_bf16(self):
         # when has_bf16_support() returns false, bf16 CPU conv will fall back to thnn impl
         if has_bf16_support():
-            self._test_conv2d_nhwc_base(dtype=torch.bfloat16)
+            self._test_conv2d_nhwc_base(torch.contiguous_format, dtype=torch.bfloat16)
+            self._test_conv2d_nhwc_base(torch.channels_last, dtype=torch.bfloat16)
 
     def test_conv2d_legacy_jit_model(self):
         """
@@ -1058,6 +1060,11 @@ def test_transpose(self):
                     x.to_mkldnn().transpose(dim1, dim2).to_dense(),
                 )
 
+    def test_transpose_invalid_dime(self):
+        x = torch.randn(3, 4, 5, dtype=torch.float32).to_mkldnn()
+        with self.assertRaisesRegex(IndexError, "Dimension out of range"):
+            torch._mkldnn_transpose(x, 0, 12)
+
     def test_linear_non_contiguous_weight(self):
         in_features = torch.randint(3, 10, (1,)).item()
         out_features = torch.randint(3, 100, (1,)).item()
diff --git a/test/test_mkldnn_fusion.py b/test/test_mkldnn_fusion.py
index 15d3c6b67f48..9f264337d956 100644
--- a/test/test_mkldnn_fusion.py
+++ b/test/test_mkldnn_fusion.py
@@ -1,6 +1,7 @@
 # Owner(s): ["module: mkldnn"]
 import itertools
 import unittest
+from typing import NamedTuple, List
 
 import torch
 from torch import nn
@@ -12,6 +13,13 @@
 
 FUSION_GROUP = 'prim::TensorExprGroup'
 
+class PointwisePostOp(NamedTuple):
+    attr : str
+    pointwise_module : nn.Module
+    scalars : List = []
+    algorithm : str = ""
+
+CONV_MODULES = {2: torch.nn.Conv2d, 3: torch.nn.Conv3d}
 
 @unittest.skipIf(not torch._C.has_mkldnn, "MKL-DNN build is disabled")
 class TestMkldnnFusion(JitTestCase):
@@ -19,7 +27,7 @@ def assertFused(self, graph, fused_patterns):
         for pat in fused_patterns:
             self.assertGraphContainsExactly(graph, pat, 0)
 
-    def _check_model(self, m, x):
+    def _check_model(self, m, x, trace=False):
         old_fusion_inlining = torch._C._debug_get_fusion_group_inlining()
         torch._C._debug_set_fusion_group_inlining(False)
 
@@ -31,7 +39,10 @@ def _check_model(self, m, x):
 
         m.eval()
         with torch.no_grad():
-            script = torch.jit.script(m)
+            if trace:
+                script = torch.jit.trace(m, x)
+            else:
+                script = torch.jit.script(m)
         script = torch.jit.freeze(script)
 
         with torch.no_grad():
@@ -61,58 +72,265 @@ def forward(self, x):
             [torch.contiguous_format, False],
             [torch.channels_last, True],
         ]:
-            input_size = 224
-            batch_size = 1
-            kernel_size = 3
-            options = itertools.product([True, False], [1, 2], [1, 4])
-            for bias, dilation, groups in options:
-                iC = 3 * groups
-                oC = 10 * groups
-                m = M(iC,
-                      oC,
-                      bias,
-                      kernel_size=(kernel_size, kernel_size),
-                      stride=2,
-                      padding=1,
-                      dilation=dilation,
-                      groups=groups).to(memory_format=memory_format)
-                x = torch.randn(batch_size, iC, input_size, input_size).to(memory_format=memory_format)
-                graph = self._check_model(m, x)
-                if enabled:
-                    self.assertFused(graph, ['aten::conv2d'])
-                    self.assertGraphContainsExactly(graph, FUSION_GROUP, 1)
-                else:
-                    self.assertGraphContains(graph, kind='aten::conv2d')
-
-    def test_conv_eltwise(self):
+            for trace in [True, False]:
+                input_size = 224
+                batch_size = 1
+                kernel_size = 3
+                options = itertools.product([True, False], [1, 2], [1, 4])
+                for bias, dilation, groups in options:
+                    iC = 3 * groups
+                    oC = 10 * groups
+                    m = M(iC,
+                          oC,
+                          bias,
+                          kernel_size=(kernel_size, kernel_size),
+                          stride=2,
+                          padding=1,
+                          dilation=dilation,
+                          groups=groups).to(memory_format=memory_format)
+                    x = torch.randn(batch_size, iC, input_size, input_size).to(memory_format=memory_format)
+                    graph = self._check_model(m, x, trace)
+                    conv_node_name = 'aten::_convolution' if trace else 'aten::conv2d'
+                    if enabled:
+                        self.assertFused(graph, [conv_node_name])
+                        self.assertGraphContainsExactly(graph, FUSION_GROUP, 1)
+                    else:
+                        self.assertGraphContains(graph, kind=conv_node_name)
+
+    def test_conv_unary_fusion_nnc(self):
         class M(nn.Module):
-            def __init__(self, eltwise_fn, in_channels, out_channels, bias, **kwargs):
+            def __init__(self, unary_fn, in_channels, out_channels, bias, **kwargs):
                 super(M, self).__init__()
                 self.conv = torch.nn.Conv2d(in_channels, out_channels, bias=bias, **kwargs)
-                self.eltwise = eltwise_fn
+                self.unary = unary_fn
 
             def forward(self, x):
                 x = self.conv(x)
-                x = self.eltwise(x)
+                x = self.unary(x)
                 return x
 
         for memory_format, enabled in [
             [torch.contiguous_format, False],
             [torch.channels_last, True],
         ]:
-            for eltwise_fn in [torch.relu]:
+            for unary_fn in [torch.relu]:
                 for bias in [True, False]:
                     for oC in [1, 10]:
-                        m = M(eltwise_fn, 3, oC, bias, kernel_size=(3, 3)).to(memory_format=memory_format)
+                        m = M(unary_fn, 3, oC, bias, kernel_size=(3, 3)).to(memory_format=memory_format)
                         x = torch.randn(1, 3, 224, 224).to(memory_format=memory_format)
 
                         graph = self._check_model(m, x)
                         if enabled:
-                            self.assertFused(graph, ['aten::conv2d', 'aten::' + eltwise_fn.__name__])
+                            self.assertFused(graph, ['aten::conv2d', 'aten::' + unary_fn.__name__])
                             self.assertGraphContainsExactly(graph, FUSION_GROUP, 1)
                         else:
                             self.assertGraphContains(graph, kind='aten::conv2d')
 
+    def test_unsupported_conv(self):
+        class M(nn.Module):
+            def __init__(self, m, in_channels, out_channels, bias, **kwargs):
+                super(M, self).__init__()
+                self.conv = m(in_channels, out_channels, bias=bias, **kwargs)
+
+            def forward(self, x):
+                res = self.conv(x)
+                return res
+
+        for module, dim, memory_format in [
+            [nn.Conv3d, 3, torch.contiguous_format],
+            [nn.Conv3d, 3, torch.channels_last_3d],
+            [nn.ConvTranspose2d, 2, torch.contiguous_format],
+            [nn.ConvTranspose2d, 2, torch.channels_last],
+        ]:
+            trace = True
+            input_size = 224
+            batch_size = 1
+            kernel_size = 3
+            groups = 2
+            bias = True
+            iC = 3 * groups
+            oC = 10 * groups
+            dilation = 2
+            m = M(module,
+                  iC,
+                  oC,
+                  bias,
+                  kernel_size=kernel_size,
+                  stride=2,
+                  padding=1,
+                  dilation=dilation,
+                  groups=groups).to(memory_format=memory_format)
+            input_sizes = [batch_size, iC, input_size, input_size]
+            if dim == 3:
+                input_sizes.append(input_size)
+            x = torch.randn(input_sizes).to(memory_format=memory_format)
+            graph = self._check_model(m, x, trace)
+            self.assertGraphContains(graph, kind='aten::_convolution')
+
+    def _unary_list(self):
+        unary_list = {
+            "relu": PointwisePostOp("relu", nn.ReLU()),
+            "sigmoid": PointwisePostOp("sigmoid", nn.Sigmoid()),
+            "tanh": PointwisePostOp("tanh", nn.Tanh()),
+            "hardswish": PointwisePostOp("hardswish", nn.Hardswish()),
+            "leaky_relu": PointwisePostOp("leaky_relu", nn.LeakyReLU(0.1, inplace=False), scalars=[0.1]),
+            "hardtanh": PointwisePostOp("hardtanh", nn.Hardtanh(min_val=-0.5, max_val=4, inplace=False), scalars=[-0.5, 4]),
+            "gelu_none": PointwisePostOp("gelu", nn.GELU(approximate="none"), algorithm="none"),
+            "gelu_tanh": PointwisePostOp("gelu", nn.GELU(approximate="tanh"), algorithm="tanh"),
+        }
+        return unary_list
+
+    def _binary_list(self):
+        binary_list = {
+            "add": torch.add,
+            "sub": torch.sub,
+            "mul": torch.mul,
+            "div": torch.div,
+        }
+        return binary_list
+
+    def test_linear_unary_fusion_ops(self):
+        class M(nn.Module):
+            def __init__(self, unary_fn, in_channels, out_channels, bias, **kwargs):
+                super(M, self).__init__()
+                self.linear = torch.nn.Linear(
+                    in_channels, out_channels, bias=bias, **kwargs
+                )
+                self.unary = unary_fn
+
+            def forward(self, x):
+                x = self.linear(x)
+                x = self.unary(x)
+                return x
+
+        for pointwise_name, pointwise_info in self._unary_list().items():
+            options = itertools.product([[2, 3, 10], [2, 10]], [True, False])
+            for input_shape, bias in options:
+                with torch.no_grad():
+                    mod = M(pointwise_info.pointwise_module, input_shape[-1], 10, bias).eval()
+                    v = torch.randn(input_shape)
+                    ref = mod(v)
+                    attr = pointwise_info.attr
+                    scalars = pointwise_info.scalars
+                    algorithm = pointwise_info.algorithm
+                    fused = torch.ops.mkldnn._linear_pointwise(
+                        v, mod.linear.weight, mod.linear.bias, attr, scalars, algorithm
+                    )
+                    self.assertEqual(ref, fused)
+
+
+    def test_conv_unary_fusion_ops(self):
+        class M(nn.Module):
+            def __init__(self, unary_fn, dim, in_channels, out_channels, dilation, groups, bias, **kwargs):
+                super(M, self).__init__()
+                self.conv = CONV_MODULES[dim](in_channels, out_channels, dilation=dilation, groups=groups, bias=bias, **kwargs)
+                self.unary = unary_fn
+
+            def forward(self, x):
+                x = self.conv(x)
+                x = self.unary(x)
+                return x
+
+        input_shapes = {2: (112, 112), 3: (55, 55, 55)}
+        for pointwise_name, pointwise_info in self._unary_list().items():
+            for dim in [2, 3]:
+                channels_last = torch.channels_last if dim == 2 else torch.channels_last_3d
+                options = itertools.product([True, False], [1, 2], [1, 4], [torch.contiguous_format, channels_last])
+                for bias, dilation, groups, memory_format in options:
+                    oC = 32 * groups
+                    iC = 3 * groups
+                    x_shape = (1, iC) + input_shapes[dim]
+                    x = torch.randn(x_shape, dtype=torch.float32).to(memory_format=memory_format)
+                    mod = M(pointwise_info.pointwise_module, dim, iC, oC, dilation, groups, bias, kernel_size=3)
+                    mod = mod.to(memory_format=memory_format).eval()
+                    with torch.no_grad():
+                        ref = mod(x)
+                        attr = pointwise_info.attr
+                        scalars = pointwise_info.scalars
+                        algorithm = pointwise_info.algorithm
+                        fused = torch.ops.mkldnn._convolution_pointwise(
+                            x, mod.conv.weight, mod.conv.bias, mod.conv.padding, mod.conv.stride, mod.conv.dilation,
+                            mod.conv.groups, attr, scalars, algorithm
+                        )
+                    self.assertEqual(ref, fused)
+
+
+    def test_conv_binary_fusion_ops(self):
+        class M(nn.Module):
+            def __init__(self, binary_fn, dim, in_channels, out_channels, dilation, groups, bias, **kwargs):
+                super(M, self).__init__()
+                self.conv = CONV_MODULES[dim](in_channels, out_channels, dilation=dilation, groups=groups, bias=bias, **kwargs)
+                self.binary = binary_fn
+
+            def forward(self, x, other):
+                x = self.conv(x)
+                x = self.binary(x, other)
+                return x
+
+        input_shapes = {2: (112, 112), 3: (55, 55, 55)}
+        for pointwise_name, pointwise_fn in self._binary_list().items():
+            for dim in [2, 3]:
+                channels_last = torch.channels_last if dim == 2 else torch.channels_last_3d
+                options = itertools.product([False, True], [True, False], [1, 2], [1, 4], [torch.contiguous_format, channels_last])
+                for fuse_relu, bias, dilation, groups, memory_format in options:
+                    oC = 32 * groups
+                    iC = 3 * groups
+                    x_shape = (1, iC) + input_shapes[dim]
+                    x = torch.randn(x_shape, dtype=torch.float32).to(memory_format=memory_format)
+                    mod = M(pointwise_fn, dim, iC, oC, dilation, groups, bias, kernel_size=3)
+                    mod = mod.to(memory_format=memory_format).eval()
+                    other = torch.randn_like(mod.conv(x))
+                    with torch.no_grad():
+                        ref = mod(x, other)
+                        unary_attr = None
+                        if fuse_relu:
+                            ref.relu_()
+                            unary_attr = "relu"
+                        attr = pointwise_name
+                        fused = torch.ops.mkldnn._convolution_pointwise(
+                            x, other, mod.conv.weight, mod.conv.bias, mod.conv.padding, mod.conv.stride, mod.conv.dilation,
+                            mod.conv.groups, attr, None, unary_attr, [], None
+                        )
+                        # for binary add, we support inplace version.
+                        if attr == "add":
+                            fused_inplace = torch.ops.mkldnn._convolution_pointwise_(
+                                x, other, mod.conv.weight, mod.conv.bias, mod.conv.padding, mod.conv.stride, mod.conv.dilation,
+                                mod.conv.groups, attr, None, unary_attr, [], None
+                            )
+                            self.assertEqual(ref, other)
+                            self.assertEqual(ref, fused_inplace)
+
+                        self.assertEqual(ref, fused)
+
+
+    def test_linear_binary_fusion_ops(self):
+        class M(nn.Module):
+            def __init__(self, binary_fn, in_channels, out_channels, bias, **kwargs):
+                super(M, self).__init__()
+                self.linear = torch.nn.Linear(
+                    in_channels, out_channels, bias=bias, **kwargs
+                )
+                self.binary = binary_fn
+
+            def forward(self, x, other):
+                x = self.linear(x)
+                x = self.binary(x, other)
+                return x
+
+        out_feature = 20
+        for pointwise_name, pointwise_fn in self._binary_list().items():
+            options = itertools.product([[2, 3, 10], [2, 10]], [True, False])
+            for input_shape, bias in options:
+                with torch.no_grad():
+                    mod = M(pointwise_fn, input_shape[-1], out_feature, bias).eval()
+                    v = torch.randn(input_shape)
+                    other = torch.randn(input_shape[:-1] + [out_feature])
+                    ref = mod(v, other)
+                    attr = pointwise_name
+                    fused = torch.ops.mkldnn._linear_pointwise(
+                        v, other, mod.linear.weight, mod.linear.bias, attr
+                    )
+                    self.assertEqual(ref, fused)
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_model_dump.py b/test/test_model_dump.py
index a8add0e2cd92..3c682b6ce680 100644
--- a/test/test_model_dump.py
+++ b/test/test_model_dump.py
@@ -131,6 +131,8 @@ def test_main(self):
 
         with tempfile.NamedTemporaryFile() as tf:
             torch.jit.save(torch.jit.script(SimpleModel()), tf)
+            # Actually write contents to disk so we can read it below
+            tf.flush()
 
             stdout = io.StringIO()
             torch.utils.model_dump.main(
diff --git a/test/test_module_init.py b/test/test_module_init.py
index 61a9a2feac77..98dcb3ee694a 100644
--- a/test/test_module_init.py
+++ b/test/test_module_init.py
@@ -4,7 +4,7 @@
 import torch
 from unittest import mock
 from unittest.mock import MagicMock, patch
-from torch.testing import floating_types
+from torch.testing._internal.common_dtype import floating_types
 from torch.testing._internal.common_device_type import instantiate_device_type_tests, dtypes
 from torch.testing._internal.common_quantization import skipIfNoFBGEMM
 from torch.testing._internal.common_utils import TestCase, run_tests
@@ -167,6 +167,76 @@ def build_constructor_arg_db():
         torch.nn.UpsamplingBilinear2d: ((), {}),
         torch.nn.UpsamplingNearest2d: ((), {}),
         torch.nn.ZeroPad2d: ((0,), {}),
+        torch.ao.nn.qat.Conv1d: ((3, 3, 3), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
+        torch.ao.nn.qat.Conv2d: ((3, 3, 3), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
+        torch.ao.nn.qat.Conv3d: ((3, 3, 3), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
+        torch.ao.nn.qat.Linear: ((5, 2), {
+            'qconfig': torch.ao.quantization.default_qconfig,
+        }),
+        torch.ao.nn.qat.Embedding: ((10, 12), {
+            'qconfig': torch.ao.quantization.float_qparams_weight_only_qconfig,
+        }),
+        torch.ao.nn.qat.EmbeddingBag: ((10, 12), {
+            'qconfig': torch.ao.quantization.float_qparams_weight_only_qconfig,
+        }),
+        torch.nn.quantizable.LSTM: ((5, 6), {}),
+        torch.nn.quantizable.LSTMCell: ((5, 6), {}),
+        torch.nn.quantizable.MultiheadAttention: ((10, 2), {}),
+        torch.ao.nn.quantized.BatchNorm2d: ((2,), {}),
+        torch.ao.nn.quantized.BatchNorm3d: ((2,), {}),
+        torch.ao.nn.quantized.Dropout: ((), {}),
+        torch.ao.nn.quantized.Conv1d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.Conv2d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.Conv3d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.ConvTranspose1d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.ConvTranspose2d: ((3, 3, 3), {}),
+        torch.ao.nn.quantized.ConvTranspose3d: ((16, 33, (3, 3, 5)), {
+            'stride': (2, 1, 1),
+            'padding': (4, 2, 2),
+            'output_padding': (2, 2, 2),
+            'dilation': (1, 1, 1),
+        }),
+        torch.ao.nn.quantized.DeQuantize: ((), {}),
+        torch.ao.nn.quantized.ELU: ((0.01, 0), {}),
+        torch.ao.nn.quantized.Embedding: ((10, 3), {
+            'factory_kwargs': {},
+        }),
+        torch.ao.nn.quantized.EmbeddingBag: ((10, 3), {
+            'factory_kwargs': {},
+        }),
+        torch.ao.nn.quantized.GroupNorm: ((2, 4, torch.nn.Parameter(torch.tensor(2.)),
+                                          torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.Hardswish: ((0.1, 0,), {}),
+        torch.ao.nn.quantized.InstanceNorm1d: ((2, torch.nn.Parameter(torch.tensor(2.)),
+                                               torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.InstanceNorm2d: ((2, torch.nn.Parameter(torch.tensor(2.)),
+                                               torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.InstanceNorm3d: ((2, torch.nn.Parameter(torch.tensor(2.)),
+                                               torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.LayerNorm: ((2, torch.nn.Parameter(torch.tensor(2.)),
+                                          torch.nn.Parameter(torch.tensor(2.)), 0.1, 0), {}),
+        torch.ao.nn.quantized.LeakyReLU: ((0.01, 0), {}),
+        torch.ao.nn.quantized.Linear: ((5, 2), {
+            'factory_kwargs': {},
+        }),
+        torch.ao.nn.quantized.MaxPool2d: ((3,), {}),
+        torch.ao.nn.quantized.Quantize: ((0.1, 0), {
+            'dtype': torch.int16,
+            'factory_kwargs': {},
+        }),
+        torch.ao.nn.quantized.ReLU6: ((), {}),
+        torch.ao.nn.quantized.Sigmoid: ((0.1, 0), {}),
+        torch.ao.nn.quantized.Softmax: ((), {}),
+        torch.ao.nn.quantized.FloatFunctional: ((), {}),
+        torch.ao.nn.quantized.FXFloatFunctional: ((), {}),
+        torch.ao.nn.quantized.QFunctional: ((), {}),
+        # Remove torch.nn.quantized after the migration completes:
         torch.nn.qat.Conv1d: ((3, 3, 3), {
             'qconfig': torch.ao.quantization.default_qconfig,
         }),
@@ -185,9 +255,6 @@ def build_constructor_arg_db():
         torch.nn.qat.EmbeddingBag: ((10, 12), {
             'qconfig': torch.ao.quantization.float_qparams_weight_only_qconfig,
         }),
-        torch.nn.quantizable.LSTM: ((5, 6), {}),
-        torch.nn.quantizable.LSTMCell: ((5, 6), {}),
-        torch.nn.quantizable.MultiheadAttention: ((10, 2), {}),
         torch.nn.quantized.BatchNorm2d: ((2,), {}),
         torch.nn.quantized.BatchNorm3d: ((2,), {}),
         torch.nn.quantized.Dropout: ((), {}),
@@ -358,6 +425,8 @@ def generate_tests(test_cls, constructor_arg_db):
     # test all modules underneath these namespaces...
     NAMESPACES = [
         torch.nn,
+        torch.ao.nn.qat,
+        torch.ao.nn.quantized,
         torch.nn.qat,
         torch.nn.quantizable,
         torch.nn.quantized,
@@ -367,8 +436,10 @@ def generate_tests(test_cls, constructor_arg_db):
         torch.nn.Module,
         torch.nn.Container,  # deprecated
         torch.nn.NLLLoss2d,  # deprecated
-        # TODO: Remove these 2 from this list once the ASan issue is fixed.
+        # TODO: Remove these 4 from this list once the ASan issue is fixed.
         # See https://github.com/pytorch/pytorch/issues/55396
+        torch.ao.nn.quantized.Embedding,
+        torch.ao.nn.quantized.EmbeddingBag,
         torch.nn.quantized.Embedding,
         torch.nn.quantized.EmbeddingBag,
         torch.nn.quantized.LSTM,
@@ -412,6 +483,14 @@ def generate_tests(test_cls, constructor_arg_db):
     }
     # these modules requires FBGEMM backend to instantiate
     MODULES_THAT_REQUIRE_FBGEMM = {
+        torch.ao.nn.quantized.Conv1d,
+        torch.ao.nn.quantized.Conv2d,
+        torch.ao.nn.quantized.Conv3d,
+        torch.ao.nn.quantized.ConvTranspose1d,
+        torch.ao.nn.quantized.ConvTranspose2d,
+        torch.ao.nn.quantized.ConvTranspose3d,
+        torch.ao.nn.quantized.Linear,
+        # Remove the lines below after AO migration is complete
         torch.nn.quantized.Conv1d,
         torch.nn.quantized.Conv2d,
         torch.nn.quantized.Conv3d,
diff --git a/test/test_modules.py b/test/test_modules.py
index a62bfff8de69..2f5008244d54 100644
--- a/test/test_modules.py
+++ b/test/test_modules.py
@@ -11,7 +11,8 @@
     instantiate_device_type_tests, onlyCUDA, toleranceOverride, tol, skipMeta)
 from torch.testing._internal.common_modules import module_db, modules, TrainEvalMode
 from torch.testing._internal.common_utils import (
-    TestCase, run_tests, freeze_rng_state, mock_wrapper, get_tensors_from, gradcheck, gradgradcheck, skipIfMps)
+    TestCase, run_tests, freeze_rng_state, mock_wrapper, get_tensors_from, gradcheck,
+    gradgradcheck, skipIfMps, skipIfTorchInductor)
 from unittest.mock import patch, call
 
 
@@ -282,9 +283,10 @@ def test_check_inplace(self, device, dtype, module_info, training):
             # === Check that the inplace operation gives the same result ===
             input_arg_copy = deepcopy(input_args)
             input_arg_clone = tuple(i.clone() for i in input_arg_copy)
+            input_clone_version = input_arg_clone[0]._version
             with freeze_rng_state():
                 output_ip = m_inplace(*input_arg_clone, **input_kwargs)
-            self.assertNotEqual(input_arg_clone[0]._version, input_version)
+            self.assertGreater(input_arg_clone[0]._version, input_clone_version)
             self.assertEqual(output_op, output_ip)
 
             # === Check that the gradients are the same ===
@@ -325,6 +327,7 @@ def inner_zero_grad(obj):
 
     @skipIfMps
     @modules(module_db)
+    @skipIfTorchInductor("to be fixed")
     def test_non_contiguous_tensors(self, device, dtype, module_info, training):
         # Check modules work with non-contiguous tensors
 
@@ -488,6 +491,7 @@ def test_gradgrad(self, device, dtype, module_info, training):
     @toleranceOverride({torch.float32: tol(5e-2, 0),
                         torch.float64: tol(4e-4, 0)})
     @modules(module_db)
+    @skipIfTorchInductor("to be fixed")
     def test_cpu_gpu_parity(self, device, dtype, module_info, training):
         # TODO: RNN / GRU / LSTM don't support backwards on eval mode for cuDNN; skip this in a
         # nicer way for eval mode only.
@@ -578,7 +582,11 @@ def check_backward(cpu_output, gpu_output):
 
     @skipIfMps
     @modules(module_db)
+    @skipIfTorchInductor("to be fixed")
     def test_memory_format(self, device, dtype, module_info, training):
+        is_sm86 = device.startswith("cuda") and torch.cuda.get_device_capability(0) == (8, 6)
+        # TODO tighten it to a specific module
+        atol, rtol = (3e-3, 7e-3) if is_sm86 else (None, None)
         module_cls = module_info.module_cls
         module_inputs = module_info.module_inputs_func(module_info, device=device, dtype=dtype,
                                                        requires_grad=False, training=training)
@@ -665,7 +673,7 @@ def inner_check_out_mem_format(output):
 
                         # === Compare outputs to (contiguous, contiguous) output. ===
                         if input_mem_format != torch.contiguous_format or module_mem_formats != torch.contiguous_format:
-                            self.assertEqual(outputs, desired_outputs)
+                            self.assertEqual(outputs, desired_outputs, rtol=rtol, atol=atol)
 
                         # === Check mem format of output. ===
                         _check_out_mem_format(outputs, input_mem_format, module_mem_format)
diff --git a/test/test_mps.py b/test/test_mps.py
index 8ff630b75b29..044b3a7e8e57 100644
--- a/test/test_mps.py
+++ b/test/test_mps.py
@@ -10,6 +10,7 @@
 import tempfile
 import os
 import pprint
+import copy
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
@@ -17,21 +18,41 @@
 from collections import defaultdict
 from torch._six import inf
 from torch.nn import Parameter
+from torch.testing._internal import opinfo
 from torch.testing._internal.common_utils import \
-    (gradcheck, gradgradcheck, run_tests, TestCase, download_file,
-     TEST_WITH_UBSAN)
+    (gradcheck, gradgradcheck, run_tests, TestCase, download_file, IS_CI,
+     TEST_WITH_UBSAN, dtype_abbrs, skipIfSlowGradcheckEnv, TEST_WITH_ASAN, suppress_warnings)
 from torch.testing import make_tensor
 from torch.testing._comparison import TensorLikePair
-from torch.testing._internal.common_dtype import get_all_dtypes
+from torch.testing._internal.common_dtype import get_all_dtypes, integral_types
 import torch.backends.mps
 from torch.distributions import Uniform, Exponential
 from functools import partial
 
-from torch.testing._internal.common_methods_invocations import op_db
-from torch.testing._internal.common_device_type import ops, instantiate_device_type_tests
+from torch.testing._internal.common_methods_invocations import (
+    op_db,
+    UnaryUfuncInfo,
+    ReductionOpInfo,
+    SpectralFuncInfo,
+    BinaryUfuncInfo,
+)
+from torch.testing._internal.common_device_type import ops, instantiate_device_type_tests, onlyMPS
 from torch.testing._internal.common_nn import NNTestCase
 import numpy as np
 import torch
+import torch.utils._pytree as pytree
+
+
+# Copied from `test_ops.py` for the purposes of duplicating `test_numpy_ref`
+_ref_test_ops = tuple(
+    filter(
+        lambda op: not isinstance(
+            op, (UnaryUfuncInfo, ReductionOpInfo, SpectralFuncInfo, BinaryUfuncInfo)
+        )
+        and op.ref is not None,
+        op_db,
+    )
+)
 
 # Same logic as test_cuda.py
 if not torch.backends.mps.is_available():
@@ -44,7 +65,7 @@ def _npRelu(self, np_features):
         return np.maximum(np_features, np.zeros(np_features.shape)).astype(np_features.dtype)
 
     def testNpRelu(self):
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(
             np.array([[0., 0.7, 0.0, 0.3, 0.0], [0.1, 0.0, 0.5, 0.0, 0.9]]),
             self._npRelu(
                 np.array([[-0.9, 0.7, -0.5, 0.3, -0.1], [0.1, -0.3, 0.5, -0.7,
@@ -58,7 +79,7 @@ def _testRelu(self, np_features, device):
         py_relu = torch.nn.ReLU(inplace=False)(py_tensor)
         py_relu_cpu = py_relu.to("cpu")
 
-        torch.testing.assert_allclose(np_relu, py_relu_cpu)
+        self.assertEqual(np_relu, py_relu_cpu)
 
     def _testReluInPlace(self, np_features, device):
         np_relu = self._npRelu(np_features)
@@ -68,9 +89,9 @@ def _testReluInPlace(self, np_features, device):
         py_relu = torch.nn.ReLU(inplace=True)(py_tensor)
         py_relu_cpu = py_relu.to("cpu")
 
-        torch.testing.assert_allclose(np_relu, py_relu_cpu)
+        self.assertEqual(np_relu, py_relu_cpu)
         # Inplace Relu modifies the initial input and it should match the output of Relu
-        torch.testing.assert_allclose(np_relu, py_tensor.to("cpu"))
+        self.assertEqual(np_relu, py_tensor.to("cpu"))
 
     def testNumbersCPU(self):
         for t in [np.int32]:
@@ -135,7 +156,7 @@ def _npLeakyRelu(self, np_features, negative_slope=0.1):
         return np.maximum(np_features, negative_slope * np_features).astype(np_features.dtype)
 
     def testNpLeakyRelu(self):
-        torch.testing.assert_allclose(
+        torch.testing.assert_close(
             np.array([[-0.09, 0.7, -0.05, 0.3, -0.01],
                       [0.1, -0.03, 0.5, -0.07, 0.9]]),
             self._npLeakyRelu(
@@ -150,14 +171,14 @@ def _testLeakyRelu(self, np_features, negative_slope, device):
 
         cpu_leaky_relu = relu_op(cpu_x)
         mps_leaky_relu = relu_op(mps_x)
-        torch.testing.assert_allclose(cpu_leaky_relu, mps_leaky_relu.to('cpu'))
+        torch.testing.assert_close(cpu_leaky_relu, mps_leaky_relu.to('cpu'))
 
         # test backward pass
         cpu_grad = torch.ones_like(cpu_leaky_relu)
         mps_grad = cpu_grad.to('mps')
         cpu_leaky_relu.backward(gradient=cpu_grad)
         mps_leaky_relu.backward(gradient=mps_grad)
-        torch.testing.assert_allclose(cpu_x.grad, mps_x.grad.to('cpu'))
+        torch.testing.assert_close(cpu_x.grad, mps_x.grad.to('cpu'))
 
     def testNumbersCPU(self):
         for t in [np.float32]:
@@ -236,14 +257,14 @@ def _testLeakyRelu(self, np_features, negative_slope, device):
 
         cpu_leaky_relu = relu_op(cpu_x)
         mps_leaky_relu = relu_op(mps_x)
-        torch.testing.assert_allclose(cpu_leaky_relu, mps_leaky_relu.to('cpu'))
+        torch.testing.assert_close(cpu_leaky_relu, mps_leaky_relu.to('cpu'))
 
         # test backward pass
         cpu_grad = torch.ones_like(cpu_leaky_relu)
         mps_grad = cpu_grad.to('mps')
         cpu_leaky_relu.backward(gradient=cpu_grad)
         mps_leaky_relu.backward(gradient=mps_grad)
-        torch.testing.assert_allclose(cpu_x.grad, mps_x.grad.to('cpu'))
+        torch.testing.assert_close(cpu_x.grad, mps_x.grad.to('cpu'))
 
     def testNumbersGPU(self):
         for t in [np.float32]:
@@ -272,14 +293,14 @@ def test_mm(self):
         B = torch.ones(5, 6).to("mps")
         C = torch.ones(6, 5).to("mps")
         D = torch.mm(B, C).cpu()
-        torch.testing.assert_allclose(D, torch.full((5, 5), 6.0))
+        torch.testing.assert_close(D, torch.full((5, 5), 6.0))
 
     def test_addmm(self):
         A = torch.ones(5, 5).to("mps")
         B = torch.ones(5, 6).to("mps")
         C = torch.ones(6, 5).to("mps")
         D = torch.addmm(A, B, C).to("cpu")
-        torch.testing.assert_allclose(D, torch.full((5, 5), 7.0))
+        torch.testing.assert_close(D, torch.full((5, 5), 7.0))
 
     def test_bmm(self):
         batch1_cpu = torch.randn(10, 3, 4)
@@ -294,6 +315,16 @@ def test_bmm(self):
         self.assertEqual(output_cpu, output_mps)
         self.assertEqual(output_cpu.size(), output_mps.size())
 
+    def test_trace(self):
+        M_cpu = torch.randn(3, 3)
+        M_mps = M_cpu.detach().clone().to("mps")
+
+        output_cpu = torch.trace(M_cpu)
+        output_mps = torch.trace(M_mps)
+
+        self.assertEqual(output_cpu, output_mps)
+        self.assertEqual(output_cpu.size(), output_mps.size())
+
     def test_addbmm(self):
         M_cpu = torch.randn(3, 5)
         batch1_cpu = torch.randn(10, 3, 4)
@@ -334,7 +365,27 @@ def helper(input_shape, batch1_shape, batch2_shape):
     def test_local_scalar_dense_mps(self):
         x_cpu = torch.randn(1)
         y_mps = x_cpu.to("mps")
-        torch.testing.assert_allclose(x_cpu.item(), y_mps.item())
+        torch.testing.assert_close(x_cpu.item(), y_mps.item())
+
+    def test_linear_1d_weight(self):
+        device = 'cpu'
+        projected = torch.rand([8]).to(device)
+        x = torch.rand([1, 2, 2, 8]).to(device)
+        x_mps = x.to('mps')
+        projected_mps = projected.to('mps')
+        linear = F.linear(x, projected)
+        linear_mps = F.linear(x_mps, projected_mps)
+
+        self.assertEqual(linear, linear_mps)
+
+        projected = torch.rand([1, 8]).to(device)
+        x = torch.rand([1, 2, 2, 8]).to(device)
+        x_mps = x.to('mps')
+        projected_mps = projected.to('mps')
+        linear = F.linear(x, projected)
+        linear_mps = F.linear(x_mps, projected_mps)
+
+        self.assertEqual(linear, linear_mps)
 
     def _linear_helper(self, in_features, out_features, shape, bias=True, backward_pass=False):
         cpu_linear = torch.nn.Linear(in_features=in_features, out_features=out_features, device="cpu", bias=bias)
@@ -442,7 +493,7 @@ def test_max_pool2d(self):
         def helper(shape, ks, padding=0, dilation=1, ceil_mode=False, return_indices=False, test_ties=False):
 
             cpu_x = None
-            if(test_ties):
+            if (test_ties):
                 cpu_x = torch.ones(shape, device='cpu', dtype=torch.float, requires_grad=True)
             else:
                 cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=True)
@@ -451,7 +502,7 @@ def helper(shape, ks, padding=0, dilation=1, ceil_mode=False, return_indices=Fal
             pool = torch.nn.MaxPool2d(kernel_size=ks, padding=padding, dilation=dilation,
                                       ceil_mode=ceil_mode, return_indices=return_indices)
 
-            if(return_indices is False):
+            if (return_indices is False):
                 y = pool(x)
                 ref_y = pool(cpu_x)
 
@@ -586,7 +637,7 @@ def helper(shape, channels_last=False):
             np.random.seed(332)
             arr = (256 - 128) * np.random.random_sample(size=shape) + 128
             cpu_x = torch.tensor(arr, device='cpu', dtype=torch.float, requires_grad=True)
-            if(channels_last):
+            if (channels_last):
                 cpu_x = cpu_x.to(memory_format=torch.channels_last)
                 cpu_x.retain_grad()
             x = cpu_x.detach().clone().to('mps').requires_grad_()
@@ -605,7 +656,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
             np.random.seed(332)
             arr = (256 - 128) * np.random.random_sample(size=shape) + 128
             cpu_x = torch.tensor(arr, device='cpu', dtype=torch.float, requires_grad=True)
-            if(channels_last):
+            if (channels_last):
                 cpu_x = cpu_x.to(memory_format=torch.channels_last)
                 cpu_x.retain_grad()
             x = cpu_x.detach().clone().to('mps').requires_grad_()
@@ -615,7 +666,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
             cpu_running_var = None
             running_mean = None
             running_var = None
-            if(track_running_stats):
+            if (track_running_stats):
                 mean_arr = (240 - 140) * np.random.random_sample(size=mean_shape) + 140
                 cpu_running_mean = torch.tensor(mean_arr, device='cpu', dtype=torch.float)
                 var_arr = 32 * np.random.random_sample(size=mean_shape)
@@ -627,7 +678,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
             cpu_weight = None
             bias = None
             cpu_bias = None
-            if(wts):
+            if (wts):
                 cpu_weight = torch.randn(mean_shape, device='cpu', dtype=torch.float, requires_grad=True)
                 weight = cpu_weight.detach().clone().to('mps').requires_grad_()
                 cpu_bias = torch.randn(mean_shape, device='cpu', dtype=torch.float, requires_grad=True)
@@ -636,7 +687,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
             y = None
             ref_y = None
 
-            if(not test_module):
+            if (not test_module):
                 y = torch.nn.functional.batch_norm(x, running_mean, running_var,
                                                    weight=weight,
                                                    bias=bias,
@@ -653,7 +704,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
                 batchnorm_op = None
                 mps_batchnorm_op = None
 
-                if(len(shape) == 3):
+                if (len(shape) == 3):
                     batchnorm_op = torch.nn.BatchNorm1d(shape[1],
                                                         eps=eps,
                                                         momentum=momentum,
@@ -666,7 +717,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
                                                             affine=wts,
                                                             track_running_stats=track_running_stats,
                                                             device='mps')
-                elif(len(shape) == 4):
+                elif (len(shape) == 4):
                     batchnorm_op = torch.nn.BatchNorm2d(shape[1],
                                                         eps=eps,
                                                         momentum=momentum,
@@ -679,7 +730,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
                                                             affine=wts,
                                                             track_running_stats=track_running_stats,
                                                             device='mps')
-                elif(len(shape) == 5):
+                elif (len(shape) == 5):
                     batchnorm_op = torch.nn.BatchNorm3d(shape[1],
                                                         eps=eps,
                                                         momentum=momentum,
@@ -693,12 +744,12 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
                                                             track_running_stats=track_running_stats,
                                                             device='mps')
 
-                if(track_running_stats):
+                if (track_running_stats):
                     batchnorm_op.running_mean = cpu_running_mean
                     batchnorm_op.running_var = cpu_running_var
                     mps_batchnorm_op.running_mean = running_mean
                     mps_batchnorm_op.running_var = running_var
-                if(wts):
+                if (wts):
                     batchnorm_op.weight = torch.nn.Parameter(cpu_weight)
                     batchnorm_op.bias = torch.nn.Parameter(cpu_bias)
                     mps_batchnorm_op.weight = torch.nn.Parameter(weight)
@@ -708,7 +759,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
                 y = mps_batchnorm_op(x)
 
             self.assertEqual(y, ref_y)
-            if(not test_module):
+            if (not test_module):
                 self.assertEqual(running_mean, cpu_running_mean)
                 self.assertEqual(running_var, cpu_running_var)
             else:
@@ -721,8 +772,8 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
             y.backward(gradient=grad)
 
             self.assertEqual(x.grad, cpu_x.grad)
-            if(wts):
-                if(not test_module):
+            if (wts):
+                if (not test_module):
                     self.assertEqual(weight.grad, cpu_weight.grad)
                     self.assertEqual(bias.grad, cpu_bias.grad)
                 else:
@@ -733,10 +784,10 @@ def helper(shape, eps=1, momentum=0.1, wts=False, training=False, channels_last=
             for test_module in [False, True]:
                 for track_running_stats in [True, False]:
                     for channels_last in [False]:
-                        if(channels_last and len(shape) != 4):
+                        if (channels_last and len(shape) != 4):
                             continue
                         # Running stats must be tracked in eval mode
-                        if(track_running_stats):
+                        if (track_running_stats):
                             helper(shape, eps=0, momentum=1, channels_last=channels_last,
                                    track_running_stats=track_running_stats, test_module=test_module)
                             helper(shape, channels_last=channels_last,
@@ -771,7 +822,7 @@ def helper(input_shape, normalized_shape, eps=1e-05, elementwise_affine=True, dt
             cpu_bias = torch.randn(normalized_shape, device='cpu', dtype=dtype, requires_grad=True)
             bias = cpu_bias.detach().clone().to('mps').requires_grad_()
 
-            if(elementwise_affine):
+            if (elementwise_affine):
                 cpu_op.weight = torch.nn.Parameter(cpu_wt)
                 mps_op.weight = torch.nn.Parameter(wt)
                 cpu_op.bias = torch.nn.Parameter(cpu_bias)
@@ -788,7 +839,7 @@ def helper(input_shape, normalized_shape, eps=1e-05, elementwise_affine=True, dt
 
             self.assertEqual(result, cpu_result)
             self.assertEqual(x.grad, cpu_x.grad)
-            if(elementwise_affine):
+            if (elementwise_affine):
                 self.assertEqual(mps_op.weight.grad, cpu_op.weight.grad)
                 self.assertEqual(mps_op.bias.grad, cpu_op.bias.grad)
 
@@ -804,7 +855,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
             np.random.seed(332)
             arr = (256 - 128) * np.random.random_sample(size=shape) + 128
             cpu_x = torch.tensor(arr, device='cpu', dtype=torch.float, requires_grad=True)
-            if(channels_last):
+            if (channels_last):
                 cpu_x = cpu_x.to(memory_format=torch.channels_last)
                 cpu_x.retain_grad()
             x = cpu_x.detach().clone().to('mps').requires_grad_()
@@ -814,7 +865,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
             cpu_running_var = None
             running_mean = None
             running_var = None
-            if(track_running_stats):
+            if (track_running_stats):
                 mean_arr = (240 - 140) * np.random.random_sample(size=mean_shape) + 140
                 cpu_running_mean = torch.tensor(mean_arr, device='cpu', dtype=torch.float)
                 var_arr = 32 * np.random.random_sample(size=mean_shape)
@@ -826,7 +877,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
             cpu_weight = None
             bias = None
             cpu_bias = None
-            if(wts):
+            if (wts):
                 cpu_weight = torch.randn(mean_shape, device='cpu', dtype=torch.float, requires_grad=True)
                 weight = cpu_weight.detach().clone().to('mps').requires_grad_()
                 cpu_bias = torch.randn(mean_shape, device='cpu', dtype=torch.float, requires_grad=True)
@@ -835,7 +886,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
             y = None
             ref_y = None
 
-            if(not test_module):
+            if (not test_module):
                 ref_y = torch.nn.functional.instance_norm(cpu_x, cpu_running_mean, cpu_running_var,
                                                           weight=cpu_weight,
                                                           bias=cpu_bias,
@@ -850,7 +901,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
                 instancenorm_op = None
                 mps_instancenorm_op = None
 
-                if(len(shape) == 3):
+                if (len(shape) == 3):
                     instancenorm_op = torch.nn.InstanceNorm1d(shape[1],
                                                               eps=eps,
                                                               momentum=momentum,
@@ -863,7 +914,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
                                                                   affine=wts,
                                                                   track_running_stats=track_running_stats,
                                                                   device='mps')
-                elif(len(shape) == 4):
+                elif (len(shape) == 4):
                     instancenorm_op = torch.nn.InstanceNorm2d(shape[1],
                                                               eps=eps,
                                                               momentum=momentum,
@@ -876,7 +927,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
                                                                   affine=wts,
                                                                   track_running_stats=track_running_stats,
                                                                   device='mps')
-                elif(len(shape) == 5):
+                elif (len(shape) == 5):
                     instancenorm_op = torch.nn.InstanceNorm3d(shape[1],
                                                               eps=eps,
                                                               momentum=momentum,
@@ -890,12 +941,12 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
                                                                   track_running_stats=track_running_stats,
                                                                   device='mps')
 
-                if(track_running_stats):
+                if (track_running_stats):
                     instancenorm_op.running_mean = cpu_running_mean
                     instancenorm_op.running_var = cpu_running_var
                     mps_instancenorm_op.running_mean = running_mean
                     mps_instancenorm_op.running_var = running_var
-                if(wts):
+                if (wts):
                     instancenorm_op.weight = torch.nn.Parameter(cpu_weight)
                     instancenorm_op.bias = torch.nn.Parameter(cpu_bias)
                     mps_instancenorm_op.weight = torch.nn.Parameter(weight)
@@ -905,7 +956,7 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
                 y = mps_instancenorm_op(x)
 
             self.assertEqual(y, ref_y)
-            if(not test_module):
+            if (not test_module):
                 self.assertEqual(running_mean, cpu_running_mean)
                 self.assertEqual(running_var, cpu_running_var)
             else:
@@ -918,8 +969,8 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
             y.backward(gradient=grad)
 
             self.assertEqual(x.grad, cpu_x.grad)
-            if(wts):
-                if(not test_module):
+            if (wts):
+                if (not test_module):
                     self.assertEqual(weight.grad, cpu_weight.grad)
                     self.assertEqual(bias.grad, cpu_bias.grad)
                 else:
@@ -930,10 +981,10 @@ def helper(shape, eps=1, momentum=0.1, wts=False, channels_last=False, track_run
             for test_module in [False, True]:
                 for track_running_stats in [True, False]:
                     for channels_last in [False]:
-                        if(channels_last and len(shape) != 4):
+                        if (channels_last and len(shape) != 4):
                             continue
                         # Running stats must be tracked in eval mode
-                        if(track_running_stats):
+                        if (track_running_stats):
                             helper(shape, eps=0, momentum=1, channels_last=channels_last,
                                    track_running_stats=track_running_stats, test_module=test_module)
                             helper(shape, channels_last=channels_last,
@@ -971,7 +1022,7 @@ def helper(input_shape, wt_shape,
             cpu_bias = None
             bias = None
 
-            if(bias_shape is not None):
+            if (bias_shape is not None):
                 cpu_bias = torch.randn(bias_shape, device='cpu', dtype=torch.float, requires_grad=True)
                 bias = cpu_bias.detach().clone().to('mps').requires_grad_()
 
@@ -989,7 +1040,7 @@ def helper(input_shape, wt_shape,
             self.assertEqual(y, ref_y, rtol=2.6e-05, atol=2e-04)
             self.assertEqual(x.grad, cpu_x.grad, rtol=2.6e-06, atol=2e-05)
             self.assertEqual(wt.grad, cpu_wt.grad, atol=8e-04, rtol=10.4e-05)
-            if(bias_shape is not None):
+            if (bias_shape is not None):
                 self.assertEqual(bias.grad, cpu_bias.grad, atol=8e-04, rtol=10.4e-05)
 
         N = 1
@@ -1045,7 +1096,7 @@ def helper(input_shape, wt_shape,
             cpu_bias = None
             bias = None
 
-            if(bias_shape is not None):
+            if (bias_shape is not None):
                 cpu_bias = torch.randn(bias_shape, device='cpu', dtype=torch.float, requires_grad=True)
                 bias = cpu_bias.detach().clone().to('mps').requires_grad_()
 
@@ -1065,7 +1116,7 @@ def helper(input_shape, wt_shape,
             self.assertEqual(x.grad, cpu_x.grad, rtol=2.6e-06, atol=2e-05)
             self.assertEqual(wt.grad, cpu_wt.grad, atol=8e-04, rtol=10.4e-05)
 
-            # if(bias_shape is not None):
+            # if (bias_shape is not None):
             #  print(cpu_bias.grad)
             #  print(bias.grad.to('cpu'))
             #  self.assertEqual(bias.grad, cpu_bias.grad)
@@ -1084,7 +1135,7 @@ def helper(input_shape, wt_shape,
             for padding in [0, 1, 2]:
                 for output_padding in [0, 1, 2]:
                     for dilation in [1, 2]:
-                        if(output_padding >= stride or output_padding >= dilation):
+                        if (output_padding >= stride or output_padding >= dilation):
                             continue
                         helper((N, C_out, H, W), (C_out, C_in, kH, kW), stride=stride,
                                padding=padding, output_padding=output_padding, dilation=dilation)
@@ -1096,30 +1147,6 @@ def helper(input_shape, wt_shape,
                         helper((N, C_out, H, W), (C_out, C_in, kH, kW), bias_shape=(C_in), stride=stride,
                                padding=padding, output_padding=output_padding, dilation=dilation)
 
-    def test_conv1d_channels_last(self):
-        model_cpu = torch.nn.Conv1d(1, 128, 3)
-        a_cpu = torch.arange((128 * 176), dtype=torch.float32)
-        a_cpu = a_cpu.view(128, 176, 1).permute(0, 2, 1)
-        out_cpu = model_cpu(a_cpu)  # pass
-
-        a_mps = a_cpu.detach().clone().to("mps")
-        model_mps = model_cpu.to("mps")
-        out_mps = model_mps(a_mps)
-
-        self.assertEqual(out_cpu, out_mps.cpu(), rtol=2.6e-05, atol=2e-04)
-
-    def test_conv1d_contiguous(self):
-        model_cpu = torch.nn.Conv1d(1, 128, 3)
-        a_cpu = torch.ones(128, 1, 176)
-        out_cpu = model_cpu(a_cpu)
-
-        a_mps = a_cpu.detach().clone().to("mps")
-        model_mps = model_cpu.to("mps")
-        out_mps = model_mps(a_mps)
-
-        self.assertEqual(out_cpu.shape, out_mps.shape)
-        self.assertEqual(out_cpu, out_mps.cpu())
-
     # Test sigmoid
     def test_sigmoid(self):
         def helper(shape):
@@ -1313,6 +1340,29 @@ def test_slice(self):
         mps_slice4 = mps_x[1, :].to('cpu')
         self.assertEqual(cpu_slice4, mps_slice4)
 
+    def test_scalar_from_slice_unary(self):
+        # https://github.com/pytorch/pytorch/issues/82543
+        tensor_list = torch.tensor([1.0, 1.2], device="mps")
+
+        for scalar in tensor_list:
+            r_mps = torch.ceil(scalar)
+            r_cpu = torch.ceil(scalar.to("cpu"))
+            self.assertEqual(r_mps.cpu(), r_cpu)
+
+    def test_scalar_from_slice_binary(self):
+        # https://github.com/pytorch/pytorch/issues/82543
+        def helper(binary_op):
+            tensor_list = torch.tensor([1.0, 1.2, 2.5, 1.0], device="mps")
+
+            for scalar in tensor_list:
+                r_mps = binary_op(scalar, 1.0)
+                r_cpu = binary_op(scalar.cpu(), 1.0)
+                self.assertEqual(r_mps.cpu(), r_cpu)
+        helper(torch.sub)
+        helper(torch.add)
+        helper(torch.not_equal)
+        helper(torch.eq)
+
     def test_slice_contiguous_view(self):
         # https://github.com/pytorch/pytorch/issues/77750
 
@@ -1352,6 +1402,18 @@ def helper(operator):
         for op in ["<=", "<", ">=", ">", "==", "!="]:
             helper(op)
 
+    def test_slice_of_slice(self):
+        x = torch.tensor([0.5, 0.5], device="cpu")
+        x_mps = torch.tensor([0.5, 0.5], device="mps")
+
+        tensor = x[1][None]
+        tensor_mps = x_mps[1][None]
+
+        res = tensor.ne(0)
+        res_mps = tensor_mps.ne(0)
+
+        self.assertEqual(res, res_mps)
+
     def test_index_storage_offset(self):
         # https://github.com/pytorch/pytorch/issues/78107
 
@@ -1563,14 +1625,26 @@ def test_storage_offset_greater_than_src_nbytes(self):
             tensor_list.append(t)
 
         for i in range(0, n_tensors - 1):
-            t = tensor_list[i].view(1, 784)
+            t = tensor_list[i].view(1, n_tensor_elems)
             t_mps = t.to("mps")
-            self.assertEqual(t, t_mps.cpu())
+            self.assertEqual(t, t_mps.cpu(), f"i={i}")
 
     # See https://github.com/pytorch/pytorch/issues/82427
-    # Test should not crash
-    def test_bool_full(self):
+    # and https://github.com/pytorch/pytorch/issues/83692
+    def test_full_bugs(self):
+        # Test should not crash
         x = torch.full((3, 3), True, device='mps')
+        # torch.full should work for uint8
+        y_mps = torch.full((2, 2), 247, device='mps', dtype=torch.uint8)
+        y_cpu = torch.full((2, 2), 247, device='cpu', dtype=torch.uint8)
+        self.assertEqual(y_mps, y_cpu)
+
+    # See https://github.com/pytorch/pytorch/issues/84995
+    def test_div_bugs(self):
+        for (dtype, mode) in itertools.product(integral_types(), ['trunc', 'floor']):
+            x = torch.tensor(list(range(1, 11)), device='mps', dtype=dtype)
+            y = torch.div(x, 101, rounding_mode=mode)
+            self.assertEqual(y.sum(), 0)
 
     # See https://github.com/pytorch/pytorch/issues/82663
     def test_bool_expand(self):
@@ -1584,6 +1658,109 @@ def test_empty_neg(self):
         y = -x
         self.assertEqual(x, y)
 
+    # See https://github.com/pytorch/pytorch/issues/85675
+    def test_cat_non_contiguous(self):
+        def rotate_subset(data):
+            return torch.concat([data[:, :2], torch.rot90(data[:, 2:])])
+        for dtype in MPS_DTYPES:
+            if dtype == torch.bool:
+                continue
+            data = torch.arange(8, dtype=dtype).reshape(2, 4)
+            mps_data = data.to("mps")
+            cpu_result = rotate_subset(data)
+            mps_result = rotate_subset(mps_data)
+            self.assertEqual(cpu_result, mps_result.to("cpu"))
+
+    # See https://github.com/pytorch/pytorch/issues/85967
+    def test_from_numpy_non_contiguous(self):
+        a = np.arange(9).reshape(3, 3)[:, :2]
+        t_cpu = torch.tensor(a, device="cpu")
+        t_mps = torch.tensor(a, device="mps")
+        self.assertEqual(t_cpu, t_mps.to("cpu"))
+
+    # See https://github.com/pytorch/pytorch/issues/86954
+    def test_copy_non_contiguous(self):
+        x = torch.arange(27).reshape(3, 3, 3).permute(2, 0, 1)
+        self.assertFalse(x.is_contiguous())
+        y = x.to('mps')
+        self.assertFalse(y.is_contiguous())
+        self.assertEqual(x, y.to('cpu'))
+
+        x = torch.arange(4**3).reshape(4, 4, 4).permute((2, 0, 1))[1:, ::2]
+        y = x.to('mps')
+        self.assertEqual(x, y.to('cpu'))
+
+        x = torch.full((4, 4, 4, 4), 13, device="cpu")
+        y = torch.full((4, 4, 4, 4), 13, device="mps")
+        z = torch.arange(4**4).reshape(4, 4, 4, 4).permute(3, 2, 0, 1)[1::, ::2]
+        x.permute(3, 2, 1, 0)[1::, ::2] = z
+        # As y is on MPS and z on CPU, this dispatches to a copy operator
+        y.permute(3, 2, 1, 0)[1::, ::2] = z
+        self.assertEqual(x, y.to('cpu'))
+
+    # See https://github.com/pytorch/pytorch/pull/84742
+    # and https://github.com/pytorch/pytorch/pull/78319
+    def test_binops_dtype_precedence(self):
+        # Test dtype precedence (casting order) in binary operations by comparing to CPU result
+        # Example values for all dtypes supported on the MPS backend
+        sample_vals = {
+            torch.bool: [False, True],
+            torch.int16: [-15, 0, 1, 10],
+            torch.int32: [-376, 0, 1, 13],
+            torch.int64: [-8, 0, 1, 77],
+            torch.float16: [-234.5, 0.0, 1.0, 2.0],
+            torch.float32: [-1.0, 0.0, 0.1, 111.99],
+        }
+        # Test all combinations of dtypes, operations, dimensionality
+        for dtype1, dtype2, binop in itertools.product(
+                sample_vals.keys(), sample_vals.keys(), ['add', 'sub', 'mul', 'div']):
+            # bool minus bool is generally unsupported, so skip
+            if binop == 'sub' and (dtype1 == torch.bool or dtype2 == torch.bool):
+                continue
+            full_shape = (10,)
+            for val1, val2 in itertools.product(sample_vals[dtype1], sample_vals[dtype2]):
+                # print(f'{dtype1},{dtype2}: ({val1}).{binop}({val2})')
+                # print(getattr(torch.tensor(val1, dtype=dtype1, device='mps'), binop)
+                #            (torch.tensor(val2, dtype=dtype2, device='mps')))
+                # print(getattr(torch.tensor(val1, dtype=dtype1, device='cpu'), binop)
+                #            (torch.tensor(val2, dtype=dtype2, device='cpu')))
+                self.assertEqual(
+                    getattr(torch.tensor(val1, dtype=dtype1, device='mps'), binop)
+                           (torch.tensor(val2, dtype=dtype2, device='mps')),
+                    getattr(torch.tensor(val1, dtype=dtype1, device='cpu'), binop)
+                           (torch.tensor(val2, dtype=dtype2, device='cpu')))
+                self.assertEqual(
+                    getattr(torch.tensor([val1], dtype=dtype1, device='mps'), binop)
+                           (torch.tensor([val2], dtype=dtype2, device='mps')),
+                    getattr(torch.tensor([val1], dtype=dtype1, device='cpu'), binop)
+                           (torch.tensor([val2], dtype=dtype2, device='cpu')))
+                self.assertEqual(
+                    getattr(torch.tensor(val1, dtype=dtype1, device='mps'), binop)
+                           (torch.tensor([val2], dtype=dtype2, device='mps')),
+                    getattr(torch.tensor(val1, dtype=dtype1, device='cpu'), binop)
+                           (torch.tensor([val2], dtype=dtype2, device='cpu')))
+                self.assertEqual(
+                    getattr(torch.tensor([val1], dtype=dtype1, device='mps'), binop)
+                           (torch.tensor(val2, dtype=dtype2, device='mps')),
+                    getattr(torch.tensor([val1], dtype=dtype1, device='cpu'), binop)
+                           (torch.tensor(val2, dtype=dtype2, device='cpu')))
+                # Test tensors created with torch.full
+                x1 = torch.full(full_shape, val1, dtype=dtype1, device='mps')
+                y1 = torch.tensor(val2, dtype=dtype2, device='mps')
+                x2 = torch.full(full_shape, val1, dtype=dtype1, device='cpu')
+                y2 = torch.tensor(val2, dtype=dtype2, device='cpu')
+                self.assertEqual(getattr(x1, binop)(y1), getattr(x2, binop)(y2))
+                x3 = torch.tensor(val1, dtype=dtype1, device='mps')
+                y3 = torch.full(full_shape, val2, dtype=dtype2, device='mps')
+                x4 = torch.tensor(val1, dtype=dtype1, device='cpu')
+                y4 = torch.full(full_shape, val2, dtype=dtype2, device='cpu')
+                self.assertEqual(getattr(x3, binop)(y3), getattr(x4, binop)(y4))
+                self.assertEqual(
+                    getattr(torch.tensor(val1, dtype=dtype1, device='mps'), binop)
+                           (torch.full(full_shape, val2, dtype=dtype2, device='mps')),
+                    getattr(torch.tensor(val1, dtype=dtype1, device='cpu'), binop)
+                           (torch.full(full_shape, val2, dtype=dtype2, device='cpu')))
+
 
 class TestLogical(TestCase):
     def _wrap_tensor(self, x, device="cpu", dtype=None, requires_grad=False):
@@ -2316,7 +2493,7 @@ def helper(n, c, h, w, reduction_type, dtype=torch.float32):
 
             cpu_x = None
             x = None
-            if(dtype not in [torch.float32, torch.bool]):
+            if (dtype not in [torch.float32, torch.bool]):
                 cpu_x = torch.randint(50, (n, c, h, w), device='cpu', dtype=dtype, requires_grad=False)
                 x = cpu_x.detach().clone().to('mps')
             elif (dtype == torch.bool):
@@ -2376,7 +2553,7 @@ def helper(n, c, h, w, reduction_type, dtype=torch.float32):
     def test_max_el(self):
         def helper(n, c, h, w, dtype=torch.float32):
 
-            if(dtype not in [torch.float32, torch.bool]):
+            if (dtype not in [torch.float32, torch.bool]):
                 cpu_x = torch.randint(50, (n, c, h, w), device='cpu', dtype=dtype, requires_grad=False)
                 x = cpu_x.detach().clone().to('mps')
             elif (dtype == torch.bool):
@@ -2457,6 +2634,47 @@ def helper(n, c, h, w, dtype=torch.float32):
         helper(2, 8, 4, 5, torch.int32)
         # helper(2, 8, 4, 5, torch.int64)
 
+    def test_median(self):
+        def helper_dtype_int32(n1, n2, n3):
+            cpu_x = torch.randint(50, (n1, n2, n3), device='cpu', dtype=torch.int32)
+            mps_x = cpu_x.detach().clone().to('mps')
+
+            result_cpu = torch.median(cpu_x)
+            result_mps = torch.median(mps_x)
+
+            self.assertEqual(result_cpu, result_mps)
+
+            for dim in [0, 1, 2]:
+                for keepdim in [True, False]:
+                    y, idx = torch.median(cpu_x, dim=dim, keepdim=keepdim)
+                    refy, refidx = torch.median(mps_x, dim=dim, keepdim=keepdim)
+                    self.assertEqual(y, refy)
+                    self.assertEqual(idx, refidx)
+
+        def helper_dtype_float32(n1, n2, n3):
+            cpu_x = torch.randn(n1, n2, n3, device='cpu', dtype=torch.float32)
+            mps_x = cpu_x.detach().clone().to('mps')
+
+            result_cpu = torch.median(cpu_x)
+            result_mps = torch.median(mps_x)
+
+            self.assertEqual(result_cpu, result_mps)
+
+            for dim in [0, 1, 2]:
+                for keepdim in [True, False]:
+                    y, idx = torch.median(cpu_x, dim=dim, keepdim=keepdim)
+                    refy, refidx = torch.median(mps_x, dim=dim, keepdim=keepdim)
+                    self.assertEqual(y, refy)
+                    self.assertEqual(idx, refidx)
+
+        helper_dtype_int32(10, 10, 10)  # median at even place
+        helper_dtype_int32(3, 3, 3)  # median at odd place
+        helper_dtype_int32(1, 1, 1)
+        helper_dtype_int32(1, 2, 3)
+        helper_dtype_float32(10, 10, 10)
+        helper_dtype_float32(3, 3, 3)
+        helper_dtype_float32(1, 1, 1)
+
     def test_any(self):
         def helper(shape):
             input_xs = []
@@ -2700,7 +2918,7 @@ def test_sum(self):
         def helper(n, c, h, w, dtype=torch.float32):
             cpu_x = None
             x = None
-            if(dtype not in [torch.float32, torch.bool]):
+            if (dtype not in [torch.float32, torch.bool]):
                 cpu_x = torch.randint(50, (n, c, h, w), device='cpu', dtype=dtype, requires_grad=False)
                 x = cpu_x.detach().clone().to('mps')
             elif (dtype == torch.bool):
@@ -2765,7 +2983,7 @@ def test_prod(self):
         def helper(shape, dtype=torch.float32):
             cpu_x = None
             x = None
-            if(dtype not in [torch.float32, torch.bool]):
+            if (dtype not in [torch.float32, torch.bool]):
                 cpu_x = torch.randint(1, 6, shape, device='cpu', dtype=dtype, requires_grad=False)
                 x = cpu_x.detach().clone().to('mps')
             elif (dtype == torch.bool):
@@ -2948,104 +3166,48 @@ def helper(shape):
         helper((9, 5, 6, 7))
 
     # Test var
-    def test_var(self):
-        def helper(shape):
-            cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=False)
-            x = cpu_x.detach().clone().to('mps')
-
-            all_var = torch.var(x, unbiased=False)
-            all_var_cpu = torch.var(cpu_x, unbiased=False)
-
-            self.assertEqual(all_var, all_var_cpu)
-
-            nil_dim_var = torch.var(x, dim=[], unbiased=False)
-            nil_dim_var_cpu = torch.var(cpu_x, dim=[], unbiased=False)
-
-            self.assertEqual(nil_dim_var, nil_dim_var_cpu)
+    def test_var_simple(self):
+        def helper():
 
-            nil_dim_var_keepdim = torch.var(x, dim=[], keepdim=True, unbiased=False)
-            nil_dim_var_cpu_keepdim = torch.var(cpu_x, dim=[], keepdim=True, unbiased=False)
+            shape = [2, 3, 4, 5]
 
-            self.assertEqual(nil_dim_var_keepdim, nil_dim_var_cpu_keepdim)
-
-            zero_dim_var = torch.var(x, dim=[0], unbiased=False)
-            zero_dim_var_cpu = torch.var(cpu_x, dim=[0], unbiased=False)
-
-            self.assertEqual(zero_dim_var, zero_dim_var_cpu)
-
-            zero_dim_var_keepdim = torch.var(x, dim=[0], keepdim=True, unbiased=False)
-            zero_dim_var_cpu_keepdim = torch.var(cpu_x, dim=[0], keepdim=True, unbiased=False)
-
-            self.assertEqual(zero_dim_var_keepdim, zero_dim_var_cpu_keepdim)
-
-            zero_one_dim_var = torch.var(x, dim=[0, 1], unbiased=False)
-            zero_one_dim_var_cpu = torch.var(cpu_x, dim=[0, 1], unbiased=False)
-
-            self.assertEqual(zero_one_dim_var, zero_one_dim_var_cpu)
-
-            zero_one_dim_var_keepdim = torch.var(x, dim=[0, 1], keepdim=True, unbiased=False)
-            zero_one_dim_var_cpu_keepdim = torch.var(cpu_x, dim=[0, 1], keepdim=True, unbiased=False)
-
-            self.assertEqual(zero_one_dim_var_keepdim, zero_one_dim_var_cpu_keepdim)
-
-            two_three_dim_var = torch.var(x, dim=[2, 3], unbiased=False)
-            two_three_dim_var_cpu = torch.var(cpu_x, dim=[2, 3], unbiased=False)
-
-            self.assertEqual(two_three_dim_var, two_three_dim_var_cpu)
-
-            two_three_keepdim_var = torch.var(x, dim=[2, 3], keepdim=True, unbiased=False)
-            two_three_dim_keepvar_cpu = torch.var(cpu_x, dim=[2, 3], keepdim=True, unbiased=False)
-
-            self.assertEqual(two_three_keepdim_var, two_three_dim_keepvar_cpu)
-
-            all_var = torch.var(x, unbiased=True)
-            all_var_cpu = torch.var(cpu_x, unbiased=True)
-
-            self.assertEqual(all_var, all_var_cpu)
-
-            nil_dim_var = torch.var(x, dim=[], unbiased=True)
-            nil_dim_var_cpu = torch.var(cpu_x, dim=[], unbiased=True)
-
-            self.assertEqual(nil_dim_var, nil_dim_var_cpu)
-
-            nil_dim_var_keepdim = torch.var(x, dim=[], keepdim=True, unbiased=True)
-            nil_dim_var_cpu_keepdim = torch.var(cpu_x, dim=[], keepdim=True, unbiased=True)
+            cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=False)
+            x = cpu_x.detach().clone().to('mps')
 
-            self.assertEqual(nil_dim_var_keepdim, nil_dim_var_cpu_keepdim)
+            for unbiased in [False, True]:
+                for keepdim in [False, True]:
 
-            zero_dim_var = torch.var(x, dim=[0], unbiased=True)
-            zero_dim_var_cpu = torch.var(cpu_x, dim=[0], unbiased=True)
+                    zero_dim_var = x.var(-1, keepdim=keepdim, unbiased=unbiased)
+                    zero_dim_var_cpu = cpu_x.var(-1, keepdim=keepdim, unbiased=unbiased)
 
-            self.assertEqual(zero_dim_var, zero_dim_var_cpu)
+                    self.assertEqual(zero_dim_var, zero_dim_var_cpu)
 
-            zero_dim_var_keepdim = torch.var(x, dim=[0], keepdim=True, unbiased=True)
-            zero_dim_var_cpu_keepdim = torch.var(cpu_x, dim=[0], keepdim=True, unbiased=True)
+                    all_var = torch.var(x, unbiased=unbiased)
+                    all_var_cpu = torch.var(cpu_x, unbiased=unbiased)
 
-            self.assertEqual(zero_dim_var_keepdim, zero_dim_var_cpu_keepdim)
+                    self.assertEqual(all_var, all_var_cpu)
 
-            zero_one_dim_var = torch.var(x, dim=[0, 1], unbiased=True)
-            zero_one_dim_var_cpu = torch.var(cpu_x, dim=[0, 1], unbiased=True)
+                    nil_dim_var = torch.var(x, dim=[], keepdim=keepdim, unbiased=unbiased)
+                    nil_dim_var_cpu = torch.var(cpu_x, dim=[], keepdim=keepdim, unbiased=unbiased)
 
-            self.assertEqual(zero_one_dim_var, zero_one_dim_var_cpu)
+                    self.assertEqual(nil_dim_var, nil_dim_var_cpu)
 
-            zero_one_dim_var_keepdim = torch.var(x, dim=[0, 1], keepdim=True, unbiased=True)
-            zero_one_dim_var_cpu_keepdim = torch.var(cpu_x, dim=[0, 1], keepdim=True, unbiased=True)
+                    zero_dim_var = torch.var(x, dim=[0], keepdim=keepdim, unbiased=unbiased)
+                    zero_dim_var_cpu = torch.var(cpu_x, dim=[0], keepdim=keepdim, unbiased=unbiased)
 
-            self.assertEqual(zero_one_dim_var_keepdim, zero_one_dim_var_cpu_keepdim)
+                    self.assertEqual(zero_dim_var, zero_dim_var_cpu)
 
-            two_three_dim_var = torch.var(x, dim=[2, 3], unbiased=True)
-            two_three_dim_var_cpu = torch.var(cpu_x, dim=[2, 3], unbiased=True)
+                    zero_one_dim_var = torch.var(x, dim=[0, -1], keepdim=keepdim, unbiased=unbiased)
+                    zero_one_dim_var_cpu = torch.var(cpu_x, dim=[0, -1], keepdim=keepdim, unbiased=unbiased)
 
-            self.assertEqual(two_three_dim_var, two_three_dim_var_cpu)
+                    self.assertEqual(zero_one_dim_var, zero_one_dim_var_cpu)
 
-            two_three_keepdim_var = torch.var(x, dim=[2, 3], keepdim=True, unbiased=True)
-            two_three_dim_keepvar_cpu = torch.var(cpu_x, dim=[2, 3], keepdim=True, unbiased=True)
+                    two_three_dim_var = torch.var(x, dim=[2, 3], keepdim=keepdim, unbiased=unbiased)
+                    two_three_dim_var_cpu = torch.var(cpu_x, dim=[2, 3], keepdim=keepdim, unbiased=unbiased)
 
-            self.assertEqual(two_three_keepdim_var, two_three_dim_keepvar_cpu)
+                    self.assertEqual(two_three_dim_var, two_three_dim_var_cpu)
 
-        helper((4, 5, 6, 7))
-        # verify if a change in shape of input would cause problems with graph caching
-        helper((9, 5, 6, 7))
+        helper()
 
     # Test forward amax
     def test_amax(self):
@@ -3202,11 +3364,18 @@ def helper(n, c, h, w):
 
     def test_divmode(self):
         def helper(shape, rounding_mode):
-            for dtype in [torch.float32]:
-                cpu_x = torch.randn(shape, device='cpu', dtype=dtype, requires_grad=False)
+            for dtype in [torch.float32, torch.float16, torch.int32, torch.int64]:
+                cpu_x = None
+                cpu_y = None
+                if (dtype in [torch.float32, torch.float16]):
+                    cpu_x = torch.randn(shape, device='cpu', dtype=dtype, requires_grad=False)
+                    cpu_y = torch.randn(shape, device='cpu', dtype=dtype, requires_grad=False)
+                else:
+                    cpu_x = torch.randint(-10, 0, shape, device='cpu', dtype=dtype, requires_grad=False)
+                    cpu_y = torch.randint(-10, 0, shape, device='cpu', dtype=dtype, requires_grad=False)
+
                 mps_x = cpu_x.detach().clone().to('mps')
                 # clamp to avoid division by 0
-                cpu_y = torch.randn(shape, device='cpu', dtype=dtype, requires_grad=False)
                 mps_y = cpu_y.detach().clone().to('mps')
 
                 result_div_cpu = torch.div(cpu_x, cpu_y, rounding_mode=rounding_mode)
@@ -3459,6 +3628,13 @@ def test_circular_pad(self):
 
         self.assertEqual(y_cpu, y_mps.cpu())
 
+    def test_constant_pad_4d_warning(self):
+        inputCPU = torch.rand((1, 2, 2, 2, 1, 1))
+        inputMPS = inputCPU.detach().clone().to('mps')
+        outputCPU = F.pad(inputCPU, [0, 0, 0, 0, 0, 0, 1, 0])
+        outputMPS = F.pad(inputMPS, [0, 0, 0, 0, 0, 0, 1, 0])
+        self.assertEqual(outputCPU, outputMPS)
+
     def test_pad(self):
         def helper(shape, padding, op, value=0):
             inputCPU = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=True)
@@ -3486,6 +3662,8 @@ def helper(shape, padding, op, value=0):
         helper((2, 1, 6), 3, nn.ReplicationPad1d)
         # Constant Pad 1D
         helper((2, 3, 4), 2, nn.ConstantPad1d)
+        # Constant Pad 1D with single dimension input
+        helper((16), (1, 2), nn.ConstantPad1d)
 
         # 2D Padding
         helper((1, 2, 3, 4), (1, 1, 2, 0), nn.ReflectionPad2d)
@@ -3497,6 +3675,8 @@ def helper(shape, padding, op, value=0):
         helper((2, 1, 6, 8), (2, 4, 3, 5), nn.ReplicationPad2d)
         # Constant Pad 2D
         helper((2, 1, 6, 8), (2, 4, 3, 5), nn.ConstantPad2d)
+        # input size < pad size
+        helper((1, 2, 3), (0, 0, 0, 1), nn.ConstantPad2d)
 
         # 3D Padding
         helper((2, 4, 6, 8, 4), (1, 3, 3, 5, 3, 4), nn.ReflectionPad3d)
@@ -3514,7 +3694,7 @@ def helper(shape, dtype=torch.float32):
             y, cpu_y = None, None
             z, cpu_z = None, None
 
-            if(dtype not in [torch.float32, torch.bool]):
+            if (dtype not in [torch.float32, torch.bool]):
                 cpu_x = torch.randint(50, shape, device='cpu', dtype=dtype, requires_grad=False)
                 x = cpu_x.detach().clone().to('mps')
                 cpu_y = torch.randint(50, shape, device='cpu', dtype=dtype, requires_grad=False)
@@ -3786,12 +3966,12 @@ def helper(shape, dim=0):
 
     # Test softplus
     def test_softplus(self):
-        def helper(shape):
+        def helper(shape, beta=0.5, threshold=0.5):
             cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=True)
             x = cpu_x.detach().clone().to('mps').requires_grad_()
 
-            softplus_result = torch.nn.Softplus(beta=0.5, threshold=0.5)(x)
-            softplus_result_cpu = torch.nn.Softplus(beta=0.5, threshold=0.5)(cpu_x)
+            softplus_result = torch.nn.Softplus(beta=beta, threshold=threshold)(x)
+            softplus_result_cpu = torch.nn.Softplus(beta=beta, threshold=threshold)(cpu_x)
 
             cpu_grad = torch.randn(softplus_result.shape)
             grad = cpu_grad.to('mps')
@@ -3805,6 +3985,8 @@ def helper(shape):
         # Test empty shape too
         for shape in [(), (2, 3), (10, 10), (2, 3, 4, 5)]:
             helper(shape)
+            helper(shape, beta=0.6, threshold=0.6)  # relu path
+            helper(shape, beta=1, threshold=20)  # softplus path
 
     # Test silu
 
@@ -3829,12 +4011,35 @@ def helper(shape):
         for shape in [[], (2, 3), (2, 8, 4, 5)]:
             helper(shape)
 
+    def test_cast_mps_to_cpu(self):
+        def helper(src_dtype, dst_dtype):
+            input = torch.rand((1, 3, 128, 128), dtype=src_dtype)
+            input_cast_mps = input.to('mps')
+            input_cast_cpu = input_cast_mps.to('cpu', dtype=dst_dtype)
+
+            # needs to match the initial Tensor
+            self.assertEqual(input_cast_cpu, input.to(dtype=dst_dtype))
+        helper(torch.half, torch.float)
+        helper(torch.float, torch.half)
+
+    def test_cast_mps_to_mps(self):
+        def helper(src_dtype, dst_dtype):
+            input_cpu = torch.rand((1, 3, 128, 128), dtype=src_dtype)
+            input_mps = input_cpu.to('mps')
+            output_mps = input_mps.to(dtype=dst_dtype)
+            output_cpu = input_cpu.to(dtype=dst_dtype)
+            self.assertEqual(output_mps.cpu(), output_cpu)
+        helper(torch.half, torch.float)
+        helper(torch.float, torch.half)
+        helper(torch.half, torch.long)
+        helper(torch.float, torch.int)
+
     # Test adaptive avg pool2d - when the input size is a multiple of output size
     # Not testing for channels last right now
     def test_adaptive_avg_pool2d_simple(self):
         def helper(input_shape, out_shape, channels_last):
             cpu_x = torch.randn(input_shape, device='cpu', dtype=torch.float, requires_grad=True)
-            if(channels_last):
+            if (channels_last):
                 cpu_x = cpu_x.to(memory_format=torch.channels_last)
                 cpu_x.retain_grad()
             x = cpu_x.detach().clone().to('mps').requires_grad_()
@@ -3859,16 +4064,31 @@ def helper(input_shape, out_shape, channels_last):
 
         helper((2, 16, 16), (4, 4), False)
 
+        # Output shape larger than input shape
+
+        helper((2, 2, 4, 4), (8, 8), False)
+        helper((2, 2, 2, 2), (4, 4), False)
+        helper((2, 2, 3, 3), (9, 9), False)
+        helper((2, 2, 2, 2), (16, 16), False)
+        helper((2, 2, 2, 16), (16, 16), False)
+
+        helper((2, 4, 4), (16, 16), False)
+
+        try:
+            helper((2, 2, 3, 3), (7, 7), False)
+        except Exception as e:
+            pass
+
     # Test max avg pool2d - when the input size is a multiple of output size
     # Not testing for channels last right now
     def test_adaptive_max_pool2d_simple(self):
         def helper(input_shape, out_shape, return_indices, dtype, channels_last=False):
             cpu_x = None
-            if(dtype in [torch.float16, torch.float32]):
+            if (dtype in [torch.float16, torch.float32]):
                 cpu_x = torch.randn(input_shape, device='cpu', dtype=dtype, requires_grad=True)
             else:
                 cpu_x = torch.randint(50, input_shape, device='cpu', dtype=dtype, requires_grad=True)
-            if(channels_last):
+            if (channels_last):
                 cpu_x = cpu_x.to(memory_format=torch.channels_last)
                 cpu_x.retain_grad()
             x = cpu_x.detach().clone().to('mps').requires_grad_()
@@ -3876,7 +4096,7 @@ def helper(input_shape, out_shape, return_indices, dtype, channels_last=False):
             max_result, max_indices = None, None
             max_result_cpu, max_indices_cpu = None, None
 
-            if(return_indices):
+            if (return_indices):
                 max_result, max_indices = torch.nn.AdaptiveMaxPool2d(out_shape, return_indices)(x)
                 max_result_cpu, max_indices_cpu = torch.nn.AdaptiveMaxPool2d(out_shape, return_indices)(cpu_x)
             else:
@@ -3890,7 +4110,7 @@ def helper(input_shape, out_shape, return_indices, dtype, channels_last=False):
             max_result_cpu.backward(gradient=cpu_grad)
 
             self.assertEqual(max_result, max_result_cpu)
-            if(return_indices):
+            if (return_indices):
                 self.assertEqual(max_indices, max_indices_cpu)
             self.assertEqual(x.grad, cpu_x.grad)
 
@@ -3904,12 +4124,13 @@ def helper(input_shape, out_shape, return_indices, dtype, channels_last=False):
                 helper((2, 16, 16), (4, 4), return_indices, dtype)
 
     def test_gelu_simple(self):
-        def helper(shape):
-            cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=True)
+        def helper(shape, dtype=torch.float):
+            cpu_x = torch.randn(shape, device='cpu', dtype=dtype, requires_grad=True)
             x = cpu_x.detach().clone().to('mps').requires_grad_()
 
             gelu_result = torch.nn.GELU()(x)
-            gelu_result_cpu = torch.nn.GELU()(cpu_x)
+            # GELU is not supported on CPU, so cast it to float
+            gelu_result_cpu = torch.nn.GELU()(cpu_x.to(torch.float))
 
             cpu_grad = torch.ones_like(gelu_result_cpu)
             grad = cpu_grad.to('mps')
@@ -3917,12 +4138,18 @@ def helper(shape):
             gelu_result.backward(gradient=grad)
             gelu_result_cpu.backward(gradient=cpu_grad)
 
-            self.assertEqual(gelu_result, gelu_result_cpu)
-            self.assertEqual(x.grad, cpu_x.grad)
+            atol = 1e-5 if dtype == torch.float else 1e-2
+            rtol = 1e-3 if dtype == torch.float else 1e-2
+            self.assertEqual(gelu_result, gelu_result_cpu.to(dtype), atol=atol, rtol=rtol)
+            self.assertEqual(x.grad, cpu_x.grad, atol=atol, rtol=rtol)
 
         # Test empty shape too
-        for shape in [(0, 3), [], (2, 3), (2, 8, 4, 5)]:
-            helper(shape)
+        for dtype in [torch.float, torch.half]:
+            for shape in [(0, 3), [], (2, 3), (2, 8, 4, 5)]:
+                helper(shape, dtype)
+        # Test that gelu would raise an assert for integral types
+        for dtype in [torch.int8, torch.int16, torch.int32, torch.int64]:
+            self.assertRaises(RuntimeError, lambda: torch.nn.GELU()(torch.randint(100, (2,), dtype=dtype, device="mps")))
 
     def test_gelu(self):
         def _test_gelu(n, m, dtype, contiguous, atol=None, rtol=None):
@@ -3960,7 +4187,7 @@ def helper(shape, min_val, max_val, inplace=False):
             cpu_x = None
             x = None
 
-            if(not inplace):
+            if (not inplace):
                 cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=True)
                 x = cpu_x.detach().clone().to('mps').requires_grad_()
             else:
@@ -3972,7 +4199,7 @@ def helper(shape, min_val, max_val, inplace=False):
 
             self.assertEqual(hardtanh_result, hardtanh_result_cpu)
 
-            if(not inplace):
+            if (not inplace):
                 cpu_grad = torch.randn(hardtanh_result_cpu.shape)
                 grad = cpu_grad.to('mps')
                 hardtanh_result.backward(gradient=grad)
@@ -3985,6 +4212,38 @@ def helper(shape, min_val, max_val, inplace=False):
                 helper(shape, min_val, max_val)
                 helper(shape, min_val, max_val, inplace=True)
 
+    def test_hardswish(self):
+        def helper(shape, inplace=False, requires_grad=True):
+            m = nn.Hardswish(inplace=inplace)
+
+            input_cpu = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=requires_grad)
+            input_mps = input_cpu.detach().clone().to('mps').requires_grad_(requires_grad)
+
+            if inplace and requires_grad:  # check that both raise runtime error
+                self.assertRaises(RuntimeError, lambda: m(input_cpu))
+                self.assertRaises(RuntimeError, lambda: m(input_mps))
+                return
+
+            output_cpu = m(input_cpu)
+            output_mps = m(input_mps)
+
+            cpu_grad = torch.ones_like(output_cpu)
+            mps_grad = cpu_grad.to('mps')
+
+            self.assertEqual(output_cpu, output_mps)
+
+            if requires_grad:
+                output_cpu.backward(gradient=cpu_grad)
+                output_mps.backward(gradient=mps_grad)
+
+                self.assertEqual(input_cpu.grad, input_mps.grad)
+
+        for shape in [(0, 3), [], (2, 3), (2, 8, 4, 5)]:
+            helper(shape, inplace=False, requires_grad=False)
+            helper(shape, inplace=True, requires_grad=False)
+            helper(shape, inplace=False, requires_grad=True)
+            helper(shape, inplace=True, requires_grad=True)
+
     def test_transpose_2D(self):
         values = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]
         values1 = [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
@@ -4063,6 +4322,20 @@ def helper(shape):
 
         helper((2, 8, 4, 5))
 
+    def test_signbit(self):
+        def helper(shape, dtype):
+            cpu_x = torch.randn(shape, device='cpu').to(dtype)
+            x = cpu_x.clone().to('mps')
+
+            signbit_result = torch.signbit(x)
+            signbit_result_cpu = torch.signbit(cpu_x)
+
+            self.assertEqual(signbit_result, signbit_result_cpu)
+
+        helper((2, 8, 4, 5), torch.int)
+        helper((2, 8, 4, 5), torch.float)
+        helper((2, 8, 4, 5), torch.int64)
+
     # Test neg
     def test_neg(self):
         def helper(shape):
@@ -4152,33 +4425,35 @@ def helper(shape, dim, index, idx_dtype=torch.int32):
         helper((2, 3, 3), -1, [1, 2])
 
     def test_embedding_dense_backward(self):
-        def helper(n, d, m):
+        def helper(n, d, m, idx):
             embeddingMPS = nn.Embedding(n, d, max_norm=True, device='mps')
+            emedding_weight = embeddingMPS.weight.detach().cpu()
             W_MPS = torch.randn((m, d), requires_grad=True, device='mps')
-            idx_MPS = torch.tensor([0, 1, 2]).to('mps')
+            idx_MPS = torch.tensor(idx, device='mps')
             a_MPS = embeddingMPS.weight.clone() @ W_MPS.t()  # weight must be cloned for this to be differentiable
             a_MPS.retain_grad()
             b_MPS = embeddingMPS(idx_MPS) @ W_MPS.t()  # modifies weight in-place
             b_MPS.retain_grad()
-            out_MPS = (a_MPS.unsqueeze(0) + b_MPS.unsqueeze(1))
+            out_MPS = (a_MPS.unsqueeze(0) + b_MPS)
             loss_MPS = out_MPS.sigmoid().prod()
             loss_MPS.backward()
 
-            embeddingCPU = nn.Embedding(n, d, max_norm=True, scale_grad_by_freq=True)
+            embeddingCPU = nn.Embedding(n, d, max_norm=True, _weight=emedding_weight)
             W_CPU = W_MPS.to('cpu')
-            idx_CPU = torch.tensor([0, 1, 2])
+            idx_CPU = torch.tensor(idx)
             a_CPU = embeddingCPU.weight.clone() @ W_CPU.t()  # weight must be cloned for this to be differentiable
             a_CPU.retain_grad()
             b_CPU = embeddingCPU(idx_CPU) @ W_CPU.t()  # modifies weight in-place
             b_CPU.retain_grad()
-            out_CPU = (a_CPU.unsqueeze(0) + b_CPU.unsqueeze(1))
+            out_CPU = (a_CPU.unsqueeze(0) + b_CPU)
             loss_CPU = out_CPU.sigmoid().prod()
             loss_CPU.backward()
 
             self.assertEqual(b_CPU.grad, b_MPS.grad)
             self.assertEqual(a_CPU.grad, a_MPS.grad)
 
-        helper(3, 5, 7)
+        helper(3, 5, 7, [0, 1, 2])
+        helper(3, 5, 7, 2)  # test scalar index
 
     # Test pytorch gather
     def test_gather(self):
@@ -4213,6 +4488,28 @@ def helper(shape, dim, idx_shape, idx_dtype=torch.int64):
         helper((2, 8, 4, 5), 2, (1, 8, 10, 3))
         helper((2, 8, 4, 5), 3, (2, 5, 3, 12))
 
+    # Test pytorch gather
+    def test_gather_scalar(self):
+        idx_dtype = torch.int64
+        cpu_x = torch.tensor(3, device='cpu', dtype=torch.float, requires_grad=True)
+        x = cpu_x.detach().clone().to('mps').requires_grad_()
+
+        idx_np = [0]
+
+        cpu_idx = torch.tensor(idx_np, device='cpu', dtype=idx_dtype)
+        idx = cpu_idx.detach().clone().to('mps')
+
+        gather_result = torch.gather(x, dim=0, index=idx)
+        gather_result_cpu = torch.gather(cpu_x, dim=0, index=cpu_idx)
+
+        cpu_grad = torch.randn([1], device='cpu', dtype=torch.float)
+        grad = cpu_grad.to('mps')
+        gather_result.backward(gradient=grad)
+        gather_result_cpu.backward(gradient=cpu_grad)
+
+        self.assertEqual(gather_result, gather_result_cpu)
+        self.assertEqual(cpu_x.grad, x.grad)
+
     # Test pytorch scatter_add and scatter
     def test_scatter_add(self):
         def helper(shape, dim, idx_shape, src_shape, idx_dtype=torch.int64, do_add=True):
@@ -4224,7 +4521,7 @@ def helper(shape, dim, idx_shape, src_shape, idx_dtype=torch.int64, do_add=True)
 
             # Indices should be taken from range of axis along which gathering is done
             idx_np = None
-            if(do_add):
+            if (do_add):
                 idx_np = np.random.randint(0, shape[dim], idx_shape)
             else:
                 idx_np = np.array([[0, 1, 2],
@@ -4239,7 +4536,7 @@ def helper(shape, dim, idx_shape, src_shape, idx_dtype=torch.int64, do_add=True)
             scatter_result = None
             scatter_result_cpu = None
 
-            if(do_add):
+            if (do_add):
                 scatter_result = torch.scatter_add(x, dim=dim, index=idx, src=src)
                 scatter_result_cpu = torch.scatter_add(cpu_x, dim=dim, index=cpu_idx, src=cpu_src)
             else:
@@ -4249,14 +4546,14 @@ def helper(shape, dim, idx_shape, src_shape, idx_dtype=torch.int64, do_add=True)
             cpu_grad = None
             grad = None
 
-            if(idx_shape == src_shape):
+            if (idx_shape == src_shape):
                 cpu_grad = torch.randn(shape, device='cpu', dtype=torch.float)
                 grad = cpu_grad.to('mps')
                 scatter_result.backward(gradient=grad)
                 scatter_result_cpu.backward(gradient=cpu_grad)
 
             self.assertEqual(scatter_result, scatter_result_cpu)
-            if(idx_shape == src_shape):
+            if (idx_shape == src_shape):
                 self.assertEqual(cpu_x.grad, x.grad)
                 self.assertEqual(cpu_src.grad, src.grad)
 
@@ -4280,6 +4577,46 @@ def helper(shape, dim, idx_shape, src_shape, idx_dtype=torch.int64, do_add=True)
         helper((8, 3), 0, (5, 3), (5, 3), do_add=False)
         helper((10, 3), 0, (5, 3), (5, 8), do_add=False)
 
+    # Test pytorch scatter_add and scatter for scalar input
+    def test_scatter_add_scalar(self):
+        def helper(idx_dtype=torch.int64, do_add=True):
+            cpu_x = torch.tensor(2, device='cpu', dtype=torch.float, requires_grad=True)
+            x = cpu_x.detach().clone().to('mps').requires_grad_()
+
+            cpu_src = torch.tensor(3, device='cpu', dtype=torch.float, requires_grad=True)
+            src = cpu_src.detach().clone().to('mps').requires_grad_()
+
+            # Indices should be taken from range of axis along which gathering is done
+            idx_np = [0]
+
+            cpu_idx = torch.tensor(idx_np, device='cpu', dtype=idx_dtype)
+            idx = cpu_idx.detach().clone().to('mps')
+
+            scatter_result = None
+            scatter_result_cpu = None
+
+            if (do_add):
+                scatter_result = torch.scatter_add(x, dim=0, index=idx, src=src)
+                scatter_result_cpu = torch.scatter_add(cpu_x, dim=0, index=cpu_idx, src=cpu_src)
+            else:
+                scatter_result = torch.scatter(x, dim=0, index=idx, src=src)
+                scatter_result_cpu = torch.scatter(cpu_x, dim=0, index=cpu_idx, src=cpu_src)
+
+            cpu_grad = None
+            grad = None
+
+            cpu_grad = torch.tensor(1.2, device='cpu', dtype=torch.float)
+            grad = cpu_grad.to('mps')
+            scatter_result.backward(gradient=grad)
+            scatter_result_cpu.backward(gradient=cpu_grad)
+
+            self.assertEqual(scatter_result, scatter_result_cpu)
+            self.assertEqual(cpu_x.grad, x.grad)
+            self.assertEqual(cpu_src.grad, src.grad)
+
+        helper()
+        helper(do_add=False)
+
     # Test pytorch scatter_reduce
     def test_scatter_reduce(self):
         def helper(shape, dim, idx_shape, src_shape, idx_dtype=torch.int64, reduce_str="sum"):
@@ -4382,7 +4719,7 @@ def helper(n, m, dtype):
             cpu_result = None
             result = None
 
-            if(n == m):
+            if (n == m):
                 cpu_result = torch.eye(n, dtype=dtype, device='cpu')
                 result = torch.eye(n, dtype=dtype, device='mps')
             else:
@@ -4445,7 +4782,7 @@ def test_arange(self):
     def test_softmax(self):
         def helper(shape, dim, channels_last=False):
             cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=True)
-            if(channels_last):
+            if (channels_last):
                 cpu_x = cpu_x.to(memory_format=torch.channels_last)
                 cpu_x.retain_grad()
             x = cpu_x.detach().clone().to('mps').requires_grad_()
@@ -4457,7 +4794,7 @@ def helper(shape, dim, channels_last=False):
             cpu_grad = None
             grad = None
 
-            if(not channels_last):
+            if (not channels_last):
                 cpu_grad = torch.randn(shape, device='cpu', dtype=torch.float)
                 grad = cpu_grad.to('mps')
 
@@ -4465,7 +4802,7 @@ def helper(shape, dim, channels_last=False):
                 softmax_result_cpu.backward(gradient=cpu_grad)
 
             self.assertEqual(softmax_result, softmax_result_cpu)
-            if(not channels_last):
+            if (not channels_last):
                 self.assertEqual(x.grad, cpu_x.grad)
 
         def helper2(dim):
@@ -4488,7 +4825,7 @@ def helper2(dim):
 
         for channels_last in [False]:
             for shape in [(2, 4, 8, 5), (3, 4, 6, 7, 2)]:
-                if(len(shape) != 4 and channels_last):
+                if (len(shape) != 4 and channels_last):
                     continue
                 for dim in [0, 1, 2, 3, -1, -2, -3]:
                     helper(shape, dim, channels_last)
@@ -4545,6 +4882,10 @@ def helper(shape, x_shape, y_shape, cond_dtype=torch.bool, x_dtype=torch.float):
         helper((1, 1, 1), (1, 1, 4), (2, 3, 1))
         helper([], (1, 1, 4), (2, 3, 1))
         helper([], (2, 3, 4), [])
+        helper((5, 2, 3), (2, 3), (2, 3))
+        helper((2, 3), (5, 2, 3), (2, 3))
+        helper((2, 3), (2, 3), (5, 2, 3))
+        helper((2, 3), (5, 2, 3), (6, 5, 2, 3))
 
     # Test normal
     def test_normal(self):
@@ -4586,27 +4927,25 @@ def helper(shape, mean=0.0, std=1.0):
         helper((100, 100), 2.5, 1.2)
 
     def test_bernoulli(self):
-        def helper(shape, prob=0.5):
-            prob_array = np.ones(shape)
-            prob_array *= prob
-            cpu_prob_tensor = torch.tensor(prob_array, device='cpu', dtype=torch.float, requires_grad=False)
-            prob_tensor = cpu_prob_tensor.detach().clone().to('mps')
-
-            mps_out = torch.bernoulli(prob_tensor)
-            # We can't check reliably the mean and std.
-            # Just make sure we don't return constant values
-            self.assertNotEqual(mps_out.to('cpu').mean(), 0.)
-            self.assertNotEqual(mps_out.to('cpu').std() ** 2, 0.)
-
-            mps_out = torch.zeros(shape, device='mps')
-            mps_out = torch.bernoulli(mps_out, prob)
-
-            self.assertNotEqual(mps_out.to('cpu').mean(), 0.)
-            self.assertNotEqual(mps_out.to('cpu').std(), 0.)
-
-        helper((100, 100), 0.50)
-        helper((100, 100), 0.76)
-        helper((100, 100), 0.23)
+        shape = (10, 10)
+        all_ones = torch.ones(shape, device='mps')
+        all_zeros = torch.zeros(shape, device='mps')
+
+        prob_tensor = all_ones * 0.5
+        # probability of drawing "1" is 0.5
+        mps_out = torch.bernoulli(prob_tensor)
+        # We can't check reliably the mean and std.
+        # Just make sure we don't return constant values
+        self.assertNotEqual(mps_out.to('cpu').mean(), 0.)
+        self.assertNotEqual(mps_out.to('cpu').std() ** 2, 0.)
+
+        # probability of drawing "1" is 0
+        mps_out = torch.bernoulli(all_zeros)
+        self.assertEqual(mps_out, all_zeros)
+
+        # probability of drawing "1" is 1
+        mps_out = torch.bernoulli(all_ones)
+        self.assertEqual(mps_out, all_ones)
 
     # Test random_.to and random_.from
     def test_random(self):
@@ -4732,6 +5071,7 @@ def helper(shape, op):
 
         helper((2, 8, 4, 5), torch.exp)
         helper((2, 8, 3, 5), torch.exp2)
+        helper((2, 8, 3, 5), torch.expm1)
         helper((2, 8, 3, 5), torch.log)
         helper((2, 8, 3, 5), torch.cos)
 
@@ -4752,6 +5092,26 @@ def helper(shape):
         helper(10000)
         helper((10000, 40))
 
+    def test_multinomial(self):
+        # Test with num_dist = 1
+        def helper(probs, compare_mean, compare_var, num_samples=5, replacement=True):
+            cpu_prob_tensor = torch.tensor(probs, device='cpu', dtype=torch.float, requires_grad=False)
+            prob_tensor = cpu_prob_tensor.detach().clone().to('mps')
+
+            mps_out = torch.multinomial(prob_tensor, num_samples, replacement=replacement)
+            if (not replacement):
+                print(mps_out.to('cpu'))
+            else:
+                # Compare "real" with theoretical values
+                print(mps_out.to('cpu').float().mean(), compare_mean)
+                print(mps_out.to('cpu').float().std() ** 2, compare_var)
+
+        # TODO: Add tests for data types
+        helper(np.array([[0., 0., 0., 0.5, 0.5]]), (3 + 4) / 2, (12.5 - 3.5 ** 2), 100000)
+        helper(np.array([[.2, .2, .2, .2, .2]]), (0 + 1 + 2 + 3 + 4) / 5, (6 - 2 * 2), 10000)
+        helper(np.array([[1, 1, 1, 1, 1]]), (0 + 1 + 2 + 3 + 4) / 5, (6 - 2 * 2), 10000)
+        helper(np.array([1, 1, 1, 1, 1]), (0 + 1 + 2 + 3 + 4) / 5, (6 - 2 * 2), 10000)
+        helper(np.array([[1, 1, 1, 1, 1, 1, 1]]), 0, 0, 7, False)
 
 class TestNNMPS(NNTestCase):
 
@@ -4823,10 +5183,14 @@ def test_conv_expand(self):
 
     # The test should not crash
     def test_permute(self):
-        X = torch.randn(5, 5).to('mps')
-        torch.log(X)
-        X = X.permute(1, 0)
-        torch.log(X)
+        M_cpu = torch.randn(5, 5)
+        M_mps = M_cpu.to('mps')
+
+        output_cpu = M_cpu.permute(1, 0)
+        output_mps = M_mps.permute(1, 0)
+
+        self.assertEqual(output_cpu, output_mps)
+        self.assertEqual(output_cpu.size(), output_mps.size())
 
     # Printing of non_contiguous should not crash
     def test_print_non_contiguous(self):
@@ -4936,6 +5300,15 @@ def test_invalid_conv2d(self):
             with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
                 module(input)
 
+            # Input and weights on different devices
+            self.assertRaisesRegex(RuntimeError,
+                                   'must be on the same device',
+                                   lambda: torch.conv2d(torch.rand(1, 3, 32, 32), torch.rand(1, 3, 3, 3, device='mps')))
+            self.assertRaisesRegex(RuntimeError,
+                                   'Input type \\(MPSFloatType\\) and weight type \\(torch\\.FloatTensor\\) should be the same',
+                                   lambda: torch.conv2d(torch.rand(1, 3, 32, 32, device='mps'), torch.rand(1, 3, 3, 3)))
+
+
     def test_conv2d_valid_padding(self, device='mps'):
         # Test F.conv2d padding='valid' is the same as no padding
         x = torch.rand(1, 1, 1, 10, device=device).to(torch.float)
@@ -5056,6 +5429,24 @@ def test_slicing_with_step(self):
 
         self.assertEqual(x_cpu, x_mps)
 
+    def test_cast_gather_scatter(self):
+        for _ in range(0, 50):
+            input = np.random.randint(0, 255, size=(5, 5, 4), dtype=np.uint8)
+            with torch.no_grad():
+                s = torch.tensor(input, dtype=torch.uint8, device="mps").unsqueeze(0)
+                s_cpu = torch.tensor(input, dtype=torch.uint8, device="cpu").unsqueeze(0)
+                s = s.long()
+                s_cpu = s_cpu.long()
+                self.assertEqual(s.cpu(), s_cpu)
+
+                s = s.float()
+                s_cpu = s_cpu.float()
+                self.assertEqual(s.cpu(), s_cpu)
+
+                s /= 255
+                s_cpu /= 255
+                self.assertEqual(s.cpu(), s_cpu)
+
     def test_slicing_replace_column(self):
         # https://github.com/pytorch/pytorch/issues/78074
         def _helper(tensor_data):
@@ -5514,6 +5905,15 @@ def test_view_copy_out(self, device="mps"):
         self.assertEqual(expected1, out1)
         self.assertEqual(expected2, out2)
 
+    def test_detached_view_copy(self, device="mps"):
+        # https://github.com/pytorch/pytorch/issues/86052
+        x = torch.arange(2)
+        # .detach() makes y not a view, but contig tensor
+        # with non-zero offset
+        y = x[1].detach()
+        z = y.to(device)
+        self.assertEqual(y, z.cpu())
+
     def test_empty_reshape(self, device="mps"):
         x = torch.randn(0, 6, device=device)
         self.assertEqual((1, 0, 6, 1, 1), x.reshape(1, 0, 6, 1, 1).shape)
@@ -5884,6 +6284,805 @@ def test_view_all_dtypes_and_devices(self, device="mps"):
             x = torch.tensor([[1, 2], [3, 4], [5, 6]], dtype=dt, device=device)
             self.assertEqual(x.view(6).shape, [6])
 
+class TestConvolutionMPS(TestCase):
+    def test_conv1d_all_strides_paddings(self):
+        # https://github.com/pytorch/pytorch/issues/82921
+        def helper(stride, padding):
+            y_cpu = torch.randn(1, 57, 40)
+            conv_cpu = nn.Conv1d(57, 20, stride=stride, padding=padding, kernel_size=3, bias=False)
+            conv_gpu = copy.deepcopy(conv_cpu).to(device='mps')
+            x_cpu = conv_cpu(y_cpu)
+
+            y_gpu = y_cpu.to(device='mps')
+            x_gpu = conv_gpu(y_gpu)
+            self.assertEqual(x_cpu, x_gpu.cpu())
+        for stride in range(1, 4):
+            for padding in range(1, 4):
+                helper(stride, padding)
+
+
+    def test_conv1d_channels_last(self):
+        # https://github.com/pytorch/pytorch/issues/81557
+        model_cpu = torch.nn.Conv1d(1, 128, 3)
+        a_cpu = torch.arange((128 * 176), dtype=torch.float32)
+        a_cpu = a_cpu.view(128, 176, 1).permute(0, 2, 1)
+        out_cpu = model_cpu(a_cpu)
+
+        a_mps = a_cpu.detach().clone().to("mps")
+        model_mps = model_cpu.to("mps")
+        out_mps = model_mps(a_mps)
+
+        self.assertEqual(out_cpu, out_mps.cpu(), rtol=2.6e-05, atol=2e-04)
+
+    def test_conv_transpose_1d_all_strides(self):
+        # https://github.com/pytorch/pytorch/issues/82711
+        def helper(stride):
+            y_cpu = torch.ones(1, 1, 2)
+            deconv_cpu = nn.ConvTranspose1d(in_channels=1, out_channels=1, kernel_size=1, stride=stride, bias=False, padding=1)
+            deconv_cpu.weight.data = torch.ones(1, 1, 2)
+            deconv_gpu = copy.deepcopy(deconv_cpu).to(device='mps')
+            x_cpu = deconv_cpu(y_cpu)
+
+            y_gpu = y_cpu.to(device='mps')
+            x_gpu = deconv_gpu(y_gpu)
+            self.assertEqual(x_cpu, x_gpu.cpu())
+        [helper(stride) for stride in [1, 2, 3]]
+
+    def test_conv_transpose_1d_nn_functional(self):
+        # https://github.com/pytorch/pytorch/issues/82563
+        tin = torch.rand((1, 512, 1245), dtype=torch.float32)
+        tparams = torch.rand((512, 256, 16), dtype=torch.float32)
+        tbias = torch.rand((256), dtype=torch.float32)
+
+        device = 'cpu'
+        tcpu = torch.nn.functional.conv_transpose1d(tin.to(device), tparams.to(device), tbias.to(device), stride=8, padding=4)
+
+        device = 'mps'
+        tgpu = torch.nn.functional.conv_transpose1d(tin.to(device), tparams.to(device), tbias.to(device), stride=8, padding=4)
+
+        self.assertEqual(tcpu, tgpu.cpu(), rtol=2.6e-05, atol=2e-04)
+
+    def test_conv_backward_1d_channels_last(self):
+        # https://github.com/pytorch/pytorch/issues/84511
+        conv_cpu = torch.nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3)
+        conv_mps = copy.deepcopy(conv_cpu).to(device='mps')
+
+        data = torch.rand(1, 176, 1, dtype=torch.float32)
+        x_cpu = data.permute(0, 2, 1).contiguous()
+        x_mps = data.permute(0, 2, 1).contiguous().to("mps")
+        res_cpu = conv_cpu(x_cpu).sum().backward()
+        res_mps = conv_mps(x_mps).sum().backward()
+
+        self.assertEqual(res_cpu, res_mps)
+
+    def test_conv1d_contiguous(self):
+        model_cpu = torch.nn.Conv1d(1, 128, 3)
+        a_cpu = torch.ones(128, 1, 176)
+        out_cpu = model_cpu(a_cpu)
+
+        a_mps = a_cpu.detach().clone().to("mps")
+        model_mps = model_cpu.to("mps")
+        out_mps = model_mps(a_mps)
+
+        self.assertEqual(out_cpu.shape, out_mps.shape)
+        self.assertEqual(out_cpu, out_mps.cpu())
+
+    def test_conv2d_all_strides_paddings(self):
+        # https://github.com/pytorch/pytorch/issues/83180
+        y_cpu = torch.randn(2, 2, 3, 6)
+        y_gpu = y_cpu.to(device='mps')
+        for strideX in range(1, 4):
+            for strideY in range(1, 4):
+                conv_cpu = torch.nn.Conv2d(in_channels=2, out_channels=2, kernel_size=3, stride=(strideX, strideY))
+                conv_gpu = copy.deepcopy(conv_cpu).to(device='mps')
+                x_cpu = conv_cpu(y_cpu)
+                x_gpu = conv_gpu(y_gpu)
+                self.assertEqual(x_cpu, x_gpu.cpu(), rtol=1e-03, atol=1e-05)
+
+    def test_conv2d_single_stride(self):
+        y_cpu = torch.randn(2, 2, 3, 6)
+        y_gpu = y_cpu.to(device='mps')
+        for stride in range(1, 4):
+            conv_cpu = torch.nn.Conv2d(in_channels=2, out_channels=2, kernel_size=3, stride=stride)
+            conv_gpu = copy.deepcopy(conv_cpu).to(device='mps')
+            x_cpu = conv_cpu(y_cpu)
+            x_gpu = conv_gpu(y_gpu)
+            self.assertEqual(x_cpu, x_gpu.cpu(), rtol=1e-03, atol=1e-05)
+
+class TestAdvancedIndexing(TestCase):
+    supported_dtypes = [torch.float32, torch.float16, torch.int64, torch.int32, torch.int16, torch.uint8]
+    supported_np_dtypes = [np.float32, np.float16, np.int64, np.int32, np.int16, np.uint8]
+
+    def test_masked_select(self):
+        x = torch.randn(3, 4)
+        x_mps = x.to("mps")
+        mask = x.ge(0.5)
+        mask_mps = x_mps.ge(0.5)
+
+        res = torch.masked_select(x, mask)
+        res_mps = torch.masked_select(x_mps, mask_mps)
+
+        self.assertEqual(res, res_mps)
+
+    # examples from https://www.tutorialspoint.com/numpy/numpy_advanced_indexing.htm
+    def test_indexing_get(self):
+        def helper(dtype):
+            x_cpu = torch.tensor([[1, 2], [3, 4], [5, 6]], dtype=dtype)
+            x_mps = x_cpu.detach().clone().to("mps")
+
+            y_cpu = x_cpu[[0, 1, 2], [0, 1, 0]]
+            y_mps = x_mps[[0, 1, 2], [0, 1, 0]]
+            self.assertEqual(y_cpu, y_mps, str(dtype))
+        [helper(dtype) for dtype in self.supported_dtypes]
+
+    def test_indexing_select_corners(self):
+        def helper(dtype):
+            x_cpu = torch.tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], dtype=dtype)
+            x_mps = x_cpu.detach().clone().to("mps")
+
+            rows_cpu = torch.tensor([[0, 0], [3, 3]])
+            rows_mps = rows_cpu.detach().clone().to("mps")
+
+            cols_cpu = torch.tensor([[0, 2], [0, 2]])
+            cols_mps = cols_cpu.detach().clone().to("mps")
+
+            res_cpu = x_cpu[rows_cpu, cols_cpu]
+            res_mps = x_mps[rows_mps, cols_mps]
+
+            self.assertEqual(res_cpu, res_mps, str(dtype))
+        [helper(dtype) for dtype in self.supported_dtypes]
+
+    # FIXME: uint8 fails for this testcase, needs further debugging
+    def test_slicing_using_advanced_index_for_column(self):
+        def helper(dtype):
+            x_cpu = torch.tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], dtype=dtype)
+            x_mps = x_cpu.detach().clone().to("mps")
+
+            z_cpu = x_cpu[1:4, 1:3]
+            z_mps = x_mps[1:4, 1:3]
+            self.assertEqual(z_cpu, z_mps, str(dtype))
+
+            # using advanced index for column
+            y_cpu = x_cpu[1:4, [1, 2]]
+            y_mps = x_mps[1:4, [1, 2]]
+            self.assertEqual(y_cpu, y_mps, str(dtype))
+        # FIXME: use supported_dtypes once uint8 is fixed
+        [helper(dtype) for dtype in [torch.float32, torch.float16, torch.int64, torch.int32, torch.int16]]
+
+    # FIXME: conditional indexing not working
+    # def test_boolean_array_indexing_1(self):
+    #     def helper(dtype):
+    #         x_cpu = torch.tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], dtype=dtype)
+    #         x_mps = x_cpu.detach().clone().to("mps")
+
+    #         res_cpu = x_cpu[x_cpu > 5]
+    #         res_mps = x_mps[x_mps > 5]
+
+    #         print(res_cpu)
+    #         print(res_mps)
+
+    #         self.assertEqual(res_cpu, res_mps, str(dtype))
+    #     [helper(dtype) for dtype in self.supported_dtypes]
+
+
+    def test_advanced_indexing_3D_get(self):
+        def helper(x_cpu):
+            x_mps = x_cpu.detach().clone().to("mps")
+            self.assertEqual(x_cpu[[1, 2], 3, :], x_mps[[1, 2], 3, :])
+            self.assertEqual(x_cpu[[0, 2], :, :], x_mps[[0, 2], :, :])
+            self.assertEqual(x_cpu[:, [1, 0], [1]], x_mps[:, [1, 0], [1]])
+
+        x_cpu = torch.tensor([[[0.1, 0.2, 0.3, 0.4],
+                               [0.5, 0.6, 0.7, 0.8],
+                               [0.9, 1.0, 1.1, 1.2],
+                               [1.3, 1.4, 1.5, 1.6]],
+
+                              [[2.0, 2.1, 2.2, 2.3],
+                               [2.4, 2.5, 2.6, 2.7],
+                               [2.8, 2.9, 3.0, 3.1],
+                               [3.2, 3.3, 3.4, 3.5]],
+
+                              [[4.0, 4.1, 4.2, 4.3],
+                               [4.4, 4.5, 4.6, 4.7],
+                               [4.8, 4.9, 5.0, 5.1],
+                               [5.1, 5.2, 5.3, 5.4]]], device="cpu", dtype=torch.float32)
+        helper(x_cpu)
+        for idx in range(len(self.supported_np_dtypes)):
+            # torch.randn / torch.rand don't work with all dtypes
+            # Generate input data for all dtypes on Numpy them move to torch
+            input_t = np.random.random_sample(size=[3, 4, 4]).astype(self.supported_np_dtypes[idx])
+            inputCPU = torch.tensor(input_t, device='cpu', dtype=self.supported_dtypes[idx])
+
+            helper(inputCPU)
+
+    def test_advanced_indexing_3D_put(self):
+        def helper(x_cpu):
+            dtype = x_cpu.dtype
+            x_mps = x_cpu.detach().clone().to("mps")
+
+            out_tensor_cpu = torch.tensor([88, 99], dtype=dtype, device="cpu")
+            out_tensor_cpu_view = out_tensor_cpu[1:]
+
+            out_tensor_mps = torch.tensor([88, 99], dtype=dtype, device="mps")
+            out_tensor_mps_view = out_tensor_mps[1:]
+
+            x_cpu[[1, 2], 3, :] = out_tensor_cpu_view
+            x_mps[[1, 2], 3, :] = out_tensor_mps_view
+            self.assertEqual(x_cpu, x_mps)
+
+            x_cpu[[0, 2], :, :] = out_tensor_cpu_view
+            x_mps[[0, 2], :, :] = out_tensor_mps_view
+            self.assertEqual(x_cpu, x_mps)
+
+            x_cpu[:, [1, 0], [1]] = out_tensor_cpu_view
+            x_mps[:, [1, 0], [1]] = out_tensor_mps_view
+            self.assertEqual(x_cpu, x_mps)
+
+        x_cpu = torch.tensor([[[0.1, 0.2, 0.3, 0.4],
+                               [0.5, 0.6, 0.7, 0.8],
+                               [0.9, 1.0, 1.1, 1.2],
+                               [1.3, 1.4, 1.5, 1.6]],
+
+                              [[2.0, 2.1, 2.2, 2.3],
+                               [2.4, 2.5, 2.6, 2.7],
+                               [2.8, 2.9, 3.0, 3.1],
+                               [3.2, 3.3, 3.4, 3.5]],
+
+                              [[4.0, 4.1, 4.2, 4.3],
+                               [4.4, 4.5, 4.6, 4.7],
+                               [4.8, 4.9, 5.0, 5.1],
+                               [5.1, 5.2, 5.3, 5.4]]], device="cpu", dtype=torch.float32)
+        helper(x_cpu)
+        for idx in range(len(self.supported_np_dtypes)):
+            # torch.randn / torch.rand don't work with all dtypes
+            # Generate input data for all dtypes on Numpy them move to torch
+            input_t = np.random.random_sample(size=[3, 4, 4]).astype(self.supported_np_dtypes[idx])
+            inputCPU = torch.tensor(input_t, device='cpu', dtype=self.supported_dtypes[idx])
+
+            helper(inputCPU)
+
+    def test_index_put_with_view_indices(self):
+        def helper(dtype):
+            target_cpu = torch.zeros([5, 3], device="cpu", dtype=dtype)
+            target_mps = torch.zeros([5, 3], device="mps", dtype=dtype)
+
+            indices_cpu = torch.tensor([[0, 1], [0, 1]], dtype=torch.int64, device="cpu")
+            indices_mps = torch.tensor([[0, 1], [0, 1]], dtype=torch.int64, device="mps")
+
+            value_cpu = torch.ones(indices_cpu.shape[0], device="cpu", dtype=dtype)
+            value_mps = torch.ones(indices_mps.shape[0], device="mps", dtype=dtype)
+
+            target_cpu.index_put_(tuple(indices_cpu.t()), value_cpu, accumulate=True)
+            target_mps.index_put_(tuple(indices_mps.t()), value_mps, accumulate=True)
+
+            self.assertEqual(target_cpu, target_mps)
+
+        [helper(dtype) for dtype in [torch.int32, torch.float]]
+
+    # tests from 'test_indexing.py'
+    def test_advancedindex_big(self, device="mps"):
+        reference = torch.arange(0, 123344, dtype=torch.int, device=device)
+
+        self.assertEqual(reference[[0, 123, 44488, 68807, 123343], ],
+                         torch.tensor([0, 123, 44488, 68807, 123343], dtype=torch.int))
+
+    def test_set_item_to_scalar_tensor(self, device="mps"):
+        m = random.randint(1, 10)
+        n = random.randint(1, 10)
+        z = torch.randn([m, n], device=device)
+        a = 1.0
+        w = torch.tensor(a, requires_grad=True, device=device)
+        z[:, 0] = w
+        z.sum().backward()
+        self.assertEqual(w.grad, m * a)
+
+    def test_single_int(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        self.assertEqual(v[4].shape, (7, 3))
+
+    def test_multiple_int(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        self.assertEqual(v[4].shape, (7, 3))
+        self.assertEqual(v[4, :, 1].shape, (7,))
+
+    def test_none(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        self.assertEqual(v[None].shape, (1, 5, 7, 3))
+        self.assertEqual(v[:, None].shape, (5, 1, 7, 3))
+        self.assertEqual(v[:, None, None].shape, (5, 1, 1, 7, 3))
+        self.assertEqual(v[..., None].shape, (5, 7, 3, 1))
+
+    def test_step(self, device="mps"):
+        v = torch.arange(10, device=device)
+        self.assertEqual(v[::1], v)
+        self.assertEqual(v[::2].tolist(), [0, 2, 4, 6, 8])
+        self.assertEqual(v[::3].tolist(), [0, 3, 6, 9])
+        self.assertEqual(v[::11].tolist(), [0])
+        self.assertEqual(v[1:6:2].tolist(), [1, 3, 5])
+
+    def test_step_assignment(self, device="mps"):
+        v = torch.zeros(4, 4, device=device)
+        v[0, 1::2] = torch.tensor([3., 4.], device=device)
+        self.assertEqual(v[0].tolist(), [0, 3, 0, 4])
+        self.assertEqual(v[1:].sum(), 0)
+
+    def test_bool_indices(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        boolIndices = torch.tensor([True, False, True, True, False], dtype=torch.bool, device=device)
+        self.assertEqual(v[boolIndices].shape, (3, 7, 3))
+        self.assertEqual(v[boolIndices], torch.stack([v[0], v[2], v[3]]))
+
+        v = torch.tensor([True, False, True], dtype=torch.bool, device=device)
+        boolIndices = torch.tensor([True, False, False], dtype=torch.bool, device=device)
+        uint8Indices = torch.tensor([1, 0, 0], dtype=torch.uint8, device=device)
+        with warnings.catch_warnings(record=True) as w:
+            self.assertEqual(v[boolIndices].shape, v[uint8Indices].shape)
+            self.assertEqual(v[boolIndices], v[uint8Indices])
+            self.assertEqual(v[boolIndices], torch.tensor([True], dtype=torch.bool, device=device))
+            self.assertEqual(len(w), 2)
+
+    def test_bool_indices_accumulate(self, device="mps"):
+        mask = torch.zeros(size=(10, ), dtype=torch.uint8, device=device)
+        mask = mask > 0
+        y = torch.ones(size=(10, 10), device=device)
+        y.index_put_((mask, ), y[mask], accumulate=True)
+        self.assertEqual(y, torch.ones(size=(10, 10), device=device))
+
+    def test_multiple_bool_indices(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        # note: these broadcast together and are transposed to the first dim
+        mask1 = torch.tensor([1, 0, 1, 1, 0], dtype=torch.bool, device=device)
+        mask2 = torch.tensor([1, 1, 1], dtype=torch.bool, device=device)
+        self.assertEqual(v[mask1, :, mask2].shape, (3, 7))
+
+    def test_byte_mask(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        mask = torch.ByteTensor([1, 0, 1, 1, 0]).to(device)
+        with warnings.catch_warnings(record=True) as w:
+            self.assertEqual(v[mask].shape, (3, 7, 3))
+            self.assertEqual(v[mask], torch.stack([v[0], v[2], v[3]]))
+            self.assertEqual(len(w), 2)
+
+        v = torch.tensor([1.], device=device)
+        self.assertEqual(v[v == 0], torch.tensor([], device=device))
+
+    def test_byte_mask_accumulate(self, device="mps"):
+        mask = torch.zeros(size=(10, ), dtype=torch.uint8, device=device)
+        y = torch.ones(size=(10, 10), device=device)
+        with warnings.catch_warnings(record=True) as w:
+            warnings.simplefilter("always")
+            y.index_put_((mask, ), y[mask], accumulate=True)
+            self.assertEqual(y, torch.ones(size=(10, 10), device=device))
+            self.assertEqual(len(w), 2)
+
+    def test_index_put_accumulate_expanded_values(self, device="mps"):
+        t = torch.zeros((5, 2))
+        t_dev = t.to(device)
+        indices = [
+            torch.tensor([0, 1, 2, 3]),
+            torch.tensor([1, ]),
+        ]
+        indices_dev = [i.to(device) for i in indices]
+        values0d = torch.tensor(1.0)
+        values1d = torch.tensor([1.0, ])
+
+        out_mps = t_dev.index_put_(indices_dev, values0d.to(device), accumulate=True)
+        out_cpu = t.index_put_(indices, values0d, accumulate=True)
+        self.assertEqual(out_mps.cpu(), out_cpu)
+
+        out_mps = t_dev.index_put_(indices_dev, values1d.to(device), accumulate=True)
+        out_cpu = t.index_put_(indices, values1d, accumulate=True)
+        self.assertEqual(out_mps.cpu(), out_cpu)
+
+        t = torch.zeros(4, 3, 2)
+        t_dev = t.to(device)
+
+        indices = [
+            torch.tensor([0, ]),
+            torch.arange(3)[:, None],
+            torch.arange(2)[None, :],
+        ]
+        indices_dev = [i.to(device) for i in indices]
+        values1d = torch.tensor([-1.0, -2.0])
+        values2d = torch.tensor([[-1.0, -2.0], ])
+
+        out_mps = t_dev.index_put_(indices_dev, values1d.to(device), accumulate=True)
+        out_cpu = t.index_put_(indices, values1d, accumulate=True)
+        self.assertEqual(out_mps.cpu(), out_cpu)
+
+        out_mps = t_dev.index_put_(indices_dev, values2d.to(device), accumulate=True)
+        out_cpu = t.index_put_(indices, values2d, accumulate=True)
+        self.assertEqual(out_mps.cpu(), out_cpu)
+
+    def test_index_put_accumulate_non_contiguous(self, device="mps"):
+        t = torch.zeros((5, 2, 2))
+        t_dev = t.to(device)
+        t1 = t_dev[:, 0, :]
+        t2 = t[:, 0, :]
+        self.assertTrue(not t1.is_contiguous())
+        self.assertTrue(not t2.is_contiguous())
+
+        indices = [torch.tensor([0, 1]), ]
+        indices_dev = [i.to(device) for i in indices]
+        value = torch.randn(2, 2)
+        out_mps = t1.index_put_(indices_dev, value.to(device), accumulate=True)
+        out_cpu = t2.index_put_(indices, value, accumulate=True)
+        self.assertTrue(not t1.is_contiguous())
+        self.assertTrue(not t2.is_contiguous())
+
+        self.assertEqual(out_mps.cpu(), out_cpu)
+
+    def test_index_put_accumulate_with_optional_tensors(self, device="mps"):
+        # TODO: replace with a better solution.
+        # Currently, here using torchscript to put None into indices.
+        # on C++ it gives indices as a list of 2 optional tensors: first is null and
+        # the second is a valid tensor.
+        @torch.jit.script
+        def func(x, i, v):
+            idx = [None, i]
+            x.index_put_(idx, v, accumulate=True)
+            return x
+
+        n = 4
+        t = torch.arange(n * 2, dtype=torch.float32).reshape(n, 2)
+        t_dev = t.to(device)
+        indices = torch.tensor([1, 0])
+        indices_dev = indices.to(device)
+        value0d = torch.tensor(10.0)
+        value1d = torch.tensor([1.0, 2.0])
+
+        out_mps = func(t_dev, indices_dev, value0d.to("mps"))
+        out_cpu = func(t, indices, value0d)
+        self.assertEqual(out_mps.cpu(), out_cpu)
+
+        out_mps = func(t_dev, indices_dev, value1d.to("mps"))
+        out_cpu = func(t, indices, value1d)
+        self.assertEqual(out_mps.cpu(), out_cpu)
+
+    def test_index_put_accumulate_duplicate_indices(self, device="mps"):
+        for i in range(1, 128):
+            # generate indices by random walk, this will create indices with
+            # lots of duplicates interleaved with each other
+            delta = torch.empty(i, dtype=torch.float32, device=device).uniform_(-1, 1)
+
+            indices = delta.cumsum(0).long().to("mps")
+
+            # abs for int64 is not supported on mps, fallback on 'cpu' to calculate it
+            input = torch.randn(indices.cpu().abs().max().to("mps") + 1, device=device)
+            values = torch.randn(indices.size(0), device=device)
+            output = input.index_put((indices,), values, accumulate=True)
+
+            input_list = input.tolist()
+            indices_list = indices.tolist()
+            values_list = values.tolist()
+            for i, v in zip(indices_list, values_list):
+                input_list[i] += v
+
+            self.assertEqual(output, input_list)
+
+    def test_multiple_byte_mask(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        # note: these broadcast together and are transposed to the first dim
+        mask1 = torch.ByteTensor([1, 0, 1, 1, 0]).to(device)
+        mask2 = torch.ByteTensor([1, 1, 1]).to(device)
+        with warnings.catch_warnings(record=True) as w:
+            warnings.simplefilter("always")
+            self.assertEqual(v[mask1, :, mask2].shape, (3, 7))
+            self.assertEqual(len(w), 2)
+
+    def test_byte_mask2d(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        c = torch.randn(5, 7, device=device)
+        num_ones = (c > 0).sum()
+        r = v[c > 0]
+        self.assertEqual(r.shape, (num_ones, 3))
+
+    # FIXME: conditional indexing not working
+    # def test_jit_indexing(self, device="mps"):
+    #     def fn1(x):
+    #         x[x < 50] = 1.0
+    #         return x
+
+    #     def fn2(x):
+    #         x[0:50] = 1.0
+    #         return x
+
+    #     scripted_fn1 = torch.jit.script(fn1)
+    #     scripted_fn2 = torch.jit.script(fn2)
+    #     data = torch.arange(100, device=device, dtype=torch.float)
+    #     out = scripted_fn1(data.detach().clone())
+    #     ref = torch.tensor(np.concatenate((np.ones(50), np.arange(50, 100))), device=device, dtype=torch.float)
+    #     self.assertEqual(out, ref)
+    #     out = scripted_fn2(data.detach().clone())
+    #     self.assertEqual(out, ref)
+
+    def test_int_indices(self, device="mps"):
+        v = torch.randn(5, 7, 3, device=device)
+        self.assertEqual(v[[0, 4, 2]].shape, (3, 7, 3))
+        self.assertEqual(v[:, [0, 4, 2]].shape, (5, 3, 3))
+        self.assertEqual(v[:, [[0, 1], [4, 3]]].shape, (5, 2, 2, 3))
+
+    def test_index_put_src_datatype(self):
+        def helper(device, dtype):
+            src = torch.ones(3, 2, 4, device=device, dtype=dtype)
+            vals = torch.ones(3, 2, 4, device=device, dtype=dtype)
+            indices = (torch.tensor([0, 2, 1]),)
+            res = src.index_put_(indices, vals, accumulate=True)
+            self.assertEqual(res.shape, src.shape)
+        [helper(device="mps", dtype=dtype) for dtype in [torch.float, torch.int32]]
+
+    def test_index_src_datatype(self):
+        def helper(device, dtype):
+            orig_dtype = dtype
+            if dtype is torch.bool:
+                dtype = torch.uint8
+
+            src = torch.ones(3, 2, 4, device=device, dtype=dtype)
+            if orig_dtype is torch.bool:
+                src = src == 1
+            # test index
+            res = src[[0, 2, 1], :, :]
+            self.assertEqual(res.shape, src.shape)
+            # test index_put, no accum
+            src[[0, 2, 1], :, :] = res
+            self.assertEqual(res.shape, src.shape)
+        [helper(device="mps", dtype=dtype) for dtype in [torch.float, torch.float16, torch.long, torch.bool]]
+
+    def test_int_indices2d(self, device="mps"):
+        # From the NumPy indexing example
+        x = torch.arange(0, 12, device=device).view(4, 3)
+        rows = torch.tensor([[0, 0], [3, 3]], device=device)
+        columns = torch.tensor([[0, 2], [0, 2]], device=device)
+        self.assertEqual(x[rows, columns].tolist(), [[0, 2], [9, 11]])
+
+    def test_int_indices_broadcast(self, device="mps"):
+        # From the NumPy indexing example
+        x = torch.arange(0, 12, device=device).view(4, 3)
+        rows = torch.tensor([0, 3], device=device)
+        columns = torch.tensor([0, 2], device=device)
+        result = x[rows[:, None], columns]
+        self.assertEqual(result.tolist(), [[0, 2], [9, 11]])
+
+    def test_empty_index(self, device="mps"):
+        x = torch.arange(0, 12, device=device).view(4, 3)
+        idx = torch.tensor([], dtype=torch.long, device=device)
+        self.assertEqual(x[idx].numel(), 0)
+
+        # empty assignment should have no effect but not throw an exception
+        y = x.clone()
+        y[idx] = -1
+        self.assertEqual(x, y)
+
+        mask = torch.zeros(4, 3, device=device).bool()
+        y[mask] = -1
+        self.assertEqual(x, y)
+
+    def test_empty_ndim_index(self, device="mps"):
+        x = torch.randn(5, device=device)
+        self.assertEqual(torch.empty(0, 2, device=device), x[torch.empty(0, 2, dtype=torch.int64, device=device)])
+
+        x = torch.randn(2, 3, 4, 5, device=device)
+        self.assertEqual(torch.empty(2, 0, 6, 4, 5, device=device),
+                         x[:, torch.empty(0, 6, dtype=torch.int64, device=device)])
+
+        x = torch.empty(10, 0, device=device)
+        self.assertEqual(x[[1, 2]].shape, (2, 0))
+        self.assertEqual(x[[], []].shape, (0,))
+        with self.assertRaisesRegex(IndexError, 'for dimension with size 0'):
+            x[:, [0, 1]]
+
+    def test_empty_ndim_index_bool(self, device="mps"):
+        x = torch.randn(5, device=device)
+        self.assertRaises(IndexError, lambda: x[torch.empty(0, 2, dtype=torch.uint8, device=device)])
+
+    def test_empty_slice(self, device="mps"):
+        x = torch.randn(2, 3, 4, 5, device=device)
+        y = x[:, :, :, 1]
+        z = y[:, 1:1, :]
+        self.assertEqual((2, 0, 4), z.shape)
+        # this isn't technically necessary, but matches NumPy stride calculations.
+        self.assertEqual((60, 20, 5), z.stride())
+        self.assertTrue(z.is_contiguous())
+
+    def test_index_getitem_copy_bools_slices(self, device="mps"):
+        true = torch.tensor(1, dtype=torch.uint8, device=device)
+        false = torch.tensor(0, dtype=torch.uint8, device=device)
+
+        tensors = [torch.randn(2, 3, device=device), torch.tensor(3., device=device)]
+
+        for a in tensors:
+            self.assertNotEqual(a.data_ptr(), a[True].data_ptr())
+            self.assertEqual(torch.empty(0, *a.shape), a[False])
+            self.assertNotEqual(a.data_ptr(), a[true].data_ptr())
+            self.assertEqual(torch.empty(0, *a.shape), a[false])
+            self.assertEqual(a.data_ptr(), a[None].data_ptr())
+            self.assertEqual(a.data_ptr(), a[...].data_ptr())
+
+    def test_index_setitem_bools_slices(self, device="mps"):
+        true = torch.tensor(1, dtype=torch.uint8, device=device)
+        false = torch.tensor(0, dtype=torch.uint8, device=device)
+
+        tensors = [torch.randn(2, 3, device=device), torch.tensor(3, device=device)]
+
+        for a in tensors:
+            # prefix with a 1,1, to ensure we are compatible with numpy which cuts off prefix 1s
+            # (some of these ops already prefix a 1 to the size)
+            neg_ones = torch.ones_like(a) * -1
+            neg_ones_expanded = neg_ones.unsqueeze(0).unsqueeze(0)
+            a[True] = neg_ones_expanded
+            self.assertEqual(a, neg_ones)
+            a[False] = 5
+            self.assertEqual(a, neg_ones)
+            a[true] = neg_ones_expanded * 2
+            self.assertEqual(a, neg_ones * 2)
+            a[false] = 5
+            self.assertEqual(a, neg_ones * 2)
+            a[None] = neg_ones_expanded * 3
+            self.assertEqual(a, neg_ones * 3)
+            a[...] = neg_ones_expanded * 4
+            self.assertEqual(a, neg_ones * 4)
+            if a.dim() == 0:
+                with self.assertRaises(IndexError):
+                    a[:] = neg_ones_expanded * 5
+
+    def test_index_scalar_with_bool_mask(self, device="mps"):
+        a = torch.tensor(1, device=device)
+        uintMask = torch.tensor(True, dtype=torch.uint8, device=device)
+        boolMask = torch.tensor(True, dtype=torch.bool, device=device)
+        self.assertEqual(a[uintMask], a[boolMask])
+        self.assertEqual(a[uintMask].dtype, a[boolMask].dtype)
+
+        a = torch.tensor(True, dtype=torch.bool, device=device)
+        self.assertEqual(a[uintMask], a[boolMask])
+        self.assertEqual(a[uintMask].dtype, a[boolMask].dtype)
+
+    def test_setitem_expansion_error(self, device="mps"):
+        true = torch.tensor(True, device=device)
+        a = torch.randn(2, 3, device=device)
+        # check prefix with  non-1s doesn't work
+        a_expanded = a.expand(torch.Size([5, 1]) + a.size())
+        # NumPy: ValueError
+        with self.assertRaises(RuntimeError):
+            a[True] = a_expanded
+        with self.assertRaises(RuntimeError):
+            a[true] = a_expanded
+
+    def test_getitem_scalars(self, device="mps"):
+        zero = torch.tensor(0, dtype=torch.int64, device=device)
+        one = torch.tensor(1, dtype=torch.int64, device=device)
+
+        # non-scalar indexed with scalars
+        a = torch.randn(2, 3, device=device)
+        self.assertEqual(a[0], a[zero])
+        self.assertEqual(a[0][1], a[zero][one])
+        self.assertEqual(a[0, 1], a[zero, one])
+        self.assertEqual(a[0, one], a[zero, 1])
+
+        # indexing by a scalar should slice (not copy)
+        self.assertEqual(a[0, 1].data_ptr(), a[zero, one].data_ptr())
+        self.assertEqual(a[1].data_ptr(), a[one.int()].data_ptr())
+        self.assertEqual(a[1].data_ptr(), a[one.short()].data_ptr())
+
+        # scalar indexed with scalar
+        r = torch.randn((), device=device)
+        with self.assertRaises(IndexError):
+            r[:]
+        with self.assertRaises(IndexError):
+            r[zero]
+        self.assertEqual(r, r[...])
+
+    def test_setitem_scalars(self, device="mps"):
+        zero = torch.tensor(0, dtype=torch.int64)
+
+        # non-scalar indexed with scalars
+        a = torch.randn(2, 3, device=device)
+        a_set_with_number = a.clone()
+        a_set_with_scalar = a.clone()
+        b = torch.randn(3, device=device)
+
+        a_set_with_number[0] = b
+        a_set_with_scalar[zero] = b
+        self.assertEqual(a_set_with_number, a_set_with_scalar)
+        a[1, zero] = 7.7
+        self.assertEqual(7.7, a[1, 0])
+
+        # scalar indexed with scalars
+        r = torch.randn((), device=device)
+        with self.assertRaises(IndexError):
+            r[:] = 8.8
+        with self.assertRaises(IndexError):
+            r[zero] = 8.8
+        r[...] = 9.9
+        self.assertEqual(9.9, r)
+
+    def test_basic_advanced_combined(self, device="mps"):
+        # From the NumPy indexing example
+        x = torch.arange(0, 12, device=device).view(4, 3)
+        self.assertEqual(x[1:2, 1:3], x[1:2, [1, 2]])
+        self.assertEqual(x[1:2, 1:3].tolist(), [[4, 5]])
+
+        # Check that it is a copy
+        unmodified = x.clone()
+        x[1:2, [1, 2]].zero_()
+        self.assertEqual(x, unmodified)
+
+        # But assignment should modify the original
+        unmodified = x.clone()
+        x[1:2, [1, 2]] = 0
+        self.assertNotEqual(x, unmodified)
+
+    def test_int_assignment(self, device="mps"):
+        x = torch.arange(0, 4, device=device).view(2, 2)
+        x[1] = 5
+        self.assertEqual(x.tolist(), [[0, 1], [5, 5]])
+
+        x = torch.arange(0, 4, device=device).view(2, 2)
+        x[1] = torch.arange(5, 7, device=device)
+        self.assertEqual(x.tolist(), [[0, 1], [5, 6]])
+
+    def test_byte_tensor_assignment(self, device="mps"):
+        x = torch.arange(0., 16, device=device).view(4, 4)
+        b = torch.ByteTensor([True, False, True, False]).to(device)
+        value = torch.tensor([3., 4., 5., 6.], device=device)
+
+        with warnings.catch_warnings(record=True) as w:
+            x[b] = value
+            self.assertEqual(len(w), 1)
+
+        self.assertEqual(x[0], value)
+        self.assertEqual(x[1], torch.arange(4., 8, device=device))
+        self.assertEqual(x[2], value)
+        self.assertEqual(x[3], torch.arange(12., 16, device=device))
+
+    def test_variable_slicing(self, device="mps"):
+        x = torch.arange(0, 16, device=device).view(4, 4)
+        indices = torch.IntTensor([0, 1]).to(device)
+        i, j = indices
+        self.assertEqual(x[i:j], x[0:1])
+
+    def test_ellipsis_tensor(self, device="mps"):
+        x = torch.arange(0, 9, device=device).view(3, 3)
+        idx = torch.tensor([0, 2], device=device)
+        self.assertEqual(x[..., idx].tolist(), [[0, 2],
+                                                [3, 5],
+                                                [6, 8]])
+        self.assertEqual(x[idx, ...].tolist(), [[0, 1, 2],
+                                                [6, 7, 8]])
+
+    def test_invalid_index(self, device="mps"):
+        x = torch.arange(0, 16, device=device).view(4, 4)
+        self.assertRaisesRegex(TypeError, 'slice indices', lambda: x["0":"1"])
+
+    def test_out_of_bound_index(self, device="mps"):
+        x = torch.arange(0, 100, device=device).view(2, 5, 10)
+        self.assertRaisesRegex(IndexError, 'index 5 is out of bounds for dimension 1 with size 5', lambda: x[0, 5])
+        self.assertRaisesRegex(IndexError, 'index 4 is out of bounds for dimension 0 with size 2', lambda: x[4, 5])
+        self.assertRaisesRegex(IndexError, 'index 15 is out of bounds for dimension 2 with size 10',
+                               lambda: x[0, 1, 15])
+        self.assertRaisesRegex(IndexError, 'index 12 is out of bounds for dimension 2 with size 10',
+                               lambda: x[:, :, 12])
+
+    def test_zero_dim_index(self, device="mps"):
+        x = torch.tensor(10, device=device)
+        self.assertEqual(x, x.item())
+
+        def runner():
+            print(x[0])
+            return x[0]
+
+        self.assertRaisesRegex(IndexError, 'invalid index', runner)
+
+    def test_cpu_indices(self, device="mps"):
+        idx = torch.tensor([0, 1])
+        b = torch.zeros(2, device=device)
+        x = torch.ones(10, device=device)
+        x[idx] = b  # index_put_
+        ref = torch.ones(10, device=device)
+        ref[:2] = 0
+        self.assertEqual(x, ref, atol=0, rtol=0)
+        out = x[idx]  # index
+        self.assertEqual(out, torch.zeros(2, device=device), atol=0, rtol=0)
+
 class TestRNNMPS(TestCase):
     def test_lstm_1(self, device="mps", dtype=torch.float32):
 
@@ -5954,7 +7153,7 @@ def test_no_warning_on_import(self):
             # On Windows, opening the subprocess with the default CWD makes `import torch`
             # fail, so just set CWD to this script's directory
             cwd=os.path.dirname(os.path.realpath(__file__)),).decode("utf-8")
-        self.assertEquals(out, "")
+        self.assertEqual(out, "")
 
     def _get_not_implemented_op(self):
         # This can be changed once we actually implement `torch.bincount`
@@ -5966,7 +7165,7 @@ def _get_not_implemented_op(self):
     def test_error_on_not_implemented(self):
         fn, args, kwargs, _ = self._get_not_implemented_op()
 
-        with self.assertRaisesRegex(NotImplementedError, "not current implemented for the MPS device"):
+        with self.assertRaisesRegex(NotImplementedError, "not currently implemented for the MPS device"):
             fn(*args, **kwargs)
 
     def test_warn_on_not_implemented_with_fallback(self):
@@ -6093,6 +7292,7 @@ def test_serialization_map_location(self):
 for t in [torch.double, torch.cdouble, torch.cfloat, torch.int8, torch.bfloat16]:
     del MPS_DTYPES[MPS_DTYPES.index(t)]
 
+
 class TestConsistency(TestCase):
     # TODO: This is only used while some ops are being added.
     # This list should contain all ops and dtypes eventually
@@ -6100,447 +7300,426 @@ class TestConsistency(TestCase):
     # by doing `EXPECTTEST_ACCEPT=1 python test_mps.py TestConsistencyCPU`
     # You most likely do NOT want to modify this manually
     ALLOWLIST_OP = {
-        '__radd__': ['torch.float32',
-                     'torch.int16',
-                     'torch.int32',
-                     'torch.int64'],
-        '__rand__': ['torch.bool',
-                     'torch.int16',
-                     'torch.int32',
-                     'torch.int64'],
-        '__rmul__': ['torch.bool',
-                     'torch.float32',
-                     'torch.int16',
-                     'torch.int32',
-                     'torch.int64'],
-        '__ror__': ['torch.bool',
-                    'torch.int16',
-                    'torch.int32',
-                    'torch.int64'],
-        '__rxor__': ['torch.bool',
-                     'torch.int16',
-                     'torch.int32',
-                     'torch.int64'],
-        '_masked.normalize': ['torch.float32'],
-        'abs': ['torch.float16',
-                'torch.float32',
-                'torch.int16',
-                'torch.int32',
-                'torch.uint8'],
-        'add': ['torch.float32',
-                'torch.int16',
-                'torch.int32',
-                'torch.int64'],
-        'addcdiv': ['torch.float32'],
-        'addcmul': ['torch.float32',
-                    'torch.int16',
-                    'torch.int32',
-                    'torch.int64'],
-        'addmv': ['torch.float32'],
-        'addr': ['torch.float32'],
-        'all': ['torch.bool',
-                'torch.float16',
-                'torch.float32',
-                'torch.int16',
-                'torch.int32',
-                'torch.int64'],
-        'any': ['torch.bool',
-                'torch.float16',
-                'torch.float32',
-                'torch.int16',
-                'torch.int32',
-                'torch.int64'],
-        'argmax': ['torch.float16',
-                   'torch.float32',
-                   'torch.int16',
-                   'torch.int32',
-                   'torch.int64'],
-        'asin': ['torch.float32'],
-        'asinh': ['torch.float32'],
-        'atan': ['torch.float32'],
-        'atan2': ['torch.float32'],
-        'atanh': ['torch.float32'],
-        'atleast_1d': ['torch.bool',
-                       'torch.float16',
-                       'torch.float32',
-                       'torch.int16',
-                       'torch.int32',
-                       'torch.int64',
-                       'torch.uint8'],
-        'atleast_2d': ['torch.bool',
-                       'torch.float16',
-                       'torch.float32',
-                       'torch.int16',
-                       'torch.int32',
-                       'torch.int64',
-                       'torch.uint8'],
-        'atleast_3d': ['torch.bool',
-                       'torch.float16',
-                       'torch.float32',
-                       'torch.int16',
-                       'torch.int32',
-                       'torch.int64',
-                       'torch.uint8'],
-        'baddbmm': ['torch.float32'],
-        'bitwise_and': ['torch.bool',
-                        'torch.int16',
-                        'torch.int32',
-                        'torch.int64',
-                        'torch.uint8'],
-        'bitwise_left_shift': ['torch.int16',
-                               'torch.int32',
-                               'torch.int64',
-                               'torch.uint8'],
-        'bitwise_not': ['torch.bool',
-                        'torch.int16',
-                        'torch.int32',
-                        'torch.int64',
-                        'torch.uint8'],
-        'bitwise_or': ['torch.bool',
-                       'torch.int16',
-                       'torch.int32',
-                       'torch.int64',
-                       'torch.uint8'],
-        'bitwise_right_shift': ['torch.int16',
-                                'torch.int32',
-                                'torch.int64',
-                                'torch.uint8'],
-        'bitwise_xor': ['torch.bool',
-                        'torch.int16',
-                        'torch.int32',
-                        'torch.int64',
-                        'torch.uint8'],
-        'bmm': ['torch.float32'],
-        'ceil': ['torch.float32'],
-        'chunk': ['torch.float16', 'torch.float32', 'torch.int64'],
-        'clone': ['torch.bool',
-                  'torch.float16',
-                  'torch.float32',
-                  'torch.int16',
-                  'torch.int32',
-                  'torch.int64',
-                  'torch.uint8'],
-        'column_stack': ['torch.float16',
-                         'torch.float32',
-                         'torch.int16',
-                         'torch.int32',
-                         'torch.int64',
-                         'torch.uint8'],
-        'conj': ['torch.bool',
-                 'torch.float16',
-                 'torch.float32',
-                 'torch.int16',
-                 'torch.int32',
-                 'torch.int64',
-                 'torch.uint8'],
-        'conj_physical': ['torch.bool',
-                          'torch.float16',
-                          'torch.float32',
-                          'torch.int16',
-                          'torch.int32',
-                          'torch.int64',
-                          'torch.uint8'],
-        'contiguous': ['torch.bool',
-                       'torch.float16',
-                       'torch.float32',
-                       'torch.int16',
-                       'torch.int32',
-                       'torch.int64',
-                       'torch.uint8'],
-        'corrcoef': ['torch.float32'],
-        'deg2rad': ['torch.float32'],
-        'diag': ['torch.float32', 'torch.int32'],
-        'diagflat': ['torch.int32'],
-        'diff': ['torch.float32'],
-        'dist': ['torch.float32'],
-        'dot': ['torch.float32', 'torch.int32'],
-        'einsum': ['torch.float32'],
-        'erf': ['torch.float32'],
-        'fill': ['torch.float16',
-                 'torch.float32',
-                 'torch.int16',
-                 'torch.int32',
-                 'torch.int64'],
-        'flatten': ['torch.bool',
-                    'torch.float16',
-                    'torch.float32',
-                    'torch.int16',
-                    'torch.int32',
-                    'torch.int64'],
-        'floor': ['torch.float32'],
-        'hstack': ['torch.float16',
-                   'torch.float32',
-                   'torch.int16',
-                   'torch.int32',
-                   'torch.int64'],
-        'index_select': ['torch.float16',
-                         'torch.float32',
-                         'torch.int16',
-                         'torch.int32',
-                         'torch.int64'],
-        'isinf': ['torch.float16', 'torch.float32'],
-        'isnan': ['torch.bool',
-                  'torch.float16',
-                  'torch.float32',
-                  'torch.int16',
-                  'torch.int32',
-                  'torch.int64',
-                  'torch.uint8'],
-        'kron': ['torch.bool',
-                 'torch.float16',
-                 'torch.float32',
-                 'torch.int16',
-                 'torch.int32',
-                 'torch.int64',
-                 'torch.uint8'],
-        'linalg.norm': ['torch.float16',
-                        'torch.float32',
-                        'torch.float16',
-                        'torch.float32'],
-        'linalg.svd': ['torch.float32'],
-        'linalg.vector_norm': ['torch.float16'],
-        'log1p': ['torch.float32'],
-        'log_softmax': ['torch.float32'],
-        'logaddexp': ['torch.float32'],
-        'logaddexp2': ['torch.float32'],
-        'masked_select': ['torch.bool',
-                          'torch.float16',
-                          'torch.float32',
-                          'torch.int16',
-                          'torch.int32',
-                          'torch.int64',
-                          'torch.uint8'],
-        'mm': ['torch.float32'],
-        'mv': ['torch.float32'],
-        'neg': ['torch.float16',
-                'torch.float32',
-                'torch.int16',
-                'torch.int32'],
-        'nn.functional.adaptive_max_pool1d': ['torch.float32'],
-        'nn.functional.adaptive_max_pool2d': ['torch.float32'],
-        'nn.functional.binary_cross_entropy': ['torch.float32'],
-        'nn.functional.celu': ['torch.float32'],
-        'nn.functional.elu': ['torch.float32'],
-        'nn.functional.embedding': ['torch.float16', 'torch.float32'],
-        'nn.functional.feature_alpha_dropout': ['torch.bool',
-                                                'torch.float16',
-                                                'torch.float32',
-                                                'torch.int16',
-                                                'torch.int32',
-                                                'torch.int64',
-                                                'torch.uint8'],
-        'nn.functional.hardtanh': ['torch.float32',
-                                   'torch.int16',
-                                   'torch.int32',
-                                   'torch.int64'],
-        'nn.functional.hinge_embedding_loss': ['torch.float32'],
-        'nn.functional.kl_div': ['torch.float32'],
-        'nn.functional.l1_loss': ['torch.float32'],
-        'nn.functional.linear': ['torch.float32'],
-        'nn.functional.huber_loss': ['torch.float32'],
-        'nn.functional.leaky_relu': ['torch.float32'],
-        'nn.functional.mse_loss': ['torch.float16', 'torch.float32'],
-        'nn.functional.relu': ['torch.float32',
-                               'torch.int16',
-                               'torch.int32',
-                               'torch.int64',
-                               'torch.uint8'],
-        'nn.functional.relu6': ['torch.float32',
-                                'torch.int16',
-                                'torch.int32',
-                                'torch.int64',
-                                'torch.uint8'],
-        'nn.functional.prelu': ['torch.float32'],
-        'nn.functional.selu': ['torch.float32'],
-        'nn.functional.silu': ['torch.float32'],
-        'nn.functional.smooth_l1_loss': ['torch.float32',
-                                         'torch.float16'],
-        'nn.functional.softmin': ['torch.float32'],
-        'nn.functional.threshold': ['torch.float32',
-                                    'torch.int16',
-                                    'torch.int32',
-                                    'torch.int64',
-                                    'torch.uint8'],
-        'nn.functional.upsample_bilinear': ['torch.float32'],
-        'norm': ['torch.float32', 'torch.float16', 'torch.float32'],
-        'positive': ['torch.float16',
-                     'torch.float32',
-                     'torch.int16',
-                     'torch.int32',
-                     'torch.int64',
-                     'torch.uint8'],
-        'rad2deg': ['torch.float32'],
-        'ravel': ['torch.bool',
-                  'torch.float16',
-                  'torch.float32',
-                  'torch.int16',
-                  'torch.int32',
-                  'torch.int64',
-                  'torch.uint8'],
-        'real': ['torch.bool',
-                 'torch.float16',
-                 'torch.float32',
-                 'torch.int16',
-                 'torch.int32',
-                 'torch.int64',
-                 'torch.uint8'],
-        'repeat': ['torch.float16',
-                   'torch.float32',
-                   'torch.int16',
-                   'torch.int32',
-                   'torch.int64',
-                   'torch.uint8'],
-        'repeat_interleave': ['torch.bool',
-                              'torch.float16',
-                              'torch.float32',
-                              'torch.int16',
-                              'torch.int32',
-                              'torch.int64',
-                              'torch.uint8'],
-        'resize_': ['torch.bool',
-                    'torch.float16',
-                    'torch.float32',
-                    'torch.int16',
-                    'torch.int32',
-                    'torch.int64',
-                    'torch.uint8'],
-        'resize_as_': ['torch.bool',
-                       'torch.float16',
-                       'torch.float32',
-                       'torch.int16',
-                       'torch.int32',
-                       'torch.int64',
-                       'torch.uint8'],
-        'resolve_conj': ['torch.bool',
-                         'torch.float16',
-                         'torch.float32',
-                         'torch.int16',
-                         'torch.int32',
-                         'torch.int64',
-                         'torch.uint8'],
-        'resolve_neg': ['torch.bool',
-                        'torch.float16',
-                        'torch.float32',
-                        'torch.int16',
-                        'torch.int32',
-                        'torch.int64',
-                        'torch.uint8'],
-        'round': ['torch.float32'],
-        'sgn': ['torch.bool',
-                'torch.float16',
-                'torch.float32',
-                'torch.int16',
-                'torch.int32',
-                'torch.int64',
-                'torch.uint8'],
-        'sign': ['torch.bool',
-                 'torch.float16',
-                 'torch.float32',
-                 'torch.int16',
-                 'torch.int32',
-                 'torch.uint8'],
-        'sin': ['torch.float32'],
-        'sinh': ['torch.float32'],
-        'softmax': ['torch.float32'],
-        'split': ['torch.bool',
-                  'torch.float16',
-                  'torch.float32',
-                  'torch.int16',
-                  'torch.int32',
-                  'torch.int64',
-                  'torch.uint8'],
-        'sqrt': ['torch.float32'],
-        'square': ['torch.float32'],
-        'squeeze': ['torch.bool',
-                    'torch.float16',
-                    'torch.float32',
-                    'torch.int16',
-                    'torch.int32',
-                    'torch.int64',
-                    'torch.uint8'],
-        'stack': ['torch.float16',
-                  'torch.float32',
-                  'torch.int16',
-                  'torch.int32',
-                  'torch.int64',
-                  'torch.uint8'],
-        'sub': ['torch.float32',
-                'torch.int16',
-                'torch.int32',
-                'torch.int64'],
-        'sum_to_size': ['torch.bool',
-                        'torch.float16',
-                        'torch.float32',
-                        'torch.int16',
-                        'torch.int32',
-                        'torch.int64',
-                        'torch.uint8'],
-        'svd': ['torch.float32'],
-        't': ['torch.bool',
-              'torch.float16',
-              'torch.float32',
-              'torch.int16',
-              'torch.int32',
-              'torch.int64',
-              'torch.uint8'],
-        'tanh': ['torch.float32'],
-        'tensordot': ['torch.float32'],
-        'topk': ['torch.float32'],
-        'tril': ['torch.bool',
-                 'torch.float16',
-                 'torch.float32',
-                 'torch.int16',
-                 'torch.int32',
-                 'torch.int64',
-                 'torch.uint8'],
-        'triu': ['torch.bool',
-                 'torch.float16',
-                 'torch.float32',
-                 'torch.int16',
-                 'torch.int32',
-                 'torch.int64',
-                 'torch.uint8'],
-        'true_divide': ['torch.float32'],
-        'trunc': ['torch.float32'],
-        'unsqueeze': ['torch.bool',
-                      'torch.float16',
-                      'torch.float32',
-                      'torch.int16',
-                      'torch.int32',
-                      'torch.int64',
-                      'torch.uint8'],
-        'view': ['torch.bool',
-                 'torch.float16',
-                 'torch.float32',
-                 'torch.int16',
-                 'torch.int32',
-                 'torch.int64',
-                 'torch.uint8'],
-        'view_as': ['torch.bool',
-                    'torch.float16',
-                    'torch.float32',
-                    'torch.int16',
-                    'torch.int32',
-                    'torch.int64',
-                    'torch.uint8'],
-        'vsplit': ['torch.bool',
-                   'torch.float16',
-                   'torch.float32',
-                   'torch.int16',
-                   'torch.int32',
-                   'torch.int64',
-                   'torch.uint8'],
-        'vstack': ['torch.float16',
-                   'torch.float32',
-                   'torch.int16',
-                   'torch.int32',
-                   'torch.int64'],
-        'zero_': ['torch.float16',
-                  'torch.float32',
-                  'torch.int16',
-                  'torch.int32',
-                  'torch.int64',
-                  'torch.uint8']}
+        '__getitem__': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        '__radd__': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        '__rand__': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        '__rdiv__': ['f16', 'f32', 'i16', 'i32', 'u8'],
+        '__rmatmul__': ['f32'],
+        '__rmul__': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        '__ror__': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        '__rpow__': ['f16'],
+        '__rxor__': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        'masked.argmax': ['i16', 'i64', 'u8'],
+        'masked.argmin': ['i16', 'i64', 'u8'],
+        'masked.log_softmax': ['f32'],
+        'masked.logaddexp': ['f32'],
+        'masked.norm': ['f16', 'f32'],
+        'masked.normalize': ['f16', 'f32'],
+        'masked.softmax': ['f32'],
+        'masked.softmin': ['f32'],
+        'masked.std': ['f32'],
+        'masked.var': ['f32'],
+        'abs': ['f16', 'f32', 'i16', 'i32', 'u8'],
+        'acos': ['f32', 'i16', 'i32', 'u8'],
+        'acosh': ['f32', 'i16', 'i32', 'u8'],
+        'add': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64'],
+        'addbmm': ['f32'],
+        'addcdiv': ['f32'],
+        'addcmul': ['f32', 'i16', 'i32', 'i64', 'u8'],
+        'addmm': ['f32'],
+        'addmv': ['f32'],
+        'addr': ['b8', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'all': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'allclose': ['f16', 'f32'],
+        'any': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'arange': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'argmax': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'argmin': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'amax': ['f32'],
+        'amix': ['f32'],
+        'logsumexp': ['f32'],
+        'mean': ['f32'],
+        'sum': ['f32'],
+        'asin': ['f32', 'i16', 'i32', 'u8'],
+        'asinh': ['f32', 'i16', 'i32', 'u8'],
+        'atan': ['f32', 'i16', 'i32', 'u8'],
+        'atan2': ['f32'],
+        'atanh': ['f32', 'i16', 'i32', 'u8'],
+        'atleast_1d': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'atleast_2d': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'atleast_3d': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'baddbmm': ['f32'],
+        'bitwise_and': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        'bitwise_left_shift': ['i16', 'i32', 'i64', 'u8'],
+        'bitwise_not': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        'bitwise_or': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        'bitwise_right_shift': ['i16', 'i32', 'i64', 'u8'],
+        'bitwise_xor': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        'block_diag': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64'],
+        'bmm': ['f32'],
+        'broadcast_shapes': ['f32'],
+        'ceil': ['f32', 'int32', 'int64', 'f16'],
+        'char': ['b8', 'u8'],
+        'chunk': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'clone': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'column_stack': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'combinations': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'conj': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'conj_physical': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'contiguous': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'corrcoef': ['f32'],
+        'cos': ['f32', 'i16', 'i32', 'u8', 'i64'],
+        'cosh': ['f32', 'i16', 'i32', 'u8', 'i64'],
+        'cov': ['f32'],
+        'cumsum': ['f16', 'f32', 'int16', 'int32'],
+        'deg2rad': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'diag': ['f32', 'i32'],
+        'diag_embed': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64'],
+        'diagflat': ['f32', 'i32'],
+        'diagonal_scatter': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64'],
+        'diff': ['f16', 'f32', 'i16', 'i32', 'i64'],
+        'dist': ['f32'],
+        'dot': ['f32', 'i16', 'i32', 'i64', 'u8'],
+        'einsum': ['f32'],
+        'equal': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'erf': ['f32', 'i16', 'i32', 'u8'],
+        'exp': ['f32', 'i16', 'i32', 'u8'],
+        'exp2': ['f16', 'f32', 'i16', 'i32', 'u8'],
+        'eye': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'fill': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'flatten': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'flip': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'fliplr': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'flipud': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'float': ['f32'],
+        'floor': ['f32', 'f16', 'i16', 'i32', 'i64'],
+        'frac': ['f16', 'f32'],
+        'gradient': ['f16', 'f32', 'i16'],
+        'half': ['f16'],
+        'hstack': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'index_select': ['f32', 'i16', 'i32', 'i64'],
+        'int': ['i32'],
+        'isclose': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'isfinite': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'isinf': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'isnan': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'isreal': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'kron': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'linalg.matrix_norm': ['f16'],
+        'linalg.svd': ['f32'],
+        'linalg.vector_norm': ['f16', 'f32'],
+        'linspace': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'log': ['f32', 'i16', 'i32', 'u8'],
+        'log10': ['f32', 'i16', 'i32', 'u8'],
+        'log2': ['f32', 'i16', 'i32', 'u8'],
+        'log_softmax': ['f32'],
+        'logaddexp': ['f32'],
+        'logaddexp2': ['f32'],
+        'logical_not': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'logspace': ['f32', 'i16', 'i32', 'i64', 'u8'],
+        'masked_fill': ['f16', 'i16', 'i32', 'i64'],
+        'masked_select': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'matmul': ['f32'],
+        'mm': ['f32'],
+        'mv': ['f32'],
+        'neg': ['f16', 'f32', 'i16', 'i32', 'i64'],
+        'nn.functional.adaptive_max_pool1d': ['f32'],
+        'nn.functional.adaptive_max_pool2d': ['f32'],
+        'nn.functional.binary_cross_entropy': ['f32'],
+        'nn.functional.binary_cross_entropy_with_logits': ['f32'],
+        'nn.functional.celu': ['f32'],
+        'nn.functional.conv1d': ['f32'],
+        'nn.functional.conv2d': ['f32'],
+        'nn.functional.conv_transpose1d': ['f32'],
+        'nn.functional.cosine_embedding_loss': ['b8',
+                                                'f32',
+                                                'i16',
+                                                'i32',
+                                                'i64'],
+        'nn.functional.elu': ['f32'],
+        'nn.functional.feature_alpha_dropout': ['b8',
+                                                'f16',
+                                                'f32',
+                                                'i16',
+                                                'i32',
+                                                'i64',
+                                                'u8'],
+        'nn.functional.gaussian_nll_loss': ['f32'],
+        'nn.functional.glu': ['f32'],
+        'nn.functional.group_norm': ['f32'],
+        'nn.functional.hardtanh': ['f32', 'i16', 'i32', 'i64'],
+        'nn.functional.hinge_embedding_loss': ['f32'],
+        'nn.functional.huber_loss': ['f32'],
+        'nn.functional.instance_norm': ['f32'],
+        'nn.functional.kl_div': ['f32'],
+        'nn.functional.l1_loss': ['f16', 'f32'],
+        'nn.functional.leaky_relu': ['f32'],
+        'nn.functional.linear': ['f32'],
+        'nn.functional.local_response_norm': ['f32'],
+        'nn.functional.margin_ranking_loss': ['f32', 'i16', 'i32'],
+        'nn.functional.mse_loss': ['f16', 'f32'],
+        'nn.functional.pad': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64'],
+        'nn.functional.pairwise_distance': ['f16',
+                                            'f32',
+                                            'i16',
+                                            'i32',
+                                            'i64'],
+        'nn.functional.poisson_nll_loss': ['f32', 'i16', 'i32', 'u8'],
+        'nn.functional.prelu': ['f32'],
+        'nn.functional.relu': ['f32', 'i16', 'i32', 'i64', 'u8'],
+        'nn.functional.relu6': ['f32', 'i16', 'i32', 'i64', 'u8'],
+        'nn.functional.selu': ['f32'],
+        'nn.functional.silu': ['f32'],
+        'nn.functional.smooth_l1_loss': ['f16', 'f32'],
+        'nn.functional.soft_margin_loss': ['f32'],
+        'nn.functional.softmin': ['f32'],
+        'nn.functional.softplus': ['f32'],
+        'nn.functional.softsign': ['f16', 'f32', 'i16', 'u8'],
+        'nn.functional.tanhshrink': ['f32', 'i16', 'i32', 'u8'],
+        'nn.functional.threshold': ['f32', 'i16', 'i32', 'i64', 'u8'],
+        'nn.functional.triplet_margin_loss': ['f32', 'i16', 'i32', 'i64'],
+        'nn.functional.triplet_margin_with_distance_loss': ['f32',
+                                                            'i16',
+                                                            'i32',
+                                                            'i64'],
+        'nn.functional.upsample_bilinear': ['f32'],
+        'norm': ['f32', 'f16'],
+        'positive': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'pow': ['f16'],
+        'rad2deg': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'real': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'reciprocal': ['f16', 'f32', 'i16', 'i32', 'u8'],
+        'repeat': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'repeat_interleave': ['b8',
+                              'f16',
+                              'f32',
+                              'i16',
+                              'i32',
+                              'i64',
+                              'u8'],
+        'resize_': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        'resize_as_': ['b8', 'i16', 'i32', 'i64', 'u8'],
+        'resolve_conj': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'resolve_neg': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'rot90': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'round': ['f32', 'f16', 'i16', 'i32', 'i64'],
+        'rsqrt': ['f32', 'i16', 'i32', 'u8'],
+        'select_scatter': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64'],
+        'sgn': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'short': ['i16'],
+        'sigmoid': ['f32'],
+        'sign': ['b8', 'f16', 'f32', 'i16', 'i32', 'u8', 'i64'],
+        'sin': ['f32', 'i16', 'i32', 'u8'],
+        'sinh': ['f32', 'i16', 'i32', 'u8'],
+        'slice_scatter': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64'],
+        'softmax': ['f32'],
+        'special.ndtr': ['b8', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'split': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'sqrt': ['f32', 'i16', 'i32', 'u8'],
+        'square': ['f16', 'f32'],
+        'squeeze': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'stack': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'sub': ['f32', 'i16', 'i32', 'i64'],
+        'sum_to_size': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'svd': ['f32'],
+        't': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'tan': ['i16', 'i32', 'u8'],
+        'tanh': ['f32', 'i16', 'i32', 'u8'],
+        'tensordot': ['f32'],
+        'tile': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'topk': ['f32'],
+        'trapz': ['f16', 'f32', 'i16', 'i32', 'i64'],
+        'tril': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'tril_indices': ['i32', 'i64'],
+        'triu': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'triu_indices': ['i32', 'i64'],
+        'true_divide': ['b8', 'f16', 'f32', 'i16', 'u8'],
+        'trunc': ['f32'],
+        'unbind': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'unflatten': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'unsqueeze': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'view': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'view_as': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'vsplit': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'vstack': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'zero_': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'clamp': ['f32', 'i16', 'i32', 'i64', 'u8'],
+        'clamp_max': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'clamp_min': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'logical_and': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'logical_or': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'logical_xor': ['b8', 'f16', 'f32', 'i16', 'i32', 'i64', 'u8'],
+        'where': ['f16', 'f32', 'i16', 'i32', 'i64', 'u8']}
+
+
+    ALLOWLIST_OP_GRAD = {
+        '__radd__': ['f16', 'f32'],
+        '__rdiv__': ['f16', 'f32'],
+        '__rmatmul__': ['f32'],
+        '__rmul__': ['f16', 'f32'],
+        'masked.log_softmax': ['f32'],
+        'masked.logaddexp': ['f32'],
+        'masked.softmax': ['f32'],
+        'masked.softmin': ['f32'],
+        'masked.std': ['f32'],
+        'masked.var': ['f32'],
+        'abs': ['f16', 'f32'],
+        'acos': ['f32'],
+        'acosh': ['f32'],
+        'add': ['f16', 'f32'],
+        'addbmm': ['f32'],
+        'addcdiv': ['f32'],
+        'addcmul': ['f32'],
+        'addmm': ['f32'],
+        'addmv': ['f32'],
+        'addr': ['f32'],
+        'all': ['f16', 'f32'],
+        'any': ['f16', 'f32'],
+        'arange': ['f16', 'f32'],
+        'argmax': ['f16', 'f32'],
+        'argmin': ['f16', 'f32'],
+        'asin': ['f32'],
+        'asinh': ['f32'],
+        'atan': ['f32'],
+        'atan2': ['f32'],
+        'atleast_1d': ['f16', 'f32'],
+        'atleast_2d': ['f16', 'f32'],
+        'atleast_3d': ['f16', 'f32'],
+        'baddbmm': ['f32'],
+        'block_diag': ['f16', 'f32'],
+        'bmm': ['f32'],
+        'broadcast_shapes': ['f32'],
+        'ceil': ['f32'],
+        'chunk': ['f16', 'f32'],
+        'clone': ['f16', 'f32'],
+        'column_stack': ['f16', 'f32'],
+        'conj': ['f16', 'f32'],
+        'conj_physical': ['f16', 'f32'],
+        'contiguous': ['f16', 'f32'],
+        'corrcoef': ['f32'],
+        'cos': ['f32'],
+        'cosh': ['f32'],
+        'cumsum': ['f16', 'f32'],
+        'deg2rad': ['f16', 'f32'],
+        'diag': ['f32'],
+        'diag_embed': ['f16', 'f32'],
+        'diagflat': ['f32'],
+        'diagonal_scatter': ['f16', 'f32'],
+        'diff': ['f16', 'f32'],
+        'dist': ['f32'],
+        'dot': ['f32'],
+        'einsum': ['f32'],
+        'erf': ['f32'],
+        'exp': ['f32'],
+        'exp2': ['f16', 'f32'],
+        'fill': ['f16', 'f32'],
+        'flatten': ['f16', 'f32'],
+        'flip': ['f16', 'f32'],
+        'fliplr': ['f16', 'f32'],
+        'flipud': ['f16', 'f32'],
+        'float': ['f32'],
+        'floor': ['f32'],
+        'gradient': ['f32'],
+        'half': ['f16'],
+        'hstack': ['f16', 'f32'],
+        'index_select': ['f32'],
+        'isclose': ['f16', 'f32'],
+        'isfinite': ['f16', 'f32'],
+        'isinf': ['f16', 'f32'],
+        'isnan': ['f16', 'f32'],
+        'isreal': ['f16', 'f32'],
+        'kron': ['f32'],
+        'linalg.matrix_norm': ['f16'],
+        'linalg.svd': ['f32'],
+        'linspace': ['f16', 'f32'],
+        'log': ['f32'],
+        'log10': ['f32'],
+        'log2': ['f32'],
+        'log_softmax': ['f32'],
+        'logaddexp': ['f32'],
+        'logical_not': ['f16', 'f32'],
+        'logspace': ['f32'],
+        'matmul': ['f32'],
+        'mm': ['f32'],
+        'mv': ['f32'],
+        'neg': ['f16', 'f32'],
+        'nn.functional.adaptive_max_pool1d': ['f32'],
+        'nn.functional.adaptive_max_pool2d': ['f32'],
+        'nn.functional.binary_cross_entropy': ['f32'],
+        'nn.functional.celu': ['f32'],
+        'nn.functional.conv1d': ['f32'],
+        'nn.functional.conv2d': ['f32'],
+        'nn.functional.conv_transpose1d': ['f32'],
+        'nn.functional.cosine_embedding_loss': ['f32'],
+        'nn.functional.elu': ['f32'],
+        'nn.functional.feature_alpha_dropout': ['f16', 'f32'],
+        'nn.functional.glu': ['f32'],
+        'nn.functional.hardtanh': ['f32'],
+        'nn.functional.hinge_embedding_loss': ['f32'],
+        'nn.functional.huber_loss': ['f32'],
+        'nn.functional.instance_norm': ['f32'],
+        'nn.functional.kl_div': ['f32'],
+        'nn.functional.l1_loss': ['f16', 'f32'],
+        'nn.functional.leaky_relu': ['f32'],
+        'nn.functional.local_response_norm': ['f32'],
+        'nn.functional.margin_ranking_loss': ['f32'],
+        'nn.functional.mse_loss': ['f32'],
+        'nn.functional.pad': ['f16', 'f32'],
+        'nn.functional.pairwise_distance': ['f16', 'f32'],
+        'nn.functional.poisson_nll_loss': ['f32'],
+        'nn.functional.relu': ['f32'],
+        'nn.functional.relu6': ['f32'],
+        'nn.functional.selu': ['f32'],
+        'nn.functional.silu': ['f32'],
+        'nn.functional.soft_margin_loss': ['f32'],
+        'nn.functional.softmin': ['f32'],
+        'nn.functional.softplus': ['f32'],
+        'nn.functional.softsign': ['f16', 'f32'],
+        'nn.functional.threshold': ['f32'],
+        'nn.functional.triplet_margin_loss': ['f32'],
+        'nn.functional.triplet_margin_with_distance_loss': ['f32'],
+        'nn.functional.upsample_bilinear': ['f32'],
+        'norm': ['f32', 'f16'],
+        'positive': ['f16', 'f32'],
+        'rad2deg': ['f16', 'f32'],
+        'real': ['f16', 'f32'],
+        'reciprocal': ['f16', 'f32'],
+        'repeat': ['f16', 'f32'],
+        'repeat_interleave': ['f16', 'f32'],
+        'resolve_conj': ['f16', 'f32'],
+        'resolve_neg': ['f16', 'f32'],
+        'round': ['f32'],
+        'rsqrt': ['f32'],
+        'select_scatter': ['f16', 'f32'],
+        'sign': ['f16', 'f32'],
+        'sin': ['f32'],
+        'sinh': ['f32'],
+        'slice_scatter': ['f16', 'f32'],
+        'softmax': ['f32'],
+        'split': ['f16', 'f32'],
+        'sqrt': ['f32'],
+        'square': ['f16', 'f32'],
+        'squeeze': ['f16', 'f32'],
+        'stack': ['f16', 'f32'],
+        'sub': ['f32'],
+        'sum_to_size': ['f16', 'f32'],
+        'svd': ['f32'],
+        't': ['f16', 'f32'],
+        'tanh': ['f32'],
+        'tensordot': ['f32'],
+        'tile': ['f16', 'f32'],
+        'tril': ['f16', 'f32'],
+        'triu': ['f16', 'f32'],
+        'true_divide': ['f16', 'f32'],
+        'trunc': ['f32'],
+        'unbind': ['f16', 'f32'],
+        'unflatten': ['f16', 'f32'],
+        'unsqueeze': ['f16', 'f32'],
+        'view': ['f16', 'f32'],
+        'view_as': ['f16', 'f32'],
+        'vsplit': ['f16', 'f32'],
+        'vstack': ['f16', 'f32'],
+        'zero_': ['f16', 'f32']}
 
     # These ops that are problematic. So never run them even when
     # generating the new allowlist.
@@ -6549,12 +7728,53 @@ class TestConsistency(TestCase):
     BLOCKLIST = {
         # Functions that hang
         'masked_fill': [torch.bool, torch.uint8, torch.float32], 'where': [torch.bool],
+        # + forward when requires_grad=True or running backward
+        'masked.mean': [torch.bool, torch.float16],
+        'masked.prod': [torch.bool],
+        'masked.sum': [torch.bool],
+
         # Functions that hard crash
         'nn.functional.kl_div': [torch.int16, torch.int32, torch.int64],
         'nn.functional.nll_loss': [torch.float32],
         'nn.functional.padreflect': [torch.float32], 'nn.functional.padreplicate': [torch.float32],
         'std': [torch.float16],
         'stft': [torch.float32], 'var': [torch.float16],
+        # + forward when requires_grad=True or running backward
+        'index_select': [torch.float16],
+        'nn.functional.embedding': [torch.float32, torch.float16],
+        '__rpow__': [torch.int64],
+        'masked.std': [torch.int32],
+        'masked.var': [torch.int32],
+        'as_strided_scatter': [torch.uint8],
+        'atan2': [torch.int64],
+        'bfloat16': None,
+        'block_diag': [torch.uint8],
+        'byte': None,
+        'chalf': None,
+        'diag_embed': [torch.uint8],
+        'diagonal_scatter': [torch.uint8],
+        'index_add': None,
+        'log1p': None,
+        'long': None,
+        'nn.functional.avg_pool1d': [torch.int64],
+        'nn.functional.avg_pool2d': [torch.int64],
+        'nn.functional.conv1d': [torch.int64],
+        'nn.functional.conv2d': [torch.int64],
+        'nn.functional.conv_transpose1d': [torch.int64],
+        'nn.functional.conv_transpose2d': [torch.int64],
+        'nn.functional.conv_transpose3d': [torch.int64, torch.float32],
+        'nn.functional.huber_loss': [torch.float16],
+        'nn.functional.local_response_norm': [torch.int64],
+        'nn.functional.padcircular': [torch.uint8],
+        'pow': [torch.int64],
+        'select_scatter': [torch.uint8],
+        'sigmoid': [torch.int64],
+        'slice_scatter': [torch.uint8],
+        'square': [torch.bool, torch.int16, torch.int32, torch.int64, torch.uint8],  # moved from section below
+
+
+        # ALLOW_LIST doesn't know about variants
+        'nn.functional.padconstant': None,
 
         # These were moved from ALLOWLIST to BLOCK as they are not working
         # locally
@@ -6562,7 +7782,6 @@ class TestConsistency(TestCase):
         '__radd__': ['torch.bool', 'torch.uint8'],
         '__rmul__': ['torch.uint8'],
         'add': ['torch.bool', 'torch.uint8'],
-        'square': ['torch.int32', 'torch.int64', 'torch.uint8'],
         'addr': ['torch.int16', 'torch.int32', 'torch.int64', 'torch.uint8'],
         'diag': ['torch.int64'],
         'diagflat': ['torch.int64'],
@@ -6622,7 +7841,7 @@ class TestConsistency(TestCase):
         'normnuc': None,
         'nn.functional.softminwith_dtype': None,
         'nn.functional.feature_alpha_dropoutwith_train': None,
-        'log_softmaxdtype': None,
+        'log_softmaxwith_dtype': None,
         'split_with_sizes': None,
         'trapezoid': None,
         'eq': None,
@@ -6637,6 +7856,7 @@ class TestConsistency(TestCase):
 
     # Used for accept mode only
     NEW_ALLOW_LIST = defaultdict(list)
+    NEW_ALLOW_LIST_GRAD = defaultdict(list)
 
     @ops(op_db, allowed_dtypes=MPS_DTYPES)
     def test_output_match(self, device, dtype, op):
@@ -6657,18 +7877,31 @@ def test_output_match(self, device, dtype, op):
         else:
             generate_new_truth = False
 
+        run_grad_test = True
         if not generate_new_truth:
             if op.name not in self.ALLOWLIST_OP:
                 self.skipTest(f"{op.name} is not in the allow list for test on MPS")
             else:
-                if str(dtype) not in self.ALLOWLIST_OP[op.name]:
+                if dtype_abbrs[dtype] not in self.ALLOWLIST_OP[op.name]:
                     self.skipTest(f"{op.name} is in the allow list for MPS but {dtype} is excluded")
 
-        try:
-            cpu_samples = op.sample_inputs(device, dtype)
+            if op.name not in self.ALLOWLIST_OP_GRAD or dtype_abbrs[dtype] not in self.ALLOWLIST_OP_GRAD[op.name]:
+                run_grad_test = False
+
+        def get_samples():
+            return op.sample_inputs(device, dtype, requires_grad=(dtype.is_floating_point or dtype.is_complex))
+        cpu_samples = get_samples()
 
-            for cpu_sample in cpu_samples:
-                mps_sample = cpu_sample.transform(lambda x: x.to("mps") if isinstance(x, torch.Tensor) else x)
+        all_forward_pass = True
+        all_backward_pass = True
+        for cpu_sample in cpu_samples:
+            #
+            # Forward check
+            #
+            forward_failed = False
+            try:
+                mps_sample = cpu_sample.transform(
+                    lambda x: x.detach().to("mps").requires_grad_(x.requires_grad) if isinstance(x, torch.Tensor) else x)
 
                 # TODO: This checks only the function variant. We should also check the method and inplace version
                 # when they exist
@@ -6679,23 +7912,142 @@ def test_output_match(self, device, dtype, op):
 
                 cpu_out = op(*cpu_args, **cpu_kwargs)
                 mps_out = op(*mps_args, **mps_kwargs)
-                self.assertEqual(cpu_out, mps_out)
-        except Exception as e:
-            if not generate_new_truth:
-                raise e
-        else:
-            if generate_new_truth:
-                self.NEW_ALLOW_LIST[op.name].append(str(dtype))
 
-                # We could write it only once. But I don't know how to detect that the current test is the last one
-                # So each test append to the dict and write it.
-                with open("new_mps_allowlist.txt", "w") as f:
-                    pprint.pprint(self.NEW_ALLOW_LIST, stream=f)
+                if op.name == "nn.functional.conv2d" and dtype == torch.float32:
+                    atol = 1e-4
+                    rtol = 3e-5
+                elif op.name == "add" and dtype == torch.float16:
+                    atol = 1e-2
+                    rtol = 1e-2
+                else:
+                    atol = None
+                    rtol = None
+
+                self.assertEqual(cpu_out, mps_out, atol=atol, rtol=rtol)
+
+            except Exception as e:
+                if not generate_new_truth:
+                    raise e
+                forward_failed = True
+                all_forward_pass = False
+
+            if not (dtype.is_floating_point or dtype.is_complex):
+                # Maybe we should error here instead?
+                continue
+
+            #
+            # Backward check
+            #
+
+            # Skip the grad test if it is not part of the allow list
+            if not generate_new_truth and not run_grad_test:
+                # TODO: maybe there is a way to print only when we have -v
+                # if i == 0:
+                #     print(f"Skipping gradient check because {op.name} is not on the allow list")
+                continue
+
+            try:
+                if forward_failed:
+                    # We would've failed immediately anyway, but this error is clearer
+                    # We error instead of continuing so that all_backward_pass would not be True
+                    raise RuntimeError("Forward pass already failed")
+
+                cpu_out = (cpu_out,) if isinstance(cpu_out, torch.Tensor) else tuple(cpu_out)
+                mps_out = (mps_out,) if isinstance(mps_out, torch.Tensor) else tuple(mps_out)
+
+                def req_grad(t):
+                    return isinstance(t, torch.Tensor) and t.requires_grad
+
+                diff_cpu_out = tuple(t for t in cpu_out if req_grad(t))
+                diff_mps_out = tuple(t for t in mps_out if req_grad(t))
+                diff_cpu_arg = tuple(t for t in pytree.tree_flatten((cpu_args, cpu_kwargs))[0] if req_grad(t))
+                diff_mps_arg = tuple(t for t in pytree.tree_flatten((mps_args, mps_kwargs))[0] if req_grad(t))
+                self.assertEqual(len(diff_cpu_out), len(diff_mps_out))
+                self.assertEqual(len(diff_cpu_arg), len(diff_mps_arg))
+
+                if len(diff_cpu_out) == 0:
+                    continue
+                # rand_like does not work with certain dtypes, so cast to double and cast back
+                cpu_grad_outputs = tuple(torch.rand_like(t.to(dtype=torch.double)).to(dtype=dtype) for t in diff_cpu_out)
+                mps_grad_outputs = tuple(t.to("mps") for t in cpu_grad_outputs)
+
+                # Compare computed gradients with cpu given random grad_output vector
+                # Sometimes when the derivative is 0, we just don't bother creating the graph
+                # allow_unused is needed in those cases.
+                cpu_grad_inputs = torch.autograd.grad(diff_cpu_out, diff_cpu_arg, grad_outputs=cpu_grad_outputs, allow_unused=True)
+                mps_grad_inputs = torch.autograd.grad(diff_mps_out, diff_mps_arg, grad_outputs=mps_grad_outputs, allow_unused=True)
+
+                self.assertEqual(cpu_grad_inputs, mps_grad_inputs)
+            except Exception as e:
+                if not generate_new_truth:
+                    raise e
+                all_backward_pass = False
+
+        if all_forward_pass and generate_new_truth:
+            if dtype_abbrs[dtype] not in self.NEW_ALLOW_LIST[op.name]:
+                self.NEW_ALLOW_LIST[op.name].append(dtype_abbrs[dtype])
+            # We could write it only once. But I don't know how to detect that the current test is the last one
+            # So each test append to the dict and write it.
+            with open("new_mps_allowlist.txt", "w") as f:
+                pprint.pprint(self.NEW_ALLOW_LIST, stream=f)
+
+        if all_backward_pass and generate_new_truth and dtype.is_floating_point:
+            if dtype_abbrs[dtype] not in self.NEW_ALLOW_LIST_GRAD[op.name]:
+                self.NEW_ALLOW_LIST_GRAD[op.name].append(dtype_abbrs[dtype])
+            # We could write it only once. But I don't know how to detect that the current test is the last one
+            # So each test append to the dict and write it.
+            with open("new_mps_allowlist_grad.txt", "w") as f:
+                pprint.pprint(self.NEW_ALLOW_LIST_GRAD, stream=f)
+
+
+# Copied from `TestCommon` in `test_ops.py`, just enough to duplicate the `test_numpy_ref` for MPS
+@skipIfSlowGradcheckEnv
+class TestCommon(TestCase):
+    exact_dtype = True
+
+    # Verifies, on teardown, that no OpInfo is still using dynamic dtypes in CI
+    @classmethod
+    def tearDownClass(cls):
+        super().tearDownClass()
+
+        if IS_CI:
+            err_msg = (
+                "The operator(s) below is(are) using dynamic_dtypes in the OpInfo entries."
+                "This is OK for testing, but be sure to set the dtypes manually before landing your PR!"
+            )
+            # Assure no opinfo entry has dynamic_dtypes
+            filtered_ops = list(filter(opinfo.utils.is_dynamic_dtype_set, op_db))
+            for op in filtered_ops:
+                fmt_str = opinfo.utils.str_format_dynamic_dtype(op)
+                err_msg += "\n" + fmt_str
+
+            assert len(filtered_ops) == 0, err_msg
+
+    # This is the MPS equivalent of `test_numpy_ref` from `test_ops.py`. It lives over here while
+    # MPS still requires some fairly heavy special casing in the test framework.
+    # When MPS becomes more consistent, this can probably be merged with that test using
+    # `@dtypesIfMPS(torch.float32)`, but for now, the assertions themselves need to be loosened
+    @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
+    @onlyMPS
+    @suppress_warnings
+    # MPS only supports float32
+    @ops(_ref_test_ops, allowed_dtypes=(torch.float32,))
+    def test_numpy_ref_mps(self, device, dtype, op):
+        # Unlike `test_numpy_ref`, this test compares in `float32` since at the time of this test's creation MPS
+        # does not support float64 Tensors.
+        # A few ops are currently broken on their reference inputs, but not their sample inputs. These should
+        # get patched up and this workaround removed.
+        broken_on_ref_inputs = op.name in ['cat', 'clamp', 'where']
+        inputs = op.reference_inputs(device, dtype) if not broken_on_ref_inputs else op.sample_inputs(device, dtype)
+        for sample_input in inputs:
+            self.compare_with_reference(op, op.ref, sample_input)
 
 # TODO: Actually instantiate that test for the "mps" device to better reflect what it is doing.
 # This requires mps to be properly registered in the device generic test framework which is not the
-# case right now.
+# case right now. We can probably use `allow_mps` introduced in https://github.com/pytorch/pytorch/pull/87342
+# to achieve this.
 instantiate_device_type_tests(TestConsistency, globals(), only_for="cpu")
+instantiate_device_type_tests(TestCommon, globals(), allow_mps=True)
 
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_multiprocessing.py b/test/test_multiprocessing.py
index 5b939afd998f..ae0d87be216a 100644
--- a/test/test_multiprocessing.py
+++ b/test/test_multiprocessing.py
@@ -15,7 +15,7 @@
 import torch.utils.hooks
 from torch.nn import Parameter
 from torch.testing._internal.common_utils import (TestCase, run_tests, IS_WINDOWS, NO_MULTIPROCESSING_SPAWN, TEST_WITH_ASAN,
-                                                  load_tests, slowTest, TEST_WITH_TSAN, TEST_WITH_ROCM)
+                                                  load_tests, slowTest, TEST_WITH_TSAN)
 
 # load_tests from common_utils is used to automatically filter tests for
 # sharding on sandcastle. This line silences flake warnings
@@ -590,7 +590,6 @@ def _test_event_multiprocess_child(event, p2c, c2p):
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
-    @unittest.skipIf(TEST_WITH_ROCM, 'Skip the test for ROCm')
     def test_event_multiprocess(self):
         event = torch.cuda.Event(enable_timing=False, interprocess=True)
         self.assertTrue(event.query())
@@ -649,7 +648,6 @@ def _test_event_handle_importer_consumer(handle, p2c, c2p):
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
-    @unittest.skipIf(TEST_WITH_ROCM, 'Skip the test for ROCm')
     def test_event_handle_importer(self):
         e0 = torch.cuda.Event(enable_timing=False, interprocess=True)
         self.assertTrue(e0.query())
@@ -689,7 +687,6 @@ def _test_event_handle_exporter_consumer(handle, p2c, c2p):
     @unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                      don't support multiprocessing with spawn start method")
     @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
-    @unittest.skipIf(TEST_WITH_ROCM, 'Skip the test for ROCm')
     def test_event_handle_exporter(self):
         e0 = torch.cuda.Event(enable_timing=False, interprocess=True)
 
diff --git a/test/test_namedtuple_return_api.py b/test/test_namedtuple_return_api.py
index 65d90625e5dc..48782535a598 100644
--- a/test/test_namedtuple_return_api.py
+++ b/test/test_namedtuple_return_api.py
@@ -13,8 +13,8 @@
 path = os.path.dirname(os.path.realpath(__file__))
 aten_native_yaml = os.path.join(path, '../aten/src/ATen/native/native_functions.yaml')
 all_operators_with_namedtuple_return = {
-    'max', 'min', 'aminmax', 'median', 'nanmedian', 'mode', 'kthvalue', 'svd', 'symeig', 'eig',
-    'qr', 'geqrf', 'slogdet', 'sort', 'topk', 'lstsq', 'linalg_inv_ex',
+    'max', 'min', 'aminmax', 'median', 'nanmedian', 'mode', 'kthvalue', 'svd', 'symeig',
+    'qr', 'geqrf', 'slogdet', 'sort', 'topk', 'linalg_inv_ex',
     'triangular_solve', 'cummax', 'cummin', 'linalg_eigh', "_linalg_eigh", "_unpack_dual", 'linalg_qr',
     'linalg_svd', '_linalg_svd', 'linalg_slogdet', '_linalg_slogdet', 'fake_quantize_per_tensor_affine_cachemask',
     'fake_quantize_per_channel_affine_cachemask', 'linalg_lstsq', 'linalg_eig', 'linalg_cholesky_ex',
@@ -77,9 +77,8 @@ def test_namedtuple_return(self):
             op(operators=['_linalg_slogdet'], input=(), names=('sign', 'logabsdet', 'LU', 'pivots'), hasout=True),
             op(operators=['qr', 'linalg_qr'], input=(), names=('Q', 'R'), hasout=True),
             op(operators=['geqrf'], input=(), names=('a', 'tau'), hasout=True),
-            op(operators=['symeig', 'eig'], input=(True,), names=('eigenvalues', 'eigenvectors'), hasout=True),
+            op(operators=['symeig'], input=(True,), names=('eigenvalues', 'eigenvectors'), hasout=True),
             op(operators=['triangular_solve'], input=(a,), names=('solution', 'cloned_coefficient'), hasout=True),
-            op(operators=['lstsq'], input=(a,), names=('solution', 'QR'), hasout=True),
             op(operators=['linalg_eig'], input=(), names=('eigenvalues', 'eigenvectors'), hasout=True),
             op(operators=['linalg_eigh'], input=("L",), names=('eigenvalues', 'eigenvectors'), hasout=True),
             op(operators=['_linalg_eigh'], input=("L",), names=('eigenvalues', 'eigenvectors'), hasout=True),
diff --git a/test/test_native_functions.py b/test/test_native_functions.py
index 831998cbf6be..ba7889e10f4c 100644
--- a/test/test_native_functions.py
+++ b/test/test_native_functions.py
@@ -19,6 +19,46 @@ def forward(self, values, incr: Optional[List[int]]):
 
 class TestNativeFunctions(TestCase):
 
+    def _lists_with_str(self):
+        return [
+            ("foo",),
+            (2, "foo"),
+            ("foo", 3),
+            ["foo"],
+            [2, "foo"],
+            ["foo", 3],
+            "foo",
+        ]
+
+    def _test_raises_str_typeerror(self, fn):
+        for arg in self._lists_with_str():
+            self.assertRaisesRegex(TypeError, "str", lambda: fn(arg))
+            try:
+                fn(arg)
+            except TypeError as e:
+                print(e)
+
+    def test_symintlist_error(self):
+        x = torch.randn(1)
+        self._test_raises_str_typeerror(lambda arg: torch._C._nn.pad(x, arg))
+
+    def test_vararg_symintlist_error(self):
+        self._test_raises_str_typeerror(lambda arg: torch.rand(arg))
+        self._test_raises_str_typeerror(lambda arg: torch.rand(*arg))
+
+    def test_symintlist_error_with_overload_but_is_unique(self):
+        x = torch.randn(1)
+        y = torch.randn(1)
+        self._test_raises_str_typeerror(lambda arg: x.set_(y, 0, arg))
+
+    def test_symintlist_error_with_overload(self):
+        x = torch.randn(1)
+        self._test_raises_str_typeerror(lambda arg: x.view(arg))
+
+    def test_intlist_error_with_overload(self):
+        x = torch.randn(1)
+        self._test_raises_str_typeerror(lambda arg: torch._C._nn.pad(x, arg))
+
     #
     # optional float list
     #
@@ -113,7 +153,7 @@ def fake_module(values, const):
         self.do_test_optional_intlist_with_module(fake_module)
 
     def test_optional_intlist_invalid(self):
-        with self.assertRaisesRegex(TypeError, "must be .* not"):
+        with self.assertRaisesRegex(TypeError, "must be .* but found"):
             IntListWrapperModule()(torch.zeros(1), [0.5])
 
         with self.assertRaisesRegex(RuntimeError, "value of type .* instead found type"):
diff --git a/test/test_native_mha.py b/test/test_native_mha.py
index 2d0843b5a4d3..2af1db7395b6 100644
--- a/test/test_native_mha.py
+++ b/test/test_native_mha.py
@@ -1,5 +1,6 @@
 # Owner(s): ["module: nn"]
 import math
+import copy
 
 import torch
 from torch.testing._internal.common_device_type import (
@@ -9,7 +10,7 @@
     onlyCUDA,
     skipMeta,
 )
-from torch.testing._internal.common_utils import run_tests, TestCase
+from torch.testing._internal.common_utils import parametrize, run_tests, TestCase
 
 class TestMHADeviceType(TestCase):
     @torch.no_grad()
@@ -39,7 +40,7 @@ def _test_transform_bias_rescale_qkv_impl(
                     xs = list(torch.unbind(x))
                     if use_padding:
                         xs[0] = xs[0][:-1]
-                    x = torch.nested_tensor(xs, device=device, dtype=dtype)
+                    x = torch.nested.nested_tensor(xs, device=device, dtype=dtype)
                 qkv = torch.nn.Linear(embed_dim, 3 * embed_dim, device=device, dtype=dtype)
 
                 # We have to use inference_mode here because q/k/v are
@@ -116,36 +117,40 @@ def _test_multihead_attention_impl(
         bs = 16
         sl = 8
 
-        q = torch.randn(bs, sl, embed_dim, device=device, dtype=dtype) * 10
+        q = 6 * torch.rand(bs, sl, embed_dim, device=device, dtype=torch.float32) - 3
         if use_padding:
             if pad_all:
                 for q_i in q:
-                    q_i[-1] = torch.zeros_like(q[0][-1], device=device, dtype=dtype)
+                    q_i[-1] = torch.zeros_like(q[0][-1], device=device, dtype=torch.float32)
                 mask = torch.zeros(q.shape[:-1], device=device, dtype=torch.bool)
                 for mask_i in mask:
                     mask_i[-1] = True
             else:
-                q[0][-1] = torch.zeros_like(q[0][-1], device=device, dtype=dtype)
+                q[0][-1] = torch.zeros_like(q[0][-1], device=device, dtype=torch.float32)
                 mask = torch.zeros(q.shape[:-1], device=device, dtype=torch.bool)
                 mask[0][-1] = True
         if mode == "self":
             k = q
             v = q
         elif mode == "encdec":
-            k = torch.randn(bs, sl, embed_dim, device=device, dtype=dtype) * 10
+            k = 6 * torch.rand(bs, sl, embed_dim, device=device, dtype=torch.float32) - 3
             v = k
         elif mode == "generic":
-            k = torch.randn(bs, sl, embed_dim, device=device, dtype=dtype) * 10
-            v = torch.randn(bs, sl, embed_dim, device=device, dtype=dtype) * 10
+            k = 6 * torch.rand(bs, sl, embed_dim, device=device, dtype=torch.float32) - 3
+            v = 6 * torch.rand(bs, sl, embed_dim, device=device, dtype=torch.float32) - 3
         else:
             self.fail(f"invalid mode `{mode}`!")
 
-        qkv = torch.nn.Linear(embed_dim, 3 * embed_dim, device=device, dtype=dtype)
-        proj = torch.nn.Linear(embed_dim, embed_dim, device=device, dtype=dtype)
+        qkv = torch.nn.Linear(embed_dim, 3 * embed_dim, device=device, dtype=torch.float32)
+        native_qkv = copy.deepcopy(qkv).to(dtype=dtype)
+
+        proj = torch.nn.Linear(embed_dim, embed_dim, device=device, dtype=torch.float32)
+        native_proj = copy.deepcopy(proj).to(dtype=dtype)
 
         pt = torch.nn.MultiheadAttention(
-            embed_dim, num_heads, batch_first=True, device=device, dtype=dtype
+            embed_dim, num_heads, batch_first=True, device=device, dtype=torch.float32
         )
+
         pt.in_proj_weight = qkv.weight
         pt.in_proj_bias = qkv.bias
         pt.out_proj.weight = proj.weight
@@ -177,7 +182,7 @@ def forward(self, q, k, v, key_padding_mask):
                 )
 
         npt = NativeMHA(
-            embed_dim=embed_dim, num_heads=num_heads, qkv=qkv, proj=proj
+            embed_dim=embed_dim, num_heads=num_heads, qkv=native_qkv, proj=native_proj
         ).to(dtype)
 
         if device == "cuda":
@@ -199,18 +204,22 @@ def forward(self, q, k, v, key_padding_mask):
                     qs = [x[:-1] for x in qs]
                 else:
                     qs[0] = qs[0][:-1]
-            q = torch.nested_tensor(qs, device=device, dtype=dtype)
+            q = torch.nested.nested_tensor(qs, device=device, dtype=dtype)
             if mode == "self":
                 k = v = q
             elif mode == "encdec":
-                k = torch.nested_tensor(torch.unbind(k), device=device, dtype=dtype)
+                k = torch.nested.nested_tensor(torch.unbind(k), device=device, dtype=dtype)
                 v = k
             else:
-                k = torch.nested_tensor(torch.unbind(k), device=device, dtype=dtype)
-                v = torch.nested_tensor(torch.unbind(v), device=device, dtype=dtype)
+                k = torch.nested.nested_tensor(torch.unbind(k), device=device, dtype=dtype)
+                v = torch.nested.nested_tensor(torch.unbind(v), device=device, dtype=dtype)
+
+        native_q = q.to(dtype=dtype)
+        native_k = k.to(dtype=dtype)
+        native_v = v.to(dtype=dtype)
 
         ynpt, weight_npt = npt(
-            q, k, v, key_padding_mask=mask if use_padding and not use_nt else None
+            native_q, native_k, native_v, key_padding_mask=mask if use_padding and not use_nt else None
         )
         if use_nt:
             ynpt = ynpt.to_padded_tensor(0)
@@ -244,7 +253,7 @@ def do_pad_all(tensors):
                         weight_npt[0][nh][-1] = torch.zeros_like(weight_npt[0][nh][-1], device=device, dtype=dtype)
 
         if dtype == torch.half:
-            torch.testing.assert_close(ypt, ynpt, atol=1e-3, rtol=1e-3)
+            torch.testing.assert_close(ypt, ynpt.to(torch.float32), atol=1e-3, rtol=1e-3)
         else:
             # High rtol seems necessary for
             # test_native_multihead_attention_cpu_float32 on Windows,
@@ -252,35 +261,40 @@ def do_pad_all(tensors):
             torch.testing.assert_close(ypt, ynpt, atol=2e-5, rtol=2e-3)
 
         if need_weights:
-            torch.testing.assert_close(weight_pt, weight_npt)
+            torch.testing.assert_close(weight_pt, weight_npt.to(torch.float32), atol=5e-4, rtol=5e-4)
         else:
             self.assertEqual(weight_pt, weight_npt)
 
     @dtypesIfCUDA(torch.float, torch.half)
     @dtypes(torch.float)
     @skipMeta
+    @parametrize("use_nt", [False, True])
+    @parametrize("use_padding, pad_all", [(False, False), (True, False), (True, True)])
+    @parametrize("need_weights", [False])
+    @parametrize("average_attn_weights", [False, True])
+    @parametrize("fused", [False, True])
     @torch.no_grad()
-    def test_native_multihead_self_attention(self, device, dtype):
-        for (use_padding, pad_all) in ((False, False), (True, False), (True, True)):
-            for use_nt in (False, True):
-                # Figuring out exactly which elements of the weights are garbage in this
-                # case eludes me, and it's not particularly enlightening to test anyway
-                # because padding doesn't especially affect the intermediate weights.
-                for need_weights in (False, not pad_all):
-                    for average_attn_weights in (False, True):
-                        with self.subTest(use_padding=use_padding, pad_all=pad_all,
-                                          use_nt=use_nt, need_weights=need_weights,
-                                          average_attn_weights=average_attn_weights):
-                            self._test_multihead_attention_impl(
-                                device,
-                                dtype,
-                                "self",
-                                use_nt=use_nt,
-                                use_padding=use_padding,
-                                pad_all=pad_all,
-                                need_weights=need_weights,
-                                average_attn_weights=average_attn_weights,
-                            )
+    def test_native_multihead_self_attention(self, device, dtype, use_nt,
+                                             need_weights, average_attn_weights, use_padding, pad_all, fused):
+        for need_weights in (False, not pad_all):
+            with self.subTest(use_padding=use_padding, pad_all=pad_all,
+                              use_nt=use_nt, need_weights=need_weights,
+                              average_attn_weights=average_attn_weights):
+                with torch.backends.cuda.sdp_kernel(
+                        enable_flash=False, enable_mem_efficient=False
+                ) if not fused else torch.backends.cuda.sdp_kernel(
+                        enable_flash=True, enable_mem_efficient=True
+                ):
+                    self._test_multihead_attention_impl(
+                        device,
+                        dtype,
+                        "self",
+                        use_nt=use_nt,
+                        use_padding=use_padding,
+                        pad_all=pad_all,
+                        need_weights=need_weights,
+                        average_attn_weights=average_attn_weights,
+                    )
 
     @dtypesIfCUDA(torch.float, torch.half)
     @dtypes(torch.float)
diff --git a/test/test_nestedtensor.py b/test/test_nestedtensor.py
index fe375c5388e2..710753886315 100644
--- a/test/test_nestedtensor.py
+++ b/test/test_nestedtensor.py
@@ -1,29 +1,45 @@
 # Owner(s): ["module: nestedtensor"]
 
+import unittest
+
+import numpy as np
 import torch
 import torch.nn
-import unittest
 from torch.testing._internal.common_device_type import (
     dtypes,
     dtypesIfCUDA,
     instantiate_device_type_tests,
+    onlyCPU,
+    onlyCUDA,
     skipMeta,
-    onlyCPU
 )
-from torch.testing._internal.common_utils import TestCase, IS_FBCODE, run_tests, freeze_rng_state, parametrize, gradcheck
-from torch import nested_tensor
+from torch.testing._internal.common_dtype import floating_types_and_half
+from torch.testing._internal.common_utils import (
+    freeze_rng_state,
+    gradcheck,
+    instantiate_parametrized_tests,
+    IS_FBCODE,
+    parametrize,
+    run_tests,
+    subtest,
+    TestCase,
+)
 
 # Tests are ported from pytorch/nestedtensor.
 # This makes porting as_nested_tensor easier in the future.
+
+
 def _iter_constructors():
     # yield as_nested_tensor
-    yield nested_tensor
+    yield torch.nested.nested_tensor
 
 # Helper function to generate a pair of random nested tensors
 # one is contiguous, the other is not, but they appear to have same entries
 # an output nested tensor consists of
 # * `len(ragged_sizes)` matrices
 # * matrices[i].shape == (20, ragged_sizes[i])
+
+
 def random_nt_noncontiguous_pair(ragged_sizes, device="cpu", dtype=torch.float16):
     xs = []
     for size in ragged_sizes:
@@ -32,14 +48,16 @@ def random_nt_noncontiguous_pair(ragged_sizes, device="cpu", dtype=torch.float16
     ys = []
     for x in xs:
         ys.append(x.transpose(-1, -2))
-    nt_contiguous = torch.nested_tensor(ys)
+    nt_contiguous = torch.nested.nested_tensor(ys)
     # noncontiguous nested tensor
     n = len(ragged_sizes)
-    nt_noncontiguous = torch.nested_tensor(xs).transpose(-1, -2)
+    nt_noncontiguous = torch.nested.nested_tensor(xs).transpose(-1, -2)
     return nt_contiguous, nt_noncontiguous
 
 # Helper functions to pad a noncontiguous nested tensor
 # can be replaced once to_padded_tensor supports noncontiguous memory
+
+
 def noncontiguous_to_padded_tensor(input, shape=None):
     tensors = input.unbind()
     ntensors = len(tensors)
@@ -62,22 +80,107 @@ def noncontiguous_to_padded_tensor(input, shape=None):
         view.copy_(tensor)
     return result
 
+# Helper function to generate a random nested tensor
+
+
+def random_nt(device, dtype, num_tensors, max_dims, min_dims=None):
+    if min_dims is None:
+        min_dims = tuple([0] * len(max_dims))
+    ts1 = []
+    for _ in range(num_tensors):
+        tensor_dims = tuple([torch.randint(low=min_dim, high=max_dim, size=(1,)).item()
+                            for (min_dim, max_dim) in zip(min_dims, max_dims)])
+        t1 = torch.randn(tensor_dims, device=device, dtype=dtype)
+        ts1.append(t1)
+    return torch.nested.nested_tensor(ts1, device=device, dtype=dtype)
+
+
 class TestNestedTensor(TestCase):
+    @parametrize("batch_size", [2, 4])
+    @parametrize("max_seq_len", [3, 5])
+    @parametrize("vocab_size", [10, 20])
+    def test_2d_nested_tensor(self, batch_size, max_seq_len, vocab_size):
+        data = []
+        nested_tensor_ref_list = []
+        for _ in range(batch_size):
+            if max_seq_len == 0:
+                length = 0
+            else:
+                length = np.random.randint(low=1, high=max_seq_len)
+            row = list(np.random.randint(low=0, high=vocab_size, size=(length,)))
+            data.append(row)
+            nested_tensor_ref_list.append(torch.tensor(row))
+        nested_tensor = torch.nested.nested_tensor(data, dtype=torch.int64)
+        nested_tensor_list = nested_tensor.unbind()
+        for id in range(batch_size):
+            self.assertEqual(
+                nested_tensor_list[id],
+                nested_tensor_ref_list[id].type(torch.int64)
+            )
+
+    @parametrize("batch_size", [2, 4])
+    @parametrize("max_seq_len", [3, 5])
+    @parametrize("vocab_size", [10, 20])
+    def test_3d_nested_tensor(self, batch_size, max_seq_len, vocab_size):
+        data = []
+        nested_tensor_ref_list = []
+        for _ in range(batch_size):
+            if max_seq_len == 0:
+                length = 0
+            else:
+                length = np.random.randint(low=1, high=max_seq_len)
+            row = list(np.random.randint(low=0, high=vocab_size, size=(length,)))
+            row = [list(item * np.arange(max_seq_len)) for item in row]
+            data.append(row)
+            nested_tensor_ref_list.append(torch.Tensor(row))
+        nested_tensor = torch.nested.nested_tensor(data, dtype=torch.int64)
+        nested_tensor_list = nested_tensor.unbind()
+        for id in range(batch_size):
+            self.assertEqual(
+                nested_tensor_list[id],
+                nested_tensor_ref_list[id].type(torch.int64)
+            )
+
+    @parametrize("batch_size", [2, 4])
+    @parametrize("max_seq_len", [3, 5])
+    @parametrize("vocab_size", [10, 20])
+    def test_3d_nested_tensor_float(self, batch_size, max_seq_len, vocab_size):
+        data = []
+        nested_tensor_ref_list = []
+        for _ in range(batch_size):
+            if max_seq_len == 0:
+                length = 0
+            else:
+                length = np.random.randint(low=1, high=max_seq_len)
+            row = list(
+                np.random.randint(low=0, high=vocab_size, size=(length,)).astype(float)
+            )
+            row = [list(item * np.arange(max_seq_len)) for item in row]
+            data.append(row)
+            nested_tensor_ref_list.append(torch.Tensor(row))
+        nested_tensor = torch.nested.nested_tensor(data, dtype=torch.float)
+        nested_tensor_list = nested_tensor.unbind()
+        for id in range(batch_size):
+            self.assertEqual(
+                nested_tensor_list[id],
+                nested_tensor_ref_list[id].type(torch.float)
+            )
+
 
     @torch.inference_mode()
     def _test_unbind_case(self, a, b):
-        nt = nested_tensor([a, b])
+        nt = torch.nested.nested_tensor([a, b])
         a1, b1 = nt.unbind()
         self.assertTrue(a is not a1)
         self.assertTrue(b is not b1)
 
-        nt = nested_tensor([a, b], dtype=a.dtype)
+        nt = torch.nested.nested_tensor([a, b], dtype=a.dtype)
         a1, b1 = nt.unbind(0)
         self.assertEqual(a, a1)
         self.assertEqual(b, b1)
 
         a = torch.randn((2, 3)).add_(1)
-        nt = nested_tensor([a])
+        nt = torch.nested.nested_tensor([a])
         self.assertEqual(a, nt.unbind(0)[0])
 
     @torch.inference_mode()
@@ -92,12 +195,6 @@ def test_unbind_1(self):
             torch.tensor([1]), torch.tensor([7]),
         )
 
-    # @torch.inference_mode()
-    # def test_unbind_2(self):
-    #     self._test_unbind_case(
-    #         torch.tensor(1), torch.tensor(7),
-    #     )
-
     @torch.inference_mode()
     def test_unbind_3(self):
         self._test_unbind_case(
@@ -115,7 +212,7 @@ def test_unbind_dim(self):
         def _test_fn(unbind_fn):
             a = torch.rand(3, 2)
             b = torch.rand(2, 3)
-            nt = nested_tensor([a, b])
+            nt = torch.nested.nested_tensor([a, b])
             self.assertRaises(RuntimeError, lambda: unbind_fn(nt, 1))
 
         # Both of these tests are necessary, because we're using
@@ -126,29 +223,28 @@ def _test_fn(unbind_fn):
 
     @torch.inference_mode()
     def test_nested_tensor(self):
-        self.assertRaises(TypeError, lambda: nested_tensor([3.0]))
-        self.assertRaises(TypeError, lambda: nested_tensor(torch.tensor([3.0])))
-        self.assertRaises(TypeError, lambda: nested_tensor(4.0))
+        self.assertRaises(TypeError, lambda: torch.nested.nested_tensor(torch.tensor([3.0])))
+        self.assertRaises(TypeError, lambda: torch.nested.nested_tensor(4.0))
 
     @torch.inference_mode()
     def test_nested_tensor_matching_dim(self):
         self.assertRaisesRegex(
             RuntimeError,
             "Found dimension 1 for Tensor at index 1 and dimension 0 for Tensor at index 0.",
-            lambda: nested_tensor([torch.tensor(1.0), torch.tensor([])]),
+            lambda: torch.nested.nested_tensor([torch.tensor(1.0), torch.tensor([])]),
         )
         self.assertRaisesRegex(
             RuntimeError,
             "Found dimension 1 for Tensor at index 2 and dimension 0 for Tensor at index 1.",
-            lambda: nested_tensor(
+            lambda: torch.nested.nested_tensor(
                 [torch.tensor(1.0), torch.tensor(2.0), torch.tensor([])]
             ),
         )
 
     @torch.inference_mode()
     def test_default_nested_tensor(self):
-        self.assertRaises(TypeError, lambda: nested_tensor())
-        default_nested_tensor = nested_tensor([])
+        self.assertRaises(TypeError, lambda: torch.nested.nested_tensor())
+        default_nested_tensor = torch.nested.nested_tensor([])
         default_tensor = torch.tensor([])
         # self.assertEqual(default_nested_tensor.nested_dim(), 1)
         # self.assertEqual(default_nested_tensor.nested_size(), ())
@@ -202,12 +298,34 @@ def test_size(self):
             a1 = constructor([])
             self.assertRaisesRegex(
                 RuntimeError,
-                "Tensors of type NestedTensorImpl do not have sym sizes"
-                if IS_FBCODE
-                else "NestedTensorImpl doesn't support sizes",
+                "NestedTensorImpl doesn't support sizes",
                 lambda: a1.size(),
             )
 
+    def test_size_dim(self):
+        a = torch.nested.nested_tensor([])
+        self.assertEqual(a.size(0), 0)
+
+        a = torch.nested.nested_tensor([torch.tensor(1)])
+        self.assertEqual(a.size(0), 1)
+
+        a = torch.nested.nested_tensor([torch.tensor(1), torch.tensor(2)])
+        self.assertEqual(a.size(0), 2)
+
+        a = torch.nested.nested_tensor([torch.rand(1, 2),
+                                        torch.rand(1, 8)])
+        self.assertEqual(a.size(0), 2)
+        self.assertEqual(a.size(1), 1)
+        self.assertRaisesRegex(
+            RuntimeError, "Given dimension 2 is irregular and does not have a size", lambda: a.size(2))
+
+        a = torch.nested.nested_tensor([torch.rand(3, 4),
+                                        torch.rand(5, 4)])
+        self.assertEqual(a.size(0), 2)
+        self.assertRaisesRegex(
+            RuntimeError, "Given dimension 1 is irregular and does not have a size", lambda: a.size(1))
+        self.assertEqual(a.size(2), 4)
+
     @unittest.skipIf(IS_FBCODE, "stride is not virtual in fbcode.")
     @torch.inference_mode()
     def test_stride(self):
@@ -223,7 +341,7 @@ def test_stride(self):
     @torch.inference_mode()
     def test_is_contiguous(self):
         # Test empty case
-        nt_empty = torch.nested_tensor([])
+        nt_empty = torch.nested.nested_tensor([])
         assert nt_empty.is_contiguous()
         self.assertEqual(nt_empty, nt_empty.contiguous())
 
@@ -235,58 +353,153 @@ def test_is_contiguous(self):
 
         # Test non_contiguous case
         assert not nt_noncontiguous.is_contiguous()
-        self.assertRaisesRegex(
-            RuntimeError,
-            r"clone_nested only supports memory format Preserve, but got Contiguous instead.",
-            lambda: nt_noncontiguous.contiguous()
-        )
+        self.assertEqual(nt_contiguous, nt_noncontiguous.contiguous())
 
     @torch.inference_mode()
     def test_repr_string(self):
-        a = nested_tensor([])
+        a = torch.nested.nested_tensor([])
         expected = "nested_tensor([" "\n\n])"
         self.assertEqual(str(a), expected)
         self.assertEqual(repr(a), expected)
 
-        a = nested_tensor([torch.tensor(1.0)])
+        a = torch.nested.nested_tensor([torch.tensor(1.0)])
         expected = "nested_tensor([" "\n  tensor(1.)" "\n])"
         self.assertEqual(str(a), expected)
         self.assertEqual(repr(a), expected)
 
-        a = nested_tensor([torch.tensor([[1, 2]]), torch.tensor([[4, 5]])])
+        a = torch.nested.nested_tensor([torch.tensor([[1, 2]]), torch.tensor([[4, 5]])])
         expected = (
             "nested_tensor([" "\n  tensor([[1, 2]])" "," "\n  tensor([[4, 5]])" "\n])"
         )
         self.assertEqual(str(a), expected)
         self.assertEqual(repr(a), expected)
 
-    @torch.inference_mode()
-    def test_activations(self):
-        for func in (torch.nn.functional.relu, torch.nn.functional.relu_, torch.nn.functional.gelu, torch._C._nn.gelu_):
-            t = torch.tensor([-1, 0, 1], dtype=torch.float)
-            nt = nested_tensor([t])
-            nested_result = func(nt)
-            self.assertTrue(nested_result.is_nested)
-            self.assertEqual(func(t), nested_result.unbind()[0])
-
     def test_to_padded_tensor_on_empty_tensor(self):
 
-        nt = torch.nested_tensor([])
-        empty = nt.to_padded_tensor(4)
+        nt = torch.nested.nested_tensor([])
+        empty = torch.nested.to_padded_tensor(nt, 4)
         self.assertEqual(empty, torch.tensor([]))
-class TestNestedTensorDeviceType(TestCase):
 
-    # Helper function to generate a random nested tensor
-    def random_nt(self, device, dtype, num_tensors, max_dims, min_dims=None):
-        if min_dims is None:
-            min_dims = tuple([0] * len(max_dims))
-        ts1 = []
-        for _ in range(num_tensors):
-            tensor_dims = tuple([torch.randint(low=min_dim, high=max_dim, size=(1,)).item()
-                                for (min_dim, max_dim) in zip(min_dims, max_dims)])
-            t1 = torch.randn(tensor_dims, device=device, dtype=dtype)
-            ts1.append(t1)
-        return torch.nested_tensor(ts1, device=device, dtype=dtype)
+    def test_nested_namespace(self):
+        nt = torch.nested.nested_tensor([torch.randn(2, 3), torch.randn(4, 5)])
+        result = nt.to_padded_tensor(4)
+        nested_namespace_result = torch.nested.to_padded_tensor(nt, 4)
+        self.assertEqual(result, nested_namespace_result)
+
+    def test_to(self):
+        ntensors = 4
+        nt = random_nt(torch.device('cpu'), torch.float32, ntensors, (4, 4))
+
+        def test_copy_behavior(t, non_blocking=False):
+            self.assertIs(t, t.to(t, non_blocking=non_blocking))
+            self.assertIs(t, t.to(t.dtype, non_blocking=non_blocking))
+            self.assertIs(t, t.to(torch.empty_like(t), non_blocking=non_blocking))
+            self.assertIsNot(t, t.to(t, non_blocking=non_blocking, copy=True))
+            self.assertIsNot(t, t.to(t.dtype, non_blocking=non_blocking, copy=True))
+            self.assertIsNot(t, t.to(torch.empty_like(t), non_blocking=non_blocking, copy=True))
+
+            devices = [t.device]
+            if t.device.type == 'cuda':
+                if t.device.index == -1:
+                    devices.append('cuda:{}'.format(torch.cuda.current_device()))
+                elif t.device.index == torch.cuda.current_device():
+                    devices.append('cuda')
+            for device in devices:
+                self.assertIs(t, t.to(device, non_blocking=non_blocking))
+                self.assertIs(t, t.to(device, t.dtype, non_blocking=non_blocking))
+                self.assertIsNot(t, t.to(device, non_blocking=non_blocking, copy=True))
+                self.assertIsNot(t, t.to(device, t.dtype, non_blocking=non_blocking, copy=True))
+
+        test_copy_behavior(nt)
+        self.assertEqual(nt.device, nt.to('cpu').device)
+        self.assertEqual(nt.device, nt.to('cpu', dtype=torch.float32).device)
+        self.assertIs(torch.float32, nt.to('cpu', dtype=torch.float32).dtype)
+        self.assertEqual(nt.device, nt.to(torch.float32).device)
+        self.assertIs(torch.float32, nt.to(dtype=torch.float32).dtype)
+
+        def test_data_ptr(getter):
+            self.assertEqual(getter(nt), getter(nt.to('cpu')))
+            self.assertEqual(getter(nt), getter(nt.to(dtype=nt.dtype, device=nt.device, copy=False)))
+            self.assertEqual(getter(nt), getter(nt.to('cpu', copy=False)))
+            self.assertNotEqual(getter(nt), getter(nt.to('cpu', copy=True)))
+
+        test_data_ptr(lambda nt: nt.data_ptr())
+
+        if torch.cuda.is_available():
+            for non_blocking in [True, False]:
+                for cuda in ['cuda', 'cuda:0' if torch.cuda.device_count() == 1 else 'cuda:1']:
+                    nt2 = random_nt(cuda, torch.float32, ntensors, (4, 4))
+                    test_copy_behavior(nt2, non_blocking)
+                    self.assertEqual(nt2.device, nt2.to(cuda, non_blocking=non_blocking).device)
+                    self.assertEqual(nt.device, nt2.to('cpu', non_blocking=non_blocking).device)
+                    self.assertEqual(nt2.device, nt.to(cuda, non_blocking=non_blocking).device)
+                    self.assertIs(torch.int32, nt2.to('cpu', dtype=torch.int32, non_blocking=non_blocking).dtype)
+                    self.assertEqual(nt.device, nt2.to('cpu', dtype=torch.int32, non_blocking=non_blocking).device)
+                    self.assertIs(torch.int32, nt2.to(dtype=torch.int32).dtype)
+                    self.assertEqual(nt2.device, nt2.to(dtype=torch.int32).device)
+
+    def test_copy_(self):
+        ntensors = 4
+        nt = random_nt(torch.device('cpu'), torch.float32, ntensors, (4, 4))
+        nt_copy = torch.empty_like(nt)
+        nt_copy.copy_(nt)
+
+        for (nt_ub, nt_copy_ub) in zip(nt.unbind(), nt_copy):
+            self.assertEqual(nt_ub, nt_copy_ub)
+
+        nt_error = torch.nested.nested_tensor([torch.tensor([0, 0])])
+        self.assertRaisesRegex(
+            RuntimeError,
+            "copy_ only supports tensors that are the same size for Nested implementations",
+            lambda: nt_error.copy_(nt)
+        )
+
+        if torch.cuda.is_available():
+            nt = random_nt(torch.device('cuda'), torch.float32, ntensors, (4, 4))
+            nt_copy = torch.empty_like(nt, device=torch.device('cpu'))
+            nt_copy.copy_(nt, non_blocking=True)
+            torch.cuda.current_stream(torch.cuda.current_device()).synchronize()
+            for (nt_ub, nt_copy_ub) in zip(nt.unbind(), nt_copy):
+                self.assertEqual(nt_ub, nt_copy_ub)
+
+            nt_copy = torch.empty_like(nt, device=torch.device('cpu'))
+            nt_copy.copy_(nt, non_blocking=False)
+            for (nt_ub, nt_copy_ub) in zip(nt.unbind(), nt_copy):
+                self.assertEqual(nt_ub, nt_copy_ub)
+
+    def test_fill_(self):
+        ntensors = 4
+        nt = random_nt(torch.device('cpu'), torch.float32, ntensors, (4, 4))
+        nt.fill_(10.)
+        for nt_ub in nt.unbind():
+            t = torch.empty_like(nt_ub)
+            t.fill_(10.)
+            self.assertEqual(nt_ub, t)
+
+        fill_tensor = torch.tensor([11.])
+        self.assertRaisesRegex(
+            RuntimeError,
+            "fill_ only supports 0-dimension value tensor",
+            lambda: nt.fill_(fill_tensor)
+        )
+
+        nt.fill_(fill_tensor[0])
+        for nt_ub in nt.unbind():
+            t = torch.empty_like(nt_ub)
+            t.fill_(11.)
+            self.assertEqual(nt_ub, t)
+
+    def test_ones_like(self):
+        ntensors = 4
+        nt = random_nt(torch.device('cpu'), torch.float32, ntensors, (4, 4))
+        ones_nt = torch.ones_like(nt)
+
+        for nt_ub in ones_nt.unbind():
+            t = torch.ones_like(nt_ub)
+            self.assertEqual(nt_ub, t)
+
+
+class TestNestedTensorDeviceType(TestCase):
 
     # Helper function to generate a pair of random nested tensors
     # the 2 nested tensors have same shapes
@@ -299,8 +512,36 @@ def random_nt_pair(self, device, dtype, num_tensors, max_dims):
             t2 = torch.randn(tensor_dims, device=device, dtype=dtype)
             ts1.append(t1)
             ts2.append(t2)
-        return (torch.nested_tensor(ts1, device=device, dtype=dtype),
-                torch.nested_tensor(ts2, device=device, dtype=dtype))
+        return (torch.nested.nested_tensor(ts1, device=device, dtype=dtype),
+                torch.nested.nested_tensor(ts2, device=device, dtype=dtype))
+
+    @dtypes(*floating_types_and_half())
+    def test_detach(self, device, dtype):
+        a = torch.randn(2, 4, device=device, dtype=dtype, requires_grad=False)
+        b = torch.randn(5, 4, device=device, dtype=dtype, requires_grad=False)
+        x = torch.nested.nested_tensor([a, b], requires_grad=True)
+
+        x_detach = x.detach()
+
+        z = x_detach * 4
+        self.assertFalse(x_detach.requires_grad)
+        self.assertFalse(z.requires_grad)
+
+        a = torch.randn(2, 4, device=device, dtype=dtype, requires_grad=True)
+        b = torch.randn(5, 4, device=device, dtype=dtype, requires_grad=True)
+        x = torch.nested.as_nested_tensor([a, b])
+
+        y = x * 2
+        y = y.detach()
+        self.assertFalse(y.requires_grad)
+        self.assertIsNone(y.grad_fn)
+
+        z = x + y
+        torch.nested.to_padded_tensor(z, 0).sum().backward()
+        # This is an incorrect gradient, but we assume that's what the user
+        # wanted. detach() is an advanced option.
+        self.assertEqual(a.grad, torch.ones(2, 4, device=device, dtype=dtype))
+        self.assertEqual(b.grad, torch.ones(5, 4, device=device, dtype=dtype))
 
     @dtypes(torch.float, torch.float16, torch.double)
     def test_unbind_noncontiguous(self, device, dtype):
@@ -318,8 +559,8 @@ def test_to_then_from_padded_tensor_no_transform0213(self, device, dtype):
         t = torch.randn(4, 4, 4, device=device, dtype=dtype)
         ts = list(torch.unbind(t))
         ts[0] = ts[0][:-1]
-        nt = torch.nested_tensor(ts, device=device, dtype=dtype)
-        padded = nt.to_padded_tensor(0)
+        nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
+        padded = torch.nested.to_padded_tensor(nt, 0)
 
         nt_to = torch._nested_from_padded_and_nested_example(padded, nt)
 
@@ -333,21 +574,77 @@ def test_to_then_from_padded_tensor_no_transform0213(self, device, dtype):
     @torch.inference_mode()
     def test_layer_norm(self, device, dtype):
         def _test(size):
+            # Simple shapes test
             t0 = torch.randn(2, size, device=device, dtype=dtype, requires_grad=False)
             t1 = torch.randn(2, size, device=device, dtype=dtype, requires_grad=False)
             ts = [t0, t1, t0, t1]
-            nt = torch.nested_tensor(ts, device=device, dtype=dtype)
+            nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
             layer_norm = torch.nn.LayerNorm(size, device=device, dtype=dtype)
-            nt_result = nt._nested_tensor_layer_norm(
-                layer_norm.weight, layer_norm.bias, 1e-5
-            )
+            nt_result = layer_norm(nt)
             for (nt_subresult, t) in zip(nt_result.unbind(), ts):
                 t_result = layer_norm(t.reshape(1, -1, size).squeeze(0))
                 self.assertEqual(nt_subresult, t_result)
 
+            # More complex nt test with different lengths for each tensor
+            t0 = torch.randn(4, size, device=device, dtype=dtype, requires_grad=False)
+            t1 = torch.randn(10, size, device=device, dtype=dtype, requires_grad=False)
+            t2 = torch.randn(7, size, device=device, dtype=dtype, requires_grad=False)
+            ts = [t0, t1, t2, t0, t2]
+            nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
+            layer_norm = torch.nn.LayerNorm(size, device=device, dtype=dtype)
+            nt_result = layer_norm(nt)
+            for (nt_subresult, t) in zip(nt_result.unbind(), ts):
+                t_result = layer_norm(t.reshape(1, -1, size).squeeze(0))
+                self.assertEqual(nt_subresult, t_result)
+
+            if size <= 128:
+                # Test with multidimensional tensors after irregular dim
+                # (run only with smaller dimensions to ensure fast execution)
+                t0 = torch.randn(4, size, size, 4, device=device, dtype=dtype, requires_grad=False)
+                t1 = torch.randn(10, size, size, 4, device=device, dtype=dtype, requires_grad=False)
+                t2 = torch.randn(7, size, size, 4, device=device, dtype=dtype, requires_grad=False)
+                ts = [t0, t1, t2, t0, t2]
+                nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
+                layer_norm = torch.nn.LayerNorm((size, size, 4), device=device, dtype=dtype)
+                nt_result = layer_norm(nt)
+                for (nt_subresult, t) in zip(nt_result.unbind(), ts):
+                    t_result = layer_norm(t.reshape(1, -1, size, size, 4).squeeze(0))
+                    self.assertEqual(nt_subresult, t_result)
+
+                # Test where the normalizing dimensions are not all
+                layer_norm = torch.nn.LayerNorm((size, 4), device=device, dtype=dtype)
+                nt_result = layer_norm(nt)
+                for (nt_subresult, t) in zip(nt_result.unbind(), ts):
+                    t_result = layer_norm(t.reshape(1, -1, size, size, 4).squeeze(0))
+                    self.assertEqual(nt_subresult, t_result)
+
         for size in (1024, 1023, 513, 512, 256, 128, 2, 4, 32):
             _test(size)
 
+    @dtypes(torch.float)
+    @dtypesIfCUDA(torch.float, torch.half)
+    @skipMeta
+    @torch.inference_mode()
+    def test_layer_norm_breaking(self, device, dtype):
+        size = 128
+        t0 = torch.randn(4, size, size, 4, device=device, dtype=dtype, requires_grad=False)
+        t1 = torch.randn(10, size, size, 4, device=device, dtype=dtype, requires_grad=False)
+        t2 = torch.randn(7, size, size, 4, device=device, dtype=dtype, requires_grad=False)
+        ts = [t0, t1, t2, t0, t2]
+        nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
+        layer_norm = torch.nn.LayerNorm((4, size, size, 4), device=device, dtype=dtype)
+        self.assertRaisesRegex(
+            RuntimeError,
+            "normalized_shape extends into irregular dimensions for the nested tensor",
+            lambda: layer_norm(nt),
+        )
+        layer_norm = torch.nn.LayerNorm((size + 1, size, 4), device=device, dtype=dtype)
+        self.assertRaisesRegex(
+            RuntimeError,
+            "The shape at dimension 0",
+            lambda: layer_norm(nt),
+        )
+
     @skipMeta
     @torch.inference_mode()
     def test_embedding(self, device):
@@ -355,7 +652,7 @@ def test_embedding(self, device):
             torch.randint(100, (L,), device=device, dtype=torch.int64)
             for L in torch.randint(5, 50, (8,))
         ]
-        x = torch.nested_tensor(inputs, device=device, dtype=torch.int64)
+        x = torch.nested.nested_tensor(inputs, device=device, dtype=torch.int64)
         emb = torch.nn.Embedding(100, 8, device=device)
         y = emb(x)
         ys = y.unbind()
@@ -367,9 +664,9 @@ def test_to_padded_tensor_simple(self, device, dtype):
         t = torch.randn(4, 4, 4, device=device, dtype=dtype)
         ts = list(torch.unbind(t))
         ts[0] = ts[0][:-1]
-        nt = torch.nested_tensor(ts, device=device, dtype=dtype)
+        nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
         for padding_value in (0, 1):
-            padded = nt.to_padded_tensor(padding_value)
+            padded = torch.nested.to_padded_tensor(nt, padding_value)
 
             correct_output = t.clone()
             if padding_value == 0:
@@ -387,9 +684,9 @@ def test_to_padded_tensor_output_size(self, device, dtype):
         output_size = (4, 6, 5)
         ts = list(torch.unbind(t))
         ts[0] = ts[0][:-1]
-        nt = torch.nested_tensor(ts, device=device, dtype=dtype)
+        nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
         for padding_value in (0, 1):
-            padded = nt.to_padded_tensor(padding_value, output_size=output_size)
+            padded = torch.nested.to_padded_tensor(nt, padding_value, output_size=output_size)
             correct_output = torch.ones(output_size, device=device, dtype=dtype) * padding_value
             correct_output[:4:, :4, :4] = t.clone()
             if padding_value == 0:
@@ -408,7 +705,7 @@ def test_to_padded_tensor_dim2(self, device, dtype):
             torch.randn(1240, device=device, dtype=dtype),
             torch.randn(2400, device=device, dtype=dtype),
         ]
-        nt = torch.nested_tensor(ts, device=device, dtype=dtype)
+        nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
         pad = 42
         correct_output = []
         for t in ts:
@@ -416,7 +713,7 @@ def test_to_padded_tensor_dim2(self, device, dtype):
             correct_output.append(next_output)
             next_output[:t.size(0)].copy_(t)
         correct_output = torch.stack(correct_output)
-        padded = nt.to_padded_tensor(pad)
+        padded = torch.nested.to_padded_tensor(nt, pad)
         self.assertEqual(padded, correct_output)
 
     @dtypes(torch.float, torch.float16, torch.double)
@@ -426,7 +723,7 @@ def test_to_padded_tensor_dim3(self, device, dtype):
             torch.randn(24, 32, device=device, dtype=dtype),
             torch.randn(40, 53, device=device, dtype=dtype),
         ]
-        nt = torch.nested_tensor(ts, device=device, dtype=dtype)
+        nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
         pad = 42
         correct_output = []
         for t in ts:
@@ -434,7 +731,7 @@ def test_to_padded_tensor_dim3(self, device, dtype):
             correct_output.append(next_output)
             next_output[:t.size(0), :t.size(1)].copy_(t)
         correct_output = torch.stack(correct_output)
-        padded = nt.to_padded_tensor(pad)
+        padded = torch.nested.to_padded_tensor(nt, pad)
         self.assertEqual(padded, correct_output)
 
     @dtypes(torch.float, torch.float16, torch.double)
@@ -444,7 +741,7 @@ def test_to_padded_tensor_dim4(self, device, dtype):
             torch.randn(24, 32, 14, device=device, dtype=dtype),
             torch.randn(40, 53, 16, device=device, dtype=dtype),
         ]
-        nt = torch.nested_tensor(ts, device=device, dtype=dtype)
+        nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
         pad = 42
         correct_output = []
         for t in ts:
@@ -452,7 +749,7 @@ def test_to_padded_tensor_dim4(self, device, dtype):
             correct_output.append(next_output)
             next_output[:t.size(0), :t.size(1), :t.size(2)].copy_(t)
         correct_output = torch.stack(correct_output)
-        padded = nt.to_padded_tensor(pad)
+        padded = torch.nested.to_padded_tensor(nt, pad)
         self.assertEqual(padded, correct_output)
 
     # TODO: test noncontiguous to_padded_tensor
@@ -465,37 +762,36 @@ def test_to_padded_tensor_noncontiguous(self, device, dtype):
         nt_contiguous, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3, 6, 7), device, dtype)
         # test noncontiguous_to_padded_tensor functionality
         self.assertEqual(
-            nt_contiguous.to_padded_tensor(0.0),
+            torch.nested.to_padded_tensor(nt_contiguous, 0.0),
             noncontiguous_to_padded_tensor(nt_noncontiguous))
         # test to_padded_tensor error message
         self.assertRaisesRegex(
             RuntimeError,
             r"for now to_padded_tensor only supports contiguous nested tensor",
-            lambda: nt_noncontiguous.to_padded_tensor(0.0)
+            lambda: torch.nested.to_padded_tensor(nt_noncontiguous, 0.0)
         )
 
     @skipMeta
     def test_device_checks(self, device):
-        nt = torch.nested_tensor([], device=device)
+        nt = torch.nested.nested_tensor([], device=device)
         is_cuda = 'cuda' in str(device)
         self.assertEqual(nt.is_cuda, is_cuda)
 
     @dtypes(torch.float, torch.float16, torch.double)
     def test_nested_tensor_indexing(self, device, dtype):
         # edge case: empty nested tensor
-        nt0 = torch.nested_tensor([])
+        nt0 = torch.nested.nested_tensor([])
         self.assertRaises(IndexError, lambda: nt0[0])
         # normal case
         x0 = torch.randn((2, 5), device=device, dtype=dtype)
         x1 = torch.randn((3, 4), device=device, dtype=dtype)
-        nt = torch.nested_tensor([x0, x1])
+        nt = torch.nested.nested_tensor([x0, x1])
         # single index: only support integer in the batch dimension
         self.assertEqual(nt[0], x0)
         self.assertEqual(nt[-1], x1)
         self.assertRaises(IndexError, lambda: nt[2])
         self.assertRaises(IndexError, lambda: nt[-3])
         self.assertRaises(NotImplementedError, lambda: nt[:])
-        self.assertRaises(NotImplementedError, lambda: nt[None])
         self.assertRaises(NotImplementedError, lambda: nt[...])
         # tuple of indices: only support integer in the batch dimension
         #                 + all possible indexing in the original tensor dimensions
@@ -504,6 +800,13 @@ def test_nested_tensor_indexing(self, device, dtype):
         self.assertEqual(nt[1, ...], x1)
         self.assertRaises(IndexError, lambda: nt[1, 4, 2])
         self.assertRaises(NotImplementedError, lambda: nt[:, 1, 1])
+        # test select on non-batch dimensions
+        self.assertEqual(nt.select(1, 0)[0], x0.select(0, 0))
+        self.assertEqual(nt.select(1, 0)[1], x1.select(0, 0))
+        self.assertRaises(IndexError, lambda: nt.select(1, 3))
+        self.assertEqual(nt.select(2, 0)[0], x0.select(1, 0))
+        self.assertEqual(nt.select(2, 0)[1], x1.select(1, 0))
+        self.assertRaises(IndexError, lambda: nt.select(2, 5))
         # make sure indexing returns a view
         nt[0].fill_(100.0)
         answer = torch.tensor(100.0, device=device, dtype=dtype).expand((2, 5))
@@ -512,6 +815,85 @@ def test_nested_tensor_indexing(self, device, dtype):
         answer = torch.tensor(200.0, device=device, dtype=dtype).expand(4)
         self.assertEqual(nt[1, 1, :], answer)
 
+        # Test that indexing works when requires_grad_(True)
+        # previously this was failing because the backward kernel for select.int uses .sizes()
+        nt = torch.nested.nested_tensor([x0, x1]).requires_grad_(True)
+        self.assertEqual(nt[0], x0)
+        self.assertEqual(nt[-1], x1)
+        grad_x0 = torch.randn((2, 5), device=device, dtype=dtype)
+        nt[0].backward(grad_x0)
+        expected_grad = torch.nested.nested_tensor([grad_x0, torch.zeros((3, 4), device=device, dtype=dtype)])
+        self.assertEqual(nt.grad, expected_grad)
+
+    @parametrize("func", [subtest(torch.nn.functional.relu, name='relu'),
+                          subtest(torch.nn.functional.relu_, name='relu_'),
+                          subtest(torch.nn.functional.gelu, name='gelu'),
+                          subtest(torch._C._nn.gelu_, name='gelu_'),
+                          subtest(torch.tanh, name='tanh'),
+                          subtest(torch.tanh_, name='tanh_'),
+                          subtest(torch.neg, name='neg')])
+    def test_activations(self, device, func):
+        nt, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3, 6, 7), device=device, dtype=torch.float32)
+        nested_result = func(nt)
+        self.assertTrue(nested_result.is_nested)
+        for t, t_res in zip(nt.unbind(), nested_result.unbind()):
+            self.assertEqual(func(t), t_res)
+        self.assertRaisesRegex(
+            RuntimeError,
+            "NestedTensor must be contiguous to get buffer.",
+            lambda: func(nt_noncontiguous))
+
+    @dtypes(*floating_types_and_half())
+    def test_nested_tensor_chunk(self, device, dtype):
+        # Transformer use case
+        a = torch.randn(3, 3 * 4, device=device, dtype=dtype)
+        b = torch.randn(2, 3 * 4, device=device, dtype=dtype)
+        c = torch.randn(1, 3 * 4, device=device, dtype=dtype)
+        a_chunks = a.chunk(3, dim=-1)
+        b_chunks = b.chunk(3, dim=-1)
+        c_chunks = c.chunk(3, dim=-1)
+
+        a_nt = [a_chunks[0], b_chunks[0], c_chunks[0]]
+        b_nt = [a_chunks[1], b_chunks[1], c_chunks[1]]
+        c_nt = [a_chunks[2], b_chunks[2], c_chunks[2]]
+
+        nt = torch.nested.nested_tensor([a, b, c])
+        chunked = nt.chunk(3, dim=-1)
+
+        self.assertEqual(chunked[0], torch.nested.nested_tensor(a_nt))
+        self.assertEqual(chunked[1], torch.nested.nested_tensor(b_nt))
+        self.assertEqual(chunked[2], torch.nested.nested_tensor(c_nt))
+
+        for chunk in chunked:
+            self.assertFalse(chunk.is_contiguous())
+
+        # Failure chunking on ragged dimensions
+        self.assertRaisesRegex(
+            RuntimeError, "Chunk for nested tensors is currently only supported for the last dimension.",
+            lambda: torch.chunk(nt, 5, dim=1))
+        self.assertRaisesRegex(
+            RuntimeError, "Chunk for nested tensors is currently only supported for the last dimension.",
+            lambda: torch.chunk(nt, 5, dim=0))
+
+        # Failure on non-contiguous nt
+        _, nt_noncontiguous = random_nt_noncontiguous_pair((2, 3), device, dtype)
+        self.assertRaisesRegex(
+            RuntimeError, "chunk expects `self` to be contiguous.", lambda: torch.chunk(nt_noncontiguous, 5, dim=-1))
+
+        # Failure when calling non divisible n_chunks
+        self.assertRaisesRegex(
+            RuntimeError, "Chunk for nested tensors is only supported for "
+            "nested tensors with trailing dimension divisible by chunks.",
+            lambda: torch.chunk(nt, 5, dim=-1))
+
+        # Failure when calling backward on a chunk
+        a = torch.randn(3, 3 * 4, device=device, dtype=dtype, requires_grad=True)
+        b = torch.randn(2, 3 * 4, device=device, dtype=dtype, requires_grad=True)
+        nt_grad = torch.nested.as_nested_tensor([a, b])
+        chunked = torch.chunk(nt_grad, 2, dim=-1)
+        self.assertRaisesRegex(RuntimeError, "derivative for aten::chunk is not implemented",
+                               lambda: chunked[0].backward(chunked[0].clone()))
+
     @dtypes(torch.float, torch.float16, torch.double)
     @torch.inference_mode()
     def test_nested_tensor_indexing_noncontiguous(self, device, dtype):
@@ -526,23 +908,38 @@ def test_nested_tensor_indexing_noncontiguous(self, device, dtype):
     @torch.inference_mode()
     def test_nested_tensor_add(self, device, dtype):
         (nt1, nt2) = self.random_nt_pair(device, dtype, 4, (4, 4))
-        ref = torch.nested_tensor([t1 + t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
+        ref = torch.nested.nested_tensor([t1 + t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
         out = nt1 + nt2
         self.assertEqual(ref, out)
 
+    @onlyCUDA
+    @dtypes(torch.float, torch.float16)
+    @torch.inference_mode()
+    @parametrize("embedding_dim", [8, 128, 256, 384])
+    def test_nested_tensor_dense_elementwise(self, device, dtype, embedding_dim):
+        batch_size = 32
+        seq_lens = torch.randint(low=0, high=10, size=(batch_size,))
+        ts = [torch.randn((seq_len, embedding_dim)) for seq_len in seq_lens]
+        nt = torch.nested.nested_tensor(ts, device=device, dtype=dtype)
+        t = torch.randn((batch_size, 1, embedding_dim), device=device, dtype=dtype)
+        ref_add = torch.nested.nested_tensor([t1 + t2 for (t1, t2) in zip(nt.unbind(), t.unbind())])
+        ref_mul = torch.nested.nested_tensor([t1 * t2 for (t1, t2) in zip(nt.unbind(), t.unbind())])
+        self.assertEqual(nt.add(t), ref_add)
+        self.assertEqual(nt.mul(t), ref_mul)
+
     @dtypes(torch.float, torch.float16)
     @skipMeta
     @torch.inference_mode()
     def test_nested_tensor_mul(self, device, dtype):
         # nested tensor * nested tensor
         (nt1, nt2) = self.random_nt_pair(device, dtype, 4, (4, 4))
-        ref = torch.nested_tensor([t1 * t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
+        ref = torch.nested.nested_tensor([t1 * t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
         out = nt1 * nt2
         self.assertEqual(ref, out)
         # nested tensor * scalar
         number = 10.0
         scalar = torch.tensor(number).to(dtype).to(device)
-        ref = torch.nested_tensor([t * number for t in nt1.unbind()])
+        ref = torch.nested.nested_tensor([t * number for t in nt1.unbind()])
         out_number0 = nt1 * number
         out_number1 = number * nt1
         out_scalar0 = nt1 * scalar
@@ -564,12 +961,44 @@ def test_nested_tensor_mul(self, device, dtype):
             lambda: vector.mul(nt1)
         )
 
+    @dtypes(torch.float, torch.float16)
+    @skipMeta
+    @torch.inference_mode()
+    def test_nested_tensor_div(self, device, dtype):
+        nt, nt2 = self.random_nt_pair(device, dtype, 4, (4, 4))
+        scale = 4.0
+        ref = torch.nested.nested_tensor([t / scale for t in nt.unbind()])
+        out = nt / 4.0
+        self.assertEqual(ref, out)
+        ref_transposed = ref.transpose(1, 2)
+        out = nt.transpose(1, 2) / 4.0
+        self.assertEqual(ref_transposed, out)
+
+        ref = torch.nested.nested_tensor([t / t2 for (t, t2) in zip(nt.unbind(), nt2.unbind())])
+        out = nt / nt2
+        self.assertEqual(ref, out)
+
+        out = nt.transpose(1, 2) / nt2.transpose(1, 2)
+        self.assertEqual(ref.transpose(1, 2), out)
+
+        nt_transpose_copy = torch.nested.nested_tensor([t.transpose(0, 1) for t in nt.unbind()])
+
+        self.assertRaisesRegex(
+            RuntimeError, "div requires strides to match when given NestedTensors",
+            lambda: nt_transpose_copy.transpose(1, 2) / nt2)
+
+        nt = torch.nested.nested_tensor([torch.randn(i, 4) for i in [3, 4, 5]], device=device, dtype=dtype)
+        nt_chunks = nt.chunk(2, -1)
+        self.assertRaisesRegex(
+            RuntimeError, "div requires offsets to match when given NestedTensors",
+            lambda: nt_chunks[0] / nt_chunks[1])
+
     @dtypes(torch.float, torch.float16)
     @skipMeta
     @torch.inference_mode()
     def test_nested_tensor_add_in_place(self, device, dtype):
         (nt1, nt2) = self.random_nt_pair(device, dtype, 4, (4, 4))
-        ref = torch.nested_tensor([t1 + t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
+        ref = torch.nested.nested_tensor([t1 + t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
         nt1 += nt2
         self.assertEqual(ref, nt1)
 
@@ -579,13 +1008,13 @@ def test_nested_tensor_add_in_place(self, device, dtype):
     def test_nested_tensor_mul_in_place(self, device, dtype):
         # nested tensor * nested tensor
         (nt1, nt2) = self.random_nt_pair(device, dtype, 4, (4, 4))
-        ref = torch.nested_tensor([t1 * t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
+        ref = torch.nested.nested_tensor([t1 * t2 for (t1, t2) in zip(nt1.unbind(), nt2.unbind())])
         nt1 *= nt2
         self.assertEqual(ref, nt1)
         # nested tensor * scalar
         number = 10.0
         scalar = torch.tensor(number).to(dtype).to(device)
-        ref = torch.nested_tensor([t * number for t in nt1.unbind()])
+        ref = torch.nested.nested_tensor([t * number for t in nt1.unbind()])
         out_number = nt1.clone()
         out_number *= number
         out_scalar = nt1.clone()
@@ -616,34 +1045,64 @@ def test_nested_tensor_mul_in_place(self, device, dtype):
     def test_nested_tensor_sum_dim(self, device, dtype):
         params = ((2, (1, 1)), ((4), (4, 4)), (10, (3, 5, 7)))
 
-        def test_sum(nt, dim, keepdim=True):
+        def test_sum(device, dtype, ntensors, max_sizes, dim, keepdim=True):
+            nt = random_nt(device, dtype, ntensors, max_sizes)
             nt2 = nt.clone()
-            nt = nt.sum(dim=dim, keepdim=keepdim)
             ub2 = nt2.unbind()
-            ub2 = [t.sum(-1, keepdim=keepdim) for t in ub2]
-            nt2 = torch.nested_tensor(ub2)
-            self.assertEqual(nt, nt2)
+            nt.requires_grad_(True)
+            [t.requires_grad_(True) for t in ub2]
+            nt_sum = nt.sum(dim=dim, keepdim=keepdim)
+            ub2_sum = [t.sum(-1, keepdim=keepdim) for t in ub2]
+            self.assertEqual(nt_sum, torch.nested.nested_tensor(ub2_sum))
+
+            # test backward
+            # generate gradient tensor that has the same size as the output
+            size = nt_sum._nested_tensor_size()
+            gt2 = []
+            for i in range(ntensors):
+                gt2.append(torch.randn(size[i].tolist(), device=device, dtype=dtype))
+            gt = torch.nested.nested_tensor(gt2).clone()
+            nt_sum.backward(gt)
+            for t2, g2 in zip(ub2_sum, gt2):
+                t2.backward(g2)
+            self.assertEqual(nt.grad, torch.nested.nested_tensor([t.grad for t in ub2]))
             return
 
         for ntensors, max_sizes in params:
-            test_sum(self.random_nt(device, dtype, ntensors, max_sizes), len(max_sizes))
+            test_sum(device, dtype, ntensors, max_sizes, len(max_sizes))
 
         # Test error inputs
         with self.assertRaisesRegex(RuntimeError, "NestedTensor can only be reduced across the last"):
-            torch.nested_tensor([torch.tensor([3, 4, 5]), torch.tensor([1, 2])]).sum(0, keepdim=True)
+            torch.nested.nested_tensor([torch.tensor([3, 4, 5]), torch.tensor([1, 2])]).sum(0, keepdim=True)
 
         with self.assertRaisesRegex(RuntimeError, "NestedTensor only allows reduction of a single"):
-            torch.nested_tensor([torch.tensor([[3, 4, 5]]), torch.tensor([[1, 2]])]).sum([0, 1], keepdim=True)
+            torch.nested.nested_tensor([torch.tensor([[3, 4, 5]]), torch.tensor([[1, 2]])]).sum([0, 1], keepdim=True)
 
         with self.assertRaisesRegex(RuntimeError, "NestedTensor always requires keepdim=True for now."):
-            torch.nested_tensor([torch.tensor([3, 4, 5]), torch.tensor([1, 2])]).sum(-1)
+            torch.nested.nested_tensor([torch.tensor([3, 4, 5]), torch.tensor([1, 2])]).sum(-1)
 
+    @dtypes(torch.float, torch.float16)
+    def test_contiguous(self, device, dtype):
+        # Since we don't have access to the buffer in python this is harder to show what
+        # we are testing for. When we call chunk on a consistent dim of a NT
+        # for chunk_size > 1 the resulting tensors are views of the original NT
+        # whose numels is now less than the size of the buffer. Clone was
+        # previously creating a new NT with a buffer that was the same size as the
+        # original.
+        nt_contiguous = torch.nested.nested_tensor([torch.randn(2, 20, device=device, dtype=dtype),
+                                                    torch.randn(4, 20, device=device, dtype=dtype)])
+        # Split up the last dimension which has a consistent size of 20 into 5 chunks
+        chunks = nt_contiguous.chunk(5, dim=-1)
+
+        # # Check chunks are contiguous after calling contiguous
+        for chunk in chunks:
+            self.assertFalse(chunk.is_contiguous())
+            self.assertTrue(chunk.contiguous().is_contiguous())
 
     @dtypes(torch.float, torch.float16)
     @skipMeta
-    @torch.inference_mode()
     def test_clone(self, device, dtype):
-        nt1 = self.random_nt(device, dtype, 4, (4, 4), (1, 1))
+        nt1 = random_nt(device, dtype, 4, (4, 4), (1, 1))
         nt2 = nt1.clone()
         # Verify the values match
         self.assertEqual(nt1, nt2)
@@ -655,21 +1114,20 @@ def test_clone(self, device, dtype):
             self.assertNotEqual(ub1[i], ub2[i])
 
         nt1.clone(memory_format=torch.preserve_format)
-        msg = "clone_nested only supports memory format Preserve, but got ChannelsLast instead."
+        msg = "Nested tensor clone supports Preserve and Contiguous memory formats, called clone with memory format: ChannelsLast"
         with self.assertRaisesRegex(RuntimeError, msg):
             nt1.clone(memory_format=torch.channels_last)
 
     # cannot test torch.float16 because: RuntimeError: "bernoulli_scalar_cpu_" not implemented for 'Half'
     @dtypes(torch.float, torch.double)
-    @torch.inference_mode()
     def test_dropout(self, device, dtype):
         # edge case: empty nested tensor
-        nt0 = torch.nested_tensor([])
+        nt0 = torch.nested.nested_tensor([])
         y = torch.nn.functional.dropout(nt0, 0.5)
         self.assertEqual(nt0, y)
         # normal nested tensor
         ntensors = 4
-        nt = self.random_nt(device, dtype, ntensors, (4, 4))
+        nt = random_nt(device, dtype, ntensors, (4, 4))
         # edge case: invalid dropout
         self.assertRaises(ValueError, lambda: torch.nn.Dropout(-0.1))
         self.assertRaises(ValueError, lambda: torch.nn.Dropout(1.1))
@@ -709,33 +1167,26 @@ def test_dropout(self, device, dtype):
         with freeze_rng_state():
             y1 = torch.nn.functional.dropout(nt, p)
         self.assertEqual(y0, y1)
-        # inplace
-        # in principle, since we have established the correctness of functional, we could simply compare inplace vs functional
-        # in practice, cuda functional has its own implementation to skip `bernoulli_`
-        # so cuda functional will differ from cuda inplace causing test failure
-        # in `test_dropout_cuda_float64 (__main__.TestNestedTensorDeviceTypeCUDA)`
-        # on `linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 2, 4, linux.4xlarge.nvidia.gpu)`
-        expect = nt.clone()
-        torch.nn.functional.dropout(nt, p, inplace=True)
-        for i in range(ntensors):
-            actual_tensor = nt[i].view(-1)
-            expect_tensor = expect[i].view(-1)
-            for j in range(actual_tensor.shape[0]):
-                if actual_tensor[j].item() == 0.0:
-                    expect_tensor[j] = 0.0
-                else:
-                    expect_tensor[j] /= 1.0 - p
-        self.assertEqual(nt, expect)
 
-    # dropout works directly on the underlying buffer memory
-    # so contiguous / noncontiguous does not make any difference
+    @dtypes(torch.float, torch.double)
+    def test_dropout_noncontiguous(self, device, dtype):
+        ntensors = 4
+        nt0 = random_nt(device, dtype, ntensors, (4, 4))
+        nt1 = nt0.transpose(-1, -2)
+        p = 0.3
+        with freeze_rng_state():
+            dropouter = torch.nn.Dropout(p)
+            y0 = dropouter(nt0)
+        with freeze_rng_state():
+            y1 = torch.nn.functional.dropout(nt1, p).transpose(-1, -2)
+        self.assertEqual(y0, y1)
 
     # cannot test torch.float16 because: RuntimeError: "softmax_kernel_impl" not implemented for 'Half'
     @dtypes(torch.float, torch.double)
     def test_softmax(self, device, dtype):
         # normal nested tensor
         ntensors = 4
-        nt = self.random_nt(device, dtype, ntensors, (4, 4))
+        nt = random_nt(device, dtype, ntensors, (4, 4))
         # error case: softmax across nested dimension
         self.assertRaisesRegex(
             RuntimeError,
@@ -755,17 +1206,17 @@ def test_softmax(self, device, dtype):
         y0 = softmaxer(nt)
         y1 = torch.nn.functional.softmax(nt, 1)
         self.assertEqual(y0, y1)
-        pt = nt.to_padded_tensor(float("-inf"))
+        pt = torch.nested.to_padded_tensor(nt, float("-inf"))
         # if an entire slice is padded, then softmax will return 0.0 / 0.0 = nan
         # however, physically speaking that should be 0.0
         expect = torch.nn.functional.softmax(pt, 1).nan_to_num_(0.0)
-        self.assertEqual(y0.to_padded_tensor(0.0), expect)
+        self.assertEqual(torch.nested.to_padded_tensor(y0, 0.0), expect)
         # edge case: empty nested tensor
-        nt0 = torch.nested_tensor([])
+        nt0 = torch.nested.nested_tensor([])
         y = torch.nn.functional.softmax(nt0, 1)
         self.assertEqual(nt0, y)
         # edge case: nesting scalars
-        nt1 = torch.nested_tensor([torch.tensor(0.0), torch.tensor(1.0)])
+        nt1 = torch.nested.nested_tensor([torch.tensor(0.0), torch.tensor(1.0)])
         self.assertRaises(RuntimeError, lambda: torch.nn.functional.softmax(nt1, 0))
         self.assertRaises(IndexError, lambda: torch.nn.functional.softmax(nt1, 1))
 
@@ -777,11 +1228,9 @@ def test_softmax_noncontiguous(self, device, dtype):
             torch.nn.functional.softmax(nt_contiguous, -1),
             torch.nn.functional.softmax(nt_noncontiguous, -1))
 
-    # cannot test torch.float16 because: RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
-    @dtypes(torch.float, torch.double)
-    def test_bmm(self, device, dtype):
+    def _test_bmm(self, device, dtype):
         # error case: one is nested but the other is not
-        nt = torch.nested_tensor([torch.randn(2), torch.randn(3)], device=device, dtype=dtype)
+        nt = torch.nested.nested_tensor([torch.randn(2), torch.randn(3)], device=device, dtype=dtype)
         t = torch.randn(4, device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
@@ -794,9 +1243,9 @@ def test_bmm(self, device, dtype):
             lambda: t.bmm(nt)
         )
         # error case: not 3D tensors
-        nt0 = torch.nested_tensor([], device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn(2), torch.randn(3)], device=device, dtype=dtype)
-        nt2 = torch.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
+        nt0 = torch.nested.nested_tensor([], device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn(2), torch.randn(3)], device=device, dtype=dtype)
+        nt2 = torch.nested.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
             "batch1 must be a 3D tensor",
@@ -838,11 +1287,11 @@ def test_bmm(self, device, dtype):
             lambda: nt2.bmm(nt1)
         )
         # error case: incompatible batch size
-        nt0 = torch.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn((4, 6)),
-                                   torch.randn((4, 5)),
-                                   torch.randn((4, 7))],
-                                  device=device, dtype=dtype)
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((4, 6)),
+                                          torch.randn((4, 5)),
+                                          torch.randn((4, 7))],
+                                         device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
             "Expected size for the 1st dimension of batch2 tensor to be: 2 but got: 3.",
@@ -854,18 +1303,42 @@ def test_bmm(self, device, dtype):
             lambda: nt1.bmm(nt0)
         )
         # error case: underlying matrices cannot be multiplied
-        nt0 = torch.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
             r"0-th nested matrices in batch cannot be multiplied \(2x4 and 2x4\)",
             lambda: nt0.bmm(nt0)
         )
         # normal nested tensor
-        nt0 = torch.nested_tensor([torch.randn((2, 4)), torch.randn((3, 7))], device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn((4, 6)), torch.randn((7, 5))], device=device, dtype=dtype)
-        actual = nt0.bmm(nt1).to_padded_tensor(0.0)
-        expect = nt0.to_padded_tensor(0.0).bmm(nt1.to_padded_tensor(0.0))
-        self.assertEqual(actual, expect)
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 4)), torch.randn((3, 7))], device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((4, 6)), torch.randn((7, 5))], device=device, dtype=dtype)
+        actual = torch.nested.to_padded_tensor(nt0.bmm(nt1), 0.0)
+        expect = torch.nested.to_padded_tensor(nt0, 0.0).bmm(torch.nested.to_padded_tensor(nt1, 0.0))
+        if dtype == torch.float16:
+            self.assertEqual(actual, expect, rtol=1e-3, atol=1e-3)
+        else:
+            self.assertEqual(actual, expect)
+
+        # test tensorcore path
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 8)), torch.randn((3, 16))], device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((8, 8)), torch.randn((16, 8))], device=device, dtype=dtype)
+        actual = torch.nested.to_padded_tensor(nt0.bmm(nt1), 0.0)
+        expect = torch.nested.to_padded_tensor(nt0, 0.0).bmm(torch.nested.to_padded_tensor(nt1, 0.0))
+        if dtype == torch.float16:
+            self.assertEqual(actual, expect, rtol=1e-3, atol=1e-3)
+        else:
+            self.assertEqual(actual, expect)
+
+    @onlyCUDA
+    @dtypes(torch.float, torch.double, torch.float16)
+    def test_bmm_cuda(self, device, dtype):
+        self._test_bmm(device, dtype)
+
+    @onlyCPU
+    # cannot test torch.float16 because: RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
+    @dtypes(torch.float, torch.double)
+    def test_bmm_cpu(self, device, dtype):
+        self._test_bmm(device, dtype)
 
     # cannot test torch.float16 because: RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
     @dtypes(torch.float, torch.double)
@@ -876,11 +1349,45 @@ def test_bmm_noncontiguous(self, device, dtype):
             nt0_contiguous.transpose(-1, -2).bmm(nt1_contiguous),
             nt0_noncontiguous.transpose(-1, -2).bmm(nt1_noncontiguous))
 
+    @dtypes(torch.float, torch.double)
+    def test_matmul_with_bmm_path(self, device, dtype):
+        def unbind_rebind_matmul(nt1, nt2):
+            t1s = nt1.unbind()
+            t2s = nt2.unbind()
+            out_ts = [t1.matmul(t2) for t1, t2 in zip(t1s, t2s)]
+            return torch.nested.nested_tensor(out_ts)
+
+        # [N, n_head, *, head_dim], [N, n_head, head_dim, *]
+        N = np.random.randint(2, 5)
+        n_heads = np.random.randint(2, 5)
+        head_dim = 3
+        t1s = []
+        t2s = []
+        for _ in range(N):
+            seq_len1 = np.random.randint(2, 5)
+            seq_len2 = np.random.randint(2, 5)
+            t1s.append(torch.randn(n_heads, seq_len1, head_dim))
+            t2s.append(torch.randn(n_heads, head_dim, seq_len2))
+        nt1 = torch.nested.nested_tensor(t1s, device=device, dtype=dtype)
+        nt2 = torch.nested.nested_tensor(t2s, device=device, dtype=dtype)
+        self.assertEqual(torch.matmul(nt1, nt2), unbind_rebind_matmul(nt1, nt2))
+
+        # test with noncontiguous
+        t3s = []
+        t4s = []
+        for _ in range(N):
+            seq_len = np.random.randint(2, 5)
+            t3s.append(torch.randn(seq_len, n_heads, head_dim))
+            t4s.append(torch.randn(seq_len, n_heads, head_dim))
+        nt3 = torch.nested.nested_tensor(t3s, device=device, dtype=dtype).transpose(1, 2)
+        nt4 = torch.nested.nested_tensor(t4s, device=device, dtype=dtype).transpose(1, 2).transpose(2, 3)
+        self.assertEqual(torch.matmul(nt3, nt4), unbind_rebind_matmul(nt3, nt4))
+
     # cannot test torch.float16 because: RuntimeError: "bmm" not implemented for 'Half'
     @dtypes(torch.float, torch.double)
     def test_matmul(self, device, dtype):
         # error case: one is nested but the other is not
-        nt = torch.nested_tensor([torch.randn(2), torch.randn(3)], device=device, dtype=dtype)
+        nt = torch.nested.nested_tensor([torch.randn(2), torch.randn(3)], device=device, dtype=dtype)
         t = torch.randn(4, device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
@@ -893,9 +1400,9 @@ def test_matmul(self, device, dtype):
             lambda: torch.matmul(t, nt)
         )
         # error case: not 3+D tensors
-        nt0 = torch.nested_tensor([], device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn(2), torch.randn(3)], device=device, dtype=dtype)
-        nt2 = torch.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
+        nt0 = torch.nested.nested_tensor([], device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn(2), torch.randn(3)], device=device, dtype=dtype)
+        nt2 = torch.nested.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
             r"matmul: For nested tensors, only inputs with >= 3 dims are currently supported. 1st input has rank: [0-9]+",
@@ -937,11 +1444,11 @@ def test_matmul(self, device, dtype):
             lambda: torch.matmul(nt2, nt1)
         )
         # error case: incompatible batch size
-        nt0 = torch.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn((4, 6)),
-                                   torch.randn((4, 5)),
-                                   torch.randn((4, 7))],
-                                  device=device, dtype=dtype)
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((4, 6)),
+                                          torch.randn((4, 5)),
+                                          torch.randn((4, 7))],
+                                         device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
             r"matmul: Expected size for the 1st dimension of 2nd input tensor to be: [0-9]+ but got: [0-9]+.",
@@ -952,59 +1459,62 @@ def test_matmul(self, device, dtype):
             r"matmul: Expected size for the 1st dimension of 2nd input tensor to be: [0-9]+ but got: [0-9]+.",
             lambda: torch.matmul(nt1, nt0)
         )
-        # error case: incompatible generalized batch size
-        nt0 = torch.nested_tensor([torch.randn((2, 2, 4)),
-                                   torch.randn((2, 3, 4))],
-                                  device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn((3, 4, 6)),
-                                   torch.randn((3, 4, 5))],
-                                  device=device, dtype=dtype)
+        # error case: incompatible (wrong) batch sizes that shouldn't even broadcast?
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 2, 4)),
+                                          torch.randn((2, 3, 4))],
+                                         device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((3, 4, 6)),
+                                          torch.randn((3, 4, 5))],
+                                         device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
-            r"matmul: For nested tensors, no broadcasting is currently performed: "
-            r"[0-9]+-th nested matrices in batch at dimension [0-9]+ "
-            r"have mismatching sizes [0-9]+ and [0-9]+",
+            "matmul(): For nested tensors, batch dimensions must have the same sizes,",
             lambda: torch.matmul(nt0, nt1)
         )
+        # error case: incompatible batch sizes that should technically broadcast
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 2, 4)),
+                                          torch.randn((1, 3, 4))],
+                                         device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((1, 4, 6)),
+                                          torch.randn((3, 4, 5))],
+                                         device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
-            r"matmul: For nested tensors, no broadcasting is currently performed: "
-            r"[0-9]+-th nested matrices in batch at dimension [0-9]+ "
-            r"have mismatching sizes [0-9]+ and [0-9]+",
-            lambda: torch.matmul(nt1, nt0)
+            "matmul(): For nested tensors, batch dimensions must have the same sizes,",
+            lambda: torch.matmul(nt0, nt1)
         )
         # error case: underlying matrices cannot be multiplied
-        nt0 = torch.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 4)), torch.randn((3, 4))], device=device, dtype=dtype)
         self.assertRaisesRegex(
             RuntimeError,
-            r"0-th nested matrices in batch cannot be multiplied \(2x4 and 2x4\)",
+            "matmul(): Nested tensors cannot be matrix multiplied",
             lambda: torch.matmul(nt0, nt0)
         )
         # normal nested tensor: 3D
-        nt0 = torch.nested_tensor([torch.randn((2, 4)), torch.randn((3, 7))], device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn((4, 6)), torch.randn((7, 5))], device=device, dtype=dtype)
-        actual = torch.matmul(nt0, nt1).to_padded_tensor(0.0)
-        expect = torch.matmul(nt0.to_padded_tensor(0.0), nt1.to_padded_tensor(0.0))
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 4)), torch.randn((3, 7))], device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((4, 6)), torch.randn((7, 5))], device=device, dtype=dtype)
+        actual = torch.nested.to_padded_tensor(torch.matmul(nt0, nt1), 0.0)
+        expect = torch.matmul(torch.nested.to_padded_tensor(nt0, 0.0), torch.nested.to_padded_tensor(nt1, 0.0))
         self.assertEqual(actual, expect)
-        # normal nested tensor: 4D
-        nt0 = torch.nested_tensor([torch.randn((8, 2, 4)),
-                                   torch.randn((8, 3, 7))],
-                                  device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn((8, 4, 6)),
-                                   torch.randn((8, 7, 5))],
-                                  device=device, dtype=dtype)
-        actual = torch.matmul(nt0, nt1).to_padded_tensor(0.0)
-        expect = torch.matmul(nt0.to_padded_tensor(0.0), nt1.to_padded_tensor(0.0))
+        # normal nested tensor: 4D (with testing for batch_size=1)
+        nt0 = torch.nested.nested_tensor([torch.randn((1, 2, 4)),
+                                          torch.randn((8, 3, 7))],
+                                         device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((1, 4, 6)),
+                                          torch.randn((8, 7, 5))],
+                                         device=device, dtype=dtype)
+        actual = torch.nested.to_padded_tensor(torch.matmul(nt0, nt1), 0.0)
+        expect = torch.matmul(torch.nested.to_padded_tensor(nt0, 0.0), torch.nested.to_padded_tensor(nt1, 0.0))
         self.assertEqual(actual, expect)
         # normal nested tensor: 5D
-        nt0 = torch.nested_tensor([torch.randn((8, 9, 2, 4)),
-                                   torch.randn((8, 9, 3, 7))],
-                                  device=device, dtype=dtype)
-        nt1 = torch.nested_tensor([torch.randn((8, 9, 4, 6)),
-                                   torch.randn((8, 9, 7, 5))],
-                                  device=device, dtype=dtype)
-        actual = torch.matmul(nt0, nt1).to_padded_tensor(0.0)
-        expect = torch.matmul(nt0.to_padded_tensor(0.0), nt1.to_padded_tensor(0.0))
+        nt0 = torch.nested.nested_tensor([torch.randn((8, 9, 2, 4)),
+                                          torch.randn((8, 9, 3, 7))],
+                                         device=device, dtype=dtype)
+        nt1 = torch.nested.nested_tensor([torch.randn((8, 9, 4, 6)),
+                                          torch.randn((8, 9, 7, 5))],
+                                         device=device, dtype=dtype)
+        actual = torch.nested.to_padded_tensor(torch.matmul(nt0, nt1), 0.0)
+        expect = torch.matmul(torch.nested.to_padded_tensor(nt0, 0.0), torch.nested.to_padded_tensor(nt1, 0.0))
         self.assertEqual(actual, expect)
 
     # cannot test torch.float16 because: RuntimeError: "bmm" not implemented for 'Half'
@@ -1021,7 +1531,7 @@ def test_linear(self, device, dtype):
         a = torch.randn(1, 2, device=device, dtype=dtype)
         b = torch.randn(2, 2, device=device, dtype=dtype)
         c = torch.randn(3, 2, device=device, dtype=dtype)
-        nt = torch.nested_tensor([a, b, c])
+        nt = torch.nested.nested_tensor([a, b, c])
 
         weight = torch.randn(2, 2, device=device, dtype=dtype)
         bias = torch.randn(2, device=device, dtype=dtype)
@@ -1030,8 +1540,8 @@ def test_linear(self, device, dtype):
 
         # invalid nested tensor dimension
         msg = r'Linear requires nested_tensor.dim == 3 and dense_matrix.dim == 2. Nested tensor dim: 2. Dense tensor dim: 2'
-        nt1 = torch.nested_tensor([torch.randn(1, device=device, dtype=dtype),
-                                  torch.randn(2, device=device, dtype=dtype)])
+        nt1 = torch.nested.nested_tensor([torch.randn(1, device=device, dtype=dtype),
+                                          torch.randn(2, device=device, dtype=dtype)])
         with self.assertRaisesRegex(RuntimeError, msg):
             torch.functional.F.linear(nt1, weight, bias)
 
@@ -1042,9 +1552,9 @@ def test_linear(self, device, dtype):
             torch.functional.F.linear(nt, weight1, bias)
 
         # inconsistent last dim of nested tensor
-        msg = r"all tensors in NestedTensor must have the same trailing dim"
-        nt2 = torch.nested_tensor([torch.randn(1, 2, device=device, dtype=dtype),
-                                  torch.randn(2, 3, device=device, dtype=dtype)])
+        msg = r"Expected all tensors in nested tensor to have the same trailing dimension, instead last dimension equals:"
+        nt2 = torch.nested.nested_tensor([torch.randn(1, 2, device=device, dtype=dtype),
+                                          torch.randn(2, 3, device=device, dtype=dtype)])
         with self.assertRaisesRegex(RuntimeError, msg):
             torch.functional.F.linear(nt2, weight, bias)
 
@@ -1075,9 +1585,8 @@ def test_linear_noncontiguous(self, device, dtype):
         )
 
     @dtypes(torch.float, torch.float16, torch.double)
-    @torch.inference_mode()
     def test_transpose(self, device, dtype):
-        nt = self.random_nt(device, dtype, 4, (4, 4))
+        nt = random_nt(device, dtype, 4, (4, 4))
         # error case: transpose nested dimension
         self.assertRaisesRegex(
             RuntimeError,
@@ -1095,14 +1604,149 @@ def test_transpose(self, device, dtype):
         # normal case
         ntT = nt.transpose(-1, -2)
         ptT_from_ntT = noncontiguous_to_padded_tensor(ntT)
-        pt = nt.to_padded_tensor(0.0)
+        pt = torch.nested.to_padded_tensor(nt, 0.0)
         ptT = pt.transpose(-1, -2)
         self.assertEqual(ptT, ptT_from_ntT)
 
     @dtypes(torch.float, torch.float16, torch.double)
-    @torch.inference_mode()
+    def test_squeeze_unsqueeze(self, device, dtype):
+        a = torch.arange(6).reshape(2, 3)
+        b = torch.arange(15).reshape(5, 3)
+        nt = torch.nested.nested_tensor([a, b], device=device, dtype=dtype)
+        # error case: squeeze no dimension
+        self.assertRaisesRegex(
+            RuntimeError,
+            "For nested tensors, squeeze without the dim argument",
+            lambda: nt.squeeze()
+        )
+        # error case: squeeze nested dimension
+        self.assertRaisesRegex(
+            RuntimeError,
+            "For nested tensors, squeezing dimension 0",
+            lambda: nt.squeeze(0)
+        )
+        # error case: dimension out of range
+        self.assertRaises(IndexError, lambda: nt.squeeze(3))
+        # error case: squeeze nested tensor of singleton tensors
+        c = torch.ones(1)
+        nt_singleton = torch.nested.nested_tensor([c, c], device=device, dtype=dtype)
+        self.assertRaisesRegex(
+            RuntimeError,
+            "For nested tensors, squeezing a nested tensor of singleton",
+            lambda: nt_singleton.squeeze(1)
+        )
+
+        # squeezing a dim which does not have size 1 should be a no-op
+        nt2 = nt.squeeze(-1)
+        self.assertEqual(nt, nt2)
+
+        # test cases that should work
+        for i in range(-2, 3):
+            if (i == 0):
+                continue
+            nt_unsqueezed = nt.unsqueeze(i)
+            size_idx = i if i < 0 else i - 1
+            self.assertEqual(nt_unsqueezed._nested_tensor_size()[:, size_idx], torch.ones(2, dtype=torch.long))
+            nt_squeezed = nt_unsqueezed.squeeze(i)
+            self.assertEqual(nt_squeezed, nt)
+
+    @dtypes(torch.float, torch.float16, torch.double)
+    def test_transpose_inference_mode_interaction(self, device, dtype):
+        nt = random_nt(device, dtype, 4, (4, 4))
+        # Construct in default mode and transpose while in inference mode
+        with torch.inference_mode():
+            ntT = nt.transpose(-1, -2)
+            ptT_from_ntT = noncontiguous_to_padded_tensor(ntT)
+            pt = torch.nested.to_padded_tensor(nt, 0.0)
+            ptT = pt.transpose(-1, -2)
+            self.assertEqual(ptT, ptT_from_ntT)
+
+        # Construct and transpose while in inference mode
+        with torch.inference_mode():
+            nt = random_nt(device, dtype, 4, (4, 4))
+            ntT = nt.transpose(-1, -2)
+            ptT_from_ntT = noncontiguous_to_padded_tensor(ntT)
+            pt = torch.nested.to_padded_tensor(nt, 0.0)
+            ptT = pt.transpose(-1, -2)
+            self.assertEqual(ptT, ptT_from_ntT)
+
+    @dtypes(torch.float, torch.float16, torch.double)
+    def test_view(self, device, dtype):
+        nt = random_nt(device, dtype, 4, (4, 4))
+        # error case: empty shape
+        self.assertRaisesRegex(
+            RuntimeError,
+            r"shape '\[\]' is invalid for a nested tensor",
+            lambda: nt.view(())
+        )
+        # error case: empty nested tensor
+        nt_empty = torch.nested.nested_tensor([])
+        self.assertRaisesRegex(
+            RuntimeError,
+            "empty nested tensor cannot be reshaped",
+            lambda: nt_empty.view(-1)
+        )
+        # error case: -1 for batch size
+        self.assertRaisesRegex(
+            RuntimeError,
+            r"view: For now nested view cannot change or infer the implicit batch dimension",
+            lambda: nt.view(-1, 2, 3)
+        )
+        self.assertRaisesRegex(
+            RuntimeError,
+            r"shape '\[.*\]' is invalid for input of size [0-9]+",
+            lambda: nt.view(4, 2, 3)
+        )
+        # normal case
+        x0 = torch.randn((2, 20), device=device, dtype=dtype)
+        x1 = torch.randn((3, 20), device=device, dtype=dtype)
+        nt = torch.nested.nested_tensor([x0, x1])
+        pt = torch.nested.to_padded_tensor(nt, 0.0)
+        # error case, trying to reshape batch dim to a legit shape
+        self.assertRaisesRegex(
+            RuntimeError,
+            r"For now nested view cannot change or infer the implicit batch dimension",
+            lambda: nt.transpose(-1, -2).view(40, -1)
+        )
+        # inherit only the ragged dimension
+        # (2, 20) -> (2, 5, 4)
+        # (3, 20) -> (3, 5, 4)
+        nt1 = nt.view(2, -1, 5, 4)
+        # (2, 3, 20) -> (2, 3, 5, 4) -> (2, 4, 5, 4)
+        pt1 = pt.view(2, -1, 5, 4)
+        self.assertEqual(noncontiguous_to_padded_tensor(nt1), pt1)
+
+        # more than one -1 (even for "old" dims), should fail
+        # this attempts to do # (2, (2, 3), 5, 4) -> (2, (2, 3), 5, 2, 2)
+        # but we ban "inherit old behavior" for >1 dimension
+        self.assertRaisesRegex(
+            RuntimeError,
+            r"only one dimension can be inferred",
+            lambda: nt1.view(2, -1, -1, 2, 2)
+        )
+
+    @dtypes(torch.float, torch.float16, torch.double)
+    def test_view_inference_mode_interaction(self, device, dtype):
+        # Construct in default mode and view while in inference mode
+        nt = torch.nested.nested_tensor([torch.randn((2, 20)), torch.randn((3, 20))], device=device, dtype=dtype)
+        with torch.inference_mode():
+            ntT = nt.view(2, -1, 4, 5)
+            ptT_from_ntT = noncontiguous_to_padded_tensor(ntT)
+            pt = torch.nested.to_padded_tensor(nt, 0.0)
+            ptT = pt.view(2, -1, 4, 5)
+            self.assertEqual(ptT, ptT_from_ntT)
+        # Construct and view while in inference mode
+        with torch.inference_mode():
+            nt = torch.nested.nested_tensor([torch.randn((2, 20)), torch.randn((3, 20))], device=device, dtype=dtype)
+            ntT = nt.view(2, -1, 4, 5)
+            ptT_from_ntT = noncontiguous_to_padded_tensor(ntT)
+            pt = torch.nested.to_padded_tensor(nt, 0.0)
+            ptT = pt.view(2, -1, 4, 5)
+            self.assertEqual(ptT, ptT_from_ntT)
+
+    @dtypes(torch.float, torch.float16, torch.double)
     def test_reshape(self, device, dtype):
-        nt = self.random_nt(device, dtype, 4, (4, 4))
+        nt = random_nt(device, dtype, 4, (4, 4))
         # error case: empty shape
         self.assertRaisesRegex(
             RuntimeError,
@@ -1110,17 +1754,17 @@ def test_reshape(self, device, dtype):
             lambda: nt.reshape(())
         )
         # error case: empty nested tensor
-        nt_empty = torch.nested_tensor([])
+        nt_empty = torch.nested.nested_tensor([])
         self.assertRaisesRegex(
             RuntimeError,
             "empty nested tensor cannot be reshaped",
             lambda: nt_empty.reshape(-1)
         )
-        # error case: invalid proposed shape for underlying tensors
+        # error case: -1 for batch size
         self.assertRaisesRegex(
             RuntimeError,
-            r"invalid shape dimension -2",
-            lambda: nt.reshape(-2, 2, 3)
+            r"reshape: For now nested reshape cannot change or infer the implicit batch dimension",
+            lambda: nt.reshape(-1, 2, 3)
         )
         self.assertRaisesRegex(
             RuntimeError,
@@ -1130,11 +1774,12 @@ def test_reshape(self, device, dtype):
         # normal case
         x0 = torch.randn((2, 20), device=device, dtype=dtype)
         x1 = torch.randn((3, 20), device=device, dtype=dtype)
-        nt = torch.nested_tensor([x0, x1])
-        pt = nt.to_padded_tensor(0.0)
+        nt = torch.nested.nested_tensor([x0, x1])  # (2, (2, 3), 20)
+        pt = torch.nested.to_padded_tensor(nt, 0.0)
+        # error case, trying to reshape batch dim to a legit shape
         self.assertRaisesRegex(
             RuntimeError,
-            r"for now reshape cannot change the implicit batch dimension",
+            r"reshape: For now nested reshape cannot change or infer the implicit batch dimension",
             lambda: nt.transpose(-1, -2).reshape(40, -1)
         )
         # inherit only the ragged dimension
@@ -1144,10 +1789,15 @@ def test_reshape(self, device, dtype):
         # (2, 3, 20) -> (2, 3, 5, 4) -> (2, 4, 5, 4)
         pt1 = pt.reshape(2, -1, 5, 4)
         self.assertEqual(noncontiguous_to_padded_tensor(nt1), pt1)
-        # also inherit regular dimension
-        nt2 = nt1.reshape(2, -1, -1, 2, 2)
-        pt2 = pt1.reshape(2, -1, 5, 2, 2)
-        self.assertEqual(noncontiguous_to_padded_tensor(nt2), pt2)
+
+        # more than one -1 (even for "old" dims), should fail
+        # this attempts to do # (2, (2, 3), 5, 4) -> (2, (2, 3), 5, 2, 2)
+        # but we ban "inherit old behavior" for >1 dimension
+        self.assertRaisesRegex(
+            RuntimeError,
+            r"only one dimension can be inferred",
+            lambda: nt1.reshape(2, -1, -1, 2, 2)
+        )
 
     @parametrize("input_dim", [3, 4])
     def test_scaled_dot_product_attention(self, device, input_dim):
@@ -1158,17 +1808,17 @@ def rand_tensor(*shape):
         E = 10
         if input_dim == 3:
             # Shape: (N, L, E); ragged L
-            query = torch.nested_tensor([rand_tensor(2, E), rand_tensor(3, E), rand_tensor(4, E)])
+            query = torch.nested.nested_tensor([rand_tensor(2, E), rand_tensor(3, E), rand_tensor(4, E)])
 
             # Shape: (N, S, E); ragged S
-            key = torch.nested_tensor([rand_tensor(3, E), rand_tensor(4, E), rand_tensor(5, E)])
-            value = torch.nested_tensor([rand_tensor(3, E), rand_tensor(4, E), rand_tensor(5, E)])
+            key = torch.nested.nested_tensor([rand_tensor(3, E), rand_tensor(4, E), rand_tensor(5, E)])
+            value = torch.nested.nested_tensor([rand_tensor(3, E), rand_tensor(4, E), rand_tensor(5, E)])
         elif input_dim == 4:
             # Shape: (N, N', L, E); ragged N' and L
-            query = torch.nested_tensor([rand_tensor(2, 2, E), rand_tensor(3, 3, E), rand_tensor(4, 4, E)])
+            query = torch.nested.nested_tensor([rand_tensor(2, 2, E), rand_tensor(3, 3, E), rand_tensor(4, 4, E)])
             # Shape: (N, N', S, E); ragged N' and S
-            key = torch.nested_tensor([rand_tensor(2, 3, E), rand_tensor(3, 4, E), rand_tensor(4, 5, E)])
-            value = torch.nested_tensor([rand_tensor(2, 3, E), rand_tensor(3, 4, E), rand_tensor(4, 5, E)])
+            key = torch.nested.nested_tensor([rand_tensor(2, 3, E), rand_tensor(3, 4, E), rand_tensor(4, 5, E)])
+            value = torch.nested.nested_tensor([rand_tensor(2, 3, E), rand_tensor(3, 4, E), rand_tensor(4, 5, E)])
         else:
             self.fail(f"Invalid input_dim {input_dim} encountered in SDP test")
 
@@ -1176,7 +1826,7 @@ def rand_mask(size):
             return torch.randint(0, 2, size=size, dtype=torch.bool, device=device)
 
         # Shape: (N, L, S); ragged L and S matching above
-        attn_mask = torch.nested_tensor([rand_mask((2, 3)), rand_mask((3, 4)), rand_mask((4, 5))])
+        attn_mask = torch.nested.nested_tensor([rand_mask((2, 3)), rand_mask((3, 4)), rand_mask((4, 5))])
 
         dropout_p = 0.0  # no dropout for reproducibility
         need_attn_weights: bool = True
@@ -1193,8 +1843,8 @@ def rand_mask(size):
                 need_attn_weights=need_attn_weights)
             expected_outputs.append(output.squeeze(0))
             expected_attn_weights.append(attn_weights.squeeze(0))
-        expected_output_nested = torch.nested_tensor(expected_outputs)
-        expected_attn_weight_nested = torch.nested_tensor(expected_attn_weights)
+        expected_output_nested = torch.nested.nested_tensor(expected_outputs)
+        expected_attn_weight_nested = torch.nested.nested_tensor(expected_attn_weights)
         self.assertEqual(actual[0], expected_output_nested)
         self.assertEqual(actual[1], expected_attn_weight_nested)
 
@@ -1208,40 +1858,118 @@ def rand_mask(size):
             torch.ops.aten._scaled_dot_product_attention(
                 query, key, value, dropout_p=dropout_p, need_attn_weights=need_attn_weights, is_causal=True)
 
+    @dtypes(torch.float, torch.float16, torch.double)
+    def test_empty_like(self, device, dtype):
+        ntensors = 4
+        nt = random_nt(device, dtype, ntensors, (4, 4))
+
+        # Create empty on same device as original nested tensor
+        nt_empty = torch.empty_like(nt)
+        assert nt.is_same_size(nt_empty)
+        self.assertEqual(nt.dtype, nt_empty.dtype)
+        self.assertEqual(nt.device, nt_empty.device)
+        self.assertEqual(nt.layout, nt_empty.layout)
+
+        if torch.cuda.is_available():
+            if device == "cpu":
+                nt_cuda = torch.empty_like(nt, device='cuda')
+                self.assertEqual(torch.device("cuda").type, nt_cuda.device.type)
+            else:
+                nt_cpu = torch.empty_like(nt, device='cpu')
+                self.assertEqual(torch.device("cpu").type, nt_cpu.device.type)
+
+        # Check changing dtype of empty_like nested tensor output
+        dtype_set = {torch.float, torch.float16, torch.double}
+        for other_dtype in dtype_set - {dtype}:
+            nt_empty_other_dtype = torch.empty_like(nt, dtype=other_dtype)
+            self.assertEqual(nt.dtype, dtype)
+            self.assertEqual(nt_empty_other_dtype.dtype, other_dtype)
+            self.assertEqual(nt.device, nt_empty.device)
+            self.assertEqual(nt.layout, nt_empty.layout)
+
+        # Create tensor for autograd
+        nt_empty_req_grad = torch.empty_like(nt, requires_grad=True)
+        self.assertEqual(nt_empty_req_grad.requires_grad, True)
+
+        # Test noncontiguous tensor fails to copy
+        nt_cont, nt_noncont = random_nt_noncontiguous_pair((2, 3, 6, 7))
+        nt_empty = torch.empty_like(nt_cont)
+        assert nt_cont.is_same_size(nt_empty)
+        with self.assertRaisesRegex(RuntimeError, "empty_like only supports contiguous memory format for Nested Tensors"):
+            nt_empty = torch.empty_like(nt_noncont)
+
 
 class TestNestedTensorAutograd(TestCase):
     # Note [Gradcheck args check_batched_grad=False] the common_utils testing version of gradcheck
     # includes the default parameters used for testing ops with gradcheck. However nested tensor
     # does not support the stack op therefore we turn it off for these tests
-    def _create_nested_tensor_from_list(self, requires_grad=False):
-        return torch.nested_tensor([torch.randn(1, 2, requires_grad=requires_grad),
-                                    torch.randn(7, 8, requires_grad=requires_grad)])
+    def _create_leaf_nested_tensor_from_list(self, tensor_device, requires_grad=False):
+        return torch.nested.nested_tensor([torch.randn(1, 2,),
+                                           torch.randn(7, 8)], requires_grad=requires_grad, device=tensor_device)
+
+    def _create_nested_tensor_from_list(self, tensor_device, requires_grad=False):
+        return torch.nested.as_nested_tensor([torch.randn(1, 2, requires_grad=requires_grad),
+                                              torch.randn(7, 8, requires_grad=requires_grad)], device=tensor_device)
 
-    def _create_nested_tensor_from_mask(self, requires_grad=False):
-        data = torch.randn(2, 3, 4, requires_grad=requires_grad)
+    def _create_nested_tensor_from_mask(self, tensor_device, requires_grad=False):
+        data = torch.randn(2, 3, 4, requires_grad=requires_grad, device=tensor_device)
         mask = torch.ones_like(data[:, :, 0]).bool()
         return torch._nested_tensor_from_mask(data, mask)
 
-    def test_set_requires_grad_from_list(self):
-        nt = self._create_nested_tensor_from_list()
+    def test_as_nested_tensor_propagates_gradients(self, device):
+        a = torch.arange(3, dtype=torch.float, device=device)
+        b = torch.arange(5, dtype=torch.float, device=device)
+        nt = torch.nested.as_nested_tensor([a, b])
+        # tensors with requires_grad=False are leaves
+        self.assertTrue(nt.is_leaf)
+        self.assertTrue(not nt.requires_grad)
+
+        a = torch.arange(3, dtype=torch.float, requires_grad=True, device=device)
+        b = torch.arange(5, dtype=torch.float, requires_grad=True, device=device)
+        nt2 = torch.nested.as_nested_tensor([a, b])
+        fake_grad = torch.nested.nested_tensor([torch.ones_like(a), torch.zeros_like(b)], device=device)
+        nt2.backward(fake_grad)
+        self.assertEqual(a.grad, fake_grad[0])
+        self.assertEqual(b.grad, fake_grad[1])
+
+    def test_nested_tensor_generates_leaf(self, device):
+        a = torch.arange(3, dtype=torch.float, requires_grad=True, device=device)
+        b = torch.arange(5, dtype=torch.float, requires_grad=True, device=device)
+
+        nt = torch.nested.nested_tensor([a, b], requires_grad=False)
+        self.assertTrue(nt.is_leaf)
+        self.assertTrue(not nt.requires_grad)
+
+        nt2 = torch.nested.nested_tensor([a, b], requires_grad=True)
+        self.assertTrue(nt2.is_leaf)
+        self.assertTrue(nt2.requires_grad)
+
+        fake_grad = torch.nested.nested_tensor([torch.ones_like(a), torch.zeros_like(b)], device=device)
+        nt2.backward(fake_grad)
+        self.assertEqual(nt2.grad, fake_grad)
+        self.assertEqual(a.grad, None)
+        self.assertEqual(b.grad, None)
+
+    def test_set_requires_grad_from_list(self, device):
+        nt = self._create_nested_tensor_from_list(device)
         nt.requires_grad_()
         assert nt.requires_grad
 
-    def test_set_requires_grad_from_mask(self):
-        nt = self._create_nested_tensor_from_mask()
+    def test_set_requires_grad_from_mask(self, device):
+        nt = self._create_nested_tensor_from_mask(device)
         nt.requires_grad_()
         assert nt.requires_grad
 
-    def test_backward_for_add_op(self):
-        nt_1 = self._create_nested_tensor_from_mask()
-        nt_2 = self._create_nested_tensor_from_mask()
+    def test_backward_for_add_op(self, device):
+        nt_1 = self._create_nested_tensor_from_mask(device)
+        nt_2 = self._create_nested_tensor_from_mask(device)
 
         nt_1.requires_grad_()
         c = nt_1 + nt_2
 
         assert nt_1.requires_grad
         assert c.requires_grad
-        grad_output = self._create_nested_tensor_from_mask()
+        grad_output = self._create_nested_tensor_from_mask(device)
         c.backward(grad_output)
 
         #  Grad check doesn't work with nested yet.
@@ -1249,221 +1977,249 @@ def test_backward_for_add_op(self):
         self.assertEqual(nt_1.grad, grad_output)
 
     # Test Factory Functions
-    def test_nested_tensor_to_padded_tensor(self):
+    def test_nested_tensor_to_padded_tensor(self, device):
         for padding_val in [0, 1]:
-            nt = torch.nested_tensor([torch.randn(1, 2), torch.randn(7, 8)])
-            nt.requires_grad_()
+            nt = self._create_leaf_nested_tensor_from_list(tensor_device=device, requires_grad=True)
 
-            out = nt.to_padded_tensor(padding_val)
-            grad_output = torch.ones(out.shape)
+            out = torch.nested.to_padded_tensor(nt, padding_val)
+            grad_output = torch.ones(out.shape, device=device)
             out.backward(grad_output)
 
-            self.assertEqual(nt.grad, torch.nested_tensor([torch.ones(1, 2), torch.ones(7, 8)]))
+            self.assertEqual(nt.grad, torch.nested.nested_tensor([torch.ones(1, 2), torch.ones(7, 8)], device=device))
 
-    def test_nested_tensor_from_mask_and_to_padded(self):
+    def test_nested_tensor_from_mask_and_to_padded(self, device):
         N, L, D = 2, 4, 4
-        mask = torch.ones(N, L)
+        mask = torch.ones(N, L, device=device)
         for i in range(1, N):
-            end = torch.randint(1, L - 1, (1,))
+            end = torch.randint(1, L - 1, (1,), device=device)
             mask[i, end:] = 0
 
         mask[0, :] = 1
         mask = mask.bool()
 
-        data = torch.randn(N, L, D, requires_grad=True, dtype=torch.float64)
+        data = torch.randn(N, L, D, requires_grad=True, dtype=torch.float64, device=device)
 
         def grad_test_func(inpt):
             nt = torch._nested_tensor_from_mask(inpt, mask)
             # This implicitly tests to_padded_tensor grads
-            return nt.to_padded_tensor(0)
+            return torch.nested.to_padded_tensor(nt, 0)
         assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
-    def test_nested_tensor_from_padded(self):
+    def test_nested_tensor_from_padded(self, device):
         nested_size = torch.tensor([[1, 2], [2, 2]])
-        padded_tensor = torch.randn(2, 2, 2, dtype=torch.float64)
+        padded_tensor = torch.randn(2, 2, 2, dtype=torch.float64, device=device)
         padded_tensor[0, 1, :] = 0
         padded_tensor.requires_grad_()
 
         def grad_test_func(tensor, nested_size):
             nt = torch._nested_from_padded(tensor, nested_size, fuse_transform_0213=False)
             # This implicitly tests to_padded_tensor grads
-            return nt.to_padded_tensor(0)
+            return torch.nested.to_padded_tensor(nt, 0)
 
         data = (padded_tensor, nested_size)
         assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
-    def test_nested_tensor_from_padded_fused(self):
+    def test_nested_tensor_from_padded_fused(self, device):
         nested_size = torch.tensor([[1, 8], [2, 8]])
-        padded_tensor = torch.randn(2, 2, 2, 4, dtype=torch.float64)
+        padded_tensor = torch.randn(2, 2, 2, 4, dtype=torch.float64, device=device)
         padded_tensor[0, 1, :] = 0
         padded_tensor.requires_grad_()
 
         def grad_test_func(tensor, nested_size):
             nt = torch._nested_from_padded(tensor, nested_size, fuse_transform_0213=True)
             # This implicitly tests to_padded_tensor grads
-            return nt.to_padded_tensor(0)
+            return torch.nested.to_padded_tensor(nt, 0)
         data = (padded_tensor, nested_size)
         assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
-    def test_nested_tensor_from_list(self):
+    def test_nested_tensor_from_list(self, device):
 
-        a = torch.randn(1, 2, requires_grad=True, dtype=torch.float64)
-        b = torch.randn(2, 2, requires_grad=True, dtype=torch.float64)
-        c = torch.randn(10, 2, requires_grad=True, dtype=torch.float64)
+        a = torch.randn(1, 2, requires_grad=True, dtype=torch.float64, device=device)
+        b = torch.randn(2, 2, requires_grad=True, dtype=torch.float64, device=device)
+        c = torch.randn(10, 2, requires_grad=True, dtype=torch.float64, device=device)
 
         def grad_test_func(a, b, c):
-            c = torch.nested_tensor([a, b, c])
+            c = torch.nested.as_nested_tensor([a, b, c])
             # This implictily tests to_padded_tensor grads
-            return c.to_padded_tensor(0)
+            return torch.nested.to_padded_tensor(c, 0)
         data = (a, b, c)
         assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
-    def test_size_dim(self):
-        a = torch.nested_tensor([])
-        self.assertEqual(a.size(0), 0)
-
-        a = torch.nested_tensor([torch.tensor(1)])
-        self.assertEqual(a.size(0), 1)
-
-        a = torch.nested_tensor([torch.tensor(1), torch.tensor(2)])
-        self.assertEqual(a.size(0), 2)
-
-        a = torch.nested_tensor([torch.rand(1, 2),
-                                 torch.rand(1, 8)])
-        self.assertEqual(a.size(0), 2)
-        self.assertEqual(a.size(1), 1)
-        self.assertRaisesRegex(
-            RuntimeError, "Given dimension 2 is irregular and does not have a size", lambda: a.size(2))
-
-        a = torch.nested_tensor([torch.rand(3, 4),
-                                 torch.rand(5, 4)])
-        self.assertEqual(a.size(0), 2)
-        self.assertRaisesRegex(
-            RuntimeError, "Given dimension 1 is irregular and does not have a size", lambda: a.size(1))
-        self.assertEqual(a.size(2), 4)
+    def test_dropout_backward(self):
+        nt = torch.nested.nested_tensor([torch.randn((2, 5)), torch.randn((3, 4))], requires_grad=True)
+        p = 0.2
+        y = torch.nn.functional.dropout(nt, p)
+        y.backward(nt.clone().detach())
+        self.assertEqual(nt.grad, y)
 
-    def test_nested_tensor_bmm_gradcheck(self):
-        a = torch.randn(2, 6, requires_grad=True, dtype=torch.float64)
-        b = torch.randn(3, 6, requires_grad=True, dtype=torch.float64)
-        c = torch.randn(6, 4, requires_grad=True, dtype=torch.float64)
-        d = torch.randn(6, 5, requires_grad=True, dtype=torch.float64)
+    def test_nested_tensor_bmm_gradcheck(self, device):
+        a = torch.randn(2, 6, requires_grad=True, dtype=torch.float64, device=device)
+        b = torch.randn(3, 6, requires_grad=True, dtype=torch.float64, device=device)
+        c = torch.randn(6, 4, requires_grad=True, dtype=torch.float64, device=device)
+        d = torch.randn(6, 5, requires_grad=True, dtype=torch.float64, device=device)
 
         def grad_test_func(a, b, c, d):
-            nt0 = torch.nested_tensor([a, b])
-            nt1 = torch.nested_tensor([c, d])
+            nt0 = torch.nested.as_nested_tensor([a, b])
+            nt1 = torch.nested.as_nested_tensor([c, d])
             result = nt0.bmm(nt1)
-            return result.to_padded_tensor(0.0)
+            return torch.nested.to_padded_tensor(result, 0.0)
 
         data = (a, b, c, d)
         assert torch.autograd.gradcheck(grad_test_func, inputs=data)
 
-    def test_nested_tensor_bmm_backward(self):
-        nt0 = torch.nested_tensor([torch.randn((2, 6)), torch.randn((3, 6))]).requires_grad_(True)
-        nt1 = torch.nested_tensor([torch.randn((6, 4)), torch.randn((6, 5))]).requires_grad_(True)
+    def test_nested_tensor_bmm_backward(self, device):
+        nt0 = torch.nested.nested_tensor([torch.randn((2, 6)), torch.randn((3, 6))], requires_grad=True, device=device)
+        nt1 = torch.nested.nested_tensor([torch.randn((6, 4)), torch.randn((6, 5))], requires_grad=True, device=device)
         with torch.no_grad():
-            pt0 = nt0.to_padded_tensor(0.0).requires_grad_(True)
-            pt1 = nt1.to_padded_tensor(0.0).requires_grad_(True)
+            pt0 = torch.nested.to_padded_tensor(nt0, 0.0).requires_grad_(True)
+            pt1 = torch.nested.to_padded_tensor(nt1, 0.0).requires_grad_(True)
 
         ynt = nt0.bmm(nt1)
         ypt = pt0.bmm(pt1)
         ynt.backward(ynt.clone())
         ypt.backward(ypt.clone())
 
-        self.assertEqual(nt0.grad.to_padded_tensor(0.0), pt0.grad)
-        self.assertEqual(nt1.grad.to_padded_tensor(0.0), pt1.grad)
+        self.assertEqual(torch.nested.to_padded_tensor(nt0.grad, 0.0), pt0.grad)
+        self.assertEqual(torch.nested.to_padded_tensor(nt1.grad, 0.0), pt1.grad)
 
-    def test_nested_tensor_matmul_gradcheck(self):
-        a = torch.randn(2, 6, requires_grad=True, dtype=torch.float64)
-        b = torch.randn(3, 6, requires_grad=True, dtype=torch.float64)
-        c = torch.randn(6, 4, requires_grad=True, dtype=torch.float64)
-        d = torch.randn(6, 5, requires_grad=True, dtype=torch.float64)
+    def test_nested_tensor_matmul_gradcheck(self, device):
+        a = torch.randn(2, 6, requires_grad=True, dtype=torch.float64, device=device)
+        b = torch.randn(3, 6, requires_grad=True, dtype=torch.float64, device=device)
+        c = torch.randn(6, 4, requires_grad=True, dtype=torch.float64, device=device)
+        d = torch.randn(6, 5, requires_grad=True, dtype=torch.float64, device=device)
 
         def grad_test_func(a, b, c, d):
-            nt0 = torch.nested_tensor([a, b])
-            nt1 = torch.nested_tensor([c, d])
+            nt0 = torch.nested.as_nested_tensor([a, b])
+            nt1 = torch.nested.as_nested_tensor([c, d])
             result = torch.matmul(nt0, nt1)
-            return result.to_padded_tensor(0.0)
+            return torch.nested.to_padded_tensor(result, 0.0)
 
         data = (a, b, c, d)
         assert torch.autograd.gradcheck(grad_test_func, inputs=data)
 
-    def test_nested_tensor_matmul_backward(self):
-        nt0 = torch.nested_tensor([torch.randn((7, 2, 6)), torch.randn((7, 3, 6))]).requires_grad_(True)
-        nt1 = torch.nested_tensor([torch.randn((7, 6, 4)), torch.randn((7, 6, 5))]).requires_grad_(True)
+    def test_nested_tensor_matmul_backward(self, device):
+        nt0 = torch.nested.nested_tensor([torch.randn((7, 2, 6)), torch.randn((7, 3, 6))], requires_grad=True, device=device)
+        nt1 = torch.nested.nested_tensor([torch.randn((7, 6, 4)), torch.randn((7, 6, 5))], requires_grad=True, device=device)
         with torch.no_grad():
-            pt0 = nt0.to_padded_tensor(0.0).requires_grad_(True)
-            pt1 = nt1.to_padded_tensor(0.0).requires_grad_(True)
+            pt0 = torch.nested.to_padded_tensor(nt0, 0.0).requires_grad_(True)
+            pt1 = torch.nested.to_padded_tensor(nt1, 0.0).requires_grad_(True)
 
         ynt = torch.matmul(nt0, nt1)
         ypt = torch.matmul(pt0, pt1)
         ynt.backward(ynt.clone())
         ypt.backward(ypt.clone())
 
-        self.assertEqual(nt0.grad.to_padded_tensor(0.0), pt0.grad)
-        self.assertEqual(nt1.grad.to_padded_tensor(0.0), pt1.grad)
+        self.assertEqual(torch.nested.to_padded_tensor(nt0.grad, 0.0), pt0.grad)
+        self.assertEqual(torch.nested.to_padded_tensor(nt1.grad, 0.0), pt1.grad)
 
-    def test_nested_tensor_transpose_gradcheck(self):
-        a = torch.randn(2, 5, requires_grad=True)
-        b = torch.randn(3, 4, requires_grad=True)
+    def test_nested_tensor_transpose_gradcheck(self, device):
+        a = torch.randn(2, 5, requires_grad=True, device=device)
+        b = torch.randn(3, 4, requires_grad=True, device=device)
 
         def grad_test_func(a, b):
-            nt = torch.nested_tensor([a, b])
+            nt = torch.nested.as_nested_tensor([a, b])
             result = nt.transpose(-2, -1).transpose(-2, -1)
-            return result.to_padded_tensor(0.0)
+            return torch.nested.to_padded_tensor(result, 0.0)
 
         data = (a, b)
         assert torch.autograd.gradcheck(grad_test_func, inputs=data, eps=1e-3)
 
-    def test_nested_tensor_transpose_backward(self):
-        nt = torch.nested_tensor([torch.randn((2, 5)), torch.randn((3, 4))]).requires_grad_(True)
+    def test_nested_tensor_transpose_backward(self, device):
+        nt = torch.nested.nested_tensor([torch.randn((2, 5)), torch.randn((3, 4))], requires_grad=True, device=device)
         with torch.no_grad():
-            pt = nt.to_padded_tensor(0.0).requires_grad_(True)
+            pt = torch.nested.to_padded_tensor(nt, 0.0).requires_grad_(True)
 
         ynt = nt.transpose(-2, -1)
         ypt = pt.transpose(-2, -1)
         ynt.backward(ynt.clone())
         ypt.backward(ypt.clone())
 
-        self.assertEqual(nt.grad.to_padded_tensor(0.0), pt.grad)
+        self.assertEqual(torch.nested.to_padded_tensor(nt.grad, 0.0), pt.grad)
 
-    def test_nested_tensor_reshape_gradcheck(self):
-        a = torch.randn(2, 6, requires_grad=True)
-        b = torch.randn(3, 6, requires_grad=True)
+    def test_nested_tensor_reshape_gradcheck(self, device):
+        a = torch.randn(2, 6, requires_grad=True, device=device)
+        b = torch.randn(3, 6, requires_grad=True, device=device)
 
         def grad_test_func(a, b):
-            nt = torch.nested_tensor([a, b])
+            nt = torch.nested.as_nested_tensor([a, b])
             result = nt.reshape(2, -1, 2, 3)
-            return result.to_padded_tensor(0.0)
+            return torch.nested.to_padded_tensor(result, 0.0)
 
         data = (a, b)
         assert torch.autograd.gradcheck(grad_test_func, inputs=data, eps=1e-3)
 
     def test_nested_tensor_reshape_backward(self):
-        nt = torch.nested_tensor([torch.randn((2, 6)), torch.randn((3, 6))]).requires_grad_(True)
+        nt = torch.nested.nested_tensor([torch.randn((2, 6)), torch.randn((3, 6))], requires_grad=True)
         with torch.no_grad():
-            pt = nt.to_padded_tensor(0.0).requires_grad_(True)
+            pt = torch.nested.to_padded_tensor(nt, 0.0).requires_grad_(True)
 
         ynt = nt.reshape(2, -1, 2, 3)
         ypt = pt.reshape(2, -1, 2, 3)
         ynt.backward(ynt.clone())
         ypt.backward(ypt.clone())
 
-        self.assertEqual(nt.grad.to_padded_tensor(0.0), pt.grad)
+        self.assertEqual(torch.nested.to_padded_tensor(nt.grad, 0.0), pt.grad)
+
+    def test_nested_tensor_squeeze_backward(self, device):
+        nt = torch.nested.nested_tensor([torch.randn((2, 6, 1)), torch.randn((3, 6, 1))], requires_grad=True, device=device)
+        with torch.no_grad():
+            pt = torch.nested.to_padded_tensor(nt, 0.0).requires_grad_(True)
+
+        ynt = nt.squeeze(-1)
+        ypt = pt.squeeze(-1)
+        ynt.backward(ynt.clone())
+        ypt.backward(ypt.clone())
 
-    def test_nested_tensor_linear(self):
+        self.assertEqual(torch.nested.to_padded_tensor(nt.grad, 0.0), pt.grad)
 
-        a = torch.randn(1, 2, requires_grad=True, dtype=torch.float64)
-        b = torch.randn(2, 2, requires_grad=True, dtype=torch.float64)
-        c = torch.randn(3, 2, requires_grad=True, dtype=torch.float64)
+    def test_nested_tensor_squeeze_gradcheck(self, device):
+        a = torch.randn((2, 6, 1), dtype=torch.float64, requires_grad=True, device=device)
+        b = torch.randn((3, 6, 1), dtype=torch.float64, requires_grad=True, device=device)
 
-        weight = torch.randn(2, 2, requires_grad=True, dtype=torch.float64)
-        bias = torch.randn(2, requires_grad=True, dtype=torch.float64)
+        def grad_test_func(a, b):
+            nt = torch.nested.as_nested_tensor([a, b])
+            result = nt.squeeze(-1)
+            return torch.nested.to_padded_tensor(result, 0.0)
+
+        assert torch.autograd.gradcheck(grad_test_func, inputs=(a, b), eps=1e-3)
+
+    def test_nested_tensor_unsqueeze_backward(self, device):
+        nt = torch.nested.nested_tensor([torch.randn((2, 6)), torch.randn((3, 6))], requires_grad=True, device=device)
+        with torch.no_grad():
+            pt = torch.nested.to_padded_tensor(nt, 0.0).requires_grad_(True)
+
+        ynt = nt.unsqueeze(2)
+        ypt = pt.unsqueeze(2)
+        ynt.backward(ynt.clone())
+        ypt.backward(ypt.clone())
+
+        self.assertEqual(torch.nested.to_padded_tensor(nt.grad, 0.0), pt.grad)
+
+    def test_nested_tensor_unsqueeze_gradcheck(self, device):
+        a = torch.randn((2, 6), dtype=torch.float64, requires_grad=True, device=device)
+        b = torch.randn((3, 6), dtype=torch.float64, requires_grad=True, device=device)
+
+        def grad_test_func(a, b):
+            nt = torch.nested.as_nested_tensor([a, b])
+            result = nt.unsqueeze(-1)
+            return torch.nested.to_padded_tensor(result, 0.0)
+
+        assert torch.autograd.gradcheck(grad_test_func, inputs=(a, b), eps=1e-3)
+
+    def test_nested_tensor_linear(self, device):
+
+        a = torch.randn(1, 2, requires_grad=True, dtype=torch.float64, device=device)
+        b = torch.randn(2, 2, requires_grad=True, dtype=torch.float64, device=device)
+        c = torch.randn(3, 2, requires_grad=True, dtype=torch.float64, device=device)
+
+        weight = torch.randn(2, 2, requires_grad=True, dtype=torch.float64, device=device)
+        bias = torch.randn(2, requires_grad=True, dtype=torch.float64, device=device)
 
         def grad_test_func(a, b, c, weight, bias=None):
-            nt = torch.nested_tensor([a, b, c])
+            nt = torch.nested.as_nested_tensor([a, b, c])
             # This implicitly tests to_padded_tensor grads
             d = torch.functional.F.linear(nt, weight, bias)
-            return d.to_padded_tensor(0)
+            return torch.nested.to_padded_tensor(d, 0)
         data = (a, b, c, weight, bias)
         assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
@@ -1471,29 +2227,29 @@ def grad_test_func(a, b, c, weight, bias=None):
         data = (a, b, c, weight)
         assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
-    def test_nested_tensor_softmax(self):
-        a = torch.randn(1, 2, requires_grad=True, dtype=torch.float64)
-        b = torch.randn(2, 2, requires_grad=True, dtype=torch.float64)
-        c = torch.randn(3, 2, requires_grad=True, dtype=torch.float64)
+    def test_nested_tensor_softmax(self, device):
+        a = torch.randn(1, 2, requires_grad=True, dtype=torch.float64, device=device)
+        b = torch.randn(2, 2, requires_grad=True, dtype=torch.float64, device=device)
+        c = torch.randn(3, 2, requires_grad=True, dtype=torch.float64, device=device)
 
         def grad_test_func(a, b, c, dim):
-            nt = torch.nested_tensor([a, b, c])
+            nt = torch.nested.as_nested_tensor([a, b, c])
             # This implicitly tests to_padded_tensor grads
             d = torch.functional.F.softmax(nt, dim=dim)
-            return d.to_padded_tensor(0)
+            return torch.nested.to_padded_tensor(d, 0)
 
         # softmax over last dim
         data = (a, b, c, -1)
         assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
 
-    def test_nested_tensor_linear_backward(self):
-        a = torch.randn(1, 2, requires_grad=False)
-        b = torch.randn(2, 2, requires_grad=False)
-        c = torch.randn(3, 2, requires_grad=False)
+    def test_nested_tensor_linear_backward(self, device):
+        a = torch.randn(1, 2, requires_grad=False, device=device)
+        b = torch.randn(2, 2, requires_grad=False, device=device)
+        c = torch.randn(3, 2, requires_grad=False, device=device)
 
-        weight = torch.randn(2, 2, requires_grad=True)
-        bias = torch.randn(2, requires_grad=True)
-        nt = torch.nested_tensor([a, b, c])
+        weight = torch.randn(2, 2, requires_grad=True, device=device)
+        bias = torch.randn(2, requires_grad=True, device=device)
+        nt = torch.nested.as_nested_tensor([a, b, c], device=device)
 
         out = torch.functional.F.linear(nt, weight, bias)
 
@@ -1506,9 +2262,64 @@ def test_nested_tensor_linear_backward(self):
         assert b.grad is None
         assert c.grad is None
 
+    def test_values_grad_with_broadcast(self, device):
+        a = torch.randn(1, 2, 4, requires_grad=True, dtype=torch.float64, device=device)
+        b = torch.randn(2, 2, 4, requires_grad=True, dtype=torch.float64, device=device)
+        c = torch.randn(3, 2, 4, requires_grad=True, dtype=torch.float64, device=device)
+
+        def grad_test_func(a, b, c):
+            nt = torch.nested.as_nested_tensor([a, b, c])
+            buffer = nt.values()
+            return buffer.sum()
+
+        data = (a, b, c)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
+
+    def test_to_buffer_series_ops_grad_with_broadcast(self, device):
+        a = torch.randn(1, 1, 2, requires_grad=True, dtype=torch.float64, device=device)
+        b = torch.randn(1, 1, 2, requires_grad=True, dtype=torch.float64, device=device)
+        c = torch.randn(1, 1, 2, requires_grad=True, dtype=torch.float64, device=device)
+
+        def grad_test_func(a, b, c):
+            nt = torch.nested.as_nested_tensor([a, b, c])
+            buffer = nt.values()
+            buffer = buffer * 2
+            return buffer.exp()
+
+        data = (a, b, c)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
+
+    def test_unbind_flow_through(self, device):
+        a = torch.randn(1, 2, 4, requires_grad=True, dtype=torch.float64, device=device)
+        b = torch.randn(2, 2, 4, requires_grad=True, dtype=torch.float64, device=device)
+        c = torch.randn(3, 2, 4, requires_grad=True, dtype=torch.float64, device=device)
+
+        def grad_test_func(a, b, c):
+            nt = torch.nested.as_nested_tensor([a, b, c])
+            ntT = nt.transpose(-1, -2)
+            unbound = ntT.unbind()
+            d = unbound[0]
+            d = torch.pow(d, 2)
+            return d
+
+        data = (a, b, c)
+        assert gradcheck(grad_test_func, inputs=data, check_batched_grad=False)
+
+    def test_indexing_backward(self, device):
+        x0 = torch.randn((2, 5))
+        x1 = torch.randn((3, 4))
+        nt = torch.nested.nested_tensor([x0, x1], device=device, requires_grad=True)
+        self.assertEqual(nt[0], x0)
+        self.assertEqual(nt[-1], x1)
+        grad_x0 = torch.randn((2, 5), device=device)
+        nt[0].backward(grad_x0)
+        expected_grad = torch.nested.nested_tensor([grad_x0, torch.zeros((3, 4), device=device)])
+        self.assertEqual(nt.grad, expected_grad)
 
 
+instantiate_parametrized_tests(TestNestedTensor)
 instantiate_device_type_tests(TestNestedTensorDeviceType, globals())
+instantiate_device_type_tests(TestNestedTensorAutograd, globals())
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_nn.py b/test/test_nn.py
index dd3b82e6da23..552f299e4b8f 100644
--- a/test/test_nn.py
+++ b/test/test_nn.py
@@ -3,7 +3,6 @@
 import contextlib
 import math
 import random
-import string
 import unittest
 import io
 import unittest.mock as mock
@@ -11,16 +10,9 @@
 import warnings
 import pickle
 from copy import deepcopy
-from itertools import repeat, product
-from functools import reduce, partial
-from operator import mul
+from itertools import product
+from functools import partial
 from collections import OrderedDict
-from tempfile import NamedTemporaryFile
-import sys
-import os
-import subprocess
-import weakref
-import gc
 
 import torch
 
@@ -33,36 +25,31 @@
 import torch.backends.cudnn as cudnn
 import torch.nn as nn
 import torch.nn.functional as F
-import torch.nn.init as init
 import torch.nn.utils.rnn as rnn_utils
 from torch.nn.utils import clip_grad_norm_, clip_grad_value_
-import torch.nn.utils.parametrize as parametrize
-import torch.nn.utils.prune as prune
 from torch.nn.utils import parameters_to_vector, vector_to_parameters
+from torch.nn.utils.fusion import fuse_conv_bn_weights
+from torch.nn.utils.fusion import fuse_linear_bn_weights
 from torch.nn import Parameter
-from torch.nn.parameter import UninitializedParameter, UninitializedBuffer
 from torch.nn.parallel._functions import Broadcast
-from torch.testing._internal.common_dtype import integral_types, floating_types_and, get_all_math_dtypes, \
-    floating_and_complex_types_and
+from torch.testing._internal.common_dtype import integral_types, get_all_math_dtypes
 from torch.testing._internal.common_utils import freeze_rng_state, run_tests, TestCase, skipIfNoLapack, skipIfRocm, \
-    skipIfRocmVersionLessThan, skipIfNotMiopenSuggestNHWC, TEST_NUMPY, TEST_SCIPY, TEST_WITH_CROSSREF, TEST_WITH_ROCM, \
+    TEST_NUMPY, TEST_SCIPY, TEST_WITH_CROSSREF, TEST_WITH_ROCM, \
     download_file, get_function_arglist, load_tests, skipIfMps,\
-    suppress_warnings, TemporaryFileName, TEST_WITH_UBSAN, IS_PPC, \
-    parametrize as parametrize_test, subtest, instantiate_parametrized_tests, set_default_dtype, IS_WINDOWS, \
-    slowTest, skipIfTorchDynamo
+    TEST_WITH_UBSAN, IS_PPC, \
+    parametrize as parametrize_test, subtest, instantiate_parametrized_tests, \
+    skipIfTorchDynamo
 from torch.testing._internal.common_cuda import TEST_CUDA, TEST_MULTIGPU, TEST_CUDNN, TEST_CUDNN_VERSION
 from torch.testing._internal.common_nn import NNTestCase, NewModuleTest, CriterionTest, \
-    module_tests, criterion_tests, loss_reference_fns, \
-    ctcloss_reference, new_module_tests, single_batch_reference_fn
-from torch.testing._internal.common_device_type import expectedFailureXLA, instantiate_device_type_tests, dtypes, \
-    dtypesIfCUDA, precisionOverride, skipCUDAIfNoCudnn, skipCUDAIfCudnnVersionLessThan, onlyCUDA, onlyCPU, \
-    skipCUDAIfRocm, skipCUDAIf, skipCUDAIfNotRocm, skipCUDAIfRocmVersionLessThan, skipCUDAIfNotMiopenSuggestNHWC, \
-    onlyNativeDeviceTypes, deviceCountAtLeast, largeTensorTest, expectedFailureMeta, skipMeta, get_all_device_types, \
-    disableMkldnn, skipCPUIfNoMkldnn, disablecuDNN, skipCUDAIfMiopen, skipCUDAIfNoMiopen
+    module_tests, criterion_tests, loss_reference_fns, _create_basic_net, \
+    ctcloss_reference, new_module_tests, single_batch_reference_fn, _test_bfloat16_ops, _test_module_empty_input
+from torch.testing._internal.common_device_type import instantiate_device_type_tests, dtypes, \
+    dtypesIfCUDA, precisionOverride, skipCUDAIfCudnnVersionLessThan, onlyCUDA, onlyCPU, \
+    skipCUDAIfRocm, skipCUDAIf, skipCUDAIfNotRocm, \
+    onlyNativeDeviceTypes, deviceCountAtLeast, largeTensorTest, expectedFailureMeta, skipMeta, get_all_device_types
 from torch.nn import MultiheadAttention
 
 from hypothesis import given
-from torch.testing import make_tensor
 import torch.testing._internal.hypothesis_utils as hu
 from torch.testing._internal.common_utils import _assertGradAndGradgradChecks, gradcheck, gradgradcheck, \
     GRADCHECK_NONDET_TOL
@@ -78,7 +65,6 @@
 load_tests = load_tests
 
 if TEST_SCIPY:
-    from scipy import stats
     import scipy.signal
     import scipy.ndimage
 
@@ -90,258 +76,6 @@
 # update test/run_test.py to list it, otherwise it will NOT be run in
 # CI.
 
-
-class PackedSequenceTest(TestCase):
-
-    _type_by_name = {
-        'torch.DoubleTensor': (torch.DoubleTensor, 'double'),
-        'torch.FloatTensor': (torch.FloatTensor, 'float'),
-        # We leave out `'torch.HalfTensor': (torch.HalfTensor, 'half'),`
-        # because of an error in `pad_packed_sequence`
-        # > AttributeError: 'torch.HalfTensor' object has no attribute 'fill_'
-        'torch.LongTensor': (torch.LongTensor, 'long'),
-        'torch.IntTensor': (torch.IntTensor, 'int'),
-        'torch.ShortTensor': (torch.ShortTensor, 'short'),
-        'torch.CharTensor': (torch.CharTensor, 'char'),
-        'torch.ByteTensor': (torch.ByteTensor, 'byte'),
-    }
-
-    def __init__(self, *args, **kwargs):
-        super(PackedSequenceTest, self).__init__(*args, **kwargs)
-        self.batch_size = 5
-        self.max_length = 6
-
-    def _ordered_sequence(self, tensor_type):
-        """Create ordered list of random sequences"""
-        seqs = [tensor_type(random.randint(1, self.max_length))
-                for _ in range(self.batch_size)]
-        if tensor_type == torch.ByteTensor:
-            seqs = [s.random_(0, 256) for s in seqs]
-        else:
-            seqs = [s.random_(-128, 128) for s in seqs]
-        ordered = sorted(seqs, key=len, reverse=True)
-        return ordered
-
-    def _padded_sequence(self, tensor_type):
-        """Create Tensor of random padded sequences"""
-        ordered = self._ordered_sequence(tensor_type)
-        lengths = [len(i) for i in ordered]
-        padded_tensor = rnn_utils.pad_sequence(ordered)
-        return padded_tensor, lengths
-
-    def test_type_casts(self):
-        """Test type casting of `PackedSequence` against type casting of tensor"""
-        for _, (input_type, _) in self._type_by_name.items():
-            for expected_type_str, (_, cast_str) in self._type_by_name.items():
-                for enforce_sorted in [True, False]:
-                    padded, lengths = self._padded_sequence(input_type)
-                    packed = rnn_utils.pack_padded_sequence(
-                        padded, lengths, enforce_sorted=enforce_sorted)
-                    # Apply cast to `PackedSequence` instance and unpack
-                    masked = getattr(packed, cast_str)()
-                    unpacked, lengths_out = rnn_utils.pad_packed_sequence(masked)
-                    self.assertEqual(unpacked.type(), expected_type_str)
-
-    def test_wrong_order(self):
-        a = torch.ones(25, 300)
-        b = torch.ones(22, 300)
-        b_a = rnn_utils.pad_sequence([b, a])
-        self.assertRaises(
-            RuntimeError,
-            lambda: rnn_utils.pack_padded_sequence(b_a, [22, 25], enforce_sorted=True))
-
-    def test_pad_sequence_with_tensor_sequences(self):
-        seq_tuple_input = torch.nn.utils.rnn.pad_sequence(
-            (torch.tensor([[7, 6]]), torch.tensor([[-7, -1]]))
-        )
-        seq_tensor_input = torch.nn.utils.rnn.pad_sequence(
-            torch.tensor([[[7, 6]], [[-7, -1]]])
-        )
-        self.assertEqual(seq_tuple_input, seq_tensor_input)
-        self.assertEqual(seq_tuple_input.shape, torch.Size([1, 2, 2]))
-
-    def test_pad_sequence_with_non_iterable_sequences(self):
-        msg = r"Expected iterable for input sequences, but got arg of type"
-        with self.assertRaisesRegex(RuntimeError, msg):
-            torch.nn.utils.rnn.pad_sequence(5)
-
-    def test_total_length(self):
-        padded, lengths = self._padded_sequence(torch.FloatTensor)
-        max_length = max(lengths)
-        packed = rnn_utils.pack_padded_sequence(padded, lengths)
-        # test ValueError if total_length < max_length
-        for total_length in (-1, 0, max_length - 1):
-            for batch_first in (True, False):
-                def err_fn():
-                    rnn_utils.pad_packed_sequence(packed, batch_first=batch_first,
-                                                  total_length=total_length)
-            self.assertRaisesRegex(ValueError,
-                                   r'Expected total_length to be at least the '
-                                   r'length of the longest sequence in input',
-                                   err_fn)
-        # test that pad_packed_sequence returns results of correct length
-        for batch_first in (True, False):
-            no_extra_pad, _ = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first)
-            for total_length_delta in (0, 1, 8):
-                total_length = max_length + total_length_delta
-                unpacked, lengths_out = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first,
-                                                                      total_length=total_length)
-                self.assertEqual(lengths, lengths_out)
-                self.assertEqual(unpacked.size(1 if batch_first else 0), total_length)
-                if total_length_delta == 0:
-                    ref_output = no_extra_pad
-                elif batch_first:
-                    extra_pad = no_extra_pad.new_zeros(self.batch_size, total_length_delta)
-                    ref_output = torch.cat([no_extra_pad, extra_pad], 1)
-                else:
-                    extra_pad = no_extra_pad.new_zeros(total_length_delta, self.batch_size)
-                    ref_output = torch.cat([no_extra_pad, extra_pad], 0)
-                self.assertEqual(unpacked, ref_output)
-
-    def test_to(self):
-        for enforce_sorted in (True, False):
-            padded, lengths = self._padded_sequence(torch.IntTensor)
-            a = rnn_utils.pack_padded_sequence(
-                padded, lengths, enforce_sorted=enforce_sorted).cpu()
-
-            self.assertIs(a, a.to('cpu'))
-            self.assertIs(a, a.cpu())
-            self.assertIs(a, a.to('cpu', dtype=torch.int32))
-            self.assertEqual(a.long(), a.to(torch.int64))
-
-            if torch.cuda.is_available():
-                for cuda in ['cuda', 'cuda:0' if torch.cuda.device_count() == 1 else 'cuda:1']:
-                    b = a.cuda(device=cuda)
-                    self.assertIs(b, b.to(cuda))
-                    self.assertIs(b, b.cuda())
-                    self.assertEqual(a, b.to('cpu'))
-                    self.assertEqual(b, a.to(cuda))
-                    self.assertEqual(a, b.to('cpu', dtype=torch.int32))
-                    self.assertIs(b, b.to(dtype=torch.int32))
-                    self.assertEqual(b.long(), b.to(dtype=torch.int64))
-
-    def test_to_memory_format(self):
-        m = torch.nn.Conv2d(in_channels=16, out_channels=32, kernel_size=2, bias=True)
-        m = m.to(memory_format=torch.channels_last)
-        for param in m.parameters():
-            if param.dim() == 4:
-                self.assertTrue(param.is_contiguous(memory_format=torch.channels_last))
-
-class TestAvgPool(TestCase):
-    def _sum_pool2d(self, x, kernel_size):
-        windows = torch.nn.functional.unfold(x, kernel_size=kernel_size, stride=kernel_size)
-        return torch.sum(windows, dim=1)
-
-    def _sum_pool3d(self, x, kernel_size):
-        # Because unfold does not support 3D sliding window we will split tensor to multiple tensors and calculate sum
-        h = kernel_size[0]
-        splited_x = [t.sum(0) for t in x.split(h) if t.size(0) == h]
-        # sum_pool2d assumes tensor in (1, 1, n, m) view, so unsqueeze two times
-        splited_x = [self._sum_pool2d(t.unsqueeze(0).unsqueeze(0), kernel_size[1:]) for t in splited_x]
-        joined_x = torch.cat(splited_x)
-        return joined_x.view(1, joined_x.numel())
-
-    def _avg_pool2d(self, x, kernel_size):
-        size = reduce((lambda x, y: x * y), kernel_size)
-        return self._sum_pool2d(x, kernel_size) / size
-
-    def _avg_pool3d(self, x, kernel_size):
-        size = reduce((lambda x, y: x * y), kernel_size)
-        return self._sum_pool3d(x, kernel_size) / size
-
-    def test_doubletensor_avg_pool2d(self):
-        n, m = 5, 8
-        input = torch.rand(1, 1, n, m)
-        for i in range(1, n + 1):
-            for j in range(1, m + 1):
-                actual = torch.nn.functional.avg_pool2d(input[0], (i, j))
-                actual = actual.view(1, actual.numel())
-                expected = self._avg_pool2d(input, (i, j))
-                self.assertEqual(actual, expected, rtol=0, atol=1e-5)
-
-    def test_avg_pool2d_with_zero_divisor(self):
-        self.assertRaisesRegex(RuntimeError, "divisor must be not zero",
-                               lambda: F.avg_pool2d(torch.zeros(3, 3, 3), (2, 2), divisor_override=0))
-
-    def test_doubletensor_avg_pool2d_with_divisor(self):
-        n, m = 3, 3
-        input = torch.rand(1, 1, n, m)
-        for i in range(1, n + 1):
-            for j in range(1, m + 1):
-                for divisor in [1, 7, i * j]:
-                    actual = F.avg_pool2d(input[0], (i, j), divisor_override=divisor)
-                    actual = actual.view(1, actual.numel())
-                    expected = self._sum_pool2d(input, (i, j)) / divisor
-                    self.assertEqual(actual, expected, rtol=0, atol=1e-5)
-
-    def test_doubletensor_avg_pool3d(self):
-        h, w, d = 5, 6, 7
-        input = torch.rand(h, w, d)
-        for i in range(1, h + 1):
-            for j in range(1, w + 1):
-                for k in range(1, d + 1):
-                    actual = torch.nn.functional.avg_pool3d(input.unsqueeze(0), (i, j, k))
-                    actual = actual.view(1, actual.numel())
-                    expected = self._avg_pool3d(input, (i, j, k))
-                    self.assertEqual(actual, expected, rtol=0, atol=1e-5)
-
-    def test_doubletensor_avg_pool3d_with_divisor(self):
-        h, w, d = 6, 5, 7
-        input = torch.rand(h, w, d)
-        for i in range(1, h + 1):
-            for j in range(1, w + 1):
-                for k in range(1, d + 1):
-                    for divisor in [1, 7, i * j]:
-                        actual = torch.nn.functional.avg_pool3d(input.unsqueeze(0), (i, j, k), divisor_override=divisor)
-                        actual = actual.view(1, actual.numel())
-                        expected = self._sum_pool3d(input, (i, j, k)) / divisor
-                        self.assertEqual(actual, expected, rtol=0, atol=1e-5)
-
-    def test_avg_pool3d_with_zero_divisor(self):
-        self.assertRaisesRegex(RuntimeError, "divisor must be not zero",
-                               lambda: F.avg_pool3d(torch.zeros(3, 3, 3, 3), (2, 2, 2), divisor_override=0))
-
-    def test_avg_pool1d_ceil_mode(self):
-        # Regression test for gh-36977
-        x = 10 * torch.randn((1, 16, 4))
-        y = torch.nn.functional.avg_pool1d(
-            x, ceil_mode=True, count_include_pad=True, kernel_size=1, stride=2)
-        self.assertTrue(not torch.isnan(y).any())
-
-        if TEST_CUDA:
-            y = torch.nn.functional.avg_pool1d(
-                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=1, stride=2)
-            self.assertTrue(not torch.isnan(y).any())
-
-
-    def test_avg_pool2d_ceil_mode(self):
-        # Regression test for gh-36977
-        x = 10 * torch.randn((1, 16, 4, 4))
-        y = torch.nn.functional.avg_pool2d(
-            x, ceil_mode=True, count_include_pad=True, kernel_size=(1, 2),
-            padding=(0, 1), stride=2)
-        self.assertTrue(not torch.isnan(y).any())
-
-        if TEST_CUDA:
-            y = torch.nn.functional.avg_pool2d(
-                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=(1, 2),
-                padding=(0, 1), stride=2)
-            self.assertTrue(not torch.isnan(y).any())
-
-
-    def test_avg_pool3d_ceil_mode(self):
-        # Regression test for gh-36977
-        x = 10 * torch.randn((1, 16, 4, 4, 4))
-        y = torch.nn.functional.avg_pool3d(
-            x, ceil_mode=True, count_include_pad=True, kernel_size=(1, 2, 3), stride=2)
-        self.assertTrue(not torch.isnan(y).any())
-
-        if TEST_CUDA:
-            y = torch.nn.functional.avg_pool3d(
-                x.to('cuda'), ceil_mode=True, count_include_pad=True, kernel_size=(1, 2, 3), stride=2)
-            self.assertTrue(not torch.isnan(y).any())
-
-
 class TestNN(NNTestCase):
     _do_cuda_memory_leak_check = True
     _do_cuda_non_default_stream = True
@@ -402,26 +136,6 @@ def _get_parameters(self, module):
             d_params.append(p.grad)
         return params, d_params
 
-    def _create_basic_net(self):
-        class Layer(nn.Module):
-            def __init__(self):
-                super(Layer, self).__init__()
-                self.layer_dummy_param = Parameter(torch.empty(3, 5))
-                self.register_buffer('layer_dummy_buf', torch.zeros(1, 3, 3, 7))
-
-        class Net(nn.Module):
-            def __init__(self):
-                super(Net, self).__init__()
-                self.l1 = Layer()
-                self.dummy_param = Parameter(torch.empty(3, 5))
-                self.register_buffer('dummy_buf', torch.zeros(7, 3, 3, 1))
-
-        l = Layer()
-        n = Net()
-        s = nn.Sequential(n, n)
-
-        return l, n, s
-
     def test_parse_to(self):
         # Test for buggy use of THPMemoryFormat_New
         self.assertEqual(
@@ -430,7 +144,7 @@ def test_parse_to(self):
         )
 
     def test_requires_grad_(self):
-        m = self._create_basic_net()[-1]
+        m = _create_basic_net()[-1]
         assert len(list(m.buffers())) > 0, 'invalid test'
         assert all(not b.requires_grad for b in m.buffers()) > 0, 'invalid test'
         assert len(list(m.parameters())) > 0, 'invalid test'
@@ -451,24 +165,6 @@ def test_module_backcompat(self):
         input = torch.randn(2, 3, dtype=torch.float)
         self.assertEqual(m(input).size(), (2, 5))
 
-    def test_conv_backcompat(self):
-        from torch.serialization import SourceChangeWarning
-
-        # This file was generated by running on PyTorch 1.0.1 on Python 2:
-        #
-        #     import torch
-        #     from torch import nn
-        #     m = nn.Conv2d(1, 1, 1)
-        #     torch.save(m, 'legacy_conv2d.pt')
-        #
-        # NB: This Pickle also contains some Unicode data!
-        path = download_file('https://download.pytorch.org/test_data/legacy_conv2d.pt')
-        with warnings.catch_warnings():
-            warnings.simplefilter('ignore', SourceChangeWarning)
-            m = torch.load(path, encoding='utf-8')
-        input = torch.randn((1, 1, 1, 1), dtype=torch.float)
-        self.assertEqual(m(input).size(), (1, 1, 1, 1))
-
     def test_share_memory(self):
         class Net(nn.Module):
             def __init__(self):
@@ -492,361 +188,6 @@ def forward(self, inp):
         for b in net.buffers():
             self.assertTrue(b.storage().is_shared())
 
-    def _test_hooks(self, backward_register_fn):
-        module = nn.Sigmoid()
-        input = torch.ones(5, 5, requires_grad=True)
-
-        counter = {
-            'forwards': 0,
-            'backwards': 0
-        }
-
-        def fw_hook(inc, h_module, input, output):
-            self.assertIsInstance(input, tuple)
-            self.assertTrue(isinstance(output, torch.Tensor))
-            self.assertTrue(h_module is module)
-            self.assertEqual(input[0], torch.ones(5, 5))
-            self.assertEqual(output, torch.empty(5, 5).fill_(1 / (1 + 1 / math.e)))
-            counter['forwards'] += inc
-
-        def bw_hook(inc, h_module, grad_input, grad_output):
-            self.assertIsInstance(grad_input, tuple)
-            self.assertIsInstance(grad_output, tuple)
-            self.assertTrue(h_module is module)
-            self.assertEqual(grad_output[0], torch.ones(5, 5) * 2)
-            counter['backwards'] += inc
-
-        test_fwd = module.register_forward_hook(lambda *args: fw_hook(1, *args))
-
-        module(input)
-        module(input)
-        self.assertEqual(counter['forwards'], 2)
-        self.assertEqual(counter['backwards'], 0)
-
-        test_bwd = getattr(module, backward_register_fn)(
-            lambda *args: bw_hook(1, *args))
-
-        output = module(input)
-        self.assertEqual(counter['forwards'], 3)
-        self.assertEqual(counter['backwards'], 0)
-
-        output.backward(torch.ones(5, 5) * 2, retain_graph=True)
-        self.assertEqual(counter['forwards'], 3)
-        self.assertEqual(counter['backwards'], 1)
-
-        output.backward(torch.ones(5, 5) * 2, retain_graph=True)
-        self.assertEqual(counter['forwards'], 3)
-        self.assertEqual(counter['backwards'], 2)
-
-        test2_fwd = module.register_forward_hook(lambda *args: fw_hook(2, *args))
-
-        output = module(input)
-        self.assertEqual(counter['forwards'], 6)
-        self.assertEqual(counter['backwards'], 2)
-
-        test2_bwd = getattr(module, backward_register_fn)(lambda *args: bw_hook(2, *args))
-
-        module(input).backward(torch.ones(5, 5) * 2)
-        self.assertEqual(counter['forwards'], 9)
-        self.assertEqual(counter['backwards'], 5)
-
-        test2_bwd.remove()
-
-        module(input).backward(torch.ones(5, 5) * 2)
-        self.assertEqual(counter['forwards'], 12)
-        self.assertEqual(counter['backwards'], 6)
-
-        test2_fwd.remove()
-
-        module(input).backward(torch.ones(5, 5) * 2)
-        self.assertEqual(counter['forwards'], 13)
-        self.assertEqual(counter['backwards'], 7)
-
-        test_fwd.remove()
-        test_bwd.remove()
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_hooks(self):
-        self._test_hooks("register_backward_hook")
-        self._test_hooks("register_full_backward_hook")
-
-    def test_hook_cpp(self):
-        bn = nn.BatchNorm1d(5)
-
-        def hook(module, grad_inputs, grad_outputs):
-            self.assertEqual(len(grad_inputs), 1)
-            self.assertEqual(len(grad_outputs), 1)
-            self.assertEqual(module, bn)
-
-        bn.register_full_backward_hook(hook)
-        output = bn(torch.randn(5, 5, requires_grad=True))
-        output.sum().backward()
-
-    def test_hook_invalid_outputs(self):
-        module = nn.Sigmoid()
-        input = torch.randn(5, 5, requires_grad=True)
-
-        def bw_fail1(self, grad_input, grad_output):
-            return grad_input[:-1]
-
-        def bw_fail2(self, grad_input, grad_output):
-            return grad_input + (torch.randn(2, 2),)
-
-        with module.register_backward_hook(bw_fail1):
-            with self.assertRaisesRegex(RuntimeError, 'got 0, but expected 1'):
-                module(input).sum().backward()
-
-        with module.register_backward_hook(bw_fail2):
-            with self.assertRaisesRegex(RuntimeError, 'got 2, but expected 1'):
-                module(input).sum().backward()
-
-    def test_hook_requires_grad(self):
-        test_self = self
-
-        class MyModule(nn.Module):
-            def forward(self, arg1, arg2, arg3):
-                test_self.assertTrue(arg1.requires_grad)
-                test_self.assertFalse(arg2.requires_grad)
-                test_self.assertTrue(arg3.requires_grad)
-                return arg1.sum() + arg2.sum() + arg3.sum()
-
-        inp = torch.rand(2, requires_grad=True)
-        mod = MyModule()
-
-        mod(inp, inp.detach(), inp)
-        # Ensure that requires grad is properly propagated
-        mod.register_full_backward_hook(lambda mod, gI, gO: None)
-        mod(inp, inp.detach(), inp)
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_hook_no_requires_grad(self):
-        mod = nn.Linear(2, 3)
-
-        inp = torch.rand(1, 2)
-
-        return_val = "None"
-        hook_called = [0]
-
-        def hook(mod, grad_input, grad_output):
-            hook_called[0] += 1
-            for gI in grad_input:
-                self.assertIsNone(gI)
-            for gO in grad_output:
-                self.assertEqual(gO.size(), (1, 3))
-
-            if return_val == "grad_input":
-                return grad_input
-            elif return_val == "invalid":
-                # If the inputs were requiring gradients, this would be
-                # a valid return
-                return inp
-            elif return_val == "None":
-                return None
-            else:
-                raise RuntimeError("Invalid return_val string")
-
-        mod.register_full_backward_hook(hook)
-
-        # This should run and trigger the hook properly
-        mod(inp).sum().backward()
-        self.assertEqual(hook_called[0], 1)
-
-        return_val = "grad_input"
-
-        mod(inp).sum().backward()
-        self.assertEqual(hook_called[0], 2)
-
-        return_val = "invalid"
-        with self.assertRaisesRegex(RuntimeError, "where no input requires gradient"):
-            mod(inp).sum().backward()
-
-    def test_hook_last_arg_requires_grad(self):
-        mod = nn.L1Loss()
-        inp = torch.rand(1, requires_grad=True)
-        mod.register_full_backward_hook(lambda m, gI, gO: None)
-
-        try:
-            mod(inp.detach(), inp)
-        except Exception as ex:
-            self.fail("Unexpected exception: %s" % ex)
-
-    def test_hook_extra_input(self):
-        class MyModule(nn.Module):
-            def forward(self, non_tensor, tensor):
-                return tensor.clone(), non_tensor
-
-        inp = torch.rand(2, requires_grad=True)
-        mod = MyModule()
-
-        def hook(mod, grad_input, grad_output):
-            self.assertIsNone(grad_input[0])
-            self.assertIsInstance(grad_input[1], torch.Tensor)
-
-            self.assertIsInstance(grad_output[0], torch.Tensor)
-            self.assertIsNone(grad_output[1])
-
-        mod.register_full_backward_hook(hook)
-        out, _ = mod(True, inp)
-        out.sum().backward()
-
-    def test_hook_inplace(self):
-        class MyModule(nn.Module):
-            def forward(self, inp, do_inplace):
-                self.inp = inp
-                if do_inplace:
-                    inp += 1
-                return inp.clone()
-
-        hook_called = [0]
-
-        def hook(mod, grad_input, grad_output):
-            hook_called[0] += 1
-
-        inp = torch.rand(10, requires_grad=True)
-        mod = MyModule()
-        mod.register_full_backward_hook(hook)
-
-        # No inplace should work
-        mod(inp, False).sum().backward()
-        self.assertEqual(hook_called[0], 1)
-
-        # Input inplace error should throw an error
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of BackwardHookFunctionBackward is "
-                                    "a view and is being modified inplace."):
-            mod(inp.clone(), True)
-
-        # Input inplace error should throw an error if we try to re-use the view after they have
-        # been modified
-        local_inp = inp.clone()
-        out = mod(local_inp, False)
-        local_inp[0] *= 1
-        with self.assertRaisesRegex(RuntimeError, "Output 0 of BackwardHookFunctionBackward is "
-                                    "a view and its base or another view"):
-            # Any operation involving the view will fail here
-            mod.inp + 2
-
-        # Output inplace error should throw an error
-        out = mod(inp, False)
-        with self.assertRaisesRegex(RuntimeError, "BackwardHookFunctionBackward is a view "
-                                    "and is being modified inplace."):
-            out += 1
-
-    def test_hook_non_full_warning(self):
-        def noop(*args):
-            pass
-
-        a = torch.rand(2, requires_grad=True)
-        b = torch.rand(2, requires_grad=True)
-
-        # Check invalid input container
-        class MyModule(nn.Module):
-            def forward(self, l):
-                return l[0].clone(), l[1].clone()
-
-        m = MyModule()
-        m.register_backward_hook(noop)
-
-        with self.assertWarnsRegex(UserWarning, "does not take as input a single Tensor or a tuple of Tensors"):
-            m([a, b])
-
-        # Check invalid output container
-        class MyModule(nn.Module):
-            def forward(self, a, b):
-                return [a.clone(), b.clone()]
-
-        m = MyModule()
-        m.register_backward_hook(noop)
-
-        with self.assertWarnsRegex(UserWarning, "does not return a single Tensor or a tuple of Tensors"):
-            m(a, b)
-
-        # Check invalid output from different Nodes
-        class MyModule(nn.Module):
-            def forward(self, a, b):
-                return a.clone(), b.clone()
-
-        m = MyModule()
-        m.register_backward_hook(noop)
-
-        with self.assertWarnsRegex(UserWarning, "outputs are generated by different autograd Nodes"):
-            m(a, b)
-
-        # Check invalid forward with multiple Nodes
-        class MyModule(nn.Module):
-            def forward(self, a):
-                return a.clone().clone()
-
-        m = MyModule()
-        m.register_backward_hook(noop)
-
-        with self.assertWarnsRegex(UserWarning, "the forward contains multiple autograd Nodes"):
-            m(a)
-
-    def test_hook_backward_size(self):
-        # Make module with multiple operations in forward
-        # And different size for input and outputs
-        class MyModule(nn.Module):
-            def forward(self, arg1, arg2):
-                tmp = arg1.sum() * arg2
-                tmp = tmp + arg2.sum() * arg1.sum()
-                tmp = tmp.sum().view(1)
-                tmp = tmp.expand(8).contiguous()
-                return tmp
-
-        module = MyModule()
-        inp1 = torch.randn(5, 5, requires_grad=True)
-        inp2 = torch.randn(10, 10, requires_grad=True)
-
-        def bw_hook(module, grad_input, grad_output):
-            self.assertEqual(len(grad_input), 2)
-            self.assertEqual(grad_input[0].size(), torch.Size([5, 5]))
-            self.assertEqual(grad_input[1].size(), torch.Size([10, 10]))
-            self.assertEqual(len(grad_output), 1)
-            self.assertEqual(grad_output[0].size(), torch.Size([8]))
-
-        with module.register_full_backward_hook(bw_hook):
-            module(inp1, inp2).sum().backward()
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_hook_backward_writeable(self):
-        module = nn.Sigmoid()
-        input = torch.randn(5, 5, requires_grad=True)
-        sig_x = torch.nn.functional.sigmoid(input)
-
-        def bw_hook(module, grad_input, grad_output):
-            for grad in grad_input:
-                self.assertTrue(isinstance(grad, torch.Tensor))
-            for grad in grad_output:
-                self.assertTrue(isinstance(grad, torch.Tensor))
-            return tuple(gi * 2 for gi in grad_input)
-
-        module.register_backward_hook(bw_hook)
-        module(input).backward(torch.ones(5, 5))
-        expected_grad = sig_x * (1 - sig_x) * 2
-        self.assertEqual(input.grad, expected_grad)
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_hook_forward_preforward_writable(self):
-        module = nn.Sigmoid()
-        input = torch.randn(5, 5, requires_grad=True)
-        sig_x = torch.nn.functional.sigmoid(input)
-
-        def forward_pre_hook(m, input):
-            return torch.nn.functional.relu(input[0])
-
-        def forward_hook(m, input, output):
-            return -output
-
-        module.register_forward_pre_hook(forward_pre_hook)
-        module.register_forward_hook(forward_hook)
-        output = module(input)
-        expected_res = -torch.nn.functional.sigmoid(torch.nn.functional.relu(input))
-        self.assertEqual(output, expected_res)
-        output.backward(torch.ones(5, 5) * 2, retain_graph=True)
-        mask = (input > 0).double()
-        expected_grad = -sig_x * (1 - sig_x) * 2 * mask
-        self.assertEqual(input.grad, expected_grad)
-
     def test_to(self):
         m = nn.Linear(3, 5)
         self.assertIs(m, m.to('cpu'))
@@ -914,212 +255,11 @@ def test_no_grad(self):
                 self.assertFalse(output2.requires_grad)
                 self.assertRaises(RuntimeError, lambda: output2.backward(torch.ones(1, 5, 10, 10)))
 
-    def test_invalid_conv1d(self):
-        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
-            module = nn.Conv1d(in_channels=3, out_channels=33, kernel_size=10, stride=1, bias=True).to(dtype)
-            input = torch.randn(1, 3, 4).to(dtype)
-            with self.assertRaisesRegex(RuntimeError,
-                                        r'Calculated padded input size per channel: \(4\). ' +
-                                        r'Kernel size: \(10\). Kernel size can\'t be greater than actual input size'):
-                module(input)
-
-            # Negative stride check
-            module = nn.Conv1d(in_channels=3, out_channels=6, kernel_size=3, stride=-1, bias=True).to(dtype)
-            input = torch.randn(1, 3, 4).to(dtype)
-            with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
-                module(input)
-
-    def test_mismatch_shape_conv2d(self):
-        for dtype in (torch.float, torch.cfloat):
-            x = torch.randn(1, 10, 1, 28, 28, dtype=dtype)
-            w = torch.randn(6, 1, 5, 5, dtype=dtype)
-
-            with self.assertRaisesRegex(RuntimeError,
-                                        r'Expected 3D \(unbatched\) or 4D \(batched\) input to conv2d, but got ' +
-                                        r'input of size: \[1, 10, 1, 28, 28\]'):
-
-                F.conv2d(x, w)
-
-    def test_conv2d_discontiguous_weight(self):
-        for dtype in (torch.float, torch.cfloat):
-            # Test for https://github.com/pytorch/pytorch/issues/55781
-            x = torch.ones(64, 16, 16, 16, dtype=dtype)
-            weight = torch.arange(0, 1.0, 1 / 2.0 ** 10).reshape(32, 16, 1, 2).to(dtype)[:, :, :, ::2]
-            self.assertFalse(weight.is_contiguous())
-            y = torch.nn.functional.conv2d(x, weight, None)
-            if torch.backends.mkldnn.is_available():
-                # Disable MKLDNN explicitly, so that either NNPACK or THCNN will be used
-                with torch.backends.mkldnn.flags(enabled=False):
-                    y_ = torch.nn.functional.conv2d(x, weight, None)
-                    self.assertEqual(y, y_)
-            self.assertEqual(y.sum(), 4186112.)
-
-    def test_invalid_conv2d(self):
-        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
-            module = torch.nn.Conv2d(1, 1, kernel_size=3, dilation=2, stride=2).to(dtype)
-            input = torch.empty(1, 1, 4, 4).to(dtype)
-            self.assertRaises(RuntimeError, lambda: module(input))
-
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, stride=1, bias=True)
-            input = torch.randn(1, 3, 1, 1)
-            with self.assertRaisesRegex(RuntimeError,
-                                        r'Calculated padded input size per channel: \(1 x 1\). ' +
-                                        r'Kernel size: \(10 x 10\). Kernel size can\'t be greater than actual input size'):
-                module(input)
-
-            # Negative stride check
-            module = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=4, stride=-1, bias=True).to(dtype)
-            input = torch.randn(1, 3, 4, 4).to(dtype)
-            with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
-                module(input)
-
-            # Zero stride check
-            module = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=4, stride=0, bias=True).to(dtype)
-            input = torch.randn(1, 3, 4, 4).to(dtype)
-            with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
-                module(input)
-
-    def test_invalid_conv3d(self):
-        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
-            module = torch.nn.Conv3d(1, 1, kernel_size=3, dilation=2, stride=2).to(dtype)
-            input = torch.empty(1, 1, 4, 4, 4).to(dtype)
-            self.assertRaises(RuntimeError, lambda: module(input))
-
-            # Negative stride check
-            module = torch.nn.Conv3d(1, 1, kernel_size=3, stride=-2)
-            input = torch.empty(1, 1, 4, 4, 4)
-            with self.assertRaisesRegex(RuntimeError, 'non-positive stride is not supported'):
-                module(input)
-
-    def test_conv_invalid_groups(self):
-        with self.assertRaisesRegex(ValueError, 'groups must be a positive integer'):
-            torch.nn.Conv1d(1, 1, kernel_size=3, dilation=2, stride=2, groups=0)
-        with self.assertRaisesRegex(ValueError, 'groups must be a positive integer'):
-            torch.nn.Conv2d(1, 1, kernel_size=3, dilation=2, stride=2, groups=-1)
-        with self.assertRaisesRegex(ValueError, 'groups must be a positive integer'):
-            torch.nn.Conv3d(1, 1, kernel_size=3, dilation=2, stride=2, groups=-2)
-
-    def test_Conv1d_module_same_padding(self):
-        # Compare module against functional: without strides/dilation, asymmetric padding
-        x = torch.rand(1, 1, 20)
-        module = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=10,
-                           padding='same')
-        expect = F.conv1d(x, module.weight, module.bias, padding='same')
-        self.assertEqual(expect, module(x))
-
-        # Test dilation, symmetric padding
-        module = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=10,
-                           padding='same', dilation=2)
-        expect = F.conv1d(x, module.weight, module.bias, padding='same', dilation=2)
-        self.assertEqual(expect, module(x))
-
-        # Test non-zero padding_mode, requiring explicit padding
-        module = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=10,
-                           padding='same', padding_mode='replicate')
-        x_padded = F.pad(x, [4, 5], mode='replicate')
-        expect = F.conv1d(x_padded, module.weight, module.bias, padding='valid')
-        self.assertEqual(expect, module(x))
-        self.assertEqual(x.size(), expect.size())
-
-        # Test connstruction with invalid padding string raises
-        with self.assertRaisesRegex(ValueError, 'Invalid padding string'):
-            module = nn.Conv1d(in_channels=3, out_channels=33, kernel_size=10, padding='foo')
-
-        # Test connstruction with same padding and strides raises
-        with self.assertRaisesRegex(ValueError, "padding='same'"):
-            module = nn.Conv1d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=2)
-
-    def test_Conv2d_module_same_padding(self):
-        # Compare module against functional:
-        # without strides/dilation, both symmetric and asymmetric padding
-        x = torch.rand(1, 1, 9, 20)
-        module = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(5, 10),
-                           padding='same')
-        expect = F.conv2d(x, module.weight, module.bias, padding='same')
-        self.assertEqual(expect, module(x))
-
-        # with dilation, symmetric padding
-        module = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(3, 4),
-                           padding='same', dilation=(1, 2))
-        expect = F.conv2d(x, module.weight, module.bias, padding='same', dilation=(1, 2))
-        self.assertEqual(expect, module(x))
-
-        # Test non-zero padding_mode, requiring explicit padding
-        module = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(3, 4),
-                           padding='same', padding_mode='reflect')
-        x_padded = F.pad(x, [1, 2, 1, 1], mode='reflect')
-        expect = F.conv2d(x_padded, module.weight, module.bias, padding='valid')
-        self.assertEqual(expect, module(x))
-        self.assertEqual(x.size(), expect.size())
-
-        # Test connstruction with invalid padding string raises
-        with self.assertRaisesRegex(ValueError, 'Invalid padding string'):
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='foo')
-
-        # Test connstruction with same padding and strides raises
-        with self.assertRaisesRegex(ValueError, "padding='same'"):
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=2)
-        with self.assertRaisesRegex(ValueError, "padding='same'"):
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(1, 3))
-        with self.assertRaisesRegex(ValueError, "padding='same'"):
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(4, 1))
-
-    def test_Conv3d_module_same_padding(self):
-        # Compare module against functional:
-        x = torch.rand(1, 1, 4, 4, 4)
-        # without dilation, both symmetric and asymmetric padding
-        module = nn.Conv3d(in_channels=1, out_channels=1, kernel_size=(2, 3, 4),
-                           padding='same')
-        expect = F.conv3d(x, module.weight, module.bias, padding='same')
-        self.assertEqual(expect, module(x))
-
-        # with dilation, both symmetric and asymmetric padding
-        module = nn.Conv3d(in_channels=1, out_channels=1, kernel_size=(2, 3, 4),
-                           padding='same', dilation=(3, 2, 1))
-        expect = F.conv3d(x, module.weight, module.bias, padding='same', dilation=(3, 2, 1))
-        self.assertEqual(expect, module(x))
-
-        # Test non-zero padding_mode, requiring explicit padding
-        module = nn.Conv3d(in_channels=1, out_channels=1, kernel_size=(2, 3, 4),
-                           padding='same', padding_mode='circular')
-        x_padded = F.pad(x, [1, 2, 1, 1, 0, 1], mode='circular')
-        expect = F.conv3d(x_padded, module.weight, module.bias, padding='valid')
-        self.assertEqual(expect, module(x))
-        self.assertEqual(x.size(), expect.size())
-
-        # Test connstruction with invalid padding string raises
-        with self.assertRaisesRegex(ValueError, 'Invalid padding string'):
-            module = nn.Conv3d(in_channels=3, out_channels=33, kernel_size=10, padding='foo')
-
-        # Test connstruction with same padding and strides raises
-        with self.assertRaisesRegex(ValueError, "padding='same'"):
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=2)
-        with self.assertRaisesRegex(ValueError, "padding='same'"):
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(1, 1, 3))
-        with self.assertRaisesRegex(ValueError, "padding='same'"):
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(1, 4, 1))
-        with self.assertRaisesRegex(ValueError, "padding='same'"):
-            module = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, padding='same', stride=(5, 1, 1))
-
-    def _test_alpha_dropout(self, cls, input):
-        mean = input.mean()
-        std = input.std()
-
-        for p in [0.2, 0.5, 0.8]:
-            module = cls(p)
-            input_var = input.detach().clone().requires_grad_()
-            output = module(input_var)
-            # output mean should be close to input mean
-            self.assertLess(abs(output.data.mean() - mean), 0.1)
-            # output std should be close to input std
-            self.assertLess(abs(output.data.std() - std), 0.1)
-            output.backward(input)
-
     def test_parameters_and_named_parameters(self):
         def names(named_parameters):
             return [k for k, _ in named_parameters]
 
-        l, n, s = self._create_basic_net()
+        l, n, s = _create_basic_net()
 
         self.assertEqual(len(list(l.parameters())), 1)
         self.assertEqual(
@@ -1141,11 +281,39 @@ def names(named_parameters):
             names(s.named_parameters()),
             ['0.dummy_param', '0.l1.layer_dummy_param'])
 
+    def test_named_parameters_remove_duplicate(self):
+        def names(named_parameters):
+            return [k for k, _ in named_parameters]
+
+        class M1(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.param1 = nn.Parameter(torch.empty(3, 3))
+                self.param2 = self.param1
+
+        m1 = M1()
+        self.assertEqual(names(m1.named_parameters()),
+                         ["param1"])
+        self.assertEqual(names(m1.named_parameters(remove_duplicate=False)),
+                         ["param1", "param2"])
+
+        class M2(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.mod1 = nn.Linear(3, 4, bias=False)
+                self.mod2 = self.mod1
+
+        m2 = M2()
+        self.assertEqual(names(m2.named_parameters()),
+                         ["mod1.weight"])
+        self.assertEqual(names(m2.named_parameters(remove_duplicate=False)),
+                         ["mod1.weight", "mod2.weight"])
+
     def test_buffers_and_named_buffers(self):
         def names(named_buffers):
             return [k for k, _ in named_buffers]
 
-        l, n, s = self._create_basic_net()
+        l, n, s = _create_basic_net()
 
         self.assertEqual(len(list(l.buffers())), 1)
         self.assertEqual(
@@ -1167,6 +335,19 @@ def names(named_buffers):
             names(s.named_buffers()),
             ['0.dummy_buf', '0.l1.layer_dummy_buf'])
 
+        # test remove_duplicate
+        class M(nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.register_buffer("buffer1", torch.empty(3, 5))
+                self.register_buffer("buffer2", self.buffer1)
+
+        m = M()
+        self.assertEqual(names(m.named_buffers()),
+                         ["buffer1"])
+        self.assertEqual(names(m.named_buffers(remove_duplicate=False)),
+                         ["buffer1", "buffer2"])
+
     def test_call_supports_python_dict_output(self):
         class Net(nn.Module):
             def __init__(self):
@@ -1885,7 +1066,6 @@ def check():
         self.assertRaises(NotImplementedError, module_dict)
         self.assertRaises(NotImplementedError, module_dict, torch.rand(1, 3))
 
-    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
     def test_ParameterList(self):
         def make_param():
             return Parameter(torch.randn(2, 2))
@@ -1994,7 +1174,6 @@ def make_param():
             self.assertIsNotNone(p2.grad_fn)
             self.assertIs(p2._base, p)
 
-    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
     def test_ParameterDict(self):
         parameters = OrderedDict([
             ('p1', Parameter(torch.randn(10, 10))),
@@ -2574,2111 +1753,107 @@ def test_vector_to_parameters(self):
         sample = next(model.parameters())[0, 0, 0]
         self.assertTrue(torch.equal(sample.data, vec.data[:5]))
 
-    # FIXME: Rewrite this test using functions not depending on LAPACK
-    #        and remove the `@skipIfNoLapack` (see #70995)
-    # torch/nn/utils/parametrize
-    @skipIfNoLapack
-    def test_register_and_remove_parametrization(self):
-        r"""Test that it is possible to add a few parametrizations
-        on a parameter or a buffer and that removing them restores the initial state
-        It also tests that backpropagating through them works as expected
-        """
-        # Define a couple matrix parametrizations
-        class Skew(nn.Module):
-            def forward(self, X):
-                X = X.tril(-1)
-                return X - X.T
-
-        class Orthogonal(nn.Module):
-            def forward(self, X):
-                # Cayley map
-                # If X is skew-symmetric it returns an orthogonal matrix
-                Id = torch.eye(X.size(0), device=X.device)
-                # We call contiguous because solve returns a tensor with strides that are Fortran-contiguous
-                # and autograd raises a performance warning.
-                # This happens when we remove the parametrization with leave_parametrized=True,
-                # which does a set_ with a non-contiguous tensor while the gradient is contiguous
-                return torch.linalg.solve(Id + X, Id - X).contiguous()
-
-        class Resize(nn.Module):
-            def forward(self, X):
-                return X[[0]]
-
-        class NoResize(nn.Module):
-            def forward(self, X):
-                return X
-
-        # Define a couple vector parametrizations
-        class FirstZero(nn.Module):
-            def forward(self, x):
-                return torch.cat([x.new_zeros(1), x[1:]])
-
-        class LastZero(nn.Module):
-            def forward(self, x):
-                return torch.cat([x[:-1], x.new_zeros(1)])
-
-        model = nn.Linear(8, 8)
-        initial_weight_id = id(model.weight)
-        initial_bias_id = id(model.bias)
-        initial_model = deepcopy(model)
-
-        # Test unsafe flag
-        with self.assertRaisesRegex(ValueError, "Registering a parametrization may not change the shape of the tensor"):
-            parametrize.register_parametrization(model, "weight", Resize())  # default unsafe = False
-            model(torch.ones(8, 8))
-
-        # One parametrization with unsafe=True
-        parametrize.register_parametrization(model, "weight", Resize(), unsafe=True)
-        self.assertTrue(hasattr(model, "parametrizations"))
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "weight"))
-        self.assertFalse(parametrize.is_parametrized(model, "bias"))
-        self.assertNotIn("weight", model._parameters)
-        A = model.weight
-        self.assertTrue(A.shape[0] == 1)
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        self.assertFalse(hasattr(model, "parametrizations"))
-        self.assertEqual(model.weight, initial_model.weight)
-        self.assertEqual(id(model.weight), initial_weight_id)
-        self.assertEqual(model.__class__, nn.Linear)
-
-        # Two parametrizations with unsafe=True
-        parametrize.register_parametrization(model, "weight", Resize(), unsafe=True)
-        parametrize.register_parametrization(model, "weight", NoResize(), unsafe=False)
-        self.assertTrue(hasattr(model, "parametrizations"))
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "weight"))
-        self.assertFalse(parametrize.is_parametrized(model, "bias"))
-        self.assertNotIn("weight", model._parameters)
-        A = model.weight
-        self.assertTrue(A.shape[0] == 1)
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        self.assertFalse(hasattr(model, "parametrizations"))
-        self.assertEqual(model.weight, initial_model.weight)
-        self.assertEqual(id(model.weight), initial_weight_id)
-        self.assertEqual(model.__class__, nn.Linear)
-
-        # Test unsafe flag doesn't change expected behavior
-        parametrize.register_parametrization(model, "weight", Skew(), unsafe=True)
-        self.assertTrue(hasattr(model, "parametrizations"))
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "weight"))
-        self.assertFalse(parametrize.is_parametrized(model, "bias"))
-        self.assertNotIn("weight", model._parameters)
-        # Result should be skew-symmetric
-        A = model.weight
-        self.assertEqual(A, -A.T)
-        # Remove and check consistency
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        self.assertFalse(hasattr(model, "parametrizations"))
-        self.assertEqual(model.weight, initial_model.weight)
-        self.assertEqual(id(model.weight), initial_weight_id)
-        self.assertEqual(model.__class__, nn.Linear)
-
-        # Test one parametrization
-        parametrize.register_parametrization(model, "weight", Skew())
-        self.assertTrue(hasattr(model, "parametrizations"))
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "weight"))
-        self.assertFalse(parametrize.is_parametrized(model, "bias"))
-        self.assertNotIn("weight", model._parameters)
-        # Result should be skew-symmetric
-        A = model.weight
-        self.assertEqual(A, -A.T)
-        # Remove and check consistency
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        self.assertFalse(hasattr(model, "parametrizations"))
-        self.assertEqual(model.weight, initial_model.weight)
-        self.assertEqual(id(model.weight), initial_weight_id)
-        self.assertEqual(model.__class__, nn.Linear)
-
-        # Test two parametrizations at the same time and removing them
-        parametrize.register_parametrization(model, "weight", Skew())
-        parametrize.register_parametrization(model, "weight", Orthogonal())
-        # Result should be orthogonal
-        X = model.weight
-        Id = torch.eye(X.size(0), device=X.device)
-        self.assertEqual(X.T @ X, Id)
-        # Structure tests
-        self.assertTrue(hasattr(model, "parametrizations"))
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "weight"))
-        self.assertFalse(parametrize.is_parametrized(model, "bias"))
-        self.assertIn("weight", model.parametrizations)
-        self.assertNotIn("weight", model._parameters)
-        # Remove
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        self.assertEqual(model.weight, initial_model.weight)
-        self.assertEqual(id(model.weight), initial_weight_id)
-        self.assertFalse(hasattr(model, "parametrizations"))
-        self.assertEqual(model.__class__, nn.Linear)
-
-        # Add everything
-        parametrize.register_parametrization(model, "weight", Skew())
-        parametrize.register_parametrization(model, "weight", Orthogonal())
-        parametrize.register_parametrization(model, "bias", FirstZero())
-        parametrize.register_parametrization(model, "bias", LastZero())
-
-        # Basic tests
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "weight"))
-        self.assertTrue(parametrize.is_parametrized(model, "bias"))
-        self.assertEqual(model.bias[0].item(), 0.)
-        self.assertEqual(model.bias[-1].item(), 0.)
-        self.assertEqual(len(list(model.parameters())), 2)  # Nothing weird has happpened
-        # Should not throw
-
-        sgd = torch.optim.SGD(model.parameters(), lr=0.01)
-
-        weight_copy = model.weight.clone()
-        bias_copy = model.bias.clone()
-        sgd.zero_grad()
-        (model.weight.T @ model.bias).sum().backward()
-        sgd.step()
-        self.assertNotEqual(model.weight, weight_copy)
-        self.assertNotEqual(model.bias, bias_copy)
-
-        # Remove first parametrization.
-        # Check that the model is still parametrized and so is the second parameter
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        self.assertTrue(parametrize.is_parametrized(model))             # Still parametrized
-        self.assertFalse(parametrize.is_parametrized(model, "weight"))  # Parametrization removed
-        self.assertTrue(parametrize.is_parametrized(model, "bias"))     # Still parametrized
-        self.assertEqual(model.bias[0].item(), 0.)                      # Still parametrized
-        self.assertEqual(model.bias[-1].item(), 0.)                     # Still parametrized
-        self.assertNotEqual(model.weight, initial_model.weight)         # Has been updated
-        self.assertEqual(id(model.weight), initial_weight_id)           # Keeps the same id
-        self.assertEqual(len(list(model.parameters())), 2)              # Nothing weird has happened
-        # Should not throw
-        weight_copy = model.weight.clone()
-        bias_copy = model.bias.clone()
-        sgd.zero_grad()
-        (model.weight.T @ model.bias).sum().backward()
-        sgd.step()
-        self.assertNotEqual(model.weight, weight_copy)
-        self.assertNotEqual(model.bias, bias_copy)
-
-        # Remove the second parametrization.
-        # Check that the module is not parametrized
-        parametrize.remove_parametrizations(model, "bias", leave_parametrized=False)
-        self.assertFalse(parametrize.is_parametrized(model))  # Not parametrized
-        self.assertNotEqual(model.bias, initial_model.bias)   # Has been updated
-        self.assertNotEqual(model.bias[0].item(), 0.)         # Not parametrized
-        self.assertNotEqual(model.bias[-1].item(), 0.)        # Not parametrized
-        self.assertEqual(id(model.bias), initial_bias_id)     # Keeps the same id
-        self.assertFalse(hasattr(model, "parametrizations"))  # Not parametrized the module
-        self.assertEqual(model.__class__, nn.Linear)          # Resores the previous class
-        self.assertEqual(len(list(model.parameters())), 2)    # Nothing weird has happeed
-
-        # Should not throw things are updated
-        weight_copy = model.weight.clone()
-        bias_copy = model.bias.clone()
-        sgd.zero_grad()
-        (model.weight.T @ model.bias).sum().backward()
-        sgd.step()
-        self.assertNotEqual(model.weight, weight_copy)
-        self.assertNotEqual(model.bias, bias_copy)
-
-        # Test leave_parametrized=True
-        for _ in range(2):
-            parametrize.register_parametrization(model, "weight", Skew())
-            parametrize.register_parametrization(model, "weight", Orthogonal())
-            parametrize.remove_parametrizations(model, "weight", leave_parametrized=True)
-            # We didn't change the dtype nor had multiple inputs, so the id should be the same
-            self.assertEqual(id(model.weight), initial_weight_id)
-            self.assertEqual(id(model.bias), initial_bias_id)
-
-            # Should not throw. Things are updated
-            weight_copy = model.weight.clone()
-            bias_copy = model.bias.clone()
-            sgd.zero_grad()
-            (model.weight.T @ model.bias).sum().backward()
-            sgd.step()
-            self.assertNotEqual(model.weight, weight_copy)
-            self.assertNotEqual(model.bias, bias_copy)
-
-    def test_register_and_remove_nested_parametrization(self):
-        r"""Test that it is possible to nest the parametrizations
-        meaning that the original param is parametrized again
-        """
-        class Skew(nn.Module):
-            def forward(self, X):
-                X = X.tril(-1)
-                return X - X.T
-
-        model = nn.Linear(8, 8)
-        # Add top level parametrization
-        parametrize.register_parametrization(model, "weight", Skew())
-        self.assertTrue(hasattr(model, "parametrizations"))
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "weight"))
-        self.assertFalse(parametrize.is_parametrized(model, "bias"))
-        self.assertNotIn("weight", model._parameters)
-        # Result should be skew-symmetric
-        A = model.weight
-        self.assertEqual(A, -A.T)
-
-        # Add nested parametrization
-        param_mod = model.parametrizations.weight
-        self.assertFalse(hasattr(param_mod, "parametrizations"))
-        self.assertFalse(parametrize.is_parametrized(param_mod))
-        self.assertFalse(parametrize.is_parametrized(param_mod, "original"))
-
-        parametrize.register_parametrization(param_mod, "original", Skew())
-        self.assertTrue(hasattr(param_mod, "parametrizations"))
-        self.assertTrue(parametrize.is_parametrized(param_mod))
-        self.assertTrue(parametrize.is_parametrized(param_mod, "original"))
-        self.assertNotIn("original", param_mod._parameters)
-        # Result should be skew-symmetric
-        A = param_mod.original
-        self.assertEqual(A, -A.T)
-
-        # Remove nested param and check consistency
-        parametrize.remove_parametrizations(param_mod, "original", leave_parametrized=False)
-        self.assertFalse(hasattr(param_mod, "parametrizations"))
-        self.assertEqual(param_mod.__class__, parametrize.ParametrizationList)
-
-        # Remove top level and check consistency
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        self.assertFalse(hasattr(model, "parametrizations"))
-        self.assertEqual(model.__class__, nn.Linear)
-
-    def test_register_and_remove_buffer_parametrization(self):
-        r"""Test that it is possible to add and remove parametrizations on buffers"""
-        # Define a couple vector parametrizations
-        class FirstZero(nn.Module):
-            def forward(self, x):
-                return torch.cat([x.new_zeros(1), x[1:]])
+    def test_rnn_weight_norm(self):
+        def check_weight_norm(l, name, num_params):
+            # This Module has 4 or 5 parameters called:
+            # 'weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0', weight_hr_l0
 
-        class LastZero(nn.Module):
-            def forward(self, x):
-                return torch.cat([x[:-1], x.new_zeros(1)])
-
-        model = nn.Linear(8, 8)
-
-        # Instantiate parametrizations on buffers. It should work as expected
-        delattr(model, "bias")
-        model.register_buffer("bias", torch.ones(8))
-        parametrize.register_parametrization(model, "bias", FirstZero())
-        parametrize.register_parametrization(model, "bias", LastZero())
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "bias"))
-        self.assertEqual(model.bias[0].item(), 0.)
-        self.assertEqual(model.bias[-1].item(), 0.)
-        self.assertTrue((model.bias[1:-1] == torch.ones(6)).all())
-        self.assertEqual(len(list(model.parameters())), 1)
-
-        # Remove parametrizations on buffers. It should work as expected
-        parametrize.remove_parametrizations(model, "bias", leave_parametrized=True)
-        self.assertFalse(parametrize.is_parametrized(model))
-        self.assertFalse(parametrize.is_parametrized(model, "bias"))
-        self.assertEqual(model.bias[0].item(), 0.)
-        self.assertEqual(model.bias[-1].item(), 0.)
-        self.assertTrue((model.bias[1:-1] == torch.ones(6)).all())
-        self.assertEqual(len(list(model.parameters())), 1)
-
-    # FIXME: Rewrite this test using functions not depending on LAPACK
-    #        and remove the `@skipIfNoLapack` (see #70995)
-    @skipIfNoLapack
-    def test_serialization_parametrization(self):
-        r"""Test that it is possible to serialize a parametrized model via state_dict"""
-        # A stateful parametrization
-        class Orthogonal(nn.Module):
-            def __init__(self, n):
-                super().__init__()
-                self.register_buffer("id", torch.eye(n))
-                self.register_buffer("B", torch.empty(n, n))
-                init.orthogonal_(self.B)
-
-            def forward(self, X):
-                A = X.triu(1)
-                A = A - A.T
-                return self.B @ torch.linalg.solve(self.id + A, self.id - A)
-
-        def get_model():
-            model = torch.nn.Sequential(
-                torch.nn.Linear(5, 5),
-                torch.nn.ReLU(),
-                torch.nn.Linear(5, 1),
+            # Applying weight norm on one of them causes it to become a tensor
+            l = torch.nn.utils.weight_norm(l, name=name)
+            self.assertEqual(
+                sum([isinstance(p, torch.nn.Parameter) for p in l._flat_weights]),
+                num_params - 1,
             )
 
-            parametrize.register_parametrization(model[0], "weight", Orthogonal(5))
-            return model
-
-        model = get_model()
-
-        prev_weight = model[0].weight
-        prev_B = model[0].parametrizations.weight[0].B
+            # Removing the weight norm reparametrization restores the Parameter
+            l = torch.nn.utils.remove_weight_norm(l, name=name)
+            self.assertEqual(
+                sum([isinstance(p, torch.nn.Parameter) for p in l._flat_weights]),
+                num_params,
+            )
 
-        new_model = get_model()
-        with TemporaryFileName() as fname:
-            torch.save(model.state_dict(), fname)
-            new_model.load_state_dict(torch.load(fname))
+            # Make sure that, upon removal of the reparametrization, the
+            # `._parameters` and `.named_parameters` contain the right params.
+            # Specifically, the original weight ('weight_ih_l0') should be placed
+            # back in the parameters, while the reparametrization components
+            # ('weight_ih_l0_v' and 'weight_ih_l0_g') should be removed.
+            self.assertTrue(name in l._parameters)
+            self.assertIsNotNone(l._parameters[name])
+            self.assertTrue(name + '_v' not in l._parameters)
+            self.assertTrue(name + '_g' not in l._parameters)
+            self.assertTrue(name in dict(l.named_parameters()))
+            self.assertIsNotNone(dict(l.named_parameters())[name])
+            self.assertTrue(name + '_v' not in dict(l.named_parameters()))
+            self.assertTrue(name + '_g' not in dict(l.named_parameters()))
 
-        # Integrity tests
-        self.assertTrue(parametrize.is_parametrized(new_model[0], "weight"))
-        self.assertEqual(prev_weight, new_model[0].weight)
-        self.assertEqual(prev_B, new_model[0].parametrizations.weight[0].B)
+        check_weight_norm(torch.nn.LSTM(32, 32), 'weight_ih_l0', 4)
+        check_weight_norm(torch.nn.LSTM(32, 32, proj_size=16), 'weight_hr_l0', 5)
 
-        # Trying to save the whole parametrized model raises
-        with self.assertRaisesRegex(RuntimeError, "state_dict"):
-            with TemporaryFileName() as fname:
-                torch.save(model, fname)
 
-    # FIXME: Rewrite this test using functions not depending on LAPACK
-    #        and remove the `@skipIfNoLapack` (see #70995)
-    @skipIfNoLapack
-    def test_initialization_parametrization(self):
-        r"""Test that it is possible to initialize a parametrization when it
-            implements a `right_inverse` method
-        """
-        class Skew(nn.Module):
-            def forward(self, X):
-                A = X.triu(1)
-                return A - A.T
-
-            def is_skew(self, A):
-                return torch.allclose(A, -A.T, atol=1e-6)
-
-            def right_inverse(self, X):
-                if not self.is_skew(X):
-                    raise ValueError("The matrix is not skew-symmetric.")
-                return X.triu(1)
-
-        # Implements a Cayley map where right_inverse is not quite the inverse of forward
-        class Orthogonal(nn.Module):
-            def __init__(self, n):
-                super().__init__()
-                self.register_buffer("B", torch.eye(n))
-
-            def forward(self, X):
-                Id = torch.eye(X.size(0))
-                return self.B @ torch.linalg.solve(Id + X, Id - X)
-
-            def is_orthogonal(self, X):
-                Id = torch.eye(X.size(0))
-                return torch.allclose(X.T @ X, Id, atol=1e-4)
-
-            def right_inverse(self, X):
-                if not self.is_orthogonal(X):
-                    raise ValueError("The input is not orthogonal.")
-                # cayley(0) == Id, so B @ cayley(0) == B
-                self.B = X
-                return torch.zeros_like(X)
-
-        N = 5
-        model = nn.Linear(N, N)
-        # Register the skew-symmetric constraint. The result is now skew-symmetric
-        skew = Skew()
-        # Make the weight skew-symmetric before registering the parametrization
-        with torch.no_grad():
-            model.weight.set_(skew(model.weight))
-        parametrize.register_parametrization(model, "weight", skew)
-        X = torch.rand(N, N)
-        # X is not skew-symmetric, so it throws an error
-        with self.assertRaises(ValueError):
-            model.weight = X
-        # Make X skew-symmetric
-        X = X - X.T
-        model.weight = X
-        self.assertEqual(model.parametrizations.weight.original, X.triu(1))
-        self.assertEqual(model.weight, X)
-
-        # Having several parametrizations registered should work in the same way
-        parametrize.register_parametrization(model, "weight", Orthogonal(N))
-        # Register now the Cayley map. The result is now orthogonal
-        X = torch.rand(N, N)
-        # X is not orthogonal, so it throws an error
-        with self.assertRaises(ValueError):
-            model.weight = X
-        init.orthogonal_(X)
-        model.weight = X
-        self.assertEqual(model.weight, X)
-        self.assertEqual(model.parametrizations.weight.original, torch.zeros_like(X))
-
-    def test_errors_unparametrized_tensor_parametrization(self):
-        # Test errors when registering a parametrization on an unparametrized tensor
-        module = nn.Linear(3, 4)
-        weight_init = module.weight.clone()
-
-        class Identity(nn.Module):
-            def forward(self, x):
-                return x
-
-        # Register a parametrization on a non-existing parameter throws
-        with self.assertRaisesRegex(ValueError, "does not have a parameter"):
-            parametrize.register_parametrization(module, "foo", Identity())
-        self.assertFalse(parametrize.is_parametrized(module))
-
-        # Removing parametrizations from an unparametrized tensor throws
-        with self.assertRaisesRegex(ValueError, "does not have a parametrization"):
-            parametrize.remove_parametrizations(module, "bias")
-        self.assertFalse(parametrize.is_parametrized(module))
-
-        # A correct parametrization with several outputs
-        class Sum(nn.Module):
-            def forward(self, x, y):
-                return x + y
-
-            def right_inverse(self, z):
-                return z, torch.zeros_like(z)
-
-        parametrize.register_parametrization(module, "weight", Sum())
-        # Cannot remove a parametrization with several outputs with `leave_parametrized=False`
-        with self.assertRaisesRegex(ValueError, "leave_parametrized=False"):
-            parametrize.remove_parametrizations(module, "weight", leave_parametrized=False)
-        parametrize.remove_parametrizations(module, "weight", leave_parametrized=True)
-
-        # A parametrization with an incorrect number of outputs
-        class WrongNumberParams(nn.Module):
-            def forward(self, x, y, z):
-                return x + y + z
-
-            def right_inverse(self, w):
-                return w, torch.zeros_like(w)
-
-        # Makes param(*param.right_inverse(X)) fail
-        with self.assertRaisesRegex(TypeError, "positional argument"):
-            parametrize.register_parametrization(module, "weight", WrongNumberParams())
-        self.assertFalse(parametrize.is_parametrized(module))
-
-        # A parametrization with a right_inverse that does not return a Tensor or Sequence[Tensor]
-        class WrongRightInverse(Identity):
-            def right_inverse(self, z):
-                return None
+    def test_weight_norm(self):
+        for dtype in [torch.float, torch.bfloat16]:
+            input = torch.randn(3, 4, dtype=dtype)
+            m = nn.Linear(4, 5).to(dtype=dtype)
+            expected_output = m(input)
 
-        # right_inverse should return a Tensor or a Sequence[Tensor]
-        with self.assertRaisesRegex(ValueError, "Tensor or a Sequence of"):
-            parametrize.register_parametrization(module, "weight", WrongRightInverse())
-        self.assertFalse(parametrize.is_parametrized(module))
+            # add weight normalization
+            m = torch.nn.utils.weight_norm(m)
+            self.assertEqual(m.weight_v.size(), m.weight.size())
+            self.assertEqual(m.weight_g.size(), (5, 1))
+            self.assertEqual(m(input), expected_output, atol=dtype2prec_DONTUSE[dtype], rtol=0)
 
-        # If it's a sequence, it must to be a sequence of tensors
-        class WrongRightInverseSequence(nn.Module):
-            def forward(self, x, y):
-                return x
+            # remove weight norm
+            m = torch.nn.utils.remove_weight_norm(m)
+            self.assertFalse(hasattr(m, 'weight_g'))
+            self.assertFalse(hasattr(m, 'weight_v'))
+            self.assertEqual(m(input), expected_output, atol=dtype2prec_DONTUSE[dtype], rtol=0)
 
-            def right_inverse(self, z):
-                return None, z
+            # test with dim=1
+            m = torch.nn.utils.weight_norm(m, dim=1)
+            self.assertEqual(m.weight_v.size(), m.weight.size())
+            self.assertEqual(m.weight_g.size(), (1, 4))
+            self.assertEqual(m(input), expected_output, atol=dtype2prec_DONTUSE[dtype], rtol=0)
 
-        with self.assertRaisesRegex(ValueError, "of the sequence with type"):
-            parametrize.register_parametrization(module, "weight", WrongRightInverseSequence())
-        self.assertFalse(parametrize.is_parametrized(module))
+            # test with dim=None
+            m = nn.Linear(4, 5).to(dtype=dtype)
+            expected_output = m(input)
+            m = torch.nn.utils.weight_norm(m, dim=None)
+            self.assertEqual(m(input), expected_output)
 
-        # A parametrization from one tensor to one tensor that changes the dtype
-        class ChangeDtypeInverse(nn.Module):
-            def forward(self, x):
-                return x.float()
+            with self.assertRaisesRegex(RuntimeError, 'register two weight_norm hooks'):
+                m = torch.nn.utils.weight_norm(m)
+                m = torch.nn.utils.weight_norm(m)
 
-            def right_inverse(self, w):
-                return w.bool()
+        # For float16, the forward of the Module doesn't work but we must still be able
+        # to register the weight norm as this is often done before sending the Module to
+        # CUDA.
+        m = nn.Linear(4, 5, dtype=torch.float16)
+        m = torch.nn.utils.weight_norm(m)
 
-        # For parametrizations that return one tensor, right_inverse may not change the dtype
-        with self.assertRaisesRegex(ValueError, "outputs one tensor, it may not change the dtype"):
-            parametrize.register_parametrization(module, "weight", ChangeDtypeInverse())
-        self.assertFalse(parametrize.is_parametrized(module))
+    def test_parameterlistdict_setting_attributes(self):
+        with warnings.catch_warnings(record=True) as w:
+            mod = nn.ParameterList(map(nn.Parameter, [torch.rand(2), torch.rand(2)]))
+        self.assertTrue(len(w) == 0)
 
-        # Doesn't return a tensor
-        class NotTensor(nn.Module):
-            def forward(self, x):
-                return 2
+        with warnings.catch_warnings(record=True) as w:
+            mod.train()
+            mod.eval()
+        self.assertTrue(len(w) == 0)
 
-        # Forward must return a tensor
-        with self.assertRaisesRegex(ValueError, "must return a tensor"):
-            parametrize.register_parametrization(module, "weight", NotTensor())
-        self.assertFalse(parametrize.is_parametrized(module))
+        with warnings.catch_warnings(record=True) as w:
+            mod = nn.ParameterDict({"a": nn.Parameter(torch.rand(2)), "b": nn.Parameter(torch.rand(2))})
+        self.assertTrue(len(w) == 0)
 
-        # A parametrization from one tensor to one tensor that changes the dtype
-        class ChangeDtype(nn.Module):
-            def forward(self, x):
-                return x.bool()
+        with warnings.catch_warnings(record=True) as w:
+            mod.train()
+            mod.eval()
+        self.assertTrue(len(w) == 0)
 
-        # forward should not change the initial dtype
-        with self.assertRaisesRegex(ValueError, "may not change the dtype"):
-            parametrize.register_parametrization(module, "weight", ChangeDtype())
-        self.assertFalse(parametrize.is_parametrized(module))
-
-        # Change shape
-        class ChangeShape(nn.Module):
-            def forward(self, x):
-                return x[:-1]
-
-        # forward should not change the original shape
-        with self.assertRaisesRegex(ValueError, "may not change the shape"):
-            parametrize.register_parametrization(module, "weight", ChangeShape())
-        self.assertFalse(parametrize.is_parametrized(module))
-
-        # Many to one that changes dtype
-        class ChangeDtypeMulti(nn.Module):
-            def forward(self, x, y):
-                return (x + y).bool()
-
-            def right_inverse(self, w):
-                return w, w + 1
-
-        # forward should not change the original shape even for parametrizations with many inputs
-        with self.assertRaisesRegex(ValueError, "may not change the dtype"):
-            parametrize.register_parametrization(module, "weight", ChangeDtypeMulti())
-        self.assertFalse(parametrize.is_parametrized(module))
-
-        # Returning a sequence of size one, although weird, it's correct
-        class SequenceLen1(nn.Module):
-            def forward(self, x):
-                return x
-
-            def right_inverse(self, w):
-                return (w,)
-
-        parametrize.register_parametrization(module, "weight", SequenceLen1())
-        self.assertTrue(hasattr(module.parametrizations.weight, "original0"))
-        self.assertFalse(hasattr(module.parametrizations.weight, "original1"))
-        _ = module.weight   # Does not throw
-        self.assertTrue(parametrize.is_parametrized(module))
-        parametrize.remove_parametrizations(module, "weight", leave_parametrized=True)
-
-        # None of the operations above should have altered the weight
-        self.assertFalse(parametrize.is_parametrized(module))
-        self.assertEqual(module.weight, weight_init)
-
-    def test_errors_parametrized_tensor_parametrization(self):
-        # Test errors when registering a parametrization on a parametrized tensor
-
-        class Identity(nn.Module):
-            def forward(self, x):
-                return x
-
-        module = nn.Linear(3, 4)
-        parametrize.register_parametrization(module, "weight", Identity())
-
-        # Has to return a tensor
-        class WrongReturn(nn.Module):
-            def forward(self, x):
-                return x, x
-
-        with self.assertRaisesRegex(ValueError, "must return a tensor"):
-            parametrize.register_parametrization(module, "weight", WrongReturn())
-        self.assertTrue(parametrize.is_parametrized(module))
-        self.assertEqual(len(module.parametrizations.weight), 1)
-        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
-
-        # Cannot change dtype
-        class ChangeDtype(nn.Module):
-            def forward(self, x):
-                return x.bool()
-
-        with self.assertRaisesRegex(ValueError, "may not change the dtype"):
-            parametrize.register_parametrization(module, "weight", ChangeDtype())
-        self.assertTrue(parametrize.is_parametrized(module))
-        self.assertEqual(len(module.parametrizations.weight), 1)
-        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
-
-        # Cannot change shape
-        class ChangeShape(nn.Module):
-            def forward(self, x):
-                return x[:-1]
-
-        with self.assertRaisesRegex(ValueError, "may not change the shape"):
-            parametrize.register_parametrization(module, "weight", ChangeShape())
-        self.assertTrue(parametrize.is_parametrized(module))
-        self.assertEqual(len(module.parametrizations.weight), 1)
-        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
-
-        # The following checks are mostly due to bugs in the code of the parametrization
-
-        # right_inverse has to return a tensor
-        class WrongReturnInverse(Identity):
-            def right_inverse(self, x):
-                return x, x
-
-        with self.assertRaisesRegex(ValueError, "right_inverse must return a tensor"):
-            parametrize.register_parametrization(module, "weight", WrongReturnInverse())
-        self.assertTrue(parametrize.is_parametrized(module))
-        self.assertEqual(len(module.parametrizations.weight), 1)
-        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
-
-        # Cannot change dtype
-        class ChangeDtypeInverse(Identity):
-            def right_inverse(self, x):
-                return x.bool()
-
-        with self.assertRaisesRegex(ValueError, "must have the same dtype"):
-            parametrize.register_parametrization(module, "weight", ChangeDtypeInverse())
-        self.assertTrue(parametrize.is_parametrized(module))
-        self.assertEqual(len(module.parametrizations.weight), 1)
-        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
-
-        # Cannot change shape
-        class ChangeShapeInverse(Identity):
-            def right_inverse(self, x):
-                return x[:-1]
-
-        with self.assertRaisesRegex(ValueError, "must have the same shape"):
-            parametrize.register_parametrization(module, "weight", ChangeShapeInverse())
-        self.assertTrue(parametrize.is_parametrized(module))
-        self.assertEqual(len(module.parametrizations.weight), 1)
-        self.assertTrue(isinstance(module.parametrizations.weight[0], Identity))
-
-    # FIXME: Rewrite this test using functions not depending on LAPACK
-    #        and remove the `@skipIfNoLapack` (see #70995)
-    @skipIfNoLapack
-    def test_multiple_inputs_parametrization(self):
-        # A parametrization with several outputs
-        class RankOne(nn.Module):
-            def forward(self, x, y):
-                # Form a rank-1 matrix from a pair of vectors
-                return x.unsqueeze(-1) @ y.unsqueeze(-2)
-
-            def right_inverse(self, Y):
-                # We project the given matrix onto the rank 1 matrices
-                U, S, Vh = torch.linalg.svd(Y, full_matrices=False)
-                # S is ordered in a decreasing way.
-                s0_sqrt = S[0].sqrt().unsqueeze(-1)
-                return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt
-
-        # Simple parametrisation
-        class Double(nn.Module):
-            def forward(self, x):
-                return 2.0 * x
-
-            def right_inverse(self, w):
-                return 0.5 * w
-
-        model = nn.Linear(3, 3)
-        # Test one parametrization
-        parametrize.register_parametrization(model, "weight", RankOne())
-        self.assertTrue(hasattr(model, "parametrizations"))
-        self.assertTrue(parametrize.is_parametrized(model))
-        self.assertTrue(parametrize.is_parametrized(model, "weight"))
-        self.assertTrue(hasattr(model.parametrizations.weight, "original0"))
-        self.assertIn("original0", model.parametrizations.weight._parameters)
-        self.assertTrue(hasattr(model.parametrizations.weight, "original1"))
-        self.assertIn("original1", model.parametrizations.weight._parameters)
-        self.assertFalse(parametrize.is_parametrized(model, "bias"))
-        self.assertNotIn("weight", model._parameters)
-        # Result should be rank 1
-        self.assertEqual(torch.linalg.matrix_rank(model.weight).item(), 1)
-
-        with self.assertRaisesRegex(ValueError, "leave_parametrized=False"):
-            # Cannot remove a parametrization with multiple inputs and not leave it parametrized
-            parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        # Remove parametrization and check consistency
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=True)
-        self.assertFalse(hasattr(model, "parametrizations"))
-        self.assertEqual(model.__class__, nn.Linear)
-        self.assertFalse(parametrize.is_parametrized(model))
-        self.assertEqual(torch.linalg.matrix_rank(model.weight).item(), 1)
-        self.assertIn("weight", model._parameters)
-
-        # Registering parametrizations with one input on top of one with multiple inputs should work
-        init_weight = model.weight.clone()
-        parametrize.register_parametrization(model, "weight", RankOne())
-        # Projecting a rank 1 matrix onto the matrices of rank one does not change the matrix
-        self.assertEqual(init_weight, model.weight)
-        parametrize.register_parametrization(model, "weight", Double())
-        # The matrix now is twice the initial matrix
-        self.assertEqual(2.0 * init_weight, model.weight)
-        # Multiplying by a scalar does not change the rank
-        self.assertEqual(torch.linalg.matrix_rank(model.weight).item(), 1)
-
-        # The model has now three parameters
-        self.assertEqual(len(list(model.parameters())), 3)
-
-        sgd = torch.optim.SGD(model.parameters(), lr=0.1)
-
-        # Test backward. Should not throw
-        for _ in range(2):
-            sgd.zero_grad()
-            loss = (model.weight.T @ model.bias).sum()
-            loss.backward()
-            sgd.step()
-
-        # Same drill as before, removing should work as expected
-        with self.assertRaisesRegex(ValueError, "leave_parametrized=False"):
-            # Cannot remove a parametrization with multiple inputs and not leave it parametrized
-            parametrize.remove_parametrizations(model, "weight", leave_parametrized=False)
-        # Remove parametrization and check consistency
-        parametrize.remove_parametrizations(model, "weight", leave_parametrized=True)
-        self.assertFalse(hasattr(model, "parametrizations"))
-        self.assertEqual(model.__class__, nn.Linear)
-        self.assertFalse(parametrize.is_parametrized(model))
-        self.assertEqual(torch.linalg.matrix_rank(model.weight).item(), 1)
-        self.assertIn("weight", model._parameters)
-
-        # The model has now two parameters
-        self.assertEqual(len(list(model.parameters())), 2)
-
-        # Test backward. Should not throw
-        sgd = torch.optim.SGD(model.parameters(), lr=0.1)
-        for _ in range(2):
-            sgd.zero_grad()
-            loss = (model.weight.T @ model.bias).sum()
-            loss.backward()
-            sgd.step()
-
-    # FIXME: Rewrite this test using functions not depending on LAPACK
-    #        and remove the `@skipIfNoLapack` (see #70995)
-    @skipIfNoLapack
-    def test_caching_parametrization(self):
-        r"""Test the caching system of a parametrization"""
-        # Define a couple matrix parametrizations
-        class Skew(nn.Module):
-            def forward(self, X):
-                X = X.tril(-1)
-                return X - X.T
-
-        class Orthogonal(nn.Module):
-            def forward(self, X):
-                Id = torch.eye(X.size(0), device=X.device)
-                return torch.linalg.solve(Id + X, Id - X)
-
-        model = nn.Linear(5, 5)
-        parametrize.register_parametrization(model, "weight", Skew())
-        parametrize.register_parametrization(model, "weight", Orthogonal())
-
-        # Test that the caching system works
-        with parametrize.cached():
-            X = model.weight
-            Y = model.weight
-            self.assertEqual(id(X), id(Y))
-
-    # FIXME: Rewrite this test using functions not depending on LAPACK
-    #        and remove the `@skipIfNoLapack` (see #70995)
-    @skipIfNoLapack
-    def test_caching_parametrization_with_transfer_parametrizations_and_params(self):
-        r"""Test that transferring parametrizations doesn't cause issues with caching"""
-        class Skew(nn.Module):
-            def forward(self, X):
-                X = X.tril(-1)
-                return X - X.T
-
-        class Orthogonal(nn.Module):
-            def forward(self, X):
-                Id = torch.eye(X.size(0), device=X.device)
-                return torch.linalg.solve(Id + X, Id - X)
-
-        model = nn.Linear(5, 5)
-        parametrize.register_parametrization(model, "weight", Skew())
-        parametrize.register_parametrization(model, "weight", Orthogonal())
-
-        to_model = nn.Linear(5, 5)
-        parametrize.transfer_parametrizations_and_params(model, to_model)
-
-        with parametrize.cached():
-            X = model.weight
-            Y = model.weight
-            self.assertEqual(id(X), id(Y))
-
-            A = to_model.weight
-            B = to_model.weight
-            self.assertEqual(id(A), id(B))
-
-            # test that the results are distinct objects for each module
-            self.assertNotEqual(id(A), id(X))
-
-    def test_parametrization_same_training_mode(self):
-        r"""Test training mode updated on parametrization registration"""
-        class Identity(nn.Module):
-            def forward(self, X):
-                return X
-
-        module = nn.Linear(4, 4)
-        module.eval()
-        parametrize.register_parametrization(module, "weight", Identity())
-        self.assertFalse(module.parametrizations.weight[0].training)
-        module.train()
-        parametrize.register_parametrization(module, "weight", Identity().eval())
-        self.assertTrue(module.parametrizations.weight[0].training)
-        self.assertTrue(module.parametrizations.weight[1].training)
-
-    def test_type_before_parametrizations(self):
-        r"""Test that type_before_parametrizations always retrieves original type"""
-
-        class Identity(nn.Module):
-            def forward(self, X):
-                return X
-
-        model = nn.Linear(5, 5)
-        original_type = type(model)
-        self.assertTrue(
-            parametrize.type_before_parametrizations(model) == original_type
-        )
-        parametrize.register_parametrization(model, "weight", Identity())
-        self.assertTrue(
-            parametrize.type_before_parametrizations(model) == original_type
-        )
-
-    def test_deepcopy_after_parametrization(self):
-        r"""Test that we are able to create a deepcopy of the module when it's parametrized."""
-
-        class AddOne(nn.Module):
-            def forward(self, x):
-                return x + 1.0
-
-        class ModelWithoutDeepcopy(nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.weight = nn.Parameter(torch.tensor([1., 1., 1., 1.]), requires_grad=True)
-                self.bias = nn.Parameter(torch.tensor([0., 0., 0., 0.]), requires_grad=True)
-                self.attr = [1.0, 2.0, 3.0, 4.0]
-
-        class ActualModel(ModelWithoutDeepcopy):
-            # Emulate custom implementation of the deepcopying.
-            def __deepcopy__(self, memo):
-                result = self.__new__(self.__class__)
-                memo[id(self)] = result
-                result.__dict__ = deepcopy(self.__dict__, memo)
-                return result
-
-        def check_deepcopy(m1: nn.Module, m2: nn.Module):
-            w1 = m1.parametrizations.weight.original
-            w2 = m2.parametrizations.weight.original
-            b1 = m1.parametrizations.bias.original if parametrize.is_parametrized(m1, "bias") else m1.bias
-            b2 = m2.parametrizations.bias.original if parametrize.is_parametrized(m2, "bias") else m2.bias
-            # Weights, biases and attributes should be equal but they must be different objects.
-            self.assertEqual(m1.__dict__.keys(), m2.__dict__.keys())
-            self.assertIsNot(m1, m2)
-            self.assertEqual(w1, w2)
-            self.assertIsNot(w1, w2)
-            self.assertEqual(b1, b2)
-            self.assertIsNot(b1, b2)
-            self.assertEqual(m1.attr, m2.attr)
-            self.assertIsNot(m1.attr, m2.attr)
-
-        for model in (ModelWithoutDeepcopy(), ActualModel()):
-            # General check that we are able to create deepcopy.
-            parametrize.register_parametrization(model, "weight", AddOne())
-            check_deepcopy(model, deepcopy(model))
-            # Check that this works on models with several parametrized tensors.
-            parametrize.register_parametrization(model, "bias", AddOne())
-            check_deepcopy(model, deepcopy(model))
-            # Check that this works on models where tensors have more than one parametrization.
-            parametrize.register_parametrization(model, "weight", AddOne())
-            check_deepcopy(model, deepcopy(model))
-
-    def test_transfer_parametrizations_and_params(self):
-        r"""Test that all parametrizations and their associated parameters are transferred."""
-
-        class AddOne(nn.Module):
-            def forward(self, x):
-                return x + 1.0
-
-        class Double(nn.Module):
-            def forward(self, x):
-                return 2.0 * x
-
-            def right_inverse(self, x):
-                return 0.5 * x
-
-        class MinusOne(nn.Module):
-            def forward(self, x):
-                return x - 1.0
-
-        model = nn.Linear(5, 5)
-        parametrize.register_parametrization(model, "weight", AddOne())
-        parametrize.register_parametrization(model, "weight", Double())
-        parametrize.register_parametrization(model, "weight", MinusOne())
-        hold_weight = model.weight
-
-        to_model = nn.qat.Linear(
-            5, 5, qconfig=torch.ao.quantization.get_default_qconfig()
-        )
-        parametrize.transfer_parametrizations_and_params(model, to_model)
-
-        # checks that final and original value are correct and the to_model is parametrized
-        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
-        self.assertEqual(model.weight, to_model.weight)
-        self.assertEqual(
-            model.parametrizations.weight.original,
-            to_model.parametrizations.weight.original,
-        )
-
-        # check that the transfer didn't affect the original value
-        self.assertEqual(hold_weight, model.weight)
-
-        # testing that changes to one set of parametrizations do not affect the other
-        parametrize.remove_parametrizations(to_model, "weight")
-        self.assertFalse(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
-        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(model, "weight"))
-
-        # also test that parameters that don't exist in to_model get transferred
-        model.test_param = Parameter(torch.randn(5, 5))
-
-        self.assertTrue(not hasattr(to_model, "test_param"))
-        parametrize.register_parametrization(model, "test_param", Double())
-        hold_test_param = model.test_param
-        parametrize.transfer_parametrizations_and_params(model, to_model, "test_param")
-
-        # check that previously missing params got transferred correctly
-        self.assertEqual(model.test_param, to_model.test_param)
-        self.assertEqual(
-            model.parametrizations.test_param.original,
-            to_model.parametrizations.test_param.original,
-        )
-
-        # check that the new transfer didn't change the value for the from_module
-        self.assertEqual(hold_test_param, model.test_param)
-
-    def test_transfer_parametrizations_and_params_right_inverse(self):
-        r"""Test that all parametrizations and their associated parameters are transferred."""
-
-        class Double(nn.Module):
-            def forward(self, x):
-                return 2.0 * x
-
-            def right_inverse(self, x):
-                return 0.5 * x
-
-        model = nn.Linear(5, 5)
-        parametrize.register_parametrization(model, "weight", Double())
-        hold_weight = model.weight
-
-        to_model = nn.qat.Linear(
-            5, 5, qconfig=torch.ao.quantization.get_default_qconfig()
-        )
-        parametrize.transfer_parametrizations_and_params(model, to_model)
-
-        # check that transfer occurs successfully
-        self.assertEqual(model.weight, to_model.weight)
-        self.assertEqual(
-            model.parametrizations.weight.original,
-            to_model.parametrizations.weight.original,
-        )
-
-        # check that transfer doesn't affect the from_model weight
-        self.assertEqual(hold_weight, model.weight)
-
-    def test_transfer_parametrizations_and_params_single_param(self):
-        r"""Test that all parametrizations and their associated parameters are transferred."""
-
-        class AddOne(nn.Module):
-            def forward(self, x):
-                return x + 1.0
-
-        class Double(nn.Module):
-            def forward(self, x):
-                return 2.0 * x
-
-        class MinusOne(nn.Module):
-            def forward(self, x):
-                return x - 1.0
-
-        model = nn.Linear(5, 5, bias=True)
-        parametrize.register_parametrization(model, "weight", AddOne())
-        parametrize.register_parametrization(model, "weight", Double())
-        parametrize.register_parametrization(model, "weight", MinusOne())
-        parametrize.register_parametrization(model, "bias", AddOne())
-        parametrize.register_parametrization(model, "bias", Double())
-        parametrize.register_parametrization(model, "bias", MinusOne())
-
-        to_model = nn.qat.Linear(
-            5, 5, bias=True, qconfig=torch.ao.quantization.get_default_qconfig()
-        )
-        parametrize.transfer_parametrizations_and_params(model, to_model, "weight")
-
-        # check that weight and only weight was transferred
-        self.assertEqual(model.weight, to_model.weight)
-        self.assertEqual(
-            model.parametrizations.weight.original,
-            to_model.parametrizations.weight.original,
-        )
-        self.assertTrue("bias" not in to_model.parametrizations)
-
-    # FIXME: Rewrite this test using functions not depending on LAPACK
-    # and remove the `@skipIfNoLapack` (see #70995)
-    @skipIfNoLapack
-    def test_transfer_parametrizations_and_params_many_to_one(self):
-        # A parametrization with several outputs
-        class RankOne(nn.Module):
-            def forward(self, x, y):
-                # Form a rank-1 matrix from a pair of vectors
-                return x.unsqueeze(-1) @ y.unsqueeze(-2)
-
-            def right_inverse(self, Y):
-                # We project the given matrix onto the rank 1 matrices
-                U, S, Vh = torch.linalg.svd(Y, full_matrices=False)
-                # S is ordered in a decreasing way.
-                s0_sqrt = S[0].sqrt().unsqueeze(-1)
-                return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt
-
-        class Double(nn.Module):
-            def forward(self, x):
-                return 2.0 * x
-
-        model = nn.Linear(3, 3)
-        parametrize.register_parametrization(model, "weight", RankOne())
-        parametrize.register_parametrization(model, "weight", Double())
-        hold_weight = model.weight
-
-        to_model = nn.qat.Linear(
-            3, 3, qconfig=torch.ao.quantization.get_default_qconfig()
-        )
-
-        parametrize.transfer_parametrizations_and_params(model, to_model)
-
-        # checks that final and original value are correct and the to_model is parametrized
-        self.assertTrue(torch.nn.utils.parametrize.is_parametrized(to_model, "weight"))
-        self.assertEqual(model.weight, to_model.weight)
-        self.assertEqual(
-            model.parametrizations.weight.original0,
-            to_model.parametrizations.weight.original0,
-        )
-        self.assertEqual(
-            model.parametrizations.weight.original1,
-            to_model.parametrizations.weight.original1,
-        )
-
-        # check that the transfer didn't affect the original value
-        self.assertEqual(hold_weight, model.weight)
-
-        # testing that changes to one set of parametrizations do not affect the other
-        model.test_param = Parameter(torch.randn(3, 3))
-
-        self.assertTrue(not hasattr(to_model, "test_param"))
-        parametrize.register_parametrization(model, "test_param", RankOne())
-        hold_test_param = model.test_param
-        parametrize.transfer_parametrizations_and_params(model, to_model, "test_param")
-
-        # also check that previously missing params got transferred correctly
-        self.assertEqual(model.test_param, to_model.test_param)
-        self.assertEqual(
-            model.parametrizations.test_param.original0,
-            to_model.parametrizations.test_param.original0,
-        )
-        self.assertEqual(
-            model.parametrizations.test_param.original1,
-            to_model.parametrizations.test_param.original1,
-        )
-
-        # check that the new transfer didn't change the value for the from_module
-        self.assertEqual(hold_test_param, model.test_param)
-
-    # torch/nn/utils/prune.py
-    @unittest.skipIf(not TEST_NUMPY, "numpy not found")
-    def test_validate_pruning_amount_init(self):
-        r"""Test the first util function that validates the pruning
-        amount requested by the user the moment the pruning method
-        is initialized. This test checks that the expected errors are
-        raised whenever the amount is invalid.
-        The original function runs basic type checking + value range checks.
-        It doesn't check the validity of the pruning amount with
-        respect to the size of the tensor to prune. That's left to
-        `_validate_pruning_amount`, tested below.
-        """
-        # neither float not int should raise TypeError
-        with self.assertRaises(TypeError):
-            prune._validate_pruning_amount_init(amount="I'm a string")
-
-        # float not in [0, 1] should raise ValueError
-        with self.assertRaises(ValueError):
-            prune._validate_pruning_amount_init(amount=1.1)
-        with self.assertRaises(ValueError):
-            prune._validate_pruning_amount_init(amount=20.)
-
-        # negative int should raise ValueError
-        with self.assertRaises(ValueError):
-            prune._validate_pruning_amount_init(amount=-10)
-
-        # all these should pass without errors because they're valid amounts
-        prune._validate_pruning_amount_init(amount=0.34)
-        prune._validate_pruning_amount_init(amount=1500)
-        prune._validate_pruning_amount_init(amount=0)
-        prune._validate_pruning_amount_init(amount=0.)
-        prune._validate_pruning_amount_init(amount=1)
-        prune._validate_pruning_amount_init(amount=1.)
-        self.assertTrue(True)
-
-    @unittest.skipIf(not TEST_NUMPY, "numpy not found")
-    def test_validate_pruning_amount(self):
-        r"""Tests the second util function that validates the pruning
-        amount requested by the user, this time with respect to the size
-        of the tensor to prune. The rationale is that if the pruning amount,
-        converted to absolute value of units to prune, is larger than
-        the number of units in the tensor, then we expect the util function
-        to raise a value error.
-        """
-        # if amount is int and amount > tensor_size, raise ValueError
-        with self.assertRaises(ValueError):
-            prune._validate_pruning_amount(amount=20, tensor_size=19)
-
-        # amount is a float so this should not raise an error
-        prune._validate_pruning_amount(amount=0.3, tensor_size=0)
-
-        # this is okay
-        prune._validate_pruning_amount(amount=19, tensor_size=20)
-        prune._validate_pruning_amount(amount=0, tensor_size=0)
-        prune._validate_pruning_amount(amount=1, tensor_size=1)
-        self.assertTrue(True)
-
-    @unittest.skipIf(not TEST_NUMPY, "numpy not found")
-    def test_compute_nparams_to_prune(self):
-        r"""Test that requested pruning `amount` gets translated into the
-        correct absolute number of units to prune.
-        """
-        self.assertEqual(
-            prune._compute_nparams_toprune(amount=0, tensor_size=15),
-            0
-        )
-        self.assertEqual(
-            prune._compute_nparams_toprune(amount=10, tensor_size=15),
-            10
-        )
-        # if 1 is int, means 1 unit
-        self.assertEqual(
-            prune._compute_nparams_toprune(amount=1, tensor_size=15),
-            1
-        )
-        # if 1. is float, means 100% of units
-        self.assertEqual(
-            prune._compute_nparams_toprune(amount=1., tensor_size=15),
-            15
-        )
-        self.assertEqual(
-            prune._compute_nparams_toprune(amount=0.4, tensor_size=17),
-            7
-        )
-
-    def test_random_pruning_sizes(self):
-        r"""Test that the new parameters and buffers created by the pruning
-        method have the same size as the input tensor to prune. These, in
-        fact, correspond to the pruned version of the tensor itself, its
-        mask, and its original copy, so the size must match.
-        """
-        # fixturize test
-        # TODO: add other modules
-        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
-        names = ['weight', 'bias']
-
-        for m in modules:
-            for name in names:
-                with self.subTest(m=m, name=name):
-                    original_tensor = getattr(m, name)
-
-                    prune.random_unstructured(m, name=name, amount=0.1)
-                    # mask has the same size as tensor being pruned
-                    self.assertEqual(
-                        original_tensor.size(),
-                        getattr(m, name + '_mask').size()
-                    )
-                    # 'orig' tensor has the same size as the original tensor
-                    self.assertEqual(
-                        original_tensor.size(),
-                        getattr(m, name + '_orig').size()
-                    )
-                    # new tensor has the same size as the original tensor
-                    self.assertEqual(
-                        original_tensor.size(),
-                        getattr(m, name).size()
-                    )
-
-    def test_random_pruning_orig(self):
-        r"""Test that original tensor is correctly stored in 'orig'
-        after pruning is applied. Important to make sure we don't
-        lose info about the original unpruned parameter.
-        """
-        # fixturize test
-        # TODO: add other modules
-        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
-        names = ['weight', 'bias']
-
-        for m in modules:
-            for name in names:
-                with self.subTest(m=m, name=name):
-
-                    # tensor prior to pruning
-                    original_tensor = getattr(m, name)
-                    prune.random_unstructured(m, name=name, amount=0.1)
-                    self.assertEqual(
-                        original_tensor,
-                        getattr(m, name + '_orig')
-                    )
-
-    def test_random_pruning_new_weight(self):
-        r"""Test that module.name now contains a pruned version of
-        the original tensor obtained from multiplying it by the mask.
-        """
-        # fixturize test
-        # TODO: add other modules
-        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
-        names = ['weight', 'bias']
-
-        for m in modules:
-            for name in names:
-                with self.subTest(m=m, name=name):
-                    # tensor prior to pruning
-                    original_tensor = getattr(m, name)
-                    prune.random_unstructured(m, name=name, amount=0.1)
-                    # weight = weight_orig * weight_mask
-                    self.assertEqual(
-                        getattr(m, name),
-                        getattr(m, name + '_orig')
-                        * getattr(m, name + '_mask').to(
-                            dtype=original_tensor.dtype
-                        ),
-                    )
-
-    def test_identity_pruning(self):
-        r"""Test that a mask of 1s does not change forward or backward.
-        """
-        input_ = torch.ones(1, 5)
-        m = nn.Linear(5, 2)
-        y_prepruning = m(input_)  # output prior to pruning
-
-        # compute grad pre-pruning and check it's equal to all ones
-        y_prepruning.sum().backward()
-        old_grad_weight = m.weight.grad.clone()  # don't grab pointer!
-        self.assertEqual(old_grad_weight, torch.ones_like(m.weight))
-        old_grad_bias = m.bias.grad.clone()
-        self.assertEqual(old_grad_bias, torch.ones_like(m.bias))
-
-        # remove grads
-        m.zero_grad()
-
-        # force the mask to be made of all 1s
-        prune.identity(m, name="weight")
-
-        # with mask of 1s, output should be identical to no mask
-        y_postpruning = m(input_)
-        self.assertEqual(y_prepruning, y_postpruning)
-
-        # with mask of 1s, grad should be identical to no mask
-        y_postpruning.sum().backward()
-        self.assertEqual(old_grad_weight, m.weight_orig.grad)
-        self.assertEqual(old_grad_bias, m.bias.grad)
-
-        # calling forward twice in a row shouldn't change output
-        y1 = m(input_)
-        y2 = m(input_)
-        self.assertEqual(y1, y2)
-
-    def test_random_pruning_0perc(self):
-        r"""Test that a mask of 1s does not change forward or backward.
-        """
-        input_ = torch.ones(1, 5)
-        m = nn.Linear(5, 2)
-        y_prepruning = m(input_)  # output prior to pruning
-
-        # compute grad pre-pruning and check it's equal to all ones
-        y_prepruning.sum().backward()
-        old_grad_weight = m.weight.grad.clone()  # don't grab pointer!
-        self.assertEqual(old_grad_weight, torch.ones_like(m.weight))
-        old_grad_bias = m.bias.grad.clone()
-        self.assertEqual(old_grad_bias, torch.ones_like(m.bias))
-
-        # remove grads
-        m.zero_grad()
-
-        # force the mask to be made of all 1s
-        with mock.patch(
-            "torch.nn.utils.prune.RandomUnstructured.compute_mask"
-        ) as compute_mask:
-            compute_mask.return_value = torch.ones_like(m.weight)
-            prune.random_unstructured(m, name='weight', amount=0.9)  # amount won't count
-
-        # with mask of 1s, output should be identical to no mask
-        y_postpruning = m(input_)
-        self.assertEqual(y_prepruning, y_postpruning)
-
-        # with mask of 1s, grad should be identical to no mask
-        y_postpruning.sum().backward()
-        self.assertEqual(old_grad_weight, m.weight_orig.grad)
-        self.assertEqual(old_grad_bias, m.bias.grad)
-
-        # calling forward twice in a row shouldn't change output
-        y1 = m(input_)
-        y2 = m(input_)
-        self.assertEqual(y1, y2)
-
-    def test_random_pruning(self):
-        input_ = torch.ones(1, 5)
-        m = nn.Linear(5, 2)
-
-        # define custom mask to assign with mock
-        mask = torch.ones_like(m.weight)
-        mask[1, 0] = 0
-        mask[0, 3] = 0
-
-        # check grad is zero for masked weights
-        with mock.patch(
-            "torch.nn.utils.prune.RandomUnstructured.compute_mask"
-        ) as compute_mask:
-            compute_mask.return_value = mask
-            prune.random_unstructured(m, name='weight', amount=0.9)
-
-        y_postpruning = m(input_)
-        y_postpruning.sum().backward()
-        # weight_orig is the parameter, so it's the tensor that will accumulate the grad
-        self.assertEqual(m.weight_orig.grad, mask)  # all 1s, except for masked units
-        self.assertEqual(m.bias.grad, torch.ones_like(m.bias))
-
-        # make sure that weight_orig update doesn't modify [1, 0] and [0, 3]
-        old_weight_orig = m.weight_orig.clone()
-        # update weights
-        learning_rate = 1.
-        for p in m.parameters():
-            p.data.sub_(p.grad.data * learning_rate)
-        # since these are pruned, they should not be updated
-        self.assertEqual(old_weight_orig[1, 0], m.weight_orig[1, 0])
-        self.assertEqual(old_weight_orig[0, 3], m.weight_orig[0, 3])
-
-    def test_random_pruning_forward(self):
-        r"""check forward with mask (by hand).
-        """
-        input_ = torch.ones(1, 5)
-        m = nn.Linear(5, 2)
-
-        # define custom mask to assign with mock
-        mask = torch.zeros_like(m.weight)
-        mask[1, 0] = 1
-        mask[0, 3] = 1
-
-        with mock.patch(
-            "torch.nn.utils.prune.RandomUnstructured.compute_mask"
-        ) as compute_mask:
-            compute_mask.return_value = mask
-            prune.random_unstructured(m, name='weight', amount=0.9)
-
-        yhat = m(input_)
-        self.assertEqual(yhat[0, 0], m.weight_orig[0, 3] + m.bias[0])
-        self.assertEqual(yhat[0, 1], m.weight_orig[1, 0] + m.bias[1])
-
-    def test_remove_pruning_forward(self):
-        r"""Remove pruning and check forward is unchanged from previous
-        pruned state.
-        """
-        input_ = torch.ones(1, 5)
-        m = nn.Linear(5, 2)
-
-        # define custom mask to assign with mock
-        mask = torch.ones_like(m.weight)
-        mask[1, 0] = 0
-        mask[0, 3] = 0
-
-        # check grad is zero for masked weights
-        with mock.patch(
-            "torch.nn.utils.prune.RandomUnstructured.compute_mask"
-        ) as compute_mask:
-            compute_mask.return_value = mask
-            prune.random_unstructured(m, name='weight', amount=0.9)
-
-        y_postpruning = m(input_)
-
-        prune.remove(m, 'weight')
-
-        y_postremoval = m(input_)
-        self.assertEqual(y_postpruning, y_postremoval)
-
-    def test_pruning_id_consistency(self):
-        r"""Test that pruning doesn't change the id of the parameters, which
-        would otherwise introduce issues with pre-existing optimizers that
-        point to old parameters.
-        """
-        m = nn.Linear(5, 2, bias=False)
-
-        tensor_id = id(list(m.parameters())[0])
-
-        prune.random_unstructured(m, name="weight", amount=0.9)
-        self.assertEqual(tensor_id, id(list(m.parameters())[0]))
-
-        prune.remove(m, "weight")
-        self.assertEqual(tensor_id, id(list(m.parameters())[0]))
-
-    def test_random_pruning_pickle(self):
-        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
-        names = ['weight', 'bias']
-
-        for m in modules:
-            for name in names:
-                with self.subTest(m=m, name=name):
-                    prune.random_unstructured(m, name=name, amount=0.1)
-                    m_new = pickle.loads(pickle.dumps(m))
-                    self.assertIsInstance(m_new, type(m))
-
-    def test_multiple_pruning_calls(self):
-        # if you call pruning twice, the hook becomes a PruningContainer
-        m = nn.Conv3d(2, 2, 2)
-        prune.l1_unstructured(m, name='weight', amount=0.1)
-        weight_mask0 = m.weight_mask  # save it for later sanity check
-
-        # prune again
-        prune.ln_structured(m, name='weight', amount=0.3, n=2, dim=0)
-        hook = next(iter(m._forward_pre_hooks.values()))
-        self.assertIsInstance(
-            hook,
-            torch.nn.utils.prune.PruningContainer
-        )
-        # check that container._tensor_name is correctly set no matter how
-        # many pruning methods are in the container
-        self.assertEqual(hook._tensor_name, 'weight')
-
-        # check that the pruning container has the right length
-        # equal to the number of pruning iters
-        self.assertEqual(len(hook), 2)  # m.weight has been pruned twice
-
-        # check that the entries of the pruning container are of the expected
-        # type and in the expected order
-        self.assertIsInstance(hook[0], torch.nn.utils.prune.L1Unstructured)
-        self.assertIsInstance(hook[1], torch.nn.utils.prune.LnStructured)
-
-        # check that all entries that are 0 in the 1st mask are 0 in the
-        # 2nd mask too
-        self.assertTrue(torch.all(m.weight_mask[weight_mask0 == 0] == 0))
-
-        # prune again
-        prune.ln_structured(m, name='weight', amount=0.1, n=float('inf'), dim=1)
-        # check that container._tensor_name is correctly set no matter how
-        # many pruning methods are in the container
-        hook = next(iter(m._forward_pre_hooks.values()))
-        self.assertEqual(hook._tensor_name, 'weight')
-
-    def test_pruning_container(self):
-        # create an empty container
-        container = prune.PruningContainer()
-        container._tensor_name = 'test'
-        self.assertEqual(len(container), 0)
-
-        p = prune.L1Unstructured(amount=2)
-        p._tensor_name = 'test'
-
-        # test adding a pruning method to a container
-        container.add_pruning_method(p)
-
-        # test error raised if tensor name is different
-        q = prune.L1Unstructured(amount=2)
-        q._tensor_name = 'another_test'
-        with self.assertRaises(ValueError):
-            container.add_pruning_method(q)
-
-        # test that adding a non-pruning method object to a pruning container
-        # raises a TypeError
-        with self.assertRaises(TypeError):
-            container.add_pruning_method(10)
-        with self.assertRaises(TypeError):
-            container.add_pruning_method('ugh')
-
-    def test_pruning_container_compute_mask(self):
-        r"""Test `compute_mask` of pruning container with a known `t` and
-        `default_mask`. Indirectly checks that Ln structured pruning is
-        acting on the right axis.
-        """
-        # create an empty container
-        container = prune.PruningContainer()
-        container._tensor_name = 'test'
-
-        # 1) test unstructured pruning
-        # create a new pruning method
-        p = prune.L1Unstructured(amount=2)
-        p._tensor_name = 'test'
-        # add the pruning method to the container
-        container.add_pruning_method(p)
-
-        # create tensor to be pruned
-        t = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]).to(dtype=torch.float32)
-        # create prior mask by hand
-        default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
-        # since we are pruning the two lowest magnitude units, the outcome of
-        # the calculation should be this:
-        expected_mask = torch.tensor([[0, 0, 1, 0], [1, 1, 0, 1]])
-        computed_mask = container.compute_mask(t, default_mask)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(expected_mask, computed_mask)
-
-        # 2) test structured pruning
-        q = prune.LnStructured(amount=1, n=2, dim=0)
-        q._tensor_name = 'test'
-        container.add_pruning_method(q)
-        # since we are pruning the lowest magnitude one of the two rows, the
-        # outcome of the calculation should be this:
-        expected_mask = torch.tensor([[0, 0, 0, 0], [1, 1, 0, 1]])
-        computed_mask = container.compute_mask(t, default_mask)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(expected_mask, computed_mask)
-
-        # 2) test structured pruning, along another axis
-        r = prune.LnStructured(amount=1, n=2, dim=1)
-        r._tensor_name = 'test'
-        container.add_pruning_method(r)
-        # since we are pruning the lowest magnitude of the four columns, the
-        # outcome of the calculation should be this:
-        expected_mask = torch.tensor([[0, 1, 1, 0], [0, 1, 0, 1]])
-        computed_mask = container.compute_mask(t, default_mask)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(expected_mask, computed_mask)
-
-    def test_l1_unstructured_pruning(self):
-        r"""Test that l1 unstructured pruning actually removes the lowest
-        entries by l1 norm (by hand). It also checks that applying l1
-        unstructured pruning more than once respects the previous mask.
-        """
-        m = nn.Linear(4, 2)
-        # modify its weight matrix by hand
-        m.weight = torch.nn.Parameter(
-            torch.tensor(
-                [[1, 2, 3, 4], [-4, -3, -2, -1]], dtype=torch.float32
-            )
-        )
-
-        prune.l1_unstructured(m, 'weight', amount=2)
-        expected_weight = torch.tensor([[0, 2, 3, 4], [-4, -3, -2, 0]],
-                                       dtype=m.weight.dtype)
-        self.assertEqual(expected_weight, m.weight)
-
-        # check that pruning again removes the next two smallest entries
-        prune.l1_unstructured(m, 'weight', amount=2)
-        expected_weight = torch.tensor([[0, 0, 3, 4], [-4, -3, 0, 0]],
-                                       dtype=m.weight.dtype)
-        self.assertEqual(expected_weight, m.weight)
-
-    def test_l1_unstructured_pruning_with_importance_scores(self):
-        r"""Test that l1 unstructured pruning actually removes the lowest
-        entries of importance scores and not the parameter by l1 norm (by hand).
-        It also checks that applying l1 unstructured pruning more than once
-        respects the previous mask.
-        """
-        m = nn.Linear(4, 2)
-        # modify its weight matrix by hand
-        m.weight = torch.nn.Parameter(
-            torch.tensor(
-                [[1, 2, 3, 4], [-4, -3, -2, -1]], dtype=torch.float32
-            )
-        )
-        importance_scores = torch.tensor(
-            [[4, 2, 1, 3], [-3, -1, -2, -4]], dtype=torch.float32
-        )
-
-        prune.l1_unstructured(m, 'weight', amount=2, importance_scores=importance_scores)
-        expected_weight = torch.tensor([[1, 2, 0, 4], [-4, 0, -2, -1]],
-                                       dtype=m.weight.dtype)
-        self.assertEqual(expected_weight, m.weight)
-
-        # check that pruning again removes two entries of m.weight that are colocated with
-        # the next two smallest absolute values of importance scores.
-        prune.l1_unstructured(m, 'weight', amount=2, importance_scores=importance_scores)
-        expected_weight = torch.tensor([[1, 0, 0, 4], [-4, 0, 0, -1]],
-                                       dtype=m.weight.dtype)
-        self.assertEqual(expected_weight, m.weight)
-
-    def test_unstructured_pruning_same_magnitude(self):
-        r"""Since it may happen that the tensor to prune has entries with the
-        same exact magnitude, it is important to check that pruning happens
-        consistenly based on the bottom % of weights, and not by threshold,
-        which would instead kill off *all* units with magnitude = threshold.
-        """
-        AMOUNT = 0.2
-        p = prune.L1Unstructured(amount=AMOUNT)
-        # create a random tensors with entries in {-2, 0, 2}
-        t = 2 * torch.randint(low=-1, high=2, size=(10, 7))
-        nparams_toprune = prune._compute_nparams_toprune(AMOUNT, t.nelement())
-
-        computed_mask = p.compute_mask(t, default_mask=torch.ones_like(t))
-        nparams_pruned = torch.sum(computed_mask == 0)
-        self.assertEqual(nparams_toprune, nparams_pruned)
-
-    def test_random_structured_pruning_amount(self):
-        AMOUNT = 0.6
-        AXIS = 2
-        p = prune.RandomStructured(amount=AMOUNT, dim=AXIS)
-        t = 2 * torch.randint(low=-1, high=2, size=(5, 4, 2)).to(
-            dtype=torch.float32
-        )
-        nparams_toprune = prune._compute_nparams_toprune(AMOUNT, t.shape[AXIS])
-
-        computed_mask = p.compute_mask(t, default_mask=torch.ones_like(t))
-        # check that 1 column is fully prune, the others are left untouched
-        remaining_axes = [_ for _ in range(len(t.shape)) if _ != AXIS]
-        per_column_sums = sorted(
-            torch.sum(computed_mask == 0, axis=remaining_axes)
-        )
-        assert per_column_sums == [0, 20]
-
-    def test_ln_structured_pruning(self):
-        r"""Check Ln structured pruning by hand.
-        """
-        m = nn.Conv2d(3, 1, 2)
-        m.weight.data = torch.tensor(
-            [[[[1., 2.], [1., 2.5]],
-             [[0.5, 1.], [0.1, 0.1]],
-             [[-3., -5.], [0.1, -1.]]]]
-        )
-        # expected effect of pruning 1 of the 3 channels by L2-norm
-        expected_mask_axis1 = torch.ones_like(m.weight)
-        expected_mask_axis1[:, 1] = 0.
-
-        prune.ln_structured(m, 'weight', amount=1, n=2, dim=1)
-        self.assertEqual(expected_mask_axis1, m.weight_mask)
-
-        # expected effect of pruning 1 of the 2 columns along axis -1 by L1-norm
-        expected_mask_axis3 = expected_mask_axis1
-        expected_mask_axis3[:, :, :, 0] = 0.
-
-        prune.ln_structured(m, 'weight', amount=1, n=1, dim=-1)
-        self.assertEqual(expected_mask_axis3, m.weight_mask)
-
-    def test_ln_structured_pruning_importance_scores(self):
-        r"""Check Ln structured pruning by hand.
-        """
-        m = nn.Conv2d(3, 1, 2)
-        m.weight.data = torch.tensor(
-            [[[[1., 2.], [1., 2.5]],
-             [[0.5, 1.], [0.1, 0.1]],
-             [[-3., -5.], [0.1, -1.]]]]
-        )
-        importance_scores = torch.tensor(
-            [[[[10., 1.], [10., 1.]],
-             [[30., 3.], [30., 3.]],
-             [[-20., -2.], [-20., -2.]]]]
-        )
-        # expected effect of pruning 1 of the 3 channels by L2-norm
-        expected_mask_axis1 = torch.ones_like(m.weight)
-        expected_mask_axis1[:, 0] = 0.
-
-        prune.ln_structured(m, 'weight', amount=1, n=2, dim=1, importance_scores=importance_scores)
-        self.assertEqual(expected_mask_axis1, m.weight_mask)
-
-        # expected effect of pruning 1 of the 2 columns along axis -1 by L1-norm
-        expected_mask_axis3 = expected_mask_axis1
-        expected_mask_axis3[:, :, :, 1] = 0.
-
-        prune.ln_structured(m, 'weight', amount=1, n=1, dim=-1, importance_scores=importance_scores)
-        self.assertEqual(expected_mask_axis3, m.weight_mask)
-
-    def test_remove_pruning(self):
-        r"""`prune.remove` removes the hook and the reparametrization
-        and makes the pruning final in the original parameter.
-        """
-        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
-        names = ['weight', 'bias']
-
-        for m in modules:
-            for name in names:
-                with self.subTest(m=m, name=name):
-                    # first prune
-                    prune.random_unstructured(m, name, amount=0.5)
-                    self.assertIn(name + "_orig", dict(m.named_parameters()))
-                    self.assertIn(name + "_mask", dict(m.named_buffers()))
-                    self.assertNotIn(name, dict(m.named_parameters()))
-                    self.assertTrue(hasattr(m, name))
-                    pruned_t = getattr(m, name)
-
-                    # then remove pruning
-                    prune.remove(m, name)
-                    self.assertIn(name, dict(m.named_parameters()))
-                    self.assertNotIn(name + "_orig", dict(m.named_parameters()))
-                    self.assertNotIn(name + "_mask", dict(m.named_buffers()))
-                    final_t = getattr(m, name)
-
-                    self.assertEqual(pruned_t, final_t)
-
-    def test_remove_pruning_exception(self):
-        r"""Removing from an unpruned tensor throws an assertion error
-        """
-        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
-        names = ['weight', 'bias']
-
-        for m in modules:
-            for name in names:
-                with self.subTest(m=m, name=name):
-                    # check that the module isn't pruned
-                    self.assertFalse(prune.is_pruned(m))
-                    # since it isn't pruned, pruning can't be removed from it
-                    with self.assertRaises(ValueError):
-                        prune.remove(m, name)
-
-
-    def test_global_pruning(self):
-        r"""Test that global l1 unstructured pruning over 2 parameters removes
-        the `amount=4` smallest global weights across the 2 parameters.
-        """
-        m = nn.Linear(4, 2)
-        n = nn.Linear(3, 1)
-        # modify the weight matrices by hand
-        m.weight = torch.nn.Parameter(
-            torch.tensor([[1, 2, 3, 4], [-4, -3, -2, -1]]).to(
-                dtype=torch.float32)
-        )
-        n.weight = torch.nn.Parameter(
-            torch.tensor([[0, 0.1, -2]]).to(
-                dtype=torch.float32)
-        )
-
-        params_to_prune = (
-            (m, 'weight'),
-            (n, 'weight'),
-        )
-
-        # prune the 4 smallest weights globally by L1 magnitude
-        prune.global_unstructured(
-            params_to_prune,
-            pruning_method=prune.L1Unstructured,
-            amount=4
-        )
-
-        expected_mweight = torch.tensor([[0, 2, 3, 4], [-4, -3, -2, 0]],
-                                        dtype=m.weight.dtype)
-        self.assertEqual(expected_mweight, m.weight)
-
-        expected_nweight = torch.tensor([[0, 0, -2]]).to(dtype=n.weight.dtype)
-        self.assertEqual(expected_nweight, n.weight)
-
-    def test_global_pruning_importance_scores(self):
-        r"""Test that global l1 unstructured pruning over 2 parameters removes
-        the `amount=4` smallest global weights across the 2 parameters.
-        """
-        m = nn.Linear(4, 2)
-        n = nn.Linear(3, 1)
-        # modify the weight matrices by hand
-        m.weight = torch.nn.Parameter(
-            torch.tensor([[1, 2, 3, 4], [-4, -3, -2, -1]]).to(
-                dtype=torch.float32)
-        )
-        m_importance_scores = torch.tensor(
-            [[4, 2, 1, 3], [-3, -1, -2, -4]], dtype=torch.float32
-        )
-        n.weight = torch.nn.Parameter(
-            torch.tensor([[0, 0.1, -2]]).to(
-                dtype=torch.float32)
-        )
-        n_importance_scores = torch.tensor([[0, 10., -0.2]]).to(dtype=torch.float32)
-
-        params_to_prune = (
-            (m, 'weight'),
-            (n, 'weight'),
-        )
-        importance_scores = {
-            (m, 'weight'): m_importance_scores,
-            (n, 'weight'): n_importance_scores,
-        }
-
-        # prune the 4 smallest weights globally by L1 magnitude
-        prune.global_unstructured(
-            params_to_prune,
-            pruning_method=prune.L1Unstructured,
-            amount=4,
-            importance_scores=importance_scores,
-        )
-
-        expected_m_weight = torch.tensor([[1, 2, 0, 4], [-4, 0, -2, -1]],
-                                         dtype=m.weight.dtype)
-        self.assertEqual(expected_m_weight, m.weight)
-
-        expected_n_weight = torch.tensor([[0, 0.1, 0]]).to(dtype=n.weight.dtype)
-        self.assertEqual(expected_n_weight, n.weight)
-
-    def test_custom_from_mask_pruning(self):
-        r"""Test that the CustomFromMask is capable of receiving
-        as input at instantiation time a custom mask, and combining it with
-        the previous default mask to generate the correct final mask.
-        """
-        # new mask
-        mask = torch.tensor([[0, 1, 1, 0], [0, 0, 1, 1]])
-        # old mask
-        default_mask = torch.tensor([[0, 0, 0, 0], [1, 1, 1, 1]])
-
-        # some tensor (not actually used)
-        t = torch.rand_like(mask.to(dtype=torch.float32))
-
-        p = prune.CustomFromMask(mask=mask)
-
-        computed_mask = p.compute_mask(t, default_mask)
-        expected_mask = torch.tensor([[0, 0, 0, 0], [0, 0, 1, 1]]).to(
-            dtype=t.dtype
-        )
-
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(computed_mask, expected_mask)
-
-    def test_pruning_rollback(self):
-        r"""Test that if something fails when the we try to compute the mask,
-        then the model isn't left in some intermediate half-pruned state.
-        The try/except statement in `apply` should handle rolling back
-        to the previous state before pruning began.
-        """
-        modules = [nn.Linear(5, 7), nn.Conv3d(2, 2, 2)]
-        names = ['weight', 'bias']
-
-        for m in modules:
-            for name in names:
-                with self.subTest(m=m, name=name):
-
-                    with mock.patch(
-                        "torch.nn.utils.prune.L1Unstructured.compute_mask"
-                    ) as compute_mask:
-                        compute_mask.side_effect = Exception('HA!')
-                        with self.assertRaises(Exception):
-                            prune.l1_unstructured(m, name=name, amount=0.9)
-
-                        self.assertTrue(
-                            name in dict(m.named_parameters())
-                        )
-                        self.assertFalse(
-                            name + '_mask' in dict(m.named_buffers())
-                        )
-                        self.assertFalse(
-                            name + '_orig' in dict(m.named_parameters())
-                        )
-
-    def test_pruning_serialization_model(self):
-        # create a model
-        model = torch.nn.Sequential(
-            torch.nn.Linear(10, 10),
-            torch.nn.ReLU(),
-            torch.nn.Linear(10, 1),
-        )
-        # check that everything looks normal before pruning
-        self.assertNotIn('0.weight_orig', model.state_dict())
-        self.assertNotIn('0.weight_mask', model.state_dict())
-        self.assertIn('0.weight', model.state_dict())
-
-        # prune one of its parameters
-        prune.l1_unstructured(module=model[0], name='weight', amount=0.9)
-
-        # check that the original weight and the new mask are present
-        self.assertIn('0.weight_orig', model.state_dict())
-        self.assertIn('0.weight_mask', model.state_dict())
-        self.assertNotIn('0.weight', model.state_dict())
-        self.assertTrue(hasattr(model[0], 'weight'))
-
-        pruned_weight = model[0].weight
-
-        with TemporaryFileName() as fname:
-            torch.save(model, fname)
-            new_model = torch.load(fname)
-
-        # check that the original weight and the new mask are present
-        self.assertIn('0.weight_orig', new_model.state_dict())
-        self.assertIn('0.weight_mask', new_model.state_dict())
-        self.assertNotIn('0.weight', new_model.state_dict())
-        self.assertTrue(hasattr(new_model[0], 'weight'))
-
-        self.assertEqual(pruned_weight, new_model[0].weight)
-
-    def test_pruning_serialization_state_dict(self):
-        # create a model
-        model = torch.nn.Sequential(
-            torch.nn.Linear(10, 10),
-            torch.nn.ReLU(),
-            torch.nn.Linear(10, 1),
-        )
-        # check that everything looks normal before pruning
-        self.assertNotIn('0.weight_orig', model.state_dict())
-        self.assertNotIn('0.weight_mask', model.state_dict())
-        self.assertIn('0.weight', model.state_dict())
-
-        # prune one of its parameters
-        prune.l1_unstructured(module=model[0], name='weight', amount=0.9)
-
-        # check that the original weight and the new mask are present
-        self.assertIn('0.weight_orig', model.state_dict())
-        self.assertIn('0.weight_mask', model.state_dict())
-        self.assertNotIn('0.weight', model.state_dict())
-        self.assertTrue(hasattr(model[0], 'weight'))
-
-        pruned_weight = model[0].weight
-
-        # make pruning permanent and restore parameter names as in base
-        # architecture
-        prune.remove(module=model[0], name='weight')
-
-        # check that the original weight and the new mask are no longer present
-        self.assertNotIn('0.weight_orig', model.state_dict())
-        self.assertNotIn('0.weight_mask', model.state_dict())
-        self.assertIn('0.weight', model.state_dict())
-
-        # save the state dict of model and reload it into new_model
-        new_model = torch.nn.Sequential(
-            torch.nn.Linear(10, 10),
-            torch.nn.ReLU(),
-            torch.nn.Linear(10, 1),
-        )
-        with TemporaryFileName() as fname:
-            torch.save(model.state_dict(), fname)
-            new_model.load_state_dict(torch.load(fname))
-
-        # check that the original weight and the new mask are not present in
-        # new_model either.
-        self.assertNotIn('0.weight_orig', new_model.state_dict())
-        self.assertNotIn('0.weight_mask', new_model.state_dict())
-        self.assertIn('0.weight', new_model.state_dict())
-
-        self.assertEqual(pruned_weight, new_model[0].weight)
-
-    def test_prune(self):
-        # create a new pruning method
-        p = prune.L1Unstructured(amount=2)
-        # create tensor to be pruned
-        t = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]).to(dtype=torch.float32)
-        # create prior mask by hand
-        default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
-        # since we are pruning the two lowest magnitude units, the outcome of
-        # the calculation should be this:
-        expected_mask = torch.tensor([[0, 0, 1, 0], [1, 1, 0, 1]])
-        pruned_tensor = p.prune(t, default_mask)
-        self.assertEqual(t * expected_mask, pruned_tensor)
-
-    def test_prune_importance_scores(self):
-        # create a new pruning method
-        p = prune.L1Unstructured(amount=2)
-        # create tensor to be pruned
-        t = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]).to(dtype=torch.float32)
-        importance_scores = torch.tensor(
-            [[1, 2, 3, 4], [1.5, 1.6, 1.7, 1.8]]
-        ).to(dtype=torch.float32)
-        # create prior mask by hand
-        default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
-        # since we are pruning the two lowest magnitude units, the outcome of
-        # the calculation should be this:
-        expected_mask = torch.tensor([[0, 1, 1, 0], [0, 1, 0, 1]])
-        pruned_tensor = p.prune(t, default_mask, importance_scores=importance_scores)
-        self.assertEqual(t * expected_mask, pruned_tensor)
-
-    def test_prune_importance_scores_mimic_default(self):
-        # create a new pruning method
-        p = prune.L1Unstructured(amount=2)
-        # create tensor to be pruned
-        t = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]]).to(dtype=torch.float32)
-        # create prior mask by hand
-        default_mask = torch.tensor([[1, 1, 1, 0], [1, 1, 0, 1]])
-        # since we are pruning the two lowest magnitude units, the outcome of
-        # the calculation should be this:
-        expected_mask = torch.tensor([[0, 0, 1, 0], [1, 1, 0, 1]])
-        pruned_tensor_without_importance_scores = p.prune(t, default_mask)
-        pruned_tensor_with_importance_scores = p.prune(t, default_mask, importance_scores=t)
-        self.assertEqual(pruned_tensor_without_importance_scores, pruned_tensor_with_importance_scores)
-        self.assertEqual(t * expected_mask, pruned_tensor_without_importance_scores)
-
-    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
-    def test_rnn_pruning(self):
-        l = torch.nn.LSTM(32, 32)
-        # This Module has 4 parameters called:
-        # 'weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0'
-
-        # Pruning one of them causes one of the weights to become a tensor
-        prune.l1_unstructured(l, 'weight_ih_l0', 0.5)
-        assert (
-            sum([isinstance(p, torch.nn.Parameter) for p in l._flat_weights])
-            == 3
-        )
-
-        # Removing the pruning reparametrization restores the Parameter
-        prune.remove(l, 'weight_ih_l0')
-        assert (
-            sum([isinstance(p, torch.nn.Parameter) for p in l._flat_weights])
-            == 4
-        )
-
-        # Make sure that, upon removal of the reparametrization, the
-        # `._parameters` and `.named_parameters` contain the right params.
-        # Specifically, the original weight ('weight_ih_l0') should be placed
-        # back in the parameters, while the reparametrization component
-        # ('weight_ih_l0_orig') should be removed.
-        assert 'weight_ih_l0' in l._parameters
-        assert l._parameters['weight_ih_l0'] is not None
-        assert 'weight_ih_l0_orig' not in l._parameters
-        assert 'weight_ih_l0' in dict(l.named_parameters())
-        assert dict(l.named_parameters())['weight_ih_l0'] is not None
-        assert 'weight_ih_l0_orig' not in dict(l.named_parameters())
-
-
-    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
-    def test_rnn_weight_norm(self):
-        def check_weight_norm(l, name, num_params):
-            # This Module has 4 or 5 parameters called:
-            # 'weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0', weight_hr_l0
-
-            # Applying weight norm on one of them causes it to become a tensor
-            l = torch.nn.utils.weight_norm(l, name=name)
-            self.assertEqual(
-                sum([isinstance(p, torch.nn.Parameter) for p in l._flat_weights]),
-                num_params - 1,
-            )
-
-            # Removing the weight norm reparametrization restores the Parameter
-            l = torch.nn.utils.remove_weight_norm(l, name=name)
-            self.assertEqual(
-                sum([isinstance(p, torch.nn.Parameter) for p in l._flat_weights]),
-                num_params,
-            )
-
-            # Make sure that, upon removal of the reparametrization, the
-            # `._parameters` and `.named_parameters` contain the right params.
-            # Specifically, the original weight ('weight_ih_l0') should be placed
-            # back in the parameters, while the reparametrization components
-            # ('weight_ih_l0_v' and 'weight_ih_l0_g') should be removed.
-            self.assertTrue(name in l._parameters)
-            self.assertIsNotNone(l._parameters[name])
-            self.assertTrue(name + '_v' not in l._parameters)
-            self.assertTrue(name + '_g' not in l._parameters)
-            self.assertTrue(name in dict(l.named_parameters()))
-            self.assertIsNotNone(dict(l.named_parameters())[name])
-            self.assertTrue(name + '_v' not in dict(l.named_parameters()))
-            self.assertTrue(name + '_g' not in dict(l.named_parameters()))
-
-        check_weight_norm(torch.nn.LSTM(32, 32), 'weight_ih_l0', 4)
-        check_weight_norm(torch.nn.LSTM(32, 32, proj_size=16), 'weight_hr_l0', 5)
-
-
-    def test_weight_norm(self):
-        for dtype in [torch.float, torch.bfloat16]:
-            input = torch.randn(3, 4, dtype=dtype)
-            m = nn.Linear(4, 5).to(dtype=dtype)
-            expected_output = m(input)
-
-            # add weight normalization
-            m = torch.nn.utils.weight_norm(m)
-            self.assertEqual(m.weight_v.size(), m.weight.size())
-            self.assertEqual(m.weight_g.size(), (5, 1))
-            self.assertEqual(m(input), expected_output, atol=dtype2prec_DONTUSE[dtype], rtol=0)
-
-            # remove weight norm
-            m = torch.nn.utils.remove_weight_norm(m)
-            self.assertFalse(hasattr(m, 'weight_g'))
-            self.assertFalse(hasattr(m, 'weight_v'))
-            self.assertEqual(m(input), expected_output, atol=dtype2prec_DONTUSE[dtype], rtol=0)
-
-            # test with dim=1
-            m = torch.nn.utils.weight_norm(m, dim=1)
-            self.assertEqual(m.weight_v.size(), m.weight.size())
-            self.assertEqual(m.weight_g.size(), (1, 4))
-            self.assertEqual(m(input), expected_output, atol=dtype2prec_DONTUSE[dtype], rtol=0)
-
-            # test with dim=None
-            m = nn.Linear(4, 5).to(dtype=dtype)
-            expected_output = m(input)
-            m = torch.nn.utils.weight_norm(m, dim=None)
-            self.assertEqual(m(input), expected_output)
-
-            with self.assertRaisesRegex(RuntimeError, 'register two weight_norm hooks'):
-                m = torch.nn.utils.weight_norm(m)
-                m = torch.nn.utils.weight_norm(m)
-
-        # For float16, the forward of the Module doesn't work but we must still be able
-        # to register the weight norm as this is often done before sending the Module to
-        # CUDA.
-        m = nn.Linear(4, 5, dtype=torch.float16)
-        m = torch.nn.utils.weight_norm(m)
-
-    def test_parameterlistdict_setting_attributes(self):
-        with warnings.catch_warnings(record=True) as w:
-            mod = nn.ParameterList(map(nn.Parameter, [torch.rand(2), torch.rand(2)]))
-        self.assertTrue(len(w) == 0)
-
-        with warnings.catch_warnings(record=True) as w:
-            mod.train()
-            mod.eval()
-        self.assertTrue(len(w) == 0)
-
-        with warnings.catch_warnings(record=True) as w:
-            mod = nn.ParameterDict({"a": nn.Parameter(torch.rand(2)), "b": nn.Parameter(torch.rand(2))})
-        self.assertTrue(len(w) == 0)
-
-        with warnings.catch_warnings(record=True) as w:
-            mod.train()
-            mod.eval()
-        self.assertTrue(len(w) == 0)
-
-    def test_parameterlistdict_pickle(self):
-        m = nn.ParameterList(map(nn.Parameter, [torch.rand(2), torch.rand(2)]))
-        with warnings.catch_warnings(record=True) as w:
-            m = pickle.loads(pickle.dumps(m))
-        self.assertTrue(len(w) == 0)
+    def test_parameterlistdict_pickle(self):
+        m = nn.ParameterList(map(nn.Parameter, [torch.rand(2), torch.rand(2)]))
+        with warnings.catch_warnings(record=True) as w:
+            m = pickle.loads(pickle.dumps(m))
+        self.assertTrue(len(w) == 0)
 
         # Test whether loading from older checkpoints works without triggering warnings
         m = nn.ParameterList(map(nn.Parameter, [torch.rand(2), torch.rand(2)]))
@@ -4853,266 +2028,6 @@ def fn(weight):
 
                     gradcheck(fn, (m.weight_orig,))
 
-    def test_new_spectral_norm(self):
-        input = torch.randn(3, 5)
-        m = nn.Linear(5, 7)
-        m = torch.nn.utils.parametrizations.spectral_norm(m)
-        spectral_norm_m = m.parametrizations.weight[0]
-
-        self.assertEqual(spectral_norm_m._u.size(), torch.Size([m.weight.size(0)]))
-
-        # .parametrizations.weight.original should be trainable
-        self.assertTrue(hasattr(m.parametrizations.weight, 'original'))
-        self.assertTrue('original' in m.parametrizations.weight._parameters)
-
-        # u should be just a reused buffer
-        self.assertTrue(hasattr(spectral_norm_m, '_u'))
-        self.assertTrue('_u' in spectral_norm_m._buffers)
-        self.assertTrue('_v' in spectral_norm_m._buffers)
-
-        # weight should be a plain attribute, not counted as a buffer or a param
-        self.assertIsNotNone(m.weight)
-        self.assertFalse('weight' in m._buffers)
-        self.assertFalse('weight' in m._parameters)
-
-        # it should also be sharing storage as `weight_orig`
-        # self.assertEqual(m.parametrizations.weight.original.storage(), m.weight.storage())
-        self.assertEqual(m.parametrizations.weight.original.size(), m.weight.size())
-        self.assertEqual(m.parametrizations.weight.original.stride(), m.weight.stride())
-
-        m = torch.nn.utils.parametrize.remove_parametrizations(m, 'weight')
-
-        # spectral_norm is the only parametrization
-        self.assertFalse(hasattr(m, 'parametrizations'))
-        self.assertTrue('weight' in m._parameters)
-
-        # We can register spectral_norm multiple times on the same parameter
-        # and on multiple parameters in the same module
-        m = torch.nn.utils.parametrizations.spectral_norm(m, 'weight')
-        m = torch.nn.utils.parametrizations.spectral_norm(m, 'weight')
-        m = torch.nn.utils.parametrizations.spectral_norm(m, 'bias')
-
-        # If we remove the parametrization on bias, weight is still parametrized
-        # Removing a parametrization runs forward in eval mode if leave_parametrized=True
-        m = torch.nn.utils.parametrize.remove_parametrizations(m, 'bias')
-        self.assertTrue('bias' in m._parameters)
-        self.assertTrue(hasattr(m, 'parametrizations'))
-        self.assertFalse('weight' in m._parameters)
-
-        m = torch.nn.utils.parametrize.remove_parametrizations(m, 'weight')
-        # Neither weight and bias are parametrized
-        self.assertFalse(hasattr(m, 'parametrizations'))
-        self.assertTrue('weight' in m._parameters)
-        self.assertFalse(torch.nn.utils.parametrize.is_parametrized(m))
-
-        # test correctness in training/eval modes and cpu/multi-gpu settings
-        for apply_dp in (True, False):
-            if apply_dp:
-                if not TEST_MULTIGPU:
-                    continue
-                device = torch.device('cuda:0')
-
-                def maybe_wrap(m):
-                    return torch.nn.DataParallel(m, [0, 1])
-            else:
-                device = torch.device('cpu')
-
-                def maybe_wrap(m):
-                    return m
-
-            for requires_grad in (True, False):
-                def get_modules():
-                    m = nn.Linear(3, 4).to(device)
-                    m.weight.requires_grad_(requires_grad)
-                    m = torch.nn.utils.parametrizations.spectral_norm(m)
-                    wrapped_m = maybe_wrap(m)
-                    spectral_norm_m = m.parametrizations.weight[0]
-                    return m, wrapped_m, spectral_norm_m
-
-                input = torch.randn(2, 3, device=device)
-
-                m, wrapped_m, spectral_norm_m = get_modules()
-
-                self.assertTrue(hasattr(spectral_norm_m, '_u'))
-                u0 = spectral_norm_m._u.clone()
-                v0 = spectral_norm_m._v.clone()
-
-                # TEST TRAINING BEHAVIOR
-
-                # We perform GD first to modify the initial matrix
-                opt = torch.optim.SGD(wrapped_m.parameters(), lr=0.1)
-
-                opt.zero_grad()
-                wrapped_m(input).sum().backward()
-                opt.step()
-
-                out = wrapped_m(input)
-                if requires_grad:
-                    # run forward again and assert that u and v are updated
-                    self.assertNotEqual(u0, spectral_norm_m._u)
-                    self.assertNotEqual(v0, spectral_norm_m._v)
-
-                # assert that backprop reaches original weight
-                # can't use gradcheck because the function changes as we
-                # activate through it in training mode
-                if requires_grad:
-                    torch.autograd.grad(out.sum(), m.parametrizations.weight.original)
-
-                # test backward works with multiple forwards
-                # it uses training mode so we need to reset `u` and `v` vectors
-                # to same value at beginning for finite difference test to pass
-                saved_u = spectral_norm_m._u.clone()
-                saved_v = spectral_norm_m._v.clone()
-
-                def fn(input):
-                    spectral_norm_m._u.data.copy_(saved_u)
-                    spectral_norm_m._v.data.copy_(saved_v)
-                    out0 = wrapped_m(input)
-                    out1 = wrapped_m(input)
-                    return out0 + out1
-
-                # Make sure we can compute gradients wrt to all the parameters in the case
-                # of double forward
-                fn(input.clone().requires_grad_()).sum().backward()
-                gradcheck(fn, (input.clone().requires_grad_(),), check_batched_grad=False)
-
-                # test removing
-                # spectral norm module needs to be in eval mode if we'd like to
-                # avoid doing another power iteration
-                m, wrapped_m, _ = get_modules()
-                pre_remove_out = wrapped_m(input)
-                m.eval()
-                m = torch.nn.utils.parametrize.remove_parametrizations(m, 'weight')
-                self.assertEqual(wrapped_m(input), pre_remove_out)
-
-                torch.nn.utils.parametrizations.spectral_norm(m)
-                for _ in range(3):
-                    pre_remove_out = wrapped_m(input)
-                m.eval()
-                m = torch.nn.utils.parametrize.remove_parametrizations(m, 'weight')
-                self.assertEqual(wrapped_m(input), pre_remove_out)
-
-                # TEST EVAL BEHAVIOR
-                m, wrapped_m, spectral_norm_m = get_modules()
-                wrapped_m(input)
-                last_train_out = wrapped_m(input)
-                last_train_u = spectral_norm_m._u.clone()
-                last_train_v = spectral_norm_m._v.clone()
-                wrapped_m.zero_grad()
-                wrapped_m.eval()
-
-                eval_out0 = wrapped_m(input)
-                # assert eval gives same result as last training iteration
-                self.assertEqual(eval_out0, last_train_out)
-                # assert doing more iteartion in eval don't change things
-                self.assertEqual(eval_out0, wrapped_m(input))
-                self.assertEqual(last_train_u, spectral_norm_m._u)
-                self.assertEqual(last_train_v, spectral_norm_m._v)
-
-                # FIXME: the code below is flaky when executed with DataParallel
-                # see https://github.com/pytorch/pytorch/issues/13818
-                if apply_dp:
-                    continue
-
-                # test backward works with multiple forwards in mixed training
-                # and eval modes
-                # it uses training mode so we need to reset `u` and `v` vectors
-                # to same value at beginning for finite difference test to pass
-                saved_u = spectral_norm_m._u.clone()
-                saved_v = spectral_norm_m._v.clone()
-
-                def fn(input):
-                    spectral_norm_m._u.data.copy_(saved_u)
-                    spectral_norm_m._v.data.copy_(saved_v)
-                    wrapped_m.train()
-                    out0 = wrapped_m(input)
-                    wrapped_m.eval()
-                    out1 = wrapped_m(input)
-                    wrapped_m.train()
-                    out2 = wrapped_m(input)
-                    wrapped_m.eval()
-                    out3 = wrapped_m(input)
-                    return out0 + out1 + out2 + out3
-
-                gradcheck(fn, (input.clone().requires_grad_(),))
-
-                # assert that backprop reaches weight_orig in eval
-                if requires_grad:
-                    def fn(weight):
-                        return wrapped_m(input)
-
-                    gradcheck(fn, (m.parametrizations.weight.original,))
-
-    def test_new_spectral_norm_load_state_dict(self):
-        for activate_times in (0, 3):
-            inp = torch.randn(2, 3)
-            m = nn.Linear(3, 5)
-            snm = torch.nn.utils.parametrizations.spectral_norm(m)
-            snm.train()
-
-            for _ in range(activate_times):
-                snm(inp)
-
-            state_dict = deepcopy(snm.state_dict())
-            self.assertEqual({
-                'parametrizations.weight.original',
-                'bias',
-                'parametrizations.weight.0._v',
-                'parametrizations.weight.0._u'
-            }, set(state_dict.keys()))
-
-            # test that non-strict loading works
-            non_strict_state_dict = deepcopy(state_dict)
-            non_strict_state_dict['nonsense'] = 'nonsense'
-            with self.assertRaisesRegex(RuntimeError, r'Unexpected key\(s\) in state_dict: "nonsense"'):
-                snm.load_state_dict(non_strict_state_dict, strict=True)
-            snm.load_state_dict(non_strict_state_dict, strict=False)
-            del non_strict_state_dict['parametrizations.weight.original']
-            snm.load_state_dict(non_strict_state_dict, strict=False)
-            del non_strict_state_dict['parametrizations.weight.0._u']
-            snm.load_state_dict(non_strict_state_dict, strict=False)
-            del non_strict_state_dict['parametrizations.weight.0._v']
-            snm.load_state_dict(non_strict_state_dict, strict=False)
-            non_strict_state_dict['weight'] = snm.weight.detach().clone()     # set W as a buffer
-            snm.load_state_dict(non_strict_state_dict, strict=False)
-            del non_strict_state_dict._metadata['parametrizations.weight.0']  # remove metadata info
-            snm.load_state_dict(non_strict_state_dict, strict=False)
-            del non_strict_state_dict['weight']                               # remove W buffer
-            snm.load_state_dict(non_strict_state_dict, strict=False)
-            del non_strict_state_dict['bias']
-            snm.load_state_dict(non_strict_state_dict, strict=False)
-
-            # normal state_dict
-
-            # test that re-wrapping does not matter
-            m = torch.nn.utils.parametrize.remove_parametrizations(snm, 'weight')
-            snm = torch.nn.utils.parametrizations.spectral_norm(m)
-
-            snm.load_state_dict(state_dict)
-            with torch.no_grad():
-                snm.eval()
-                out0_eval = snm(inp)
-                snm.train()
-                out1_train = snm(inp)
-                out2_train = snm(inp)
-                snm.eval()
-                out3_eval = snm(inp)
-
-            # test that re-wrapping does not matter
-            m = torch.nn.utils.parametrize.remove_parametrizations(snm, 'weight')
-            snm = torch.nn.utils.parametrizations.spectral_norm(m)
-
-            # Test normal loading
-            snm.load_state_dict(state_dict)
-            with torch.no_grad():
-                snm.eval()
-                self.assertEqual(out0_eval, snm(inp))
-                snm.train()
-                self.assertEqual(out1_train, snm(inp))
-                self.assertEqual(out2_train, snm(inp))
-                snm.eval()
-                self.assertEqual(out3_eval, snm(inp))
-
     @skipIfNoLapack
     def test_spectral_norm_load_state_dict(self):
         inp = torch.randn(2, 3)
@@ -5220,16 +2135,6 @@ def test_spectral_norm_dim(self):
         # check that u refers to the same dimension
         self.assertEqual(m.weight_u.shape, m.weight_orig[0, :, 0, 0].shape)
 
-    def test_new_spectral_norm_dim(self):
-        inp = torch.randn(2, 3, 10, 12)
-        m = nn.ConvTranspose2d(3, 4, (5, 6))
-        m = torch.nn.utils.parametrizations.spectral_norm(m)
-        snm = m.parametrizations.weight[0]
-        # this should not run into incompatible shapes
-        x = m(inp)
-        # check that u refers to the same dimension
-        self.assertEqual(snm._u.shape, m.parametrizations.weight.original[0, :, 0, 0].shape)
-
     def test_spectral_norm_forward(self):
         input = torch.randn(3, 5)
         m = nn.Linear(5, 7)
@@ -5246,164 +2151,11 @@ def test_spectral_norm_forward(self):
         expect_out = m(input)
         self.assertEqual(expect_out, out_hat)
 
-    def test_new_spectral_norm_forward(self):
-        input = torch.randn(3, 5)
-        m = nn.Linear(5, 7)
-        m = torch.nn.utils.parametrizations.spectral_norm(m)
-        snm = m.parametrizations.weight[0]
-        # naive forward
-        _weight = m.parametrizations.weight.original
-        _bias, _v = m.bias, snm._v
-        _weight_mat = _weight.view(_weight.size(0), -1)
-        _u = torch.mv(_weight_mat, _v)
-        _u = F.normalize(_u, dim=0, eps=1e-12)
-        _v = torch.mv(_weight_mat.t(), _u)
-        _v = F.normalize(_v, dim=0, eps=1e-12)
-        _weight.data /= torch.dot(_u, torch.matmul(_weight_mat, _v))
-        out_hat = torch.nn.functional.linear(input, _weight, _bias)
-        expect_out = m(input)
-        self.assertEqual(expect_out, out_hat)
-
     def test_spectral_norm_pickle(self):
         m = torch.nn.utils.spectral_norm(nn.Linear(5, 7))
         m = pickle.loads(pickle.dumps(m))
         self.assertIsInstance(m, nn.Linear)
 
-    @skipIfNoLapack
-    def test_orthogonal_parametrization(self):
-        # Orthogonal implements 6 algorithms (3x parametrizations times 2 options of use_trivialization)
-
-        def assert_is_orthogonal(X):
-            n, k = X.size(-2), X.size(-1)
-            if n < k:
-                X = X.mT
-                n, k = k, n
-            Id = torch.eye(k, dtype=X.dtype, device=X.device).expand(*(X.size()[:-2]), k, k)
-            eps = 10 * n * torch.finfo(X.dtype).eps
-            torch.testing.assert_allclose(X.mH @ X, Id, atol=eps, rtol=0.)
-
-
-        def assert_weight_allclose_Q(weight, W):
-            # Test that weight is equal to the Q part of the QR decomposition of W
-            # (or of its transpose if the matrix is wide)
-            wide_matrix = W.size(-2) < W.size(-1)
-            if wide_matrix:
-                W = W.mT
-            Q, R = torch.linalg.qr(W)
-            Q *= R.diagonal(dim1=-2, dim2=-1).sgn().unsqueeze(-2)
-            if wide_matrix:
-                Q = Q.mT
-            torch.testing.assert_allclose(Q, weight, atol=1e-5, rtol=0.)
-
-
-        for shape, dtype, use_linear in product(((4, 4), (5, 3), (3, 5)),  # square/ tall / wide
-                                                (torch.float32, torch.complex64),
-                                                (True, False)):
-            # Conv2d does not support complex yet
-            if not use_linear:
-                continue
-
-            if use_linear:
-                input = torch.randn(3, shape[0], dtype=dtype)
-            else:
-                input = torch.randn(2, 2, shape[0] + 2, shape[1] + 1, dtype=dtype)
-
-            for parametrization, use_trivialization in product(("matrix_exp", "cayley", "householder"),
-                                                               (False, True)):
-                # right_inverse for Cayley and matrix_exp not implemented for use_trivialization=False
-                # See Note [right_inverse expm cayley]
-                can_initialize = use_trivialization or parametrization == "householder"
-
-                # We generate them every time to always start with fresh weights
-                if use_linear:
-                    m = nn.Linear(*shape, dtype=dtype)
-                else:
-                    m = nn.Conv2d(2, 3, shape, dtype=dtype)
-
-                # We do not support householder for complex inputs
-                # See Note [Householder complex]
-                w_init = m.weight.clone()
-                if parametrization == "householder" and m.weight.is_complex():
-                    msg = "householder parametrization does not support complex tensors"
-                    with self.assertRaisesRegex(ValueError, msg):
-                        torch.nn.utils.parametrizations.orthogonal(m,
-                                                                   "weight",
-                                                                   parametrization,
-                                                                   use_trivialization=use_trivialization)
-                    continue
-
-                wide_matrix = w_init.size(-2) < w_init.size(-1)
-                torch.nn.utils.parametrizations.orthogonal(m,
-                                                           "weight",
-                                                           parametrization,
-                                                           use_trivialization=use_trivialization)
-                # Forwards works as expected
-                self.assertEqual(w_init.shape, m.weight.shape)
-                assert_is_orthogonal(m.weight)
-                if can_initialize:
-                    assert_weight_allclose_Q(m.weight, w_init)
-
-                # Intializing with a given orthogonal matrix works
-                X = torch.randn_like(m.weight)
-                if wide_matrix:
-                    X = X.mT
-                w_new = torch.linalg.qr(X).Q
-                if wide_matrix:
-                    w_new = w_new.mT
-                if can_initialize:
-                    m.weight = w_new
-                    torch.testing.assert_allclose(w_new, m.weight, atol=1e-5, rtol=0.)
-                else:
-                    msg = "assign to the matrix exponential or the Cayley parametrization"
-                    with self.assertRaisesRegex(NotImplementedError, msg):
-                        m.weight = w_new
-
-                # Intializing with a non-orthogonal matrix makes m.weight be the Q part of the given matrix
-                w_new = torch.randn_like(m.weight)
-                if can_initialize:
-                    m.weight = w_new
-                    assert_weight_allclose_Q(m.weight, w_new)
-                else:
-                    msg = "assign to the matrix exponential or the Cayley parametrization"
-                    with self.assertRaisesRegex(NotImplementedError, msg):
-                        m.weight = w_new
-
-                opt = torch.optim.SGD(m.parameters(), lr=0.1)
-                for _ in range(2):
-                    opt.zero_grad()
-                    m(input).norm().backward()
-                    grad = m.parametrizations.weight.original.grad
-                    self.assertIsNotNone(grad)
-                    # We do not update the upper triangular part of the matrix if tall tril if wide
-                    if grad.size(-2) >= grad.size(-1):
-                        zeros_grad = grad.triu(1)
-                    else:
-                        zeros_grad = grad.tril(-1)
-                    self.assertEqual(zeros_grad, torch.zeros_like(zeros_grad))
-                    # The gradient in the diagonal can only be imaginary because a skew-Hermitian
-                    # matrix has imaginary diagonal
-                    diag_grad = grad.diagonal(dim1=-2, dim2=-1)
-                    if grad.is_complex():
-                        diag_grad = diag_grad.real
-                    self.assertEqual(diag_grad, torch.zeros_like(diag_grad))
-                    opt.step()
-                    assert_is_orthogonal(m.weight)
-
-    @skipIfNoLapack
-    def test_orthogonal_errors(self):
-        m = nn.Linear(3, 4)
-        with self.assertRaisesRegex(ValueError, "has to be one of"):
-            torch.nn.utils.parametrizations.orthogonal(m, "weight", "foo")
-
-        with self.assertRaisesRegex(ValueError, "Expected a matrix"):
-            torch.nn.utils.parametrizations.orthogonal(m, "bias")
-
-        torch.nn.utils.parametrizations.orthogonal(m, "weight")
-        with self.assertRaisesRegex(ValueError, "matrices of shape"):
-            m.weight = torch.randn(5, 5)
-        torch.nn.utils.parametrize.remove_parametrizations(m, "weight")
-
-
     def test_threshold_int(self):
         x = torch.tensor([-3, -2, -1, 0, 1, 2, 3])
         expected = torch.tensor([99, 99, 99, 99, 1, 2, 3])
@@ -5416,179 +2168,6 @@ def test_threshold_bfloat16(self):
             res_bf16 = F.threshold(x.bfloat16(), threshold, 0).float()
             self.assertEqual(res_bf16, expected)
 
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    def test_embedding_max_norm_unsorted_repeating_indices(self):
-        def create_embedding(device):
-            # Seed RNG so we get the same Embedding each time
-            torch.manual_seed(0)
-            return torch.nn.Embedding(
-                num_embeddings=20,
-                embedding_dim=64,
-                max_norm=1.0).to(device)
-
-        ix = torch.arange(2, device='cpu', dtype=torch.long).repeat(2000)
-        out_cpu = create_embedding('cpu')(ix)
-
-        ix = ix.to('cuda')
-        out = create_embedding('cuda')(ix)
-        self.assertEqual(out.cpu(), out_cpu)
-
-    def test_embedding_sparse_basic(self):
-        embedding = nn.Embedding(10, 20, sparse=True)
-        input = torch.tensor([[0, 2, 4, 5], [4, 3, 0, 9]], dtype=torch.long)
-        embedding(input).sum().backward()
-        self.assertTrue(embedding.weight.grad.is_sparse)
-        self.assertEqual(embedding.weight.grad.shape, embedding.weight.shape)
-
-    def test_embedding_sparse_empty_tensor(self):
-        embedding = nn.Embedding(0, 0, sparse=True)
-        input = torch.tensor([], dtype=torch.int64)
-        embedding(input).sum().backward()
-        self.assertTrue(embedding.weight.grad.is_sparse)
-        self.assertEqual(embedding.weight.grad.shape, embedding.weight.shape)
-
-        embedding = nn.Embedding(10, 0, sparse=True)
-        input = torch.LongTensor([[0, 2, 4, 5], [4, 3, 0, 9]])
-        embedding(input).sum().backward()
-        self.assertTrue(embedding.weight.grad.is_sparse)
-        self.assertEqual(embedding.weight.grad.shape, embedding.weight.shape)
-
-    def test_move_sparse_half_embedding(self):
-        embedding = nn.Embedding(10, 3, sparse=True)
-        self.assertEqual(embedding.weight.device.type, 'cpu')
-        self.assertEqual(embedding.weight.dtype, torch.float64)
-        embedding.to(torch.float16)
-        self.assertEqual(embedding.weight.dtype, torch.float16)
-        self.assertEqual(embedding.embedding_dim, 3)
-        self.assertEqual(embedding.num_embeddings, 10)
-
-        if torch.cuda.is_available():
-            embedding.to('cuda')
-            self.assertEqual(embedding.weight.device.type, 'cuda')
-            embedding.to('cpu')
-            self.assertEqual(embedding.weight.device.type, 'cpu')
-
-    def test_embedding_max_norm(self):
-        embedding = nn.Embedding(22, 5, max_norm=1.0)
-        input = torch.tensor([2, 8, 8, 6], dtype=torch.long)
-        output = embedding(input)
-        self.assertEqual(output[1], output[2])
-        self.assertTrue(output.data.norm(p=2, dim=1).le(1).all())
-
-    def test_embedding_from_pretrained(self):
-        a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
-        embedding = nn.Embedding.from_pretrained(a)
-        self.assertEqual(a, embedding.weight.data)
-
-        input = torch.LongTensor([0, 1])
-        output = embedding(input)
-        self.assertEqual(a, output)
-
-    def test_embedding_bag_from_pretrained(self):
-        a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
-        embedding = nn.EmbeddingBag.from_pretrained(a)
-        self.assertEqual(a, embedding.weight)
-
-        input = torch.tensor([0, 1], dtype=torch.long)
-        output = embedding(input, torch.arange(input.size(0)))
-        self.assertEqual(a, output)
-
-    def test_embedding_from_pretrained_padding_idx(self):
-        padding_idx = 2
-        padding_vec = torch.ones(3) * 7
-        embeddings = torch.rand(4, 3, requires_grad=True)
-        with torch.no_grad():
-            embeddings[padding_idx] = padding_vec
-        embedding_nn = nn.Embedding.from_pretrained(embeddings, padding_idx=padding_idx)
-        self.assertEqual(embedding_nn.weight[padding_idx], padding_vec)
-
-    def test_embedding_bag_from_pretrained_padding_idx(self):
-        padding_idx = 2
-        embeddings = torch.rand(4, 3, requires_grad=True)
-        embedding_nn = nn.EmbeddingBag.from_pretrained(embeddings, padding_idx=padding_idx)
-        self.assertEqual(embedding_nn.weight, embeddings)
-
-    def test_embedding_from_pretrained_options(self):
-        a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
-        opts = {
-            "max_norm": 2.,
-            "norm_type": .5,
-            "scale_grad_by_freq": False,
-            "sparse": True
-        }
-        embedding = nn.Embedding.from_pretrained(a, **opts)
-        input = torch.LongTensor([0, 1])
-        output = embedding(input)
-        # test output and that weight matrix was renormalized
-        self.assertEqual(a, output)
-        self.assertTrue(a.ne(torch.arange(1, 7, dtype=a.dtype).view(2, 3)).all())
-        self.assertTrue(output.data.norm(p=opts["norm_type"], dim=1).le(opts["max_norm"]).all())
-
-    def test_embedding_functional(self):
-        a = torch.tensor([
-            [1, 3, 2],
-            [0, 2, 1]
-        ], dtype=torch.long)
-        embeddings = torch.rand(4, 3, requires_grad=True)
-
-        embed_old = torch.nn.Embedding(4, 3)
-        embed_old.weight.data = embeddings.data
-        res_old = embed_old(a)
-
-        res_F = F.embedding(a, embeddings)
-        self.assertEqual(res_old, res_F)
-
-        embed_old = torch.nn.Embedding(4, 3)
-        embed_old = embed_old.from_pretrained(embeddings, padding_idx=2)
-        res_old = embed_old(a)
-        res_F = F.embedding(a, embeddings, padding_idx=2)
-
-        self.assertEqual(res_old, res_F)
-
-    def test_embedding_bag_functional(self):
-        a = torch.tensor([
-            [1, 3, 2],
-            [0, 2, 1]
-        ], dtype=torch.long)
-        embeddings = torch.rand(4, 3, requires_grad=True)
-
-        embed_old = torch.nn.EmbeddingBag(4, 3)
-        embed_old.weight = torch.nn.Parameter(embeddings)
-        res_old = embed_old(a)
-
-        res_F = F.embedding_bag(a, embeddings)
-        self.assertEqual(res_old, res_F)
-
-        embed_old = torch.nn.EmbeddingBag(4, 3)
-        embed_old = embed_old.from_pretrained(embeddings, padding_idx=2)
-        res_old = embed_old(a)
-        res_F = F.embedding_bag(a, embeddings, padding_idx=2)
-
-        self.assertEqual(res_old, res_F)
-
-    # Make sure that error is thrown if padding_idx is out of bounds
-    def test_embedding_bag_padding_idx_error(self):
-        a = torch.tensor([
-            [1, 3, 2],
-            [0, 2, 1]
-        ], dtype=torch.long)
-        num_embeddings = 4
-        num_features = 3
-        embeddings = torch.rand(num_embeddings, num_features, requires_grad=True)
-
-        functional_err_msg = r'padding_idx must be within the number of embeddings'
-        module_err_msg = r'padding_idx must be within num_embeddings'
-
-        for padding_idx in range(-(num_embeddings + 2), (num_embeddings + 2)):
-            if (padding_idx < -num_embeddings) or (padding_idx >= num_embeddings):
-                with self.assertRaisesRegex(RuntimeError, functional_err_msg):
-                    F.embedding_bag(a, embeddings, padding_idx=padding_idx)
-                with self.assertRaisesRegex(AssertionError, module_err_msg):
-                    torch.nn.EmbeddingBag(num_embeddings, num_features, padding_idx=padding_idx)
-            else:
-                F.embedding_bag(a, embeddings, padding_idx=padding_idx)
-                torch.nn.EmbeddingBag(num_embeddings, num_features, padding_idx=padding_idx)
-
     @unittest.skipUnless('fbgemm' in torch.backends.quantized.supported_engines,
                          'Linear_FP16_weight requires FBGEMM. FBGEMM is only optimized for CPUs'
                          ' with instruction set support avx2 or newer.')
@@ -5608,50 +2187,6 @@ def fc_op(X, W, b):
         expected_output = fc_op(X, W, b)
         torch.testing.assert_close(torch.from_numpy(expected_output), actual_output.cpu(), atol=1e-3, rtol=1e-3)
 
-    def test_embeddingbag_from_pretrained(self):
-        a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
-        embeddingbag = nn.EmbeddingBag.from_pretrained(a)
-        self.assertEqual(a, embeddingbag.weight.data)
-
-        input = torch.LongTensor([[0, 1]])
-        output = embeddingbag(input)
-        self.assertEqual(a.mean(0, keepdim=True), output)
-
-    def test_embeddingbag_from_pretrained_options(self):
-        a = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
-        opts = {
-            "max_norm": 2.,
-            "norm_type": .5,
-            "scale_grad_by_freq": False,
-            "mode": "max",
-            "sparse": False
-        }
-        embeddingbag = nn.EmbeddingBag.from_pretrained(a, **opts)
-
-        input = torch.LongTensor([[0, 1]])
-        output = embeddingbag(input)
-        self.assertEqual(a.max(0, keepdim=True)[0], output)
-        self.assertTrue(a.ne(torch.arange(1, 7, dtype=a.dtype).view(2, 3)).all())
-        self.assertTrue(a.norm(p=opts["norm_type"], dim=1).le(opts["max_norm"]).all())
-
-    def test_AlphaDropout(self):
-        # generate random tensor with zero mean and unit std
-        input = torch.randn(5000)
-        self._test_alpha_dropout(nn.AlphaDropout, input)
-
-    def test_FeatureAlphaDropout(self):
-        b = random.randint(1, 5)
-        w = random.randint(1, 5)
-        h = random.randint(1, 5)
-        d = random.randint(1, 2)
-        num_features = 1000
-        input = torch.randn(num_features, b, d, w, h)
-        self._test_alpha_dropout(nn.FeatureAlphaDropout, input)
-
-        # no batch dims
-        input = torch.randn(50, 20, 64, 64)
-        self._test_alpha_dropout(nn.FeatureAlphaDropout, input)
-
     def test_pad_scalar_error(self):
         inputs = torch.tensor(0., requires_grad=True)
         self.assertRaises(RuntimeError, lambda: F.pad(inputs, (1, 1)))
@@ -5790,12 +2325,12 @@ def _create_src_lengths_mask(batch_size, src_lengths):
             return (src_indices < src_lengths).int().detach()
 
         def _multihead_attn_test_helper(add_key_padding_mask=False, add_bias_kv=False, add_zero_attn=False,
-                                        saved_kv=False, same_embed_dim=False, byte_mask=False,
+                                        saved_kv=False, same_embed_dim=False,
                                         average_attn_weights=average_attn_weights):
             for _ in range(100):
                 batch_sz, seq_len = [random.randint(2, 10) for r in range(2)]
                 d_head = random.randint(3, 10)
-                nheads = random.randint(3, 10)
+                nheads = random.randint(2, 5) * 2
                 d_model = d_head * nheads
                 if same_embed_dim:
                     kv_dim = d_model
@@ -5819,20 +2354,15 @@ def _multihead_attn_test_helper(add_key_padding_mask=False, add_bias_kv=False, a
                     seq_mask = np.random.randint(0, 2, (1, seq_len))
                     key_padding_mask = (np.repeat(seq_mask, batch_sz, axis=0) == 1)
                     key_padding_mask_tensor = torch.from_numpy(key_padding_mask)
-                    if byte_mask:
-                        key_padding_mask_tensor = key_padding_mask_tensor.byte()
                 decoder_state = np.random.rand(batch_sz, d_model)
                 K = np.random.rand(*dims)
                 V = K
                 Q = np.expand_dims(decoder_state, 1)
                 attn_mask = np.random.randint(0 , 2, size=(1, seq_len))
                 attn_mask_tensor = torch.from_numpy(attn_mask).float()
-                if byte_mask:
-                    attn_mask_tensor = (attn_mask_tensor == 0).byte()
-                else:
-                    attn_mask_tensor.masked_fill_(attn_mask_tensor == 0, float('-inf'))
-                    attn_mask_tensor.masked_fill_(attn_mask_tensor > 0, float('0.0'))
-                    attn_mask_tensor = attn_mask_tensor.double()
+                attn_mask_tensor.masked_fill_(attn_mask_tensor == 0, float('-inf'))
+                attn_mask_tensor.masked_fill_(attn_mask_tensor > 0, float('0.0'))
+                attn_mask_tensor = attn_mask_tensor.double()
 
                 decoder_state_tensor = torch.from_numpy(decoder_state).to(torch.get_default_dtype())
                 source_hid_tensor = torch.from_numpy(K).to(torch.get_default_dtype()).transpose(0, 1)
@@ -5979,10 +2509,6 @@ def test_multihead_attn_all_arguments3():
             _multihead_attn_test_helper(add_key_padding_mask=True, add_zero_attn=True,
                                         saved_kv=True, same_embed_dim=True)
 
-        def test_multihead_attn_all_arguments4():
-            _multihead_attn_test_helper(add_key_padding_mask=True, add_zero_attn=True,
-                                        saved_kv=True, same_embed_dim=True, byte_mask=True)
-
         test_multihead_attn_add_zero_attn()  # Test MultiheadAttention with add_zero_attn
         test_multihead_attn_add_bias_kv()  # Test MultiheadAttention with add_bias_kv
         test_multihead_attn_no_masking()   # Test MultiheadAttention without masking
@@ -5993,7 +2519,6 @@ def test_multihead_attn_all_arguments4():
         with self.assertRaisesRegex(AssertionError, "bias cannot be added to static key."):
             test_multihead_attn_all_arguments2()  # Test MultiheadAttention with all the argument.
         test_multihead_attn_all_arguments3()  # Test MultiheadAttention with all the argument.
-        test_multihead_attn_all_arguments4()  # Test MultiheadAttention with all the argument.
 
     def test_multihead_attn_3d_attn_mask(self):
         embed_dim = 8
@@ -6035,62 +2560,62 @@ def test_multihead_attn_no_bias(self):
 
     def _test_multihead_attn_invalid_shape_impl(self, mha):
         # Batched (3D) query cases
-        query = torch.randn(3, 3, 3)
-        key = torch.randn(3, 3, 3)
-        value = torch.randn(3, 3, 3)
+        query = torch.randn(4, 4, 4)
+        key = torch.randn(4, 4, 4)
+        value = torch.randn(4, 4, 4)
 
         msg = "expected `key` and `value` to be 3-D but found 2-D and 3-D tensors respectively"
         # 3D query, 2D key and 3D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, torch.randn(3, 3), value)
+            mha(query, torch.randn(4, 4), value)
 
         msg = "expected `key` and `value` to be 3-D but found 3-D and 2-D tensors respectively"
         # 3D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, torch.randn(3, 3))
+            mha(query, key, torch.randn(4, 4))
 
         msg = "expected `key_padding_mask` to be `None` or 2-D but found 1-D tensor instead"
         # 3D query, 3D key, 3D value and 1D key_padding_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, key_padding_mask=torch.tensor([False, True, True], dtype=torch.bool))
+            mha(query, key, value, key_padding_mask=torch.tensor([False, False, True, True], dtype=torch.bool))
 
         msg = "expected `attn_mask` to be `None`, 2-D or 3-D but found 1-D tensor instead"
         # 3D query, 3D key, 3D value and 1D attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.tensor([False, True, True], dtype=torch.bool))
+            mha(query, key, value, attn_mask=torch.tensor([False, False, True, True], dtype=torch.bool))
 
         # Unbatched (2D) query cases
-        query = torch.randn(3, 3)
-        key = torch.randn(3, 3)
-        value = torch.randn(3, 3)
+        query = torch.randn(4, 4)
+        key = torch.randn(4, 4)
+        value = torch.randn(4, 4)
 
         msg = "expected `key` and `value` to be 2-D but found 3-D and 2-D tensors respectively"
         # 2D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, torch.randn(3, 3, 3), value)
+            mha(query, torch.randn(4, 4, 4), value)
 
         msg = "expected `key` and `value` to be 2-D but found 2-D and 3-D tensors respectively"
         # 2D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, torch.randn(3, 3, 3))
+            mha(query, key, torch.randn(4, 4, 4))
 
         msg = "expected `key_padding_mask` to be `None` or 1-D but found 2-D tensor instead"
         # 2D query, 2D key, 2D value and 1D key_padding_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, key_padding_mask=torch.tensor([[False, True, True] * 2], dtype=torch.bool))
+            mha(query, key, value, key_padding_mask=torch.tensor([[False, False, True, True] * 2], dtype=torch.bool))
 
         msg = "expected `attn_mask` to be `None`, 2-D or 3-D but found 1-D tensor instead"
         # 2D query, 2D key, 2D value and 1D attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.tensor([False, True, True], dtype=torch.bool))
+            mha(query, key, value, attn_mask=torch.tensor([False, False, True, True], dtype=torch.bool))
 
-        msg = r"Expected `attn_mask` shape to be \(3, 3, 3\)"
+        msg = r"Expected `attn_mask` shape to be \(4, 4, 4\)"
         # 2D query, 2D key, 2D value and 3D incorrect attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.randn(4, 3, 3).bernoulli_().to(torch.bool))
+            mha(query, key, value, attn_mask=torch.randn(5, 4, 4).bernoulli_().to(torch.bool))
 
     def test_multihead_attn_invalid_shape(self):
-        mha = torch.nn.MultiheadAttention(3, 3)
+        mha = torch.nn.MultiheadAttention(4, 4)
         self._test_multihead_attn_invalid_shape_impl(mha)
         # Give the test a chance to hit the fast path. (Right now, it
         # won't, but gating may be less restricted in the future.)
@@ -6099,12 +2624,12 @@ def test_multihead_attn_invalid_shape(self):
 
     @torch.no_grad()
     def test_multihead_attn_fast_path_invalid_shape(self):
-        mha = torch.nn.MultiheadAttention(3, 3, batch_first=True).eval()
+        mha = torch.nn.MultiheadAttention(4, 4, batch_first=True).eval()
 
         # Batched (3D) query cases
-        query = torch.randn(3, 3, 3)
-        key = torch.randn(3, 3, 3)
-        value = torch.randn(3, 3, 3)
+        query = torch.randn(4, 4, 4)
+        key = torch.randn(4, 4, 4)
+        value = torch.randn(4, 4, 4)
 
         # Currently, this case will just go to the slow path and get
         # the usual message because it fails the requirement to be
@@ -6134,38 +2659,38 @@ def test_multihead_attn_fast_path_invalid_shape(self):
 
         # Unbatched (2D) query cases
         # NOTE: error messages are the same as regular path because the fast path doesn't support 2D.
-        query = torch.randn(3, 3)
-        key = torch.randn(3, 3)
-        value = torch.randn(3, 3)
+        query = torch.randn(4, 4)
+        key = torch.randn(4, 4)
+        value = torch.randn(4, 4)
 
         msg = "expected `key` and `value` to be 2-D but found 3-D and 2-D tensors respectively"
         # 2D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, torch.randn(3, 3, 3), value)
+            mha(query, torch.randn(4, 4, 4), value)
 
         msg = "expected `key` and `value` to be 2-D but found 2-D and 3-D tensors respectively"
         # 2D query, 3D key and 2D value
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, torch.randn(3, 3, 3))
+            mha(query, key, torch.randn(4, 4, 4))
 
         msg = "expected `key_padding_mask` to be `None` or 1-D but found 2-D tensor instead"
         # 2D query, 2D key, 2D value and 1D key_padding_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, key_padding_mask=torch.tensor([[False, True, True] * 2], dtype=torch.bool))
+            mha(query, key, value, key_padding_mask=torch.tensor([[False, False, True, True] * 2], dtype=torch.bool))
 
         msg = "expected `attn_mask` to be `None`, 2-D or 3-D but found 1-D tensor instead"
         # 2D query, 2D key, 2D value and 1D attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.tensor([False, True, True], dtype=torch.bool))
+            mha(query, key, value, attn_mask=torch.tensor([False, False, True, True], dtype=torch.bool))
 
-        msg = r"Expected `attn_mask` shape to be \(3, 3, 3\)"
+        msg = r"Expected `attn_mask` shape to be \(4, 4, 4\)"
         # 2D query, 2D key, 2D value and 3D incorrect attn_mask
         with self.assertRaisesRegex(AssertionError, msg):
-            mha(query, key, value, attn_mask=torch.randn(4, 3, 3).bernoulli_().to(torch.bool))
+            mha(query, key, value, attn_mask=torch.randn(5, 4, 4).bernoulli_().to(torch.bool))
 
     def test_multihead_attn_nested_tensor_outside_fast_path(self):
-        mha = torch.nn.MultiheadAttention(3, 3, batch_first=True).eval()
-        nt = torch.nested_tensor([torch.randn(3, 3)])
+        mha = torch.nn.MultiheadAttention(4, 4, batch_first=True).eval()
+        nt = torch.nested.nested_tensor([torch.randn(4, 4)])
         # One tested platform (linux-bionic-py3.7-clang) has a torch_function for one
         # or more of these. Take advantage of that to test the torch_function bailout.
         has_torch_func = torch.overrides.has_torch_function(
@@ -6186,7 +2711,7 @@ def test_multihead_attn_nested_tensor_outside_fast_path(self):
             mha(nt, nt, nt)
         with torch.inference_mode():
             mha(nt, nt, nt)
-        nt = torch.nested_tensor([torch.randn(3, 3, requires_grad=False)])
+        nt = torch.nested.nested_tensor([torch.randn(4, 4, requires_grad=False)])
         nt.requires_grad = False
         with self.assertRaisesRegex(AssertionError, msg):
             mha(nt, nt, nt)
@@ -6204,155 +2729,6 @@ def test_normalize(self):
         inputs = torch.randn((), requires_grad=True)
         self.assertTrue(gradcheck(lambda x: F.normalize(x, p=1, dim=-1), (inputs,)))
 
-    def test_adaptive_pooling_input_size(self):
-        for numel in (2, 3):
-            for pool_type in ('Max', 'Avg'):
-                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
-                module_cls = getattr(nn, cls_name)
-                output_size = (2,) * numel
-                module = module_cls(output_size)
-
-                input = torch.randn(output_size)
-                self.assertRaises(ValueError, lambda: module(input))
-
-    def test_adaptive_pooling_size_none(self):
-        for numel in (2, 3):
-            for pool_type in ('Max', 'Avg'):
-                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
-                module_cls = getattr(nn, cls_name)
-                output_size = (2,) * (numel - 1) + (None,)
-                module = module_cls(output_size)
-
-                input = torch.randn((4,) * (numel + 1))
-                output = module(input)
-                self.assertEqual(output.size(), (4,) + (2,) * (numel - 1) + (4,))
-
-    @unittest.skipIf(TEST_WITH_UBSAN, "signed integer overflow error with UBSAN")
-    def test_adaptive_pooling_size_overflow(self):
-        # 0x0x3fffffffffffffff * 2 * 2 = 0xfffffffffffffffc = -4 as int64_t
-        # Tensor::numel() return int64_t, so following check that negative allocs are correctly handled
-        self.assertRaises(
-            RuntimeError,
-            lambda: torch.nn.AdaptiveMaxPool1d(0x3fffffffffffffff)(torch.empty([2, 2, 2])))
-
-    def test_adaptive_pooling_avg_nhwc(self):
-        device_list = ['cpu']
-        if TEST_CUDA:
-            device_list.append('cuda')
-
-        for device in device_list:
-            input = torch.randint(1, 10, (4, 8, 8, 8), dtype=torch.float32).to(device)
-            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-            grad = torch.randint(1, 10, (4, 8, 7, 7), dtype=torch.float32).to(device)
-            pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
-
-            out = pool(input)
-            out.backward(grad)
-            ref_out = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(input.grad, ref_input.grad)
-
-    def test_adaptive_pooling_avg_nhwc_non_contiguous(self):
-        device_list = ['cpu']
-        if TEST_CUDA:
-            device_list.append('cuda')
-
-        for device in device_list:
-            input = torch.randint(1, 10, (4, 8, 8, 8), dtype=torch.float32).to(device)
-            input = input.contiguous(memory_format=torch.channels_last)
-            input = input[:, ::2, :, :].requires_grad_()
-            grad = torch.randint(1, 10, (4, 8, 7, 7), dtype=torch.float32).to(device)
-            grad = grad[:, ::2, :, :]
-            pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.AdaptiveAvgPool2d((7, 7)).to(device)
-
-            out = pool(input)
-            out.backward(grad)
-            ref_out = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(input.grad, ref_input.grad)
-
-    def test_adaptive_pooling_bfloat16(self):
-        def _test_adaptive_pooling_bfloat16(self, device, mod, memory_format):
-            input = torch.randint(1, 10, (3, 19, 8, 8), dtype=torch.float32)
-            input = input.to(device).to(memory_format=memory_format).requires_grad_()
-            pool = mod((7, 7)).to(device)
-
-            input2 = input.detach().clone().bfloat16().requires_grad_(True)
-
-            out = pool(input)
-            out.sum().backward()
-            out2 = pool(input2)
-            out2.sum().backward()
-
-            self.assertTrue(out2.is_contiguous(memory_format=memory_format))
-            self.assertEqual(out2.dtype, torch.bfloat16)
-            self.assertEqual(input2.grad.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.float(), atol=0.1, rtol=0)
-            self.assertEqual(input.grad, input2.grad.float(), atol=0.1, rtol=0)
-
-        device_list = ['cpu']
-        for device in device_list:
-            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveAvgPool2d, torch.contiguous_format)
-            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveAvgPool2d, torch.channels_last)
-            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveMaxPool2d, torch.contiguous_format)
-            _test_adaptive_pooling_bfloat16(self, device, torch.nn.AdaptiveMaxPool2d, torch.channels_last)
-
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    @largeTensorTest('12GB', device='cuda')
-    def test_adaptive_pooling_avg_nhwc_launch_config_backward(self):
-        input = torch.randint(1, 10, (1, 32, 2 ** 17 + 1, 32), dtype=torch.float32, device="cuda")
-        input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-        grad = torch.randint(1, 10, (1, 32, 10, 32), dtype=torch.float32, device="cuda")
-
-        pool = torch.nn.AdaptiveAvgPool2d((10, 32)).cuda()
-
-        ref_input = input.detach().clone().contiguous().requires_grad_(True)
-        ref_grad = grad.detach().clone().contiguous()
-        ref_pool = torch.nn.AdaptiveAvgPool2d((10, 32)).cuda()
-
-        out = pool(input)
-        out.backward(grad)
-        ref_out = ref_pool(ref_input)
-        ref_out.backward(ref_grad)
-
-        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-        self.assertTrue(ref_out.is_contiguous())
-        self.assertEqual(out, ref_out)
-        self.assertEqual(input.grad, ref_input.grad)
-
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    @largeTensorTest('12GB', device='cuda')
-    def test_adaptive_pooling_avg_nhwc_launch_config_forward(self):
-        input = torch.randint(1, 10, (1, 32, 16, 16), dtype=torch.float32, device="cuda")
-        input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-        pool = torch.nn.AdaptiveAvgPool2d((2 ** 17 + 1, 32)).cuda()
-
-        ref_input = input.detach().clone().contiguous().requires_grad_(True)
-        ref_pool = torch.nn.AdaptiveAvgPool2d((2 ** 17 + 1, 32)).cuda()
-
-        out = pool(input)
-        ref_out = ref_pool(ref_input)
-
-        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-        self.assertTrue(ref_out.is_contiguous())
-        self.assertEqual(out, ref_out)
-
     @unittest.skipIf(not TEST_MULTIGPU, "multi-GPU not supported")
     # Skip the test for ROCm as per https://github.com/pytorch/pytorch/issues/53190
     @skipIfRocm
@@ -6548,6 +2924,19 @@ def test_load_state_dict_BC(self):
         self.assertEqual(bn.num_batches_tracked.dtype, torch.long)
         self.assertEqual(bn.num_batches_tracked.item(), 0)
 
+    def test_load_state_dict_child(self):
+        base_module = nn.Linear(1, 1)
+        model = base_module
+        for _ in range(3):
+            model = nn.Sequential(*[deepcopy(model) for _ in range(10)])
+
+        def hook_fn(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
+            module_state_dict = module.state_dict()
+            self.assertEqual(len(module_state_dict.keys()), len(state_dict.keys()))
+
+        model[0][0]._register_load_state_dict_pre_hook(hook_fn, with_module=True)
+        model.load_state_dict(model.state_dict(), strict=True)
+
     @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
     def test_load_state_dict_ref_cycle(self):
         # load_state_dict shouldn't cause a reference cycle involving Tensors
@@ -6780,435 +3169,6 @@ def test_assignments(get_list, a, b, c):
         self.assertIn('buf', l.state_dict())
         self.assertEqual(l.state_dict()['buf'], buf)
 
-    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
-    def test_thnn_conv_strided_padded_dilated(self):
-        for convfn, dims, transposed in (
-                (torch.nn.functional.conv2d, 2, False),
-                (torch.nn.functional.conv_transpose2d, 2, True),
-                (torch.nn.functional.conv3d, 3, False),
-                (torch.nn.functional.conv_transpose3d, 3, True)):
-            for stride, padding, dilation in (
-                    (2, 0, 1), (1, 1, 1), (2, 1, 1), (1, 0, 2)):
-                kwargs = {"stride": stride, "padding": padding, "dilation": dilation}
-                inp_shape = (1, 2) + dims * (4,)
-                weight_shape = (2, 2) + dims * (1,)
-                inputs = torch.randn(inp_shape, dtype=torch.double, device="cuda", requires_grad=True)
-                weight = torch.randn(weight_shape, dtype=torch.double, device="cuda", requires_grad=True)
-                bias = torch.randn(2, dtype=torch.double, device="cuda", requires_grad=True)
-                with torch.backends.cudnn.flags(enabled=False):
-                    res = convfn(inputs, weight, bias, **kwargs)
-                res_cpu = convfn(inputs.cpu(), weight.cpu(), bias.cpu(), **kwargs)
-                self.assertEqual(res, res_cpu)
-                with torch.backends.cudnn.flags(enabled=False):
-                    torch.autograd.gradcheck(
-                        lambda x, w, b: convfn(x, w, b, **kwargs),
-                        (inputs, weight, bias)
-                    )
-                    torch.autograd.gradcheck(
-                        lambda x, w, b: convfn(x, w, b, **kwargs),
-                        (inputs.cpu(), weight.cpu(), bias.cpu())
-                    )
-
-    def test_Conv2d_inconsistent_types(self):
-        inputs = torch.randn(4, 1, 7, 7, dtype=torch.float)
-        weights = torch.randn(1, 1, 3, 3, dtype=torch.double)
-        # inconsistent types should raise an exception
-        self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights))
-        # but it should work with the same type
-        nn.functional.conv2d(inputs.float(), weights.float())
-
-    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
-    def test_Conv2d_inconsistent_types_on_GPU_without_cudnn(self):
-        inputs = torch.randn(4, 1, 7, 7, dtype=torch.float, device="cuda")
-        weights = torch.randn(1, 1, 3, 3, dtype=torch.double, device="cuda")
-        bias = torch.randn(1, dtype=torch.double, device="cuda")
-
-        with torch.backends.cudnn.flags(enabled=False):
-            # inconsistent types should raise an exception
-            self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights))
-            self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights.float(), bias))
-
-            # but it should work with the same type
-            nn.functional.conv2d(inputs.float(), weights.float(), bias.float())
-
-    def test_Conv2d_1x1(self):
-        in_channels = 2
-        out_channels = 2
-        mod = torch.nn.Conv2d(2, 2, 1, bias=False).to(dtype=torch.double)
-        input = torch.randn(1, in_channels, 5, 5, requires_grad=True, dtype=torch.double)
-        for enabled in (False, True):
-            with torch.backends.mkldnn.flags(enabled=enabled):
-                gradcheck(F.conv2d, (input, mod.weight))
-
-    def test_Conv2d_OneDNN(self):
-        def run_once(group_val=24, dilation=1):
-            ifm = torch.ones([1, group_val, 6, 6], dtype=torch.float32)
-            weights = torch.ones([group_val, 1, 3, 3], dtype=torch.float32)
-            op = torch.nn.Conv2d(
-                in_channels=group_val,
-                out_channels=group_val,
-                kernel_size=[3, 3],
-                stride=[2, 2],
-                padding=[1, 1],
-                dilation=[dilation, dilation],
-                groups=group_val,
-                bias=False,
-                padding_mode='zeros'
-            )
-
-            op.weight.data = weights
-            res = op(ifm)
-            grad_in = torch.ones(res.shape, dtype=torch.float32)
-            res.backward(grad_in)
-            return op.weight.grad
-
-        for gorup_val in (24, 48, 23, 25):
-            for dilation in (1, 2):
-                with torch.backends.mkldnn.flags(enabled=False):
-                    without_onednn = run_once(gorup_val, dilation)
-
-                with torch.backends.mkldnn.flags(enabled=True):
-                    with_onednn = run_once(gorup_val, dilation)
-
-                self.assertEqual(without_onednn, with_onednn)
-
-    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
-    @unittest.skipIf(not TEST_CUDNN, 'CUDNN not available')
-    def test_cudnn_non_contiguous(self):
-        x = torch.randn(192, 16, 50).cuda()
-        x = x.permute(0, 2, 1).contiguous().permute(0, 2, 1)
-        m = torch.nn.Conv1d(
-            in_channels=16,
-            out_channels=32,
-            kernel_size=2,
-            bias=True).cuda()
-        result = m(x)
-
-    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
-    @unittest.skipIf(not TEST_CUDNN, 'CUDNN not available')
-    def test_Conv2d_inconsistent_types_on_GPU_with_cudnn(self):
-        inputs = torch.randn(4, 1, 7, 7, dtype=torch.float, device="cuda")
-        weights = torch.randn(1, 1, 3, 3, dtype=torch.double, device="cuda")
-        bias = torch.randn(1, dtype=torch.double, device="cuda")
-
-        with torch.backends.cudnn.flags(enabled=True):
-            # inconsistent types should raise an exception
-            self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights))
-            self.assertRaises(RuntimeError, lambda: nn.functional.conv2d(inputs, weights.float(), bias))
-
-            # but it should work with the same type
-            nn.functional.conv2d(inputs.float(), weights.float(), bias.float())
-
-    def test_Conv2d_missing_argument(self):
-        c = nn.Conv2d(3, 3, 3)
-        self.assertRaises(TypeError, lambda: c(None))
-
-    def test_Conv2d_backward_twice(self):
-        input = torch.randn(2, 3, 5, 5)
-        c = nn.Conv2d(3, 3, 3)
-        o1 = c(input)
-        o1.sum().backward()
-        self.assertRaisesRegex(RuntimeError, 'Specify retain_graph=True',
-                               lambda: o1.sum().backward())
-
-
-    def test_conv_modules_raise_error_on_incorrect_input_size(self):
-        for dtype in [torch.bfloat16, torch.double, torch.float]:
-            modules = [nn.Conv1d(3, 8, 3).to(dtype), nn.ConvTranspose1d(3, 8, 3).to(dtype),
-                       nn.Conv2d(3, 8, 3).to(dtype), nn.ConvTranspose2d(3, 8, 3).to(dtype),
-                       nn.Conv3d(3, 8, 3).to(dtype), nn.ConvTranspose3d(3, 8, 3).to(dtype)]
-
-            invalid_input_dims = [(1, 4), (1, 4),
-                                  (2, 5), (2, 5),
-                                  (3, 6), (3, 6)]
-
-            for invalid_dims, module in zip(invalid_input_dims, modules):
-                for dims in invalid_dims:
-                    input = torch.empty(torch.Size((3, ) * dims))
-                    self.assertRaises(RuntimeError, lambda: module(input))
-
-    def test_conv_shapecheck(self):
-        def test(should_raise, module, input_size, dtype):
-            input = torch.empty(3, *input_size).to(dtype)
-            if should_raise:
-                self.assertRaises(RuntimeError, lambda: module(input))
-            else:
-                # just run it to ensure no exception raised.
-                module(input)
-
-        for dtype in [torch.bfloat16, torch.float, torch.double, torch.cfloat, torch.cdouble]:
-            # Conv1d
-            test(True, nn.Conv1d(1, 1, 3).to(dtype), (1, 2), dtype)
-            test(True, nn.Conv1d(1, 1, 3, stride=2).to(dtype), (1, 2), dtype)
-            test(False, nn.Conv1d(1, 1, 2).to(dtype), (1, 2), dtype)
-            test(False, nn.Conv1d(1, 1, 2, stride=2).to(dtype), (1, 2), dtype)
-            test(False, nn.Conv1d(1, 1, 3, stride=2, padding=1).to(dtype), (1, 2), dtype)
-
-            # Conv2d
-            test(True, nn.Conv2d(1, 1, (3, 3)).to(dtype), (1, 2, 2), dtype)
-            test(False, nn.Conv2d(1, 1, (3, 3)).to(dtype), (1, 3, 3), dtype)
-            test(False, nn.Conv2d(1, 1, (3, 3), padding=1).to(dtype), (1, 2, 2), dtype)
-
-            # Conv3D
-            test(True, nn.Conv3d(1, 1, (3, 3, 3)).to(dtype), (1, 2, 2, 2), dtype)
-            test(False, nn.Conv3d(1, 1, (3, 3, 3)).to(dtype), (1, 3, 3, 3), dtype)
-            test(False, nn.Conv3d(1, 1, (3, 3, 3), padding=1).to(dtype), (1, 2, 2, 2), dtype)
-
-    def test_ConvTranspose2d_output_size(self):
-        m = nn.ConvTranspose2d(3, 4, 3, 3, 0, 2)
-        i = torch.randn(2, 3, 6, 6)
-        for h in range(15, 22):
-            for w in range(15, 22):
-                if 18 <= h <= 20 and 18 <= w <= 20:
-                    output = m(i, output_size=(h, w))
-                    self.assertEqual(output.size()[2:], (h, w))
-                else:
-                    self.assertRaises(ValueError, lambda: m(i, (h, w)))
-
-    def test_ConvTranspose2d_output_size_downsample_upsample(self):
-        b, c, hid_c = 2, 3, 2
-        for h in range(13, 24):
-            for w in range(13, 17):
-                for k in range(2, 5):
-                    for d in range(1, 5):
-                        for s in range(1, 4):
-                            for p in range(3):
-                                conv = nn.Conv2d(
-                                    in_channels=c,
-                                    out_channels=hid_c,
-                                    kernel_size=k,
-                                    stride=s,
-                                    padding=p,
-                                    dilation=d,
-                                )
-
-                                t_conv = nn.ConvTranspose2d(
-                                    in_channels=hid_c,
-                                    out_channels=c,
-                                    kernel_size=k,
-                                    stride=s,
-                                    padding=p,
-                                    dilation=d,
-                                )
-
-                                i = torch.randn(b, c, h, w)
-
-                                out = t_conv(conv(i), output_size=i.shape)
-
-                                self.assertEqual(out.size()[2:], i.size()[2:])
-
-    def test_ConvTranspose3d_correct_output_size(self):
-        # Check that ConvTranspose3d can take a 5d output_size.
-        m = nn.ConvTranspose3d(2, 2, 2)
-        i = torch.rand(1, 2, 1, 1, 1)
-        out = m(i, output_size=(1, 2, 2, 2, 2))
-
-    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
-    def test_ConvTranspose2d_half_cublas_gemm(self):
-        with torch.backends.cudnn.flags(enabled=False):
-            inputs = torch.randn(1, 1, 16, 16, device='cuda', dtype=torch.half)
-            deconv = nn.ConvTranspose2d(
-                1, 1, 3, stride=2, padding=1, output_padding=1).cuda().half()
-            output = deconv(inputs)
-            output.mean().backward()
-
-    # For https://github.com/pytorch/pytorch/pull/1273
-    # Almost identical to the above `test_Conv2d_naive_groups`
-    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
-    def test_Conv2d_groups_nobias(self):
-        dev_dtypes = [("cpu", torch.float)]
-        if TEST_CUDA:
-            dev_dtypes += [("cuda", torch.float), ("cuda", torch.half)]
-        if AMPERE_OR_ROCM:
-            dev_dtypes += [("cuda", torch.bfloat16)]
-        for device, dtype in dev_dtypes:
-            m = nn.Conv2d(4, 4, kernel_size=3, groups=2, bias=False).to(device, dtype)
-            i = torch.randn(2, 4, 6, 6, device=device, dtype=dtype, requires_grad=True)
-            output = m(i)
-            grad_output = torch.randn(2, 4, 4, 4, device=device, dtype=dtype)
-            output.backward(grad_output)
-
-            m1 = nn.Conv2d(2, 2, kernel_size=3, bias=False).to(device, dtype)
-            m1.weight.data.copy_(m.weight.data[:2])
-            i1 = i.data[:, :2].contiguous().requires_grad_(True)
-            output1 = m1(i1)
-            output1.backward(grad_output[:, :2].contiguous())
-
-            m2 = nn.Conv2d(2, 2, kernel_size=3, bias=False).to(device, dtype)
-            m2.weight.data.copy_(m.weight.data[2:])
-            i2 = i.data[:, 2:].contiguous().requires_grad_(True)
-            output2 = m2(i2)
-            output2.backward(grad_output[:, 2:].contiguous())
-
-            self.assertEqual(output, torch.cat([output1, output2], 1))
-            self.assertEqual(i.grad.data,
-                             torch.cat([i1.grad.data, i2.grad.data], 1),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-            self.assertEqual(m.weight.grad.data,
-                             torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
-                             atol=1e-1 if dtype == torch.half else dtype2prec_DONTUSE[dtype], rtol=0)
-
-    # Almost identical to the above `test_Conv2d_naive_groups`
-    # Covering special case when group > 1, input-channel / group < 16 and output-channel is multiple of 16
-    # See also https://github.com/pytorch/pytorch/pull/18463#issuecomment-476563686
-    # and https://github.com/pytorch/pytorch/pull/18463#issuecomment-477001024
-    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
-    def test_Conv2d_groups_nobias_v2(self):
-        torch.manual_seed(123)
-        dev_dtypes = [("cpu", torch.float)]
-        if TEST_CUDA:
-            dev_dtypes += [("cuda", torch.float), ("cuda", torch.half)]
-        if AMPERE_OR_ROCM:
-            dev_dtypes += [("cuda", torch.bfloat16)]
-        for device, dtype in dev_dtypes:
-            m = nn.Conv2d(4, 16, kernel_size=3, groups=2, bias=False).to(device, dtype)
-            i = torch.randn(2, 4, 6, 6, device=device, dtype=dtype, requires_grad=True)
-            output = m(i)
-            grad_output = torch.randn(2, 16, 4, 4, device=device, dtype=dtype)
-            output.backward(grad_output)
-
-            m1 = nn.Conv2d(2, 8, kernel_size=3, bias=False).to(device, dtype)
-            m1.weight.data.copy_(m.weight.data[:8])
-            i1 = i.data[:, :2].contiguous().requires_grad_(True)
-            output1 = m1(i1)
-            output1.backward(grad_output[:, :8].contiguous())
-
-            m2 = nn.Conv2d(2, 8, kernel_size=3, bias=False).to(device, dtype)
-            m2.weight.data.copy_(m.weight.data[8:])
-            i2 = i.data[:, 2:].contiguous().requires_grad_(True)
-            output2 = m2(i2)
-            output2.backward(grad_output[:, 8:].contiguous())
-
-            self.assertEqual(output, torch.cat([output1, output2], 1))
-            self.assertEqual(i.grad.data,
-                             torch.cat([i1.grad.data, i2.grad.data], 1),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-            self.assertEqual(m.weight.grad.data,
-                             torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
-                             atol=1e-1 if dtype == torch.half else dtype2prec_DONTUSE[dtype], rtol=0)
-
-    # CPU-only test for group conv3d fast implementation using bmm
-    # See: https://github.com/pytorch/pytorch/pull/36355
-    def test_Conv3d_groups_nobias(self):
-        torch.manual_seed(123)
-        m = nn.Conv3d(4, 16, kernel_size=3, groups=2, bias=False).to("cpu", torch.float)
-        i = torch.randn(2, 4, 6, 6, 6, device="cpu", dtype=torch.float, requires_grad=True)
-        output = m(i)
-        grad_output = torch.randn(2, 16, 4, 4, 4, device="cpu", dtype=torch.float)
-        output.backward(grad_output)
-
-        m1 = nn.Conv3d(2, 8, kernel_size=3, bias=False).to("cpu", torch.float)
-        m1.weight.data.copy_(m.weight.data[:8])
-        i1 = i.data[:, :2].contiguous().requires_grad_(True)
-        output1 = m1(i1)
-        output1.backward(grad_output[:, :8].contiguous())
-
-        m2 = nn.Conv3d(2, 8, kernel_size=3, bias=False).to("cpu", torch.float)
-        m2.weight.data.copy_(m.weight.data[8:])
-        i2 = i.data[:, 2:].contiguous().requires_grad_(True)
-        output2 = m2(i2)
-        output2.backward(grad_output[:, 8:].contiguous())
-
-        self.assertEqual(output, torch.cat([output1, output2], 1))
-        self.assertEqual(i.grad.data,
-                         torch.cat([i1.grad.data, i2.grad.data], 1),
-                         atol=dtype2prec_DONTUSE[torch.float], rtol=0)
-        self.assertEqual(m.weight.grad.data,
-                         torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
-                         atol=dtype2prec_DONTUSE[torch.float], rtol=dtype2prec_DONTUSE[torch.float])
-
-    def test_Conv3d_groups_wbias(self):
-        torch.manual_seed(123)
-        m = nn.Conv3d(4, 16, kernel_size=3, groups=2, bias=True).to("cpu", torch.float)
-        i = torch.randn(2, 4, 6, 6, 6, device="cpu", dtype=torch.float, requires_grad=True)
-        output = m(i)
-        grad_output = torch.randn(2, 16, 4, 4, 4, device="cpu", dtype=torch.float)
-        output.backward(grad_output)
-
-        m1 = nn.Conv3d(2, 8, kernel_size=3, bias=True).to("cpu", torch.float)
-        m1.weight.data.copy_(m.weight.data[:8])
-        m1.bias.data.copy_(m.bias.data[:8])
-        i1 = i.data[:, :2].contiguous().requires_grad_(True)
-        output1 = m1(i1)
-        output1.backward(grad_output[:, :8].contiguous())
-
-        m2 = nn.Conv3d(2, 8, kernel_size=3, bias=True).to("cpu", torch.float)
-        m2.weight.data.copy_(m.weight.data[8:])
-        m2.bias.data.copy_(m.bias.data[8:])
-        i2 = i.data[:, 2:].contiguous().requires_grad_(True)
-        output2 = m2(i2)
-        output2.backward(grad_output[:, 8:].contiguous())
-
-        self.assertEqual(output, torch.cat([output1, output2], 1))
-        self.assertEqual(i.grad.data,
-                         torch.cat([i1.grad.data, i2.grad.data], 1),
-                         atol=dtype2prec_DONTUSE[torch.float],
-                         rtol=dtype2prec_DONTUSE[torch.float])
-        self.assertEqual(m.weight.grad.data,
-                         torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
-                         atol=dtype2prec_DONTUSE[torch.float],
-                         rtol=dtype2prec_DONTUSE[torch.float])
-        self.assertEqual(m.bias.grad.data,
-                         torch.cat([m1.bias.grad.data, m2.bias.grad.data], 0),
-                         atol=dtype2prec_DONTUSE[torch.float], rtol=dtype2prec_DONTUSE[torch.float])
-
-
-
-    def test_MaxUnpool2d_output_size(self):
-        m = nn.MaxPool2d(3, stride=2, return_indices=True)
-        mu = nn.MaxUnpool2d(3, stride=2)
-        big_t = torch.rand(1, 1, 6, 6)
-        big_t[0][0][4][4] = 100
-        output_big, indices_big = m(big_t)
-        self.assertRaises(RuntimeError, lambda: mu(output_big, indices_big))
-
-        small_t = torch.rand(1, 1, 5, 5)
-        for i in range(0, 4, 2):
-            for j in range(0, 4, 2):
-                small_t[:, :, i, j] = 100
-        output_small, indices_small = m(small_t)
-        for h in range(3, 10):
-            for w in range(3, 10):
-                if 4 <= h <= 6 and 4 <= w <= 6:
-                    size = (h, w)
-                    if h == 6:
-                        size = (1, 1) + size
-
-                    mu(output_small, indices_small, output_size=size)
-                else:
-                    self.assertRaises(ValueError, lambda: mu(output_small, indices_small, (h, w)))
-
-    def test_max_unpool2d_nhwc_cpu(self):
-        input = torch.randn(2, 10, 9, 9).float().cpu()
-        input = input.contiguous(memory_format=torch.channels_last)
-        ref_input = input.clone().contiguous()
-
-        pool = nn.MaxPool2d(3, stride=2, return_indices=True).cpu()
-        ref_pool = nn.MaxPool2d(3, stride=2, return_indices=True).cpu()
-
-        out, ind = pool(input)
-        ref_out, ref_ind = ref_pool(ref_input)
-        out.requires_grad_()
-        ref_out.requires_grad_()
-
-        unpool = nn.MaxUnpool2d(3, stride=2).cpu()
-        ref_unpool = nn.MaxUnpool2d(3, stride=2).cpu()
-
-        upout = unpool(out, ind)
-        ref_upout = ref_unpool(ref_out, ref_ind)
-
-        grad = torch.randn(upout.size()).float().cpu()
-        grad = grad.contiguous(memory_format=torch.channels_last)
-        ref_grad = grad.clone().contiguous()
-
-        upout.backward(grad)
-        ref_upout.backward(ref_grad)
-
-        self.assertTrue(upout.is_contiguous(memory_format=torch.channels_last))
-        self.assertTrue(ref_upout.is_contiguous())
-        self.assertTrue(torch.allclose(upout, ref_upout))
-        self.assertTrue(torch.allclose(out.grad, ref_out.grad))
-
     def test_container_copy(self):
         class Model(nn.Module):
             def __init__(self):
@@ -7296,33 +3256,6 @@ def test_mse_loss_size_warning(self):
             self.assertEqual(len(w), 1)
             self.assertIn('Please ensure they have the same size.', str(w[0]))
 
-    def test_poisson_nll_loss_reduction_modes(self):
-        input = torch.tensor([0.5, 1.5, 2.5])
-        target = torch.tensor([1., 2., 3.])
-        component_wise_loss = torch.exp(input) - target * input
-        self.assertEqual(component_wise_loss,
-                         F.poisson_nll_loss(input, target, reduction='none'))
-        self.assertEqual(torch.sum(component_wise_loss),
-                         F.poisson_nll_loss(input, target, reduction='sum'))
-        self.assertEqual(torch.mean(component_wise_loss),
-                         F.poisson_nll_loss(input, target, reduction='mean'))
-        with self.assertRaisesRegex(ValueError, 'is not valid'):
-            F.poisson_nll_loss(input, target, reduction='total')
-
-    def test_gaussian_nll_loss_reduction_modes(self):
-        input = torch.tensor([[0.5, 1.5, 2.5], [2., 4., 6.]])
-        target = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
-        var = torch.tensor([[0.5, 1., 1.5], [1., 1.5, 2.]])
-        component_wise_loss = 0.5 * (torch.log(var) + (input - target)**2 / var)
-        self.assertEqual(component_wise_loss,
-                         F.gaussian_nll_loss(input, target, var, reduction='none'))
-        self.assertEqual(torch.sum(component_wise_loss),
-                         F.gaussian_nll_loss(input, target, var, reduction='sum'))
-        self.assertEqual(torch.mean(component_wise_loss),
-                         F.gaussian_nll_loss(input, target, var, reduction='mean'))
-        with self.assertRaisesRegex(ValueError, 'is not valid'):
-            F.gaussian_nll_loss(input, target, var, reduction='total')
-
     def test_gaussian_nll_loss_broadcasting(self):
         input = torch.tensor([[0.5, 1.5, 2.5], [2., 4., 6.]])
         target_full = torch.tensor([[1., 2., 3.], [1., 2., 3.]])
@@ -7498,276 +3431,6 @@ def test_all(hidden_size, bad_hx, good_hx, input_size, input):
         bad_input = torch.randn(3, 1)
         test_all(hidden_size, good_hx, good_hx, input_size, bad_input)
 
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    def test_native_dropout_corner_case(self):
-        for train in [True, False]:
-            for p in [0.0, 1.0]:
-                for device in ["cuda", "cpu"]:
-                    x = torch.randn(5).to(device=device).requires_grad_()
-                    x_ref = x.detach().requires_grad_()
-                    o = torch.native_dropout(x, p, train)[0]
-                    o_ref = torch.dropout(x_ref, p, train)
-                    o.sum().backward()
-                    o_ref.sum().backward()
-                    assert(o.equal(o_ref))
-                    assert(x.grad.equal(x_ref.grad))
-
-    def test_invalid_dropout_p(self):
-        v = torch.ones(1)
-        self.assertRaises(ValueError, lambda: nn.Dropout(-0.1))
-        self.assertRaises(ValueError, lambda: nn.Dropout(1.1))
-        self.assertRaises(ValueError, lambda: nn.Dropout1d(-0.1))
-        self.assertRaises(ValueError, lambda: nn.Dropout1d(1.1))
-        self.assertRaises(ValueError, lambda: nn.Dropout2d(-0.1))
-        self.assertRaises(ValueError, lambda: nn.Dropout2d(1.1))
-        self.assertRaises(ValueError, lambda: nn.Dropout3d(-0.1))
-        self.assertRaises(ValueError, lambda: nn.Dropout3d(1.1))
-        self.assertRaises(ValueError, lambda: F.dropout(v, -0.1))
-        self.assertRaises(ValueError, lambda: F.dropout(v, 1.1))
-
-    def test_pad_sequence(self):
-        def pad(tensor, length):
-            return torch.cat(
-                [tensor.data, tensor.data.new(
-                    length - tensor.size(0), *tensor.size()[1:]).zero_()])
-
-        # single dimensional
-        a = torch.tensor([1, 2, 3])
-        b = torch.tensor([4, 5])
-        c = torch.tensor([6])
-
-        # batch_first = true
-        expected = torch.tensor([[4, 5, 0], [1, 2, 3], [6, 0, 0]])
-        padded = rnn_utils.pad_sequence([b, a, c], True)
-        self.assertEqual(padded, expected)
-
-        # batch_first = false
-        padded = rnn_utils.pad_sequence([b, a, c])
-        self.assertEqual(padded, expected.transpose(0, 1))
-
-        # pad with non-zero value
-        expected = torch.tensor([[4, 5, 1], [1, 2, 3], [6, 1, 1]])
-        padded = rnn_utils.pad_sequence([b, a, c], True, 1)
-        self.assertEqual(padded, expected)
-
-        # Test pad sorted sequence
-        expected = torch.tensor([[1, 2, 3], [4, 5, 0], [6, 0, 0]])
-        padded = rnn_utils.pad_sequence([a, b, c], True)
-        self.assertEqual(padded, expected)
-
-        # more dimensions
-        maxlen = 9
-        for num_dim in (0, 1, 2, 3):
-            sequences = []
-            trailing_dims = [4] * num_dim
-            for i in range(1, maxlen + 1):
-                seq_len = i * i
-                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
-            random.shuffle(sequences)
-            expected = []
-            for seq in sequences:
-                expected.append(pad(seq, maxlen * maxlen))
-            # batch first = true
-            expected = torch.stack(expected)
-            padded = rnn_utils.pad_sequence(sequences, True)
-            self.assertEqual(padded, expected)
-
-            # batch first = false
-            padded = rnn_utils.pad_sequence(sequences)
-            self.assertEqual(padded, expected.transpose(0, 1))
-
-    def test_unpad_sequence(self):
-
-        # single dimensional
-        a = torch.tensor([1, 2, 3])
-        b = torch.tensor([4, 5])
-        c = torch.tensor([6])
-        sequences = [a, b, c]
-
-        lengths = torch.as_tensor([v.size(0) for v in sequences])
-        for batch_first in [True, False]:
-            padded_sequences = rnn_utils.pad_sequence(sequences, batch_first=batch_first)
-            unpadded_sequences = rnn_utils.unpad_sequence(padded_sequences, lengths, batch_first=batch_first)
-            self.assertEqual(sequences, unpadded_sequences)
-
-        # more dimensions
-        maxlen = 9
-        for num_dim in (0, 1, 2, 3):
-            sequences = []
-            trailing_dims = [4] * num_dim
-            for i in range(1, maxlen + 1):
-                seq_len = i * i
-                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
-            random.shuffle(sequences)
-
-            lengths = torch.as_tensor([v.size(0) for v in sequences])
-            padded_sequences = rnn_utils.pad_sequence(sequences, batch_first=batch_first)
-            unpadded_sequences = rnn_utils.unpad_sequence(padded_sequences, lengths, batch_first=batch_first)
-            self.assertEqual(sequences, unpadded_sequences)
-
-    def test_pack_sequence(self):
-        def _compatibility_test(sequences, lengths, batch_first, enforce_sorted=False):
-            padded = rnn_utils.pad_sequence(sequences, batch_first)
-            packed = rnn_utils.pack_sequence(sequences, enforce_sorted)
-            unpacked = rnn_utils.pad_packed_sequence(packed, batch_first)
-            self.assertEqual(padded, unpacked[0])
-            pack_padded = rnn_utils.pack_padded_sequence(
-                padded, lengths, batch_first, enforce_sorted)
-            self.assertEqual(packed, pack_padded)
-
-        # single dimensional
-        a = torch.tensor([1, 2, 3])
-        b = torch.tensor([4, 5])
-        c = torch.tensor([6])
-        packed = rnn_utils.pack_sequence([a, b, c], enforce_sorted=False)
-        expected = torch.tensor([1, 4, 6, 2, 5, 3])
-        self.assertEqual(packed.batch_sizes, [3, 2, 1])
-        self.assertEqual(packed.data.data, expected)
-        self.assertEqual(packed.sorted_indices, [0, 1, 2])
-        self.assertEqual(packed.unsorted_indices, [0, 1, 2])
-
-        packed_unsorted = rnn_utils.pack_sequence([b, c, a], enforce_sorted=False)
-        self.assertEqual(packed_unsorted.batch_sizes, [3, 2, 1])
-        self.assertEqual(packed_unsorted.data.data, expected)
-        self.assertEqual(packed_unsorted.sorted_indices, [2, 0, 1])
-        self.assertEqual(packed_unsorted.unsorted_indices, [1, 2, 0])
-
-        # single dimensional, enforce_sorted = True
-        packed_enforce_sorted = rnn_utils.pack_sequence([a, b, c], enforce_sorted=True)
-        self.assertEqual(packed_enforce_sorted.batch_sizes, [3, 2, 1])
-        self.assertEqual(packed_enforce_sorted.data.data, expected)
-        self.assertTrue(packed_enforce_sorted.sorted_indices is None)
-        self.assertTrue(packed_enforce_sorted.unsorted_indices is None)
-
-        with self.assertRaisesRegex(RuntimeError, 'must be sorted in decreasing order'):
-            rnn_utils.pack_sequence([b, c, a], enforce_sorted=True)
-
-        with self.assertRaisesRegex(RuntimeError, 'You can pass `enforce_sorted=False`'):
-            rnn_utils.pack_sequence([b, c, a], enforce_sorted=True)
-
-        # more dimensions
-        maxlen = 9
-        for num_dim in (0, 1, 2, 3):
-            sequences = []
-            lengths = []
-            trailing_dims = [4] * num_dim
-            for i in range(maxlen, 0, -1):
-                seq_len = i * i
-                lengths.append(seq_len)
-                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
-            unsorted_sequences = [s.clone() for s in sequences]
-            random.shuffle(unsorted_sequences)
-            unsorted_sequences_lengths = [t.size(0) for t in unsorted_sequences]
-
-            # compatibility with other utilities
-            for batch_first in (True, False):
-                for enforce_sorted in (True, False):
-                    _compatibility_test(sequences, lengths, batch_first, enforce_sorted)
-                _compatibility_test(unsorted_sequences, unsorted_sequences_lengths,
-                                    batch_first)
-
-    def test_unpack_sequence(self):
-
-        # single dimensional
-        a = torch.tensor([1, 2, 3])
-        b = torch.tensor([4, 5])
-        c = torch.tensor([6])
-        sequences = [a, b, c]
-
-        packed_sequences = rnn_utils.pack_sequence(sequences, enforce_sorted=False)
-        unpacked_sequences = rnn_utils.unpack_sequence(packed_sequences)
-        self.assertEqual(sequences, unpacked_sequences)
-
-        # more dimensions
-        maxlen = 9
-        for num_dim in (0, 1, 2, 3):
-            sequences = []
-            trailing_dims = [4] * num_dim
-            for i in range(1, maxlen + 1):
-                seq_len = i * i
-                sequences.append(torch.rand(seq_len, 5, *trailing_dims))
-            random.shuffle(sequences)
-
-            packed_sequences = rnn_utils.pack_sequence(sequences, enforce_sorted=False)
-            unpacked_sequences = rnn_utils.unpack_sequence(packed_sequences)
-            self.assertEqual(sequences, unpacked_sequences)
-
-    def test_pack_padded_sequence(self):
-        def generate_test_case(sorted_lengths, should_shuffle):
-            def pad(tensor, length):
-                return torch.cat([tensor, tensor.new(length - tensor.size(0), *tensor.size()[1:]).zero_()])
-
-            max_length = sorted_lengths[0]
-            batch_sizes = [sum(map(bool, filter(lambda x: x >= i, sorted_lengths)))
-                           for i in range(1, max_length + 1)]
-            offset = 0
-            padded = torch.cat([pad(i * 100 + torch.arange(1., 5 * l + 1).view(l, 1, 5), max_length)
-                                for i, l in enumerate(sorted_lengths, 1)], 1)
-            expected_data = [[torch.arange(1., 6) + (i + 1) * 100 + 5 * n for i in range(batch_size)]
-                             for n, batch_size in enumerate(batch_sizes)]
-            expected_data = list(itertools.chain.from_iterable(expected_data))
-            expected_data = torch.stack(expected_data, dim=0)
-
-            if should_shuffle:
-                # Shuffle the padded sequence to create an unsorted sequence
-                permutation = list(range(len(sorted_lengths)))
-                random.shuffle(permutation)
-
-                unsorted_indices = torch.tensor(permutation)
-                padded = padded.index_select(1, unsorted_indices)
-                lengths = torch.tensor(sorted_lengths).index_select(0, unsorted_indices)
-            else:
-                unsorted_indices = None
-                lengths = sorted_lengths
-
-            return padded.requires_grad_(), lengths, expected_data, batch_sizes, unsorted_indices
-
-        test_cases = [
-            # sorted_lengths, should_shuffle
-            [[10, 8, 4, 2, 2, 2, 1], False],
-            [[11, 10, 8, 6, 4, 3, 1], False],
-            [[11, 10, 8, 6, 4, 3, 1], True],
-        ]
-
-        for test_case, batch_first in itertools.product(test_cases, (True, False)):
-            sorted_lengths, should_shuffle = test_case
-            padded, lengths, expected_data, batch_sizes, unsorted_indices = generate_test_case(
-                sorted_lengths, should_shuffle)
-
-            src = padded
-            if batch_first:
-                src = src.transpose(0, 1)
-
-            # check output
-            packed = rnn_utils.pack_padded_sequence(src, lengths, batch_first=batch_first,
-                                                    enforce_sorted=not should_shuffle)
-            self.assertEqual(packed.data.data, expected_data)
-            self.assertEqual(packed.batch_sizes, batch_sizes)
-            self.assertEqual(packed.unsorted_indices, unsorted_indices)
-
-            # test inverse
-            unpacked, unpacked_len = rnn_utils.pad_packed_sequence(packed, batch_first=batch_first)
-            self.assertEqual(unpacked, src)
-            self.assertEqual(unpacked_len, lengths)
-
-            # check grad
-            if padded.grad is not None:
-                padded.grad.data.zero_()
-            grad_output = unpacked.data.clone().normal_()
-            unpacked.backward(grad_output)
-            if batch_first:
-                grad_output.transpose_(0, 1)
-            for i, l in enumerate(lengths):
-                self.assertEqual(padded.grad.data[:l, i], grad_output[:l, i])
-                if l < 10:
-                    self.assertEqual(padded.grad.data[l:, i].abs().sum(), 0)
-
-        # test error messages
-        with self.assertRaisesRegex(RuntimeError, 'You can pass `enforce_sorted=False`'):
-            packed = rnn_utils.pack_padded_sequence(torch.randn(3, 3), [1, 3, 2])
-        with self.assertRaisesRegex(RuntimeError, 'empty tensor'):
-            packed = rnn_utils.pack_padded_sequence(torch.randn(0, 0), [])
-
     def test_LSTM_cell(self):
         # this is just a smoke test; these modules are implemented through
         # autograd so no Jacobian test is needed
@@ -9349,7 +5012,6 @@ def test_inplace_thnn(self):
             self.assertEqual(grad_output, grad_output_clone)
 
 
-    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
     def test_pixel_shuffle_unshuffle(self):
         def _test_pixel_shuffle_unshuffle_helper(num_input_dims, valid_channels_dim=True,
                                                  upscale_factor=None):
@@ -9773,11 +5435,12 @@ def helper(self, size, dtype, mixed_dtype=False):
             helper(self, shape, torch.bfloat16, False)
             helper(self, shape, torch.bfloat16, True)
 
-    def test_batchnorm_non_contig_cpu(self):
+    @parametrize_test('bn_module', [torch.nn.BatchNorm2d, torch.nn.SyncBatchNorm])
+    def test_batchnorm_non_contig_cpu(self, bn_module):
         input = torch.arange(6, dtype=torch.float).reshape(1, 3, 2, 1).cpu()
         input = input.permute(0, 2, 1, 3)
 
-        bn = torch.nn.BatchNorm2d(2).cpu().float().eval()
+        bn = bn_module(2).cpu().float().eval()
         bn.weight.data.uniform_()
         bn.bias.data.uniform_()
 
@@ -9792,9 +5455,17 @@ def test_batchnorm_non_contig_cpu(self):
         self.assertTrue(ref_out.is_contiguous())
         self.assertEqual(out, ref_out)
 
+        input_bf = torch.arange(24, dtype=torch.bfloat16).reshape(1, 3, 2, 4)
+        input_bf = input_bf.permute(0, 2, 1, 3)
+        input_f = input_bf.float()
+        bn_mix = bn_module(2).float().eval()
+        ref_bn_f = deepcopy(bn_mix)
+        out_bf = bn_mix(input_bf)
+        ref_out_bf = ref_bn_f(input_f)
+        self.assertEqual(ref_out_bf, out_bf.float(), atol=0.05, rtol=0.05)
+
     @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
     @unittest.skipIf(not TEST_CUDNN, "needs cudnn")
-    @skipIfRocm
     def test_batchnorm_cudnn_nhwc(self):
         def run_test(input, grad_output):
             c = input.size(1)
@@ -9982,7 +5653,6 @@ def test_pdist_empty_col(self):
             inp = torch.randn(4, 0, dtype=torch.double, device=device, requires_grad=True)
             self.assertTrue(gradcheck(F.pdist, (inp,)))
 
-    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
     @unittest.expectedFailure
     def test_pdist_cpu_gradgrad_unimplemented(self):
         inp = torch.randn(4, 5, requires_grad=True)
@@ -10034,32 +5704,28 @@ def test_kl_div_with_diff_type(self):
             input = torch.tensor([[2, 3, 5], [3, 2, 1]], dtype=torch.double, device=device)
             target = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.double, device=device)
             expected = torch.nn.functional.kl_div(input, target)
-            for input_dtype in get_all_math_dtypes(device):
-                if input_dtype.is_complex:
+            real_dtypes = (torch.float32, torch.float64, torch.float16)
+            for input_dtype, target_dtype in product(real_dtypes, repeat=2):
+                if (torch.device(device).type == 'cpu' and target_dtype == torch.float16):
                     continue
-                for target_dtype in [torch.float32, torch.float64, torch.float16]:
-                    if (torch.device(device).type == 'cpu' and target_dtype == torch.float16):
-                        continue
-                    input = input.to(input_dtype)
-                    target = target.to(target_dtype)
-                    result = torch.nn.functional.kl_div(input, target)
-                    self.assertEqual(result.item(), expected.item(), atol=0.001, rtol=0)
+                input = input.to(input_dtype)
+                target = target.to(target_dtype)
+                result = torch.nn.functional.kl_div(input, target)
+                self.assertEqual(result.item(), expected.item(), atol=0.001, rtol=0)
 
     def test_kl_div_with_diff_type_log_target(self):
         for device in device_():
             input = torch.tensor([[2, 3, 5], [3, 2, 1]], dtype=torch.double, device=device)
             target = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.double, device=device).log()
             expected = torch.nn.functional.kl_div(input, target, log_target=True)
-            for input_dtype in get_all_math_dtypes(device):
-                if input_dtype.is_complex:
+            real_dtypes = (torch.float32, torch.float64, torch.float16)
+            for input_dtype, target_dtype in product(real_dtypes, repeat=2):
+                if (torch.device(device).type == 'cpu' and target_dtype == torch.float16):
                     continue
-                for target_dtype in [torch.float32, torch.float64, torch.float16]:
-                    if (torch.device(device).type == 'cpu' and target_dtype == torch.float16):
-                        continue
-                    input = input.to(input_dtype)
-                    target = target.to(target_dtype)
-                    result = torch.nn.functional.kl_div(input, target, log_target=True)
-                    self.assertEqual(result.item(), expected.item(), atol=0.001, rtol=0)
+                input = input.to(input_dtype)
+                target = target.to(target_dtype)
+                result = torch.nn.functional.kl_div(input, target, log_target=True)
+                self.assertEqual(result.item(), expected.item(), atol=0.001, rtol=0)
 
     def test_kl_div_log_softmax_target(self):
         for device in device_():
@@ -10157,21 +5823,6 @@ def test_triplet_margin_loss_swap_no_reduce(self):
         self.assertEqual(F.triplet_margin_loss(input1, input2, input3, swap=True, reduction='none'),
                          loss_reference_fns['TripletMarginLoss'](input1, input2, input3, swap=True, reduction='none'))
 
-    def test_triplet_margin_loss_invalid(self):
-        input1 = torch.randn(5, 10, requires_grad=True)
-        input2 = torch.randn(5, 10, requires_grad=True)
-        input3 = torch.randn(5, 10, requires_grad=True)
-        input_1d = torch.randn(10, requires_grad=True)
-
-        with self.assertRaisesRegex(RuntimeError, "All inputs should have same dimension"):
-            F.triplet_margin_loss(input1, input2, input_1d)
-
-        with self.assertRaisesRegex(RuntimeError, "All inputs should have same dimension"):
-            F.triplet_margin_loss(input1, input_1d, input3)
-
-        with self.assertRaisesRegex(RuntimeError, "All inputs should have same dimension"):
-            F.triplet_margin_loss(input_1d, input2, input3)
-
     def test_pointwise_loss_target_grad_none_reduction(self):
         i = torch.randn(5, 10)
         t = torch.randn(5, 10, requires_grad=True)
@@ -11204,9 +6855,9 @@ def test_upsampling_small_scale(self):
         self.assertEqual(expected_out_t, out_t)
 
     def test_upsampling_bfloat16(self, dtype=torch.bfloat16):
-        def helper(size, scale_factor, mode, device):
+        def helper(size, scale_factor, mode, device, memory_format=torch.contiguous_format):
             inputf = torch.randn(size, device=device, dtype=torch.float, requires_grad=True)
-            input = inputf.to(dtype).detach().requires_grad_(True)
+            input = inputf.to(dtype).to(memory_format=memory_format).detach().requires_grad_(True)
             m = nn.Upsample(scale_factor=scale_factor, mode=mode)
 
             outf = m(inputf)
@@ -11222,9 +6873,11 @@ def helper(size, scale_factor, mode, device):
         for device in ['cpu']:
             helper([3, 20, 30], 2, 'nearest', device)
             helper([3, 20, 11, 7], 2, 'nearest', device)
+            helper([3, 20, 11, 7], 2, 'nearest', device, torch.channels_last)
             helper([3, 20, 11, 7, 3], 2, 'nearest', device)
             helper([3, 20, 30], 2, 'linear', device)
             helper([3, 20, 11, 7], 2, 'bilinear', device)
+            helper([3, 20, 11, 7], 2, 'bilinear', device, torch.channels_last)
             helper([3, 20, 11, 7, 3], 2, 'trilinear', device)
 
     @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
@@ -11307,12 +6960,57 @@ def test_linear_broadcasting(self):
         expected = m(inp.view(6, 5)).view(2, 3, 8)
         self.assertEqual(expected, m(inp))
 
+    @parametrize_test('device', ['cpu'] + (['cuda'] if TEST_CUDA else []))
+    @parametrize_test('bias', [
+        subtest(False, name='nobias'), subtest(True, name='bias')])
+    @parametrize_test('weight_layout', [
+        subtest(torch.strided, name='weightStrided'),
+        subtest(torch.sparse_coo, name='weightCOO'),
+        subtest(torch.sparse_csr, name='weightCSR'),
+        subtest(torch.sparse_csc, name='weightCSC'),
+        # TODO: addmm: computation on CPU is not implemented for Strided + Strided @ SparseBsr
+        # subtest(torch.sparse_bsr, name='weightBSR'),
+        # subtest(torch.sparse_bsc, name='weightBSC'),
+    ])
+    def test_linear_autograd(self, device, bias, weight_layout):
+        module = nn.Linear(4, 4, bias=bias, device=device)
+        if weight_layout == torch.strided:
+            pass
+        elif weight_layout == torch.sparse_csr:
+            module.weight = nn.Parameter(module.weight.to_sparse_csr())
+        elif weight_layout == torch.sparse_csc:
+            module.weight = nn.Parameter(module.weight.to_sparse_csc())
+        elif weight_layout == torch.sparse_bsr:
+            module.weight = nn.Parameter(module.weight.to_sparse_bsr(2, 2))
+        elif weight_layout == torch.sparse_bsc:
+            module.weight = nn.Parameter(module.weight.to_sparse_bsc(2, 2))
+        elif weight_layout == torch.sparse_coo:
+            module.weight = nn.Parameter(module.weight.to_sparse_coo())
+        else:
+            assert(0)
+
+        inp = torch.randn(4, requires_grad=True, device=device)
+        res = module(inp)
+        if bias:
+            expected = (torch.einsum("i,ji->j", inp, module.weight.to_dense())) + module.bias
+        else:
+            expected = (torch.einsum("i,ji->j", inp, module.weight.to_dense()))
+        self.assertEqual(res, expected)
+
+        grad_output = torch.randn(4, device=device)
+        grads = torch.autograd.grad(res, [module.weight, inp], grad_output)
+        grads_expected = torch.autograd.grad(expected, [module.weight, inp], grad_output)
+
+        self.assertEqual(grads_expected[0].layout, weight_layout)
+
+        for g, ge in zip(grads, grads_expected):
+            self.assertEqual(g, ge)
+
     def test_bilinear(self):
         module = nn.Bilinear(10, 10, 8)
         input1 = torch.randn(4, 10, requires_grad=True)
         input2 = torch.randn(4, 10, requires_grad=True)
         grad_output = torch.randn(4, 8)
-
         res = module(input1, input2)
         expected = (torch.einsum("bi,kij,bj->bk", input1, module.weight, input2) +
                     module.bias)
@@ -11383,140 +7081,6 @@ def test_bilinear_broadcasting(self):
         expected = m(input1.view(6, 5), input2.view(6, 6)).view(2, 3, 8)
         self.assertEqual(expected, m(input1, input2))
 
-    def test_conv_tbc(self):
-        inp = torch.randn(9, 4, 5, requires_grad=True)
-        weight = torch.randn(3, 5, 6, requires_grad=True)
-        bias = torch.randn(6, requires_grad=True)
-
-        gradcheck(lambda i, w, b, pad: F.conv_tbc(i, w, b, pad), (inp, weight, bias, 3))
-
-
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    @unittest.skipIf(not TEST_CUDNN, "needs cudnn")
-    @skipIfRocmVersionLessThan((4, 3))
-    @skipIfNotMiopenSuggestNHWC
-    def test_grouped_conv_cudnn_nhwc_support(self):
-        # in order to catch the hols in grouped convolution in nhwc support for earlier cudnn version
-        input = torch.randn((16, 16, 8, 8), dtype=torch.float16, device="cuda").to(memory_format=torch.channels_last)
-        weight = torch.randn((8, 4, 3, 3), dtype=torch.float16, device="cuda").to(memory_format=torch.channels_last)
-        out = torch.convolution(input, weight, None, (1, 1), (1, 1), (1, 1), False, (0, 0), 4)
-        input = torch.randn((16, 8, 8, 8), dtype=torch.float16, device="cuda").to(memory_format=torch.channels_last)
-        out_transpose = torch.convolution(input, weight, None, (1, 1), (1, 1), (1, 1), True, (0, 0), 4)
-
-    @unittest.expectedFailure
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    @unittest.skipIf(not TEST_CUDNN, "needs cudnn")
-    def test_conv_cudnn_memory_layout_dominance(self):
-        # desired behavior here is to have the memory_layout of conv.weight to
-        # dominante the layout of output.
-        # which is not the same as current behavior, we'll fix this in
-        # following up PRs and remove the `expectedFailure` tag
-        input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, device="cuda", requires_grad=True)
-        conv = nn.Conv2d(8, 4, 3).cuda().float()
-
-        out = conv(input)
-        self.assertTrue(out.is_contiguous())
-
-        input = input.contiguous(memory_format=torch.channels_last)
-        out = conv(input)
-        self.assertTrue(out.is_contiguous())
-
-        conv.weight.data = conv.weight.contiguous(memory_format=torch.channels_last)
-        out = conv(input)
-        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-
-        input = input.contiguous()
-        out = conv(input)
-        self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-
-
-    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
-    def test_cudnn_noncontiguous_weight(self):
-        # Noncontiguous weights must be contiguous() before being
-        # passed to cuDNN
-        input = torch.tensor([1, 1, 1], dtype=torch.double, device="cuda").view(1, 1, 3)
-        weights1 = torch.tensor([1], dtype=torch.double, device="cuda").expand(1, 1, 2)
-        weights2 = torch.tensor([1], dtype=torch.double, device="cuda").expand(1, 1, 2).contiguous()
-        self.assertEqual(F.conv1d(input, weights1, bias=None, stride=2, dilation=2),
-                         F.conv1d(input, weights2, bias=None, stride=2, dilation=2))
-
-
-    def run_grad_conv_test(self, func_forward, func_backward, dim=1, gradient='input'):
-        for kern, inp_size in [(3, 6), (3, 7), (4, 9)]:
-            for batch, stride, padding, chan_in, chan_out, dilation in \
-                    product([1, 2], [1, 2], [0, 1, 2], [2], [3], [1]):
-
-                for has_bias in [True, False]:
-                    input_shape = [batch, chan_in]
-                    weight_shape = [chan_out, chan_in]
-                    for _ in range(dim):
-                        input_shape.append(inp_size)
-                        weight_shape.append(kern)
-
-                    input = torch.randn(input_shape, requires_grad=True)
-                    weight = torch.randn(weight_shape, requires_grad=True)
-                    if has_bias:
-                        bias = torch.randn([chan_out], requires_grad=True)
-                    output = func_forward(input, weight, stride=stride, padding=padding, dilation=dilation, bias=bias)
-
-                    gradient_o = torch.randn(output.shape)
-                    gradient_w = torch.autograd.grad(output, input if (gradient == 'input') else weight, gradient_o)
-
-                    self.assertEqual(gradient_w[0],
-                                     func_backward(
-                                     input_shape if (gradient == 'input') else input,
-                                     weight_shape if (gradient == 'weight') else weight,
-                                     gradient_o,
-                                     stride=stride,
-                                     padding=padding,
-                                     dilation=dilation))
-
-    def test_grad_conv1d_input(self):
-        self.run_grad_conv_test(F.conv1d, F.grad.conv1d_input, 1, 'input')
-
-    def test_grad_conv1d_weight(self):
-        self.run_grad_conv_test(F.conv1d, F.grad.conv1d_weight, 1, 'weight')
-
-    def test_grad_conv2d_input(self):
-        self.run_grad_conv_test(F.conv2d, F.grad.conv2d_input, 2, 'input')
-
-    def test_grad_conv2d_weight(self):
-        self.run_grad_conv_test(F.conv2d, F.grad.conv2d_weight, 2, 'weight')
-
-    def test_grad_conv3d_input(self):
-        self.run_grad_conv_test(F.conv3d, F.grad.conv3d_input, 3, 'input')
-
-    def test_grad_conv3d_weight(self):
-        self.run_grad_conv_test(F.conv3d, F.grad.conv3d_weight, 3, 'weight')
-
-    @unittest.skipIf(not torch._nnpack_available(), "NNPACK unavailable")
-    def test_nnpack_conv(self):
-        for kern, inp_size in [(3, 6), (3, 7), (4, 9)]:
-            for batch, stride, padding, chan_in, chan_out in \
-                    product([1, 2, 3, 4], [1, 2], [0, 1, 2], [2], [3]):
-
-                for has_bias in [True, False]:
-                    input_shape = [batch, chan_in]
-                    weight_shape = [chan_out, chan_in]
-                    for _ in range(2):
-                        input_shape.append(inp_size)
-                        weight_shape.append(kern)
-
-                    input = torch.randn(input_shape, requires_grad=True, dtype=torch.float)
-                    weight = torch.randn(weight_shape, requires_grad=True, dtype=torch.float)
-                    if has_bias:
-                        bias = torch.randn([chan_out], requires_grad=True, dtype=torch.float)
-                    output = torch._nnpack_spatial_convolution(input, weight, stride=stride, padding=padding, bias=bias)
-                    output_expected = torch.nn.functional.conv2d(input, weight, stride=stride, padding=padding, bias=bias)
-                    self.assertEqual(output, output_expected, atol=3e-4, rtol=0)
-
-                    gradient_o = torch.randn(output.shape, dtype=torch.float)
-
-                    grads = torch.autograd.grad(output, [input, weight], gradient_o)
-                    grads_expected = torch.autograd.grad(output_expected, [input, weight], gradient_o)
-                    for gr, gr_expected in zip(grads, grads_expected):
-                        self.assertEqual(gr, gr_expected, atol=3e-4, rtol=0)
-
     def test_fold_invalid_arg(self):
         # input.size(1) not divisible by \prod(kernel_size)
 
@@ -11549,33 +7113,20 @@ def test_unfold_invalid_arg(self):
         # input wrong dimension
 
         unfold = nn.Unfold(kernel_size=(2, 3))
-        with self.assertRaisesRegex(NotImplementedError, r"Only 4D input Tensors are supported"):
-            unfold(torch.randn(1, 5, 2))
 
         # calculated output shape is too small
-
-        with self.assertRaisesRegex(RuntimeError, r"too small \(non-positive\)"):
+        with self.assertRaisesRegex(RuntimeError, r"its components must be at least one"):
             unfold = nn.Unfold(kernel_size=(2, 3))
             unfold(torch.randn(1, 2, 2, 2))
 
-        with self.assertRaisesRegex(RuntimeError, r"too small \(non-positive\)"):
+        with self.assertRaisesRegex(RuntimeError, r"its components must be at least one"):
             unfold = nn.Unfold(kernel_size=(5, 3), padding=(1, 1))
             unfold(torch.randn(1, 2, 2, 3))
 
-        with self.assertRaisesRegex(RuntimeError, r"too small \(non-positive\)"):
+        with self.assertRaisesRegex(RuntimeError, r"its components must be at least one"):
             unfold = nn.Unfold(kernel_size=(1, 3), padding=(1, 1), dilation=(1, 2))
             unfold(torch.randn(1, 2, 2, 2))
 
-    def test_conv_padding_mode(self):
-        with self.assertRaisesRegex(ValueError, "padding_mode must be one of"):
-            nn.Conv2d(3, 3, 3, padding_mode="xyz")
-
-        with self.assertRaisesRegex(ValueError, "padding_mode must be one of"):
-            nn.Conv2d(3, 3, 3, padding_mode=3)
-
-        with self.assertRaisesRegex(ValueError, "Only \"zeros\" "):
-            nn.ConvTranspose2d(3, 3, 3, padding_mode="reflect")
-
     def test_softmin(self):
         x = torch.randn(2, 16)
         self.assertEqual(F.softmin(x, 1), F.softmax(-x, 1))
@@ -11588,8 +7139,7 @@ def test_log_softmax_cpu(self, dtype=torch.bfloat16):
             outf = F.log_softmax(inputf, dim=dim)
             out = F.log_softmax(input, dim=dim)
             self.assertEqual(out.dtype, dtype)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(out, outf, atol=0.1, rtol=0)
+            self.assertEqual(out, outf.to(dtype=dtype), atol=0.1, rtol=0)
 
             out.sum().backward()
             outf.sum().backward()
@@ -11719,14 +7269,12 @@ def test_cross_entropy_loss(self, dtype=torch.bfloat16):
         outf = loss_cpu(inputf, target)
         out = loss_cpu(input, target)
         self.assertEqual(out.dtype, dtype)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(out, outf, atol=1e-1, rtol=0)
+        self.assertEqual(out, outf.to(dtype=dtype), atol=1e-1, rtol=0)
 
         outf.backward()
         out.backward()
         self.assertEqual(input.grad.dtype, dtype)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(input.grad, inputf.grad, atol=1e-1, rtol=0)
+        self.assertEqual(input.grad, inputf.grad.to(dtype=dtype), atol=1e-1, rtol=0)
 
     def test_cross_entropy_loss_precision(self):
         # Regression test for #55657
@@ -11820,95 +7368,16 @@ def test_sync_batchnorm_accuracy_cuda(self):
         #   fwd: torch.batch_norm_stats, torch.batch_norm_gather_stats_with_counts, torch.batch_norm_elemt
         #   bwd: torch.batch_norm_backward_reduce, torch.batch_norm_backward_elemt
 
-        def _batch_norm_stats(data):
+        def _batch_norm_stats(data, memory_format, mean_axes):
             mean1, _ = torch.batch_norm_stats(data, 1e-5)
-            mean2, _ = torch.batch_norm_stats(data.to(memory_format=torch.channels_last), 1e-5)
-            mean_ref = torch.mean(data, (0, 2, 3), keepdim=False)
+            mean2, _ = torch.batch_norm_stats(data.to(memory_format=memory_format), 1e-5)
+            mean_ref = torch.mean(data, mean_axes, keepdim=False)
 
             self.assertEqual(mean_ref, mean1)
             self.assertEqual(mean_ref, mean2)
 
-        data = torch.randn(1, 96, 112, 112, dtype=torch.float, device='cuda')
-        _batch_norm_stats(data)
-
-    def test_functional_grad_conv(self):
-        # Conv 1D
-        input = torch.randn(1, 1, 5, requires_grad=True)
-        weight = torch.randn(1, 1, 3, requires_grad=True)
-        output = F.conv1d(input, weight, dilation=2)
-        grad_output = torch.randn(output.shape)
-
-        grad_input_autograd, grad_weight_autograd = torch.autograd.grad(output, (input, weight), grad_output)
-
-        grad_input_functional = torch.nn.grad.conv1d_input(input.shape, weight, grad_output, dilation=2)
-        self.assertEqual(grad_input_functional, grad_input_autograd)
-
-        grad_weight_functional = torch.nn.grad.conv1d_weight(input, weight.shape, grad_output, dilation=2)
-        self.assertEqual(grad_weight_functional, grad_weight_autograd)
-
-        # Conv 2D
-        input = torch.randn(1, 1, 5, 5, requires_grad=True)
-        weight = torch.randn(1, 1, 3, 3, requires_grad=True)
-        output = F.conv2d(input, weight, dilation=2)
-        grad_output = torch.randn(output.shape)
-
-        (grad_input_autograd, grad_weight_autograd) = torch.autograd.grad(output, (input, weight), grad_output)
-
-        grad_input_functional = torch.nn.grad.conv2d_input(input.shape, weight, grad_output, dilation=2)
-        self.assertEqual(grad_input_functional, grad_input_autograd)
-
-        grad_weight_functional = torch.nn.grad.conv2d_weight(input, weight.shape, grad_output, dilation=2)
-        self.assertEqual(grad_weight_functional, grad_weight_autograd)
-
-        # Conv 3D
-        input = torch.randn(1, 1, 5, 5, 5, requires_grad=True)
-        weight = torch.randn(1, 1, 3, 3, 3, requires_grad=True)
-        output = F.conv3d(input, weight, dilation=2)
-        grad_output = torch.randn(output.shape)
-
-        (grad_input_autograd, grad_weight_autograd) = torch.autograd.grad(output, (input, weight), grad_output)
-
-        grad_input_functional = torch.nn.grad.conv3d_input(input.shape, weight, grad_output, dilation=2)
-        self.assertEqual(grad_input_functional, grad_input_autograd)
-
-        grad_weight_functional = torch.nn.grad.conv3d_weight(input, weight.shape, grad_output, dilation=2)
-        self.assertEqual(grad_weight_functional, grad_weight_autograd)
-
-    def test_functional_grad_conv2d(self):
-        BATCH_SIZE = 4
-        IN_CH = 8
-        OUT_CH = 16
-        SPATIAL = 32
-
-        def _test_conv2d(stride, kernel_size, groups, dilation):
-            padding = kernel_size // 2
-
-            input = torch.empty(BATCH_SIZE, IN_CH, SPATIAL, SPATIAL).uniform_(-8.0, 8.0).requires_grad_(True)
-
-            weight = torch.empty(OUT_CH, IN_CH // groups, kernel_size, kernel_size).uniform_(-4.0, 4.0).requires_grad_(True)
-
-            output = F.conv2d(input, weight,
-                              stride=stride, padding=padding, dilation=dilation, groups=groups)
-
-            grad_output = torch.randn(output.shape)
-
-            (grad_input_autograd, grad_weight_autograd) = torch.autograd.grad(output, (input, weight), grad_output)
-
-            grad_input_functional = torch.nn.grad.conv2d_input(input.shape, weight, grad_output,
-                                                               stride=stride, padding=padding, dilation=dilation, groups=groups)
-            self.assertEqual(grad_input_functional, grad_input_autograd)
-
-            grad_weight_functional = torch.nn.grad.conv2d_weight(input, weight.shape, grad_output,
-                                                                 stride=stride, padding=padding, dilation=dilation, groups=groups)
-            self.assertEqual(grad_weight_functional, grad_weight_autograd)
-
-        strides = [1, 2]
-        kernel_sizes = [1, 3, 5]
-        groups = [1, 2, 4]
-        dilates = [1, 2]
-
-        for s, k, g, d in product(strides, kernel_sizes, groups, dilates):
-            _test_conv2d(s, k, g, d)
+        _batch_norm_stats(torch.randn(1, 96, 112, 112, dtype=torch.float, device='cuda'), torch.channels_last, (0, 2, 3))
+        _batch_norm_stats(torch.randn(1, 96, 112, 112, 112, dtype=torch.float, device='cuda'), torch.channels_last_3d, (0, 2, 3, 4))
 
     def test_flatten(self):
         tensor_input = torch.randn(2, 1, 2, 3)
@@ -11998,433 +7467,6 @@ def test_padding_list(self):
         y = net(x)
 
 
-class TestNNInit(TestCase):
-    def setUp(self):
-        super(TestNNInit, self).setUp()
-        random.seed(123)
-
-    def _is_normal(self, tensor, mean, std):
-        samples = tensor.view(-1).tolist()
-        p_value = stats.kstest(samples, 'norm', args=(mean, std))[1]
-        return p_value > 0.0001
-
-    def _is_trunc_normal(self, tensor, mean, std, a, b):
-        # scipy's trunc norm is suited for data drawn from N(0, 1),
-        # so we need to transform our data to test it using scipy.
-        z_samples = (tensor.view(-1) - mean) / std
-        z_samples = z_samples.tolist()
-        a0 = (a - mean) / std
-        b0 = (b - mean) / std
-        p_value = stats.kstest(z_samples, 'truncnorm', args=(a0, b0))[1]
-        return p_value > 0.0001
-
-    def _is_uniform(self, tensor, a, b):
-        samples = tensor.view(-1).tolist()
-        p_value = stats.kstest(samples, 'uniform', args=(a, (b - a)))[1]
-        return p_value > 0.0001
-
-    def _create_random_nd_tensor(self, dims, size_min, size_max):
-        size = [random.randint(size_min, size_max) for _ in range(dims)]
-        tensor = torch.zeros(size)
-        return tensor
-
-    def _random_float(self, a, b):
-        return (b - a) * random.random() + a
-
-    def test_calculate_gain_linear(self):
-        for fn in ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose2d', 'conv_transpose2d', 'conv_transpose3d']:
-            gain = init.calculate_gain(fn)
-            self.assertEqual(gain, 1)
-
-    def test_calculate_gain_nonlinear(self):
-        for fn in ['sigmoid', 'tanh', 'relu', 'leaky_relu']:
-            gain = init.calculate_gain(fn)
-            if fn == 'sigmoid':
-                self.assertEqual(gain, 1)
-            elif fn == 'tanh':  # 5 / 3
-                self.assertEqual(gain, 1.6666666666666667)
-            elif fn == 'relu':  # sqrt(2)
-                self.assertEqual(gain, 1.4142135623730951)
-            elif fn == 'leaky_relu':  # sqrt(2 / 1 + slope^2))
-                self.assertEqual(gain, 1.4141428569978354)
-            elif fn == 'selu':
-                self.assertEqual(gain, 0.75)
-
-    def test_calculate_gain_leaky_relu(self):
-        for param in [None, 0, 0.01, 10]:
-            gain = init.calculate_gain('leaky_relu', param)
-            if param is None:  # Default slope is 0.01
-                self.assertEqual(gain, 1.4141428569978354)
-            elif param == 0:  # No slope = same gain as normal ReLU
-                self.assertEqual(gain, 1.4142135623730951)
-            elif param == 0.01:
-                self.assertEqual(gain, 1.4141428569978354)
-            elif param == 10:
-                self.assertEqual(gain, 0.14071950894605836)
-
-    def test_calculate_gain_leaky_relu_only_accepts_numbers(self):
-        for param in [True, [1], {'a': 'b'}]:
-            with self.assertRaises(ValueError):
-                init.calculate_gain('leaky_relu', param)
-
-    def test_calculate_gain_only_accepts_valid_nonlinearities(self):
-        for n in [2, 5, 25]:
-            # Generate random strings of lengths that definitely aren't supported
-            random_string = ''.join([random.choice(string.ascii_lowercase) for i in range(n)])
-            with self.assertRaises(ValueError):
-                init.calculate_gain(random_string)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
-    def test_uniform(self):
-        for dims in [1, 2, 4]:
-            input_tensor = self._create_random_nd_tensor(dims, size_min=30, size_max=50)
-            a = self._random_float(-3, 3)
-            b = a + self._random_float(1, 5)
-            init.uniform_(input_tensor, a=a, b=b)
-            assert self._is_uniform(input_tensor, a, b)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
-    def test_normal(self):
-        for dims in [1, 2, 4]:
-            input_tensor = self._create_random_nd_tensor(dims, size_min=30, size_max=50)
-            mean = self._random_float(-3, 3)
-            std = self._random_float(1, 5)
-            init.normal_(input_tensor, mean=mean, std=std)
-
-            assert self._is_normal(input_tensor, mean, std)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
-    def test_trunc_normal(self):
-        for dims in [1, 2, 4]:
-            input_tensor = self._create_random_nd_tensor(dims, size_min=30, size_max=50)
-            mean = self._random_float(-3, 3)
-            std = self._random_float(.01, 1)
-            a = self._random_float(mean - 2 * std, mean)
-            b = self._random_float(mean, mean + 2 * std)
-            init.trunc_normal_(input_tensor, mean=mean, std=std, a=a, b=b)
-
-            assert self._is_trunc_normal(input_tensor, mean, std, a, b)
-
-    def test_constant(self):
-        for dims in [1, 2, 4]:
-            input_tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=5)
-            val = self._random_float(1, 10)
-            init.constant_(input_tensor, val)
-
-            self.assertEqual(input_tensor, input_tensor.clone().fill_(val))
-
-    def test_ones_and_zeros(self):
-        for init_fn_, val in zip([init.ones_, init.zeros_], [1, 0]):
-            for dims in [1, 2, 4]:
-                input_tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=5)
-                init_fn_(input_tensor)
-
-                self.assertEqual(input_tensor, input_tensor.clone().fill_(val))
-
-    def test_eye(self):
-        input_tensor = self._create_random_nd_tensor(2, size_min=1, size_max=5)
-        init.eye_(input_tensor)
-
-        # Check every single element
-        for i in range(input_tensor.size(0)):
-            for j in range(input_tensor.size(1)):
-                if i == j:
-                    assert input_tensor[i][j] == 1
-                else:
-                    assert input_tensor[i][j] == 0
-
-    def test_eye_only_works_on_2d_inputs(self):
-        for dims in [1, 3]:
-            with self.assertRaises(ValueError):
-                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=3)
-                init.eye_(tensor)
-
-    def test_max_unpool(self):
-        # Test 1D
-        output, indices = F.max_pool1d(torch.randn([1, 1, 4]), 2, stride=2, return_indices=True)
-        self.assertEqual(F.max_unpool1d(output, indices, 2), F.max_unpool1d(output, indices, 2, stride=2))
-
-        # Test list / tuple passed as argument to max_unpool1d
-        input = torch.randn([1, 1, 5], requires_grad=True)
-        output, indices = F.max_pool1d(input, 2, stride=2, return_indices=True)
-        self.assertEqual(F.max_unpool1d(output, indices, 2, stride=2, output_size=input.shape),
-                         F.max_unpool1d(output, indices, 2, stride=2, output_size=input.size()))
-        gradcheck(F.max_unpool1d, (output, indices, 2), check_forward_ad=True)
-
-        # Test 2D
-        output, indices = F.max_pool2d(torch.randn([1, 1, 4, 4], requires_grad=True), 2, stride=2, return_indices=True)
-        self.assertEqual(F.max_unpool2d(output, indices, 2), F.max_unpool2d(output, indices, 2, stride=2))
-        gradcheck(F.max_unpool2d, (output, indices, 2), check_forward_ad=True)
-
-        # Test 3D
-        output, indices = F.max_pool3d(torch.randn([4, 4, 4, 4, 4], requires_grad=True), 2, stride=2, return_indices=True)
-        self.assertEqual(F.max_unpool3d(output, indices, 2), F.max_unpool3d(output, indices, 2, stride=2))
-        gradcheck(F.max_unpool3d, (output, indices, 2), check_forward_ad=True)
-
-    def test_dirac_properties(self):
-        for dims in [3, 4, 5]:
-            for groups in [1, 2, 3]:
-                # prepare random tensor with random sizes, but fits groups
-                a, c, d, e = (random.randint(1, 5) for _ in range(4))
-                b = random.randint(1, 5 * groups)  # same range as a*groups but all range allowed
-                # make sure first dim divides by groups
-                input_tensor = torch.randn((a * groups, b, c, d, e)[:dims])
-
-                init.dirac_(input_tensor, groups)
-
-                c_out, c_in = input_tensor.size(0) // groups, input_tensor.size(1)
-                min_d = min(c_out, c_in)
-                # Check number of nonzeros is equivalent to smallest dim (for each group)
-                assert torch.nonzero(input_tensor).size(0) == min_d * groups
-                # Check sum of values (can have precision issues, hence assertEqual) is also equivalent
-                self.assertEqual(input_tensor.sum(), min_d * groups)
-
-
-    def test_dirac_identity(self):
-        for groups in [1, 3]:
-            batch, in_c, out_c, size, kernel_size = 8, 3, 9, 5, 3  # in_c, out_c must divide by groups
-            eff_out_c = out_c // groups
-
-            # Test 1D
-            input_var = torch.randn(batch, in_c, size)
-            filter_var = torch.zeros(eff_out_c, in_c, kernel_size)
-            filter_var = torch.cat([filter_var] * groups)
-            init.dirac_(filter_var, groups)
-            output_var = F.conv1d(input_var, filter_var)
-            input_tensor, output_tensor = input_var.data, output_var.data  # Variables do not support nonzero
-            for g in range(groups):
-                # Assert in_c outputs are preserved (per each group)
-                self.assertEqual(input_tensor[:, :, 1:-1],
-                                 output_tensor[:, eff_out_c * g:eff_out_c * g + in_c, :])
-                # Assert extra outputs are 0
-                assert torch.nonzero(output_tensor[:, eff_out_c * g + in_c:eff_out_c * (g + 1), :]).numel() == 0
-
-            # Test 2D
-            input_var = torch.randn(batch, in_c, size, size)
-            filter_var = torch.zeros(eff_out_c, in_c, kernel_size, kernel_size)
-            filter_var = torch.cat([filter_var] * groups)
-            init.dirac_(filter_var, groups)
-            output_var = F.conv2d(input_var, filter_var)
-            input_tensor, output_tensor = input_var.data, output_var.data  # Variables do not support nonzero
-            for g in range(groups):
-                # Assert in_c outputs are preserved (per each group)
-                self.assertEqual(input_tensor[:, :, 1:-1, 1:-1],
-                                 output_tensor[:, eff_out_c * g:eff_out_c * g + in_c, :, :])
-                # Assert extra outputs are 0
-                assert torch.nonzero(output_tensor[:, eff_out_c * g + in_c:eff_out_c * (g + 1), :, :]).numel() == 0
-
-            # Test 3D
-            input_var = torch.randn(batch, in_c, size, size, size)
-            filter_var = torch.zeros(eff_out_c, in_c, kernel_size, kernel_size, kernel_size)
-            filter_var = torch.cat([filter_var] * groups)
-            init.dirac_(filter_var, groups)
-            output_var = F.conv3d(input_var, filter_var)
-            input_tensor, output_tensor = input_var.data, output_var.data
-            for g in range(groups):
-                # Assert in_c outputs are preserved (per each group)
-                self.assertEqual(input_tensor[:, :, 1:-1, 1:-1, 1:-1],
-                                 output_tensor[:, eff_out_c * g:eff_out_c * g + in_c, :, :, :])
-                # Assert extra outputs are 0
-                assert torch.nonzero(output_tensor[:, eff_out_c * g + in_c:eff_out_c * (g + 1), :, :, :]).numel() == 0
-
-    def test_dirac_only_works_on_3_4_5d_inputs(self):
-        for dims in [1, 2, 6]:
-            with self.assertRaises(ValueError):
-                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=3)
-                init.dirac_(tensor)
-
-    def test_xavier_uniform_errors_on_inputs_smaller_than_2d(self):
-        for dims in [0, 1]:
-            tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=1)
-            with self.assertRaises(ValueError):
-                init.xavier_uniform_(tensor)
-
-    def test_xavier_normal_errors_on_inputs_smaller_than_2d(self):
-        for dims in [0, 1]:
-            tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=1)
-            with self.assertRaises(ValueError):
-                init.xavier_normal_(tensor)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
-    def test_xavier_uniform(self):
-        for use_gain in [True, False]:
-            for dims in [2, 4]:
-                input_tensor = self._create_random_nd_tensor(dims, size_min=20, size_max=25)
-                gain = 1
-
-                if use_gain:
-                    gain = self._random_float(0.1, 2)
-                    init.xavier_uniform_(input_tensor, gain=gain)
-                else:
-                    init.xavier_uniform_(input_tensor)
-
-                fan_in = input_tensor.size(1)
-                fan_out = input_tensor.size(0)
-                if input_tensor.dim() > 2:
-                    fan_in *= input_tensor[0, 0].numel()
-                    fan_out *= input_tensor[0, 0].numel()
-
-                expected_std = gain * math.sqrt(2.0 / (fan_in + fan_out))
-                bounds = expected_std * math.sqrt(3)
-                assert self._is_uniform(input_tensor, -bounds, bounds)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
-    def test_xavier_normal(self):
-        for use_gain in [True, False]:
-            for dims in [2, 4]:
-                input_tensor = self._create_random_nd_tensor(dims, size_min=20, size_max=25)
-                gain = 1
-
-                if use_gain:
-                    gain = self._random_float(0.1, 2)
-                    init.xavier_normal_(input_tensor, gain=gain)
-                else:
-                    init.xavier_normal_(input_tensor)
-
-                fan_in = input_tensor.size(1)
-                fan_out = input_tensor.size(0)
-                if input_tensor.dim() > 2:
-                    fan_in *= input_tensor[0, 0].numel()
-                    fan_out *= input_tensor[0, 0].numel()
-
-                expected_std = gain * math.sqrt(2.0 / (fan_in + fan_out))
-                assert self._is_normal(input_tensor, 0, expected_std)
-
-    def test_kaiming_uniform_errors_on_inputs_smaller_than_2d(self):
-        for dims in [0, 1]:
-            with self.assertRaises(ValueError):
-                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=1)
-                init.kaiming_uniform_(tensor)
-
-    def test_kaiming_normal_errors_on_inputs_smaller_than_2d(self):
-        for dims in [0, 1]:
-            with self.assertRaises(ValueError):
-                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=1)
-                init.kaiming_normal_(tensor)
-
-    def test_kaiming_uniform_warning_on_0element_tensor(self):
-        tensor = torch.empty(0, 1)
-        with self.assertWarnsRegex(UserWarning, "Initializing zero-element tensors is a no-op"):
-            _ = init.kaiming_uniform_(tensor)
-
-    def test_kaiming_normal_warning_on_0element_tensor(self):
-        tensor = torch.empty(0, 1)
-        with self.assertWarnsRegex(UserWarning, "Initializing zero-element tensors is a no-op"):
-            _ = init.kaiming_normal_(tensor)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
-    def test_kaiming_uniform(self):
-        for use_a in [True, False]:
-            for dims in [2, 4]:
-                for mode in ['fan_in', 'fan_out']:
-                    input_tensor = self._create_random_nd_tensor(dims, size_min=20, size_max=25)
-                    if use_a:
-                        a = self._random_float(0.1, 2)
-                        init.kaiming_uniform_(input_tensor, a=a, mode=mode)
-                    else:
-                        a = 0
-                        init.kaiming_uniform_(input_tensor, mode=mode)
-
-                    fan_in = input_tensor.size(1)
-                    fan_out = input_tensor.size(0)
-                    if input_tensor.dim() > 2:
-                        fan_in *= input_tensor[0, 0].numel()
-                        fan_out *= input_tensor[0, 0].numel()
-
-                    if mode == 'fan_in':
-                        n = fan_in
-                    else:
-                        n = fan_out
-
-                    expected_std = math.sqrt(2.0 / ((1 + a**2) * n))
-                    bounds = expected_std * math.sqrt(3.0)
-                    assert self._is_uniform(input_tensor, -bounds, bounds)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
-    def test_kaiming_normal(self):
-        for use_a in [True, False]:
-            for dims in [2, 4]:
-                for mode in ['fan_in', 'fan_out']:
-                    input_tensor = self._create_random_nd_tensor(dims, size_min=20, size_max=25)
-                    if use_a:
-                        a = self._random_float(0.1, 2)
-                        init.kaiming_normal_(input_tensor, a=a, mode=mode)
-                    else:
-                        a = 0
-                        init.kaiming_normal_(input_tensor, mode=mode)
-
-                    fan_in = input_tensor.size(1)
-                    fan_out = input_tensor.size(0)
-                    if input_tensor.dim() > 2:
-                        fan_in *= input_tensor[0, 0].numel()
-                        fan_out *= input_tensor[0, 0].numel()
-
-                    if mode == 'fan_in':
-                        n = fan_in
-                    else:
-                        n = fan_out
-
-                    expected_std = math.sqrt(2.0 / ((1 + a**2) * n))
-                    assert self._is_normal(input_tensor, 0, expected_std)
-
-    def test_sparse_only_works_on_2d_inputs(self):
-        for dims in [1, 3]:
-            with self.assertRaises(ValueError):
-                sparsity = self._random_float(0.1, 0.9)
-                tensor = self._create_random_nd_tensor(dims, size_min=1, size_max=3)
-                init.sparse_(tensor, sparsity)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy not found.")
-    def test_sparse_default_std(self):
-        for use_random_std in [True, False]:
-            input_tensor = self._create_random_nd_tensor(2, size_min=30, size_max=35)
-            rows, cols = input_tensor.size(0), input_tensor.size(1)
-            sparsity = self._random_float(0.1, 0.2)
-
-            std = 0.01  # default std
-            if use_random_std:
-                std = self._random_float(0.01, 0.2)
-                init.sparse_(input_tensor, sparsity=sparsity, std=std)
-            else:
-                init.sparse_(input_tensor, sparsity=sparsity)
-
-            for col_idx in range(input_tensor.size(1)):
-                column = input_tensor[:, col_idx]
-                assert column[column == 0].nelement() >= math.ceil(sparsity * rows)
-
-            assert self._is_normal(input_tensor[input_tensor != 0], 0, std)
-
-    @skipIfNoLapack
-    def test_orthogonal(self):
-        for use_gain in [True, False]:
-            for tensor_size in [[3, 4], [4, 3], [20, 2, 3, 4], [2, 3, 4, 5]]:
-                input_tensor = torch.zeros(tensor_size)
-                gain = 1.0
-
-                if use_gain:
-                    gain = self._random_float(0.1, 2)
-                    init.orthogonal_(input_tensor, gain=gain)
-                else:
-                    init.orthogonal_(input_tensor)
-
-                rows, cols = tensor_size[0], reduce(mul, tensor_size[1:])
-                flattened_tensor = input_tensor.view(rows, cols)
-                if rows > cols:
-                    self.assertEqual(torch.mm(flattened_tensor.t(), flattened_tensor),
-                                     torch.eye(cols) * gain ** 2, atol=1e-6, rtol=0)
-                else:
-                    self.assertEqual(torch.mm(flattened_tensor, flattened_tensor.t()),
-                                     torch.eye(rows) * gain ** 2, atol=1e-6, rtol=0)
-
-    def test_deprecation(self):
-        x = torch.randn(3, 3)
-
-        def fn():
-            init.normal(x)
-
-        with self.assertWarnsRegex(UserWarning, 'deprecated', msg='methods not suffixed with underscore should be deprecated'):
-            fn()
-
 class TestFusionEval(TestCase):
     @given(X=hu.tensor(shapes=((5, 3, 5, 5),)),
            running_mean=hu.tensor(shapes=(6,)),
@@ -12932,121 +7974,6 @@ def _buildEquivalentAffineTransforms3d(device, input_size, output_size, angle_ra
 
 
 class TestNNDeviceType(NNTestCase):
-    def run_conv_double_back_test(self, kern, stride, padding, chan_in, chan_out, batch_size,
-                                  inp_size, dilation, no_weight, groups=1, use_cuda=False,
-                                  use_bias=True, dtype=torch.double):
-        if use_cuda:
-            device = torch.device("cuda")
-        else:
-            device = torch.device("cpu")
-
-        x = torch.randn(batch_size, chan_in, inp_size, inp_size, device=device,
-                        dtype=dtype, requires_grad=True)
-        weight = torch.randn(chan_out, chan_in // groups, kern, kern, device=device,
-                             dtype=dtype, requires_grad=not no_weight)
-        if use_bias:
-            bias = torch.randn(chan_out, device=device, dtype=dtype, requires_grad=True)
-        else:
-            bias = None
-
-        def func(*inputs):
-            if use_bias:
-                lx, lweight, lbias = inputs
-            else:
-                lx, lweight = inputs
-                lbias = None
-            # We disable cudnn during forward to avoid finite difference imprecision issues
-            with cudnn.flags(enabled=False):
-                out = F.conv2d(lx, lweight, lbias, stride, padding, dilation, groups)
-            return out
-
-        if use_bias:
-            inputs = x, weight, bias
-        else:
-            inputs = x, weight
-
-        dummy_out = func(*inputs)
-        grad_y = torch.randn_like(dummy_out, device=device, dtype=dtype, requires_grad=True)
-
-        # Issue #15353: test mkldnn double backward, don't run gradgradcheck due
-        # to imprecision issues
-        if dtype == torch.float:
-            g, = torch.autograd.grad(dummy_out.sum(), x, create_graph=True)
-            return g.requires_grad
-
-        return gradgradcheck(func, inputs, (grad_y,))
-
-    def _test_dropout(self, cls, device, input, memory_format=torch.contiguous_format):
-        p = 0.2
-        input = input.to(device).fill_(1 - p)
-
-        module = cls(p)
-        input_var = input.clone(memory_format=memory_format).requires_grad_()
-        output = module(input_var)
-        self.assertTrue(output.is_contiguous(memory_format=memory_format))
-        self.assertLess(abs(output.data.mean() - (1 - p)), 0.05)
-        output.backward(input)
-        self.assertTrue(input_var.grad.is_contiguous(memory_format=memory_format))
-        self.assertLess(abs(input_var.grad.data.mean() - (1 - p)), 0.05)
-
-        module = cls(p, True)
-        input_var = input.clone(memory_format=memory_format).requires_grad_()
-        output = module(input_var + 0)
-        self.assertTrue(output.is_contiguous(memory_format=memory_format))
-        self.assertLess(abs(output.data.mean() - (1 - p)), 0.05)
-        output.backward(input)
-        self.assertTrue(input_var.grad.is_contiguous(memory_format=memory_format))
-        self.assertLess(abs(input_var.grad.data.mean() - (1 - p)), 0.05)
-
-        # check eval mode doesn't change anything
-        for inplace in [True, False]:
-            module = cls(p, inplace).eval()
-            self.assertEqual(input, module(input))
-
-        # Check that these don't raise errors
-        module.__repr__()
-        str(module)
-
-    def _test_dropout_discontiguous(self, cls, device, memory_format=torch.contiguous_format):
-        # In this test, we verify that dropout preserves the layout and data for different memory formats.
-        # We check whether, we get same values for the output of dropout, when the probability
-        # of dropout is 0 or very close to 0.
-        # Reference: https://github.com/pytorch/pytorch/issues/47176
-        close_to_zero_p = 1e-10  # Should be almost zero but not zero, as for p=0 different path is taken
-        for p in [0, close_to_zero_p]:
-            inp = torch.ones(2, 3, 3, 3, device=device)
-            inp_discontiguous = torch.empty(2, 3, 3, 6, device=device, memory_format=memory_format)[..., ::2]
-            inp_discontiguous.copy_(inp)
-            mod = cls(p=p)
-            out = mod(inp_discontiguous)
-            if p != 0:  # Zero will keep strides as is based on input.
-                # When prob == 0, input stride (54, 18, 6, 2) -> output stride (54, 18, 6, 2)
-                # When prob != 0, input stride (54, 18, 6, 2) -> output stride (27, 9, 3, 1)
-                self.assertTrue(out.is_contiguous(memory_format=memory_format))
-            self.assertEqual(inp_discontiguous, out)
-
-    def _test_dropout_stride_mean_preserve(self, cls, device):
-        def invert_perm(p):
-            d = {x: i for i, x in enumerate(p)}
-            return (d[0], d[1], d[2], d[3])
-
-        inp = torch.ones(2, 3, 4, 5, device=device)
-        shifts = [(0, 0), (1, 0), (0, 1), (1, 1)]
-        for perm in itertools.permutations((0, 1, 2, 3), r=4):
-            for shift in shifts:
-                for p in [1e-10, 0.3, 0.5, 0.7]:
-                    mod = cls(p=p)
-                    permuted_inp = inp.permute(perm).contiguous().permute(invert_perm(perm))
-                    permuted_inp = permuted_inp[shift[0]:, shift[1]:, :, :]
-                    out = mod(permuted_inp)
-
-                    self.assertTrue(out.permute(perm).is_contiguous())
-                    self.assertEqual(inp.mean(), out.mean(), rtol=0.5, atol=0.5)
-                    if p == 1e-10:
-                        self.assertEqual(permuted_inp, out)
-                    else:
-                        self.assertNotEqual(permuted_inp, out)
-
     def _test_InstanceNorm_general(self, cls, input, device, dtype=torch.float):
         # default case track_running_stats=False
         b, c = input.size(0), input.size(1)
@@ -13232,21 +8159,6 @@ def _test_GroupNorm_cuda_half(self):
         output.sum().backward()
         self.assertEqualTypeString(output, input)
 
-    def _test_module_empty_input(self, module, inp, check_size=True, inference=False):
-        if not inference:
-            inp.requires_grad_(True)
-        out = module(inp)
-        if not inference:
-            gO = torch.rand_like(out)
-            out.backward(gO)
-        if check_size:
-            self.assertEqual(out.size(), inp.size())
-        if not inference:
-            for p in module.parameters():
-                if p.requires_grad:
-                    self.assertEqual(p.grad, torch.zeros_like(p.grad))
-            self.assertEqual(inp.grad, torch.zeros_like(inp))
-
     def _test_module_empty_inputs(self, module, inputs):
         for _inp in inputs:
             _inp.requires_grad_(True)
@@ -13481,1037 +8393,56 @@ def test_affine_3d_rotateRandom(self, device):
                 torch.tensor(input_ary, device=device).to(device),
                 affine_tensor,
                 padding_mode='border',
-                align_corners=True
-            ).to('cpu')
-
-            affine_tensor = affine_tensor.to('cpu')
-
-            for i in range(affine_tensor.size(1)):
-                for r in range(affine_tensor.size(2)):
-                    for c in range(affine_tensor.size(3)):
-                        grid_out = np.dot(grid_ary, [i, r, c, 1])
-                        self.assertEqual(affine_tensor[0, i, r, c], grid_out[:3], exact_dtype=False)
-
-            self.assertEqual(scipy_ary, gridsample_ary.reshape_as(scipy_ary))
-
-
-    @onlyCUDA
-    @skipCUDAIfNoCudnn
-    @dtypes(*floating_and_complex_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
-    def test_Conv2d_deterministic_cudnn(self, device, dtype):
-        inputs = torch.randn(2, 3, 5, 5, device=device, dtype=dtype, requires_grad=True)
-        with cudnn.flags(enabled=True, benchmark=True, deterministic=True):
-            conv1 = torch.nn.Conv2d(3, 3, 3).to(device, dtype)
-            conv2 = torch.nn.Conv2d(3, 3, 3).to(device, dtype)
-            conv2.bias.data.copy_(conv1.bias.data)
-            conv2.weight.data.copy_(conv1.weight.data)
-            out1 = conv1(inputs)
-            out2 = conv2(inputs)
-            self.assertEqual(out1, out2, atol=0.0, rtol=0)
-            y = torch.randn(out1.size(), device=device, dtype=dtype)
-            out1.backward(y)
-            out2.backward(y)
-            self.assertEqual(conv1.bias.grad.data, conv2.bias.grad.data, atol=0.0, rtol=0)
-            self.assertEqual(conv1.weight.grad.data, conv2.weight.grad.data, atol=0.0, rtol=0)
-
-
-    @onlyCUDA
-    @dtypes(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
-    def test_Conv2d_large_workspace(self, device, dtype):
-        # These sizes require huge cuDNN workspaces. Make sure we choose a
-        # reasonable algorithm that does not run out of memory
-        sizes = [
-            (1, 256, 109, 175),
-            (1, 256, 80, 128),
-            (1, 256, 120, 192),
-        ]
-
-        def run_test(benchmark):
-            with torch.backends.cudnn.flags(benchmark=benchmark):
-                conv = torch.nn.Conv2d(256, 256, kernel_size=3, padding=1).to(device, dtype)
-                for size in sizes:
-                    x = torch.randn(size, device=device, dtype=dtype)
-                    out = conv(x.detach().clone().requires_grad_())
-                    out.backward(torch.ones_like(out))
-
-        run_test(benchmark=False)
-        run_test(benchmark=True)
-
-
-    @onlyCUDA
-    @dtypes(torch.half, torch.float)
-    def test_ConvTranspose2d_large_output_padding(self, device, dtype):
-        net1 = torch.nn.ConvTranspose2d(128, 64, kernel_size=3, stride=2, padding=1, output_padding=1)\
-            .to(device=device, dtype=dtype)
-        net2 = torch.nn.ConvTranspose2d(64, 32, kernel_size=3, stride=2, padding=1, output_padding=1)\
-            .to(device=device, dtype=dtype)
-        net3 = torch.nn.ConvTranspose2d(32, 3, kernel_size=3, stride=2, padding=1, output_padding=1)\
-            .to(device=device, dtype=dtype)
-        x = torch.rand(1, 128, 6, 6, device=device, dtype=dtype, requires_grad=True)
-        x = net1(x)
-        x = net2(x)
-        x = net3(x)
-        x.backward(torch.randn_like(x))
-        torch.cuda.synchronize()
-
-
-    @onlyCUDA
-    @tf32_on_and_off(0.01)
-    @dtypes(torch.float, torch.double, torch.half)
-    # Very similar to test_Conv2d_naive_groups but with special care to handle
-    # the number of groups == number of input channels
-    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
-    def test_Conv2d_depthwise_naive_groups(self, device, dtype):
-        for depth_multiplier in [1, 2]:
-            m = nn.Conv2d(2, 2 * depth_multiplier, kernel_size=3, groups=2).to(device, dtype)
-            i = torch.randn(2, 2, 6, 6, device="cuda", dtype=dtype).div_(2).requires_grad_()
-            output = m(i)
-            grad_output = torch.randn(2, 2 * depth_multiplier, 4, 4, device=device, dtype=dtype) / 2
-            output.backward(grad_output)
-
-            offset = 1 * depth_multiplier
-
-            m1 = nn.Conv2d(1, 1 * depth_multiplier, kernel_size=3).to(device, dtype)
-            m1.weight.data = m.weight.data[:offset].clone()
-            m1.bias.data = m.bias.data[:offset].clone()
-            i1 = i.detach()[:, :1].clone().requires_grad_()
-            output1 = m1(i1)
-            output1.backward(grad_output[:, :offset].contiguous())
-
-            m2 = nn.Conv2d(1, 1 * depth_multiplier, kernel_size=3).to(device, dtype)
-            m2.weight.data.copy_(m.weight.data[offset:])
-            m2.bias.data.copy_(m.bias.data[offset:])
-            i2 = i.detach()[:, 1:].clone().requires_grad_()
-            output2 = m2(i2)
-            output2.backward(grad_output[:, offset:].contiguous())
-
-            self.assertEqual(output, torch.cat([output1, output2], 1),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-            self.assertEqual(i.grad.data,
-                             torch.cat([i1.grad.data, i2.grad.data], 1),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-            self.assertEqual(m.bias.grad.data,
-                             torch.cat([m1.bias.grad.data,
-                                        m2.bias.grad.data], 0),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-            self.assertEqual(m.weight.grad.data,
-                             torch.cat([m1.weight.grad.data,
-                                        m2.weight.grad.data], 0),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-
-    @onlyCUDA
-    @dtypes(torch.float, torch.double, torch.half)
-    @tf32_on_and_off(0.005)
-    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
-    def test_Conv3d_depthwise_naive_groups(self, device, dtype):
-        for depth_multiplier in [1, 2]:
-            m = nn.Conv3d(2, 2 * depth_multiplier, kernel_size=3, groups=2).to(device, dtype)
-            i = torch.randn(2, 2, 6, 6, 6, device="cuda", dtype=dtype).div_(2).requires_grad_()
-            output = m(i)
-            grad_output = torch.randn(2, 2 * depth_multiplier, 4, 4, 4, device=device, dtype=dtype) / 2
-            output.backward(grad_output)
-
-            offset = 1 * depth_multiplier
-
-            m1 = nn.Conv3d(1, 1 * depth_multiplier, kernel_size=3).to(device, dtype)
-            m1.weight.data = m.weight.data[:offset].clone()
-            m1.bias.data = m.bias.data[:offset].clone()
-            i1 = i.detach()[:, :1].clone().requires_grad_()
-            output1 = m1(i1)
-            output1.backward(grad_output[:, :offset].contiguous())
-
-            m2 = nn.Conv3d(1, 1 * depth_multiplier, kernel_size=3).to(device, dtype)
-            m2.weight.data.copy_(m.weight.data[offset:])
-            m2.bias.data.copy_(m.bias.data[offset:])
-            i2 = i.detach()[:, 1:].clone().requires_grad_()
-            output2 = m2(i2)
-            output2.backward(grad_output[:, offset:].contiguous())
-
-            self.assertEqual(output, torch.cat([output1, output2], 1),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-            self.assertEqual(i.grad.data,
-                             torch.cat([i1.grad.data, i2.grad.data], 1),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-            self.assertEqual(m.bias.grad.data,
-                             torch.cat([m1.bias.grad.data,
-                                        m2.bias.grad.data], 0),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-            self.assertEqual(m.weight.grad.data,
-                             torch.cat([m1.weight.grad.data,
-                                        m2.weight.grad.data], 0),
-                             atol=dtype2prec_DONTUSE[dtype], rtol=0)
-
-
-    @onlyCUDA
-    @dtypes(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
-    def test_noncontig_conv_grad(self, device, dtype):
-        # FIXME: remove after adding non-contiguous grad tests for all modules
-        module = nn.Conv2d(3, 5, kernel_size=3, padding=1).to(device, dtype)
-        input = torch.randn(2, 3, 10, 10, dtype=dtype, device=device, requires_grad=True)
-        output = module(input)
-
-        grad = torch.randn(2, 2, 5, 10, 10, dtype=dtype, device=device)[:, 1]
-        assert not grad.is_contiguous()
-        output.backward(grad, retain_graph=True)
-        self.assertIsNotNone(input.grad)
-        result = input.grad.data.clone()
-        input.grad.data.zero_()
-
-        output.backward(grad.contiguous())
-        self.assertEqual(result, input.grad.data, atol=dtype2prec_DONTUSE[dtype], rtol=0)
-
-
-    @onlyCUDA
-    @dtypes(torch.float, torch.half)
-    def test_batchnorm_large_batch(self, device, dtype):
-        bn = nn.BatchNorm2d(1).to(device, dtype)
-        data = torch.rand(880801, 1, 1, 1, device=device, dtype=dtype)
-        out = bn(data).sum().backward()
-
-
-    @onlyCUDA
-    @dtypes(torch.double)
-    def test_conv_double_backward(self, device, dtype):
-        with torch.backends.cudnn.flags(deterministic=True):
-            # Double backward only runs with DoubleTensor due to precision reason
-            batch_size = 1
-            for kern, inp_size, dilations in [(3, 5, [1, 2]), (4, 9, [1])]:
-                for stride, padding, chan_in, chan_out, dilation in product([1], [2], [2], [3], dilations):
-                    no_weight = stride == 2
-                    result = self.run_conv_double_back_test(kern, stride,
-                                                            padding, chan_in, chan_out,
-                                                            batch_size, inp_size, dilation,
-                                                            no_weight, use_cuda=True, dtype=dtype)
-                    self.assertTrue(result,
-                                    "Conv double backward test failed with parameters:" +
-                                    "\nkern: " + str(kern) +
-                                    "\nstride: " + str(stride) +
-                                    "\npadding: " + str(padding) +
-                                    "\nchan_in: " + str(chan_in) +
-                                    "\nchan_out: " + str(chan_out) +
-                                    "\nbatch_size: " + str(batch_size) +
-                                    "\ninp_size: " + str(inp_size) +
-                                    "\ndilation: " + str(dilation))
-
-
-    def test_conv_double_backward_no_bias(self):
-        kern = 3
-        stride = 2
-        chan_in, chan_out = 2, 4
-        batch_size = 2
-        inp_size = 5
-        padding = 1
-        dilation = 1
-        no_weight = False
-        use_bias = True
-        result = self.run_conv_double_back_test(kern, stride,
-                                                padding, chan_in, chan_out,
-                                                batch_size, inp_size, dilation,
-                                                no_weight, use_bias=use_bias)
-        self.assertTrue(result,
-                        "Conv double backward test failed with parameters:" +
-                        "\nkern: " + str(kern) +
-                        "\nstride: " + str(stride) +
-                        "\npadding: " + str(padding) +
-                        "\nchan_in: " + str(chan_in) +
-                        "\nchan_out: " + str(chan_out) +
-                        "\nbatch_size: " + str(batch_size) +
-                        "\ninp_size: " + str(inp_size) +
-                        "\ndilation: " + str(dilation))
-
-
-    def test_conv_double_backward_groups(self):
-        kern = 3
-        stride = 1
-        padding = 2
-        chan_in, chan_out = 2, 4
-        batch_size = 2
-        inp_size = 6
-        dilation = 1
-        no_weight = False
-        groups = 2
-        result = self.run_conv_double_back_test(kern, stride,
-                                                padding, chan_in * groups, chan_out * groups,
-                                                batch_size, inp_size, dilation,
-                                                no_weight, groups=groups)
-        self.assertTrue(result,
-                        "Conv double backward test failed with parameters:" +
-                        "\nkern: " + str(kern) +
-                        "\nstride: " + str(stride) +
-                        "\npadding: " + str(padding) +
-                        "\nchan_in: " + str(chan_in) +
-                        "\nchan_out: " + str(chan_out) +
-                        "\nbatch_size: " + str(batch_size) +
-                        "\ninp_size: " + str(inp_size) +
-                        "\ndilation: " + str(dilation) +
-                        "\ngroups: " + str(groups))
-
-
-    def test_conv_double_backward_stride(self):
-        batch_size = 2
-
-        # Cannot provide ggW when stride is > 1
-        for kern, inp_size, dilations in [(3, 5, [1, 2]), (3, 7, [1])]:
-            for stride, padding, chan_in, chan_out, dilation in product([2], [0, 1], [1], [2], dilations):
-                no_weight = False
-                self.run_conv_double_back_test(kern, stride,
-                                               padding, chan_in, chan_out,
-                                               batch_size, inp_size, dilation,
-                                               no_weight)
-
-    @dtypes(torch.float, torch.cfloat)
-    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
-    def test_conv1d_same_padding(self, device, dtype):
-        # Test padding='same' outputs the correct shape
-        test_args = [
-            # in_size
-            range(50, 55),
-            # kernel_size
-            [1, 2, 3, 8],
-            # dilation
-            range(1, 4),
-            # stride
-            [1],
-        ]
-        for in_size, k_size, dilation, stride in itertools.product(*test_args):
-            x = torch.rand(1, 1, in_size, device=device, dtype=dtype)
-            y = torch.rand(1, 1, k_size, device=device, dtype=dtype)
-            z = F.conv1d(x, y, padding='same', dilation=dilation, stride=stride)
-            self.assertEqual(z.size(2), int(math.ceil(in_size / stride)))
-
-        # Compare F.conv1d padding='same' output against manual padding
-        # Without strides/dilation
-        x = torch.rand(1, 1, 12, device=device, dtype=dtype)
-        y = torch.rand(1, 1, 3, device=device, dtype=dtype)
-        expect = F.conv1d(x, y, padding=1)
-        actual = F.conv1d(x, y, padding='same')
-        self.assertEqual(expect, actual)
-
-        # With dilation
-        x = torch.rand(1, 1, 12, device=device, dtype=dtype)
-        y = torch.rand(1, 1, 4, device=device, dtype=dtype)
-        expect = F.conv1d(x, y, padding=3, dilation=2)
-        actual = F.conv1d(x, y, padding='same', dilation=2)
-        self.assertEqual(expect, actual)
-
-        # Dilation with asymmetric padding
-        expect = F.conv1d(x, y, padding=5, dilation=3)[..., 1:]
-        actual = F.conv1d(x, y, padding='same', dilation=3)
-        self.assertEqual(expect, actual)
-
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv2d_same_padding(self, device, dtype):
-        if dtype is torch.cfloat:
-            rtol, atol = 2e-6, 2e-6
-        else:
-            rtol, atol = None, None
-        # Compare F.conv2d padding='same' output against manual padding
-        # Without strides/dilation
-        x = torch.rand(1, 1, 10, 11, device=device, dtype=dtype)
-        y = torch.rand(1, 1, 4, 5, device=device, dtype=dtype)
-        expect = F.conv2d(x, y, padding=(2, 2))[..., 1:, :]
-        actual = F.conv2d(x, y, padding='same')
-        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
-
-        # With dilation
-        y = torch.rand(1, 1, 3, 4, device=device, dtype=dtype)
-        expect = F.conv2d(x, y, padding=(2, 3), dilation=2)
-        actual = F.conv2d(x, y, padding='same', dilation=2)
-        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
-
-        # Dilation with asymmetric padding
-        y = torch.rand(1, 1, 4, 4, device=device, dtype=dtype)
-        expect = F.conv2d(x, y, padding=5, dilation=3)[..., 1:, 1:]
-        actual = F.conv2d(x, y, padding='same', dilation=3)
-        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
-
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv3d_same_padding(self, device, dtype):
-        if dtype is torch.cfloat:
-            rtol, atol = 2e-6, 2e-6
-        else:
-            rtol, atol = None, None
-        # Compare F.conv3d padding='same' output against manual padding
-        # Without strides/dilation
-        x = torch.rand(1, 1, 10, 11, 12, device=device, dtype=dtype)
-        y = torch.rand(1, 1, 1, 2, 5, device=device, dtype=dtype)
-        expect = F.conv3d(x, y, padding=(0, 1, 2))[..., :, 1:, :]
-        actual = F.conv3d(x, y, padding='same')
-        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
-
-        # With dilation
-        expect = F.conv3d(x, y, padding=(0, 1, 4), dilation=2)
-        actual = F.conv3d(x, y, padding='same', dilation=2)
-        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
-
-        # Dilation with asymmetric padding
-        y = torch.rand(1, 1, 4, 4, 4, device=device, dtype=dtype)
-        expect = F.conv3d(x, y, padding=5, dilation=3)[..., 1:, 1:, 1:]
-        actual = F.conv3d(x, y, padding='same', dilation=3)
-        self.assertEqual(expect, actual, rtol=rtol, atol=atol)
-
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv1d_valid_padding(self, device, dtype):
-        # Test F.conv1d padding='valid' is the same as no padding
-        x = torch.rand(1, 1, 10, device=device, dtype=dtype)
-        y = torch.rand(1, 1, 4, device=device, dtype=dtype)
-        expect = F.conv1d(x, y)
-        actual = F.conv1d(x, y, padding='valid')
-        self.assertEqual(expect, actual)
-
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv2d_valid_padding(self, device, dtype):
-        # Test F.conv2d padding='valid' is the same as no padding
-        x = torch.rand(1, 1, 1, 10, device=device, dtype=dtype)
-        y = torch.rand(1, 1, 1, 4, device=device, dtype=dtype)
-        expect = F.conv2d(x, y)
-        actual = F.conv2d(x, y, padding='valid')
-        self.assertEqual(expect, actual)
-
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv3d_valid_padding(self, device, dtype):
-        # Test F.conv3d padding='valid' is the same as no padding
-        x = torch.rand(1, 1, 1, 1, 10, dtype=dtype, device=device)
-        y = torch.rand(1, 1, 1, 1, 4, dtype=dtype, device=device)
-        expect = F.conv3d(x, y)
-        actual = F.conv3d(x, y, padding='valid')
-        self.assertEqual(expect, actual)
-
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv1d_same_padding_backward(self, device, dtype):
-        # Test F.conv1d gradients work with padding='same'
-        x = torch.rand(1, 1, 12, dtype=dtype, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 4, dtype=dtype, device=device, requires_grad=True)
-
-        # Symmetric padding
-        z = F.conv1d(x, y, padding=3, dilation=2)
-        z.sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        z = F.conv1d(x, y, padding='same', dilation=2)
-        z.sum().backward()
-        self.assertEqual(gx_expect, x.grad)
-        self.assertEqual(gy_expect, y.grad)
-        x.grad, y.grad = None, None
-
-        # Asymmetric padding
-        z = F.conv1d(x, y, padding=2)[..., 1:]
-        z.sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        z = F.conv1d(x, y, padding='same')
-        z.sum().backward()
-        self.assertEqual(gx_expect, x.grad)
-        self.assertEqual(gy_expect, y.grad)
-
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv2d_same_padding_backward(self, device, dtype):
-        # Test F.conv2d gradients work with padding='same'
-        x = torch.rand(1, 1, 10, 11, device=device, dtype=dtype, requires_grad=True)
-        y = torch.rand(1, 1, 4, 5, device=device, dtype=dtype, requires_grad=True)
-
-        # Symmetric padding
-        z = F.conv2d(x, y, padding=(3, 4), dilation=2)
-        z.sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        z = F.conv2d(x, y, padding='same', dilation=2)
-        z.sum().backward()
-        self.assertEqual(gx_expect, x.grad)
-        self.assertEqual(gy_expect, y.grad)
-        x.grad, y.grad = None, None
-
-        # Asymmetric padding
-        y = torch.rand(1, 1, 4, 4, device=device, dtype=dtype, requires_grad=True)
-        z = F.conv2d(x, y, padding=2)[..., 1:, 1:]
-        z.sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        z = F.conv2d(x, y, padding='same')
-        z.sum().backward()
-        self.assertEqual(gx_expect, x.grad)
-        self.assertEqual(gy_expect, y.grad)
-
-    @dtypes(torch.double, torch.cdouble)
-    def test_conv3d_same_padding_backward(self, device, dtype):
-        check_forward_ad = torch.device(device).type != 'xla'
-
-        # Test F.conv3d gradients work with padding='same'
-        x = torch.rand(1, 1, 1, 11, 12, dtype=dtype, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 1, 2, 5, dtype=dtype, device=device, requires_grad=True)
-
-        # Symmetric padding
-        z = F.conv3d(x, y, padding=(0, 1, 4), dilation=2)
-        z.sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        z = F.conv3d(x, y, padding='same', dilation=2)
-        z.sum().backward()
-        self.assertEqual(gx_expect, x.grad)
-        self.assertEqual(gy_expect, y.grad)
-        x.grad, y.grad = None, None
-
-        gradcheck(lambda x, y: F.conv3d(x, y, padding='same', dilation=2), (x, y),
-                  check_forward_ad=check_forward_ad, nondet_tol=1e-5)
-        if torch.device(device).type != 'cuda':
-            # https://github.com/pytorch/pytorch/issues/70702
-            gradgradcheck(lambda x, y: F.conv3d(x, y, padding='same', dilation=2), (x, y),
-                          check_fwd_over_rev=True)
-
-        # Asymmetric padding
-        y = torch.rand(1, 1, 1, 4, 4, dtype=dtype, device=device, requires_grad=True)
-        z = F.conv3d(x, y, padding=2)[..., 1:, 1:]
-        z.sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        z = F.conv3d(x, y, padding='same')
-        z.sum().backward()
-        self.assertEqual(gx_expect, x.grad)
-        self.assertEqual(gy_expect, y.grad)
-
-        gradcheck(lambda x, y: F.conv3d(x, y, padding='same'), (x, y),
-                  check_forward_ad=check_forward_ad, nondet_tol=1e-5)
-        if torch.device(device).type != 'cuda':
-            # https://github.com/pytorch/pytorch/issues/70702
-            gradgradcheck(lambda x, y: F.conv3d(x, y, padding='same'), (x, y),
-                          check_fwd_over_rev=True)
-
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv1d_valid_padding_backward(self, device, dtype):
-        # Test F.conv1d gradients work with padding='valid'
-        x = torch.rand(1, 1, 10, dtype=dtype, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 4, dtype=dtype, device=device, requires_grad=True)
-        F.conv1d(x, y, padding=0).sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        F.conv1d(x, y, padding='valid').sum().backward()
-        gx_actual, gy_actual = x.grad, y.grad
-        self.assertEqual(gx_expect, gx_actual)
-        self.assertEqual(gy_expect, gy_actual)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
-    @dtypes(torch.float, torch.cfloat)
-    @parametrize_test("mode", ('valid', 'same'))
-    def test_conv1d_vs_scipy(self, device, dtype, mode):
-        t = make_tensor((1, 10), device=device, dtype=dtype)
-        feat_dim = t.shape[1]
-        weight_even = make_tensor((1, 1, 4), device=device, dtype=dtype)
-        weight_odd = make_tensor((1, 1, 5), device=device, dtype=dtype)
-
-        def _test(t, weight, mode):
-            # SciPy expects two 1-D inputs.
-            t_a = t.view(-1).cpu().numpy()
-            w_a = weight.view(-1).cpu().numpy()
-            expected = scipy.signal.convolve(t_a, w_a, mode=mode)
-
-            kwargs = {'padding': mode}
-            if mode == 'same':
-                # `same` padding in PyTorch conv1d is different
-                # from SciPy
-                p = weight.shape[2] // 2
-                t = torch.nn.functional.pad(t, (p, p))
-                # We have already taken care of padding
-                kwargs.pop("padding")
-
-            # second input is flipped in SciPy's convolve
-            weight_flipped = torch.flip(weight, (2,))
-            actual = torch.nn.functional.conv1d(t, weight_flipped, **kwargs).squeeze(0)
-            if mode == 'same':
-                actual = actual[:feat_dim]
-
-            self.assertEqual(actual, expected)
-
-        # Global dtype for this test suite is torch.double
-        # This leads to change in type-promotion
-        # and conv1d outputs `complex128` for `complex64` input.
-        with set_default_dtype(torch.float):
-            _test(t, weight_even, mode)
-            _test(t, weight_odd, mode)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
-    @dtypes(torch.float, torch.cfloat)
-    @parametrize_test("mode", ('valid', 'same'))
-    def test_conv2d_vs_scipy(self, device, dtype, mode):
-        t = make_tensor((1, 5, 10), device=device, dtype=dtype)
-        weight_even = make_tensor((1, 1, 2, 4), device=device, dtype=dtype)
-        weight_odd = make_tensor((1, 1, 3, 5), device=device, dtype=dtype)
-
-        def _test(t, weight, mode):
-            # SciPy expects two 2-D inputs.
-            t_a = t.squeeze(0).cpu().numpy()
-            w_a = weight.squeeze(0).squeeze(0).cpu().numpy()
-            expected = scipy.signal.convolve2d(t_a, w_a, mode=mode)
-
-            kwargs = {'padding': mode}
-            if mode == 'same':
-                # `same` padding in PyTorch conv2d is different
-                # from SciPy
-                left_right_pad = weight.shape[3] // 2
-                top_bottom_pad = weight.shape[2] // 2
-                p = (left_right_pad, left_right_pad, top_bottom_pad, top_bottom_pad)
-                t = torch.nn.functional.pad(t, p)
-                # We have already taken care of padding
-                kwargs.pop("padding")
-
-            # second input is flipped in SciPy's convolve2d
-            weight_flipped = torch.flip(weight, (2, 3))
-            actual = torch.nn.functional.conv2d(t, weight_flipped, **kwargs).squeeze(0)
-            if mode == 'same':
-                actual = actual[:5, :10]
-
-            self.assertEqual(actual, expected, rtol=2e-5, atol=5e-6)
-
-        # Global dtype for this test suite is torch.double
-        # This leads to change in type-promotion
-        # and conv1d outputs `complex128` for `complex64` input.
-        with set_default_dtype(torch.float):
-            _test(t, weight_even, mode)
-            _test(t, weight_odd, mode)
-
-    @unittest.skipIf(not TEST_SCIPY, "Scipy required for the test.")
-    @dtypes(torch.float, torch.cfloat)
-    @parametrize_test("mode", ('valid', 'same'))
-    def test_conv3d_vs_scipy(self, device, dtype, mode):
-        t = make_tensor((1, 5, 5, 10), device=device, dtype=dtype)
-        weight_even = make_tensor((1, 1, 2, 2, 4), device=device, dtype=dtype)
-        weight_odd = make_tensor((1, 1, 2, 3, 5), device=device, dtype=dtype)
-
-        def _test(t, weight, mode):
-            # SciPy expects two 3-D inputs.
-            t_a = t.squeeze(0).cpu().numpy()
-            w_a = weight.squeeze(0).squeeze(0).cpu().numpy()
-            expected = scipy.signal.convolve(t_a, w_a, mode=mode)
-
-            kwargs = {'padding': mode}
-            if mode == 'same':
-                # `same` padding in PyTorch conv3d is different
-                # from SciPy
-                left_right_pad = weight.shape[4] // 2
-                top_bottom_pad = weight.shape[3] // 2
-                front_back_pad = weight.shape[2] // 2
-                p = (left_right_pad, left_right_pad, top_bottom_pad, top_bottom_pad,
-                     front_back_pad, front_back_pad)
-                t = torch.nn.functional.pad(t, p)
-                # We have already taken care of padding
-                kwargs.pop("padding")
-
-            # second input is flipped in SciPy's convolve
-            weight_flipped = torch.flip(weight, (2, 3, 4))
-            actual = torch.nn.functional.conv3d(t, weight_flipped, **kwargs).squeeze(0)
-            if mode == 'same':
-                actual = actual[:5, :5, :10]
-
-            if tf32_is_not_fp32() and (dtype == torch.float or dtype == torch.complex64):
-                self.assertEqual(actual, expected, atol=0.05, rtol=0.05)
-            else:
-                self.assertEqual(actual, expected, rtol=2e-5, atol=5e-6)
-
-        # Global dtype for this test suite is torch.double
-        # This leads to change in type-promotion
-        # and conv1d outputs `complex128` for `complex64` input.
-        with set_default_dtype(torch.float):
-            _test(t, weight_even, mode)
-            _test(t, weight_odd, mode)
-
-    @dtypes(torch.float, torch.complex64)
-    def test_conv2d_valid_padding_backward(self, device, dtype):
-        # Test F.conv2d gradients work with padding='valid'
-        x = torch.rand(1, 1, 1, 10, device=device, dtype=dtype, requires_grad=True)
-        y = torch.rand(1, 1, 1, 4, device=device, dtype=dtype, requires_grad=True)
-        F.conv2d(x, y, padding=0).sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        F.conv2d(x, y, padding='valid').sum().backward()
-        gx_actual, gy_actual = x.grad, y.grad
-        self.assertEqual(gx_expect, gx_actual)
-        self.assertEqual(gy_expect, gy_actual)
-
-    @dtypes(torch.double, torch.cdouble)
-    def test_conv3d_valid_padding_backward(self, device, dtype):
-        check_forward_ad = torch.device(device).type != 'xla'
-
-        # Test F.conv3d gradients work with padding='valid'
-        x = torch.rand(1, 1, 1, 1, 10, dtype=dtype, device=device, requires_grad=True)
-        y = torch.rand(1, 1, 1, 1, 4, dtype=dtype, device=device, requires_grad=True)
-        F.conv3d(x, y, padding=0).sum().backward()
-        gx_expect, gy_expect = x.grad, y.grad
-        x.grad, y.grad = None, None
-
-        F.conv3d(x, y, padding='valid').sum().backward()
-        gx_actual, gy_actual = x.grad, y.grad
-        self.assertEqual(gx_expect, gx_actual)
-        self.assertEqual(gy_expect, gy_actual)
-
-        gradcheck(lambda x, y: F.conv3d(x, y, padding='valid'), (x, y), check_forward_ad=check_forward_ad)
-        gradgradcheck(lambda x, y: F.conv3d(x, y, padding='valid'), (x, y), check_fwd_over_rev=check_forward_ad)
-
-    @parametrize_test("N", range(2, 4), name_fn=lambda N: 'ConvTranspose{}d'.format(N))
-    def test_conv_transpose_with_output_size_and_no_batch_dim(self, device, N):
-        # For inputs with no batch dim, verify output is the correct shape when output_size is set.
-        # See https://github.com/pytorch/pytorch/issues/75889
-        inp = torch.randn((1, 15, 13) if N == 2 else (1, 15, 13, 13), device=device)
-        output_size = (1, 240, 200) if N == 2 else (1, 240, 200, 200)
-        ConvTransposeNd = getattr(nn, 'ConvTranspose{}d'.format(N))
-        m = ConvTransposeNd(1, 1, kernel_size=16, stride=16, padding=7, bias=False, device=device)
-        output = m(inp, output_size=output_size)
-        self.assertEqual(output.shape, output_size)
-
-    @skipMeta
-    @parametrize_test("input_shape,transposed,dilated,groups,layout,backend_expected", [
-        # === slow ===
-        subtest(((2, 6, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Slow2d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow1d'),
-        subtest(((2, 6, 7), True, False, 3, torch.strided, torch._C._ConvBackend.SlowTranspose2d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow1d_transposed'),
-        subtest(((2, 6, 7), False, True, 3, torch.strided, torch._C._ConvBackend.SlowDilated2d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow1d_dilated'),
-        subtest(((2, 6, 7), True, True, 3, torch.strided, torch._C._ConvBackend.SlowTranspose2d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow1d_dilated_transposed'),
-        subtest(((2, 6, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Slow2d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow2d'),
-        subtest(((2, 6, 7, 8), True, False, 3, torch.strided, torch._C._ConvBackend.SlowTranspose2d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow2d_transposed'),
-        subtest(((2, 6, 7, 8), False, True, 3, torch.strided, torch._C._ConvBackend.SlowDilated2d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow2d_dilated'),
-        subtest(((2, 6, 7, 8), True, True, 3, torch.strided, torch._C._ConvBackend.SlowTranspose2d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow2d_dilated_transposed'),
-        subtest(((2, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Slow3d),
-                decorators=[onlyCPU, disableMkldnn], name='slow3d_cpu'),
-        # CUDA doesn't have a slow 3D implementation, so it goes to the dilated 3D implementation instead
-        subtest(((2, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.SlowDilated3d),
-                decorators=[onlyCUDA, disablecuDNN], name='slow3d_cuda'),
-        # FIXME: RuntimeError: CUDA out of memory.
-        # subtest(((2, 6, 7, 8, 9), True, False, 3, torch.strided, torch._C._ConvBackend.SlowTranspose3d),
-        #         decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow3d_transposed'),
-        subtest(((2, 6, 7, 8, 9), False, True, 3, torch.strided, torch._C._ConvBackend.SlowDilated3d),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow3d_dilated'),
-        # FIXME: RuntimeError: CUDA out of memory.
-        # subtest(((2, 6, 7, 8, 9), True, True, 3, torch.strided, torch._C._ConvBackend.SlowTranspose3d),
-        #         decorators=[onlyNativeDeviceTypes, disableMkldnn, disablecuDNN], name='slow3d_dilated_transposed'),
-        subtest(((0, 6, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch1d'),
-        subtest(((2, 0, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_channel1d'),
-        subtest(((0, 0, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch_channel1d'),
-        subtest(((0, 6, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch2d'),
-        subtest(((2, 0, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_channel2d'),
-        subtest(((0, 0, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch_channel2d'),
-        subtest(((0, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch3d'),
-        subtest(((2, 0, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_channel3d'),
-        subtest(((0, 0, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Empty),
-                decorators=[onlyNativeDeviceTypes, disableMkldnn], name='empty_batch_channel3d'),
-        # === cuda ===
-        # Note that disablecuDNN disables miopen as well.
-        subtest(((2, 6, 7), False, False, 6, torch.strided, torch._C._ConvBackend.CudaDepthwise2d),
-                decorators=[onlyCUDA, disablecuDNN], name='cuda_depthwise1d'),
-        subtest(((2, 6, 7, 8), False, False, 6, torch.strided, torch._C._ConvBackend.CudaDepthwise2d),
-                decorators=[onlyCUDA, disablecuDNN], name='cuda_depthwise2d'),
-        subtest(((2, 6, 7, 8, 9), False, False, 6, torch.strided, torch._C._ConvBackend.CudaDepthwise3d),
-                decorators=[onlyCUDA, disablecuDNN], name='cuda_depthwise3d'),
-        # === cudnn ===
-        subtest(((2, 6, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Cudnn),
-                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn1d'),
-        subtest(((2, 6, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Cudnn),
-                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn2d'),
-        subtest(((2, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Cudnn),
-                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn3d'),
-        subtest(((2, 6, 7), True, False, 3, torch.strided, torch._C._ConvBackend.CudnnTranspose),
-                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn1d_transposed'),
-        subtest(((2, 6, 7, 8), True, False, 3, torch.strided, torch._C._ConvBackend.CudnnTranspose),
-                decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn2d_transposed'),
-        # FIXME: RuntimeError: CUDA out of memory.
-        # subtest(((2, 6, 7, 8, 9), True, False, 3, torch.strided, torch._C._ConvBackend.CudnnTranspose),
-        #         decorators=[onlyCUDA, skipCUDAIfNoCudnn, skipCUDAIfMiopen], name='cudnn3d_transposed'),
-        # === miopen ===
-        subtest(((2, 6, 7), False, False, 3, torch.strided, torch._C._ConvBackend.Miopen),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen1d'),
-        subtest(((2, 6, 7, 8), False, False, 3, torch.strided, torch._C._ConvBackend.Miopen),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen2d'),
-        subtest(((2, 6, 7, 8, 9), False, False, 3, torch.strided, torch._C._ConvBackend.Miopen),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen3d'),
-        subtest(((2, 6, 7), True, False, 3, torch.strided, torch._C._ConvBackend.MiopenTranspose),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen1d_transposed'),
-        subtest(((2, 6, 7, 8), True, False, 3, torch.strided, torch._C._ConvBackend.MiopenTranspose),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen2d_transposed'),
-        subtest(((2, 6, 7, 8, 9), True, False, 3, torch.strided, torch._C._ConvBackend.MiopenTranspose),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen3d_transposed'),
-        subtest(((2, 6, 7), False, False, 6, torch.strided, torch._C._ConvBackend.MiopenDepthwise),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen_depthwise1d'),
-        subtest(((2, 6, 7, 8), False, False, 6, torch.strided, torch._C._ConvBackend.MiopenDepthwise),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen_depthwise2d'),
-        subtest(((2, 6, 7, 8, 9), False, False, 6, torch.strided, torch._C._ConvBackend.MiopenDepthwise),
-                decorators=[onlyCUDA, skipCUDAIfNoMiopen], name='miopen_depthwise3d'),
-        # === mkldnn ===
-        subtest(((2, 6, 7), False, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn1d'),
-        subtest(((2, 6, 7, 8), False, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn2d'),
-        subtest(((2, 6, 7, 8, 9), False, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn3d'),
-        # Transposed convolution is broken for mkldnn. See https://github.com/pytorch/pytorch/issues/68775.
-        subtest(((2, 6, 7), True, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn, unittest.expectedFailure], name='mkldnn1d_transposed'),
-        subtest(((2, 6, 7, 8), True, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn, unittest.expectedFailure], name='mkldnn2d_transposed'),
-        subtest(((2, 6, 7, 8, 9), True, False, 3, torch._mkldnn, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn, unittest.expectedFailure], name='mkldnn3d_transposed'),
-        subtest(((2, 6, 7), False, True, 3, torch.strided, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn1d_cpu_input'),
-        subtest(((2, 6, 7, 8), False, True, 3, torch.strided, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn2d_cpu_input'),
-        subtest(((2, 6, 7, 8, 9), False, True, 3, torch.strided, torch._C._ConvBackend.Mkldnn),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn3d_cpu_input'),
-        subtest(((0, 6, 7), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch1d'),
-        subtest(((2, 0, 7), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_channel1d'),
-        subtest(((0, 0, 7), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch_channel1d'),
-        subtest(((0, 6, 7, 8), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch2d'),
-        subtest(((2, 0, 7, 8), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_channel2d'),
-        subtest(((0, 0, 7, 8), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch_channel2d'),
-        subtest(((0, 6, 7, 8, 9), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch3d'),
-        subtest(((2, 0, 7, 8, 9), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_channel3d'),
-        subtest(((0, 0, 7, 8, 9), False, False, 3, torch._mkldnn, torch._C._ConvBackend.MkldnnEmpty),
-                decorators=[onlyCPU, skipCPUIfNoMkldnn], name='mkldnn_empty_batch_channel3d'),
-        # Note: Tests for mobile backends are not currently supported. This comprises
-        # NnpackSpatial, Winograd3x3Depthwise, and Xnnpack2d backends. Testing these
-        # requires the ability to gate tests by whether PyTorch is built with USE_MOBILE=1.
-    ])
-    # Test with both bias and no bias.
-    @parametrize_test("has_bias", [False, True])
-    # Test with both stride=1 and stride>1 cases.
-    @parametrize_test("strided", [False, True])
-    # Test with both contiguous and non-contiguous inputs.
-    @parametrize_test("contiguous", [False, True])
-    def test_conv_backend(
-            self, device, input_shape, has_bias, strided, contiguous, transposed, dilated, groups,
-            layout, backend_expected):
-        # Build up inputs.
-        dtype = torch.float32
-        C_in, C_out, dim, kernel_size = input_shape[1], 12, len(input_shape) - 2, 3
-        x = torch.randn(*input_shape, device=device, dtype=dtype, requires_grad=True)
-        weight = torch.randn(C_in if transposed else C_out,
-                             C_out // groups if transposed else C_in // groups,
-                             *[kernel_size for _ in range(dim)],
-                             device=device, dtype=dtype, requires_grad=True)
-        bias = torch.randn(C_out, device=device, dtype=dtype, requires_grad=True) if has_bias else None
-
-        def _make_noncontiguous(inp):
-            if inp is None:
-                return None
-            old_requires_grad = inp.requires_grad
-            inp = torch.repeat_interleave(inp, 2, dim=-1)
-            inp = inp[..., ::2].detach().requires_grad_(old_requires_grad)
-            return inp
-
-        if not contiguous:
-            x = _make_noncontiguous(x)
-            weight = _make_noncontiguous(weight)
-            bias = _make_noncontiguous(bias)
-
-        if layout is torch._mkldnn:
-            x = x.to_mkldnn()
-            # Note that weight and bias are not supported as mkldnn tensors during training.
-
-        stride = (2,) * dim if strided else (1,) * dim
-        padding = (0,) * dim
-        dilation = (2,) * dim if dilated else (1,) * dim
-        output_padding = (0,) * dim
-        inputs = [x, weight, bias, stride, padding, dilation, transposed, output_padding, groups]
-
-        # Ensure correct backend is selected.
-        backend_actual = torch._C._select_conv_backend(*inputs)
-        self.assertEqual(backend_actual, backend_expected)
-
-        # Ensure backward call succeeds.
-        convolution = torch.ops.aten.convolution
-        output = convolution(*inputs)
-        grad_output = torch.randn(output.shape, device=device, dtype=dtype)
-        if not contiguous:
-            grad_output = _make_noncontiguous(grad_output)
-        if layout is torch._mkldnn:
-            grad_output = grad_output.to_mkldnn()
-        output.backward(grad_output)
-
-        # mkldnn doesn't support gradcheck :(
-        if layout is torch._mkldnn:
-            return
-
-        if backend_actual != torch._C._ConvBackend.Empty:  # FIXME: forward AD fails
-            # Forward AD and forward-over-reverse AD smoke test in float32
-            # TODO: remove this if we introduce per-op gradient tests for float32
-            with fwAD.dual_level():
-                dual_inputs = [(fwAD.make_dual(i, torch.rand_like(i)) if isinstance(i, torch.Tensor) else i) for i in inputs]
-                # Forward AD
-                output = convolution(*dual_inputs)
-                # Forward over reverse AD
-                grad_output_d = fwAD.make_dual(torch.rand_like(output), torch.rand_like(output))
-                if has_bias:
-                    torch.autograd.grad(output, [x, weight, bias], grad_output_d)
-                else:
-                    torch.autograd.grad(output, [x, weight], grad_output_d)
-
-        # Convert to float64 for gradcheck.
-        x = x.to(torch.float64).detach().requires_grad_(True)
-        weight = weight.to(torch.float64).detach().requires_grad_(True)
-        if bias is not None:
-            bias = bias.to(torch.float64).detach().requires_grad_(True)
-        inputs = [x, weight, bias, stride, padding, dilation, transposed, output_padding, groups]
-
-        # Set some backend-specific validation settings.
-        gradcheck_nondet_tol = 0.0
-        if torch.backends.cudnn.is_available():
-            # cuDNN introduces non-determinism
-            gradcheck_nondet_tol = GRADCHECK_NONDET_TOL
-
-        self.assertTrue(gradcheck(convolution, inputs, nondet_tol=gradcheck_nondet_tol))
-
-        # double backward doesn't support bias gradients
-        if bias is not None:
-            bias.requires_grad_(False)
-        self.assertTrue(gradgradcheck(convolution, inputs, nondet_tol=gradcheck_nondet_tol))
-
+                align_corners=True
+            ).to('cpu')
 
-    def test_Dropout(self, device):
-        input = torch.empty(1000)
-        self._test_dropout(nn.Dropout, device, input)
+            affine_tensor = affine_tensor.to('cpu')
 
-        self._test_dropout_discontiguous(nn.Dropout, device)
-        self._test_dropout_discontiguous(nn.Dropout, device, memory_format=torch.channels_last)
+            for i in range(affine_tensor.size(1)):
+                for r in range(affine_tensor.size(2)):
+                    for c in range(affine_tensor.size(3)):
+                        grid_out = np.dot(grid_ary, [i, r, c, 1])
+                        self.assertEqual(affine_tensor[0, i, r, c], grid_out[:3], exact_dtype=False)
 
-        self._test_dropout_stride_mean_preserve(nn.Dropout, device)
+            self.assertEqual(scipy_ary, gridsample_ary.reshape_as(scipy_ary))
 
-        if self.device_type == 'cuda' or self.device_type == 'cpu':
-            input = input.bfloat16()
-            self._test_dropout(nn.Dropout, device, input)
 
-    def _test_dropoutNd_no_batch(self, dropout, input):
-        input_clone = input.clone()
-        with freeze_rng_state():
-            res_no_batch = dropout(input)
+    @onlyCUDA
+    @dtypes(torch.float, torch.half)
+    def test_batchnorm_large_batch(self, device, dtype):
+        bn = nn.BatchNorm2d(1).to(device, dtype)
+        data = torch.rand(880801, 1, 1, 1, device=device, dtype=dtype)
+        out = bn(data).sum().backward()
 
-        with freeze_rng_state():
-            res_batched = dropout(input_clone.unsqueeze(0)).squeeze(0)
-
-        self.assertEqual(res_no_batch, res_batched)
-
-    def _test_dropoutNd_channel_zero(self, dropout, input):
-        # Verify the number of zeros in a channel is 0 or the number of elements in the channel
-        # for a fully positive input tensor
-        shape = input.shape
-        B = shape[0]
-        C = shape[1]
-        channel_numel = torch.tensor(shape[2:]).prod()
-        result = dropout(input)
-
-        for b, c in product(range(B), range(C)):
-            self.assertTrue(result[b, c].count_nonzero() in (0, channel_numel))
-
-    @expectedFailureXLA  # seems like freeze_rng_state is not honoured by XLA
-    def test_Dropout1d(self, device):
-        N, C, L = random.randint(10, 15), random.randint(10, 15), random.randint(10, 15)
-        input = torch.empty(N, C, L)
-        self._test_dropout(nn.Dropout1d, device, input)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected 2D or 3D input, but received a 4D input"):
-            nn.Dropout1d(p=0.5)(torch.rand(1, 2, 2, 2, device=device))
-
-        with self.assertRaisesRegex(RuntimeError, "Expected 2D or 3D input, but received a 1D input"):
-            nn.Dropout1d(p=0.5)(torch.rand(2, device=device))
-
-        # no batch dims
-        input = torch.rand(50, 2, device=device)
-        self._test_dropoutNd_no_batch(nn.Dropout1d(p=0.5), input)
-        self._test_dropoutNd_no_batch(nn.Dropout1d(p=0.5, inplace=True), input)
-
-        # check that complete channels are dropped
-        input = torch.ones(10, 4, 2, device=device)
-        self._test_dropoutNd_channel_zero(nn.Dropout1d(p=0.5), input)
-        self._test_dropoutNd_channel_zero(nn.Dropout1d(p=0.5, inplace=True), input)
-
-    @expectedFailureXLA  # seems like freeze_rng_state is not honoured by XLA
-    def test_Dropout2d(self, device):
-        b = random.randint(1, 5)
-        w = random.randint(1, 5)
-        h = random.randint(1, 5)
-        num_features = 1000
-        input = torch.empty(num_features, b, w, h)
-        self._test_dropout(nn.Dropout2d, device, input)
-        self._test_dropout(nn.Dropout2d, device, input, memory_format=torch.channels_last)
-
-        self._test_dropout_discontiguous(nn.Dropout2d, device)
-        self._test_dropout_discontiguous(nn.Dropout2d, device, memory_format=torch.channels_last)
-
-        with self.assertWarnsRegex(UserWarning, "Received a 5-D input to dropout2d"):
-            nn.Dropout2d(p=0.5)(torch.rand(1, 2, 2, 2, 2, device=device))
-
-        with self.assertWarnsRegex(UserWarning, "Received a 2-D input to dropout2d"):
-            nn.Dropout2d(p=0.5)(torch.rand(1, 2, device=device))
-
-        # TODO: Uncomment these lines once no-batch-dim inputs are supported.
-        # For now, the historical dropout1d behavior is performed for 3D inputs.
-        # See https://github.com/pytorch/pytorch/issues/77081
-
-        # input = torch.rand(50, 2, 2, device=device)
-        # self._test_dropoutNd_no_batch(nn.Dropout2d(p=0.5), input)
-        # self._test_dropoutNd_no_batch(nn.Dropout2d(p=0.5, inplace=True), input)
-
-        with self.assertWarnsRegex(UserWarning, "assuming that channel-wise 1D dropout behavior is desired"):
-            nn.Dropout2d(p=0.5)(torch.rand(1, 2, 2, device=device))
-
-        # check that complete channels are dropped
-        input = torch.ones(10, 4, 2, 2, device=device)
-        self._test_dropoutNd_channel_zero(nn.Dropout2d(p=0.5), input)
-        self._test_dropoutNd_channel_zero(nn.Dropout2d(p=0.5, inplace=True), input)
-
-    @expectedFailureXLA  # seems like freeze_rng_state is not honoured by XLA
-    def test_Dropout3d(self, device):
-        b = random.randint(1, 5)
-        w = random.randint(1, 5)
-        h = random.randint(1, 5)
-        d = random.randint(1, 2)
-        num_features = 1000
-        input = torch.empty(num_features, b, d, w, h)
-        self._test_dropout(nn.Dropout3d, device, input)
-
-        self._test_dropout_discontiguous(nn.Dropout3d, device)
-        self._test_dropout_discontiguous(nn.Dropout3d, device, memory_format=torch.channels_last)
-
-        with self.assertWarnsRegex(UserWarning, "Received a 6-D input to dropout3d"):
-            nn.Dropout3d(p=0.5)(torch.rand(1, 2, 2, 2, 2, 2, device=device))
-
-        with self.assertWarnsRegex(UserWarning, "Received a 3-D input to dropout3d"):
-            nn.Dropout3d(p=0.5)(torch.rand(1, 2, 2, device=device))
-
-        # no batch dims
-        input = torch.rand(50, 2, 2, 2, device=device)
-        self._test_dropoutNd_no_batch(nn.Dropout3d(p=0.5), input)
-        self._test_dropoutNd_no_batch(nn.Dropout3d(p=0.5, inplace=True), input)
-
-        # check that complete channels are dropped
-        input = torch.ones(10, 4, 2, 2, 2, device=device)
-        self._test_dropoutNd_channel_zero(nn.Dropout3d(p=0.5), input)
-        self._test_dropoutNd_channel_zero(nn.Dropout3d(p=0.5, inplace=True), input)
+    @dtypesIfCUDA(torch.float, torch.double, torch.half, torch.complex128)
+    @dtypes(torch.float, torch.double, torch.bfloat16, torch.complex128)
+    def test_conv_empty_input(self, device, dtype):
+        def help(input, conv, memory_format):
+            ref_out = conv(input)
+            conv_cl = conv.to(memory_format=memory_format)
+            out_cl = conv_cl(input)
+            self.assertEqual(ref_out, out_cl)
+            input_cl = input.to(memory_format=memory_format)
+            out_cl2 = conv(input_cl)
+            self.assertEqual(out_cl, out_cl2)
+            out_cl3 = conv_cl(input_cl)
+            self.assertEqual(out_cl, out_cl3)
+
+        # channels_last case
+        input2d = torch.randn((0, 4, 20, 20)).to(device=device, dtype=dtype)
+        conv2d = torch.nn.Conv2d(4, 4, 3, 1).to(device=device, dtype=dtype)
+        help(input2d, conv2d, torch.channels_last)
+        # channels_last_3d case
+        input3d = torch.randn((0, 4, 20, 20, 20)).to(device=device, dtype=dtype)
+        conv3d = torch.nn.Conv3d(4, 4, 3, 1).to(device=device, dtype=dtype)
+        help(input3d, conv3d, torch.channels_last_3d)
+        # non-contiguous case
+        weight = torch.rand(4, 8, 3, 3)[:, ::2, :, :].to(device=device, dtype=dtype)
+        bias = torch.rand(4).to(device=device, dtype=dtype)
+        out = F.conv2d(input2d, weight, bias, (1, 1), 0, (1, 1), 1)
+        weight = weight.contiguous()
+        out_ref = F.conv2d(input2d, weight, bias, (1, 1), 0, (1, 1), 1)
+        self.assertEqual(out_ref, out)
 
     def test_InstanceNorm1d_general(self, device):
         b = random.randint(3, 5)
@@ -14644,10 +8575,10 @@ def test_GroupNorm_raises_error_if_one_value_per_group(self, device):
     def test_GroupNorm_empty(self, device):
         mod = torch.nn.GroupNorm(2, 4).to(device)
         inp = torch.randn(0, 4, 2, 2, device=device)
-        self._test_module_empty_input(mod, inp)
+        _test_module_empty_input(self, mod, inp)
         if self.device_type == 'cuda' and self.has_cudnn():
             with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp)
+                _test_module_empty_input(self, mod, inp)
 
     @onlyCPU
     @dtypes(torch.float, torch.double)
@@ -14752,7 +8683,7 @@ def test_ReplicationPad_empty(self, device, dtype):
                 (torch.nn.ReplicationPad1d(3), torch.randn(0, 3, 10, device=device, dtype=dtype)),
                 (torch.nn.ReplicationPad2d(3), torch.randn(0, 3, 10, 10, device=device, dtype=dtype)),
                 (torch.nn.ReplicationPad3d(3), torch.randn(0, 3, 10, 10, 10, device=device, dtype=dtype))]:
-            self._test_module_empty_input(mod, inp, check_size=False)
+            _test_module_empty_input(self, mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, 'Expected 2D or 3D'):
             mod = torch.nn.ReplicationPad1d(2)
@@ -14883,7 +8814,7 @@ def test_TransformerEncoderLayer_empty(self, device):
                 if not training:
                     encoder_layer = encoder_layer.eval()
                     with torch.no_grad():
-                        self._test_module_empty_input(encoder_layer, input, check_size=False, inference=True)
+                        _test_module_empty_input(self, encoder_layer, input, check_size=False, inference=True)
                     if batch_first and not TEST_WITH_CROSSREF:
                         with torch.no_grad():
                             # A NestedTensor with no tensors inside it doesn't have dim 3 (or dim
@@ -14891,13 +8822,13 @@ def test_TransformerEncoderLayer_empty(self, device):
                             # result.
                             with self.assertRaisesRegex(
                                     AssertionError, 'MultiheadAttention does not support NestedTensor outside'):
-                                nt = torch.nested_tensor([], device=device)
-                                self._test_module_empty_input(encoder_layer, nt, check_size=False, inference=True)
+                                nt = torch.nested.nested_tensor([], device=device)
+                                _test_module_empty_input(self, encoder_layer, nt, check_size=False, inference=True)
 
-                            nt = torch.nested_tensor([torch.rand(0, 512, device=device)], device=device)
-                            self._test_module_empty_input(encoder_layer, nt, check_size=False, inference=True)
+                            nt = torch.nested.nested_tensor([torch.rand(0, 512, device=device)], device=device)
+                            _test_module_empty_input(self, encoder_layer, nt, check_size=False, inference=True)
                 else:
-                    self._test_module_empty_input(encoder_layer, input, check_size=False)
+                    _test_module_empty_input(self, encoder_layer, input, check_size=False)
 
     @expectedFailureMeta  # RuntimeError: cannot reshape tensor of 0 elements into shape [1, 0, -1]
     @onlyNativeDeviceTypes
@@ -14907,7 +8838,7 @@ def test_TransformerEncoder_empty(self, device):
             input = torch.rand(*input_shape, device=device)
             encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=batch_first).to(device)
             transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6).to(device)
-            self._test_module_empty_input(transformer_encoder, input, check_size=False)
+            _test_module_empty_input(self, transformer_encoder, input, check_size=False)
 
     @expectedFailureMeta  # RuntimeError: cannot reshape tensor of 0 elements into shape [1, 0, -1]
     @onlyNativeDeviceTypes
@@ -14946,7 +8877,7 @@ def test_ReflectionPad_empty(self, device, dtype):
                 (torch.nn.ReflectionPad1d(2), torch.randn(0, 3, 10, device=device, dtype=dtype)),
                 (torch.nn.ReflectionPad2d(2), torch.randn(0, 3, 10, 10, device=device, dtype=dtype)),
                 (torch.nn.ReflectionPad3d(3), torch.randn(0, 3, 10, 10, 10, device=device, dtype=dtype))]:
-            self._test_module_empty_input(mod, inp, check_size=False)
+            _test_module_empty_input(self, mod, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, '2D or 3D'):
             mod = torch.nn.ReflectionPad1d(2)
@@ -14988,7 +8919,7 @@ def test_ReflectionPad2d_large(self, device):
     def test_LocalResponseNorm_empty(self, device):
         mod = torch.nn.LocalResponseNorm(2).to(device)
         inp = torch.ones(0, 5, 24, 24, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
+        _test_module_empty_input(self, mod, inp, check_size=False)
 
     @onlyCUDA   # Test if CPU and GPU results match
     def test_ReflectionPad3d_large(self, device):
@@ -15038,217 +8969,17 @@ def test_MarginLoss_empty(self, device, dtype):
                 y = torch.ones(10, 0, device=device).type(torch.long)
                 mod(x, y)
 
-    @onlyNativeDeviceTypes
-    @dtypes(torch.float, torch.double)
-    def test_adaptive_pooling_zero_batch(self, dtype, device):
-        inp = torch.ones(0, 10, dtype=dtype, device=device)
-        mod = torch.nn.AdaptiveAvgPool1d(5).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        inp = torch.ones(0, 10, 10, dtype=dtype, device=device)
-        mod = torch.nn.AdaptiveAvgPool2d((5, 5)).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        inp = torch.ones(0, 10, 10, 10, dtype=dtype, device=device)
-        mod = torch.nn.AdaptiveAvgPool3d((5, 5, 5)).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-    @onlyNativeDeviceTypes
-    def test_FractionalMaxPool2d_zero_batch(self, device):
-        mod = nn.FractionalMaxPool2d(3, output_ratio=(0.5, 0.5))
-        inp = torch.ones(0, 16, 50, 32, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected input"):
-            inp = torch.randn(1, 0, 50, 32, device=device)
-            mod(inp)
-
-    @onlyNativeDeviceTypes
-    def test_FractionalMaxPool3d_zero_batch(self, device):
-        mod = nn.FractionalMaxPool3d(3, output_ratio=(0.5, 0.5, 0.5)).to(device)
-        inp = torch.ones(0, 16, 50, 32, 32, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected input"):
-            inp = torch.randn(1, 0, 50, 32, 32, device=device)
-            mod(inp)
-
-    @onlyNativeDeviceTypes
-    def test_FractionalMaxPool2d_zero_out_size(self, device):
-        mod = nn.FractionalMaxPool2d([2, 2], output_size=[0, 1])
-        inp = torch.rand([16, 50, 32, 32], device=device)
-        out = mod(inp)
-        self.assertEqual(out, torch.empty((16, 50, 0, 1), device=device))
-
-    @onlyNativeDeviceTypes
-    def test_FractionalMaxPool3d_zero_out_size(self, device):
-        mod = nn.FractionalMaxPool3d([3, 2, 2], output_size=[0, 1, 1])
-        inp = torch.rand([16, 50, 32, 32], device=device)
-        out = mod(inp)
-        self.assertEqual(out, torch.empty((16, 0, 1, 1), device=device))
-
     @onlyNativeDeviceTypes
     def test_Unfold_empty(self, device):
         inp = torch.randn(0, 3, 3, 4, device=device)
         unfold = torch.nn.Unfold(kernel_size=(2, 3)).to(device)
-        self._test_module_empty_input(unfold, inp, check_size=False)
+        _test_module_empty_input(self, unfold, inp, check_size=False)
 
         with self.assertRaisesRegex(RuntimeError, 'Expected 3D or 4D'):
             inp = torch.randn(3, 0, 3, 4, device=device)
             unfold = torch.nn.Unfold(kernel_size=(2, 3)).to(device)
             unfold(inp)
 
-    @onlyNativeDeviceTypes
-    def test_MaxPool_zero_batch_dim(self, device):
-        inp = torch.randn(0, 16, 50, device=device)
-        mod = torch.nn.MaxPool1d(3, stride=2).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        # 1D is supposed to be okay with 0 numel() inputs so dont test
-        # error raising for that case.
-
-        inp = torch.randn(0, 16, 50, 32, device=device)
-        mod = torch.nn.MaxPool2d(3, stride=2).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.randn(1, 0, 50, 32, device=device)
-            mod(inp)
-
-        inp = torch.ones(0, 16, 50, 44, 31, device=device)
-        mod = torch.nn.MaxPool3d(3, stride=2).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.ones(1, 0, 50, 44, 31, device=device)
-            mod(inp)
-
-    @onlyNativeDeviceTypes
-    def test_MaxUnpool_zero_batch_dim(self, device):
-        pool = torch.nn.MaxPool1d(2, stride=2, return_indices=True).to(device)
-        unpool = torch.nn.MaxUnpool1d(2, stride=2).to(device)
-        inp = torch.randn(0, 10, 10, requires_grad=True, device=device)
-        output, indices = pool(inp)
-        output.requires_grad_(True)
-        unpool_out = unpool(output, indices)
-        unpool_out.sum().backward()
-
-        self.assertEqual(inp.grad, torch.zeros_like(inp))
-        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
-
-        pool = torch.nn.MaxPool2d(2, stride=2, return_indices=True).to(device)
-        unpool = torch.nn.MaxUnpool2d(2, stride=2).to(device)
-        inp = torch.randn(0, 10, 10, 10, requires_grad=True, device=device)
-        output, indices = pool(inp)
-        unpool_out = unpool(output, indices)
-        unpool_out.sum().backward()
-
-        self.assertEqual(inp.grad, torch.zeros_like(inp))
-        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
-
-        pool = torch.nn.MaxPool3d(2, stride=2, return_indices=True).to(device)
-        unpool = torch.nn.MaxUnpool3d(2, stride=2).to(device)
-        inp = torch.randn(0, 10, 10, 10, 10, requires_grad=True, device=device)
-        output, indices = pool(inp)
-        output.requires_grad_(True)
-        unpool_out = unpool(output, indices)
-        unpool_out.sum().backward()
-
-        self.assertEqual(inp.grad, torch.zeros_like(inp))
-        self.assertEqual(unpool_out, torch.zeros_like(unpool_out))
-
-    @slowTest
-    @onlyNativeDeviceTypes
-    @skipCUDAIfRocm
-    @parametrize_test("module_name,module_size,output_size,test_index,should_error", [
-        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), -1, True), name='case1'),
-        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), 2 * 2 * 4 * 5, True), name='case2'),
-        subtest(('MaxUnpool2d', (2, 2), (1, 3, 4, 5), (2 * 2 * 4 * 5) - 1, False), name='case3'),
-        subtest(('MaxUnpool2d', (2, 3), (2, 1, 4, 2), 2 * 3 * 4 * 2, True), name='case4'),
-        subtest(('MaxUnpool2d', (2, 3), (2, 1, 4, 2), (2 * 3 * 4 * 2) - 1, False), name='case5'),
-
-        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), -1, True), name='case6'),
-        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), 2 * 2 * 2 * 3 * 4 * 5, True), name='case7'),
-        subtest(('MaxUnpool3d', (2, 2, 2), (1, 3, 4, 5), (2 * 2 * 2 * 3 * 4 * 5) - 1, False), name='case8'),
-        subtest(('MaxUnpool3d', (2, 2, 2), (2, 3, 4, 1), 2 * 2 * 2 * 3 * 4 * 1, True), name='case9'),
-        subtest(('MaxUnpool3d', (2, 2, 2), (2, 3, 4, 1), (2 * 2 * 2 * 3 * 4 * 1) - 1, False), name='case10'),
-    ])
-    def test_MaxUnpool_index_errors(self, device, module_name, module_size, output_size, test_index, should_error):
-        # NOTE: CUDA tests need to be run in a subprocess because they cause device asserts
-        if torch.device(device).type == 'cuda':
-            error_msgs = {
-                'MaxUnpool2d': r'Assertion `maxind >= 0 && maxind < outputImageSize` failed',
-                'MaxUnpool3d': r'Assertion `index >= 0 && index < outputImageSize` failed'}
-
-            script = f'''
-import torch
-unpool = torch.nn.{module_name}({module_size}).to('{device}')
-output = torch.rand({output_size}, dtype=torch.float32, device='{device}')
-indices = torch.zeros({output_size}, dtype=torch.int64, device='{device}')
-indices.flatten()[0] = {test_index}
-unpool(output, indices)
-torch.cuda.synchronize()
-'''
-            p = subprocess.run(
-                [sys.executable, '-c', script],
-                cwd=os.path.dirname(os.path.realpath(__file__)),
-                capture_output=True,
-                text=True,
-            )
-
-            output = p.stdout + '\n' + p.stderr
-
-            error_msg = error_msgs[module_name]
-
-            if should_error:
-                self.assertIn(
-                    error_msg,
-                    output,
-                    'The expected error was not found')
-            else:
-                self.assertNotIn(
-                    'Error',
-                    output,
-                    'Should not have produced an error')
-        else:
-            module_class = getattr(torch.nn, module_name)
-            unpool = module_class(module_size).to(device)
-            output = torch.rand(output_size, dtype=torch.float32, device=device)
-            indices = torch.zeros(output_size, dtype=torch.int64, device=device)
-            indices.flatten()[0] = test_index
-
-            if should_error:
-                with self.assertRaisesRegex(RuntimeError, r'Found an invalid max index:'):
-                    unpool(output, indices)
-            else:
-                unpool(output, indices)
-
-    @onlyNativeDeviceTypes
-    def test_AdaptiveMaxPool_zero_batch_dim(self, device):
-        inp = torch.randn(0, 16, 50, device=device)
-        mod = torch.nn.AdaptiveMaxPool1d(3).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.randn(1, 0, 50, device=device)
-            mod(inp)
-
-        inp = torch.randn(0, 16, 50, 32, device=device)
-        mod = torch.nn.AdaptiveMaxPool2d(3).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.randn(1, 0, 50, 32, device=device)
-            mod(inp)
-
-        inp = torch.ones(0, 16, 50, 44, 31, device=device)
-        mod = torch.nn.AdaptiveMaxPool3d(3).to(device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Expected"):
-            inp = torch.ones(1, 0, 50, 44, 31, device=device)
-            mod(inp)
-
     @onlyCUDA
     @dtypes(torch.float, torch.double)
     @tf32_on_and_off(0.005)
@@ -15320,81 +9051,16 @@ def check_rnn_grads(rnn1, rnn2):
     def test_BatchNorm_empty(self, device):
         mod = torch.nn.BatchNorm2d(3).to(device)
         inp = torch.randn(0, 3, 2, 2, device=device)
-        self._test_module_empty_input(mod, inp)
+        _test_module_empty_input(self, mod, inp)
         if self.device_type == 'cuda' and self.has_cudnn():
             with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp)
+                _test_module_empty_input(self, mod, inp)
 
         self.assertEqual(mod.running_mean, torch.tensor([0., 0, 0], device=device))
         self.assertEqual(mod.running_var, torch.tensor([1., 1, 1], device=device))
         self.assertEqual(mod.weight.grad, torch.tensor([0., 0, 0], device=device))
         self.assertEqual(mod.bias.grad, torch.tensor([0., 0, 0], device=device))
 
-    @dtypes(torch.float, torch.cfloat)
-    def test_conv_empty_channel(self, device, dtype):
-        in_channels = 0
-        mod = torch.nn.Conv1d(in_channels, 8, 2, stride=2, dtype=dtype).to(device)
-        inp = torch.randn(2, 0, 15, device=device, dtype=dtype)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
-            inp = torch.randn(2, 1, 0, device=device)
-            mod(inp)
-
-        mod = torch.nn.Conv2d(in_channels, 33, 3, stride=2, dtype=dtype).to(device)
-        inp = torch.randn(2, 0, 50, 100, device=device, dtype=dtype)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
-            inp = torch.randn(2, 1, 40, 0, device=device)
-            mod(inp)
-
-        mod = torch.nn.Conv3d(in_channels, 33, 3, stride=2, dtype=dtype).to(device)
-        inp = torch.randn(2, 0, 50, 20, 40, device=device, dtype=dtype)
-        self._test_module_empty_input(mod, inp, check_size=False)
-
-        with self.assertRaisesRegex(RuntimeError, "Given groups=1, weight"):
-            inp = torch.randn(2, 1, 50, 0, 40, device=device)
-            mod(inp)
-
-    def test_group_conv_empty(self, device):
-        mod = torch.nn.Conv2d(4, 4, stride=2, kernel_size=3, padding=1, groups=4).to(device)
-        inp = torch.randn(0, 4, 4, 4, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-        if self.device_type == 'cuda' and self.has_cudnn():
-            with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp, check_size=False)
-
-    def test_group_convTranspose_empty(self, device):
-        mod = torch.nn.ConvTranspose2d(4, 4, stride=2, kernel_size=3, padding=1, groups=4).to(device)
-        inp = torch.randn(0, 4, 4, 4, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-        if self.device_type == 'cuda' and self.has_cudnn():
-            with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp, check_size=False)
-
-    def test_convTranspose_empty(self, device):
-        mod = torch.nn.ConvTranspose2d(4, 4, stride=2, kernel_size=3, padding=1).to(device)
-        inp = torch.randn(0, 4, 4, 4, device=device)
-        self._test_module_empty_input(mod, inp, check_size=False)
-        if self.device_type == 'cuda' and self.has_cudnn():
-            with torch.backends.cudnn.flags(enabled=False):
-                self._test_module_empty_input(mod, inp, check_size=False)
-
-    @onlyNativeDeviceTypes
-    def test_AvgPool2d_empty(self, device):
-        avgpool = torch.nn.AvgPool2d(3, stride=2).to(device)
-        inp = torch.randn(0, 16, 20, 32, device=device)
-        self._test_module_empty_input(avgpool, inp, check_size=False)
-
-        clast_inp = torch.randn(0, 16, 20, 32, device=device).contiguous(memory_format=torch.channels_last)
-        self._test_module_empty_input(avgpool, clast_inp, check_size=False)
-
-        # test with empty non-batch input
-        with self.assertRaisesRegex(RuntimeError, '3D or 4D'):
-            inp = torch.randn(16, 0, 20, 32, device=device)
-            avgpool(inp)
-
     @onlyCUDA
     @largeTensorTest('16GB')
     def test_prelu_backward_32bit_indexing(self, device):
@@ -15406,7 +9072,7 @@ def test_prelu_backward_32bit_indexing(self, device):
     def test_linear_empty(self, device):
         mod = torch.nn.Linear(7, 7).to(device)
         inp = torch.randn(0, 7, device=device)
-        self._test_module_empty_input(mod, inp)
+        _test_module_empty_input(self, mod, inp)
 
     def test_one_hot(self, device):
         if self.device_type != 'cuda':  # cuda throws device assert for invalid data
@@ -15627,451 +9293,63 @@ def test_unequal_when_beta_is_greater_than_one():
         test_unequal_when_beta_is_greater_than_one()
 
     @onlyCPU
-    def test_smooth_l1_loss_bfloat16(self, device):
-        def test_dtype(fn, input, target, dtype):
-            input = input.detach().clone().to(dtype=dtype).requires_grad_(True)
-            input2 = input.detach().clone().float().requires_grad_(True)
-            target = target.detach().clone().to(dtype=dtype)
-            target2 = target.detach().clone().float()
-            out = fn(input, target)
-            out.sum().backward()
-            out2 = fn(input2, target2)
-            out2.sum().backward()
-            self.assertEqual(out.dtype, dtype)
-            self.assertEqual(input.grad.dtype, dtype)
-            self.assertEqual(out, out2, exact_dtype=False)
-            self.assertEqual(input.grad, input2.grad, exact_dtype=False)
-
-        def func(device):
-            return nn.SmoothL1Loss().to(device=device)
-
-        shapes = [[1, 3, 1, 6], [1, 3, 1, 128], [1, 3, 128, 128]]
-        for shape in shapes:
-            x = torch.randn(shape, device=device, requires_grad=True)
-            t = torch.randn(shape, device=device)
-            test_dtype(func(device), x, t, torch.bfloat16)
-
-    # We don't want to make propagating NaN a hard requirement on ops, but for
-    # these easy ones, we should make them do so.
-    def test_nonlinearity_propagate_nan(self, device):
-        def test(nonlinearity, *args, **kwargs):
-            x = torch.tensor([nan], device=device)
-            fn = getattr(F, nonlinearity)
-            try:
-                self.assertTrue(math.isnan(fn(x, *args, **kwargs).item()))
-            except Exception as e:
-                if 'not implemented' not in str(e):
-                    raise
-
-        test('relu')
-        test('relu', inplace=True)
-        test('relu6')
-        test('elu')
-        test('selu')
-        test('celu')
-        test('rrelu')
-        test('rrelu', inplace=True)
-        test('hardtanh')
-        test('tanh')
-        test('sigmoid')
-        test('logsigmoid')
-        test('hardshrink')
-        test('tanhshrink')
-        test('softsign')
-        test('softmin', 0)
-        test('softmax', 0)
-        test('log_softmax', 0)
-        test('leaky_relu', 0.2)
-        test('threshold', 3, 2)
-        test('threshold', 3, 2, inplace=True)
-
-    def test_pooling_shape(self, device):
-        ''' Test the output shape calculation for pooling functions '''
-
-        # Checks output shape against expected for 1D, 2D and 3D
-        def check(expected_out_shape, sizes, *args, **kwargs):
-            for kernel in ['max', 'avg']:
-                for i in [1, 2, 3]:
-                    if hasattr(torch.nn.functional, f'{kernel}_pool{i}d'):
-                        op = getattr(torch.nn.functional, f'{kernel}_pool{i}d')
-                        t = torch.randn(sizes[:i + 2], device=device)
-                        self.assertEqual(op(t, *args, **kwargs).shape, expected_out_shape[:i + 2])
-
-        check((1, 1, 3, 3, 4), (1, 1, 5, 6, 7), kernel_size=1, stride=2, padding=0, ceil_mode=True)
-        check((1, 1, 2, 3, 3), (1, 1, 3, 4, 5), kernel_size=2, stride=2, padding=1, ceil_mode=False)
-        check((1, 1, 2, 3, 3), (1, 1, 3, 4, 5), kernel_size=2, stride=2, padding=1, ceil_mode=True)
-
-        # Test case from issue https://github.com/pytorch/pytorch/issues/45357
-        x = torch.randn(1, 1, 6, 7, device=device)
-        y = torch.nn.functional.max_pool2d(x, 1, stride=(2, 2), padding=0, ceil_mode=True)
-        self.assertEqual(y.size(), (1, 1, 3, 4))
-
-    @onlyNativeDeviceTypes   # TODO: fix on XLA
-    def test_adaptive_avg_pool2d_output_size_one(self, device):
-        def helper(size, memory_format):
-            x = torch.randint(1, 10, size, dtype=torch.float, device=device, requires_grad=True)
-            if memory_format == 'non_contiguous':
-                x = x[::2, ::2, ::2, ::2]
-            else:
-                x = x.to(memory_format=memory_format)
-
-            net = torch.nn.AdaptiveAvgPool2d((1, 1))
-            out = net(x)
-            ref_out = x.contiguous().mean((-1, -2)).view((x.size(0), x.size(1), 1, 1))
-
-            out.sum().backward()    # make sure it doesn't crash
-
-            self.assertEqual(out, ref_out)
-            if memory_format == torch.channels_last:
-                self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-                c = out.size(1)
-                self.assertEqual(out.stride(), [c, 1, c, c])
-            else:
-                self.assertTrue(out.is_contiguous())
-                c = out.size(1)
-                self.assertEqual(out.stride(), [c, 1, 1, 1])
-
-        for mf in (torch.contiguous_format, torch.channels_last, 'non_contiguous'):
-            helper((2, 3, 6, 6), mf)
-
-    @onlyNativeDeviceTypes
-    def test_adaptive_avg_pool3d_output_size_one(self, device):
-        x = torch.randn((2, 3, 6, 6, 6) , dtype=torch.float, device=device, requires_grad=True)
-
-        net = torch.nn.AdaptiveAvgPool3d(1)
-        out = net(x)
-        ref_out = x.contiguous().mean((-1, -2, -3)).view(out.shape)
-
-        out.sum().backward()    # make sure it doesn't crash
-
-        self.assertEqual(out, ref_out)
-        self.assertTrue(out.is_contiguous())
-        c = out.size(1)
-        self.assertEqual(out.stride(), [c, 1, 1, 1, 1])
-
-
-    @expectedFailureMeta  # Runtime Error not raised for meta
-    @onlyNativeDeviceTypes
-    @dtypes(torch.uint8, torch.int8, torch.short, torch.int, torch.long)
-    def test_adaptive_pooling_no_suppot_input(self, device, dtype):
-        for numel in (2, 3):
-            for pool_type in ('Max', 'Avg'):
-                cls_name = 'Adaptive{}Pool{}d'.format(pool_type, numel)
-                module_cls = getattr(nn, cls_name)
-                output_size = (2,) * numel
-                module = module_cls(output_size)
-                input = torch.randn((4,) * (numel + 1), device=device).to(dtype)
-                with self.assertRaisesRegex(RuntimeError, "not implemented"):
-                    output = module(input)
-
-    @onlyNativeDeviceTypes
-    @dtypes(torch.float, torch.double)
-    @dtypesIfCUDA(torch.half, torch.float, torch.double)
-    def test_avg_pool2d_nhwc(self, device, dtype):
-        def helper(n, c, h, w, kernel_size, stride=None,
-                   count_include_pad=True, divisor_override=None, padding=0):
-            if stride is None:
-                stride = kernel_size
-            input = torch.randn(n, c, h, w, dtype=dtype, device=device)
-            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-            grad = torch.randn(n, c, (h - kernel_size) // stride + 1, (w - kernel_size) // stride + 1,
-                               dtype=dtype, device=device)
-            pool = torch.nn.AvgPool2d(kernel_size, stride=stride, count_include_pad=count_include_pad,
-                                      divisor_override=divisor_override).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.AvgPool2d(kernel_size, stride=stride, count_include_pad=count_include_pad,
-                                          divisor_override=divisor_override).to(device)
-
-            out = pool(input)
-            out.backward(grad)
-            ref_out = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(input.grad, ref_input.grad)
-
-        helper(4, 8, 8, 8, 3)
-        helper(4, 8, 8, 8, 3, count_include_pad=False, padding=1)
-        helper(4, 8, 8, 8, 3, count_include_pad=False, padding=2, stride=2)
-        helper(4, 8, 8, 8, 3, divisor_override=42)
-        helper(4, 8, 8, 8, 7)
-        # ROCm 16GB MI25 hits OOM error. Clear caching allocator prior to running large subtest.
-        if TEST_WITH_ROCM and 'cuda' in device:
-            torch.cuda.empty_cache()
-        helper(200, 512, 28, 28, 2)
-        helper(4, 8, 7, 7, 3, stride=1)
-        helper(4, 8, 7, 7, 3, padding=2, stride=1)
-        helper(10, 512, 31, 31, 3, stride=2)
-        helper(1, 129, 8, 8, 3, stride=2)
-
-    @onlyCPU
-    @dtypes(torch.float)
-    def test_max_pool1d_errors(self, device, dtype):
-        def check(x, args, message):
-            model = torch.nn.MaxPool1d(*args)
-            with self.assertRaisesRegex(RuntimeError, r'max_pool1d\(\) ' + message):
-                model(torch.tensor(x, device=device, dtype=dtype))
-
-        # Pooling args: (kernel_size, stride, padding, dilation, return_indices, ceil_mode)
-        check(0, (1,), "Expected 2D or 3D input tensor, but got")
-        check([], (1,), "Expected 2D or 3D input tensor, but got")
-        check([[]], (1, 0), "stride must be greater than zero, but got 0")
-        check([[]], (1, 1, -1), "padding must be non-negative, but got -1")
-        check([[]], (1, 1, 2), "padding should be at most half of kernel size, but got padding=2 and kernel_size=1")
-        check([[]], (1, 1, 0, 0), "dilation must be greater than zero, but got 0")
-        check([[]], (5, 1, 0, 1), "Invalid computed output size: -4")
-
-    @onlyCPU
-    @dtypes(torch.float, torch.double)
-    def test_max_pool1d_corner_cases(self, device, dtype):
-        def check(x, args, expected):
-            model = torch.nn.MaxPool1d(*args)
-            if isinstance(x, list):
-                x = torch.tensor(x, device=device, dtype=dtype)
-                expected = torch.tensor(expected, device=device, dtype=dtype)
-            self.assertEqual(model(x), expected)
-
-        # Pooling args: (kernel_size, stride, padding, dilation, return_indices, ceil_mode)
-        check([[]], (1, None, 0, 1, False, False), [[]])
-        check([[[]]], (1, None, 0, 1, False, False), [[[]]])
-        check([[[]]], (2, 1, 1, 2, False, True), [[[]]])
-        check([[1]], (1, None, 0, 1, False, False), [[1]])
-        check([[1]], (2, None, 1, 2, False, False), [[float('-inf')]])
-        check([[1], [1]], (2, None, 1, 2, False, False), [[float('-inf')], [float('-inf')]])
-        check([[1, 2]], (2, 1, 1, 2, False, False), [[2, 1]])
-        check([[1, 2]], (2, 2, 1, 2, False, True), [[2, 2]])
-
-        empty_tensor = torch.empty((2, 0, 1), device=device, dtype=dtype)
-        check(empty_tensor, (1, None, 0, 1, False, False), empty_tensor)
-
-    @onlyCPU
-    @dtypes(torch.float, torch.double)
-    def test_max_pool1d(self, device, dtype):
-        # FIXME For now compare against max_pool1d with indices
-        def check(x, *args, **kwargs):
-            model = torch.nn.MaxPool1d(*args, **kwargs)
-            ref_model = torch.nn.MaxPool1d(*args, **kwargs, return_indices=True)
-            self.assertEqual(model(x), ref_model(x)[0])
-
-        sizes = [random.sample(range(8, 128), 3) for _ in range(3)]
-        kernel_sizes = random.sample(range(1, 5), 3)
-        strides = random.sample(range(1, 5), 3)
-        dilations = random.sample(range(1, 5), 3)
-        ceil_modes = [True, False]
-
-        for size, kernel_size, stride, dilation, ceil_mode in \
-                itertools.product(sizes, kernel_sizes, strides, dilations, ceil_modes):
-            padding = random.sample(range(0, math.floor(kernel_size / 2) + 1), 1)
-            check(torch.randn(size, device=device, dtype=dtype),
-                  kernel_size, stride, padding, dilation, ceil_mode=ceil_mode)
-
-        # Non-contiguous test
-        tensor = torch.randn(5, 151, 33, device=device, dtype=dtype)[::2, ::3, ::2]
-        check(tensor, 3, 2, 1, 2, ceil_mode=True)
-        check(tensor.transpose(1, 2), 3, 2, 1, 2, ceil_mode=True)
-
-    @onlyCUDA
-    def test_max_pool2d(self, device):
-        def helper(n, c, h, w, ks):
-            x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
-            ref_x = x.detach().clone().cpu().requires_grad_()
-
-            pool = torch.nn.MaxPool2d(kernel_size=ks)
-
-            y = pool(x)
-            ref_y = pool(ref_x)
-
-            y.sum().backward()
-            ref_y.sum().backward()
-
-            self.assertEqual(y, ref_y)
-            self.assertEqual(x.grad, ref_x.grad)
-
-        helper(2, 8, 4, 4, ks=2)
-        helper(1, 100000, 32, 32, ks=4)
-        helper(1, 100000, 1, 4, ks=(1, 4))  # test for max_pool1d
-
-    @onlyNativeDeviceTypes
-    @dtypes(torch.float, torch.double)
-    @dtypesIfCUDA(torch.half, torch.float, torch.double)
-    def test_max_pool2d_nhwc(self, device, dtype):
-        def helper(n, c, h, w, kernel_size, stride=None):
-            if stride is None:
-                stride = kernel_size
-            input = torch.randn(n, c, h, w, dtype=dtype, device=device)
-            input = input.contiguous(memory_format=torch.channels_last).requires_grad_()
-            grad = torch.randn(n, c, (h - kernel_size) // stride + 1, (w - kernel_size) // stride + 1,
-                               dtype=dtype, device=device)
-            pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
-
-            out, ind = pool(input)
-            out.backward(grad)
-            ref_out, ref_ind = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_ind.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(ind, ref_ind)
-            self.assertEqual(input.grad, ref_input.grad)
-
-        helper(4, 8, 8, 8, 7)
-        helper(200, 512, 28, 28, 2)
-        helper(4, 8, 7, 7, 3, stride=1)
-        helper(10, 512, 31, 31, 3, stride=2)
-        helper(1, 129, 8, 8, 3, stride=2)
-
-    @onlyNativeDeviceTypes
-    @dtypes(torch.half, torch.float, torch.double)
-    @onlyCUDA
-    def test_max_pool3d_ndhwc(self, device, dtype):
-        def helper(n, c, h, w, d, kernel_size, stride=None):
-            batch = n
-            if not batch:
-                batch = 1
-            input = torch.randn(batch, c, d, h, w, dtype=dtype, device=device)
-            input = input.contiguous(memory_format=torch.channels_last_3d).requires_grad_()
-            if not n:
-                input = input.squeeze(0).detach().clone().requires_grad_()
-            if isinstance(kernel_size, int):
-                kernel_size = [kernel_size] * 3
-            if stride is None:
-                stride = kernel_size
-            elif isinstance(stride, int):
-                stride = [stride] * 3
-            grad = torch.randn(batch, c,
-                               (d - kernel_size[0]) // stride[0] + 1,
-                               (h - kernel_size[1]) // stride[1] + 1,
-                               (w - kernel_size[2]) // stride[2] + 1,
-                               dtype=dtype, device=device)
-            grad = grad.contiguous(memory_format=torch.channels_last_3d)
-            if not n:
-                grad = grad.squeeze(0)
-            pool = torch.nn.MaxPool3d(kernel_size, stride, return_indices=True).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.MaxPool3d(kernel_size, stride, return_indices=True).to(device)
-            out, ind = pool(input)
-            out.backward(grad)
-            ref_out, ref_ind = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
-
-            if len(out.shape) == 4:
-                self.assertTrue(out.unsqueeze(0).is_contiguous(memory_format=torch.channels_last_3d))
-            else:
-                self.assertTrue(out.is_contiguous(memory_format=torch.channels_last_3d))
-            self.assertTrue(ref_out.is_contiguous())
-            if len(ind.shape) == 4:
-                self.assertTrue(ind.unsqueeze(0).is_contiguous(memory_format=torch.channels_last_3d))
-            else:
-                self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last_3d))
-            self.assertTrue(ref_ind.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(ind, ref_ind)
-            if dtype == torch.half:
-                self.assertEqual(input.grad, ref_input.grad, atol=0.05, rtol=0.01)
-            else:
-                self.assertEqual(input.grad, ref_input.grad)
-
-        helper(4, 8, 8, 8, 8, 7)
-        helper(4, 8, 8, 8, 8, (5, 6, 7))
-        helper(1, 8, 8, 8, 8, (5, 6, 7))
-        helper(0, 6, 12, 13, 14, (5, 6, 7))
-        helper(4, 8, 7, 7, 7, 3, stride=1)
-        helper(10, 128, 19, 19, 19, 3, stride=2)
-        helper(10, 128, 19, 19, 19, (1, 2, 3), stride=2)
-        helper(1, 128, 19, 19, 19, (1, 2, 3), stride=2)
-        helper(0, 128, 19, 19, 19, (1, 2, 3), stride=2)
-        helper(1, 79, 4, 4, 4, 3, stride=2)
-        helper(0, 79, 4, 4, 4, 3, stride=2)
-
-
-    @onlyCPU
-    def test_max_pool2d_bfloat16(self, device):
-        def helper(n, c, h, w, kernel_size, stride, memory_format):
-            input = torch.randn(n, c, h, w, dtype=torch.float32, device=device).bfloat16()
-            input = input.to(memory_format=memory_format).requires_grad_()
-            pool = torch.nn.MaxPool2d(kernel_size, stride, return_indices=True).to(device)
-
-            input2 = input.detach().clone().float().requires_grad_(True)
-
-            out, ind = pool(input)
-            out.sum().backward()
-            out2, ind2 = pool(input2)
-            out2.sum().backward()
-
-            self.assertTrue(out.is_contiguous(memory_format=memory_format))
-            self.assertEqual(out.dtype, torch.bfloat16)
-            self.assertEqual(input.grad.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.bfloat16())
-            self.assertEqual(ind, ind2)
-            self.assertEqual(input.grad, input2.grad.bfloat16())
-
-        helper(4, 30, 8, 8, 7, 1, torch.contiguous_format)
-        helper(4, 65, 8, 8, 7, 1, torch.channels_last)
-        helper(1, 19, 20, 10, 8, 2, torch.contiguous_format)
-        helper(1, 19, 20, 10, 8, 2, torch.channels_last)
-
-    @onlyCUDA
-    def test_max_pool2d_indices(self, device):
-        def helper(n, c, h, w, ks):
-            if n is None:
-                x = torch.randn(c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
-            else:
-                x = torch.randn(n, c, h, w, device='cuda', dtype=torch.float, requires_grad=True)
-
-            ref_x = x.detach().clone().cpu().requires_grad_()
-
-            pool = torch.nn.MaxPool2d(kernel_size=ks, return_indices=True)
-
-            y, idx = pool(x)
-            ref_y, ref_idx = pool(ref_x)
-
-            y.sum().backward()
-            ref_y.sum().backward()
-
-            self.assertEqual(y, ref_y)
-            self.assertEqual(idx, ref_idx)  # assertEqual implicitly compares shape for tensors
-            self.assertEqual(x.grad, ref_x.grad)
-
-        helper(2, 8, 4, 4, ks=2)
-        helper(None, 3, 50, 50, ks=5)
-
-    @onlyCPU
-    def test_avg_pool2d_bfloat16(self, device):
-        def helper(n, c, h, w, kernel_size, stride, memory_format):
-            input = torch.randn(n, c, h, w, dtype=torch.float32, device=device).bfloat16()
-            input = input.to(memory_format=memory_format).requires_grad_()
-            pool = torch.nn.AvgPool2d(kernel_size, stride).to(device)
-
+    def test_smooth_l1_loss_bfloat16(self, device):
+        def test_dtype(fn, input, target, dtype):
+            input = input.detach().clone().to(dtype=dtype).requires_grad_(True)
             input2 = input.detach().clone().float().requires_grad_(True)
-
-            out = pool(input)
+            target = target.detach().clone().to(dtype=dtype)
+            target2 = target.detach().clone().float()
+            out = fn(input, target)
             out.sum().backward()
-            out2 = pool(input2)
+            out2 = fn(input2, target2)
             out2.sum().backward()
+            self.assertEqual(out.dtype, dtype)
+            self.assertEqual(input.grad.dtype, dtype)
+            self.assertEqual(out, out2, exact_dtype=False)
+            self.assertEqual(input.grad, input2.grad, exact_dtype=False)
 
-            self.assertTrue(out.is_contiguous(memory_format=memory_format))
-            self.assertEqual(out.dtype, torch.bfloat16)
-            self.assertEqual(input.grad.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.bfloat16())
-            self.assertEqual(input.grad, input2.grad.bfloat16())
+        def func(device):
+            return nn.SmoothL1Loss().to(device=device)
+
+        shapes = [[1, 3, 1, 6], [1, 3, 1, 128], [1, 3, 128, 128]]
+        for shape in shapes:
+            x = torch.randn(shape, device=device, requires_grad=True)
+            t = torch.randn(shape, device=device)
+            test_dtype(func(device), x, t, torch.bfloat16)
+
+    # We don't want to make propagating NaN a hard requirement on ops, but for
+    # these easy ones, we should make them do so.
+    def test_nonlinearity_propagate_nan(self, device):
+        def test(nonlinearity, *args, **kwargs):
+            x = torch.tensor([nan], device=device)
+            fn = getattr(F, nonlinearity)
+            try:
+                self.assertTrue(math.isnan(fn(x, *args, **kwargs).item()))
+            except Exception as e:
+                if 'not implemented' not in str(e):
+                    raise
 
-        helper(4, 30, 8, 8, 7, 1, torch.contiguous_format)
-        helper(4, 65, 8, 8, 7, 1, torch.channels_last)
-        helper(1, 19, 20, 10, 8, 2, torch.contiguous_format)
-        helper(1, 19, 20, 10, 8, 2, torch.channels_last)
+        test('relu')
+        test('relu', inplace=True)
+        test('relu6')
+        test('elu')
+        test('selu')
+        test('celu')
+        test('rrelu')
+        test('rrelu', inplace=True)
+        test('hardtanh')
+        test('tanh')
+        test('sigmoid')
+        test('logsigmoid')
+        test('hardshrink')
+        test('tanhshrink')
+        test('softsign')
+        test('softmin', 0)
+        test('softmax', 0)
+        test('log_softmax', 0)
+        test('leaky_relu', 0.2)
+        test('threshold', 3, 2)
+        test('threshold', 3, 2, inplace=True)
 
     def test_upsamplingNearest1d(self, device):
         # Forward AD does not support XLA because XLA tensors don't have storage
@@ -16538,490 +9816,204 @@ def test_upsamplingBicubic2d_aa_correctness(self, device, memory_format):
         t_out = F.interpolate(t_in, size=(2, 2), mode="bicubic", align_corners=False, antialias=True)
         self.assertEqual(expected_out, t_out)
 
-    @dtypes(torch.float, torch.double)
-    def test_adaptive_pooling_max_nhwc(self, device, dtype):
-        def helper(n, c, h, w, output_height, output_width, contig):
-            input = torch.randint(1, 10, (n, c, h, w), device=device, dtype=dtype)
-            input = input.contiguous(memory_format=torch.channels_last)
-            grad = torch.randint(1, 10, (4, 8, output_height, output_width), device=device, dtype=dtype)
-            grad = grad.contiguous(memory_format=torch.channels_last)
-            if not contig:
-                input = input[:, ::2, :, :]
-                grad = grad[:, ::2, :, :]
-            input.requires_grad_(True)
-            pool = torch.nn.AdaptiveMaxPool2d((output_height, output_width), return_indices=True).to(device)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.AdaptiveMaxPool2d((output_height, output_width), return_indices=True).to(device)
-
-            out, ind = pool(input)
-            out.backward(grad)
-            ref_out, ref_ind = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
+    @onlyCUDA
+    @dtypes(torch.half)
+    @largeTensorTest('40GB')
+    def test_upsampling_64bit_indexing_channels_last(self, device, dtype):
+        x = torch.rand((32, 64, 512, 512), dtype=dtype, device=device)
+        out = torch.nn.functional.interpolate(x.to(memory_format=torch.channels_last), scale_factor=2, mode='nearest')
+        out_ref = torch.nn.functional.interpolate(x, scale_factor=2, mode='nearest')
+        del x
+        self.assertTrue(torch.allclose(out, out_ref))
 
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_ind.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(ind, ref_ind)
-            self.assertEqual(input.grad, ref_input.grad)
+    def _slow_masked_softmax(self, input, mask):
+        exp = torch.exp(input)
+        exp = exp * mask
+        s = exp.sum(dim=3, keepdim=True).expand(exp.size())
+        return exp / s
 
-        for contig in [True, False]:
-            helper(4, 8, 10, 10, 7, 7, contig)
-            helper(4, 8, 9, 14, 5, 8, contig)
-            helper(4, 8, 11, 11, 1, 1, contig)
+    def test_masked_softmax_mask_types(self, device):
+        # Test that mask type 0 (LxL attention mask), mask type 1 (BxL padding mask),
+        # and mask type 2 (generic BxHxLxL mask) are processed correctly on the
+        # fast path and the results match explicit slow calculation.
+        sizes = [(1, 1, 32), (3, 16, 310), (12, 4, 1024), (4, 2, 1200)]
 
-    @dtypes(torch.float, torch.double)
-    def test_pooling_max_nhwc(self, device, dtype):
-        def helper(n, c, h, w, kernel_size, stride, padding, dilation, contig, device):
-            output_height = math.floor((h + 2 * padding[0] - dilation[0] * (kernel_size[0] - 1) - 1) / stride[0] + 1)
-            output_width = math.floor((w + 2 * padding[1] - dilation[1] * (kernel_size[1] - 1) - 1) / stride[1] + 1)
-
-            input = torch.randint(1, 10, (n, c, h, w), device=device, dtype=dtype)
-            input = input.contiguous(memory_format=torch.channels_last)
-            grad = torch.randint(1, 10, (n, c, output_height, output_width), device=device, dtype=dtype)
-            grad = grad.contiguous(memory_format=torch.channels_last)
-            if not contig:
-                input = input[:, ::2, :, :]
-                grad = grad[:, ::2, :, :]
-            input.requires_grad_(True)
-            pool = torch.nn.MaxPool2d(
-                kernel_size, stride, padding, dilation, return_indices=True, ceil_mode=False
-            )
+        for (B, num_heads, L) in sizes:
 
-            ref_input = input.detach().clone().contiguous().requires_grad_(True)
-            ref_grad = grad.detach().clone().contiguous()
-            ref_pool = torch.nn.MaxPool2d(
-                kernel_size, stride, padding, dilation, return_indices=True, ceil_mode=False
-            ).to(device)
+            # mask_type == 0 => attention mask of shape LxL
+            src_mask_orig = torch.randint(0, 2, (L, L)).bool()
+            src_mask = src_mask_orig.reshape(1, 1, L, L).expand(B, num_heads, L, L).bool()
 
-            out, ind = pool(input)
-            out.backward(grad)
-            ref_out, ref_ind = ref_pool(ref_input)
-            ref_out.backward(ref_grad)
+            # mask_type == 1 => padding mask of shape BxL
+            src_key_padding_mask_orig = torch.randint(0, 2, (B, L)).bool()
+            src_key_padding_mask = src_key_padding_mask_orig.reshape(B, 1, 1, L).expand(B, num_heads, L, L).bool()
 
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertTrue(ind.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_ind.is_contiguous())
-            self.assertEqual(out, ref_out)
-            self.assertEqual(ind, ref_ind)
-            self.assertEqual(input.grad, ref_input.grad)
+            # mask_type == 2 =>  shape BxHxLxL
+            generic_mask = torch.randint(0, 2, (B, num_heads, L, L)).bool()
+            masks = [(src_mask_orig, src_mask, 0),
+                     (src_key_padding_mask_orig, src_key_padding_mask, 1),
+                     (generic_mask, generic_mask, 2)
+                     ]
+            for dim in [0, 3]:
+                for mask_orig, mask, mask_type in masks:
+                    if (self.device_type == "cuda") and (num_heads % 2) and (mask_type == 1):
+                        # CUDA path doesn't support padding mask when the number of heads is odd
+                        continue
+                    input = torch.randn((B, num_heads, L, L))
+                    if (self.device_type == "cuda"):
+                        input = input.cuda()
+                        mask = mask.cuda()
+                        mask_orig = mask_orig.cuda()
+                    native_res = torch._masked_softmax(input, mask_orig, dim, mask_type)
+                    mask = ~mask
+
+                    def slow_masked_softmax(input, mask):
+                        exp = torch.exp(input)
+                        exp = exp * mask
+                        s = exp.sum(dim=dim, keepdim=True).expand(exp.size())
+                        return exp / s
+
+                    pt_res = slow_masked_softmax(input, mask)
+                    pt_res = torch.nan_to_num(pt_res)
+
+                    mask_not = mask.logical_not()
+                    # In result, should only fill the entirely masked out rows since those are non-deterministic (*may* be 0)
+                    # Converts rows with all True's to False
+                    mask_out = mask_not.all(dim, keepdim=True).expand(mask_not.shape)
+                    self.assertEqual(
+                        pt_res.masked_fill(mask_out, 0),
+                        native_res.masked_fill(mask_out, 0),
+                        exact_dtype=True
+                    )
 
-        for contig in [True, False]:
-            helper(4, 8, 10, 10, (2, 2), (1, 1), (1, 1), (2, 2), contig, device)
-            helper(4, 8, 9, 14, (2, 2), (1, 1), (1, 1), (2, 2), contig, device)
-            helper(4, 8, 11, 11, (4, 4), (2, 2), (2, 2), (2, 2), contig, device)
-
-    def test_embedding_dense_grad(self, device):
-        embd = nn.Embedding(20, 20).to(device)
-        weight = embd.weight
-
-        def fn_wrapper(device):
-            def fn(weight):
-                inp = torch.tensor([[0, 1, 1, 2], [3, 5, 7, 11]], dtype=torch.long).to(device)
-                return torch.nn.functional.embedding(inp, weight)
-            return fn
-
-        fn = fn_wrapper(device)
-        _assertGradAndGradgradChecks(self, fn, (weight, ))
-
-    def test_embedding_scalar_weight_error(self, device):
-        indices = torch.rand(2, 2, device=device).long()
-        weights = [
-            torch.tensor(1.0, device=device),
-            torch.tensor(1.0, device=device).reshape(1, 1, 1),
-        ]
+    @onlyCUDA
+    def test_masked_softmax_devices_parity(self):
+        # Test that softmax with mask type 0 (LxL attention mask), mask type 1 (BxL padding mask),
+        # and mask type 2 (BxHxLxL generic mask) gives the same result on CPU and on CUDA.
 
-        for weight in weights:
-            with self.assertRaisesRegex(RuntimeError, "'weight' must be 2-D"):
-                torch.nn.functional.embedding(indices, weight)
-
-    @dtypesIfCUDA(torch.float16, torch.float64)
-    @dtypes(torch.float64)
-    def test_embedding_backward(self, device, dtype):
-        embedding = nn.Embedding(10, 3, sparse=True)
-        tensor = torch.tensor([[7, 1, 3]])
-        ones = torch.tensor(1., dtype=dtype).expand(3, 3)
-        tensorTwice = tensor.repeat(1, 2)
-        onesTwice = torch.cat((ones, ones))
-
-        embedding = embedding.to(dtype=dtype).to(device)
-        tensor = tensor.to(device)
-        ones = ones.to(device)
-        tensorTwice = tensorTwice.to(device)
-        onesTwice = onesTwice.to(device)
-
-        embedding.zero_grad()
-        embedding(tensor[0]).sum().backward()
-        self.assertEqual(embedding.weight.grad._indices(), tensor)
-        self.assertEqual(embedding.weight.grad._values(), ones)
-
-        embedding.zero_grad()
-        embedding(tensor[0]).sum().backward()
-        embedding(tensor[0]).sum().backward()
-        self.assertEqual(embedding.weight.grad._indices(), tensorTwice)
-        self.assertEqual(embedding.weight.grad._values(), onesTwice)
-
-        embedding.zero_grad()
-        embedding(tensor[0]).sum().backward()
-        tensor[0, 0] = 8
-        embedding(tensor[0]).sum().backward()
-        tensorTwice[0, 3] = 8
-        self.assertEqual(embedding.weight.grad._indices(), tensorTwice)
-        self.assertEqual(embedding.weight.grad._values(), onesTwice)
-
-    @dtypesIfCUDA(*((torch.float, torch.double, torch.bfloat16, torch.half)
-                    if TEST_WITH_ROCM else (torch.float, torch.double, torch.half)))
-    @dtypes(torch.float32)
-    def test_embedding_max_norm_backward(self, device, dtype):
-        # can't use gradcheck since in place renorm makes analytical gradients different from produced ones
-        weight = torch.randn((4, 4), device=device, dtype=dtype) * 2
-        weight.requires_grad_()
-        inp_list = [0, 1, 2, 2]
-        inp = torch.tensor(inp_list, device=device)
-        out = nn.functional.embedding(inp, weight, max_norm=1.).sum()
-        out.backward()
+        sizes = [(1, 1, 32), (3, 16, 310), (12, 4, 1024), (4, 2, 1200)]
+        for (B, num_heads, L) in sizes:
+            # mask_type == 0 => attention mask of shape LxL
+            src_mask = torch.randint(0, 2, (L, L)).bool()
+            # mask_type == 1 => padding mask of shape BxL
+            src_key_padding_mask = torch.randint(0, 2, (B, L)).bool()
+            # mask_type == 2 => generic mask of shape BxHxLxL
+            generic_mask = torch.randint(0, 2, (B, num_heads, L, L)).bool()
+            masks = [(src_mask, 0), (src_key_padding_mask, 1), (generic_mask, 2)]
+            input = torch.randn((B, num_heads, L, L))
+            for dim in [0, 3]:
+                for mask, mask_type in masks:
+                    if (num_heads % 2) and (mask_type == 1):
+                        # CUDA path doesn't support padding mask when the number of heads is odd
+                        continue
 
-        expected_grad = torch.tensor([[1., 1., 2., 0.]], device=device, dtype=dtype).transpose(0, 1).expand(4, 4)
-        self.assertEqual(weight.grad, expected_grad)
-
-    @dtypesIfCUDA(*((torch.float, torch.double, torch.bfloat16, torch.half)
-                    if TEST_WITH_ROCM else (torch.float, torch.double, torch.half)))
-    @dtypes(torch.float32)
-    def test_embedding_max_norm_fwd_AD(self, device, dtype):
-        if torch.device(device).type == 'xla':
-            self.skipTest("forward AD doesn't work on xla")
-
-        # can't use gradcheck since in place renorm makes analytical gradients different from produced ones
-        weight = torch.randn((4, 4), device=device, dtype=dtype) * 2
-        tangent = torch.ones((4, 4), device=device, dtype=dtype)
-        inp = torch.tensor([[0, 1], [2, 2]], device=device)
-        with torch.autograd.forward_ad.dual_level():
-            dual_weight = torch.autograd.forward_ad.make_dual(weight, tangent)
-            out = nn.functional.embedding(inp, dual_weight, max_norm=1.)
-            jvp = torch.autograd.forward_ad.unpack_dual(out).tangent
-
-        expected_grad = torch.ones((2, 2, 4), device=device, dtype=dtype)
-        self.assertEqual(jvp, expected_grad)
-
-    @dtypesIfCUDA(*((torch.float, torch.double, torch.bfloat16, torch.half)
-                    if TEST_WITH_ROCM else (torch.float, torch.double, torch.half)))
-    @dtypes(torch.float32)
-    def test_embedding_padding_idx(self, device, dtype):
-        embedding = nn.Embedding(10, 20, padding_idx=0).to(device, dtype)
-        input = torch.tensor([[0, 2, 4, 5], [4, 3, 0, 9]], dtype=torch.long).to(device)
-        output = embedding(input)
-        self.assertEqual(output[0][0].sum(), 0)
-        self.assertEqual(output[1][2].sum(), 0)
-
-        embedding = nn.Embedding(10, 20, padding_idx=0, sparse=True).to(device, dtype)
-        input = torch.tensor([[0, 2, 4, 5], [4, 3, 0, 9]], dtype=torch.long).to(device)
-        output = embedding(input)
-        self.assertEqual(output[0][0].sum(), 0)
-        self.assertEqual(output[1][2].sum(), 0)
-
-        # negative indexing check for padding_idx
-        # padding_idx=-2, num_embeddings=10 ==> index 8 padded
-        embedding = nn.Embedding(10, 20, padding_idx=-2).to(device, dtype)
-        input = torch.tensor([[0, 2, 8, 5], [4, 8, 0, 9]], dtype=torch.long).to(device)
-        output = embedding(input)
-        self.assertEqual(output[0][2].sum(), 0)
-        self.assertEqual(output[1][1].sum(), 0)
-
-        embedding = nn.Embedding(10, 20, padding_idx=-2, sparse=True).to(device, dtype)
-        input = torch.tensor([[0, 2, 8, 5], [4, 8, 0, 9]], dtype=torch.long).to(device)
-        output = embedding(input)
-        self.assertEqual(output[0][2].sum(), 0)
-        self.assertEqual(output[1][1].sum(), 0)
-
-        # change padding vector
-        padding_vector = torch.ones(20, dtype=dtype, device=device)
-        embedding = nn.Embedding(10, 20, padding_idx=2, sparse=True).to(device, dtype)
+                    def softmax_on_device(mask, input, device):
+                        # Compute softmax on a given device
+                        input_device = input.to(device)
+                        mask_device = mask.to(device)
+                        softmax_res = torch._masked_softmax(input_device, mask_device, dim, mask_type)
+                        if mask_type == 0:
+                            mask_expanded = mask_device.reshape(1, 1, L, L).expand(B, num_heads, L, L).bool()
+                        elif mask_type == 1:
+                            mask_expanded = mask_device.reshape(B, 1, 1, L).expand(B, num_heads, L, L).bool()
+                        else:
+                            mask_expanded = mask_device
+                        # In result, should only fill the entirely masked out rows since those are non-deterministic (*may* be 0)
+                        # Fill rows with all True's with 0
+                        mask_out = mask_expanded.all(dim, keepdim=True).expand(mask_expanded.shape)
+                        softmax_res = softmax_res.masked_fill(mask_out, 0)
+                        return softmax_res
+
+                    cpu_res = softmax_on_device(mask, input, "cpu")
+                    cuda_res = softmax_on_device(mask, input, "cuda")
+                    self.assertEqual(cpu_res, cuda_res, exact_dtype=True)
+
+    def test_multihead_self_attn_two_masks_fast_path(self, device):
+        """
+        Multihead self-attention should give the same result on the fast path (BetterTransformer) as on the slow path
+        when both attention mask (mask type 0) and key padding mask (mask type 1) are provided
+        """
         with torch.no_grad():
-            embedding.weight[2] = padding_vector
-        input = torch.tensor([0, 2], dtype=torch.long).to(device)
-        output = embedding(input)
-        self.assertEqual(output[1], padding_vector)
-
-        # out of bounds check for padding_idx
-        self.assertRaises(AssertionError, nn.Embedding, num_embeddings=10, embedding_dim=20, padding_idx=25)
-        self.assertRaises(AssertionError, nn.Embedding, num_embeddings=10, embedding_dim=20, padding_idx=-25)
-
-        padding_idx = 0
-        embedding = nn.Embedding(5, 2, padding_idx=padding_idx).to(device, dtype)
-        for n in (1, 2, 1000):  # Need large N to trigger all the methods we have implemented
-            for other_indices in ([], [1, 3], [2]):
-                indices = torch.tensor(other_indices + [padding_idx] * n, dtype=torch.long).to(device)
-                pre = embedding.weight[padding_idx].clone()
-                embedding(indices).sum().backward()
-                after = (embedding.weight + embedding.weight.grad)[padding_idx]
-                embedding.zero_grad()
-                self.assertEqual(after, pre)
-
-                # test double backward
-                emb_sum = embedding(indices).sum()
-                emb_grad = torch.autograd.grad(outputs=emb_sum, inputs=list(embedding.parameters()), retain_graph=True)
-                scalar = emb_grad[0].sum() + emb_sum
-                scalar.backward()
-                after = (embedding.weight + embedding.weight.grad)[padding_idx]
-                embedding.zero_grad()
-                self.assertEqual(after, pre)
-
-    # Check correctness of torch.nn.functional.embedding_bag forward and
-    # backward functions with padding_idx, given a 1D input separated into bags
-    # with an offset array. Compare against an equivalent 2D input that uses
-    # padding indices to fill in the gaps indicated by the offset array
-
-    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
-    @onlyNativeDeviceTypes
-    @dtypes(torch.float32, torch.float64)
-    @dtypesIfCUDA(torch.half, torch.bfloat16)
-    def test_embedding_bag_1D_padding_idx(self, device, dtype):
-        num_features = 3
-        max_indices_per_bag = 10
-        num_bags = 10
-        num_words = 100
-
-        def gen_1D_indices_offsets(include_last_offset, allpad):
-            indices = []
-            offsets = []
-            cur_offset = 0
-
-            # Make one bag full and one bag empty, for extra coverage
-            empty_bag = random.randint(0, num_bags - 1)
-            full_bag = empty_bag
-            while full_bag == empty_bag:
-                full_bag = random.randint(0, num_bags - 1)
-
-            for bag in range(num_bags):
-                offsets.append(cur_offset)
-                if bag == full_bag:
-                    bag_size = max_indices_per_bag
-                elif bag == empty_bag:
-                    bag_size = 0
-                else:
-                    bag_size = random.randint(1, max_indices_per_bag - 1)
-                indices += [1 if allpad else random.randint(0, num_words - 1) for _ in range(bag_size)]
-                cur_offset += bag_size
-
-            # embedding_bag requires first entry of offsets to be 0
-            assert offsets[0] == 0
-
-            indices = torch.tensor(indices, device=device)
-
-            if include_last_offset:
-                offsets.append(indices.size(0))
-
-            offsets = torch.tensor(offsets, device=device)
-
-            return indices, offsets
-
-        # Convert a 1-D indices-offsets representation into 2-D. Fill any empty
-        # indices with padding_idx
-        def gen_2D_indices_from_1D(indices_1D, offsets, include_last_offset, padding_idx):
-            assert offsets[0] == 0
-            if include_last_offset:
-                offsets = offsets[:-1]
-            indices_2D = torch.empty(num_bags, max_indices_per_bag, device=device, dtype=torch.long)
-            for bag in range(num_bags):
-                # Determine the start and end position of the bag within indices_1D
-                start = offsets[bag]
-                end = len(indices_1D) if bag + 1 == num_bags else offsets[bag + 1]
-                end = min(len(indices_1D), end)
-
-                # Pull out the bag's indices from indices_1D, and fill any
-                # remaining space with padding indices
-                indices_in_bag = []
-                for item_pos in range(0, max_indices_per_bag):
-                    if (start + item_pos) < end:
-                        indices_in_bag.append(indices_1D[start + item_pos])
-                    else:
-                        indices_in_bag.append(padding_idx)
-                indices_2D[bag] = torch.tensor(indices_in_bag, device=device)
-
-            return indices_2D
-
-        test_cases = product(['max', 'mean', 'sum'], [False, True], [False, True], [False, True])
+            embed_dim = 14
+            num_heads = 7
+            batch_size = 8
+            src_len = 5
+
+            query = value = key = torch.rand(batch_size, src_len, embed_dim).to(device)
+            # Create masks of two different types
+            attn_mask = torch.randint(0, 2, (src_len, src_len)).bool().to(device)
+            key_padding_mask = torch.randint(0, 2, (batch_size, src_len)).bool().to(device)
+
+            # We'll need expanded versions of the masks for masking out the outputs below
+            attn_mask_expanded = attn_mask.reshape(1, 1, src_len, src_len) \
+                                          .expand(batch_size, num_heads, src_len, src_len)
+            key_padding_mask_expanded = key_padding_mask.reshape(batch_size, 1, 1, src_len) \
+                                                        .expand(batch_size, num_heads, src_len, src_len)
+            merged_mask = attn_mask_expanded.logical_or(key_padding_mask_expanded)
+
+            # Compute attention on the fast path
+            mta_model = torch.nn.MultiheadAttention(embed_dim, num_heads, batch_first=True, device=device)
+            mta_model.training = False
+            result_fast_path, _ = mta_model(query, key, value, attn_mask=attn_mask, key_padding_mask=key_padding_mask)
+
+            # Compute attention on the slow path
+            result_ref, _ = torch.nn.functional.multi_head_attention_forward(query.transpose(0, 1),
+                                                                             key.transpose(0, 1),
+                                                                             value.transpose(0, 1),
+                                                                             embed_dim, num_heads,
+                                                                             mta_model.in_proj_weight,
+                                                                             mta_model.in_proj_bias,
+                                                                             mta_model.bias_k, mta_model.bias_v,
+                                                                             mta_model.add_zero_attn,
+                                                                             mta_model.dropout,
+                                                                             mta_model.out_proj.weight,
+                                                                             mta_model.out_proj.bias,
+                                                                             training=mta_model.training,
+                                                                             key_padding_mask=key_padding_mask,
+                                                                             need_weights=False,
+                                                                             attn_mask=attn_mask,
+                                                                             use_separate_proj_weight=False,
+                                                                             q_proj_weight=mta_model.q_proj_weight,
+                                                                             k_proj_weight=mta_model.k_proj_weight,
+                                                                             v_proj_weight=mta_model.v_proj_weight,
+                                                                             average_attn_weights=False,
+                                                                             )
+            result_ref = result_ref.transpose(0, 1)  # Convert to batch-first
+
+            # Rows which are completely masked out are nan, we need to exclude them from comparison
+            mask_out = merged_mask[:, 0, :, :].all(-1, keepdim=True).expand(batch_size, src_len, embed_dim)
+            result_fast_path_masked = result_fast_path.masked_fill(mask_out, 0)
+            result_ref_masked = result_ref.masked_fill(mask_out, 0)
+
+            self.assertEqual(result_fast_path_masked, result_ref_masked)
 
-        for mode, sparse, include_last_offset, allpad in test_cases:
-            # Max sparse and bfloat16 are not supported
-            if mode == 'max':
-                if sparse or (dtype == torch.bfloat16):
-                    continue
-            indices_1D, offsets = gen_1D_indices_offsets(include_last_offset, allpad)
-            for padding_idx_1D in list(set(indices_1D.tolist())) + [None]:
-                msg = (
-                    f"mode: '{mode}', sparse: {sparse}, include_last_offset: {include_last_offset}, "
-                    f"padding_idx_1D: {padding_idx_1D}")
-
-                # If 1D input does not use a padding index, we still need one for the 2D input,
-                # so we can add one dummy word to the weights to act as the padded word
-                padding_idx_2D = padding_idx_1D if padding_idx_1D is not None else num_words
-                num_words_with_padding = num_words if padding_idx_1D is not None else num_words + 1
-
-                indices_2D = gen_2D_indices_from_1D(
-                    indices_1D,
-                    offsets,
-                    include_last_offset,
-                    padding_idx_2D)
-
-                weights = torch.randn(
-                    num_words_with_padding,
-                    num_features,
-                    dtype=dtype,
-                    device=device,
-                    requires_grad=True)
-                weights_check = weights.clone().detach().requires_grad_(True)
-
-                bag = torch.nn.functional.embedding_bag(
-                    indices_1D,
-                    weights,
-                    offsets,
-                    padding_idx=padding_idx_1D,
-                    mode=mode,
-                    sparse=sparse,
-                    include_last_offset=include_last_offset)
-
-                bag_check = torch.nn.functional.embedding_bag(
-                    indices_2D,
-                    weights_check,
-                    padding_idx=padding_idx_2D,
-                    mode=mode,
-                    sparse=sparse)
-                self.assertEqual(bag, bag_check, msg=msg)
-
-                bag.sum().backward()
-                bag_check.sum().backward()
-
-                # Sometimes, half dtype gradients mismatch by a greater amount
-                # than other dtypes
-                if dtype in [torch.half, torch.bfloat16]:
-                    atol = 0.01
-                    rtol = 0.01
-                else:
-                    atol = None
-                    rtol = None
-                self.assertEqual(weights.grad, weights_check.grad, msg=msg, atol=atol, rtol=rtol)
-
-    # Check correctness of torch.nn.functional.embedding_bag forward and
-    # backward functions with padding_idx, given a 2D indices input. Compare
-    # against torch.nn.functional.embedding followed by a reduction.
-    @onlyNativeDeviceTypes
-    @dtypes(torch.float32, torch.float64)
-    @dtypesIfCUDA(torch.half, torch.bfloat16)
-    def test_embedding_bag_2D_padding_idx(self, device, dtype):
-        # Use a Python implementation of embedding_bag with padding_idx support
-        # to check torch.nn.functional.embedding_bag correctness
-        def embedding_bag_check(indices, weights, mode, sparse, padding_idx):
-            assert padding_idx is not None
-            embedding = torch.nn.functional.embedding(
-                indices,
-                weights,
-                padding_idx=padding_idx,
-                sparse=sparse)
-
-            reduction_dim = indices.dim() - 1
-
-            if mode == 'sum' or mode == 'mean':
-                # We must avoid including elements at padding_idx in the
-                # sum/mean, so multiply those elements by 0, and multiply
-                # all other elements by 1
-                per_sample_weights = indices.ne(padding_idx).to(dtype).unsqueeze(-1)
-                res = embedding.mul(per_sample_weights).sum(dim=reduction_dim)
-
-                if mode == 'mean':
-                    weights_sum = per_sample_weights.sum(dim=reduction_dim)
-                    res = res.div(weights_sum)
-
-            elif mode == 'max':
-                # We must avoid allowing elements at padding_idx to be chosen
-                # as the max, so set those elements to negative infinity
-                res = embedding.masked_fill(
-                    indices.unsqueeze(-1) == padding_idx, -float('inf')
-                ).amax(dim=reduction_dim)
-
-            else:
-                raise RuntimeError(f"mode '{mode}' is not available")
-
-            # If a row is all padding, set its corresponding result row to 0.
-            # This is needed because the above mean and max mode
-            # implementations set these elements to nan and -inf, respectively
-            if mode in ['mean', 'max']:
-                res = res.masked_fill(
-                    indices.eq(padding_idx).all(dim=-1).unsqueeze(-1),
-                    0)
-
-            return res
-
-        num_features = 3
-        num_words = 10
-        indices_dim1 = 10
-
-        for mode, sparse, allpad, indices_dim0 in product(['max', 'mean', 'sum'], [False, True], [False, True], [1, 10]):
-            # Max sparse and bfloat16 are not supported
-            if mode == 'max':
-                if sparse or (dtype == torch.bfloat16):
-                    continue
-
-            if allpad:
-                indices = torch.empty(indices_dim0, indices_dim1, dtype=torch.long, device=device).fill_(1)
-            else:
-                indices = torch.randint(0, num_words, (indices_dim0, indices_dim1), device=device)
-
-                if indices_dim0 > 1:
-                    # Fill one row with duplicate index so we can test with a fully
-                    # padded row
-                    duplicate_row = random.randint(0, indices_dim0 - 1)
-                    indices[duplicate_row] = indices[duplicate_row][0]
-
-            for padding_idx in list(set(indices.flatten(0, -1).tolist())):
-                weights = torch.randn(num_words, num_features, dtype=dtype, device=device, requires_grad=True)
-                weights_check = weights.clone().detach().requires_grad_(True)
-
-                msg = (
-                    f"mode: '{mode}', sparse: {sparse}, padding_idx: {padding_idx}, "
-                    f"allpad: {allpad}, indices.size(): {indices.size()}")
-
-                # Check forward with a Python implementation of padding_idx embedding_bag
-                bag_check = embedding_bag_check(
-                    indices,
-                    weights_check,
-                    mode,
-                    sparse,
-                    padding_idx)
-                bag = torch.nn.functional.embedding_bag(
-                    indices,
-                    weights,
-                    padding_idx=padding_idx,
-                    mode=mode,
-                    sparse=sparse)
-
-                self.assertEqual(bag, bag_check, msg=msg)
-
-                bag_check.sum().backward()
-                grad_check = weights_check.grad
-
-                bag.sum().backward()
-                grad = weights.grad
-
-                # Sometimes, half dtype gradients mismatch by a greater amount
-                # than other dtypes
-                if dtype in [torch.half, torch.bfloat16]:
-                    atol = 0.01
-                    rtol = 0.01
-                else:
-                    atol = None
-                    rtol = None
-                self.assertEqual(grad, grad_check, msg=msg, atol=atol, rtol=rtol)
-
-    def _slow_masked_softmax(self, input, mask):
-        exp = torch.exp(input)
-        exp = exp * mask
-        s = exp.sum(dim=3, keepdim=True).expand(exp.size())
-        return exp / s
+    @torch.no_grad()
+    @unittest.skipIf(TEST_WITH_CROSSREF, 'CrossRef turns on TorchFunctionMode, and so disables fastpath.')
+    def test_multihead_self_attn_two_masks_fast_path_mock(self, device):
+        """
+        Multihead self-attention should take fast path when both attention mask (mask type 0)
+        and key padding mask (mask type 1) are provided at the same time on CPU and CUDA
+        """
+        if device not in ['cpu', 'cuda']:
+            self.skipTest("Fastpath only runs on CPU and CUDA.")
+        with torch.autocast(device_type=device, enabled=False):
+            embed_dim = 14
+            num_heads = 7
+            batch_size = 8
+            src_len = 5
+
+            query = value = key = torch.rand(batch_size, src_len, embed_dim).to(device)
+            # Create masks of two different types
+            attn_mask = torch.randint(0, 2, (src_len, src_len)).bool().to(device)
+            key_padding_mask = torch.randint(0, 2, (batch_size, src_len)).bool().to(device)
+
+            with mock.patch('torch._native_multi_head_attention') as fastpath_mock:
+                # Compute attention on the fast path
+                mta_model = torch.nn.MultiheadAttention(embed_dim, num_heads, batch_first=True, device=device).eval()
+                mta_model.training = False
+                mta_model(query, key, value, attn_mask=attn_mask, key_padding_mask=key_padding_mask)
+                # If mock was called, fastpath was taken
+                self.assertTrue(fastpath_mock.called)
 
     def test_masked_softmax(self, device):
         sizes = [(1, 1, 32), (3, 16, 310), (12, 4, 1024), (4, 2, 1200)]
@@ -17087,26 +10079,26 @@ def test_masked_softmax_grad(self, device):
         for shape in shapes:
             dims = [0, len(shape) - 1] if len(shape) > 0 else [0]
             for dim in dims:
-                input = torch.randn(shape, requires_grad=True)
-                mask = torch.randint(0, 2, shape).bool()
-                mask_type = 1   # BxL => src_key_padding_mask
-                if (self.device_type == "cuda"):
-                    input = input.cuda().detach().requires_grad_()
-                    mask = mask.cuda()
-                self._test_masked_softmax_helper(input, dim, mask, mask_type)
+                for mask_type in [1, 2]:  # 1 = BxL => src_key_padding_mask
+                    input = torch.randn(shape, requires_grad=True)
+                    mask = torch.randint(0, 2, shape).bool()
+                    if (self.device_type == "cuda"):
+                        input = input.cuda().detach().requires_grad_()
+                        mask = mask.cuda()
+                    self._test_masked_softmax_helper(input, dim, mask, mask_type)
 
     # In this test, the forward pass is expected to produce nan's because when dim=0, we only have unspecified values
     def test_masked_softmax_forward_with_nans(self, device):
         dim = 0
         shapes = [(4, 5), (50, 100), (1500, 1200)]
         for (x, y) in shapes:
-            input = torch.randn((x, y), requires_grad=True)
-            mask = torch.tensor([i % 2 for i in range(y)]).expand((x, y)).bool()
-            mask_type = 1   # BxL => src_key_padding_mask
-            if (self.device_type == "cuda"):
-                input = input.cuda().detach().requires_grad_()
-                mask = mask.cuda()
-            self._test_masked_softmax_helper(input, dim, mask, mask_type)
+            for mask_type in [1, 2]:  # 1 = BxL => src_key_padding_mask
+                input = torch.randn((x, y), requires_grad=True)
+                mask = torch.tensor([i % 2 for i in range(y)]).expand((x, y)).bool()
+                if (self.device_type == "cuda"):
+                    input = input.cuda().detach().requires_grad_()
+                    mask = mask.cuda()
+                self._test_masked_softmax_helper(input, dim, mask, mask_type)
 
     @onlyCUDA
     def test_masked_softmax_transformer_layout(self, device):
@@ -17188,9 +10180,8 @@ def test_softmax_results(self, device, dtype):
     @onlyCUDA
     @dtypes(torch.float, torch.half)
     @largeTensorTest("20GB")
-    @largeTensorTest("90GB", "cpu")
-    @precisionOverride({torch.half: 0.001})
-    def test_softmax_64bit_indexing(self, device, dtype):
+    @largeTensorTest("64GB", "cpu")
+    def test_warp_softmax_64bit_indexing(self, device, dtype):
         def run_test(*shape):
             x = torch.randn(shape, device="cuda", dtype=torch.float16, requires_grad=True)
             y = F.log_softmax(x, dim=-1, dtype=dtype)
@@ -17199,12 +10190,32 @@ def run_test(*shape):
                 xx = x.cpu().requires_grad_()
             yy = F.log_softmax(xx.float(), dim=-1).to(dtype)
             yy.backward(yy)
-            self.assertEqual(y, yy)
-            self.assertEqual(x.grad, xx.grad)
+            # workaround to reduce memory usage vs. self.assertEqual, see #84944
+            rtol, atol = torch.testing._comparison.get_tolerances(dtype, rtol=None, atol=None)
+            self.assertTrue(torch.allclose(y.cpu(), yy, rtol=rtol, atol=atol))
+            # x is half
+            rtol, _ = torch.testing._comparison.get_tolerances(torch.half, rtol=None, atol=None)
+            self.assertTrue(torch.allclose(x.grad.cpu(), xx.grad, rtol=rtol, atol=1e-3))
 
         run_test(1100000000, 2)  # Illegal memory access https://github.com/pytorch/pytorch/issues/52715
         run_test(2200000000, 1)  # invalid configuration argument https://github.com/pytorch/pytorch/issues/52716
 
+    @onlyCUDA
+    @dtypes(torch.half)
+    @largeTensorTest("20GB")
+    @largeTensorTest("2GB", "cpu")
+    @precisionOverride({torch.half: 0.001})
+    def test_softmax_64bit_indexing(self, device, dtype):
+        def run_test(*shape):
+            x = torch.ones(shape, device=device, dtype=dtype, requires_grad=True)
+            y = F.log_softmax(x, dim=-1, dtype=dtype)
+            y.backward(y)
+            self.assertEqual(y[0], y[-1])
+            self.assertEqual(x.grad[0], x.grad[-1])
+
+        run_test(1024 * 256 + 1, 8192)  # https://github.com/pytorch/pytorch/issues/84144
+
+
     @dtypes(torch.float)
     @dtypesIfCUDA(torch.float, torch.half)
     def test_log_softmax_big(self, device, dtype):
@@ -17221,59 +10232,6 @@ def _test_helper(shape):
             # test non-persistent softmax kernel
             _test_helper((4, 1536))
 
-    @onlyCUDA
-    @largeTensorTest('12GB')
-    def test_conv_large_nosplit(self, device):
-        # Here we just test the convolution correctly route to the fallback implementation
-        # that is, it does not crash. The correctness of fallback implementation should be
-        # covered in other tests
-        dtype = torch.half if self.device_type == 'cuda' else torch.float
-        conv1 = nn.Conv2d(2, 2, 8, 8).to(device).to(dtype)
-        input_large = torch.randn(1, 2, 1024, 1024 * 1024, dtype=dtype, device=device)
-        conv1(input_large)
-        conv2 = torch.nn.Conv2d(1, 1024, 1, 1).to(device).to(dtype)
-        input_large = torch.randn(1, 1, 2048, 1024 , dtype=dtype, device=device)
-        conv2(input_large)
-
-    def test_conv_noncontig_weights(self, device):
-        for dim in (1, 2, 3):
-            for grouped in (False, True):
-                nc = 3
-                groups = 3 if grouped else 1
-                w = torch.randn([3] * dim, device=device)
-                w = w.expand([nc, int(nc / groups)] + list(w.shape))
-                w = w.detach().requires_grad_()
-                x = torch.randn([1, nc] + ([5] * dim), device=device, requires_grad=True)
-                y = getattr(F, 'conv{}d'.format(dim))(x, w, groups=groups)
-                y.sum().backward()
-                y = getattr(F, 'conv_transpose{}d'.format(dim))(x, w, groups=groups)
-                y.sum().backward()
-
-    def test_conv_noncontig_weights_and_bias(self, device):
-        # need floats to exercise https://github.com/pytorch/pytorch/issues/16018
-        for bias in [True, False]:
-            conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,
-                              bias=bias).to(device, torch.float)
-
-            input_nc = torch.randn((1, 3, 224, 224, 2), device=device, dtype=torch.float)[:, :, :, :, 1]
-            input_c = input_nc.contiguous()
-
-            weight_nc = torch.randn((64, 3, 7, 7, 2), device=device, dtype=torch.float)[:, :, :, :, 1]
-            conv1.weight = nn.Parameter(weight_nc)
-            weight_c = conv1.weight.contiguous()
-
-            if bias:
-                bias_nc = torch.randn((64, 2), device=device, dtype=torch.float)[:, 1]
-                conv1.bias = nn.Parameter(bias_nc)
-                bias_c = conv1.bias.contiguous()
-
-            out1 = conv1(input_nc)
-            conv1.weight = nn.Parameter(weight_c)
-            if bias:
-                conv1.bias = nn.Parameter(bias_c)
-            out2 = conv1(input_c)
-            self.assertEqual(out1, out2)
-
     def test_save_lstm_compatibility(self, device):
         # Test that saving an LSTM in PyTorch 1.7 and older can still be
         # loaded in newer versions of PyTorch.
@@ -17424,63 +10382,6 @@ def test_grid_sample_large_index_3d(self, device, dtype):
             small_image.grad.zero_()
             large_view.grad.zero_()
 
-    @onlyCUDA
-    @largeTensorTest('12GB')
-    def test_conv_transposed_large(self, device):
-        dtype = torch.half if self.device_type == 'cuda' else torch.float
-        conv = nn.ConvTranspose2d(1, 1, 1, 1, bias=False).to(device).to(dtype)
-        input_large = torch.randn(4096, 1, 512, 1024, dtype=dtype, device=device)
-        # forward
-        ret = conv(input_large)
-        maxdiff0 = (ret.narrow(0, 0, 1024) - conv(input_large.narrow(0, 0, 1024))).abs_().max().item()
-        maxdiff1 = (ret.narrow(0, 1024, 1024) - conv(input_large.narrow(0, 1024, 1024))).abs_().max().item()
-        maxdiff2 = (ret.narrow(0, 2048, 1024) - conv(input_large.narrow(0, 2048, 1024))).abs_().max().item()
-        maxdiff3 = (ret.narrow(0, 3072, 1024) - conv(input_large.narrow(0, 3072, 1024))).abs_().max().item()
-        if self.device_type == 'cuda':
-            # cuDNN may use algorithms such as FFT that don't guarantee a diff of 0
-            self.assertEqual(maxdiff0, 0, atol=2e-3, rtol=1e-5)
-            self.assertEqual(maxdiff1, 0, atol=2e-3, rtol=1e-5)
-            self.assertEqual(maxdiff2, 0, atol=2e-3, rtol=1e-5)
-            self.assertEqual(maxdiff3, 0, atol=2e-3, rtol=1e-5)
-        else:
-            self.assertEqual(maxdiff0, 0)
-            self.assertEqual(maxdiff1, 0)
-            self.assertEqual(maxdiff2, 0)
-            self.assertEqual(maxdiff3, 0)
-
-    @onlyCUDA
-    @skipCUDAIfRocm
-    @largeTensorTest('12GB')
-    def test_conv_large(self, device):
-        dtype = torch.half if self.device_type == 'cuda' else torch.float
-        conv = nn.Conv2d(2, 2, 8, 8, bias=False).to(device).to(dtype)
-        input_large = torch.randn(4097, 2, 512, 512, dtype=dtype, device=device)
-        # forward
-        ret = conv(input_large)
-        self.assertEqual(ret[:2048], conv(input_large[:2048]))
-        self.assertEqual(ret[2048:4096], conv(input_large[2048:4096]))
-        self.assertEqual(ret[4096:], conv(input_large[4096:]))
-
-        # backward
-        conv.zero_grad()
-        # When computing the backward, we are using the `max(dim=1)`` to create
-        # some sparsity. Without this sparsity, the rounding error would be
-        # too large (as large as 1e-5) to satisfy the creterion (1e-6) of `assertEqual`
-        ret.view(4097, -1).max(dim=1).values.sum().backward()
-        del ret
-        grad1 = conv.weight.grad.detach().clone()
-        conv.zero_grad()
-        conv(input_large[:2048]).view(2048, -1).max(dim=1).values.sum().backward()
-        conv(input_large[2048:4096]).view(2048, -1).max(dim=1).values.sum().backward()
-        conv(input_large[4096:]).view(1, -1).max(dim=1).values.sum().backward()
-        grad2 = conv.weight.grad.detach().clone()
-        # gradients are at the order of hundreds, we need to scale it to
-        # the order of one so that we can compare
-        scale = 1 / grad2.abs().mean()
-        grad1 = grad1 * scale
-        grad2 = grad2 * scale
-        self.assertEqual(grad1, grad2, atol=5e-2, rtol=5e-3)
-
     def _test_gumbel_softmax_st_shapes(self, device, dtype, shape, dim, count_expected):
         logits = torch.randn(shape, dtype=torch.float, device=device)
         logits = logits.to(dtype)
@@ -17790,76 +10691,6 @@ def test_CTCLoss_no_batch_dim(self, device, reduction, use_module_form):
         self._assertEqual_list((input_length, 1, vocab_size), [t.grad.shape for t in log_probs_refs])
         self._assertEqual_list((input_length, vocab_size), [t.grad.shape for t in log_probs_no_bd_refs])
 
-    @onlyCUDA
-    @skipCUDAIfNoCudnn
-    def test_contig_wrong_stride_cudnn(self, device):
-        # x has to have batch_size 1 to test contiguous checks
-        x = torch.randn(1, 16, 5, 5, device=device)
-        stride = list(x.stride())
-        stride[0] = 20
-        # change the stride in dimension 0. the tensor is still contiguous because size[0] is 1
-        x.set_(x.storage(), 0, x.size(), stride)
-        self.assertTrue(x.is_contiguous())
-        F.conv_transpose2d(x, torch.randn(16, 1, 1, 1, device=device))
-        F.conv2d(x, torch.randn(1, 16, 1, 1, device=device))
-
-    @onlyCUDA
-    def test_Conv2d_size_1_kernel(self, device):
-        x_cpu = torch.randn(2, 3, 5, 5)
-        conv_cpu = torch.nn.Conv2d(3, 3, kernel_size=1)
-        y_cpu = conv_cpu(x_cpu)
-        y = torch.rand_like(y_cpu)
-        y_cpu.backward(y)
-
-        with cudnn.flags(enabled=False):
-            conv_cuda = torch.nn.Conv2d(3, 3, kernel_size=1).to(device)
-            conv_cuda.bias.data.copy_(conv_cpu.bias.data)
-            conv_cuda.weight.data.copy_(conv_cpu.weight.data)
-            y_cuda = conv_cuda(x_cpu.to(device))
-            y_cuda.backward(y.to(device))
-
-        self.assertEqual(y_cpu, y_cuda, atol=1e-5, rtol=0, exact_device=False)
-        self.assertEqual(conv_cpu.bias.grad.data, conv_cuda.bias.grad.data, atol=1e-5, rtol=0, exact_device=False)
-        self.assertEqual(conv_cpu.weight.grad.data, conv_cuda.weight.grad.data, atol=1e-5, rtol=0, exact_device=False)
-
-    @onlyCUDA
-    def test_ConvTranspose2d_size_1_kernel(self, device):
-        x_cpu = torch.randn(2, 3, 5, 5)
-        conv_cpu = torch.nn.ConvTranspose2d(3, 3, kernel_size=1)
-        y_cpu = conv_cpu(x_cpu)
-        y = torch.rand_like(y_cpu)
-        y_cpu.backward(y)
-
-        with cudnn.flags(enabled=False):
-            conv_cuda = torch.nn.ConvTranspose2d(3, 3, kernel_size=1).to(device)
-            conv_cuda.bias.data.copy_(conv_cpu.bias.data)
-            conv_cuda.weight.data.copy_(conv_cpu.weight.data)
-            y_cuda = conv_cuda(x_cpu.to(device))
-            y_cuda.backward(y.to(device))
-
-        self.assertEqual(y_cpu, y_cuda, atol=1e-5, rtol=0, exact_device=False)
-        self.assertEqual(conv_cpu.bias.grad.data, conv_cuda.bias.grad.data, atol=1e-5, rtol=0, exact_device=False)
-        self.assertEqual(conv_cpu.weight.grad.data, conv_cuda.weight.grad.data, atol=1e-5, rtol=0, exact_device=False)
-
-    @onlyCUDA
-    def test_ConvTranspose3d_size_1_kernel(self, device):
-        x_cpu = torch.randn(2, 3, 3, 5, 5)
-        conv_cpu = torch.nn.ConvTranspose3d(3, 3, kernel_size=1)
-        y_cpu = conv_cpu(x_cpu)
-        y = torch.rand_like(y_cpu)
-        y_cpu.backward(y)
-
-        with cudnn.flags(enabled=False):
-            conv_cuda = torch.nn.ConvTranspose3d(3, 3, kernel_size=1).to(device)
-            conv_cuda.bias.data.copy_(conv_cpu.bias.data)
-            conv_cuda.weight.data.copy_(conv_cpu.weight.data)
-            y_cuda = conv_cuda(x_cpu.to(device))
-            y_cuda.backward(y.to(device))
-
-        self.assertEqual(y_cpu, y_cuda, atol=1e-5, rtol=0, exact_device=False)
-        self.assertEqual(conv_cpu.bias.grad.data, conv_cuda.bias.grad.data, atol=1e-5, rtol=0, exact_device=False)
-        self.assertEqual(conv_cpu.weight.grad.data, conv_cuda.weight.grad.data, atol=1e-5, rtol=0, exact_device=False)
-
     def _ordered_sequence(self, device, dtype):
         """Create ordered list of random sequences"""
         seqs = [torch.empty(random.randint(1, 6), device=device, dtype=dtype)
@@ -17919,649 +10750,31 @@ def test_overwrite_module_params_on_conversion_cpu_device(self, device):
 
             # Test that if `torch.__future__.get_overwrite_module_params_on_conversion() == True`,
             # `cpu_module.to("cuda")` doesn't preserve previous references to
-            # `cpu_module`'s parameters or gradients.
-            m = nn.Linear(20, 10)
-            m.weight.grad = torch.randn(10, 20)
-            weight_ref = m.weight
-            weight_grad_ref = m.weight.grad
-            m.to(device)
-            self.assertNotEqual(weight_ref.device, m.weight.device)
-            self.assertNotEqual(weight_grad_ref.device, m.weight.grad.device)
-        finally:
-            torch.__future__.set_overwrite_module_params_on_conversion(False)
-
-    @onlyCUDA
-    @dtypes(*((torch.float, torch.double, torch.bfloat16, torch.half)
-              if TEST_WITH_ROCM else (torch.float, torch.double, torch.half)))
-    def test_embedding_max_norm_device(self, device, dtype):
-        embedding = nn.Embedding(22, 5, max_norm=1.0).to(device, dtype=dtype)
-        # nn.Embedding only takes LongTensor as input
-        input = torch.tensor([2, 8, 8, 6], device=device, dtype=torch.long)
-        output = embedding(input)
-        self.assertEqual(output[1], output[2])
-        self.assertTrue(output.data.norm(p=2, dim=1).le(1).all())
-
-    @onlyCUDA
-    @dtypes(torch.half, torch.float)
-    def test_softmax(self, device, dtype):
-        input = torch.rand(32, 100, device=device, dtype=dtype, requires_grad=True)
-        inputf = input.to(torch.float).detach().requires_grad_(True)
-        out = F.softmax(input, dim=-1, dtype=torch.float)
-        outf = F.softmax(inputf, dim=-1)
-        # should be bitwise equal
-        self.assertEqual(out, outf, atol=0, rtol=0)
-        gO = torch.empty_like(outf).uniform_()
-        out.backward(gO)
-        outf.backward(gO)
-        # should be bitwise equal
-        self.assertEqual(input.grad, inputf.grad.to(dtype), atol=0, rtol=0)
-
-    @onlyCUDA
-    def test_pool3d_size_one_feature_dim(self, device):
-        # Tests crazy strides for feature dim of size 1
-        x = torch.randn(7, 1, 5, 3, 2, device=device)
-        strange_strides = [30, 1234, 6, 2, 1]
-        y = x.as_strided(x.size(), strange_strides)
-        x = x.cpu().as_strided(x.size(), strange_strides)
-
-        to_test = {
-            'max_pool3d': lambda t: F.max_pool3d(t, (5, 1, 1), stride=(5, 1, 1)),
-            'avg_pool3d': lambda t: F.avg_pool3d(t, (5, 1, 1), stride=(5, 1, 1)),
-        }
-
-        for test, fn in to_test.items():
-            # Should not crash
-            out_y = fn(y)
-            out_x = fn(x)
-            self.assertEqual(out_y, out_x.to(device), msg=test)
-
-    @onlyCUDA
-    @largeTensorTest('18GB')
-    @largeTensorTest('180GB', 'cpu')
-    def test_pool3d_large_size_int64(self, device):
-        # See https://github.com/pytorch/pytorch/issues/52822
-        x = torch.randn(70, 32, 100, 100, 100, dtype=torch.half, device=device, requires_grad=True)
-        y = torch.nn.functional.max_pool3d(x, 5)
-        g = torch.randn_like(y, dtype=torch.half)
-        torch.cuda.synchronize()
-        y.backward(g)
-        torch.cuda.synchronize()
-
-        ref_x = x.detach().cpu().float()  # max_pool3d_cpu is not implemented for half
-        ref_x.requires_grad = True
-        ref_g = g.cpu().float()
-        ref_y = torch.nn.functional.max_pool3d(ref_x, 5)
-        ref_y.backward(ref_g)
-
-        self.assertEqual(y, ref_y, exact_dtype=False)
-        self.assertEqual(x.grad, ref_x.grad, exact_dtype=False)
-
-    @onlyCUDA
-    def test_AvgPool3d_backward_after_cat_dim1_device(self, device):
-        # x has to have batch_size 1 to test contiguous checks
-        x = torch.randn(1, 3, 4, 4, 4, device=device, requires_grad=True)
-        y = F.avg_pool3d(x, kernel_size=3, padding=1, stride=2)
-
-        grad = torch.randn(y.size(), device=device)
-        # increase the stride in dimension 0. the tensor is still contiguous because size[0] is 1
-        stride = list(grad.stride())
-        stride[0] = stride[0] * 2
-        grad.set_(grad.storage(), 0, grad.size(), stride)
-        assert grad.is_contiguous()
-
-        y.backward(grad)
-
-    def test_pooling_size_empty(self, device):
-        t = torch.rand([1, 2, 3, 4], device=device)
-        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool1d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool2d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_avg_pool3d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool1d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool2d(t, []))
-        self.assertRaises(RuntimeError, lambda: F.adaptive_max_pool3d(t, []))
-
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
-    def test_embedding_bag_empty_input(self, device, dtypes):
-        m = 4
-        n = 3
-        x = torch.tensor([], device=device, dtype=dtypes[0])
-        for sparse in [True, False]:
-            Embed = torch.nn.EmbeddingBag(m, n, sparse=sparse)
-            Embed.to(device)
-
-            output = Embed(input=x, offsets=torch.tensor([0], device=device, dtype=dtypes[1]))
-            self.assertEqual(output, torch.zeros_like(output))
-
-            output = Embed(input=x, offsets=torch.tensor([0, 0], device=device, dtype=dtypes[1]))
-            self.assertEqual(output, torch.zeros_like(output))
-
-    @skipCUDAIf(True, "no out-of-bounds check on CUDA for perf.")
-    @dtypes(*itertools.product((torch.float, torch.double), (torch.int, torch.long)))
-    @parametrize_test("padding_idx", [None, 0])
-    @parametrize_test("mode", ["sum", "mean", "max"])
-    def test_embedding_bag_out_of_bounds_idx(self, device, dtypes, padding_idx, mode):
-        padding_idx = 0
-        w_dtype, idx_dtype = dtypes
-        # negative out-of-bound
-        idx1 = torch.tensor([[-1, 1]], device=device, dtype=idx_dtype)
-        # positive out-of-bound
-        idx2 = torch.tensor([[11, 8]], device=device, dtype=idx_dtype)
-        weight = torch.randn(10, 2, device=device, dtype=w_dtype)
-        if mode == 'sum':
-            # Only `sum` supports per_sample_weight
-            per_sample_weights = (None, torch.randn_like(idx1, device=device, dtype=w_dtype))
-        else:
-            per_sample_weights = (None,)
-
-        for p_s_weights, idx in itertools.product(per_sample_weights, (idx1, idx2)):
-            msg = "Expected idx >= 0 && idx < num_embeddings"
-            with self.assertRaisesRegex(RuntimeError, msg):
-                torch.nn.functional.embedding_bag(idx, weight,
-                                                  per_sample_weights=p_s_weights, padding_idx=padding_idx,
-                                                  mode=mode)
-
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
-    def test_EmbeddingBag_per_sample_weights_failures(self, device, dtypes):
-        # Failure 1: mismatched embeddings / per_sample_weights dtype
-        es = nn.EmbeddingBag(5, 2, mode='sum').to(dtype=torch.float, device=device)
-        input = torch.tensor([3, 1, 1, 1, 4, 0], dtype=dtypes[0], device=device)
-        offsets = torch.tensor([0, 0, 3, 3, 6], dtype=dtypes[1], device=device)
-        per_sample_weights = torch.randn_like(input, dtype=torch.double, device=device)
-        if device == 'cpu':
-            with self.assertRaisesRegex(RuntimeError, 'have the same type as'):
-                es(input, offsets, per_sample_weights)
-        else:
-            with self.assertRaisesRegex(RuntimeError, 'expected scalar type'):
-                es(input, offsets, per_sample_weights)
-
-        # Failure 2.1: input/per_sample_weights have different sizes (1d input)
-        input = torch.tensor([3, 1, 1, 1, 4, 0], dtype=dtypes[0], device=device)
-        offsets = torch.tensor([0, 0, 3, 3, 6], dtype=dtypes[1], device=device)
-        per_sample_weights = torch.randn(5, dtype=torch.float, device=device)
-        with self.assertRaisesRegex(ValueError, 'same shape as the input'):
-            es(input, offsets, per_sample_weights)
-
-        # Failure 2.2: input/per_sample_weights have different sizes (2d input)
-        input = torch.randint(5, (7, 3), dtype=dtypes[0], device=device)
-        offsets = None
-        per_sample_weights = torch.randn(7 * 3, dtype=torch.float, device=device)
-        with self.assertRaisesRegex(ValueError, 'same shape as the input'):
-            es(input, offsets, per_sample_weights)
-
-        # Failure 3: Unsupported per_sample_weights and mode=('max', 'mean')
-        for unsupported_mode in ('max', 'mean'):
-            es = nn.EmbeddingBag(5, 2, mode=unsupported_mode).to(
-                dtype=torch.float, device=device)
-            input = torch.randint(5, (7, 3), dtype=dtypes[0], device=device)
-            offsets = None
-            per_sample_weights = torch.randn(7, 3, dtype=torch.float, device=device)
-            with self.assertRaisesRegex(NotImplementedError,
-                                        "only supported for mode='sum'"):
-                es(input, offsets, per_sample_weights)
-
-    def _embedding_bag_reference_impl(self, input, weight, offsets=None, mode='sum',
-                                      per_sample_weights=None, include_last_offset=False):
-        assert mode == 'sum' or per_sample_weights is None
-        assert offsets is not None
-        if per_sample_weights is None:
-            per_sample_weights = torch.ones(input.size()).to(
-                dtype=weight.dtype, device=weight.device
-            )
-        assert input.numel() == per_sample_weights.numel()
-
-        bags = []
-        long_input = input.to(torch.long)
-        embeddings = weight.index_select(0, long_input) * per_sample_weights.unsqueeze(1)
-        if include_last_offset:
-            for index in range(len(offsets) - 1):
-                offset = offsets[index]
-                next_offset = offsets[index + 1]
-                length = next_offset - offset
-                if length == 0:
-                    bags.append(
-                        torch.tensor([0] * weight.size(1)).to(
-                            dtype=embeddings.dtype, device=embeddings.device
-                        )
-                    )
-                else:
-                    if mode == 'sum':
-                        bags.append(embeddings.narrow(0, offset, length).sum(0))
-                    elif mode == 'mean':
-                        bags.append(embeddings.narrow(0, offset, length).sum(0).div(length))
-                    else:
-                        assert mode == 'max'
-                        bags.append(embeddings.narrow(0, offset, length).max(0)[0])
-        else:
-            for index, offset in enumerate(offsets):
-                if index + 1 < len(offsets):
-                    next_offset = offsets[index + 1]
-                else:
-                    next_offset = len(long_input)
-                length = next_offset - offset
-                if length == 0:
-                    bags.append(
-                        torch.tensor([0] * weight.size(1)).to(
-                            dtype=embeddings.dtype, device=embeddings.device
-                        )
-                    )
-                else:
-                    if mode == 'sum':
-                        bags.append(embeddings.narrow(0, offset, length).sum(0))
-                    elif mode == 'mean':
-                        bags.append(embeddings.narrow(0, offset, length).sum(0).div(length))
-                    else:
-                        assert mode == 'max'
-                        bags.append(embeddings.narrow(0, offset, length).max(0)[0])
-        return torch.stack(bags)
-
-    @skipMeta
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.half, torch.float, torch.double)))
-    def test_EmbeddingBag_empty_per_sample_weights_and_offsets(self, device, dtypes):
-        # Test empty input and per sample weight, and backward pass. There was a CUDA
-        # invalid configuration bug (more context in #46572)
-        def test_per_sample_weights(mode, trainable_scale):
-            es = nn.EmbeddingBag(5, 2, mode=mode).to(dtype=dtypes[2], device=device)
-            es.weight.data.copy_(
-                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
-            input = torch.tensor([], device=device, dtype=dtypes[0])
-            offsets = torch.tensor([0, 0, 0, 0, 0], device=device, dtype=dtypes[1])
-            per_sample_weights = torch.randn_like(input, dtype=dtypes[2]) \
-                                      .requires_grad_(trainable_scale)
-            ref_per_sample_weights = \
-                per_sample_weights.detach().requires_grad_(trainable_scale)
-            reference_weights = es.weight.detach().requires_grad_()
-
-            expected = self._embedding_bag_reference_impl(
-                input, reference_weights, offsets, mode, ref_per_sample_weights)
-            result = es(input, offsets, per_sample_weights)
-            self.assertEqual(result, expected, atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-
-            grad = torch.randn_like(expected)
-            result.backward(grad)
-            # the reference impl doesn't have grad fn for empty input; but the grad should
-            # simply be a zero tensor
-            ref_weights_grad = torch.zeros_like(es.weight)
-            self.assertEqual(es.weight.grad, ref_weights_grad,
-                             atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-            if trainable_scale:
-                ref_per_sample_weights_grad = torch.empty_like(per_sample_weights)
-                self.assertEqual(per_sample_weights.grad, ref_per_sample_weights_grad,
-                                 atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-
-        modes = ('sum',)
-        trainable_scale = (True, False)
-        for mode, trainable in itertools.product(modes, trainable_scale):
-            test_per_sample_weights(mode, trainable)
-
-    @skipMeta
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    def test_EmbeddingBag_per_sample_weights_and_offsets(self, device, dtypes):
-        def test_per_sample_weights(mode, trainable_scale):
-            es = nn.EmbeddingBag(5, 2, mode=mode).to(dtype=dtypes[2], device=device)
-            es.weight.data.copy_(
-                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
-            input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtypes[0])
-            offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=dtypes[1])
-            per_sample_weights = torch.randn_like(input, dtype=dtypes[2]) \
-                                      .requires_grad_(trainable_scale)
-            ref_per_sample_weights = \
-                per_sample_weights.detach().requires_grad_(trainable_scale)
-            reference_weights = es.weight.detach().requires_grad_()
-
-            expected = self._embedding_bag_reference_impl(
-                input, reference_weights, offsets, mode, ref_per_sample_weights)
-            result = es(input, offsets, per_sample_weights)
-            self.assertEqual(result, expected, atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-
-            grad = torch.randn_like(expected).to(dtype=dtypes[2], device=device)
-            result.backward(grad)
-            expected.backward(grad)
-            self.assertEqual(es.weight.grad, reference_weights.grad,
-                             atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-            if trainable_scale:
-                self.assertEqual(per_sample_weights.grad, ref_per_sample_weights.grad,
-                                 atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-
-        modes = ('sum',)
-        trainable_scale = (True, False)
-        for mode, trainable in itertools.product(modes, trainable_scale):
-            test_per_sample_weights(mode, trainable)
-
-    @skipMeta
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    def test_EmbeddingBag_per_sample_weights_and_new_offsets(self, device, dtypes):
-        def test_per_sample_weights_new_offsets(mode, trainable_scale, include_last_offset, has_weight=True):
-            es = nn.EmbeddingBag(5, 2, mode=mode, include_last_offset=include_last_offset).to(dtype=dtypes[2], device=device)
-            es.weight.data.copy_(
-                torch.arange(1, 11, device=device).view_as(es.weight).to(dtypes[2]))
-            input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtypes[0])
-            offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=dtypes[1])
-
-            if include_last_offset:
-                offsets = torch.cat((offsets, torch.tensor([input.size(0)], device=device, dtype=dtypes[1])), 0)
-
-            if has_weight:
-                per_sample_weights = torch.randn_like(input, device=device, dtype=dtypes[2]) \
-                                          .requires_grad_(trainable_scale)
-                ref_per_sample_weights = \
-                    per_sample_weights.detach().requires_grad_(trainable_scale)
-            else:
-                per_sample_weights = None
-                ref_per_sample_weights = None
-
-            reference_weights = es.weight.detach().requires_grad_()
-
-            expected = self._embedding_bag_reference_impl(
-                input, reference_weights, offsets, mode, ref_per_sample_weights, include_last_offset)
-            result = es(input, offsets, per_sample_weights)
-            self.assertEqual(result, expected, atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-
-            grad = torch.randn_like(expected)
-            result.backward(grad)
-            expected.backward(grad)
-            self.assertEqual(es.weight.grad, reference_weights.grad,
-                             atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-            if has_weight and trainable_scale:
-                self.assertEqual(per_sample_weights.grad, ref_per_sample_weights.grad,
-                                 atol=dtype2prec_DONTUSE[dtypes[2]], rtol=0)
-
-        trainable_scale = (True, False)
-        include_last_offset = (True, False)
-        modes = (('sum', False), ('sum', True), ('max', False), ('mean', False))
-        for (mode, has_weight), trainable, include_last_offset in itertools.product(
-            modes, trainable_scale, include_last_offset
-        ):
-            test_per_sample_weights_new_offsets(
-                mode, trainable, include_last_offset, has_weight
-            )
-
-    def _test_EmbeddingBag_vs_Embedding(self, N, D, B, L, max_norm=None,
-                                        mode='mean',
-                                        device='cpu',
-                                        wdtype=torch.float,
-                                        dtype=torch.long,
-                                        test_per_sample_weights=False,
-                                        trainable_per_sample_weights=False,
-                                        sparse=False,
-                                        test_backward=True,
-                                        backward_prec=None):
-        es = nn.EmbeddingBag(N, D, mode=mode, sparse=sparse, max_norm=max_norm).to(device, wdtype)
-        e = nn.Embedding(N, D, max_norm=max_norm).to(device, wdtype)
-        e.weight.data.copy_(es.weight)
-        input = torch.randint(N, (B, L), device=device, dtype=dtype)
-        offsets = torch.arange(0, B, device=device, dtype=dtype).mul_(L)
-        grad_output = torch.rand(B, D, device=device, dtype=wdtype)
-
-        if test_per_sample_weights:
-            # To prevent large gradients, weights should sum to 1 for each bag
-            per_sample_weights = \
-                torch.randn(B, L, device=device, dtype=wdtype).softmax(dim=-1)
-            per_sample_weights_reference = \
-                per_sample_weights.clone().requires_grad_(trainable_per_sample_weights)
-            per_sample_weights.requires_grad_(trainable_per_sample_weights)
-            output = es(input.view(-1), offsets, per_sample_weights.view(-1))
-        else:
-            output = es(input.view(-1), offsets)
-            per_sample_weights = None
-            per_sample_weights_reference = None
-
-        if mode == 'sum':
-            if test_per_sample_weights:
-                ref_output = (e(input) * per_sample_weights_reference.unsqueeze(-1)).sum(1)
-            else:
-                ref_output = e(input).sum(1)
-        elif mode == 'mean':
-            assert not test_per_sample_weights
-            ref_output = e(input).mean(1)
-        elif mode == 'max':
-            assert not test_per_sample_weights
-            ref_output = e(input).max(1)[0]
-
-        self.assertEqual(output, ref_output, atol=dtype2prec_DONTUSE[wdtype], rtol=0)
-
-        if not test_backward:
-            return
-
-        output.backward(grad_output)
-        ref_output.backward(grad_output)
-        es_weight_grad = es.weight.grad.data
-        if sparse:
-            es_weight_grad = es.weight.grad.data.to_dense()
-
-        # We have more floating point error here because we are dealing with larger numbers
-        if backward_prec is None:
-            needed_prec = dtype2prec_DONTUSE[wdtype] * 5
-        else:
-            needed_prec = backward_prec
-
-        self.assertEqual(es_weight_grad, e.weight.grad, atol=needed_prec, rtol=0)
-
-        if test_per_sample_weights and trainable_per_sample_weights:
-            self.assertEqual(per_sample_weights.grad, per_sample_weights_reference.grad,
-                             atol=dtype2prec_DONTUSE[wdtype], rtol=0)
-
-    @skipCUDAIf(True, "Temporarily disabled. See t54369166")
-    @dtypesIfCUDA(*itertools.product((torch.int, torch.long), (torch.half, torch.float, torch.double)))
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.float, torch.double)))
-    def test_EmbeddingBag_per_sample_weights_and_no_offsets(self, device, dtypes):
-        def run_tests(mode, sparse, trainable_per_sample_weights):
-            kwargs = dict(test_per_sample_weights=True, device=device,
-                          mode=mode, wdtype=dtypes[1], dtype=dtypes[0], sparse=sparse,
-                          trainable_per_sample_weights=trainable_per_sample_weights)
-
-            # Simple case
-            self._test_EmbeddingBag_vs_Embedding(2, 3, 5, 7, **kwargs)
-
-            # B * L > 1000
-            self._test_EmbeddingBag_vs_Embedding(2, 5, 53, 23, **kwargs)
-
-            # Large num_embedding
-            self._test_EmbeddingBag_vs_Embedding(101, 5, 3, 7, **kwargs)
-
-            # Large embedding_dim
-            self._test_EmbeddingBag_vs_Embedding(2, 101, 3, 7, **kwargs)
-
-        modes = ('sum',)
-        sparsity = (True, False)
-        trainable_scale = (True, False)
-        for mode, sparse, trainable_per_sample_weights in \
-                itertools.product(modes, sparsity, trainable_scale):
-            run_tests(mode, sparse, trainable_per_sample_weights)
-
-        # Test CUDA Dense on half precision
-        if device == 'cuda':
-            modes = ('sum',)
-            sparsity = (False,)
-            trainable_scale = (True, False)
-            for mode, sparse, trainable_per_sample_weights in \
-                    itertools.product(modes, sparsity, trainable_scale):
-                run_tests(mode, sparse, trainable_per_sample_weights)
-
-    def _test_EmbeddingBag(
-        self,
-        device,
-        mode,
-        sparse,
-        wdtype=torch.double,
-        dtype=torch.long,
-        odtype=torch.long,
-        test_backward=True,
-    ):
-        # check a known test example
-        es = nn.EmbeddingBag(5, 2, mode=mode, sparse=sparse).to(device, wdtype)
-        es.weight.data.copy_(torch.arange(1, 11, device=device).view_as(es.weight).to(wdtype))
-        input = torch.tensor([3, 1, 1, 1, 4, 0], device=device, dtype=dtype)
-        offsets = torch.tensor([0, 0, 3, 3, 6], device=device, dtype=odtype)
-
-        grad_output = torch.tensor(
-            [1, 2,
-             3, 4], device=device, dtype=wdtype).view(2, 2)
-        grad_output_with_empty = torch.tensor(
-            [99, 99,
-             1, 2,
-             99, 99,
-             3, 4,
-             99, 99], device=device, dtype=wdtype).view(5, 2)
-
-        if mode == "sum" or mode == "mean":
-            denominator = 1 if mode == "sum" else 3
-            expected_output = torch.tensor(
-                [[13, 16],
-                 [13, 16]], device=device, dtype=wdtype) / denominator
-
-            expected_output_with_empty = torch.tensor(
-                [[0, 0],
-                 [13, 16],
-                 [0, 0],
-                 [13, 16],
-                 [0, 0]], device=device, dtype=wdtype) / denominator
-
-            expected_grad_weight = torch.tensor(
-                [[3, 4],
-                 [5, 8],
-                 [0, 0],
-                 [1, 2],
-                 [3, 4]], device=device, dtype=wdtype) / denominator
-        elif mode == "max":
-            expected_output = torch.tensor(
-                [[7, 8],
-                 [9, 10]], device=device, dtype=wdtype)
-
-            expected_output_with_empty = torch.tensor(
-                [[0, 0],
-                 [7, 8],
-                 [0, 0],
-                 [9, 10],
-                 [0, 0]], device=device, dtype=wdtype)
-
-            expected_grad_weight = torch.tensor(
-                [[0, 0],
-                 [0, 0],
-                 [0, 0],
-                 [1, 2],
-                 [3, 4]], device=device, dtype=wdtype)
-        output = es(input, offsets)
-        output.backward(grad_output_with_empty)
-
-        es_weight_grad = es.weight.grad.data
-        if sparse:
-            es_weight_grad = es.weight.grad.to_dense()
-        self.assertEqual(output, expected_output_with_empty)
-        self.assertEqual(es_weight_grad, expected_grad_weight, atol=dtype2prec_DONTUSE[wdtype], rtol=0)
-
-        # check same example except as 2D (2 x 3)
-        input = input.view(2, -1)
-        es.zero_grad()
-        output = es(input)
-        output.backward(grad_output)
-
-        es_weight_grad = es.weight.grad
-        if sparse:
-            es_weight_grad = es.weight.grad.to_dense()
-        self.assertEqual(output, expected_output)
-        self.assertEqual(es_weight_grad, expected_grad_weight, atol=dtype2prec_DONTUSE[wdtype], rtol=0)
-
-        # test all empty bags
-        es.zero_grad()
-        inputs = torch.tensor([], dtype=dtype, device=device)
-        offsets = torch.tensor([0, 0, 0, 0], dtype=odtype, device=device)
-        es(inputs, offsets).sum().backward()
-        dense_grad = es.weight.grad
-        if dense_grad.is_sparse:
-            dense_grad = dense_grad.to_dense()
-        self.assertEqual(dense_grad, torch.zeros_like(es.weight))
-
-        # now compare EmbeddingBag vs Embedding + Sum/Mean, for constant bag length
-        N, D, B, L = random.randint(1, 100), random.randint(1, 100), random.randint(1, 50), random.randint(1, 50)
-        kwargs = dict(mode=mode, sparse=sparse, device=device, wdtype=wdtype, dtype=dtype, test_backward=test_backward)
-        self._test_EmbeddingBag_vs_Embedding(N, D, B, L, **kwargs)
-        for max_norm in (None, 3):
-            for p in itertools.product([1, 2], repeat=4):
-                self._test_EmbeddingBag_vs_Embedding(*p, max_norm=max_norm, **kwargs)
-
-        # check that giving illegal input combos raises error
-        es = nn.EmbeddingBag(10, 20, mode=mode, sparse=sparse)
-        input = torch.ones(3, 4, dtype=dtype)
-        offset = torch.arange(0, 3, dtype=odtype)
-        self.assertRaises(ValueError, lambda: es(input, offset))
-        self.assertRaises(ValueError, lambda: es(input.view(-1)))
-        offset[0] = 1
-        if self.device_type == "cpu":
-            self.assertRaises(RuntimeError, lambda: es(input.view(-1), offset))
-            offset[0] = 0
-            offset[-1] = 100
-            self.assertRaises(RuntimeError, lambda: es(input.view(-1), offset))
-
-    @skipMeta
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    def test_embedding_bag_device(self, device, dtypes):
-        self._test_EmbeddingBag(device, 'sum', False, wdtype=dtypes[2], dtype=dtypes[0], odtype=dtypes[1])
-        self._test_EmbeddingBag(device, 'mean', False, wdtype=dtypes[2], dtype=dtypes[0], odtype=dtypes[1])
-        self._test_EmbeddingBag(device, 'max', False, wdtype=dtypes[2], dtype=dtypes[0], odtype=dtypes[1])
-
-        test_backward = False
-        if self.device_type == 'cuda':
-            # see 'todo' in test_embedding_bag.
-            test_backward = dtypes[2] is not torch.float16
-        elif self.device_type == 'cpu':
-            # TODO: figure out why precision on sparse embeddings isn't the
-            # same as for dense.
-            test_backward = dtypes[2] is not torch.float and dtypes[2] is not torch.float16
-
-        self._test_EmbeddingBag(
-            device,
-            'sum',
-            True,
-            wdtype=dtypes[2],
-            dtype=dtypes[0],
-            odtype=dtypes[1],
-            test_backward=test_backward,
-        )
-        self._test_EmbeddingBag(
-            device,
-            'mean',
-            True,
-            wdtype=dtypes[2],
-            dtype=dtypes[0],
-            odtype=dtypes[1],
-            test_backward=test_backward,
-        )
-
-    @skipMeta
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long), (torch.float, torch.double, torch.half)))
-    def test_embedding_bag_non_contiguous_weight(self, device, dtypes):
-        weight_tensor = torch.randn(3, 4, dtype=dtypes[2], device=device)
-
-        weight_tensor_non_contig = weight_tensor[:, :3]  # This is non-contiguous strided.
-        weight_tensor_contig = weight_tensor_non_contig.clone().contiguous()  # Contig-strided.
-
-        index = torch.tensor([0, 1, 2], dtype=dtypes[0], device=device)
-        offsets = torch.tensor([0, 2], dtype=dtypes[1], device=device)
-        for mode in ['sum', 'mean', 'max']:
-            output_non_contig = F.embedding_bag(
-                input=index,
-                weight=weight_tensor_non_contig,
-                offsets=offsets,
-                mode=mode,
-            )
-            output_contig = F.embedding_bag(
-                input=index,
-                weight=weight_tensor_contig,
-                offsets=offsets,
-                mode=mode,
-            )
-        self.assertEqual(output_non_contig, output_contig)
+            # `cpu_module`'s parameters or gradients.
+            m = nn.Linear(20, 10)
+            m.weight.grad = torch.randn(10, 20)
+            weight_ref = m.weight
+            weight_grad_ref = m.weight.grad
+            m.to(device)
+            self.assertNotEqual(weight_ref.device, m.weight.device)
+            self.assertNotEqual(weight_grad_ref.device, m.weight.grad.device)
+        finally:
+            torch.__future__.set_overwrite_module_params_on_conversion(False)
 
     @onlyCUDA
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
-    def test_embedding_bag_bfloat16(self, device, dtypes):
-        self._test_EmbeddingBag(device, 'sum', True, wdtype=torch.bfloat16, dtype=dtypes[0], odtype=dtypes[1], test_backward=True)
-        self._test_EmbeddingBag(device, 'mean', True, wdtype=torch.bfloat16, dtype=dtypes[0], odtype=dtypes[1], test_backward=True)
-
-    @onlyNativeDeviceTypes  # currently fails on XLA
-    @dtypes(*itertools.product((torch.int, torch.long), (torch.int, torch.long)))
-    def test_embedding_bag_half(self, device, dtypes):
-        self._test_EmbeddingBag(device, 'sum', True, wdtype=torch.float16, dtype=dtypes[0], odtype=dtypes[1], test_backward=True)
+    @dtypes(torch.half, torch.float)
+    def test_softmax(self, device, dtype):
+        input = torch.rand(32, 100, device=device, dtype=dtype, requires_grad=True)
+        inputf = input.to(torch.float).detach().requires_grad_(True)
+        out = F.softmax(input, dim=-1, dtype=torch.float)
+        outf = F.softmax(inputf, dim=-1)
+        # should be bitwise equal
+        self.assertEqual(out, outf, atol=0, rtol=0)
+        gO = torch.empty_like(outf).uniform_()
+        out.backward(gO)
+        outf.backward(gO)
+        # should be bitwise equal
+        self.assertEqual(input.grad, inputf.grad.to(dtype), atol=0, rtol=0)
 
     @onlyCUDA
     @dtypes(torch.half, torch.float, torch.double)
@@ -18604,55 +10817,6 @@ def test_multihead_attention_dtype_batch_first(self, device, dtype):
                 self.assertEqual(q.size(), out[0].size())
                 self.assertEqual(dtype, out[0].dtype)
 
-    @dtypesIfCUDA(*floating_types_and(torch.half, *[torch.bfloat16] if AMPERE_OR_ROCM else []))
-    @dtypes(torch.float)
-    @torch.backends.cudnn.flags(enabled=True, benchmark=False)
-    def test_Conv2d_naive_groups(self, device, dtype):
-        # Check that grouped convolutions matches two half convolutions
-        m = nn.Conv2d(4, 4, kernel_size=3, groups=2).to(device, dtype)
-        i = torch.randn(2, 4, 6, 6, device=device, dtype=dtype, requires_grad=True)
-        output = m(i)
-        grad_output = torch.randn(2, 4, 4, 4, device=device, dtype=dtype)
-        output.backward(grad_output)
-
-        m1 = nn.Conv2d(2, 2, kernel_size=3).to(device, dtype)
-        m1.weight.data.copy_(m.weight.data[:2])
-        m1.bias.data.copy_(m.bias.data[:2])
-        i1 = i.data[:, :2].contiguous().requires_grad_(True)
-        output1 = m1(i1)
-        output1.backward(grad_output[:, :2].contiguous())
-
-        m2 = nn.Conv2d(2, 2, kernel_size=3).to(device, dtype)
-        m2.weight.data.copy_(m.weight.data[2:])
-        m2.bias.data.copy_(m.bias.data[2:])
-        i2 = i.data[:, 2:].contiguous().requires_grad_(True)
-        output2 = m2(i2)
-        output2.backward(grad_output[:, 2:].contiguous())
-
-        self.assertEqual(output, torch.cat([output1, output2], 1))
-        self.assertEqual(i.grad.data,
-                         torch.cat([i1.grad.data, i2.grad.data], 1),
-                         atol=dtype2prec_DONTUSE[dtype], rtol=0)
-        self.assertEqual(m.bias.grad.data,
-                         torch.cat([m1.bias.grad.data, m2.bias.grad.data], 0),
-                         atol=dtype2prec_DONTUSE[dtype], rtol=0)
-        self.assertEqual(m.weight.grad.data,
-                         torch.cat([m1.weight.grad.data, m2.weight.grad.data], 0),
-                         atol=dtype2prec_DONTUSE[dtype], rtol=0)
-
-    @dtypes(torch.double, torch.cdouble)
-    def test_Conv2d_backward_depthwise(self, device, dtype):
-        x = torch.randn(2, 2, 4, 20, device=device, dtype=dtype, requires_grad=True)
-        weight = torch.randn(2, 1, 3, 5, device=device, dtype=dtype, requires_grad=True)
-
-        def conv2d_depthwise(x, weight):
-            return torch.nn.functional.conv2d(
-                x, weight, bias=None, stride=(1, 10), groups=2)
-
-        for cudnn_enabled in [False, True]:
-            with torch.backends.cudnn.flags(enabled=cudnn_enabled):
-                torch.autograd.gradcheck(conv2d_depthwise, (x, weight))
-
     def _test_batchnorm_grad(self, device, dtype=torch.double):
         bs, n_feat, size_feat = 4, 5, 6
         input = torch.arange(bs * n_feat * size_feat, device=device,
@@ -18899,164 +11063,6 @@ def test_batchnorm_simple_average_mixed(self, device, dtype):
             with torch.backends.cudnn.flags(enabled=False):
                 self._test_batchnorm_simple_average(device, dtype, torch.float)
 
-    def _test_maxpool_indices(self, num_dim, adaptive=False, device="cpu", dtype=torch.float):
-        def expected_indices(dim):
-            if dim == 1:
-                return torch.tensor([1, 3], dtype=torch.double).repeat(2, 2, 1)
-            if dim == 2:
-                return torch.tensor([[5, 7], [13, 15]], dtype=torch.double).repeat(2, 2, 1, 1)
-
-        def expected_grad(dim):
-            if dim == 1:
-                return torch.tensor([0, 1, 0, 1], dtype=torch.double).repeat(2, 2, 1)
-            grad = expected_grad(dim - 1)
-            zero = torch.zeros(grad.size())
-            return torch.stack((zero, grad, zero, grad), 2)
-
-        def expected_output(dim):
-            if dim == 1:
-                return torch.arange(2, 17, 2).view(2, 2, 2)
-            if dim == 2:
-                col = torch.arange(6, 63, 8)
-                return torch.stack([col, col + 2], 1).view(2, 2, 2, 2)
-
-        if adaptive:
-            cls_name = 'AdaptiveMaxPool{}d'.format(num_dim)
-        else:
-            cls_name = 'MaxPool{}d'.format(num_dim)
-        module_cls = getattr(nn, cls_name)
-        module = module_cls(2, return_indices=True).to(device, dtype=dtype)
-        numel = 4 ** (num_dim + 1)
-        input = torch.arange(1, numel + 1).view(2, 2, *repeat(4, num_dim)).to(device, dtype=dtype)
-        input_var = input.clone().detach().requires_grad_()
-
-        # Check forward
-        output, indices = module(input_var)
-        if num_dim != 3:
-            expected_indices = expected_indices(num_dim)
-            expected_output = expected_output(num_dim)
-            self.assertEqual(indices.dim(), input.dim())
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(indices.data.squeeze(), expected_indices)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            self.assertEqualIgnoreType(output.data.squeeze(), expected_output)
-        self.assertTrue(output.requires_grad)
-        self.assertFalse(indices.requires_grad)
-
-        # Make sure backward works
-        grad_output = torch.ones(output.size(), device=device, dtype=dtype)
-        output.backward(grad_output, retain_graph=True)
-        expected_grad = expected_grad(num_dim)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(input_var.grad.data, expected_grad.view_as(input))
-
-        # Make sure backward after changing indices will result in an error
-        indices.add_(1)
-        self.assertRaises(RuntimeError, lambda: output.backward(grad_output))
-
-        # Make sure -Infinity is handled correctly
-        t = torch.tensor([[[float("-inf")]]])
-        m = nn.MaxPool1d(kernel_size=1, return_indices=True)
-        output, indices = m(t)
-        self.assertEqual(output[0, 0, 0], float("-inf"))
-        self.assertEqual(indices[0, 0, 0], 0)
-
-        t = torch.tensor([[[float("-inf")]]])
-        m = nn.MaxPool2d(kernel_size=1, return_indices=True)
-        output, indices = m(t)
-        self.assertEqual(output[0, 0, 0], float("-inf"))
-        self.assertEqual(indices[0, 0, 0], 0)
-
-        t = torch.tensor([[[[float("-inf")]]]])
-        m = nn.MaxPool3d(kernel_size=1, return_indices=True)
-        output, indices = m(t)
-        self.assertEqual(output[0, 0, 0, 0], float("-inf"))
-        self.assertEqual(indices[0, 0, 0, 0], 0)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @dtypes(torch.float)
-    def test_MaxPool1d_indices(self, device, dtype):
-        self._test_maxpool_indices(1, device=device, dtype=dtype)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @dtypes(torch.float)
-    def test_MaxPool2d_indices(self, device, dtype):
-        self._test_maxpool_indices(2, device=device, dtype=dtype)
-
-    @skipIfMps
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @dtypes(torch.float)
-    def test_MaxPool3d_indices(self, device, dtype):
-        self._test_maxpool_indices(3, device=device, dtype=dtype)
-
-    @skipIfMps
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @dtypes(torch.float)
-    def test_AdaptiveMaxPool1d_indices(self, device, dtype):
-        self._test_maxpool_indices(1, adaptive=True, device=device, dtype=dtype)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_AdaptiveMaxPool2d_indices(self, device, dtype):
-        self._test_maxpool_indices(2, adaptive=True, device=device, dtype=dtype)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_AdaptiveMaxPool3d_indices(self, device, dtype):
-        self._test_maxpool_indices(3, adaptive=True, device=device, dtype=dtype)
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_maxpool_indices_no_batch_dim(self, device, dtype):
-        """Check that indices with no batch dim is consistent with a single batch."""
-        max_pool_cases = [
-            (nn.MaxPool1d(3, return_indices=True),
-             torch.randn(3, 5, device=device, dtype=dtype)),
-            (nn.MaxPool2d(3, return_indices=True),
-             torch.randn(3, 5, 6, device=device, dtype=dtype)),
-            (nn.MaxPool3d(3, return_indices=True),
-             torch.randn(3, 5, 6, 7, device=device, dtype=dtype)),
-            (nn.AdaptiveMaxPool1d(3, return_indices=True),
-             torch.randn(3, 5, device=device, dtype=dtype)),
-            (nn.AdaptiveMaxPool2d(3, return_indices=True),
-             torch.randn(3, 5, 6, device=device, dtype=dtype)),
-            (nn.AdaptiveMaxPool3d(3, return_indices=True),
-             torch.randn(3, 5, 6, 7, device=device, dtype=dtype))]
-
-        for module, input in max_pool_cases:
-            _, indices_no_batch = module(input)
-            _, indicies_single_batch = module(input.unsqueeze(0))
-            self.assertEqual(indices_no_batch, indicies_single_batch.squeeze(0))
-
-
-    @dtypesIfCUDA(torch.half, torch.float, torch.double)
-    @dtypes(torch.float)
-    @onlyNativeDeviceTypes  # TODO: Fails on XLA
-    def test_max_pool_nan_inf(self, device, dtype):
-        for adaptive in ['', 'adaptive_']:
-            for num_dim in [1, 2, 3]:
-                fn_name = '{}max_pool{}d'.format(adaptive, num_dim)
-                fn = getattr(F, fn_name)
-
-                x = torch.full([1, 1] + num_dim * [3], nan, device=device, dtype=dtype, requires_grad=True)
-                res = fn(x, 1 if adaptive else 3)
-                res.backward(torch.randn_like(res))
-                self.assertTrue(math.isnan(res.item()))
-                x.requires_grad_(False)
-                res = fn(x, 1 if adaptive else 3)
-                self.assertTrue(math.isnan(res.item()))
-
-                x2 = torch.full([1, 1] + num_dim * [3], -inf, device=device, dtype=dtype, requires_grad=True)
-                res2 = fn(x2, 1 if adaptive else 3)
-                res2.backward(torch.randn_like(res2))
-                self.assertTrue(math.isinf(res2.item()))
-                x2.requires_grad_(False)
-                res2 = fn(x2, 1 if adaptive else 3)
-                self.assertTrue(math.isinf(res2.item()))
-
     @onlyNativeDeviceTypes
     @dtypes(torch.float, torch.double)
     def test_grid_sample_nan_inf(self, device, dtype):
@@ -19067,150 +11073,6 @@ def test_grid_sample_nan_inf(self, device, dtype):
                                                      padding_mode=padding_mode, align_corners=False)
             self.assertEqual(sample, torch.zeros([1, 1, 1, 2], device=device, dtype=dtype))
 
-    @expectedFailureMeta  # RuntimeError: Unrecognized tensor type ID: Meta
-    @onlyNativeDeviceTypes
-    def test_fractional_max_pool2d(self, device):
-        x = torch.randn(1, 2, 7, 7, requires_grad=True, device=device)
-        samples = x.new(1, 2, 2).uniform_()
-
-        def func(x):
-            return F.fractional_max_pool2d(
-                x, (2, 2), output_size=(3, 3), _random_samples=samples)
-
-        self.assertEqual(func(x).shape, (1, 2, 3, 3))
-        gradcheck(func, [x])
-        gradgradcheck(func, [x])
-
-        x = torch.randn(2, 7, 7, requires_grad=True, device=device)
-        self.assertEqual(func(x).shape, (2, 3, 3))
-        if self.device_type != 'cuda':
-            # Reference: https://github.com/pytorch/pytorch/issues/52427
-            # Raises -> RuntimeError: TensorAccessor expected 4 dims but tensor has 3
-            # on CUDA in gradcheck
-            gradcheck(func, [x])
-            gradgradcheck(func, [x])
-
-        for kernel_size in [(), (1,)]:
-            with self.assertRaisesRegex(RuntimeError, "kernel_size must either"):
-                # Incorrect kernel_size
-                F.fractional_max_pool2d(x, kernel_size=kernel_size, output_size=(3, 3), _random_samples=samples)
-
-        err_large_msg = "too large relative to input "
-        err_out_size_msg = "output_size must either"
-        for output_size, msg in [((9, 3), err_large_msg + "height"),
-                                 ((3, 9), err_large_msg + "width"),
-                                 ((3,), err_out_size_msg),
-                                 ((), err_out_size_msg)]:
-            with self.assertRaisesRegex(RuntimeError, msg):
-                # Incorrect output_size
-                F.fractional_max_pool2d(x, (2, 2), output_size=output_size, _random_samples=samples)
-
-    @expectedFailureMeta  # RuntimeError: Unrecognized tensor type ID: Meta
-    @onlyNativeDeviceTypes
-    def test_fractional_max_pool3d(self, device):
-        x = torch.randn(1, 2, 7, 7, 7, requires_grad=True, device=device)
-        samples = x.new(1, 2, 3).uniform_()
-
-        def func(x):
-            return F.fractional_max_pool3d(
-                x, (2, 2, 2), output_size=(3, 3, 3), _random_samples=samples)
-
-        self.assertEqual(func(x).shape, (1, 2, 3, 3, 3))
-        gradcheck(func, [x])
-        gradgradcheck(func, [x])
-
-        x = torch.randn(2, 7, 7, 7, requires_grad=True, device=device)
-        self.assertEqual(func(x).shape, (2, 3, 3, 3))
-        gradcheck(func, [x])
-        gradgradcheck(func, [x])
-
-        for kernel_size in [(), (1,), (1, 1)]:
-            with self.assertRaisesRegex(RuntimeError, "kernel_size must either"):
-                # Incorrect kernel_size
-                F.fractional_max_pool3d(x, kernel_size=kernel_size, output_size=(3, 3, 3), _random_samples=samples)
-
-        err_large_msg = "too large relative to input "
-        err_out_size_msg = "output_size must either"
-        for output_size, msg in [((9, 3, 3), err_large_msg + "time"),
-                                 ((3, 9, 3), err_large_msg + "height"),
-                                 ((3, 3, 9), err_large_msg + "width"),
-                                 ((3, 3), err_out_size_msg),
-                                 ((3,), err_out_size_msg),
-                                 ((), err_out_size_msg)]:
-            with self.assertRaisesRegex(RuntimeError, msg):
-                # Incorrect output_size
-                F.fractional_max_pool3d(x, (2, 2, 2), output_size=output_size, _random_samples=samples)
-
-    @dtypesIfCUDA(torch.half, torch.float, torch.double)
-    @dtypes(torch.float)
-    @onlyNativeDeviceTypes  # TODO: Fails on XLA
-    def test_fractional_max_pool_nan_inf(self, device, dtype):
-        for num_dim in [2, 3]:
-            fn_name = 'FractionalMaxPool{}d'.format(num_dim)
-            fn = getattr(nn, fn_name)(kernel_size=2, output_size=1)
-            x = torch.full([1, 1] + num_dim * [3], nan, device=device, dtype=dtype, requires_grad=True)
-            res = fn(x)
-            res.backward(torch.randn_like(res))
-            self.assertTrue(math.isnan(res.item()))
-
-            x2 = torch.full([1, 1] + num_dim * [3], -inf, device=device, dtype=dtype, requires_grad=True)
-            res2 = fn(x2)
-            res2.backward(torch.randn_like(res2))
-            self.assertTrue(math.isinf(res2.item()))
-
-    @onlyNativeDeviceTypes  # TODO: RuntimeError message different on XLA
-    def test_pooling_zero_stride(self, device):
-        for op in ('max', 'avg'):
-            for num_dim in [1, 2, 3]:
-                fn_name = '{}_pool{}d'.format(op, num_dim)
-                fn = getattr(F, fn_name)
-                x = torch.ones([1, 2] + num_dim * [4], device=device, dtype=torch.float)
-                self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
-                                       lambda: fn(x, kernel_size=2, stride=0))
-
-                fn_module_name = '{}Pool{}d'.format(op.title(), num_dim)
-                fn_module = getattr(nn, fn_module_name)(kernel_size=2, stride=0)
-                self.assertRaisesRegex(RuntimeError, r"stride should not be zero|stride must be greater than zero",
-                                       lambda: fn_module(x))
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_pool_large_size(self, device, dtype):
-        for op in ('max', 'avg'):
-            for num_dim in [1, 2, 3]:
-                fn_name = '{}_pool{}d'.format(op, num_dim)
-                fn = getattr(F, fn_name)
-                # 16777217 is the smallest integer not expressible in float32
-                x = torch.ones([1, 1, 16777217] + (num_dim - 1) * [1],
-                               device=device, dtype=dtype)
-                res = fn(x, 1, stride=1, padding=0)
-                # check if the output shape was still computed correctly
-                self.assertEqual(x.shape[2], res.shape[2])
-
-    @dtypesIfCUDA(*floating_types_and(torch.half, torch.bfloat16))
-    @skipIfMps
-    @dtypes(torch.float)
-    def test_pool_invalid_size(self, device, dtype):
-        for op in ('max', 'avg'):
-            for num_dim in [1, 2, 3]:
-                fn_name = '{}_pool{}d'.format(op, num_dim)
-                if op == 'max':
-                    # New implementation without indices supports empty tensors
-                    # TODO(Heitor) change once with_indices code is updated
-                    fn_name += '_with_indices'
-                fn = getattr(F, fn_name)
-                # use a configuration that gives zero outputs only
-                # when doing a correct floor division by the stride
-                x = torch.ones([1, 1] + num_dim * [4],
-                               device=device, dtype=dtype)
-                with self.assertRaisesRegex(RuntimeError, r"too small|smaller than"):
-                    try:
-                        res = fn(x, 3, stride=2, padding=0, dilation=2)
-                    except TypeError:
-                        # some implementations do not support dilation
-                        res = fn(x, 6, stride=2, padding=0)
-
     def test_CTCLoss_empty_target(self, device):
         target_lengths = [0, 0, 0]
         input_lengths = [50, 50, 50]
@@ -19304,11 +11166,6 @@ def test_ctc_loss_cudnn(self, device):
         grad_cudnn, = torch.autograd.grad(loss_cudnn, log_probs, grad_out)
         self.assertEqual(grad_cudnn, grad_native, atol=1e-4, rtol=0)
 
-    def test_empty_dropout(self, device):
-        x = torch.tensor([]).to(device)
-        out = torch.nn.functional.dropout(x)
-        self.assertEqual(out.size(), x.size())
-
     @dtypesIfCUDA(torch.half, torch.float, torch.double)
     @dtypes(torch.float)
     @tf32_on_and_off(0.005)
@@ -19461,411 +11318,26 @@ def test_bfloat16(fn, device, inp_dims, prec):
             test_bfloat16(torch.nn.Softshrink(), device, shape, prec=1e-2)
             test_bfloat16(torch.nn.Hardswish(), device, shape, prec=2e-2)
             test_bfloat16(torch.nn.Softplus(), device, shape, prec=1e-2)
-
-    def _test_bfloat16_ops(self, op, device, inp_dims=(), prec=1e-2, scale_factor=None):
-        # fp32 compute
-        input1 = torch.randn(inp_dims, dtype=torch.float32, device=device, requires_grad=True)
-        if scale_factor is not None:
-            input1 = (torch.rand(inp_dims, dtype=torch.bfloat16, device=device) * scale_factor).float().requires_grad_()
-        out1 = op(input1)
-        grad_input1 = torch.randn_like(out1, device=device)
-        out1.backward(grad_input1)
-
-        # bfloat16 compute
-        op_bfp16 = op.bfloat16()
-        input2 = input1.detach().bfloat16().requires_grad_()
-        grad_input2 = grad_input1.bfloat16()
-        out2 = op_bfp16(input2)
-        out2.backward(grad_input2)
-
-        self.assertEqual(out1, out2, atol=prec, rtol=prec, exact_dtype=False)
-        self.assertEqual(input1.grad.data, input2.grad.data, atol=prec, rtol=prec, exact_dtype=False)
+            test_bfloat16(torch.nn.SiLU(), device, shape, prec=1e-2)
+            test_bfloat16(torch.nn.Hardtanh(), device, shape, prec=1e-2)
+            test_bfloat16(torch.nn.Mish(), device, shape, prec=1e-2)
 
     @onlyCUDA
     def test_activations_bfloat16(self, device):
-        self._test_bfloat16_ops(torch.nn.ReLU(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.Threshold(0.1, 20), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.ELU(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.Softplus(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.Hardshrink(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.Softshrink(), device, inp_dims=(5), prec=1e-2)
-        self._test_bfloat16_ops(torch.nn.LeakyReLU(), device, inp_dims=(5), prec=1e-2)
-
-    @onlyCUDA
-    def test_pooling_bfloat16(self, device):
-        self._test_bfloat16_ops(torch.nn.AvgPool1d(3, stride=2), device, inp_dims=(8, 4, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AvgPool2d(3, stride=2), device, inp_dims=(8, 4, 16, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AvgPool3d(3, stride=2), device, inp_dims=(8, 4, 16, 16, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AdaptiveAvgPool1d(3), device, inp_dims=(8, 4, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AdaptiveAvgPool2d((3, 5)), device, inp_dims=(8, 4, 16, 16), prec=0.05)
-        self._test_bfloat16_ops(torch.nn.AdaptiveAvgPool3d((3, 5, 7)), device, inp_dims=(8, 4, 16, 16, 16), prec=0.05)
+        _test_bfloat16_ops(self, torch.nn.ReLU(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.Threshold(0.1, 20), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.ELU(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.Softplus(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.Hardshrink(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.Softshrink(), device, inp_dims=(5), prec=1e-2)
+        _test_bfloat16_ops(self, torch.nn.LeakyReLU(), device, inp_dims=(5), prec=1e-2)
 
     @onlyNativeDeviceTypes
     def test_softmax_bfloat16(self, device):
         for dim in [0, 1, 2, 3]:
-            self._test_bfloat16_ops(torch.nn.Softmax(dim=dim), device, inp_dims=(16, 33, 15, 16), prec=1e-2)
+            _test_bfloat16_ops(self, torch.nn.Softmax(dim=dim), device, inp_dims=(16, 33, 15, 16), prec=1e-2)
             # test softmax with large input value which casues exp() to overflow
-            self._test_bfloat16_ops(torch.nn.Softmax(dim=dim), device, inp_dims=(16, 33, 15, 16), prec=0.05, scale_factor=1000.0)
-
-    @onlyCPU
-    @dtypes(torch.float, torch.double)
-    def test_conv_thnn_nhwc(self, device, dtype):
-        def helper(n, c, h, w, out_channels, kernel_size, dilation, groups, input_format, weight_format):
-            input = torch.randint(-3, 3, (n, c, h, w), dtype=dtype, device=device)\
-                .to(memory_format=input_format)
-            input.requires_grad_()
-            conv = nn.Conv2d(c, out_channels, kernel_size, dilation=dilation, groups=groups)\
-                .to(device='cpu', dtype=dtype, memory_format=weight_format)
-            for p in conv.parameters():
-                p.data = torch.randint_like(p, -3, 3)
-
-            ref_input = input.detach().clone().contiguous().requires_grad_()
-            ref_conv = nn.Conv2d(c, out_channels, kernel_size, dilation=dilation, groups=groups)
-            # load_state_dict will restore the stride & memory_layout on ref_conv.weight.
-            ref_conv.load_state_dict(conv.state_dict())
-            ref_conv = ref_conv.to(device='cpu', dtype=dtype, memory_format=torch.contiguous_format)
-
-            out = conv(input)
-            ref_out = ref_conv(ref_input)
-
-            grad = torch.randint_like(out, -3, 3)
-            ref_grad = grad.detach().clone().contiguous()
-
-            out.backward(grad)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertEqual(out, ref_out, exact_dtype=False)
-            self.assertEqual(conv.weight.grad, ref_conv.weight.grad, exact_dtype=False)
-            self.assertEqual(conv.bias.grad, ref_conv.bias.grad, exact_dtype=False)
-            self.assertEqual(input.grad, ref_input.grad, exact_dtype=False)
-
-        with torch.backends.mkldnn.flags(enabled=False):
-            formats = [[torch.channels_last, torch.channels_last],
-                       [torch.channels_last, torch.contiguous_format],
-                       [torch.contiguous_format, torch.channels_last]]
-            for input_format, weight_format in formats:
-                # non-dilated conv: thnn_conv2d normal path (with im2col)
-                helper(2, 8, 4, 4, out_channels=4, kernel_size=3, dilation=1, groups=1,
-                       input_format=input_format, weight_format=weight_format)
-                helper(2, 8, 4, 4, out_channels=8, kernel_size=3, dilation=1, groups=8,
-                       input_format=input_format, weight_format=weight_format)
-                # test when input chanels is 1 and not converted to channels last
-                helper(2, 1, 10, 10, out_channels=8, kernel_size=3, dilation=1, groups=1,
-                       input_format=torch.contiguous_format, weight_format=torch.channels_last)
-                # non-dilated conv: thnn_conv2d fast path (skip im2col)
-                helper(1, 16, 56, 56, out_channels=16, kernel_size=1, dilation=1, groups=1,
-                       input_format=input_format, weight_format=weight_format)
-                # ic == oc == 1 here, so need to stick input to CL to activate channels last
-                helper(1, 16, 56, 56, out_channels=16, kernel_size=1, dilation=1, groups=16,
-                       input_format=torch.channels_last, weight_format=weight_format)
-                # dilated conv: slow_conv_dilated2d
-                helper(2, 8, 11, 13, out_channels=16, kernel_size=3, dilation=2, groups=1,
-                       input_format=input_format, weight_format=weight_format)
-                helper(2, 16, 11, 13, out_channels=32, kernel_size=3, dilation=2, groups=16,
-                       input_format=input_format, weight_format=weight_format)
-
-    @onlyCUDA
-    @skipCUDAIfRocmVersionLessThan((4, 3))
-    @skipCUDAIfNotMiopenSuggestNHWC
-    @skipCUDAIfCudnnVersionLessThan(7603)
-    @dtypes(torch.half, torch.float, torch.cfloat)
-    def test_conv_cudnn_nhwc(self, device, dtype):
-        def helper(n, c, h, w, out_channels, kernel_size, groups):
-            input = torch.randint(-3, 3, (n, c, h, w), dtype=dtype, device=device)\
-                .to(memory_format=torch.channels_last)
-            input.requires_grad_()
-            conv = nn.Conv2d(c, out_channels, kernel_size, groups=groups)\
-                .to(device='cuda', dtype=dtype, memory_format=torch.channels_last)
-            for p in conv.parameters():
-                p.data = torch.randint_like(p, -3, 3)
-
-            # use FP64 channels-first conv as reference
-            ref_input = input.detach().clone().contiguous().double().requires_grad_()
-            ref_conv = nn.Conv2d(c, out_channels, kernel_size, groups=groups)
-            # load_state_dict will restore the stride & memory_layout on ref_conv.weight.
-            ref_conv.load_state_dict(conv.state_dict())
-            ref_conv = ref_conv.to(device='cuda', dtype=torch.double, memory_format=torch.contiguous_format)
-
-            out = conv(input)
-            ref_out = ref_conv(ref_input)
-
-            grad = torch.randint_like(out, -3, 3)
-            ref_grad = grad.detach().clone().double().contiguous()
-
-            out.backward(grad)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(input.grad.is_contiguous(memory_format=torch.channels_last))
-            self.assertTrue(conv.weight.grad.is_contiguous(memory_format=torch.channels_last))
-
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertTrue(ref_input.grad.is_contiguous())
-            self.assertTrue(ref_conv.weight.grad.is_contiguous())
-
-            self.assertEqual(out, ref_out, exact_dtype=False)
-            self.assertEqual(conv.weight.grad, ref_conv.weight.grad, exact_dtype=False)
-            self.assertEqual(conv.bias.grad, ref_conv.bias.grad, exact_dtype=False)
-            self.assertEqual(input.grad, ref_input.grad, exact_dtype=False)
-
-        helper(2, 8, 4, 4, out_channels=4, kernel_size=3, groups=1)
-        helper(2, 8, 4, 4, out_channels=8, kernel_size=3, groups=8)
-        helper(1, 16, 56, 56, out_channels=16, kernel_size=3, groups=1)
-        helper(1, 16, 56, 56, out_channels=16, kernel_size=3, groups=16)
-
-    @onlyCUDA
-    @skipCUDAIfRocm
-    @skipCUDAIfCudnnVersionLessThan(8005)
-    @dtypes(torch.half, torch.float)
-    def test_conv_cudnn_ndhwc(self, device, dtype):
-        def helper(n, c, d, h, w, out_channels, kernel_size, groups):
-            input = torch.randint(-2, 2, (n, c, d, h, w), dtype=dtype, device=device)\
-                .to(memory_format=torch.channels_last_3d)
-            input.requires_grad_()
-            conv = nn.Conv3d(c, out_channels, kernel_size, groups=groups)\
-                .to(device='cuda', dtype=dtype, memory_format=torch.channels_last_3d)
-            for p in conv.parameters():
-                p.data = torch.randint_like(p, -2, 2)
-
-            # use FP64 channels-first conv as reference
-            ref_input = input.detach().clone().contiguous().double().requires_grad_()
-            ref_conv = nn.Conv3d(c, out_channels, kernel_size, groups=groups)
-            # load_state_dict will restore the stride & memory_layout on ref_conv.weight.
-            ref_conv.load_state_dict(conv.state_dict())
-            ref_conv = ref_conv.to(device='cuda', dtype=torch.double, memory_format=torch.contiguous_format)
-
-            out = conv(input)
-            ref_out = ref_conv(ref_input)
-
-            grad = torch.randint_like(out, -2, 2)
-            ref_grad = grad.detach().clone().double().contiguous()
-
-            out.backward(grad)
-            ref_out.backward(ref_grad)
-
-            self.assertTrue(out.is_contiguous(memory_format=torch.channels_last_3d))
-            self.assertTrue(input.grad.is_contiguous(memory_format=torch.channels_last_3d))
-            self.assertTrue(conv.weight.grad.is_contiguous(memory_format=torch.channels_last_3d))
-
-            self.assertTrue(ref_out.is_contiguous())
-            self.assertTrue(ref_input.grad.is_contiguous())
-            self.assertTrue(ref_conv.weight.grad.is_contiguous())
-
-            self.assertEqual(out, ref_out, exact_dtype=False)
-            self.assertEqual(conv.weight.grad, ref_conv.weight.grad, exact_dtype=False)
-            self.assertEqual(conv.bias.grad, ref_conv.bias.grad, exact_dtype=False)
-            self.assertEqual(input.grad, ref_input.grad, exact_dtype=False)
-
-        helper(2, 8, 4, 4, 4, out_channels=4, kernel_size=3, groups=1)
-        helper(2, 8, 4, 4, 4, out_channels=8, kernel_size=3, groups=8)
-        helper(1, 16, 18, 18, 18, out_channels=16, kernel_size=3, groups=1)
-        helper(1, 16, 18, 18, 18, out_channels=16, kernel_size=3, groups=16)
-
-    def _run_conv(self, layer, device, inp, grad, ref_conv, ref_input, ref_out,
-                  input_format, weight_format, grad_format, output_format):
-        conv = layer(inp.size(1), grad.size(1),
-                     ref_conv.weight.size(2)).float().to(device)
-        # load_state_dict will restore the stride & memory_layout on ref_conv.weight.
-        conv.load_state_dict(ref_conv.state_dict())
-        weight_data = conv.weight.detach().clone().contiguous(memory_format=weight_format)
-        conv.weight.data = weight_data.resize_(weight_data.size(), memory_format=weight_format)
-        input = inp.clone().contiguous(memory_format=input_format)
-        input.resize_(input.size(), memory_format=input_format)
-        input = input.requires_grad_()
-        grad = grad.contiguous(memory_format=grad_format)
-        grad.resize_(grad.size(), memory_format=grad_format)
-        out = conv(input)
-        out.backward(grad)
-        self.assertTrue(out.is_contiguous(memory_format=output_format))
-        self.assertEqual(out, ref_out)
-        self.assertEqual(conv.weight.grad, ref_conv.weight.grad)
-        self.assertEqual(conv.bias.grad, ref_conv.bias.grad)
-        self.assertEqual(input.grad, ref_input.grad)
-
-    def _test_conv_cudnn_nhwc_nchw(self, layer, n, c, h, w, k, filter_size, device):
-        data = torch.randint(1, 10, (n, c, h, w), dtype=torch.float32, device=device)
-        ref_input = data.clone().contiguous().requires_grad_(True)
-        ref_conv = layer(c, k, filter_size).float().to(device)
-        ref_out = ref_conv(ref_input)
-        grad = torch.randint(1, 10, ref_out.size(), dtype=torch.float32, device="cuda")
-        ref_out.backward(grad)
-
-        for w_f in [torch.contiguous_format, torch.channels_last]:
-            for g_f in [torch.contiguous_format, torch.channels_last]:
-                for input_format in [torch.contiguous_format, torch.channels_last]:
-                    output_format = torch.contiguous_format
-                    # Older versions of CudNN have Channels Last support disabled
-                    if torch.backends.cudnn.version() >= 7603:
-                        if input_format == torch.channels_last:
-                            output_format = torch.channels_last
-                        # This is because we have N111 weight that cannot handle
-                        # the ambiguous memory_format
-                        if w_f == torch.channels_last:
-                            if layer == nn.Conv2d and filter_size * c != 1:
-                                output_format = torch.channels_last
-                            if layer == nn.ConvTranspose2d and filter_size * k != 1:
-                                output_format = torch.channels_last
-                    self._run_conv(layer, device, data, grad, ref_conv, ref_input,
-                                   ref_out, input_format, w_f, g_f, output_format)
-
-    @onlyCUDA
-    @skipCUDAIfRocmVersionLessThan((4, 3))
-    @skipCUDAIfNotMiopenSuggestNHWC
-    @skipCUDAIfCudnnVersionLessThan(7603)
-    @tf32_on_and_off(0.05)
-    def test_conv_cudnn_mismatch_memory_format(self, device):
-        configs = [
-            [4, 2, 8, 8, 4, 2],
-            [4, 1, 8, 8, 4, 2],
-            [1, 1, 8, 8, 4, 2],
-            [4, 2, 2, 8, 4, 1],
-            [4, 2, 1, 8, 4, 1],
-            [4, 2, 8, 8, 4, 1],
-            [4, 1, 8, 8, 4, 1],
-        ]
-        for n, c, h, w, k, filter_size in configs:
-            self._test_conv_cudnn_nhwc_nchw(nn.Conv2d, n, c, h, w, k, filter_size, device)
-            self._test_conv_cudnn_nhwc_nchw(nn.ConvTranspose2d, n, c, h, w, k, filter_size, device)
-
-    # torch.half is erroring out on Windows with CUDA 10.1 + cuDNN 7.6.4
-    # returning CUDNN_STATUS_BAD_PARAM
-    # Disabling that specific test for now [see issue # 33918]
-    @onlyCUDA
-    @skipCUDAIfNoCudnn
-    @dtypes(torch.float, torch.double)
-    def test_conv_cudnn_nhwc_support(self, device, dtype):
-        input = torch.randn((1, 16, 1, 1), dtype=dtype, device="cuda", requires_grad=True)
-        weight = torch.randn((8, 16, 3, 3), dtype=dtype, device="cuda", requires_grad=True)
-        weight = weight.to(memory_format=torch.channels_last)
-        o = torch.conv2d(input, weight, None, (2, 1), (1, 1), (1, 1), 1)
-        self.assertTrue(o.is_contiguous(memory_format=torch.channels_last))
-        o.sum().backward()
-
-    # Test that faster algorithms used for inference produce the same results
-    # Validates depthwise3x3 bug reported in https://github.com/pytorch/pytorch/issues/60176
-    @onlyCPU
-    @dtypes(torch.float)
-    def test_conv2d_no_grad(self, device, dtype):
-        for batch in [1, 2, 3]:
-            for groups in [1, 2, 4]:
-                input = torch.rand(batch, groups, 8, 8, dtype=dtype, device=device)
-                m = nn.Conv2d(groups, 8, kernel_size=(3, 3), groups=groups, dtype=dtype, device=device)
-                with torch.no_grad():
-                    output_ng = m(input)
-                output = m(input)
-                self.assertEqual(output, output_ng, rtol=1e-2, atol=1e-5)
-
-    @onlyCUDA
-    @skipCUDAIfRocm
-    @skipCUDAIfNoCudnn
-    @dtypes(torch.float, torch.float16)
-    @precisionOverride({torch.half: 0.002, torch.float: 1e-4})
-    def test_cudnn_convolution_relu(self, device, dtype):
-        for batch, groups, image_size, kernel_size, memory_format in \
-                product((1, 2, 3),
-                        (1, 2, 4),
-                        ((1, 1), (8, 8)),
-                        ((1, 1), (3, 3)),
-                        (torch.channels_last, torch.contiguous_format)):
-            if image_size[0] < kernel_size[0]:
-                continue
-            inp = torch.rand(batch, groups, *image_size, dtype=dtype, device=device)
-            w = torch.randn(8, groups, *kernel_size, dtype=dtype, device=device)
-            conv2d_out = torch.conv2d(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
-            inp = inp.to(memory_format=memory_format)
-            w = w.to(memory_format=memory_format)
-            cudnn_out = torch.cudnn_convolution_relu(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
-            self.assertTrue(cudnn_out.is_contiguous(memory_format=memory_format))
-            if tf32_is_not_fp32() and dtype == torch.float:
-                self.assertEqual(conv2d_out.relu(), cudnn_out, atol=2e-4, rtol=0.006)
-            else:
-                self.assertEqual(conv2d_out.relu(), cudnn_out)
-
-    @onlyCUDA
-    @skipCUDAIfRocm
-    @skipCUDAIfNoCudnn
-    @dtypes(torch.float, torch.float16)
-    @precisionOverride({torch.half: 0.002, torch.float: 1e-4})
-    def test_cudnn_convolution_add_relu(self, device, dtype):
-        for batch, groups, image_size, kernel_size, memory_format in \
-            product((1, 2, 3),
-                    (1, 2, 4),
-                    ((1, 1), (8, 8)),
-                    ((1, 1), (3, 3)),
-                    (torch.channels_last, torch.contiguous_format)):
-            if image_size[0] < kernel_size[0]:
-                continue
-            inp = torch.rand(batch, groups, *image_size, dtype=dtype, device=device)
-            w = torch.randn(8, groups, *kernel_size, dtype=dtype, device=device)
-            conv2d_out = torch.conv2d(inp, w, None, (1, 1), (0, 0), (1, 1), 1)
-            alpha = 2.0
-            z = torch.randn_like(conv2d_out)
-
-            inp = inp.to(memory_format=memory_format)
-            w = w.to(memory_format=memory_format)
-            z = z.to(memory_format=memory_format)
-            cudnn_out = torch.cudnn_convolution_add_relu(inp, w, z, alpha, None, (1, 1), (0, 0), (1, 1), 1)
-
-            self.assertTrue(cudnn_out.is_contiguous(memory_format=memory_format))
-            if tf32_is_not_fp32() and dtype == torch.float:
-                self.assertEqual(F.relu(conv2d_out + alpha * z), cudnn_out, atol=3e-4, rtol=0.006)
-            else:
-                self.assertEqual(F.relu(conv2d_out + alpha * z), cudnn_out)
-
-    @onlyCUDA
-    @skipCUDAIfRocm
-    @skipCUDAIfCudnnVersionLessThan(7603)
-    def test_convert_conv2d_weight_memory_format(self, device):
-        input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, device=device)
-        model = nn.Sequential(
-            nn.Conv2d(8, 4, 3),
-            nn.BatchNorm2d(4)).to(device).float()
-        for memory_format in [torch.channels_last, torch.contiguous_format]:
-            model = nn.utils.convert_conv2d_weight_memory_format(model, memory_format)
-            out = model(input)
-            self.assertTrue(out.is_contiguous(memory_format=memory_format))
-
-        model = nn.Sequential(
-            nn.ConvTranspose2d(8, 4, 3),
-            nn.BatchNorm2d(4)).to(device).float()
-        for memory_format in [torch.channels_last, torch.contiguous_format]:
-            model = nn.utils.convert_conv2d_weight_memory_format(model, memory_format)
-            out = model(input)
-            self.assertTrue(out.is_contiguous(memory_format=memory_format))
-
-    def test_conv_double_backward_strided_with_3D_input_and_weight(self, device):
-        # Test that _convolution_double_backward() outputs the correct grad shapes
-        # for 3D input / weight when stride > 1. This is an ad-hoc regression test for a
-        # specific case that was uncovered during the convolution consolidation effort.
-        # The test can be safely deleted if _convolution_double_backward() is removed.
-
-        input = torch.randn(2, 3, 6, device=device)
-        weight = torch.randn(3, 3, 3, device=device)
-        bias = torch.randn(3, device=device)
-        stride = (2,)
-        padding = (1,)
-        dilation = (1,)
-        transposed = False
-        output_padding = (0,)
-        groups = 1
-        output = torch.ops.aten.convolution(input, weight, bias, stride, padding, dilation, transposed,
-                                            output_padding, groups)
-
-        ggI = torch.randn(input.shape, device=device)
-        ggW = torch.randn(weight.shape, device=device)
-        ggB = torch.randn(bias.shape, device=device)
-        gO = torch.randn(output.shape, device=device)
-        output_mask = [True, True, True]
-        grad_grad_output, grad_input, grad_weight = torch.ops.aten._convolution_double_backward(
-            ggI, ggW, ggB, gO, weight, input, stride, padding, dilation, transposed,
-            output_padding, groups, output_mask)
-
-        # Make sure the correct shapes are computed.
-        self.assertEqual(grad_grad_output.shape, gO.shape)
-        self.assertEqual(grad_input.shape, input.shape)
-        self.assertEqual(grad_weight.shape, weight.shape)
+            _test_bfloat16_ops(self, torch.nn.Softmax(dim=dim), device, inp_dims=(16, 33, 15, 16), prec=0.05, scale_factor=1000.0)
 
     def test_nll_loss_mismatched_batch(self, device):
         x = torch.randn((10, 3), requires_grad=True, device=device)
@@ -19898,6 +11370,39 @@ def test_nll_loss_invalid_weights(self, device):
             with self.assertRaisesRegex(RuntimeError, msg):
                 F.nll_loss(x, t, weight=weight)
 
+    # Ref: https://github.com/pytorch/pytorch/issue/85005
+    @onlyCUDA
+    @largeTensorTest("45GB", "cpu")
+    @largeTensorTest("45GB", "cuda")
+    @parametrize_test("reduction", ("none", "mean", "sum"))
+    def test_nll_loss_large_tensor(self, device, reduction):
+        shape = [int(2 ** 16), int(2 ** 16) + 1]
+
+        input = torch.randn(shape, device=device, dtype=torch.float32, requires_grad=True)
+        labels = torch.randint(shape[0], (shape[0],), dtype=torch.long, device=device)
+
+        out = F.nll_loss(input, labels, reduction=reduction)
+
+        with torch.no_grad():
+            input_cpu = input.cpu().float().requires_grad_()
+            labels_cpu = labels.cpu()
+        out_cpu = F.nll_loss(input_cpu, labels_cpu, reduction=reduction)
+        # workaround to reduce memory usage vs. self.assertEqual, see #84944
+        rtol, atol = torch.testing._comparison.get_tolerances(torch.float32, rtol=None, atol=None)
+        if reduction == "sum":
+            orig_rtol, orig_atol = rtol, atol
+            rtol, atol = 7 * rtol, 3 * atol
+        with torch.no_grad():
+            self.assertTrue(torch.allclose(out.cpu(), out_cpu, rtol=rtol, atol=atol))
+        if reduction == "sum":
+            rtol, atol = orig_rtol, orig_atol
+
+        if reduction != "none":
+            out.backward()
+            out_cpu.backward()
+            with torch.no_grad():
+                self.assertTrue(torch.allclose(input.grad.cpu(), input_cpu.grad, rtol=rtol, atol=atol))
+
     def _nll_loss_helper(self, input_size, reduction, expected, device):
         input = torch.rand(input_size, requires_grad=True, device=device)
         num_channels = input_size[1]
@@ -19905,8 +11410,7 @@ def _nll_loss_helper(self, input_size, reduction, expected, device):
         target = torch.randint(num_channels, target_size, device=device)
 
         output = F.nll_loss(input, target, reduction=reduction)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(output, expected)
+        self.assertEqual(output, expected, exact_dtype=False)
 
         output.sum().backward()
         self.assertEqual(input.grad.size(), input.size())
@@ -20205,6 +11709,30 @@ def check_equal(loss, inp_targ_1, inp_targ_2):
                 # i.e. we don't count the ignored_idx at all.
                 check_equal(loss, (inp1, targ_positive_ignore_index), (inp2[1:], targ_positive_ignore_index[1:]))
 
+    # Ref: https://github.com/pytorch/pytorch/issue/85005
+    @onlyCUDA
+    @largeTensorTest("45GB", "cpu")
+    @largeTensorTest("45GB", "cuda")
+    @parametrize_test("reduction", ("none", "mean", "sum"))
+    def test_cross_entropy_large_tensor(self, device, reduction):
+        logits = torch.randn(int(2 ** 16), int(2 ** 16) + 1, dtype=torch.float32, device='cuda', requires_grad=True)
+        labels = torch.zeros(logits.size(0), dtype=torch.long, device='cuda')
+        loss = F.cross_entropy(logits, labels, reduction=reduction)
+        if reduction != "none":
+            loss.backward()
+
+        with torch.no_grad():
+            logits_cpu = logits.cpu().detach().requires_grad_()
+            labels_cpu = labels.cpu().detach()
+        loss_cpu = F.cross_entropy(logits_cpu, labels_cpu, reduction=reduction)
+        if reduction != "none":
+            loss_cpu.backward()
+
+        # workaround to reduce memory usage vs. self.assertEqual, see #84944
+        rtol, atol = torch.testing._comparison.get_tolerances(torch.float32, rtol=None, atol=None)
+        self.assertTrue(torch.allclose(loss.cpu(), loss_cpu, rtol=rtol, atol=atol))
+        if reduction != "none":
+            self.assertTrue(torch.allclose(logits.grad.cpu(), logits_cpu.grad, rtol=rtol, atol=atol))
 
     def test_softshrink_negative(self, device):
         input = torch.randn(5, device=device, requires_grad=True)
@@ -20249,19 +11777,8 @@ def test_logsigmoid_out(self, device):
         noncontig_out = torch.randn(2, 3, device=device).t()
         self.assertEqual(F.logsigmoid(x), F.logsigmoid(x, out=noncontig_out))
 
-    def test_maxpool3d_non_square_backward(self, device):
-        # previous CUDA routine of this backward calculates kernel launch grid size
-        # with last two dimensions interchanged, so the tailing along the longer dim
-        # get ignored. Here we test whether every position gets gradient.
-        for dim in (2, 3, 4):
-            shape = tuple(32 if i != dim else 256 for i in range(4))
-            x = torch.randn(shape, device=device, requires_grad=True)
-            F.max_pool3d(x, kernel_size=(1, 1, 1)).sum().backward()
-            self.assertEqual(x.grad, torch.ones_like(x.grad))
-
     # Check that clip_grad_norm_ raises an error if the total norm of the
     # parameters' gradients is non-finite
-    @skipIfTorchDynamo("TorchDynamo fails here for unknown reasons")
     def test_clip_grad_norm_error_if_nonfinite(self, device):
         norms_pos = [0.1, 1, 2, 3.5, inf]
         norms_neg = [-0.1, -1, -2, -3.5]
@@ -20640,20 +12157,6 @@ def test_skip_init(self, device):
         self.assertEqual(m_initialized.weight.device, m_uninitialized.weight.device)
         self.assertFalse(torch.allclose(m_initialized.weight, m_uninitialized.weight))
 
-    def test_adaptive_pool_invalid(self, device):
-        inp_1d = (torch.randn(1, 1, 1, device=device), (-1,))
-        inp_2d = (torch.randn(1, 1, 1, 1, device=device), (-1, 0))
-        inp_3d = (torch.randn(1, 1, 1, 1, 1, device=device), (-1, 0, 2))
-        module_input_dict = {torch.nn.AdaptiveAvgPool1d : inp_1d,
-                             torch.nn.AdaptiveAvgPool2d : inp_2d,
-                             torch.nn.AdaptiveAvgPool3d : inp_3d}
-
-        for m, inp in module_input_dict.items():
-            with self.assertRaisesRegex(RuntimeError,
-                                        r"elements of output_size must be greater than or equal to 0"):
-                t, output_size = inp
-                m(output_size)(t)
-
     @dtypes(torch.float)
     @dtypesIfCUDA(torch.double, torch.float, torch.half)
     def test_transformerencoderlayer(self, device, dtype):
@@ -20783,7 +12286,7 @@ def perm_fn(x):
                 mask = torch.zeros(encoder_input.shape[:-1], device=device, dtype=torch.bool)
                 mask[0][-1] = True
 
-                nt = torch.nested_tensor([encoder_input[0][:-1], encoder_input[1]], device=device)
+                nt = torch.nested.nested_tensor([encoder_input[0][:-1], encoder_input[1]], device=device)
                 result = model(nt)
                 ref_output = torch.tensor(
                     [
@@ -20837,23 +12340,23 @@ def perm_fn(x):
     @dtypes(torch.double)
     @torch.no_grad()
     def test_multihead_attn_fast_path_query_and_bias_have_different_dtypes(self, device, dtype):
-        mha = torch.nn.MultiheadAttention(3, 3, batch_first=True, dtype=dtype, device=device).eval()
+        mha = torch.nn.MultiheadAttention(4, 4, batch_first=True, dtype=dtype, device=device).eval()
         mha.in_proj_bias = torch.nn.Parameter(mha.in_proj_bias.to(torch.half).to(device))
-        query = torch.randn(3, 3, 3, dtype=dtype, device=device)
+        query = torch.randn(4, 4, 4, dtype=dtype, device=device)
         mha(query, query, query)
 
     @dtypes(torch.double)
     @torch.no_grad()
     def test_multihead_attn_fast_path_small_test(self, device, dtype):
-        mha = torch.nn.MultiheadAttention(3, 3, batch_first=True, dtype=dtype, device=device).eval()
-        query = torch.randn(3, 3, 3, dtype=dtype, device=device)
+        mha = torch.nn.MultiheadAttention(4, 4, batch_first=True, dtype=dtype, device=device).eval()
+        query = torch.randn(4, 4, 4, dtype=dtype, device=device)
         mha(query, query, query)
 
     @dtypes(torch.double)
     @torch.no_grad()
     def test_multihead_attn_in_proj_bias_none(self, device, dtype):
-        mha = torch.nn.MultiheadAttention(1, 1, bias=False, dtype=dtype, device=device)
-        query = torch.rand(3, 2, 1, dtype=dtype, device=device)
+        mha = torch.nn.MultiheadAttention(2, 2, bias=False, dtype=dtype, device=device)
+        query = torch.rand(2, 2, 2, dtype=dtype, device=device)
         mha(query, query, query)
 
     @dtypes(torch.double)
@@ -20870,13 +12373,35 @@ def test_multihead_attn_in_proj_weight_none(self, device, dtype):
     @onlyCPU
     @dtypes(torch.double)
     def test_transformerencoderlayer_fast_path(self, device, dtype):
-        model = torch.nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True, device=device, dtype=dtype)
-        src = torch.rand(32, 10, 512)
-        src_mask = torch.zeros(10, 10).to(torch.bool)
+        """
+        Test transformer fast path on CPU with different valid mask types and shapes
+        """
+        d_model = 512
+        nhead = 8
+        batch_size = 32
+        src_len = 10
 
+        model = torch.nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead, batch_first=True,
+                                                 device=device, dtype=dtype, dropout=0)
         model.eval()
+
+        # Batched inputs
+        src = torch.rand(batch_size, src_len, 512)
+
+        # Attention mask of shape (src_len, src_len)
+        src_mask = torch.zeros(src_len, src_len).to(torch.bool)
+        with torch.no_grad():
+            model(src, src_mask=src_mask)
+
+        # Padding mask of shape (batch_size, src_len)
+        src_key_padding_mask = torch.zeros(batch_size, src_len).to(torch.bool)
+        with torch.no_grad():
+            model(src, src_key_padding_mask=src_key_padding_mask)
+
+        # Provide both masks
         with torch.no_grad():
-            model(src, src_mask)
+            model(src, src_mask=src_mask, src_key_padding_mask=src_key_padding_mask)
+
 
     @dtypes(torch.float)
     @dtypesIfCUDA(torch.half, torch.float)
@@ -20959,903 +12484,6 @@ def perm_fn(x):
                 _test(activation=activation, batch_first=batch_first, training=training)
 
 
-class TestModuleGlobalHooks(TestCase):
-
-    def tearDown(self):
-        nn.modules.module._global_backward_hooks = OrderedDict()
-        nn.modules.module._global_forward_hooks = OrderedDict()
-        nn.modules.module._global_forward_pre_hooks = OrderedDict()
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_module_global_hooks(self):
-        module = nn.Sigmoid
-
-        module_1 = module()
-        module_2 = module()
-        module_3 = module()
-
-        input = torch.ones(5, 5, requires_grad=True)
-
-        counter = {
-            'forwards': 0,
-            'backwards': 0
-        }
-
-        def fw_hook(inc, h_module, input, output):
-            self.assertIsInstance(input, tuple)
-            self.assertTrue(isinstance(output, torch.Tensor))
-            self.assertTrue(isinstance(h_module, module))
-            self.assertEqual(input[0], torch.ones(5, 5))
-            self.assertEqual(output, torch.empty(5, 5).fill_(1 / (1 + 1 / math.e)))
-            counter['forwards'] += inc
-
-        def bw_hook(inc, h_module, grad_input, grad_output):
-            self.assertIsInstance(grad_input, tuple)
-            self.assertIsInstance(grad_output, tuple)
-            self.assertTrue(isinstance(h_module, module))
-            self.assertEqual(grad_output[0], torch.ones(5, 5) * 2)
-            counter['backwards'] += inc
-
-        test_fwd = nn.modules.module.register_module_forward_hook(lambda *args: fw_hook(1, *args))
-
-        module_1(input)
-        module_2(input)
-        module_3(input)
-        self.assertEqual(counter['forwards'], 3)
-        self.assertEqual(counter['backwards'], 0)
-
-        test_bwd = nn.modules.module.register_module_backward_hook(
-            lambda *args: bw_hook(1, *args))
-
-        output_1 = module_1(input)
-        output_2 = module_2(input)
-        output_3 = module_3(input)
-        self.assertEqual(counter['forwards'], 6)
-        self.assertEqual(counter['backwards'], 0)
-
-        output_1.backward(torch.ones(5, 5) * 2, retain_graph=True)
-        output_2.backward(torch.ones(5, 5) * 2, retain_graph=False)
-        output_3.backward(torch.ones(5, 5) * 2, retain_graph=False)
-        self.assertEqual(counter['forwards'], 6)
-        self.assertEqual(counter['backwards'], 3)
-
-        output_1.backward(torch.ones(5, 5) * 2, retain_graph=True)
-        self.assertEqual(counter['forwards'], 6)
-        self.assertEqual(counter['backwards'], 4)
-
-        test2_fwd = nn.modules.module.register_module_forward_hook(lambda *args: fw_hook(2, *args))
-
-        output = module_1(input)
-        output = module_2(input)
-        output = module_3(input)
-        self.assertEqual(counter['forwards'], 15)
-        self.assertEqual(counter['backwards'], 4)
-
-        test2_bwd = nn.modules.module.register_module_backward_hook(lambda *args: bw_hook(2, *args))
-
-        module_1(input).backward(torch.ones(5, 5) * 2)
-        self.assertEqual(counter['forwards'], 18)
-        self.assertEqual(counter['backwards'], 7)
-
-        test2_bwd.remove()
-
-        module_2(input).backward(torch.ones(5, 5) * 2)
-        self.assertEqual(counter['forwards'], 21)
-        self.assertEqual(counter['backwards'], 8)
-
-        test2_fwd.remove()
-
-        module_3(input).backward(torch.ones(5, 5) * 2)
-        self.assertEqual(counter['forwards'], 22)
-        self.assertEqual(counter['backwards'], 9)
-
-        test_fwd.remove()
-        test_bwd.remove()
-
-    def test_module_global_hook_invalid_outputs(self):
-        module = nn.Sigmoid()
-        input = torch.randn(5, 5, requires_grad=True)
-
-        def bw_fail1(self, grad_input, grad_output):
-            return grad_input[:-1]
-
-        def bw_fail2(self, grad_input, grad_output):
-            return grad_input + (torch.randn(2, 2),)
-
-        with nn.modules.module.register_module_backward_hook(bw_fail1):
-            with self.assertRaisesRegex(RuntimeError, 'got 0, but expected 1'):
-                module(input).sum().backward()
-
-        with nn.modules.module.register_module_backward_hook(bw_fail2):
-            with self.assertRaisesRegex(RuntimeError, 'got 2, but expected 1'):
-                module(input).sum().backward()
-
-    def test_module_backward_global_hook_writeable(self):
-        module = nn.Sigmoid()
-        input = torch.randn(5, 5, requires_grad=True)
-        sig_x = torch.sigmoid(input)
-
-        def bw_hook(module, grad_input, grad_output):
-            for grad in grad_input:
-                self.assertTrue(isinstance(grad, torch.Tensor))
-            for grad in grad_output:
-                self.assertTrue(isinstance(grad, torch.Tensor))
-            return tuple(gi * 2 for gi in grad_input)
-
-        nn.modules.module.register_module_backward_hook(bw_hook)
-        module(input).backward(torch.ones(5, 5))
-        expected_grad = sig_x * (1 - sig_x) * 2
-        self.assertEqual(input.grad, expected_grad)
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_module_global_forward_preforward_hook_writeable(self):
-        module = nn.Sigmoid()
-        input = torch.randn(5, 5, requires_grad=True)
-        sig_x = torch.sigmoid(input)
-
-        def forward_pre_hook(m, input):
-            return torch.nn.functional.relu(input[0])
-
-        def forward_hook(m, input, output):
-            return -output
-
-        nn.modules.module.register_module_forward_pre_hook(forward_pre_hook)
-        nn.modules.module.register_module_forward_hook(forward_hook)
-        output = module(input)
-        expected_res = -torch.sigmoid(torch.nn.functional.relu(input))
-        self.assertEqual(output, expected_res)
-        output.backward(torch.ones(5, 5) * 2, retain_graph=True)
-        mask = (input > 0).double()
-        expected_grad = -sig_x * (1 - sig_x) * 2 * mask
-        self.assertEqual(input.grad, expected_grad)
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_module_forward_preforward_hook_removable(self):
-        """
-        This test is to test when multiple pre-forward hook functions can be
-        registered successfully and used correctly, if the handle can be removable
-        during the pre-forward hook function call.
-        """
-        module = nn.Sigmoid()
-
-        def removable_hook(m, input):
-            nonlocal handle
-            handle.remove()
-            return input
-
-        def removable_hook_2(m, input):
-            nonlocal handle_2
-            handle_2.remove()
-            return input
-
-        handle = module.register_forward_pre_hook(removable_hook)
-        handle_2 = module.register_forward_pre_hook(removable_hook_2)
-
-        # make sure hook register is successful
-        self.assertEqual(len(handle.hooks_dict_ref()), 2)
-        self.assertEqual(len(handle_2.hooks_dict_ref()), 2)
-
-        input = torch.randn(2, 2)
-        output = module(input)
-        self.assertEqual(torch.sigmoid(input), output)
-
-        # make sure hook removal is successful
-        self.assertFalse(handle.id in handle.hooks_dict_ref())
-        self.assertFalse(handle_2.id in handle.hooks_dict_ref())
-        self.assertEqual(len(handle.hooks_dict_ref()), 0)
-        self.assertEqual(len(handle_2.hooks_dict_ref()), 0)
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_module_forward_forward_hook_removable(self):
-        """
-        This test is to test when multiple forward hook functions can be registered
-        successfully and used correctly, if the handle can be removable during the
-        forward hook function call.
-        """
-        module = nn.Sigmoid()
-
-        def removable_hook(m, input, output):
-            nonlocal handle
-            handle.remove()
-            return output
-
-        def removable_hook_2(m, input, output):
-            nonlocal handle_2
-            handle_2.remove()
-            return output
-
-        handle = module.register_forward_hook(removable_hook)
-        handle_2 = module.register_forward_hook(removable_hook_2)
-
-        # make sure hook register is successful
-        self.assertEqual(len(handle.hooks_dict_ref()), 2)
-        self.assertEqual(len(handle_2.hooks_dict_ref()), 2)
-
-        input = torch.randn(2, 2)
-        output = module(input)
-        self.assertEqual(torch.sigmoid(input), output)
-
-        # make sure hook removal is successful
-        self.assertFalse(handle.id in handle.hooks_dict_ref())
-        self.assertFalse(handle_2.id in handle.hooks_dict_ref())
-        self.assertEqual(len(handle.hooks_dict_ref()), 0)
-        self.assertEqual(len(handle_2.hooks_dict_ref()), 0)
-
-    @skipIfTorchDynamo("TorchDynamo does not work well with hooks")
-    def test_global_and_local_hooks_order(self):
-        module = nn.Sigmoid()
-
-        global_forward_pre_called = False
-        local_forward_pre_called = False
-        global_forward_called = False
-        local_forward_called = False
-        global_backward_called = False
-        local_backward_called = False
-
-        def global_forward_pre_hook(m, input):
-            nonlocal global_forward_pre_called
-            self.assertTrue(not local_forward_pre_called)
-            global_forward_pre_called = True
-            return input
-
-        def local_forward_pre_hook(m, input):
-            nonlocal local_forward_pre_called
-            self.assertTrue(global_forward_pre_called)
-            local_forward_pre_called = True
-            return input
-
-        def global_forward_hook(m, input, output):
-            nonlocal global_forward_called
-            self.assertTrue(not local_forward_called)
-            global_forward_called = True
-            return output
-
-        def local_forward_hook(m, input, output):
-            nonlocal local_forward_called
-            self.assertTrue(global_forward_called)
-            local_forward_called = True
-            return output
-
-        def global_backward_hook(m, input, output):
-            nonlocal global_backward_called
-            self.assertTrue(not local_backward_called)
-            global_backward_called = True
-            return input
-
-        def local_backward_hook(m, input, output):
-            nonlocal local_backward_called
-            self.assertTrue(global_backward_called)
-            local_backward_called = True
-            return input
-
-        input = torch.randn(5, 5, requires_grad=True)
-        nn.modules.module.register_module_forward_pre_hook(global_forward_pre_hook)
-        module.register_forward_pre_hook(local_forward_pre_hook)
-        nn.modules.module.register_module_forward_hook(global_forward_hook)
-        module.register_forward_hook(local_forward_hook)
-        nn.modules.module.register_module_backward_hook(global_backward_hook)
-        module.register_backward_hook(local_backward_hook)
-
-        output = module(input)
-        self.assertTrue(local_forward_called and local_forward_pre_called and global_forward_called and global_forward_pre_called)
-
-        output.backward(torch.ones(5, 5), retain_graph=True)
-        self.assertTrue(local_backward_called and global_backward_called)
-
-
-class LazyModule(torch.nn.modules.lazy.LazyModuleMixin, torch.nn.Module):
-    pass
-
-
-class TestLazyModules(TestCase):
-
-    @suppress_warnings
-    def test_lazy_module_parameter(self):
-        module = LazyModule()
-        module.register_parameter('test_param', UninitializedParameter())
-        self.assertTrue(module.has_uninitialized_params())
-        state_dict = module.state_dict()
-        self.assertIsInstance(state_dict['test_param'], UninitializedParameter)
-        new_module = LazyModule()
-        # An error is raised when there is an attempt to replace an existing parameter
-        # with an uninitialized one
-        new_module.register_parameter('test_param', nn.Parameter(torch.ones(5, 5)))
-        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
-            new_module.load_state_dict(state_dict)
-        # Uninitialized parameters are overriden when the state dict to be loaded contains a valid one
-        new_module = LazyModule()
-        new_module.register_parameter('test_param', nn.Parameter(torch.ones(5, 5)))
-        module.load_state_dict(new_module.state_dict())
-        self.assertEqual(module.test_param, torch.ones((5, 5)))
-
-        # Uninitialized parameters are left unchanged
-        module = LazyModule()
-        module.register_parameter('test_param', UninitializedParameter())
-        self.assertTrue(module.has_uninitialized_params())
-
-        new_module = LazyModule()
-        new_module.register_parameter('test_param', UninitializedParameter())
-        module.load_state_dict(new_module.state_dict())
-        self.assertTrue(module.has_uninitialized_params())
-
-    @suppress_warnings
-    def test_lazy_module_buffer(self):
-        module = LazyModule()
-        module.register_buffer('test_buffer', UninitializedBuffer())
-        self.assertTrue(module.has_uninitialized_params())
-        state_dict = module.state_dict()
-        self.assertIsInstance(state_dict['test_buffer'], UninitializedBuffer)
-        new_module = LazyModule()
-        # An error is raised when there is an attempt to replace an existing parameter
-        # with an uninitialized one
-        new_module.register_buffer('test_buffer', torch.ones(5, 5))
-        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
-            new_module.load_state_dict(state_dict)
-        # Uninitialized parameters are overriden when the state dict to be loaded contains a valid one
-        new_module = LazyModule()
-        new_module.register_buffer('test_buffer', torch.ones(5, 5))
-        module.load_state_dict(new_module.state_dict())
-        self.assertEqual(module.test_buffer, torch.ones((5, 5)))
-
-        # Uninitialized parameters are left unchanged
-        module = LazyModule()
-        module.register_buffer('test_buffer', UninitializedBuffer())
-        self.assertTrue(module.has_uninitialized_params())
-
-        new_module = LazyModule()
-        new_module.register_buffer('test_buffer', UninitializedBuffer())
-        module.load_state_dict(new_module.state_dict())
-        module.load_state_dict(new_module.state_dict())
-        self.assertTrue(module.has_uninitialized_params())
-
-    @suppress_warnings
-    def test_lazy_module_jit_param(self):
-        module = LazyModule()
-        module.register_parameter('test_param', UninitializedParameter())
-        self.assertTrue(module.has_uninitialized_params())
-        with self.assertRaisesRegex(RuntimeError, 'run a forward pass'):
-            torch.jit.script(module)
-
-    @suppress_warnings
-    def test_lazy_module_jit_buffer(self):
-        module = LazyModule()
-        module.register_buffer('test_buffer', UninitializedBuffer())
-        self.assertTrue(module.has_uninitialized_params())
-        with self.assertRaisesRegex(RuntimeError, 'run a forward pass'):
-            torch.jit.script(module)
-
-    @suppress_warnings
-    def test_lazy_share_memory_param(self):
-        module = LazyModule()
-        module.register_parameter('test_param', UninitializedParameter())
-        self.assertTrue(module.has_uninitialized_params())
-        with self.assertRaisesRegex(RuntimeError, 'share memory on an uninitialized'):
-            module.share_memory()
-
-    @suppress_warnings
-    def test_lazy_share_memory_buffer(self):
-        module = LazyModule()
-        module.register_buffer('test_buffer', UninitializedBuffer())
-        self.assertTrue(module.has_uninitialized_params())
-        with self.assertRaisesRegex(RuntimeError, 'share memory on an uninitialized'):
-            module.share_memory()
-
-    @suppress_warnings
-    def test_linear(self):
-        module = nn.LazyLinear(10)
-        self.assertIsInstance(module.weight, UninitializedParameter)
-        self.assertIsInstance(module.bias, UninitializedParameter)
-        input = torch.ones(5, 5)
-        module(input)
-        self.assertIsInstance(module, nn.Linear)
-        self.assertNotIsInstance(module, nn.LazyLinear)
-        self.assertTrue(module.weight.shape == (10, 5))
-        self.assertTrue(module.bias.shape == (10,))
-        y = module(input)
-        self.assertTrue(torch.equal(torch.nn.functional.linear(input, module.weight, module.bias), y))
-
-    @suppress_warnings
-    def test_lazy_linear_pickle(self):
-        module = nn.LazyLinear(10)
-        self.assertIsInstance(module.weight, UninitializedParameter)
-        self.assertIsInstance(module.bias, UninitializedParameter)
-        module = pickle.loads(pickle.dumps(module))
-        self.assertIsInstance(module, nn.LazyLinear)
-        self.assertIsInstance(module.weight, UninitializedParameter)
-        self.assertIsInstance(module.bias, UninitializedParameter)
-        input = torch.ones(5, 5)
-        module(input)  # fully materialized
-        new_module = pickle.loads(pickle.dumps(module))
-        self.assertIsInstance(new_module, nn.Linear)
-        self.assertNotIsInstance(new_module, nn.LazyLinear)
-        self.assertTrue(new_module.weight.shape == (10, 5))
-        self.assertNotIsInstance(new_module.weight, UninitializedParameter)
-        self.assertTrue(new_module.bias.shape == (10,))
-        self.assertNotIsInstance(new_module.bias, UninitializedParameter)
-
-    @suppress_warnings
-    def test_linear_state(self):
-        module = nn.Linear(5, 10)
-        lazy_module = nn.LazyLinear(10)
-        lazy_module.load_state_dict(module.state_dict())
-        # Parameters have been initialized but the module won't become a full
-        # Linear one until the first iteration. This is due to
-        # limitations on the state_dict loading logic
-        self.assertFalse(lazy_module.has_uninitialized_params())
-        self.assertTrue(lazy_module.weight.shape == (10, 5))
-        self.assertTrue(lazy_module.bias.shape == (10,))
-
-        module = nn.Linear(5, 10)
-        lazy_module = nn.LazyLinear(10)
-        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
-            module.load_state_dict(lazy_module.state_dict())
-
-    def _check_lazy_conv(self, cls, lazy_cls, func, init_args, input_shape,
-                         expected_weight_shape, expected_bias_shape):
-        module = lazy_cls(*init_args)
-        self.assertIsInstance(module.weight, UninitializedParameter)
-        if module.bias is not None:
-            self.assertIsInstance(module.bias, UninitializedParameter)
-        input = torch.ones(*input_shape)
-        module(input)
-        self.assertIsInstance(module, cls)
-        self.assertNotIsInstance(module, lazy_cls)
-        self.assertEqual(module.weight.shape, expected_weight_shape)
-        if module.bias is not None:
-            self.assertEqual(module.bias.shape, expected_bias_shape)
-        y = module(input)
-        self.assertTrue(torch.equal(func(input, module.weight, module.bias), y))
-
-    def _check_lazy_conv_pickle(self, cls, lazy_cls, init_args, input_shape,
-                                expected_weight_shape, expected_bias_shape):
-        module = lazy_cls(*init_args)
-        self.assertIsInstance(module.weight, UninitializedParameter)
-        if module.bias is not None:
-            self.assertIsInstance(module.bias, UninitializedParameter)
-        module = pickle.loads(pickle.dumps(module))
-        self.assertIsInstance(module, lazy_cls)
-        self.assertIsInstance(module.weight, UninitializedParameter)
-        if module.bias is not None:
-            self.assertIsInstance(module.bias, UninitializedParameter)
-        input = torch.ones(*input_shape)
-        module(input)  # fully materialized
-        new_module = pickle.loads(pickle.dumps(module))
-        self.assertIsInstance(new_module, cls)
-        self.assertNotIsInstance(new_module, lazy_cls)
-        self.assertEqual(new_module.weight.shape, expected_weight_shape)
-        self.assertNotIsInstance(new_module.weight, UninitializedParameter)
-        if new_module.bias is not None:
-            self.assertEqual(new_module.bias.shape, expected_bias_shape)
-            self.assertNotIsInstance(new_module.bias, UninitializedParameter)
-
-    def _check_lazy_conv_state(self, gen_module, gen_lazy_module,
-                               expected_weight_shape, expected_bias_shape):
-        module = gen_module()
-        lazy_module = gen_lazy_module()
-        lazy_module.load_state_dict(module.state_dict())
-        # Parameters have been initialized but the module won't become a full
-        # Conv one until the first iteration. This is due to
-        # limitations on the state_dict loading logic
-        self.assertFalse(lazy_module.has_uninitialized_params())
-        self.assertEqual(lazy_module.weight.shape, expected_weight_shape)
-        if lazy_module.bias is not None:
-            self.assertEqual(lazy_module.bias.shape, expected_bias_shape)
-
-        module = gen_module()
-        lazy_module = gen_lazy_module()
-        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
-            module.load_state_dict(lazy_module.state_dict())
-
-
-    def test_lazy_pre_forward_hook(self):
-        """
-        This test is to test whether lazymodule can register other pre-forward hook
-        functions successfully.
-        """
-        class TestModule(torch.nn.modules.lazy.LazyModuleMixin, torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-
-            def initialize_parameters(self, input):
-                return None
-
-            def forward(self, input):
-                return input
-
-        def hook_function(module, input):
-            return input[0] + 1
-
-        module = TestModule()
-        module.register_forward_pre_hook(hook_function)
-        output = module(torch.zeros(2, 2))
-        self.assertEqual(output, torch.ones(2, 2))
-
-    def test_lazy_forward_hook(self):
-        """
-        This test is to test whether lazymodule can register other forward hook
-        functions successfully.
-        """
-        class TestModule(torch.nn.modules.lazy.LazyModuleMixin, torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-
-            def initialize_parameters(self, input):
-                return None
-
-            def forward(self, input):
-                return input
-
-        def hook_function(module, input, output):
-            return input[0] + 1
-
-        module = TestModule()
-        module.register_forward_hook(hook_function)
-        output = module(torch.zeros(2, 2))
-        self.assertEqual(output, torch.ones(2, 2))
-
-    @suppress_warnings
-    def test_lazy_conv1d(self):
-        self._check_lazy_conv(nn.Conv1d, nn.LazyConv1d, torch.nn.functional.conv1d,
-                              (32, 2), (192, 16, 50), (32, 16, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv1d_pickle(self):
-        self._check_lazy_conv_pickle(nn.Conv1d, nn.LazyConv1d, (32, 2), (192, 16, 50),
-                                     (32, 16, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv1d_state(self):
-        self._check_lazy_conv_state(lambda: nn.Conv1d(16, 32, 2),
-                                    lambda: nn.LazyConv1d(32, 2),
-                                    (32, 16, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv2d(self):
-        self._check_lazy_conv(nn.Conv2d, nn.LazyConv2d, torch.nn.functional.conv2d,
-                              (32, 2), (192, 16, 8, 6), (32, 16, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv2d_pickle(self):
-        self._check_lazy_conv_pickle(nn.Conv2d, nn.LazyConv2d, (32, 2), (192, 16, 8, 6),
-                                     (32, 16, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv2d_state(self):
-        self._check_lazy_conv_state(lambda: nn.Conv2d(16, 32, 2),
-                                    lambda: nn.LazyConv2d(32, 2),
-                                    (32, 16, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv3d(self):
-        self._check_lazy_conv(nn.Conv3d, nn.LazyConv3d, torch.nn.functional.conv3d,
-                              (32, 2), (192, 16, 8, 7, 6), (32, 16, 2, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv3d_pickle(self):
-        self._check_lazy_conv_pickle(nn.Conv3d, nn.LazyConv3d, (32, 2), (192, 16, 8, 7, 6),
-                                     (32, 16, 2, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv3d_state(self):
-        self._check_lazy_conv_state(lambda: nn.Conv3d(16, 32, 2),
-                                    lambda: nn.LazyConv3d(32, 2),
-                                    (32, 16, 2, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transposed1d(self):
-        self._check_lazy_conv(nn.ConvTranspose1d, nn.LazyConvTranspose1d, torch.nn.functional.conv_transpose1d,
-                              (32, 2), (192, 16, 50), (16, 32, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transpose1d_pickle(self):
-        self._check_lazy_conv_pickle(nn.ConvTranspose1d, nn.LazyConvTranspose1d, (32, 2),
-                                     (192, 16, 50), (16, 32, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transpose1d_state(self):
-        self._check_lazy_conv_state(lambda: nn.ConvTranspose1d(16, 32, 2),
-                                    lambda: nn.LazyConvTranspose1d(32, 2),
-                                    (16, 32, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transpose2d(self):
-        self._check_lazy_conv(nn.ConvTranspose2d, nn.LazyConvTranspose2d, torch.nn.functional.conv_transpose2d,
-                              (32, 2), (192, 16, 8, 6), (16, 32, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transpose2d_pickle(self):
-        self._check_lazy_conv_pickle(nn.ConvTranspose2d, nn.LazyConvTranspose2d, (32, 2),
-                                     (192, 16, 8, 6), (16, 32, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transpose2d_state(self):
-        self._check_lazy_conv_state(lambda: nn.ConvTranspose2d(16, 32, 2),
-                                    lambda: nn.LazyConvTranspose2d(32, 2),
-                                    (16, 32, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transpose3d(self):
-        self._check_lazy_conv(nn.ConvTranspose3d, nn.LazyConvTranspose3d, torch.nn.functional.conv_transpose3d,
-                              (32, 2), (192, 16, 8, 7, 6), (16, 32, 2, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transpose3d_pickle(self):
-        self._check_lazy_conv_pickle(nn.ConvTranspose3d, nn.LazyConvTranspose3d, (32, 2),
-                                     (192, 16, 8, 7, 6), (16, 32, 2, 2, 2), (32,))
-
-    @suppress_warnings
-    def test_lazy_conv_transpose3d_state(self):
-        self._check_lazy_conv_state(lambda: nn.ConvTranspose3d(16, 32, 2),
-                                    lambda: nn.LazyConvTranspose3d(32, 2),
-                                    (16, 32, 2, 2, 2), (32,))
-
-    def _check_lazy_norm(self, cls, lazy_cls, input_shape):
-        for affine in [False, True]:
-            for track_running_stats in [False, True]:
-                lazy_module = lazy_cls(affine=affine, track_running_stats=track_running_stats)
-
-                if affine:
-                    self.assertIsInstance(lazy_module.weight, UninitializedParameter)
-                    self.assertIsInstance(lazy_module.bias, UninitializedParameter)
-                if track_running_stats:
-                    self.assertIsInstance(lazy_module.running_mean, UninitializedBuffer)
-                    self.assertIsInstance(lazy_module.running_var, UninitializedBuffer)
-
-                input = torch.ones(*input_shape)
-                lazy_output = lazy_module(input)
-                self.assertIsInstance(lazy_module, cls)
-                self.assertNotIsInstance(lazy_module, lazy_cls)
-
-                num_features = input_shape[1]
-                module = cls(num_features, affine=affine, track_running_stats=track_running_stats)
-                expected_output = module(input)
-
-                self.assertEqual(lazy_output, expected_output)
-                if module.weight is not None:
-                    self.assertEqual(lazy_module.weight.shape, module.weight.shape)
-                    self.assertEqual(lazy_module.weight, module.weight)
-                if module.bias is not None:
-                    self.assertEqual(lazy_module.bias.shape, module.bias.shape)
-                    self.assertEqual(lazy_module.bias, module.bias)
-                if module.running_mean is not None:
-                    self.assertEqual(lazy_module.running_mean.shape, module.running_mean.shape)
-                    self.assertEqual(lazy_module.running_mean, module.running_mean)
-                if module.running_var is not None:
-                    self.assertEqual(lazy_module.running_var.shape, module.running_var.shape)
-                    self.assertEqual(lazy_module.running_var, module.running_var)
-                if module.num_batches_tracked is not None:
-                    self.assertEqual(lazy_module.num_batches_tracked.shape, module.num_batches_tracked.shape)
-                    self.assertEqual(lazy_module.num_batches_tracked, module.num_batches_tracked)
-
-    def _check_lazy_norm_pickle(self, cls, lazy_cls, input_shape):
-        for affine in [False, True]:
-            for track_running_stats in [False, True]:
-                module = lazy_cls(affine=affine, track_running_stats=track_running_stats)
-                module = pickle.loads(pickle.dumps(module))
-
-                self.assertIsInstance(module, lazy_cls)
-                if affine:
-                    self.assertIsInstance(module.weight, UninitializedParameter)
-                    self.assertIsInstance(module.bias, UninitializedParameter)
-                if track_running_stats:
-                    self.assertIsInstance(module.running_mean, UninitializedBuffer)
-                    self.assertIsInstance(module.running_var, UninitializedBuffer)
-
-                input = torch.ones(*input_shape)
-                module(input)  # fully materialized
-                module = pickle.loads(pickle.dumps(module))
-
-                self.assertNotIsInstance(module, lazy_cls)
-                self.assertIsInstance(module, cls)
-                if affine:
-                    self.assertNotIsInstance(module.weight, UninitializedParameter)
-                    self.assertNotIsInstance(module.bias, UninitializedParameter)
-                if track_running_stats:
-                    self.assertNotIsInstance(module.running_mean, UninitializedBuffer)
-                    self.assertNotIsInstance(module.running_var, UninitializedBuffer)
-
-    def _check_lazy_batchnorm_state(self, cls, lazy_cls):
-        module = cls(10)
-        lazy_module = lazy_cls(affine=True, track_running_stats=True)
-        lazy_module.load_state_dict(module.state_dict())
-        # Parameters have been initialized but the module won't become a full
-        # Conv one until the first iteration. This is due to
-        # limitations on the state_dict loading logic
-        self.assertFalse(lazy_module.has_uninitialized_params())
-        self.assertEqual(lazy_module.weight.shape, (10,))
-        self.assertEqual(lazy_module.bias.shape, (10,))
-        self.assertEqual(lazy_module.running_mean.shape, (10,))
-        self.assertEqual(lazy_module.running_var.shape, (10,))
-
-        module = cls(10)
-        lazy_module = lazy_cls()
-        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
-            module.load_state_dict(lazy_module.state_dict())
-
-    def _check_lazy_instancenorm_state(self, cls, lazy_cls):
-        for affine in [False, True]:
-            for track_running_stats in [False, True]:
-                module = cls(10, affine=affine, track_running_stats=track_running_stats)
-                lazy_module = lazy_cls(affine=affine, track_running_stats=track_running_stats)
-                lazy_module.load_state_dict(module.state_dict())
-                # Parameters have been initialized but the module won't become a full
-                # InstanceNorm one until the first iteration. This is due to
-                # limitations on the state_dict loading logic
-                self.assertFalse(lazy_module.has_uninitialized_params())
-                if affine:
-                    self.assertEqual(lazy_module.weight.shape, (10,))
-                    self.assertEqual(lazy_module.bias.shape, (10,))
-                if track_running_stats:
-                    self.assertEqual(lazy_module.running_mean.shape, (10,))
-                    self.assertEqual(lazy_module.running_var.shape, (10,))
-
-        module = cls(10, affine=True, track_running_stats=True)
-        lazy_module = lazy_cls(affine=True, track_running_stats=True)
-        with self.assertRaisesRegex(RuntimeError, 'shape of an uninitialized'):
-            module.load_state_dict(lazy_module.state_dict())
-
-    def test_lazy_batchnorm1d(self):
-        self._check_lazy_norm(nn.BatchNorm1d, nn.LazyBatchNorm1d, (16, 3, 6))
-        self._check_lazy_norm(nn.BatchNorm1d, nn.LazyBatchNorm1d, (16, 6))
-
-    def test_lazy_batchnorm1d_pickle(self):
-        self._check_lazy_norm_pickle(nn.BatchNorm1d, nn.LazyBatchNorm1d, (16, 3, 6))
-        self._check_lazy_norm_pickle(nn.BatchNorm1d, nn.LazyBatchNorm1d, (16, 6))
-
-    def test_lazy_batchnorm1d_state(self):
-        self._check_lazy_batchnorm_state(nn.BatchNorm1d, nn.LazyBatchNorm1d)
-        self._check_lazy_batchnorm_state(nn.BatchNorm1d, nn.LazyBatchNorm1d)
-
-    def test_lazy_batchnorm2d(self):
-        self._check_lazy_norm(nn.BatchNorm2d, nn.LazyBatchNorm2d, (16, 3, 6, 7))
-
-    def test_lazy_batchnorm2d_pickle(self):
-        self._check_lazy_norm_pickle(nn.BatchNorm2d, nn.LazyBatchNorm2d, (16, 3, 6, 7))
-
-    def test_lazy_batchnorm2d_state(self):
-        self._check_lazy_batchnorm_state(nn.BatchNorm2d, nn.LazyBatchNorm2d)
-        self._check_lazy_batchnorm_state(nn.BatchNorm2d, nn.LazyBatchNorm2d)
-
-    def test_lazy_batchnorm3d(self):
-        self._check_lazy_norm(nn.BatchNorm3d, nn.LazyBatchNorm3d, (16, 3, 6, 7, 8))
-
-    def test_lazy_batchnorm3d_pickle(self):
-        self._check_lazy_norm_pickle(nn.BatchNorm3d, nn.LazyBatchNorm3d, (16, 3, 6, 7, 8))
-
-    def test_lazy_batchnorm3d_state(self):
-        self._check_lazy_batchnorm_state(nn.BatchNorm3d, nn.LazyBatchNorm3d)
-        self._check_lazy_batchnorm_state(nn.BatchNorm3d, nn.LazyBatchNorm3d)
-
-    def test_lazy_instancenorm1d(self):
-        self._check_lazy_norm(nn.InstanceNorm1d, nn.LazyInstanceNorm1d, (16, 3, 6))
-
-    def test_lazy_instancenorm1d_pickle(self):
-        self._check_lazy_norm_pickle(nn.InstanceNorm1d, nn.LazyInstanceNorm1d, (16, 3, 6))
-
-    def test_lazy_instancenorm1d_state(self):
-        self._check_lazy_instancenorm_state(nn.InstanceNorm1d, nn.LazyInstanceNorm1d)
-        self._check_lazy_instancenorm_state(nn.InstanceNorm1d, nn.LazyInstanceNorm1d)
-
-    def test_lazy_instancenorm2d(self):
-        self._check_lazy_norm(nn.InstanceNorm2d, nn.LazyInstanceNorm2d, (16, 3, 6, 7))
-
-    def test_lazy_instancenorm2d_pickle(self):
-        self._check_lazy_norm_pickle(nn.InstanceNorm2d, nn.LazyInstanceNorm2d, (16, 3, 6, 7))
-
-    def test_lazy_instancenorm2d_state(self):
-        self._check_lazy_instancenorm_state(nn.InstanceNorm2d, nn.LazyInstanceNorm2d)
-        self._check_lazy_instancenorm_state(nn.InstanceNorm2d, nn.LazyInstanceNorm2d)
-
-    def test_lazy_instancenorm3d(self):
-        self._check_lazy_norm(nn.InstanceNorm3d, nn.LazyInstanceNorm3d, (16, 3, 6, 7, 8))
-
-    def test_lazy_instancenorm3d_pickle(self):
-        self._check_lazy_norm_pickle(nn.InstanceNorm3d, nn.LazyInstanceNorm3d, (16, 3, 6, 7, 8))
-
-    def test_lazy_instancenorm3d_state(self):
-        self._check_lazy_instancenorm_state(nn.InstanceNorm3d, nn.LazyInstanceNorm3d)
-        self._check_lazy_instancenorm_state(nn.InstanceNorm3d, nn.LazyInstanceNorm3d)
-
-    @suppress_warnings
-    def test_materialize_dtype(self):
-        module = LazyModule()
-        module.register_parameter('test_param', UninitializedParameter())
-        module.test_param.materialize(10)
-        self.assertTrue(module.test_param.dtype == torch.float64)
-        module = LazyModule()
-        module.register_parameter('test_param', UninitializedParameter())
-        module.half()
-        module.test_param.materialize(10)
-        self.assertTrue(module.test_param.dtype == torch.float16)
-
-    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
-    @suppress_warnings
-    def test_materialize_device(self):
-        module = LazyModule()
-        module.register_parameter('test_param', UninitializedParameter())
-        module.test_param.materialize(10)
-        self.assertTrue(module.test_param.device.type == 'cpu')
-        module = LazyModule()
-        module.register_parameter('test_param', UninitializedParameter())
-        module.cuda()
-        module.test_param.materialize(10)
-        self.assertTrue(module.test_param.device.type == 'cuda')
-
-    @suppress_warnings
-    def test_chained_initialization(self):
-        class MyNetwork(torch.nn.Module):
-            def __init__(self):
-                super(MyNetwork, self).__init__()
-                self.linear_1 = torch.nn.LazyLinear(15)
-                self.linear_2 = torch.nn.LazyLinear(10)
-
-            def forward(self, x):
-                y = self.linear_1(x)
-                return self.linear_2(y)
-
-        net = MyNetwork()
-        net(torch.ones(5, 10))
-        self.assertTrue(net.linear_1.weight.shape == (15, 10))
-        self.assertTrue(net.linear_1.bias.shape == (15,))
-        self.assertTrue(net.linear_2.weight.shape == (10, 15))
-        self.assertTrue(net.linear_2.bias.shape == (10,))
-
-    @suppress_warnings
-    def test_optimizer_pass(self):
-        optimizers = [torch.optim.Adadelta, torch.optim.Adagrad, torch.optim.Adam,
-                      torch.optim.AdamW, torch.optim.Adamax,
-                      torch.optim.ASGD, torch.optim.SGD, torch.optim.Rprop,
-                      torch.optim.RMSprop, torch.optim.LBFGS]
-
-        def run_step(module, optim):
-            self.assertIsInstance(optim.param_groups[0]['params'][0], UninitializedParameter)
-            module.test_param.materialize(10)
-            self.assertIsInstance(optim.param_groups[0]['params'][0], Parameter)
-            self.assertNotIsInstance(optim.param_groups[0]['params'][0], UninitializedParameter)
-            for p in module.parameters():
-                p.grad = torch.rand_like(p)
-            if isinstance(optim, torch.optim.LBFGS):
-                optim.step(lambda: 1.0)
-            else:
-                optim.step()
-
-        for optim_cls in optimizers:
-            module = LazyModule()
-            module.register_parameter('test_param', UninitializedParameter())
-            if optim_cls is torch.optim.SGD:
-                optim = optim_cls(module.parameters(), lr=0.0)
-            elif optim_cls is torch.optim.Adagrad:
-                with self.assertRaisesRegex(ValueError, 'uninitialized parameter'):
-                    optim = optim_cls(module.parameters())
-                continue
-            else:
-                optim = optim_cls(module.parameters())
-            run_step(module, optim)
-
-    @suppress_warnings
-    def test_weight_norm(self):
-        m = nn.LazyLinear(7)
-        with self.assertRaisesRegex(ValueError, 'have uninitialized parameters.'):
-            m = torch.nn.utils.weight_norm(m)
-
-    @suppress_warnings
-    def test_spectral_norm(self):
-        m = nn.LazyLinear(7)
-        with self.assertRaisesRegex(ValueError, 'have uninitialized parameters.'):
-            m = torch.nn.utils.spectral_norm(m)
-
-    @suppress_warnings
-    def test_invalid_functions(self):
-        param = torch.nn.parameter.UninitializedParameter()
-        with self.assertRaisesRegex(ValueError, 'uninitialized parameter'):
-            torch.empty_like(param)
-
-        with self.assertRaisesRegex(ValueError, 'uninitialized parameter'):
-            torch.add(param, param)
-
-        with self.assertRaisesRegex(ValueError, 'uninitialized parameter'):
-            param + param
-
 class TestFunctionalPickle(TestCase):
 
     # issue gh-38137
@@ -21863,210 +12491,33 @@ def test_pickle_softsign(self):
         # Make sure it does not throw an exception
         s = pickle.dumps(F.softsign)
 
-def _hook_to_pickle(*args, **kwargs):
-    pass
-
-class TestStateDictHooks(TestCase):
-
-    def test_load_state_dict_pre_hook(self):
-
-        m = nn.Linear(10, 10)
-        m_state_dict = m.state_dict()
-
-        m_load = nn.Linear(10, 10)
-
-        hook_called = 0
-
-        def hook_without_module(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
-            self.assertEqual(m_state_dict, state_dict)
-            nonlocal hook_called
-            hook_called += 1
-
-        def hook_with_module(module, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
-            self.assertEqual(m_state_dict, state_dict)
-            self.assertTrue(m_load is module)
-            nonlocal hook_called
-            hook_called += 1
-
-        hook_called = 0
-        m_load._register_load_state_dict_pre_hook(hook_without_module)
-        m_load.load_state_dict(m_state_dict)
-        self.assertEqual(1, hook_called)
-
-        hook_called = 0
-        m_load._register_load_state_dict_pre_hook(hook_with_module, True)
-        m_load.load_state_dict(m_state_dict)
-        self.assertEqual(2, hook_called)
-
-    def test_no_extra_ref_to_module(self):
-        try:
-            gc.disable()
-            m = nn.Linear(10, 10)
-
-            m._register_load_state_dict_pre_hook(_hook_to_pickle, True)
-            weak_m = weakref.ref(m)
-            del m
-
-            self.assertEqual(weak_m(), None)
-        finally:
-            gc.enable()
-
-    def test_pickled_hook(self):
-        m = nn.Linear(10, 10)
-        m._register_load_state_dict_pre_hook(_hook_to_pickle, True)
-        pickle.loads(pickle.dumps(m))
-
-    def test_load_state_dict_module_pre_hook(self):
-        hook_called = 0
-
-        # Test with module instance method as hook
-        class MyModule(nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-                self.foo = torch.nn.Parameter(torch.rand(10))
-
-            def my_pre_load_hook(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
-                assert [] == error_msgs
-                assert [] == unexpected_keys
-                assert [] == missing_keys
-                assert strict
-                nonlocal hook_called
-                hook_called += 1
-
-            def my_pre_load_hook_with_module(
-                self,
-                module,
-                state_dict,
-                prefix,
-                local_metadata,
-                strict,
-                missing_keys,
-                unexpected_keys,
-                error_msgs,
-            ):
-                assert [] == error_msgs
-                assert [] == unexpected_keys
-                assert [] == missing_keys
-                assert strict
-                assert self is module
-                nonlocal hook_called
-                hook_called += 1
-
-        # Test that hooks registered on a submodule are also called
-        # appropriately, i.e. with the submodule as module argument in
-        # my_pre_load_hook_with_module.
-        class MyModuleContainer(nn.Module):
-            def __init__(self, mod):
-                super().__init__()
-                self.mod = mod
-
-        for ctor in [MyModuleContainer, lambda x: x]:
-            m = ctor(MyModule())
-            state_dict = m.state_dict()
-            if isinstance(m, MyModuleContainer):
-                mod = m.mod
-            else:
-                mod = m
-
-            hook_called = 0
-            mod._register_load_state_dict_pre_hook(
-                mod.my_pre_load_hook
-            )
-            m.load_state_dict(state_dict)
-            self.assertEqual(1, hook_called)
-
-            hook_called = 0
-            mod._register_load_state_dict_pre_hook(
-                mod.my_pre_load_hook_with_module, True
-            )
-            m.load_state_dict(state_dict)
-            self.assertEqual(2, hook_called)
-
-    def test_load_state_dict_post_hook(self):
-        hook_called = 0
-
-        class MyModule(nn.Module):
-            def __init__(self):
-                super(MyModule, self).__init__()
-                self.foo = torch.nn.Parameter(torch.rand(10))
-
-            def my_post_load_hook(self, module, incompatible_keys):
-                assert module is self
-                nonlocal hook_called
-                incompatible_keys.missing_keys.append("foo")
-                incompatible_keys.unexpected_keys.append("bar")
-                hook_called += 1
-
-        nested = MyModule()
-        wrapped = nn.ModuleList([nested])
-        handle = nested.register_load_state_dict_post_hook(
-            nested.my_post_load_hook,
-        )
-        # Hook must be called even if it is wrapped
-        ret = wrapped.load_state_dict(wrapped.state_dict(), strict=False)
-        self.assertEqual(hook_called, 1)
-        # Ensure that the hook modified missing_keys and unexpected_keys
-        missing = ret.missing_keys
-        unexpected = ret.unexpected_keys
-        self.assertEqual(missing, ["foo"])
-        self.assertEqual(unexpected, ["bar"])
-        # When called with strict=True, the error raised should mention the
-        # missing and unexpected keys the hook added.
-        with self.assertRaisesRegex(RuntimeError, "foo.*\n.*bar"):
-            wrapped.load_state_dict(wrapped.state_dict(), strict=True)
-        self.assertEqual(hook_called, 2)
-        # Removing the hook via handle.remove() should cause it not to
-        # fire anymore.
-        handle.remove()
-        # Hook did not run so it should not have added any keys
-        ret = wrapped.load_state_dict(wrapped.state_dict(), strict=False)
-        self.assertEqual(ret.missing_keys, [])
-        self.assertEqual(ret.unexpected_keys, [])
-        # hook_called should not have been incremented
-        self.assertEqual(hook_called, 2)
-
-        def load_hook_clear_incompatible(module, incompatible_keys):
-            incompatible_keys.missing_keys.clear()
-            incompatible_keys.unexpected_keys.clear()
-
-        nested.register_load_state_dict_post_hook(load_hook_clear_incompatible)
-        state_dict = wrapped.state_dict()
-        state_dict["extra"] = torch.ones(1)
-        # load state_dict with strict=True should not throw.
-        ret = wrapped.load_state_dict(state_dict, strict=True)
-        # explicitly ensure that the post hook clearned out incompatible_keys
-        self.assertEqual([], ret.missing_keys)
-        self.assertEqual([], ret.unexpected_keys)
-
-    @unittest.skipIf(IS_WINDOWS, "Tempfile permission issue on windows")
-    def test_load_state_dict_post_hook_backward_compatibility(self):
-        def my_post_load_hook(mod, _):
-            nonlocal called
-            called = True
-
-        for m in [nn.Softmin(10), nn.Softmax(10), nn.LogSoftmax(10)]:
-            called = False
-            sd = deepcopy(m.state_dict())
-            self.assertTrue(hasattr(m, '_load_state_dict_post_hooks'))
-            # Simulate an older model that did not have this attr
-            delattr(m, '_load_state_dict_post_hooks')
-            # Save and load, and ensure that load_state_dict works (without proper
-            # BC we would run into errors because this attribute would be expected).
-            # In particular, Softmax runs into the issue described here:
-            # https://github.com/pytorch/pytorch/issues/77280
-            with NamedTemporaryFile() as f:
-                # Note that torch.save / torch.load is not recommended to save/load
-                # modules.
-                torch.save(m, f.name)
-                m = torch.load(f.name)
-                m.load_state_dict(sd)
-                self.assertFalse(called)
-
-            # Ensure hooks can be registered and called.
-            m.register_load_state_dict_post_hook(my_post_load_hook)
-            m.load_state_dict(sd)
-            self.assertTrue(called)
 
+class TestFusionUtils(TestCase):
+    def test_fuse_conv_bn_requires_grad(self):
+        conv = torch.nn.Conv2d(3, 3, 3)
+        bn = torch.nn.BatchNorm2d(3)
+        cases = itertools.product([True, False], [True, False])
+        for w_rg, b_rg in cases:
+            conv.weight.requires_grad = w_rg
+            conv.bias.requires_grad = b_rg
+            weight, bias = \
+                fuse_conv_bn_weights(conv.weight, conv.bias,
+                                     bn.running_mean, bn.running_var, bn.eps, bn.weight, bn.bias)
+            self.assertEqual(weight.requires_grad, w_rg)
+            self.assertEqual(bias.requires_grad, b_rg)
+
+    def test_fuse_linear_bn_requires_grad(self):
+        linear = torch.nn.Linear(3, 3)
+        bn = torch.nn.BatchNorm1d(3)
+        cases = itertools.product([True, False], [True, False])
+        for w_rg, b_rg in cases:
+            linear.weight.requires_grad = w_rg
+            linear.bias.requires_grad = b_rg
+            weight, bias = \
+                fuse_linear_bn_weights(linear.weight, linear.bias,
+                                       bn.running_mean, bn.running_var, bn.eps, bn.weight, bn.bias)
+            self.assertEqual(weight.requires_grad, w_rg)
+            self.assertEqual(bias.requires_grad, b_rg)
 
 instantiate_device_type_tests(TestNNDeviceType, globals())
 instantiate_parametrized_tests(TestNN)
diff --git a/test/test_nnapi.py b/test/test_nnapi.py
index daf5389c7abe..60f2c8971236 100644
--- a/test/test_nnapi.py
+++ b/test/test_nnapi.py
@@ -114,12 +114,12 @@ def test_prelu(self):
 
     def test_quantize(self):
         self.check(
-            torch.nn.quantized.Quantize(0.25, 2, torch.quint8),
+            torch.ao.nn.quantized.Quantize(0.25, 2, torch.quint8),
             nhwc(torch.tensor([[[[1.0]], [[2.0]]]])))
 
     def test_dequantize(self):
         self.check(
-            torch.nn.quantized.DeQuantize(),
+            torch.ao.nn.quantized.DeQuantize(),
             nhwc(qpt([[[[1.0]], [[2.0]]]], 0.25, 2)))
 
     def test_unsqueeze(self):
@@ -564,7 +564,7 @@ def test_conv2d_transpose(self):
                 convert_arg = torch.zeros(*convert_dims)
 
                 if "quant" in kind:
-                    model = torch.nn.quantized.ConvTranspose2d(in_ch, out_ch, kernel)
+                    model = torch.ao.nn.quantized.ConvTranspose2d(in_ch, out_ch, kernel)
                     model.qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
                     inp = qpt(inp, 1.0 / 16, 128)
                     # I've seen numerical differences between QNNPACK and NNAPI,
@@ -589,7 +589,7 @@ def test_conv2d_transpose(self):
 
 
     def test_qadd(self):
-        func = torch.nn.quantized.QFunctional()
+        func = torch.ao.nn.quantized.QFunctional()
         func.scale = 0.5
         func.zero_point = 120
 
@@ -652,7 +652,7 @@ def test_qlinear(self):
         torch.manual_seed(29)
         weight = qpt(torch.randn(16, 32), 0.125, 0, torch.qint8)
         bias = torch.randn(16)
-        mod = torch.nn.quantized.Linear(32, 16)
+        mod = torch.ao.nn.quantized.Linear(32, 16)
         mod.set_weight_bias(weight, bias)
         inp = qpt(torch.randn(2, 32), 0.05, 130, torch.quint8)
         self.check(mod, inp)
diff --git a/test/test_nvfuser_dynamo.py b/test/test_nvfuser_dynamo.py
new file mode 100644
index 000000000000..e59ead80fe13
--- /dev/null
+++ b/test/test_nvfuser_dynamo.py
@@ -0,0 +1,148 @@
+# Owner(s): ["module: nvfuser"]
+
+import unittest
+import warnings
+from functools import partial
+
+import torch
+import torch._dynamo as torchdynamo
+from torch.testing import make_tensor
+from torch.testing._internal.common_utils import (
+    IS_WINDOWS,
+    run_tests,
+    skipIfTorchDynamo,
+    TEST_WITH_ROCM,
+    TestCase,
+)
+from torch.testing._internal.jit_utils import RUN_CUDA
+
+RUN_NVFUSER = RUN_CUDA and not TEST_WITH_ROCM
+
+
+def is_pre_volta():
+    if not RUN_NVFUSER:
+        return False
+    prop = torch.cuda.get_device_properties(torch.cuda.current_device())
+    return prop.major < 7
+
+
+def is_networkx_available():
+    try:
+        import networkx  # noqa: F401
+
+        return True
+    except ImportError:
+        return False
+
+
+@skipIfTorchDynamo("Not a suitable test for TorchDynamo")
+@unittest.skipIf(IS_WINDOWS, "TorchDynamo is not supported on Windows")
+@unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+@unittest.skipIf(is_pre_volta(), "Only supported on Volta and newer devices.")
+class TestNvFuserDynamo(TestCase):
+    def test_basic(self):
+        input1 = make_tensor((2, 4, 8), device="cuda", dtype=torch.float32)
+        input2 = make_tensor((2, 4, 8), device="cuda", dtype=torch.float32)
+
+        @torchdynamo.optimize("nvprims_nvfuser")
+        def func(a, b):
+            return a.sin() + b.cos()
+
+        # No warnings and no errors
+        with warnings.catch_warnings(record=True) as w:
+            nvfuser_result = func(input1, input2)
+            self.assertEqual(len(w), 0)
+        eager_result = func.__wrapped__(input1, input2)
+        self.assertEqual(eager_result, nvfuser_result)
+
+    @unittest.skipIf(not is_networkx_available(), "networkx not available")
+    def test_min_cut(self):
+        from functorch.compile import default_partition
+        from torch._dynamo.optimizations.training import nvprims_fw_bw_partition_fn
+
+        def get_fw_bw_graph(f, inps, partitioner):
+            from functorch.compile import aot_function
+
+            # Helper functions are taken from functorch/test_aotdispatch.py
+            def extract_graph(fx_g, _, graph_cell):
+                graph_cell[0] = fx_g
+                return fx_g
+
+            fw_graph_cell = [None]
+            bw_graph_cell = [None]
+            aot_function(
+                f,
+                fw_compiler=partial(extract_graph, graph_cell=fw_graph_cell),
+                bw_compiler=partial(extract_graph, graph_cell=bw_graph_cell),
+                partition_fn=partitioner,
+            )(*inps).sum().backward()
+            return (fw_graph_cell[0], bw_graph_cell[0])
+
+        def get_ins_outs(fx_g):
+            ins = []
+            outs = []
+            for n in fx_g.graph.nodes:
+                if n.op == "placeholder":
+                    ins.append(n)
+                elif n.op == "output":
+                    outs = tuple(n.args[0])
+            return ins, outs
+
+        def get_num_ins_outs(fx_g):
+            return tuple(len(i) for i in get_ins_outs(fx_g))
+
+        def func(x):
+            return x * x * x
+
+        input1 = make_tensor(
+            (3,), device="cpu", dtype=torch.float32, requires_grad=True
+        )
+        fw_graph, bw_graph = get_fw_bw_graph(func, [input1], default_partition)
+        self.assertEqual(get_num_ins_outs(fw_graph), (1, 3))
+        self.assertEqual(get_num_ins_outs(bw_graph), (3, 1))
+
+        input1 = make_tensor(
+            (3,), device="cpu", dtype=torch.float32, requires_grad=True
+        )
+        fw_graph, bw_graph = get_fw_bw_graph(func, [input1], nvprims_fw_bw_partition_fn)
+        self.assertEqual(get_num_ins_outs(fw_graph), (1, 2))
+        self.assertEqual(get_num_ins_outs(bw_graph), (2, 1))
+
+    def test_batch_norm_implicit_dtype_promotion(self):
+        input1 = make_tensor((2, 3, 4, 5), device="cuda", dtype=torch.float32)
+        input2 = make_tensor((5, 5), device="cuda", dtype=torch.float32)
+        w = make_tensor((3), device="cuda", dtype=torch.float32)
+        b = make_tensor((3), device="cuda", dtype=torch.float32)
+
+        @torchdynamo.optimize("nvprims_nvfuser")
+        def func(mat1, mat2, w, b):
+            o = torch.matmul(mat1, mat2)
+            return torch.batch_norm(o, w, b, None, None, True, 1e-2, 1e-5, True)
+
+        # No warnings and no errors
+        with torch.cuda.amp.autocast():
+            with warnings.catch_warnings(record=True) as warning:
+                nvfuser_result = func(input1, input2, w, b)
+                self.assertEqual(len(warning), 0)
+            eager_result = func.__wrapped__(input1, input2, w, b)
+            self.assertEqual(eager_result, nvfuser_result)
+
+    def test_dtype_correctness(self):
+        input1 = make_tensor((2, 4, 8), device="cuda", dtype=torch.float16)
+
+        @torchdynamo.optimize("nvprims_nvfuser")
+        def func(a):
+            tmp = a + 1.0
+            # nvfuser would promote output to fp32 in math, FusionDefinition should cast output dtype back
+            return torch.where(tmp > 0, tmp, 0.0)
+
+        # No warnings and no errors
+        with warnings.catch_warnings(record=True) as w:
+            nvfuser_result = func(input1)
+            self.assertEqual(len(w), 0)
+        eager_result = func.__wrapped__(input1)
+        self.assertEqual(eager_result, nvfuser_result)
+
+
+if __name__ == "__main__":
+    run_tests()
diff --git a/test/test_nvfuser_frontend.py b/test/test_nvfuser_frontend.py
new file mode 100644
index 000000000000..9974eb29c727
--- /dev/null
+++ b/test/test_nvfuser_frontend.py
@@ -0,0 +1,366 @@
+# Owner(s): ["module: nvfuser"]
+
+import unittest
+from typing import List
+
+import torch
+from torch.testing._internal.common_utils import run_tests, TEST_WITH_ROCM, TestCase
+from torch.testing._internal.jit_utils import RUN_CUDA
+import torch._refs as refs
+import torch._prims as prims
+
+# Will only create the _nvfuser module if CUDA is available
+if hasattr(torch._C, "_nvfuser"):
+    from torch._C._nvfuser import Fusion, FusionCache, FusionDefinition, DataType
+
+RUN_NVFUSER = RUN_CUDA and not TEST_WITH_ROCM
+
+def is_pre_volta():
+    if not RUN_NVFUSER:
+        return False
+    prop = torch.cuda.get_device_properties(torch.cuda.current_device())
+    return prop.major < 7
+
+@unittest.skipIf(not RUN_NVFUSER, "requires CUDA")
+@unittest.skipIf(is_pre_volta(), "Only supported on Volta and newer devices.")
+class TestNvFuserFrontend(TestCase):
+    def test_basic(self) :
+        input1 = torch.ones(2, 4, 8, device='cuda')
+        input2 = torch.ones(2, 4, 8, device='cuda')
+        fc = FusionCache.get()
+        before_fusions = fc.num_fusions()
+
+        fs1 = Fusion()
+        with FusionDefinition(fs1) as fd :
+            t0 = fd.define_tensor(3)
+            t1 = fd.define_tensor(3)
+            c0 = fd.define_constant(3.0)
+
+            t2 = fd.ops.add(t0, t1)
+            t3 = fd.ops.mul(t2, c0)
+            t4 = fd.ops.sum(t3, [-1], False, DataType.Float)
+
+            fd.add_output(t4)
+
+        # Expected Output is a tensor of 48's
+        nvf_out1 = fs1.execute([input1, input2])[0]
+
+        # Create a new fusion with the same definition, it should hit the cache!
+        fs2 = Fusion()
+        with FusionDefinition(fs2) as fd :
+            t0 = fd.define_tensor(3)
+            t1 = fd.define_tensor(3)
+            c0 = fd.define_constant(3.0)
+
+            t2 = fd.ops.add(t0, t1)
+            t3 = fd.ops.mul(t2, c0)
+            t4 = fd.ops.sum(t3, [-1], False, DataType.Float)
+
+            fd.add_output(t4)
+
+        nvf_out2 = fs2.execute([input1, input2])[0]
+
+        # Check there is still only 1 cache entry
+        fc = FusionCache.get()
+        self.assertEqual(fc.num_fusions() - before_fusions, 1)
+
+        # Create a fusion from a fusion id and make sure it executes!
+        fs3 = Fusion(fs2.id())
+        nvf_out3 = fs3.execute([input1, input2])[0]
+
+        eager_out = torch.sum((input1 + input2) * 3.0, dim=-1)
+        self.assertEqual(eager_out, nvf_out1)
+        self.assertEqual(eager_out, nvf_out2)
+        self.assertEqual(eager_out, nvf_out3)
+
+    def test_basic_fp16(self) :
+        fs = Fusion()
+        with FusionDefinition(fs) as fd :
+            t0 = fd.define_tensor(3, DataType.Half)
+            t1 = fd.define_tensor(3, DataType.Half)
+            c0 = fd.define_constant(3.0)
+
+            t2 = fd.ops.add(t0, t1)
+            t3 = fd.ops.mul(t2, c0)
+            t4 = fd.ops.sum(t3, [-1], False, DataType.Float)
+
+            t5 = fd.ops.cast(t4, DataType.Half)
+            fd.add_output(t5)
+
+        input1 = torch.ones(2, 4, 8, device='cuda', dtype=torch.float16)
+        input2 = torch.ones(2, 4, 8, device='cuda', dtype=torch.float16)
+
+        # Expected Output is a tensor of 48's
+        nvf_out = fs.execute([input1, input2])[0]
+        eager_out = torch.sum((input1 + input2) * 3.0, dim=-1)
+        self.assertEqual(eager_out, nvf_out)
+
+    def test_cast_double_to_half(self) :
+        fs = Fusion()
+        with FusionDefinition(fs) as fd :
+            t0 = fd.define_tensor(2, DataType.Double)
+            t1 = fd.define_tensor(2, DataType.Double)
+
+            t0h = fd.ops.cast(t0, DataType.Half)
+            t1h = fd.ops.cast(t1, DataType.Half)
+            t2 = fd.ops.add(t0h, t1h)
+            t3 = fd.ops.relu(t2)
+            t4 = fd.ops.cast(t3, DataType.Half)
+
+            fd.add_output(t4)
+
+        input1 = torch.randn(2, 4, device='cuda', dtype=torch.float64)
+        input2 = torch.randn(2, 4, device='cuda', dtype=torch.float64)
+
+        nvf_out = fs.execute([input1, input2])[0]
+        eager_out = torch.relu(input1.to(torch.half) + input2.to(torch.half))
+        self.assertEqual(eager_out, nvf_out)
+
+    def test_promote_to_double(self) :
+        fs = Fusion()
+
+        with FusionDefinition(fs) as fd :
+            t0 = fd.define_tensor(2, DataType.Half)
+            t1 = fd.define_tensor(2, DataType.Double)
+
+            t2 = fd.ops.add(t0, t1)
+            t5 = fd.ops.relu(t2)
+
+            fd.add_output(t5)
+
+        input1 = torch.randn(2, 4, device='cuda', dtype=torch.float16)
+        input2 = torch.randn(2, 4, device='cuda', dtype=torch.float64)
+
+        nvf_out = fs.execute([input1, input2])[0]
+        eager_out = torch.relu(input1 + input2)
+        self.assertEqual(eager_out, nvf_out)
+
+    def test_implicit_broadcast_input(self) :
+        fs = Fusion()
+        with FusionDefinition(fs) as fd :
+            t0 = fd.define_tensor(1)
+            t1 = fd.define_tensor(3)
+
+            t0_b = fd.ops.broadcast_in_dim(t0, [2, 3, 4], [1])
+            t2 = fd.ops.add(t0_b, t1)
+
+            fd.add_output(t2)
+
+        input1 = torch.randn(3, device='cuda')
+        input2 = torch.randn(2, 3, 4, device='cuda')
+
+        nvf_out = fs.execute([input1, input2])[0]
+        eager_out = refs.add(prims.broadcast_in_dim(input1, [2, 3, 4], [1]), input2)
+        self.assertEqual(eager_out, nvf_out)
+
+    def test_explicit_broadcast_input(self) :
+        input1 = torch.randn(1, 1, 4, device='cuda')
+        input2 = torch.randn(2, 3, 4, device='cuda')
+
+        fs = Fusion()
+        with FusionDefinition(fs) as fd :
+            t0 = fd.define_tensor(sizes=input1.size(), strides=input1.stride())
+            t1 = fd.define_tensor(sizes=input2.size(), strides=input2.stride())
+
+            t0_b = fd.ops.broadcast_in_dim(t0, [2, 3, 4], [0, 1, 2])
+            t2 = fd.ops.add(t0_b, t1)
+
+            fd.add_output(t2)
+
+        nvf_out = fs.execute([input1, input2])[0]
+        eager_out = refs.add(prims.broadcast_in_dim(input1, [2, 3, 4], [0, 1, 2]), input2)
+        self.assertEqual(eager_out, nvf_out)
+
+    def test_broadcast_mixing(self) :
+        fs = Fusion()
+        with FusionDefinition(fs) as fd :
+            t0 = fd.define_tensor([3, 1], [1, 1])
+            t1 = fd.define_tensor(1)
+
+            t1_b = fd.ops.broadcast_in_dim(t1, [3, 3], [0])
+            t2 = fd.ops.add(t0, t1_b)
+
+            fd.add_output(t2)
+
+        input1 = torch.randn(3, 1, device='cuda')
+        input2 = torch.randn(3, device='cuda')
+
+        nvf_out = fs.execute([input1, input2])[0]
+        eager_out = refs.add(input1, prims.broadcast_in_dim(input2, [3, 3], [0]))
+        self.assertEqual(eager_out, nvf_out)
+
+    def test_ops_broadcast(self) :
+        fs = Fusion()
+        with FusionDefinition(fs) as fd :
+            t0 = fd.define_tensor(1)
+            t1 = fd.define_tensor(3)
+
+            t0_b = fd.ops.broadcast(t0, [True, False, True])
+            t2 = fd.ops.add(t0_b, t1)
+
+            fd.add_output(t2)
+
+        input1 = torch.randn(3, device='cuda')
+        input2 = torch.randn(2, 3, 4, device='cuda')
+
+        nvf_out = fs.execute([input1, input2])[0]
+        eager_out = refs.add(prims.broadcast_in_dim(input1, [2, 3, 4], [1]), input2)
+        self.assertEqual(eager_out, nvf_out)
+
+    def test_prim_layer_norm_fwd(self) :
+        def primitive_definition(
+            inputs: torch.Tensor,
+            weight: torch.Tensor,
+            bias: torch.Tensor,
+            normalization_axis: int,
+            keepdim: bool,
+        ) -> torch.Tensor:
+            mean = inputs.mean(normalization_axis, keepdim=keepdim)
+            diff = inputs - mean
+            diff_sq = diff * diff
+            var = diff_sq.mean(normalization_axis, keepdim=keepdim)
+            pre_shift_scale_norm_output = (inputs - mean) / torch.sqrt(var + 1e-12)
+            norm_output = weight * pre_shift_scale_norm_output + bias
+            return norm_output
+
+        def nvfuser_fusion(
+            fd: FusionDefinition,
+            normalization_axis: int,
+            norm_size: int,
+            input_shape: List[int],
+            eps: float,
+            keepDim: bool
+        ) -> None :
+            inputs = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
+            weights = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float)
+            bias = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float)
+            sum0 = fd.ops.sum(inputs, axes=[normalization_axis], keepdim=keepDim)
+            norm_const = fd.define_constant(norm_size)
+            mean = fd.ops.div(sum0, norm_const)
+            diff = fd.ops.sub(inputs, mean)
+            diff_sq = fd.ops.mul(diff, diff)
+            sum1 = fd.ops.sum(diff_sq, axes=[normalization_axis], keepdim=keepDim)
+            var = fd.ops.div(sum1, norm_const)
+            eps_const = fd.define_constant(eps)
+            var_eps = fd.ops.add(var, eps_const)
+            invstd = fd.ops.rsqrt(var_eps)
+            pre_scale_bias = fd.ops.mul(diff, invstd)
+            weights_bcast = fd.ops.broadcast_in_dim(weights, output_shape=input_shape, broadcast_dims=[2])
+            scale = fd.ops.mul(pre_scale_bias, weights_bcast)
+            bias_bcast = fd.ops.broadcast_in_dim(bias, output_shape=input_shape, broadcast_dims=[2])
+            out = fd.ops.add(scale, bias_bcast)
+            fd.add_output(out)
+            fd.add_output(mean)
+            fd.add_output(invstd)
+
+        def nvfuser_fusion_var_mean(
+            fd: FusionDefinition,
+            normalization_axis: int,
+            norm_size: int,
+            input_shape: List[int],
+            eps: float,
+            keepDim: bool
+        ) -> None :
+            inputs = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
+            weights = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float)
+            bias = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float)
+            var, mean = fd.ops.var_mean(inputs, axes=[normalization_axis], correction=0, keepdim=keepDim)
+            eps_const = fd.define_constant(eps)
+            var_eps = fd.ops.add(var, eps_const)
+            invstd = fd.ops.rsqrt(var_eps)
+            diff = fd.ops.sub(inputs, mean)
+            pre_scale_bias = fd.ops.mul(diff, invstd)
+            weights_bcast = fd.ops.broadcast_in_dim(weights, output_shape=input_shape, broadcast_dims=[2])
+            scale = fd.ops.mul(pre_scale_bias, weights_bcast)
+            bias_bcast = fd.ops.broadcast_in_dim(bias, output_shape=input_shape, broadcast_dims=[2])
+            out = fd.ops.add(scale, bias_bcast)
+            fd.add_output(out)
+            fd.add_output(mean)
+            fd.add_output(invstd)
+
+        input_size = [64, 128, 1024]
+        dtype = torch.float32
+        device = 'cuda'
+        inputs = torch.randn(*input_size, device=device, requires_grad=True)
+        weights = torch.nn.Parameter(torch.randn(input_size[2], dtype=dtype, device=device))
+        biases = torch.nn.Parameter(torch.randn(input_size[2], dtype=dtype, device=device))
+        fc = FusionCache.get()
+        before_fusions = fc.num_fusions()
+
+        for _ in range(5) :
+            nvf_fusion = Fusion()
+            with FusionDefinition(nvf_fusion) as fd:
+                nvfuser_fusion(fd, 2, inputs.size()[2], inputs.size(), 1e-12, True)
+            nvf_out = nvf_fusion.execute([inputs, weights, biases])
+
+        for _ in range(5) :
+            nvf_var_mean_fusion = Fusion()
+            with FusionDefinition(nvf_var_mean_fusion) as fd:
+                nvfuser_fusion_var_mean(fd, 2, inputs.size()[2], inputs.size(), 1e-12, True)
+            nvf_var_mean_out = nvf_var_mean_fusion.execute([inputs, weights, biases])
+
+        for _ in range(5) :
+            eager_out = primitive_definition(inputs, weights, biases, 2, True)
+
+        self.assertEqual(eager_out, nvf_out[0])
+        self.assertEqual(eager_out, nvf_var_mean_out[0])
+        fusion_cache = FusionCache.get()
+        self.assertEqual(fc.num_fusions() - before_fusions, 2)
+
+    def test_prim_rms_norm_fwd(self) :
+        def primitive_definition(
+            inputs: torch.Tensor,
+            weight: torch.Tensor,
+            normalization_axis: int,
+            keepdim: bool,
+        ) -> torch.Tensor:
+            var = inputs.mul(inputs).mean(normalization_axis, keepdim)
+            pre_shift_scale_norm_output = inputs / torch.sqrt(var + 1e-12)
+            norm_output = weight * pre_shift_scale_norm_output
+            return norm_output
+
+        def nvfuser_fusion(
+            fd: FusionDefinition,
+            normalization_axis: int,
+            norm_size: int,
+            input_shape: List[int],
+            eps: float,
+            keepDim: bool
+        ) -> None :
+            inputs = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
+            weights = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float)
+            inputs_sq = fd.ops.mul(inputs, inputs)
+            sum0 = fd.ops.sum(inputs_sq, axes=[normalization_axis], keepdim=keepDim)
+            norm_const = fd.define_constant(norm_size)
+            var = fd.ops.div(sum0, norm_const)
+            eps_const = fd.define_constant(eps)
+            var_eps = fd.ops.add(var, eps_const)
+            invstd = fd.ops.rsqrt(var_eps)
+            pre_scale = fd.ops.mul(inputs, invstd)
+            weights_bcast = fd.ops.broadcast_in_dim(weights, output_shape=input_shape, broadcast_dims=[2])
+            out = fd.ops.mul(pre_scale, weights_bcast)
+            fd.add_output(out)
+            fd.add_output(invstd)
+
+        input_size = [64, 128, 1024]
+        dtype = torch.float32
+        device = 'cuda'
+        inputs = torch.randn(*input_size, device=device, requires_grad=True)
+        weights = torch.nn.Parameter(torch.randn(input_size[2], dtype=dtype, device=device))
+        fc = FusionCache.get()
+        before_fusions = fc.num_fusions()
+
+        for _ in range(5) :
+            nvf_fusion = Fusion()
+            with FusionDefinition(nvf_fusion) as fd:
+                nvfuser_fusion(fd, 2, inputs.size()[2], inputs.size(), 1e-12, True)
+            nvf_out = nvf_fusion.execute([inputs, weights])
+
+        for _ in range(5) :
+            eager_out = primitive_definition(inputs, weights, 2, True)
+
+        self.assertEqual(eager_out, nvf_out[0])
+        self.assertEqual(fc.num_fusions() - before_fusions, 1)
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_ops.py b/test/test_ops.py
index 9124ba32ad33..62d44030cbff 100644
--- a/test/test_ops.py
+++ b/test/test_ops.py
@@ -10,12 +10,14 @@
 from collections import defaultdict
 from importlib import import_module
 from torch.utils._pytree import tree_map
+from typing import Dict
 
 from torch.testing import make_tensor
 from torch.testing._internal.common_dtype import (
     floating_and_complex_types_and,
     all_types_and_complex_and,
 )
+from test_proxy_tensor import xfail, skip, skipOps
 
 from torch.testing._internal.common_utils import (
     TestCase,
@@ -24,15 +26,18 @@
     IS_SANDCASTLE,
     clone_input_helper,
     IS_CI,
+    set_default_dtype,
     suppress_warnings,
     noncontiguous_like,
     TEST_WITH_ASAN,
     TEST_WITH_UBSAN,
+    skipIfRocm,
     IS_WINDOWS,
     IS_FBCODE,
     first_sample,
     parametrize,
-    skipIfSlowGradcheckEnv,
+    skipIfTorchInductor,
+    slowTest,
 )
 from torch.testing._internal.common_methods_invocations import (
     op_db,
@@ -52,14 +57,14 @@
     onlyCPU,
     onlyNativeDeviceTypes,
     OpDTypes,
-    skipCUDAIfRocm,
     skipMeta,
 )
 from torch._subclasses.fake_tensor import (
     FakeTensor,
     FakeTensorMode,
 )
-from torch.utils._python_dispatch import enable_torch_dispatch_mode
+from torch._subclasses.fake_utils import outputs_alias_inputs
+
 import torch._prims as prims
 from torch._prims.context import TorchRefsMode
 
@@ -94,9 +99,15 @@
 )
 _ops_and_refs = op_db + python_ref_db
 
+# Create a list of operators that are a subset of _ref_test_ops but don't have a
+# numpy ref to compare them too, If both CPU and CUDA are compared to numpy
+# then they do not need to be compared to each other
+_ops_and_refs_with_no_numpy_ref = [op for op in _ops_and_refs if op.ref is None]
+
+aten = torch.ops.aten
+
 # Tests that apply to all operators and aren't related to any particular
 #   system
-@skipIfSlowGradcheckEnv
 class TestCommon(TestCase):
     exact_dtype = True
 
@@ -149,16 +160,43 @@ def test_multiple_devices(self, devices, dtype, op):
     @suppress_warnings
     @ops(_ref_test_ops, allowed_dtypes=(torch.float64, torch.long, torch.complex128))
     def test_numpy_ref(self, device, dtype, op):
-        try:
-            # Sets the default dtype to NumPy's default dtype of double
-            cur_default = torch.get_default_dtype()
-            torch.set_default_dtype(torch.double)
+        # Sets the default dtype to NumPy's default dtype of double
+        with set_default_dtype(torch.double):
             for sample_input in op.reference_inputs(device, dtype):
                 self.compare_with_reference(
                     op, op.ref, sample_input, exact_dtype=(dtype is not torch.long)
                 )
-        finally:
-            torch.set_default_dtype(cur_default)
+
+    # Tests that the cpu and gpu results are consistent
+    @onlyCUDA
+    @suppress_warnings
+    @slowTest
+    @ops(_ops_and_refs_with_no_numpy_ref, dtypes=OpDTypes.any_common_cpu_cuda_one)
+    def test_compare_cpu(self, device, dtype, op):
+
+        def to_cpu(arg):
+            if isinstance(arg, torch.Tensor):
+                return arg.to(device='cpu')
+            return arg
+
+        samples = op.reference_inputs(device, dtype)
+
+        for sample in samples:
+            cpu_sample = sample.transform(to_cpu)
+            cuda_results = op(sample.input, *sample.args, **sample.kwargs)
+            cpu_results = op(cpu_sample.input, *cpu_sample.args, **cpu_sample.kwargs)
+
+            # output_process_fn_grad has a very unfortunate name
+            # We use this function in linalg extensively to postprocess the inputs of functions
+            # that are not completely well-defined. Think svd and muliplying the singular vectors by -1.
+            # CPU and CUDA implementations of the SVD can return valid SVDs that are different.
+            # We use this function to compare them.
+            cuda_results = sample.output_process_fn_grad(cuda_results)
+            cpu_results = cpu_sample.output_process_fn_grad(cpu_results)
+
+            # Lower tolerance because we are running this as a `@slowTest`
+            # Don't want the periodic tests to fail frequently
+            self.assertEqual(cuda_results, cpu_results, atol=1e-3, rtol=1e-3)
 
     # Tests that experimental Python References can propagate shape, dtype,
     # and device metadata properly.
@@ -166,8 +204,10 @@ def test_numpy_ref(self, device, dtype, op):
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
     @onlyNativeDeviceTypes
     @ops(python_ref_db)
+    @skipIfTorchInductor("Takes too long for inductor")
     def test_python_ref_meta(self, device, dtype, op):
-        mode = torch._prims.get_prim_fake_mode()
+        with FakeTensorMode() as mode:
+            pass
 
         def _to_tensormeta(x):
             if isinstance(x, torch.Tensor):
@@ -181,10 +221,12 @@ def _to_tensormeta(x):
 
             meta_sample = sample.transform(_to_tensormeta)
             try:
-                with enable_torch_dispatch_mode(mode):
+                with mode:
                     meta_result = op(meta_sample.input, *meta_sample.args, **meta_sample.kwargs)
             except torch._subclasses.fake_tensor.UnsupportedFakeTensorException:
                 continue
+            except torch._subclasses.fake_tensor.DataDependentOutputException:
+                continue
 
             if isinstance(result, torch.Tensor):
                 self.assertTrue(isinstance(meta_result, FakeTensor))
@@ -195,7 +237,17 @@ def _to_tensormeta(x):
                         self.assertTrue(isinstance(b, FakeTensor))
                         prims.utils.compare_tensor_meta(a, b)
 
-    def _ref_test_helper(self, ctx, device, dtype, op, skip_zero_numel=False, skip_zero_dim=False):
+    def _ref_test_helper(
+        self,
+        ctx,
+        device,
+        dtype,
+        op,
+        skip_zero_numel=False,
+        skip_zero_dim=False,
+        skip_bfloat=False,
+        skip_view_consistency=False,
+    ):
         # NOTE: this test works by comparing the reference
         ex = None
         for sample in op.reference_inputs(device, dtype, requires_grad=False):
@@ -203,6 +255,26 @@ def _ref_test_helper(self, ctx, device, dtype, op, skip_zero_numel=False, skip_z
                 continue
             if isinstance(sample.input, torch.Tensor) and sample.input.ndim == 0 and skip_zero_dim:
                 continue
+
+            is_lower_than_cuda11_0 = (
+                (torch.version.cuda is not None)
+                and ([int(x) for x in torch.version.cuda.split(".")] < [11, 0]))
+
+            if (
+                skip_bfloat
+                and is_lower_than_cuda11_0
+                and (
+                    (
+                        isinstance(sample.input, torch.Tensor)
+                        and sample.input.dtype == torch.bfloat16
+                    )
+                    or any(
+                        isinstance(arg, torch.Tensor) and arg.dtype == torch.bfloat16
+                        for arg in sample.args
+                    )
+                )
+            ):
+                continue
             with ctx():
                 ref_result = op(sample.input, *sample.args, **sample.kwargs)
             torch_result = op.torch_opinfo(sample.input, *sample.args, **sample.kwargs)
@@ -210,8 +282,10 @@ def _ref_test_helper(self, ctx, device, dtype, op, skip_zero_numel=False, skip_z
             for a, b in zip(tree_flatten(ref_result)[0], tree_flatten(torch_result)[0]):
                 if isinstance(a, torch.Tensor) or isinstance(b, torch.Tensor):
                     prims.utils.compare_tensor_meta(a, b)
-                    if getattr(op, 'validate_view_consistency', True):
-                        self.assertEqual(a._is_view(), b._is_view())
+                    if getattr(op, 'validate_view_consistency', True) and not skip_view_consistency:
+                        msg = (f"The torch implementation {'returns' if b._is_view() else 'does not return'} "
+                               f"a view, while the reference {'does' if a._is_view() else 'does not'}")
+                        self.assertEqual(a._is_view(), b._is_view(), msg)
 
             # Computes the dtype the more precise computatino would occur in
             precise_dtype = torch.bool
@@ -296,6 +370,7 @@ def _distance(a, b):
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
     @onlyNativeDeviceTypes
     @ops(python_ref_db)
+    @skipIfTorchInductor("Takes too long for inductor")
     def test_python_ref(self, device, dtype, op):
         # In this test, primTorch refs call into the refs namespace
         # For example, a ref with torch.foo in it will calls refs.foo instead
@@ -308,6 +383,7 @@ def test_python_ref(self, device, dtype, op):
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
     @onlyNativeDeviceTypes
     @ops(python_ref_db)
+    @skipIfTorchInductor("Takes too long for inductor")
     def test_python_ref_torch_fallback(self, device, dtype, op):
         # In this test, refs call into the torch namespace (after the initial invocation)
         # For example, a ref with torch.foo in it will call torch.foo instead of refs.foo
@@ -316,9 +392,9 @@ def test_python_ref_torch_fallback(self, device, dtype, op):
 
     @unittest.skipIf(TEST_WITH_ASAN, "Skipped under ASAN")
     @onlyCUDA
-    @skipCUDAIfRocm
     @ops(python_ref_db)
     @parametrize('executor', ['aten', 'nvfuser'])
+    @skipIfTorchInductor("Takes too long for inductor")
     def test_python_ref_executor(self, device, dtype, op, executor):
         # TODO: Not all dtypes are supported with nvfuser
         from torch._prims_common import _torch_dtype_to_nvfuser_dtype_map
@@ -337,9 +413,16 @@ def test_python_ref_executor(self, device, dtype, op, executor):
         if executor == "nvfuser" and isinstance(op, ReductionPythonRefInfo):
             skip_zero_dim = True
 
-        # skip zero-dim tensors for some composites of reduction operations
-        normalization_ops = ["_refs.softmax", "_refs.logsumexp", "_refs.log_softmax"]
-        if executor == "nvfuser" and op.name in normalization_ops:
+        # skip zero-dim tensors for some composites of reduction operations and view
+        skip_zero_dim_ops = [
+            "_refs.logsumexp",
+            "_refs.log_softmax",
+            "_refs.native_group_norm",
+            "_refs.softmax",
+            "_refs.sum_to_size",
+            "ops.nvprims.view",
+        ]
+        if executor == "nvfuser" and op.name in skip_zero_dim_ops:
             skip_zero_dim = True
 
         from torch._prims.executor import make_traced
@@ -354,6 +437,10 @@ def test_python_ref_executor(self, device, dtype, op, executor):
             op,
             skip_zero_numel=("nvfuser" in executor),  # nvfuser doesn't support zero-sized tensors
             skip_zero_dim=skip_zero_dim,
+            skip_bfloat=("nvfuser" in executor),  # nvfuser doesn't support bfloat tensors for pre-11 cuda TK
+            # # nvfuser doesn't support view consistency
+            # https://github.com/pytorch/pytorch/issues/84863
+            skip_view_consistency=("nvfuser" in executor),
         )
 
     @skipMeta
@@ -364,13 +451,17 @@ def test_errors(self, device, op):
         for ei in error_inputs:
             si = ei.sample_input
             with self.assertRaisesRegex(ei.error_type, ei.error_regex):
-                op(si.input, *si.args, **si.kwargs)
+                out = op(si.input, *si.args, **si.kwargs)
+                self.assertFalse(isinstance(out, type(NotImplemented)))
 
     @skipMeta
     @onlyNativeDeviceTypes
     @ops([op for op in python_ref_db if op.error_inputs_func is not None], dtypes=OpDTypes.none)
+    @skipIfTorchInductor("Takes too long for inductor")
     def test_python_ref_errors(self, device, op):
-        mode = torch._prims.get_prim_fake_mode()
+        mode = FakeTensorMode()
+        with mode:
+            pass
 
         def _to_tensormeta(x):
             if isinstance(x, torch.Tensor):
@@ -381,8 +472,7 @@ def _to_tensormeta(x):
         for ei in error_inputs:
             si = ei.sample_input
             meta_sample = si.transform(_to_tensormeta)
-            # TODO: match strings
-            with self.assertRaisesRegex(ei.error_type, ""):
+            with self.assertRaisesRegex(ei.error_type, ei.error_regex):
                 op(meta_sample.input, *meta_sample.args, **meta_sample.kwargs)
 
     # Tests that the function produces the same result when called with
@@ -410,11 +500,6 @@ def test_noncontiguous_samples(self, device, dtype, op):
                 noncontig_sample.kwargs,
             )
 
-            # Verifies sample input tensors should have no grad or history
-            sample_tensor = t_inp if isinstance(t_inp, torch.Tensor) else t_inp[0]
-            assert sample_tensor.grad is None
-            assert sample_tensor.grad_fn is None
-
             # validates forward
             expected = op(t_inp, *t_args, **t_kwargs)
             actual = op(n_inp, *n_args, **n_kwargs)
@@ -979,6 +1064,7 @@ def _test_inplace_preserve_storage(samples, variants):
     # Reference testing for operations in complex32 against complex64.
     # NOTE: We test against complex64 as NumPy doesn't have a complex32 equivalent dtype.
     @ops(op_db, allowed_dtypes=(torch.complex32,))
+    @skipIfTorchInductor("Inductor does not support complex dtype yet")
     def test_complex_half_reference_testing(self, device, dtype, op):
         if not op.supports_dtype(torch.complex32, device):
             unittest.skip("Does not support complex32")
@@ -1008,6 +1094,7 @@ def test_complex_half_reference_testing(self, device, dtype, op):
 
     @ops(op_db, allowed_dtypes=(torch.bool,))
     @unittest.skipIf(TEST_WITH_UBSAN, "Test uses undefined behavior")
+    @skipIfTorchInductor("Inductor does not support view with dtype yet")
     def test_non_standard_bool_values(self, device, dtype, op):
         # Test boolean values other than 0x00 and 0x01 (gh-54789)
         def convert_boolean_tensors(x):
@@ -1057,8 +1144,10 @@ def test_dtypes(self, device, op):
         unsupported_dtypes = set()
         supported_backward_dtypes = set()
         unsupported_backward_dtypes = set()
+        dtype_error: Dict[torch.dtype, Exception] = dict()
 
-        def unsupported(dtype):
+        def unsupported(dtype, e):
+            dtype_error[dtype] = e
             unsupported_dtypes.add(dtype)
             if dtype in allowed_backward_dtypes:
                 unsupported_backward_dtypes.add(dtype)
@@ -1073,7 +1162,7 @@ def unsupported(dtype):
                     op.sample_inputs(device, dtype, requires_grad=requires_grad)
                 )
             except Exception as e:
-                unsupported(dtype)
+                unsupported(dtype, e)
                 continue
 
             for sample in samples:
@@ -1086,7 +1175,7 @@ def unsupported(dtype):
                     # NOTE: some ops will fail in forward if their inputs
                     #   require grad but they don't support computing the gradient
                     #   in that type! This is a bug in the op!
-                    unsupported(dtype)
+                    unsupported(dtype, e)
                     continue
 
                 # Checks for backward support in the same dtype, if the input has
@@ -1131,6 +1220,7 @@ def _tensor_requires_grad(x):
                     backward_tensor.backward(grad)
                     supported_backward_dtypes.add(dtype)
                 except Exception as e:
+                    dtype_error[dtype] = e
                     unsupported_backward_dtypes.add(dtype)
 
         # Checks that dtypes are listed correctly and generates an informative
@@ -1225,6 +1315,12 @@ def _tensor_requires_grad(x):
                 )
             )
 
+        all_claimed_but_unsupported = set.union(claimed_but_unsupported_backward, claimed_but_unsupported_forward)
+        if all_claimed_but_unsupported:
+            msg += "Unexpected failures raised the following errors:\n"
+            for dtype in all_claimed_but_unsupported:
+                msg += f"{dtype} - {dtype_error[dtype]}\n"
+
         self.fail(msg)
 
 
@@ -1284,7 +1380,6 @@ def test_forward_ad(self, device, dtype, op):
                 op.get_op(), args, kwargs, op.gradcheck_wrapper, self.assertEqual)
 
 
-@skipIfSlowGradcheckEnv
 class TestMathBits(TestCase):
     # Tests that
     # 1. The operator's output for physically conjugated/negated tensors and conjugate/negative view tensors
@@ -1399,6 +1494,7 @@ def clone_and_perform_view(input, **kwargs):
                         self.assertEqual(tensor.grad, cloned1_tensor.grad)
 
     @ops(ops_and_refs, allowed_dtypes=(torch.cfloat,))
+    @skipIfTorchInductor("Inductor does not support complex dtype yet")
     def test_conj_view(self, device, dtype, op):
         if not op.test_conjugated_samples:
             self.skipTest("Operation doesn't support conjugated inputs.")
@@ -1421,6 +1517,7 @@ def test_conj_view(self, device, dtype, op):
         )
 
     @ops(ops_and_refs, allowed_dtypes=(torch.double,))
+    @skipIfTorchInductor("Inductor does not support complex dtype yet")
     def test_neg_view(self, device, dtype, op):
         if not op.test_neg_view:
             self.skipTest("Operation not tested with tensors with negative bit.")
@@ -1440,6 +1537,7 @@ def test_neg_view(self, device, dtype, op):
         )
 
     @ops(ops_and_refs, allowed_dtypes=(torch.cdouble,))
+    @skipIfTorchInductor("Inductor does not support complex dtype yet")
     def test_neg_conj_view(self, device, dtype, op):
         if not op.test_neg_view:
             self.skipTest("Operation not tested with tensors with negative bit.")
@@ -1476,7 +1574,7 @@ def is_bit_set(x):
 def check_inplace_view(func, input, rs, input_size, input_strides):
     if func is None:
         return
-    # TODO: extend this test to test ops with multiple outputs and ops like native_batch_norm.out
+    # TODO: extend this test to test ops with multiple outputs and ops like native_batch_norm(_legit).out
     # which mutate not necessarily the first input.
     if isinstance(rs, torch.Tensor) and rs is input:
         unequal_size = rs.size() != input_size
@@ -1494,7 +1592,6 @@ def check_inplace_view(func, input, rs, input_size, input_strides):
 # A mode that when enabled runs correctness checks to ensure
 # that operators have expected tags based on their input and
 # ouput tensor properties
-@skipIfSlowGradcheckEnv
 class TestTagsMode(TorchDispatchMode):
     def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         if isinstance(args[0], torch.Tensor):
@@ -1507,7 +1604,6 @@ def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         return rs
 
 # Test to verify the correctness for tags in `tags.yaml`, also available for access through `torch.Tags`
-@skipIfSlowGradcheckEnv
 class TestTags(TestCase):
     @onlyCPU
     @ops(ops_and_refs, dtypes=OpDTypes.any_one)
@@ -1527,10 +1623,9 @@ def test_tags(self, device, dtype, op):
                 check_inplace_view(opoverloadpacket, input, rs, old_size, old_stride)
 
 
-@skipIfSlowGradcheckEnv
 class TestRefsOpsInfo(TestCase):
 
-    import_paths = ["_refs", "_refs.special", "_refs.nn.functional", "_refs.fft"]
+    import_paths = ["_refs", "_refs.special", "_refs.nn.functional", "_refs.fft", "_refs._conversions"]
     module_alls = [(path, import_module(f"torch.{path}").__all__) for path in import_paths]
     ref_ops_names = tuple(itertools.chain.from_iterable(
         [f"{path}.{op}" for op in module_all] for path, module_all in module_alls))
@@ -1545,29 +1640,55 @@ class TestRefsOpsInfo(TestCase):
         '_refs.full',
         '_refs.full_like',
         '_refs.item',
+        '_refs.to',
         '_refs.ones',
         '_refs.ones_like',
+        '_refs.special.expit',
         '_refs.std_var',
         '_refs.swap_axes',
         '_refs.uniform',
         '_refs.scalar_tensor',
         '_refs.trunc_divide',
         '_refs.zeros',
-        '_refs.zeros_like'
+        '_refs.zeros_like',
+        '_refs.rfloordiv',
+        '_refs.rtruediv',
+        '_refs.rpow',
+        # These should be tested with their out-of-place counterparts
+        '_refs.index_add_',
+        '_refs.index_copy_',
+        '_refs.index_fill_',
+        '_refs.native_group_norm',
     }
 
     not_in_decomp_table = {
         # duplicated in _decomp and _refs
-        '_refs.nn.functional.elu',
+        '_refs.nn.functional.group_norm',
         '_refs.nn.functional.mse_loss',
-        '_refs.transpose',
-        '_refs.var',
         '_refs.rsub',
+        # duplicated due to efficiency concerns of the ref vs the decomp
+        '_refs.index_add_',
         # these are not aten ops?
+        '_refs._conversions.bfloat16',
+        '_refs._conversions.bool',
+        '_refs._conversions.byte',
+        '_refs._conversions.char',
+        '_refs._conversions.double',
+        '_refs._conversions.float',
+        '_refs._conversions.half',
+        '_refs._conversions.int',
+        '_refs._conversions.long',
+        '_refs._conversions.short',
+        '_refs._conversions.chalf',
+        '_refs._conversions.cfloat',
+        '_refs._conversions.cdouble',
         '_refs.broadcast_shapes',
         '_refs.broadcast_tensors',
         '_refs.nn.functional.tanhshrink',
-        '_refs.swap_axes',
+        '_refs.nn.functional.triplet_margin_loss',
+        '_refs.rfloordiv',
+        '_refs.rtruediv',
+        '_refs.rpow',
         # CompositeImplicitAutograd
         '_refs.allclose',
         '_refs.atleast_1d',
@@ -1588,12 +1709,26 @@ class TestRefsOpsInfo(TestCase):
         '_refs.hstack',
         '_refs.isclose',
         '_refs.isfinite',
+        '_refs.isreal',
+        '_refs.log_softmax',
+        '_refs.movedim',
         '_refs.narrow',
+        '_refs.nn.functional.l1_loss',
+        '_refs.nn.functional.log_softmax',
+        '_refs.nn.functional.poisson_nll_loss',
+        '_refs.nn.functional.softmax',
+        '_refs.nn.functional.softmin',
         '_refs.positive',
         '_refs.ravel',
         '_refs.reshape',
+        '_refs.softmax',
+        '_refs.special.expit',
+        '_refs.special.log_softmax',
+        '_refs.special.softmax',
         '_refs.square',
+        '_refs.T',
         '_refs.tensor_split',
+        '_refs.to',
         '_refs.true_divide',
         '_refs.trunc_divide',
         '_refs.vsplit',
@@ -1603,32 +1738,32 @@ class TestRefsOpsInfo(TestCase):
         '_refs.linalg.svd',
         '_refs.linalg.svdvals',
         '_refs.unflatten',
-        # CompositeExplicitAutograd,
-        '_refs.unbind',
+        '_refs.sum_to_size',
         # ref implementation missing kwargs
-        '_refs.empty',  # missing "pin_memory"
-        '_refs.empty_like',  # missing "layout"
-        '_refs.full',  # missing "layout"
         '_refs.full_like',  # missing "layout"
-        '_refs.ones',  # missing "layout"
         '_refs.ones_like',  # missing "layout"
         '_refs.round',  # missing "decimals"
         '_refs.scalar_tensor',  # missing "layout"
-        '_refs.zeros',  # missing "layout"
         '_refs.zeros_like',  # missing "layout"
         # other
+        '_refs.expand_as',
         '_refs.as_strided',  # _prims._as_strided_meta: "reduce() of empty sequence with no initial value"
         '_refs.copy_to',  # torch._C._jit_get_operation: No such operator aten::copy_to
-        '_refs.clone',  # test_meta.py: view size is not compatible with input tensor's size and stride
         '_refs.equal',  # 'bool' object has no attribute 'dtype'
         '_refs.conj',  # Calls _prims.conj
+        '_refs.real',
+        '_refs.imag',
     }
 
     @parametrize("op", ref_ops_names)
     def test_refs_are_in_python_ref_db(self, op):
+        inplace = op[-1] == "_"
         if op in self.skip_ref_ops:
             raise unittest.SkipTest(f"{op} does not have an entry in python_ref_db")
-        self.assertIn(op, self.ref_db_names)
+        elif inplace:
+            self.assertNotIn(op, self.ref_db_names, msg=f"{op} is an in-place operation and should not have an OpInfo")
+        else:
+            self.assertIn(op, self.ref_db_names)
 
     @parametrize("op", ref_ops_names)
     def test_refs_are_in_decomp_table(self, op):
@@ -1638,11 +1773,11 @@ def test_refs_are_in_decomp_table(self, op):
         op_impl = getattr(import_module(f"torch.{module_path}"), op_name)
 
         if op in self.not_in_decomp_table:
-            self.assertFalse(op_impl in torch._decomp.decomposition_table.values(),
+            self.assertNotIn(op_impl, torch._decomp.decomposition_table.values(),
                              f"Unexpectedly found {op} in torch._decomp.decomposition_table.values()")
         else:
-            self.assertTrue(op_impl in torch._decomp.decomposition_table.values(),
-                            f"Did not find {op} in torch._decomp.decomposition_table.values()")
+            self.assertIn(op_impl, torch._decomp.decomposition_table.values(),
+                          f"Did not find {op} in torch._decomp.decomposition_table.values()")
 
 
 fake_skips = (
@@ -1678,6 +1813,7 @@ def test_refs_are_in_decomp_table(self, op):
     "sparse.sampled.addmm",  # sparsity not supported
     # Can not infer total number of classes from meta. no way at present to throw DynamicOutputShapeException
     "nn.functional.one_hot",
+    "narrow",  # Fails only for one overload with DataDependentOutputException (hence skip).
 )
 
 fake_autocast_device_skips = defaultdict(dict)
@@ -1706,13 +1842,21 @@ def test_refs_are_in_decomp_table(self, op):
     "index_select",
 )
 
+data_dependent_op_tests = (
+    "equal",
+    "corrcoef",
+    "nn.functional.gaussian_nll_loss",
+    "allclose",
+)
+
 aliasing_failures = (
     "histogramdd",
-    "nn.functional.pixel_shuffle",
-    "nn.functional.pixel_unshuffle",
 )
 
-fake_striding_skips = (
+# tests which have inconsistent fake tensor stride propagation
+# XXX: no new tests should be added to this list as a result of a
+# decomp or prim, see https://github.com/pytorch/pytorch/issues/78050#issuecomment-1253950325
+fake_tensor_stride_failing_ops = {
     "fft.fft2",
     "fft.fft",
     "fft.fftn",
@@ -1733,12 +1877,38 @@ def test_refs_are_in_decomp_table(self, op):
     "fft.rfftn",
     "svd",
     "linalg.svd",
-    "nn.functional.conv_transpose2d",
-)
-
+}
 
-@skipIfSlowGradcheckEnv
-class TestFakeTensorNonErroring(TestCase):
+fake_backward_xfails = fake_tensor_stride_failing_ops | {
+    "linalg.cond",
+    "linalg.matrix_norm",
+    "linalg.norm",
+    "linalg.svd",
+    "linalg.svdvals",
+    "pca_lowrank",
+    "roll",
+    "svd_lowrank",
+    "sgn",
+    "cholesky",
+}
+
+fake_backward_xfails = {xfail(stride_skip) for stride_skip in fake_backward_xfails} | {
+    xfail("segment_reduce", "lengths"),
+    xfail("norm", "nuc"),
+    xfail("linalg.norm", "subgradients_at_zero"),  # can accept vector inputs
+    skip('nn.functional.ctc_loss'),
+}
+
+fake_autocast_backward_xfails = {
+    skip("nn.functional.binary_cross_entropy"),
+    skip("sparse.sampled_addmm"),
+    skip("linalg.pinv"),
+    skip("linalg.pinv", "hermitian"),
+    skip("linalg.pinv", "singular"),
+    skip('pinverse'),
+}
+
+class TestFakeTensor(TestCase):
     def _test_fake_helper(self, device, dtype, op, context):
         name = op.name
         if op.variant_test_name:
@@ -1749,7 +1919,7 @@ def _test_fake_helper(self, device, dtype, op, context):
         samples = op.sample_inputs(device, dtype, requires_grad=False)
         for sample in samples:
             try:
-                mode = FakeTensorMode(inner=None)
+                mode = FakeTensorMode(throw_on_data_dependent_ops=True)
 
                 def map_to_fake(e):
                     if isinstance(e, torch.Tensor):
@@ -1768,18 +1938,9 @@ def map_to_fake(e):
                     continue
 
                 with context():
-                    with enable_torch_dispatch_mode(mode):
+                    with mode:
                         res_fake = op(input, *args, **kwargs)
 
-                def outputs_alias_inputs(outputs, inputs):
-                    input_storages = set()
-                    for out in tree_flatten(outputs)[0]:
-                        if isinstance(out, torch.Tensor):
-                            input_storages.add(out.storage()._cdata)
-                    for inp in tree_flatten(inputs)[0]:
-                        if isinstance(inp, torch.Tensor) and inp.storage()._cdata in input_storages:
-                            return True
-                    return False
 
                 for fake_out, real_out in zip(
                     tree_flatten(res_fake)[0], tree_flatten(res)[0]
@@ -1792,12 +1953,10 @@ def outputs_alias_inputs(outputs, inputs):
                     # if you see a shape exception here, you may need to add
                     # a `dynamic_output_shape` tag to an operator
 
-                    check_strides = name not in fake_striding_skips
+                    check_strides = name not in fake_tensor_stride_failing_ops
 
-                    # if there is a striding failure here as a result of adding a primtorch ref,
-                    # feel free to add the op to `fake_striding_skips` but please tag
-                    # @eellison on the pr.
-                    # see: https://github.com/pytorch/pytorch/issues/78050
+                    # prims/decomps must correctly model strides,
+                    # see https://github.com/pytorch/pytorch/issues/78050#issuecomment-1253950325
                     prims.utils.compare_tensor_meta(fake_out, real_out, check_strides)
 
                     if name not in aliasing_failures:
@@ -1805,12 +1964,14 @@ def outputs_alias_inputs(outputs, inputs):
                         real_aliasing = outputs_alias_inputs((sample.input, sample, args, sample.kwargs), res)
                         self.assertEqual(fake_aliasing, real_aliasing)
 
-                self.assertTrue(name not in dynamic_output_op_tests)
+                self.assertTrue(name not in dynamic_output_op_tests and name not in data_dependent_op_tests)
 
             except torch._subclasses.fake_tensor.UnsupportedFakeTensorException:
                 pass
             except torch._subclasses.fake_tensor.DynamicOutputShapeException:
                 self.assertTrue(name in dynamic_output_op_tests or name in sometimes_dynamic_output_op_test)
+            except torch._subclasses.fake_tensor.DataDependentOutputException:
+                self.assertTrue(name in data_dependent_op_tests)
 
     @ops(op_db, dtypes=OpDTypes.any_one)
     def test_fake(self, device, dtype, op):
@@ -1823,12 +1984,49 @@ def test_fake_autocast(self, device, dtype, op):
         context = torch.cuda.amp.autocast if device == "cuda" else torch.cpu.amp.autocast
         self._test_fake_helper(device, dtype, op, context)
 
+    def _test_fake_crossref_helper(self, device, dtype, op, context):
+        samples = op.sample_inputs(device, dtype, requires_grad=True)
+
+        for iter, sample in enumerate(samples):
+            args = [sample.input] + list(sample.args)
+            kwargs = sample.kwargs
+
+            # skip these to speed up tests
+            common_skip_ops = (
+                aten.detach.default,
+                aten.empty_strided.default,
+                aten.copy_.default,
+                aten.is_same_size.default,
+            )
+
+            # TODO: enable check_aliasing, batch norm fails
+            with torch._subclasses.CrossRefFakeMode(ignore_op_fn=lambda fn: fn in common_skip_ops, check_aliasing=True):
+                with warnings.catch_warnings(), context(), torch.autograd.set_multithreading_enabled(False):
+                    composite_compliance.compute_expected_grads(
+                        op.get_op(), args, kwargs,
+                        sample.output_process_fn_grad,
+                        op.gradcheck_wrapper)
+
+    @skipIfRocm
+    @onlyCUDA
+    @ops([op for op in op_db if op.supports_autograd], allowed_dtypes=(torch.float,))
+    @skipOps('TestFakeTensor', 'test_fake_crossref_backward_no_amp', fake_backward_xfails)
+    def test_fake_crossref_backward_no_amp(self, device, dtype, op):
+        self._test_fake_crossref_helper(device, dtype, op, contextlib.nullcontext)
+
+    @skipIfRocm
+    @onlyCUDA
+    @ops([op for op in op_db if op.supports_autograd], allowed_dtypes=(torch.float,))
+    @skipOps('TestFakeTensor', 'test_fake_crossref_backward_amp', fake_backward_xfails | fake_autocast_backward_xfails)
+    def test_fake_crossref_backward_amp(self, device, dtype, op):
+        self._test_fake_crossref_helper(device, dtype, op, torch.cuda.amp.autocast)
+
 
 instantiate_device_type_tests(TestCommon, globals())
 instantiate_device_type_tests(TestCompositeCompliance, globals())
 instantiate_device_type_tests(TestMathBits, globals())
 instantiate_device_type_tests(TestRefsOpsInfo, globals(), only_for="cpu")
-instantiate_device_type_tests(TestFakeTensorNonErroring, globals())
+instantiate_device_type_tests(TestFakeTensor, globals())
 instantiate_device_type_tests(TestTags, globals())
 
 if __name__ == "__main__":
diff --git a/test/test_ops_fwd_gradients.py b/test/test_ops_fwd_gradients.py
new file mode 100644
index 000000000000..4b7b1c785d5f
--- /dev/null
+++ b/test/test_ops_fwd_gradients.py
@@ -0,0 +1,76 @@
+# Owner(s): ["module: unknown"]
+
+from functools import partial
+import torch
+
+from torch.testing._internal.common_utils import (
+    TestGradients, run_tests, skipIfTorchInductor, IS_MACOS)
+from torch.testing._internal.common_methods_invocations import op_db
+from torch.testing._internal.common_device_type import \
+    (instantiate_device_type_tests, ops, OpDTypes)
+
+# TODO: fixme https://github.com/pytorch/pytorch/issues/68972
+torch.set_default_dtype(torch.float32)
+
+# TODO: mitigate flaky issue on macOS https://github.com/pytorch/pytorch/issues/66033
+# AFAIK, c10::ThreadPool looks correct in the way it uses condition_variable wait. The
+# issue seems to point to macOS itself https://github.com/graphia-app/graphia/issues/33
+if IS_MACOS:
+    torch.set_num_threads(1)
+
+# gradcheck requires double precision
+_gradcheck_ops = partial(ops, dtypes=OpDTypes.supported,
+                         allowed_dtypes=[torch.double, torch.cdouble])
+
+class TestFwdGradients(TestGradients):
+    # Test that forward-over-reverse gradgrad is computed correctly
+    @_gradcheck_ops(op_db)
+    def test_fn_fwgrad_bwgrad(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+
+        if op.supports_fwgrad_bwgrad:
+            self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
+        else:
+            err_msg = r"Trying to use forward AD with .* that does not support it"
+            hint_msg = ("Running forward-over-backward gradgrad for an OP that has does not support it did not "
+                        "raise any error. If your op supports forward AD, you should set supports_fwgrad_bwgrad=True.")
+            with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
+                self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
+
+
+    def _forward_grad_helper(self, device, dtype, op, variant, is_inplace):
+        # TODO: clean up how attributes are passed to gradcheck from OpInfos
+        def call_grad_test_helper():
+            check_batched_forward_grad = ((op.check_batched_forward_grad and not is_inplace) or
+                                          (op.check_inplace_batched_forward_grad and is_inplace))
+            self._grad_test_helper(device, dtype, op, variant, check_forward_ad=True, check_backward_ad=False,
+                                   check_batched_grad=False, check_batched_forward_grad=check_batched_forward_grad)
+        if op.supports_forward_ad:
+            call_grad_test_helper()
+        else:
+            err_msg = r"Trying to use forward AD with .* that does not support it"
+            hint_msg = ("Running forward AD for an OP that has does not support it did not "
+                        "raise any error. If your op supports forward AD, you should set supports_forward_ad=True")
+            with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
+                call_grad_test_helper()
+
+    @_gradcheck_ops(op_db)
+    def test_forward_mode_AD(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+
+        self._forward_grad_helper(device, dtype, op, op.get_op(), is_inplace=False)
+
+    @_gradcheck_ops(op_db)
+    @skipIfTorchInductor("to be fixed")
+    def test_inplace_forward_mode_AD(self, device, dtype, op):
+        self._skip_helper(op, device, dtype)
+
+        if not op.inplace_variant or not op.supports_inplace_autograd:
+            self.skipTest("Skipped! Operation does not support inplace autograd.")
+
+        self._forward_grad_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()), is_inplace=True)
+
+instantiate_device_type_tests(TestFwdGradients, globals())
+
+if __name__ == '__main__':
+    run_tests()
diff --git a/test/test_ops_gradients.py b/test/test_ops_gradients.py
index bed924c2aec0..b4401af543d0 100644
--- a/test/test_ops_gradients.py
+++ b/test/test_ops_gradients.py
@@ -1,11 +1,9 @@
 # Owner(s): ["module: unknown"]
 
-from functools import partial, wraps
-from itertools import chain
+from functools import partial
 import torch
 
-from torch.testing._internal.common_utils import \
-    (TestCase, is_iterable_of_tensors, run_tests, gradcheck, gradgradcheck, is_slow_gradcheck_env)
+from torch.testing._internal.common_utils import TestGradients, run_tests
 from torch.testing._internal.common_methods_invocations import op_db
 from torch.testing._internal.common_device_type import \
     (instantiate_device_type_tests, ops, OpDTypes)
@@ -17,135 +15,7 @@
 _gradcheck_ops = partial(ops, dtypes=OpDTypes.supported,
                          allowed_dtypes=[torch.double, torch.cdouble])
 
-class TestGradients(TestCase):
-    exact_dtype = True
-
-    # Copies inputs to inplace operations to avoid inplace modifications
-    #   to leaves requiring gradient
-    def _get_safe_inplace(self, inplace_variant):
-        @wraps(inplace_variant)
-        def _fn(t, *args, **kwargs):
-            return inplace_variant(t.clone(), *args, **kwargs)
-
-        return _fn
-
-    def _check_helper(self, device, dtype, op, variant, check, *, check_forward_ad=False, check_backward_ad=True,
-                      check_batched_grad=None, check_batched_forward_grad=False):
-        assert check in ('gradcheck', 'bwgrad_bwgrad', 'fwgrad_bwgrad')
-        # NB: check_backward_ad does not affect gradgradcheck (always True)
-        if variant is None:
-            self.skipTest("Skipped! Variant not implemented.")
-        if not op.supports_dtype(dtype, torch.device(device).type):
-            self.skipTest(f"Skipped! {op.name} does not support dtype {str(dtype)}")
-
-        def is_inplace(variant):
-            if hasattr(variant, "__wrapped__"):
-                return variant.__wrapped__ is op.get_inplace()
-            return variant is op.get_inplace()
-
-        include_conjugated_inputs = op.test_conjugated_samples and dtype.is_complex
-
-        samples = op.sample_inputs(device, dtype, requires_grad=True, include_conjugated_inputs=include_conjugated_inputs,
-                                   small_inputs_only=is_slow_gradcheck_env())
-
-        for sample in samples:
-            if sample.broadcasts_input and is_inplace(variant):
-                continue
-
-            # Gradcheck expects tensors as its input, but autograd actually supports tensorlists
-            #   and tensors passed as kwargs. The following creates a function that accepts just
-            #   the tensors that require grad as varargs, and then recomposes them back into the
-            #   original input.
-
-            # Creates gradcheck inputs by identifying tensors requiring grad
-            all_args = None
-            if is_iterable_of_tensors(sample.input):
-                all_args = chain(sample.input, sample.args, sample.kwargs.values())
-            else:
-                all_args = tuple(chain((sample.input,), sample.args, sample.kwargs.values()))
-            gradcheck_args = tuple(x for x in all_args if (isinstance(x, torch.Tensor) and x.requires_grad))
-
-            def _input_recomposition_helper(inputs, inp, input_idx):
-                if is_iterable_of_tensors(inp):
-                    tensor_list = []
-                    for x in inp:
-                        if isinstance(x, torch.Tensor) and x.requires_grad:
-                            tensor_list.append(inputs[input_idx])
-                            input_idx = input_idx + 1
-                        else:
-                            tensor_list.append(x)
-                    return tensor_list, input_idx
-                elif isinstance(inp, torch.Tensor) and inp.requires_grad:
-                    return inputs[input_idx], input_idx + 1
-                else:
-                    return inp, input_idx
-
-            def fn(*inputs):
-                # Puts inputs back into sample properly
-                positional_args = []
-                input_idx = 0
-                inp, input_idx = _input_recomposition_helper(inputs, sample.input, input_idx)
-                positional_args.append(inp)
-
-                for x in sample.args:
-                    inp, input_idx = _input_recomposition_helper(inputs, x, input_idx)
-                    positional_args.append(inp)
-
-                # Recreates kwargs
-                kwargs = {}
-                for k, v in sample.kwargs.items():
-                    inp, input_idx = _input_recomposition_helper(inputs, v, input_idx)
-                    kwargs[k] = inp
-
-                output = op.gradcheck_wrapper(variant, *positional_args, **kwargs)
-                if sample.output_process_fn_grad is not None:
-                    return sample.output_process_fn_grad(output)
-                return output
-
-            if check == 'gradcheck':
-                if check_batched_grad is None:
-                    check_batched_grad = op.check_batched_grad
-                self.assertTrue(gradcheck(fn, gradcheck_args,
-                                          check_batched_grad=check_batched_grad,
-                                          check_grad_dtypes=True,
-                                          nondet_tol=op.gradcheck_nondet_tol,
-                                          fast_mode=op.gradcheck_fast_mode,
-                                          check_forward_ad=check_forward_ad,
-                                          check_backward_ad=check_backward_ad,
-                                          check_undefined_grad=True,
-                                          check_batched_forward_grad=check_batched_forward_grad))
-            elif check in ('bwgrad_bwgrad', 'fwgrad_bwgrad'):  # gradgrad check
-                self.assertFalse(check_forward_ad, msg="Cannot run forward AD check for gradgradcheck")
-                for gen_non_contig_grad_outputs in (False, True):
-                    kwargs = {
-                        "gen_non_contig_grad_outputs": gen_non_contig_grad_outputs,
-                        "check_batched_grad": op.check_batched_gradgrad,
-                        "check_grad_dtypes": True,
-                        "nondet_tol": op.gradcheck_nondet_tol,
-                        "fast_mode": op.gradcheck_fast_mode
-                    }
-                    if check == "fwgrad_bwgrad":
-                        kwargs["check_fwd_over_rev"] = True
-                        kwargs["check_rev_over_rev"] = False
-                        kwargs["check_batched_grad"] = False
-                        kwargs["check_undefined_grad"] = False
-
-                    self.assertTrue(gradgradcheck(fn, gradcheck_args, **kwargs))
-            else:
-                self.assertTrue(False, msg="Unknown check requested!")
-
-    def _grad_test_helper(self, device, dtype, op, variant, *, check_forward_ad=False, check_backward_ad=True,
-                          check_batched_grad=None, check_batched_forward_grad=False):
-        return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad,
-                                  check_backward_ad=check_backward_ad, check_batched_grad=check_batched_grad,
-                                  check_batched_forward_grad=check_batched_forward_grad)
-
-    def _skip_helper(self, op, device, dtype):
-        if dtype not in op.supported_backward_dtypes(torch.device(device).type):
-            self.skipTest("Skipped! Op doesn't support autograd for this dtype.")
-        if not op.supports_autograd and not op.supports_forward_ad:
-            self.skipTest("Skipped! autograd not supported.")
-
+class TestBwdGradients(TestGradients):
     # Tests that gradients are computed correctly
     @_gradcheck_ops(op_db)
     def test_fn_grad(self, device, dtype, op):
@@ -189,20 +59,6 @@ def test_fn_gradgrad(self, device, dtype, op):
         else:
             self._check_helper(device, dtype, op, op.get_op(), 'bwgrad_bwgrad')
 
-    # Test that forward-over-reverse gradgrad is computed correctly
-    @_gradcheck_ops(op_db)
-    def test_fn_fwgrad_bwgrad(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-
-        if op.supports_fwgrad_bwgrad:
-            self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
-        else:
-            err_msg = r"Trying to use forward AD with .* that does not support it"
-            hint_msg = ("Running forward-over-backward gradgrad for an OP that has does not support it did not "
-                        "raise any error. If your op supports forward AD, you should set supports_fwgrad_bwgrad=True.")
-            with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
-                self._check_helper(device, dtype, op, op.get_op(), "fwgrad_bwgrad")
-
     # Test that gradients of gradients are properly raising
     @_gradcheck_ops(op_db)
     def test_fn_fail_gradgrad(self, device, dtype, op):
@@ -228,39 +84,8 @@ def test_inplace_gradgrad(self, device, dtype, op):
             self.skipTest("Skipped! Operation does not support inplace autograd.")
         self._check_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()), "bwgrad_bwgrad")
 
-    def _forward_grad_helper(self, device, dtype, op, variant, is_inplace):
-        # TODO: clean up how attributes are passed to gradcheck from OpInfos
-        def call_grad_test_helper():
-            check_batched_forward_grad = ((op.check_batched_forward_grad and not is_inplace) or
-                                          (op.check_inplace_batched_forward_grad and is_inplace))
-            self._grad_test_helper(device, dtype, op, variant, check_forward_ad=True, check_backward_ad=False,
-                                   check_batched_grad=False, check_batched_forward_grad=check_batched_forward_grad)
-        if op.supports_forward_ad:
-            call_grad_test_helper()
-        else:
-            err_msg = r"Trying to use forward AD with .* that does not support it"
-            hint_msg = ("Running forward AD for an OP that has does not support it did not "
-                        "raise any error. If your op supports forward AD, you should set supports_forward_ad=True")
-            with self.assertRaisesRegex(NotImplementedError, err_msg, msg=hint_msg):
-                call_grad_test_helper()
-
-    @_gradcheck_ops(op_db)
-    def test_forward_mode_AD(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-
-        self._forward_grad_helper(device, dtype, op, op.get_op(), is_inplace=False)
-
-    @_gradcheck_ops(op_db)
-    def test_inplace_forward_mode_AD(self, device, dtype, op):
-        self._skip_helper(op, device, dtype)
-
-        if not op.inplace_variant or not op.supports_inplace_autograd:
-            self.skipTest("Skipped! Operation does not support inplace autograd.")
-
-        self._forward_grad_helper(device, dtype, op, self._get_safe_inplace(op.get_inplace()), is_inplace=True)
-
 
-instantiate_device_type_tests(TestGradients, globals())
+instantiate_device_type_tests(TestBwdGradients, globals())
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_ops_jit.py b/test/test_ops_jit.py
index dcb1923cd8f1..e03e051ff012 100644
--- a/test/test_ops_jit.py
+++ b/test/test_ops_jit.py
@@ -7,7 +7,7 @@
 
 from torch.testing import FileCheck
 from torch.testing._internal.common_utils import \
-    (run_tests, IS_SANDCASTLE, clone_input_helper, first_sample, skipIfSlowGradcheckEnv)
+    (run_tests, IS_SANDCASTLE, clone_input_helper, first_sample)
 from torch.testing._internal.common_methods_invocations import op_db
 from torch.testing._internal.common_device_type import instantiate_device_type_tests, ops, OpDTypes
 from torch.testing._internal.common_jit import JitCommonTestCase, check_against_reference
@@ -30,7 +30,6 @@
 #   autodifferentiation behavior.
 # Inherits from JitCommonTestCase instead of TestCase directly to share
 #   functionality with original test_jit.py method operator tests
-@skipIfSlowGradcheckEnv
 class TestJit(JitCommonTestCase):
     exact_dtype = True
 
@@ -53,6 +52,12 @@ def test_variant_consistency_jit(self, device, dtype, op):
             'function': func, 'method': method,
         }
 
+        # scripting strips the torch.ops prefix from these operators
+        # incorrectly; don't bother testing this case.  Count this
+        # as "testing"
+        if isinstance(func, torch._ops.OpOverload):
+            self.skipTest("variant consistency doesn't work on torch.ops")
+
         # TODO: find better way to standardize on op registration itself..
         has_fake_function = op.name in ["resize_", 'resize_as_']
 
diff --git a/test/test_optim.py b/test/test_optim.py
index d276696a548a..b5d2d43c86ce 100644
--- a/test/test_optim.py
+++ b/test/test_optim.py
@@ -8,20 +8,43 @@
 from copy import deepcopy
 
 import torch
-from torch._six import inf
 import torch.optim as optim
-import torch.optim._multi_tensor as optim_mt
 import torch.nn.functional as F
+from torch.nn import Parameter
 from torch.optim import SGD
-from torch.autograd import Variable
 from torch import sparse
-from torch.optim.lr_scheduler import LambdaLR, MultiplicativeLR, SequentialLR, StepLR, \
-    MultiStepLR, ConstantLR, LinearLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau, \
-    _LRScheduler, CyclicLR, CosineAnnealingWarmRestarts, OneCycleLR, ChainedScheduler, PolynomialLR, \
-    EPOCH_DEPRECATION_WARNING
+from torch.optim.lr_scheduler import (
+    LambdaLR,
+    MultiplicativeLR,
+    SequentialLR,
+    StepLR,
+    MultiStepLR,
+    ConstantLR,
+    LinearLR,
+    ExponentialLR,
+    CosineAnnealingLR,
+    ReduceLROnPlateau,
+    LRScheduler,
+    CyclicLR,
+    CosineAnnealingWarmRestarts,
+    OneCycleLR,
+    ChainedScheduler,
+    PolynomialLR,
+    EPOCH_DEPRECATION_WARNING,
+)
 from torch.optim.swa_utils import AveragedModel, SWALR, update_bn
-from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_UBSAN, load_tests, \
-    parametrize, instantiate_parametrized_tests, gradcheck
+from torch.testing._internal.common_utils import (
+    TestCase,
+    run_tests,
+    TEST_WITH_UBSAN,
+    load_tests,
+    parametrize,
+    instantiate_parametrized_tests,
+    gradcheck,
+    skipIfRocm,
+    skipIfTorchDynamo
+)
+
 # load_tests from common_utils is used to automatically filter tests for
 # sharding on sandcastle. This line silences flake warnings
 load_tests = load_tests
@@ -29,35 +52,41 @@
 
 def rosenbrock(tensor):
     x, y = tensor
-    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2
+    return (1 - x) ** 2 + 100 * (y - x**2) ** 2
 
 
 def drosenbrock(tensor):
     x, y = tensor
-    return torch.tensor((-400 * x * (y - x ** 2) - 2 * (1 - x), 200 * (y - x ** 2)))
+    return torch.tensor((-400 * x * (y - x**2) - 2 * (1 - x), 200 * (y - x**2)))
 
 
 class TestOptim(TestCase):
     exact_dtype = True
 
-    def _test_rosenbrock_sparse(self, constructor, scheduler_constructors=None,
-                                sparse_only=False, maximize=False):
+    def _test_rosenbrock_sparse(
+        self,
+        constructor,
+        scheduler_constructors=None,
+        sparse_only=False,
+        maximize=False,
+    ):
         if scheduler_constructors is None:
             scheduler_constructors = []
         params_t = torch.tensor([1.5, 1.5])
 
-        params = Variable(params_t, requires_grad=True)
+        params = Parameter(params_t)
         optimizer = constructor([params])
         schedulers = []
         for scheduler_constructor in scheduler_constructors:
             schedulers.append(scheduler_constructor(optimizer))
 
         if not sparse_only:
-            params_c = Variable(params_t.clone(), requires_grad=True)
+            params_c = Parameter(params_t.clone())
             optimizer_c = constructor([params_c])
 
         solution = torch.tensor([1, 1])
-        initial_dist = params.data.dist(solution)
+        with torch.no_grad():
+            initial_dist = params.dist(solution)
 
         def eval(params, sparse_grad, w):
             # Depending on w, provide only the x or y gradient
@@ -70,11 +99,11 @@ def eval(params, sparse_grad, w):
             if w:
                 i = torch.LongTensor([[0, 0]])
                 x = grad[0]
-                v = torch.tensor([x / 4., x - x / 4.])
+                v = torch.tensor([x / 4.0, x - x / 4.0])
             else:
                 i = torch.LongTensor([[1, 1]])
                 y = grad[1]
-                v = torch.tensor([y - y / 4., y / 4.])
+                v = torch.tensor([y - y / 4.0, y / 4.0])
             x = sparse.DoubleTensor(i, v, torch.Size([2])).to(dtype=v.dtype)
             with torch.no_grad():
                 if sparse_grad:
@@ -94,28 +123,53 @@ def eval(params, sparse_grad, w):
                     scheduler.step()
             if not sparse_only:
                 optimizer_c.step(functools.partial(eval, params_c, False, w))
-                self.assertEqual(params.data, params_c.data)
+                self.assertEqual(params, params_c)
 
         if not maximize:
             self.assertLessEqual(params.data.dist(solution), initial_dist)
         else:
-            self.assertGreaterEqual(rosenbrock(params.data), rosenbrock(params_t))
-
-    def _test_basic_cases_template(self, weight, bias, input, constructor,
-                                   scheduler_constructors, constructor_accepts_maximize=True):
+            self.assertGreaterEqual(rosenbrock(params), rosenbrock(params_t))
+
+    def _test_basic_cases_template(
+        self,
+        weight_tensor,
+        bias_tensor,
+        input_tensor,
+        constructor,
+        scheduler_constructors,
+        constructor_accepts_maximize=True,
+        constructor_accepts_foreach=False,
+    ):
         maximize_options = set([False, constructor_accepts_maximize])
-        if not constructor_accepts_maximize:
-            def three_arg_constructor(weight, bias, maximize):
+        foreach_options = set([False, constructor_accepts_foreach])
+
+        four_arg_constructor = constructor
+        if constructor_accepts_maximize and constructor_accepts_foreach:
+            pass
+        elif constructor_accepts_maximize:
+
+            def four_arg_constructor(weight, bias, maximize, foreach):
+                self.assertFalse(foreach)
+                return constructor(weight, bias, maximize)
+
+        elif constructor_accepts_foreach:
+
+            def four_arg_constructor(weight, bias, maximize, foreach):
                 self.assertFalse(maximize)
-                return constructor(weight, bias)
+                return constructor(weight, bias, foreach)
+
         else:
-            three_arg_constructor = constructor
 
-        for maximize in maximize_options:
-            weight = Variable(weight, requires_grad=True)
-            bias = Variable(bias, requires_grad=True)
-            input = Variable(input)
-            optimizer = three_arg_constructor(weight, bias, maximize)
+            def four_arg_constructor(weight, bias, maximize, foreach):
+                self.assertFalse(maximize or foreach)
+                return constructor(weight, bias)
+
+        for maximize, foreach in itertools.product(maximize_options, foreach_options):
+            with torch.no_grad():
+                weight = Parameter(weight_tensor.clone().detach())
+                bias = Parameter(bias_tensor.clone().detach())
+                input = input_tensor.clone().detach().requires_grad_()
+            optimizer = four_arg_constructor(weight, bias, maximize, foreach)
             schedulers = []
             for scheduler_constructor in scheduler_constructors:
                 schedulers.append(scheduler_constructor(optimizer))
@@ -133,7 +187,7 @@ def fn():
                 return loss
 
             initial_value = fn().item()
-            for _i in range(200):
+            for _ in range(200):
                 for scheduler in schedulers:
                     if isinstance(scheduler, ReduceLROnPlateau):
                         val_loss = fn()
@@ -146,10 +200,11 @@ def fn():
             else:
                 self.assertLess(fn().item(), initial_value)
 
-    def _test_state_dict(self, weight, bias, input, constructor):
-        weight = Variable(weight, requires_grad=True)
-        bias = Variable(bias, requires_grad=True)
-        input = Variable(input)
+    def _test_state_dict(self, weight, bias, input, constructor, atol=None, rtol=None):
+        weight = Parameter(weight)
+        bias = Parameter(bias)
+        with torch.no_grad():
+            input = input.clone().detach().requires_grad_()
 
         def fn_base(optimizer, weight, bias):
             optimizer.zero_grad()
@@ -165,8 +220,9 @@ def fn_base(optimizer, weight, bias):
         for _i in range(20):
             optimizer.step(fn)
         # Clone the weights and construct new optimizer for them
-        weight_c = Variable(weight.data.clone(), requires_grad=True)
-        bias_c = Variable(bias.data.clone(), requires_grad=True)
+        with torch.no_grad():
+            weight_c = Parameter(weight.clone().detach())
+            bias_c = Parameter(bias.clone().detach())
         optimizer_c = constructor(weight_c, bias_c)
         fn_c = functools.partial(fn_base, optimizer_c, weight_c, bias_c)
         # Load state dict
@@ -174,7 +230,7 @@ def fn_base(optimizer, weight, bias):
         state_dict_c = deepcopy(optimizer.state_dict())
         optimizer_c.load_state_dict(state_dict_c)
         # Run both optimizations in parallel
-        for _i in range(20):
+        for _ in range(20):
             optimizer.step(fn)
             optimizer_c.step(fn_c)
             self.assertEqual(weight, weight_c)
@@ -185,31 +241,35 @@ def fn_base(optimizer, weight, bias):
         self.assertEqual(optimizer.state_dict(), optimizer_c.state_dict())
         # Make sure repeated parameters have identical representation in state dict
         optimizer_c.param_groups.extend(optimizer_c.param_groups)
-        self.assertEqual(optimizer.state_dict()['param_groups'][-1],
-                         optimizer_c.state_dict()['param_groups'][-1])
+        self.assertEqual(
+            optimizer.state_dict()["param_groups"][-1],
+            optimizer_c.state_dict()["param_groups"][-1],
+        )
 
         # Make sure that optimizers that support maximize can load older models
         state_dict = optimizer.state_dict()
-        if 'maximize' in state_dict['param_groups'][0]:
-            for group in state_dict['param_groups']:
-                del group['maximize']
+        if "maximize" in state_dict["param_groups"][0]:
+            for group in state_dict["param_groups"]:
+                del group["maximize"]
             optimizer.load_state_dict(state_dict)
             # Make sure we can still step
             optimizer.step()
         # Make sure that optimizers that support foreach can load older models
         state_dict = optimizer.state_dict()
-        if 'foreach' in state_dict['param_groups'][0]:
-            for group in state_dict['param_groups']:
-                del group['foreach']
+        if "foreach" in state_dict["param_groups"][0]:
+            for group in state_dict["param_groups"]:
+                del group["foreach"]
             optimizer.load_state_dict(state_dict)
             # Make sure we can still step
             optimizer.step()
 
         # Make sure that loading optimizers with step not wrapped in tensor can work
         state_dict = optimizer.state_dict()
-        if 'step' in state_dict['state'][0] and torch.is_tensor(state_dict['state'][0]['step']):
-            for state in state_dict['state'].values():
-                state['step'] = state['step'].item()
+        if "step" in state_dict["state"][0] and torch.is_tensor(
+            state_dict["state"][0]["step"]
+        ):
+            for state in state_dict["state"].values():
+                state["step"] = state["step"].item()
             optimizer.load_state_dict(state_dict)
             optimizer.step()
 
@@ -218,9 +278,14 @@ def fn_base(optimizer, weight, bias):
         if not torch.cuda.is_available():
             return
 
-        input_cuda = Variable(input.data.float().cuda())
-        weight_cuda = Variable(weight.data.float().cuda(), requires_grad=True)
-        bias_cuda = Variable(bias.data.float().cuda(), requires_grad=True)
+        with torch.no_grad():
+            input_cuda = input.clone().detach().to(dtype=torch.float32, device="cuda")
+            weight_cuda = Parameter(
+                weight.clone().detach().to(dtype=torch.float32, device="cuda")
+            )
+            bias_cuda = Parameter(
+                bias.clone().detach().to(dtype=torch.float32, device="cuda")
+            )
         optimizer_cuda = constructor(weight_cuda, bias_cuda)
         fn_cuda = functools.partial(fn_base, optimizer_cuda, weight_cuda, bias_cuda)
 
@@ -233,37 +298,59 @@ def fn_base(optimizer, weight, bias):
 
         # Make sure that device of state['step'] is still CPU
         new_state_dict = optimizer_cuda.state_dict()
-        if 'step' in state_dict['state'][0] and torch.is_tensor(state_dict['state'][0]['step']):
-            for state in new_state_dict['state'].values():
-                self.assertEqual(state['step'].device.type, 'cpu')
+        if "step" in state_dict["state"][0] and torch.is_tensor(
+            state_dict["state"][0]["step"]
+        ):
+            for state in new_state_dict["state"].values():
+                self.assertEqual(state["step"].device.type, "cpu")
 
         for _i in range(20):
             optimizer.step(fn)
             optimizer_cuda.step(fn_cuda)
             self.assertEqual(weight, weight_cuda)
-            self.assertEqual(bias, bias_cuda)
+            self.assertEqual(bias, bias_cuda, atol=atol, rtol=rtol)
 
         # validate deepcopy() copies all public attributes
         def getPublicAttr(obj):
-            return set(k for k in obj.__dict__ if not k.startswith('_'))
+            return set(k for k in obj.__dict__ if not k.startswith("_"))
+
         self.assertEqual(getPublicAttr(optimizer), getPublicAttr(deepcopy(optimizer)))
 
-    def _test_basic_cases(self, constructor, scheduler_constructors=None,
-                          ignore_multidevice=False, constructor_accepts_maximize=False):
+    def _test_basic_cases(
+        self,
+        constructor,
+        scheduler_constructors=None,
+        ignore_multidevice=False,
+        constructor_accepts_maximize=False,
+        constructor_accepts_foreach=False,
+        atol=None,
+        rtol=None,
+    ):
         if scheduler_constructors is None:
             scheduler_constructors = []
 
-        def make_two_arg_constructor(constructor, maximize: bool = False):
+        def make_two_arg_constructor(
+            constructor, maximize: bool = False, foreach: bool = False
+        ):
+            if constructor_accepts_maximize and constructor_accepts_foreach:
+                return lambda weight, bias: constructor(weight, bias, maximize, foreach)
             if constructor_accepts_maximize:
                 return lambda weight, bias: constructor(weight, bias, maximize)
+            if constructor_accepts_foreach:
+                return lambda weight, bias: constructor(weight, bias, foreach)
             return constructor
 
-        for maximize in (True, False):
+        for maximize, foreach in itertools.product(
+            set([False, constructor_accepts_maximize]),
+            set([False, constructor_accepts_foreach]),
+        ):
             self._test_state_dict(
                 torch.randn(10, 5),
                 torch.randn(10),
                 torch.randn(5),
-                make_two_arg_constructor(constructor, maximize),
+                make_two_arg_constructor(constructor, maximize, foreach),
+                atol=atol,
+                rtol=rtol,
             )
         self._test_basic_cases_template(
             torch.randn(10, 5),
@@ -272,6 +359,7 @@ def make_two_arg_constructor(constructor, maximize: bool = False):
             constructor,
             scheduler_constructors,
             constructor_accepts_maximize,
+            constructor_accepts_foreach,
         )
         # non-contiguous parameters
         self._test_basic_cases_template(
@@ -281,6 +369,7 @@ def make_two_arg_constructor(constructor, maximize: bool = False):
             constructor,
             scheduler_constructors,
             constructor_accepts_maximize,
+            constructor_accepts_foreach,
         )
         # CUDA
         if not torch.cuda.is_available():
@@ -292,6 +381,7 @@ def make_two_arg_constructor(constructor, maximize: bool = False):
             constructor,
             scheduler_constructors,
             constructor_accepts_maximize,
+            constructor_accepts_foreach,
         )
         # Multi-GPU
         if not torch.cuda.device_count() > 1 or ignore_multidevice:
@@ -303,6 +393,7 @@ def make_two_arg_constructor(constructor, maximize: bool = False):
             constructor,
             scheduler_constructors,
             constructor_accepts_maximize,
+            constructor_accepts_foreach,
         )
 
     def _test_complex_optimizer(self, optimizer_constructor):
@@ -311,10 +402,9 @@ def _test_complex_optimizer(self, optimizer_constructor):
         complex_opt = optimizer_constructor(complex_param)
         real_opt = optimizer_constructor(real_param)
 
-        for i in range(3):
+        for _ in range(3):
             complex_param.grad = torch.randn_like(complex_param)
             real_param.grad = torch.view_as_real(complex_param.grad)
-
             complex_opt.step()
             real_opt.step()
 
@@ -331,7 +421,7 @@ def _test_complex_2d(self, optimizer_constructor, f=None):
         optim1 = optimizer_constructor([a1])
         optim2 = optimizer_constructor([a1_real, a1_imag])
 
-        for i in range(10):
+        for _ in range(10):
             optim1.zero_grad()
             optim2.zero_grad()
             a2 = torch.complex(a1_real, a1_imag)
@@ -347,171 +437,237 @@ def _test_complex_2d(self, optimizer_constructor, f=None):
             self.assertEqual(a1.imag, a1_imag)
 
     def _build_params_dict(self, weight, bias, **kwargs):
-        return [{'params': [weight]}, dict(params=[bias], **kwargs)]
+        return [{"params": [weight]}, dict(params=[bias], **kwargs)]
 
     def _build_params_dict_single(self, weight, bias, **kwargs):
         return [dict(params=bias, **kwargs)]
 
     def test_sgd(self):
-        for optimizer in [optim.SGD, optim_mt.SGD]:
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict_single(weight, bias, lr=1e-2),
-                    lr=1e-3, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict_single(weight, bias, lr=1e-2), maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                [lambda opt: StepLR(opt, gamma=0.9, step_size=10)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                [lambda opt: LinearLR(opt, start_factor=0.4, end_factor=0.8, total_iters=4)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                [lambda opt: ConstantLR(opt, factor=0.4, total_iters=4)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                [lambda opt: StepLR(opt, gamma=0.9, step_size=10),
-                 lambda opt: LinearLR(opt, start_factor=0.4, end_factor=0.6, total_iters=4)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                [lambda opt: StepLR(opt, gamma=0.9, step_size=10),
-                 lambda opt: ReduceLROnPlateau(opt)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                [lambda opt: StepLR(opt, gamma=0.99, step_size=10),
-                 lambda opt: ExponentialLR(opt, gamma=0.99),
-                 lambda opt: ReduceLROnPlateau(opt)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, momentum=0.5, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, momentum=0.5, weight_decay=1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize:
-                optimizer([weight, bias], nesterov=True, lr=1e-3, momentum=0.5, weight_decay=1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                [lambda opt: PolynomialLR(opt, power=0.9, total_iters=4)],
-                constructor_accepts_maximize=True
-            )
-            with self.assertRaisesRegex(ValueError, "Invalid momentum value: -0.5"):
-                optimizer(None, lr=1e-2, momentum=-0.5)
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                self._build_params_dict_single(weight, bias, lr=1e-2),
+                lr=1e-3,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                self._build_params_dict_single(weight, bias, lr=1e-2),
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            [lambda opt: StepLR(opt, gamma=0.9, step_size=10)],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            [
+                lambda opt: LinearLR(
+                    opt, start_factor=0.4, end_factor=0.8, total_iters=4
+                )
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            [lambda opt: ConstantLR(opt, factor=0.4, total_iters=4)],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            [
+                lambda opt: StepLR(opt, gamma=0.9, step_size=10),
+                lambda opt: LinearLR(
+                    opt, start_factor=0.4, end_factor=0.6, total_iters=4
+                ),
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            [
+                lambda opt: StepLR(opt, gamma=0.9, step_size=10),
+                lambda opt: ReduceLROnPlateau(opt),
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            [
+                lambda opt: StepLR(opt, gamma=0.99, step_size=10),
+                lambda opt: ExponentialLR(opt, gamma=0.99),
+                lambda opt: ReduceLROnPlateau(opt),
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias],
+                lr=1e-3,
+                momentum=0.5,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias],
+                lr=1e-3,
+                momentum=0.5,
+                weight_decay=1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias],
+                nesterov=True,
+                lr=1e-3,
+                momentum=0.5,
+                weight_decay=1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.SGD(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            [lambda opt: PolynomialLR(opt, power=0.9, total_iters=4)],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        with self.assertRaisesRegex(ValueError, "Invalid momentum value: -0.5"):
+            optim.SGD(None, lr=1e-2, momentum=-0.5)
 
     def test_sgd_sparse(self):
-        for optimizer in [optim.SGD, optim_mt.SGD]:
+        for foreach in (False, True):
             self._test_rosenbrock_sparse(
-                lambda params: optimizer(params, lr=5e-3)
+                lambda params: optim.SGD(params, lr=4.8e-3, foreach=foreach)
             )
             self._test_rosenbrock_sparse(
-                lambda params: optimizer(params, lr=0.005),
-                [lambda opt: StepLR(opt, gamma=0.99999, step_size=300)]
+                lambda params: optim.SGD(params, lr=0.0048, foreach=foreach),
+                [lambda opt: StepLR(opt, gamma=0.99999, step_size=300)],
             )
 
     def test_sgd_complex(self):
-        for optimizer in [optim.SGD, optim_mt.SGD]:
+        for foreach in (False, True):
             self._test_complex_optimizer(
-                lambda param: optimizer([param], lr=0.001)
+                lambda param: optim.SGD([param], lr=0.001, foreach=foreach)
             )
             self._test_complex_optimizer(
-                lambda param: optimizer([param], lr=0.001, momentum=1)
+                lambda param: optim.SGD([param], lr=0.001, momentum=1, foreach=foreach)
             )
             self._test_complex_optimizer(
-                lambda param: optimizer([param], lr=0.001, momentum=1, weight_decay=1)
+                lambda param: optim.SGD(
+                    [param], lr=0.001, momentum=1, weight_decay=1, foreach=foreach
+                )
             )
             self._test_complex_optimizer(
-                lambda param: optimizer([param], lr=0.001, nesterov=True, momentum=1, weight_decay=1)
+                lambda param: optim.SGD(
+                    [param],
+                    lr=0.001,
+                    nesterov=True,
+                    momentum=1,
+                    weight_decay=1,
+                    foreach=foreach,
+                )
             )
             self._test_complex_optimizer(
-                lambda param: optimizer([param], lr=0.001, momentum=1, dampening=0.5, weight_decay=1)
+                lambda param: optim.SGD(
+                    [param],
+                    lr=0.001,
+                    momentum=1,
+                    dampening=0.5,
+                    weight_decay=1,
+                    foreach=foreach,
+                )
             )
 
-    def test_multi_tensor_optimizers(self):
+    def _test_derived_optimizers(self, optimizer_pairs_with_flags, flag):
         if not torch.cuda.is_available():
             return
-
-        optimizer_pairs_with_flags = [
-            ((optim.Adam, optim._multi_tensor.Adam), dict(weight_decay=1., amsgrad=True)),
-            ((optim.Adam, optim._multi_tensor.Adam), dict(weight_decay=1., amsgrad=False)),
-            ((optim.Adam, optim._multi_tensor.Adam), dict(weight_decay=0., amsgrad=True)),
-            ((optim.Adam, optim._multi_tensor.Adam), dict(weight_decay=0., amsgrad=False)),
-            ((optim.AdamW, optim._multi_tensor.AdamW), dict(weight_decay=1., amsgrad=True)),
-            ((optim.AdamW, optim._multi_tensor.AdamW), dict(weight_decay=1., amsgrad=False)),
-            ((optim.AdamW, optim._multi_tensor.AdamW), dict(weight_decay=0., amsgrad=True)),
-            ((optim.AdamW, optim._multi_tensor.AdamW), dict(weight_decay=0., amsgrad=False)),
-            ((optim.NAdam, optim._multi_tensor.NAdam), dict(weight_decay=0., momentum_decay=6e-3)),
-            ((optim.NAdam, optim._multi_tensor.NAdam), dict(weight_decay=1., momentum_decay=6e-3)),
-            ((optim.NAdam, optim._multi_tensor.NAdam), dict(weight_decay=0., momentum_decay=4e-3)),
-            ((optim.NAdam, optim._multi_tensor.NAdam), dict(weight_decay=0.01, momentum_decay=4e-3)),
-            ((optim.SGD, optim._multi_tensor.SGD), dict(lr=0.2, momentum=1, dampening=0, weight_decay=1, nesterov=True)),
-            ((optim.SGD, optim._multi_tensor.SGD), dict(lr=0.2, momentum=1, dampening=0.5, weight_decay=1, nesterov=False)),
-            ((optim.RAdam, optim._multi_tensor.RAdam), dict(weight_decay=0)),
-            ((optim.RAdam, optim._multi_tensor.RAdam), dict(weight_decay=1)),
-            ((optim.RMSprop, optim._multi_tensor.RMSprop), dict(weight_decay=1, momentum=1, centered=True)),
-            ((optim.RMSprop, optim._multi_tensor.RMSprop), dict(weight_decay=1, momentum=0, centered=True)),
-            ((optim.RMSprop, optim._multi_tensor.RMSprop), dict(weight_decay=1, momentum=1, centered=False)),
-            ((optim.RMSprop, optim._multi_tensor.RMSprop), dict(weight_decay=0, momentum=1, centered=False)),
-            ((optim.Rprop, optim._multi_tensor.Rprop), dict(lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50))),
-            ((optim.ASGD, optim._multi_tensor.ASGD), dict(weight_decay=0)),
-            ((optim.ASGD, optim._multi_tensor.ASGD), dict(weight_decay=1)),
-            ((optim.Adamax, optim._multi_tensor.Adamax), dict(weight_decay=0)),
-            ((optim.Adamax, optim._multi_tensor.Adamax), dict(weight_decay=1)),
-            ((optim.Adadelta, optim._multi_tensor.Adadelta), dict(weight_decay=0)),
-            ((optim.Adadelta, optim._multi_tensor.Adadelta), dict(weight_decay=1)),
-            ((optim.Adagrad, optim._multi_tensor.Adagrad), dict(weight_decay=0)),
-            ((optim.Adagrad, optim._multi_tensor.Adagrad), dict(weight_decay=1)),
-        ]
+        assert flag in ("foreach", "fused")
 
         kIterations = 4
-        device = 'cuda'
-        for optimizers, params in optimizer_pairs_with_flags:
+        device = "cuda"
+        for optimizer_constructor, params in optimizer_pairs_with_flags:
             res, state = [], []
-            for opt in optimizers:
-                input = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=torch.float64, device=device).reshape(3, 2)
+            for foreach in (False, True):
+                input = torch.tensor(
+                    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=torch.float64, device=device
+                ).reshape(3, 2)
 
                 torch.manual_seed(1)
-                model = torch.nn.Sequential(torch.nn.Linear(2, 3),
-                                            torch.nn.Sigmoid(),
-                                            torch.nn.Linear(3, 1),
-                                            torch.nn.Sigmoid())
+                model = torch.nn.Sequential(
+                    torch.nn.Linear(2, 3),
+                    torch.nn.Sigmoid(),
+                    torch.nn.Linear(3, 1),
+                    torch.nn.Sigmoid(),
+                )
                 model.to(dtype=torch.float64, device=device)
-                optimizer = opt(model.parameters(), **params)
+                params_with_foreach = deepcopy(params)
+                params_with_foreach["foreach"] = foreach
+                optimizer = optimizer_constructor(
+                    model.parameters(), **params_with_foreach
+                )
 
                 for _ in range(kIterations):
                     optimizer.zero_grad()
@@ -538,391 +694,736 @@ def test_multi_tensor_optimizers(self):
                 mt_p_state = mt_state[mt_p]
 
                 for k in st_p_state:
-                    self.assertEqual(st_p_state[k], mt_p_state[k], atol=5e-5, rtol=0)
+                    actual = mt_p_state[k]
+                    # If `torch.optim.Adam` is `__init__`ed with either `fused=True` or `capturable=True`,
+                    # `step` Tensor is 1D while usually it's 0D.
+                    if (
+                        k == "step"
+                        and isinstance(actual, torch.Tensor)
+                        and actual.ndim == 1
+                    ):
+                        actual = actual[0]
+                    self.assertEqual(st_p_state[k], actual, atol=5e-5, rtol=0)
+
+    def test_multi_tensor_optimizers(self):
+        optimizer_pairs_with_flags = [
+            (optim.Adam, dict(weight_decay=1.0, amsgrad=True)),
+            (optim.Adam, dict(weight_decay=1.0, amsgrad=False)),
+            (optim.Adam, dict(weight_decay=0.0, amsgrad=True)),
+            (optim.Adam, dict(weight_decay=0.0, amsgrad=False)),
+            (optim.AdamW, dict(weight_decay=1.0, amsgrad=True)),
+            (optim.AdamW, dict(weight_decay=1.0, amsgrad=False)),
+            (optim.AdamW, dict(weight_decay=0.0, amsgrad=True)),
+            (optim.AdamW, dict(weight_decay=0.0, amsgrad=False)),
+            (optim.NAdam, dict(weight_decay=0.0, momentum_decay=6e-3)),
+            (optim.NAdam, dict(weight_decay=1.0, momentum_decay=6e-3)),
+            (optim.NAdam, dict(weight_decay=0.0, momentum_decay=4e-3)),
+            (optim.NAdam, dict(weight_decay=0.01, momentum_decay=4e-3)),
+            (
+                optim.SGD,
+                dict(lr=0.2, momentum=1, dampening=0, weight_decay=1, nesterov=True),
+            ),
+            (
+                optim.SGD,
+                dict(lr=0.2, momentum=1, dampening=0.5, weight_decay=1, nesterov=False),
+            ),
+            (optim.RAdam, dict(weight_decay=0)),
+            (optim.RAdam, dict(weight_decay=1)),
+            (optim.RMSprop, dict(weight_decay=1, momentum=1, centered=True)),
+            (optim.RMSprop, dict(weight_decay=1, momentum=0, centered=True)),
+            (optim.RMSprop, dict(weight_decay=1, momentum=1, centered=False)),
+            (optim.RMSprop, dict(weight_decay=0, momentum=1, centered=False)),
+            (optim.Rprop, dict(lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50))),
+            (optim.ASGD, dict(weight_decay=0)),
+            (optim.ASGD, dict(weight_decay=1)),
+            (optim.Adamax, dict(weight_decay=0)),
+            (optim.Adamax, dict(weight_decay=1)),
+            (optim.Adadelta, dict(weight_decay=0)),
+            (optim.Adadelta, dict(weight_decay=1)),
+            (optim.Adagrad, dict(weight_decay=0)),
+            (optim.Adagrad, dict(weight_decay=1)),
+        ]
+        self._test_derived_optimizers(optimizer_pairs_with_flags, "foreach")
+
+    def test_fused_optimizers(self):
+        optimizer_pairs_with_flags = [
+            (optim.Adam, dict(weight_decay=1.0, amsgrad=False)),
+            (optim.Adam, dict(weight_decay=1.0, amsgrad=True)),
+            (optim.Adam, dict(weight_decay=0.0, amsgrad=False)),
+            (optim.Adam, dict(weight_decay=0.0, amsgrad=True)),
+        ]
+        self._test_derived_optimizers(optimizer_pairs_with_flags, "fused")
 
     def test_adam(self):
-        for optimizer in [optim.Adam, optim_mt.Adam]:
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2), lr=1e-3, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, amsgrad=True, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, weight_decay=0.1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, amsgrad=True, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, maximize=maximize),
-                [lambda opt: ExponentialLR(opt, gamma=0.9)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, maximize=maximize),
-                [lambda opt: LinearLR(opt, start_factor=0.4, total_iters=4)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, maximize=maximize),
-                [lambda opt: ConstantLR(opt, factor=0.4, total_iters=4)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, amsgrad=True, maximize=maximize),
-                [lambda opt: ConstantLR(opt, factor=0.4, total_iters=4),
-                 lambda opt: ExponentialLR(opt, gamma=0.9)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, amsgrad=True, maximize=maximize),
-                [lambda opt: ExponentialLR(opt, gamma=0.9),
-                 lambda opt: ReduceLROnPlateau(opt)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, amsgrad=True, maximize=maximize),
-                [lambda opt: StepLR(opt, gamma=0.9, step_size=10),
-                 lambda opt: ReduceLROnPlateau(opt)],
-                constructor_accepts_maximize=True
-            )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                [weight, bias],
+                lr=1e-3,
+                amsgrad=True,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                [weight, bias],
+                lr=1e-3,
+                weight_decay=0.1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                amsgrad=True,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [lambda opt: ExponentialLR(opt, gamma=0.9)],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [lambda opt: LinearLR(opt, start_factor=0.4, total_iters=4)],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [lambda opt: ConstantLR(opt, factor=0.4, total_iters=4)],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                [weight, bias],
+                lr=1e-3,
+                amsgrad=True,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [
+                lambda opt: ConstantLR(opt, factor=0.4, total_iters=4),
+                lambda opt: ExponentialLR(opt, gamma=0.9),
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                [weight, bias],
+                lr=1e-3,
+                amsgrad=True,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [
+                lambda opt: ExponentialLR(opt, gamma=0.9),
+                lambda opt: ReduceLROnPlateau(opt),
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                amsgrad=True,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [
+                lambda opt: StepLR(opt, gamma=0.9, step_size=10),
+                lambda opt: ReduceLROnPlateau(opt),
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
 
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, maximize=maximize),
-                [lambda opt: PolynomialLR(opt, total_iters=4, power=0.9)],
-                constructor_accepts_maximize=True
-            )
-            self._test_complex_2d(optimizer)
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [lambda opt: PolynomialLR(opt, total_iters=4, power=0.9)],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_complex_2d(optim.Adam)
+        self._test_complex_2d(functools.partial(optim.Adam, foreach=True))
 
-            with self.assertRaisesRegex(ValueError, "Invalid beta parameter at index 0: 1.0"):
-                optimizer(None, lr=1e-2, betas=(1.0, 0.0))
+        with self.assertRaisesRegex(
+            ValueError, "Invalid beta parameter at index 0: 1.0"
+        ):
+            optim.Adam(None, lr=1e-2, betas=(1.0, 0.0))
 
-            with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -1"):
-                optimizer(None, lr=1e-2, weight_decay=-1)
+        with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -1"):
+            optim.Adam(None, lr=1e-2, weight_decay=-1)
 
     def test_adamw(self):
-        for optimizer in [optim.AdamW, optim_mt.AdamW]:
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2), lr=1e-3, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, weight_decay=1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, weight_decay=1, amsgrad=True, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_complex_2d(optimizer)
-            with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -1"):
-                optimizer(None, lr=1e-2, weight_decay=-1)
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.AdamW(
+                [weight, bias], lr=1e-3, maximize=maximize, foreach=foreach
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.AdamW(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.AdamW(
+                [weight, bias],
+                lr=1e-3,
+                weight_decay=1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.AdamW(
+                [weight, bias],
+                lr=1e-3,
+                weight_decay=1,
+                amsgrad=True,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_complex_2d(optim.AdamW)
+        self._test_complex_2d(functools.partial(optim.AdamW, foreach=True))
+        with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -1"):
+            optim.AdamW(None, lr=1e-2, weight_decay=-1)
 
     def test_sparse_adam(self):
         self._test_rosenbrock_sparse(
-            lambda params: optim.SparseAdam(params, lr=4e-2),
-            [],
-            True
+            lambda params: optim.SparseAdam(params, lr=4e-2), [], True
         )
         self._test_rosenbrock_sparse(
             lambda params: optim.SparseAdam(params, lr=4e-2, maximize=True),
             [],
             True,
-            True
+            True,
         )
-        with self.assertRaisesRegex(ValueError, "Invalid beta parameter at index 0: 1.0"):
+        with self.assertRaisesRegex(
+            ValueError, "Invalid beta parameter at index 0: 1.0"
+        ):
             optim.SparseAdam(None, lr=1e-2, betas=(1.0, 0.0))
-        with self.assertRaisesRegex(ValueError, "SparseAdam requires dense parameter tensors"):
+        with self.assertRaisesRegex(
+            ValueError, "SparseAdam requires dense parameter tensors"
+        ):
             optim.SparseAdam([torch.zeros(3, layout=torch.sparse_coo)])
-        with self.assertRaisesRegex(ValueError, "SparseAdam requires dense parameter tensors"):
+        with self.assertRaisesRegex(
+            ValueError, "SparseAdam requires dense parameter tensors"
+        ):
             optim.SparseAdam([{"params": [torch.zeros(3, layout=torch.sparse_coo)]}])
 
     # ROCm precision is too low to pass this test
     def test_adadelta(self):
         # Handles https://github.com/pytorch/pytorch/issues/69698
         self.rel_tol = 4e-3
-        for optimizer in [optim.Adadelta, optim_mt.Adadelta]:
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, rho=0.95), maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, rho=0.95), maximize=maximize),
-                [lambda opt: StepLR(opt, gamma=0.9, step_size=10),
-                 lambda opt: ReduceLROnPlateau(opt)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], weight_decay=1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            with self.assertRaisesRegex(ValueError, "Invalid rho value: 1.1"):
-                optimizer(None, lr=1e-2, rho=1.1)
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adadelta(
+                [weight, bias], maximize=maximize, foreach=foreach
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adadelta(
+                self._build_params_dict(weight, bias, rho=0.95),
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adadelta(
+                self._build_params_dict(weight, bias, rho=0.95),
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [
+                lambda opt: StepLR(opt, gamma=0.9, step_size=10),
+                lambda opt: ReduceLROnPlateau(opt),
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adadelta(
+                [weight, bias], weight_decay=1, maximize=maximize, foreach=foreach
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        with self.assertRaisesRegex(ValueError, "Invalid rho value: 1.1"):
+            optim.Adadelta(None, lr=1e-2, rho=1.1)
 
     def test_adadelta_complex(self):
         # Handles https://github.com/pytorch/pytorch/issues/69698
         self.rel_tol = 2e-2
         for optimizer in [optim.Adadelta]:
-            self._test_complex_optimizer(
-                lambda weight: optimizer([weight])
-            )
-            self._test_complex_optimizer(
-                lambda weight: optimizer([weight], rho=0.95)
-            )
+            self._test_complex_optimizer(lambda weight: optimizer([weight]))
+            self._test_complex_optimizer(lambda weight: optimizer([weight], rho=0.95))
             self._test_complex_optimizer(
                 lambda weight: optimizer([weight], rho=0.95, weight_decay=1)
             )
 
     def test_nadam(self):
-        for optimizer in [optim.NAdam, optim_mt.NAdam]:
-            self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], lr=1e-3)
-            )
-            self._test_basic_cases(
-                lambda weight, bias: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3)
-            )
-            self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], lr=1e-3, weight_decay=0.1, momentum_decay=6e-3)
-            )
-            self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], lr=1e-3, weight_decay=0.1, momentum_decay=6e-3),
-                [lambda opt: ExponentialLR(opt, gamma=0.9)]
-            )
-            with self.assertRaisesRegex(ValueError, "Invalid beta parameter at index 0: 1.0"):
-                optimizer(None, lr=1e-2, betas=(1.0, 0.0))
-            with self.assertRaisesRegex(ValueError, "Invalid momentum_decay value: -0.2"):
-                optimizer(None, lr=1e-2, momentum_decay=-0.2)
+        self._test_basic_cases(
+            lambda weight, bias, foreach: optim.NAdam(
+                [weight, bias], lr=1e-3, foreach=foreach
+            ),
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, foreach: optim.NAdam(
+                self._build_params_dict(weight, bias, lr=1e-2), lr=1e-3, foreach=foreach
+            ),
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, foreach: optim.NAdam(
+                [weight, bias],
+                lr=1e-3,
+                weight_decay=0.1,
+                momentum_decay=6e-3,
+                foreach=foreach,
+            ),
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, foreach: optim.NAdam(
+                [weight, bias],
+                lr=1e-3,
+                weight_decay=0.1,
+                momentum_decay=6e-3,
+                foreach=foreach,
+            ),
+            [lambda opt: ExponentialLR(opt, gamma=0.9)],
+            constructor_accepts_foreach=True,
+        )
+        with self.assertRaisesRegex(
+            ValueError, "Invalid beta parameter at index 0: 1.0"
+        ):
+            optim.NAdam(None, lr=1e-2, betas=(1.0, 0.0))
+        with self.assertRaisesRegex(ValueError, "Invalid momentum_decay value: -0.2"):
+            optim.NAdam(None, lr=1e-2, momentum_decay=-0.2)
 
     def test_adagrad(self):
-        for optimizer in [optim.Adagrad, optim_mt.Adagrad]:
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    [weight, bias], lr=1e-1, initial_accumulator_value=0.1, maximize=maximize,
-                ),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-1,
-                    maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-1,
-                    maximize=maximize),
-                [lambda opt: ReduceLROnPlateau(opt)],
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-1,
-                    maximize=maximize),
-                [lambda opt: ReduceLROnPlateau(opt),
-                 lambda opt: ExponentialLR(opt, gamma=0.99)],
-                constructor_accepts_maximize=True
-            )
-            with self.assertRaisesRegex(ValueError, "Invalid lr_decay value: -0.5"):
-                optimizer(None, lr=1e-2, lr_decay=-0.5)
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adagrad(
+                [weight, bias], lr=1e-1, maximize=maximize, foreach=foreach
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adagrad(
+                [weight, bias],
+                lr=1e-1,
+                initial_accumulator_value=0.1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [lambda opt: ReduceLROnPlateau(opt)],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            [
+                lambda opt: ReduceLROnPlateau(opt),
+                lambda opt: ExponentialLR(opt, gamma=0.99),
+            ],
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        with self.assertRaisesRegex(ValueError, "Invalid lr_decay value: -0.5"):
+            optim.Adagrad(None, lr=1e-2, lr_decay=-0.5)
 
     def test_adagrad_sparse(self):
-        for optimizer in [optim.Adagrad, optim_mt.Adagrad]:
+        for foreach in (False, True):
             self._test_rosenbrock_sparse(
-                lambda params: optimizer(params, lr=1e-1)
+                lambda params: optim.Adagrad(params, lr=1e-1, foreach=foreach)
             )
             self._test_rosenbrock_sparse(
-                lambda params: optimizer(params, lr=0.1),
-                [lambda opt: StepLR(opt, gamma=1 - 1e-5, step_size=500),
-                 lambda opt: ReduceLROnPlateau(opt, threshold=1e-4)]
+                lambda params: optim.Adagrad(params, lr=0.1, foreach=foreach),
+                [
+                    lambda opt: StepLR(opt, gamma=1 - 1e-5, step_size=500),
+                    lambda opt: ReduceLROnPlateau(opt, threshold=1e-4),
+                ],
             )
 
     def test_adagrad_complex(self):
-        for optimizer in [optim.Adagrad, optim_mt.Adagrad]:
+        for foreach in (False, True):
             self._test_complex_optimizer(
-                lambda param: optimizer([param], lr=1e-1)
+                lambda param: optim.Adagrad([param], lr=1e-1, foreach=foreach)
             )
             self._test_complex_optimizer(
-                lambda param: optimizer(
-                    [param], lr=1e-1, initial_accumulator_value=0.1
+                lambda param: optim.Adagrad(
+                    [param],
+                    lr=1e-1,
+                    initial_accumulator_value=0.1,
+                    foreach=foreach,
                 )
             )
 
     def test_adamax(self):
-        for optimizer in [optim.Adamax, optim_mt.Adamax]:
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    [weight, bias], lr=1e-1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
-                    [weight, bias], lr=1e-1, weight_decay=1, maximize=maximize),
-                constructor_accepts_maximize=True
-            )
-            self._test_complex_2d(optimizer)
-            with self.assertRaisesRegex(ValueError, "Invalid beta parameter at index 1: 1.0"):
-                optimizer(None, lr=1e-2, betas=(0.0, 1.0))
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adamax(
+                [weight, bias], lr=1e-1, maximize=maximize, foreach=foreach
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adamax(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, maximize, foreach: optim.Adamax(
+                [weight, bias],
+                lr=1e-1,
+                weight_decay=1,
+                maximize=maximize,
+                foreach=foreach,
+            ),
+            constructor_accepts_maximize=True,
+            constructor_accepts_foreach=True,
+        )
+        self._test_complex_2d(optim.Adamax)
+        self._test_complex_2d(functools.partial(optim.Adamax, foreach=True))
+        with self.assertRaisesRegex(
+            ValueError, "Invalid beta parameter at index 1: 1.0"
+        ):
+            optim.Adamax(None, lr=1e-2, betas=(0.0, 1.0))
 
     def test_radam(self):
-        for optimizer in [optim.RAdam, optim_mt.RAdam]:
-            self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], lr=1e-3)
-            )
-            self._test_basic_cases(
-                lambda weight, bias: optimizer(
-                    self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3)
-            )
-            self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], lr=1e-3, weight_decay=0.1)
-            )
-            self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], lr=1e-3),
-                [lambda opt: ExponentialLR(opt, gamma=0.9),
-                    lambda opt: ReduceLROnPlateau(opt)]
-            )
-            with self.assertRaisesRegex(ValueError, "Invalid beta parameter at index 0: 1.0"):
-                optimizer(None, lr=1e-2, betas=(1.0, 0.0))
+        self._test_basic_cases(
+            lambda weight, bias, foreach: optim.RAdam(
+                [weight, bias], lr=1e-3, foreach=foreach
+            ),
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, foreach: optim.RAdam(
+                self._build_params_dict(weight, bias, lr=1e-2), lr=1e-3, foreach=foreach
+            ),
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, foreach: optim.RAdam(
+                [weight, bias], lr=1e-3, weight_decay=0.1, foreach=foreach
+            ),
+            constructor_accepts_foreach=True,
+        )
+        self._test_basic_cases(
+            lambda weight, bias, foreach: optim.RAdam(
+                [weight, bias], lr=1e-3, foreach=foreach
+            ),
+            [
+                lambda opt: ExponentialLR(opt, gamma=0.9),
+                lambda opt: ReduceLROnPlateau(opt),
+            ],
+            constructor_accepts_foreach=True,
+        )
+        with self.assertRaisesRegex(
+            ValueError, "Invalid beta parameter at index 0: 1.0"
+        ):
+            optim.RAdam(None, lr=1e-2, betas=(1.0, 0.0))
 
-            with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -1"):
-                optimizer(None, lr=1e-2, weight_decay=-1)
+        with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -1"):
+            optim.RAdam(None, lr=1e-2, weight_decay=-1)
 
     def test_rmsprop(self):
-        for optimizer in [optim.RMSprop, optim_mt.RMSprop]:
+        for foreach in (False, True):
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-2, maximize=maximize),
-                constructor_accepts_maximize=True
+                lambda weight, bias, maximize, foreach: optim.RMSprop(
+                    [weight, bias], lr=1e-2, maximize=maximize, foreach=foreach
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
             )
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
+                lambda weight, bias, maximize, foreach: optim.RMSprop(
                     self._build_params_dict(weight, bias, lr=1e-3),
-                    lr=1e-2, maximize=maximize),
-                constructor_accepts_maximize=True
+                    lr=1e-2,
+                    maximize=maximize,
+                    foreach=foreach,
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
             )
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
+                lambda weight, bias, maximize, foreach: optim.RMSprop(
                     self._build_params_dict(weight, bias, lr=1e-3),
-                    lr=1e-2, centered=True, maximize=maximize),
-                constructor_accepts_maximize=True
+                    lr=1e-2,
+                    centered=True,
+                    maximize=maximize,
+                    foreach=foreach,
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
             )
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
+                lambda weight, bias, maximize, foreach: optim.RMSprop(
                     self._build_params_dict(weight, bias, lr=1e-3),
-                    lr=1e-2, centered=True, momentum=0.1, maximize=maximize),
-                constructor_accepts_maximize=True
+                    lr=1e-2,
+                    centered=True,
+                    momentum=0.1,
+                    maximize=maximize,
+                    foreach=foreach,
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
             )
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
+                lambda weight, bias, maximize, foreach: optim.RMSprop(
                     self._build_params_dict(weight, bias, lr=1e-3),
-                    lr=1e-2, momentum=0.1, maximize=maximize),
-                constructor_accepts_maximize=True
+                    lr=1e-2,
+                    momentum=0.1,
+                    maximize=maximize,
+                    foreach=foreach,
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
             )
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
+                lambda weight, bias, maximize, foreach: optim.RMSprop(
                     self._build_params_dict(weight, bias, lr=1e-3),
-                    lr=1e-2, momentum=0.1, weight_decay=1, maximize=maximize),
-                constructor_accepts_maximize=True
+                    lr=1e-2,
+                    momentum=0.1,
+                    weight_decay=1,
+                    maximize=maximize,
+                    foreach=foreach,
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
+            )
+            self._test_complex_2d(lambda param: optim.RMSprop(param, foreach=foreach))
+            self._test_complex_2d(
+                lambda param: optim.RMSprop(param, centered=True, foreach=foreach)
+            )
+            self._test_complex_2d(
+                lambda param: optim.RMSprop(param, momentum=0.1, foreach=foreach)
+            )
+            self._test_complex_2d(
+                lambda param: optim.RMSprop(param, maximize=True, foreach=foreach)
+            )
+            self._test_complex_optimizer(
+                lambda param: optim.RMSprop([param], foreach=foreach)
+            )
+            self._test_complex_optimizer(
+                lambda param: optim.RMSprop([param], centered=True, foreach=foreach)
+            )
+            self._test_complex_optimizer(
+                lambda param: optim.RMSprop([param], momentum=0.1, foreach=foreach)
+            )
+            self._test_complex_optimizer(
+                lambda param: optim.RMSprop([param], maximize=True, foreach=foreach)
             )
             with self.assertRaisesRegex(ValueError, "Invalid momentum value: -1.0"):
-                optimizer(None, lr=1e-2, momentum=-1.0)
+                optim.RMSprop(None, lr=1e-2, momentum=-1.0, foreach=foreach)
 
     def test_asgd(self):
-        for optimizer in [optim.ASGD, optim_mt.ASGD]:
+        for foreach in (False, True):
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer([weight, bias], lr=1e-3, t0=100, maximize=maximize),
-                constructor_accepts_maximize=True
+                lambda weight, bias, maximize, foreach: optim.ASGD(
+                    [weight, bias], lr=1e-3, t0=100, maximize=maximize, foreach=foreach
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
             )
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
+                lambda weight, bias, maximize, foreach: optim.ASGD(
                     self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, t0=100, maximize=maximize),
-                constructor_accepts_maximize=True
+                    lr=1e-3,
+                    t0=100,
+                    maximize=maximize,
+                    foreach=foreach,
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
             )
             self._test_basic_cases(
-                lambda weight, bias, maximize: optimizer(
+                lambda weight, bias, maximize, foreach: optim.ASGD(
                     self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3, weight_decay=1, maximize=maximize),
-                constructor_accepts_maximize=True
+                    lr=1e-3,
+                    weight_decay=1,
+                    maximize=maximize,
+                    foreach=foreach,
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
+            )
+            # Ref: https://github.com/pytorch/pytorch/issues/84560
+            # self._test_complex_2d(optimizer)
+            self._test_complex_optimizer(
+                lambda params: optim.ASGD([params], foreach=foreach)
+            )
+            self._test_complex_optimizer(
+                lambda params: optim.ASGD([params], maximize=True, foreach=foreach)
+            )
+            self._test_complex_optimizer(
+                lambda params: optim.ASGD(
+                    [params], maximize=True, weight_decay=0.9, foreach=foreach
+                )
+            )
+            self._test_complex_optimizer(
+                lambda params: optim.ASGD(
+                    [params], maximize=False, weight_decay=0.9, foreach=foreach
+                )
+            )
+            self._test_complex_optimizer(
+                lambda params: optim.ASGD([params], weight_decay=0.9, foreach=foreach)
             )
             with self.assertRaisesRegex(ValueError, "Invalid weight_decay value: -0.5"):
-                optimizer(None, lr=1e-2, weight_decay=-0.5)
+                optim.ASGD(None, lr=1e-2, weight_decay=-0.5, foreach=foreach)
 
+    @skipIfRocm
     def test_rprop(self):
-        for optimizer in [optim.Rprop, optim_mt.Rprop]:
+        is_cuda_sm86 = torch.cuda.is_available() and torch.cuda.get_device_capability(
+            0
+        ) == (8, 6)
+        for foreach in (False, True):
             self._test_basic_cases(
-                lambda weight, bias: optimizer([weight, bias], lr=1e-3)
+                lambda weight, bias, maximize, foreach: optim.Rprop(
+                    [weight, bias], lr=2e-4, maximize=maximize, foreach=foreach
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
             )
             self._test_basic_cases(
-                lambda weight, bias: optimizer(
+                lambda weight, bias, maximize, foreach: optim.Rprop(
                     self._build_params_dict(weight, bias, lr=1e-2),
-                    lr=1e-3)
+                    lr=2e-4,
+                    maximize=maximize,
+                    foreach=foreach,
+                ),
+                constructor_accepts_maximize=True,
+                constructor_accepts_foreach=True,
+                atol=4e-5 if is_cuda_sm86 else None,
+                rtol=3e-5 if is_cuda_sm86 else None,
+            )
+            self._test_complex_2d(lambda param: optim.Rprop(param, foreach=foreach))
+            self._test_complex_optimizer(
+                lambda param: optim.Rprop([param], lr=0.001, foreach=foreach)
+            )
+            self._test_complex_optimizer(
+                lambda param: optim.Rprop(
+                    [param], lr=0.001, maximize=True, foreach=foreach
+                )
             )
             with self.assertRaisesRegex(ValueError, "Invalid eta values: 1.0, 0.5"):
-                optimizer(None, lr=1e-2, etas=(1.0, 0.5))
+                optim.Rprop(None, lr=1e-2, etas=(1.0, 0.5), foreach=foreach)
 
     def test_lbfgs(self):
         self._test_basic_cases(
-            lambda weight, bias: optim.LBFGS([weight, bias]),
-            ignore_multidevice=True
+            lambda weight, bias: optim.LBFGS([weight, bias]), ignore_multidevice=True
         )
         self._test_basic_cases(
-            lambda weight, bias: optim.LBFGS([weight, bias], line_search_fn="strong_wolfe"),
-            ignore_multidevice=True
+            lambda weight, bias: optim.LBFGS(
+                [weight, bias], line_search_fn="strong_wolfe"
+            ),
+            ignore_multidevice=True,
         )
 
     @unittest.skipIf(TEST_WITH_UBSAN, "division-by-zero error with UBSAN")
     def test_lbfgs_return_type(self):
         params = [torch.randn(10, 5), torch.randn(10)]
-        opt1 = optim.LBFGS(params, 0.01, tolerance_grad=inf)
-        opt2 = optim.LBFGS(params, 0.01, tolerance_grad=-inf)
+        opt1 = optim.LBFGS(params, 0.01, tolerance_grad=math.inf)
+        opt2 = optim.LBFGS(params, 0.01, tolerance_grad=-math.inf)
 
         def closure():
             return torch.tensor([10])
@@ -933,15 +1434,17 @@ def closure():
 
     def test_invalid_param_type(self):
         with self.assertRaises(TypeError):
-            optim.SGD(Variable(torch.randn(5, 5)), lr=3)
+            optim.SGD(Parameter(torch.randn(5, 5)), lr=3)
 
     def test_duplicate_params_in_param_group(self):
-        param = Variable(torch.randn(5, 5))
+        param = Parameter(torch.randn(5, 5))
         with warnings.catch_warnings(record=True) as w:
             warnings.simplefilter("always")
             optim.SGD([param, param], lr=0.1)
             self.assertEqual(len(w), 1)
-            self.assertIn('a parameter group with duplicate parameters', str(w[0].message))
+            self.assertIn(
+                "a parameter group with duplicate parameters", str(w[0].message)
+            )
 
     def test_no_grad_for_all_params(self):
         params = [torch.randn(5, 5, requires_grad=False) for _ in range(2)]
@@ -963,6 +1466,97 @@ def test_no_grad_for_all_params(self):
             # all params have no grad
             opt.step()
 
+    # make sure that `state_steps` is correctly either updated or not updated when `found_inf`.
+    def test_functional_fused_adam_with_foundinf(self):
+        if not torch.cuda.is_available():
+            self.skipTest("CUDA is required.")
+
+        from torch.optim import adam
+
+        num_tensors = 5
+        for amsgrad in (False, True):
+            params, grads, exp_avgs, exp_avg_sqs = [
+                [torch.ones((1,), device="cuda") for _ in range(num_tensors)]
+                for _ in range(4)
+            ]
+            max_exp_avg_sqs = (
+                [torch.ones((1,), device="cuda") for _ in range(num_tensors)]
+                if amsgrad
+                else []
+            )
+            state_steps = [
+                torch.ones((1,), dtype=torch.float32, device="cuda")
+                for _ in range(num_tensors)
+            ]
+            grad_scale = torch.cuda.amp.grad_scaler._MultiDeviceReplicator(
+                torch.ones((1,), dtype=torch.float32, device="cuda")
+            )
+            found_inf = torch.cuda.amp.grad_scaler._MultiDeviceReplicator(
+                torch.ones((1,), dtype=torch.float32, device="cuda")
+            )
+
+            adam.adam(
+                params,
+                grads,
+                exp_avgs,
+                exp_avg_sqs,
+                max_exp_avg_sqs,
+                state_steps,
+                foreach=False,
+                capturable=False,
+                fused=True,
+                amsgrad=amsgrad,
+                beta1=0.9,
+                beta2=0.99,
+                lr=1e-2,
+                weight_decay=0.0,
+                eps=1e-8,
+                maximize=False,
+                grad_scale=grad_scale,
+                found_inf=found_inf,
+            )
+
+            self.assertEqual(
+                state_steps,
+                [
+                    torch.ones((1,), dtype=torch.float32, device="cuda")
+                    for _ in range(num_tensors)
+                ],
+            )
+
+    def test_empty_grad(self):
+        optimizers = [
+            torch.optim.Adadelta,
+            torch.optim.Adagrad,
+            torch.optim.Adam,
+            torch.optim.AdamW,
+            torch.optim.Adamax,
+            torch.optim.ASGD,
+            torch.optim.NAdam,
+            torch.optim.RAdam,
+            torch.optim.RMSprop,
+            torch.optim.Rprop,
+            torch.optim.SGD,
+            torch.optim.SparseAdam,
+        ]
+
+        for optimizer in optimizers:
+            net = torch.nn.Embedding(
+                5, 1, padding_idx=0, sparse=optimizer is torch.optim.SparseAdam
+            )
+            original_params = (param.detach().clone() for param in net.parameters())
+            # Simulate a batch that only indexes the embedding at padding_idx
+            x = torch.tensor([[0, 0]]).int()
+            y = torch.tensor([[[3.0], [4.0]]])
+            opt = optimizer(net.parameters(), lr=1e-5)
+            torch.nn.MSELoss()(net.forward(x), y).backward()
+
+            opt.step()
+
+            for original_param, param in zip(original_params, net.parameters()):
+                # assert that the parameters have not changed
+                self.assertEqual(original_param, param)
+
 
 class SchedulerTestNet(torch.nn.Module):
     def __init__(self):
@@ -995,8 +1589,12 @@ def setUp(self):
         super(TestLRScheduler, self).setUp()
         self.net = SchedulerTestNet()
         self.opt = SGD(
-            [{'params': self.net.conv1.parameters()}, {'params': self.net.conv2.parameters(), 'lr': 0.5}],
-            lr=0.05)
+            [
+                {"params": self.net.conv1.parameters()},
+                {"params": self.net.conv2.parameters(), "lr": 0.5},
+            ],
+            lr=0.05,
+        )
 
     def _check_warning_is_epoch_deprecation_warning(self, w, *, num_warnings: int = 1):
         """This function swallows the epoch deprecation warning which is produced when we
@@ -1010,48 +1608,84 @@ def _check_warning_is_epoch_deprecation_warning(self, w, *, num_warnings: int =
             self.assertEqual(warning.message.args[0], EPOCH_DEPRECATION_WARNING)
 
     def test_error_when_getlr_has_epoch(self):
-        class MultiStepLR(torch.optim.lr_scheduler._LRScheduler):
+        class MultiStepLR(torch.optim.lr_scheduler.LRScheduler):
             def __init__(self, optimizer, gamma, milestones, last_epoch=-1):
-                self.init_lr = [group['lr'] for group in optimizer.param_groups]
+                self.init_lr = [group["lr"] for group in optimizer.param_groups]
                 self.gamma = gamma
                 self.milestones = milestones
                 super().__init__(optimizer, last_epoch)
 
             def get_lr(self, step):
                 global_step = self.last_epoch
-                gamma_power = ([0] + [i + 1 for i, m in enumerate(self.milestones) if global_step >= m])[-1]
-                return [init_lr * (self.gamma ** gamma_power) for init_lr in self.init_lr]
+                gamma_power = (
+                    [0]
+                    + [i + 1 for i, m in enumerate(self.milestones) if global_step >= m]
+                )[-1]
+                return [
+                    init_lr * (self.gamma**gamma_power) for init_lr in self.init_lr
+                ]
 
         optimizer = torch.optim.SGD([torch.rand(1)], lr=1)
 
         with self.assertRaises(TypeError):
             scheduler = MultiStepLR(optimizer, gamma=1, milestones=[10, 20])
 
+    @skipIfTorchDynamo("Torchdynamo keeps references to optim in the guards and the stack of the graph break frames")
     def test_no_cyclic_references(self):
         import gc
-        param = Variable(torch.empty(10), requires_grad=True)
+
+        param = Parameter(torch.empty(10))
         optim = SGD([param], lr=0.5)
         scheduler = LambdaLR(optim, lambda epoch: 1.0)
         del scheduler
 
         # Prior to Python 3.7, local variables in a function will be referred by the current frame.
         import sys
+
         if sys.version_info < (3, 7):
             import inspect
+
             referrers = gc.get_referrers(optim)
             self.assertTrue(
                 len(referrers) == 1 and referrers[0] is inspect.currentframe(),
-                "Optimizer should contain no cyclic references (except current frame)")
+                "Optimizer should contain no cyclic references (except current frame)",
+            )
             del referrers
         else:
             self.assertTrue(
                 len(gc.get_referrers(optim)) == 0,
-                "Optimizer should contain no cyclic references")
+                "Optimizer should contain no cyclic references",
+            )
 
         gc.collect()
         del optim
         self.assertEqual(
-            gc.collect(), 0, msg="Optimizer should be garbage-collected on __del__")
+            gc.collect(), 0, msg="Optimizer should be garbage-collected on __del__"
+        )
+
+    @skipIfTorchDynamo("Torchdynamo keeps references to optim in the guards and the stack of the graph break frames")
+    def test_no_cyclic_references_in_step(self):
+        import gc
+        import weakref
+
+        def run():
+            param = torch.empty(10, requires_grad=True)
+            optim = SGD(params=[param], lr=0.5)
+            scheduler = LambdaLR(optim, lambda epoch: 1.0)
+            param.sum().backward()
+            optim.step()
+            scheduler.step()
+
+            return weakref.ref(scheduler)
+
+        # To ensure that there are no reference cycles in scheduler,
+        # we need to turn off the garbage collector. Since gc will
+        # automatically collect unreachable objects.
+        gc.disable()
+        ref = run()
+
+        assert ref() is None
+        gc.enable()  # restore
 
     def test_old_pattern_warning(self):
         epochs = 35
@@ -1065,7 +1699,7 @@ def old_pattern():
                 scheduler.step()
                 self.opt.step()
 
-        self.assertWarnsRegex(UserWarning, r'how-to-adjust-learning-rate', old_pattern)
+        self.assertWarnsRegex(UserWarning, r"how-to-adjust-learning-rate", old_pattern)
 
     def test_old_pattern_warning_with_arg(self):
         epochs = 35
@@ -1079,12 +1713,12 @@ def old_pattern2():
                 scheduler.step()
                 self.opt.step()
 
-        self.assertWarnsRegex(UserWarning, r'how-to-adjust-learning-rate', old_pattern2)
+        self.assertWarnsRegex(UserWarning, r"how-to-adjust-learning-rate", old_pattern2)
 
     def test_old_pattern_warning_resuming(self):
         epochs = 35
         for i, group in enumerate(self.opt.param_groups):
-            group['initial_lr'] = 0.01
+            group["initial_lr"] = 0.01
 
         with warnings.catch_warnings(record=True) as ws:
             warnings.simplefilter("always")  # allow any warning to be raised
@@ -1096,12 +1730,12 @@ def old_pattern():
                 scheduler.step()
                 self.opt.step()
 
-        self.assertWarnsRegex(UserWarning, r'how-to-adjust-learning-rate', old_pattern)
+        self.assertWarnsRegex(UserWarning, r"how-to-adjust-learning-rate", old_pattern)
 
     def test_old_pattern_warning_resuming_with_arg(self):
         epochs = 35
         for i, group in enumerate(self.opt.param_groups):
-            group['initial_lr'] = 0.01
+            group["initial_lr"] = 0.01
 
         with warnings.catch_warnings(record=True) as ws:
             warnings.simplefilter("always")  # allow any warning to be raised
@@ -1113,12 +1747,12 @@ def old_pattern2():
                 scheduler.step()
                 self.opt.step()
 
-        self.assertWarnsRegex(UserWarning, r'how-to-adjust-learning-rate', old_pattern2)
+        self.assertWarnsRegex(UserWarning, r"how-to-adjust-learning-rate", old_pattern2)
 
     def test_old_pattern_warning_with_overridden_optim_step(self):
         epochs = 35
         for i, group in enumerate(self.opt.param_groups):
-            group['initial_lr'] = 0.01
+            group["initial_lr"] = 0.01
 
         with warnings.catch_warnings(record=True) as ws:
             warnings.simplefilter("always")  # allow any warning to be raised
@@ -1141,7 +1775,7 @@ def old_pattern2():
                 scheduler.step()
                 self.opt.step()
 
-        self.assertWarnsRegex(UserWarning, r'how-to-adjust-learning-rate', old_pattern2)
+        self.assertWarnsRegex(UserWarning, r"how-to-adjust-learning-rate", old_pattern2)
 
     def test_new_pattern_no_warning(self):
         epochs = 35
@@ -1194,7 +1828,9 @@ def new_pattern():
                 self.opt.step()
                 scheduler.step()
 
-        self.assertWarnsRegex(UserWarning, r'`optimizer.step\(\)` has been overridden', new_pattern)
+        self.assertWarnsRegex(
+            UserWarning, r"`optimizer.step\(\)` has been overridden", new_pattern
+        )
 
     def _test_lr_is_constant_for_constant_epoch(self, scheduler):
         l = []
@@ -1205,7 +1841,7 @@ def _test_lr_is_constant_for_constant_epoch(self, scheduler):
                 scheduler.step(2)
                 self._check_warning_is_epoch_deprecation_warning(w)
 
-            l.append(self.opt.param_groups[0]['lr'])
+            l.append(self.opt.param_groups[0]["lr"])
         self.assertEqual(min(l), max(l))
 
     def test_step_lr_is_constant_for_constant_epoch(self):
@@ -1240,8 +1876,11 @@ def test_step_lr(self):
 
     def test_get_last_lr_step_lr(self):
         from torch.nn import Parameter
+
         epochs = 10
-        optimizer = torch.optim.SGD([Parameter(torch.randn(2, 2, requires_grad=True))], 0.1)
+        optimizer = torch.optim.SGD(
+            [Parameter(torch.randn(2, 2, requires_grad=True))], 0.1
+        )
         targets = [[0.1] * 3 + [0.01] * 3 + [0.001] * 3 + [0.0001]]
         scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 3, gamma=0.1)
         self._test_get_last_lr(scheduler, targets, epochs)
@@ -1296,12 +1935,21 @@ def test_get_last_lr_linearlr(self):
         # lr = 0.005     if 4 <= epoch
         epochs = 10
         start_factor = 1.0 / 4
-        end_factor = 3. / 5
+        end_factor = 3.0 / 5
         iters = 4
-        interpolation = [start_factor + i * (end_factor - start_factor) / iters for i in range(iters)]
-        single_targets = [x * 0.05 for x in interpolation] + [0.05 * end_factor] * (epochs - iters)
+        interpolation = [
+            start_factor + i * (end_factor - start_factor) / iters for i in range(iters)
+        ]
+        single_targets = [x * 0.05 for x in interpolation] + [0.05 * end_factor] * (
+            epochs - iters
+        )
         targets = [single_targets, [x * epochs for x in single_targets]]
-        scheduler = LinearLR(self.opt, start_factor=start_factor, end_factor=end_factor, total_iters=iters)
+        scheduler = LinearLR(
+            self.opt,
+            start_factor=start_factor,
+            end_factor=end_factor,
+            total_iters=iters,
+        )
         self._test_get_last_lr(scheduler, targets, epochs)
 
     def test_constantlr(self):
@@ -1322,12 +1970,26 @@ def test_linearlr(self):
         epochs = 10
         start_factor = 1.0 / 2
         iters = 4
-        interpolation = [start_factor + i * (1 - start_factor) / iters for i in range(iters)]
+        interpolation = [
+            start_factor + i * (1 - start_factor) / iters for i in range(iters)
+        ]
         single_targets = [x * 0.05 for x in interpolation] + [0.05] * (epochs - iters)
         targets = [single_targets, [x * epochs for x in single_targets]]
         scheduler = LinearLR(self.opt, start_factor=start_factor, total_iters=iters)
         self._test(scheduler, targets, epochs)
 
+    def test_linearlr_start_factor_limits1(self):
+        start_factor = 0.0
+        iters = 4
+        with self.assertRaises(ValueError):
+            LinearLR(self.opt, start_factor=start_factor, total_iters=iters)
+
+    def test_linearlr_start_factor_limits2(self):
+        start_factor = 1.1
+        iters = 4
+        with self.assertRaises(ValueError):
+            LinearLR(self.opt, start_factor=start_factor, total_iters=iters)
+
     def test_constantlr_with_epoch(self):
         # lr = 0.025     if epoch < 5
         # lr = 0.005    if 5 <= epoch
@@ -1345,9 +2007,11 @@ def test_linearlr_with_epoch(self):
         # lr = 0.005     if 4 <= epoch
         epochs = 10
         start_factor = 1.0 / 2
-        end_factor = 1.
+        end_factor = 1.0
         iters = 4
-        interpolation = [start_factor + i * (end_factor - start_factor) / iters for i in range(iters)]
+        interpolation = [
+            start_factor + i * (end_factor - start_factor) / iters for i in range(iters)
+        ]
         single_targets = [x * 0.05 for x in interpolation] + [0.05] * (epochs - iters)
         targets = [single_targets, [x * epochs for x in single_targets]]
         scheduler = LinearLR(self.opt, start_factor=start_factor, total_iters=iters)
@@ -1355,7 +2019,7 @@ def test_linearlr_with_epoch(self):
 
     def test_exp_lr(self):
         epochs = 10
-        single_targets = [0.05 * (0.9 ** x) for x in range(epochs)]
+        single_targets = [0.05 * (0.9**x) for x in range(epochs)]
         targets = [single_targets, [x * epochs for x in single_targets]]
         scheduler = ExponentialLR(self.opt, gamma=0.9)
         self._test(scheduler, targets, epochs)
@@ -1364,7 +2028,9 @@ def test_poly_lr(self):
         epochs = 10
         power = 0.9
         total_iters = 5
-        single_targets = [(1.0 - x / total_iters) ** power * 0.05 for x in range(total_iters)] + [0.0] * (epochs - total_iters)
+        single_targets = [
+            (1.0 - x / total_iters) ** power * 0.05 for x in range(total_iters)
+        ] + [0.0] * (epochs - total_iters)
         targets = [single_targets, [x * epochs for x in single_targets]]
         scheduler = PolynomialLR(self.opt, power=power, total_iters=total_iters)
         self._test(scheduler, targets, epochs)
@@ -1372,9 +2038,10 @@ def test_poly_lr(self):
     def test_cos_anneal_lr(self):
         epochs = 10
         eta_min = 1e-10
-        single_targets = [eta_min + (0.05 - eta_min) *
-                          (1 + math.cos(math.pi * x / epochs)) / 2
-                          for x in range(epochs)]
+        single_targets = [
+            eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * x / epochs)) / 2
+            for x in range(epochs)
+        ]
         targets = [single_targets, [x * epochs for x in single_targets]]
         scheduler = CosineAnnealingLR(self.opt, T_max=epochs, eta_min=eta_min)
         self._test(scheduler, targets, epochs)
@@ -1385,8 +2052,12 @@ def test_closed_form_step_lr(self):
         self._test_against_closed_form(scheduler, closed_form_scheduler, 20)
 
     def test_closed_form_linearlr(self):
-        scheduler = LinearLR(self.opt, start_factor=1.0 / 3, end_factor=0.7, total_iters=4)
-        closed_form_scheduler = LinearLR(self.opt, start_factor=1.0 / 3, end_factor=0.7, total_iters=4)
+        scheduler = LinearLR(
+            self.opt, start_factor=1.0 / 3, end_factor=0.7, total_iters=4
+        )
+        closed_form_scheduler = LinearLR(
+            self.opt, start_factor=1.0 / 3, end_factor=0.7, total_iters=4
+        )
         self._test_against_closed_form(scheduler, closed_form_scheduler, 20)
 
     def test_closed_form_constantlr(self):
@@ -1414,7 +2085,9 @@ def test_closed_form_cos_anneal_lr(self):
         epochs = 20
         T_max = 5
         scheduler = CosineAnnealingLR(self.opt, T_max=T_max, eta_min=eta_min)
-        closed_form_scheduler = CosineAnnealingLR(self.opt, T_max=T_max, eta_min=eta_min)
+        closed_form_scheduler = CosineAnnealingLR(
+            self.opt, T_max=T_max, eta_min=eta_min
+        )
         self._test_against_closed_form(scheduler, closed_form_scheduler, epochs)
 
     def test_cos_anneal_lr_continue(self):
@@ -1425,97 +2098,135 @@ def test_cos_anneal_lr_continue(self):
         scheduler.step()
         original_lrs = scheduler._last_lr
         new_scheduler = CosineAnnealingLR(
-            self.opt, T_max=T_max, eta_min=eta_min, last_epoch=0)
+            self.opt, T_max=T_max, eta_min=eta_min, last_epoch=0
+        )
         new_lrs = new_scheduler._last_lr
-        torch.testing.assert_allclose(original_lrs, new_lrs, rtol=1e-4, atol=1e-5)
+        torch.testing.assert_close(original_lrs, new_lrs, rtol=1e-4, atol=1e-5)
 
     def test_reduce_lr_on_plateau1(self):
         epochs = 10
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         targets = [[0.5] * 20]
         metrics = [10 - i * 0.0167 for i in range(20)]
-        scheduler = ReduceLROnPlateau(self.opt, threshold_mode='abs', mode='min',
-                                      threshold=0.01, patience=5, cooldown=5)
+        scheduler = ReduceLROnPlateau(
+            self.opt,
+            threshold_mode="abs",
+            mode="min",
+            threshold=0.01,
+            patience=5,
+            cooldown=5,
+        )
         self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
 
     def test_reduce_lr_on_plateau2(self):
         epochs = 22
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         targets = [[0.5] * 6 + [0.05] * 7 + [0.005] * 7 + [0.0005] * 2]
         metrics = [10 - i * 0.0165 for i in range(22)]
-        scheduler = ReduceLROnPlateau(self.opt, patience=5, cooldown=0, threshold_mode='abs',
-                                      mode='min', threshold=0.1)
+        scheduler = ReduceLROnPlateau(
+            self.opt,
+            patience=5,
+            cooldown=0,
+            threshold_mode="abs",
+            mode="min",
+            threshold=0.1,
+        )
         self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
 
     def test_reduce_lr_on_plateau3(self):
         epochs = 22
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         targets = [[0.5] * (2 + 6) + [0.05] * (5 + 6) + [0.005] * 4]
         metrics = [-0.8] * 2 + [-0.234] * 20
-        scheduler = ReduceLROnPlateau(self.opt, mode='max', patience=5, cooldown=5,
-                                      threshold_mode='abs')
+        scheduler = ReduceLROnPlateau(
+            self.opt, mode="max", patience=5, cooldown=5, threshold_mode="abs"
+        )
         self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
 
     def test_reduce_lr_on_plateau4(self):
         epochs = 20
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         targets = [[0.5] * 20]
-        metrics = [1.5 * (1.025 ** i) for i in range(20)]  # 1.025 > 1.1**0.25
-        scheduler = ReduceLROnPlateau(self.opt, mode='max', patience=3,
-                                      threshold_mode='rel', threshold=0.1)
+        metrics = [1.5 * (1.025**i) for i in range(20)]  # 1.025 > 1.1**0.25
+        scheduler = ReduceLROnPlateau(
+            self.opt, mode="max", patience=3, threshold_mode="rel", threshold=0.1
+        )
         self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
 
     def test_reduce_lr_on_plateau5(self):
         epochs = 20
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         targets = [[0.5] * 6 + [0.05] * (5 + 6) + [0.005] * 4]
-        metrics = [1.5 * (1.005 ** i) for i in range(20)]
-        scheduler = ReduceLROnPlateau(self.opt, mode='max', threshold_mode='rel',
-                                      threshold=0.1, patience=5, cooldown=5)
+        metrics = [1.5 * (1.005**i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(
+            self.opt,
+            mode="max",
+            threshold_mode="rel",
+            threshold=0.1,
+            patience=5,
+            cooldown=5,
+        )
         self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
 
     def test_reduce_lr_on_plateau6(self):
         epochs = 20
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         targets = [[0.5] * 20]
-        metrics = [1.5 * (0.85 ** i) for i in range(20)]
-        scheduler = ReduceLROnPlateau(self.opt, mode='min', threshold_mode='rel',
-                                      threshold=0.1)
+        metrics = [1.5 * (0.85**i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(
+            self.opt, mode="min", threshold_mode="rel", threshold=0.1
+        )
         self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
 
     def test_reduce_lr_on_plateau7(self):
         epochs = 20
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         targets = [[0.5] * 6 + [0.05] * (5 + 6) + [0.005] * 4]
         metrics = [1] * 7 + [0.6] + [0.5] * 12
-        scheduler = ReduceLROnPlateau(self.opt, mode='min', threshold_mode='rel',
-                                      threshold=0.1, patience=5, cooldown=5)
+        scheduler = ReduceLROnPlateau(
+            self.opt,
+            mode="min",
+            threshold_mode="rel",
+            threshold=0.1,
+            patience=5,
+            cooldown=5,
+        )
         self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
 
     def test_reduce_lr_on_plateau8(self):
         epochs = 20
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         targets = [[0.5] * 6 + [0.4] * 14, [0.5] * 6 + [0.3] * 14]
-        metrics = [1.5 * (1.005 ** i) for i in range(20)]
-        scheduler = ReduceLROnPlateau(self.opt, mode='max', threshold_mode='rel', min_lr=[0.4, 0.3],
-                                      threshold=0.1, patience=5, cooldown=5)
+        metrics = [1.5 * (1.005**i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(
+            self.opt,
+            mode="max",
+            threshold_mode="rel",
+            min_lr=[0.4, 0.3],
+            threshold=0.1,
+            patience=5,
+            cooldown=5,
+        )
         self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
 
     def test_sequentiallr1(self):
         epochs = 19
         schedulers = [None] * 2
-        targets = [[0.05, 0.04, 0.032] + [0.05 for x in range(4)]
-                                       + [0.05 * 0.1 for x in range(4)]
-                                       + [0.05 * 0.01 for x in range(4)]
-                                       + [0.05 * 0.001 for x in range(4)]]
+        targets = [
+            [0.05, 0.04, 0.032]
+            + [0.05 for x in range(4)]
+            + [0.05 * 0.1 for x in range(4)]
+            + [0.05 * 0.01 for x in range(4)]
+            + [0.05 * 0.001 for x in range(4)]
+        ]
         milestones = [3]
         schedulers[0] = ExponentialLR(self.opt, gamma=0.8)
         schedulers[1] = StepLR(self.opt, gamma=0.1, step_size=4)
@@ -1525,7 +2236,7 @@ def test_sequentiallr1(self):
     def test_sequentiallr2(self):
         epochs = 13
         schedulers = [None] * 2
-        targets = [[0.005, 0.005, 0.005] + [0.05 * 0.9 ** x for x in range(10)]]
+        targets = [[0.005, 0.005, 0.005] + [0.05 * 0.9**x for x in range(10)]]
         milestones = [3]
         schedulers[0] = ConstantLR(self.opt, factor=0.1, total_iters=3)
         schedulers[1] = ExponentialLR(self.opt, gamma=0.9)
@@ -1535,8 +2246,11 @@ def test_sequentiallr2(self):
     def test_sequentiallr3(self):
         epochs = 12
         schedulers = [None] * 3
-        targets = [[0.005, 0.005, 0.005] + [0.05, 0.04, 0.032]
-                                         + [0.05, 0.05, 0.005, 0.005, 0.0005, 0.0005]]
+        targets = [
+            [0.005, 0.005, 0.005]
+            + [0.05, 0.04, 0.032]
+            + [0.05, 0.05, 0.005, 0.005, 0.0005, 0.0005]
+        ]
         milestones = [3, 6]
         schedulers[0] = ConstantLR(self.opt, factor=0.1, total_iters=3)
         schedulers[1] = ExponentialLR(self.opt, gamma=0.8)
@@ -1550,9 +2264,11 @@ def test_sequentiallr4(self):
 
         schedulers = [
             torch.optim.lr_scheduler.ConstantLR(optimizer, factor=1),
-            torch.optim.lr_scheduler.ConstantLR(optimizer, factor=0.1)
+            torch.optim.lr_scheduler.ConstantLR(optimizer, factor=0.1),
         ]
-        scheduler = torch.optim.lr_scheduler.SequentialLR(optimizer, schedulers, milestones=[10])
+        scheduler = torch.optim.lr_scheduler.SequentialLR(
+            optimizer, schedulers, milestones=[10]
+        )
 
         new_lr = optimizer.param_groups[0]["lr"]
 
@@ -1577,7 +2293,7 @@ def test_get_last_lr_sequentiallr(self):
     def test_chained_lr2_get_last_lr_before_step(self):
         schedulers = [
             LinearLR(self.opt, start_factor=0.4, total_iters=3),
-            MultiStepLR(self.opt, milestones=[4, 8, 10], gamma=0.1)
+            MultiStepLR(self.opt, milestones=[4, 8, 10], gamma=0.1),
         ]
         scheduler = ChainedScheduler(schedulers)
         self.assertEqual(scheduler.get_last_lr(), schedulers[-1].get_last_lr())
@@ -1603,7 +2319,9 @@ def test_chained_lr2(self):
     def test_chained_lr3(self):
         epochs = 10
         schedulers = [None] * 2
-        targets = [[0.02, 0.03, 0.04, 0.05] + [0.005] * 4 + [0.0005] * 3 + [0.00005] * 3]
+        targets = [
+            [0.02, 0.03, 0.04, 0.05] + [0.005] * 4 + [0.0005] * 3 + [0.00005] * 3
+        ]
         schedulers[0] = LinearLR(self.opt, start_factor=0.4, total_iters=3)
         schedulers[1] = MultiStepLR(self.opt, milestones=[4, 8, 10], gamma=0.1)
         scheduler = ChainedScheduler(schedulers)
@@ -1613,10 +2331,12 @@ def test_chained_lr3(self):
     def test_chained_lr4(self):
         epochs = 9
         schedulers = [None] * 3
-        targets = [[0.05 * 0.2 * 0.9 ** x for x in range(3)]
-                   + [0.05 * 0.2 * 0.9 ** 3 * 0.1]
-                   + [0.05 * 0.9 ** x * 0.1 for x in range(4, 6)]
-                   + [0.05 * 0.9 ** x * 0.01 for x in range(6, 9)]]
+        targets = [
+            [0.05 * 0.2 * 0.9**x for x in range(3)]
+            + [0.05 * 0.2 * 0.9**3 * 0.1]
+            + [0.05 * 0.9**x * 0.1 for x in range(4, 6)]
+            + [0.05 * 0.9**x * 0.01 for x in range(6, 9)]
+        ]
         schedulers[0] = ExponentialLR(self.opt, gamma=0.9)
         schedulers[1] = ConstantLR(self.opt, factor=0.2, total_iters=4)
         schedulers[2] = StepLR(self.opt, gamma=0.1, step_size=3)
@@ -1654,10 +2374,10 @@ def test_compound_step_and_multistep_lr(self):
     def test_compound_step_and_exp_lr(self):
         epochs = 10
         schedulers = [None] * 2
-        single_targets = [0.05 * (0.9 ** x) for x in range(3)]
-        single_targets += [0.005 * (0.9 ** x) for x in range(3, 6)]
-        single_targets += [0.0005 * (0.9 ** x) for x in range(6, 9)]
-        single_targets += [0.00005 * (0.9 ** x) for x in range(9, 12)]
+        single_targets = [0.05 * (0.9**x) for x in range(3)]
+        single_targets += [0.005 * (0.9**x) for x in range(3, 6)]
+        single_targets += [0.0005 * (0.9**x) for x in range(6, 9)]
+        single_targets += [0.00005 * (0.9**x) for x in range(9, 12)]
         targets = [single_targets, [x * epochs for x in single_targets]]
         schedulers[0] = StepLR(self.opt, gamma=0.1, step_size=3)
         schedulers[1] = ExponentialLR(self.opt, gamma=0.9)
@@ -1666,10 +2386,10 @@ def test_compound_step_and_exp_lr(self):
     def test_compound_exp_and_multistep_lr(self):
         epochs = 10
         schedulers = [None] * 2
-        single_targets = [0.05 * (0.9 ** x) for x in range(2)]
-        single_targets += [0.005 * (0.9 ** x) for x in range(2, 5)]
-        single_targets += [0.0005 * (0.9 ** x) for x in range(5, 9)]
-        single_targets += [0.00005 * (0.9 ** x) for x in range(9, 11)]
+        single_targets = [0.05 * (0.9**x) for x in range(2)]
+        single_targets += [0.005 * (0.9**x) for x in range(2, 5)]
+        single_targets += [0.0005 * (0.9**x) for x in range(5, 9)]
+        single_targets += [0.00005 * (0.9**x) for x in range(9, 11)]
         targets = [single_targets, [x * epochs for x in single_targets]]
         schedulers[0] = MultiStepLR(self.opt, gamma=0.1, milestones=[2, 5, 9])
         schedulers[1] = ExponentialLR(self.opt, gamma=0.9)
@@ -1681,13 +2401,18 @@ def test_compound_exp_and_linearlr(self):
         start_factor = 0.4
         end_factor = 0.9
         schedulers = [None] * 2
-        single_targets = [0.05 * (0.9 ** x) for x in range(11)]
+        single_targets = [0.05 * (0.9**x) for x in range(11)]
         for i in range(iters):
             single_targets[i] *= start_factor + i / iters * (end_factor - start_factor)
         for i in range(iters, 11):
             single_targets[i] *= end_factor
         targets = [single_targets, [x * epochs for x in single_targets]]
-        schedulers[0] = LinearLR(self.opt, start_factor=start_factor, end_factor=end_factor, total_iters=iters)
+        schedulers[0] = LinearLR(
+            self.opt,
+            start_factor=start_factor,
+            end_factor=end_factor,
+            total_iters=iters,
+        )
         schedulers[1] = ExponentialLR(self.opt, gamma=0.9)
         self._test(schedulers, targets, epochs)
 
@@ -1696,7 +2421,13 @@ def test_compound_step_and_constantlr(self):
         iters = 4
         factor = 0.4
         schedulers = [None] * 2
-        single_targets = [0.05 * 0.4] * 3 + [0.005 * 0.4] + [0.005] * 2 + [0.0005] * 3 + [0.00005] * 3
+        single_targets = (
+            [0.05 * 0.4] * 3
+            + [0.005 * 0.4]
+            + [0.005] * 2
+            + [0.0005] * 3
+            + [0.00005] * 3
+        )
         targets = [single_targets, [x * epochs for x in single_targets]]
         schedulers[0] = StepLR(self.opt, gamma=0.1, step_size=3)
         schedulers[1] = ConstantLR(self.opt, factor=0.4, total_iters=4)
@@ -1718,9 +2449,10 @@ def test_compound_linearlr_and_multistep_lr(self):
     def test_compound_cosanneal_and_step_lr(self):
         epochs = 10
         eta_min = 1e-10
-        single_targets = [eta_min + (0.05 - eta_min) *
-                          (1 + math.cos(math.pi * x / epochs)) / 2
-                          for x in range(epochs)]
+        single_targets = [
+            eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * x / epochs)) / 2
+            for x in range(epochs)
+        ]
         single_targets = [x * 0.1 ** (i // 3) for i, x in enumerate(single_targets)]
         targets = [single_targets, [x * epochs for x in single_targets]]
         schedulers = [None] * 2
@@ -1731,9 +2463,10 @@ def test_compound_cosanneal_and_step_lr(self):
     def test_compound_cosanneal_and_multistep_lr(self):
         epochs = 10
         eta_min = 1e-10
-        single_targets = [eta_min + (0.05 - eta_min) *
-                          (1 + math.cos(math.pi * x / epochs)) / 2
-                          for x in range(epochs)]
+        single_targets = [
+            eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * x / epochs)) / 2
+            for x in range(epochs)
+        ]
         multipliers = [1] * 2 + [0.1] * 3 + [0.01] * 4 + [0.001]
         single_targets = [x * y for x, y in zip(single_targets, multipliers)]
         targets = [single_targets, [x * epochs for x in single_targets]]
@@ -1748,9 +2481,10 @@ def test_compound_cosanneal_and_linearlr(self):
         start_factor = 0.4
         eta_min = 1e-10
         schedulers = [None] * 2
-        single_targets = [eta_min + (0.05 - eta_min) *
-                          (1 + math.cos(math.pi * x / epochs)) / 2
-                          for x in range(epochs)]
+        single_targets = [
+            eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * x / epochs)) / 2
+            for x in range(epochs)
+        ]
         for i in range(iters):
             single_targets[i] *= start_factor + i / iters * (1 - start_factor)
         targets = [single_targets, [x * epochs for x in single_targets]]
@@ -1761,10 +2495,11 @@ def test_compound_cosanneal_and_linearlr(self):
     def test_compound_cosanneal_and_exp_lr(self):
         epochs = 10
         eta_min = 1e-10
-        single_targets = [eta_min + (0.05 - eta_min) *
-                          (1 + math.cos(math.pi * x / epochs)) / 2
-                          for x in range(epochs)]
-        multipliers = [0.1 ** i for i in range(epochs)]
+        single_targets = [
+            eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * x / epochs)) / 2
+            for x in range(epochs)
+        ]
+        multipliers = [0.1**i for i in range(epochs)]
         single_targets = [x * y for x, y in zip(single_targets, multipliers)]
         targets = [single_targets, [x * epochs for x in single_targets]]
         schedulers = [None] * 2
@@ -1775,7 +2510,7 @@ def test_compound_cosanneal_and_exp_lr(self):
     def test_compound_reduce_lr_on_plateau1(self):
         epochs = 10
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         single_targets = [0.5] * 20
         multipliers = [0.1 ** (i // 3) for i in range(20)]
         single_targets = [x * y for x, y in zip(multipliers, single_targets)]
@@ -1783,15 +2518,21 @@ def test_compound_reduce_lr_on_plateau1(self):
         targets = targets[1:]  # test runs step before checking lr
         metrics = [10 - i * 0.0167 for i in range(20)]
         schedulers = [None, None]
-        schedulers[0] = ReduceLROnPlateau(self.opt, threshold_mode='abs', mode='min',
-                                          threshold=0.01, patience=5, cooldown=5)
+        schedulers[0] = ReduceLROnPlateau(
+            self.opt,
+            threshold_mode="abs",
+            mode="min",
+            threshold=0.01,
+            patience=5,
+            cooldown=5,
+        )
         schedulers[1] = StepLR(self.opt, gamma=0.1, step_size=3)
         self._test_reduce_lr_on_plateau(schedulers, targets, metrics, epochs)
 
     def test_compound_reduce_lr_on_plateau2(self):
         epochs = 22
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         single_targets = [0.5] * 6 + [0.05] * 7 + [0.005] * 7 + [0.0005] * 2
         multipliers = [1] * 3 + [0.1] * 5 + [0.01] * 4 + [0.001] * 10
         single_targets = [x * y for x, y in zip(single_targets, multipliers)]
@@ -1799,42 +2540,51 @@ def test_compound_reduce_lr_on_plateau2(self):
         targets = targets[1:]  # test runs step before checking lr
         metrics = [10 - i * 0.0165 for i in range(22)]
         schedulers = [None] * 2
-        schedulers[0] = ReduceLROnPlateau(self.opt, patience=5, cooldown=0, threshold_mode='abs',
-                                          mode='min', threshold=0.1)
+        schedulers[0] = ReduceLROnPlateau(
+            self.opt,
+            patience=5,
+            cooldown=0,
+            threshold_mode="abs",
+            mode="min",
+            threshold=0.1,
+        )
         schedulers[1] = MultiStepLR(self.opt, gamma=0.1, milestones=[3, 8, 12])
         self._test_reduce_lr_on_plateau(schedulers, targets, metrics, epochs)
 
     def test_compound_reduce_lr_on_plateau3(self):
         epochs = 22
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         single_targets = [0.5] * (2 + 6) + [0.05] * (5 + 6) + [0.005] * 4
-        multipliers = [0.1 ** i for i in range(epochs)]
+        multipliers = [0.1**i for i in range(epochs)]
         single_targets = [x * y for x, y in zip(multipliers, single_targets)]
         targets = [single_targets]
         targets = targets[1:]  # test runs step before checking lr
         metrics = [-0.8] * 2 + [-0.234] * 20
         schedulers = [None, None]
-        schedulers[0] = ReduceLROnPlateau(self.opt, mode='max', patience=5, cooldown=5,
-                                          threshold_mode='abs')
+        schedulers[0] = ReduceLROnPlateau(
+            self.opt, mode="max", patience=5, cooldown=5, threshold_mode="abs"
+        )
         schedulers[1] = ExponentialLR(self.opt, gamma=0.1)
         self._test_reduce_lr_on_plateau(schedulers, targets, metrics, epochs)
 
     def test_compound_reduce_lr_on_plateau4(self):
         epochs = 20
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.05
+            param_group["lr"] = 0.05
         epochs = 10
         eta_min = 1e-10
-        single_targets = [eta_min + (0.05 - eta_min) *
-                          (1 + math.cos(math.pi * x / epochs)) / 2
-                          for x in range(epochs)]
+        single_targets = [
+            eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * x / epochs)) / 2
+            for x in range(epochs)
+        ]
         targets = [single_targets]
         targets = targets[1:]  # test runs step before checking lr
-        metrics = [1.5 * (1.025 ** i) for i in range(20)]  # 1.025 > 1.1**0.25
+        metrics = [1.5 * (1.025**i) for i in range(20)]  # 1.025 > 1.1**0.25
         schedulers = [None, None]
-        schedulers[0] = ReduceLROnPlateau(self.opt, mode='max', patience=3,
-                                          threshold_mode='rel', threshold=0.1)
+        schedulers[0] = ReduceLROnPlateau(
+            self.opt, mode="max", patience=3, threshold_mode="rel", threshold=0.1
+        )
         schedulers[1] = CosineAnnealingLR(self.opt, epochs, eta_min)
         self._test_reduce_lr_on_plateau(schedulers, targets, metrics, epochs)
 
@@ -1843,7 +2593,7 @@ def test_compound_reduce_lr_on_plateau5(self):
         start_factor = 0.4
         epochs = 22
         for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
+            param_group["lr"] = 0.5
         single_targets = [0.5] * 6 + [0.05] * 7 + [0.005] * 7 + [0.0005] * 2
         multipliers = [1] * 22
         for i in range(iters):
@@ -1853,8 +2603,14 @@ def test_compound_reduce_lr_on_plateau5(self):
         targets = targets[1:]  # test runs step before checking lr
         metrics = [10 - i * 0.0165 for i in range(22)]
         schedulers = [None] * 2
-        schedulers[0] = ReduceLROnPlateau(self.opt, patience=5, cooldown=0, threshold_mode='abs',
-                                          mode='min', threshold=0.1)
+        schedulers[0] = ReduceLROnPlateau(
+            self.opt,
+            patience=5,
+            cooldown=0,
+            threshold_mode="abs",
+            mode="min",
+            threshold=0.1,
+        )
         schedulers[1] = LinearLR(self.opt, start_factor=start_factor, total_iters=iters)
         self._test_reduce_lr_on_plateau(schedulers, targets, metrics, epochs)
 
@@ -1867,30 +2623,94 @@ def test_cycle_lr_triangular_mode_one_lr(self):
         momentum_target = [5, 4, 3, 2, 1, 2, 3, 4, 5, 4, 3]
         lr_targets = [lr_target, lr_target]
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = CyclicLR(self.opt, base_lr=1, max_lr=5, step_size_up=4,
-                             cycle_momentum=True, base_momentum=1, max_momentum=5,
-                             mode='triangular')
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=1,
+            max_lr=5,
+            step_size_up=4,
+            cycle_momentum=True,
+            base_momentum=1,
+            max_momentum=5,
+            mode="triangular",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target))
 
     def test_cycle_lr_triangular_mode_one_lr_no_momentum(self):
         lr_target = [1, 2, 3, 4, 5, 4, 3, 2, 1, 2, 3]
         lr_targets = [lr_target, lr_target]
-        momentum_target = [self.opt.defaults['momentum']] * len(lr_target)
+        momentum_target = [self.opt.defaults["momentum"]] * len(lr_target)
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = CyclicLR(self.opt, base_lr=1, max_lr=5, step_size_up=4,
-                             cycle_momentum=False, mode='triangular')
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=1,
+            max_lr=5,
+            step_size_up=4,
+            cycle_momentum=False,
+            mode="triangular",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target))
 
     def test_cycle_lr_triangular2_mode_one_lr(self):
-        lr_target = [1, 2, 3, 4, 5, 4, 3, 2, 1, 1.5, 2.0, 2.5, 3.0, 2.5, 2.0, 1.5,
-                     1, 1.25, 1.50, 1.75, 2.00, 1.75]
-        momentum_target = [5.0, 4.0, 3.0, 2.0, 1.0, 2.0, 3.0, 4.0, 5.0, 4.5, 4.0,
-                           3.5, 3.0, 3.5, 4.0, 4.5, 5.0, 4.75, 4.5, 4.25, 4.0, 4.25]
+        lr_target = [
+            1,
+            2,
+            3,
+            4,
+            5,
+            4,
+            3,
+            2,
+            1,
+            1.5,
+            2.0,
+            2.5,
+            3.0,
+            2.5,
+            2.0,
+            1.5,
+            1,
+            1.25,
+            1.50,
+            1.75,
+            2.00,
+            1.75,
+        ]
+        momentum_target = [
+            5.0,
+            4.0,
+            3.0,
+            2.0,
+            1.0,
+            2.0,
+            3.0,
+            4.0,
+            5.0,
+            4.5,
+            4.0,
+            3.5,
+            3.0,
+            3.5,
+            4.0,
+            4.5,
+            5.0,
+            4.75,
+            4.5,
+            4.25,
+            4.0,
+            4.25,
+        ]
         lr_targets = [lr_target, lr_target]
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = CyclicLR(self.opt, base_lr=1, max_lr=5, step_size_up=4,
-                             cycle_momentum=True, base_momentum=1, max_momentum=5,
-                             mode='triangular2')
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=1,
+            max_lr=5,
+            step_size_up=4,
+            cycle_momentum=True,
+            base_momentum=1,
+            max_momentum=5,
+            mode="triangular2",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target))
 
     def test_cycle_lr_exp_range_mode_one_lr(self):
@@ -1902,10 +2722,17 @@ def test_cycle_lr_exp_range_mode_one_lr(self):
         momentum_target = [max_lr - x * diff_lr * gamma**i for i, x in enumerate(xs)]
         lr_targets = [lr_target, lr_target]
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = CyclicLR(self.opt, base_lr=base_lr,
-                             max_lr=max_lr, step_size_up=4,
-                             cycle_momentum=True, base_momentum=base_lr, max_momentum=max_lr,
-                             mode='exp_range', gamma=gamma)
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=base_lr,
+            max_lr=max_lr,
+            step_size_up=4,
+            cycle_momentum=True,
+            base_momentum=base_lr,
+            max_momentum=max_lr,
+            mode="exp_range",
+            gamma=gamma,
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target))
 
     def test_cycle_lr_triangular_mode(self):
@@ -1915,23 +2742,81 @@ def test_cycle_lr_triangular_mode(self):
         momentum_target_1 = [5, 4, 3, 2, 1, 2, 3, 4, 5, 4, 3]
         momentum_target_2 = [x + 1 for x in momentum_target_1]
         momentum_targets = [momentum_target_1, momentum_target_2]
-        scheduler = CyclicLR(self.opt, base_lr=[1, 2], max_lr=[5, 6], step_size_up=4,
-                             cycle_momentum=True, base_momentum=[1, 2], max_momentum=[5, 6],
-                             mode='triangular')
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=[1, 2],
+            max_lr=[5, 6],
+            step_size_up=4,
+            cycle_momentum=True,
+            base_momentum=[1, 2],
+            max_momentum=[5, 6],
+            mode="triangular",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target_1))
 
     def test_cycle_lr_triangular2_mode(self):
-        lr_target_1 = [1, 2, 3, 4, 5, 4, 3, 2, 1, 1.5, 2.0, 2.5, 3.0, 2.5, 2.0, 1.5, 1,
-                       1.25, 1.50, 1.75, 2.00, 1.75]
+        lr_target_1 = [
+            1,
+            2,
+            3,
+            4,
+            5,
+            4,
+            3,
+            2,
+            1,
+            1.5,
+            2.0,
+            2.5,
+            3.0,
+            2.5,
+            2.0,
+            1.5,
+            1,
+            1.25,
+            1.50,
+            1.75,
+            2.00,
+            1.75,
+        ]
         lr_target_2 = [x + 2 for x in lr_target_1]
         lr_targets = [lr_target_1, lr_target_2]
-        momentum_target_1 = [5.0, 4.0, 3.0, 2.0, 1.0, 2.0, 3.0, 4.0, 5.0, 4.5, 4.0, 3.5,
-                             3.0, 3.5, 4.0, 4.5, 5.0, 4.75, 4.5, 4.25, 4.0, 4.25]
+        momentum_target_1 = [
+            5.0,
+            4.0,
+            3.0,
+            2.0,
+            1.0,
+            2.0,
+            3.0,
+            4.0,
+            5.0,
+            4.5,
+            4.0,
+            3.5,
+            3.0,
+            3.5,
+            4.0,
+            4.5,
+            5.0,
+            4.75,
+            4.5,
+            4.25,
+            4.0,
+            4.25,
+        ]
         momentum_target_2 = [x + 2 for x in momentum_target_1]
         momentum_targets = [momentum_target_1, momentum_target_2]
-        scheduler = CyclicLR(self.opt, base_lr=[1, 3], max_lr=[5, 7], step_size_up=4,
-                             cycle_momentum=True, base_momentum=[1, 3], max_momentum=[5, 7],
-                             mode='triangular2')
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=[1, 3],
+            max_lr=[5, 7],
+            step_size_up=4,
+            cycle_momentum=True,
+            base_momentum=[1, 3],
+            max_momentum=[5, 7],
+            mode="triangular2",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target_1))
 
     def test_cycle_lr_exp_range_mode(self):
@@ -1946,46 +2831,129 @@ def test_cycle_lr_exp_range_mode(self):
         lr_target_1 = [base_lr_1 + x * diff_lr_1 * gamma**i for i, x in enumerate(xs)]
         lr_target_2 = [base_lr_2 + x * diff_lr_2 * gamma**i for i, x in enumerate(xs)]
         lr_targets = [lr_target_1, lr_target_2]
-        momentum_target_1 = [max_lr_1 - x * diff_lr_1 * gamma**i for i, x in enumerate(xs)]
-        momentum_target_2 = [max_lr_2 - x * diff_lr_2 * gamma**i for i, x in enumerate(xs)]
+        momentum_target_1 = [
+            max_lr_1 - x * diff_lr_1 * gamma**i for i, x in enumerate(xs)
+        ]
+        momentum_target_2 = [
+            max_lr_2 - x * diff_lr_2 * gamma**i for i, x in enumerate(xs)
+        ]
         momentum_targets = [momentum_target_1, momentum_target_2]
-        scheduler = CyclicLR(self.opt, base_lr=[base_lr_1, base_lr_2],
-                             max_lr=[max_lr_1, max_lr_2], step_size_up=4,
-                             cycle_momentum=True, base_momentum=[base_lr_1, base_lr_2],
-                             max_momentum=[max_lr_1, max_lr_2],
-                             mode='exp_range', gamma=gamma)
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=[base_lr_1, base_lr_2],
+            max_lr=[max_lr_1, max_lr_2],
+            step_size_up=4,
+            cycle_momentum=True,
+            base_momentum=[base_lr_1, base_lr_2],
+            max_momentum=[max_lr_1, max_lr_2],
+            mode="exp_range",
+            gamma=gamma,
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target_1))
 
     def test_cycle_lr_triangular_mode_step_size_up_down(self):
-        lr_target = [1.0, 2.0, 3.0, 4.0, 5.0, 13.0 / 3, 11.0 / 3, 9.0 / 3, 7.0 / 3, 5.0 / 3, 1.0]
+        lr_target = [
+            1.0,
+            2.0,
+            3.0,
+            4.0,
+            5.0,
+            13.0 / 3,
+            11.0 / 3,
+            9.0 / 3,
+            7.0 / 3,
+            5.0 / 3,
+            1.0,
+        ]
         lr_targets = [lr_target, lr_target]
-        momentum_target = [5.0, 4.0, 3.0, 2.0, 1.0, 5.0 / 3, 7.0 / 3, 3.0, 11.0 / 3, 13.0 / 3, 5.0]
+        momentum_target = [
+            5.0,
+            4.0,
+            3.0,
+            2.0,
+            1.0,
+            5.0 / 3,
+            7.0 / 3,
+            3.0,
+            11.0 / 3,
+            13.0 / 3,
+            5.0,
+        ]
         momentum_targets = [momentum_target, momentum_target]
 
-        scheduler = CyclicLR(self.opt, base_lr=1, max_lr=5,
-                             step_size_up=4,
-                             step_size_down=6,
-                             cycle_momentum=True,
-                             base_momentum=1, max_momentum=5,
-                             mode='triangular')
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=1,
+            max_lr=5,
+            step_size_up=4,
+            step_size_down=6,
+            cycle_momentum=True,
+            base_momentum=1,
+            max_momentum=5,
+            mode="triangular",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target))
 
     def test_cycle_lr_triangular2_mode_step_size_up_down(self):
-        lr_base_target = ([
-            1.0, 3.0, 5.0, 13.0 / 3, 11.0 / 3, 9.0 / 3, 7.0 / 3, 5.0 / 3, 1.0, 2.0, 3.0, 8.0 / 3,
-            7.0 / 3, 6.0 / 3, 5.0 / 3, 4.0 / 3, 1.0, 3.0 / 2, 2.0, 11.0 / 6, 10.0 / 6, 9.0 / 6,
-            8.0 / 6, 7.0 / 6
-        ])
-        momentum_base_target = ([
-            5.0, 3.0, 1.0, 5.0 / 3, 7.0 / 3, 3.0, 11.0 / 3, 13.0 / 3, 5.0, 4.0, 3.0, 10.0 / 3,
-            11.0 / 3, 4.0, 13.0 / 3, 14.0 / 3, 5.0, 4.5, 4.0, 25.0 / 6, 13.0 / 3, 4.5, 14.0 / 3,
-            29.0 / 6
-        ])
+        lr_base_target = [
+            1.0,
+            3.0,
+            5.0,
+            13.0 / 3,
+            11.0 / 3,
+            9.0 / 3,
+            7.0 / 3,
+            5.0 / 3,
+            1.0,
+            2.0,
+            3.0,
+            8.0 / 3,
+            7.0 / 3,
+            6.0 / 3,
+            5.0 / 3,
+            4.0 / 3,
+            1.0,
+            3.0 / 2,
+            2.0,
+            11.0 / 6,
+            10.0 / 6,
+            9.0 / 6,
+            8.0 / 6,
+            7.0 / 6,
+        ]
+        momentum_base_target = [
+            5.0,
+            3.0,
+            1.0,
+            5.0 / 3,
+            7.0 / 3,
+            3.0,
+            11.0 / 3,
+            13.0 / 3,
+            5.0,
+            4.0,
+            3.0,
+            10.0 / 3,
+            11.0 / 3,
+            4.0,
+            13.0 / 3,
+            14.0 / 3,
+            5.0,
+            4.5,
+            4.0,
+            25.0 / 6,
+            13.0 / 3,
+            4.5,
+            14.0 / 3,
+            29.0 / 6,
+        ]
         deltas = [2 * i for i in range(0, 2)]
         base_lrs = [1 + delta for delta in deltas]
         max_lrs = [5 + delta for delta in deltas]
         lr_targets = [[x + delta for x in lr_base_target] for delta in deltas]
-        momentum_targets = [[x + delta for x in momentum_base_target] for delta in deltas]
+        momentum_targets = [
+            [x + delta for x in momentum_base_target] for delta in deltas
+        ]
         scheduler = CyclicLR(
             self.opt,
             base_lr=base_lrs,
@@ -1995,26 +2963,47 @@ def test_cycle_lr_triangular2_mode_step_size_up_down(self):
             cycle_momentum=True,
             base_momentum=base_lrs,
             max_momentum=max_lrs,
-            mode='triangular2')
-        self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_base_target))
+            mode="triangular2",
+        )
+        self._test_cycle_lr(
+            scheduler, lr_targets, momentum_targets, len(lr_base_target)
+        )
 
     def test_cycle_lr_exp_range_mode_step_size_up_down(self):
         base_lr, max_lr = 1, 5
         diff_lr = max_lr - base_lr
         gamma = 0.9
-        xs = ([
-            0.0, 0.5, 1.0, 5.0 / 6, 4.0 / 6, 3.0 / 6, 2.0 / 6, 1.0 / 6, 0.0, 0.5, 1.0, 5.0 / 6,
-            4.0 / 6
-        ])
+        xs = [
+            0.0,
+            0.5,
+            1.0,
+            5.0 / 6,
+            4.0 / 6,
+            3.0 / 6,
+            2.0 / 6,
+            1.0 / 6,
+            0.0,
+            0.5,
+            1.0,
+            5.0 / 6,
+            4.0 / 6,
+        ]
         lr_target = [base_lr + x * diff_lr * gamma**i for i, x in enumerate(xs)]
         lr_targets = [lr_target, lr_target]
         momentum_target = [max_lr - x * diff_lr * gamma**i for i, x in enumerate(xs)]
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = CyclicLR(self.opt, base_lr=base_lr, max_lr=max_lr,
-                             step_size_up=2, step_size_down=6,
-                             cycle_momentum=True, base_momentum=base_lr,
-                             max_momentum=max_lr,
-                             mode='exp_range', gamma=gamma)
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=base_lr,
+            max_lr=max_lr,
+            step_size_up=2,
+            step_size_down=6,
+            cycle_momentum=True,
+            base_momentum=base_lr,
+            max_momentum=max_lr,
+            mode="exp_range",
+            gamma=gamma,
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target))
 
     def test_cycle_lr_with_momentumless_optimizer(self):
@@ -2027,15 +3016,25 @@ def test_cycle_lr_with_momentumless_optimizer(self):
         # in more detail in https://github.com/pytorch/pytorch/issues/19003 ).
         old_opt = self.opt
         self.opt = optim.Adam(
-            [{'params': self.net.conv1.parameters()}, {'params': self.net.conv2.parameters(), 'lr': 0.5}],
-            lr=0.05)
+            [
+                {"params": self.net.conv1.parameters()},
+                {"params": self.net.conv2.parameters(), "lr": 0.5},
+            ],
+            lr=0.05,
+        )
 
         lr_target = [1, 2, 3, 4, 5, 4, 3, 2, 1, 2, 3]
         lr_targets = [lr_target, lr_target]
         momentum_target = [None] * len(lr_target)
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = CyclicLR(self.opt, base_lr=1, max_lr=5, step_size_up=4,
-                             cycle_momentum=False, mode='triangular')
+        scheduler = CyclicLR(
+            self.opt,
+            base_lr=1,
+            max_lr=5,
+            step_size_up=4,
+            cycle_momentum=False,
+            mode="triangular",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, len(lr_target))
 
         self.opt = old_opt  # set optimizer back to SGD
@@ -2045,9 +3044,26 @@ def test_cycle_lr_cycle_momentum_fail_with_momentumless_optimizer(self):
             adam_opt = optim.Adam(self.net.parameters())
             scheduler = CyclicLR(adam_opt, base_lr=1, max_lr=5, cycle_momentum=True)
 
+    def test_cycle_lr_removed_after_out_of_scope(self):
+        import gc
+        import weakref
+
+        gc.disable()
+
+        def test():
+            adam_opt = optim.Adam(self.net.parameters())
+            scheduler = CyclicLR(adam_opt, base_lr=1, max_lr=5, cycle_momentum=False)
+            return weakref.ref(scheduler)
+
+        ref = test()
+        assert ref() is None
+        gc.enable()
+
     def test_onecycle_lr_invalid_anneal_strategy(self):
         with self.assertRaises(ValueError):
-            scheduler = OneCycleLR(self.opt, max_lr=1e-3, total_steps=10, anneal_strategy="CATS")
+            scheduler = OneCycleLR(
+                self.opt, max_lr=1e-3, total_steps=10, anneal_strategy="CATS"
+            )
 
     def test_onecycle_lr_invalid_pct_start(self):
         with self.assertRaises(ValueError):
@@ -2062,8 +3078,15 @@ def test_onecycle_lr_linear_annealing(self):
         momentum_target = [22, 11.5, 1, 4, 7, 10, 13, 16, 19, 22]
         lr_targets = [lr_target, lr_target]
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = OneCycleLR(self.opt, max_lr=25, final_div_factor=2, base_momentum=1, max_momentum=22,
-                               total_steps=10, anneal_strategy='linear')
+        scheduler = OneCycleLR(
+            self.opt,
+            max_lr=25,
+            final_div_factor=2,
+            base_momentum=1,
+            max_momentum=22,
+            total_steps=10,
+            anneal_strategy="linear",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, 10)
 
     def test_onecycle_lr_linear_annealing_three_phases(self):
@@ -2071,59 +3094,111 @@ def test_onecycle_lr_linear_annealing_three_phases(self):
         momentum_target = [22, 15, 8, 1, 8, 15, 22, 22, 22, 22]
         lr_targets = [lr_target, lr_target]
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = OneCycleLR(self.opt, max_lr=25, div_factor=25,
-                               base_momentum=1, max_momentum=22,
-                               total_steps=10, anneal_strategy='linear',
-                               pct_start=0.4, final_div_factor=4,
-                               three_phase=True)
+        scheduler = OneCycleLR(
+            self.opt,
+            max_lr=25,
+            div_factor=25,
+            base_momentum=1,
+            max_momentum=22,
+            total_steps=10,
+            anneal_strategy="linear",
+            pct_start=0.4,
+            final_div_factor=4,
+            three_phase=True,
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, 10)
 
     def test_onecycle_lr_cosine_annealing(self):
         def annealing_cos(start, end, pct):
             cos_out = math.cos(math.pi * pct) + 1
             return end + (start - end) / 2.0 * cos_out
-        lr_target = [1, 13, 25, annealing_cos(25, 0.5, 1 / 7.0), annealing_cos(25, 0.5, 2 / 7.0),
-                     annealing_cos(25, 0.5, 3 / 7.0), annealing_cos(25, 0.5, 4 / 7.0), annealing_cos(25, 0.5, 5 / 7.0),
-                     annealing_cos(25, 0.5, 6 / 7.0), 0.5]
-        momentum_target = [22, 11.5, 1, annealing_cos(1, 22, 1 / 7.0), annealing_cos(1, 22, 2 / 7.0),
-                           annealing_cos(1, 22, 3 / 7.0), annealing_cos(1, 22, 4 / 7.0), annealing_cos(1, 22, 5 / 7.0),
-                           annealing_cos(1, 22, 6 / 7.0), 22]
+
+        lr_target = [
+            1,
+            13,
+            25,
+            annealing_cos(25, 0.5, 1 / 7.0),
+            annealing_cos(25, 0.5, 2 / 7.0),
+            annealing_cos(25, 0.5, 3 / 7.0),
+            annealing_cos(25, 0.5, 4 / 7.0),
+            annealing_cos(25, 0.5, 5 / 7.0),
+            annealing_cos(25, 0.5, 6 / 7.0),
+            0.5,
+        ]
+        momentum_target = [
+            22,
+            11.5,
+            1,
+            annealing_cos(1, 22, 1 / 7.0),
+            annealing_cos(1, 22, 2 / 7.0),
+            annealing_cos(1, 22, 3 / 7.0),
+            annealing_cos(1, 22, 4 / 7.0),
+            annealing_cos(1, 22, 5 / 7.0),
+            annealing_cos(1, 22, 6 / 7.0),
+            22,
+        ]
         lr_targets = [lr_target, lr_target]
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = OneCycleLR(self.opt, max_lr=25, final_div_factor=2, base_momentum=1, max_momentum=22,
-                               total_steps=10)
+        scheduler = OneCycleLR(
+            self.opt,
+            max_lr=25,
+            final_div_factor=2,
+            base_momentum=1,
+            max_momentum=22,
+            total_steps=10,
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, 10)
 
     def test_cycle_lr_with_adam(self):
         old_opt = self.opt
         self.opt = optim.Adam(
-            [{'params': self.net.conv1.parameters()}, {'params': self.net.conv2.parameters(), 'lr': 0.5}],
-            lr=0.05)
+            [
+                {"params": self.net.conv1.parameters()},
+                {"params": self.net.conv2.parameters(), "lr": 0.5},
+            ],
+            lr=0.05,
+        )
 
         lr_target = [1, 13, 25, 21.5, 18, 14.5, 11, 7.5, 4, 0.5]
         momentum_target = [22, 11.5, 1, 4, 7, 10, 13, 16, 19, 22]
         lr_targets = [lr_target, lr_target]
         momentum_targets = [momentum_target, momentum_target]
-        scheduler = OneCycleLR(self.opt, max_lr=25, final_div_factor=2, base_momentum=1, max_momentum=22,
-                               total_steps=10, anneal_strategy='linear')
+        scheduler = OneCycleLR(
+            self.opt,
+            max_lr=25,
+            final_div_factor=2,
+            base_momentum=1,
+            max_momentum=22,
+            total_steps=10,
+            anneal_strategy="linear",
+        )
         self._test_cycle_lr(scheduler, lr_targets, momentum_targets, 10, use_beta1=True)
         self.opt = old_opt  # set optimizer back to SGD
 
     def test_lambda_lr(self):
         epochs = 10
-        self.opt.param_groups[0]['lr'] = 0.05
-        self.opt.param_groups[1]['lr'] = 0.4
-        targets = [[0.05 * (0.9 ** x) for x in range(epochs)], [0.4 * (0.8 ** x) for x in range(epochs)]]
-        scheduler = LambdaLR(self.opt,
-                             lr_lambda=[lambda x1: 0.9 ** x1, lambda x2: 0.8 ** x2])
+        self.opt.param_groups[0]["lr"] = 0.05
+        self.opt.param_groups[1]["lr"] = 0.4
+        targets = [
+            [0.05 * (0.9**x) for x in range(epochs)],
+            [0.4 * (0.8**x) for x in range(epochs)],
+        ]
+        scheduler = LambdaLR(
+            self.opt, lr_lambda=[lambda x1: 0.9**x1, lambda x2: 0.8**x2]
+        )
         self._test(scheduler, targets, epochs)
 
     def test_multiplicative_lr(self):
         epochs = 10
-        self.opt.param_groups[0]['lr'] = 0.05
-        self.opt.param_groups[1]['lr'] = 0.4
-        targets = [[0.05 * (0.9 ** x) for x in range(epochs)], [0.4 * (0.8 ** x) for x in range(epochs)]]
-        scheduler = MultiplicativeLR(self.opt, lr_lambda=[lambda x1: 0.9, lambda x2: 0.8])
+        self.opt.param_groups[0]["lr"] = 0.05
+        self.opt.param_groups[1]["lr"] = 0.4
+        targets = [
+            [0.05 * (0.9**x) for x in range(epochs)],
+            [0.4 * (0.8**x) for x in range(epochs)],
+        ]
+        scheduler = MultiplicativeLR(
+            self.opt, lr_lambda=[lambda x1: 0.9, lambda x2: 0.8]
+        )
         self._test(scheduler, targets, epochs)
 
     @parametrize("T_mult", [1, 2, 4])
@@ -2133,14 +3208,20 @@ def test_CosineAnnealingWarmRestarts_lr1(self, T_mult):
         T_i = 10
         T_cur = 0
         targets = [[0.05], [0.5]]
-        scheduler = CosineAnnealingWarmRestarts(self.opt, T_0=T_i, T_mult=T_mult, eta_min=eta_min)
+        scheduler = CosineAnnealingWarmRestarts(
+            self.opt, T_0=T_i, T_mult=T_mult, eta_min=eta_min
+        )
         for _ in range(1, iters, 1):
             T_cur += 1
             if T_cur >= T_i:
                 T_cur = T_cur - T_i
                 T_i = int(T_mult) * T_i
-            targets[0] += [eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2]
-            targets[1] += [eta_min + (0.5 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2]
+            targets[0] += [
+                eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2
+            ]
+            targets[1] += [
+                eta_min + (0.5 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2
+            ]
         self._test(scheduler, targets, iters)
 
     def test_CosineAnnealingWarmRestarts_lr2(self):
@@ -2151,41 +3232,69 @@ def test_CosineAnnealingWarmRestarts_lr2(self):
             T_i = 10
             T_cur = 0
             targets = [[0.05], [0.5]]
-            scheduler = CosineAnnealingWarmRestarts(self.opt, T_0=T_i, T_mult=T_mult, eta_min=eta_min)
+            scheduler = CosineAnnealingWarmRestarts(
+                self.opt, T_0=T_i, T_mult=T_mult, eta_min=eta_min
+            )
             for _ in torch.arange(0.1, iters, 0.1):
                 T_cur = round(T_cur + 0.1, 1)
                 if T_cur >= T_i:
                     T_cur = T_cur - T_i
                     T_i = int(T_mult) * T_i
-                targets[0] += [eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2]
-                targets[1] += [eta_min + (0.5 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2]
+                targets[0] += [
+                    eta_min
+                    + (0.05 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2
+                ]
+                targets[1] += [
+                    eta_min
+                    + (0.5 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2
+                ]
             self._test_CosineAnnealingWarmRestarts(scheduler, targets, iters)
 
     def test_CosineAnnealingWarmRestarts_lr3(self):
-        epochs_for_T_mults = [[0, 1, 2, 3, 4, 5, 12, 27, 3, 4, 5, 6, 13],
-                              [0, 1, 2, 3, 4, 5, 25, 32, 33, 34, 80, 81, 3],
-                              [0, 0.1, 0.2, 0.3, 1.3, 2.3, 17.5, 18.5, 19.5, 29.5, 30.5, 31.5, 50]]
-        T_curs_for_T_mults = [[1, 2, 3, 4, 5, 2, 7, 3, 4, 5, 6, 3],
-                              [1, 2, 3, 4, 5, 15, 2, 3, 4, 10, 11, 3],
-                              [0.1, 0.2, 0.3, 1.3, 2.3, 7.5, 8.5, 9.5, 19.5, 20.5, 21.5, 10]]
-        T_is_for_T_mults = [[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
-                            [10, 10, 10, 10, 10, 20, 40, 40, 40, 80, 80, 10],
-                            [10, 10, 10, 10, 10, 30, 30, 30, 30, 30, 30, 90]]
+        epochs_for_T_mults = [
+            [0, 1, 2, 3, 4, 5, 12, 27, 3, 4, 5, 6, 13],
+            [0, 1, 2, 3, 4, 5, 25, 32, 33, 34, 80, 81, 3],
+            [0, 0.1, 0.2, 0.3, 1.3, 2.3, 17.5, 18.5, 19.5, 29.5, 30.5, 31.5, 50],
+        ]
+        T_curs_for_T_mults = [
+            [1, 2, 3, 4, 5, 2, 7, 3, 4, 5, 6, 3],
+            [1, 2, 3, 4, 5, 15, 2, 3, 4, 10, 11, 3],
+            [0.1, 0.2, 0.3, 1.3, 2.3, 7.5, 8.5, 9.5, 19.5, 20.5, 21.5, 10],
+        ]
+        T_is_for_T_mults = [
+            [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
+            [10, 10, 10, 10, 10, 20, 40, 40, 40, 80, 80, 10],
+            [10, 10, 10, 10, 10, 30, 30, 30, 30, 30, 30, 90],
+        ]
         eta_min = 1e-10
         T_mults = [1, 2, 3]
-        for epochs, T_mult, T_curs, T_is in zip(epochs_for_T_mults, T_mults, T_curs_for_T_mults, T_is_for_T_mults):
+        for epochs, T_mult, T_curs, T_is in zip(
+            epochs_for_T_mults, T_mults, T_curs_for_T_mults, T_is_for_T_mults
+        ):
             targets = [[0.05], [0.5]]
-            scheduler = CosineAnnealingWarmRestarts(self.opt, T_0=10, T_mult=T_mult, eta_min=eta_min)
+            scheduler = CosineAnnealingWarmRestarts(
+                self.opt, T_0=10, T_mult=T_mult, eta_min=eta_min
+            )
             for T_cur, T_i in zip(T_curs, T_is):
-                targets[0] += [eta_min + (0.05 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2]
-                targets[1] += [eta_min + (0.5 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2]
-            self._test_interleaved_CosineAnnealingWarmRestarts(scheduler, targets, epochs)
+                targets[0] += [
+                    eta_min
+                    + (0.05 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2
+                ]
+                targets[1] += [
+                    eta_min
+                    + (0.5 - eta_min) * (1 + math.cos(math.pi * T_cur / T_i)) / 2
+                ]
+            self._test_interleaved_CosineAnnealingWarmRestarts(
+                scheduler, targets, epochs
+            )
 
     def test_swalr_no_anneal(self):
         epochs, swa_start, swa_lr = 10, 5, 0.01
-        initial_lrs = [group['lr'] for group in self.opt.param_groups]
-        targets = [[lr] * (swa_start + 1) + [swa_lr] * (epochs - swa_start - 1)
-                   for lr in initial_lrs]
+        initial_lrs = [group["lr"] for group in self.opt.param_groups]
+        targets = [
+            [lr] * (swa_start + 1) + [swa_lr] * (epochs - swa_start - 1)
+            for lr in initial_lrs
+        ]
         swa_scheduler = SWALR(self.opt, anneal_epochs=1, swa_lr=swa_lr)
         self._test_swalr(swa_scheduler, None, targets, swa_start, epochs)
 
@@ -2198,15 +3307,22 @@ def test_swalr_cosine_anneal_after_multiplicative(self):
 
         def anneal_coef(t):
             if t + 1 >= anneal_epochs:
-                return 0.
+                return 0.0
             return (1 + math.cos(math.pi * (t + 1) / anneal_epochs)) / 2
 
-        initial_lrs = [group['lr'] for group in self.opt.param_groups]
-        targets_before_swa = [[lr * mult_factor**i for i in range(swa_start + 1)]
-                              for lr in initial_lrs]
+        initial_lrs = [group["lr"] for group in self.opt.param_groups]
+        targets_before_swa = [
+            [lr * mult_factor**i for i in range(swa_start + 1)] for lr in initial_lrs
+        ]
         swa_epochs = epochs - swa_start - 1
-        targets = [lrs + [lrs[-1] * anneal_coef(t) + swa_lr * (1 - anneal_coef(t)) for t in range(swa_epochs)]
-                   for lrs in targets_before_swa]
+        targets = [
+            lrs
+            + [
+                lrs[-1] * anneal_coef(t) + swa_lr * (1 - anneal_coef(t))
+                for t in range(swa_epochs)
+            ]
+            for lrs in targets_before_swa
+        ]
 
         self._test_swalr(swa_scheduler, scheduler, targets, swa_start, epochs)
 
@@ -2215,29 +3331,46 @@ def test_swalr_linear_anneal_after_multiplicative(self):
         epochs, swa_start, swa_lrs, anneal_epochs = 15, 5, [0.01, 0.02], 4
         mult_factor = 0.9
         scheduler = MultiplicativeLR(self.opt, lr_lambda=lambda epoch: mult_factor)
-        swa_scheduler = SWALR(self.opt, anneal_epochs=anneal_epochs,
-                              anneal_strategy="linear", swa_lr=swa_lrs)
+        swa_scheduler = SWALR(
+            self.opt,
+            anneal_epochs=anneal_epochs,
+            anneal_strategy="linear",
+            swa_lr=swa_lrs,
+        )
 
         def anneal_coef(t):
             if t + 1 >= anneal_epochs:
-                return 0.
+                return 0.0
             return 1 - (t + 1) / anneal_epochs
 
-        initial_lrs = [group['lr'] for group in self.opt.param_groups]
-        targets_before_swa = [[lr * mult_factor**i for i in range(swa_start + 1)]
-                              for lr in initial_lrs]
+        initial_lrs = [group["lr"] for group in self.opt.param_groups]
+        targets_before_swa = [
+            [lr * mult_factor**i for i in range(swa_start + 1)] for lr in initial_lrs
+        ]
         swa_epochs = epochs - swa_start - 1
-        targets = [lrs + [lrs[-1] * anneal_coef(t) + swa_lr * (1 - anneal_coef(t)) for t in range(swa_epochs)]
-                   for lrs, swa_lr in zip(targets_before_swa, swa_lrs)]
+        targets = [
+            lrs
+            + [
+                lrs[-1] * anneal_coef(t) + swa_lr * (1 - anneal_coef(t))
+                for t in range(swa_epochs)
+            ]
+            for lrs, swa_lr in zip(targets_before_swa, swa_lrs)
+        ]
 
         self._test_swalr(swa_scheduler, scheduler, targets, swa_start, epochs)
 
     def _test_swalr(self, swa_scheduler, scheduler, targets, swa_start, epochs):
         for epoch in range(epochs):
             for param_group, target in zip(self.opt.param_groups, targets):
-                self.assertEqual(target[epoch], param_group['lr'],
-                                 msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                     epoch, target[epoch], param_group['lr']), atol=1e-5, rtol=0)
+                self.assertEqual(
+                    target[epoch],
+                    param_group["lr"],
+                    msg="LR is wrong in epoch {}: expected {}, got {}".format(
+                        epoch, target[epoch], param_group["lr"]
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
             if epoch >= swa_start:
                 self.opt.step()
                 swa_scheduler.step()
@@ -2248,29 +3381,32 @@ def _test_swalr(self, swa_scheduler, scheduler, targets, swa_start, epochs):
     def test_swalr_hypers(self):
         # Test that SWALR raises errors for incorrect hyper-parameters
         with self.assertRaisesRegex(ValueError, "anneal_strategy must"):
-            swa_scheduler = SWALR(self.opt, anneal_strategy="exponential", swa_lr=1.)
+            swa_scheduler = SWALR(self.opt, anneal_strategy="exponential", swa_lr=1.0)
 
         with self.assertRaisesRegex(ValueError, "anneal_epochs must"):
-            swa_scheduler = SWALR(self.opt, anneal_epochs=-1, swa_lr=1.)
+            swa_scheduler = SWALR(self.opt, anneal_epochs=-1, swa_lr=1.0)
         with self.assertRaisesRegex(ValueError, "anneal_epochs must"):
-            swa_scheduler = SWALR(self.opt, anneal_epochs=1.7, swa_lr=1.)
+            swa_scheduler = SWALR(self.opt, anneal_epochs=1.7, swa_lr=1.0)
         with self.assertRaisesRegex(ValueError, "swa_lr must"):
-            swa_scheduler = SWALR(self.opt, swa_lr=[1., 0.1, 0.01])
+            swa_scheduler = SWALR(self.opt, swa_lr=[1.0, 0.1, 0.01])
 
     def test_step_lr_state_dict(self):
         self._check_scheduler_state_dict(
             lambda: StepLR(self.opt, gamma=0.1, step_size=3),
-            lambda: StepLR(self.opt, gamma=0.01 / 2, step_size=1))
+            lambda: StepLR(self.opt, gamma=0.01 / 2, step_size=1),
+        )
 
     def test_multi_step_lr_state_dict(self):
         self._check_scheduler_state_dict(
             lambda: MultiStepLR(self.opt, gamma=0.1, milestones=[2, 5, 9]),
-            lambda: MultiStepLR(self.opt, gamma=0.01, milestones=[1, 4, 6]))
+            lambda: MultiStepLR(self.opt, gamma=0.01, milestones=[1, 4, 6]),
+        )
 
     def test_exp_step_lr_state_dict(self):
         self._check_scheduler_state_dict(
             lambda: ExponentialLR(self.opt, gamma=0.1),
-            lambda: ExponentialLR(self.opt, gamma=0.01))
+            lambda: ExponentialLR(self.opt, gamma=0.01),
+        )
 
     def test_cosine_lr_state_dict(self):
         epochs = 10
@@ -2278,49 +3414,56 @@ def test_cosine_lr_state_dict(self):
         self._check_scheduler_state_dict(
             lambda: CosineAnnealingLR(self.opt, T_max=epochs, eta_min=eta_min),
             lambda: CosineAnnealingLR(self.opt, T_max=epochs // 2, eta_min=eta_min / 2),
-            epochs=epochs)
+            epochs=epochs,
+        )
 
     def test_reduce_lr_on_plateau_state_dict(self):
-        scheduler = ReduceLROnPlateau(self.opt, mode='min', factor=0.1, patience=2)
+        scheduler = ReduceLROnPlateau(self.opt, mode="min", factor=0.1, patience=2)
         for score in [1.0, 2.0, 3.0, 4.0, 3.0, 4.0, 5.0, 3.0, 2.0, 1.0]:
             scheduler.step(score)
-        scheduler_copy = ReduceLROnPlateau(self.opt, mode='max', factor=0.5, patience=10)
+        scheduler_copy = ReduceLROnPlateau(
+            self.opt, mode="max", factor=0.5, patience=10
+        )
         scheduler_copy.load_state_dict(scheduler.state_dict())
         for key in scheduler.__dict__.keys():
-            if key not in {'optimizer', 'is_better'}:
+            if key not in {"optimizer", "is_better"}:
                 self.assertEqual(scheduler.__dict__[key], scheduler_copy.__dict__[key])
 
     def test_lambda_lr_state_dict_fn(self):
         scheduler = LambdaLR(self.opt, lr_lambda=lambda x: x)
         state = scheduler.state_dict()
-        self.assertIsNone(state['lr_lambdas'][0])
+        self.assertIsNone(state["lr_lambdas"][0])
 
         scheduler_copy = LambdaLR(self.opt, lr_lambda=lambda x: x)
         scheduler_copy.load_state_dict(state)
         for key in scheduler.__dict__.keys():
-            if key not in {'optimizer', 'lr_lambdas'}:
+            if key not in {"optimizer", "lr_lambdas"}:
                 self.assertEqual(scheduler.__dict__[key], scheduler_copy.__dict__[key])
 
     def test_lambda_lr_state_dict_obj(self):
         scheduler = LambdaLR(self.opt, lr_lambda=LambdaLRTestObject(10))
         state = scheduler.state_dict()
-        self.assertIsNotNone(state['lr_lambdas'][0])
+        self.assertIsNotNone(state["lr_lambdas"][0])
 
         scheduler_copy = LambdaLR(self.opt, lr_lambda=LambdaLRTestObject(-1))
         scheduler_copy.load_state_dict(state)
         for key in scheduler.__dict__.keys():
-            if key not in {'optimizer'}:
+            if key not in {"optimizer"}:
                 self.assertEqual(scheduler.__dict__[key], scheduler_copy.__dict__[key])
 
     def test_CosineAnnealingWarmRestarts_lr_state_dict(self):
         self._check_scheduler_state_dict(
             lambda: CosineAnnealingWarmRestarts(self.opt, T_0=10, T_mult=2),
-            lambda: CosineAnnealingWarmRestarts(self.opt, T_0=100))
+            lambda: CosineAnnealingWarmRestarts(self.opt, T_0=100),
+        )
 
     def test_swa_lr_state_dict(self):
         self._check_scheduler_state_dict(
             lambda: SWALR(self.opt, anneal_epochs=3, swa_lr=0.5),
-            lambda: SWALR(self.opt, anneal_epochs=10, anneal_strategy="linear", swa_lr=5.))
+            lambda: SWALR(
+                self.opt, anneal_epochs=10, anneal_strategy="linear", swa_lr=5.0
+            ),
+        )
 
     def _check_scheduler_state_dict(self, constr, constr2, epochs=10):
         scheduler = constr()
@@ -2330,12 +3473,12 @@ def _check_scheduler_state_dict(self, constr, constr2, epochs=10):
         scheduler_copy = constr2()
         scheduler_copy.load_state_dict(scheduler.state_dict())
         for key in scheduler.__dict__.keys():
-            if key != 'optimizer':
+            if key != "optimizer":
                 self.assertEqual(scheduler.__dict__[key], scheduler_copy.__dict__[key])
         self.assertEqual(scheduler.get_last_lr(), scheduler_copy.get_last_lr())
 
     def _test_get_last_lr(self, schedulers, targets, epochs=10):
-        if isinstance(schedulers, _LRScheduler):
+        if isinstance(schedulers, LRScheduler):
             schedulers = [schedulers]
         optimizers = {scheduler.optimizer for scheduler in schedulers}
         for epoch in range(epochs):
@@ -2344,32 +3487,54 @@ def _test_get_last_lr(self, schedulers, targets, epochs=10):
             [scheduler.step() for scheduler in schedulers]
             target = [[t[epoch] for t in targets]] * len(schedulers)
             for t, r in zip(target, result):
-                self.assertEqual(target, result,
-                                 msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                     epoch, t, r), atol=1e-5, rtol=0)
+                self.assertEqual(
+                    target,
+                    result,
+                    msg="LR is wrong in epoch {}: expected {}, got {}".format(
+                        epoch, t, r
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
 
     def _test_with_epoch(self, schedulers, targets, epochs=10):
-        if isinstance(schedulers, _LRScheduler):
+        if isinstance(schedulers, LRScheduler):
             schedulers = [schedulers]
         optimizers = {scheduler.optimizer for scheduler in schedulers}
         for epoch in range(epochs):
             [optimizer.step() for optimizer in optimizers]
             with warnings.catch_warnings(record=True) as w:
-                [scheduler.step(epoch) for scheduler in schedulers]  # step before assert: skip initial lr
-                self._check_warning_is_epoch_deprecation_warning(w, num_warnings=len(schedulers))
+                [
+                    scheduler.step(epoch) for scheduler in schedulers
+                ]  # step before assert: skip initial lr
+                self._check_warning_is_epoch_deprecation_warning(
+                    w, num_warnings=len(schedulers)
+                )
             for param_group, target in zip(self.opt.param_groups, targets):
-                self.assertEqual(target[epoch], param_group['lr'],
-                                 msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                     epoch, target[epoch], param_group['lr']), atol=1e-5, rtol=0)
+                self.assertEqual(
+                    target[epoch],
+                    param_group["lr"],
+                    msg="LR is wrong in epoch {}: expected {}, got {}".format(
+                        epoch, target[epoch], param_group["lr"]
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
 
     def _test(self, schedulers, targets, epochs=10):
-        if isinstance(schedulers, _LRScheduler):
+        if isinstance(schedulers, LRScheduler):
             schedulers = [schedulers]
         for epoch in range(epochs):
             for param_group, target in zip(self.opt.param_groups, targets):
-                self.assertEqual(target[epoch], param_group['lr'],
-                                 msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                     epoch, target[epoch], param_group['lr']), atol=1e-5, rtol=0)
+                self.assertEqual(
+                    target[epoch],
+                    param_group["lr"],
+                    msg="LR is wrong in epoch {}: expected {}, got {}".format(
+                        epoch, target[epoch], param_group["lr"]
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
             [scheduler.step() for scheduler in schedulers]
 
     def _test_CosineAnnealingWarmRestarts(self, scheduler, targets, epochs=10):
@@ -2377,17 +3542,29 @@ def _test_CosineAnnealingWarmRestarts(self, scheduler, targets, epochs=10):
             epoch = round(epoch.item(), 1)
             scheduler.step(epoch)
             for param_group, target in zip(self.opt.param_groups, targets):
-                self.assertEqual(target[index], param_group['lr'],
-                                 msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                     epoch, target[index], param_group['lr']), atol=1e-5, rtol=0)
+                self.assertEqual(
+                    target[index],
+                    param_group["lr"],
+                    msg="LR is wrong in epoch {}: expected {}, got {}".format(
+                        epoch, target[index], param_group["lr"]
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
 
     def _test_interleaved_CosineAnnealingWarmRestarts(self, scheduler, targets, epochs):
         for index, epoch in enumerate(epochs):
             scheduler.step(epoch)
             for param_group, target in zip(self.opt.param_groups, targets):
-                self.assertEqual(target[index], param_group['lr'],
-                                 msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                     epoch, target[index], param_group['lr']), atol=1e-5, rtol=0)
+                self.assertEqual(
+                    target[index],
+                    param_group["lr"],
+                    msg="LR is wrong in epoch {}: expected {}, got {}".format(
+                        epoch, target[index], param_group["lr"]
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
 
     def _test_against_closed_form(self, scheduler, closed_form_scheduler, epochs=10):
         self.setUp()
@@ -2397,18 +3574,28 @@ def _test_against_closed_form(self, scheduler, closed_form_scheduler, epochs=10)
             with warnings.catch_warnings(record=True) as w:
                 closed_form_scheduler.step(epoch)
                 self._check_warning_is_epoch_deprecation_warning(w)
-            targets.append([group['lr'] for group in self.opt.param_groups])
+            targets.append([group["lr"] for group in self.opt.param_groups])
         self.setUp()
         for epoch in range(epochs):
             self.opt.step()
             scheduler.step()
             for i, param_group in enumerate(self.opt.param_groups):
-                self.assertEqual(targets[epoch][i], param_group['lr'],
-                                 msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                     epoch, targets[epoch][i], param_group['lr']), atol=1e-5, rtol=0)
+                self.assertEqual(
+                    targets[epoch][i],
+                    param_group["lr"],
+                    msg="LR is wrong in epoch {}: expected {}, got {}".format(
+                        epoch, targets[epoch][i], param_group["lr"]
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
 
-    def _test_reduce_lr_on_plateau(self, schedulers, targets, metrics, epochs=10, verbose=False):
-        if isinstance(schedulers, _LRScheduler) or isinstance(schedulers, ReduceLROnPlateau):
+    def _test_reduce_lr_on_plateau(
+        self, schedulers, targets, metrics, epochs=10, verbose=False
+    ):
+        if isinstance(schedulers, LRScheduler) or isinstance(
+            schedulers, ReduceLROnPlateau
+        ):
             schedulers = [schedulers]
         for epoch in range(epochs):
             self.opt.step()
@@ -2418,40 +3605,89 @@ def _test_reduce_lr_on_plateau(self, schedulers, targets, metrics, epochs=10, ve
                 else:
                     scheduler.step()
             if verbose:
-                print('epoch{}:\tlr={}'.format(epoch, self.opt.param_groups[0]['lr']))
+                print("epoch{}:\tlr={}".format(epoch, self.opt.param_groups[0]["lr"]))
             for param_group, target in zip(self.opt.param_groups, targets):
-                self.assertEqual(target[epoch], param_group['lr'],
-                                 msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                     epoch, target[epoch], param_group['lr']), atol=1e-5, rtol=0)
+                self.assertEqual(
+                    target[epoch],
+                    param_group["lr"],
+                    msg="LR is wrong in epoch {}: expected {}, got {}".format(
+                        epoch, target[epoch], param_group["lr"]
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
 
-    def _test_cycle_lr(self, scheduler, lr_targets, momentum_targets, batch_iterations, verbose=False, use_beta1=False):
+    def _test_cycle_lr(
+        self,
+        scheduler,
+        lr_targets,
+        momentum_targets,
+        batch_iterations,
+        verbose=False,
+        use_beta1=False,
+    ):
         for batch_num in range(batch_iterations):
             if verbose:
-                if 'momentum' in self.opt.param_groups[0].keys():
-                    print('batch{}:\tlr={},momentum={}'.format(batch_num, self.opt.param_groups[0]['lr'],
-                                                               self.opt.param_groups[0]['momentum']))
-                elif use_beta1 and 'betas' in self.opt.param_groups[0].keys():
-                    print('batch{}:\tlr={},beta1={}'.format(batch_num, self.opt.param_groups[0]['lr'],
-                                                            self.opt.param_groups[0]['betas'][0]))
+                if "momentum" in self.opt.param_groups[0].keys():
+                    print(
+                        "batch{}:\tlr={},momentum={}".format(
+                            batch_num,
+                            self.opt.param_groups[0]["lr"],
+                            self.opt.param_groups[0]["momentum"],
+                        )
+                    )
+                elif use_beta1 and "betas" in self.opt.param_groups[0].keys():
+                    print(
+                        "batch{}:\tlr={},beta1={}".format(
+                            batch_num,
+                            self.opt.param_groups[0]["lr"],
+                            self.opt.param_groups[0]["betas"][0],
+                        )
+                    )
                 else:
-                    print('batch{}:\tlr={}'.format(batch_num, self.opt.param_groups[0]['lr']))
-
-            for param_group, lr_target, momentum_target in zip(self.opt.param_groups, lr_targets, momentum_targets):
+                    print(
+                        "batch{}:\tlr={}".format(
+                            batch_num, self.opt.param_groups[0]["lr"]
+                        )
+                    )
+
+            for param_group, lr_target, momentum_target in zip(
+                self.opt.param_groups, lr_targets, momentum_targets
+            ):
                 self.assertEqual(
-                    lr_target[batch_num], param_group['lr'],
-                    msg='LR is wrong in batch_num {}: expected {}, got {}'.format(
-                        batch_num, lr_target[batch_num], param_group['lr']), atol=1e-5, rtol=0)
+                    lr_target[batch_num],
+                    param_group["lr"],
+                    msg="LR is wrong in batch_num {}: expected {}, got {}".format(
+                        batch_num, lr_target[batch_num], param_group["lr"]
+                    ),
+                    atol=1e-5,
+                    rtol=0,
+                )
 
-                if use_beta1 and 'betas' in param_group.keys():
+                if use_beta1 and "betas" in param_group.keys():
                     self.assertEqual(
-                        momentum_target[batch_num], param_group['betas'][0],
-                        msg='Beta1 is wrong in batch_num {}: expected {}, got {}'.format(
-                            batch_num, momentum_target[batch_num], param_group['betas'][0]), atol=1e-5, rtol=0)
-                elif 'momentum' in param_group.keys():
+                        momentum_target[batch_num],
+                        param_group["betas"][0],
+                        msg="Beta1 is wrong in batch_num {}: expected {}, got {}".format(
+                            batch_num,
+                            momentum_target[batch_num],
+                            param_group["betas"][0],
+                        ),
+                        atol=1e-5,
+                        rtol=0,
+                    )
+                elif "momentum" in param_group.keys():
                     self.assertEqual(
-                        momentum_target[batch_num], param_group['momentum'],
-                        msg='Momentum is wrong in batch_num {}: expected {}, got {}'.format(
-                            batch_num, momentum_target[batch_num], param_group['momentum']), atol=1e-5, rtol=0)
+                        momentum_target[batch_num],
+                        param_group["momentum"],
+                        msg="Momentum is wrong in batch_num {}: expected {}, got {}".format(
+                            batch_num,
+                            momentum_target[batch_num],
+                            param_group["momentum"],
+                        ),
+                        atol=1e-5,
+                        rtol=0,
+                    )
             self.opt.step()
             scheduler.step()
 
@@ -2464,7 +3700,9 @@ def test_cosine_then_cyclic(self):
 
         model = torch.nn.Linear(2, 1)
         optimizer = torch.optim.SGD(model.parameters(), lr=optim_lr)
-        lr_scheduler_1 = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20, eta_min=0.1)
+        lr_scheduler_1 = torch.optim.lr_scheduler.CosineAnnealingLR(
+            optimizer, T_max=20, eta_min=0.1
+        )
         lr_scheduler_2 = torch.optim.lr_scheduler.CyclicLR(
             optimizer, base_lr=base_lr, max_lr=max_lr, step_size_up=1, step_size_down=3
         )
@@ -2500,7 +3738,9 @@ class SWATestCNN(torch.nn.Module):
     def __init__(self, input_channels):
         super(SWATestCNN, self).__init__()
         self.n_features = 10
-        self.conv1 = torch.nn.Conv2d(input_channels, self.n_features, kernel_size=3, padding=1)
+        self.conv1 = torch.nn.Conv2d(
+            input_channels, self.n_features, kernel_size=3, padding=1
+        )
         self.bn = torch.nn.BatchNorm2d(self.n_features, momentum=0.3)
 
     def compute_preactivation(self, x):
@@ -2513,7 +3753,6 @@ def forward(self, x):
 
 
 class TestSWAUtils(TestCase):
-
     def _test_averaged_model(self, net_device, swa_device):
         dnn = torch.nn.Sequential(
             torch.nn.Conv2d(1, 5, kernel_size=3),
@@ -2524,7 +3763,7 @@ def _test_averaged_model(self, net_device, swa_device):
             torch.nn.ReLU(),
             torch.nn.Linear(5, 5),
             torch.nn.ReLU(),
-            torch.nn.Linear(5, 10)
+            torch.nn.Linear(5, 10),
         ).to(net_device)
 
         averaged_dnn = AveragedModel(dnn, device=swa_device)
@@ -2556,8 +3795,7 @@ def test_averaged_model_mixed_device(self):
         if not torch.cuda.is_available():
             return
         dnn = torch.nn.Sequential(
-            torch.nn.Conv2d(1, 5, kernel_size=3),
-            torch.nn.Linear(5, 10)
+            torch.nn.Conv2d(1, 5, kernel_size=3), torch.nn.Linear(5, 10)
         )
         dnn[0].cuda()
         dnn[1].cpu()
@@ -2577,8 +3815,7 @@ def test_averaged_model_mixed_device(self):
 
     def test_averaged_model_state_dict(self):
         dnn = torch.nn.Sequential(
-            torch.nn.Conv2d(1, 5, kernel_size=3),
-            torch.nn.Linear(5, 10)
+            torch.nn.Conv2d(1, 5, kernel_size=3), torch.nn.Linear(5, 10)
         )
         averaged_dnn = AveragedModel(dnn)
         averaged_dnn2 = AveragedModel(dnn)
@@ -2596,12 +3833,14 @@ def test_averaged_model_exponential(self):
         # Test AveragedModel with EMA as avg_fn
         dnn = torch.nn.Sequential(
             torch.nn.Conv2d(1, 5, kernel_size=3),
-            torch.nn.Linear(5, 10)
+            torch.nn.BatchNorm2d(5, momentum=0.3),
+            torch.nn.Linear(5, 10),
         )
         alpha = 0.9
 
         def avg_fn(p_avg, p, n_avg):
             return alpha * p_avg + (1 - alpha) * p
+
         averaged_dnn = AveragedModel(dnn, avg_fn=avg_fn)
         averaged_params = [torch.zeros_like(param) for param in dnn.parameters()]
         n_updates = 10
@@ -2612,29 +3851,40 @@ def avg_fn(p_avg, p, n_avg):
                 if i == 0:
                     updated_averaged_params.append(p.clone())
                 else:
-                    updated_averaged_params.append((p_avg * alpha +
-                                                   p * (1 - alpha)).clone())
+                    updated_averaged_params.append(
+                        (p_avg * alpha + p * (1 - alpha)).clone()
+                    )
+            for b in dnn.buffers():
+                if b.size() != torch.Size([]):
+                    b.detach_().add_(torch.randn_like(b))
+
             averaged_dnn.update_parameters(dnn)
             averaged_params = updated_averaged_params
 
         for p_avg, p_swa in zip(averaged_params, averaged_dnn.parameters()):
             self.assertEqual(p_avg, p_swa)
+        for b_avg, b_swa in zip(dnn.buffers(), averaged_dnn.module.buffers()):
+            self.assertEqual(b_avg, b_swa)
 
     def test_averaged_model_exponential_buffers(self):
         # Test AveragedModel with EMA as avg_fn and use_buffers as True.
         dnn = torch.nn.Sequential(
             torch.nn.Conv2d(1, 5, kernel_size=3),
             torch.nn.BatchNorm2d(5, momentum=0.3),
-            torch.nn.Linear(5, 10)
+            torch.nn.Linear(5, 10),
         )
         alpha = 0.9
 
         def avg_fn(p_avg, p, n_avg):
             return alpha * p_avg + (1 - alpha) * p
+
         averaged_dnn = AveragedModel(dnn, avg_fn=avg_fn, use_buffers=True)
         dnn_params = itertools.chain(dnn.parameters(), dnn.buffers())
-        averaged_params = [torch.zeros_like(param) for param in dnn_params
-                           if param.size() != torch.Size([])]
+        averaged_params = [
+            torch.zeros_like(param)
+            for param in dnn_params
+            if param.size() != torch.Size([])
+        ]
         n_updates = 10
         for i in range(n_updates):
             updated_averaged_params = []
@@ -2645,13 +3895,18 @@ def avg_fn(p_avg, p, n_avg):
                 if i == 0:
                     updated_averaged_params.append(p.clone())
                 else:
-                    updated_averaged_params.append((p_avg * alpha +
-                                                   p * (1 - alpha)).clone())
+                    updated_averaged_params.append(
+                        (p_avg * alpha + p * (1 - alpha)).clone()
+                    )
             averaged_dnn.update_parameters(dnn)
             averaged_params = updated_averaged_params
 
         for p_avg, p_swa in zip(
-                averaged_params, itertools.chain(averaged_dnn.module.parameters(), averaged_dnn.module.buffers())):
+            averaged_params,
+            itertools.chain(
+                averaged_dnn.module.parameters(), averaged_dnn.module.buffers()
+            ),
+        ):
             self.assertEqual(p_avg, p_swa)
 
     def _test_update_bn(self, dnn, dl_x, dl_xy, cuda):
@@ -2686,10 +3941,10 @@ def _test_update_bn(self, dnn, dl_x, dl_xy, cuda):
         self.assertEqual(preactivation_var, dnn.bn.running_var, atol=1e-1, rtol=0)
 
         def _reset_bn(module):
-            if issubclass(module.__class__,
-                          torch.nn.modules.batchnorm._BatchNorm):
+            if issubclass(module.__class__, torch.nn.modules.batchnorm._BatchNorm):
                 module.running_mean = torch.zeros_like(module.running_mean)
                 module.running_var = torch.ones_like(module.running_var)
+
         # reset batch norm and run update_bn again
         dnn.apply(_reset_bn)
         update_bn(dl_xy, dnn, device=x.device)
@@ -2755,6 +4010,7 @@ def test_bn_update_eval_momentum(self):
         # check that momentum is preserved
         self.assertEqual(dnn.bn.momentum, 0.3)
 
+
 instantiate_parametrized_tests(TestLRScheduler)
 
 
@@ -2765,22 +4021,272 @@ def _diff_fn(p, grad, opt_differentiable_state, opt_class, kwargs, *ignored):
     # dict
     p = p.clone()
     p.grad = grad
-    opt_differentiable_state = {k: v.clone() for k, v in opt_differentiable_state.items()}
+    opt_differentiable_state = {
+        k: v.clone() if isinstance(v, torch.Tensor) else v
+        for k, v in opt_differentiable_state.items()
+    }
     opt = opt_class([p], **kwargs)
-    opt.state.update(opt_differentiable_state)
+    opt.state[p].update(opt_differentiable_state)
     opt.step()
-    return (p,) + tuple(opt_differentiable_state.values())
+    return (p,) + tuple(
+        v
+        for v in opt.state[p].values()
+        if isinstance(v, torch.Tensor) and v.requires_grad
+    )
 
 
 class TestDifferentiableOptimizer(TestCase):
-
     def test_sgd(self):
         p = torch.rand(10, requires_grad=True, dtype=torch.float64)
         grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
         mbuff = torch.rand(10, requires_grad=True, dtype=torch.float64)
-        state = {'momentum_buffer': mbuff}
-        gradcheck(_diff_fn, (p, grad, state, torch.optim.SGD, {'lr': 0.9, 'differentiable': True}, *state.values()))
+        state = {"momentum_buffer": mbuff}
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.SGD,
+                {"lr": 0.9, "differentiable": True},
+                *state.values(),
+            ),
+        )
+
+    def test_adam(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["exp_avg"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["exp_avg_sq"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["max_exp_avg_sq"] = torch.rand(
+            10, requires_grad=True, dtype=torch.float64
+        )
+
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.Adam,
+                {"lr": 0.9, "differentiable": True, "amsgrad": True},
+                *state.values(),
+            ),
+        )
+
+    def test_rmsprop(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["step"] = 0
+        state["square_avg"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["momentum_buffer"] = torch.rand(
+            10, requires_grad=True, dtype=torch.float64
+        )
+        # This can cause issues with large values and nan due to sqrt ops
+        state["grad_avg"] = 1e-2 * torch.rand(
+            10, requires_grad=True, dtype=torch.float64
+        )
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.RMSprop,
+                {
+                    "lr": 0.9,
+                    "maximize": True,
+                    "momentum": 0.9,
+                    "differentiable": True,
+                    "centered": True,
+                    "weight_decay": 0.1,
+                },
+                *state.values(),
+            ),
+        )
+
+    def test_adadelta(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["square_avg"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["acc_delta"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.Adadelta,
+                {"lr": 0.9, "weight_decay": 0.1, "differentiable": True},
+                *state.values(),
+            ),
+        )
+
+    def test_adagrad(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["sum"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.Adagrad,
+                {"lr": 0.9, "weight_decay": 0.1, "differentiable": True},
+                *state.values(),
+            ),
+        )
+
+    def test_adamax(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["exp_avg"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["exp_inf"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.Adamax,
+                {"lr": 0.9, "weight_decay": 0.1, "differentiable": True},
+                *state.values(),
+            ),
+        )
+
+    def test_asgd(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` `eta` & `mu` are not continuous variables (even though we define them as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["eta"] = torch.tensor(0.9, requires_grad=False, dtype=torch.float64)
+        state["mu"] = torch.tensor(1.0, requires_grad=False, dtype=torch.float64)
+        state["ax"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.ASGD,
+                {"lr": 0.9, "differentiable": True},
+                *state.values(),
+            ),
+        )
+
+    def test_rprop(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["prev"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["step_size"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.Rprop,
+                {"lr": 0.9, "differentiable": True},
+                *state.values(),
+            ),
+        )
+
+    def test_adamw(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["exp_avg"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["exp_avg_sq"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["max_exp_avg_sq"] = torch.rand(
+            10, requires_grad=True, dtype=torch.float64
+        )
+
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.AdamW,
+                {"lr": 0.9, "differentiable": True, "amsgrad": True},
+                *state.values(),
+            ),
+        )
+
+    def test_nadam(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["exp_avg"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["exp_avg_sq"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["mu_product"] = torch.tensor(1.0, requires_grad=True, dtype=torch.float64)
+
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.NAdam,
+                {"lr": 0.9, "differentiable": True},
+                *state.values(),
+            ),
+        )
+
+    def test_radam(self):
+        state = {}
+        p = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        grad = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        # `step` is not a continuous variable (even though we define it as a float)
+        # and so it shouldn't require gradients.
+        state["step"] = torch.tensor(10.0, requires_grad=False, dtype=torch.float64)
+        state["exp_avg"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+        state["exp_avg_sq"] = torch.rand(10, requires_grad=True, dtype=torch.float64)
+
+        gradcheck(
+            _diff_fn,
+            (
+                p,
+                grad,
+                state,
+                torch.optim.RAdam,
+                {"lr": 0.9, "differentiable": True},
+                *state.values(),
+            ),
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     run_tests()
diff --git a/test/test_overrides.py b/test/test_overrides.py
index 9a7747b5f217..7082f75a2141 100644
--- a/test/test_overrides.py
+++ b/test/test_overrides.py
@@ -16,9 +16,11 @@
     get_overridable_functions,
     get_testing_overrides,
     is_tensor_method_or_property,
-    TorchFunctionMode
+    TorchFunctionMode,
+    _get_current_function_mode,
+    _get_current_function_mode_stack,
 )
-from torch.utils._mode_utils import find_outermost_mode, all_same_mode, all_same_mode_scope
+from torch.utils._mode_utils import all_same_mode
 from torch.utils._pytree import tree_map
 
 Tensor = torch.Tensor
@@ -385,6 +387,10 @@ def test_mean_semantics(self):
         self.assertEqual(torch.mean(t3), 4.0)
         self.assertEqual(bar(t3), 0)
 
+    def test_has_torch_function_non_sequence(self):
+        with self.assertRaisesRegex(TypeError, "expected a sequence"):
+            has_torch_function(object())
+
     def test_mm_semantics(self):
         """Test that a function with multiple arguments can be overrided"""
         t1 = DiagonalTensor(5, 2)
@@ -635,7 +641,7 @@ def instance_gen():
                         func = func.__get__(instance_gen())
                         continue
                     func_args.append(instance_gen())
-                elif t == 'TensorList':
+                elif t == 'TensorList' or t == 'ITensorListRef':
                     func_args.append([instance_gen(), instance_gen()])
                 elif t == 'c10::List<c10::optional<Tensor>>':
                     func_args.append([instance_gen(), instance_gen()])
@@ -1150,11 +1156,6 @@ def __torch_function__(self, *args, **kwargs):
             self.assertEqual(torch._sparse_csr_tensor_unsafe(1, 1, 1, (1, 1)), -1)
             self.assertEqual(torch.as_tensor([1]), -1)
 
-    def test_enable_torch_function_mode_with_tensor_subclass(self):
-        x = torch.randn(1)
-        with torch.overrides.enable_torch_function_mode(SubTensor):
-            self.assertEqual(torch.mm(x, x), -1)
-
     def test_modes_handle_first(self):
         class A(TorchFunctionMode):
             def __torch_function__(self, *args, **kwargs):
@@ -1178,54 +1179,9 @@ def __torch_function__(self, *args, **kwargs):
             self.assertEqual(torch.mm(x, x), -1)
             self.assertEqual(bar(x), 1)
             self.assertRaisesRegex(
-                TypeError, r'SubTensor.+MyMode',
+                TypeError, r'SubTensor',
                 lambda: self.assertEqual(torch.max(x, x)))
 
-    def test_enable_torch_function_mode_trivial(self):
-        class A(TorchFunctionMode):
-            def __torch_function__(self, *args, **kwargs):
-                return -40
-        a = A()
-        with torch.overrides.enable_torch_function_mode(a):
-            with torch.overrides.enable_torch_function_mode(a):
-                self.assertEqual(bar(None), -40)
-
-    def test_enable_torch_function_mode_replace(self):
-        class A(TorchFunctionMode):
-            def __init__(self, val):
-                self.val = val
-
-            def __torch_function__(self, *args, **kwargs):
-                return self.val
-        a1 = A(-40)
-        a2 = A(-41)
-        with torch.overrides.enable_torch_function_mode(a1):
-            with torch.overrides.enable_torch_function_mode(a2, replace=a1):
-                self.assertEqual(bar(None), -41)
-
-    def test_enable_torch_function_mode_ignore_preexisting(self):
-        class A(TorchFunctionMode):
-            def __init__(self, val):
-                self.val = val
-
-            def __torch_function__(self, *args, **kwargs):
-                return self.val
-        a1 = A(-40)
-        a2 = A(-41)
-        with torch.overrides.enable_torch_function_mode(a1):
-            with torch.overrides.enable_torch_function_mode(a2, ignore_preexisting=True):
-                self.assertEqual(bar(None), -41)
-
-    def test_ctor_no_inner(self):
-        class A(TorchFunctionMode):
-            def __torch_function__(self, *args, **kwargs):
-                return torch.zeros([])
-
-        with torch.overrides.enable_torch_function_mode(A()):
-            x = torch.randn((3, 4))
-
-        self.assertEqual(x, torch.zeros([]))
-
     def test_with_mode(self):
         class ErrorA(RuntimeError):
             pass
@@ -1270,15 +1226,24 @@ def __torch_function__(self, func, _, args=(), kwargs=None):
 
         self.assertEqual(out, ["layer2", "layer1"])
 
-    def test_error_using_same_mode(self):
+    def test_nested_same_mode(self):
+        out = []
+
         class A(TorchFunctionMode):
-            pass
+            def __init__(self, msg):
+                self.msg = msg
 
-        x = A()
-        with x:
-            with self.assertRaisesRegex(RuntimeError, "has already been used as a mode. Please use a fresh version"):
-                with x:
-                    pass
+            def __torch_function__(self, func, _, args=(), kwargs=None):
+                if kwargs is None:
+                    kwargs = {}
+                out.append(self.msg)
+                return func(*args, **kwargs)
+
+        with A("layer1") as a:
+            with a:
+                torch.empty([])
+
+        self.assertEqual(out, ["layer1", "layer1"])
 
     def test_error_using_class_method_on_mode(self):
         class A(TorchFunctionMode):
@@ -1287,90 +1252,47 @@ def __torch_function__(cls, func, _, args=(), kwargs=None):
                 return func(args, kwargs)
 
         x = torch.tensor(5.)
-        with self.assertRaisesRegex(RuntimeError, "should be a normal method not a class method"):
+        with self.assertRaisesRegex(RuntimeError, "classmethod is not supported, please make it a plain method"):
             with A():
                 x + x
 
-    def test_error_with_ancestor(self):
+    def test_restacking_with_ancestor(self):
         class A(TorchFunctionMode):
             pass
 
-        with A() as x:
-            pass
-
-        with self.assertRaisesRegex(RuntimeError, "has already been used as a mode. Please use a fresh version"):
-            with x:
-                pass
-
-    def test_restore_errors(self):
-        class A(TorchFunctionMode):
-            pass
-
-        with self.assertRaisesRegex(RuntimeError, "does not have any ancestors. Use the standard version instead"):
-            with A().restore():
-                pass
-
-        x = A()
         with A():
-            with x:
+            with A() as x:
                 pass
 
-        with A():  # a different mode instance than the one above
-            with self.assertRaisesRegex(RuntimeError, "the current mode is not its ancestor"):
-                with x.restore():
-                    pass
-
-
-    def test_restore_ancestor_mode(self):
-        class A(TorchFunctionMode):
-            pass
-
-        x = A()
-        y = A()
         with x:
-            with y:
-                pass
-
-        z = A()
-        with y.restore():
-            with z:
-                pass
-
-        with x.restore():
-            with z.restore():
-                pass
+            pass
 
-    def test_find_outermost_mode(self):
+    def test_get_cur_mode(self):
         class A(TorchFunctionMode):
-            pass
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                pass
 
-        self.assertIsNone(find_outermost_mode([None, None]))
+        with A() as mode1:
+            self.assertEqual(_get_current_function_mode(), mode1)
 
-        x = A()
-        y = A()
-        with x:
-            with y:
-                pass
+        with mode1:
+            with A() as mode2:
+                self.assertEqual(_get_current_function_mode(), mode2)
 
-        self.assertEqual(find_outermost_mode([x, y]), y)
 
-        z = A()
-        with y.restore():
-            with z:
+    def test_get_mode_stack(self):
+        class A(TorchFunctionMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
                 pass
 
-        self.assertEqual(find_outermost_mode([z, x]), z)
-        i = A()
+        self.assertEqual(_get_current_function_mode_stack(), [])
 
-        with self.assertRaisesRegex(RuntimeError, "doesn't have ancestors set so the ordering with other modes"):
-            find_outermost_mode([i, x, y, z])
+        with A() as mode1:
+            self.assertEqual(_get_current_function_mode_stack(), [mode1])
 
-        k = A()
-        with k:
-            pass
-
-        with self.assertRaisesRegex(RuntimeError, "don't come from the same scope"):
-            find_outermost_mode([k, x, y, z])
+        with mode1:
+            with A() as mode2:
+                self.assertEqual(_get_current_function_mode_stack(), [mode1, mode2])
 
     def test_all_same_mode(self):
         class A(TorchFunctionMode):
@@ -1382,31 +1304,29 @@ class A(TorchFunctionMode):
         self.assertFalse(all_same_mode([x, None]))
         self.assertFalse(all_same_mode([x, y]))
 
-    def test_all_same_mode_scope(self):
-        class A(TorchFunctionMode):
-            pass
+    def test_nested_modes_with_python_has_torch_function(self):
+        called = []
 
-        x = A()
-        y = A()
-        z = A()
-        with x:
-            with y:
-                pass
+        class A(TorchFunctionMode):
+            def __torch_function__(self, func, types, args=(), kwargs=None):
+                called.append("A")
+                kwargs = {} if kwargs is None else kwargs
+                return func(*args, **kwargs)
 
-        with x.restore():
-            with z:
-                pass
+        class B(TorchFunctionMode):
+            def __torch_function__(self, func, types, args=(), kwargs=None):
+                called.append("B")
+                kwargs = {} if kwargs is None else kwargs
+                return func(*args, **kwargs)
 
-        i = A()
+        x = torch.randn(3, 4)
+        with A():
+            with B():
+                y = bar(x)
 
-        self.assertTrue(all_same_mode_scope([x, y], y))
-        self.assertTrue(all_same_mode_scope([x, z], z))
-        self.assertFalse(all_same_mode_scope([x, y, z], y))
-        self.assertFalse(all_same_mode_scope([x, y, z], z))
-        self.assertFalse(all_same_mode_scope([x, y, i], y))
+        self.assertEqual(y, x)
+        self.assertEqual(called, ["B", "A"])
 
-        no_ancestor = A()
-        self.assertFalse(all_same_mode_scope([x, y, z], no_ancestor))
 
     def test_reentrant_mode_idiom(self):
         log = []
@@ -1417,7 +1337,7 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
                     kwargs = {}
                 log.append(func)
                 if func is torch.sub:
-                    with torch.overrides.enable_torch_function_mode(self, replace=self.inner):
+                    with self:
                         input, other = args
                         assert not kwargs
                         return torch.add(input, other, alpha=-1)
diff --git a/test/test_prims.py b/test/test_prims.py
index 6806db366b0e..23e7c47b023a 100644
--- a/test/test_prims.py
+++ b/test/test_prims.py
@@ -2,38 +2,48 @@
 
 from functools import partial
 from itertools import product
+import warnings
 from warnings import catch_warnings
 import unittest
 
 import torch
 from torch.testing import make_tensor
-from torch.testing._internal.common_utils import parametrize, run_tests, TestCase, TEST_SCIPY
+from torch.testing._internal.common_utils import (parametrize, run_tests, TestCase, TEST_SCIPY,
+                                                  set_default_dtype, skipCUDAMemoryLeakCheckIf)
 from torch.testing._internal.common_device_type import (
     instantiate_device_type_tests,
     onlyCUDA,
     skipCUDAIfRocm,
     dtypes,
+    OpDTypes,
 )
+from torch.testing._internal.common_methods_invocations import (
+    op_db,
+)
+from torch.testing._internal.common_device_type import (
+    ops,
+)
+
 from torch.testing._internal.logging_tensor import LoggingTensor, capture_logs, log_input
 import torch._prims as prims
 from torch._prims.executor import make_traced
 import torch._refs as refs
+from torch.fx.experimental.proxy_tensor import make_fx
 
 
 if TEST_SCIPY:
     import scipy.special
 
+NVPRIM_ATEN_FALLBACK_WARNING = "fallback to aten executor"
+GET_ISOLATED_GRAPHMODULE_ERROR = "get_isolated_graphmodule failed on decomposition"
 
 class TestPrims(TestCase):
     @onlyCUDA
     @skipCUDAIfRocm
     @dtypes(torch.float32)
     def test_broadcast_in_dim(self, device, dtype):
-        # nvfuser is not currently capable of realizing a broadcasted tensor
-        # when the broadcast is the only operation.  Another op is needed.
         def _wrapper(a, b, broadcast_dimensions):
-            a_bc = prims.broadcast_in_dim(a, b.shape, broadcast_dimensions)
-            return prims.add(a_bc, b)
+            return prims.broadcast_in_dim(a, b.shape, broadcast_dimensions)
 
         traced = make_traced(_wrapper)
         make_arg = partial(make_tensor, device=device, dtype=dtype)
@@ -121,11 +131,8 @@ def test_cbrt_prim(self, device, dtype):
         batches = [(), (1,), (2,), (0, 1), (1, 1), (2, 2)]
         shapes = [(), (0,), (1,), (5,)]
 
-        try:
-            # Sets the default dtype to NumPy's default dtype of double
-            cur_default = torch.get_default_dtype()
-            torch.set_default_dtype(torch.double)
-
+        # Sets the default dtype to NumPy's default dtype of double
+        with set_default_dtype(torch.double):
             # Tested here, as this OP is not currently exposed or tested in ATen
             for b, s in product(batches, shapes):
                 x = make_arg(b + s)
@@ -135,8 +142,6 @@ def test_cbrt_prim(self, device, dtype):
                 y_np = scipy.special.cbrt(x_np)
 
                 self.assertEqual(y, y_np, exact_device=False)
-        finally:
-            torch.set_default_dtype(cur_default)
 
     @onlyCUDA
     @skipCUDAIfRocm
@@ -157,12 +162,209 @@ def test_nvfuser_impl_is_used(self, device):
         ), (f"The following prims do not have 'impl_nvfuser' defined: {ops_without_nvfuser_impl} ",
             "while there exists nvfuser implementations for them.")
 
+    def test_skip_ops_nvfuser_prims_mode(self, device):
+        # This test verifies that the NvfuserPrimsMode skips the specified
+        # functions. Skipping a function means that it's not converted into
+        # nvprims counterparts.
+        from torch._prims.context import NvfuserPrimsMode
+
+        a = make_tensor(5, 5, device=device, dtype=torch.float32)
+
+        def func(a):
+            return torch.ops.prims.sin.default(a)
+
+        skip_ops = {"prims.sin.default", }
+        with NvfuserPrimsMode(skip_ops=skip_ops):
+            gm = make_fx(func)(a)
+
+        includes_any_prims_sin = any(
+            node.target == torch.ops.prims.sin.default for node in gm.graph.nodes
+        )
+        self.assertTrue(includes_any_prims_sin)
+        include_any_nvprims_sin = any(
+            node.target == torch.ops.nvprims.sin.default for node in gm.graph.nodes
+        )
+        self.assertFalse(include_any_nvprims_sin)
+
+    def test_skip_ops_nvfuser_capability_mode(self, device):
+        # This test verifies that the NvfuserCapabilityMode skips the specified
+        # functions. Skipping a function means that specific
+        # reference/decomposition is not traced and there's no attempt to lower
+        # it to nvprims.
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+
+        a = make_tensor(5, 5, device=device, dtype=torch.float32)
+
+        def func(a):
+            return torch.sin(a)
+
+        skip_ops = {"torch.sin", }
+        with TorchRefsNvfuserCapabilityMode(skip_ops=skip_ops):
+            gm = make_fx(func)(a)
+
+        includes_any_aten_sin = any(
+            node.target == torch.ops.aten.sin.default for node in gm.graph.nodes
+        )
+        self.assertTrue(includes_any_aten_sin)
+        include_any_nvprims_sin = any(
+            node.target == torch.ops.nvprims.sin.default for node in gm.graph.nodes
+        )
+        self.assertFalse(include_any_nvprims_sin)
+
+    def test_partitioner_tuple_output(self, device):
+        # This test verifies that the partitioner doesn't segment on nodes with
+        # tuple outputs.
+        from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
+        from torch._prims.nvfuser_executor import NvfuserPrimOperatorSupport
+
+        a = make_tensor(5, 3, 3, device=device, dtype=torch.float32)
+
+        def func(x):
+            xx = torch.ops.nvprims.add(x, 1)
+            var, mean = torch.ops.nvprims.var_mean(x, correction=0)
+            var_cos = torch.ops.nvprims.cos(var)
+            mean_sin = torch.ops.nvprims.sin(mean)
+            return torch.ops.nvprims.add(var_cos, mean_sin)
+
+        gm = make_fx(func)(a)
+        supported_ops = NvfuserPrimOperatorSupport()
+        partitioner = CapabilityBasedPartitioner(
+            gm, supported_ops, allows_single_node_partition=False
+        )
+        partitions = partitioner.propose_partitions()
+        self.assertEqual(len(partitions), 1)
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    def test_nvfuser_empty_fusion(self, device):
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.executor import execute
+
+        a = torch.randn(3, 3, device=device)
+
+        def func(a, b, c):
+            return (a, b, c)
+
+        gm = make_fx(func)(a, a, a)
+
+        with self.assertRaisesRegex(AssertionError, "Graph must contain at least one call_function node"):
+            execute(gm, a, a, a, executor="strictly_nvfuser")
+
+        # Should pass with partitioned executor
+        out = execute(gm, a, a, a, executor="nvfuser")
+        self.assertEqual(out, (a, a, a))
+
+    @onlyCUDA
+    @dtypes(torch.float16, torch.uint8)
+    def test_nvprim_convert_element_type(self, device, dtype):
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.executor import execute
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch._prims_common import _torch_dtype_to_nvfuser_dtype_map
+
+        # initialize input as float32, which is different from `dtype` in the argument.
+        # this ensures that tracing will have a _to_copy node.
+        a = torch.randn(3, 3, device=device, dtype=torch.float32)
+
+        def func(x, dtype):
+            return x.to(dtype).to(x.dtype)
+
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(func)(a, dtype)
+            execute(gm, a, dtype, executor="nvfuser")
+
+        call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+        includes_aten_to_copy = any(
+            torch.ops.aten._to_copy.default == node.target
+            for node in call_function_nodes
+        )
+        includes_nvprim_convert_element_type = any(
+            torch.ops.nvprims.convert_element_type.default == node.target
+            for node in call_function_nodes
+        )
+        nvprim_support_flag = _torch_dtype_to_nvfuser_dtype_map.get(dtype) is not None
+        self.assertEqual(includes_aten_to_copy, not nvprim_support_flag)
+        self.assertEqual(includes_nvprim_convert_element_type, nvprim_support_flag)
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    def test_nvfuser_rand_like_fusion(self, device):
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.executor import execute
+
+        a = torch.randn(3, 3, device=device)
+
+        def func(a):
+            return torch.rand_like(a)
+
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(func)(a)
+
+        out = execute(gm, a, executor="strictly_nvfuser")
+        self.assertEqual(out.size(), a.size())
+
+    @skipCUDAMemoryLeakCheckIf(True)  # https://github.com/pytorch/pytorch/issues/84529
+    @onlyCUDA
+    @skipCUDAIfRocm
+    def test_nvfuser_no_args(self, device):
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.executor import execute
+        from torch._prims.nvfuser_executor import make_nvfuser_fusion
+
+        a = torch.randn(3, 3, device=device)
+
+        def func():
+            return torch.sigmoid(a)
+
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(func)()
+
+        with warnings.catch_warnings(record=True) as caught:
+            execute(gm, executor="strictly_nvfuser")
+        # fusion execute with no cuda input is handled by nvprim aten fallback
+        self.assertTrue(any(NVPRIM_ATEN_FALLBACK_WARNING in str(w.message) for w in caught))
+
+        with self.assertRaisesRegex(AssertionError, "There must be at least one argument"):
+            make_nvfuser_fusion(gm)
+
+        with self.assertRaisesRegex(AssertionError, "Number of placeholder nodes in the graph must match"):
+            execute(gm, a, executor="strictly_nvfuser")
+
+        # Should pass with partitioned executor
+        out = execute(gm, executor="nvfuser")
+        self.assertEqual(out, func())
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    def test_nvfuser_constant_tensors(self, device):
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.executor import execute
+
+        a = torch.randn(3, 3, device=device)
+        b = torch.randn(3, 3, device=device)
+
+        def func(b):
+            return a + b
+
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(func)(b)
+
+        with self.assertRaisesRegex(AssertionError, "not supported yet"):
+            execute(gm, b, executor="strictly_nvfuser")
+
+        # Should pass with partitioned executor
+        out = execute(gm, b, executor="nvfuser")
+        self.assertEqual(out, gm(b))
+
     @onlyCUDA
     @skipCUDAIfRocm
     def test_nvfuser_executor_cached_noncontiguous(self, device):
         # This test is to ensure that nvfuser computes correct results for noncontiguous tensors
         from torch.fx.experimental.proxy_tensor import make_fx
-        from torch._prims.context import TorchRefsMode
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
         from torch._prims.executor import execute
 
         a = torch.randn(3, 3, device=device)
@@ -170,16 +372,18 @@ def test_nvfuser_executor_cached_noncontiguous(self, device):
         def func(a):
             return torch.sigmoid(a)
 
-        with TorchRefsMode():
+        with TorchRefsNvfuserCapabilityMode():
             gm = make_fx(func)(a)
 
         # First run to create the cache
-        execute(gm, a, executor="nvfuser")
+        execute(gm, a, executor="strictly_nvfuser")
 
         # a.mT is noncontiguous, but it shouldn't affect correctness
         expected = execute(gm, a.mT, executor="aten")
-        actual = execute(gm, a.mT, executor="nvfuser")
-        self.assertEqual(expected, actual)
+        for use_python_cache in [True, False]:
+            params = {"use_python_fusion_cache": use_python_cache}
+            actual = execute(gm, a.mT, executor="strictly_nvfuser", executor_parameters=params)
+            self.assertEqual(expected, actual)
 
     def test_nvfuser_capability_context(self, device):
         # This test is to ensure that the torch calls are replaced with refs
@@ -236,6 +440,61 @@ def func(a):
         self.assertFalse(includes_prims_digamma)
         self.assertTrue(includes_nvprims_exp)
 
+
+    def test_aten_overload_to_prims(self, device):
+        # This test is to ensure that the torch.ops.aten calls are replaced with refs
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsMode
+
+        a = torch.randn(3, 3, device=device)
+
+        def func(a):
+            return torch.ops.aten.sigmoid.default(torch.ops.aten.digamma.default(a))
+
+        with TorchRefsMode():
+            gm = make_fx(func)(a)
+
+        # Check that all call_function nodes are prims
+        call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+        all_prims_namespace = all(
+            node.target.name().startswith("prims") for node in call_function_nodes
+        )
+        self.assertTrue(all_prims_namespace)
+
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    def test_nvfuser_executor_parameters(self, device):
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.executor import execute
+
+        a = torch.randn(3, 4, device=device)
+
+        def func(a):
+            return torch.ops.nvprims.add(a, a)
+
+        gm = make_fx(func)(a)
+
+        expected = execute(gm, a, executor="aten")
+        # Shouldn't raise an error because unuseful parameters are ignored
+        params_dicts = [None, {}, {"none": None}]
+        for params in params_dicts:
+            actual = execute(gm, a, executor="nvfuser", executor_parameters=params)
+            self.assertEqual(expected, actual)
+
+        # Check caching parameter
+        for use_cache in [True, False]:
+            params = {"use_python_fusion_cache": use_cache}
+            actual = execute(gm, a, executor="nvfuser", executor_parameters=params)
+            self.assertEqual(expected, actual)
+
+        # Check allow_single_op_fusion parameter
+        for allow_single_op_fusion in [True, False]:
+            params = {"allow_single_op_fusion": allow_single_op_fusion}
+            actual = execute(gm, a, executor="nvfuser", executor_parameters=params)
+            self.assertEqual(expected, actual)
+
+
     @onlyCUDA
     @skipCUDAIfRocm
     def test_nvfuser_executor_partitioned(self, device):
@@ -245,7 +504,7 @@ def test_nvfuser_executor_partitioned(self, device):
         self.assertTrue(getattr(torch.ops.nvprims, "digamma", None) is None)
 
         from torch.fx.experimental.proxy_tensor import make_fx
-        from torch._prims.context import TorchRefsMode
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
         from torch._prims.executor import execute
 
         a = torch.randn(3, 4, device=device)
@@ -258,7 +517,7 @@ def func(a, b, c):
             dd = torch.sqrt(d)
             return torch.mul(aa, dd.digamma())
 
-        with TorchRefsMode():
+        with TorchRefsNvfuserCapabilityMode():
             gm = make_fx(func)(a, b, c)
 
         expected = execute(gm, a, b, c, executor="aten")
@@ -274,7 +533,7 @@ def test_nvfuser_executor_partitioned_no_partitions_error(self, device):
         self.assertTrue(getattr(torch.ops.nvprims, "digamma", None) is None)
 
         from torch.fx.experimental.proxy_tensor import make_fx
-        from torch._prims.context import TorchRefsMode
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
         from torch._prims.executor import execute
 
         a = torch.randn(3, 4, device=device)
@@ -282,7 +541,7 @@ def test_nvfuser_executor_partitioned_no_partitions_error(self, device):
         def func(a):
             return torch.digamma(a)  # not supported by nvfuser
 
-        with TorchRefsMode():
+        with TorchRefsNvfuserCapabilityMode():
             gm = make_fx(func)(a)
 
         with catch_warnings(record=True) as w:
@@ -305,11 +564,222 @@ def func(a):
 
         for node in gm.graph.nodes:
             if node.op == "call_function":
-                self.assertTrue(node.name == "add_default")
+                self.assertTrue(node.name == "add")
                 self.assertTrue(node.target == torch.ops.nvprims.add.default)
                 self.assertFalse(node.target == torch.ops.prims.add.default)
                 self.assertFalse(node.target == torch.ops.aten.add.default)
 
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float32, torch.float64)
+    def test_native_batch_norm_nvprims(self, device, dtype):
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch._prims.executor import execute
+
+        # This test verifies that native_batch_norm is translated into nvprims
+        # and can be executed with nvFuser
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch.testing._internal.common_methods_invocations import (
+            sample_inputs_native_batch_norm,
+        )
+
+        samples = sample_inputs_native_batch_norm(
+            None, device, dtype, requires_grad=False
+        )
+        batch_norms = [
+            torch.native_batch_norm,
+            torch.ops.aten.native_batch_norm,
+            torch.ops.aten.native_batch_norm.default,
+            torch.ops.nvprims.native_batch_norm.default,
+        ]
+        for sample, batch_norm in product(samples, batch_norms):
+            if sample.input.numel() == 0:
+                continue
+
+            def func(
+                input, weight, bias, running_mean, running_var, training, momentum, eps
+            ):
+                return batch_norm(
+                    input,
+                    weight,
+                    bias,
+                    running_mean,
+                    running_var,
+                    training,
+                    momentum,
+                    eps,
+                )
+
+            with TorchRefsNvfuserCapabilityMode():
+                gm = make_fx(func)(sample.input, *sample.args)
+
+            call_function_nodes = list(
+                filter(lambda n: n.op == "call_function", gm.graph.nodes)
+            )
+            includes_aten_batch_norm = any(
+                torch.ops.aten.native_batch_norm.default == node.target
+                for node in call_function_nodes
+            )
+            self.assertFalse(includes_aten_batch_norm)
+
+            includes_nvprims_batch_norm = any(
+                torch.ops.nvprims.native_batch_norm.default == node.target
+                for node in call_function_nodes
+            )
+            self.assertTrue(includes_nvprims_batch_norm)
+
+            # Check that the graph can be executed with nvFuser
+            out = execute(gm, sample.input, *sample.args, executor="strictly_nvfuser")
+            self.assertEqual(out, gm(sample.input, *sample.args))
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float32, torch.float64)
+    def test_cudnn_batch_norm_nvprims(self, device, dtype):
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch._prims.executor import execute
+
+        # This test verifies that cudnn_batch_norm is translated into nvprims
+        # and can be executed with nvFuser
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch.testing._internal.common_methods_invocations import (
+            sample_inputs_native_batch_norm,
+        )
+
+        samples = sample_inputs_native_batch_norm(
+            None, device, dtype, requires_grad=False
+        )
+        for sample in samples:
+            if sample.input.numel() == 0:
+                continue
+
+            def func(
+                input, weight, bias, running_mean, running_var, training, momentum, eps
+            ):
+                return torch.ops.aten.cudnn_batch_norm.default(
+                    input,
+                    weight,
+                    bias,
+                    running_mean,
+                    running_var,
+                    training,
+                    momentum,
+                    eps,
+                )
+
+            with TorchRefsNvfuserCapabilityMode():
+                gm = make_fx(func)(sample.input, *sample.args)
+
+            call_function_nodes = list(
+                filter(lambda n: n.op == "call_function", gm.graph.nodes)
+            )
+            includes_aten_batch_norm = any(
+                torch.ops.aten.cudnn_batch_norm.default == node.target
+                for node in call_function_nodes
+            )
+            self.assertFalse(includes_aten_batch_norm)
+
+            includes_nvprims_batch_norm = any(
+                torch.ops.nvprims.native_batch_norm.default == node.target
+                for node in call_function_nodes
+            )
+            self.assertTrue(includes_nvprims_batch_norm)
+
+            # Check that the graph can be executed with nvFuser
+            out = execute(gm, sample.input, *sample.args, executor="nvfuser")
+            self.assertEqual(out, gm(sample.input, *sample.args))
+
+    # decomposition of native_batch_norm_backward uses a casting, which prevents nvprim lowering on CPU build
+    @onlyCUDA
+    @dtypes(torch.float32, torch.float16)
+    def test_batch_norm_backward_nvprims(self, device, dtype):
+        # This test verifies that the backward pass of batch norm is correctly decomposed into nvprims
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch.testing._internal.common_methods_invocations import sample_inputs_batch_norm
+
+        samples_iter = sample_inputs_batch_norm(None, device, dtype, requires_grad=True)
+        sample = next(samples_iter)
+        grad = torch.randn_like(sample.input)
+
+        def func1(grad, input, weight, rm, rv, eps, train):
+            return torch.ops.aten.native_batch_norm_backward.default(
+                grad, input, weight, rm, rv, rm, rv, train, eps, [True, True, True]
+            )
+
+        def func2(grad, input, weight, rm, rv, eps, train):
+            return torch.ops.aten.cudnn_batch_norm_backward.default(
+                input, grad, weight, rm, rv, rm, rv, eps, grad
+            )
+
+        args = sample.args
+        kwargs = sample.kwargs
+        all_args = [grad, sample.input, args[2], args[0], args[1], kwargs['eps'], kwargs['training']]
+
+        for func in (func1, func2):
+            with TorchRefsNvfuserCapabilityMode():
+                gm = make_fx(func)(*all_args)
+
+            call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+            includes_batch_norm_backward = any(
+                torch.ops.aten.native_batch_norm_backward.default == node.target
+                for node in call_function_nodes
+            )
+            self.assertFalse(includes_batch_norm_backward)
+            all_nvprims = all(
+                str(node.target).startswith("nvprims") for node in call_function_nodes
+            )
+            self.assertTrue(all_nvprims)
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float32)
+    def test_silu_backward_no_filled_tensor(self, device, dtype):
+        # This test verifies a workaround for
+        # https://github.com/pytorch/pytorch/issues/86612
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from functorch import functionalize
+        from torch._prims.nvfuser_executor import _remove_empty_like_fill
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+
+        def func(a):
+            out = torch.nn.functional.silu(a)
+            grad = torch.ones_like(out)
+            return torch.autograd.grad([out], [a], [grad])
+
+        make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=True)
+        a = make_arg((3, 4))
+        gm = make_fx(func)(a)
+        # functionalize(gm) doesn't work with non-detached inputs
+        gm = make_fx(functionalize(gm))(a.detach())
+
+        # replace aten.sub with nvprims.sub
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(gm)(a)
+
+        # Check that the graph contains empty_like
+        any_aten_empty_like = any(
+            node.target == torch.ops.aten.empty_like.default for node in gm.graph.nodes
+        )
+        self.assertTrue(any_aten_empty_like)
+        any_aten_fill = any(
+            node.target == torch.ops.aten.fill.Scalar for node in gm.graph.nodes
+        )
+        self.assertTrue(any_aten_fill)
+
+        # Now remove the empty_like and fill
+        gm = _remove_empty_like_fill(gm)
+        any_aten_empty_like = any(
+            node.target == torch.ops.aten.empty_like.default for node in gm.graph.nodes
+        )
+        self.assertFalse(any_aten_empty_like)
+        any_aten_fill = any(
+            node.target == torch.ops.aten.fill.Scalar for node in gm.graph.nodes
+        )
+        self.assertFalse(any_aten_fill)
+        self.assertEqual(gm(a), func(a))
+
+
     @onlyCUDA
     @skipCUDAIfRocm
     @dtypes(torch.float32)
@@ -331,6 +801,145 @@ def _wrapper(a):
             self.assertTrue(result.is_contiguous)
             self.assertEqual(_wrapper(a), result)
 
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float16, torch.float32)
+    @parametrize("correction", [0, 1])
+    @parametrize("keepdim", [True, False])
+    def test_var_mean(self, device, dtype, correction, keepdim):
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+
+
+        def _wrapper(a):
+            return torch.var_mean(a, [0, 1], correction=correction, keepdim=keepdim)
+
+        make_arg = partial(make_tensor, device=device, dtype=dtype)
+
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(_wrapper)(make_arg((5, 5)))
+
+        call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+        includes_nvprims_var_mean = any(
+            torch.ops.nvprims.var_mean.main == node.target
+            for node in call_function_nodes
+        )
+        self.assertTrue(includes_nvprims_var_mean)
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float16, torch.float32)
+    def test_nvprims_view(self, device, dtype):
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch._prims.executor import execute
+
+        make_arg = partial(make_tensor, device=device, dtype=dtype)
+        a = make_arg((3, 4, 5))
+
+        def func1(a):
+            return a.view(tuple(reversed(a.shape)))
+
+        def func2(a):
+            return a.reshape(tuple(reversed(a.shape)))
+
+        def func3(a):
+            return torch.view_copy(a, tuple(reversed(a.shape)))
+
+        def func4(a):
+            return torch.reshape(a, tuple(reversed(a.shape)))
+
+        def func5(a):
+            return torch.ops.aten.view.default(a, tuple(reversed(a.shape)))
+
+        def func6(a):
+            return torch.ops.aten._unsafe_view.default(a, tuple(reversed(a.shape)))
+
+        def func7(a):
+            return torch.ops.aten.view_copy.default(a, tuple(reversed(a.shape)))
+
+        for func in (func1, func2, func3, func4, func5, func6, func7):
+            with TorchRefsNvfuserCapabilityMode():
+                gm = make_fx(func)(a)
+
+            call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+            includes_nvprims_view = any(
+                torch.ops.nvprims.view.default == node.target
+                for node in call_function_nodes
+            )
+            self.assertTrue(includes_nvprims_view)
+
+            # Try executing the graph
+            out = execute(gm, a, executor="strictly_nvfuser")
+            self.assertEqual(out, func(a))
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float16, torch.float32)
+    def test_nvprims_view_partitioner(self, device, dtype):
+        # This test verifies that views that are not fused with other ops are
+        # correctly overriden to call aten implementation.
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch._prims.nvfuser_executor import maybe_partition_graph
+
+        make_arg = partial(make_tensor, device=device, dtype=dtype)
+        a = make_arg((4, 5))
+        b = make_arg((5, 4))
+
+        def func(a, b):
+            aa = a.view(b.shape)
+            aa = aa.view(a.shape)
+            return aa.digamma()
+
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(func)(a, b)
+        gm, _ = maybe_partition_graph(gm, False, False)
+
+        out = gm(a, b)
+        self.assertEqual(out, func(a, b))
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float32, torch.float16)
+    def test_cpu_tensor(self, device, dtype):
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+        from torch._prims.executor import execute
+
+        def _wrapper(t0, t1, cpu_scalar):
+            return t0 + t1 + cpu_scalar
+
+        make_arg = partial(make_tensor, device=device, dtype=dtype)
+        a = make_arg((12, 1))
+        b = make_arg((12, 12))
+        c = torch.tensor(0.5)
+
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(_wrapper)(a, b, c)
+
+        with warnings.catch_warnings(record=True) as caught:
+            actual = execute(gm, a, b, c, executor="nvfuser")
+        # cpu scalar tensor is handled by nvfuser codegen, so it shouldn't fallback
+        self.assertFalse(any(NVPRIM_ATEN_FALLBACK_WARNING in str(w.message) for w in caught))
+
+        expected = execute(gm, a, b, c, executor="aten")
+        self.assertEqual(expected, actual)
+
+        call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+        includes_aten_add = any(
+            torch.ops.aten.add.default == node.target
+            for node in call_function_nodes
+        )
+        self.assertFalse(includes_aten_add)
+
+        with warnings.catch_warnings(record=True) as caught:
+            nvprim_aten_fallback = execute(gm, a.cpu(), b.cpu(), c, executor="nvfuser")
+        # cpu tensor is handled by nvprim aten fallback, assert that it's indeed in warning
+        self.assertTrue(any(NVPRIM_ATEN_FALLBACK_WARNING in str(w.message) for w in caught))
+
+        self.assertEqual(expected, nvprim_aten_fallback)
+
     @onlyCUDA
     @skipCUDAIfRocm
     @dtypes(torch.float32)
@@ -487,5 +1096,118 @@ def test_constant_pad_nd_memory_format(self, device, dtype):
 instantiate_device_type_tests(TestRefs, globals())
 
 
+class TestDecomp(TestCase):
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float16, torch.float32)
+    def test_decomposition_type_promotion_nvprim_amp(self, device, dtype):
+        x = torch.rand(5, device=device).to(dtype)
+        y = torch.rand(5, device=device).to(dtype)
+
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode, _is_func_unsupported_nvfuser
+        from torch.fx.experimental.proxy_tensor import make_fx
+        op = torch.ops.aten.leaky_relu_backward.default
+        op_decomp = torch._decomp.decomposition_table.get(op)
+
+        def fn0(*arg):
+            return _is_func_unsupported_nvfuser(TorchRefsNvfuserCapabilityMode(), op, op_decomp, arg, {})
+
+        def fn1(x):
+            x = x * 2
+            x = x @ x
+            x = x * 2
+            return x
+
+        self.assertFalse(fn0(x, y, 0.3, False))
+        with TorchRefsNvfuserCapabilityMode():
+
+            # Autocast context has C++ level ATen calls that are hidden from
+            # TorchRefsNvfuserCapabilityMode that works only on Python level.
+            # The first call to make_fx records autocast C++ calls directly and
+            # doesn't have the chance to translate to nvprims. After the first
+            # call, "gm" contains explicit calls to torch.ops.aten and nothing
+            # is hidden, so the second call to make_fx actually translates
+            # recorded autocast dtype conversions to nvprims.
+            with torch.autocast("cuda"):
+                gm = make_fx(fn1)(x)
+            gm = make_fx(gm)(x)
+            call_function_nodes = list(filter(lambda n: n.op == "call_function", gm.graph.nodes))
+            includes_aten_to_copy = any(
+                torch.ops.aten._to_copy.default == node.target
+                for node in call_function_nodes
+            )
+            self.assertFalse(includes_aten_to_copy)
+
+    @onlyCUDA
+    @skipCUDAIfRocm
+    @dtypes(torch.float16, torch.float32)
+    def test_masked_fill_decomposition_under_nvprim_context(self, device, dtype):
+        # masked_fill decomposition extracts cpu scalar tensor value when
+        # filling out a cuda tensor. This triggers data-dependent control flow
+        # on TorchRefsNvfuser speculative lowering.
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsNvfuserCapabilityMode
+
+        x = torch.empty(2, 3, device=device).to(dtype=dtype)
+        mask = torch.ones_like(x).bool()
+        y = torch.tensor(0.3)  # cpu scalar tensor
+
+        def func(x, mask, y):
+            return torch.masked_fill(x, mask, y)
+
+        # mimics real use-case for TorchRefsNvfuserCapabilityMode context
+        gm = make_fx(func, decomposition_table={})(x, mask, y)
+
+        with warnings.catch_warnings(record=True) as caught:
+            with TorchRefsNvfuserCapabilityMode():
+                gm = make_fx(gm)(x, mask, y)
+        # masked_fill decomposition fails inside `get_isolated_graphmodule`
+        self.assertTrue(any(GET_ISOLATED_GRAPHMODULE_ERROR in str(w.message) for w in caught))
+
+    @ops([op for op in op_db if op.supports_varargs], dtypes=OpDTypes.any_one)
+    def test_decomposition_method_vararg(self, device, dtype, op):
+        # some ops have vararg variants for the methods. this tests it.
+        # we don't have tests for varargs in OpInfo, so we need to
+        # improvise this a bit.
+        # The rule for general functions (the special cases being e.g. tensor
+        # creation functions taking shapes) is that things can be vararg
+        # if the method has only one argument of sequence type.
+        # e.g. permute can be called on a 3d tensor t as t.permute(0, 2, 1)
+        #      as well as t.permute([0, 2, 1])
+        #      when the signature in native_functions.yaml
+        #      shows arguments Tensor self, IntList dims
+        # we might need to adjust things for the factory functions or
+        # have them do their own test
+        from torch.fx.experimental.proxy_tensor import make_fx
+        from torch._prims.context import TorchRefsMode
+
+        # filter out empty tuple as that cannot be the varargs
+        sample_inputs = (si for si in op.sample_inputs(device, dtype, requires_grad=False)
+                         if (si.args[-1] if si.args else si.input))
+
+        # just run one test, we assume there is a suitable one in the tests
+        sample_input = next(sample_inputs)
+        all_args = (sample_input.input,) + sample_input.args
+
+        # in general, the methods take varargs and not (always?) the function
+        # variants, the exception to this rule are the factory functions
+        if op.is_factory_function:
+            fn = op.op
+        else:
+            fn = op.method_variant
+        with TorchRefsMode():
+            gm = make_fx(fn)(*all_args[:-1], *all_args[-1])
+
+        # in case we add random factory functions
+        torch.manual_seed(1)
+        res = gm(*all_args[:-1], *all_args[-1])
+        torch.manual_seed(1)
+        expected = fn(*all_args[:-1], *all_args[-1])
+        self.assertEqual(res, expected)
+
+
+instantiate_device_type_tests(TestDecomp, globals())
+
+
 if __name__ == "__main__":
     run_tests()
diff --git a/test/test_proxy_tensor.py b/test/test_proxy_tensor.py
index 0362586c7ec6..f82848dfb107 100644
--- a/test/test_proxy_tensor.py
+++ b/test/test_proxy_tensor.py
@@ -1,10 +1,11 @@
 # Owner(s): ["module: ProxyTensor"]
 
-from torch.testing._internal.common_utils import TestCase, run_tests
+from torch.testing._internal.common_utils import TestCase, run_tests, IS_WINDOWS, xfail_inherited_tests
 import torch
 import unittest
 import warnings
 import torch.nn.utils._stateless as stateless
+import operator
 from collections.abc import Iterable
 from torch.testing._internal.common_device_type import instantiate_device_type_tests
 from torch.testing._internal.common_methods_invocations import DecorateInfo
@@ -12,24 +13,27 @@
 from torch._subclasses.fake_tensor import DynamicOutputShapeException
 
 from torch._decomp import decomposition_table
+from torch.fx.experimental.symbolic_shapes import sym_float
 from torch.testing._internal.common_device_type import ops
 from torch._C import _disabled_torch_function_impl
-from torch.fx.experimental.proxy_tensor import make_fx, DecompositionInterpreter, get_isolated_graphmodule
+from torch.fx.experimental.proxy_tensor import make_fx, DecompositionInterpreter, get_isolated_graphmodule, has_proxy
 from torch.utils._pytree import tree_map
 from torch import nn
 import re
 
-import types
 import functools
+import itertools
 
 aten = torch.ops.aten
 
 try:
     import sympy  # noqa: F401
-    HAS_SYMPY = True
+    # TODO(jansel): these tests fail on windows
+    HAS_SYMPY = not IS_WINDOWS
 except ImportError:
     HAS_SYMPY = False
 skipIfNoSympy = unittest.skipIf(not HAS_SYMPY, "no sympy")
+HAS_CUDA = torch.cuda.is_available()
 
 
 def process_failures():
@@ -66,16 +70,6 @@ def create_normalized_name(op):
     print("}")
 
 
-def copy_func(f):
-    """Based on http://stackoverflow.com/a/6528148/190597 (Glenn Maynard)"""
-    g = types.FunctionType(f.__code__, f.__globals__, name=f.__name__,
-                           argdefs=f.__defaults__,
-                           closure=f.__closure__)
-    g = functools.update_wrapper(g, f)
-    g.__kwdefaults__ = f.__kwdefaults__
-    return g
-
-
 # Copied from functorch
 def xfail(op_name, variant_name='', *, device_type=None, dtypes=None):
     return (op_name, variant_name, device_type, dtypes, True)
@@ -175,7 +169,9 @@ class TestGenericProxyTensor(TestCase):
     def _test(self, f, inps):
         fx_f = make_fx(f, tracing_mode=self.tracing_mode)(*inps)
         new_inps = tree_map(_create_new_input, inps)
-        self.assertEqual(fx_f(*new_inps), f(*new_inps))
+        r1 = fx_f(*new_inps)
+        r2 = f(*new_inps)
+        self.assertEqual(r1, r2)
 
     def test_make_fx_simple(self):
         def f(x):
@@ -284,11 +280,10 @@ def f2_logging_tensor(x):
             self.assertTrue(is_any_sigmoid(gm))
             return torch.digamma(x)
 
-        with self.assertRaisesRegex(AssertionError, "ProxyTensor is wrapped with another Tensor subclass"):
-            traced = make_fx(f2_logging_tensor)(torch.randn(3))
-            self.assertFalse(is_any_sum(traced))
-            self.assertFalse(is_any_sigmoid(traced))  # this fails, sigmoid is traced with LoggingTensor
-            self.assertTrue(is_any_digamma(traced))
+        traced = make_fx(f2_logging_tensor)(torch.randn(3))
+        self.assertFalse(is_any_sum(traced))
+        self.assertFalse(is_any_sigmoid(traced))  # this fails, sigmoid is traced with LoggingTensor
+        self.assertTrue(is_any_digamma(traced))
 
     def test_proxy_tensor_mode_with_decomp_table_preserves_proxy(self):
         def f(x):
@@ -311,8 +306,8 @@ def _new_zeros_decomp(inp, size, dtype=None, layout=None, device=None, pin_memor
 
 def forward(self, x_1):
     zeros = torch.ops.aten.zeros.default([2], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
-    copy__default = torch.ops.aten.copy_.default(zeros, x_1);  zeros = x_1 = None
-    return copy__default
+    copy_ = torch.ops.aten.copy_.default(zeros, x_1);  zeros = x_1 = None
+    return copy_
     """)
 
     def test_make_fx_reentrant_dispatch(self):
@@ -395,6 +390,19 @@ def f(x):
             )
         )
 
+    def test_val_metadata_mutation(self):
+        def f(x):
+            y = x.clone()
+            y.unsqueeze_(0)
+            return y
+
+        traced = make_fx(f, tracing_mode=self.tracing_mode)(torch.randn(3, requires_grad=True))
+        self.assertEqual([
+            tuple(node.meta['val'].shape)
+            for node in traced.graph.nodes
+            if 'val' in node.meta
+        ], [(3,), (3,), (1, 3)])
+
     def test_make_fx_overloads(self):
         def f(x):
             return x.cos() + torch.randn(x.shape)
@@ -442,6 +450,28 @@ def f():
         g = make_fx(f, tracing_mode=self.tracing_mode)()
         self.assertEqual(g(), f())
 
+    def test_constant_blowup(self):
+        def f():
+            val = torch.tensor([2])
+            blowup = val.repeat(1000)
+            return blowup.sum().item()
+
+        self.assertRaisesRegex(
+            RuntimeError, "data-dependent",
+            lambda: make_fx(f, tracing_mode=self.tracing_mode)()
+        )
+
+    def test_constant_random(self):
+        def f():
+            val = torch.tensor([2.0])
+            val.normal_()
+            return val.item()
+
+        self.assertRaisesRegex(
+            RuntimeError, "data-dependent",
+            lambda: make_fx(f, tracing_mode=self.tracing_mode)()
+        )
+
     def test_decomposition_interpreter(self):
         def fn(x):
             return torch.nn.functional.silu(x)
@@ -502,6 +532,29 @@ def f(x, params):
             torch.allclose(fx_f(input, params)[1], f(input, params)[1])
         )
 
+    def test_make_fx_model_double_param(self):
+        class Emformer(torch.nn.Module):
+            def __init__(
+                self,
+                input_dim: int = 256,
+            ) -> None:
+                super().__init__()
+
+                self.layer_norm = torch.nn.LayerNorm(input_dim)
+
+            def forward(mod_self, x):  # noqa: B902
+                self.assertTrue(isinstance(mod_self.layer_norm.weight, torch.Tensor))
+                y = mod_self.layer_norm(x)
+                self.assertTrue(isinstance(mod_self.layer_norm.weight, torch.Tensor))
+                z = mod_self.layer_norm(y)
+                return z
+
+
+        gm = make_fx(Emformer())(torch.randn(16, 1, 256))
+        ops = set([n.target for n in gm.graph.nodes if n.op == 'call_function'])
+        self.assertEqual(len(ops), 2)
+
+
     def test_make_fx_model_fwd_bwd_wgtupdate(self):
         class Foo(torch.nn.Module):
             def __init__(self):
@@ -514,6 +567,8 @@ def forward(self, x):
         model = Foo()
 
         def f(args, params, buffers):
+            for p in params.values():
+                p.grad = None
             if not isinstance(args, Iterable):
                 args = [args]
             params_and_buffers = {**params, **buffers}
@@ -539,14 +594,107 @@ def f(args, params, buffers):
         )
 
     def test_trace_subclasses(self):
-        def f(x):
+        def f1(x):
             x = UnwrapTensor(x)
             y = x * 2
             return y
 
+        def f2(x):
+            wrapped = UnwrapTensor(x)
+            y = x * wrapped
+            return y
+
         inp = [torch.randn(5)]
-        self._test(f, [torch.randn(5)])
+        self._test(f1, inp)
+        self._test(f2, inp)
+
+    def test_partial_decomp(self):
+        def f(a, b, c):
+            x = torch.addmm(a, b, c)
+            y = torch.addmm(a, b, c, beta=2, alpha=1)
+            return x + y
+        inps = [torch.randn(5, 5), torch.randn(5, 5), torch.randn(5, 5)]
+        fx_g = make_fx(f)(*inps)
+
+        def addmm(a, b, c, beta=1, alpha=1):
+            if beta == 1 and alpha == 1:
+                return NotImplemented
+            return beta * a + alpha * (b @ c)
 
+        decomposed_fx = make_fx(f, {aten.addmm.default: addmm})(*inps)
+
+        self.assertEqual(fx_g(*inps), decomposed_fx(*inps))
+        self.assertEqual(len([n for n in fx_g.graph.nodes if n.target == aten.addmm.default]), 2)
+        self.assertEqual(len([n for n in decomposed_fx.graph.nodes if n.target == aten.addmm.default]), 1)
+
+    def test_decomp_of_capture(self):
+        val = torch.randn(5)
+
+        def f(x):
+            return x.t() + val.t()
+
+        def nop(x):
+            return x.cos()
+
+        traced = make_fx(f, decomposition_table={torch.ops.aten.t.default: nop})(torch.randn(5))
+        self.assertEqual(len([n for n in traced.graph.nodes if n.target == torch.ops.aten.t.default]), 0)
+
+
+    @unittest.skipIf(not HAS_CUDA, 'CUDA-only test')
+    def test_amp_cache(self):
+        layer = torch.nn.Conv2d(3, 3, 3).cuda()
+
+        def f(x, w):
+            return torch.nn.functional.conv2d(x, w, stride=layer.stride)
+
+        inp = torch.randn(4, 3, 10, 10, device='cuda')
+        with torch.autocast('cuda'):
+            out_graph = make_fx(f)(inp, layer.weight).graph
+            out_graph2 = make_fx(f)(inp, layer.weight).graph
+
+        self.assertEqual(len(out_graph.nodes), len(out_graph2.nodes))
+        for a, b in zip(out_graph.nodes, out_graph2.nodes):
+            self.assertEqual(a.op, b.op)
+
+    def test_has_proxy(self):
+        foo = torch.randn(5)
+
+        def f(x):
+            self.assertFalse(has_proxy(foo))
+            self.assertTrue(has_proxy(x))
+            y = x.cos()
+            self.assertTrue(has_proxy(y))
+            return y
+
+        self.assertFalse(has_proxy(torch.randn(5)))
+        make_fx(f)(torch.randn(5))
+
+    def test_strides(self):
+        def f(x):
+            self.assertTrue(x.is_contiguous())
+            self.assertFalse(x.is_contiguous(memory_format=torch.channels_last))
+            x = x.permute(0, 3, 1, 2)
+            self.assertFalse(x.is_contiguous())
+            self.assertTrue(x.is_contiguous(memory_format=torch.channels_last))
+            return x
+        make_fx(f)(torch.randn(2, 3, 4, 5))
+
+        def f(x):
+            self.assertTrue(x.is_contiguous())
+            y = x[:, 1]
+            self.assertFalse(y.is_contiguous())
+            y = x[:, ::2]
+            self.assertFalse(y.is_contiguous())
+            return x.cos()
+
+        make_fx(f)(torch.randn(2, 3, 4, 5))
+
+    def test_pr_86917(self):
+        # Tests the issue brought up here https://github.com/pytorch/pytorch/pull/86917#issuecomment-1283155344
+        def f(a, b):
+            return torch.ops.aten.nll_loss_forward(a, b, None, 1, 10)
+
+        self._test(f, [torch.randn(1, 10), torch.zeros(1, dtype=torch.long)])
 
 class TestGenericProxyTensorReal(TestGenericProxyTensor):
     tracing_mode = "real"
@@ -556,31 +704,9 @@ class TestGenericProxyTensorFake(TestGenericProxyTensor):
     tracing_mode = "fake"
 
 
-def xfail_inherited_tests(tests):
-    """
-    Given a list of test names which are defined by a superclass of the
-    class this decorates, mark them as expected failure.  This is useful
-    if you are doing poor man's parameterized tests by subclassing a generic
-    test class.
-    """
-    def deco(cls):
-        for t in tests:
-            # NB: expectedFailure operates by mutating the method in question,
-            # which is why you have to copy the function first
-            setattr(cls, t, unittest.expectedFailure(copy_func(getattr(cls, t))))
-        return cls
-    return deco
-
-
 @skipIfNoSympy
 @xfail_inherited_tests([
-    "test_inplace_metadata",
-    "test_mode_tracing_factory_function",
     "test_make_fx_overloads",
-    "test_make_fx_model_fwd_bwd_wgtupdate",
-    "test_make_fx_model_fwd_bwd",
-    "test_proxy_tensor",
-    "test_resnet18_backward_trace",
     "test_trace_subclasses",
 ])
 class TestGenericProxyTensorSymbolic(TestGenericProxyTensor):
@@ -616,10 +742,48 @@ def f(x, y):
         x, y = torch.randn(2), torch.randn(2)
         self.assertEqual(g(x, y), f(x, y))
 
+    def test_alias(self):
+        def f(x):
+            return torch.ops.aten.alias(x)
+
+        r = str(make_fx(f, tracing_mode="fake")(torch.randn(2)).code).strip()
+        # NB: this should not have a detach call
+        self.assertExpectedInline(r, """\
+def forward(self, x_1):
+    alias = torch.ops.aten.alias.default(x_1);  x_1 = None
+    return alias""")
+
+    def test_meta(self):
+        def f(x):
+            a = x.cos()
+            b = torch.var_mean(a, dim=0)
+            c = b * 2
+            return c
+
+        out = make_fx(f, tracing_mode="fake")(torch.randn(5, 5))
+        for n in out.graph.nodes:
+            if n.op == 'output':
+                continue
+            self.assertTrue('val' in n.meta)
+
+def _get_node(fx_g, cond):
+    for n in fx_g.graph.nodes:
+        if cond(n):
+            return n
+    raise AssertionError
+
+def _get_free_symbols(shape_env):
+    vars = tuple(shape_env.var_to_val.keys())
+    return len([var for var in vars if var not in shape_env.replacements])
+
+def _trace(f, *args):
+    inps = [torch.randn(arg) for arg in args]
+    return make_fx(f, tracing_mode="symbolic")(*inps)
+
 # TODO: Need to test the guards themselves specifically as well
 @skipIfNoSympy
 class TestSymbolicTracing(TestCase):
-    def _test_dynamic(self, fn, trace_inputs, test_inputs):
+    def _test_dynamic(self, fn, trace_inputs, test_inputs, assert_eq=True):
         """
         Tests fn traced with trace_inputs against test_inputs
         Also returns shape env
@@ -628,7 +792,9 @@ def _test_dynamic(self, fn, trace_inputs, test_inputs):
         traced_f = make_fx(fn, tracing_mode="symbolic")(*trace_inputs)
         for input in test_inputs:
             input = [torch.randn(shape) for shape in input]
-            self.assertEqual(traced_f(*input), fn(*input))
+            rx, ry = traced_f(*input), fn(*input)
+            if assert_eq:
+                self.assertEqual(rx, ry)
         return traced_f.shape_env
 
 
@@ -642,7 +808,9 @@ def f(x):
         shape_env = self._test_dynamic(f, [(3, 4)], test_inputs)
         self.assertTrue(shape_env.evaluate_guards_for_args(torch.randn(4, 5)))
         self.assertFalse(shape_env.evaluate_guards_for_args(torch.randn(25, 5)))
-        assert len(shape_env.guards) == 1
+        # TODO: There should eventually be guards for contiguity, but they're
+        # not currently being done yet
+        assert len(shape_env.guards) == 1, "\n" + shape_env.format_guards()
 
     def test_binary_broadcast(self):
         def f(a, b):
@@ -655,6 +823,64 @@ def f(a, b):
         shape_env = self._test_dynamic(f, [(1, 2), (3, 1)], test_inputs)
         assert len(shape_env.guards) == 0
 
+    def test_multiply_shape(self):
+        def f(a):
+            return torch.empty(a.shape[0] * 2)
+
+        r = str(make_fx(f, tracing_mode="symbolic")(torch.empty(4)).code).strip()
+        self.assertExpectedInline(r, """\
+def forward(self, a_1):
+    sym_size = torch.ops.aten.sym_size(a_1, 0);  a_1 = None
+    mul = sym_size * 2;  sym_size = None
+    empty = torch.ops.aten.empty.memory_format([mul], device = device(type='cpu'), pin_memory = False);  mul = None
+    return empty""")
+
+
+    def test_neg_shape(self):
+        def f(a):
+            return torch.empty(-a.shape[0] + 10)
+
+        r = str(make_fx(f, tracing_mode="symbolic")(torch.empty(2)).code).strip()
+        self.assertExpectedInline(r, """\
+def forward(self, a_1):
+    sym_size = torch.ops.aten.sym_size(a_1, 0);  a_1 = None
+    neg = -sym_size;  sym_size = None
+    add = neg + 10;  neg = None
+    empty = torch.ops.aten.empty.memory_format([add], device = device(type='cpu'), pin_memory = False);  add = None
+    return empty""")
+
+    def test_sqrt_size(self):
+        def f(a):
+            return a / a.size(-1) ** 0.5
+
+        r = str(make_fx(f, tracing_mode="symbolic")(torch.empty(4)).code).strip()
+        self.assertExpectedInline(r, """\
+def forward(self, a_1):
+    sym_size = torch.ops.aten.sym_size(a_1, 0)
+    pow_1 = sym_size ** 0.5;  sym_size = None
+    div = torch.ops.aten.div.Tensor(a_1, pow_1);  a_1 = pow_1 = None
+    return div""")
+
+
+    def test_symint_to_tensor(self):
+        def f(a):
+            return a / a.shape[0]
+
+        r = str(make_fx(f, tracing_mode="symbolic")(torch.empty(4)).code).strip()
+        self.assertExpectedInline(r, """\
+def forward(self, a_1):
+    sym_size = torch.ops.aten.sym_size(a_1, 0)
+    div = torch.ops.aten.div.Tensor(a_1, sym_size);  a_1 = sym_size = None
+    return div""")
+
+        r = str(make_fx(f, tracing_mode="symbolic", decomposition_table=decomposition_table)(torch.empty(4)).code).strip()
+        self.assertExpectedInline(r, """\
+def forward(self, a_1):
+    sym_size = torch.ops.aten.sym_size(a_1, 0)
+    sym_float = torch.fx.experimental.symbolic_shapes.sym_float(sym_size);  sym_size = None
+    div = torch.ops.prims.div.default(a_1, sym_float);  a_1 = sym_float = None
+    return div""")
+
     def test_cat(self):
         def f(a, b):
             val = torch.mul(a, b)
@@ -673,10 +899,20 @@ def f(a, b):
 
     def test_new_empty(self):
         def f(a, b):
-            return torch.clamp(a.new_empty(b.shape[0], b.shape[1] * 2), 1, 1)
+            return a.new_empty(b.shape[0], b.shape[1] * 2)
+
+        self._test_dynamic(f, [(2, 4), (4, 5)], [[(2, 3), (5, 7)], [(3, 7), (9, 3)]], assert_eq=False)
 
-        self._test_dynamic(f, [(2, 4), (4, 5)], [[(2, 3), (5, 7)], [(3, 7), (9, 3)]])
+    def test_size_with_tensor(self):
+        def f(tensor):
+            max_size = torch.tensor([800, 1216], dtype=torch.int64)
+            batch_shape = [2] + list(tensor.shape[:-2]) + list(max_size)
+            return tensor.new_empty(batch_shape)
 
+        a = torch.randn(3, 800, 1199)
+        self.assertRaisesRegex(
+            RuntimeError, "data-dependent", lambda: make_fx(f, tracing_mode="symbolic")(a)
+        )
 
     def test_expand(self):
         def f(a):
@@ -687,14 +923,153 @@ def f(a):
         self._test_dynamic(f, [(3,)], [[(3,)], [(4,)], [(2,)]])
         self._test_dynamic(f, [(5, 1)], [[(4, 1)], [(3, 1)], [(6, 1)]])
 
+    def test_metadata(self):
+        def f(a, b):
+            d = a.new_empty(a.shape[0] + b.shape[0])
+            return d
+        fx_g = make_fx(f, tracing_mode="symbolic")(torch.randn(5), torch.randn(4))
+        meta_c = _get_node(fx_g, lambda x: x.target == aten.new_empty.default)
+        meta_d = _get_node(fx_g, lambda x: x.target == operator.add)
+        self.assertTrue(meta_c.meta['val'].shape[0].get_pyobj().expr == meta_d.meta['val'].node.expr)
+
+    def test_metadata_fresh(self):
+        def f(x):
+            assert x.shape[0] == 3
+            return x.cos()
+
+        fx_g = make_fx(f, tracing_mode="symbolic")(torch.randn(3))
+        meta_cos = _get_node(fx_g, lambda x: x.target == aten.cos.default)
+        meta_inp = _get_node(fx_g, lambda x: x.op == 'placeholder')
+        self.assertTrue(meta_cos.meta['val'].shape[0].get_pyobj().expr == 3)
+        # Checks if the input expr has been updated even though the constraint
+        # happened afterwards
+        self.assertTrue(meta_inp.meta['val'].shape[0].get_pyobj().expr == 3)
+
+    def test_elementwise_meta_with_sym_numbers(self):
+        def f(x, offset, as_sym_float=False):
+            x0 = x.size()[0]
+            if as_sym_float:
+                x0 = sym_float(x0)
+            return torch.add(x0, offset)
+
+        fx_g = make_fx(f, tracing_mode="symbolic")(torch.rand(2, 3), 2.0, False)
+        meta_add = _get_node(fx_g, lambda x: x.target == aten.add.Tensor)
+        self.assertEqual(meta_add.meta['val'].shape, ())
+        self.assertEqual(meta_add.meta['val'].dtype, torch.float32)
+
+        fx_g = make_fx(f, tracing_mode="symbolic")(torch.rand(2, 3), 2, False)
+        meta_add = _get_node(fx_g, lambda x: x.target == aten.add.Tensor)
+        self.assertEqual(meta_add.meta['val'].shape, ())
+        self.assertEqual(meta_add.meta['val'].dtype, torch.int64)
+
+        fx_g = make_fx(f, tracing_mode="symbolic")(torch.rand(2, 3), 2, True)
+        meta_add = _get_node(fx_g, lambda x: x.target == aten.add.Tensor)
+        self.assertEqual(meta_add.meta['val'].shape, ())
+        self.assertEqual(meta_add.meta['val'].dtype, torch.float32)
+
+    def test_return_symint(self):
+        def f(x):
+            return x.shape[0], x.cos(), x.shape[0] / 5
+        self._test_dynamic(f, [(5,)], [[(4,)], [(12,)]])
+
+        def f(x):
+            return x.shape
+        self._test_dynamic(f, [(5, 3)], [[(4, 6)]])
+
+    def test_rmethod(self):
+        def f(x):
+            return x.size(0) + x
+        self._test_dynamic(f, [(5,)], [[(4,)], [(12,)]])
+
+    def test_mega_guard(self):
+        def f(a, b):
+            assert a.shape[0] == b.shape[0] * 2
+            assert b.shape[0] == 8
+            return a.cos()
+        fx_g = make_fx(f, tracing_mode="symbolic")(torch.randn(16), torch.randn(8))
+        self.assertExpectedInline(str(fx_g.shape_env.get_guard_expr()), "Eq(s1, 8) & Eq(s0, 2*s1)")
+
+    def test_sym_storage_offset(self):
+        def f(x, y):
+            return x + y
+
+        inp = (torch.randn(8)[3:], torch.randn(5))
+        fx_g = make_fx(f, tracing_mode="symbolic")(*inp)
+        inp = (torch.randn(8)[3:], torch.randn(5))
+        self.assertEqual(fx_g(*inp), f(*inp))
+
+    def _assert_no_guards(self, fx_g, free_symbols):
+        assert _get_free_symbols(fx_g.shape_env) == free_symbols, fx_g.shape_env.var_to_val
+        assert len(fx_g.shape_env.get_nontrivial_guards()) == 0, fx_g.shape_env.format_guards()
+
+    def test_guards_equal(self):
+        def f(a, b):
+            return a * b
+
+        # NB: Numbers are carefully chosen to avoid duck shaping from applying
+
+        fx_g = _trace(f, (5, 6), (5, 6))
+        self._assert_no_guards(fx_g, 2)
+
+        fx_g = _trace(f, (5, 6, 7), (5, 6, 7))
+        self._assert_no_guards(fx_g, 3)
+
+        fx_g = _trace(f, (5, 1), (1, 6))
+        self._assert_no_guards(fx_g, 2)
+
+        def f(a, b, c, d):
+            a = a + b
+            cat = torch.cat([c, d])
+            return a + cat
+
+        fx_g = _trace(f, 7, 7, 4, 3)
+        self._assert_no_guards(fx_g, 2)
+
+        def f(a, b, c, d, e):
+            vals = [a, b, c, d, e]
+            x = a
+            for idx in range(len(vals) - 1):
+                x = torch.cat([x, vals[idx]]) + vals[idx + 1]
+            return x
+
+        fx_g = _trace(f, 2, 4, 8, 16, 32)
+        self._assert_no_guards(fx_g, 1)
+
+        def f(a, b):
+            a = a.view(b.shape[0])
+            return a + b.sum()
+
+        fx_g = _trace(f, (4, 2), 8)
+        self._assert_no_guards(fx_g, 2)
+
+        fx_g = _trace(f, (4, 2), (8, 5))
+        self._assert_no_guards(fx_g, 3)
+
+        fx_g = _trace(f, (2, 3, 4), 24)
+        self._assert_no_guards(fx_g, 3)
+
+    def test_nonidentity_transitive_guards(self):
+        def f(a, b, c, d, e):
+            vals = [a, b, c, d, e]
+            cat_vals = []
+            for idx in range(len(vals) - 1):
+                cat_vals.append(torch.cat([vals[idx], vals[idx]]))
+            final_vals = []
+            for a, b in reversed(list(zip(cat_vals, vals[1:]))):
+                final_vals.append(a + b)
+            return final_vals
+
+        fx_g = _trace(f, 2, 4, 8, 16, 32)
+        self._assert_no_guards(fx_g, 1)
+
+
+
 
 
 make_fx_failures = {
     # unknown
     xfail('allclose'),
     xfail('equal'),
-    xfail('linalg.eigvals'),
-    xfail('nn.functional.max_pool1d', device_type='cpu'),
     # empty
     skip('new_empty'),
     skip('empty_like'),
@@ -714,12 +1089,11 @@ def f(a):
     xfail('corrcoef'),
     xfail('quantile'),
     xfail('nanquantile'),
+    xfail('narrow'),
 
     # Seems like it's creating a sparse tensor that isn't captured by tensor.is_sparse
     xfail('sparse.sampled_addmm'),
 
-    # ???
-    xfail('nn.functional.ctc_loss'),
     # proxy tensor doesn't support sparse correctly right now
     skip('to_sparse'),
     # segfaults
@@ -730,11 +1104,10 @@ def f(a):
     # FakeTensor fallback doesn't work
     xfail('segment_reduce', 'lengths'),
     xfail('multinomial'),
-    xfail('mvlgamma', 'mvlgamma_p_1'),
-    xfail('mvlgamma', 'mvlgamma_p_3'),
-    xfail('mvlgamma', 'mvlgamma_p_5'),
     xfail('cholesky'),
     xfail('cholesky_inverse'),
+    # cannot do these as they rely on tensor data
+    xfail('repeat_interleave'),
     # ASAN failures due to divide by 0
     skip('nn.functional.nll_loss'),
 }
@@ -742,80 +1115,30 @@ def f(a):
 symbolic_tensor_failures = {
     # Needs complex-value support
     xfail('polar'),
-    xfail('complex'),
     xfail('linalg.eig'),
-    xfail('__getitem__', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('__rmatmul__', ''),  # aten.new_empty.default - couldn't find symbolic meta function/decomposition
-    xfail('__rpow__', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.amax', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.amin', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.argmax', ''),  # aten.argmax.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.argmin', ''),  # aten.argmin.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.cumprod', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.cumsum', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.log_softmax', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.logaddexp', ''),  # aten.logaddexp.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.logsumexp', ''),  # Tensors of type TensorImpl do not have numel
-    xfail('_masked.mean', ''),  # ones() received an invalid combination of arguments - got (torch.Size, device=torch.device, ...
-    xfail('_masked.median', ''),  # aten.nanmedian.dim - couldn't find symbolic meta function/decomposition
-    xfail('_masked.norm', ''),  # aten.linalg_vector_norm.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.normalize', ''),  # aten.linalg_vector_norm.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.prod', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.softmax', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.softmin', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.std', ''),  # ones() received an invalid combination of arguments - got (torch.Size, device=torch.device, d...
-    xfail('_masked.sum', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('_masked.var', ''),  # ones() received an invalid combination of arguments - got (torch.Size, device=torch.device, d...
-    xfail('addbmm', ''),  # aten.addbmm.default - couldn't find symbolic meta function/decomposition
-    xfail('addmm', ''),  # aten.mm.default - couldn't find symbolic meta function/decomposition
-    xfail('addmm', 'decomposed'),  # aten.mm.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.eigvals'),
+    skip('masked.logsumexp', ''),  # Tensors of type TensorImpl do not have numel
+    xfail('masked.cumprod', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
+    xfail('masked.logaddexp', ''),  # aten.logaddexp.default - couldn't find symbolic meta function/decomposition
     xfail('addmv', ''),  # aten.addmv.default - couldn't find symbolic meta function/decomposition
     xfail('addr', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('all', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
     xfail('aminmax', ''),  # aten.aminmax.default - couldn't find symbolic meta function/decomposition
-    xfail('argmax', ''),  # aten.argmax.default - couldn't find symbolic meta function/decomposition
-    xfail('argmin', ''),  # aten.argmin.default - couldn't find symbolic meta function/decomposition
-    xfail('argsort', ''),  # aten.sort.default - couldn't find symbolic meta function/decomposition
     xfail('argwhere', ''),  # aten.nonzero.default - couldn't find symbolic meta function/decomposition
-    xfail('as_strided', ''),  # aten.as_strided.default - couldn't find symbolic meta function/decomposition
-    xfail('as_strided_scatter', ''),  # aten.as_strided_scatter.default - couldn't find symbolic meta function/decomposition
     xfail('baddbmm', ''),  # aten.baddbmm.default - couldn't find symbolic meta function/decomposition
-    xfail('bernoulli', ''),  # aten.bernoulli.default - couldn't find symbolic meta function/decomposition
-    xfail('bfloat16', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('bmm', ''),  # aten.bmm.default - couldn't find symbolic meta function/decomposition
-    xfail('bool', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('broadcast_tensors', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('bucketize', ''),  # aten.bucketize.Tensor - couldn't find symbolic meta function/decomposition
-    xfail('byte', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('cartesian_prod', ''),  # Tensors of type TensorImpl do not have numel
     xfail('cdist', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('chalf', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('char', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('cholesky_solve', ''),  # Could not run 'aten::_cholesky_solve_helper' with arguments from the 'Meta' back...
-    xfail('chunk', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('clamp_max', ''),  # Received type <class 'NoneType'> that is neither a tensor or a number!
-    xfail('clone', ''),  # aten.clone.default - couldn't find symbolic meta function/decomposition
     xfail('column_stack', ''),  # Tensors of type TensorImpl do not have numel
-    xfail('constant_pad_nd', ''),  # aten.fill.Scalar - couldn't find symbolic meta function/decomposition
+    xfail('combinations', ''),
     xfail('count_nonzero', ''),  # Could not run 'aten::count_nonzero.dim_IntList' with arguments from the 'Meta' ba...
     xfail('cross', ''),  # aten.linalg_cross.default - couldn't find symbolic meta function/decomposition
     xfail('cummax', ''),  # aten.cummax.default - couldn't find symbolic meta function/decomposition
     xfail('cummin', ''),  # aten.cummin.default - couldn't find symbolic meta function/decomposition
     xfail('cumprod', ''),  # aten.cumprod.default - couldn't find symbolic meta function/decomposition
-    xfail('cumsum', ''),  # aten.cumsum.default - couldn't find symbolic meta function/decomposition
     xfail('cumulative_trapezoid', ''),  # aten.slice.Tensor - couldn't find symbolic meta function/decomposition
-    xfail('deg2rad', ''),  # aten.deg2rad.default - couldn't find symbolic meta function/decomposition
-    xfail('diag_embed', ''),  # aten.diag_embed.default - couldn't find symbolic meta function/decomposition
-    xfail('diagflat', ''),  # Tensors of type TensorImpl do not have numel
-    xfail('diagonal', ''),  # aten.diagonal.default - couldn't find symbolic meta function/decomposition
-    xfail('diagonal_scatter', ''),  # aten.diagonal_scatter.default - couldn't find symbolic meta function/decomposition
     xfail('diff', ''),  # aten.empty_like.default - couldn't find symbolic meta function/decomposition
-    xfail('dist', ''),  # aten.dist.default - couldn't find symbolic meta function/decomposition
-    xfail('double', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('dsplit', ''),  # aten.slice.Tensor - couldn't find symbolic meta function/decomposition
-    xfail('eig', ''),  # aten.eig.default - couldn't find symbolic meta function/decomposition
-    xfail('einsum', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('expand_as', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('fft.fft2', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('fft.fft', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('fft.fftn', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
@@ -836,41 +1159,25 @@ def f(a):
     xfail('fft.rfft2', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('fft.rfft', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('fft.rfftn', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('fill', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
-    xfail('flatten', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('unflatten', ''),  # RuntimeError: Trying to call aten.size on a tensor with symbolic shapes...
-    xfail('float', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('float_power', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('frexp', ''),  # aten.frexp.Tensor - couldn't find symbolic meta function/decomposition
-    xfail('full_like', ''),  # aten.full_like.default - couldn't find symbolic meta function/decomposition
-    xfail('gather', ''),  # aten.gather.default - couldn't find symbolic meta function/decomposition
     xfail('geqrf', ''),  # aten.geqrf.default - couldn't find symbolic meta function/decomposition
     xfail('gradient', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('half', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('histc', ''),  # Could not run 'aten::histc' with arguments from the 'Meta' backend. This could be because...
     xfail('histogram', ''),  # Could not run 'aten::histogram.bin_ct' with arguments from the 'Meta' backend. This c...
     xfail('histogramdd', ''),  # aten._histogramdd_bin_edges.default - couldn't find symbolic meta function/decomposition
     xfail('hsplit', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('i0', ''),  # aten.i0.default - couldn't find symbolic meta function/decomposition
-    xfail('index_add', ''),  # Float
-    xfail('index_copy', ''),  # Expected a long tensor for index, but got Float
-    xfail('index_fill', ''),  # aten.index_fill.int_Scalar - couldn't find symbolic meta function/decomposition
-    xfail('index_put', ''),  # aten.index_put.default - couldn't find symbolic meta function/decomposition
     xfail('index_reduce', ''),  # Float
     xfail('inner', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('int', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('inverse', ''),  # Tensors of type TensorImpl do not have numel
     xfail('isclose', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
     xfail('isin', ''),  # aten.isin.Tensor_Tensor - couldn't find symbolic meta function/decomposition
-    xfail('isreal', ''),  # aten.empty_like.default - couldn't find symbolic meta function/decomposition
     xfail('kron', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('kthvalue', ''),  # aten.kthvalue.default - couldn't find symbolic meta function/decomposition
-    xfail('lerp', ''),  # aten.lerp.Scalar - couldn't find symbolic meta function/decomposition
-    xfail('linalg.cholesky', ''),  # aten.linalg_cholesky_ex.default - couldn't find symbolic meta function/decomposition
-    xfail('linalg.cholesky_ex', ''),  # aten.linalg_cholesky_ex.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.cond', ''),  # Tensors of type TensorImpl do not have numel
     xfail('linalg.cross', ''),  # aten.linalg_cross.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.det', ''),  # aten._linalg_det.default - couldn't find symbolic meta function/decomposition
+    xfail('linalg.det', 'singular'),  # aten._linalg_det.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.eigh', ''),  # aten._linalg_eigh.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.eigvalsh', ''),  # aten._linalg_eigh.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.householder_product', ''),  # aten.linalg_householder_product.default - couldn't find symbolic meta funct...
@@ -903,117 +1210,59 @@ def f(a):
     xfail('linalg.tensorinv', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.tensorsolve', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('linalg.vander', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('linalg.vecdot', ''),  # Could not run 'aten::vdot' with arguments from the 'Meta' backend. This could be ...
-    xfail('linalg.vector_norm', ''),  # TensorImpl do not have numel
-    xfail('log_softmax', 'dtype'),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('logaddexp2', ''),  # aten.logaddexp2.default - couldn't find symbolic meta function/decomposition
     xfail('logaddexp', ''),  # aten.logaddexp.default - couldn't find symbolic meta function/decomposition
     xfail('logcumsumexp', ''),  # aten.logcumsumexp.default - couldn't find symbolic meta function/decomposition
     xfail('logdet', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('long', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
     xfail('lu', ''),  # aten.linalg_lu_factor_ex.default - couldn't find symbolic meta function/decomposition
     xfail('lu_solve', ''),  # aten.linalg_lu_solve.default - couldn't find symbolic meta function/decomposition
     xfail('lu_unpack', ''),  # aten.lu_unpack.default - couldn't find symbolic meta function/decomposition
-    xfail('masked_fill', ''),  # expected predicate to be bool, got torch.float32
-    xfail('masked_scatter', ''),  # aten.masked_scatter.default - couldn't find symbolic meta function/decomposition
     xfail('masked_select', ''),  # aten.masked_select.default - couldn't find symbolic meta function/decomposition
-    xfail('matmul', ''),  # aten.new_empty.default - couldn't find symbolic meta function/decomposition
     xfail('matrix_exp', ''),  # aten.linalg_matrix_exp.default - couldn't find symbolic meta function/decomposition
-    xfail('max', 'reduction_with_dim'),  # aten.max.dim - couldn't find symbolic meta function/decomposition
-    xfail('mean', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
     xfail('median', ''),  # Could not run 'aten::median' with arguments from the 'Meta' backend. This could be becau...
     xfail('meshgrid', 'list_of_tensors'),  # Tensors of type TensorImpl do not have numel
     xfail('meshgrid', 'variadic_tensors'),  # Tensors of type TensorImpl do not have numel
     xfail('min', 'reduction_with_dim'),  # aten.min.dim - couldn't find symbolic meta function/decomposition
-    xfail('mm', ''),  # aten.mm.default - couldn't find symbolic meta function/decomposition
     xfail('mode', ''),  # aten.mode.default - couldn't find symbolic meta function/decomposition
-    xfail('msort', ''),  # aten.sort.default - couldn't find symbolic meta function/decomposition
-    xfail('mv', ''),  # aten.mv.default - couldn't find symbolic meta function/decomposition
-    xfail('nanmean', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
     xfail('nanquantile', ''),  # Could not run 'aten::equal' with arguments from the 'Meta' backend.
     xfail('narrow', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('native_layer_norm', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promot...
-    xfail('nn.functional.adaptive_avg_pool1d', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.adaptive_avg_pool2d', ''),  # argument 'size' must be tuple of ints, but found element o...
-    xfail('nn.functional.adaptive_avg_pool3d', ''),  # aten._adaptive_avg_pool3d.default - couldn't find symbolic meta func...
+    xfail('max_pool2d_with_indices_backward', ''),  # (symint math failure) Given input size: (s0xs1x2). Calculated ...
     xfail('nn.functional.adaptive_max_pool1d', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.adaptive_max_pool2d', ''),  # aten.adaptive_max_pool2d.default - couldn't find symbolic meta funct...
     xfail('nn.functional.adaptive_max_pool3d', ''),  # argument 'output_size' (position 2) must be tupl...
-    xfail('nn.functional.avg_pool1d', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.avg_pool2d', ''),  # aten.avg_pool2d.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.avg_pool3d', ''),  # aten.avg_pool3d.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.batch_norm', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.bilinear', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.binary_cross_entropy', ''),  # aten.new_empty.default - couldn't find symbolic meta function/decom...
-    xfail('nn.functional.binary_cross_entropy_with_logits', ''),  # aten.binary_cross_entropy_with_logits.default - couldn'...
-    xfail('nn.functional.conv1d', ''),  # aten.convolution.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.conv2d', ''),  # aten.convolution.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.conv_transpose1d', ''),  # aten.convolution.default - couldn't find symbolic meta function/decompo...
-    xfail('nn.functional.conv_transpose2d', ''),  # aten.convolution.default - couldn't find symbolic meta function/decompo...
-    xfail('nn.functional.conv_transpose3d', ''),  # aten.convolution.default - couldn't find symbolic meta function/decompo...
-    xfail('nn.functional.cosine_embedding_loss', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
     xfail('nn.functional.cosine_similarity', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.cross_entropy', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.dropout2d', ''),  # Tensors of type TensorImpl do not have numel
-    xfail('nn.functional.dropout3d', ''),  # Tensors of type TensorImpl do not have numel
-    xfail('nn.functional.dropout', ''),  # Tensors of type TensorImpl do not have numel
+    xfail('nn.functional.ctc_loss'),  # aten._ctc_loss.Tensor - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.embedding_bag', ''),  # aten._embedding_bag_forward_only.default - couldn't find symbolic meta fun...
-    xfail('nn.functional.embedding', ''),  # argument 'size' must be tuple of ints, but found element of type tor...
-    xfail('nn.functional.feature_alpha_dropout', 'with_train'),  # Tensors of type TensorImpl do not have numel
     xfail('nn.functional.fractional_max_pool2d', ''),  # argument 'size' must be tuple of ints, but found element of t...
     xfail('nn.functional.fractional_max_pool3d', ''),  # argument 'size' must be tuple of ints, but found element of t...
-    xfail('nn.functional.glu', ''),  # aten.glu.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.grid_sample', ''),  # aten.grid_sampler_2d.default - couldn't find symbolic meta function/decompos...
-    xfail('nn.functional.group_norm', ''),  # 'torch._C.SymIntNode' and 'int'
-    xfail('nn.functional.hardsigmoid', ''),  # Received type <class 'NoneType'> that is neither a tensor or a number!
-    xfail('nn.functional.hardswish', ''),  # Received type <class 'NoneType'> that is neither a tensor or a number!
-    xfail('nn.functional.hinge_embedding_loss', ''),  # aten.empty_like.default - couldn't find symbolic meta function/deco...
-    xfail('nn.functional.huber_loss', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.instance_norm', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.interpolate', 'area'),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.interpolate', 'bicubic'),  # aten.upsample_bicubic2d.vec - couldn't find symbolic meta function/d...
-    xfail('nn.functional.interpolate', 'bilinear'),  # aten.upsample_bilinear2d.vec - couldn't find symbolic meta function...
     xfail('nn.functional.interpolate', 'linear'),  # aten.upsample_linear1d.vec - couldn't find symbolic meta function/dec...
     xfail('nn.functional.interpolate', 'nearest'),  # aten.upsample_nearest1d.vec - couldn't find symbolic meta function/d...
     xfail('nn.functional.interpolate', 'trilinear'),  # aten.upsample_trilinear3d.vec - couldn't find symbolic meta functi...
-    xfail('nn.functional.kl_div', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type pro...
-    xfail('nn.functional.l1_loss', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.layer_norm', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type...
-    xfail('nn.functional.linear', ''),  # aten.mv.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.local_response_norm', ''),  # Tensors of type TensorImpl do not have numel
-    xfail('nn.functional.margin_ranking_loss', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
-    xfail('nn.functional.max_pool2d', ''),  # aten.max_pool2d_with_indices.default - couldn't find symbolic meta function/d...
+    xfail('nn.functional.max_pool1d', ''),  # Trying to call aten.size on a tensor with symbolic shapes.
     xfail('nn.functional.max_pool3d', ''),  # aten.max_pool3d_with_indices.default - couldn't find symbolic meta function/d...
     xfail('nn.functional.max_unpool1d', 'grad'),  # aten.max_unpool2d.default - couldn't find symbolic meta function/decom...
     xfail('nn.functional.max_unpool2d', 'grad'),  # aten.max_unpool2d.default - couldn't find symbolic meta function/decom...
     xfail('nn.functional.max_unpool3d', 'grad'),  # aten.max_unpool3d.default - couldn't find symbolic meta function/decom...
-    xfail('nn.functional.mse_loss', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.multi_margin_loss', ''),  # Could not run 'aten::multi_margin_loss' with arguments from the...
     xfail('nn.functional.multilabel_margin_loss', ''),  # Could not run 'aten::multilabel_margin_loss_forward' with ...
-    xfail('nn.functional.multilabel_soft_margin_loss', ''),  # aten.new_empty.default - couldn't find symbolic meta functio...
-    xfail('nn.functional.normalize', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.pad', 'circular'),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.pad', 'constant'),  # aten.fill.Scalar - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.pad', 'reflect'),  # aten.reflection_pad1d.default - couldn't find symbolic meta function/decompo...
     xfail('nn.functional.pad', 'replicate'),  # aten.replication_pad1d.default - couldn't find symbolic meta function/deco...
     xfail('nn.functional.pdist', ''),  # Could not run 'aten::_pdist_forward' with arguments from the 'Meta' backend...
     xfail('nn.functional.pixel_shuffle', ''),  # aten.pixel_shuffle.default - couldn't find symbolic meta function/decompos...
     xfail('nn.functional.pixel_unshuffle', ''),  # aten.pixel_unshuffle.default - couldn't find symbolic meta function/deco...
-    xfail('nn.functional.poisson_nll_loss', ''),  # The underlying op of 'aten.stride' has no overload name '_schema'
-    xfail('nn.functional.rrelu', ''),  # aten.empty_like.default - couldn't find symbolic meta function/decomposition
     xfail('nn.functional.smooth_l1_loss', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.soft_margin_loss', ''),  # aten.soft_margin_loss.default - couldn't find symbolic meta function/de...
-    xfail('nn.functional.softmin', 'with_dtype'),  # aten._to_copy.default - couldn't find symbolic meta function/decompos...
-    xfail('nn.functional.triplet_margin_loss', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing element...
-    xfail('nn.functional.triplet_margin_with_distance_loss', ''),  # Unexpected type <class 'torch.SymIntNode'> when com...
-    xfail('nn.functional.unfold', ''),  # aten.im2col.default - couldn't find symbolic meta function/decomposition
-    xfail('nn.functional.upsample_bilinear', ''),  # aten.upsample_bilinear2d.vec - couldn't find symbolic meta function/de...
     xfail('nn.functional.upsample_nearest', ''),  # aten.upsample_nearest1d.vec - couldn't find symbolic meta function/deco...
-    xfail('norm', ''),  # TensorImpl does not have numel
+    xfail('nonzero', ''),  # aten.nonzero.default - couldn't find symbolic meta function/decomposition
     xfail('norm', 'nuc'),  # aten._linalg_svd.default - couldn't find symbolic meta function/decomposition
     xfail('normal', ''),  # aten.normal.Tensor_Tensor - couldn't find symbolic meta function/decomposition
     xfail('normal', 'number_mean'),  # aten.normal.float_Tensor - couldn't find symbolic meta function/decomposition
-    xfail('ones_like', ''),  # aten.ones_like.default - couldn't find symbolic meta function/decomposition
     xfail('ormqr', ''),  # aten.ormqr.default - couldn't find symbolic meta function/decomposition
     xfail('outer', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('pca_lowrank', ''),  # aten.mm.default - couldn't find symbolic meta function/decomposition
@@ -1023,72 +1272,31 @@ def f(a):
     xfail('polygamma', 'polygamma_n_2'),  # aten.polygamma.default - couldn't find symbolic meta function/decomposition
     xfail('polygamma', 'polygamma_n_3'),  # aten.polygamma.default - couldn't find symbolic meta function/decomposition
     xfail('polygamma', 'polygamma_n_4'),  # aten.polygamma.default - couldn't find symbolic meta function/decomposition
-    xfail('put', ''),  # aten.clone.default - couldn't find symbolic meta function/decomposition
     xfail('quantile', ''),  # Could not run 'aten::equal' with arguments from the 'Meta' backend.
     xfail('qr', ''),  # aten.linalg_qr.default - couldn't find symbolic meta function/decomposition
-    xfail('rad2deg', ''),  # aten.rad2deg.default - couldn't find symbolic meta function/decomposition
-    xfail('rand_like', ''),  # aten.randn_like.default - couldn't find symbolic meta function/decomposition
-    xfail('randint_like', ''),  # aten.randint_like.default - couldn't find symbolic meta function/decomposition
-    xfail('randn_like', ''),  # aten.randn_like.default - couldn't find symbolic meta function/decomposition
-    xfail('ravel', ''),  # Tensors of type TensorImpl do not have numel
     xfail('renorm', ''),  # aten.renorm.default - couldn't find symbolic meta function/decomposition
-    xfail('repeat', ''),  # aten.repeat.default - couldn't find symbolic meta function/decomposition
+    xfail('repeat_interleave', ''),  # Cannot call sizes() on tensor with symbolic sizes/strides
     xfail('reshape_as', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('reshape', ''),  # Tensors of type TensorImpl do not have numel
     xfail('resize_', ''),  # aten.clone.default - couldn't find symbolic meta function/decomposition
     xfail('resize_as_', ''),  # aten.clone.default - couldn't find symbolic meta function/decomposition
     xfail('roll', ''),  # Tensors of type TensorImpl do not have numel
-    xfail('rot90', ''),  # aten.empty_like.default - couldn't find symbolic meta function/decomposition
-    xfail('round', ''),  # aten.round.default - couldn't find symbolic meta function/decomposition
-    xfail('round', 'decimals_0'),  # aten.round.decimals - couldn't find symbolic meta function/decomposition
-    xfail('round', 'decimals_3'),  # aten.round.decimals - couldn't find symbolic meta function/decomposition
-    xfail('round', 'decimals_neg_3'),  # aten.round.decimals - couldn't find symbolic meta function/decomposition
-    xfail('scatter_add', ''),  # aten.scatter_add.default - couldn't find symbolic meta function/decomposition
-    xfail('scatter', ''),  # aten.scatter.src - couldn't find symbolic meta function/decomposition
-    xfail('scatter_reduce', 'amax'),  # aten.scatter_reduce.two - couldn't find symbolic meta function/decomposition
-    xfail('scatter_reduce', 'amin'),  # aten.scatter_reduce.two - couldn't find symbolic meta function/decomposition
-    xfail('scatter_reduce', 'mean'),  # aten.scatter_reduce.two - couldn't find symbolic meta function/decomposition
-    xfail('scatter_reduce', 'prod'),  # aten.scatter_reduce.two - couldn't find symbolic meta function/decomposition
-    xfail('scatter_reduce', 'sum'),  # aten.scatter_reduce.two - couldn't find symbolic meta function/decomposition
     xfail('searchsorted', ''),  # Could not run 'aten::searchsorted.Tensor' with arguments from the 'Meta' backend. ...
     xfail('segment_reduce', 'offsets'),  # aten.segment_reduce.default - couldn't find symbolic meta function/decomposition
-    xfail('select', ''),  # aten.select.int - couldn't find symbolic meta function/decomposition
-    xfail('select_scatter', ''),  # aten.select_scatter.default - couldn't find symbolic meta function/decomposition
-    xfail('sgn', ''),  # aten.sgn.default - couldn't find symbolic meta function/decomposition
-    xfail('short', ''),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('sinc', ''),  # aten.sinc.default - couldn't find symbolic meta function/decomposition
-    xfail('slice_scatter', ''),  # aten.slice_scatter.default - couldn't find symbolic meta function/decomposition
-    xfail('softmax', 'with_dtype'),  # aten._to_copy.default - couldn't find symbolic meta function/decomposition
-    xfail('sort', ''),  # aten.sort.default - couldn't find symbolic meta function/decomposition
     xfail('special.airy_ai', ''),  # aten.special_airy_ai.default - couldn't find symbolic meta function/decomposition
-    xfail('special.bessel_j0', ''),  # aten.special_bessel_j0.default - couldn't find symbolic meta function/decomposition
-    xfail('special.bessel_j1', ''),  # aten.special_bessel_j1.default - couldn't find symbolic meta function/decomposition
     xfail('special.bessel_y0', ''),  # aten.special_bessel_y0.default - couldn't find symbolic meta function/decomposition
     xfail('special.bessel_y1', ''),  # aten.special_bessel_y1.default - couldn't find symbolic meta function/decomposition
     xfail('special.chebyshev_polynomial_t', ''),  # aten.special_chebyshev_polynomial_t.default - couldn't find symbolic me...
     xfail('special.chebyshev_polynomial_u', ''),  # aten.special_chebyshev_polynomial_u.default - couldn't find symbolic me...
-    xfail('special.entr', ''),  # aten.special_entr.default - couldn't find symbolic meta function/decomposition
-    xfail('special.erfcx', ''),  # aten.special_erfcx.default - couldn't find symbolic meta function/decomposition
     xfail('special.hermite_polynomial_h', ''),  # aten.special_hermite_polynomial_h.default - couldn't find symbolic meta f...
     xfail('special.hermite_polynomial_he', ''),  # aten.special_hermite_polynomial_he.default - couldn't find symbolic meta...
     xfail('special.laguerre_polynomial_l', ''),  # aten.special_laguerre_polynomial_l.default - couldn't find symbolic meta...
-    xfail('special.log_ndtr', ''),  # aten.special_log_ndtr.default - couldn't find symbolic meta function/decomposition
     xfail('special.modified_bessel_i0', ''),  # aten.special_modified_bessel_i0.default - couldn't find symbolic meta funct...
     xfail('special.modified_bessel_i1', ''),  # aten.special_modified_bessel_i1.default - couldn't find symbolic meta funct...
     xfail('special.modified_bessel_k0', ''),  # aten.special_modified_bessel_k0.default - couldn't find symbolic meta funct...
     xfail('special.modified_bessel_k1', ''),  # aten.special_modified_bessel_k1.default - couldn't find symbolic meta funct...
-    xfail('special.ndtri', ''),  # aten.special_ndtri.default - couldn't find symbolic meta function/decomposition
     xfail('special.polygamma', 'special_polygamma_n_0'),  # aten.polygamma.default - couldn't find symbolic meta function/...
     xfail('special.scaled_modified_bessel_k0', ''),  # aten.special_scaled_modified_bessel_k0.default - couldn't find symbo...
     xfail('special.scaled_modified_bessel_k1', ''),  # aten.special_scaled_modified_bessel_k1.default - couldn't find symbo...
-    xfail('special.spherical_bessel_j0', ''),  # aten.special_spherical_bessel_j0.default - couldn't find symbolic meta fun...
-    xfail('special.xlog1py', ''),  # aten.special_xlog1py.default - couldn't find symbolic meta function/decomposition
-    xfail('split', ''),  # 'torch._C.SymIntNode' and 'int'
-    xfail('split', 'list_args'),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('split_with_sizes', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('stack', ''),  # argument 'size' must be tuple of ints, but found element of type torch._C.SymIntNode a...
-    xfail('std', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
-    xfail('std_mean', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
     xfail('stft', ''),  # argument 'size' must be tuple of ints, but found element of type torch._C.SymIntNode at...
     xfail('sum_to_size', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('svd', ''),  # aten._linalg_svd.default - couldn't find symbolic meta function/decomposition
@@ -1097,49 +1305,102 @@ def f(a):
     xfail('take_along_dim', ''),  # dtype of indices should be Long but got Float
     xfail('take', ''),  # aten.take.default - couldn't find symbolic meta function/decomposition
     xfail('tensordot', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('tile', ''),  # aten.repeat.default - couldn't find symbolic meta function/decomposition
-    xfail('topk', ''),  # aten.topk.default - couldn't find symbolic meta function/decomposition
     xfail('trapz', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('trapezoid', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
     xfail('triangular_solve', ''),  # aten.triangular_solve.default - couldn't find symbolic meta function/decomposition
-    xfail('tril', ''),  # aten.tril.default - couldn't find symbolic meta function/decomposition
-    xfail('triu', ''),  # aten.triu.default - couldn't find symbolic meta function/decomposition
-    xfail('unfold', ''),  # aten.unfold.default - couldn't find symbolic meta function/decomposition
-    xfail('var_mean', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
-    xfail('var', ''),  # Unexpected type <class 'torch.SymIntNode'> when computing elementwise type promotion!
-    xfail('vdot', ''),  # aten.vdot.default - couldn't find symbolic meta function/decomposition
-    xfail('view_as_complex', ''),  # aten.view_as_complex.default - couldn't find symbolic meta function/decomposition
     xfail('view_as', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('view', ''),  # Tensors of type TensorImpl do not have numel
     xfail('vsplit', ''),  # aten.size.default - couldn't find symbolic meta function/decomposition
-    xfail('where', ''),  # expected predicate to be bool, got torch.float32
-    xfail('zero_', ''),  # aten.clone.default - couldn't find symbolic meta function/decomposition
-    xfail('zeros_like', ''),  # aten.zeros_like.default - couldn't find symbolic meta function/decomposition
-    xfail('unbind', ''),  # aten.unbind.int - couldn't find symbolic meta function/decomposition
+    xfail('unique_consecutive', ''),  # aten.unique_consecutive.default - couldn't find symbolic meta function/decomposition
+    xfail('unique', ''),  # aten._unique2.default - couldn't find symbolic meta function/decomposition
+}
+symbolic_tensor_segfaults = {
+    skip('nn.functional.batch_norm')  # Segfault??
+}
+
+symbolic_tensor_failures.update(symbolic_tensor_segfaults)
+
+outplace_symbolic_tensor_failures = {
+    xfail('masked_fill', ''),  # expected predicate to be bool, got torch.float32
+    xfail('masked_scatter', ''),  # aten.masked_scatter.default - couldn't find symbolic meta function/decomposition
+    xfail('nn.functional.rrelu', ''),  # aten.empty_like.default - couldn't find symbolic meta function/decomposition
 }
 
-def _test_make_fx_helper(self, device, dtype, op, tracing_mode):
-    def f(args, kwargs):
-        return op.op(*args, **kwargs)
+inplace_symbolic_tensor_failures = {
+    # bugs
+    xfail('float_power', ''),  # base given to float_power_ has dtype Float but the operation's result requires dtype Double
+    # decomp not implemented
+    xfail('addmm', ''),
+    xfail('addmm', 'decomposed'),
+    xfail('nn.functional.hardsigmoid', ''),
+    xfail('round', ''),  # ref missing a kwarg
+    xfail('round', 'decimals_0'),  # ref missing a kwarg
+    xfail('round', 'decimals_3'),  # ref missing a kwarg
+    xfail('round', 'decimals_neg_3'),  # ref missing a kwarg
+    xfail('unique', ''),
+    # in-place has a different signature than out-of-place
+    xfail('uniform', ''),
+    # Views
+    xfail('t', ''),
+    xfail('transpose', ''),
+}
+
+# Copies inputs to inplace operations to avoid inplace modifications
+#   to leaves requiring gradient
+def _get_safe_inplace(inplace_variant):
+    @functools.wraps(inplace_variant)
+    def _fn(t, *args, **kwargs):
+        return inplace_variant(t.clone(), *args, **kwargs)
+
+    return _fn
+
+def _test_make_fx_helper(self, device, dtype, op, tracing_mode, inplace=False):
+    def f(args, kwargs, extra_args, extra_kwargs):
+        if extra_args:
+            for i, t in extra_args:
+                args[i] = t.size()
+        if extra_kwargs:
+            for k, t in extra_kwargs.items():
+                kwargs[k] = t.size()
+
+        fn = _get_safe_inplace(op.get_inplace()) if inplace else op.op
+        return fn(*args, **kwargs)
     sample_inputs_itr = op.sample_inputs(device, dtype, requires_grad=False)
     new_f = None
-    for sample_input in sample_inputs_itr:
+
+    # Limit ourselves to first 100 inputs so symbolic tracing tests don't take too long
+    for sample_input in itertools.islice(sample_inputs_itr, 100):
+        if inplace and sample_input.broadcasts_input:
+            continue
         args = [sample_input.input] + list(sample_input.args)
         kwargs = sample_input.kwargs
 
+        # If any argument is a torch.Size(), maybe get dynamic shapes for it by:
+        # - Create a temporary Tensor whose size is the torch.Size() we want. Note that
+        #   we use an expanded Tensor as we cannot pass "meta" Tensors to make_fx.
+        # - Pass it to make_fx such that it is is converted to a proxy Tensor
+        # - Unpack the size in the wrapper to get a torch.Size with dynamic shapes (in
+        #   symbolic mode, a no-op otherwise)
+        extra_args = []
+        extra_kwargs = {}
+        for i, arg in enumerate(args):
+            if isinstance(arg, torch.Size):
+                extra_args.append((i, torch.empty(arg, device="cpu")))
+        for key, value in kwargs.items():
+            if isinstance(value, torch.Size):
+                extra_kwargs[key] = torch.empty(value, device="cpu")
+
         try:
-            new_f = make_fx(f, tracing_mode=tracing_mode)(args, kwargs)
+            new_f = make_fx(f, tracing_mode=tracing_mode)(args, kwargs, extra_args, extra_kwargs)
         except DynamicOutputShapeException as e:
             self.skipTest("Dynamic output shape operation in trace")
-
         for arg in args:
             if isinstance(arg, torch.Tensor) and arg.dtype == torch.float:
                 arg.uniform_(0, 1)
         try:
-            old_out = f(args, kwargs)
+            old_out = f(args, kwargs, extra_args, extra_kwargs)
         except Exception:
             continue
-        new_out = wrapper_set_seed(new_f, args, kwargs)
+        new_out = wrapper_set_seed(new_f, args, kwargs, extra_args, extra_kwargs)
         self.assertEqual(new_out, old_out)
 
 class TestProxyTensorOpInfo(TestCase):
@@ -1156,10 +1417,19 @@ def test_make_fx_fake_exhaustive(self, device, dtype, op):
     @skipIfNoSympy
     @ops(op_db, allowed_dtypes=(torch.float,))
     @skipOps('TestProxyTensorOpInfo', 'test_make_fx_symbolic_exhaustive',
-             make_fx_failures | fake_tensor_failures | symbolic_tensor_failures)
+             make_fx_failures | fake_tensor_failures | symbolic_tensor_failures | outplace_symbolic_tensor_failures)
     def test_make_fx_symbolic_exhaustive(self, device, dtype, op):
         _test_make_fx_helper(self, device, dtype, op, "symbolic")
 
+    @skipIfNoSympy
+    @ops(op_db, allowed_dtypes=(torch.float,))
+    @skipOps('TestProxyTensorOpInfo', 'test_make_fx_symbolic_exhaustive_inplace',
+             make_fx_failures | fake_tensor_failures | symbolic_tensor_failures | inplace_symbolic_tensor_failures)
+    def test_make_fx_symbolic_exhaustive_inplace(self, device, dtype, op):
+        if not op.get_inplace():
+            self.skipTest("No inplace variable for this op")
+        _test_make_fx_helper(self, device, dtype, op, "symbolic", inplace=True)
+
 
 only_for = ("cpu")
 instantiate_device_type_tests(TestProxyTensorOpInfo, globals(), only_for=only_for)
diff --git a/test/test_public_bindings.py b/test/test_public_bindings.py
index 07a1e4f19a13..4d2df6512698 100644
--- a/test/test_public_bindings.py
+++ b/test/test_public_bindings.py
@@ -11,6 +11,20 @@
 import os
 import unittest
 
+
+# TODO(jansel): we should remove this workaround once this is fixed:
+# https://github.com/pytorch/pytorch/issues/86619
+NOT_IMPORTED_WHEN_TEST_WRITTEN = {
+    "torch.fx.experimental.normalize",
+    "torch.fx.experimental.proxy_tensor",
+    "torch.fx.experimental.schema_type_annotation",
+    "torch.fx.experimental.symbolic_shapes",
+    "torch.fx.passes.backends.cudagraphs",
+    "torch.fx.passes.infra.partitioner",
+    "torch.fx.passes.utils.fuser_utils",
+}
+
+
 class TestPublicBindings(TestCase):
     def test_no_new_bindings(self):
         """
@@ -86,9 +100,12 @@ def test_no_new_bindings(self):
             "DeviceObjType",
             "DictType",
             "DisableTorchFunction",
+            "DispatchKey",
+            "DispatchKeySet",
             "dtype",
             "EnumType",
             "ErrorReport",
+            "ExcludeDispatchKeyGuard",
             "ExecutionPlan",
             "FatalError",
             "FileCheck",
@@ -124,9 +141,11 @@ def test_no_new_bindings(self):
             "INSERT_FOLD_PREPACK_OPS",
             "InterfaceType",
             "IntType",
+            "SymFloatType",
             "SymIntType",
             "IODescriptor",
             "is_anomaly_enabled",
+            "is_anomaly_check_nan_enabled",
             "is_autocast_cache_enabled",
             "is_autocast_cpu_enabled",
             "is_autocast_enabled",
@@ -188,7 +207,8 @@ def test_no_new_bindings(self):
             "StreamObjType",
             "StringType",
             "SUM",
-            "SymIntNode",
+            "SymFloat",
+            "SymInt",
             "TensorType",
             "ThroughputBenchmark",
             "TracingState",
@@ -222,6 +242,7 @@ def test_no_new_bindings(self):
             "import_ir_module_from_buffer",
             "init_num_threads",
             "is_anomaly_enabled",
+            "is_anomaly_check_nan_enabled",
             "is_autocast_enabled",
             "is_grad_enabled",
             "layout",
@@ -240,7 +261,7 @@ def test_no_new_bindings(self):
             "set_num_threads",
             "unify_type_list",
             "vitals_enabled",
-
+            "VULKAN_AUTOMATIC_GPU_TRANSFER",
             "wait",
             "Tag",
         }
@@ -313,6 +334,8 @@ def check_one_element(elem, modname, mod, *, is_public, is_all):
                     why_not_looks_public = f"because it starts with `_` (`{elem}`)"
 
                 if is_public != looks_public:
+                    if modname in NOT_IMPORTED_WHEN_TEST_WRITTEN:
+                        return
                     if modname in allow_dict and elem in allow_dict[modname]:
                         return
 
diff --git a/test/test_python_dispatch.py b/test/test_python_dispatch.py
index 76085408ee78..33465217bbbc 100644
--- a/test/test_python_dispatch.py
+++ b/test/test_python_dispatch.py
@@ -7,11 +7,11 @@
 from torch.cuda.jiterator import _create_jit_fn
 import unittest
 from torch.testing._internal.common_utils import TestCase, run_tests, TEST_WITH_ROCM, IS_WINDOWS
-from torch.utils._mode_utils import no_dispatch, find_outermost_mode, all_same_mode, all_same_mode_scope
+from torch.utils._mode_utils import no_dispatch, all_same_mode
 from torch.testing._internal.logging_tensor import LoggingTensor, LoggingTensorReentrant, LoggingTensorMode, \
     log_input, capture_logs, capture_logs_with_logging_tensor_mode
-from torch.utils._pytree import tree_map
-from torch.utils._python_dispatch import enable_torch_dispatch_mode, TorchDispatchMode
+from torch.utils._pytree import tree_map, tree_map_only
+from torch.utils._python_dispatch import TorchDispatchMode, _get_current_dispatch_mode, _get_current_dispatch_mode_stack
 
 import logging
 
@@ -390,6 +390,24 @@ def test_produce_real_type(self) -> None:
 $4 = torch._ops.aten.select.int($3, 1, 1)
 $5 = torch._ops.aten.clone.default($4, memory_format=torch.contiguous_format)''')
 
+    def test_optional_tensor_list(self) -> None:
+        def weird(xs):
+            print("woof")
+            return torch.empty(())
+
+        my_lib = Library("my_lib", "DEF")
+        my_lib.define("weird(Tensor?[] self) -> Tensor")
+        my_lib.impl("weird", weird, "CPU")
+        with capture_logs() as logs:
+            x = LoggingTensor(torch.ones(2, 2))
+            log_input("x", x)
+            torch.ops.my_lib.weird.default([None, x])
+
+        self.assertExpectedInline('\n'.join(logs), '''\
+$0 = input('x')
+$1 = torch._ops.my_lib.weird.default([None, LoggingTensor(tensor([[1., 1.],
+        [1., 1.]]))])''')
+
     def test_list_ret(self) -> None:
         # test all sequence types are permissible returns
         for list_type in (list, tuple):
@@ -444,12 +462,6 @@ def test_detach_appears_twice_when_called_once(self) -> None:
 $1 = torch._ops.aten.detach.default($0)
 $2 = torch._ops.aten.detach.default($1)''')
 
-    def test_metadata_change_not_allowed(self) -> None:
-        x = LoggingTensor(torch.ones(1))
-        y = x.data
-        self.assertIsInstance(y, LoggingTensor)
-        self.assertRaises(RuntimeError, lambda: y.resize_(4))
-
     def test_storage(self) -> None:
         # For now, just make sure it doesn't crash.  Ideally, we should
         # return some virtual storage that is safe to work with
@@ -736,34 +748,22 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
         res = x.index_put_(idxs, v)
         self.assertEqual(called_funcs, [torch.ops.aten.index_put_.default])
 
-    def test_enable_torch_dispatch_mode_error(self) -> None:
-        z = LoggingTensor(torch.empty([]))
-        with self.assertRaisesRegex(ValueError, "expected to get TorchDispatchMode, Tensor-like class, or None"):
-            with enable_torch_dispatch_mode(z):
-                pass
-
-    def test_enable_torch_dispatch_mode_basic(self) -> None:
+    def test_torch_dispatch_mode_basic(self) -> None:
         with capture_logs(is_mode=True) as logs:
-            with enable_torch_dispatch_mode(LoggingTensorMode()):
+            with LoggingTensorMode():
                 torch.empty([])
         self.assertExpectedInline('\n'.join(logs), """\
 $0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)""")
 
-    def test_enable_torch_dispatch_mode_unrelated_tensors(self) -> None:
+    def test_torch_dispatch_mode_unrelated_tensors(self) -> None:
         x = torch.randn([])
         y = torch.randn([])
         with capture_logs(is_mode=True) as logs:
-            with enable_torch_dispatch_mode(LoggingTensorMode()):
+            with LoggingTensorMode():
                 x + y
         self.assertExpectedInline('\n'.join(logs), """\
 $2 = torch._ops.aten.add.Tensor($0, $1)""")
 
-    def test_nested_push_regular(self):
-        with LoggingTensorMode.push() as mode:
-            # This previously errored
-            with LoggingTensorMode():
-                pass
-
     def test_nested_push_logging_tensor_mode(self):
         x = torch.randn([])
         y = torch.randn([])
@@ -805,7 +805,7 @@ def test_capture_logs_with_torch_dispatch_mode(self):
 
         self.assertEqual(logs1, logs2)
 
-    def test_enable_torch_dispatch_mode_subclass_priority(self) -> None:
+    def test_torch_dispatch_mode_subclass_priority(self) -> None:
         class ErrorA(RuntimeError):
             pass
 
@@ -819,7 +819,8 @@ def __new__(cls, elem):
 
             @classmethod
             def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
-                raise ErrorA
+                with AMode():
+                    raise ErrorA
 
         class B(A):
             @staticmethod
@@ -828,6 +829,15 @@ def __new__(cls, elem):
 
             @classmethod
             def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+                with BMode():
+                    func(*args, **kwargs)
+
+        class AMode(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                raise ErrorA
+
+        class BMode(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
                 raise ErrorB
 
         a = A(torch.empty(1))
@@ -840,93 +850,58 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
         # B has precedence over A due to the subclass relationship yet
         # modes take precedence over arguments
         with self.assertRaises(ErrorA):
-            with enable_torch_dispatch_mode(A):
+            with AMode():
                 b + b
         with self.assertRaises(ErrorB):
-            with enable_torch_dispatch_mode(B):
+            with BMode():
                 a + a
         with self.assertRaises(ErrorB):
-            with enable_torch_dispatch_mode(B):
+            with BMode():
                 a + b
 
-    def test_enable_torch_dispatch_mode_respects_no_dispatch(self) -> None:
+    def test_mode_with_make_subclass(self):
+        class SubTensor(torch.Tensor):
+            @staticmethod
+            def __new__(cls, elem):
+                return torch.Tensor._make_subclass(cls, elem, elem.requires_grad)
+
+        class BasicMode(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                return func(*args, **kwargs)
+
+        x = torch.randn(3)
+        with BasicMode():
+            y = SubTensor(x)
+        self.assertIsInstance(y, SubTensor)
+
+    def test_torch_dispatch_mode_respects_no_dispatch(self) -> None:
         with capture_logs(is_mode=True) as logs1:
-            with enable_torch_dispatch_mode(LoggingTensorMode()):
+            with LoggingTensorMode():
                 torch.ones([2, 3])
                 with no_dispatch():
                     torch.ones([2, 3])
         with capture_logs(is_mode=True) as logs2:
-            with enable_torch_dispatch_mode(LoggingTensorMode()):
+            with LoggingTensorMode():
                 torch.ones([2, 3])
         self.assertEqual(logs1, logs2)
 
-    def test_enable_torch_dispatch_mode_instance(self) -> None:
+    def test_shallow_copy_and_detach(self) -> None:
+        seen = set()
+        test_case = self
+
         class TestMode(TorchDispatchMode):
             def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+                tree_map_only(torch.Tensor, lambda t: test_case.assertIn(t, seen), (args, kwargs))
                 if kwargs is None:
                     kwargs = {}
-                return func(*args, **kwargs)
-
-        x = TestMode()
-        y = torch.tensor([2.])
-        with enable_torch_dispatch_mode(x):
-            y + y
-
-    def test_nested_enable_torch_dispatch_mode(self) -> None:
-        class A(LoggingTensorMode):
-            pass
-
-        with self.assertRaisesRegex(ValueError, "there is already an active mode"):
-            with enable_torch_dispatch_mode(LoggingTensorMode()):
-                with enable_torch_dispatch_mode(A()):
-                    pass
-
-        # For nesting to be a noop, they need to be the same instance
-        with self.assertRaisesRegex(ValueError, "there is already an active mode"):
-            with enable_torch_dispatch_mode(LoggingTensorMode()):
-                with enable_torch_dispatch_mode(LoggingTensorMode()):
-                    pass
-
-    def test_nesting_with_same_enable_torch_dispatch_mode(self) -> None:
-        # "nested" enable_torch_dispatch_modes are allowed if they're the same mode (same instance).
-        # It's the equivalent of a noop, so it will only write once to the log
-        x = torch.tensor([3.])
-        mode = LoggingTensorMode()
-        with capture_logs(is_mode=True) as logs:
-            log_input("x", x)
-            with enable_torch_dispatch_mode(mode):
-                with enable_torch_dispatch_mode(mode):
-                    x + x
-        self.assertExpectedInline('\n'.join(logs), '''\
-$0 = input('x')
-$1 = torch._ops.aten.add.Tensor($0, $0)''')
-
-    def test_enable_torch_dispatch_mode_ignore_preexisting(self):
-        class A(TorchDispatchMode):
-            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
-                raise AssertionError
-
-        x = torch.tensor([3.])
-        with capture_logs(is_mode=True) as logs:
-            with enable_torch_dispatch_mode(A()):
-                with enable_torch_dispatch_mode(LoggingTensorMode(), ignore_preexisting=True):
-                    x + x
-        self.assertExpectedInline('\n'.join(logs), """\
-$1 = torch._ops.aten.add.Tensor($0, $0)""")
-
-    def test_enable_torch_dispatch_mode_replace(self):
-        class A(TorchDispatchMode):
-            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
-                raise AssertionError
+                r = func(*args, **kwargs)
+                tree_map_only(torch.Tensor, lambda t: seen.add(t), r)
+                return r
 
-        x = torch.tensor([3.])
-        outer_mode = A()
-        with capture_logs(is_mode=True) as logs:
-            with enable_torch_dispatch_mode(outer_mode):
-                with enable_torch_dispatch_mode(LoggingTensorMode(), replace=outer_mode):
-                    x + x
-        self.assertExpectedInline('\n'.join(logs), """\
-$1 = torch._ops.aten.add.Tensor($0, $0)""")
+        with TestMode():
+            x = torch.randn(3, requires_grad=True)
+            loss = (x * x).sum()
+            loss.backward()
 
     def test_exception_handling(self):
         class A(torch.Tensor):
@@ -934,41 +909,19 @@ class A(torch.Tensor):
             def __new__(cls, elem):
                 return torch.Tensor._make_subclass(cls, elem, elem.requires_grad)
 
-            @classmethod
-            def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+        class AMode(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
                 if func.__name__ == 'randn.default':
                     raise RuntimeError()
-                return cls(torch.zeros(()))
+                return A(torch.zeros(()))
 
-        with enable_torch_dispatch_mode(A):
+        with AMode():
             try:
                 torch.randn(())
             except RuntimeError:
                 pass
             self.assertTrue(isinstance(torch.zeros(()), A))
 
-    def test_ctor_no_inner(self):
-        class A(TorchDispatchMode):
-            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
-                return torch.zeros([])
-
-        with enable_torch_dispatch_mode(A()):
-            x = torch.randn((3, 4))
-
-        self.assertEqual(x, torch.zeros([]))
-
-    def test_with_mode(self):
-        class ErrorA(RuntimeError):
-            pass
-
-        class A(TorchDispatchMode):
-            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
-                raise ErrorA()
-
-        with self.assertRaises(ErrorA):
-            with A():
-                torch.empty([])
-
     def test_with_mode_created_separately(self):
         class ErrorA(RuntimeError):
             pass
@@ -1009,10 +962,7 @@ def __new__(cls, elem, mode):
 
             @classmethod
             def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
-                modes = (arg.mode for arg in args + tuple(kwargs.values()) if isinstance(arg, ModeTensor))
-                outermost = find_outermost_mode(modes)
-                with outermost.restore():
-                    return func(*args, **kwargs)
+                raise NotImplementedError("Shouldn't be here")
 
         class Mode(TorchDispatchMode):
             def __torch_dispatch__(self, func, types, args=(), kwargs=None):
@@ -1099,70 +1049,17 @@ def unwrap(t):
             with PoliteMode():
                 a.abs()
 
-    def test_disable_mode(self):
-        class FailEverythingMode(TorchDispatchMode):
-            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
-                raise RuntimeError("arf")
-
-        with FailEverythingMode() as m:
-            self.assertRaises(RuntimeError, lambda: torch.ones([2, 3]))
-            with enable_torch_dispatch_mode(None, replace=m):
-                torch.ones([2, 3])
+    def test_nesting_same_mode(self):
+        # If the pushed mode is the same instance as the current mode, we allow pushing an already active mode.
 
-    def test_make_wrapper_subclass_with_modes(self):
-        class ModeTensor(torch.Tensor):
-            def __new__(cls, elem, mode):
-                r = torch.Tensor._make_wrapper_subclass(cls, elem.shape)
-                r.elem = elem
-                r.mode = mode
-                return r
-
-            @classmethod
-            def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
-                modes = (arg.mode for arg in args + tuple(kwargs.values()) if isinstance(arg, ModeTensor))
-                outermost = find_outermost_mode(modes)
-                with outermost.restore():
-                    return func(*args, **kwargs)
-
-        class Mode(TorchDispatchMode):
-            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
-                def unwrap(e):
-                    if isinstance(e, ModeTensor):
-                        return e.elem
-                    else:
-                        return e
-
-                def wrap(t):
-                    if isinstance(t, torch.Tensor):
-                        return ModeTensor(t, self)
-                    else:
-                        return t
-
-                return wrap(func(*tuple(unwrap(a) for a in args), **kwargs))
-
-        x = torch.tensor(4.)
-        with Mode():
-            y = x + x
-            z = y + y
-        self.assertIsInstance(y, ModeTensor)
-        self.assertIsInstance(z, ModeTensor)
-
-        with Mode():
-            with Mode():
-                y = x + x
-                z = y + y
-        self.assertIsInstance(y, ModeTensor)
-        self.assertIsInstance(z, ModeTensor)
-
-    def test_error_using_same_mode(self):
-        class A(TorchDispatchMode):
-            pass
+        with capture_logs(is_mode=True) as logs:
+            with LoggingTensorMode() as reenabled:
+                with reenabled:
+                    torch.empty([])
+            self.assertExpectedInline('\n'.join(logs), """\
+$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)
+$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)""")
 
-        x = A()
-        with x:
-            with self.assertRaisesRegex(RuntimeError, "has already been used as a mode. Please use a fresh version"):
-                with x:
-                    pass
 
     def test_error_using_class_method_on_mode(self):
         class A(TorchDispatchMode):
@@ -1171,78 +1068,37 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
                 return func(args, kwargs)
 
         x = torch.tensor(5.)
-        with self.assertRaisesRegex(RuntimeError, "should be a normal method not a class method"):
+        with self.assertRaisesRegex(RuntimeError, "classmethod is not supported, please make it a plain method"):
             with A():
                 x + x
 
-    def test_error_with_ancestor(self):
-        x = LoggingTensorMode()
-        with x:
-            pass
-
-        with self.assertRaisesRegex(RuntimeError, "has already been used as a mode. Please use a fresh version"):
-            with x:
-                pass
-
-    def test_restore_errors(self):
-        with self.assertRaisesRegex(RuntimeError, "does not have any ancestors. Use the standard version instead"):
-            with LoggingTensorMode().restore():
-                pass
-
-        x = LoggingTensorMode()
-        with LoggingTensorMode():
-            with x:
-                pass
-
-        with LoggingTensorMode():  # a different mode instance than the one above
-            with self.assertRaisesRegex(RuntimeError, "the current mode is not its ancestor"):
-                with x.restore():
-                    pass
-
-    def test_restore_ancestor_mode(self):
-        x = LoggingTensorMode()
-        y = LoggingTensorMode()
-        with x:
-            with y:
-                pass
-
-        z = LoggingTensorMode()
-        with y.restore():
-            with z:
+    def test_get_cur_mode(self):
+        class A(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
                 pass
 
-        with x.restore():
-            with z.restore():
-                pass
+        self.assertEqual(_get_current_dispatch_mode(), None)
 
-    def test_find_outermost_mode(self):
-        self.assertIsNone(find_outermost_mode([None, None]))
+        with A() as mode1:
+            self.assertEqual(_get_current_dispatch_mode(), mode1)
 
-        x = LoggingTensorMode()
-        y = LoggingTensorMode()
-        with x:
-            with y:
-                pass
-
-        self.assertEqual(find_outermost_mode([x, y]), y)
+        with mode1:
+            with A() as mode2:
+                self.assertEqual(_get_current_dispatch_mode(), mode2)
 
-        z = LoggingTensorMode()
-        with y.restore():
-            with z:
+    def test_get_mode_stack(self):
+        class A(TorchDispatchMode):
+            def __torch_dispatch__(self, func, types, args=(), kwargs=None):
                 pass
 
-        self.assertEqual(find_outermost_mode([z, x]), z)
-        i = LoggingTensorMode()
-
-        with self.assertRaisesRegex(RuntimeError, "doesn't have ancestors set so the ordering with other modes"):
-            find_outermost_mode([i, x, y, z])
+        self.assertEqual(_get_current_dispatch_mode_stack(), [])
 
-        k = LoggingTensorMode()
-        with k:
-            pass
+        with A() as mode1:
+            self.assertEqual(_get_current_dispatch_mode_stack(), [mode1])
 
-        with self.assertRaisesRegex(RuntimeError, "don't come from the same scope"):
-            find_outermost_mode([k, x, y, z])
+        with mode1:
+            with A() as mode2:
+                self.assertEqual(_get_current_dispatch_mode_stack(), [mode1, mode2])
 
     def test_all_same_mode(self):
         x = LoggingTensorMode()
@@ -1251,29 +1107,6 @@ def test_all_same_mode(self):
         self.assertFalse(all_same_mode([x, None]))
         self.assertFalse(all_same_mode([x, y]))
 
-    def test_all_same_mode_scope(self):
-        x = LoggingTensorMode()
-        y = LoggingTensorMode()
-        z = LoggingTensorMode()
-        with x:
-            with y:
-                pass
-
-        with x.restore():
-            with z:
-                pass
-
-        i = LoggingTensorMode()
-
-        self.assertTrue(all_same_mode_scope([x, y], y))
-        self.assertTrue(all_same_mode_scope([x, z], z))
-        self.assertFalse(all_same_mode_scope([x, y, z], y))
-        self.assertFalse(all_same_mode_scope([x, y, z], z))
-        self.assertFalse(all_same_mode_scope([x, y, i], y))
-
-        no_ancestor = LoggingTensorMode()
-        self.assertFalse(all_same_mode_scope([x, y, z], no_ancestor))
-
     def test_tolist_numpy_with_torch_dispatch_mode(self) -> None:
         x = LoggingTensor(torch.tensor([2.0, 3.0]))
         with self.assertRaisesRegex(RuntimeError, "is not supported for tensor subclasses."):
@@ -1283,7 +1116,7 @@ def test_tolist_numpy_with_torch_dispatch_mode(self) -> None:
         with self.assertRaises(AssertionError):
             self.assertEqual(x, None)
 
-    def test_enable_torch_dispatch_mode_subclass_autograd_device_check(self) -> None:
+    def test_subclass_autograd_device_check(self) -> None:
         class NonWrapperSubclass(torch.Tensor):
             elem: torch.Tensor
 
@@ -1587,6 +1420,38 @@ def __torch_dispatch__(cls, func, types, args, kwargs):
             with self.assertRaisesRegex(TypeError, err_msg):
                 e.contiguous()
 
+    def test_fancy_strides(self):
+        calls = []
+
+        class ExampleTensor(torch.Tensor):
+            @staticmethod
+            def __new__(cls, data):
+                return TestPythonDispatch.subclass_helper(cls, data, False, dispatch_sizes_strides_policy="strides")
+
+            @classmethod
+            def __torch_dispatch__(cls, func, types, args, kwargs):
+                if func in [
+                    torch.ops.aten.is_contiguous.default,
+                    torch.ops.aten.is_contiguous.memory_format,
+                    torch.ops.aten.is_strides_like_format.default,
+                    torch.ops.aten.is_non_overlapping_and_dense.default,
+                    torch.ops.aten.stride.default
+                ]:
+                    calls.append((func, list(args)[1:]))
+                    return None
+                with no_dispatch():
+                    return func(*args, **kwargs)
+
+        e = ExampleTensor(torch.randn(2, 2))
+        self.assertFalse(e.is_contiguous(memory_format=torch.channels_last))
+        self.assertEqual(calls, [(torch.ops.aten.is_contiguous.memory_format, [torch.channels_last])])
+        calls.clear()
+        self.assertFalse(torch.ops.aten.is_strides_like_format.default(e, torch.channels_last))
+        self.assertEqual(calls, [(torch.ops.aten.is_strides_like_format.default, [torch.channels_last])])
+        calls.clear()
+        self.assertTrue(torch.ops.aten.is_non_overlapping_and_dense.default(e))
+        self.assertEqual(calls, [(torch.ops.aten.is_non_overlapping_and_dense.default, [])])
+
     def test_device_slowpath(self):
         for use_wrapper_subclass in [True]:
             class ExampleTensor1(torch.Tensor):
@@ -1697,7 +1562,7 @@ def __new__(cls, data, wrapper):
 
                 @classmethod
                 def __torch_dispatch__(cls, func, types, args, kwargs):
-                    if func == torch.ops.aten.stride.default:
+                    if func == torch.ops.aten.sym_stride.default:
                         return (4, 2)
                     return NotImplemented
 
@@ -1708,13 +1573,13 @@ def __new__(cls, data, wrapper):
 
                 @classmethod
                 def __torch_dispatch__(cls, func, types, args, kwargs):
-                    if func == torch.ops.aten.stride.default:
+                    if func == torch.ops.aten.sym_stride.default:
                         return None
                     return NotImplemented
 
-            err_msg = "no implementation found for 'torch.ops.aten.stride'"
+            err_msg = "no implementation found for 'torch.ops.aten.sym_stride'"
             e = StridesNotImplemented(torch.randn(3, 3), use_wrapper_subclass)
-            with self.assertRaisesRegex(TypeError, err_msg):
+            with self.assertRaisesRegex(RuntimeError, err_msg):
                 e.stride()
 
             e = StridesCustomReturn(torch.randn(3, 3), use_wrapper_subclass)
@@ -1821,5 +1686,19 @@ def __torch_dispatch__(cls, func, types, args, kwargs):
             e = LayoutDefaultReturn(torch.randn(4, 2), use_wrapper_subclass)
             self.assertEqual(e.layout, torch.strided)
 
+class TestPythonDispatcher(TestCase):
+    def test_basic(self):
+        x = torch.randn(2, requires_grad=True)
+        r = torch._C._EnablePythonDispatcher()
+        torch.add(x, x)
+
+    def test_lstsq(self):
+        a = torch.randn(4, 3)
+        b = torch.rand(4, 3)
+        expected_shape = torch.linalg.lstsq(a, b).solution.shape
+        r = torch._C._EnablePythonDispatcher()
+        python_disp_shape = torch.linalg.lstsq(a, b).solution.shape
+        self.assertEqual(expected_shape, python_disp_shape)
+
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_pytree.py b/test/test_pytree.py
index 4927ce25224f..3d7e42f35136 100644
--- a/test/test_pytree.py
+++ b/test/test_pytree.py
@@ -200,10 +200,12 @@ def test_tree_all_any(self):
 
     def test_treespec_repr(self):
         # Check that it looks sane
-        pytree = (0, [0, 0, 0])
+        pytree = (0, [0, 0, [0]])
         _, spec = tree_flatten(pytree)
-        self.assertEqual(
-            repr(spec), 'TreeSpec(tuple, None, [*, TreeSpec(list, None, [*, *, *])])')
+        self.assertEqual(repr(spec), ("TreeSpec(tuple, None, [*,\n"
+                                      "                       TreeSpec(list, None, [*,\n"
+                                      "                                             *,\n"
+                                      "                                             TreeSpec(list, None, [*])])])"))
 
     def test_broadcast_to_and_flatten(self):
         cases = [
diff --git a/test/test_quantization.py b/test/test_quantization.py
index 16f1d2cd318a..2726e0f82eec 100644
--- a/test/test_quantization.py
+++ b/test/test_quantization.py
@@ -38,7 +38,12 @@
 from quantization.core.test_workflow_module import TestFusedObsFakeQuantModule  # noqa: F401
 from quantization.core.test_backend_config import TestBackendConfig  # noqa: F401
 from quantization.core.test_utils import TestUtils  # noqa: F401
-from quantization.core.test_docs import TestQuantizationDocs  # noqa: F401
+try:
+    # This test has extra data dependencies, so in some environments, e.g. Meta internal
+    # Buck, it has its own test runner.
+    from quantization.core.test_docs import TestQuantizationDocs  # noqa: F401
+except ImportError:
+    pass
 
 # Eager Mode Workflow. Tests for the functionality of APIs and different features implemented
 # using eager mode.
@@ -47,7 +52,6 @@
 from quantization.eager.test_quantize_eager_ptq import TestQuantizeEagerPTQStatic  # noqa: F401
 from quantization.eager.test_quantize_eager_ptq import TestQuantizeEagerPTQDynamic  # noqa: F401
 from quantization.eager.test_quantize_eager_ptq import TestQuantizeEagerOps  # noqa: F401
-from quantization.eager.test_quantize_eager_ptq import TestQuantizeEagerONNXExport  # noqa: F401
 # 2. Eager mode quantization aware training
 from quantization.eager.test_quantize_eager_qat import TestQuantizeEagerQAT  # noqa: F401
 from quantization.eager.test_quantize_eager_qat import TestQuantizeEagerQATNumerics  # noqa: F401
@@ -79,6 +83,7 @@
     from quantization.fx.test_numeric_suite_fx import TestFXGraphMatcher  # noqa: F401
     from quantization.fx.test_numeric_suite_fx import TestFXGraphMatcherModels  # noqa: F401
     from quantization.fx.test_numeric_suite_fx import TestFXNumericSuiteCoreAPIs  # noqa: F401
+    from quantization.fx.test_numeric_suite_fx import TestFXNumericSuiteNShadows  # noqa: F401
     from quantization.fx.test_numeric_suite_fx import TestFXNumericSuiteCoreAPIsModels  # noqa: F401
 except ImportError:
     pass
@@ -119,18 +124,12 @@
 
 # AO Migration tests
 from quantization.ao_migration.test_quantization import TestAOMigrationQuantization  # noqa: F401
+from quantization.ao_migration.test_ao_migration import TestAOMigrationNNQuantized  # noqa: F401
+from quantization.ao_migration.test_ao_migration import TestAOMigrationNNIntrinsic  # noqa: F401
 try:
     from quantization.ao_migration.test_quantization_fx import TestAOMigrationQuantizationFx  # noqa: F401
 except ImportError:
     pass
 
-try:
-    from quantization.dbr.test_quantize_dbr import TestQuantizeDBR  # noqa: F401
-    from quantization.dbr.test_quantize_dbr import TestQuantizeDBRIndividualOps  # noqa: F401
-    from quantization.dbr.test_quantize_dbr import TestQuantizeDBRMultipleOps  # noqa: F401
-    from quantization.dbr.test_quantize_dbr import TestQuantizeDBRModels  # noqa: F401
-except ImportError:
-    pass
-
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_reductions.py b/test/test_reductions.py
index f524962b86dc..7a360888e659 100644
--- a/test/test_reductions.py
+++ b/test/test_reductions.py
@@ -1,5 +1,6 @@
 # Owner(s): ["module: tests"]
 
+import contextlib
 import torch
 import numpy as np
 
@@ -463,9 +464,9 @@ def test_dim_reduction_less_than_64(self, device):
                torch.norm]
         for op in ops:
             with self.assertRaisesRegex(RuntimeError, "only tensors with up to 64 dims are supported"):
-                op(x, 64)
+                op(x, dim=64)
             with self.assertRaisesRegex(RuntimeError, "only tensors with up to 64 dims are supported"):
-                op(x, -1)
+                op(x, dim=-1)
 
     @onlyCPU
     @dtypes(torch.float, torch.bfloat16)
@@ -1129,6 +1130,10 @@ def test_bincount(self, device):
             torch.bincount(torch.tensor([1, 3], device=device),
                            torch.tensor([.2, .2], device=device),
                            minlength=-1)
+        # n-d weights, with n > 1 throws
+        with self.assertRaisesRegex(RuntimeError, '1-d'):
+            torch.bincount(torch.tensor([1, 0], device=device),
+                           torch.tensor([[1., 0.3], [1., 0.3]], device=device))
         # input and weights dim mismatch
         with self.assertRaisesRegex(RuntimeError, 'same length'):
             torch.bincount(torch.tensor([1, 0], device=device),
@@ -1788,11 +1793,9 @@ def test_repeated_dim(self, device):
         x = torch.randn(3, 3, 3, 3, device=device)
 
         error_msg = r'appears multiple times in the list of dims'
-        norm_error_msg = r'Expected dims to be different, got'
         for op in ops:
             for dim in [(0, 0), (0, -4)]:
-                e_msg = norm_error_msg if op == torch.norm else error_msg
-                with self.assertRaisesRegex(RuntimeError, e_msg):
+                with self.assertRaisesRegex(RuntimeError, error_msg):
                     op(x, dim=dim)
 
     # TODO: update this test to comapre against NumPy
@@ -2795,9 +2798,14 @@ def test_histc(self, device):
             torch.histc(torch.tensor([1., 2., float("inf")], dtype=torch.float, device=device),
                         bins=4, max=3),
             torch.tensor([0, 1, 1, 0], dtype=torch.float, device=device))
-        # tensor with nan -- should throw a RuntimeError
+        # tensor with nan; min, max not provided -- should throw a RuntimeError
         with self.assertRaisesRegex(RuntimeError, r'range of \[nan, nan\] is not finite'):
             torch.histc(torch.tensor([float("nan")], dtype=torch.float, device=device))
+        # tensor with nan; min, max provided -- nan is ignored
+        self.assertEqual(
+            torch.histc(torch.tensor([1., 2., float("nan")], dtype=torch.float, device=device),
+                        bins=4, max=3),
+            torch.tensor([0, 1, 1, 0], dtype=torch.float, device=device))
         # tensors with min > max -- should throw a RuntimeError
         with self.assertRaisesRegex(RuntimeError, "max must be larger than min"):
             torch.histc(torch.tensor([1., 2., 3.], dtype=torch.float, device=device),
@@ -2833,6 +2841,9 @@ def test_against_np(tensor, bins=100, min=0, max=0):
         expanded = torch.randn(1, 5, 1, 2, device=device).expand(3, 5, 7, 2)
         test_against_np(expanded)
 
+        linear = torch.linspace(0, 0.99 - 5.0e-7, 101).to(device)
+        test_against_np(linear, bins=20, min=0, max=0.99)
+
     @onlyCPU
     def test_histc_bfloat16(self, device):
         actual = torch.histc(
@@ -3361,7 +3372,7 @@ def to_numpy(input):
             expected = op.ref(to_numpy(t), *sample_input.args,
                               **dict(
                                   # `identity` is mapped to numpy reduction `initial` argument
-                                  identity=torch._masked._reduction_identity(op.name, t),
+                                  identity=torch.masked._reduction_identity(op.name, t),
                                   **sample_input.kwargs))
 
             # Workaround https://github.com/pytorch/pytorch/issues/66556
@@ -3373,6 +3384,21 @@ def to_numpy(input):
 
             self.assertEqual(actual, expected, msg, exact_dtype=exact_dtype)
 
+    @onlyCUDA
+    @largeTensorTest("8GB")
+    @dtypes(torch.half, torch.chalf, torch.bfloat16)
+    def test_reductions_large_half_tensors(self, device, dtype):
+        t = torch.ones(2**31, device=device, dtype=dtype)
+        t[2**30:] = -1
+        expected = torch.tensor(0, device=device, dtype=dtype)
+        self.assertEqual(torch.sum(t), expected)
+
+        # mean_cuda is not implemented for ComplexHalf
+        err_msg = "not implemented for 'ComplexHalf'"
+        ctx = self.assertRaisesRegex(
+            RuntimeError, err_msg) if dtype is torch.chalf else contextlib.nullcontext()
+        with ctx:
+            self.assertEqual(torch.mean(t), expected)
 
 instantiate_device_type_tests(TestReductions, globals())
 
diff --git a/test/test_scatter_gather_ops.py b/test/test_scatter_gather_ops.py
index e8d874b6a52b..9aae81a6503f 100644
--- a/test/test_scatter_gather_ops.py
+++ b/test/test_scatter_gather_ops.py
@@ -7,9 +7,9 @@
 
 from torch.testing import make_tensor
 from torch.testing._internal.common_utils import \
-    (parametrize, run_tests, TestCase,)
+    (parametrize, run_tests, TestCase, DeterministicGuard)
 from torch.testing._internal.common_device_type import \
-    (instantiate_device_type_tests, dtypes, dtypesIfCUDA,
+    (instantiate_device_type_tests, onlyCPU, dtypes, dtypesIfCUDA,
      toleranceOverride, tol,)
 from torch.testing._internal.common_dtype import \
     (get_all_dtypes,)
@@ -185,19 +185,23 @@ def test_scatter__reductions(self, device, dtype):
 
     @dtypes(torch.float16, torch.float32, torch.complex64)
     def test_scatter_add_(self, device, dtype):
-        self._test_scatter_base(torch.Tensor.scatter_add_, device=device, dtype=dtype,
-                                is_scalar=False, reduction=None)
+        for deterministic in [False, True]:
+            with DeterministicGuard(deterministic):
+                self._test_scatter_base(torch.Tensor.scatter_add_, device=device, dtype=dtype,
+                                        is_scalar=False, reduction=None)
 
     @dtypes(torch.float32)
     def test_scatter_add_mult_index_base(self, device, dtype):
-        m, n = 30, 40
-        idx = torch.zeros(m, n, device=device, dtype=torch.long)
-        src = torch.ones(m, n, device=device, dtype=dtype)
-        res0 = torch.zeros(m, n, device=device, dtype=dtype).scatter_add_(0, idx, src)
-        res1 = torch.zeros(m, n, device=device, dtype=dtype).scatter_add_(1, idx, src)
+        for deterministic in [False, True]:
+            with DeterministicGuard(deterministic):
+                m, n = 30, 40
+                idx = torch.zeros(m, n, device=device, dtype=torch.long)
+                src = torch.ones(m, n, device=device, dtype=dtype)
+                res0 = torch.zeros(m, n, device=device, dtype=dtype).scatter_add_(0, idx, src)
+                res1 = torch.zeros(m, n, device=device, dtype=dtype).scatter_add_(1, idx, src)
 
-        self.assertEqual(res0[0, :], m * torch.ones(n, device=device, dtype=dtype), atol=0, rtol=0)
-        self.assertEqual(res1[:, 0], n * torch.ones(m, device=device, dtype=dtype), atol=0, rtol=0)
+                self.assertEqual(res0[0, :], m * torch.ones(n, device=device, dtype=dtype), atol=0, rtol=0)
+                self.assertEqual(res1[:, 0], n * torch.ones(m, device=device, dtype=dtype), atol=0, rtol=0)
 
     # FIXME: discrepancy between bool ReduceAdd on CUDA and CPU (a + b on CPU and buggy a && b on CUDA)
     @dtypes(*get_all_dtypes(include_half=True, include_bfloat16=True, include_bool=False))
@@ -260,6 +264,74 @@ def test_scatter_reduce_amin(self, device, dtype):
                     expected_result[2] = 0
                 self.assertEqual(input, expected_result)
 
+    @onlyCPU
+    @dtypes(torch.float32, torch.float64, torch.bfloat16)
+    def test_scatter_expanded_index(self, device, dtype):
+        def helper(input_size, idx_size):
+            input = torch.randn(input_size, device=device).to(dtype=dtype)
+            input2 = input.clone()
+
+            shape = [1] * len(input_size)
+            shape[0] = idx_size
+            dim_size = input_size[0]
+            idx = torch.randint(0, dim_size, shape)
+
+            # The fast path on scatter when index is expanded
+            # will depend on sorted index where the collected src indice
+            # for each row in input will be mapped to rowptrs in a CSR format.
+            # Create some empty rows by masking:
+            mask = (idx > 1) * (idx < 4)
+            idx[mask] = 0
+
+            expanded_shape = input_size
+            expanded_shape[0] = idx_size
+            idx = idx.expand(expanded_shape)
+            idx2 = idx.contiguous()
+            src = torch.randn(expanded_shape, device=device).to(dtype=dtype)
+
+            out = input.scatter_add(0, idx, src)
+            out2 = input2.scatter_add(0, idx2, src)
+
+            self.assertEqual(out, out2)
+
+            for reduce in ["sum", "prod", "mean", "amax", "amin"]:
+                for include_self in [True, False]:
+                    out = input.scatter_reduce(0, idx, src, reduce=reduce, include_self=include_self)
+                    out2 = input2.scatter_reduce(0, idx2, src, reduce=reduce, include_self=include_self)
+                    self.assertEqual(out, out2)
+
+        helper([50, 17], 100)
+        helper([50, 1], 100)
+        helper([50, 8, 7], 100)
+        helper([50, 3, 4, 5], 100)
+
+    @onlyCPU
+    @dtypes(torch.float32, torch.float64, torch.bfloat16)
+    def test_gather_expanded_index(self, device, dtype):
+        def helper(input_size, idx_size):
+            input = torch.randn(input_size, device=device).to(dtype=dtype)
+            input2 = input.clone()
+
+            shape = [1] * len(input_size)
+            shape[0] = idx_size
+            dim_size = input_size[0]
+            idx = torch.randint(0, dim_size, shape)
+
+            # Test the fast path on gather when index is expanded
+            expanded_shape = input_size
+            expanded_shape[0] = idx_size
+            idx = idx.expand(expanded_shape)
+            idx2 = idx.contiguous()
+
+            out = torch.gather(input, 0, idx)
+            out2 = torch.gather(input2, 0, idx2)
+
+            self.assertEqual(out, out2)
+
+        helper([50, 17], 100)
+        helper([50, 1], 100)
+        helper([50, 8, 7], 100)
+        helper([50, 3, 4, 5], 100)
 
 # Generic Device Test Framework instantation, see
 #   https://github.com/pytorch/pytorch/wiki/Running-and-writing-tests
diff --git a/test/test_schema_check.py b/test/test_schema_check.py
index 63a45b3a1ef0..a18cc848cd05 100644
--- a/test/test_schema_check.py
+++ b/test/test_schema_check.py
@@ -8,7 +8,7 @@
 from torch.testing._internal.common_utils import run_tests
 from torch.fx.operator_schemas import normalize_function
 from torch.testing._internal.schema_check_mode import SchemaCheckMode
-from torch.utils._python_dispatch import enable_torch_dispatch_mode, TorchDispatchMode
+from torch.utils._python_dispatch import TorchDispatchMode
 from torch.testing._internal.common_methods_invocations import op_db
 from torch.testing._internal.jit_utils import JitTestCase
 from torch.testing._internal.common_device_type import ops, OpDTypes, instantiate_device_type_tests
@@ -72,25 +72,24 @@ def wrap(e):
 class TestSchemaCheck(JitTestCase):
     # Tests that SchemaCheckMode records operator order with grad
     def test_schema_check_mode_operator_order(self):
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        with SchemaCheckMode() as schema_check:
             x = torch.rand((3, 3), requires_grad=True)
             x.relu().sin()
-        self.assertEqual(["aten::rand", "aten::relu", "aten::sin"], schema_check.ops)
+        self.assertEqual(["aten::rand", "aten::relu", "aten::detach", "aten::sin"], schema_check.ops)
 
     # Tests that SchemaCheckMode records operator order without grad
     def test_schema_check_mode_operator_order_without_grad(self):
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        with SchemaCheckMode() as schema_check:
             x = torch.rand((3, 3), requires_grad=False)
             x.relu().sin()
         self.assertEqual(["aten::rand", "aten::relu", "aten::sin"], schema_check.ops)
 
     # Tests that SchemaCheckMode records mutations and aliases with none expected
     def test_schema_check_mode_mutated_aliasing_none(self):
-        x = torch.rand((3, 3), requires_grad=True)
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        # NB: previously requires_grad=True, but this induces a detach for
+        # saved variable
+        x = torch.rand((3, 3))
+        with SchemaCheckMode() as schema_check:
             actual = x.relu().sin()
         self.assertEqual([], schema_check.mutated)
         self.assertEqual([], schema_check.aliasing)
@@ -98,8 +97,7 @@ def test_schema_check_mode_mutated_aliasing_none(self):
     # Tests that SchemaCheckMode records mutations and aliases with mutation expected
     def test_schema_check_mode_mutated_aliasing_mutation(self):
         actual = torch.rand((3, 3), requires_grad=False)
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        with SchemaCheckMode() as schema_check:
             actual.sinh_()
         self.assertEqual([('aten::sinh_', 'input')], schema_check.mutated)
         self.assertEqual([('aten::sinh_', 'input', 'output_0')], schema_check.aliasing)
@@ -107,8 +105,7 @@ def test_schema_check_mode_mutated_aliasing_mutation(self):
     # Tests that SchemaCheckMode records mutations and aliases with resize_
     def test_schema_check_mode_mutated_aliasing_resize_(self):
         actual = torch.rand((3, 3), requires_grad=False)
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        with SchemaCheckMode() as schema_check:
             actual.resize_(9)
         self.assertEqual([('aten::resize_', 'input')], schema_check.mutated)
         self.assertEqual([('aten::resize_', 'input', 'output_0')], schema_check.aliasing)
@@ -117,8 +114,7 @@ def test_schema_check_mode_mutated_aliasing_resize_(self):
     def test_schema_check_mode_mutated_aliasing_aliasing_inputs(self):
         actual = torch.rand((3, 3))
         y = actual
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        with SchemaCheckMode() as schema_check:
             actual.add_(y)
         self.assertEqual(
             [
@@ -138,8 +134,7 @@ def test_schema_check_mode_mutated_aliasing_aliasing_inputs(self):
     # Tests that SchemaCheckMode records mutations and alias with as_strided
     def test_schema_check_mode_mutated_aliasing_as_strided(self):
         x = torch.rand((3, 6, 4))
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        with SchemaCheckMode() as schema_check:
             x.as_strided_([3, 6, 4], [9, 1, 1])
         self.assertEqual(
             [
@@ -159,8 +154,7 @@ def test_schema_check_mode_mutated_aliasing_multiple_outputs(self):
         x = torch.arange(9.)
         m_actual = torch.arange(9.)
         e_actual = torch.zeros([9], dtype=torch.int32)
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        with SchemaCheckMode() as schema_check:
             torch.frexp(x, out=(m_actual, e_actual))
         self.assertEqual(
             [
@@ -181,8 +175,7 @@ def test_schema_check_mode_mutated_aliasing_multiple_outputs(self):
     def test_schema_check_mode_mutated_aliasing_aliasing_outputs(self):
         x = torch.rand((3, 3))
         actual = torch.zeros(3)
-        schema_check = SchemaCheckMode()
-        with enable_torch_dispatch_mode(schema_check):
+        with SchemaCheckMode() as schema_check:
             torch.aminmax(x, dim=0, out=[actual, actual])
         self.assertEqual(
             [
@@ -205,7 +198,7 @@ def test_schema_check_mode_mutated_aliasing_aliasing_outputs(self):
     def test_schema_check_mode_functionality(self):
         x = torch.rand((3, 3), requires_grad=True)
         expected = x.relu().sin()
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual = x.relu().sin()
         self.assertEqual(expected, actual)
 
@@ -213,7 +206,7 @@ def test_schema_check_mode_functionality(self):
     def test_schema_check_mode_functionality_default_replaced(self):
         x = torch.rand((3, 3), requires_grad=True)
         expected = x.add(x, alpha=2)
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual = x.add(x, alpha=2)
         self.assertEqual(expected, actual)
 
@@ -223,7 +216,7 @@ def test_schema_check_mode_functionality_list_input(self):
         b = torch.rand((3, 3))
         c = torch.rand((3, 3))
         expected = torch.linalg.multi_dot([a, b, c])
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual = torch.linalg.multi_dot([a, b, c])
         self.assertEqual(expected, actual)
 
@@ -231,7 +224,7 @@ def test_schema_check_mode_functionality_list_input(self):
     def test_schema_check_mode_functionality_wildcard_after(self):
         x = torch.rand((3, 3))
         expected = x.chunk(6)
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual = x.chunk(6)
         self.assertEqual(expected, actual)
 
@@ -240,7 +233,7 @@ def test_schema_check_mode_functionality_kwarg_tensor(self):
         x = torch.rand((3, 5))
         w = torch.rand((4))
         expected = torch.stft(x, 4, win_length=4, window=w, return_complex=True)
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual = torch.stft(x, 4, win_length=4, window=w, return_complex=True)
         self.assertEqual(expected, actual)
 
@@ -249,7 +242,7 @@ def test_schema_check_mode_functionality_mutable_inputs(self):
         expected = torch.rand((3, 3), requires_grad=False)
         actual = torch.clone(expected)
         expected.sinh_()
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual.sinh_()
         self.assertEqual(expected, actual)
 
@@ -260,7 +253,7 @@ def test_schema_check_mode_functionality_aliasing_inputs(self):
         actual = torch.clone(expected)
         y = actual
         expected.add_(x)
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual.add_(y)
         self.assertEqual(expected, actual)
 
@@ -270,7 +263,7 @@ def test_schema_check_mode_functionality_with_multiple_outputs(self):
         m_expected, e_expected = torch.frexp(x)
         m_actual = torch.arange(9.)
         e_actual = torch.zeros([9], dtype=torch.int32)
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             torch.frexp(x, out=(m_actual, e_actual))
         self.assertEqual(m_expected, m_actual)
         self.assertEqual(e_expected, e_actual)
@@ -279,13 +272,13 @@ def test_schema_check_mode_functionality_with_multiple_outputs(self):
     def test_schema_check_mode_functionality_with_multiple_outputs_aliasing(self):
         x = torch.rand((3, 3))
         actual = torch.zeros(3)
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             torch.aminmax(x, dim=0, out=[actual, actual])
         self.assertEqual(torch.amax(x, dim=0), actual)
 
     # Tests that SchemaCheckMode wraps Torch.tensor in ops with real Device input
     def test_schema_check_mode_functionality_device_input(self):
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             x = torch.rand((3, 3), device="cpu", dtype=torch.double)
             y = x + x
         self.assertEqual(x + x, y)
@@ -295,7 +288,7 @@ def test_schema_check_mode_functionality_training_op(self):
         x = torch.rand((3, 3), requires_grad=True)
         batch = torch.nn.BatchNorm1d(3, track_running_stats=True)
         expected = batch(x)
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual = batch(x)
         self.assertEqual(expected, actual)
 
@@ -309,7 +302,7 @@ def test_schema_check_mode_functionality_nested_training_op(self):
         expected.relu_()
         expected = batch(expected)
 
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual.sinh_()
             actual.tanh_()
             actual.relu_()
@@ -319,7 +312,7 @@ def test_schema_check_mode_functionality_nested_training_op(self):
     # Tests that SchemaCheckMode wraps Torch.tensor with empty list input
     def test_schema_check_mode_empty_list_input(self):
         expected = torch.atleast_1d([])
-        with enable_torch_dispatch_mode(SchemaCheckMode()):
+        with SchemaCheckMode():
             actual = torch.atleast_1d([])
         self.assertEqual(expected, actual)
 
@@ -328,7 +321,7 @@ def test_mutation_check_fail(self):
         with self.assertRaisesRegex(RuntimeError, "Argument input is not defined as mutable but was mutated"):
             x = torch.rand((3, 3))
             y = torch.rand((3, 3))
-            with enable_torch_dispatch_mode(SchemaCheckMode()):
+            with SchemaCheckMode():
                 IncorrectAliasTensor(x).sub(IncorrectAliasTensor(y))
 
     # # Tests that an exception is raised for a mismatching mutation over multiple ops
@@ -336,7 +329,7 @@ def test_mutation_check_fail_multiple_operators(self):
         with self.assertRaisesRegex(RuntimeError, "Argument input is not defined as mutable but was mutated"):
             x = torch.rand((3, 3))
             y = torch.rand((3, 3))
-            with enable_torch_dispatch_mode(SchemaCheckMode()):
+            with SchemaCheckMode():
                 IncorrectAliasTensor(x).sin().cos().sub(IncorrectAliasTensor(y))
 
     # Tests that an exception is raised for a mismatching alias
@@ -344,7 +337,7 @@ def test_alias_check_fail_simple(self):
         with self.assertRaisesRegex(RuntimeError, "Argument input is not defined to alias output but was aliasing"):
             x = torch.rand((3, 3), requires_grad=True)
             y = torch.rand((3, 3))
-            with enable_torch_dispatch_mode(SchemaCheckMode()):
+            with SchemaCheckMode():
                 IncorrectAliasTensor(x).add(IncorrectAliasTensor(y), alpha=2)
 
     # Tests that an exception is raised for a mismatching alias over multiple ops
@@ -352,7 +345,7 @@ def test_alias_check_fail_multiple_operators(self):
         with self.assertRaisesRegex(RuntimeError, "Argument input is not defined to alias output but was aliasing"):
             x = torch.rand((3, 3), requires_grad=True)
             y = torch.zeros((3, 3), requires_grad=True)
-            with enable_torch_dispatch_mode(SchemaCheckMode()):
+            with SchemaCheckMode():
                 IncorrectAliasTensor(x).sin().relu().add(IncorrectAliasTensor(y), alpha=2)
 
     # Tests that an exception is raised for a centered mismatching alias over multiple ops
@@ -360,15 +353,14 @@ def test_alias_check_fail_multiple_operators_centered(self):
         with self.assertRaisesRegex(RuntimeError, "Argument input is not defined to alias output but was aliasing"):
             x = torch.rand((3, 3), requires_grad=True)
             y = torch.zeros((3, 3), requires_grad=True)
-            with enable_torch_dispatch_mode(SchemaCheckMode()):
+            with SchemaCheckMode():
                 IncorrectAliasTensor(x).sin().add(IncorrectAliasTensor(y), alpha=2).relu()
 
     # Tests that an exception is raised for a centered mismatching alias over multiple ops
     def test_alias_check_fail_outputs_unexpectedly_aliasing(self):
         with self.assertRaisesRegex(RuntimeError, "Outputs 0 and 1 alias unexpectedly"):
             x = torch.rand((3, 3))
-            s = SchemaCheckMode()
-            with enable_torch_dispatch_mode(s):
+            with SchemaCheckMode() as s:
                 IncorrectAliasTensor(x).aminmax(dim=0)
 
     # Tests that is_alias_of returns as expected
@@ -437,8 +429,7 @@ def __torch_dispatch__(self, func, types, args=(), kwargs=None):
 
                 return func(*args, **kwargs)
         x = torch.rand((3, 3))
-        schemaInfoCheck = SchemaInfoBindTestMode(self)
-        with enable_torch_dispatch_mode(schemaInfoCheck):
+        with SchemaInfoBindTestMode(self) as schemaInfoCheck:
             x.add(x)
 
 
@@ -450,7 +441,7 @@ def test_schema_correctness(self, device, dtype, op):
         if (dtype == torch.complex32):
             return
         for sample in op.sample_inputs(device, dtype, requires_grad=False):
-            with enable_torch_dispatch_mode(SchemaCheckMode()):
+            with SchemaCheckMode():
                 op(sample.input, *sample.args, **sample.kwargs)
 
 instantiate_device_type_tests(TestSchemaCheckModeOpInfo, globals(), only_for=("cpu", "cuda"))
diff --git a/test/test_serialization.py b/test/test_serialization.py
index 2d95dbee6f06..b97c35c46762 100644
--- a/test/test_serialization.py
+++ b/test/test_serialization.py
@@ -148,17 +148,17 @@ def test(name_or_buffer):
 
         test(io.BytesIO())
 
-    def test_serialization(self):
+    def _test_serialization(self, weights_only):
         # Test serialization with a real file
         b = self._test_serialization_data()
         with tempfile.NamedTemporaryFile() as f:
             torch.save(b, f)
             f.seek(0)
-            c = torch.load(f)
+            c = torch.load(f, weights_only=weights_only)
             self._test_serialization_assert(b, c)
         with TemporaryFileName() as fname:
             torch.save(b, fname)
-            c = torch.load(fname)
+            c = torch.load(fname, weights_only=weights_only)
             self._test_serialization_assert(b, c)
         # test non-ascii encoding of bytes arrays/strings
         # The following bytes are produced by serializing
@@ -180,12 +180,18 @@ def test_serialization(self):
         buf = io.BytesIO(serialized)
         utf8_bytes = b'\xc5\xbc\xc4\x85\xc4\x85\xc3\xb3\xc5\xbc\xc4\x85\xc5\xbc'
         utf8_str = utf8_bytes.decode('utf-8')
-        loaded_utf8 = torch.load(buf, encoding='utf-8')
+        loaded_utf8 = torch.load(buf, weights_only=weights_only, encoding='utf-8')
         self.assertEqual(loaded_utf8, [utf8_str, torch.zeros(1, dtype=torch.float), 2])
         buf.seek(0)
-        loaded_bytes = torch.load(buf, encoding='bytes')
+        loaded_bytes = torch.load(buf, weights_only=weights_only, encoding='bytes')
         self.assertEqual(loaded_bytes, [utf8_bytes, torch.zeros(1, dtype=torch.float), 2])
 
+    def test_serialization(self):
+        self._test_serialization(False)
+
+    def test_serialization_safe(self):
+        self._test_serialization(True)
+
     def test_serialization_filelike(self):
         # Test serialization (load and save) with a filelike object
         b = self._test_serialization_data()
@@ -279,7 +285,7 @@ def test_serialization_offset_gzip(self):
         self.assertTrue(torch.equal(a, b))
         self.assertEqual(i, j)
 
-    def test_serialization_sparse(self):
+    def _test_serialization_sparse(self, weights_only):
         def _test_serialization(conversion):
             x = torch.zeros(3, 3)
             x[1][1] = 1
@@ -287,10 +293,19 @@ def _test_serialization(conversion):
             with tempfile.NamedTemporaryFile() as f:
                 torch.save({"tensor": x}, f)
                 f.seek(0)
-                y = torch.load(f)
+                y = torch.load(f, weights_only=weights_only)
                 self.assertEqual(x, y["tensor"])
         _test_serialization(lambda x: x.to_sparse())
         _test_serialization(lambda x: x.to_sparse_csr())
+        _test_serialization(lambda x: x.to_sparse_csc())
+        _test_serialization(lambda x: x.to_sparse_bsr(1, 1))
+        _test_serialization(lambda x: x.to_sparse_bsc(1, 1))
+
+    def test_serialization_sparse(self):
+        self._test_serialization(False)
+
+    def test_serialization_sparse_safe(self):
+        self._test_serialization(True)
 
     def test_serialization_sparse_invalid(self):
         x = torch.zeros(3, 3)
@@ -321,36 +336,60 @@ def __reduce_ex__(self, proto):
                     "size is inconsistent with indices"):
                 y = torch.load(f)
 
-    def test_serialization_sparse_csr_invalid(self):
+    def _test_serialization_sparse_compressed_invalid(self,
+                                                      conversion,
+                                                      get_compressed_indices,
+                                                      get_plain_indices):
         x = torch.zeros(3, 3)
         x[1][1] = 1
-        x = x.to_sparse_csr()
+        x = conversion(x)
 
         class TensorSerializationSpoofer(object):
             def __init__(self, tensor):
                 self.tensor = tensor
 
             def __reduce_ex__(self, proto):
-                invalid_crow_indices = self.tensor.crow_indices().clone()
-                invalid_crow_indices[0] = 3
+                invalid_compressed_indices = get_compressed_indices(self.tensor).clone()
+                invalid_compressed_indices[0] = 3
                 return (
                     torch._utils._rebuild_sparse_tensor,
                     (
                         self.tensor.layout,
                         (
-                            invalid_crow_indices,
-                            self.tensor.col_indices(),
+                            invalid_compressed_indices,
+                            get_plain_indices(self.tensor),
                             self.tensor.values(),
                             self.tensor.size())))
 
+        if x.layout in {torch.sparse_csr, torch.sparse_bsr}:
+            compressed_indices_name = 'crow_indices'
+        else:
+            compressed_indices_name = 'ccol_indices'
+
         with tempfile.NamedTemporaryFile() as f:
             torch.save({"spoofed": TensorSerializationSpoofer(x)}, f)
             f.seek(0)
             with self.assertRaisesRegex(
                     RuntimeError,
-                    "rebuilding sparse tensor for layout torch.sparse_csr"):
+                    f"`{compressed_indices_name}[[]..., 0[]] == 0` is not satisfied."):
                 y = torch.load(f)
 
+    def test_serialization_sparse_csr_invalid(self):
+        self._test_serialization_sparse_compressed_invalid(
+            torch.Tensor.to_sparse_csr, torch.Tensor.crow_indices, torch.Tensor.col_indices)
+
+    def test_serialization_sparse_csc_invalid(self):
+        self._test_serialization_sparse_compressed_invalid(
+            torch.Tensor.to_sparse_csc, torch.Tensor.ccol_indices, torch.Tensor.row_indices)
+
+    def test_serialization_sparse_bsr_invalid(self):
+        self._test_serialization_sparse_compressed_invalid(
+            lambda x: x.to_sparse_bsr(1, 1), torch.Tensor.crow_indices, torch.Tensor.col_indices)
+
+    def test_serialization_sparse_bsc_invalid(self):
+        self._test_serialization_sparse_compressed_invalid(
+            lambda x: x.to_sparse_bsc(1, 1), torch.Tensor.ccol_indices, torch.Tensor.row_indices)
+
     def test_serialize_device(self):
         device_str = ['cpu', 'cpu:0', 'cuda', 'cuda:0']
         device_obj = [torch.device(d) for d in device_str]
@@ -358,13 +397,13 @@ def test_serialize_device(self):
             device_copied = copy.deepcopy(device)
             self.assertEqual(device, device_copied)
 
-    def test_serialization_backwards_compat(self):
+    def _test_serialization_backwards_compat(self, weights_only):
         a = [torch.arange(1 + i, 26 + i).view(5, 5).float() for i in range(2)]
         b = [a[i % 2] for i in range(4)]
         b += [a[0].storage()]
         b += [a[0].reshape(-1)[1:4].clone().storage()]
         path = download_file('https://download.pytorch.org/test_data/legacy_serialized.pt')
-        c = torch.load(path)
+        c = torch.load(path, weights_only=weights_only)
         self.assertEqual(b, c, atol=0, rtol=0)
         self.assertTrue(isinstance(c[0], torch.FloatTensor))
         self.assertTrue(isinstance(c[1], torch.FloatTensor))
@@ -403,12 +442,17 @@ def __reduce__(self):
                 old_x = old_cls(x)
                 torch.save(old_x, f)
                 f.seek(0)
-                load_x = torch.load(f)
+                load_x = torch.load(f, weights_only=weights_only)
                 self.assertEqual(x.storage(), load_x.storage())
                 self.assertEqual(x.storage_offset(), load_x.storage_offset())
                 self.assertEqual(x.size(), load_x.size())
                 self.assertEqual(x.stride(), load_x.stride())
 
+    def test_serialization_backwards_compat(self):
+        self._test_serialization_backwards_compat(False)
+
+    def test_serialization_backwards_compat_safe(self):
+        self._test_serialization_backwards_compat(True)
 
     def test_serialization_save_warnings(self):
         with warnings.catch_warnings(record=True) as warns:
@@ -550,6 +594,34 @@ def test_serialization_filelike_uses_readinto(self):
         b = torch.load(data)
         self.assertTrue(data.was_called('readinto'))
 
+    def test_serialization_filelike_exceptions(self):
+        # Try to serialize to buffers that does not have write method
+        # Or have a malfrormed one, and make sure it does not cause an abort
+        # See https://github.com/pytorch/pytorch/issues/87997
+        x = torch.rand(10)
+        with self.assertRaises(AttributeError):
+            # Tries to serialize str into tensor
+            torch.save('foo', x)
+        x.write = "bar"
+        x.flush = "baz"
+        with self.assertRaises(TypeError):
+            # Tries to serialize str into tensor with write property
+            torch.save('foo', x)
+        x.write = str.__add__
+        x.flush = str.__mul__
+        with self.assertRaises(TypeError):
+            # Tries to serialize str into tensor with wrong callable write property
+            torch.save('foo', x)
+        s_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+        s = torch.CharStorage(s_data)
+        with self.assertRaises(AttributeError):
+            # Tries to serialize list into CharStorage
+            torch.save(s_data, s)
+        x = torch.randint(10, (3, 3), dtype=torch.float).cpu().numpy()
+        with self.assertRaises(AttributeError):
+            # Tries to serialize ndarray into ndarray
+            torch.save(x, x)
+
 
     def test_serialization_storage_slice(self):
         # Generated using:
@@ -680,25 +752,31 @@ def wrapper(*args, **kwargs):
     def __exit__(self, *args, **kwargs):
         torch.save = self.torch_save
 
+@unittest.skipIf(IS_WINDOWS, "NamedTemporaryFile on windows")
 class TestBothSerialization(TestCase):
-    @unittest.skipIf(IS_WINDOWS, "NamedTemporaryFile on windows")
-    def test_serialization_new_format_old_format_compat(self, device):
+    def _test_serialization_new_format_old_format_compat(self, device, weights_only):
         x = [torch.ones(200, 200, device=device) for i in range(30)]
 
         def test(f_new, f_old):
             torch.save(x, f_new, _use_new_zipfile_serialization=True)
             f_new.seek(0)
-            x_new_load = torch.load(f_new)
+            x_new_load = torch.load(f_new, weights_only=weights_only)
             self.assertEqual(x, x_new_load)
 
             torch.save(x, f_old, _use_new_zipfile_serialization=False)
             f_old.seek(0)
-            x_old_load = torch.load(f_old)
+            x_old_load = torch.load(f_old, weights_only=weights_only)
             self.assertEqual(x_old_load, x_new_load)
 
         with tempfile.NamedTemporaryFile() as f_new, tempfile.NamedTemporaryFile() as f_old:
             test(f_new, f_old)
 
+    def test_serialization_new_format_old_format_compat(self, device):
+        self._test_serialization_new_format_old_format_compat(device, False)
+
+    def test_serialization_new_format_old_format_compat_safe(self, device):
+        self._test_serialization_new_format_old_format_compat(device, True)
+
 
 class TestOldSerialization(TestCase, SerializationMixin):
     # unique_key is necessary because on Python 2.7, if a warning passed to
@@ -721,7 +799,7 @@ def import_module(name, filename):
             module = import_module(tmpmodule_name, fname)
             torch.save(module.Net(), checkpoint)
 
-            # First check that the checkpoint can be loaded without warnings
+            # First check that the checkpoint can be loaded without warning about unsafe loads
             checkpoint.seek(0)
             with warnings.catch_warnings(record=True) as w:
                 loaded = torch.load(checkpoint)
@@ -771,7 +849,8 @@ def test_serialization_offset(self):
         self.assertEqual(i, i_loaded)
         self.assertEqual(j, j_loaded)
 
-    def test_serialization_offset_filelike(self):
+    @parametrize('weights_only', (True, False))
+    def test_serialization_offset_filelike(self, weights_only):
         a = torch.randn(5, 5)
         b = torch.randn(1024, 1024, 512, dtype=torch.float32)
         i, j = 41, 43
@@ -783,9 +862,9 @@ def test_serialization_offset_filelike(self):
             self.assertTrue(f.tell() > 2 * 1024 * 1024 * 1024)
             f.seek(0)
             i_loaded = pickle.load(f)
-            a_loaded = torch.load(f)
+            a_loaded = torch.load(f, weights_only=weights_only)
             j_loaded = pickle.load(f)
-            b_loaded = torch.load(f)
+            b_loaded = torch.load(f, weights_only=weights_only)
         self.assertTrue(torch.equal(a, a_loaded))
         self.assertTrue(torch.equal(b, b_loaded))
         self.assertEqual(i, i_loaded)
@@ -797,7 +876,8 @@ def run(self, *args, **kwargs):
 
 
 class TestSerialization(TestCase, SerializationMixin):
-    def test_serialization_zipfile(self):
+    @parametrize('weights_only', (True, False))
+    def test_serialization_zipfile(self, weights_only):
         data = self._test_serialization_data()
 
         def test(name_or_buffer):
@@ -806,7 +886,7 @@ def test(name_or_buffer):
             if hasattr(name_or_buffer, 'seek'):
                 name_or_buffer.seek(0)
 
-            result = torch.load(name_or_buffer)
+            result = torch.load(name_or_buffer, weights_only=weights_only)
             self.assertEqual(result, data)
 
         with tempfile.NamedTemporaryFile() as f:
@@ -828,28 +908,108 @@ def test_serialization_2gb_file(self):
         big_model = torch.nn.Conv2d(20000, 3200, kernel_size=3)
 
         with BytesIOContext() as f:
-            torch.save(big_model, f)
+            torch.save(big_model.state_dict(), f)
             f.seek(0)
             state = torch.load(f)
 
-    def test_pathlike_serialization(self):
+    @parametrize('weights_only', (True, False))
+    def test_pathlike_serialization(self, weights_only):
         model = torch.nn.Conv2d(20, 3200, kernel_size=3)
 
         with TemporaryFileName() as fname:
             path = pathlib.Path(fname)
-            torch.save(model, path)
-            torch.load(path)
+            torch.save(model.state_dict(), path)
+            torch.load(path, weights_only=weights_only)
 
-    def test_meta_serialization(self):
+    @parametrize('weights_only', (True, False))
+    def test_meta_serialization(self, weights_only):
         big_model = torch.nn.Conv2d(20000, 320000, kernel_size=3, device='meta')
 
         with BytesIOContext() as f:
-            torch.save(big_model, f)
+            torch.save(big_model.state_dict(), f)
             f.seek(0)
-            state = torch.load(f)
+            state = torch.load(f, weights_only=weights_only)
+
+        self.assertEqual(state['weight'].size(), big_model.weight.size())
+
+    def test_serialization_python_attr(self):
+        def _test_save_load_attr(t):
+            t.foo = 'foo'
+            t.pi = 3.14
+
+            with BytesIOContext() as f:
+                torch.save(t, f)
+                f.seek(0)
+                loaded_t = torch.load(f)
+
+            self.assertEqual(t, loaded_t)
+            self.assertEqual(t.foo, loaded_t.foo)
+            self.assertEqual(t.pi, loaded_t.pi)
+
+        t = torch.zeros(3, 3)
+        _test_save_load_attr(t)
+        # This should start failing once Parameter
+        # supports saving Python Attribute.
+        err_msg = "'Parameter' object has no attribute"
+        with self.assertRaisesRegex(AttributeError, err_msg):
+            _test_save_load_attr(torch.nn.Parameter(t))
+
+    def test_weights_only_assert(self):
+        class HelloWorld:
+            def __reduce__(self):
+                return (print, ("Hello World!",))
+
+        with BytesIOContext() as f:
+            torch.save(HelloWorld(), f)
+            f.seek(0)
+            # Unsafe load should work
+            self.assertIsNone(torch.load(f, weights_only=False))
+            f.seek(0)
+            # Safe load should assert
+            with self.assertRaisesRegex(pickle.UnpicklingError, "Unsupported class"):
+                torch.load(f, weights_only=True)
+
+    @parametrize('weights_only', (False, True))
+    def test_serialization_math_bits(self, weights_only):
+        t = torch.randn(1, dtype=torch.cfloat)
+
+        def _save_load_check(t):
+            with BytesIOContext() as f:
+                torch.save(t, f)
+                f.seek(0)
+                # Unsafe load should work
+                self.assertEqual(torch.load(f, weights_only=weights_only), t)
 
-        self.assertEqual(state.weight.size(), big_model.weight.size())
+        t_conj = torch.conj(t)
+        _save_load_check(t_conj)
+
+        t_neg = torch._neg_view(t)
+        _save_load_check(t_neg)
+
+        t_n_c = torch._neg_view(torch.conj(t))
+        _save_load_check(t_n_c)
+
+    @parametrize('weights_only', (False, True))
+    def test_serialization_efficient_zerotensor(self, weights_only):
+        # We don't support serializing `ZeroTensor` as it is not public
+        # facing yet.
+        # If in future, `ZeroTensor` serialization is supported, this test
+        # should start failing!
+        t = torch._efficientzerotensor((4, 5))
+
+        def _save_load_check(t):
+            with BytesIOContext() as f:
+                torch.save(t, f)
+                f.seek(0)
+                # Unsafe load should work
+                self.assertEqual(torch.load(f, weights_only=weights_only), t)
 
+        # NOTE: `torch.save` fails before we hit the TORCH_CHECK in `getTensoMetadata`
+        #       as nullptr storage is disabled.
+        err_msg = (r'python bindings to nullptr storage \(e.g., from torch.Tensor._make_wrapper_subclass\)'
+                   ' are currently unsafe and thus disabled')
+        with self.assertRaisesRegex(RuntimeError, err_msg):
+            _save_load_check(t)
 
     def run(self, *args, **kwargs):
         with serialization_method(use_zip=True):
@@ -983,6 +1143,8 @@ def test_empty_class_serialization(self):
 
 instantiate_device_type_tests(TestBothSerialization, globals())
 instantiate_parametrized_tests(TestSubclassSerialization)
+instantiate_parametrized_tests(TestOldSerialization)
+instantiate_parametrized_tests(TestSerialization)
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_shape_ops.py b/test/test_shape_ops.py
index c09d58ea4c8b..9b2ff360553c 100644
--- a/test/test_shape_ops.py
+++ b/test/test_shape_ops.py
@@ -678,6 +678,16 @@ def test_nonzero_non_diff(self, device):
         nz = x.nonzero()
         self.assertFalse(nz.requires_grad)
 
+    @dtypes(torch.int64, torch.float, torch.complex128)
+    def test_sparse_dense_dim(self, device, dtype):
+        for shape in [(), (2, ), (2, 3)]:
+            if dtype.is_complex or dtype.is_floating_point:
+                x = torch.rand(shape, device=device, dtype=dtype)
+            else:
+                x = torch.randint(-9, 9, shape, device=device, dtype=dtype)
+            self.assertEqual(x.sparse_dim(), 0)
+            self.assertEqual(x.dense_dim(), len(shape))
+
 instantiate_device_type_tests(TestShapeOps, globals())
 
 if __name__ == '__main__':
diff --git a/test/test_sparse.py b/test/test_sparse.py
index ecfaee77ff79..4bfccaff0e2c 100644
--- a/test/test_sparse.py
+++ b/test/test_sparse.py
@@ -9,7 +9,7 @@
 from torch.testing import make_tensor
 from torch.testing._internal.common_utils import TestCase, run_tests, skipIfRocm, do_test_dtypes, \
     do_test_empty_full, load_tests, TEST_NUMPY, TEST_SCIPY, IS_WINDOWS, gradcheck, coalescedonoff, \
-    DeterministicGuard, first_sample, TEST_WITH_CROSSREF
+    DeterministicGuard, first_sample, TEST_WITH_CROSSREF, TEST_WITH_ROCM, skipIfTorchDynamo
 from torch.testing._internal.common_cuda import TEST_CUDA, _get_torch_cuda_version
 from numbers import Number
 from typing import Dict, Any
@@ -25,7 +25,6 @@
     all_types, all_types_and_complex, all_types_and_complex_and, floating_and_complex_types,
     floating_and_complex_types_and, integral_types, floating_types_and,
 )
-from torch.utils._python_dispatch import TorchDispatchMode
 
 if TEST_SCIPY:
     import scipy.sparse
@@ -41,43 +40,28 @@
     IS_WINDOWS and torch.version.cuda and LooseVersion(torch.version.cuda) > "11.2"
 ) or (not IS_WINDOWS and CUDA11OrLater)
 
-class CrossRefSparseFakeMode(TorchDispatchMode):
-    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
-        kwargs = kwargs or {}
-
-        def on_tensor(f):
-            def go(t):
-                if isinstance(t, torch.Tensor):
-                    return f(t)
-                else:
-                    return t
-            return go
-
-        # empty_like excluded for now due to sparse complex
-        # aten._to_dense.default this one is getting called with csc
-        if (
-            func not in [
-                torch.ops.aten.lift_fresh.default,
-                torch.ops.aten.empty_like.default,
-                torch.ops.aten.set_.source_Storage_storage_offset,
-                torch.ops.aten.sspaddmm.out,
-                torch.ops.aten._spdiags.default,
-                torch.ops.aten._to_dense.default
-            ]
-            and torch.Tag.dynamic_output_shape not in func.tags
-            and torch.Tag.inplace_view not in func.tags
-        ):
-            from torch._subclasses.fake_tensor import FakeTensorMode, UnsupportedFakeTensorException
-            from torch.utils._pytree import tree_map
-            try:
-                with FakeTensorMode(allow_meta=True) as fake_mode:
-                    fake_args, fake_kwargs = tree_map(on_tensor(fake_mode.from_tensor), (args, kwargs))
-                    fake_r = func(*fake_args, **fake_kwargs)
-            except UnsupportedFakeTensorException:
-                pass
-
-        r = func(*args, **kwargs)
-        return r
+class CrossRefSparseFakeMode(torch._subclasses.CrossRefFakeMode):
+    def __init__(self):
+        super(CrossRefSparseFakeMode, self).__init__(
+            self.ignore_op, check_strides=False,
+            check_aliasing=False,
+        )  # TODO: enable stride/alias checking
+
+    # empty_like excluded for now due to sparse complex
+    # aten._to_dense.default this one is getting called with csc
+    @staticmethod
+    def ignore_op(func):
+        return func in (
+            torch.ops.aten.empty_like.default,
+            torch.ops.aten.set_.source_Storage_storage_offset,
+            torch.ops.aten.sspaddmm.out,
+            torch.ops.aten._spdiags.default,
+            torch.ops.aten._to_dense.default,
+            torch.ops.aten.indices.default,
+            torch.ops.aten._indices.default,
+            torch.ops.aten.values.default,
+            torch.ops.aten._values.default,
+        )
 
 class TestSparseBase(TestCase):
     def run(self, result=None):
@@ -908,6 +892,7 @@ def test_shape(di, dj, dk, nnz):
         test_shape(10, 20, 0, 0)
         test_shape(10, 20, 0, 20)
 
+    @skipIfTorchDynamo("https://github.com/pytorch/torchdynamo/issues/1166")
     @dtypes(torch.double, torch.cdouble)
     def test_t_empty(self, device, dtype):
         def test_in_place(x):
@@ -1247,7 +1232,7 @@ def test_shape(di, dj, dk, nnz):
         "bmm sparse-dense CUDA is not yet supported in Windows, at least up to CUDA 10.1"
     )
     @unittest.skipIf(
-        TEST_CUDA and _get_torch_cuda_version() < (10, 1),
+        TEST_CUDA and _get_torch_cuda_version() < (10, 1) and not TEST_WITH_ROCM,
         "bmm sparse-dense requires CUDA 10.1 or greater"
     )
     @coalescedonoff
@@ -1309,7 +1294,7 @@ def test_shape(num_mats, dim_i, dim_j, dim_k, nnz):
         "bmm sparse-dense CUDA is not yet supported in Windows, at least up to CUDA 10.1"
     )
     @unittest.skipIf(
-        _get_torch_cuda_version() < (10, 1),
+        _get_torch_cuda_version() < (10, 1) and not TEST_WITH_ROCM,
         "bmm sparse-dense requires CUDA 10.1 or greater"
     )
     def test_bmm_deterministic(self, device, dtype, coalesced):
@@ -1731,9 +1716,7 @@ def fn(S):
         self.assertEqual(torch.sparse.sum(x, dim=0), torch.sparse.sum(x, dim=-2))
         self.assertEqual(torch.sum(x.to_dense(), dim=0), torch.sparse.sum(x, dim=0).to_dense())
 
-        # not support SparseTensor.sum()
         S = self._gen_sparse(sparse_dims, nnz, with_size, dtype, device, coalesced)[0]
-        self.assertRaises(RuntimeError, lambda: S.sum())
 
         # dim out of range
         self.assertRaises(IndexError, lambda: torch.sparse.sum(S, 5))
@@ -2201,16 +2184,7 @@ def is_integral(dtype):
             with self.assertRaisesRegex(RuntimeError, "log1p_ requires coalesced input"):
                 sparse_tensor.log1p_()
 
-        if not is_integral_dtype:
-            sparse_tensor.requires_grad_()
-            self.assertTrue(sparse_tensor.requires_grad)
-
-            # test autograd
-            x = sparse_tensor.clone()
-            y = sparse_tensor.log1p()
-            with self.assertRaisesRegex(RuntimeError, "log1p of a sparse tensor is made to be non-differentiable"):
-                y.backward(x)
-        else:
+        if is_integral_dtype:
             with self.assertRaisesRegex(RuntimeError, "only Tensors of floating point dtype can require gradients"):
                 sparse_tensor.requires_grad_()
 
@@ -2256,6 +2230,8 @@ def test_log1p(self, device, dtype, coalesced):
                 device=device,
                 dtype=dtype
             )
+            # empty tensors are coalesced at creation (nnz < 2) we must force the uncoalesced state
+            input_uncoalesced._coalesced_(False)
             self._test_log1p_tensor(input_uncoalesced, coalesced)
 
     def _test_neg_negative(self, sparse_tensor):
@@ -2399,6 +2375,8 @@ def test_asin_arcsin(self, device, dtype, coalesced):
                 dtype=dtype,
                 device=device
             )
+            # empty tensors are coalesced at creation (nnz < 2) we must force the uncoalesced state
+            input_uncoalesced._coalesced_(False)
             self._test_asin_arcsin(input_uncoalesced, coalesced)
 
     @coalescedonoff
@@ -2990,41 +2968,6 @@ def test_is_nonzero(self, device):
         self.assertFalse(torch.sparse_coo_tensor(([0],), 0. + 0j, (1,), dtype=torch.cdouble, device=device)
                          .is_nonzero())
 
-    def test_allow_tensor_metadata_change(self, device):
-        def do_test(t):
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "raw_resize_ is not allowed on a Tensor created from .data or .detach()"):
-                t.transpose_(0, 1)
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "resize_ is not allowed on a Tensor created from .data or .detach()"):
-                t.resize_as_(self.sparse_empty(3, 3))
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "resize_and_clear_ is not allowed on a Tensor created from .data or .detach()"):
-                t.mul_(t)
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_coalesced is not allowed on a Tensor created from .data or .detach()"):
-                t._coalesced_(True)
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_indices_and_values_unsafe is not allowed on a Tensor created from .data or .detach()"):
-                a = self.sparse_tensor(torch.tensor([[0, 1, 1], [2, 0, 2]]), torch.tensor([3., 4., 5.])).data
-                a.add_(a)
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "resize_and_clear_ is not allowed on a Tensor created from .data or .detach()"):
-                a.zero_()
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "resize_ is not allowed on a Tensor created from .data or .detach()"):
-                a.copy_(self.sparse_empty(3, 3))
-
-        do_test(self.sparse_empty([3, 0], device=device).data)
-        do_test(self.sparse_empty([3, 0], device=device).detach())
-
     @dtypes(torch.double, torch.cdouble)
     def test_change_tensor_metadata(self, device, dtype):
         i = self.index_tensor([[0], [1]], device=device)
@@ -3067,7 +3010,6 @@ def test_change_tensor_metadata(self, device, dtype):
         self.assertEqual(list(t.coalesce().indices().size()), [2, 1])
         self.assertEqual(list(t.coalesce().values().size()), [1, 3])
 
-    @skipIfRocm
     @coalescedonoff
     @dtypes(torch.double)
     def test_pickle(self, device, dtype, coalesced):
@@ -3360,6 +3302,7 @@ def softmax_jacobian_autograd(x, dim, log=False):
                 J[i] = g.to_dense() if g.is_sparse else g
             return J
 
+        @skipIfTorchDynamo("https://github.com/pytorch/torchdynamo/issues/1166")
         def test_op(sparse_dims, nnz, with_size, coalesced):
             if isinstance(with_size, Number):
                 with_size = [with_size] * sparse_dims
@@ -3612,10 +3555,7 @@ def test(sparse_dims, nnz, with_size, new_size):
         test(4, 6, [7, 3, 1, 3, 1, 3], [7, 3, 1, 3, 2, 3])
         test(4, 6, [7, 3, 1, 3, 2, 1], [7, 3, 1, 3, 2, 3])
 
-    @coalescedonoff
-    @dtypes(*all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16))
-    @precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2})
-    def test_sparse_dense_mul(self, device, dtype, coalesced):
+    def _test_mul_skips(self, device, dtype, coalesced):
         skipTestIfUncoalesced = False
         # This case always coalesce inputs and that could lead to loss of precision,
         # hence it is inhibited for float16/bfloat16 by providing already coalesced tensors.
@@ -3629,6 +3569,63 @@ def test_sparse_dense_mul(self, device, dtype, coalesced):
         if skipTestIfUncoalesced:
             self.skipTest(f"Test with dtype={dtype}, device={device} runs only with coalesced inputs")
 
+    @coalescedonoff
+    # NOTE: addcmul_out is not implemented for bool and half.
+    @dtypes(*all_types_and_complex_and(torch.bfloat16))
+    @precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2})
+    def test_sparse_sparse_mul(self, device, dtype, coalesced):
+        self._test_mul_skips(device, dtype, coalesced)
+
+        shape = (2, 3, 4, 10)
+        nnz = 10
+
+        def check(self, x, y):
+            res_sparse = x * y
+            res_dense = x.to_dense() * y.to_dense()
+            self.assertEqual(res_sparse.to_dense(), res_dense)
+
+        def check_empty(sparse_shape, nnz, dense_shape, coalesce):
+            from itertools import product
+            for nnz_val, shape_suffix in product((nnz, 0), ((), (0,))):
+                empty_sparse_shape = sparse_shape + shape_suffix
+                empty_dense_shape = dense_shape + shape_suffix
+                x = self._gen_sparse(sparse_dim, nnz_val, empty_sparse_shape, dtype, device, coalesce)[0]
+                check(self, x, x)
+
+        # TODO: uncomment once backward is implemented for sparse tensors that broadcast in dense dims.
+        # def check_autograd(x, y):
+        #     if dtype in {torch.double, torch.cdouble}:
+        #         xa = x.detach().clone().requires_grad_(True)
+        #         ya = y.detach().clone().requires_grad_(True)
+        #         gradcheck(lambda a, b: (a * b).to_dense(), (xa, ya), check_sparse_nnz=True)
+        #         gradcheck(lambda a, b: (a * b).to_dense(), (ya, xa), check_sparse_nnz=True)
+
+        for dim in range(len(shape) + 1):
+            sub_shape = shape[dim:]
+            sparse_dim = len(sub_shape) // 2
+
+            check_empty(sub_shape, nnz, shape, coalesced)
+
+            x = self._gen_sparse(sparse_dim, nnz, sub_shape, dtype, device, coalesced)[0]
+            y = self._gen_sparse(sparse_dim, nnz, sub_shape, dtype, device, coalesced)[0]
+            check(self, x, y)
+            # TODO: uncomment once supported
+            # check_autograd(x, y)
+
+            # check broadcasting in dense dims
+            for d in range(sparse_dim, len(sub_shape)):
+                new_shape = sub_shape[:d] + (1,) + sub_shape[d + 1:]
+                y = self._gen_sparse(sparse_dim, nnz, new_shape, dtype, device, coalesced)[0]
+                check(self, x, y)
+                # TODO: uncomment once supported
+                # check_autograd(x, y)
+
+    @coalescedonoff
+    @dtypes(*all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16))
+    @precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2})
+    def test_sparse_dense_mul(self, device, dtype, coalesced):
+        self._test_mul_skips(device, dtype, coalesced)
+
         shape = (2, 3, 4, 10)
         nnz = 10
 
@@ -3711,7 +3708,7 @@ def check_empty(sparse_shape, nnz, dense_shape, coalesce):
             check(self, s, d)
             check_empty(shape, nnz, sub_shape, coalesced)
 
-    @unittest.skipIf(not TEST_NUMPY, "NumPy is not availible")
+    @unittest.skipIf(not TEST_NUMPY, "NumPy is not available")
     @onlyCPU
     @dtypes(*all_types_and_complex_and(torch.bool))
     def test_sparse_spdiags(self, device, dtype):
@@ -3798,6 +3795,30 @@ def invalid_cases():
         for case, error_regex in invalid_cases():
             check_invalid(case, error_regex)
 
+    def test_small_nnz_coalesced(self):
+        # creating a coo tensor with nnz == 0 is always coalesced
+        self.assertTrue(torch.sparse_coo_tensor([[], []], [], (2, 2)).is_coalesced())
+        # same for a coo tensor with only 1 nnz
+        self.assertTrue(torch.sparse_coo_tensor([[0], [0]], [1], (2, 2)).is_coalesced())
+        # two or more nnz coalesced is false as it can't be verified without an expensive check
+        self.assertFalse(torch.sparse_coo_tensor([[0, 0], [0, 0]], [1, 2], (2, 2)).is_coalesced())
+        # even if there are no duplicates
+        self.assertFalse(torch.sparse_coo_tensor([[0, 1], [0, 1]], [1, 2], (2, 2)).is_coalesced())
+
+    @coalescedonoff
+    @dtypes(*all_types_and_complex_and(torch.bool))
+    def test_sum(self, device, dtype, coalesced):
+        def run_test(shape, nnz):
+            a = self._gen_sparse(2, nnz, shape, dtype, device, coalesced)[0]
+            self.assertEqual(a.sum(), a._values().sum())
+            if dtype.is_floating_point or dtype.is_complex:
+                a.requires_grad_(True)
+                a.sum().backward()
+                self.assertEqual(a.grad, torch.ones(shape, dtype=dtype, device=device))
+        for shape in [(10, 5), (10, 10)]:
+            run_test(shape, 0)
+            run_test(shape, max(shape))
+            run_test(shape, shape[0] * shape[1])
 
 
 class TestSparseOneOff(TestCase):
@@ -3966,12 +3987,12 @@ def test_future_empty_dim(self, device, dtype, op):
         torch.testing._internal.common_methods_invocations._generate_reduction_kwargs
         is made to generate samples with `dim=()` for non-scalar
         inputs. With this and after gh-29137 is resolved, this test
-        can be deleted. See also `torch._masked._canonical_dim`
+        can be deleted. See also `torch.masked._canonical_dim`
         implementation about changing the `dim=()` behavior.
         """
 
         samples = op.sample_inputs_func(op, device, dtype, requires_grad=False)
-        op_name = op.name.replace('_masked.', '')
+        op_name = op.name.replace('masked.', '')
         for sample_input in samples:
             if sample_input.kwargs.get('dim') != 0:
                 continue
@@ -3983,7 +4004,7 @@ def test_future_empty_dim(self, device, dtype, op):
             if mask is None and op_name in {'prod', 'amax', 'amin'}:
                 # FIXME: for now reductions with non-zero reduction identity and
                 # unspecified mask are not supported for sparse COO
-                # tensors, see torch._masked.prod implementation
+                # tensors, see torch.masked.prod implementation
                 # for details.
                 continue
             sparse_op_kwargs = dict(sample_input_kwargs)
@@ -4013,10 +4034,13 @@ def test_basic(self):
         self.assertEqual(r.dense_dim(), 1)
         self.assertEqual(r._dimV(), 1)
         self.assertEqual(r._nnz(), 0)
-        # TODO: nnz zero sparse tensors should always be coalesced...
+        # nnz zero sparse tensors should always be coalesced at creation
+        self.assertEqual(r.is_coalesced(), True)
+        # but we can force them into the uncoalesed state
+        r._coalesced_(False)
         self.assertEqual(r.is_coalesced(), False)
+        # return the coalesced state for indices/values access
         r._coalesced_(True)
-        self.assertEqual(r.is_coalesced(), True)
         # TODO: this sort of aliasing will need to be handled by
         # functionalization
         self.assertEqual(r._indices(), torch.empty(2, 0, device='meta', dtype=torch.int64))
@@ -4025,6 +4049,7 @@ def test_basic(self):
         self.assertEqual(r.values(), torch.empty(0, 4, device='meta'))
 
 
+
 # e.g., TestSparseUnaryUfuncsCPU and TestSparseUnaryUfuncsCUDA
 instantiate_device_type_tests(TestSparseUnaryUfuncs, globals(), except_for='meta')
 
diff --git a/test/test_sparse_csr.py b/test/test_sparse_csr.py
index d9e62b7d5809..8d21b21d3fd9 100644
--- a/test/test_sparse_csr.py
+++ b/test/test_sparse_csr.py
@@ -12,8 +12,8 @@
     (TEST_WITH_ROCM, TEST_SCIPY, TEST_NUMPY, TEST_MKL, IS_WINDOWS, TestCase, run_tests, load_tests, coalescedonoff, parametrize,
      subtest)
 from torch.testing._internal.common_device_type import \
-    (ops, instantiate_device_type_tests, dtypes, OpDTypes, dtypesIfCUDA, onlyCPU, onlyCUDA, skipCUDAIfNoCusparseGeneric,
-     precisionOverride, skipMeta, skipCUDAIf, skipCUDAIfRocm, skipCPUIfNoMklSparse)
+    (ops, instantiate_device_type_tests, dtypes, OpDTypes, dtypesIfCUDA, onlyCPU, onlyCUDA, skipCUDAIfNoSparseGeneric,
+     precisionOverride, skipMeta, skipCUDAIf, skipCUDAIfRocm, skipCPUIfNoMklSparse, skipCUDAIfRocmVersionLessThan)
 from torch.testing._internal.common_methods_invocations import \
     (op_db, sparse_csr_unary_ufuncs, ReductionOpInfo)
 from torch.testing._internal.common_cuda import _get_torch_cuda_version, CUDA11OrLater, TEST_CUDA
@@ -61,7 +61,13 @@ def _check_cusparse_sddmm_available():
 UNARY_EWISE_CSR_ALLOW_AUTOGRAD = [
     'abs',
     'conj_physical',
+    'deg2rad',
     'neg',
+    'positive',
+    'frac',
+    'nn.functional.relu',
+    'log1p',
+    'rad2deg'
 ]
 
 # This should be just an import from test_linalg instead of code duplication
@@ -100,12 +106,15 @@ def _test_addmm_addmv(
     def convert_layout(mat):
         if layout == torch.sparse_csr:
             return mat.to_sparse_csr()
+        elif layout == torch.sparse_csc:
+            return mat.to_sparse_csc()
         else:
             assert mat.layout == layout
             return mat
 
     if mode == "all_sparse":
         res1 = f(*map(convert_layout, (t, m, v)), alpha=alpha, beta=beta)
+        test_case.assertEqual(res1.layout, layout)
         res1 = res1.to_dense()
     elif mode == "dense_result":
         res1 = f(t, convert_layout(m), convert_layout(v), alpha=alpha, beta=beta)
@@ -166,6 +175,20 @@ def sparse_compressed_nonblock_layouts(test_name='layout'):
 }
 
 
+def batched_nonbatched(test_name='batched'):
+    return parametrize(test_name, [
+        subtest(True, name="Batched"),
+        subtest(False, name="NonBatched")
+    ])
+
+
+def hybrid_nonhybrid(test_name='hybrid'):
+    return parametrize(test_name, [
+        subtest(True, name="Hybrid"),
+        subtest(False, name="NonHybrid")
+    ])
+
+
 class TestSparseCompressed(TestCase):
     """Testing sparse compressed (CSR, CSC, BSR, BSC) tensor generic features.
     """
@@ -322,13 +345,23 @@ def test_layout(self, layout):
         self.assertIn(str(layout), {'torch.sparse_csr', 'torch.sparse_csc', 'torch.sparse_bsr', 'torch.sparse_bsc'})
         self.assertEqual(type(layout), torch.layout)
 
-    @parametrize('shape_and_device_inference', [subtest(False, name='_'), subtest(False, name='shape_and_device_inference')])
+    @parametrize('shape_and_device_inference', [subtest(False, name='_'), subtest(True, name='shape_and_device_inference')])
     @parametrize('use_factory_function', [subtest(False, name='_'), subtest(True, name='factory')])
     @parametrize('input_kind', [subtest('tensor', name='from_tensor'), subtest('list', name='from_list')])
     @all_sparse_compressed_layouts()
     @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_sparse_compressed_constructor(self, layout, device, dtype,
                                            use_factory_function, shape_and_device_inference, input_kind):
+        if input_kind == 'list' and shape_and_device_inference and torch.device(device).type == 'cuda':
+            # list inputs to factory/constructor function without
+            # specifying device will result a sparse compressed tensor
+            # on CPU. So, skip testing against cuda device as unused.
+            self.skipTest("nothing to test")
+
+        expected_devices = [torch.device(device)]
+        if TEST_CUDA and torch.device(device).type == 'cuda' and torch.cuda.device_count() >= 2 and not shape_and_device_inference:
+            expected_devices.append(torch.device('cuda:1'))
+
         factory_function = {
             torch.sparse_csr: torch.sparse_csr_tensor,
             torch.sparse_csc: torch.sparse_csc_tensor,
@@ -337,43 +370,48 @@ def test_sparse_compressed_constructor(self, layout, device, dtype,
         }[layout]
         compressed_indices_mth, plain_indices_mth = sparse_compressed_indices_methods[layout]
         for index_dtype in [torch.int32, torch.int64]:
-            for compressed_indices, plain_indices, values, size in self._generate_small_inputs(layout, device, dtype, index_dtype):
-                if input_kind == 'list':
-                    if size == (0, 0):
-                        # for this degenerate case, plain_indices must
-                        # remain a tensor because
-                        # tensor(plain_indices) results a float dtype
-                        # when plain_indices is an empty list
-                        if index_dtype == torch.int32:
-                            # skip testing int32 case because
-                            # tensor(compressed_indices) results a
-                            # int64 dtype when compressed_indices is
-                            # [0] (a list of single int zero).
-                            continue
+            for expected_device in expected_devices:
+                for compressed_indices, plain_indices, values, size in self._generate_small_inputs(
+                        layout, expected_device, dtype, index_dtype):
+                    if input_kind == 'list':
+                        if size == (0, 0):
+                            # for this degenerate case, plain_indices must
+                            # remain a tensor because
+                            # tensor(plain_indices) results a float dtype
+                            # when plain_indices is an empty list
+                            if index_dtype == torch.int32:
+                                # skip testing int32 case because
+                                # tensor(compressed_indices) results a
+                                # int64 dtype when compressed_indices is
+                                # [0] (a list of single int zero).
+                                continue
+                        else:
+                            plain_indices = plain_indices.tolist()
+                        compressed_indices = compressed_indices.tolist()
+                        values = values.tolist()
+                        if size == (0, 0) and layout in {torch.sparse_bsr, torch.sparse_bsc}:
+                            # in the block sparse case, values of type list needs to represent a 3-D tensor
+                            values = [[[]]]
+
+                    if use_factory_function:
+                        if shape_and_device_inference:
+                            sparse = factory_function(compressed_indices, plain_indices, values)
+                        else:
+                            sparse = factory_function(compressed_indices, plain_indices, values, size,
+                                                      dtype=dtype, device=expected_device)
                     else:
-                        plain_indices = plain_indices.tolist()
-                    compressed_indices = compressed_indices.tolist()
-                    values = values.tolist()
-                    if size == (0, 0) and layout in {torch.sparse_bsr, torch.sparse_bsc}:
-                        # in the block sparse case, values of type list needs to represent a 3-D tensor
-                        values = [[[]]]
-                if use_factory_function:
-                    if shape_and_device_inference:
-                        sparse = factory_function(compressed_indices, plain_indices, values)
-                    else:
-                        sparse = factory_function(compressed_indices, plain_indices, values, size,
-                                                  dtype=dtype, device=device)
-                else:
-                    if shape_and_device_inference:
-                        sparse = torch.sparse_compressed_tensor(compressed_indices, plain_indices, values, layout=layout)
-                    else:
-                        sparse = torch.sparse_compressed_tensor(compressed_indices, plain_indices, values, size,
-                                                                dtype=dtype, layout=layout, device=device)
-                self.assertEqual(layout, sparse.layout)
-                self.assertEqual(size, sparse.shape)
-                self.assertEqual(compressed_indices, compressed_indices_mth(sparse))
-                self.assertEqual(plain_indices, plain_indices_mth(sparse))
-                self.assertEqual(values, sparse.values())
+                        if shape_and_device_inference:
+                            sparse = torch.sparse_compressed_tensor(compressed_indices, plain_indices, values, layout=layout)
+                        else:
+                            sparse = torch.sparse_compressed_tensor(compressed_indices, plain_indices, values, size,
+                                                                    dtype=dtype, layout=layout, device=expected_device)
+                    self.assertEqual(layout, sparse.layout)
+                    self.assertEqual(size, sparse.shape)
+                    self.assertEqual(compressed_indices, compressed_indices_mth(sparse))
+                    self.assertEqual(plain_indices, plain_indices_mth(sparse))
+                    self.assertEqual(values, sparse.values())
+                    self.assertEqual(sparse.device, sparse.values().device)
+                    self.assertEqual(sparse.device, expected_device)
 
     @skipMeta
     @sparse_compressed_nonblock_layouts()
@@ -548,16 +586,13 @@ def test_consistency(self, layout, device, dtype, op):
 
         # FIXME: remove in followup once integer support is landed for segment_reduce
         if (layout == torch.sparse_csr and not dtype.is_floating_point
-                and op.name in ('_masked.mean', '_masked.amax', '_masked.amin')):
+                and op.name in ('masked.mean', 'masked.amax', 'masked.amin')):
             self.skipTest(f"{op.name} does not support input with {layout} layout")
 
-        require_mask = isinstance(op, ReductionOpInfo) and '_masked.' in op.name
+        require_mask = isinstance(op, ReductionOpInfo) and 'masked.' in op.name
         if require_mask and layout in {torch.sparse_bsr, torch.sparse_bsc}:
             self.skipTest(f"{op.name} does not support input with {layout} layout")
 
-        if layout is torch.sparse_bsc:
-            self.skipTest(f"test requires conversion from Strided layout to {layout} layout")
-
         samples = list(op.sample_inputs(device, dtype))
 
         # Fail early to prevent silent success with this test
@@ -598,7 +633,7 @@ def test_consistency(self, layout, device, dtype, op):
             assert torch.is_tensor(output)
             strided_output = output.to_dense()
             if require_mask:
-                output_mask = torch._masked._output_mask(op.op, sample.input, **sample.kwargs)
+                output_mask = torch.masked._output_mask(op.op, sample.input, **sample.kwargs)
                 expected.masked_fill_(~output_mask, 0)
             self.assertEqual(strided_output, expected)
             count += 1
@@ -676,7 +711,7 @@ def _generate_invalid_input(self, layout, device):
                shape((2, 3)),
                'compressed_indices must have dimensionality >= 1 but got 0')
 
-        yield ('compressed/plain_indices mismatch of dimensionalites',
+        yield ('compressed/plain_indices mismatch of dimensionalities',
                tensor([[0, 2, 4]]),
                tensor([0, 1, 0, 2]),
                values([1, 2, 3, 4]),
@@ -684,14 +719,14 @@ def _generate_invalid_input(self, layout, device):
                'compressed_indices and plain_indices dimensionalities must be equal but got 2 and 1, respectively')
 
         if layout in {torch.sparse_csr, torch.sparse_csc}:
-            yield ('indices and values mismatch of dimensionalites',
+            yield ('indices and values mismatch of dimensionalities',
                    tensor([[0, 2, 4]]),
                    tensor([[0, 1, 0, 2]]),
                    values([1, 2, 3, 4]),
                    shape((2, 3)),
                    r'values must have dimensionality > sum of batch and block dimensionalities \(=1 \+ 0\) but got 1')
         else:
-            yield ('indices and values mismatch of dimensionalites',
+            yield ('indices and values mismatch of dimensionalities',
                    tensor([[0, 2, 4]]),
                    tensor([[0, 1, 0, 2]]),
                    values([1, 2, 3, 4]),
@@ -703,7 +738,7 @@ def _generate_invalid_input(self, layout, device):
                tensor([0, 1, 0, 2]),
                values([1, 2, 3, 4]),
                (2,),
-               r'tensor dimensionality must be sum of batch, base, and dense dimensionalites \(=0 \+ 2 \+ 0\) but got 1')
+               r'tensor dimensionality must be sum of batch, base, and dense dimensionalities \(=0 \+ 2 \+ 0\) but got 1')
 
         yield ('invalid batchsize',
                tensor([[0, 2, 4]]),
@@ -830,6 +865,20 @@ def _generate_invalid_input(self, layout, device):
                    shape((2, 3)),
                    r'Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!')
 
+        if TEST_CUDA and torch.device(device).type == 'cuda' and torch.cuda.device_count() >= 2:
+            yield ('indices and values mismatch of device index',
+                   torch.tensor([0, 2, 4], device='cuda:0'),
+                   torch.tensor([0, 1, 0, 1], device='cuda:0'),
+                   values([1, 2, 3, 4], device='cuda:1'),
+                   shape((2, 3)),
+                   r'device of compressed_indices \(=cuda:0\) must match device of values \(=cuda:1\)')
+            yield ('compressed_indices and values mismatch of device index',
+                   torch.tensor([0, 2, 4], device='cuda:0'),
+                   torch.tensor([0, 1, 0, 1], device='cuda:1'),
+                   values([1, 2, 3, 4], device='cuda:0'),
+                   shape((2, 3)),
+                   r'Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!')
+
     @skipMeta
     @all_sparse_compressed_layouts()
     @parametrize('target', [subtest('validate_sparse_compressed_tensor_args'),
@@ -886,6 +935,40 @@ def test_dim(self, layout):
             self.assertEqual(sparse.dense_dim(), dense_dim)
 
 
+    @skipMeta
+    @all_sparse_compressed_layouts()
+    @dtypes(*all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16))
+    def test_to_dtype(self, layout, device, dtype):
+        # to_dense does not support hybrid inputs
+        input_gen = self._generate_small_inputs(layout, device=device, enable_hybrid=False)
+        for compressed_indices, plain_indices, values, size in input_gen:
+            sparse = torch.sparse_compressed_tensor(compressed_indices, plain_indices, values, size,
+                                                    dtype=dtype, layout=layout, device=device)
+            for to_dtype in all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16):
+                sparse_to_dtype = sparse.to(to_dtype)
+                dense_to_dtype = sparse.to_dense().to(to_dtype)
+                self.assertEqual(sparse_to_dtype.to_dense(), dense_to_dtype)
+
+    @skipMeta
+    @all_sparse_compressed_layouts()
+    @dtypes(torch.double)
+    def test_pickle(self, layout, dtype, device):
+        import pickle
+
+        input_gen = self._generate_small_inputs(layout)
+        for compressed_indices, plain_indices, values, size in input_gen:
+            sparse = torch.sparse_compressed_tensor(compressed_indices, plain_indices, values, size,
+                                                    dtype=dtype, device=device, layout=layout)
+            serialized = pickle.dumps(sparse)
+            sparse_loaded = pickle.loads(serialized)
+
+            self.assertEqual(sparse, sparse_loaded)
+
+
+def _npref_block_addmm_addmv(c, a, b, alpha, beta):
+    return alpha * (a @ b) + beta * c
+
+
 class TestSparseCSR(TestCase):
 
     def test_csr_stride(self):
@@ -906,7 +989,7 @@ def test_csr_storage(self):
     def test_csr_is_contiguous(self):
         a = self.genSparseCSRTensor((3, 3), 3, dtype=torch.float, device=self.device_type, index_dtype=torch.int64)
 
-        with self.assertRaisesRegex(RuntimeError, "Tensors of type SparseCsrTensorImpl do not have is_contiguous"):
+        with self.assertRaisesRegex(RuntimeError, "Sparse CSR tensors do not have is_contiguous"):
             a.is_contiguous()
 
     def test_csr_double_to_sparse_csr(self):
@@ -1031,6 +1114,126 @@ def test_resize(self, device, dtype):
             self.assertEqual(a.shape, (new_shape[-2], new_shape[-1]))
             self.assertEqual(a._nnz(), 5)
 
+    @skipMeta
+    @dtypes(torch.float, torch.bool)
+    @all_sparse_compressed_layouts()
+    def test_resize_as_sparse_compressed(self, device, dtype, layout):
+
+        def _check_resize_b_as_a(b, a):
+            br = b.clone()
+            br.resize_as_sparse_(a)
+
+            # shape is inherited from a
+            self.assertEqual(a.shape, br.shape)
+            # other metadata is not affected
+            self.assertEqual(b.layout, br.layout)
+            self.assertEqual(b.device, br.device)
+            self.assertEqual(b.dtype, br.dtype)
+
+            def _get_compressed_plain_inds(t):
+                compressed_indices_mth, plain_indices_mth = sparse_compressed_indices_methods[t.layout]
+                return compressed_indices_mth(t), plain_indices_mth(t)
+
+            br_compressed_indices, br_plain_indices = _get_compressed_plain_inds(br)
+            br_values = br.values()
+
+            b_compressed_indices, b_plain_indices = _get_compressed_plain_inds(b)
+            a_compressed_indices, a_plain_indices = _get_compressed_plain_inds(a)
+            self.assertEqual(a_plain_indices.shape, br_plain_indices.shape)
+            self.assertEqual(a_compressed_indices.shape, br_compressed_indices.shape)
+            # We don't check the content of br_plain_indices and br_compressed_indices
+            # because it is not well-defined (the content depends on the original
+            # shape of `b` that `resize_as` ought to discard) nor needed (the
+            # subsequent operation likely updates the indices and values of `b` anyway).
+            # the device/dtype of indices should always be unaffected
+            self.assertEqual(b_plain_indices.dtype, br_plain_indices.dtype)
+            self.assertEqual(b_plain_indices.device, br_plain_indices.device)
+            self.assertEqual(b_compressed_indices.dtype, br_compressed_indices.dtype)
+            self.assertEqual(b_compressed_indices.device, br_compressed_indices.device)
+            # values are generated empty, shape is updated
+            self.assertEqual(a.values().shape, br_values.shape)
+            # the device/dtype of indices should always be unaffected
+            b_values = b.values()
+            self.assertEqual(b_values.dtype, br_values.dtype)
+            self.assertEqual(b_values.device, br_values.device)
+            # nnz will be picked up from a via new shape of values
+            self.assertEqual(a._nnz(), br._nnz())
+
+            # post resize the invariants of the layout are respected
+            torch._validate_sparse_compressed_tensor_args(br_compressed_indices, br_plain_indices, br_values, br.shape,
+                                                          br.layout)
+
+        block_sparse = layout in (torch.sparse_bsr, torch.sparse_bsc)
+        shape = (2, 1, 6, 4)
+        nnz = 4
+        blocksize = (2, 1) if block_sparse else ()
+        for index_dtype in [torch.int32, torch.int64]:
+            a = self.genSparseCompressedTensor(shape,
+                                               layout=layout,
+                                               device=device,
+                                               index_dtype=index_dtype,
+                                               dtype=dtype,
+                                               nnz=nnz,
+                                               blocksize=blocksize)
+
+            # same size, resize should not trigger
+            b = self.genSparseCompressedTensor(shape,
+                                               layout=layout,
+                                               device=device,
+                                               index_dtype=index_dtype,
+                                               dtype=dtype,
+                                               nnz=nnz,
+                                               blocksize=blocksize)
+
+            # This test will not always trigger a resize, if the layouts are the same nothing should happen to b.
+            # The invariants of the function as checked should still hold
+            _check_resize_b_as_a(b, a)
+
+            # same ndim, but bigger, more nnz, different dtype, different blocksize if blocked
+            b = self.genSparseCompressedTensor(tuple(s * 2 for s in shape),
+                                               layout=layout,
+                                               device=device,
+                                               dtype=torch.chalf,
+                                               index_dtype=torch.int64 if index_dtype == torch.int32 else torch.int32,
+                                               nnz=nnz * 2,
+                                               blocksize=tuple(2 * bi for bi in blocksize))
+            _check_resize_b_as_a(b, a)
+
+            # different device, only check on cuda pass as we know we are testing in an environment
+            # that has multiple devices
+
+            # TODO: .cpu() does not seem to work correctly for sparse. Causes a call to `copy_` which
+            # complains about incompatible nnz between src and self?
+            if torch.device(device).type == 'cuda' and (layout not in (torch.sparse_bsc, torch.sparse_bsr)):
+                a_cpu = self.genSparseCompressedTensor(shape,
+                                                       layout=layout,
+                                                       device='cpu',
+                                                       index_dtype=index_dtype,
+                                                       dtype=dtype,
+                                                       nnz=nnz,
+                                                       blocksize=blocksize)
+                _check_resize_b_as_a(b, a)
+
+            # error on a strided
+            a_strided = a.to_dense()
+            with self.assertRaisesRegex(
+                    RuntimeError, r'"resize_as_sparse_compressed_: src " expected sparse compressed tensor layout'):
+                b.resize_as_sparse_(a_strided)
+
+            # error on b strided
+            b_strided = b.to_dense()
+            with self.assertRaisesRegex(
+                    RuntimeError, r'"resize_as_sparse_compressed_: self " expected sparse compressed tensor layout'):
+                b_strided.resize_as_sparse_(a)
+
+            # error if layout does not match, transpose induces layout flip
+            with self.assertRaisesRegex(RuntimeError,
+                                        r"resize_as_sparse_compressed_tensor_: self and src must have the same layout"):
+                b.transpose(-2, -1).resize_as_sparse_(a)
+            with self.assertRaisesRegex(RuntimeError,
+                                        r"resize_as_sparse_compressed_tensor_: self and src must have the same layout"):
+                b.resize_as_sparse_(a.transpose(-2, -1))
+
     @skipMeta
     @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_resize_errors(self, device, dtype):
@@ -1194,14 +1397,6 @@ def test_csr_to_block_csr_errors(self, device, dtype):
             with self.assertRaisesRegex(RuntimeError, r"size \(16, 16\) with block size \(5, 5\)"):
                 block_t = t.to_sparse_bsr((5, 5))
 
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
-    def test_sparse_csr_from_dense_convert_error(self, device, dtype):
-        size = (4, 2, 4)
-        dense = make_tensor(size, dtype=dtype, device=device)
-
-        with self.assertRaisesRegex(RuntimeError, "Only 2D"):
-            sparse = dense.to_sparse_csr()
-
     # TODO: Support auto generation of device check for sparse tensors
     # See: https://github.com/pytorch/pytorch/issues/59058
     @onlyCUDA
@@ -1218,12 +1413,16 @@ def test_matmul_device_mismatch(self, device, dtype):
                     torch.addmm(s, csr, m2)
 
     @skipCPUIfNoMklSparse
-    @skipCUDAIfNoCusparseGeneric
+    @skipCUDAIfNoSparseGeneric
     @dtypes(*floating_and_complex_types())
     @dtypesIfCUDA(*floating_and_complex_types_and(
                   *[torch.half] if SM53OrLater else [],
                   *[torch.bfloat16] if SM80OrLater else []))
     def test_csr_matvec(self, device, dtype):
+
+        if TEST_WITH_ROCM and (dtype == torch.half or dtype == torch.bfloat16):
+            self.skipTest("ROCm doesn't work with half dtypes correctly.")
+
         side = 100
         for index_dtype in [torch.int32, torch.int64]:
             csr = self.genSparseCSRTensor((side, side), 1000, device=device, dtype=dtype, index_dtype=index_dtype)
@@ -1240,7 +1439,8 @@ def test_csr_matvec(self, device, dtype):
                 csr.matmul(bad_vec)
 
     @onlyCUDA
-    @unittest.skipIf(not CUDA11OrLater, "Only CUDA 11+ is supported")
+    @unittest.skipIf(not (CUDA11OrLater or TEST_WITH_ROCM), "Only CUDA 11+ is supported")
+    @skipCUDAIfRocmVersionLessThan((5, 2))
     @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
     def test_baddbmm(self, device, dtype):
         def run_test(c, a, a_batched, b, op_b=False, op_out=False, *, dtype=None, device=None):
@@ -1275,7 +1475,7 @@ def run_test(c, a, a_batched, b, op_b=False, op_out=False, *, dtype=None, device
 
     @onlyCUDA
     @unittest.skipIf(not CUDA11OrLater, "Only CUDA 11+ is supported")
-    @skipCUDAIfNoCusparseGeneric
+    @skipCUDAIfNoSparseGeneric
     @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
     def test_bmm(self, device, dtype):
         def run_test(a, a_batched, b, op_b=False, op_out=False, *, dtype=None, device=None):
@@ -1305,7 +1505,17 @@ def run_test(a, a_batched, b, op_b=False, op_out=False, *, dtype=None, device=No
                 for op_b, op_out in itertools.product([True, False], repeat=2):
                     run_test(a, a_batched, b, op_b, op_out, dtype=dtype, device=device)
 
-    def run_test_block_addmm_addmv(self, addmv_addmm, c, a, b, op_b=False, op_out=False, *, dtype=None, device=None):
+    def run_test_block_addmm_addmv(self,
+                                   addmv_addmm,
+                                   c,
+                                   a,
+                                   b,
+                                   op_b=False,
+                                   op_out=False,
+                                   *,
+                                   dtype=None,
+                                   device=None,
+                                   ref=_npref_block_addmm_addmv):
         alpha = complex(random.random(), random.random()) if dtype.is_complex else random.random()
         beta = complex(random.random(), random.random()) if dtype.is_complex else random.random()
         b = b.mH if (op_b and a.shape == b.shape) else b
@@ -1314,16 +1524,8 @@ def run_test_block_addmm_addmv(self, addmv_addmm, c, a, b, op_b=False, op_out=Fa
 
         out = torch.empty_like(c.mH if op_out and a.shape == b.shape else c)
         addmv_addmm(c, a, b, alpha=alpha, beta=beta, out=out)
+        expected = ref(c, a, b, alpha, beta)
 
-        a_bsr = sp.bsr_matrix(
-            (
-                a.values().cpu().numpy(),
-                a.col_indices().cpu().numpy(),
-                a.crow_indices().cpu().numpy(),
-            ),
-            shape=a.shape,
-        )
-        expected = alpha * (a_bsr * b.cpu().resolve_conj().numpy()) + beta * c.cpu().numpy()
         self.assertEqual(actual, out)
         self.assertEqual(actual, expected)
 
@@ -1333,26 +1535,112 @@ def run_test_block_addmm_addmv(self, addmv_addmm, c, a, b, op_b=False, op_out=Fa
     @parametrize("noncontiguous", [True, False])
     @skipCPUIfNoMklSparse
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
-    @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
+    @dtypes(*floating_and_complex_types())
+    @dtypesIfCUDA(*floating_and_complex_types_and(
+                  *[torch.half] if SM53OrLater else [],
+                  *[torch.bfloat16] if SM80OrLater else []))
     @precisionOverride({torch.float32: 1e-3, torch.complex64: 1e-3,
-                        torch.float64: 1e-5, torch.complex128: 1e-5})
+                        torch.float64: 1e-5, torch.complex128: 1e-5,
+                        torch.float16: 1e-3, torch.bfloat16: 1e-3})
     def test_block_addmm(self, device, dtype, index_dtype, block_size, noncontiguous):
+
+        def make_transposed_addmm_op(f):
+
+            def tt(t):
+                if isinstance(t, torch.Tensor):
+                    return t.transpose(-2, -1)
+                else:
+                    # assume numpy/scipy spmatrix
+                    return t.transpose()
+
+            @functools.wraps(f)
+            def wrapper(c, a, b, alpha=None, beta=None, out=None):
+                if out is not None:
+                    # the ref takes no out kwarg
+                    assert isinstance(out, torch.Tensor)
+                    # tranpose inplace to propogate out to checking context
+                    out.transpose_(-2, -1)
+                    return f(tt(c), tt(b), tt(a), alpha=alpha, beta=beta, out=out)
+                else:
+                    return f(tt(c), tt(b), tt(a), alpha=alpha, beta=beta)
+
+            return wrapper
+
+        def ref_sp_numpy(c, a, b, alpha=None, beta=None, out=None):
+
+            def prep_input(t):
+
+                def to_sp_block_compressed(t):
+
+                    if t.layout is torch.sparse_bsc:
+                        tt = t.transpose(-1, -2)
+                    else:
+                        tt = t
+
+                    t_sp_bsr = sp.bsr_matrix(
+                        (
+                            tt.values().cpu().numpy(),
+                            tt.col_indices().cpu().numpy(),
+                            tt.crow_indices().cpu().numpy(),
+                        ),
+                        shape=tt.shape,
+                    )
+
+                    if t.layout is torch.sparse_bsc:
+                        return t_sp_bsr.transpose()
+                    else:
+                        return t_sp_bsr
+
+                if t.layout is not torch.strided:
+                    return to_sp_block_compressed(t)
+                else:
+                    return t.cpu().resolve_conj().numpy()
+
+            res = _npref_block_addmm_addmv(
+                *map(lambda t: prep_input(t), (c, a, b)),
+                alpha,
+                beta
+            )
+
+            if out is not None:
+                out.copy_(res)
+                return out
+            else:
+                return res
+
+        def ref_half_bfloat16(c, a, b, alpha=None, beta=None, out=None):
+            res = alpha * (a.to_dense().to(torch.float32) @ b.to_dense().to(torch.float32)).to(a.dtype) + beta * c
+            if out is not None:
+                out.copy_(res)
+                return out
+            else:
+                return res
+
+        if dtype in (torch.half, torch.bfloat16):
+            ref = ref_half_bfloat16
+        else:
+            ref = ref_sp_numpy
+
         for (m, n, k) in itertools.product([2, 5], repeat=3):
             nnz = random.randint(0, m * k)
-            if not noncontiguous:
-                a = self.genSparseCSRTensor((m * block_size, k * block_size), nnz,
-                                            dtype=dtype, device=device, index_dtype=index_dtype)
-                a = a.to_sparse_bsr((block_size, block_size))
-            else:
-                a = self.genSparseCSRTensor((m, k), nnz, dtype=dtype, device=device, index_dtype=index_dtype)
-                a_data = make_tensor((nnz, block_size, block_size), dtype=dtype, device=device)
-                a_data = a_data.mT if noncontiguous else a_data   # Test column-major blocks
-                a = torch._sparse_bsr_tensor_unsafe(a.crow_indices(), a.col_indices(),
-                                                    a_data, (m * block_size, k * block_size))
+            a = self.genSparseCSRTensor((m, k), nnz, dtype=dtype, device=device, index_dtype=index_dtype)
+            a_data = make_tensor((nnz, block_size, block_size), dtype=dtype, device=device)
+            a_data = a_data.mT if noncontiguous else a_data
+            a = torch._sparse_bsr_tensor_unsafe(a.crow_indices(), a.col_indices(),
+                                                a_data, (m * block_size, k * block_size))
             b = make_tensor((k * block_size, n * block_size), dtype=dtype, device=device, noncontiguous=noncontiguous)
             c = make_tensor((m * block_size, n * block_size), dtype=dtype, device=device, noncontiguous=noncontiguous)
             for op_b, op_out in itertools.product([True, False], repeat=2):
-                self.run_test_block_addmm_addmv(torch.addmm, c, a, b, op_b, op_out, dtype=dtype, device=device)
+                self.run_test_block_addmm_addmv(torch.addmm, c, a, b, op_b, op_out, dtype=dtype, device=device, ref=ref)
+                self.run_test_block_addmm_addmv(make_transposed_addmm_op(torch.addmm),
+                                                c,
+                                                a,
+                                                b,
+                                                op_b,
+                                                op_out,
+                                                dtype=dtype,
+                                                device=device,
+                                                ref=make_transposed_addmm_op(ref))
 
     @parametrize("block_size", [2, 3])
     @parametrize("index_dtype", [torch.int32, torch.int64])
@@ -1494,6 +1782,11 @@ def _test(t, x, y):
                 y = torch.randn(dk, dj, dtype=dtype, device=device)
                 _test(t, x, y)
 
+                t = torch.randn(di, dj, dtype=dtype, device=device)
+                x = self.genSparseCSCTensor((di, dk), nnz0, device=device, dtype=dtype, index_dtype=index_dtype)
+                y = torch.randn(dk, dj, dtype=dtype, device=device)
+                _test(t, x, y)
+
                 if nnz1 is None:
                     nnz1 = random.randint(dk * dj // 2, dk * dj)
                 t = torch.randn(di, dj, dtype=dtype, device=device)
@@ -1501,6 +1794,11 @@ def _test(t, x, y):
                 y = self.genSparseCSRTensor((dk, dj), nnz1, device=device, dtype=dtype, index_dtype=index_dtype)
                 _test(t, x, y)
 
+                t = torch.randn(di, dj, dtype=dtype, device=device)
+                x = torch.randn(di, dk, dtype=dtype, device=device)
+                y = self.genSparseCSCTensor((dk, dj), nnz1, device=device, dtype=dtype, index_dtype=index_dtype)
+                _test(t, x, y)
+
                 x_shape, y_shape = x.shape, y.shape
 
                 gen_csr_csc = [self.genSparseCSRTensor, self.genSparseCSCTensor]
@@ -1576,27 +1874,28 @@ def test_shape(m, n, p, nnz, broadcast, index_dtype, alpha_beta=None):
                                       *[torch.bfloat16] if SM80OrLater else [],
                                       *[torch.half] if SM53OrLater else [],
                                       *[torch.complex128] if CUSPARSE_SPMM_COMPLEX128_SUPPORTED else []))
+    @sparse_compressed_nonblock_layouts()
     @skipCUDAIf(
         not _check_cusparse_spgemm_available(),
         "cuSparse Generic API SpGEMM is not available"
     )
-    def test_addmm_all_sparse_csr(self, device, dtype):
+    def test_addmm_all_sparse_csr(self, device, dtype, layout):
         M = torch.randn(10, 25, device=device).to(dtype)
         m1 = torch.randn(10, 50, device=device).to(dtype)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=torch.sparse_csr, mode="all_sparse")
+        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=layout, mode="all_sparse")
 
         # Test 0-strided
         M = torch.randn(10, 1, device=device).to(dtype).expand(10, 25)
         m1 = torch.randn(10, 1, device=device).to(dtype).expand(10, 50)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=torch.sparse_csr, mode="all_sparse")
+        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=layout, mode="all_sparse")
 
         # Test beta=0, M=nan
         M = torch.full((10, 25), float('nan'), device=device).to(dtype)
         m1 = torch.randn(10, 50, device=device).to(dtype)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        _test_addmm_addmv(self, torch.addmm, M, m1, m2, beta=0, layout=torch.sparse_csr, mode="all_sparse")
+        _test_addmm_addmv(self, torch.addmm, M, m1, m2, beta=0, layout=layout, mode="all_sparse")
 
         # Test transpose
         for t1, t2, t3, t4 in itertools.product([True, False], repeat=4):
@@ -1608,28 +1907,29 @@ def maybe_transpose(cond, m):
             M = maybe_transpose(t1, torch.randn(10, 25, device=device).to(dtype))
             m1 = maybe_transpose(t2, torch.randn(10, 50, device=device).to(dtype))
             m2 = maybe_transpose(t3, torch.randn(50, 25, device=device).to(dtype))
-            _test_addmm_addmv(self, torch.addmm, M, m1, m2, transpose_out=t4, layout=torch.sparse_csr, mode="all_sparse")
+            _test_addmm_addmv(self, torch.addmm, M, m1, m2, transpose_out=t4, layout=layout, mode="all_sparse")
 
     @onlyCPU
     @skipCPUIfNoMklSparse
     @dtypes(*floating_and_complex_types())
-    def test_addmm_dense_result(self, device, dtype):
+    @sparse_compressed_nonblock_layouts()
+    def test_addmm_dense_result(self, device, dtype, layout):
         M = torch.randn(10, 25, device=device).to(dtype)
         m1 = torch.randn(10, 50, device=device).to(dtype)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=torch.sparse_csr, mode="dense_result")
+        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=layout, mode="dense_result")
 
         # Test 0-strided
         M = torch.randn(10, 1, device=device).to(dtype).expand(10, 25)
         m1 = torch.randn(10, 1, device=device).to(dtype).expand(10, 50)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=torch.sparse_csr, mode="dense_result")
+        _test_addmm_addmv(self, torch.addmm, M, m1, m2, layout=layout, mode="dense_result")
 
         # Test beta=0, M=nan
         M = torch.full((10, 25), float('nan'), device=device).to(dtype)
         m1 = torch.randn(10, 50, device=device).to(dtype)
         m2 = torch.randn(50, 25, device=device).to(dtype)
-        _test_addmm_addmv(self, torch.addmm, M, m1, m2, beta=0, layout=torch.sparse_csr, mode="dense_result")
+        _test_addmm_addmv(self, torch.addmm, M, m1, m2, beta=0, layout=layout, mode="dense_result")
 
         # Test transpose
         for t1, t2, t3, t4 in itertools.product([True, False], repeat=4):
@@ -1641,7 +1941,7 @@ def maybe_transpose(cond, m):
             M = maybe_transpose(t1, torch.randn(10, 25, device=device).to(dtype))
             m1 = maybe_transpose(t2, torch.randn(10, 50, device=device).to(dtype))
             m2 = maybe_transpose(t3, torch.randn(50, 25, device=device).to(dtype))
-            _test_addmm_addmv(self, torch.addmm, M, m1, m2, transpose_out=t4, layout=torch.sparse_csr, mode="dense_result")
+            _test_addmm_addmv(self, torch.addmm, M, m1, m2, transpose_out=t4, layout=layout, mode="dense_result")
 
     @parametrize("k", [0, 1, 8])
     @parametrize("n", [0, 1, 10])
@@ -1978,7 +2278,11 @@ def run_test(c, a, b, op_a, op_b, *, alpha=None, beta=None):
             self.assertEqual(actual.to_dense(), out.to_dense())
             self.assertEqual(actual.to_dense(), expected)
 
-        mnk = itertools.product([2, 5], repeat=3)
+        mnk = list(itertools.product([2, 5], repeat=3))
+
+        # Add a test case for size 0 a and b tensors
+        mnk = mnk + [(5, 5, 0)]
+
         batch_shapes = [(), (2,), (2, 3)] if self.device_type == 'cuda' else [(), ]
         tf = [True, False]
         for index_dtype in [torch.int32, torch.int64]:
@@ -2088,6 +2392,71 @@ def test_sampled_addmm_errors(self, device, dtype):
         with self.assertRaisesRegex(RuntimeError, r"Expected mat2 to have strided layout"):
             torch.sparse.sampled_addmm(a_sparse, a, a_sparse)
 
+    @onlyCPU
+    @dtypes(torch.float32, torch.float64)
+    def test_spmm_reduce(self, device, dtype):
+        def run_test(m, n, k, nnz, reduce_type, index_dtype, train):
+            csr = self.genSparseCSRTensor((m, n), nnz, dtype=dtype, device=device, index_dtype=index_dtype)
+            mat = torch.randn(n, k, dtype=dtype)
+            ref_mat = mat.clone()
+            ref_values = csr.values().clone()
+
+            out_int32 = index_dtype == torch.int32
+            coo_indices = torch._convert_indices_from_csr_to_coo(
+                csr.crow_indices(),
+                csr.col_indices(),
+                out_int32=out_int32)
+            row, col = coo_indices[0], coo_indices[1]
+
+            def ref(row, col, val, mat):
+                if reduce_type == "max":
+                    scatter_type = "amax"
+                elif reduce_type == "min":
+                    scatter_type = "amin"
+                else:
+                    scatter_type = reduce_type
+
+                out = torch.zeros([m, k], dtype=dtype)
+                weight = mat.index_select(0, col)
+                src = weight.mul(val.view(-1, 1))
+                index = row.view(-1, 1).expand_as(weight)
+                index = index.to(dtype=torch.int64)
+                # scatter_reduce expect index to be int64
+                out.scatter_reduce_(0, index, src, reduce=scatter_type, include_self=False)
+                return out
+
+            if train:
+                csr.requires_grad_()
+                mat.requires_grad_()
+                ref_values.requires_grad_()
+                ref_mat.requires_grad_()
+
+            ref_out = ref(row, col, ref_values, ref_mat)
+            out = torch.sparse.spmm_reduce(csr, mat, reduce_type)
+            self.assertEqual(out, ref_out)
+
+            if train:
+                ref_out.sum().backward()
+                out.sum().backward()
+
+                grad_values = csr.grad.values()
+                grad_weight = mat.grad
+                ref_grad_values = ref_values.grad
+                ref_grad_weight = ref_mat.grad
+                self.assertEqual(grad_values, ref_grad_values)
+                self.assertEqual(grad_weight, ref_grad_weight)
+
+        for train in [False, True]:
+            for index_dtype in [torch.int32, torch.int64]:
+                for reduce_type in ["sum", "mean", "max", "min"]:
+                    # by setting nnz < M, create empty rows
+                    run_test(3, 4, 11, 1, reduce_type, index_dtype, train)
+                    run_test(3, 4, 11, 6, reduce_type, index_dtype, train)
+                    run_test(3, 4, 11, 12, reduce_type, index_dtype, train)
+                    # we are doing blocking with 4x vector length in the kernel,
+                    # so need to test when K > 4x vector length
+                    run_test(4, 7, 33, 13, reduce_type, index_dtype, train)
+
     @skipMeta
     @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
     def test_coo_csr_conversion(self, device, dtype):
@@ -2206,6 +2575,9 @@ def test_autograd_sparse_csr_unary(self, device, dtype, op):
             raise ValueError("Expected at least one 2D tensor in samples.")
 
         for sample in samples:
+            # We must skip samples of low dimensionality, we can't covert them to sparsed compressed layouts
+            if sample.input.ndim < 2:
+                continue
             sparse_input = sample.input.to_sparse_csr().requires_grad_(True)
 
             def fn(input):
@@ -2391,56 +2763,139 @@ def run_test(shape, nnz, index_type):
 
     @skipMeta
     @dtypes(*all_types_and_complex_and(torch.half, torch.bool, torch.bfloat16))
-    def test_transpose(self, device, dtype):
-
-        def run_test(shape, nnz, index_type):
-            # CSR
-            a = self.genSparseCSRTensor(shape, nnz, dtype=dtype, device=device, index_dtype=index_dtype)
-            self.assertEqual(a.layout, torch.sparse_csr)
-
-            # CSC
-            a_t = a.transpose(0, 1)
-            self.assertEqual(a_t.layout, torch.sparse_csc)
-
-            # CSR
-            a_v = a.transpose(0, 0)
-            self.assertEqual(a_v.layout, torch.sparse_csr)
-
-            # CSR again
-            a_t_t = a_t.transpose(0, 1)
-            self.assertEqual(a_t_t.layout, torch.sparse_csr)
-
-            # TODO: Do we want to extend view properties to members as well?
-            # These checks are based on is_view_of from test_view_ops.py
-            self.assertTrue(a_t._is_view())
-            self.assertTrue(a_v._is_view())
-            self.assertTrue(a_t_t._is_view())
-
-            self.assertTrue(a_t._base is a)
-            self.assertTrue(a_v._base is a)
-            self.assertTrue(a_t_t._base is a)
-
-            self.assertFalse(a_t is a)
-            self.assertFalse(a_v is a)
-            self.assertFalse(a_t_t is a)
-
-            self.assertEqual(a.to_dense().transpose(0, 1), a_t.to_dense())
-            self.assertEqual(a.to_dense(), a_v.to_dense())
-            self.assertEqual(a.to_dense(), a_t_t.to_dense())
-
-            with self.assertRaisesRegex(RuntimeError, "torch.transpose_: in-place transposition is not supported"):
-                a.transpose_(0, 0)
-
-            with self.assertRaisesRegex(RuntimeError, "torch.transpose_: in-place transposition is not supported"):
-                a.transpose_(0, 1)
-
-
-        for shape, index_dtype in itertools.product(
-                [(10, 5), (10, 10)],
-                [torch.int32, torch.int64]):
-            run_test(shape, 0, index_dtype)
-            run_test(shape, max(shape), index_dtype)
-            run_test(shape, shape[0] * shape[1], index_dtype)
+    @all_sparse_compressed_layouts()
+    def test_transpose(self, device, dtype, layout):
+
+        def _check_transpose_view(subject, transpose):
+            self.assertTrue(transpose.values()._is_view())
+            self.assertTrue(transpose._is_view())
+            self.assertTrue(transpose._base is subject)
+
+        def _check_layout_invariants(transpose):
+            self.assertEqual(transpose.device, torch.device(device))
+            compressed_indices_mth, plain_indices_mth = sparse_compressed_indices_methods[transpose.layout]
+            compressed_indices, plain_indices = compressed_indices_mth(transpose), plain_indices_mth(transpose)
+            # note: invariant check for bsr/bsc values is too strict wrt to value contiguity (invariant 3.7)
+            if transpose.layout in (torch.sparse_bsr, torch.sparse_bsc):
+                n_batch = compressed_indices.dim() - 1
+                n_dense = transpose.dim() - 2 - n_batch
+                self.assertTrue(transpose.values().is_contiguous()
+                                or transpose.values().transpose(-2 - n_dense, -1 - n_dense).is_contiguous())
+                torch._validate_sparse_compressed_tensor_args(compressed_indices, plain_indices, transpose.values().contiguous(),
+                                                              transpose.shape, transpose.layout)
+            else:
+                torch._validate_sparse_compressed_tensor_args(compressed_indices, plain_indices, transpose.values(),
+                                                              transpose.shape, transpose.layout)
+
+        def check_good_transpose(subject, subject_dense, dim0, dim1, expected_layout):
+            transpose = subject.transpose(dim0, dim1)
+            # correct layout
+            self.assertEqual(transpose.layout, expected_layout)
+            # transpose must be return a view
+            _check_transpose_view(subject, transpose)
+            # result uses unsafe construction, so we check invariants
+            _check_layout_invariants(transpose)
+            self.assertEqual(transpose.to_dense(), subject_dense.transpose(dim0, dim1))
+
+            round_trip = transpose.transpose(dim0, dim1)
+            self.assertEqual(round_trip.layout, subject.layout)
+            # transpose must be return a view
+            _check_transpose_view(subject, round_trip)
+            # result uses unsafe construction, so we check invariants
+            _check_layout_invariants(round_trip)
+            self.assertEqual(round_trip.to_dense(), subject_dense)
+
+        def check_same_dim_transpose(subject, subject_dense, dim):
+            transpose = subject.transpose(dim, dim)
+            # correct layout
+            self.assertEqual(transpose.layout, subject.layout)
+            # transpose must be return a view
+            _check_transpose_view(subject, transpose)
+            # result uses unsafe construction, so we check invariants
+            _check_layout_invariants(transpose)
+            self.assertEqual(transpose.to_dense(), subject_dense)
+
+        def check_dim_type_mismatch_throws(subject, name0, dim0, name1, dim1):
+            mismatch_name = f"{dim0}\\({name0}\\) and {dim1}\\({name1}\\)"
+            err = r"transpose\(\): can only transpose dimensions of the same type \(Batch, Sparse, Dense\), got " + mismatch_name
+
+            with self.assertRaisesRegex(RuntimeError, err):
+                subject.transpose(dim0, dim1)
+
+        def run_test(shape, nnz, index_type, n_dense, blocksize=()):
+            subject = self.genSparseCompressedTensor(shape,
+                                                     nnz,
+                                                     layout=layout,
+                                                     device=device,
+                                                     index_dtype=index_type,
+                                                     blocksize=blocksize,
+                                                     dense_dims=n_dense,
+                                                     dtype=dtype)
+
+
+            sparse0 = len(shape) - n_dense - 1
+            sparse1 = sparse0 - 1
+
+            dense0 = sparse0 + 1 if n_dense > 0 else None
+            dense1 = dense0 + 1 if n_dense > 1 else None
+
+            n_batch = len(shape) - n_dense - 2
+            batch0 = sparse1 - 1 if n_batch > 0 else None
+            batch1 = 0 if n_batch > 1 else None
+
+            sparse_dims = (sparse0, sparse1)
+            dense_dims = (dense0, dense1)
+            batch_dims = (batch0, batch1)
+
+            named0 = [(name, d[0]) for name, d in zip(["Batch", "Sparse", "Dense"], (batch_dims, sparse_dims, dense_dims))]
+            named1 = [(name, d[1]) for name, d in zip(["Batch", "Sparse", "Dense"], (batch_dims, sparse_dims, dense_dims))]
+
+            flipped_layout = {
+                torch.sparse_csr: torch.sparse_csc,
+                torch.sparse_csc: torch.sparse_csr,
+                torch.sparse_bsr: torch.sparse_bsc,
+                torch.sparse_bsc: torch.sparse_bsr
+            }[layout]
+            if n_dense > 0:
+                # expect all transpose to throw
+                for (name0, dim0), (name1, dim1) in itertools.product(named0, named1):
+                    msg = r"transpose\(\): hybrid sparse compressed tensors with dense dimensions are not supported"
+                    if (dim0 is not None) and (dim1 is not None):
+                        with self.assertRaisesRegex(RuntimeError, msg):
+                            subject.transpose(dim0, dim1)
+            else:
+                subject_dense = subject.to_dense()
+                for (name0, dim0), (name1, dim1) in itertools.product(named0, named1):
+                    if dim0 is not None:
+                        check_same_dim_transpose(subject, subject_dense, dim0)
+
+                        if dim1 is not None:
+                            if name0 == name1:
+                                expected_layout = flipped_layout if name0 == "Sparse" else layout
+                                check_good_transpose(subject, subject_dense, dim0, dim1, expected_layout)
+                            else:
+                                check_dim_type_mismatch_throws(subject, name0, dim0, name1, dim1)
+
+        # batch/sparse, sparse/dense only and full hybrid cases
+        shape_ndense = list(itertools.product([(2, 4, 6, 2), (10, 6, 4, 2), (2, 4, 4, 2, 6)], [0, 1, 2]))
+        # sparse only cases
+        shape_ndense += [[(4, 8), 0], [(2, 2), 0], [(8, 4), 0]]
+        for (shape, n_dense), index_dtype in itertools.product(shape_ndense, [torch.int32, torch.int64]):
+            n_batch = len(shape) - n_dense - 2
+            sparse_shape = shape[n_batch: n_batch + 2]
+            if layout in (torch.sparse_bsr, torch.sparse_bsc):
+                # for blocked all combinations of 2,1 shoudl be valid blocksizes
+                run_test(shape, 0, index_dtype, n_dense, blocksize=(2, 2))
+                run_test(shape, max(sparse_shape), index_dtype, n_dense, blocksize=(2, 2))
+                run_test(shape, sparse_shape[0] * sparse_shape[1], index_dtype, n_dense, blocksize=(2, 2))
+                # repeat the realistic sparseity case with varried block sizes
+                run_test(shape, max(sparse_shape), index_dtype, n_dense, blocksize=(2, 1))
+                run_test(shape, max(sparse_shape), index_dtype, n_dense, blocksize=(1, 2))
+                run_test(shape, max(sparse_shape), index_dtype, n_dense, blocksize=(1, 1))
+            else:
+                run_test(shape, 0, index_dtype, n_dense)
+                run_test(shape, max(sparse_shape), index_dtype, n_dense)
+                run_test(shape, sparse_shape[0] * sparse_shape[1], index_dtype, n_dense)
 
     # TODO: This is a stopgap for a rigorous extension of our autograd tests
     # to test the functionality of detach
@@ -2504,138 +2959,247 @@ def test_compressed_layout_conversions_coverage(self, device, from_layout, to_la
         that an exception is thrown for unsupported conversions.
         """
 
-        def _to_from_layout(layout_a, layout_b):
-            a = make_tensor((6, 10), dtype=torch.float, device=device)
-            expect_error = (layout_a in [torch.sparse_csc, torch.sparse_bsc]
-                            or layout_b in [torch.sparse_csc, torch.sparse_bsc])
-            expect_error = expect_error or (layout_a, layout_b) == (torch.sparse_bsr, torch.sparse_bsr)
-            expect_error = expect_error or (layout_a, layout_b) == (torch.sparse_bsr, torch.sparse_csr)
-            # CSC to CSR conversion is supported
-            if layout_a is torch.sparse_csc and layout_b is torch.sparse_csr:
-                expect_error = False
-            # CSC to CSC conversion is supported
-            if layout_a is torch.sparse_csc and layout_b is torch.sparse_csc:
+        allowed_pairwise_layouts_sets = {
+            frozenset({torch.sparse_csc}),
+            frozenset({torch.sparse_csr}),
+            frozenset({torch.sparse_csc, torch.sparse_csr}),
+            frozenset({torch.sparse_bsc}),
+            frozenset({torch.sparse_bsr}),
+            frozenset({torch.sparse_bsc, torch.sparse_bsr}),
+            frozenset({torch.sparse_csr, torch.sparse_bsr}),
+        }
+        block_layouts = (torch.sparse_bsr, torch.sparse_bsc)
+
+        def _to_from_layout(layout_a, layout_b, a):
+            expect_error = True
+            if {layout_a, layout_b} in allowed_pairwise_layouts_sets:
                 expect_error = False
+
+            # BSR -> CSR is not yet supported
+            if (layout_a, layout_b) == (torch.sparse_bsr, torch.sparse_csr):
+                expect_error = True
+            # CSR -> BSR only works for non-batched inputs
+            if (layout_a, layout_b) == (torch.sparse_csr, torch.sparse_bsr):
+                if a.dim() > 2:
+                    expect_error = True
+
+            b = self._convert_to_layout(a, layout_a)
             if expect_error:
                 with self.assertRaises(RuntimeError):
-                    b = self._convert_to_layout(a, layout_a)
                     self._convert_to_layout(b, layout_b)
             else:
-                b = self._convert_to_layout(a, layout_a)
                 c = self._convert_to_layout(b, layout_b)
-                if (layout_a is not torch.sparse_bsr and layout_b is not torch.sparse_bsr):
-                    self.assertEqual(a.to_dense(), c.to_dense())
+                self.assertEqual(a.to_dense(), c.to_dense())
+
+                # change of blocksize upon conversion is not yet supported.
+                if b.layout in block_layouts:
+                    for block_layout in block_layouts:
+                        with self.assertRaisesRegex(RuntimeError, "blocksize does not match the blocksize"):
+                            self._convert_to_layout(b, block_layout, blocksize=3)
 
-        _to_from_layout(from_layout, to_layout)
+        batch_dims = [(), (2,), (2, 2), (2, 2, 2)]
+        sparse_dims = (6, 12)
+        for batch_dim in batch_dims:
+            a = make_tensor(batch_dim + sparse_dims, dtype=torch.float, device=device)
+            _to_from_layout(from_layout, to_layout, a)
 
     @skipMeta
     @all_sparse_compressed_layouts()
+    @batched_nonbatched()
+    @hybrid_nonhybrid()
     @unittest.skipIf(not TEST_SCIPY, "SciPy not found")
-    def test_dense_to_from_sparse_compressed(self, device, layout):
+    def test_dense_to_from_sparse_compressed(self, device, hybrid, batched, layout):
         """
         This test tests conversion from dense to/from CSR and CSC
         by comparing to SciPy's implementation.
 
         TODO: Eventually this is meant to be merged into test_compressed_layout_conversions_coverage
         """
-        if layout is torch.sparse_bsc:
-            # TODO: Remove this once support has been enabled
-            return
 
-        shapes = [(6, 10), (0, 10), (6, 0), (0, 0)]
+        # adjust this block as support is added
+        supports_batched_from_sparse = (torch.sparse_bsr, torch.sparse_bsc, torch.sparse_csr, torch.sparse_csc)
+        supports_batched_to_sparse = (torch.sparse_bsr, torch.sparse_bsc, torch.sparse_csr, torch.sparse_csc)
+        supports_hybrid_from_sparse = ()
+        supports_hybrid_to_sparse = ()
 
-        blocksizes = [(2, 2)]
-        batch_sizes = [(3, )]
+        blocked_layouts = (torch.sparse_bsr, torch.sparse_bsc)
 
-        if layout is torch.sparse_bsr:
-            blocksizes += [(3, 5), (6, 10)]
-            batch_sizes += [(2, 3), (1, 1, 1, 2)]
+        # helpers
 
-        def _test_matrix(pt_matrix, dense, layout, blocksize):
-            sp_matrix = self._construct_sp_matrix(dense, layout, blocksize=blocksize)
+        def _check_against_scipy_matrix(pt_matrix, dense, blocksize, **kwargs):
+            # scipy has no bsc layout, so we check against the bsr layout of the tranposed dense
+            if layout == torch.sparse_bsc:
+                sp_matrix = self._construct_sp_matrix(dense.t(), layout=torch.sparse_bsr, blocksize=blocksize[::-1])
+            else:
+                sp_matrix = self._construct_sp_matrix(dense, layout=layout, blocksize=blocksize)
 
             compressed_indices_mth, plain_indices_mth = sparse_compressed_indices_methods[layout]
 
             self.assertEqual(layout, pt_matrix.layout)
-            self.assertEqual(sp_matrix.shape, pt_matrix.shape)
+            if layout == torch.sparse_bsc:
+                self.assertEqual(sp_matrix.shape[::-1], pt_matrix.shape)
+            else:
+                self.assertEqual(sp_matrix.shape, pt_matrix.shape)
+
             self.assertEqual(torch.tensor(sp_matrix.indptr, dtype=torch.int64), compressed_indices_mth(pt_matrix))
             self.assertEqual(torch.tensor(sp_matrix.indices, dtype=torch.int64), plain_indices_mth(pt_matrix))
-            self.assertEqual(torch.tensor(sp_matrix.data), pt_matrix.values())
+            if layout == torch.sparse_bsc:
+                # we must tranpose the blocks before comparing
+                self.assertEqual(torch.tensor(sp_matrix.data), pt_matrix.values().transpose(-2, -1))
+            else:
+                self.assertEqual(torch.tensor(sp_matrix.data), pt_matrix.values())
 
+        def _check_hybrid_matrix(pt_matrix, dense, **kwargs):
+            # no support for dense dims, all layouts should skip before this failure
+            self.assertTrue(False, "not implemented")
 
-        for shape, blocksize in itertools.product(shapes, blocksizes):
+        def _check_batched(pt_tensor, dense, check_batch=None, batch_shape=(), blocksize=(), **kwargs):
+            self.assertEqual(layout, pt_tensor.layout)
+            self.assertEqual(pt_tensor.shape, dense.shape)
+            compressed_indices_mth, plain_indices_mth = sparse_compressed_indices_methods[layout]
+            for batch_index in np.ndindex(batch_shape):
+                pt_matrix = pt_tensor[batch_index]
+                dense_matrix = dense[batch_index]
+                dense_matrix_pt = self._convert_to_layout(dense_matrix, layout, blocksize)
+                # sanity check, selecting batch of to_<layout> and dense[batch].to_<layout> should give the same result
+                self.assertEqual(pt_matrix, dense_matrix_pt)
+                check_batch(pt_matrix, dense_matrix, blocksize, **kwargs)
+
+        def _generate_subject(sparse_shape, batch_shape, hybrid_shape):
+            shape = batch_shape + sparse_shape + hybrid_shape
+            n_batch_dim = len(batch_shape)
+            n_hybrid_dim = len(hybrid_shape)
+            # generate a dense tensor
             dense = make_tensor(shape, dtype=torch.float, device=device)
-            dense = dense.relu()  # Introduce some sparsity
-            pt_matrix = self._convert_to_layout(dense, layout, blocksize=blocksize)
-            _test_matrix(pt_matrix, dense, layout, blocksize)
-            self.assertEqual(dense, pt_matrix.to_dense())
-
-        if layout is not torch.sparse_bsr:
-            # TODO: Remove this once support has been enabled
-            return
 
-        # Test batch shapes (ND inputs)
-
-        # Case 1: Same sparsity pattern across matrices
-        for shape, blocksize, batch_shape in itertools.product(shapes, blocksizes, batch_sizes):
-            full_shape = batch_shape + shape
-            batch_len = functools.reduce(lambda x, y: x * y, batch_shape, 1)
-            dense = make_tensor(full_shape, dtype=torch.float, device=device)
-            # select the first batch to create the mask
-            mask = dense[tuple(np.unravel_index(0, batch_shape))].relu().bool()
+            # introduce some sparsty, mask is sparse shape, element applies to entire dense sub-tensor (hybrid) and is
+            # applied to each batch
+            mask = make_tensor(sparse_shape, dtype=torch.bool, device=device)
+            # manually expand to match hybrid shape
+            if hybrid:
+                mask = mask.view(sparse_shape + tuple(1 for _ in range(n_hybrid_dim)))
+                mask = mask.expand(sparse_shape + hybrid_shape)
+
+            # mask will broadcast over the batch dims if present
+
+            return dense * mask
+
+        expect_to_layout_support = True
+        expect_from_layout_support = True
+        # note: order is important here, the hybrid-ness decides the inner content check which is used to build the
+        # batched checker (if needed)
+        check_content = _check_against_scipy_matrix
+        if hybrid:
+            expect_to_layout_support = expect_to_layout_support and layout in supports_hybrid_to_sparse
+            expect_from_layout_support = expect_from_layout_support and layout in supports_hybrid_from_sparse
+            check_content = _check_hybrid_matrix
+
+        if batched:
+            expect_to_layout_support = expect_to_layout_support and layout in supports_batched_to_sparse
+            expect_from_layout_support = expect_from_layout_support and layout in supports_batched_from_sparse
+            check_content = functools.partial(_check_batched, check_batch=check_content)
+
+        sparse_sizes = [(6, 10), (0, 10), (6, 0), (0, 0)]
+        blocksizes = [(2, 2), (1, 1), (1, 2)] if layout in blocked_layouts else [()]
+        batch_sizes = [(3,), (1, 3), (2, 1, 3)] if batched else [()]
+        hybrid_sizes = [(4, ), (2, 2)] if hybrid else [()]
+        if not hybrid:
+            # general cases, always run, hybrid excluded untill dense->sparse api exists
+            for sparse_shape, blocksize, batch_shape, hybrid_shape in itertools.product(
+                    sparse_sizes, blocksizes, batch_sizes, hybrid_sizes):
+                dense = _generate_subject(sparse_shape, batch_shape, hybrid_shape)
+                if expect_to_layout_support:
+                    sparse = self._convert_to_layout(dense, layout, blocksize)
+                    check_content(sparse, dense, blocksize=blocksize, batch_shape=batch_shape, hybrid_shape=hybrid_shape)
+                    if expect_from_layout_support:
+                        dense_back = sparse.to_dense()
+                        self.assertEqual(dense, dense_back)
+                    else:
+                        with self.assertRaises(RuntimeError):
+                            sparse.to_dense()
+                else:
+                    with self.assertRaises(RuntimeError):
+                        self._convert_to_layout(dense, layout, blocksize)
+
+        # special cases for batched tensors
+        if batched and expect_to_layout_support:
+            # batched sparse tensors need only have the same number of non-zeros in each batch not nessesarily the
+            # same sparsity pattern in each batch
+            sparse_shape = sparse_sizes[0]
+            hybrid_shape = hybrid_sizes[0]
+            batch_shape = batch_sizes[0]
+            shape = batch_shape + sparse_shape + hybrid_shape
+            dense = make_tensor(shape, dtype=torch.float, device=device)
+            blocksize = blocksizes[0]
+            # number of elements/blocks in each batch (total not nnz)
+            batch_mask_shape = sparse_shape
+            if layout in blocked_layouts:
+                # if we are blocked the mask is genereated for the block valued elemetns
+                batch_mask_shape = sparse_shape[0] // blocksize[0], sparse_shape[1] // blocksize[1]
+
+
+            # random bool vector w/ length equal to max possible nnz for the sparse_shape
+            mask_source = make_tensor(batch_mask_shape, dtype=torch.bool, device=device).flatten()
+            n_batch = functools.reduce(lambda x, y: x * y, batch_shape, 1)
+
+            # stack random permutations of the source for each batch
+            mask = torch.stack([mask_source[torch.randperm(mask_source.numel())]
+                               for _ in range(n_batch)], dim=0).reshape(batch_shape + batch_mask_shape)
+            if layout in blocked_layouts:
+                # for blocked we need to do a bit of extra work to expand the mask from blocked-space to element-space
+                mask_shape = mask.shape
+                mask = mask.view(mask_shape + (1, 1))
+                mask = mask.expand(mask_shape + blocksize)
+                mask = mask.transpose(-3, -2)
+                mask = mask.reshape_as(dense)
             dense = dense * mask
-            pt_tensor = self._convert_to_layout(dense, layout, blocksize=blocksize)
-            for i in range(batch_len):
-                batch_idx = tuple(np.unravel_index(i, batch_shape))
-                _test_matrix(pt_tensor[batch_idx], dense[batch_idx], layout, blocksize)
-            self.assertEqual(dense, pt_tensor.to_dense())
-
-        # Verify exception when given 0 sized batch
-        for shape, blocksize in itertools.product(shapes, blocksizes):
-            dense = make_tensor((0,) + shape, dtype=torch.float, device=device)
-            # TODO: Support zero sized batch dimensions
-            with self.assertRaisesRegex(RuntimeError, "to_sparse_bsr: Expected product of batch dimensions to be non-zero."):
-                self._convert_to_layout(dense, layout, blocksize=blocksize)
+            sparse = self._convert_to_layout(dense, layout, blocksize)
+            check_content(sparse, dense, blocksize=blocksize, batch_shape=batch_shape, hybrid_shape=hybrid_shape)
+
+            if expect_from_layout_support:
+                dense_back = sparse.to_dense()
+                self.assertEqual(dense, dense_back)
+
+            # if batches have different nnz we expect the conversion to throw
+            mask_0 = mask[0]
+            mask_1 = mask[0].clone().fill_(True)
+            mask_2 = mask[0].clone().fill_(False)
+            mask_true = mask_source.clone().fill_(True)
+            mask_false = mask_source.clone().fill_(False)
+            mask = torch.stack([(mask_0, mask_1, mask_2)[i % 3] for i in range(n_batch)], dim=0).reshape(batch_shape + mask_0.shape)
+            dense = make_tensor(shape, dtype=torch.float, device=device)
+            dense = dense * mask
+            msg = "Expect the same number of specified elements per batch."
+            with self.assertRaisesRegex(RuntimeError, msg):
+                self._convert_to_layout(dense, layout, blocksize)
 
-        # TODO: Case 2: Different sparsity pattern across matrices, but same number of zeros
-        # NOTE: For blocksparse formats this applies at a per-block level,
-        dense = make_tensor((2, 4, 4), dtype=torch.float, device=device)
-        blocksize = (2, 2)
-        mask = torch.tensor([
-            [[True, True], [False, True]],
-            [[True, False], [True, True]]],
-            device=device).view((2, 2, 2, 1, 1))
-        mask = mask.expand((2, 2, 2, 2, 2))
-        mask = mask.transpose(2, 3)
-        mask = mask.reshape_as(dense)
-        dense = dense * mask
-        if layout == torch.sparse_bsr:
-            # this is not an error as long as the nse is equal for bsr
-            pt_tensor = self._convert_to_layout(dense, layout, blocksize=blocksize)
-            for i in range(2):
-                _test_matrix(pt_tensor[i], dense[i], layout, blocksize)
-            self.assertEqual(dense, pt_tensor.to_dense())
-        else:
-            with self.assertRaisesRegex(RuntimeError, "Expect the same sparsity pattern across matrices for ND input."):
+            # Should throw if there is a zero in the batch size
+            dense = make_tensor((0,) + shape, dtype=torch.float, device=device)
+            layout_code = str(layout).split("_")[-1]
+            msg = f"to_sparse_{layout_code}: Expected product of batch dimensions to be non-zero."
+            with self.assertRaisesRegex(RuntimeError, msg):
                 self._convert_to_layout(dense, layout, blocksize=blocksize)
 
-        # TODO: Case 3: Different sparsity pattern across matrices, but different number of zeros
-        dense = make_tensor((2, 4, 4), dtype=torch.float, device=device)
-        blocksize = (2, 2)
-        mask = torch.tensor(
-            [[[True, True], [False, False]],
-             [[True, False], [True, True]]],
-            device=device).view((2, 2, 2, 1, 1))
-        mask = mask.expand((2, 2, 2, 2, 2))
-        mask = mask.transpose(2, 3)
-        mask = mask.reshape_as(dense)
-        dense = dense * mask
-        if layout == torch.sparse_bsr:
-            msg = "Expect the same number of specified elements per batch."
-        else:
-            msg = "Expect the same sparsity pattern across matrices for ND input."
-        with self.assertRaisesRegex(RuntimeError, msg):
-            self._convert_to_layout(dense, layout, blocksize=blocksize)
+        if hybrid:
+            # conversion from sparse -> dense should be blocked with dense dims
+            sparse_shape = sparse_sizes[0]
+            hybrid_shape = hybrid_sizes[0]
+            batch_shape = batch_sizes[0]
+            blocksize = blocksizes[0]
+            sparse_hybrid = self.genSparseCompressedTensor(batch_shape + sparse_shape + hybrid_shape,
+                                                           nnz=4,
+                                                           layout=layout,
+                                                           device=device,
+                                                           dtype=torch.float,
+                                                           index_dtype=torch.int64,
+                                                           blocksize=blocksize,
+                                                           dense_dims=len(hybrid_shape))
+            with self.assertRaises(RuntimeError):
+                sparse_hybrid.to_dense()
+
+        # special cases for hybrid tensors
+        # todo: figure out what these are
+        # if hybrid and expect_to_layout_support:
 
     @skipMeta
     @all_sparse_compressed_layouts()
diff --git a/test/test_spectral_ops.py b/test/test_spectral_ops.py
index a0cc0c20c164..3c48b834b531 100644
--- a/test/test_spectral_ops.py
+++ b/test/test_spectral_ops.py
@@ -18,6 +18,7 @@
 from torch.testing._internal.common_methods_invocations import (
     spectral_funcs, SpectralFuncType)
 from torch.testing._internal.common_cuda import SM53OrLater
+from torch._prims_common import corresponding_complex_dtype
 
 from setuptools import distutils
 from typing import Optional, List
@@ -737,11 +738,7 @@ def test_fftshift_frequencies(self, device, dtype):
 
     # Legacy fft tests
     def _test_fft_ifft_rfft_irfft(self, device, dtype):
-        complex_dtype = {
-            torch.float16: torch.complex32,
-            torch.float32: torch.complex64,
-            torch.float64: torch.complex128
-        }[dtype]
+        complex_dtype = corresponding_complex_dtype(dtype)
 
         def _test_complex(sizes, signal_ndim, prepro_fn=lambda x: x):
             x = prepro_fn(torch.randn(*sizes, dtype=complex_dtype, device=device))
@@ -1181,9 +1178,8 @@ def test_complex_stft_onesided(self, device):
     @skipCPUIfNoFFT
     def test_stft_requires_complex(self, device):
         x = torch.rand(100)
-        y = x.stft(10, pad_mode='constant')
-        # with self.assertRaisesRegex(RuntimeError, 'stft requires the return_complex parameter'):
-        #     y = x.stft(10, pad_mode='constant')
+        with self.assertRaisesRegex(RuntimeError, 'stft requires the return_complex parameter'):
+            y = x.stft(10, pad_mode='constant')
 
     @skipCPUIfNoFFT
     def test_fft_input_modification(self, device):
@@ -1391,20 +1387,22 @@ def test_istft_throws(self, device):
     @skipCPUIfNoFFT
     @dtypes(torch.double)
     def test_istft_of_sine(self, device, dtype):
+        complex_dtype = corresponding_complex_dtype(dtype)
+
         def _test(amplitude, L, n):
             # stft of amplitude*sin(2*pi/L*n*x) with the hop length and window size equaling L
             x = torch.arange(2 * L + 1, device=device, dtype=dtype)
             original = amplitude * torch.sin(2 * math.pi / L * x * n)
             # stft = torch.stft(original, L, hop_length=L, win_length=L,
             #                   window=torch.ones(L), center=False, normalized=False)
-            stft = torch.zeros((L // 2 + 1, 2, 2), device=device, dtype=dtype)
+            stft = torch.zeros((L // 2 + 1, 2), device=device, dtype=complex_dtype)
             stft_largest_val = (amplitude * L) / 2.0
             if n < stft.size(0):
-                stft[n, :, 1] = -stft_largest_val
+                stft[n].imag = torch.tensor(-stft_largest_val, dtype=dtype)
 
             if 0 <= L - n < stft.size(0):
                 # symmetric about L // 2
-                stft[L - n, :, 1] = stft_largest_val
+                stft[L - n].imag = torch.tensor(stft_largest_val, dtype=dtype)
 
             inverse = torch.istft(
                 stft, L, hop_length=L, win_length=L,
@@ -1426,11 +1424,12 @@ def _test(amplitude, L, n):
     @dtypes(torch.double)
     def test_istft_linearity(self, device, dtype):
         num_trials = 100
+        complex_dtype = corresponding_complex_dtype(dtype)
 
         def _test(data_size, kwargs):
             for i in range(num_trials):
-                tensor1 = torch.randn(data_size, device=device, dtype=dtype)
-                tensor2 = torch.randn(data_size, device=device, dtype=dtype)
+                tensor1 = torch.randn(data_size, device=device, dtype=complex_dtype)
+                tensor2 = torch.randn(data_size, device=device, dtype=complex_dtype)
                 a, b = torch.rand(2, dtype=dtype, device=device)
                 # Also compare method vs. functional call signature
                 istft1 = tensor1.istft(**kwargs)
@@ -1441,7 +1440,7 @@ def _test(data_size, kwargs):
         patterns = [
             # hann_window, centered, normalized, onesided
             (
-                (2, 7, 7, 2),
+                (2, 7, 7),
                 {
                     'n_fft': 12,
                     'window': torch.hann_window(12, device=device, dtype=dtype),
@@ -1452,7 +1451,7 @@ def _test(data_size, kwargs):
             ),
             # hann_window, centered, not normalized, not onesided
             (
-                (2, 12, 7, 2),
+                (2, 12, 7),
                 {
                     'n_fft': 12,
                     'window': torch.hann_window(12, device=device, dtype=dtype),
@@ -1463,7 +1462,7 @@ def _test(data_size, kwargs):
             ),
             # hamming_window, centered, normalized, not onesided
             (
-                (2, 12, 7, 2),
+                (2, 12, 7),
                 {
                     'n_fft': 12,
                     'window': torch.hamming_window(12, device=device, dtype=dtype),
@@ -1474,7 +1473,7 @@ def _test(data_size, kwargs):
             ),
             # hamming_window, not centered, not normalized, onesided
             (
-                (2, 7, 3, 2),
+                (2, 7, 3),
                 {
                     'n_fft': 12,
                     'window': torch.hamming_window(12, device=device, dtype=dtype),
@@ -1491,13 +1490,13 @@ def _test(data_size, kwargs):
     @skipCPUIfNoFFT
     def test_batch_istft(self, device):
         original = torch.tensor([
-            [[4., 0.], [4., 0.], [4., 0.], [4., 0.], [4., 0.]],
-            [[0., 0.], [0., 0.], [0., 0.], [0., 0.], [0., 0.]],
-            [[0., 0.], [0., 0.], [0., 0.], [0., 0.], [0., 0.]]
-        ], device=device)
+            [4., 4., 4., 4., 4.],
+            [0., 0., 0., 0., 0.],
+            [0., 0., 0., 0., 0.]
+        ], device=device, dtype=torch.complex64)
 
-        single = original.repeat(1, 1, 1, 1)
-        multi = original.repeat(4, 1, 1, 1)
+        single = original.repeat(1, 1, 1)
+        multi = original.repeat(4, 1, 1)
 
         i_original = torch.istft(original, n_fft=4, length=4)
         i_single = torch.istft(single, n_fft=4, length=4)
@@ -1539,7 +1538,7 @@ def find(self, obj, name=None, module=None, globs=None, extraglobs=None):
         doctests = []
 
         modname = name if name is not None else obj.__name__
-        globs = dict() if globs is None else globs
+        globs = {} if globs is None else globs
 
         for fname in obj.__all__:
             func = getattr(obj, fname)
diff --git a/test/test_stateless.py b/test/test_stateless.py
index d7b7be547be8..9eee585a61f5 100644
--- a/test/test_stateless.py
+++ b/test/test_stateless.py
@@ -176,6 +176,37 @@ def test_reparamertize_module_fail_reset_to_original(self):
         self.assertEqual(orig_sn_weight, module.l1.weight)
 
 
+    def test_tied_weights_warns(self):
+        module = MockModule()
+        module.tied_bias = module.l1.bias
+        module.register_buffer("tied_buffer", module.buffer)
+        weight = torch.tensor([[1.0]],)
+        bias = torch.tensor([0.0])
+        buffer = torch.tensor([0.0])
+
+        parameters = {'l1.weight': weight,
+                      'l1.bias': bias,
+                      'buffer': buffer}
+        x = torch.randn(1, 1)
+        self.assertNotWarn(lambda: stateless.functional_call(module, parameters, x))
+
+        # if tied values are the same tensors, shouldn't warn
+        parameters['tied_bias'] = bias
+        parameters['tied_buffer'] = buffer
+        self.assertNotWarn(lambda: stateless.functional_call(module, parameters, x))
+        del parameters['tied_bias']
+        del parameters['tied_buffer']
+
+        with self.assertWarnsOnceRegex(UserWarning, "functional_call was passed multiple values"):
+            parameters['tied_bias'] = torch.tensor([5.0])
+            stateless.functional_call(module, parameters, x)
+        del parameters['tied_bias']
+
+        with self.assertWarnsOnceRegex(UserWarning, "functional_call was passed multiple values"):
+            parameters['tied_buffer'] = torch.tensor([5.0])
+            stateless.functional_call(module, parameters, x)
+
+
     def test_setattr(self):
         class Foo(torch.nn.Module):
             def __init__(self):
diff --git a/test/test_subclass.py b/test/test_subclass.py
index 2eb45c361ed9..654f82bfa404 100644
--- a/test/test_subclass.py
+++ b/test/test_subclass.py
@@ -1,18 +1,31 @@
 # Owner(s): ["module: nn"]
 
 import tempfile
-import torch
 from copy import deepcopy
 from functools import partial
+from unittest import expectedFailure
+
+import torch
 from torch import nn
-from torch.nn.utils.parametrize import register_parametrization, remove_parametrizations
 from torch.nn.modules.lazy import LazyModuleMixin
+from torch.nn.utils.parametrize import (
+    register_parametrization,
+    remove_parametrizations,
+)
+from torch.testing._internal.common_subclass import (
+    DiagTensorBelow,
+    subclass_db,
+)
 from torch.testing._internal.common_utils import (
-    TestCase, run_tests, parametrize, subtest, instantiate_parametrized_tests)
-from torch.testing._internal.common_subclass import subclass_db, DiagTensorBelow
+    TestCase,
+    instantiate_parametrized_tests,
+    parametrize,
+    run_tests,
+    skipIfTorchDynamo,
+    subtest,
+)
 from torch.testing._internal.logging_tensor import LoggingTensor
 from torch.utils._pytree import tree_map
-from unittest import expectedFailure
 
 # The current test methodology in this file is to test a variety of real use cases
 # with a set of fully-fledged tensor subclasses. In the future, this may change
@@ -43,6 +56,7 @@ def test_param_invariants(self, tensor_cls, tensor_requires_grad):
         self.assertNotIsInstance(x, nn.Parameter)
         self.assertEqual(x.requires_grad, tensor_requires_grad)
 
+    @skipIfTorchDynamo()
     @parametrize_tensor_cls
     @parametrize("as_param", [False, True])
     def test_deepcopy(self, tensor_cls, as_param):
@@ -76,6 +90,7 @@ def test_serialization(self, tensor_cls, as_param):
                 # Serialization should preserve both custom type and "parameter-ness".
                 self.assertIsInstance(x_loaded, nn.Parameter)
 
+    @skipIfTorchDynamo("Visible only with functorch as functorch monkeypatches tensor str")
     @parametrize_tensor_cls
     @parametrize("as_param", [False, True])
     def test_repr(self, tensor_cls, as_param):
diff --git a/test/test_tensor_creation_ops.py b/test/test_tensor_creation_ops.py
index 3925ffaaf5eb..aab264524969 100644
--- a/test/test_tensor_creation_ops.py
+++ b/test/test_tensor_creation_ops.py
@@ -27,11 +27,6 @@
 
 from torch.utils.dlpack import to_dlpack
 
-# TODO: refactor tri_tests_args, _compare_trilu_indices, run_additional_tri_tests
-from torch.testing._internal.common_methods_invocations import (
-    tri_tests_args, _compare_trilu_indices, run_additional_tri_tests)
-
-
 # TODO: replace with make_tensor
 def _generate_input(shape, dtype, device, with_extremal):
     if shape == ():
@@ -232,106 +227,6 @@ def test_roll(self, device):
         expected = torch.tensor([6., 1 + 0.5j, 2 + 1j, 3.5, 4., 5j], device=device)
         self.assertEqual(expected, t.roll(1, 0))
 
-    @slowTest
-    def test_triu_tril(self, device):
-        def gen_mask(shape, diagonal, device, upper):
-            mask = torch.zeros(*shape[-2:]).byte()
-            for i in range(shape[-2]):
-                for j in range(shape[-1]):
-                    cond = j - i < diagonal if upper else j - i > diagonal
-                    if cond:
-                        mask[i, j] = 1
-            return mask.expand(*shape).to(device)
-
-        torch_functions = {True: torch.triu, False: torch.tril}
-        numpy_functions = {True: np.triu, False: np.tril}
-
-        # TODO: remove this when bool and half are supported for torch.where
-        def bool_half_compat_where(pred, true_tensor, false_tensor, dtype):
-            if dtype == torch.bool or dtype == torch.half:
-                return torch.where(pred.byte(), true_tensor.byte(), false_tensor.byte()).to(dtype=dtype)
-            else:
-                return torch.where(pred, true_tensor, false_tensor)
-
-        def run_test(shape, device, diagonal, dtype):
-            x = torch.empty(*shape, device=device, dtype=dtype).fill_(2)
-
-            for upper in [True, False]:
-                # normal test with mask
-                torch_tri_func = torch_functions[upper]
-                res1 = torch_tri_func(x, diagonal=diagonal)
-                res2 = torch.empty(0, device=device, dtype=dtype)
-                torch_tri_func(x, diagonal=diagonal, out=res2)
-                exp_mask = gen_mask(shape, diagonal, device, upper)
-                expected = bool_half_compat_where(exp_mask, torch.tensor(0).type_as(x), x, dtype)
-                self.assertEqual(res1, res2, atol=0, rtol=0)
-                self.assertEqual(expected, res1, atol=0, rtol=0)
-
-                # non-contiguous and expanded tensors test
-                if 0 not in shape:
-                    for s in range(-len(shape), -1):
-                        # non-contiguous tensors
-                        x_nc = x.clone().transpose(s, s + 1)
-                        exp_mask = gen_mask(x_nc.size(), diagonal, device, upper)
-                        if 1 not in shape:
-                            assert not x_nc.is_contiguous(), "x is intentionally non-contiguous"
-                        exp_nc = bool_half_compat_where(exp_mask, torch.tensor(0).type_as(x), x_nc, dtype)
-                        self.assertEqual(torch_tri_func(x_nc, diagonal), exp_nc, atol=0, rtol=0)
-                        x_nc_is_contiguous = x_nc.is_contiguous()
-                        if upper:
-                            self.assertEqual(x_nc.triu_(diagonal), exp_nc, atol=0, rtol=0)
-                        else:
-                            self.assertEqual(x_nc.tril_(diagonal), exp_nc, atol=0, rtol=0)
-
-                        self.assertTrue(x_nc.is_contiguous() == x_nc_is_contiguous,
-                                        "contiguity of x_nc should not be changed")
-
-                    # expanded tensors
-                    expanded_size = (x.size(0),) + x.size()
-                    x_expanded = x.clone().expand(*expanded_size)
-                    if x.size(0) != 1:
-                        assert 0 in x_expanded.stride(), "x intentionally has 0 in its stride"
-                    output = torch_tri_func(x_expanded, diagonal)
-                    self.assertEqual(output, expected.expand(expanded_size), atol=0, rtol=0)
-                    if x.size(0) != 1:
-                        self.assertTrue(0 in x_expanded.stride(),
-                                        "geometry of x_expanded should be the same")
-                    if upper:
-                        self.assertEqual(output, x_expanded.triu_(diagonal), atol=0, rtol=0)
-                    else:
-                        self.assertEqual(output, x_expanded.tril_(diagonal), atol=0, rtol=0)
-
-                # numpy test
-                numpy_tri_func = numpy_functions[upper]
-                self.assertEqual(numpy_tri_func(x.to('cpu').numpy(), diagonal), res1.cpu().numpy())
-
-        diagonals = [-2, -1, 0, 1, 2]
-        shapes = [(3, 3), (5, 3, 3), (7, 5, 3, 3),  # square matrices
-                  (7, 3), (5, 7, 3), (7, 5, 7, 3),  # fat matrices
-                  (3, 7), (5, 3, 7), (7, 5, 3, 7),  # thin matrices
-                  (3, 0), (0, 3, 3), (3, 3, 0, 0),  # no numel matrices
-                  (3, 1), (5, 3, 1), (7, 5, 3, 1),  # very fat matrices
-                  (1, 3), (5, 1, 3), (7, 5, 1, 3),  # very thin matrices
-                  (1, 3, 3, 3), (3, 1, 3, 3, 3)]    # unsqueezed batch dimensions
-        dtypes = all_types_and_complex_and(torch.half, torch.bool)
-        for s, d, dtype in product(shapes, diagonals, dtypes):
-            run_test(s, device, d, dtype)
-
-    @onlyCPU
-    def test_triu_tril_bfloat16(self, device):
-        op_funcs = [torch.tril, torch.triu]
-        for op_fun in op_funcs:
-            input = torch.randn(3, 3, dtype=torch.float32, device=device).bfloat16().requires_grad_(True)
-            input2 = input.detach().clone().float().requires_grad_(True)
-            out = op_fun(input)
-            out.sum().backward()
-            out2 = op_fun(input2)
-            out2.sum().backward()
-            self.assertEqual(out.dtype, torch.bfloat16)
-            self.assertEqual(input.grad.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.bfloat16())
-            self.assertEqual(input.grad, input2.grad.bfloat16(), atol=0.01, rtol=0)
-
     def test_diagflat(self, device):
         dtype = torch.float32
         # Basic sanity test
@@ -596,10 +491,6 @@ def test_cat_empty_legacy(self, device):
         res1 = torch.cat([empty, empty], dim=1)
         self.assertEqual(res1, empty)
 
-        with self.assertRaisesRegex(RuntimeError,
-                                    'non-empty list of Tensors'):
-            torch.cat([], dim=1)
-
     def test_cat_empty(self, device):
         dtype = torch.float32
 
@@ -613,39 +504,10 @@ def test_cat_empty(self, device):
         res1 = torch.cat([empty, empty], dim=1)
         self.assertEqual(res1, empty)
 
-        # check non-legacy-behavior (sizes don't match)
-        empty = torch.randn((4, 0, 31, 32), dtype=dtype, device=device)
-        self.assertRaises(RuntimeError, lambda: torch.cat([x, empty], dim=1))
-        self.assertRaises(RuntimeError, lambda: torch.cat([empty, x], dim=1))
-
-        # check non-legacy-behavior (dimensions don't match)
-        empty = torch.randn((4, 0), dtype=dtype, device=device)
-        self.assertRaises(RuntimeError, lambda: torch.cat([x, empty], dim=1))
-        self.assertRaises(RuntimeError, lambda: torch.cat([empty, x], dim=1))
-
     def test_cat_out(self, device):
         x = torch.zeros((0), device=device)
         y = torch.randn((4, 6), device=device)
 
-        with self.assertRaisesRegex(
-                RuntimeError,
-                r"unsupported operation: some elements of the input tensor and "
-                r"the written-to tensor refer to a single memory location."):
-            torch.cat([x, y], dim=0, out=x)
-
-        with self.assertRaisesRegex(
-                RuntimeError,
-                r"unsupported operation: some elements of the input tensor and "
-                r"the written-to tensor refer to a single memory location."):
-            torch.cat([x, y], dim=0, out=y)
-
-        z = torch.zeros((4, 6), device=device)
-        with self.assertRaisesRegex(
-                RuntimeError,
-                r"unsupported operation: some elements of the input tensor and "
-                r"the written-to tensor refer to a single memory location."):
-            torch.cat([y, z], out=z[:2, :])
-
         w = y.view(-1).clone()
         a = torch.cat([w[:2], w[4:6]])
         b = torch.cat([w[:2], w[4:6]], out=w[6:10])
@@ -766,32 +628,11 @@ def test_cat_out_memory_format(self, device):
 
         self.assertTrue(res3_cuda.is_contiguous(memory_format=torch.channels_last))
 
-    @onlyCUDA
-    @deviceCountAtLeast(2)
-    def test_cat_different_devices(self, devices):
-        cuda0 = torch.randn((3, 3), device=devices[0])
-        cuda1 = torch.randn((3, 3), device=devices[1])
-        with self.assertRaisesRegex(RuntimeError,
-                                    "Expected all tensors to be on the same device"):
-            torch.cat((cuda0, cuda1))
-
-        with self.assertRaisesRegex(RuntimeError,
-                                    "Expected all tensors to be on the same device"):
-            torch.cat((cuda0, cuda0), out=cuda1)
-
     @onlyCUDA
     def test_cat_stack_cross_devices(self, device):
         cuda = torch.randn((3, 3), device=device)
         cpu = torch.randn((3, 3), device='cpu')
 
-        # cat
-        with self.assertRaisesRegex(RuntimeError,
-                                    "Expected all tensors to be on the same device"):
-            torch.cat((cuda, cpu))
-        with self.assertRaisesRegex(RuntimeError,
-                                    "Expected all tensors to be on the same device"):
-            torch.cat((cpu, cuda))
-
         # Stack
         with self.assertRaisesRegex(RuntimeError,
                                     "Expected all tensors to be on the same device"):
@@ -1164,18 +1005,6 @@ def test_cat_big(self, device):
         result = torch.cat(concat_list)
         self.assertEqual(result.size(0), SIZE1 + SIZE2)
 
-    @onlyCPU
-    def test_cat_bad_input_sizes(self, device):
-        x = torch.randn(2, 1, device=device)
-        y = torch.randn(2, 1, 1, device=device)
-        z = torch.randn(2, 1, 1, device=device)
-        self.assertRaises(RuntimeError, lambda: torch.cat([x, y, z]))
-
-        x = torch.randn(2, 1, 2, device=device)
-        y = torch.randn(2, 1, 1, device=device)
-        z = torch.randn(2, 2, 1, device=device)
-        self.assertRaises(RuntimeError, lambda: torch.cat([x, y, z], dim=1))
-
     @onlyCPU
     @dtypes(torch.half, torch.double, torch.int)
     def test_cat2(self, device, dtype):
@@ -1199,20 +1028,6 @@ def test_cat2(self, device, dtype):
         z = torch.cat([x, y])
         self.assertEqual(z.size(), (21, SIZE, SIZE))
 
-        self.assertRaises(RuntimeError, lambda: torch.cat([]))
-        self.assertRaisesRegex(TypeError, 'got None', lambda: torch.cat([x, None]))
-
-    @onlyCPU
-    def test_cat_scalars(self, device):
-        x = torch.tensor(0, device=device)
-        y = torch.tensor(1, device=device)
-        with self.assertRaisesRegex(RuntimeError, 'zero-dimensional.*cannot be concatenated'):
-            torch.cat([x, y])
-
-    def test_zeros_dtype_out_match(self, device):
-        d = torch.tensor((2, 3), device=device, dtype=torch.double)
-        self.assertRaises(RuntimeError, lambda: torch.zeros((2, 3), device=device, dtype=torch.float32, out=d))
-
     # FIXME: Create an OpInfo-based tensor creation method test that verifies this for all tensor
     #   creation methods and verify all dtypes and layouts
     @dtypes(torch.bool, torch.uint8, torch.int16, torch.int64, torch.float16, torch.float32, torch.complex64)
@@ -1223,38 +1038,6 @@ def test_zeros_dtype_layout_device_match(self, device, dtype):
         self.assertIs(layout, t.layout)
         self.assertEqual(torch.device(device), t.device)
 
-    # TODO: update to work on CUDA, too
-    @onlyCPU
-    def test_trilu_indices(self, device):
-        for test_args in tri_tests_args:
-            _compare_trilu_indices(self, *test_args)
-        run_additional_tri_tests(self, 'cpu')
-
-        # test default options
-        x = torch.ones(
-            3, 3, dtype=torch.long, device='cpu', layout=torch.strided)
-        self.assertEqual(
-            x.tril(0).nonzero().transpose(0, 1), torch.tril_indices(3, 3))
-        self.assertEqual(
-            x.triu(0).nonzero().transpose(0, 1), torch.triu_indices(3, 3))
-
-        # test stride 0 cases
-        x = torch.ones(
-            3, 1, 3, 3, dtype=torch.long, device='cpu', layout=torch.strided)
-        output = x.triu(2).expand(3, 3, 3, 3)
-        b = x.clone().expand(3, 3, 3, 3)
-        self.assertEqual(b.triu(2), output)
-        self.assertRaises(RuntimeError, lambda: b.triu_(2))
-
-    @onlyCPU
-    def test_triu_tril_indices_bfloat16(self, device):
-        op_funcs = [torch.tril_indices, torch.triu_indices]
-        for op_fun in op_funcs:
-            out = op_fun(4, 3, 1, dtype=torch.bfloat16)
-            out2 = op_fun(4, 3, 1, dtype=torch.float)
-            self.assertEqual(out.dtype, torch.bfloat16)
-            self.assertEqual(out, out2.bfloat16())
-
     # TODO: update to work on CUDA, too
     @onlyCPU
     def test_stack(self, device):
diff --git a/test/test_tensorexpr.py b/test/test_tensorexpr.py
index 8151ce1b4252..cf894f3749eb 100644
--- a/test/test_tensorexpr.py
+++ b/test/test_tensorexpr.py
@@ -11,12 +11,14 @@
 
 from torch.testing._internal.jit_utils import JitTestCase, TensorExprTestOptions
 
+LLVM_ENABLED = torch._C._llvm_enabled()
 
 class BaseTestClass(JitTestCase):
     def setUp(self):
         super(BaseTestClass, self).setUp()
         self.tensorexpr_options = TensorExprTestOptions()
         self.devices = ['cpu'] if not torch.cuda.is_available() else ['cpu', 'cuda']
+        self.dtypes = [torch.float32, torch.bfloat16] if LLVM_ENABLED else [torch.float32]
 
     def tearDown(self):
         self.tensorexpr_options.restore()
@@ -654,34 +656,61 @@ def test_pow(x, y):
         def test_type_as(x, y):
             return x.type_as(torch.add(x, y))
 
-        fns = {
-            test_atan2,
+        cmp_fns = {
             test_gt,
             test_ge,
             test_lt,
             test_le,
+            test_ne,
+            test_eq
+        }
+
+        non_cmp_fns = {
+            test_atan2,
             test_lerp,
             test_mul,
-            test_ne,
             test_div,
-            test_eq,
             test_fmod,
             test_sub,
             test_remainder,
             test_pow,
             test_type_as,
         }
-        for torch_fn in fns:
-            for dev in self.devices:
-                rand_a = torch.rand(1024, device=dev)
-                rand_b = torch.rand(1024, device=dev)
-                in1 = 20 * torch.rand(1024, device=dev)
-                in2 = 20 * torch.rand(1024, device=dev)
-                traced = torch.jit.trace(torch_fn, (in1, in2))
-                x = warmup_and_run_forward(traced, rand_a, rand_b)
-                self.assertLastGraphAllFused()
+
+        all_test_fns = cmp_fns.union(non_cmp_fns)
+        fn_dev_dtype = itertools.product(all_test_fns, self.devices, self.dtypes)
+        for torch_fn, dev, data_type in fn_dev_dtype:
+            if torch_fn is test_lerp and data_type is torch.bfloat16:
+                continue
+            rand_a = torch.rand(1024, dtype=data_type, device=dev)
+            rand_b = torch.rand(1024, dtype=data_type, device=dev)
+            in1 = 20 * torch.rand(1024, dtype=data_type, device=dev)
+            in2 = 20 * torch.rand(1024, dtype=data_type, device=dev)
+            traced = torch.jit.trace(torch_fn, (in1, in2))
+            x = warmup_and_run_forward(traced, rand_a, rand_b)
+            self.assertLastGraphAllFused()
+
+            _atol = 2e-3
+            _rtol = 1e-5
+            if data_type is torch.bfloat16:
+                # Compared to aten logic, NNC coudl save addtional BF16/Fp32 conversion.
+                # Take d = a + b - c as an example, the aten logic is as follows at
+                # operator level:
+                #    tmp = to_bf16(to_fp32(a) + to_fp32(b))
+                #    d = to_bf16(to_fp32(tmp) + to_fp32(c))
+                # But NNC could fuse the compression and remove the redudant conversions.
+                # The final statement is as follows
+                #    d = to_bf16(to_fp32(a) + to_fp32(b) + to_fp32(c))
+                # Hence, we simulate NNC computation by feeding fp32 tensors and converting
+                # the result tensor back to bf16. The simulation could avoid the numeric
+                # deviation to simplify the result comprasion
+                y = warmup_and_run_forward(traced, rand_a.float(), rand_b.float())
+                if torch_fn not in cmp_fns:
+                    y = y.bfloat16()
+                _atol = 2e-2
+            else:
                 y = torch_fn(rand_a, rand_b)
-                np.testing.assert_allclose(x.cpu().numpy(), y.cpu().numpy(), atol=2e-3)
+            self.assertEqual(x.cpu(), y.cpu(), atol=_atol, rtol=_rtol)
 
     def test_unary_ops(self):
         def test_cast_float(x, y):
@@ -820,6 +849,10 @@ def test_threshold(x, y):
             c = F.threshold(torch.add(x, y), 0.5, 10)
             return c
 
+        gpu_only_fns = {
+            test_erf,
+            test_erfc
+        }
         fns = {
             test_round,
             test_sin,
@@ -842,8 +875,6 @@ def test_threshold(x, y):
             test_rsqrt,
             test_exp,
             test_expm1,
-            test_erf,
-            test_erfc,
             test_frac,
             test_lgamma,
             test_reciprocal,
@@ -854,33 +885,45 @@ def test_threshold(x, y):
             test_hardtanh,
             test_sigmoid,
         }
+        fn_dev_dtype = itertools.product(gpu_only_fns.union(fns), self.devices, self.dtypes)
+
+        torch.manual_seed(0)
+        for torch_fn, dev, data_type in fn_dev_dtype:
+            if torch_fn == test_lgamma and dev == "cuda":
+                # lgamma_cuda does not support BF16
+                continue
+            rand_a = torch.rand(1024, dtype=data_type, device=dev)
+            rand_b = torch.rand(1024, dtype=data_type, device=dev)
+
+            ins = 20 * torch.rand(1024, dtype=data_type, device=dev)
+            cc = np.empty([1024], dtype=np.float32)
+            cc.fill(np.nan)
+            nans = torch.from_numpy(cc).to(dev)
+            traced = torch.jit.trace(torch_fn, (ins, ins))
+            x = warmup_and_run_forward(traced, rand_a, rand_b)
+            self.assertLastGraphAllFused()
 
-        for torch_fn in fns:
-            for dev in self.devices:
-                # print(torch_fn, dev)
-                rand_a = torch.rand(1024, device=dev)
-                rand_b = torch.rand(1024, device=dev)
-                ins = 20 * torch.rand(1024, device=dev)
-                cc = np.empty([1024], dtype=np.float32)
-                cc.fill(np.nan)
-                nans = torch.from_numpy(cc).to(dev)
-                traced = torch.jit.trace(torch_fn, (ins, ins))
-                x = warmup_and_run_forward(traced, rand_a, rand_b)
-                self.assertLastGraphAllFused()
+            _atol = 5e-3 if data_type is torch.bfloat16 else 2e-3
+            _rtol = 1e-5
+            if data_type is torch.bfloat16 and torch_fn not in gpu_only_fns:
+                y = warmup_and_run_forward(traced, rand_a.float(), rand_b.float())
+                y = y.bfloat16()
+            else:
                 y = torch_fn(rand_a, rand_b)
-                np.testing.assert_allclose(x.cpu().numpy(), y.cpu().numpy(), atol=2e-3)
-                # nans
-                # TODO: reenable. Currently all of the tests fail
-                # traced = torch.jit.trace(torch_fn, (ins, ins))
-                # x = warmup_and_run_forward(traced, rand_a, rand_b)
-                # y = torch_fn(nans, rand_b)
-                # try:
-                #     np.testing.assert_allclose(x.cpu().numpy(), y.cpu().numpy())
-                #     print("Succeeded on dev=", dev, "function=", torch_fn)
-                # except AssertionError:
-                #     # Print extra info before exiting:
-                #     print("Failed on dev=", dev, "function=", torch_fn)
-                #     # np.testing.assert_allclose(x.cpu().numpy(), y.cpu().numpy())
+
+            self.assertEqual(x.cpu(), y.cpu(), atol=_atol, rtol=_rtol)
+            # nans
+            # TODO: reenable. Currently all of the tests fail
+            # traced = torch.jit.trace(torch_fn, (ins, ins))
+            # x = warmup_and_run_forward(traced, rand_a, rand_b)
+            # y = torch_fn(nans, rand_b)
+            # try:
+            #     np.testing.assert_allclose(x.cpu().numpy(), y.cpu().numpy())
+            #     print("Succeeded on dev=", dev, "function=", torch_fn)
+            # except AssertionError:
+            #     # Print extra info before exiting:
+            #     print("Failed on dev=", dev, "function=", torch_fn)
+            #     # np.testing.assert_allclose(x.cpu().numpy(), y.cpu().numpy())
 
     def test_rand_like(self):
         N = 1 << 16
@@ -891,8 +934,12 @@ def run_rand_like(x, y):
         for device in self.devices:
             x = torch.rand(N, device=device)
             traced = torch.jit.trace(run_rand_like, (x, x), check_trace=False)
-            x_v = warmup_and_run_forward(traced, x, x)
-            self.assertLastGraphAllFused()
+
+            for data_type in self.dtypes:
+                _x = x.to(dtype=data_type)
+                x_v = warmup_and_run_forward(traced, _x, _x)
+                self.assertLastGraphAllFused()
+
             x_np = x.cpu().numpy()
             x1_mean = np.mean(x_np)
             x2_mean = np.mean(x_np ** 2)
@@ -911,14 +958,15 @@ def test_min(x, y):
         tmax = torch.jit.trace(test_max, (torch.rand(1), torch.rand(1)))
         tmin = torch.jit.trace(test_min, (torch.rand(1), torch.rand(1)))
 
-        x = torch.tensor([np.nan])
-        y = torch.tensor([1.0])
+        for data_type in self.dtypes:
+            x = torch.tensor([np.nan]).to(dtype=data_type)
+            y = torch.tensor([1.0]).to(dtype=data_type)
 
-        assert np.isnan(warmup_and_run_forward(tmin, x, y).item())
-        assert np.isnan(warmup_and_run_forward(tmin, y, x).item())
+        assert np.isnan(warmup_and_run_forward(tmin, x, y).float().item())
+        assert np.isnan(warmup_and_run_forward(tmin, y, x).float().item())
         self.assertLastGraphAllFused()
-        assert np.isnan(warmup_and_run_forward(tmax, x, y).item())
-        assert np.isnan(warmup_and_run_forward(tmax, y, x).item())
+        assert np.isnan(warmup_and_run_forward(tmax, x, y).float().item())
+        assert np.isnan(warmup_and_run_forward(tmax, y, x).float().item())
         self.assertLastGraphAllFused()
 
     def test_double_intrinsics(self):
@@ -936,33 +984,40 @@ def run_remainder(x, y):
             c = torch.remainder(torch.add(x, y), x)
             return c
 
-        a = torch.rand(1024, dtype=float)
-        b = torch.rand(1024, dtype=float)
-        zeros = torch.zeros(1024, dtype=float)
-        cc = np.array(1024, dtype=float)
-        cc.fill(np.nan)
-        nans = torch.from_numpy(cc)
+        for data_type in self.dtypes:
+            a = torch.rand(1024, dtype=data_type)
+            b = torch.rand(1024, dtype=data_type)
+            zeros = torch.zeros(1024, dtype=data_type)
+            cc = np.array(1024, dtype=float)
+            cc.fill(np.nan)
+            nans = torch.from_numpy(cc).to(dtype=data_type)
 
-        # random floats
-        traced = torch.jit.trace(run_remainder, (torch.zeros(1024), torch.zeros(1024)))
-        x = warmup_and_run_forward(traced, a, b)
-        self.assertLastGraphAllFused()
-        y = run_remainder(a, b)
-        np.testing.assert_allclose(x.numpy(), y.numpy())
+            # random floats
+            zeros1 = torch.zeros(1024, dtype=data_type)
+            zeros2 = torch.zeros(1024, dtype=data_type)
 
-        # div by 0
-        traced = torch.jit.trace(run_remainder, (torch.zeros(1024), torch.zeros(1024)))
-        x = warmup_and_run_forward(traced, zeros, a)
-        self.assertLastGraphAllFused()
-        y = run_remainder(zeros, a)
-        np.testing.assert_allclose(x.numpy(), y.numpy())
+            traced = torch.jit.trace(run_remainder, (zeros1, zeros2))
+            x = warmup_and_run_forward(traced, a, b)
+            self.assertLastGraphAllFused()
+            y = run_remainder(a, b)
+            if data_type is torch.bfloat16:
+                self.assertEqual(x, y, atol=4e-3, rtol=2e-3)
+            else:
+                self.assertEqual(x, y)
+
+            # div by 0
+            traced = torch.jit.trace(run_remainder, (zeros1, zeros2))
+            x = warmup_and_run_forward(traced, zeros, a)
+            self.assertLastGraphAllFused()
+            y = run_remainder(zeros, a)
+            self.assertEqual(x, y)
 
-        # numerators and denominatos are nan
-        traced = torch.jit.trace(run_remainder, (torch.zeros(1024), torch.zeros(1024)))
-        x = warmup_and_run_forward(traced, nans, a)
-        self.assertLastGraphAllFused()
-        y = run_remainder(nans, a)
-        np.testing.assert_allclose(x.numpy(), y.numpy())
+            # numerators and denominatos are nan
+            traced = torch.jit.trace(run_remainder, (zeros1, zeros2))
+            x = warmup_and_run_forward(traced, nans, a)
+            self.assertLastGraphAllFused()
+            y = run_remainder(nans, a)
+            self.assertEqual(x, y)
 
     def test_multioutput(self):
         def easy(x):
@@ -986,32 +1041,48 @@ def easy(x):
             aaa, bbb = torch.chunk(y, 2)
             return aaa + bbb
 
-        traced = torch.jit.trace(easy, (torch.zeros(1024, 1024)))
+        for data_type in self.dtypes:
+            trace_input = torch.zeros(1024, 1024, dtype=data_type)
+            traced = torch.jit.trace(easy, (trace_input))
 
-        a = torch.zeros(32, 32)
-        x = warmup_and_run_forward(traced, a)
-        self.assertLastGraphAllFused()
-        npr = a.numpy()
-        npr2 = npr + 1
-        npr_a, npr_b = np.array_split(npr2, 2)
-        np.testing.assert_allclose(npr_a + npr_b, x.numpy())
+            a = torch.zeros(32, 32, dtype=data_type)
+            x = warmup_and_run_forward(traced, a)
+            self.assertLastGraphAllFused()
+            npr = a.float().numpy()
+            npr2 = npr + 1
+            npr_a, npr_b = np.array_split(npr2, 2)
+            np.testing.assert_allclose(npr_a + npr_b, x.float().numpy())
 
     def test_cat(self):
         for device in self.devices:
+            _dim = 1
+
             def foo(*args):
                 args_2 = [v + i for i, v in enumerate(args)]
-                v = torch.cat(args_2, dim=1)
+                v = torch.cat(args_2, dim=_dim)
                 return v * v
 
-            M = 16
-            Ns = [128, 16, 1]
-            values = [torch.zeros(M, N, device=device) for N in Ns]
-            traced = torch.jit.trace(foo, values)
+            for data_type in self.dtypes:
+                M = 16
+                Ns = [128, 16, 1]
+                values = [torch.zeros(M, N, dtype=data_type, device=device) for N in Ns]
+                traced = torch.jit.trace(foo, values)
 
-            x = warmup_and_run_forward(traced, *values)
-            self.assertLastGraphAllFused()
-            ref = foo(*values)
-            np.testing.assert_allclose(ref.cpu().numpy(), x.cpu().numpy())
+                x = warmup_and_run_forward(traced, *values)
+                self.assertLastGraphAllFused()
+                ref = foo(*values)
+                np.testing.assert_allclose(ref.cpu().float().numpy(), x.cpu().float().numpy())
+
+            # Test channels-last
+            for _cur_dim in range(4):
+                _dim = _cur_dim
+                values = [torch.randn((2, 3, 4, 5), device=device).to(memory_format=torch.channels_last) for _ in range(10)]
+                traced = torch.jit.trace(foo, values)
+
+                x = warmup_and_run_forward(traced, *values)
+                self.assertLastGraphAllFused()
+                ref = foo(*values)
+                self.assertEqual(ref, x)
 
     # This test checks that we correctly handle fusion group with just aten::cat in it.
     # Note that the test only makes sense with min_fusion_group=1, otherwise no
@@ -1119,12 +1190,12 @@ def test_int(x: torch.Tensor, y: torch.Tensor, z: torch.Tensor, a: int, b: int)
             return torch.add(torch.add(x, y, alpha=a), z, alpha=b)
 
         for test in (test_float, test_int):
-            x, y, z = [torch.rand(4) for i in range(3)]
-            a, b = 1, 2
-            test(x, y, z, a, b)
-            r = test(x, y, z, a, b)
-            xn, yn, zn = [t.numpy() for t in (x, y, z)]
-            np.testing.assert_allclose(r.numpy(), xn + yn * a + zn * b)
+            for data_type in self.dtypes:
+                x, y, z = [torch.rand(4, dtype=data_type) for i in range(3)]
+                a, b = 1, 2
+                test(x, y, z, a, b)
+                r = test(x, y, z, a, b)
+                self.assertEqual(r, x + y * a + z * b)
 
     def test_loop(self):
         @torch.jit.script
@@ -1190,14 +1261,16 @@ def test_log_softmax(x, y):
             return a + b + c + d
 
         for test in (test_softmax, test_log_softmax, test_softmax_neg_index):
-            old = torch._C._jit_set_texpr_reductions_enabled(True)
-            traced = torch.jit.trace(test, (torch.randn(2, 3, device=device), torch.randn(2, 3, device=device)))
-            inp = torch.randn(2, 3, device=device)
-            res = traced(inp, inp)
-            # Use eager mode as reference.
-            ref = test(inp, inp)
-            np.testing.assert_allclose(ref, res.cpu().numpy(), rtol=1e-06, atol=1e-06)
-            torch._C._jit_set_texpr_reductions_enabled(old)
+            for data_type in self.dtypes:
+                old = torch._C._jit_set_texpr_reductions_enabled(True)
+                traced_input = torch.randn(2, 3, dtype=data_type, device=device)
+                traced = torch.jit.trace(test, (traced_input, traced_input))
+                inp = torch.randn(2, 3, dtype=data_type, device=device)
+                res = traced(inp, inp)
+                # Use eager mode as reference.
+                ref = test(inp, inp)
+                np.testing.assert_allclose(ref, res.cpu().numpy(), rtol=1e-06, atol=1e-06)
+                torch._C._jit_set_texpr_reductions_enabled(old)
 
     def test_softmax_cpu(self):
         self._test_softmax('cpu')
@@ -1252,6 +1325,17 @@ def do_exp(x, y, z):
             x = warmup_and_run_forward(traced, x, y, z)
             self.assertLastGraphAllFused()
 
+    def test_sin_pow(self):
+        def test(x):
+            return torch.sin(torch.pow(x, 0))
+
+        for data_type, shape in itertools.product(self.dtypes, [[3], [5], [10]]):
+            x = torch.rand(shape, dtype=data_type)
+            scripted = torch.jit.script(test)
+            out = warmup_and_run_forward(scripted, x)
+            self.assertLastGraphAllFused()
+            self.assertEqual(out, test(x))
+
     def test_transpose(self):
         @torch.jit.script
         def test(x, y, z):
@@ -1354,35 +1438,44 @@ def test_where(self):
         def run_where(x, y):
             return torch.where(torch.gt(x, y), x, y)
 
-        a = torch.rand(1024, dtype=float)
-        b = torch.rand(1024, dtype=float)
-        traced = torch.jit.trace(run_where, (torch.zeros(1024), torch.zeros(1024)))
-        x = warmup_and_run_forward(traced, a, b)
-        self.assertLastGraphAllFused()
-        y = run_where(a, b)
-        np.testing.assert_allclose(x.numpy(), y.numpy())
+        for data_type in self.dtypes:
+            a = torch.rand(1024, dtype=data_type)
+            b = torch.rand(1024, dtype=data_type)
+            zeros = torch.zeros(1024, dtype=data_type)
+            traced = torch.jit.trace(run_where, (zeros, zeros))
+            x = warmup_and_run_forward(traced, a, b)
+            self.assertLastGraphAllFused()
+            y = run_where(a, b)
+            np.testing.assert_allclose(x.float().numpy(), y.float().numpy())
 
     def test_multi_rand(self):
         for device in self.devices:
             def test(x):
                 y = torch.rand_like(x)
                 return (x + y) - (y - x)
-            a = torch.rand(4, device=device)
-            scripted = torch.jit.script(test)
-            out = warmup_and_run_forward(scripted, a)
-            self.assertLastGraphAllFused()
-            assert torch.allclose(out, 2 * a)
+
+            _atol = 2e-3
+            _rtol = 1e-5
+            for data_type in self.dtypes:
+                if data_type is torch.bfloat16:
+                    _atol = 2e-2
+                a = torch.rand(4, dtype=data_type, device=device)
+                scripted = torch.jit.script(test)
+                out = warmup_and_run_forward(scripted, a)
+                self.assertLastGraphAllFused()
+                assert torch.allclose(out, 2 * a, atol=_atol, rtol=_rtol)
 
     def test_mask(self):
         def test(x):
             return x.unsqueeze(1) == 0
 
         for d in self.devices:
-            x = torch.rand(4, device=d) > 0.5
-            scripted = torch.jit.script(test)
-            out = warmup_and_run_forward(scripted, x)
-            self.assertLastGraphAllFused()
-            assert torch.equal(out, test(x))
+            for data_type in self.dtypes:
+                x = torch.rand(4, dtype=data_type, device=d) > 0.5
+                scripted = torch.jit.script(test)
+                out = warmup_and_run_forward(scripted, x)
+                self.assertLastGraphAllFused()
+                assert torch.equal(out, test(x))
 
     def test_simple_add(self):
         val = torch._C._jit_get_te_generate_block_code()
@@ -1552,14 +1645,15 @@ def foo(a, b, c):
                 t7 = a * t6
                 return (t7, t5, t_next)
 
-            a = torch.rand(20, 20, dtype=torch.float32, device=device)
-            b = torch.rand(20 * 29, dtype=torch.float32, device=device).as_strided([20], [29])
-            c = torch.ones(20, dtype=torch.int64, device=device)
-            traced = torch.jit.trace(foo, (a, b, c))
-            ref = foo(a, b, c)
-            exp = traced(a, b, c)
-            exp = traced(a, b, c)
-            self.assertEqual(ref, exp)
+            for data_type in self.dtypes:
+                a = torch.rand(20, 20, dtype=data_type, device=device)
+                b = torch.rand(20 * 29, dtype=data_type, device=device).as_strided([20], [29])
+                c = torch.ones(20, dtype=torch.int64, device=device)
+                traced = torch.jit.trace(foo, (a, b, c))
+                ref = foo(a, b, c)
+                exp = traced(a, b, c)
+                exp = traced(a, b, c)
+                self.assertEqual(ref, exp)
 
     def test_propagated_mem_layout(self):
         def foo(a, b, c):
diff --git a/test/test_testing.py b/test/test_testing.py
index c71b1a4c54d3..6dc06a8a2aeb 100644
--- a/test/test_testing.py
+++ b/test/test_testing.py
@@ -12,22 +12,23 @@
 import subprocess
 import sys
 import unittest.mock
-from typing import Any, Callable, Iterator, List, Tuple
+from typing import Any, Callable, Iterator, List, Tuple, Generator
 
 import torch
 
 from torch.testing import make_tensor
 from torch.testing._internal.common_utils import \
-    (IS_FBCODE, IS_SANDCASTLE, IS_WINDOWS, TestCase, run_tests, skipIfRocm, slowTest,
+    (IS_FBCODE, IS_MACOS, IS_SANDCASTLE, IS_WINDOWS, TestCase, run_tests, skipIfRocm, slowTest,
      parametrize, subtest, instantiate_parametrized_tests, dtype_name, TEST_WITH_ROCM)
 from torch.testing._internal.common_device_type import \
     (PYTORCH_TESTING_DEVICE_EXCEPT_FOR_KEY, PYTORCH_TESTING_DEVICE_ONLY_FOR_KEY, dtypes,
      get_device_type_test_bases, instantiate_device_type_tests, onlyCUDA, onlyNativeDeviceTypes,
-     deviceCountAtLeast, ops, expectedFailureMeta)
+     deviceCountAtLeast, ops, expectedFailureMeta, OpDTypes)
 from torch.testing._internal.common_methods_invocations import op_db
 from torch.testing._internal import opinfo
 from torch.testing._internal.common_dtype import all_types_and_complex_and
 from torch.testing._internal.common_modules import modules, module_db
+from torch.testing._internal.opinfo.core import SampleInput
 
 # For testing TestCase methods and torch.testing functions
 class TestTesting(TestCase):
@@ -67,7 +68,7 @@ def test_assertEqual_longMessage(self):
 
             self.longMessage = True
             extra_msg = "sentinel"
-            with self.assertRaisesRegex(AssertionError, re.escape(f"{default_msg} : {extra_msg}")):
+            with self.assertRaisesRegex(AssertionError, re.escape(f"{default_msg}\n{extra_msg}")):
                 self.assertEqual(actual, expected, msg=extra_msg)
         finally:
             self.longMessage = long_message
@@ -1177,6 +1178,17 @@ def test_mismatching_values_msg(self):
             with self.assertRaisesRegex(AssertionError, re.escape("Sparse CSR values")):
                 fn()
 
+    @unittest.expectedFailure
+    def test_hybrid_support(self):
+        # If you read this after the test unexpectedly succeeded, this is a good thing. It means that you added support
+        # for `.to_dense()` for hybrid sparse CSR tensors and in turn enabled support for them in
+        # `torch.testing.assert_close` if comparing to strided tensors. You can safely remove this test as well as the
+        # patch on `TensorOrArrayPair` in `torch.testing._internal.common_utils`.
+        actual = torch.sparse_csr_tensor([0, 2, 4], [0, 1, 0, 1], [[1, 11], [2, 12], [3, 13], [4, 14]])
+        expected = torch.stack([actual[0].to_dense(), actual[1].to_dense()])
+
+        torch.testing.assert_close(actual, expected, check_layout=False)
+
 
 @unittest.skipIf(IS_FBCODE or IS_SANDCASTLE, "Not all sandcastle jobs support CSC testing")
 class TestAssertCloseSparseCSC(TestCase):
@@ -1775,16 +1787,26 @@ def test_circular_dependencies(self) -> None:
                            "torch.backends._coreml",  # depends on pycoreml
                            "torch.contrib.",  # something weird
                            "torch.testing._internal.distributed.",  # just fails
-                           "torch.ao.sparsity._experimental.",  # depends on pytorch_lightning, not user-facing
+                           "torch.ao.pruning._experimental.",  # depends on pytorch_lightning, not user-facing
                            "torch.cuda._dynamo_graphs",  # depends on torchdynamo
                            ]
         # See https://github.com/pytorch/pytorch/issues/77801
         if not sys.version_info >= (3, 9):
             ignored_modules.append("torch.utils.benchmark")
-        if IS_WINDOWS:
-            # Distributed does not work on Windows
-            ignored_modules.append("torch.distributed.")
+        if IS_WINDOWS or IS_MACOS:
+            # Distributed should be importable on Windows(except nn.api.), but not on Mac
+            if IS_MACOS:
+                ignored_modules.append("torch.distributed.")
+            else:
+                ignored_modules.append("torch.distributed.nn.api.")
+                ignored_modules.append("torch.distributed.optim.")
+                ignored_modules.append("torch.distributed.pipeline.")
+                ignored_modules.append("torch.distributed.rpc.")
             ignored_modules.append("torch.testing._internal.dist_utils")
+            # And these both end up with transitive dependencies on distributed
+            ignored_modules.append("torch.nn.parallel._replicated_tensor_ddp_interop")
+            ignored_modules.append("torch.testing._internal.common_fsdp")
+            ignored_modules.append("torch.testing._internal.common_distributed")
 
         torch_dir = os.path.dirname(torch.__file__)
         for base, folders, files in os.walk(torch_dir):
@@ -1814,6 +1836,115 @@ def test_no_warning_on_import(self) -> None:
             cwd=os.path.dirname(os.path.realpath(__file__)),).decode("utf-8")
         self.assertEquals(out, "")
 
+    @unittest.skipIf(IS_WINDOWS, "importing torch+CUDA on CPU results in warning")
+    @parametrize('path', ['torch', 'functorch'])
+    def test_no_mutate_global_logging_on_import(self, path) -> None:
+        # Calling logging.basicConfig, among other things, modifies the global
+        # logging state. It is not OK to modify the global logging state on
+        # `import torch` (or other submodules we own) because users do not expect it.
+        expected = 'abcdefghijklmnopqrstuvwxyz'
+        commands = [
+            'import logging',
+            f'import {path}',
+            '_logger = logging.getLogger("torch_test_testing")',
+            'logging.root.addHandler(logging.StreamHandler())',
+            'logging.root.setLevel(logging.INFO)',
+            f'_logger.info("{expected}")'
+        ]
+        out = subprocess.check_output(
+            [sys.executable, "-W", "all", "-c", "; ".join(commands)],
+            stderr=subprocess.STDOUT,
+        ).decode("utf-8")
+        self.assertEqual(out.strip(), expected)
+
+class TestOpInfos(TestCase):
+    def test_sample_input(self) -> None:
+        a, b, c, d, e = [object() for _ in range(5)]
+
+        # Construction with natural syntax
+        s = SampleInput(a, b, c, d=d, e=e)
+        assert s.input is a
+        assert s.args == (b, c)
+        assert s.kwargs == dict(d=d, e=e)
+
+        # Construction with explicit args and kwargs
+        s = SampleInput(a, args=(b,), kwargs=dict(c=c, d=d, e=e))
+        assert s.input is a
+        assert s.args == (b,)
+        assert s.kwargs == dict(c=c, d=d, e=e)
+
+        # Construction with a mixed form will error
+        with self.assertRaises(AssertionError):
+            s = SampleInput(a, b, c, args=(d, e))
+
+        with self.assertRaises(AssertionError):
+            s = SampleInput(a, b, c, kwargs=dict(d=d, e=e))
+
+        with self.assertRaises(AssertionError):
+            s = SampleInput(a, args=(b, c), d=d, e=e)
+
+        with self.assertRaises(AssertionError):
+            s = SampleInput(a, b, c=c, kwargs=dict(d=d, e=e))
+
+        # Mixing metadata into "natural" construction will error
+        with self.assertRaises(AssertionError):
+            s = SampleInput(a, b, name="foo")
+
+        with self.assertRaises(AssertionError):
+            s = SampleInput(a, b, output_process_fn_grad=lambda x: x)
+
+        with self.assertRaises(AssertionError):
+            s = SampleInput(a, b, broadcasts_input=True)
+
+        # But when only input is given, metadata is allowed for backward
+        # compatibility
+        s = SampleInput(a, broadcasts_input=True)
+        assert s.input is a
+        assert s.broadcasts_input
+
+    def test_sample_input_metadata(self) -> None:
+        a, b = [object() for _ in range(2)]
+        s1 = SampleInput(a, b=b)
+        self.assertIs(s1.output_process_fn_grad(None), None)
+        self.assertFalse(s1.broadcasts_input)
+        self.assertEqual(s1.name, "")
+
+        s2 = s1.with_metadata(
+            output_process_fn_grad=lambda x: a,
+            broadcasts_input=True,
+            name="foo",
+        )
+        self.assertIs(s1, s2)
+        self.assertIs(s2.output_process_fn_grad(None), a)
+        self.assertTrue(s2.broadcasts_input)
+        self.assertEqual(s2.name, "foo")
+
+
+# Tests that validate the various sample generating functions on each OpInfo.
+class TestOpInfoSampleFunctions(TestCase):
+
+    @ops(op_db, dtypes=OpDTypes.any_one)
+    def test_opinfo_sample_generators(self, device, dtype, op):
+        # Test op.sample_inputs doesn't generate multiple samples when called
+        samples = op.sample_inputs(device, dtype)
+        self.assertIsInstance(samples, Generator)
+
+    @ops([op for op in op_db if op.reference_inputs_func is not None], dtypes=OpDTypes.any_one)
+    def test_opinfo_reference_generators(self, device, dtype, op):
+        # Test op.reference_inputs doesn't generate multiple samples when called
+        samples = op.reference_inputs(device, dtype)
+        self.assertIsInstance(samples, Generator)
+
+    @ops([op for op in op_db if op.error_inputs_func is not None], dtypes=OpDTypes.none)
+    def test_opinfo_error_generators(self, device, op):
+        # Test op.error_inputs doesn't generate multiple inputs when called
+        samples = op.error_inputs(device)
+        self.assertIsInstance(samples, Generator)
+
+
+instantiate_device_type_tests(TestOpInfoSampleFunctions, globals())
+instantiate_parametrized_tests(TestImports)
+
 
 if __name__ == '__main__':
     run_tests()
diff --git a/test/test_torch.py b/test/test_torch.py
index e5653e679c63..31759213ecef 100644
--- a/test/test_torch.py
+++ b/test/test_torch.py
@@ -24,21 +24,20 @@
 import subprocess
 import weakref
 import sys
-from torch.utils.dlpack import from_dlpack, to_dlpack
 from torch._six import inf, nan, string_classes
 from itertools import product, combinations, permutations
 from functools import partial
 from torch import multiprocessing as mp
 from torch.testing import make_tensor
 from torch.testing._internal.common_utils import (
-    TestCase, TEST_WITH_ROCM, run_tests,
+    TEST_WITH_TORCHINDUCTOR, TestCase, TEST_WITH_ROCM, run_tests,
     IS_WINDOWS, IS_FILESYSTEM_UTF8_ENCODING, NO_MULTIPROCESSING_SPAWN,
-    IS_SANDCASTLE, IS_FBCODE, IS_REMOTE_GPU, load_tests, slowTest,
+    IS_SANDCASTLE, IS_FBCODE, IS_REMOTE_GPU, load_tests, skipIfTorchInductor, slowTest,
     TEST_WITH_CROSSREF, skipIfTorchDynamo,
     skipCUDAMemoryLeakCheckIf, BytesIOContext,
     skipIfRocm, skipIfNoSciPy, TemporaryFileName, TemporaryDirectoryName,
     wrapDeterministicFlagAPITest, DeterministicGuard, CudaSyncGuard,
-    skipIfNotRegistered, bytes_to_scalar, parametrize, skipIfMps)
+    skipIfNotRegistered, bytes_to_scalar, parametrize, skipIfMps, noncontiguous_like)
 from multiprocessing.reduction import ForkingPickler
 from torch.testing._internal.common_device_type import (
     expectedFailureMeta,
@@ -48,7 +47,7 @@
     dtypes, dtypesIfCUDA, dtypesIfCPU, deviceCountAtLeast,
     skipMeta,
     PYTORCH_CUDA_MEMCHECK, largeTensorTest, onlyNativeDeviceTypes,
-    expectedAlertNondeterministic, get_all_device_types, skipXLA)
+    get_all_device_types, skipXLA)
 from typing import Tuple
 import torch.backends.quantized
 import torch.testing._internal.data
@@ -1159,11 +1158,10 @@ def test_nondeterministic_alert_AvgPool3d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('avg_pool3d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'avg_pool3d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_AdaptiveAvgPool2d(self, device):
@@ -1172,11 +1170,10 @@ def test_nondeterministic_alert_AdaptiveAvgPool2d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('adaptive_avg_pool2d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'adaptive_avg_pool2d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_AdaptiveAvgPool3d(self, device):
@@ -1185,11 +1182,10 @@ def test_nondeterministic_alert_AdaptiveAvgPool3d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('adaptive_avg_pool3d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'adaptive_avg_pool3d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_MaxPool3d(self, device):
@@ -1198,11 +1194,10 @@ def test_nondeterministic_alert_MaxPool3d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('max_pool3d_with_indices_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'max_pool3d_with_indices_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_AdaptiveMaxPool2d(self, device):
@@ -1211,11 +1206,10 @@ def test_nondeterministic_alert_AdaptiveMaxPool2d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('adaptive_max_pool2d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'adaptive_max_pool2d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_FractionalMaxPool2d(self, device):
@@ -1224,11 +1218,10 @@ def test_nondeterministic_alert_FractionalMaxPool2d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('fractional_max_pool2d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'fractional_max_pool2d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_FractionalMaxPool3d(self, device):
@@ -1237,11 +1230,52 @@ def test_nondeterministic_alert_FractionalMaxPool3d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('fractional_max_pool3d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'fractional_max_pool3d_backward_cuda',
+            torch.device(device).type == 'cuda')
+
+    @dtypes(*floating_types_and(torch.half))
+    @onlyNativeDeviceTypes
+    def test_nondeterministic_alert_MaxUnpool1d(self, device, dtype):
+        if dtype == torch.half and torch.device(device).type == 'cpu':
+            self.skipTest('float16 not implemented on CPU')
+
+        module = torch.nn.MaxUnpool1d(3, 1)
+        input = torch.randn(1, 1, 7, dtype=dtype, device=device)
+        indices = torch.zeros_like(input, dtype=torch.long, device=device)
+
+        self.check_nondeterministic_alert(
+            lambda: module(input, indices),
+            'max_unpooling2d_forward_out')
+
+    @dtypes(*floating_types_and(torch.half))
+    @onlyNativeDeviceTypes
+    def test_nondeterministic_alert_MaxUnpool2d(self, device, dtype):
+        if dtype == torch.half and torch.device(device).type == 'cpu':
+            self.skipTest('float16 not implemented on CPU')
+
+        module = torch.nn.MaxUnpool2d(3, 1)
+        input = torch.randn(1, 1, 7, 7, dtype=dtype, device=device)
+        indices = torch.zeros_like(input, dtype=torch.long, device=device)
+
+        self.check_nondeterministic_alert(
+            lambda: module(input, indices),
+            'max_unpooling2d_forward_out')
+
+    @dtypes(*floating_types_and(torch.half))
+    @onlyNativeDeviceTypes
+    def test_nondeterministic_alert_MaxUnpool3d(self, device, dtype):
+        if dtype == torch.half and torch.device(device).type == 'cpu':
+            self.skipTest('float16 not implemented on CPU')
 
-        backward_func(self, device)
+        module = torch.nn.MaxUnpool3d(3, 1)
+        input = torch.randn(1, 1, 7, 7, 7, dtype=dtype, device=device)
+        indices = torch.zeros_like(input, dtype=torch.long, device=device)
+
+        self.check_nondeterministic_alert(
+            lambda: module(input, indices),
+            'max_unpooling3d_forward_out')
 
     @skipIfMps
     def test_nondeterministic_alert_interpolate_linear(self, device):
@@ -1253,11 +1287,10 @@ def test_nondeterministic_alert_interpolate_linear(self, device):
             align_corners=False)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('upsample_linear1d_backward_out_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad),
+            'upsample_linear1d_backward_out_cuda',
+            torch.device(device).type == 'cuda')
 
     def test_nondeterministic_alert_interpolate_bilinear(self, device):
         input = torch.randn(1, 2, 4, 4, device=device, requires_grad=True)
@@ -1268,11 +1301,10 @@ def test_nondeterministic_alert_interpolate_bilinear(self, device):
             align_corners=False)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('upsample_bilinear2d_backward_out_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad),
+            'upsample_bilinear2d_backward_out_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_interpolate_bicubic(self, device):
@@ -1284,11 +1316,10 @@ def test_nondeterministic_alert_interpolate_bicubic(self, device):
             align_corners=False)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('upsample_bicubic2d_backward_out_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad),
+            'upsample_bicubic2d_backward_out_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_interpolate_trilinear(self, device):
@@ -1300,11 +1331,10 @@ def test_nondeterministic_alert_interpolate_trilinear(self, device):
             align_corners=False)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('upsample_trilinear3d_backward_out_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad),
+            'upsample_trilinear3d_backward_out_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_ReflectionPad1d(self, device):
@@ -1313,11 +1343,10 @@ def test_nondeterministic_alert_ReflectionPad1d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('reflection_pad1d_backward_out_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'reflection_pad1d_backward_out_cuda',
+            torch.device(device).type == 'cuda')
 
     def test_nondeterministic_alert_ReflectionPad2d(self, device):
         module = torch.nn.ReflectionPad2d((1, 2, 3, 4))
@@ -1325,11 +1354,10 @@ def test_nondeterministic_alert_ReflectionPad2d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('reflection_pad2d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'reflection_pad2d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_ReflectionPad3d(self, device):
@@ -1338,11 +1366,10 @@ def test_nondeterministic_alert_ReflectionPad3d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('reflection_pad3d_backward_out_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'reflection_pad3d_backward_out_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_ReplicationPad1d(self, device):
@@ -1351,11 +1378,10 @@ def test_nondeterministic_alert_ReplicationPad1d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('replication_pad1d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'replication_pad1d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     def test_nondeterministic_alert_ReplicationPad2d(self, device):
         module = torch.nn.ReplicationPad2d((1, 2, 3, 4))
@@ -1363,11 +1389,10 @@ def test_nondeterministic_alert_ReplicationPad2d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('replication_pad2d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'replication_pad2d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_ReplicationPad3d(self, device):
@@ -1376,22 +1401,21 @@ def test_nondeterministic_alert_ReplicationPad3d(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('replication_pad3d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'replication_pad3d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     def test_nondeterministic_alert_NLLLoss(self, device):
         module = torch.nn.NLLLoss()
         input = torch.randn(2, 3, 5, 5, device=device)
         target = torch.rand(2, 5, 5, device=device).mul(3).floor().long()
 
-        @expectedAlertNondeterministic('nll_loss2d_forward_out_cuda_template', ['cuda'])
-        def forward_func(slf, device):
-            module(input, target)
 
-        forward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: module(input, target),
+            'nll_loss2d_forward_out_cuda_template',
+            torch.device(device).type == 'cuda')
 
     def test_nondeterministic_alert_CTCLoss(self, device):
         module = torch.nn.CTCLoss()
@@ -1402,11 +1426,10 @@ def test_nondeterministic_alert_CTCLoss(self, device):
         res = module(input, target, input_lengths, target_lengths)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('ctc_loss_backward_gpu', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad, retain_graph=True)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'ctc_loss_backward_gpu',
+            torch.device(device).type == 'cuda')
 
     def test_nondeterministic_alert_EmbeddingBag_max(self, device):
         module = torch.nn.EmbeddingBag(
@@ -1416,113 +1439,67 @@ def test_nondeterministic_alert_EmbeddingBag_max(self, device):
         res = module(input)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('embedding_bag_backward_cuda_max', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'embedding_bag_backward_cuda_max',
+            torch.device(device).type == 'cuda')
 
     @dtypes(*all_types_and_complex_and(torch.bool))
     def test_nondeterministic_alert_cumsum(self, device, dtype):
+        input = make_tensor((10,), dtype=dtype, device=device, low=-9, high=9)
+        should_alert = torch.device(device).type == 'cuda' and (dtype.is_floating_point or dtype.is_complex)
 
-        def test_func(op_call):
-            input = make_tensor((10,), dtype=dtype, device=device, low=-9, high=9)
-
-            @expectedAlertNondeterministic('cumsum_cuda_kernel', ['cuda'])
-            def forward_func_alert(slf, device):
-                op_call(input, 0)
-
-            if dtype.is_floating_point or dtype.is_complex:
-                forward_func_alert(self, device)
-            else:
-                with DeterministicGuard(True):
-                    op_call(input, 0)
-
-        test_func(torch.Tensor.cumsum)
-        test_func(torch.cumsum)
-
-    def test_nondeterministic_alert_scatter_add(self, device):
-        def test_func(op_call):
-            input = torch.randn(5, 4, device=device)
-            dim = 0
-            index = torch.tensor([[3]], device=device)
-            src = torch.tensor([[1.0]], device=device)
-
-            @expectedAlertNondeterministic('scatter_add_cuda_kernel', ['cuda'])
-            def forward_func(slf, device):
-                op_call(input, dim, index, src)
-
-            forward_func(self, device)
-
-        test_func(torch.Tensor.scatter_add_)
-        test_func(torch.Tensor.scatter_add)
-        test_func(torch.scatter_add)
+        for op_call in [torch.Tensor.cumsum, torch.cumsum]:
+            self.check_nondeterministic_alert(
+                lambda: op_call(input, 0),
+                'cumsum_cuda_kernel',
+                should_alert)
 
     @expectedFailureMeta  # expected a non-determinitic error, but it was not raised
     @onlyNativeDeviceTypes
     def test_nondeterministic_alert_put(self, device):
-        def test_func(op_call):
-            a = torch.randn(10, device=device)
-            indices = torch.tensor([0, 0], device=device)
-            values = torch.tensor([0., 1.], device=device)
-
-            @expectedAlertNondeterministic('put_')
-            def forward_func(slf, device):
-                op_call(a, indices, values, accumulate=False)
+        a = torch.randn(10, device=device)
+        indices = torch.tensor([0, 0], device=device)
+        values = torch.tensor([0., 1.], device=device)
 
-            forward_func(self, device)
-
-        test_func(torch.Tensor.put)
-        test_func(torch.Tensor.put_)
+        for op_call in [torch.Tensor.put, torch.Tensor.put_]:
+            self.check_nondeterministic_alert(
+                lambda: op_call(a, indices, values, accumulate=False),
+                'put_')
 
     def test_nondeterministic_alert_put_accumulate(self, device):
-        def test_func(op_call):
-            a = torch.randn(10, device=device)
-            indices = torch.tensor([0, 0], device=device)
-            values = torch.tensor([0., 1.], device=device)
-
-            @expectedAlertNondeterministic('put_', ['cuda'])
-            def forward_func(slf, device):
-                op_call(a, indices, values, accumulate=True)
-
-            forward_func(self, device)
+        a = torch.randn(10, device=device)
+        indices = torch.tensor([0, 0], device=device)
+        values = torch.tensor([0., 1.], device=device)
 
-        test_func(torch.Tensor.put)
-        test_func(torch.Tensor.put_)
+        for op_call in [torch.Tensor.put, torch.Tensor.put_]:
+            self.check_nondeterministic_alert(
+                lambda: op_call(a, indices, values, accumulate=True),
+                'put_',
+                torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_histc(self, device):
-        def test_func(op_call):
-            a = torch.tensor([], device=device)
-
-            @expectedAlertNondeterministic('_histc_cuda', ['cuda'])
-            def forward_func(slf, device):
-                res = op_call(a, min=0, max=3)
-
-            forward_func(self, device)
-
-        test_func(torch.histc)
-        test_func(torch.Tensor.histc)
+        a = torch.tensor([], device=device)
+        for op_call in [torch.histc, torch.Tensor.histc]:
+            self.check_nondeterministic_alert(
+                lambda: op_call(a, min=0, max=3),
+                '_histc_cuda',
+                torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_bincount(self, device):
-        def test_func(op_call):
-            a = torch.tensor([], device=device, dtype=torch.long)
-
-            @expectedAlertNondeterministic('_bincount_cuda', ['cuda'])
-            def forward_func(slf, device):
-                res = op_call(a)
-
-            forward_func(self, device)
-
-        test_func(torch.bincount)
-        test_func(torch.Tensor.bincount)
+        a = torch.tensor([], device=device, dtype=torch.long)
+        for op_call in [torch.bincount, torch.Tensor.bincount]:
+            self.check_nondeterministic_alert(
+                lambda: op_call(a),
+                '_bincount_cuda',
+                torch.device(device).type == 'cuda')
 
     # Ensures that kthvalue throws nondeterministic alerts in the correct cases
     @dtypes(torch.double)
     def test_nondeterministic_alert_kthvalue(self, device, dtype):
-        @expectedAlertNondeterministic('kthvalue CUDA', ['cuda'])
-        def test_func(slf, device, call_type):
+        def test_func(call_type):
             S = 10
             k = 5
             a = torch.randn(S, device=device)
@@ -1537,27 +1514,11 @@ def test_func(slf, device, call_type):
             else:
                 self.fail(f"'{call_type}' is not a valid call type")
 
-        test_func(self, device, 'function')
-        test_func(self, device, 'method')
-        test_func(self, device, 'out')
-
-    @onlyNativeDeviceTypes
-    def test_nondeterministic_alert_gather(self, device):
-        def test_func(op_call):
-            a = torch.randn(3, 3, device=device, requires_grad=True)
-            dim = 0
-            index = torch.tensor([[0]], device=device)
-            res = op_call(a, dim, index)
-            grad = torch.ones_like(res)
-
-            @expectedAlertNondeterministic('scatter_add_cuda_kernel', ['cuda'])
-            def backward_func(slf, device):
-                res.backward(grad)
-
-            backward_func(self, device)
-
-        test_func(torch.gather)
-        test_func(torch.Tensor.gather)
+        for call_type in ['function', 'method', 'out']:
+            self.check_nondeterministic_alert(
+                lambda: test_func('function'),
+                'kthvalue CUDA',
+                torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_grid_sample_2d(self, device):
@@ -1566,11 +1527,10 @@ def test_nondeterministic_alert_grid_sample_2d(self, device):
         res = torch.nn.functional.grid_sample(input, grid, align_corners=False)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('grid_sampler_2d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'grid_sampler_2d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     @skipIfMps
     def test_nondeterministic_alert_grid_sample_3d(self, device):
@@ -1579,11 +1539,10 @@ def test_nondeterministic_alert_grid_sample_3d(self, device):
         res = torch.nn.functional.grid_sample(input, grid, align_corners=False)
         grad = torch.ones_like(res)
 
-        @expectedAlertNondeterministic('grid_sampler_3d_backward_cuda', ['cuda'])
-        def backward_func(slf, device):
-            res.backward(grad)
-
-        backward_func(self, device)
+        self.check_nondeterministic_alert(
+            lambda: res.backward(grad, retain_graph=True),
+            'grid_sampler_3d_backward_cuda',
+            torch.device(device).type == 'cuda')
 
     def test_invalid_shapes_grid_sampler(self, device):
         make_arg = partial(
@@ -1652,7 +1611,7 @@ def run_test(x, y):
     # Ensures that median throws nondeterministic alerts in the correct cases
     @dtypes(torch.double)
     def test_nondeterministic_alert_median(self, device, dtype):
-        def test_func(slf, device, call_type):
+        def test_func(call_type):
             S = 10
             a = torch.randn(S, device=device)
             if call_type == 'function':
@@ -1670,15 +1629,19 @@ def test_func(slf, device, call_type):
             else:
                 self.fail(f"'{call_type}' is not a valid call type")
 
-        @expectedAlertNondeterministic('median CUDA with indices output', ['cuda'])
-        def test_func_expect_error(slf, device, call_type):
-            test_func(slf, device, call_type)
+        def test_func_expect_error(call_type, should_error):
+            self.check_nondeterministic_alert(
+                lambda: test_func(call_type),
+                'median CUDA with indices output',
+                should_error)
+
+        is_cuda = torch.device(device).type == 'cuda'
 
-        test_func(self, device, 'function')
-        test_func_expect_error(self, device, 'function with indices')
-        test_func(self, device, 'method')
-        test_func_expect_error(self, device, 'method with indices')
-        test_func_expect_error(self, device, 'out with indices')
+        test_func_expect_error('function', False)
+        test_func_expect_error('function with indices', is_cuda)
+        test_func_expect_error('method', False)
+        test_func_expect_error('method with indices', is_cuda)
+        test_func_expect_error('out with indices', is_cuda)
 
     # FIXME: move to test_scatter_gather_ops
     def _test_gather_backward_one_dim(self, device, deterministic: bool = False) -> None:
@@ -1787,7 +1750,8 @@ def _cond_fn(x):
                           lambda: torch.nn.functional.one_hot(ind, num_classes=size),
                           lambda: torch.randperm(20000, device=device),
                           lambda: torch.repeat_interleave(x, 2, output_size=2 * size),
-                          lambda: torch.repeat_interleave(x, repeats, output_size=2 * size))
+                          lambda: torch.repeat_interleave(x, repeats, output_size=2 * size),
+                          lambda: torch.any(y))
         expect_sync = (lambda: _ind_put_fn(x, mask, y),
                        lambda: _ind_put_fn(x, ind_cpu, y),
                        lambda: _ind_get_fn(x, mask),
@@ -2348,8 +2312,9 @@ def test_cumprod(self, device):
         x = torch.rand(100, 100, device=device)
         res1 = torch.cumprod(x, 1)
         res2 = torch.tensor([]).to(device)
-        torch.cumprod(x, 1, out=res2)
-        self.assertEqual(res1, res2)
+        if not TEST_WITH_TORCHINDUCTOR:
+            torch.cumprod(x, 1, out=res2)
+            self.assertEqual(res1, res2)
         x.cumprod_(1)
         self.assertEqual(res1, x)
 
@@ -2995,10 +2960,9 @@ def test_index_reduce(self, device, dtype, reduce):
                     dest = make_tensor(size, device=device, dtype=dtype, noncontiguous=dest_noncontig)
                     src_size = size[:dim] + (num_src,) + size[dim + 1:]
                     src = make_tensor(src_size, device=device, dtype=dtype, noncontiguous=src_noncontig)
-                    idx = torch.randint(num_dest, (num_src,), dtype=idx_dtype, device=device)
-                    if index_noncontig:
-                        # noncontiguous_like fails with RuntimeError: XLA tensors do not have storage
-                        idx = torch.testing.make_non_contiguous(idx)
+                    idx = torch.testing.make_tensor(
+                        num_src, low=0, high=num_dest, dtype=idx_dtype, device=device, noncontiguous=index_noncontig
+                    )
                     expected = dest.clone()
                     dest.index_reduce_(dim, idx, src, reduce, include_self=include_self)
                     # fill rows in idx with reduction inits if include_self=False
@@ -4065,6 +4029,7 @@ def test_copy_mem_overlap(self, device, dtype):
             doubles, sz, lambda input, out: out.copy_(input))
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors!)
     @onlyNativeDeviceTypes
     def test_index_add_mem_overlap(self, device):
         x = torch.rand((1,), device=device).expand((6,))
@@ -4081,6 +4046,7 @@ def test_index_add_mem_overlap(self, device):
             ind.index_add_(0, ind.clone(), ind)
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors!)
     @onlyNativeDeviceTypes
     def test_index_copy_mem_overlap(self, device):
         x = torch.rand((1,), device=device).expand((6,))
@@ -4097,6 +4063,7 @@ def test_index_copy_mem_overlap(self, device):
             ind.index_copy_(0, ind.clone(), ind)
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors!)
     @expectedFailureMeta  # Warning not triggered
     @onlyNativeDeviceTypes
     def test_index_fill_mem_overlap(self, device):
@@ -4121,6 +4088,7 @@ def test_shift_mem_overlap(self, device):
             x[:-1] >>= x[1:]
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors)
     @expectedFailureMeta  # RuntimeError not raised
     @onlyNativeDeviceTypes
     def test_bernoulli_mem_overlap(self, device):
@@ -4133,10 +4101,9 @@ def test_bernoulli_mem_overlap(self, device):
         p = torch.rand(6, device=device)
         with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
             x.bernoulli_(p=p)
-        with self.assertRaisesRegex(RuntimeError, 'unsupported operation'):
-            torch.bernoulli(torch.rand_like(x), out=x)
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors!)
     @expectedFailureMeta  # RuntimeError not raised
     @onlyNativeDeviceTypes
     def test_put_mem_overlap(self, device):
@@ -4158,6 +4125,7 @@ def test_put_mem_overlap(self, device):
             ind.put_(ind.clone(), ind)
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors!)
     @expectedFailureMeta  # UserWarning not triggered
     @onlyNativeDeviceTypes
     def test_index_put_mem_overlap(self, device):
@@ -4179,6 +4147,7 @@ def test_index_put_mem_overlap(self, device):
             ind.index_put_((ind.clone(),), ind)
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors!)
     @expectedFailureMeta  # UserWarning not triggered
     @onlyNativeDeviceTypes
     def test_masked_fill_mem_overlap(self, device):
@@ -4195,6 +4164,7 @@ def test_masked_fill_mem_overlap(self, device):
             mask[1:].masked_fill_(mask[:-1], False)
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors!)
     @expectedFailureMeta  # RuntimeError not raised
     @onlyNativeDeviceTypes
     def test_masked_scatter_mem_overlap(self, device):
@@ -4206,6 +4176,7 @@ def test_masked_scatter_mem_overlap(self, device):
             x.masked_scatter_(mask, src)
 
     # FIXME: convert to ErrorInputs
+    # (but have to extend ErrorInputs to handle inplace-only errors!)
     @onlyNativeDeviceTypes
     def test_scatter_mem_overlap(self, device):
         x = torch.rand((1,), device=device).expand((6,))
@@ -4676,171 +4647,9 @@ def compare_strides(s1, s2, div):
             for x in xs:
                 _test_helper(x, op, unary=True)
 
-    # FIXME: move dlpack tests to their own test class/suite
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_capsule_conversion(self, device, dtype):
-        # DLpack does not explicitly support bool (xref dmlc/dlpack#75)
-        x = make_tensor((5,), dtype=dtype, device=device)
-        z = from_dlpack(to_dlpack(x))
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_protocol_conversion(self, device, dtype):
-        x = make_tensor((5,), dtype=dtype, device=device)
-        z = from_dlpack(x)
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    def test_dlpack_shared_storage(self, device):
-        x = make_tensor((5,), dtype=torch.float64, device=device)
-        z = from_dlpack(to_dlpack(x))
-        z[0] = z[0] + 20.0
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyCUDA
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_conversion_with_streams(self, device, dtype):
-        # Create a stream where the tensor will reside
-        stream = torch.cuda.Stream()
-        with torch.cuda.stream(stream):
-            # Do an operation in the actual stream
-            x = make_tensor((5,), dtype=dtype, device=device) + 1
-        # DLPack protocol helps establish a correct stream order
-        # (hence data dependency) at the exchange boundary.
-        # DLPack manages this synchronization for us, so we don't need to
-        # explicitly wait until x is populated
-        stream = torch.cuda.Stream()
-        with torch.cuda.stream(stream):
-            z = from_dlpack(x)
-        stream.synchronize()
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_from_dlpack(self, device, dtype):
-        x = make_tensor((5,), dtype=dtype, device=device)
-        y = torch.from_dlpack(x)
-        self.assertEqual(x, y)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_from_dlpack_noncontinguous(self, device, dtype):
-        x = make_tensor((25,), dtype=dtype, device=device).reshape(5, 5)
-
-        y1 = x[0]
-        y1_dl = torch.from_dlpack(y1)
-        self.assertEqual(y1, y1_dl)
-
-        y2 = x[:, 0]
-        y2_dl = torch.from_dlpack(y2)
-        self.assertEqual(y2, y2_dl)
-
-        y3 = x[1, :]
-        y3_dl = torch.from_dlpack(y3)
-        self.assertEqual(y3, y3_dl)
-
-        y4 = x[1]
-        y4_dl = torch.from_dlpack(y4)
-        self.assertEqual(y4, y4_dl)
-
-        y5 = x.t()
-        y5_dl = torch.from_dlpack(y5)
-        self.assertEqual(y5, y5_dl)
-
-    @skipMeta
-    @onlyCUDA
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_conversion_with_diff_streams(self, device, dtype):
-        stream_a = torch.cuda.Stream()
-        stream_b = torch.cuda.Stream()
-        # DLPack protocol helps establish a correct stream order
-        # (hence data dependency) at the exchange boundary.
-        # the `tensor.__dlpack__` method will insert a synchronization event
-        # in the current stream to make sure that it was correctly populated.
-        with torch.cuda.stream(stream_a):
-            x = make_tensor((5,), dtype=dtype, device=device) + 1
-            z = torch.from_dlpack(x.__dlpack__(stream_b.cuda_stream))
-            stream_a.synchronize()
-        stream_b.synchronize()
-        self.assertEqual(z, x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_from_dlpack_dtype(self, device, dtype):
-        x = make_tensor((5,), dtype=dtype, device=device)
-        y = torch.from_dlpack(x)
-        assert x.dtype == y.dtype
-
-    @skipMeta
-    @onlyCUDA
-    def test_dlpack_default_stream(self, device):
-        class DLPackTensor:
-            def __init__(self, tensor):
-                self.tensor = tensor
-
-            def __dlpack_device__(self):
-                return self.tensor.__dlpack_device__()
-
-            def __dlpack__(self, stream=None):
-                if torch.version.hip is None:
-                    assert stream == 1
-                else:
-                    assert stream == 0
-                capsule = self.tensor.__dlpack__(stream)
-                converted = True
-                return capsule
-
-        # CUDA-based tests runs on non-default streams
-        with torch.cuda.stream(torch.cuda.default_stream()):
-            x = DLPackTensor(make_tensor((5,), dtype=torch.float32, device=device))
-            from_dlpack(x)
-
-    @skipMeta
-    @onlyNativeDeviceTypes
-    @dtypes(*all_types_and_complex_and(torch.half, torch.bfloat16))
-    def test_dlpack_tensor_invalid_stream(self, device, dtype):
-        with self.assertRaises(TypeError):
-            x = make_tensor((5,), dtype=dtype, device=device)
-            x.__dlpack__(stream=object())
-
-    @skipMeta
-    def test_dlpack_error_on_bool_tensor(self):
-        x = torch.tensor([True], dtype=torch.bool)
-        with self.assertRaises(RuntimeError):
-            to_dlpack(x)
-
-    # TODO: increase tests once NumPy supports the `__dlpack__` protocol
-    @skipMeta
-    def test_dlpack_export_requires_grad(self):
-        x = torch.zeros(10, dtype=torch.float32, requires_grad=True)
-        with self.assertRaisesRegex(RuntimeError, r"require gradient"):
-            x.__dlpack__()
-
-    @skipMeta
-    def test_dlpack_export_is_conj(self):
-        x = torch.tensor([-1 + 1j, -2 + 2j, 3 - 3j])
-        y = torch.conj(x)
-        with self.assertRaisesRegex(RuntimeError, r"conjugate bit"):
-            y.__dlpack__()
-
-    @skipMeta
-    def test_dlpack_export_non_strided(self):
-        x = torch.sparse_coo_tensor([[0]], [1], size=(1,))
-        y = torch.conj(x)
-        with self.assertRaisesRegex(RuntimeError, r"strided"):
-            y.__dlpack__()
-
     @onlyCUDA
     @unittest.skipIf(PYTORCH_CUDA_MEMCHECK, "is_pinned uses failure to detect pointer property")
+    @skipIfTorchInductor("pin_memory isn't yet supported in TorchInductor")
     def test_pin_memory_from_constructor(self, device):
         def _get_like(t, **kwargs):
             return [
@@ -5228,47 +5037,6 @@ def test_pickle_gradscaler(self, device):
                 if lazy_init_scale:
                     self.assertEqual(b.scale(torch.tensor([4.0], dtype=torch.float32, device=device)), 12.0)
 
-    # FIXME: convert to ErrorInputs
-    @skipIfMps
-    def test_multinomial_invalid(self, device):
-        def test(probs):
-            with self.assertRaisesRegex(RuntimeError,
-                                        'probability tensor contains either `inf`, `nan` or element < 0'):
-                out = torch.multinomial(probs.to(device), 2)
-                if out.is_cuda:
-                    torch.cuda.synchronize()
-
-        test(torch.tensor([1., -1., 1.]))
-        test(torch.tensor([1., inf, 1.]))
-        test(torch.tensor([1., -inf, 1.]))
-        test(torch.tensor([1., 1., nan]))
-
-    # FIXME: convert to ErrorInputs
-    @skipIfMps
-    def test_multinomial_invalid_distribution(self, device):
-        def test(probs, replacement):
-            with self.assertRaisesRegex(RuntimeError,
-                                        r"invalid multinomial distribution \(sum of probabilities <= 0\)"):
-                out = torch.multinomial(probs, 2, replacement)
-                if out.is_cuda:
-                    torch.cuda.synchronize()
-
-        x = torch.zeros(3, device=device)
-        y = torch.zeros(3, 3, device=device)
-        z = torch.zeros(3, 3, device=device)
-        z[1, :] = 1
-
-        test(x, False)
-        test(y, False)
-        test(z, False)
-
-        # Verify only for CPU as replacement=True
-        # throws device side assert triggered.
-        if self.device_type == 'cpu':
-            test(x, True)
-            test(y, True)
-            test(z, True)
-
     # FIXME: move to test distributions
     def _test_multinomial_empty(self, device, replacement, num_samples):
         probs = torch.ones(0, 3, device=device)
@@ -5351,6 +5119,14 @@ def check_equal(condition, x, y):
 
                 check_equal(condition, x, y)
                 check_equal(condition, y, x)
+                if self.device_type == "cuda":
+                    check_equal(condition, torch.tensor(x), y)
+                    check_equal(condition, y, torch.tensor(x))
+                    if not isinstance(y, torch.Tensor):
+                        check_equal(condition, torch.tensor(y), torch.tensor(x))
+                    if isinstance(y, torch.Tensor) and y.ndim > 0:
+                        check_equal(torch.tensor(True), x, y)
+                        check_equal(torch.tensor(True), y, x)
 
 
     def test_hook_remove(self, device):
@@ -5813,10 +5589,10 @@ def test_index_add(self):
                             dest = make_tensor(dest.shape, device=device, dtype=dest.dtype, noncontiguous=True)
                         src = torch.randn(num_copy, *other_sizes, device=device)
                         if not src_contig:
-                            src = torch.testing.make_non_contiguous(src)
+                            src = noncontiguous_like(src)
                         idx = torch.randperm(num_dest, dtype=dtype, device=device).narrow(0, 0, num_copy)
                         if not index_contig:
-                            idx = torch.testing.make_non_contiguous(idx)
+                            idx = noncontiguous_like(idx)
                         # index_add_ without alpha argument
                         dest2 = dest.clone()
                         dest.index_add_(0, idx, src)
@@ -5888,7 +5664,7 @@ def test_unflatten(self):
             torch.tensor([1]).unflatten(0, [])
         with self.assertRaisesRegex(RuntimeError, r"Provided sizes \[2, 2\] don't multiply up to the size of dim 0 \(1\)"):
             torch.tensor([1]).unflatten(0, [2, 2])
-        with self.assertRaisesRegex(IndexError, r"dimension specified as 0 but tensor has no dimensions"):
+        with self.assertRaisesRegex(IndexError, r"Dimension specified as 0 but tensor has no dimensions"):
             torch.tensor(1).unflatten(0, [0])
         with self.assertRaisesRegex(RuntimeError, r"only one dimension can be inferred"):
             torch.randn(5, 10).unflatten(1, (-1, -1))
@@ -5925,10 +5701,10 @@ def test_is_same_size(self):
         self.assertFalse(t1.is_same_size(t3))
         self.assertTrue(t1.is_same_size(t4))
 
-        nt1 = torch.nested_tensor([torch.ones(2, 4), torch.ones(3, 4), torch.ones(5, 4)])
-        nt2 = torch.nested_tensor([torch.ones(2, 4), torch.ones(2, 4), torch.ones(2, 4)])
-        nt3 = torch.nested_tensor([torch.ones(2, 4, 5), torch.ones(2, 6, 5)])
-        nt4 = torch.nested_tensor([torch.ones(2, 4), torch.ones(3, 4), torch.ones(5, 4)])
+        nt1 = torch.nested.nested_tensor([torch.ones(2, 4), torch.ones(3, 4), torch.ones(5, 4)])
+        nt2 = torch.nested.nested_tensor([torch.ones(2, 4), torch.ones(2, 4), torch.ones(2, 4)])
+        nt3 = torch.nested.nested_tensor([torch.ones(2, 4, 5), torch.ones(2, 6, 5)])
+        nt4 = torch.nested.nested_tensor([torch.ones(2, 4), torch.ones(3, 4), torch.ones(5, 4)])
 
         self.assertFalse(nt1.is_same_size(nt2))
         self.assertFalse(nt1.is_same_size(nt3))
@@ -6035,6 +5811,13 @@ def test_equal(self):
         self.assertTrue(torch.equal(s1, s3))
         self.assertFalse(torch.equal(s1, s4))
 
+        # Different dtypes
+        x = torch.tensor((1, 2, 3), dtype=torch.float)
+        y = torch.tensor((1, 2, 3), dtype=torch.int)
+        z = torch.tensor((1, -1), dtype=torch.int)
+        self.assertTrue(torch.equal(x, y))
+        self.assertFalse(torch.equal(z, x))
+
     def test_element_size(self):
         byte = torch.ByteStorage().element_size()
         char = torch.CharStorage().element_size()
@@ -6647,6 +6430,127 @@ def test_storage_casts(self):
         self.assertEqual(complexdouble_storage.type(), 'torch.ComplexDoubleStorage')
         self.assertIs(complexdouble_storage.dtype, torch.complex128)
 
+    # Test that internal versions of functions related to TypedStorage do not
+    # produce a deprecation warning
+    def test_typed_storage_internal_no_warning(self):
+        s0 = torch.FloatStorage(10)
+        s0_untyped = s0.untyped()
+        t0 = torch.randn(10)
+
+        funcs = [
+            lambda: torch.FloatStorage(_internal=True),
+            lambda: torch.TypedStorage(
+                dtype=torch.float,
+                device='cpu',
+                _internal=True),
+            lambda: torch.TypedStorage(
+                wrap_storage=s0_untyped,
+                dtype=s0.dtype,
+                _internal=True),
+            lambda: torch.FloatStorage._dtype,
+            lambda: s0._resize_(20),
+            lambda: s0._size(),
+            lambda: s0._untyped_storage,
+            lambda: s0._is_shared(),
+            lambda: s0._share_memory_(),
+            lambda: s0._pickle_storage_type(),
+            lambda: s0._setitem(slice(0, s0._size()), 1),
+            lambda: s0._element_size(),
+            lambda: s0._deepcopy({}),
+            lambda: s0._data_ptr(),
+            lambda: s0._nbytes(),
+            lambda: t0._typed_storage(),
+        ]
+
+        if torch.cuda.is_available():
+            s1 = torch.cuda.FloatStorage(10)
+            s1_untyped = s1.untyped()
+            t1 = torch.randn(10, device='cuda')
+
+            funcs += [
+                lambda: torch.cuda.FloatStorage(_internal=True),
+                lambda: torch.TypedStorage(
+                    dtype=torch.float,
+                    device='cuda',
+                    _internal=True),
+                lambda: torch.TypedStorage(
+                    wrap_storage=s1_untyped,
+                    dtype=s1.dtype,
+                    _internal=True),
+                lambda: torch.cuda.FloatStorage._dtype,
+                lambda: s1._resize_(20),
+                lambda: s1._size(),
+                lambda: s1._untyped_storage,
+                lambda: s1._is_shared(),
+                lambda: s1._share_memory_(),
+                lambda: s1._pickle_storage_type(),
+                lambda: s1._setitem(slice(0, s1._size()), 1),
+                lambda: s1._element_size(),
+                lambda: s1._deepcopy({}),
+                lambda: s1._data_ptr(),
+                lambda: s1._nbytes(),
+                lambda: t1._typed_storage(),
+            ]
+
+        # Check that each of the TypedStorage internal function calls do not
+        # produce a deprecation warning
+        for f in funcs:
+            with warnings.catch_warnings():
+                warnings.filterwarnings('error', "TypedStorage is deprecated")
+                f()
+
+    # Test that public functions related to TypedStorage produce a deprecation
+    # warning
+    def test_typed_storage_deprecation_warning(self):
+        s0 = torch.FloatStorage(10)
+        funcs = [
+            lambda: torch.FloatStorage(),
+            lambda: torch.FloatStorage.dtype,
+            lambda: s0.fill_(0),
+            lambda: s0.is_cuda,
+            lambda: s0.untyped(),
+            lambda: len(s0),
+            lambda: s0[0],
+        ]
+
+        if torch.cuda.is_available():
+            s1 = torch.cuda.FloatStorage(10)
+            funcs += [
+                lambda: torch.cuda.FloatStorage(),
+                lambda: torch.cuda.FloatStorage.dtype,
+                lambda: s1.fill_(0),
+                lambda: s1.is_cuda,
+                lambda: s1.untyped(),
+                lambda: len(s1),
+                lambda: s1[0],
+            ]
+
+        # Check that each of the TypedStorage function calls produce a warning
+        # if warnings are reset between each
+        for f in funcs:
+            with warnings.catch_warnings(record=True) as w:
+                warnings.resetwarnings()
+                f()
+                self.assertEqual(len(w), 1)
+                warning = w[0].message
+                self.assertTrue(warning, DeprecationWarning)
+                self.assertTrue(re.search(
+                    '^TypedStorage is deprecated',
+                    str(warning)))
+
+        # Check that only one warning is raised from calling multiple
+        # TypedStorage functions if warnings are not reset between each
+        with warnings.catch_warnings(record=True) as w:
+            warnings.resetwarnings()
+            for f in funcs:
+                f()
+            self.assertEqual(len(w), 1)
+            warning = w[0].message
+            self.assertTrue(warning, DeprecationWarning)
+            self.assertTrue(re.search(
+                '^TypedStorage is deprecated',
+                str(warning)))
+
     def test_from_file(self):
         def assert_with_filename(filename):
             size = 10000
@@ -7012,6 +6916,7 @@ def test_new(self) -> None:
         self.assertRaises(RuntimeError, lambda: x.new(z.storage()))
 
     @unittest.skipIf(PYTORCH_CUDA_MEMCHECK, "is_pinned uses failure to detect pointer property")
+    @skipIfTorchInductor("pin_memory isn't yet supported in TorchInductor")
     def test_pin_memory(self):
         x = torch.randn(3, 5)
         self.assertFalse(x.is_pinned())
@@ -7205,22 +7110,8 @@ def test_has_internal_overlap(self):
         self.assertEqual(torch._debug_has_internal_overlap(c), OVERLAP_TOO_HARD)
 
     def test_allow_tensor_metadata_change(self):
-        def do_test(t):
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_sizes_contiguous is not allowed on a Tensor created from .data or .detach()"):
-                t.resize_((2, 1))
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_storage is not allowed on a Tensor created from .data or .detach()"):
-                t.set_()
-            with self.assertRaisesRegex(
-                    RuntimeError,
-                    "set_storage_offset is not allowed on a Tensor created from .data or .detach()"):
-                t.set_(t.storage(), 0, t.size(), list(t.stride()))
-
-        do_test(torch.tensor([[1, 2]]).data)
-        do_test(torch.tensor([[1, 2]]).detach())
+        a = torch.ones(2, 3)
+        # Metadata changes are allowed on view tensors that are created from detach().
 
     @skipIfNotRegistered("LayerNorm", "Skipping as LayerNorm is not registered")
     def test_c10_layer_norm(self):
@@ -7641,6 +7532,7 @@ def test_dtype_is_signed(self):
         self.assertRaisesRegex(RuntimeError, 'not supported for quantized', lambda: torch.qint32.is_signed)
 
     # FIXME: Put the following random tests into their own test class or test suite
+    @skipIfTorchDynamo("requires https://github.com/pytorch/torchdynamo/pull/1098")
     def test_RNGState(self):
         state = torch.get_rng_state()
         stateCloned = state.clone()
@@ -7652,6 +7544,7 @@ def test_RNGState(self):
         after = torch.rand(1000)
         self.assertEqual(before, after, atol=0, rtol=0)
 
+    @skipIfTorchDynamo("requires https://github.com/pytorch/torchdynamo/pull/1098")
     def test_RNGStateAliasing(self):
         # Fork the random number stream at this point
         gen = torch.Generator()
@@ -7664,6 +7557,7 @@ def test_RNGStateAliasing(self):
         forked_value = torch.rand(1000, generator=gen)
         self.assertEqual(target_value, forked_value, atol=0, rtol=0, msg="RNG has not forked correctly.")
 
+    @skipIfTorchDynamo("requires https://github.com/pytorch/torchdynamo/pull/1098")
     def test_RNG_after_pickle(self):
         torch.random.manual_seed(100)
         before = torch.rand(10)
@@ -7676,6 +7570,7 @@ def test_RNG_after_pickle(self):
 
         self.assertEqual(before, after, atol=0, rtol=0)
 
+    @skipIfTorchDynamo("requires https://github.com/pytorch/torchdynamo/pull/1098")
     def test_boxMullerState(self):
         torch.manual_seed(123)
         odd_number = 101
@@ -7691,6 +7586,7 @@ def test_boxMullerState(self):
         self.assertEqual(seeded, reseeded, atol=0, rtol=0,
                          msg='repeated calls to manual_seed not generating same sequence of normally distributed numbers')
 
+    @skipIfTorchDynamo("requires https://github.com/pytorch/torchdynamo/pull/1098")
     def test_manual_seed(self):
         rng_state = torch.get_rng_state()
         torch.manual_seed(2)
@@ -7765,6 +7661,51 @@ def test_copy_many_to_one(self):
         # storage to a single storage would cause RuntimeError to be thrown
         self.assertRaises(RuntimeError, lambda: torch.zeros(1, 6).expand(5, 6).copy_(torch.zeros(5, 6)))
 
+    def test_copy_float16(self):
+        # Check that fbgemm code no longer reads memory out of bounds, see
+        # copy_impl and fbgemm::Float16ToFloat_ref.
+        # https://github.com/pytorch/pytorch/issues/88543
+
+        # Types to test different code paths in copy_impl.
+        dtypes = (
+            # out_dtype, src_dtype
+            (torch.float32, torch.float16),  # fbgemm
+            (torch.float16, torch.float32),  # fbgemm
+            (torch.float32, torch.float32),  # TensorIterator
+        )
+
+        cases = (
+            # out_shape, src_shape, is_ok
+            # These cases used to crash with fbgemm, make sure these also raise
+            # exceptions with TensorIterator.
+            ((1, 2, 3), (0, 2, 3), False),  # same strides, not allowed by TI
+            ((1, 5, 6), (4, 5, 6), False),  # same strides, not allowed by TI
+            (1, (0, 2, 3), False),  # different strides
+            ((4, 5, 6), (0, 2, 3), False),  # different strides
+            ((4, 5, 6), (1, 2, 3), False),  # different strides
+            ((4, 5, 6), (6, 5, 4), False),  # same numel
+
+            # These cases should pass with fbgemm and TensorIterator.
+            ((4, 5, 6), (1, 5, 6), True),  # same strides
+            ((4, 5, 6), (4, 5, 6), True),  # same strides
+            ((0, 2, 3), 1, True),  # different strides, allowed by TI
+            ((4, 5, 6), (4, 5, 1), True),  # different strides, allowed by TI
+        )
+
+        for (out_shape, src_shape, is_ok), (out_dtype, src_dtype) in itertools.product(cases, dtypes):
+            out = torch.zeros(out_shape, dtype=out_dtype, device=torch.device('cpu'))
+            src = torch.ones(src_shape, dtype=src_dtype, device=torch.device('cpu'))
+            if is_ok:
+                if torch.cuda.is_available():
+                    out_cuda = out.cuda()
+                    src_cuda = src.cuda()
+                res = out.copy_(src)
+                if torch.cuda.is_available():
+                    res_cuda = out_cuda.copy_(src_cuda)
+                    self.assertEqual(res, res_cuda)
+            else:
+                self.assertRaises(RuntimeError, lambda: out.copy_(src))
+
     # FIXME: Port to a more appropriate test suite
     def _test_to_with_layout(self, layout):
         def test_copy_behavior(t, non_blocking=False):
diff --git a/test/test_transformers.py b/test/test_transformers.py
index 6066650e55cd..0260c822498d 100644
--- a/test/test_transformers.py
+++ b/test/test_transformers.py
@@ -1,11 +1,16 @@
 # Owner(s): ["module: nn"]
 
 import contextlib
+from functools import partial
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 import unittest
 from unittest.mock import patch
+import math
+from torch.backends.cuda import sdp_kernel, SDPBackend
+import torch.optim as optim
+from torch.testing._internal.common_dtype import floating_types_and_half
 
 from torch.testing._internal.common_nn import NNTestCase
 from torch.testing._internal.common_utils import (
@@ -14,22 +19,20 @@
     parametrize,
     instantiate_parametrized_tests,
     freeze_rng_state,
-    TEST_WITH_CROSSREF
+    TEST_WITH_CROSSREF,
+    TEST_WITH_ROCM,
+    IS_WINDOWS,
+    slowTest,
+    set_default_dtype,
+    gradcheck
 )
-from torch.testing._internal.common_cuda import TEST_CUDA
+
+from torch.testing._internal.common_methods_invocations import wrapper_set_seed
+from torch.testing._internal.common_cuda import TEST_CUDA, SM80OrLater
 
 if TEST_FAIRSEQ:
     import fairseq.models.transformer as fairseq_transformer
 
-@contextlib.contextmanager
-def set_default_dtype(dtype):
-    saved_dtype = torch.get_default_dtype()
-    torch.set_default_dtype(dtype)
-    try:
-        yield
-    finally:
-        torch.set_default_dtype(saved_dtype)
-
 class TestTransformers(NNTestCase):
     _do_cuda_memory_leak_check = True
     _do_cuda_non_default_stream = True
@@ -66,11 +69,65 @@ def test_self_attn_TxT_attn_mask(self):
             self.assertEqual(output_mask_4d, output_mask_TxT)
 
     @parametrize("device", device_list)
-    def test_transformerencoderlayer_src_mask(self, device):
+    @slowTest
+    def test_train_with_pad_and_catch_error(self, device):
+        iters = 100
+        pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.bool).to(device)
+        layer = nn.TransformerEncoderLayer(
+            d_model=2,
+            dim_feedforward=4,
+            nhead=2,
+            batch_first=True,
+            activation="gelu",
+            dropout=0,
+        )
+        criterion = nn.MSELoss()
+        encoder = nn.TransformerEncoder(layer, 2).to(device)
+        optimizer = optim.SGD(encoder.parameters(), lr=0.1, momentum=0.9)
+        encoder.train()
+        for i in range(iters):
+            encoder.train()
+            optimizer.zero_grad()
+            inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1).to(device)
+
+            outputs = encoder(inputs, src_key_padding_mask=pad_mask)
+
+            loss = criterion(outputs[:, 0:2, :], inputs[:, 0:2, :])
+            loss.backward()
+            optimizer.step()
+
+            with torch.no_grad():
+                test = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1).to(device)
+
+                # Expect uint8 type not supported
+                ex = None
+                try:
+                    test_train_uint8 = encoder(test, src_key_padding_mask=pad_mask.to(torch.uint8))
+                except AssertionError as e:
+                    continue
+                self.assertFalse(e, "Failed to catch unsupported uint8 type exception")
+
+                test_train_bool = encoder(test, src_key_padding_mask=pad_mask)
+                encoder.eval()
+
+                # Expect long type not supported
+                ex = None
+                try:
+                    test_eval_uint8 = encoder(test, src_key_padding_mask=pad_mask.to(torch.int64))
+                except AssertionError as e:
+                    continue
+                self.assertFalse(e, "Failed to catch unsupported Long type exception")
+
+                test_eval_bool = encoder(test, src_key_padding_mask=pad_mask)
+                l1_bool = nn.L1Loss()(test_train_bool[:, 0:2, :], test_eval_bool[:, 0:2, :]).item()
+                self.assertTrue(l1_bool < 1e-4, "Eval/Train difference in pad_mask BOOL")
+
+    @parametrize("device", device_list)
+    @parametrize("nhead", [1, 4, 8])
+    def test_transformerencoderlayer_src_mask(self, device, nhead):
         batch_size = 2
         seqlen = 4
         d_model = 8
-        nhead = 8
         dim_feedforward = 32
 
         model = torch.nn.TransformerEncoderLayer(
@@ -86,36 +143,98 @@ def test_transformerencoderlayer_src_mask(self, device):
         with torch.no_grad():
             model(src, src_mask=src_mask)
 
-    @parametrize("use_torchscript", [True, False])
-    @parametrize("with_no_grad", [True, False])
-    @parametrize("training", [True, False])
-    def test_transformerencoder_fastpath_torchscript(self, use_torchscript, with_no_grad, training):
+    @parametrize("device", device_list)
+    @parametrize("use_torchscript", [False])
+    @parametrize("enable_nested_tensor", [True, False])
+    @parametrize("use_autocast", [True, False])
+    @parametrize("d_model", [12, 256])
+    def test_transformerencoder_fastpath(self, device, use_torchscript, enable_nested_tensor, use_autocast, d_model):
         """
-        Test TransformerEncoder does not crash
+        Test TransformerEncoder fastpath output matches slowpath output
         """
+        torch.manual_seed(1234)
+        nhead = 4
+        dim_feedforward = d_model
+        batch_first = True
+
         model = torch.nn.TransformerEncoder(
-            torch.nn.TransformerEncoderLayer(d_model=2, nhead=2, dim_feedforward=8, batch_first=True),
+            torch.nn.TransformerEncoderLayer(
+                d_model=d_model,
+                nhead=nhead,
+                dim_feedforward=dim_feedforward,
+                batch_first=batch_first),
             num_layers=2,
-            enable_nested_tensor=True
-        )
-
-        if training:
-            model = model.train()
-        else:
-            model = model.eval()
+            enable_nested_tensor=enable_nested_tensor
+        ).to(device).eval()
 
         if use_torchscript:
             model = torch.jit.script(model)
 
-        x = torch.Tensor([[[1, 2], [3, 4]]]).to(torch.float)
-        mask = torch.Tensor([[0, 1]]).to(torch.bool)
-
-        if with_no_grad:
-            cm = torch.no_grad()
-        else:
-            cm = contextlib.nullcontext()
-        with cm:
-            model(x, src_key_padding_mask=mask)
+        # each input is (input, mask)
+        input_mask_pairs = [
+            (
+                torch.rand(3, 2, d_model),
+                [
+                    [0, 1],
+                    [0, 1],
+                    [1, 1]
+                ]
+            ),
+            (
+                torch.rand(2, 100, d_model),
+                [
+                    [0] * 98 + [1] * 2,
+                    [0] * 90 + [1] * 10
+                ]
+            ),
+            # softmax.cu switches from fast->slowpath at masked seqlen 1024. test 1024.
+            (
+                torch.rand(2, 1024, d_model),
+                [
+                    [0] * 1020 + [1] * 4,
+                    [0] * 1024,
+                ]
+            ),
+            (
+                torch.rand(1, 1026, d_model),
+                [[0] * 1024 + [1] * 2]
+            ),
+            # softmax.cu switches from fast->slowpath at masked seqlen 1024. test range of masks above 1024.
+            (
+                torch.rand(4, 1040, d_model),
+                [
+                    [0] * 1024 + [1] * 16,
+                    [0] * 1025 + [1] * 15,
+                    [0] * 1031 + [1] * 9,
+                    [0] * 1040,
+                ]
+            )
+        ]
+        input_mask_pairs = [
+            (
+                torch.tensor(pair[0], device=device, dtype=torch.get_default_dtype()),  # float input
+                torch.tensor(pair[1], device=device, dtype=torch.bool)  # bool mask
+            ) for pair in input_mask_pairs
+        ]
+
+        maybe_autocast = torch.autocast("cuda", dtype=torch.float16) if use_autocast else contextlib.nullcontext()
+        with maybe_autocast:
+            for input, src_key_padding_mask in input_mask_pairs:
+                with torch.no_grad():
+                    fastpath_output = model(input, src_key_padding_mask=src_key_padding_mask)
+                slowpath_output = model(input, src_key_padding_mask=src_key_padding_mask)  # reference
+                # Make sure fastpath_output is same shape as slowpath_output and mask.
+                # When enable_nested_tensor=true, fastpath_output may be smaller than input tensor.
+                # Eg if input bs=1, seqlen=6, and we mask out 2 tokens, fastpath_output will have bs=1, seqlen=4.
+                # Expand back to old size to match.
+                bs, true_seqlen, embed_dim = fastpath_output.shape
+                expanded_seqlen = src_key_padding_mask.shape[1]
+                fastpath_output_expanded = torch.zeros(bs, expanded_seqlen, embed_dim, device=device)
+                fastpath_output_expanded[:, :true_seqlen, :] = fastpath_output
+                # no garauntees on output corresponding to masked tokens, so they may vary between slow/fast path. set all to 0.
+                fastpath_output_expanded = fastpath_output_expanded.masked_fill(src_key_padding_mask.unsqueeze(-1), 0)
+                slowpath_output = slowpath_output.masked_fill(src_key_padding_mask.unsqueeze(-1), 0)
+                torch.testing.assert_close(fastpath_output_expanded, slowpath_output, rtol=1e-7, atol=1e-5)
 
     @parametrize("with_no_grad", [True, False])
     @parametrize("training", [True, False])
@@ -145,7 +264,7 @@ def test_transformerencoder_square_input(self, with_no_grad, training, enable_ne
             model = model.train()
         else:
             model = model.eval()
-        x = torch.arange(0, 16).reshape(2, 2, 4).to(torch.float).to(device)
+        x = torch.arange(0, 16).reshape(2, 2, 4).to(torch.get_default_dtype()).to(device)
         src_mask = torch.Tensor([[0, 1], [0, 0]]).to(torch.bool).to(device)
 
         if with_no_grad:
@@ -667,7 +786,28 @@ def set_weights_deterministic(model):
                          if attn_dim is not None else "no_attn_mask")))
     @parametrize("dropout_p", [0.0, 0.2, 0.5])
     @parametrize("device", device_list)
+    @sdp_kernel(enable_flash=False)
     def test_scaled_dot_product_attention(self, device, input_dim, attn_mask_dim, is_causal, dropout_p):
+        def sdp_ref(
+                q,
+                k,
+                v,
+                attn_mask=None,
+                dropout_p=0.0):
+            E = q.size(-1)
+            q = q / math.sqrt(E)
+            # (B, Nt, E) x (B, E, Ns) -> (B, Nt, Ns)
+            if attn_mask is not None:
+                attn = torch.baddbmm(attn_mask, q, k.transpose(-2, -1))
+            else:
+                attn = torch.bmm(q, k.transpose(-2, -1))
+
+            attn = torch.nn.functional.softmax(attn, dim=-1)
+            if dropout_p > 0.0:
+                attn = torch.nn.functional.dropout(attn, p=dropout_p)
+            # (B, Nt, Ns) x (B, Ns, E) -> (B, Nt, E)
+            output = torch.bmm(attn, v)
+            return output, attn
         # TODO: Support cross-device / dtype testing properly when instantiate_device_type_tests() is used.
         dtypes = [torch.double, torch.float]
         for dtype in dtypes:
@@ -705,8 +845,7 @@ def rand_tensor(*shape):
                 a = attn_mask_float
                 if a is not None and attn_mask_dim > 3:
                     a = a.view(-1, L, S)
-                expected = F._scaled_dot_product_attention(
-                    q, k, v, attn_mask=a, dropout_p=dropout_p)
+                expected = sdp_ref(q, k, v, attn_mask=a, dropout_p=dropout_p)
                 if input_dim > 3:
                     expected = (expected[0].view(-1, N_prime, L, E), expected[1].view(-1, N_prime, L, S))
 
@@ -726,26 +865,38 @@ def rand_tensor(*shape):
                     actual = torch.ops.aten._scaled_dot_product_attention(
                         query, key, value, attn_mask, dropout_p, need_attn_weights, is_causal)
 
-            # freeze_rng_state() doesn't seem to work outside of CPU, so dropout makes the results incomparable.
-            # TODO: Do this skipping in a nicer way once the granular test skipping logic lands.
-            if dropout_p == 0.0 or device == 'cpu':
                 self.assertEqual(actual, expected)
 
+        if attn_mask_dim is None:
+            q = q.double().clone()
+            k = k.double().clone()
+            v = v.double().clone()
+            q.requires_grad_()
+            k.requires_grad_()
+            v.requires_grad_()
+
+            assert gradcheck(lambda *args, **kwargs: wrapper_set_seed(sdp_ref, *args, **kwargs),
+                             (q, k, v, attn_mask, dropout_p))
+            assert gradcheck(lambda *args, **kwargs:
+                             wrapper_set_seed(torch.nn.functional._scaled_dot_product_attention, *args, **kwargs),
+                             (q, k, v, attn_mask, dropout_p))
+
     @unittest.skipIf(TEST_WITH_CROSSREF, 'Fastpath not available with crossref')
     @torch.no_grad()
     def test_mask_check_fastpath(self):
         """
-        Test that fastpath is executed independently of the mask that is passed.
-        If the passed mask is left aligned or mask_check=False, test that nested tensors are used (sparsity fastpath),
-        otherwise use fastpath with traditional tensors.
+        Test that fastpath is executed independently of the masks that are passed.
+        If the passed key padding mask is left aligned or mask_check=False, test that nested tensors are used
+        (sparsity fastpath), otherwise use fastpath with traditional tensors.
+        Also test that fast path is executed with both key padding mask and attention mask passed at the same time.
         """
 
         x = torch.Tensor([[[1, 2], [3, 4], [5, 6]]]).to(torch.float)
 
-        def _test_fastpath(model, mask, mock_return_value, nested_tensors=True):
+        def _test_fastpath(model, key_padding_mask, mock_return_value, attn_mask=None, nested_tensors=True):
             with patch('torch._transformer_encoder_layer_fwd') as fastpath_mock:
                 fastpath_mock.return_value = mock_return_value
-                model(x, src_key_padding_mask=mask)
+                model(x, src_key_padding_mask=key_padding_mask, mask=attn_mask)
 
                 # If mock was called, fastpath was taken
                 self.assertTrue(fastpath_mock.called)
@@ -759,31 +910,376 @@ def _test_fastpath(model, mask, mock_return_value, nested_tensors=True):
         model = torch.nn.TransformerEncoder(encoder_layer, num_layers=2, enable_nested_tensor=True, mask_check=True)
         model.eval()
 
-        aligned_mask = torch.Tensor([[0, 0, 1]]).to(torch.bool)
-        not_aligned_mask = torch.Tensor([[1, 0, 1]]).to(torch.bool)
-        nested_tensor_return_value = torch.nested_tensor([torch.ones((2, 2), dtype=torch.float)])
+        aligned_key_padding_mask = torch.Tensor([[0, 0, 1]]).to(torch.bool)
+        not_aligned_key_padding_mask = torch.Tensor([[1, 0, 1]]).to(torch.bool)
+        attn_mask = torch.Tensor([[1, 0, 1], [0, 1, 0], [1, 0, 1]]).to(torch.bool)
+        nested_tensor_return_value = torch.nested.nested_tensor([torch.ones((2, 2), dtype=torch.float)])
         tensor_return_value = torch.ones((1, 3, 2), dtype=torch.float)
 
         # Left aligned mask results in sparsity fastpath
-        _test_fastpath(model, aligned_mask, nested_tensor_return_value, nested_tensors=True)
+        _test_fastpath(model, aligned_key_padding_mask, nested_tensor_return_value, nested_tensors=True)
 
         # Not aligned mask results in fastpath
-        _test_fastpath(model, not_aligned_mask, tensor_return_value, nested_tensors=False)
+        _test_fastpath(model, not_aligned_key_padding_mask, tensor_return_value, nested_tensors=False)
 
         model = torch.nn.TransformerEncoder(encoder_layer, num_layers=2, enable_nested_tensor=False, mask_check=True)
         model.eval()
 
         # If nested tensor disabled, fastpath is always taken
-        _test_fastpath(model, aligned_mask, tensor_return_value, nested_tensors=False)
-        _test_fastpath(model, not_aligned_mask, tensor_return_value, nested_tensors=False)
-
+        _test_fastpath(model, aligned_key_padding_mask, tensor_return_value, nested_tensors=False)
+        _test_fastpath(model, not_aligned_key_padding_mask, tensor_return_value, nested_tensors=False)
+        # Fast path is taken if both attention mask and key padding mask are present
+        _test_fastpath(model, aligned_key_padding_mask, tensor_return_value, attn_mask=attn_mask, nested_tensors=False)
 
         model = torch.nn.TransformerEncoder(encoder_layer, num_layers=2, enable_nested_tensor=True, mask_check=False)
         model.eval()
 
         # Mask check disabled results in sparisty fastpath, independently of the mask
-        _test_fastpath(model, aligned_mask, nested_tensor_return_value, nested_tensors=True)
-        _test_fastpath(model, not_aligned_mask, nested_tensor_return_value, nested_tensors=True)
+        _test_fastpath(model, aligned_key_padding_mask, nested_tensor_return_value, nested_tensors=True)
+        _test_fastpath(model, not_aligned_key_padding_mask, nested_tensor_return_value, nested_tensors=True)
+
+    def rand_nt(self, shape, device, dtype, requires_grad=False, packed=False):
+        batch, seq_len, num_heads, head_dim = shape
+        size = (seq_len, num_heads, head_dim) if not packed else (seq_len, 3 * num_heads * head_dim)
+        return torch.nested.nested_tensor([
+            torch.randn(size, device=device, dtype=dtype, requires_grad=requires_grad)
+            for _ in range(batch)])
+
+    def rand_tensor(self, shape, device, dtype, requires_grad=False, packed=False):
+        batch, seq_len, num_heads, head_dim = shape
+        size = (batch, seq_len, num_heads, head_dim) if not packed else (batch, seq_len, 3 * num_heads * head_dim)
+        return torch.randn(size, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    @unittest.skipIf(not TEST_CUDA or TEST_WITH_ROCM or IS_WINDOWS, "Flash Attention was not built for this system")
+    @parametrize("type", ["dense", "nested"])
+    @parametrize("is_contiguous", [True, False])
+    def test_scaled_dot_product_attention_fused_kernels(self, type: str, is_contiguous: bool):
+        rand_nt = partial(self.rand_nt, device="cuda", dtype=torch.float16)
+        rand_tensor = partial(self.rand_tensor, device="cuda", dtype=torch.float16)
+
+        batch, seq_len, num_heads, head_dim = 32, 64, 16, 64
+        shape = (batch, seq_len, num_heads, head_dim)
+        if type == "dense":
+            query = rand_tensor(shape)
+            key = rand_tensor(shape)
+            value = rand_tensor(shape)
+        elif type == "nested":
+            query = rand_nt(shape)
+            key = rand_nt(shape)
+            value = rand_nt(shape)
+
+        # Lets switch seq_len and num_heads
+        # B x S X H X D -> B x H x S x D
+        query = query.transpose(1, 2)
+        key = key.transpose(1, 2)
+        value = value.transpose(1, 2)
+
+        if is_contiguous:
+            query = query.contiguous()
+            key = key.contiguous()
+            value = value.contiguous()
+
+        with sdp_kernel(enable_math=False):
+            actual = torch.nn.functional._scaled_dot_product_attention(
+                query, key, value, attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False)
+        with sdp_kernel(enable_flash=False):
+            math_ref = torch.nn.functional._scaled_dot_product_attention(
+                query.contiguous(), key.contiguous(), value.contiguous(),
+                attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False)
+
+        self.assertEqual(actual[0].contiguous(), math_ref[0].contiguous(), atol=1e-3, rtol=1e-2)
+
+    @unittest.skipIf(not TEST_CUDA or TEST_WITH_ROCM or IS_WINDOWS, "Flash Attention was not built for this system")
+    @parametrize("type", ["dense", "nested"])
+    @parametrize("is_contiguous", [True, False])
+    def test_scaled_dot_product_attention_fused_kernels_packed(self, type: str, is_contiguous: bool):
+        rand_nt = partial(self.rand_nt, device="cuda", dtype=torch.float16, packed=True)
+        rand_tensor = partial(self.rand_tensor, device="cuda", dtype=torch.float16, packed=True)
+
+        batch_size, seq_len, num_heads, head_dim = 32, 64, 16, 64
+        shape = (batch_size, seq_len, num_heads, head_dim)
+
+        # Test Packed
+        qkv = rand_tensor(shape) if type == "dense" else rand_nt(shape)
+        query, key, value = qkv.chunk(3, dim=-1)
+
+        query = query.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+
+        if is_contiguous:
+            query = query.contiguous()
+            key = key.contiguous()
+            value = value.contiguous()
+
+        with sdp_kernel(enable_math=False):
+            actual = torch.nn.functional._scaled_dot_product_attention(
+                query, key, value, attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False)
+        with sdp_kernel(enable_flash=False):
+            math_ref = torch.nn.functional._scaled_dot_product_attention(
+                query.contiguous(), key.contiguous(), value.contiguous(),
+                attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False)
+
+        self.assertEqual(actual[0].contiguous(), math_ref[0].contiguous(), atol=2e-3, rtol=1e-2)
+
+    @unittest.skipIf(not TEST_CUDA or TEST_WITH_ROCM or IS_WINDOWS, "Flash Attention was not built for this system")
+    @parametrize("type", ["dense", "nested"])
+    @parametrize("fused_kernel", ["flash", "mem_efficient"])
+    def test_scaled_dot_product_attention_fused_kernels_packed_accuracy(self, type: str, fused_kernel: str):
+        if (not SM80OrLater) and fused_kernel == "flash":
+            return
+
+        def rand_nt(shape):
+            batch, seq_len, num_heads, head_dim = shape
+            tensors = [6 * torch.rand((seq_len, 3 * num_heads * head_dim), device="cuda", dtype=torch.float32) - 3
+                       for _ in range(batch)]
+            return (torch.nested.nested_tensor(tensors, device="cuda", dtype=torch.float32),
+                    torch.nested.nested_tensor(tensors, device="cuda", dtype=torch.float16))
+
+        def rand_tensor(shape):
+            batch, seq_len, num_heads, head_dim = shape
+            tensor = 6 * torch.rand((batch, seq_len, 3 * num_heads * head_dim), device="cuda", dtype=torch.float32) - 3
+            return tensor, tensor.to(dtype=torch.float16)
+
+        batch_size, seq_len, num_heads, head_dim = 16, 8, 4, 64
+        shape = (batch_size, seq_len, num_heads, head_dim)
+
+        # Test Packed
+        qkv, qkv_low_precision = rand_tensor(shape) if type == "dense" else rand_nt(shape)
+        query, key, value = qkv.chunk(3, dim=-1)
+        query_lp, key_lp, value_lp = qkv_low_precision.chunk(3, dim=-1)
+
+        query = query.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+
+        query_lp = query_lp.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        key_lp = key_lp.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        value_lp = value_lp.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+
+        if fused_kernel == "flash":
+            with sdp_kernel(enable_mem_efficient=False, enable_math=False):
+                # TODO Flash for the nested path is currently not working due to cuda memory issues
+                if type == "nested":
+                    self.assertRaises(RuntimeError, lambda: torch.nn.functional._scaled_dot_product_attention(
+                        query_lp, key_lp, value_lp, attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False))
+                    return
+                actual = torch.nn.functional._scaled_dot_product_attention(
+                    query_lp, key_lp, value_lp, attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False)
+        elif fused_kernel == "mem_efficient":
+            with sdp_kernel(enable_flash=False, enable_math=False):
+                actual = torch.nn.functional._scaled_dot_product_attention(
+                    query_lp, key_lp, value_lp, attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False)
+
+        with sdp_kernel(enable_flash=False, enable_mem_efficient=False):
+            math_ref_lp = torch.nn.functional._scaled_dot_product_attention(
+                query_lp.contiguous(), key_lp.contiguous(), value_lp.contiguous(),
+                attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False)
+
+        with sdp_kernel(enable_flash=False, enable_mem_efficient=False):
+            math_query = query.contiguous()
+            math_key = key.contiguous()
+            math_value = value.contiguous()
+
+            math_ref = torch.nn.functional._scaled_dot_product_attention(
+                math_query, math_key, math_value, attn_mask=None, dropout_p=0.0, need_attn_weights=False, is_causal=False)
+
+        actual_test = actual[0]
+        math_ref_test = math_ref[0]
+        math_ref_lp_test = math_ref_lp[0]
+
+        if actual_test.is_nested:
+            actual_test = torch.nested.to_padded_tensor(actual_test.contiguous(), padding=0.0)
+            math_ref_test = torch.nested.to_padded_tensor(math_ref_test, padding=0.0)
+            math_ref_lp_test = torch.nested.to_padded_tensor(math_ref_lp_test, padding=0.0)
+
+        actual_test = actual_test.to(dtype=torch.float32).contiguous()
+        math_ref_test = math_ref_test.to(dtype=torch.float32).contiguous()
+        math_ref_lp_test = math_ref_lp_test.to(dtype=torch.float32).contiguous()
+
+        self.assertEqual(math_ref_test, math_ref_lp_test, atol=7e-3, rtol=7e-3)
+        self.assertEqual(actual_test, math_ref_test, atol=5e-3, rtol=5e-3)
+
+    @unittest.skipIf(not TEST_CUDA or TEST_WITH_ROCM or IS_WINDOWS, "Flash Attention was not built for this system")
+    @parametrize("contiguous_inputs", [True, False])
+    def test_sdp_math_gradcheck(self, contiguous_inputs: bool):
+
+        batch_size, seq_len, num_heads, head_dim = 4, 4, 2, 16
+        rand_tensor = partial(self.rand_tensor, device="cuda", dtype=torch.float64, requires_grad=True, packed=True)
+
+        qkv = rand_tensor((batch_size, seq_len, num_heads, head_dim))
+        query, key, value = qkv.chunk(3, dim=-1)
+
+        query = query.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+
+        if contiguous_inputs:
+            query = query.contiguous()
+            key = key.contiguous()
+            value = value.contiguous()
+
+        with sdp_kernel(enable_math=True, enable_mem_efficient=False, enable_flash=False):
+            assert gradcheck(lambda *args, **kwargs:
+                             wrapper_set_seed(torch.nn.functional._scaled_dot_product_attention, *args, **kwargs),
+                             (query, key, value, None, 0.0, False, False)
+                             )
+
+    @unittest.skipIf(not TEST_CUDA or TEST_WITH_ROCM or IS_WINDOWS, "Flash Attention was not built for this system")
+    @parametrize("contiguous_inputs", [True, False])
+    def test_sdp_fused_grad_against_math(self, contiguous_inputs: bool):
+        batch_size, seq_len, num_heads, head_dim = 4, 4, 2, 16
+        rand_tensor = partial(self.rand_tensor, device="cuda", dtype=torch.float64, requires_grad=True, packed=True)
+
+        qkv = rand_tensor((batch_size, seq_len, num_heads, head_dim))
+        qkv_lp = qkv.detach().clone().to(torch.float32).requires_grad_()
+
+        query, key, value = qkv.chunk(3, dim=-1)
+        query_lp, key_lp, value_lp = qkv_lp.chunk(3, dim=-1)
+
+        query = query.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+
+        query_lp = query_lp.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        key_lp = key_lp.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+        value_lp = value_lp.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+
+        if contiguous_inputs:
+            query = query.contiguous()
+            key = key.contiguous()
+            value = value.contiguous()
+
+            query_lp = query_lp.contiguous()
+            key_lp = key_lp.contiguous()
+            value_lp = value_lp.contiguous()
+
+        with sdp_kernel(enable_math=True, enable_mem_efficient=False, enable_flash=False):
+            out, atten = torch.nn.functional._scaled_dot_product_attention(query, key, value, None, 0.0, False, False)
+
+        with sdp_kernel(enable_math=False, enable_mem_efficient=True, enable_flash=False):
+            out_lp, atten_lp = torch.nn.functional._scaled_dot_product_attention(
+                query_lp, key_lp, value_lp, None, 0.0, False, False)
+
+        rand_upward = torch.rand_like(out)
+        rand_upward_lp = rand_upward.to(torch.float32)
+
+        out.backward(rand_upward)
+        out_lp.backward(rand_upward_lp)
+
+        # Cast up and compare
+        self.assertEqual(qkv.grad, qkv_lp.grad.to(torch.float64), atol=1e-5, rtol=1e-5)
+
+    @parametrize("type", ["dense", "nested"])
+    def test_fused_sdp_choice(self, type: str):
+        device = "cpu"
+        # Test that cpu and nestedtensor cpu return MATH backend
+        for dtype in floating_types_and_half():
+            make_tensor = partial(self.rand_tensor, device=device, dtype=dtype)
+            size = (2, 2, 3, 4)
+            q, k, v = make_tensor(size), make_tensor(size), make_tensor(size)
+            assert torch._fused_sdp_choice(q, k, v) == SDPBackend.MATH
+
+        if TEST_CUDA and not TEST_WITH_ROCM and not IS_WINDOWS:
+            batch_size, seq_len, num_heads, head_dim = 32, 64, 16, 64
+            shape = (batch_size, seq_len, num_heads, head_dim)
+            device = "cuda"
+            make_tensor = partial(self.rand_tensor, device=device, dtype=torch.float16, packed=True)
+            make_nt = partial(self.rand_nt, device=device, dtype=torch.float16, packed=True)
+
+            qkv = make_tensor(shape) if type == "dense" else make_nt(shape)
+            query, key, value = qkv.chunk(3, dim=-1)
+
+            query = query.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+            value = value.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+            key = key.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+
+            if SM80OrLater and not type == "nested":
+                assert torch._fused_sdp_choice(query, key, value) == SDPBackend.FLASH_ATTENTION
+            else:
+                assert torch._fused_sdp_choice(query, key, value) == SDPBackend.EFFICIENT_ATTENTION
+
+            # Change dtype to float32 so that efficient attention should get chosen
+            make_tensor = partial(self.rand_tensor, device=device, dtype=torch.float32, packed=True)
+            make_nt = partial(self.rand_nt, device=device, dtype=torch.float32, packed=True)
+
+            qkv = make_tensor(shape) if type == "dense" else make_nt(shape)
+            query, key, value = qkv.chunk(3, dim=-1)
+
+            query = query.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+            value = value.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+            key = key.view(batch_size, -1, num_heads, head_dim).transpose(1, 2)
+
+            assert torch._fused_sdp_choice(query, key, value) == SDPBackend.EFFICIENT_ATTENTION
+
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_sdp_runtime_dispatch(self):
+        # We will test all the constraints that we know will cause a failure
+        # The problem is that any code path that goes down flash_attention
+        # will fail on CI/CD becuase it is not compiled with the right flags
+        device = 'cuda'
+        dtype = torch.float16
+        make_tensor = partial(self.rand_tensor, device=device, dtype=dtype)
+
+        with sdp_kernel(enable_flash=False, enable_math=False, enable_mem_efficient=False):
+            size = (2, 3, 4)
+            q = torch.randn(size, device=device, dtype=dtype)
+            k = torch.randn(size, device=device, dtype=dtype)
+            v = torch.randn(size, device=device, dtype=dtype)
+            self.assertRaisesRegex(RuntimeError, "No viable backend for scaled_dot_product_attention was found.",
+                                   lambda: torch._fused_sdp_choice(q, k, v))
+            self.assertRaisesRegex(RuntimeError, "No viable backend for scaled_dot_product_attention was found.",
+                                   lambda: torch.nn.functional._scaled_dot_product_attention(q, k, v))
+
+        with sdp_kernel(enable_flash=True, enable_mem_efficient=False, enable_math=False):
+            # Failures for invalid input
+
+            # Dim is not 4
+            q = torch.randn(size, device=device, dtype=dtype)
+            k = torch.randn(size, device=device, dtype=dtype)
+            v = torch.randn(size, device=device, dtype=dtype)
+            self.assertRaises(RuntimeError, lambda: torch.nn.functional._scaled_dot_product_attention(
+                q, k, v, None, 0.0, False, False))
+
+            # Xformers can now cover this case but will add back in next PR
+            # Invalid last_dim size
+            size = (2, 2, 3, 4)
+            q, k, v = make_tensor(size), make_tensor(size), make_tensor(size)
+            self.assertRaises(RuntimeError, lambda: torch.nn.functional._scaled_dot_product_attention(
+                q, k, v, None, 0.0, False, False))
+
+            # Invalid dtype
+            size = (2, 2, 3, 16)
+            make_tensor = partial(self.rand_tensor, device=device, dtype=torch.float64)
+            q, k, v = make_tensor(size), make_tensor(size), make_tensor(size)
+            self.assertRaises(RuntimeError, lambda: torch.nn.functional._scaled_dot_product_attention(
+                q, k, v, None, 0.0, False, False))
+
+            make_tensor = partial(self.rand_tensor, device=device, dtype=torch.float32)
+            q, k, v = make_tensor(size), make_tensor(size), make_tensor(size)
+            self.assertRaises(RuntimeError, lambda: torch.nn.functional._scaled_dot_product_attention(
+                q, k, v, None, 0.0, False, False))
+
+            # Failures for unsupported SDP args
+            q, k, v = make_tensor(size), make_tensor(size), make_tensor(size)
+
+            # Needs attention weights
+            self.assertRaises(RuntimeError, lambda: torch.nn.functional._scaled_dot_product_attention(
+                q, k, v, None, 0.0, True, False))
+
+            # Non-None attention mask
+            self.assertRaises(RuntimeError, lambda: torch.nn.functional._scaled_dot_product_attention(
+                q, k, v, torch.ones_like(q), 0.0, False, False))
+
+    # Test failing MHA when bias was NoneType
+    def test_bias_is_none(self):
+        x = torch.rand((1, 5, 10))
+        model = torch.nn.modules.activation.MultiheadAttention(10, 1, bias=False, batch_first=True)
+        model.eval()
+        model(x, x, x)
+        # completes without error
+
 
 # TODO: Replace this with instantiate_device_type_tests() to take advantage of test framework support for
 # cross device / dtype testing.
diff --git a/test/test_type_promotion.py b/test/test_type_promotion.py
index 25190f8976cc..1d80556a7d48 100644
--- a/test/test_type_promotion.py
+++ b/test/test_type_promotion.py
@@ -1,13 +1,14 @@
 # Owner(s): ["module: type promotion"]
 
-from functools import (partial, wraps)
+from functools import wraps
 import itertools
 import unittest
 
 import torch
 
 from torch.testing._internal.common_utils import (TestCase, run_tests, load_tests, make_tensor,
-                                                  TEST_NUMPY, torch_to_numpy_dtype_dict, numpy_to_torch_dtype_dict)
+                                                  TEST_NUMPY, set_default_dtype, torch_to_numpy_dtype_dict,
+                                                  numpy_to_torch_dtype_dict)
 from torch.testing._internal.common_device_type import (instantiate_device_type_tests, onlyNativeDeviceTypes,
                                                         dtypes, onlyCPU, expectedFailureMeta, skipMeta)
 from torch.testing._internal.common_dtype import (
@@ -30,14 +31,10 @@
 def float_double_default_dtype(fn):
     @wraps(fn)
     def wrapped_fn(*args, **kwargs):
-        cur_dtype = torch.get_default_dtype()
-        try:
-            torch.set_default_dtype(torch.float)
+        with set_default_dtype(torch.float):
             fn(*args, **kwargs)
-            torch.set_default_dtype(torch.double)
+        with set_default_dtype(torch.double):
             fn(*args, **kwargs)
-        finally:
-            torch.set_default_dtype(cur_dtype)
 
     return wrapped_fn
 
@@ -321,16 +318,16 @@ def test_complex_half(self, device):
 
     @float_double_default_dtype
     def test_alternate_result(self, device):
-        f = torch.tensor([1, 1, 1, 1], dtype=torch.float, device=device)
+        x = torch.tensor([1, 1, 1, 1], dtype=torch.float, device=device)
         o = torch.tensor([0, 0, 0, 0], dtype=torch.long, device=device)
         self.assertRaisesRegex(RuntimeError,
                                "can't be cast to",
-                               lambda: torch.add(f, f, out=o))
+                               lambda: torch.add(x, x, out=o))
         d = torch.tensor([1, 1, 1, 1], dtype=torch.double, device=device)
-        torch.add(f, f, out=d)
+        torch.add(x, x, out=d)
         self.assertEqual(d.dtype, torch.double)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(f + f, d)
+        x = x.to(torch.double)
+        self.assertEqual(x + x, d)
 
     @float_double_default_dtype
     def test_mixed_type_backward(self, device):
@@ -339,8 +336,8 @@ def test_mixed_type_backward(self, device):
         tens = f * ten
         s = (tens + 2).sum()
         s.backward()
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(f.grad, tens)
+        expected = f.grad.to(torch.double)
+        self.assertEqual(tens, expected)
 
         # If we don't convert the returned grad_input to the actual input type
         # we get an error like:
@@ -444,8 +441,7 @@ def test_create_bool_tensors(self, device):
         self.assertEqual(torch.arange(False, True, device=device), expected)
         self.assertEqual(torch.arange(True, device=device), expected)
         expected = torch.tensor([0, 0.5], dtype=torch.get_default_dtype(), device=device)
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(torch.arange(False, True, 0.5, device=device), expected)
+        self.assertEqual(torch.arange(False, True, 0.5, device=device), expected)
         expected = torch.ones(0, dtype=torch.int64, device=device)
         self.assertEqual(torch.arange(False, False, device=device), expected)
 
@@ -477,7 +473,7 @@ def _get_dtype(x):
             elif isinstance(x, complex):
                 return torch.complex64
             else:
-                raise AssertionError(f"Unkonwn type {x}")
+                raise AssertionError(f"Unknown type {x}")
 
         # tensor against tensor
         a_tensor = torch.tensor((0, 1), device=device, dtype=dtypes[0])
@@ -937,53 +933,6 @@ def test_integer_addcdiv_deprecated(self, device, dtype):
         with self.assertRaisesRegex(RuntimeError, '^Integer division.+is no longer supported+'):
             t.addcdiv_(t, t)
 
-    def _ternary_promotion_common(self, device, op1, op2):
-        make_arg = partial(make_tensor, device=device)
-
-        types = (
-            (torch.float64, torch.float64, torch.complex128),
-            (torch.long, torch.bfloat16, torch.float32),
-        )
-
-        for type1, type2, type3 in types:
-            arg1 = make_arg([5, 5], dtype=type1)
-            arg2 = make_arg([5, 5], dtype=type2)
-            arg3 = make_arg([1, 5], dtype=type3)
-
-            res1 = op1(arg1, arg2, arg3)
-            res2 = op2(arg1, arg2, arg3)
-
-            # res1 and res2 are not guaranteed to be the same.  They are the
-            # same when all the inputs are tensors with one or more dimensions.
-            self.assertEqual(res1, res2)
-            self.assertEqual(res1.dtype, res2.dtype)
-
-    # Fails on XLA:
-    # https://github.com/pytorch/pytorch/pull/74234#issuecomment-1117169366
-    # https://github.com/pytorch/xla/issues/3551
-    @onlyNativeDeviceTypes
-    def test_addcdiv_promotion(self, device):
-        def op1(arg1, arg2, arg3):
-            return torch.addcdiv(arg1, arg2, arg3)
-
-        def op2(arg1, arg2, arg3):
-            return arg1 + arg2 / arg3
-
-        self._ternary_promotion_common(device, op1, op2)
-
-    # Fails on XLA:
-    # https://github.com/pytorch/pytorch/pull/74234#issuecomment-1117169366
-    # https://github.com/pytorch/xla/issues/3551
-    @onlyNativeDeviceTypes
-    def test_addcmul_promotion(self, device):
-        def op1(arg1, arg2, arg3):
-            return torch.addcmul(arg1, arg2, arg3)
-
-        def op2(arg1, arg2, arg3):
-            return arg1 + arg2 * arg3
-
-        self._ternary_promotion_common(device, op1, op2)
-
     @unittest.skipIf(not TEST_NUMPY, "NumPy not found")
     @float_double_default_dtype
     @onlyCPU
@@ -1073,6 +1022,13 @@ def test_cat_different_dtypes(self, device):
             res = torch.cat([x, y])
             self.assertEqual(res, expected_res, exact_dtype=True)
 
+            # cat: full and an empty tensor.
+            y = torch.tensor([], device=device, dtype=y_dtype)
+            res_dtype = torch.result_type(x, y)
+            expected_res = torch.tensor(x_vals + [], device=device, dtype=res_dtype)
+            res = torch.cat([x, y])
+            self.assertEqual(res, expected_res, exact_dtype=True)
+
     @onlyNativeDeviceTypes
     def test_cat_out_different_dtypes(self, device):
         dtypes = all_types_and_complex_and(torch.half)
@@ -1104,7 +1060,7 @@ def test_unary_op_out_casting(self, device, dtypes):
         out = torch.empty(0, dtype=dtypes[1], device=device)
 
         ops = (torch.neg, torch.floor, torch.ceil)
-        float_only_ops = {torch.floor, torch.ceil}
+        float_and_int_only_ops = {torch.floor, torch.ceil}
         real_only_ops = {torch.floor, torch.ceil}
         for op in ops:
             if dtypes[0] is not dtypes[1]:
@@ -1114,8 +1070,9 @@ def test_unary_op_out_casting(self, device, dtypes):
                 with self.assertRaises(RuntimeError):
                     op(t, out=out)
             elif (
-                    op in float_only_ops
+                    op in float_and_int_only_ops
                     and (not dtypes[0].is_floating_point and not dtypes[0].is_complex)
+                    and (not (dtypes[0] == torch.int64 and dtypes[1] == torch.int64))
                     and device != "meta"
             ):
                 with self.assertRaises(RuntimeError):
diff --git a/test/test_unary_ufuncs.py b/test/test_unary_ufuncs.py
index 8729c8d2c945..5a9bdb53ab6b 100644
--- a/test/test_unary_ufuncs.py
+++ b/test/test_unary_ufuncs.py
@@ -624,16 +624,6 @@ def test_frexp_assert_raises(self, device):
                 ):
                     torch.frexp(input, out=(mantissa, exponent))
 
-    def test_mvlgamma_argcheck(self, device):
-        def run_test(d):
-            input = torch.linspace((d - 2) / 2, 10, 10, device=device)
-            torch.mvlgamma(input, d)
-
-        with self.assertRaisesRegex(
-            RuntimeError, r"All elements must be greater than \(p-1\)/2"
-        ):
-            run_test(3)
-
     def test_polygamma_neg(self, device):
         with self.assertRaisesRegex(
             RuntimeError, r"polygamma\(n, x\) does not support negative n\."
@@ -686,12 +676,9 @@ def test_abs_angle_complex_to_float(self, device, dtype):
                     torch.float32 if dtype is torch.complex64 else torch.float64
                 )
                 np_float_out = np_fn(a).astype(torch_to_numpy_dtype_dict[float_dtype])
-                float_out = torch.empty_like(t).float()
+                float_out = torch.empty_like(t, dtype=float_dtype)
                 torch_fn(t, out=float_out)
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                self.assertEqualIgnoreType(
-                    torch.from_numpy(np_float_out), float_out.cpu()
-                )
+                self.assertEqual(torch.from_numpy(np_float_out), float_out.cpu())
 
                 # Tests float out (resized out)
                 float_out = torch.empty(1, device=device, dtype=float_dtype)
@@ -699,21 +686,15 @@ def test_abs_angle_complex_to_float(self, device, dtype):
                 self.assertEqual(torch.from_numpy(np_float_out), float_out.cpu())
 
                 # Tests complex out
-                np_complex_out = np_fn(a)
+                np_complex_out = np_fn(a).astype(torch_to_numpy_dtype_dict[dtype])
                 complex_out = torch.empty_like(t)
                 torch_fn(t, out=complex_out)
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                self.assertEqualIgnoreType(
-                    torch.from_numpy(np_complex_out), complex_out.cpu()
-                )
+                self.assertEqual(torch.from_numpy(np_complex_out), complex_out.cpu())
 
                 # Tests complex out (resized out)
                 complex_out = torch.empty(0, device=device, dtype=dtype)
                 torch_fn(t, out=complex_out)
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                self.assertEqualIgnoreType(
-                    torch.from_numpy(np_complex_out), complex_out.cpu()
-                )
+                self.assertEqual(torch.from_numpy(np_complex_out), complex_out.cpu())
 
                 # Tests long out behavior (expected failure)
                 long_out = torch.empty(0, device=device, dtype=torch.long)
diff --git a/test/test_utils.py b/test/test_utils.py
index f81c9112f376..b745e771abd1 100644
--- a/test/test_utils.py
+++ b/test/test_utils.py
@@ -19,7 +19,7 @@
 import torch.utils.cpp_extension
 from torch.autograd._functions.utils import check_onnx_broadcast
 from torch.onnx.symbolic_opset9 import _prepare_onnx_paddings
-from torch.testing._internal.common_utils import load_tests, IS_SANDCASTLE, IS_WINDOWS
+from torch.testing._internal.common_utils import load_tests, IS_FBCODE, IS_SANDCASTLE, IS_WINDOWS
 
 # load_tests from torch.testing._internal.common_utils is used to automatically filter tests for
 # sharding on sandcastle. This line silences flake warnings
@@ -51,8 +51,8 @@ def _check_checkpoint_sequential(
         module_lists_to_compare,
         num_chunks,
         input,
+        use_reentrant,
     ):
-
         # not checkpointed
         out = model(input)
         out_not_checkpointed = out.detach().clone()
@@ -69,7 +69,9 @@ def _check_checkpoint_sequential(
             detached.requires_grad = True
 
             # pass list of modules to checkpoint
-            out = checkpoint_sequential(model_to_compare, num_chunks, detached)
+            out = checkpoint_sequential(
+                model_to_compare, num_chunks, detached, use_reentrant=use_reentrant
+            )
             out_checkpointed = out.detach().clone()
             model.zero_grad()
             out.sum().backward()
@@ -96,21 +98,26 @@ def __init__(self):
 
             def forward(self, input_var):
                 self.counter += 1
-                return input_var
+                # For reentrant, need to have autograd actually
+                # pack a tensor to trigger recomp
+                ret = input_var * torch.tensor(2.)
+                return ret
 
         # checkpointed
-        modules = [Net() for _ in range(10)]
-        for m in modules:
-            self.assertEqual(m.counter, 0)
-        input_var = torch.randn(3, 4, requires_grad=True)
-        out = checkpoint_sequential(modules, 2, input_var)
-        for m in modules:
-            self.assertEqual(m.counter, 1)
-        out.sum().backward()
-        for m in modules[:(len(modules) // 2)]:
-            self.assertEqual(m.counter, 2)
-        for m in modules[(len(modules) // 2):]:
-            self.assertEqual(m.counter, 1)
+        for use_reentrant in [True, False]:
+            with self.subTest(use_reentrant=use_reentrant):
+                modules = [Net() for _ in range(10)]
+                for m in modules:
+                    self.assertEqual(m.counter, 0)
+                input_var = torch.randn(3, 4, requires_grad=True)
+                out = checkpoint_sequential(modules, 2, input_var, use_reentrant=use_reentrant)
+                for m in modules:
+                    self.assertEqual(m.counter, 1)
+                out.sum().backward()
+                for m in modules[:(len(modules) // 2)]:
+                    self.assertEqual(m.counter, 2)
+                for m in modules[(len(modules) // 2):]:
+                    self.assertEqual(m.counter, 1)
 
     def test_checkpoint_valid(self):
         model = nn.Sequential(
@@ -132,27 +139,42 @@ def test_checkpoint_valid(self):
             torch.autograd.grad(
                 outputs=[out], grad_outputs=[torch.ones(1, 5)], inputs=[input_var], create_graph=True
             )
+        # works with use_reentrant=False, and grads are the same
+        out = model(input_var)
+        grads_no_checkpoint = torch.autograd.grad(
+            outputs=[out], grad_outputs=[torch.ones(1, 5)], inputs=[input_var], create_graph=True,
+        )
+        out_checkpoint = checkpoint_sequential(modules, chunks, input_var, use_reentrant=False)
+        # check outputs are the same
+        self.assertEqual(out_checkpoint, out)
+        grads_checkpoint = torch.autograd.grad(
+            outputs=[out_checkpoint], grad_outputs=[torch.ones(1, 5)], inputs=[input_var], create_graph=True,
+        )
+        self.assertEqual(grads_no_checkpoint, grads_checkpoint)
 
     def test_checkpoint(self):
-        model = nn.Sequential(
-            nn.Linear(100, 50),
-            nn.ReLU(),
-            nn.Linear(50, 20),
-            nn.ReLU(),
-            nn.Linear(20, 5),
-            nn.ReLU()
-        )
+        for use_reentrant in [True, False]:
+            with self.subTest(use_reentrant=use_reentrant):
+                model = nn.Sequential(
+                    nn.Linear(100, 50),
+                    nn.ReLU(),
+                    nn.Linear(50, 20),
+                    nn.ReLU(),
+                    nn.Linear(20, 5),
+                    nn.ReLU()
+                )
 
-        # Compare uncheckpointed model with its checkpointed counterparts
-        # In addition to running checkpoint_sequential on the nn.Sequential
-        # instance, we also run the function on the list of functions within
-        # the module.
-        self._check_checkpoint_sequential(
-            model,
-            [list(model.children()), model],
-            2,
-            torch.randn(1, 100, requires_grad=True)
-        )
+                # Compare uncheckpointed model with its checkpointed counterparts
+                # In addition to running checkpoint_sequential on the nn.Sequential
+                # instance, we also run the function on the list of functions within
+                # the module.
+                self._check_checkpoint_sequential(
+                    model,
+                    [list(model.children()), model],
+                    2,
+                    torch.randn(1, 100, requires_grad=True),
+                    use_reentrant=use_reentrant,
+                )
 
     def test_checkpoint_module_list(self):
         class ModuleListNet(nn.Module):
@@ -173,15 +195,18 @@ def forward(self, input):
                     input = layer(input)
                 return input
 
-        model = ModuleListNet()
-
-        # Compare uncheckpointed model with its checkpointed counterparts.
-        self._check_checkpoint_sequential(
-            model,
-            [list(model.module_list.children()), model.module_list],
-            2,
-            torch.randn(1, 100, requires_grad=True),
-        )
+        for use_reentrant in [True, False]:
+            with self.subTest(use_reentrant=use_reentrant):
+                model = ModuleListNet()
+
+                # Compare uncheckpointed model with its checkpointed counterparts.
+                self._check_checkpoint_sequential(
+                    model,
+                    [list(model.module_list.children()), model.module_list],
+                    2,
+                    torch.randn(1, 100, requires_grad=True),
+                    use_reentrant=use_reentrant,
+                )
 
     def test_checkpoint_sequential_deprecated_multiple_args(self):
         class Two(nn.Module):
@@ -192,8 +217,10 @@ def forward(self, a, b):
         a = torch.randn(1, 100, requires_grad=True)
         b = torch.randn(1, 100, requires_grad=True)
 
-        with self.assertRaises(TypeError):
-            checkpoint_sequential(model, 1, a, b)  # type: ignore[call-arg]
+        for use_reentrant in [True, False]:
+            with self.subTest(use_reentrant=use_reentrant):
+                with self.assertRaises(TypeError):
+                    checkpoint_sequential(model, 1, a, b)  # type: ignore[call-arg]
 
     def test_checkpoint_sequential_deprecated_no_args(self):
         class Noop(nn.Module):
@@ -201,9 +228,10 @@ def forward(self):
                 pass
 
         model = nn.Sequential(Noop())
-
-        with self.assertRaises(TypeError):
-            checkpoint_sequential(model, 1)  # type: ignore[call-arg]
+        for use_reentrant in [True, False]:
+            with self.subTest(use_reentrant=use_reentrant):
+                with self.assertRaises(TypeError):
+                    checkpoint_sequential(model, 1)  # type: ignore[call-arg]
 
     def test_checkpoint_rng_cpu(self):
         for _ in range(5):
@@ -423,6 +451,7 @@ def test_fn(x):
 
 class TestDataLoaderUtils(TestCase):
     def setUp(self):
+        super().setUp()
         self.dataset = torch.randn(5, 3, 3, 2)
         self.batch_size = 3
 
@@ -581,6 +610,7 @@ def test_bottleneck_cuda(self):
 from torch.utils.collect_env import get_pretty_env_info
 
 
+@unittest.skipIf(IS_FBCODE, "runs pip which is not available internally")
 class TestCollectEnv(TestCase):
     def test_smoke(self):
         info_output = get_pretty_env_info()
diff --git a/test/test_view_ops.py b/test/test_view_ops.py
index 6c65457ae24f..3c4376b501f9 100644
--- a/test/test_view_ops.py
+++ b/test/test_view_ops.py
@@ -9,7 +9,7 @@
 
 from torch.testing import make_tensor
 from torch.testing._internal.common_utils import (
-    TestCase, run_tests, suppress_warnings, gradcheck, gradgradcheck,
+    IS_FBCODE, TestCase, run_tests, suppress_warnings, gradcheck, gradgradcheck,
     numpy_to_torch_dtype_dict, skipIfTorchDynamo
 )
 from torch.testing._internal.common_device_type import \
@@ -102,7 +102,7 @@ def is_view_of(self, base, other):
         # Note: only validates storage on native device types
         # because some accelerators, like XLA, do not expose storage
         if base.device.type == 'cpu' or base.device.type == 'cuda':
-            if base.storage().data_ptr() != other.storage().data_ptr():
+            if base._storage().data_ptr() != other._storage().data_ptr():
                 return False
 
         return True
@@ -857,6 +857,7 @@ def test_advanced_indexing_nonview(self, device):
         nv[1, 1] = 0
         self.assertNotEqual(t[2, 2], nv[1, 1])
 
+    @unittest.skipIf(IS_FBCODE, "TorchScript backend not yet supported in FBCODE/OVRSOURCE builds")
     def test_advanced_indexing_assignment(self, device):
         t = torch.ones(3, 3, device=device)
         rows = torch.tensor([[0, 0], [2, 2]], device=device)
@@ -926,6 +927,12 @@ def test_view_copy(self, device):
         self.assertEqual(a_view_copy, a_view)
         self.assertEqual(a.grad, a_ref.grad)
 
+    # Testing that the output of a view_copy kernel (by default) is contiguous.
+    def test_view_copy_output_contiguous(self, device):
+        a = torch.randn(4, 4, 4, 4, device=device).to(memory_format=torch.channels_last)
+        b = torch.ops.aten.slice_copy(a, 0, 0, 2)
+        self.assertTrue(b.is_contiguous())
+
     def test_view_copy_out(self, device):
         a = torch.randn(2, 2, device=device)
         out = torch.empty(2, device=device)
diff --git a/test/test_xnnpack_integration.py b/test/test_xnnpack_integration.py
index 9e510d1715b1..17ac2d9e7fc3 100644
--- a/test/test_xnnpack_integration.py
+++ b/test/test_xnnpack_integration.py
@@ -14,7 +14,7 @@
 import io
 import itertools
 
-from torch.testing._internal.common_utils import TEST_WITH_TSAN
+from torch.testing._internal.common_utils import IS_FBCODE, TEST_WITH_TSAN
 
 @unittest.skipUnless(torch.backends.xnnpack.enabled,
                      " XNNPACK must be enabled for these tests."
@@ -987,6 +987,7 @@ def validate_transform_conv1d_to_conv2d(
             torch.testing.assert_close(ref_result, xnnpack_result, rtol=1e-2, atol=1e-3)
 
 
+    @unittest.skipIf(IS_FBCODE, "T137513244")
     def test_conv1d_basic(self):
         batch_size_list = range(1, 3)
         input_channels_per_group_list = range(10, 12)
diff --git a/third_party/VulkanMemoryAllocator b/third_party/VulkanMemoryAllocator
new file mode 160000
index 000000000000..a6bfc237255a
--- /dev/null
+++ b/third_party/VulkanMemoryAllocator
@@ -0,0 +1 @@
+Subproject commit a6bfc237255a6bac1513f7c1ebde6d8aed6b5191
diff --git a/third_party/build_bundled.py b/third_party/build_bundled.py
index 4da1b84a6f32..d60a2c1354fd 100644
--- a/third_party/build_bundled.py
+++ b/third_party/build_bundled.py
@@ -181,9 +181,14 @@ def squeeze(t):
         ),
         help="location to output new bundled licenses file",
     )
-
+    parser.add_argument(
+        "--include-files",
+        action="store_true",
+        default=False,
+        help="include actual license terms to the output",
+    )
     args = parser.parse_args()
     fname = args.out_file
     print(f"+ Writing bundled licenses to {args.out_file}")
     with open(fname, 'w') as fid:
-        create_bundled(third_party, fid)
+        create_bundled(third_party, fid, args.include_files)
diff --git a/third_party/cpuinfo b/third_party/cpuinfo
index 5916273f79a2..8ec7bd91ad04 160000
--- a/third_party/cpuinfo
+++ b/third_party/cpuinfo
@@ -1 +1 @@
-Subproject commit 5916273f79a21551890fd3d56fc5375a78d1598d
+Subproject commit 8ec7bd91ad0470e61cf38f618cc1f270dede599c
diff --git a/third_party/cpuinfo.BUILD b/third_party/cpuinfo.BUILD
deleted file mode 100644
index 2e9e2aeff3cd..000000000000
--- a/third_party/cpuinfo.BUILD
+++ /dev/null
@@ -1,55 +0,0 @@
-load("@rules_cc//cc:defs.bzl", "cc_library")
-
-cc_library(
-    name = "clog",
-    srcs = [
-        "deps/clog/src/clog.c",
-    ],
-    hdrs = glob([
-        "deps/clog/include/*.h",
-    ]),
-    includes = [
-        "deps/clog/include/",
-    ],
-    linkstatic = True,
-    visibility = ["//visibility:public"],
-)
-
-cc_library(
-    name = "cpuinfo",
-    srcs = glob(
-        [
-            "src/*.c",
-            "src/linux/*.c",
-            "src/x86/*.c",
-            "src/x86/cache/*.c",
-            "src/x86/linux/*.c",
-        ],
-        exclude = [
-            "src/x86/mockcpuid.c",
-            "src/linux/mockfile.c",
-        ],
-    ),
-    hdrs = glob([
-        "include/*.h",
-        "src/*.h",
-        "src/cpuinfo/*.h",
-        "src/include/*.h",
-        "src/x86/*.h",
-        "src/x86/linux/*.h",
-        "src/linux/*.h",
-    ]),
-    copts = [
-        "-DCPUINFO_LOG_LEVEL=2",
-        "-D_GNU_SOURCE=1",
-    ],
-    includes = [
-        "include",
-        "src",
-    ],
-    linkstatic = True,
-    visibility = ["//visibility:public"],
-    deps = [
-        ":clog",
-    ],
-)
diff --git a/third_party/cudnn_frontend b/third_party/cudnn_frontend
index 43709ab96c47..171a7a986f7f 160000
--- a/third_party/cudnn_frontend
+++ b/third_party/cudnn_frontend
@@ -1 +1 @@
-Subproject commit 43709ab96c47e26eebcdac72f93f946d44ceffa8
+Subproject commit 171a7a986f7fbd9ed71bd0cf3c7ad4f55843d6b3
diff --git a/third_party/cutlass b/third_party/cutlass
new file mode 160000
index 000000000000..b72cbf957df8
--- /dev/null
+++ b/third_party/cutlass
@@ -0,0 +1 @@
+Subproject commit b72cbf957df8cf84a6d0ff91c190ad51a9c1d24a
diff --git a/third_party/cutlass.BUILD b/third_party/cutlass.BUILD
new file mode 100644
index 000000000000..bd928c5fc1a1
--- /dev/null
+++ b/third_party/cutlass.BUILD
@@ -0,0 +1,11 @@
+# Description:
+#   CUDA Templates for Linear Algebra Subroutines
+
+load("@rules_cc//cc:defs.bzl", "cc_library")
+
+cc_library(
+    name = "cutlass",
+    hdrs = glob(["include/**/*.h"]),
+    includes = ["include/"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/fbgemm b/third_party/fbgemm
index 499cd22f5c2e..4d1738b3142a 160000
--- a/third_party/fbgemm
+++ b/third_party/fbgemm
@@ -1 +1 @@
-Subproject commit 499cd22f5c2e26041e4f190f628b48478a89a030
+Subproject commit 4d1738b3142a6cb0c032cd639e239566010b054a
diff --git a/third_party/fmt b/third_party/fmt
index cd4af11efc9c..7bdf0628b127 160000
--- a/third_party/fmt
+++ b/third_party/fmt
@@ -1 +1 @@
-Subproject commit cd4af11efc9c622896a3e4cb599fa28668ca3d05
+Subproject commit 7bdf0628b1276379886c7f6dda2cef2b3b374f0b
diff --git a/third_party/gloo b/third_party/gloo
index 5b1435132631..4a5e339b7642 160000
--- a/third_party/gloo
+++ b/third_party/gloo
@@ -1 +1 @@
-Subproject commit 5b143513263133af2b95547e97c07cebeb72bf72
+Subproject commit 4a5e339b764261d20fc409071dc7a8b8989aa195
diff --git a/third_party/gloo.BUILD b/third_party/gloo.BUILD
index 3f623e54e6ad..e9deaa13fc63 100644
--- a/third_party/gloo.BUILD
+++ b/third_party/gloo.BUILD
@@ -75,8 +75,7 @@ cc_library(
         ]
     ) + if_cuda(glob(["gloo/cuda*.cc"])),
     copts = [
-        "-std=gnu++11",
-        "-std=c++11",
+        "-std=c++17",
     ],
     visibility = ["//visibility:public"],
     deps = [":gloo_headers"] + if_cuda(
diff --git a/third_party/ideep b/third_party/ideep
index 8a114a51c116..5ddc65efe042 160000
--- a/third_party/ideep
+++ b/third_party/ideep
@@ -1 +1 @@
-Subproject commit 8a114a51c116b55c4ceb689b98746786bd00c29b
+Subproject commit 5ddc65efe0428bbce2942b3ce5e3ce15239abe2f
diff --git a/third_party/kineto b/third_party/kineto
index 0703c7899906..6c1629809068 160000
--- a/third_party/kineto
+++ b/third_party/kineto
@@ -1 +1 @@
-Subproject commit 0703c78999061b8329dfab7ec5046fc5764a5573
+Subproject commit 6c1629809068efd78a8d56b4aa479c7ec49ae562
diff --git a/third_party/mkl-dnn.BUILD b/third_party/mkl-dnn.BUILD
index 1d40b1c5feda..fb41d31e89a8 100644
--- a/third_party/mkl-dnn.BUILD
+++ b/third_party/mkl-dnn.BUILD
@@ -9,6 +9,7 @@ _DNNL_RUNTIME_OMP = {
     "#cmakedefine DNNL_WITH_SYCL": "/* #undef DNNL_WITH_SYCL */",
     "#cmakedefine DNNL_WITH_LEVEL_ZERO": "/* #undef DNNL_WITH_LEVEL_ZERO */",
     "#cmakedefine DNNL_SYCL_CUDA": "/* #undef DNNL_SYCL_CUDA */",
+    "#cmakedefine DNNL_SYCL_HIP": "/* #undef DNNL_SYCL_HIP */",
     "#cmakedefine DNNL_ENABLE_STACK_CHECKER": "#undef DNNL_ENABLE_STACK_CHECKER",
     "#cmakedefine DNNL_EXPERIMENTAL": "#undef DNNL_EXPERIMENTAL",
     "#cmakedefine01 BUILD_TRAINING": "#define BUILD_TRAINING 1",
@@ -53,9 +54,9 @@ template_rule(
     out = "third_party/oneDNN/include/oneapi/dnnl/dnnl_version.h",
     substitutions = {
         "@DNNL_VERSION_MAJOR@": "2",
-        "@DNNL_VERSION_MINOR@": "6",
+        "@DNNL_VERSION_MINOR@": "7",
         "@DNNL_VERSION_PATCH@": "0",
-        "@DNNL_VERSION_HASH@": "52b5f107dd9cf10910aaa19cb47f3abf9b349815",
+        "@DNNL_VERSION_HASH@": "650085b2f3643aad05c629425983491d63b5c289",
     },
 )
 
diff --git a/third_party/nccl/nccl b/third_party/nccl/nccl
index 19ab67d1727d..f89fd4777d2e 160000
--- a/third_party/nccl/nccl
+++ b/third_party/nccl/nccl
@@ -1 +1 @@
-Subproject commit 19ab67d1727d337d10d0a48cbaf5cd119b8d88f1
+Subproject commit f89fd4777d2ef9229c039ff750ae21da01626f52
diff --git a/third_party/pybind11 b/third_party/pybind11
index aa304c9c7d72..80dc998efced 160000
--- a/third_party/pybind11
+++ b/third_party/pybind11
@@ -1 +1 @@
-Subproject commit aa304c9c7d725ffb9d10af08a3b34cb372307020
+Subproject commit 80dc998efced8ceb2be59756668a7e90e8bef917
diff --git a/third_party/xnnpack.buck.bzl b/third_party/xnnpack.buck.bzl
index ee07488e2674..41f6e2e7c815 100644
--- a/third_party/xnnpack.buck.bzl
+++ b/third_party/xnnpack.buck.bzl
@@ -1,4 +1,5 @@
 load("//tools/build_defs:fb_xplat_cxx_library.bzl", "fb_xplat_cxx_library")
+load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")
 load("//tools/build_defs:glob_defs.bzl", "subdir_glob")
 load("//tools/build_defs:platform_defs.bzl", "ANDROID", "APPLE", "APPLETVOS", "CXX", "IOS", "MACOSX", "WINDOWS")
 load(
@@ -237,6 +238,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_sse",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_SSE_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.c"),
             ("XNNPACK/src", "**/*.h"),
@@ -259,12 +264,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_SSE_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -316,6 +321,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_sse2",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_SSE2_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.c"),
             ("XNNPACK/src", "**/*.h"),
@@ -338,12 +347,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_SSE2_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -397,6 +406,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_ssse3",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_SSSE3_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.c"),
             ("XNNPACK/src", "**/*.h"),
@@ -419,12 +432,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_SSSE3_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -478,6 +491,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_sse41",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_SSE41_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.c"),
             ("XNNPACK/src", "**/*.h"),
@@ -500,12 +517,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_SSE41_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -559,6 +576,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_avx",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_AVX_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.h"),
             ("XNNPACK/src", "**/*.c"),
@@ -582,12 +603,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_AVX_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -640,6 +661,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_f16c",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_F16C_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.h"),
             ("XNNPACK/src", "**/*.c"),
@@ -663,12 +688,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_F16C_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         platforms = (APPLE, ANDROID, CXX, WINDOWS),
         preferred_linkage = "static",
         preprocessor_flags = [
@@ -723,6 +748,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_xop",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_XOP_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.h"),
             ("XNNPACK/src", "**/*.c"),
@@ -746,12 +775,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_XOP_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -804,6 +833,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_fma3",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_FMA3_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.h"),
             ("XNNPACK/src", "**/*.c"),
@@ -829,12 +862,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_FMA3_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -901,6 +934,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_avx2",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_AVX2_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.c"),
             ("XNNPACK/src", "**/*.h"),
@@ -928,12 +965,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_AVX2_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -1006,6 +1043,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_avx512",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_AVX512F_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.c"),
             ("XNNPACK/src", "**/*.h"),
@@ -1029,12 +1070,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_AVX512F_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
@@ -1087,6 +1128,10 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
 
     fb_xplat_cxx_library(
         name = "ukernels_avx512skx",
+        srcs = (select({
+            "DEFAULT": [],
+            "ovr_config//os:macos-x86_64": PROD_AVX512SKX_MICROKERNEL_SRCS,
+        }) if is_arvr_mode() else []),
         headers = subdir_glob([
             ("XNNPACK/src", "**/*.c"),
             ("XNNPACK/src", "**/*.h"),
@@ -1118,12 +1163,12 @@ def define_xnnpack(third_party, labels = [], XNNPACK_WINDOWS_AVX512F_ENABLED = F
                 ],
             ),
         ],
-        platform_srcs = [
+        platform_srcs = ([
             (
                 "x86|x86_64|platform009|platform010",
                 PROD_AVX512SKX_MICROKERNEL_SRCS,
             ),
-        ],
+        ] if not is_arvr_mode() else []),
         preferred_linkage = "static",
         preprocessor_flags = [
             "-DXNN_LOG_LEVEL=0",
diff --git a/tools/BUCK.bzl b/tools/BUCK.bzl
index 2ca59eec5ff6..58a49fded0ee 100644
--- a/tools/BUCK.bzl
+++ b/tools/BUCK.bzl
@@ -62,10 +62,11 @@ def define_tools_targets(
             ("code_analyzer", "gen_oplist.py"),
             ("code_analyzer", "gen_op_registration_allowlist.py"),
         ]),
-        base_module = "",
+        base_module = "tools.code_analyzer",
         tests = [
             ":gen_oplist_test",
         ],
+        visibility = ["PUBLIC"],
         deps = [
             ":gen_selected_mobile_ops_header",
             torchgen_deps,
@@ -75,7 +76,7 @@ def define_tools_targets(
 
     python_binary(
         name = "gen_oplist",
-        main_module = "gen_oplist",
+        main_module = "tools.code_analyzer.gen_oplist",
         visibility = ["PUBLIC"],
         deps = [
             ":gen_oplist_lib",
@@ -129,6 +130,7 @@ def define_tools_targets(
             "autograd/templates/python_functions.cpp",
             "autograd/templates/python_functions.h",
             "autograd/templates/python_linalg_functions.cpp",
+            "autograd/templates/python_nested_functions.cpp",
             "autograd/templates/python_nn_functions.cpp",
             "autograd/templates/python_return_types.cpp",
             "autograd/templates/python_sparse_functions.cpp",
@@ -210,6 +212,18 @@ def define_tools_targets(
             "gen_vulkan_spv.py",
         ],
         base_module = "",
+        deps = [
+            torchgen_deps,
+            ":gen_aten_vulkan_glsl_lib",
+        ],
+    )
+
+    python_library(
+        name = "gen_aten_vulkan_glsl_lib",
+        srcs = [
+            "gen_vulkan_glsl.py",
+        ],
+        base_module = "tools",
         deps = [
             torchgen_deps,
         ],
@@ -222,6 +236,20 @@ def define_tools_targets(
             "PUBLIC",
         ],
         deps = [
+            ":gen_aten_vulkan_glsl_lib",
+            ":gen_aten_vulkan_spv_lib",
+        ],
+    )
+
+    python_test(
+        name = "vulkan_codegen_test",
+        srcs = [
+            "test/test_vulkan_codegen.py",
+        ],
+        contacts = contacts,
+        visibility = ["PUBLIC"],
+        deps = [
+            ":gen_aten_vulkan_glsl_lib",
             ":gen_aten_vulkan_spv_lib",
         ],
     )
diff --git a/tools/amd_build/build_amd.py b/tools/amd_build/build_amd.py
index 46e1b8e20b49..a5416ca037e5 100755
--- a/tools/amd_build/build_amd.py
+++ b/tools/amd_build/build_amd.py
@@ -118,8 +118,11 @@
     "aten/src/ATen/cuda/CUDAConfig.h",
     "torch/csrc/jit/codegen/cuda/codegen.cpp",
     "torch/csrc/jit/codegen/cuda/runtime/block_reduction.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/block_sync_atomic.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/block_sync_default_rocm.cu",
     "torch/csrc/jit/codegen/cuda/runtime/broadcast.cu",
     "torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu",
+    "torch/csrc/jit/codegen/cuda/runtime/helpers.cu",
     "torch/csrc/jit/codegen/fuser/cuda/resource_strings.h",
     "torch/csrc/jit/tensorexpr/ir_printer.cpp",
     # generated files we shouldn't frob
diff --git a/tools/autograd/derivatives.yaml b/tools/autograd/derivatives.yaml
index 52e9c8c3809c..0bb01b06ec1a 100644
--- a/tools/autograd/derivatives.yaml
+++ b/tools/autograd/derivatives.yaml
@@ -200,7 +200,7 @@
 #     preferable since it would be less efficient.
 #
 # NB: The parameter names here MUST be consistent with the parameter names
-# in Decalarations.yaml
+# in native_functions.yaml
 - name: abs(Tensor self) -> Tensor
   self: grad * self.sgn()
   result: handle_r_to_c(result.scalar_type(), self_t.conj() * self_p.sgn())
@@ -220,8 +220,8 @@
 
 - name: addbmm(Tensor self, Tensor batch1, Tensor batch2, *, Scalar beta=1, Scalar alpha=1) -> Tensor
   self: maybe_multiply(grad, beta.conj())
-  batch1: maybe_multiply(grad.unsqueeze(0).expand({ batch1.size(0), batch1.size(1), batch2.size(2) }).bmm(batch2.transpose(1, 2).conj()), alpha.conj())
-  batch2: maybe_multiply(batch1.transpose(1, 2).conj().bmm(grad.unsqueeze(0).expand({ batch1.size(0), batch1.size(1), batch2.size(2) })), alpha.conj())
+  batch1: maybe_multiply(grad.unsqueeze(0).expand_symint({ batch1.sym_size(0), batch1.sym_size(1), batch2.sym_size(2) }).bmm(batch2.transpose(1, 2).conj()), alpha.conj())
+  batch2: maybe_multiply(batch1.transpose(1, 2).conj().bmm(grad.unsqueeze(0).expand_symint({ batch1.sym_size(0), batch1.sym_size(1), batch2.sym_size(2) })), alpha.conj())
   result: maybe_multiply(self_t, beta) + maybe_multiply(batch1_t.bmm(batch2_p).sum(0), alpha) + maybe_multiply(batch1_p.bmm(batch2_t).sum(0), alpha)
 
 - name: addcdiv(Tensor self, Tensor tensor1, Tensor tensor2, *, Scalar value=1) -> Tensor
@@ -238,14 +238,14 @@
 
 - name: addmm(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1) -> Tensor
   self: maybe_multiply(grad, beta.conj())
-  mat1: mm_mat1_backward(grad, mat2, mat1.sizes(), mat1.strides(), mat1.layout(), alpha)
-  mat2: mm_mat2_backward(grad, mat1, mat2.sizes(), mat2.strides(), mat2.layout(), alpha)
+  mat1: mm_mat1_backward(grad, mat2, mat1.sym_sizes(), mat1.sym_strides(), mat1.layout(), alpha)
+  mat2: mm_mat2_backward(grad, mat1, mat2.sym_sizes(), mat2.sym_strides(), mat2.layout(), alpha)
   result: maybe_multiply(self_t, beta) + maybe_multiply(mat1_t.mm(mat2_p), alpha) + maybe_multiply(mat1_p.mm(mat2_t), alpha)
 
 - name: _sparse_addmm(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1) -> Tensor
   self: maybe_multiply(grad, beta)
   mat1: mm_mat1_sparse_backward(grad, mat1, mat2, alpha)
-  mat2: mm_mat2_backward(grad, mat1, mat2.sizes(), mat2.strides(), mat2.layout(), alpha)
+  mat2: mm_mat2_backward(grad, mat1, mat2.sym_sizes(), mat2.sym_strides(), mat2.layout(), alpha)
 
 - name: addmv(Tensor self, Tensor mat, Tensor vec, *, Scalar beta=1, Scalar alpha=1) -> Tensor
   self: maybe_multiply(grad, beta.conj())
@@ -306,11 +306,11 @@
 - name: atanh_(Tensor(a!) self) -> Tensor(a!)
   self: not_implemented("inplace version of atanh")
 
-- name: as_strided(Tensor(a) self, int[] size, int[] stride, int? storage_offset=None) -> Tensor(a)
+- name: as_strided(Tensor(a) self, SymInt[] size, SymInt[] stride, SymInt? storage_offset=None) -> Tensor(a)
   self: as_strided_backward(grad, TensorGeometry(self), size, stride, storage_offset)
   result: auto_linear
 
-- name: as_strided_(Tensor(a!) self, int[] size, int[] stride, int? storage_offset=None) -> Tensor(a!)
+- name: as_strided_(Tensor(a!) self, SymInt[] size, SymInt[] stride, SymInt? storage_offset=None) -> Tensor(a!)
   self: as_strided_backward(grad, TensorGeometry(self), size, stride, storage_offset)
   result: auto_linear
 
@@ -354,7 +354,7 @@
   self, other: matmul_backward(grad, self, other, grad_input_mask)
 
 - name: cat(Tensor[] tensors, int dim=0) -> Tensor
-  tensors: cat_tensors_backward(grad, to_args_sizes(tensors), to_args_scalartypes(tensors), dim)
+  tensors: cat_tensors_backward(grad, to_args_sizes_symint(tensors), to_args_scalartypes(tensors), dim)
   result: cat_jvp(tensors, dim)
 
 - name: cauchy_(Tensor(a!) self, float median=0, float sigma=1, *, Generator? generator=None) -> Tensor(a!)
@@ -501,13 +501,16 @@
 - name: _ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank=0, bool zero_infinity=False) -> (Tensor, Tensor)
   log_probs: _ctc_loss_backward(grad, log_probs, targets, input_lengths, target_lengths, result0, result1, blank, zero_infinity)
 
+- name: _ctc_loss.Tensor(Tensor log_probs, Tensor targets, Tensor input_lengths, Tensor target_lengths, int blank=0, bool zero_infinity=False) -> (Tensor, Tensor)
+  log_probs: _ctc_loss_backward(grad, log_probs, targets, input_lengths, target_lengths, result0, result1, blank, zero_infinity)
+
 - name: deg2rad(Tensor self) -> Tensor
   self: deg2rad_backward(grad)
   result: auto_element_wise
 
 - name: _linalg_det(Tensor A) -> (Tensor result, Tensor LU, Tensor pivots)
   A: linalg_det_backward(grad, result, A, LU, pivots)
-  result: at::linalg_lu_solve(LU, pivots, A_t, /*left*/true, /*adjoint*/!A_p.is_complex() && A_p.is_contiguous()).diagonal(0, -2, -1).sum(-1) * result
+  result: linalg_det_jvp(A_t, result, LU, pivots, A_p.is_contiguous() && !A_p.is_complex())
   output_differentiability: [True, False, False]
 
 - name: _linalg_slogdet(Tensor A) -> (Tensor sign, Tensor logabsdet, Tensor LU, Tensor pivots)
@@ -523,15 +526,11 @@
   self: grad.diagonal(offset, dim1, dim2)
   result: auto_linear
 
-- name: diag(Tensor self, int diagonal=0) -> Tensor
-  self: diag_backward(grad, self.sizes(), diagonal)
-  result: auto_linear
-
 - name: diagonal(Tensor(a) self, int offset=0, int dim1=0, int dim2=1) -> Tensor(a)
-  self: diagonal_backward(grad, self.sizes(), offset, dim1, dim2)
+  self: diagonal_backward_symint(grad, self.sym_sizes(), offset, dim1, dim2)
   result: auto_linear
 
-- name: diagonal_backward(Tensor grad_output, int[] input_sizes, int offset, int dim1, int dim2) -> Tensor
+- name: diagonal_backward(Tensor grad_output, SymInt[] input_sizes, int offset, int dim1, int dim2) -> Tensor
   grad_output: grad.diagonal(offset, dim1, dim2)
   result: auto_linear
 
@@ -555,11 +554,11 @@
 - name: div.Tensor_mode(Tensor self, Tensor other, *, str? rounding_mode) -> Tensor
   self: div_tensor_self_backward(grad, other, self.scalar_type(), rounding_mode)
   other: div_tensor_other_backward(grad, self, other, rounding_mode)
-  result: "rounding_mode.has_value() ? result.new_zeros(result.sizes()) : self_t / other_p - other_t * (self_p / other_p) / other_p"
+  result: "rounding_mode.has_value() ? result.new_zeros_symint(result.sym_sizes()) : self_t / other_p - other_t * (self_p / other_p) / other_p"
 
 - name: div.Scalar_mode(Tensor self, Scalar other, *, str? rounding_mode) -> Tensor
   self: div_tensor_self_backward(grad, at::lift_fresh(at::scalar_to_tensor(other)), self.scalar_type(), rounding_mode)
-  result: "rounding_mode.has_value() ? result.new_zeros(result.sizes()) : self_t / other"
+  result: "rounding_mode.has_value() ? result.new_zeros_symint(result.sym_sizes()) : self_t / other"
 
 - name: dot(Tensor self, Tensor tensor) -> Tensor
   self: grad * tensor.conj()
@@ -582,9 +581,6 @@
   grad_output: "native_dropout_double_backward(grad, grad_output, mask, scale)"
   mask: 'not_implemented("native_dropout_backward: mask")'
 
-- name: eig(Tensor self, bool eigenvectors=False) -> (Tensor eigenvalues, Tensor eigenvectors)
-  self: eig_backward(grads, self, eigenvectors, eigenvalues, eigenvectors_return)
-
 - name: eq_.Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
   self: zeros_like(self)
   result: self_t.zero_()
@@ -622,12 +618,9 @@
   self: grad * (result + 1)
   result: auto_element_wise
 
-- name: expand(Tensor(a) self, int[] size, *, bool implicit=False) -> Tensor(a)
-  self: at::sum_to(grad, self.sizes())
-  result: auto_linear
-
-- name: expand.SymInt(Tensor(a) self, SymInt[] size, *, bool implicit=False) -> Tensor(a)
-  self: at::sum_to(grad, c10::asIntArrayRefSlow(self.sym_sizes()))
+# TODO: this derivative is not SymInt safe, need sum_to support
+- name: expand(Tensor(a) self, SymInt[] size, *, bool implicit=False) -> Tensor(a)
+  self: at::sum_to(grad, self.sym_sizes())
   result: auto_linear
 
 - name: exponential_(Tensor(a!) self, float lambd=1, *, Generator? generator=None) -> Tensor(a!)
@@ -784,7 +777,7 @@
   other: -grad * exp((self - 1) * log(other) - other - lgamma(self))
 
 - name: index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
-  self: index_backward(grad.new_zeros(self.sizes(), self.options()), indices, grad)
+  self: index_backward(grad.new_zeros_symint(self.sym_sizes(), self.options()), indices, grad)
   result: auto_linear
 
 - name: index_add(Tensor self, int dim, Tensor index, Tensor source, *, Scalar alpha=1) -> Tensor
@@ -831,17 +824,14 @@
   result: at::_index_put_impl_(self_t, indices, values_t, accumulate, unsafe)
 
 - name: index_select(Tensor self, int dim, Tensor index) -> Tensor
-  self: index_select_backward(grad, self.sizes(), dim, index)
+  self: index_select_backward_symint(grad, self.sym_sizes(), dim, index)
   index: non_differentiable
   result: auto_linear
 
-- name: inverse(Tensor self) -> Tensor
-  self: -at::matmul(result.mH(), at::matmul(grad, result.mH()))
-  result: -at::matmul(at::matmul(result, self_t), result)
-
-- name: linalg_inv_ex(Tensor self, *, bool check_errors=False) -> (Tensor inverse, Tensor info)
-  self: -at::matmul(inverse.mH(), at::matmul(grad, inverse.mH()))
-  inverse: -at::matmul(at::matmul(inverse, self_t), inverse)
+- name: linalg_inv_ex(Tensor A, *, bool check_errors=False) -> (Tensor inverse, Tensor info)
+  A: -at::matmul(inverse.mH(), at::matmul(grad, inverse.mH()))
+  inverse: -at::matmul(at::matmul(inverse, A_t), inverse)
+  output_differentiability: [True, False]
 
 - name: linalg_pinv.atol_rtol_tensor(Tensor self, *, Tensor? atol=None, Tensor? rtol=None, bool hermitian=False) -> Tensor
   self: pinv_backward(grad, result, self)
@@ -851,7 +841,7 @@
   self: non_differentiable
 
 - name: kthvalue(Tensor self, int k, int dim=-1, bool keepdim=False) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), keepdim)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), keepdim)
   values: gather_with_keepdimed_indices(self_t, dim, indices, keepdim)
 
 - name: le_.Scalar(Tensor(a!) self, Scalar other) -> Tensor(a!)
@@ -916,30 +906,44 @@
   other: grad / (1 + pow(2, self - other))
   result: self_t / (1 + pow(2, other_p - self_p)) + other_t / (1 + pow(2, self_p - other_p))
 
+# Note [Gradient formula for xlogy at x = 0, y <= 0]
+# x * log(y) is not defined at y <= 0, so we cannot even talk about differentiability
+# Now, xlogy(0, y) = 0 by definition.
+# This does not make it differentiable as it's not defined in a neighbourhood of a point
+# (0, y) when y <= 0.
+# Now, when a function is non-differentiable, sometimes we return "a relatively sensible value"
+# In this case, as per the discussion in https://github.com/pytorch/pytorch/issues/80770, we choose
+# this value to be zero, which is the directional derivative along the line {x = 0}.
 - name: xlogy.Tensor(Tensor self, Tensor other) -> Tensor
-  self: grad * at::xlogy((self != 0), other)
-  other: grad * at::where(other.isnan() | (self != 0), self / other, 0)
-  result: self_t * at::xlogy((self_p != 0), other_p) + other_t * self_p / other_p
+  self: at::xlogy(grad, other).masked_fill((self == 0.) & (other <= 0.), 0.)
+  other: grad * self / other
+  result: at::xlogy(self_t, other_p).masked_fill((self_p == 0.) & (other_p <= 0.), 0.) + other_t * self_p / other_p
 
 - name: xlogy.Scalar_Self(Scalar self, Tensor other) -> Tensor
-  other: grad * at::where(other.isnan() | (!self.equal(0)), self / other, 0)
+  other: grad * self / other
   result: auto_element_wise
 
 - name: xlogy.Scalar_Other(Tensor self, Scalar other) -> Tensor
-  self: grad * at::xlogy((self != 0), other)
+  self: "other.toDouble() > 0.
+          ? at::xlogy(grad,  other)
+          : at::xlogy(grad,  other).masked_fill(self == 0., 0.)"
   result: auto_element_wise
 
+# See Note [Gradient formula for xlogy at x = 0, y <= 0]
+# Same here but with y <= -1
 - name: special_xlog1py(Tensor self, Tensor other) -> Tensor
-  self: grad * other.log1p()
+  self: at::special_xlog1py(grad,  other).masked_fill((self == 0.) & (other <= -1.), 0.)
   other: grad * self / (other + 1)
-  result: self_t * other_p.log1p() + other_t * self_p / (other_p + 1)
+  result: at::special_xlog1py(self_t,  other_p).masked_fill((self_p == 0.) & (other_p <= -1.), 0.) + other_t * self_p / (other_p + 1)
 
 - name: special_xlog1py.self_scalar(Scalar self, Tensor other) -> Tensor
   other: grad * self / (other + 1)
   result: auto_element_wise
 
 - name: special_xlog1py.other_scalar(Tensor self, Scalar other) -> Tensor
-  self: grad * log1p(other.toDouble())
+  self: "other.toDouble() > -1.
+          ? at::special_xlog1py(grad,  other)
+          : at::special_xlog1py(grad,  other).masked_fill(self == 0., 0.)"
   result: auto_element_wise
 
 - name: special_zeta(Tensor self, Tensor other) -> Tensor
@@ -960,10 +964,6 @@
   self: logsumexp_backward(grad, self, result, dim, keepdim)
   result: logsumexp_jvp(self_p, self_t, dim, keepdim)
 
-- name: lstsq(Tensor self, Tensor A) -> (Tensor solution, Tensor QR)
-  self: not_implemented("lstsq")
-  A: not_implemented("lstsq")
-
 - name: linalg_lstsq(Tensor self, Tensor b, float? rcond=None, *, str? driver=None) -> (Tensor solution, Tensor residuals, Tensor rank, Tensor singular_values)
   self, b: linalg_lstsq_backward(grad, self, b, rcond, driver, grad_input_mask)
   solution: linalg_lstsq_jvp(self_p, b_p, self_t, b_t)
@@ -995,10 +995,10 @@
   result: linalg_lu_solve_jvp(result, LU_p, pivots, LU_t, B_t, left, adjoint)
 
 - name: lu_unpack(Tensor LU_data, Tensor LU_pivots, bool unpack_data=True, bool unpack_pivots=True) -> (Tensor P, Tensor L, Tensor U)
-  LU_data: lu_unpack_backward(grad_L, grad_U, LU_data.size(-2), LU_data.size(-1))
+  LU_data: lu_unpack_backward(grad_L, grad_U, LU_data.sym_size(-2), LU_data.sym_size(-1))
   LU_pivots: non_differentiable
-  L: "LU_data_t.size(-2) >= LU_data_t.size(-1) ? LU_data_t.tril(-1) : LU_data_t.narrow(-1, 0, LU_data_t.size(-2)).tril(-1)"
-  U: "LU_data_t.size(-1) >= LU_data_t.size(-2) ? LU_data_t.triu() : LU_data_t.narrow(-2, 0, LU_data_t.size(-1)).triu()"
+  L: "LU_data_t.sym_size(-2) >= LU_data_t.sym_size(-1) ? LU_data_t.tril(-1) : LU_data_t.narrow_symint(-1, 0, LU_data_t.sym_size(-2)).tril(-1)"
+  U: "LU_data_t.sym_size(-1) >= LU_data_t.sym_size(-2) ? LU_data_t.triu() : LU_data_t.narrow_symint(-2, 0, LU_data_t.sym_size(-1)).triu()"
   output_differentiability: [False, True, True]
 
 - name: masked_fill.Scalar(Tensor self, Tensor mask, Scalar value) -> Tensor
@@ -1008,13 +1008,13 @@
 
 - name: masked_fill.Tensor(Tensor self, Tensor mask, Tensor value) -> Tensor
   self: grad.masked_fill(mask, 0)
-  value: at::where(mask, grad, 0).sum()
+  value: masked_fill_backward(grad, mask)
   mask: non_differentiable
   result: self_t.masked_fill(mask, value_t)
 
 - name: masked_scatter(Tensor self, Tensor mask, Tensor source) -> Tensor
   self: grad.masked_fill(mask, 0)
-  source: masked_scatter_backward(grad, mask, source.sizes())
+  source: masked_scatter_backward(grad, mask, source.sym_sizes())
   mask: non_differentiable
   result: self_t.masked_scatter(mask, source_t)
 
@@ -1028,7 +1028,7 @@
   result: linalg_matrix_exp_differential(self_p, self_t, /*adjoint*/ false)
 
 - name: max.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), keepdim)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), keepdim)
   values: gather_with_keepdimed_indices(self_t, dim, indices, keepdim)
 
 - name: max(Tensor self) -> Tensor
@@ -1046,11 +1046,11 @@
   result: other_t + (self_p > other_p).logical_or_(other_p.isnan()) * (self_t - other_t)
 
 - name: mean(Tensor self, *, ScalarType? dtype=None) -> Tensor
-  self: grad.expand(self.sizes()) / self.numel()
+  self: grad.expand_symint(self.sym_sizes()) / self.sym_numel()
   result: auto_linear
 
 - name: mean.dim(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
-  self: mean_backward(grad, self.sizes(), dim, self.numel(), keepdim)
+  self: mean_backward(grad, self.sym_sizes(), dim, self.sym_numel(), keepdim)
   result: auto_linear
 
 - name: median(Tensor self) -> Tensor
@@ -1076,15 +1076,15 @@
 # The backward implementation is correct in the sense that it returns the
 # subgradient on one side.
 - name: median.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), keepdim)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), keepdim)
   values: gather_with_keepdimed_indices(self_t, dim, indices, keepdim)
 
 - name: nanmedian.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), keepdim)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), keepdim)
   values: gather_with_keepdimed_indices(self_t, dim, indices, keepdim)
 
 - name: min.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), keepdim)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), keepdim)
   values: gather_with_keepdimed_indices(self_t, dim, indices, keepdim)
 
 - name: min(Tensor self) -> Tensor
@@ -1110,12 +1110,12 @@
   result: amaxamin_jvp(self_p, self_t, result, dim, keepdim)
 
 - name: mm(Tensor self, Tensor mat2) -> Tensor
-  self: mm_mat1_backward(grad, mat2, self.sizes(), self.strides(), self.layout(), 1)
-  mat2: mm_mat2_backward(grad, self, mat2.sizes(), mat2.strides(), mat2.layout(), 1)
+  self: mm_mat1_backward(grad, mat2, self.sym_sizes(), self.sym_strides(), self.layout(), 1)
+  mat2: mm_mat2_backward(grad, self, mat2.sym_sizes(), mat2.sym_strides(), mat2.layout(), 1)
   result: at::mm(self_t, mat2_p) + at::mm(self_p, mat2_t)
 
 - name: mode(Tensor self, int dim=-1, bool keepdim=False) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), keepdim)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), keepdim)
   values: gather_with_keepdimed_indices(self_t, dim, indices, keepdim)
 
 - name: mul.Tensor(Tensor self, Tensor other) -> Tensor
@@ -1144,23 +1144,31 @@
   input, weight, bias: "grad.defined() ? native_batch_norm_backward(grad, input, weight, running_mean, running_var, result1, result2, training, eps, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
   result0: batch_norm_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, running_mean, running_var, result1, result2, training, eps)
 
+- name: _native_batch_norm_legit(Tensor input, Tensor? weight, Tensor? bias, Tensor(a!) running_mean, Tensor(b!) running_var, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)
+  input, weight, bias: "grad.defined() ? native_batch_norm_backward(grad, input, weight, running_mean, running_var, result1, result2, training, eps, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+  result0: batch_norm_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, running_mean, running_var, result1, result2, training, eps)
+
+- name: _native_batch_norm_legit.no_stats(Tensor input, Tensor? weight, Tensor? bias, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)
+  input, weight, bias: "grad.defined() ? native_batch_norm_backward(grad, input, weight, Tensor(), Tensor(), result1, result2, training, eps, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+  result0: batch_norm_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, Tensor(), Tensor(), result1, result2, training, eps)
+
 - name: native_batch_norm_backward(Tensor grad_out, Tensor input, Tensor? weight, Tensor? running_mean, Tensor? running_var, Tensor? save_mean, Tensor? save_invstd, bool train, float eps, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   input, weight, grad_out: batchnorm_double_backward(input, weight, grads[0], grads[1], grads[2], grad_out, running_mean, running_var, train, eps, save_mean, save_invstd, grad_input_mask)
   save_mean: not_implemented("native_batch_norm_backward save_mean")
   save_invstd: not_implemented("native_batch_norm_backward save_invstd")
 
-- name: native_layer_norm(Tensor input, int[] normalized_shape, Tensor? weight, Tensor? bias, float eps) -> (Tensor, Tensor, Tensor)
-  input, weight, bias: "grad.defined() ? native_layer_norm_backward(grad, input, normalized_shape, result1, result2, weight, bias, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: native_layer_norm(Tensor input, SymInt[] normalized_shape, Tensor? weight, Tensor? bias, float eps) -> (Tensor, Tensor, Tensor)
+  input, weight, bias: "grad.defined() ? native_layer_norm_backward_symint(grad, input, normalized_shape, result1, result2, weight, bias, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
   result0: layer_norm_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, result1, result2, normalized_shape)
 
-- name: native_layer_norm_backward(Tensor grad_out, Tensor input, int[] normalized_shape, Tensor mean, Tensor rstd, Tensor? weight, Tensor? bias, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
+- name: native_layer_norm_backward(Tensor grad_out, Tensor input, SymInt[] normalized_shape, Tensor mean, Tensor rstd, Tensor? weight, Tensor? bias, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   input, weight, grad_out: layer_norm_double_backward(input, weight, grads[0], grads[1], grads[2], grad_out, mean, rstd, normalized_shape, grad_input_mask)
   bias: Tensor()
   mean: not_implemented("native_layer_norm_backward mean")
   rstd: not_implemented("native_layer_norm_backward rstd")
 
-- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, int N, int C, int HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
-  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward(grads[0].is_contiguous() ? grads[0] : grads[0].contiguous(), input.is_contiguous() ? input : input.contiguous(), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
+- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
+  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(), input.device().is_xpu() ? input : input.contiguous(), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
   result0: group_norm_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, result1, result2, group)
   result1: group_norm_mean_jvp(input_t, result1, group)
   result2: group_norm_invstd_jvp(input_p, input_t, result1, result2, group)
@@ -1228,16 +1236,16 @@
   result: self_t.zero_()
 
 - name: normal.Tensor_float(Tensor mean, float std=1, *, Generator? generator=None) -> Tensor
-  mean: at::zeros(mean.sizes(), grad.options())
+  mean: at::zeros_symint(mean.sym_sizes(), grad.options())
   result: auto_element_wise
 
 - name: normal.float_Tensor(float mean, Tensor std, *, Generator? generator=None) -> Tensor
-  std: at::zeros(std.sizes(), grad.options())
+  std: at::zeros_symint(std.sym_sizes(), grad.options())
   result: auto_element_wise
 
 - name: normal.Tensor_Tensor(Tensor mean, Tensor std, *, Generator? generator=None) -> Tensor
-  mean: at::zeros(mean.sizes(), grad.options())
-  std: at::zeros(std.sizes(), grad.options())
+  mean: at::zeros_symint(mean.sym_sizes(), grad.options())
+  std: at::zeros_symint(std.sym_sizes(), grad.options())
   result: zeros_like(mean_t)
 
 - name: linalg_householder_product(Tensor input, Tensor tau) -> Tensor
@@ -1245,9 +1253,7 @@
   result: householder_product_jvp(input_t, tau_t, result, input_p, tau_p)
 
 - name: ormqr(Tensor self, Tensor input2, Tensor input3, bool left=True, bool transpose=False) -> Tensor
-  self: not_implemented("ormqr")
-  input2: not_implemented("ormqr")
-  input3: not_implemented("ormqr")
+  self, input2, input3: ormqr_backward(grad, result, self, input2, input3, left, transpose, grad_input_mask)
 
 - name: permute(Tensor(a) self, int[] dims) -> Tensor(a)
   self: permute_backwards(grad, dims)
@@ -1320,8 +1326,8 @@
 - name: renorm(Tensor self, Scalar p, int dim, Scalar maxnorm) -> Tensor
   self: renorm_backward(grad, self, p, dim, maxnorm)
 
-- name: repeat(Tensor self, int[] repeats) -> Tensor
-  self: repeat_backward(grad, repeats, self.sizes())
+- name: repeat(Tensor self, SymInt[] repeats) -> Tensor
+  self: repeat_backward(grad, repeats, self.sym_sizes())
   result: auto_linear
 
 - name: special_entr(Tensor self) -> Tensor
@@ -1348,12 +1354,8 @@
 # making it impossible (hard) to detect when it is actually a view.
 # - name: reshape(Tensor self, IntArrayRef shape)
 
-- name: _reshape_nested(Tensor self, int[] shape) -> Tensor
-  self: _reshape_nested_backward(self, grad)
-  result: auto_linear
-
-- name: _reshape_alias(Tensor(a) self, int[] size, int[] stride) -> Tensor(a)
-  self: grad.reshape(self.sizes())
+- name: _reshape_alias(Tensor(a) self, SymInt[] size, SymInt[] stride) -> Tensor(a)
+  self: grad.reshape_symint(self.sym_sizes())
   result: auto_linear
 
 - name: round(Tensor self) -> Tensor
@@ -1385,12 +1387,16 @@
   src: grad.gather(dim, index)
   result: scatter_add(self_t, dim, index, src_t)
 
-- name: select.int(Tensor(a) self, int dim, int index) -> Tensor(a)
-  self: select_backward(grad, self.sizes(), dim, index)
-  result: auto_linear
+- name: select.int(Tensor(a) self, int dim, SymInt index) -> Tensor(a)
+  dispatch:
+    Default:
+      self: select_backward_symint(grad, self.sym_sizes(), dim, index)
+      result: auto_linear
+    AutogradNestedTensor:
+      self: _nested_select_backward_symint(grad, self, dim, index)
 
-- name: select_backward(Tensor grad_output, int[] input_sizes, int dim, int index) -> Tensor
-  grad_output: grad.select(dim, index)
+- name: select_backward(Tensor grad_output, SymInt[] input_sizes, int dim, SymInt index) -> Tensor
+  grad_output: grad.select_symint(dim, index)
   result: auto_linear
 
 - name: sigmoid(Tensor self) -> Tensor
@@ -1424,22 +1430,22 @@
   self: grad * self.cosh().conj()
   result: auto_element_wise
 
-- name: slice.Tensor(Tensor(a) self, int dim=0, int? start=None, int? end=None, int step=1) -> Tensor(a)
-  self: slice_backward_wrapper(grad, self.sizes(), dim, start, end, step)
+- name: slice.Tensor(Tensor(a) self, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1) -> Tensor(a)
+  self: slice_backward_wrapper(grad, self.sym_sizes(), dim, start, end, step)
   result: auto_linear
 
-- name: slice_backward(Tensor grad_output, int[] input_sizes, int dim, int start, int end, int step) -> Tensor
-  grad_output: grad.slice(dim, start, end, step)
+- name: slice_backward(Tensor grad_output, SymInt[] input_sizes, int dim, SymInt start, SymInt end, SymInt step) -> Tensor
+  grad_output: grad.slice_symint(dim, start, end, step)
   result: auto_linear
 
-- name: slice_scatter(Tensor self, Tensor src, int dim=0, int? start=None, int? end=None, int step=1) -> Tensor
-  self: slice_scatter(grad, zeros_like(src), dim, start, end, step)
-  src: grad.slice(dim, start, end, step)
+- name: slice_scatter(Tensor self, Tensor src, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1) -> Tensor
+  self: slice_scatter_symint(grad, zeros_like(src), dim, start, end, step)
+  src: grad.slice_symint(dim, start, end, step)
   result: auto_linear
 
-- name: select_scatter(Tensor self, Tensor src, int dim, int index) -> Tensor
-  self: select_scatter(grad, zeros_like(src), dim, index)
-  src: grad.select(dim, index)
+- name: select_scatter(Tensor self, Tensor src, int dim, SymInt index) -> Tensor
+  self: select_scatter_symint(grad, zeros_like(src), dim, index)
+  src: grad.select_symint(dim, index)
   result: auto_linear
 
 - name: diagonal_scatter(Tensor self, Tensor src, int offset=0, int dim1=0, int dim2=1) -> Tensor
@@ -1447,10 +1453,10 @@
   src: grad.diagonal(offset, dim1, dim2)
   result: auto_linear
 
-- name: as_strided_scatter(Tensor self, Tensor src, int[] size, int[] stride, int? storage_offset=None) -> Tensor
+- name: as_strided_scatter(Tensor self, Tensor src, SymInt[] size, SymInt[] stride, SymInt? storage_offset=None) -> Tensor
   self: as_strided_scatter_backward(grad, TensorGeometry(self), TensorGeometry(src), size, stride, storage_offset)
   # See Note [as_strided_scatter backward support]
-  src: grad.contiguous().as_strided(size, stride, storage_offset)
+  src: grad.contiguous().as_strided_symint(size, stride, storage_offset)
   result: auto_linear
 
 - name: _linalg_solve_ex(Tensor A, Tensor B, *, bool left=True, bool check_errors=False) -> (Tensor result, Tensor LU, Tensor pivots, Tensor info)
@@ -1459,29 +1465,29 @@
   output_differentiability: [True, False, False, False]  # LU is an auxiliary tensor not exposed to the user
 
 - name: sort(Tensor self, int dim=-1, bool descending=False) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), true)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), true)
   output_differentiability: [True, False]
   values: gather_with_keepdimed_indices(self_t, dim, indices, true)
 
 - name: sort.stable(Tensor self, *, bool? stable, int dim=-1, bool descending=False) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), true)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), true)
   output_differentiability: [True, False]
   values: gather_with_keepdimed_indices(self_t, dim, indices, true)
 
-- name: split.Tensor(Tensor(a -> *) self, int split_size, int dim=0) -> Tensor(a)[]
-  self: split_backward(grads, split_size, dim, self.sizes(), self.options())
+- name: split.Tensor(Tensor(a -> *) self, SymInt split_size, int dim=0) -> Tensor(a)[]
+  self: split_backward(grads, split_size, dim, self.sym_sizes(), self.options())
   result: auto_linear
 
-- name: unsafe_split.Tensor(Tensor self, int split_size, int dim=0) -> Tensor[]
-  self: split_backward(grads, split_size, dim, self.sizes(), self.options())
+- name: unsafe_split.Tensor(Tensor self, SymInt split_size, int dim=0) -> Tensor[]
+  self: split_backward(grads, split_size, dim, self.sym_sizes(), self.options())
   result: auto_linear
 
-- name: split_with_sizes(Tensor(a -> *) self, int[] split_sizes, int dim=0) -> Tensor(a)[]
-  self: split_with_sizes_backward(grads, split_sizes, dim, self.sizes(), self.options())
+- name: split_with_sizes(Tensor(a -> *) self, SymInt[] split_sizes, int dim=0) -> Tensor(a)[]
+  self: split_with_sizes_backward(grads, split_sizes, dim, self.sym_sizes(), self.options())
   result: auto_linear
 
-- name: unsafe_split_with_sizes(Tensor self, int[] split_sizes, int dim=0) -> Tensor[]
-  self: split_with_sizes_backward(grads, split_sizes, dim, self.sizes(), self.options())
+- name: unsafe_split_with_sizes(Tensor self, SymInt[] split_sizes, int dim=0) -> Tensor[]
+  self: split_with_sizes_backward(grads, split_sizes, dim, self.sym_sizes(), self.options())
   result: auto_linear
 
 - name: sqrt(Tensor self) -> Tensor
@@ -1489,19 +1495,23 @@
   result: auto_element_wise
 
 - name: squeeze(Tensor(a) self) -> Tensor(a)
-  self: unsqueeze_to(grad, self.sizes())
+  self: unsqueeze_to(grad, self.sym_sizes())
   result: auto_linear
 
 - name: squeeze.dim(Tensor(a) self, int dim) -> Tensor(a)
-  self: unsqueeze_to(grad, dim, self.sizes())
-  result: auto_linear
+  dispatch:
+    Default:
+      self: unsqueeze_to(grad, dim, self.sym_sizes())
+      result: auto_linear
+    AutogradNestedTensor:
+      self: grad.unsqueeze(dim)
 
 - name: squeeze_(Tensor(a!) self) -> Tensor(a!)
-  self: unsqueeze_to(grad, self.sizes())
+  self: unsqueeze_to(grad, self.sym_sizes())
   result: auto_linear
 
 - name: squeeze_.dim(Tensor(a!) self, int dim) -> Tensor(a!)
-  self: unsqueeze_to(grad, dim, self.sizes())
+  self: unsqueeze_to(grad, dim, self.sym_sizes())
   result: auto_linear
 
 - name: std.correction(Tensor self, int[1]? dim, *, int? correction, bool keepdim=False) -> Tensor
@@ -1534,16 +1544,17 @@
   result: auto_element_wise
 
 - name: sum(Tensor self, *, ScalarType? dtype=None) -> Tensor
-  self: grad.expand(self.sizes())
-  result: auto_linear
-
-- name: sum.SymInt(Tensor self, SymInt[1] dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
-  self: sum_backward(grad, self.sym_sizes(), dim, keepdim)
+  self: grad.expand_symint(self.sym_sizes())
   result: auto_linear
 
 - name: sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
-  self: sum_backward(grad, self.sizes(), dim, keepdim)
-  result: auto_linear
+  dispatch:
+    Default:
+      self: sum_backward(grad, self.sym_sizes(), dim, keepdim)
+      result: auto_linear
+    AutogradNestedTensor:
+      # TODO: replace this function once semantics for nested tensor expand have been settled on
+      self: _nested_sum_backward(grad, self, dim, keepdim)
 
 - name: nansum(Tensor self, int[1]? dim=None, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor
   self: nansum_backward(grad.to(self.scalar_type()), self, dim, keepdim)
@@ -1551,12 +1562,12 @@
 
 # We never call _linalg_svd with compute_uv=False in an autograd context, so we don't even consider it here
 - name: _linalg_svd(Tensor A, bool full_matrices=False, bool compute_uv=True, *, str? driver=None) -> (Tensor U, Tensor S, Tensor Vh)
-  A: "svd_backward(full_matrices && grad_U.defined() ? grad_U.narrow(-1, 0, S.size(-1)) : grad_U,
+  A: "svd_backward(full_matrices && grad_U.defined() ? grad_U.narrow_symint(-1, 0, S.sym_size(-1)) : grad_U,
                    grad_S,
-                   full_matrices && grad_Vh.defined() ? grad_Vh.narrow(-2, 0, S.size(-1)) : grad_Vh,
-                   full_matrices ? U.narrow(-1, 0, S.size(-1)) : U,
+                   full_matrices && grad_Vh.defined() ? grad_Vh.narrow_symint(-2, 0, S.sym_size(-1)) : grad_Vh,
+                   full_matrices ? U.narrow_symint(-1, 0, S.sym_size(-1)) : U,
                    S,
-                   full_matrices ? Vh.narrow(-2, 0, S.size(-1)) : Vh)"
+                   full_matrices ? Vh.narrow_symint(-2, 0, S.sym_size(-1)) : Vh)"
   U, S, Vh: linalg_svd_jvp(A_t, U, S, Vh, full_matrices)
 
 - name: symeig(Tensor self, bool eigenvectors=False, bool upper=True) -> (Tensor eigenvalues, Tensor eigenvectors)
@@ -1607,12 +1618,12 @@
   result: auto_element_wise
 
 - name: topk(Tensor self, int k, int dim=-1, bool largest=True, bool sorted=True) -> (Tensor values, Tensor indices)
-  self: value_selecting_reduction_backward(grad, dim, indices, self.sizes(), true)
+  self: value_selecting_reduction_backward_symint(grad, dim, indices, self.sym_sizes(), true)
   output_differentiability: [True, False]
   values: gather(self_t, dim, indices)
 
 - name: trace(Tensor self) -> Tensor
-  self: trace_backward(grad, self.sizes())
+  self: trace_backward_symint(grad, self.sym_sizes())
   result: auto_linear
 
 - name: transpose.int(Tensor(a) self, int dim0, int dim1) -> Tensor(a)
@@ -1661,10 +1672,10 @@
   self: to_mkldnn_backward(grad, self)
 
 - name: unfold(Tensor(a) self, int dimension, int size, int step) -> Tensor(a)
-  self: unfold_backward(grad, self.sizes(), dimension, size, step)
+  self: unfold_backward_symint(grad, self.sym_sizes(), dimension, size, step)
   result: auto_linear
 
-- name: unfold_backward(Tensor grad_in, int[] input_sizes, int dim, int size, int step) -> Tensor
+- name: unfold_backward(Tensor grad_in, SymInt[] input_sizes, int dim, int size, int step) -> Tensor
   grad_in: grad.unfold(dim, size, step)
   result: auto_linear
 
@@ -1692,8 +1703,8 @@
   output_differentiability: [True, False, False]
   self: not_implemented("_unique2")
 
-- name: _unsafe_view(Tensor self, int[] size) -> Tensor
-  self: grad.reshape(self.sizes())
+- name: _unsafe_view(Tensor self, SymInt[] size) -> Tensor
+  self: grad.reshape_symint(self.sym_sizes())
   result: auto_linear
 
 - name: lift(Tensor self) -> Tensor
@@ -1723,15 +1734,14 @@
   # linear
   result1: mean(self_t, dim.value_or(IntArrayRef({})), keepdim)
 
-- name: view(Tensor(a) self, int[] size) -> Tensor(a)
-  self: grad.reshape(self.sizes())
-  result: auto_linear
-
-- name: view.SymInt(Tensor(a) self, SymInt[] size) -> Tensor(a)
-  # TODO: add proper double backward for view.SymInt
-  # by SymIntizing `reshape`
-  self: grad.reshape(c10::asIntArrayRefSlow(self.sym_sizes()))
-  result: auto_linear
+- name: view(Tensor(a) self, SymInt[] size) -> Tensor(a)
+  dispatch:
+    Default:
+      self: grad.reshape_symint(self.sym_sizes())
+      result: auto_linear
+    AutogradNestedTensor:
+      self: grad.reshape_as(self)
+      result: auto_linear
 
 - name: view.dtype(Tensor(a) self, ScalarType dtype) -> Tensor(a)
   output_differentiability: [False]
@@ -1764,7 +1774,7 @@
   self: grad.to_dense().sparse_mask(mask).to_dense()
   mask: non_differentiable
 
-- name: _sparse_coo_tensor_with_dims_and_tensors(int sparse_dim, int dense_dim, int[] size, Tensor indices, Tensor values, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
+- name: _sparse_coo_tensor_with_dims_and_tensors(int sparse_dim, int dense_dim, SymInt[] size, Tensor indices, Tensor values, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
   values: sparse_constructor_values_backward(grad, indices)
 
 - name: _sparse_sum.dim(Tensor self, int[1] dim) -> Tensor
@@ -1777,7 +1787,11 @@
   self: not_implemented("_standard_gamma_grad")
 
 - name: values(Tensor(a) self) -> Tensor(a)
-  self: at::_sparse_coo_tensor_unsafe(self.indices(), grad, self.sizes())._coalesced_(true)
+  dispatch:
+    Default:
+      self: at::_sparse_coo_tensor_unsafe_symint(self.indices(), grad, self.sym_sizes())._coalesced_(true)
+    AutogradNestedTensor:
+      self: at::_nested_view_from_buffer(grad.contiguous(), self._nested_tensor_size(), self._nested_tensor_strides(), self._nested_tensor_offsets())
 
 # Why is _values() not differentiable?
 # See NOTE [ Sparse: autograd and API ]
@@ -1791,9 +1805,9 @@
            _trilinear(i1_p, i2_t, i3_p, expand1, expand2, expand3, sumdim, unroll_dim) +
            _trilinear(i1_p, i2_p, i3_t, expand1, expand2, expand3, sumdim, unroll_dim)"
 
-- name: constant_pad_nd(Tensor self, int[] pad, Scalar value=0) -> Tensor
+- name: constant_pad_nd(Tensor self, SymInt[] pad, Scalar value=0) -> Tensor
   self: constant_pad_nd_backward(grad, pad)
-  result: constant_pad_nd(self_t, pad, 0)
+  result: constant_pad_nd_symint(self_t, pad, 0)
 
 - name: binary_cross_entropy(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean) -> Tensor
   self: binary_cross_entropy_backward(grad, self, target, weight, reduction)
@@ -1819,23 +1833,23 @@
              + binary_cross_entropy_with_logits_target_backward(target_t, self_p, target_p, weight, pos_weight, at::Reduction::None),
            reduction)"
 
-- name: embedding(Tensor weight, Tensor indices, int padding_idx=-1, bool scale_grad_by_freq=False, bool sparse=False) -> Tensor
+- name: embedding(Tensor weight, Tensor indices, SymInt padding_idx=-1, bool scale_grad_by_freq=False, bool sparse=False) -> Tensor
   indices: non_differentiable
-  weight: embedding_backward(grad, indices, weight.size(0), padding_idx, scale_grad_by_freq, sparse)
+  weight: embedding_backward_symint(grad, indices, weight.sym_size(0), padding_idx, scale_grad_by_freq, sparse)
   result: auto_linear
 
-- name: embedding_dense_backward(Tensor grad_output, Tensor indices, int num_weights, int padding_idx, bool scale_grad_by_freq) -> Tensor
-  grad_output: embedding_dense_double_backward(grad, indices, padding_idx)
+- name: embedding_dense_backward(Tensor grad_output, Tensor indices, SymInt num_weights, SymInt padding_idx, bool scale_grad_by_freq) -> Tensor
+  grad_output: embedding_dense_double_backward_symint(grad, indices, padding_idx)
   indices: non_differentiable
   result: auto_linear
 
 - name: _embedding_bag(Tensor weight, Tensor indices, Tensor offsets, bool scale_grad_by_freq=False, int mode=0, bool sparse=False, Tensor? per_sample_weights=None, bool include_last_offset=False, int padding_idx=-1) -> (Tensor, Tensor, Tensor, Tensor)
   indices: non_differentiable
   offsets: non_differentiable
-  weight: _embedding_bag_backward(grad, indices, offsets, result1, result2, result3, weight.size(0), scale_grad_by_freq, mode, sparse, per_sample_weights, padding_idx)
+  weight: _embedding_bag_backward_symint(grad, indices, offsets, result1, result2, result3, weight.sym_size(0), scale_grad_by_freq, mode, sparse, per_sample_weights, padding_idx)
   per_sample_weights: _embedding_bag_per_sample_weights_backward(grad, weight, indices, offsets, result1, mode, padding_idx)
 
-- name: _embedding_bag_dense_backward(Tensor grad, Tensor indices, Tensor offset2bag, Tensor bag_size, Tensor maximum_indices, int num_weights, bool scale_grad_by_freq, int mode, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
+- name: _embedding_bag_dense_backward(Tensor grad, Tensor indices, Tensor offset2bag, Tensor bag_size, Tensor maximum_indices, SymInt num_weights, bool scale_grad_by_freq, int mode, Tensor? per_sample_weights, int padding_idx=-1) -> Tensor
   indices: non_differentiable
   offset2bag: non_differentiable
   bag_size: non_differentiable
@@ -1858,15 +1872,15 @@
   self: multilabel_margin_loss_backward(grad, self, target, reduction, is_target)
   target: non_differentiable
 
-- name: nll_loss_forward(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index) -> (Tensor output, Tensor total_weight)
-  self: nll_loss_backward(grad, self, target, weight, reduction, ignore_index, total_weight)
+- name: nll_loss_forward(Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index) -> (Tensor output, Tensor total_weight)
+  self: nll_loss_backward_symint(grad, self, target, weight, reduction, ignore_index, total_weight)
   target: non_differentiable
-  output: std::get<0>(nll_loss_forward(self_t, target, weight, reduction, ignore_index))
+  output: std::get<0>(nll_loss_forward_symint(self_t, target, weight, reduction, ignore_index))
 
-- name: nll_loss2d_forward(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index) -> (Tensor output, Tensor total_weight)
-  self: nll_loss2d_backward(grad, self, target, weight, reduction, ignore_index, total_weight)
+- name: nll_loss2d_forward(Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index) -> (Tensor output, Tensor total_weight)
+  self: nll_loss2d_backward_symint(grad, self, target, weight, reduction, ignore_index, total_weight)
   target: non_differentiable
-  output: std::get<0>(nll_loss2d_forward(self_t, target, weight, reduction, ignore_index))
+  output: std::get<0>(nll_loss2d_forward_symint(self_t, target, weight, reduction, ignore_index))
 
 - name: smooth_l1_loss(Tensor self, Tensor target, int reduction=Mean, float beta=1.0) -> Tensor
   self: smooth_l1_loss_backward(grad, self, target, reduction, beta)
@@ -1944,11 +1958,11 @@
 
 - name: leaky_relu_(Tensor(a!) self, Scalar negative_slope=0.01) -> Tensor(a!)
   self: leaky_relu_backward(grad, result, negative_slope, true)
-  result: auto_element_wise
+  result: self_t.copy_(leaky_relu_backward(original_self_t.conj(), result, negative_slope, true).conj())
 
 - name: log_sigmoid_forward(Tensor self) -> (Tensor output, Tensor buffer)
   self: log_sigmoid_backward(grad, self, buffer)
-  output: auto_element_wise
+  output: log_sigmoid_backward(self_t.conj(), self_p, buffer).conj()
 
 - name: _log_softmax(Tensor self, int dim, bool half_to_float) -> Tensor
   self: _log_softmax_backward_data(grad, result, dim, self.scalar_type())
@@ -2002,125 +2016,78 @@
   result: auto_element_wise
 
 - name: threshold_(Tensor(a!) self, Scalar threshold, Scalar value) -> Tensor(a!)
-  self: threshold_backward(grad, result, threshold)
-  result: auto_element_wise
-
-- name: reflection_pad1d(Tensor self, int[2] padding) -> Tensor
-  self: reflection_pad1d_backward(grad, self, padding)
-  result: auto_linear
-
-- name: reflection_pad2d(Tensor self, int[4] padding) -> Tensor
-  self: reflection_pad2d_backward(grad, self, padding)
-  result: auto_linear
-
-- name: reflection_pad3d(Tensor self, int[6] padding) -> Tensor
-  self: reflection_pad3d_backward(grad, self, padding)
-  result: auto_linear
-
-- name: replication_pad1d(Tensor self, int[2] padding) -> Tensor
-  self: replication_pad1d_backward(grad, self, padding)
-  result: auto_linear
-
-- name: replication_pad2d(Tensor self, int[4] padding) -> Tensor
-  self: replication_pad2d_backward(grad, self, padding)
-  result: auto_linear
-
-- name: replication_pad3d(Tensor self, int[6] padding) -> Tensor
-  self: replication_pad3d_backward(grad, self, padding)
-  result: auto_linear
-
-- name: upsample_linear1d(Tensor self, int[1] output_size, bool align_corners, float? scales=None) -> Tensor
-  self: upsample_linear1d_backward(grad, output_size, self.sizes(), align_corners, scales)
-  result: auto_linear
-
-- name: upsample_bilinear2d(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: upsample_bilinear2d_backward(grad, output_size, self.sizes(), align_corners, scales_h, scales_w)
-  result: auto_linear
-
-- name: _upsample_bilinear2d_aa(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: _upsample_bilinear2d_aa_backward(grad, output_size, self.sizes(), align_corners, scales_h, scales_w)
-  result: auto_linear
-
-- name: upsample_bicubic2d(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: upsample_bicubic2d_backward(grad, output_size, self.sizes(), align_corners, scales_h, scales_w)
-  result: auto_linear
-
-- name: _upsample_bicubic2d_aa(Tensor self, int[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: _upsample_bicubic2d_aa_backward(grad, output_size, self.sizes(), align_corners, scales_h, scales_w)
+  self: threshold_backward(grad, self, threshold)
+  result: self_t.copy_(threshold_backward(self_t.conj(), original_self_p, threshold).conj())
 
-- name: upsample_trilinear3d(Tensor self, int[3] output_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: upsample_trilinear3d_backward(grad, output_size, self.sizes(), align_corners, scales_d, scales_h, scales_w)
+- name: reflection_pad1d(Tensor self, SymInt[2] padding) -> Tensor
+  self: reflection_pad1d_backward_symint(grad, self, padding)
   result: auto_linear
 
-- name: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
-  self: upsample_nearest1d_backward(grad, output_size, self.sizes(), scales)
+- name: reflection_pad2d(Tensor self, SymInt[4] padding) -> Tensor
+  self: reflection_pad2d_backward_symint(grad, self, padding)
   result: auto_linear
 
-- name: _upsample_nearest_exact1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
-  self: _upsample_nearest_exact1d_backward(grad, output_size, self.sizes(), scales)
+- name: reflection_pad3d(Tensor self, SymInt[6] padding) -> Tensor
+  self: reflection_pad3d_backward_symint(grad, self, padding)
   result: auto_linear
 
-- name: upsample_nearest2d(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: upsample_nearest2d_backward(grad, output_size, self.sizes(), scales_h, scales_w)
+- name: replication_pad1d(Tensor self, SymInt[2] padding) -> Tensor
+  self: replication_pad1d_backward_symint(grad, self, padding)
   result: auto_linear
 
-- name: _upsample_nearest_exact2d(Tensor self, int[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: _upsample_nearest_exact2d_backward(grad, output_size, self.sizes(), scales_h, scales_w)
+- name: replication_pad2d(Tensor self, SymInt[4] padding) -> Tensor
+  self: replication_pad2d_backward_symint(grad, self, padding)
   result: auto_linear
 
-- name: upsample_nearest3d(Tensor self, int[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: upsample_nearest3d_backward(grad, output_size, self.sizes(), scales_d, scales_h, scales_w)
+- name: replication_pad3d(Tensor self, SymInt[6] padding) -> Tensor
+  self: replication_pad3d_backward_symint(grad, self, padding)
   result: auto_linear
 
-- name: _upsample_nearest_exact3d(Tensor self, int[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
-  self: _upsample_nearest_exact3d_backward(grad, output_size, self.sizes(), scales_d, scales_h, scales_w)
+- name: upsample_linear1d(Tensor self, SymInt[1] output_size, bool align_corners, float? scales=None) -> Tensor
+  self: upsample_linear1d_backward_symint(grad, output_size, self.sym_sizes(), align_corners, scales)
   result: auto_linear
 
-- name: upsample_linear1d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
-  input: upsample_linear1d_backward(grad, output_size, input.sizes(), align_corners, scale_factors)
+- name: upsample_bilinear2d(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: upsample_bilinear2d_backward_symint(grad, output_size, self.sym_sizes(), align_corners, scales_h, scales_w)
   result: auto_linear
 
-- name: upsample_bilinear2d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
-  input: upsample_bilinear2d_backward(grad, output_size, input.sizes(), align_corners, scale_factors)
+- name: _upsample_bilinear2d_aa(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: _upsample_bilinear2d_aa_backward_symint(grad, output_size, self.sym_sizes(), align_corners, scales_h, scales_w)
   result: auto_linear
 
-- name: _upsample_bilinear2d_aa.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
-  input: _upsample_bilinear2d_aa_backward(grad, output_size, input.sizes(), align_corners, scale_factors)
+- name: upsample_bicubic2d(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: upsample_bicubic2d_backward_symint(grad, output_size, self.sym_sizes(), align_corners, scales_h, scales_w)
   result: auto_linear
 
-- name: upsample_trilinear3d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
-  input: upsample_trilinear3d_backward(grad, output_size, input.sizes(), align_corners, scale_factors)
-  result: auto_linear
+- name: _upsample_bicubic2d_aa(Tensor self, SymInt[2] output_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: _upsample_bicubic2d_aa_backward_symint(grad, output_size, self.sym_sizes(), align_corners, scales_h, scales_w)
 
-- name: upsample_bicubic2d.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
-  input: upsample_bicubic2d_backward(grad, output_size, input.sizes(), align_corners, scale_factors)
+- name: upsample_trilinear3d(Tensor self, SymInt[3] output_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: upsample_trilinear3d_backward_symint(grad, output_size, self.sym_sizes(), align_corners, scales_d, scales_h, scales_w)
   result: auto_linear
 
-- name: _upsample_bicubic2d_aa.vec(Tensor input, int[]? output_size, bool align_corners, float[]? scale_factors) -> Tensor
-  input: _upsample_bicubic2d_aa_backward(grad, output_size, input.sizes(), align_corners, scale_factors)
-
-- name: upsample_nearest1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
-  input: upsample_nearest1d_backward(grad, output_size, input.sizes(), scale_factors)
+- name: upsample_nearest1d(Tensor self, SymInt[1] output_size, float? scales=None) -> Tensor
+  self: upsample_nearest1d_backward_symint(grad, output_size, self.sym_sizes(), scales)
   result: auto_linear
 
-- name: _upsample_nearest_exact1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
-  input: _upsample_nearest_exact1d_backward(grad, output_size, input.sizes(), scale_factors)
+- name: _upsample_nearest_exact1d(Tensor self, SymInt[1] output_size, float? scales=None) -> Tensor
+  self: _upsample_nearest_exact1d_backward_symint(grad, output_size, self.sym_sizes(), scales)
   result: auto_linear
 
-- name: upsample_nearest2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
-  input: upsample_nearest2d_backward(grad, output_size, input.sizes(), scale_factors)
+- name: upsample_nearest2d(Tensor self, SymInt[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: upsample_nearest2d_backward_symint(grad, output_size, self.sym_sizes(), scales_h, scales_w)
   result: auto_linear
 
-- name: _upsample_nearest_exact2d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
-  input: _upsample_nearest_exact2d_backward(grad, output_size, input.sizes(), scale_factors)
+- name: _upsample_nearest_exact2d(Tensor self, SymInt[2] output_size, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: _upsample_nearest_exact2d_backward_symint(grad, output_size, self.sym_sizes(), scales_h, scales_w)
   result: auto_linear
 
-- name: upsample_nearest3d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
-  input: upsample_nearest3d_backward(grad, output_size, input.sizes(), scale_factors)
+- name: upsample_nearest3d(Tensor self, SymInt[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: upsample_nearest3d_backward_symint(grad, output_size, self.sym_sizes(), scales_d, scales_h, scales_w)
   result: auto_linear
 
-- name: _upsample_nearest_exact3d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
-  input: _upsample_nearest_exact3d_backward(grad, output_size, input.sizes(), scale_factors)
+- name: _upsample_nearest_exact3d(Tensor self, SymInt[3] output_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+  self: _upsample_nearest_exact3d_backward_symint(grad, output_size, self.sym_sizes(), scales_d, scales_h, scales_w)
   result: auto_linear
 
 - name: pixel_shuffle(Tensor self, int upscale_factor) -> Tensor
@@ -2131,11 +2098,11 @@
   self: pixel_shuffle(grad, downscale_factor)
   result: auto_linear
 
-- name: _adaptive_avg_pool2d(Tensor self, int[2] output_size) -> Tensor
+- name: _adaptive_avg_pool2d(Tensor self, SymInt[2] output_size) -> Tensor
   self: _adaptive_avg_pool2d_backward(grad, self)
   result: auto_linear
 
-- name: _adaptive_avg_pool3d(Tensor self, int[3] output_size) -> Tensor
+- name: _adaptive_avg_pool3d(Tensor self, SymInt[3] output_size) -> Tensor
   self: _adaptive_avg_pool3d_backward(grad, self)
   result: auto_linear
 
@@ -2180,9 +2147,6 @@
 - name: mps_convolution_backward(Tensor self, Tensor grad_output, Tensor weight, int[] padding, int[] stride, int[] dilation, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
   grad_output, self, weight: _convolution_double_backward(grads[0], grads[1], grads[2], grad_output, weight, self, stride, padding, dilation, false, std::vector<int64_t>(padding.size(), 0), groups, grad_input_mask)
 
-- name: _mps_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor
-  self, weight, bias: mps_linear_backward(self, grad, weight, grad_input_mask)
-
 - name: max_pool2d_with_indices(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=0, int[2] dilation=1, bool ceil_mode=False) -> (Tensor, Tensor)
   self: max_pool2d_with_indices_backward(grad, self, kernel_size, stride, padding, dilation, ceil_mode, result1)
   result0: gather(self_t.flatten(-2), -1, result1.flatten(-2)).view_as(result1)
@@ -2203,21 +2167,21 @@
   indices: non_differentiable
   result: auto_linear
 
-- name: convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor
-  input, weight, bias: "grad.defined() ? convolution_backward(grad, input, weight, bias->sizes(), stride, padding, dilation, transposed, output_padding, groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups) -> Tensor
+  input, weight, bias: "grad.defined() ? convolution_backward_symint(grad, input, weight, bias->sym_sizes(), stride, padding, dilation, transposed, output_padding, groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
   result: convolution_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, stride, padding, dilation, transposed, output_padding, groups)
 
 # TorchScript serializes calls to _convolution so this entry is present until that is changed to use convolution.
 # Note that the benchmark, deterministic, cudnn_enabled, and allow_tf32 flags are queried from the global context
 # by convolution_backward instead of being passed along from the forward pass.
-- name: _convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> Tensor
-  input, weight, bias: "grad.defined() ? convolution_backward(grad, input, weight, bias->sizes(), stride, padding, dilation, transposed, output_padding, groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: _convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> Tensor
+  input, weight, bias: "grad.defined() ? convolution_backward_symint(grad, input, weight, bias->sym_sizes(), stride, padding, dilation, transposed, output_padding, groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
   result: _convolution_jvp(input_p, input_t, weight_p, weight_t, bias_p, bias_t, stride, padding, dilation, transposed, output_padding, groups, benchmark, deterministic, cudnn_enabled, allow_tf32)
 
-- name: convolution_backward(Tensor grad_output, Tensor input, Tensor weight, int[]? bias_sizes, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
-  grad_output, input, weight: _convolution_double_backward(grads[0], grads[1], grads[2], grad_output, weight, input, stride, padding, dilation, transposed, output_padding, groups, grad_input_mask)
-  result0: std::get<0>(convolution_backward(grad_output_p, input_p, weight_t, bias_sizes, stride, padding, dilation, transposed, output_padding, groups, {true, false, false})) + std::get<0>(convolution_backward(grad_output_t, input_p, weight_p, bias_sizes, stride, padding, dilation, transposed, output_padding, groups, {true, false, false}))
-  result1: std::get<1>(convolution_backward(grad_output_p, input_t, weight_p, bias_sizes, stride, padding, dilation, transposed, output_padding, groups, {false, true, false})) + std::get<1>(convolution_backward(grad_output_t, input_p, weight_p, bias_sizes, stride, padding, dilation, transposed, output_padding, groups, {false, true, false}))
+- name: convolution_backward(Tensor grad_output, Tensor input, Tensor weight, SymInt[]? bias_sizes, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups, bool[3] output_mask) -> (Tensor, Tensor, Tensor)
+  grad_output, input, weight: _convolution_double_backward_symint(grads[0], grads[1], grads[2], grad_output, weight, input, stride, padding, dilation, transposed, output_padding, groups, grad_input_mask)
+  result0: std::get<0>(convolution_backward_symint(grad_output_p, input_p, weight_t, bias_sizes, stride, padding, dilation, transposed, output_padding, groups, {true, false, false})) + std::get<0>(convolution_backward_symint(grad_output_t, input_p, weight_p, bias_sizes, stride, padding, dilation, transposed, output_padding, groups, {true, false, false}))
+  result1: std::get<1>(convolution_backward_symint(grad_output_p, input_t, weight_p, bias_sizes, stride, padding, dilation, transposed, output_padding, groups, {false, true, false})) + std::get<1>(convolution_backward_symint(grad_output_t, input_p, weight_p, bias_sizes, stride, padding, dilation, transposed, output_padding, groups, {false, true, false}))
   result2: convolution_backward_jvp_grad_bias(grad_output_t, result2)
 
 - name: convolution_overrideable(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> Tensor
@@ -2226,11 +2190,11 @@
 - name: convolution_backward_overrideable(Tensor grad_output, Tensor input, Tensor weight, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool[3] output_mask) -> (Tensor grad_input, Tensor grad_weight, Tensor grad_bias)
   grad_output, input, weight: _convolution_double_backward(grads[0], grads[1], grads[2], grad_output, weight, input, stride, padding, dilation, transposed, output_padding, groups, grad_input_mask)
 
-- name: slow_conv_transpose2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] output_padding=0, int[2] dilation=1) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, dilation, true, output_padding, 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: slow_conv_transpose2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, SymInt[2] padding=0, SymInt[2] output_padding=0, int[2] dilation=1) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, dilation, true, output_padding, 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
-- name: slow_conv_transpose3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] output_padding=0, int[3] dilation=1) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, dilation, true, output_padding, 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: slow_conv_transpose3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, SymInt[3] padding=0, SymInt[3] output_padding=0, int[3] dilation=1) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, dilation, true, output_padding, 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
 - name: _slow_conv2d_forward(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, int[2] padding) -> Tensor
   self, weight, bias: "grad.defined() ? _slow_conv2d_backward(grad, self, weight, kernel_size, stride, padding, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
@@ -2238,45 +2202,36 @@
 - name: _slow_conv2d_backward.output_mask(Tensor grad_output, Tensor self, Tensor weight, int[2] kernel_size, int[2] stride, int[2] padding, bool[3] output_mask) -> (Tensor grad_input, Tensor grad_weight, Tensor grad_bias)
   grad_output, self, weight: _convolution_double_backward(grads[0], grads[1], grads[2], grad_output, weight, self, stride, padding, {{1, 1}}, false, {{0, 0}}, 1, grad_input_mask)
 
-- name: _conv_depthwise2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, int[2] padding, int[2] dilation) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad.contiguous(), self, weight, bias->sizes(), stride, padding, dilation, /*transposed=*/ false, /*output_padding=*/ {{0, 0}}, /*groups=*/ 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: _conv_depthwise2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias, int[2] stride, SymInt[2] padding, int[2] dilation) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad.contiguous(), self, weight, bias->sym_sizes(), stride, padding, dilation, /*transposed=*/ false, /*output_padding=*/ {{0, 0}}, /*groups=*/ 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
-- name: conv_depthwise3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, int[3] padding, int[3] dilation) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad.contiguous(), self, weight, bias->sizes(), stride, padding, dilation, /*transposed=*/ false, /*output_padding=*/ {{0, 0, 0}}, /*groups=*/ 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: conv_depthwise3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, SymInt[3] padding, int[3] dilation) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad.contiguous(), self, weight, bias->sym_sizes(), stride, padding, dilation, /*transposed=*/ false, /*output_padding=*/ {{0, 0, 0}}, /*groups=*/ 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
-- name: slow_conv3d_forward(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, int[3] padding) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, /*dilation=*/ {{1, 1, 1}}, false, /*output_padding=*/ {{0, 0, 0}}, 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: slow_conv3d_forward(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias, int[3] stride, SymInt[3] padding) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, /*dilation=*/ {{1, 1, 1}}, false, /*output_padding=*/ {{0, 0, 0}}, 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
-- name: slow_conv_dilated2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] dilation=1) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, dilation, false, std::vector<int64_t>(padding.size(), 0), 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: slow_conv_dilated2d(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, SymInt[2] padding=0, int[2] dilation=1) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, dilation, false, std::vector<c10::SymInt>(padding.size(), 0), 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
-- name: slow_conv_dilated3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, int[3] padding=0, int[3] dilation=1) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, dilation, false, std::vector<int64_t>(padding.size(), 0), 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: slow_conv_dilated3d(Tensor self, Tensor weight, int[3] kernel_size, Tensor? bias=None, int[3] stride=1, SymInt[3] padding=0, int[3] dilation=1) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, dilation, false, std::vector<c10::SymInt>(padding.size(), 0), 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
-- name: col2im(Tensor self, int[2] output_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
-  self: col2im_backward(grad, kernel_size, dilation, padding, stride)
+- name: col2im(Tensor self, SymInt[2] output_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
+  self: im2col(grad, kernel_size, dilation, padding, stride)
   result: auto_linear
 
 - name: im2col(Tensor self, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
-  self: im2col_backward(grad, {self.size(2), self.size(3)}, kernel_size, dilation, padding, stride)
-  result: auto_linear
-
-# NN double backwards support
-- name: im2col_backward(Tensor grad_output, int[2] input_size, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
-  grad_output: im2col(grad, kernel_size, dilation, padding, stride)
-  result: auto_linear
-
-- name: col2im_backward(Tensor grad_output, int[2] kernel_size, int[2] dilation, int[2] padding, int[2] stride) -> Tensor
-  grad_output: col2im(grad, {grad_output.size(2), grad_output.size(3)}, kernel_size, dilation, padding, stride)
+  self: col2im_symint(grad, {self.sym_size(-2), self.sym_size(-1)}, kernel_size, dilation, padding, stride)
   result: auto_linear
 
 - name: _adaptive_avg_pool2d_backward(Tensor grad_output, Tensor self) -> Tensor
-  grad_output: _adaptive_avg_pool2d(grad, { grad_output.size(-2), grad_output.size(-1) })
+  grad_output: _adaptive_avg_pool2d_symint(grad, {grad_output.sym_size(-2), grad_output.sym_size(-1)})
   self: zeros_like(self)
   result: _adaptive_avg_pool2d_backward(grad_output_t, self_p)
 
 - name: _adaptive_avg_pool3d_backward(Tensor grad_output, Tensor self) -> Tensor
-  grad_output: _adaptive_avg_pool3d(grad, { grad_output.size(-3), grad_output.size(-2), grad_output.size(-1) })
+  grad_output: _adaptive_avg_pool3d_symint(grad, { grad_output.sym_size(-3), grad_output.sym_size(-2), grad_output.sym_size(-1) })
   self: zeros_like(self)
   result: _adaptive_avg_pool3d_backward(grad_output_t, self_p)
 
@@ -2377,13 +2332,13 @@
            + mse_loss_backward(grad_output_t, self_p, target_p, reduction)
           "
 
-- name: nll_loss_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight) -> Tensor
-  grad_output: nll_loss(grad, target, weight, reduction, ignore_index)
+- name: nll_loss_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index, Tensor total_weight) -> Tensor
+  grad_output: nll_loss_symint(grad, target, weight, reduction, ignore_index)
   self: zeros_like(grad)
   target: non_differentiable
 
-- name: nll_loss2d_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight) -> Tensor
-  grad_output: nll_loss2d(grad, target, weight, reduction, ignore_index)
+- name: nll_loss2d_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, SymInt ignore_index, Tensor total_weight) -> Tensor
+  grad_output: nll_loss2d_symint(grad, target, weight, reduction, ignore_index)
   self: zeros_like(grad)
   target: non_differentiable
 
@@ -2393,41 +2348,45 @@
   self: zeros_like(grad)
   result: rrelu_with_noise_backward(grad_output_t, self_p, noise, lower, upper, training, false)
 
-- name: reflection_pad1d_backward(Tensor grad_output, Tensor self, int[2] padding) -> Tensor
-  grad_output: reflection_pad1d(grad, padding)
+- name: reflection_pad1d_backward(Tensor grad_output, Tensor self, SymInt[2] padding) -> Tensor
+  grad_output: reflection_pad1d_symint(grad, padding)
   self: zeros_like(self)
-  result: reflection_pad1d_backward(grad_output_t, self_p, padding)
+  result: reflection_pad1d_backward_symint(grad_output_t, self_p, padding)
 
-- name: reflection_pad2d_backward(Tensor grad_output, Tensor self, int[4] padding) -> Tensor
-  grad_output: reflection_pad2d(grad, padding)
+- name: reflection_pad2d_backward(Tensor grad_output, Tensor self, SymInt[4] padding) -> Tensor
+  grad_output: reflection_pad2d_symint(grad, padding)
   self: zeros_like(self)
-  result: reflection_pad2d_backward(grad_output_t, self_p, padding)
+  result: reflection_pad2d_backward_symint(grad_output_t, self_p, padding)
 
-- name: reflection_pad3d_backward(Tensor grad_output, Tensor self, int[6] padding) -> Tensor
-  grad_output: reflection_pad3d(grad, padding)
+- name: reflection_pad3d_backward(Tensor grad_output, Tensor self, SymInt[6] padding) -> Tensor
+  grad_output: reflection_pad3d_symint(grad, padding)
   self: zeros_like(self)
-  result: reflection_pad3d_backward(grad_output_t, self_p, padding)
+  result: reflection_pad3d_backward_symint(grad_output_t, self_p, padding)
 
-- name: replication_pad1d_backward(Tensor grad_output, Tensor self, int[2] padding) -> Tensor
-  grad_output: replication_pad1d(grad, padding)
+- name: replication_pad1d_backward(Tensor grad_output, Tensor self, SymInt[2] padding) -> Tensor
+  grad_output: replication_pad1d_symint(grad, padding)
   self: zeros_like(self)
-  result: replication_pad1d_backward(grad_output_t, self_p, padding)
+  result: replication_pad1d_backward_symint(grad_output_t, self_p, padding)
 
-- name: replication_pad2d_backward(Tensor grad_output, Tensor self, int[4] padding) -> Tensor
-  grad_output: replication_pad2d(grad, padding)
+- name: replication_pad2d_backward(Tensor grad_output, Tensor self, SymInt[4] padding) -> Tensor
+  grad_output: replication_pad2d_symint(grad, padding)
   self: zeros_like(self)
-  result: replication_pad2d_backward(grad_output_t, self_p, padding)
+  result: replication_pad2d_backward_symint(grad_output_t, self_p, padding)
 
-- name: replication_pad3d_backward(Tensor grad_output, Tensor self, int[6] padding) -> Tensor
-  grad_output: replication_pad3d(grad, padding)
+- name: replication_pad3d_backward(Tensor grad_output, Tensor self, SymInt[6] padding) -> Tensor
+  grad_output: replication_pad3d_symint(grad, padding)
   self: zeros_like(self)
-  result: replication_pad3d_backward(grad_output_t, self_p, padding)
+  result: replication_pad3d_backward_symint(grad_output_t, self_p, padding)
 
 - name: sparse_sampled_addmm(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1) -> Tensor
   self: maybe_multiply(grad, beta.conj())
   mat1: maybe_multiply(grad.sparse_mask(self).mm(mat2.mH()), alpha.conj())
   mat2: maybe_multiply(mat1.mH().mm(grad.sparse_mask(self)), alpha.conj())
 
+- name: _spmm_reduce(Tensor input, Tensor weight, str reduce, Tensor? row_indices=None, Tensor? ccol_indices=None, Tensor? csr2csc=None) -> (Tensor, Tensor)
+  output_differentiability: [True, False]
+  input, weight: "grad.defined() ? _spmm_reduce_backward(input, grad, weight, reduce, result1, row_indices, ccol_indices, csr2csc, grad_input_mask) :  std::tuple<Tensor, Tensor>()"
+
 - name: smooth_l1_loss_backward(Tensor grad_output, Tensor self, Tensor target, int reduction, float beta) -> Tensor
   grad_output: smooth_l1_loss_backward(grad, self, target, reduction, beta)
   self: smooth_l1_loss_double_backward(grad * grad_output, self, target, reduction, beta)
@@ -2466,98 +2425,51 @@
   self: zeros_like(grad)
   result: zeros_like(self_t) + threshold_backward(grad_output_t, self_p, threshold)
 
-- name: upsample_linear1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, bool align_corners, float? scales=None) -> Tensor
-  grad_output: upsample_linear1d(grad, output_size, align_corners, scales)
+- name: upsample_linear1d_backward(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, bool align_corners, float? scales=None) -> Tensor
+  grad_output: upsample_linear1d_symint(grad, output_size, align_corners, scales)
   result: auto_linear
 
-- name: upsample_bilinear2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: upsample_bilinear2d(grad, output_size, align_corners, scales_h, scales_w)
+- name: upsample_bilinear2d_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: upsample_bilinear2d_symint(grad, output_size, align_corners, scales_h, scales_w)
   result: auto_linear
 
-- name: _upsample_bilinear2d_aa_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: _upsample_bilinear2d_aa(grad, output_size, align_corners, scales_h, scales_w)
+- name: _upsample_bilinear2d_aa_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: _upsample_bilinear2d_aa_symint(grad, output_size, align_corners, scales_h, scales_w)
   result: auto_linear
 
-- name: upsample_bicubic2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: upsample_bicubic2d(grad, output_size, align_corners, scales_h, scales_w)
+- name: upsample_bicubic2d_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: upsample_bicubic2d_symint(grad, output_size, align_corners, scales_h, scales_w)
   result: auto_linear
 
-- name: _upsample_bicubic2d_aa_backward(Tensor grad_output, int[2] output_size, int[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: _upsample_bicubic2d_aa(grad, output_size, align_corners, scales_h, scales_w)
-
-- name: upsample_trilinear3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: upsample_trilinear3d(grad, output_size, align_corners, scales_d, scales_h, scales_w)
-  result: auto_linear
+- name: _upsample_bicubic2d_aa_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, bool align_corners, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: _upsample_bicubic2d_aa_symint(grad, output_size, align_corners, scales_h, scales_w)
 
-- name: upsample_nearest1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None) -> Tensor
-  grad_output: upsample_nearest1d(grad, output_size, scales)
+- name: upsample_trilinear3d_backward(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, bool align_corners, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: upsample_trilinear3d_symint(grad, output_size, align_corners, scales_d, scales_h, scales_w)
   result: auto_linear
 
-- name: _upsample_nearest_exact1d_backward(Tensor grad_output, int[1] output_size, int[3] input_size, float? scales=None) -> Tensor
-  grad_output: _upsample_nearest_exact1d(grad, output_size, scales)
+- name: upsample_nearest1d_backward(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, float? scales=None) -> Tensor
+  grad_output: upsample_nearest1d_symint(grad, output_size, scales)
   result: auto_linear
 
-- name: upsample_nearest2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: upsample_nearest2d(grad, output_size, scales_h, scales_w)
+- name: _upsample_nearest_exact1d_backward(Tensor grad_output, SymInt[1] output_size, SymInt[3] input_size, float? scales=None) -> Tensor
+  grad_output: _upsample_nearest_exact1d_symint(grad, output_size, scales)
   result: auto_linear
 
-- name: _upsample_nearest_exact2d_backward(Tensor grad_output, int[2] output_size, int[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: _upsample_nearest_exact2d(grad, output_size, scales_h, scales_w)
+- name: upsample_nearest2d_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: upsample_nearest2d_symint(grad, output_size, scales_h, scales_w)
   result: auto_linear
 
-- name: upsample_nearest3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: upsample_nearest3d(grad, output_size, scales_d, scales_h, scales_w)
+- name: _upsample_nearest_exact2d_backward(Tensor grad_output, SymInt[2] output_size, SymInt[4] input_size, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: _upsample_nearest_exact2d_symint(grad, output_size, scales_h, scales_w)
   result: auto_linear
 
-- name: _upsample_nearest_exact3d_backward(Tensor grad_output, int[3] output_size, int[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
-  grad_output: _upsample_nearest_exact3d(grad, output_size, scales_d, scales_h, scales_w)
+- name: upsample_nearest3d_backward(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: upsample_nearest3d_symint(grad, output_size, scales_d, scales_h, scales_w)
   result: auto_linear
 
-- name: upsample_linear1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
-  grad_output: upsample_linear1d(grad, output_size, align_corners, scale_factors)
-  result: auto_linear
-
-- name: upsample_bilinear2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
-  grad_output: upsample_bilinear2d(grad, output_size, align_corners, scale_factors)
-  result: auto_linear
-
-- name: _upsample_bilinear2d_aa_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
-  grad_output: _upsample_bilinear2d_aa(grad, output_size, align_corners, scale_factors)
-  result: auto_linear
-
-- name: upsample_trilinear3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
-  grad_output: upsample_trilinear3d(grad, output_size, align_corners, scale_factors)
-  result: auto_linear
-
-- name: upsample_bicubic2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
-  grad_output: upsample_bicubic2d(grad, output_size, align_corners, scale_factors)
-  result: auto_linear
-
-- name: _upsample_bicubic2d_aa_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, bool align_corners, float[]? scale_factors) -> Tensor
-  grad_output: _upsample_bicubic2d_aa(grad, output_size, align_corners, scale_factors)
-
-- name: upsample_nearest1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  grad_output: upsample_nearest1d(grad, output_size, scale_factors)
-  result: auto_linear
-
-- name: _upsample_nearest_exact1d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  grad_output: _upsample_nearest_exact1d(grad, output_size, scale_factors)
-  result: auto_linear
-
-- name: upsample_nearest2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  grad_output: upsample_nearest2d(grad, output_size, scale_factors)
-  result: auto_linear
-
-- name: _upsample_nearest_exact2d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  grad_output: _upsample_nearest_exact2d(grad, output_size, scale_factors)
-  result: auto_linear
-
-- name: upsample_nearest3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  grad_output: upsample_nearest3d(grad, output_size, scale_factors)
-  result: auto_linear
-
-- name: _upsample_nearest_exact3d_backward.vec(Tensor grad_output, int[]? output_size, int[] input_size, float[]? scale_factors) -> Tensor
-  grad_output: _upsample_nearest_exact3d(grad, output_size, scale_factors)
+- name: _upsample_nearest_exact3d_backward(Tensor grad_output, SymInt[3] output_size, SymInt[5] input_size, float? scales_d=None, float? scales_h=None, float? scales_w=None) -> Tensor
+  grad_output: _upsample_nearest_exact3d_symint(grad, output_size, scales_d, scales_h, scales_w)
   result: auto_linear
 
 - name: sigmoid_backward(Tensor grad_output, Tensor output) -> Tensor
@@ -2574,6 +2486,9 @@
 - name: _cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank, bool deterministic, bool zero_infinity) -> (Tensor, Tensor)
   log_probs: _cudnn_ctc_loss_backward(grad, result0, result1, zero_infinity)
 
+- name: _cudnn_ctc_loss.Tensor(Tensor log_probs, Tensor targets, Tensor input_lengths, Tensor target_lengths, int blank, bool deterministic, bool zero_infinity) -> (Tensor, Tensor)
+  log_probs: _cudnn_ctc_loss_backward(grad, result0, result1, zero_infinity)
+
 - name: cudnn_convolution_transpose(Tensor self, Tensor weight, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic, bool allow_tf32) -> Tensor
   self, weight: "_cudnn_convolution_backward(self, grad, weight, padding, output_padding, stride, dilation, true, groups, {grad_input_mask[0], grad_input_mask[1]})"
 
@@ -2611,9 +2526,9 @@
 
 # nnpack
 
-- name: _nnpack_spatial_convolution(Tensor input, Tensor weight, Tensor? bias, int[2] padding, int[2] stride=1) -> Tensor
+- name: _nnpack_spatial_convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[2] padding, int[2] stride=1) -> Tensor
   # NNPACK does not support strided convolutions in the backwards path, which is the reason why we are using the closest available function that does here.
-  input, weight, bias: "grad.defined() ? convolution_backward(grad, input, weight, bias->sizes(), stride, padding, std::vector<int64_t>(padding.size(), 1), false, std::vector<int64_t>(padding.size(), 0), 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+  input, weight, bias: "grad.defined() ? convolution_backward_symint(grad, input, weight, bias->sym_sizes(), stride, padding, std::vector<int64_t>(padding.size(), 1), false, std::vector<c10::SymInt>(padding.size(), 0), 1, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
 #LSTM MPS
 - name: _lstm_mps(Tensor input, Tensor[] hx, Tensor[] params, bool has_biases, int num_layers, float dropout, bool train, bool bidirectional, bool batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
@@ -2626,12 +2541,12 @@
 
 # Only frst three of _cudnn_rnn outputs can have gradients.
 # _cudnn_rnn outputs: (output, hy, cy, reserve, weight_buf)
-- name: _cudnn_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor? weight_buf, Tensor hx, Tensor? cx, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
+- name: _cudnn_rnn(Tensor input, Tensor[] weight, int weight_stride0, Tensor? weight_buf, Tensor hx, Tensor? cx, int mode, SymInt hidden_size, SymInt proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, SymInt[] batch_sizes, Tensor? dropout_state) -> (Tensor, Tensor, Tensor, Tensor, Tensor)
   dropout_state: non_differentiable
   output_differentiability: [True, True, True, False, False]
-  input, hx, cx, weight: "_cudnn_rnn_backward(input, weight, weight_stride0, result4, hx, cx, result0, grads[0], grads[1], grads[2], mode, hidden_size, proj_size, num_layers, batch_first, dropout, train, bidirectional, batch_sizes, dropout_state, retain_variables ? result3.clone() : result3, grad_input_mask)"
+  input, hx, cx, weight: "_cudnn_rnn_backward_symint(input, weight, weight_stride0, result4, hx, cx, result0, grads[0], grads[1], grads[2], mode, hidden_size, proj_size, num_layers, batch_first, dropout, train, bidirectional, batch_sizes, dropout_state, retain_variables ? result3.clone() : result3, grad_input_mask)"
 
-- name: _cudnn_rnn_backward(Tensor input, Tensor[] weight, int weight_stride0, Tensor weight_buf, Tensor hx, Tensor? cx, Tensor output, Tensor? grad_output, Tensor? grad_hy, Tensor? grad_cy, int mode, int hidden_size, int proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, int[] batch_sizes, Tensor? dropout_state, Tensor reserve, bool[4] output_mask) -> (Tensor, Tensor, Tensor, Tensor[])
+- name: _cudnn_rnn_backward(Tensor input, Tensor[] weight, int weight_stride0, Tensor weight_buf, Tensor hx, Tensor? cx, Tensor output, Tensor? grad_output, Tensor? grad_hy, Tensor? grad_cy, int mode, SymInt hidden_size, SymInt proj_size, int num_layers, bool batch_first, float dropout, bool train, bool bidirectional, SymInt[] batch_sizes, Tensor? dropout_state, Tensor reserve, bool[4] output_mask) -> (Tensor, Tensor, Tensor, Tensor[])
   dropout_state: non_differentiable
   input: not_implemented("_cudnn_rnn_backward", kCudnnDoubleBackwardMsg)
   weight: not_implemented_list("_cudnn_rnn_backward", kCudnnDoubleBackwardMsg)
@@ -2644,14 +2559,14 @@
 
 # miopen
 
-- name: miopen_convolution_transpose(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, dilation, true, output_padding, groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: miopen_convolution_transpose(Tensor self, Tensor weight, Tensor? bias, SymInt[] padding, SymInt[] output_padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, dilation, true, output_padding, groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
-- name: miopen_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, dilation, false, std::vector<int64_t>(padding.size(), 0), groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: miopen_convolution(Tensor self, Tensor weight, Tensor? bias, SymInt[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, dilation, false, std::vector<c10::SymInt>(padding.size(), 0), groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
-- name: miopen_depthwise_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, dilation, false, std::vector<int64_t>(padding.size(), 0), groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: miopen_depthwise_convolution(Tensor self, Tensor weight, Tensor? bias, SymInt[] padding, int[] stride, int[] dilation, int groups, bool benchmark, bool deterministic) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, dilation, false, std::vector<c10::SymInt>(padding.size(), 0), groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
 - name: miopen_batch_norm(Tensor input, Tensor weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float exponential_average_factor, float epsilon) -> (Tensor, Tensor, Tensor)
   input, weight, bias: "grad.defined() ? (training ? miopen_batch_norm_backward(input, grad.contiguous(), weight, running_mean, running_var, result1, result2, epsilon) : native_batch_norm_backward(grad, input, weight, running_mean, running_var, result1, result2, training, epsilon, grad_input_mask)) : std::tuple<Tensor, Tensor, Tensor>()"
@@ -2670,8 +2585,8 @@
   dropout_state: non_differentiable
 
 # mkldnn
-- name: mkldnn_convolution(Tensor self, Tensor weight, Tensor? bias, int[] padding, int[] stride, int[] dilation, int groups) -> Tensor
-  self, weight, bias: "grad.defined() ? convolution_backward(grad, self, weight, bias->sizes(), stride, padding, dilation, /*transposed=*/ false, /*output_padding=*/ std::vector<int64_t>(padding.size(), 0), groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
+- name: mkldnn_convolution(Tensor self, Tensor weight, Tensor? bias, SymInt[] padding, int[] stride, int[] dilation, int groups) -> Tensor
+  self, weight, bias: "grad.defined() ? convolution_backward_symint(grad, self, weight, bias->sym_sizes(), stride, padding, dilation, /*transposed=*/ false, /*output_padding=*/ std::vector<c10::SymInt>(padding.size(), 0), groups, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
 
 - name: mkldnn_linear(Tensor self, Tensor weight, Tensor? bias=None) -> Tensor
   self, weight, bias: mkldnn_linear_backward(self, grad, weight, grad_input_mask)
@@ -2686,35 +2601,49 @@
   self: mkldnn_adaptive_avg_pool2d_backward(grad, self)
 
 - name: _mkldnn_reshape(Tensor self, int[] shape) -> Tensor
-  self: grad.reshape(self.sizes())
+  self: grad.reshape_symint(self.sym_sizes())
 
-# Nested Tensor
-- name: nested_tensor(Tensor[] list, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
+# NestedTensor
+- name: _nested_tensor_from_tensor_list(Tensor[] list, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
   list: "grad.defined()? at::unbind(grad) : std::vector<Tensor>(list.size())"
 
 - name: _nested_tensor_from_mask(Tensor t, Tensor mask, bool mask_check=True) -> Tensor
-  t: grad.to_padded_tensor(0, t.sizes())
+  t: grad.to_padded_tensor_symint(0, t.sym_sizes())
   mask: non_differentiable
 
 - name: _nested_from_padded(Tensor padded, Tensor cpu_nested_shape_example, bool fuse_transform_0213=False) -> Tensor
   padded: _nested_from_padded_backward(grad, padded, fuse_transform_0213)
   cpu_nested_shape_example: non_differentiable
 
-- name: to_padded_tensor(Tensor self, float padding, int[]? output_size=None) -> Tensor
+- name: to_padded_tensor(Tensor self, float padding, SymInt[]? output_size=None) -> Tensor
   self: at::_nested_from_padded(grad, self._nested_tensor_size())
   padding: non_differentiable
 
+- name:  _nested_view_from_buffer(Tensor(a) self, Tensor nested_size, Tensor nested_strides, int[] offsets) -> Tensor(a)
+  self: grad.values()
+  nested_size: non_differentiable
+  nested_strides: non_differentiable
+
+# Transformers
+- name: _scaled_dot_product_efficient_attention(Tensor query, Tensor key, Tensor value, bool compute_log_sumexp, bool is_causal=False) -> (Tensor, Tensor)
+  output_differentiability: [True, False]
+  query, key, value: _scaled_dot_product_efficient_attention_backward(grad, query, key, value, result0, result1, is_causal)
+
+- name:  _efficient_attention_forward(Tensor query, Tensor key, Tensor value, Tensor? cu_seqlens_q, Tensor? cu_seqlens_k, int? max_seqlen_q, bool compute_log_sumexp=False, bool causal=False) -> (Tensor, Tensor)
+  output_differentiability: [True, False]
+  query, key, value: _efficient_attention_backward(grad, query, key, value, result0, result1, causal)
+
 # fft
 - name: _fft_r2c(Tensor self, int[] dim, int normalization, bool onesided) -> Tensor
-  self: fft_r2c_backward(grad, dim, normalization, onesided, self.size(dim.back()))
+  self: fft_r2c_backward(grad, dim, normalization, onesided, self.sym_size(dim.back()))
   result: auto_linear
 
 - name: _fft_c2r(Tensor self, int[] dim, int normalization, int last_dim_size) -> Tensor
   self: fft_c2r_backward(grad, dim, normalization)
   result: auto_linear
 
-- name: _fft_c2c(Tensor self, int[] dim, int normalization, bool forward) -> Tensor
-  self: _fft_c2c(grad, dim, normalization, !forward)
+- name: _fft_c2c(Tensor self, SymInt[] dim, int normalization, bool forward) -> Tensor
+  self: _fft_c2c_symint(grad, dim, normalization, !forward)
   result: auto_linear
 
 - name: unbind.int(Tensor(a -> *) self, int dim=0) -> Tensor(a)[]
@@ -2722,7 +2651,7 @@
   result: auto_linear
 
 - name: stack(Tensor[] tensors, int dim=0) -> Tensor
-  tensors: "grad.defined() ? unbind(grad, dim) : std::vector<Tensor>(tensors.size())"
+  tensors: stack_tensors_backward(grad, dim, to_args_scalartypes(tensors))
   result: stack_jvp(tensors, dim)
 
 # fused RNN kernels
@@ -2738,7 +2667,7 @@
 
 # PackedSequence helpers
 - name: _pack_padded_sequence(Tensor input, Tensor lengths, bool batch_first) -> (Tensor, Tensor)
-  input: _pack_padded_sequence_backward(grad, input.sizes(), result1, batch_first)
+  input: _pack_padded_sequence_backward_symint(grad, input.sym_sizes(), result1, batch_first)
 
 # TH wrappers
 - name: eq.Scalar(Tensor self, Scalar other) -> Tensor
@@ -2800,12 +2729,12 @@
 - name: _test_autograd_multiple_dispatch.fullcoverage(Tensor self) -> Tensor
   dispatch:
     Default:
-      self: grad.expand(self.sizes()) + 1
+      self: grad.expand_symint(self.sym_sizes()) + 1
       result: auto_linear
     AutogradNestedTensor:
       self: grad.mul(grad)
     AutogradCUDA:
-      self: grad.expand(self.sizes()) * 2
+      self: grad.expand_symint(self.sym_sizes()) * 2
 
 - name: _test_autograd_multiple_dispatch.ntonly(Tensor self, bool b) -> Tensor
   dispatch:
@@ -2825,6 +2754,7 @@
 - name: scatter_reduce.two(Tensor self, int dim, Tensor index, Tensor src, str reduce, *, bool include_self=True) -> Tensor
   self, src: scatter_reduce_backward(grad, self, dim, index, src, reduce, include_self, result)
   index: non_differentiable
+  result: scatter_reduce_jvp(self_p, self_t, dim, index, src_p, src_t, reduce, include_self, result)
 
 - name: special_airy_ai(Tensor x) -> Tensor
   x: non_differentiable
@@ -2981,3 +2911,7 @@
 
 - name: special_spherical_bessel_j0(Tensor x) -> Tensor
   x: non_differentiable
+
+- name: _reshape_copy(Tensor self, SymInt[] size) -> Tensor
+  self: grad.reshape_symint(self.sym_sizes())
+  result: auto_linear
diff --git a/tools/autograd/gen_autograd_functions.py b/tools/autograd/gen_autograd_functions.py
index d21dd439d307..3e9e125bfb9f 100644
--- a/tools/autograd/gen_autograd_functions.py
+++ b/tools/autograd/gen_autograd_functions.py
@@ -20,14 +20,18 @@
     boolT,
     doubleT,
     intArrayRefT,
+    iTensorListRefT,
     ListCType,
     longT,
     MutRefCType,
     OptionalCType,
     optionalIntArrayRefT,
+    optionalSymIntArrayRefT,
     scalarT,
     stringT,
     symIntArrayRefT,
+    SymIntT,
+    TENSOR_LIST_LIKE_CTYPES,
     tensorListT,
     tensorT,
 )
@@ -287,7 +291,7 @@
 for (auto i : c10::irange(prop.size())) {
     auto si = prop[i];
     if (si.is_symbolic()) {
-      auto py_symint = py::cast(si.toSymIntNodeImpl()).release().ptr();
+      auto py_symint = py::cast(si).release().ptr();
       PyTuple_SetItem(tup, (Py_ssize_t) i, py_symint);
     } else {
        PyTuple_SetItem(tup, (Py_ssize_t) i, PyLong_FromUnsignedLong(si.as_int_unchecked()));
@@ -308,6 +312,10 @@
 return PyLong_FromUnsignedLong((int64_t) prop);
 """
 
+GETTER_BODY_SYMINT = """\
+return prop.is_symbolic() ? py::cast(prop).release().ptr() : PyLong_FromUnsignedLong(prop.as_int_unchecked());
+"""
+
 GETTER_BODY_DOUBLE = """\
 return PyFloat_FromDouble((double) prop);
 """
@@ -346,6 +354,7 @@
 
 MISC_GETTER_DEFS = {
     OptionalCType(BaseCType(longT)): (GETTER_DEFINITION_OPT, GETTER_BODY_INT64_T),
+    OptionalCType(BaseCType(SymIntT)): (GETTER_DEFINITION_OPT, GETTER_BODY_SYMINT),
     BaseCType(doubleT): (GETTER_DEFINITION, GETTER_BODY_DOUBLE),
     OptionalCType(BaseCType(doubleT)): (GETTER_DEFINITION_OPT, GETTER_BODY_DOUBLE),
     BaseCType(boolT): (GETTER_DEFINITION, GETTER_BODY_BOOL),
@@ -400,7 +409,9 @@ def gen_autograd_functions_lib(
             fname,
             fname,
             lambda: {
-                "generated_comment": "@" + f"generated from {fm.template_dir}/" + fname,
+                "generated_comment": "@"
+                + f"generated from {fm.template_dir_for_comments()}/"
+                + fname,
                 "autograd_function_declarations": declarations,
                 "autograd_function_definitions": definitions,
             },
@@ -418,7 +429,8 @@ def gen_autograd_functions_python(
     fm.write(
         "python_functions.h",
         lambda: {
-            "generated_comment": f"@generated from {fm.template_dir}/python_functions.h",
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/python_functions.h",
             "shard_forward_declare": [
                 f"void initialize_autogenerated_functions_{i}();"
                 for i in range(num_shards)
@@ -437,7 +449,8 @@ def gen_autograd_functions_python(
         infos,
         key_fn=lambda info: info.name,
         base_env={
-            "generated_comment": f"@generated from {fm.template_dir}/python_functions.cpp",
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/python_functions.cpp",
         },
         env_callable=lambda info: {
             "py_function_initializers": [
@@ -463,10 +476,7 @@ def process_function(info: DifferentiabilityInfo, template: CodeTemplate) -> str
     py_getsetdef_structs: List[str] = []
 
     for arg in info.args_with_derivatives:
-        if (
-            arg.type == "at::TensorList"
-            or arg.type == "const c10::List<c10::optional<at::Tensor>> &"
-        ):
+        if arg.type in TENSOR_LIST_LIKE_CTYPES:
             size = f"{arg.name}_size_"
             saved_list_sizes.append(f"size_t {arg.name}_size_;")
         else:
@@ -500,7 +510,7 @@ def save_var(var: SavedAttribute, is_output: bool) -> None:
                 )
             )
             should_append_raw_getsetdef = True
-        elif type == BaseCType(tensorListT):
+        elif type == BaseCType(tensorListT) or type == BaseCType(iTensorListRefT):
             saved_variables.append(f"std::vector<SavedVariable> {name}_;")
             saved_variables.append(f"bool {name}_released_ = false;")
             # Just clear() is sufficient, we don't need to loop and clear each variable.
@@ -561,6 +571,13 @@ def save_var(var: SavedAttribute, is_output: bool) -> None:
                     op=info.op, name=name, body=GETTER_BODY_ARRAYREF_LONG
                 )
             )
+        elif type == BaseCType(optionalSymIntArrayRefT):
+            saved_variables.append(f"c10::OptionalArray<c10::SymInt> {name};")
+            getter_definitions.append(
+                GETTER_DEFINITION_OPT_ARRAYREF.substitute(
+                    op=info.op, name=name, body=GETTER_BODY_ARRAYREF_SYMINT
+                )
+            )
         elif type == OptionalCType(BaseCType(intArrayRefT)):
             saved_variables.append(f"c10::OptionalArray<int64_t> {name};")
             getter_definitions.append(
@@ -568,6 +585,13 @@ def save_var(var: SavedAttribute, is_output: bool) -> None:
                     op=info.op, name=name, body=GETTER_BODY_ARRAYREF_LONG
                 )
             )
+        elif type == OptionalCType(BaseCType(symIntArrayRefT)):
+            saved_variables.append(f"c10::OptionalArray<c10::SymInt> {name};")
+            getter_definitions.append(
+                GETTER_DEFINITION_OPT_ARRAYREF.substitute(
+                    op=info.op, name=name, body=GETTER_BODY_ARRAYREF_SYMINT
+                )
+            )
         elif type == OptionalCType(ArrayRefCType(BaseCType(doubleT))):
             saved_variables.append(f"c10::OptionalArray<double> {name};")
             getter_definitions.append(
@@ -582,6 +606,13 @@ def save_var(var: SavedAttribute, is_output: bool) -> None:
                     op=info.op, name=name, body=GETTER_BODY_INT64_T
                 )
             )
+        elif type == BaseCType(SymIntT):
+            saved_variables.append(f"c10::SymInt {name};")
+            getter_definitions.append(
+                GETTER_DEFINITION.substitute(
+                    op=info.op, name=name, body=GETTER_BODY_SYMINT
+                )
+            )
         elif type == BaseCType(stringT):
             saved_variables.append(f"std::string {name};")
             getter_definitions.append(
@@ -597,6 +628,16 @@ def save_var(var: SavedAttribute, is_output: bool) -> None:
                 )
             )
         else:
+            # Check for indicators that you're putting a non-owning reference
+            # into the saved variable field.  If this is spuriously firing,
+            # edit this field.  Otherwise, you probably need to add a case
+            # above.
+            assert (
+                "ref" not in type.cpp_type().lower()
+                and "view" not in type.cpp_type().lower()
+                and "*" not in type.cpp_type()
+                and "&" not in type.cpp_type()
+            ), f"{type.cpp_type()} looks like it contains a non-owning reference"
             saved_variables.append(f"{type.cpp_type()} {name};")
 
             if type in MISC_GETTER_DEFS:
diff --git a/tools/autograd/gen_inplace_or_view_type.py b/tools/autograd/gen_inplace_or_view_type.py
index 69f3eecf590c..d79212a093b5 100644
--- a/tools/autograd/gen_inplace_or_view_type.py
+++ b/tools/autograd/gen_inplace_or_view_type.py
@@ -16,12 +16,16 @@
     BaseCType,
     Binding,
     boolT,
+    ConstRefCType,
     CType,
     DispatcherSignature,
     intArrayRefT,
     longT,
     OptionalCType,
     symIntArrayRefT,
+    SymIntT,
+    # See Note [Nested Arg Types]
+    tensorT,
 )
 from torchgen.code_template import CodeTemplate
 from torchgen.context import with_native_function
@@ -55,6 +59,7 @@
     "view_as_real",
     "_conj",
     "_neg_view",
+    "_nested_view_from_buffer",
 ]
 
 VIEW_FUNCTIONS = {
@@ -249,6 +254,7 @@ def unpack_args(f: NativeFunction) -> Tuple[List[str], List[Binding]]:
         for r in cpp.argument(
             a,
             method=False,
+            symint=True,
             cpp_no_default_args=set(),
             faithful=False,
             has_tensor_options=False,
@@ -321,9 +327,12 @@ def emit_view_lambda(f: NativeFunction, unpacked_bindings: List[Binding]) -> str
     known_view_arg_simple_types: List[CType] = [
         BaseCType(longT),
         OptionalCType(BaseCType(longT)),
+        BaseCType(SymIntT),
+        OptionalCType(BaseCType(SymIntT)),
         BaseCType(boolT),
         BaseCType(intArrayRefT),
         BaseCType(symIntArrayRefT),
+        ConstRefCType(BaseCType(tensorT)),
     ]
     for unpacked_binding in unpacked_bindings:
         arg, arg_type = unpacked_binding.name, unpacked_binding.nctype.type
@@ -338,7 +347,6 @@ def emit_view_lambda(f: NativeFunction, unpacked_bindings: List[Binding]) -> str
                 "over by value, also add a test in pytorch/xla/test/test_operations.py where this code "
                 "is exercised."
             )
-
         if arg_type == BaseCType(intArrayRefT) or arg_type == BaseCType(
             symIntArrayRefT
         ):
@@ -354,6 +362,13 @@ def emit_view_lambda(f: NativeFunction, unpacked_bindings: List[Binding]) -> str
                 arg=arg, val=arg_value, default="0"
             )
             updated_unpacked_args.append(arg_value)
+        elif (
+            arg == "nested_size_" or arg == "nested_strides_"
+        ) and arg_type == ConstRefCType(BaseCType(tensorT)):
+            # [NOTE] [Nested Arg Types]
+            # This is temporary. Nested tensors will be migrating to use SymInts and
+            # nested_size and nested_strides will no longer be tensors.
+            updated_unpacked_args.append(arg[:-1])
         else:
             updated_unpacked_args.append(arg)
 
@@ -494,7 +509,7 @@ def gen_formals(f: NativeFunction) -> str:
         # See Note [Plumbing Keys Through The Dispatcher] for details.
         ["c10::DispatchKeySet ks"]
         + [
-            f'{cpp.argument_type(a, binds="__placeholder__").cpp_type()} {a.name}'
+            f'{cpp.argument_type(a, binds="__placeholder__", symint=True).cpp_type()} {a.name}'
             for a in f.func.schema_order_arguments()
         ]
     )
@@ -514,7 +529,7 @@ def inplace_or_view_method_definition(
     ):
         return None
     return METHOD_DEFINITION.substitute(
-        return_type=cpp.returns_type(f.func.returns).cpp_type(),
+        return_type=cpp.returns_type(f.func.returns, symint=True).cpp_type(),
         type_wrapper_name=type_wrapper_name(f),
         formals=gen_formals(f),
         type_definition_body=emit_inplace_or_view_body(fn),
@@ -581,7 +596,8 @@ def gen_inplace_or_view_type(
         [fn for fn in fns_with_infos if use_derived(fn)],
         key_fn=lambda fn: fn.func.root_name,
         base_env={
-            "generated_comment": f"@generated from {template_path}/ADInplaceOrViewType.cpp",
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/ADInplaceOrViewType.cpp",
         },
         env_callable=gen_inplace_or_view_type_env,
         num_shards=2,
diff --git a/tools/autograd/gen_python_functions.py b/tools/autograd/gen_python_functions.py
index cf585daba2f0..ee06a8ed1238 100644
--- a/tools/autograd/gen_python_functions.py
+++ b/tools/autograd/gen_python_functions.py
@@ -1,7 +1,8 @@
 # Generates Python bindings for ATen functions
 #
 # The bindings are generated as methods on python_variable or functions on the
-# torch._C._nn. torch._C._fft, torch._C._linalg, torch._C._sparse or torch._C._special objects.
+# torch._C._nn. torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._sparse
+# or torch._C._special objects.
 #
 
 # Code tries to stick to the following rules:
@@ -123,6 +124,7 @@
     "_local_scalar_dense",
     "to",
     "_to_copy",
+    "_reshape_copy",
     "copy_sparse_to_sparse_",
     "copy_",
     "numpy_T",
@@ -151,6 +153,10 @@
     "fill.Scalar",  # only used by the functionalization pass
     "lift.*",
     "normal_functional",  # only used by the functionalization pas
+    "_nested_tensor_strides",  # don't want to expose this to python
+    "_nested_tensor_offsets",  # don't want to expose this to python
+    "_nested_view_from_buffer",  # View only version of _nested_from_buffer. This will force users to only use the "safe" version.
+    "_nested_view_from_buffer_copy",
 ]
 
 SKIP_PYTHON_BINDINGS = list(
@@ -217,6 +223,10 @@ def is_py_linalg_function(f: NativeFunction) -> bool:
     return f.python_module == "linalg"
 
 
+def is_py_nested_function(f: NativeFunction) -> bool:
+    return f.python_module == "nested"
+
+
 def is_py_sparse_function(f: NativeFunction) -> bool:
     return f.python_module == "sparse"
 
@@ -238,6 +248,8 @@ def gen(
     tags_yaml_path: str,
     deprecated_yaml_path: str,
     template_path: str,
+    *,
+    symint: bool = True,
 ) -> None:
     fm = FileManager(install_dir=out, template_dir=template_path, dry_run=False)
     native_functions = parse_native_yaml(
@@ -253,6 +265,7 @@ def gen(
         None,
         "python_variable_methods.cpp",
         method=True,
+        symint=symint,
     )
 
     # NOTE: num_shards here must be synced with gatherTorchFunctions in
@@ -266,6 +279,7 @@ def gen(
         "python_torch_functions.cpp",
         method=False,
         num_shards=3,
+        symint=symint,
     )
 
     create_python_bindings(
@@ -275,6 +289,7 @@ def gen(
         "torch.nn",
         "python_nn_functions.cpp",
         method=False,
+        symint=symint,
     )
 
     create_python_bindings(
@@ -284,6 +299,7 @@ def gen(
         "torch.fft",
         "python_fft_functions.cpp",
         method=False,
+        symint=symint,
     )
 
     create_python_bindings(
@@ -293,6 +309,16 @@ def gen(
         "torch.linalg",
         "python_linalg_functions.cpp",
         method=False,
+        symint=symint,
+    )
+
+    create_python_bindings(
+        fm,
+        functions,
+        is_py_nested_function,
+        "torch.nested",
+        "python_nested_functions.cpp",
+        method=False,
     )
 
     create_python_bindings(
@@ -302,6 +328,7 @@ def gen(
         "torch.sparse",
         "python_sparse_functions.cpp",
         method=False,
+        symint=symint,
     )
 
     create_python_bindings(
@@ -311,6 +338,7 @@ def gen(
         "torch.special",
         "python_special_functions.cpp",
         method=False,
+        symint=symint,
     )
 
     # Currently, we only use `functions` to generate `return_types` bindings.
@@ -354,6 +382,7 @@ def create_python_bindings(
     filename: str,
     *,
     method: bool,
+    symint: bool = True,
 ) -> None:
     """Generates Python bindings to ATen functions"""
     py_methods: List[str] = []
@@ -365,7 +394,9 @@ def create_python_bindings(
 
     for name in sorted(grouped.keys(), key=lambda x: str(x)):
         overloads = grouped[name]
-        py_methods.append(method_impl(name, module, overloads, method=method))
+        py_methods.append(
+            method_impl(name, module, overloads, method=method, symint=symint)
+        )
         py_method_defs.append(method_def(name, module, overloads, method=method))
         py_forwards.extend(forward_decls(name, overloads, method=method))
         ops_headers.append(f"#include <ATen/ops/{name.base}.h>")
@@ -374,7 +405,8 @@ def create_python_bindings(
         filename,
         filename,
         lambda: {
-            "generated_comment": "@" + f"generated from {fm.template_dir}/{filename}",
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/{filename}",
             "ops_headers": ops_headers,
             "py_forwards": py_forwards,
             "py_methods": py_methods,
@@ -412,7 +444,8 @@ def create_python_return_type_bindings(
         filename,
         filename,
         lambda: {
-            "generated_comment": "@" + f"generated from {fm.template_dir}/{filename}",
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/{filename}",
             "py_return_types": py_return_types_definition,
             "py_return_types_map": py_return_types_map,
         },
@@ -428,6 +461,7 @@ def create_python_bindings_sharded(
     *,
     method: bool,
     num_shards: int,
+    symint: bool = True,
 ) -> None:
     """Generates Python bindings to ATen functions"""
     grouped = group_filter_overloads(pairs, pred)
@@ -444,7 +478,9 @@ def env_func(
         return {
             "ops_headers": [f"#include <ATen/ops/{name.base}.h>"],
             "py_forwards": list(forward_decls(name, fn_pairs, method=method)),
-            "py_methods": [method_impl(name, module, fn_pairs, method=method)],
+            "py_methods": [
+                method_impl(name, module, fn_pairs, method=method, symint=symint)
+            ],
             "py_method_defs": [method_def(name, module, fn_pairs, method=method)],
         }
 
@@ -452,7 +488,8 @@ def env_func(
         filename,
         grouped.items(),
         base_env={
-            "generated_comment": "@" + f"generated from {fm.template_dir}/{filename}",
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/{filename}",
         },
         key_fn=key_func,
         env_callable=env_func,
@@ -773,6 +810,7 @@ def method_impl(
     overloads: Sequence[PythonSignatureNativeFunctionPair],
     *,
     method: bool,
+    symint: bool = True,
 ) -> str:
     """
     Generate a python binding for all overloads of an op.
@@ -791,14 +829,18 @@ def method_impl(
 
     traceable = "true" if all(should_trace(o.function) for o in overloads) else "false"
 
-    grouped_overloads: Sequence[PythonSignatureGroup] = group_overloads(overloads)
+    grouped_overloads: Sequence[PythonSignatureGroup] = group_overloads(
+        overloads, symint=symint
+    )
     is_singleton = len(grouped_overloads) == 1
     signatures: List[str] = []
     dispatch: List[str] = []
     for overload_index, overload in enumerate(grouped_overloads):
-        signature = overload.signature.signature_str()
+        signature = overload.signature.signature_str(symint=symint)
         signatures.append(f"{cpp_string(str(signature))},")
-        dispatch_body = emit_dispatch_case(overload, namedtuple_typenames)
+        dispatch_body = emit_dispatch_case(
+            overload, namedtuple_typenames, symint=symint
+        )
         dispatch.append(
             PY_VARIABLE_CASE.substitute(
                 overload_index=overload_index, body=dispatch_body
@@ -853,6 +895,7 @@ def gen_has_torch_function_check(
             "torch.nn": "THPNNVariableFunctionsModule",
             "torch.fft": "THPFFTVariableFunctionsModule",
             "torch.linalg": "THPLinalgVariableFunctionsModule",
+            "torch.nested": "THPNestedVariableFunctionsModule",
             "torch.sparse": "THPSparseVariableFunctionsModule",
             "torch.special": "THPSpecialVariableFunctionsModule",
         }[module]
@@ -882,6 +925,8 @@ def gen_has_torch_function_check(
 def emit_dispatch_case(
     overload: PythonSignatureGroup,
     namedtuple_typenames: Dict[str, str],
+    *,
+    symint: bool = True,
 ) -> str:
     """
     Emit dispatch code for a single parsed signature. This corresponds to either
@@ -894,18 +939,19 @@ def emit_dispatch_case(
         return PY_VARIABLE_OUT.substitute(
             out_idx=overload.signature.output_idx(),
             call_dispatch=emit_single_dispatch(
-                overload.signature, overload.base, namedtuple_typenames
+                overload.signature, overload.base, namedtuple_typenames, symint=symint
             ),
             call_dispatch_out=emit_single_dispatch(
                 overload.signature,
                 overload.outplace,
                 namedtuple_typenames,
+                symint=symint,
             ),
         )
     else:
         # no-output version only
         return emit_single_dispatch(
-            overload.signature, overload.base, namedtuple_typenames
+            overload.signature, overload.base, namedtuple_typenames, symint=symint
         )
 
 
@@ -987,14 +1033,14 @@ def method_def(
 
 
 def group_overloads(
-    overloads: Sequence[PythonSignatureNativeFunctionPair],
+    overloads: Sequence[PythonSignatureNativeFunctionPair], *, symint: bool = True
 ) -> Sequence[PythonSignatureGroup]:
     bases: Dict[str, PythonSignatureNativeFunctionPair] = {}
     outplaces: Dict[str, PythonSignatureNativeFunctionPair] = {}
 
     # first group by signature ignoring out arguments
     for overload in overloads:
-        sig = overload.signature.signature_str(skip_outputs=True)
+        sig = overload.signature.signature_str(skip_outputs=True, symint=symint)
         if overload.function.func.is_out_fn():
             if sig in outplaces:
                 raise RuntimeError(
@@ -1021,9 +1067,11 @@ def group_overloads(
                     and not overload.signature.deprecated
                 ):
                     candidates.append(
-                        overload.signature.signature_str(skip_outputs=True)
+                        overload.signature.signature_str(
+                            skip_outputs=True, symint=symint
+                        )
                     )
-            out_sig = out.signature.signature_str()
+            out_sig = out.signature.signature_str(symint=symint)
             raise RuntimeError(
                 f"While identifying overloads, we found an out schema {out_sig} without a corresponding non-out variant. "
                 f"We expected the non-out variant to have schema: \n- {sig}\nPlease check that you spelled the schema "
@@ -1038,7 +1086,7 @@ def group_overloads(
         )
         for sig, base in bases.items()
     ]
-    return sort_overloads(grouped)
+    return sort_overloads(grouped, symint=symint)
 
 
 # This function declares a partial order on declarations, and sorts them according
@@ -1087,7 +1135,7 @@ def group_overloads(
 
 
 def sort_overloads(
-    grouped_overloads: Sequence[PythonSignatureGroup],
+    grouped_overloads: Sequence[PythonSignatureGroup], *, symint: bool = True
 ) -> Sequence[PythonSignatureGroup]:
     # NB: Smaller here means lower priority
 
@@ -1113,6 +1161,11 @@ def is_arg_smaller(t1: Type, t2: Type) -> bool:
             # Prioritize IntArrayRef overload over SymIntArrayRef
             str(t1) == "SymInt[]"
             and str(t2) == "int[]"
+            or
+            # Make sure both in, SymInt are sorted consistently w.r.t. Tensor since Tensor can be implicitly
+            # converted to either int or SymInt.  Prioritize the Tensor overload since it otherwise gets shadowed.
+            (str(t1) == "SymInt" or str(t1) == "int")
+            and str(t2) == "Tensor"
         )
 
     def is_smaller(s1: PythonSignature, s2: PythonSignature) -> bool:
@@ -1132,7 +1185,7 @@ def is_smaller(s1: PythonSignature, s2: PythonSignature) -> bool:
 
     # First sort by signature
     grouped_overloads = sorted(
-        grouped_overloads, key=lambda x: x.signature.signature_str()
+        grouped_overloads, key=lambda x: x.signature.signature_str(symint=symint)
     )
 
     # Construct the relation graph
@@ -1170,7 +1223,11 @@ def is_smaller(s1: PythonSignature, s2: PythonSignature) -> bool:
 
 
 def emit_single_dispatch(
-    ps: PythonSignature, f: NativeFunction, namedtuple_typenames: Dict[str, str]
+    ps: PythonSignature,
+    f: NativeFunction,
+    namedtuple_typenames: Dict[str, str],
+    *,
+    symint: bool = True,
 ) -> str:
     """
     Emit dispatch code for a single native function.
@@ -1189,7 +1246,10 @@ def go(f: NativeFunction) -> str:
         # dispatch lambda signature
         name = cpp.name(f.func)
         lambda_formals = ", ".join(
-            map(lambda a: f"{a.type_str} {a.name}", dispatch_lambda_args(ps, f))
+            map(
+                lambda a: f"{a.type_str} {a.name}",
+                dispatch_lambda_args(ps, f, symint=symint),
+            )
         )
         lambda_return = dispatch_lambda_return_str(f)
 
@@ -1198,8 +1258,8 @@ def go(f: NativeFunction) -> str:
         dispatch_args = ", ".join(cpp_dispatch_exprs(f, python_signature=ps))
 
         # from arg parser outputs to dispatch lambda arguments
-        parser_outputs = arg_parser_output_exprs(ps, f)
-        lambda_arg_exprs = dispatch_lambda_exprs(ps, f)
+        parser_outputs = arg_parser_output_exprs(ps, f, symint=symint)
+        lambda_arg_exprs = dispatch_lambda_exprs(ps, f, symint=symint)
         inits = "\n".join(lambda_arg_exprs.inits)
         lambda_args = ", ".join(lambda_arg_exprs.exprs)
 
diff --git a/tools/autograd/gen_trace_type.py b/tools/autograd/gen_trace_type.py
index f7df5efb39c9..45796d8ffa47 100644
--- a/tools/autograd/gen_trace_type.py
+++ b/tools/autograd/gen_trace_type.py
@@ -383,7 +383,7 @@ def declare_returned_variables(f: NativeFunction) -> str:
         return ""
     if len(f.func.returns) == 1:
         return ""
-    types = map(cpp.return_type, f.func.returns)
+    types = [cpp.return_type(r, symint=True) for r in f.func.returns]
     names = cpp.return_names(f)
     return "\n".join(f"{type.cpp_type()} {name};" for type, name in zip(types, names))
 
@@ -483,13 +483,13 @@ def method_definition(f: NativeFunction) -> str:
         # See Note [Plumbing Keys Through The Dispatcher] for details.
         ["c10::DispatchKeySet ks"]
         + [
-            f'{cpp.argument_type(a, binds="__placeholder__").cpp_type()} {a.name}'
+            f'{cpp.argument_type(a, binds="__placeholder__", symint=True).cpp_type()} {a.name}'
             for a in f.func.schema_order_arguments()
         ]
     )
 
     return METHOD_DEFINITION.substitute(
-        return_type=cpp.returns_type(f.func.returns).cpp_type(),
+        return_type=cpp.returns_type(f.func.returns, symint=True).cpp_type(),
         type_wrapper_name=type_wrapper_name(f),
         formals=formals,
         type_definition_body=emit_trace_body(f),
@@ -535,7 +535,8 @@ def gen_trace_type(
         [fn for fn in native_functions if cpp.name(fn.func) not in MANUAL_TRACER],
         key_fn=lambda fn: fn.root_name,
         base_env={
-            "generated_comment": f"@generated from {template_path}/TraceType.cpp",
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/TraceType.cpp",
         },
         env_callable=gen_trace_type_func,
         num_shards=5,
diff --git a/tools/autograd/gen_variable_factories.py b/tools/autograd/gen_variable_factories.py
index 07abc98a8d4c..c1708d4b65c1 100644
--- a/tools/autograd/gen_variable_factories.py
+++ b/tools/autograd/gen_variable_factories.py
@@ -48,7 +48,7 @@ def gen_variable_factories(
         "variable_factories.h",
         lambda: {
             "generated_comment": "@"
-            + f"generated from {fm.template_dir}/variable_factories.h",
+            + f"generated from {fm.template_dir_for_comments()}/variable_factories.h",
             "ops_headers": [
                 f"#include <ATen/ops/{fn.root_name}.h>" for fn in factory_functions
             ],
@@ -76,31 +76,39 @@ def process_function(f: NativeFunction) -> Optional[str]:
     if Variant.function not in f.variants or not is_factory:
         return None
 
-    sig = CppSignatureGroup.from_native_function(f, method=False).signature
-    formals: List[str] = []
-    exprs: List[str] = []
-    requires_grad = "false"
-    for arg in sig.arguments():
-        qualified_type = fully_qualified_type(arg.type)
-        if arg.default:
-            formals.append(f"{qualified_type} {arg.name} = {arg.default}")
-        else:
-            formals.append(f"{qualified_type} {arg.name}")
-
-        if isinstance(arg.argument, TensorOptionsArguments):
-            # note: we remove the requires_grad setting from the TensorOptions because
-            # it is ignored anyways (and we actually have an assertion that it isn't set
-            # which would fail otherwise). We handle requires_grad explicitly here
-            # instead of passing it through to the kernel.
-            exprs.append(f"at::TensorOptions({arg.name}).requires_grad(c10::nullopt)")
-            # Manually set the requires_grad bit on the result tensor.
-            requires_grad = f"{arg.name}.requires_grad()"
-        else:
-            exprs.append(arg.name)
-
-    return f"""\
-inline at::Tensor {name}({', '.join(formals)}) {{
+    cpp_sigs = CppSignatureGroup.from_native_function(f, method=False)
+    sigs = [cpp_sigs.signature]
+    if cpp_sigs.symint_signature is not None:
+        sigs.append(cpp_sigs.symint_signature)
+    r = ""
+    for sig in sigs:
+        formals: List[str] = []
+        exprs: List[str] = []
+        requires_grad = "false"
+        for arg in sig.arguments():
+            qualified_type = fully_qualified_type(arg.type)
+            if arg.default:
+                formals.append(f"{qualified_type} {arg.name} = {arg.default}")
+            else:
+                formals.append(f"{qualified_type} {arg.name}")
+
+            if isinstance(arg.argument, TensorOptionsArguments):
+                # note: we remove the requires_grad setting from the TensorOptions because
+                # it is ignored anyways (and we actually have an assertion that it isn't set
+                # which would fail otherwise). We handle requires_grad explicitly here
+                # instead of passing it through to the kernel.
+                exprs.append(
+                    f"at::TensorOptions({arg.name}).requires_grad(c10::nullopt)"
+                )
+                # Manually set the requires_grad bit on the result tensor.
+                requires_grad = f"{arg.name}.requires_grad()"
+            else:
+                exprs.append(arg.name)
+
+        r += f"""\
+inline at::Tensor {sig.name()}({', '.join(formals)}) {{
   at::AutoDispatchBelowADInplaceOrView guard;
-  return autograd::make_variable(at::{name}({', '.join(exprs)}), /*requires_grad=*/{requires_grad});
+  return autograd::make_variable(at::{sig.name()}({', '.join(exprs)}), /*requires_grad=*/{requires_grad});
 }}
 """
+    return r
diff --git a/tools/autograd/gen_variable_type.py b/tools/autograd/gen_variable_type.py
index 73251a88f883..3d1ff895c837 100644
--- a/tools/autograd/gen_variable_type.py
+++ b/tools/autograd/gen_variable_type.py
@@ -31,6 +31,7 @@
 from torchgen.api.autograd import (
     DifferentiableInput,
     dispatch_strategy,
+    ForwardDerivative,
     gen_differentiable_outputs,
     is_differentiable,
     NativeFunctionWithDifferentiabilityInfo,
@@ -42,6 +43,7 @@
     Binding,
     DispatcherSignature,
     intArrayRefT,
+    iTensorListRefT,
     ListCType,
     MutRefCType,
     OptionalCType,
@@ -49,6 +51,7 @@
     SpecialArgName,
     stringT,
     symIntArrayRefT,
+    TENSOR_LIST_LIKE_CTYPES,
     tensorListT,
     tensorT,
     TupleCType,
@@ -159,6 +162,7 @@
     "logical_or",
     # This function returns nested_tensor shape as a tensor that is non-differentiable
     "_nested_tensor_size",
+    "_nested_tensor_strides",
 }
 
 # The C -> R functions at the time of adding this are still being audited and tested
@@ -284,12 +288,14 @@
     "addmv",
     "addr",
     "linalg_householder_product",
+    "ormqr",
     "constant_pad_nd",
     "reflection_pad1d",
     "reflection_pad2d",
     "reflection_pad3d",
     "linalg_cholesky_ex",
     "linalg_eig",
+    "diagonal_copy",
     "select_backward",
     "diagonal_backward",
     "slice_backward",
@@ -342,15 +348,15 @@
     "conj_physical_",
     "_neg_view",
     "_reshape_alias",
+    "_reshape_copy",
     "_linalg_det",
     "lu_solve",
     "linalg_solve_triangular",
     "linalg_pinv",
     "linalg_lstsq",
+    "unfold_copy",
     "col2im",
-    "col2im_backward",
     "im2col",
-    "im2col_backward",
     "cholesky_inverse",
     "to_sparse",
     "sparse_sampled_addmm",
@@ -537,6 +543,7 @@
     # Nested Tensors related functions
     # _nested_tensor_size() should never actually be called with requires_grad=True tensor
     "_nested_tensor_size",
+    "_nested_tensor_strides",
 }
 
 DONT_ENFORCE_STORAGE_IMPL_USE_COUNT = {
@@ -594,6 +601,21 @@
 # If the non-variable operation has return values, we use the `tmp` variable to hold the
 # values temporarily and pass the values to the return variables outside of the
 # `at::AutoDispatchBelowAutograd` guard block.
+DISPATCH_TO_NON_VAR_TYPE_WITH_TMP_RETURN_VALUES_JVP_DECOMP = CodeTemplate(
+    """\
+auto ${tmp_var} = ([&]() {
+  if (${any_has_forward_grad}) {
+    static c10::OperatorName full_name("aten::${op_name}", "${op_overload}");
+    static c10::optional<c10::OperatorHandle> opt_op = c10::Dispatcher::singleton().findSchema(full_name);
+    return impl::run_jit_decomposition_with_args_for_jvp<${return_types}>("${op_name}", *opt_op, ks, ${arg_names});
+  } else {
+    ${guard}
+    return ${base_type_call};
+  }
+})();
+"""
+)
+
 DISPATCH_TO_NON_VAR_TYPE_WITH_TMP_RETURN_VALUES = CodeTemplate(
     """\
 auto ${tmp_var} = ([&]() {
@@ -642,6 +664,12 @@
 """
 )
 
+FW_DERIVATIVE_TENSORLIST_CHECK_TEMPLATE = CodeTemplate(
+    """\
+isFwGradDefinedTensorList(${req_inp})\
+"""
+)
+
 FW_DERIVATIVE_DEFINED_GRAD_TEMPLATE = CodeTemplate(
     """\
 auto ${inp}_t_raw = toNonOptFwGrad(${inp});
@@ -734,7 +762,8 @@ def gen_variable_type(
     fm.write(
         "VariableType.h",
         lambda: {
-            "generated_comment": "@" f"generated from {template_path}/VariableType.h"
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/VariableType.h"
         },
     )
 
@@ -788,7 +817,8 @@ def wrapper_registrations(used_keys: Set[str]) -> str:
         [fn for fn in fns_with_diff_infos if use_derived(fn)],
         key_fn=lambda fn: cpp.name(fn.func.func),
         base_env={
-            "generated_comment": "@" f"generated from {template_path}/VariableType.cpp",
+            "generated_comment": "@"
+            + f"generated from {fm.template_dir_for_comments()}/VariableType.cpp",
         },
         env_callable=gen_variable_type_func,
         num_shards=5,
@@ -809,7 +839,7 @@ def gen_variable_type_func(
     fn: NativeFunctionWithDifferentiabilityInfo,
 ) -> Dict[str, List[str]]:
     f = fn.func
-    result = dict()
+    result = {}
     with native_function_manager(f):
         name = cpp.name(f.func)
         formals = gen_formals(f)
@@ -849,7 +879,9 @@ def gen_variable_type_func(
             if not fn.info:
                 key = "Default"
                 type_definition = METHOD_DEFINITION.substitute(
-                    return_type=cpp.returns_type(f.func.returns).cpp_type(),
+                    return_type=cpp.returns_type(
+                        f.func.returns, symint=True
+                    ).cpp_type(),
                     type_wrapper_name=type_wrapper_name(f, key),
                     type_definition_body=emit_body(fn, key),
                     formals=formals,
@@ -860,7 +892,9 @@ def gen_variable_type_func(
             else:
                 for key, _ in fn.info.items():
                     type_definition = METHOD_DEFINITION.substitute(
-                        return_type=cpp.returns_type(f.func.returns).cpp_type(),
+                        return_type=cpp.returns_type(
+                            f.func.returns, symint=True
+                        ).cpp_type(),
                         type_wrapper_name=type_wrapper_name(f, key),
                         type_definition_body=emit_body(fn, key),
                         formals=formals,
@@ -913,7 +947,7 @@ def gen_differentiable_input(
         # TODO: `cpp_type` is only to keep it byte-for-byte compatible with the old codegen, should remove.
         # NB: This is not a clone of cpp.argument() - TensorOptionsArguments / faithful / binds are
         # not handled properly as they are irrelevant for this codegen.
-        cpp_type = cpp.argument_type(a, binds=a.name).cpp_type()
+        cpp_type = cpp.argument_type(a, binds=a.name, symint=True).cpp_type()
 
         if not is_differentiable(a.name, a.type, info):
             return None
@@ -968,6 +1002,23 @@ def find_args_with_derivatives(
             f"ERROR: derivative ignored for {name} -- specified an autograd function without derivative"
         )
 
+    if requires_derivative and not len(fw_derivatives) == 0:
+        assert sum(len(derivative.var_names) for derivative in fw_derivatives) == len(
+            differentiable_outputs
+        ), (
+            "Expected the number of forward derivatives implemented to match the "
+            "number of differentiable outputs. NB: This only applies when at least "
+            "one forward derivative is implemented. Not implementing any forward "
+            "derivatives is also okay, and we would require inputs to the op to "
+            "not have associated tangents in that case."
+        )
+    try_jit_decomposition = (
+        requires_derivative
+        and len(fw_derivatives) == 0
+        and (not modifies_arguments(f))
+        and (not returns_void)
+    )
+
     def emit_save_inputs() -> List[str]:
         setup: List[str] = []
         if info is None or not info.has_derivatives:
@@ -1077,11 +1128,7 @@ def emit_check_if_in_complex_autograd_allowlist() -> List[str]:
         for arg in differentiable_outputs:
             name = arg.name
             # TODO: should be `arg.type.is_tensor_like()`?
-            if arg.cpp_type in [
-                "at::Tensor",
-                "at::TensorList",
-                "const c10::List<c10::optional<at::Tensor>> &",
-            ]:
+            if arg.cpp_type == "at::Tensor" or arg.cpp_type in TENSOR_LIST_LIKE_CTYPES:
                 body.append(f'throw_error_for_complex_autograd({name}, "{base_name}");')
         return body
 
@@ -1159,8 +1206,10 @@ def save_variables(
                     expr = f"SavedVariable({var}, {str(is_output).lower()}, {is_inplace_view})"
                 else:
                     expr = f"SavedVariable({var}, {str(is_output).lower()})"
-            elif type == BaseCType(tensorListT) or type == ListCType(
-                OptionalCType(BaseCType(tensorT))
+            elif (
+                type == BaseCType(tensorListT)
+                or type == ListCType(OptionalCType(BaseCType(tensorT)))
+                or type == BaseCType(iTensorListRefT)
             ):
                 expr = f"make_saved_variable_list({name})"
                 name += "_"
@@ -1204,6 +1253,7 @@ def emit_dispatch_call(
             api_name=cpp.name(
                 f.func,
                 faithful_name_for_out_overloads=True,
+                symint_overload=f.func.has_symint(),
             ),
             unpacked_args=[dispatch_key_set] + list(unpacked_args),
         )
@@ -1238,7 +1288,9 @@ def check_tensorimpl_and_storage(
         for unpacked_binding in unpacked_bindings:
             arg = unpacked_binding.name
             noref_cpp_type = unpacked_binding.nctype.type.remove_const_ref()
-            if noref_cpp_type == BaseCType(tensorListT):
+            if noref_cpp_type == BaseCType(tensorListT) or noref_cpp_type == BaseCType(
+                iTensorListRefT
+            ):
                 stmts_before_call += [
                     SAVE_TENSORLIST_STORAGE.substitute(tensorlist_name=arg),
                     SAVE_TENSORLIST_IMPL.substitute(tensorlist_name=arg),
@@ -1285,7 +1337,7 @@ def check_tensorimpl_and_storage(
             for i, (ret, ret_name) in enumerate(
                 zip(f.func.returns, cpp.return_names(f))
             ):
-                noref_cpp_type = cpp.return_type(ret).remove_const_ref()
+                noref_cpp_type = cpp.return_type(ret, symint=True).remove_const_ref()
                 if noref_cpp_type == BaseCType(tensorT):
                     if aliased_arg_name is not None:
                         assert (
@@ -1333,7 +1385,9 @@ def check_tensorimpl_and_storage(
             )
         return call
 
-    def emit_call(f: NativeFunction, unpacked_bindings: List[Binding]) -> str:
+    def emit_call(
+        f: NativeFunction, unpacked_bindings: List[Binding], try_jit_decomposition: bool
+    ) -> str:
         # We only care about adding `at::AutoDispatchBelowAutograd` guard for non-variable dispatch
         # (which corresponds to 'use_derived' strategy). The purpose of this guard is to make sure
         # the baseType operations still dispatch to non-Variable type, even if the arguments passed
@@ -1347,13 +1401,50 @@ def emit_call(f: NativeFunction, unpacked_bindings: List[Binding]) -> str:
         else:
             guard = "at::AutoDispatchBelowADInplaceOrView guard;"
 
-        if not modifies_arguments(f) and not returns_void:
-            call = DISPATCH_TO_NON_VAR_TYPE_WITH_TMP_RETURN_VALUES.substitute(
-                base_type_call=base_type_call, tmp_var=TMP_VAR, guard=guard
+        any_has_forward_grad = (
+            get_any_has_fw_grad_cond(derivative=None)
+            if requires_derivative
+            else "false"
+        )
+        return_types = ", ".join(
+            [cpp.return_type(a, symint=True).cpp_type() for a in f.func.returns]
+        )
+        if len(f.func.returns) > 1:
+            return_types = f"std::tuple<{return_types}>"
+
+        arg_names = [
+            a.name
+            for a in cpp.arguments(
+                f.func.arguments,
+                faithful=True,
+                symint=True,
+                method=False,
+                cpp_no_default_args=set(),
             )
+        ]
+
+        if not modifies_arguments(f) and not returns_void:
+            if try_jit_decomposition:
+                call = DISPATCH_TO_NON_VAR_TYPE_WITH_TMP_RETURN_VALUES_JVP_DECOMP.substitute(
+                    base_type_call=base_type_call,
+                    tmp_var=TMP_VAR,
+                    guard=guard,
+                    any_has_forward_grad=any_has_forward_grad,
+                    op_name=cpp.name(f.func),
+                    op_overload=f.func.name.overload_name,
+                    return_types=return_types,
+                    arg_names=arg_names,
+                )
+            else:
+                call = DISPATCH_TO_NON_VAR_TYPE_WITH_TMP_RETURN_VALUES.substitute(
+                    base_type_call=base_type_call,
+                    tmp_var=TMP_VAR,
+                    guard=guard,
+                )
 
             call += wrap_output(f, unpacked_bindings, TMP_VAR)
         else:
+            assert not try_jit_decomposition
             call = DISPATCH_TO_NON_VAR_TYPE_WITHOUT_RETURN_VALUES.substitute(
                 base_type_call=base_type_call, guard=guard
             )
@@ -1401,38 +1492,14 @@ def get_any_has_forward_grad_name(var_names: Tuple[str, ...]) -> str:
     def emit_any_has_forward_grad() -> List[str]:
         content: List[str] = []
         for derivative in fw_derivatives:
-            assert derivative.required_inputs_fw_grad is not None
-            requires_fw_grad = " || ".join(
-                [
-                    FW_DERIVATIVE_CHECK_TEMPLATE.substitute(req_inp=inp.name)
-                    for inp in differentiable_inputs
-                    if inp.name in derivative.required_inputs_fw_grad
-                ]
-            )
-            if not requires_fw_grad:
-                # Handle functions like stack
-                # For these, we don't unpack anything and always call the user function
-                if not (
-                    len(differentiable_inputs) == 1
-                    and is_tensor_list_type(differentiable_inputs[0].type)
-                ):
-                    raise RuntimeError(
-                        f'No differentiable input to "{name}" is a differentiable Tensor (as the provided '
-                        "forward AD formula does not use any input tangent) even though a forward gradient "
-                        "formula has been defined for it. This case should only happen for function that "
-                        "take a single TensorList as input. All other cases are not supported right now."
-                    )
-                requires_fw_grad = "true"
-
+            requires_fw_grad = get_any_has_fw_grad_cond(derivative=derivative)
             if info and info.output_differentiability_conditions:
                 assert len(info.output_differentiability_conditions) == 1
-                requires_fw_grad = f"({info.output_differentiability_conditions[0]}) && ({requires_fw_grad})"
-
+                requires_fw_grad = f"({info.output_differentiability_conditions[0]}) && {requires_fw_grad}"
             content.append(
                 f"auto {get_any_has_forward_grad_name(derivative.var_names)} = {requires_fw_grad};\n"
                 f"(void){get_any_has_forward_grad_name(derivative.var_names)};"
             )
-
         return content
 
     def emit_check_inplace() -> List[str]:
@@ -1555,46 +1622,83 @@ def emit_fw_derivatives() -> List[str]:
         content.append("\n".join(fw_grad_setters))
         return content
 
-    def emit_forbid_fw_derivatives(is_out_fn: bool = False) -> str:
-        def get_msg() -> str:
-            if is_out_fn:
-                msg = "because it is an out= function"
-            else:
-                msg = (
-                    "because it has not been implemented yet.\\nPlease file an issue "
-                    "to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml "
-                    "so that we can prioritize its implementation."
-                )
-            return msg
-
-        res = ""
-        to_check: List[str] = []
-        for inp in list(
-            mapMaybe(
-                gen_differentiable_input,
-                f.func.arguments.non_out + list(f.func.arguments.out),  # type: ignore[operator]
-            )
-        ):
-            if is_tensor_type(inp.type):
-                to_check.append(
-                    FW_DERIVATIVE_CHECK_TEMPLATE.substitute(req_inp=inp.name)
-                )
-            elif is_tensor_list_type(inp.type):
-                cond = FW_DERIVATIVE_CHECK_TEMPLATE.substitute(req_inp="_t")
-                res += FW_DERIVATIVE_FORBID_LIST_TEMPLATE.substitute(
-                    arg=inp.name, cond=cond, name=name, msg=get_msg()
+    def get_any_has_fw_grad_cond(derivative: Optional[ForwardDerivative]) -> str:
+        #
+        # Produces a condition string (e.g, "isFwGradDefined(grad_output) || isFwGradDefined(output)")
+        #
+        if derivative is None:
+            # (1) If a derivative is NOT provided, cond will check fw_grad of ALL differentiable inputs
+            # - Used in the out_fn case when we want to forbid fw derivatives
+            # - Used in the case where the fw_derivative is not defined, but we want
+            #   To check if there is a decomposition registered for jvp
+            to_check: List[str] = []
+            for inp in list(
+                mapMaybe(
+                    gen_differentiable_input,
+                    f.func.arguments.non_out + list(f.func.arguments.out),  # type: ignore[operator]
                 )
+            ):
+                if is_tensor_type(inp.type):
+                    to_check.append(
+                        FW_DERIVATIVE_CHECK_TEMPLATE.substitute(req_inp=inp.name)
+                    )
+                elif is_tensor_list_type(inp.type):
+                    to_check.append(
+                        FW_DERIVATIVE_TENSORLIST_CHECK_TEMPLATE.substitute(
+                            req_inp=inp.name
+                        )
+                    )
+                else:
+                    raise RuntimeError(
+                        f'Unsupported input type for "{name}" when forbidding forward AD usage.'
+                    )
+            return f'({" || ".join(to_check)})'
+        else:
+            # (2) If derivative is provided, use that information to determine which inputs
+            #     to check fw_grad for
+            assert derivative.required_inputs_fw_grad is not None
+
+            if len(derivative.required_inputs_fw_grad) == 0:
+                # Handle functions like stack
+                # For these, we don't unpack anything and always call the user function
+                if not (
+                    len(differentiable_inputs) == 1
+                    and is_tensor_list_type(differentiable_inputs[0].type)
+                ):
+                    raise RuntimeError(
+                        f'No differentiable input to "{name}" is a differentiable Tensor (as the provided '
+                        "forward AD formula does not use any input tangent) even though a forward gradient "
+                        "formula has been defined for it. This case should only happen for function that "
+                        "take a single TensorList as input. All other cases are not supported right now."
+                    )
+                any_has_fw_grad = "true"
             else:
-                raise RuntimeError(
-                    f'Unsupported input type for "{name}" when forbidding forward AD usage.'
+                any_has_fw_grad = " || ".join(
+                    [
+                        FW_DERIVATIVE_CHECK_TEMPLATE.substitute(req_inp=inp.name)
+                        for inp in differentiable_inputs
+                        if inp.name in derivative.required_inputs_fw_grad
+                    ]
                 )
+                any_has_fw_grad = f"({any_has_fw_grad})"
+
+            return any_has_fw_grad
 
-        if len(to_check) > 0:
-            cond = " || ".join(to_check)
-            res += FW_DERIVATIVE_FORBID_TEMPLATE.substitute(
-                cond=cond, name=name, msg=get_msg()
+    def emit_forbid_fw_derivatives(is_out_fn: bool = False) -> str:
+        if is_out_fn:
+            msg = "because it is an out= function"
+        else:
+            msg = (
+                "because it has not been implemented yet.\\nPlease file an issue "
+                "to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml "
+                "so that we can prioritize its implementation."
             )
-        return res
+        cond = get_any_has_fw_grad_cond(derivative=None)
+        return (
+            FW_DERIVATIVE_FORBID_TEMPLATE.substitute(cond=cond, name=name, msg=msg)
+            if cond != ""
+            else ""
+        )
 
     body: List[str] = []
     unpack_args_stats, unpacked_bindings = unpack_args(f)
@@ -1608,7 +1712,7 @@ def get_msg() -> str:
         body.extend(setup_derivative(differentiable_inputs))
     body.append(declare_returned_variables(f))
 
-    body.append(emit_call(f, unpacked_bindings))
+    body.append(emit_call(f, unpacked_bindings, try_jit_decomposition))
     if requires_derivative:
         # set_flags has to appear after version_counter, because rebase_history
         # requires that the counter is incremented before it is called
@@ -1618,20 +1722,11 @@ def get_msg() -> str:
     if is_out_fn:
         body.append(emit_forbid_fw_derivatives(is_out_fn=True))
     else:
-        if requires_derivative:
-            body.extend(emit_fw_derivatives())
-            if len(fw_derivatives) == 0:
-                body.append(emit_forbid_fw_derivatives())
+        if requires_derivative and not try_jit_decomposition:
+            if len(fw_derivatives) > 0:
+                body.extend(emit_fw_derivatives())
             else:
-                assert sum(
-                    len(derivative.var_names) for derivative in fw_derivatives
-                ) == len(differentiable_outputs), (
-                    "Expected the number of forward derivatives implemented to match the "
-                    "number of differentiable outputs. NB: This only applies when at least "
-                    "one forward derivative is implemented. Not implementing any forward "
-                    "derivatives is also okay, and we would require inputs to the op to "
-                    "not have associated tangents in that case."
-                )
+                body.append(emit_forbid_fw_derivatives())
 
     if requires_derivative:
         # Save only after the forward AD has been set up
diff --git a/tools/autograd/load_derivatives.py b/tools/autograd/load_derivatives.py
index c68ddc020a4c..61ce2d640afc 100644
--- a/tools/autograd/load_derivatives.py
+++ b/tools/autograd/load_derivatives.py
@@ -20,7 +20,6 @@
     Binding,
     boolT,
     CppSignatureGroup,
-    intArrayRefT,
     layoutT,
     longT,
     NamedCType,
@@ -29,6 +28,7 @@
     SpecialArgName,
     stringT,
     symIntArrayRefT,
+    SymIntT,
     tensorGeometryT,
     tensorOptionsT,
     typeAndSizeT,
@@ -42,6 +42,7 @@
     NativeFunction,
     NativeFunctionsViewGroup,
     OperatorName,
+    SchemaKind,
     Type,
     Variant,
 )
@@ -64,12 +65,12 @@ def add_view_copy_derivatives(
         g.view.func.name: g for g in view_groups
     }
 
-    view_infos = dict()
+    view_infos = {}
 
     for _, info_dispatch_dict in infos.items():
         # maybe_view_group only needs to be calculated once per info_dispatch_dict
         maybe_view_group = None
-        view_copy_differentiability_infos = dict()
+        view_copy_differentiability_infos = {}
         for dispatch_key, info in info_dispatch_dict.items():
             maybe_view_group = view_name_to_group.get(info.func.func.name, None)
             if maybe_view_group is not None and maybe_view_group.view_copy is not None:
@@ -123,7 +124,7 @@ def load_derivatives(
         functions_by_signature: Dict[
             FunctionSchema, List[NativeFunction]
         ] = defaultdict(list)
-        functions_by_schema: Dict[str, NativeFunction] = dict()
+        functions_by_schema: Dict[str, NativeFunction] = {}
         for function in native_functions_without_view_copies:
             functions_by_signature[function.func.signature()].append(function)
             assert str(function.func) not in functions_by_schema
@@ -136,7 +137,7 @@ def load_derivatives(
         # infos is a dict that maps FunctionSchema -> a dict of per dispatch key DifferentiabilityInfos
         # this is useful because in tools/autograd/gen_autograd.py:match_differentiability_info
         # we ultimately need to categorize the DifferentiabilityInfos by FunctionSchema
-        infos: Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]] = dict()
+        infos: Dict[FunctionSchema, Dict[str, DifferentiabilityInfo]] = {}
         used_dispatch_keys: Set[str] = set()
         for defn_dict in definitions:
             # Ensure that the old derivatives.yaml schema with no dispatch key can be loaded.
@@ -167,9 +168,14 @@ def load_derivatives(
     return _GLOBAL_LOAD_DERIVATIVE_CACHE[key]
 
 
+# TODO: Why is this going through CppSignatureGroup, that doesn't make sense...
 @with_native_function
 def cpp_arguments(f: NativeFunction) -> Sequence[Binding]:
-    return CppSignatureGroup.from_native_function(f, method=False).signature.arguments()
+    sigs = CppSignatureGroup.from_native_function(f, method=False)
+    if sigs.symint_signature is not None:
+        return sigs.symint_signature.arguments()
+    else:
+        return sigs.signature.arguments()
 
 
 def create_derivative(
@@ -184,7 +190,9 @@ def create_derivative(
     ]
 
     return_names = tuple(n if n != "self" else "result" for n in cpp.return_names(f))
-    return_types = tuple(cpp.return_type(r).remove_const_ref() for r in f.func.returns)
+    return_types = tuple(
+        cpp.return_type(r, symint=True).remove_const_ref() for r in f.func.returns
+    )
 
     named_returns = [
         NamedCType(name, type) for name, type in zip(return_names, return_types)
@@ -265,7 +273,7 @@ def postprocess_forward_derivatives(
     def find_required_inputs(formula: str, postfix: str) -> Tuple[str, ...]:
         required_inputs = set()
         for arg in args_with_derivatives:
-            if arg.type == "at::TensorList":
+            if arg.type in ("at::TensorList", "const at::ITensorListRef &"):
                 # The functions taking TensorList handle everything internally
                 continue
             arg_name = arg.name
@@ -290,6 +298,9 @@ def find_required_inputs(formula: str, postfix: str) -> Tuple[str, ...]:
         formula = defn.formula
         required_inputs_tangent = find_required_inputs(formula, "_t")
         if formula == "auto_element_wise":
+            assert (
+                f.func.kind() != SchemaKind.inplace
+            ), f"Cannot use auto_element_wise with {f.func.name} because it is an in-place variant"
             if (
                 (not len(args_with_derivatives) == 1)
                 or len(forward_derivatives) > 1
@@ -375,7 +386,7 @@ def repl(m: Any) -> str:
                 new_args.append(arg_name)
 
             # TODO we are trolling
-            if f.func.is_symint_fn():
+            if f.func.has_symint():
                 defn_name += "_symint"
 
             # Call into the forward again. We need two cases here to handle both Tensor methods and at:: functions.
@@ -664,7 +675,7 @@ def set_up_derivatives(
             "Please use a different name in native_functions.yaml."
         )
 
-    diffinfo_dict = dict()
+    diffinfo_dict = {}
     for key, defn in defn_dict["dispatch"].items():
         if key != "Default" and key not in _VALID_AUTOGRAD_KEYS:
             raise RuntimeError(
@@ -744,14 +755,6 @@ def stride_expr(name: str) -> str:
         return f'strides_or_error({name}, "{name}")'
 
     REPLACEMENTS: List[Tuple[str, Dict[str, Any]]] = [
-        # replace self.sizes() with self_sizes
-        (
-            r"{}.sizes\(\)",
-            {
-                "suffix": "_sizes",
-                "nctype": lambda name: NamedCType(name, BaseCType(intArrayRefT)),
-            },
-        ),
         # replace self.sym_sizes() with self_sym_sizes
         (
             r"{}.sym_sizes\(\)",
@@ -760,15 +763,15 @@ def stride_expr(name: str) -> str:
                 "nctype": lambda name: NamedCType(name, BaseCType(symIntArrayRefT)),
             },
         ),
-        # replace self->sizes() with self_sizes_opt
+        # replace self->sym_sizes() with self_sym_sizes_opt
         (
-            r"{}->sizes\(\)",
+            r"{}->sym_sizes\(\)",
             {
-                "suffix": "_sizes_opt",
+                "suffix": "_sym_sizes_opt",
                 "nctype": lambda name: NamedCType(
-                    name, OptionalCType(BaseCType(intArrayRefT))
+                    name, OptionalCType(BaseCType(symIntArrayRefT))
                 ),
-                "expr": lambda name: f"{name}.has_value() ? c10::optional<IntArrayRef>({name}->sizes()) : c10::nullopt",
+                "expr": lambda name: f"{name}.has_value() ? c10::optional<c10::SymIntArrayRef>({name}->sym_sizes()) : c10::nullopt",
             },
         ),
         # replace self.options() with self_options
@@ -789,12 +792,14 @@ def stride_expr(name: str) -> str:
                 "res": lambda name: name + "_info.zeros()",  # at eval-time
             },
         ),
-        # replace self.size(2) with self_size_2
+        # replace self.sym_size(2) with self_sym_size_2
         (
-            r"{}.size\((\w+)\)",
+            r"{}.sym_size\((-?\w+)\)",
             {
-                "suffix": lambda m: "_argsize_{}".format(*m.groups()),
-                "nctype": lambda name: NamedCType(name, BaseCType(longT)),
+                "suffix": lambda m: "_sym_argsize_{}".format(
+                    m.groups()[0].replace("-", "minus_")
+                ),
+                "nctype": lambda name: NamedCType(name, BaseCType(SymIntT)),
             },
         ),
         # replace self.numel() with self_numel
@@ -815,6 +820,16 @@ def stride_expr(name: str) -> str:
                 ),
             },
         ),
+        # replace to_args_sizes_symint(self) with self_args_sizes
+        (
+            r"to_args_sizes_symint\({}\)",
+            {
+                "suffix": "_args_sizes_symint",
+                "nctype": lambda name: NamedCType(
+                    name, VectorCType(VectorCType(BaseCType(SymIntT)))
+                ),
+            },
+        ),
         # replace to_args_scalartypes(self) with self_args_scalartypes
         (
             r"to_args_scalartypes\({}\)",
@@ -848,12 +863,12 @@ def stride_expr(name: str) -> str:
                 "nctype": lambda name: NamedCType(name, BaseCType(longT)),
             },
         ),
-        # replace self.strides() with self_strides
+        # replace self.sym_strides() with self_sym_strides
         (
-            r"{}.strides\(\)",
+            r"{}.sym_strides\(\)",
             {
-                "suffix": "_strides",
-                "nctype": lambda name: NamedCType(name, BaseCType(intArrayRefT)),
+                "suffix": "_sym_strides",
+                "nctype": lambda name: NamedCType(name, BaseCType(symIntArrayRefT)),
                 "expr": stride_expr,
             },
         ),
@@ -878,6 +893,23 @@ def stride_expr(name: str) -> str:
     # find which arguments need to be saved
     saved: List[SavedAttribute] = []
 
+    if ".sizes()" in formula or "->sizes()" in formula:
+        raise RuntimeError(
+            ".sizes() is not supported in derivative formulas. Instead, please use the SymInt version,"
+            + f".sym_sizes(), which returned a c10::SymIntArrayRef. formula={formula}"
+        )
+    if re.search(r"\.size\([-]?\d+\)", formula) or re.search(
+        r"->size\([-]?\d+\)", formula
+    ):
+        raise RuntimeError(
+            ".size(int) is not supported in derivative formulas. Instead, please use the SymInt version,"
+            + f".sym_size(int), which returned a c10::SymIntArrayRef. formula={formula}"
+        )
+    if ".strides()" in formula or "->strides()" in formula:
+        raise RuntimeError(
+            ".strides() is not supported in derivative formulas. Instead, please use the SymInt version,"
+            + f".sym_strides(), which returned a c10::SymIntArrayRef. formula={formula}"
+        )
     for nctype in nctypes:
         name = (
             nctype.name.name if isinstance(nctype.name, SpecialArgName) else nctype.name
diff --git a/tools/autograd/templates/Functions.h b/tools/autograd/templates/Functions.h
index 4fe5b6dab205..9d67003907b0 100644
--- a/tools/autograd/templates/Functions.h
+++ b/tools/autograd/templates/Functions.h
@@ -11,6 +11,8 @@
 #include "torch/csrc/autograd/saved_variable.h"
 #include <torch/csrc/Export.h>
 
+#include <c10/core/SymIntArrayRef.h>
+
 namespace torch { namespace autograd { namespace generated {
 
 using at::Scalar;
@@ -45,13 +47,13 @@ struct TypeAndSize {
   TypeAndSize() : options(at::TensorOptions()) {}
   /* implicit */
   TypeAndSize(const Tensor & t)
-    : sizes(t.sizes().vec())
+    : sym_sizes(t.sym_sizes().vec())
     , options(t.options()) {}
 
-  Tensor zeros() { return at::zeros(sizes, options); }
+  Tensor zeros() { return at::zeros_symint(sym_sizes, options); }
 
 private:
-  std::vector<int64_t> sizes;
+  std::vector<c10::SymInt> sym_sizes;
   at::TensorOptions options;
 };
 
diff --git a/tools/autograd/templates/VariableType.cpp b/tools/autograd/templates/VariableType.cpp
index 58ee41dddf24..9cd2d5c40de7 100644
--- a/tools/autograd/templates/VariableType.cpp
+++ b/tools/autograd/templates/VariableType.cpp
@@ -3,7 +3,8 @@
 #include "torch/csrc/autograd/FunctionsManual.h"
 
 #include <ATen/RedispatchFunctions.h>
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
+#include <ATen/core/TorchDispatchUtils.h>
 #include <torch/library.h>
 
 
diff --git a/tools/autograd/templates/VariableType.h b/tools/autograd/templates/VariableType.h
index f7c7450c831f..ad2abc2bdb72 100644
--- a/tools/autograd/templates/VariableType.h
+++ b/tools/autograd/templates/VariableType.h
@@ -51,7 +51,7 @@ namespace VariableType {
   at::Tensor & unpack(Tensor & t, const char * name, int pos);
   const at::Tensor & unpack(const Tensor & t, const char * name, int pos);
   at::Tensor unpack_opt(const Tensor & t, const char * name, int pos);
-  std::vector<at::Tensor> unpack(at::TensorList tl, const char *name, int pos);
+  std::vector<at::Tensor> unpack(at::ITensorListRef tl, const char *name, int pos);
 };
 
 }} // namespace torch::autograd
diff --git a/tools/autograd/templates/python_functions.cpp b/tools/autograd/templates/python_functions.cpp
index 57343a53ea98..eacf56b31d88 100644
--- a/tools/autograd/templates/python_functions.cpp
+++ b/tools/autograd/templates/python_functions.cpp
@@ -5,7 +5,7 @@
 #include <Python.h>
 #include <ATen/ATen.h>
 
-#include <c10/core/SymIntNodeImpl.h>
+#include <c10/core/SymNodeImpl.h>
 #include "torch/csrc/autograd/generated/Functions.h"
 #include "torch/csrc/autograd/python_cpp_function.h"
 #include <torch/csrc/autograd/python_variable.h>
diff --git a/tools/autograd/templates/python_nested_functions.cpp b/tools/autograd/templates/python_nested_functions.cpp
new file mode 100644
index 000000000000..5515ca6f8a0b
--- /dev/null
+++ b/tools/autograd/templates/python_nested_functions.cpp
@@ -0,0 +1,81 @@
+#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
+// ${generated_comment}
+
+#include "torch/csrc/Device.h"
+#include "torch/csrc/DynamicTypes.h"
+#include "torch/csrc/Exceptions.h"
+#include "torch/csrc/autograd/python_nested_functions.h"
+#include "torch/csrc/autograd/python_return_types.h"
+#include "torch/csrc/autograd/python_variable.h"
+#include "torch/csrc/autograd/utils/wrap_outputs.h"
+#include "torch/csrc/autograd/utils/python_arg_parsing.h"
+#include "torch/csrc/autograd/generated/variable_factories.h"
+#include "torch/csrc/utils/out_types.h"
+#include "torch/csrc/utils/pycfunction_helpers.h"
+#include "torch/csrc/utils/python_arg_parser.h"
+#include "torch/csrc/utils/structseq.h"
+#include "torch/csrc/utils/cuda_lazy_init.h"
+
+#ifndef AT_PER_OPERATOR_HEADERS
+#include <ATen/Functions.h>
+#else
+$ops_headers
+#endif
+
+using at::Tensor;
+using at::Device;
+using at::Layout;
+using at::Scalar;
+using at::ScalarType;
+using at::Backend;
+using at::OptionalDeviceGuard;
+using at::DeviceGuard;
+using at::TensorOptions;
+using at::IntArrayRef;
+using at::OptionalIntArrayRef;
+using at::Generator;
+using at::TensorList;
+using at::Dimname;
+using at::DimnameList;
+
+using namespace torch::autograd::utils;
+
+namespace torch { namespace autograd {
+
+// generated forward declarations start here
+
+${py_forwards}
+
+static PyMethodDef nested_functions[] = {
+  {NULL, NULL, 0, NULL},
+  ${py_method_defs}
+  {NULL}
+};
+
+static PyObject* THPNestedVariableFunctionsModule = NULL;
+
+void initNestedFunctions(PyObject* module) {
+  nested_functions[0] = get_nested_functions_manual()[0];
+  static struct PyModuleDef def = {
+     PyModuleDef_HEAD_INIT,
+     "torch._C._nested",
+     NULL,
+     -1,
+     nested_functions
+  };
+  PyObject* nested = PyModule_Create(&def);
+  THPNestedVariableFunctionsModule = nested;
+  if (!nested) {
+    throw python_error();
+  }
+  // steals a reference to nested
+  if (PyModule_AddObject(module, "_nested", nested) != 0) {
+    throw python_error();
+  }
+}
+
+// generated methods start here
+
+${py_methods}
+
+}} // namespace torch::autograd
diff --git a/tools/autograd/templates/python_nn_functions.cpp b/tools/autograd/templates/python_nn_functions.cpp
index 13b3d47cf448..a4c0dde426ff 100644
--- a/tools/autograd/templates/python_nn_functions.cpp
+++ b/tools/autograd/templates/python_nn_functions.cpp
@@ -67,7 +67,7 @@ static PyObject * THPVariable__parse_to(PyObject* module, PyObject* args, PyObje
   }
   PyTuple_SET_ITEM(tuple.get(), 2, torch::autograd::utils::wrap(non_blocking));
   if (opt_memory_format.has_value()) {
-    PyTuple_SET_ITEM(tuple.get(), 3, torch::utils::getTHPMemoryFormat(opt_memory_format.value()).release().ptr());
+    PyTuple_SET_ITEM(tuple.get(), 3, torch::utils::getTHPMemoryFormat(opt_memory_format.value()));
   } else {
     Py_INCREF(Py_None);
     PyTuple_SET_ITEM(tuple.get(), 3, Py_None);
diff --git a/tools/autograd/templates/python_variable_methods.cpp b/tools/autograd/templates/python_variable_methods.cpp
index dc236c526067..6ad042c0b903 100644
--- a/tools/autograd/templates/python_variable_methods.cpp
+++ b/tools/autograd/templates/python_variable_methods.cpp
@@ -147,13 +147,21 @@ static PyObject * THPVariable_stride(PyObject* self, PyObject* args, PyObject* k
   }
 
   if (r.idx == 0) {
-    return wrap(self_.stride(r.toInt64(0)));
+    return torch::toPyObject(self_.sym_stride(r.toInt64(0)));
   } else if (r.idx == 1) {
     // yes, this is called strides in ATen.
-    IntArrayRef strides = self_.strides();
+    at::SymIntArrayRef strides = self_.sym_strides();
     // we can't do the normal wrapping here because IntArrayRef maps to both
     // torch.Size and tuple in python
-    return THPUtils_packInt64Array(strides.size(), strides.data());
+    // TODO: consider factoring this out
+    THPObjectPtr tuple(PyTuple_New(strides.size()));
+    if (!tuple) throw python_error();
+    for (size_t i = 0; i != strides.size(); i++) {
+      PyObject* s = torch::toPyObject(strides[i]);
+      if (!s) throw python_error();
+      PyTuple_SET_ITEM(tuple.get(), i, s);
+    }
+    return tuple.release();
   }
   else if (r.idx == 2) {
     return wrap(self_.stride(r.dimname(0)));
@@ -205,7 +213,7 @@ static PyObject * THPVariable_storage_offset(PyObject* self_, PyObject* args)
     return handle_torch_function(self_, "storage_offset");
   }
   auto& self = THPVariable_Unpack(self_);
-  return wrap(self.storage_offset());
+  return py::cast(self.sym_storage_offset()).release().ptr();
   END_HANDLE_TH_ERRORS
 }
 
@@ -232,12 +240,7 @@ static PyObject * THPVariable_numel(PyObject* self, PyObject* args)
    if (jit::tracer::isTracing()) {
      return wrap(jit::tracer::getNumelOf(self_));
    } else {
-     auto si = self_.sym_numel();
-     if (si.is_symbolic()) {
-       return py::cast(si.toSymIntNodeImpl()).release().ptr();
-     } else {
-       return THPUtils_packInt64(si.as_int_unchecked());
-     }
+     return py::cast(self_.sym_numel()).release().ptr();
    }
    END_HANDLE_TH_ERRORS
 }
@@ -912,6 +915,10 @@ static PyObject * THPVariable_map_(PyObject* self, PyObject* args, PyObject* kwa
         "Can't call map_() on Variable that requires grad. Use "
         "var.detach().map_() instead.");
   }
+  TORCH_CHECK(
+      !self_.unsafeGetTensorImpl()->is_python_dispatch() && !other.unsafeGetTensorImpl()->is_python_dispatch(),
+      ".map_ is not supported for tensor subclasses.");
+
   return THPVariable_Wrap(torch::utils::map_(self_, other, r.pyobject(1)));
   END_HANDLE_TH_ERRORS
 }
@@ -937,6 +944,9 @@ static PyObject * THPVariable_map2_(PyObject* self, PyObject* args, PyObject* kw
         "Can't call map2_() on Variable that requires grad. Use "
         "var.detach().map2_() instead.");
   }
+  TORCH_CHECK(
+      !x.unsafeGetTensorImpl()->is_python_dispatch() && !y.unsafeGetTensorImpl()->is_python_dispatch(),
+      ".map2_ is not supported for tensor subclasses.");
   return THPVariable_Wrap(torch::utils::map2_(self_, x, y, r.pyobject(2)));
   END_HANDLE_TH_ERRORS
 }
@@ -969,7 +979,7 @@ static PyObject * THPVariable_storage(PyObject* self, PyObject* arg)
 {
   HANDLE_TH_ERRORS
   if (check_has_torch_function(self)) {
-    return handle_torch_function(self, "storage");
+    return handle_torch_function(self, "_storage");
   }
   auto& self_ = THPVariable_Unpack(self);
   return createPyObject(self_.storage());
@@ -1125,9 +1135,9 @@ static PyObject* THPVariable_set_(
       {
           "set_()",
           "set_(Storage source)",
-          "set_(Storage source, int64_t storage_offset, IntArrayRef size, IntArrayRef stride=None)",
+          "set_(Storage source, SymInt storage_offset, SymIntArrayRef size, SymIntArrayRef stride=None)",
           "set_(Tensor source)",
-          "set_(Tensor source, int64_t storage_offset, IntArrayRef size, IntArrayRef stride=None)",
+          "set_(Tensor source, SymInt storage_offset, SymIntArrayRef size, SymIntArrayRef stride=None)",
       },
       /*traceable=*/false);
 
@@ -1171,19 +1181,19 @@ static PyObject* THPVariable_set_(
         " for argument 1 'storage'");
       auto dispatch_set_ = [](const Tensor& self,
                               Storage source,
-                              int64_t storage_offset,
-                              IntArrayRef size,
-                              IntArrayRef stride) -> Tensor {
+                              c10::SymInt storage_offset,
+                              c10::SymIntArrayRef size,
+                              c10::SymIntArrayRef stride) -> Tensor {
         pybind11::gil_scoped_release no_gil;
-        return self.set_(source, storage_offset, size, stride);
+        return self.set__symint(source, storage_offset, size, stride);
       };
       return wrap(dispatch_set_(
-          self, storage, _r.toInt64(1), _r.intlist(2), _r.intlist(3)));
+          self, storage, _r.toSymInt(1), _r.symintlist(2), _r.symintlist(3)));
     }
     case 3: {
       // aten::set_.source_Tensor(Tensor(a!) self, Tensor source) -> Tensor(a!)
       auto dispatch_set_ = [](const Tensor& self, const Tensor& source) -> Tensor {
-        TORCH_INTERNAL_ASSERT(source.dtype() == self.dtype());
+        TORCH_CHECK(source.dtype() == self.dtype(), "Could not set tensor of type ", source.dtype(), " to a tensor of type ", self.dtype());
         pybind11::gil_scoped_release no_gil;
         return self.set_(source);
       };
@@ -1195,14 +1205,14 @@ static PyObject* THPVariable_set_(
       at::Tensor storage = _r.tensor(0);
       auto dispatch_set_ = [](const Tensor& self,
                               const Tensor& source,
-                              int64_t storage_offset,
-                              IntArrayRef size,
-                              IntArrayRef stride) -> Tensor {
+                              c10::SymInt storage_offset,
+                              c10::SymIntArrayRef size,
+                              c10::SymIntArrayRef stride) -> Tensor {
         pybind11::gil_scoped_release no_gil;
-        return self.set_(source, storage_offset, size, stride);
+        return self.set__symint(source, storage_offset, size, stride);
       };
       return wrap(dispatch_set_(
-          self, storage, _r.toInt64(1), _r.intlist(2), _r.intlist(3)));
+          self, storage, _r.toSymInt(1), _r.symintlist(2), _r.symintlist(3)));
     }
   }
   Py_RETURN_NONE;
diff --git a/tools/code_analyzer/gen_oplist.py b/tools/code_analyzer/gen_oplist.py
index 1e5d1277afcd..18104ab30cb6 100644
--- a/tools/code_analyzer/gen_oplist.py
+++ b/tools/code_analyzer/gen_oplist.py
@@ -127,7 +127,7 @@ def main(argv: List[Any]) -> None:
         default=False,
         required=False,
     )
-    options = parser.parse_args()
+    options = parser.parse_args(argv)
 
     if os.path.isfile(options.model_file_list_path):
         print("Processing model file: ", options.model_file_list_path)
@@ -186,4 +186,4 @@ def main(argv: List[Any]) -> None:
 
 
 if __name__ == "__main__":
-    main(sys.argv)
+    main(sys.argv[1:])
diff --git a/tools/code_coverage/README.md b/tools/code_coverage/README.md
index 67adb445d053..32fbc89e6aac 100644
--- a/tools/code_coverage/README.md
+++ b/tools/code_coverage/README.md
@@ -51,7 +51,7 @@ Great, you are ready to run the code coverage tool for the first time! Start fro
 ```
 python oss_coverage.py --run-only=atest
 ```
-This command will run `atest` binary in `build/bin/` folder and generate reoports over the entire *Pytorch* folder. You can find the reports in `profile/summary`. But you may only be interested in the `aten` folder, in this case, try:
+This command will run `atest` binary in `build/bin/` folder and generate reports over the entire *Pytorch* folder. You can find the reports in `profile/summary`. But you may only be interested in the `aten` folder, in this case, try:
 ```
 python oss_coverage.py --run-only=atest --interest-only=aten
 ```
@@ -91,9 +91,9 @@ python oss_coverage.py --run-only=atest --interest-only=c10 --summary
 
 
 **2. Run tests yourself**
-When you are developing a new feature, you may first run the tests yourself to make sure the implementation is all right and then want to learn its coverage. But sometimes the test take very long time and you don't want to wait to run it again when doing code coverage. In this case, you can use these arguments to accerate your development (make sure you build pytorch with the coverage option!):
+When you are developing a new feature, you may first run the tests yourself to make sure the implementation is all right and then want to learn its coverage. But sometimes the test take very long time and you don't want to wait to run it again when doing code coverage. In this case, you can use these arguments to accelerate your development (make sure you build pytorch with the coverage option!):
 ```
-# run tests when you are devloping a new feature, assume the the test is `test_nn.py`
+# run tests when you are developing a new feature, assume the test is `test_nn.py`
 python oss_coverage.py --run-only=test_nn.py
 # or you can run it yourself
 cd test/ && python test_nn.py
diff --git a/tools/code_coverage/package/tool/summarize_jsons.py b/tools/code_coverage/package/tool/summarize_jsons.py
index 0ab113816c07..35086561e718 100644
--- a/tools/code_coverage/package/tool/summarize_jsons.py
+++ b/tools/code_coverage/package/tool/summarize_jsons.py
@@ -26,7 +26,7 @@
 )
 
 
-# coverage_records: Dict[str, LineInfo] = dict()
+# coverage_records: Dict[str, LineInfo] = {}
 covered_lines: Dict[str, Set[int]] = {}
 uncovered_lines: Dict[str, Set[int]] = {}
 tests_type: TestStatusType = {"success": set(), "partial": set(), "fail": set()}
diff --git a/tools/cpuinfo_target_definition.bzl b/tools/cpuinfo_target_definition.bzl
deleted file mode 100644
index 27b1c7bb272d..000000000000
--- a/tools/cpuinfo_target_definition.bzl
+++ /dev/null
@@ -1,12 +0,0 @@
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-load("//caffe2/tools:sgx_target_definitions.bzl", "is_sgx")
-
-def add_cpuinfo_lib():
-    cpp_library(
-        name = "cpuinfo",
-        exported_deps = [
-            "fbsource//third-party/cpuinfo_sgx:cpuinfo_coffeelake",
-        ] if is_sgx else [
-            "fbsource//third-party/cpuinfo:cpuinfo",
-        ],
-    )
diff --git a/tools/dynamo/verify_dynamo.py b/tools/dynamo/verify_dynamo.py
new file mode 100644
index 000000000000..cbc582a56157
--- /dev/null
+++ b/tools/dynamo/verify_dynamo.py
@@ -0,0 +1,156 @@
+import os
+import re
+import subprocess
+import sys
+import traceback
+import warnings
+
+from pkg_resources import packaging
+
+MIN_CUDA_VERSION = packaging.version.parse("11.6")
+MIN_PYTHON_VERSION = (3, 7)
+
+
+class VerifyDynamoError(BaseException):
+    pass
+
+
+def check_python():
+    if sys.version_info < MIN_PYTHON_VERSION:
+        raise VerifyDynamoError(
+            f"Python version not supported: {sys.version_info} "
+            f"- minimum requirement: {MIN_PYTHON_VERSION}"
+        )
+    return sys.version_info
+
+
+def check_torch():
+    import torch
+
+    return packaging.version.parse(torch.__version__)
+
+
+# based on torch/utils/cpp_extension.py
+def get_cuda_version():
+    from torch.utils import cpp_extension
+
+    CUDA_HOME = cpp_extension._find_cuda_home()
+    if not CUDA_HOME:
+        raise VerifyDynamoError(cpp_extension.CUDA_NOT_FOUND_MESSAGE)
+
+    nvcc = os.path.join(CUDA_HOME, "bin", "nvcc")
+    cuda_version_str = (
+        subprocess.check_output([nvcc, "--version"])
+        .strip()
+        .decode(*cpp_extension.SUBPROCESS_DECODE_ARGS)
+    )
+    cuda_version = re.search(r"release (\d+[.]\d+)", cuda_version_str)
+    if cuda_version is None:
+        raise VerifyDynamoError("CUDA version not found in `nvcc --version` output")
+
+    cuda_str_version = cuda_version.group(1)
+    return packaging.version.parse(cuda_str_version)
+
+
+def check_cuda():
+    import torch
+
+    if not torch.cuda.is_available():
+        return None
+
+    torch_cuda_ver = packaging.version.parse(torch.version.cuda)
+
+    # check if torch cuda version matches system cuda version
+    cuda_ver = get_cuda_version()
+    if cuda_ver != torch_cuda_ver:
+        # raise VerifyDynamoError(
+        warnings.warn(
+            f"CUDA version mismatch, `torch` version: {torch_cuda_ver}, env version: {cuda_ver}"
+        )
+
+    if torch_cuda_ver < MIN_CUDA_VERSION:
+        # raise VerifyDynamoError(
+        warnings.warn(
+            f"(`torch`) CUDA version not supported: {torch_cuda_ver} "
+            f"- minimum requirement: {MIN_CUDA_VERSION}"
+        )
+    if cuda_ver < MIN_CUDA_VERSION:
+        # raise VerifyDynamoError(
+        warnings.warn(
+            f"(env) CUDA version not supported: {cuda_ver} "
+            f"- minimum requirement: {MIN_CUDA_VERSION}"
+        )
+
+    return cuda_ver
+
+
+def check_dynamo(backend, device, err_msg):
+    import torch
+
+    if device == "cuda" and not torch.cuda.is_available():
+        print(f"CUDA not available -- skipping CUDA check on {backend} backend\n")
+        return
+
+    try:
+        import torch._dynamo as dynamo
+
+        dynamo.reset()
+
+        @dynamo.optimize(backend, nopython=True)
+        def fn(x):
+            return x + x
+
+        class Module(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+
+            def forward(self, x):
+                return x + x
+
+        mod = Module()
+        opt_mod = dynamo.optimize(backend, nopython=True)(mod)
+
+        for f in (fn, opt_mod):
+            x = torch.randn(10, 10).to(device)
+            x.requires_grad = True
+            y = f(x)
+            torch.testing.assert_close(y, x + x)
+            z = y.sum()
+            z.backward()
+            torch.testing.assert_close(x.grad, 2 * torch.ones_like(x))
+    except Exception:
+        sys.stderr.write(traceback.format_exc() + "\n" + err_msg + "\n\n")
+        sys.exit(1)
+
+
+_SANITY_CHECK_ARGS = (
+    ("eager", "cpu", "CPU eager sanity check failed"),
+    ("eager", "cuda", "CUDA eager sanity check failed"),
+    ("aot_eager", "cpu", "CPU aot_eager sanity check failed"),
+    ("aot_eager", "cuda", "CUDA aot_eager sanity check failed"),
+    ("inductor", "cpu", "CPU inductor sanity check failed"),
+    (
+        "inductor",
+        "cuda",
+        "CUDA inductor sanity check failed\n"
+        + "NOTE: Please check that you installed the correct hash/version of `triton`",
+    ),
+)
+
+
+def main():
+    python_ver = check_python()
+    torch_ver = check_torch()
+    cuda_ver = check_cuda()
+    print(
+        f"Python version: {python_ver.major}.{python_ver.minor}.{python_ver.micro}\n"
+        f"`torch` version: {torch_ver}\n"
+        f"CUDA version: {cuda_ver}\n"
+    )
+    for args in _SANITY_CHECK_ARGS:
+        check_dynamo(*args)
+    print("All required checks passed")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tools/gen_vulkan_glsl.py b/tools/gen_vulkan_glsl.py
new file mode 100644
index 000000000000..bf6f16dff25f
--- /dev/null
+++ b/tools/gen_vulkan_glsl.py
@@ -0,0 +1,111 @@
+import copy
+import os
+
+import yaml
+
+from torchgen.code_template import CodeTemplate
+from yaml.constructor import ConstructorError
+from yaml.nodes import MappingNode
+
+try:
+    from yaml import CLoader as Loader
+except ImportError:
+    from yaml import Loader  # type: ignore[misc]
+
+# https://gist.github.com/pypt/94d747fe5180851196eb
+class UniqueKeyLoader(Loader):
+    def construct_mapping(self, node, deep=False):  # type: ignore[no-untyped-def]
+        if not isinstance(node, MappingNode):
+            raise ConstructorError(
+                None,
+                None,
+                "expected a mapping node, but found %s" % node.id,
+                node.start_mark,
+            )
+        mapping = {}
+        for key_node, value_node in node.value:
+            key = self.construct_object(key_node, deep=deep)  # type: ignore[no-untyped-call]
+            try:
+                hash(key)
+            except TypeError:
+                raise ConstructorError(
+                    "while constructing a mapping",
+                    node.start_mark,
+                    "found unacceptable key ",
+                    key_node.start_mark,
+                )
+            # check for duplicate keys
+            if key in mapping:
+                raise ConstructorError(
+                    "while constructing a mapping",
+                    node.start_mark,
+                    "found duplicate key",
+                    key_node.start_mark,
+                )
+            value = self.construct_object(value_node, deep=deep)  # type: ignore[no-untyped-call]
+            mapping[key] = value
+        return mapping
+
+
+class GLSLGenerator(object):
+    standard_header = """
+#version 450 core
+#define PRECISION $precision
+#define FORMAT $format
+
+"""
+
+    def __init__(self):  # type: ignore[no-untyped-def]
+        self.ops_template_params = {}
+
+    def add_params_yaml(self, parameters_yaml_file):  # type: ignore[no-untyped-def]
+        all_template_params = {}
+        with open(parameters_yaml_file, "r") as f:
+            contents = yaml.load(f, Loader=UniqueKeyLoader)
+            for key in contents:
+                all_template_params[key] = contents[key]
+        self.validate_and_construct_op_params(all_template_params)  # type: ignore[no-untyped-call]
+
+    def validate_and_construct_op_params(self, all_template_params):  # type: ignore[no-untyped-def]
+        for op in all_template_params:
+            if op in self.ops_template_params:
+                raise KeyError(f"{op} params file has already been parsed")
+            op_params_default_vals = all_template_params[op][
+                "parameter_names_with_default_values"
+            ]
+            template_params_set = set(op_params_default_vals.keys())
+            self.ops_template_params[op] = []
+            self.ops_template_params[op].append(op_params_default_vals)
+            op_template_params_values = all_template_params[op]["parameter_values"]
+            for param_vals in op_template_params_values:
+                param_vals_set = set(param_vals.keys())
+                missing_keys = template_params_set - param_vals_set
+                invalid_keys = param_vals_set - template_params_set
+                if (len(invalid_keys)) > 0:
+                    raise KeyError(f"Invalid keys {invalid_keys} are found")
+                param_vals_copy = copy.deepcopy(param_vals)
+                for key in missing_keys:
+                    param_vals_copy[key] = op_params_default_vals[key]
+                self.ops_template_params[op].append(param_vals_copy)
+
+    def generate(self, glsl_template_in, out_dir):  # type: ignore[no-untyped-def]
+        glsl_template_name = os.path.basename(glsl_template_in)
+        op_name, extension_name = glsl_template_name.split(".")
+        if extension_name != "glslt":
+            raise TypeError(f"invalid file type for glsl template {extension_name}")
+        if op_name not in self.ops_template_params:
+            raise KeyError(f"{op_name} params have not been populated")
+        code_template = CodeTemplate.from_file(glsl_template_in)
+        for template_params in self.ops_template_params[op_name]:
+            content = GLSLGenerator.standard_header
+            param_vals_string = "x".join([str(i) for i in template_params.values()])
+            output_file_name = op_name + "_" + param_vals_string + ".glsl"
+            content += code_template.substitute(template_params)
+            output_file = os.path.join(out_dir, output_file_name)
+            with open(output_file, "w") as f:
+                f.write(content)
+
+
+# Remove this
+if __name__ == "__main__":
+    pass
diff --git a/tools/gen_vulkan_spv.py b/tools/gen_vulkan_spv.py
index 74b1212bdbe2..cc317eba7d4a 100644
--- a/tools/gen_vulkan_spv.py
+++ b/tools/gen_vulkan_spv.py
@@ -8,11 +8,23 @@
 import sys
 import subprocess
 from torchgen.code_template import CodeTemplate
+from dataclasses import dataclass
+from typing import List
+
+from tools.gen_vulkan_glsl import GLSLGenerator
 
 H_NAME = "spv.h"
 CPP_NAME = "spv.cpp"
 DEFAULT_ENV = {"precision": "highp", "format": "rgba32f"}
 
+
+@dataclass
+class ShaderInfo:
+    tile_size: List[int]
+    layouts: List[str]
+    weight_storage_type: str = ""
+    bias_storage_type: str = ""
+
 def getName(filePath):
     return os.path.basename(filePath).replace("/", "_").replace(".", "_")
 
@@ -20,11 +32,44 @@ def isDescriptorLine(lineStr):
     descriptorLineId = r"^layout\(set"
     return re.search(descriptorLineId, lineStr)
 
+def isTileSizeLine(lineStr):
+    tile_size_id = r"^ \* TILE_SIZE = \("
+    return re.search(tile_size_id, lineStr)
+
+def findTileSizes(lineStr):
+    tile_size_id = r"^ \* TILE_SIZE = \(([0-9]+), ([0-9]+), ([0-9]+)\)"
+    matches = re.search(tile_size_id, lineStr)
+    return [int(matches.group(1)), int(matches.group(2)), int(matches.group(3))]
+
+def isWeightStorageTypeLine(lineStr):
+    weight_storage_id = r"^ \* WEIGHT_STORAGE = "
+    return re.search(weight_storage_id, lineStr)
+
+def getWeightStorageType(lineStr):
+    weight_storage_id = r"^ \* WEIGHT_STORAGE = ([a-zA-Z]+_\dD)"
+    matches = re.search(weight_storage_id, lineStr)
+    return matches.group(1)
+
+def isBiasStorageTypeLine(lineStr):
+    weight_storage_id = r"^ \* BIAS_STORAGE = "
+    return re.search(weight_storage_id, lineStr)
+
+def getBiasStorageType(lineStr):
+    weight_storage_id = r"^ \* BIAS_STORAGE = ([a-zA-Z]+_\dD)"
+    matches = re.search(weight_storage_id, lineStr)
+    return matches.group(1)
+
 typeIdMapping = {
     r"image[123]D\b": "VK_DESCRIPTOR_TYPE_STORAGE_IMAGE",
     r"sampler[123]D\b": "VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER",
     r"\bbuffer\b": "VK_DESCRIPTOR_TYPE_STORAGE_BUFFER",
-    r"\buniform\b.*\bBlock\b": "VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER",
+    r"\buniform\b": "VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER",
+}
+
+storageTypeToEnum = {
+    "TEXTURE_2D" : "api::StorageType::TEXTURE_2D",
+    "TEXTURE_3D" : "api::StorageType::TEXTURE_3D",
+    "BUFFER" : "api::StorageType::BUFFER",
 }
 
 def determineDescriptorType(lineStr):
@@ -32,16 +77,40 @@ def determineDescriptorType(lineStr):
         if re.search(identifier, lineStr):
             return typeNum
 
-    raise Exception("Could not identify descriptor type of line: {}".format(lineStr))
-
-def getLayout(srcFilePath):
-    layout = []
+def getShaderInfo(srcFilePath):
+    shader_info = ShaderInfo([], [], "")
     with open(srcFilePath, 'r') as srcFile:
         for line in srcFile:
             if isDescriptorLine(line):
-                layout.append(determineDescriptorType(line))
+                shader_info.layouts.append(determineDescriptorType(line))
+            if isTileSizeLine(line):
+                shader_info.tile_size = findTileSizes(line)
+            if isWeightStorageTypeLine(line):
+                shader_info.weight_storage_type = getWeightStorageType(line)
+            if isBiasStorageTypeLine(line):
+                shader_info.bias_storage_type = getBiasStorageType(line)
+
+    return shader_info
+
+def genGLSLFromGLSLT(src_dir_path, tmp_dir_path):
+    template_dir_path = os.path.join(src_dir_path, "templates")
+    vexs = glob.glob(os.path.join(template_dir_path, '**', '*.yaml'), recursive=True)
+    parameter_yaml_files = []
+    for f in vexs:
+        if len(f) > 1:
+            parameter_yaml_files.append(f)
+    generator = GLSLGenerator()
+    for params_yaml in parameter_yaml_files:
+        generator.add_params_yaml(params_yaml)  # type: ignore[no-untyped-call]
 
-    return layout
+    vexs = glob.glob(os.path.join(src_dir_path, '**', '*.glslt'), recursive=True)
+    templateSrcPaths = []
+    for f in vexs:
+        if len(f) > 1:
+            templateSrcPaths.append(f)
+            templateSrcPaths.sort()
+    for glslt in templateSrcPaths:
+        generator.generate(glslt, tmp_dir_path)  # type: ignore[no-untyped-call]
 
 def genCppH(hFilePath, cppFilePath, srcDirPath, glslcPath, tmpDirPath, env):
     print("hFilePath:{} cppFilePath:{} srcDirPath:{} glslcPath:{} tmpDirPath:{}".format(
@@ -49,6 +118,14 @@ def genCppH(hFilePath, cppFilePath, srcDirPath, glslcPath, tmpDirPath, env):
 
     vexs = glob.glob(os.path.join(srcDirPath, '**', '*.glsl'), recursive=True)
     templateSrcPaths = []
+    for f in vexs:
+        if len(f) > 1:
+            templateSrcPaths.append(f)
+            templateSrcPaths.sort()
+
+    # Now add glsl files that are generated from templates
+    genGLSLFromGLSLT(srcDirPath, tmpDirPath)
+    vexs = glob.glob(os.path.join(tmpDirPath, '**', '*.glsl'), recursive=True)
     for f in vexs:
         if len(f) > 1:
             templateSrcPaths.append(f)
@@ -74,6 +151,7 @@ def genCppH(hFilePath, cppFilePath, srcDirPath, glslcPath, tmpDirPath, env):
             glslcPath, "-fshader-stage=compute",
             srcPath, "-o", spvPath,
             "--target-env=vulkan1.0",
+            "-I", srcDirPath,
             "-Werror"
         ]
 
@@ -85,6 +163,8 @@ def genCppH(hFilePath, cppFilePath, srcDirPath, glslcPath, tmpDirPath, env):
     h = "#pragma once\n"
     h += "#include <stdint.h>\n"
     h += "#include <vector>\n"
+    h += "#include <string>\n"
+    h += "#include <ATen/native/vulkan/api/Types.h>\n"
     h += "#include <ATen/native/vulkan/api/vk_api.h>"
 
     nsbegin = "\nnamespace at {\nnamespace native {\nnamespace vulkan {\n"
@@ -101,7 +181,7 @@ def genCppH(hFilePath, cppFilePath, srcDirPath, glslcPath, tmpDirPath, env):
         h += "extern const uint32_t {}[];\n".format(name)
         h += "extern const uint32_t {};\n".format(name_len)
 
-        layout = getLayout(srcPath)
+        shader_info = getShaderInfo(srcPath)
         name_layout = name + "_layout"
         h += "extern const std::vector<VkDescriptorType> {};\n".format(name_layout)
 
@@ -117,10 +197,33 @@ def genCppH(hFilePath, cppFilePath, srcDirPath, glslcPath, tmpDirPath, env):
 
         # Add layout
         cpp += "const std::vector<VkDescriptorType> {} = {{\n".format(name_layout)
-        for descriptor in layout:
+        for descriptor in shader_info.layouts:
             cpp += "  {},\n".format(descriptor)
         cpp += "};\n"
 
+        # Add tile size
+        if (len(shader_info.tile_size) > 0):
+            name_tile_size = name + "_tile_size"
+            h += "extern const std::vector<uint32_t> {};\n".format(name_tile_size)
+            cpp += "const std::vector<uint32_t> {} = {{\n".format(name_tile_size)
+            for s in shader_info.tile_size:
+                cpp += "  {},\n".format(s)
+            cpp += "};\n"
+
+        # Add weight type
+        if (shader_info.weight_storage_type != ""):
+            name_weight_storage_type = name + "_weight_storage_type"
+            h += "extern const api::StorageType {};\n".format(name_weight_storage_type)
+            cpp += "const api::StorageType {} = \n".format(name_weight_storage_type)
+            cpp += "  {};\n".format(storageTypeToEnum[shader_info.weight_storage_type])
+
+        # Add bias type
+        if (shader_info.bias_storage_type != ""):
+            name_bias_storage_type = name + "_bias_storage_type"
+            h += "extern const api::StorageType {};\n".format(name_bias_storage_type)
+            cpp += "const api::StorageType {} = \n".format(name_bias_storage_type)
+            cpp += "  {};\n".format(storageTypeToEnum[shader_info.bias_storage_type])
+
     cpp += nsend
     h += nsend
 
diff --git a/tools/generate_torch_version.py b/tools/generate_torch_version.py
index 96970bd2b1c3..1586ff15fd20 100644
--- a/tools/generate_torch_version.py
+++ b/tools/generate_torch_version.py
@@ -25,13 +25,12 @@ def get_sha(pytorch_root: Union[str, Path]) -> str:
 
 def get_tag(pytorch_root: Union[str, Path]) -> str:
     try:
-        tag = (
-            subprocess.check_output(
-                ["git", "describe", "--tags", "--exact"], cwd=pytorch_root
-            )
-            .decode("ascii")
-            .strip()
-        )
+        tag = subprocess.run(
+            ["git", "describe", "--tags", "--exact"],
+            cwd=pytorch_root,
+            encoding="ascii",
+            capture_output=True,
+        ).stdout.strip()
         if RELEASE_PATTERN.match(tag):
             return tag
         else:
diff --git a/tools/jit/gen_unboxing.py b/tools/jit/gen_unboxing.py
index ebeaa21bc7be..79c594a9afa0 100644
--- a/tools/jit/gen_unboxing.py
+++ b/tools/jit/gen_unboxing.py
@@ -116,7 +116,9 @@ def __call__(self, f: NativeFunction) -> str:
                 # from wrapping/unwrapping TensorOptios.
                 # However, we would look to include default args for schema parsing.
                 # Default args only show up in the nonfaithful C++ API,
-                arg_default = cpp.default_expr(arg.argument.default, arg.argument.type)
+                arg_default = cpp.default_expr(
+                    arg.argument.default, arg.argument.type, symint=False
+                )
                 if arg_default.startswith("{"):
                     arg_cpp = f"c10::IntArrayRef({arg_default})"
                 else:
diff --git a/tools/linter/adapters/newlines_linter.py b/tools/linter/adapters/newlines_linter.py
index f51254ad496a..a2cb1c5ccdc9 100644
--- a/tools/linter/adapters/newlines_linter.py
+++ b/tools/linter/adapters/newlines_linter.py
@@ -4,13 +4,13 @@
 import argparse
 import json
 import logging
-import os
 import sys
 
 from enum import Enum
-from typing import NamedTuple, Optional
+from typing import List, NamedTuple, Optional
 
 NEWLINE = 10  # ASCII "\n"
+CARRIAGE_RETURN = 13  # ASCII "\r"
 LINTER_CODE = "NEWLINE"
 
 
@@ -33,78 +33,96 @@ class LintMessage(NamedTuple):
     description: Optional[str]
 
 
-def correct_trailing_newlines(filename: str) -> bool:
-    with open(filename, "rb") as f:
-        a = len(f.read(2))
-        if a == 0:
-            return True
-        elif a == 1:
-            # file is wrong whether or not the only byte is a newline
-            return False
-        else:
-            f.seek(-2, os.SEEK_END)
-            b, c = f.read(2)
-            # no ASCII byte is part of any non-ASCII character in UTF-8
-            return b != NEWLINE and c == NEWLINE
-
-
 def check_file(filename: str) -> Optional[LintMessage]:
     logging.debug("Checking file %s", filename)
 
     with open(filename, "rb") as f:
-        a = len(f.read(2))
-        if a == 0:
-            # File is empty, just leave it alone.
-            return None
-        elif a == 1:
-            # file is wrong whether or not the only byte is a newline
+        lines = f.readlines()
+
+    if len(lines) == 0:
+        # File is empty, just leave it alone.
+        return None
+
+    if len(lines) == 1 and len(lines[0]) == 1:
+        # file is wrong whether or not the only byte is a newline
+        return LintMessage(
+            path=filename,
+            line=None,
+            char=None,
+            code=LINTER_CODE,
+            severity=LintSeverity.ERROR,
+            name="testestTrailing newline",
+            original=None,
+            replacement=None,
+            description="Trailing newline found. Run `lintrunner --take NEWLINE -a` to apply changes.",
+        )
+
+    if len(lines[-1]) == 1 and lines[-1][0] == NEWLINE:
+        try:
+            original = b"".join(lines).decode("utf-8")
+        except Exception as err:
             return LintMessage(
                 path=filename,
                 line=None,
                 char=None,
                 code=LINTER_CODE,
                 severity=LintSeverity.ERROR,
-                name="testestTrailing newline",
+                name="Decoding failure",
                 original=None,
                 replacement=None,
-                description="Trailing newline found. Run `lintrunner --take NEWLINE -a` to apply changes.",
+                description=f"utf-8 decoding failed due to {err.__class__.__name__}:\n{err}",
             )
 
-        else:
-            # Read the last two bytes
-            f.seek(-2, os.SEEK_END)
-            b, c = f.read(2)
-            # no ASCII byte is part of any non-ASCII character in UTF-8
-            if b != NEWLINE and c == NEWLINE:
-                return None
-            else:
-                f.seek(0)
-                try:
-                    original = f.read().decode("utf-8")
-                except Exception as err:
-                    return LintMessage(
-                        path=filename,
-                        line=None,
-                        char=None,
-                        code=LINTER_CODE,
-                        severity=LintSeverity.ERROR,
-                        name="Decoding failure",
-                        original=None,
-                        replacement=None,
-                        description=f"utf-8 decoding failed due to {err.__class__.__name__}:\n{err}",
-                    )
-
-                return LintMessage(
-                    path=filename,
-                    line=None,
-                    char=None,
-                    code=LINTER_CODE,
-                    severity=LintSeverity.ERROR,
-                    name="Trailing newline",
-                    original=original,
-                    replacement=original.rstrip("\n") + "\n",
-                    description="Trailing newline found. Run `lintrunner --take NEWLINE -a` to apply changes.",
-                )
+        return LintMessage(
+            path=filename,
+            line=None,
+            char=None,
+            code=LINTER_CODE,
+            severity=LintSeverity.ERROR,
+            name="Trailing newline",
+            original=original,
+            replacement=original.rstrip("\n") + "\n",
+            description="Trailing newline found. Run `lintrunner --take NEWLINE -a` to apply changes.",
+        )
+    has_changes = False
+    original_lines: Optional[List[bytes]] = None
+    for idx, line in enumerate(lines):
+        if len(line) >= 2 and line[-1] == NEWLINE and line[-2] == CARRIAGE_RETURN:
+            if not has_changes:
+                original_lines = list(lines)
+                has_changes = True
+            lines[idx] = line[:-2] + b"\n"
+
+    if has_changes:
+        try:
+            assert original_lines is not None
+            original = b"".join(original_lines).decode("utf-8")
+            replacement = b"".join(lines).decode("utf-8")
+        except Exception as err:
+            return LintMessage(
+                path=filename,
+                line=None,
+                char=None,
+                code=LINTER_CODE,
+                severity=LintSeverity.ERROR,
+                name="Decoding failure",
+                original=None,
+                replacement=None,
+                description=f"utf-8 decoding failed due to {err.__class__.__name__}:\n{err}",
+            )
+        return LintMessage(
+            path=filename,
+            line=None,
+            char=None,
+            code=LINTER_CODE,
+            severity=LintSeverity.ERROR,
+            name="DOS newline",
+            original=original,
+            replacement=replacement,
+            description="DOS newline found. Run `lintrunner --take NEWLINE -a` to apply changes.",
+        )
+
+    return None
 
 
 if __name__ == "__main__":
diff --git a/tools/linter/adapters/pip_init.py b/tools/linter/adapters/pip_init.py
index f921bcb73e33..f177a920d0b7 100644
--- a/tools/linter/adapters/pip_init.py
+++ b/tools/linter/adapters/pip_init.py
@@ -37,8 +37,8 @@ def run_command(args: List[str]) -> "subprocess.CompletedProcess[bytes]":
         "--dry-run", help="do not install anything, just print what would be done."
     )
     parser.add_argument(
-        "--no-binary",
-        help="do not use pre-compiled binaries from pip.",
+        "--no-black-binary",
+        help="do not use pre-compiled binaries from pip for black.",
         action="store_true",
     )
 
@@ -72,7 +72,7 @@ def run_command(args: List[str]) -> "subprocess.CompletedProcess[bytes]":
                 "Package {package_name} did not have a version specified. "
                 "Please specify a version to produce a consistent linting experience."
             )
-        if args.no_binary:
+        if args.no_black_binary and "black" in package_name:
             pip_args.append(f"--no-binary={package_name}")
 
     dry_run = args.dry_run == "1"
diff --git a/tools/linter/adapters/s3_init_config.json b/tools/linter/adapters/s3_init_config.json
index 0b0e87e8e26c..d48f264f83d5 100644
--- a/tools/linter/adapters/s3_init_config.json
+++ b/tools/linter/adapters/s3_init_config.json
@@ -27,12 +27,12 @@
     },
     "actionlint": {
         "Darwin": {
-            "download_url": "https://oss-clang-format.s3.us-east-2.amazonaws.com/actionlint/1.6.15/Darwin_amd64/actionlint",
-            "hash": "e9a0e0b17e54cfefe7964b6aa1da8921b1f8f2318c31c0eb1a17ea3e8ab10db2"
+            "download_url": "https://oss-clang-format.s3.us-east-2.amazonaws.com/actionlint/1.6.21/Darwin_amd64/actionlint",
+            "hash": "b354db83815384d3c3a07f68f44b30cb0a70899757a0d185d7322de9952e8813"
         },
         "Linux": {
-            "download_url": "https://oss-clang-format.s3.us-east-2.amazonaws.com/actionlint/1.6.15/Linux_arm64/actionlint",
-            "hash": "d6b45ae67f29a2bf9ddd226071ddd8f158fdf2992e8515a06838e5fef90f3a2d"
+            "download_url": "https://oss-clang-format.s3.us-east-2.amazonaws.com/actionlint/1.6.21/Linux_arm64/actionlint",
+            "hash": "025ac157db121b33971ef24af72d73d71cda3cb1e3a94795bb2708ef4032ca76"
         }
     }
 }
diff --git a/tools/linter/clang_tidy/generate_build_files.py b/tools/linter/clang_tidy/generate_build_files.py
index 35f1b81d8989..3986d3d28e4d 100644
--- a/tools/linter/clang_tidy/generate_build_files.py
+++ b/tools/linter/clang_tidy/generate_build_files.py
@@ -32,7 +32,6 @@ def update_submodules() -> None:
 
 def gen_compile_commands() -> None:
     os.environ["USE_NCCL"] = "0"
-    os.environ["USE_DEPLOY"] = "1"
     os.environ["CC"] = "clang"
     os.environ["CXX"] = "clang++"
     run_timed_cmd([sys.executable, "setup.py", "--cmake-only", "build"])
diff --git a/tools/miniz_target_definition.bzl b/tools/miniz_target_definition.bzl
deleted file mode 100644
index 49eaa00643b2..000000000000
--- a/tools/miniz_target_definition.bzl
+++ /dev/null
@@ -1,25 +0,0 @@
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-load("//caffe2/tools:sgx_target_definitions.bzl", "is_sgx")
-
-def add_miniz_lib():
-    cpp_library(
-        name = "miniz",
-        srcs = [
-            "third_party/miniz-2.1.0/fb/FollyCrcPlugin.cpp",
-            "third_party/miniz-2.1.0/fb/miniz-fb.c",
-        ],
-        headers = {
-            "caffe2/third_party/miniz-2.1.0/miniz.c": "third_party/miniz-2.1.0/miniz.c",
-            "miniz-fb.h": "third_party/miniz-2.1.0/fb/miniz-fb.h",
-            "miniz.h": "third_party/miniz-2.1.0/miniz.h",
-        },
-        header_namespace = "",
-        # -fexceptions is required, otherwise, when we use @mode/opt-clang-thinlto,
-        # c functions become noexcept, and we may not be able to catch exceptions
-        # during model loading.
-        compiler_flags = ["-DUSE_EXTERNAL_MZCRC", "-fexceptions"] + (["-DMINIZ_NO_STDIO"] if is_sgx else []),
-        # folly is only required as a dependency if USE_EXTERNAL_MZCRC
-        # above is defined, and FollyCrcPlugin.cpp is added.
-        # Neither are strictly needed, but run significantly faster.
-        exported_deps = ["//folly/hash:checksum"],
-    )
diff --git a/tools/onnx/gen_diagnostics.py b/tools/onnx/gen_diagnostics.py
new file mode 100644
index 000000000000..92960024e048
--- /dev/null
+++ b/tools/onnx/gen_diagnostics.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python3
+
+""" Generates PyTorch ONNX Export Diagnostic rules for C++, Python and documentations.
+The rules are defined in torch/onnx/_internal/diagnostics/rules.yaml.
+
+Usage:
+
+python -m tools.onnx.gen_diagnostics \
+    torch/onnx/_internal/diagnostics/rules.yaml \
+    torch/onnx/_internal/diagnostics \
+    torch/csrc/onnx/diagnostics/generated \
+    torch/docs/source
+"""
+
+import argparse
+import os
+import string
+import subprocess
+import textwrap
+from typing import Any, Mapping, Sequence
+
+import yaml
+
+from torchgen import utils as torchgen_utils
+
+_RULES_GENERATED_COMMENT = """\
+GENERATED CODE - DO NOT EDIT DIRECTLY
+This file is generated by gen_diagnostics.py.
+See tools/onnx/gen_diagnostics.py for more information.
+
+Diagnostic rules for PyTorch ONNX export.
+"""
+
+_PY_RULE_CLASS_COMMENT = """\
+GENERATED CODE - DO NOT EDIT DIRECTLY
+The purpose of generating a class for each rule is to override the `format_message`
+method to provide more details in the signature about the format arguments.
+"""
+
+_PY_RULE_CLASS_TEMPLATE = """\
+class _{pascal_case_name}(infra.Rule):
+    \"\"\"{short_description}\"\"\"
+    def format_message(self, {message_arguments}) -> str:  # type: ignore[override]
+        \"\"\"Returns the formatted default message of this Rule.
+
+        Message template: {message_template}
+        \"\"\"
+        return self.message_default_template.format({message_arguments_assigned})
+
+"""
+
+_PY_RULE_COLLECTION_FIELD_TEMPLATE = """\
+{snake_case_name}: _{pascal_case_name} = dataclasses.field(
+    default=_{pascal_case_name}.from_sarif(**{sarif_dict}),
+    init=False,
+)
+\"\"\"{short_description}\"\"\"
+"""
+
+_CPP_RULE_TEMPLATE = """\
+/**
+ * @brief {short_description}
+ */
+{name},
+"""
+
+_RuleType = Mapping[str, Any]
+
+
+def _kebab_case_to_snake_case(name: str) -> str:
+    return name.replace("-", "_")
+
+
+def _kebab_case_to_pascal_case(name: str) -> str:
+    return "".join(word.capitalize() for word in name.split("-"))
+
+
+def _format_rule_for_python_class(rule: _RuleType) -> str:
+    pascal_case_name = _kebab_case_to_pascal_case(rule["name"])
+    short_description = rule["short_description"]["text"]
+    message_template = rule["message_strings"]["default"]["text"]
+    field_names = [
+        field_name
+        for _, field_name, _, _ in string.Formatter().parse(message_template)
+        if field_name is not None
+    ]
+    for field_name in field_names:
+        assert isinstance(
+            field_name, str
+        ), f"Unexpected field type {type(field_name)} from {field_name}. "
+        "Field name must be string.\nFull message template: {message_template}"
+        assert (
+            not field_name.isnumeric()
+        ), f"Unexpected numeric field name {field_name}. "
+        "Only keyword name formatting is supported.\nFull message template: {message_template}"
+    message_arguments = ", ".join(field_names)
+    message_arguments_assigned = ", ".join(
+        [f"{field_name}={field_name}" for field_name in field_names]
+    )
+    return _PY_RULE_CLASS_TEMPLATE.format(
+        pascal_case_name=pascal_case_name,
+        short_description=short_description,
+        message_template=repr(message_template),
+        message_arguments=message_arguments,
+        message_arguments_assigned=message_arguments_assigned,
+    )
+
+
+def _format_rule_for_python_field(rule: _RuleType) -> str:
+    snake_case_name = _kebab_case_to_snake_case(rule["name"])
+    pascal_case_name = _kebab_case_to_pascal_case(rule["name"])
+    short_description = rule["short_description"]["text"]
+
+    return _PY_RULE_COLLECTION_FIELD_TEMPLATE.format(
+        snake_case_name=snake_case_name,
+        pascal_case_name=pascal_case_name,
+        sarif_dict=rule,
+        short_description=short_description,
+    )
+
+
+def _format_rule_for_cpp(rule: _RuleType) -> str:
+    name = f"k{_kebab_case_to_pascal_case(rule['name'])}"
+    short_description = rule["short_description"]["text"]
+    return _CPP_RULE_TEMPLATE.format(name=name, short_description=short_description)
+
+
+def gen_diagnostics_python(
+    rules: Sequence[_RuleType], out_py_dir: str, template_dir: str
+) -> None:
+
+    rule_class_lines = [_format_rule_for_python_class(rule) for rule in rules]
+    rule_field_lines = [_format_rule_for_python_field(rule) for rule in rules]
+
+    fm = torchgen_utils.FileManager(
+        install_dir=out_py_dir, template_dir=template_dir, dry_run=False
+    )
+    fm.write_with_template(
+        "_rules.py",
+        "rules.py.in",
+        lambda: {
+            "generated_comment": _RULES_GENERATED_COMMENT,
+            "generated_rule_class_comment": _PY_RULE_CLASS_COMMENT,
+            "rule_classes": "\n".join(rule_class_lines),
+            "rules": textwrap.indent("\n".join(rule_field_lines), " " * 4),
+        },
+    )
+    _lint_file(os.path.join(out_py_dir, "_rules.py"))
+
+
+def gen_diagnostics_cpp(
+    rules: Sequence[_RuleType], out_cpp_dir: str, template_dir: str
+) -> None:
+
+    rule_lines = [_format_rule_for_cpp(rule) for rule in rules]
+    rule_names = [f'"{_kebab_case_to_snake_case(rule["name"])}",' for rule in rules]
+
+    fm = torchgen_utils.FileManager(
+        install_dir=out_cpp_dir, template_dir=template_dir, dry_run=False
+    )
+    fm.write_with_template(
+        "rules.h",
+        "rules.h.in",
+        lambda: {
+            "generated_comment": textwrap.indent(
+                _RULES_GENERATED_COMMENT,
+                " * ",
+                predicate=lambda x: True,  # Don't ignore empty line
+            ),
+            "rules": textwrap.indent("\n".join(rule_lines), " " * 2),
+            "py_rule_names": textwrap.indent("\n".join(rule_names), " " * 4),
+        },
+    )
+    _lint_file(os.path.join(out_cpp_dir, "rules.h"))
+
+
+def gen_diagnostics_docs(
+    rules: Sequence[_RuleType], out_docs_dir: str, template_dir: str
+) -> None:
+    # TODO: Add doc generation in a follow-up PR.
+    pass
+
+
+def _lint_file(file_path: str) -> None:
+    p = subprocess.Popen(["lintrunner", "-a", file_path])
+    p.wait()
+
+
+def gen_diagnostics(
+    rules_path: str,
+    out_py_dir: str,
+    out_cpp_dir: str,
+    out_docs_dir: str,
+) -> None:
+
+    with open(rules_path, "r") as f:
+        rules = yaml.load(f, Loader=torchgen_utils.YamlLoader)
+
+    template_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "templates")
+
+    gen_diagnostics_python(
+        rules,
+        out_py_dir,
+        template_dir,
+    )
+
+    gen_diagnostics_cpp(
+        rules,
+        out_cpp_dir,
+        template_dir,
+    )
+
+    gen_diagnostics_docs(rules, out_docs_dir, template_dir)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Generate ONNX diagnostics files")
+    parser.add_argument("rules_path", metavar="RULES", help="path to rules.yaml")
+    parser.add_argument(
+        "out_py_dir",
+        metavar="OUT_PY",
+        help="path to output directory for Python",
+    )
+    parser.add_argument(
+        "out_cpp_dir",
+        metavar="OUT_CPP",
+        help="path to output directory for C++",
+    )
+    parser.add_argument(
+        "out_docs_dir",
+        metavar="OUT_DOCS",
+        help="path to output directory for docs",
+    )
+    args = parser.parse_args()
+    gen_diagnostics(
+        args.rules_path,
+        args.out_py_dir,
+        args.out_cpp_dir,
+        args.out_docs_dir,
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tools/onnx/gen_diagnostics.sh b/tools/onnx/gen_diagnostics.sh
new file mode 100755
index 000000000000..1785fdee32b4
--- /dev/null
+++ b/tools/onnx/gen_diagnostics.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+# Run this script inside its folder to generate PyTorch ONNX Export Diagnostic rules
+# for C++, Python and documentations.
+# The rules are defined in torch/onnx/_internal/diagnostics/rules.yaml.
+
+set -e -x
+ROOT="${PWD}/../../"
+pushd "$ROOT"
+(
+python -m tools.onnx.gen_diagnostics \
+    torch/onnx/_internal/diagnostics/rules.yaml \
+    torch/onnx/_internal/diagnostics \
+    torch/csrc/onnx/diagnostics/generated \
+    torch/docs/source
+)
+popd
diff --git a/tools/onnx/sarif/code-gen-hints.json b/tools/onnx/sarif/code-gen-hints.json
new file mode 100644
index 000000000000..14c704183198
--- /dev/null
+++ b/tools/onnx/sarif/code-gen-hints.json
@@ -0,0 +1,10 @@
+{
+    "SarifLog.$schema": [
+        {
+            "kind": "PropertyNameHint",
+            "arguments": {
+                "pythonPropertyName": "schemaUri"
+            }
+        }
+    ]
+}
diff --git a/tools/onnx/sarif/gen_sarif.sh b/tools/onnx/sarif/gen_sarif.sh
new file mode 100755
index 000000000000..2099b92838ea
--- /dev/null
+++ b/tools/onnx/sarif/gen_sarif.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+# Run this script inside its folder to generate the SARIF python object model files
+# from the SARIF schema.
+# e.g. ./gen_sarif.sh
+#
+# This script requires the jschema_to_python package to be installed.
+# To install it, run:
+#   pip install jschema_to_python
+
+set -e -x
+ROOT="${PWD}/../../.."
+SARIF_DIR="torch/onnx/_internal/diagnostics/infra/sarif"
+
+# SARIF version
+SARIF_VERSION="2.1.0"
+SARIF_SCHEMA_LINK="https://docs.oasis-open.org/sarif/sarif/v2.1.0/cs01/schemas/sarif-schema-2.1.0.json"
+
+# Download SARIF schema
+tmp_dir="$(mktemp -d)"
+sarif_schema_file_path="${tmp_dir}/sarif-schema-${SARIF_VERSION}.json"
+curl -L -o "$sarif_schema_file_path" "$SARIF_SCHEMA_LINK"
+
+# TODO: A private branch of jschema_to_python was used to enable
+#       the generation to dataclasses and support annotation.
+python -m jschema_to_python \
+    --schema-path "$sarif_schema_file_path" \
+    --module-name torch.onnx._internal.diagnostics.infra.sarif \
+    --output-directory "${ROOT}/${SARIF_DIR}" \
+    --root-class-name SarifLog \
+    --hints-file-path code-gen-hints.json \
+    --force \
+    --library dataclasses \
+    -vv
+
+# Generate SARIF version file
+echo "from typing_extensions import Final" > "${ROOT}/${SARIF_DIR}/version.py"
+echo "SARIF_VERSION: Final = \"${SARIF_VERSION}\"" >> "${ROOT}/${SARIF_DIR}/version.py"
+echo "SARIF_SCHEMA_LINK: Final = \"${SARIF_SCHEMA_LINK}\"" >> "${ROOT}/${SARIF_DIR}/version.py"
+
+pushd "$ROOT"
+(
+    # Hack to have flake8 not complain about generated code.
+    set +x
+    while IFS= read -r -d '' file; do
+        echo "# flake8: noqa" >> "$file"
+    done < <(find "$SARIF_DIR" -name '*.py' -print0)
+    set -x
+
+    lintrunner "${SARIF_DIR}/"** -a
+)
+popd
diff --git a/tools/onnx/templates/rules.h.in b/tools/onnx/templates/rules.h.in
new file mode 100644
index 000000000000..4c8180652403
--- /dev/null
+++ b/tools/onnx/templates/rules.h.in
@@ -0,0 +1,21 @@
+#pragma once
+
+/**
+${generated_comment}
+ */
+
+namespace torch {
+namespace onnx {
+namespace diagnostics {
+
+enum class Rule : uint32_t {
+${rules}
+};
+
+static constexpr const char* const kPyRuleNames [] = {
+${py_rule_names}
+};
+
+} // namespace diagnostics
+} // namespace onnx
+} // namespace torch
diff --git a/tools/onnx/templates/rules.py.in b/tools/onnx/templates/rules.py.in
new file mode 100644
index 000000000000..2137119d14c2
--- /dev/null
+++ b/tools/onnx/templates/rules.py.in
@@ -0,0 +1,20 @@
+"""
+${generated_comment}
+"""
+
+import dataclasses
+
+# flake8: noqa
+from torch.onnx._internal.diagnostics import infra
+
+"""
+${generated_rule_class_comment}
+"""
+
+${rule_classes}
+
+@dataclasses.dataclass
+class _POERules(infra.RuleCollection):
+${rules}
+
+rules = _POERules()
diff --git a/tools/onnx/update_default_opset_version.py b/tools/onnx/update_default_opset_version.py
index dfdbf1f23c87..9c4b0e099be8 100755
--- a/tools/onnx/update_default_opset_version.py
+++ b/tools/onnx/update_default_opset_version.py
@@ -9,6 +9,7 @@
 Run with no arguments.
 """
 
+import argparse
 import datetime
 import os
 import pathlib
@@ -16,59 +17,10 @@
 import subprocess
 import sys
 from subprocess import DEVNULL
+from typing import Any
 
-pytorch_dir = pathlib.Path(__file__).parent.parent.parent.resolve()
-onnx_dir = pytorch_dir / "third_party" / "onnx"
-os.chdir(onnx_dir)
-
-date = datetime.datetime.now() - datetime.timedelta(days=18 * 30)
-onnx_commit = subprocess.check_output(
-    ("git", "log", f"--until={date}", "--max-count=1", "--format=%H"), encoding="utf-8"
-).strip()
-onnx_tags = subprocess.check_output(
-    ("git", "tag", "--list", f"--contains={onnx_commit}"), encoding="utf-8"
-)
-tag_tups = []
-semver_pat = re.compile(r"v(\d+)\.(\d+)\.(\d+)")
-for tag in onnx_tags.splitlines():
-    match = semver_pat.match(tag)
-    if match:
-        tag_tups.append(tuple(int(x) for x in match.groups()))
-
-version_str = "{}.{}.{}".format(*min(tag_tups))
-
-print("Using ONNX release", version_str)
-
-head_commit = subprocess.check_output(
-    ("git", "log", "--max-count=1", "--format=%H", "HEAD"), encoding="utf-8"
-).strip()
-
-new_default = None
-
-subprocess.check_call(
-    ("git", "checkout", f"v{version_str}"), stdout=DEVNULL, stderr=DEVNULL
-)
-try:
-    from onnx import helper  # type: ignore[import]
-
-    for version in helper.VERSION_TABLE:
-        if version[0] == version_str:
-            new_default = version[2]
-            print("found new default opset_version", new_default)
-            break
-    if not new_default:
-        sys.exit(
-            f"failed to find version {version_str} in onnx.helper.VERSION_TABLE at commit {onnx_commit}"
-        )
-finally:
-    subprocess.check_call(
-        ("git", "checkout", head_commit), stdout=DEVNULL, stderr=DEVNULL
-    )
-
-os.chdir(pytorch_dir)
 
-
-def read_sub_write(path: str, prefix_pat: str) -> None:
+def read_sub_write(path: str, prefix_pat: str, new_default: int) -> None:
     with open(path, encoding="utf-8") as f:
         content_str = f.read()
     content_str = re.sub(prefix_pat, r"\g<1>{}".format(new_default), content_str)
@@ -77,18 +29,84 @@ def read_sub_write(path: str, prefix_pat: str) -> None:
     print("modified", path)
 
 
-read_sub_write(
-    os.path.join("torch", "onnx", "_constants.py"),
-    r"(onnx_default_opset = )\d+",
-)
-read_sub_write(
-    os.path.join("torch", "onnx", "__init__.py"), r"(opset_version \(int, default )\d+"
-)
-
-print("Updating operator .expect files")
-subprocess.check_call(("python", "setup.py", "develop"), stdout=DEVNULL, stderr=DEVNULL)
-subprocess.check_call(
-    ("python", os.path.join("test", "onnx", "test_operators.py"), "--accept"),
-    stdout=DEVNULL,
-    stderr=DEVNULL,
-)
+def main(args: Any) -> None:
+    pytorch_dir = pathlib.Path(__file__).parent.parent.parent.resolve()
+    onnx_dir = pytorch_dir / "third_party" / "onnx"
+    os.chdir(onnx_dir)
+
+    date = datetime.datetime.now() - datetime.timedelta(days=18 * 30)
+    onnx_commit = subprocess.check_output(
+        ("git", "log", f"--until={date}", "--max-count=1", "--format=%H"),
+        encoding="utf-8",
+    ).strip()
+    onnx_tags = subprocess.check_output(
+        ("git", "tag", "--list", f"--contains={onnx_commit}"), encoding="utf-8"
+    )
+    tag_tups = []
+    semver_pat = re.compile(r"v(\d+)\.(\d+)\.(\d+)")
+    for tag in onnx_tags.splitlines():
+        match = semver_pat.match(tag)
+        if match:
+            tag_tups.append(tuple(int(x) for x in match.groups()))
+
+    # Take the release 18 months ago
+    version_str = "{}.{}.{}".format(*min(tag_tups))
+
+    print("Using ONNX release", version_str)
+
+    head_commit = subprocess.check_output(
+        ("git", "log", "--max-count=1", "--format=%H", "HEAD"), encoding="utf-8"
+    ).strip()
+
+    new_default = None
+
+    subprocess.check_call(
+        ("git", "checkout", f"v{version_str}"), stdout=DEVNULL, stderr=DEVNULL
+    )
+    try:
+        from onnx import helper  # type: ignore[import]
+
+        for version in helper.VERSION_TABLE:
+            if version[0] == version_str:
+                new_default = version[2]
+                print("found new default opset_version", new_default)
+                break
+        if not new_default:
+            sys.exit(
+                f"failed to find version {version_str} in onnx.helper.VERSION_TABLE at commit {onnx_commit}"
+            )
+    finally:
+        subprocess.check_call(
+            ("git", "checkout", head_commit), stdout=DEVNULL, stderr=DEVNULL
+        )
+
+    os.chdir(pytorch_dir)
+
+    read_sub_write(
+        os.path.join("torch", "onnx", "_constants.py"),
+        r"(ONNX_DEFAULT_OPSET = )\d+",
+        new_default,
+    )
+    read_sub_write(
+        os.path.join("torch", "onnx", "utils.py"),
+        r"(opset_version \(int, default )\d+",
+        new_default,
+    )
+
+    if not args.skip_build:
+        print("Building PyTorch...")
+        subprocess.check_call(
+            ("python", "setup.py", "develop"),
+        )
+    print("Updating operator .expect files")
+    subprocess.check_call(
+        ("python", os.path.join("test", "onnx", "test_operators.py"), "--accept"),
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--skip_build", action="store_true", help="Skip building pytorch"
+    )
+    main(parser.parse_args())
diff --git a/tools/perf_kernel_defs.bzl b/tools/perf_kernel_defs.bzl
deleted file mode 100644
index dfefb734b111..000000000000
--- a/tools/perf_kernel_defs.bzl
+++ /dev/null
@@ -1,66 +0,0 @@
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-
-is_dbg_build = native.read_config("fbcode", "build_mode", "").find("dbg") != -1
-is_sanitizer = native.read_config("fbcode", "sanitizer", "") != ""
-
-def define_perf_kernels(
-        prefix,
-        levels_and_flags,
-        compiler_common_flags = [],
-        arch_compiler_common_flags = {},
-        dependencies = [],
-        arch_dependencies = [],
-        external_deps = []):
-    vectorize_flags = ([
-        # "-Rpass=loop-vectorize", # Add vectorization information to output
-        "-DENABLE_VECTORIZATION=1",
-        "-fveclib=SVML",
-    ] if not is_dbg_build and not is_sanitizer else [])
-
-    compiler_specific_flags = {
-        "clang": vectorize_flags,
-        "gcc": [],
-    }
-
-    compiler_specific_flags["clang"] += ["-Wno-pass-failed"]
-
-    common_srcs = native.glob(
-        ["**/*.cc"],
-        exclude = [
-            "**/*_avx512.cc",
-            "**/*_avx2.cc",
-            "**/*_avx.cc",
-        ],
-    )
-
-    cpp_headers = native.glob(
-        ["**/*.h"],
-    )
-
-    kernel_targets = {}
-    for arch, levels_and_flags in levels_and_flags.items():
-        for level, flags in levels_and_flags:
-            cpp_library(
-                name = prefix + "perfkernels_" + level,
-                srcs = native.glob(["**/*_" + level + ".cc"]),
-                headers = cpp_headers,
-                compiler_flags = compiler_common_flags + flags,
-                arch_compiler_flags = arch_compiler_common_flags,
-                compiler_specific_flags = compiler_specific_flags,
-                exported_deps = dependencies,
-                exported_arch_deps = arch_dependencies,
-                exported_external_deps = external_deps,
-            )
-            kernel_targets.setdefault(arch, []).append(":" + prefix + "perfkernels_" + level)
-
-    cpp_library(
-        name = prefix + "perfkernels",
-        srcs = common_srcs,
-        headers = cpp_headers,
-        compiler_flags = compiler_common_flags,
-        arch_compiler_flags = arch_compiler_common_flags,
-        compiler_specific_flags = compiler_specific_flags,
-        link_whole = True,
-        exported_arch_deps = kernel_targets.items() + arch_dependencies,
-        exported_deps = dependencies,
-    )
diff --git a/tools/pyi/gen_pyi.py b/tools/pyi/gen_pyi.py
index 79f5ee2979b3..43118edb98bd 100644
--- a/tools/pyi/gen_pyi.py
+++ b/tools/pyi/gen_pyi.py
@@ -10,7 +10,7 @@
 )
 from torchgen.gen import parse_native_yaml
 
-from torchgen.model import Variant
+from torchgen.model import DispatchKey, Variant
 from torchgen.utils import FileManager
 
 from tools.autograd.gen_python_functions import (
@@ -74,7 +74,7 @@ def should_bind_method(python_func: PythonSignatureNativeFunctionPair) -> bool:
 # TODO: Consider defining some aliases for our Union[...] types, to make
 # the stubs to read on the human eye.
 
-DEVICE_PARAM = "device: Union[_device, str, None]=None"
+DEVICE_PARAM = "device: Device=None"
 FACTORY_PARAMS = (
     f"dtype: Optional[_dtype]=None, {DEVICE_PARAM}, requires_grad: _bool=False"
 )
@@ -136,6 +136,9 @@ def should_bind_method(python_func: PythonSignatureNativeFunctionPair) -> bool:
     "floor_divide",
     "floor_divide_",
     "floor_divide_out",
+    "to",
+    "_to_copy",
+    "copy_",
 ]
 
 binary_ops = (
@@ -390,7 +393,7 @@ def gen_pyi(
             ],
             "numel": ["def numel(self: Tensor) -> _int: ..."],
             "as_tensor": [
-                "def as_tensor(data: Any, dtype: _dtype=None, device: Optional[_device]=None) -> Tensor: ..."
+                f"def as_tensor(data: Any, dtype: Optional[_dtype]=None, {DEVICE_PARAM}) -> Tensor: ..."
             ],
             "get_num_threads": ["def get_num_threads() -> _int: ..."],
             "set_num_threads": ["def set_num_threads(num: _int) -> None: ..."],
@@ -430,6 +433,7 @@ def gen_pyi(
                 " device: Optional[_device] = None,"
                 " requires_grad: bool = False) -> Tensor: ..."
             ],
+            "_sync": ["def _sync(t: Tensor) -> None: ..."],
             "_is_functional_tensor": [
                 "def _is_functional_tensor(t: Tensor) -> _bool: ..."
             ],
@@ -439,6 +443,10 @@ def gen_pyi(
             "_to_functional_tensor": [
                 "def _to_functional_tensor(t: Tensor) -> Tensor: ..."
             ],
+            "_enable_functionalization": [
+                "def _enable_functionalization(*, reapply_views: _bool = False): ..."
+            ],
+            "_disable_functionalization": ["def _disable_functionalization(): ..."],
             "range": [
                 "def range(start: Number, end: Number,"
                 " step: Number=1, *, out: Optional[Tensor]=None, {}) -> Tensor: ...".format(
@@ -593,7 +601,7 @@ def gen_pyi(
                 "def size(self, dim: _int) -> _int: ...",
             ],
             "stride": [
-                "def stride(self) -> Tuple[_int]: ...",
+                "def stride(self) -> Tuple[_int, ...]: ...",
                 "def stride(self, _int) -> _int: ...",
             ],
             "new_ones": [
@@ -622,7 +630,7 @@ def gen_pyi(
                     DEVICE_PARAM
                 ),
             ],
-            "as_subclass": ["def as_subclass(self, cls: Tensor) -> Tensor: ..."],
+            "as_subclass": ["def as_subclass(self, cls: Type[S]) -> S: ..."],
             "_make_subclass": [
                 "def _make_subclass(cls, data: Tensor, require_grad: _bool = False, dispatch_strides: _bool=False,"
                 " dispatch_device: _bool=False, device_for_backend_keys: Optional[_device] = None) -> Tensor: ..."
@@ -718,7 +726,7 @@ def gen_pyi(
                 binop += "_"
                 out_suffix = ""
             unsorted_tensor_method_hints[binop].append(
-                "def {}(self, other: Union[Tensor, Number]{})"
+                "def {}(self, other: Union[Tensor, Number, torch.SymInt, torch.SymFloat]{})"
                 " -> Tensor: ...".format(binop, out_suffix)
             )
     for binop in ["add", "sub"]:
@@ -728,7 +736,7 @@ def gen_pyi(
                 binop += "_"
                 out_suffix = ""
             unsorted_tensor_method_hints[binop].append(
-                "def {}(self, other: Union[Tensor, Number], "
+                "def {}(self, other: Union[Tensor, Number, torch.SymInt, torch.SymFloat], "
                 "*, alpha: Optional[Number]=1{})"
                 " -> Tensor: ...".format(binop, out_suffix)
             )
@@ -863,6 +871,10 @@ def gen_pyi(
     all_directive = pformat(all_symbols, width=100, compact=True).split("\n")
     all_directive[0] = "__all__ = {}".format(all_directive[0])
 
+    # Dispatch key hints
+    # ~~~~~~~~~~~~~~~~~~
+    dispatch_key_hints = [f"{d.name}: DispatchKey = ..." for d in DispatchKey]
+
     # Write out the stub
     # ~~~~~~~~~~~~~~~~~~
 
@@ -873,6 +885,7 @@ def gen_pyi(
         "legacy_class_hints": legacy_class_hints,
         "legacy_storage_base_hints": legacy_storage_base_hints,
         "dtype_class_hints": dtype_class_hints,
+        "dispatch_key_hints": dispatch_key_hints,
         "all_directive": all_directive,
     }
     fm.write_with_template(
diff --git a/tools/setup_helpers/cmake.py b/tools/setup_helpers/cmake.py
index 4f4e22e2e0ae..5ce3f3009b3c 100644
--- a/tools/setup_helpers/cmake.py
+++ b/tools/setup_helpers/cmake.py
@@ -230,6 +230,7 @@ def generate(
                     "OPENSSL_ROOT_DIR",
                     "STATIC_DISPATCH_BACKEND",
                     "SELECTED_OP_LIST",
+                    "TRACING_BASED",
                 )
             }
         )
diff --git a/tools/setup_helpers/cmake_utils.py b/tools/setup_helpers/cmake_utils.py
index 8fb41c913e25..dabd66a4e838 100644
--- a/tools/setup_helpers/cmake_utils.py
+++ b/tools/setup_helpers/cmake_utils.py
@@ -51,7 +51,7 @@ def get_cmake_cache_variables_from_file(
       dict: A ``dict`` containing the value of cached CMake variables.
     """
 
-    results = dict()
+    results = {}
     for i, line in enumerate(cmake_cache_file, 1):
         line = line.strip()
         if not line or line.startswith(("#", "//")):
diff --git a/tools/sgx_aten_target_definitions.bzl b/tools/sgx_aten_target_definitions.bzl
deleted file mode 100644
index 48886ae16fe2..000000000000
--- a/tools/sgx_aten_target_definitions.bzl
+++ /dev/null
@@ -1,261 +0,0 @@
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-load("@fbcode_macros//build_defs:custom_rule.bzl", "custom_rule")
-load("//caffe2:build.bzl", "GENERATED_CPP")
-load("//caffe2:build_variables.bzl", "jit_core_headers", "jit_core_sources")
-load("//caffe2/tools:sgx_target_definitions.bzl", "is_sgx")
-
-default_compiler_flags = [
-    "-Wno-error=strict-aliasing",
-    "-Wno-unused-local-typedefs",
-    "-Wno-shadow-compatible-local",
-    "-Wno-maybe-uninitialized",  # aten is built with gcc as part of HHVM
-    "-Wno-unknown-pragmas",
-    "-Wno-strict-overflow",
-    # See https://fb.facebook.com/groups/fbcode/permalink/1813348245368673/
-    # These trigger on platform007
-    "-Wno-stringop-overflow",
-    "-Wno-class-memaccess",
-    "-DHAVE_MMAP",
-    "-DUSE_GCC_ATOMICS=1",
-    "-D_FILE_OFFSET_BITS=64",
-    "-DHAVE_SHM_OPEN=1",
-    "-DHAVE_SHM_UNLINK=1",
-    "-DHAVE_MALLOC_USABLE_SIZE=1",
-    "-DTH_HAVE_THREAD",
-    "-DCPU_CAPABILITY_DEFAULT",
-    "-DTH_INDEX_BASE=0",
-    "-DMAGMA_V2",
-    "-DNO_CUDNN_DESTROY_HANDLE",
-    "-DUSE_QNNPACK",
-    "-DUSE_PYTORCH_QNNPACK",
-    # The dynamically loaded NVRTC trick doesn't work in fbcode,
-    # and it's not necessary anyway, because we have a stub
-    # nvrtc library which we load canonically anyway
-    "-DUSE_DIRECT_NVRTC",
-    "-DUSE_XNNPACK",
-    "-Wno-error=uninitialized",
-]
-
-compiler_specific_flags = {
-    "clang": [
-        "-Wno-absolute-value",
-        "-Wno-pass-failed",
-        "-Wno-braced-scalar-init",
-    ],
-    "gcc": [
-        "-Wno-error=array-bounds",
-    ],
-}
-
-def add_sgx_aten_libs(ATEN_HEADERS_CPU_MKL, ATEN_SRCS_CPU_MKL, ATEN_CORE_CPP):
-    # we do not need to define these targets if we are in not SGX mode
-    if not is_sgx:
-        return
-
-    x64_compiler_flags = [
-        "-DUSE_SSE2",
-        "-DUSE_SSE3",
-        "-DUSE_SSE4_1",
-        "-DUSE_SSE4_2",
-        # dont enable AVX2 because we dont have runtime dispatch
-        "-DCPU_CAPABILITY_DEFAULT",
-        "-DCPU_CAPABILITY=DEFAULT",
-        "-DTH_INDEX_BASE=0",
-        "-DTH_INDEX_BASE=0",
-        "-msse",
-        "-msse2",
-        "-msse3",
-        "-msse4",
-        "-msse4.1",
-        "-msse4.2",
-        "-mavx",
-        "-mavx2",
-    ]
-
-    cpu_preprocessor_flags = [
-        "-DATEN_MKLDNN_ENABLED_FBCODE=0",
-        "-DATEN_NNPACK_ENABLED_FBCODE=0",
-        "-DATEN_MKL_ENABLED_FBCODE=0",
-        "-DAT_BUILD_WITH_BLAS_FBCODE=1",
-        "-DAT_BLAS_USE_CBLAS_DOT_FBCODE=1",
-        "-DAT_BLAS_F2C_FBCODE=0",
-        "-DATEN_CUDNN_ENABLED_FBCODE=1",
-        "-DATEN_ROCM_ENABLED_FBCODE=0",
-        "-DC10_MOBILE",
-        "-DAT_PARALLEL_NATIVE_FBCODE=1",
-    ]
-
-    custom_rule(
-        name = "generate-sgx-config",
-        srcs = [
-            "src/ATen/Config.h.in",
-        ],
-        build_args = " ".join([
-            "--input-file",
-            "src/ATen/Config.h.in",
-            "--output-file",
-            "Config.h",
-            "--replace",
-            "@AT_MKLDNN_ENABLED@",
-            "0",
-            "--replace",
-            "@AT_MKL_ENABLED@",
-            "0",
-            "--replace",
-            "@AT_MKL_SEQUENTIAL@",
-            "0",
-            "--replace",
-            "@AT_FFTW_ENABLED@",
-            "0",
-            "--replace",
-            "@AT_POCKETFFT_ENABLED@",
-            "0",
-            "--replace",
-            "@AT_NNPACK_ENABLED@",
-            "ATEN_NNPACK_ENABLED_FBCODE",
-            "--replace",
-            "@AT_BUILD_WITH_BLAS@",
-            "1",
-            "--replace",
-            "@AT_BUILD_WITH_LAPACK@",
-            "0",
-            "--replace",
-            "@CAFFE2_STATIC_LINK_CUDA_INT@",
-            "0",
-            "--replace",
-            "@AT_BLAS_F2C@",
-            "AT_BLAS_F2C_FBCODE",
-            "--replace",
-            "@AT_BLAS_USE_CBLAS_DOT@",
-            "AT_BLAS_USE_CBLAS_DOT_FBCODE",
-            "--replace",
-            "@AT_PARALLEL_OPENMP@",
-            "0",
-            "--replace",
-            "@AT_PARALLEL_NATIVE@",
-            "1",
-            "--replace",
-            "@AT_PARALLEL_NATIVE_TBB@",
-            "0",
-        ]),
-        build_script_dep = "//caffe2:substitute",
-        output_gen_files = ["Config.h"],
-    )
-
-    cpp_library(
-        name = "generated-sgx-config-header",
-        headers = [":generate-sgx-config=Config.h"],
-        header_namespace = "ATen",
-    )
-
-    ATEN_CORE_H = native.glob([
-        "src/ATen/core/*.h",
-        "src/ATen/core/boxing/*.h",
-        "src/ATen/core/boxing/impl/*.h",
-        "src/ATen/core/dispatch/*.h",
-        "src/ATen/core/op_registration/*.h",
-    ]) + [
-        "src/ATen/CPUGeneratorImpl.h",
-        "src/ATen/NumericUtils.h",
-    ]
-
-    cpp_library(
-        name = "ATen-core-sgx-headers",
-        headers = ATEN_CORE_H,
-        propagated_pp_flags = [
-            "-Icaffe2/aten/src",
-        ],
-        exported_deps = [
-            "//caffe2:generated-aten-headers-core",
-            "//caffe2/c10:c10",
-        ],
-    )
-
-    cpp_library(
-        name = "ATen-sgx-core",
-        # Sorry, this is duped with GENERATED_CPP_CORE.  I was too lazy to refactor
-        # the list into a bzl file
-        srcs = ATEN_CORE_CPP + [
-            ":gen_aten=Operators_0.cpp",
-            ":gen_aten=Operators_1.cpp",
-            ":gen_aten=Operators_2.cpp",
-            ":gen_aten=Operators_3.cpp",
-            ":gen_aten=Operators_4.cpp",
-            ":gen_aten=core/ATenOpList.cpp",
-            ":gen_aten=core/TensorMethods.cpp",
-        ],
-        headers = native.glob([
-            "src/ATen/*.h",
-            "src/ATen/ops/*.h",
-            "src/ATen/quantized/*.h",
-        ]),
-        compiler_flags = default_compiler_flags,
-        compiler_specific_flags = compiler_specific_flags,
-        link_whole = True,
-        # Tests that fail in CPU static dispatch mode because they require
-        # the dispatcher in order to work can be gated out with `#ifndef
-        # ATEN_CPU_STATIC_DISPATCH`.
-        propagated_pp_flags = [],
-        # Must be linked with caffe2_core
-        undefined_symbols = True,
-        exported_deps = [
-            ":ATen-core-sgx-headers",
-            "//caffe2:jit-core-sgx",
-        ],
-    )
-
-    cpp_library(
-        name = "ATen-sgx-cpu",
-        srcs = ATEN_SRCS_CPU_MKL + [":gen_aten=" + x for x in GENERATED_CPP],
-        headers = ATEN_HEADERS_CPU_MKL,
-        arch_compiler_flags = {"x86_64": x64_compiler_flags},
-        compiler_flags = default_compiler_flags,
-        compiler_specific_flags = compiler_specific_flags,
-        include_directories = [
-            "src",
-            "src/TH",
-        ],
-        link_whole = True,
-        propagated_pp_flags = cpu_preprocessor_flags,
-        exported_deps = [
-            "fbsource//third-party/cpuinfo_sgx:cpuinfo_coffeelake",
-            ":ATen-sgx-core",
-            ":aten-headers-cpu",
-            ":generated-aten-headers-cpu",
-            ":generated-sgx-config-header",
-            ":generated-sgx-th-general-header",
-            ":generated-sgx-th-general-header-no-prefix",
-            "//caffe2/caffe2:caffe2_sgx_core",
-            "//caffe2/caffe2/perfkernels:sgx_perfkernels",
-            "//xplat/third-party/XNNPACK:XNNPACK",
-        ],
-        exported_external_deps = [
-            ("OpenBLAS", None, "OpenBLAS"),
-        ],
-        deps = [
-            "//caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:pytorch_qnnpack",
-        ],
-    )
-
-def add_sgx_aten_jit_libs():
-    # we do not need to define these targets if we are in not SGX mode
-    if not is_sgx:
-        return
-
-    cpp_library(
-        name = "jit-core-sgx",
-        # Sorry, this is duped with GENERATED_CPP_CORE.  I was too lazy to refactor
-        # the list into a bzl file
-        srcs = jit_core_sources,
-        headers = jit_core_headers,
-        compiler_flags = default_compiler_flags,
-        compiler_specific_flags = compiler_specific_flags,
-        include_directories = [""],
-        link_whole = True,
-        # Must be linked with caffe2_core
-        undefined_symbols = True,
-        exported_deps = [
-            "//caffe2:ATen-core-sgx-headers",
-            "//caffe2/c10:c10",
-        ],
-    )
diff --git a/tools/sgx_caffe2_target_definitions.bzl b/tools/sgx_caffe2_target_definitions.bzl
deleted file mode 100644
index d7277298cc9b..000000000000
--- a/tools/sgx_caffe2_target_definitions.bzl
+++ /dev/null
@@ -1,255 +0,0 @@
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-load("//caffe2/caffe2:defs.bzl", "get_sgx_patterns")
-load("//caffe2/tools:perf_kernel_defs.bzl", "define_perf_kernels")
-load("//caffe2/tools:sgx_target_definitions.bzl", "is_sgx")
-
-def add_sgx_caffe_libs():
-    # we do not need to define these targets if we are in not SGX mode
-    if not is_sgx:
-        return
-
-    core_file_patterns = [
-        "core/allocator.cc",
-        "core/logging.cc",
-        "core/flags.cc",
-        "core/common.cc",
-        "core/context.cc",
-        "core/event.cc",
-        "core/context_base.cc",
-        "core/numa.cc",
-        "core/blob_serialization.cc",
-        "core/tensor.cc",
-        "core/types.cc",
-        "core/blob_stats.cc",
-        "opt/converter.cc",
-        "opt/annotations.cc",
-        "utils/cpuid.cc",
-        "utils/threadpool/ThreadPool.cc",
-        "utils/threadpool/pthreadpool-cpp.cc",
-        "utils/threadpool/thread_pool_guard.cpp",
-        "utils/proto_utils.cc",
-    ]
-
-    core_srcs = native.glob(
-        core_file_patterns,
-    )
-
-    core_external_deps = [
-        "protobuf",
-        "glog",
-        "sparsehash",
-        "zstd",
-    ]
-
-    core_internal_deps = [
-        "fbsource//third-party/fmt:fmt",
-        "//caffe/proto:fb_protobuf",
-        "//caffe2/caffe2/proto:fb_protobuf",
-        "//caffe2/c10:c10",
-        "//common/base:exception",
-        "//common/logging:logging",
-    ]
-
-    internal_deps = core_internal_deps + [
-        # "//libfb/py/mkl:mkl_dep_handle_lp64",
-        "//onnx/onnx:onnx_lib",
-        "//foxi:foxi_loader",
-        "//caffe2/caffe2/fb/onnxifi:fbonnxifi_loader_stub",
-        # "//rocksdb:rocksdb",
-        "//caffe2:cpuinfo",
-        "//xplat/QNNPACK:QNNPACK",
-        "//folly/experimental/symbolizer:symbolizer",
-        "//folly/hash:hash",
-        "//folly/io:iobuf",
-        "//folly:conv",
-        "//folly:dynamic",
-        "//folly:executor",
-        "//folly:format",
-        "//folly:json",
-        "//folly:map_util",
-        "//folly:memory",
-        "//folly:mpmc_queue",
-        "//folly:optional",
-        "//folly:random",
-        "//folly:range",
-        "//folly/synchronization:rw_spin_lock",
-        "//folly:singleton",
-        "//folly:string",
-        "//folly:synchronized",
-        "//folly:thread_local",
-        "//folly:traits",
-        "//caffe2:ATen-core-headers",
-        # important dependency to claim space for future refactorings
-        "//caffe2:ATen-cpu",
-        "//caffe2/caffe2/perfkernels:perfkernels",
-        "//xplat/third-party/FP16:FP16",
-        "fbsource//third-party/neon2sse:neon2sse",
-    ]
-
-    exclude = [
-        # hip files are obtained from defs_hip.bzl
-        # do not include in the cpu/cuda build
-        "**/hip/**/*",
-        "test/caffe2_gtest_main.cc",
-        "quantization/server/**/*",
-        "fb/async/comm/**/*",
-        "fb/monitoring/**/*",
-        "fb/session/**/*",
-        # utils/knobs.cc and utils/knob_patcher.cc are only used in the open-source build
-        # The internal build uses versions from fb/utils/ instead.
-        "utils/knobs.cc",
-        "utils/knob_patcher.cc",
-    ]
-
-    core_file_patterns = [
-        "core/allocator.cc",
-        "core/logging.cc",
-        "core/flags.cc",
-        "core/common.cc",
-        "core/context.cc",
-        "core/event.cc",
-        "core/context_base.cc",
-        "core/numa.cc",
-        "core/blob_serialization.cc",
-        "core/tensor.cc",
-        "core/types.cc",
-        "core/blob_stats.cc",
-        "opt/converter.cc",
-        "opt/annotations.cc",
-        "utils/cpuid.cc",
-        "utils/threadpool/ThreadPool.cc",
-        "utils/threadpool/pthreadpool-cpp.cc",
-        "utils/threadpool/thread_pool_guard.cpp",
-        "utils/proto_utils.cc",
-    ]
-
-    test_file_patterns = get_sgx_patterns([
-        "_test.cc",
-        "_test.cpp",
-    ])
-
-    gpu_file_patterns = get_sgx_patterns([
-        "_gpu.cc",
-        "_cudnn.cc",
-    ])
-
-    cpu_file_patterns = get_sgx_patterns([
-        ".cc",
-        ".cpp",
-    ])
-
-    cpp_srcs = native.glob(
-        cpu_file_patterns,
-        exclude = exclude + gpu_file_patterns + test_file_patterns + core_file_patterns,
-    )
-
-    pp_flags = [
-        "-Icaffe2",
-        "-Imodules",
-        "-DEIGEN_NO_DEBUG",
-        "-DCAFFE2_USE_GOOGLE_GLOG",
-        "-DCAFFE2_NO_CROSS_ARCH_WARNING",
-        "-DCAFFE2_USE_EXCEPTION_PTR",
-        # Work-around for incompatible thread pools in Caffe2 and NNPACK
-        "-DFBCODE_CAFFE2",
-        "-DUSE_PTHREADPOOL",
-        "-DC10_MOBILE",
-    ]
-
-    compiler_flags = [
-        "-Wno-unknown-pragmas",
-        "-Wno-narrowing",
-        "-Wno-missing-braces",
-        "-Wno-strict-overflow",
-        "-mno-avx",
-        "-Wno-error=unused-result",
-    ]
-
-    cpu_header_patterns = [
-        "**/*.h",
-    ]
-
-    cpp_headers = native.glob(
-        cpu_header_patterns,
-        exclude = exclude,
-    )
-
-    cpp_library(
-        name = "caffe2_sgx_headers",
-        headers = cpp_headers,
-        propagated_pp_flags = pp_flags,
-        exported_deps = core_internal_deps + [
-            "//folly/io/async:async_base",
-            "//caffe2/aten:ATen-core-sgx-headers",
-        ],
-        exported_external_deps = core_external_deps,
-    )
-
-    cpp_library(
-        name = "caffe2_sgx_core",
-        srcs = core_srcs + [
-            "serialize/inline_container.cc",
-            "serialize/crc.cc",
-            "serialize/file_adapter.cc",
-            "serialize/istream_adapter.cc",
-            "serialize/read_adapter_interface.cc",
-        ],
-        compiler_flags = compiler_flags,
-        link_whole = True,
-        propagated_pp_flags = pp_flags,
-        exported_deps = core_internal_deps + [
-            "//caffe2/aten:ATen-sgx-core",
-            "//caffe2/caffe2/core/nomnigraph:nomnigraph",
-            "//xplat/third-party/pthreadpool:pthreadpool",
-            "//caffe2:miniz",
-        ],
-        exported_external_deps = core_external_deps,
-    )
-
-def add_sgx_perf_kernel_libs():
-    # we do not need to define these targets if we are in not SGX mode
-    if not is_sgx:
-        return
-
-    dependencies = [
-        "//caffe2/caffe2:caffe2_sgx_headers",
-        "//caffe2/aten:ATen-core-sgx-headers",
-    ]
-
-    compiler_common_flags = [
-        "-DCAFFE2_PERF_WITH_AVX2",
-        "-DCAFFE2_PERF_WITH_AVX",
-    ]
-
-    external_deps = []
-
-    # these are esentially disabled for hte sgx build but we still need them
-    # to avoid linking issues
-    levels_and_flags = {
-        "x86_64": [
-            (
-                "avx2",
-                [
-                    "-mavx2",
-                    "-mfma",
-                    "-mavx",
-                    "-mf16c",
-                ],
-            ),
-            (
-                "avx",
-                [
-                    "-mavx",
-                    "-mf16c",
-                ],
-            ),
-        ],
-    }
-
-    define_perf_kernels(
-        prefix = "sgx_",
-        levels_and_flags = levels_and_flags,
-        compiler_common_flags = compiler_common_flags,
-        dependencies = dependencies,
-        external_deps = external_deps,
-    )
diff --git a/tools/sgx_target_definitions.bzl b/tools/sgx_target_definitions.bzl
deleted file mode 100644
index 2cb816e1cc9b..000000000000
--- a/tools/sgx_target_definitions.bzl
+++ /dev/null
@@ -1,96 +0,0 @@
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-load("@fbsource//tools/build_defs:buckconfig.bzl", "read_bool")
-load(
-    "//caffe2:build_variables.bzl",
-    "core_sources_common",
-    "core_sources_full_mobile",
-    "core_trainer_sources",
-    "libtorch_extra_sources",
-    "libtorch_generated_sources",
-)
-
-is_sgx = read_bool("fbcode", "sgx_mode", False)
-
-def libtorch_sgx_sources(gencode_pattern = ":generate-code[{}]"):
-    libtorch_core_mobile_sources = sorted(core_sources_common + core_sources_full_mobile + core_trainer_sources)
-
-    sgx_sources_to_exclude = [
-        "torch/csrc/jit/tensorexpr/llvm_codegen.cpp",
-        "torch/csrc/jit/tensorexpr/llvm_jit.cpp",
-        "torch/csrc/jit/codegen/fuser/cpu/fused_kernel.cpp",
-    ]
-
-    return libtorch_generated_sources(gencode_pattern) + [i for i in libtorch_core_mobile_sources if i not in sgx_sources_to_exclude] + [i for i in libtorch_extra_sources if i not in sgx_sources_to_exclude]
-
-def add_sgx_torch_libs():
-    # we do not need to define these targets if we are in not SGX mode
-    if not is_sgx:
-        return
-
-    compiler_flags_cpu = [
-        "-DNO_CUDNN_DESTROY_HANDLE",
-        "-DPYTORCH_ONNX_CAFFE2_BUNDLE",
-        "-DTORCH_ENABLE_LLVM",
-        "-Wno-write-strings",
-        "-Wno-format",
-        "-Wno-strict-aliasing",
-        "-Wno-non-virtual-dtor",
-        "-Wno-shadow-compatible-local",
-        "-Wno-empty-body",
-        "-DUSE_XNNPACK",
-    ]
-
-    propagated_pp_flags_cpu = [
-        "-DSYMBOLICATE_MOBILE_DEBUG_HANDLE",
-        "-DC10_MOBILE",
-    ]
-
-    include_directories = [
-        "..",
-        ".",
-        "torch/csrc/api/include",
-        "torch/csrc",
-        "torch/csrc/nn",
-        "torch/lib",
-    ]
-
-    common_flags = {
-        "compiler_specific_flags": {
-            "clang": [
-                "-Wno-absolute-value",
-                "-Wno-expansion-to-defined",
-                "-Wno-pessimizing-move",
-                "-Wno-return-type-c-linkage",
-                "-Wno-unknown-pragmas",
-            ],
-        },
-        "headers": native.glob(["torch/csrc/**/*.h", "torch/csrc/generic/*.cpp", "test/cpp/jit/*.h", "test/cpp/tensorexpr/*.h"]),
-    }
-
-    _libtorch_sgx_sources = list(libtorch_sgx_sources())
-
-    cpp_library(
-        name = "libtorch-sgx",
-        srcs = _libtorch_sgx_sources + [
-            "fb/supported_mobile_models/SupportedMobileModels.cpp",
-            "torch/csrc/jit/mobile/function.cpp",
-            "torch/csrc/jit/mobile/import.cpp",
-            "torch/csrc/jit/mobile/interpreter.cpp",
-            "torch/csrc/jit/mobile/module.cpp",  # this is only needed to load the model from caffe2/test/cpp/lite_interpreter_runtime/delegate_test.ptl
-        ],
-        link_whole = True,
-        include_directories = include_directories,
-        propagated_pp_flags = propagated_pp_flags_cpu,
-        exported_deps = [
-            ":generated-autograd-headers",
-            ":generated-version-header",
-            "//caffe2/aten:ATen-sgx-cpu",
-            "//caffe2/caffe2:caffe2_sgx_core",
-            "//onnx/onnx:onnx_lib",
-        ],
-        exported_external_deps = [
-            ("protobuf", None),
-        ],
-        compiler_flags = compiler_flags_cpu,
-        **common_flags
-    )
diff --git a/tools/stats/check_disabled_tests.py b/tools/stats/check_disabled_tests.py
new file mode 100644
index 000000000000..636af668a13d
--- /dev/null
+++ b/tools/stats/check_disabled_tests.py
@@ -0,0 +1,277 @@
+import argparse
+import json
+import os
+import xml.etree.ElementTree as ET
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from typing import Any, Dict, Generator, Tuple
+
+from tools.stats.upload_stats_lib import (
+    download_gha_artifacts,
+    download_s3_artifacts,
+    is_rerun_disabled_tests,
+    unzip,
+    upload_to_s3,
+)
+from tools.stats.upload_test_stats import process_xml_element
+
+TESTCASE_TAG = "testcase"
+SEPARATOR = ";"
+
+
+def process_report(
+    report: Path,
+) -> Dict[str, Dict[str, int]]:
+    """
+    Return a list of disabled tests that should be re-enabled and those that are still
+    flaky (failed or skipped)
+    """
+    root = ET.parse(report)
+
+    # All rerun tests from a report are grouped here:
+    #
+    # * Success test should be re-enable if it's green after rerunning in all platforms
+    #   where it is currently disabled
+    # * Failures from pytest because pytest-flakefinder is used to run the same test
+    #   multiple times, some could fails
+    # * Skipped tests from unittest
+    #
+    # We want to keep track of how many times the test fails (num_red) or passes (num_green)
+    all_tests: Dict[str, Dict[str, int]] = {}
+
+    if not is_rerun_disabled_tests(root):
+        return all_tests
+
+    for test_case in root.iter(TESTCASE_TAG):
+        parsed_test_case = process_xml_element(test_case)
+
+        # Under --rerun-disabled-tests mode, a test is skipped when:
+        # * it's skipped explicitly inside PyToch code
+        # * it's skipped because it's a normal enabled test
+        # * or it's falky (num_red > 0 and num_green > 0)
+        # * or it's failing (num_red > 0 and num_green == 0)
+        #
+        # We care only about the latter two here
+        skipped = parsed_test_case.get("skipped", None)
+        if skipped and "num_red" not in skipped.get("message", ""):
+            continue
+
+        name = parsed_test_case.get("name", "")
+        classname = parsed_test_case.get("classname", "")
+        filename = parsed_test_case.get("file", "")
+
+        if not name or not classname or not filename:
+            continue
+
+        # Check if the test is a failure
+        failure = parsed_test_case.get("failure", None)
+
+        disabled_test_id = SEPARATOR.join([name, classname, filename])
+        if disabled_test_id not in all_tests:
+            all_tests[disabled_test_id] = {
+                "num_green": 0,
+                "num_red": 0,
+            }
+
+        # Under --rerun-disabled-tests mode, if a test is not skipped or failed, it's
+        # counted as a success. Otherwise, it's still flaky or failing
+        if skipped:
+            try:
+                stats = json.loads(skipped.get("message", ""))
+            except json.JSONDecodeError:
+                stats = {}
+
+            all_tests[disabled_test_id]["num_green"] += stats.get("num_green", 0)
+            all_tests[disabled_test_id]["num_red"] += stats.get("num_red", 0)
+        elif failure:
+            # As a failure, increase the failure count
+            all_tests[disabled_test_id]["num_red"] += 1
+        else:
+            all_tests[disabled_test_id]["num_green"] += 1
+
+    return all_tests
+
+
+def get_test_reports(
+    repo: str, workflow_run_id: int, workflow_run_attempt: int
+) -> Generator[Path, None, None]:
+    """
+    Gather all the test reports from S3 and GHA. It is currently not possible to guess which
+    test reports are from rerun_disabled_tests workflow because the name doesn't include the
+    test config. So, all reports will need to be downloaded and examined
+    """
+    with TemporaryDirectory() as temp_dir:
+        print("Using temporary directory:", temp_dir)
+        os.chdir(temp_dir)
+
+        artifact_paths = download_s3_artifacts(
+            "test-reports", workflow_run_id, workflow_run_attempt
+        )
+        for path in artifact_paths:
+            unzip(path)
+
+        artifact_paths = download_gha_artifacts(
+            "test-report", workflow_run_id, workflow_run_attempt
+        )
+        for path in artifact_paths:
+            unzip(path)
+
+        for report in Path(".").glob("**/*.xml"):
+            yield report
+
+
+def get_disabled_test_name(test_id: str) -> Tuple[str, str, str, str]:
+    """
+    Follow flaky bot convention here, if that changes, this will also need to be updated
+    """
+    name, classname, filename = test_id.split(SEPARATOR)
+    return f"{name} (__main__.{classname})", name, classname, filename
+
+
+def prepare_record(
+    workflow_id: int,
+    workflow_run_attempt: int,
+    name: str,
+    classname: str,
+    filename: str,
+    flaky: bool,
+    num_red: int = 0,
+    num_green: int = 0,
+) -> Tuple[Any, Dict[str, Any]]:
+    """
+    Prepare the record to save onto S3
+    """
+    key = (
+        workflow_id,
+        workflow_run_attempt,
+        name,
+        classname,
+        filename,
+    )
+
+    record = {
+        "workflow_id": workflow_id,
+        "workflow_run_attempt": workflow_run_attempt,
+        "name": name,
+        "classname": classname,
+        "filename": filename,
+        "flaky": flaky,
+        "num_green": num_green,
+        "num_red": num_red,
+    }
+
+    return key, record
+
+
+def save_results(
+    workflow_id: int,
+    workflow_run_attempt: int,
+    all_tests: Dict[str, Dict[str, int]],
+) -> None:
+    """
+    Save the result to S3, so it can go to Rockset
+    """
+    should_be_enabled_tests = {
+        name: stats
+        for name, stats in all_tests.items()
+        if "num_green" in stats
+        and stats["num_green"]
+        and "num_red" in stats
+        and stats["num_red"] == 0
+    }
+    still_flaky_tests = {
+        name: stats
+        for name, stats in all_tests.items()
+        if name not in should_be_enabled_tests
+    }
+
+    records = {}
+    for test_id, stats in all_tests.items():
+        num_green = stats.get("num_green", 0)
+        num_red = stats.get("num_red", 0)
+        disabled_test_name, name, classname, filename = get_disabled_test_name(test_id)
+
+        key, record = prepare_record(
+            workflow_id=workflow_id,
+            workflow_run_attempt=workflow_run_attempt,
+            name=name,
+            classname=classname,
+            filename=filename,
+            flaky=test_id in still_flaky_tests,
+            num_green=num_green,
+            num_red=num_red,
+        )
+        records[key] = record
+
+    # Log the results
+    print(f"The following {len(should_be_enabled_tests)} tests should be re-enabled:")
+    for test_id, stats in should_be_enabled_tests.items():
+        disabled_test_name, name, classname, filename = get_disabled_test_name(test_id)
+        print(f"  {disabled_test_name} from {filename}")
+
+    print(f"The following {len(still_flaky_tests)} are still flaky:")
+    for test_id, stats in still_flaky_tests.items():
+        num_green = stats.get("num_green", 0)
+        num_red = stats.get("num_red", 0)
+
+        disabled_test_name, name, classname, filename = get_disabled_test_name(test_id)
+        print(
+            f"  {disabled_test_name} from {filename}, failing {num_red}/{num_red + num_green}"
+        )
+
+    upload_to_s3(
+        workflow_id,
+        workflow_run_attempt,
+        "rerun_disabled_tests",
+        list(records.values()),
+    )
+
+
+def main(repo: str, workflow_run_id: int, workflow_run_attempt: int) -> None:
+    """
+    Find the list of all disabled tests that should be re-enabled
+    """
+    # Aggregated across all jobs
+    all_tests: Dict[str, Dict[str, int]] = {}
+
+    for report in get_test_reports(
+        args.repo, args.workflow_run_id, args.workflow_run_attempt
+    ):
+        tests = process_report(report)
+        for name, stats in tests.items():
+            if name not in all_tests:
+                all_tests[name] = stats.copy()
+            else:
+                all_tests[name]["num_green"] += stats.get("num_green", 0)
+                all_tests[name]["num_red"] += stats.get("num_red", 0)
+
+    save_results(
+        workflow_run_id,
+        workflow_run_attempt,
+        all_tests,
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Upload test artifacts from GHA to S3")
+    parser.add_argument(
+        "--workflow-run-id",
+        type=int,
+        required=True,
+        help="id of the workflow to get artifacts from",
+    )
+    parser.add_argument(
+        "--workflow-run-attempt",
+        type=int,
+        required=True,
+        help="which retry of the workflow this is",
+    )
+    parser.add_argument(
+        "--repo",
+        type=str,
+        required=True,
+        help="which GitHub repo this workflow run belongs to",
+    )
+
+    args = parser.parse_args()
+    main(args.repo, args.workflow_run_id, args.workflow_run_attempt)
diff --git a/tools/stats/import_test_stats.py b/tools/stats/import_test_stats.py
index fbc33a685d4a..a798119010d2 100644
--- a/tools/stats/import_test_stats.py
+++ b/tools/stats/import_test_stats.py
@@ -83,9 +83,18 @@ def get_slow_tests(
 
 def get_test_times(dirpath: str, filename: str) -> Dict[str, Dict[str, float]]:
     url = "https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/test-times.json"
+    build_environment = os.environ.get("BUILD_ENVIRONMENT")
+    if build_environment is None:
+        test_times = fetch_and_cache(dirpath, filename, url, lambda x: x)
+        raise RuntimeError(
+            f"BUILD_ENVIRONMENT is not defined, available keys are {test_times.keys()}"
+        )
 
     def process_response(the_response: Dict[str, Any]) -> Any:
-        build_environment = os.environ["BUILD_ENVIRONMENT"]
+        if build_environment not in the_response:
+            raise RuntimeError(
+                f"{build_environment} not found, available envs are: {the_response.keys()}"
+            )
         return the_response[build_environment]
 
     try:
@@ -99,34 +108,18 @@ def get_disabled_tests(
     dirpath: str, filename: str = DISABLED_TESTS_FILE
 ) -> Optional[Dict[str, Any]]:
     def process_disabled_test(the_response: Dict[str, Any]) -> Dict[str, Any]:
+        # remove re-enabled tests and condense even further by getting rid of pr_num
         disabled_test_from_issues = dict()
-        for item in the_response["items"]:
-            title = item["title"]
-            key = "DISABLED "
-            issue_url = item["html_url"]
-            issue_number = issue_url.split("/")[-1]
-            if title.startswith(key) and issue_number not in IGNORE_DISABLED_ISSUES:
-                test_name = title[len(key) :].strip()
-                body = item["body"]
-                platforms_to_skip = []
-                key = "platforms:"
-                # When the issue has no body, it is assumed that all platforms should skip the test
-                if body is not None:
-                    for line in body.splitlines():
-                        line = line.lower()
-                        if line.startswith(key):
-                            pattern = re.compile(r"^\s+|\s*,\s*|\s+$")
-                            platforms_to_skip.extend(
-                                [x for x in pattern.split(line[len(key) :]) if x]
-                            )
+        for test_name, (pr_num, link, platforms) in the_response.items():
+            if pr_num not in IGNORE_DISABLED_ISSUES:
                 disabled_test_from_issues[test_name] = (
-                    item["html_url"],
-                    platforms_to_skip,
+                    link,
+                    platforms,
                 )
         return disabled_test_from_issues
 
     try:
-        url = "https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests.json"
+        url = "https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests-condensed.json"
         return fetch_and_cache(dirpath, filename, url, process_disabled_test)
     except Exception:
         print("Couldn't download test skip set, leaving all tests enabled...")
diff --git a/tools/stats/monitor.py b/tools/stats/monitor.py
index 81183aff2b0f..b45979451507 100644
--- a/tools/stats/monitor.py
+++ b/tools/stats/monitor.py
@@ -30,11 +30,22 @@ def get_per_process_cpu_info() -> List[Dict[str, Any]]:
             "cmd": " ".join(p.cmdline()),
             "cpu_percent": p.cpu_percent(),
             "rss_memory": p.memory_info().rss,
-            "uss_memory": p.memory_full_info().uss,
         }
-        if "pss" in p.memory_full_info():
-            # only availiable in linux
-            info["pss_memory"] = p.memory_full_info().pss
+
+        # https://psutil.readthedocs.io/en/latest/index.html?highlight=memory_full_info
+        # requires higher user privileges and could throw AccessDenied error, i.e. mac
+        try:
+            memory_full_info = p.memory_full_info()
+
+            info["uss_memory"] = memory_full_info.uss
+            if "pss" in memory_full_info:
+                # only availiable in linux
+                info["pss_memory"] = memory_full_info.pss
+
+        except psutil.AccessDenied as e:
+            # It's ok to skip this
+            pass
+
         per_process_info.append(info)
     return per_process_info
 
@@ -75,9 +86,10 @@ def exit_gracefully(*args: Any) -> None:
             }
             if handle is not None:
                 stats["per_process_gpu_info"] = get_per_process_gpu_info(handle)
-                stats["total_gpu_utilizaiton"] = pynvml.nvmlDeviceGetUtilizationRates(
-                    handle
-                ).gpu
+                # https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html
+                gpu_utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
+                stats["total_gpu_utilization"] = gpu_utilization.gpu
+                stats["total_gpu_mem_utilization"] = gpu_utilization.memory
         except Exception as e:
             stats = {
                 "time": datetime.datetime.utcnow().isoformat("T") + "Z",
diff --git a/tools/stats/print_test_stats.py b/tools/stats/print_test_stats.py
index b82c1236525d..068b03598772 100755
--- a/tools/stats/print_test_stats.py
+++ b/tools/stats/print_test_stats.py
@@ -670,7 +670,7 @@ def __str__(self) -> str:
 class TestSuite:
     def __init__(self, name: str) -> None:
         self.name = name
-        self.test_cases: Dict[str, TestCase] = dict()
+        self.test_cases: Dict[str, TestCase] = {}
         self.failed_count = 0
         self.skipped_count = 0
         self.errored_count = 0
@@ -725,7 +725,7 @@ class TestFile:
     def __init__(self, name: str) -> None:
         self.name = name
         self.total_time = 0.0
-        self.test_suites: Dict[str, TestSuite] = dict()
+        self.test_suites: Dict[str, TestSuite] = {}
 
     def append(self, test_case: TestCase) -> None:
         suite_name = test_case.class_name
@@ -765,7 +765,7 @@ def get_recursive_files(folder: str, extension: str) -> Iterable[str]:
 
 
 def parse_reports(folder: str) -> Dict[str, TestFile]:
-    tests_by_file = dict()
+    tests_by_file = {}
     for report in get_recursive_files(folder, ".xml"):
         report_path = Path(report)
         # basename of the directory of test-report is the test filename
diff --git a/tools/stats/upload_artifacts.py b/tools/stats/upload_artifacts.py
new file mode 100644
index 000000000000..eb0fde7f38ac
--- /dev/null
+++ b/tools/stats/upload_artifacts.py
@@ -0,0 +1,61 @@
+import argparse
+import os
+import re
+from tempfile import TemporaryDirectory
+
+from tools.stats.upload_stats_lib import download_gha_artifacts, upload_file_to_s3
+
+ARTIFACTS = [
+    "sccache-stats",
+    "test-jsons",
+    "test-reports",
+    "usage-log",
+]
+BUCKET_NAME = "gha-artifacts"
+FILENAME_REGEX = r"-runattempt\d+"
+
+
+def get_artifacts(repo: str, workflow_run_id: int, workflow_run_attempt: int) -> None:
+    with TemporaryDirectory() as temp_dir:
+        print("Using temporary directory:", temp_dir)
+        os.chdir(temp_dir)
+
+        for artifact in ARTIFACTS:
+            artifact_paths = download_gha_artifacts(
+                artifact, workflow_run_id, workflow_run_attempt
+            )
+
+            for artifact_path in artifact_paths:
+                # GHA artifact is named as follows: NAME-runattempt${{ github.run_attempt }}-SUFFIX.zip
+                # and we want remove the run_attempt to conform with the naming convention on S3, i.e.
+                # pytorch/pytorch/WORKFLOW_ID/RUN_ATTEMPT/artifact/NAME-SUFFIX.zip
+                s3_filename = re.sub(FILENAME_REGEX, "", artifact_path.name)
+                upload_file_to_s3(
+                    file_name=str(artifact_path.resolve()),
+                    bucket=BUCKET_NAME,
+                    key=f"{repo}/{workflow_run_id}/{workflow_run_attempt}/artifact/{s3_filename}",
+                )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Upload test artifacts from GHA to S3")
+    parser.add_argument(
+        "--workflow-run-id",
+        type=int,
+        required=True,
+        help="id of the workflow to get artifacts from",
+    )
+    parser.add_argument(
+        "--workflow-run-attempt",
+        type=int,
+        required=True,
+        help="which retry of the workflow this is",
+    )
+    parser.add_argument(
+        "--repo",
+        type=str,
+        required=True,
+        help="which GitHub repo this workflow run belongs to",
+    )
+    args = parser.parse_args()
+    get_artifacts(args.repo, args.workflow_run_id, args.workflow_run_attempt)
diff --git a/tools/stats/upload_stats_lib.py b/tools/stats/upload_stats_lib.py
index 1cba78f68da1..c91075225a62 100644
--- a/tools/stats/upload_stats_lib.py
+++ b/tools/stats/upload_stats_lib.py
@@ -2,6 +2,7 @@
 import io
 import json
 import os
+import xml.etree.ElementTree as ET
 import zipfile
 from pathlib import Path
 from typing import Any, Dict, List
@@ -12,6 +13,7 @@
 
 PYTORCH_REPO = "https://api.github.com/repos/pytorch/pytorch"
 S3_RESOURCE = boto3.resource("s3")
+TARGET_WORKFLOW = "--rerun-disabled-tests"
 
 
 def _get_request_headers() -> Dict[str, str]:
@@ -136,6 +138,22 @@ def upload_to_s3(
     print("Done!")
 
 
+def upload_file_to_s3(
+    file_name: str,
+    bucket: str,
+    key: str,
+) -> None:
+    """
+    Upload a local file to S3
+    """
+    print(f"Upload {file_name} to s3://{bucket}/{key}")
+    boto3.client("s3").upload_file(
+        file_name,
+        bucket,
+        key,
+    )
+
+
 def unzip(p: Path) -> None:
     """Unzip the provided zipfile to a similarly-named directory.
 
@@ -149,3 +167,16 @@ def unzip(p: Path) -> None:
 
     with zipfile.ZipFile(p, "r") as zip:
         zip.extractall(unzipped_dir)
+
+
+def is_rerun_disabled_tests(root: ET.ElementTree) -> bool:
+    """
+    Check if the test report is coming from rerun_disabled_tests workflow
+    """
+    skipped = root.find(".//*skipped")
+    # Need to check against None here, if not skipped doesn't work as expected
+    if skipped is None:
+        return False
+
+    message = skipped.attrib.get("message", "")
+    return TARGET_WORKFLOW in message or "num_red" in message
diff --git a/tools/stats/upload_test_stats.py b/tools/stats/upload_test_stats.py
index 9b919716ebc3..23695933c704 100644
--- a/tools/stats/upload_test_stats.py
+++ b/tools/stats/upload_test_stats.py
@@ -9,6 +9,7 @@
 from tools.stats.upload_stats_lib import (
     download_gha_artifacts,
     download_s3_artifacts,
+    is_rerun_disabled_tests,
     unzip,
     upload_to_s3,
 )
@@ -35,9 +36,18 @@ def parse_xml_report(
     job_id = get_job_id(report)
     print(f"Found job id: {job_id}")
 
+    test_cases: List[Dict[str, Any]] = []
+
     root = ET.parse(report)
+    # TODO: unlike unittest, pytest-flakefinder used by rerun disabled tests for test_ops
+    # includes skipped messages multiple times (50 times by default). This slows down
+    # this script too much (O(n)) because it tries to gather all the stats. This should
+    # be fixed later in the way we use pytest-flakefinder. A zipped test report from rerun
+    # disabled test is only few MB, but will balloon up to a much bigger XML file after
+    # extracting from a dozen to few hundred MB
+    if is_rerun_disabled_tests(root):
+        return test_cases
 
-    test_cases = []
     for test_case in root.iter(tag):
         case = process_xml_element(test_case)
         case["workflow_id"] = workflow_id
@@ -118,10 +128,16 @@ def process_xml_element(element: ET.Element) -> Dict[str, Any]:
 
 
 def get_pytest_parallel_times() -> Dict[Any, Any]:
-    pytest_parallel_times = {}
+    pytest_parallel_times: Dict[Any, Any] = {}
     for report in Path(".").glob("**/python-pytest/**/*.xml"):
         invoking_file = report.parent.name
+
         root = ET.parse(report)
+        # TODO: Skip test reports from rerun disabled tests, same reason as mentioned
+        # above
+        if is_rerun_disabled_tests(root):
+            continue
+
         assert len(list(root.iter("testsuite"))) == 1
         for test_suite in root.iter("testsuite"):
             pytest_parallel_times[
@@ -167,6 +183,23 @@ def get_tests(
         return test_cases, pytest_parallel_times
 
 
+def get_tests_for_circleci(
+    workflow_run_id: int, workflow_run_attempt: int
+) -> Tuple[List[Dict[str, Any]], Dict[Any, Any]]:
+    # Parse the reports and transform them to JSON
+    test_cases = []
+    for xml_report in Path(".").glob("**/test/test-reports/**/*.xml"):
+        test_cases.extend(
+            parse_xml_report(
+                "testcase", xml_report, workflow_run_id, workflow_run_attempt
+            )
+        )
+
+    pytest_parallel_times = get_pytest_parallel_times()
+
+    return test_cases, pytest_parallel_times
+
+
 def get_invoking_file_times(
     test_case_summaries: List[Dict[str, Any]], pytest_parallel_times: Dict[Any, Any]
 ) -> List[Dict[str, Any]]:
@@ -261,7 +294,6 @@ def init_value(test_case: Dict[str, Any]) -> Dict[str, Any]:
     parser = argparse.ArgumentParser(description="Upload test stats to Rockset")
     parser.add_argument(
         "--workflow-run-id",
-        type=int,
         required=True,
         help="id of the workflow to get artifacts from",
     )
@@ -276,10 +308,23 @@ def init_value(test_case: Dict[str, Any]) -> Dict[str, Any]:
         required=True,
         help="Head branch of the workflow",
     )
-    args = parser.parse_args()
-    test_cases, pytest_parallel_times = get_tests(
-        args.workflow_run_id, args.workflow_run_attempt
+    parser.add_argument(
+        "--circleci",
+        action="store_true",
+        help="If this is being run through circleci",
     )
+    args = parser.parse_args()
+
+    print(f"Workflow id is: {args.workflow_run_id}")
+
+    if args.circleci:
+        test_cases, pytest_parallel_times = get_tests_for_circleci(
+            args.workflow_run_id, args.workflow_run_attempt
+        )
+    else:
+        test_cases, pytest_parallel_times = get_tests(
+            args.workflow_run_id, args.workflow_run_attempt
+        )
 
     # Flush stdout so that any errors in rockset upload show up last in the logs.
     sys.stdout.flush()
diff --git a/tools/target_definitions.bzl b/tools/target_definitions.bzl
deleted file mode 100644
index 6c4a53dbfc89..000000000000
--- a/tools/target_definitions.bzl
+++ /dev/null
@@ -1,571 +0,0 @@
-# @lint-ignore-every BUCKLINT supress the warning for using native
-load("@bazel_skylib//lib:paths.bzl", "paths")
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-load("@fbcode_macros//build_defs:cpp_python_extension.bzl", "cpp_python_extension")
-load("@fbcode_macros//build_defs:custom_rule.bzl", "custom_rule")
-load("@fbcode_macros//build_defs:python_binary.bzl", "python_binary")
-load("@fbsource//tools/build_defs:glob_defs.bzl", "glob")
-load(
-    "//caffe2:build_variables.bzl",
-    "glob_libtorch_python_sources",
-    "libtorch_cuda_sources",
-    "libtorch_nvfuser_generated_headers",
-    "libtorch_nvfuser_runtime_sources",
-    "libtorch_python_cuda_sources",
-    "libtorch_sources",
-    "torch_cpp_srcs",
-)
-load(
-    "//caffe2:defs_hip.bzl",
-    "get_hip_flags",
-    "hip_external_deps",
-    "hip_pp_flags",
-)
-load("//caffe2/caffe2/fb:defs_gpu.bzl", "gpu_library_selector", "gpu_library_targets", "is_amd_build")
-load("//tools/build/buck:nccl_deps.bzl", "get_nccl_dependency")
-
-def _path_to_filename(fname):
-    return paths.split_extension(paths.basename(fname))[0]
-
-def use_kineto():
-    return native.host_info().os.is_linux and native.host_info().arch.is_x86_64 and not is_amd_build()
-
-def add_torch_libs():
-    r = {}
-
-    torch_cpp_headers = glob(["torch/csrc/api/include/**/*.h"]) + ["torch/script.h"]
-    libtorch_python_sources = glob_libtorch_python_sources()
-
-    use_mpi = native.read_config("fbcode", "caffe2_use_mpi", None)
-    enable_flatbuffer = bool(native.read_config("fbcode", "caffe2_enable_flatbuffer", None))
-
-    compiler_flags_cpu = [
-        "-DUSE_C10D",
-        "-DUSE_NUMPY",
-        "-DUSE_SCALARS",
-        "-DNO_CUDNN_DESTROY_HANDLE",
-        "-DBUILD_CAFFE2",
-        "-DTORCH_ENABLE_LLVM",
-        "-Wno-write-strings",
-        "-Wno-format",
-        "-Wno-strict-aliasing",
-        "-Wno-non-virtual-dtor",
-        "-Wno-shadow-compatible-local",
-        "-Wno-empty-body",
-    ] + ([] if native.host_info().os.is_windows else [
-        # XNNPACK depends on an updated version of pthreadpool interface, whose implementation
-        # includes <pthread.h> - a header not available on Windows.
-        "-DUSE_XNNPACK",
-    ])
-
-    # We should really include preprocessor flags here
-    # instead of compiler_flags
-    propagated_pp_flags_cpu = [
-        "-DSYMBOLICATE_MOBILE_DEBUG_HANDLE",
-        "-DUSE_DISTRIBUTED",
-        "-DUSE_C10D_GLOO",
-        "-DUSE_RPC",
-        "-DUSE_TENSORPIPE",
-    ] + (
-        ["-DUSE_C10D_MPI"] if use_mpi else []
-    ) + (
-        ["-DUSE_KINETO", "-DUSE_KINETO_UPDATED"] if use_kineto() else []
-    ) + (
-        ["-DENABLE_LIBKINETO_CLIENT"] if native.read_config("kineto", "enable_libkineto_client", "1") == "1" else []
-    )
-
-    compiler_flags_cuda = [
-        "-DUSE_CUDNN",
-        "-DUSE_NCCL",
-    ]
-
-    compiler_flags_hip = []
-
-    propagated_pp_flags_cuda = [
-        "-DUSE_CUDA",
-        "-DUSE_C10D_NCCL",
-    ]
-
-    common_headers = glob([
-        "torch/csrc/**/*.h",
-        # c10d used to be a separate library whose includes ended in .hpp.
-        "torch/csrc/distributed/c10d/*.hpp",
-        "torch/csrc/generic/*.cpp",
-    ]) + [
-        "torch/csrc/deploy/Exception.h",
-        "torch/csrc/deploy/deploy.h",
-        "torch/csrc/deploy/elf_file.h",
-        "torch/csrc/deploy/environment.h",
-        "torch/csrc/deploy/interpreter/builtin_registry.h",
-        "torch/csrc/deploy/interpreter/interpreter_impl.h",
-        "torch/csrc/deploy/loader.h",
-        "torch/csrc/deploy/mem_file.h",
-        "torch/csrc/deploy/noop_environment.h",
-        "torch/csrc/deploy/path_environment.h",
-        "torch/csrc/deploy/unity/tests/test_unity.h",
-        "torch/csrc/deploy/unity/xar_environment.h",
-        "torch/csrc/distributed/rpc/metrics/RpcMetricsHandler.h",
-        "test/cpp/jit/test_custom_class_registrations.h",
-        "test/cpp/jit/test_utils.h",
-        "test/cpp/tensorexpr/gtest_assert_float_eq.h",
-        "test/cpp/tensorexpr/padded_buffer.h",
-        "test/cpp/tensorexpr/test_base.h",
-        "test/cpp/tensorexpr/test_utils.h",
-    ]
-    common_headers.remove("torch/csrc/jit/serialization/mobile_bytecode_generated.h")
-
-    common_flags = {
-        "compiler_specific_flags": {
-            "clang": [
-                "-Wno-absolute-value",
-                "-Wno-expansion-to-defined",
-                "-Wno-pessimizing-move",
-                "-Wno-return-type-c-linkage",
-                "-Wno-unknown-pragmas",
-            ],
-        },
-        "headers": common_headers,
-    }
-
-    include_directories = [
-        "..",
-        ".",
-        "torch/csrc/api/include",
-        "torch/csrc",
-        # c10d used to be a separate library and its includes were c10d/Foo.hpp,
-        # hence we now need this hack to keep supporting them.
-        "torch/csrc/distributed",
-        "torch/csrc/nn",
-    ]
-
-    _libtorch_sources = list(libtorch_sources())
-
-    # Add the Gloo and TensorPipe backends specific to Facebook networking.
-    _libtorch_sources.append("torch/csrc/distributed/c10d/fb/GlooDeviceFactory.cpp")
-    _libtorch_sources.append("torch/csrc/distributed/rpc/fb/tensorpipe_agent.cpp")
-
-    cpp_library(
-        name = "libtorch",
-        srcs = _libtorch_sources + ([
-            "torch/csrc/jit/serialization/flatbuffer_serializer.cpp",
-            "torch/csrc/jit/serialization/flatbuffer_serializer_jit.cpp",
-            "torch/csrc/jit/mobile/flatbuffer_loader.cpp",
-        ] if enable_flatbuffer else []),
-        link_whole = True,
-        include_directories = include_directories,
-        propagated_pp_flags = propagated_pp_flags_cpu,
-        exported_deps = (
-            [
-                ":ATen-cpu",
-                ":generated-autograd-headers",
-                ":generated-lazy-headers",
-                "//caffe2:version_cpp",
-                "//caffe2/caffe2:caffe2_cpu",
-                "//caffe2/caffe2/quantization/server:dnnlowp_ops",
-                "//caffe2/caffe2/serialize:inline_container",
-                "//caffe2/torch/lib/libshm:libshm",
-                "//gloo:gloo",
-                "//gloo/fb/transport/tls:tls",
-                "//gloo/transport/tcp:tcp",
-                "//tensorpipe:tensorpipe_cpu",
-            ] + (["//kineto/libkineto:kineto"] if use_kineto() else ["//kineto/libkineto:kineto_activity_header"]) +
-            (["//caffe2:mobile_bytecode"] if enable_flatbuffer else [])
-        ),
-        exported_external_deps = [
-            ("nanopb", None, "protobuf-nanopb"),
-            ("protobuf", None),
-            ("llvm-fb", None, "LLVMAnalysis"),
-            ("llvm-fb", None, "LLVMBPFAsmParser"),
-            ("llvm-fb", None, "LLVMBPFCodeGen"),
-            ("llvm-fb", None, "LLVMCodeGen"),
-            ("llvm-fb", None, "LLVMCore"),
-            ("llvm-fb", None, "LLVMExecutionEngine"),
-            ("llvm-fb", None, "LLVMIRReader"),
-            ("llvm-fb", None, "LLVMInstCombine"),
-            ("llvm-fb", None, "LLVMInterpreter"),
-            ("llvm-fb", None, "LLVMMC"),
-            ("llvm-fb", None, "LLVMNVPTXCodeGen"),
-            ("llvm-fb", None, "LLVMOrcJIT"),
-            ("llvm-fb", None, "LLVMRISCVAsmParser"),
-            ("llvm-fb", None, "LLVMRISCVCodeGen"),
-            ("llvm-fb", None, "LLVMScalarOpts"),
-            ("llvm-fb", None, "LLVMSupport"),
-            ("llvm-fb", None, "LLVMTarget"),
-            ("llvm-fb", None, "LLVMTransformUtils"),
-            ("llvm-fb", None, "LLVMVectorize"),
-            ("llvm-fb", None, "LLVMAArch64AsmParser"),
-            ("llvm-fb", None, "LLVMAArch64CodeGen"),
-            ("llvm-fb", None, "LLVMAArch64Info"),
-            ("llvm-fb", None, "LLVMWebAssemblyAsmParser"),
-            ("llvm-fb", None, "LLVMWebAssemblyCodeGen"),
-            ("llvm-fb", None, "LLVMWebAssemblyInfo"),
-            ("llvm-fb", None, "LLVMX86AsmParser"),
-            ("llvm-fb", None, "LLVMX86CodeGen"),
-            ("llvm-fb", None, "LLVMipo"),
-        ] + ([("openmpi", None, "openmpi")] if use_mpi else []),
-        compiler_flags = compiler_flags_cpu,
-        **common_flags
-    )
-
-    # Below rules are used to stringify NVfuser runtime library into a header files
-    python_binary(
-        name = "nvfuser-stringify",
-        srcs = ["torch/csrc/jit/codegen/cuda/tools/stringify_file.py"],
-        base_module = "",
-        main_module = "torch.csrc.jit.codegen.cuda.tools.stringify_file",
-    )
-
-    # files in libtorch_nvfuser_runtime_sources that are violating package boundaries
-    # are mapped to their corresponding export_file rules.
-    violation_paths_to_rule = {
-        "aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh": ":aten/src/ATen/cuda/detail/PhiloxCudaStateRaw.cuh",
-        "aten/src/ATen/cuda/detail/UnpackRaw.cuh": ":aten/src/ATen/cuda/detail/UnpackRaw.cuh",
-    }
-
-    for name in libtorch_nvfuser_runtime_sources:
-        src_path = violation_paths_to_rule.get(name, name)
-        filename = _path_to_filename(src_path)
-        native.genrule(
-            name = "gen-nvfuser-hdr={}.h".format(filename),
-            srcs = {name: src_path},
-            bash = "$(exe :nvfuser-stringify) -i $SRCDIR/{} -o $OUT".format(name),
-            out = "{}.h".format(filename),
-        )
-    cpp_library(
-        name = "generated-nvfuser-headers",
-        headers = [":gen-nvfuser-hdr=" + x for x in libtorch_nvfuser_generated_headers],
-        header_namespace = "nvfuser_resources",
-    )
-
-    _libtorch_cuda_sources = list(libtorch_cuda_sources)
-    cpp_library(
-        name = "libtorch_cuda",
-        srcs = _libtorch_cuda_sources,
-        link_whole = True,
-        include_directories = include_directories,
-        # TODO: putting USE_CUDA in propagated_pp_flags is error-prone
-        propagated_pp_flags = propagated_pp_flags_cuda,
-        exported_deps = [
-            ":ATen",
-            ":generated-aten-headers-cuda",
-            ":generated-autograd-headers",
-            ":generated-nvfuser-headers",
-            ":libtorch",
-            "//caffe2/caffe2:caffe2_cpu",
-            "//caffe2/caffe2:caffe2_gpu",
-            "//caffe2/torch/lib/libshm:libshm",
-            "//gloo:gloo_gpu_cuda",
-            "//tensorpipe:tensorpipe_cuda",
-        ] + get_nccl_dependency(),
-        exported_external_deps = [
-            ("cudnn", None, "cudnn-lazy"),
-            ("cuda", None, "nvToolsExt-lazy"),
-            ("cuda", None, "nvrtc-lazy"),
-            ("cuda", None, "nvrtc-builtins-lazy"),
-        ],
-        compiler_flags = compiler_flags_cpu + compiler_flags_cuda,
-        **common_flags
-    )
-
-    # (original_paths, hipified_paths)
-    libtorch_hip_headers_filter = torch_cpp_headers + [h for h in common_headers if any([h.startswith(d) for d in [
-        # headers in the following directories are added to libtorch_hip_headers_filter
-        # so that they are not hipified.
-        "torch/csrc/deploy/",
-        "torch/csrc/distributed/rpc/metrics/",
-        "torch/csrc/jit/serialization/",
-        "torch/cpp/jit/",
-        "torch/cpp/tensorexpr/",
-    ]])]
-    libtorch_hip_sources = (libtorch_cuda_sources, [f.replace(".cu", ".hip") for f in libtorch_cuda_sources])
-    libtorch_hip_headers = ([f for f in common_headers if f not in libtorch_hip_headers_filter],) * 2
-
-    custom_rule(
-        name = "fb_libtorch_hipify_gen",
-        srcs = libtorch_hip_sources[0] + libtorch_hip_headers[0],
-        build_args = "--source-dir= --hipify-dir= --copy-dir= --rewrite-cu-ext",
-        build_script_dep = "//caffe2:fb_caffe2_hipify",
-        output_gen_files = libtorch_hip_sources[1] + libtorch_hip_headers[1],
-    )
-
-    cpp_library(
-        name = "libtorch_hip_headers",
-        headers = [":fb_libtorch_hipify_gen={}".format(f) for f in libtorch_hip_headers[1]],
-        header_namespace = "",
-    )
-
-    cpp_library(
-        name = "libtorch_hip",
-        srcs = [":fb_libtorch_hipify_gen={}".format(f) for f in libtorch_hip_sources[1]],
-        headers = [f for f in common_headers if f in libtorch_hip_headers_filter],
-        link_whole = True,
-        propagated_pp_flags = hip_pp_flags,
-        exported_deps = [
-            ":generated-aten-headers-hip",
-            ":generated-autograd-headers",
-            ":generated-nvfuser-headers",
-            ":libtorch",
-            ":libtorch_hip_headers",
-            "//caffe2:ATen-hip",
-            "//caffe2/caffe2:caffe2_cpu",
-            "//caffe2/caffe2:caffe2_gpu_hip",
-            "//caffe2/torch/lib/libshm:libshm",
-            "//gloo:gloo_gpu_hip",
-            "//tensorpipe:tensorpipe_cpu",  # TODO: include a HIP version once it's developed
-        ],
-        exported_external_deps = hip_external_deps,
-        compiler_flags = compiler_flags_cpu + compiler_flags_hip + [
-            "-Wno-unused-result",
-        ],
-        hip_flags = ["-Wno-unused-result"] + get_hip_flags(),
-        compiler_specific_flags = common_flags["compiler_specific_flags"],
-    )
-
-    gpu_library_targets(
-        name = "libtorch_gpu",
-        deps_cpu = [
-            ":libtorch",
-        ],
-        deps_cuda = [
-            ":libtorch_cuda",
-        ],
-        deps_hip = [
-            ":libtorch_hip",
-        ],
-        exclude_hip_target = False,
-        extra_external_deps = [],
-    )
-
-    # torch-cpp is still conditionally compiled based on USE_CUDA. Ideally we'd
-    # separate it out as an additive library instead.
-    gpu_library_selector(
-        name = "torch-cpp",
-        deps_cpu = [":torch-cpp-cpu"],
-        deps_cuda = [":torch-cpp-cuda"],
-        deps_hip = [":torch-cpp-hip"],
-        merge_cpu_deps = False,
-        exclude_hip_target = False,
-    )
-
-    # USE_CUDA flag is propagated through propagated_pp_flags on libtorch
-    cpp_library(
-        name = "torch-cpp-cuda",
-        srcs = torch_cpp_srcs,
-        headers = torch_cpp_headers,
-        include_directories = [
-            ".",
-            "torch/csrc/api/include/",
-        ],
-        exported_deps = [
-            ":libtorch_cuda",
-            "//caffe2/torch/fb/init:init",
-        ],
-        exported_external_deps = [
-            ("cuda", None, "cuda-lazy"),
-            ("cudnn", None, "cudnn-lazy"),
-        ],
-    )
-
-    cpp_library(
-        name = "torch-cpp-hip",
-        srcs = torch_cpp_srcs,
-        headers = torch_cpp_headers,
-        include_directories = [
-            ".",
-            "torch/csrc/api/include/",
-        ],
-        exported_deps = [
-            ":libtorch_hip",
-            "//caffe2/torch/fb/init:init",
-        ],
-        exported_external_deps = hip_external_deps,
-    )
-
-    cpp_library(
-        name = "torch-cpp-cpu",
-        srcs = torch_cpp_srcs,
-        headers = torch_cpp_headers,
-        include_directories = [
-            ".",
-            "torch/csrc/api/include/",
-        ],
-        exported_deps = [
-            ":libtorch",
-            "//caffe2/torch/fb/init:init",
-        ],
-    )
-
-    # _C_impl is still conditionally compiled based on USE_CUDA. Ideally we'd
-    # separate it out as an additive library instead.
-    # TODO: split it into cpp and cuda parts similarly to libtorch
-    gpu_library_selector(
-        name = "_C_impl",
-        deps_cpu = [":_C_impl_cpu"],
-        deps_cuda = [":_C_impl_cuda"],
-        deps_hip = [":_C_impl_hip"],
-        merge_cpu_deps = False,
-        exclude_hip_target = False,
-    )
-
-    cpp_library(
-        name = "_C_impl_cpu",
-        srcs = libtorch_python_sources,
-        link_whole = True,
-        exported_deps = [
-            "fbsource//third-party/fmt:fmt",
-            ":torch-cpp-cpu",
-            "//caffe2/torch/fb/init:init",
-            "//caffe2/torch/lib/libshm:libshm",
-        ],
-        exported_external_deps = [
-            ("numpy", None, "cpp"),
-            ("pybind11", None),
-            ("python", None),
-        ],
-        compiler_flags = compiler_flags_cpu,
-        compiler_specific_flags = common_flags["compiler_specific_flags"],
-    )
-
-    # This target is used to help get headers for compile-time deps for torch::deploy
-    # libinterpreter.so build _without_ getting link-time deps, which are supplied
-    # separately by the application that dlopens libinterpreter.so.
-    #
-    # We make use of the buck auto-generated #headers flavor of a target to accomplish this.
-    #
-    # However, since #headers flavor of target with srcs can't be used in all build modes, we
-    # work around this limitation by using this 'pass-through' target, which has a usable
-    # #headers flavor in all build modes.
-    cpp_library(
-        name = "headers_for_torch_python_deps",
-        exported_deps = [
-            ":_C_impl_cpu",
-        ],
-    )
-    cpp_library(
-        name = "headers_for_torch_python_cuda_deps",
-        exported_deps = [
-            ":_C_impl_cuda",
-        ],
-    )
-
-    # This target compiles torch_python bindings, but skips the deps on actual
-    # torch and python since those will be integrated specially in the wrapper for
-    # libinterpreter.so used in torch::deploy
-    cpp_library(
-        name = "torch_python_without_torch",
-        srcs = libtorch_python_sources + torch_cpp_srcs,
-        undefined_symbols = True,
-        preferred_linkage = "static",
-        exported_deps = [
-            ":headers_for_torch_python_deps#headers",
-        ],
-        exported_external_deps = [
-            ("pybind11", None),
-            ("frozenpython", None, "python-headers"),
-        ],
-        compiler_flags = compiler_flags_cpu + [
-            # some code in the Python bindings compiles differently
-            # when you are deploy
-            "-DUSE_DEPLOY",
-        ],
-        compiler_specific_flags = common_flags["compiler_specific_flags"],
-    )
-
-    cpp_library(
-        name = "torch_python_cuda_without_torch",
-        srcs = libtorch_python_sources + torch_cpp_srcs + libtorch_python_cuda_sources,
-        undefined_symbols = True,
-        preferred_linkage = "static",
-        exported_deps = [
-            ":headers_for_torch_python_cuda_deps#headers",
-        ],
-        exported_external_deps = [
-            ("pybind11", None),
-            ("frozenpython", None, "python-headers"),
-        ],
-        compiler_flags = compiler_flags_cpu + [
-            "-DUSE_CUDA",
-            # some code in the Python bindings compiles differently
-            # when you are deploy
-            "-DUSE_DEPLOY",
-        ],
-        compiler_specific_flags = common_flags["compiler_specific_flags"],
-    )
-
-    cpp_library(
-        name = "_C_impl_cuda",
-        srcs = libtorch_python_sources + libtorch_python_cuda_sources,
-        link_whole = True,
-        exported_deps = [
-            "fbsource//third-party/fmt:fmt",
-            ":torch-cpp-cuda",
-            "//caffe2/torch/fb/init:init",
-            "//caffe2/torch/lib/libshm:libshm",
-        ],
-        exported_external_deps = [
-            ("numpy", None, "cpp"),
-            ("pybind11", None),
-            ("python", None),
-        ],
-        compiler_flags = compiler_flags_cpu + compiler_flags_cuda,
-        compiler_specific_flags = common_flags["compiler_specific_flags"],
-    )
-
-    # Autogenerated files whose rules contain ":" are not hipified.
-    libtorch_python_hip_sources = [f for f in (libtorch_python_sources + libtorch_python_cuda_sources) if ":" in f]
-    libtorch_python_hip_sources_hipified = [f for f in (libtorch_python_sources + libtorch_python_cuda_sources) if not ":" in f]
-
-    custom_rule(
-        name = "fb_C_impl_hipify_gen",
-        srcs = libtorch_python_hip_sources_hipified,
-        build_args = "--source-dir= --hipify-dir= --copy-dir=",
-        build_script_dep = "//caffe2:fb_caffe2_hipify",
-        output_gen_files = libtorch_python_hip_sources_hipified,
-    )
-
-    cpp_library(
-        name = "_C_impl_hip",
-        srcs = [":fb_C_impl_hipify_gen={}".format(f) for f in (libtorch_python_hip_sources_hipified)] + libtorch_python_hip_sources,
-        link_whole = True,
-        exported_deps = [
-            "fbsource//third-party/fmt:fmt",
-            ":torch-cpp-hip",
-            "//caffe2/torch/fb/init:init",
-            "//caffe2/torch/lib/libshm:libshm",
-        ],
-        exported_external_deps = [
-            ("numpy", None, "cpp"),
-            ("pybind11", None),
-            ("python", None),
-        ],
-        compiler_flags = compiler_flags_cpu + compiler_flags_hip + ["-Wno-unused-result"],
-        compiler_specific_flags = common_flags["compiler_specific_flags"],
-    )
-
-    cpp_python_extension(
-        name = "_C",
-        srcs = [
-            "torch/csrc/stub.c",
-        ],
-        base_module = "torch",
-        deps = [
-            ":_C_impl",
-            "//caffe2:flatbuffer_loader",
-        ],
-    )
-
-    cpp_python_extension(
-        name = "_C_flatbuffer",
-        srcs = [
-            "torch/csrc/stub_with_flatbuffer.c",
-            "torch/csrc/init_flatbuffer_module.cpp",
-        ],
-        base_module = "torch",
-        deps = [
-            ":_C_impl",
-            "//caffe2:flatbuffer_loader",
-            "//caffe2:flatbuffer_serializer",
-        ],
-    )
-
-    return r
diff --git a/tools/test/gen_oplist_test.py b/tools/test/gen_oplist_test.py
index d58e2ccc9067..33f9fb293edc 100644
--- a/tools/test/gen_oplist_test.py
+++ b/tools/test/gen_oplist_test.py
@@ -4,7 +4,7 @@
 import unittest
 from unittest.mock import MagicMock
 
-from gen_oplist import throw_if_any_op_includes_overloads
+from tools.code_analyzer.gen_oplist import throw_if_any_op_includes_overloads
 
 
 class GenOplistTest(unittest.TestCase):
diff --git a/tools/test/test_codegen.py b/tools/test/test_codegen.py
index 6bc0211a039c..4a9585708890 100644
--- a/tools/test/test_codegen.py
+++ b/tools/test/test_codegen.py
@@ -9,10 +9,13 @@
 import yaml
 
 from tools.autograd import gen_autograd_functions, load_derivatives
+from torchgen.api.types import CppSignatureGroup, DispatcherSignature
+from torchgen.context import native_function_manager
 from torchgen.gen import (
     get_native_function_declarations,
     get_native_function_schema_registrations,
     LineLoader,
+    static_dispatch,
 )
 from torchgen.model import (
     BackendIndex,
@@ -214,6 +217,14 @@ def setUp(self) -> None:
             loc=torchgen.model.Location(__file__, 1),
             valid_tags=set(),
         )
+        (
+            self.fragment_custom_native_function,
+            _,
+        ) = torchgen.model.NativeFunction.from_yaml(
+            {"func": "quantized_decomposed::func() -> bool"},
+            loc=torchgen.model.Location(__file__, 1),
+            valid_tags=set(),
+        )
 
     def test_default_namespace_schema_registration_code_valid(self) -> None:
         native_functions = [DEFAULT_NATIVE_FUNCTION]
@@ -234,6 +245,23 @@ def test_custom_namespace_schema_registration_code_valid(self) -> None:
 TORCH_LIBRARY(custom, m) {
   m.def("func() -> bool", {});
 
+};""",
+        )
+
+    def test_fragment_custom_namespace_schema_registration_code_valid(self) -> None:
+        """Sometimes we want to extend an existing namespace, for example quantized
+        namespace, which is already defined in native/quantized/library.cpp
+        """
+        _, registrations = get_native_function_schema_registrations(
+            native_functions=[self.fragment_custom_native_function],
+            schema_selector=self.selector,
+        )
+        self.assertEqual(
+            registrations,
+            """
+TORCH_LIBRARY_FRAGMENT(quantized_decomposed, m) {
+  m.def("func() -> bool", {});
+
 };""",
         )
 
@@ -255,21 +283,36 @@ def test_mixed_namespace_schema_registration_code_valid(self) -> None:
 };""",
         )
 
-    def test_3_namespaces_schema_registration_code_invalid(self) -> None:
+    def test_3_namespaces_schema_registration_code_valid(self) -> None:
         custom2_native_function, _ = torchgen.model.NativeFunction.from_yaml(
             {"func": "custom2::func() -> bool"},
             loc=torchgen.model.Location(__file__, 1),
             valid_tags=set(),
         )
-        with self.assertRaises(AssertionError):
-            get_native_function_schema_registrations(
-                native_functions=[
-                    DEFAULT_NATIVE_FUNCTION,
-                    self.custom_native_function,
-                    custom2_native_function,
-                ],
-                schema_selector=self.selector,
-            )
+        (
+            aten_registrations,
+            custom_registrations,
+        ) = get_native_function_schema_registrations(
+            native_functions=[
+                DEFAULT_NATIVE_FUNCTION,
+                self.custom_native_function,
+                custom2_native_function,
+            ],
+            schema_selector=self.selector,
+        )
+        self.assertEqual(aten_registrations, ['m.def("func() -> bool", {});\n'])
+        self.assertEqual(
+            custom_registrations,
+            """
+TORCH_LIBRARY(custom, m) {
+  m.def("func() -> bool", {});
+
+};
+TORCH_LIBRARY(custom2, m) {
+  m.def("func() -> bool", {});
+
+};""",
+        )
 
 
 class TestGenNativeFunctionDeclaration(unittest.TestCase):
@@ -393,6 +436,65 @@ def test_functional_variant_autogen_out_variant_two_returns(self) -> None:
         self.assertEqual(backend_metadata.kernel, "op_2_out")
 
 
+# Test for static_dispatch
+class TestStaticDispatchGeneratrion(unittest.TestCase):
+    def setUp(self) -> None:
+        self.backend_indices: Dict[
+            DispatchKey, Dict[OperatorName, BackendMetadata]
+        ] = defaultdict(dict)
+        yaml_entry = """
+- func: op.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
+  dispatch:
+    CompositeExplicitAutograd: op
+        """
+        es = yaml.load(yaml_entry, Loader=LineLoader)
+        self.one_return_func, m = NativeFunction.from_yaml(
+            es[0], loc=Location(__file__, 1), valid_tags=set()
+        )
+
+        BackendIndex.grow_index(self.backend_indices, m)
+        dispatch_key = DispatchKey.CompositeExplicitAutograd
+        self.assertTrue(dispatch_key in self.backend_indices)
+        self.indices = [
+            BackendIndex(
+                dispatch_key=dispatch_key,
+                use_out_as_primary=True,
+                external=False,
+                device_guard=False,
+                index=self.backend_indices[dispatch_key],
+            )
+        ]
+
+    def test_op_with_1_backend_generates_static_dispatch(self) -> None:
+        disp_sig = DispatcherSignature.from_schema(self.one_return_func.func)
+        with native_function_manager(self.one_return_func):
+            out = static_dispatch(
+                sig=disp_sig,
+                f=self.one_return_func,
+                backend_indices=self.indices,
+            )
+        self.assertEqual(
+            out, "return at::compositeexplicitautograd::op_out(out, self);"
+        )
+
+    def test_op_with_cpp_sig_generates_static_dispatch(self) -> None:
+        sig_group = CppSignatureGroup.from_native_function(
+            self.one_return_func,
+            method=False,
+            fallback_binding=self.one_return_func.manual_cpp_binding,
+        )
+        # cpp signature puts out at the front
+        with native_function_manager(self.one_return_func):
+            out = static_dispatch(
+                sig=sig_group.signature,
+                f=self.one_return_func,
+                backend_indices=self.indices,
+            )
+        self.assertEqual(
+            out, "return at::compositeexplicitautograd::op_out(out, self);"
+        )
+
+
 # Represents the most basic NativeFunction. Use dataclasses.replace()
 # to edit for use.
 DEFAULT_NATIVE_FUNCTION, _ = torchgen.model.NativeFunction.from_yaml(
diff --git a/tools/test/test_codegen_model.py b/tools/test/test_codegen_model.py
index 8ba4b1316d2a..582ec4b9dc19 100644
--- a/tools/test/test_codegen_model.py
+++ b/tools/test/test_codegen_model.py
@@ -5,12 +5,18 @@
 from typing import cast
 
 import expecttest
+
 import torchgen.dest as dest
 import torchgen.gen as gen
 import yaml
 from torchgen.gen import LineLoader, parse_native_yaml_struct
-
-from torchgen.model import CustomClassType, DispatchKey, NativeFunctionsGroup, Type
+from torchgen.model import (
+    Annotation,
+    CustomClassType,
+    DispatchKey,
+    NativeFunctionsGroup,
+    Type,
+)
 
 
 class TestCodegenModel(expecttest.TestCase):
@@ -153,5 +159,48 @@ def test_parse_custom_class_type(self) -> None:
         self.assertEqual(custom_class_name_with_prefix, str(custom_class_type))
 
 
+class TestAnnotation(expecttest.TestCase):
+    def test_single_alias_no_write(self) -> None:
+        a = Annotation.parse("a")
+        self.assertEqual(a.alias_set, tuple("a"))
+        self.assertFalse(a.is_write)
+        self.assertEqual(a.alias_set_after, tuple())
+
+    def test_single_alias_is_write(self) -> None:
+        a = Annotation.parse("a!")
+        self.assertEqual(a.alias_set, tuple("a"))
+        self.assertTrue(a.is_write)
+        self.assertEqual(a.alias_set_after, tuple())
+
+    def test_single_alias_is_write_to_wildcard(self) -> None:
+        a = Annotation.parse("a! -> *")
+        self.assertEqual(a.alias_set, tuple("a"))
+        self.assertTrue(a.is_write)
+        self.assertEqual(a.alias_set_after, tuple("*"))
+
+    def test_alias_set(self) -> None:
+        a = Annotation.parse("a|b")
+        self.assertEqual(a.alias_set, ("a", "b"))
+
+    def test_alias_set_is_write_raises_exception(self) -> None:
+        with self.assertRaisesRegex(
+            AssertionError, r"alias set larger than 1 is not mutable"
+        ):
+            Annotation.parse("a|b!")
+
+    def test_single_alias_is_write_to_alias_set(self) -> None:
+        a = Annotation.parse("a! -> a|b")
+        self.assertEqual(a.alias_set, tuple("a"))
+        self.assertTrue(a.is_write)
+        self.assertEqual(a.alias_set_after, ("a", "b"))
+
+    def test_before_and_after_alias_set_larger_than_1_raises_exception(self) -> None:
+        with self.assertRaisesRegex(
+            AssertionError,
+            r"before alias set and after alias set cannot be larger than 1 at the same time",
+        ):
+            Annotation.parse("a|b -> c|d")
+
+
 if __name__ == "__main__":
     unittest.main()
diff --git a/tools/test/test_gen_backend_stubs.py b/tools/test/test_gen_backend_stubs.py
index 9091cca6dddf..8d54b8ce04dd 100644
--- a/tools/test/test_gen_backend_stubs.py
+++ b/tools/test/test_gen_backend_stubs.py
@@ -3,6 +3,7 @@
 import os
 import tempfile
 import unittest
+from typing import Optional
 
 import expecttest
 from torchgen.gen import _GLOBAL_PARSE_NATIVE_YAML_CACHE  # noqa: F401
@@ -25,12 +26,20 @@ def assert_success_from_gen_backend_stubs(self, yaml_str: str) -> None:
             fp.flush()
             run(fp.name, "", True)
 
-    def get_errors_from_gen_backend_stubs(self, yaml_str: str) -> str:
+    def get_errors_from_gen_backend_stubs(
+        self, yaml_str: str, *, kernels_str: Optional[str] = None
+    ) -> str:
         with tempfile.NamedTemporaryFile(mode="w") as fp:
             fp.write(yaml_str)
             fp.flush()
             try:
-                run(fp.name, "", True)
+                if kernels_str is None:
+                    run(fp.name, "", True)
+                else:
+                    with tempfile.NamedTemporaryFile(mode="w") as kernel_file:
+                        kernel_file.write(kernels_str)
+                        kernel_file.flush()
+                        run(fp.name, "", True, impl_path=kernel_file.name)
             except AssertionError as e:
                 # Scrub out the temp file name from any error messages to simplify assertions.
                 return str(e).replace(fp.name, "")
@@ -238,7 +247,7 @@ def test_unrecognized_key(self) -> None:
         output_error = self.get_errors_from_gen_backend_stubs(yaml_str)
         self.assertExpectedInline(
             output_error,
-            """ contains unexpected keys: invalid_key. Only the following keys are supported: backend, class_name, cpp_namespace, extra_headers, supported, autograd, full_codegen, non_native, ir_gen""",  # noqa: B950
+            """ contains unexpected keys: invalid_key. Only the following keys are supported: backend, class_name, cpp_namespace, extra_headers, supported, autograd, full_codegen, non_native, ir_gen, symint""",  # noqa: B950
         )
 
     # if use_out_as_primary is provided, it must be a bool
@@ -269,6 +278,34 @@ def test_device_guard_non_bool(self) -> None:
             """You must provide either True or False for device_guard. Provided: frue""",
         )  # noqa: B950
 
+    def test_incorrect_kernel_name(self) -> None:
+        yaml_str = """\
+backend: XLA
+cpp_namespace: torch_xla
+supported:
+- abs
+autograd:
+- add.Tensor"""
+        # Codegen will expect two kernel names (and try to parse them with regex):
+        # XLANativeFunctions::abs(...)
+        # XLANativeFunctions::add(...)
+        kernels_str = """\
+at::Tensor& XLANativeFunctions::absWRONG(at::Tensor& self) {}
+at::Tensor& XLANativeFunctions::add(at::Tensor& self) {}"""
+        output_error = self.get_errors_from_gen_backend_stubs(
+            yaml_str, kernels_str=kernels_str
+        )
+        self.assertExpectedInline(
+            output_error,
+            """\
+
+XLANativeFunctions is missing a kernel definition for abs. We found 0 kernel(s) with that name,
+but expected 1 kernel(s). The expected function schemas for the missing operator are:
+at::Tensor abs(const at::Tensor & self)
+
+""",
+        )
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/tools/test/test_vulkan_codegen.py b/tools/test/test_vulkan_codegen.py
new file mode 100644
index 000000000000..26ccc6642579
--- /dev/null
+++ b/tools/test/test_vulkan_codegen.py
@@ -0,0 +1,100 @@
+import os
+import tempfile
+import unittest
+
+from tools.gen_vulkan_glsl import GLSLGenerator
+from yaml.constructor import ConstructorError
+
+
+class TestGLSLCodegen(unittest.TestCase):
+    def test_assert_on_duplicate_key_yaml(self) -> None:
+        yaml_with_duplicate_keys = """
+conv2d_pw:
+  parameter_names_with_default_values:
+      TILE_SIZE_X: 1
+      TILE_SIZE_Y: 1
+  parameter_values:
+    - TILE_SIZE_X: 2
+      TILE_SIZE_Y: 2
+    - TILE_SIZE_X: 2
+      TILE_SIZE_Y: 4
+    - TILE_SIZE_X: 4
+      TILE_SIZE_Y: 2
+    - TILE_SIZE_X: 4
+      TILE_SIZE_Y: 4
+conv2d_pw:
+  parameter_names_with_default_values:
+    - TILE_SIZE_X: 1
+    - TILE_SIZE_Y: 1
+  parameter_values:
+    - TILE_SIZE_X: 2
+      TILE_SIZE_Y: 2
+    - TILE_SIZE_X: 2
+      TILE_SIZE_Y: 4
+    - TILE_SIZE_X: 4
+      TILE_SIZE_Y: 2
+    - TILE_SIZE_X: 4
+      TILE_SIZE_Y: 4
+"""
+
+        generator = GLSLGenerator()  # type: ignore[no-untyped-call]
+        with tempfile.NamedTemporaryFile(mode="w") as fp:
+            fp.write(yaml_with_duplicate_keys)
+            fp.flush()
+            with self.assertRaisesRegex(
+                ConstructorError, r"while constructing a mapping"
+            ):
+                generator.add_params_yaml(fp.name)  # type: ignore[no-untyped-call]
+
+    def test_assert_keys_mismatch(self) -> None:
+        yaml_with_key_mismatch = """
+conv2d_pw:
+  parameter_names_with_default_values:
+      TILE_SIZE_X: 1
+      TILE_SIZE_Y: 1
+  parameter_values:
+    - TILE_SIZE_X: 2
+      TILE_SIZE_Z: 2
+"""
+
+        generator = GLSLGenerator()  # type: ignore[no-untyped-call]
+        with tempfile.NamedTemporaryFile(mode="w") as fp:
+            fp.write(yaml_with_key_mismatch)
+            fp.flush()
+            with self.assertRaisesRegex(KeyError, r"Invalid keys {'TILE_SIZE_Z'}"):
+                generator.add_params_yaml(fp.name)  # type: ignore[no-untyped-call]
+
+    def test_missing_key_default_val(self) -> None:
+        yaml_with_key_mismatch = """
+conv2d_pw:
+  parameter_names_with_default_values:
+      TILE_SIZE_X: 1
+      TILE_SIZE_Y: 1
+  parameter_values:
+    - TILE_SIZE_X: 2
+"""
+        file_content = """
+x = $TILE_SIZE_X + $TILE_SIZE_Y
+"""
+
+        generator = GLSLGenerator()  # type: ignore[no-untyped-call]
+        with tempfile.NamedTemporaryFile(mode="w") as fp:
+            fp.write(yaml_with_key_mismatch)
+            fp.flush()
+            generator.add_params_yaml(fp.name)  # type: ignore[no-untyped-call]
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                template_file_name = os.path.join(tmp_dir, "conv2d_pw.glslt")
+                with open(template_file_name, "w") as template_file:
+                    template_file.write(file_content)
+                    template_file.flush()
+                    generator.generate(template_file.name, tmp_dir)  # type: ignore[no-untyped-call]
+                    file_name_1 = os.path.join(tmp_dir, "conv2d_pw_1x1.glsl")
+                    file_name_2 = os.path.join(tmp_dir, "conv2d_pw_2x1.glsl")
+                    self.assertTrue(os.path.exists(file_name_1))
+                    self.assertTrue(os.path.exists(file_name_2))
+                    with open(file_name_1, "r") as f:
+                        contents = f.read()
+                        self.assertTrue("1 + 1" in contents)
+                    with open(file_name_2, "r") as f:
+                        contents = f.read()
+                        self.assertTrue("2 + 1" in contents)
diff --git a/tools/testing/test_selections.py b/tools/testing/test_selections.py
index caa447c6907d..950d686d8dac 100644
--- a/tools/testing/test_selections.py
+++ b/tools/testing/test_selections.py
@@ -1,43 +1,62 @@
 import os
 import subprocess
 
-from typing import Dict, List, Tuple
+from typing import Callable, Dict, List, Optional, Tuple
 
 from tools.stats.import_test_stats import get_disabled_tests, get_slow_tests
 
+NUM_PROCS = 1 if os.getenv("PYTORCH_TEST_CUDA_MEM_LEAK_CHECK", "0") == "1" else 2
+
+
+class ShardJob:
+    def __init__(self, test_times: Dict[str, float]):
+        self.test_times = test_times
+        self.serial: List[str] = []
+        self.parallel: List[str] = []
+
+    def get_total_time(self) -> float:
+        procs = [0.0 for _ in range(NUM_PROCS)]
+        for test in self.parallel:
+            test_time = self.test_times.get(test, 0)
+            min_index = procs.index(min(procs))
+            procs[min_index] += test_time
+        time = max(procs) + sum(self.test_times.get(test, 0) for test in self.serial)
+        return time
+
+    def convert_to_tuple(self) -> Tuple[float, List[str]]:
+        return (self.get_total_time(), self.serial + self.parallel)
+
 
 def calculate_shards(
-    num_shards: int, tests: List[str], job_times: Dict[str, float]
+    num_shards: int,
+    tests: List[str],
+    test_file_times: Dict[str, float],
+    must_serial: Optional[Callable[[str], bool]] = None,
 ) -> List[Tuple[float, List[str]]]:
-    filtered_job_times: Dict[str, float] = dict()
-    unknown_jobs: List[str] = []
-    for test in tests:
-        if test in job_times:
-            filtered_job_times[test] = job_times[test]
+    must_serial = must_serial or (lambda x: True)
+
+    known_tests = [x for x in tests if x in test_file_times]
+    unknown_tests: List[str] = [x for x in tests if x not in known_tests]
+
+    sorted_tests = sorted(known_tests, key=lambda j: test_file_times[j], reverse=True)
+
+    sharded_jobs: List[ShardJob] = [
+        ShardJob(test_file_times) for _ in range(num_shards)
+    ]
+    for test in sorted_tests:
+        if must_serial(test):
+            min_sharded_job = min(sharded_jobs, key=lambda j: j.get_total_time())
+            min_sharded_job.serial.append(test)
         else:
-            unknown_jobs.append(test)
-
-    # The following attempts to implement a partition approximation greedy algorithm
-    # See more at https://en.wikipedia.org/wiki/Greedy_number_partitioning
-    sorted_jobs = sorted(
-        filtered_job_times, key=lambda j: filtered_job_times[j], reverse=True
-    )
-    sharded_jobs: List[Tuple[float, List[str]]] = [(0.0, []) for _ in range(num_shards)]
-    for job in sorted_jobs:
-        min_shard_index = sorted(range(num_shards), key=lambda i: sharded_jobs[i][0])[0]
-        curr_shard_time, curr_shard_jobs = sharded_jobs[min_shard_index]
-        curr_shard_jobs.append(job)
-        sharded_jobs[min_shard_index] = (
-            curr_shard_time + filtered_job_times[job],
-            curr_shard_jobs,
-        )
+            min_sharded_job = min(sharded_jobs, key=lambda j: j.get_total_time())
+            min_sharded_job.parallel.append(test)
 
     # Round robin the unknown jobs starting with the smallest shard
-    index = sorted(range(num_shards), key=lambda i: sharded_jobs[i][0])[0]
-    for job in unknown_jobs:
-        sharded_jobs[index][1].append(job)
+    index = min(range(num_shards), key=lambda i: sharded_jobs[i].get_total_time())
+    for test in unknown_tests:
+        sharded_jobs[index].serial.append(test)
         index = (index + 1) % num_shards
-    return sharded_jobs
+    return [job.convert_to_tuple() for job in sharded_jobs]
 
 
 def _query_changed_test_files() -> List[str]:
diff --git a/tools/update_masked_docs.py b/tools/update_masked_docs.py
index 87ee0830e01b..2310525f9616 100644
--- a/tools/update_masked_docs.py
+++ b/tools/update_masked_docs.py
@@ -1,7 +1,7 @@
-"""This script updates the file torch/_masked/_docs.py that contains
+"""This script updates the file torch/masked/_docs.py that contains
 the generated doc-strings for various masked operations. The update
 should be triggered whenever a new masked operation is introduced to
-torch._masked package. Running the script requires that torch package
+torch.masked package. Running the script requires that torch package
 is functional.
 """
 
@@ -10,7 +10,7 @@
 
 def main() -> None:
 
-    target = os.path.join("torch", "_masked", "_docs.py")
+    target = os.path.join("torch", "masked", "_docs.py")
 
     try:
         import torch
@@ -40,9 +40,9 @@ def main() -> None:
 """
     )
 
-    for func_name in sorted(torch._masked.__all__):
-        func = getattr(torch._masked, func_name)
-        func_doc = torch._masked._generate_docstring(func)
+    for func_name in sorted(torch.masked._ops.__all__):
+        func = getattr(torch.masked._ops, func_name)
+        func_doc = torch.masked._generate_docstring(func)
         _new_content.append(f'{func_name}_docstring = """{func_doc}"""\n')
 
     new_content = "\n".join(_new_content)
diff --git a/torch/CMakeLists.txt b/torch/CMakeLists.txt
index 88dff97e1bf7..4452ddb5b383 100644
--- a/torch/CMakeLists.txt
+++ b/torch/CMakeLists.txt
@@ -207,6 +207,10 @@ add_custom_target(torch_python_stubs DEPENDS
     "${TORCH_SRC_DIR}/nn/functional.pyi"
     "${TORCH_SRC_DIR}/utils/data/datapipes/datapipe.pyi"
 )
+
+file(GLOB_RECURSE torchgen_python "${PROJECT_SOURCE_DIR}/torchgen/*.py")
+file(GLOB_RECURSE autograd_python "${TOOLS_PATH}/autograd/*.py")
+file(GLOB_RECURSE pyi_python "${TOOLS_PATH}/pyi/*.py")
 add_custom_command(
     OUTPUT
     "${TORCH_SRC_DIR}/_C/__init__.pyi"
@@ -218,13 +222,15 @@ add_custom_command(
       --tags-path "aten/src/ATen/native/tags.yaml"
       --deprecated-functions-path "tools/autograd/deprecated.yaml"
     DEPENDS
-    "${TORCH_SRC_DIR}/_C/__init__.pyi.in"
-    "${TORCH_SRC_DIR}/_C/_VariableFunctions.pyi.in"
-    "${TORCH_SRC_DIR}/nn/functional.pyi.in"
-    "${TOOLS_PATH}/pyi/gen_pyi.py"
-    "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
-    "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml"
-    "${TORCH_ROOT}/tools/autograd/deprecated.yaml"
+      "${TORCH_SRC_DIR}/_C/__init__.pyi.in"
+      "${TORCH_SRC_DIR}/_C/_VariableFunctions.pyi.in"
+      "${TORCH_SRC_DIR}/nn/functional.pyi.in"
+      "${TORCH_ROOT}/aten/src/ATen/native/native_functions.yaml"
+      "${TORCH_ROOT}/aten/src/ATen/native/tags.yaml"
+      "${TORCH_ROOT}/tools/autograd/deprecated.yaml"
+      ${pyi_python}
+      ${autograd_python}
+      ${torchgen_python}
     WORKING_DIRECTORY
     "${TORCH_ROOT}"
 )
@@ -273,94 +279,6 @@ if(USE_NCCL AND NOT WIN32)
     list(APPEND TORCH_PYTHON_COMPILE_DEFINITIONS USE_NCCL)
 endif()
 
-
-# WARNING- any TORCH_PYTHON_COMPILE_DEFINITIONS above this line
-#          affect both torch_python and DEPLOY interpreter.
-if(USE_DEPLOY)
-  add_library(torch_python_obj OBJECT ${TORCH_PYTHON_SRCS})
-  if(NOT MSVC)
-    target_compile_options(torch_python_obj PRIVATE -Wno-unused-variable)
-    target_compile_options(torch_python_obj PRIVATE -Wno-unused-local-typedefs)
-  endif()
-  if(USE_DISTRIBUTED)
-    # Set c10d-related compile definitions. For a "normal" build of
-    # libtorch_python, these are set on libtorch as PUBLIC so they are
-    # automatically propagated when libtorch_python links against libtorch. But
-    # since in the deploy build we are intentionally *not* linking against
-    # libtorch, we need to set them manually here.
-    list(APPEND TORCH_PYTHON_COMPILE_DEFINITIONS USE_DISTRIBUTED)
-    if(USE_GLOO AND USE_C10D_GLOO)
-      list(APPEND TORCH_PYTHON_COMPILE_DEFINITIONS USE_C10D_GLOO)
-    endif()
-    if(USE_UCC AND USE_C10D_UCC)
-      list(APPEND TORCH_PYTHON_COMPILE_DEFINITIONS USE_C10D_UCC)
-    endif()
-    if(USE_NCCL AND USE_C10D_NCCL)
-        list(APPEND TORCH_PYTHON_COMPILE_DEFINITIONS USE_C10D_NCCL)
-        # Put nccl headers on the include path. We are specifically only setting
-        # include dirs here instead of linking against __caffe2_nccl wholesale
-        # to ensure we aren't accidentally replicating the nccl lib.
-        target_include_directories(torch_python_obj PRIVATE $<TARGET_PROPERTY:__caffe2_nccl,INTERFACE_INCLUDE_DIRECTORIES>)
-    endif()
-    if(USE_MPI AND USE_C10D_MPI)
-      list(APPEND TORCH_PYTHON_COMPILE_DEFINITIONS USE_C10D_MPI)
-    endif()
-
-    # Pass USE_RPC in order to reduce use of
-    # #if defined(USE_DISTRIBUTED) && !defined(_WIN32)
-    # need to be removed when RPC is supported
-    if(NOT WIN32)
-      target_compile_definitions(torch_cpu PUBLIC USE_RPC)
-    endif()
-    if(USE_TENSORPIPE)
-      list(APPEND TORCH_PYTHON_COMPILE_DEFINITIONS USE_TENSORPIPE)
-    endif()
-
-    # Set c10d-related include directories as well.
-    target_include_directories(torch_python_obj PRIVATE $<BUILD_INTERFACE:${TORCH_SRC_DIR}/csrc/distributed>)
-  endif()
-  target_compile_definitions(torch_python_obj PRIVATE "-DTHP_BUILD_MAIN_LIB -DUSE_DEPLOY")
-
-  target_compile_definitions(torch_python_obj PRIVATE ${TORCH_PYTHON_COMPILE_DEFINITIONS})
-
-  target_compile_definitions(torch_python_obj PUBLIC ${TORCH_PYTHON_PUBLIC_COMPILE_DEFINITIONS})
-
-  target_compile_options(torch_python_obj PRIVATE ${TORCH_PYTHON_COMPILE_OPTIONS})
-
-  if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
-    target_compile_options(torch_python_obj PRIVATE -fno-gnu-unique)
-  endif()
-
-  target_include_directories(torch_python_obj PUBLIC ${TORCH_PYTHON_INCLUDE_DIRECTORIES})
-  target_include_directories(torch_python_obj PRIVATE ../third_party/fmt/include)
-
-  # need to specify the dependency so the generated headers exist,
-  # missing dependency since torch_python_obj doesn't link onnx, the interpreter lib does.
-  add_dependencies(torch_python_obj onnx)
-
-  target_include_directories(torch_python_obj SYSTEM PRIVATE
-      ${PYTHON_INCLUDE_DIRS}
-      ${pybind11_INCLUDE_DIRS})
-
-  if(HAVE_SOVERSION)
-    set_target_properties(torch_python_obj PROPERTIES
-        VERSION ${TORCH_VERSION} SOVERSION ${TORCH_SOVERSION})
-  endif()
-  add_dependencies(torch_python_obj torch_python_stubs)
-
-  # Required workaround for generated sources
-  # See https://samthursfield.wordpress.com/2015/11/21/cmake-dependencies-between-targets-and-files-and-custom-commands/#custom-commands-in-different-directories
-  add_dependencies(torch_python_obj generate-torch-sources)
-  set_source_files_properties(
-      ${GENERATED_THNN_SOURCES}
-      ${GENERATED_CXX_PYTHON}
-      PROPERTIES GENERATED TRUE
-      )
-
-  add_dependencies(torch_python_obj gen_torch_version)
-
-endif()
-
 if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
   # Workaround for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80947 in EmbeddingBag.cpp
   set_source_files_properties(${TORCH_SRC_DIR}/csrc/utils/throughput_benchmark.cpp PROPERTIES COMPILE_FLAGS -Wno-attributes)
@@ -395,7 +313,6 @@ endif()
 
 if(NOT MSVC)
   target_compile_options(torch_python PRIVATE -Wno-unused-variable)
-  target_compile_options(torch_python PRIVATE -Wno-unused-local-typedefs)
 endif()
 
 # Required workaround for generated sources
@@ -440,6 +357,11 @@ target_compile_options(torch_python PRIVATE ${TORCH_PYTHON_COMPILE_OPTIONS})
 
 target_include_directories(torch_python PUBLIC ${TORCH_PYTHON_INCLUDE_DIRECTORIES})
 
+if(USE_UCC)
+  target_link_libraries(torch_python PRIVATE __caffe2_ucc)
+  target_compile_definitions(torch_python PRIVATE USE_UCC)
+endif()
+
 if(BUILD_ONEDNN_GRAPH)
   target_compile_definitions(torch_python PRIVATE "-DBUILD_ONEDNN_GRAPH")
   target_compile_definitions(torch_cpu PRIVATE "-DBUILD_ONEDNN_GRAPH")
@@ -487,3 +409,6 @@ if(NOT ${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
   # Pybind11 requires explicit linking of the torch_python library
   target_link_libraries(nnapi_backend PRIVATE torch torch_python pybind::pybind11)
 endif()
+
+set(TORCH_PYTHON_COMPILE_OPTIONS ${TORCH_PYTHON_COMPILE_OPTIONS} PARENT_SCOPE)
+set(TORCH_PYTHON_LINK_FLAGS ${TORCH_PYTHON_LINK_FLAGS} PARENT_SCOPE)
diff --git a/torch/_C/_VariableFunctions.pyi.in b/torch/_C/_VariableFunctions.pyi.in
index 75d566f131ab..ffd9f5204093 100644
--- a/torch/_C/_VariableFunctions.pyi.in
+++ b/torch/_C/_VariableFunctions.pyi.in
@@ -5,7 +5,7 @@ from typing import List, Tuple, Optional, Union, Any, ContextManager, Callable,
 from typing_extensions import Literal
 from torch._six import inf
 
-from torch.types import _int, _float, _bool, Number, _dtype, _device, _qscheme, _size, _layout, SymInt
+from torch.types import _int, _float, _bool, Number, _dtype, _device, _qscheme, _size, _layout, SymInt, Device
 import torch
 
 import builtins
diff --git a/torch/_C/__init__.pyi.in b/torch/_C/__init__.pyi.in
index b8b5fc265ac2..cc1f1ed66714 100644
--- a/torch/_C/__init__.pyi.in
+++ b/torch/_C/__init__.pyi.in
@@ -12,7 +12,9 @@ from typing import (
 from typing_extensions import Literal
 from torch._six import inf
 
-from torch.types import _int, _float, _bool, _dtype, _device, _qscheme, _size, _layout, Device, Number, Storage, SymInt
+from torch.types import (
+    _int, _float, _bool, _dtype, _device, _qscheme, _size, _layout, Device, Number, Storage, SymInt, _dispatchkey
+)
 from torch.storage import TypedStorage
 
 import builtins
@@ -22,10 +24,12 @@ import builtins
 from . import _nn as _nn
 from . import _onnx as _onnx
 from . import _VariableFunctions as _VariableFunctions
+from . import _functorch as _functorch
 from . import _lazy as _lazy
 from . import _lazy_ts_backend as _lazy_ts_backend
 
 T = TypeVar('T')
+S = TypeVar("S", bound="torch.Tensor")
 
 # Defined in torch/csrc/Device.cpp
 class device:
@@ -165,13 +169,7 @@ class Future(object):
 
 def _jit_set_num_profiled_runs(num: _size) -> _size: ...
 
-class SymIntNode(object):
-    def get_pyobj(self) -> Any: ...
-
-    @staticmethod
-    def new_symint(self) -> SymIntNode: ...
-
-# Defined in torch/csrc/jit/passes/xnnpack_rewrite.h
+# Defined in torch/csrc/jit/passes/mobile_optimizer_type.h
 class MobileOptimizerType:
     ...
 
@@ -180,6 +178,7 @@ INSERT_FOLD_PREPACK_OPS: MobileOptimizerType
 REMOVE_DROPOUT: MobileOptimizerType
 FUSE_ADD_RELU: MobileOptimizerType
 HOIST_CONV_PACKED_PARAMS: MobileOptimizerType
+VULKAN_AUTOMATIC_GPU_TRANSFER: MobileOptimizerType
 
 def fork(*args: Any, **kwargs: Any) -> Future: ...
 def wait(fut: Future) -> Any: ...
@@ -217,6 +216,7 @@ def _clone_module_with_class(module: 'torch.jit.ScriptModule',
                              ignored_methods: List[AnyStr],
                              ignored_attributes: List[AnyStr]) -> 'torch.jit.ScriptModule': ...
 def _jit_pass_vulkan_optimize_for_mobile(module: 'torch.jit.ScriptModule',
+                                         optimization_blocklist: Set[MobileOptimizerType],
                                          preserved_methods: List[AnyStr]) -> 'torch.jit.ScriptModule': ...
 def _jit_pass_metal_optimize_for_mobile(module: 'torch.jit.ScriptModule',
                                          preserved_methods: List[AnyStr]) -> 'torch.jit.ScriptModule': ...
@@ -261,9 +261,22 @@ def _jit_pass_insert_quant_dequant(module: 'torch.jit.ScriptModule',
                                    inplace: _bool,
                                    debug: _bool,
                                    quant_type: _int): ...
+def _jit_pass_insert_quant_dequant_for_ondevice_ptq(module: 'torch.jit.ScriptModule',
+                                   method_name: str,
+                                   inplace: _bool,
+                                   debug: _bool,
+                                   quant_type: _int): ...
 def _jit_pass_quant_finalize(module: 'torch.jit.ScriptModule',
                              quant_type: _int,
                              preserved_attrs: Sequence[str]): ...
+def _jit_pass_quant_finalize_for_ondevice_ptq(module: 'torch.jit.ScriptModule',
+                             quant_type: _int,
+                             method_name: str): ...
+def _jit_pass_insert_observer_method_for_ondevice_ptq(module: 'torch.jit.ScriptModule',
+                               method_name: str,
+                               qconfig_dict: Dict[str, Any],
+                               inplace: _bool,
+                               quant_type: _int): ...
 def _jit_set_profiling_executor(profiling_flag: _bool) -> _bool: ...
 def _jit_set_profiling_mode(profiling_flag: _bool) -> _bool: ...
 def _jit_set_fusion_strategy(strategy: List[Tuple[str, _int]]) -> List[Tuple[str, _int]]: ...
@@ -285,6 +298,7 @@ def _jit_get_emit_hooks() -> Tuple[Callable, Callable]: ...
 def _load_for_lite_interpreter(filename: Union[str, Path], map_location: Union[_device, str, None]): ...
 def _load_for_lite_interpreter_from_buffer(buffer: BinaryIO, map_location: Union[_device, str, None]): ...
 def _export_operator_list(module: LiteScriptModule): ...
+def _quantize_ondevice_ptq_dynamic(module: LiteScriptModule, method_name: str): ...
 def _get_model_bytecode_version(filename: Union[str, Path]) -> _int: ...
 def _get_model_bytecode_version_from_buffer(buffer: BinaryIO) -> _int: ...
 def _backport_for_mobile(filename_input: Union[str, Path], filename_output: Union[str, Path], to_version: _int) -> None: ...
@@ -308,6 +322,15 @@ def _create_function_from_trace(
     force_outplace: _bool,
     argument_names: List[str]
 ) -> Tuple[Graph, Stack]: ...
+def _create_function_from_trace_with_dict(
+    qualname: str,
+    func: Callable[..., Any],
+    input_dict: Dict[str, Any],
+    var_lookup_fn: Callable[[Tensor], str],
+    strict: _bool,
+    force_outplace: _bool,
+    argument_names: List[str]
+) -> Tuple[Graph, Stack]: ...
 def _jit_is_script_object(obj: Any) -> _bool: ...
 def _last_executed_optimized_graph() -> Graph: ...
 def parse_type_comment(comment: str) -> Decl: ...
@@ -330,7 +353,7 @@ def _jit_pass_lower_all_tuples(graph: Graph) -> None: ...
 def _jit_pass_onnx_set_dynamic_input_shape(graph: Graph, dynamic_axes: Dict[str, Dict[_int, str]], input_names: List[str]) -> None: ...
 def _jit_pass_onnx_graph_shape_type_inference(graph: Graph, paramsDict: Dict[str, IValue], opset_version: _int) -> None: ...
 def _jit_pass_onnx_assign_output_shape(graph: Graph, tensors: List[Tensor], desc: IODescriptor, onnx_shape_inference: _bool, is_script: _bool) -> None: ...
-def _jit_pass_onnx_remove_inplace_ops_for_onnx(graph: Graph, module: Module) -> None: ...
+def _jit_pass_onnx_remove_inplace_ops_for_onnx(graph: Graph, module: Optional[ScriptModule] = None) -> None: ...
 def _jit_pass_remove_inplace_ops(graph: Graph) -> None: ...
 def _jit_pass_canonicalize_graph_fuser_ops(graph: Graph) -> None: ...
 def _jit_pass_peephole(graph: Graph, disable_shape_peepholes: _bool = False) -> None: ...
@@ -383,7 +406,9 @@ def _jit_pass_onnx_block(
     env: Dict[Value, Value],
     is_sub_block: _bool
 ) -> Dict[Value, Value]: ...
-def _jit_pass_fixup_onnx_controlflow_node(n: Node, opset_version: _int) -> Node: ...
+def _jit_pass_onnx_assign_scoped_names_for_node_and_value(graph: Graph) -> None: ...
+def _jit_pass_fixup_onnx_controlflow_node(n: Node, opset_version: _int) -> List[Value]: ...
+def _jit_onnx_create_full_scope_name(class_name: str, variable_name: str) -> str: ...
 
 def _compile_graph_to_code_table(name: str, graph: Graph) -> IValue: ...
 
@@ -490,14 +515,14 @@ class Value:
 
 # Defined in torch/csrc/jit/ir/ir.h
 class Block:
-    def inputs(self) -> List[Value]: ...
-    def outputs(self) -> List[Value]: ...
+    def inputs(self) -> Iterator[Value]: ...
+    def outputs(self) -> Iterator[Value]: ...
     def nodes(self) -> Iterator[Node]: ...
     def paramNode(self) -> Node: ...
     def returnNode(self) -> Node: ...
     def owningNode(self) -> Node: ...
     def registerOutput(self, n: Value) -> _int: ...
-    def addNode(self, name: str, values: Sequence[Value]) -> Node: ...
+    def addNode(self, name: str, inputs: Sequence[Value]) -> Node: ...
     ...
 
 # Defined in torch/csrc/jit/ir/ir.h
@@ -505,11 +530,11 @@ class Node:
     def __getitem__(self, key: str) -> Any: ...
     def schema(self) -> str: ...
     def input(self) -> Value: ...
-    def inputs(self) -> List[Value]: ...
+    def inputs(self) -> Iterator[Value]: ...
     def inputsAt(self, idx: _int) -> Value: ...
     def inputsSize(self) -> _int: ...
     def output(self) -> Value: ...
-    def outputs(self) -> List[Value]: ...
+    def outputs(self) -> Iterator[Value]: ...
     def outputsAt(self, idx: _int) -> Value: ...
     def outputsSize(self) -> _int: ...
     def hasMultipleOutputs(self) -> _bool: ...
@@ -577,14 +602,16 @@ class Node:
     def t_(self, name: str, val: Tensor) -> Node: ...
     def ts(self, name: str) -> List[Tensor]: ...
     def ts_(self, name: str, val: List[Tensor]) -> Node: ...
+    def ty(self, name: str) -> JitType: ...
     def ty_(self, name: str, val: JitType) -> Node: ...
+    def tys(self, name: str) -> List[JitType]: ...
     def tys_(self, name: str, val: List[JitType]) -> Node: ...
     ...
 
 # Defined in torch/torch/csrc/jit/ir/ir.h
 class Graph:
-    def inputs(self) -> List[Value]: ...
-    def outputs(self) -> List[Value]: ...
+    def inputs(self) -> Iterator[Value]: ...
+    def outputs(self) -> Iterator[Value]: ...
     def nodes(self) -> Iterator[Node]: ...
     def param_node(self) -> Node: ...
     def return_node(self) -> Node: ...
@@ -607,6 +634,13 @@ class Graph:
     ...
 
 
+# Defined in torch/aten/src/ATen/core/alias_info.h
+class AliasInfo:
+    is_write: _bool
+    before_set: Set[str]
+    after_set: Set[str]
+
+
 # Defined in torch/aten/src/ATen/core/function_schema.h
 class Argument:
     name: str
@@ -615,6 +649,7 @@ class Argument:
     def has_default_value(self) -> _bool: ...
     kwarg_only : _bool
     is_out: _bool
+    alias_info: Optional[AliasInfo]
     ...
 class FunctionSchema:
     arguments: List[Argument]
@@ -784,6 +819,12 @@ def get_num_interop_threads() -> _int: ...  # THPModule_getNumInteropThreads
 def set_num_interop_threads(nthreads: _int) -> None: ...  # THPModule_setNumInteropThreads
 def _get_cudnn_enabled() -> _bool: ...  # THPModule_userEnabledCuDNN
 def _set_cudnn_enabled(arg: _bool) -> None: ...  # THPModule_setUserEnabledCuDNN
+def _get_flash_sdp_enabled() -> _bool: ...  # THPModule_userEnabledFusedSDP
+def _set_sdp_use_flash(arg: _bool) -> None: ...  # THPModule_setSDPUseFlash
+def _get_mem_efficient_sdp_enabled() -> _bool: ...  # THPModule_userEnabledMathSDP
+def _set_sdp_use_mem_efficient(arg: _bool) -> None: ...  # THPModule_setSDPUseMemEfficient
+def _get_math_sdp_enabled() -> _bool: ...  # THPModule_userEnabledMathSDP
+def _set_sdp_use_math(arg: _bool) -> None: ...  # THPModule_setSDPUseMath
 def _get_mkldnn_enabled() -> _bool: ...  # THPModule_userEnabledMkldnn
 def _set_mkldnn_enabled(arg: _bool) -> None: ...  # THPModule_setUserEnabledMkldnn
 def _get_cudnn_benchmark() -> _bool: ...  # THPModule_benchmarkCuDNN
@@ -805,12 +846,17 @@ def _get_cublas_allow_fp16_reduced_precision_reduction() -> _bool: ... #THPModul
 def _set_cublas_allow_fp16_reduced_precision_reduction(arg: _bool) -> None: ... #THPModule_setAllowFP16ReductionCuBLAS
 def _set_conj(x: Tensor, conj: _bool) -> None: ...
 def _set_neg(x: Tensor, neg: _bool) -> None: ...
-def _add_meta_to_tls_dispatch_include() -> None: ...
-def _remove_meta_from_tls_dispatch_include() -> None: ...
+def _set_meta_in_tls_dispatch_include(meta_in_tls: _bool) -> None: ...
+def _meta_in_tls_dispatch_include() -> _bool: ...
+def _select_conv_backend(*args, **kwargs) -> ConvBackend: ...
+def _conv_determine_backend_memory_format(input: Tensor, weight: Tensor, backend: ConvBackend) -> memory_format: ...
+def _has_storage(x: Tensor) -> _bool: ...
+def _should_allow_numbers_as_tensors(func_name: str) -> _bool: ...
 # NB: There is no Capsule type in typing, see
 # https://code.activestate.com/lists/python-dev/139675/
 def _to_dlpack(data: Tensor) -> Any: ...  # THPModule_toDLPack
 def _from_dlpack(data: Any) -> Tensor: ...  # THPModule_fromDLPack
+def _get_cpp_backtrace(frames_to_skip: _int, maximum_number_of_frames: _int) -> str: ...  # THPModule_getCppBacktrace
 def set_flush_denormal(arg: _bool) -> _bool: ...  # THPModule_setFlushDenormal
 def get_default_dtype() -> _dtype: ...  # THPModule_getDefaultDtype
 def _get_default_device() -> str: ...  # THPModule_getDefaultDevice
@@ -838,6 +884,9 @@ class _LinalgBackend:
     Cusolver: _LinalgBackend
     Magma: _LinalgBackend
 
+class ConvBackend(Enum):
+    ...
+
 # Defined in `valgrind.h` and `callgrind.h` respecitively.
 def _valgrind_supported_platform() -> _bool: ...  # NVALGRIND
 def _valgrind_toggle() -> None: ...  # CALLGRIND_TOGGLE_COLLECT
@@ -871,8 +920,9 @@ def autocast_increment_nesting() -> _int: ...
 def autocast_decrement_nesting() -> _int: ...
 def is_autocast_cache_enabled() -> _bool: ...
 def set_autocast_cache_enabled(enabled: _bool) -> None: ...
-def set_anomaly_enabled(enabled: _bool) -> None: ...
+def set_anomaly_enabled(enabled: _bool, check_nan: _bool = True) -> None: ...
 def is_anomaly_enabled() -> _bool: ...
+def is_anomaly_check_nan_enabled() -> _bool: ...
 def _enter_dual_level() -> _int: ...
 def _exit_dual_level(level: _int) -> None: ...
 def _make_dual(tensor: Tensor, tangent: Tensor, level: _int) -> Tensor: ...
@@ -882,11 +932,18 @@ def __is_forward_AD_enabled() -> _bool: ...
 def _register_default_hooks(pack_hook: Callable, unpack_hook: Callable) -> None: ...
 def _reset_default_hooks() -> None: ...
 
+def _is_torch_function_mode_enabled()-> _bool: ...
 def _set_torch_function_mode(cls: Any) -> None: ...
-def _get_torch_function_mode() -> Any: ...
+def _push_on_torch_function_stack(cls: Any) -> None: ...
+def _pop_torch_function_stack() -> Any: ...
+def _get_function_stack_at(idx: _int) -> Any: ...
+def _len_torch_function_stack() -> _int: ...
 
 def _set_torch_dispatch_mode(cls: Any) -> None: ...
-def _get_torch_dispatch_mode() -> Any: ...
+def _push_on_torch_dispatch_stack(cls: Any) -> None: ...
+def _pop_torch_dispatch_stack() -> Any: ...
+def _get_dispatch_stack_at(idx: _int) -> Any: ...
+def _len_torch_dispatch_stack() -> _int: ...
 
 class _InferenceMode(object):
     def __init__(self, mode: _bool) -> None: ...
@@ -897,6 +954,9 @@ class _DisableFuncTorch:
 class _EnableTorchFunction:
     def __init__(self) -> None: ...
 
+class _MultithreadingEnabled:
+    def __init__(self, mode: _bool) -> None: ...
+
 # Defined in torch/csrc/jit/python/script_init.cpp
 class LoggerBase(object):
     ...
@@ -912,11 +972,14 @@ class AggregationType(Enum):
     AVG = 1
 
 class FileCheck(object):
-    # TODO (add more FileCheck signature)
-    def check_source_highlighted(self, highlight: str) -> 'FileCheck': ...
     def run(self, test_string: str) -> None: ...
     def check(self, test_string: str) -> 'FileCheck': ...
     def check_not(self, test_string: str) -> 'FileCheck': ...
+    def check_same(self, test_string: str) -> 'FileCheck': ...
+    def check_next(self, test_string: str) -> 'FileCheck': ...
+    def check_count(self, test_string: str, count: _int, exactly: _bool = False) -> 'FileCheck': ...
+    def check_dag(self, test_string: str) -> 'FileCheck': ...
+    def check_source_highlighted(self, test_string: str) -> 'FileCheck': ...
     ...
 
 # Defined in torch/csrc/jit/python/init.cpp
@@ -945,12 +1008,16 @@ def _jit_set_inline_everything_mode(enabled: _bool) -> None: ...
 def _jit_get_logging_option() -> str: ...
 def _jit_set_logging_option(option: str) -> None: ...
 def _jit_set_logging_stream(stream_name: str) -> None: ...
+def _jit_pass_cse(Graph) -> _bool: ...
 def _jit_pass_dce(Graph) -> None: ...
 def _jit_pass_lint(Graph) -> None: ...
 
 # Defined in torch/csrc/jit/python/python_custome_class.cpp
 def _get_custom_class_python_wrapper(name: str, attr: str) -> Any: ...
 
+# Defined in torch/csrc/Module.cpp
+def _rename_privateuse1_backend(backend: str) -> None: ...
+
 # Defined in torch/csrc/Generator.cpp
 class Generator(object):
     device: _device
@@ -963,18 +1030,83 @@ class Generator(object):
 
 
 # Defined in torch/csrc/utils/python_dispatch.cpp
-def _dispatch_library(kind: str, name: str, dispatch: str, file: str = "", linenum: Any = 0) -> Any: ...
-def _dispatch_has_kernel_for_dispatch_key(name: str, dispatch: str) -> _bool: ...
-def _dispatch_has_computed_kernel_for_dispatch_key(name: str, dispatch: str) -> _bool: ...
+
+class _DispatchOperatorHandle:
+    def schema(self) -> FunctionSchema: ...
+
+class _DispatchModule:
+    def def_(self, schema: str, alias: str = "") -> _DispatchModule: ...
+    def def_legacy(self, schema: str) -> _DispatchModule: ...
+    def def_name_t_t(self, name: str, dispatch: str, debug: str = "default_def_name_t_t") -> _DispatchModule: ...
+    def def_schema_t_t(self, schema: str, dispatch: str, alias: str, debug: str = "default_def_schema_t_t") -> _DispatchModule: ...
+    def impl_t_t(self, name: str, dispatch: str, debug: str = "impl_t_t") -> _DispatchModule: ...
+    def impl(self, name: str, dispatch: str, func: Callable) -> _DispatchModule: ...
+    def define(self, schema: str, alias: str = "") -> _DispatchModule: ...
+    def fallback_fallthrough(self, dispatch: str = "") -> _DispatchModule: ...
+
+def _dispatch_library(kind: str, name: str, dispatch: str, file: str = "", linenum: Any = 0) -> _DispatchModule: ...
+def _dispatch_dump(name: str) -> str: ...
+def _dispatch_dump_table(name: str) -> str: ...
+def _dispatch_check_invariants(name: str) -> None: ...
+def _dispatch_check_all_invariants() -> None: ...
 def _dispatch_has_kernel(name: str) -> _bool: ...
-def _dispatch_tls_is_dispatch_key_excluded(dispatch: str) -> _bool: ...
-def _dispatch_tls_set_dispatch_key_excluded(dispatch: str, val: _bool) -> None: ...
+def _dispatch_has_kernel_for_dispatch_key(name: str, dispatch: _dispatchkey) -> _bool: ...
+def _dispatch_has_kernel_for_any_dispatch_key(name: str, dispatch_key_set: DispatchKeySet) -> _bool: ...
+def _dispatch_has_computed_kernel_for_dispatch_key(name: str, dispatch: _dispatchkey) -> _bool: ...
+def _dispatch_find_dangling_impls() -> List[str]: ...
+def _dispatch_get_all_op_names() -> List[str]: ...
+def _dispatch_tls_set_dispatch_key_excluded(dispatch: _dispatchkey, val: _bool) -> None: ...
+def _dispatch_tls_is_dispatch_key_excluded(dispatch: _dispatchkey) -> _bool: ...
+def _dispatch_tls_set_dispatch_key_included(dispatch: _dispatchkey, val: _bool) -> None: ...
+def _dispatch_tls_is_dispatch_key_included(dispatch: _dispatchkey) -> _bool: ...
 def _dispatch_isTensorSubclassLike(tensor: Tensor) -> _bool: ...
-def _dispatch_dump(dispatch: str) -> str: ...
+def _dispatch_key_name(dispatch: _dispatchkey) -> str: ...
+def _dispatch_key_parse(dispatch: _dispatchkey) -> DispatchKey: ...
+def _dispatch_num_backends() -> _int: ...
+def _functionalization_reapply_views_tls() -> _bool: ...
+
+class DispatchKey(Enum):
+    ${dispatch_key_hints}
+
+class DispatchKeySet:
+    def __or__(self, other: DispatchKeySet) -> DispatchKeySet: ...
+    def __sub__(self, other: DispatchKeySet) -> DispatchKeySet: ...
+    def __and__(self, other: DispatchKeySet) -> DispatchKeySet: ...
+    def highestPriorityTypeId(self) -> DispatchKey: ...
+    def has(self, k: _dispatchkey) -> _bool: ...
+    def __repr__(self) -> str: ...
+
+_dispatch_autogradother_backends: DispatchKeySet
+def _dispatch_has_backend_fallback(dispatch: _dispatchkey) -> _bool: ...
+def _dispatch_keyset_full_after(t: _dispatchkey) -> DispatchKeySet: ...
+def _dispatch_keyset_to_string(keyset: DispatchKeySet) -> str: ...
+def _dispatch_get_backend_keyset_from_autograd(dispatch: _dispatchkey) -> DispatchKeySet: ...
+def _dispatch_keys(tensor: Tensor) -> DispatchKeySet: ...
+def _dispatch_tls_local_exclude_set() -> DispatchKeySet: ...
+def _dispatch_tls_local_include_set() -> DispatchKeySet: ...
+def _dispatch_is_included_in_alias(dispatch_a: _dispatchkey, dispatch_b: _dispatchkey) -> _bool: ...
+
+class ExcludeDispatchKeyGuard:
+    pass
 
 class _AutoDispatchBelowAutograd:
     pass
 
+def _dispatch_print_registrations_for_dispatch_key(dispatch_key: str = "") -> None: ...
+def _dispatch_get_registrations_for_dispatch_key(dispatch_key: str = "") -> List[str]: ...
+
+def _are_functorch_transforms_active() -> _bool: ...
+
+# Define in torch/csrc/autograd/init.cpp
+class _DisablePythonDispatcher(object):
+    pass
+
+class _EnablePythonDispatcher(object):
+    pass
+
+def _set_python_dispatcher(dispatcher: object) -> None: ...
+
+
 # Defined in torch/csrc/utils/init.cpp
 class BenchmarkConfig(object):
     num_calling_threads: _int
@@ -1042,6 +1174,7 @@ def _cuda_getCurrentStream(device: _int) -> _int: ...
 def _cuda_getCurrentRawStream(device: _int) -> _int: ...
 def _cuda_getDefaultStream(device: _int) -> _int: ...
 def _cuda_getCurrentBlasHandle() -> _int: ...
+def _cuda_clearCublasWorkspaces() -> None: ...
 def _cuda_setDevice(device: _int) -> None: ...
 def _cuda_getDevice() -> _int: ...
 def _cuda_getDeviceCount() -> _int: ...
@@ -1057,13 +1190,22 @@ def _cuda_getCompiledVersion() -> _int: ...
 def _cuda_cudaHostAllocator() -> _int: ...
 def _cuda_cudaCachingAllocator_raw_alloc(size: _int, cuda_stream: _int) -> _int: ...
 def _cuda_cudaCachingAllocator_raw_delete(ptr: _int) -> None: ...
+def _cuda_cudaCachingAllocator_set_allocator_settings(env: str) -> None: ...
 def _cuda_setMemoryFraction(fraction: _float, device: _int) -> None: ...
 def _cuda_emptyCache() -> None: ...
 def _cuda_memoryStats(device: _int) -> Dict[str, Any]: ...
 def _cuda_resetAccumulatedMemoryStats(device: _int) -> None: ...
 def _cuda_resetPeakMemoryStats(device: _int) -> None: ...
-def _cuda_memorySnapshot() -> List[Dict[str, Any]]: ...
-def _cuda_recordMemoryHistory(enabled: _bool) -> None: ...
+def _cuda_memorySnapshot() -> Dict[str, Any]: ...
+def _cuda_recordMemoryHistory(enabled: _bool, record_context: _bool, record_context_cpp: _bool, alloc_trace_max_entries: _int, alloc_trace_record_context: _bool) -> None: ...
+def _cuda_getAllocatorBackend() -> str: ...
+
+class _cuda_CUDAAllocator:
+    ...
+
+def _cuda_customAllocator(alloc_fn: _int, free_fn: _int) -> _cuda_CUDAAllocator: ...
+def _cuda_changeCurrentAllocator(allocator: _cuda_CUDAAllocator) -> None: ...
+def _cuda_getAllocator() -> _cuda_CUDAAllocator: ...
 def _cuda_lock_mutex() -> None: ...
 def _cuda_unlock_mutex() -> None: ...
 def _cuda_canDeviceAccessPeer(device: _int, peer_device: _int) -> _bool: ...
@@ -1208,6 +1350,8 @@ class JitType:
     def with_sizes(self, sizes: List[Optional[_int]]) -> JitType: ...
     def kind(self) -> str: ...
     def scalarType(self) -> Optional[str]: ...
+    def getElementType(self) -> JitType: ...
+    def dtype(self) -> Optional[_dtype]: ...
 
 class InferredType:
     def __init__(self, arg: Union[JitType, str]): ...
@@ -1368,5 +1512,11 @@ def _enable_minidumps_on_exceptions() -> None: ...
 def _register_py_class_for_device(device: str, cls: Any) -> None: ...
 def _activate_cuda_trace() -> None: ...
 
+# Defined in torch/csrc/Module.cpp
+def _current_graph_task_id() -> _int: ...
+
 class _OutOfMemoryError:
     pass
+
+class _DistBackendError(RuntimeError):
+    pass
diff --git a/torch/_C/_autograd.pyi b/torch/_C/_autograd.pyi
index 34f7766b42b4..bdba43cb693a 100644
--- a/torch/_C/_autograd.pyi
+++ b/torch/_C/_autograd.pyi
@@ -1,23 +1,11 @@
-from typing import List, Set, Callable, Any, Union
+from typing import List, Set, Callable, Any, Union, Optional
 from enum import Enum
 
 import torch
+from ._profiler import _ProfilerEvent, ActiveProfilerType, ProfilerActivity, ProfilerConfig
 
 # Defined in tools/autograd/init.cpp
 
-class ProfilerState(Enum):
-    Disable = ...
-    CPU = ...
-    CUDA = ...
-    NVTX = ...
-    ITT = ...
-    KINETO = ...
-    KINETO_GPU_FALLBACK = ...
-
-class ProfilerActivity(Enum):
-    CPU = ...
-    CUDA = ...
-
 class DeviceType(Enum):
     CPU = ...
     CUDA = ...
@@ -36,27 +24,6 @@ class DeviceType(Enum):
     Metal = ...
     ...
 
-class _ExperimentalConfig:
-    def __init__(
-        self,
-        profiler_metrics: List[str] = ...,
-        profiler_measure_per_kernel: bool = ...,
-    ) -> None: ...
-    ...
-
-class ProfilerConfig:
-    def __init__(
-        self,
-        state: ProfilerState,
-        report_input_shapes: bool,
-        profile_memory: bool,
-        with_stack: bool,
-        with_flops: bool,
-        with_modules: bool,
-        experimental_config: _ExperimentalConfig,
-    ) -> None: ...
-    ...
-
 class ProfilerEvent:
     def cpu_elapsed_us(self, other: ProfilerEvent) -> float: ...
     def cpu_memory_usage(self) -> int: ...
@@ -85,59 +52,6 @@ class _KinetoEvent:
     def linked_correlation_id(self) -> int: ...
     ...
 
-class _ProfilerEvent:
-    tag: _EventType
-    id: int
-    correlation_id: int
-    start_tid: int
-    start_time_ns: int
-    end_time_ns: int
-    duration_time_ns: int
-    parent: _ProfilerEvent
-    children: List[_ProfilerEvent]
-    extra_fields: Union[_ExtraFields_Allocation, _ExtraFields_Backend,
-                        _ExtraFields_PyCall, _ExtraFields_PyCCall,
-                        _ExtraFields_TorchOp]
-    def name(self) -> str: ...
-    ...
-
-class _PyFrameState:
-    line_number: int
-    function_name: str
-    file_name: str
-    ...
-
-class _EventType(Enum):
-    Allocation = ...
-    Backend = ...
-    PyCall = ...
-    PyCCall = ...
-    TorchOp = ...
-    Kineto = ...
-
-class _Inputs:
-    shapes: List[List[int]]
-    dtypes: List[str]
-
-class _ExtraFields_TorchOp:
-    allow_tf32_cublas: bool
-    inputs: _Inputs
-    ...
-
-class _ExtraFields_Backend:
-    ...
-
-class _ExtraFields_Allocation:
-    ...
-
-class _ExtraFields_PyCCall:
-    caller: _PyFrameState
-    ...
-
-class _ExtraFields_PyCall:
-    caller: _PyFrameState
-    ...
-
 class _ProfilerResult:
     def events(self) -> List[_KinetoEvent]: ...
     def legacy_events(self) -> List[List[ProfilerEvent]]: ...
@@ -147,9 +61,6 @@ class _ProfilerResult:
 class SavedTensor:
     ...
 
-class ActiveProfilerType:
-    ...
-
 def _enable_profiler(config: ProfilerConfig, activities: Set[ProfilerActivity]) -> None: ...
 def _prepare_profiler(config: ProfilerConfig, activities: Set[ProfilerActivity]) -> None: ...
 def _disable_profiler() -> _ProfilerResult: ...
@@ -157,10 +68,6 @@ def _profiler_enabled() -> bool: ...
 def _add_metadata_json(key: str, value: str) -> None: ...
 def _kineto_step() -> None: ...
 def kineto_available() -> bool: ...
-def _add_execution_graph_observer(output_file_path: str) -> bool: ...
-def _remove_execution_graph_observer() -> None: ...
-def _enable_execution_graph_observer() -> None: ...
-def _disable_execution_graph_observer() -> None: ...
 def _record_function_with_args_enter(name: str, args: List[Any]) -> torch.Tensor: ...
 def _record_function_with_args_exit(handle: torch.Tensor) -> None: ...
 def _supported_activities() -> Set[ProfilerActivity]: ...
@@ -172,3 +79,7 @@ def _pop_saved_tensors_default_hooks() -> None: ...
 def _enable_profiler_legacy(config: ProfilerConfig) -> None: ...
 def _disable_profiler_legacy() -> List[List[ProfilerEvent]]: ...
 def _profiler_type() -> ActiveProfilerType: ...
+
+def _saved_tensors_hooks_enable() -> None: ...
+def _saved_tensors_hooks_disable(message: str) -> None: ...
+def _saved_tensors_hooks_get_disabled_error_message() -> Optional[str]: ...
diff --git a/torch/_C/_distributed_c10d.pyi b/torch/_C/_distributed_c10d.pyi
index 64126fd42333..f16a8ec362f5 100644
--- a/torch/_C/_distributed_c10d.pyi
+++ b/torch/_C/_distributed_c10d.pyi
@@ -1,8 +1,9 @@
 from datetime import timedelta
 from enum import Enum
-from typing import Optional, List, Any, Tuple, overload
+from typing import Any, Dict, List, Optional, overload, Tuple, Union
 
 from torch import Tensor
+from torch.futures import Future
 
 # This module is defined in torch/csrc/distributed/c10d/init.cpp
 
@@ -32,13 +33,34 @@ class Reducer:
         self,
         params: List[Tensor],
         bucket_indices: List[List[int]],
+        per_bucket_size_limits: List[int],
         process_group: ProcessGroup,
-        expect_sparse_gradients: List[bool],
-        bucket_bytes_cap: int,
-        find_unused_parameters: bool,
-        gradient_as_bucket_view: bool,
+        expect_sparse_gradients: List[bool] = ...,
+        bucket_bytes_cap: int = ...,  # kDefaultBucketBytesCap in reducer.hpp
+        find_unused_parameters: bool = ...,
+        gradient_as_bucket_view: bool = ...,
+        param_to_name_mapping: Dict[int, str] = ...,
+        first_bucket_types_cap: int = ...,  # kDefaultFirstBucketBytes in reducer.hpp
     ): ...
-    ...
+    def prepare_for_forward(self) -> None: ...
+    def prepare_for_backward(self, output: List[Tensor]) -> None: ...
+    def get_backward_stats(self) -> List[int]: ...
+    def _install_post_backward_futures(self, futures: List[Future]) -> None: ...
+    def _rebuild_buckets(self) -> bool: ...
+    def _get_zeros_like_grad_buckets(self) -> List[GradBucket]: ...
+    def _push_all_rebuilt_params(self) -> None: ...
+    def _set_forward_pass_work_handle(
+        self, work: Work, use_static_world_size: bool
+    ): ...
+    def _get_local_used_map(self) -> Tensor: ...
+    def _set_ddp_runtime_logging_sample_rate(self, sample_rate: int) -> None: ...
+    def _set_static_graph(self) -> None: ...
+    def _run_comm_hook(self, bucket: GradBucket) -> Future: ...
+    def set_logger(self, logger: Logger) -> None: ...
+
+class DDPLoggingData:
+    strs_map: Dict[str, str]
+    ints_map: Dict[str, int]
 
 class Logger:
     def __init__(self, reducer: Reducer): ...
@@ -49,8 +71,14 @@ class Logger:
         output_device: int,
         broadcast_buffers: bool,
         has_sync_bn: bool,
+        static_graph: bool,
     ): ...
-    ...
+    def set_runtime_stats_and_log(self) -> None: ...
+    def set_error_and_log(self, error: str) -> None: ...
+    def _get_ddp_logging_data(self) -> DDPLoggingData: ...
+    def _set_comm_hook_name(self, comm_hook: str) -> None: ...
+    def _set_uneven_input_join(self) -> None: ...
+    def _set_static_graph(self) -> None: ...
 
 def get_debug_level(): ...
 def set_debug_level(): ...
@@ -61,7 +89,10 @@ class DebugLevel(Enum):
     INFO = ...
     DETAIL = ...
 
-class ReduceOp(Enum):
+class ReduceOp:
+
+    def __init__(self, op: "RedOpType"): ...
+
     SUM = ...
     PRODUCT = ...
     MIN = ...
@@ -69,8 +100,11 @@ class ReduceOp(Enum):
     BAND = ...
     BOR = ...
     BXOR = ...
+    PREMUL_SUM = ...
     UNUSED = ...
 
+    class RedOpType(Enum): ...
+
 class BroadcastOptions:
     rootRank: int
     rootTensor: int
@@ -114,7 +148,9 @@ class Store:
     def set(self, key: str, value: str): ...
     def get(self, key: str) -> bytes: ...
     def add(self, key: str, value: int) -> int: ...
-    def compare_set(self, key: str, expected_value: str, desired_value: str) -> bytes: ...
+    def compare_set(
+        self, key: str, expected_value: str, desired_value: str
+    ) -> bytes: ...
     def delete_key(self, key: str) -> bool: ...
     def num_keys(self) -> int: ...
     def set_timeout(self, timeout: timedelta): ...
@@ -138,11 +174,17 @@ class TCPStore(Store):
         is_master: bool = ...,
         timeout: timedelta = ...,
         wait_for_workers: bool = ...,
-        multi_tenant: bool = ...
+        multi_tenant: bool = ...,
     ): ...
+    @property
+    def host(self) -> str: ...
+    @property
+    def port(self) -> int: ...
 
 class PrefixStore(Store):
     def __init__(self, prefix: str, store: Store): ...
+    @property
+    def underlying_store(self) -> Store: ...
 
 class Work:
     def is_completed(self) -> bool: ...
@@ -157,6 +199,7 @@ class Work:
 
 class ProcessGroup:
     class Options: ...
+
     def __init__(self): ...
     def rank(self) -> int: ...
     def size(self) -> int: ...
@@ -225,7 +268,7 @@ class ProcessGroup:
         self,
         output: Tensor,
         input: Tensor,
-        opts = AllGatherOptions(),
+        opts=AllGatherOptions(),
     ) -> Work: ...
     def allgather_coalesced(
         self,
@@ -333,6 +376,7 @@ def _round_robin_process_groups(
 class ProcessGroupGloo(ProcessGroup):
     class Device: ...
     class Options: ...
+
     def __init__(
         self,
         store: Store,
@@ -348,16 +392,12 @@ class ProcessGroupGloo(ProcessGroup):
     ...
 
 class _ProcessGroupWrapper(ProcessGroup):
-    def __init__(
-        self,
-        pg: ProcessGroup,
-        gloo_pg: ProcessGroupGloo
-    ): ...
+    def __init__(self, pg: ProcessGroup, gloo_pg: ProcessGroupGloo): ...
     wrapped_pg: ProcessGroup
 
-
 class ProcessGroupNCCL(ProcessGroup):
     class Options: ...
+
     def __init__(
         self,
         store: Store,
@@ -392,9 +432,9 @@ class ProcessGroupMPI(ProcessGroup):
 
 def _compute_bucket_assignment_by_size(
     tensors: List[Tensor],
-    bucket_size: int,
-    expect_sparse_gradient: List[bool],
-    tensor_indices: List[int],
+    bucket_size_limits: List[int],
+    expect_sparse_gradient: List[bool] = ...,
+    tensor_indices: List[int] = ...,
 ) -> Tuple[List[List[int]], List[int]]: ...
 def _broadcast_coalesced(
     process_group: ProcessGroup,
@@ -408,3 +448,4 @@ def _verify_params_across_processes(
     params: List[Tensor],
     logger: Optional[Logger],
 ): ...
+def _make_nccl_premul_sum(factor: Union[float, List[Tensor]]) -> ReduceOp: ...
diff --git a/torch/_C/_distributed_rpc.pyi b/torch/_C/_distributed_rpc.pyi
index 06d7a6fcba3f..72448910e4bc 100644
--- a/torch/_C/_distributed_rpc.pyi
+++ b/torch/_C/_distributed_rpc.pyi
@@ -4,8 +4,9 @@ import enum
 import torch
 from torch.types import Device
 from . import Future
-from ._autograd import ProfilerConfig, ProfilerState, ProfilerEvent
+from ._autograd import ProfilerEvent
 from ._distributed_c10d import ProcessGroup, Store
+from ._profiler import ActiveProfilerType, ProfilerConfig, ProfilerState
 
 # This module is defined in torch/csrc/distributed/rpc/init.cpp
 
@@ -77,7 +78,7 @@ class _TensorPipeRpcBackendOptionsBase(RpcBackendOptions):
         _channels: Optional[List],
         rpc_timeout: float = _DEFAULT_RPC_TIMEOUT_SEC,
         init_method: str = _DEFAULT_INIT_METHOD,
-        device_maps: Dict[str, Dict[torch.device, torch.device]] = dict(),
+        device_maps: Dict[str, Dict[torch.device, torch.device]] = {},
         devices: List[torch.device] = list()): ...
     def _set_device_map(self, to: str, device_map: Dict[torch.device, torch.device]): ...
 
diff --git a/torch/_C/_functorch.pyi b/torch/_C/_functorch.pyi
new file mode 100644
index 000000000000..bb9649daadcb
--- /dev/null
+++ b/torch/_C/_functorch.pyi
@@ -0,0 +1,46 @@
+from torch import Tensor
+from enum import Enum
+
+# Defined in torch/csrc/functorch/init.cpp
+
+def _set_dynamic_layer_keys_included(included: bool) -> None: ...
+def get_unwrapped(tensor: Tensor) -> Tensor: ...
+def is_batchedtensor(tensor: Tensor) -> bool: ...
+def is_functionaltensor(tensor: Tensor) -> bool: ...
+def is_functorch_wrapped_tensor(tensor: Tensor) -> bool: ...
+def is_gradtrackingtensor(tensor: Tensor) -> bool: ...
+def maybe_get_bdim(tensor: Tensor) -> int: ...
+def maybe_get_level(tensor: Tensor) -> int: ...
+
+def set_autograd_function_allowed(allowed: bool) -> None: ...
+def get_autograd_function_allowed() -> bool: ...
+
+# Defined in aten/src/ATen/functorch/Interpreter.h
+class TransformType(Enum):
+    Torch: TransformType = ...
+    Vmap: TransformType = ...
+    Grad: TransformType = ...
+    Jvp: TransformType = ...
+    Functionalize: TransformType = ...
+
+class CInterpreter:
+    def key(self) -> TransformType: ...
+    def level(self) -> int: ...
+
+class CGradInterpreterPtr:
+    def __init__(self, interpreter: CInterpreter): ...
+    def lift(self, Tensor) -> Tensor: ...
+    def prevGradMode(self) -> bool: ...
+
+class CVmapInterpreterPtr:
+    def __init__(self, interpreter: CInterpreter): ...
+    def key(self) -> TransformType: ...
+    def level(self) -> int: ...
+    def batchSize(self) -> int: ...
+
+class DynamicLayer:
+    pass
+
+def peek_interpreter_stack() -> CInterpreter: ...
+def pop_dynamic_layer_stack() -> DynamicLayer: ...
+def push_dynamic_layer_stack(dl: DynamicLayer) -> int: ...
diff --git a/torch/_C/_itt.pyi b/torch/_C/_itt.pyi
index 8de715bc85a9..8a54437f527b 100644
--- a/torch/_C/_itt.pyi
+++ b/torch/_C/_itt.pyi
@@ -1,4 +1,5 @@
 # Defined in torch/csrc/itt.cpp
+def is_available() -> None: ...
 def rangePush(message: str) -> None: ...
 def rangePop() -> None: ...
 def mark(message: str) -> None: ...
diff --git a/torch/_C/_lazy.pyi b/torch/_C/_lazy.pyi
index 808dbdd11a4b..2a2df34f1320 100644
--- a/torch/_C/_lazy.pyi
+++ b/torch/_C/_lazy.pyi
@@ -1,7 +1,7 @@
 from typing import List
 from torch import Tensor
 
-#defined in torch/csrc/lazy/python/init.cpp
+# defined in torch/csrc/lazy/python/init.cpp
 def _mark_step(device: str, devices: List[str], wait: bool): ...
 def _wait_device_ops(devices: List[str]): ...
 def _reset_metrics(): ...
@@ -9,7 +9,12 @@ def _counter_names() -> List[str]: ...
 def _counter_value(name: str) -> int: ...
 def _metrics_report() -> str: ...
 def _get_graph_hash(tensors: List[Tensor]) -> str: ...
-def _sync_multi(tensors: List[Tensor], devices: List[str], wait: bool = True, sync_ltc_data: bool = True): ...
+def _sync_multi(
+    tensors: List[Tensor],
+    devices: List[str],
+    wait: bool = True,
+    sync_ltc_data: bool = True,
+): ...
 def _get_tensor_id(tensor: Tensor) -> int: ...
 def _get_tensors_text(tensors: List[Tensor]) -> str: ...
 def _get_tensors_dot(tensors: List[Tensor]) -> str: ...
@@ -19,3 +24,4 @@ def _set_force_fallback(newval: str): ...
 def _clear_ir_cache(): ...
 def _dump_ir_cache(filename: str): ...
 def _set_reuse_ir(val: bool): ...
+def _get_default_device_type(): ...
diff --git a/torch/_C/_profiler.pyi b/torch/_C/_profiler.pyi
new file mode 100644
index 000000000000..4a1fe23cec61
--- /dev/null
+++ b/torch/_C/_profiler.pyi
@@ -0,0 +1,218 @@
+from enum import Enum
+from typing import List, Optional, Tuple, Union
+
+from torch._C import device, dtype, layout
+
+from typing_extensions import Literal
+
+# defined in torch/csrc/profiler/python/init.cpp
+
+class RecordScope(Enum):
+    FUNCTION = ...
+    BACKWARD_FUNCTION = ...
+    TORCHSCRIPT_FUNCTION = ...
+    KERNEL_FUNCTION_DTYPE = ...
+    CUSTOM_CLASS = ...
+    BUILD_FEATURE = ...
+    LITE_INTERPRETER = ...
+    USER_SCOPE = ...
+    STATIC_RUNTIME_OP = ...
+    STATIC_RUNTIME_MODEL = ...
+
+class ProfilerState(Enum):
+    Disable = ...
+    CPU = ...
+    CUDA = ...
+    NVTX = ...
+    ITT = ...
+    KINETO = ...
+    KINETO_GPU_FALLBACK = ...
+
+class ActiveProfilerType(Enum):
+    NONE = ...
+    LEGACY = ...
+    KINETO = ...
+    NVTX = ...
+    ITT = ...
+
+class ProfilerActivity(Enum):
+    CPU = ...
+    CUDA = ...
+
+class _EventType(Enum):
+    TorchOp = ...
+    Backend = ...
+    Allocation = ...
+    OutOfMemory = ...
+    PyCall = ...
+    PyCCall = ...
+    Kineto = ...
+
+class _ExperimentalConfig:
+    def __init__(
+        self,
+        profiler_metrics: List[str] = ...,
+        profiler_measure_per_kernel: bool = ...,
+        verbose: bool = ...,
+    ) -> None: ...
+    ...
+
+class ProfilerConfig:
+    def __init__(
+        self,
+        state: ProfilerState,
+        report_input_shapes: bool,
+        profile_memory: bool,
+        with_stack: bool,
+        with_flops: bool,
+        with_modules: bool,
+        experimental_config: _ExperimentalConfig,
+    ) -> None: ...
+    ...
+
+class _ProfilerEvent:
+    start_tid: int
+    start_time_ns: int
+    children: List[_ProfilerEvent]
+
+    # TODO(robieta): remove in favor of `self.typed`
+    extra_fields: Union[
+        _ExtraFields_TorchOp,
+        _ExtraFields_Backend,
+        _ExtraFields_Allocation,
+        _ExtraFields_OutOfMemory,
+        _ExtraFields_PyCall,
+        _ExtraFields_PyCCall,
+        _ExtraFields_Kineto,
+    ]
+
+    @property
+    def typed(
+        self,
+    ) -> Union[
+        Tuple[Literal[_EventType.TorchOp], _ExtraFields_TorchOp],
+        Tuple[Literal[_EventType.Backend], _ExtraFields_Backend],
+        Tuple[Literal[_EventType.Allocation], _ExtraFields_Allocation],
+        Tuple[Literal[_EventType.OutOfMemory], _ExtraFields_OutOfMemory],
+        Tuple[Literal[_EventType.PyCall], _ExtraFields_PyCall],
+        Tuple[Literal[_EventType.PyCCall], _ExtraFields_PyCCall],
+        Tuple[Literal[_EventType.Kineto], _ExtraFields_Kineto],
+    ]: ...
+    @property
+    def name(self) -> str: ...
+    @property
+    def tag(self) -> _EventType: ...
+    @property
+    def id(self) -> int: ...
+    @property
+    def parent(self) -> Optional[_ProfilerEvent]: ...
+    @property
+    def correlation_id(self) -> int: ...
+    @property
+    def end_time_ns(self) -> int: ...
+    @property
+    def duration_time_ns(self) -> int: ...
+
+class _TensorMetadata:
+    impl_ptr: Optional[int]
+    storage_data_ptr: Optional[int]
+    id: Optional[int]
+
+    @property
+    def allocation_id(self) -> Optional[int]: ...
+    @property
+    def layout(self) -> layout: ...
+    @property
+    def device(self) -> device: ...
+    @property
+    def dtype(self) -> dtype: ...
+    @property
+    def sizes(self) -> List[int]: ...
+    @property
+    def strides(self) -> List[int]: ...
+
+Scalar = Union[int, float, bool, complex]
+Input = Optional[Union[_TensorMetadata, List[_TensorMetadata], Scalar]]
+
+class _ExtraFields_TorchOp:
+    name: str
+    sequence_number: int
+    allow_tf32_cublas: bool
+
+    @property
+    def inputs(self) -> List[Input]: ...
+    @property
+    def scope(self) -> RecordScope: ...
+
+class _ExtraFields_Backend: ...
+
+class _ExtraFields_Allocation:
+    ptr: int
+    id: Optional[int]
+    alloc_size: int
+    total_allocated: int
+    total_reserved: int
+
+    @property
+    def allocation_id(self) -> Optional[int]: ...
+    @property
+    def device(self) -> device: ...
+
+class _ExtraFields_OutOfMemory: ...
+
+class _PyFrameState:
+    line_number: int
+    function_name: str
+
+    @property
+    def file_name(self) -> str: ...
+
+class _NNModuleInfo:
+    @property
+    def self_ptr(self) -> int: ...
+    @property
+    def cls_ptr(self) -> int: ...
+    @property
+    def cls_name(self) -> str: ...
+    @property
+    def parameters(
+        self,
+    ) -> List[Tuple[str, _TensorMetadata, Optional[_TensorMetadata]]]: ...
+
+class _OptimizerInfo:
+    @property
+    def parameters(
+        self,
+    ) -> List[
+        Tuple[
+            # Parameter
+            _TensorMetadata,
+            #
+            # Gradient (if present during optimizer.step())
+            Optional[_TensorMetadata],
+            #
+            # Optimizer state for Parameter as (name, tensor) pairs
+            List[Tuple[str, _TensorMetadata]],
+        ]
+    ]: ...
+
+class _ExtraFields_PyCCall:
+    @property
+    def caller(self) -> _PyFrameState: ...
+
+class _ExtraFields_PyCall:
+    @property
+    def callsite(self) -> _PyFrameState: ...
+    @property
+    def caller(self) -> _PyFrameState: ...
+    @property
+    def module(self) -> Optional[_NNModuleInfo]: ...
+    @property
+    def optimizer(self) -> Optional[_OptimizerInfo]: ...
+
+class _ExtraFields_Kineto: ...
+
+def _add_execution_graph_observer(output_file_path: str) -> bool: ...
+def _remove_execution_graph_observer() -> None: ...
+def _enable_execution_graph_observer() -> None: ...
+def _disable_execution_graph_observer() -> None: ...
diff --git a/torch/__init__.py b/torch/__init__.py
index cd0f6e2e47bf..02765c4aeee8 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -2,7 +2,7 @@
 r"""
 The torch package contains data structures for multi-dimensional
 tensors and defines mathematical operations over these tensors.
-Additionally, it provides many utilities for efficient serializing of
+Additionally, it provides many utilities for efficient serialization of
 Tensors and arbitrary types, and other useful utilities.
 
 It has a CUDA counterpart, that enables you to run your tensor computations
@@ -29,7 +29,7 @@
 
 from ._six import string_classes as _string_classes
 
-from typing import Set, Type, TYPE_CHECKING, Union, Callable
+from typing import Set, Type, TYPE_CHECKING, Union, Callable, Any
 import builtins
 
 __all__ = [
@@ -47,7 +47,7 @@
     'is_deterministic_algorithms_warn_only_enabled',
     'set_deterministic_debug_mode', 'get_deterministic_debug_mode',
     'set_float32_matmul_precision', 'get_float32_matmul_precision',
-    'set_warn_always', 'is_warn_always_enabled',
+    'set_warn_always', 'is_warn_always_enabled', 'SymInt', 'SymFloat',
 ]
 
 ################################################################################
@@ -171,20 +171,11 @@ def _load_global_deps():
     # you load consistently use the same libstdc++, or you may have
     # mysterious segfaults.
     #
-    import os as _dl_flags
-    if not hasattr(_dl_flags, 'RTLD_GLOBAL') or not hasattr(_dl_flags, 'RTLD_LAZY'):
-        try:
-            # next try if DLFCN exists
-            import DLFCN as _dl_flags  # type: ignore[import, no-redef]
-        except ImportError:
-            # as a last attempt, use compile-time constants
-            import torch._dl as _dl_flags  # type: ignore[import, no-redef]
     old_flags = sys.getdlopenflags()
-    sys.setdlopenflags(_dl_flags.RTLD_GLOBAL | _dl_flags.RTLD_LAZY)
+    sys.setdlopenflags(os.RTLD_GLOBAL | os.RTLD_LAZY)
     from torch._C import *  # noqa: F403
     sys.setdlopenflags(old_flags)
     del old_flags
-    del _dl_flags
 
 else:
     # Easy way.  You want this most of the time, because it will prevent
@@ -205,6 +196,92 @@ def _load_global_deps():
 if TYPE_CHECKING:
     import torch._C as _C
 
+class SymInt:
+    """
+    Like an int (including magic methods), but redirects all operations on the
+    wrapped node. This is used in particular to symbolically record operations
+    in the symbolic shape workflow.
+    """
+
+    def __init__(self, node):
+        # This field MUST be named node; C++ binding code assumes that this
+        # class has a field named node that stores SymNode
+        self.node = node
+
+    def __bool__(self):
+        return self.node.bool_()
+
+    def __int__(self):
+        return self.node.int_()
+
+    # Magic methods installed by torch.fx.experimental.symbolic_shapes
+
+    def __eq__(self, other: object) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __lt__(self, other) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __gt__(self, other) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __le__(self, other) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __ge__(self, other) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __sym_float__(self):
+        raise AssertionError("type stub not overridden")
+
+    def __repr__(self):
+        return str(self.node)
+
+    # For BC; direct access of node is OK too
+    def get_pyobj(self):
+        return self.node
+
+class SymFloat:
+    """
+    Like an float (including magic methods), but redirects all operations on the
+    wrapped node. This is used in particular to symbolically record operations
+    in the symbolic shape workflow.
+    """
+
+    def __init__(self, node):
+        from torch.fx.experimental.symbolic_shapes import SymNode
+        assert isinstance(node, SymNode)
+        # This field MUST be named node; C++ binding code assumes that this
+        # class has a field named node that stores SymNode
+        self.node = node
+
+    def __bool__(self):
+        return self.node.bool_()
+
+    # Magic methods installed by torch.fx.experimental.symbolic_shapes
+
+    def __eq__(self, other: object) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __lt__(self, other) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __gt__(self, other) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __le__(self, other) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __ge__(self, other) -> builtins.bool:
+        raise AssertionError("type stub not overridden")
+
+    def __repr__(self):
+        return self.node.str()
+
+    # For BC; direct access of node is OK too
+    def get_pyobj(self):
+        return self.node
+
 # Check to see if we can load C extensions, and if not provide some guidance
 # on what the problem might be.
 try:
@@ -408,10 +485,8 @@ def use_deterministic_algorithms(mode, *, warn_only=False):
           tensor
         * :func:`torch.Tensor.put_` with ``accumulate=True`` when called on a CPU
           tensor
-        * :func:`torch.Tensor.scatter_add_` when ``input`` dimension is one and called
-          on a CUDA tensor
-        * :func:`torch.gather` when ``input`` dimension is one and called
-          on a CUDA tensor that requires grad
+        * :func:`torch.Tensor.scatter_add_` when called on a CUDA tensor
+        * :func:`torch.gather` when called on a CUDA tensor that requires grad
         * :func:`torch.index_add` when called on CUDA tensor
         * :func:`torch.index_select` when attempting to differentiate a CUDA tensor
         * :func:`torch.repeat_interleave` when attempting to differentiate a CUDA tensor
@@ -427,6 +502,9 @@ def use_deterministic_algorithms(mode, *, warn_only=False):
         * :class:`torch.nn.AdaptiveMaxPool2d` when attempting to differentiate a CUDA tensor
         * :class:`torch.nn.FractionalMaxPool2d` when attempting to differentiate a CUDA tensor
         * :class:`torch.nn.FractionalMaxPool3d` when attempting to differentiate a CUDA tensor
+        * :class:`torch.nn.MaxUnpool1d`
+        * :class:`torch.nn.MaxUnpool2d`
+        * :class:`torch.nn.MaxUnpool3d`
         * :func:`torch.nn.functional.interpolate` when attempting to differentiate a CUDA tensor
           and one of the following modes is used:
 
@@ -445,10 +523,6 @@ def use_deterministic_algorithms(mode, *, warn_only=False):
         * :class:`torch.nn.CTCLoss` when attempting to differentiate a CUDA tensor
         * :class:`torch.nn.EmbeddingBag` when attempting to differentiate a CUDA tensor when
           ``mode='max'``
-        * :func:`torch.Tensor.scatter_add_` when ``input`` dimension is larger than one
-          and called on a CUDA tensor
-        * :func:`torch.gather` when ``input`` dimension is larger than one
-          and called on a CUDA tensor that requires grad
         * :func:`torch.Tensor.put_` when ``accumulate=False``
         * :func:`torch.Tensor.put_` when ``accumulate=True`` and called on a CUDA tensor
         * :func:`torch.histc` when called on a CUDA tensor
@@ -659,7 +733,7 @@ def is_warn_always_enabled():
 ################################################################################
 
 from ._tensor import Tensor
-from .storage import _StorageBase, TypedStorage, _LegacyStorage, UntypedStorage
+from .storage import _StorageBase, TypedStorage, _LegacyStorage, UntypedStorage, _warn_typed_storage_removal
 
 # NOTE: New <type>Storage classes should never be added. When adding a new
 # dtype, use torch.storage.TypedStorage directly.
@@ -667,86 +741,171 @@ def is_warn_always_enabled():
 class ByteStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.uint8
 
 class DoubleStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.double
 
 class FloatStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.float
 
 class HalfStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.half
 
 class LongStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.long
 
 class IntStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.int
 
 class ShortStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.short
 
 class CharStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.int8
 
 class BoolStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.bool
 
 class BFloat16Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.bfloat16
 
 class ComplexDoubleStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.cdouble
 
 class ComplexFloatStorage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.cfloat
 
 class QUInt8Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.quint8
 
 class QInt8Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.qint8
 
 class QInt32Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.qint32
 
 class QUInt4x2Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.quint4x2
 
 class QUInt2x4Storage(_LegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.quint2x4
 
 _storage_classes = {
@@ -856,7 +1015,9 @@ def _assert(condition, message):
 )
 from torch import fft as fft
 from torch import futures as futures
+from torch import nested as nested
 from torch import nn as nn
+from torch.signal import windows as windows
 from torch import optim as optim
 import torch.optim._multi_tensor
 from torch import multiprocessing as multiprocessing
@@ -884,11 +1045,12 @@ def _assert(condition, message):
 
 # Quantized, sparse, AO, etc. should be last to get imported, as nothing
 # is expected to depend on them.
-import torch.nn.intrinsic
+from torch import ao as ao
+# nn.quant* depends on ao -- so should be after those.
 import torch.nn.quantizable
 import torch.nn.quantized
-# AO depends on nn, as well as quantized stuff -- so should be after those.
-from torch import ao as ao
+import torch.nn.qat
+import torch.nn.intrinsic
 
 _C._init_names(list(torch._storage_classes))
 
@@ -940,11 +1102,15 @@ def compiled_with_cxx11_abi():
 # Import experimental masked operations support. See
 # [RFC-0016](https://github.com/pytorch/rfcs/pull/27) for more
 # information.
-from . import _masked
+from . import masked
 
 # Import removed ops with error message about removal
-from ._linalg_utils import solve
-
+from ._linalg_utils import (  # type: ignore[misc]
+    matrix_rank,
+    eig,
+    solve,
+    lstsq,
+)
 
 def _register_device_module(device_type, module):
     r"""Register an external runtime module of the specific :attr:`device_type`
@@ -965,7 +1131,15 @@ def _register_device_module(device_type, module):
 
 # expose return_types
 from . import return_types
-if sys.executable != 'torch_deploy':
-    from . import library
-    if not TYPE_CHECKING:
-        from . import _meta_registrations
+from . import library
+if not TYPE_CHECKING:
+    from . import _meta_registrations
+
+# Enable CUDA Sanitizer
+if 'TORCH_CUDA_SANITIZER' in os.environ:
+    import torch.cuda._sanitizer as csan
+
+    csan.enable_cuda_sanitizer()
+
+# Populate magic methods on SymInt and SymFloat
+import torch.fx.experimental.symbolic_shapes
diff --git a/torch/_decomp/__init__.py b/torch/_decomp/__init__.py
index 7f3d29290ef4..d50f33933da4 100644
--- a/torch/_decomp/__init__.py
+++ b/torch/_decomp/__init__.py
@@ -2,24 +2,58 @@
 from collections import defaultdict
 from functools import wraps
 from itertools import chain
-from typing import Callable, Dict, NamedTuple, Sequence, Tuple, Union
+from typing import Callable, Dict, Sequence, Union
 
 import torch
-import torch._ops
 import torch.library
+from torch._ops import OpOverload, OpOverloadPacket
 from torch.utils._pytree import tree_map
 
-__all__ = ["decomposition_table", "register_decomposition", "get_decompositions"]
+__all__ = [
+    "decomposition_table",
+    "pre_autograd_decomposition_table",
+    "meta_table",
+    "register_decomposition",
+    "get_decompositions",
+]
+
 
 # TODO: relax key type here; torch registrations should be possible to; but
 # right now this type is accurate
-decomposition_table: Dict[torch._ops.OpOverload, Callable] = {}
+global_decomposition_table: Dict[str, Dict[OpOverload, Callable]] = defaultdict(dict)
 
+decomposition_table = global_decomposition_table["post_autograd"]
+pre_autograd_decomposition_table = global_decomposition_table["pre_autograd"]
+meta_table = global_decomposition_table["meta"]
 
-meta_lib = torch.library.Library("aten", "IMPL", "Meta")
 
+def _add_op_to_registry(registry, op, fn):
+    """
+    This is an internal API for adding an op to the decomposition table.
 
-def register_decomposition(aten_op, registry=None, *, disable_meta: bool = False):
+    If op is OpOverload, it will be added to the registry directly.
+    If op is OpOverloadPacket, all the valid op_overloads in the packet will be added to the registry.
+    """
+    overloads = []
+    if isinstance(op, OpOverload):
+        overloads.append(op)
+    else:
+        assert isinstance(op, OpOverloadPacket)
+        for ol in op.overloads():
+            overloads.append(getattr(op, ol))
+
+    for op_overload in overloads:
+        if op_overload in registry:
+            raise RuntimeError(f"duplicate registrations for {op_overload}")
+
+        # TorchScript dumps a bunch of extra nonsense overloads
+        # which don't have corresponding dispatcher entries, we need
+        # to filter those out, e.g aten.add.float_int
+        if torch._C._dispatch_has_kernel(op_overload.name()):
+            registry[op_overload] = fn
+
+
+def register_decomposition(aten_op, registry=None, *, type="post_autograd"):
     """
     A decorator to register a function as a decomposition to the Python
     decomposition table.  Use it like this::
@@ -36,11 +70,12 @@ def clamp_min(x):
     autograd) and not just backend tracing, where we then need to know if a
     decomposition can be used to simulate a transform.
 
-    By default, if the decomposition is for an operator that doesn't have
-    a Meta implementation, we will register it to the dispatcher.  Use
-    `disable_meta` to disable this behavior.
+    By default, we also will register it to the Meta key of dispatcher,
+    and replace the c++ Meta implementation if there is already one.
     """
 
+    assert type in {"post_autograd", "pre_autograd", "meta"}
+
     def decomposition_decorator(f: Callable) -> Callable:
         sig = inspect.signature(f)
         out_annotation = f.__annotations__.get("out")
@@ -86,66 +121,22 @@ def _fn(*args, **kwargs):
 
         nonlocal registry
         if registry is None:
-            registry = decomposition_table
-
-        def add_op_to_table(aten_op):
-            overloads = []
-            if isinstance(aten_op, torch._ops.OpOverload):
-                overloads.append(aten_op)
-            else:
-                assert isinstance(aten_op, torch._ops.OpOverloadPacket)
-                for ol in aten_op.overloads():
-                    overloads.append(getattr(aten_op, ol))
-            for op_overload in overloads:
-                if op_overload in registry:
-                    raise RuntimeError(f"duplicate registrations for {op_overload}")
-                registry[op_overload] = fn
-                # TODO: factor this logic into OpOverload or Library API
-                name = op_overload._schema.name
-                if op_overload._schema.overload_name:
-                    name += "." + op_overload._schema.overload_name
-                if (
-                    not disable_meta
-                    # TorchScript dumps a bunch of extra nonsense overloads
-                    # which don't have corresponding dispatcher entries, we need
-                    # to filter those out
-                    and torch._C._dispatch_has_kernel(name)
-                    # Don't register a python meta kernel to any operator that has
-                    # should already work with meta tensors today.
-                    # We can check that by seeing if the "computed table" for the operator
-                    # has a registration to Meta;
-                    # either through a direct registration, or an indirect one through
-                    # an alias dispatch key (e.g. CompositeImplicitAutograd)
-                    and not torch._C._dispatch_has_computed_kernel_for_dispatch_key(
-                        name, "Meta"
-                    )
-                ):
-                    if any(
-                        a.alias_info is not None and not a.alias_info.is_write
-                        for a in op_overload._schema.arguments
-                    ):
-                        raise RuntimeError(
-                            f"""
-Attempting to register a python meta kernel for a view operator: {str(op_overload)}.
-We shouldn't do this, because the output will report as not having aliased storages.
-All view ops have meta kernels in C++ today, so we should use those instead.
-
-If you're registering an operator through the `@register_decomposition` decorator,
-Please set `disable_meta=True`.
-                        """
-                        )
-                    meta_lib.impl(op_overload, fn)
+            registry = global_decomposition_table[type]
+
+        def register(op):
+            _add_op_to_registry(registry, op, fn)
 
         # To handle allowing multiple aten_ops at once
-        tree_map(add_op_to_table, aten_op)
+        tree_map(register, aten_op)
         return fn
 
     return decomposition_decorator
 
 
 def get_decompositions(
-    aten_ops: Sequence[Union[torch._ops.OpOverload, torch._ops.OpOverloadPacket]]
-) -> Dict[torch._ops.OpOverload, Callable]:
+    aten_ops: Sequence[Union[OpOverload, OpOverloadPacket]],
+    type: str = "post_autograd",
+) -> Dict[OpOverload, Callable]:
     """
     Retrieve a dictionary of decompositions corresponding to the list of
     operator overloads and overload packets passed as input.  Overload
@@ -157,16 +148,19 @@ def get_decompositions(
     they know how to implement, and we provide decompositions for everything
     not in this set.
     """
+    assert type in {"post_autograd", "pre_autograd", "meta"}
+
+    registry = global_decomposition_table[type]
     packets_to_overloads = defaultdict(list)
-    for opo in decomposition_table:
+    for opo in registry:
         packets_to_overloads[opo.overloadpacket].append(opo)
     decompositions = {}
     for op in aten_ops:
-        if isinstance(op, torch._ops.OpOverloadPacket) and op in packets_to_overloads:
+        if isinstance(op, OpOverloadPacket) and op in packets_to_overloads:
             for op_overload in packets_to_overloads[op]:
-                decompositions[op_overload] = decomposition_table[op_overload]
-        elif isinstance(op, torch._ops.OpOverload) and op in decomposition_table:
-            decompositions[op] = decomposition_table[op]
+                decompositions[op_overload] = registry[op_overload]
+        elif isinstance(op, OpOverload) and op in registry:
+            decompositions[op] = registry[op]
     return decompositions
 
 
diff --git a/torch/_decomp/decompositions.py b/torch/_decomp/decompositions.py
index be36cdb2221f..b9c922587136 100644
--- a/torch/_decomp/decompositions.py
+++ b/torch/_decomp/decompositions.py
@@ -1,15 +1,28 @@
 import functools
+import operator
+import sys
 from enum import Enum
-from typing import Callable, List, Optional, Tuple
+from functools import partial, reduce
+from itertools import product
+from typing import Callable, cast, Iterable, List, Optional, Tuple, Union
 
 import torch
 import torch._prims_common as utils
 import torch.nn.functional as F
 from torch import Tensor
 from torch._decomp import register_decomposition
-from torch._prims_common.wrappers import out_wrapper
+from torch._prims_common import IntLike, NumberType, TensorLike, TensorSequenceType
+from torch._prims_common.wrappers import (
+    _maybe_convert_to_dtype,
+    _maybe_resize_out,
+    _safe_copy_out,
+    out_wrapper,
+)
+from torch.fx.experimental.symbolic_shapes import guard_int, sym_float, sym_int
 from torch.utils._pytree import tree_flatten, tree_map
 
+DispatchKey = torch._C.DispatchKey  # type: ignore[attr-defined]
+
 # None of these functions are publicly accessible; get at them
 # from torch._decomps
 __all__: List[str] = []
@@ -26,7 +39,11 @@ class Reduction(Enum):
 # This wraps a decomposition and performs various type promotion logic within it, depending on the strategy provided
 # We're currently re-using ELEMENTWISE_TYPE_PROMOTION_KIND, although some of the usages are on non-elementwise ops
 # Will need to validate the non-elementwise uses
-def type_casts(f: Callable, type_promotion: utils.ELEMENTWISE_TYPE_PROMOTION_KIND):
+def type_casts(
+    f: Callable,
+    type_promotion: utils.ELEMENTWISE_TYPE_PROMOTION_KIND,
+    compute_dtype_only: bool = False,
+):
     @functools.wraps(f)
     def inner(*args, **kwargs):
         flat_args = [
@@ -50,18 +67,26 @@ def decrease_prec(x):
                 return x
 
         r = f(*tree_map(increase_prec, args), **tree_map(increase_prec, kwargs))
-        return tree_map(decrease_prec, r)
+        if compute_dtype_only:
+            return r
+        else:
+            return tree_map(decrease_prec, r)
 
     return inner
 
 
-pw_cast_for_opmath = functools.partial(
+compute_only_pw_cast_for_opmath = partial(
+    type_casts,
+    type_promotion=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+    compute_dtype_only=True,
+)
+pw_cast_for_opmath = partial(
     type_casts, type_promotion=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT
 )
-reduction_complex_to_real = functools.partial(
+reduction_complex_to_real = partial(
     type_casts, type_promotion=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.COMPLEX_TO_FLOAT
 )
-pw_cast_for_int_to_real = functools.partial(
+pw_cast_for_int_to_real = partial(
     type_casts, type_promotion=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT
 )
 
@@ -91,19 +116,6 @@ def softplus_backward(out_grad: Tensor, x: Tensor, beta: float, threshold: float
     return torch.where((x * beta) > threshold, out_grad, out_grad * z / (z + 1.0))
 
 
-@register_decomposition(aten.elu)
-@pw_cast_for_opmath
-def elu(
-    self: Tensor, alpha: float = 1, scale: float = 1, input_scale: float = 1
-) -> Tensor:
-    negcoef = alpha * scale
-    poscoef = scale
-    negiptcoef = input_scale
-    return torch.where(
-        self > 0, self * poscoef, (torch.exp(self * negiptcoef) - 1) * negcoef
-    )
-
-
 @register_decomposition(aten.elu_backward)
 @pw_cast_for_opmath
 def elu_backward(
@@ -143,7 +155,7 @@ def hardsigmoid_backward(grad_output: Tensor, self: Tensor):
     return torch.where(
         (self > -3.0) & (self < 3.0),
         grad_output * (1.0 / 6.0),
-        grad_output.new_zeros(()),
+        0.0,
     )
 
 
@@ -152,17 +164,13 @@ def hardsigmoid_backward(grad_output: Tensor, self: Tensor):
 def hardtanh_backward(
     grad_output: Tensor, self: Tensor, min_val: float, max_val: float
 ):
-    return torch.where(
-        (self <= min_val) | (self >= max_val), grad_output.new_zeros(()), grad_output
-    )
+    return torch.where((self <= min_val) | (self >= max_val), 0.0, grad_output)
 
 
 @register_decomposition(aten.hardshrink_backward)
 @pw_cast_for_opmath
 def hardshrink_backward(grad_out: Tensor, self: Tensor, lambd: float):
-    return torch.where(
-        (self >= -lambd) & (self <= lambd), grad_out.new_zeros(()), grad_out
-    )
+    return torch.where((self >= -lambd) & (self <= lambd), 0.0, grad_out)
 
 
 @register_decomposition(aten.hardswish)
@@ -176,7 +184,7 @@ def hardswish(self: Tensor) -> Tensor:
 def hardswish_backward(grad_output: Tensor, self: Tensor) -> Tensor:
     return torch.where(
         self < -3,
-        grad_output.new_zeros(()),
+        0.0,
         torch.where(self <= 3, grad_output * ((self / 3) + 0.5), grad_output),
     )
 
@@ -184,7 +192,7 @@ def hardswish_backward(grad_output: Tensor, self: Tensor) -> Tensor:
 @register_decomposition(aten.threshold_backward)
 @pw_cast_for_opmath
 def threshold_backward(grad_output: Tensor, self: Tensor, threshold: float):
-    return torch.where(self <= threshold, grad_output.new_zeros(()), grad_output)
+    return torch.where(self <= threshold, 0.0, grad_output)
 
 
 @register_decomposition(aten.leaky_relu_backward)
@@ -251,9 +259,7 @@ def silu_backward(grad_output: Tensor, self: Tensor) -> Tensor:
 
 @register_decomposition(aten.softshrink_backward)
 def softshrink_backward(grad_output: Tensor, self: Tensor, lambd: float) -> Tensor:
-    return torch.where(
-        (self >= -lambd) & (self <= lambd), grad_output.new_zeros(()), grad_output
-    )
+    return torch.where((self >= -lambd) & (self <= lambd), 0.0, grad_output)
 
 
 @register_decomposition(aten.prelu_backward)
@@ -269,9 +275,7 @@ def prelu_backward(
     for _ in range(2, grad_output.dim()):
         cur_weight = cur_weight.unsqueeze(-1)
     input_grad = torch.where(self > 0, grad_output, cur_weight * grad_output)
-    weight_grad_collector = torch.where(
-        self > 0, grad_output.new_zeros(()), self * grad_output
-    )
+    weight_grad_collector = torch.where(self > 0, 0.0, self * grad_output)
     out = weight_grad_collector.sum_to_size(cur_weight.shape)
     while out.dim() > weight.dim():
         out = out.squeeze(-1)
@@ -352,21 +356,7 @@ def mse_loss_backward(
     return norm * (input - target) * grad_output
 
 
-@register_decomposition(aten.huber_loss)
-@pw_cast_for_opmath
-def huber_loss(
-    self: Tensor,
-    target: Tensor,
-    reduction: int = Reduction.MEAN.value,
-    delta: float = 1.0,
-) -> Tensor:
-    assert delta > 0, "huber_loss does not support non-positive values for delta."
-    z = (self - target).abs()
-    loss = torch.where(z < delta, 0.5 * z * z, delta * (z - 0.5 * delta))
-    return apply_loss_reduction(loss, reduction)
-
-
-@register_decomposition(aten.huber_loss_backward)
+@register_decomposition(aten.huber_loss_backward.default)
 @pw_cast_for_opmath
 def huber_loss_backward(
     grad_output: Tensor, self: Tensor, target: Tensor, reduction: int, delta: float
@@ -380,6 +370,22 @@ def huber_loss_backward(
     )
 
 
+# We cannot use @out_wrapper() here, because the output tensor is not named 'out', it's 'grad_input'
+@register_decomposition(aten.huber_loss_backward.out)
+@pw_cast_for_opmath
+def huber_loss_backward_out(
+    grad_output: Tensor,
+    self: Tensor,
+    target: Tensor,
+    reduction: int,
+    delta: float,
+    grad_input: Tensor,
+):
+    result = huber_loss_backward(grad_output, self, target, reduction, delta)
+    _maybe_resize_out(grad_input, result.shape)
+    return _safe_copy_out(copy_from=result, copy_to=grad_input, exact_dtype=True)
+
+
 def _nll_loss_backward(
     grad_output: Tensor,
     self: Tensor,
@@ -549,6 +555,32 @@ def binary_cross_entropy_backward(
     return result
 
 
+@register_decomposition(aten.soft_margin_loss)
+@out_wrapper()
+@pw_cast_for_opmath
+def soft_margin_loss(
+    input: Tensor,
+    target: Tensor,
+    reduction: int = Reduction.MEAN.value,
+) -> Tensor:
+    loss = torch.log1p(torch.exp(-input * target))
+    return apply_loss_reduction(loss, reduction)
+
+
+@register_decomposition(aten.soft_margin_loss_backward)
+@pw_cast_for_opmath
+def soft_margin_loss_backward(
+    grad_output: Tensor,
+    self: Tensor,
+    target: Tensor,
+    reduction: int = Reduction.MEAN.value,
+) -> Tensor:
+    grad_input = target * grad_output * (torch.sigmoid(target * self) - 1)
+    if reduction == Reduction.MEAN.value:
+        grad_input = grad_input / self.numel()
+    return grad_input
+
+
 @register_decomposition(aten._euclidean_dist)
 def _euclidean_dist(x1: Tensor, x2: Tensor) -> Tensor:
     x1_norm = x1.pow(2).sum(-1, True)
@@ -574,6 +606,58 @@ def slice_backward(
     return torch.slice_scatter(grad_input, grad_output, dim, start, end, step)
 
 
+@register_decomposition(aten.slice.Tensor)
+def slice_forward(
+    # Tensor(a) self, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1
+    self: Tensor,
+    dim: int = 0,
+    start: Optional[int] = None,
+    end: Optional[int] = None,
+    step: int = 1,
+):
+
+    ndim = self.dim()
+    if ndim == 0:
+        raise RuntimeError("slice() cannot be applied to a 0-dim tensor.")
+    dim = utils.canonicalize_dim(self.dim(), dim)
+    sizes = list(self.size())
+    strides = list(self.stride())
+
+    if step <= 0:
+        raise RuntimeError("slice step must be positive")
+
+    start_val = start if start is not None else 0
+    end_val = end if end is not None else sys.maxsize  # 2^63 – 1
+
+    if start_val < 0:
+        start_val += sizes[dim]
+
+    if end_val < 0:
+        end_val += sizes[dim]
+
+    if start_val < 0:
+        start_val = 0
+    elif start_val >= sizes[dim]:
+        start_val = sizes[dim]
+
+    if end_val < start_val:
+        end_val = start_val
+    elif end_val >= sizes[dim]:
+        end_val = sizes[dim]
+
+    storage_offset = self.storage_offset() + start_val * strides[dim]
+    len = end_val - start_val
+    sizes[dim] = (len + step - 1) // step
+    strides[dim] *= step
+
+    if self.is_quantized:
+        raise NotImplementedError(
+            "Slice decomposition for quantized tensors aren't implemented"
+        )
+    else:
+        return self.as_strided(sizes, strides, storage_offset)
+
+
 @register_decomposition(aten.select_backward)
 def select_backward(grad_output: Tensor, input_sizes: List[int], dim: int, index: int):
     grad_input = grad_output.new_zeros(input_sizes)
@@ -588,56 +672,272 @@ def diagonal_backward(
     return torch.diagonal_scatter(grad_input, grad_output, offset, dim1, dim2)
 
 
+def _cast_grad_to_input_dtype(
+    grad_output: Tensor, grad_input: Tensor, input_dtype: torch.dtype
+):
+    if grad_output.dtype != input_dtype:
+        grad_input = grad_input.to(input_dtype)
+    return grad_input
+
+
 @register_decomposition(aten._softmax_backward_data)
-@pw_cast_for_opmath
+@compute_only_pw_cast_for_opmath
 def _softmax_backward_data(
-    grad_output: Tensor, output: Tensor, dim: int, input_dtype: int
+    grad_output: Tensor, output: Tensor, dim: int, input_dtype: torch.dtype
 ):
-    new_grad = grad_output * output
-    return new_grad - output * torch.sum(new_grad, dim=dim, keepdim=True)
+    new_grad_output = grad_output * output
+    grad_input = new_grad_output - output * torch.sum(
+        new_grad_output, dim=dim, keepdim=True
+    )
+
+    # CPU kernel doesn't respect input_dtype, but following check doesn't work for meta tensor
+    # if grad_output.device == torch.device("cpu"):
+    #     return grad_input.contiguous()
+
+    return _cast_grad_to_input_dtype(grad_output, grad_input, input_dtype).contiguous()
 
 
 @register_decomposition(aten._log_softmax_backward_data)
-@pw_cast_for_opmath
+@compute_only_pw_cast_for_opmath
 def _log_softmax_backward_data(
-    grad_output: Tensor, output: Tensor, dim: int, input_dtype: int
+    grad_output: Tensor, output: Tensor, dim: int, input_dtype: torch.dtype
 ):
     grad_input = grad_output - torch.exp(output) * torch.sum(
         grad_output, dim=dim, keepdim=True
     )
-    return grad_input
+    return _cast_grad_to_input_dtype(grad_output, grad_input, input_dtype)
 
 
-# TODO: the type annotations on arguments are not quite right
+def _im2col_col2im_indices_along_dim(
+    input_d, kernel_d, dilation_d, padding_d, stride_d, device
+):
+    """Utility function to implement im2col and col2im"""
+    blocks_d = input_d + padding_d * 2 - dilation_d * (kernel_d - 1)
 
+    arange_kw = partial(torch.arange, dtype=torch.int64, device=device)
 
-@register_decomposition(aten.im2col_backward)
-def im2col_backward(
-    grad_output: Tensor,
-    input_size: List[int],
+    # Stride kernel over input and find starting indices along dim d
+    blocks_d_indices = arange_kw(0, blocks_d, stride_d).unsqueeze(0)
+
+    # Apply dilation on kernel and find its indices along dim d
+    kernel_grid = arange_kw(0, kernel_d * dilation_d, dilation_d).unsqueeze(-1)
+
+    # Broadcast and add kernel staring positions (indices) with
+    # kernel_grid along dim d, to get block indices along dim d
+    return blocks_d_indices + kernel_grid
+
+
+@register_decomposition(aten.im2col)
+@out_wrapper()
+@pw_cast_for_opmath
+def im2col(
+    input: Tensor,
     kernel_size: List[int],
     dilation: List[int],
     padding: List[int],
     stride: List[int],
 ) -> Tensor:
-    return F.fold(grad_output, input_size, kernel_size, dilation, padding, stride)  # type: ignore[arg-type]
+    utils.check(len(kernel_size) == 2, lambda: "im2col(): only 2D kernel supported")
+    utils.check(len(dilation) == 2, lambda: "im2col(): only 2D dilation supported")
+    utils.check(len(padding) == 2, lambda: "im2col(): only 2D padding supported")
+    utils.check(len(stride) == 2, lambda: "im2col(): only 2D stride supported")
+
+    def check_positive(param, param_name, strict=True):
+        cond = all(p > 0 for p in param) if strict else all(p >= 0 for p in param)
+        utils.check(
+            cond, lambda: "{param_name} should be greater {'than' zero, but got {param}"
+        )
 
+    check_positive(kernel_size, "kernel_size")
+    check_positive(dilation, "dilation")
+    check_positive(dilation, "padding", strict=False)
+    check_positive(stride, "stride")
+
+    shape = input.shape
+    ndim = len(shape)
+    utils.check(
+        ndim in (3, 4) and all(d != 0 for d in shape[-3:]),
+        lambda: "Expected 3D or 4D (batch mode) tensor for input with possible 0 batch size "
+        f"and non-zero dimensions, but got: {tuple(shape)}",
+    )
+    output_size = tuple(
+        1 + (out + 2 * pad - dil * (ker - 1) - 1) // st
+        for out, pad, dil, ker, st in zip(
+            shape[-2:], padding, dilation, kernel_size, stride
+        )
+    )
+    utils.check(
+        all(c > 0 for c in output_size),
+        lambda: f"Given an input with spacial size {tuple(shape[-2:])}, "
+        f"kernel_size={kernel_size}, dilation={dilation}, "
+        f"padding={padding}, stride={stride}, "
+        "the calculated shape of the array of sliding blocks "
+        f"is {output_size}, but its components must be at least one.",
+    )
+    batched_input = ndim == 4
+    if not batched_input:
+        input = input.unsqueeze(0)
 
-@register_decomposition(aten.col2im_backward)
-def col2im_backward(
-    grad_output: Tensor,
+    batch_dim, channel_dim, input_h, input_w = input.shape
+
+    stride_h, stride_w = stride
+    padding_h, padding_w = padding
+    dilation_h, dilation_w = dilation
+    kernel_h, kernel_w = kernel_size
+
+    blocks_row_indices = _im2col_col2im_indices_along_dim(
+        input_h, kernel_h, dilation_h, padding_h, stride_h, input.device
+    )
+    blocks_col_indices = _im2col_col2im_indices_along_dim(
+        input_w, kernel_w, dilation_w, padding_w, stride_w, input.device
+    )
+
+    # Note that F.pad takes (padding_left, padding_right, padding_top, padding_bottom)
+    # ugh
+    padded_input = F.pad(input, (padding_w, padding_w, padding_h, padding_h))
+
+    blocks_row_indices = blocks_row_indices.unsqueeze(-1).unsqueeze(-1)
+    output = padded_input[:, :, blocks_row_indices, blocks_col_indices]
+    output = output.permute(0, 1, 2, 4, 3, 5)
+    num_blocks_row = blocks_row_indices.size(1)
+    num_blocks_col = blocks_col_indices.size(1)
+    output = output.reshape(
+        batch_dim, channel_dim * kernel_h * kernel_w, num_blocks_row * num_blocks_col
+    )
+
+    if not batched_input:
+        output = output.squeeze(0)
+    return output
+
+
+@register_decomposition(aten.col2im)
+@out_wrapper()
+@pw_cast_for_opmath
+def col2im(
+    input: Tensor,
+    output_size: List[int],
     kernel_size: List[int],
     dilation: List[int],
     padding: List[int],
     stride: List[int],
 ) -> Tensor:
-    return F.unfold(grad_output, kernel_size, dilation, padding, stride)  # type: ignore[arg-type]
+    utils.check(len(output_size) == 2, lambda: "only 2D output_size supported")
+    utils.check(len(kernel_size) == 2, lambda: "only 2D kernel supported")
+    utils.check(len(dilation) == 2, lambda: "only 2D dilation supported")
+    utils.check(len(padding) == 2, lambda: "only 2D padding supported")
+    utils.check(len(stride) == 2, lambda: "only 2D stride supported")
+
+    def check_positive(param, param_name, strict=True):
+        cond = all(p > 0 for p in param) if strict else all(p >= 0 for p in param)
+        utils.check(
+            cond, lambda: "{param_name} should be greater than zero, but got {param}"
+        )
+
+    check_positive(kernel_size, "kernel_size")
+    check_positive(dilation, "dilation")
+    check_positive(padding, "padding", strict=False)
+    check_positive(stride, "stride")
+    check_positive(output_size, "output_size")
+
+    shape = input.shape
+    ndim = len(shape)
+    utils.check(
+        ndim in (2, 3) and all(d != 0 for d in shape[-2:]),
+        lambda: "Expected 2D or 3D (batch mode) tensor for input with possible 0 batch size "
+        f"and non-zero dimensions, but got: {tuple(shape)}",
+    )
+    prod_kernel_size = kernel_size[0] * kernel_size[1]
+    utils.check(
+        shape[-2] % prod_kernel_size == 0,
+        lambda: "Expected size of input's first non-batch dimension to be divisible by the "
+        f"product of kernel_size, but got input.shape[-2] = {shape[-2]} and "
+        f"kernel_size={kernel_size}",
+    )
+    col = [
+        1 + (out + 2 * pad - dil * (ker - 1) - 1) // st
+        for out, pad, dil, ker, st in zip(
+            output_size, padding, dilation, kernel_size, stride
+        )
+    ]
+    L = col[0] * col[1]
+    utils.check(
+        shape[-1] == L,
+        lambda: f"Given output_size={output_size}, kernel_size={kernel_size}, "
+        f"dilation={dilation}, padding={padding}, stride={stride}, "
+        f"expected input.size(-1) to be {L} but got {shape[-1]}.",
+    )
+    utils.check(
+        L > 0,
+        lambda: f"Given output_size={output_size}, kernel_size={kernel_size}, "
+        f"dilation={dilation}, padding={padding}, stride={stride}, "
+        f"expected input.size(-1) to be {L} but got {shape[-1]}.",
+    )
+    batched_input = ndim == 3
+    if not batched_input:
+        input = input.unsqueeze(0)
+
+    shape = input.shape
+
+    out_h, out_w = output_size
+    stride_h, stride_w = stride
+    padding_h, padding_w = padding
+    dilation_h, dilation_w = dilation
+    kernel_h, kernel_w = kernel_size
+
+    # col2im is defined as the backwards of im2col, so we differentiate its decomposition by hand
+    input = input.reshape([shape[0], shape[1] // prod_kernel_size] + kernel_size + col)
+    input = input.permute(0, 1, 2, 4, 3, 5)
+
+    indices_row = _im2col_col2im_indices_along_dim(
+        out_h, kernel_h, dilation_h, padding_h, stride_h, input.device
+    )
+    indices_row = _unsqueeze_to_dim(indices_row, 4)
+    indices_col = _im2col_col2im_indices_along_dim(
+        out_w, kernel_w, dilation_w, padding_w, stride_w, input.device
+    )
+
+    output_padded_size = [o + 2 * p for o, p in zip(output_size, padding)]
+    output = input.new_zeros(
+        [shape[0], shape[1] // prod(kernel_size)] + output_padded_size
+    )
+    idx = (None, None, indices_row, indices_col)
+    output = torch.ops.aten.index_put(output, idx, input, accumulate=True)
+    output = F.pad(output, (-padding_w, -padding_w, -padding_h, -padding_h))
+
+    if not batched_input:
+        output = output.squeeze(0)
+    return output
 
 
 @register_decomposition(aten.native_dropout_backward)
-@pw_cast_for_opmath
 def native_dropout_backward(grad_output: Tensor, mask: Tensor, scale: float):
-    return grad_output * (mask.type_as(grad_output) * scale)
+    # According to the CUDA kernel implementation we should have this test;
+    # but it seems to fail tests!
+    # utils.check(mask.dtype == torch.bool, lambda: f"Mask should be Bool Scalar Type {mask.dtype}")
+
+    # Mimicking CUDA kernel's behavior for output stride: output follow input's memory format
+    # This different from TensorIterator's behavior
+    r = (grad_output * (mask.type_as(grad_output) * scale)).clone(
+        memory_format=utils.suggest_memory_format(grad_output)
+    )
+    return r
+
+
+@register_decomposition(aten.unfold_backward)
+def unfold_backward(
+    grad: Tensor, input_size: List[int], dimension: int, size: int, step: int
+) -> Tensor:
+    if len(input_size) == 0:
+        return torch.squeeze_copy(grad, 0)
+    dim = utils.canonicalize_dim(len(input_size), dimension)
+    idx = torch.arange(input_size[dim], device=grad.device, dtype=torch.int32)
+    idx = idx.unfold(0, size, step).flatten()
+    grad = grad.movedim(-1, dim + 1).flatten(dim, dim + 1)
+    # nb. At the moment this generates two kernels in triton
+    # It could potentially be fused into one call to scatter_reduce,
+    # in the case step <= size provided scatter_reduce generates 1 kernel
+    grad_input = grad.new_zeros(input_size)
+    return torch.index_add(grad_input, dim, idx, grad)
 
 
 @register_decomposition(aten.logit_backward.default)
@@ -651,7 +951,7 @@ def logit_backward(
         return torch.where(
             torch.logical_and(self >= lo, self <= hi),
             grad_output / (self * (1.0 - self)),
-            self.new_zeros(()),
+            0.0,
         )
     else:
         return torch.where(
@@ -671,39 +971,45 @@ def native_dropout(input: Tensor, p: float, train: Optional[bool]):
         return (input, torch.ones_like(input, dtype=torch.bool))
 
 
-# TODO: Correct the type promotion semantics
 @register_decomposition(aten._softmax)
-@pw_cast_for_opmath
+@out_wrapper()
 def _softmax(x: Tensor, dim: int, half_to_float: bool):
+    # eager softmax returns a contiguous tensor. Ensure that decomp also returns
+    # a contiguous tensor.
+    x = x.contiguous()
+    if half_to_float:
+        assert x.dtype == torch.half
+    computation_dtype, result_dtype = utils.elementwise_dtypes(
+        x, type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT
+    )
+    x = x.to(computation_dtype)
     x_max = torch.amax(x, dim, keepdim=True)
     unnormalized = torch.exp(x - x_max)
-    return unnormalized / torch.sum(unnormalized, dim, keepdim=True)
+    result = unnormalized / torch.sum(unnormalized, dim, keepdim=True)
+    if not half_to_float:
+        result = result.to(result_dtype)
+    return result
 
 
-# TODO: Correct the type promotion semantics
 @register_decomposition(aten._log_softmax)
-@pw_cast_for_opmath
+@out_wrapper()
 def _log_softmax(x: Tensor, dim: int, half_to_float: bool):
+    # eager log_softmax returns a contiguous tensor. Ensure that decomp also
+    # returns a contiguous tensor.
+    x = x.contiguous()
+    if half_to_float:
+        assert x.dtype == torch.half
+    computation_dtype, result_dtype = utils.elementwise_dtypes(
+        x, type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT
+    )
+    x = x.to(computation_dtype)
     x_max = torch.amax(x, dim, keepdim=True)
     shifted = x - x_max
     shifted_logsumexp = torch.log(torch.sum(torch.exp(shifted), dim, keepdim=True))
-    return shifted - shifted_logsumexp
-
-
-@register_decomposition(aten.addcdiv)
-@pw_cast_for_opmath
-def addcdiv(self: Tensor, tensor1: Tensor, tensor2: Tensor, value: float = 1):
-    return self + value * (tensor1 / tensor2)
-
-
-# Remove special case when https://github.com/pytorch/pytorch/pull/72949 is landed.
-@register_decomposition(aten.addcmul)
-@pw_cast_for_opmath
-def addcmul(self: Tensor, tensor1: Tensor, tensor2: Tensor, value: float = 1):
-    if self.is_floating_point() or self.is_complex():
-        return self + value * tensor1 * tensor2
-    else:
-        return self + int(value) * tensor1 * tensor2
+    result = shifted - shifted_logsumexp
+    if not half_to_float:
+        result = result.to(result_dtype)
+    return result
 
 
 @register_decomposition(aten.rsub.Tensor)
@@ -725,22 +1031,19 @@ def embedding(
     sparse: bool = False,
 ) -> Tensor:
     assert weight.dim() == 2, "'weight' must be 2-D"
-    # TODO: Assert not ported over yet
-    #   auto indices_arg = TensorArg(indices, "indices", 1);
-    #   checkScalarTypes("embedding", indices_arg, {kLong, kInt});
-
-    if indices.dim() == 1:
-        return weight.index_select(0, indices)
-
-    size = list(indices.shape)
-    for d in weight.shape[1:]:
-        size.append(d)
-
-    return weight.index_select(0, indices.reshape(-1)).view(size)
+    # Nb. scale_grad_by_freq is not used in the forward
+    if indices.ndim <= 1:
+        # We need this one as weight[indices] calls item() in these cases
+        out = weight.index_select(0, indices)
+        if indices.ndim == 0:
+            out = out.squeeze(0)
+        return out
+    else:
+        return weight[indices]
 
 
-# TODO: Correct the type promotion semantics
 @register_decomposition(aten.embedding_dense_backward)
+@pw_cast_for_opmath
 def embedding_dense_backward(
     grad_output: Tensor,
     indices: Tensor,
@@ -748,22 +1051,20 @@ def embedding_dense_backward(
     padding_idx: int,
     scale_grad_by_freq: bool,
 ):
-    numel = indices.numel()
-    grad = grad_output.reshape(numel, grad_output.size(-1))
-    grad_weight = grad_output.new_zeros((num_weights, grad_output.shape[-1]))
-    indices_rank1 = indices.reshape(numel)
+    indices = _maybe_convert_to_dtype(indices, torch.long)  # type: ignore[assignment]
     if scale_grad_by_freq:
         counts = indices.new_zeros((num_weights,))
-        ones = indices.new_ones((numel,))
-        counts = counts.index_put([indices_rank1], ones, accumulate=True)
-        grad_weights_scale = counts[indices_rank1]
-        grad = grad / grad_weights_scale.unsqueeze(1)
-    skip_padding = (indices_rank1 != padding_idx).unsqueeze(1)
-    skip_padding = skip_padding.expand_as(grad)
-    zero_grad = torch.full_like(grad, 0)
-    return grad_weight.index_put(
-        [indices_rank1], torch.where(skip_padding, grad, zero_grad), accumulate=True
+        ones = torch.ones_like(indices)
+        counts = counts.index_put([indices], ones, accumulate=True)
+        grad_weights_scale = counts[indices]
+        grad_output = grad_output / grad_weights_scale.unsqueeze(1)
+
+    mask = _unsqueeze_to_dim(indices == padding_idx, grad_output.ndim)
+    grad = grad_output.masked_fill(mask, 0)
+    grad_weight = grad_output.new_zeros(
+        (num_weights,) + grad_output.shape[indices.ndim :]
     )
+    return grad_weight.index_put([indices], grad, accumulate=True)
 
 
 def prod(x: List[int]):
@@ -773,7 +1074,7 @@ def prod(x: List[int]):
     return r
 
 
-@register_decomposition(aten.split_with_sizes, disable_meta=True)
+@register_decomposition(aten.split_with_sizes)
 def split_with_sizes(
     self: Tensor, split_sizes: List[int], dim: int = 0
 ) -> List[Tensor]:
@@ -787,7 +1088,7 @@ def split_with_sizes(
     return splits
 
 
-@register_decomposition(aten.split.Tensor, disable_meta=True)
+@register_decomposition(aten.split.Tensor)
 def split(self: Tensor, split_size: int, dim: int = 0) -> List[Tensor]:
     input_sizes = self.shape
     dim_size = input_sizes[dim]
@@ -795,13 +1096,15 @@ def split(self: Tensor, split_size: int, dim: int = 0) -> List[Tensor]:
         assert dim_size == 0
         return [self]
     chunks = (dim_size + split_size - 1) // split_size
+    chunks = guard_int(chunks)
     split_sizes = [split_size for i in range(chunks)]
-    split_sizes[chunks - 1] = split_size - (split_size * chunks - dim_size)
+    split_sizes[-1] = split_size - (split_size * chunks - dim_size)
     return torch.split(self, split_sizes, dim)
 
 
 # TODO: this doesn't appear to have enough precision in bfloat16
 @register_decomposition(aten.addmm)
+@out_wrapper()
 @pw_cast_for_opmath
 def addmm(self: Tensor, mat1: Tensor, mat2: Tensor, beta: int = 1, alpha: int = 1):
     if not self.is_floating_point() and not self.is_complex():
@@ -810,7 +1113,14 @@ def addmm(self: Tensor, mat1: Tensor, mat2: Tensor, beta: int = 1, alpha: int =
     out = alpha * torch.mm(mat1, mat2)
     if beta == 0:
         return out
-    return beta * self + out
+
+    # The output of aten.addmm is contiguous, we need to match this behavior in the decomposition.
+    # The original implementation 'beta * self + out' would return a strided tensor if `self` is strided.
+    # We thus use `out`, the output of torch.mm, which is always contiguous, as the first argument for addition.
+    # This is relying on TensorIterator's behavior that it takes higher precedence on the stride of first input.
+    # Alternative, we can write `(beta * self + out).contiguous()`, but it introduces another copy in some cases.
+    # This implementation is not ideal, and we should revisit this when we have a better solution.
+    return out + beta * self
 
 
 # This computes the mean and variance along the specifized normalization dims,
@@ -830,35 +1140,92 @@ def normalize(input, norm_dims, eps):
     return out, mean, rstd
 
 
-@register_decomposition(aten.native_group_norm.default, disable_meta=True)
-def native_group_norm(
+@register_decomposition(aten.native_group_norm_backward)
+@pw_cast_for_opmath
+def native_group_norm_backward(
+    grad_output: Tensor,
     input: Tensor,
-    weight: Optional[Tensor],
-    bias: Optional[Tensor],
+    mean: Tensor,
+    rstd: Tensor,
+    gamma: Optional[Tensor],
     N: int,
     C: int,
     HxW: int,
     group: int,
-    eps: float,
-) -> Tuple[Tensor, Tensor, Tensor]:
-    orig_shape = input.shape
-    input = input.view(N, group, C // group, HxW)
-    reduction_dims = [2, 3]
-    out, mean, rstd = normalize(input, reduction_dims, eps)
-    mean = _squeeze_multiple(mean, reduction_dims)
-    rstd = _squeeze_multiple(rstd, reduction_dims)
-    out = out.view(orig_shape)
-    if weight is not None:
-        weight = _unsqueeze_to_dim(weight, out.dim() - 1)
-        out = out * weight
-    if bias is not None:
-        bias = _unsqueeze_to_dim(bias, out.dim() - 1)
-        out = out + bias
+    output_mask: List[bool],
+) -> Tuple[Optional[Tensor], Optional[Tensor], Optional[Tensor]]:
+    utils.check_same_device(
+        grad_output, input, mean, rstd, allow_cpu_scalar_tensors=False
+    )
+    utils.check_same_shape(input, grad_output, allow_cpu_scalar_tensors=False)
+    utils.check_same_shape(mean, rstd, allow_cpu_scalar_tensors=False)
+    utils.check(
+        input.numel() == N * C * HxW,
+        lambda: f"Expect input to have { N * C * HxW} elements",
+    )
+    utils.check(
+        mean.shape == (N, group),
+        lambda: f"Expect mean to have shape ({N}, {group}, but got {mean.shape}",
+    )
+    utils.check(
+        gamma is None or gamma.numel() == C,
+        lambda: f"Expect gamma to have {C} elements but got {gamma.numel() if gamma is not None else -1}",
+    )
+
+    cpg, _rem = divmod(C, group)
+    utils.check(
+        _rem == 0,
+        lambda: f"Expect number of channels {C} to be evenly-divisible by number of groups {group}",
+    )
+
+    # Compute Internal gradients
+    ds = torch.mul(grad_output, input).view(N, C, HxW).sum(dim=[2])
+    db = grad_output.view(N, C, HxW).sum(dim=[2])
+
+    d_input: Optional[Tensor] = None
+    d_gamma: Optional[Tensor] = None
+    d_bias: Optional[Tensor] = None
+    if output_mask[0]:
+        s = 1.0 / (HxW * cpg)
+        if gamma is not None:
+            ds_val = torch.mul(ds, gamma.unsqueeze(0)).reshape(N, group, cpg).sum(2)
+            db_val = torch.mul(db, gamma.unsqueeze(0)).reshape(N, group, cpg).sum(2)
+            c1 = torch.mul(
+                rstd.unsqueeze(-1),
+                gamma.reshape(1, group, cpg),
+            )
+        else:
+            ds_val = ds.reshape(N, group, cpg).sum(2)
+            db_val = db.reshape(N, group, cpg).sum(2)
+            c1 = torch.mul(
+                rstd.unsqueeze(-1),
+                torch.ones((1, group, cpg), device=rstd.device),
+            )
+        c2 = (db_val * mean - ds_val) * rstd * rstd * rstd * s
+        c3 = -c2 * mean - db_val * rstd * s
+
+        c1 = c1.unsqueeze(-1)
+        c2 = _unsqueeze_to_dim(c2, 4)
+        c3 = _unsqueeze_to_dim(c3, 4)
+        d_input = (
+            torch.mul(grad_output.reshape(N, group, cpg, HxW), c1)
+            + torch.mul(input.reshape(N, group, cpg, HxW), c2)
+            + c3
+        )
+        d_input = d_input.reshape(input.shape).to(input.dtype)
+    if output_mask[1]:
+        d_gamma = (
+            (
+                (ds.view(N, group, cpg) - db.view(N, group, cpg) * mean.unsqueeze(-1))
+                * rstd.unsqueeze(-1)
+            )
+            .sum(dim=[0])
+            .reshape(C)
+        )
+    if output_mask[2]:
+        d_bias = db.sum(dim=[0])
 
-    out = out.to(dtype=input.dtype)
-    mean = mean.to(dtype=input.dtype)
-    rstd = rstd.to(dtype=input.dtype)
-    return (out, mean, rstd)
+    return (d_input, d_gamma, d_bias)
 
 
 def _maybe_cast(x: Optional[Tensor], dtype) -> Optional[Tensor]:
@@ -883,7 +1250,7 @@ def native_layer_norm_backward(
     input_ndim = input.dim()
     computation_dtype = utils.get_computation_dtype(input.dtype)
     grad_out_cast, input_cast, weight_cast, bias_cast = [
-        x.to(computation_dtype) if x is not None else x
+        x.to(computation_dtype).contiguous() if x is not None else x
         for x in (grad_out, input, weight, bias)
     ]
     assert grad_out_cast is not None
@@ -903,9 +1270,9 @@ def native_layer_norm_backward(
     M = prod(outer_dims)  # type: ignore[arg-type]
     if M <= 0 or N <= 0:
         return (
-            input.new_zeros(input_shape),
-            input.new_zeros(input_shape[axis:]),
-            input.new_zeros(input_shape[axis:]),
+            input.new_zeros(input_shape) if output_mask[0] else None,
+            input.new_zeros(input_shape[axis:]) if output_mask[1] else None,
+            input.new_zeros(input_shape[axis:]) if output_mask[2] else None,
         )
 
     x_hat = (input_cast - mean) * rstd
@@ -936,7 +1303,7 @@ def native_layer_norm_backward(
         if len(outer_dim_indices) > 0:
             d_bias = torch.sum(grad_out_cast, outer_dim_indices, False)
         else:
-            d_bias = grad_out_cast
+            d_bias = grad_out_cast.clone()
 
     return (
         _maybe_cast(d_input, input.dtype),
@@ -945,8 +1312,7 @@ def native_layer_norm_backward(
     )
 
 
-@register_decomposition(aten.native_batch_norm)
-def native_batch_norm(
+def native_batch_norm_helper(
     input: Tensor,
     weight: Optional[Tensor],
     bias: Optional[Tensor],
@@ -955,16 +1321,21 @@ def native_batch_norm(
     training: bool,
     momentum: float,
     eps: float,
-) -> Tuple[Tensor, Tensor, Tensor]:
+    functional: bool,
+) -> Tuple[Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor]]:
     reduction_dims = [0] + list(range(2, input.dim()))
     computation_dtype = utils.get_computation_dtype(input.dtype)
+    new_running_mean = running_mean
+    new_running_var = running_var
     if training:
         output, mean, rstd = normalize(input, reduction_dims, eps)
 
         save_mean = _squeeze_multiple(mean, reduction_dims)
         save_rstd = _squeeze_multiple(rstd, reduction_dims)
         if running_mean is not None:
-            running_mean.copy_(momentum * save_mean + (1 - momentum) * running_mean)
+            new_running_mean = momentum * save_mean + (1 - momentum) * running_mean
+            if not functional:
+                running_mean.copy_(new_running_mean)
         if running_var is not None:
             n = input.numel() / input.shape[1]
             # This doesn't strictly match eager's numerics, which accumulates var sum and then directly applies the correction
@@ -973,11 +1344,15 @@ def native_batch_norm(
             unbiased_var = torch.var(input, reduction_dims, unbiased=False) * (
                 n / (n - 1)
             )
-            running_var.copy_(momentum * unbiased_var + (1 - momentum) * running_var)
+            new_running_var = momentum * unbiased_var + (1 - momentum) * running_var
+            if not functional:
+                running_var.copy_(new_running_var)
     else:
         assert running_mean is not None and running_var is not None
-        running_mean = running_mean.to(dtype=computation_dtype)
-        running_var = running_var.to(dtype=computation_dtype)
+        running_mean = running_mean.to(dtype=computation_dtype, copy=True)
+        new_running_mean = running_mean
+        running_var = running_var.to(dtype=computation_dtype, copy=True)
+        new_running_var = running_var
         mean = running_mean
         invstd = 1 / (torch.sqrt(running_var + eps))
         # Very annoying inconsistency where CPU and CUDA give different shapes
@@ -1003,7 +1378,127 @@ def native_batch_norm(
     if input.device.type == "cpu":
         save_mean = save_mean.to(dtype=input.dtype)
         save_rstd = save_rstd.to(dtype=input.dtype)
-    return output.to(dtype=input.dtype), save_mean, save_rstd
+    return (
+        output.to(dtype=input.dtype),
+        save_mean,
+        save_rstd,
+        new_running_mean,
+        new_running_var,
+    )
+
+
+@register_decomposition(aten.native_batch_norm)
+def native_batch_norm(
+    input: Tensor,
+    weight: Optional[Tensor],
+    bias: Optional[Tensor],
+    running_mean: Optional[Tensor],
+    running_var: Optional[Tensor],
+    training: bool,
+    momentum: float,
+    eps: float,
+) -> Tuple[Tensor, Tensor, Tensor]:
+    output, save_mean, save_rstd, _, _ = native_batch_norm_helper(
+        input, weight, bias, running_mean, running_var, training, momentum, eps, False
+    )
+    return output, save_mean, save_rstd
+
+
+# TODO: this decomposition is NOT here to stay. We would much prefer replacing native_batch_norm
+# with our new correctly schema'd _native_batch_norm_legit and its variants, but
+# we cannot do that immediately in the C++ because it would be forwards incompatible
+# with some mobile use cases.
+#
+# Since this change is most impactful for aot autograd/functionalization, we simply
+# register this decomposition on the Autograd key for the python dispatcher (which is
+# currently only used by aot autograd/functionalization and no one else, really).
+# In two weeks or so, we should remove this decomposition and phase out the current native_batch_norm
+# to be _native_batch_norm_legit and have the right schema (stating that there are input mutations).
+@torch.ops.aten.native_batch_norm.default.py_impl(DispatchKey.Autograd)
+def native_batch_norm_decomposition(
+    input: Tensor,
+    weight: Optional[Tensor],
+    bias: Optional[Tensor],
+    running_mean: Optional[Tensor],
+    running_var: Optional[Tensor],
+    training: bool,
+    momentum: float,
+    eps: float,
+) -> Tuple[Tensor, Tensor, Tensor]:
+    if running_mean is None and running_var is None:
+        return aten._native_batch_norm_legit(
+            input, weight, bias, training, momentum, eps
+        )
+    if running_mean is None:
+        raise RuntimeError(
+            "running_mean is None, but running_var is provided. "
+            "They should both be None or both be provided."
+        )
+    if running_var is None:
+        raise RuntimeError(
+            "running_var is None, but running_mean is provided. "
+            "They should both be None or both be provided."
+        )
+    return aten._native_batch_norm_legit(
+        input, weight, bias, running_mean, running_var, training, momentum, eps
+    )
+
+
+@register_decomposition(aten._native_batch_norm_legit.default)
+def _native_batch_norm_legit(
+    input: Tensor,
+    weight: Optional[Tensor],
+    bias: Optional[Tensor],
+    running_mean: Tensor,
+    running_var: Tensor,
+    training: bool,
+    momentum: float,
+    eps: float,
+) -> Tuple[Tensor, Tensor, Tensor]:
+    output, save_mean, save_rstd, _, _ = native_batch_norm_helper(
+        input, weight, bias, running_mean, running_var, training, momentum, eps, False
+    )
+    return output, save_mean, save_rstd
+
+
+@register_decomposition(aten._native_batch_norm_legit.no_stats)
+def _native_batch_norm_legit_no_stats(
+    input: Tensor,
+    weight: Optional[Tensor],
+    bias: Optional[Tensor],
+    training: bool,
+    momentum: float,
+    eps: float,
+) -> Tuple[Tensor, Tensor, Tensor]:
+    output, save_mean, save_rstd, _, _ = native_batch_norm_helper(
+        input, weight, bias, None, None, training, momentum, eps, False
+    )
+    return output, save_mean, save_rstd
+
+
+@register_decomposition(aten._native_batch_norm_legit_functional.default)
+def _native_batch_norm_legit_functional(
+    input: Tensor,
+    weight: Optional[Tensor],
+    bias: Optional[Tensor],
+    running_mean: Tensor,
+    running_var: Tensor,
+    training: bool,
+    momentum: float,
+    eps: float,
+) -> Tuple[Tensor, Tensor, Tensor, Tensor, Tensor]:
+    (
+        output,
+        save_mean,
+        save_rstd,
+        new_running_mean,
+        new_running_var,
+    ) = native_batch_norm_helper(
+        input, weight, bias, running_mean, running_var, training, momentum, eps, True
+    )
+    assert new_running_mean is not None, "new_running_mean should not be None"
+    assert new_running_var is not None, "new_running_var should not be None"
+    return output, save_mean, save_rstd, new_running_mean, new_running_var
 
 
 @register_decomposition(aten._fused_dropout)
@@ -1014,7 +1509,36 @@ def _fused_dropout_decomposition(input, p, generator=None):
     return (res, mask)
 
 
-@register_decomposition(aten.xlogy.Tensor)
+@register_decomposition(aten._to_copy)
+def _to_copy(
+    x: Tensor,
+    *,
+    dtype: Optional[torch.dtype] = None,
+    layout=None,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+    non_blocking: bool = False,
+    memory_format: Optional[torch.memory_format] = None,
+):
+    assert not layout or layout == torch.strided, "TODO"
+    assert not pin_memory, "TODO"
+    assert device is not None or dtype is not None or memory_format is not None
+    dtype_converted = False
+    if device is not None and device != x.get_device():
+        # avoid conversions on cpu
+        if dtype is not None and device.type == "cpu":
+            x = torch._prims.convert_element_type(x, dtype)
+            dtype_converted = True
+        x = torch._prims.device_put(x, device)
+    if dtype is not None and not dtype_converted:
+        x = torch._prims.convert_element_type(x, dtype)
+    if memory_format is not None:  # no ref/prim for memory format
+        out = torch.empty_like(x, memory_format=memory_format)
+        out.copy_(x)
+        return out  # type: ignore[call-overload]
+    return x
+
+
 @pw_cast_for_int_to_real
 def xlogy(self: Tensor, other: Tensor) -> Tensor:
     return aten.where(
@@ -1028,63 +1552,23 @@ def xlogy(self: Tensor, other: Tensor) -> Tensor:
     )
 
 
-@register_decomposition(aten.var.correction)
+@register_decomposition(aten.std.correction)
 @reduction_complex_to_real
-def var_correction(
+def std_decomposition(
     x: Tensor,
-    dims: Optional[List[int]],
+    dim: Optional[List[int]],
     correction: Optional[int] = None,
     keepdim: bool = False,
 ):
-    if dims is None:
-        dims = []
-
-    if x.is_complex():
-        # For complex, calculate variance of real and imaginary components
-        # separately then add to get overall variance.
-        real_in = x.real
-        var_real = torch.var(real_in, dims, correction=correction, keepdim=keepdim)
-        imag_in = x.imag
-        var_imag = torch.var(imag_in, dims, correction=correction, keepdim=keepdim)
-        return var_real + var_imag
-
-    if correction is None:
-        correction = 0
-
-    if len(dims) == 0:
-        n = prod(x.shape)  # type: ignore[arg-type]
-    else:
-        n = 1
-        for dim in dims:
-            n *= x.shape[dim]
-
-    mean = torch.mean(x, dims, True)
-    sub = x - mean
-    sq = sub * sub
-    sum = torch.sum(sq, dims, keepdim)
-
-    if correction:
-        n = n - correction
-
-    return sum / n
-
-
-@register_decomposition(aten.std.correction)
-@reduction_complex_to_real
-def std_decomposition(
-    x: Tensor, dims: List[int], correction: int = 0, keepdim: bool = False
-):
-    return torch.sqrt(torch.var(x, dims, correction=correction, keepdim=keepdim))
+    return torch.sqrt(torch.var(x, dim, correction=correction, keepdim=keepdim))
 
 
 # Questionable decompositions
 # This is only valid if we're running the graph without autograd, such as if the backward pass has been traced.
 # Note that this decomposition causes issues with in-place ops
-@register_decomposition(
-    [aten.detach, aten.lift, aten.lift_fresh, aten.alias], disable_meta=True
-)
+@register_decomposition([aten.detach, aten.lift, aten.lift_fresh])
 def nop_decomposition(x):
-    return x
+    return aten.alias(x)
 
 
 @register_decomposition(aten.cudnn_batch_norm)
@@ -1113,12 +1597,19 @@ def cudnn_batch_norm(
         return (a, b, c, input.new_zeros((0,), dtype=torch.uint8))
     return (
         a,
-        input.new_zeros((0,)),
-        input.new_zeros((0,)),
+        weight.new_zeros((0,)),
+        weight.new_zeros((0,)),
         input.new_zeros((0,), dtype=torch.uint8),
     )
 
 
+def _broadcast_batch_norm_backward(x, broadcast_mask):
+    for axis, mask in enumerate(broadcast_mask):
+        if mask == 1 and not (axis < x.ndim and x.shape[axis] == broadcast_mask[axis]):
+            x = x.unsqueeze(axis)
+    return x
+
+
 @register_decomposition(aten.native_batch_norm_backward)
 def native_batch_norm_backward(
     grad_out: Tensor,
@@ -1177,21 +1668,23 @@ def native_batch_norm_backward(
         if i != axis:
             reduction_axes.append(i)
 
-    mean = torch.reshape(mean, broadcast_mask)  # type: ignore[arg-type]
+    mean = _broadcast_batch_norm_backward(mean, broadcast_mask)  # type: ignore[arg-type]
     norm = 1.0 / num_features
     grad_output_sum = torch.sum(grad_out_cast, reduction_axes)  # type: ignore[arg-type]
-    dot_p = torch.sum(grad_out_cast * (input_cast - mean), reduction_axes)
+    dot_p = torch.sum(grad_out_cast * (input_cast - mean), reduction_axes)  # type: ignore[operator]
 
-    grad_mean = torch.reshape(grad_output_sum * norm, broadcast_mask)
-    proj_scale = torch.reshape(torch.mul(dot_p * norm, invstd * invstd), broadcast_mask)  # type: ignore[operator]
+    grad_mean = _broadcast_batch_norm_backward(grad_output_sum * norm, broadcast_mask)
+    proj_scale = _broadcast_batch_norm_backward(torch.mul(dot_p * norm, invstd * invstd), broadcast_mask)  # type: ignore[operator]
 
     if weight_cast is None:
-        grad_scale = torch.reshape(invstd, broadcast_mask) * 1.0  # type: ignore[arg-type]
+        grad_scale = _broadcast_batch_norm_backward(invstd, broadcast_mask) * 1.0  # type: ignore[arg-type]
     else:
-        grad_scale = torch.reshape(invstd * weight_cast, broadcast_mask)
+        grad_scale = _broadcast_batch_norm_backward(
+            invstd * weight_cast, broadcast_mask
+        )
 
     if train:
-        proj = (input_cast - mean) * proj_scale
+        proj = (input_cast - mean) * proj_scale  # type: ignore[operator]
         grad_input = ((grad_out_cast - proj) - grad_mean) * grad_scale
     else:
         grad_input = grad_out_cast * grad_scale
@@ -1239,18 +1732,135 @@ def cudnn_batch_norm_backward(
     )
 
 
-@register_decomposition(aten.transpose.int, disable_meta=True)
-def transpose_int(self: Tensor, dim0: int, dim1: int) -> Tensor:
-    dim0, dim1 = utils.canonicalize_dims(self.dim(), (dim0, dim1))  # type: ignore[misc]
+@register_decomposition(aten._adaptive_avg_pool2d)
+@pw_cast_for_opmath
+def adaptive_avg_pool2d(input: Tensor, output_size: Tuple[int, int]):
+    # Preconditions
+    device = input.device
+    shape = input.shape
+    ndim = len(shape)
+    utils.check(
+        ndim in (3, 4),
+        lambda: f"adaptive_avg_pool2d(): Expected 3D or 4D tensor, but got {ndim}",
+    )
+    for d in input.shape[-2:]:
+        utils.check(
+            d != 0,
+            lambda: "adaptive_avg_pool2d(): Expected input to have non-zero size for "
+            f"non-batch dimensions, but input has shape {tuple(shape)}.",
+        )
+
+    # Optimisation (we should also do this in the kernel implementation)
+    if shape[-2] % output_size[-2] == 0 and shape[-1] % output_size[-1] == 0:
+        stride = tuple(i // o for i, o in zip(shape[-2:], output_size))
+        kernel = tuple(
+            i - (o - 1) * s for i, o, s in zip(shape[-2:], output_size, stride)
+        )
+        return torch.nn.functional.avg_pool2d(input, kernel, stride)
+
+    def start_index(a, b, c):
+        return torch.div(a * c, b, rounding_mode="trunc")
+
+    def end_index(a, b, c):
+        return torch.div((a + 1) * c + b - 1, b, rounding_mode="trunc")
+
+    def compute_idx(in_size, out_size):
+        orange = torch.arange(out_size, device=device, dtype=torch.int64)
+        i0 = start_index(orange, out_size, in_size)
+        # Let length = end_index - start_index, i.e. the length of the pooling kernels
+        # length.max() can be computed analytically as follows:
+        maxlength = in_size // out_size + 1
+        in_size_mod = in_size % out_size
+        # adaptive = True iff there are kernels with different lengths
+        adaptive = not (in_size_mod == 0 or out_size % in_size_mod == 0)
+        if adaptive:
+            maxlength += 1
+        elif in_size_mod == 0:
+            maxlength -= 1
+
+        range_max = torch.arange(maxlength, device=device, dtype=torch.int64)
+        idx = i0.unsqueeze(-1) + range_max
+        if adaptive:
+            # Need to clamp to avoid accesing out-of-bounds memory
+            # TODO make minimum accept scalars
+            maxval = torch.scalar_tensor(
+                in_size - 1, dtype=idx.dtype, device=idx.device
+            )
+            idx = torch.minimum(idx, maxval)
+
+            # Compute the lenghts
+            i1 = end_index(orange, out_size, in_size)
+            length = i1 - i0
+        else:
+            length = maxlength
+        return idx, length, range_max, adaptive
+
+    # length is not None if it's constant, otherwise we'll need to compute it
+    idxh, length_h, range_max_h, adaptive_h = compute_idx(shape[-2], output_size[-2])
+    idxw, length_w, range_max_w, adaptive_w = compute_idx(shape[-1], output_size[-1])
+
+    vals = input[..., _unsqueeze_to_dim(idxh, 4), idxw]
+    # Shortcut for the simpler case
+    if not adaptive_h and not adaptive_w:
+        return torch.mean(vals, dim=(-3, -1))
+
+    def maybe_mask(vals, length, range_max, adaptive, dim):
+        if isinstance(length, IntLike):
+            return vals, length
+        else:
+            # zero-out the things we didn't really want to select
+            assert dim < 0
+            # hack
+            mask = range_max >= length.unsqueeze(-1)
+            if dim == -2:
+                mask = _unsqueeze_to_dim(mask, 4)
+            vals = torch.masked_fill(vals, mask, 0.0)
+            # Compute the length of each window
+            length = _unsqueeze_to_dim(length, -dim)
+            return vals, length
+
+    vals, length_h = maybe_mask(
+        vals, length_h, range_max_h, adaptive=adaptive_h, dim=-2
+    )
+    vals, length_w = maybe_mask(
+        vals, length_w, range_max_w, adaptive=adaptive_w, dim=-1
+    )
 
-    if self.dim() <= 1:
-        return self
+    # We unroll the sum as we assume that the kernels are going to be small
+    ret = None
+    for i, j in product(range(vals.shape[-3]), range(vals.shape[-1])):
+        if ret is None:
+            ret = vals[..., i, :, j]
+        else:
+            ret = ret + vals[..., i, :, j]
+    return ret / (length_h * length_w)
 
-    if dim0 == dim1:
-        return self
-    perm = list(range(self.dim()))
-    perm[dim0], perm[dim1] = perm[dim1], perm[dim0]
-    return torch.permute(self, perm)
+
+@register_decomposition(aten.index_add_)
+def index_add_(
+    x: TensorLike,
+    dim: int,
+    index: TensorLike,
+    tensor: TensorLike,
+    *,
+    alpha: NumberType = 1,
+):
+    dim = utils.canonicalize_dims(x.ndim, dim)
+    utils.check(
+        index.ndim <= 1,
+        lambda: f"Index should have dimension 1 or 0 (got {index.ndim})",
+    )
+    if alpha != 1:
+        python_type = utils.dtype_to_type(x.dtype)
+        utils.check(
+            python_type == bool
+            or utils.is_weakly_lesser_type(type(alpha), python_type),
+            lambda: f"alpha argument of type {type(alpha)} cannot be safely cast to type {python_type}!",
+        )
+        tensor = tensor * alpha
+    idx = (None,) * dim + (index,)
+    torch.ops.aten.index_put_(x, idx, tensor, accumulate=True)
+    return x
 
 
 def _squeeze_multiple(self: Tensor, dims: List[int]) -> Tensor:
@@ -1293,7 +1903,6 @@ def log_sigmoid_forward(self: Tensor) -> Tuple[Tensor, Tensor]:
 
 @register_decomposition(aten.norm)
 @out_wrapper()
-@reduction_complex_to_real
 def norm(
     self: Tensor,
     p: Optional[float] = None,
@@ -1301,33 +1910,86 @@ def norm(
     keepdim: bool = False,
     dtype: Optional[torch.dtype] = None,
 ):
-    if p is None:
-        p = 2.0
-    return torch.linalg.vector_norm(self, p, dim, keepdim, dtype=dtype)
+    p = p if p is not None else 2.0
+    if dtype:
+        return torch.linalg.vector_norm(self.to(dtype), p, dim, keepdim, dtype=dtype)
+
+    computation_dtype, result_dtype = utils.elementwise_dtypes(
+        self, type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.COMPLEX_TO_FLOAT
+    )
+    return torch.linalg.vector_norm(
+        self.to(computation_dtype), p, dim, keepdim, dtype=dtype
+    ).to(result_dtype)
+
+
+# aten/src/ATen/native/UpSample.cpp compute_output_size
+def upsample_compute_output_size(input_size, output_size, scale_factors):
+    spatial_dimensions = len(input_size) - 2
+    if output_size is not None:
+        utils.check(
+            scale_factors is None,
+            lambda: "Must specify exactly one of output_size and scale_factors",
+        )
+        utils.check(len(output_size) == spatial_dimensions, lambda: "")
+        return output_size
+    if scale_factors is not None:
+        # NB: this isn't necessary lol
+        utils.check(
+            output_size is None,
+            lambda: "Must specify exactly one of output_size and scale_factors",
+        )
+        utils.check(len(scale_factors) == spatial_dimensions, lambda: "")
+        return [
+            # Returning output_size as float. We cannot convert it to int directly,
+            # as latter computation of scale_factor is relying output size being float
+            sym_float(input_size[i + 2] * scale_factors[i])
+            for i in range(spatial_dimensions)
+        ]
+    utils.check(
+        False, lambda: "Must specify exactly one of output_size and scale_factors"
+    )
+
+
+def get_scale_value(scales, idx):
+    if scales is None:
+        return None
+    return scales[idx]
 
 
 @register_decomposition(torch.ops.aten.upsample_bilinear2d.vec)
+@torch.ops.aten.upsample_bilinear2d.vec.py_impl(DispatchKey.CompositeImplicitAutograd)
+@torch.ops.aten.upsample_bilinear2d.vec.py_impl(DispatchKey.Autograd)
+def upsample_bilinear2d_vec(input, output_size, align_corners, scale_factors):
+    osize = upsample_compute_output_size(input.size(), output_size, scale_factors)
+    scale_h = get_scale_value(scale_factors, 0)
+    scale_w = get_scale_value(scale_factors, 1)
+
+    # NB: osize could be a list of float when scale_factors is float
+    # so we cannot redispatch to aten.upsample_bilinear2d.default here
+    return upsample_bilinear2d(input, osize, align_corners, scale_h, scale_w)
+
+
+@register_decomposition(torch.ops.aten.upsample_bilinear2d.default)
+@torch.ops.aten.upsample_bilinear2d.default.py_impl(DispatchKey.Autograd)
 @pw_cast_for_opmath
-def upsample_bilinear2d_vec(
+def upsample_bilinear2d(
     input: Tensor,
-    output_size: Optional[List[int]],
+    output_size: List[Union[int, float]],
     align_corners: bool,
-    scale_factors: Optional[List[float]],
+    scales_h: Optional[float] = None,
+    scales_w: Optional[float] = None,
 ) -> Tensor:
     # get dimensions of original image
     n_batch, n_channels, in_h, in_w = input.shape
 
-    if output_size is not None:
-        out_h = float(output_size[0])
-        out_w = float(output_size[1])
-    elif scale_factors is not None:
-        out_h = in_h * scale_factors[0]
-        out_w = in_w * scale_factors[1]
+    out_h = sym_float(output_size[0])
+    out_w = sym_float(output_size[1])
 
     # Calculate horizontal and vertical scaling factor
+    # TODO: Figure out if scales_h/scales_w matters here
     if out_h > 1:
         if align_corners:
-            h_scale_factor = (in_h - 1) / (int(out_h) - 1)
+            h_scale_factor = (in_h - 1) / (sym_int(out_h) - 1)
         else:
             h_scale_factor = in_h / out_h
     else:
@@ -1335,14 +1997,14 @@ def upsample_bilinear2d_vec(
 
     if out_w > 1:
         if align_corners:
-            w_scale_factor = (in_w - 1) / (int(out_w) - 1)
+            w_scale_factor = (in_w - 1) / (sym_int(out_w) - 1)
         else:
             w_scale_factor = in_w / out_w
     else:
         w_scale_factor = 0.0
 
-    i = torch.arange(int(out_h), dtype=input.dtype, device=input.device)
-    j = torch.arange(int(out_w), dtype=input.dtype, device=input.device)
+    i = torch.arange(sym_int(out_h), dtype=input.dtype, device=input.device)
+    j = torch.arange(sym_int(out_w), dtype=input.dtype, device=input.device)
 
     if align_corners:
         x = h_scale_factor * i
@@ -1374,6 +2036,16 @@ def upsample_bilinear2d_vec(
     q1 = torch.mul(v1, xscale1) + torch.mul(v2, xscale2)
     q2 = torch.mul(v3, xscale1) + torch.mul(v4, xscale2)
     result = torch.mul(q1, yscale1) + torch.mul(q2, yscale2)
+
+    # convert output to correct memory format, if necessary
+    memory_format = utils.suggest_memory_format(input)
+
+    # following "heuristic: only use channels_last path when it's faster than the contiguous path"
+    if input.device.type == "cuda" and n_channels < 16:
+        memory_format = torch.contiguous_format
+
+    result = result.contiguous(memory_format=memory_format)
+
     return result
 
 
@@ -1383,6 +2055,11 @@ def is_same_size(a: Tensor, b: Tensor) -> bool:
     return a.shape == b.shape
 
 
+@register_decomposition([aten._reshape_alias, aten._unsafe_view])
+def _reshape_alias(x, shape, *args):
+    return aten.view(x, shape)
+
+
 @register_decomposition(aten.nll_loss_forward)
 def nll_loss_forward(
     self: Tensor,
@@ -1451,3 +2128,472 @@ def nll_loss_forward(
             result = result.sum() / total_weight
 
     return result, total_weight
+
+
+# These are adapted from aten/src/ATen/native/UpSample.h, wich is based on
+# https://en.wikipedia.org/wiki/Bicubic_interpolation#Bicubic_convolution_algorithm
+def _upsample_cubic_convolution1(x: Tensor, A: float) -> Tensor:
+    return ((A + 2) * x - (A + 3)) * x * x + 1
+
+
+def _upsample_cubic_convolution2(x: Tensor, A: float) -> Tensor:
+    return ((A * x - 5 * A) * x + 8 * A) * x - 4 * A
+
+
+def _upsample_get_cubic_coefficients(t: Tensor) -> TensorSequenceType:
+    A = -0.75
+    return (
+        _upsample_cubic_convolution2(t + 1.0, A),
+        _upsample_cubic_convolution1(t, A),
+        _upsample_cubic_convolution1(1.0 - t, A),
+        _upsample_cubic_convolution2(2.0 - t, A),
+    )
+
+
+def _upsample_cubic_interp1d(coeffs: TensorSequenceType, ts: Tensor) -> Tensor:
+    coeffs2 = _upsample_get_cubic_coefficients(ts)
+    return _sum_tensors(c1 * c2 for (c1, c2) in zip(coeffs, coeffs2))
+
+
+# Need this instead of just sum() to keep mypy happy
+def _sum_tensors(ts: Iterable[Tensor]) -> Tensor:
+    return reduce(torch.add, ts)
+
+
+@register_decomposition(aten.grid_sampler_2d)
+@pw_cast_for_opmath
+def grid_sampler_2d(
+    a: Tensor,
+    grid: Tensor,
+    interpolation_mode: int = 0,
+    padding_mode: int = 0,
+    align_corners: bool = False,
+) -> Tensor:
+    utils.check(
+        interpolation_mode in (0, 1, 2),
+        lambda: f"Invalid interpolation mode {interpolation_mode}",
+    )
+    utils.check(
+        padding_mode in (0, 1, 2), lambda: f"Invalid padding mode {padding_mode}"
+    )
+
+    def unnormalize(coords: Tensor, size: int) -> Tensor:
+        # Rescale coordinates from [-1, 1] to:
+        #   [0, size - 1] if align_corners is True
+        #   [-.5, size -.5] if align_corners is False
+        mul = (size * 0.5 - 0.5) if align_corners else (size * 0.5)
+        ofs = size * 0.5 - 0.5
+        return coords * mul + ofs
+
+    # Reflects coordinates until they fall between low and high (inclusive).
+    # The bounds are passed as twice their value so that half-integer values
+    # can be represented as ints.
+    def reflect_coordinates(coords: Tensor, twice_low: int, twice_high: int) -> Tensor:
+        if twice_low == twice_high:
+            return torch.zeros_like(coords)
+        coords_min = twice_low / 2
+        coords_span = (twice_high - twice_low) / 2
+        coords2 = (coords - coords_min).abs()
+        extra = torch.fmod(coords2, coords_span)
+        flips = (coords2 / coords_span).floor().to(dtype=torch.int8)
+        return torch.where(
+            flips & 1 == 0, extra + coords_min, coords_span + coords_min - extra
+        )
+
+    def compute_coordinates(coords: Tensor, size: int) -> Tensor:
+        if padding_mode == 0:  # Zero
+            return coords
+        elif padding_mode == 1:  # Borders
+            return torch.clamp(coords, 0, size - 1)
+        else:  # padding_mode == 2, Reflection
+            if align_corners:
+                coords_reflected = reflect_coordinates(coords, 0, 2 * (size - 1))
+            else:
+                coords_reflected = reflect_coordinates(coords, -1, 2 * size - 1)
+            return torch.clamp(coords_reflected, 0, size - 1)
+
+    def compute_source_index(coords: Tensor, size: int) -> Tensor:
+        coords_un = unnormalize(coords, size)
+        return compute_coordinates(coords_un, size)
+
+    N, C, iH, iW = a.shape
+    _, oH, oW, _ = grid.shape
+
+    def in_bounds_cond(xs: Tensor, ys: Tensor) -> Tensor:
+        return torch.logical_and(
+            0 <= xs, torch.logical_and(xs < iW, torch.logical_and(0 <= ys, ys < iH))
+        )
+
+    N_idx = torch.arange(N, device=a.device).view(N, 1, 1, 1)
+    C_idx = torch.arange(C, device=a.device).view(1, C, 1, 1)
+
+    def clip(xs: Tensor, ys: Tensor, ws: Tensor) -> TensorSequenceType:
+        cond = in_bounds_cond(xs, ys)
+        # To clip to inside valid coordinates, we map the coordinates
+        # to (x, y) = (0, 0) and also set the weight to 0
+        # We also change the shape of the tensor to the appropriate one for
+        # broadcasting with N_idx, C_idx for the purposes of advanced indexing
+        return tuple(
+            torch.where(cond, t, 0).view(N, 1, oH, oW)
+            for t in (xs.to(dtype=torch.int64), ys.to(dtype=torch.int64), ws)
+        )
+
+    def get_summand(ix: Tensor, iy: Tensor, w) -> Tensor:
+        # Perform clipping, index into input tensor and multiply by weight
+        idx_x, idx_y, w_ = clip(ix, iy, w)
+        return a[N_idx, C_idx, idx_y, idx_x] * w_
+
+    x = grid[..., 0]
+    y = grid[..., 1]
+
+    if interpolation_mode == 0:  # Bilinear
+        ix = compute_source_index(x, iW)
+        iy = compute_source_index(y, iH)
+
+        ix_nw, iy_nw = ix.floor(), iy.floor()
+        ix_ne, iy_ne = ix_nw + 1, iy_nw
+        ix_sw, iy_sw = ix_nw, iy_nw + 1
+        ix_se, iy_se = ix_ne, iy_sw
+
+        w_nw = (ix_se - ix) * (iy_se - iy)
+        w_ne = (ix - ix_sw) * (iy_sw - iy)
+        w_sw = (ix_ne - ix) * (iy - iy_ne)
+        w_se = (ix - ix_nw) * (iy - iy_nw)
+
+        return _sum_tensors(
+            get_summand(ix, iy, w)
+            for (ix, iy, w) in (
+                (ix_nw, iy_nw, w_nw),
+                (ix_ne, iy_ne, w_ne),
+                (ix_sw, iy_sw, w_sw),
+                (ix_se, iy_se, w_se),
+            )
+        )
+    elif interpolation_mode == 1:  # Nearest
+        ix = compute_source_index(x, iW)
+        iy = compute_source_index(y, iH)
+
+        ix_nearest = ix.round()
+        iy_nearest = iy.round()
+
+        return get_summand(ix_nearest, iy_nearest, 1)
+    else:  # interpolation_mode == 2, Bicubic
+        ix = unnormalize(x, iW)
+        iy = unnormalize(y, iH)
+
+        ix_nw = ix.floor()
+        iy_nw = iy.floor()
+
+        tx = ix - ix_nw
+        ty = iy - iy_nw
+
+        def get_value_bounded(ix: Tensor, iy: Tensor) -> Tensor:
+            x = compute_coordinates(ix, iW)
+            y = compute_coordinates(iy, iH)
+            return get_summand(x, y, 1)
+
+        def get_coeff(ofs: int) -> Tensor:
+            iy_ofs = iy_nw + (ofs - 1)
+            cs = (
+                get_value_bounded(ix_nw - 1, iy_ofs),
+                get_value_bounded(ix_nw, iy_ofs),
+                get_value_bounded(ix_nw + 1, iy_ofs),
+                get_value_bounded(ix_nw + 2, iy_ofs),
+            )
+            return _upsample_cubic_interp1d(cs, tx.unsqueeze(1))
+
+        coeffs = tuple((get_coeff(ofs) for ofs in range(4)))
+        return _upsample_cubic_interp1d(coeffs, ty.unsqueeze(1))
+
+
+@register_decomposition(aten.mv)
+@out_wrapper()
+@pw_cast_for_opmath
+def mv(self, vec):
+    utils.check(
+        self.dim() == 2 and vec.dim() == 1,
+        lambda: f"matrix @ vector expected, got {self.dim()}, {vec.dim()}",
+    )
+    utils.check(
+        self.size(1) == vec.size(0),
+        lambda: f"size mismatch, got {self.size(0)}x{self.size(1)},{vec.size(0)}",
+    )
+    return (self * vec).sum(dim=1)
+
+
+@register_decomposition(aten.dot)
+@out_wrapper()
+@pw_cast_for_opmath
+def dot(self, other):
+    if self.is_complex():
+        if self.is_conj():
+            if other.is_conj():
+                return torch.dot(self.conj(), other.conj()).conj()
+            else:
+                return torch.vdot(self.conj(), other)
+        elif other.is_conj():
+            return torch.vdot(other.conj(), self)
+
+    utils.check(
+        self.dim() == 1 and other.dim() == 1,
+        lambda: f"1D tensors expected, but got {self.dim()}D and {other.dim()}D tensors",
+    )
+    utils.check(
+        self.dtype == other.dtype,
+        lambda: f"dot : expected both vectors to have same dtype, but found {self.dtype} and {other.dtype}",
+    )
+
+    def numel_error():
+        return (
+            f"inconsistent tensor size, expected tensor [{self.numel()}] and src [{other.numel()}] to have the"
+            f"same number of elements, but got {self.numel()} and {other.numel()} elements respectively"
+        )
+
+    utils.check(self.numel() == other.numel(), numel_error)
+
+    return (self * other).sum()
+
+
+@register_decomposition(aten.binary_cross_entropy_with_logits)
+def binary_cross_entropy_with_logits(
+    self, target, weight=None, pos_weight=None, reduction=Reduction.MEAN.value
+):
+    max_val = (-self).clamp_min(0)
+    if pos_weight is not None:
+        log_weight = (pos_weight - 1) * target + 1
+        loss = (1 - target) * self + log_weight * (
+            ((-max_val).exp() + (-self - max_val).exp()).log() + max_val
+        )
+    else:
+        loss = (
+            (1 - target) * self
+            + max_val
+            + ((-max_val).exp() + (-self - max_val).exp()).log()
+        )
+
+    if weight is not None:
+        loss = loss * weight
+
+    return apply_loss_reduction(loss, reduction)
+
+
+def should_fold(tensor1: torch.Tensor, dim_tensor2: int) -> bool:
+    dim_tensor1 = tensor1.ndim
+    if dim_tensor1 >= 3 and (dim_tensor2 == 1 or dim_tensor2 == 2):
+        t1_sizes_ptr = tensor1.shape
+        t1_strides = tensor1.stride()
+        if (
+            dim_tensor1 == 3
+            and dim_tensor2 == 2
+            and t1_strides[-1] != 1
+            and t1_strides[0] == t1_sizes_ptr[1] * t1_sizes_ptr[2]
+        ):
+            # First dim is slowest moving, and then the following two dims are
+            # transposed. This can happen for example by permute(0, 2, 1).
+            # First 2 dims could be folded to use mm but would require permutation
+            # with actual data movement, which can be instead handled by BMM with each
+            # GEMM transposed.
+            # This can be generalized to a tensor with dim X + Y + Z where X, Y, and Z
+            # dims are contiguous, Y dims and Z dims are transposed, and X, Y, Z > 0.
+            # For example, this can happen by permute(0, 1, 5, 2, 3, 4), where X = 2,
+            # Y = 3, and Z = 1.
+            return False
+        else:
+            return True
+    else:
+        return False
+
+
+@torch.ops.aten.matmul.default.py_impl(DispatchKey.CompositeImplicitAutograd)
+def matmul(tensor1, tensor2):
+    dim_tensor1 = tensor1.dim()
+    dim_tensor2 = tensor2.dim()
+    assert dim_tensor1 != 0 and dim_tensor2 != 0
+    if dim_tensor1 == 1 and dim_tensor2 == 1:
+        return torch.dot(tensor1, tensor2)
+    elif dim_tensor1 == 2 and dim_tensor2 == 1:
+        return torch.mv(tensor1, tensor2)
+    elif dim_tensor1 == 1 and dim_tensor2 == 2:
+        return torch.squeeze(torch.mm(torch.unsqueeze(tensor1, 0), tensor2), 0)
+    elif dim_tensor1 == 2 and dim_tensor2 == 2:
+        # if tensor1.shape[1] != tensor2.shape[0]:
+        #     breakpoint()
+        return torch.mm(tensor1, tensor2)
+    elif should_fold(tensor1, dim_tensor2) or should_fold(tensor2, dim_tensor1):
+        # NB: Much of this was written with Copilot! (although still had to fix a bunch of issues)
+
+        # dim_tensor1 >=3 && (dim_tensor2 == 1 || dim_tensor2 == 2) ||
+        # dim_tensor2 >=3 && (dim_tensor1 == 1 || dim_tensor1 == 2)
+        # and some condition on the strides is fulfilled
+
+        # optimization: use mm instead of bmm by folding the batch of the larger tensor
+        # into its leading matrix dimension
+        transpose = dim_tensor2 > dim_tensor1
+        t1 = tensor2.mT if transpose else tensor1
+        t2 = (
+            tensor2 if not transpose else (tensor1.t() if dim_tensor1 == 2 else tensor1)
+        )
+        # Invariant: t1.dim() >= 3 && (t2.dim() == 1 || t2.dim() == 2)
+        #            and t1 and t2 are matmul-compatible
+
+        # Why not t1.view(-1, sizes_1[-1])?
+        # If the last dim is 0, then view(-1, 0) won't work because the -1 becomes ambiguous.
+        # This can happen in e.g. [3, 5, 0] @ [0, 0].
+        sizes_1 = t1.shape
+        output_shape = list(sizes_1[:-1])
+        folded_dim1 = reduce(operator.mul, output_shape)
+
+        # Readjust output_shape if we are multiplying by a matrix
+        t2_is_matrix = t2.dim() == 2
+        if t2_is_matrix:
+            output_shape.append(t2.shape[1])
+        t1_folded = t1.reshape(folded_dim1, sizes_1[-1])
+        if t2_is_matrix:
+            # FIXME This path always does an unnecessary copy when transpose == True as the returned
+            # result from BLAS is already C-transposed
+            output = t1_folded.mm(t2).view(output_shape)
+            return output.mT.contiguous() if transpose else output
+        else:
+            return t1_folded.mv(t2).view(output_shape)
+
+    elif dim_tensor1 >= 1 and dim_tensor2 >= 1:
+        # We are multiplying b1 x n x m1 by x2 x m2 x p (where b1 can be a list);
+        # we track m1 vs m2 separately even though they must match for nicer error messages
+        n = tensor1.size(-2) if dim_tensor1 > 1 else 1
+        m1 = tensor1.size(-1)
+        batch_tensor1 = tensor1.shape[:-2]
+        m2 = tensor2.size(-2) if dim_tensor2 > 1 else tensor2.size(-1)
+        p = tensor2.size(-1) if dim_tensor2 > 1 else 1
+        batch_tensor2: List[int] = []
+        # TODO: handling of slice
+        for i in range(dim_tensor2 - 2):
+            batch_tensor2.append(tensor2.size(i))
+
+        # expand the batch portion (i.e. cut off matrix dimensions and expand rest)
+        expand_batch_portion = list(
+            torch.broadcast_shapes(batch_tensor1, batch_tensor2)
+        )
+
+        tensor1_expand_size = expand_batch_portion + [n, m1]
+        tensor2_expand_size = expand_batch_portion + [m2, p]
+
+        expand_batch_product = prod(expand_batch_portion)
+
+        # HACK: We need reshape with symint support
+        tensor1_expanded = tensor1.expand(tensor1_expand_size).reshape(
+            expand_batch_product, n, m1
+        )
+        tensor2_expanded = tensor2.expand(tensor2_expand_size).reshape(
+            expand_batch_product, m2, p
+        )
+
+        output_shape = expand_batch_portion
+        if dim_tensor1 > 1:
+            output_shape.append(n)
+
+        if dim_tensor2 > 1:
+            output_shape.append(p)
+
+        return tensor1_expanded.bmm(tensor2_expanded).view(output_shape)
+    else:
+        utils.check(False, lambda: "both arguments to matmul need to be at least 1D")
+
+
+@register_decomposition(aten.upsample_bicubic2d.default)
+@pw_cast_for_opmath
+def upsample_bicubic2d_default(
+    a: Tensor,
+    output_size: Tuple[int, int],
+    align_corners: bool,
+    scale_h: Optional[float] = None,
+    scale_w: Optional[float] = None,
+) -> Tensor:
+    N, C, iH, iW = a.shape
+    oH, oW = output_size
+
+    def compute_scale(in_size, out_size, align_corners, scale=None):
+        if align_corners:
+            return (in_size - 1) / (out_size - 1) if out_size > 1 else 0
+        else:
+            return 1 / scale if scale is not None and scale > 0 else in_size / out_size
+
+    def compute_source_index(scale, dst_index, align_corners):
+        if align_corners:
+            return scale * dst_index
+        else:
+            return scale * (dst_index + 0.5) - 0.5
+
+    height_scale = compute_scale(iH, oH, align_corners, scale_h)
+    width_scale = compute_scale(iW, oW, align_corners, scale_w)
+
+    N_idx = torch.arange(N, device=a.device).view(N, 1, 1, 1)
+    C_idx = torch.arange(C, device=a.device).view(1, C, 1, 1)
+    out_y = torch.arange(oH, device=a.device).view((1, 1, oH, 1))
+    out_x = torch.arange(oW, device=a.device).view((1, 1, 1, oW))
+
+    real_x = compute_source_index(width_scale, out_x, align_corners)
+    in_x = real_x.floor()
+    t_x = real_x - in_x
+    ix = in_x.to(dtype=torch.int64)
+
+    real_y = compute_source_index(height_scale, out_y, align_corners)
+    in_y = real_y.floor()
+    t_y = real_y - in_y
+    iy = in_y.to(dtype=torch.int64)
+
+    iys_ofs = (iy - 1, iy, iy + 1, iy + 2)
+    ixs_ofs = (ix - 1, ix, ix + 1, ix + 2)
+
+    def load_bounded(ys, xs):
+        y_idx = torch.clamp(ys, 0, iH - 1)
+        x_idx = torch.clamp(xs, 0, iW - 1)
+        return a[N_idx, C_idx, y_idx, x_idx]
+
+    def get_x_interp(y):
+        coeffs_x = tuple((load_bounded(y, x_ofs) for x_ofs in ixs_ofs))
+        return _upsample_cubic_interp1d(coeffs_x, t_x)
+
+    coeffs_y = tuple((get_x_interp(y_ofs) for y_ofs in iys_ofs))
+    return _upsample_cubic_interp1d(coeffs_y, t_y)
+
+
+@register_decomposition(aten.upsample_bicubic2d.vec)
+@out_wrapper()
+@pw_cast_for_opmath
+def upsample_bicubic2d_vec(
+    a: Tensor,
+    output_size: Optional[Tuple[int, int]],
+    align_corners: bool,
+    scale_factors: Optional[Tuple[float, float]] = None,
+) -> Tensor:
+    utils.check(
+        bool(output_size) + bool(scale_factors) == 1,
+        lambda: "Must specify exactly one of output_size and scale_factors.",
+    )
+    if output_size is None:
+        assert scale_factors is not None
+        output_size = cast(
+            Tuple[int, int],
+            tuple(int(w * scale) for w, scale in zip(a.shape[2:], scale_factors)),
+        )
+    scale_h, scale_w = scale_factors if scale_factors else (None, None)
+    return upsample_bicubic2d_default(a, output_size, align_corners, scale_h, scale_w)
+
+
+def register_inplace(aten_op, outplace_op):
+    @register_decomposition(aten_op)
+    def inplace_op(*args, **kwargs):
+        out = outplace_op(*args, **kwargs)
+        return args[0].copy_(out)
+
+    return inplace_op
+
+
+register_inplace(aten.add_, aten.add)
+register_inplace(aten.sub_, aten.sub)
+register_inplace(aten.mul_, aten.mul)
+register_inplace(aten.relu_, aten.relu)
+register_inplace(aten.hardtanh_, aten.hardtanh)
+register_inplace(aten.hardswish_, aten.hardswish)
+register_inplace(aten.leaky_relu_, aten.leaky_relu)
+register_inplace(aten.silu_, aten.silu)
diff --git a/functorch/functorch/_src/decompositions.py b/torch/_decomp/decompositions_for_jvp.py
similarity index 61%
rename from functorch/functorch/_src/decompositions.py
rename to torch/_decomp/decompositions_for_jvp.py
index 3780d09db20d..ca72ed3f937f 100644
--- a/functorch/functorch/_src/decompositions.py
+++ b/torch/_decomp/decompositions_for_jvp.py
@@ -1,17 +1,37 @@
+import inspect
+from typing import Callable, Dict, List, Optional, Tuple
+
 import torch
-from torch import Tensor
 import torch._decomp
-from typing import Tuple, List, Optional
-
-aten = torch.ops.aten
+from torch import Tensor
 
 decomposition_table = torch._decomp.decomposition_table
+decomposition_table_for_jvp: Dict[torch._ops.OpOverload, Callable] = {}
 register_decomposition = torch._decomp.register_decomposition
-get_decompositions = torch._decomp.get_decompositions
+aten = torch.ops.aten
 
-# Decompositions have been ported to torch._decomp inside of PyTorch core.
-# The only decompositions here are temporary or hacks.
-# Please submit your contributions to PyTorch core!
+# NOTE: [forward-mode AD decompositions mechanism]
+#
+# The mechanism is in VariableType,
+#   IF any inputs have forward grad
+#      AND there is no forward AD formula implemented
+#      AND the functions is actually differentiable
+#   run the decomposition
+#      See run_jit_decomposition_with_args_for_jvp
+#      We currently use python decompositions that we torchscript.
+#
+# Note that we would be building the backward graph at the decomposed level
+# too, but that is OK, because we would've errored out otherwise anyway.
+#
+# TODO: The mechanism we are using to register decompositions doesn't
+# seem to be exclusively used for jvp. So open question here is whether
+# torch/csrc/jit/runtime/decomposition_registry.cpp is being used for other things.
+# If that is the case, we may go down the decomposition path unexpectedly
+# (and possibly produce an unintelligible error) vs erroring out earlier and
+# printing that the forward AD formula is not implemented.
+#
+# The solution to this may be to have a explicitly white list control when
+# to enable the decomposition.
 
 
 def maybe_register_decomposition(op):
@@ -20,6 +40,7 @@ def decorator(f):
             return register_decomposition(op)(f)
         except Exception:
             return f
+
     return decorator
 
 
@@ -33,6 +54,39 @@ def register_decomposition_for_jvp(fn):
     return register_decomposition(fn, registry=decomposition_table_for_jvp)
 
 
+def _register_jit_decomposition_for_jvp(decomp, use_python=False):
+    if decomp in decomposition_table_for_jvp:
+        decomposition_table_used = decomposition_table_for_jvp
+    elif decomp in decomposition_table:
+        decomposition_table_used = decomposition_table
+    else:
+        raise RuntimeError(f"could not find decomposition for {decomp}")
+    decomp_fn = decomposition_table_used[decomp]
+    if use_python:
+        decomp_fn = torch.jit.ignore(decomp_fn)
+        sig = inspect.signature(decomp_fn)
+
+        # Create a string wrapping the function from the signature
+        # example output:
+        # def wrapped_decomp(x: torch.Tensor, y: int, z: int):
+        #   return decomp_fn(x, y, z)
+        # Thanks copilot!
+        def get_function_def(sig):
+            param_def = [f"{param_str}" for param_str in sig.parameters.values()]
+            param_use = [f"{param_str}" for param_str in sig.parameters.keys()]
+
+            return f"def wrapped_decomp({', '.join(param_def)}):\n  return decomp_fn({', '.join(param_use)})\n"
+
+        f_str = get_function_def(sig)
+        graph = torch.jit.CompilationUnit(f_str).wrapped_decomp.graph
+    else:
+        graph = torch.jit.script(decomp_fn).graph
+    torch.jit._register_decomposition(decomp, graph)
+
+
+# The only decompositions here are temporary or hacks for the purposes of jvp
+
+# TODO: do these also belong here?
 @maybe_register_decomposition(aten.trace.default)
 def trace(self: Tensor) -> Tensor:
     return torch.sum(torch.diag(self))
@@ -49,7 +103,9 @@ def log_sigmoid_forward(self: Tensor) -> Tuple[Tensor, Tensor]:
     return min - torch.log1p(z), buffer
 
 
-def recompute_mean_var(input: Tensor, rstd: Tensor, inner_dim_indices: List[int], keepdim: bool):
+def recompute_mean_var(
+    input: Tensor, rstd: Tensor, inner_dim_indices: List[int], keepdim: bool
+):
     # for most norm decompositions, it will be the same as the core version except for here.
     # We recompute the mean and variance so that they track gradients through input
 
@@ -129,7 +185,7 @@ def native_layer_norm_backward(
         if len(outer_dim_indices) > 0:
             d_bias: Optional[Tensor] = torch.sum(grad_out, outer_dim_indices, False)
         else:
-            d_bias = grad_out
+            d_bias = grad_out.clone()
     elif bias is not None:
         d_bias = torch.zeros_like(bias)  # should be None but doesn't work with vjp
     else:
@@ -145,7 +201,7 @@ def prod(x: List[int]):
     return r
 
 
-@register_decomposition_for_jvp(aten.native_batch_norm_backward)  # @register_decomposition_for_jvp after in core
+@register_decomposition_for_jvp(aten.native_batch_norm_backward)
 def native_batch_norm_backward(
     grad_out: Tensor,
     input: Tensor,
@@ -163,11 +219,13 @@ def native_batch_norm_backward(
     assert input_rank >= 2, "rank of the input must be at least 2"
 
     axis = 1
-    num_features = prod(input_shape) / input_shape[axis]
+    num_features = prod(input_shape) / input_shape[axis]  # type: ignore[arg-type]
     mean = save_mean
     invstd = save_invstd
     if train:
-        assert save_mean is not None and save_invstd is not None, "when train=True, save_mean and save_invstd are required"
+        assert (
+            save_mean is not None and save_invstd is not None
+        ), "when train=True, save_mean and save_invstd are required"
 
         reduciton_dims = [0] + list(range(2, input.dim()))
         assert invstd is not None  # for typing
@@ -177,6 +235,8 @@ def native_batch_norm_backward(
         mean = running_mean
         invstd = torch.rsqrt(running_var + eps)
 
+    assert invstd is not None and mean is not None
+
     broadcast_mask = [1] * input_rank
     broadcast_mask[axis] = input_shape[axis]
 
@@ -207,13 +267,28 @@ def native_batch_norm_backward(
     if output_mask[1]:
         grad_weight = dot_p * invstd
     elif weight is not None:
-        grad_weight = torch.zeros_like(weight)  # should be None but doesn't work with vjp
+        grad_weight = torch.zeros_like(
+            weight
+        )  # should be None but doesn't work with vjp
     else:
         grad_weight = torch.zeros(())  # should be None but doesn't work with vjp
 
     if output_mask[2]:
         grad_bias = grad_output_sum
     else:
-        grad_bias = torch.zeros_like(grad_output_sum)  # should be None but doesn't work with vjp
+        grad_bias = torch.zeros_like(
+            grad_output_sum
+        )  # should be None but doesn't work with vjp
 
     return (grad_input, grad_weight, grad_bias)
+
+
+_register_jit_decomposition_for_jvp(torch.ops.aten.trace.default, use_python=True)
+_register_jit_decomposition_for_jvp(torch.ops.aten.nll_loss_backward.default)
+_register_jit_decomposition_for_jvp(torch.ops.aten.nll_loss2d_backward.default)
+_register_jit_decomposition_for_jvp(torch.ops.aten._log_softmax_backward_data.default)
+_register_jit_decomposition_for_jvp(torch.ops.aten._softmax_backward_data.default)
+_register_jit_decomposition_for_jvp(torch.ops.aten.log_sigmoid_forward.default)
+_register_jit_decomposition_for_jvp(torch.ops.aten.native_layer_norm_backward.default)
+_register_jit_decomposition_for_jvp(torch.ops.aten.native_batch_norm_backward.default)
+_register_jit_decomposition_for_jvp(torch.ops.aten.cudnn_batch_norm_backward.default)
diff --git a/torch/_deploy.py b/torch/_deploy.py
index 53769538b6c1..30c022eac879 100644
--- a/torch/_deploy.py
+++ b/torch/_deploy.py
@@ -23,7 +23,7 @@ def persistent_id(obj):
             if isinstance(obj, torch.storage.TypedStorage):
                 # TODO: Once we decide to break serialization FC, we can
                 # remove this case
-                storage = obj._storage
+                storage = obj._untyped_storage
                 dtype = obj.dtype
             else:
                 storage = obj
diff --git a/torch/ao/sparsity/scheduler/__init__.py b/torch/_dispatch/__init__.py
similarity index 100%
rename from torch/ao/sparsity/scheduler/__init__.py
rename to torch/_dispatch/__init__.py
diff --git a/torch/_dispatch/python.py b/torch/_dispatch/python.py
new file mode 100644
index 000000000000..f0814889ba2d
--- /dev/null
+++ b/torch/_dispatch/python.py
@@ -0,0 +1,142 @@
+import torch._C
+from contextlib import contextmanager
+import unittest.mock
+import torch
+import torch.utils._pytree as pytree
+import itertools
+
+__all__ = ['enable_python_dispatcher', 'no_python_dispatcher']
+
+@contextmanager
+def no_python_dispatcher():
+    g = torch._C._DisablePythonDispatcher()
+    try:
+        yield
+    finally:
+        del g
+
+@contextmanager
+def enable_python_dispatcher():
+    g = torch._C._EnablePythonDispatcher()
+    try:
+        yield
+    finally:
+        del g
+
+CROSSREF_FUNCTIONALIZE = False
+
+def all_known_overloads():
+    for ns in torch.ops:
+        packets = getattr(torch.ops, ns)
+        for op_name in packets:
+            packet = getattr(packets, op_name)
+            for overload in packet:
+                yield getattr(packet, overload)
+
+@contextmanager
+def suspend_functionalization():
+    f_tls = torch._C._dispatch_tls_is_dispatch_key_included(torch._C.DispatchKey.Functionalize)
+    f_rv = torch._C._functionalization_reapply_views_tls()
+    if f_tls:
+        torch._disable_functionalization()
+    try:
+        yield
+    finally:
+        if f_tls:
+            torch._enable_functionalization(reapply_views=f_rv)
+
+def check_tensor_metadata_matches(nv, rv, desc):
+    assert callable(desc)
+    assert nv.size() == rv.size(), f"{desc()}: sizes {nv.size()} != {rv.size()}"
+    assert nv.dtype == rv.dtype, f"{desc()}: dtype {nv.dtype} != {rv.dtype}"
+    same_strides, idx = torch._prims_common.check_significant_strides(nv, rv, only_cuda=False)
+    assert same_strides, f"{desc()}: strides {nv.stride()} != {rv.stride()} (mismatch at index {idx})"
+
+def check_metadata_matches(n, r, desc):
+    assert callable(desc)
+    n_vals, n_spec = pytree.tree_flatten(n)
+    r_vals, r_spec = pytree.tree_flatten(r)
+    # TODO: test the specs match; empirically  sometimes we have a tuple
+    # on one side and a list on the other
+    assert len(n_vals) == len(r_vals), f"{len(n_vals)} != {len(r_vals)}"
+    for i, nv, rv in zip(range(len(n_vals)), n_vals, r_vals):
+        if not isinstance(rv, torch.Tensor):
+            continue
+        check_tensor_metadata_matches(nv, rv, lambda: f"{desc()} output {i}")
+
+class Lit:
+    def __init__(self, s):
+        self.s = s
+
+    def __repr__(self):
+        return self.s
+
+def _fmt(a: object) -> object:
+    if isinstance(a, torch.Tensor):
+        return Lit(f"torch.empty_strided({tuple(a.size())}, {a.stride()}, dtype={a.dtype})")
+    else:
+        return a
+
+def make_crossref_functionalize(op, final_key):
+    from torch._subclasses.fake_tensor import FakeTensorMode
+    # This case is pretty weird, suppress it for now
+    if op == torch.ops.aten.lift_fresh.default:
+        return final_key
+
+    def handler(*args, **kwargs):
+        fake_mode = FakeTensorMode()
+
+        def fakeify_defun(t):
+            if isinstance(t, torch.Tensor):
+                if torch._is_functional_tensor(t):
+                    r = torch._from_functional_tensor(t)
+                    # NB: This assumes that the inner tensor sizes/strides match
+                    # the outer tensor sizes/strides.  This doesn't necessarily have to
+                    # be the case, see discussion at
+                    # https://github.com/pytorch/pytorch/pull/87610/files/401ddeda1d769bedc88a12de332c7357b60e51a4#r1007264456
+                    assert t.size() == r.size()
+                    assert t.stride() == r.stride()
+                else:
+                    r = t
+                # TODO: suppress guards
+                return fake_mode.from_tensor(r)
+            return t
+
+        def maybe_detach(t):
+            if isinstance(t, torch.Tensor):
+                return t.detach()
+            else:
+                return t
+
+        with suspend_functionalization():
+            f_args, f_kwargs = pytree.tree_map(fakeify_defun, (args, kwargs))
+            orig_f_args, orig_f_kwargs = pytree.tree_map(maybe_detach, (f_args, f_kwargs))
+            with fake_mode:
+                f_r = op(*f_args, **f_kwargs)
+        r = op._op_dk(final_key, *args, **kwargs)
+
+        def desc():
+            fmt_args = ", ".join(
+                itertools.chain(
+                    (repr(pytree.tree_map(_fmt, a)) for a in orig_f_args),
+                    (f"{k}={pytree.tree_map(_fmt, v)}" for k, v in orig_f_kwargs.items()),
+                )
+            )
+            return f"{op}({fmt_args})"
+        check_metadata_matches(f_r, r, desc)
+        return r
+    return handler
+
+# NB: enabling this is slow, don't do it in a hot loop.  This is purely
+# for debugging purposes.
+@contextmanager
+def enable_crossref_functionalize():
+    for op in all_known_overloads():
+        op._uncache_dispatch(torch._C.DispatchKey.Functionalize)
+    try:
+        with enable_python_dispatcher(), unittest.mock.patch(
+                'torch._dispatch.python.CROSSREF_FUNCTIONALIZE', True):
+            yield
+    finally:
+        for op in all_known_overloads():
+            op._uncache_dispatch(torch._C.DispatchKey.Functionalize)
diff --git a/torch/_dynamo/__init__.py b/torch/_dynamo/__init__.py
new file mode 100644
index 000000000000..57df92d75f4f
--- /dev/null
+++ b/torch/_dynamo/__init__.py
@@ -0,0 +1,122 @@
+from . import allowed_functions, convert_frame, eval_frame, resume_execution
+from .convert_frame import replay
+from .eval_frame import (
+    assume_constant_result,
+    disable,
+    explain,
+    export,
+    optimize,
+    optimize_assert,
+    OptimizedModule,
+    reset_code,
+    run,
+    skip,
+)
+from .utils import compilation_metrics, guard_failures, orig_code_map
+
+__all__ = [
+    "allow_in_graph",
+    "assume_constant_result",
+    "disallow_in_graph",
+    "graph_break",
+    "optimize",
+    "optimize_assert",
+    "export",
+    "explain",
+    "run",
+    "replay",
+    "disable",
+    "reset",
+    "list_backends",
+    "skip",
+    "OptimizedModule",
+]
+
+
+def reset():
+    """Clear all compile caches and restore initial state"""
+    for weak_code in convert_frame.input_codes.seen + convert_frame.output_codes.seen:
+        code = weak_code()
+        if code:
+            reset_code(code)
+    convert_frame.input_codes.clear()
+    convert_frame.output_codes.clear()
+    orig_code_map.clear()
+    guard_failures.clear()
+    resume_execution.ContinueExecutionCache.cache.clear()
+    eval_frame.most_recent_backend = None
+    compilation_metrics.clear()
+
+
+def list_backends():
+    """
+    Return valid strings that can be passed to::
+
+        @torch._dynamo.optimize(<backend>)
+        def foo(...):
+           ....
+    """
+    from .optimizations import BACKENDS
+
+    return [*sorted([*BACKENDS.keys(), "inductor"])]
+
+
+def allow_in_graph(fn):
+    """
+    Customize which functions TorchDynamo will include in the generated
+    graph. Similar to `torch.fx.wrap()`.
+    ::
+
+        torch._dynamo.allow_in_graph(my_custom_function)
+
+        @torch._dynamo.optimize(...)
+        def fn(a):
+            x = torch.add(x, 1)
+            x = my_custom_function(x)
+            x = torch.add(x, 1)
+            return x
+
+        fn(...)
+
+    Will capture a single graph containing `my_custom_function()`.
+    """
+    if isinstance(fn, (list, tuple)):
+        return [allow_in_graph(x) for x in fn]
+    assert callable(fn), "allow_in_graph expects a callable"
+    allowed_functions._allowed_function_ids.add(id(fn))
+    allowed_functions._disallowed_function_ids.remove(id(fn))
+    return fn
+
+
+def disallow_in_graph(fn):
+    """
+    Customize which functions TorchDynamo will exclude in the generated
+    graph and force a graph break on.
+    ::
+
+        torch._dynamo.disallow_in_graph(torch.sub)
+
+        @torch._dynamo.optimize(...)
+        def fn(a):
+            x = torch.add(x, 1)
+            x = torch.sub(x, 1)
+            x = torch.add(x, 1)
+            return x
+
+        fn(...)
+
+    Will break the graph on `torch.sub`, and give two graphs each with a
+    single `torch.add()` op.
+    """
+    if isinstance(fn, (list, tuple)):
+        return [disallow_in_graph(x) for x in fn]
+    assert callable(fn), "disallow_in_graph expects a callable"
+    allowed_functions._allowed_function_ids.remove(id(fn))
+    allowed_functions._disallowed_function_ids.add(id(fn))
+    return fn
+
+
+@disallow_in_graph
+def graph_break():
+    """Force a graph break"""
+    pass
diff --git a/torch/_dynamo/allowed_functions.py b/torch/_dynamo/allowed_functions.py
new file mode 100644
index 000000000000..67daafc5adac
--- /dev/null
+++ b/torch/_dynamo/allowed_functions.py
@@ -0,0 +1,272 @@
+import builtins
+import collections
+import copy
+import functools
+import inspect
+import itertools
+import math
+import operator
+import types
+import warnings
+from typing import Dict, Optional, Set
+
+import numpy
+
+import torch
+from torch.fx._symbolic_trace import is_fx_tracing
+
+from . import config
+from .utils import is_safe_constant
+
+"""
+A note on allowed functions:
+
+Dynamo consults this file to determine if a particular function/module
+is allowed to appear as a node in its fx output.
+
+If a function is disallowed, it may either be traced-through, or skipped.
+
+Trace-through means dynamo will continue to trace the interior code for
+the function/module rather than stopping at its boundary and recording it
+as a node in the fx graph. Whether tracing through or allowing, the functionality
+of the function/module is part of the dynamo graph.  Caveat: if tracing through,
+any interior operation could trigger its own graph-break.
+
+Skips are determined by (torch/_dynamo/skipfiles.py) - see "a note on
+skipfiles" there.
+"""
+
+
+def make_function_id_set(lazy_initializer):
+    """
+    Track a set of `id()`s of objects which are either allowed or not
+    allowed to go into the generated FX graph.  Use to test for torch.*,
+    numpy.*, builtins.*, etc.
+
+    Support user modification to permit customization of what can be
+    added to the graph and what will cause a graph break.
+    """
+
+    class FunctionIdSet:
+        function_ids: Optional[Set[int]] = None
+        function_names: Optional[Dict[int, str]] = None
+
+        def __call__(self):
+            if self.function_ids is None:
+                value = lazy_initializer()
+                if isinstance(value, dict):
+                    self.function_ids = set(value.keys())
+                    self.function_names = value
+                else:
+                    assert isinstance(value, set)
+                    self.function_ids = value
+            return self.function_ids
+
+        def get_name(self, idx: int, default: str):
+            self()  # lazy init
+            return self.function_names.get(idx, default)
+
+        def add(self, idx: int):
+            self()  # lazy init
+            self.function_ids.add(idx)
+
+        def remove(self, idx: int):
+            if idx in self():
+                self.function_ids.remove(idx)
+
+        def __contains__(self, idx: int):
+            return idx in self()
+
+    return FunctionIdSet()
+
+
+@make_function_id_set
+def _disallowed_function_ids():
+    remove = [
+        True,
+        False,
+        None,
+        collections.OrderedDict,
+        copy.copy,
+        copy.deepcopy,
+        inspect.signature,
+        math.__package__,
+        torch.__builtins__,
+        torch.autocast_decrement_nesting,
+        torch.autocast_increment_nesting,
+        torch.autograd.grad,
+        torch.clear_autocast_cache,
+        torch.cuda.current_device,
+        torch.cuda.amp.autocast_mode.autocast,
+        torch.distributions.constraints.is_dependent,
+        torch.distributions.normal.Normal,
+        torch.inference_mode,
+        torch.set_anomaly_enabled,
+        torch.set_autocast_cache_enabled,
+        torch.set_autocast_cpu_dtype,
+        torch.set_autocast_cpu_enabled,
+        torch.set_autocast_enabled,
+        torch.set_autocast_gpu_dtype,
+        torch.autograd.profiler.profile,
+        warnings.warn,
+        torch._C._dynamo.eval_frame.unsupported,
+    ]
+    # extract all dtypes from torch
+    dtypes = [
+        obj for obj in torch.__dict__.values() if isinstance(obj, type(torch.float32))
+    ]
+    remove += dtypes
+    storage = [
+        obj
+        for obj in torch.__dict__.values()
+        if isinstance(obj, type(torch.FloatStorage))
+    ]
+    remove += storage
+    return {id(x) for x in remove}
+
+
+@make_function_id_set
+def _allowed_function_ids():
+    """
+    Walk torch.* and get the ids of all the stuff in it
+    """
+    warnings.filterwarnings("ignore", category=UserWarning, module="torch.distributed")
+    torch_object_ids = dict()
+
+    def _is_allowed_module_prefix(obj):
+        allowed_modules = ("torch", "math")
+        # torch.nn.modules.rnn is disallowed because these modules internally
+        # flatten their parameters.  This flattening process will call
+        # Tensor.set_ with a Storage, and Storages cannot be traced with
+        # AOTAutograd; so we need to graph-break. To ensure this, we inline
+        # these functions, rather than keep them opaque-ly in the graph.
+        disallowed_modules = (
+            "torch.optim.",
+            "torch.nn.modules.rnn.",
+            "torch._dynamo.",
+            "torch._C._dynamo.",
+            "torch._inductor.",
+            "torch._C.inductor.",
+            "torch.fx.",
+            "torch.distributed.fsdp.",
+        )
+        allowed_modules_dot = tuple([x + "." for x in allowed_modules])
+        module = inspect.getmodule(obj)
+        if module is None:
+            return False
+
+        mod_name = module.__name__
+
+        if any(mod_name.startswith(m) for m in disallowed_modules):
+            return False
+
+        return mod_name in allowed_modules or mod_name.startswith(allowed_modules_dot)
+
+    def _find_torch_objects(module):
+        if any(
+            module.__name__.startswith(mod_name)
+            for mod_name in config.allowed_functions_module_string_ignorelist
+        ):
+            return
+        torch_object_ids[id(module)] = module.__name__
+        for name, obj in list(module.__dict__.items()):
+            if id(obj) not in torch_object_ids:
+                if isinstance(obj, types.ModuleType):
+                    if obj.__name__.startswith("torch.") and _is_allowed_module_prefix(
+                        obj
+                    ):
+                        torch_object_ids[id(obj)] = f"{module.__name__}.{name}"
+                        _find_torch_objects(obj)
+                elif _is_allowed_module_prefix(obj):
+                    torch_object_ids[id(obj)] = f"{module.__name__}.{name}"
+                elif inspect.getmodule(obj) is None and not is_safe_constant(obj):
+                    torch_object_ids[id(obj)] = f"{module.__name__}.{name}"
+
+    _find_torch_objects(torch)
+    _find_torch_objects(math)
+
+    for idx in _disallowed_function_ids():
+        if idx in torch_object_ids:
+            del torch_object_ids[idx]
+
+    for extra in (is_fx_tracing,):
+        torch_object_ids[id(extra)] = f"{extra.__module__}.{extra.__name__}"
+
+    return torch_object_ids
+
+
+@make_function_id_set
+def _builtin_function_ids():
+    rv = {
+        id(v): f"builtins.{k}"
+        for k, v in builtins.__dict__.items()
+        if not k.startswith("_") and callable(v)
+    }
+    rv.update(
+        {
+            id(v): f"operator.{k}"
+            for k, v in operator.__dict__.items()
+            if not k.startswith("_") and callable(v)
+        }
+    )
+    rv.update(
+        {id(v): f"functools.{v.__name__}" for v in (itertools.chain, itertools.islice)}
+    )
+    rv[id(functools.reduce)] = "functools.reduce"
+    return rv
+
+
+@make_function_id_set
+def _numpy_function_ids():
+    rv = dict()
+    for mod in (numpy, numpy.random):
+        rv.update(
+            {
+                id(v): f"{mod.__name__}.{k}"
+                for k, v in mod.__dict__.items()
+                if callable(v)
+                and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
+            }
+        )
+    return rv
+
+
+@make_function_id_set
+def _builtin_constant_ids():
+    """
+    Collects constant builtins by eliminating callable items.
+    """
+    rv = {
+        id(v): f"builtins.{k}"
+        for k, v in builtins.__dict__.items()
+        if not k.startswith("_") and not callable(v)
+    }
+    return rv
+
+
+def is_allowed(obj):
+    """Is this safe to trace like torch.add ?"""
+    # torch.ops is populated lazily so we don't necessarily have them in
+    # _allowed_function_ids.  Figure it out by testing the type instead
+    # in those cases
+    return id(obj) in _allowed_function_ids or isinstance(
+        obj,
+        (torch._ops.OpOverloadPacket, torch._ops.OpOverload, torch._ops._OpNamespace),
+    )
+
+
+def torch_get_name(obj, default):
+    """Convert a torch.* funcion to a string"""
+    return _allowed_function_ids.get_name(id(obj), default)
+
+
+def is_builtin_callable(obj):
+    return id(obj) in _builtin_function_ids
+
+
+def is_builtin_constant(obj):
+    return id(obj) in _builtin_constant_ids
+
+
+def is_numpy(obj):
+    return isinstance(obj, numpy.ndarray) or id(obj) in _numpy_function_ids
diff --git a/torch/_dynamo/bytecode_analysis.py b/torch/_dynamo/bytecode_analysis.py
new file mode 100644
index 000000000000..165bb77fc3c1
--- /dev/null
+++ b/torch/_dynamo/bytecode_analysis.py
@@ -0,0 +1,197 @@
+import dataclasses
+import dis
+import sys
+from numbers import Real
+
+TERMINAL_OPCODES = {
+    dis.opmap["RETURN_VALUE"],
+    dis.opmap["JUMP_ABSOLUTE"],
+    dis.opmap["JUMP_FORWARD"],
+    dis.opmap["RAISE_VARARGS"],
+    # TODO(jansel): double check exception handling
+}
+if sys.version_info >= (3, 9):
+    TERMINAL_OPCODES.add(dis.opmap["RERAISE"])
+JUMP_OPCODES = set(dis.hasjrel + dis.hasjabs)
+HASLOCAL = set(dis.haslocal)
+HASFREE = set(dis.hasfree)
+
+if sys.version_info < (3, 8):
+
+    def stack_effect(opcode, arg, jump=None):
+        # jump= was added in python 3.8, we just ingore it here
+        if dis.opname[opcode] in ("NOP", "EXTENDED_ARG"):
+            # for some reason NOP isn't supported in python 3.7
+            return 0
+        return dis.stack_effect(opcode, arg)
+
+else:
+    stack_effect = dis.stack_effect
+
+
+def remove_dead_code(instructions):
+    """Dead code elimination"""
+    indexof = {id(inst): i for i, inst in enumerate(instructions)}
+    live_code = set()
+
+    def find_live_code(start):
+        for i in range(start, len(instructions)):
+            if i in live_code:
+                return
+            live_code.add(i)
+            inst = instructions[i]
+            if inst.opcode in JUMP_OPCODES:
+                find_live_code(indexof[id(inst.target)])
+            if inst.opcode in TERMINAL_OPCODES:
+                return
+
+    find_live_code(0)
+    return [inst for i, inst in enumerate(instructions) if i in live_code]
+
+
+def remove_pointless_jumps(instructions):
+    """Eliminate jumps to the next instruction"""
+    pointless_jumps = {
+        id(a)
+        for a, b in zip(instructions, instructions[1:])
+        if a.opname == "JUMP_ABSOLUTE" and a.target is b
+    }
+    return [inst for inst in instructions if id(inst) not in pointless_jumps]
+
+
+def propagate_line_nums(instructions):
+    """Ensure every instruction has line number set in case some are removed"""
+    cur_line_no = None
+
+    def populate_line_num(inst):
+        nonlocal cur_line_no
+        if inst.starts_line:
+            cur_line_no = inst.starts_line
+
+        inst.starts_line = cur_line_no
+
+    for inst in instructions:
+        populate_line_num(inst)
+
+
+def remove_extra_line_nums(instructions):
+    """Remove extra starts line properties before packing bytecode"""
+
+    cur_line_no = None
+
+    def remove_line_num(inst):
+        nonlocal cur_line_no
+        if inst.starts_line is None:
+            return
+        elif inst.starts_line == cur_line_no:
+            inst.starts_line = None
+        else:
+            cur_line_no = inst.starts_line
+
+    for inst in instructions:
+        remove_line_num(inst)
+
+
+@dataclasses.dataclass
+class ReadsWrites:
+    reads: set
+    writes: set
+    visited: set
+
+
+def livevars_analysis(instructions, instruction):
+    indexof = {id(inst): i for i, inst in enumerate(instructions)}
+    must = ReadsWrites(set(), set(), set())
+    may = ReadsWrites(set(), set(), set())
+
+    def walk(state, start):
+        if start in state.visited:
+            return
+        state.visited.add(start)
+
+        for i in range(start, len(instructions)):
+            inst = instructions[i]
+            if inst.opcode in HASLOCAL or inst.opcode in HASFREE:
+                if "LOAD" in inst.opname or "DELETE" in inst.opname:
+                    if inst.argval not in must.writes:
+                        state.reads.add(inst.argval)
+                elif "STORE" in inst.opname:
+                    state.writes.add(inst.argval)
+                else:
+                    raise NotImplementedError(f"unhandled {inst.opname}")
+            if inst.opcode in JUMP_OPCODES:
+                walk(may, indexof[id(inst.target)])
+                state = may
+            if inst.opcode in TERMINAL_OPCODES:
+                return
+
+    walk(must, indexof[id(instruction)])
+    return must.reads | may.reads
+
+
+@dataclasses.dataclass
+class FixedPointBox:
+    value: bool = True
+
+
+@dataclasses.dataclass
+class StackSize:
+    low: Real
+    high: Real
+    fixed_point: FixedPointBox
+
+    def zero(self):
+        self.low = 0
+        self.high = 0
+        self.fixed_point.value = False
+
+    def offset_of(self, other, n):
+        prior = (self.low, self.high)
+        self.low = min(self.low, other.low + n)
+        self.high = max(self.high, other.high + n)
+        if (self.low, self.high) != prior:
+            self.fixed_point.value = False
+
+
+def stacksize_analysis(instructions):
+    assert instructions
+    fixed_point = FixedPointBox()
+    stack_sizes = {
+        inst: StackSize(float("inf"), float("-inf"), fixed_point)
+        for inst in instructions
+    }
+    stack_sizes[instructions[0]].zero()
+
+    for _ in range(100):
+        if fixed_point.value:
+            break
+        fixed_point.value = True
+
+        for inst, next_inst in zip(instructions, instructions[1:] + [None]):
+            stack_size = stack_sizes[inst]
+            if inst.opcode not in TERMINAL_OPCODES:
+                assert next_inst is not None, f"missing next inst: {inst}"
+                stack_sizes[next_inst].offset_of(
+                    stack_size, stack_effect(inst.opcode, inst.arg, jump=False)
+                )
+            if inst.opcode in JUMP_OPCODES:
+                stack_sizes[inst.target].offset_of(
+                    stack_size, stack_effect(inst.opcode, inst.arg, jump=True)
+                )
+
+    if False:
+        for inst in instructions:
+            stack_size = stack_sizes[inst]
+            print(stack_size.low, stack_size.high, inst)
+
+    low = min([x.low for x in stack_sizes.values()])
+    high = max([x.high for x in stack_sizes.values()])
+
+    if sys.version_info < (3, 8) and not fixed_point.value:
+        # This is a rare issue in python 3.7 that still needs debugging
+        # see test/test_nops.py::NopTests::test3
+        return low + 32
+
+    assert fixed_point.value, "failed to reach fixed point"
+    assert low >= 0
+    return high
diff --git a/torch/_dynamo/bytecode_transformation.py b/torch/_dynamo/bytecode_transformation.py
new file mode 100644
index 000000000000..5355e3f41cdf
--- /dev/null
+++ b/torch/_dynamo/bytecode_transformation.py
@@ -0,0 +1,388 @@
+import dataclasses
+import dis
+import itertools
+import sys
+import types
+from typing import Any, List, Optional
+
+from .bytecode_analysis import (
+    propagate_line_nums,
+    remove_extra_line_nums,
+    stacksize_analysis,
+)
+
+
+@dataclasses.dataclass
+class Instruction:
+    """A mutable version of dis.Instruction"""
+
+    opcode: int
+    opname: str
+    arg: int
+    argval: Any
+    offset: Optional[int] = None
+    starts_line: Optional[int] = None
+    is_jump_target: bool = False
+    # extra fields to make modification easier:
+    target: Optional["Instruction"] = None
+
+    def __hash__(self):
+        return id(self)
+
+    def __eq__(self, other):
+        return id(self) == id(other)
+
+
+def convert_instruction(i: dis.Instruction):
+    return Instruction(
+        i.opcode,
+        i.opname,
+        i.arg,
+        i.argval,
+        i.offset,
+        i.starts_line,
+        i.is_jump_target,
+    )
+
+
+class _NotProvided:
+    pass
+
+
+def create_instruction(name, arg=None, argval=_NotProvided, target=None):
+    if argval is _NotProvided:
+        argval = arg
+    return Instruction(
+        opcode=dis.opmap[name], opname=name, arg=arg, argval=argval, target=target
+    )
+
+
+def lnotab_writer(lineno, byteno=0):
+    """
+    Used to create typing.CodeType.co_lnotab
+    See https://github.com/python/cpython/blob/main/Objects/lnotab_notes.txt
+    This is the internal format of the line number table if Python < 3.10
+    """
+    assert sys.version_info < (3, 10)
+    lnotab = []
+
+    def update(lineno_new, byteno_new):
+        nonlocal byteno, lineno
+        while byteno_new != byteno or lineno_new != lineno:
+            byte_offset = max(0, min(byteno_new - byteno, 255))
+            line_offset = max(-128, min(lineno_new - lineno, 127))
+            assert byte_offset != 0 or line_offset != 0
+            byteno += byte_offset
+            lineno += line_offset
+            lnotab.extend((byte_offset, line_offset & 0xFF))
+
+    return lnotab, update
+
+
+def linetable_writer(first_lineno):
+    """
+    Used to create typing.CodeType.co_linetable
+    See https://github.com/python/cpython/blob/main/Objects/lnotab_notes.txt
+    This is the internal format of the line number table if Python >= 3.10
+    """
+    assert sys.version_info >= (3, 10)
+    linetable = []
+    lineno = first_lineno
+    lineno_delta = 0
+    byteno = 0
+
+    def _update(byteno_delta, lineno_delta):
+        while byteno_delta != 0 or lineno_delta != 0:
+            byte_offset = max(0, min(byteno_delta, 254))
+            line_offset = max(-127, min(lineno_delta, 127))
+            assert byte_offset != 0 or line_offset != 0
+            byteno_delta -= byte_offset
+            lineno_delta -= line_offset
+            linetable.extend((byte_offset, line_offset & 0xFF))
+
+    def update(lineno_new, byteno_new):
+        nonlocal lineno, lineno_delta, byteno
+        byteno_delta = byteno_new - byteno
+        byteno = byteno_new
+        _update(byteno_delta, lineno_delta)
+        lineno_delta = lineno_new - lineno
+        lineno = lineno_new
+
+    def end(total_bytes):
+        _update(total_bytes - byteno, lineno_delta)
+
+    return linetable, update, end
+
+
+def assemble(instructions: List[dis.Instruction], firstlineno):
+    """Do the opposite of dis.get_instructions()"""
+    code = []
+    if sys.version_info < (3, 10):
+        lnotab, update_lineno = lnotab_writer(firstlineno)
+    else:
+        lnotab, update_lineno, end = linetable_writer(firstlineno)
+
+    for inst in instructions:
+        if inst.starts_line is not None:
+            update_lineno(inst.starts_line, len(code))
+        arg = inst.arg or 0
+        code.extend((inst.opcode, arg & 0xFF))
+
+    if sys.version_info >= (3, 10):
+        end(len(code))
+
+    return bytes(code), bytes(lnotab)
+
+
+def virtualize_jumps(instructions):
+    """Replace jump targets with pointers to make editing easier"""
+    jump_targets = {inst.offset: inst for inst in instructions}
+
+    for inst in instructions:
+        if inst.opcode in dis.hasjabs or inst.opcode in dis.hasjrel:
+            for offset in (0, 2, 4, 6):
+                if jump_targets[inst.argval + offset].opcode != dis.EXTENDED_ARG:
+                    inst.target = jump_targets[inst.argval + offset]
+                    break
+
+
+def devirtualize_jumps(instructions):
+    """Fill in args for virtualized jump target after instructions may have moved"""
+    indexof = {id(inst): i for i, inst, in enumerate(instructions)}
+    jumps = set(dis.hasjabs).union(set(dis.hasjrel))
+
+    for inst in instructions:
+        if inst.opcode in jumps:
+            target = inst.target
+            target_index = indexof[id(target)]
+            for offset in (1, 2, 3):
+                if (
+                    target_index >= offset
+                    and instructions[target_index - offset].opcode == dis.EXTENDED_ARG
+                ):
+                    target = instructions[target_index - offset]
+                else:
+                    break
+
+            if inst.opcode in dis.hasjabs:
+                if sys.version_info < (3, 10):
+                    inst.arg = target.offset
+                else:
+                    # arg is offset of the instruction line rather than the bytecode
+                    # for all jabs/jrel since python 3.10
+                    inst.arg = int(target.offset / 2)
+            else:  # relative jump
+                if sys.version_info < (3, 10):
+                    inst.arg = target.offset - inst.offset - instruction_size(inst)
+                else:
+                    inst.arg = int(
+                        (target.offset - inst.offset - instruction_size(inst)) / 2
+                    )
+            inst.argval = target.offset
+            inst.argrepr = f"to {target.offset}"
+
+
+def strip_extended_args(instructions: List[Instruction]):
+    instructions[:] = [i for i in instructions if i.opcode != dis.EXTENDED_ARG]
+
+
+def remove_load_call_method(instructions: List[Instruction]):
+    """LOAD_METHOD puts a NULL on the stack which causes issues, so remove it"""
+    rewrites = {"LOAD_METHOD": "LOAD_ATTR", "CALL_METHOD": "CALL_FUNCTION"}
+    for inst in instructions:
+        if inst.opname in rewrites:
+            inst.opname = rewrites[inst.opname]
+            inst.opcode = dis.opmap[inst.opname]
+    return instructions
+
+
+def explicit_super(code: types.CodeType, instructions: List[Instruction]):
+    """convert super() with no args into explict arg form"""
+    cell_and_free = (code.co_cellvars or tuple()) + (code.co_freevars or tuple())
+    output = []
+    for idx, inst in enumerate(instructions):
+        output.append(inst)
+        if inst.opname == "LOAD_GLOBAL" and inst.argval == "super":
+            nexti = instructions[idx + 1]
+            if nexti.opname == "CALL_FUNCTION" and nexti.arg == 0:
+                assert "__class__" in cell_and_free
+                output.append(
+                    create_instruction(
+                        "LOAD_DEREF", cell_and_free.index("__class__"), "__class__"
+                    )
+                )
+                first_var = code.co_varnames[0]
+                if first_var in cell_and_free:
+                    output.append(
+                        create_instruction(
+                            "LOAD_DEREF", cell_and_free.index(first_var), first_var
+                        )
+                    )
+                else:
+                    output.append(create_instruction("LOAD_FAST", 0, first_var))
+                nexti.arg = 2
+                nexti.argval = 2
+
+    instructions[:] = output
+
+
+def fix_extended_args(instructions: List[Instruction]):
+    """Fill in correct argvals for EXTENDED_ARG ops"""
+    output = []
+
+    def maybe_pop_n(n):
+        for _ in range(n):
+            if output and output[-1].opcode == dis.EXTENDED_ARG:
+                output.pop()
+
+    for i, inst in enumerate(instructions):
+        if inst.opcode == dis.EXTENDED_ARG:
+            # Leave this instruction alone for now so we never shrink code
+            inst.arg = 0
+        elif inst.arg and inst.arg > 0xFFFFFF:
+            maybe_pop_n(3)
+            output.append(create_instruction("EXTENDED_ARG", inst.arg >> 24))
+            output.append(create_instruction("EXTENDED_ARG", inst.arg >> 16))
+            output.append(create_instruction("EXTENDED_ARG", inst.arg >> 8))
+        elif inst.arg and inst.arg > 0xFFFF:
+            maybe_pop_n(2)
+            output.append(create_instruction("EXTENDED_ARG", inst.arg >> 16))
+            output.append(create_instruction("EXTENDED_ARG", inst.arg >> 8))
+        elif inst.arg and inst.arg > 0xFF:
+            maybe_pop_n(1)
+            output.append(create_instruction("EXTENDED_ARG", inst.arg >> 8))
+        output.append(inst)
+
+    added = len(output) - len(instructions)
+    assert added >= 0
+    instructions[:] = output
+    return added
+
+
+def instruction_size(inst):
+    return 2
+
+
+def check_offsets(instructions):
+    offset = 0
+    for inst in instructions:
+        assert inst.offset == offset
+        offset += instruction_size(inst)
+
+
+def update_offsets(instructions):
+    offset = 0
+    for inst in instructions:
+        inst.offset = offset
+        offset += instruction_size(inst)
+
+
+def debug_bytes(*args):
+    index = range(max(map(len, args)))
+    result = []
+    for arg in (
+        [index] + list(args) + [[int(a != b) for a, b in zip(args[-1], args[-2])]]
+    ):
+        result.append(" ".join(f"{x:03}" for x in arg))
+
+    return "bytes mismatch\n" + "\n".join(result)
+
+
+def debug_checks(code):
+    """Make sure our assembler produces same bytes as we start with"""
+    dode = transform_code_object(code, lambda x, y: None, safe=True)
+    assert code.co_code == dode.co_code, debug_bytes(code.co_code, dode.co_code)
+    assert code.co_lnotab == dode.co_lnotab, debug_bytes(code.co_lnotab, dode.co_lnotab)
+
+
+HAS_LOCAL = set(dis.haslocal)
+HAS_NAME = set(dis.hasname)
+
+
+def fix_vars(instructions: List[Instruction], code_options):
+    varnames = {name: idx for idx, name in enumerate(code_options["co_varnames"])}
+    names = {name: idx for idx, name in enumerate(code_options["co_names"])}
+    for i in range(len(instructions)):
+        if instructions[i].opcode in HAS_LOCAL:
+            instructions[i].arg = varnames[instructions[i].argval]
+        elif instructions[i].opcode in HAS_NAME:
+            instructions[i].arg = names[instructions[i].argval]
+
+
+def transform_code_object(code, transformations, safe=False):
+    keys = [
+        "co_argcount",
+        "co_posonlyargcount",  # python 3.8+
+        "co_kwonlyargcount",
+        "co_nlocals",
+        "co_stacksize",
+        "co_flags",
+        "co_code",
+        "co_consts",
+        "co_names",
+        "co_varnames",
+        "co_filename",
+        "co_name",
+        "co_firstlineno",
+        "co_lnotab",  # changed to "co_linetable" if python 3.10+
+        "co_freevars",
+        "co_cellvars",
+    ]
+    if sys.version_info < (3, 8):
+        keys.pop(1)
+    if sys.version_info >= (3, 10):
+        keys = list(map(lambda x: x.replace("co_lnotab", "co_linetable"), keys))
+    code_options = {k: getattr(code, k) for k in keys}
+    assert len(code_options["co_varnames"]) == code_options["co_nlocals"]
+
+    instructions = cleaned_instructions(code, safe)
+    propagate_line_nums(instructions)
+
+    transformations(instructions, code_options)
+
+    fix_vars(instructions, code_options)
+
+    dirty = True
+    while dirty:
+        update_offsets(instructions)
+        devirtualize_jumps(instructions)
+        # this pass might change offsets, if so we need to try again
+        dirty = fix_extended_args(instructions)
+
+    remove_extra_line_nums(instructions)
+    bytecode, lnotab = assemble(instructions, code_options["co_firstlineno"])
+    if sys.version_info < (3, 10):
+        code_options["co_lnotab"] = lnotab
+    else:
+        code_options["co_linetable"] = lnotab
+
+    code_options["co_code"] = bytecode
+    code_options["co_nlocals"] = len(code_options["co_varnames"])
+    code_options["co_stacksize"] = stacksize_analysis(instructions)
+    assert set(keys) - {"co_posonlyargcount"} == set(code_options.keys()) - {
+        "co_posonlyargcount"
+    }
+    return types.CodeType(*[code_options[k] for k in keys])
+
+
+def cleaned_instructions(code, safe=False):
+    instructions = list(map(convert_instruction, dis.get_instructions(code)))
+    check_offsets(instructions)
+    virtualize_jumps(instructions)
+    strip_extended_args(instructions)
+    if not safe:
+        remove_load_call_method(instructions)
+        explicit_super(code, instructions)
+    return instructions
+
+
+_unique_id_counter = itertools.count()
+
+
+def unique_id(name):
+    return f"{name}_{next(_unique_id_counter)}"
+
+
+def is_generator(code: types.CodeType):
+    co_generator = 0x20
+    return (code.co_flags & co_generator) > 0
diff --git a/torch/_dynamo/codegen.py b/torch/_dynamo/codegen.py
new file mode 100644
index 000000000000..e469ce02ebd6
--- /dev/null
+++ b/torch/_dynamo/codegen.py
@@ -0,0 +1,364 @@
+import collections
+import dataclasses
+import re
+import sys
+import types
+from typing import List
+
+import torch.nn
+
+from .bytecode_transformation import create_instruction, Instruction
+from .exc import unimplemented
+from .source import AttrSource, Source
+from .utils import is_safe_constant, istype, rot_n_helper
+from .variables.base import VariableTracker
+from .variables.nn_module import NNModuleVariable
+from .variables.tensor import (
+    DynamicShapeVariable,
+    TensorVariable,
+    TensorWithTFOverrideVariable,
+    UnspecializedNumpyVariable,
+    UnspecializedPythonVariable,
+)
+
+
+@dataclasses.dataclass
+class GraphOutputEntry:
+    index: int
+    variable: VariableTracker
+
+    def merge(self, other: VariableTracker):
+        # merge in any extra guards
+        self.variable = self.variable.add_options(other)
+
+
+class PyCodegen(object):
+    """
+    Helper class uses for constructing Python bytecode
+    """
+
+    def __init__(
+        self,
+        tx=None,
+        root: torch.nn.Module = None,
+        graph_output_var: str = None,
+        tempvars=None,
+    ):
+        self.root = root
+        self.top_of_stack = None
+        self.uses = collections.Counter()
+        self.graph_outputs = collections.OrderedDict()
+        self._output: List[Instruction] = []
+        self.tempvars = tempvars or {}
+        self.tx = tx
+        self.graph_output_var = graph_output_var
+        self.code_options = self.tx.output.code_options
+        self.cell_and_freevars = self.tx.cell_and_freevars
+        self.new_var = self.tx.output.new_var
+
+    def graph_output_vars(self):
+        return [x.variable for x in self.graph_outputs.values()]
+
+    def __call__(self, value, allow_cache=True):
+        """Generate code such that top-of-stack (TOS) is set to value"""
+        if isinstance(value, Source):
+            self._output.extend(value.reconstruct(self))
+            self.clear_tos()
+            return
+
+        self.tx.output.guards.update(value.guards)
+
+        assert isinstance(value, VariableTracker)
+        output = self._output
+        graph_outputs = self.graph_outputs
+
+        if self.top_of_stack is value:
+            output.append(create_instruction("DUP_TOP"))
+            return
+
+        if allow_cache:
+            if value.mutable_local and value.mutable_local in self.tempvars:
+                output.append(self.create_load(self.tempvars[value.mutable_local]))
+                self.top_of_stack = value
+                return
+            if self.tempvars.get(value) is not None:
+                output.append(self.create_load(self.tempvars[value]))
+                self.top_of_stack = value
+                return
+
+        if value.source is not None and allow_cache:
+            output.extend(value.source.reconstruct(self))
+        elif value.is_python_constant() and is_safe_constant(
+            value.as_python_constant()
+        ):
+            output.append(self.create_load_const(value.as_python_constant()))
+        elif isinstance(
+            value,
+            (
+                TensorVariable,
+                DynamicShapeVariable,
+                TensorWithTFOverrideVariable,
+                UnspecializedNumpyVariable,
+                UnspecializedPythonVariable,
+            ),
+        ):
+            if isinstance(value, TensorWithTFOverrideVariable):
+                # unwrap back to tensor
+                value = value.tensor_variable
+            graph_outputs_key = id(value.proxy)
+            if graph_outputs_key not in graph_outputs:
+                graph_outputs[graph_outputs_key] = GraphOutputEntry(
+                    len(graph_outputs), value
+                )
+            else:
+                graph_outputs[graph_outputs_key].merge(value)
+
+            output.append(self.create_load(self.graph_output_var))
+            output.append(
+                self._create_load_const(graph_outputs[graph_outputs_key].index)
+            )
+            output.append(create_instruction("BINARY_SUBSCR"))
+
+            if isinstance(value, UnspecializedNumpyVariable):
+                unspec_var = self.tx.output.new_var("unspec")
+                raw_type = type(value.raw_value)
+                output.extend(
+                    [
+                        self.create_load_attr("item"),
+                        create_instruction("CALL_FUNCTION", 0),
+                        self.create_store(unspec_var),
+                        self.create_load_const(raw_type),
+                        self.create_load(unspec_var),
+                        create_instruction("CALL_FUNCTION", 1),
+                    ]
+                )
+            if isinstance(value, UnspecializedPythonVariable) and value.need_unwrap:
+                output.extend(
+                    [
+                        self.create_load_attr("item"),
+                        create_instruction("CALL_FUNCTION", 0),
+                    ]
+                )
+        elif isinstance(value, NNModuleVariable):
+            parts = value.module_key.split(".")
+            if parts[0] in self.code_options["co_varnames"]:
+                output.append(self.create_load(parts[0]))
+                parts = parts[1:]
+            else:
+                assert self.root is not None
+                output.append(self.create_load_output(self.root))
+            for part in parts:
+                output.append(self.create_load_attr(part))
+        else:
+            self.uses[value] += 1
+            try:
+                output.extend(value.reconstruct(self))
+            except NotImplementedError:
+                unimplemented(f"reconstruct: {value}")
+            if allow_cache and value in self.tempvars:
+                self._output.append(create_instruction("DUP_TOP"))
+                self.add_cache(value)
+
+        self.top_of_stack = value
+
+    def add_cache(self, value):
+        var = self.new_var()
+        self.tempvars[value] = var
+        if value.mutable_local:
+            self.tempvars[value.mutable_local] = var
+        self._output.append(self.create_store(var))
+
+    def foreach(self, items):
+        for i in items:
+            self(i)
+
+    def setup_globally_cached(self, name, value):
+        """Store value in a new global"""
+        name = re.sub(r"[^a-zA-Z0-9_]+", "_", name)
+        f_globals = self.tx.f_globals
+        if name in f_globals:
+            assert id(f_globals[name]) == id(value)
+        else:
+            f_globals[name] = value
+        return [self.create_load_global(name, add=True)]
+
+    def clear_tos(self):
+        self.top_of_stack = None
+
+    def append_output(self, inst):
+        assert isinstance(inst, Instruction)
+        self._output.append(inst)
+        self.clear_tos()
+
+    def extend_output(self, insts):
+        assert all(isinstance(x, Instruction) for x in insts)
+        self._output.extend(insts)
+        self.clear_tos()
+
+    def get_instructions(self):
+        return self._output
+
+    def create_load(self, name):
+        if name in self.cell_and_freevars():
+            return create_instruction(
+                "LOAD_DEREF", self.cell_and_freevars().index(name), name
+            )
+        assert name in self.code_options["co_varnames"], f"{name} missing"
+        return create_instruction(
+            "LOAD_FAST", self.code_options["co_varnames"].index(name), name
+        )
+
+    def create_load_closure(self, name):
+        assert name in self.cell_and_freevars()
+        return create_instruction(
+            "LOAD_CLOSURE", self.cell_and_freevars().index(name), name
+        )
+
+    def create_store(self, name):
+        if name in self.cell_and_freevars():
+            return create_instruction(
+                "STORE_DEREF", self.cell_and_freevars().index(name), name
+            )
+        assert name in self.code_options["co_varnames"]
+        return create_instruction(
+            "STORE_FAST", self.code_options["co_varnames"].index(name), name
+        )
+
+    def create_load_global(self, name, add=False):
+        if add:
+            self.tx.output.update_co_names(name)
+        assert name in self.code_options["co_names"], f"{name} not in co_names"
+        return create_instruction(
+            "LOAD_GLOBAL", self.code_options["co_names"].index(name), name
+        )
+
+    def create_load_const(self, value):
+        assert is_safe_constant(value), f"unsafe constant {value}"
+        return self._create_load_const(value)
+
+    @staticmethod
+    def get_const_index(code_options, value):
+        co_consts = code_options["co_consts"]
+        assert istype(co_consts, tuple)
+        index = None
+        for i, v in enumerate(co_consts):
+            if type(v) is type(value) and v == value:
+                index = i
+                break
+        if index is None:
+            index = len(co_consts)
+            co_consts = co_consts + (value,)
+            code_options["co_consts"] = co_consts
+        return index
+
+    def _create_load_const(self, value):
+        index = self.get_const_index(self.code_options, value)
+        return create_instruction("LOAD_CONST", index, value)
+
+    create_load_output = _create_load_const
+
+    def create_load_attr(self, name):
+        if name not in self.code_options["co_names"]:
+            self.code_options["co_names"] = self.code_options["co_names"] + (name,)
+        return create_instruction(
+            "LOAD_ATTR", self.code_options["co_names"].index(name), name
+        )
+
+    def create_load_attrs(self, names):
+        return [self.create_load_attr(name) for name in names.split(".")]
+
+    def load_function_name(self, fn_name, num_on_stack=0):
+        """Load the global fn_name on the stack num_on_stack down"""
+        return [self.create_load_global(fn_name, add=True)] + self.rot_n(
+            num_on_stack + 1
+        )
+
+    def rot_n(self, n):
+        if n == 0 or n == 1:
+            return []
+        elif n == 2:
+            return [create_instruction("ROT_TWO")]
+        elif n == 3:
+            return [create_instruction("ROT_THREE")]
+        elif n == 4 and sys.version_info >= (3, 8):
+            return [create_instruction("ROT_FOUR")]
+        elif sys.version_info >= (3, 10):
+            return [create_instruction("ROT_N", n)]
+        else:
+            return [
+                create_instruction("BUILD_TUPLE", n),
+                self._create_load_const(rot_n_helper(n)),
+                create_instruction("ROT_TWO"),
+                create_instruction("CALL_FUNCTION_EX", 0),
+                create_instruction("UNPACK_SEQUENCE", n),
+            ]
+
+    def make_function_with_closure(
+        self, fn_name: str, code: types.CodeType, num_on_stack=0
+    ):
+        freevars = code.co_freevars
+        assert freevars
+        output = self._output
+        for var in freevars:
+            assert var in self.cell_and_freevars()
+            output.append(
+                create_instruction(
+                    "LOAD_CLOSURE", self.cell_and_freevars().index(var), var
+                )
+            )
+        output.append(create_instruction("BUILD_TUPLE", len(freevars)))
+        output.append(self.create_load_const(code))
+        output.append(self.create_load_const(fn_name))
+        output.append(create_instruction("MAKE_FUNCTION", 0x08))
+        output.extend(self.rot_n(num_on_stack + 1))
+        self.clear_tos()
+
+    def create_load_python_module(self, mod):
+        """
+        Generate a LOAD_GLOBAL instruction to fetch a given python module.
+        """
+        root_globals = self.tx.output.root_globals
+        name = re.sub(r"^.*[.]", "", mod.__name__)
+        if root_globals.get(name, None) is mod:
+            return self.create_load_global(name, add=True)
+        mangled_name = f"___module_{name}_{id(mod)}"
+        if mangled_name not in root_globals:
+            self.tx.output.install_global(mangled_name, mod)
+        return self.create_load_global(mangled_name, add=True)
+
+    def make_call_generated_code(self, fn_name: str) -> List[Instruction]:
+        """Call the generated code function stored in fn_name"""
+        self.extend_output(self.load_function_name(fn_name))
+
+        graphargs = self.tx.output.graphargs
+        for arg in graphargs:
+            if arg.is_unspecialized:
+                self.extend_output(
+                    [
+                        self.create_load_python_module(torch),
+                        self.create_load_attr("tensor"),
+                    ]
+                )
+                self.extend_output(arg.load(self))
+                self.extend_output(
+                    [
+                        create_instruction("CALL_FUNCTION", 1),
+                    ]
+                )
+            else:
+                self.extend_output(arg.load(self))
+
+        self.append_output(create_instruction("CALL_FUNCTION", len(graphargs)))
+
+    def load_import_from(self, module_name, object_name):
+        self.extend_output(
+            AttrSource(self.tx.import_source(module_name), object_name).reconstruct(
+                self
+            )
+        )
+
+    def create_begin_finally(self):
+        if sys.version_info < (3, 8):
+            return self.create_load_const(None)
+        else:
+            return create_instruction("BEGIN_FINALLY")
diff --git a/torch/_dynamo/config.py b/torch/_dynamo/config.py
new file mode 100644
index 000000000000..26efff205389
--- /dev/null
+++ b/torch/_dynamo/config.py
@@ -0,0 +1,182 @@
+import logging
+import os
+import sys
+from os.path import abspath, dirname
+from types import ModuleType
+
+import torch
+
+# needed so that CODE is registered as a level in logging
+from . import logging as torchdynamo_logging  # noqa: F401
+
+try:
+    import torch._prims
+    import torch._refs
+
+    HAS_REFS_PRIMS = True
+except ImportError:
+    HAS_REFS_PRIMS = False
+
+
+# log level (levels print what it says + all levels listed below it)
+# logging.DEBUG print full traces <-- lowest level + print tracing of every instruction
+# logging.CODE print compiled functions + graphs (NOTE: can only be used after importing torch._dynamo.logging)
+# logging.INFO print the steps that dynamo is running
+# logging.WARN print warnings (including graph breaks)
+# logging.ERROR print exceptions (and what user code was being processed when it occurred)
+# NOTE: changing log_level will automatically update the levels of all torchdynamo loggers
+log_level = logging.WARNING
+
+# the name of a file to write the logs to
+log_file_name = None
+
+# Verbose will print full stack traces on warnings and errors
+verbose = False
+
+# verify the correctness of optimized backend
+verify_correctness = False
+
+# need this many ops to create an FX graph
+minimum_call_count = 1
+
+# turn on/off DCE pass
+dead_code_elimination = True
+
+# disable (for a function) when cache reaches this size
+cache_size_limit = 64
+
+# specializing int/float by default
+specialize_int_float = True
+
+# Assume these functions return constants
+constant_functions = {
+    torch.jit.is_scripting: False,
+    torch.jit.is_tracing: False,
+    torch._C._get_tracing_state: None,
+    torch.fx._symbolic_trace.is_fx_tracing: False,
+    torch.onnx.is_in_onnx_export: False,
+}
+
+
+# don't specialize on shapes and strides and put shape ops in graph
+dynamic_shapes = os.environ.get("TORCHDYNAMO_DYNAMIC_SHAPES") == "1"
+
+# Set this to False to assume nn.Modules() contents are immutable (similar assumption as freezing)
+guard_nn_modules = False
+
+# Run the FX graph as it is created to get better type information
+dynamic_propagation = True
+
+# run FX normalization passes in optimizer
+normalize_ir = False
+
+# If a tensor subclass type is in this set, torchdynamo will inline the
+# __torch_function__ logic of the subclass.
+traceable_tensor_subclasses = set()
+
+# Suppress errors in torch._dynamo.optimize, instead forcing a fallback to eager.
+# This is a good way to get your model to work one way or another, but you may
+# lose optimization opportunities this way.  Devs, if your benchmark model is failing
+# this way, you should figure out why instead of suppressing it.
+suppress_errors = bool(os.environ.get("TORCHDYNAMO_SUPPRESS_ERRORS", False))
+
+# Record and write an execution record of the current frame to a file
+# if an exception is encountered
+replay_record_enabled = False
+
+# Rewrite assert statement in python with torch._assert
+rewrite_assert_with_torch_assert = True
+
+# Show a warning on every graph break
+print_graph_breaks = False
+
+# If a PyTorch module is in this allowlist, torchdynamo will be allowed
+# to inline objects from it or its children.
+skipfiles_inline_module_allowlist = {
+    torch.nn,
+    torch.distributions,
+    torch.testing,
+    torch.ao.nn,
+}
+if HAS_REFS_PRIMS:
+    skipfiles_inline_module_allowlist |= {
+        torch._refs,
+        torch._prims,
+        torch._decomp,
+    }
+
+# If a string representing a PyTorch module is in this ignorelist,
+# the `allowed_functions.is_allowed` function will not consider it
+# when creating a list of PyTorch functions that will appear in
+# FX IR.
+allowed_functions_module_string_ignorelist = {
+    "torch.distributions",
+    "torch.testing",
+    "torch._refs",
+    "torch._prims",
+    "torch._decomp",
+}
+
+# Debug Flag to try minifier at different stages. Possible values are {None, "aot", "dynamo"}
+# None - Minifier is switched off
+# dynamo - Runs minifier on the TorchDynamo produced graphs, if compilation fails
+# aot - Runs minifier on the Aot Autograd produced graphs, if compilation fails
+repro_after = os.environ.get("TORCHDYNAMO_REPRO_AFTER", None)
+# Compiler compilation debug info
+# 1: Dumps the original graph out to repro.py if compilation fails
+# 2: Dumps a minifier_launcher.py if compilation fails.
+# 3: Always dumps a minifier_laucher.py. Good for segfaults.
+# 4: Dumps a minifier_launcher.py if the accuracy fails.
+repro_level = int(os.environ.get("TORCHDYNAMO_REPRO_LEVEL", 2))
+
+# Not all backends support scalars. Some calls on torch.Tensor (like .item()) return a scalar type.
+# When this flag is set to False, we introduce a graph break instead of capturing.
+capture_scalar_outputs = False
+
+# Should almost always be true in prod. This relaxes the requirement that cond's true_fn and
+# false_fn produces code with identical guards.
+enforce_cond_guards_match = True
+
+# Automatically split model graph into pieces to match DDP bucket sizes
+# to allow DDP comm/compute overlap
+optimize_ddp = False
+
+# If True, raises exception if TorchDynamo is called with a context manager
+raise_on_ctx_manager_usage = True
+
+# If True, raise when aot autograd is unsafe to use
+raise_on_unsafe_aot_autograd = False
+
+# How to import torchdynamo, either torchdynamo or torch._dynamo
+dynamo_import = __name__.replace(".config", "")
+
+# How to import torchinductor, either torchinductor or torch.inductor
+inductor_import = dynamo_import.replace("dynamo", "inductor")
+
+# If true, error with a better message if we symbolically trace over a
+# dynamo-optimized function. If false, silently suppress dynamo.
+error_on_nested_fx_trace = True
+
+# root folder of the project
+if "torch." in dynamo_import:
+    base_dir = dirname(dirname(dirname(abspath(__file__))))
+else:
+    base_dir = dirname(dirname(abspath(__file__)))
+
+debug_dir_root = os.path.join(os.getcwd(), "torchdynamo_debug")
+
+
+class _AccessLimitingConfig(ModuleType):
+    def __setattr__(self, name, value):
+        if name not in _allowed_config_names:
+            raise AttributeError(f"{__name__}.{name} does not exist")
+        # automatically set logger level whenever config.log_level is modified
+        if name == "log_level":
+            from .logging import set_loggers_level
+
+            set_loggers_level(value)
+        return object.__setattr__(self, name, value)
+
+
+_allowed_config_names = {*globals().keys()}
+sys.modules[__name__].__class__ = _AccessLimitingConfig
diff --git a/torch/_dynamo/convert_frame.py b/torch/_dynamo/convert_frame.py
new file mode 100644
index 000000000000..1f3138ec6cdc
--- /dev/null
+++ b/torch/_dynamo/convert_frame.py
@@ -0,0 +1,499 @@
+import functools
+import itertools
+import logging
+import os
+import traceback
+import types
+import typing
+import weakref
+from typing import Callable
+
+import torch
+from torch.fx.graph_module import _forward_from_src as original_forward_from_src
+
+from . import config, exc
+from .allowed_functions import is_allowed
+from .bytecode_analysis import remove_dead_code, remove_pointless_jumps
+from .bytecode_transformation import is_generator, transform_code_object
+from .eval_frame import always_optimize_code_objects, skip_code, TorchPatcher
+from .exc import (
+    BackendCompilerFailed,
+    InternalTorchDynamoError,
+    TorchRuntimeError,
+    unimplemented,
+    Unsupported,
+)
+from .guards import CheckFunctionManager, GuardedCode
+from .replay_record import ExecutionRecord
+from .symbolic_convert import InstructionTranslator
+from .utils import (
+    CleanupManager,
+    counters,
+    dynamo_timed,
+    filter_stack,
+    format_bytecode,
+    gen_record_file_name,
+    guard_failures,
+    init_logging,
+    is_namedtuple,
+    istype,
+    orig_code_map,
+    troubleshooting_url,
+    write_record_to_file,
+)
+
+log = logging.getLogger(__name__)
+
+
+class Tracker:
+    def __init__(self):
+        self.seen = []
+        self.seen_ids = set()
+
+    def add(self, strong_obj):
+        idx = id(strong_obj)
+        if idx not in self.seen_ids:
+            obj = weakref.ref(strong_obj, lambda _: self.seen_ids.remove(idx))
+            self.seen.append(obj)
+            self.seen_ids.add(idx)
+
+    def __contains__(self, item):
+        return id(item) in self.seen_ids
+
+    def clear(self):
+        self.seen.clear()
+        self.seen_ids.clear()
+
+
+input_codes = Tracker()
+output_codes = Tracker()
+
+
+initial_grad_state = None
+
+
+@functools.wraps(original_forward_from_src)
+def fx_forward_from_src_skip_result(*args, **kwargs):
+    # we monkey patch FX to prevent infinite loop of trying to convert
+    # our generated code
+    result: types.FunctionType = original_forward_from_src(*args, **kwargs)
+    skip_code(result.__code__)
+    return result
+
+
+def wrap_convert_context(fn):
+    """
+    Context manager to:
+        1) Save/restore torch random state
+        2) Save/restore torch.is_grad_enabled() state
+        3) Monkey patch torch.fx.graph_module._forward_from_src
+    """
+
+    @functools.wraps(fn)
+    def _fn(*args, **kwargs):
+        prior_grad_mode = torch.is_grad_enabled()
+        rng_state = torch.random.get_rng_state()
+        if torch.cuda.is_available():
+            cuda_rng_state = torch.cuda.get_rng_state()
+        prior_fwd_from_src = torch.fx.graph_module._forward_from_src
+        torch.fx.graph_module._forward_from_src = fx_forward_from_src_skip_result
+        try:
+            return fn(*args, **kwargs)
+        finally:
+            torch._C._set_grad_enabled(prior_grad_mode)
+            torch.random.set_rng_state(rng_state)
+            if torch.cuda.is_available():
+                torch.cuda.set_rng_state(cuda_rng_state)
+            torch.fx.graph_module._forward_from_src = prior_fwd_from_src
+
+    _fn._torchdynamo_orig_callable = fn
+    return _fn
+
+
+@TorchPatcher.suppress_torch_distributed_warnings
+def has_tensor_in_frame(frame):
+    """Check if the frame has torch.* related bits"""
+    # Check if the function was decorated using torch._dynamo.optimize
+    if frame.f_code in always_optimize_code_objects:
+        return True
+
+    # Check if there is global import of torch.*
+    for co_name in frame.f_code.co_names:
+        if co_name in frame.f_globals:
+            if is_allowed(frame.f_globals[co_name]):
+                return True
+
+    seen_ids = dict()
+
+    def has_tensor(obj):
+        """Recursively check if the obj has a tensor"""
+        obj_id = id(obj)
+        if obj_id in seen_ids:
+            return seen_ids[obj_id]
+        seen_ids[obj_id] = False
+
+        if isinstance(obj, (torch.Tensor, torch.nn.Module)):
+            seen_ids[obj_id] = True
+            return seen_ids[obj_id]
+        elif istype(obj, (list, tuple)):
+            seen_ids[obj_id] = any([has_tensor(v) for v in obj])
+            return seen_ids[obj_id]
+        elif istype(obj, dict):
+            # Some packages like pytest can be updated during runtime. So, make a
+            # copy of values to avoid issues like "RuntimeError: dictionary
+            # changed size during iteration"
+            values = list(obj.values())
+            seen_ids[obj_id] = any([has_tensor(v) for v in values])
+            return seen_ids[obj_id]
+        elif istype(obj, (str, int, float, type(None), bool)):
+            seen_ids[obj_id] = False
+            return seen_ids[obj_id]
+        elif is_namedtuple(obj):
+            seen_ids[obj_id] = any([has_tensor(getattr(obj, v)) for v in obj._fields])
+            return seen_ids[obj_id]
+        else:
+            # if config.debug:
+            #     print(
+            #         f"Assuming that object of type {type(obj)} does not have a tensor"
+            #     )
+            return False
+
+    # Check if the passed arguments are of type Tensor
+    for value in frame.f_locals.values():
+        if has_tensor(value):
+            return True
+
+    log.debug(
+        f"skipping because no torch.* {frame.f_code.co_name} \
+            {frame.f_code.co_filename} {frame.f_code.co_firstlineno}"
+    )
+
+    return False
+
+
+def format_error_msg(exc, code, record_filename=None, frame=None):
+    msg = os.linesep * 2
+
+    if config.verbose:
+        msg = format_bytecode(
+            "WON'T CONVERT", code.co_name, code.co_filename, code.co_firstlineno, code
+        )
+        msg += "=" * 10 + " TorchDynamo Stack Trace " + "=" * 10 + "\n"
+        msg += traceback.format_exc()
+        if hasattr(exc, "real_stack"):
+            msg += (
+                "\n"
+                + "=" * 10
+                + " The above exception occurred while processing the following code "
+                + "=" * 10
+                + "\n\n"
+            )
+            stack_above_dynamo = []
+            if frame is not None:
+                stack_above_dynamo = filter_stack(traceback.extract_stack(frame))
+
+            msg += "".join(
+                traceback.format_list(
+                    stack_above_dynamo + list(reversed(exc.real_stack))
+                )
+            )
+            msg += "\n"
+            msg += "=" * 10
+
+    else:
+        msg = f"WON'T CONVERT {code.co_name} {code.co_filename}\
+ line {code.co_firstlineno} \ndue to: \n{traceback.format_exc(limit=-1)}"
+
+    return msg
+
+
+def augment_exc_message(exc, msg="\n"):
+    if (
+        hasattr(exc, "real_stack")
+        and len(exc.real_stack) > 0
+        and not (config.verbose and config.suppress_errors)
+    ):
+        msg += f"\nfrom user code:\n {''.join(traceback.format_list(reversed(exc.real_stack[0:2])))}"
+
+    if config.replay_record_enabled and hasattr(exc, "record_filename"):
+        msg += f"\nLast frame execution written to {exc.record_filename}. To run only this frame while debugging, run\
+ {config.dynamo_import}.replay('{exc.record_filename}').\n"
+
+    if not config.verbose:
+        msg += (
+            f"\nSet {config.dynamo_import}.config.verbose=True for more information\n"
+        )
+
+    if hasattr(exc, "inner_exception") and hasattr(
+        exc.inner_exception, "minifier_path"
+    ):
+        msg += (
+            f"\nMinifier script written to {exc.inner_exception.minifier_path}. Run "
+            "this script to find the smallest traced graph which reproduces this error.\n"
+        )
+
+    if not config.suppress_errors:
+        msg += (
+            "\n\n"
+            "You can suppress this exception and fall back to eager by setting:\n"
+            "    torch._dynamo.config.suppress_errors = True\n"
+        )
+
+    old_msg = "" if len(exc.args) == 0 else exc.args[0]
+    new_msg = old_msg + msg
+    exc.args = (new_msg,) + exc.args[1:]
+
+
+def exception_handler(e, code, frame=None):
+    record_filename = None
+    if hasattr(e, "exec_record"):
+        record_filename = gen_record_file_name(e, code)
+        write_record_to_file(record_filename, e.exec_record)
+        e.record_filename = record_filename
+
+    augment_exc_message(e)
+    # Only log the exception if we are going to suppress it
+    # if aren't suppressing it, a higher level except block will handle it
+    if config.suppress_errors:
+        log.error(format_error_msg(e, code, record_filename, frame))
+
+
+def convert_frame_assert(
+    compiler_fn: Callable, guard_export_fn=None, one_graph=True, export=False
+):
+    """Fully convert a frame into an FX graph"""
+    init_logging()
+
+    @dynamo_timed
+    def _convert_frame_assert(frame: types.FrameType, cache_size: int):
+        code = frame.f_code
+        input_codes.add(code)
+        if code in output_codes:
+            return None
+        if (
+            os.environ.get("TORCHDYNAMO_DEBUG_FUNCTION")
+            and os.environ.get("TORCHDYNAMO_DEBUG_FUNCTION") != code.co_name
+        ):
+            return None
+        if code.co_name == "<genexpr>" and code.co_filename.endswith(
+            ("transformers/file_utils.py", "transformers/utils/generic.py")
+        ):
+            # not needed, but cleans up torchbench error stats
+            return None
+        if code.co_name == "__setattr__":
+            # setattr could be tricky to handle generally,
+            # but also not likely useful to compile- skip the whole frame
+            return None
+
+        # Check if the frame is generated by an exec builtin call
+        # TODO - Running exec generated frame seems propagates f_globals to the
+        # next frames.
+        if code.co_name == "<module>" and code.co_filename == "<string>":
+            return None
+
+        if (
+            code.co_name == "<lambda>"
+            and code.co_filename == "<string>"
+            and not bool(frame.f_builtins)
+        ):
+            # namedtuple subclass constructor. Empty builtins cause issue with
+            # len keyword in LIST_LEN guard.
+            return None
+
+        if is_generator(code):
+            unimplemented("generator")
+        if cache_size >= config.cache_size_limit:
+
+            def format_func_info(code):
+                return f"'{code.co_name}' ({code.co_filename}:{code.co_firstlineno})"
+
+            def format_guard_failures(code):
+                # For the common case, it's sufficient to see just the most recent failure.
+                # We could add a verbose mode if needed
+                return f"{str(guard_failures[code][-1])}"
+
+            assert code in guard_failures, "TODO(whc) any other recompile reasons?"
+            log.warning(
+                f"{config.dynamo_import} hit config.cache_size_limit ({config.cache_size_limit})\n"
+                + f"   function: {format_func_info(code)}\n"
+                + f"   reasons:  {format_guard_failures(code)}\n"
+                + f"to diagnose recompilation issues, see {troubleshooting_url}."
+            )
+            unimplemented("cache_size_limit reached")
+
+        if not has_tensor_in_frame(frame):
+            return None
+
+        global initial_grad_state
+        initial_grad_state = torch.is_grad_enabled()
+
+        return _compile(
+            frame.f_code,
+            frame.f_globals,
+            frame.f_locals,
+            frame.f_builtins,
+            compiler_fn,
+            one_graph,
+            export,
+            guard_export_fn,
+            frame,
+        )
+
+    _convert_frame_assert._torchdynamo_orig_callable = compiler_fn
+    return wrap_convert_context(_convert_frame_assert)
+
+
+def _compile(
+    code,
+    globals,
+    locals,
+    builtins,
+    compiler_fn,
+    one_graph,
+    export,
+    guard_export_fn=None,
+    frame=None,
+):
+    output = None
+
+    # from .utils import print_once;  print_once(code.co_filename)
+    def transform(instructions, code_options):
+        nonlocal output
+        tracer = InstructionTranslator(
+            instructions,
+            code,
+            locals,
+            globals,
+            builtins,
+            code_options,
+            compiler_fn,
+            one_graph,
+            export,
+        )
+        tracer.run()
+        output = tracer.output
+        assert output.output_instructions
+        instructions[:] = output.output_instructions
+        code_options.update(output.code_options)
+
+        if config.dead_code_elimination:
+            instructions[:] = remove_pointless_jumps(remove_dead_code(instructions))
+
+    try:
+        for attempt in itertools.count():
+            try:
+                out_code = transform_code_object(code, transform)
+                orig_code_map[out_code] = code
+                break
+            except exc.RestartAnalysis:
+                log.debug("Restarting analysis ...")
+                if attempt > 100:
+                    unimplemented("100+ RestartAnalysis() calls")
+            except exc.SkipFrame:
+                log.debug(
+                    f"Skipping frame {code.co_name} \
+                    {code.co_filename} {code.co_firstlineno}"
+                )
+                if one_graph:
+                    log.debug("No graph captured with one_graph=True")
+                return None
+        output_codes.add(out_code)
+
+        log.log(
+            logging.CODE,
+            format_bytecode(
+                "ORIGINAL BYTECODE",
+                code.co_name,
+                code.co_filename,
+                code.co_firstlineno,
+                code,
+            ),
+        )
+        log.log(
+            logging.CODE,
+            format_bytecode(
+                "MODIFIED BYTECODE",
+                code.co_name,
+                code.co_filename,
+                code.co_firstlineno,
+                out_code,
+            ),
+        )
+
+        assert output.guards is not None
+        CleanupManager.instance[out_code] = output.cleanups
+        check_fn = CheckFunctionManager(output, output.guards, locals, globals)
+
+        guarded_code = GuardedCode(out_code, check_fn.check_fn)
+        guard_str = "GUARDS:\n"
+        guard_str += "\n".join([f" - {str(guard)}" for guard in sorted(output.guards)])
+
+        log.log(logging.CODE, guard_str)
+
+        if guard_export_fn is not None:
+            guard_export_fn(output.guards)
+
+        return guarded_code
+    except (
+        Unsupported,
+        TorchRuntimeError,
+        BackendCompilerFailed,
+        AssertionError,
+    ) as e:
+        exception_handler(e, code, frame)
+        raise
+    except Exception as e:
+        exception_handler(e, code, frame)
+        raise InternalTorchDynamoError() from e
+
+
+def convert_frame(compiler_fn: typing.Callable, guard_export_fn=None):
+    """Try to convert a frame into an FX graph, if error leave frame unmodified"""
+    inner_convert = convert_frame_assert(compiler_fn, guard_export_fn, one_graph=False)
+
+    def _convert_frame(frame: types.FrameType, cache_size: int):
+        counters["frames"]["total"] += 1
+        try:
+            result = inner_convert(frame, cache_size)
+            counters["frames"]["ok"] += 1
+            return result
+        except (NotImplementedError, Unsupported):
+            pass
+        except Exception:
+            if not config.suppress_errors:
+                raise
+        return None
+
+    _convert_frame._torchdynamo_orig_callable = compiler_fn
+    return _convert_frame
+
+
+# TODO mlazos: add support for same args, or record them
+def replay(filename):
+    from .optimizations.backends import eager
+
+    original_replay_val = config.replay_record_enabled
+    config.replay_record_enabled = False
+    init_logging()
+    with open(filename, "rb") as in_file:
+        record = ExecutionRecord.load(in_file)
+    record.globals = {
+        k: v for k, v in itertools.chain(record.globals.items(), globals().items())
+    }
+
+    try:
+        _compile(
+            record.code,
+            record.globals,
+            record.locals,
+            record.builtins,
+            eager,
+            False,  # one_graph
+            None,  # export_fn
+            None,  # frame
+            False,  # Export
+        )
+    except Exception:
+        pass
+    finally:
+        config.replay_record_enabled = original_replay_val
diff --git a/torch/_dynamo/debug_utils.py b/torch/_dynamo/debug_utils.py
new file mode 100644
index 000000000000..29d830167b10
--- /dev/null
+++ b/torch/_dynamo/debug_utils.py
@@ -0,0 +1,944 @@
+import copy
+import functools
+import getpass
+import logging
+import os
+import shutil
+import subprocess
+import textwrap
+import uuid
+from collections import Counter
+from importlib import import_module
+from tempfile import TemporaryFile
+
+import torch
+import torch.fx as fx
+
+from . import config
+from .optimizations.backends import register_backend
+from .utils import clone_inputs, get_debug_dir
+
+log = logging.getLogger(__name__)
+
+
+def minifier_dir():
+    path = os.path.join(get_debug_dir(), "minifier")
+    if path is None:
+        path = f"/tmp/minifier_{getpass.getuser()}"
+    if not os.path.exists(path):
+        os.makedirs(path, exist_ok=True)
+    return path
+
+
+class NNModuleToString:
+    safe_reprs = [
+        torch.nn.Linear,
+        torch.nn.Conv1d,
+        torch.nn.Conv2d,
+        torch.nn.Conv3d,
+        torch.nn.BatchNorm1d,
+        torch.nn.BatchNorm2d,
+        torch.nn.BatchNorm3d,
+        torch.nn.LayerNorm,
+        torch.nn.Dropout,
+        torch.nn.Softmax,
+        torch.nn.ReLU,
+        torch.nn.GELU,
+        torch.nn.Identity,
+        torch.nn.MaxPool2d,
+        torch.nn.Embedding,
+        torch.nn.Tanh,
+        torch.nn.ConvTranspose1d,
+        torch.nn.GLU,
+        torch.nn.LSTM,
+        torch.nn.Flatten,
+        torch.nn.AdaptiveAvgPool2d,
+    ]
+
+    @staticmethod
+    def can_convert_to_string(gm):
+        cant_convert = set()
+        for _, module in gm.named_children():
+            if type(module) not in NNModuleToString.safe_reprs:
+                cant_convert.add(module)
+
+        if len(cant_convert) > 0:
+            log.warning(f"We have not tested reprs of some modules - {cant_convert}")
+        # TODO - Assuming that all modules can be safely repr'd. Check if that assumption is correct.
+        return True
+
+    @staticmethod
+    def convert(gm):
+        from torch.nn.modules.module import _addindent
+
+        tab = " " * 4
+
+        model_str = textwrap.dedent(
+            """
+            from torch.nn import *
+            class Repro(torch.nn.Module):
+                def __init__(self):
+                    super().__init__()
+            """
+        )
+
+        for module_name, module in gm.named_children():
+            module_str = f"{module.__repr__()}"
+            # module should be a core torch.nn.Module, so all parameters
+            # should be on the same device.
+            example_param = next(module.parameters(), None)
+            if example_param is not None and example_param.is_cuda:
+                module_str = f"{module_str}.cuda()"
+            model_str += f"{tab*2}self.{module_name} = {module_str}\n"
+
+        for buffer_name, buffer in gm._buffers.items():
+            if buffer is None:
+                continue
+            if torch.is_floating_point(buffer):
+                tensor_str = f"torch.randn({list(buffer.shape)}, dtype={buffer.dtype})"
+            else:
+                tensor_str = (
+                    f"torch.randint(1, size={list(buffer.shape)}, dtype={buffer.dtype})"
+                )
+            if buffer.is_cuda:
+                tensor_str = f"{tensor_str}.cuda()"
+            model_str += f"{tab*2}self.register_buffer('{buffer_name}', {tensor_str})\n"
+
+        for param_name, param in gm._parameters.items():
+            if param is None:
+                continue
+            tensor_str = f"torch.nn.Parameter(torch.randn({list(param.shape)}, dtype={param.dtype}))"
+            if param.is_cuda:
+                tensor_str = f"{tensor_str}.cuda()"
+            model_str += f"{tab*2}self.{param_name} = {tensor_str}\n"
+
+        # TODO - Keep this code for now. But, I don't think we will need this.
+        # attrs = dir(gm)
+        # for attr in attrs:
+        #     if "_tensor_constant" in attr:
+        #         val = getattr(gm, attr)
+        #         model_str += f"    {attr} = {val!r}\n"
+
+        model_str += f"{_addindent(gm.code, 4)}\n"
+        return model_str
+
+
+@functools.lru_cache(None)  # subprocess is expensive
+def _cuda_system_info_comment():
+    if not torch.cuda.is_available():
+        return "# torch.cuda.is_available()==False, no GPU info collected\n"
+
+    model_str = "# CUDA Info: \n"
+    try:
+        cuda_version_out = subprocess.run(["nvcc", "--version"], stdout=subprocess.PIPE)
+        cuda_version_lines = cuda_version_out.stdout.decode().split("\n")
+        cuda_version_out = "".join(
+            [f"# {s} \n" for s in cuda_version_lines if s not in [""]]
+        )
+        model_str += f"{cuda_version_out}\n"
+    except FileNotFoundError:
+        model_str += "# nvcc not found\n"
+
+    gpu_names = subprocess.run(
+        ["nvidia-smi", "--query-gpu=gpu_name", "--format=csv"],
+        stdout=subprocess.PIPE,
+    )
+    gpu_names = gpu_names.stdout.decode().split("\n")
+    gpu_names = [name for name in gpu_names if name not in ("", "name")]
+    gpu_names = Counter(gpu_names)
+
+    model_str += "# GPU Hardware Info: \n"
+    for name, count in gpu_names.items():
+        model_str += f"# {name} : {count} \n"
+    model_str += "\n"
+    return model_str
+
+
+TEST_REPLACEABLE_COMMENT = "# REPLACEABLE COMMENT FOR TESTING PURPOSES"
+
+
+def generate_compiler_repro_string(gm, args):
+    model_str = textwrap.dedent(
+        f"""
+        import torch
+        from torch import tensor, device
+        import torch.fx as fx
+        from {config.dynamo_import}.testing import rand_strided
+        from math import inf
+        from torch.fx.experimental.proxy_tensor import make_fx
+
+        {TEST_REPLACEABLE_COMMENT}
+
+        """
+    )
+    model_str += f"# torch version: {torch.version.__version__}\n"
+    if hasattr(torch.version, "cuda"):
+        model_str += f"# torch cuda version: {torch.version.cuda}\n"
+    if hasattr(torch.version, "git_version"):
+        model_str += f"# torch git version: {torch.version.git_version}\n\n\n"
+    model_str += _cuda_system_info_comment()
+
+    model_str += NNModuleToString.convert(gm)
+
+    model_str += f"args = {[(tuple(a.shape), tuple(a.stride()), a.dtype, a.device.type) for a in args]!r}\n"
+    model_str += (
+        "args = [rand_strided(sh, st, dt, dev) for (sh, st, dt, dev) in args]\n"
+    )
+    model_str += "mod = make_fx(Repro())(*args)\n"
+    return model_str
+
+
+INDUCTOR_IMPORT = f"""
+from {config.inductor_import}.compile_fx import compile_fx_inner
+from {config.dynamo_import}.debug_utils import same_two_models
+"""
+
+COMPILER_REPRO_OPTIONS = {
+    "inductor": (INDUCTOR_IMPORT, "compile_fx_inner", "inductor_fails"),
+    "inductor_accuracy": (
+        INDUCTOR_IMPORT,
+        "compile_fx_inner",
+        "inductor_accuracy_fails",
+    ),
+}
+
+
+def dump_compiler_graph_state(gm, args, compiler_name):
+    subdir = os.path.join(minifier_dir(), "checkpoints")
+    if not os.path.exists(subdir):
+        os.makedirs(subdir, exist_ok=True)
+    file_name = os.path.join(subdir, f"{len(gm.graph.nodes)}.py")
+    log.warning(f"Writing checkpoint with {len(gm.graph.nodes)} nodes to {file_name}")
+    with open(file_name, "w") as fd:
+        save_graph_repro(fd, gm, args, compiler_name)
+    curdir = os.getcwd()
+    repro_path = os.path.join(curdir, "repro.py")
+    try:
+        shutil.copyfile(file_name, repro_path)
+        log.warning(f"Copying repro file for convenience to {repro_path}")
+    except OSError:
+        log.warning(f"No write permissions for {repro_path}")
+        pass
+
+
+def save_graph_repro(fd, gm, args, compiler_name):
+    if "inductor" in compiler_name:
+        fd.write(f"import {config.inductor_import}.overrides\n")
+    fd.write(generate_compiler_repro_string(gm, args))
+    fd.write(COMPILER_REPRO_OPTIONS[compiler_name][0])
+    if "_accuracy" in compiler_name:
+        fd.write(
+            textwrap.dedent(
+                f"""
+                compiled = {COMPILER_REPRO_OPTIONS[compiler_name][1]}(mod, args)
+                class AccuracyError(Exception):
+                    pass
+                if not same_two_models(mod, compiled, args, only_fwd=True):
+                    raise AccuracyError("Bad accuracy detected")
+                """
+            )
+        )
+    else:
+        fd.write(
+            textwrap.dedent(
+                f"""
+                compiled = {COMPILER_REPRO_OPTIONS[compiler_name][1]}(mod, args)
+                compiled(args)
+                """
+            )
+        )
+
+
+def isolate_fails(fx_g, args, compiler_name: str, env=None, patch_code=None):
+    if env is None:
+        env = {}
+    subdir = os.path.join(os.getcwd(), "isolate")
+    if not os.path.exists(subdir):
+        os.makedirs(subdir, exist_ok=True)
+    file_name = os.path.join(subdir, f"{str(uuid.uuid4())[:5]}.py")
+    with open(file_name, "w") as fd:
+        repro_code = generate_compiler_repro_string(fx_g, args)
+        if patch_code is not None:
+            repro_code = repro_code.replace(TEST_REPLACEABLE_COMMENT, patch_code)
+        fd.write(repro_code)
+        fail_fn = COMPILER_REPRO_OPTIONS[compiler_name][2]
+        fd.write(
+            textwrap.dedent(
+                f"""
+                from {__name__} import {fail_fn}
+                """
+            )
+        )
+        fd.write(
+            textwrap.dedent(
+                f"""
+                if {fail_fn}(mod, args):
+                    exit(1)
+                else:
+                    exit(0)
+                """
+            )
+        )
+    new_env = os.environ.copy()
+    new_env = {**new_env, **env}
+    stdout, stderr = TemporaryFile(), TemporaryFile()
+    p = subprocess.Popen(
+        ["python", file_name],
+        cwd=subdir,
+        stdout=stdout,
+        stderr=stderr,
+        env=new_env,
+    )
+    p.wait()
+
+    if p.returncode != 0:
+        stdout.seek(0)
+        stderr.seek(0)
+        print(textwrap.indent(stdout.read().decode("utf-8"), prefix=">>  "))
+        print(textwrap.indent(stderr.read().decode("utf-8"), prefix=">>  "))
+        return True
+    return False
+
+
+def inductor_fails(fx_g, args, check_str=None):
+    compile_fx_inner = import_module(
+        f"{config.inductor_import}.compile_fx"
+    ).compile_fx_inner
+
+    import_module(f"{config.inductor_import}.config").triton.autotune = False
+
+    try:
+        result = fx_g(*args)
+        assert isinstance(result, (tuple, list))
+        assert not any([isinstance(x, (tuple, list)) for x in result])
+    except Exception:
+        return False
+
+    try:
+        compile_mod = compile_fx_inner(fx_g, args)
+        compile_mod(args)
+    except Exception as e:
+        if check_str is not None and check_str not in repr(e):
+            return False
+        print(repr(e))
+        return True
+    return False
+
+
+def inductor_accuracy_fails(fx_g, args, check_str=None):
+    from torch._inductor.compile_fx import compile_fx_inner
+
+    return backend_aot_accuracy_fails(fx_g, args, compile_fx_inner)
+
+
+def get_minifier_repro_path():
+    return os.path.join(minifier_dir(), "minifier_launcher.py")
+
+
+def helper_for_dump_minify(contents):
+    minified_repro_path = get_minifier_repro_path()
+    log.warning(f"Writing minified repro to {minified_repro_path}")
+    try:
+        with open(minified_repro_path, "w") as fd:
+            fd.write(contents)
+    except OSError as e:
+        log.exception(e)
+        raise NotImplementedError("Could not write to {minified_repro_path}")
+
+
+def dump_to_minify(gm, args, compiler_name: str):
+    favored_device = 1 if torch.cuda.device_count() >= 2 else 0
+
+    contents = textwrap.dedent(
+        f"""
+isolate_fails_code_str = None
+
+{generate_compiler_repro_string(gm, args)}
+
+from functools import partial
+from {__name__} import (
+    isolate_fails,
+    dump_compiler_graph_state,
+)
+from functorch.compile import minifier
+
+env_variables = {{"CUDA_VISIBLE_DEVICES": "{favored_device}"}}
+
+minifier(
+    mod,
+    args,
+    module_fails=partial(isolate_fails, env=env_variables, compiler_name="{compiler_name}", patch_code=isolate_fails_code_str),
+    dump_state=partial(dump_compiler_graph_state, compiler_name="{compiler_name}"),
+)
+        """
+    )
+    return helper_for_dump_minify(contents)
+
+
+class AccuracyError(Exception):
+    pass
+
+
+def wrap_compiler_debug(compiler_fn, compiler_name: str):
+    """
+    Minifier for Fx Graph modules after Aot Autograd has finished. We wrap both
+    forward and backward call separately with the backend compiler_fn - like
+    inductor or nvfuser. Intercepting after Aot Autograd presents neat
+    abstration, where all the params are lifted as graph inputs, making it easy
+    to save the graph as a string.
+    """
+
+    @functools.wraps(compiler_fn)
+    def debug_wrapper(gm, example_inputs, **kwargs):
+        from torch._subclasses import FakeTensorMode
+
+        orig_graph = copy.deepcopy(gm.graph)
+        assert config.repro_after in ("dynamo", "aot", None)
+        inner_compiled_fn = None
+
+        def deferred_for_real_inputs(real_inputs):
+            """
+            Aot Autograd fw_compiler and bw_compiler can have fake tensors. So,
+            example_inputs can be fake tensors. We can call compiler_fn (which is
+            inductor or nvfuser) with fake tensors but the actualy compiled_fn
+            should be called with real tensors. Therefore, the actual invocation
+            is deffered.
+            """
+            # Avoid re-compiling when we call the compiled function twice. This happens
+            # when we run the model inference or training in a for loop like here
+            # https://github.com/pytorch/torchdynamo/issues/1687#issuecomment-1280040633
+            nonlocal inner_compiled_fn
+            # Copy the tensor attrs like shape, stride etc by converting to Fake Tensor
+            # because inductor clears the tensor list in its codegen. And example_inputs
+            # are available only for the first invocation.
+            fake_mode = FakeTensorMode()
+            copy_tensor_attrs = [fake_mode.from_tensor(x) for x in real_inputs]
+            if config.repro_level == 3:
+                # Always dump the original module in case we have segfaults
+                dump_to_minify(
+                    fx.GraphModule(gm, orig_graph), real_inputs, compiler_name
+                )
+
+            if config.repro_level == 4:
+                if compiler_name != "inductor":
+                    raise NotImplementedError(
+                        "Accuracy minification is supported for inductor only"
+                    )
+                if inner_compiled_fn is None:
+                    inner_compiled_fn = compiler_fn(gm, example_inputs, **kwargs)
+                if backend_aot_accuracy_fails(gm, real_inputs, compiler_fn):
+                    log.warning("Accuracy failed for the AOT Autograd graph")
+                    dump_compiler_graph_state(
+                        fx.GraphModule(gm, orig_graph),
+                        copy_tensor_attrs,
+                        f"{compiler_name}_accuracy",
+                    )
+                    dump_to_minify(
+                        fx.GraphModule(gm, orig_graph),
+                        copy_tensor_attrs,
+                        f"{compiler_name}_accuracy",
+                    )
+                    raise AccuracyError("Bad accuracy detected")
+                else:
+                    # Call the compiled function with real inputs
+                    return inner_compiled_fn(real_inputs)
+            else:
+                try:
+                    # Call the compiler_fn - which is either aot_autograd or inductor
+                    # with fake inputs
+                    if inner_compiled_fn is None:
+                        inner_compiled_fn = compiler_fn(gm, example_inputs, **kwargs)
+                    # Call the compiled function with real inputs
+                    return inner_compiled_fn(real_inputs)
+                except Exception as e:
+                    if config.repro_level == 1:
+                        dump_compiler_graph_state(
+                            fx.GraphModule(gm, orig_graph),
+                            copy_tensor_attrs,
+                            compiler_name,
+                        )
+                    elif config.repro_level == 2:
+                        dump_to_minify(
+                            fx.GraphModule(gm, orig_graph),
+                            copy_tensor_attrs,
+                            compiler_name,
+                        )
+                    log.error("CompilerError")
+                    raise
+
+        if config.repro_after == "aot":
+            compiled_fn = deferred_for_real_inputs
+            compiled_fn._boxed_call = True
+        else:
+            compiled_fn = compiler_fn(gm, example_inputs, **kwargs)
+
+        return compiled_fn
+
+    return debug_wrapper
+
+
+def run_fwd_maybe_bwd(gm, args, only_fwd=False):
+    """
+    Runs a forward and possibly backward iteration for a given mod and args.
+    """
+    from functorch._src.aot_autograd import make_boxed_func
+
+    from .testing import collect_results, reduce_to_scalar_loss, requires_bwd_pass
+
+    gm = copy.deepcopy(gm)
+    new_args = clone_inputs(args)
+    # Set the requires_grad field explicitly because clone_inputs only sets
+    # requires_grad for leaf tensors.
+    for narg, arg in zip(new_args, args):
+        narg.requires_grad_(arg.requires_grad)
+    args = new_args
+
+    if hasattr(gm, "zero_grad"):
+        gm.zero_grad(True)
+
+    # TorchInductor returned callable expects lists. So, boxing the call.
+    if not hasattr(gm, "_boxed_call") and hasattr(gm, "named_parameters"):
+        orig_named_parameters = gm.named_parameters
+        gm = make_boxed_func(gm)
+        gm.named_parameters = orig_named_parameters
+
+    out = gm(args)
+    if only_fwd:
+        return out
+    if requires_bwd_pass(out):
+        loss = reduce_to_scalar_loss(out)
+        loss.backward()
+    return collect_results(gm, out, None, [])
+
+
+def same_two_models(gm, opt_gm, example_inputs, only_fwd=False):
+    """
+    Check two models have same accuracy.
+    """
+    from .eval_frame import OptimizedModule
+    from .testing import named_parameters_for_optimized_module
+    from .utils import same
+
+    if isinstance(gm, OptimizedModule):
+        gm.named_parameters = named_parameters_for_optimized_module(gm)
+
+    if isinstance(opt_gm, OptimizedModule):
+        opt_gm.named_parameters = named_parameters_for_optimized_module(opt_gm)
+
+    ref = run_fwd_maybe_bwd(gm, example_inputs, only_fwd)
+
+    try:
+        fp64_model, fp64_examples = cast_to_fp64(
+            copy.deepcopy(gm), clone_inputs(example_inputs)
+        )
+        fp64_ref = run_fwd_maybe_bwd(fp64_model, fp64_examples, only_fwd)
+    except Exception:
+        log.warning("Could not generate fp64 outputs")
+        fp64_ref = None
+
+    res = run_fwd_maybe_bwd(opt_gm, example_inputs, only_fwd)
+
+    passing = same(ref, res, fp64_ref, tol=0.001, equal_nan=True)
+    return passing
+
+
+def cast_to(dtype, model, inputs):
+    from torch.utils._pytree import tree_map
+
+    # cast model and inputs to fp16
+    model = model.to(dtype)
+
+    inputs = tree_map(
+        lambda x: x.to(dtype)
+        if isinstance(x, torch.Tensor) and x.is_floating_point()
+        else x,
+        inputs,
+    )
+    return model, inputs
+
+
+def cast_to_fp64(model, inputs):
+    return cast_to(torch.float64, model, inputs)
+
+
+def generate_dynamo_fx_repro_string(
+    model_str, args, compiler_name, check_accuracy=False
+):
+    """
+    Generate a repro string for backend-agnostic minified version.
+    """
+
+    run_code = textwrap.dedent(
+        f"""
+with torch.cuda.amp.autocast(enabled={torch.is_autocast_enabled()}):
+    ref = run_fwd_maybe_bwd(mod, args)
+    res = run_fwd_maybe_bwd(opt_mod, args)
+    """
+    )
+
+    if config.repro_level == 4 or check_accuracy:
+        run_code = textwrap.dedent(
+            f"""
+mod.eval()
+opt_mod.eval()
+
+class AccuracyError(Exception):
+    pass
+
+with torch.cuda.amp.autocast(enabled={torch.is_autocast_enabled()}):
+    assert same_two_models(mod, mod, args), "Eager itself failed"
+    if not same_two_models(mod, opt_mod, args):
+        raise AccuracyError("Dynamo failed")
+    """
+        )
+
+    return textwrap.dedent(
+        f"""
+from math import inf
+import torch
+from torch import tensor, device
+import torch.fx as fx
+import {config.dynamo_import}
+from {config.dynamo_import}.testing import rand_strided
+from {config.dynamo_import}.debug_utils import run_fwd_maybe_bwd
+from {config.dynamo_import}.debug_utils import same_two_models
+
+{TEST_REPLACEABLE_COMMENT}
+
+args = {[(tuple(a.shape), tuple(a.stride()), a.dtype, a.device.type, a.requires_grad) for a in args]}
+args = [rand_strided(sh, st, dt, dev).requires_grad_(rg) for (sh, st, dt, dev, rg) in args]
+
+{model_str}
+
+mod = Repro()
+opt_mod = {config.dynamo_import}.optimize("{compiler_name}")(mod)
+
+{run_code}
+        """
+    )
+
+
+def dump_backend_repro_as_file(gm, args, compiler_name, check_accuracy=False):
+    """
+    Saves the repro to a repro.py file
+    """
+    curdir = os.getcwd()
+    subdir = os.path.join(os.getcwd(), "checkpoints")
+    if not os.path.exists(subdir):
+        os.makedirs(subdir, exist_ok=True)
+    file_name = os.path.join(subdir, f"minified_{len(gm.graph.nodes)}_nodes.py")
+    log.warning(f"Writing checkpoint with {len(gm.graph.nodes)} nodes to {file_name}")
+
+    model_str = NNModuleToString.convert(gm)
+    with open(file_name, "w") as fd:
+        fd.write(
+            generate_dynamo_fx_repro_string(
+                model_str, args, compiler_name, check_accuracy
+            )
+        )
+    latest_repro = os.path.join(curdir, "repro.py")
+    log.warning(f"Copying {file_name} to {latest_repro} for convenience")
+    shutil.copyfile(file_name, latest_repro)
+
+
+# TODO - Commented because we are assuming that nn.Modules can be safely repr'd
+# If that does not work, we might have to bring this code back. So, keeping it
+# as it is for now.
+# def dump_backend_repro_as_tarfile(gm, args, compiler_name):
+#     """
+#     Saves the repro in repro.tar.gz, as opposed to a file. This is used for
+#     cases, where we can't convert a Fx GraphModule to a string, and therefore
+#     fallback to to_folder for serialization. We accompany this with a repro.py
+#     script that imports the saved module, sets it up and runs the model to repro
+#     the error.
+#     """
+#     import tarfile
+
+#     subdir = os.path.join(minifier_dir(), "checkpoints")
+#     if not os.path.exists(subdir):
+#         os.makedirs(subdir, exist_ok=True)
+
+#     tmp_dir = os.path.join(subdir, f"{len(gm.graph.nodes)}")
+#     if os.path.exists(tmp_dir):
+#         shutil.rmtree(tmp_dir)
+#     os.makedirs(tmp_dir, exist_ok=True)
+
+#     file_name = os.path.join(tmp_dir, "repro.py")
+#     gm_dir = os.path.join(tmp_dir, "module")
+#     if not os.path.exists(gm_dir):
+#         os.makedirs(gm_dir, exist_ok=True)
+#     for node in gm.graph.nodes:
+#         new_kwargs = {}
+#         for k, v in node.kwargs.items():
+#             if isinstance(v, torch.device):
+#                 v = v.type
+#             new_kwargs[k] = v
+#         node.kwargs = new_kwargs
+#     gm.recompile()
+
+#     print(f"Writing checkpoint with {len(gm.graph.nodes)} nodes to {file_name}")
+#     with open(file_name, "w") as fd:
+#         # TODO - Add the readable version of to_folder when available
+#         gm.to_folder(gm_dir, "Repro")
+#         fd.write(
+#             generate_dynamo_fx_repro_string(
+#                 "from module import Repro", args, compiler_name
+#             )
+#         )
+
+#     local_dir = os.path.join(config.base_dir, "repro")
+#     if os.path.exists(local_dir):
+#         shutil.rmtree(local_dir)
+#     shutil.copytree(tmp_dir, local_dir)
+#     local_tar_file = os.path.join(config.base_dir, "repro.tar.gz")
+#     print(f"Writing checkpoint with {len(gm.graph.nodes)} locally to {local_tar_file}")
+#     with tarfile.open(local_tar_file, "w:gz") as tar:
+#         tar.add(local_dir, arcname=os.path.basename(local_dir))
+
+
+def dump_backend_state(gm, args, compiler_name, check_accuracy=False):
+    """
+    Dumps the dynamo graph to repro the issue.
+    1) It tries to convert Fx GraphModule to a string. If we can, it writes to a
+    repro.py file.
+    2) If we can't convert Fx GraphModule to a string, we use to_folder to save
+    the module and save a tar file.
+    """
+    assert NNModuleToString.can_convert_to_string(gm)
+    return dump_backend_repro_as_file(gm, args, compiler_name, check_accuracy)
+    # return dump_backend_repro_as_tarfile(gm, args, compiler_name)
+
+
+def backend_accuracy_fails(gm, example_inputs, compiler_fn, only_fwd=False):
+    compiled_gm = compiler_fn(copy.deepcopy(gm), clone_inputs(example_inputs))
+    return not same_two_models(gm, compiled_gm, example_inputs, only_fwd)
+
+
+backend_aot_accuracy_fails = functools.partial(backend_accuracy_fails, only_fwd=True)
+
+
+def backend_fails(gm, example_inputs, compiler_fn, orig_failure):
+    """
+    Minifier uses this function to identify if the minified graph module fails
+    with the same error.
+
+    One caveat is that minifier can potentially go into a wrong direction when
+    the resulting graph module fails for a different reason. To avoid this, we
+    save the string for the original exception and check similarity between new
+    and old exception. They can be somewhat different in some cases, when the
+    exception string depends on the failing node information. So, we have a
+    loose similarity metric to guide the minifier path.
+    """
+    from difflib import SequenceMatcher
+
+    try:
+        compiled_gm = compiler_fn(gm, example_inputs)
+        run_fwd_maybe_bwd(compiled_gm, clone_inputs(example_inputs))
+        return False
+    except Exception as e:
+        new_failure = str(e)
+        if SequenceMatcher(None, orig_failure, new_failure).ratio() > 0.5:
+            return True
+        return False
+
+
+def dump_to_minify_after_dynamo(gm, args, compiler_name):
+    model_str = NNModuleToString.convert(gm)
+
+    minifier_backend = "dynamo_minifier_backend"
+    if config.repro_level == 4:
+        minifier_backend = "dynamo_accuracy_minifier_backend"
+
+    custom_compiler_error = (
+        textwrap.dedent(
+            """\
+        raise RuntimeError(
+            'Compiler name is None - this likely means that a custom compiler '
+            'was called by torchdynamo. Please remove this error, import your '
+            'custom compiler function, and replace the compiler_name="None" '
+            'line below to compiler_name=<my_imported_custom_function>'
+        )
+        """
+        )
+        if compiler_name is None
+        else ""
+    )
+
+    contents = textwrap.dedent(
+        f"""
+import os
+from math import inf
+import torch
+from torch import tensor, device
+import torch.fx as fx
+import functools
+import {config.dynamo_import}
+from {config.dynamo_import}.debug_utils import run_fwd_maybe_bwd
+from {config.dynamo_import}.optimizations.backends import BACKENDS
+from {config.dynamo_import}.testing import rand_strided
+
+{TEST_REPLACEABLE_COMMENT}
+
+args = {[(tuple(a.shape), tuple(a.stride()), a.dtype, a.device.type, a.requires_grad) for a in args]}
+args = [rand_strided(sh, st, dt, dev).requires_grad_(rg) for (sh, st, dt, dev, rg) in args]
+
+{model_str}
+mod = Repro()
+
+# Setup debug minifier compiler
+compiler_fn = BACKENDS["{minifier_backend}"]
+{custom_compiler_error}
+dynamo_minifier_backend = functools.partial(
+    compiler_fn,
+    compiler_name="{compiler_name}",
+)
+opt_mod = {config.dynamo_import}.optimize(dynamo_minifier_backend)(mod)
+
+with torch.cuda.amp.autocast(enabled={torch.is_autocast_enabled()}):
+    opt_mod(*args)
+        """
+    )
+    helper_for_dump_minify(contents)
+
+
+def wrap_backend_debug(compiler_fn, compiler_name: str):
+    """
+    A minifier decorator that wraps the TorchDynamo produced Fx graph modules.
+    As opposed to wrap_compiler_debug, this wrapper intercepts at the
+    TorchDynamo produced Fx Graph Module. This makes it backend-agnostic to some
+    level, e.g., it is useful for minifying issues related to Aot Autograd
+    tracing.  If an error is found, we minify and save the minified repro in
+    repro.tar.gz.
+    """
+
+    @functools.wraps(compiler_fn)
+    def debug_wrapper(gm, example_inputs, **kwargs):
+        assert config.repro_after in ("dynamo", "aot", None)
+        if config.repro_after == "dynamo":
+            if config.repro_level == 3:
+                dump_to_minify_after_dynamo(gm, example_inputs, compiler_name)
+
+            # Check for either accuracy (level 4) or other type of failures.
+            if config.repro_level == 4:
+                # Check Accuracy
+                compiled_gm = compiler_fn(gm, example_inputs, **kwargs)
+                if backend_accuracy_fails(gm, example_inputs, compiler_fn):
+                    log.warning(
+                        "Accuracy failed for the TorchDyanmo produced graph. Creating script to minify the error."
+                    )
+                    dump_to_minify_after_dynamo(
+                        fx.GraphModule(gm, copy.deepcopy(gm.graph)),
+                        example_inputs,
+                        compiler_name,
+                    )
+                    exc = AccuracyError("Bad accuracy detected.")
+                    exc.minifier_path = os.path.join(
+                        minifier_dir(), "minifier_launcher.py"
+                    )
+                    raise exc
+            else:
+                try:
+                    compiled_gm = compiler_fn(gm, example_inputs, **kwargs)
+                    run_fwd_maybe_bwd(compiled_gm, example_inputs)
+                except Exception as exc:
+                    log.warning(
+                        "Compiled Fx GraphModule failed. Creating script to minify the error."
+                    )
+                    if config.repro_level == 1:
+                        dump_state_fn = functools.partial(
+                            dump_backend_state, compiler_name=compiler_name
+                        )
+                        dump_state_fn(
+                            fx.GraphModule(gm, copy.deepcopy(gm.graph)), example_inputs
+                        )
+                    elif config.repro_level == 2:
+                        dump_to_minify_after_dynamo(
+                            fx.GraphModule(gm, copy.deepcopy(gm.graph)),
+                            example_inputs,
+                            compiler_name,
+                        )
+                    exc.minifier_path = os.path.join(
+                        minifier_dir(), "minifier_launcher.py"
+                    )
+                    raise
+        else:
+            compiled_gm = compiler_fn(gm, example_inputs, **kwargs)
+
+        return compiled_gm
+
+    debug_wrapper._torchdynamo_orig_callable = compiler_fn
+
+    return debug_wrapper
+
+
+@register_backend
+def dynamo_minifier_backend(gm, example_inputs, compiler_name):
+    from functorch.compile import minifier
+
+    from .eval_frame import lookup_backend
+
+    compiler_fn = lookup_backend(compiler_name)
+
+    try:
+        compiled_gm = compiler_fn(gm, example_inputs)
+        run_fwd_maybe_bwd(compiled_gm, example_inputs)
+        raise ValueError("No issue was detected")
+    except Exception as exc:
+        orig_failure = str(exc)
+        log.warning(
+            "Compiled Fx GraphModule failed. Creating script to minify the error."
+        )
+        dump_state_fn = functools.partial(
+            dump_backend_state, compiler_name=compiler_name
+        )
+        dump_state_fn(fx.GraphModule(gm, copy.deepcopy(gm.graph)), example_inputs)
+        fails_fn = functools.partial(
+            backend_fails,
+            compiler_fn=compiler_fn,
+            orig_failure=orig_failure,
+        )
+        minifier(
+            gm,
+            example_inputs,
+            module_fails=fails_fn,
+            dump_state=dump_state_fn,
+        )
+    return gm
+
+
+@register_backend
+def dynamo_accuracy_minifier_backend(gm, example_inputs, compiler_name):
+    from functorch.compile import minifier
+
+    from torch._dynamo.optimizations.backends import BACKENDS
+
+    if compiler_name == "inductor":
+        from torch._inductor.compile_fx import compile_fx
+
+        compiler_fn = compile_fx
+    else:
+        compiler_fn = BACKENDS[compiler_name]
+
+    # Set the eval mode to remove randomness.
+    gm.eval()
+
+    # Check Accuracy
+    if backend_accuracy_fails(gm, example_inputs, compiler_fn):
+        log.warning("Accuracy failed for the TorchDyanmo produced graph")
+        dump_state_fn = functools.partial(
+            dump_backend_state, compiler_name=compiler_name, check_accuracy=True
+        )
+        fails_fn = functools.partial(
+            backend_accuracy_fails,
+            compiler_fn=compiler_fn,
+        )
+        dump_state_fn(fx.GraphModule(gm, copy.deepcopy(gm.graph)), example_inputs)
+        minifier(
+            gm,
+            example_inputs,
+            module_fails=fails_fn,
+            dump_state=dump_state_fn,
+        )
+    else:
+        log.error("Input graph does not fail accuracy testing")
+    return gm
diff --git a/torch/_dynamo/eval_frame.py b/torch/_dynamo/eval_frame.py
new file mode 100644
index 000000000000..f8e2bd28439c
--- /dev/null
+++ b/torch/_dynamo/eval_frame.py
@@ -0,0 +1,754 @@
+import contextlib
+import functools
+import inspect
+import logging
+import os
+import sys
+import textwrap
+import threading
+import traceback
+import types
+import warnings
+from importlib import import_module
+from unittest.mock import patch
+
+import torch
+import torch.utils._pytree as pytree
+from torch.fx.experimental.proxy_tensor import make_fx
+from torch.nn.parallel.distributed import DistributedDataParallel
+
+from . import config, convert_frame, skipfiles, utils
+from .exc import ResetRequired
+from .mutation_guard import install_generation_tagging_init
+from .optimizations.distributed import DDPOptimizer
+from .utils import compile_times
+
+log = logging.getLogger(__name__)
+
+try:
+    from torch.fx.experimental import proxy_tensor
+except ImportError:
+    proxy_tensor = None
+
+_eval_frame = torch._C._dynamo.eval_frame
+set_eval_frame = _eval_frame.set_eval_frame
+reset_code = _eval_frame.reset_code
+unsupported = _eval_frame.unsupported
+skip_code = _eval_frame.skip_code
+set_guard_fail_hook = _eval_frame.set_guard_fail_hook
+set_guard_error_hook = _eval_frame.set_guard_error_hook
+always_optimize_code_objects = utils.ExactWeakKeyDictionary()
+null_context = contextlib.nullcontext
+unset = object()
+compile_lock = threading.RLock()
+most_recent_backend = None
+
+
+class OptimizedModule(torch.nn.Module):
+    """
+    Wraps the original nn.Module object and later patches its
+    forward method to optimized self.forward method.
+    """
+
+    def __init__(self, mod, dynamo_ctx):
+        super().__init__()
+        # Installs the params/buffer
+        self._orig_mod = mod
+        self.dynamo_ctx = dynamo_ctx
+
+    def __getattr__(self, name):
+        if name == "_orig_mod":
+            return self._modules["_orig_mod"]
+        return getattr(self._orig_mod, name)
+
+    def forward(self, *args, **kwargs):
+        return self.dynamo_ctx(self._orig_mod.forward)(*args, **kwargs)
+
+
+def remove_from_cache(f):
+    """
+    Make sure f.__code__ is not cached to force a recompile
+    """
+    if isinstance(f, types.CodeType):
+        reset_code(f)
+    elif hasattr(f, "__code__"):
+        reset_code(f.__code__)
+    elif hasattr(getattr(f, "forward", None), "__code__"):
+        reset_code(f.forward.__code__)
+    else:
+        from . import reset
+
+        reset()
+        log.warning("could not determine __code__ for %s", f)
+
+
+def nothing():
+    pass
+
+
+def innermost_fn(fn):
+    """
+    In case of nesting of _TorchDynamoContext calls, find the innermost
+    function. TorchDynamo caches on fn.__code__ object, so its necessary to find
+    the innermost function to pass on the optimize, run, disable etc.
+    """
+    unaltered_fn = fn
+    while hasattr(unaltered_fn, "_torchdynamo_orig_callable"):
+        unaltered_fn = unaltered_fn._torchdynamo_orig_callable
+        assert callable(unaltered_fn)
+    return unaltered_fn
+
+
+@contextlib.contextmanager
+def enable_dynamic(enable: bool = True):
+    if not enable:
+        yield
+        return
+    with patch("torch._dynamo.config.dynamic_shapes", True), patch(
+        "functorch._src.config.use_dynamic_shapes", True
+    ):
+        yield
+
+
+class _TorchDynamoContext:
+    def __init__(
+        self,
+        callback,
+        on_enter=nothing,
+        backend_ctx_ctor=null_context,
+        patch_fn=nothing,
+        first_ctx=False,
+        *,
+        dynamic=False,
+    ):
+        super().__init__()
+        assert callable(callback) or callback is False or callback is None
+        self.callback = callback
+        self.prior = unset
+        self.on_enter = on_enter
+        self.extra_ctx_ctor = backend_ctx_ctor
+        self.first_ctx = first_ctx
+        self.dynamic = dynamic
+        patch_fn()
+
+    def __enter__(self):
+        if config.raise_on_ctx_manager_usage:
+            raise RuntimeError(
+                "torch._dynamo.optimize(...) is used with a context manager. "
+                "Please refer to https://github.com/pytorch/torchdynamo#usage-example "
+                "to use torch._dynamo.optimize(...) as an annotation/decorator. "
+            )
+        self.on_enter()
+        self.prior = set_eval_frame(self.callback)
+        self.backend_ctx = self.extra_ctx_ctor()
+        self.backend_ctx.__enter__()
+        self.dynamic_ctx = enable_dynamic(self.dynamic)
+        self.dynamic_ctx.__enter__()
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        set_eval_frame(self.prior)
+        self.prior = unset
+        # TODO: This is totally not the right way to chain contexts manually
+        self.dynamic_ctx.__exit__(exc_type, exc_val, exc_tb)
+        self.backend_ctx.__exit__(exc_type, exc_val, exc_tb)
+
+    def __call__(self, fn):
+        fn = innermost_fn(fn)
+        # Optimize the forward method of torch.nn.Module object
+        if isinstance(fn, torch.nn.Module):
+            mod = fn
+            new_mod = OptimizedModule(mod, self)
+            # Save the function pointer to find the original callable while nesting
+            # of decorators.
+            new_mod._torchdynamo_orig_callable = mod.forward
+            return new_mod
+
+        assert callable(fn)
+
+        callback = self.callback
+        on_enter = self.on_enter
+        backend_ctx_ctor = self.extra_ctx_ctor
+
+        @functools.wraps(fn)
+        def _fn(*args, **kwargs):
+            if (
+                not isinstance(self, DisableContext)
+                and torch.fx._symbolic_trace.is_fx_tracing()
+            ):
+                if config.error_on_nested_fx_trace:
+                    raise RuntimeError(
+                        "Detected that you are using FX to symbolically trace "
+                        "a dynamo-optimized function. This is not supported at the moment."
+                    )
+                else:
+                    return fn(*args, **kwargs)
+
+            on_enter()
+            prior = set_eval_frame(callback)
+            backend_ctx = backend_ctx_ctor()
+            backend_ctx.__enter__()
+            dynamic_ctx = enable_dynamic(self.dynamic)
+            dynamic_ctx.__enter__()
+            try:
+                return fn(*args, **kwargs)
+            finally:
+                set_eval_frame(prior)
+                dynamic_ctx.__exit__(None, None, None)
+                backend_ctx.__exit__(None, None, None)
+
+        # hooks to properly handle inlining
+        if isinstance(self, DisableContext):
+            _fn._torchdynamo_disable = True
+        else:
+            _fn._torchdynamo_inline = fn
+
+        # Save the function pointer to find the original callable while nesting
+        # of decorators.
+        _fn._torchdynamo_orig_callable = fn
+
+        # If the function is called using torch._dynamo.optimize decorator, we
+        # should prevent any type of skipping.
+        if callback not in (None, False):
+            if not hasattr(fn, "__code__"):
+                raise RuntimeError(
+                    textwrap.dedent(
+                        """
+
+                        torch._dynamo.optimize is called on a non function object.
+                        If this is a callable class, please wrap the relevant code into a function and optimize the
+                        wrapper function.
+
+                        >> class CallableClass:
+                        >>     def __init__(self):
+                        >>         super().__init__()
+                        >>         self.relu = torch.nn.ReLU()
+                        >>
+                        >>     def __call__(self, x):
+                        >>         return self.relu(torch.sin(x))
+                        >>
+                        >>     def print_hello(self):
+                        >>         print("Hello world")
+                        >>
+                        >> mod = CallableClass()
+
+                        If you want to optimize the __call__ function and other code, wrap that up in a function
+
+                        >> def wrapper_fn(x):
+                        >>     y = mod(x)
+                        >>     return y.sum()
+
+                        and then optimize the wrapper_fn
+
+                        >> opt_wrapper_fn = torch._dynamo.optimize(wrapper_fn)
+                        """
+                    )
+                )
+            always_optimize_code_objects[fn.__code__] = True
+
+        return _fn
+
+
+class OptimizeContext(_TorchDynamoContext):
+    def __init__(self, callback, backend_ctx_ctor, first_ctx=False, *, dynamic=False):
+        def on_enter():
+            global most_recent_backend
+            if (
+                most_recent_backend is not None
+                and most_recent_backend is not compiler_fn
+            ):
+                raise ResetRequired()
+            most_recent_backend = compiler_fn
+            install_generation_tagging_init()
+
+        compiler_fn = innermost_fn(callback)
+        super().__init__(
+            callback=callback,
+            on_enter=on_enter,
+            backend_ctx_ctor=backend_ctx_ctor,
+            patch_fn=TorchPatcher.patch,
+            first_ctx=first_ctx,
+            dynamic=dynamic,
+        )
+
+
+class RunOnlyContext(_TorchDynamoContext):
+    def __init__(self):
+        super().__init__(callback=False)
+
+
+class DisableContext(_TorchDynamoContext):
+    def __init__(self):
+        super().__init__(callback=None)
+
+
+def catch_errors_wrapper(callback):
+    @functools.wraps(callback)
+    def catch_errors(frame, cache_size):
+        if frame.f_lasti >= 0 or skipfiles.check(frame.f_code.co_filename):
+            log.debug(f"skipping {frame.f_code.co_name} {frame.f_code.co_filename}")
+            return None
+        if frame.f_code.co_filename == "<string>" and frame.f_code.co_name == "__new__":
+            # nametuple constructor
+            return None
+        if config.optimize_ddp:
+            ddp_module = DistributedDataParallel._get_active_ddp_module()
+            if ddp_module:
+                with compile_lock:
+                    ddp_optimizer = DDPOptimizer(
+                        bucket_bytes_cap=ddp_module.bucket_bytes_cap,
+                        backend_compile_fn=callback._torchdynamo_orig_callable,
+                    )
+                    hijacked_callback = convert_frame.convert_frame(
+                        ddp_optimizer.compile_fn, guard_export_fn=None
+                    )
+                    return hijacked_callback(frame, cache_size)
+
+        with compile_lock:
+            return callback(frame, cache_size)
+
+    catch_errors._torchdynamo_orig_callable = callback
+    return catch_errors
+
+
+def _optimize_catch_errors(compile_fn, backend_ctx_ctor=null_context, dynamic=False):
+    return OptimizeContext(
+        catch_errors_wrapper(compile_fn),
+        backend_ctx_ctor=backend_ctx_ctor,
+        first_ctx=True,
+        dynamic=dynamic,
+    )
+
+
+def get_compiler_fn(compiler_fn):
+    from .debug_utils import wrap_backend_debug
+
+    compiler_str = compiler_fn if isinstance(compiler_fn, str) else None
+    compiler_fn = lookup_backend(compiler_fn)
+    return wrap_backend_debug(compiler_fn, compiler_str)
+
+
+@functools.lru_cache(1)
+def lookup_backend(compiler_fn):
+    """Expand backend strings to functions"""
+    if compiler_fn == "inductor":
+        if torch.cuda.is_available():
+            if (
+                torch.backends.cuda.matmul.allow_tf32 is False
+                and torch.cuda.get_device_capability() >= (8, 0)
+            ):
+                warnings.warn(
+                    "TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled."
+                    "Consider setting `torch.set_float32_matmul_precision('high')`"
+                )
+
+        compiler_fn = import_module(f"{config.inductor_import}.compile_fx").compile_fx
+    elif isinstance(compiler_fn, str):
+        from .optimizations import BACKENDS
+
+        compiler_fn = BACKENDS[compiler_fn]
+    return compiler_fn
+
+
+class _NullDecorator(contextlib.nullcontext):
+    def __call__(self, fn):
+        assert callable(fn)
+        return fn
+
+
+def optimize(
+    backend="inductor",
+    *,
+    nopython=False,
+    guard_export_fn=None,
+    disable=False,
+    dynamic=False,
+):
+    """
+    The main entrypoint of TorchDynamo.  Do graph capture and call
+    backend() to optimize extracted graphs.
+
+    Args:
+        backend: One of the two things:
+            - Either, a function/callable taking a torch.fx.GraphModule and
+            example_inputs and returning a python callable that runs the
+            graph faster.
+            One can also provide additional context for the backend, like
+            torch.jit.fuser("fuser2"), by setting the backend_ctx_ctor attribute.
+            See AOTAutogradMemoryEfficientFusionWithContext for the usage.
+            - Or, a string backend name in `torch._dynamo.list_backends()`
+        nopython: If True, graph breaks will be errors and there will
+            be a single whole-program graph.
+        disable: If True, turn this decorator into a no-op
+        dynamic: If True, turn on dynamic shapes support
+
+    Example Usage::
+
+        @torch._dynamo.optimize()
+        def toy_example(a, b):
+            ...
+    """
+    torch._C._log_api_usage_once("torch._dynamo.optimize")
+    if disable or os.environ.get("TORCHDYNAMO_DISABLE", "") == "1":
+        return _NullDecorator()
+    if sys.platform == "win32":
+        warnings.warn(
+            "Windows is not currently supported, "
+            + f"{config.dynamo_import}.optimize() will do nothing"
+        )
+        return _NullDecorator()
+    if sys.version_info >= (3, 11):
+        warnings.warn(
+            "Python 3.11+ not yet supported, "
+            f"{config.dynamo_import}.optimize() will do nothing"
+        )
+        return _NullDecorator()
+
+    backend = get_compiler_fn(backend)
+
+    # Find if backend has any extra context manager
+    backend_ctx_ctor = getattr(backend, "backend_ctx_ctor", null_context)
+
+    if nopython:
+        return optimize_assert(
+            backend, guard_export_fn=guard_export_fn, dynamic=dynamic
+        )
+    return _optimize_catch_errors(
+        convert_frame.convert_frame(backend, guard_export_fn=guard_export_fn),
+        backend_ctx_ctor,
+        dynamic=dynamic,
+    )
+
+
+# TODO(voz): Consider making "explain" output alongside a run / part of a run
+@patch("torch._dynamo.symbolic_convert.explain", True)
+def explain(f, *args, **kwargs):
+    # TODO(voz): Do we want a decorator for this?
+    from . import reset
+
+    reset()
+
+    out_guards = []
+    graphs = []
+    ops_per_graph = []
+    op_count = 0
+    break_reasons = []
+
+    def dynamo_graph_accumulating_compiler(gm: torch.fx.GraphModule, example_inputs):
+        nonlocal graphs
+        nonlocal op_count
+        nonlocal ops_per_graph
+
+        graphs.append(gm)
+        ops = []
+        for node in gm.graph.nodes:
+            if node.op == "call_function":
+                ops.append(node.target)
+
+        op_count += len(ops)
+        ops_per_graph.append(ops)
+        if gm.compile_subgraph_reason is not None:
+            break_reasons.append(gm.compile_subgraph_reason)
+        return gm.forward
+
+    def guard_export_print(guards):
+        nonlocal out_guards
+        out_guards.append(guards)
+
+    with patch(f"{__name__}.most_recent_backend", None):
+        opt_f = optimize(
+            dynamo_graph_accumulating_compiler,
+            nopython=False,
+            guard_export_fn=guard_export_print,
+        )(f)
+        # TODO(voz): We may have instances of `f` that mutate inputs, we should track sideffects and reject.
+        opt_f(*args, **kwargs)
+
+    graph_count = len(graphs)
+
+    # For the explanation summary, dedupe reasons by the innermost stack frame and dedupe by it.
+    deduped_reasons = {}
+    for reason in break_reasons:
+        innermost_frame = reason.user_stack[-1]
+        # __repr__ uniquely identifies a FrameSummary so we can use it for deduping
+        deduped_reasons[repr(innermost_frame)] = reason
+
+    formatted_list = ""
+    for idx, break_reason in enumerate(deduped_reasons.values()):
+        formatted_stack = "".join(traceback.format_list(break_reason.user_stack))
+        msg = f"{break_reason.reason}\n{formatted_stack}"
+        formatted_list += f"{idx + 1}. {msg} \n"
+
+    explanation = f"Dynamo produced {graph_count} graphs "
+    explanation += f"with {graph_count - 1} graph break and {op_count} ops"
+    explanation_verbose = explanation
+    explanation_verbose += f"\n Break reasons: \n\n{formatted_list}"
+
+    explanation_verbose += compile_times()
+
+    # TODO(voz): Do we want a decorator for this?
+    reset()
+    return (
+        explanation,
+        out_guards,
+        graphs,
+        ops_per_graph,
+        break_reasons,
+        explanation_verbose,
+    )
+
+
+def export(
+    f, *args, aten_graph=False, decomposition_table=None, tracing_mode="real", **kwargs
+):
+    torch._C._log_api_usage_once("torch._dynamo.export")
+    if decomposition_table is not None or tracing_mode != "real":
+        assert (
+            aten_graph
+        ), "Specifying a decomposition_table table or tracing mode is illegal without setting aten_graph=True"
+    f = innermost_fn(f)
+
+    graph = None
+    out_guards = None
+    graph_captured_input = None
+    graph_captured_result = None
+
+    def produce_matching(source_args, candidate_args):
+        matched_elements_positions = []
+        dict_of_source_args = dict()
+        for i in range(0, len(source_args)):
+            element_id = id(source_args[i])
+            dict_of_source_args[element_id] = i
+
+        for i in range(0, len(candidate_args)):
+            arg = candidate_args[i]
+            # 1-element tensor arg can be unspec int/float
+            if isinstance(arg, torch.Tensor) and torch.numel(arg) == 1:
+                if id(arg) in dict_of_source_args:
+                    matched_elements_positions.append(dict_of_source_args[id(arg)])
+                elif id(arg.item()) in dict_of_source_args:
+                    matched_elements_positions.append(
+                        dict_of_source_args[id(arg.item())]
+                    )
+                else:
+                    raise AssertionError(
+                        "Dynamo input/output is not consistent with traced input/output"
+                    )
+            else:
+                assert (
+                    id(arg) in dict_of_source_args
+                ), "Dynamo input and output is a strict subset of traced input/output"
+                matched_elements_positions.append(dict_of_source_args[id(arg)])
+
+        return matched_elements_positions
+
+    def guard_export_print(guards):
+        nonlocal out_guards
+        assert out_guards is None, "whole graph export entails exactly one guard export"
+        out_guards = guards
+
+    def dynamo_normalization_capturing_compiler(
+        gm: torch.fx.GraphModule, example_inputs
+    ):
+        nonlocal graph
+
+        assert graph is None, "whole graph export entails exactly one graph"
+        graph = gm
+
+        def result_capturing_wrapper(*graph_inputs):
+            nonlocal graph_captured_result
+            nonlocal graph_captured_input
+
+            graph_captured_input = graph_inputs
+            graph_captured_result = graph(*graph_inputs)
+            return graph_captured_result
+
+        return result_capturing_wrapper
+
+    # TODO(voz): Handle kwargs properly?
+    flat_args, in_spec = pytree.tree_flatten(args)
+
+    remove_from_cache(f)
+    with patch(f"{__name__}.most_recent_backend", None):
+        opt_f = optimize_assert(
+            dynamo_normalization_capturing_compiler,
+            guard_export_fn=guard_export_print,
+            export=True,
+        )(f)
+        # TODO(voz): We may have instances of `f` that mutate inputs, we should track sideffects and reject.
+        result_traced = opt_f(*args, **kwargs)
+    remove_from_cache(f)
+
+    assert graph is not None, "whole graph export entails exactly one call"
+    assert out_guards is not None, "whole graph export entails exactly one guard export"
+
+    matched_input_elements_positions = produce_matching(flat_args, graph_captured_input)
+
+    flat_results_traced, out_spec_traced = pytree.tree_flatten(result_traced)
+
+    flat_both = list(graph_captured_result) + flat_args
+    matched_output_elements_positions = produce_matching(flat_both, flat_results_traced)
+
+    class ChangeInputOutputSignature(torch.fx.interpreter.Transformer):
+        def __init__(
+            self,
+            m,
+        ):
+            super().__init__(m)
+            arg_len = len(flat_args)
+            self.new_args = [
+                super(ChangeInputOutputSignature, self).placeholder(f"arg{i}", (), {})
+                for i in range(0, arg_len)
+            ]
+            self.old_args_gen = (
+                self.new_args[i] for i in matched_input_elements_positions
+            )
+
+        def placeholder(self, target, args, kwargs):
+            arg = next(self.old_args_gen)
+            if "val" in self.current_node.meta:
+                arg.node.meta["val"] = self.current_node.meta["val"]
+            return arg
+
+        def output(self, target, args, kwargs):
+            dynamo_result_flat = args[0]
+            lookup = [*dynamo_result_flat, *self.new_args]
+            new_result_flat = [lookup[i] for i in matched_output_elements_positions]
+            new_result = pytree.tree_unflatten(new_result_flat, out_spec_traced)
+
+            return super().output(target, (new_result,), {})
+
+        def run_node(self, n):
+            self.current_node = n
+            return super().run_node(n)
+
+    if aten_graph:
+        # Running graph with interpreter is needed for propagating the stack_trace
+        def graph_with_interpreter(*args):
+            with torch.fx.traceback.override_stack_trace():
+                return torch.fx.Interpreter(graph).run(*args)
+
+        graph = make_fx(
+            graph_with_interpreter,
+            decomposition_table=decomposition_table,
+            tracing_mode=tracing_mode,
+        )(*graph_captured_input)
+
+    new_graph = ChangeInputOutputSignature(
+        graph,
+    ).transform()
+
+    return (new_graph, out_guards)
+
+
+def assume_constant_result(fn):
+    fn._dynamo_marked_constant = True
+    return fn
+
+
+def optimize_assert(backend, *, guard_export_fn=None, export=False, dynamic=False):
+    """
+    The same as `torch._dynamo.optimize(backend, nopython=True)`
+    """
+    backend = get_compiler_fn(backend)
+
+    # Find if backend has any extra context manager
+    backend_ctx_ctor = getattr(backend, "backend_ctx_ctor", null_context)
+
+    return _optimize_catch_errors(
+        convert_frame.convert_frame_assert(backend, guard_export_fn, export=export),
+        backend_ctx_ctor,
+        dynamic=dynamic,
+    )
+
+
+def run(fn=None):
+    """Don't do any dynamic compiles, just use prior optimizations"""
+    if fn is not None:
+        fn = innermost_fn(fn)
+        assert callable(fn)
+        return RunOnlyContext()(fn)
+    return RunOnlyContext()
+
+
+def disable(fn=None):
+    """Decorator and context manager to disable TorchDynamo"""
+    if fn is not None:
+        fn = innermost_fn(fn)
+        assert callable(fn)
+        return DisableContext()(fn)
+    return DisableContext()
+
+
+def skip(fn=None):
+    """
+    Skip frames associated with the function code, but still process recursively
+    invoked frames
+    """
+    if fn is None:
+        return skip
+    fn = innermost_fn(fn)
+    assert callable(fn)
+    skip_code(fn.__code__)
+    fn._torchdynamo_disable = True
+    return fn
+
+
+class TorchPatcher:
+    @staticmethod
+    @functools.lru_cache(None)
+    def patch():
+        # Disable TorchDynamo on some torch.* compilers generated frames
+        torch.jit.trace = disable(torch.jit.trace)
+        torch.jit.trace_module = disable(torch.jit.trace_module)
+        torch.jit._get_trace_graph = disable(torch.jit._get_trace_graph)
+
+        # symbolic_trace creates new frames. We disable Dynamo on such frames
+        torch.fx._symbolic_trace.Tracer.trace = disable(
+            torch.fx._symbolic_trace.Tracer.trace
+        )
+
+        torch.onnx.export_to_pretty_string = disable(torch.onnx.export_to_pretty_string)
+        torch.distributions.Distribution.set_default_validate_args(False)
+
+        if proxy_tensor is not None:
+            proxy_tensor.dispatch_trace = disable(proxy_tensor.dispatch_trace)
+
+        optimizers = [
+            opt
+            for opt in torch.optim.__dict__.values()
+            if inspect.isclass(opt) and issubclass(opt, torch.optim.Optimizer)
+        ]
+
+        # disable dynamo for the wrapper that helps give dynamo hints about entering DDP
+        if hasattr(DistributedDataParallel, "_inside_ddp_forward"):
+            DistributedDataParallel._inside_ddp_forward = skip(
+                DistributedDataParallel._inside_ddp_forward
+            )
+
+        # disable profile hook
+        for opt in optimizers:
+            opt._cuda_graph_capture_health_check = disable(
+                opt._cuda_graph_capture_health_check
+            )
+            opt.zero_grad = disable(opt.zero_grad)
+            # disable any currently set hooks
+            # Note: we only want to disable the profiling hook
+            # which is the *last* hook applied, we want to keep the no_grad hook
+            hooked = getattr(opt.step, "hooked", False)
+            if hooked:
+                unwrapped_step = getattr(opt.step, "__wrapped__", None)
+                if unwrapped_step:
+                    opt.step = unwrapped_step
+
+            # disable future hooking
+            opt.step.hooked = True
+
+    @staticmethod
+    def suppress_torch_distributed_warnings(fn):
+        def inner_fn(*args, **kwargs):
+            warnings.filterwarnings(
+                "ignore", category=UserWarning, module="torch.distributed"
+            )
+            return fn(*args, **kwargs)
+
+        return inner_fn
diff --git a/torch/_dynamo/exc.py b/torch/_dynamo/exc.py
new file mode 100644
index 000000000000..41a9f68351aa
--- /dev/null
+++ b/torch/_dynamo/exc.py
@@ -0,0 +1,72 @@
+import os
+import textwrap
+
+from .utils import counters
+
+
+class TorchDynamoException(RuntimeError):
+    pass
+
+
+class InternalTorchDynamoError(TorchDynamoException):
+    pass
+
+
+class RestartAnalysis(TorchDynamoException):
+    pass
+
+
+class SkipFrame(TorchDynamoException):
+    pass
+
+
+class TorchRuntimeError(TorchDynamoException):
+    pass
+
+
+class ResetRequired(TorchDynamoException):
+    def __init__(self):
+        super(ResetRequired, self).__init__(
+            textwrap.dedent(
+                """
+                Must call `torch._dynamo.reset()` before changing backends.  Detected two calls to
+                `torch._dynamo.optimize(...)` with a different backend compiler arguments.
+                """
+            )
+        )
+
+
+class BackendCompilerFailed(TorchDynamoException):
+    def __init__(self, backend_fn, inner_exception):
+        self.backend_name = getattr(backend_fn, "__name__", "?")
+        self.inner_exception = inner_exception
+        msg = f"{self.backend_name} raised {type(inner_exception).__name__}: {inner_exception}"
+        super().__init__(msg)
+
+
+class Unsupported(TorchDynamoException):
+    def __init__(self, msg):
+        super(Unsupported, self).__init__(msg)
+        self.real_stack = []
+        self.msg = msg
+        self.category = None
+        self.add_to_stats()
+
+    def remove_from_stats(self):
+        counters[self.category][self.msg] -= 1
+        if counters[self.category][self.msg] <= 0:
+            del counters[self.category][self.msg]
+
+    def add_to_stats(self, category="unimplemented"):
+        self.category = category
+        counters[category][self.msg] += 1
+
+
+def unimplemented(msg: str):
+    assert msg != os.environ.get("BREAK", False)
+    raise Unsupported(msg)
+
+
+def warning(msg: str):
+    counters["warnings"][msg] += 1
+    assert msg != os.environ.get("BREAK", False)
diff --git a/torch/_dynamo/guards.py b/torch/_dynamo/guards.py
new file mode 100644
index 000000000000..7768cb14fc62
--- /dev/null
+++ b/torch/_dynamo/guards.py
@@ -0,0 +1,847 @@
+import collections
+import dataclasses
+import enum
+import logging
+import math
+import os
+import re
+import types
+import weakref
+from inspect import currentframe, getframeinfo
+from typing import Any, Callable, Dict, List, Optional, Set
+
+import numpy as np
+
+import sympy
+
+import torch
+from torch.fx.experimental.symbolic_shapes import FloorDiv
+
+from . import config, convert_frame, mutation_guard
+from .eval_frame import set_guard_error_hook, set_guard_fail_hook
+from .exc import unimplemented
+from .utils import (
+    dict_const_keys,
+    dict_param_key_ids,
+    guard_failures,
+    istype,
+    orig_code_map,
+    rename_implicit,
+    tuple_iterator_getitem,
+    tuple_iterator_len,
+)
+
+log = logging.getLogger(__name__)
+TensorGuards = torch._C._dynamo.guards.TensorGuards
+check_obj_id = torch._C._dynamo.guards.check_obj_id
+check_type_id = torch._C._dynamo.guards.check_type_id
+
+
+CLOSURE_VARS = collections.OrderedDict(
+    [
+        ("___check_type_id", check_type_id),
+        ("___check_obj_id", check_obj_id),
+        ("___is_grad_enabled", torch.is_grad_enabled),
+        ("___odict_getitem", collections.OrderedDict.__getitem__),
+        ("___dict_param_key_ids", dict_param_key_ids),
+        ("___dict_const_keys", dict_const_keys),
+        ("___tuple_iterator_len", tuple_iterator_len),
+        ("___tuple_iterator_getitem", tuple_iterator_getitem),
+        ("__math_isnan", math.isnan),
+        ("inf", float("inf")),
+    ]
+)
+
+
+class GuardSource(enum.Enum):
+    LOCAL = 0
+    GLOBAL = 1
+    LOCAL_NN_MODULE = 2
+    GLOBAL_NN_MODULE = 3
+    CONSTANT = 4
+    RANDOM_VALUE = 5
+
+    def select(self, locals_, globals_):
+        if self in (GuardSource.LOCAL, GuardSource.LOCAL_NN_MODULE):
+            return locals_
+        if self in (GuardSource.GLOBAL, GuardSource.GLOBAL_NN_MODULE):
+            return globals_
+        raise NotImplementedError()
+
+    def is_nn_module(self):
+        return self in (GuardSource.GLOBAL_NN_MODULE, GuardSource.LOCAL_NN_MODULE)
+
+    def is_local(self):
+        return self in (GuardSource.LOCAL, GuardSource.LOCAL_NN_MODULE)
+
+
+@dataclasses.dataclass
+class Guard:
+    name: str
+    source: GuardSource
+    create_fn: Callable
+    is_volatile: bool = False
+
+    # Export only. These values are written to at time of guard check_fn creation.
+    guard_types: Optional[List[str]] = None
+    code_list: Optional[List[str]] = None
+    obj_weakref: Optional[Any] = None
+    guarded_class_weakref: Optional[type] = None
+
+    def __hash__(self):
+        return hash((self.name, self.source, id(self.create_fn)))
+
+    def sort_key(self):
+        return (
+            self.source.value if self.source else -1,
+            len(self.name),
+            self.name,
+            self.create_fn.__code__.co_firstlineno,
+        )
+
+    def __lt__(self, other):
+        return self.sort_key() < other.sort_key()
+
+    @staticmethod
+    def weakref_to_str(obj_weakref):
+        """
+        This is a workaround of a Python weakref bug.
+
+        `obj_weakref` is instance returned by `weakref.ref`,
+        `str(obj_weakref)` is buggy if the original obj overrides __getattr__, e.g:
+
+            class MyConfig(dict):
+                def __getattr__(self, x):
+                    return self[x]
+
+            obj = MyConfig(offset=5)
+            obj_weakref = weakref.ref(obj)
+            str(obj_weakref)  # raise error: KeyError: '__name__'
+        """
+        if isinstance(obj_weakref, weakref.ReferenceType):
+            obj = obj_weakref()
+            if obj is not None:
+                return f"<weakref at {hex(id(obj_weakref))}; to '{obj.__class__.__name__}' at {hex(id(obj))}>"
+            else:
+                return f"<weakref at {hex(id(obj_weakref))}; dead>"
+        else:
+            return str(obj_weakref)
+
+    def __str__(self):
+        s = f"""
+            {self.source.name.lower() if self.source else ""} {repr(self.name)} {self.create_fn.__name__}
+            {{
+                'guard_types': {self.guard_types},
+                'code': {self.code_list},
+                'obj_weakref': {self.weakref_to_str(self.obj_weakref)}
+                'guarded_class': {self.guarded_class_weakref}
+            }}
+            """
+        return s
+
+    def create(self, local_builder: "GuardBuilder", global_builder: "GuardBuilder"):
+        return self.create_fn(self.source.select(local_builder, global_builder), self)
+
+    def is_nn_module(self):
+        return self.source.is_nn_module()
+
+    def is_local(self):
+        return self.source.is_local()
+
+    def set_export_info(self, guard_type, guarded_class, code_list, obj_weakref):
+        if not self.guard_types:
+            self.guard_types = list()
+
+        self.guard_types.append(guard_type)
+
+        assert self.guarded_class_weakref in (
+            guarded_class,
+            None,
+        ), "Guarded class id must be identical, or None"
+        self.guarded_class_weakref = guarded_class
+
+        if not self.code_list:
+            self.code_list = code_list
+        else:
+            self.code_list.extend(code_list)
+
+        assert self.obj_weakref in (
+            obj_weakref,
+            None,
+        ), "Guarded object must be identical, or None"
+        self.obj_weakref = obj_weakref
+
+
+def strip_function_call(name):
+    """
+    "___odict_getitem(a, 1)" => "a"
+    """
+    m = re.search(r"([a-z0-9_]+)\(([^(),]+)[^()]*\)", name)
+    if m and m.group(1) != "slice":
+        return strip_function_call(m.group(2))
+    return strip_getattr_getitem(name)
+
+
+def strip_getattr_getitem(name):
+    """
+    "a[1]" => "a"
+    "a.foo" => "a"
+    """
+    return re.split(r"[.\[]", name)[0]
+
+
+class GuardBuilder:
+    def __init__(
+        self, id_ref: Callable, scope: Dict[str, Any], guarded_code, renames=True
+    ):
+        self.id_ref = id_ref
+        if scope:
+            if renames:
+                scope = {rename_implicit(k): v for k, v in scope.items()}
+        else:
+            scope = dict()
+        self.scope = scope
+        self.argnames: List[str] = []
+        # Code is python expression strings generated for each guard
+        self.code: List[str] = []
+        self.tensor_check_names = []
+        self.tensor_check_ids = {}
+        self.tensor_check_examples = []
+        self.guarded_code = guarded_code
+
+    def get(self, name: str):
+        return eval(name, self.scope, CLOSURE_VARS)
+
+    def arg_ref(self, guard: Guard):
+        if isinstance(guard, str):
+            name = guard
+        else:
+            name = guard.name
+        base = strip_getattr_getitem(strip_function_call(name))
+        if base not in self.argnames:
+            if re.match(r"^\d+$", base):
+                log.warning(f"invalid var name: {guard}")
+            self.argnames.append(base)
+
+        return name
+
+    def TYPE_MATCH(self, guard: Guard):
+        # ___check_type_id is same as `id(type(x)) == y`
+        t = type(self.get(guard.name))
+        obj_id = self.id_ref(t)
+        code = f"___check_type_id({self.arg_ref(guard)}, {obj_id})"
+        self._produce_guard_code(guard, [code])
+
+    def ID_MATCH(self, guard: Guard):
+        # ___check_obj_id is same as `id(x) == y`
+        m = re.match(r"^type\((.+)\)$", guard.name)
+        if m:
+            # optional optimization to produce cleaner/faster guard code
+            return self.TYPE_MATCH(Guard(m.group(1), guard.source, None))
+
+        code = f"___check_obj_id({self.arg_ref(guard)}, {self.id_ref(self.get(guard.name))})"
+        self._produce_guard_code(guard, [code])
+
+    def NAME_MATCH(self, guard: Guard):
+        obj = self.get(guard.name)
+        code = f"{self.arg_ref(guard)}.__name__ == {obj.__name__}"
+        self._produce_guard_code(guard, [code])
+
+    def HASATTR(self, guard: Guard):
+        m = re.match(r"^(.*)[.]([a-zA-Z0-9_]+)$", guard.name)
+        assert m, f"invalid hasattr check {guard.name}"
+        base, attr = m.group(1, 2)
+        ref = self.arg_ref(base)
+        val = hasattr(self.get(base), attr)
+        code = None
+        if val:
+            code = f"hasattr({ref}, {attr!r})"
+        else:
+            code = f"not hasattr({ref}, {attr!r})"
+
+        self._produce_guard_code(guard, [code], provided_guarded_object=self.get(base))
+
+    def EQUALS_MATCH(self, guard: Guard):
+        ref = self.arg_ref(guard)
+        val = self.get(guard.name)
+        t = type(val)
+        assert istype(
+            val,
+            (
+                int,
+                float,
+                bool,
+                type(None),
+                str,
+                type,
+                list,
+                tuple,
+                set,
+                slice,
+                frozenset,
+                range,
+                torch.Size,
+                torch.device,
+                torch.dtype,
+                np.int8,
+                np.int16,
+                np.int32,
+                np.int64,
+                np.uint8,
+                np.uint16,
+                np.uint32,
+                np.uint64,
+            ),
+        ), t.__name__
+        if istype(val, (torch.device, torch.dtype)):
+            # TODO(jansel): is this slow? perhaps optimize it
+            code = f"str({ref}) == {str(val)!r}"
+            self._produce_guard_code(guard, [code])
+            return
+
+        # Special case for nan because float("nan") == float("nan") evaluates to False
+        if istype(val, float) and math.isnan(val):
+            code = list()
+            code.append(f"___check_type_id({ref}, {self.id_ref(t)})")
+            code.append(f"__math_isnan({ref})")
+            self._produce_guard_code(guard, code)
+            return
+
+        # Add type check to prevent equality check between tensor and non-tensor.
+        code = list()
+        if istype(val, (list, tuple)):
+            self.LIST_LENGTH(guard)
+
+            for idx, elem in enumerate(val):
+                code.append(
+                    f"___check_type_id({ref}[{idx}], {self.id_ref(type(elem))})"
+                )
+
+        elif not istype(val, torch.Size):
+            code.append(f"___check_type_id({ref}, {self.id_ref(t)})")
+
+        if istype(val, torch.Size):
+            val = tuple(val)
+
+        code.append(f"{ref} == {val!r}")
+        self._produce_guard_code(guard, code)
+
+    def CONSTANT_MATCH(self, guard: Guard):
+        val = self.get(guard.name)
+        if istype(val, (bool, type(None))):
+            self.ID_MATCH(guard)
+        else:
+            self.EQUALS_MATCH(guard)
+
+    def NN_MODULE(self, guard: Guard):
+        self.ID_MATCH(guard)
+        ref = self.arg_ref(guard)
+        val = self.get(guard.name)
+
+        def setup_guard():
+            assert istype(val.training, bool)
+            self.code.append(f"{ref}.training == {val.training}")
+
+        if hasattr(val, "training"):
+            # There are cases where a monkeypatched object has a guard made between __new__ and __init__
+            setup_guard()
+        else:
+            unimplemented(f"Guard setup for uninitialized class {type(val)}")
+
+    def FUNCTION_MATCH(self, guard: Guard):
+        """things like torch.add and user defined functions"""
+        if guard.is_local():
+            return self.ID_MATCH(guard)
+
+    def BUILTIN_MATCH(self, guard: Guard):
+        return self.FUNCTION_MATCH(guard)
+
+    def PYMODULE_MATCH(self, guard: Guard):
+        return self.FUNCTION_MATCH(guard)
+
+    def LIST_LENGTH(self, guard):
+        ref = self.arg_ref(guard)
+        value = self.get(guard.name)
+        t = type(value)
+
+        code = list()
+        code.append(f"___check_type_id({ref}, {self.id_ref(t)})")
+        code.append(f"len({ref}) == {len(value)}")
+
+        self._produce_guard_code(guard, code)
+
+    def TUPLE_ITERATOR_LEN(self, guard):
+        ref = self.arg_ref(guard)
+        value = self.get(guard.name)
+        t = type(value)
+
+        code = list()
+        code.append(f"___check_type_id({ref}, {self.id_ref(t)})")
+        code.append(f"___tuple_iterator_len({ref}) == {tuple_iterator_len(value)}")
+
+        self._produce_guard_code(guard, code)
+
+    def DICT_KEYS(self, guard):
+        ref = self.arg_ref(guard)
+        value = self.get(guard.name)
+        t = type(value)
+
+        code = list()
+        code.append(f"___check_type_id({ref}, {self.id_ref(t)})")
+        param_key_ids = set(dict_param_key_ids(value))
+        const_keys = set(dict_const_keys(value))
+        if param_key_ids:
+            code.append(f"___dict_param_key_ids({ref}) == {param_key_ids!r}")
+            code.append(f"___dict_const_keys({ref}) == {const_keys!r}")
+        else:
+            code.append(f"set({ref}.keys()) == {const_keys!r}")
+
+        self._produce_guard_code(guard, code)
+
+    def WEAKREF_ALIVE(self, guard):
+        self._produce_guard_code(guard, [f"{self.arg_ref(guard)} is not None"])
+
+    def NN_MODULE_PARAM_NAMES(self, guard):
+        ref = self.arg_ref(guard)
+        value = self.get(guard.name)
+        t = type(value)
+        keys = {k for k, v in value.named_parameters()}
+
+        code = list()
+        code.append(f"___check_type_id({ref}, {self.id_ref(t)})")
+        code.append(f"{{k for k, v in {ref}.named_parameters()}} == {keys!r}")
+
+        self._produce_guard_code(guard, code)
+
+    def ODICT_KEYS(self, guard):
+        """OrderedDict keys match"""
+        ref = self.arg_ref(guard)
+        value = self.get(guard.name)
+        t = type(value)
+
+        code = list()
+        code.append(f"___check_type_id({ref}, {self.id_ref(t)})")
+        code.append(f"str({ref}.keys()) == {str(value.keys())!r}")
+
+        self._produce_guard_code(guard, code)
+
+    def OBJECT_MUTATION(self, guard: Guard):
+        mutation_guard.watch(self.get(guard.name), self.guarded_code)
+
+    def GRAD_MODE(self, guard: Guard):
+        """Guard on the initial grad state"""
+        assert guard.name == ""
+        assert guard.source is GuardSource.GLOBAL
+        code = None
+        if convert_frame.initial_grad_state:
+            code = "___is_grad_enabled()"
+        else:
+            code = "not ___is_grad_enabled()"
+        self._produce_guard_code(guard, [code])
+
+    # This is a bit of a crutch for export case for symbolic shape guards.
+    # SYMBOL_MATCH is only ever, and must only ever, be used for setting this value on
+    # the create_fn field for tracking guards in export.
+    @staticmethod
+    def SYMBOL_MATCH():
+        pass
+
+    def TENSOR_MATCH(self, guard: Guard):
+        if guard.is_nn_module():
+            self.ID_MATCH(guard)
+        else:
+            value = self.get(guard.name)
+            tensor_name = self.arg_ref(guard)
+            self.tensor_check_names.append(tensor_name)
+            self.tensor_check_examples.append(value)
+
+            # STOP - DO NOT USE id_ref FOR TENSORS - TENSOR INVALIDATION RULES DIFFER
+            self.tensor_check_ids[tensor_name] = id(value)
+
+            # Note: Guard code produced for tensor_match is a little different.
+            # We accumulate tensor names, then do a single install of `___check_tensors`.
+            # See _guards.cpp and TensorGuard for more information.
+            # TODO(voz): Add tensor matching code to export
+            # Note: this is a bit of a special case, and so does not use _produce_guard_code
+            guard.set_export_info(
+                "TENSOR_MATCH",
+                weakref.ref(type(value)),
+                None,
+                weakref.ref(value),
+            )
+
+    # A util that appends guarded code, or, in the case of export, adds data onto guards
+    def _produce_guard_code(self, guard, code_list, provided_guarded_object=None):
+        caller = currentframe().f_back
+        func_name = getframeinfo(caller)[2]
+        # We use func_name for export, so might as well get a nice defensive check out of it
+        assert func_name in dir(
+            self.__class__
+        ), f"_produce_guard_code must be called from inside GuardedCode. Called from {func_name}"
+
+        self.code.extend(code_list)
+
+        # Not all guards have names, some can be installed globally (see asserts on HAS_GRAD)
+        if provided_guarded_object is None:
+            name_valid = guard.name is not None and guard.name != ""
+
+            guarded_object = self.get(guard.name) if name_valid else None
+        else:
+            guarded_object = provided_guarded_object
+
+        guarded_object_type = (
+            weakref.ref(type(guarded_object)) if guarded_object is not None else None
+        )
+        obj_ref = None
+        if hasattr(guarded_object.__class__, "__weakref__"):
+            obj_ref = weakref.ref(guarded_object)
+
+        guard.set_export_info(
+            func_name,
+            guarded_object_type,
+            code_list,
+            obj_ref,
+        )
+
+
+@dataclasses.dataclass
+class GuardedCode:
+    code: types.CodeType
+    check_fn: Callable
+
+
+from sympy.printing.str import StrPrinter
+
+
+@dataclasses.dataclass
+class TensorReference(object):
+    """
+    TensorReference objects are entirely optional. They are created to give us hints
+    into where the symbolic shape came from.
+
+    ref_id: The id of the tensor
+    kind: A string tracking where in the tensor this value came from ("size","stride", etc)
+    idx: An index in the structure
+
+    NOTE - A symbolic shape coming from tensor at id 12345's shape dim 2, would be
+    TensorReference(ref_id=12345, kind="size", idx=2)
+    """
+
+    ref_id: Optional[int] = None
+    kind: Optional[str] = None
+    idx: Optional[int] = None
+    # Note - this is untyped because of TypeError: '_SpecialForm' object does not support item assignment
+    # But it is a Optional[Union["sympy.Expr", int]]
+    expr: Optional[object] = None  # Populated after association
+
+    def __hash__(self):
+        return hash((self.ref_id, self.kind, self.idx))
+
+
+class DynamoGuardPrinter(StrPrinter):
+    @staticmethod
+    def tensor_ref_as_str(tensor_ref, id_to_name_map):
+        if tensor_ref.kind in ("size", "stride"):
+            return f"{id_to_name_map[tensor_ref.ref_id]}.{tensor_ref.kind}()[{tensor_ref.idx}]"
+        return f"{id_to_name_map[tensor_ref.ref_id]}.{tensor_ref.kind}()"
+
+    def __init__(
+        self, expr_to_tensor_ref, id_to_name_map, shape_env, intermediary_symbols
+    ):
+        super().__init__()
+        self.expr_to_tensor_ref = expr_to_tensor_ref
+        self.id_to_name_map = id_to_name_map
+        self.shape_env = shape_env
+        self.intermediary_symbols = intermediary_symbols
+
+    def _print_Symbol(self, expr) -> str:
+        assert isinstance(expr, sympy.Symbol)
+        if expr == 0:
+            return "0"
+        if expr == 1:
+            return "1"
+        assert expr in (self.expr_to_tensor_ref) or (expr in self.intermediary_symbols)
+        refs = self.expr_to_tensor_ref[expr]
+        if len(refs) == 0:
+            return super()._print_Symbol(expr)
+        tensor_ref = next(
+            iter(refs)
+        )  # Any is fine here, because we install equality guards later
+        return DynamoGuardPrinter.tensor_ref_as_str(tensor_ref, self.id_to_name_map)
+
+
+# NB: Naively, you'd expect this to only be a function that produces
+# the callable that consistutes the guard.  However, there is some
+# delicate handling for invalidating this check function when the
+# locals/globals get invalidated, so there's some extra state
+# we have to hold in this manager class.
+#
+# TODO: this object has reference cycle with itself, via check_fn which
+# references back to CheckFunction via ___guarded_code in closure_vars.
+# Ideally, there shouldn't be any ref cycle so that guards are
+# promptly disposed of.
+class CheckFunctionManager:
+    def __init__(
+        self,
+        output_graph=None,
+        guards: Optional[Set[Guard]] = None,
+        f_locals: Optional[Dict] = None,
+        f_globals: Optional[Dict] = None,
+    ):
+        self.valid = True
+        self._weakrefs = []
+        self._seen_ids = set()
+        self.output_graph = output_graph
+
+        # Note: right overrides left
+        def combine_scopes(left, right):
+            if left is None:
+                return right
+
+            if right is None:
+                return left
+
+            return {**left, **right}
+
+        local_builder = GuardBuilder(
+            self.id_ref, combine_scopes(f_globals, f_locals), self, renames=True
+        )
+        global_builder = GuardBuilder(self.id_ref, f_globals, self, renames=False)
+        for guard in sorted(guards or [], key=Guard.sort_key):
+            if not config.guard_nn_modules and guard.is_nn_module():
+                continue
+            guard.create(local_builder, global_builder)
+        self.check_fn = self.compile_check_fn(local_builder, global_builder, guards)
+        self._seen_ids.clear()
+
+    """
+    This is a complex bit of logic. The outline here is brief. For a line by line breakdown, see
+    the code comments below.
+
+    The role of this function is to take the current state of symbolic shape guards, tensor ids in the
+    CURRENT dynamo frame, and tensor names (dynamo's frame agnostic tensor reference mechanism, see TensorCheck and
+    guards.cpp for more info) - and produce executable python expressions for addition to our guarded code components
+    that make their way into check_fn.
+
+    We DO NOT create guards based on ids. The IDs act as a lookup for the following mapping:
+
+    dynamo: tensor_name <> tensor_id
+    shape_env: tensor_id <> shape_expr
+
+    This allows us to then create a tensor_name <> shape_expr association for the current frames guards.
+    """
+
+    def _parse_symbolic_shape_expressions(self, tensor_check_names, tensor_check_ids):
+        # Pre join output
+        finished_expressions = []
+
+        # A mapping of tensor_ids to tensor names
+        id_to_name_map = {}
+
+        # We should not have a shape env, or guards if we are not in config.dynamic shapes
+        # But check it anyway.
+        if not config.dynamic_shapes:
+            return None
+
+        expr_to_tensor_ref = {}
+        guard_printer = DynamoGuardPrinter(
+            expr_to_tensor_ref,
+            id_to_name_map,
+            self.output_graph.shape_env,
+            self.output_graph.intermediary_symbols,
+        )
+
+        # tensor_check_names is the primary tensor association mechanism in dynamo.
+        # All other guards installations are driven off of it, so these ones will too.
+        for name in tensor_check_names:
+            tensor_id = tensor_check_ids[name]
+            id_to_name_map[tensor_id] = name
+
+            if tensor_id in self.output_graph.tensor_id_to_sym_shape_ref:
+                # If we made it here, this tensor_id is relevant to dynamo guard installation
+                # AND was found in the shape_env
+                tensor_ref_set = self.output_graph.tensor_id_to_sym_shape_ref[tensor_id]
+                for tensor_ref in tensor_ref_set:
+                    obj_expr = tensor_ref.expr
+                    if obj_expr not in expr_to_tensor_ref:
+                        expr_to_tensor_ref[obj_expr] = {}
+                    expr_to_tensor_ref[obj_expr][tensor_ref] = ""
+
+        guard_expression = self.output_graph.shape_env.get_guard_expr()
+        expr_as_str = guard_printer.doprint(guard_expression)
+        # We may get into a state where symbolic shape keys (all should be found in replacements)
+        # Have not been removed from the expression. This is a serious enough error state that we need to assert.
+        for key in self.output_graph.shape_env.var_to_val.keys():
+            assert str(key) not in expr_as_str, f"Unknown shape symbol {key}. "
+        finished_expressions.append(expr_as_str)
+
+        for expr in expr_to_tensor_ref.keys():
+            tensor_refs = expr_to_tensor_ref[expr].keys()
+            equality_candidates = [
+                DynamoGuardPrinter.tensor_ref_as_str(x, id_to_name_map)
+                for x in tensor_refs
+            ]
+
+            if len(equality_candidates) > 1:
+                equality_expr = " == ".join(equality_candidates)
+                finished_expressions.append(equality_expr)
+
+        # Redundant with code_parts, but allows us to wrap it with parens nicely.
+        if len(finished_expressions) == 0:
+            return None
+
+        expression = " and ".join(finished_expressions)
+        return f"({expression})"
+
+    def compile_check_fn(self, local_builder, global_builder, guards_out):
+        assert not (set(local_builder.argnames) & set(global_builder.argnames))
+        # see parallel handling of ".0" / "___implicit0" in _eval_frame.c
+        args = [a for a in local_builder.scope.keys() if a == "___implicit0"]
+        args += [a for a in local_builder.argnames if a != "___implicit0"]
+        args += ["**___kwargs_ignored"]
+        args = ",".join(args)
+
+        code_parts = (
+            ["___guarded_code.valid"] + local_builder.code + global_builder.code
+        )
+        # TODO(whc) maybe only the 'check_tensors' one is ambiguous? if so we can be less general..
+        verbose_code_parts = (
+            ["___guarded_code.valid"] + local_builder.code + global_builder.code
+        )
+
+        tensor_check_names = (
+            local_builder.tensor_check_names + global_builder.tensor_check_names
+        )
+
+        tensor_check_ids = local_builder.tensor_check_ids.copy()
+        tensor_check_ids.update(global_builder.tensor_check_ids)
+
+        check_tensors_fn = None
+        check_tensors_verbose_fn = None
+        if tensor_check_names:
+            symbolic_shape_expression = self._parse_symbolic_shape_expressions(
+                tensor_check_names, tensor_check_ids
+            )
+            tensor_check_examples = (
+                local_builder.tensor_check_examples
+                + global_builder.tensor_check_examples
+            )
+            tensor_guards = TensorGuards(
+                *tensor_check_examples, dynamic_shapes=config.dynamic_shapes
+            )
+            check_tensors_fn = tensor_guards.check
+            check_tensors_verbose_fn = tensor_guards.check_verbose
+            code_parts.append(f"___check_tensors({', '.join(tensor_check_names)})")
+            verbose_args = ", ".join(
+                tensor_check_names + ["tensor_check_names=tensor_check_names"]
+            )
+            verbose_code_parts.append(f"___check_tensors_verbose({verbose_args})")
+            if symbolic_shape_expression:
+                code_parts.append(symbolic_shape_expression)
+                verbose_code_parts.append(symbolic_shape_expression)
+                guards_out.add(
+                    Guard(
+                        name="symbolic_shape_expression",
+                        source=None,
+                        create_fn=GuardBuilder.SYMBOL_MATCH,
+                        code_list=symbolic_shape_expression,
+                    )
+                )
+
+        def direct_equality(a, b):
+            return a == b
+
+        def direct_negation(a, b):
+            return not direct_equality(a, b)
+
+        code = " and ".join(unique(code_parts))
+        closure_vars = collections.OrderedDict(
+            [
+                ("___guarded_code", self),
+                ("___check_tensors", check_tensors_fn),
+                ("___check_tensors_verbose", check_tensors_verbose_fn),
+                ("tensor_check_names", tensor_check_names),
+                ("floor", math.floor),
+                ("ceiling", math.ceil),
+                ("Eq", direct_equality),
+                ("Ne", direct_negation),
+                ("Mod", sympy.Mod),
+                ("FloorDiv", FloorDiv),
+            ]
+        )
+        closure_vars.update(CLOSURE_VARS)
+        py_code = f"""\
+def ___make_guard_fn({','.join(closure_vars.keys())}):
+    return lambda {args}: {code}
+"""
+        if os.environ.get("TORCHDYNAMO_PRINT_GUARDS", None) == "1":
+            print("GUARDS", code)
+        set_guard_fail_hook(guard_fail_hook)
+        out = dict()
+        # print("RUNNING PY CODE", py_code)
+        exec(py_code, global_builder.scope, out)
+        guard_fn = out["___make_guard_fn"](*closure_vars.values())
+        guard_fn.closure_vars = closure_vars
+        # TODO(whc) maybe '.code_parts' was only kept around for the guard callback? so we don't need both
+        guard_fn.code_parts = code_parts
+        guard_fn.verbose_code_parts = verbose_code_parts
+        guard_fn.global_scope = global_builder.scope
+        return guard_fn
+
+    def invalidate(self, ref):
+        # A weakref is no longer valid, self.check_fn should return false
+        self.valid = False
+
+    def id_ref(self, obj):
+        """add a weakref, return the id"""
+        try:
+            if id(obj) not in self._seen_ids:
+                self._weakrefs.append(weakref.ref(obj, self.invalidate))
+                self._seen_ids.add(id(obj))
+        except TypeError:
+            pass  # cannot weakref bool object
+        return id(obj)
+
+
+def guard_fail_hook(
+    guard_fn: Callable, code: types.CodeType, f_locals: Dict[str, Any], last: bool
+):
+    """
+    called whenever a guard fails.
+    """
+    if not last:
+        return
+    scope = {rename_implicit(k): v for k, v in f_locals.items()}
+    scope.update(guard_fn.closure_vars)
+    reasons = []
+    for part in guard_fn.verbose_code_parts:
+        fail_reason = eval(part, guard_fn.global_scope, scope)
+        # TODO(whc) hacky for now as not every 'part' in guard_fn.verbose_code_parts
+        # is updated to return a string explaining the failure.
+        if isinstance(fail_reason, str):
+            reasons.append(fail_reason)
+            break
+        elif isinstance(fail_reason, bool) and not fail_reason:
+            reasons.append(part)
+            break
+    guard_failures[orig_code_map[code]].append(reasons)
+
+
+def guard_error_hook(
+    guard_fn: Callable, code: types.CodeType, f_locals: Dict[str, Any], last: bool
+):
+    print(
+        f"ERROR RUNNING GUARDS {code.co_name} {code.co_filename}:{code.co_firstlineno}"
+    )
+    print(" ", " and\n  ".join(guard_fn.code_parts))
+
+
+set_guard_error_hook(guard_error_hook)
+
+
+def unique(seq):
+    seen = set()
+    for x in seq:
+        if x not in seen:
+            yield x
+            seen.add(x)
diff --git a/torch/_dynamo/logging.py b/torch/_dynamo/logging.py
new file mode 100644
index 000000000000..95ee727f1ddf
--- /dev/null
+++ b/torch/_dynamo/logging.py
@@ -0,0 +1,88 @@
+import itertools
+import logging
+import os
+
+# logging level for dynamo generated graphs/bytecode/guards
+logging.CODE = 15
+logging.addLevelName(logging.CODE, "CODE")
+
+
+# Return all loggers that torchdynamo/torchinductor is responsible for
+def get_loggers():
+    return [
+        logging.getLogger("torch._dynamo"),
+        logging.getLogger("torch._inductor"),
+    ]
+
+
+# Set the level of all loggers that torchdynamo is responsible for
+def set_loggers_level(level):
+    for logger in get_loggers():
+        logger.setLevel(level)
+
+
+LOGGING_CONFIG = {
+    "version": 1,
+    "formatters": {
+        "torchdynamo_format": {
+            "format": "[%(asctime)s] %(name)s: [%(levelname)s] %(message)s"
+        },
+    },
+    "handlers": {
+        "torchdynamo_console": {
+            "class": "logging.StreamHandler",
+            "level": "DEBUG",
+            "formatter": "torchdynamo_format",
+            "stream": "ext://sys.stderr",
+        },
+    },
+    "loggers": {
+        "torch._dynamo": {
+            "level": "DEBUG",
+            "handlers": ["torchdynamo_console"],
+            "propagate": False,
+        },
+        "torch._inductor": {
+            "level": "DEBUG",
+            "handlers": ["torchdynamo_console"],
+            "propagate": False,
+        },
+    },
+    "disable_existing_loggers": False,
+}
+
+
+# initialize torchdynamo loggers
+def init_logging(log_level, log_file_name=None):
+    if "PYTEST_CURRENT_TEST" not in os.environ:
+        logging.config.dictConfig(LOGGING_CONFIG)
+        if log_file_name is not None:
+            log_file = logging.FileHandler(log_file_name)
+            log_file.setLevel(log_level)
+            for logger in get_loggers():
+                logger.addHandler(log_file)
+
+    set_loggers_level(log_level)
+
+
+# Creates a logging function that logs a message with a step # prepended.
+# get_step_logger should be lazily called (i.e. at runtime, not at module-load time)
+# so that step numbers are initialized properly. e.g.:
+
+# @functools.lru_cache(None)
+# def _step_logger():
+#     return get_step_logger(logging.getLogger(...))
+
+# def fn():
+#     _step_logger()(logging.INFO, "msg")
+
+_step_counter = itertools.count(1)
+
+
+def get_step_logger(logger):
+    step = next(_step_counter)
+
+    def log(level, msg):
+        logger.log(level, f"Step {step}: {msg}")
+
+    return log
diff --git a/torch/_dynamo/mutation_guard.py b/torch/_dynamo/mutation_guard.py
new file mode 100644
index 000000000000..8d1122a7ab60
--- /dev/null
+++ b/torch/_dynamo/mutation_guard.py
@@ -0,0 +1,119 @@
+import functools
+import weakref
+
+import torch.nn
+from torch.nn import Module
+
+from .utils import ExactWeakKeyDictionary
+
+
+class MutationTracker:
+    db = ExactWeakKeyDictionary()
+
+    def __init__(self):
+        self.mutation_count = 0
+        self.watchers = []
+
+    def on_mutation(self, name):
+        self.mutation_count += 1
+        tmp = self.watchers
+        self.watchers = []
+        for ref in tmp:
+            guarded = ref()
+            if guarded is not None:
+                guarded.invalidate(ref)
+
+    def track(self, guarded_code):
+        self.watchers.append(weakref.ref(guarded_code))
+
+
+def watch(obj, guarded_code):
+    """invalidate guarded_code when obj is mutated"""
+    ensure_patched(type(obj))
+
+    if obj not in MutationTracker.db:
+        MutationTracker.db[obj] = MutationTracker()
+    tracker = MutationTracker.db[obj]
+    tracker.track(guarded_code)
+
+
+def ensure_patched(cls):
+    if getattr(cls, "___needs_mutation_patch", True):
+        cls.___needs_mutation_patch = False
+        original_setattr = cls.__setattr__
+
+        @functools.wraps(original_setattr)
+        def custom_setattr(self, key, value):
+            try:
+                MutationTracker.db[self].on_mutation(key)
+            except KeyError:
+                pass
+            return original_setattr(self, key, value)
+
+        cls.__setattr__ = custom_setattr
+
+
+class GenerationTracker:
+    generation = 0
+    dynamic_classes = ExactWeakKeyDictionary()
+    generation_values = ExactWeakKeyDictionary()
+
+    @classmethod
+    def tag(cls, obj):
+        cls.generation_values[obj] = cls.generation
+
+    @staticmethod
+    def mark_class_dynamic(cls):
+        assert issubclass(cls, torch.nn.Module)
+        GenerationTracker.dynamic_classes[cls] = True
+
+    @classmethod
+    def get_generation_value(cls, obj):
+        if obj not in cls.generation_values:
+            return -1
+        return cls.generation_values[obj]
+
+    @classmethod
+    def check(cls, obj):
+        return (
+            obj in cls.generation_values
+            and cls.generation_values[obj] == cls.generation
+        )
+
+
+def is_dynamic_nn_module(obj):
+    """Check for nn.Modules() created dynamically or mutated"""
+    if hasattr(obj, "torchdynamo_force_dynamic"):
+        return obj.torchdynamo_force_dynamic
+    dyn = GenerationTracker.dynamic_classes.get(type(obj)) or GenerationTracker.check(
+        obj
+    )
+    return dyn
+
+
+def install_generation_tagging_init():
+    """
+    Monkey patch torch.nn.Module.__init__ and torch.nn.Module.__setstate__
+    so we can detect nn.Module instances created dynamically inside forward methods.
+    """
+
+    if getattr(Module, "___needs_generation_tag_patch", True):
+        init = Module.__init__
+
+        def patched_init(self, *args, **kwargs):
+            init(self, *args, **kwargs)
+            GenerationTracker.tag(self)
+
+        Module.__init__ = patched_init
+
+        setstate = Module.__setstate__
+
+        def patched_setstate(self, state):
+            setstate(self, state)
+            GenerationTracker.tag(self)
+
+        Module.__setstate__ = patched_setstate
+
+        Module.___needs_generation_tag_patch = False
+
+    GenerationTracker.generation += 1
diff --git a/torch/_dynamo/optimizations/__init__.py b/torch/_dynamo/optimizations/__init__.py
new file mode 100644
index 000000000000..9117517b8bf4
--- /dev/null
+++ b/torch/_dynamo/optimizations/__init__.py
@@ -0,0 +1,6 @@
+from .backends import BACKENDS
+from .training import create_aot_backends
+
+create_aot_backends()
+
+__all__ = ["BACKENDS"]
diff --git a/torch/_dynamo/optimizations/analysis.py b/torch/_dynamo/optimizations/analysis.py
new file mode 100644
index 000000000000..5de5743bd5e2
--- /dev/null
+++ b/torch/_dynamo/optimizations/analysis.py
@@ -0,0 +1,150 @@
+import functools
+import itertools
+import operator
+
+import torch
+
+from torch._subclasses import FakeTensorMode  # noqa: F401
+from torch.fx.node import map_aggregate
+from torch.fx.passes.shape_prop import _extract_tensor_metadata, ShapeProp
+from torch.multiprocessing.reductions import StorageWeakRef
+from torch.utils._pytree import tree_map
+
+from .. import config
+
+from ..utils import deepcopy_to_fake_tensor
+
+
+class ShapeAliasingAndMutationProp(ShapeProp):
+    def __init__(self, *args, **kwargs):
+        super(ShapeAliasingAndMutationProp, self).__init__(*args, **kwargs)
+        self.input_alias_groups = set()
+        self.storage_to_alias_group = dict()
+        self.make_alias_group = itertools.count(1)
+
+    def tensor_alias_group(self, value: torch.Tensor):
+        """Assign a unique identifier to the storage of a given tensor"""
+        storage = StorageWeakRef(value._typed_storage())
+        alias_group = self.storage_to_alias_group.get(storage)
+        if alias_group is None:
+            alias_group = next(self.make_alias_group)
+            self.storage_to_alias_group[storage] = alias_group
+        return alias_group
+
+    def placeholder(self, target, args, kwargs):
+        value = super().placeholder(target, args, kwargs)
+        assert isinstance(value, torch.Tensor)
+        self.input_alias_groups.add(self.tensor_alias_group(value))
+        return value
+
+    def run_node(self, n: torch.fx.Node):
+        args, kwargs = self.fetch_args_kwargs_from_env(n)
+        tensor_args = self.extract_tensors((args, kwargs))
+
+        input_versions1 = [obj._version for obj in tensor_args]
+        result = getattr(self, n.op)(n.target, args, kwargs)
+        input_versions2 = [obj._version for obj in tensor_args]
+
+        n.meta["type"] = type(result)
+        n.meta["alias_groups"] = {
+            self.tensor_alias_group(obj) for obj in self.extract_tensors(result)
+        }
+
+        if (
+            not n.meta["alias_groups"]
+            and n.op == "call_function"
+            and n.target == operator.setitem
+        ):
+            n.meta["alias_groups"] = {self.tensor_alias_group(tensor_args[0])}
+
+        n.meta["mutates_alias_groups"] = {
+            self.tensor_alias_group(tensor)
+            for tensor, v1, v2 in zip(tensor_args, input_versions1, input_versions2)
+            if v1 != v2
+        }
+        # Partial mutation refers to the mutation caused by getitem that can
+        # potentially result in changing only a slice of the original tensor
+        n.meta["partial_mutation"] = False
+
+        def visit_arg(arg: torch.fx.Node):
+            if (
+                arg.op == "call_function" and arg.target == operator.getitem
+            ) or arg.meta["partial_mutation"]:
+                if bool(n.meta["mutates_alias_groups"] & arg.meta["alias_groups"]):
+                    n.meta["partial_mutation"] = True
+
+        torch.fx.map_arg((n.args, n.kwargs), visit_arg)
+        n.meta["is_input_alias"] = bool(
+            self.input_alias_groups & n.meta["alias_groups"]
+        )
+        n.meta["is_input_mutation"] = bool(
+            self.input_alias_groups & n.meta["mutates_alias_groups"]
+        )
+        n.meta["is_mutation"] = bool(n.meta["mutates_alias_groups"])
+        n.meta["tensor_metas"] = [
+            _extract_tensor_metadata(obj) for obj in self.extract_tensors(result)
+        ]
+        tensors = self.extract_tensors(result)
+        if tensors:
+            n.meta["device"] = tensors[0].device
+            n.meta["dtype"] = tensors[0].dtype
+
+        return result
+
+    @staticmethod
+    def extract_tensors(result):
+        """Return a flat list of tensors found in some nested data structure"""
+        seen = set()
+        tensors = []
+
+        def visit(obj):
+            if isinstance(obj, torch.Tensor) and id(obj) not in seen:
+                seen.add(id(obj))
+                tensors.append(obj)
+
+        map_aggregate(result, visit)
+        return tensors
+
+    def run(self, *args):
+        try:
+            super().run(*args)
+        finally:
+            # cleanup
+            self.env.clear()
+
+
+def has_mutation(gm, example_inputs, inputs_only=False):
+    """Check if the graph module has any form of mutation.  If inputs_only is
+    true, we only check for mutation of inputs"""
+    # TODO - moco gives bad accuracy with Aliasing. gm is getting mutated in a bad way.
+
+    def _wrap_to_fake_tensor(t, *, f_mode):
+        if type(t) in (torch.Tensor, torch.nn.Parameter):
+            static_shapes_ = config.dynamic_shapes is False
+            return fake_mode.from_tensor(
+                t, static_shapes=config.dynamic_shapes is not False
+            )
+        else:
+            return t
+
+    # Our analysis pass should use dynamic shape tensor inputs
+    # when dynamic shapes are enabled.
+    # We don't actually care about the guards that are created
+    # on those shapes though, so just create a fresh ShapeEnv here.
+    from torch.fx.experimental.symbolic_shapes import ShapeEnv
+
+    fake_mode = FakeTensorMode(shape_env=ShapeEnv() if config.dynamic_shapes else None)
+    fake_wrapper = functools.partial(_wrap_to_fake_tensor, f_mode=fake_mode)
+    example_inputs = tree_map(fake_wrapper, example_inputs)
+    new_gm = deepcopy_to_fake_tensor(gm, fake_mode)
+    with fake_mode.restore() if hasattr(fake_mode, "restore") else fake_mode:
+        ShapeAliasingAndMutationProp(new_gm).run(*example_inputs)
+
+    for node in new_gm.graph.nodes:
+        if node.meta["is_mutation"] or node.meta["is_input_mutation"]:
+            if inputs_only:
+                if node.meta["is_input_alias"]:
+                    return True
+            else:
+                return True
+    return False
diff --git a/torch/_dynamo/optimizations/backends.py b/torch/_dynamo/optimizations/backends.py
new file mode 100644
index 000000000000..c5011096c32f
--- /dev/null
+++ b/torch/_dynamo/optimizations/backends.py
@@ -0,0 +1,830 @@
+import copy
+import functools
+import io
+import logging
+import os
+import subprocess
+import tempfile
+
+import numpy as np
+
+import torch
+
+from ..utils import identity
+from .subgraph import SubGraph
+
+log = logging.getLogger(__name__)
+BACKENDS = dict()
+_NP_DTYPE = {
+    torch.float16: np.float16,
+    torch.float32: np.float32,
+    torch.float64: np.float64,
+    torch.uint8: np.uint8,
+    torch.int8: np.int8,
+    torch.int16: np.int16,
+    torch.int32: np.int32,
+    torch.int64: np.longlong,
+    torch.bool: np.bool_,
+}
+
+
+def register_backend(fn):
+    @functools.wraps(fn)
+    def inner(gm, example_inputs, **kwargs):
+        return fn(gm, example_inputs, **kwargs)
+
+    BACKENDS[fn.__name__] = inner
+    return inner
+
+
+def create_backend(fn):
+    @functools.wraps(fn)
+    def inner(model, example_inputs=None, **kwargs):
+        if model is None:
+            return None
+
+        if not isinstance(model, SubGraph):
+            with tempfile.TemporaryDirectory() as tmp:
+                return inner(SubGraph(model, example_inputs, tmp), **kwargs)
+        else:
+            assert example_inputs is None
+
+        try:
+            return fn(model, **kwargs)
+        except KeyboardInterrupt:
+            raise
+
+    BACKENDS[fn.__name__] = inner
+    return inner
+
+
+@create_backend
+def eager(subgraph):
+    return subgraph.model
+
+
+@create_backend
+def ts(subgraph):
+    return subgraph.scripted
+
+
+def reload_jit_model(subgraph, opt_fn=identity):
+    tmp = io.BytesIO()
+    torch.jit.save(subgraph.scripted, tmp)
+    tmp.seek(0)
+    model = torch.jit.load(tmp)
+    model = opt_fn(model)
+    # populate cache
+    for _ in range(3):
+        model(*subgraph.example_inputs)
+    return model
+
+
+def reload_jit_model_ofi(subgraph):
+    return reload_jit_model(subgraph, torch.jit.optimize_for_inference)
+
+
+@create_backend
+def nnc(subgraph):
+    with torch.jit.fuser("fuser1"):
+        return reload_jit_model(subgraph)
+
+
+@create_backend
+def nnc_ofi(subgraph):
+    with torch.jit.fuser("fuser1"):
+        return reload_jit_model_ofi(subgraph)
+
+
+@create_backend
+def ts_nvfuser(subgraph):
+    with torch.jit.fuser("fuser2"):
+        return reload_jit_model(subgraph)
+
+
+@create_backend
+def ts_nvfuser_ofi(subgraph):
+    with torch.jit.fuser("fuser2"):
+        return reload_jit_model_ofi(subgraph)
+
+
+@create_backend
+def onednn(subgraph):
+    with torch.jit.fuser("fuser3"):
+        return reload_jit_model(subgraph)
+
+
+@create_backend
+def ofi(subgraph):
+    return torch.jit.optimize_for_inference(subgraph.scripted)
+
+
+@create_backend
+def static_runtime(subgraph):
+    scripted = subgraph.scripted
+    if hasattr(scripted, "_c"):
+        static_module = torch._C._jit_to_static_module(scripted._c)
+    else:
+        static_module = torch._C._jit_to_static_module(scripted.graph)
+    return subgraph.wrap_returns(static_module)
+
+
+def onnxrt_common(subgraph, provider, onnx_filename=None):
+    import onnxruntime
+
+    assert provider in onnxruntime.get_available_providers()
+    session = onnxruntime.InferenceSession(
+        onnx_filename or subgraph.onnx_filename, providers=[provider]
+    )
+    input_names = subgraph.input_names
+    output_names = subgraph.output_names
+    create_outputs = subgraph.empty_outputs_factory()
+    is_cpu = subgraph.is_cpu
+
+    def _call(*args):
+        binding = session.io_binding()
+        args = [a.contiguous() for a in args]
+        for name, value in zip(input_names, args):
+            dev = value.device
+            binding.bind_input(
+                name,
+                dev.type,
+                dev.index or 0,
+                _NP_DTYPE[value.dtype],
+                value.size(),
+                value.data_ptr(),
+            )
+        outputs = create_outputs()
+        for name, value in zip(output_names, outputs):
+            dev = value.device
+            binding.bind_output(
+                name,
+                dev.type,
+                dev.index or 0,
+                _NP_DTYPE[value.dtype],
+                value.size(),
+                value.data_ptr(),
+            )
+        session.run_with_iobinding(binding)
+        if is_cpu:
+            binding.copy_outputs_to_cpu()
+        return outputs
+
+    return subgraph.wrap_returns(_call)
+
+
+@create_backend
+def onnxrt_cpu(subgraph):
+    return onnxrt_common(subgraph, provider="CPUExecutionProvider")
+
+
+@create_backend
+def onnxrt_cuda(subgraph):
+    return onnxrt_common(subgraph, provider="CUDAExecutionProvider")
+
+
+@create_backend
+def onnx2tensorrt(subgraph):
+    if subgraph.will_tensorrt_barf():
+        # TensorRT fails violently with an abort() on this
+        return None
+
+    return onnxrt_common(subgraph, provider="TensorrtExecutionProvider")
+
+
+@create_backend
+def onnxrt_cpu_numpy(subgraph, provider="CPUExecutionProvider"):
+    """Alternate version that integrates via numpy"""
+    import onnxruntime
+
+    assert provider in onnxruntime.get_available_providers()
+    ort_session = onnxruntime.InferenceSession(
+        subgraph.onnx_filename, providers=[provider]
+    )
+
+    def to_numpy(x):
+        try:
+            return x.numpy()
+        except RuntimeError:
+            return x.detach().numpy()
+
+    def _call(*args):
+        res = ort_session.run(
+            None, {f"i{i}": to_numpy(arg) for i, arg in enumerate(args)}
+        )
+        res = [torch.from_numpy(x) for x in res]
+        return res
+
+    return subgraph.wrap_returns(_call)
+
+
+@create_backend
+def onnxrt(subgraph):
+    if subgraph.is_cuda:
+        return onnxrt_cuda(subgraph)
+    else:
+        return onnxrt_cpu(subgraph)
+
+
+@functools.lru_cache(None)
+def _init_tensorflow():
+    import tensorflow as tf
+
+    # prevent tensorflow from eating all the GPU memory
+    gpus = tf.config.list_physical_devices("GPU")
+    for gpu in gpus:
+        tf.config.experimental.set_memory_growth(gpu, True)
+    return tf
+
+
+@create_backend
+def onnx2tf(subgraph):
+    import onnx
+    from onnx_tf.backend import prepare
+
+    tf = _init_tensorflow()
+    filename = subgraph.filename("tensorflow")
+    input_names = subgraph.input_names
+    output_names = subgraph.output_names
+    device = "/CPU:0" if subgraph.is_cpu else f"/GPU:{subgraph.device_index}"
+    with tf.device(device):
+        if not os.path.exists(filename):
+            prepare(onnx.load(subgraph.onnx_filename)).export_graph(filename)
+        tf_module = tf.saved_model.load(filename)
+        tf_module = tf.function(tf_module, jit_compile=True)
+
+    def run(*args):
+        args = [a.contiguous() for a in args]
+        with tf.device(device):
+            outs = tf_module(
+                **{
+                    name: tf.experimental.dlpack.from_dlpack(
+                        torch.utils.dlpack.to_dlpack(args[idx])
+                    )
+                    for idx, name in enumerate(input_names)
+                }
+            )
+            return [
+                torch.utils.dlpack.from_dlpack(
+                    tf.experimental.dlpack.to_dlpack(outs[name])
+                )
+                for name in output_names
+            ]
+
+    return subgraph.wrap_returns(run)
+
+
+@create_backend
+def taso(subgraph):
+    taso_filename = subgraph.filename("taso")
+    subprocess.check_call(
+        [
+            os.path.expanduser("~/conda/envs/taso/bin/python"),
+            "-c",
+            "import taso,onnx; onnx.save(taso.export_onnx(taso.optimize("
+            f"taso.load_onnx('{subgraph.onnx_filename}'))), '{taso_filename}')",
+        ]
+    )
+    return onnxrt_common(
+        subgraph, provider="CUDAExecutionProvider", onnx_filename=taso_filename
+    )
+
+
+@create_backend
+def ipex(subgraph, **kwargs):
+    import intel_extension_for_pytorch as ipex
+
+    inputs = subgraph.example_inputs
+    model = subgraph.model
+    with torch.no_grad():
+        model.eval()
+        if kwargs["datatype"] == "bf16":
+            model = ipex.optimize(model, dtype=torch.bfloat16)
+        else:
+            model = ipex.optimize(model, dtype=torch.float32)
+        try:
+            traced_model = torch.jit.trace(model, inputs).eval()
+            traced_model = torch.jit.freeze(traced_model)
+            return traced_model
+        except Exception:
+            log.warning("JIT trace failed during the 'ipex' optimize process.")
+            return model
+
+
+def _raise_timeout(signum, frame):
+    raise TimeoutError()
+
+
+@create_backend
+def fx2trt(subgraph, **kwargs):
+    if subgraph.will_tensorrt_barf():
+        # TensorRT fails violently with an abort() on this
+        return None
+
+    from torch_tensorrt.fx.fx2trt import InputTensorSpec, TRTInterpreter
+    from torch_tensorrt.fx.passes.lower_basic_pass import transform_setitem
+    from torch_tensorrt.fx.tools.trt_splitter import TRTSplitter, TRTSplitterSetting
+    from torch_tensorrt.fx.tracer.acc_tracer import acc_tracer
+    from torch_tensorrt.fx.trt_module import TRTModule
+    from torch_tensorrt.fx.utils import LowerPrecision
+
+    from .normalize import normalize_ir
+
+    try:
+        model = subgraph.model
+        inputs = subgraph.example_inputs
+        # normalize
+        model = normalize_ir(model, inputs)
+        # pass rewrite
+        model = transform_setitem(model, inputs)
+        acc_model = acc_tracer.trace(model, inputs)
+        # Split out unsupported ops
+        splitter_setting = TRTSplitterSetting()
+        splitter_setting.use_implicit_batch_dim = False
+        splitter = TRTSplitter(acc_model, inputs, settings=splitter_setting)
+        splitter.node_support_preview()
+        split_mod = splitter()
+        num_piece = 0
+        for name, _ in split_mod.named_children():
+            print(f"graph is split into {name}")
+            num_piece += 1
+
+        # if the graph module is split into pieces larger than 8, we consider its perf
+        # is not good and fall back to non-TRT
+        if num_piece > 8:
+            print(
+                f"The graph module is split into {num_piece} which is large than the \
+                threshold=8. Fall back to non-TRT module."
+            )
+            return None
+
+        if "fp16_mode" in kwargs and kwargs["fp16_mode"]:
+            precision = LowerPrecision.FP16
+        else:
+            precision = LowerPrecision.FP32
+
+        def get_submod_inputs(mod, submod, inputs):
+            acc_inputs = None
+
+            def get_input(self, inputs):
+                nonlocal acc_inputs
+                acc_inputs = inputs
+
+            handle = submod.register_forward_pre_hook(get_input)
+            mod(*inputs)
+            handle.remove()
+            return acc_inputs
+
+        for name, _ in split_mod.named_children():
+            if "_run_on_acc" in name:
+                submod = getattr(split_mod, name)
+                # print("acc=",submod.code)
+                # Get submodule inputs for fx2trt
+                acc_inputs = get_submod_inputs(split_mod, submod, inputs)
+
+                # fx2trt replacement
+                interp = TRTInterpreter(
+                    submod,
+                    InputTensorSpec.from_tensors(acc_inputs),
+                    explicit_batch_dimension=True,
+                )
+                r = interp.run(
+                    max_workspace_size=20 << 30,
+                    lower_precision=precision,
+                    # profiling_verbosity=trt.ProfilingVerbosity.DETAILED, #For profile
+                )
+                # For profile
+                # from fx2trt_oss.fx.tools.trt_profiler_sorted import profile_trt_module
+                # profile_trt_module("", trt_mod, acc_inputs)
+                trt_mod = TRTModule(*r)
+
+                setattr(split_mod, name, trt_mod)
+            else:
+                submod = getattr(split_mod, name)
+                # print("gpu=",submod.code)
+        return subgraph.wrap_returns(split_mod)
+    except Exception:
+        log.exception("FX2TRT conversion error")
+        return None
+
+
+@create_backend
+def torch2trt(subgraph):
+    if subgraph.will_tensorrt_barf():
+        # TensorRT fails violently with an abort() on this
+        return None
+
+    from torch2trt import torch2trt
+
+    inputs = subgraph.example_inputs
+    trt_mod = torch2trt(
+        subgraph.model,
+        inputs,
+        max_batch_size=len(inputs[0]),
+        strict_type_constraints=True,
+    )
+    return subgraph.wrap_returns(trt_mod)
+
+
+@create_backend
+def tensorrt(subgraph):
+    if subgraph.will_tensorrt_barf():
+        # TensorRT fails violently with an abort() on this
+        return None
+
+    model = onnx2tensorrt(subgraph)
+    if model is None:
+        model = torch2trt(subgraph)
+    return model
+
+
+@create_backend
+def onnx2tensorrt_alt(subgraph):
+    if subgraph.will_tensorrt_barf():
+        # TensorRT fails violently with an abort() on this
+        return None
+
+    import tensorrt as trt
+
+    from torch.fx.experimental.fx2trt.trt_module import TRTModule
+
+    inputs = subgraph.example_inputs
+
+    logger = trt.Logger(trt.Logger.ERROR)
+    builder = trt.Builder(logger)
+    config = builder.create_builder_config()
+    assert isinstance(inputs, (list, tuple))
+    inputs = tuple(inputs)
+    input_names = subgraph.input_names
+    output_names = subgraph.output_names
+    network = builder.create_network(
+        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    )
+    parser = trt.OnnxParser(network, logger)
+    success = parser.parse(open(subgraph.onnx_filename, "rb").read())
+    for idx in range(parser.num_errors):
+        print(parser.get_error(idx))
+    assert success
+
+    config.max_workspace_size = 1 << 25
+    config.set_flag(trt.BuilderFlag.STRICT_TYPES)
+    builder.max_batch_size = len(inputs[0])
+
+    engine = builder.build_engine(network, config)
+    assert engine
+
+    trt_mod = TRTModule(engine, input_names, output_names)
+    return subgraph.wrap_returns(trt_mod)
+
+
+@create_backend
+def cudagraphs(subgraph):
+    model = subgraph.model
+    inputs = subgraph.example_inputs
+    assert subgraph.is_cuda
+    return subgraph.wrap_returns(cudagraphs_inner(model, inputs))
+
+
+@create_backend
+def cudagraphs_ts(subgraph):
+    assert subgraph.is_cuda
+    model = subgraph.scripted
+    inputs = subgraph.example_inputs
+
+    # warmup
+    for _ in range(3):
+        model(*inputs)
+
+    return subgraph.wrap_returns(cudagraphs_inner(model, inputs))
+
+
+@create_backend
+def cudagraphs_ts_ofi(subgraph):
+    assert subgraph.is_cuda
+    model = torch.jit.optimize_for_inference(torch.jit.freeze(subgraph.scripted))
+    inputs = subgraph.example_inputs
+
+    # warmup
+    for _ in range(3):
+        model(*inputs)
+
+    return subgraph.wrap_returns(cudagraphs_inner(model, inputs))
+
+
+def cudagraphs_inner(model, inputs, copy_outputs=True):
+    assert isinstance(inputs, (list, tuple))
+    static_inputs = [torch.zeros_like(x) for x in inputs]
+
+    # warmup
+    torch.cuda.synchronize()
+    stream = torch.cuda.Stream()
+    stream.wait_stream(torch.cuda.current_stream())
+    with torch.cuda.stream(stream):
+        model(*inputs)
+    stream.synchronize()
+    torch.cuda.current_stream().wait_stream(stream)
+    torch.cuda.synchronize()
+
+    # record
+    graph = torch.cuda.CUDAGraph()
+    with torch.cuda.graph(graph, stream=stream):
+        static_outputs = model(*static_inputs)
+    if not isinstance(static_outputs, (list, tuple)):
+        static_outputs = (static_outputs,)
+
+    def run(*new_inputs):
+        assert len(static_inputs) == len(new_inputs)
+        for dst, src in zip(static_inputs, new_inputs):
+            dst.copy_(src)
+        graph.replay()
+        if copy_outputs:
+            return [x.clone() for x in static_outputs]
+        else:
+            return static_outputs
+
+    return run
+
+
+@create_backend
+def aot_autograd(subgraph, **kwargs):
+    def _wrapped_bw_compiler(*args, **kwargs):
+        # stop TorchDynamo from trying to compile our generated backwards pass
+        return disable(disable(bw_compiler)(*args, **kwargs))
+
+    bw_compiler = kwargs.get("bw_compiler") or kwargs["fw_compiler"]
+    kwargs["bw_compiler"] = _wrapped_bw_compiler
+
+    from functorch.compile import aot_module_simplified
+
+    from .. import disable
+
+    return aot_module_simplified(subgraph.model, **kwargs)
+
+
+def tvm_compile(jit_mod, example_inputs, log_file=None, **kwargs):
+    if jit_mod is None:
+        return None
+    try:
+        return tvm_compile_inner(jit_mod, example_inputs, None, log_file, **kwargs)
+    except Exception as e:
+        if log_file and os.path.exists(log_file):
+            os.unlink(log_file)
+        if isinstance(e, KeyboardInterrupt):
+            raise
+        log.exception("tvm error")
+        return None
+
+
+@create_backend
+def tvm(subgraph):
+    return subgraph.wrap_returns(
+        tvm_compile_inner(
+            subgraph.scripted,
+            subgraph.example_inputs,
+            tuning_option=None,
+            cuda=subgraph.is_cuda,
+        )
+    )
+
+
+@create_backend
+def ansor(subgraph):
+    """
+    WARNING: this backend takes hours or days to train and
+    often produces a slower result than the default schedule.
+    """
+    return subgraph.wrap_returns(
+        tvm_compile_inner(
+            subgraph.scripted,
+            subgraph.example_inputs,
+            tuning_option="auto_scheduler",
+            log_file=subgraph.filename("ansor"),
+            cuda=subgraph.is_cuda,
+        )
+    )
+
+
+@create_backend
+def tvm_meta_schedule(subgraph):
+    return subgraph.wrap_returns(
+        tvm_compile_inner(
+            subgraph.scripted,
+            subgraph.example_inputs,
+            tuning_option="meta_schedule",
+            trials=20000,
+            cuda=subgraph.is_cuda,
+        )
+    )
+
+
+@functools.lru_cache(None)
+def llvm_target():
+    if "avx512" in open("/proc/cpuinfo").read():
+        return "llvm -mcpu=skylake-avx512"
+    return "llvm -mcpu=core-avx2"
+
+
+def tvm_compile_inner(
+    jit_mod, example_inputs, tuning_option=None, log_file=None, trials=20000, cuda=False
+):
+    try:
+        import tvm
+        from tvm import relay
+        from tvm.contrib import graph_executor
+
+        shape_list = [(f"inp_{idx}", i.shape) for idx, i in enumerate(example_inputs)]
+        mod, params = relay.frontend.from_pytorch(jit_mod, shape_list)
+        if cuda:
+            dev = tvm.cuda(0)
+            target = tvm.target.cuda()
+        else:
+            dev = tvm.cpu(0)
+            target = tvm.target.Target(llvm_target())
+
+        if tuning_option == "auto_scheduler":
+            from tvm import auto_scheduler
+
+            if log_file is None:
+                log_file = tempfile.NamedTemporaryFile()
+            if not os.path.exists(log_file):
+                tasks, task_weights = auto_scheduler.extract_tasks(
+                    mod["main"], params, target
+                )
+                for task in tasks:
+                    print(task.compute_dag)
+                else:
+                    print("No tasks")
+                if len(tasks) != 0:
+                    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
+                    if not os.path.exists(log_file):
+                        assert trials > 0
+                        tune_option = auto_scheduler.TuningOptions(
+                            num_measure_trials=trials,
+                            measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
+                            early_stopping=2000,
+                        )
+                        try:
+                            tuner.tune(tune_option)
+                        except Exception:
+                            if os.path.exists(log_file):
+                                os.unlink(log_file)
+                            raise
+
+            with auto_scheduler.ApplyHistoryBest(log_file):
+                with tvm.transform.PassContext(
+                    opt_level=3, config={"relay.backend.use_auto_scheduler": True}
+                ):
+                    lib = relay.build(mod, target=target, params=params)
+        elif tuning_option == "meta_schedule":
+            from os import path as osp
+
+            from tvm import meta_schedule as ms
+
+            with tempfile.TemporaryDirectory() as work_dir:
+                if log_file is not None:
+                    assert osp.isdir(
+                        log_file
+                    ), "TVM's meta_schedule requires a directory for storing log files."
+                    work_dir = log_file
+                # TODO(shingjan): This could be replaced by tvm.contrib.torch.optimize_torch
+                # once USE_PT_TVMDSOOP is updated and turned on by default in TVM.
+                database = ms.relay_integration.tune_relay(
+                    mod=mod,
+                    target=target,
+                    work_dir=work_dir,
+                    max_trials_global=20000,
+                    num_trials_per_iter=64,
+                    params=params,
+                    strategy="evolutionary",
+                )
+                lib = ms.relay_integration.compile_relay(
+                    database=database,
+                    mod=mod,
+                    target=target,
+                    params=params,
+                )
+
+        elif tuning_option is None:
+            # no autotuning (for debugging)
+            with tvm.transform.PassContext(opt_level=10):
+                lib = relay.build(mod, target=target, params=params)
+        else:
+            raise NotImplementedError(
+                "This tuning option is invalid/not implemented for torchdynamo's TVM-related backend. "
+                "There are three available options including None, auto_scheduler and meta_schedule."
+            )
+        m = graph_executor.GraphModule(lib["default"](dev))
+
+        def to_torch_tensor(nd_tensor):
+            """A helper function to transfer a NDArray to torch.tensor."""
+            if nd_tensor.dtype == "bool":
+                # DLPack does not support boolean so it can't be handled by
+                # torch.utils.dlpack.from_pack. Workaround by going through
+                # numpy, although this brings additional data copy overhead.
+                return torch.from_numpy(nd_tensor.numpy())
+            return torch.utils.dlpack.from_dlpack(nd_tensor.to_dlpack())
+
+        def exec_tvm(*args):
+            args = [a.contiguous() for a in args]
+            for idx, arg in enumerate(args, 0):
+                if arg.dim() != 0:
+                    if arg.requires_grad:
+                        arg = arg.detach()
+                    m.set_input(
+                        f"inp_{idx}",
+                        tvm.nd.array(arg.numpy(), dev),
+                    )
+            m.run()
+            return [
+                to_torch_tensor(m.get_output(i)) for i in range(m.get_num_outputs())
+            ]
+
+        return exec_tvm
+    except Exception:
+        log.exception("tvm error")
+        return jit_mod  # explicit fall back to eager
+
+
+@functools.lru_cache(None)
+def _init_ltc():
+    try:
+        import torch._lazy.extract_compiled_graph
+        from torch._lazy.ts_backend import init as init_ts_backend
+
+        # hopefully changing this line to sth like _ltc_init_xla_backend in future
+        # will enable XLA
+        init_ts_backend()
+
+        return torch._lazy
+    except ModuleNotFoundError as e:
+        print(f"ltc backend fails. Can not import {e.name}")
+        raise
+
+
+def ltc_reuse_graph(gm: torch.fx.GraphModule, example_inputs):
+    ltc = _init_ltc()
+    return ltc.extract_compiled_graph.extract_compiled_graph(gm, example_inputs)
+
+
+def ltc_trivial(gm: torch.fx.GraphModule, example_inputs):
+    ltc = _init_ltc()
+    lazy_model = copy.deepcopy(gm).to(device="lazy")
+    ltc.extract_compiled_graph.force_lazy_device(lazy_model)
+
+    def ltc_model(*inputs):
+        orig_device = inputs[0].device if len(inputs) > 0 else "cuda"
+        lazy_inputs = tuple(inp.to(device="lazy") for inp in inputs)
+
+        lazy_out = lazy_model(*lazy_inputs)
+        out = tuple(out.to(device=orig_device) for out in lazy_out)
+        return out
+
+    return ltc_model
+
+
+@create_backend
+def torchxla_trivial(subgraph):
+    return subgraph.model
+
+
+@create_backend
+def torchxla_trace_once(subgraph):
+    import torch._dynamo.optimizations.torchxla_integration as integration
+
+    model = subgraph.model
+    example_inputs = subgraph.example_inputs
+    return integration.extract_compiled_graph(model, example_inputs)
+
+
+def ipex_fp32(gm: torch.fx.GraphModule, example_inputs):
+    kwargs_ipex = {"datatype": "fp32"}
+    return BACKENDS["ipex"](gm, example_inputs, **kwargs_ipex)
+
+
+def ipex_bf16(gm: torch.fx.GraphModule, example_inputs):
+    kwargs_ipex = {"datatype": "bf16"}
+    return BACKENDS["ipex"](gm, example_inputs, **kwargs_ipex)
+
+
+def fx2trt_compiler_fp16(gm: torch.fx.GraphModule, example_inputs):
+    kwargs_fx2trt = {"fp16_mode": True}
+    trt_compiled = BACKENDS["fx2trt"](gm, example_inputs, **kwargs_fx2trt)
+    if trt_compiled is not None:
+        return trt_compiled
+    else:
+        print(
+            "FX2TRT conversion failed on the subgraph. Return GraphModule forward instead"
+        )
+        return gm.forward
+
+
+def fx2trt_compiler(gm: torch.fx.GraphModule, example_inputs):
+    kwargs_fx2trt = {"fp16_mode": False}
+    trt_compiled = BACKENDS["fx2trt"](gm, example_inputs, **kwargs_fx2trt)
+    if trt_compiled is not None:
+        return trt_compiled
+    else:
+        print(
+            "FX2TRT conversion failed on the subgraph. Return GraphModule forward instead"
+        )
+        return gm.forward
diff --git a/torch/_dynamo/optimizations/distributed.py b/torch/_dynamo/optimizations/distributed.py
new file mode 100644
index 000000000000..b71d85c4e34f
--- /dev/null
+++ b/torch/_dynamo/optimizations/distributed.py
@@ -0,0 +1,277 @@
+import logging
+from dataclasses import dataclass, field
+from typing import Any, List, Optional
+
+import torch
+import torch.fx.traceback as fx_traceback
+from torch import fx
+from torch.fx.node import Node
+
+log = logging.getLogger(__name__)
+
+
+def args_str(args):
+    # a debug helper
+    if torch.is_tensor(args):
+        return f"T[{args.shape}]"
+    elif isinstance(args, tuple):
+        return f"tuple({', '.join([args_str(x) for x in args])})"
+    elif isinstance(args, list):
+        return f"list({', '.join([args_str(x) for x in args])})"
+    else:
+        return str(args)
+
+
+@dataclass
+class Bucket:
+    size: int = 0
+    params: List[str] = field(default_factory=list)
+    nodes: List[fx.Node] = field(default_factory=list)
+
+    # param_ids is just used for unit testing
+    param_ids: List = field(default_factory=list)
+
+
+def pretty_print_buckets(buckets: List[Bucket]):
+    headers = ("Index", "Size (b)", "Param Names")
+    rows = []
+    for idx, bucket in enumerate(reversed(buckets)):
+        if len(bucket.params) > 0:
+            rows.append((idx, bucket.size, bucket.params[0]))
+            for param in bucket.params[1:]:
+                rows.append((None, None, param))
+    try:
+        from tabulate import tabulate
+
+        log.info(
+            "\nDDPOptimizer bucket assignments\n"
+            + tabulate(rows, headers=headers, tablefmt="simple_grid")
+        )
+    except ImportError:
+        log.info(
+            "Please `pip install tabulate` in order to pretty-print ddp bucket sizes"
+        )
+
+
+class DDPOptimizer:
+    """
+    DDPOptimizer applies when dynamo compiles models wrapped in DistributedDataParallel (DDP),
+    breaking the dynamo graph into chunks to compile separately, with the breaks aligning to
+    the boundaries of gradient-allreduce buckets chosen by DDP.
+
+    Background/Motivation
+     - DDP uses allreduce collectives to synchronize partial gradients computed on different workers
+     - DDP groups gradient allreduces into 'buckets' to optimize communication efficiency of all-reduce
+     - Parameters grouped into buckets are assumed to be adjacent in time, so they become ready
+       at around the same time during backward and thus can share the same allreduce efficently
+     - Allreduces must overlap with backward compute for optimal training performance
+     - DDP schedules allreduces using 'hooks' fired from the c++ autograd engine in pytorch, which
+       operates when individual grads become 'ready'
+     - Dynamo+AOTAutograd produces a single fused graph that runs 'atomically' from the perspective of the
+       autograd engine, such that all gradients become 'ready' at the same time.  Hooks fire after the whole
+       fused backward function executes, preventing any overlap of compute and communication
+
+    Algorithm
+     - DDPOptimizer starts off with an FX graph traced by dynamo which represents forward.  It can traverse
+       this graph in reverse order to determine the true order that gradients will become ready during backward.
+     - Parameter sizes are counted in reverse order, up to a bucket size limit, at which point a new bucket is started
+       and a graph break introduced
+     - Each of the subgraphs is compiled by the compiler provided to dynamo by the user, and then fused back together
+       into an outer module that is returned to the user
+
+    Notes
+     - It would be better to enforce (by adding an API to DDP) that the bucket splits chosen here are used by DDP,
+       and that DDP does not need to detect or optimize bucket order by observing execution at runtime, as it does
+       in eager.
+     - If Dynamo can't capture a whole graph for the portion of the model wrapped by DDP, this algorithm will currently
+       produce splits that do not necessarily align with the buckets used by DDP.  This should result in performance
+       degradation approaching the baseline case where graph-splits are not used, but not worse.
+     - If the backend compiler fails to compile a single subgraph, it will execute eagerly despite the rest of the
+       subgraphs being compiled
+     - DDP has a 'parameters_and_buffers_to_ignore' field, which DDPOptimizer attempts to honor by reading markers
+       left by DDP on individual parameters.  In cases where other transformations, such as reparameterization, are
+       also used, the ignore markers could be lost.  If DDPOptimizer fails to ignore a parameter ignored by DDP,
+       it is not catastrophic but could impact performance by choosing sub-optimal bucket splits.
+     - DDPOptimizer always ignores all buffers, regardless of their ignore flag, since buffers do not require gradients,
+       and therefore aren't allreduced by DDP.  (They are broadcast during forward, but this is not covered by
+       DDPOptimizer)
+
+    Args:
+        bucket_bytes_cap (int): Controls the size of buckets, in bytes, used to determine graphbreaks.  Should be
+            set to match the equivalent parameter on the original DDP module.
+
+        backend_compile_fn (callable): A dynamo compiler function, to be invoked to compile each subgraph.
+
+        first_bucket_cap (int): Controls the size of the first bucket.  Should match DDP's first bucket cap.  DDP
+            special-cases the first bucket size since it is sometimes optimal to start a small allreduce early.
+
+    """
+
+    def __init__(
+        self,
+        bucket_bytes_cap: int,
+        backend_compile_fn,
+        first_bucket_cap: Optional[int] = None,
+    ):
+        if first_bucket_cap is not None:
+            self.first_bucket_cap = first_bucket_cap
+        elif torch.distributed.is_available():
+            # this constant comes from C10D lib which is not always built
+            self.first_bucket_cap = torch.distributed._DEFAULT_FIRST_BUCKET_BYTES
+        else:
+            self.first_bucket_cap = bucket_bytes_cap
+
+        self.bucket_bytes_cap = bucket_bytes_cap
+        assert (
+            self.first_bucket_cap <= self.bucket_bytes_cap
+        ), "First bucket should be smaller/equal to other buckets to get comms warmed up ASAP"
+
+        self.backend_compile_fn = backend_compile_fn
+
+    def _ignore_parameter(self, parameter):
+        return hasattr(parameter, "_ddp_ignored") and parameter._ddp_ignored
+
+    def compile_fn(self, gm: fx.GraphModule, example_inputs: List[torch.Tensor]):
+        """
+        Implements graph splitting, first determining a set of of buckets by counting
+        parameter sizes in reverse graph order, then invoking the user/backend compiler
+        to compile each subgraph. Finally, stiches compiled graphs into one graphmodule
+        and returns its callable.
+        """
+
+        # 1: compute the partition map according to DDP bucket logic
+        buckets = [Bucket()]  # (size, param_names)
+        for node in reversed(gm.graph.nodes):
+            if node.op in ("output", "placeholder"):
+                continue
+
+            if (
+                buckets[0].size >= self.bucket_bytes_cap
+                or len(buckets) == 1
+                and buckets[0].size >= self.first_bucket_cap
+            ):
+                buckets.insert(0, Bucket())
+
+            if node.op == "call_module":
+                target = gm.get_submodule(node.target)
+                for name, p in target.named_parameters():
+                    param = target.get_parameter(name)
+                    if p.requires_grad and not self._ignore_parameter(param):
+                        buckets[0].size += p._storage().nbytes()
+                        buckets[0].params.append(f"{node.target}_{name}")
+                        buckets[0].param_ids.append(id(param))
+            elif node.op == "get_attr":
+                maybe_param = getattr(gm, node.target)
+                if maybe_param.requires_grad and not self._ignore_parameter(
+                    maybe_param
+                ):
+                    buckets[0].size += maybe_param._storage().nbytes()
+                    buckets[0].params.append(node.target)
+                    buckets[0].param_ids.append(id(maybe_param))
+
+            # All nodes have to be mapped to a bucket, even if they don't have their own params
+            # Ignored params still end up in buckets, we just don't count them towards the capacity
+            buckets[0].nodes.append(node)
+
+        # stash buckets for testing/debugging purposes
+        self.buckets = buckets
+        log.info(
+            f"DDPOptimizer used bucket cap {self.bucket_bytes_cap} and produced the following buckets:"
+        )
+        pretty_print_buckets(buckets)
+
+        if len(buckets) == 1:
+            # bypass split/fuse logic if there is only one bucket
+            return self.backend_compile_fn(gm, example_inputs)
+
+        # 2: partition the graphmodule according to bucket capacity
+        partition_map = {}
+        for idx, b in enumerate(buckets):
+            for node in b.nodes:
+                partition_map[node] = idx
+
+        split_gm = fx.passes.split_module.split_module(
+            gm, None, lambda node: partition_map[node]
+        )
+
+        debug_str = (
+            f"\n---orig graph---\n{gm.graph}\n"
+            + f"\n---split graph---\n{split_gm.graph}\n"
+        )
+        for name, module in split_gm.named_modules():
+            if "." not in name and len(name):
+                # only print the submod graphs, not their children
+                debug_str += f"\n---{name} graph---\n{module.graph}\n"
+        debug_str += "\n---------------\n"
+        log.debug(debug_str)
+
+        # 3: compile each of the partitioned submodules using the user-provided compiler
+        class SubmodCompiler(torch.fx.interpreter.Interpreter):
+            def __init__(self, module, compiler):
+                super().__init__(module)
+                self.compiler = compiler
+
+            def compile_submod(self, submod, args, kwargs):
+                """
+                Compile the submodule,
+                using a wrapper to make sure its output is always a tuple,
+                which is required by AotAutograd based compilers
+                """
+                assert len(kwargs) == 0, "We assume only args for these modules"
+
+                class WrapperModule(torch.nn.Module):
+                    def __init__(self, compiled_submod, unwrap_singleton_tuple):
+                        super().__init__()
+                        self.compiled_submod = compiled_submod
+                        self.unwrap_singleton_tuple = unwrap_singleton_tuple
+
+                    def forward(self, *args):
+                        x = self.compiled_submod(*args)
+                        # TODO(whc)
+                        # for some reason the isinstance check is necessary if I split one node per submod
+                        # - even though I supposedly wrapped the output in a tuple in those cases, the real
+                        # compiled module was still returning a tensor
+                        if self.unwrap_singleton_tuple and isinstance(x, (tuple, list)):
+                            return x[0]
+                        return x
+
+                unwrap_singleton_tuple = False
+                for sn in submod.graph.nodes:
+                    if sn.op == "output":
+                        if not isinstance(sn.args[0], tuple):
+                            unwrap_singleton_tuple = True
+                            sn.args = (sn.args,)
+                submod.recompile()
+
+                wrapper = WrapperModule(
+                    self.compiler(submod, args),
+                    unwrap_singleton_tuple,
+                )
+                return wrapper
+
+            def run_node(self, n: Node) -> Any:
+                with fx_traceback.append_stack_trace(n.stack_trace):
+                    args, kwargs = self.fetch_args_kwargs_from_env(n)
+                    log.debug(f"run_node {n.op}, {n.target} got args {args_str(args)}")
+                    assert isinstance(args, tuple)
+                    assert isinstance(kwargs, dict)
+
+                    # modify the currently running FX graph
+                    # maybe this isn't sound in general, but only changing the target of a node might be ok?
+                    if n.op == "call_module":
+                        submod = self.fetch_attr(n.target)
+                        log.debug(f"\n---{n.target} graph---\n" + str(submod.graph))
+                        compiled_submod = self.compile_submod(submod, args, kwargs)
+                        self.module.delete_submodule(n.target)
+                        n.target = "compiled_" + n.target
+                        self.module.add_submodule(n.target, compiled_submod)
+                    # then we execute the modified node using the usual logic
+                    return getattr(self, n.op)(n.target, args, kwargs)
+
+        submod_compiler = SubmodCompiler(split_gm, self.backend_compile_fn)
+        submod_compiler.run(*example_inputs)
+        split_gm.recompile()
+
+        log.debug("\n---final graph---\n" + str(split_gm.graph) + "\n---------------\n")
+
+        return split_gm
diff --git a/torch/_dynamo/optimizations/inference.py b/torch/_dynamo/optimizations/inference.py
new file mode 100644
index 000000000000..0ecf45402549
--- /dev/null
+++ b/torch/_dynamo/optimizations/inference.py
@@ -0,0 +1,197 @@
+import base64
+import hashlib
+import io
+import itertools
+import json
+import logging
+import os
+import time
+from collections import defaultdict
+
+import torch
+
+from .. import config
+from ..utils import (
+    check_is_cuda,
+    checkpoint_params,
+    clone_inputs,
+    count_calls,
+    counters,
+)
+from .normalize import long_name, normalize_ir
+
+log = logging.getLogger(__name__)
+
+
+def string_key(gm: torch.fx.GraphModule, example_inputs):
+    out = io.StringIO()
+    node_to_id = defaultdict(iter(itertools.count()).__next__)
+
+    def argkey(n: torch.fx.Node):
+        return f"#{node_to_id[n]}"
+
+    def tensorkey(t):
+        if isinstance(t, torch.Tensor):
+            requires_grad = t.requires_grad and torch.torch.is_grad_enabled()
+            return (
+                f"{t.__class__.__name__}({t.dtype}, {t.device}, "
+                f"{tuple(t.size())}, {tuple(t.stride())}, {requires_grad})"
+            )
+        return type(t).__name__
+
+    inputs_iter = iter(example_inputs)
+
+    for node in gm.graph.nodes:
+        key = argkey(node)
+        name = "."
+        if node.op == "placeholder":
+            name = tensorkey(next(inputs_iter))
+        elif node.op == "get_attr":
+            val = eval(f"self.{node.target}", {"self": gm})
+            name = tensorkey(val)
+        elif node.op in ("call_function", "call_method", "call_module"):
+            name = long_name(gm, node)
+        out.write(
+            f"{key} {node.op} {name} "
+            f"{torch.fx.map_arg(node.args, argkey)!r} "
+            f"{torch.fx.map_arg(node.kwargs, argkey)!r}\n"
+        )
+    return out.getvalue()
+
+
+def graph_hash(gm: torch.fx.GraphModule, example_inputs):
+    return "g" + base64.urlsafe_b64encode(
+        hashlib.sha256(string_key(gm, example_inputs).encode("utf-8")).digest()
+    )[:39].decode("utf-8")
+
+
+def folder_name(gm: torch.fx.GraphModule, example_inputs):
+    base = os.path.join(config.base_dir, "subgraphs")
+    if not os.path.exists(base):
+        os.mkdir(base)
+        open(os.path.join(base, "__init__.py"), "w").close()
+    return os.path.join(base, graph_hash(gm, example_inputs))
+
+
+def record_graph_stats(gm):
+    for node in gm.graph.nodes:
+        if node.op in ("call_function", "call_method", "call_module"):
+            counters[node.op][long_name(gm, node)] += 1
+        elif node.op in ("placeholder", "output", "get_attr"):
+            pass
+        else:
+            raise AssertionError(node.op)
+
+
+def check_requires_grad(gm, example_inputs):
+    if torch.is_grad_enabled():
+        if any(
+            getattr(x, "requires_grad", False)
+            for x in itertools.chain(example_inputs, gm.parameters(True))
+        ):
+            return True
+    return False
+
+
+def jit_trace(gm, example_inputs):
+    """Wrapper around jit.trace to handle hooks"""
+    restore_backward_hooks = []
+
+    def visit(mod):
+        if mod._backward_hooks:
+            restore_backward_hooks.append((mod, mod._backward_hooks))
+            mod._backward_hooks = []
+
+    if not check_requires_grad(gm, example_inputs):
+        # in inference mode it is safe to ignore backwards hooks to allow tracing
+        gm.apply(visit)
+
+    try:
+        return torch.jit.trace(gm.forward, example_inputs)
+    finally:
+        for mod, hooks in restore_backward_hooks:
+            mod._backward_hooks = hooks
+
+
+def same(left, right):
+    return len(left) == len(right) and all(
+        torch.allclose(a, b, atol=1e-4, rtol=1e-4) for a, b in zip(left, right)
+    )
+
+
+class TorchScriptStrategy(object):
+    """Common base for backend strategies that use TorchScript"""
+
+    @classmethod
+    def compile_fn(cls, gm: torch.fx.GraphModule, example_inputs):
+        if count_calls(gm.graph) < 2:
+            return gm.forward  # no point for tiny graphs
+        return cls(gm, example_inputs).verified_candidate()
+
+    def __init__(self, gm: torch.fx.GraphModule, example_inputs):
+        super(TorchScriptStrategy, self).__init__()
+        self.restore = checkpoint_params(gm)
+        self.original_example_inputs = example_inputs
+        self.correct = gm.forward(*self.example_inputs)
+        self.gm = normalize_ir(gm, self.original_example_inputs)
+        self.scripted = jit_trace(self.gm, self.example_inputs)
+
+    @property
+    def example_inputs(self):
+        return clone_inputs(self.original_example_inputs)
+
+    def verified_candidate(self):
+        try:
+            candidate = self.candidate()
+            if candidate is None or candidate is self.gm.forward:
+                return self.gm.forward
+
+            self.restore()
+            result = candidate(*self.example_inputs)
+
+            if same(result, self.correct):
+                return candidate
+
+            print(f"incorrect candidate {self}")
+
+            return self.gm.forward
+        except Exception:
+            log.exception("error in verified_candidate()")
+            return self.gm.forward
+        finally:
+            self.restore()
+
+    def candidate(self):
+        raise NotImplementedError()
+
+
+def save_pt(path, name, data):
+    with open(os.path.join(path, name), "wb") as fd:
+        torch.save(data, fd)
+
+
+def save_metadata(path, gm, example_inputs):
+    with open(os.path.join(path, "metadata.json"), "w") as fd:
+        json.dump(
+            {
+                "is_cuda": check_is_cuda(gm, example_inputs),
+            },
+            fd,
+        )
+
+
+def touch_timestamp(path):
+    open(os.path.join(path, "timestamp"), "w").write(str(time.time()))
+
+
+def argmin(perf):
+    best = "eager"
+    best_sec = float("inf")
+    for name, sec in perf.items():
+        if sec < best_sec:
+            best = name
+            best_sec = float(sec)
+            if name == "eager":
+                # small bias torwards using eager since it is more robust
+                best_sec *= 0.99
+    return best
diff --git a/torch/_dynamo/optimizations/log_args.py b/torch/_dynamo/optimizations/log_args.py
new file mode 100644
index 000000000000..caa0a9a83ce6
--- /dev/null
+++ b/torch/_dynamo/optimizations/log_args.py
@@ -0,0 +1,74 @@
+import json
+import os
+
+import torch
+from torch.fx.experimental.proxy_tensor import make_fx
+
+aten = torch.ops.aten
+
+
+class ConvArgsAnalysis(torch.fx.Interpreter):
+    """
+    Log arguments like input shape (input, bias, weights shape)
+    and options(padding/stride/kernel size/dilation/etc) for
+    aten.convolution
+    """
+
+    def __init__(self, gm: torch.fx.GraphModule):
+        super().__init__(gm)
+
+        self.nodes_conv_args = {}
+        self.conv_arg_names = [
+            arg.name for arg in aten.convolution.default._schema.arguments
+        ]
+
+    def run(self, *args):
+        run_result = super().run(*args)
+        if self.nodes_conv_args:
+            filename = "tmp/conv_args.json"
+            os.makedirs(os.path.dirname(filename), exist_ok=True)
+            with open(filename, "a") as fd:
+                json.dump(self.nodes_conv_args, fd)
+                fd.write("\n")
+        return run_result
+
+    def run_node(self, n: torch.fx.Node):
+        result = super().run_node(n)
+
+        if n.op == "call_function":
+            if n.target == aten.convolution.default:
+                args, kwargs = self.fetch_args_kwargs_from_env(n)
+                assert len(args) == len(
+                    self.conv_arg_names
+                ), f"aten.convolution should have {len(self.conv_arg_names)} args"
+                conv_args = {}
+                # collect tensor's shape, stride (channel first or last), dtype
+                for i in range(3):
+                    arg_name = self.conv_arg_names[i]
+                    if args[i] is None:
+                        conv_args[arg_name] = {
+                            "shape": None,
+                            "stride": None,
+                            "dtype": None,
+                        }
+                    else:
+                        conv_args[arg_name] = {
+                            "shape": args[i].shape,
+                            "stride": args[i].stride(),
+                            "dtype": str(args[i].dtype),
+                        }
+                # collect stride/padding/dilation/transposed/output_padding/groups
+                for i in range(3, len(args)):
+                    arg_name = self.conv_arg_names[i]
+                    conv_args[arg_name] = args[i]
+
+                self.nodes_conv_args[n.name.replace("_default", "")] = conv_args
+        return result
+
+
+def conv_args_analysis(gm: torch.fx.GraphModule, example_inputs):
+    # lowering graph
+    gm = make_fx(gm)(*example_inputs)
+    # use Interpreter to logs the args of conv
+    ConvArgsAnalysis(gm).run(*example_inputs)
+    return gm
diff --git a/torch/_dynamo/optimizations/normalize.py b/torch/_dynamo/optimizations/normalize.py
new file mode 100644
index 000000000000..47b2c5703a4d
--- /dev/null
+++ b/torch/_dynamo/optimizations/normalize.py
@@ -0,0 +1,441 @@
+import builtins
+import dataclasses
+import functools
+import itertools
+import logging
+import math
+import operator
+
+import torch
+from torch.fx import Transformer
+from torch.fx.experimental.normalize import NormalizeOperators
+from torch.fx.operator_schemas import get_signature_for_torch_op
+
+from .. import config
+from ..allowed_functions import torch_get_name
+from ..utils import clone_inputs, counters
+from .analysis import ShapeAliasingAndMutationProp
+
+log = logging.getLogger(__name__)
+
+VIEW_OPS = {
+    # list taken from https://pytorch.org/docs/stable/tensor_view.html
+    "getitem",
+    "as_strided",
+    "detach",
+    "diagonal",
+    "expand",
+    "expand_as",
+    "movedim",
+    "narrow",
+    "permute",
+    "select",
+    "squeeze",
+    "transpose",
+    "t",
+    "T",
+    "real",
+    "imag",
+    "view_as_real",
+    "view_as_imag",
+    "unflatten",
+    "unfold",
+    "unsqueeze",
+    "view",
+    "view_as",
+    "unbind",
+    "split",
+    "split_with_sizes",
+    "swapaxes",
+    "swapdims",
+    "chunk",
+    "indices",
+    "values",
+}
+MAYBE_VIEW_OPS = {"contiguous", "reshape"}
+
+# convert x.foo(...) to torch.foo(x, ...)
+NORMALIZE_METHODS = {
+    # These ones aren't normalized:
+    # ('view', 342)
+    # ('reshape', 285)
+    # ('expand', 87)
+    # ('permute', 78)
+    # ('to', 66)
+    # ('contiguous', 62)
+    # ('reshape_as', 57)
+    # ('masked_fill', 30)
+    # ('float', 22) -- could rewrite
+    # ('expand_as', 14) -- could rewrite
+    # ('detach', 4)
+    # ('repeat', 2)
+    # TODO(jansel): debug why this causes issues in detectron2_maskrcnn
+    # "div": torch.div,
+    "add_": operator.iadd,
+    "all": torch.all,
+    "any": torch.any,
+    "ceil": torch.ceil,
+    "chunk": torch.chunk,
+    "clamp": torch.clamp,
+    "clone": torch.clone,
+    "exp": torch.exp,
+    "flatten": torch.flatten,
+    "flip": torch.flip,
+    "floor": torch.floor,
+    "index_select": torch.index_select,
+    "log2": torch.log2,
+    "log_softmax": torch.nn.functional.log_softmax,
+    "max": torch.max,
+    "mean": torch.mean,
+    "min": torch.min,
+    "mul_": operator.imul,
+    "narrow": torch.narrow,
+    "ne": torch.ne,
+    "nonzero": torch.nonzero,
+    "numel": torch.numel,
+    "pow": torch.pow,
+    "round": torch.round,
+    "rsqrt": torch.rsqrt,
+    "sigmoid": torch.sigmoid,
+    "softmax": torch.nn.functional.softmax,
+    "sort": torch.sort,
+    "split": torch.split,
+    "squeeze": torch.squeeze,
+    "std": torch.std,
+    "sum": torch.sum,
+    "topk": torch.topk,
+    "transpose": torch.transpose,
+    "tril": torch.tril,
+    "t": torch.t,
+    "unbind": torch.unbind,
+    "unsqueeze": torch.unsqueeze,
+}
+DONT_EXPAND_MODULES = {
+    # These have internal control flow
+    "ConvTranspose1d",
+    "ConvTranspose2d",
+    "Conv2d",
+    "ConvReLU2d",
+    "ConvBn2d",
+    "ConvBnReLU2d",
+    "EmbeddingBag",
+    "InstanceNorm2d",
+    "LSTM",
+}
+
+F = torch.nn.functional
+INPLACE_KEYWORD_OPS = {
+    F.mish,
+    F.silu,
+    F.hardsigmoid,
+    F.rrelu,
+    F.leaky_relu,
+    F.celu,
+    F.selu,
+    F.elu,
+    F.relu6,
+    F.hardswish,
+    F.hardtanh,
+    F.relu,
+    F.threshold,
+}
+IOPERATOR_REPLACEMENTS = {
+    "masked_fill_": "masked_fill",
+    "scatter_": "scatter",
+    "unsqueeze_": "unsqueeze",
+    torch.relu_: torch.relu,
+    torch.sigmoid_: torch.sigmoid,
+    operator.iadd: torch.add,
+    operator.iand: torch.bitwise_and,
+    operator.ifloordiv: functools.partial(torch.div, rounding_mode="floor"),
+    operator.itruediv: torch.div,
+    operator.imul: torch.mul,
+    operator.imatmul: torch.matmul,
+    operator.ior: torch.bitwise_or,
+    operator.ipow: torch.pow,
+    operator.isub: torch.sub,
+    operator.ixor: torch.bitwise_xor,
+}
+OPERATOR_REPLACEMENTS = {
+    operator.lt: torch.lt,
+    operator.le: torch.le,
+    operator.eq: torch.eq,
+    operator.ne: torch.ne,
+    operator.ge: torch.ge,
+    operator.gt: torch.gt,
+    operator.abs: torch.abs,
+    operator.add: torch.add,
+    operator.and_: torch.bitwise_and,
+    operator.floordiv: functools.partial(torch.div, rounding_mode="floor"),
+    # operator.truediv: torch.div,  # TODO(jansel): debug issue in vision_maskrcnn
+    operator.inv: torch.bitwise_not,
+    operator.invert: torch.bitwise_not,
+    operator.mod: torch.remainder,
+    operator.mul: torch.mul,
+    operator.matmul: torch.matmul,
+    operator.neg: torch.neg,
+    operator.or_: torch.bitwise_or,
+    operator.pos: torch.positive,
+    operator.pow: torch.pow,
+    operator.sub: torch.sub,
+    operator.xor: torch.bitwise_xor,
+    torch.nn.functional.sigmoid: torch.sigmoid,
+    torch.nn.functional.tanh: torch.tanh,
+    torch.nn.functional.relu: torch.relu,
+}
+
+SKIP_INPLACE = {
+    v
+    for v in itertools.chain(
+        math.__dict__.values(), builtins.__dict__.values(), operator.__dict__.values()
+    )
+    if callable(v)
+}
+
+
+def always_true(*args, **kwargs):
+    return True
+
+
+class InliningTracer(torch.fx.Tracer):
+    def is_leaf_module(self, m: torch.nn.Module, module_qualified_name: str) -> bool:
+        return False
+
+
+def expand_module_call(prefix, graph: torch.fx.Graph, module, args, kwargs):
+    # this patch is needed to make BatchNorm2D FX trace
+    module.__dict__["_check_input_dim"] = always_true
+    try:
+        assert not kwargs
+        arg_index = itertools.count()
+        vars = dict()
+        for node in InliningTracer().trace(module).nodes:
+            if node.op == "placeholder":
+                vars[node] = args[next(arg_index)]
+            elif node.op == "output":
+                assert len(node.args) == 1
+                return vars[node.args[0]]
+            elif node.op == "get_attr":
+                vars[node] = graph.get_attr(f"{prefix}{node.target}")
+            else:
+                vars[node] = graph.node_copy(node, vars.__getitem__)
+        raise AssertionError("unreachable")
+    except Exception:
+        print(f"Error while expanding {module.__class__.__name__}")
+        raise
+    finally:
+        del module.__dict__["_check_input_dim"]
+
+
+@dataclasses.dataclass
+class NodeCounts:
+    usages: int = 0
+
+
+def short_name(gm, node: torch.fx.Node):
+    if node.op == "call_function":
+        return node.target.__name__
+    elif node.op == "call_method":
+        return node.target
+    elif node.op == "call_module":
+        return gm.get_submodule(node.target).__class__.__name__
+    elif node.op == "get_attr":
+        return node.target
+    elif node.op == "output":
+        return "output"
+    raise AssertionError(node.op)
+
+
+def long_name(gm, node: torch.fx.Node):
+    name = short_name(gm, node)
+    target = node.target
+    if node.op == "call_function":
+        return torch_get_name(
+            node.target, f"{getattr(target, '__module__', '')}.{name}"
+        )
+    elif node.op == "call_method":
+        return name
+    elif node.op == "call_module":
+        target = gm.get_submodule(target).__class__
+        return f"{getattr(target, '__module__', '')}.{getattr(target, '__name__', '')}"
+    elif node.op == "get_attr":
+        return name
+    elif node.op == "output":
+        return "output"
+    raise AssertionError("unreachable")
+
+
+class Inplacifier:
+    def __init__(self, gm: torch.fx.GraphModule):
+        self.gm = gm
+
+    def can_be_view(self, node):
+        name = short_name(self.gm, node)
+        return name in VIEW_OPS or name in MAYBE_VIEW_OPS
+
+    def inplacify(self):
+        counts = dict()
+
+        def record_usage(node):
+            counts[node].usages += 1
+            return node
+
+        for node in self.gm.graph.nodes:
+            if node.op in ("call_function", "call_method", "call_module"):
+                if self.can_be_view(node):
+                    # Aliasing
+                    counts[node] = counts[node.args[0]]
+                elif "out" in node.kwargs:
+                    counts[node] = counts[node.kwargs["out"]]
+                else:
+                    counts[node] = NodeCounts(0)
+            else:
+                counts[node] = NodeCounts(float("inf"))
+
+        for node in reversed(list(self.gm.graph.nodes)):
+            kwargs = dict(node.kwargs)
+            if "inplace" in kwargs:
+                kwargs.pop("inplace")
+            if node.op == "call_function" and len(node.args) + len(kwargs) == 1:
+                arg = node.args[0] if node.args else next(kwargs.values())
+                if isinstance(arg, torch.fx.Node) and counts[arg].usages == 0:
+                    if node.target in SKIP_INPLACE:
+                        continue
+                    elif node.target in INPLACE_KEYWORD_OPS:
+                        kwargs["inplace"] = True
+                        counters["optimizations"]["inplace"] += 1
+                    elif " out: torch.Tensor" in repr(
+                        get_signature_for_torch_op(node.target)
+                    ):
+                        kwargs["out"] = arg
+                        counters["optimizations"]["out"] += 1
+                    else:
+                        continue
+                    with self.gm.graph.inserting_before(node):
+                        node.replace_all_uses_with(
+                            self.gm.graph.call_function(node.target, node.args, kwargs)
+                        )
+                    self.gm.graph.erase_node(node)
+
+            torch.fx.map_arg((node.args, node.kwargs), record_usage)
+
+
+class Functionalization(Transformer):
+    """
+    Remove most cases of mutation from a given fx Graph.
+    """
+
+    def __init__(self, *args, **kwargs):
+        super(Functionalization, self).__init__(*args, **kwargs)
+        self.tracer.tensor_attrs = dict()  # TODO(jansel): upstream this fix
+
+    def run_node(self, n: torch.fx.Node):
+
+        patches = []
+        target = n.target
+        args, kwargs = self.fetch_args_kwargs_from_env(n)
+        kwargs = dict(kwargs)
+
+        if (
+            not n.meta["is_input_mutation"]
+            and not n.meta["partial_mutation"]
+            and issubclass(n.meta["type"], torch.Tensor)
+        ):
+            if "inplace" in n.kwargs:
+                if kwargs["inplace"]:
+                    patches.append(n.args[0])
+                kwargs.pop("inplace")
+            elif "out" in n.kwargs:
+                kwargs.pop("out")
+                patches.append(n.kwargs["out"])
+            elif n.target in IOPERATOR_REPLACEMENTS:
+                target = IOPERATOR_REPLACEMENTS[n.target]
+                patches.append(n.args[0])
+            elif n.meta["is_mutation"]:
+                counters["mutation"][long_name(self.module, n)] += 1
+
+            if target in OPERATOR_REPLACEMENTS and not kwargs:
+                target = OPERATOR_REPLACEMENTS[target]
+
+        if target is builtins.getattr:
+            if args[1] == "dtype":
+                return n.args[0].meta["dtype"]
+            elif args[1] == "device":
+                return n.args[0].meta["device"]
+            else:
+                counters["getattr"][args[1]] += 1
+
+        if isinstance(target, functools.partial):
+            assert not target.args
+            kwargs.update(target.keywords)
+            target = target.func
+
+        if not issubclass(n.meta["type"], torch.Tensor):
+            counters["nontensor"][long_name(self.module, n)] += 1
+
+        with self._set_current_node(n):
+            result = getattr(self, n.op)(target, args, kwargs)
+
+            # For inplace operators, the output dtype should be equal to the
+            # dtype of tensor being inplace modified.
+            if n.target in IOPERATOR_REPLACEMENTS:
+                result = self.call_method("to", (result, n.args[0].meta["dtype"]), {})
+
+        for patch in patches:
+            assert isinstance(
+                patch, torch.fx.Node
+            ), f"{patch} {n.target} {n.args} {n.kwargs}"
+            if patch in self.env:
+                self.env[patch] = result
+
+        return result
+
+
+def swap_node(graph, old_node, new_node):
+    old_node.replace_all_uses_with(new_node)
+    graph.erase_node(old_node)
+    new_node.meta = old_node.meta
+
+
+def normalize(gm: torch.fx.GraphModule):
+    # gm.graph.print_tabular()
+    graph: torch.fx.Graph = gm.graph
+
+    for node in list(graph.nodes):
+        with graph.inserting_before(node):
+            if node.op == "call_method" and node.target in NORMALIZE_METHODS:
+                swap_node(
+                    graph,
+                    node,
+                    graph.call_function(
+                        NORMALIZE_METHODS[node.target], node.args, node.kwargs
+                    ),
+                )
+            elif node.op == "call_module":
+                submod = gm.get_submodule(node.target)
+                if submod.__class__.__name__ not in DONT_EXPAND_MODULES:
+                    swap_node(
+                        graph,
+                        node,
+                        expand_module_call(
+                            f"{node.target}.", graph, submod, node.args, node.kwargs
+                        ),
+                    )
+
+    # gm.graph.print_tabular()
+
+
+def normalize_ir(gm, example_inputs):
+    if config.normalize_ir:
+        example_inputs = clone_inputs(example_inputs)
+        normalize(gm)
+        try:
+            gm = NormalizeOperators(gm).transform()
+        except AttributeError:
+            # log.exception("NormalizeOperators() failed")
+            pass
+        ShapeAliasingAndMutationProp(gm).run(*example_inputs)
+        gm = Functionalization(gm).transform()
+    gm.recompile()
+    # record_graph_stats(gm)
+    return gm
diff --git a/torch/_dynamo/optimizations/subgraph.py b/torch/_dynamo/optimizations/subgraph.py
new file mode 100644
index 000000000000..55b773675566
--- /dev/null
+++ b/torch/_dynamo/optimizations/subgraph.py
@@ -0,0 +1,236 @@
+import functools
+import importlib
+import itertools
+import json
+import logging
+import math
+import operator
+import os
+
+import torch
+
+from .. import config
+from ..utils import check_is_cuda, checkpoint_params, is_jit_model, torchscript
+
+log = logging.getLogger(__name__)
+
+
+def cached(fn):
+    cached_name = f"_{fn.__name__}"
+
+    @functools.wraps(fn)
+    def inner(self):
+        if hasattr(self, cached_name):
+            return getattr(self, cached_name)
+        result = fn(self)
+        setattr(self, cached_name, result)
+        return result
+
+    return inner
+
+
+def load_module_fx(name):
+    pymod = importlib.import_module(f"subgraphs.{name}")
+    # TODO(jansel): upstream these fixes to to_folder()
+    pymod.module._operator_iadd = operator.iadd
+    pymod.module._operator_imul = operator.imul
+    pymod.module._operator_itruediv = operator.itruediv
+    pymod.module._operator_setitem = operator.setitem
+    pymod.module.math_sqrt = math.sqrt
+    pymod.module.device = torch.device
+    pymod.module.inf = float("inf")
+    return pymod.FxModule()
+
+
+def load_module_jit(name):
+    filename = os.path.join(config.base_dir, "subgraphs", name, "model.ts")
+    if not os.path.exists(filename):
+        return None
+    model = torch.jit.load(filename)
+    assert is_jit_model(model)
+    return model
+
+
+class SubGraph(object):
+    @classmethod
+    def load(cls, name):
+        model_dir = os.path.join(config.base_dir, "subgraphs", name)
+        example_inputs = torch.load(os.path.join(model_dir, "example_inputs.pt"))
+        example_outputs = torch.load(os.path.join(model_dir, "example_outputs.pt"))
+        metadata = json.loads(open(os.path.join(model_dir, "metadata.json")).read())
+        model_fx = load_module_fx(name)
+        model_jit = load_module_jit(name)
+        is_cuda = metadata["is_cuda"]
+
+        assert model_jit is not None
+
+        torch.set_rng_state(torch.load(os.path.join(model_dir, "rng_state.pt")))
+        if is_cuda:
+            model_jit = model_jit.cuda()
+        restore_jit = checkpoint_params(model_jit)
+        if model_fx is not None:
+            if is_cuda:
+                model_fx = model_fx.cuda()
+            restore_fx = checkpoint_params(model_fx)
+        else:
+            model_fx = model_jit
+            restore_fx = restore_jit
+
+        def restore():
+            restore_fx()
+            restore_jit()
+
+        subgraph = cls(model_fx, example_inputs, model_dir)
+        subgraph._scripted = model_jit
+        subgraph._example_outputs = example_outputs
+        subgraph._is_cuda = is_cuda
+        subgraph.restore = restore
+        return subgraph
+
+    def __init__(self, model, example_inputs, model_dir):
+        super(SubGraph, self).__init__()
+        self.model = model
+        self.example_inputs = example_inputs
+        self.model_dir = model_dir
+
+    def filename(self, name):
+        return os.path.join(self.model_dir, name)
+
+    @property
+    @cached
+    def scripted(self):
+        return torchscript(self.model, self.example_inputs)
+
+    @property
+    @cached
+    def example_outputs(self):
+        filename = self.filename("example_outputs.pt")
+        if os.path.exists(filename):
+            return torch.load(filename)
+        result = self.model(*self.example_inputs)
+        torch.save(result, filename)
+        return result
+
+    @property
+    def example_outputs_list(self):
+        if self.is_tensor_output:
+            return [self.example_outputs]
+        return self.example_outputs
+
+    @property
+    def input_names(self):
+        return [f"i{i}" for i in range(len(self.example_inputs))]
+
+    @property
+    def is_tensor_output(self):
+        return not isinstance(self.example_outputs, (list, tuple))
+
+    @property
+    def output_names(self):
+        return [f"o{x}" for x in range(len(self.example_outputs_list))]
+
+    @property
+    def device_index(self):
+        return 0
+
+    @property
+    @cached
+    def onnx_filename(self):
+        filename = self.filename("onnx")
+        if os.path.exists(filename):
+            return filename
+
+        try:
+            torch.onnx.export(
+                self.scripted,
+                self.example_inputs,
+                filename,
+                input_names=self.input_names,
+                output_names=self.output_names,
+                do_constant_folding=True,
+                opset_version=14,
+            )
+        except IndexError:
+            # work around bug in constant folding pass
+            torch.onnx.export(
+                self.scripted,
+                self.example_inputs,
+                filename,
+                input_names=self.input_names,
+                output_names=self.output_names,
+                do_constant_folding=False,
+                opset_version=14,
+            )
+        return filename
+
+    @property
+    def is_cpu(self):
+        return not self.is_cuda
+
+    @property
+    @cached
+    def is_cuda(self):
+        return check_is_cuda(self.model, self.example_inputs)
+
+    @property
+    def output_specs(self):
+        return [
+            (o.shape, o.dtype, o.layout, o.device, o.requires_grad)
+            for o in self.example_outputs_list
+        ]
+
+    def empty_outputs_factory(self):
+        specs = self.output_specs
+
+        def create():
+            return [
+                torch.empty(
+                    shape,
+                    dtype=dtype,
+                    layout=layout,
+                    device=device,
+                    requires_grad=requires_grad,
+                )
+                for shape, dtype, layout, device, requires_grad in specs
+            ]
+
+        return create
+
+    def wrap_returns(self, fn):
+        """Fix [Tensor()] vs Tensor() return type issues"""
+        expected = self.example_outputs
+        actual = fn(*self.example_inputs)
+        if isinstance(expected, (list, tuple)) and not isinstance(
+            actual, (list, tuple)
+        ):
+            assert len(expected) == 1
+            if isinstance(expected, tuple):
+                return lambda *args: (fn(*args),)
+            else:
+                return lambda *args: [fn(*args)]
+        elif not isinstance(expected, (list, tuple)) and isinstance(
+            actual, (list, tuple)
+        ):
+            assert len(actual) == 1
+            return lambda *args: fn(*args)[0]
+        elif isinstance(expected, (list, tuple)) and isinstance(actual, (list, tuple)):
+            assert len(actual) == len(expected)
+            return fn
+        else:
+            return fn
+
+    def has_dtype(self, dtype):
+        for x in itertools.chain(
+            self.example_inputs, self.scripted.parameters(), self.scripted.buffers()
+        ):
+            if x.dtype == dtype:
+                return True
+        return False
+
+    def will_tensorrt_barf(self):
+        return False
+        # code = torch.jit.freeze(self.scripted).code
+        # TODO(jansel): submit a bug report for this one, issue is in opacus_cifar10
+        # if "group_norm" in code or "einsum" in code:
+        #    return True
+        # return self.has_dtype(torch.int64)
diff --git a/torch/_dynamo/optimizations/torchxla_integration.py b/torch/_dynamo/optimizations/torchxla_integration.py
new file mode 100644
index 000000000000..f93e4d385ad8
--- /dev/null
+++ b/torch/_dynamo/optimizations/torchxla_integration.py
@@ -0,0 +1,189 @@
+import dataclasses
+
+import functools
+import itertools
+import os
+import time
+from typing import Any, Dict, List
+
+import torch
+
+debug = os.environ.get("debug_extract_compiled_graph") == "1"
+
+
+@dataclasses.dataclass
+class GraphInputMatcher:
+    """
+    The GraphInputMatcher class setup the graph inputs for future calls after lazy tracing.
+    Specifically, those graph inputs corresponding to method parameters should be replaced with the
+    arguments for the current call.
+
+    tensor_id_to_arg_idx maps the tensor id to the parameter index.
+    graph_input_tensor_ids, graph_input_xla_values list the tensor_id and ivalue for each of the
+    TS/XLA graph inputs.
+    """
+
+    tensor_id_to_arg_idx: Dict[int, int]
+    graph_input_tensor_ids: List[int]
+    # there are 2 categories of graph_input_tensors.
+    # Category 1: those whose id are not found in tensor_id_to_arg_idx. These are
+    # most likely const tensors and we can get its content from graph_input_tensors
+    # Category 2: those whose id are found in tensor_id_to_arg_idx. We should get
+    #  the tensor from method arguments
+    graph_input_xla_values: List[Any]
+
+    # get the real graph input tensors
+    def __call__(self, args):
+        real_input = []
+        for tensor_id, traced_xla_value in zip(
+            self.graph_input_tensor_ids, self.graph_input_xla_values
+        ):
+            arg_idx = self.tensor_id_to_arg_idx.get(tensor_id, None)
+            if arg_idx is None:
+                inp = traced_xla_value
+            else:
+                inp = args[arg_idx]
+            real_input.append(inp)
+        return real_input
+
+
+def get_fallback_ops():
+    fallback_ops = []
+    for opname in metrics.counter_names():
+        if "aten::" not in opname:
+            continue
+        val = int(metrics.counter_value(opname))
+        if val > 0:
+            fallback_ops.append(f"{opname}={val}")
+
+    return fallback_ops
+
+
+@functools.lru_cache(None)
+def import_torchxla():
+    """
+    CI will run test_circular_dependencies in test/test_testing.py
+    which tries to import all modules found.
+    Enclosing the imports in a function so CI that does not have torch_xla
+    installed will not break.
+    """
+    global torch_xla, xm, metrics
+    import torch_xla
+    import torch_xla.core.xla_model as xm
+    import torch_xla.debug.metrics as metrics
+
+
+def is_xla_tensor(tensor: torch.Tensor) -> bool:
+    return tensor.device.type == "xla"
+
+
+def extract_compiled_graph(xla_model: torch.fx.GraphModule, xla_args):
+    import_torchxla()
+
+    assert all(
+        map(
+            is_xla_tensor,
+            filter(
+                lambda x: isinstance(x, torch.Tensor),
+                itertools.chain(xla_model.parameters(), xla_args),
+            ),
+        )
+    ), "All tensors should be on xla"
+
+    # This call is critical to make sure xla_args' tensor id show up in graph_input_tensor_ids
+    xm.mark_step()
+    args_tensor_ids = [
+        torch_xla._XLAC._xla_get_tensor_id(xla_arg) for xla_arg in xla_args
+    ]
+
+    if debug:
+        print(f"args_tensor_ids {args_tensor_ids}")
+
+    tensor_id_to_arg_idx = {tensor_id: i for i, tensor_id in enumerate(args_tensor_ids)}
+    xla_out = xla_model(*xla_args)
+
+    fallback_ops = get_fallback_ops()
+    if len(fallback_ops) > 0:
+        raise RuntimeError(
+            f"Fail to extact the compiled graph because of fallback: {','.join(fallback_ops)}"
+        )
+
+    if not isinstance(xla_out, (tuple, list)):
+        xla_out = (xla_out,)
+
+    # If a arg is being in place updated by model, we need to include arg as part of the graph result.
+    xla_args_need_update_bool = torch_xla._XLAC._check_tensor_need_materialization(
+        xla_args
+    )
+    xla_args_need_update = []
+    arg_index_to_need_update_index = {}
+    for i, need_update in enumerate(xla_args_need_update_bool):
+        if need_update:
+            arg_index_to_need_update_index[i] = len(xla_args_need_update)
+            xla_args_need_update.append(xla_args[i])
+
+    args_and_out = tuple(xla_args_need_update) + tuple(xla_out)
+
+    if debug:
+        print(f"XLA IR Text: {torch_xla._XLAC._get_xla_tensors_text(args_and_out)}")
+        print(f"XLA IR HLO: {torch_xla._XLAC._get_xla_tensors_hlo(args_and_out)}")
+
+    # calculate graph hash
+    graph_hash = torch_xla._XLAC._get_graph_hash(args_and_out)
+    if debug:
+        print("graph_hash", graph_hash)
+
+    (
+        graph_input_tensor_ids,
+        graph_input_xla_values,
+    ) = torch_xla._XLAC._get_tensors_xla_device_data_node(args_and_out)
+    if debug:
+        print(f"graph_input_tensor_ids {graph_input_tensor_ids}")
+    assert len(graph_input_tensor_ids) == len(
+        graph_input_xla_values
+    ), f"{len(graph_input_tensor_ids)} v.s. {len(graph_input_xla_values)}"
+    graph_input_matcher = GraphInputMatcher(
+        tensor_id_to_arg_idx, graph_input_tensor_ids, graph_input_xla_values
+    )
+
+    # compiles+runs graph rooted at tensors in 'args_and_out'
+    torch_xla._XLAC._xla_sync_multi(args_and_out, [])
+    torch_xla._XLAC._clear_pending_irs(str(xm.xla_device()))
+
+    # input all cpu tensors
+    def optimized_mod(*args):
+        torch_xla._XLAC._xla_sync_multi(args, [])
+        enter_ts = time.time()
+        if len(args_and_out) == 0:
+            return ()
+
+        assert len(args) > 0  # can not handle no args case for now
+        graph_input = graph_input_matcher(args)
+        start_ts = time.time()
+        res = torch_xla._XLAC._run_cached_graph(graph_hash, graph_input)
+        if debug:
+            print(
+                f"torchxla reuse compiled graph run_cached_graph takes {time.time() - start_ts} seconds"
+            )
+
+        args_inplace_update_ts = time.time()
+        assert len(res) == len(args_and_out)
+        ncopy = 0
+
+        for arg_index, res_index in arg_index_to_need_update_index.items():
+            args[arg_index].copy_(res[res_index])
+
+        if debug:
+            print(
+                f"Copy {ncopy} args takes {time.time() - args_inplace_update_ts} seconds"
+            )
+
+        # First few elements might be xla_args that needs to be in place updated
+        result = res[len(xla_args_need_update) :]
+        if debug:
+            print(f"optimized_mod takes {time.time() - enter_ts} seconds overall")
+
+        xm.mark_step()
+        return result
+
+    return optimized_mod
diff --git a/torch/_dynamo/optimizations/training.py b/torch/_dynamo/optimizations/training.py
new file mode 100644
index 000000000000..06b66e282a9e
--- /dev/null
+++ b/torch/_dynamo/optimizations/training.py
@@ -0,0 +1,547 @@
+import functools
+import logging
+import operator
+from collections import defaultdict
+from functools import partial
+from importlib import import_module
+from typing import Set
+
+import torch
+from torch.fx import GraphModule
+from torch.fx.passes.backends.cudagraphs import partition_cudagraphs
+from torch.multiprocessing.reductions import StorageWeakRef
+from torch.nn import Module
+from torch.utils._pytree import tree_map
+
+from .. import config
+from ..utils import clone_inputs, count_calls, counters
+from .analysis import has_mutation
+from .backends import BACKENDS
+from .normalize import normalize_ir
+
+log = logging.getLogger(__name__)
+
+
+def is_aot_autograd_safe_to_run(gm, example_inputs):
+    """
+    There are some known issues with Aot Autograd. This is a workaround to catch
+    such cases, and fallback to eager. We should fix these quickly.
+
+    Issues
+    1) LSTM - https://github.com/pytorch/torchdynamo/issues/1147
+    2) LSTM - https://github.com/pytorch/functorch/issues/586
+    3) Input mutation - https://github.com/pytorch/torchdynamo/issues/1301
+    """
+
+    def raise_or_warn(reason):
+        msg = f"Unable to use Aot Autograd because of presence of {reason}"
+        if config.raise_on_unsafe_aot_autograd:
+            raise NotImplementedError(msg)
+        else:
+            log.warning(msg)
+        return False
+
+    import functorch.compile
+
+    # 1) LSTM module (tts_angular) - https://github.com/pytorch/functorch/issues/586
+    for submod in gm.modules():
+        if submod.__class__.__name__ == "LSTM":
+            return raise_or_warn("LSTM")
+
+    # 2) Mutation in the graph
+    mutated = False
+    try:
+        if not torch.is_inference_mode_enabled():
+            if functorch.compile.config.use_functionalize:
+                # There are two problematic classes we still exclude for now with
+                # functionalization:
+                #   - data mutation of inputs (fixed when we stop recording the
+                #   copy_ directly into the graph)
+                #   - metadata mutation of inputs (fixed if we do an extra partition
+                #   to avoid AotAutograd on the mutated inputs, or if we some how
+                #   get custom autograd function to reflect metadata changes to the
+                #   original tensor)
+                mutated = has_mutation(gm, example_inputs, inputs_only=True)
+            else:
+                mutated = has_mutation(gm, example_inputs)
+        else:
+            log.info(
+                "inference_mode enabled. TorchDynamo could not check for mutation."
+            )
+    except NotImplementedError as e:
+        if "SparseTensorImpl" not in str(e):
+            # TODO - TorchDynamo mutation analysis cannot handle sparse tensors.
+            # So, there is a chance that we could call Aot Autograd when it is
+            # unsafe.
+            # The exception is fairly guarded with string check, so any other
+            # mutation analysis bugs will raise exceptions and will be caught.
+            raise e
+        pass
+
+    # TODO: delete the logic for this later.
+    # Now that aot autograd supports aliasing and mutation, we don't need it.
+    # if mutated:
+    # return raise_or_warn("mutation")
+
+    return True
+
+
+class AotAutogradStrategy(object):
+    """Base class for backend strategies that use AOT Autograd"""
+
+    @classmethod
+    def compile_fn(cls, gm: torch.fx.GraphModule, example_inputs):
+        if count_calls(gm.graph) < 2:
+            return gm  # no point for tiny graphs
+        return cls(gm, example_inputs).verified_candidate()
+
+    def __init__(self, gm: torch.fx.GraphModule, example_inputs):
+        import functorch.compile
+
+        functorch.compile.config.use_functionalize = True
+        functorch.compile.config.use_fake_tensor = True
+
+        super(AotAutogradStrategy, self).__init__()
+        counters["aot_autograd"]["total"] += 1
+        self.use_fallback = False
+        self.original_example_inputs = example_inputs
+        self.gm = gm
+
+        if not functorch.compile.config.use_functionalize and config.normalize_ir:
+            try:
+                self.gm = normalize_ir(gm, self.example_inputs)
+            except Exception:
+                log.debug("TorchDynamo unable to remove mutation")
+                self.use_fallback = True
+                pass
+
+        if not is_aot_autograd_safe_to_run(gm, example_inputs):
+            self.use_fallback = True
+
+    @property
+    def example_inputs(self):
+        return clone_inputs(self.original_example_inputs)
+
+    def verified_candidate(self):
+        if self.use_fallback:
+            log.debug("Unable to use AOT Autograd because graph has mutation")
+            counters["aot_autograd"]["not_ok"] += 1
+            return self.gm
+        cg = self.candidate()
+        if cg is None:
+            counters["aot_autograd"]["not_ok"] += 1
+            raise RuntimeError("AOT Autograd failed to compile")
+        counters["aot_autograd"]["ok"] += 1
+        return cg
+
+    def candidate(self):
+        raise NotImplementedError()
+
+
+class AotNop(AotAutogradStrategy):
+    """Useful for debugging purpose"""
+
+    def candidate(self):
+        from functorch._src.compilers import debug_nop
+        from functorch.compile import nop
+
+        DEBUG = False
+        return BACKENDS["aot_autograd"](
+            self.gm, self.example_inputs, fw_compiler=debug_nop if DEBUG else nop
+        )
+
+
+aot_eager = AotNop.compile_fn
+
+
+class AotTorchscript(AotAutogradStrategy):
+    """
+    AOT Autograd with torchscript backend. Default partitioner.
+    """
+
+    def candidate(self):
+        from functorch.compile import ts_compile
+
+        return BACKENDS["aot_autograd"](
+            self.gm, self.example_inputs, fw_compiler=ts_compile
+        )
+
+
+aot_ts = AotTorchscript.compile_fn
+
+# Global counter to differentiate between different graphs.
+graph_idx = 0
+
+
+class AotPrint(AotNop):
+    """Saves all the gm models so that we can run them separately"""
+
+    def candidate(self):
+        global graph_idx
+        module_idx = "module_" + str(graph_idx)
+        self.gm.to_folder(module_idx, "Bar")
+        for idx, x in enumerate(self.example_inputs):
+            torch.save(x, module_idx + "_tensor" + str(idx) + ".pt")
+        graph_idx += 1
+        return super(AotPrint, self).candidate()
+
+
+aot_print = AotPrint.compile_fn
+
+
+def mem_efficient_fusion_kwargs(use_decomps):
+    from functorch.compile import (
+        default_decompositions,
+        min_cut_rematerialization_partition,
+        ts_compile,
+    )
+
+    kwargs = {
+        # these are taken from memory_efficient_fusion()
+        "fw_compiler": ts_compile,
+        "bw_compiler": ts_compile,
+        "partition_fn": min_cut_rematerialization_partition,
+    }
+
+    if use_decomps:
+        kwargs["decompositions"] = default_decompositions
+
+    return kwargs
+
+
+class AotMemEfficientFusion(AotAutogradStrategy):
+    """Use Min cut rematerilization and TorchScript+nvFuser with AOT Autograd"""
+
+    def candidate(self):
+        kwargs = mem_efficient_fusion_kwargs(use_decomps=True)
+        return BACKENDS["aot_autograd"](self.gm, self.example_inputs, **kwargs)
+
+
+class AotMemEfficientFusionNoDecomps(AotAutogradStrategy):
+    """Use Min cut rematerilization and TorchScript+nvFuser with AOT Autograd"""
+
+    def candidate(self):
+        kwargs = mem_efficient_fusion_kwargs(use_decomps=False)
+        return BACKENDS["aot_autograd"](self.gm, self.example_inputs, **kwargs)
+
+
+class AotInductorDebug(AotAutogradStrategy):
+    """
+    Uses TorchInductor Aot Autograd decopms and partitioner to isolate aot vs
+    inductor problems.
+    """
+
+    def candidate(self):
+        from functorch.compile import min_cut_rematerialization_partition, nop
+
+        decompositions = import_module(
+            f"{config.inductor_import}.compile_fx"
+        ).select_decomp_table()
+
+        kwargs = {
+            # these are taken from memory_efficient_fusion()
+            "fw_compiler": nop,
+            "bw_compiler": nop,
+            "decompositions": decompositions,
+            "partition_fn": functools.partial(
+                min_cut_rematerialization_partition, compiler="inductor"
+            ),
+        }
+        return BACKENDS["aot_autograd"](self.gm, self.example_inputs, **kwargs)
+
+
+aot_inductor_debug = AotInductorDebug.compile_fn
+
+
+class AOTMemEfficientFusionWithContext:
+    """Pass TorchScript+nvFuser context to TorchDynamo"""
+
+    def __init__(self, use_decomps=True):
+        self.backend_ctx_ctor = lambda: torch.jit.fuser("fuser2")
+        self.use_decomps = use_decomps
+
+    def __call__(self, gm: torch.fx.GraphModule, example_inputs):
+        if self.use_decomps:
+            return AotMemEfficientFusion.compile_fn(gm, example_inputs)
+        else:
+            return AotMemEfficientFusionNoDecomps.compile_fn(gm, example_inputs)
+
+
+aot_mem_efficient_fusion = AOTMemEfficientFusionWithContext(True)
+aot_mem_efficient_fusion_no_decomp = AOTMemEfficientFusionWithContext(False)
+
+
+def prims_executor(gm, inputs, *, executor):
+    from functorch.compile import make_boxed_func
+
+    # This function is called once per forward/backward pass of a graph in AOT
+    # Autograd. We use it to set up the nvFuser-specific FX graph and return
+    # execute function.
+    from torch._prims.context import TorchRefsNvfuserCapabilityMode
+    from torch._prims.executor import execute
+    from torch.fx.experimental.proxy_tensor import make_fx
+
+    # AOT Autograd might not use the partitioner, so we need to make sure that
+    # the graph is transformed to use nvFuser-compatible nodes.
+    if not getattr(gm, "_nvprim_transformed", False):
+        with TorchRefsNvfuserCapabilityMode():
+            gm = make_fx(gm)(*inputs)
+
+    # Then we return a callable that executes the "gm" graph
+    return make_boxed_func(partial(execute, gm, executor=executor))
+
+
+def nvprims_fw_bw_partition_fn(joint_module, joint_inputs, *, num_fwd_outputs):
+    # This function is called once per forward+backward pass of a graph in AOT
+    # Autograd. We use it to set up the nvFuser-specific FX graph that is later
+    # passed to the executor.
+    from functorch.compile import min_cut_rematerialization_partition
+
+    from torch._prims.context import TorchRefsNvfuserCapabilityMode
+    from torch.fx.experimental.proxy_tensor import make_fx
+
+    # AOT Autograd expects arguments of the traced function to be named exactly
+    # "primals, tangents"
+    def func(primals, tangents):
+        return joint_module(primals, tangents)
+
+    # First we trace the graph conditionally decomposing nodes
+    # that can be sent to the nvfuser executor
+    with TorchRefsNvfuserCapabilityMode():
+        prim_gm = make_fx(func)(*joint_inputs)
+
+    # all nvprims for now
+    recomputable_ops = {
+        getattr(torch.ops.nvprims, prim)
+        for prim in dir(torch.ops.nvprims)
+        if isinstance(getattr(torch.ops.nvprims, prim), torch._ops.OpOverloadPacket)
+        and getattr(torch.ops.nvprims, prim).is_recomputable
+    }
+
+    fw_gm, bw_gm = min_cut_rematerialization_partition(
+        prim_gm,
+        joint_inputs,
+        recomputable_ops=recomputable_ops,
+        num_fwd_outputs=num_fwd_outputs,
+    )
+    # AOT Autograd might not use the partitioner, so we need to make sure that
+    # the graph is marked as already transformed to use nvFuser-compatible nodes
+    fw_gm._nvprim_transformed = True
+    bw_gm._nvprim_transformed = True
+    return fw_gm, bw_gm
+
+
+def create_nvprims_backend(*, executor):
+    class NvPrims(AotAutogradStrategy):
+        def __init__(self, gm: torch.fx.GraphModule, example_inputs):
+            super(NvPrims, self).__init__(gm, example_inputs)
+            self.executor = executor
+
+        def candidate(self):
+            from torch._dynamo import disable
+
+            return BACKENDS["aot_autograd"](
+                self.gm,
+                self.example_inputs,
+                fw_compiler=partial(prims_executor, executor=self.executor),
+                bw_compiler=partial(prims_executor, executor=self.executor),
+                partition_fn=disable(nvprims_fw_bw_partition_fn),
+            )
+
+    return NvPrims
+
+
+aot_nvprims_nvfuser = create_nvprims_backend(executor="nvfuser").compile_fn
+aot_nvprims_aten = create_nvprims_backend(executor="aten").compile_fn
+
+
+def cloner(t):
+    if isinstance(t, torch.Tensor):
+        return t.clone()
+    else:
+        return t
+
+
+class CudaGraphModule(Module):
+    gm: GraphModule
+    mutated_inputs: Set[int]
+
+    def __init__(self, gm, mutated_inputs):
+        super().__init__()
+        self.gm = gm
+        self.mutated_inputs = mutated_inputs
+
+    warmed_up = False
+
+    # these are all None or all filled
+    graph = None
+    static_inputs = None
+    static_outputs = None
+
+    # NB: we override __call__ as we don't need any nn.Module machinery
+    # and to reduce overhead
+    def __call__(self, *args):
+        # TODO: once we've recorded here, we'd like to replace the __call__
+        # implementation with compiled bytecode that copies into static, replays
+        # the cuda graph, then copies out.  First condition is the hotpath,
+        # needs optimizing
+        if self.graph is not None:
+            assert len(args) == len(self.static_inputs)
+            for dst, src in zip(self.static_inputs, args):
+                dst.copy_(src)
+            self.graph.replay()
+            for i in self.mutated_inputs:
+                args[i].copy_(self.static_inputs[i])
+            return tree_map(cloner, self.static_outputs)
+
+        elif self.warmed_up:
+            # record
+            self.static_inputs = [x.clone() for x in args]
+            self.graph = torch.cuda.CUDAGraph()
+            with torch.cuda.graph(self.graph):
+                self.static_outputs = self.gm(*self.static_inputs)
+            # NB: recording doesn't actually run the operations, so
+            # now we immediately replay the graph to serve up the result
+            self.graph.replay()
+            for i in self.mutated_inputs:
+                args[i].copy_(self.static_inputs[i])
+            return tree_map(cloner, self.static_outputs)
+
+        else:
+            # warmup
+            stream = torch.cuda.Stream()
+            stream.wait_stream(torch.cuda.current_stream())
+            with torch.cuda.stream(stream):
+                r = self.gm(*args)
+            torch.cuda.current_stream().wait_stream(stream)
+            self.warmed_up = True
+            return r
+
+
+# Interpreter versions of these passes can be found at
+# https://gist.github.com/ezyang/df2d746cac3b2c7d55c181e37c57ef23
+
+
+def find_input_mutations(g):
+    def meta_fk(meta):
+        return meta["val"] if "val" in meta else meta["fake_result"]
+
+    inputs = defaultdict(set)
+    input_idx = 0
+    mutated_inputs = set()
+    for n in g.nodes:
+        if n.op == "placeholder":
+            inputs[StorageWeakRef(meta_fk(n.meta)._typed_storage())].add(input_idx)
+            input_idx += 1
+        elif n.op == "call_function":
+            if n.target is operator.getitem:
+                continue
+            schema = n.target._schema
+            for i, arg in enumerate(schema.arguments):
+                if i < len(n.args):
+                    argument = n.args[i]
+                else:
+                    if arg.name not in n.kwargs:
+                        continue
+                    argument = n.kwargs[arg.name]
+                mut_arg = False
+                if arg.alias_info:
+                    if arg.alias_info.is_write:
+                        mut_arg = True
+                if mut_arg:
+                    # TODO: not correct for args that contain tensors in a struct
+                    # like list
+                    mutated_inputs |= inputs[
+                        StorageWeakRef(meta_fk(argument.meta)._typed_storage())
+                    ]
+        # TODO: error on unrecognized nodes
+    return mutated_inputs
+
+
+# Mutates input graph
+def apply_cuda_graphs(gm):
+    for n in gm.graph.nodes:
+        if n.op == "call_module":
+            assert not n.kwargs
+            submod = gm.get_submodule(n.target)
+            gm.delete_submodule(n.target)
+            mutated_inputs = find_input_mutations(submod.graph)
+            gm.add_submodule(n.target, CudaGraphModule(submod, mutated_inputs))
+    # NB: we didn't actually change the graph, no need for recompile
+
+
+def cudagraphs(model, inputs):
+    model = partition_cudagraphs(model, inputs)
+    apply_cuda_graphs(model)
+    return model
+
+
+def raw_aot_autograd_cudagraphs(model, inputs):
+    kwargs = {
+        # these are taken from memory_efficient_fusion()
+        "fw_compiler": cudagraphs,
+        "bw_compiler": cudagraphs,
+    }
+
+    def _wrapped_bw_compiler(*args, **kwargs):
+        # stop TorchDynamo from trying to compile our generated backwards pass
+        return disable(disable(bw_compiler)(*args, **kwargs))  # type: ignore[operator]
+
+    bw_compiler = kwargs.get("bw_compiler") or kwargs["fw_compiler"]
+    kwargs["bw_compiler"] = _wrapped_bw_compiler
+
+    from functorch.compile import aot_module_simplified  # type: ignore[import]
+
+    from .. import disable
+
+    return aot_module_simplified(model, **kwargs)
+
+
+class AotAutogradCudaGraphs(AotAutogradStrategy):
+    def candidate(self):
+        return raw_aot_autograd_cudagraphs(self.gm, self.example_inputs)
+
+
+aot_cudagraphs = AotAutogradCudaGraphs.compile_fn
+
+
+def create_aot_backends():
+    """
+    Register aliases for the AOT backends
+    """
+    # aot_eager uses AOT Autograd backend with nop compiler. It is helpful in debugging.
+    BACKENDS["aot_eager"] = aot_eager
+
+    # aot_eager uses AOT Autograd backend with print compiler. It prints the
+    # graphs and also saves the graph modules that are sent to AOT Autograd.
+    # This is helpful for debugging.
+    BACKENDS["aot_print"] = aot_print
+
+    # aot_ts uses torchscript backend. We can use this with both nnc and nvfuser
+    # by using the relevant fuser with torch.jit.fuser(...)
+    BACKENDS["aot_ts"] = aot_ts
+
+    # "nvprims" is a subset of PrimTorch primitives that are guaranteed to be
+    # supported by nvFuser. This is the preferred backend for nvFuser+PrimTorch.
+    BACKENDS["nvprims_nvfuser"] = aot_nvprims_nvfuser
+    # This is useful for debugging. Can be removed later.
+    BACKENDS["nvprims_aten"] = aot_nvprims_aten
+
+    # aot_ts_nvfuser uses the memory efficient fusion algorithm from AOT Autograd.
+    # It uses min cut rematerialization algorithm, uses nvFuser as the
+    # compiler backend, and TorchScript as the frontend.
+    BACKENDS["aot_ts_nvfuser"] = aot_mem_efficient_fusion
+
+    # Similar to aot_ts_nvfuser, but disables the decompositions. Decompositions
+    # can cause accuracy deviations. This setting allows us to compare accuracy
+    # without worrying about the impact of decomposisitons. More details at
+    # https://github.com/pytorch/torchdynamo/issues/611
+    BACKENDS["aot_ts_nvfuser_nodecomps"] = aot_mem_efficient_fusion_no_decomp
+
+    # aot_cudagraphs only applies CUDA graphs to the graph.  It is also helpful
+    # for debugging and can serve as a perf baseline.
+    BACKENDS["aot_cudagraphs"] = aot_cudagraphs
+
+    # aot_inductor_debug just replaces the inductor compiler with nop to help
+    # isolate inductor vs aot_eager errors
+    BACKENDS["aot_inductor_debug"] = aot_inductor_debug
diff --git a/torch/_dynamo/output_graph.py b/torch/_dynamo/output_graph.py
new file mode 100644
index 000000000000..96c65f12fa91
--- /dev/null
+++ b/torch/_dynamo/output_graph.py
@@ -0,0 +1,629 @@
+import collections
+import copy
+import functools
+import itertools
+import logging
+import operator
+import re
+import traceback
+from dataclasses import dataclass
+from typing import Any, Callable, Dict, List, Optional, Union
+
+import torch.nn
+from torch import fx
+from torch.fx.experimental.symbolic_shapes import ShapeEnv
+
+from . import config, logging as torchdynamo_logging, variables
+from .bytecode_transformation import create_instruction, Instruction, unique_id
+from .codegen import PyCodegen
+from .exc import BackendCompilerFailed
+from .guards import GuardBuilder
+from .mutation_guard import is_dynamic_nn_module
+from .side_effects import SideEffects
+from .source import ConstantSource, LocalSource, Source
+from .utils import (
+    assert_no_fake_params_or_buffers,
+    checkpoint_params,
+    CleanupHook,
+    clone_inputs,
+    count_calls,
+    counters,
+    format_graph_tabular,
+    same,
+)
+from .variables.builder import VariableBuilder, wrap_fx_proxy
+from .variables.nn_module import NNModuleVariable
+from .variables.tensor import (
+    DynamicShapeVariable,
+    TensorVariable,
+    UnspecializedNumpyVariable,
+    UnspecializedPythonVariable,
+)
+
+log = logging.getLogger(__name__)
+
+
+@functools.lru_cache(None)
+def _step_logger():
+    return torchdynamo_logging.get_step_logger(log)
+
+
+@dataclass
+class GraphCompileReason:
+    """Stores why a given output graph was compiled; i.e. what caused the graph break."""
+
+    reason: str
+    user_stack: List[traceback.FrameSummary]
+
+
+def _get_gen_rand_values_fn(random_calls):
+    def _gen_rand_values():
+        return [fn(*args, **kwargs) for fn, args, kwargs in random_calls]
+
+    return _gen_rand_values
+
+
+class FakeRootModule(torch.nn.Module):
+    """Trick the constructor of fx.GraphModule"""
+
+    def __init__(self, nn_modules: dict):
+        super(FakeRootModule, self).__init__()
+        for k, v in nn_modules.items():
+            setattr(self, k, v)
+
+    def __repr__(self):
+        return "FakeRootModule(...)"
+
+
+def wrap_compiler_fn(compiler_fn):
+    """WrapperBackend if config.verify_correctness is True"""
+    if config.verify_correctness:
+        # wrap backend if verify_correctness is True
+        wrapper_backend_compiler_fn = WrapperBackend(compiler_fn)
+
+        wrapper_backend_compiler_fn._torchdynamo_orig_callable = compiler_fn
+        return wrapper_backend_compiler_fn
+
+    return compiler_fn
+
+
+class WrapperBackend:
+    def __init__(self, backend=None):
+        self.backend = backend
+
+    @property
+    def example_inputs(self):
+        return clone_inputs(self.original_example_inputs)
+
+    def __call__(self, gm: torch.fx.GraphModule, example_inputs):
+
+        self.restore = checkpoint_params(gm)
+        self.original_example_inputs = clone_inputs(example_inputs)
+        self.gm = gm
+        copy_gm = copy.deepcopy(self.gm)
+        self.candidate = self.backend(copy_gm, self.original_example_inputs)
+
+        if self.candidate is None or self.candidate is self.gm.forward:
+            return self.gm.forward
+
+        if not config.verify_correctness:
+            return self.candidate
+
+        # if verify_correctness=True
+        try:
+            correct = self.gm.forward(*self.example_inputs)
+            result = self.candidate(*self.example_inputs)
+
+            # TODO: replace `same` function with the one in testing
+            if same(correct, result):
+                return self.candidate
+
+            raise RuntimeError(f"incorrect results of backend {self}")
+            return self.gm.forward
+
+        except Exception:
+            log.exception("error in verify_correctness")
+            raise
+        finally:
+            self.restore()
+
+
+class OutputGraph(fx.Tracer):
+    """
+    Wrapper class to hold outputs of InstructionTranslator.  Mainly the
+    generated fx.Graph.
+    """
+
+    def __init__(
+        self,
+        f_globals: Dict[str, Any],
+        code_options: Dict[str, Any],
+        compiler_fn: Callable,
+        root_tx,
+    ):
+        super(OutputGraph, self).__init__()
+
+        # Mutable state checkpointed by copy_graphstate()
+        self.graph = torch.fx.Graph()
+        self.graphargs = []
+        self.guards = set()
+        self.nn_modules = dict()
+        self.side_effects = SideEffects()
+        self.code_options = dict(code_options)
+        self.output_instructions = []
+        # Node => computed real value (see utils.get_real_value)
+        self.real_value_cache = {}
+
+        # Not checkpointed
+        self.compiler_fn = compiler_fn
+        self.root_globals = f_globals
+        self.root_tx = root_tx
+        self.cleanups = []
+        self.should_exit = False
+        self.random_values_var = None
+        self.initial_random_state = ()
+        self.unspec_variable_map = {}
+        self.shape_env = ShapeEnv() if config.dynamic_shapes else None
+        self.tensor_id_to_sym_shape_ref = {}
+        self.intermediary_symbols = {}
+
+        # Enables creating unique node names by tracking
+        # all current placeholder node names
+        self.name_to_input = collections.OrderedDict()
+
+    @property
+    def output(self):
+        return self
+
+    @property
+    def fake_mode(self):
+        return self.root_tx.fake_mode
+
+    def copy_graphstate(self):
+        """Create a checkpoint of the current state by copying everything"""
+        graph_nodes = set(self.graph.nodes)
+        return (
+            graph_nodes,
+            list(self.graphargs),
+            set(self.guards),
+            dict(self.nn_modules),
+            self.side_effects.clone(),
+        )
+
+    def restore_graphstate(self, state):
+        """Restore a checkpoint created by self.copy_graphstate()"""
+        (
+            graph_nodes,
+            self.graphargs,
+            self.guards,
+            self.nn_modules,
+            self.side_effects,
+        ) = state
+        # FX deepcopy doesn't work for a partially created graph, so just remove new nodes
+        for node in reversed(list(self.graph.nodes)):
+            if node not in graph_nodes:
+                # Erasing node alone does not remove the meta information
+                # So, remove the help tensor explicitly
+                if "example_value" in node.meta:
+                    del node.meta["example_value"]
+                self.graph.erase_node(node)
+                self.real_value_cache.pop(node, None)
+                self.name_to_input.pop(node.name, None)
+
+    def count_calls(self):
+        return count_calls(self.graph)
+
+    def get_submodule(self, keys):
+        assert keys
+        obj = self.nn_modules
+        for k in keys.split("."):
+            if isinstance(obj, dict):
+                obj = obj[k]
+            else:
+                obj = getattr(obj, k)
+        return obj
+
+    def create_graph_input(self, name, type_expr=None):
+        # unique
+        if name in self.name_to_input:
+            for i in itertools.count():
+                if f"{name}_{i}" not in self.name_to_input:
+                    name = f"{name}_{i}"
+                    break
+
+        if self.name_to_input:
+            prev_name = next(reversed(self.name_to_input))
+            ctx = self.graph.inserting_after(self.name_to_input[prev_name])
+        else:
+            ctx = self.graph.inserting_before(None)
+        with ctx:
+            proxy = self.create_proxy("placeholder", name, (), {}, type_expr=type_expr)
+            self.name_to_input[name] = proxy.node
+            return proxy
+
+    def new_var(self, name="tmp"):
+        existing = set(self.code_options["co_varnames"])
+        for i in itertools.count():
+            var = f"___{name}_{i}"
+            if var not in existing:
+                self.code_options["co_varnames"] = self.code_options["co_varnames"] + (
+                    var,
+                )
+                return var
+
+    def update_co_names(self, name):
+        """Ensure self.code_options.co_names contains name"""
+        if name not in self.code_options["co_names"]:
+            self.code_options["co_names"] = tuple(self.code_options["co_names"]) + (
+                name,
+            )
+
+    def register_attr_or_module(
+        self, target: Union[torch.nn.Module, torch.Tensor, Any], *names, **options
+    ):
+        if is_dynamic_nn_module(target):
+            return variables.UnspecializedNNModuleVariable(target, **options)
+
+        options = dict(options)
+        options["guards"] = set(options.get("guards", []))
+        source: Source = options.get("source", None)
+        if isinstance(target, torch.Tensor):
+            if source:
+                options["guards"].add(source.make_guard(GuardBuilder.TENSOR_MATCH))
+
+            def wrap_name(module_key):
+                return wrap_fx_proxy(
+                    self,
+                    self.create_proxy("get_attr", module_key, tuple(), {}),
+                    example_value=target,
+                    **options,
+                )
+
+        elif isinstance(target, torch.nn.Module):
+            assert isinstance(target, torch.nn.Module)
+            options["guards"].add(source.make_guard(GuardBuilder.NN_MODULE))
+
+            def wrap_name(module_key):
+                return NNModuleVariable(type(target), module_key, **options)
+
+        elif isinstance(target, (torch.SymInt, torch.SymFloat)):
+            # HACKY CODE REGION BEGIN
+            # WE ARE PIGGYBACKING ON EXISTING INFRA TO REGISTER ATTRS
+            # This ultimately gets written to self.nn_modules, which is unfortunate
+            # Attrs that are tenors and symints and such need to be migrated to have their
+            # own storage
+            # alas, this is like this for now
+            self.intermediary_symbols.update({target.get_pyobj().expr: None})
+
+            def wrap_name(module_key):
+                return DynamicShapeVariable.create(
+                    self,
+                    self.create_proxy("get_attr", module_key, tuple(), {}),
+                    dyn_shape=target,
+                    **options,
+                )
+
+            # HACKY CODE REGION END
+        else:
+
+            def wrap_name(module_key):
+                self.output.update_co_names(module_key)
+                self.root_globals[module_key] = target
+                return VariableBuilder(self, ConstantSource(source_name=module_key))(
+                    target
+                )
+
+        for k, v in self.nn_modules.items():
+            if v is target:
+                # it already exists
+                return wrap_name(k)
+
+        # create a new unique name
+        name = "_".join(map(str, names))
+        # e.g. repalce abc.xyz[123].qkv with abc.xyz_123.qkv
+        name = re.sub(r"\[(\d+)\]", r"_\g<1>", name)
+        # e.g. replace abc.xyz_123.qkv with abc_xyz_123_qkv
+        name = re.sub(r"[^a-zA-Z0-9]", "_", name)
+
+        if not name or not name[0].isalpha():
+            name = "sub" + name
+        base = name
+        for i in itertools.count():
+            if name not in self.nn_modules:
+                self.nn_modules[name] = target
+                return wrap_name(name)
+            name = f"{base}_{i}"
+
+        raise AssertionError("unreachable")
+
+    def compile_subgraph(
+        self, tx, partial_convert=False, reason: Optional[GraphCompileReason] = None
+    ):
+        """
+        Generate a subgraph to continue execution on user code.
+        Automatically restore live variables.
+        """
+        from .eval_frame import disable
+
+        self.partial_convert = partial_convert
+        self.compile_subgraph_reason = reason
+
+        if not all(block.can_restore() for block in tx.block_stack):
+            unimplemented("compile_subgraph with block_depth != 0")
+
+        for block in reversed(tx.block_stack):
+            block.exit(tx)
+
+        tx.prune_dead_locals()
+        stack_values = list(tx.stack)
+        root = FakeRootModule(self.nn_modules)
+
+        # Add all the local vars to the "stack" so restore at the end
+        restore_vars = []
+        val_to_names = collections.OrderedDict()
+        if stack_values:
+            val_to_names[stack_values[-1]] = list()
+        for k, v in tx.symbolic_locals.items():
+            if isinstance(v.source, LocalSource) and v.source.name() == k:
+                continue  # no need to restore initial state
+            if v not in val_to_names:
+                val_to_names[v] = list()
+            val_to_names[v].append(k)
+        for v in val_to_names.keys():
+            restore_vars.extend(val_to_names[v])
+            stack_values.extend([v] * len(val_to_names[v]))
+
+        # to handle random calls
+        if len(tx.random_calls) > 0:
+            random_calls_instructions = []
+            self.random_values_var = self.new_var("random_values")
+            rand_fn_name = unique_id("__gen_rand_values")
+            rand_fn = disable(_get_gen_rand_values_fn(tx.random_calls))
+            self.install_global(rand_fn_name, rand_fn)
+            codegen = PyCodegen(tx, root)
+            random_calls_instructions.extend(
+                [
+                    codegen.create_load_global("random", add=True),
+                    codegen.create_load_attr("setstate"),
+                    codegen.create_load_const(tx.output.initial_random_state),
+                    create_instruction("CALL_FUNCTION", 1),
+                ]
+            )
+            random_calls_instructions.extend(codegen.load_function_name(rand_fn_name))
+            random_calls_instructions.extend(
+                [
+                    create_instruction("CALL_FUNCTION", 0),
+                    codegen.create_store(tx.output.random_values_var),
+                ]
+            )
+            self.add_output_instructions(random_calls_instructions)
+
+        if (
+            stack_values
+            and all(
+                not isinstance(
+                    v, (UnspecializedNumpyVariable, UnspecializedPythonVariable)
+                )
+                for v in stack_values
+            )
+            and all(isinstance(x, TensorVariable) for x in stack_values)
+            and len(set(stack_values)) == len(stack_values)
+            and self.side_effects.is_empty()
+        ):
+
+            # optimization to generate better code in a common case
+            self.add_output_instructions(
+                self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
+                + [create_instruction("UNPACK_SEQUENCE", len(stack_values))]
+            )
+        else:
+            graph_output_var = self.new_var("graph_out")
+            pass1 = PyCodegen(tx, root, graph_output_var)
+            self.side_effects.codegen_save_tempvars(pass1)
+            pass1.foreach(stack_values)
+            self.side_effects.codegen_update_mutated(pass1)
+
+            # one more time now that we have established tempvars
+            pass2 = PyCodegen(
+                tx,
+                root,
+                graph_output_var,
+                tempvars={val: None for val, count in pass1.uses.items() if count > 1},
+            )
+            self.side_effects.codegen_save_tempvars(pass2)
+            pass2.foreach(stack_values)
+            self.side_effects.codegen_update_mutated(pass2)
+
+            output = []
+            if count_calls(self.graph) != 0 or len(pass2.graph_outputs) != 0:
+                output.extend(
+                    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
+                )
+
+                if len(pass2.graph_outputs) != 0:
+                    output.append(pass2.create_store(graph_output_var))
+                else:
+                    output.append(create_instruction("POP_TOP"))
+            self.add_output_instructions(output + pass2.get_instructions())
+
+        # restore all the live local vars
+        self.add_output_instructions(
+            [PyCodegen(tx).create_store(var) for var in reversed(restore_vars)]
+        )
+
+    def compile_and_call_fx_graph(self, tx, rv, root):
+        """
+        Generate code from self.graph and return the Instruction()s to
+        call that generated code.
+        """
+        from .eval_frame import disable
+
+        assert isinstance(rv, list)
+        assert isinstance(root, FakeRootModule)
+        for output in rv:
+            self.guards.update(output.guards)
+
+        self.create_node(
+            "output", "output", (self.create_arg(tuple(x.as_proxy() for x in rv)),), {}
+        )
+        self.remove_unused_graphargs()
+        ncalls = count_calls(self.graph)
+        counters["stats"]["calls_captured"] += ncalls
+        counters["stats"]["fusions_possible"] += ncalls - 1
+
+        if config.dynamic_propagation:
+            # free a bit of memory
+            for node in self.graph.nodes:
+                if "example_value" in node.meta:
+                    del node.meta["example_value"]
+            self.real_value_cache.clear()
+
+        gm = fx.GraphModule(root, self.graph)
+        gm.recompile()
+        gm.compile_subgraph_reason = self.compile_subgraph_reason
+        name = unique_id("__compiled_fn")
+
+        assert_no_fake_params_or_buffers(gm)
+        compiled_fn = self.call_user_compiler(gm)
+        compiled_fn = disable(compiled_fn)
+
+        counters["stats"]["unique_graphs"] += 1
+        self.install_global(name, compiled_fn)
+
+        try:
+            # the call to tabulate can cause a lot of memory to be allocated
+            if config.log_level <= logging.INFO:
+                log.log(
+                    logging.CODE,
+                    f"TRACED GRAPH\n {name} {gm.forward.__code__.co_filename} {format_graph_tabular(gm.graph)}\n",
+                )
+        except ImportError:
+            log.warning(
+                "Unable to print graph: `format_graph_tabular` relies on the library `tabulate`, "
+                "which could not be found on this machine. Run `pip "
+                "install tabulate` to install the library."
+            )
+
+        cg = PyCodegen(tx)
+        cg.make_call_generated_code(name)
+        return cg.get_instructions()
+
+    def call_user_compiler(self, gm):
+        try:
+            name = (
+                self.compiler_fn.__name__
+                if hasattr(self.compiler_fn, "__name__")
+                else ""
+            )
+            _step_logger()(logging.INFO, f"calling compiler function {name}")
+            compiler_fn = self.compiler_fn
+            if config.verify_correctness:
+                compiler_fn = wrap_compiler_fn(compiler_fn)
+            compiled_fn = compiler_fn(gm, self.example_inputs())
+            _step_logger()(logging.INFO, f"done compiler function {name}")
+            assert callable(compiled_fn), "compiler_fn did not return callable"
+        except Exception as e:
+            compiled_fn = gm.forward
+            raise BackendCompilerFailed(self.compiler_fn, e) from e
+        return compiled_fn
+
+    def example_inputs(self):
+        result = []
+        for arg in self.graphargs:
+            result.extend(arg.get_examples())
+        return result
+
+    def remove_unused_graphargs(self):
+        for node in reversed(list(self.graph.nodes)):
+            if len(list(node.users)) == 0:
+                if node.op == "get_attr":
+                    self.graph.erase_node(node)
+                elif node.op == "call_function" and node.target is operator.getitem:
+                    self.graph.erase_node(node)
+
+        expanded_graphargs = []
+        for arg in self.graphargs:
+            expanded_graphargs.extend([arg] * len(arg))
+            arg.uses = 0
+
+        for node, arg in zip(self.graph.nodes, expanded_graphargs):
+            assert node.op == "placeholder"
+            arg.uses += len(node.users)
+
+        for node, arg in list(zip(self.graph.nodes, expanded_graphargs)):
+            if arg.uses == 0:
+                if "example_value" in node.meta:
+                    del node.meta["example_value"]
+                self.graph.erase_node(node)
+                self.real_value_cache.pop(node, None)
+                self.name_to_input.pop(node.name, None)
+
+        self.graphargs = [arg for arg in self.graphargs if arg.uses > 0]
+
+    def add_output_instructions(self, prefix: List[Instruction]):
+        """
+        We call this on the creation of a new compiled subgraph that is inserted
+        before user code.
+        """
+        self.output_instructions.extend(prefix)
+        self.should_exit = True
+
+    def install_global(self, name, value):
+        self.cleanups.append(CleanupHook.create(self.root_globals, name, value))
+
+    def cleanup(self):
+        # There is a reference cycle between tracer and OutputGraph, causing
+        # some of the tensor objects to be held alive for longer than necessary.
+
+        # Clear cache for conversion of real -> fake tensors
+        self.root_tx.fake_mode.fake_tensor_converter = None
+        self.root_tx = None
+
+        # Note: generated fx graph will hold a reference to the nn_module,
+        # So depending on the backend they may not be released
+        self.nn_modules = None
+
+        # Cleanup graphargs
+        for graph_arg in self.graphargs:
+            graph_arg.erase()
+
+        for node in self.graph.nodes:
+            if "example_value" in node.meta:
+                del node.meta["example_value"]
+        self.real_value_cache.clear()
+        self.name_to_input.clear()
+
+    def create_proxy(
+        self,
+        kind,
+        target,
+        args,
+        kwargs,
+        name=None,
+        type_expr=None,
+        proxy_factory_fn=None,
+        current_tx=None,
+    ):
+        rv = super().create_proxy(
+            kind, target, args, kwargs, name, type_expr, proxy_factory_fn
+        )
+
+        # append stack trace to fx node
+        tx = current_tx if current_tx else self.root_tx
+
+        nn_module_stack = tx.nn_module_stack
+        if nn_module_stack:
+            rv.node.meta["nn_module_stack"] = nn_module_stack.copy()
+
+        frame_summaries: List[traceback.FrameSummary] = []
+        while tx:
+            frame_summaries.append(tx.frame_summary())
+            tx = getattr(tx, "parent", None)
+
+        msgs = traceback.StackSummary.from_list(frame_summaries).format()
+
+        # Carry module_stack along with node.stack_trace for reusing stacktrace propagation infra
+        nn_module_stack_str = f"Module stack: {nn_module_stack}\n"
+        rv.node.stack_trace = nn_module_stack_str + " | ".join(msgs)
+
+        return rv
diff --git a/torch/_dynamo/profiler.py b/torch/_dynamo/profiler.py
new file mode 100644
index 000000000000..b5a667070a8c
--- /dev/null
+++ b/torch/_dynamo/profiler.py
@@ -0,0 +1,177 @@
+import dataclasses
+import os
+from typing import Any, List
+
+import torch
+
+from . import config
+from .utils import print_once
+
+
+@dataclasses.dataclass
+class ProfileMetrics:
+    microseconds: float = 0.0
+    operators: int = 0
+    fusions: int = 0
+    graphs: int = 0
+
+    def __iadd__(self, other: "ProfileMetrics"):
+        self.microseconds += other.microseconds
+        self.operators += other.operators
+        self.fusions += other.fusions
+        return self
+
+    def __add__(self, other: "ProfileMetrics"):
+        assert isinstance(other, ProfileMetrics)
+        return ProfileMetrics(
+            self.microseconds + other.microseconds,
+            self.operators + other.operators,
+            self.fusions + other.fusions,
+        )
+
+    def __truediv__(self, other):
+        if isinstance(other, int):
+            other = ProfileMetrics(other, other, other)
+        return ProfileMetrics(
+            self.microseconds / max(1, other.microseconds),
+            self.operators / max(1, other.operators),
+            self.fusions / max(1, other.fusions),
+        )
+
+    def __str__(self):
+        return f"{self.operators:4.0%} ops {self.microseconds:4.0%} time"
+
+    def tocsv(self):
+        return [self.operators, self.microseconds]
+
+
+class ProfileResult:
+    def __init__(self, captured, total, unique_graphs):
+        self.captured: ProfileMetrics = captured or ProfileMetrics()
+        self.total: ProfileMetrics = total or ProfileMetrics()
+        self.unique_graphs: int = unique_graphs
+
+    def __iadd__(self, other: ProfileMetrics):
+        self.captured += other.captured
+        self.total += other.total
+        self.unique_graphs += other.unique_graphs
+        return self
+
+    def percent(self):
+        return self.captured / self.total
+
+    def __str__(self):
+        return (
+            f"{self.unique_graphs:2} graphs {self.captured.graphs:2} graph calls "
+            f"{self.captured.operators:4}/{self.total.operators:4} = "
+            + str(self.percent())
+        )
+
+    def tocsv(self):
+        return [
+            self.unique_graphs,
+            self.captured.graphs,
+            self.captured.operators,
+            self.total.operators,
+        ] + self.percent().tocsv()
+
+
+def should_print_missing():
+    return os.environ.get("TORCHDYNAMO_PRINT_MISSING") == "1"
+
+
+def print_missing(stack):
+    if any("/torch/autograd/profiler.py" in x for x in stack):
+        return
+    stack = [
+        x for x in stack if ("<built-in" not in x and "site-packages/torch/" not in x)
+    ]
+    print_once("MISSING", " >> ".join(stack[-3:]))
+
+
+class Profiler:
+    unique_graphs = 0
+
+    def __init__(self):
+        self.prof = torch.profiler.profile(
+            activities=[torch.profiler.ProfilerActivity.CPU],
+            with_stack=should_print_missing(),
+        )
+
+    def results(self):
+        captured_regions = 0
+        captured_ops = 0
+        captured_microseconds = 0
+        total_ops = 0
+        total_microseconds = 0
+
+        last_op_end_time = -1
+        captured_region_end_time = -1
+        events = list(sorted(self.prof.events(), key=lambda x: x.time_range.start))
+        for e in events:
+            if e.name == "TORCHDYNAMO":
+                captured_region_end_time = e.time_range.end
+                captured_regions += 1
+                # ignore `handle = torch.zeros(1)` in record_function.__init__()
+                total_ops -= 1
+            elif e.time_range.start >= last_op_end_time:
+                last_op_end_time = e.time_range.end
+                if e.time_range.end <= captured_region_end_time:
+                    captured_ops += 1
+                    captured_microseconds += e.time_range.elapsed_us()
+                elif should_print_missing():
+                    print_missing(e.stack)
+                total_ops += 1
+                total_microseconds += e.time_range.elapsed_us()
+            else:
+                pass  # ops recursively called from other ops (ignored)
+
+        unique_graphs = Profiler.unique_graphs
+        Profiler.unique_graphs = 0
+
+        return ProfileResult(
+            captured=ProfileMetrics(
+                microseconds=captured_microseconds,
+                operators=captured_ops,
+                fusions=captured_ops - captured_regions,
+                graphs=captured_regions,
+            ),
+            total=ProfileMetrics(
+                microseconds=total_microseconds,
+                operators=total_ops,
+                fusions=total_ops - 1,
+            ),
+            unique_graphs=unique_graphs,
+        )
+
+
+def shapes_of(it):
+    if it:
+        return [tuple(getattr(x, "shape", [])) for x in it]
+
+
+def fx_insert_profiling(gm: torch.fx.GraphModule, example_inputs: List[Any]):
+    input_shapes = shapes_of(example_inputs)
+    output_shapes = None
+
+    def debug_print(extra):
+        gm.graph.print_tabular()
+        return f"shape mismatch in={input_shapes} out={output_shapes} got={extra}"
+
+    def _wrapped(*args):
+        nonlocal output_shapes
+        with torch.profiler.record_function("TORCHDYNAMO"):
+            assert (
+                shapes_of(args) == input_shapes or config.dynamic_shapes
+            ), debug_print(shapes_of(args))
+            result = gm.forward(*args)
+            if output_shapes is None:
+                output_shapes = shapes_of(result)
+            else:
+                assert (
+                    shapes_of(result) == output_shapes or config.dynamic_shapes
+                ), debug_print(shapes_of(result))
+            return result
+
+    Profiler.unique_graphs += 1
+    return _wrapped
diff --git a/torch/_dynamo/replay_record.py b/torch/_dynamo/replay_record.py
new file mode 100644
index 000000000000..f09d9bf9c878
--- /dev/null
+++ b/torch/_dynamo/replay_record.py
@@ -0,0 +1,118 @@
+import dataclasses
+from dataclasses import field
+from types import CodeType, ModuleType
+from typing import Any, Dict
+
+try:
+    import dill
+except ImportError:
+    dill = None
+
+
+@dataclasses.dataclass
+class ModuleRecord:
+    module: ModuleType
+    accessed_attrs: Dict[str, Any] = field(default_factory=dict)
+
+
+@dataclasses.dataclass
+class DummyModule:
+    name: str
+
+
+@dataclasses.dataclass
+class ExecutionRecord:
+    code: CodeType
+    globals: Dict[str, Any] = field(default_factory=dict)
+    locals: Dict[str, Any] = field(default_factory=dict)
+    builtins: Dict[str, Any] = field(default_factory=dict)
+    code_options: Dict[str, Any] = field(default_factory=dict)
+
+    def dump(self, f):
+        assert dill is not None, "replay_record requires `pip install dill`"
+        dill.dump(self, f)
+
+    @classmethod
+    def load(cls, f):
+        assert dill is not None, "replay_record requires `pip install dill`"
+        return dill.load(f)
+
+
+@dataclasses.dataclass
+class ExecutionRecorder:
+    MOD_EXCLUDES = ["torch"]
+    LOCAL_MOD_PREFIX = "___local_mod_"
+
+    code: CodeType
+    globals: Dict[str, Any] = field(default_factory=dict)
+    locals: Dict[str, Any] = field(default_factory=dict)
+    builtins: Dict[str, Any] = field(default_factory=dict)
+    code_options: Dict[str, Any] = field(default_factory=dict)
+    name_to_modrec: Dict[str, Any] = field(default_factory=dict)
+
+    def add_local_var(self, name, var):
+        if isinstance(var, ModuleType):
+            if self._is_excl(var):
+                return
+            self.locals[name] = self._add_mod(var)
+        else:
+            self.locals[name] = var
+
+    def add_global_var(self, name, var):
+        if isinstance(var, ModuleType):
+            if self._is_excl(var):
+                return
+            self.globals[name] = self._add_mod(var)
+        else:
+            self.globals[name] = var
+
+    def add_local_mod(self, name, mod):
+        assert isinstance(mod, ModuleType)
+        if self._is_excl(mod):
+            return
+
+        self.add_global_var(name, mod)
+
+    def record_module_access(self, mod, name, val):
+        if self._is_excl(mod):
+            return
+        if isinstance(val, ModuleType):
+            self.name_to_modrec[mod.__name__].accessed_attrs[name] = self._add_mod(val)
+            return
+
+        self.name_to_modrec[mod.__name__].accessed_attrs[name] = val
+
+    def get_record(self):
+        return ExecutionRecord(
+            self.code,
+            ExecutionRecorder._resolve_modules(self.globals),
+            ExecutionRecorder._resolve_modules(self.locals),
+            self.builtins.copy(),
+            self.code_options.copy(),
+        )
+
+    def _add_mod(self, mod):
+        if mod.__name__ not in self.name_to_modrec:
+            self.name_to_modrec[mod.__name__] = ModuleRecord(mod)
+
+        return self.name_to_modrec[mod.__name__]
+
+    @classmethod
+    def _is_excl(cls, mod):
+        return any([mod.__name__ == excl for excl in cls.MOD_EXCLUDES])
+
+    # Convert ModuleRecords -> DummyModule tree
+    @classmethod
+    def _resolve_modules(cls, vars):
+        def resolve_module(var):
+            if not isinstance(var, ModuleRecord):
+                return var
+
+            dummy_mod = DummyModule(var.module.__name__)
+            for attr_name, attr_value in var.accessed_attrs.items():
+                attr_value = resolve_module(attr_value)
+                dummy_mod.__setattr__(attr_name, attr_value)
+
+            return dummy_mod
+
+        return {k: resolve_module(v) for k, v in vars.items()}
diff --git a/torch/_dynamo/resume_execution.py b/torch/_dynamo/resume_execution.py
new file mode 100644
index 000000000000..c05f610d6712
--- /dev/null
+++ b/torch/_dynamo/resume_execution.py
@@ -0,0 +1,304 @@
+import copy
+import dataclasses
+import sys
+import types
+from typing import Any, Dict, List
+
+from .bytecode_transformation import (
+    create_instruction,
+    Instruction,
+    transform_code_object,
+)
+from .codegen import PyCodegen
+from .utils import ExactWeakKeyDictionary
+
+# taken from code.h in cpython
+CO_OPTIMIZED = 0x0001
+CO_NEWLOCALS = 0x0002
+CO_VARARGS = 0x0004
+CO_VARKEYWORDS = 0x0008
+CO_NESTED = 0x0010
+CO_GENERATOR = 0x0020
+CO_NOFREE = 0x0040
+CO_COROUTINE = 0x0080
+CO_ITERABLE_COROUTINE = 0x0100
+CO_ASYNC_GENERATOR = 0x0200
+
+
+@dataclasses.dataclass(frozen=True)
+class ReenterWith:
+    stack_index: int = None
+
+    def __call__(self, code_options, cleanup):
+        if sys.version_info < (3, 9):
+            with_cleanup_start = create_instruction("WITH_CLEANUP_START")
+            if sys.version_info < (3, 8):
+                begin_finally = create_instruction(
+                    "LOAD_CONST", PyCodegen.get_const_index(code_options, None), None
+                )
+            else:
+                begin_finally = create_instruction("BEGIN_FINALLY")
+            cleanup[:] = [
+                create_instruction("POP_BLOCK"),
+                begin_finally,
+                with_cleanup_start,
+                create_instruction("WITH_CLEANUP_FINISH"),
+                create_instruction("END_FINALLY"),
+            ] + cleanup
+
+            return [
+                create_instruction("CALL_FUNCTION", 0),
+                create_instruction("SETUP_WITH", target=with_cleanup_start),
+                create_instruction("POP_TOP"),
+            ]
+        else:
+
+            with_except_start = create_instruction("WITH_EXCEPT_START")
+            pop_top_after_with_except_start = create_instruction("POP_TOP")
+
+            cleanup_complete_jump_target = create_instruction("NOP")
+
+            cleanup[:] = [
+                create_instruction("POP_BLOCK"),
+                create_instruction(
+                    "LOAD_CONST", PyCodegen.get_const_index(code_options, None), None
+                ),
+                create_instruction("DUP_TOP"),
+                create_instruction("DUP_TOP"),
+                create_instruction("CALL_FUNCTION", 3),
+                create_instruction("POP_TOP"),
+                create_instruction("JUMP_FORWARD", target=cleanup_complete_jump_target),
+                with_except_start,
+                create_instruction(
+                    "POP_JUMP_IF_TRUE", target=pop_top_after_with_except_start
+                ),
+                create_instruction("RERAISE"),
+                pop_top_after_with_except_start,
+                create_instruction("POP_TOP"),
+                create_instruction("POP_TOP"),
+                create_instruction("POP_EXCEPT"),
+                create_instruction("POP_TOP"),
+                cleanup_complete_jump_target,
+            ] + cleanup
+
+            return [
+                create_instruction("CALL_FUNCTION", 0),
+                create_instruction("SETUP_WITH", target=with_except_start),
+                create_instruction("POP_TOP"),
+            ]
+
+
+@dataclasses.dataclass
+class ResumeFunctionMetadata:
+    code: types.CodeType
+    instructions: List[Instruction] = None
+
+
+class ContinueExecutionCache:
+    cache = ExactWeakKeyDictionary()
+    generated_code_metadata = ExactWeakKeyDictionary()
+
+    @classmethod
+    def lookup(cls, code, lineno, *key):
+        if code not in cls.cache:
+            cls.cache[code] = dict()
+        key = tuple(key)
+        if key not in cls.cache[code]:
+            cls.cache[code][key] = cls.generate(code, lineno, *key)
+        return cls.cache[code][key]
+
+    @classmethod
+    def generate(
+        cls,
+        code,
+        lineno,
+        offset: int,
+        nstack: int,
+        argnames: List[str],
+        setup_fns: List[ReenterWith],
+    ):
+        assert offset is not None
+        assert not (
+            code.co_flags
+            & (CO_GENERATOR | CO_COROUTINE | CO_ITERABLE_COROUTINE | CO_ASYNC_GENERATOR)
+        )
+        assert code.co_flags & CO_OPTIMIZED
+        if code in ContinueExecutionCache.generated_code_metadata:
+            return cls.generate_based_on_original_code_object(
+                code, lineno, offset, nstack, argnames, setup_fns
+            )
+
+        meta = ResumeFunctionMetadata(code)
+
+        def update(instructions: List[Instruction], code_options: Dict[str, Any]):
+            meta.instructions = copy.deepcopy(instructions)
+
+            args = [f"___stack{i}" for i in range(nstack)]
+            args.extend(v for v in argnames if v not in args)
+            freevars = tuple(code_options["co_cellvars"] or []) + tuple(
+                code_options["co_freevars"] or []
+            )
+            code_options["co_name"] = f"<graph break in {code_options['co_name']}>"
+            code_options["co_firstlineno"] = lineno
+            code_options["co_cellvars"] = tuple()
+            code_options["co_freevars"] = freevars
+            code_options["co_argcount"] = len(args)
+            code_options["co_posonlyargcount"] = 0
+            code_options["co_kwonlyargcount"] = 0
+            code_options["co_varnames"] = tuple(
+                args + [v for v in code_options["co_varnames"] if v not in args]
+            )
+            code_options["co_flags"] = code_options["co_flags"] & ~(
+                CO_VARARGS | CO_VARKEYWORDS
+            )
+            (target,) = [i for i in instructions if i.offset == offset]
+
+            prefix = []
+            cleanup = []
+            hooks = {fn.stack_index: fn for fn in setup_fns}
+            for i in range(nstack):
+                prefix.append(create_instruction("LOAD_FAST", f"___stack{i}"))
+                if i in hooks:
+                    prefix.extend(hooks.pop(i)(code_options, cleanup))
+            assert not hooks
+
+            prefix.append(create_instruction("JUMP_ABSOLUTE", target=target))
+
+            # because the line number table monotonically increases from co_firstlineno
+            # remove starts_line for any instructions before the graph break instruction
+            # this will ensure the instructions after the break have the correct line numbers
+            target_ind = int(target.offset / 2)
+            for inst in instructions[0:target_ind]:
+                inst.starts_line = None
+
+            if cleanup:
+                prefix.extend(cleanup)
+                prefix.extend(cls.unreachable_codes(code_options))
+
+            # TODO(jansel): add dead code elimination here
+            instructions[:] = prefix + instructions
+
+        new_code = transform_code_object(code, update)
+        ContinueExecutionCache.generated_code_metadata[new_code] = meta
+        return new_code
+
+    @staticmethod
+    def unreachable_codes(code_options):
+        """Codegen a `raise None` to make analysis work for unreachable code"""
+        if None not in code_options["co_consts"]:
+            code_options["co_consts"] = tuple(code_options["co_consts"]) + (None,)
+        return [
+            create_instruction(
+                "LOAD_CONST",
+                argval=None,
+                arg=code_options["co_consts"].index(None),
+            ),
+            create_instruction("RAISE_VARARGS", 1),
+        ]
+
+    @classmethod
+    def generate_based_on_original_code_object(cls, code, lineno, offset: int, *args):
+        """
+        This handles the case of generating a resume into code generated
+        to resume something else.  We want to always generate starting
+        from the original code object so that if control flow paths
+        converge we only generated 1 resume function (rather than 2^n
+        resume functions).
+        """
+
+        meta: ResumeFunctionMetadata = ContinueExecutionCache.generated_code_metadata[
+            code
+        ]
+        new_offset = None
+
+        def find_new_offset(
+            instructions: List[Instruction], code_options: Dict[str, Any]
+        ):
+            nonlocal new_offset
+            (target,) = [i for i in instructions if i.offset == offset]
+            # match the functions starting at the last instruction as we have added a prefix
+            (new_target,) = [
+                i2
+                for i1, i2 in zip(reversed(instructions), reversed(meta.instructions))
+                if i1 is target
+            ]
+            assert target.opcode == new_target.opcode
+            new_offset = new_target.offset
+
+        transform_code_object(code, find_new_offset)
+        return ContinueExecutionCache.lookup(meta.code, lineno, new_offset, *args)
+
+
+"""
+# partially finished support for with statements
+
+def convert_locals_to_cells(
+        instructions: List[Instruction],
+        code_options: Dict[str, Any]):
+
+    code_options["co_cellvars"] = tuple(
+        var
+        for var in code_options["co_varnames"]
+        if var not in code_options["co_freevars"]
+        and not var.startswith("___stack")
+    )
+    cell_and_free = code_options["co_cellvars"] + code_options["co_freevars"]
+    for inst in instructions:
+        if str(inst.argval).startswith("___stack"):
+            continue
+        elif inst.opname == "LOAD_FAST":
+            inst.opname = "LOAD_DEREF"
+        elif inst.opname == "STORE_FAST":
+            inst.opname = "STORE_DEREF"
+        elif inst.opname == "DELETE_FAST":
+            inst.opname = "DELETE_DEREF"
+        else:
+            continue
+        inst.opcode = dis.opmap[inst.opname]
+        assert inst.argval in cell_and_free, inst.argval
+        inst.arg = cell_and_free.index(inst.argval)
+
+def patch_setup_with(
+    instructions: List[Instruction],
+    code_options: Dict[str, Any]
+):
+    nonlocal need_skip
+    need_skip = True
+    target_index = [
+        idx for idx, i in enumerate(instructions) if i.offset == offset
+    ][0]
+    assert instructions[target_index].opname == "SETUP_WITH"
+    convert_locals_to_cells(instructions, code_options)
+
+    stack_depth_before = nstack + stack_effect(instructions[target_index].opcode,
+                                               instructions[target_index].arg)
+
+    inside_with = []
+    inside_with_resume_at = None
+    stack_depth = stack_depth_before
+    idx = target_index + 1
+    for idx in range(idx, len(instructions)):
+        inst = instructions[idx]
+        if inst.opname == "BEGIN_FINALLY":
+            inside_with_resume_at = inst
+            break
+        elif inst.target is not None:
+            unimplemented("jump from with not supported")
+        elif inst.opname in ("BEGIN_FINALLY", "WITH_CLEANUP_START", "WITH_CLEANUP_FINISH", "END_FINALLY",
+                             "POP_FINALLY", "POP_EXCEPT",
+                             "POP_BLOCK", "END_ASYNC_FOR"):
+            unimplemented("block ops not supported")
+        inside_with.append(inst)
+        stack_depth += stack_effect(inst.opcode, inst.arg)
+    assert inside_with_resume_at
+
+    instructions = [
+        create_instruction("LOAD_FAST", f"___stack{i}") for i in range(nstack)
+    ] + [
+        create_instruction("SETUP_WITH", target=instructions[target_index].target)
+        ... call the function ...
+        unpack_tuple
+    ] + [
+        create_instruction("JUMP_ABSOLUTE", target=inside_with_resume_at)
+    ]
+"""
diff --git a/torch/_dynamo/side_effects.py b/torch/_dynamo/side_effects.py
new file mode 100644
index 000000000000..55e6e9f927e8
--- /dev/null
+++ b/torch/_dynamo/side_effects.py
@@ -0,0 +1,336 @@
+import collections
+import dataclasses
+import inspect
+from typing import Any
+
+import torch.nn
+
+from . import utils, variables
+from .bytecode_transformation import create_instruction
+from .codegen import PyCodegen
+from .source import LocalSource, Source
+from .utils import object_new
+from .variables.base import VariableTracker
+
+
+@dataclasses.dataclass
+class MutableSideEffects:
+    """
+    VariableTracker.mutable_local marker to indicate a list passed as
+    an input that if we mutate we need to re-apply those mutations after
+    the graph runs.
+    """
+
+    source: Source
+    is_modified: bool = False
+
+    def __hash__(self):
+        return id(self)
+
+    def __eq__(self, other):
+        return self is other
+
+
+@dataclasses.dataclass
+class AttributeMutation:
+    """
+    VariableTracker.mutable_local marker to track changes to attributes
+    """
+
+    source: Source
+
+
+class AttributeMutationExisting(AttributeMutation):
+    def __hash__(self):
+        return id(self)
+
+    def __eq__(self, other):
+        return self is other
+
+
+@dataclasses.dataclass
+class AttributeMutationNew(AttributeMutation):
+    cls_source: Source
+
+    def __hash__(self):
+        return id(self)
+
+    def __eq__(self, other):
+        return self is other
+
+
+class SideEffects(object):
+    """
+    Track side effects (list mutation, setattr, etc) that need to be
+    applied after an FX graph is run.
+    """
+
+    def __init__(self, id_to_variable=None, store_attr_mutations=None, keepalive=None):
+        super(SideEffects, self).__init__()
+        self.id_to_variable = id_to_variable or collections.OrderedDict()
+        self.store_attr_mutations = store_attr_mutations or collections.OrderedDict()
+        self.keepalive = keepalive or []
+
+    def clone(self):
+        """Create a shallow copy"""
+        return self.__class__(
+            id_to_variable=collections.OrderedDict(self.id_to_variable),
+            store_attr_mutations=collections.OrderedDict(
+                (k, collections.OrderedDict(v))
+                for k, v in self.store_attr_mutations.items()
+            ),
+            keepalive=list(self.keepalive),
+        )
+
+    def apply(self, fn, cache=None, skip_fn=lambda _: False):
+        if cache is None:
+            cache = dict()
+
+        self.id_to_variable = collections.OrderedDict(
+            (k, VariableTracker.apply(fn, v, cache, skip_fn))
+            for k, v in self.id_to_variable.items()
+        )
+        self.store_attr_mutations = collections.OrderedDict(
+            (k, VariableTracker.apply(fn, v, cache, skip_fn))
+            for k, v in self.store_attr_mutations.items()
+        )
+
+    def __contains__(self, item):
+        return id(item) in self.id_to_variable
+
+    def __getitem__(self, item):
+        return self.id_to_variable[id(item)]
+
+    def store_attr(self, item: VariableTracker, name: str, value: VariableTracker):
+        assert self.is_attribute_mutation(item)
+        if item.mutable_local not in self.store_attr_mutations:
+            self.store_attr_mutations[item.mutable_local] = collections.OrderedDict()
+        self.store_attr_mutations[item.mutable_local][name] = value
+
+    def load_attr(self, item, name):
+        assert self.is_attribute_mutation(item)
+        return self.store_attr_mutations[item.mutable_local][name]
+
+    def store_cell(self, cellvar, value):
+        assert isinstance(cellvar, variables.NewCellVariable)
+        assert isinstance(value, variables.VariableTracker)
+        self.store_attr(cellvar, "cell_contents", value)
+
+    def load_cell(self, cellvar):
+        assert isinstance(cellvar, variables.NewCellVariable)
+        return self.load_attr(cellvar, "cell_contents")
+
+    def load_global(self, gvar: VariableTracker, name: str):
+        assert isinstance(gvar, variables.VariableTracker)
+        return self.load_attr(gvar, name)
+
+    def store_global(self, gvar: VariableTracker, name: str, value: VariableTracker):
+        assert isinstance(gvar, variables.VariableTracker)
+        assert isinstance(value, variables.VariableTracker)
+        self.store_attr(gvar, name, value)
+
+    @staticmethod
+    def cls_supports_mutation_side_effects(cls):
+        return inspect.getattr_static(cls, "__setattr__", None) in (
+            object.__setattr__,
+            torch.nn.Module.__setattr__,
+        )
+
+    def is_attribute_mutation(self, item):
+        return isinstance(item.mutable_local, AttributeMutation)
+
+    def is_modified(self, item):
+        if isinstance(item.mutable_local, AttributeMutationNew):
+            return True
+        if self.is_attribute_mutation(item):
+            return item.mutable_local in self.store_attr_mutations
+        return item.mutable_local.is_modified
+
+    def _track_obj(
+        self,
+        source: Source,
+        item: Any,
+        variable: VariableTracker,
+        mutable_cls=MutableSideEffects,
+    ):
+        """Start tracking a new variable for mutation"""
+        variable = variable.clone(mutable_local=mutable_cls(source), source=source)
+        self.id_to_variable[id(item)] = variable
+        self.keepalive.append(item)
+        return variable
+
+    track_list = _track_obj
+    track_dict = _track_obj
+
+    def track_object_existing(
+        self,
+        source: Source,
+        item: Any,
+        variable: VariableTracker,
+    ):
+        return self._track_obj(
+            source, item, variable, mutable_cls=AttributeMutationExisting
+        )
+
+    def track_object_new(
+        self,
+        cls_source: Source,
+        user_cls: Any,
+        variable_cls: Any,
+        options,
+    ):
+        obj = object_new(user_cls)
+        variable = variable_cls(
+            obj, mutable_local=AttributeMutationNew(None, cls_source), **options
+        )
+        self.id_to_variable[id(obj)] = variable
+        self.keepalive.append(obj)
+        return variable
+
+    def track_cell_new(
+        self,
+    ):
+        obj = object()
+        variable = variables.NewCellVariable(
+            mutable_local=AttributeMutationNew(None, None),
+        )
+        self.id_to_variable[id(obj)] = variable
+        self.keepalive.append(obj)
+        return variable
+
+    def track_cell_existing(self, source: Source, item: Any):
+        variable = variables.NewCellVariable(
+            mutable_local=AttributeMutationExisting(source),
+        )
+        self.id_to_variable[id(item)] = variable
+        self.keepalive.append(item)
+        return variable
+
+    def track_global_existing(self, source: Source, item: Any):
+        variable = variables.NewGlobalVariable(
+            mutable_local=AttributeMutationExisting(source),
+        )
+        self.id_to_variable[id(item)] = variable
+        self.keepalive.append(item)
+        return variable
+
+    def prune_dead_object_new(self, tx):
+        live_new_objects = set()
+        skip_obj = None
+
+        def visit(var: VariableTracker):
+            if (
+                isinstance(var.mutable_local, AttributeMutationNew)
+                and var.mutable_local is not skip_obj
+            ):
+                live_new_objects.add(var.mutable_local)
+            return var
+
+        def is_live(var: VariableTracker):
+            if isinstance(var, AttributeMutationNew):
+                return var in live_new_objects
+            if isinstance(var, VariableTracker):
+                return is_live(var.mutable_local)
+            return True
+
+        VariableTracker.apply(visit, (tx.stack, tx.symbolic_locals))
+        for var in self.id_to_variable.values():
+            if not isinstance(var.mutable_local, AttributeMutationNew):
+                VariableTracker.apply(visit, var)
+
+        for skip_obj, setattrs in self.store_attr_mutations.items():
+            VariableTracker.apply(visit, setattrs)
+
+        self.id_to_variable = collections.OrderedDict(
+            (k, v) for k, v in self.id_to_variable.items() if is_live(v)
+        )
+        self.store_attr_mutations = collections.OrderedDict(
+            (k, v) for k, v in self.store_attr_mutations.items() if is_live(k)
+        )
+
+    def mutation(self, oldvar, newvar):
+        return newvar.clone(
+            mutable_local=MutableSideEffects(oldvar.mutable_local.source, True)
+        )
+
+    def _get_modified_vars(self):
+        return [var for var in self.id_to_variable.values() if self.is_modified(var)]
+
+    def codegen_save_tempvars(self, cg: PyCodegen):
+        for var in self._get_modified_vars():
+            if isinstance(
+                var.mutable_local, (AttributeMutationExisting, AttributeMutationNew)
+            ) and isinstance(var, variables.NewCellVariable):
+                cg.load_import_from(utils.__name__, "make_cell")
+                cg.extend_output([create_instruction("CALL_FUNCTION", 0)])
+                cg.add_cache(var)
+                if isinstance(var.mutable_local, AttributeMutationNew):
+                    var.mutable_local.source = LocalSource(cg.tempvars[var])
+            elif isinstance(var.mutable_local, AttributeMutationNew):
+                cg.load_import_from(utils.__name__, "object_new")
+                cg(var.mutable_local.cls_source)
+                cg.extend_output([create_instruction("CALL_FUNCTION", 1)])
+                cg.add_cache(var)
+                var.mutable_local.source = LocalSource(cg.tempvars[var])
+            elif var in cg.tempvars:
+                assert cg.tempvars.get(var) is None
+                # subsequent usage should point to the original variable
+                cg(var.mutable_local.source)
+                cg.add_cache(var)
+
+    def codegen_update_mutated(self, cg: PyCodegen):
+        suffixes = []
+        for var in self._get_modified_vars():
+            if isinstance(var, variables.ListVariable):
+                # old[:] = new
+                cg(var, allow_cache=False)
+                cg(var.mutable_local.source)
+                cg.extend_output(
+                    [
+                        cg.create_load_const(None),
+                        cg.create_load_const(None),
+                        create_instruction("BUILD_SLICE", 2),
+                    ]
+                )
+                suffixes.append([create_instruction("STORE_SUBSCR")])
+            elif isinstance(var, variables.ConstDictVariable):
+                cg.tx.output.update_co_names("clear")
+                cg.tx.output.update_co_names("update")
+
+                cg(var.mutable_local.source)
+                cg.extend_output([create_instruction("LOAD_METHOD", "update")])
+                cg(var, allow_cache=False)
+
+                cg(var.mutable_local.source)
+                cg.extend_output([create_instruction("LOAD_METHOD", "clear")])
+
+                suffixes.append(
+                    [
+                        create_instruction("CALL_METHOD", 0),  # clear
+                        create_instruction("POP_TOP"),
+                        create_instruction("CALL_METHOD", 1),  # update
+                        create_instruction("POP_TOP"),
+                    ]
+                )
+            elif self.is_attribute_mutation(var):
+                for name, value in self.store_attr_mutations.get(
+                    var.mutable_local, {}
+                ).items():
+                    if isinstance(var, variables.NewGlobalVariable):
+                        cg.tx.output.update_co_names(name)
+                        cg(value)
+                        suffixes.append([create_instruction("STORE_GLOBAL", name)])
+                    else:
+                        cg.tx.output.update_co_names(name)
+                        cg(value)
+                        cg(var.mutable_local.source)
+                        suffixes.append([create_instruction("STORE_ATTR", name)])
+            else:
+                raise AssertionError(type(var))
+
+        # do all the actual mutations at the very end to handle dependencies
+        for suffix in reversed(suffixes):
+            cg.extend_output(suffix)
+
+    def is_empty(self):
+        return not any(map(self.is_modified, self.id_to_variable.values()))
diff --git a/torch/_dynamo/skipfiles.py b/torch/_dynamo/skipfiles.py
new file mode 100644
index 000000000000..41a04626756d
--- /dev/null
+++ b/torch/_dynamo/skipfiles.py
@@ -0,0 +1,213 @@
+import _collections_abc
+import _weakrefset
+import abc
+import collections
+import contextlib
+import copy
+import copyreg
+import dataclasses
+import enum
+import functools
+import importlib
+import inspect
+import linecache
+import logging
+import multiprocessing
+import operator
+import os
+import posixpath
+import random
+import re
+import selectors
+import signal
+import tempfile
+import threading
+import tokenize
+import traceback
+import types
+import typing
+import unittest
+import weakref
+
+import torch
+
+try:
+    import torch._prims
+
+    # isort: split
+    # TODO: Hack to unblock simultaneous landing changes. Fix after https://github.com/pytorch/pytorch/pull/81088 lands
+    import torch._prims.utils
+    import torch._prims.wrappers
+    import torch._refs
+    import torch._refs.nn
+    import torch._refs.nn.functional
+    import torch._refs.special
+
+    HAS_PRIMS_REFS = True
+except ImportError:
+    HAS_PRIMS_REFS = False
+
+from . import config
+
+"""
+A note on skipfiles:
+
+Dynamo consults this file to determine whether code should be compiled or skipped.
+
+A skip applies at the frame boundary, meaning dynamo either triggers a graph break
+at the beginning of the frame or attempts to trace the whole frame.  When skipping
+a frame, recursively called frames are still traced by dynamo unless also skipped.
+
+Skipfiles (skipped at the file level instead of function level) still apply on a
+frame-by-frame boundary as dynamo traces, but apply to all functions in that file.
+
+@skip is a helper decorator that can be applied to your function to cause it to be
+included here.
+"""
+
+
+def _strip_init_py(s):
+    return re.sub(r"__init__.py$", "", s)
+
+
+def _module_dir(m: types.ModuleType):
+    return _strip_init_py(m.__file__)
+
+
+SKIP_DIRS = [
+    # torch.*
+    _module_dir(torch),
+    # torchdynamo.*
+    os.path.dirname(__file__) + "/",
+    "<frozen importlib",
+    "<__array_function__ internals>",
+] + [
+    # skip some standard libs
+    _module_dir(m)
+    for m in (
+        abc,
+        collections,
+        contextlib,
+        copy,
+        copyreg,
+        dataclasses,
+        enum,
+        functools,
+        importlib,
+        inspect,
+        linecache,
+        logging,
+        multiprocessing,
+        operator,
+        os,
+        posixpath,
+        random,
+        re,
+        selectors,
+        signal,
+        tempfile,
+        threading,
+        tokenize,
+        traceback,
+        types,
+        typing,
+        unittest,
+        weakref,
+        _collections_abc,
+        _weakrefset,
+    )
+]
+FILENAME_ALLOWLIST = {
+    torch.nn.Sequential.__init__.__code__.co_filename,
+    torch.set_rng_state.__code__.co_filename,
+}
+
+
+if HAS_PRIMS_REFS:
+    FILENAME_ALLOWLIST |= {
+        torch._prims.__file__,
+        torch._prims.utils.__file__,
+        torch._prims.wrappers.__file__,
+        torch._refs.__file__,
+        torch._refs.special.__file__,
+        torch._refs.nn.functional.__file__,
+    }
+
+
+SKIP_DIRS_RE = None
+
+
+def _recompile_re():
+    global SKIP_DIRS_RE
+    SKIP_DIRS_RE = re.compile(f"^({'|'.join(map(re.escape, SKIP_DIRS))})")
+
+
+def add(import_name: str):
+    if isinstance(import_name, types.ModuleType):
+        return add(import_name.__name__)
+    assert isinstance(import_name, str)
+    module_spec = importlib.util.find_spec(import_name)
+    if not module_spec:
+        return
+    origin = module_spec.origin
+    if origin is None:
+        return
+    global SKIP_DIRS_RE
+    SKIP_DIRS.append(_strip_init_py(origin))
+    _recompile_re()
+
+
+def check(filename, allow_torch=False):
+    """Should skip this file?"""
+    if filename is None:
+        return True
+    if filename in FILENAME_ALLOWLIST:
+        return False
+    if allow_torch and is_torch(filename):
+        return False
+    return bool(SKIP_DIRS_RE.match(filename))
+
+
+# skip common third party libs
+for _name in (
+    "functorch",
+    "intel_extension_for_pytorch",
+    "networkx",
+    "numpy",
+    "omegaconf",
+    "onnx",
+    "onnxruntime",
+    "onnx_tf",
+    "pandas",
+    "sklearn",
+    "tabulate",
+    "tensorflow",
+    "tensorrt",
+    "torch2trt",
+    "tqdm",
+    "tree",
+    "tvm",
+    "fx2trt_oss",
+    "xarray",
+):
+    add(_name)
+
+_recompile_re()
+
+
+def is_torch_inline_allowed(filename):
+    return any(
+        filename.startswith(_module_dir(mod))
+        for mod in config.skipfiles_inline_module_allowlist
+    )
+
+
+@functools.lru_cache(None)
+def dynamo_dir():
+    return _module_dir(importlib.import_module(config.dynamo_import))
+
+
+def is_torch(filename):
+    if filename.startswith(dynamo_dir()):
+        return False
+    return filename.startswith(_module_dir(torch))
diff --git a/torch/_dynamo/source.py b/torch/_dynamo/source.py
new file mode 100644
index 000000000000..626bdb4b7826
--- /dev/null
+++ b/torch/_dynamo/source.py
@@ -0,0 +1,259 @@
+import collections
+import dataclasses
+from typing import Any
+
+from . import utils
+from .bytecode_transformation import create_instruction
+from .guards import Guard, GuardSource
+from .utils import rename_implicit
+
+_GUARD_SOURCE_NN_MODULE = {
+    GuardSource.LOCAL: GuardSource.LOCAL_NN_MODULE,
+    GuardSource.GLOBAL: GuardSource.GLOBAL_NN_MODULE,
+    GuardSource.LOCAL_NN_MODULE: GuardSource.LOCAL_NN_MODULE,
+    GuardSource.GLOBAL_NN_MODULE: GuardSource.GLOBAL_NN_MODULE,
+}
+
+_GUARD_SOURCE_NOT_NN_MODULE = {
+    GuardSource.LOCAL: GuardSource.LOCAL,
+    GuardSource.GLOBAL: GuardSource.GLOBAL,
+    GuardSource.LOCAL_NN_MODULE: GuardSource.LOCAL,
+    GuardSource.GLOBAL_NN_MODULE: GuardSource.GLOBAL,
+}
+
+
+def is_constant_source(source):
+    if isinstance(source, ConstantSource):
+        return True
+    try:
+        if source.guard_source() == GuardSource.CONSTANT:
+            return True
+    except NotImplementedError:
+        pass
+
+    return False
+
+
+@dataclasses.dataclass
+class Source:
+    def reconstruct(self, codegen):
+        raise NotImplementedError()
+
+    def guard_source(self):
+        raise NotImplementedError()
+
+    def name(self):
+        raise NotImplementedError()
+
+    def make_guard(self, fn, is_volatile=False):
+        if self.guard_source() is GuardSource.CONSTANT:
+            raise NotImplementedError()
+        return Guard(self.name(), self.guard_source(), fn, is_volatile)
+
+    def is_nn_module(self):
+        return self.guard_source() in (
+            GuardSource.LOCAL_NN_MODULE,
+            GuardSource.GLOBAL_NN_MODULE,
+        )
+
+
+@dataclasses.dataclass
+class LocalSource(Source):
+    local_name: str
+
+    def reconstruct(self, codegen):
+        return [codegen.create_load(self.local_name)]
+
+    def guard_source(self):
+        return GuardSource.LOCAL
+
+    def name(self):
+        return rename_implicit(self.local_name)
+
+
+@dataclasses.dataclass
+class RandomValueSource(Source):
+    random_call_index: int
+
+    def guard_source(self):
+        return GuardSource.RANDOM_VALUE
+
+    def reconstruct(self, codegen):
+        return [
+            codegen.create_load(codegen.tx.output.random_values_var),
+            codegen.create_load_const(self.random_call_index),
+            create_instruction("BINARY_SUBSCR"),
+        ]
+
+    def name(self):
+        return rename_implicit(f"random_value_{self.random_call_index}")
+
+
+@dataclasses.dataclass
+class GlobalSource(Source):
+    global_name: str
+
+    def reconstruct(self, codegen):
+        return [codegen.create_load_global(self.global_name, add=True)]
+
+    def guard_source(self):
+        return GuardSource.GLOBAL
+
+    def name(self):
+        return self.global_name
+
+
+@dataclasses.dataclass
+class GlobalWeakRefSource(Source):
+    global_name: str
+
+    def reconstruct(self, codegen):
+        return [
+            codegen.create_load_global(self.global_name, add=True),
+            create_instruction("CALL_FUNCTION", 0),
+        ]
+
+    def guard_source(self):
+        return GuardSource.GLOBAL
+
+    def name(self):
+        return f"{self.global_name}()"
+
+
+@dataclasses.dataclass
+class AttrSource(Source):
+    base: Source
+    member: str
+
+    def __init__(self, base, member):
+        super().__init__()
+        if "." in member:
+            member_parts = member.split(".")
+            self.base = AttrSource(base, ".".join(member_parts[:-1]))
+            self.member = member_parts[-1]
+        else:
+            self.base = base
+            self.member = member
+
+    def reconstruct(self, codegen):
+        return self.base.reconstruct(codegen) + codegen.create_load_attrs(self.member)
+
+    def guard_source(self):
+        return self.base.guard_source()
+
+    def name(self):
+        if self.member.isnumeric():
+            return f"getattr({self.base.name()}, {self.member!r})"
+        return f"{self.base.name()}.{self.member}"
+
+
+@dataclasses.dataclass
+class GetItemSource(Source):
+    base: Source
+    index: Any
+
+    def reconstruct(self, codegen):
+        instrs = self.base.reconstruct(codegen)
+
+        if isinstance(self.index, Source):
+            instrs.extend(self.index.reconstruct(codegen))
+        else:
+            instrs.append(codegen.create_load_const(self.index))
+        instrs.append(create_instruction("BINARY_SUBSCR"))
+
+        return instrs
+
+    def guard_source(self):
+        return self.base.guard_source()
+
+    def name(self):
+        if isinstance(self.index, Source):
+            return f"{self.base.name()}[{self.index.name()}]"
+        else:
+            return f"{self.base.name()}[{self.index!r}]"
+
+
+@dataclasses.dataclass
+class TupleIteratorGetItemSource(GetItemSource):
+    def reconstruct(self, codegen):
+        codegen.load_import_from(utils.__name__, "tuple_iterator_getitem")
+        return self.base.reconstruct(codegen) + [
+            codegen.create_load_const(self.index),
+            create_instruction("CALL_FUNCTION", 2),
+        ]
+
+    def name(self):
+        return f"___tuple_iterator_getitem({self.base.name()}, {self.index!r})"
+
+
+@dataclasses.dataclass
+class TypeSource(Source):
+    base: Source
+
+    def reconstruct(self, codegen):
+        codegen.load_import_from("builtins", "type")
+        return self.base.reconstruct(codegen) + [create_instruction("CALL_FUNCTION", 1)]
+
+    def guard_source(self):
+        return self.base.guard_source()
+
+    def name(self):
+        return f"type({self.base.name()})"
+
+
+@dataclasses.dataclass
+class ODictGetItemSource(Source):
+    base: Source
+    index: Any
+
+    def reconstruct(self, codegen):
+        return (
+            [codegen._create_load_const(collections.OrderedDict.__getitem__)]
+            + self.base.reconstruct(codegen)
+            + [
+                codegen.create_load_const(self.index),
+                create_instruction("CALL_FUNCTION", 2),
+            ]
+        )
+
+    def guard_source(self):
+        return self.base.guard_source()
+
+    def name(self):
+        return f"___odict_getitem({self.base.name()}, {self.index!r})"
+
+
+@dataclasses.dataclass
+class NNModuleSource(Source):
+    inner: Source
+
+    def reconstruct(self, codegen):
+        return self.inner.reconstruct(codegen)
+
+    def guard_source(self):
+        return _GUARD_SOURCE_NN_MODULE[self.inner.guard_source()]
+
+    def name(self):
+        return self.inner.name()
+
+
+class NotNNModuleSource(NNModuleSource):
+    def guard_source(self):
+        return _GUARD_SOURCE_NOT_NN_MODULE[self.inner.guard_source()]
+
+
+@dataclasses.dataclass
+class ConstantSource(Source):
+    source_name: str
+
+    def reconstruct(self, codegen):
+        return [codegen.create_load_global(self.source_name, add=False)]
+
+    def guard_source(self):
+        return GuardSource.CONSTANT
+
+    def name(self):
+        return self.source_name
+
+    def make_guard(self, fn, is_volatile=False):
+        raise NotImplementedError()
diff --git a/torch/_dynamo/symbolic_convert.py b/torch/_dynamo/symbolic_convert.py
new file mode 100644
index 000000000000..9b2a39ef3384
--- /dev/null
+++ b/torch/_dynamo/symbolic_convert.py
@@ -0,0 +1,1860 @@
+import collections
+import dataclasses
+import dis
+import functools
+import importlib
+import inspect
+import itertools
+import logging
+import operator
+import sys
+import traceback
+import types
+import typing
+import weakref
+from typing import Any, Dict, Iterable, List
+from unittest.mock import patch
+
+import torch
+
+from . import (
+    allowed_functions,
+    config,
+    exc,
+    logging as torchdynamo_logging,
+    side_effects,
+    skipfiles,
+    variables,
+)
+from .allowed_functions import is_allowed, is_builtin_callable, is_builtin_constant
+from .bytecode_analysis import livevars_analysis
+from .bytecode_transformation import (
+    cleaned_instructions,
+    create_instruction,
+    Instruction,
+    is_generator,
+    unique_id,
+)
+from .codegen import PyCodegen
+from .exc import BackendCompilerFailed, unimplemented, Unsupported
+from .guards import GuardBuilder
+from .output_graph import GraphCompileReason, OutputGraph
+from .replay_record import DummyModule, ExecutionRecorder
+from .resume_execution import ContinueExecutionCache, ReenterWith
+from .source import (
+    AttrSource,
+    GetItemSource,
+    GlobalSource,
+    GlobalWeakRefSource,
+    LocalSource,
+)
+from .utils import counters, graph_break_dup_warning_checker, istype, proxy_args_kwargs
+from .variables.base import MutableLocal, typestr, VariableTracker
+from .variables.builder import VariableBuilder, wrap_fx_proxy
+from .variables.builtin import BuiltinVariable
+from .variables.constant import ConstantVariable
+from .variables.dicts import ConstDictVariable
+from .variables.functions import (
+    BaseUserFunctionVariable,
+    NestedUserFunctionVariable,
+    UserFunctionVariable,
+)
+from .variables.lists import (
+    BaseListVariable,
+    ListIteratorVariable,
+    ListVariable,
+    SliceVariable,
+    TupleVariable,
+)
+from .variables.misc import (
+    ClosureVariable,
+    ContextWrappingVariable,
+    GetAttrVariable,
+    GradModeVariable,
+    PythonModuleVariable,
+    UnknownVariable,
+    WithExitFunctionVariable,
+)
+from .variables.nn_module import NNModuleVariable
+from .variables.tensor import DynamicShapeVariable, TensorVariable
+from .variables.torch import TorchVariable
+from .variables.user_defined import UserDefinedVariable
+
+log = logging.getLogger(__name__)
+
+
+@functools.lru_cache(None)
+def _step_logger():
+    return torchdynamo_logging.get_step_logger(log)
+
+
+@dataclasses.dataclass
+class BlockStackEntry:
+    target: Instruction
+    stack_index: int = None
+    with_context: ContextWrappingVariable = None
+
+    def can_restore(self):
+        return self.with_context is not None
+
+    def resume_fn(self):
+        assert self.stack_index is not None
+        return ReenterWith(self.stack_index)
+
+    def exit(self, tx):
+        return self.with_context.exit(tx)
+
+
+def stack_op(fn: typing.Callable):
+    nargs = len(inspect.signature(fn).parameters)
+    fn_var = BuiltinVariable(fn)
+
+    @functools.wraps(fn)
+    def impl(self: "InstructionTranslatorBase", inst: Instruction):
+        self.push(fn_var.call_function(self, self.popn(nargs), {}))
+
+    return impl
+
+
+def _detect_and_normalize_assert_statement(
+    self: "InstructionTranslatorBase", truth_fn: typing.Callable, push: bool
+):
+    # Detect if this jump instruction is assert and normalize the assert
+    # by pushing dummy error message when nothing is given.
+    #
+    # Python 3.9 assertion is in following format:
+    # 18 POP_JUMP_IF_TRUE       28
+    # 20 LOAD_ASSERTION_ERROR
+    # 22 LOAD_CONST               3 ('Assert message') -> optional instruction
+    # 24 CALL_FUNCTION            1                    -> optional instruction
+    # 26 RAISE_VARARGS
+    #
+    # Python 3.8 assertion is in following format:
+    # 18 POP_JUMP_IF_TRUE       28
+    # 20 LOAD_GLOBAL              0 (Assertion type)
+    # 22 LOAD_CONST               3 ('Assert message') -> optional instruction
+    # 24 CALL_FUNCTION            1                    -> optional instruction
+    # 26 RAISE_VARARGS            1
+
+    if (truth_fn is not operator.truth) or push:
+        return False
+
+    current_instruction_pointer = self.instruction_pointer
+    inst = self.instructions[current_instruction_pointer]
+    # Detect LOAD_ASSERTION_ERROR or LOAD_GLOBAL 0
+    if sys.version_info < (3, 9):
+        if inst.opname != "LOAD_GLOBAL" or inst.argval != "AssertionError":
+            return False
+    else:
+        if inst.opname != "LOAD_ASSERTION_ERROR":
+            return False
+
+    current_instruction_pointer += 1
+
+    if current_instruction_pointer >= len(self.instructions):
+        return False
+
+    inst = self.instructions[current_instruction_pointer]
+    has_error_msg = False
+    # DETECT RAISE_VARARGS or LOAD CONST
+    if inst.opname == "LOAD_CONST":
+        if not isinstance(inst.argval, str):
+            return False
+        self.LOAD_CONST(inst)
+        has_error_msg = True
+
+        # if it is LOAD_CONSTANT, it must be followed by CALL_FUNCTION
+        current_instruction_pointer += 1
+        if current_instruction_pointer >= len(self.instructions):
+            return False
+        inst = self.instructions[current_instruction_pointer]
+        if inst.opname != "CALL_FUNCTION":
+            return False
+
+        # CALL_FUNCTION should be followed by RAISE_VARARGS
+        current_instruction_pointer += 1
+        if current_instruction_pointer >= len(self.instructions):
+            return False
+        inst = self.instructions[current_instruction_pointer]
+
+    if inst.opname != "RAISE_VARARGS":
+        return False
+
+    if not has_error_msg:
+        # Push dummy value instead of error message
+        self.push(ConstantVariable("assertion error"))
+
+    return True
+
+
+def generic_jump(truth_fn: typing.Callable, push: bool):
+    def inner(self: "InstructionTranslatorBase", inst: Instruction):
+        value: VariableTracker = self.pop()
+        self.output.guards.update(value.guards)
+        if (
+            config.rewrite_assert_with_torch_assert
+            and _detect_and_normalize_assert_statement(self, truth_fn, push)
+        ):
+            error_msg: VariableTracker = self.pop()
+            self.output.guards.update(error_msg.guards)
+            # Skip over things like `assert True`
+            if value.is_python_constant() and bool(value.as_python_constant()):
+                self.jump(inst)
+                return
+
+            # Manually insert torch._assert instead of python assert and jump over
+            # assert related instructions as we don't need them anymore.
+            self.output.create_proxy(
+                "call_function",
+                torch._assert,
+                *proxy_args_kwargs((value, error_msg), {}),
+                current_tx=self,
+            )
+            self.jump(inst)
+            return
+
+        if value.is_python_constant():
+            if truth_fn(value.as_python_constant()):
+                push and self.push(value)
+                self.jump(inst)
+        elif (
+            isinstance(value, (TensorVariable)) and self.should_compile_partial_graph()
+        ):
+            # compile a partial subgraph prefix then jump into user code
+            if self.has_backedge():
+                msg = (
+                    "Skipping frame because there is a graph break in a for/while loop"
+                )
+                log.debug(msg)
+                raise exc.SkipFrame(msg)
+
+            self.push(value)
+            self.output.compile_subgraph(
+                self,
+                reason=GraphCompileReason(
+                    f"generic_jump {typestr(value)}", [self.frame_summary()]
+                ),
+            )
+            self.pop()
+
+            if_next = self.create_call_resume_at(self.next_instruction)
+            push and self.push(value)
+            if_jump = self.create_call_resume_at(inst.target)
+
+            self.output.add_output_instructions(
+                [(create_instruction(inst.opname, target=if_jump[0]))]
+                + if_next
+                + if_jump
+            )
+        elif isinstance(value, NNModuleVariable):
+            # Equivant of "self.nn_module is not None"
+            if truth_fn(value):
+                push and self.push(value)
+                self.jump(inst)
+        elif not isinstance(value, TensorVariable) and value.has_unpack_var_sequence(
+            self
+        ):
+            if truth_fn(len(value.unpack_var_sequence(self))):
+                push and self.push(value)
+                self.jump(inst)
+        elif isinstance(value, DynamicShapeVariable):
+            eval_result = value.evaluate_expr(self.output)
+            if truth_fn(eval_result):
+                push and self.push(value)
+                self.jump(inst)
+        else:
+            unimplemented(f"generic_jump {typestr(value)}")
+
+    return inner
+
+
+explain = False
+
+
+def break_graph_if_unsupported(*, push):
+    def decorator(inner_fn):
+        @functools.wraps(inner_fn)
+        def wrapper(self: "InstructionTranslatorBase", inst: Instruction):
+            state = self.copy_graphstate()
+            reason = None
+            try:
+                return inner_fn(self, inst)
+            except Unsupported as excp:
+                if self.has_backedge():
+                    msg = "Skipping frame because there is a graph break in a for/while loop"
+                    log.debug(msg)
+                    raise exc.SkipFrame(msg)
+
+                if not self.should_compile_partial_graph():
+                    raise
+                user_stack = [self.frame_summary()] + list(reversed(excp.real_stack))
+                user_stack_formatted = "".join(traceback.format_list(user_stack))
+                frame_loc = (user_stack[-1].filename, user_stack[-1].lineno)
+                # torch._dynamo.explain() formats this a little nicer, and presents a slightly
+                # more actionable user code pointer
+                if (
+                    config.print_graph_breaks
+                    and not explain
+                    and graph_break_dup_warning_checker.add(frame_loc)
+                ):
+                    log.warning(
+                        f"Graph break: {excp} from user code at {user_stack_formatted}"
+                    )
+
+                excp.remove_from_stats()
+                excp.add_to_stats("graph_break")
+                reason = GraphCompileReason(excp.msg, user_stack)
+            self.restore_graphstate(state)
+            self.output.compile_subgraph(self, reason=reason)
+            self.popn(push - dis.stack_effect(inst.opcode, inst.arg))
+
+            for _ in range(push):
+                self.push(UnknownVariable())
+
+            resume_call_insts = self.create_call_resume_at(self.next_instruction)
+            # Check if there is a block stack entry with GradModeVariable. And
+            # wrap the instruction causing the graph break inside a try..finally
+            # block. See more details at
+            # https://github.com/pytorch/torchdynamo/issues/207
+            cleanup = []
+            if len(self.block_stack) == 1 and isinstance(
+                self.block_stack[0].with_context, GradModeVariable
+            ):
+                ctx_variable = self.block_stack[0].with_context
+
+                cg = PyCodegen(self)
+                setup_finally, cleanup = ctx_variable.reconstruct(
+                    cg, resume_call_insts[0]
+                )
+                self.output.add_output_instructions(setup_finally)
+
+            self.output.add_output_instructions([inst])
+
+            # Add the cleanup instructions from try..finally block
+            self.output.add_output_instructions(cleanup)
+            self.output.add_output_instructions(
+                resume_call_insts,
+            )
+
+        return wrapper
+
+    return decorator
+
+
+class InstructionTranslatorBase(object):
+    def has_backedge(self):
+        cur_offset = self.current_instruction.offset
+        for inst in self.instructions[self.instruction_pointer :]:
+            if inst.opname in (
+                "JUMP_ABSOLUTE",
+                "POP_JUMP_IF_TRUE",
+                "POP_JUMP_IF_FALSE",
+            ):
+                jump_offset = inst.argval
+                if jump_offset < cur_offset:
+                    return True
+        return False
+
+    def cell_and_freevars(self):
+        if not hasattr(self, "_cell_and_freevars"):
+            self._cell_and_freevars = tuple(
+                self.code_options["co_cellvars"] or []
+            ) + tuple(self.code_options["co_freevars"] or [])
+        return self._cell_and_freevars
+
+    def prune_dead_locals(self):
+        reads = livevars_analysis(self.instructions, self.current_instruction)
+        # implicit use by super()
+        # reads = reads | {"__class__"}
+        # output variables?
+        reads = reads | set(self.cell_and_freevars())
+        self.symbolic_locals = collections.OrderedDict(
+            [(k, v) for k, v in self.symbolic_locals.items() if k in reads]
+        )
+        self.output.side_effects.prune_dead_object_new(self)
+
+    def call_function(
+        self,
+        fn: VariableTracker,
+        args: List[VariableTracker],
+        kwargs: Dict[str, VariableTracker],
+    ):
+        assert isinstance(fn, VariableTracker)
+        assert isinstance(args, list)
+        assert isinstance(kwargs, dict)
+        assert all(
+            isinstance(x, VariableTracker)
+            for x in itertools.chain(args, kwargs.values())
+        )
+        self.push(fn.call_function(self, args, kwargs))
+
+    def update_locals_and_stack(self, oldvar: VariableTracker, newvar: VariableTracker):
+        def repl(v: VariableTracker):
+            if v.mutable_local is oldvar.mutable_local:
+                return newvar
+            return v
+
+        def skip(v: VariableTracker):
+            return oldvar.mutable_local not in v.recursively_contains
+
+        cache = dict()
+        self.output.side_effects.apply(repl, cache, skip_fn=skip)
+        self.stack = [
+            VariableTracker.apply(repl, x, cache, skip_fn=skip) for x in self.stack
+        ]
+        for k, x in self.symbolic_locals.items():
+            self.symbolic_locals[k] = VariableTracker.apply(
+                repl, x, cache, skip_fn=skip
+            )
+
+    def replace_all(self, oldvar: VariableTracker, newvar: VariableTracker):
+        if isinstance(oldvar.mutable_local, side_effects.MutableSideEffects):
+            newvar = self.output.side_effects.mutation(oldvar, newvar)
+        else:
+            assert isinstance(oldvar.mutable_local, variables.base.MutableLocal)
+            newvar = newvar.clone(mutable_local=variables.base.MutableLocal())
+        self.update_locals_and_stack(oldvar, newvar)
+        return newvar
+
+    def inline_user_function_return(self, fn, args, kwargs):
+        """
+        A call to some user defined function by inlining it.
+        """
+        state = self.copy_graphstate()
+        try:
+            result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
+            self.output.guards.update(fn.guards)
+            return result
+        except Exception:
+            self.restore_graphstate(state)
+            raise
+
+    def step(self):
+        """Process exactly one instruction, return False we should exit"""
+        inst = self.instructions[self.instruction_pointer]
+        self.current_instruction = inst
+        self.instruction_pointer += 1
+        if self.instruction_pointer < len(self.instructions):
+            self.next_instruction = self.instructions[self.instruction_pointer]
+        else:
+            self.instruction_pointer = None
+            self.next_instruction = None
+        if inst.starts_line and self.lineno != inst.starts_line:
+            self.lineno = inst.starts_line
+            log.debug(f"TRACE starts_line {self.f_code.co_filename}:{self.lineno}")
+
+        if len(self.stack) == 0 and self.should_compile_partial_graph():
+            self.checkpoint = inst, self.copy_graphstate()
+
+        log.debug(f"TRACE {inst.opname} {inst.argval} {self.stack}")
+
+        try:
+            if not hasattr(self, inst.opname):
+                unimplemented(f"missing: {inst.opname}")
+            getattr(self, inst.opname)(inst)
+
+            return inst.opname != "RETURN_VALUE"
+        except BackendCompilerFailed:
+            raise
+        except Unsupported as exc:
+            exc.real_stack.append(self.frame_summary())
+            if self.empty_checkpoint():
+                raise
+        except Exception as exc:
+            real_stack = getattr(exc, "real_stack", [])
+            real_stack.append(self.frame_summary())
+            exc.real_stack = real_stack
+            raise
+
+        # generate code from checkpoint
+        assert not self.output.output_instructions
+        continue_inst, state = self.checkpoint
+        self.restore_graphstate(state)
+        self.output.compile_subgraph(self, partial_convert=True)
+        self.output.add_output_instructions(
+            [create_instruction("JUMP_ABSOLUTE", target=continue_inst)]
+            + self.instructions
+        )
+
+    def run(self):
+        try:
+            while (
+                self.instruction_pointer is not None
+                and not self.output.should_exit
+                and self.step()
+            ):
+                pass
+        except BackendCompilerFailed:
+            raise
+        except Exception as e:
+            if config.replay_record_enabled:
+                e.exec_record = self.exec_recorder.get_record()
+            raise
+        finally:
+            # Cleanup the outputGraph to delete the held tensors. We perform the
+            # cleanup only for InstructionTranslator and not
+            # InliningInstructionTranslator. The InliningInstructionTranslator
+            # mutates the output object and is restored to original state if
+            # there was an exception.
+            if isinstance(self, InstructionTranslator):
+                self.output.cleanup()
+
+    def push(self, val):
+        assert val is None or isinstance(
+            val, VariableTracker
+        ), f"push expects VariableTracker, got {typestr(val)}"
+        self.stack.append(val)
+
+    def push_many(self, vals: List[TensorVariable]):
+        for val in vals:
+            self.push(val)
+
+    def pop(self) -> TensorVariable:
+        return self.stack.pop()
+
+    def popn(self, n: int) -> List[TensorVariable]:
+        assert n >= 0
+        return list(reversed([self.pop() for _ in range(n)]))
+
+    def LOAD_FAST(self, inst):
+        name = inst.argval
+
+        if name in self.f_locals and config.replay_record_enabled:
+            self.exec_recorder.add_local_var(name, self.f_locals[name])
+
+        if name.startswith(".") and name not in self.symbolic_locals:
+            # This happens in dict/list comprehensions
+            name = name.replace(".", "implicit")
+        assert name not in self.cell_and_freevars()
+        if name not in self.symbolic_locals:
+            unimplemented("undefined LOAD_FAST")
+        self.push(self.symbolic_locals[name])
+        if name.startswith("___stack"):
+            self.symbolic_locals.pop(name)
+
+    def LOAD_DEREF(self, inst):
+        assert inst.argval in self.cell_and_freevars()
+
+        if inst.argval in self.f_locals and config.replay_record_enabled:
+            self.exec_recorder.add_local_var(inst.argval, self.f_locals[inst.argval])
+
+        if inst.argval not in self.symbolic_locals:
+            unimplemented(f"undefined LOAD_DEREF {inst.argval}")
+        self.push(self.symbolic_locals[inst.argval])
+
+    def STORE_FAST(self, inst):
+        self.symbolic_locals[inst.argval] = self.pop()
+
+    def DELETE_FAST(self, inst):
+        del self.symbolic_locals[inst.argval]
+
+    STORE_DEREF = STORE_FAST
+
+    def LOAD_CLOSURE(self, inst):
+        self.push(ClosureVariable(name=inst.argval))
+
+    def LOAD_CONST(self, inst):
+        self.push(ConstantVariable(value=inst.argval))
+
+    def get_global_source(self, name):
+        if self.output.root_globals is self.f_globals:
+            source = GlobalSource(name)
+        else:
+            if "__name__" in self.f_globals:
+                source = AttrSource(
+                    self.import_source(self.f_globals["__name__"]), name
+                )
+            else:
+                mangled_name = f"___unnamed_scope_{id(self.f_globals)}"
+                if mangled_name not in self.output.root_globals:
+                    self.output.install_global(mangled_name, self.f_globals)
+                source = GetItemSource(GlobalSource(mangled_name), name)
+        return source
+
+    def LOAD_GLOBAL(self, inst):
+        name = inst.argval
+
+        if config.replay_record_enabled:
+            if name in self.f_globals:
+                self.exec_recorder.add_global_var(name, self.f_globals[name])
+            else:
+                assert name in self.f_builtins
+                self.exec_recorder.builtins[name] = self.f_builtins[name]
+
+        if name in self.symbolic_globals:
+            variable = self.output.side_effects[self.symbolic_globals[name]]
+            self.push(self.output.side_effects.load_global(variable, name))
+            return
+
+        try:
+            value = self.f_globals[name]
+        except KeyError:
+            return self.load_builtin(inst)
+
+        source = self.get_global_source(name)
+        self.push(VariableBuilder(self, source)(value))
+
+    def STORE_GLOBAL(self, inst):
+        value = self.pop()
+        name = inst.argval
+        source = self.get_global_source(name)
+        if name not in self.symbolic_globals:
+            self.symbolic_globals[name] = object()  # sentinel object
+        variable = self.output.side_effects.track_global_existing(
+            source, self.symbolic_globals[name]
+        )
+        self.output.side_effects.store_global(variable, name, value)
+
+    def import_source(self, module_name):
+        """Create an alias to a module for use in guards"""
+        value = importlib.import_module(module_name)
+        alias = f"__import_{module_name.replace('.', '_dot_')}"
+        f_globals = self.output.root_globals
+        assert alias not in f_globals or f_globals[alias] is value
+        f_globals[alias] = value
+        self.output.update_co_names(alias)
+        return GlobalSource(alias)
+
+    def resolve_name(self, name, package, level):
+        """
+        Copied from the Cpython implementation of __import__
+        Resolve a relative module name to an absolute one.
+        https://github.com/python/cpython/blob/5a094f0255eea1db58fb2cf14c200971e64ec36e/Lib/importlib/_bootstrap.py#L902
+        """
+        bits = package.rsplit(".", level - 1)
+        if len(bits) < level:
+            raise ImportError("attempted relative import beyond top-level package")
+        base = bits[0]
+        return "{}.{}".format(base, name) if name else base
+
+    def calc_package(self):
+        """
+        Copied from the Cpython implementation of __import__
+        https://github.com/python/cpython/blob/5a094f0255eea1db58fb2cf14c200971e64ec36e/Lib/importlib/_bootstrap.py#L1090
+        """
+        package = self.f_globals.get("__package__")
+        spec = self.f_globals.get("__spec__")
+        if package is not None:
+            if spec is not None and package != spec.parent:
+                log.warning(
+                    "__package__ != __spec__.parent "
+                    f"({package!r} != {spec.parent!r})",
+                    ImportWarning,
+                    stacklevel=3,
+                )
+            return package
+        elif spec is not None:
+            return spec.parent
+        else:
+            log.warning(
+                "can't resolve package from __spec__ or __package__, "
+                "falling back on __name__ and __path__",
+                ImportWarning,
+                stacklevel=3,
+            )
+            package = self.f_globals["__name__"]
+            if "__path__" not in self.f_globals:
+                package = package.rpartition(".")[0]
+        return package
+
+    def IMPORT_NAME(self, inst):
+        level, fromlist = self.popn(2)
+        level = level.as_python_constant()
+        fromlist = fromlist.as_python_constant()
+        module_name = inst.argval
+
+        # Are we replaying? if so, load recorded module
+        recorded_name = (
+            f"{ExecutionRecorder.LOCAL_MOD_PREFIX}_{level}_{fromlist}_{module_name}"
+        )
+        if recorded_name in self.f_globals:
+            value = self.f_globals[recorded_name]
+            source = GlobalSource(recorded_name)
+        else:
+            value = __import__(
+                module_name,
+                fromlist=fromlist,
+                level=level,
+                globals=self.f_globals,
+            )
+
+            if level != 0:
+                pkg = self.calc_package()
+                module_name = self.resolve_name(module_name, pkg, level)
+
+            # For __import__, when the name variable is of the form package.module,
+            # normally, the top-level package (the name up till the first dot) is
+            # returned, not the module named by module_name. However, when a
+            # non-empty fromlist argument is given, the module named by name is
+            # returned. Therefore, we set the source correctly here.
+            if not fromlist:
+                top_level_module_name = module_name.partition(".")[0]
+                source = self.import_source(top_level_module_name)
+            else:
+                source = self.import_source(module_name)
+
+        if config.replay_record_enabled:
+            self.exec_recorder.add_local_mod(recorded_name, value)
+
+        if is_allowed(value):
+            self.push(TorchVariable(value, source=source))
+        elif istype(value, (types.ModuleType, DummyModule)):
+            self.push(PythonModuleVariable(value, source=source))
+        else:
+            unimplemented(f"IMPORT_NAME {typestr(value)}")
+
+    def IMPORT_FROM(self, inst):
+        self.DUP_TOP(inst)
+        self.LOAD_ATTR(inst)
+
+    def load_builtin(self, inst):
+        assert inst.argval in self.f_builtins
+        val = self.f_builtins[inst.argval]
+
+        if callable(val):
+            assert is_builtin_callable(val)
+            self.push(VariableBuilder(self, GlobalSource(inst.argval))(val))
+        else:
+            assert is_builtin_constant(val)
+            self.push(ConstantVariable(value=val))
+
+    def jump(self, inst):
+        self.instruction_pointer = self.indexof[id(inst.target)]
+
+    JUMP_FORWARD = jump
+    JUMP_ABSOLUTE = jump
+
+    POP_JUMP_IF_FALSE = generic_jump(operator.not_, False)
+    POP_JUMP_IF_TRUE = generic_jump(operator.truth, False)
+    JUMP_IF_FALSE_OR_POP = generic_jump(operator.not_, True)
+    JUMP_IF_TRUE_OR_POP = generic_jump(operator.truth, True)
+
+    def SETUP_LOOP(self, inst):
+        # only exists in python<=3.7
+        self.block_stack.append(BlockStackEntry(inst.target))
+
+    def SETUP_EXCEPT(self, inst):
+        # only exists in python<=3.7
+        self.block_stack.append(BlockStackEntry(inst.target))
+
+    def POP_BLOCK(self, inst):
+        self.block_stack.pop()
+
+    def SETUP_WITH(self, inst):
+        ctx = self.pop()
+        if not isinstance(ctx, ContextWrappingVariable):
+            unimplemented(f"SETUP_WITH {ctx}")
+        self.output.guards.update(ctx.guards)
+
+        if isinstance(self, InstructionTranslator):
+            self.block_stack.append(BlockStackEntry(inst.target, len(self.stack), ctx))
+        else:
+            # can't restore this while inlining
+            self.block_stack.append(BlockStackEntry(inst.target))
+        self.push(
+            WithExitFunctionVariable(
+                ctx,
+                inst.target,
+                **VariableTracker.propagate(ctx),
+            )
+        )
+        self.push(ctx.enter(self))
+
+    def SETUP_FINALLY(self, inst):
+        self.block_stack.append(BlockStackEntry(inst.target))
+
+    def BEGIN_FINALLY(self, inst):
+        self.push(None)
+
+    def WITH_CLEANUP_START(self, inst):
+        exit, exc = self.popn(2)
+        if sys.version_info < (3, 8):
+            assert exc.is_python_constant()
+            assert exc.as_python_constant() is None
+        else:
+            assert exc is None
+        self.push(exc)
+        self.push(exit.call_function(self, [ConstantVariable(None)] * 3, {}))
+
+    def WITH_CLEANUP_FINISH(self, inst):
+        self.popn(2)
+        self.push(None)
+
+    def END_FINALLY(self, inst):
+        tos = self.pop()
+        if sys.version_info < (3, 8):
+            # python3.7 and 3.8 can have END_FINALLY without BEGIN_FINALLY
+            assert tos is None or (
+                tos.is_python_constant() and tos.as_python_constant() is None
+            )
+        else:
+            assert tos is None
+
+    def FOR_ITER(self, inst):
+        it = self.pop()
+        if isinstance(it, ListIteratorVariable):
+            self.output.guards.update(it.guards)
+            try:
+                val, next_iter = it.next_variables()
+                self.replace_all(it, next_iter)
+                self.push(next_iter)
+                self.push(val)
+            except StopIteration:
+                self.jump(inst)
+        else:
+            unimplemented(f"FOR_ITER {typestr(it)}")
+
+    def COMPARE_OP(self, inst):
+        left, right = self.popn(2)
+        left = left.as_specialized(self)
+        right = right.as_specialized(self)
+        options = VariableTracker.propagate([left, right])
+        op = inst.argval
+        supported_is_const = {
+            "is": operator.is_,
+            "is not": operator.is_not,
+            "==": operator.eq,
+            "!=": operator.ne,
+        }
+        supported_tensors = {
+            ">": operator.gt,
+            "<": operator.lt,
+            ">=": operator.ge,
+            "<=": operator.le,
+            "==": operator.eq,
+            "!=": operator.ne,
+        }
+        supported_any = dict(
+            itertools.chain(supported_tensors.items(), supported_is_const.items())
+        )
+        if (
+            isinstance(
+                left,
+                (
+                    TensorVariable,
+                    DynamicShapeVariable,
+                    NNModuleVariable,
+                    BaseListVariable,
+                    UserDefinedVariable,
+                    BaseUserFunctionVariable,
+                    ConstDictVariable,
+                ),
+            )
+            and isinstance(right, ConstantVariable)
+            and right.value is None
+            and op in supported_is_const
+        ):
+            # <non-None> is None
+            self.push(
+                ConstantVariable(
+                    supported_is_const[op](object(), right.value), **options
+                )
+            )
+        elif (
+            left.is_python_constant()
+            and right.is_python_constant()
+            and op in supported_any
+        ):
+            # constant fold
+            self.push(
+                ConstantVariable(
+                    supported_any[op](
+                        left.as_python_constant(), right.as_python_constant()
+                    ),
+                    **options,
+                )
+            )
+        elif (
+            isinstance(left, TensorVariable) or isinstance(right, TensorVariable)
+        ) and op in supported_tensors:
+            self.push(
+                wrap_fx_proxy(
+                    self,
+                    supported_tensors[op](left.as_proxy(), right.as_proxy()),
+                    **options,
+                )
+            )
+        elif (
+            isinstance(left, DynamicShapeVariable)
+            or isinstance(right, DynamicShapeVariable)
+        ) and op in supported_tensors:
+            self.push(
+                DynamicShapeVariable.create(
+                    self,
+                    supported_tensors[op](left.as_proxy(), right.as_proxy()),
+                    dyn_shape=None,
+                    **options,
+                )
+            )
+        elif op in ("in", "not in"):
+            self.push(right.call_method(self, "__contains__", [left], {}))
+            if op == "not in":
+                self.UNARY_NOT(inst)
+        elif (
+            isinstance(left, UserFunctionVariable)
+            and isinstance(right, UserFunctionVariable)
+            and op in supported_is_const
+        ):
+            self.push(
+                ConstantVariable(supported_is_const[op](left.fn, right.fn), **options)
+            )
+        else:
+            unimplemented(f"COMPARE_OP {typestr(left)} {op} {typestr(right)}")
+
+    def GET_ITER(self, inst):
+        self.call_function(BuiltinVariable(iter), [self.pop()], {})
+
+    @break_graph_if_unsupported(push=1)
+    def CALL_FUNCTION(self, inst):
+        args = self.popn(inst.argval)
+        fn = self.pop()
+        self.call_function(fn, args, {})
+
+    @break_graph_if_unsupported(push=1)
+    def CALL_FUNCTION_EX(self, inst):
+        if inst.argval == 0:
+            kwargsvars = ConstDictVariable({}, dict)
+            argsvars = self.pop()
+        elif inst.argval == 1:
+            kwargsvars = self.pop()
+            argsvars = self.pop()
+        else:
+            unimplemented("CALL_FUNCTION_EX")
+        fn = self.pop()
+        self.output.guards.update(argsvars.guards)
+        self.output.guards.update(kwargsvars.guards)
+
+        if (
+            isinstance(fn, GetAttrVariable)
+            and isinstance(fn.obj, TensorVariable)
+            and fn.name == "view"
+            and isinstance(argsvars, (ConstantVariable, TensorVariable))
+        ):
+            # Hack to handle special case in some bert models.  Converts
+            # x.view(*shape) into x.view(shape), which is correct for view()
+            # but not generally.  See test_transpose_for_scores().
+            argsvars = TupleVariable([argsvars])
+
+        if not isinstance(
+            argsvars, BaseListVariable
+        ) and argsvars.has_unpack_var_sequence(self):
+            argsvars = TupleVariable(argsvars.unpack_var_sequence(self))
+
+        if not isinstance(argsvars, BaseListVariable) or not isinstance(
+            kwargsvars, ConstDictVariable
+        ):
+            unimplemented(f"non-static call {typestr(argsvars)} {typestr(kwargsvars)}")
+
+        self.call_function(fn, argsvars.items, kwargsvars.items)
+
+    @break_graph_if_unsupported(push=1)
+    def CALL_FUNCTION_KW(self, inst):
+        argnames = self.pop()
+        args = self.popn(inst.argval)
+        fn = self.pop()
+        assert isinstance(argnames, ConstantVariable)
+        argnames = argnames.value
+        args, kwargs = args[: -len(argnames)], args[-len(argnames) :]
+        kwargs = dict(zip(argnames, kwargs))
+        assert len(kwargs) == len(argnames)
+        self.call_function(fn, args, kwargs)
+
+    def LOAD_METHOD(self, inst):
+        self.LOAD_ATTR(inst)
+        self.push(self.pop())
+        self.push(None)
+
+    def CALL_METHOD(self, inst):
+        args = self.popn(inst.argval)
+        dummy = self.pop()
+        assert dummy is None
+        fn = self.pop()
+        self.call_function(fn, args, {})
+
+    def LOAD_ATTR(self, inst):
+        obj = self.pop()
+        result = BuiltinVariable(getattr).call_function(
+            self, [obj, ConstantVariable(inst.argval)], {}
+        )
+        self.push(result)
+
+    def STORE_ATTR(self, inst):
+        prior = self.copy_graphstate()
+        val, obj = self.popn(2)
+
+        if isinstance(obj, NNModuleVariable):
+            # We don't allow side effects during export
+            # https://github.com/pytorch/torchdynamo/issues/1475
+            assert (
+                not self.export
+            ), f"Mutating module attribute {inst.argval} during export."
+
+        try:
+            self.output.guards.update(
+                BuiltinVariable(setattr)
+                .call_function(self, [obj, ConstantVariable(inst.argval), val], {})
+                .guards
+            )
+            return
+        except Unsupported as e:
+            if not self.should_compile_partial_graph():
+                raise
+            e.remove_from_stats()
+            e.add_to_stats("graph_break")
+            self.restore_graphstate(prior)
+
+        # break the graph
+        self.output.compile_subgraph(
+            self, reason=GraphCompileReason("store_attr", [self.frame_summary()])
+        )
+        self.output.add_output_instructions([inst])
+        self.popn(2)
+        self.output.add_output_instructions(
+            self.create_call_resume_at(self.next_instruction)
+        )
+
+    @break_graph_if_unsupported(push=0)
+    def STORE_SUBSCR(self, inst):
+        val, obj, key = self.popn(3)
+        result = obj.call_method(self, "__setitem__", [key, val], {})
+        # no result is pushed, so need to lift the guards to global
+        self.output.guards.update(result.guards)
+
+    def BUILD_TUPLE(self, inst):
+        items = self.popn(inst.argval)
+        options = VariableTracker.propagate(items)
+        self.push(TupleVariable(items, **options))
+
+    def BUILD_SLICE(self, inst):
+        items = self.popn(inst.argval)
+        options = VariableTracker.propagate(items)
+        self.push(
+            SliceVariable(
+                [x.as_specialized(self) for x in items],
+                **options,
+            )
+        )
+
+    def BUILD_LIST(self, inst):
+        items = self.popn(inst.argval)
+        options = VariableTracker.propagate(items)
+        self.push(ListVariable(items, mutable_local=MutableLocal(), **options))
+
+    def BUILD_LIST_UNPACK(self, inst, cls=ListVariable):
+        seqs = self.popn(inst.argval)
+        options = VariableTracker.propagate(seqs)
+        items = list()
+        for seq in seqs:
+            try:
+                items.extend(seq.unpack_var_sequence(self))
+            except NotImplementedError:
+                unimplemented(f"BUILD_LIST_UNPACK {seq}")
+        self.push(cls(items, mutable_local=MutableLocal(), **options))
+
+    def BUILD_TUPLE_UNPACK(self, inst):
+        self.BUILD_LIST_UNPACK(inst, cls=TupleVariable)
+
+    BUILD_TUPLE_UNPACK_WITH_CALL = BUILD_TUPLE_UNPACK
+
+    def BUILD_MAP(self, inst):
+        items = self.popn(inst.argval * 2)
+        options = VariableTracker.propagate(items)
+        result = dict()
+        for k, v in zip(items[::2], items[1::2]):
+            assert isinstance(k, ConstantVariable) or (
+                isinstance(k, TensorVariable) and k.specialized_value is not None
+            )
+
+            result[ConstDictVariable.get_key(k)] = v
+        assert len(result) == len(items) / 2
+        self.push(
+            ConstDictVariable(result, dict, mutable_local=MutableLocal(), **options)
+        )
+
+    def BUILD_CONST_KEY_MAP(self, inst):
+        keys = self.pop()
+        values = self.popn(inst.argval)
+        options = VariableTracker.propagate([keys] + values)
+        assert isinstance(keys, ConstantVariable)
+        keys = keys.value
+        assert istype(keys, tuple)
+        assert len(keys) == len(values)
+        self.push(
+            ConstDictVariable(
+                dict(zip(keys, values)),
+                dict,
+                mutable_local=MutableLocal(),
+                **options,
+            )
+        )
+
+    def MAP_ADD(self, inst):
+        if sys.version_info < (3, 8):
+            v, k = self.popn(2)
+        else:
+            k, v = self.popn(2)
+
+        assert inst.argval > 0
+        obj = self.stack[-inst.arg]
+        assert isinstance(obj, ConstDictVariable)
+        assert obj.mutable_local
+        items = dict(obj.items)
+        items[k.as_python_constant()] = v
+        self.replace_all(
+            obj,
+            ConstDictVariable(
+                items,
+                obj.user_cls,
+                **VariableTracker.propagate([obj, k, v]),
+            ),
+        )
+
+    def LIST_APPEND(self, inst):
+        v = self.pop()
+        assert inst.argval > 0
+        obj = self.stack[-inst.arg]
+        assert isinstance(obj, ListVariable)
+        assert obj.mutable_local
+        self.replace_all(
+            obj,
+            ListVariable(
+                obj.items + [v],
+                **VariableTracker.propagate([obj, v]),
+            ),
+        )
+
+    def MAKE_FUNCTION(self, inst):
+        flags = inst.arg
+        old_stack = list(self.stack)
+        fn_name = self.pop()
+        code = self.pop()
+        defaults = None
+        closure = None
+        annotations = None
+        kwdefaults = None
+
+        if flags & 0x08:
+            closure = self.pop()
+        if flags & 0x04:
+            annotations = self.pop()
+        if flags & 0x02:
+            kwdefaults = self.pop()
+        if flags & 0x01:
+            defaults = self.pop()
+
+        options = VariableTracker.propagate(old_stack[len(self.stack) :])
+        self.push(
+            NestedUserFunctionVariable(
+                fn_name,
+                code,
+                self.f_globals,
+                defaults,
+                kwdefaults,
+                annotations,
+                closure,
+                closure_scope=self,
+                **options,
+            )
+        )
+
+    def UNPACK_SEQUENCE(self, inst):
+        # TODO(jansel): rewrite this using unpack_var_sequence
+        seq = self.pop()
+        options = VariableTracker.propagate([seq])
+        if isinstance(seq, BaseListVariable):
+            assert len(seq.items) == inst.argval
+            self.output.guards.update(seq.guards)
+            for i in reversed(seq.items):
+                self.push(i)
+        elif seq.is_python_constant() and isinstance(seq, ConstantVariable):
+            val = seq.as_python_constant()
+            assert len(val) == inst.argval
+            for i in reversed(val):
+                self.push(ConstantVariable(i, **options))
+        elif isinstance(seq, TensorVariable):
+            proxy = seq.as_proxy()
+            for i in reversed(range(inst.argval)):
+                self.push(wrap_fx_proxy(self, proxy[i], **options))
+        elif isinstance(seq, GetAttrVariable) and isinstance(seq.obj, TensorVariable):
+            # x, y = a.shape
+            proxy = getattr(seq.obj.as_proxy(), seq.name)
+            for i in reversed(range(inst.argval)):
+                self.push(wrap_fx_proxy(self, proxy[i], **options))
+        else:
+            unimplemented(f"UNPACK_SEQUENCE {seq}")
+
+    def UNPACK_EX(self, inst):
+        assert 0 <= inst.argval <= 0xFFFF
+        prefix = inst.argval & 0xFF  # low byte
+        suffix = inst.argval >> 8  # high byte
+        seq = self.pop()
+        options = VariableTracker.propagate(seq)
+        if seq.has_unpack_var_sequence(self):
+            vals = list(seq.unpack_var_sequence(self))
+            assert len(vals) >= prefix + suffix
+            vals_prefix = vals[:prefix]
+            vals_list = vals[prefix : len(vals) - suffix]
+            vals_suffix = vals[len(vals) - suffix :]
+            for item in reversed(vals_suffix):
+                self.push(item.add_options(options))
+            self.push(TupleVariable(vals_list, **options))
+            for item in reversed(vals_prefix):
+                self.push(item.add_options(options))
+        else:
+            unimplemented(f"UNPACK_EX {seq}")
+
+    def NOP(self, inst):
+        pass
+
+    def POP_TOP(self, inst):
+        self.pop()
+
+    def ROT_TWO(self, inst):
+        a = self.pop()
+        b = self.pop()
+        self.push(a)
+        self.push(b)
+
+    def ROT_THREE(self, inst):
+        a = self.pop()
+        b = self.pop()
+        c = self.pop()
+        self.push(a)
+        self.push(c)
+        self.push(b)
+
+    def ROT_FOUR(self, inst):
+        a = self.pop()
+        b = self.pop()
+        c = self.pop()
+        d = self.pop()
+        self.push(a)
+        self.push(d)
+        self.push(c)
+        self.push(b)
+
+    def DUP_TOP(self, inst):
+        a = self.pop()
+        self.push(a)
+        self.push(a)
+
+    def DUP_TOP_TWO(self, inst):
+        a = self.pop()
+        b = self.pop()
+        self.push(b)
+        self.push(a)
+        self.push(b)
+        self.push(a)
+
+    def FORMAT_VALUE(self, inst):
+        flags = inst.arg
+        if (flags & 0x04) == 0x04:
+            fmt_spec = self.pop()
+        else:
+            fmt_spec = ConstantVariable("")
+
+        value = self.pop()
+        if isinstance(value, DynamicShapeVariable):
+            value = ConstantVariable(str(value.dyn_shape))
+        if (flags & 0x03) == 0x01:
+            value = BuiltinVariable(str).call_function(self, [value], {})
+        elif (flags & 0x03) == 0x02:
+            value = BuiltinVariable(repr).call_function(self, [value], {})
+        elif (flags & 0x03) == 0x03:
+            value = BuiltinVariable(ascii).call_function(self, [value], {})
+
+        fmt_var = ConstantVariable(
+            "{:" + fmt_spec.as_python_constant() + "}"
+        ).add_options(fmt_spec)
+
+        self.call_function(BuiltinVariable(str.format), [fmt_var, value], {})
+
+    def BUILD_STRING(self, inst):
+        result = ""
+        for _ in range(inst.arg):
+            str_var = self.pop()
+            assert isinstance(str_var, ConstantVariable)
+            result = str_var.value + result
+        self.push(ConstantVariable(value=result))
+
+    def IS_OP(self, inst):
+        assert inst.argval == 0 or inst.argval == 1
+        if inst.argval == 0:
+            new_argval = "is"
+        else:
+            new_argval = "is not"
+        new_inst = create_instruction("COMPARE_OP", argval=new_argval)
+        self.COMPARE_OP(new_inst)
+
+    def CONTAINS_OP(self, inst):
+        assert inst.argval == 0 or inst.argval == 1
+        left, right = self.popn(2)
+        op = inst.argval
+        self.push(right.call_method(self, "__contains__", [left], {}))
+        if op == 1:
+            self.UNARY_NOT(inst)
+
+    def LIST_EXTEND(self, inst):
+        v = self.pop()
+        assert inst.argval > 0
+        obj = self.stack[-inst.arg]
+        assert isinstance(obj, ListVariable)
+        assert obj.mutable_local
+        obj.call_method(self, "extend", [v], {})
+
+    def LIST_TO_TUPLE(self, inst):
+        self.push(BuiltinVariable(tuple).call_function(self, [self.pop()], {}))
+
+    def DICT_MERGE(self, inst):
+        v = self.pop()
+        assert inst.argval > 0
+        obj = self.stack[-inst.arg]
+        assert isinstance(obj, ConstDictVariable)
+        assert obj.mutable_local
+        obj.call_method(self, "update", [v], {})
+
+    def GEN_START(self, inst):
+        self.pop()
+
+    def GET_LEN(self, inst):
+        tos = self.stack[-1]
+        if tos.is_python_constant():
+            self.push(ConstantVariable(len(tos.as_python_constant())))
+        else:
+            self.push(tos.call_method(self, "__len__", [], {}))
+
+    def MATCH_MAPPING(self, inst):
+        tos = self.stack[-1]
+        assert isinstance(tos, ConstDictVariable)
+        if isinstance(tos.items, collections.abc.Mapping):
+            self.push(ConstantVariable(True))
+        else:
+            self.push(ConstantVariable(False))
+
+    def MATCH_SEQUENCE(self, inst):
+        tos = self.stack[-1]
+        assert tos.is_python_constant()
+        tos_value = tos.as_python_constant()
+        if isinstance(tos_value, collections.abc.Sequence) and not isinstance(
+            tos_value, (str, bytes, bytearray)
+        ):
+            self.push(ConstantVariable(True))
+        else:
+            self.push(ConstantVariable(False))
+
+    def MATCH_KEYS(self, inst):
+        tos = self.stack[-1]
+        assert tos.is_python_constant()
+        keys = tos.as_python_constant()
+        tos1 = self.stack[-2]
+        assert isinstance(tos1, ConstDictVariable)
+        match_obj = tos1.items
+        if all(key in match_obj for key in keys):
+            self.push(TupleVariable(list(match_obj[key] for key in keys)))
+            self.push(ConstantVariable(True))
+        else:
+            self.push(ConstantVariable(None))
+            self.push(ConstantVariable(False))
+
+    UNARY_POSITIVE = stack_op(operator.pos)
+    UNARY_NEGATIVE = stack_op(operator.neg)
+    UNARY_NOT = stack_op(operator.not_)
+    UNARY_INVERT = stack_op(operator.invert)
+
+    BINARY_POWER = stack_op(operator.pow)
+    BINARY_MULTIPLY = stack_op(operator.mul)
+    BINARY_MATRIX_MULTIPLY = stack_op(operator.matmul)
+    BINARY_FLOOR_DIVIDE = stack_op(operator.floordiv)
+    BINARY_TRUE_DIVIDE = stack_op(operator.truediv)
+    BINARY_MODULO = stack_op(operator.mod)
+    BINARY_ADD = stack_op(operator.add)
+    BINARY_SUBTRACT = stack_op(operator.sub)
+    BINARY_SUBSCR = break_graph_if_unsupported(push=1)(stack_op(operator.getitem))
+    BINARY_LSHIFT = stack_op(operator.lshift)
+    BINARY_RSHIFT = stack_op(operator.rshift)
+    BINARY_AND = stack_op(operator.and_)
+    BINARY_OR = stack_op(operator.or_)
+    BINARY_XOR = stack_op(operator.xor)
+
+    INPLACE_POWER = stack_op(operator.ipow)
+    INPLACE_MULTIPLY = stack_op(operator.imul)
+    INPLACE_MATRIX_MULTIPLY = stack_op(operator.imatmul)
+    INPLACE_FLOOR_DIVIDE = stack_op(operator.ifloordiv)
+    INPLACE_TRUE_DIVIDE = stack_op(operator.itruediv)
+    INPLACE_MODULO = stack_op(operator.imod)
+    INPLACE_ADD = stack_op(operator.iadd)
+    INPLACE_SUBTRACT = stack_op(operator.isub)
+    INPLACE_LSHIFT = stack_op(operator.ilshift)
+    INPLACE_RSHIFT = stack_op(operator.irshift)
+    INPLACE_AND = stack_op(operator.iand)
+    INPLACE_XOR = stack_op(operator.ixor)
+    INPLACE_OR = stack_op(operator.ior)
+
+    def copy_graphstate(self):
+        """Create a checkpoint of the current state by copying everything"""
+        return (
+            self.output.copy_graphstate(),
+            collections.OrderedDict(self.symbolic_locals),
+            list(self.stack),
+            list(self.block_stack),
+            self.instruction_pointer,
+            self.current_instruction,
+            self.next_instruction,
+            self.lineno,
+        )
+
+    def restore_graphstate(self, state):
+        """Restore a checkpoint created by self.copy_graphstate()"""
+        (
+            output_state,
+            self.symbolic_locals,
+            self.stack,
+            self.block_stack,
+            self.instruction_pointer,
+            self.current_instruction,
+            self.next_instruction,
+            self.lineno,
+        ) = state
+        self.output.restore_graphstate(output_state)
+
+    def empty_checkpoint(self):
+        if self.checkpoint is None:
+            return True
+        output_graphstate = self.checkpoint[1][0]
+        graphstate = self.checkpoint[1][1:]
+        state = (*output_graphstate, *graphstate)
+        for obj in state:
+            if isinstance(obj, Iterable):
+                if len(obj) != 0:
+                    return False
+        return True
+
+    def format_frame_summary(self, additional_stack_frames=None):
+        if additional_stack_frames is None:
+            additional_stack_frames = []
+        return "".join(
+            traceback.format_list(
+                ([self.frame_summary()] + list(reversed(additional_stack_frames)))
+            )
+        )
+
+    def frame_summary(self):
+        return traceback.FrameSummary(
+            getattr(self.f_code, "co_filename", "<unknown>"),
+            self.lineno,
+            getattr(self.f_code, "co_name", "<unknown>"),
+            lookup_line=False,
+        )
+
+    def store_dict_key(self, name, value):
+        self.output.guards.add(
+            GlobalWeakRefSource(name).make_guard(GuardBuilder.WEAKREF_ALIVE)
+        )
+        if name not in self.output.root_globals:
+            self.output.install_global(name, weakref.ref(value))
+
+    @property
+    def fake_mode(self):
+        return self._fake_mode
+
+    def find_symbolic_locals_name(self, tensor_variable):
+        for key, value in self.symbolic_locals.items():
+            if value is tensor_variable:
+                return key
+        return None
+
+    def __init__(
+        self,
+        output: OutputGraph,
+        instructions: List[Instruction],
+        f_locals: Dict[str, Any],
+        f_globals: Dict[str, Any],
+        f_builtins: Dict[str, Any],
+        code_options: Dict[str, Any],
+        symbolic_locals: Dict[str, VariableTracker],
+        symbolic_globals: Dict[str, VariableTracker],
+        f_code: types.CodeType,
+        export: bool,
+    ):
+        super(InstructionTranslatorBase, self).__init__()
+
+        # Mutable state checkpointed by copy_graphstate()
+        self.output: OutputGraph = output
+        self.symbolic_locals: Dict[str, VariableTracker] = symbolic_locals
+        self.symbolic_globals: Dict[str, VariableTracker] = symbolic_globals
+        self.stack: List[VariableTracker] = []
+        self.instruction_pointer: int = 0
+        self.current_instruction: Instruction = create_instruction("NOP")
+        self.next_instruction: typing.Optional[Instruction] = None
+        self.block_stack: List[BlockStackEntry] = []
+        self.lineno: int = code_options.get("co_firstlineno")
+
+        # Properties of the input/output code
+        self.instructions: List[Instruction] = instructions
+        self.indexof: Dict[int, int] = {id(i): n for n, i in enumerate(instructions)}
+        self.f_locals: Dict[
+            str, Any
+        ] = f_locals  # needed for recording accessed locals for replay
+        self.f_globals: Dict[str, Any] = f_globals
+        self.f_builtins: Dict[str, Any] = f_builtins
+        self.code_options: Dict[str, Any] = code_options
+        self.f_code: types.CodeType = f_code
+
+        # Execution record for replaying errors
+        self.exec_recorder = ExecutionRecorder(code=f_code, code_options=code_options)
+        # Stack of module being parsed, current nn.module is at the end of ordered dict
+        self.nn_module_stack: Dict[str, str] = {}
+        # Flag to indicate whether tracing is used for export.
+        self.export = export
+
+        self._fake_mode = torch._subclasses.FakeTensorMode(
+            throw_on_data_dependent_ops=True,
+            shape_env=output.shape_env,
+        )
+
+        self.checkpoint = None
+        self.random_calls: List[tuple] = []
+
+        if sys.version_info >= (3, 10):
+            from .resume_execution import (
+                CO_ASYNC_GENERATOR,
+                CO_COROUTINE,
+                CO_GENERATOR,
+                CO_ITERABLE_COROUTINE,
+            )
+
+            if f_code.co_flags & (
+                CO_GENERATOR | CO_COROUTINE | CO_ITERABLE_COROUTINE | CO_ASYNC_GENERATOR
+            ):
+                self.push(BuiltinVariable(None))
+
+
+class InstructionTranslator(InstructionTranslatorBase):
+    def __init__(
+        self,
+        instructions: List[Instruction],
+        f_code,
+        f_locals,
+        f_globals,
+        f_builtins,
+        code_options,
+        compiler_fn,
+        one_graph,
+        export,
+    ):
+        super(InstructionTranslator, self).__init__(
+            output=OutputGraph(f_globals, code_options, compiler_fn, self),
+            instructions=instructions,
+            f_locals=f_locals,
+            f_globals=f_globals,
+            f_builtins=f_builtins,
+            code_options=code_options,
+            symbolic_locals=collections.OrderedDict(),  # set below
+            # A global var is inserted only after a STORE_GLOBAL happens to it
+            symbolic_globals=collections.OrderedDict(),
+            f_code=f_code,
+            export=export,
+        )
+        self.one_graph: bool = one_graph
+        self.export = export
+        if self.export:
+            assert (
+                self.one_graph
+            ), "Export without one graph - something has gone wrong."
+
+        vars = list(code_options["co_varnames"])
+        vars.extend(x for x in self.cell_and_freevars() if x not in vars)
+        self.symbolic_locals = collections.OrderedDict(
+            (k, VariableBuilder(self, LocalSource(k))(f_locals[k]))
+            for k in vars
+            if k in f_locals
+        )
+
+        # symbolic_locals contains the mapping from original f_locals to the
+        # Variable objects. During the Variable building phase, each object also
+        # has its associated guards. At the end, we will accumulate these
+        # guards.
+        #
+        # One way of handling these guards is to just accumulate all of them
+        # right now. However, many f_locals might not be used in the frame and
+        # thus can unnecessarily increase guard execution overhead.  Therefore,
+        # we selectively update output.guards as we run the Python Bytecode
+        # instruction by instruction.
+        #
+        # An exception here is list/dict variables. Guards related to these
+        # variables have indexed access, like Tensor_match on args[0], and if
+        # args is not used in this frame, we will miss a LIST_LENGTH check like
+        # len(args) == 2. Missing the LIST_LENGTH check causes problem for the
+        # next invocation when args is not a list, and args[0] is a runtime
+        # error. Therefore, we recursively add guards for list/dict variable here.
+        for val in self.symbolic_locals.values():
+            if isinstance(
+                val, (ListIteratorVariable, BaseListVariable, ConstDictVariable)
+            ):
+                local_guards = VariableTracker.propagate(val)["guards"]
+                index_guards = [
+                    guard
+                    for guard in local_guards
+                    if guard.create_fn
+                    in (
+                        GuardBuilder.LIST_LENGTH,
+                        GuardBuilder.DICT_KEYS,
+                        GuardBuilder.ODICT_KEYS,
+                        GuardBuilder.TUPLE_ITERATOR_LEN,
+                    )
+                ]
+                self.output.guards.update(index_guards)
+
+        self._freevars_ids = dict()
+        for name in self.code_options["co_freevars"]:
+            if name in f_locals:
+                self._freevars_ids[name] = id(f_locals[name])
+
+    def run(self):
+        _step_logger()(logging.INFO, f"torchdynamo start tracing {self.f_code.co_name}")
+        super().run()
+
+    def match_nested_cell(self, name, cell):
+        """Match a cell in this method to one in a function we are inlining"""
+        value = cell.cell_contents
+        # TODO(jansel): check the id of the cell rather than the contents
+        if id(value) != self._freevars_ids.get(name):
+            return None
+        return self.symbolic_locals[name]
+
+    def should_compile_partial_graph(self):
+        return all(b.can_restore() for b in self.block_stack) and not self.one_graph
+
+    def create_call_resume_at(self, inst):
+        self.instruction_pointer = None
+
+        if inst.opname == "RETURN_VALUE":
+            return [create_instruction("RETURN_VALUE")]
+
+        reads = livevars_analysis(self.instructions, inst)
+        argnames = tuple(
+            k
+            for k in self.symbolic_locals.keys()
+            if k in reads and k not in self.cell_and_freevars()
+        )
+        nargs = len(self.stack) + len(argnames)
+
+        name = unique_id(f"__resume_at_{inst.offset}")
+
+        new_code: types.CodeType = ContinueExecutionCache.lookup(
+            self.f_code,
+            self.lineno,
+            inst.offset,
+            len(self.stack),
+            argnames,
+            tuple(b.resume_fn() for b in self.block_stack),
+        )
+
+        cg = PyCodegen(self)
+
+        if new_code.co_freevars:
+            cg.make_function_with_closure(name, new_code, len(self.stack))
+        else:
+            self.output.install_global(
+                name, types.FunctionType(new_code, self.f_globals, name)
+            )
+            cg.extend_output(cg.load_function_name(name, len(self.stack)))
+
+        cg.extend_output([cg.create_load(k) for k in argnames])
+        cg.extend_output(
+            [
+                create_instruction("CALL_FUNCTION", nargs),
+                create_instruction("RETURN_VALUE"),
+            ]
+        )
+        return cg.get_instructions()
+
+    def RETURN_VALUE(self, inst):
+        if self.output.count_calls() == 0 and not self.export:
+            raise exc.SkipFrame()
+        self.instruction_pointer = None
+        _step_logger()(logging.INFO, f"torchdynamo done tracing {self.f_code.co_name}")
+        self.output.compile_subgraph(self)
+        self.output.add_output_instructions([create_instruction("RETURN_VALUE")])
+
+
+class InliningInstructionTranslator(InstructionTranslatorBase):
+    """Trace and inline a called method"""
+
+    @classmethod
+    def inline_call(cls, parent, func, args, kwargs):
+        with patch.dict(counters, {"unimplemented": counters["inline_call"]}):
+            return cls.inline_call_(parent, func, args, kwargs)
+
+    @staticmethod
+    def inline_call_(parent, func, args, kwargs):
+        assert isinstance(
+            func,
+            (UserFunctionVariable, NestedUserFunctionVariable),
+        )
+        if func.has_self():
+            unimplemented("inline with __self__")
+
+        if func.get_name() == "patched_init":
+            unimplemented("Patched init cannot be inlined.")
+
+        try:
+            if id(func.get_function()) in allowed_functions._disallowed_function_ids:
+                unimplemented(f"inlining disallowed: {func.get_function()}")
+        except NotImplementedError:
+            pass  # closures
+
+        if skipfiles.check(
+            func.get_filename()
+        ) and not skipfiles.is_torch_inline_allowed(func.get_filename()):
+            unimplemented(
+                f"inline in skipfiles: {func.get_name()} {func.get_filename()}"
+            )
+
+        try:
+            sub_locals, closure_cells = func.bind_args(parent, args, kwargs)
+        except TypeError as exc:
+            log.warning(
+                f"{func.get_filename()} {func.get_function()} {args} {kwargs} {exc}"
+            )
+            unimplemented("arg mismatch inlining")
+
+        for v in itertools.chain(sub_locals.values(), closure_cells.values()):
+            if not isinstance(v, VariableTracker):
+                unimplemented(f"unconverted arg {v}")
+
+        code: types.CodeType = func.get_code()
+        if code.co_name in ("__setitem__", "__setattr__"):
+            unimplemented(f"inline {code.co_name}")
+
+        log.debug(f"INLINING {code} \n {dis.Bytecode(code).dis()} \n")
+
+        if is_generator(code):
+            tracer = InliningGeneratorInstructionTranslator(
+                parent, code, sub_locals, parent.symbolic_globals, closure_cells, func
+            )
+        else:
+            tracer = InliningInstructionTranslator(
+                parent, code, sub_locals, parent.symbolic_globals, closure_cells, func
+            )
+
+        tracer.run()
+        assert tracer.symbolic_result is not None
+        func.export_freevars(parent, tracer)
+
+        if tracer.f_globals is parent.f_globals:
+            # Merge symbolic_globals back if parent and child are in the same namespace
+            parent.symbolic_globals.update(tracer.symbolic_globals)
+
+        log.debug(f"DONE INLINING {code}")
+
+        if is_generator(code):
+            assert tracer.symbolic_result.as_python_constant() is None
+            return ListIteratorVariable(
+                tracer.generated_items,
+                mutable_local=MutableLocal(),
+                **VariableTracker.propagate(tracer.symbolic_result),
+            )
+        else:
+            return tracer.symbolic_result
+
+    def __init__(
+        self,
+        parent: InstructionTranslatorBase,
+        code: types.CodeType,
+        symbolic_locals: Dict[str, VariableTracker],
+        symbolic_globals: Dict[str, VariableTracker],
+        closure_cells: Dict[str, VariableTracker],
+        funcvar: BaseUserFunctionVariable,
+    ):
+        f_globals = funcvar.get_globals()
+        f_builtins = f_globals["__builtins__"]
+        if not isinstance(f_builtins, dict):
+            f_builtins = f_builtins.__dict__
+        super(InliningInstructionTranslator, self).__init__(
+            output=parent.output,
+            f_locals={},
+            f_globals=f_globals,
+            f_builtins=f_builtins,
+            symbolic_locals=symbolic_locals,
+            symbolic_globals=symbolic_globals,
+            instructions=cleaned_instructions(code),
+            code_options={k: getattr(code, k) for k in dir(code)},
+            f_code=code,
+            export=parent.export,
+        )
+        self.parent = parent
+        self.symbolic_result = None
+        self.closure_cells = closure_cells
+        self.nn_module_stack = parent.nn_module_stack.copy()
+
+    @property
+    def fake_mode(self):
+        return self.parent.fake_mode
+
+    def STORE_DEREF(self, inst):
+        if inst.argval in self.closure_cells:
+            cell = self.closure_cells[inst.argval]
+            val = self.pop()
+            if isinstance(cell, ClosureVariable):
+                self.output.root_tx.symbolic_locals[cell.name] = val
+            else:
+                self.output.side_effects.store_cell(cell, val)
+        else:
+            if isinstance(
+                self.symbolic_locals.get(inst.argval),
+                variables.NewCellVariable,
+            ):
+                self.output.side_effects.store_cell(
+                    self.symbolic_locals[inst.argval], self.pop()
+                )
+            else:
+                unimplemented("write to __closure__ while inlining")
+
+    def LOAD_DEREF(self, inst):
+        if inst.argval in self.closure_cells:
+            cell = self.closure_cells[inst.argval]
+            if isinstance(cell, ClosureVariable):
+                self.push(self.output.root_tx.symbolic_locals[cell.name])
+            else:
+                self.push(self.output.side_effects.load_cell(cell))
+        else:
+            maybe_sym_local = self.symbolic_locals.get(inst.argval, None)
+            if isinstance(maybe_sym_local, variables.NewCellVariable):
+                self.push(self.output.side_effects.load_cell(maybe_sym_local))
+            else:
+                super().LOAD_DEREF(inst)
+
+    def LOAD_CLOSURE(self, inst):
+        assert inst.argval in self.cell_and_freevars()
+        self.push(self.closure_cells[inst.argval])
+
+    def replace_all(self, oldvar: VariableTracker, newvar: VariableTracker):
+        newvar = super().replace_all(oldvar, newvar)
+        # recursively check and update parent's locals and stack in case oldvar is from parent
+        translator = self
+        while hasattr(translator, "parent"):
+            translator = translator.parent
+            translator.update_locals_and_stack(oldvar, newvar)
+        return newvar
+
+    def should_compile_partial_graph(self):
+        return False  # inlining functions is all-or-nothing
+
+    def create_call_resume_at(self, offset):
+        unimplemented("cant resume while inlining")
+
+    def RETURN_VALUE(self, inst):
+        self.symbolic_result = self.pop()
+        self.instruction_pointer = None
+
+
+class InliningGeneratorInstructionTranslator(InliningInstructionTranslator):
+    def __init__(self, *args, **kwargs):
+        super(InliningGeneratorInstructionTranslator, self).__init__(*args, **kwargs)
+        self.generated_items = []
+
+    def YIELD_VALUE(self, inst: Instruction):
+        self.generated_items.append(self.pop())
+        # TODO(jansel): figure out why this is needed, it isn't in the docs for YIELD_VALUE
+        self.push(ConstantVariable(None))
diff --git a/torch/_dynamo/test_case.py b/torch/_dynamo/test_case.py
new file mode 100644
index 000000000000..39eda31646d2
--- /dev/null
+++ b/torch/_dynamo/test_case.py
@@ -0,0 +1,68 @@
+import contextlib
+import importlib
+import sys
+from unittest.mock import patch
+
+import torch
+import torch.testing
+from torch.testing._internal.common_utils import (
+    IS_WINDOWS,
+    TEST_WITH_CROSSREF,
+    TEST_WITH_ROCM,
+    TEST_WITH_TORCHDYNAMO,
+    TestCase as TorchTestCase,
+)
+
+from . import config, reset, utils
+
+
+def run_tests(needs=()):
+    from torch.testing._internal.common_utils import run_tests
+
+    if (
+        TEST_WITH_TORCHDYNAMO
+        or IS_WINDOWS
+        or TEST_WITH_CROSSREF
+        or TEST_WITH_ROCM
+        or sys.version_info >= (3, 11)
+    ):
+        return  # skip testing
+
+    if isinstance(needs, str):
+        needs = (needs,)
+    for need in needs:
+        if need == "cuda" and not torch.cuda.is_available():
+            return
+        else:
+            try:
+                importlib.import_module(need)
+            except ImportError:
+                return
+    run_tests()
+
+
+class TestCase(TorchTestCase):
+    @classmethod
+    def tearDownClass(cls):
+        cls._exit_stack.close()
+        super().tearDownClass()
+
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls._exit_stack = contextlib.ExitStack()
+        cls._exit_stack.enter_context(
+            patch.object(config, "raise_on_ctx_manager_usage", True)
+        )
+
+    def setUp(self):
+        super().setUp()
+        reset()
+        utils.counters.clear()
+
+    def tearDown(self):
+        for k, v in utils.counters.items():
+            print(k, v.most_common())
+        reset()
+        utils.counters.clear()
+        super().tearDown()
diff --git a/torch/_dynamo/test_minifier_common.py b/torch/_dynamo/test_minifier_common.py
new file mode 100644
index 000000000000..8fb0688f2c3e
--- /dev/null
+++ b/torch/_dynamo/test_minifier_common.py
@@ -0,0 +1,131 @@
+import os
+import re
+import subprocess
+import tempfile
+import unittest
+
+import torch
+import torch._dynamo
+import torch._dynamo.test_case
+from torch._dynamo.debug_utils import TEST_REPLACEABLE_COMMENT
+
+
+class MinifierTestBase(torch._dynamo.test_case.TestCase):
+    _debug_dir_obj = tempfile.TemporaryDirectory()
+    DEBUG_DIR = _debug_dir_obj.name
+
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls._exit_stack.enter_context(
+            unittest.mock.patch.object(
+                torch._dynamo.config,
+                "debug_dir_root",
+                cls.DEBUG_DIR,
+            )
+        )
+        os.makedirs(cls.DEBUG_DIR, exist_ok=True)
+
+    @classmethod
+    def tearDownClass(cls):
+        cls._debug_dir_obj.cleanup()
+        cls._exit_stack.close()
+
+    def setUp(self):
+        super().setUp()
+
+    def tearDown(self):
+        super().tearDown()
+
+    # Search for the name of the first function defined in a code string.
+    def _get_fn_name(self, code):
+        fn_name_match = re.search(r"def (\w+)\(", code)
+        if fn_name_match is not None:
+            return fn_name_match.group(1)
+        return None
+
+    # Run `code` in a separate python process.
+    # Returns the completed process state and the directory containing the
+    # minifier launcher script, if `code` outputted it.
+    def _run_test_code(self, code):
+        proc = subprocess.run(
+            ["python3", "-c", code], capture_output=True, cwd=self.DEBUG_DIR
+        )
+
+        repro_dir_match = re.search(
+            r"(\S+)minifier_launcher.py", proc.stderr.decode("utf-8")
+        )
+        if repro_dir_match is not None:
+            # Print repro directory for debugging generated code.
+            # Make sure to comment out `shutil.rmtree...` above as well.
+            print("repro dir:", repro_dir_match.group(1))
+            return proc, repro_dir_match.group(1)
+        return proc, None
+
+    # Patch generated files with testing patches
+    def _inject_code(self, patch_code, filename):
+        patch_code = f"""\
+{patch_code}
+torch._dynamo.config.debug_dir_root = "{self.DEBUG_DIR}"
+"""
+        with open(filename, "r") as f:
+            code = f.read()
+        code = code.replace(TEST_REPLACEABLE_COMMENT, patch_code)
+        with open(filename, "w") as f:
+            f.write(code)
+        return code
+
+    # Runs the minifier launcher script in `repro_dir`, patched with `patch_code`.
+    def _run_minifier_launcher(self, patch_code, repro_dir):
+        self.assertIsNotNone(repro_dir)
+        launch_file = os.path.join(repro_dir, "minifier_launcher.py")
+        self.assertTrue(os.path.exists(launch_file))
+        launch_code = self._inject_code(patch_code, launch_file)
+
+        launch_proc = subprocess.run(
+            ["python3", launch_file],
+            capture_output=True,
+            cwd=repro_dir,
+        )
+
+        return launch_proc, launch_code
+
+    # Runs the repro script in `repro_dir`, patched with `patch_code`
+    def _run_repro(self, patch_code, repro_dir):
+        self.assertIsNotNone(repro_dir)
+        repro_file = os.path.join(repro_dir, "repro.py")
+        self.assertTrue(os.path.exists(repro_file))
+        repro_code = self._inject_code(patch_code, repro_file)
+
+        repro_proc = subprocess.run(
+            ["python3", repro_file], capture_output=True, cwd=repro_dir
+        )
+
+        return repro_proc, repro_code
+
+    # Template for testing code.
+    # `run_code` is the code to run for the test case.
+    # `patch_code` is the code to be patched in every generated file.
+    def _gen_test_code(self, run_code, repro_after, repro_level, patch_code):
+        return f"""\
+import torch
+import torch._dynamo
+{patch_code}
+torch._dynamo.config.repro_after = "{repro_after}"
+torch._dynamo.config.repro_level = {repro_level}
+torch._dynamo.config.debug_dir_root = "{self.DEBUG_DIR}"
+{run_code}
+"""
+
+    # Runs a full minifier test.
+    # Minifier tests generally consist of 3 stages:
+    # 1. Run the problematic code (in a separate process since it could segfault)
+    # 2. Run the generated minifier launcher script
+    # 3. Run the generated repro script
+    def _run_full_test(self, run_code, repro_after, repro_level, patch_code):
+        test_code = self._gen_test_code(run_code, repro_after, repro_level, patch_code)
+        test_proc, repro_dir = self._run_test_code(test_code)
+        self.assertIsNotNone(repro_dir)
+        launch_proc, launch_code = self._run_minifier_launcher(patch_code, repro_dir)
+        repro_proc, repro_code = self._run_repro(patch_code, repro_dir)
+        return ((test_proc, launch_proc, repro_proc), (launch_code, repro_code))
diff --git a/torch/_dynamo/testing.py b/torch/_dynamo/testing.py
new file mode 100644
index 000000000000..55186931988b
--- /dev/null
+++ b/torch/_dynamo/testing.py
@@ -0,0 +1,272 @@
+import contextlib
+import dis
+import functools
+import logging
+import os.path
+import types
+import unittest
+from unittest.mock import patch
+
+import torch
+from torch import fx
+
+from . import config, eval_frame, optimize_assert, reset
+from .bytecode_transformation import (
+    create_instruction,
+    debug_checks,
+    is_generator,
+    transform_code_object,
+)
+from .guards import CheckFunctionManager, GuardedCode
+from .utils import same
+
+unsupported = eval_frame.unsupported
+three = 3
+
+log = logging.getLogger(__name__)
+
+
+def clone_me(x):
+    if x is None:
+        return None
+    return x.detach().clone().requires_grad_(x.requires_grad)
+
+
+def named_parameters_for_optimized_module(mod):
+    assert isinstance(mod, eval_frame.OptimizedModule)
+    return mod._orig_mod.named_parameters
+
+
+def remove_optimized_module_prefix(name):
+    prefix = "_orig_mod."
+    assert name.startswith(prefix)
+    name = name[len(prefix) :]
+    return torch.distributed.fsdp._common_utils.clean_tensor_name(name)
+
+
+def collect_results(model, prediction, loss, example_inputs):
+    results = []
+    results.append(prediction)
+    results.append(loss)
+    # if isinstance(loss, torch.Tensor) and loss.item() > 1:
+    #     log.warning(
+    #         f"High loss value alert - {loss:.2f}. Can result in unstable gradients."
+    #     )
+
+    grads = dict()
+    params = dict()
+    for name, param in model.named_parameters():
+        if isinstance(model, eval_frame.OptimizedModule):
+            name = remove_optimized_module_prefix(name)
+        param_copy = param
+        grad = param.grad
+        # Treat None and zero grad as same
+        if param.grad is None:
+            grad = torch.zeros_like(param)
+        grads[name + ".grad"] = grad
+        params[name] = param_copy
+    results.append(grads)
+    results.append(params)
+    for example in example_inputs:
+        if isinstance(example, (tuple, list)):
+            for inp in example:
+                if isinstance(inp, torch.Tensor):
+                    results.append(inp.grad)
+        else:
+            if isinstance(example, torch.Tensor):
+                results.append(example.grad)
+    return results
+
+
+def requires_bwd_pass(out):
+    if isinstance(out, torch.Tensor):
+        return out.requires_grad
+    elif isinstance(out, (list, tuple)):
+        return any([requires_bwd_pass(x) for x in out])
+    elif out is None:
+        return False
+    raise NotImplementedError("Don't know how to reduce", type(out))
+
+
+def reduce_to_scalar_loss(out):
+    """Reduce the output of a model to get scalar loss"""
+    if isinstance(out, torch.Tensor):
+        # Mean does not work on integer tensors
+        return out.sum() / out.numel()
+    elif isinstance(out, (list, tuple)):
+        return sum([reduce_to_scalar_loss(x) for x in out]) / len(out)
+    elif type(out).__name__ in (
+        "MaskedLMOutput",
+        "Seq2SeqLMOutput",
+        "CausalLMOutputWithCrossAttentions",
+    ):
+        return reduce_to_scalar_loss(out.logits)
+    elif type(out).__name__ == "SquashedNormal":
+        return out.mean.sum()
+    elif isinstance(out, dict):
+        return sum([reduce_to_scalar_loss(value) for value in out.values()]) / len(
+            out.keys()
+        )
+    raise NotImplementedError("Don't know how to reduce", type(out))
+
+
+def debug_dir():
+    path = os.path.join(os.path.dirname(__file__), "../debug")
+    if not os.path.exists(path):
+        os.mkdir(path)
+    return path
+
+
+def debug_dump(name, code: types.CodeType, extra=""):
+    with open(os.path.join(debug_dir(), name), "w") as fd:
+        fd.write(
+            f"{dis.Bytecode(code).info()}\n\n{dis.Bytecode(code).dis()}\n\n{extra}\n"
+        )
+
+
+def debug_insert_nops(frame, cache_size):
+    """used to debug jump updates"""
+
+    def insert_nops(instructions, code_options):
+        instructions.insert(0, create_instruction("NOP"))
+        instructions.insert(0, create_instruction("NOP"))
+
+    if is_generator(frame.f_code):
+        return None
+
+    debug_checks(frame.f_code)
+    code = transform_code_object(frame.f_code, insert_nops)
+
+    return GuardedCode(code, CheckFunctionManager().check_fn)
+
+
+class CompileCounter:
+    def __init__(self):
+        self.frame_count = 0
+        self.op_count = 0
+
+    def __call__(self, gm: torch.fx.GraphModule, example_inputs):
+        self.frame_count += 1
+        for node in gm.graph.nodes:
+            if "call" in node.op:
+                self.op_count += 1
+        return gm.forward
+
+    def clear(self):
+        self.frame_count = 0
+        self.op_count = 0
+
+
+class CompileCounterWithBackend:
+    def __init__(self, backend):
+        self.frame_count = 0
+        self.op_count = 0
+        self.backend = backend
+
+    def __call__(self, gm: torch.fx.GraphModule, example_inputs):
+        from torch._dynamo.eval_frame import lookup_backend
+
+        self.frame_count += 1
+        for node in gm.graph.nodes:
+            if "call" in node.op:
+                self.op_count += 1
+        return lookup_backend(self.backend)(gm, example_inputs)
+
+
+def standard_test(self, fn, nargs, expected_ops=None, expected_ops_dynamic=None):
+    if config.dynamic_shapes and expected_ops_dynamic is not None:
+        expected_ops = expected_ops_dynamic
+
+    actual = CompileCounter()
+    if expected_ops is None:
+        expected = CompileCounter()
+        try:
+            gm = torch.fx.symbolic_trace(fn)
+            expected(gm)
+            print("\nfx.symbolic_trace graph:")
+            gm.graph.print_tabular()
+            expected_ops = expected.op_count
+        except Exception:
+            pass  # Silently ignore FX errors (not our issue)
+
+    args1 = [torch.randn(10, 10) for _ in range(nargs)]
+    args2 = [torch.randn(10, 10) for _ in range(nargs)]
+    correct1 = fn(*args1)
+    correct2 = fn(*args2)
+    reset()
+    opt_fn = optimize_assert(actual)(fn)
+    val1a = opt_fn(*args1)
+    val2a = opt_fn(*args2)
+    val1b = opt_fn(*args1)
+    val2b = opt_fn(*args2)
+    reset()
+    self.assertTrue(same(val1a, correct1))
+    self.assertTrue(same(val1b, correct1))
+    self.assertTrue(same(val2a, correct2))
+    self.assertTrue(same(val2b, correct2))
+    self.assertEqual(actual.frame_count, 1)
+    if expected_ops is not None:
+        self.assertEqual(actual.op_count, expected_ops)
+
+
+def dummy_fx_compile(gm: fx.GraphModule, example_inputs):
+    return gm.forward
+
+
+def format_speedup(speedup, pvalue, is_correct=True, pvalue_threshold=0.1):
+    if not is_correct:
+        return "ERROR"
+    if pvalue > pvalue_threshold:
+        return f"{speedup:.3f}x SAME"
+    return f"{speedup:.3f}x p={pvalue:.2f}"
+
+
+def requires_static_shapes(fn):
+    @functools.wraps(fn)
+    def _fn(*args, **kwargs):
+        if config.dynamic_shapes:
+            raise unittest.SkipTest("requires static shapes")
+        return fn(*args, **kwargs)
+
+    return _fn
+
+
+def rand_strided(size, stride, dtype=torch.float32, device="cpu"):
+    needed_size = sum((shape - 1) * stride for shape, stride in zip(size, stride)) + 1
+    if dtype.is_floating_point:
+        buffer = torch.randn(needed_size, dtype=dtype, device=device)
+    else:
+        buffer = torch.zeros(size=[needed_size], dtype=dtype, device=device)
+    return torch.as_strided(buffer, size, stride)
+
+
+def _make_fn_with_patches(fn, *patches):
+    @functools.wraps(fn)
+    def _fn(*args, **kwargs):
+        with contextlib.ExitStack() as stack:
+            for attr, val in patches:
+                stack.enter_context(patch.object(config, attr, val))
+
+            return fn(*args, **kwargs)
+
+    return _fn
+
+
+def make_test_cls_with_patches(cls, cls_prefix, fn_suffix, *patches):
+    class DummyTestClass(cls):
+        pass
+
+    DummyTestClass.__name__ = f"{cls_prefix}{cls.__name__}"
+
+    for name in dir(cls):
+        if name.startswith("test_"):
+            fn = getattr(cls, name)
+            if not callable(fn):
+                continue
+            new_name = f"{name}{fn_suffix}"
+            fn = _make_fn_with_patches(fn, *patches)
+            fn.__name__ = new_name
+            setattr(DummyTestClass, name, None)
+            setattr(DummyTestClass, new_name, fn)
+
+    return DummyTestClass
diff --git a/torch/_dynamo/utils.py b/torch/_dynamo/utils.py
new file mode 100644
index 000000000000..62d967402af9
--- /dev/null
+++ b/torch/_dynamo/utils.py
@@ -0,0 +1,1157 @@
+import collections
+import contextlib
+import copy
+import cProfile
+import dataclasses
+import datetime
+import dis
+import functools
+import gc
+import inspect
+import itertools
+import logging
+import logging.config
+import math
+import operator
+import os
+import pstats
+import re
+import sys
+import time
+import types
+import typing
+import weakref
+from contextlib import contextmanager
+from functools import lru_cache
+from typing import Any, Dict
+
+import numpy as np
+import sympy
+
+import torch
+from torch import fx
+from torch._dispatch.python import enable_python_dispatcher
+from torch.nn.modules.lazy import LazyModuleMixin
+from torch.utils._pytree import tree_map
+
+from . import config, logging as torchdynamo_logging
+
+counters = collections.defaultdict(collections.Counter)
+troubleshooting_url = (
+    "https://github.com/pytorch/torchdynamo/blob/main/TROUBLESHOOTING.md"
+)
+
+log = logging.getLogger(__name__)
+
+# profiling compilation time
+compilation_metrics = collections.OrderedDict()
+
+
+timer_counter = itertools.count()
+
+
+def tabulate(rows, headers):
+    try:
+        import tabulate
+
+        return tabulate.tabulate(rows, headers=headers)
+    except ImportError:
+        return "\n".join(
+            ", ".join(map(str, row)) for row in itertools.chain([headers], rows)
+        )
+
+
+def dynamo_profiled(func):
+    def profile_wrapper(*args, **kwargs):
+        global timer_counter
+        datafn = (
+            func.__name__ + f"{next(timer_counter)}.profile"
+        )  # Name the data file sensibly
+        prof = cProfile.Profile()
+        prof.enable()
+        retval = prof.runcall(func, *args, **kwargs)
+        prof.disable()
+        print(f"### Cprofile for {func.__name__} iter {next(timer_counter)} ###")
+        ps = pstats.Stats(prof)
+        ps.sort_stats(pstats.SortKey.TIME).print_stats(20)
+        ps.sort_stats(pstats.SortKey.CUMULATIVE).print_stats(20)
+        prof.dump_stats(datafn)
+        return retval
+
+    return profile_wrapper
+
+
+def dynamo_timed(func):
+    def time_wrapper(*args, **kwargs):
+        key = func.__qualname__
+        if key not in compilation_metrics:
+            compilation_metrics[key] = []
+        t0 = time.time()
+        r = func(*args, **kwargs)
+        latency = time.time() - t0
+        # print(f"Dynamo timer: key={key}, latency={latency:.2f} sec")
+        compilation_metrics[key].append(latency)
+        return r
+
+    return time_wrapper
+
+
+def compile_times(repr="str", aggregate=False):
+    """
+    Get metrics about torchdynamo frontend/backend compilation times.
+
+    Accumulates information from functions tagged with `@dynamo_timed`.
+
+    repr='str' returns a printable string for user interaction, and 'csv'
+    returns headers, rows which can be logged for output
+
+    aggregate causes values from multiple compilations (e.g. split graphs)
+    to be accumulated into one value.  If false, expect more than one value
+    per metric.
+    """
+
+    def fmt_fn(values, item_fn=lambda x: x):
+
+        if aggregate:
+            return item_fn(sum(values))
+        return ", ".join(map(item_fn, values))
+
+    if repr == "str":
+        rows = [
+            (k, fmt_fn(compilation_metrics[k], item_fn=lambda x: f"{x:.4f}"))
+            for k in compilation_metrics
+        ]
+        out = "TorchDynamo compilation metrics:\n"
+        out += tabulate(rows, headers=("Function", "Runtimes (s)"))
+        return out
+    elif repr == "csv":
+        values = [
+            fmt_fn(v, item_fn=lambda x: f"{x:.6f}")
+            for v in compilation_metrics.values()
+        ]
+        headers = list(compilation_metrics.keys())
+        return headers, values
+
+
+tensortype_to_dtype = {
+    torch.FloatTensor: (torch.float32, torch.float),
+    torch.DoubleTensor: (torch.float64, torch.double),
+    torch.HalfTensor: (torch.float16, torch.half),
+    torch.BFloat16Tensor: (torch.bfloat16,),
+    torch.ByteTensor: (torch.uint8,),
+    torch.CharTensor: (torch.int8,),
+    torch.LongTensor: (torch.int64, torch.long),
+    torch.IntTensor: (torch.int32, torch.int),
+    torch.ShortTensor: (torch.int16, torch.short),
+    torch.BoolTensor: (torch.bool,),
+}
+
+
+class DuplicateWarningChecker(object):
+    def __init__(self, maxsize=4096):
+        self.maxsize = maxsize
+        self.reset()
+
+    def reset(self):
+        self.set = collections.OrderedDict()
+
+    def add(self, key):
+        if key in self.set:
+            self.set.move_to_end(key, last=True)
+            if not config.verbose:
+                return False
+        else:
+            self.set[key] = None
+            while len(self.set) > self.maxsize:
+                self.set.popitem(last=False)
+        return True
+
+
+graph_break_dup_warning_checker = DuplicateWarningChecker()
+
+
+def init_logging():
+    torchdynamo_logging.init_logging(
+        config.log_level, log_file_name=config.log_file_name
+    )
+    graph_break_dup_warning_checker.reset()
+
+
+# filter out all frames after entering dynamo
+def filter_stack(stack):
+    user_stack = []
+    for frame in stack:
+        if "convert_frame" in frame.filename:
+            break
+        if (
+            "eval_frame" in frame.filename
+            or f"{config.dynamo_import}.optimize(" in frame.line
+        ):
+            continue
+        user_stack.append(frame)
+
+    return user_stack
+
+
+def format_graph_tabular(graph):
+    node_specs = [[n.op, n.name, n.target, n.args, n.kwargs] for n in graph.nodes]
+    return tabulate(node_specs, headers=["opcode", "name", "target", "args", "kwargs"])
+
+
+def format_bytecode(prefix, name, filename, line_no, code):
+    return f"{prefix} {name} {filename}\
+ line {line_no} \n{dis.Bytecode(code).dis()}\n "
+
+
+def gen_record_file_name(exc, code):
+    return f"{get_debug_dir()}/error_recordings/\
+{code.co_name}_{type(exc).__name__}_{code.co_firstlineno}.rec"
+
+
+def write_record_to_file(filename, exec_record):
+    try:
+        if os.path.exists(filename):
+            log.warning(
+                f"Unable to write execution record {filename}; file already exists."
+            )
+        else:
+            os.makedirs(os.path.dirname(filename), exist_ok=True)
+            with open(filename, "wb") as f:
+                exec_record.dump(f)
+    except Exception:
+        log.error(f"Unable to write execution record {filename}", exc_info=1)
+
+
+def count_calls(g: fx.Graph):
+    c = 0
+    for n in g.nodes:
+        if "call" in n.op:
+            c += 1
+    return c
+
+
+def identity(x):
+    return x
+
+
+def nothing(*args, **kwargs):
+    pass
+
+
+class ExactWeakKeyDictionary:
+    """Similar to weakref.WeakKeyDictionary, but use `is`/`id` rather than `==` to compare equality"""
+
+    def __init__(self):
+        self.values = dict()
+        self.refs = dict()
+
+    def __getitem__(self, key):
+        return self.values[id(key)]
+
+    def get(self, key, default=None):
+        return self.values.get(id(key), default)
+
+    def __contains__(self, key):
+        return id(key) in self.values
+
+    def __setitem__(self, key, value):
+        idx = id(key)
+        if idx not in self.refs:
+            self.refs[idx] = weakref.ref(key, lambda ref: self._remove_id(idx))
+        self.values[idx] = value
+
+    def _remove_id(self, idx):
+        if idx in self.values:
+            del self.values[idx]
+        if idx in self.refs:
+            del self.refs[idx]
+
+    def clear(self):
+        self.refs.clear()
+        self.values.clear()
+
+
+def istype(obj, allowed_types):
+    """isinstance() without subclasses"""
+    if isinstance(allowed_types, (tuple, list, set)):
+        return type(obj) in allowed_types
+    return type(obj) is allowed_types
+
+
+def is_typing(value):
+    if sys.version_info < (3, 9):
+        return isinstance(value, typing._GenericAlias)
+    else:
+        return isinstance(value, typing._SpecialGenericAlias)
+
+
+def is_numpy_int_type(value):
+    return istype(
+        value,
+        (
+            np.int8,
+            np.int16,
+            np.int32,
+            np.int64,
+            np.uint8,
+            np.uint16,
+            np.uint32,
+            np.uint64,
+        ),
+    )
+
+
+def is_numpy_float_type(value):
+    return istype(
+        value,
+        (
+            np.float16,
+            np.float32,
+            np.float64,
+        ),
+    )
+
+
+def istensor(obj):
+    """Check of obj is a tensor"""
+    tensor_list = (
+        torch.Tensor,
+        torch.nn.Parameter,
+        *config.traceable_tensor_subclasses,
+    )
+    tensor_list = tensor_list + (torch._subclasses.FakeTensor,)
+    return istype(obj, tensor_list)
+
+
+def is_lazy_module(mod):
+    return isinstance(mod, LazyModuleMixin)
+
+
+@functools.lru_cache(4096)
+def print_once(*args):
+    print(*args)
+
+
+def make_cell(val=None):
+    """Some black magic to create a cell object that usually only exists in a closure"""
+    x = val
+
+    def f():
+        return x
+
+    assert len(f.__closure__) == 1
+    return f.__closure__[0]
+
+
+def proxy_args_kwargs(args, kwargs):
+    try:
+        proxy_args = tuple(arg.as_proxy() for arg in args)
+        proxy_kwargs = {key: arg.as_proxy() for key, arg in kwargs.items()}
+        return proxy_args, proxy_kwargs
+    except NotImplementedError:
+        from .exc import unimplemented
+        from .variables.base import typestr
+
+        raise unimplemented(
+            f"call_function args: {typestr(*args)} {typestr(*list(kwargs.values()))}"
+        )
+
+
+@dataclasses.dataclass
+class CleanupHook:
+    """Remove a global variable when hook is called"""
+
+    scope: Dict[str, Any]
+    name: str
+
+    def __call__(self, *args):
+        CleanupManager.count -= 1
+        del self.scope[self.name]
+
+    @staticmethod
+    def create(scope, name, val):
+        assert name not in scope
+        CleanupManager.count += 1
+        scope[name] = val
+        return CleanupHook(scope, name)
+
+
+class CleanupManager(ExactWeakKeyDictionary):
+    count = 0
+
+    def _remove_id(self, idx):
+        for hook in self.values[idx]:
+            hook()
+        super()._remove_id(idx)
+
+
+CleanupManager.instance = CleanupManager()
+
+
+def clone_tensor(x):
+    """Clone the tensor and its gradient"""
+    y = x.clone().requires_grad_(x.requires_grad)
+    if x.is_leaf and x.grad is not None:
+        y.grad = x.grad.clone()
+    return y
+
+
+def clone_input(x):
+    """copy while preserving strides"""
+
+    def torch_clone(x):
+        y = torch.clone(x)
+        if x.is_leaf:
+            y.requires_grad_(x.requires_grad)
+        if x.is_leaf and x.grad is not None:
+            y.grad = clone_input(x.grad)
+        return y
+
+    with torch.no_grad():
+        if x.device.type == "xla":
+            # Access data_ptr() for a xla tensor will cause crash
+            return torch_clone(x)
+
+        needed_size = sum(
+            (shape - 1) * stride for shape, stride in zip(x.size(), x.stride())
+        )
+        if x.is_quantized:
+            result = torch.empty_quantized((needed_size + 32,), x)
+        else:
+            result = torch.empty(needed_size + 32, dtype=x.dtype, device=x.device)
+        cache_line_offset = (
+            (x.data_ptr() - result.data_ptr()) % 32
+        ) // x.element_size()
+        result.as_strided_(x.size(), x.stride(), cache_line_offset)
+        try:
+            result.copy_(x.clone())
+            if x.is_leaf:
+                result.requires_grad_(x.requires_grad)
+            if x.is_leaf and x.grad is not None:
+                result.grad = clone_input(x.grad)
+        except RuntimeError:
+            # RuntimeError: unsupported operation: more than one element of the written-to
+            # tensor refers to a single memory location. Please clone() the tensor before
+            # performing the operation.
+            return torch_clone(x)
+        return result
+
+
+def clone_inputs(example_inputs):
+    if isinstance(example_inputs, dict):
+        res = dict(example_inputs)
+        for key, value in res.items():
+            assert isinstance(value, torch.Tensor)
+            res[key] = clone_input(value)
+        return res
+
+    res = list(example_inputs)
+    for i in range(len(res)):
+        if isinstance(res[i], torch.Tensor):
+            res[i] = clone_input(res[i])
+    return res
+
+
+@contextmanager
+def preserve_rng_state():
+    rng = torch.clone(torch.random.get_rng_state())
+    if torch.cuda.is_available():
+        cuda_rng = torch.clone(torch.cuda.get_rng_state())
+    try:
+        yield
+    finally:
+        torch.random.set_rng_state(rng)
+        if torch.cuda.is_available():
+            torch.cuda.set_rng_state(cuda_rng)
+
+
+def is_jit_model(model0):
+    return isinstance(
+        model0,
+        (
+            torch.jit._trace.TopLevelTracedModule,
+            torch.jit._script.RecursiveScriptModule,
+            torch.jit.ScriptFunction,
+            torch.jit.ScriptModule,
+        ),
+    )
+
+
+def torchscript(model, example_inputs, verbose=False):
+    if is_jit_model(model):
+        # already done?
+        return model
+
+    try:
+        return torch.jit.trace(model, example_inputs)
+    except Exception:
+        try:
+            return torch.jit.script(model)
+        except Exception:
+            if verbose:
+                log.exception("jit error")
+            else:
+                log.error("Both torch.jit.trace and torch.jit.script failed")
+    return None
+
+
+def getfile(obj):
+    try:
+        return inspect.getfile(obj)
+    except TypeError:
+        return None
+
+
+def is_namedtuple(obj):
+    """Test if an object is a namedtuple or a torch.return_types.* quasi-namedtuple"""
+    return is_namedtuple_cls(type(obj))
+
+
+def is_namedtuple_cls(cls):
+    """Test if an object is a namedtuple or a torch.return_types.* quasi-namedtuple"""
+    try:
+        if issubclass(cls, tuple):
+            bases = getattr(cls, "__bases__", []) or [None]
+            module = getattr(cls, "__module__", None)
+            return module == "torch.return_types" or (
+                bases[0] is tuple and hasattr(cls, "_make") and hasattr(cls, "_fields")
+            )
+    except TypeError:
+        pass
+    return False
+
+
+@functools.lru_cache(1)
+def namedtuple_fields(cls):
+    """Get the fields of a namedtuple or a torch.return_types.* quasi-namedtuple"""
+    if cls is slice:
+        return ["start", "stop", "step"]
+
+    assert issubclass(cls, tuple)
+    if hasattr(cls, "_fields"):
+        # normal namedtuples
+        return cls._fields
+
+    @dataclasses.dataclass
+    class Marker:
+        index: int
+
+    # frustrating ones e.g. torch.return_types.max
+    assert cls.__module__ == "torch.return_types"
+    obj = cls(map(Marker, range(cls.n_fields)))
+    fields = [None] * cls.n_fields
+    for name in dir(obj):
+        if name[0] != "_" and isinstance(getattr(obj, name), Marker):
+            fields[getattr(obj, name).index] = name
+    return fields
+
+
+def checkpoint_params(gm):
+    with torch.no_grad():
+        rng_state = torch.clone(torch.random.get_rng_state())
+        if torch.cuda.is_available():
+            cuda_rng_state = torch.clone(torch.cuda.get_rng_state())
+        saved_state = []
+        for param in itertools.chain(gm.parameters(), gm.buffers()):
+            saved_state.append((param, param._version, torch.clone(param)))
+
+    def restore():
+        with torch.no_grad():
+            torch.random.set_rng_state(rng_state)
+            if torch.cuda.is_available():
+                torch.cuda.set_rng_state(cuda_rng_state)
+            for param, version, original_value in saved_state:
+                if param._version != version:
+                    param.copy_(original_value)
+
+    return restore
+
+
+def timed(model, example_inputs, times=1):
+    if torch.cuda.is_available():
+        synchronize = torch.cuda.synchronize
+    else:
+        synchronize = nothing
+
+    synchronize()
+    gc.collect()
+    torch.manual_seed(1337)
+    t0 = time.perf_counter()
+    for _ in range(times):
+        result = model(*example_inputs)
+        synchronize()
+    t1 = time.perf_counter()
+    return result, t1 - t0
+
+
+def check_is_cuda(gm, example_inputs):
+    return all(x.is_cuda for x in itertools.chain(example_inputs, gm.parameters(True)))
+
+
+@lru_cache(32)
+def rot_n_helper(n):
+    assert n > 1
+    vars = [f"v{i}" for i in range(n)]
+    rotated = reversed(vars[-1:] + vars[:-1])
+    fn = eval(f"lambda {','.join(vars)}: ({','.join(rotated)})")
+    fn.__name__ = f"rot_{n}_helper"
+    return fn
+
+
+def is_safe_constant(v):
+    if istype(v, (tuple, frozenset)):
+        return all(map(is_safe_constant, v))
+    return istype(
+        v,
+        (
+            types.CodeType,
+            int,
+            float,
+            bool,
+            str,
+            bytes,
+            type(None),
+            slice,
+            type(type),
+            torch.device,
+        ),
+    )
+
+
+def check_constant_args(args, kwargs):
+    return all(x.is_python_constant() for x in itertools.chain(args, kwargs.values()))
+
+
+def check_unspec_python_args(args, kwargs):
+    from .variables.constant import ConstantVariable
+    from .variables.tensor import UnspecializedPythonVariable
+
+    unspec_count = 0
+    for x in itertools.chain(args, kwargs.values()):
+        if isinstance(x, UnspecializedPythonVariable):
+            unspec_count += 1
+        elif not isinstance(x, (UnspecializedPythonVariable, ConstantVariable)):
+            return False
+        else:
+            pass
+
+    return unspec_count > 0
+
+
+def specialize_args_kwargs(tx, args, kwargs):
+    specialized_args = []
+    specialized_kwargs = {}
+    for x in args:
+        specialized_args.append(x.as_specialized(tx))
+    for k, v in kwargs.items():
+        specialized_kwargs.update({k: v.as_specialized(tx)})
+    return specialized_args, specialized_kwargs
+
+
+dict_values = type(dict().values())
+odict_values = type(collections.OrderedDict().values())
+tuple_iterator = type(iter(tuple()))
+tuple_iterator_len = tuple_iterator.__length_hint__
+object_new = object.__new__
+
+
+def product(it):
+    return functools.reduce(operator.mul, it, 1)
+
+
+def tuple_iterator_getitem(it, index):
+    _, (obj,), start = it.__reduce__()
+    return obj[start + index]
+
+
+def dict_param_key_ids(value):
+    return set([id(k) for k in value.keys() if isinstance(k, torch.nn.Parameter)])
+
+
+def dict_const_keys(value):
+    return set(k for k in value.keys() if not isinstance(k, torch.nn.Parameter))
+
+
+def global_key_name(key):
+    return f"__dict_key_{id(key)}"
+
+
+def rename_implicit(v):
+    """
+    Usage of inline comprehensions generates a implicit ".0" variable that
+    trips up guard generation.  This renames these variables in guards.
+    """
+    m = re.match(r"^[.](\d+)$", v)
+    if m:
+        assert v == ".0", f"currently only .0 supported: {v}"
+        # to support .1 etc see guards.py and _eval_frame.c
+        return f"___implicit{m.group(1)}"
+    return v
+
+
+from torch._subclasses import (  # noqa: F401
+    FakeTensorMode,
+    UnsupportedFakeTensorException,
+)
+
+
+def make_fake_tensor(e, fake_mode, static_shapes=False, tx=None):
+    fake_tensor = fake_mode.from_tensor(e, static_shapes=static_shapes)
+    if tx is not None:
+        from torch._dynamo.guards import TensorReference
+
+        def _record(tensor_ref):
+            if tensor_ref.ref_id not in tx.output.tensor_id_to_sym_shape_ref:
+                tx.output.tensor_id_to_sym_shape_ref[tensor_ref.ref_id] = set()
+            tx.output.tensor_id_to_sym_shape_ref[tensor_ref.ref_id].add(tensor_ref)
+
+        def _extract(symbol):
+            if isinstance(symbol, int):
+                return None
+            sym_expr = symbol.get_pyobj().expr
+            if not isinstance(sym_expr, sympy.Symbol):
+                return None
+            return sym_expr
+
+        def _record_ref(e, index, symbol, kind):
+            sym_expr = _extract(symbol)
+            if sym_expr:
+                tensor_ref = TensorReference(id(e), kind, index, sym_expr)
+                _record(tensor_ref)
+
+        for index, symbol in enumerate(fake_tensor.size()):
+            _record_ref(e, index, symbol, "size")
+
+        for index, symbol in enumerate(fake_tensor.stride()):
+            _record_ref(e, index, symbol, "stride")
+
+        offset = fake_tensor.storage_offset()
+        _record_ref(e, None, offset, "storage_offset")
+
+    return fake_tensor
+
+
+def wrap_fake_exception(fn):
+    try:
+        return fn()
+    except UnsupportedFakeTensorException as e:
+        from .exc import unimplemented
+
+        msg = f"Unsupported: {e.reason} with fake tensor propagation."
+        log.warning(msg)
+        raise unimplemented(msg)
+
+
+def wrap_to_fake_tensor(e, fake_mode):
+    if type(e) in (torch.Tensor, torch.nn.Parameter):
+        return wrap_fake_exception(
+            lambda: make_fake_tensor(
+                e, fake_mode, static_shapes=config.dynamic_shapes is False
+            )
+        )
+    else:
+        return e
+
+
+def wrap_to_fake_tensor_and_record(e, tx):
+    if type(e) in (torch.Tensor, torch.nn.Parameter):
+        static_shapes = config.dynamic_shapes is False
+        if type(e) is torch.nn.Parameter:
+            # Always static for params
+            static_shapes = True
+        return wrap_fake_exception(
+            lambda: make_fake_tensor(e, tx.fake_mode, static_shapes, tx)
+        )
+    else:
+        return e
+
+
+def deepcopy_to_fake_tensor(obj, fake_mode):
+    with torch._subclasses.fake_tensor.FakeCopyMode(fake_mode):
+        return wrap_fake_exception(lambda: copy.deepcopy(obj))
+
+
+def rmse(ref, res):
+    """
+    Calculate root mean squared error
+    """
+    return torch.sqrt(torch.mean(torch.square(ref - res)))
+
+
+def same(
+    ref,
+    res,
+    fp64_ref=None,
+    cos_similarity=False,
+    tol=1e-4,
+    equal_nan=False,
+    exact_dtype=True,
+):
+    """Check correctness to see if ref and res match"""
+    if fp64_ref is None:
+        fp64_ref = ref
+    if isinstance(ref, (list, tuple, torch.nn.ParameterList, torch.Size)):
+        assert isinstance(res, (list, tuple)), f"type mismatch {type(ref)} {type(res)}"
+        return len(ref) == len(res) and all(
+            same(ai, bi, fp64_refi, cos_similarity, tol, equal_nan, exact_dtype)
+            for ai, bi, fp64_refi in zip(ref, res, fp64_ref)
+        )
+    elif isinstance(ref, dict):
+        assert isinstance(res, dict)
+        assert set(ref.keys()) == set(
+            res.keys()
+        ), f"keys mismatch {set(ref.keys())} == {set(res.keys())}"
+        for k in ref.keys():
+            if not (
+                same(
+                    ref[k],
+                    res[k],
+                    fp64_ref[k],
+                    cos_similarity=cos_similarity,
+                    tol=tol,
+                    equal_nan=equal_nan,
+                    exact_dtype=exact_dtype,
+                )
+            ):
+                log.error(f"Accuracy failed for key name {k}")
+                return False
+        return True
+    elif isinstance(ref, torch.Tensor):
+        if ref.is_sparse:
+            assert res.is_sparse
+            ref = ref.to_dense()
+            res = res.to_dense()
+        assert isinstance(res, torch.Tensor), f"type mismatch {type(ref)} {type(res)}"
+        if exact_dtype:
+            if ref.dtype != res.dtype:
+                log.error(f"dtype mismatch {ref.dtype}, {res.dtype}")
+                return False
+            if ref.dtype == torch.bool:
+                # triton stores bool as int8, so add this for more accurate checking
+                return torch.allclose(
+                    ref.to(dtype=torch.uint8),
+                    res.to(dtype=torch.uint8),
+                    atol=tol,
+                    rtol=tol,
+                    equal_nan=equal_nan,
+                )
+        if cos_similarity:
+            ref = ref.flatten().to(torch.float32)
+            res = res.flatten().to(torch.float32)
+            if torch.allclose(ref, res, atol=tol, rtol=tol, equal_nan=True):
+                # early exit that handles zero/nan better
+                # cosine_similarity(zeros(10), zeros(10), dim=0) is 0
+                return True
+            res = torch.nn.functional.cosine_similarity(ref, res, dim=0, eps=1e-6)
+            if res < 0.99:
+                log.warning(f"Similarity score={res.cpu().detach().item()}")
+            return res >= 0.99
+        else:
+            if not exact_dtype:
+                ref = ref.to(res.dtype)
+
+            # First try usual allclose
+            if torch.allclose(ref, res, atol=tol, rtol=tol, equal_nan=equal_nan):
+                return True
+
+            # Check error from fp64 version
+            if fp64_ref.dtype == torch.float64:
+                ref_error = rmse(fp64_ref, ref).item()
+                res_error = rmse(fp64_ref, res).item()
+                multiplier = 2.0
+
+                if fp64_ref.numel() < 1000 or (
+                    ref.ndim == 4 and ref.shape[-1] == ref.shape[-2] == 1
+                ):
+                    # In the presence of noise, noise might dominate our error
+                    # metric for smaller tensors.
+                    # Similary, for 1x1 kenerls, there seems to be high noise with amp.
+                    multiplier = 3.0
+
+                passes_test = res_error <= (multiplier * ref_error + 1e-4)
+                if not passes_test:
+                    log.error(
+                        f"RMSE (res-fp64): {res_error:.5f}, (ref-fp64): {ref_error:.5f} and shape={res.size()}"
+                    )
+                    # import pdb; pdb.set_trace()
+                return passes_test
+
+            return False
+    elif isinstance(ref, (str, int, type(None), bool, torch.device)):
+        return ref == res
+    elif isinstance(ref, float):
+        return math.isclose(ref, res, rel_tol=tol, abs_tol=tol)
+    elif is_numpy_int_type(ref) or is_numpy_float_type(ref):
+        return (type(ref) is type(res)) and (ref == res)
+    elif type(ref).__name__ in (
+        "MaskedLMOutput",
+        "Seq2SeqLMOutput",
+        "CausalLMOutputWithCrossAttentions",
+        "LongformerMaskedLMOutput",
+        "Instances",
+        "SquashedNormal",
+        "Boxes",
+        "Normal",
+        "TanhTransform",
+        "Foo",
+        "Variable",
+    ):
+        assert type(ref) is type(res)
+        return all(
+            same(
+                getattr(ref, key),
+                getattr(res, key),
+                getattr(fp64_ref, key),
+                cos_similarity=cos_similarity,
+                tol=tol,
+                equal_nan=equal_nan,
+                exact_dtype=exact_dtype,
+            )
+            for key in ref.__dict__.keys()
+        )
+    else:
+        raise RuntimeError(f"unsupported type: {type(ref).__name__}")
+
+
+def format_func_info(code):
+    short_filename = code.co_filename.split("/")[-1]
+    return f"'{code.co_name}' ({short_filename}:{code.co_firstlineno})"
+
+
+@contextlib.contextmanager
+def disable_cache_limit():
+    prior = config.cache_size_limit
+    config.cache_size_limit = sys.maxsize
+
+    try:
+        yield
+    finally:
+        pass
+        config.cache_size_limit = prior
+
+
+# map from transformed code back to original user code
+orig_code_map = ExactWeakKeyDictionary()
+
+# keep a record of code_obj -> list of guard failure reasons for logging
+guard_failures = collections.defaultdict(list)
+
+
+class CompileProfiler:
+    """Utility for profiling how and what dynamo would compile.
+
+    Can be used for
+     * diagnosing recompilation issues
+     * determining an appropriate compile cache limit
+     * (TODO)confirming which functions got compiled/skipped
+    """
+
+    def __init__(self):
+        self.frame_count = 0
+        self.op_count = 0
+        self.backend_ctx_ctor = lambda: disable_cache_limit()
+
+    def __call__(self, gm: torch.fx.GraphModule, example_inputs):
+        self.frame_count += 1
+        for node in gm.graph.nodes:
+            if "call" in node.op:
+                self.op_count += 1
+        return gm.forward
+
+    def get_metrics(self):
+        return {"guard_failures": guard_failures}
+
+    def report(self):
+        metrics = self.get_metrics()
+        gf = metrics["guard_failures"]
+
+        def num_recompiles(code):
+            return len(gf[code])
+
+        def recompile_reasons(code):
+            return "\n".join([str(x) for x in gf[code]])
+
+        summarized_gf = [
+            [format_func_info(code), num_recompiles(code), recompile_reasons(code)]
+            for code in gf
+        ]
+        rpt = "Torchdynamo Profiler Report\n"
+        if "graph_break" in counters:
+            rpt += "\n"
+            rpt += "The following conditions caused torchdynamo to break out of tracing and fall back to python.\n"
+            rpt += (
+                f"You may gain additional insight by passing `nopython=True` to {config.dynamo_import}.optimize, "
+                "to break on the first condition.\n"
+            )
+            graph_breaks = counters["graph_break"]
+            rpt += tabulate(
+                [[msg, graph_breaks[msg]] for msg in graph_breaks],
+                headers=["Graph Break Reason", "Count"],
+            )
+
+        if len(gf):
+            max_recompiles = max([num_recompiles(code) for code in gf])
+            rpt += "\n"
+            rpt += (
+                "These subgraphs were recompiled more than once due to guard failures."
+            )
+            rpt += (
+                "Guard failures indicate some condition assumed to be static by the tracer changed, "
+                "making it unsafe to reuse the compiled program."
+            )
+            rpt += tabulate(
+                summarized_gf,
+                headers=["Function", "Num Recompiles", "Recompile Reasons"],
+            )
+            rpt += "\n"
+            rpt += (
+                f"Set {config.dynamo_import}.config.cache_size_limit to "
+                f"{max_recompiles} to avoid being cache limited.\n"
+            )
+        else:
+            rpt += "No cache-limited recompilations detected.\n"
+
+        return rpt
+
+
+# return same dir unless user changes config between calls
+@functools.lru_cache(None)
+def _get_debug_dir(root_dir):
+    dir_name = "run_" + datetime.datetime.now().strftime("%Y_%m_%d_%H_%M_%S_%f")
+    return os.path.join(root_dir, dir_name)
+
+
+def get_debug_dir():
+    debug_root = config.debug_dir_root
+    return _get_debug_dir(debug_root)
+
+
+def get_fake_value(node, tx):
+    """
+    Run the computation represented by `node` using fake tensors and return the result.
+    """
+    from .exc import TorchRuntimeError, unimplemented, Unsupported
+
+    op = node.op
+    fake_wrapper = functools.partial(wrap_to_fake_tensor_and_record, tx=tx)
+
+    def visit(n: torch.fx.Node):
+        return n.meta["example_value"]
+
+    args, kwargs = torch.fx.node.map_arg((node.args, node.kwargs), visit)
+    args = tree_map(fake_wrapper, args)
+    kwargs = tree_map(fake_wrapper, kwargs)
+
+    nnmodule = None
+    if op == "call_module":
+        nnmodule = tx.output.nn_modules[node.target]
+
+        if not is_lazy_module(nnmodule):
+            nnmodule = deepcopy_to_fake_tensor(nnmodule, tx.fake_mode)
+
+    if op == "call_module" and is_lazy_module(nnmodule):
+        assert nnmodule is not None
+        # In the case of a lazy module, we want to run
+        # the pre-hooks which initialize it
+        nnmodule(*args, **kwargs)
+    try:
+        with tx.fake_mode, enable_python_dispatcher():
+            return wrap_fake_exception(
+                lambda: run_node(tx.output, node, args, kwargs, nnmodule)
+            )
+    except Unsupported:
+        raise
+    except RuntimeError as e:
+        cause = e
+        if e.__cause__ is not None:
+            cause = e.__cause__
+        if isinstance(
+            cause, torch._subclasses.fake_tensor.DataDependentOutputException
+        ):
+            if config.capture_scalar_outputs and node.target == "item":
+                return torch.zeros(size=(), dtype=args[0].dtype).item()
+            else:
+                unimplemented(f"data dependent operator: {cause.func}")
+        elif isinstance(
+            cause, torch._subclasses.fake_tensor.DynamicOutputShapeException
+        ):
+            unimplemented(f"dynamic shape operator: {cause.func}")
+        raise TorchRuntimeError() from e
+
+
+def run_node(output_graph, node, args, kwargs, nnmodule):
+    """
+    Runs a given node, with the given args and kwargs.
+
+    Behavior is dicatated by a node's op.
+
+    run_node is useful for extracting real values out of nodes.
+    See get_real_value for more info on common usage.
+
+    Note: The output_graph arg is only used for 'get_attr' ops
+    Note: The nnmodule arg is only used for 'call_module' ops
+
+    Nodes that are not call_function, call_method, call_module, or get_attr will
+    raise an AssertionError.
+    """
+    op = node.op
+    try:
+        if op == "call_function":
+            return node.target(*args, **kwargs)
+        elif op == "call_method":
+            return getattr(args[0], node.target)(*args[1:], **kwargs)
+        elif op == "call_module":
+            assert nnmodule is not None
+            return nnmodule(*args, **kwargs)
+        elif op == "get_attr":
+            return output_graph.get_submodule(node.target)
+    except Exception as e:
+        raise RuntimeError(
+            f"Failed running {op} {node.target}(*{args}, **{kwargs}):\n{e}\n(scroll up for backtrace)"
+        ) from e
+    raise AssertionError(op)
+
+
+def get_real_value(node, output_graph):
+    """
+    Run the actual computation represented by `node` and return the result.
+    This will execute any dependent nodes in the graph as well.
+    """
+    cache = output_graph.real_value_cache
+    if node in cache:
+        return cache[node]
+
+    op = node.op
+    args, kwargs = torch.fx.node.map_arg(
+        (node.args, node.kwargs),
+        lambda n: get_real_value(n, output_graph),
+    )
+
+    if op == "call_module":
+        nn_module = output_graph.nn_modules[node.target]
+        if not is_lazy_module(nn_module):
+            nn_module = copy.deepcopy(nn_module)
+        else:
+            # In the case of a lazy module, we want to run
+            # the pre-hooks which initialize it
+            nn_module(*args, **kwargs)
+    else:
+        nn_module = None
+
+    try:
+        real_value = run_node(output_graph, node, args, kwargs, nn_module)
+        cache[node] = real_value
+    except RuntimeError as e:
+        raise TorchRuntimeError() from e
+    return real_value
+
+
+def assert_no_fake_params_or_buffers(gm):
+    for name, buffer in gm.named_buffers():
+        assert not isinstance(
+            buffer, torch._subclasses.FakeTensor
+        ), f"Unexpected fake buffer {name}"
+    for name, param in gm.named_parameters():
+        assert not isinstance(
+            param, torch._subclasses.FakeTensor
+        ), f"Unexpected fake param {name}"
diff --git a/torch/_dynamo/variables/__init__.py b/torch/_dynamo/variables/__init__.py
new file mode 100644
index 000000000000..2305afc226ac
--- /dev/null
+++ b/torch/_dynamo/variables/__init__.py
@@ -0,0 +1,89 @@
+from .base import VariableTracker
+from .builtin import BuiltinVariable
+from .constant import ConstantVariable, EnumVariable
+from .dicts import ConstDictVariable, DataClassVariable, DefaultDictVariable
+from .functions import (
+    NestedUserFunctionVariable,
+    UserFunctionVariable,
+    UserMethodVariable,
+)
+from .lists import (
+    BaseListVariable,
+    ListIteratorVariable,
+    ListVariable,
+    NamedTupleVariable,
+    RangeVariable,
+    SliceVariable,
+    TupleVariable,
+)
+from .misc import (
+    AutogradFunctionVariable,
+    BlackHoleVariable,
+    ClosureVariable,
+    ContextWrappingVariable,
+    GetAttrVariable,
+    GradModeVariable,
+    InspectSignatureVariable,
+    LambdaVariable,
+    NewCellVariable,
+    NewGlobalVariable,
+    NumpyVariable,
+    PythonModuleVariable,
+    SuperVariable,
+    UnknownVariable,
+    WithExitFunctionVariable,
+)
+from .nn_module import NNModuleVariable, UnspecializedNNModuleVariable
+from .tensor import (
+    DynamicShapeVariable,
+    FakeItemVariable,
+    TensorVariable,
+    UnspecializedNumpyVariable,
+    UnspecializedPythonVariable,
+)
+from .torch import TorchVariable
+from .user_defined import UserDefinedClassVariable, UserDefinedObjectVariable
+
+__all__ = [
+    "AutogradFunctionVariable",
+    "BaseListVariable",
+    "BlackHoleVariable",
+    "BuiltinVariable",
+    "ClosureVariable",
+    "ConstantVariable",
+    "ConstDictVariable",
+    "ContextWrappingVariable",
+    "DataClassVariable",
+    "DefaultDictVariable",
+    "EnumVariable",
+    "FakeItemVariable",
+    "GetAttrVariable",
+    "GradModeVariable",
+    "InspectSignatureVariable",
+    "LambdaVariable",
+    "ListIteratorVariable",
+    "ListVariable",
+    "NamedTupleVariable",
+    "NestedUserFunctionVariable",
+    "NewCellVariable",
+    "NewGlobalVariable",
+    "NNModuleVariable",
+    "NumpyVariable",
+    "PythonModuleVariable",
+    "RangeVariable",
+    "SliceVariable",
+    "SuperVariable",
+    "TensorVariable",
+    "TorchVariable",
+    "TupleVariable",
+    "UnknownVariable",
+    "UnspecializedNNModuleVariable",
+    "UnspecializedNumpyVariable",
+    "UnspecializedPythonVariable",
+    "UserDefinedClassVariable",
+    "UserDefinedObjectVariable",
+    "UserFunctionVariable",
+    "UserMethodVariable",
+    "VariableTracker",
+    "WithExitFunctionVariable",
+]
diff --git a/torch/_dynamo/variables/base.py b/torch/_dynamo/variables/base.py
new file mode 100644
index 000000000000..4c5aee344061
--- /dev/null
+++ b/torch/_dynamo/variables/base.py
@@ -0,0 +1,301 @@
+import collections
+from typing import Any, Callable, Dict, List, Optional, Set
+
+from .. import variables
+from ..exc import unimplemented
+from ..source import AttrSource, Source
+from ..utils import dict_values, identity, istype, odict_values
+
+
+class MutableLocal:
+    """
+    Marker used to indicate this (list, iter, etc) was constructed in
+    local scope and can be mutated safely in analysis without leaking
+    state.
+    """
+
+    def __hash__(self):
+        return id(self)
+
+    def __eq__(self, other):
+        return self is other
+
+
+# metaclass to call post_init
+class HasPostInit(type):
+    def __call__(cls, *args, **kwargs):
+        obj = type.__call__(cls, *args, **kwargs)
+        obj.__post_init__(*args, **kwargs)
+        return obj
+
+
+class VariableTracker(object, metaclass=HasPostInit):
+    """
+    Base class for tracked locals and stack values
+
+    VariableTracker instances are immutable and should be copied in
+    order to change them.
+    """
+
+    # fields to leave unmodified in apply()
+    _nonvar_fields = ["value"]
+
+    @staticmethod
+    def propagate(*vars: List[List["VariableTracker"]]):
+        """Combine the guards from many VariableTracker into **kwargs for a new instance"""
+        guards = set()
+
+        def visit(var):
+            if type(var) in (list, tuple, dict_values, odict_values):
+                for i in var:
+                    visit(i)
+            elif isinstance(var, variables.BaseListVariable):
+                guards.update(var.guards)
+                for i in var.items:
+                    visit(i)
+            elif isinstance(var, variables.ConstDictVariable):
+                guards.update(var.guards)
+                visit(var.items.values())
+            else:
+                assert isinstance(var, VariableTracker), typestr(var)
+                guards.update(var.guards)
+
+        visit(vars)
+        return {
+            "guards": guards,
+        }
+
+    def clone(self, **kwargs):
+        """Shallow copy with some (optional) changes"""
+        args = dict(self.__dict__)
+        args.update(kwargs)
+        return self.__class__(**args)
+
+    @classmethod
+    def copy(cls, value):
+        """Deeper (but not full) copy, leaving FX and user objects alone"""
+        return cls.apply(identity, value)
+
+    @classmethod
+    def apply(
+        cls,
+        fn: Callable[["VariableTracker"], "VariableTracker"],
+        value,
+        cache=None,
+        skip_fn=lambda _: False,  # Whether we should skip applying to this var
+    ):
+        """
+        Walk this object and call fn on all the VariableTracker
+        instances to produce a new VariableTracker with the results.
+        """
+        if cache is None:
+            cache = dict()
+
+        idx = id(value)
+        if idx in cache:
+            return cache[idx][0]
+
+        if isinstance(value, VariableTracker):
+            if not skip_fn(value):
+                updated_dict = dict(value.__dict__)
+                for key in updated_dict.keys():
+                    if key not in value._nonvar_fields:
+                        updated_dict[key] = cls.apply(
+                            fn, updated_dict[key], cache, skip_fn
+                        )
+                result = fn(value.clone(**updated_dict))
+            else:
+                result = fn(value)
+
+        elif istype(value, list):
+            result = [cls.apply(fn, v, cache, skip_fn) for v in value]
+        elif istype(value, tuple):
+            result = tuple(cls.apply(fn, v, cache, skip_fn) for v in value)
+        elif istype(value, collections.OrderedDict):
+            result = collections.OrderedDict(
+                cls.apply(fn, v, cache, skip_fn) for v in value.items()
+            )
+        elif istype(value, dict):
+            result = {
+                k: cls.apply(fn, v, cache, skip_fn) for k, v in list(value.items())
+            }
+        else:
+            result = value
+
+        # save `value` to keep it alive and ensure id() isn't reused
+        cache[idx] = (result, value)
+        return result
+
+    def add_guard(self, guard):
+        return self.clone(guards=set.union(self.guards, {guard}))
+
+    def add_guards(self, guards):
+        if guards is None:
+            return self
+        assert isinstance(guards, set)
+        return self.clone(guards=set.union(self.guards, guards))
+
+    def add_options(self, options, *more):
+        if more:
+            return self.add_options(options).add_options(*more)
+        if isinstance(options, VariableTracker):
+            return self.add_guards(options.guards)
+        assert isinstance(options, dict)
+        return self.add_guards(options.get("guards", set()))
+
+    def __str__(self):
+        return f"{self.__class__.__name__}()"
+
+    def __repr__(self):
+        return str(self)
+
+    def python_type(self):
+        raise NotImplementedError(f"{self} has no type")
+
+    def as_python_constant(self):
+        """For constants"""
+        raise NotImplementedError(f"{self} is not a constant")
+
+    def is_python_constant(self):
+        try:
+            self.as_python_constant()
+            return True
+        except NotImplementedError:
+            return False
+
+    def as_specialized(self, tx):
+        """
+        For specialized variables, return itself,
+        For unspecialized variables, convert to constant variable and return.
+        """
+        return self
+
+    def can_make_guard(self):
+        try:
+            self.make_guard(None)
+            return True
+        except NotImplementedError:
+            return False
+
+    def make_guard(self, fn):
+        if self.source:
+            return self.source.make_guard(fn)
+        raise NotImplementedError()
+
+    def replace_guards(self, guards, *fns):
+        name = self.source.name()
+        new_guards = {g for g in (guards or []) if g.name != name}
+        new_guards.update(self.source.make_guard(fn) for fn in fns)
+        return new_guards
+
+    def const_getattr(self, tx, name: str) -> Any:
+        """getattr(self, name) returning a python constant"""
+        raise NotImplementedError()
+
+    def var_getattr(self, tx, name: str) -> "VariableTracker":
+        """getattr(self, name) returning a new variable"""
+        options = VariableTracker.propagate(self)
+        value = self.const_getattr(tx, name)
+        if not variables.ConstantVariable.is_literal(value):
+            raise NotImplementedError()
+        if self.source:
+            options["source"] = AttrSource(self.source, name)
+        return variables.ConstantVariable(value, **options)
+
+    def is_proxy(self):
+        try:
+            self.as_proxy()
+            return True
+        except NotImplementedError:
+            return False
+
+    def as_proxy(self):
+        raise NotImplementedError(str(self))
+
+    def reconstruct(self, codegen):
+        raise NotImplementedError()
+
+    def unpack_var_sequence(self, tx):
+        raise NotImplementedError()
+
+    def has_unpack_var_sequence(self, tx):
+        try:
+            self.unpack_var_sequence(tx)
+            return True
+        except NotImplementedError:
+            return False
+
+    def num_parameters(self):
+        unimplemented(f"num_parameters: {self}")
+
+    def call_hasattr(self, tx, name: str) -> "VariableTracker":
+        unimplemented(f"hasattr: {self}")
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        unimplemented(f"call_function {self} {args} {kwargs}")
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        if name == "__len__" and self.has_unpack_var_sequence(tx):
+            assert not (args or kwargs)
+            return variables.ConstantVariable(
+                len(self.unpack_var_sequence(tx)), **VariableTracker.propagate(self)
+            )
+        elif (
+            name == "__getattr__"
+            and len(args) == 1
+            and args[0].is_python_constant()
+            and not kwargs
+        ):
+            return self.var_getattr(tx, args[0].as_python_constant()).add_options(
+                self, args[0]
+            )
+        raise unimplemented(f"call_method {self} {name} {args} {kwargs}")
+
+    def __init__(
+        self,
+        guards: Optional[Set] = None,
+        source: Source = None,
+        mutable_local: MutableLocal = None,
+        recursively_contains: Optional[Set] = None,
+    ):
+        super(VariableTracker, self).__init__()
+        self.guards = guards or set()
+        self.source = source
+        self.mutable_local = mutable_local
+        self.recursively_contains = (
+            recursively_contains  # provides hint to replace_all when replacing vars
+        )
+
+    def __post_init__(self, *args, **kwargs):
+        if self.recursively_contains is None:
+            self.recursively_contains = set()
+
+            def aggregate_mutables(var):
+                self.recursively_contains.update(var.recursively_contains)
+                if var.mutable_local is not None:
+                    self.recursively_contains.add(var.mutable_local)
+
+                return var
+
+            VariableTracker.apply(
+                aggregate_mutables, self, skip_fn=lambda var: var is not self
+            )
+
+
+def typestr(*objs):
+    if len(objs) == 1:
+        (obj,) = objs
+        if isinstance(obj, VariableTracker):
+            return str(obj)
+        else:
+            return type(obj).__name__
+    else:
+        return " ".join(map(typestr, objs))
diff --git a/torch/_dynamo/variables/builder.py b/torch/_dynamo/variables/builder.py
new file mode 100644
index 000000000000..843e3d1edbbb
--- /dev/null
+++ b/torch/_dynamo/variables/builder.py
@@ -0,0 +1,809 @@
+import collections
+import dataclasses
+import enum
+import functools
+import inspect
+import math
+import numbers
+import operator
+import re
+import types
+from abc import ABCMeta
+from typing import Any, Union
+
+import numpy as np
+from functorch.experimental.ops import PyOperator
+
+import torch
+from torch.fx.immutable_collections import immutable_list
+
+from .. import config, mutation_guard, replay_record, skipfiles
+from ..allowed_functions import is_allowed, is_builtin_callable, is_numpy
+from ..exc import unimplemented
+from ..guards import GuardBuilder, GuardSource
+from ..side_effects import SideEffects
+from ..source import (
+    AttrSource,
+    ConstantSource,
+    GetItemSource,
+    GlobalSource,
+    GlobalWeakRefSource,
+    is_constant_source,
+    LocalSource,
+    RandomValueSource,
+    Source,
+    TupleIteratorGetItemSource,
+)
+from ..utils import (
+    clone_input,
+    get_fake_value,
+    getfile,
+    global_key_name,
+    is_namedtuple,
+    is_numpy_int_type,
+    is_typing,
+    istensor,
+    istype,
+    odict_values,
+    preserve_rng_state,
+    tuple_iterator,
+    tuple_iterator_getitem,
+    tuple_iterator_len,
+    wrap_to_fake_tensor_and_record,
+)
+
+from .base import MutableLocal, typestr
+from .builtin import BuiltinVariable
+from .constant import ConstantVariable, EnumVariable
+from .dicts import (
+    ConstDictVariable,
+    DataClassVariable,
+    DefaultDictVariable,
+    HFPretrainedConfigVariable,
+)
+from .functions import UserFunctionVariable
+from .lists import (
+    ListIteratorVariable,
+    ListVariable,
+    NamedTupleVariable,
+    RangeVariable,
+    SizeVariable,
+    SliceVariable,
+    TupleVariable,
+)
+from .misc import (
+    AutogradFunctionVariable,
+    GetAttrVariable,
+    InspectSignatureVariable,
+    LambdaVariable,
+    NumpyVariable,
+    PythonModuleVariable,
+    SkipFilesVariable,
+    TypingVariable,
+)
+from .nn_module import UnspecializedNNModuleVariable
+from .tensor import (
+    DynamicShapeVariable,
+    FakeItemVariable,
+    TensorVariable,
+    TensorWithTFOverrideVariable,
+    UnspecializedNumpyVariable,
+    UnspecializedPythonVariable,
+)
+from .torch import (
+    tensor_dunder_fns,
+    torch_special_class_types,
+    TorchPyOperator,
+    TorchVariable,
+)
+from .user_defined import UserDefinedClassVariable, UserDefinedObjectVariable
+
+
+class _missing:
+    pass
+
+
+@dataclasses.dataclass
+class GraphArg:
+    source: Source
+    example: Any
+    is_unspecialized: bool
+
+    def __post_init__(self):
+        if isinstance(self.example, torch._subclasses.fake_tensor.FakeTensor):
+            raise AssertionError("Fake Tensor observed in TorchDynamo Fx graph inputs")
+
+    def load(self, tx):
+        return self.source.reconstruct(tx)
+
+    def get_examples(self):
+        return [self.example]
+
+    def __len__(self):
+        return 1
+
+    def erase(self):
+        self.example = None
+
+
+class VariableBuilder:
+    """Wrap a python value in a VariableTracker() instance"""
+
+    def __init__(
+        self,
+        tx,
+        source: Source,
+    ):
+        super(VariableBuilder, self).__init__()
+        self.tx = tx
+        self.source = source
+        self.name = source.name()
+
+    def __call__(self, value):
+        if value in self.tx.output.side_effects:
+            # TODO(jansel): add guard for alias relationship
+            return self.tx.output.side_effects[value]
+        return self._wrap(value).clone(**self.options())
+
+    @staticmethod
+    @functools.lru_cache(None)
+    def _common_constants():
+        return set(range(17)).union(
+            {
+                20,
+                30,
+                40,
+                32,
+                64,
+                96,
+                128,
+                144,
+                240,
+                256,
+                672,
+                1024,
+                2048,
+                4096,
+                0.1,
+                0.01,
+                0.001,
+                0.5,
+                0.05,
+                800,
+                1.873536229133606,
+                4.135166556742356,  # Work around for vision_maskrcnn where torch.clamp can't be on different devices
+            }
+        )
+
+    @staticmethod
+    def list_type(value):
+        if is_namedtuple(value):
+            return functools.partial(NamedTupleVariable, tuple_cls=type(value))
+        return {
+            tuple: TupleVariable,
+            list: ListVariable,
+            odict_values: ListVariable,
+            torch.nn.ParameterList: ListVariable,
+            torch.nn.ModuleList: ListVariable,
+        }[type(value)]
+
+    def get_source(self):
+        return self.source
+
+    def options(self):
+        return {"source": self.get_source()}
+
+    def make_guards(self, *guards):
+        source = self.get_source()
+        if (
+            isinstance(source, ConstantSource)
+            or source.guard_source() == GuardSource.CONSTANT
+        ):
+            return None
+        return {source.make_guard(guard) for guard in guards}
+
+    def _wrap(self, value):
+        make_guards = self.make_guards
+        if istype(value, (torch.SymInt, torch.SymFloat)):
+            return self.wrap_sym(value)
+        if istensor(value):
+            return self.wrap_tensor(value)
+        elif istype(value, (tuple, list, odict_values)) or is_namedtuple(value):
+            # One can index a tensor with a list/tuple. Therefore, we need to
+            # have a stricter match.
+            if istype(value, (tuple, list)) and all(
+                [isinstance(x, int) or is_numpy_int_type(x) or x is None for x in value]
+            ):
+                guards = self.make_guards(GuardBuilder.EQUALS_MATCH)
+            else:
+                guards = self.make_guards(GuardBuilder.LIST_LENGTH)
+            output = [
+                VariableBuilder(self.tx, GetItemSource(self.get_source(), i))(
+                    item
+                ).add_guards(guards)
+                for i, item in enumerate(value)
+            ]
+            result = self.list_type(value)(output, guards=guards)
+            if istype(value, list):
+                return self.tx.output.side_effects.track_list(
+                    self.source, value, result
+                )
+            return result
+        elif istype(value, tuple_iterator):
+            guards = self.make_guards(GuardBuilder.TUPLE_ITERATOR_LEN)
+            output = [
+                VariableBuilder(
+                    self.tx, TupleIteratorGetItemSource(self.get_source(), i)
+                )(tuple_iterator_getitem(value, i)).add_guards(guards)
+                for i in range(tuple_iterator_len(value))
+            ]
+            return ListIteratorVariable(
+                output, mutable_local=MutableLocal(), guards=guards
+            )
+        elif istype(value, (slice, range)):
+            items = [
+                VariableBuilder(self.tx, AttrSource(self.get_source(), k))(
+                    getattr(value, k)
+                )
+                for k in ("start", "stop", "step")
+            ]
+            if isinstance(value, slice):
+                return SliceVariable(items, guards=make_guards(GuardBuilder.TYPE_MATCH))
+            else:
+                return RangeVariable(
+                    items, guards=make_guards(GuardBuilder.EQUALS_MATCH)
+                )
+        elif istype(
+            value, (dict, collections.defaultdict, collections.OrderedDict)
+        ) and all(
+            map(
+                lambda k: ConstantVariable.is_literal(k)
+                or self.tensor_can_be_dict_key(k),
+                value.keys(),
+            )
+        ):
+            guards = self.make_guards(GuardBuilder.DICT_KEYS)
+
+            # store key variables in global location for reconstruction
+            for key in value.keys():
+                if self.tensor_can_be_dict_key(key):
+                    self.tx.store_dict_key(global_key_name(key), key)
+
+            def index_source(key):
+                if self.tensor_can_be_dict_key(key):
+                    return GlobalWeakRefSource(global_key_name(key))
+                else:
+                    return key
+
+            result = dict(
+                [
+                    (
+                        k,
+                        VariableBuilder(
+                            self.tx, GetItemSource(self.get_source(), index_source(k))
+                        )(value[k]).add_guards(guards),
+                    )
+                    for k in value.keys()
+                ]
+            )
+
+            if istype(value, collections.defaultdict):
+                result = DefaultDictVariable(
+                    result, type(value), value.default_factory, guards=guards
+                )
+            else:
+                result = ConstDictVariable(result, type(value), guards=guards)
+
+            return self.tx.output.side_effects.track_dict(self.source, value, result)
+        elif isinstance(value, torch.nn.Module):
+            if mutation_guard.is_dynamic_nn_module(value):
+                # created dynamically, don't specialize on it
+                result = UnspecializedNNModuleVariable(
+                    value, guards=make_guards(GuardBuilder.TYPE_MATCH)
+                )
+                if not SideEffects.cls_supports_mutation_side_effects(type(value)):
+                    # don't allow STORE_ATTR mutation with custom __setattr__
+                    return result
+                return self.tx.output.side_effects.track_object_existing(
+                    self.source, value, result
+                )
+            elif issubclass(
+                value.__class__, torch.nn.parallel.distributed.DistributedDataParallel
+            ):
+                return UnspecializedNNModuleVariable(
+                    value, guards=make_guards(GuardBuilder.TYPE_MATCH)
+                )
+            else:
+                return self.tx.output.register_attr_or_module(
+                    value,
+                    self.name,
+                    source=self.get_source(),
+                    # Guards are added inside register_attr_or_module
+                )
+        elif ConstantVariable.is_literal(value) or istype(
+            value, (torch.Size, torch.device, torch.dtype)
+        ):
+            if type(value) in (int, float) and not config.specialize_int_float:
+                # unspecializing int/float by default, but still
+                # specialize for the following conditions
+                if (
+                    value in self._common_constants()
+                    or isinstance(self.source, GlobalSource)
+                    or isinstance(self.source, GetItemSource)
+                    or (
+                        isinstance(self.source, AttrSource)
+                        and isinstance(self.source.base, GlobalSource)
+                    )
+                ):
+                    return ConstantVariable(
+                        value=value,
+                        guards=make_guards(GuardBuilder.CONSTANT_MATCH),
+                    )
+                else:
+                    return self.wrap_unspecialized_primitive(value)
+            else:
+                return ConstantVariable(
+                    value=value,
+                    guards=make_guards(GuardBuilder.CONSTANT_MATCH),
+                )
+        elif isinstance(value, frozenset) and (
+            all(is_allowed(x) or ConstantVariable.is_literal(x) for x in value)
+        ):
+            # For frozenset, we can guard by object ID instead of value
+            # equality, this allows us to handle non-literal values
+            return ConstantVariable(
+                value=value,
+                guards=make_guards(GuardBuilder.ID_MATCH),
+            )
+        elif isinstance(value, enum.Enum):
+            return EnumVariable(
+                value=value,
+                guards=make_guards(GuardBuilder.ID_MATCH),
+            )
+        elif is_builtin_callable(value):
+            return BuiltinVariable(
+                value,
+                guards=make_guards(GuardBuilder.BUILTIN_MATCH),
+            )
+        elif is_allowed(value):
+            return TorchVariable(
+                value,
+                guards=make_guards(GuardBuilder.FUNCTION_MATCH),
+            )
+        elif is_typing(value):
+            # typing.List, typing.Mapping, etc.
+            return TypingVariable(
+                value,
+                guards=make_guards(GuardBuilder.ID_MATCH),
+            )
+        elif value is inspect.signature:
+            return LambdaVariable(
+                InspectSignatureVariable.create,
+                guards=make_guards(GuardBuilder.FUNCTION_MATCH),
+            )
+        elif value is dataclasses.fields:
+            return LambdaVariable(
+                _dataclasses_fields_lambda,
+                guards=make_guards(GuardBuilder.FUNCTION_MATCH),
+            )
+        elif is_numpy(value):
+            return NumpyVariable(
+                value,
+                guards=make_guards(
+                    GuardBuilder.FUNCTION_MATCH
+                    if callable(value)
+                    else GuardBuilder.TYPE_MATCH
+                ),
+            )
+        elif value in tensor_dunder_fns:
+            return TorchVariable(
+                value,
+                guards=make_guards(GuardBuilder.FUNCTION_MATCH),
+            )
+        elif (
+            istype(value, (type, types.FunctionType))
+            and skipfiles.check(getfile(value), allow_torch=True)
+            and not inspect.getattr_static(value, "_torchdynamo_inline", False)
+        ):
+            return SkipFilesVariable(
+                value, guards=make_guards(GuardBuilder.FUNCTION_MATCH)
+            )
+        elif istype(value, (type, ABCMeta)):
+            # TODO(whc) the following seems preferable but breaks some tests, debug
+            # elif inspect.isclass(value):
+            return UserDefinedClassVariable(
+                value, guards=make_guards(GuardBuilder.FUNCTION_MATCH)
+            )
+        elif value in tensor_dunder_fns:
+            return TorchVariable(
+                value,
+                guards=make_guards(GuardBuilder.FUNCTION_MATCH),
+            )
+        elif istype(value, types.FunctionType):
+            return UserFunctionVariable(
+                value,
+                guards=make_guards(GuardBuilder.FUNCTION_MATCH),
+            )
+        elif istype(value, (types.ModuleType, replay_record.DummyModule)):
+            return PythonModuleVariable(
+                value,
+                guards=make_guards(GuardBuilder.PYMODULE_MATCH),
+            )
+        elif type(value) is torch.autograd.function.FunctionMeta:
+            return AutogradFunctionVariable(
+                value, guards=make_guards(GuardBuilder.FUNCTION_MATCH)
+            )
+        elif (
+            isinstance(value, types.BuiltinFunctionType)
+            and type(getattr(value, "__self__", None))
+            is torch.autograd.function.FunctionMeta
+            and getattr(value, "__name__", "") == "apply"
+            and value == getattr(value.__self__, "apply", None)
+        ):
+            # handle aliased autograd function `apply` calls
+            return GetAttrVariable(
+                AutogradFunctionVariable(
+                    value.__self__, guards=make_guards(GuardBuilder.FUNCTION_MATCH)
+                ),
+                "apply",
+            )
+        elif isinstance(value, (int, float, np.number)):
+            return self.wrap_unspecialized_primitive(value)
+        elif DataClassVariable.is_matching_object(value):
+            return DataClassVariable.wrap(self, value).add_guards(
+                make_guards(GuardBuilder.TYPE_MATCH)
+            )
+        elif HFPretrainedConfigVariable.is_matching_object(value):
+            return HFPretrainedConfigVariable(
+                value, guards=make_guards(GuardBuilder.TYPE_MATCH)
+            )
+        elif isinstance(value, PyOperator):
+            return TorchPyOperator(
+                value,
+                guards=self.make_guards(
+                    GuardBuilder.TYPE_MATCH, GuardBuilder.NAME_MATCH
+                ),
+            )
+        elif type(value).__name__ == "builtin_function_or_method" and isinstance(
+            value.__self__, torch_special_class_types
+        ):
+            return TorchVariable(
+                value,
+                guards=make_guards(GuardBuilder.FUNCTION_MATCH),
+            )
+        else:
+            result = UserDefinedObjectVariable(
+                value,
+                guards=self.make_guards(GuardBuilder.TYPE_MATCH),
+            )
+            if not SideEffects.cls_supports_mutation_side_effects(type(value)):
+                # don't allow STORE_ATTR mutation with custom __setattr__
+                return result
+            return self.tx.output.side_effects.track_object_existing(
+                self.source, value, result
+            )
+
+    def tensor_can_be_dict_key(self, value):
+        # only allow Parameter and another specific Tensor can be used as dict key
+        return (
+            isinstance(value, torch.nn.Parameter)
+            or isinstance(self.source, AttrSource)
+            and self.source.member == "state"
+            and isinstance(self.source.base, LocalSource)
+        )
+
+    def tensor_should_specialize(self):
+        return (
+            self.source
+            and isinstance(self.source, GetItemSource)
+            and isinstance(self.source.base, GetItemSource)
+            and self.source.base.index == "params"
+            and isinstance(self.source.base.base, GetItemSource)
+            and isinstance(self.source.base.base.base, AttrSource)
+            and self.source.base.base.base.member == "param_groups"
+            and isinstance(self.source.base.base.base.base, LocalSource)
+            and (
+                isinstance(
+                    self.tx.f_locals[self.source.base.base.base.base.local_name],
+                    torch.optim.Optimizer,
+                )
+                if self.source.base.base.base.base.local_name in self.tx.f_locals.keys()
+                else True
+            )
+        )
+
+    def wrap_sym(self, value: Union[torch.SymInt, torch.SymFloat]):
+        if not is_constant_source(self.get_source()):
+            self.tx.output.graphargs.append(GraphArg(self.get_source(), value, False))
+        elif is_constant_source(self.get_source()):
+            return self.tx.output.register_attr_or_module(
+                value,
+                re.sub(r"[^a-zA-Z0-9]+", "_", self.name),
+                source=None,
+                dyn_shape=value
+                # shape Guards live their own rich life via shape_env
+            )
+        return DynamicShapeVariable.create(
+            tx=self.tx,
+            proxy=self.tx.output.create_graph_input(
+                re.sub(r"[^a-zA-Z0-9]+", "_", self.name), type(value)
+            ),
+            dyn_shape=value
+            # shape Guards live their own rich life via shape_env
+        )
+
+    def wrap_tensor(self, value: torch.Tensor):
+        if self.get_source().guard_source().is_nn_module():
+            return self.tx.output.register_attr_or_module(
+                value,
+                self.name,
+                source=self.get_source(),
+                # Guards are done inside register_attr_or_module
+                # guards=self.make_guards(GuardBuilder.TENSOR_MATCH),
+            )
+        else:
+            if not is_constant_source(self.get_source()):
+                self.tx.output.graphargs.append(
+                    GraphArg(self.get_source(), value, False)
+                )
+            # Disable __torch_function__ to prevent cloning of `value` to hit
+            # us
+            with torch._C.DisableTorchFunction():
+                if is_constant_source(self.get_source()):
+                    return self.tx.output.register_attr_or_module(
+                        value,
+                        re.sub(r"[^a-zA-Z0-9]+", "_", self.name),
+                        source=None,
+                        # Guards are added inside register_attr_or_module
+                    )
+                tensor_variable = wrap_fx_proxy(
+                    tx=self.tx,
+                    proxy=self.tx.output.create_graph_input(
+                        re.sub(r"[^a-zA-Z0-9]+", "_", self.name), type(value)
+                    ),
+                    example_value=value,
+                    guards=self.make_guards(GuardBuilder.TENSOR_MATCH),
+                    should_specialize=self.tensor_should_specialize(),
+                )
+            if torch.overrides.has_torch_function_unary(value):
+                subclass_torch_function__func = value.__torch_function__.__func__
+                subclass_type = type(value)
+                return TensorWithTFOverrideVariable(
+                    tensor_variable,
+                    self.get_source(),
+                    subclass_torch_function__func,
+                    subclass_type,
+                )
+            return tensor_variable
+
+    def wrap_unspecialized_primitive(self, value):
+        if self.name in self.tx.output.unspec_variable_map:
+            return self.tx.output.unspec_variable_map[self.name]
+        else:
+            if config.dynamic_shapes and isinstance(value, int):
+                shape_env = self.tx.output.shape_env
+                wrapped_value = shape_env.create_symintnode(
+                    shape_env.create_symbol(value)
+                )
+                # TODO: Do float
+            else:
+                # TODO: Eliminate this case entirely
+                wrapped_value = torch.tensor(value)
+            if not is_constant_source(self.get_source()):
+                self.tx.output.graphargs.append(
+                    GraphArg(self.get_source(), wrapped_value, True)
+                )
+            if not isinstance(self.get_source(), RandomValueSource):
+                guards = {self.get_source().make_guard(GuardBuilder.TYPE_MATCH, True)}
+                options = {"guards": guards}
+            else:
+                options = {}
+            options.update({"source": self.get_source()})
+            if isinstance(wrapped_value, torch.Tensor):
+                options.update({"raw_value": value})
+
+            proxy = self.tx.output.create_graph_input(
+                re.sub(r"[^a-zA-Z0-9]+", "_", self.name), type(wrapped_value)
+            )
+
+            if isinstance(value, np.number):
+                unspec_var = wrap_fx_proxy_cls(
+                    UnspecializedNumpyVariable,
+                    tx=self.tx,
+                    proxy=proxy,
+                    example_value=wrapped_value,
+                    **options,
+                )
+            else:
+                unspec_var = wrap_fx_proxy_cls(
+                    UnspecializedPythonVariable,
+                    tx=self.tx,
+                    proxy=proxy,
+                    example_value=wrapped_value,
+                    **options,
+                )
+            self.tx.output.unspec_variable_map[self.name] = unspec_var
+            return unspec_var
+
+
+def _dataclasses_fields_lambda(obj):
+    if isinstance(obj, UserDefinedObjectVariable):
+        value = obj.value
+    elif isinstance(obj, DataClassVariable):
+        value = obj.user_cls
+    else:
+        unimplemented(f"Dataclass fields handling fails for type {obj}")
+    items = []
+    for field in dataclasses.fields(value):
+        source = None
+        if obj.source:
+            source = GetItemSource(
+                AttrSource(obj.source, "__dataclass_fields__"), field.name
+            )
+        items.append(UserDefinedObjectVariable(field, source=source).add_options(obj))
+    return TupleVariable(items).add_options(obj)
+
+
+def wrap_fx_proxy(tx, proxy, example_value=None, **options):
+    return wrap_fx_proxy_cls(
+        target_cls=TensorVariable,
+        tx=tx,
+        proxy=proxy,
+        example_value=example_value,
+        **options,
+    )
+
+
+# Note: Unfortunate split due to some gross classes existing that subclass TensorVariable
+# Should be compositional instead
+def wrap_fx_proxy_cls(target_cls, tx, proxy, example_value=None, **options):
+    if "guards" in options and options["guards"] is not None:
+        tx.output.guards.update(options["guards"])
+
+    assert "example_value" not in proxy.node.meta
+    if not config.dynamic_propagation:
+        if isinstance(example_value, torch.Tensor):
+            options.update(target_cls.specialize(example_value))
+        return target_cls(proxy, **options)
+
+    initial_example_value = example_value
+
+    def _clone_input(value):
+        if isinstance(value, torch.Tensor):
+            # tensor subclasses will not be converted to FakeTensors and need to be cloned
+            if not isinstance(value, torch._subclasses.fake_tensor.FakeTensor):
+                # NB: ensure strides are preserved
+                value = clone_input(value)
+
+        return value
+
+    with preserve_rng_state():
+        if example_value is None:
+            example_value = get_fake_value(proxy.node, tx)
+
+        else:
+            proxy.tracer.real_value_cache[proxy.node] = _clone_input(example_value)
+            fake_wrapper = functools.partial(wrap_to_fake_tensor_and_record, tx=tx)
+            example_value = fake_wrapper(example_value)
+
+    if isinstance(example_value, torch.Tensor):
+        is_parameter = isinstance(example_value, torch.nn.Parameter)
+        should_specialize = options.pop("should_specialize", False)
+        if is_parameter or should_specialize:
+            specialized_value = initial_example_value
+        else:
+            specialized_value = None
+
+        example_value = _clone_input(example_value)
+        proxy.node.meta["example_value"] = example_value
+        specialized_props = target_cls.specialize(example_value)
+        if isinstance(example_value, torch._subclasses.fake_tensor.FakeTensor):
+            specialized_props["class_type"] = (
+                torch.nn.Parameter if is_parameter else torch.Tensor
+            )
+
+        specialized_props["specialized_value"] = specialized_value
+
+        options.update(specialized_props)
+        return target_cls(proxy, **options)
+    elif (
+        hasattr(proxy.node.target, "__name__")
+        and proxy.node.target.__name__ == "set_state"
+        and isinstance(proxy.node.target.__self__, torch._C.Generator)
+        or proxy.node.target == torch.random.set_rng_state
+    ):
+        from . import TorchVariable
+
+        return TorchVariable(proxy.node.target)
+    elif (
+        proxy.node.target == torch._C._DisableFuncTorch
+        or proxy.node.target == torch.cuda._is_in_bad_fork
+    ):
+        from . import UserDefinedObjectVariable
+
+        return UserDefinedObjectVariable(example_value)
+    elif istype(example_value, (int, bool, float)) and config.dynamic_shapes:
+        proxy.node.meta["example_value"] = example_value
+        return DynamicShapeVariable.create(tx, proxy, example_value, **options)
+    elif istype(example_value, torch.Size) and config.dynamic_shapes:
+        proxy.node.meta["example_value"] = example_value
+        sizes = []
+        for i, v in enumerate(example_value):
+            proxy_i = proxy[i]
+            sizes.append(DynamicShapeVariable.create(tx, proxy_i, v, **options))
+        return SizeVariable(sizes, proxy, **options)
+    elif istype(example_value, int) and proxy.node.target in (
+        torch.seed,
+        operator.mod,
+        # some mac builds are missing torch.distributed.get_rank()
+        getattr(torch.distributed, "get_rank", _missing),
+        getattr(torch.distributed, "get_world_size", _missing),
+    ):
+        if config.dynamic_shapes:
+            proxy.node.meta["example_value"] = example_value
+            return DynamicShapeVariable.create(tx, proxy, example_value, **options)
+        else:
+            return ConstantVariable(example_value, **options)
+    elif istype(example_value, torch.Size) and all(
+        [isinstance(x, int) for x in example_value]
+    ):
+        sizes = [ConstantVariable(x) for x in example_value]
+        return SizeVariable(sizes, **options)
+    elif isinstance(example_value, (tuple, list)):
+        unpacked = []
+        for i, val in enumerate(example_value):
+            if val is None:
+                # nn.MultiheadAttention() can return None, see issue #175
+                unpacked.append(
+                    ConstantVariable(None, **options),
+                )
+            else:
+                unpacked.append(
+                    wrap_fx_proxy(
+                        tx,
+                        proxy.tracer.create_proxy(
+                            "call_function", operator.getitem, (proxy, i), {}
+                        ),
+                        example_value=val,
+                        **options,
+                    )
+                )
+        if istype(example_value, tuple):
+            return TupleVariable(unpacked, **options)
+        elif istype(example_value, (list, immutable_list)):
+            return ListVariable(unpacked, mutable_local=MutableLocal(), **options)
+        else:
+            assert (
+                example_value.__class__.__module__ == "torch.return_types"
+                or hasattr(example_value, "_fields")
+            ), ("namedtuple?")
+            return NamedTupleVariable(unpacked, example_value.__class__, **options)
+    elif example_value is None or proxy.node.target is torch.manual_seed:
+        return ConstantVariable(None, **options)
+    elif (
+        isinstance(example_value, int)
+        and proxy.node.target is torch._utils._element_size
+    ):
+        proxy.node.meta["example_value"] = example_value
+        return ConstantVariable(example_value, **options)
+    elif (
+        isinstance(example_value, numbers.Number)
+        and (proxy.node.target == "item" or proxy.node.target in {math.sqrt, math.pow})
+        and config.capture_scalar_outputs
+    ):
+        # item raw value should not be accessed
+        return wrap_fx_proxy_cls(
+            FakeItemVariable,
+            tx=tx,
+            proxy=proxy,
+            example_value=torch.tensor(example_value),
+            **options,
+        )
+    elif isinstance(example_value, (torch.SymInt, torch.SymFloat)):
+        proxy.node.meta["example_value"] = example_value
+        return DynamicShapeVariable(proxy, example_value, **options)
+    else:
+        raise AssertionError(
+            "torch.* op returned non-Tensor "
+            + f"{typestr(example_value)} {proxy.node.op} {proxy.node.target}"
+        )
diff --git a/torch/_dynamo/variables/builtin.py b/torch/_dynamo/variables/builtin.py
new file mode 100644
index 000000000000..369b9364a416
--- /dev/null
+++ b/torch/_dynamo/variables/builtin.py
@@ -0,0 +1,857 @@
+import functools
+import inspect
+import itertools
+import logging
+import math
+import operator
+import types
+from typing import Dict, List
+
+import numpy as np
+
+import torch
+from torch.fx.experimental.symbolic_shapes import sym_float, sym_int
+
+from .. import config, variables
+from ..allowed_functions import is_allowed
+from ..exc import unimplemented, Unsupported
+from ..guards import GuardBuilder
+from ..replay_record import DummyModule
+from ..source import AttrSource, is_constant_source, TypeSource
+from ..utils import (
+    check_constant_args,
+    check_unspec_python_args,
+    istype,
+    proxy_args_kwargs,
+    specialize_args_kwargs,
+)
+from .base import MutableLocal, VariableTracker
+from .dicts import ConstDictVariable
+from .tensor import DynamicShapeVariable, FakeItemVariable, UnspecializedPythonVariable
+
+log = logging.getLogger(__name__)
+
+
+class BuiltinVariable(VariableTracker):
+    @staticmethod
+    @functools.lru_cache(None)
+    def _constant_fold_functions():
+        fns = {
+            abs,
+            all,
+            any,
+            bool,
+            callable,
+            chr,
+            dict,
+            divmod,
+            float,
+            int,
+            len,
+            list,
+            max,
+            min,
+            ord,
+            pow,
+            repr,
+            round,
+            set,
+            str,
+            str.format,
+            sum,
+            tuple,
+            type,
+            operator.pos,
+            operator.neg,
+            operator.not_,
+            operator.invert,
+            operator.pow,
+            operator.mul,
+            operator.matmul,
+            operator.floordiv,
+            operator.truediv,
+            operator.mod,
+            operator.add,
+            operator.sub,
+            operator.getitem,
+            operator.lshift,
+            operator.rshift,
+            operator.and_,
+            operator.or_,
+            operator.xor,
+            operator.ipow,
+            operator.imul,
+            operator.imatmul,
+            operator.ifloordiv,
+            operator.itruediv,
+            operator.imod,
+            operator.iadd,
+            operator.isub,
+            operator.ilshift,
+            operator.irshift,
+            operator.iand,
+            operator.ixor,
+            operator.ior,
+            operator.index,
+        }
+        fns.update(x for x in math.__dict__.values() if isinstance(x, type(math.sqrt)))
+        return fns
+
+    def can_constant_fold_through(self):
+        return self.fn in self._constant_fold_functions()
+
+    @staticmethod
+    @functools.lru_cache(None)
+    def _fx_graph_functions():
+        fns = {
+            operator.pos,
+            operator.neg,
+            operator.not_,
+            operator.invert,
+            operator.pow,
+            operator.mul,
+            operator.matmul,
+            operator.floordiv,
+            operator.truediv,
+            operator.mod,
+            operator.add,
+            operator.sub,
+            operator.getitem,
+            operator.lshift,
+            operator.rshift,
+            operator.and_,
+            operator.or_,
+            operator.xor,
+            operator.ipow,
+            operator.imul,
+            operator.imatmul,
+            operator.ifloordiv,
+            operator.itruediv,
+            operator.imod,
+            operator.iadd,
+            operator.isub,
+            operator.ilshift,
+            operator.irshift,
+            operator.iand,
+            operator.ixor,
+            operator.ior,
+        }
+        return fns
+
+    def can_insert_in_graph(self):
+        return self.fn in self._fx_graph_functions()
+
+    def __init__(self, fn, **kwargs):
+        super(BuiltinVariable, self).__init__(**kwargs)
+        self.fn = fn
+
+    def __str__(self):
+        if self.fn is None:
+            name = "None"
+        else:
+            name = self.fn.__name__
+
+        return f"{self.__class__.__name__}({name})"
+
+    def python_type(self):
+        return type(self.fn)
+
+    def as_python_constant(self):
+        return self.fn
+
+    def reconstruct(self, codegen):
+        name = self.fn.__name__
+        assert self.fn.__module__ == "builtins"
+        assert name not in codegen.tx.f_globals, "shadowed global"
+        return [codegen.create_load_global(name, add=True)]
+
+    def constant_args(self, *args, **kwargs):
+        return check_constant_args(args, kwargs)
+
+    def tensor_args(self, *args, **kwargs):
+        return any(
+            isinstance(i, variables.TensorVariable)
+            for i in itertools.chain(args, kwargs.values())
+        ) and not any(
+            isinstance(i, variables.GetAttrVariable)
+            for i in itertools.chain(args, kwargs.values())
+        )
+
+    def unspec_numpy_args(self, *args, **kwargs):
+        return all(
+            isinstance(
+                i,
+                (
+                    variables.UnspecializedNumpyVariable,
+                    variables.UnspecializedPythonVariable,
+                    variables.ConstantVariable,
+                ),
+            )
+            for i in itertools.chain(args, kwargs.values())
+        ) and any(
+            isinstance(x, variables.UnspecializedNumpyVariable)
+            for x in itertools.chain(args, kwargs.values())
+        )
+
+    def unspec_python_args(self, *args, **kwargs):
+        return check_unspec_python_args(args, kwargs)
+
+    @staticmethod
+    def unwrap_unspec_args_kwargs(args, kwargs):
+        unwrapped_args = []
+        unwrapped_kwargs = {}
+        for x in args:
+            if isinstance(
+                x,
+                (
+                    variables.UnspecializedNumpyVariable,
+                    variables.UnspecializedPythonVariable,
+                ),
+            ):
+                unwrapped_args.append(x.raw_value)
+            else:
+                unwrapped_args.append(x.as_python_constant())
+        for k, v in kwargs:
+            if isinstance(
+                x,
+                (
+                    variables.UnspecializedNumpyVariable,
+                    variables.UnspecializedPythonVariable,
+                ),
+            ):
+                unwrapped_kwargs.update({k: v.raw_value})
+            else:
+                unwrapped_kwargs.update({k: v.as_python_constant()})
+        return unwrapped_args, unwrapped_kwargs
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        from .builder import wrap_fx_proxy, wrap_fx_proxy_cls
+
+        constant_args = check_constant_args(args, kwargs)
+        tensor_args = self.tensor_args(*args, **kwargs)
+        unspec_python_args = self.unspec_python_args(*args, **kwargs)
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        has_constant_handler = self.can_constant_fold_through() and (
+            constant_args or unspec_python_args
+        )
+        assert isinstance(args, (list, tuple))
+        assert isinstance(kwargs, dict)
+
+        if (
+            self.fn is operator.getitem
+            and len(args) == 2
+            and isinstance(args[1], variables.TensorVariable)
+            and args[1].dtype == torch.bool
+            and not config.dynamic_shapes
+        ):
+            unimplemented("dynamic Tensor.__getitem__(bool[])")
+
+        # args[0] is list and args[1] is unspec
+        if self.fn is operator.getitem and not isinstance(
+            args[0], variables.TensorVariable
+        ):
+            tensor_args = False
+            args, kwargs = specialize_args_kwargs(tx, args, kwargs)
+
+        if (
+            self.can_insert_in_graph()
+            and tensor_args
+            and not (
+                self.fn is operator.getitem
+                and isinstance(args[0], ConstDictVariable)
+                and isinstance(args[1], variables.TensorVariable)
+            )
+        ):
+            try:
+                fn = self.fn
+                if self.fn is operator.iadd and isinstance(
+                    args[0], variables.ConstantVariable
+                ):
+                    # Work around weird bug in hf_T5
+                    fn, args = operator.add, [args[1], args[0]]
+
+                proxy = tx.output.create_proxy(
+                    "call_function", fn, *proxy_args_kwargs(args, kwargs), current_tx=tx
+                )
+                if any([isinstance(arg, FakeItemVariable) for arg in args]):
+                    return wrap_fx_proxy_cls(
+                        FakeItemVariable,
+                        tx,
+                        proxy,
+                        **options,
+                    )
+                elif self.unspec_numpy_args(*args, **kwargs):
+                    _args, _kwargs = self.unwrap_unspec_args_kwargs(args, kwargs)
+                    raw_value = self.fn(*_args, **_kwargs)
+                    return wrap_fx_proxy_cls(
+                        variables.UnspecializedNumpyVariable,
+                        tx,
+                        proxy,
+                        raw_value=raw_value,
+                        **options,
+                    )
+                elif self.unspec_python_args(*args, **kwargs):
+                    _args, _kwargs = self.unwrap_unspec_args_kwargs(args, kwargs)
+                    raw_value = self.fn(*_args, **_kwargs)
+
+                    need_unwrap = any(
+                        x.need_unwrap
+                        for x in itertools.chain(args, kwargs.values())
+                        if isinstance(x, variables.UnspecializedPythonVariable)
+                    )
+
+                    return wrap_fx_proxy_cls(
+                        UnspecializedPythonVariable,
+                        tx,
+                        proxy,
+                        raw_value=raw_value,
+                        need_unwrap=need_unwrap,
+                        **options,
+                    )
+                else:
+                    # Work around for vision_maskrcnn due to precision difference
+                    # specialize the dividend when float divide by tensor
+                    if self.fn is operator.truediv and isinstance(
+                        args[0], variables.UnspecializedPythonVariable
+                    ):
+                        args[0] = args[0].convert_to_constant(tx)
+                    return wrap_fx_proxy(tx, proxy, **options)
+
+            except NotImplementedError:
+                unimplemented(f"partial tensor op: {self} {args} {kwargs}")
+
+        # Handle cases like int(torch.seed())
+        # Also handle sym_float to sym_int cases
+        if self.fn in (int, float) and isinstance(args[0], DynamicShapeVariable):
+            fn_ = sym_int if self.fn is int else sym_float
+            out = wrap_fx_proxy(
+                tx=tx,
+                proxy=tx.output.create_proxy(
+                    "call_function",
+                    fn_,
+                    (args[0].as_proxy(),),
+                    {},
+                    current_tx=tx,
+                ),
+                **options,
+            )
+            return out
+
+        handler = getattr(self, f"call_{self.fn.__name__}", None)
+        if handler:
+            try:
+                inspect.signature(handler).bind(tx, *args, **kwargs)
+            except TypeError as exc:
+                if not has_constant_handler:
+                    log.warning(
+                        f"incorrect arg count {handler} {exc} and no constant handler"
+                    )
+                handler = None
+
+        if handler:
+            try:
+                result = handler(tx, *args, **kwargs)
+                if result is not None:
+                    return result.add_options(options)
+            except Unsupported as exc:
+                if not has_constant_handler:
+                    raise
+                # Actually, we will handle this just fine
+                exc.remove_from_stats()
+
+        if has_constant_handler:
+            args, kwargs = specialize_args_kwargs(tx, args, kwargs)
+            # constant fold
+            return variables.ConstantVariable(
+                self.as_python_constant()(
+                    *[x.as_python_constant() for x in args],
+                    **{k: v.as_python_constant() for k, v in kwargs.items()},
+                ),
+                **options,
+            )
+        return super().call_function(tx, args, kwargs)
+
+    def _call_min_max(self, tx, a, b):
+        if self.tensor_args(a, b):
+            if not isinstance(a, variables.TensorVariable):
+                a, b = b, a
+            assert isinstance(a, variables.TensorVariable)
+
+            # result of an item call is a scalar convert to a tensor
+            if isinstance(a, FakeItemVariable):
+                a = variables.TorchVariable(torch.tensor).call_function(tx, [a], {})
+
+            # Dynamic input does not get resolved, rather, gets stored as call_function
+            if isinstance(a, DynamicShapeVariable):
+                from .builder import wrap_fx_proxy
+
+                return wrap_fx_proxy(
+                    tx=tx,
+                    proxy=tx.output.create_proxy(
+                        "call_function",
+                        self.fn,
+                        *proxy_args_kwargs([a, b], {}),
+                        current_tx=tx,
+                    ),
+                    **VariableTracker.propagate(self, [a, b]),
+                )
+
+            # convert min/max to torch ops
+            if b.is_python_constant():
+                kwargs = {"min": b} if (self.fn is max) else {"max": b}
+                result = variables.TorchVariable(torch.clamp).call_function(
+                    tx, [a], kwargs
+                )
+            else:
+                fn = {max: torch.maximum, min: torch.minimum}[self.fn]
+                result = variables.TorchVariable(fn).call_function(tx, [a, b], {})
+
+            # return unspec if both a, b are unspec or const
+            if all(
+                isinstance(
+                    i,
+                    (
+                        variables.UnspecializedNumpyVariable,
+                        variables.UnspecializedPythonVariable,
+                        variables.ConstantVariable,
+                    ),
+                )
+                for i in [a, b]
+            ):
+
+                if any([isinstance(val, FakeItemVariable) for val in [a, b]]):
+                    return variables.FakeItemVariable.from_tensor_variable(result)
+
+                if b.is_python_constant():
+                    raw_b = b.as_python_constant()
+                else:
+                    raw_b = b.raw_value
+                if self.fn is max:
+                    raw_res = max(a.raw_value, raw_b)
+                else:
+                    raw_res = min(a.raw_value, raw_b)
+
+                if isinstance(raw_res, np.number):
+                    return variables.UnspecializedNumpyVariable.from_tensor_variable(
+                        result, raw_res
+                    )
+                else:
+                    need_unwrap = any(
+                        x.need_unwrap
+                        for x in [a, b]
+                        if isinstance(x, variables.UnspecializedPythonVariable)
+                    )
+                    return variables.UnspecializedPythonVariable.from_tensor_variable(
+                        result, raw_res, need_unwrap
+                    )
+            # otherwise return tensor
+            else:
+                return result
+        elif isinstance(a, variables.ConstantVariable) and isinstance(
+            b, variables.ConstantVariable
+        ):
+            if self.fn is max:
+                return variables.ConstantVariable(max(a.value, b.value))
+            else:
+                return variables.ConstantVariable(min(a.value, b.value))
+        elif isinstance(a, DynamicShapeVariable) or isinstance(b, DynamicShapeVariable):
+            proxy = tx.output.create_proxy(
+                "call_function", self.fn, *proxy_args_kwargs([a, b], {})
+            )
+            return DynamicShapeVariable.create(tx, proxy, None)
+        else:
+
+            unimplemented(f"unsupported min / max over args {str(a)}, {str(b)}")
+
+    call_min = _call_min_max
+    call_max = _call_min_max
+
+    def call_range(self, tx, *args):
+        if self.unspec_python_args(*args) or self.constant_args(*args):
+            args, _ = specialize_args_kwargs(tx, args, {})
+            return variables.RangeVariable(args)
+        elif self._dynamic_args(*args):
+
+            def guard_if_dyn(arg):
+                if isinstance(arg, DynamicShapeVariable):
+                    return arg.evaluate_expr(tx.output)
+                return arg
+
+            args = [variables.ConstantVariable(guard_if_dyn(arg)) for arg in args]
+            return variables.RangeVariable(args)
+        # None no-ops this handler and lets the driving function proceed
+        return None
+
+    def _dynamic_args(self, *args, **kwargs):
+        return any([isinstance(x, DynamicShapeVariable) for x in args]) or any(
+            [isinstance(x, DynamicShapeVariable) for x in kwargs.values()]
+        )
+
+    def call_slice(self, tx, *args):
+        return variables.SliceVariable(args)
+
+    def _dyn_proxy(self, tx, *args, **kwargs):
+        assert self._dynamic_args(*args, **kwargs)
+        from .builder import wrap_fx_proxy
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        return wrap_fx_proxy(
+            tx,
+            tx.output.create_proxy(
+                "call_function", self.fn, *proxy_args_kwargs(args, kwargs)
+            ),
+            **options,
+        )
+
+    def call_mod(self, tx, *args, **kwargs):
+        if self._dynamic_args(*args, **kwargs):
+            return self._dyn_proxy(tx, *args, **kwargs)
+
+    def _call_iter_tuple_list(self, tx, obj=None, *args, **kwargs):
+        if self._dynamic_args(*args, **kwargs):
+            return self._dyn_proxy(tx, *args, **kwargs)
+        cls = variables.BaseListVariable.cls_for(self.fn)
+        if obj is None:
+            return cls(
+                [],
+                mutable_local=MutableLocal(),
+            )
+        elif obj.has_unpack_var_sequence(tx):
+            guards = set()
+            if obj.source and not is_constant_source(obj.source):
+                guards.add(obj.source.make_guard(GuardBuilder.LIST_LENGTH))
+            return cls(
+                list(obj.unpack_var_sequence(tx)),
+                mutable_local=MutableLocal(),
+                guards=guards,
+            ).add_options(self, obj)
+
+    call_iter = _call_iter_tuple_list
+    call_tuple = _call_iter_tuple_list
+    call_list = _call_iter_tuple_list
+
+    def call_dict(self, tx, arg):
+        if isinstance(arg, variables.ConstDictVariable):
+            return arg.clone(mutable_local=MutableLocal())
+
+    def call_zip(self, tx, *args):
+        options = VariableTracker.propagate(self, args)
+        if all(x.has_unpack_var_sequence(tx) for x in args):
+            items = [
+                variables.TupleVariable(list(item), **options)
+                for item in zip(*[arg.unpack_var_sequence(tx) for arg in args])
+            ]
+            return variables.TupleVariable(items, **options)
+
+    def call_enumerate(self, tx, *args):
+        options = VariableTracker.propagate(self, args)
+        if len(args) == 1:
+            start = 0
+        else:
+            assert len(args) == 2
+            assert isinstance(args[1], variables.ConstantVariable)
+            start = args[1].as_python_constant()
+        if args[0].has_unpack_var_sequence(tx):
+            items = [
+                variables.TupleVariable(
+                    [variables.ConstantVariable(idx, **options), var],
+                    **options,
+                )
+                for idx, var in enumerate(args[0].unpack_var_sequence(tx), start)
+            ]
+            return variables.TupleVariable(items, **options)
+
+    def call_mul(self, tx, a, b):
+        if isinstance(
+            a, (variables.ListVariable, variables.TupleVariable)
+        ) and isinstance(b, variables.ConstantVariable):
+            return a.__class__(
+                items=a.items * b.as_python_constant(), mutable_local=MutableLocal()
+            ).add_options(self, a, b)
+        elif isinstance(
+            b, (variables.ListVariable, variables.TupleVariable)
+        ) and isinstance(a, variables.ConstantVariable):
+            return b.__class__(
+                items=b.items * a.as_python_constant(), mutable_local=MutableLocal()
+            ).add_options(self, a, b)
+        else:
+            return a.call_method(tx, "__mul__", [b], {})
+
+    def call_len(self, tx, *args, **kwargs):
+        return args[0].call_method(tx, "__len__", args[1:], kwargs)
+
+    def call_add(self, tx, *args, **kwargs):
+        return args[0].call_method(tx, "__add__", args[1:], kwargs)
+
+    def call_sub(self, tx, *args, **kwargs):
+        return args[0].call_method(tx, "__sub__", args[1:], kwargs)
+
+    def call_truediv(self, tx, *args, **kwargs):
+        return args[0].call_method(tx, "__truediv__", args[1:], kwargs)
+
+    def call_floordiv(self, tx, *args, **kwargs):
+        return args[0].call_method(tx, "__floordiv__", args[1:], kwargs)
+
+    def call_iadd(self, tx, *args, **kwargs):
+        return args[0].call_method(tx, "__iadd__", args[1:], kwargs)
+
+    def call_getitem(self, tx, *args, **kwargs):
+        if self.unspec_python_args(*args, **kwargs):
+            args, kwargs = specialize_args_kwargs(tx, args, kwargs)
+        return args[0].call_method(tx, "__getitem__", args[1:], kwargs)
+
+    def call_isinstance(self, tx, arg, isinstance_type):
+        arg_type = arg.python_type()
+
+        isinstance_type = isinstance_type.as_python_constant()
+
+        if isinstance(arg, variables.TensorVariable) and arg.dtype is not None:
+            return variables.ConstantVariable(arg.call_isinstance(isinstance_type))
+        # UserDefinedObject with C extensions can have torch.Tensor attributes,
+        # so break graph.
+        if isinstance(arg, variables.UserDefinedObjectVariable) and isinstance(
+            arg.value, types.MemberDescriptorType
+        ):
+            unimplemented(
+                f"isinstance called on UserDefinedClass {arg} {isinstance_type}"
+            )
+        try:
+            val = issubclass(arg_type, isinstance_type)
+        except TypeError:
+            val = arg_type is isinstance_type
+        return variables.ConstantVariable(val)
+
+    def call_super(self, tx, a, b):
+        return variables.SuperVariable(a, b)
+
+    def call_next(self, tx, arg):
+        if isinstance(arg, variables.ListIteratorVariable):
+            val, next_iter = arg.next_variables()
+            tx.replace_all(arg, next_iter)
+            return val
+        elif isinstance(arg, variables.BaseListVariable):
+            return arg.items[0].add_options(self, arg)
+
+    def call_hasattr(self, tx, obj, attr):
+        if attr.is_python_constant():
+            name = attr.as_python_constant()
+            return obj.call_hasattr(tx, name).add_options(self, obj, attr)
+
+    def call_map(self, tx, fn, seq):
+        if seq.has_unpack_var_sequence(tx):
+            items = [fn.call_function(tx, [x], {}) for x in seq.unpack_var_sequence(tx)]
+            return variables.TupleVariable(items).add_options(self, fn, seq)
+
+    def call_sum(self, tx, seq, **kwargs):
+        # Special case for sum on tuple of floats and ints
+        if (
+            isinstance(seq, (variables.ListVariable, variables.TupleVariable))
+            and all(
+                [
+                    isinstance(x, variables.ConstantVariable)
+                    and isinstance(x.value, (int, float))
+                    for x in seq.items
+                ]
+            )
+            and not kwargs
+        ):
+            new_list = [x.value for x in seq.items]
+            return variables.ConstantVariable(sum(new_list))
+        if seq.has_unpack_var_sequence(tx):
+            start = kwargs.pop(
+                "start", variables.ConstantVariable(0)
+            ).as_python_constant()
+            assert not kwargs
+            items = seq.unpack_var_sequence(tx)[start:]
+            return BuiltinVariable(functools.reduce).call_function(
+                tx,
+                [
+                    BuiltinVariable(operator.add),
+                    variables.TupleVariable(items),
+                    variables.ConstantVariable(0).add_options(self, seq),
+                ],
+                {},
+            )
+
+    def call_reduce(self, tx, function, iterable, initializer=None):
+        if iterable.has_unpack_var_sequence(tx):
+            items = iterable.unpack_var_sequence(tx)
+            if initializer is None:
+                value, items = items[0], items[1:]
+            else:
+                value = initializer
+            for element in items:
+                value = function.call_function(tx, [value, element], {})
+            return value
+
+    def call_getattr(
+        self, tx, obj: VariableTracker, name_var: VariableTracker, default=None
+    ):
+        from . import (
+            ConstantVariable,
+            GetAttrVariable,
+            PythonModuleVariable,
+            TorchVariable,
+            UserFunctionVariable,
+        )
+        from .builder import VariableBuilder
+
+        options = VariableTracker.propagate(self, obj, name_var)
+        guards = options["guards"]
+        name = name_var.as_python_constant()
+
+        if not name_var.is_python_constant():
+            unimplemented("non-const getattr() name")
+
+        if tx.output.side_effects.is_attribute_mutation(obj):
+            try:
+                # re-read a pending side effect?
+                return tx.output.side_effects.load_attr(obj, name).add_options(options)
+            except KeyError:
+                pass
+
+        if default is not None:
+            hasattr_var = self.call_hasattr(tx, obj, name_var)
+            guards.update(hasattr_var.guards)
+            assert hasattr_var.as_python_constant() in (True, False)
+            if not hasattr_var.as_python_constant():
+                return default.add_guards(guards)
+
+        if obj.source:
+            source = AttrSource(obj.source, name)
+            options["source"] = source
+        else:
+            source = None
+
+        if isinstance(obj, variables.NNModuleVariable):
+            return obj.var_getattr(tx, name).add_options(options)
+        elif isinstance(obj, variables.TensorVariable) and name == "grad":
+            if source:
+                # We are going to be raising this tensor as grapharg. So, ensure
+                # that we have real grad value instead of fake tensor value.
+                # Walk through the inputs of the subgraph and find if we already
+                # have the original tensor stored in the graphargs.
+                for grapharg in tx.output.graphargs:
+                    if grapharg.source == source.base:
+                        example_value = grapharg.example.grad
+                        return VariableBuilder(tx, source)(example_value).add_options(
+                            options
+                        )
+                unimplemented("tensor grad")
+            else:
+                unimplemented("tensor grad")
+        elif isinstance(
+            obj,
+            (
+                variables.TensorVariable,
+                variables.NamedTupleVariable,
+                variables.ConstantVariable,
+                variables.UserDefinedClassVariable,
+                variables.UserDefinedObjectVariable,
+            ),
+        ):
+            try:
+                return (
+                    obj.var_getattr(tx, name).clone(source=source).add_options(options)
+                )
+            except NotImplementedError:
+                return GetAttrVariable(obj, name, **options)
+        elif isinstance(obj, TorchVariable):
+            member = getattr(obj.value, name)
+            if is_allowed(member):
+                return TorchVariable(member, **options)
+            elif ConstantVariable.is_literal(member):
+                return ConstantVariable(member, **options)
+            else:
+                return VariableBuilder(tx, source)(member).add_guards(guards)
+        elif isinstance(obj, (PythonModuleVariable, DummyModule)):
+            member = obj.value.__dict__[name]
+
+            if config.replay_record_enabled:
+                tx.exec_recorder.record_module_access(obj.value, name, member)
+
+            return VariableBuilder(tx, source)(member).add_guards(guards)
+        elif istype(obj, UserFunctionVariable) and name in ("__name__", "__module__"):
+            return ConstantVariable(
+                getattr(obj.fn, name), **VariableTracker.propagate(obj)
+            )
+        else:
+            try:
+                return (
+                    obj.var_getattr(tx, name).clone(source=source).add_options(options)
+                )
+            except NotImplementedError:
+                return GetAttrVariable(obj, name, **options)
+
+    def call_setattr(
+        self, tx, obj: VariableTracker, name_var: VariableTracker, val: VariableTracker
+    ):
+        if isinstance(obj, (variables.BlackHoleVariable, variables.DataClassVariable)):
+            return obj.call_method(tx, "__setattr__", [name_var, val], {})
+        elif (
+            tx.output.side_effects.is_attribute_mutation(obj)
+            and name_var.is_python_constant()
+        ):
+            tx.output.side_effects.store_attr(obj, name_var.as_python_constant(), val)
+            return val.add_options(self, obj, name_var)
+        elif isinstance(obj, variables.UserDefinedObjectVariable):
+            unimplemented(
+                f"setattr(UserDefinedObjectVariable) {type(obj.value).__setattr__}"
+            )
+        elif isinstance(obj, variables.NNModuleVariable):
+            obj.convert_to_unspecialized(tx)
+
+    def call_type(self, tx, obj: VariableTracker):
+        from .builder import VariableBuilder
+
+        try:
+            py_type = obj.python_type()
+        except NotImplementedError:
+            py_type = None
+
+        if istype(obj, variables.TupleVariable):
+            return BuiltinVariable(py_type).add_options(self, obj)
+
+        if py_type is not None and obj.source:
+            return VariableBuilder(tx, TypeSource(obj.source))(py_type).add_options(
+                self, obj
+            )
+
+        unimplemented(f"type({obj})")
+
+    def call_reversed(self, tx, obj: VariableTracker):
+        if obj.has_unpack_var_sequence(tx):
+            items = list(reversed(obj.unpack_var_sequence(tx)))
+            return variables.TupleVariable(
+                items, **VariableTracker.propagate(self, obj)
+            )
+
+    def call_chain(self, tx, *args):
+        if all(obj.has_unpack_var_sequence(tx) for obj in args):
+            items = []
+            for obj in args:
+                items.extend(obj.unpack_var_sequence(tx))
+            return variables.TupleVariable(
+                items, **VariableTracker.propagate(self, *args)
+            )
+
+    def call_islice(self, tx, iterable, *args):
+        if iterable.has_unpack_var_sequence(tx) and all(
+            x.is_python_constant() for x in args
+        ):
+            const_args = [x.as_python_constant() for x in args]
+            items = iterable.unpack_var_sequence(tx)
+            items = list(itertools.islice(items, *const_args))
+            return variables.TupleVariable(
+                items, **VariableTracker.propagate(self, iterable, *args)
+            )
+
+    def call_id(self, tx, *args):
+        if len(args) > 0 and isinstance(args[0], variables.NNModuleVariable):
+            nn_mod_variable = args[0]
+            mod = tx.output.get_submodule(nn_mod_variable.module_key)
+            return variables.ConstantVariable(id(mod))
+        else:
+            unimplemented(f"call_id with args {args}")
diff --git a/torch/_dynamo/variables/constant.py b/torch/_dynamo/variables/constant.py
new file mode 100644
index 000000000000..63eed37ccbec
--- /dev/null
+++ b/torch/_dynamo/variables/constant.py
@@ -0,0 +1,158 @@
+import operator
+from typing import Dict, List
+
+import torch
+
+from .. import variables
+from ..exc import unimplemented
+from ..utils import istype
+from .base import typestr, VariableTracker
+
+
+class ConstantVariable(VariableTracker):
+    def __init__(self, value, **kwargs):
+        super(ConstantVariable, self).__init__(**kwargs)
+        assert not isinstance(value, torch.Tensor)
+        assert not isinstance(value, torch.SymInt)
+        assert not isinstance(value, torch.SymFloat)
+        self.value = value
+
+    def as_proxy(self):
+        return self.value
+
+    def __str__(self):
+        # return f"ConstantVariable({self.value})"
+        return f"ConstantVariable({type(self.value).__name__})"
+
+    def python_type(self):
+        return type(self.value)
+
+    def as_python_constant(self):
+        return self.value
+
+    @property
+    def items(self):
+        """
+        Need this when adding a BaseListVariable and a ConstantVariable together.
+        Happens in detectron2.
+        """
+        return self.unpack_var_sequence(tx=None)
+
+    def getitem_const(self, arg: VariableTracker):
+        return ConstantVariable(
+            self.value[arg.as_python_constant()],
+            **VariableTracker.propagate([self, arg]),
+        )
+
+    @staticmethod
+    def is_literal(obj):
+        if type(obj) in (int, float, bool, type(None), str):
+            return True
+        if type(obj) in (list, tuple, set, frozenset):
+            return all(ConstantVariable.is_literal(x) for x in obj)
+        return False
+
+    def unpack_var_sequence(self, tx):
+        try:
+            options = VariableTracker.propagate([self])
+            return [ConstantVariable(x, **options) for x in self.as_python_constant()]
+        except TypeError:
+            raise NotImplementedError()
+
+    def const_getattr(self, tx, name):
+        member = getattr(self.value, name)
+        if callable(member):
+            raise NotImplementedError()
+        return member
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        from .tensor import DynamicShapeVariable
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+
+        if istype(self.value, tuple):
+            # empty tuple constant etc
+            return variables.TupleVariable(
+                items=self.unpack_var_sequence(tx), source=self.source, **options
+            ).call_method(tx, name, args, kwargs)
+
+        if any([isinstance(x, DynamicShapeVariable) for x in args]):
+            # NOTE! DANGER! THIS ONLY WORKS FOR COMMUTATIVE OPS
+            # we are relying on add to have arg[0] be a DynamicShapeVariable
+            # because we are in ConstantVariable land
+            # This transforms
+            # constant + dynamic
+            # into
+            # dynamic + constant
+            # Which already has infra built for writing to the graph
+            if name == "__add__":
+                assert len(args) == 1
+                return args[0].call_method(tx, name, [self], {})
+            # Unfortunate constant
+            return super(ConstantVariable, self).call_method(tx, name, args, kwargs)
+        try:
+            const_args = [a.as_python_constant() for a in args]
+            const_kwargs = {k: v.as_python_constant() for k, v in kwargs.items()}
+        except NotImplementedError:
+            return super(ConstantVariable, self).call_method(tx, name, args, kwargs)
+
+        def has_arith_binop(num_ty):
+            return (
+                isinstance(self.value, num_ty)
+                and hasattr(operator, name)
+                and len(args) == 1
+                and args[0].is_python_constant()
+            )
+
+        if isinstance(self.value, str) and name in str.__dict__.keys():
+            assert not kwargs
+            method = getattr(self.value, name)
+            return ConstantVariable(method(*const_args, **const_kwargs), **options)
+        elif has_arith_binop(int) or has_arith_binop(float):
+            op = getattr(operator, name)
+            add_target = const_args[0]
+            if isinstance(add_target, (torch.SymInt, torch.SymFloat)):
+                from .tensor import DynamicShapeVariable
+
+                # Addition between a non sym and sym makes a sym
+                # dyn_shape = tx.output.register_attr_or_module(
+                #     add_target, f"sym_shape_{add_target}", source=None
+                # )
+                proxy = tx.output.create_proxy(
+                    "call_function", op, (self.value, add_target), {}
+                )
+                return DynamicShapeVariable.create(tx, proxy, add_target, **options)
+            return ConstantVariable(op(self.value, add_target), **options)
+        elif name == "__len__" and not (args or kwargs):
+            return ConstantVariable(len(self.value), **options)
+        elif name == "__contains__" and len(args) == 1 and args[0].is_python_constant():
+            assert not kwargs
+            search = args[0].as_python_constant()
+            result = search in self.value
+            return ConstantVariable(result, **options)
+
+        unimplemented(f"const method call {typestr(self.value)}.{name}")
+
+
+class EnumVariable(VariableTracker):
+    def __init__(self, value, **kwargs):
+        super(EnumVariable, self).__init__(**kwargs)
+        self.value = value
+
+    def as_proxy(self):
+        return self.value
+
+    def __str__(self):
+        return f"EnumVariable({type(self.value)})"
+
+    def python_type(self):
+        return type(self.value)
+
+    def as_python_constant(self):
+        return self.value
diff --git a/torch/_dynamo/variables/dicts.py b/torch/_dynamo/variables/dicts.py
new file mode 100644
index 000000000000..f28efc713db4
--- /dev/null
+++ b/torch/_dynamo/variables/dicts.py
@@ -0,0 +1,436 @@
+import collections
+import dataclasses
+import functools
+import inspect
+from typing import Dict, List
+
+from .. import variables
+from ..bytecode_transformation import create_instruction
+from ..eval_frame import skip_code
+from ..exc import unimplemented
+from ..source import AttrSource, GlobalWeakRefSource
+from ..utils import global_key_name, istensor
+from .base import MutableLocal, VariableTracker
+from .constant import ConstantVariable
+from .tensor import TensorVariable
+
+
+class ConstDictVariable(VariableTracker):
+    def __init__(self, items, user_cls, recursively_contains=None, **kwargs):
+        super(ConstDictVariable, self).__init__(
+            recursively_contains=recursively_contains, **kwargs
+        )
+        self.items = items
+        self.user_cls = user_cls
+
+    def as_proxy(self):
+        return {k: v.as_proxy() for k, v in self.items.items()}
+
+    def python_type(self):
+        return self.user_cls
+
+    def reconstruct(self, codegen):
+        for key, value in self.items.items():
+            if istensor(key):
+                codegen.extend_output(
+                    [
+                        codegen.create_load_global(global_key_name(key), add=True),
+                        create_instruction("CALL_FUNCTION", 0),
+                    ]
+                )
+            else:
+                codegen.append_output(codegen.create_load_const(key))
+            codegen(self.items[key])
+
+        return [create_instruction("BUILD_MAP", len(self.items))]
+
+    def getitem_const(self, arg: VariableTracker):
+        return self.items[ConstDictVariable.get_key(arg)].add_options(self, arg)
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        from . import ConstantVariable, TupleVariable
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        val = self.items
+
+        if name == "__getitem__":
+            return self.getitem_const(args[0])
+
+        elif name == "items":
+            assert not (args or kwargs)
+            return TupleVariable(
+                [
+                    TupleVariable(
+                        [
+                            ConstDictVariable._key_to_var(
+                                tx,
+                                k,
+                                **options,
+                            ),
+                            v,
+                        ],
+                        **options,
+                    )
+                    for k, v in val.items()
+                ],
+                **options,
+            )
+        elif name == "keys":
+            assert not (args or kwargs)
+            return TupleVariable(
+                [
+                    ConstDictVariable._key_to_var(
+                        tx,
+                        k,
+                        **options,
+                    )
+                    for k in val.keys()
+                ],
+                **options,
+            )
+
+        elif name == "values":
+            assert not (args or kwargs)
+            return TupleVariable(list(val.values()), **options)
+        elif name == "__len__":
+            assert not (args or kwargs)
+            return ConstantVariable(len(self.items), **options)
+        elif (
+            name == "__setitem__"
+            and args
+            and ConstDictVariable.is_valid_key(args[0])
+            and self.mutable_local
+        ):
+            assert not kwargs and len(args) == 2
+            k = ConstDictVariable.get_key(args[0])
+
+            if istensor(k):
+                tx.store_dict_key(global_key_name(k), k)
+            newval = collections.OrderedDict(val)
+            newval[k] = args[1]
+
+            new_rec_contains = self.recursively_contains.union(
+                args[1].recursively_contains
+            )
+            if args[1].mutable_local is not None:
+                new_rec_contains.add(args[1].mutable_local)
+
+            return tx.replace_all(
+                self,
+                self.modifed(newval, new_rec_contains, **options),
+            )
+        elif (
+            name in ("pop", "get")
+            and args
+            and ConstDictVariable.is_valid_key(args[0])
+            and ConstDictVariable.get_key(args[0]) not in self.items
+            and len(args) == 2
+        ):
+            # missing item, return the default value
+            return args[1].add_options(options)
+        elif (
+            name == "pop"
+            and args
+            and ConstDictVariable.is_valid_key(args[0])
+            and self.mutable_local
+        ):
+            newval = collections.OrderedDict(val)
+            result = newval.pop(ConstDictVariable.get_key(args[0]))
+            tx.replace_all(self, self.modifed(newval, None, **options))
+            return result.add_options(options)
+        elif (
+            name == "update"
+            and args
+            and isinstance(args[0], ConstDictVariable)
+            and self.mutable_local
+        ):
+            newval = collections.OrderedDict(val)
+            newval.update(args[0].items)
+            new_rec_contains = self.recursively_contains.union(
+                args[0].recursively_contains
+            )
+            result = self.modifed(
+                newval, recursively_contains=new_rec_contains, **options
+            )
+            return tx.replace_all(self, result)
+        elif (
+            name in ("get", "__getattr__")
+            and args
+            and ConstDictVariable.is_valid_key(args[0])
+            and ConstDictVariable.get_key(args[0]) in self.items
+        ):
+            result = self.items[ConstDictVariable.get_key(args[0])]
+            return result.add_options(options)
+        elif (
+            name == "__contains__" and args and ConstDictVariable.is_valid_key(args[0])
+        ):
+            return ConstantVariable(
+                ConstDictVariable.get_key(args[0]) in self.items, **options
+            )
+        else:
+            return super().call_method(tx, name, args, kwargs)
+
+    def modifed(self, items, recursively_contains, **options):
+        """a copy of self with different items"""
+        return self.clone(
+            items=items, recursively_contains=recursively_contains, **options
+        )
+
+    def unpack_var_sequence(self, tx):
+        options = VariableTracker.propagate([self])
+        val = self.items
+        result = [ConstDictVariable._key_to_var(tx, k, **options) for k in val.keys()]
+        return result
+
+    @classmethod
+    def get_key(cls, arg: VariableTracker):
+        if isinstance(arg, TensorVariable) and arg.specialized_value is not None:
+            return arg.specialized_value
+        else:
+            return arg.as_python_constant()
+
+    @classmethod
+    def is_valid_key(cls, key):
+        return (
+            key.is_python_constant()
+            or isinstance(key, TensorVariable)
+            and key.specialized_value is not None
+        )
+
+    @classmethod
+    def _key_to_var(cls, tx, key, **options):
+        from .builder import VariableBuilder
+
+        if istensor(key):
+            return VariableBuilder(tx, GlobalWeakRefSource(global_key_name(key)))(key)
+        else:
+            assert ConstantVariable.is_literal(key)
+            return ConstantVariable(key, **options)
+
+
+class DefaultDictVariable(ConstDictVariable):
+    def __init__(self, items, user_cls, default_factory=None, **kwargs):
+        super(DefaultDictVariable, self).__init__(items, user_cls, **kwargs)
+        assert user_cls is collections.defaultdict
+        self.default_factory = default_factory
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        from . import ListVariable, TupleVariable
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+
+        if name == "__getitem__":
+            k = ConstDictVariable.get_key(args[0])
+
+            if k in self.items:
+                return self.getitem_const(args[0])
+            else:
+                if self.default_factory is None:
+                    raise KeyError(f"{k}")
+                else:
+                    if istensor(k):
+                        tx.store_dict_key(global_key_name(k), k)
+                    new_val = collections.OrderedDict(self.items)
+                    if self.default_factory is list:
+                        default_var = ListVariable([], mutable_local=MutableLocal())
+                    elif self.default_factory is tuple:
+                        default_var = TupleVariable([], mutable_local=MutableLocal())
+                    elif self.default_factory is dict:
+                        default_var = ConstDictVariable(
+                            {}, dict, mutable_local=MutableLocal()
+                        )
+                    else:
+                        unimplemented(
+                            f"defaultdict with default_factory = {self.default_factory}"
+                        )
+                    new_val[k] = default_var
+                    new_rec_contains = self.recursively_contains.union(
+                        default_var.recursively_contains
+                    )
+                    new_rec_contains.add(default_var.mutable_local)
+                    tx.replace_all(
+                        self, self.modifed(new_val, new_rec_contains, **options)
+                    )
+                    return default_var
+        else:
+            return super().call_method(tx, name, args, kwargs)
+
+
+class DataClassVariable(ConstDictVariable):
+    """
+    This is a bit of a hack to deal with
+    transformers.file_utils.ModelOutput() from huggingface.
+
+    ModelOutput causes trouble because it a a mix of a dataclass and a
+    OrderedDict and it calls super() methods implemented in C.
+    """
+
+    # ModelOutput() excludes None, though generic datclasses don't
+    include_none = False
+
+    @staticmethod
+    @functools.lru_cache(None)
+    def _patch_once():
+        from transformers.file_utils import ModelOutput
+
+        for obj in ModelOutput.__dict__.values():
+            if callable(obj):
+                skip_code(obj.__code__)
+
+    @staticmethod
+    def is_matching_cls(cls):
+        try:
+            from transformers.file_utils import ModelOutput
+
+            return issubclass(cls, ModelOutput)
+        except ImportError:
+            return False
+
+    @classmethod
+    def is_matching_object(cls, obj):
+        return cls.is_matching_cls(type(obj))
+
+    @classmethod
+    def create(cls, user_cls, args, kwargs, options):
+        DataClassVariable._patch_once()
+
+        skip_code(user_cls.__init__.__code__)
+        keys = [f.name for f in dataclasses.fields(user_cls)]
+        bound = inspect.signature(user_cls).bind(*args, **kwargs)
+        bound.apply_defaults()
+        assert set(bound.arguments.keys()) == set(keys)
+        items = collections.OrderedDict()
+        for key in keys:
+            val = bound.arguments[key]
+            if isinstance(val, VariableTracker):
+                items[key] = val
+            else:
+                if cls.include_none:
+                    assert variables.ConstantVariable.is_literal(val)
+                    items[key] = variables.ConstantVariable(val)
+                else:
+                    assert val is None, f"unexpected {val}"
+
+        if len(items) == 1 and not isinstance(items[keys[0]], variables.TensorVariable):
+            unimplemented("DataClassVariable iterator constructor")
+            # TODO(jansel): implement unpacking logic in ModelOutput.__post_init__
+
+        return cls(items, user_cls, **options)
+
+    @classmethod
+    def wrap(cls, builder, obj):
+        user_cls = type(obj)
+        keys = [f.name for f in dataclasses.fields(user_cls)]
+
+        excluded = []
+        items = collections.OrderedDict()
+        for key in keys:
+            # __init__ function of a dataclass might not have yet defined the key
+            if hasattr(obj, key):
+                val = getattr(obj, key)
+                var = builder.__class__(
+                    tx=builder.tx, source=AttrSource(builder.source, key)
+                )(val)
+                if val is not None or cls.include_none:
+                    items[key] = var
+                else:
+                    excluded.append(var)
+        return cls(
+            items, user_cls, **VariableTracker.propagate(excluded, items.values())
+        )
+
+    def __init__(self, items, user_cls, **options):
+        super(DataClassVariable, self).__init__(items, user_cls, **options)
+        assert self.is_matching_cls(user_cls)
+
+    def as_proxy(self):
+        raise NotImplementedError()
+
+    def reconstruct(self, codegen):
+        codegen.extend_output([codegen._create_load_const(self.user_cls)])
+        keys = tuple(self.items.keys())
+        for key in keys:
+            codegen(self.items[key])
+        return [
+            codegen.create_load_const(keys),
+            create_instruction("CALL_FUNCTION_KW", len(keys)),
+        ]
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        if name == "__getitem__":
+            assert not kwargs and len(args) == 1
+            index = args[0].as_python_constant()
+            if isinstance(index, str):
+                return self.items[index].add_options(options)
+            else:
+                return (
+                    self.call_method(tx, "to_tuple", [], {})
+                    .call_method(tx, "__getitem__", args, kwargs)
+                    .add_options(options)
+                )
+        elif name == "to_tuple":
+            assert not (args or kwargs)
+            return variables.TupleVariable(list(self.items.values()), **options)
+        elif name == "__setattr__":
+            name = "__setitem__"
+        return super(DataClassVariable, self).call_method(tx, name, args, kwargs)
+
+    def var_getattr(self, tx, name: str) -> "VariableTracker":
+        if name in self.items:
+            return self.call_method(
+                tx, "__getitem__", [variables.ConstantVariable(name)], {}
+            )
+        elif not self.include_none:
+            defaults = {f.name: f.default for f in dataclasses.fields(self.user_cls)}
+            if name in defaults:
+                assert variables.ConstantVariable.is_literal(defaults[name])
+                return variables.ConstantVariable(defaults[name]).add_options(self)
+        super(DataClassVariable, self).var_getattr(tx, name)
+
+
+class HFPretrainedConfigVariable(VariableTracker):
+    """
+    Hack for HuggingFace PretrainedConfig
+    """
+
+    @staticmethod
+    def is_matching_cls(cls):
+        try:
+            from transformers.configuration_utils import PretrainedConfig
+
+            return issubclass(cls, PretrainedConfig)
+        except ImportError:
+            return False
+
+    @classmethod
+    def is_matching_object(cls, obj):
+        return cls.is_matching_cls(type(obj))
+
+    def __init__(self, obj, **kwargs):
+        super(HFPretrainedConfigVariable, self).__init__(**kwargs)
+        self.obj = obj
+        assert self.is_matching_cls(type(obj))
+
+    def var_getattr(self, tx, name: str) -> "VariableTracker":
+        from . import ConstantVariable
+
+        return ConstantVariable(getattr(self.obj, name))
diff --git a/torch/_dynamo/variables/functions.py b/torch/_dynamo/variables/functions.py
new file mode 100644
index 000000000000..a8bb8bd84c79
--- /dev/null
+++ b/torch/_dynamo/variables/functions.py
@@ -0,0 +1,413 @@
+import abc
+import enum
+import functools
+import inspect
+import itertools
+import types
+from typing import Dict, List
+
+from .. import variables
+from ..bytecode_transformation import create_instruction
+from ..exc import unimplemented
+from ..source import AttrSource, GetItemSource
+from ..utils import make_cell
+from .base import typestr, VariableTracker
+
+
+def wrap_bound_arg(val, options):
+    if isinstance(val, dict):
+        return variables.ConstDictVariable(
+            {k: wrap_bound_arg(v, options) for k, v in val.items()}, dict, **options
+        )
+    elif isinstance(val, (tuple, list)):
+        cls = variables.BaseListVariable.cls_for(type(val))
+        return cls([wrap_bound_arg(x, options) for x in val], **options)
+    elif variables.ConstantVariable.is_literal(val):
+        return variables.ConstantVariable(val, **options)
+    elif isinstance(val, types.FunctionType):
+        return variables.UserFunctionVariable(val, **options)
+    elif isinstance(val, enum.Enum):
+        return variables.EnumVariable(val, **options)
+    elif isinstance(val, (type, abc.ABCMeta)):
+        return variables.UserDefinedClassVariable(val, **options)
+    else:
+        assert isinstance(val, VariableTracker), typestr(val)
+        return val
+
+
+def wrap_args_kwargs(result, options):
+    for k, v in list(result.items()):
+        if isinstance(v, (tuple, dict)):
+            # args/kwargs
+            result[k] = wrap_bound_arg(v, options)
+
+
+def init_cellvars(parent, result, code):
+    closure_cells = dict()
+    side_effects = parent.output.side_effects
+
+    for name in code.co_cellvars:
+        closure_cells[name] = side_effects.track_cell_new()
+        if name in result:
+            side_effects.store_cell(closure_cells[name], result.pop(name))
+
+    return closure_cells
+
+
+class BaseUserFunctionVariable(VariableTracker):
+    def get_filename(self):
+        return self.get_code().co_filename
+
+    def get_name(self):
+        return self.get_code().co_name
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        return tx.inline_user_function_return(
+            self, list(self.self_args()) + list(args), kwargs
+        )
+
+    def num_parameters(self):
+        return len(inspect.signature(self.get_function()).parameters)
+
+    def closure_vars(self, tx):
+        return {}
+
+
+class UserFunctionVariable(BaseUserFunctionVariable):
+    """Some unsupported user-defined global function"""
+
+    def __init__(self, fn, is_constant=False, **kwargs):
+        super(UserFunctionVariable, self).__init__(**kwargs)
+        if getattr(fn, "_dynamo_marked_constant", False):
+            # This method should be treated as a constant for the purposes of compilation
+            self.is_constant = True
+        else:
+            self.is_constant = False
+
+        assert isinstance(
+            fn, types.FunctionType
+        ), f"expected FunctionType found {typestr(fn)} {fn}"
+        # unpack @torch._dynamo.optimize()(fn) wrapped function
+        fn = inspect.getattr_static(fn, "_torchdynamo_inline", fn)
+        # unpack torch.jit.script_if_tracing
+        if inspect.getattr_static(fn, "__script_if_tracing_wrapper", False):
+            fn = inspect.getattr_static(fn, "__original_fn", fn)
+        self.fn: types.FunctionType = fn
+
+    def self_args(self):
+        return []
+
+    def get_function(self):
+        return self.fn
+
+    def get_code(self):
+        return self.fn.__code__
+
+    def python_type(self):
+        return types.FunctionType
+
+    def has_self(self):
+        return getattr(self.fn, "__self__", None) is not None
+
+    def get_globals(self):
+        return self.fn.__globals__
+
+    def bind_args(self, parent, args, kwargs):
+        assert not self.is_constant
+        options = VariableTracker.propagate([self])
+        wrap = functools.partial(wrap_bound_arg, options=options)
+
+        fn: types.FunctionType = self.fn
+        fake_func = types.FunctionType(
+            fn.__code__,
+            fn.__globals__,
+            fn.__name__,
+            tuple(map(wrap, fn.__defaults__ or [])),
+            fn.__closure__,
+        )
+        if fn.__kwdefaults__:
+            fake_func.__kwdefaults__ = {
+                k: wrap(v) for k, v in fn.__kwdefaults__.items()
+            }
+
+        bound = inspect.signature(fake_func).bind(*args, **kwargs)
+        bound.apply_defaults()
+        result = dict(bound.arguments.items())
+
+        wrap_args_kwargs(result, options)
+        closure_cells = init_cellvars(parent, result, fn.__code__)
+        closure = self.fn.__closure__ or ()
+        assert len(closure) == len(self.fn.__code__.co_freevars)
+        for idx, name, cell in zip(
+            itertools.count(), self.fn.__code__.co_freevars, closure
+        ):
+            if name == "__class__":
+                result[name] = variables.UserDefinedClassVariable(cell.cell_contents)
+            else:
+                var = parent.output.root_tx.match_nested_cell(name, cell)
+                if var is not None:
+                    # optimization for cleaner codegen
+                    result[name] = var
+                elif self.source:
+                    from .builder import VariableBuilder
+
+                    side_effects = parent.output.side_effects
+                    if cell in side_effects:
+                        out = side_effects[cell]
+                    else:
+                        closure_cell = GetItemSource(
+                            AttrSource(self.source, "__closure__"), idx
+                        )
+                        closure_cell_contents = AttrSource(
+                            closure_cell, "cell_contents"
+                        )
+
+                        # cells are written to with "cell_contents",
+                        # so the source should just be the closure_cell, not its contents
+                        out = side_effects.track_cell_existing(closure_cell, cell)
+                        side_effects.store_cell(
+                            out,
+                            VariableBuilder(parent, closure_cell_contents)(
+                                cell.cell_contents
+                            ),
+                        )
+
+                    result[name] = out
+
+                else:
+                    unimplemented("inline with __closure__")
+
+        return result, closure_cells
+
+    def export_freevars(self, parent, child):
+        pass
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        if self.is_constant:
+            options = VariableTracker.propagate(self, args, kwargs.values())
+            return invoke_and_store_as_constant(
+                tx, self.fn, self.get_name(), options, args, kwargs
+            )
+
+        return super(UserFunctionVariable, self).call_function(tx, args, kwargs)
+
+
+class UserMethodVariable(UserFunctionVariable):
+    """Some unsupported user-defined method"""
+
+    def __init__(self, fn, obj, **kwargs):
+        super(UserMethodVariable, self).__init__(fn=fn, **kwargs)
+        self.obj = obj
+
+    def __str__(self):
+        return f"{self.__class__.__name__}({self.fn}, {self.obj})"
+
+    def self_args(self):
+        return [self.obj]
+
+    def python_type(self):
+        return types.MethodType
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        if (
+            isinstance(self.obj, variables.NNModuleVariable)
+            and getattr(self.fn, "__module__", "").startswith("torch.nn.")
+            or self.is_constant
+        ):
+            return self.obj.call_method(
+                tx, self.fn.__name__, args, kwargs, constant=self.is_constant
+            ).add_options(self)
+        return super().call_function(tx, args, kwargs)
+
+    def num_parameters(self):
+        return super(UserMethodVariable, self).num_parameters() - 1
+
+
+class WrappedUserMethodVariable(UserMethodVariable):
+    def __init__(self, wrapped, context, **kwargs):
+        kwargs.pop("fn", None)
+        kwargs.pop("obj", None)
+        super(WrappedUserMethodVariable, self).__init__(
+            wrapped.fn, wrapped.obj, **kwargs
+        )
+        self.wrapped = wrapped
+        self.context = context
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        self.context.enter(tx)
+        result = super().call_function(tx, args, kwargs)
+        self.context.exit(tx)
+        return result
+
+
+class WrappedUserFunctionVariable(UserFunctionVariable):
+    def __init__(self, wrapped, context, **kwargs):
+        kwargs.pop("fn", None)
+        kwargs.pop("obj", None)
+        super(WrappedUserFunctionVariable, self).__init__(wrapped.fn, **kwargs)
+        self.wrapped = wrapped
+        self.context = context
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        self.context.enter(tx)
+        result = super().call_function(tx, args, kwargs)
+        self.context.exit(tx)
+        return result
+
+
+def invoke_and_store_as_constant(tx, fn, name, options, args, kwargs):
+    def convert(x):
+        if isinstance(x, variables.TensorVariable):
+            return x.get_real_value()
+        return x.as_python_constant()
+
+    args = [convert(x) for x in args]
+    kwargs = {k: convert(v) for k, v in kwargs.items()}
+    res = fn(*args, **kwargs)
+    return tx.output.register_attr_or_module(
+        res,
+        name,
+        **options,
+    )
+
+
+class NestedUserFunctionVariable(BaseUserFunctionVariable):
+    def __init__(
+        self,
+        fn_name,
+        code,
+        f_globals,
+        defaults,
+        kwdefaults,
+        annotations,
+        closure,
+        closure_scope,
+        **kwargs,
+    ):
+        super(NestedUserFunctionVariable, self).__init__(**kwargs)
+        assert isinstance(fn_name.as_python_constant(), str)
+        assert isinstance(code.as_python_constant(), types.CodeType)
+        assert isinstance(f_globals, dict)
+        self.fn_name = fn_name
+        self.code = code
+        self.f_globals = f_globals
+        self.defaults = defaults
+        self.kwdefaults = kwdefaults
+        self.annotations = annotations
+        self.closure = closure
+        if closure is None:
+            closure_scope = None
+        self.closure_scope = closure_scope
+
+    def self_args(self):
+        return []
+
+    def get_code(self):
+        return self.code.as_python_constant()
+
+    def get_function(self):
+        if self.closure:
+            raise NotImplementedError()
+        func = types.FunctionType(
+            self.code.as_python_constant(),
+            self.f_globals,
+            self.fn_name.as_python_constant(),
+        )
+        if self.defaults:
+            func.__defaults__ = self.defaults.as_python_constant()
+        if self.kwdefaults:
+            func.__kwdefaults__ = self.kwdefaults.as_python_constant()
+        if self.annotations:
+            annotations = self.annotations.as_python_constant()
+            if isinstance(annotations, tuple):
+                from itertools import pairwise
+
+                annotations = dict(pairwise(annotations))
+
+            # TypeError: __annotations__ must be set to a dict object
+            assert isinstance(annotations, dict)
+            func.__annotations__ = annotations
+        return func
+
+    def has_closure(self):
+        return self.closure is not None
+
+    def has_self(self):
+        return False
+
+    def get_globals(self):
+        return self.f_globals
+
+    def bind_args(self, parent, args, kwargs):
+        code = self.get_code()
+        func = types.FunctionType(
+            code,
+            self.f_globals,
+            self.fn_name.as_python_constant(),
+            tuple(self.defaults.items) if self.defaults else None,
+            tuple(make_cell(None) for _ in range(len(self.get_code().co_freevars))),
+        )
+        if self.kwdefaults:
+            func.__kwdefaults__ = self.kwdefaults.items
+
+        bound = inspect.signature(func).bind(*args, **kwargs)
+        bound.apply_defaults()
+        result = dict(bound.arguments.items())
+
+        wrap_args_kwargs(result, VariableTracker.propagate(self))
+        closure_cells = init_cellvars(parent, result, code)
+
+        for idx, name in enumerate(code.co_freevars):
+            assert getattr(self.closure.items[idx], name, name) == name
+            assert name not in result
+            closure_cells[name] = self.closure.items[idx]
+
+        return result, closure_cells
+
+    def export_freevars(self, parent, child):
+        code = self.get_code()
+        for var in code.co_freevars:
+            if var in child.symbolic_locals:
+                parent.symbolic_locals[var] = child.symbolic_locals[var]
+
+    def reconstruct(self, codegen):
+        flags = 0x00
+        if self.defaults:
+            flags |= 0x01
+            codegen(self.defaults)
+        if self.kwdefaults:
+            flags |= 0x02
+            codegen(self.kwdefaults)
+        if isinstance(self.annotations, variables.ConstDictVariable) or isinstance(
+            self.annotations, variables.TupleVariable
+        ):
+            flags |= 0x04
+            try:
+                if isinstance(self.annotations, variables.ConstDictVariable):
+                    annotations = {
+                        k: v.as_python_constant()
+                        for k, v in self.annotations.items.items()
+                    }
+                else:
+                    annotations = tuple(
+                        [v.as_python_constant() for v in self.annotations.items]
+                    )
+                codegen.extend_output([codegen._create_load_const(annotations)])
+            except NotImplementedError:
+                codegen(self.annotations)
+        if self.closure:
+            flags |= 0x08
+            codegen(self.closure)
+        codegen(self.code)
+        codegen(self.fn_name)
+        return [create_instruction("MAKE_FUNCTION", flags)]
diff --git a/torch/_dynamo/variables/lists.py b/torch/_dynamo/variables/lists.py
new file mode 100644
index 000000000000..8214edcc4c9d
--- /dev/null
+++ b/torch/_dynamo/variables/lists.py
@@ -0,0 +1,511 @@
+from typing import Dict, List, Optional
+
+import torch
+import torch.fx
+
+from .. import config, variables
+from ..bytecode_transformation import create_instruction
+from ..exc import unimplemented
+from ..source import GetItemSource
+from ..utils import namedtuple_fields, proxy_args_kwargs
+from .base import MutableLocal, VariableTracker
+from .constant import ConstantVariable
+
+
+class BaseListVariable(VariableTracker):
+    @staticmethod
+    def cls_for(obj):
+        return {
+            iter: ListIteratorVariable,
+            list: ListVariable,
+            slice: SliceVariable,
+            torch.Size: SizeVariable,
+            tuple: TupleVariable,
+        }[obj]
+
+    def __init__(
+        self, items: List[VariableTracker], recursively_contains=None, **kwargs
+    ):
+        super(BaseListVariable, self).__init__(
+            recursively_contains=recursively_contains, **kwargs
+        )
+        assert isinstance(items, list)
+        assert all(isinstance(x, VariableTracker) for x in items)
+        self.items: List[VariableTracker] = items
+
+    def _as_proxy(self):
+        return [x.as_proxy() for x in self.items]
+
+    def as_python_constant(self):
+        return self.python_type()([x.as_python_constant() for x in self.items])
+
+    def as_proxy(self):
+        assert self.python_type() is not SizeVariable
+        return self.python_type()(self._as_proxy())
+
+    def getitem_const(self, arg: VariableTracker):
+        index = arg.as_python_constant()
+        if isinstance(index, slice):
+            if self.source is not None:
+                return self.clone(
+                    items=self.items[index],
+                    source=GetItemSource(self.source, index),
+                    mutable_local=MutableLocal() if self.mutable_local else None,
+                ).add_options(arg, self)
+            else:
+                return self.clone(
+                    items=self.items[index],
+                    mutable_local=MutableLocal() if self.mutable_local else None,
+                ).add_options(arg, self)
+        else:
+            assert isinstance(index, int)
+            return self.items[index].add_options(arg, self)
+
+    def unpack_var_sequence(self, tx):
+        return [x.add_options(self) for x in self.items]
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        if name == "__getitem__":
+            assert not kwargs and len(args) == 1
+            return self.getitem_const(args[0])
+        elif name == "__add__":
+            assert not kwargs and len(args) == 1
+            return type(self)(self.items + args[0].items, **options)
+        elif (
+            name == "__contains__"
+            and len(args) == 1
+            and args[0].is_python_constant()
+            and all(x.is_python_constant() for x in self.items)
+        ):
+            assert not kwargs
+            search = args[0].as_python_constant()
+            result = any(x.as_python_constant() == search for x in self.items)
+            return variables.ConstantVariable(result, **options)
+
+        return super(BaseListVariable, self).call_method(tx, name, args, kwargs)
+
+
+class RangeVariable(BaseListVariable):
+    def __init__(self, items, **kwargs):
+        items_to_map = items
+        start = variables.ConstantVariable(0)
+        stop = None
+        step = variables.ConstantVariable(1)
+
+        if len(items_to_map) == 1:
+            (stop,) = items_to_map
+        elif len(items_to_map) == 2:
+            start, stop = items_to_map
+        elif len(items_to_map) == 3:
+            start, stop, step = items_to_map
+        else:
+            raise AssertionError()
+
+        assert stop is not None
+        super().__init__([start, stop, step], **kwargs)
+
+    def python_type(self):
+        return range
+
+    def as_python_constant(self):
+        return range(*[x.as_python_constant() for x in self.items])
+
+    def as_proxy(self):
+        return self.python_type()(*self._as_proxy())
+
+    def unpack_var_sequence(self, tx):
+        return [
+            variables.ConstantVariable(x).add_options(self)
+            for x in self.as_python_constant()
+        ]
+
+    def reconstruct(self, codegen):
+        assert "range" not in codegen.tx.f_globals
+        codegen.append_output(codegen.create_load_python_module(range))
+        codegen.foreach(self.items)
+        return [create_instruction("CALL_FUNCTION", 3)]
+
+    def var_getattr(self, tx, name):
+        fields = ["start", "stop", "step"]
+        if name not in fields:
+            unimplemented(f"range.{name}")
+        return self.items[fields.index(name)].add_options(self)
+
+
+class ListVariable(BaseListVariable):
+    def python_type(self):
+        return list
+
+    def reconstruct(self, codegen):
+        codegen.foreach(self.items)
+        return [create_instruction("BUILD_LIST", len(self.items))]
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        if name == "append" and self.mutable_local:
+            assert not kwargs
+            (arg,) = args
+            new_rec_contains = self.recursively_contains.union(arg.recursively_contains)
+            new_rec_contains.add(arg.mutable_local)
+            tx.replace_all(
+                self,
+                ListVariable(
+                    self.items + [arg], recursively_contains=new_rec_contains, **options
+                ),
+            )
+            return ConstantVariable(None)
+        elif (
+            name in ("extend", "__iadd__")
+            and self.mutable_local
+            and args
+            and args[0].has_unpack_var_sequence(tx)
+        ):
+            assert not kwargs
+            (arg,) = args
+            return tx.replace_all(
+                self,
+                ListVariable(
+                    list(self.items) + list(arg.unpack_var_sequence(tx)),
+                    **options,
+                ),
+            )
+        elif name == "insert" and self.mutable_local:
+            assert not kwargs
+            idx, value = args
+            items = list(self.items)
+            items.insert(idx.as_python_constant(), value)
+            return tx.replace_all(
+                self,
+                ListVariable(items, **options),
+            )
+        elif name == "pop" and self.mutable_local:
+            assert not kwargs
+            items = list(self.items)
+            result = items.pop(*[a.as_python_constant() for a in args])
+            tx.replace_all(
+                self,
+                ListVariable(items, **options),
+            )
+            return result
+        elif name == "clear" and self.mutable_local:
+            assert not kwargs and not args
+            return tx.replace_all(
+                self,
+                ListVariable([], **options),
+            )
+        elif (
+            name == "__setitem__"
+            and self.mutable_local
+            and args
+            and args[0].is_python_constant()
+        ):
+            assert not kwargs
+            key, value = args
+            items = list(self.items)
+            if isinstance(key, SliceVariable):
+                items[key.as_python_constant()] = list(value.items)
+            else:
+                items[key.as_python_constant()] = value
+            result = ListVariable(items, **options)
+            return tx.replace_all(self, result)
+        else:
+            return super().call_method(tx, name, args, kwargs)
+
+
+class TupleVariable(BaseListVariable):
+    def python_type(self):
+        return tuple
+
+    def reconstruct(self, codegen):
+        codegen.foreach(self.items)
+        return [create_instruction("BUILD_TUPLE", len(self.items))]
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        if (
+            name in ("__add__", "__iadd__")
+            and len(args) == 1
+            and isinstance(args[0], TupleVariable)
+        ):
+            assert not kwargs
+            return TupleVariable(self.items + args[0].items, **options)
+        elif (
+            name in ("__add__", "__iadd__")
+            and len(args) == 1
+            and isinstance(args[0], variables.ConstantVariable)
+        ):
+            assert not kwargs
+            return TupleVariable(
+                self.items + list(args[0].unpack_var_sequence(self)), **options
+            )
+        return super().call_method(tx, name, args, kwargs)
+
+
+class SizeVariable(TupleVariable):
+    """torch.Size(...)"""
+
+    def __init__(
+        self,
+        items: List[VariableTracker],
+        proxy: Optional[torch.fx.Proxy] = None,
+        **kwargs,
+    ):
+        self.proxy = proxy
+        super().__init__(items, **kwargs)
+
+    def python_type(self):
+        return torch.Size
+
+    def as_proxy(self):
+        if self.proxy is not None:
+            return self.proxy
+
+        # torch.Size needs special handling.  Normally, we pun a list-like
+        # container to directly contain Proxy/Node objects from FX, and FX
+        # knows to look inside containers (via map_aggregate).  But torch.Size
+        # is weird; although it subclasses from tuple, it doesn't allow
+        # members which aren't int-like (rejecting Proxy and Node).  This
+        # means we can't use the normal representation trick
+        # torch.Size([proxy0, proxy1]).  I looked into seeing if I could
+        # relax torch.Size in PyTorch proper, but if torch.Size constructor
+        # sees a type that it doesn't recognize, it will try to call
+        # __index__() on it, so there is no BC way to actually change this
+        # behavior (though it occurs to me that I could have just added a
+        # YOLO no checking alternate constructor.)
+        #
+        # To work around this problem, I represent a torch.Size proxy as
+        # a straight up proxy, that would have been constructed by taking
+        # the constituent proxies as arguments.  This trick can be generally
+        # used for any construct that we need a proxy for but we can't
+        # directly represent as an aggregate; I don't see very many examples
+        # of this in torchdynamo though!
+
+        # Look for a proxy.  If there are none, do the legacy behavior
+        tracer = None
+        proxies = self._as_proxy()
+        for proxy in proxies:
+            if isinstance(proxy, torch.fx.Proxy):
+                tracer = proxy.tracer
+                break
+
+        if tracer is None:
+            return torch.Size(proxies)
+
+        proxy = tracer.create_proxy("call_function", torch.Size, (proxies,), {})
+        proxy.node.meta["example_value"] = torch.Size(
+            [p.node.meta["example_value"] for p in proxies]
+        )
+        return proxy
+
+    def reconstruct(self, codegen):
+        codegen.load_import_from("torch", "Size")
+        codegen.foreach(self.items)
+        build_torch_size = [
+            create_instruction("BUILD_TUPLE", len(self.items)),
+            create_instruction("CALL_FUNCTION", 1),
+        ]
+        return build_torch_size
+
+    def unpack_var_sequence(self, tx):
+        return [x.add_options(self) for x in self.items]
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        if name == "__getitem__":
+            assert not kwargs and len(args) == 1
+            if config.dynamic_shapes:
+                out = self.get_item_dyn(tx, args[0])
+            else:
+                out = self.getitem_const(args[0])
+            return out
+        return super(SizeVariable, self).call_method(tx, name, args, kwargs)
+
+    def get_item_dyn(self, tx, arg: VariableTracker):
+        from .tensor import DynamicShapeVariable
+
+        index = arg.as_python_constant()
+        if isinstance(index, slice):
+
+            def _dynamo_get_item_lambda(target, index):
+                return torch.Size.__getitem__(target, index)
+
+            parent_proxy = self.as_proxy()
+            proxy = tx.output.create_proxy(
+                "call_function",
+                _dynamo_get_item_lambda,
+                *proxy_args_kwargs([self, arg], {}),
+                current_tx=tx,
+            )
+            items = self.items[index]
+
+            def _unpack_into_example(item):
+                if isinstance(item, DynamicShapeVariable):
+                    return item.dyn_shape
+                return item.as_python_constant()
+
+            # Mirror the indexing into example_value for downstream correctness
+            proxy.node.meta["example_value"] = parent_proxy.node.meta["example_value"][
+                index
+            ]
+            return SizeVariable(items, proxy=proxy).add_options(arg, self)
+        else:
+            assert isinstance(index, int)
+            return self.items[index].add_options(arg, self)
+
+
+class ShapeVariable(TupleVariable):
+    """
+    Represents tensor.shape(...) and helps differentiate between a constant
+    TupleVariable and ShapeVariable.
+    """
+
+    pass
+
+
+class NamedTupleVariable(TupleVariable):
+    def __init__(self, items, tuple_cls, **kwargs):
+        super().__init__(items, **kwargs)
+        self.tuple_cls = tuple_cls
+
+    def python_type(self):
+        return self.tuple_cls
+
+    def as_python_constant(self):
+        return self.python_type()(*[x.as_python_constant() for x in self.items])
+
+    def reconstruct(self, codegen):
+        create_fn = getattr(self.tuple_cls, "_make", self.tuple_cls)
+        codegen.append_output(codegen._create_load_const(create_fn))
+        codegen.foreach(self.items)
+        return [
+            create_instruction("BUILD_TUPLE", len(self.items)),
+            create_instruction("CALL_FUNCTION", 1),
+        ]
+
+    def var_getattr(self, tx, name):
+        fields = namedtuple_fields(self.tuple_cls)
+        if name not in fields:
+            unimplemented(f"NamedTupleVariable.{name}")
+        return self.items[fields.index(name)].add_options(self)
+
+    def call_hasattr(self, tx, name: str) -> "VariableTracker":
+        options = VariableTracker.propagate(self)
+        fields = namedtuple_fields(self.tuple_cls)
+        return variables.ConstantVariable(name in fields, **options)
+
+
+class SliceVariable(BaseListVariable):
+    def __init__(self, items, **kwargs):
+        from .tensor import DynamicShapeVariable
+
+        if any([isinstance(x, DynamicShapeVariable) for x in items]):
+            unimplemented("Dynamic slicing not supported")
+
+        items_to_map = items
+        start, stop, step = [variables.ConstantVariable(None)] * 3
+
+        if len(items_to_map) == 1:
+            (stop,) = items_to_map
+        elif len(items_to_map) == 2:
+            start, stop = items_to_map
+        elif len(items_to_map) == 3:
+            start, stop, step = items_to_map
+        else:
+            raise AssertionError()
+
+        # Avoids a .item() call in the tensor slice that would attempt to get a
+        # value out fake tensors, and which would determine the output shape of
+        # the slice.  It is a workaround until
+        # https://github.com/pytorch/pytorch/pull/83567 is landed and there is
+        # more complete support for breaking on data dependent operators.
+        if not config.capture_scalar_outputs:
+            for limit in (start, stop, step):
+                if isinstance(limit, (variables.TensorVariable, DynamicShapeVariable)):
+                    unimplemented("Dynamic slicing not supported")
+
+        super().__init__([start, stop, step], **kwargs)
+
+    def as_proxy(self):
+        return slice(*self._as_proxy())
+
+    def python_type(self):
+        return slice
+
+    def as_python_constant(self):
+        return slice(*[x.as_python_constant() for x in self.items])
+
+    def reconstruct(self, codegen):
+        codegen.foreach(self.items)
+        return [create_instruction("BUILD_SLICE", len(self.items))]
+
+    def var_getattr(self, tx, name):
+        fields = ["start", "stop", "step"]
+        if name not in fields:
+            unimplemented(f"slice.{name}")
+        return self.items[fields.index(name)].add_options(self)
+
+
+class ListIteratorVariable(VariableTracker):
+    def __init__(self, items, index: int = 0, recursively_contains=None, **kwargs):
+        super(ListIteratorVariable, self).__init__(
+            recursively_contains=recursively_contains, **kwargs
+        )
+        assert isinstance(items, list)
+        # Removing this check as it slows things down too much
+        # https://github.com/pytorch/pytorch/pull/87533#issuecomment-1287574492
+
+        # assert all(isinstance(x, VariableTracker) for x in items)
+        self.items = items
+        self.index = index
+
+    def next_variables(self):
+        assert self.mutable_local
+        if self.index >= len(self.items):
+            raise StopIteration()
+        return self.items[self.index].add_options(self), ListIteratorVariable(
+            self.items,
+            self.index + 1,
+            mutable_local=MutableLocal(),
+            **VariableTracker.propagate([self]),
+        )
+
+    def as_python_constant(self):
+        if self.index > 0:
+            raise NotImplementedError()
+        return iter([x.as_python_constant() for x in self.items])
+
+    def unpack_var_sequence(self, tx):
+        return [x.add_options(self) for x in self.items[self.index :]]
+
+    def reconstruct(self, codegen):
+        remaining_items = self.items[self.index :]
+        codegen.foreach(remaining_items)
+        return [
+            create_instruction("BUILD_TUPLE", len(remaining_items)),
+            create_instruction("GET_ITER"),
+        ]
diff --git a/torch/_dynamo/variables/misc.py b/torch/_dynamo/variables/misc.py
new file mode 100644
index 000000000000..7e1c91b68c41
--- /dev/null
+++ b/torch/_dynamo/variables/misc.py
@@ -0,0 +1,705 @@
+import inspect
+import sys
+import types
+from typing import Dict, List
+
+import torch._C
+
+from .. import config, variables
+from ..bytecode_transformation import create_instruction
+from ..exc import unimplemented
+from ..guards import Guard, GuardBuilder, GuardSource
+from ..source import AttrSource
+from ..utils import identity, proxy_args_kwargs
+from .base import VariableTracker
+from .functions import (
+    UserFunctionVariable,
+    UserMethodVariable,
+    WrappedUserFunctionVariable,
+    WrappedUserMethodVariable,
+)
+
+
+class SuperVariable(VariableTracker):
+    def __init__(self, typevar, objvar=None, **kwargs):
+        super(SuperVariable, self).__init__(**kwargs)
+        self.typevar = typevar
+        self.objvar = objvar
+
+    def reconstruct(self, codegen):
+        codegen(variables.BuiltinVariable(super))
+        codegen(self.typevar)
+        if self.objvar is not None:
+            codegen(self.objvar)
+            return [create_instruction("CALL_FUNCTION", 2)]
+        else:
+            return [create_instruction("CALL_FUNCTION", 1)]
+
+    def const_getattr(self, tx, name):
+        assert self.objvar, "1-arg super not implemented"
+        search_type = self.typevar.as_python_constant()
+
+        # We default to the python type of the object. However,
+        # 1. If this is a `type`, then the original object represents the user
+        # defined type.
+        # 2. If this is `torch._C._TensorMeta`, the original object is the user
+        # defined type of a custom tensor subclass.
+        # TODO(future PR): figure out how to do this in a less hacky way
+        type_to_use = self.objvar.python_type()
+        if type_to_use is type or type_to_use is torch._C._TensorMeta:
+            type_to_use = self.objvar.value
+
+        # TODO(jansel): there is a small chance this could trigger user code, prevent that
+        return getattr(super(search_type, type_to_use), name)
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        options = VariableTracker.propagate(
+            self, args, kwargs.values(), self.objvar, self.typevar
+        )
+        inner_fn = self.const_getattr(self, name)
+        if inner_fn is object.__init__:
+            return LambdaVariable(identity, **options)
+        elif isinstance(inner_fn, types.FunctionType):
+            return variables.UserFunctionVariable(inner_fn, **options).call_function(
+                tx, [self.objvar] + args, kwargs
+            )
+        elif isinstance(inner_fn, types.MethodType):
+            return variables.UserMethodVariable(
+                inner_fn.__func__, self.objvar, **options
+            ).call_function(tx, args, kwargs)
+        else:
+            unimplemented(f"non-function or method super: {inner_fn}")
+
+
+class UnknownVariable(VariableTracker):
+    """
+    It could be anything!
+    """
+
+
+class ClosureVariable(UnknownVariable):
+    def __init__(self, name, **kwargs):
+        super(ClosureVariable, self).__init__(**kwargs)
+        self.name = name
+
+    def reconstruct(self, codegen):
+        return [codegen.create_load_closure(self.name)]
+
+
+class NewCellVariable(VariableTracker):
+    def __init__(self, **kwargs):
+        super(NewCellVariable, self).__init__(**kwargs)
+
+
+class NewGlobalVariable(VariableTracker):
+    def __init__(self, **kwargs):
+        super(NewGlobalVariable, self).__init__(**kwargs)
+
+
+class ContextWrappingVariable(VariableTracker):
+    def __init__(self, target_values, initial_values=None, **kwargs):
+        super(ContextWrappingVariable, self).__init__(**kwargs)
+        self.target_values = target_values
+        self.initial_values = initial_values
+        self.recursively_contains = (
+            set()
+        )  # This var doesn't contain any child vars and doesn't support clone() properly,
+        # so don't populate this automatically
+
+    def enter(self, tx):
+        self._call_func(tx, self.target_values)
+        return variables.ConstantVariable(None, **VariableTracker.propagate(self))
+
+    def exit(self, tx, *args):
+        self._call_func(tx, self.initial_values)
+        return variables.ConstantVariable(None, **VariableTracker.propagate(self))
+
+    def module_name(self):
+        return "torch"
+
+    def reconstruct(self, codegen, target_inst=None):
+        """
+        Generate following Python Bytecode, with a `torch._C._set_grad_enable` call
+        Python 3.8
+             0 LOAD_GLOBAL              0 (torch)
+             2 LOAD_ATTR                1 (_C)
+             4 LOAD_METHOD              2 (_set_grad_enable)
+             6 LOAD_CONST               1 (False)
+             8 CALL_METHOD              1
+            10 POP_TOP
+
+            12 SETUP_FINALLY           10 (to 24)
+
+            14 LOAD_GLOBAL              3 (user_inst)
+            16 CALL_FUNCTION            0
+            18 POP_TOP
+            20 POP_BLOCK
+            22 BEGIN_FINALLY
+
+            24 LOAD_GLOBAL              0 (torch)
+            26 LOAD_ATTR                1 (_C)
+            28 LOAD_METHOD              2 (_set_grad_enable)
+            30 LOAD_CONST               2 (True)
+            32 CALL_METHOD              1
+            34 POP_TOP
+            36 END_FINALLY
+            38 LOAD_CONST               0 (None)
+            40 RETURN_VALUE
+
+        Instructions 0-10 and 24-34 call torch._C.set_grad_enable(True/False)
+
+        Python 3.9, 3.10
+             0 LOAD_GLOBAL              0 (torch)
+             2 LOAD_ATTR                1 (_C)
+             4 LOAD_METHOD              2 (_set_grad_enable)
+             6 LOAD_CONST               1 (False)
+             8 CALL_METHOD              1
+            10 POP_TOP
+
+            12 SETUP_FINALLY           22 (to 36)
+
+            14 LOAD_GLOBAL              3 (user_inst)
+            16 CALL_FUNCTION            0
+            18 POP_TOP
+            20 POP_BLOCK
+
+            22 LOAD_GLOBAL              0 (torch)
+            24 LOAD_ATTR                1 (_C)
+            26 LOAD_METHOD              2 (_set_grad_enable)
+            28 LOAD_CONST               2 (True)
+            30 CALL_METHOD              1
+            32 POP_TOP
+
+            34 JUMP_FORWARD            14 (to 50)
+
+            36 LOAD_GLOBAL              0 (torch)
+            38 LOAD_ATTR                1 (_C)
+            40 LOAD_METHOD              2 (_set_grad_enable)
+            42 LOAD_CONST               2 (True)
+            44 CALL_METHOD              1
+            46 POP_TOP
+            48 RERAISE
+
+            50 LOAD_CONST               0 (None)
+            52 RETURN_VALUE
+
+        """
+        if self.target_values == self.initial_values:
+            return ([], [])
+
+        def set_context_insts(values):
+            global_torch_source = codegen.tx.import_source("torch")
+            attr_source = AttrSource(global_torch_source, self._func_name())
+            load_set_context_enabling_insts = attr_source.reconstruct(codegen)
+
+            loads = [codegen.create_load_const(val) for val in values]
+
+            return [
+                *load_set_context_enabling_insts,
+                *loads,
+                create_instruction("CALL_FUNCTION", len(values)),
+                create_instruction("POP_TOP"),
+            ]
+
+        init_block = set_context_insts(self.target_values)
+        finally_block = set_context_insts(self.initial_values)
+        setup_final_inst = create_instruction("SETUP_FINALLY", target=finally_block[0])
+        prologue = init_block + [setup_final_inst]
+
+        # Generate the epilogue - starts with 20 POP_BLOCK and ends at 34 POP_TOP
+        if sys.version_info < (3, 9):
+            # Generate the prologue that ends with setup_finally
+            epilogue = [
+                create_instruction("POP_BLOCK"),
+                codegen.create_begin_finally(),
+                *finally_block,
+                create_instruction("END_FINALLY"),
+            ]
+        else:
+            except_block = set_context_insts(self.initial_values)
+            epilogue = [
+                create_instruction("POP_BLOCK"),
+                *except_block,
+                create_instruction("JUMP_FORWARD", target=target_inst),
+                *finally_block,
+                create_instruction("RERAISE"),
+            ]
+
+        return (prologue, epilogue)
+
+    def _call_func(self, tx, initial_values):
+        raise NotImplementedError("_call_func called on base")
+
+    def _func_name(self):
+        raise NotImplementedError("_func_name called on base")
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        assert len(args) == 1
+        assert isinstance(args[0], UserMethodVariable) or isinstance(
+            args[0], UserFunctionVariable
+        )
+
+        if isinstance(args[0], UserMethodVariable):
+            return WrappedUserMethodVariable(args[0], self)
+
+        if isinstance(args[0], UserFunctionVariable):
+            return WrappedUserFunctionVariable(args[0], self)
+
+
+class GradModeVariable(ContextWrappingVariable):
+    """represents torch.{no_grad,enable_grad,set_grad_mode}()"""
+
+    _guards_singleton = {Guard("", GuardSource.GLOBAL, GuardBuilder.GRAD_MODE)}
+
+    @staticmethod
+    def create(tx, target_value, **kwargs):
+        var = GradModeVariable(
+            target_values=[target_value],
+            initial_values=[torch.is_grad_enabled()],
+            **kwargs,
+        )
+        var._call_func(tx, [target_value])
+        return var
+
+    def __init__(self, target_values, initial_values=None, **kwargs):
+        super(GradModeVariable, self).__init__(
+            target_values=target_values, initial_values=initial_values, **kwargs
+        )
+        self.guards = self.guards | self._guards_singleton
+
+    def enter(self, tx):
+        return variables.ConstantVariable(None, **VariableTracker.propagate(self))
+
+    def _call_func(self, tx, values):
+        assert len(values) == 1
+        value = values[0]
+        tx.output.graph.create_node(
+            "call_function", torch._C._set_grad_enabled, (value,), {}
+        ),
+        torch._C._set_grad_enabled(value)
+
+    def _func_name(self):
+        return "_C._set_grad_enabled"
+
+    def fn_name(self):
+        if self.target_values[0]:
+            return "enable_grad"
+        else:
+            return "no_grad"
+
+
+class AutocastModeVariable(ContextWrappingVariable):
+    @staticmethod
+    def create(target_values, kwargs):
+        values = target_values
+        # device_type : str,
+        # dtype : Optional[_dtype] = None,
+        # enabled : bool = True,
+        # cache_enabled : Optional[bool] = None):cache_enabled
+        assert "device_type" in kwargs
+        values.append(kwargs["device_type"])
+        del kwargs["device_type"]
+
+        if "dtype" in kwargs:
+            values.append(kwargs["dtype"])
+            del kwargs["dtype"]
+        else:
+            values.append(variables.ConstantVariable(None))
+
+        if "enabled" in kwargs:
+            values.append(kwargs["enabled"])
+            del kwargs["enabled"]
+        else:
+            values.append(variables.ConstantVariable(True))
+
+        if "cache_enabled" in kwargs:
+            values.append(kwargs["cache_enabled"])
+            del kwargs["cache_enabled"]
+        else:
+            values.append(variables.ConstantVariable(None))
+
+        var = AutocastModeVariable(target_values, initial_values=None, **kwargs)
+        return var
+
+    def __init__(self, target_values, initial_values=None, **kwargs):
+        super(AutocastModeVariable, self).__init__(
+            target_values=target_values, initial_values=initial_values, **kwargs
+        )
+        self.target_values = [val.as_python_constant() for val in target_values]
+        self.mode = None
+
+    def exit(self, tx, *args):
+        tx.output.graph.create_node(
+            "call_function", exit_functional_autocast, (self.mode,), {}
+        )
+
+    def enter(self, tx):
+        self.mode = tx.output.graph.create_node(
+            "call_function", enter_functional_autocast, (*self.target_values,), {}
+        )
+
+    def _func_name(self):
+        return "torch.amp.autocast_mode.autocast"
+
+    def fn_name(self):
+        return "torch.amp.autocast_mode.autocast"
+
+
+def enter_functional_autocast(*vals):
+    mode = torch.amp.autocast(*vals)
+    mode.__enter__()
+    return mode
+
+
+def exit_functional_autocast(mode):
+    mode.__exit__(None, None, None)
+
+
+class NullContextVariable(ContextWrappingVariable):
+    """
+    This class represents Python contextlib.nullcontext.
+    It's used as a placeholder for other context managers that Dynamo doesn't
+    support yet, e.g, torch.autograd.profiler.record_function.
+    """
+
+    def __init__(self, target_values=None, **kwargs):
+        super(NullContextVariable, self).__init__(target_values=target_values, **kwargs)
+
+    def enter(self, tx):
+        return variables.ConstantVariable(None, **VariableTracker.propagate(self))
+
+    def exit(self, tx, *args):
+        return variables.ConstantVariable(None, **VariableTracker.propagate(self))
+
+    def module_name(self):
+        return "contextlib"
+
+    def fn_name(self):
+        return "nullcontext"
+
+
+class WithExitFunctionVariable(VariableTracker):
+    def __init__(self, ctx: VariableTracker, target, **kwargs):
+        super(WithExitFunctionVariable, self).__init__(**kwargs)
+        self.ctx = ctx
+        self.target = target
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        assert not kwargs
+        return self.ctx.exit(tx, *args)
+
+    def reconstruct(self, codegen):
+        # Note here we reconstruct the context manager rather than the
+        # exit function.  The handler generated by BlockStackEntry
+        # will re-enter the context in the resume function.
+        output = AttrSource(
+            codegen.tx.import_source(self.ctx.module_name()), self.ctx.fn_name()
+        ).reconstruct(codegen)
+
+        if codegen.tx.output.partial_convert:
+            output.extend(
+                [
+                    create_instruction("CALL_FUNCTION", 0),
+                    create_instruction("SETUP_WITH", target=self.target),
+                    create_instruction("POP_TOP"),
+                ]
+            )
+        return output
+
+
+class InspectSignatureVariable(VariableTracker):
+    """represents inspect.signature(...)"""
+
+    @staticmethod
+    def create(callable, **kwargs):
+        if kwargs:
+            unimplemented(f"inspect.signature with {kwargs}")
+        return InspectSignatureVariable(callable)
+
+    def __init__(self, inspected, **kwargs):
+        super(InspectSignatureVariable, self).__init__(**kwargs)
+        self.inspected = inspected
+
+
+class AutogradFunctionVariable(VariableTracker):
+    """represents a torch.autograd.Function subclass"""
+
+    def __init__(self, fn_cls, **kwargs):
+        super().__init__(**kwargs)
+        self.fn_cls = fn_cls
+
+    def call_apply(self, tx, args, kwargs):
+        requires_grad = False
+
+        def visit(node):
+            nonlocal requires_grad
+            if isinstance(node, variables.TensorVariable):
+                if node.requires_grad is not False:
+                    requires_grad = True
+            if isinstance(node, variables.NNModuleVariable):
+                if node.is_training(tx):
+                    requires_grad = True
+            return node
+
+        VariableTracker.apply(visit, (args, kwargs))
+
+        if requires_grad and torch.is_grad_enabled():
+            # TODO(jansel): handle this in training mode
+            unimplemented("autograd.Function with requires_grad")
+
+        args = [BlackHoleVariable()] + list(args)
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        fn = self.fn_cls.forward
+        if isinstance(fn, types.FunctionType):
+            return variables.UserFunctionVariable(fn, **options).call_function(
+                tx, args, kwargs
+            )
+        elif isinstance(fn, types.MethodType):
+            return variables.UserMethodVariable(
+                fn.__func__, variables.UserDefinedClassVariable(self.fn_cls), **options
+            ).call_function(tx, args, kwargs)
+        else:
+            unimplemented(
+                f"non-function or method in subclass of torch.autograd.Function: {fn}"
+            )
+
+    def call_function(self, tx, args, kwargs):
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        return AutogradFunctionVariable(self.fn_cls, **options)
+
+
+class BlackHoleVariable(VariableTracker):
+    """A autograd.function context that just ignores everything (for forward extraction)"""
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        assert name in ("__setattr__", "save_for_backward"), name
+        return variables.ConstantVariable(
+            None, **VariableTracker.propagate(self, args, kwargs.values())
+        )
+
+
+class LambdaVariable(VariableTracker):
+    def __init__(self, fn, **kwargs):
+        super(LambdaVariable, self).__init__(**kwargs)
+        self.fn = fn
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        return self.fn(*args, **kwargs).add_options(self)
+
+
+class GetAttrVariable(VariableTracker):
+    def __init__(self, obj, name, **kwargs):
+        super(GetAttrVariable, self).__init__(**kwargs)
+        assert isinstance(obj, VariableTracker)
+        assert isinstance(name, str)
+        self.obj = obj
+        self.name = name
+
+    def __str__(self):
+        return f"{self.__class__.__name__}({self.obj}, {self.name})"
+
+    def as_proxy(self):
+        return getattr(self.obj.as_proxy(), self.name)
+
+    def const_getattr(self, tx, name):
+        if not isinstance(self.obj, variables.NNModuleVariable):
+            raise NotImplementedError()
+        step1 = tx.output.get_submodule(self.obj.module_key)
+        if self.name not in step1.__dict__:
+            raise NotImplementedError()
+        step2 = inspect.getattr_static(step1, self.name)
+        if name not in step2.__dict__:
+            raise NotImplementedError()
+        return inspect.getattr_static(step2, name)
+
+    def reconstruct(self, codegen):
+        codegen(self.obj)
+        return codegen.create_load_attrs(self.name)
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        from .builder import wrap_fx_proxy
+
+        # This variable is True when it corresponds to user code such as
+        #
+        #   super().__torch_function__(...)
+        #
+        # and the super().__torch_function__ attribute resolves
+        # to torch.Tensor.__torch_function__.
+        is_original_tensor_torch_function = (
+            self.name == "__torch_function__"
+            and isinstance(self.obj, SuperVariable)
+            # for now, only support one level of inheritance
+            and len(self.obj.objvar.value.__mro__) > 1
+            and self.obj.objvar.value.__mro__[1] == torch.Tensor
+        )
+        if is_original_tensor_torch_function:
+            # Instead of tracing inside torch.Tensor.__torch_function__,
+            # record the `call_function` or `call_method` call into the graph.
+            from . import TorchVariable
+
+            original_torch_or_getattr_variable = args[0]
+            new_args = args[2].items
+            new_kwargs = args[3].items
+            options = VariableTracker.propagate(self, new_args, new_kwargs.values())
+            # Disable __torch_function__ here to prevent the clone of the
+            # example tensor from going into the override.
+            with torch._C.DisableTorchFunction():
+                if isinstance(args[0], TorchVariable):
+                    return wrap_fx_proxy(
+                        tx=tx,
+                        proxy=tx.output.create_proxy(
+                            "call_function",
+                            original_torch_or_getattr_variable.value,
+                            *proxy_args_kwargs(new_args, new_kwargs),
+                            current_tx=tx,
+                        ),
+                        **options,
+                    )
+                elif isinstance(args[0], GetAttrVariable):
+                    return wrap_fx_proxy(
+                        tx=tx,
+                        proxy=tx.output.create_proxy(
+                            "call_method",
+                            original_torch_or_getattr_variable.name,
+                            *proxy_args_kwargs(new_args, new_kwargs),
+                            current_tx=tx,
+                        ),
+                        **options,
+                    )
+                else:
+                    unimplemented(
+                        f"GetAttrVariable.call_function original __torch_function__ {args}"
+                    )
+
+        if isinstance(self.obj, AutogradFunctionVariable) and self.name == "apply":
+            return self.obj.call_apply(tx, args, kwargs).add_options(self)
+        return self.obj.call_method(tx, self.name, args, kwargs).add_options(self)
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        if (
+            name == "__len__"
+            and isinstance(self.obj, InspectSignatureVariable)
+            and self.name == "parameters"
+        ):
+            return variables.ConstantVariable(
+                self.obj.inspected.num_parameters(),
+                **VariableTracker.propagate(self, self.obj, self.obj.inspected),
+            )
+        return super(GetAttrVariable, self).call_method(tx, name, args, kwargs)
+
+
+class PythonModuleVariable(VariableTracker):
+    def __init__(self, value: types.ModuleType, **kwargs):
+        super(PythonModuleVariable, self).__init__(**kwargs)
+        self.value = value
+
+    def python_type(self):
+        return types.ModuleType
+
+
+class SkipFilesVariable(VariableTracker):
+    def __init__(self, value, **kwargs):
+        super().__init__(**kwargs)
+        self.value = value
+
+    def python_type(self):
+        return type(self.value)
+
+    def as_python_constant(self):
+        return self.value
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        if inspect.getattr_static(self.value, "_torchdynamo_disable", False):
+            unimplemented(
+                f"call {config.dynamo_import}.disable() wrapped function {self.value}"
+            )
+        else:
+            try:
+                path = inspect.getfile(self.value)
+            except TypeError:
+                path = f"Builtin {self.value.__name__}"
+            unimplemented("call_function in skip_files " + path)
+
+
+class TypingVariable(VariableTracker):
+    def __init__(self, value, **kwargs):
+        super().__init__(**kwargs)
+        self.value = value
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        if name == "__getitem__" and len(args) == 1:
+            return variables.ConstantVariable(
+                self.value[args[0].as_python_constant()],
+                **VariableTracker.propagate(self, args),
+            )
+        unimplemented("typing")
+
+    def python_type(self):
+        return type(self.value)
+
+    def as_python_constant(self):
+        return self.value
+
+
+class NumpyVariable(VariableTracker):
+    """
+    Wrapper around `numpy.*` for better error messages.
+    """
+
+    def __init__(self, value, **kwargs):
+        super().__init__(**kwargs)
+        self.value = value
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        unimplemented("numpy")
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        unimplemented("numpy")
+
+    def python_type(self):
+        return type(self.value)
+
+    def as_python_constant(self):
+        return self.value
diff --git a/torch/_dynamo/variables/nn_module.py b/torch/_dynamo/variables/nn_module.py
new file mode 100644
index 000000000000..48557f41d0b2
--- /dev/null
+++ b/torch/_dynamo/variables/nn_module.py
@@ -0,0 +1,574 @@
+import functools
+import inspect
+import itertools
+import re
+import types
+from contextlib import contextmanager
+from typing import Dict, List
+
+import torch.nn
+
+from .. import skipfiles, variables
+from ..allowed_functions import is_allowed
+from ..exc import RestartAnalysis, unimplemented
+from ..guards import GuardBuilder
+from ..mutation_guard import GenerationTracker
+from ..source import AttrSource, GetItemSource, NNModuleSource, NotNNModuleSource
+from ..utils import (
+    is_lazy_module,
+    is_safe_constant,
+    istensor,
+    istype,
+    proxy_args_kwargs,
+)
+from .base import MutableLocal, typestr, VariableTracker
+from .functions import invoke_and_store_as_constant
+from .lists import SliceVariable
+from .user_defined import UserDefinedObjectVariable
+
+
+class NNModuleVariable(VariableTracker):
+    _nonvar_fields = ["module_type", "module_key"]
+
+    def __init__(self, module_type: type, module_key: str, **kwargs):
+        super(NNModuleVariable, self).__init__(**kwargs)
+        self.module_type = module_type
+        self.module_key = module_key
+        assert self.source
+
+    def python_type(self):
+        return self.module_type
+
+    def _wrap_submodule(self, tx, source, submod, *key_extra, **options):
+        return
+
+    def unpack_var_sequence(self, tx):
+        # implement list/iter/tuple/etc calls
+        base = tx.output.get_submodule(self.module_key)
+        options = VariableTracker.propagate([self])
+        assert isinstance(
+            base, (torch.nn.ModuleList, torch.nn.ParameterList, torch.nn.Sequential)
+        ), typestr(base)
+        assert self.source
+        result = []
+        for idx, submod in enumerate(base):
+            result.append(
+                tx.output.register_attr_or_module(
+                    submod,
+                    self.module_key,
+                    idx,
+                    source=NNModuleSource(GetItemSource(self.source, idx)),
+                    **options,
+                )
+            )
+        return result
+
+    def call_hasattr(self, tx, name: str) -> "VariableTracker":
+        options = VariableTracker.propagate(self)
+        mod = tx.output.get_submodule(self.module_key)
+        result = hasattr(mod, name)
+        return variables.ConstantVariable(result, **options).add_guard(
+            NNModuleSource(AttrSource(self.source, name)).make_guard(
+                GuardBuilder.HASATTR
+            )
+        )
+
+    def is_training(self, tx):
+        mod = tx.output.get_submodule(self.module_key)
+        return getattr(mod, "training", False)
+
+    def convert_to_unspecialized(self, tx):
+        """Restart analysis treating this module as an UnspecializedNNModuleVariable"""
+        mod = tx.output.get_submodule(self.module_key)
+        GenerationTracker.tag(mod)
+
+        # Mark the class dynamic unless its module initialization
+        if tx.f_code.co_name != "__init__":
+            GenerationTracker.mark_class_dynamic(type(mod))
+        raise RestartAnalysis()
+
+    def var_getattr(self, tx, name):
+        from .builder import VariableBuilder
+
+        options = VariableTracker.propagate(self)
+        guards = options.get("guards", set())
+
+        if self.source:
+            source = AttrSource(self.source, name)
+            options["source"] = source
+        else:
+            source = None
+
+        base = tx.output.get_submodule(self.module_key)
+        base_dict = object.__getattribute__(base, "__dict__")
+        object_member = True
+        all_class_attribute_names = set()
+        for x in inspect.getmro(base.__class__):
+            all_class_attribute_names.update(x.__dict__.keys())
+
+        if not self.source:
+            unimplemented("GETATTR with no source")
+
+        if name in base_dict:
+            subobj = base_dict[name]
+        elif (
+            "_modules" in base_dict
+            and name in base_dict["_modules"]
+            and name not in all_class_attribute_names
+        ):
+            subobj = base_dict["_modules"][name]
+        elif "_parameters" in base_dict and name in base_dict["_parameters"]:
+            subobj = base_dict["_parameters"][name]
+        elif "_buffers" in base_dict and name in base_dict["_buffers"]:
+            subobj = base_dict["_buffers"][name]
+        else:
+            subobj = inspect.getattr_static(base, name)
+            object_member = False
+
+        if name == "__class__" and not object_member:
+            return variables.UserDefinedClassVariable(base.__class__, **options)
+
+        if object_member:
+            return VariableBuilder(tx, NNModuleSource(source))(subobj)
+        else:
+            if istype(subobj, property):
+                return variables.UserFunctionVariable(
+                    subobj.fget, guards=guards
+                ).call_function(tx, [(self)], {})
+            elif istype(subobj, classmethod):
+                return variables.UserMethodVariable(
+                    subobj.__func__,
+                    variables.UserDefinedObjectVariable(type(base), guards=guards),
+                    **options,
+                )
+            elif istype(subobj, staticmethod):
+                return variables.UserFunctionVariable(subobj.__get__(base), **options)
+            elif istype(subobj, types.FunctionType):
+                return variables.UserMethodVariable(subobj, self, **options)
+            elif is_safe_constant(subobj) or istensor(subobj):
+                # Support possibly common cases of class members
+                return VariableBuilder(tx, NNModuleSource(source))(subobj)
+            else:
+                unimplemented(f"class property {typestr(base)} {typestr(subobj)}")
+
+        return variables.GetAttrVariable(self, name, **options)
+
+    def call_function(
+        self,
+        tx,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        mod = tx.output.get_submodule(self.module_key)
+
+        @contextmanager
+        def record_nn_module_stack():
+            try:
+                tx.nn_module_stack[self.module_key] = type(mod)
+                yield
+            finally:
+                del tx.nn_module_stack[self.module_key]
+
+        with record_nn_module_stack():
+            is_lazy = is_lazy_module(mod)
+            if (
+                isinstance(mod, torch.nn.Sequential)
+                and mod.__class__.forward is torch.nn.Sequential.forward
+            ):
+                # unroll Sequential()
+                assert not kwargs
+                (arg,) = args
+                for idx, submod in enumerate(mod):
+                    tx.call_function(
+                        tx.output.register_attr_or_module(
+                            submod,
+                            self.module_key,
+                            idx,
+                            source=NNModuleSource(GetItemSource(self.source, idx)),
+                            **options,
+                        ),
+                        [arg],
+                        {},
+                    )
+                    arg = tx.pop()
+                return arg
+            elif is_allowed(mod.__class__):
+                # The module type will change after it is called
+                if is_lazy:
+                    self.module_type = mod.cls_to_become
+                from .builder import wrap_fx_proxy
+
+                return wrap_fx_proxy(
+                    tx=tx,
+                    proxy=tx.output.create_proxy(
+                        "call_module",
+                        self.module_key,
+                        *proxy_args_kwargs(args, kwargs),
+                        current_tx=tx,
+                    ),
+                    **options,
+                )
+
+            else:
+                # for lazy modules, run the pre-hooks which will update the type
+                # TODO mlazos: we don't fully support all of the hooks that exist,
+                # so restrict using __call__ only to lazy modules for now
+                if is_lazy:
+                    fn = mod.__class__.__call__
+                else:
+                    fn = mod.__class__.forward
+
+                return tx.inline_user_function_return(
+                    variables.UserFunctionVariable(fn, **options),
+                    [self] + args,
+                    kwargs,
+                )
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+        constant=False,
+    ) -> "VariableTracker":
+        from . import ConstantVariable, ListIteratorVariable, TupleVariable
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        key = self.module_key
+        module = tx.output.get_submodule(key)
+
+        if name == "forward":
+            return self.call_function(tx, args, kwargs)
+
+        if name == "_check_input_dim" and skipfiles.is_torch_inline_allowed(
+            inspect.getfile(module.__class__._check_input_dim)
+        ):
+            return ConstantVariable(True, **options)
+
+        if name == "_get_item_by_idx":
+            assert args[1].is_python_constant()
+            assert isinstance(args[0], TupleVariable)
+            mod_var = args[0].items[args[1].value]
+            key = mod_var.module_key
+            submod = tx.output.get_submodule(key)
+            return tx.output.register_attr_or_module(
+                submod,
+                key,
+                key,
+                source=NNModuleSource(GetItemSource(self.source, key)),
+                **options,
+            )
+
+        if constant:
+            fn = getattr(module, name)
+            name = f"{module.__class__.__name__}_{name}_result"
+            return invoke_and_store_as_constant(tx, fn, name, options, args, kwargs)
+
+        def assert_all_args_kwargs_const():
+            if not all(
+                x.is_python_constant() for x in itertools.chain(args, kwargs.values())
+            ):
+                raise unimplemented(f"non-const NNModule method {name}")
+
+        def get_kwargs(*names):
+            assert_all_args_kwargs_const()
+            fn = getattr(module, name)
+            bound_args = inspect.signature(fn).bind(
+                *([x.as_python_constant() for x in args]),
+                **{k: v.as_python_constant() for k, v in kwargs.items()},
+            )
+            bound_args.apply_defaults()
+            bound_args = bound_args.arguments
+            return {k: bound_args[k] for k in names}
+
+        def wrap_values(items, getsource=AttrSource):
+            result = []
+            for name, submod in items:
+                # layer.0.foo => layer[0].foo
+                name = re.sub(r"[.]([0-9]+)([.]|$)", r"[\1]\2", name)
+                src = NNModuleSource(getsource(self.source, name))
+                result.append(
+                    tx.output.register_attr_or_module(
+                        submod,
+                        key,
+                        name,
+                        source=src,
+                        **options,
+                    )
+                )
+            return ListIteratorVariable(result, mutable_local=MutableLocal(), **options)
+
+        def named_embed(name, obj):
+            return TupleVariable(
+                [
+                    ConstantVariable(name, **options),
+                    tx.output.register_attr_or_module(
+                        obj,
+                        key,
+                        name,
+                        source=NNModuleSource(AttrSource(self.source, name)),
+                        **options,
+                    ),
+                ]
+            )
+
+        if name == "children":
+            assert not (args or kwargs)
+            return wrap_values(module.named_children())
+        elif name == "named_parameters":
+            result = []
+            for name, param in module.named_parameters(
+                **get_kwargs("prefix", "recurse")
+            ):
+                result.append(named_embed(name, param))
+            return ListIteratorVariable(result, mutable_local=MutableLocal(), **options)
+        elif name == "named_buffers":
+            result = []
+            for name, buffer in module.named_buffers(
+                **get_kwargs("prefix", "recurse", "remove_duplicate")
+            ):
+                result.append(named_embed(name, buffer))
+            return ListIteratorVariable(result, mutable_local=MutableLocal(), **options)
+        elif name == "named_modules":
+            result = []
+            for name, submod in module.named_modules(
+                **get_kwargs("memo", "prefix", "remove_duplicate")
+            ):
+                result.append(named_embed(name, submod))
+            return ListIteratorVariable(result, mutable_local=MutableLocal(), **options)
+        elif name == "modules":
+            return wrap_values(module.named_modules())
+        elif name == "parameters":
+            return wrap_values(module.named_parameters(**get_kwargs("recurse")))
+        elif name == "values":
+            assert not (args or kwargs)
+            return wrap_values(module.items(), GetItemSource)
+        elif name == "items":
+            assert not (args or kwargs)
+            result = []
+            for name, submod in module.items():
+                result.append(named_embed(name, submod))
+            return ListIteratorVariable(result, mutable_local=MutableLocal(), **options)
+        elif name == "__len__":
+            assert not (args or kwargs)
+            return ConstantVariable(len(module), **options)
+        elif (
+            name == "__contains__"
+            and isinstance(module, (torch.nn.ModuleDict, torch.nn.ParameterDict))
+            and args
+            and args[0].is_python_constant()
+        ):
+            return ConstantVariable(
+                args[0].as_python_constant() in module._modules, **options
+            )
+        elif name == "__getitem__":
+            assert not kwargs and len(args) == 1
+            assert type(module).__getitem__ in (
+                torch.nn.ModuleDict.__getitem__,
+                torch.nn.ModuleList.__getitem__,
+                torch.nn.ParameterList.__getitem__,
+                torch.nn.Sequential.__getitem__,
+            ), typestr(module)
+            assert self.source
+
+            if isinstance(args[0], SliceVariable):
+                # Build a TupleVariable of NNModules
+                result = []
+                submods = []
+
+                # Turn the slice into the list of integers
+                keys = list(range(len(module)))[args[0].as_python_constant()]
+                for idx, submod in enumerate(module[args[0].as_python_constant()]):
+                    key = keys[idx]
+                    src = NNModuleSource(GetItemSource(self.source, key))
+                    result.append(
+                        tx.output.register_attr_or_module(
+                            submod,
+                            key,
+                            source=src,
+                            **options,
+                        )
+                    )
+                    submods.append(submod)
+
+                new_module = torch.nn.Sequential(*submods)
+                new_module_variable = tx.output.register_attr_or_module(
+                    new_module,
+                    f"{self}.__getitem__(slice)",
+                    source=NNModuleSource(
+                        GetItemSource(self.source, args[0].as_python_constant())
+                    ),
+                    **options,
+                )
+                return new_module_variable
+
+            key = args[0].as_python_constant()
+            submod = module[key]
+            return tx.output.register_attr_or_module(
+                submod,
+                key,
+                args[0].as_python_constant(),
+                source=NNModuleSource(GetItemSource(self.source, key)),
+                **options,
+            )
+        elif name == "_get_abs_string_index":
+            # Inline the function
+            fn = getattr(module, name).__func__
+            return tx.inline_user_function_return(
+                variables.UserFunctionVariable(fn, **options),
+                [self] + args,
+                kwargs,
+            )
+        # A loose heuristic, but seems to be generally good before we drop into the
+        # manual handling of inputs
+        elif (
+            name in module.__class__.__dict__
+            and callable(module.__class__.__dict__[name])
+            and all(
+                isinstance(x, variables.TensorVariable)
+                for x in itertools.chain(args, kwargs.values())
+            )
+        ):
+            # TODO(voz): Refactor this into a generic as_proxy() for nn module
+            # We use variations of this pattern in a few places now.
+            def make_attr(name):
+                node = tx.output.create_proxy(
+                    "get_attr",
+                    name,
+                    tuple(),
+                    {},
+                )
+                return node
+
+            # Bind in self
+            tx.output.register_attr_or_module(
+                module,
+                self.module_key,
+                self.module_key,
+                source=NNModuleSource(GetItemSource(self.source, self.module_key)),
+                **options,
+            )
+            proxy_for_mod = make_attr(self.module_key)
+            proxy_for_mod.node.meta["example_value"] = module
+
+            proxy_args, proxy_kwargs = proxy_args_kwargs(args, kwargs)
+
+            from .builder import wrap_fx_proxy
+
+            return wrap_fx_proxy(
+                tx=tx,
+                proxy=tx.output.create_proxy(
+                    "call_method",
+                    name,
+                    args=(proxy_for_mod, *proxy_args),
+                    kwargs=proxy_kwargs,
+                    current_tx=tx,
+                ),
+                **options,
+            )
+        else:
+            return super().call_method(tx, name, args, kwargs)
+
+
+class UnspecializedNNModuleVariable(UserDefinedObjectVariable):
+    """
+    The above class will specialize on the id() of a module and place
+    parameters on the torch.fx.GraphModule.  Giving one graph per
+    module instance.  This version treats nn.Modules() like other user
+    defined objects and will pass parameters into the FX graph as inputs.
+    Giving one graph per module class.
+    """
+
+    def __init__(self, value, **kwargs):
+        super(UnspecializedNNModuleVariable, self).__init__(value=value, **kwargs)
+        if self.source and self.source.is_nn_module():
+            # force guard checks even when `not config.guard_nn_modules``
+            self.source = NotNNModuleSource(self.source)
+
+    @staticmethod
+    @functools.lru_cache(None)
+    def _nn_module_method_ids():
+        return {
+            id(x.__code__)
+            for x in torch.nn.Module.__dict__.values()
+            if hasattr(x, "__code__")
+        }
+
+    def unpack_var_sequence(self, tx):
+        from .builder import VariableBuilder
+
+        try:
+            fn = inspect.getattr_static(self.value_type, "__iter__")
+        except AttributeError:
+            raise NotImplementedError()
+
+        if fn in (
+            torch.nn.ModuleList.__iter__,
+            torch.nn.ParameterList.__iter__,
+            torch.nn.Sequential.__iter__,
+        ):
+            assert self.source
+            return [
+                VariableBuilder(tx, source=GetItemSource(self.source, idx))(
+                    item
+                ).add_options(self)
+                for idx, item in enumerate(self.value)
+            ]
+
+        return super().unpack_var_sequence(tx)
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        options = VariableTracker.propagate(self, args, kwargs.values())
+
+        # TODO mlazos: only support __call__ for lazy modules
+        # until we can support a larger swath of python
+        if is_lazy_module(self.value):
+            fn = self.value_type.__call__
+        else:
+            fn = self.value_type.forward
+
+        return variables.UserFunctionVariable(fn, **options).call_function(
+            tx, [self] + list(args), kwargs
+        )
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        from .builder import VariableBuilder
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+
+        if name not in getattr(self.value, "__dict__", {}):
+            try:
+                method = inspect.getattr_static(type(self.value), name)
+            except AttributeError:
+                method = None
+
+            if method is torch.nn.Module.parameters:
+                assert not args or kwargs
+                options["guards"].add(
+                    self.source.make_guard(GuardBuilder.NN_MODULE_PARAM_NAMES)
+                )
+                items = []
+                for name, value in self.value.named_parameters():
+                    items.append(
+                        VariableBuilder(tx, AttrSource(self.source, name))(
+                            value
+                        ).add_options(options)
+                    )
+                return variables.ListIteratorVariable(
+                    items, mutable_local=MutableLocal(), **options
+                )
+
+            if id(method.__code__) in self._nn_module_method_ids():
+                unimplemented(f"UnspecializedNNModuleVariable missing {name}")
+
+        return super().call_method(tx, name, args, kwargs)
diff --git a/torch/_dynamo/variables/tensor.py b/torch/_dynamo/variables/tensor.py
new file mode 100644
index 000000000000..9626ab8ae082
--- /dev/null
+++ b/torch/_dynamo/variables/tensor.py
@@ -0,0 +1,593 @@
+import itertools
+import operator
+from typing import Dict, List
+
+import torch.fx
+import torch.random
+
+from .. import config, variables
+from ..exc import unimplemented
+from ..guards import GuardBuilder
+from ..source import AttrSource
+
+from ..utils import (
+    get_fake_value,
+    get_real_value,
+    product,
+    proxy_args_kwargs,
+    tensortype_to_dtype,
+)
+from .base import VariableTracker
+from .constant import ConstantVariable
+from .lists import ShapeVariable, SizeVariable
+
+
+class TensorVariable(VariableTracker):
+    """A torch.Tensor input or an intermediate value in the FX graph"""
+
+    _nonvar_fields = [
+        "proxy",
+        "dtype",
+        "device",
+        "layout",
+        "ndim",
+        "size",
+        "stride",
+        "requires_grad",
+        "is_quantized",
+        "is_contiguous",
+    ]
+
+    def get_real_value(self):
+        """
+        Get the actual value represented by this variable if computation is run
+        using the user-provided inputs.
+        NOTE: this runs actual tensor computation and may be
+        slow and memory-intensive.
+        """
+        return get_real_value(self.proxy.node, self.proxy.tracer)
+
+    def __init__(
+        self,
+        proxy: torch.fx.Proxy,
+        dtype=None,
+        device=None,
+        layout=None,
+        ndim=None,
+        size=None,
+        stride=None,
+        requires_grad=None,
+        is_quantized=None,
+        is_contiguous=None,
+        is_sparse=None,
+        class_type=torch.Tensor,
+        specialized_value=None,
+        **kwargs,
+    ):
+        super(TensorVariable, self).__init__(**kwargs)
+        self.proxy = proxy
+        self.dtype = dtype
+        self.device = device
+        self.layout = layout
+        self.ndim = ndim
+        self.size = size
+        self.stride = stride
+        self.requires_grad = requires_grad
+        self.is_quantized = is_quantized
+        self.is_contiguous = is_contiguous
+        self.is_sparse = is_sparse
+        self.class_type = class_type
+        self.specialized_value = specialized_value
+
+    def as_proxy(self):
+        return self.proxy
+
+    def python_type(self):
+        return self.class_type
+
+    def call_isinstance(self, tensor_type):
+        def check_type(ty):
+            if ty not in tensortype_to_dtype:
+                return issubclass(self.python_type(), ty)
+
+            dtypes = tensortype_to_dtype[ty]
+            return self.dtype in dtypes
+
+        if type(tensor_type) is tuple:
+            return any([check_type(ty) for ty in tensor_type])
+        else:
+            return check_type(tensor_type)
+
+    @staticmethod
+    def specialize(value: torch.Tensor):
+        props = {
+            "dtype": value.dtype,
+            "device": value.device,
+            "layout": value.layout,
+            "ndim": int(value.ndim),
+            "requires_grad": value.requires_grad,
+            "is_quantized": value.is_quantized,
+            "is_sparse": value.is_sparse,
+            "class_type": type(value),
+        }
+        if not config.dynamic_shapes:
+            props["size"] = tuple(value.size())
+            props["stride"] = tuple(value.stride())
+            props["is_contiguous"] = tuple(
+                [
+                    x
+                    for x in torch._prims_common._memory_formats
+                    if value.is_contiguous(memory_format=x)
+                ]
+            )
+        return props
+
+    def var_getattr(self, tx, name):
+        from . import ConstantVariable, TorchVariable
+
+        result = None
+        options = VariableTracker.propagate(self)
+        if name == "ndim" and self.ndim is not None:
+            result = ConstantVariable(self.ndim, **options)
+        elif name == "dtype" and self.dtype is not None:
+            result = TorchVariable(self.dtype, **options)
+        elif name == "device" and self.device is not None:
+            result = TorchVariable(self.device, **options)
+        elif name == "layout" and self.layout is not None:
+            result = TorchVariable(self.layout, **options)
+        elif name == "is_cuda" and self.device is not None:
+            result = ConstantVariable(self.device.type == "cuda", **options)
+        elif name == "shape" and self.size is not None:
+            sizes = [variables.ConstantVariable(x) for x in self.size]
+            result = ShapeVariable(sizes, **options)
+        elif name == "requires_grad" and self.requires_grad is not None:
+            result = ConstantVariable(self.requires_grad, **options)
+        elif name == "is_quantized" and self.is_quantized is not None:
+            result = ConstantVariable(self.is_quantized, **options)
+        elif name == "is_sparse" and self.is_sparse is not None:
+            result = ConstantVariable(self.is_sparse, **options)
+        elif name == "shape" and self.size is None:
+            result = self.call_method(tx, "size", [], {})
+        elif name == "ndim" and self.ndim is None:
+            result = self.call_method(tx, "dim", [], {})
+        elif name == "data":
+            result = self.call_method(tx, "detach", [], {})
+        elif name == "T":
+            args = [variables.ConstantVariable(i) for i in range(self.ndim - 1, -1, -1)]
+            result = self.call_method(tx, "permute", args, {})
+
+        if name == "__class__":
+            return TorchVariable(self.python_type(), **options)
+
+        # Add a guard for type matching, these guards are checked before tensor guards
+        # In some cases, a <tensor>.<attr> guard can be evaluated first, and break if
+        # <tensor> is later changed to another type
+        if result is not None and self.source is not None:
+            result = result.add_guard(self.make_guard(GuardBuilder.TYPE_MATCH))
+
+        if result is None:
+            raise NotImplementedError()
+
+        return result
+
+    def unpack_var_sequence(self, tx):
+        options = VariableTracker.propagate(self)
+        if self.size:
+            return [
+                variables.BuiltinVariable(operator.getitem, **options).call_function(
+                    tx, [self, variables.ConstantVariable(i)], {}
+                )
+                for i in range(self.size[0])
+            ]
+
+        return super(TensorVariable, self).unpack_var_sequence(tx)
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        from . import ConstantVariable, TupleVariable
+        from .builder import wrap_fx_proxy
+
+        kwargs = dict(kwargs)
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        if name == "stride" and self.stride is not None:
+            constant_result = ConstantVariable(self.stride, **options)
+        elif name == "size" and self.size is not None:
+            sizes = [variables.ConstantVariable(x) for x in self.size]
+            constant_result = SizeVariable(sizes, **options)
+        elif name == "size" and self.size is None and config.dynamic_shapes:
+            return wrap_fx_proxy(
+                tx,
+                tx.output.create_proxy(
+                    "call_method",
+                    name,
+                    *proxy_args_kwargs([self] + list(args), kwargs),
+                    current_tx=tx,
+                ),
+                **options,
+            )
+        elif name in ("numel", "nelement") and self.size is not None:
+            constant_result = ConstantVariable(product(self.size), **options)
+        elif name in ("ndimension", "dim") and self.ndim is not None:
+            constant_result = ConstantVariable(self.ndim, **options)
+        elif name == "is_floating_point" and self.dtype is not None:
+            constant_result = ConstantVariable(self.dtype.is_floating_point, **options)
+        elif name == "is_contiguous" and self.is_contiguous is not None:
+            if "memory_format" in kwargs:
+                memory_format = kwargs.pop("memory_format").as_python_constant()
+            else:
+                memory_format = torch.contiguous_format
+            constant_result = ConstantVariable(
+                memory_format in self.is_contiguous, **options
+            )
+        elif name == "type" and self.dtype is not None and len(args) == 0:
+            tensortype = [k for k, v in tensortype_to_dtype.items() if self.dtype in v][
+                0
+            ]
+            constant_result = ConstantVariable(
+                f"torch.{tensortype.__name__}", **options
+            )
+        elif name == "get_device" and isinstance(self.device, torch.device):
+            index = self.device.index if self.device.type != "cpu" else -1
+            constant_result = ConstantVariable(index, **options)
+        else:
+            constant_result = None
+
+        if constant_result:
+            assert not kwargs, f"Tensor.{name}() unhandled kwargs"
+            if len(args) == 1:
+                return constant_result.getitem_const(args[0])
+            elif args:
+                return TupleVariable(
+                    [constant_result.getitem_const(a) for a in args], **options
+                )
+            return constant_result
+        elif (
+            name == "repeat"
+            and not all(
+                x.is_python_constant() for x in itertools.chain(args, kwargs.values())
+            )
+            and not config.dynamic_shapes
+        ):
+            unimplemented("dynamic Tensor.repeat")
+        elif name in ("tolist", "numpy", "backward", "data_ptr"):
+            unimplemented(f"Tensor.{name}")
+        elif name == "nonzero" and not config.dynamic_shapes:
+            unimplemented(f"Tensor.{name}")
+        elif name == "item":
+            if config.capture_scalar_outputs:
+                example_value = get_fake_value(self.proxy.node, tx)
+                return wrap_fx_proxy(
+                    tx,
+                    tx.output.create_proxy(
+                        "call_method", "item", (self.as_proxy(),), {}, current_tx=tx
+                    ),
+                    example_value=example_value,
+                    **options,
+                )
+            else:
+                unimplemented(f"Tensor.{name}")
+        elif name == "__len__":
+            if self.size:
+                assert not config.dynamic_shapes
+                return ConstantVariable(self.size[0], **options)
+            else:
+                return wrap_fx_proxy(
+                    tx,
+                    tx.output.create_proxy(
+                        "call_function", len, (self.as_proxy(),), {}, current_tx=tx
+                    ),
+                    **options,
+                )
+        elif name == "__setitem__":
+            tx.output.guards.update(options["guards"])
+            tx.output.create_proxy(
+                "call_function",
+                operator.setitem,
+                *proxy_args_kwargs([self] + list(args), kwargs),
+                current_tx=tx,
+            )
+            return ConstantVariable(None, **options)
+        elif name in ("resize_", "resize_as_"):
+            if "memory_format" in kwargs:
+                memory_format = kwargs["memory_format"].as_python_constant()
+            else:
+                memory_format = torch.contiguous_format
+
+            if name == "resize_":
+                self.size = args[0].as_python_constant()
+                self.is_contiguous = (memory_format,)
+            else:
+                assert isinstance(args[0], TensorVariable)
+                if self.size and args[0].size:
+                    if (
+                        self.size == args[0].size
+                        or memory_format is torch.preserve_format
+                    ):
+                        self.is_contiguous = args[0].is_contiguous
+                    else:
+                        self.size = args[0].size
+                        self.stride = args[0].stride
+                        self.ndim = args[0].ndim
+                        self.is_contiguous = (memory_format,)
+
+            return wrap_fx_proxy(
+                tx,
+                tx.output.create_proxy(
+                    "call_method",
+                    name,
+                    *proxy_args_kwargs([self] + list(args), kwargs),
+                    current_tx=tx,
+                ),
+                **options,
+            )
+        else:
+            # Convert x.new(torch.Size) into x.new_empty(torch.Size),
+            # as Tensor.new acts differently with a Size input versus a tuple input.
+            if (
+                name == "new"
+                and len(args) == 1
+                and isinstance(args[0], (SizeVariable, ShapeVariable))
+                and not config.dynamic_shapes
+            ):
+                name = "new_empty"
+            return wrap_fx_proxy(
+                tx,
+                tx.output.create_proxy(
+                    "call_method",
+                    name,
+                    *proxy_args_kwargs([self] + list(args), kwargs),
+                    current_tx=tx,
+                ),
+                **options,
+            )
+
+
+class DynamicShapeVariable(VariableTracker):
+    """
+    Represents a symbolic size, e.g., as returned by tensor.size(0)
+    """
+
+    @classmethod
+    def create(cls, tx, proxy, dyn_shape, **options):
+        if "example_value" in proxy.node.meta:
+            assert proxy.node.meta["example_value"] == dyn_shape
+        if dyn_shape is None:
+            dyn_shape = get_fake_value(proxy.node, tx)
+        proxy.node.meta["example_value"] = dyn_shape
+        return DynamicShapeVariable(proxy, dyn_shape, **options)
+
+    def __init__(self, proxy, dyn_shape, **kwargs):
+        super(DynamicShapeVariable, self).__init__(**kwargs)
+        self.proxy = proxy
+        self.dyn_shape = dyn_shape
+
+    def python_type(self):
+        return type(self.dyn_shape)
+
+    def unpack_var_sequence(self, tx):
+        super(DynamicShapeVariable, self).unpack_var_sequence(tx)
+
+    def as_proxy(self):
+        return self.proxy
+
+    def evaluate_expr(self, output_graph):
+        if not isinstance(self.dyn_shape, torch.SymInt):
+            return self.dyn_shape
+        return output_graph.shape_env.evaluate_expr(self.dyn_shape.get_pyobj().expr)
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        from .builder import wrap_fx_proxy
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+
+        return wrap_fx_proxy(
+            tx,
+            tx.output.create_proxy(
+                "call_method",
+                name,
+                *proxy_args_kwargs([self] + list(args), kwargs),
+                current_tx=tx,
+            ),
+            **options,
+        )
+
+
+class TensorWithTFOverrideVariable(VariableTracker):
+    """
+    Represents a tensor subclass instance with a __torch_function__ override.
+    """
+
+    def __init__(
+        self,
+        tensor_variable,
+        orig_tensor_variable_source,
+        subclass_torch_function__func,
+        subclass_type,
+        **kwargs,
+    ):
+        super(TensorWithTFOverrideVariable, self).__init__(**kwargs)
+        self.tensor_variable = tensor_variable
+        self.orig_tensor_variable_source = orig_tensor_variable_source
+        self.subclass_torch_function__func = subclass_torch_function__func
+        self.subclass_type = subclass_type
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        # This code block implements inlining the __torch_function__ override
+        # of `call_method`.
+        from . import GetAttrVariable
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+        # insert unwrapped version of self as the first argument
+        args = list(args)
+        args.insert(0, self.tensor_variable)
+        func_var = GetAttrVariable(self.tensor_variable, name)
+
+        unwrapped = TensorWithTFOverrideVariable.inline_torch_function_unwrapped(
+            tx,
+            func_var,
+            self.orig_tensor_variable_source,
+            self.subclass_torch_function__func,
+            self.subclass_type,
+            options,
+            args,
+            kwargs,
+        )
+
+        # TODO(future PR): implement rewrapping conditional on method presence
+        # in `torch.overrides.get_default_nowrap_function()`. It's unclear how
+        # to do this easily in the current codebase since the resolution of
+        # `GetAttrVariable` depends on the type of the underlying object.
+
+        return TensorWithTFOverrideVariable(
+            unwrapped,
+            self.orig_tensor_variable_source,
+            self.subclass_torch_function__func,
+            self.subclass_type,
+        )
+
+    @staticmethod
+    def inline_torch_function_unwrapped(
+        tx,
+        original_func_var,
+        tensor_with_tf_override_source,
+        tf_func,
+        subclass_type,
+        options,
+        args,
+        kwargs,
+    ):
+        """
+        This function inlines the `__torch_function__` override for `original_func_var`.
+        For example, if the user code is
+
+           x1 = torch.sigmoid(x0)
+
+        And `x0` has an override, then:
+        * `original_func_var` will be a `VariableTracker` object wrapping `torch.sigmoid`
+        * `tensor_with_tf_override_source` will be the `Source` object from
+          the original tensor override instance in the beginning of the program
+        * `tf_func` will be the custom `__torch_function__` function
+        * `subclass_type` will be `type(x0)`
+
+        The caller is expected to properly massage args and kwargs before
+        passing them into this function.
+
+        The caller is responsible for wrapping the return value, if needed.
+        """
+        from . import UserDefinedClassVariable
+        from .builder import TupleVariable, VariableBuilder
+
+        source = AttrSource(
+            AttrSource(tensor_with_tf_override_source, "__torch_function__"),
+            "__func__",
+        )
+        tf_func_var = VariableBuilder(tx, source)(tf_func)
+        type_var = UserDefinedClassVariable(subclass_type, **options)
+
+        # signature:
+        # def __torch_function__(cls, func, types, args=(), kwargs=None):
+        tf_args = (
+            type_var,  # cls
+            original_func_var,  # func
+            (type_var,),  # types
+            TupleVariable(args),  # args
+            kwargs,  # kwargs
+        )
+
+        # Disable __torch_function__ here to prevent the clone of the
+        # example tensor from going into the override.
+        with torch._C.DisableTorchFunction():
+            return tx.inline_user_function_return(tf_func_var, tf_args, {})
+
+
+class UnspecializedNumpyVariable(TensorVariable):
+    """
+    This is a 1-element tensor represents unspecialized numpy float/int.
+    """
+
+    def __init__(self, proxy: torch.fx.Proxy, **kwargs):
+        raw_value = kwargs.pop("raw_value", None)
+        super(UnspecializedNumpyVariable, self).__init__(proxy, **kwargs)
+        self.raw_value = raw_value
+
+    @classmethod
+    def from_tensor_variable(cls, tensor_variable, raw_value):
+        # Convert a `TensorVariable` instance into an `UnspecializedNumpyVariable` instance.
+        return UnspecializedNumpyVariable(
+            **dict(tensor_variable.__dict__), raw_value=raw_value
+        )
+
+    def as_specialized(self, tx):
+        for graph_arg in tx.output.graphargs:
+            if graph_arg.source is self.source:
+                graph_arg.erase()
+
+        for g in self.guards:
+            if g.is_volatile:
+                g.create_fn = GuardBuilder.CONSTANT_MATCH
+
+        return ConstantVariable(value=self.raw_value, guards=self.guards)
+
+
+class UnspecializedPythonVariable(TensorVariable):
+    """
+    This is a 1-element tensor represents unspecialized python float/int.
+    """
+
+    def __init__(self, proxy: torch.fx.Proxy, **kwargs):
+        raw_value = kwargs.pop("raw_value", None)
+        need_unwrap = kwargs.pop("need_unwrap", True)
+        super(UnspecializedPythonVariable, self).__init__(proxy, **kwargs)
+        self.raw_value = raw_value
+        self.need_unwrap = need_unwrap
+
+    @classmethod
+    def from_tensor_variable(cls, tensor_variable, raw_value, need_unwrap=True):
+        # Convert a `TensorVariable` instance into an `UnspecializedPythonVariable` instance.
+        return UnspecializedPythonVariable(
+            **dict(tensor_variable.__dict__),
+            raw_value=raw_value,
+            need_unwrap=need_unwrap,
+        )
+
+    def as_specialized(self, tx):
+        for graph_arg in tx.output.graphargs:
+            if graph_arg.source is self.source:
+                graph_arg.erase()
+
+        for g in self.guards:
+            if g.is_volatile:
+                g.create_fn = GuardBuilder.CONSTANT_MATCH
+
+        return ConstantVariable(value=self.raw_value, guards=self.guards)
+
+
+class FakeItemVariable(TensorVariable):
+    """An unspecialized python variable which prevents access to the underlying raw value.
+    This is needed if item is called on a FakeTensor."""
+
+    def __init__(self, proxy: torch.fx.Proxy, **kwargs):
+        need_unwrap = kwargs.pop("need_unwrap", False)
+        super(FakeItemVariable, self).__init__(proxy, **kwargs)
+        self.need_unwrap = need_unwrap
+
+    @classmethod
+    def from_tensor_variable(cls, tensor_variable):
+        return FakeItemVariable(**dict(tensor_variable.__dict__))
diff --git a/torch/_dynamo/variables/torch.py b/torch/_dynamo/variables/torch.py
new file mode 100644
index 000000000000..d737e460304f
--- /dev/null
+++ b/torch/_dynamo/variables/torch.py
@@ -0,0 +1,751 @@
+import logging
+
+import math
+import re
+import types
+from typing import Dict, List
+
+import numpy
+
+import torch._C
+import torch.nn
+import torch.onnx.operators
+
+from .. import config, variables
+from ..allowed_functions import torch_get_name
+from ..exc import unimplemented
+from ..source import GetItemSource, NNModuleSource
+from ..utils import (
+    check_constant_args,
+    check_unspec_python_args,
+    istype,
+    product,
+    proxy_args_kwargs,
+    specialize_args_kwargs,
+    tensortype_to_dtype,
+)
+from .base import VariableTracker
+from .lists import ListVariable, TupleVariable
+from .misc import AutocastModeVariable, NullContextVariable
+from .nn_module import NNModuleVariable
+from .tensor import TensorWithTFOverrideVariable
+
+log = logging.getLogger(__name__)
+
+# TODO(voz): Maybe rename these later
+tensor_dunder_fns = [
+    torch.Tensor.__rmatmul__,
+    torch.Tensor.__rmod__,
+    torch.Tensor.__rpow__,
+    torch.Tensor.__rsub__,
+    torch._C._TensorBase.__radd__,
+    torch._C._TensorBase.__rmul__,
+    torch._C._TensorBase.__ror__,
+    torch._C._TensorBase.__rxor__,
+    torch._C._TensorBase.__rand__,
+]
+
+torch_special_class_types = (torch._C.Generator,)
+
+REWRITE_OPS_TO_TENSOR_SIZE_METHOD = [
+    torch.onnx.operators.shape_as_tensor,
+    torch._shape_as_tensor,
+]
+
+
+# TODO(voz): perhaps a decorator? This is rather readable for now tho, and not a public API.
+def remap_as_fn___radd__(*args):
+    return torch._C._TensorBase.__radd__(*args)
+
+
+def remap_as_fn___rmul__(*args):
+    return torch._C._TensorBase.__rmul__(*args)
+
+
+def remap_as_fn___ror__(*args):
+    return torch._C._TensorBase.__ror__(*args)
+
+
+def remap_as_fn___rxor__(*args):
+    return torch._C._TensorBase.__rxor__(*args)
+
+
+def remap_as_fn___rand__(*args):
+    return torch._C._TensorBase.__rand__(*args)
+
+
+tensor_dunder_fns_remap = {
+    torch._C._TensorBase.__radd__: remap_as_fn___radd__,
+    torch._C._TensorBase.__rmul__: remap_as_fn___rmul__,
+    torch._C._TensorBase.__ror__: remap_as_fn___ror__,
+    torch._C._TensorBase.__rxor__: remap_as_fn___rxor__,
+    torch._C._TensorBase.__rand__: remap_as_fn___rand__,
+}
+
+
+try:
+    # Wed need to monkeypatch transformers here, sadly.
+    # TODO(voz): Upstream to transformers lib
+    import transformers
+
+    def _dynamo_overriden_transformers_eq(self, other):
+        if not hasattr(other, "__dict__"):
+            return False
+        return self.__dict__ == other.__dict__
+
+    transformers.configuration_utils.PretrainedConfig.__eq__ = (
+        _dynamo_overriden_transformers_eq
+    )
+except ImportError:
+    pass
+
+
+class TorchVariable(VariableTracker):
+    """Points to a module or method in torch.*"""
+
+    def __init__(self, value, **kwargs):
+        super(TorchVariable, self).__init__(**kwargs)
+
+        if value in tensor_dunder_fns_remap:
+            value = tensor_dunder_fns_remap[value]
+        self.value = value
+
+        # the remainder of this is just optional debug checks
+        try:
+            self_should_be_none = getattr(self.value, "__self__", None)
+        except RuntimeError as e:
+            assert "No such operator" in str(e), str(e)
+            self_should_be_none = None
+
+        # assert "_ntuple.<locals>.parse" not in str(value)
+
+        if self_should_be_none is None:
+            pass
+        elif isinstance(self_should_be_none, types.ModuleType):
+            # weird ones like torch.nn.functional.avg_pool2d have __self__
+            name = self_should_be_none.__name__
+            assert re.match(r"^(torch|math)([.]|$)", name), f"__self__ set to {name}"
+        elif isinstance(
+            self_should_be_none, type(torch._C._get_tracing_state.__self__)
+        ):
+            # some _C functions have __self__ as a null capsule
+            pass
+        elif isinstance(self_should_be_none, torch_special_class_types):
+            pass
+        else:
+            raise AssertionError(f"{value} found with __self__ set")
+
+    def __repr__(self):
+        return f"TorchVariable({self.value})"
+
+    def unique_var_name(self):
+        name = torch_get_name(self.value, f"allowed_fn_{id(self.value)}")
+        return "__" + re.sub(r"[^a-zA-Z0-9_]+", "_", name)
+
+    def reconstruct(self, codegen):
+        return codegen.setup_globally_cached(self.unique_var_name(), self.value)
+
+    def as_proxy(self):
+        return self.value
+
+    def python_type(self):
+        if isinstance(self.value, (torch.Tensor, torch.nn.Module)):
+            return type(self.value)
+        return super().python_type()
+
+    def as_python_constant(self):
+        return self.value
+
+    def can_constant_fold_through(self):
+        if self.value in (
+            torch._assert,
+            torch.device,
+            torch.finfo,
+            torch.iinfo,
+            torch.is_floating_point,
+            torch.cuda.is_available,
+            torch.nn.functional._Reduction.get_enum,
+            torch._utils._get_device_index,
+        ):
+            return True
+        return getattr(self.value, "__module__", None) == "math"
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        from . import (
+            ConstantVariable,
+            DynamicShapeVariable,
+            GradModeVariable,
+            TensorVariable,
+            UserDefinedObjectVariable,
+        )
+
+        from .builder import wrap_fx_proxy
+
+        constant_args = check_constant_args(args, kwargs)
+        unspec_python_args = check_unspec_python_args(args, kwargs)
+        options = VariableTracker.propagate(self, args, kwargs.values())
+
+        if self.value in config.constant_functions:
+            assert not args and not kwargs
+            return ConstantVariable(config.constant_functions[self.value], **options)
+        elif self.can_constant_fold_through() and (constant_args or unspec_python_args):
+            args, kwargs = specialize_args_kwargs(tx, args, kwargs)
+            # constant fold
+            return ConstantVariable(
+                self.as_python_constant()(
+                    *[x.as_python_constant() for x in args],
+                    **{k: v.as_python_constant() for k, v in kwargs.items()},
+                ),
+                **options,
+            )
+        elif istype(self.value, type) and issubclass(self.value, torch.nn.Module):
+            if self.value is torch.nn.Softmax:
+                return self._call_softmax(tx, args, kwargs, options)
+            if self.value is torch.nn.CrossEntropyLoss:
+                return self._call_cross_entropy_loss(tx, args, kwargs, options)
+            else:
+                unimplemented(f"construct nn.Module: {self.value.__name__}")
+        elif self.value in (torch.is_tensor, torch.overrides.is_tensor_like):
+            assert len(args) == 1
+            if isinstance(args[0], TensorVariable) or (
+                self.value is torch.overrides.is_tensor_like
+                and isinstance(args[0], UserDefinedObjectVariable)
+                and hasattr(args[0].value, "__torch_function__")
+            ):
+                return ConstantVariable(True, **options)
+            else:
+                return ConstantVariable(False, **options)
+        elif (
+            self.value
+            in (
+                torch.is_floating_point,
+                torch.is_complex,
+            )
+            and isinstance(args[0], TensorVariable)
+            and args[0].dtype is not None
+        ):
+            if self.value is torch.is_floating_point:
+                return ConstantVariable(args[0].dtype.is_floating_point, **options)
+            elif self.value is torch.is_complex:
+                return ConstantVariable(args[0].dtype.is_complex, **options)
+            else:
+                raise AssertionError()
+        elif (
+            self.value is torch.numel
+            and isinstance(args[0], TensorVariable)
+            and args[0].size is not None
+        ):
+            return ConstantVariable(product(args[0].size), **options)
+        elif self.value in REWRITE_OPS_TO_TENSOR_SIZE_METHOD:
+            assert len(args) == 1
+            assert isinstance(args[0], TensorVariable)
+            return args[0].call_method(tx, "size", [], {})
+        elif self.value in (
+            torch.nn.modules.utils._single,
+            torch.nn.modules.utils._pair,
+            torch.nn.modules.utils._triple,
+            torch.nn.modules.utils._quadruple,
+            torch.nn.modules.utils._ntuple,
+        ):
+            return self._call_ntuple(tx, args, kwargs, options)
+        elif self.value is torch.no_grad:
+            return GradModeVariable.create(tx, False, **options)
+        elif self.value is torch.enable_grad:
+            return GradModeVariable.create(tx, True, **options)
+        elif self.value is torch.set_grad_enabled and len(args) == 1:
+            return GradModeVariable.create(tx, args[0].as_python_constant(), **options)
+        elif self.value is torch.is_grad_enabled:
+            assert not (args or kwargs)
+            return ConstantVariable(torch.is_grad_enabled(), **options).add_guards(
+                GradModeVariable._guards_singleton
+            )
+        elif not config.dynamic_shapes and self.is_dynamic_shapes(args, kwargs):
+            unimplemented(f"dynamic shapes: {self.value.__name__}")
+        elif len(args) > 0 and isinstance(args[0], TensorWithTFOverrideVariable):
+            # This code block implements inlining the __torch_function__
+            # override of a tensor.
+
+            tensor_with_tf_override = args[0]
+
+            # TODO(future PR): make this implement the full __torch_function__ API
+            # instead of assuming the relevant override is in the first argument.
+            args[0] = args[0].tensor_variable
+
+            unwrapped = TensorWithTFOverrideVariable.inline_torch_function_unwrapped(
+                tx,
+                self,
+                tensor_with_tf_override.orig_tensor_variable_source,
+                tensor_with_tf_override.subclass_torch_function__func,
+                tensor_with_tf_override.subclass_type,
+                options,
+                args,
+                kwargs,
+            )
+
+            # The wrapping here follows the logic in
+            # `torch.Tensor.__torch_function__`.
+            if self.value in torch.overrides.get_default_nowrap_functions():
+                return unwrapped
+            return TensorWithTFOverrideVariable(
+                unwrapped,
+                tensor_with_tf_override.orig_tensor_variable_source,
+                tensor_with_tf_override.subclass_torch_function__func,
+                tensor_with_tf_override.subclass_type,
+            )
+        elif self.value is torch.amp.autocast_mode.autocast:
+            return AutocastModeVariable.create(target_values=args, kwargs=kwargs)
+        elif self.value in (
+            torch.profiler.profile,
+            torch.profiler.record_function,
+            torch.autograd.profiler.profile,
+            torch.autograd.profiler.record_function,
+        ):
+            log.warning("Profiler will be ignored")
+            return NullContextVariable(**options)
+        elif self.value is torch.autograd._profiler_enabled:
+            unimplemented("torch.autograd._profiler_enabled not supported yet")
+        elif self.value is torch.jit.annotate:
+            assert len(args) == 2
+            return args[1]
+        if (
+            self.value.__name__ == "get_state"
+            and hasattr(self.value, "__self__")
+            and isinstance(self.value.__self__, torch._C.Generator)
+        ):
+
+            def get_state_from_generator():
+                return self.value()
+
+            return wrap_fx_proxy(
+                tx=tx,
+                proxy=tx.output.create_proxy(
+                    "call_function",
+                    get_state_from_generator,
+                    *proxy_args_kwargs(args, kwargs),
+                    current_tx=tx,
+                ),
+                example_value=self.value(),
+                **options,
+            )
+        if (
+            self.value.__name__ == "set_state"
+            and hasattr(self.value, "__self__")
+            and isinstance(self.value.__self__, torch._C.Generator)
+        ) or self.value == torch.random.set_rng_state:
+            assert len(args) == 1
+            assert isinstance(args[0], TensorVariable)
+
+            unimplemented(
+                "TODO: make torch.random.set_rng_state work with FakeTensor/aot_autograd"
+            )
+            # In fake tensor case, this state doesn't matter, but
+            # it needs to be valid to not segfault. Pull a real tensor out.
+            # The value won't matter since we are running with fake tensors anyway, so rng doesn't matter.
+            # However, it is imperative to record the call_function in the graph with the true args
+            # (Not the fake example_value) - for the sake of graph correctness.
+            if self.value == torch.random.set_rng_state:
+                example_value = torch.random.get_rng_state()
+            else:
+                example_value = self.value.__self__.get_state()
+
+            self.value.__module__ = self.__module__
+            return wrap_fx_proxy(
+                tx=tx,
+                proxy=tx.output.create_proxy(
+                    "call_function",
+                    self.value,
+                    *proxy_args_kwargs(args, kwargs),
+                    current_tx=tx,
+                ),
+                example_value=example_value,
+                **options,
+            )
+        elif (
+            self.value == torch.numel
+            and len(args) == 1
+            and isinstance(args[0], TensorVariable)
+            and len(kwargs) == 0
+        ):
+            # TODO(voz): This is rewritten as a call_method because
+            # torch.numel(x) w/ sym shapes raises a RuntimeError and x.numel() does not
+            return wrap_fx_proxy(
+                tx=tx,
+                proxy=tx.output.create_proxy(
+                    "call_method",
+                    "numel",
+                    *proxy_args_kwargs(args, kwargs),
+                    current_tx=tx,
+                ),
+                **options,
+            )
+        else:
+            any_symints_or_symfloats = any(
+                [isinstance(x, DynamicShapeVariable) for x in args]
+            )
+            all_ints_or_floats = all(
+                [
+                    isinstance(
+                        x, (variables.ConstantVariable, variables.DynamicShapeVariable)
+                    )
+                    for x in args
+                ]
+            )
+            bin_ops = set(["add", "sub", "mul", "div", "sqrt"])
+            if (
+                self.value.__module__ == "torch"
+                and self.value.__name__ in bin_ops
+                and any_symints_or_symfloats
+                and all_ints_or_floats
+            ):
+                msg = f"""\
+Calling {str(self.value)} on only torch.SymInt arguments is not yet supported.
+To support this behavior, we need to allow const-propping tensors that store symint data.
+For now, dynamo will explicitly graph break when it encounters user code with this behavior.
+"""
+                log.warning(msg)
+                raise unimplemented(msg)
+            # Handle sth like torch.LongTensor(list(np.int64, np.int64, ...)),
+            # as FX symbolic trace doesn't support numpy int/float as base types.
+            if (
+                self.value in tensortype_to_dtype
+                and len(args) == 1
+                and isinstance(args[0], ListVariable)
+                and args[0].is_python_constant()
+            ):
+                for x in args[0].items:
+                    if isinstance(x.value, numpy.generic):
+                        x.value = x.value.item()
+
+            # TODO(voz): Replace w/ dynamic shape rewrite table.
+            # Ideally, we would be able to do this at ctor time, but alas we need a combination
+            # of value + args to determine this.
+            fn_ = self.value
+            if any([isinstance(x, DynamicShapeVariable) for x in args]):
+                if self.value == math.sqrt:
+                    from torch.fx.experimental.symbolic_shapes import sym_sqrt
+
+                    fn_ = sym_sqrt
+
+            tensor_variable = wrap_fx_proxy(
+                tx=tx,
+                proxy=tx.output.create_proxy(
+                    "call_function",
+                    fn_,
+                    *proxy_args_kwargs(args, kwargs),
+                    current_tx=tx,
+                ),
+                **options,
+            )
+
+            if "out" in kwargs:
+                # out variants of torch operators like torch.sort and
+                # torch.sigmoid mutate the tensors in the out field. Track such
+                # tensors and rewrite the symbolic locals.
+                if isinstance(tensor_variable, TupleVariable):
+                    assert isinstance(kwargs["out"], TupleVariable)
+                    output_tensor_names = [
+                        tx.find_symbolic_locals_name(x) for x in kwargs["out"].items
+                    ]
+                    for idx, name in enumerate(output_tensor_names):
+                        assert name in tx.symbolic_locals
+                        tx.symbolic_locals[name] = tensor_variable.items[idx]
+                elif isinstance(tensor_variable, TensorVariable):
+                    assert isinstance(kwargs["out"], TensorVariable)
+                    name = tx.find_symbolic_locals_name(kwargs["out"])
+                    assert name in tx.symbolic_locals
+                    tx.symbolic_locals[name] = tensor_variable
+                else:
+                    unimplemented(f"out variant of {type(kwargs['out'])}")
+
+            return tensor_variable
+
+    def is_dynamic_shapes(self, args, kwargs):
+        """Check for dynamic shapes when shape specialization is enabled"""
+        # TODO(jansel): need to get a complete list
+        if self.value in (
+            torch.nonzero,
+            torch.unique,
+            torch.unique_consecutive,
+        ) or self.value.__name__ in ("nms",):
+            return True
+
+        if self.value is torch.where and len(args) + len(kwargs) == 1:
+            return True
+
+        if self.value in (
+            torch.arange,
+            torch.repeat_interleave,
+        ):
+            none = variables.ConstantVariable(None)
+
+            def has_non_const(it):
+                return not all(x.is_python_constant() for x in it)
+
+            def arange(start=none, end=none, step=none, **kwargs):
+                return has_non_const([start, end, step])
+
+            def repeat_interleave(input, repeats, dim=none, **kwargs):
+                return has_non_const([repeats])
+
+            return locals()[self.value.__name__](*args, **kwargs)
+
+        return False
+
+    def _call_softmax(self, tx, args, kwargs, options):
+        """rewrite the pattern nn.Softmax(dim=-1)(x) to F.softmax(x, -1)"""
+        dim = args[0] if args else kwargs.get("dim", variables.ConstantVariable(None))
+
+        def fake_softmax(input):
+            from .builder import wrap_fx_proxy
+
+            return wrap_fx_proxy(
+                tx=tx,
+                proxy=tx.output.create_proxy(
+                    "call_function",
+                    torch.nn.functional.softmax,
+                    *proxy_args_kwargs([input, dim], {}),
+                    current_tx=tx,
+                ),
+                **VariableTracker.propagate([self, dim, input]),
+            )
+
+        return variables.LambdaVariable(fake_softmax, **options)
+
+    def _call_cross_entropy_loss(self, tx, args, kwargs, options):
+        """
+        functional: input, target, weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean',
+        label_smoothing=0.0
+
+        non functional ctor: weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean',
+        label_smoothing=0.0
+
+        non functional loss call: input, target, optional_output
+        """
+        from . import ConstantVariable
+
+        def normalize_args(
+            weight=ConstantVariable(None),
+            size_average=ConstantVariable(None),
+            ignore_index=ConstantVariable(-100),
+            reduce=ConstantVariable(None),
+            reduction=ConstantVariable("mean"),
+            label_smoothing=ConstantVariable(0.0),
+        ):
+            return (
+                weight,
+                size_average,
+                ignore_index,
+                reduce,
+                reduction,
+                label_smoothing,
+            )
+
+        (
+            weight,
+            size_average,
+            ignore_index,
+            reduce_arg,
+            reduction,
+            label_smoothing,
+        ) = normalize_args(*args, **kwargs)
+
+        def fake_cross_entropy_loss(input, target):
+            from .builder import wrap_fx_proxy
+
+            return wrap_fx_proxy(
+                tx=tx,
+                proxy=tx.output.create_proxy(
+                    "call_function",
+                    torch.nn.functional.cross_entropy,
+                    *proxy_args_kwargs(
+                        [
+                            input,
+                            target,
+                            weight,
+                            size_average,
+                            ignore_index,
+                            reduce_arg,
+                            reduction,
+                            label_smoothing,
+                        ],
+                        {},
+                    ),
+                    current_tx=tx,
+                ),
+                **VariableTracker.propagate(
+                    [
+                        self,
+                        weight,
+                        size_average,
+                        ignore_index,
+                        reduce_arg,
+                        reduction,
+                        label_smoothing,
+                        input,
+                        target,
+                    ]
+                ),
+            )
+
+        return variables.LambdaVariable(fake_cross_entropy_loss, **options)
+
+    def _call_ntuple(self, tx, args, kwargs, options):
+        """inline behavior of torch.nn.modules.utils._ntuple"""
+        if self.value is torch.nn.modules.utils._ntuple:
+            count = args[0].as_python_constant()
+        else:
+            count = self.value.__closure__[0].cell_contents
+        assert isinstance(count, int)
+
+        def handle_ntuple(value):
+            if value.has_unpack_var_sequence(tx):
+                return variables.TupleVariable(
+                    list(value.unpack_var_sequence(tx)),
+                    **VariableTracker.propagate(self, value, args, kwargs.values()),
+                )
+            elif value.is_python_constant():
+                # constant prop through it
+                return variables.ConstantVariable(
+                    torch.nn.modules.utils._ntuple(count)(value.as_python_constant()),
+                    **VariableTracker.propagate(self, value, args, kwargs.values()),
+                )
+            else:
+                unimplemented(f"torch.nn.modules.utils._ntuple({value})")
+
+        if self.value is torch.nn.modules.utils._ntuple:
+            return variables.LambdaVariable(handle_ntuple, **options)
+        else:
+            return handle_ntuple(args[0])
+
+
+class TorchPyOperator(VariableTracker):
+    def __init__(self, value, **kwargs):
+        super(TorchPyOperator, self).__init__(**kwargs)
+        self.value = value
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        from . import ListVariable, TensorVariable, UserFunctionVariable
+        from .builder import wrap_fx_proxy
+
+        assert kwargs is None or len(kwargs) == 0, "kwargs are not supported, yet"
+
+        def unwrap_real(arg):
+            if isinstance(arg, TensorVariable):
+                return arg.get_real_value()
+            if isinstance(arg, UserFunctionVariable):
+                return arg.fn
+            if isinstance(arg, NNModuleVariable):
+                return tx.output.get_submodule(arg.module_key)
+            if arg.has_unpack_var_sequence(tx):
+                return [
+                    unwrap_real(arg_inner) for arg_inner in arg.unpack_var_sequence(tx)
+                ]
+            return arg
+
+        def make_attr(name, proxy_args=None):
+            node = tx.output.create_proxy(
+                "get_attr",
+                name,
+                tuple(proxy_args) if proxy_args else tuple(),
+                {},
+            )
+            return node
+
+        # Get values
+        u_args = [unwrap_real(arg) for arg in args]
+
+        def unwrap_proxy(arg):
+            try:
+                if isinstance(arg, TensorVariable):
+                    return arg.as_proxy()
+                if isinstance(arg, NNModuleVariable):
+                    name = arg.module_key
+                    mod = unwrap_real(arg)
+                    options = VariableTracker.propagate(self, args, kwargs.values())
+                    tx.output.register_attr_or_module(
+                        mod,
+                        name,
+                        name,
+                        source=NNModuleSource(
+                            GetItemSource(self.source, arg.module_key)
+                        ),
+                        **options,
+                    )
+                    return make_attr(name)
+                if arg.has_unpack_var_sequence(tx):
+                    return [
+                        unwrap_proxy(arg_inner)
+                        for arg_inner in arg.unpack_var_sequence(tx)
+                    ]
+                return arg.as_proxy()
+            except NotImplementedError:
+                return arg
+
+        def register_as_subgraph(fn, name, args):
+            from .. import export
+
+            gm, guards = export(fn, *args)
+
+            next_name = None
+            i = 0
+            while not next_name:
+                candidate = f"name_{i}"
+                if candidate in tx.output.nn_modules:
+                    i += 1
+                else:
+                    next_name = candidate
+
+            gm.__name__ = next_name
+            src = NNModuleSource(GetItemSource(self.source, next_name))
+            gm.torchdynamo_force_dynamic = False
+            tx.output.register_attr_or_module(gm, next_name, source=src)
+            return next_name, gm, guards
+
+        # Get args as proxies
+        p_args = [unwrap_proxy(arg) for arg in args]
+        if self.value.__name__ == "cond":
+            # TODO(voz): Support fake tensor dispatch for recursive
+            # ops - see torch/dispatch/_dispatcher.py
+            from .. import config
+
+            assert len(p_args) == 4
+            assert type(args[0]) is TensorVariable  # predicate
+            assert type(p_args[1]) is UserFunctionVariable  # true_fn
+            assert type(p_args[2]) is UserFunctionVariable  # false_fn
+            assert type(args[3]) is ListVariable  # args
+
+            node_args = [unwrap_real(x) for x in args[3].unpack_var_sequence(tx)]
+            proxy_args = [unwrap_proxy(x) for x in args[3].unpack_var_sequence(tx)]
+            true_name, true_graph, true_guards = register_as_subgraph(
+                p_args[1].get_function(), "true", node_args
+            )
+            false_name, false_graph, false_guards = register_as_subgraph(
+                p_args[2].get_function(), "false", node_args
+            )
+
+            if config.enforce_cond_guards_match:
+                assert (
+                    true_guards == false_guards
+                ), "Guards for true and false path must be equal."
+
+            true_node = make_attr(true_name, proxy_args)
+            false_node = make_attr(false_name, proxy_args)
+            p_args[1] = true_node
+            p_args[2] = false_node
+
+        # Store the invocation as a call
+        return wrap_fx_proxy(
+            tx=tx,
+            proxy=tx.output.create_proxy(
+                "call_function",
+                self.value,
+                args=tuple(p_args),
+                kwargs={},
+                current_tx=tx,
+            ),
+            example_value=self.value(*u_args),
+        )
diff --git a/torch/_dynamo/variables/user_defined.py b/torch/_dynamo/variables/user_defined.py
new file mode 100644
index 000000000000..8cc9528ed67c
--- /dev/null
+++ b/torch/_dynamo/variables/user_defined.py
@@ -0,0 +1,386 @@
+import collections
+import contextlib
+import dataclasses
+import functools
+import importlib
+import inspect
+import random
+import types
+from typing import Dict, List
+
+import torch.nn
+
+from .. import variables
+from ..exc import unimplemented
+from ..guards import Guard, GuardBuilder
+from ..source import AttrSource, ODictGetItemSource, RandomValueSource
+from ..utils import is_namedtuple_cls, namedtuple_fields
+from .base import MutableLocal, VariableTracker
+from .misc import NullContextVariable
+
+
+class UserDefinedVariable(VariableTracker):
+    pass
+
+
+class UserDefinedClassVariable(UserDefinedVariable):
+    def __init__(self, value, **kwargs):
+        super().__init__(**kwargs)
+        self.value = value
+
+    def as_python_constant(self):
+        return self.value
+
+    def var_getattr(self, tx, name: str) -> "VariableTracker":
+        options = VariableTracker.propagate(self)
+        try:
+            obj = inspect.getattr_static(self.value, name)
+        except AttributeError:
+            obj = None
+
+        if isinstance(obj, staticmethod):
+            return variables.UserFunctionVariable(obj.__get__(self.value), **options)
+        elif isinstance(obj, classmethod):
+            return variables.UserMethodVariable(obj.__func__, self, **options)
+
+        return super(UserDefinedClassVariable, self).var_getattr(tx, name)
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        if (
+            name == "__subclasses__"
+            and len(args) == 0
+            and not kwargs
+            and "__subclasses__" not in self.value.__dict__
+        ):
+            options = VariableTracker.propagate(self, args, kwargs.values())
+            options["mutable_local"] = MutableLocal()
+            subs_as_vars: List[VariableTracker] = list()
+            for sub in self.value.__subclasses__():
+                source = AttrSource(tx.import_source(sub.__module__), sub.__name__)
+                subs_as_vars.append(
+                    variables.UserDefinedClassVariable(sub, source=source)
+                )
+
+            return variables.ListVariable(subs_as_vars, **options)
+
+        return super().call_method(tx, name, args, kwargs)
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        from ..side_effects import SideEffects
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+
+        if self.value in (
+            contextlib.nullcontext,
+            torch.autograd.profiler.profile,
+        ):
+            return NullContextVariable(**options)
+        elif is_namedtuple_cls(self.value):
+            fields = namedtuple_fields(self.value)
+            items = list(args)
+            items.extend([None] * (len(fields) - len(items)))
+            for name, value in kwargs.items():
+                assert name in fields
+                items[fields.index(name)] = value
+            assert all(x is not None for x in items)
+            return variables.NamedTupleVariable(
+                items, self.value, **VariableTracker.propagate(self, items)
+            )
+        elif (
+            inspect.getattr_static(self.value, "__new__", None) in (object.__new__,)
+            and SideEffects.cls_supports_mutation_side_effects(self.value)
+            and self.source
+        ):
+            var = tx.output.side_effects.track_object_new(
+                self.source, self.value, UserDefinedObjectVariable, options
+            )
+            return var.add_options(var.call_method(tx, "__init__", args, kwargs))
+        elif variables.DataClassVariable.is_matching_cls(self.value):
+            options["mutable_local"] = MutableLocal()
+            return variables.DataClassVariable.create(self.value, args, kwargs, options)
+
+        return super().call_function(tx, args, kwargs)
+
+    def const_getattr(self, tx, name):
+        if name == "__name__":
+            return self.value.__name__
+        return super().const_getattr(tx, name)
+
+
+class UserDefinedObjectVariable(UserDefinedVariable):
+    """
+    Mostly objects of defined type.  Catch-all for something where we only know the type.
+    """
+
+    def __init__(self, value, value_type=None, **kwargs):
+        super(UserDefinedObjectVariable, self).__init__(**kwargs)
+        self.value = value
+        self.value_type = value_type or type(value)
+        assert type(value) is self.value_type
+
+    def __str__(self):
+        inner = self.value_type.__name__
+        if inner in [
+            "builtin_function_or_method",
+            "getset_descriptor",
+            "method_descriptor",
+            "method",
+        ]:
+            inner = str(getattr(self.value, "__name__", None))
+        return f"{self.__class__.__name__}({inner})"
+
+    def python_type(self):
+        return self.value_type
+
+    @staticmethod
+    @functools.lru_cache(None)
+    def _supported_random_functions():
+        fns = {
+            random.random,
+            random.randint,
+            random.randrange,
+            random.uniform,
+        }
+        return fns
+
+    def call_method(
+        self,
+        tx,
+        name,
+        args: "List[VariableTracker]",
+        kwargs: "Dict[str, VariableTracker]",
+    ) -> "VariableTracker":
+        from . import ConstantVariable, TupleVariable, UserMethodVariable
+
+        options = VariableTracker.propagate(self, args, kwargs.values())
+
+        if name not in getattr(self.value, "__dict__", {}):
+            try:
+                method = inspect.getattr_static(type(self.value), name)
+            except AttributeError:
+                method = None
+
+            if method is object.__init__:
+                return ConstantVariable(None, **options)
+
+            if method is collections.OrderedDict.keys and self.source:
+                # subclass of OrderedDict
+                assert not (args or kwargs)
+                keys = list(self.value.keys())
+                assert all(map(ConstantVariable.is_literal, keys))
+                return TupleVariable(
+                    [ConstantVariable(k, **options) for k in keys], **options
+                ).add_guard(
+                    Guard(
+                        self.source.name(),
+                        self.source.guard_source(),
+                        GuardBuilder.ODICT_KEYS,
+                    )
+                )
+
+            if (
+                method is collections.OrderedDict.items
+                and isinstance(self.value, collections.OrderedDict)
+                and self.source
+            ):
+                assert not (args or kwargs)
+                items = []
+                keys = self.call_method(tx, "keys", [], {})
+                options = VariableTracker.propagate(self, args, kwargs.values(), keys)
+                for key in keys.unpack_var_sequence(tx):
+                    items.append(
+                        TupleVariable(
+                            [key, self.odict_getitem(tx, key)],
+                            **options,
+                        )
+                    )
+                return TupleVariable(items, **options)
+
+            if method is collections.OrderedDict.__getitem__ and len(args) == 1:
+                assert not kwargs
+                return self.odict_getitem(tx, args[0])
+
+            # check for methods implemented in C++
+            if isinstance(method, types.FunctionType):
+                # TODO(jansel): add a guard to check for monkey patching?
+                return UserMethodVariable(method, self, **options).call_function(
+                    tx, args, kwargs
+                )
+
+        return super().call_method(tx, name, args, kwargs)
+
+    def is_supported_random(self):
+        try:
+            return self.value in self._supported_random_functions()
+        except TypeError:
+            # TypeError: unhashable type
+            return False
+
+    def call_function(
+        self, tx, args: "List[VariableTracker]", kwargs: "Dict[str, VariableTracker]"
+    ) -> "VariableTracker":
+        from .builder import VariableBuilder
+
+        if (
+            self.is_supported_random()
+            and all(k.is_python_constant() for k in args)
+            and all(v.is_python_constant() for v in kwargs.values())
+        ):
+            args = [x.as_python_constant() for x in args]
+            kwargs = {k: v.as_python_constant() for k, v in kwargs.items()}
+            random_call_index = len(tx.random_calls)
+            if random_call_index == 0:
+                tx.output.initial_random_state = random.getstate()
+            example_value = self.value(*args, **kwargs)
+            source = RandomValueSource(random_call_index)
+            tx.random_calls.append((self.value, args, kwargs))
+            return VariableBuilder(tx, source).wrap_unspecialized_primitive(
+                example_value
+            )
+
+        return super().call_function(tx, args, kwargs)
+
+    def _check_for_getattribute(self):
+        try:
+            if isinstance(
+                inspect.getattr_static(type(self.value), "__getattribute__"),
+                types.FunctionType,
+            ):
+                unimplemented("UserDefinedObjectVariable with custom __getattribute__")
+        except AttributeError:
+            pass
+
+    def _check_for_getattr(self):
+        try:
+            getattr_fn = inspect.getattr_static(type(self.value), "__getattr__")
+        except AttributeError:
+            getattr_fn = None
+        if getattr_fn is torch.nn.Module.__getattr__:
+            # ignore this case of getattr
+            getattr_fn = None
+        return getattr_fn
+
+    def _getattr_static(self, name):
+        if isinstance(self.value, (dataclasses.Field, torch.nn.Module)):
+            # getattr_static doesn't work on these
+            subobj = getattr(self.value, name)
+        else:
+            subobj = inspect.getattr_static(self.value, name)
+        return subobj
+
+    def var_getattr(self, tx, name):
+        from . import ConstantVariable
+        from .builder import VariableBuilder
+
+        options = VariableTracker.propagate(self)
+        value = self.value
+        source = AttrSource(self.source, name) if self.source else None
+        self._check_for_getattribute()
+        getattr_fn = self._check_for_getattr()
+
+        try:
+            subobj = self._getattr_static(name)
+        except AttributeError:
+            if isinstance(getattr_fn, types.FunctionType):
+                return variables.UserMethodVariable(
+                    getattr_fn, self, **options
+                ).call_function(tx, [ConstantVariable(name)], {})
+            elif getattr_fn is not None:
+                unimplemented("UserDefined with non-function __getattr__")
+
+        if isinstance(subobj, property):
+            return variables.UserMethodVariable(
+                subobj.fget, self, **options
+            ).call_function(tx, [], {})
+
+        if (
+            name in getattr(value, "__dict__", {})
+            or ConstantVariable.is_literal(subobj)
+            or isinstance(
+                subobj,
+                (
+                    torch.Tensor,
+                    torch.nn.Module,
+                ),
+            )
+        ):
+            if source:
+                return VariableBuilder(tx, source)(subobj).add_options(options)
+            elif ConstantVariable.is_literal(subobj):
+                return ConstantVariable(subobj, **options)
+
+        if (
+            name not in getattr(value, "__dict__", {})
+            and type(value).__module__.startswith("torch.")
+            and "torch.optim" not in type(value).__module__
+            and not callable(value)
+        ):
+            if not source:
+                assert getattr(
+                    importlib.import_module(type(value).__module__),
+                    type(value).__name__,
+                ) is type(value)
+                source = AttrSource(
+                    AttrSource(
+                        tx.import_source(type(value).__module__), type(value).__name__
+                    ),
+                    name,
+                )
+
+            return VariableBuilder(tx, source)(subobj).add_options(options)
+
+        if isinstance(
+            subobj,
+            (
+                torch.distributions.constraints._Interval,
+                torch.distributions.constraints._Real,
+                torch.distributions.constraints.Constraint,
+            ),
+        ):
+            return UserDefinedObjectVariable(subobj, source=source, **options)
+
+        if isinstance(subobj, staticmethod):
+            return variables.UserFunctionVariable(subobj.__get__(self.value), **options)
+        elif isinstance(subobj, classmethod):
+            return variables.UserMethodVariable(subobj.__func__, self, **options)
+
+        if name == "__class__":
+            return UserDefinedClassVariable(type(self.value), source=source, **options)
+
+        return variables.GetAttrVariable(self, name, source=source, **options)
+
+    def call_hasattr(self, tx, name: str) -> "VariableTracker":
+        if not self.source:
+            unimplemented("hasattr no source")
+        options = VariableTracker.propagate(self)
+        options["guards"].add(
+            AttrSource(self.source, name).make_guard(GuardBuilder.HASATTR)
+        )
+        if self._check_for_getattribute() or self._check_for_getattr():
+            unimplemented("hasattr with custom __getattr__")
+
+        try:
+            self._getattr_static(name)
+            return variables.ConstantVariable(True, **options)
+        except AttributeError:
+            return variables.ConstantVariable(False, **options)
+
+    def odict_getitem(self, tx, key):
+        from .builder import VariableBuilder
+
+        return VariableBuilder(
+            tx,
+            ODictGetItemSource(self.source, key.as_python_constant()),
+        )(
+            collections.OrderedDict.__getitem__(self.value, key.as_python_constant())
+        ).add_options(
+            key, self
+        )
diff --git a/torch/ao/sparsity/sparsifier/__init__.py b/torch/_functorch/__init__.py
similarity index 100%
rename from torch/ao/sparsity/sparsifier/__init__.py
rename to torch/_functorch/__init__.py
diff --git a/torch/_functorch/pyfunctorch.py b/torch/_functorch/pyfunctorch.py
new file mode 100644
index 000000000000..1ada5b4e1977
--- /dev/null
+++ b/torch/_functorch/pyfunctorch.py
@@ -0,0 +1,142 @@
+from abc import ABC, abstractmethod
+import contextlib
+from typing import Any
+import torch
+import torch.utils._pytree as pytree
+from torch._C._functorch import (
+    TransformType,
+    CInterpreter,
+    CGradInterpreterPtr,
+    CVmapInterpreterPtr,
+    pop_dynamic_layer_stack,
+    push_dynamic_layer_stack,
+)
+
+"""
+This file contains the functorch integration with PyDispatcher.
+
+PyDispatcher does not understand functorch's DynamicLayerStack dispatching
+logic because it is entirely implemented in C++ in the fallbacks for two
+dispatch keys, FuncTorchDynamicLayer{Front, Back}Mode (PyDispatcher is unable
+to directly reuse C++ boxed fallbacks).
+
+Instead of trying to hammer PyDispatcher into understanding those fallbacks,
+we re-implement the logic of peeking the top of the stack for an interpreter,
+selecting the interpreter to dispatch on, etc, in Python. This leads to a
+simpler design.
+
+The main difference between C++ functorch and PyDispatcher's functorch logic
+is that:
+- C++ functorch needs to manually tweak dispatch keys to ping-pong between
+  DynamicLayerFrontMode and DynamicLayerBackMode.
+- PyDispatcher's functorch logic pops an Interpreter from the top of the stack
+  and asks it to execute the rule associated with the Interpreter.
+
+In C++ we do the ping-pong because e.g. vmap rules are associated with the
+batched DispatchKey, but in PyDispatcher we are able to avoid this by asking
+the user to register a batching rule directly to a transform that an
+interpreter then invokes.
+"""
+
+
+# FuncTorchInterpreter is the Python version of Interpreter (recall that
+# the DynamicLayerStack is a stack of interpreters).
+# It is a wrapper around the actual C++ Interpreter object.
+#
+# Keep the methods in sync with aten/src/ATen/functorch/Interpreter.h
+class FuncTorchInterpreter(ABC):
+    def __init__(self, cptr: Any):
+        self._cptr = cptr
+
+    # Process an operation. eg for vmap, this is invoking a batching rule.
+    # Conceptually this is analogous to Interpreter::process in C++
+    @abstractmethod
+    def process(self, op, args, kwargs):
+        pass
+
+    # lower an operation from this Interpreter to the next Interpreter on the stack.
+    # Concretely, this involves temporarily popping the current Interpreter.
+    # Conceptually this is analogous to Interpreter::sendToNextInterpreter in C++
+    def lower(self):
+        return temporarily_pop_interpreter_stack()
+
+    def level(self):
+        return self._cptr.level()
+
+    def key(self):
+        return self._cptr.key()
+
+
+@contextlib.contextmanager
+def temporarily_pop_interpreter_stack():
+    try:
+        saved = pop_dynamic_layer_stack()
+        yield
+    finally:
+        push_dynamic_layer_stack(saved)
+
+
+class VmapInterpreter(FuncTorchInterpreter):
+    def __init__(self, cdata: CInterpreter):
+        assert cdata.key() == TransformType.Vmap
+        # NOTE: [Interpreter cdata vs cptr]
+        # cdata is a generic CInterpreter. We wrap it in a CVmapInterpreterPtr
+        # so that we can access methods specific to the vmap interpreter
+        self._cdata = cdata
+        self._cptr = CVmapInterpreterPtr(cdata)
+
+    def process(self, op, args, kwargs):
+        kernel = op.functorch_table[TransformType.Vmap]
+        return kernel(self, *args, **kwargs)
+
+    def batch_size(self):
+        return self._cptr.batchSize()
+
+
+class GradInterpreter(FuncTorchInterpreter):
+    def __init__(self, cdata: CInterpreter):
+        assert cdata.key() == TransformType.Grad
+        # See NOTE: [Interpreter cdata vs cptr]
+        self._cdata = cdata
+        self._cptr = CGradInterpreterPtr(cdata)
+
+    def lift(self, args, kwargs):
+        args, kwargs = pytree.tree_map_only(torch.Tensor, self._cptr.lift, [args, kwargs])
+        return args, kwargs
+
+    def process(self, op, args, kwargs):
+        kernel = op.functorch_table[TransformType.Grad]
+        args, kwargs = self.lift(args, kwargs)
+        return kernel(self, *args, **kwargs)
+
+    # GradInterpreter has custom lower because of the no_grad interaction
+    # See NOTE [grad and vjp interaction with no_grad]
+    # This logic is mirrored from C++ GradInterpreterPtr::sendToNextInterpreter
+    def lower(self):
+        prev_grad_mode = self.prev_grad_mode()
+        if not self.prev_grad_mode:
+            return contextlib.nested(torch.no_grad(), super().lower())
+        return super().lower()
+
+    def prev_grad_mode(self):
+        return self._cptr.prevGradMode()
+
+
+def coerce_cinterpreter(cinterpreter: CInterpreter) -> FuncTorchInterpreter:
+    key = cinterpreter.key()
+    if key == TransformType.Grad:
+        return GradInterpreter(cinterpreter)
+    if key == TransformType.Vmap:
+        return VmapInterpreter(cinterpreter)
+    raise RuntimeError(f"NYI: PyDispatcher has not implemented support for {key}")
+
+
+def retrieve_current_functorch_interpreter():
+    interpreter = torch._C._functorch.peek_interpreter_stack()
+    assert interpreter is not None
+    return coerce_cinterpreter(interpreter)
+
+
+def dispatch_functorch(op, args, kwargs):
+    interpreter = retrieve_current_functorch_interpreter()
+    return interpreter.process(op, args, kwargs)
diff --git a/torch/_functorch/utils.py b/torch/_functorch/utils.py
new file mode 100644
index 000000000000..c1474ba90fe3
--- /dev/null
+++ b/torch/_functorch/utils.py
@@ -0,0 +1,14 @@
+import contextlib
+from torch._C._functorch import (
+    set_autograd_function_allowed,
+    get_autograd_function_allowed,
+)
+
+@contextlib.contextmanager
+def enable_autograd_function():
+    try:
+        prev_state = get_autograd_function_allowed()
+        set_autograd_function_allowed(True)
+        yield
+    finally:
+        set_autograd_function_allowed(prev_state)
diff --git a/torch/_inductor/__init__.py b/torch/_inductor/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/_inductor/codecache.py b/torch/_inductor/codecache.py
new file mode 100644
index 000000000000..bca8c59830be
--- /dev/null
+++ b/torch/_inductor/codecache.py
@@ -0,0 +1,612 @@
+import base64
+import dataclasses
+import functools
+import getpass
+import hashlib
+import logging
+import multiprocessing
+import os
+import re
+import shutil
+import signal
+import subprocess
+import sys
+import sysconfig
+import tempfile
+import types
+from concurrent.futures import Future, ProcessPoolExecutor, ThreadPoolExecutor
+from ctypes import cdll
+from threading import Thread
+from time import sleep, time
+from typing import Any, Callable, Dict, List
+
+import torch
+from torch.utils import cpp_extension
+from . import config, cuda_properties, exc
+
+LOCK_TIMEOUT = 600
+
+# timing metrics for time spent in the compilation
+_cumulative_compile_time = 0
+_t0 = None
+
+
+def _compile_start():
+    global _t0
+    if _t0 is None:
+        _t0 = time()
+
+
+def _compile_end():
+    global _cumulative_compile_time, _t0
+    if _t0 is not None:
+        t1 = time()
+        _cumulative_compile_time += t1 - _t0
+        _t0 = None
+        # print("CUMULATIVE COMPILE TIME", _cumulative_compile_time)
+
+
+log = logging.getLogger(__name__)
+logging.getLogger("filelock").setLevel(logging.DEBUG if config.debug else logging.INFO)
+
+
+@functools.lru_cache(None)
+def cache_dir():
+    return os.environ.get(
+        "TORCHINDUCTOR_CACHE_DIR", f"/tmp/torchinductor_{getpass.getuser()}"
+    )
+
+
+def get_lock_dir():
+    lock_dir = os.path.join(cache_dir(), "locks")
+    if not os.path.exists(lock_dir):
+        os.makedirs(lock_dir, exist_ok=True)
+    return lock_dir
+
+
+def code_hash(code):
+    return (
+        "c"
+        + base64.b32encode(hashlib.sha256(code.encode("utf-8")).digest())[:51]
+        .decode("utf-8")
+        .lower()
+    )
+
+
+def write(source_code, ext, extra=""):
+    basename = code_hash(source_code + extra)
+    subdir = os.path.join(cache_dir(), basename[1:3])
+    if not os.path.exists(subdir):
+        os.makedirs(subdir, exist_ok=True)
+    path = os.path.join(subdir, f"{basename}.{ext}")
+    if not os.path.exists(path):
+        # use a temp file for thread safety
+        fd, tmp_path = tempfile.mkstemp(dir=subdir)
+        with os.fdopen(fd, "w") as f:
+            f.write(source_code)
+        os.rename(tmp_path, path)
+    return basename, path
+
+
+def cpp_compiler():
+    if isinstance(config.cpp.cxx, (list, tuple)):
+        search = tuple(config.cpp.cxx)
+    else:
+        search = (config.cpp.cxx,)
+    return cpp_compiler_search(search)
+
+
+@functools.lru_cache(1)
+def cpp_compiler_search(search):
+    for cxx in search:
+        try:
+            if cxx is None:
+                # gxx package is only available for Linux
+                # according to https://anaconda.org/conda-forge/gxx/
+                if sys.platform != "linux":
+                    continue
+                from filelock import FileLock
+
+                lock_dir = get_lock_dir()
+                lock = FileLock(
+                    os.path.join(lock_dir, "g++.lock"), timeout=LOCK_TIMEOUT
+                )
+                with lock:
+                    cxx = install_gcc_via_conda()
+            subprocess.check_output([cxx, "--version"])
+            return cxx
+        except (subprocess.SubprocessError, FileNotFoundError, ImportError):
+            continue
+    raise exc.InvalidCxxCompiler()
+
+
+def install_gcc_via_conda():
+    """On older systems, this is a quick way to get a modern compiler"""
+    prefix = os.path.join(cache_dir(), "gcc")
+    cxx_path = os.path.join(prefix, "bin", "g++")
+    if not os.path.exists(cxx_path):
+        log.info("Downloading GCC via conda")
+        conda = os.environ.get("CONDA_EXE", "conda")
+        if conda is None:
+            conda = shutil.which("conda")
+        if conda is not None:
+            subprocess.check_call(
+                [
+                    conda,
+                    "create",
+                    f"--prefix={prefix}",
+                    "--channel=conda-forge",
+                    "--quiet",
+                    "-y",
+                    "python=3.8",
+                    "gxx",
+                ],
+                stdout=subprocess.PIPE,
+            )
+    return cxx_path
+
+
+def is_gcc():
+    return re.search(r"(gcc|g\+\+)", cpp_compiler())
+
+
+class VecISA(object):
+    _bit_width: int
+    _macro: str
+    _arch_flags: str
+    _dtype_nelements: Dict[torch.dtype, int]
+
+    # TorchInductor CPU vectorization reuses PyTorch vectorization utility functions
+    # Hence, TorchInductor would depend on Sleef* to accelerate mathematical functions
+    # like exp, pow, sin, cos and etc.
+    # But PyTorch and TorchInductor might use different compilers to build code. If
+    # PyTorch uses gcc-7/g++-7 to build the release package, the libtorch_cpu.so
+    # will not expose the Sleef* AVX512 symbols since gcc-7/g++-7 cannot pass
+    # avx512 check in CMake - FindAVX.cmake. But TorchInductor install the latest
+    # gcc/g++ compiler by default while it could support the AVX512 compilation.
+    # Therefore, there would be a conflict sleef version between PyTorch and
+    # TorchInductor. Hence, we dry-compile the following code to check whether current
+    # HW platform and PyTorch both could support AVX512 or AVX2. And suppose ARM
+    # also needs the logic
+    _avx_code = """
+#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2)
+#include <ATen/cpu/vec/functional.h>
+#include <ATen/cpu/vec/vec.h>
+#endif
+
+__attribute__((aligned(64))) float in_out_ptr0[16] = {0.0};
+
+extern "C" void __avx_chk_kernel() {
+    auto tmp0 = at::vec::Vectorized<float>(1);
+    auto tmp1 = tmp0.exp();
+    tmp1.store(in_out_ptr0);
+}
+"""
+
+    _avx_py_load = """
+import torch
+from ctypes import cdll
+cdll.LoadLibrary("__lib_path__")
+"""
+
+    def bit_width(self):
+        return self._bit_width
+
+    def nelements(self, dtype: torch.dtype = torch.float):
+        return self._dtype_nelements[dtype]
+
+    def build_macro(self):
+        return self._macro
+
+    def build_arch_flags(self):
+        return self._arch_flags
+
+    def __hash__(self) -> int:
+        return hash(str(self))
+
+    @functools.lru_cache(None)
+    def __bool__(self):
+        key, input_path = write(VecISA._avx_code, "cpp", extra="")
+        from filelock import FileLock
+
+        lock_dir = get_lock_dir()
+        lock = FileLock(os.path.join(lock_dir, key + ".lock"), timeout=LOCK_TIMEOUT)
+        with lock:
+            output_path = input_path[:-3] + "so"
+            build_cmd = cpp_compile_command(
+                input_path, output_path, warning_all=False, vec_isa=self
+            ).split(" ")
+            try:
+                # Check build result
+                subprocess.check_output(build_cmd, stderr=subprocess.STDOUT)
+                subprocess.check_call(
+                    [
+                        "python",
+                        "-c",
+                        VecISA._avx_py_load.replace("__lib_path__", output_path),
+                    ],
+                    stderr=subprocess.DEVNULL,
+                )
+            except Exception as e:
+                return False
+
+            return True
+
+
+@dataclasses.dataclass
+class VecAVX512(VecISA):
+    _bit_width = 512
+    _macro = "CPU_CAPABILITY_AVX512"
+    _arch_flags = "-mavx512f -mavx512dq -mavx512vl -mavx512bw -mfma"
+    _dtype_nelements = {torch.float: 16, torch.bfloat16: 32}
+
+    def __str__(self) -> str:
+        return "avx512"
+
+    __hash__: Callable[[VecISA], Any] = VecISA.__hash__
+
+
+@dataclasses.dataclass
+class VecAVX2(VecISA):
+    _bit_width = 256
+    _macro = "CPU_CAPABILITY_AVX2"
+    _arch_flags = "-mavx2 -mfma"
+    _dtype_nelements = {torch.float: 8, torch.bfloat16: 16}
+
+    def __str__(self) -> str:
+        return "avx2"
+
+    __hash__: Callable[[VecISA], Any] = VecISA.__hash__
+
+
+class InvalidVecISA(VecISA):
+    _bit_width = 0
+    _macro = ""
+    _arch_flags = ""
+    _dtype_nelements = {}
+
+    def __str__(self) -> str:
+        return "INVALID_VEC_ISA"
+
+    def __bool__(self):
+        return False
+
+    __hash__: Callable[[VecISA], Any] = VecISA.__hash__
+
+
+invalid_vec_isa = InvalidVecISA()
+supported_vec_isa_list = [VecAVX512(), VecAVX2()]
+
+
+# Cache the cpuinfo to avoid I/O overhead. Meanwhile, the cpuinfo content
+# might have too much redundant content that is useless for ISA check. Hence,
+# we only cache some key isa information.
+@functools.lru_cache(None)
+def valid_vec_isa_list():
+    if sys.platform != "linux":
+        return []
+
+    isa_list = []
+    with open("/proc/cpuinfo") as _cpu_info:
+        _cpu_info_content = _cpu_info.read()
+        for isa in supported_vec_isa_list:
+            if str(isa) in _cpu_info_content and isa:
+                isa_list.append(isa)
+        return isa_list
+
+
+def pick_vec_isa():
+    _valid_vec_isa_list: List[VecISA] = valid_vec_isa_list()
+    if not _valid_vec_isa_list:
+        return invalid_vec_isa
+
+    # If the simdlen is None, it indicates determin the vectroization length automatically
+    if config.cpp.simdlen is None:
+        assert _valid_vec_isa_list
+        return _valid_vec_isa_list[0]
+
+    for isa in _valid_vec_isa_list:
+        if config.cpp.simdlen == isa.bit_width():
+            return isa
+
+    return invalid_vec_isa
+
+
+def cpp_compile_command(
+    input,
+    output,
+    warning_all=True,
+    shared=True,
+    include_pytorch=False,
+    vec_isa: VecISA = invalid_vec_isa,
+):
+    if sys.platform == "linux" and (
+        include_pytorch
+        or vec_isa != invalid_vec_isa
+        or config.cpp.enable_kernel_profile
+    ):
+        # Note - We include pytorch only on linux right now. There is more work
+        # to do to enable OMP build on darwin where PyTorch is built with IOMP
+        # and we need a way to link to what PyTorch links.
+        ipaths = cpp_extension.include_paths() + [sysconfig.get_path("include")]
+        lpaths = cpp_extension.library_paths() + [sysconfig.get_config_var("LIBDIR")]
+        libs = ["c10", "torch", "torch_cpu", "torch_python", "gomp"]
+        macros = vec_isa.build_macro()
+        if macros:
+            macros = f"-D{macros}"
+    else:
+        # Note - this is effectively a header only inclusion. Usage of some header files may result in
+        # symbol not found, if those header files require a library.
+        # For those cases, include the lpath and libs command as we do for pytorch above.
+        # This approach allows us to only pay for what we use.
+        ipaths = cpp_extension.include_paths() + [sysconfig.get_path("include")]
+        lpaths = []
+        libs = ["gomp"]
+        macros = ""
+    ipaths = " ".join(["-I" + p for p in ipaths])
+    lpaths = " ".join(["-L" + p for p in lpaths])
+    libs = " ".join(["-l" + p for p in libs])
+
+    shared_lib = "-shared -fPIC" if shared else ""
+    warning_all_flag = "-Wall" if warning_all else ""
+    return re.sub(
+        r"[ \n]+",
+        " ",
+        f"""
+            {cpp_compiler()} {input} {shared_lib} {warning_all_flag} -std=c++14 -Wno-unused-variable
+            {ipaths} {lpaths} {libs} {macros}
+            -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp
+            -D C10_USING_CUSTOM_GENERATED_MACROS
+            -o{output}
+        """,
+    ).strip()
+
+
+class CppCodeCache:
+    cache = dict()
+    clear = staticmethod(cache.clear)
+
+    @staticmethod
+    def _load_library(path):
+        try:
+            return cdll.LoadLibrary(path)
+        except OSError as e:
+            if "gomp" in str(e) and os.path.exists("/usr/lib64/libgomp.so.1"):
+                # hacky workaround for fbcode/buck
+                global _libgomp
+                _libgomp = cdll.LoadLibrary("/usr/lib64/libgomp.so.1")
+                return cdll.LoadLibrary(path)
+            raise
+
+    @classmethod
+    def load(cls, source_code):
+        picked_vec_isa = pick_vec_isa()
+        key, input_path = write(
+            source_code,
+            "cpp",
+            extra=cpp_compile_command("i", "o", vec_isa=picked_vec_isa),
+        )
+        if key not in cls.cache:
+            from filelock import FileLock
+
+            lock_dir = get_lock_dir()
+            lock = FileLock(os.path.join(lock_dir, key + ".lock"), timeout=LOCK_TIMEOUT)
+            with lock:
+                output_path = input_path[:-3] + "so"
+                if not os.path.exists(output_path):
+                    cmd = cpp_compile_command(
+                        input=input_path, output=output_path, vec_isa=picked_vec_isa
+                    ).split(" ")
+                    try:
+                        subprocess.check_output(cmd, stderr=subprocess.STDOUT)
+                    except subprocess.CalledProcessError as e:
+                        raise exc.CppCompileError(cmd, e.output)
+
+                cls.cache[key] = cls._load_library(output_path)
+                cls.cache[key].key = key
+
+        return cls.cache[key]
+
+
+class PyCodeCache:
+    cache = dict()
+    clear = staticmethod(cache.clear)
+
+    @classmethod
+    def load(cls, source_code):
+        key, path = write(source_code, "py")
+        if key not in cls.cache:
+            with open(path) as f:
+                code = compile(f.read(), path, "exec")
+                mod = types.ModuleType(f"{__name__}.{key}")
+                mod.__file__ = path
+                mod.key = key
+                exec(code, mod.__dict__, mod.__dict__)
+                # another thread might set this first
+                cls.cache.setdefault(key, mod)
+        return cls.cache[key]
+
+
+@functools.lru_cache(None)
+def patch_triton_dir():
+    os.environ["TRITON_CACHE_DIR"] = os.environ.get(
+        "TRITON_CACHE_DIR", os.path.join(cache_dir(), "triton")
+    )
+
+
+class TritonCodeCache:
+    @staticmethod
+    def get_name(mod):
+        (name,) = [n for n in dir(mod) if n.startswith("triton_")]
+        return name
+
+    @classmethod
+    def load(cls, source_code):
+        patch_triton_dir()
+        mod = PyCodeCache.load(source_code)
+        return getattr(mod, cls.get_name(mod))
+
+
+def _worker_compile(source_code, cc, device):
+    cuda_properties.set_compiler_worker_current_device(device)
+    kernel = TritonCodeCache.load(source_code)
+    kernel.precompile(warm_cache_only_with_cc=cc)
+
+
+def _load_kernel(source_code):
+    kernel = TritonCodeCache.load(source_code)
+    kernel.precompile()
+    return kernel
+
+
+def _load_kernel_name(source_code):
+    return TritonCodeCache.get_name(PyCodeCache.load(source_code))
+
+
+class TritonFuture:
+    def __init__(self, source_code, future):
+        self.source_code = source_code
+        self.future = future
+
+    # @dynamo_utils.dynamo_timed
+    def result(self):
+        t0 = time()
+        if hasattr(self, "kernel"):
+            return self.kernel
+        # If the worker failed this will throw an exception.
+        self.future.result()
+        kernel = self.kernel = _load_kernel(self.source_code)
+        latency = time() - t0
+        if latency > 50:
+            name = _load_kernel_name(self.source_code)
+            log.warning(
+                f"Detected long compilation time of {latency} seconds for kernel name {name}"
+            )
+            log.warning(self.source_code)
+        del self.source_code, self.future
+        return kernel
+
+
+class AsyncCompile:
+    def __init__(self):
+        self._context_keepalive = None
+
+    @staticmethod
+    @functools.lru_cache(1)
+    def pool():
+        assert config.compile_threads > 1
+        return ThreadPoolExecutor(config.compile_threads)
+
+    @staticmethod
+    @functools.lru_cache(1)
+    def process_pool():
+        # ensure properties have been calculated before processes
+        # are forked
+        cuda_properties._properties()
+        assert config.compile_threads > 1
+        orig_ppid = os.getpid()
+
+        # if this process dies abnormally (e.g. segfault)
+        # it will not shut down the workers. Instead
+        # the workers will have their parent reassigned to the
+        # init process. This launches a separate thread to
+        # watch for the worker getting reassigned,
+        # and cleans it up in this case.
+        def init():
+            def run():
+                while True:
+                    sleep(1)
+                    if orig_ppid != os.getppid():
+                        os.kill(os.getpid(), signal.SIGKILL)
+
+            global _watchdog_thread
+            _watchdog_thread = Thread(target=run, daemon=True)
+            _watchdog_thread.start()
+
+        # we rely on 'fork' because we cannot control whether users
+        # have an `if __name__ == '__main__'` in their main process.
+        fork_context = multiprocessing.get_context("fork")
+        pool = ProcessPoolExecutor(
+            config.compile_threads, mp_context=fork_context, initializer=init
+        )
+        # when this pool is created in a subprocess object, the normal exit handler
+        # doesn't run, and we need to register our own handler.
+        # exitpriority has to be high, because another one of the finalizers will
+        # kill the worker thread that sends the shutdown message to the workers...
+        multiprocessing.util.Finalize(None, pool.shutdown, exitpriority=sys.maxsize)
+        return pool
+
+    @classmethod
+    def warm_pool(cls):
+        if config.compile_threads <= 1:
+            return
+        _compile_start()
+        pool = cls.process_pool()
+
+        # We have to fork processes for compiler workers, but the more memory and other resources that are loaded, the
+        # slower the os.fork time is, quite drastically. It also holds the GIL so we can't put it on another thread.
+
+        # Examples:
+        # A simple x + x + x script: 10ms seconds in the middle of the program, 2ms at startup
+        # tf_efficientnet_b0 benchmark: 50ms! in the middle of the program , 3ms at startup
+
+        # So we want to start the workers early when it is still cheap, and also to allow the workers to get
+        # ready before we have work for them.
+
+        # ProcessPoolExecutor also does not launch the workers until it finds a point when all the workers are idle.
+        # But if we waited until then fork time will be long and we will be waiting for the processes to initialize.
+
+        # We force them to start here with some YOLOing of the internal methods.
+        if hasattr(pool, "_start_queue_management_thread"):
+            pool._start_queue_management_thread()
+        else:
+            for i in range(config.compile_threads):
+                pool._adjust_process_count()
+            pool._start_executor_manager_thread()
+        _compile_end()
+
+    @classmethod
+    def submit(cls, task):
+        if config.compile_threads <= 1:
+            return task()
+        return cls.pool().submit(task)
+
+    @classmethod
+    def map(cls, fn, seq):
+        if config.compile_threads <= 1 or len(seq) <= 1:
+            return list(map(fn, seq))
+        return [t.result() for t in [cls.pool().submit(fn, x) for x in seq]]
+
+    def triton(self, source_code):
+        _compile_start()
+        if self._context_keepalive is None:
+            # Workaround `CUDA: Error- context is destroyed`
+            self._context_keepalive = torch.tensor([1], device="cuda")
+
+        if config.compile_threads > 1:
+            major, minor = torch.cuda.get_device_capability()
+            device = torch.cuda.current_device()
+            cc = major * 10 + minor
+            future = self.process_pool().submit(
+                _worker_compile, source_code, cc, device
+            )
+            return TritonFuture(source_code, future)
+        else:
+            return _load_kernel(source_code)
+
+    def cpp(self, source_code):
+        def task():
+            return CppCodeCache.load(source_code).kernel
+
+        return self.submit(task)
+
+    def wait(self, scope: Dict[str, Any]):
+        if config.compile_threads > 1:
+            for key, result in list(scope.items()):
+                if isinstance(result, (Future, TritonFuture)):
+                    scope[key] = result.result()
+
+        _compile_end()
+
+
+AsyncCompile.warm_pool()
diff --git a/torch/_inductor/codegen/__init__.py b/torch/_inductor/codegen/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/_inductor/codegen/autotuner.py b/torch/_inductor/codegen/autotuner.py
new file mode 100644
index 000000000000..6425f28cc18f
--- /dev/null
+++ b/torch/_inductor/codegen/autotuner.py
@@ -0,0 +1,274 @@
+import builtins
+
+import torch
+
+from .. import config, triton_ops
+from ..triton_ops.autotune import mm_autotune, mm_heuristics
+from ..utils import dynamo_testing
+from ..virtualized import V
+
+aten = torch.ops.aten
+rand_strided = dynamo_testing.rand_strided
+
+
+def str2func(str):
+    module, *name = str.split(".")
+    if module == "aten":
+        runnable = aten
+    elif module == "triton_ops":
+        runnable = triton_ops
+    elif module == "torch":
+        runnable = torch
+    else:
+        raise Exception(f"{str} could not be called")
+
+    for n in name:
+        runnable = getattr(runnable, n)
+    return runnable
+
+
+class Autotuner:
+    def __init__(self):
+        self.cache = dict()
+
+    def _bench(self, kernel, *args, **kwargs):
+        def kernel_call():
+            kernel(*args, **kwargs)
+
+        from triton.testing import do_bench
+
+        return do_bench(kernel_call)
+
+
+autotune = Autotuner()
+
+
+def tuned_conv(
+    x_shape,
+    w_shape,
+    x_stride,
+    w_stride,
+    stride,
+    padding,
+    dilation,
+    transposed,
+    output_padding,
+    groups,
+    device,
+    dtype,
+    adjust_triton=0.95,
+):
+    """
+    Return the best kernel name given inputs and layer parameters;
+    Considering potential pointwise fusion of conv, we could adjust triton timing
+    by multiplying adjust_triton (default=0.95)
+    """
+
+    sizevars = V.graph.sizevars
+    x_shape = [sizevars.size_hint(s) for s in x_shape]
+    w_shape = [sizevars.size_hint(s) for s in w_shape]
+    x_stride = [sizevars.size_hint(s) for s in x_stride]
+    w_stride = [sizevars.size_hint(s) for s in w_stride]
+    x = rand_strided(x_shape, x_stride, device=device, dtype=dtype)
+    w = rand_strided(w_shape, w_stride, device=device, dtype=dtype)
+    # the identifiable args for the layers
+    id_args = [
+        *x_shape,
+        *w_shape,
+        stride,
+        padding,
+        dilation,
+        transposed,
+        output_padding,
+        groups,
+        # *x_stride,
+        # *w_stride,
+    ]
+    use_cuda = x.is_cuda
+
+    # gen_key
+    key = tuple(id_args)
+    key = ("conv",) + key
+
+    # candidate kernels
+    kernels = ["aten.convolution"]
+    if use_cuda:
+        kernels += ["triton_ops.conv"]
+
+    # filter kernels that args/kwargs does not meet requirements
+    remove_kernels = []
+    if groups > 1 or transposed:
+        remove_kernels += ["triton_ops.conv"]
+    kernels = [k for k in kernels if k not in remove_kernels]
+
+    # if only one choice, return that kernel
+    if len(kernels) == 1:
+        kernel = kernels[0]
+        # return kernel(
+        #     x, w, stride, padding, dilation, transposed, output_padding, groups
+        # )
+        return kernel
+    timings = {}
+    if key not in autotune.cache:
+        for kernel in kernels:
+            runnable_kernel = str2func(kernel)
+            if "triton_ops" in kernel:
+                # because we use nhwc layout by default for triton conv
+                x = x.to(memory_format=torch.channels_last)
+            run_args = (
+                x,
+                w,
+                None,
+                stride,
+                padding,
+                dilation,
+                transposed,
+                output_padding,
+                groups,
+            )
+            timing, _, _ = autotune._bench(runnable_kernel, *run_args)
+            if "triton_ops" in kernel:
+                timing = timing * adjust_triton
+            timings[kernel] = timing
+        autotune.cache[key] = builtins.min(timings, key=timings.get)
+        if config.debug:
+            print("for key = ", key)
+            print("timing", timings)
+            print("best_kernel", autotune.cache[key])
+    best_kernel = autotune.cache[key]
+    # if best_kernel == "triton_ops.conv":
+    #     print(key, best_kernel)
+    return best_kernel
+
+
+def tuned_mm(
+    a_shape,
+    b_shape,
+    a_stride,
+    b_stride,
+    device,
+    dtype,
+    adjust_triton=0.95,
+):
+    """
+    Return the best kernel name given mm input size;
+    Considering potential pointwise fusion of mm, we could adjust triton timing
+    by multiplying adjust_triton (default=0.95)
+    """
+
+    sizevars = V.graph.sizevars
+    a_shape = [sizevars.size_hint(s) for s in a_shape]
+    b_shape = [sizevars.size_hint(s) for s in b_shape]
+    a_stride = [sizevars.size_hint(s) for s in a_stride]
+    b_stride = [sizevars.size_hint(s) for s in b_stride]
+    a = rand_strided(a_shape, a_stride, device=device, dtype=dtype)
+    b = rand_strided(b_shape, b_stride, device=device, dtype=dtype)
+    c = torch.empty((a_shape[0], b_shape[1]), device=device, dtype=dtype)
+    id_args = [
+        *a_shape,
+        *b_shape,
+    ]
+    use_cuda = a.is_cuda
+
+    # gen_key
+    key = tuple(id_args)
+    key = ("mm",) + key
+
+    # candidate kernels
+    kernels = ["aten.mm.out"]
+    if use_cuda:
+        kernels += ["triton_ops.matmul_out"]
+    # if only one choice, return that kernel
+    if len(kernels) == 1:
+        kernel = kernels[0]
+        return kernel
+    timings = {}
+    if key not in autotune.cache:
+        # bench_start = time.time()
+        for kernel in kernels:
+            runnable_kernel = str2func(kernel)
+            if "triton_ops" in kernel:
+                run_args = (a, b, c)
+                run_kwargs = {}
+                inner_kernel = str2func(
+                    kernel.replace("matmul_out", "_matmul_out") + ".kernel"
+                )
+                inner_kernel.kernel_decorators = []
+                # fix SPLIT_K = 1 for fusable kernels
+                mm_heuristics()(mm_autotune(get_io_bound_configs=False)(inner_kernel))
+            else:
+                run_args = (a, b)
+                run_kwargs = {"out": c}
+            timing, _, _ = autotune._bench(runnable_kernel, *run_args, **run_kwargs)
+            if "triton_ops" in kernel:
+                timing = timing * adjust_triton
+            timings[kernel] = timing
+        # bench_end = time.time()
+        # bench_time = bench_end - bench_start
+        autotune.cache[key] = builtins.min(timings, key=timings.get)
+        if config.debug:
+            print("for key = ", key)
+            print("timing", timings)
+            print("best_kernel", autotune.cache[key])
+    best_kernel = autotune.cache[key]
+    return best_kernel
+
+
+def tuned_conv_layout(
+    kernel,
+    x_shape,
+    w_shape,
+    stride,
+    padding,
+    dilation,
+    transposed,
+    output_padding,
+    groups,
+    device,
+    dtype,
+):
+    sizevars = V.graph.sizevars
+    x_shape = [sizevars.size_hint(s) for s in x_shape]
+    w_shape = [sizevars.size_hint(s) for s in w_shape]
+    x = torch.randn(x_shape, device=device, dtype=dtype)
+    w = torch.randn(w_shape, device=device, dtype=dtype)
+    id_args = [
+        *x_shape,
+        *w_shape,
+        stride,
+        padding,
+        dilation,
+        transposed,
+        output_padding,
+        groups,
+    ]
+
+    # gen_key
+    key = tuple(id_args)
+    key = ("conv_layout",) + key
+    runnable_kernel = str2func(kernel)
+
+    timings = {}
+    if key not in autotune.cache:
+        for memory_format in ["torch.contiguous_format", "torch.channels_last"]:
+            x = x.to(memory_format=str2func(memory_format))
+            run_args = (
+                x,
+                w,
+                None,
+                stride,
+                padding,
+                dilation,
+                transposed,
+                output_padding,
+                groups,
+            )
+            timing, _, _ = autotune._bench(runnable_kernel, *run_args)
+            timings[memory_format] = timing
+        autotune.cache[key] = builtins.min(timings, key=timings.get)
+        if config.debug:
+            print("for key = ", key)
+            print("timing", timings)
+            print("best_layout", autotune.cache[key])
+    best_layout = autotune.cache[key]
+    return best_layout
diff --git a/torch/_inductor/codegen/common.py b/torch/_inductor/codegen/common.py
new file mode 100644
index 000000000000..da64f3e63584
--- /dev/null
+++ b/torch/_inductor/codegen/common.py
@@ -0,0 +1,635 @@
+import collections
+import contextlib
+import itertools
+import logging
+import math
+import re
+import textwrap
+import typing
+from collections import namedtuple
+from io import StringIO
+from itertools import chain
+
+import sympy
+from sympy.printing.printer import Printer
+
+from .. import metrics
+from ..utils import free_symbol_startswith, sympy_dot, sympy_subs, sympy_symbol, unique
+from ..virtualized import ops, V
+
+log = logging.getLogger(__name__)
+
+TensorArg = namedtuple("TensorArg", ["name", "buffer", "dtype"])
+SizeArg = namedtuple("SizeArg", ["name", "expr"])
+
+
+def index_prevent_reordering(index: typing.List[sympy.Expr], index_vars, sizes):
+    from ..ir import FlexibleLayout
+
+    # added contiguous index prevents reordering
+    return [*index, sympy_dot(index_vars, FlexibleLayout.contiguous_strides(sizes))]
+
+
+class ExprPrinter(Printer):
+    @staticmethod
+    def paren(string):
+        if (
+            isinstance(string, CSEVariable)
+            or re.match(r"^[a-z0-9_.]+$", string, re.I)
+            or re.match(r"^\([^)]*\)$", string, re.I)
+            or string == ""
+        ):
+            return string
+        return f"({string})"
+
+    def _print_Pow(self, expr):
+        # Pow() confuses triton
+        base, exp = expr.args
+        base = self._print(base)
+        assert exp.is_integer
+        exp = int(exp)
+        return "*".join([self.paren(base)] * exp)
+
+    def _print_Mul(self, expr):
+        return "*".join(map(self.paren, map(self._print, expr.args)))
+
+    def _print_Add(self, expr):
+        return " + ".join(map(self.paren, map(self._print, expr.args)))
+
+    def _print_Mod(self, expr):
+        return " % ".join(map(self.paren, map(self._print, expr.args)))
+
+    def _print_CleanDiv(self, expr):
+        return self._print_IndexingDiv(expr)
+
+
+class OpOverrides:
+    def __init__(self, parent):
+        super().__init__()
+        self._parent = parent
+
+    def __getattr__(self, item):
+        return getattr(self._parent, item)
+
+    @staticmethod
+    def identity(value):
+        # used to trigger cse
+        return value
+
+    @staticmethod
+    def constant(value, dtype):
+        return repr(value)
+
+    @staticmethod
+    def reciprocal(x):
+        return ops.div("1", x)
+
+    @staticmethod
+    def square(x):
+        return ops.mul(x, x)
+
+    @staticmethod
+    def sign(x):
+        left = ops.where(ops.lt("0", x), "1", "0")
+        right = ops.where(ops.lt(x, "0"), "1", "0")
+        return ops.sub(left, right)
+
+    @staticmethod
+    def bitwise_not(x):
+        return f"~{ExprPrinter.paren(x)}"
+
+    @staticmethod
+    def logical_not(a):
+        return f"{ExprPrinter.paren(a)} == 0"
+
+    @staticmethod
+    def bitwise_and(x, y):
+        return f"{ExprPrinter.paren(x)} & {ExprPrinter.paren(y)}"
+
+    @staticmethod
+    def bitwise_or(x, y):
+        return f"{ExprPrinter.paren(x)} | {ExprPrinter.paren(y)}"
+
+    @staticmethod
+    def bitwise_xor(x, y):
+        return f"{ExprPrinter.paren(x)} ^ {ExprPrinter.paren(y)}"
+
+    @staticmethod
+    def remainder(a, b):
+        r = ops.mod(a, b)
+        return ops.where(f"(({r} != 0) & (({r} < 0) != ({b} < 0)))", ops.add(r, b), r)
+
+
+class IndentedBuffer:
+    tabwidth = 4
+
+    def __init__(self, initial_indent=0):
+        self._lines = []
+        self._indent = initial_indent
+
+    def getvalue(
+        self,
+    ):
+        buf = StringIO()
+        for line in self._lines:
+            if isinstance(line, DeferredLine):
+                line = line()
+                if line is None:
+                    continue
+            assert isinstance(line, str)
+            buf.write(line)
+            buf.write("\n")
+        return buf.getvalue()
+
+    def clear(self):
+        self._lines.clear()
+
+    def __bool__(self):
+        return bool(self._lines)
+
+    def prefix(self):
+        return " " * (self._indent * self.tabwidth)
+
+    def writeline(self, line):
+        if isinstance(line, DeferredLine):
+            self._lines.append(line.with_prefix(self.prefix()))
+        elif line.strip():
+            self._lines.append(f"{self.prefix()}{line}")
+        else:
+            self._lines.append("")
+
+    def writelines(self, lines):
+        for line in lines:
+            self.writeline(line)
+
+    def indent(self, offset=1):
+        @contextlib.contextmanager
+        def ctx():
+            self._indent += offset
+            yield
+            self._indent -= offset
+
+        return ctx()
+
+    def splice(self, other_code, strip=False):
+        if isinstance(other_code, IndentedBuffer):
+            dedent = float("inf")
+            for line in other_code._lines:
+                if line:
+                    dedent = min(dedent, len(line) - len(line.lstrip()))
+            if math.isinf(dedent):
+                dedent = 0
+            for line in other_code._lines:
+                IndentedBuffer.writeline(self, line[dedent:])
+        else:
+            other_code = textwrap.dedent(other_code)
+            if strip:
+                other_code = other_code.lstrip()
+            if not other_code:
+                return
+            other_code = other_code.rstrip()
+            for line in other_code.split("\n"):
+                self.writeline(line)
+
+
+class DeferredLine:
+    """A line that can be 'unwritten' by adding name to V.graph.removed_buffers"""
+
+    def __init__(self, name, line):
+        if not line.strip():
+            line = ""
+        self.name = name
+        self.line = line
+
+    def __call__(self):
+        if (
+            self.name not in V.graph.removed_buffers
+            and self.name not in V.graph.inplaced_to_remove
+        ):
+            return self.line
+        return None
+
+    def with_prefix(self, prefix):
+        return DeferredLine(self.name, f"{prefix}{self.line}")
+
+    def lstrip(self):
+        return DeferredLine(self.name, self.line.lstrip())
+
+    def __getitem__(self, index):
+        return DeferredLine(self.name, self.line[index])
+
+    def __bool__(self):
+        return bool(self.line)
+
+    def __len__(self):
+        return len(self.line)
+
+
+class DeferredIndentedBuffer(IndentedBuffer):
+    def __init__(self, initial_indent=0):
+        super(DeferredIndentedBuffer, self).__init__(initial_indent)
+
+    def writeline(self, name, line):
+        if name is None:
+            return super().writeline(line)
+        assert "buf" in name
+        return super().writeline(DeferredLine(name, line))
+
+    def writelines(self, name, lines):
+        for line in lines:
+            self.writeline(name, line)
+
+
+class BracesBuffer(IndentedBuffer):
+    def indent(self, offset=1):
+        @contextlib.contextmanager
+        def ctx():
+            for _ in range(offset):
+                self.writeline("{")
+                self._indent += 1
+            for _ in range(-offset):
+                self._indent -= 1
+                self.writeline("}")
+            yield
+            for _ in range(-offset):
+                self.writeline("{")
+                self._indent += 1
+            for _ in range(offset):
+                self._indent -= 1
+                self.writeline("}")
+
+        return ctx()
+
+
+class InplacedBuffer(typing.NamedTuple):
+    inner_name: str
+    other_names: typing.List[str]
+
+
+class KernelArgs:
+    @staticmethod
+    def _lookup(prefix, odict, name):
+        assert isinstance(name, (str, sympy.Symbol))
+        name = str(name)
+        if name not in odict:
+            odict[name] = f"{prefix}{len(odict)}"
+        return odict[name]
+
+    def __init__(self, sizevars=None):
+        self.input_buffers = collections.OrderedDict()
+        self.output_buffers = collections.OrderedDict()
+        self.inplace_buffers = collections.OrderedDict()
+        self.sizevars = sizevars or collections.OrderedDict()
+
+    def input(self, name):
+        name = V.graph.scheduler.mutation_real_name.get(name, name)
+        assert name not in V.graph.removed_buffers, name
+        if name in self.output_buffers:
+            return self.output_buffers[name]
+        if name in self.inplace_buffers:
+            return self.inplace_buffers[name].inner_name
+        if name.startswith("seed"):
+            return self._lookup("seed", self.input_buffers, name)
+        return self._lookup("in_ptr", self.input_buffers, name)
+
+    def output(self, name):
+        name = V.graph.scheduler.mutation_real_name.get(name, name)
+        assert name not in V.graph.removed_buffers, name
+        if name in self.inplace_buffers:
+            return self.inplace_buffers[name].inner_name
+        return self._lookup("out_ptr", self.output_buffers, name)
+
+    def make_inplace(self, input_name, output_name):
+        assert output_name not in self.inplace_buffers
+        if input_name in self.inplace_buffers:
+            buf = self.inplace_buffers[input_name]
+            buf.other_names.append(output_name)
+            self.inplace_buffers[output_name] = buf
+        else:
+            buf = InplacedBuffer(
+                f"in_out_ptr{len(unique(self.inplace_buffers.values()))}",
+                [input_name, output_name],
+            )
+            self.inplace_buffers[input_name] = buf
+            self.inplace_buffers[output_name] = buf
+
+    def size(self, name):
+        if str(name) == "seed":
+            self.sizevars["seed"] = "seed"
+            return "seed"
+        return self._lookup("ks", self.sizevars, name)
+
+    def call_names(self):
+        return chain(
+            self.input_buffers.keys(), self.output_buffers.keys(), self.sizevars.keys()
+        )
+
+    def cpp_argdefs(self):
+        from .cpp import DTYPE_TO_CPP, INDEX_TYPE
+
+        # TODO(jansel): replace this with data from scheduler
+        buffer_types = {x.get_name(): x.get_dtype() for x in V.graph.buffers}
+        buffer_types.update(
+            {name: val.get_dtype() for name, val in V.graph.graph_inputs.items()}
+        )
+        buffer_types.update(
+            {name: val.dtype for name, val in V.graph.constants.items()}
+        )
+
+        call_args = []
+        arg_defs = []
+        for inplaced in unique(self.inplace_buffers.values()):
+            outer = inplaced.other_names[-1]
+            inner = inplaced.inner_name
+            dtype = buffer_types[outer]
+            arg_defs.append(f"{DTYPE_TO_CPP[dtype]}* __restrict__ {inner}")
+            call_args.append(f"c_void_p({outer}.data_ptr())")
+        for outer, inner in self.input_buffers.items():
+            if outer in self.inplace_buffers:
+                continue
+            dtype = buffer_types[outer]
+            arg_defs.append(f"const {DTYPE_TO_CPP[dtype]}* __restrict__ {inner}")
+            call_args.append(f"c_void_p({outer}.data_ptr())")
+        for outer, inner in self.output_buffers.items():
+            if outer in self.inplace_buffers or inner == "REMOVED":
+                continue
+            dtype = buffer_types[outer]
+            arg_defs.append(f"{DTYPE_TO_CPP[dtype]}* __restrict__ {inner}")
+            call_args.append(f"c_void_p({outer}.data_ptr())")
+        for outer, inner in self.sizevars.items():
+            arg_defs.append(f"const {INDEX_TYPE} {inner}")
+            call_args.append(f"c_long({outer})")
+        return arg_defs, call_args
+
+    def python_argdefs(self):
+        arg_defs = []
+        call_args = []
+        precompile_args = []
+        for inplaced in unique(self.inplace_buffers.values()):
+            arg_defs.append(inplaced.inner_name)
+            call_args.append(inplaced.other_names[-1])
+            precompile_args.append(
+                TensorArg(
+                    inplaced.inner_name,
+                    inplaced.other_names[-1],
+                    V.graph.get_dtype(inplaced.other_names[-1]),
+                )
+            )
+        for outer, inner in chain(
+            self.input_buffers.items(), self.output_buffers.items()
+        ):
+            if outer in self.inplace_buffers or inner == "REMOVED":
+                continue
+            arg_defs.append(inner)
+            call_args.append(outer)
+            precompile_args.append(TensorArg(inner, outer, V.graph.get_dtype(outer)))
+        for outer, inner in self.sizevars.items():
+            arg_defs.append(inner)
+            call_args.append(outer)
+            precompile_args.append(SizeArg(inner, sympy_symbol(outer)))
+        return arg_defs, call_args, precompile_args
+
+    def aliases(self):
+        for inplaced in unique(self.inplace_buffers.values()):
+            for other in inplaced.other_names:
+                if other in V.graph.inplaced_to_remove:
+                    continue
+                if other in self.input_buffers:
+                    yield self.input_buffers[other], inplaced.inner_name
+                if other in self.output_buffers:
+                    yield self.output_buffers[other], inplaced.inner_name
+
+    def is_removed(self, name):
+        def _is_removed(name, buffers):
+            return name not in buffers or buffers[name] == "REMOVED"
+
+        return _is_removed(name, self.output_buffers) and _is_removed(
+            name, self.inplace_buffers
+        )
+
+
+class CSEVariable:
+    """A CSEVariable is just a name for an expression but it is useful to be able to annotate them on a backend dependent basis.
+    The backends can inherit from this class and overload the "create_cse_var" Kernel to do that.
+    The "update_on_args" method gives you a hook for annotations, see example of TritonCSEVariable in triton.py."""
+
+    def __init__(self, name):
+        self.name = name
+
+    def __str__(self):
+        return self.name
+
+    def __hash__(self) -> int:
+        return hash(self.name)
+
+    def __eq__(self, other) -> bool:
+        return type(other) == type(self) and other.name == self.name
+
+    def update_on_args(self, args, kwargs):
+        pass
+
+
+class CSE:
+    """Common subexpression elimination"""
+
+    def __init__(
+        self,
+        prefix="",
+        suffix="",
+        name_prefix="tmp",
+        iter_buffers=None,
+        store_cache=None,
+        reduction_cache=None,
+    ):
+        self.prefix = prefix
+        self.suffix = suffix
+        self.cache = {}
+        self.name_prefix = name_prefix
+        self.store_cache = store_cache or {}
+        self.reduction_cache = reduction_cache or {}
+        self.iter_buffer_ids = iter_buffers or itertools.count()
+        self.invalidated_stores = set()
+        self.varname_map = {}
+
+    def invalidate(self, keep_vars: typing.Set[str]):
+        for name, tmp in list(self.store_cache.items()):
+            if tmp not in keep_vars:
+                del self.store_cache[name]
+                self.invalidated_stores.add(name)
+        self.cache = {k: v for k, v in self.cache.items() if v in keep_vars}
+
+    def clone(self):
+        return CSE(
+            self.prefix,
+            self.suffix,
+            self.name_prefix,
+            self.iter_buffer_ids,
+            self.store_cache,
+        )
+
+    def generate(
+        self, buffer: IndentedBuffer, expr: typing.Union[str, CSEVariable], write=True
+    ) -> CSEVariable:
+        assert isinstance(expr, (str, CSEVariable)), type(expr)
+        if isinstance(expr, CSEVariable):
+            return expr
+        if expr not in self.cache:
+            var = self.newvar()
+            self.cache[expr] = var
+            if write:
+                V.kernel.current_node.codegen_originating_info(buffer, only_once=True)
+                buffer.writeline(f"{self.prefix}{var} = {expr}{self.suffix}")
+        return self.cache[expr]
+
+    def newvar(self) -> CSEVariable:
+        var_name = f"{self.name_prefix}{next(self.iter_buffer_ids)}"
+        var = V.kernel.create_cse_var(var_name)
+        self.varname_map[var_name] = var
+        return var
+
+
+class CodeGen:
+    def __init__(self):
+        super().__init__()
+        self.exit_stack = contextlib.ExitStack()
+
+    def __enter__(self):
+        self.exit_stack.__enter__()
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.exit_stack.__exit__(exc_type, exc_val, exc_tb)
+
+
+class Kernel(CodeGen):
+    newvar_prefix = ""
+    suffix = ""
+    overrides = None
+    load_format = None
+    store_format = None
+
+    def __init__(self, args=None):
+        super().__init__()
+        metrics.generated_kernel_count += 1
+        self.args = args or KernelArgs()
+        self.loads = IndentedBuffer()
+        self.compute = IndentedBuffer()
+        self.stores = DeferredIndentedBuffer()
+        self.cse = CSE(self.newvar_prefix, self.suffix)
+        self.must_keep_buffers = set()
+        self.current_node = None
+        self.store_buffer_names = set()
+
+    @contextlib.contextmanager
+    def set_current_node(self, node):
+        prior = self.current_node
+        self.current_node = node
+        yield
+        self.current_node = prior
+
+    @contextlib.contextmanager
+    def swap_buffers(self, lb, cb=None, sb=None):
+        if cb is None:
+            cb = lb
+        loads = self.loads
+        compute = self.compute
+        stores = self.stores
+        cse = self.cse
+        self.loads = lb
+        self.compute = cb
+        self.stores = sb
+        self.cse = cse.clone()
+        yield
+        self.loads = loads
+        self.compute = compute
+        self.stores = stores
+        self.cse = cse
+
+    def load(self, name: str, index: sympy.Expr):
+        raise NotImplementedError()
+
+    def indirect_load(self, name: str, index: sympy.Expr):
+        """A load the depends on an index we have read"""
+        prior = self.loads
+        try:
+            # put the load in the compute section as it might have deps
+            self.loads = self.compute
+            return self.load(name, index)
+        finally:
+            self.loads = prior
+
+    def store(self, name, index, value, mode=None):
+        raise NotImplementedError()
+
+    def reduction(self, name, dtype, src_dtype, reduction_type, index, value):
+        raise NotImplementedError()
+
+    def __enter__(self):
+        class CSEProxy:
+            @staticmethod
+            def __getattr__(name):
+                def inner(*args, **kwargs):
+                    csevar = self.cse.generate(
+                        self.compute, getattr(parent_handler, name)(*args, **kwargs)
+                    )
+                    csevar.update_on_args(args, kwargs)
+                    return csevar
+
+                return inner
+
+            @staticmethod
+            def indirect_indexing(index_var):
+                return sympy_symbol(str(index_var))
+
+            @staticmethod
+            def load(name: str, index: sympy.Expr):
+                if name in self.cse.invalidated_stores:
+                    # A load from an invalidated store requires us to
+                    # keep the actual buffer around
+                    V.kernel.must_keep_buffers.add(name)
+                if free_symbol_startswith(index, "tmp"):
+                    return self.indirect_load(name, index)
+                store_cache = self.cse.store_cache
+                if name in store_cache:
+                    return store_cache[name]
+                return self.load(name, index)
+
+            @staticmethod
+            def store(name, index, value, mode=None):
+                self.store_buffer_names.add(name)
+                if mode is None:
+                    self.cse.store_cache[name] = value
+                    for other_name in self.current_node.get_mutations():
+                        self.cse.store_cache[other_name] = value
+                if name not in V.graph.removed_buffers:
+                    return self.store(name, index, value, mode=mode)
+
+            @staticmethod
+            def reduction(name, dtype, src_dtype, reduction_type, index, value):
+                self.store_buffer_names.add(name)
+                return self.reduction(
+                    name, dtype, src_dtype, reduction_type, index, value
+                )
+
+        super().__enter__()
+        parent_handler = self.overrides(V.get_ops_handler())
+        self.exit_stack.enter_context(V.set_ops_handler(CSEProxy()))
+        self.exit_stack.enter_context(V.set_kernel_handler(self))
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        V.graph.scheduler.remove_kernel_local_buffers()
+        super().__exit__(exc_type, exc_val, exc_tb)
+
+    def rename_indexing(self, index) -> sympy.Expr:
+        if isinstance(index, (list, tuple)):
+            return [self.rename_indexing(x) for x in index]
+        index = V.graph.sizevars.simplify(index)
+        sorted_symbols = sorted(index.free_symbols, key=lambda s: s.name)
+        replacements = {
+            x: self.args.size(x) for x in sorted_symbols if x.name.startswith("s")
+        }
+        return sympy_subs(index, replacements)
+
+    def create_cse_var(self, *args, **kwargs):
+        return CSEVariable(*args, **kwargs)
diff --git a/torch/_inductor/codegen/cpp.py b/torch/_inductor/codegen/cpp.py
new file mode 100644
index 000000000000..309b8cf9c8e7
--- /dev/null
+++ b/torch/_inductor/codegen/cpp.py
@@ -0,0 +1,1561 @@
+import contextlib
+import dataclasses
+import functools
+import math
+import sys
+from copy import deepcopy
+from pathlib import Path
+from typing import Dict, List
+
+import sympy
+
+import torch
+from torch._prims_common import is_float_dtype
+
+from .. import codecache, config, ir, metrics
+from ..codegen.wrapper import WrapperCodeGen
+from ..utils import sympy_product, sympy_subs, sympy_symbol
+from ..virtualized import ops, V
+from .common import (
+    BracesBuffer,
+    DeferredIndentedBuffer,
+    ExprPrinter,
+    IndentedBuffer,
+    Kernel,
+    KernelArgs,
+    OpOverrides,
+)
+
+DTYPE_TO_CPP = {
+    torch.float32: "float",
+    torch.float64: "double",
+    torch.float16: "half",
+    torch.int64: "long",
+    torch.int32: "int",
+    torch.int16: "short",
+    torch.int8: "signed char",
+    torch.uint8: "unsigned char",
+    torch.bool: "bool",
+    torch.bfloat16: "bfloat16",
+}
+INDEX_TYPE = "long"
+
+RTYPE_TO_CPP = {
+    "sum": "+",
+    "min": "min",
+    "max": "max",
+    "argmin": "argmin",
+    "argmax": "argmax",
+    "any": "||",
+}
+
+
+def reduction_init(reduction_type, dtype):
+    if reduction_type in ("sum", "any"):
+        return 0
+    if reduction_type in {"max", "argmax"}:
+        return (
+            f"-std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::infinity()"
+            if is_float_dtype(dtype)
+            else f"std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::min()"
+        )
+    if reduction_type in {"min", "argmin"}:
+        return (
+            f"std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::infinity()"
+            if is_float_dtype(dtype)
+            else f"std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::max()"
+        )
+    raise AssertionError(reduction_type)
+
+
+def reduction_combine(reduction_type, var, next_value):
+    if reduction_type == "sum":
+        return f"{var} += {next_value}"
+    if reduction_type == "any":
+        return f"{var} = {var} || {next_value}"
+    return f"{var} = std::{reduction_type}({var}, {next_value})"
+
+
+def reduction_combine_vec(reduction_type, var, next_value):
+    if reduction_type == "max":
+        return f"{var} = at::vec::maximum({var}, {next_value})"
+    elif reduction_type == "min":
+        return f"{var} = at::vec::minimum({var}, {next_value})"
+    elif reduction_type == "sum":
+        return f"{var} += {next_value}"
+    else:
+        raise NotImplementedError()
+
+
+index_value_name_counter = 1
+
+
+def argmax_argmin_prefix(reduction_type, src_dtype, tmpvar):
+    global index_value_name_counter
+    struct_name = f"IndexValue_{index_value_name_counter}"
+    index_value_name_counter += 1
+
+    # A small annoyance, due to it being a little cumbersome to just throw {} into strings
+    prefix = [
+        f"struct {struct_name} {{size_t index; {DTYPE_TO_CPP[src_dtype]} value;}};",
+        f"{struct_name} {tmpvar}{{0, {reduction_init(reduction_type, src_dtype)}}};",
+    ]
+    if reduction_type == "argmax":
+        prefix.extend(
+            [
+                f"#pragma omp declare reduction(argmax : struct {struct_name} :\\",
+                "    omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\\",
+                "    omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\\",
+                f"\tinitializer(omp_priv = {{0, {reduction_init(reduction_type, src_dtype)}}})",
+            ]
+        )
+    elif reduction_type == "argmin":
+        prefix.extend(
+            [
+                f"#pragma omp declare reduction(argmin : struct {struct_name} :\\",
+                "    omp_out.value = omp_in.value > omp_out.value ? omp_out.value : omp_in.value,\\",
+                "    omp_out.index = omp_in.value > omp_out.value ? omp_out.index : omp_in.index)\\",
+                f"\tinitializer(omp_priv = {{0, {reduction_init(reduction_type, src_dtype)}}})",
+            ]
+        )
+    return prefix
+
+
+def float16_reduction_prefix(rtype):
+    # TODO: This user-defined reduction uses float16 accumulation for sum. To reduce numerical
+    # errors, float32 accumulation should be used instead.
+    assert rtype in (
+        "sum",
+        "any",
+    ), f"float16 user-defined reduction only supports 'sum' and 'any' but got {rtype}"
+    prefix = [
+        f"#pragma omp declare reduction({RTYPE_TO_CPP[rtype]}:{DTYPE_TO_CPP[torch.float16]}:"
+        + f"omp_out = omp_out {RTYPE_TO_CPP[rtype]} omp_in)"
+    ]
+    return prefix
+
+
+def parallel_num_threads():
+    threads = config.cpp.threads
+    if threads < 1:
+        threads = torch.get_num_threads()
+    return threads
+
+
+@functools.lru_cache()
+def cpp_prefix():
+    path = Path(__file__).parent / "cpp_prefix.h"
+    with path.open() as f:
+        _, filename = codecache.write(
+            f.read(),
+            "h",
+        )
+    return f'#include "{filename}"'
+
+
+class CppPrinter(ExprPrinter):
+    def _print_ModularIndexing(self, expr):
+        x, div, mod = expr.args
+        x = self.paren(self.doprint(x))
+        div = self.paren(self.doprint(div))
+        mod = self.paren(self.doprint(mod))
+        if div != "1":
+            x = f"({x} / {div})"
+        return f"{x} % {mod}"
+
+    def _print_IndexingDiv(self, expr):
+        x, div = expr.args
+        x = self.paren(self.doprint(x))
+        div = self.paren(self.doprint(div))
+        return f"({x} / {div})"
+
+
+cexpr = CppPrinter().doprint
+
+
+class CppVecOverrides(OpOverrides):
+    """Map element-wise ops to aten vectorization C++"""
+
+    @staticmethod
+    def add(a, b):
+        return f"{a} + {b}"
+
+    @staticmethod
+    def sub(a, b):
+        return f"{a} - {b}"
+
+    @staticmethod
+    def mul(a, b):
+        return f"{a} * {b}"
+
+    @staticmethod
+    def div(a, b):
+        return f"{a} / {b}"
+
+    @staticmethod
+    def abs(x):
+        return f"{x}.abs()"
+
+    @staticmethod
+    def sin(x):
+        return f"{x}.sin()"
+
+    @staticmethod
+    def cos(x):
+        return f"{x}.cos()"
+
+    @staticmethod
+    def exp(x):
+        return f"{x}.exp()"
+
+    @staticmethod
+    def sqrt(x):
+        return f"{x}.sqrt()"
+
+    @staticmethod
+    def rsqrt(x):
+        return f"{x}.rsqrt()"
+
+    @staticmethod
+    def pow(a, b):
+        return f"{a}.pow({b})"
+
+    @staticmethod
+    def log(x):
+        return f"{x}.log()"
+
+    @staticmethod
+    def round(x):
+        return f"{x}.round()"
+
+    @staticmethod
+    def floor(x):
+        return f"{x}.floor()"
+
+    @staticmethod
+    def ceil(x):
+        return f"{x}.ceil()"
+
+    @staticmethod
+    def trunc(x):
+        return f"{x}.trunc()"
+
+    @staticmethod
+    def fmod(a, b):
+        return f"{a}.fmod({b})"
+
+    @staticmethod
+    def lgamma(x):
+        return f"{x}.lgamma()"
+
+    @staticmethod
+    def logical_and(a, b):
+        return f"{a} && {b}"
+
+    @staticmethod
+    def logical_or(a, b):
+        return f"{a} || {b}"
+
+    @staticmethod
+    def tanh(a):
+        return f"{a}.tanh()"
+
+    @staticmethod
+    def reciprocal(a):
+        return f"{a}.reciprocal()"
+
+    @staticmethod
+    def constant(val, dtype):
+        if val == float("inf"):
+            quote = f"std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::infinity()"
+        elif val == float("-inf"):
+            quote = f"-std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::infinity()"
+        elif math.isnan(val):
+            quote = f"std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::quiet_NaN()"
+        elif val is True or val is False:
+            quote = f"static_cast<{DTYPE_TO_CPP[dtype]}>({str(val).lower()})"
+        else:
+            quote = f"static_cast<{DTYPE_TO_CPP[dtype]}>({repr(val)})"
+        return f"at::vec::Vectorized<{DTYPE_TO_CPP[dtype]}>({quote})"
+
+    @staticmethod
+    def relu(x):
+        return f"at::vec::clamp_min({x}, decltype({x})(0))"
+
+    @staticmethod
+    def sigmoid(x):
+        return f"decltype({x})(1)/(decltype({x})(1) + {x}.neg().exp())"
+
+    @staticmethod
+    def neg(x):
+        return f"{x}.neg()"
+
+    @staticmethod
+    def floordiv(a, b):
+        # a and b are integer type
+        _t = f"decltype({a})"
+        quot = f"{a} / {b}"
+        rem = f"{a} % {b}"
+        return f"(({a} < {_t}(0)) != ({b} < {_t}(0)) ? ({rem} != {_t}(0) ? {quot} - {_t}(1) : {quot}) : {quot})"
+
+    @staticmethod
+    def truncdiv(a, b):
+        # a and b are integer type
+        return f"{a} / {b}"
+
+    @staticmethod
+    def minimum(a, b):
+        return f"at::vec::minimum({a}, {b})"
+
+    @staticmethod
+    def maximum(a, b):
+        return f"at::vec::maximum({a}, {b})"
+
+    @staticmethod
+    def square(a):
+        return f"{a}.pow(2)"
+
+    @staticmethod
+    def where(a, b, c):
+        return f"decltype({b})::blendv({c}, {b}, {a})"
+
+    @staticmethod
+    def sign(x):
+        code = BracesBuffer()
+        # auto tmp5 = tmp4 < 0 ? -1 : 1;
+        vec_zero = f"decltype({x})(0)"
+        vec_one = f"decltype({x})(1)"
+        blendv = f"decltype({x})::blendv({vec_zero}, {vec_one}, {vec_zero} < {x})"
+        left = V.kernel.cse.newvar()
+        code.writeline(f"auto {left} = {blendv};")
+
+        # auto tmp6 = tmp4 == 0 ? 0 : tmp5;
+        blendv = f"decltype({x})::blendv({vec_zero}, {vec_one}, {x} < {vec_zero})"
+        right = V.kernel.cse.newvar()
+        code.writeline(f"auto {right} = {blendv};")
+        result = V.kernel.cse.newvar()
+        code.writeline(f"auto {result} = {left} - {right};")
+        V.kernel.compute.splice(code)
+        return result
+
+    @staticmethod
+    def to_dtype(x, dtype):
+        assert dtype in [torch.bool], f"{__name__} does not support {dtype}"
+        return f"({x})"
+
+
+class CppOverrides(OpOverrides):
+    """Map element-wise ops to C++"""
+
+    @staticmethod
+    def to_dtype(x, dtype):
+        assert dtype in DTYPE_TO_CPP, f"{dtype} missing from {__name__}.DTYPE_TO_CPP"
+        return f"static_cast<{DTYPE_TO_CPP[dtype]}>({x})"
+
+    @staticmethod
+    def abs(x):
+        return f"std::abs({x})"
+
+    @staticmethod
+    def sin(x):
+        return f"std::sin({x})"
+
+    @staticmethod
+    def cos(x):
+        return f"std::cos({x})"
+
+    @staticmethod
+    def exp(x):
+        # return f"Sleef_expf_u10({x})"
+        return f"std::exp({x})"
+
+    @staticmethod
+    def erf(x):
+        return f"std::erf({x})"
+
+    @staticmethod
+    def sqrt(x):
+        return f"std::sqrt({x})"
+
+    @staticmethod
+    def rsqrt(x):
+        return f"1 / std::sqrt({x})"
+
+    @staticmethod
+    def signbit(x):
+        return f"std::signbit({x})"
+
+    @staticmethod
+    def pow(a, b):
+        return f"std::pow({a}, {b})"
+
+    @staticmethod
+    def log(x):
+        return f"std::log({x})"
+
+    @staticmethod
+    def round(x):
+        return f"std::nearbyint({x})"
+
+    @staticmethod
+    def floor(x):
+        return f"std::floor({x})"
+
+    @staticmethod
+    def floordiv(a, b):
+        # a and b are integer type
+        quot = f"{a} / {b}"
+        rem = f"{a} % {b}"
+        return f"(({a} < 0) != ({b} < 0) ? ({rem} != 0 ? {quot} - 1 : {quot}) : {quot})"
+
+    @staticmethod
+    def ceil(x):
+        return f"std::ceil({x})"
+
+    @staticmethod
+    def trunc(x):
+        return f"std::trunc({x})"
+
+    @staticmethod
+    def truncdiv(a, b):
+        # a and b are integer type
+        return f"{a} / {b}"
+
+    @staticmethod
+    def fmod(a, b):
+        return f"std::fmod({a}, {b})"
+
+    @staticmethod
+    def isinf(x):
+        return f"std::isinf({x})"
+
+    @staticmethod
+    def isnan(x):
+        return f"std::isnan({x})"
+
+    @staticmethod
+    def lgamma(x):
+        return f"std::lgamma({x})"
+
+    @staticmethod
+    def relu(x):
+        return f"{x} * ({x}>0)"
+
+    @staticmethod
+    def minimum(a, b):
+        return f"({b} != {b}) ? {b} : std::min({a}, {b})"
+
+    @staticmethod
+    def maximum(a, b):
+        return f"({b} != {b}) ? {b} : std::max({a}, {b})"
+
+    @staticmethod
+    def where(a, b, c):
+        return f"{a} ? {b} : {c}"
+
+    @staticmethod
+    def mod(a, b):
+        return f"mod({a}, {b})"
+
+    @staticmethod
+    def constant(val, dtype):
+        if val == float("inf"):
+            return f"std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::infinity()"
+        elif val == float("-inf"):
+            return f"-std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::infinity()"
+        elif math.isnan(val):
+            return f"std::numeric_limits<{DTYPE_TO_CPP[dtype]}>::quiet_NaN()"
+        elif val is True or val is False:
+            return ops.to_dtype(str(val).lower(), dtype)
+        return ops.to_dtype(repr(val), dtype)
+
+    @staticmethod
+    def index_expr(expr, dtype):
+        return ops.to_dtype(cexpr(V.kernel.rename_indexing(expr)), dtype)
+
+    @staticmethod
+    def masked(mask, body, other):
+        code = BracesBuffer()
+        var = V.kernel.cse.newvar()
+        if other == float("-inf"):
+            code.writeline(f"float {var} = -std::numeric_limits<float>::infinity();")
+        elif other == float("inf"):
+            code.writeline(f"float {var} = std::numeric_limits<float>::infinity();")
+        elif isinstance(other, float):
+            code.writeline(f"float {var} = {other};")
+        else:
+            code.writeline(f"auto {var} = {other!r};")
+        code.writeline(f"if({mask})")
+        with V.kernel.swap_buffers(code), code.indent():
+            result = body()
+            code.writeline(f"{var} = {result};")
+        V.kernel.compute.splice(code)
+        return var
+
+    @staticmethod
+    def logical_and(a, b):
+        return f"{a} && {b}"
+
+    @staticmethod
+    def logical_or(a, b):
+        return f"{a} || {b}"
+
+    @staticmethod
+    def rand(seed: sympy.Expr, offset: sympy.Expr, dtype):
+        return f"static_cast<{DTYPE_TO_CPP[dtype]}>(normalized_rand_cpu({seed}, {offset}));"
+
+    @staticmethod
+    def randn(seed: sympy.Expr, offset: sympy.Expr, dtype):
+        return f"static_cast<{DTYPE_TO_CPP[dtype]}>(randn_cpu({seed}, {offset}));"
+
+    @staticmethod
+    def sigmoid(x):
+        x = ops.exp(f"-{x}")
+        return f"1 / (1 + {x})"
+
+    @staticmethod
+    def sign(x):
+        code = BracesBuffer()
+        # auto tmp5 = tmp4 < 0 ? -1 : 1;
+        left = V.kernel.cse.newvar()
+        right = V.kernel.cse.newvar()
+        result = V.kernel.cse.newvar()
+        code.writeline(f"auto {left} = {x} > 0 ? 1 : 0;")
+        code.writeline(f"auto {right} = {x} < 0 ? 1 : 0;")
+        code.writeline(f"auto {result} = {left} - {right};")
+        V.kernel.compute.splice(code)
+        return result
+
+
+class CppKernel(Kernel):
+    overrides = CppOverrides
+    sexpr = cexpr
+    newvar_prefix = "auto "
+    suffix = ";"
+
+    def __init__(self, args, num_threads):
+        super(CppKernel, self).__init__(args)
+        self.call_ranges = None
+        self.ranges = None
+        self.itervars = None
+        self.reduction_depth = None
+        self.reduction_prefix = IndentedBuffer()
+        self.reduction_suffix = DeferredIndentedBuffer()
+        self.reduction_vars = {}
+        self.num_threads = num_threads  # num_threads the kernel specialized for
+
+    def load(self, name: str, index: sympy.Expr):
+        var = self.args.input(name)
+        index = self.rename_indexing(index)
+        line = f"{var}[{cexpr(index)}]"
+        if V.graph.get_dtype(name) in (torch.float16, torch.bfloat16):
+            line = f"static_cast<float>({line})"
+        return self.cse.generate(self.loads, line)
+
+    def store(self, name, index, value, mode=None):
+        assert "buf" in name
+        var = self.args.output(name)
+        index = self.rename_indexing(index)
+        if mode is None:
+            line = f"{var}[{cexpr(index)}] = {value};"
+        elif mode == "atomic_add":
+            if not config.cpp.dynamic_threads and self.num_threads == 1:
+                line = f"{var}[{cexpr(index)}] += {value};"
+            else:
+                line = f"atomic_add(&{var}[{cexpr(index)}], {value});"
+        else:
+            raise NotImplementedError(f"store mode={mode}")
+        self.stores.writeline(name, line)
+
+    def reduction(self, name, dtype, src_dtype, reduction_type, index, value):
+        argmax_or_argmin = reduction_type in {"argmax", "argmin"}
+        tmpvar = self.cse.generate(
+            self.loads, f"reduction {name} {cexpr(index)}", write=False
+        )
+        index = self.rename_indexing(index)
+        self.reduction_vars[tmpvar] = reduction_type
+        if argmax_or_argmin:
+            self.reduction_prefix.writelines(
+                argmax_argmin_prefix(reduction_type, src_dtype, tmpvar)
+            )
+            compare_op = "<" if reduction_type == "argmax" else ">"
+            self.stores.writelines(
+                None,
+                [
+                    f"if ({tmpvar}.value {compare_op} {value}) {{",
+                    f"    {tmpvar}.index = {self.itervars[-1]}; {tmpvar}.value = {value};",
+                    "}",
+                ],
+            )
+        else:
+            if dtype == torch.float16:
+                self.reduction_prefix.writelines(
+                    float16_reduction_prefix(reduction_type)
+                )
+            self.reduction_prefix.writeline(
+                f"{DTYPE_TO_CPP[dtype]} {tmpvar} = {reduction_init(reduction_type, dtype)};"
+            )
+            self.stores.writeline(
+                None, f"{reduction_combine(reduction_type, tmpvar, value)};"
+            )
+
+        if name not in V.graph.removed_buffers:
+            var = self.args.output(name)
+            member_name = ".index" if argmax_or_argmin else ""
+            self.reduction_suffix.writeline(
+                name, f"{var}[{cexpr(index)}] = {tmpvar}{member_name};"
+            )
+        self.cse.store_cache[name] = tmpvar
+
+    def set_ranges(self, lengths, reduction_lengths):
+        if self.call_ranges:
+            assert self.call_ranges == tuple(lengths) + tuple(
+                reduction_lengths
+            ), f"{self.call_ranges} == {tuple(lengths)} + {tuple(reduction_lengths)}"
+            assert self.reduction_depth == len(lengths)
+        else:
+            self.call_ranges = tuple(lengths) + tuple(reduction_lengths)
+            self.ranges = [self.rename_indexing(x) for x in self.call_ranges]
+            self.itervars = [sympy_symbol(f"i{n}") for n in range(len(self.ranges))]
+            self.reduction_depth = len(lengths)
+        return (
+            self.itervars[: self.reduction_depth],
+            self.itervars[self.reduction_depth :],
+        )
+
+    def size_hint(self):
+        return V.graph.sizevars.size_hint(sympy_product(self.call_ranges))
+
+    def codegen_loops(self, code, worksharing):
+        threads = parallel_num_threads()
+
+        loops = [LoopLevel(var, size) for var, size in zip(self.itervars, self.ranges)]
+        loops, reductions = LoopNest(loops[: self.reduction_depth]), LoopNest(
+            loops[self.reduction_depth :]
+        )
+        reductions.mark_reduction(self.reduction_vars)
+
+        if codecache.pick_vec_isa():
+            # TODO(jansel): detect stride-1 dimension and vectorize that
+            if reductions:
+                reductions.loops[-1].simd = True
+            elif loops:
+                loops.loops[-1].simd = True
+
+        par_depth = 0
+        reduction_par_depth = 0
+        if loops:
+            par_depth = self.decide_parallel_depth(
+                self.call_ranges[: self.reduction_depth], threads
+            )
+        else:
+            reduction_par_depth = self.decide_parallel_depth(
+                self.call_ranges[self.reduction_depth :], threads
+            )
+
+        with contextlib.ExitStack() as stack:
+            if par_depth:
+                worksharing.parallel(threads)
+                loops.mark_parallel(par_depth)
+            elif reduction_par_depth:
+                # need to close the worksharing scope to define reduction vars outside it
+                worksharing.close()
+                reductions.mark_parallel(reduction_par_depth)
+            elif threads > 1:
+                if worksharing.single():
+                    stack.enter_context(code.indent())
+
+            loops.codegen(code, stack)
+
+            with contextlib.ExitStack() as stack_outer:
+                if self.reduction_prefix:
+                    stack_outer.enter_context(code.indent())
+                code.splice(self.reduction_prefix)
+
+                if reduction_par_depth:
+                    worksharing.parallel(threads)
+
+                with contextlib.ExitStack() as stack:
+                    reductions.codegen(code, stack)
+                    code.splice(self.loads)
+                    code.splice(self.compute)
+                    code.splice(self.stores)
+
+                if reduction_par_depth:
+                    worksharing.close()
+
+                code.splice(self.reduction_suffix)
+
+    def decide_parallel_depth(self, ranges, threads):
+        seq = self.size_hint()
+        par = 1
+        depth = 0
+        for expr in ranges:
+            hint = V.graph.sizevars.size_hint(expr)
+            if par >= 2 * threads or par == threads:
+                break
+            if seq // threads < config.cpp.min_chunk_size:
+                # not enough work
+                break
+            depth += 1
+            par *= hint
+            seq /= hint
+        # if we assume thread number is dynamic, make sure we
+        # have at least one parallel scope and let OMP runtime
+        # to manage the serial vs. parallel.
+        if config.cpp.dynamic_threads and depth == 0 and len(ranges) > 0:
+            depth = 1
+        return depth
+
+    @contextlib.contextmanager
+    def write_to_suffix(self):
+        prior = (self.loads, self.compute, self.stores, self.cse)
+        self.loads = IndentedBuffer()
+        self.compute = IndentedBuffer()
+        self.stores = DeferredIndentedBuffer()
+        self.cse = self.cse.clone()
+        yield
+        self.reduction_suffix.splice(self.loads)
+        self.reduction_suffix.splice(self.compute)
+        self.reduction_suffix.splice(self.stores)
+        (self.loads, self.compute, self.stores, self.cse) = prior
+
+
+class CppVecKernel(CppKernel):
+    overrides = CppVecOverrides
+
+    def __init__(self, args, num_threads):
+        super(CppVecKernel, self).__init__(args, num_threads)
+        assert codecache.pick_vec_isa()
+        self.simd_nelements = codecache.pick_vec_isa().nelements()
+        self.reduction_omp_dec: Dict[str, str] = {}
+        metrics.generated_cpp_vec_kernel_count += 1
+
+    def is_single_step_var(self, var: sympy.Symbol, index: sympy.Expr):
+        replacement = {var: var + 1}
+        new_index = sympy_subs(index, replacement)
+        delta = sympy.simplify(new_index - index)
+        return delta == 1
+
+    def is_var_irrevelant(self, var: sympy.Symbol, index: sympy.Expr):
+        expanded_index = sympy.expand(index)
+        return not expanded_index.has(var)
+
+    def transform_index(self, index: sympy.Expr):
+        expanded_index = sympy.expand(index)
+        assert self.simd_nelements
+        assert self.simd_nelements >= 1
+        most_inner_var = self.itervars[-1]
+        replacement = {most_inner_var: most_inner_var * self.simd_nelements}
+        new_index = sympy_subs(expanded_index, replacement)
+        return new_index
+
+    def load(self, name: str, index: sympy.Expr):
+        var = self.args.input(name)
+        index = self.rename_indexing(index)
+
+        expanded_index = sympy.expand(index)
+        new_index = self.transform_index(index)
+
+        if expanded_index == new_index:
+            line = f"at::vec::Vectorized<float>({var}[{cexpr(index)}])"
+        else:
+            if V.graph.get_dtype(name) in [torch.bool, torch.uint8]:
+                g_tmp_buf = f"g_tmp_buffer_{var}"
+                nelements = codecache.pick_vec_isa().nelements()
+                self.loads.writeline(f"float {g_tmp_buf}[{nelements}] = {{0}};")
+                self.loads.writeline(
+                    f"flag_to_float({var} + {cexpr(new_index)}, {g_tmp_buf}, {nelements});"
+                )
+                line = f"at::vec::Vectorized<float>::loadu({g_tmp_buf})"
+            else:
+                line = f"at::vec::Vectorized<float>::loadu({var} + {cexpr(new_index)})"
+
+        return self.cse.generate(self.loads, line)
+
+    def store(self, name, index, value, mode=None):
+        assert "buf" in name
+        var = self.args.output(name)
+        index = self.rename_indexing(index)
+        assert mode is None
+
+        expanded_index = sympy.expand(index)
+        new_index = self.transform_index(index)
+        assert new_index != expanded_index
+        line = f"{value}.store({var} + {cexpr(new_index)});"
+        self.stores.writeline(name, line)
+
+    def reduction(self, name, dtype, src_dtype, reduction_type, index, value):
+        assert reduction_type in {"max", "min", "sum"}
+        assert dtype == torch.float
+        assert src_dtype == torch.float
+        reduce_map = {"max": "maximum", "min": "minimum"}
+
+        vec_ns = "at::vec"
+        vec = f"{vec_ns}::Vectorized<{DTYPE_TO_CPP[dtype]}>"
+
+        if reduction_type not in self.reduction_omp_dec:
+            vec_reduc_prefix = "#pragma omp declare reduction("
+            vec_reduc_prefix += f"{RTYPE_TO_CPP[reduction_type]}:{vec}:"
+            if reduction_type == "sum":
+                vec_reduc_prefix += "omp_out += omp_in"
+            else:
+                vec_reduc_prefix += (
+                    f"omp_out = {vec_ns}::{reduce_map[reduction_type]}(omp_out, omp_in)"
+                )
+            vec_reduc_prefix += ")"
+            vec_reduc_prefix += " initializer("
+            vec_reduc_prefix += "omp_priv={{"
+            vec_reduc_prefix += f"{reduction_init(reduction_type, dtype)}"
+            vec_reduc_prefix += "}})"
+            self.reduction_omp_dec[reduction_type] = RTYPE_TO_CPP[reduction_type]
+            self.reduction_prefix.writeline(vec_reduc_prefix)
+
+        tmpvar = self.cse.generate(
+            self.loads, f"reduction {name} {cexpr(index)}", write=False
+        )
+        tmpvar_vec = f"{tmpvar}_vec"
+
+        index = self.rename_indexing(index)
+        self.reduction_vars[tmpvar] = reduction_type
+        self.reduction_prefix.writeline(
+            f"{DTYPE_TO_CPP[dtype]} {tmpvar} = {reduction_init(reduction_type, dtype)};"
+        )
+        self.reduction_prefix.writeline(
+            f"auto {tmpvar_vec} = at::vec::Vectorized<{DTYPE_TO_CPP[dtype]}>({tmpvar});"
+        )
+        self.stores.writeline(
+            None, f"{reduction_combine_vec(reduction_type, tmpvar_vec, value)};"
+        )
+
+        reduce_all_body = "{"
+        if reduction_type == "sum":
+            reduce_all_body += "return x + y;"
+        else:
+            reduce_all_body += f"return {vec_ns}::{reduce_map[reduction_type]}(x, y);"
+        reduce_all_body += "}"
+        vec_reduce_all_func = f"{vec_ns}::vec_reduce_all<{DTYPE_TO_CPP[dtype]}>"
+        self.reduction_suffix.writeline(
+            name,
+            f"{tmpvar} = {vec_reduce_all_func}([]({vec}& x, {vec}&y) {reduce_all_body}, {tmpvar_vec});",
+        )
+        self.cse.store_cache[name] = tmpvar
+
+
+class CppVecKernelChecker(CppVecKernel):
+    def __init__(self, args, num_threads):
+        super(CppVecKernelChecker, self).__init__(args, num_threads)
+
+        # Since this kernel is only for checker but does not genreate any
+        # code, so we need to decrease the kernel count.
+        metrics.generated_kernel_count -= 1
+        metrics.generated_cpp_vec_kernel_count -= 1
+
+        # Used to recorde the graph wrapper code as the wrapper_code status could be
+        # changed during graph run.
+        self._orig_wrapper_code = None
+
+        self.simd_vec = True
+        self.fast_vec_list = []
+        for k, v in CppVecOverrides.__dict__.items():
+            if isinstance(v, staticmethod):
+                self.fast_vec_list.append(k)
+        self.exit_stack = contextlib.ExitStack()
+
+    def is_legal_data_access(self, var: sympy.Symbol, index: sympy.Expr):
+        return self.is_var_irrevelant(var, index) or self.is_single_step_var(var, index)
+
+    def could_vec(self, name: str, index: sympy.Expr):
+        assert self.itervars is not None
+        # Not a loop
+        if len(self.itervars) == 0:
+            return False
+
+        most_inner_var = self.itervars[-1]
+        return self.is_legal_data_access(most_inner_var, index)
+
+    def load(self, name: str, index: sympy.Expr):
+        if not V.graph.get_dtype(name) in [
+            torch.float,
+            torch.float32,
+            torch.bool,
+            torch.uint8,
+        ]:
+            self.simd_vec = False
+            return self.simd_vec
+
+        index = self.rename_indexing(index)
+        self.simd_vec = self.simd_vec and self.could_vec(name, index)
+        return self.simd_vec
+
+    def store(self, name, index, value, mode=None):
+        if not V.graph.get_dtype(name) in [torch.float, torch.float32]:
+            self.simd_vec = False
+            return self.simd_vec
+
+        assert "buf" in name
+        index = self.rename_indexing(index)
+
+        if mode:
+            self.simd_vec = False
+            return False
+
+        self.simd_vec = self.simd_vec and self.could_vec(name, index)
+        return self.simd_vec
+
+    def reduction(self, name, dtype, src_dtype, reduction_type, index, value):
+        if (
+            dtype == torch.float
+            and src_dtype == torch.float
+            and reduction_type in ["max", "min", "sum"]
+        ):
+            pass
+        else:
+            self.simd_vec = False
+        return self.simd_vec
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        assert self._orig_wrapper_code is not None
+        # Restore the wrapper_code
+        V.graph.wrapper_code = self._orig_wrapper_code
+        self.exit_stack.__exit__(exc_type, exc_val, exc_tb)
+
+    def __enter__(self):
+        # Recorde the graph wrapper code. The wrapper_code status could be
+        # changed during graph run. Regarding this checker, we also need to
+        # run the graph but we don't expect to change any status that would
+        # impact the code generation. Hence, we record the graph wapper code
+        # and replace it with a dummy warpper_code and then restore to the
+        # original one as long as the checker is finished.
+        self._orig_wrapper_code = V.graph.wrapper_code
+        V.graph.wrapper_code = WrapperCodeGen()
+
+        class VecCheckerProxy:
+            @staticmethod
+            def __getattr__(name):
+                def inner(*args, **kwargs):
+                    if not (name in self.fast_vec_list):
+                        self.simd_vec = False
+                    return self.simd_vec
+
+                return inner
+
+            @staticmethod
+            def load(name: str, index: sympy.Expr):
+                return self.load(name, index)
+
+            @staticmethod
+            def store(name, index, value, mode=None):
+                return self.store(name, index, value, mode=mode)
+
+            @staticmethod
+            def reduction(name, dtype, src_dtype, reduction_type, index, value):
+                return self.reduction(
+                    name, dtype, src_dtype, reduction_type, index, value
+                )
+
+            @staticmethod
+            def constant(val, dtype):
+                supported_dtype = (torch.float32, torch.int32)
+                is_supported_dtype = dtype in (supported_dtype)
+                if not is_supported_dtype:
+                    self.simd_vec = False
+                return is_supported_dtype
+
+            @staticmethod
+            def index_expr(expr, dtype):
+                self.simd_vec = False
+                tmp_var = self.cse.newvar()
+                return tmp_var
+
+            @staticmethod
+            def indirect_indexing(index_var):
+                self.simd_vec = False
+                return sympy.Symbol(str(index_var))
+
+            @staticmethod
+            def masked(mask, body, other):
+                tmp_var = self.cse.newvar()
+                return tmp_var
+
+            @staticmethod
+            def to_dtype(x, dtype):
+                if dtype != torch.bool:
+                    self.simd_vec = False
+                return x
+
+        self.exit_stack.enter_context(V.set_ops_handler(VecCheckerProxy()))
+        self.exit_stack.enter_context(V.set_kernel_handler(self))
+        return self
+
+
+class CppKernelProxy(CppKernel):
+    def __init__(self, args=None, num_threads=None):
+        super(CppKernelProxy, self).__init__(args, num_threads)
+        self.simd_vec_kernel: CppVecKernel = None
+        self.simd_omp_kernel: CppKernel = None
+        self.picked_vec_isa: codecache.VecISA = codecache.pick_vec_isa()
+
+    def vectorize_most_inner_loop(self, loop_nest, dtype=torch.float):
+        assert self.picked_vec_isa
+        nelements = self.picked_vec_isa.nelements(dtype)
+        loop_nest.split_most_inner_loop(nelements)
+        loop_with_tail = loop_nest.loops[-1]
+        assert isinstance(loop_with_tail, LoopLevelWithTail)
+
+        loop_with_tail.main_loop.simd_vec = True
+
+        loop_with_tail.tail_loop.simd_omp = True
+        # We chope the loop into two cubes by the nelements - main loop and tail loop.
+        # Regarding the main loop, it is straightforward that it could be vectorized with
+        # nelements. But for the tail loop, it still could be vectorized. For example,
+        # if the nelements is 8(256bits), then the tail loop still could be vectorized
+        # as 4(128bits).
+        loop_with_tail.tail_loop.simd_nelements = int(nelements / 2)
+        loop_with_tail.tail_loop.simd_vec = False
+
+        loop_with_tail.main_loop_body = self.simd_vec_kernel
+        loop_with_tail.tail_loop_body = self.simd_omp_kernel
+        return loop_nest
+
+    def codegen_loops(self, code, worksharing):
+        threads = parallel_num_threads()
+
+        if self.simd_vec_kernel is None or not self.picked_vec_isa:
+            assert self.simd_omp_kernel
+            return self.simd_omp_kernel.codegen_loops(code, worksharing)
+
+        assert self.simd_vec_kernel.itervars == self.simd_omp_kernel.itervars
+        assert self.simd_vec_kernel.ranges == self.simd_omp_kernel.ranges
+        assert (
+            self.simd_vec_kernel.reduction_vars == self.simd_omp_kernel.reduction_vars
+        )
+
+        itervars = self.simd_vec_kernel.itervars
+        rangs = self.simd_vec_kernel.ranges
+        loops = [LoopLevel(var, size) for var, size in zip(itervars, rangs)]
+        assert (
+            self.simd_vec_kernel.reduction_depth == self.simd_omp_kernel.reduction_depth
+        )
+        reduction_depth = self.simd_vec_kernel.reduction_depth
+        loops_nest_non_reduce, loops_nest_reduce = LoopNest(
+            loops[:reduction_depth]
+        ), LoopNest(loops[reduction_depth:])
+        loops_nest_reduce.mark_reduction(self.simd_vec_kernel.reduction_vars)
+
+        assert self.picked_vec_isa
+        # Do not apply vectorization since the range of most inner is too small. Meanwhile,
+        # If the range of the most inner is less then the codecache.pick_vec_isa().nelements(),
+        # the generated code for some reduction will be as follows that leads to incrrect result.
+        #
+        #    LINE01:  float tmp1 = 0;
+        #    LINE02:  auto tmp1_vec = at::vec::Vectorized<float>(tmp1);
+        #    LINE03:  for(long i1=0; i1<2; i1+=1)
+        #    LINE04:  {
+        #    LINE05:      for(long i2=0; i2<0; i2+=1)
+        #    LINE06:      {
+        #    LINE07:          auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (8*i0) + (16*i2) + (32*i1));
+        #    LINE08:          tmp1_vec += tmp0;
+        #    LINE09:      }
+        #    LINE10:      tmp1 = vec_reduce_all<float>([](Vectorized<float>& x, Vectorized<float>&y) {return x + y;}, tmp1_vec);
+        #    LINE11:      #pragma omp simd simdlen(8)  reduction(+:tmp1)
+        #    LINE12:      for(long i2=0; i2<8; i2+=1)
+        #    LINE13:      {
+        #    LINE14:          auto tmp0 = in_ptr0[i2 + (8*i0) + (32*i1)];
+        #    LINE15:          tmp1 += tmp0;
+        #    LINE16:      }
+        #    LINE17:  }
+        #    LINE18:  out_ptr3[i0] = tmp1;
+        #
+        # tmp1_vec(LINE02) will always be zero as it is initialized with tmp1 value and the range(LINE05)
+        # is 0. Hence, the LINE10 will always reset tmp1 to 0. But tmp1(LINE01) is global value. So the result
+        # will be incorrect. We skip thie case.
+        most_inner_loop = (
+            loops_nest_reduce.loops[-1]
+            if loops_nest_reduce
+            else loops_nest_non_reduce.loops[-1]
+        )
+        main_loop_range = ir.IndexingDiv(
+            most_inner_loop.size, self.picked_vec_isa.nelements()
+        )
+        loop_interval = sympy.simplify(main_loop_range)
+        # TODO(Eikan): To support dynamic shape.
+        if not loop_interval.is_integer or loop_interval <= 0:
+            metrics.generated_cpp_vec_kernel_count -= 1
+            return self.simd_omp_kernel.codegen_loops(code, worksharing)
+
+        # TODO(jansel): detect stride-1 dimension and vectorize that
+        if loops_nest_reduce:
+            loops_nest_reduce.loops[-1].simd = True
+        elif loops_nest_non_reduce:
+            loops_nest_non_reduce.loops[-1].simd = True
+
+        par_depth = 0
+        reduction_par_depth = 0
+        if loops_nest_non_reduce:
+            par_depth = self.simd_vec_kernel.decide_parallel_depth(
+                self.simd_vec_kernel.call_ranges[:reduction_depth], threads
+            )
+        else:
+            reduction_par_depth = self.simd_vec_kernel.decide_parallel_depth(
+                self.simd_vec_kernel.call_ranges[reduction_depth:], threads
+            )
+
+        # If the most inner loop of the reduction will be vectorized, the vectorization
+        # will add a vec variable for reduction. Take the code snippet as an example:
+        #     float tmp1 = 0;
+        #     for(long i1=0; i1<8; i1+=1) {
+        #        auto tmp0 = in_ptr0[i1];
+        #        tmp1 += tmp0;
+        #     }
+        # The vectorization will add tmp1_vec for reduction and then the loop will be transformed
+        # as follows.
+        #     float tmp1 = 0;
+        #     auto tmp1_vec = at::vec::Vectorized<float>(tmp1);
+        #     for(long i1=0; i1<1; i1+=1) {
+        #        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (8*i1));
+        #        tmp1_vec += tmp0;
+        #     }
+        #     tmp1 = at::vec::vec_reduce_all<float>([]
+        #       (at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;},
+        #       tmp1_vec);
+        #     for(long i1=8; i1<8; i1+=1) {
+        #        auto tmp0 = in_ptr0[i1];
+        #        tmp1 += tmp0;
+        #     }
+        # It means that the vectorization introduce another reduction variable(tmp1_vec).
+        # If the most inner loop of the reduction is not a parallelized but its parent reduction
+        # loop is parallized, the new added reduction variable(tmp1_vec) could not be added
+        # to the parallelized loop reduction. So we skip this case and does not vectorize it.
+        if reduction_par_depth > 0 and reduction_par_depth != len(
+            loops_nest_reduce.loops
+        ):
+            metrics.generated_cpp_vec_kernel_count -= 1
+            return self.simd_omp_kernel.codegen_loops(code, worksharing)
+
+        with contextlib.ExitStack() as stack:
+            if par_depth:
+                worksharing.parallel(threads)
+                loops_nest_non_reduce.mark_parallel(par_depth)
+            elif reduction_par_depth:
+                # need to close the worksharing scope to define reduction vars outside it
+                worksharing.close()
+                loops_nest_reduce.mark_parallel(reduction_par_depth)
+            elif threads > 1:
+                if worksharing.single():
+                    stack.enter_context(code.indent())
+
+            non_reduce_loops = loops_nest_non_reduce.loops
+            reduce_loops = loops_nest_reduce.loops
+            loop_with_tail: LoopLevelWithTail = None
+
+            if loops_nest_reduce:
+                self.vectorize_most_inner_loop(loops_nest_reduce)
+                loop_with_tail = loops_nest_reduce.loops[-1]
+                # The most inner loop will be vectorized
+                reduce_loops = reduce_loops[0:-1]
+            else:
+                self.vectorize_most_inner_loop(loops_nest_non_reduce)
+                loop_with_tail = loops_nest_non_reduce.loops[-1]
+                # The most inner loop will be vectorized
+                non_reduce_loops = non_reduce_loops[0:-1]
+
+            # The reductions loops are always the loop body of non-reduction loops
+            for loop in non_reduce_loops:
+                code.writelines(loop.lines())
+                stack.enter_context(code.indent())
+
+            with contextlib.ExitStack() as stack_outer:
+                if self.simd_vec_kernel.reduction_prefix:
+                    stack_outer.enter_context(code.indent())
+                code.splice(self.simd_vec_kernel.reduction_prefix)
+
+                if reduction_par_depth:
+                    worksharing.parallel(threads)
+
+                with contextlib.ExitStack() as stack:
+                    for loop in reduce_loops:
+                        code.writelines(loop.lines())
+                        stack.enter_context(code.indent())
+
+                    def gen_vectorized_loop(loop, kernel, write_reduction_suffix=False):
+                        code.writelines(loop.lines())
+                        with contextlib.ExitStack() as stack:
+                            stack.enter_context(code.indent())
+                            code.splice(kernel.loads)
+                            code.splice(kernel.compute)
+                            code.splice(kernel.stores)
+                        if write_reduction_suffix:
+                            code.splice(kernel.reduction_suffix)
+
+                    # Regarding the vectorized reduction loop, we need to call reduce_all to to reduce
+                    # the vectorize as a single scalar. Hence, we set write_reduction_suffix to True to
+                    # gen the code.
+                    gen_vectorized_loop(
+                        loop_with_tail.main_loop, loop_with_tail.main_loop_body, True
+                    )
+
+                    gen_vectorized_loop(
+                        loop_with_tail.tail_loop, loop_with_tail.tail_loop_body, False
+                    )
+
+                if reduction_par_depth:
+                    worksharing.close()
+
+                code.splice(loop_with_tail.tail_loop_body.reduction_suffix)
+
+
+class CppScheduling:
+    def __init__(self, scheduler):
+        self.scheduler = scheduler
+        self.kernel_group = KernelGroup()
+
+    def group_fn(self, sizes):
+        return tuple(tuple(map(V.graph.sizevars.simplify, s)) for s in sizes)
+
+    @staticmethod
+    def can_fuse_horizontal(node1, node2):
+        _, (vars1, reduce1) = node1.group
+        _, (vars2, reduce2) = node2.group
+        if vars1 == vars2 and reduce1 == reduce2:
+            return True
+        if reduce1 == () and vars1 == vars2 + reduce2:
+            return True
+        # TODO(jansel): allow fusion pointwise (vars1, ()) suffix?
+        return False
+
+    @classmethod
+    def can_fuse_vertical(cls, node1, node2):
+        return cls.can_fuse_horizontal(node1, node2) and not node1.is_reduction()
+
+    def can_vec(self, nodes):
+        if not codecache.pick_vec_isa():
+            return False
+
+        _, (group, reduction_group) = max(
+            nodes, key=lambda x: int(x.is_reduction())
+        ).group
+
+        with CppVecKernelChecker(
+            deepcopy(self.kernel_group.args), parallel_num_threads()
+        ) as kernel_checker:
+            vars, reduction_vars = kernel_checker.set_ranges(group, reduction_group)
+            for node in nodes:
+                if node.group[1] in [
+                    (group, reduction_group),
+                    (group + reduction_group, ()),
+                ]:
+                    node.run(vars, reduction_vars)
+                else:
+                    assert node.group[1] == (
+                        group,
+                        (),
+                    ), f"unexpected group: {node.group[1]} != {group}, {reduction_group}"
+                    node.run(vars, ())
+
+            return kernel_checker.simd_vec
+
+    def _codegen_nodes_impl(self, nodes, is_simd_vec=False):
+        """
+        Turn an set of pre-fused nodes into a C++ kernel.
+        """
+        kernel_group = self.kernel_group
+        _, (group, reduction_group) = max(
+            nodes, key=lambda x: int(x.is_reduction())
+        ).group
+
+        def create_kernel(_is_simd_vec):
+            in_suffix = False
+
+            with kernel_group.new_kernel(_is_simd_vec) as kernel:
+                vars, reduction_vars = kernel.set_ranges(group, reduction_group)
+
+                for node in nodes:
+                    if node.group[1] in [
+                        (group, reduction_group),
+                        (group + reduction_group, ()),
+                    ]:
+                        assert not in_suffix
+                        node.run(vars, reduction_vars)
+                    else:
+                        in_suffix = True
+                        assert node.group[1] == (
+                            group,
+                            (),
+                        ), f"unexpected group: {node.group[1]} != {group}, {reduction_group}"
+                        # we can fuse in some extra pointwise into the suffix
+                        with kernel.write_to_suffix():
+                            node.run(vars, ())
+                return kernel
+
+        org_inplace_buffers_flag = config.inplace_buffers
+        if is_simd_vec:
+            # Create vectorization kernel
+            cpp_vec_kernel = create_kernel(True)
+
+            # Since a kernel is divided into two parts - vectorization and non-vectorization.
+            # And the two parts share the same global contexts like V.graph.wrapper_code,
+            # V.kernel.args. But the vectorization kernel generation has updated these global
+            # contexts. Hence, the non-vectorization kernel should not do this again to avoid
+            # conext conflict. By now, we only control the config.inplace_buffers. In the future,
+            # we could maintain more contexts.
+            config.inplace_buffers = False
+
+            # Create non-vectorization kernel
+            cpp_kernel = create_kernel(False)
+
+            # Restore the inplace_buffers flag
+            config.inplace_buffers = org_inplace_buffers_flag
+            return (cpp_vec_kernel, cpp_kernel)
+        else:
+            return (None, create_kernel(False))
+
+    def codegen_nodes(self, nodes):
+        """
+        Turn an set of pre-fused nodes into a C++ kernel.
+        """
+        kernel_group = self.kernel_group
+
+        can_be_simd_vec = self.can_vec(nodes)
+        simd_vec_kernel, simd_omp_kernel = self._codegen_nodes_impl(
+            nodes, can_be_simd_vec
+        )
+
+        assert simd_omp_kernel
+        metrics.generated_kernel_count -= 1
+        # Maitain the metrics kernel count
+        if simd_vec_kernel:
+            metrics.generated_kernel_count -= 1
+
+        cpp_kernel_proxy = CppKernelProxy(
+            kernel_group.args, kernel_group.ws.num_threads
+        )
+        cpp_kernel_proxy.simd_vec_kernel = simd_vec_kernel
+        cpp_kernel_proxy.simd_omp_kernel = simd_omp_kernel
+
+        kernel_group.finalize_kernel(cpp_kernel_proxy, None)
+
+    def flush(self):
+        self.kernel_group.codegen_define_and_call(V.graph.wrapper_code)
+        self.kernel_group = KernelGroup()
+
+
+class KernelGroup:
+    def __init__(self):
+        super().__init__()
+        self.args = KernelArgs()
+        self.loops_code = BracesBuffer()
+        self.ws = WorkSharing(self.loops_code)
+        self.stack = contextlib.ExitStack()
+        self.stack.enter_context(self.ws)
+        self.count = 0
+
+    def new_kernel(self, simd_vec=False):
+        if simd_vec:
+            return CppVecKernel(self.args, parallel_num_threads())
+        else:
+            return CppKernel(self.args, parallel_num_threads())
+
+    def finalize_kernel(self, new_kernel, scheduler):
+        self.count += 1
+        code = self.loops_code
+        ws = self.ws
+        new_kernel.codegen_loops(code, ws)
+
+    def codegen_define_and_call(self, wrapper):
+        self.stack.close()
+        if self.count == 0:
+            return
+
+        kernel_name = "kernel_cpp_" + wrapper.next_kernel_suffix()
+        arg_defs, call_args = self.args.cpp_argdefs()
+        arg_defs = ",\n".ljust(25).join(arg_defs)
+        code = BracesBuffer()
+        # TODO: support kernel profile on other platforms
+        enable_kernel_profile = (
+            config.cpp.enable_kernel_profile and sys.platform == "linux"
+        )
+        if enable_kernel_profile:
+            code.writelines(["#include <ATen/record_function.h>"])
+        code.writelines([cpp_prefix(), "" f'extern "C" void kernel({arg_defs})'])
+        with code.indent():
+            if enable_kernel_profile:
+                code.writelines(
+                    [
+                        f'RECORD_FUNCTION("{kernel_name}", c10::ArrayRef<c10::IValue>({{}}));'
+                    ]
+                )
+            for old, new in self.args.aliases():
+                code.writeline(f"auto {old} = {new};")
+            code.splice(self.loops_code)
+
+        codecache_def = IndentedBuffer()
+        codecache_def.writeline("async_compile.cpp('''")
+        codecache_def.splice(code)
+        codecache_def.writeline("''')")
+
+        codecache_str = codecache_def.getvalue()
+        # TODO(voz): Ostensibly, we should not need this. But there are cases where C++ codegen does
+        # not use BracesBuffer, so we have no good indicator of a C++ buffer atm.
+        codecache_str = codecache_str.replace("#pragma CMT", "//")
+        wrapper.define_kernel(kernel_name, codecache_str)
+
+        # generate the code to call this
+        wrapper.writeline(
+            "{}({})".format(kernel_name, ", ".join(call_args)),
+        )
+
+
+class WorkSharing:
+    def __init__(self, code):
+        self.code = code
+        self.in_parallel = False
+        self.num_threads = None
+        self.stack = contextlib.ExitStack()
+
+    def parallel(self, threads):
+        if self.in_parallel and threads != self.num_threads:
+            # wrong number of threads
+            self.close()
+        if not self.in_parallel:
+            self.num_threads = threads
+            self.in_parallel = True
+            if config.cpp.dynamic_threads:
+                self.code.writeline("#pragma omp parallel")
+            else:
+                self.code.writeline(f"#pragma omp parallel num_threads({threads})")
+            self.stack.enter_context(self.code.indent())
+
+    def single(self):
+        if self.in_parallel:
+            self.code.writeline("#pragma omp single")
+        return self.in_parallel
+
+    def close(self):
+        self.stack.close()
+        self.in_parallel = False
+
+    def __enter__(self):
+        self.stack.__enter__()
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.stack.__exit__(exc_type, exc_val, exc_tb)
+
+
+@dataclasses.dataclass
+class LoopLevel:
+    var: sympy.Expr = None
+    size: sympy.Expr = None
+    offset: sympy.Expr = sympy.Integer(0)
+    steps: sympy.Expr = sympy.Integer(1)
+    parallel: int = 0
+    simd_omp: bool = False
+    picked_vec_isa: codecache.VecISA = codecache.pick_vec_isa()
+    simd_nelements: int = picked_vec_isa.nelements() if picked_vec_isa else 0
+    simd_vec: bool = False
+    collapsed: bool = False
+    reduction_vars: Dict[str, str] = None
+
+    def lines(self):
+        if self.reduction_vars:
+            suffix = "_vec" if self.simd_vec else ""
+            reduction = " " + " ".join(
+                f"reduction({RTYPE_TO_CPP[rtype]}:{var}{suffix})"
+                for var, rtype in self.reduction_vars.items()
+            )
+        else:
+            reduction = ""
+        simd = (
+            f"simd simdlen({self.simd_nelements}) "
+            if self.simd_omp and self.simd_nelements > 1
+            else ""
+        )
+        if self.parallel:
+            # TODO(jansel): look into chunk size and other schedules
+            line1 = f"#pragma omp for{reduction} "
+            if self.parallel > 1:
+                line1 += f" collapse({self.parallel})"
+            if self.simd_omp:
+                line1 = line1.replace(" for ", f" for {simd}")
+        elif self.simd_vec:
+            line1 = ""
+        elif self.simd_omp:
+            line1 = f"#pragma omp {simd}{reduction}"
+        elif not self.reduction_vars and codecache.is_gcc():
+            line1 = "#pragma GCC ivdep"
+        else:
+            line1 = ""
+        line2 = f"for({INDEX_TYPE} {self.var}={cexpr(self.offset)}; {self.var}<{cexpr(self.size)}; {self.var}+={cexpr(self.steps)})"
+        if self.collapsed or not line1:
+            return [line2]
+        return [line1, line2]
+
+
+class LoopLevelWithTail(LoopLevel):
+    def __init__(self, main_loop: LoopLevel, tail_loop: LoopLevel):
+        super().__init__()
+        self.main_loop = main_loop
+        self.tail_loop = tail_loop
+        self.main_loop_body = None
+        self.tail_loop_body = None
+
+    def lines(self):
+        raise AssertionError("Not Implemented")
+
+
+@dataclasses.dataclass
+class LoopNest:
+    loops: List[LoopLevel]
+
+    def __bool__(self):
+        return bool(self.loops)
+
+    def mark_reduction(self, reduction_vars):
+        for loop in self.loops:
+            loop.reduction_vars = reduction_vars
+
+    def mark_parallel(self, par_depth):
+        loops = self.loops
+        loops[0].parallel = par_depth
+        for i in range(1, par_depth):
+            loops[i].collapsed = True
+
+    def split_most_inner_loop(self, factor):
+        sympy_factor = sympy.Integer(factor)
+
+        most_inner_loop = self.loops[-1]
+
+        # If the most inner loop needs to be collapsed, we need to
+        # exclude it since we need to split it into two loops. Meanwhile,
+        # we still mark it as parallized.
+        if most_inner_loop.collapsed:
+            assert self.loops[0].parallel == len(self.loops)
+            self.loops[0].parallel -= 1
+
+        main_loop_range = ir.IndexingDiv(most_inner_loop.size, sympy_factor)
+
+        main_loop = LoopLevel(most_inner_loop.var, main_loop_range)
+        main_loop.parallel = most_inner_loop.parallel
+        main_loop.collapsed = False
+        main_loop.reduction_vars = most_inner_loop.reduction_vars
+
+        offset = main_loop_range * sympy_factor
+        tail_loop = LoopLevel(most_inner_loop.var, most_inner_loop.size)
+        tail_loop.offset = offset
+        tail_loop.parallel = most_inner_loop.parallel
+        tail_loop.collapsed = False
+        tail_loop.reduction_vars = most_inner_loop.reduction_vars
+
+        loop_with_tail = LoopLevelWithTail(main_loop, tail_loop)
+        loop_with_tail.parallel = 0
+        loop_with_tail.collapsed = False
+
+        self.loops[-1] = loop_with_tail
+
+    def codegen(self, code, stack):
+        for loop in self.loops:
+            code.writelines(loop.lines())
+            stack.enter_context(code.indent())
+        else:
+            stack.enter_context(code.indent())
diff --git a/torch/_inductor/codegen/cpp_prefix.h b/torch/_inductor/codegen/cpp_prefix.h
new file mode 100644
index 000000000000..c1c9c3bae112
--- /dev/null
+++ b/torch/_inductor/codegen/cpp_prefix.h
@@ -0,0 +1,71 @@
+#include <algorithm>
+#include <atomic>
+#include <cmath>
+#include <cstdlib>
+#include <limits>
+#include <omp.h>
+
+#include <ATen/core/PhiloxRNGEngine.h>
+#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2)
+#include <ATen/cpu/vec/functional.h>
+#include <ATen/cpu/vec/vec.h>
+#endif
+#include <c10/util/BFloat16.h>
+#include <c10/util/Half.h>
+
+typedef at::Half half;
+typedef at::BFloat16 bfloat16;
+
+template <typename T> inline T mod(T a, T b) { return a % b; }
+template <> inline float mod(float a, float b) { return std::fmod(a, b); }
+template <> inline double mod(double a, double b) { return std::fmod(a, b); }
+
+constexpr float uint32_to_uniform_float(uint32_t value) {
+  // maximum value such that `MAX_INT * scale < 1.0` (with float rounding)
+  constexpr float scale = 4.6566127342e-10;
+  return static_cast<float>(value & 0x7FFFFFFF) * scale;
+}
+
+float normalized_rand_cpu(uint32_t seed, uint32_t offset) {
+  return uint32_to_uniform_float(at::Philox4_32(seed, 0, offset)());
+}
+
+float randn_cpu(uint32_t seed, uint32_t offset) {
+  at::Philox4_32 engine(seed, 0, offset);
+  return engine.randn(10);
+}
+
+template <typename T> struct AsIntegerType { typedef T type; };
+template <> struct AsIntegerType<float> { typedef uint32_t type; };
+template <> struct AsIntegerType<double> { typedef uint64_t type; };
+
+template <typename T> void atomic_add(volatile T *addr, T offset) {
+  typedef typename AsIntegerType<T>::type alt_type;
+
+  static_assert(sizeof(std::atomic<alt_type>) == sizeof(T),
+                "std::atomic issue");
+
+  alt_type expected;
+
+  alt_type desired;
+
+  std::atomic<alt_type> *atomic_addr = (std::atomic<alt_type> *)addr;
+  do {
+    T val = *addr;
+    reinterpret_cast<T *>(&expected)[0] = val;
+    reinterpret_cast<T *>(&desired)[0] = val + offset;
+  } while (!atomic_addr->compare_exchange_weak(expected, desired,
+                                               std::memory_order_relaxed));
+}
+
+// This function is used to convert bool or uint8 to float mask for
+// vectorization. The caller needs to make sure the src represents TRUE/FALSE
+// correctly.
+template <typename T>
+void flag_to_float(const T* src, float* dst, int64_t n) {
+#pragma unroll
+  for (int64_t i = 0; i < n; i++) {
+    uint32_t* dst_u32 = (uint32_t*)dst;
+    dst_u32[i] = *(src + i) ? 0xFFFFFFFF : 0;
+  }
+}
diff --git a/torch/_inductor/codegen/triton.py b/torch/_inductor/codegen/triton.py
new file mode 100644
index 000000000000..de84d4ddbeff
--- /dev/null
+++ b/torch/_inductor/codegen/triton.py
@@ -0,0 +1,1481 @@
+import collections
+import contextlib
+import dataclasses
+import functools
+import itertools
+import logging
+import math
+import operator
+from typing import Dict, List
+
+import sympy
+
+import torch
+
+from .. import config, ir, scheduler
+from ..ir import ReductionHint
+from ..utils import (
+    free_symbol_startswith,
+    get_fused_kernel_name,
+    instance_descriptor,
+    sympy_product,
+    sympy_subs,
+    sympy_symbol,
+)
+from ..virtualized import ops, V
+from .common import (
+    CSEVariable,
+    DeferredLine,
+    ExprPrinter,
+    IndentedBuffer,
+    index_prevent_reordering,
+    Kernel,
+    OpOverrides,
+    SizeArg,
+    TensorArg,
+)
+
+log = logging.getLogger(__name__)
+
+
+def signature_of(arg):
+    from triton.runtime.jit import JITFunction
+
+    if isinstance(arg, TensorArg):
+        tye = JITFunction._type_of(arg.dtype)
+        if V.graph.is_unspec_arg(arg.buffer):
+            # had unwrapped 0d tensor as scalar
+            new_tye = tye.lstrip("*")
+            if new_tye in ["fp16", "bf16"]:
+                return "fp32"
+            else:
+                return new_tye
+        else:
+            return tye
+    if isinstance(arg, SizeArg):
+        return JITFunction._key_of(V.graph.sizevars.size_hint(arg.expr))
+    raise NotImplementedError(f"unhandled {type(arg)}: {arg}")
+
+
+def config_of(args):
+    from ..compile_fx import ALIGNMENT
+
+    def is_aligned(x):
+        if isinstance(x, TensorArg):
+            return x.buffer not in V.graph.unaligned_buffers
+        assert isinstance(x, SizeArg)
+        return V.graph.sizevars.maybe_guard_multiple_of(x.expr, ALIGNMENT)
+
+    divisible_by_16 = [i for i, arg in enumerate(args) if is_aligned(arg)]
+    return instance_descriptor(tuple(divisible_by_16), ())
+
+
+class TritonPrinter(ExprPrinter):
+    def _print_ModularIndexing(self, expr):
+        x, div, mod = expr.args
+        x = self.paren(self.doprint(x))
+        div = self.paren(self.doprint(div))
+        mod = self.paren(self.doprint(mod))
+        if div != "1":
+            x = f"({x} // {div})"
+        return f"{x} % {mod}"
+
+    def _print_IndexingDiv(self, expr):
+        x, div = expr.args
+        x = self.paren(self.doprint(x))
+        div = self.paren(self.doprint(div))
+        return f"({x} // {div})"
+
+
+texpr = TritonPrinter().doprint
+
+
+def triton_compute_type(dtype):
+    triton_type_name = str(dtype).split(".")[-1]
+    if triton_type_name == "bool":
+        triton_type_name = "int1"
+    if triton_type_name in ("float16", "bfloat16"):
+        # float16 math is done in float32 inside the kernel
+        triton_type_name = "float32"
+    return f"tl.{triton_type_name}"
+
+
+def triton_constant(value):
+    if value == float("inf"):
+        return 'float("inf")'
+    elif value == float("-inf"):
+        return 'float("-inf")'
+    elif math.isnan(value):
+        return 'float("nan")'
+    return repr(value)
+
+
+class TritonCSEVariable(CSEVariable):
+    def __init__(self, name):
+        super().__init__(name)
+        self.is_scalar = False
+
+    def update_on_args(self, args, kwargs):
+        self.is_scalar = all(
+            not (isinstance(arg, TritonCSEVariable)) or arg.is_scalar for arg in args
+        )
+
+
+class TritonOverrides(OpOverrides):
+    """Map element-wise ops to Triton"""
+
+    @staticmethod
+    def to_dtype(x, dtype: torch.dtype):
+        if dtype == torch.bool:
+            return f"({x} != 0)"
+        return f"{x}.to({triton_compute_type(dtype)})"
+
+    @staticmethod
+    def constant(value, dtype):
+        return triton_constant(value)
+
+    @staticmethod
+    def abs(x):
+        return f"tl.abs({x})"
+
+    @staticmethod
+    def libdevice_abs(x):
+        return f"tl.libdevice.abs({x})"
+
+    @staticmethod
+    def exp(x):
+        return f"tl.exp({x})"
+
+    @staticmethod
+    def libdevice_exp(x):
+        return f"tl.libdevice.exp({x})"
+
+    @staticmethod
+    def sqrt(x):
+        return f"tl.sqrt({x})"
+
+    @staticmethod
+    def libdevice_sqrt(x):
+        return f"tl.libdevice.sqrt({x})"
+
+    @staticmethod
+    def relu(x):
+        return ops.maximum("0", x)
+
+    @staticmethod
+    def minimum(a, b):
+        return f"tl.where({a} != {a}, {a}, tl.where({a} < {b}, {a}, {b}))"
+
+    @staticmethod
+    def maximum(a, b):
+        return f"tl.where({a} != {a}, {a}, tl.where({a} > {b}, {a}, {b}))"
+
+    @staticmethod
+    def where(a, b, c):
+        return f"tl.where({a}, {b}, {c})"
+
+    @staticmethod
+    def cos(x):
+        return f"tl.cos({x})"
+
+    @staticmethod
+    def libdevice_cos(x):
+        return f"tl.libdevice.cos({x})"
+
+    @staticmethod
+    def sin(x):
+        return f"tl.sin({x})"
+
+    @staticmethod
+    def libdevice_sin(x):
+        return f"tl.libdevice.sin({x})"
+
+    @staticmethod
+    def index_expr(expr, dtype):
+        return V.kernel.indexing(expr)[0]
+
+    @staticmethod
+    def masked(mask, body, other):
+        with V.kernel.mask_loads(mask) as new_mask:
+            result = body()
+        return ops.where(
+            new_mask, result, TritonOverrides.constant(other, torch.float32)
+        )
+
+    @staticmethod
+    def lgamma(x):
+        return f"tl.libdevice.lgamma({x})"
+
+    @staticmethod
+    def erf(x):
+        return f"tl.libdevice.erf({x})"
+
+    @staticmethod
+    def logical_and(a, b):
+        return f"{a} & {b}"
+
+    @staticmethod
+    def logical_or(a, b):
+        return f"{a} | {b}"
+
+    @staticmethod
+    def rand(seed, offset, _):  # _ here to keep the contract identical to CPU rand op
+        return f"tl.rand({seed}, {offset})"
+
+    @staticmethod
+    def randn(seed, offset, _):  # _ here to keep the contract identical to CPU randn op
+        return f"tl.randn({seed}, {offset})"
+
+    @staticmethod
+    def rsqrt(x):
+        return f"tl.libdevice.rsqrt({x})"
+
+    @staticmethod
+    def sigmoid(x):
+        return f"tl.sigmoid({x})"
+
+    @staticmethod
+    def libdevice_sigmoid(x):
+        return f"1/(1 + tl.libdevice.exp(-({x})))"
+
+    @staticmethod
+    def signbit(x):
+        # XX: This is wrong for the value -0.0 in floating point
+        return f"tl.libdevice.signbit({x}) if ({x}).dtype is tl.float32 else {x} < 0"
+
+    @staticmethod
+    def fmod(a, b):
+        return f"tl.libdevice.fmod({a}, {b})"
+
+    @staticmethod
+    def pow(a, b):
+        return f"tl.libdevice.pow({a}, {b})"
+
+    @staticmethod
+    def log(x):
+        return f"tl.log({x})"
+
+    @staticmethod
+    def libdevice_log(x):
+        return f"tl.libdevice.log({x})"
+
+    @staticmethod
+    def isinf(x):
+        return f"tl.libdevice.isinf({x})"
+
+    @staticmethod
+    def isnan(x):
+        return f"tl.libdevice.isnan({x})"
+
+    @staticmethod
+    def round(x):
+        return f"tl.libdevice.nearbyint({x})"
+
+    @staticmethod
+    def floor(x):
+        return f"tl.libdevice.floor({x})"
+
+    @staticmethod
+    def floordiv(a, b):
+        # See the comment in lowering.div_mode. a and b are integer type.
+        # Similar to div_floor_kernel_cuda in pytorch core.
+        # Notice that // in triton behaves as truncdiv instead of floordiv
+        quot = f"{a} // {b}"
+        rem = f"{a} % {b}"
+        return f"tl.where(({a} < 0) != ({b} < 0), tl.where({rem} != 0, {quot} - 1, {quot}), {quot})"
+
+    @staticmethod
+    def trunc(x):
+        return f"tl.libdevice.trunc({x})"
+
+    @staticmethod
+    def truncdiv(a, b):
+        # See the comment in lowering.div_mode. a and b are integer type.
+        # Notice that // in triton behaves as truncdiv instead of floordiv
+        return f"{a} // {b}"
+
+    @staticmethod
+    def ceil(x):
+        return f"tl.libdevice.ceil({x})"
+
+
+@dataclasses.dataclass
+class IterationRanges:
+    """
+    Each range tree represents multiple sets of iteration indexing
+    in a single tiled dimension in the output kernel.
+
+    If you have two loops ranges one (4, 3, 2) and another (4, 6),
+    then the range tree will be:
+            4 (i0)
+        3 (i1)  6 (i3)
+        2 (i2)
+    Where i0 is shared between both loops, but then the split into
+    different indexing vars.  All loop ranges must iterate over
+    the same number of elements.
+    """
+
+    def __init__(
+        self,
+        name: str,
+        var_list: List[sympy.Symbol],
+        var_ranges: Dict[sympy.Symbol, sympy.Expr],
+        numel: sympy.Expr,
+        prefix: str,
+        divisor=sympy.Integer(1),
+        length=sympy.Integer(1),
+    ):
+        super(IterationRanges, self).__init__()
+        self.name = name
+        self.var_list = var_list
+        self.var_ranges = var_ranges
+        self.numel = numel
+        self.prefix = prefix
+        self.divisor = divisor
+        self.length = length
+
+    def is_loop(self):
+        return self.prefix == "r"
+
+
+class IterationRangesRoot(IterationRanges):
+    def __init__(
+        self,
+        name: str,
+        numel: sympy.Expr,
+        prefix: str,
+        index: int,
+        kernel: "Kernel",
+        pid_cache=None,
+    ):
+        if pid_cache is None:
+            pid_cache = {}
+        super(IterationRangesRoot, self).__init__(
+            name=name,
+            var_list=[],
+            var_ranges={},
+            numel=numel,
+            prefix=prefix,
+        )
+        self.index = index
+        self.kernel = kernel
+        # Store all the nodes in one flat list
+        self.nodes: Dict[sympy.Expr, IterationRangesEntry] = {}
+        # This is for re-ordering program ID in triton mm template
+        # pid_cache["tl.program_id(0)"] = pid_m
+        self.pid_cache: Dict[str, str] = pid_cache
+
+    def cache_clear(self):
+        for node in self.nodes.values():
+            node.cache_clear()
+
+    def lookup(self, divisor, length):
+        """
+        Lookup a given RangeTreeEntry, creating it if needed
+        """
+        if V.graph.sizevars.maybe_guard_equals(divisor * length, self.numel):
+            expr = ir.IndexingDiv(sympy_symbol(f"{self.prefix}index"), divisor)
+        else:
+            expr = ir.ModularIndexing(
+                sympy_symbol(f"{self.prefix}index"), divisor, length
+            )
+
+        if expr not in self.nodes:
+            node = IterationRangesEntry(
+                f"{self.prefix}{next(V.kernel.iter_vars_count)}",
+                divisor,
+                length,
+                expr,
+                self,
+            )
+            V.kernel.range_tree_nodes[node.symbol()] = node
+            self.var_list.append(node.symbol())
+            self.var_ranges[node.symbol()] = length
+            self.nodes[expr] = node
+        return self.nodes[expr]
+
+    def construct(self, lengths: List[sympy.Expr]):
+        divisor = sympy.Integer(1)
+        itervars = []
+        for length in reversed(lengths):
+            itervars.append(self.lookup(divisor, length).symbol())
+            divisor = divisor * length
+        return list(reversed(itervars))
+
+    def vars_and_sizes(self, index: sympy.Expr):
+        """Figure out vars from this tree used in index"""
+        nodes = [V.kernel.range_tree_nodes.get(s) for s in index.free_symbols]
+        nodes = [n for n in nodes if n and n.prefix == self.prefix]
+        nodes.sort(key=lambda x: V.graph.sizevars.size_hint(x.divisor))
+        divisor = sympy.Integer(1)
+        index_vars = []
+        sizes = []
+
+        def add(node):
+            nonlocal divisor
+            index_vars.append(node.symbol())
+            sizes.append(node.length)
+            divisor = divisor * node.length
+
+        for node in nodes:
+            if not V.graph.sizevars.maybe_guard_equals(node.divisor, divisor):
+                # fill in unused index var
+                add(self.lookup(divisor, ir.IndexingDiv(node.divisor, divisor)))
+                divisor = node.divisor
+            add(node)
+        if not V.graph.sizevars.maybe_guard_equals(self.numel, divisor):
+            # fill in unused index var
+            add(self.lookup(divisor, ir.IndexingDiv(self.numel, divisor)))
+
+        return list(reversed(index_vars)), list(reversed(sizes))
+
+    def ranges_code(self):
+        size = self.kernel.reshape_size_str(self.index, self.prefix)
+        return f"tl.reshape(tl.arange(0, {self.prefix.upper()}BLOCK), {size})"
+
+    def pid_cache_lookup(self, key):
+        if key in self.pid_cache:
+            return self.pid_cache[key]
+        return key
+
+    def codegen_header(self, code):
+        x = self.prefix
+        if self.is_loop():
+            code.writeline(f"{self.name} = {x}offset + {x}base")
+        else:
+            pid = self.pid_cache_lookup(f"tl.program_id({self.index})")
+            code.writelines(
+                [
+                    f"{x}offset = {pid} * {x.upper()}BLOCK",
+                    f"{self.name} = {x}offset + {self.ranges_code()}",
+                ]
+            )
+        code.writeline(f"{x}mask = {self.name} < {x}numel")
+
+
+class IterationRangesEntry(IterationRanges):
+    def __init__(
+        self,
+        name: str,
+        divisor: sympy.Expr,
+        length: sympy.Expr,
+        expr: sympy.Expr,
+        parent: IterationRanges,
+    ):
+        super(IterationRangesEntry, self).__init__(
+            name=name,
+            numel=parent.numel / length,
+            var_list=parent.var_list,
+            var_ranges=parent.var_ranges,
+            prefix=parent.prefix,
+            divisor=divisor,
+            length=length,
+        )
+        self.parent = parent
+        self.codegen = functools.lru_cache(None)(self._codegen)
+        self.expr = expr
+
+    def cache_clear(self):
+        self.codegen.cache_clear()
+
+    def writeline(self, line):
+        if self.is_loop():
+            V.kernel.indexing_code.writeline(line)
+        else:
+            # lift non-reduction stores outside loop
+            V.kernel.body.writeline(line)
+
+    def _codegen(self):
+        self.writeline(f"{self.name} = " + texpr(V.kernel.rename_indexing(self.expr)))
+        return self.name
+
+    def symbol(self):
+        return sympy_symbol(self.name)
+
+    def __hash__(self):
+        return hash(self.name)
+
+    def __eq__(self, other):
+        return self.name == other.name
+
+
+class TritonKernel(Kernel):
+    overrides = TritonOverrides
+    sexpr = texpr
+
+    def __init__(self, *groups, pid_cache=None, reduction_hint=ReductionHint.DEFAULT):
+        if pid_cache is None:
+            pid_cache = {}
+        super(TritonKernel, self).__init__()
+        self.numels = [V.graph.sizevars.simplify(s) for s in groups]
+        self.range_trees = []
+        self.range_tree_nodes = {}
+        self.iter_vars_count = itertools.count()
+        self.inside_reduction = self.numels[-1] != 1
+        self._load_mask = None
+        self.body = IndentedBuffer()
+        self.indexing_code = IndentedBuffer()
+        self.suffix = IndentedBuffer()
+        self.outside_loop_vars = set()
+        self.initialize_range_tree(pid_cache)
+        self.reduction_hint = reduction_hint
+
+        # define this in a closure to make cache local to object
+        @functools.lru_cache(None)
+        def simplify_indexing(index: sympy.Expr):
+            index = V.graph.sizevars.simplify_with_ranges(index, self.var_ranges())
+            for tree in self.range_trees:
+                index = self.combine_contiguous_dims(index, tree)
+            return index
+
+        self.simplify_indexing = simplify_indexing
+
+    def initialize_range_tree(self, pid_cache):
+        names = ["xindex", "yindex", "zindex"][: len(self.numels) - 1] + ["rindex"]
+        for i in range(len(self.numels)):
+            self.range_trees.append(
+                IterationRangesRoot(
+                    names[i], self.numels[i], names[i][0], i, self, pid_cache
+                )
+            )
+        for tree in self.range_trees:
+            # reduction indexing goes inside a loop
+            if tree.prefix != "r":
+                tree.codegen_header(self.body)
+        if self.inside_reduction and self.range_trees[-1].is_loop():
+            # workaround for this issue:
+            # https://gist.github.com/jansel/6527126f781559095c5531f98a4235a7
+            self.body.writeline(f"rbase = {self.range_trees[-1].ranges_code()}")
+
+    def disable_reduction(self):
+        @contextlib.contextmanager
+        def ctx():
+            if self.numels[-1] == 1:
+                assert not self.inside_reduction
+                yield
+                return
+            # calling codegen_body() will flush all the pending buffers
+            # and write out a reduction loop
+            self.codegen_body()
+            self.inside_reduction = False
+            yield
+            # flush out any code before opening the next loop
+            self.codegen_body()
+            self.inside_reduction = True
+
+        return ctx()
+
+    def set_ranges(self, *lengths):
+        assert len(lengths) == len(self.range_trees)
+        return [
+            ranges.construct(length)
+            for length, ranges in zip(lengths, self.range_trees)
+        ]
+
+    @staticmethod
+    def _split_iteration_ranges(
+        groups: List[sympy.Expr], lengths: List[List[sympy.Expr]]
+    ):
+        sv = V.graph.sizevars
+        new_ranges = [[] for _ in groups]
+        remaining = [sv.simplify(g) for g in groups]
+        var_count = itertools.count()
+
+        def add_range(i, expr):
+            expr = sv.simplify(expr)
+            if not sv.maybe_guard_multiple_of(remaining[i], expr):
+                raise CantSplit()
+            # guard on the last item out
+            sv.maybe_guard_equals(remaining[i], expr)
+            remaining[i] = ir.IndexingDiv(remaining[i], expr)
+            new_ranges[i].append(expr)
+            return next(var_count)
+
+        def make_combined(size, idx1, idx2):
+            def getter(flat_vars):
+                return size * flat_vars[idx1] + flat_vars[idx2]
+
+            return getter
+
+        return_getters_groups = []
+        current_group = 0
+        for length_group in lengths:
+            return_getters = []
+            for size in length_group:
+                if sv.maybe_guard_equals(size, 1):
+                    return_getters.append(lambda _: sympy.Integer(0))
+                    continue
+
+                while (
+                    current_group < len(remaining)
+                    and sv.size_hint(remaining[current_group]) == 1
+                ):
+                    # scroll to next group with remaining elements
+                    current_group += 1
+
+                if sv.size_hint(size) > sv.size_hint(remaining[current_group]):
+                    # need to break size in two
+                    if not sv.maybe_guard_multiple_of(size, remaining[current_group]):
+                        raise CantSplit()
+                    size1 = remaining[current_group]
+                    size2 = ir.IndexingDiv(size, remaining[current_group])
+                    return_getters.append(
+                        make_combined(
+                            size2,
+                            add_range(current_group, size1),
+                            add_range(current_group + 1, size2),
+                        )
+                    )
+                else:
+                    return_getters.append(
+                        operator.itemgetter(add_range(current_group, size))
+                    )
+            return_getters_groups.append(return_getters)
+
+        assert all(
+            V.graph.sizevars.size_hint(s) == 1 for s in remaining
+        ), f"failed to set ranges {remaining} {lengths}"
+
+        return new_ranges, return_getters_groups
+
+    @classmethod
+    def is_compatible(cls, groups: List[sympy.Expr], lengths: List[List[sympy.Expr]]):
+        try:
+            cls._split_iteration_ranges(groups, lengths)
+            return True
+        except CantSplit:
+            return False
+
+    def split_and_set_ranges(self, lengths: List[List[sympy.Expr]]):
+        """
+        We may want to fuse `for i0 in s0*s1` into a tiled kernel with groups (s0, s1).
+
+        To do this we need to split up the iteration space of i0 into something like:
+            for i1 in s0:
+              for i2 in s1:
+                i0 = i1*s1 + i2
+                ....
+
+        This function matches and resplits lengths to the groups of
+        this kernel to enable tiled + non-tiled fusions.
+        """
+        groups = [rt.numel for rt in self.range_trees]
+        if not self.inside_reduction:
+            groups[-1] = sympy.Integer(1)
+
+        if len(lengths) == len(self.range_trees) and all(
+            V.graph.sizevars.simplify(sympy_product(x) - g) == 0
+            for x, g in zip(lengths, groups)
+        ):
+            return self.set_ranges(*lengths)
+
+        new_ranges, return_getters_groups = self._split_iteration_ranges(
+            groups, lengths
+        )
+        itervars = list(itertools.chain(*self.set_ranges(*new_ranges)))
+        return [[fn(itervars) for fn in fns] for fns in return_getters_groups]
+
+    def is_indirect_indexing(self, index: sympy.Expr):
+        # tmpX  means indirect indexing
+        return free_symbol_startswith(index, "tmp")
+
+    def combine_contiguous_dims(self, index: sympy.Expr, tree: IterationRangesRoot):
+        """
+        More aggressive simplification to merge contiguous dims
+        """
+        if isinstance(index, (sympy.Integer, sympy.Symbol)):
+            return index
+        index_vars, sizes = tree.vars_and_sizes(index)
+        if len(sizes) <= 1:
+            return index
+        new_sizes, reindex, prune = V.graph.sizevars._simplify_loops(
+            index_vars, sizes, index_prevent_reordering([index], index_vars, sizes)
+        )
+        if new_sizes == sizes:
+            return index
+        new_index_vars = tree.construct(new_sizes)
+        new_index = sympy_subs(index, dict(zip(index_vars, reindex(new_index_vars))))
+        return new_index
+
+    def indexing(
+        self,
+        index: sympy.Expr,
+        *,
+        copy_shape=None,
+        dense_indexing=False,
+    ):
+        """
+        Compute the index and mask to pass to tl.load() or tl.store()
+        """
+        index = self.simplify_indexing(index)
+        index_vars = index.free_symbols
+        index_str = texpr(self.rename_indexing(self.codegen_indexing(index)))
+        indirect_indexing = self.is_indirect_indexing(index)
+
+        need_dense = (
+            config.triton.dense_indexing
+            or dense_indexing
+            or indirect_indexing
+            or self._load_mask is not None
+        ) and index != 0
+
+        have_dense = True
+        have_loop_vars = False
+        mask = []
+        dense_mask = []
+
+        for tree in self.range_trees:
+            if tree.prefix == "r" and not self.inside_reduction:
+                continue
+            if index_vars.intersection(tree.var_list):
+                have_loop_vars = True
+                have_dense = False
+                mask.append(f"{tree.prefix}mask")
+            dense_mask.append(f"{tree.prefix}mask")
+
+        if (need_dense and not have_dense) or isinstance(index, sympy.Integer):
+            index_str = f"{index_str} + tl.zeros({self.dense_size_str()}, tl.int32)"
+            if isinstance(index, sympy.Integer):
+                return index_str, "None"
+            else:
+                mask = dense_mask
+
+        elif not have_loop_vars and copy_shape:
+            mask = dense_mask
+            index_str = f"{index_str} + tl.zeros({copy_shape}.shape, tl.int32)"
+        elif indirect_indexing:
+            # Use dense mask for indirect_indexing
+            # See https://github.com/pytorch/torchdynamo/issues/1654
+            # TODO - An optimization could be to hoist this load outside of
+            # reduction loop, if it is independent of rmask. Such example can be found in
+            # https://github.com/pytorch/torchdynamo/issues/1654
+            index_str = f"{index_str} + tl.zeros({self.dense_size_str()}, tl.int32)"
+            mask = dense_mask
+
+        if self._load_mask:
+            mask.append(self._load_mask)
+        elif not mask:
+            mask = ["None"]
+
+        if mask == ["xmask"] and index == 0 and self.range_trees[0].numel == 1:
+            # This causes a triton error:
+            # https://github.com/openai/triton/issues/633
+            mask = ["None"]
+
+        if (
+            index_str in self.cse.varname_map
+            and self.cse.varname_map[index_str].is_scalar
+        ):
+            mask = ["None"]
+
+        return index_str, " & ".join(map(str, mask))
+
+    def var_ranges(self):
+        return dict(
+            itertools.chain.from_iterable(
+                tree.var_ranges.items() for tree in self.range_trees
+            )
+        )
+
+    def codegen_indexing(self, expr: sympy.Expr):
+        expr = V.graph.sizevars.simplify_with_ranges(expr, self.var_ranges())
+        for sym in sorted(expr.free_symbols, key=str):
+            if sym in self.range_tree_nodes:
+                self.range_tree_nodes[sym].codegen()
+        return expr
+
+    @contextlib.contextmanager
+    def mask_loads(self, mask):
+        """Context manager to add an additional mask to tl.load/store"""
+        prior = self._load_mask
+        if prior:
+            mask = self.cse.generate(self.compute, f"{mask} & {prior}")
+
+        self._load_mask = mask
+        with self.swap_buffers(self.compute, self.compute):
+            # TODO(jansel): do we need a reshape here?
+            yield mask
+        self._load_mask = prior
+
+    def load(self, name: str, index: sympy.Expr):
+        var = self.args.input(name)
+        indirect_indexing = self.is_indirect_indexing(index)
+        index, mask = self.indexing(index)
+
+        if "rmask" in mask:
+            # This eviction policy heuristic is untested.
+            # ptillet suggested we should try only doing this for
+            # the first N-1 loops and not for the final loop.
+            ep = ", eviction_policy='evict_last'"
+        else:
+            ep = ""
+        # "other" below is a workaround for https://github.com/openai/triton/issues/737
+        # for bool, even though it's likely subject to the same bug, setting `other` leads
+        # to LLVM errors so we are skipping it for now
+        if "tmp" in mask and V.graph.get_dtype(name) != torch.bool:
+            other = ", other=0"
+        else:
+            other = ""
+
+        if V.graph.is_unspec_arg(name):
+            line = var
+        else:
+            line = f"tl.load({var} + ({index}), {mask}{ep}{other})"
+            if V.graph.get_dtype(name) in (torch.float16, torch.bfloat16):
+                line += ".to(tl.float32)"
+
+        if (
+            self.inside_reduction
+            and "rmask" not in mask
+            and "tmp" not in mask
+            and not indirect_indexing
+        ):
+            # can lift a common load outside of reduction loop
+            # One exception is when this is an indirect_load.
+            tmp = self.cse.generate(self.body, line)
+        else:
+            tmp = self.cse.generate(self.loads, line)
+
+        if not self.inside_reduction or "rmask" not in mask:
+            self.outside_loop_vars.add(tmp)
+        return tmp
+
+    def store(self, name, index, value, mode=None):
+        var = self.args.output(name)
+        index, mask = self.indexing(index, dense_indexing=True)
+        if mode is None:
+            line = f"tl.store({var} + ({index}), {value}, {mask})"
+        elif mode == "atomic_add":
+            line = f"tl.atomic_add({var} + ({index}), {value}, {mask})"
+        else:
+            raise NotImplementedError(f"store mode={mode}")
+        self.stores.writeline(name, line)
+        if not self.inside_reduction:
+            self.outside_loop_vars.add(value)
+
+    def reduction(self, name, dtype, src_dtype, reduction_type, index, value):
+        assert self.inside_reduction
+        default = triton_constant(ir.Reduction.default_value(reduction_type, src_dtype))
+        masks = [f"{tree.prefix}mask" for tree in self.range_trees]
+        if self._load_mask:
+            masks.append(self._load_mask)
+        sizes = [f"{tree.prefix.upper()}BLOCK" for tree in self.range_trees]
+        sizes[-1] = "1"
+        reduction_range_prefix = self.range_trees[-1].prefix
+        reduction_sizes = ["1" for _ in self.range_trees]
+        reduction_sizes[-1] = f"{reduction_range_prefix.upper()}BLOCK"
+
+        if reduction_type == "any":
+            reduction_type = "max"
+
+        dim = len(self.range_trees) - 1
+        result_var = self.cse.newvar()
+        if (src_dtype, reduction_type, value) not in self.cse.reduction_cache:
+            self.cse.reduction_cache[(src_dtype, reduction_type, value)] = result_var
+            accumulator = f"_{result_var}"
+            self.body.writeline(
+                f"{accumulator} = tl.zeros({self.dense_size_str()}, {triton_compute_type(src_dtype)}) + {default}"
+            )
+            accumulator_index = None
+            if reduction_type in {"argmax", "argmin"}:
+                accumulator_index = f"_{result_var}_index"
+                self.body.writeline(
+                    f"{accumulator_index} = tl.zeros({self.dense_size_str()}, tl.int64)"
+                )
+
+            updated = value
+            if reduction_type in {"min", "argmin"}:
+                masks.append(f"({accumulator} > {value})")
+            elif reduction_type in {"max", "argmax"}:
+                masks.append(f"({accumulator} < {value})")
+            elif reduction_type == "sum":
+                updated = f"{accumulator} + {value}"
+            else:
+                raise NotImplementedError(f"reduction_type {reduction_type}")
+
+            cond = " & ".join(masks)
+
+            if accumulator_index:
+                # argmax or argmin
+                self.compute.writeline(
+                    f"{accumulator_index} = tl.where({cond},  {reduction_range_prefix}index, {accumulator_index})",
+                )
+            self.compute.writeline(
+                f"{accumulator} = tl.where({cond}, {updated}, {accumulator})"
+            )
+
+            if accumulator_index:
+                # argmax, argmin
+                self.suffix.writelines(
+                    [
+                        f"{accumulator_index}_reduce = tl.reshape(",
+                        f"\ttl.{reduction_type}({accumulator}, {dim}), [{', '.join(sizes)}]).to(tl.int32)",
+                        f"{accumulator_index}_mask = (tl.reshape(tl.arange(0, {reduction_range_prefix.upper()}BLOCK),",
+                        f"\t[{', '.join(reduction_sizes)}]) == {accumulator_index}_reduce)",
+                        f"{result_var} = tl.reshape(tl.sum(",
+                        f"\ttl.where({accumulator_index}_mask, {accumulator_index}, 0), {dim}), [{', '.join(sizes)}])",
+                    ]
+                )
+            else:
+                self.suffix.writeline(
+                    f"{result_var} = tl.reshape(tl.{reduction_type}({accumulator}, {dim}), [{', '.join(sizes)}])"
+                )
+        else:
+            var_name = self.cse.reduction_cache[(src_dtype, reduction_type, value)]
+            self.suffix.writeline(f"{result_var} = {var_name}")
+        self.inside_reduction = False
+        index, mask = self.indexing(index)
+        assert "rmask" not in index
+        self.inside_reduction = True
+        self.outside_loop_vars.add(result_var)
+        self.cse.store_cache[name] = result_var
+        if name not in V.graph.removed_buffers:
+            var = self.args.output(name)
+            self.suffix.writeline(
+                DeferredLine(name, f"tl.store({var} + {index}, {result_var}, {mask})")
+            )
+
+    def codegen_body(self):
+        """
+        Concat output code from index_code, loads, compute, stores,
+        suffix into self.body.
+
+        For pointwise kernels, this is called just once at the end.
+
+        For reduction kernels, this generates a loop over the reduction
+        axis.
+        """
+        if not (
+            self.indexing_code
+            or self.loads
+            or self.stores
+            or self.compute
+            or self.suffix
+        ):
+            return
+
+        if self.inside_reduction:
+            self.body.writeline("for roffset in range(0, rnumel, RBLOCK):")
+            with self.body.indent():
+                # last range tree is always reduction
+                self.range_trees[-1].codegen_header(self.body)
+                self.body.splice(self.indexing_code)
+                self.body.splice(self.loads)
+                self.body.splice(self.compute)
+                self.body.splice(self.stores)
+
+            # invalidate any caches that came from inside the reduction loop
+            self.cse.invalidate(self.outside_loop_vars)
+            self.range_trees[-1].cache_clear()
+        else:
+            self.body.splice(self.indexing_code)
+            self.body.splice(self.loads)
+            self.body.splice(self.compute)
+            self.body.splice(self.stores)
+        self.body.splice(self.suffix)
+        self.indexing_code.clear()
+        self.loads.clear()
+        self.compute.clear()
+        self.stores.clear()
+        self.suffix.clear()
+
+    def codegen_kernel(self, name=None):
+        from triton import next_power_of_2
+
+        code = IndentedBuffer()
+        size_hints = [
+            next_power_of_2(V.graph.sizevars.size_hint(numel)) for numel in self.numels
+        ]
+        if not self.inside_reduction:
+            size_hints.pop()
+            heuristics = "pointwise"
+        else:
+            heuristics = "reduction"
+
+        if name is None:
+            code.splice(
+                f"""
+                    import triton
+                    import triton.language as tl
+                    from {config.inductor_import}.ir import ReductionHint
+                    from {config.inductor_import}.ir import TileHint
+                    from {config.inductor_import}.triton_ops.autotune import {heuristics}
+                    from {config.inductor_import}.utils import instance_descriptor
+                """
+            )
+
+        argdefs, _, signature = self.args.python_argdefs()
+        triton_meta = {
+            "signature": dict(enumerate(map(signature_of, signature))),
+            "device": V.graph.scheduler.current_device.index,
+            "constants": {},
+        }
+
+        for tree in self.range_trees:
+            if tree.prefix != "r" or self.inside_reduction:
+                sizearg = SizeArg(f"{tree.prefix}numel", tree.numel)
+                signature.append(sizearg)
+                triton_meta["signature"][len(argdefs)] = signature_of(sizearg)
+                argdefs.append(f"{tree.prefix}numel")
+                # constexpr version causes issues, see
+                # https://github.com/pytorch/torchdynamo/pull/1362
+                # triton_meta["constants"][len(argdefs)] = V.graph.sizevars.size_hint(
+                #     tree.numel
+                # )
+                # argdefs.append(f"{tree.prefix}numel: tl.constexpr")
+        triton_meta["configs"] = [config_of(signature)]
+
+        for tree in self.range_trees:
+            if tree.prefix != "r" or self.inside_reduction:
+                argdefs.append(f"{tree.prefix.upper()}BLOCK : tl.constexpr")
+
+        if self.inside_reduction:
+            reduction_hint = self.reduction_hint
+            heuristics_line = f"""
+                @{heuristics}(size_hints={size_hints!r},
+                              reduction_hint={reduction_hint},
+                              filename=__file__,
+                              meta={triton_meta!r})
+                @triton.jit
+            """
+        else:
+            tile_hint = ""
+            if len(size_hints) == 2:
+                if len(signature) == 4:  # input, output and 2 args
+                    tile_hint = "tile_hint=TileHint.SQUARE,"
+                else:
+                    tile_hint = "tile_hint=TileHint.DEFAULT,"
+            heuristics_line = f"""
+                @{heuristics}(size_hints={size_hints!r}, {tile_hint}filename=__file__, meta={triton_meta!r})
+                @triton.jit
+            """
+        code.splice(heuristics_line)
+        code.writeline(f"def {name or 'KERNEL_NAME'}({', '.join(argdefs)}):")
+        self.codegen_body()
+        with code.indent():
+            if not config.dynamic_shapes:
+                self.codegen_static_numels(code)
+            for old, new in self.args.aliases():
+                code.writeline(f"{old} = {new}")
+            code.splice(self.body)
+
+        if name is not None:
+            return code.getvalue()
+
+        wrapper = IndentedBuffer()
+        wrapper.writeline("async_compile.triton('''")
+        wrapper.splice(code.getvalue(), strip=True)
+        wrapper.writeline("''')")
+        return wrapper.getvalue()
+
+    def codegen_static_numels(self, code):
+        """
+        We get a small speedup from hard coding numels if they are static.
+        """
+        for tree in self.range_trees:
+            if tree.prefix != "r" or self.inside_reduction:
+                if isinstance(V.graph.sizevars.simplify(tree.numel), sympy.Integer):
+                    code.writeline(
+                        f"{tree.prefix}numel = {V.graph.sizevars.size_hint(tree.numel)}"
+                    )
+                elif not config.dynamic_shapes:
+                    code.writeline(
+                        f"{tree.prefix}numel = {V.graph.sizevars.size_hint(tree.numel)}  # dynamic_shapes=False"
+                    )
+
+    def reshape_size_str(self, i=None, x=None):
+        sizes = ["1"] * (len(self.range_trees) - int(self.numels[-1] == 1))
+        if i is not None:
+            sizes[i] = f"{x.upper()}BLOCK"
+        return f"[{', '.join(sizes)}]"
+
+    def dense_size_str(self):
+        sizes = []
+        for tree in self.range_trees:
+            if tree.prefix != "r" or self.inside_reduction:
+                sizes.append(f"{tree.prefix.upper()}BLOCK")
+            elif tree.prefix == "r" and tree.numel != 1:
+                sizes.append("1")
+        return f"[{', '.join(sizes)}]"
+
+    def call_kernel(self, code, name: str):
+        _, call_args, _ = self.args.python_argdefs()
+        # dynamo wraps unspec variable as 0d CPU tensor, need convert to scalar
+        for i in range(len(call_args)):
+            if V.graph.is_unspec_arg(call_args[i]):
+                call_args[i] = call_args[i] + ".item()"
+        grid = []
+        # TODO(jansel): if there are constants, we shouldn't bother passing them as args
+        for tree in self.range_trees:
+            if isinstance(tree.numel, (sympy.Integer, sympy.Symbol)):
+                expr = texpr(tree.numel)
+            else:
+                expr = f"{name}_{tree.prefix}numel"
+                code.writeline(f"{expr} = {texpr(tree.numel)}")
+            if tree.prefix != "r" or self.inside_reduction:
+                call_args.append(expr)
+            if tree.prefix != "r":
+                grid.append(expr)
+        call_args = ", ".join(call_args)
+        stream_name = code.write_get_cuda_stream(V.graph.scheduler.current_device.index)
+        code.writeline(
+            f"{name}.run({call_args}, grid=grid({', '.join(grid)}), stream={stream_name})"
+        )
+
+    def create_cse_var(self, *args, **kwargs):
+        return TritonCSEVariable(*args, **kwargs)
+
+
+class TritonScheduling:
+    def __init__(self, scheduler):
+        self.scheduler = scheduler
+
+    def group_fn(self, sizes):
+        return tuple(V.graph.sizevars.simplify(sympy_product(s)) for s in sizes)
+
+    def can_fuse(self, node1, node2):
+        """
+        Hook called by Scheduler to determine if the Triton backend
+        can fuse node1 and node2.  These nodes might already be
+        FusedSchedulerNodes.
+        """
+        _, (numel1, rnumel1) = node1.group
+        _, (numel2, rnumel2) = node2.group
+
+        if node1.is_reduction() and node2.is_reduction():
+            return numel1 == numel2 and rnumel1 == rnumel2
+
+        if not node1.is_reduction() and not node2.is_reduction():
+            if not (numel1 == numel2 and rnumel1 == rnumel2):
+                return False
+
+            # check for a bad combined tiling
+            tiling1 = self.select_tiling(node1.get_nodes(), numel1, rnumel1)
+            tiling2 = self.select_tiling(node2.get_nodes(), numel1, rnumel1)
+            tiling3 = self.select_tiling(
+                node1.get_nodes() + node2.get_nodes(), numel1, rnumel1
+            )
+            if config.triton.tiling_prevents_pointwise_fusion:
+                if len(tiling1) > 2:
+                    if len(tiling2) > 2:
+                        return tiling1 == tiling2 == tiling3
+                    else:
+                        return tiling1 == tiling3
+                elif len(tiling2) > 2:
+                    return tiling2 == tiling3
+
+            return True
+
+        if not node1.is_reduction() and node2.is_reduction():
+            assert rnumel1 == 1 and rnumel2 != 1
+            if numel1 == numel2 * rnumel2:
+                if not all(
+                    TritonKernel.is_compatible((numel2, rnumel2), n.get_ranges())
+                    for n in node1.get_nodes()
+                ):
+                    return False
+                if config.triton.tiling_prevents_reduction_fusion:
+                    return self.select_tiling(node1.get_nodes(), numel1) in (
+                        (numel1, 1),
+                        (numel2, rnumel2, 1),
+                    )
+                return True
+
+            return numel1 == numel2
+
+        assert node1.is_reduction() and not node2.is_reduction()
+        # swap args to hit the case above
+        return self.can_fuse_horizontal(node2, node1)
+
+    can_fuse_vertical = can_fuse
+    can_fuse_horizontal = can_fuse
+
+    def codegen_nodes(self, nodes):
+        """
+        Given a set of pre-fused nodes, generate a Triton kernel.
+        """
+        _, (numel, rnumel) = max(nodes, key=lambda x: int(x.is_reduction())).group
+        node_schedule = []
+        current_loop_writes = set()
+        done = set()
+
+        def fits_in_main_body(n):
+            _, (node_numel, node_rnumel) = n.group
+            return (node_numel == numel and node_rnumel == rnumel) or (
+                node_numel == numel * rnumel and node_rnumel == 1
+            )
+
+        def fits_outside_reduction(n):
+            _, (node_numel, node_rnumel) = n.group
+            return node_numel == numel and node_rnumel == 1 and rnumel != 1
+
+        @contextlib.contextmanager
+        def end_current_reduction_loop():
+            if current_loop_writes:
+                # flush out any other runnable nodes to reduce number of loops
+                for other_node in nodes[index + 1 :]:
+                    if (
+                        node not in done
+                        and fits_in_main_body(other_node)
+                        and not (
+                            current_loop_writes & other_node.recursive_predecessors
+                        )
+                    ):
+                        done.add(node)
+                        current_loop_writes.add(node.get_name())
+                        node_schedule.append(node)
+
+            if node_schedule and node_schedule[-1] is EnableReduction:
+                node_schedule.pop()
+            else:
+                node_schedule.append(DisableReduction)
+            yield
+            node_schedule.append(EnableReduction)
+            current_loop_writes.clear()
+
+        for index, node in enumerate(nodes):
+            if node in done:
+                continue
+            done.add(node)
+
+            if fits_in_main_body(node):
+                if current_loop_writes & node.recursive_predecessors and rnumel != 1:
+                    with end_current_reduction_loop():
+                        pass  # need to start a new reduction loop
+                current_loop_writes.add(node.get_name())
+                node_schedule.append(node)
+            elif fits_outside_reduction(node):
+                with end_current_reduction_loop():
+                    node_schedule.append(node)
+            else:
+                raise NotImplementedError(
+                    f"unexpected group: ({numel}, {rnumel}) != {node.group[1]}"
+                )
+
+        log.log(logging.CODE, "schedule: %s", node_schedule)
+        return self.codegen_node_schedule(node_schedule, numel, rnumel)
+
+    @staticmethod
+    def reduction_hint(node):
+        assert node.is_reduction()
+        if all(
+            dep.is_contiguous()
+            for dep in itertools.chain(node.read_writes.reads, node.read_writes.writes)
+        ):
+            return ReductionHint.INNER
+        else:
+            return node.node.data.reduction_hint
+
+    def codegen_node_schedule(self, node_schedule, numel, reduction_numel):
+        tiled_groups = self.select_tiling(node_schedule, numel, reduction_numel)
+        reductions = list(
+            filter(
+                lambda n: n not in (EnableReduction, DisableReduction)
+                and n.is_reduction(),
+                node_schedule,
+            )
+        )
+        if len(reductions) > 0:
+            hints = [self.reduction_hint(n) for n in reductions]
+            if hints.count(hints[0]) == len(hints):
+                reduction_hint_val = hints[0]
+            else:
+                reduction_hint_val = ReductionHint.DEFAULT
+        else:
+            reduction_hint_val = ReductionHint.DEFAULT
+        with TritonKernel(*tiled_groups, reduction_hint=reduction_hint_val) as kernel:
+            stack = contextlib.ExitStack()
+            for node in node_schedule:
+                if node not in (EnableReduction, DisableReduction):
+                    node.mark_run()
+            for node in node_schedule:
+                if node is DisableReduction:
+                    stack.enter_context(kernel.disable_reduction())
+                elif node is EnableReduction:
+                    stack.close()
+                else:
+                    node.codegen(kernel.split_and_set_ranges(node.get_ranges()))
+
+        wrapper = V.graph.wrapper_code
+        src_code = kernel.codegen_kernel()
+        if src_code in wrapper.kernels:
+            kernel_name = wrapper.kernels[src_code]
+        else:
+            fused_name = (
+                get_fused_kernel_name(node_schedule)
+                if config.triton.descriptive_kernel_names
+                else ""
+            )
+            kernel_name = "triton_" + fused_name + wrapper.next_kernel_suffix()
+            wrapper.kernels[src_code] = kernel_name
+            subs_name = kernel_name if config.triton.ordered_kernel_names else "triton_"
+            src_code = src_code.replace("KERNEL_NAME", subs_name)
+            # TODO(voz): Ostensibly, we should not need this. But there are cases where C++ codegen does
+            # not use BracesBuffer, so we have no good indicator of a C++ buffer atm.
+            src_code = src_code.replace("#pragma CMT", "#")
+            wrapper.define_kernel(kernel_name, src_code)
+        kernel.call_kernel(wrapper, kernel_name)
+        self.scheduler.free_buffers()
+
+    @staticmethod
+    @functools.lru_cache(32)
+    def candidate_tilings(node):
+        ranges, reduction_ranges = node.get_ranges()
+        if len(ranges) <= 1:
+            return ()
+
+        rw = node.pointwise_read_writes()
+        assert len(rw.range_vars) == len(ranges)
+
+        deps = [
+            dep
+            for dep in itertools.chain(rw.reads, rw.writes)
+            if dep.name not in V.graph.removed_buffers
+        ]
+        write_names = {dep.name for dep in rw.writes}
+
+        tilings = []
+
+        for dep in deps:
+            strides = V.graph.sizevars.stride_hints(dep.index, rw.range_vars)
+            assert len(strides) == len(ranges)
+            try:
+                split = strides.index(1) + 1
+                if split == len(ranges):
+                    continue
+                if all(s == 0 for s in strides[split:]):
+                    # if this is a broadcasted tensor and all dimensions after split are broadcast,
+                    # this is not a real split
+                    continue
+
+            except ValueError:
+                continue
+            tiled_groups = (
+                V.graph.sizevars.simplify(sympy_product(ranges[:split])),
+                V.graph.sizevars.simplify(sympy_product(ranges[split:])),
+            )
+            # score by number of elements
+            score = V.graph.sizevars.size_hint(
+                sympy_product(
+                    size for size, stride in zip(ranges, strides) if stride != 0
+                )
+            )
+            if dep.name in write_names:
+                # ngimel said contiguous writes is more important than reads
+                score *= 2
+            if CandidateTiling.is_good_size(tiled_groups[0]):
+                score *= 2
+            if CandidateTiling.is_good_size(tiled_groups[1]):
+                score *= 2
+
+            if (
+                V.graph.sizevars.size_hint(
+                    score - sympy_product(itertools.chain(ranges, reduction_ranges))
+                )
+                >= 0
+            ):
+                tilings.append(CandidateTiling(tiled_groups, score, dep.name))
+        return tilings
+
+    @classmethod
+    def select_tiling(cls, node_schedule, numel, reduction_numel=sympy.Integer(1)):
+        """
+        Heuristics to decide how to tile kernels.
+        Currently, we tile based on stride-1 dimensions.
+
+        Returns:
+            `(tile1, tile2, reduction_numel)` s.t. `tile1 * tile2 == numel`
+
+        """
+        if reduction_numel != 1 or config.triton.max_tiles <= 1:
+            # TODO(jansel): should we tile reductions?
+            return (numel, reduction_numel)
+
+        seen_names = set()
+        candidate_tiles = collections.Counter()
+        for node in EnableReduction.filter(node_schedule):
+            for tiling in cls.candidate_tilings(node):
+                if tiling.name in seen_names:
+                    continue
+                seen_names.add(tiling.name)
+                candidate_tiles[tiling.tiling] += tiling.score
+
+        ranked_tilings = [tiling for tiling, score in candidate_tiles.most_common()]
+
+        if config.triton.max_tiles >= 3:
+            # Add one 3D tiling choice
+            for i in range(1, len(ranked_tilings)):
+                a0, a1 = ranked_tilings[0]
+                b0, b1 = ranked_tilings[i]
+                if V.graph.sizevars.size_hint(a1 - b1) == 0:
+                    continue
+                if V.graph.sizevars.size_hint(a1 - b1) < 0:
+                    # swap so a0 is bigger
+                    a0, a1 = ranked_tilings[i]
+                    b0, b1 = ranked_tilings[0]
+                assert V.graph.sizevars.size_hint(a1 - b1) > 0
+                if V.graph.sizevars.maybe_guard_multiple_of(a1, b1):
+                    tiling = (a0, ir.IndexingDiv(a1, b1), b1)
+                    ranked_tilings = [tiling] + ranked_tilings
+                    break  # only 1 choice for now
+
+        for tiled_groups in ranked_tilings:
+            new_groups = (*tiled_groups, reduction_numel)
+            if all(
+                TritonKernel.is_compatible(new_groups, node.get_ranges())
+                for node in node_schedule
+                if isinstance(node, scheduler.SchedulerNode)
+            ):
+                return new_groups
+
+        return (numel, reduction_numel)
+
+    def flush(self):
+        pass
+
+
+@dataclasses.dataclass
+class CandidateTiling:
+    tiling: List[sympy.Expr]
+    score: int  # higher is better
+    name: str = None
+
+    @staticmethod
+    def is_good_size(s):
+        """Somewhat arbitrary heuristic used to boost scores for some sizes"""
+        s = V.graph.sizevars.size_hint(s)
+        return s >= 32 and (s % 32 == 0)
+
+
+class DisableReduction:
+    """
+    Marker to invoke `kernel.disable_reduction()`.  This closes a
+    reduction loop and allows for pointwise ops to occur on the output
+    of a reduction.
+    """
+
+
+class EnableReduction:
+    """
+    Marker to end a DisableReduction block.
+    """
+
+    @staticmethod
+    def filter(node_schedule):
+        """
+        Get the nodes from node_schedule skipping those in a
+        DisableReduction block.
+        """
+        disabled = False
+        for node in node_schedule:
+            if node in (EnableReduction, DisableReduction):
+                # Don't tile stuff outside the main reduction loop
+                disabled = node is DisableReduction
+            elif disabled:
+                pass
+            else:
+                yield node
+
+
+class CantSplit(Exception):
+    pass
diff --git a/torch/_inductor/codegen/triton_conv_delta_x.j2 b/torch/_inductor/codegen/triton_conv_delta_x.j2
new file mode 100644
index 000000000000..a7bf8ac433ea
--- /dev/null
+++ b/torch/_inductor/codegen/triton_conv_delta_x.j2
@@ -0,0 +1,181 @@
+
+@conv_heuristics()
+@triton.jit
+def {{kernel_name}}(
+    {% for i in template_inout_argdefs %}
+    {{i}},
+    {% endfor %}
+    # stride of tensor
+    stride_xn,
+    stride_xc,
+    stride_xh,
+    stride_xw,
+    stride_wn,
+    stride_wc,
+    stride_wh,
+    stride_ww,
+    stride_yn,
+    stride_yc,
+    stride_yh,
+    stride_yw,
+    stride_biasn,
+    # Tensor dimensions
+    BATCH,
+    IN_C,
+    IN_H,
+    IN_W,
+    KERNEL_N,
+    KERNEL_H,
+    KERNEL_W,
+    OUT_H,
+    OUT_W,
+    # parameters of conv
+    stride_h,
+    stride_w,
+    padding_h,
+    padding_w,
+    dilation_h,
+    dilation_w,
+    output_padding_h,
+    output_padding_w,
+    groups: tl.constexpr,
+    # pointer inc for x
+    delta_x_ptr,
+    # fusable kernels args
+    {% for i in extra_argdefs %}
+    {{i}},
+    {% endfor %}
+    # Metaparameters
+    ACC_TYPE: tl.constexpr,
+    CONV1X1_NHWC: tl.constexpr,
+    # blocks in different dimension
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    # reduction tiling parameter for matmul
+    BLOCK_K: tl.constexpr,
+):
+    """
+    each program instance computes a [BLOCK_BATCH, BLOCK_N, BLOCK_H, BLOCK_W] block of y
+    """
+    # -----------------------------------------------------------
+    # Map program ids `pid` to the block of y it should compute.
+    pid_nhw = tl.program_id(0)
+    pid_k = tl.program_id(1)
+
+    # offset for output y
+    off_y_k = pid_k * BLOCK_N + tl.arange(0, BLOCK_N)
+    off_y_nhw = pid_nhw * BLOCK_M + tl.arange(0, BLOCK_M)
+    off_y_n = off_y_nhw // (OUT_H * OUT_W)
+    off_y_hw = off_y_nhw % (OUT_H * OUT_W)
+    off_y_h = off_y_hw // OUT_W
+    off_y_w = off_y_hw % OUT_W
+
+    # offset for the initial ptr for x
+    off_x_n = off_y_n
+    off_x_h = off_y_h * stride_h - padding_h
+    off_x_w = off_y_w * stride_w - padding_w
+    off_x_nhw = off_x_n * stride_xn + off_x_h * stride_xh + off_x_w * stride_xw
+    off_x_crs = tl.arange(0, BLOCK_K)
+
+    CRS = IN_C * KERNEL_H * KERNEL_W
+    # load inc ptr of x, upade x_ptrs
+    if not CONV1X1_NHWC:
+        delta_x_ptrs = delta_x_ptr + off_x_crs
+        off_x_crs_unpacked = tl.load(delta_x_ptrs, mask=off_x_crs < CRS, other=0)
+        x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
+    else:
+        x_ptrs = x + off_x_nhw[:, None] + off_x_crs[None, :]
+
+    mask_x = (
+        (off_x_n < BATCH)
+        & (off_x_h >= 0)
+        & (off_x_h < IN_H)
+        & (off_x_w >= 0)
+        & (off_x_w < IN_W)
+    )[:, None] & (off_x_crs < CRS)[None, :]
+
+    # offset for the inital ptr for w
+    off_w_crs = tl.arange(0, BLOCK_K)
+    off_w_k = off_y_k
+    w_ptrs = w + off_w_crs[:, None] + off_w_k[None, :] * stride_wn
+    mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
+
+    # ------ load x ------
+    matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
+    # ------ load w ------
+    matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
+
+    # -----------------------------------------------------------
+    # allocate accumulator
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
+    for crs in range(0, CRS, BLOCK_K):
+
+        # ------ matrix multiplication ------
+        acc += tl.dot(matrix_x, matrix_w)
+        # ------ update ptrs ------
+        w_ptrs += BLOCK_K
+        # load inc ptr of x, upade x_ptrs
+        if not CONV1X1_NHWC:
+            delta_x_ptrs += BLOCK_K
+            off_x_crs = crs + BLOCK_K + tl.arange(0, BLOCK_K)
+            off_x_crs_unpacked = tl.load(delta_x_ptrs, mask=off_x_crs < CRS, other=0)
+            x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
+        else:
+            off_x_crs = crs + BLOCK_K + tl.arange(0, BLOCK_K)
+            x_ptrs += BLOCK_K
+
+        mask_x = (
+            (off_x_n < BATCH)
+            & (off_x_h >= 0)
+            & (off_x_h < IN_H)
+            & (off_x_w >= 0)
+            & (off_x_w < IN_W)
+        )[:, None] & (off_x_crs < CRS)[None, :]
+        mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
+        # ------ prefetch ------
+        # ------ load x ------
+        matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
+        # ------ load w ------
+        matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
+
+    acc = acc.to({{out_def}}.dtype.element_ty)
+
+{% if keep_store %}
+    # rematerialize -- this saves some registers
+    # offset for output y
+    off_y_k = pid_k * BLOCK_N + tl.arange(0, BLOCK_N)
+    off_y_nhw = pid_nhw * BLOCK_M + tl.arange(0, BLOCK_M)
+    off_y_n = off_y_nhw // (OUT_H * OUT_W)
+    off_y_hw = off_y_nhw % (OUT_H * OUT_W)
+    # consider output padding
+    off_y_h = off_y_hw // OUT_W + output_padding_h
+    off_y_w = off_y_hw % OUT_W + output_padding_w
+
+    # y ptrs in the block of [BLOCK_M, BLOCK_N]
+    y_ptrs = (
+        {{out_def}}
+        + off_y_n[:, None] * stride_yn
+        + off_y_h[:, None] * stride_yh
+        + off_y_w[:, None] * stride_yw
+        + off_y_k[None, :] * stride_yc
+    )
+
+    # out-of-bounds check
+    mask_y = (
+        (off_y_n < BATCH)[:, None]
+        & (off_y_h < OUT_H + output_padding_h)[:, None]
+        & (off_y_w < OUT_W + output_padding_w)[:, None]
+        & (off_y_k < KERNEL_N)[None, :]
+    )
+    tl.store(y_ptrs, acc, mask=mask_y)
+{% endif %}
+
+{% if pointwise_code %}
+{{ pointwise_code | indent(4, true) }}
+    {#
+    z = tl.load(z_ptrs, mask=mask_z)
+    acc += z
+    #}
+{% endif %}
+
+    return
diff --git a/torch/_inductor/codegen/triton_conv_delta_x_hwc.j2 b/torch/_inductor/codegen/triton_conv_delta_x_hwc.j2
new file mode 100644
index 000000000000..34f2c3881272
--- /dev/null
+++ b/torch/_inductor/codegen/triton_conv_delta_x_hwc.j2
@@ -0,0 +1,200 @@
+
+@conv_heuristics()
+@triton.jit
+def {{kernel_name}}(
+    {% for i in template_inout_argdefs %}
+    {{i}},
+    {% endfor %}
+    # stride of tensor
+    stride_xn,
+    stride_xc,
+    stride_xh,
+    stride_xw,
+    stride_wn,
+    stride_wc,
+    stride_wh,
+    stride_ww,
+    stride_yn,
+    stride_yc,
+    stride_yh,
+    stride_yw,
+    stride_biasn,
+    # Tensor dimensions
+    BATCH,
+    IN_C,
+    IN_H,
+    IN_W,
+    KERNEL_N,
+    KERNEL_H,
+    KERNEL_W,
+    OUT_H,
+    OUT_W,
+    # parameters of conv
+    stride_h,
+    stride_w,
+    padding_h,
+    padding_w,
+    dilation_h,
+    dilation_w,
+    output_padding_h,
+    output_padding_w,
+    groups,
+    # pointer inc for x
+    delta_xh_ptr,
+    delta_xw_ptr,
+    delta_xc_ptr,
+    # fusable kernels args
+    {% for i in extra_argdefs %}
+    {{i}},
+    {% endfor %}
+    # Metaparameters
+    ACC_TYPE: tl.constexpr,
+    CONV1X1_NHWC: tl.constexpr,
+    # blocks in different dimension
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    # reduction tiling parameter for matmul
+    BLOCK_K: tl.constexpr,
+):
+    """
+    each program instance computes a [BLOCK_BATCH, BLOCK_N, BLOCK_H, BLOCK_W] block of y
+    """
+    # -----------------------------------------------------------
+    # Map program ids `pid` to the block of y it should compute.
+    pid_nhw = tl.program_id(0)
+    pid_k = tl.program_id(1)
+
+    # offset for output y
+    off_y_k = pid_k * BLOCK_N + tl.arange(0, BLOCK_N)
+    off_y_nhw = pid_nhw * BLOCK_M + tl.arange(0, BLOCK_M)
+    off_y_n = off_y_nhw // (OUT_H * OUT_W)
+    off_y_hw = off_y_nhw % (OUT_H * OUT_W)
+    off_y_h = off_y_hw // OUT_W + output_padding_h
+    off_y_w = off_y_hw % OUT_W + output_padding_w
+
+    # offset for the initial ptr for x
+    off_x_n = off_y_n
+    off_x_h = off_y_h * stride_h - padding_h
+    off_x_w = off_y_w * stride_w - padding_w
+    off_x_nhw = off_x_n * stride_xn + off_x_h * stride_xh + off_x_w * stride_xw
+    off_x_crs = tl.arange(0, BLOCK_K)
+
+    CRS = IN_C * KERNEL_H * KERNEL_W
+    # load inc ptr of x, upade x_ptrs
+    if not CONV1X1_NHWC:
+        delta_xh_ptrs = delta_xh_ptr + off_x_crs
+        delta_xw_ptrs = delta_xw_ptr + off_x_crs
+        delta_xc_ptrs = delta_xc_ptr + off_x_crs
+        delta_xh = tl.load(delta_xh_ptrs, mask=off_x_crs < CRS, other=0)
+        delta_xw = tl.load(delta_xw_ptrs, mask=off_x_crs < CRS, other=0)
+        delta_xc = tl.load(delta_xc_ptrs, mask=off_x_crs < CRS, other=0)
+        off_x_crs_unpacked = (
+            delta_xh * stride_xh + delta_xw * stride_xw + delta_xc * stride_xc
+        )
+        x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
+    else:
+        x_ptrs = x + off_x_nhw[:, None] + off_x_crs[None, :]
+        delta_xh = 0
+        delta_xw = 0
+
+    mask_x = (
+        (off_x_n < BATCH)[:, None]
+        & (off_x_crs < CRS)[None, :]
+        & (off_x_h[:, None] + delta_xh[None, :] >= 0)
+        & (off_x_h[:, None] + delta_xh[None, :] < IN_H)
+        & (off_x_w[:, None] + delta_xw[None, :] >= 0)
+        & (off_x_w[:, None] + delta_xw[None, :] < IN_W)
+    )
+
+    # offset for the inital ptr for w
+    off_w_crs = tl.arange(0, BLOCK_K)
+    off_w_k = off_y_k
+    w_ptrs = w + off_w_crs[:, None] + off_w_k[None, :] * stride_wn
+    mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
+
+    # ------ load x ------
+    matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
+    # ------ load w ------
+    matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
+
+    # -----------------------------------------------------------
+    # allocate accumulator
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
+    for crs in range(0, CRS, BLOCK_K):
+
+        # ------ matrix multiplication ------
+        acc += tl.dot(matrix_x, matrix_w)
+        # ------ update ptrs ------
+        w_ptrs += BLOCK_K
+        # load inc ptr of x, upade x_ptrs
+        off_x_crs = crs + BLOCK_K + tl.arange(0, BLOCK_K)
+        if not CONV1X1_NHWC:
+            delta_xh_ptrs += BLOCK_K
+            delta_xw_ptrs += BLOCK_K
+            delta_xc_ptrs += BLOCK_K
+            delta_xh = tl.load(delta_xh_ptrs, mask=off_x_crs < CRS, other=0)
+            delta_xw = tl.load(delta_xw_ptrs, mask=off_x_crs < CRS, other=0)
+            delta_xc = tl.load(delta_xc_ptrs, mask=off_x_crs < CRS, other=0)
+            off_x_crs_unpacked = (
+                delta_xh * stride_xh + delta_xw * stride_xw + delta_xc * stride_xc
+            )
+            x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
+        else:
+            x_ptrs += BLOCK_K
+
+        mask_x = (
+            (off_x_n < BATCH)[:, None]
+            & (off_x_crs < CRS)[None, :]
+            & (off_x_h[:, None] + delta_xh[None, :] >= 0)
+            & (off_x_h[:, None] + delta_xh[None, :] < IN_H)
+            & (off_x_w[:, None] + delta_xw[None, :] >= 0)
+            & (off_x_w[:, None] + delta_xw[None, :] < IN_W)
+        )
+        mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
+        # ------ prefetch ------
+        # ------ load x ------
+        matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
+        # ------ load w ------
+        matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
+
+    acc = acc.to({{out_def}}.dtype.element_ty)
+
+{% if keep_store %}
+    # rematerialize -- this saves some registers
+    # offset for output y
+    off_y_k = pid_k * BLOCK_N + tl.arange(0, BLOCK_N)
+    off_y_nhw = pid_nhw * BLOCK_M + tl.arange(0, BLOCK_M)
+    off_y_n = off_y_nhw // (OUT_H * OUT_W)
+    off_y_hw = off_y_nhw % (OUT_H * OUT_W)
+    # consider output padding
+    off_y_h = off_y_hw // OUT_W + output_padding_h
+    off_y_w = off_y_hw % OUT_W + output_padding_w
+
+    # y ptrs in the block of [BLOCK_M, BLOCK_N]
+    y_ptrs = (
+        {{out_def}}
+        + off_y_n[:, None] * stride_yn
+        + off_y_h[:, None] * stride_yh
+        + off_y_w[:, None] * stride_yw
+        + off_y_k[None, :] * stride_yc
+    )
+
+    # out-of-bounds check
+    mask_y = (
+        (off_y_n < BATCH)[:, None]
+        & (off_y_h < OUT_H + output_padding_h)[:, None]
+        & (off_y_w < OUT_W + output_padding_w)[:, None]
+        & (off_y_k < KERNEL_N)[None, :]
+    )
+    tl.store(y_ptrs, acc, mask=mask_y)
+{% endif %}
+
+{% if pointwise_code %}
+{{ pointwise_code | indent(4, true) }}
+    {#
+    z = tl.load(z_ptrs, mask=mask_z)
+    acc += z
+    #}
+{% endif %}
+
+    return
diff --git a/torch/_inductor/codegen/triton_mm.j2 b/torch/_inductor/codegen/triton_mm.j2
new file mode 100644
index 000000000000..3073b3f49071
--- /dev/null
+++ b/torch/_inductor/codegen/triton_mm.j2
@@ -0,0 +1,80 @@
+import torch
+import triton
+import triton.language as tl
+{# from triton.ops.matmul import get_configs_io_bound #}
+
+@mm_autotune()
+@mm_heuristics()
+@triton.jit
+def {{kernel_name}}(
+    {% for i in template_inout_argdefs %}
+    {{i}},
+    {% endfor %}
+    M,
+    N,
+    K,
+    stride_am,
+    stride_ak,
+    stride_bk,
+    stride_bn,
+    stride_cm,
+    stride_cn,
+    # fusable kernels args
+    {% for i in extra_argdefs %}
+    {{i}},
+    {% endfor %}
+    allow_tf32: tl.constexpr,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    BLOCK_K: tl.constexpr,
+    GROUP_M: tl.constexpr,
+    SPLIT_K: tl.constexpr,
+    EVEN_K: tl.constexpr,
+    ACC_TYPE: tl.constexpr,
+):
+    # matrix multiplication
+    pid = tl.program_id(0)
+    pid_z = tl.program_id(1)
+    grid_m = (M + BLOCK_M - 1) // BLOCK_M
+    grid_n = (N + BLOCK_N - 1) // BLOCK_N
+    # re-order program ID for better L2 performance
+    width = GROUP_M * grid_n
+    group_id = pid // width
+    group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
+    pid_m = group_id * GROUP_M + (pid % group_size)
+    pid_n = (pid % width) // (group_size)
+    # do matrix multiplication
+    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+    ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
+    rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
+    rk = pid_z * BLOCK_K + tl.arange(0, BLOCK_K)
+    # pointers
+    A_ptrs = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
+    B_ptrs = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
+    for k in range(K, 0, -BLOCK_K * SPLIT_K):
+        if EVEN_K:
+            a = tl.load(A_ptrs)
+            b = tl.load(B_ptrs)
+        else:
+            a = tl.load(A_ptrs, mask=rk[None, :] < k, other=0.0)
+            b = tl.load(B_ptrs, mask=rk[:, None] < k, other=0.0)
+        acc += tl.dot(a, b, allow_tf32=allow_tf32)
+        A_ptrs += BLOCK_K * SPLIT_K * stride_ak
+        B_ptrs += BLOCK_K * SPLIT_K * stride_bk
+    acc = acc.to({{out_def}}.dtype.element_ty)
+
+{% if keep_store %}
+    # rematerialize rm and rn to save registers
+    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+    C_ptrs = {{out_def}} + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)
+    mask = (rm < M)[:, None] & (rn < N)[None, :]
+    # handles write-back with reduction-splitting
+    tl.store(C_ptrs, acc, mask=mask)
+{% endif %}
+
+{% if pointwise_code %}
+{{ pointwise_code | indent(4, true) }}
+{% endif %}
diff --git a/torch/_inductor/codegen/triton_template.py b/torch/_inductor/codegen/triton_template.py
new file mode 100644
index 000000000000..cd1c2bed6bb7
--- /dev/null
+++ b/torch/_inductor/codegen/triton_template.py
@@ -0,0 +1,351 @@
+import logging
+import os
+
+import sympy
+
+from .. import config, ir
+from ..virtualized import V
+from .common import IndentedBuffer
+from .triton import TritonKernel
+
+log = logging.getLogger((__name__))
+template_dict = {ir.Convolution: "triton_conv", ir.MatrixMultiply: "triton_mm"}
+
+
+class TritonTemplateKernel(TritonKernel):
+    def __init__(self, node: ir.ExternKernel, *groups):
+        from jinja2 import Environment, FileSystemLoader, StrictUndefined
+
+        self.node = node
+        self.template_name = template_dict[type(node)]
+        env = Environment(
+            loader=FileSystemLoader(os.path.dirname(__file__)),
+            trim_blocks=True,
+            lstrip_blocks=True,
+            undefined=StrictUndefined,
+        )
+        pid_cache = {}
+        if isinstance(node, ir.Convolution):
+            pid_cache = {
+                "tl.program_id(0)": "pid_nhw",
+                "tl.program_id(1)": "pid_k",
+            }
+            self.map_args()
+            KERNEL_H = self.args_dict["KERNEL_H"]
+            KERNEL_W = self.args_dict["KERNEL_W"]
+            padding_h = self.args_dict["padding_h"]
+            padding_w = self.args_dict["padding_w"]
+            if ((KERNEL_H == "1" and KERNEL_W == "1")) or (
+                (padding_h == "0") and (padding_w == "0")
+            ):
+                self.template_name += "_delta_x"
+            else:
+                self.template_name += "_delta_x_hwc"
+        elif isinstance(node, ir.MatrixMultiply):
+            pid_cache = {
+                "tl.program_id(0)": "pid_m",
+                "tl.program_id(1)": "pid_n",
+            }
+
+        self.template = env.get_template(self.template_name + ".j2")
+        super(TritonTemplateKernel, self).__init__(*groups, pid_cache=pid_cache)
+
+    def rename_vars(self):
+        for k, v in self.inout_dict.items():
+            self.args.output_buffers[v] = k
+        if isinstance(self.node, ir.Convolution):
+            self.cse.store_cache[self.inout_dict["y"]] = "acc"
+        elif isinstance(self.node, ir.MatrixMultiply):
+            self.cse.store_cache[self.inout_dict["C"]] = "acc"
+
+    def assign_block_numel(self):
+        code = IndentedBuffer()
+        if isinstance(self.node, ir.Convolution):
+            code.writeline("XBLOCK: tl.constexpr = BLOCK_M")
+            code.writeline("YBLOCK: tl.constexpr = BLOCK_N")
+            code.writeline(
+                "xnumel = BATCH * (OUT_H + 2 * output_padding_h) * (OUT_W + 2 * output_padding_w)"
+            )
+            code.writeline("ynumel = KERNEL_N")
+        elif isinstance(self.node, ir.MatrixMultiply):
+            code.writeline("XBLOCK: tl.constexpr = BLOCK_M")
+            code.writeline("YBLOCK: tl.constexpr = BLOCK_N")
+            code.writeline("xnumel = M")
+            code.writeline("ynumel = N")
+
+        return code
+
+    def indexing(self, index: sympy.Expr, copy_shape=None, dense_indexing=True):
+        # use dense_indexing for TritonTemplateKernel to avoid map::at error
+        return super().indexing(
+            index, copy_shape=copy_shape, dense_indexing=dense_indexing
+        )
+
+    def codegen_body(
+        self, name, fuse, could_remove_kernel_buf, kernel_buf_replace_name
+    ):
+        """
+        put render_variables into the template
+        to generate the final code
+        """
+        # get extra_argdefs from fusable triton kernels
+        self.extra_argdefs = []
+        self.extra_call_args = []
+        argdefs, call_args, _ = self.args.python_argdefs()
+        # add extra args if it is different from
+        # current TritonTemplateKernel args
+        for (argdef, call_arg) in zip(argdefs, call_args):
+            if (
+                argdef not in self.inout_dict.keys()
+                and argdef not in self.args_dict.keys()
+            ):
+                self.extra_argdefs.append(argdef)
+                self.extra_call_args.append(call_arg)
+
+        if could_remove_kernel_buf:
+            if isinstance(self.node, ir.Convolution):
+                self.inout_dict.pop("y")
+            elif isinstance(self.node, ir.MatrixMultiply):
+                self.inout_dict.pop("C")
+        self.template_inout_argdefs = list(self.inout_dict.keys())
+
+        if kernel_buf_replace_name is not None:
+            idx = self.extra_call_args.index(kernel_buf_replace_name)
+            kernel_buf_replace_def = self.extra_argdefs[idx]
+
+        super().codegen_body()
+        self.pointwise_code = IndentedBuffer()
+        self.pointwise_code.splice(self.assign_block_numel())
+        self.pointwise_code.splice(self.body)
+        render_dict = {}
+        render_dict["kernel_name"] = name
+        render_dict["template_inout_argdefs"] = self.template_inout_argdefs
+        render_dict["extra_argdefs"] = self.extra_argdefs
+        render_dict["pointwise_code"] = self.pointwise_code.getvalue() if fuse else None
+        render_dict["keep_store"] = not could_remove_kernel_buf
+        render_dict["out_def"] = (
+            self.out_def() if not could_remove_kernel_buf else kernel_buf_replace_def
+        )
+        self.body = self.template.render(render_dict) + "\n"
+
+    def out_def(self):
+        if isinstance(self.node, ir.Convolution):
+            return "y"
+        elif isinstance(self.node, ir.MatrixMultiply):
+            return "C"
+
+    def codegen_kernel(
+        self,
+        name=None,
+        fuse=False,
+        could_remove_kernel_buf=False,
+        kernel_buf_replace_name=None,
+    ):
+
+        code = IndentedBuffer()
+
+        self.codegen_body(name, fuse, could_remove_kernel_buf, kernel_buf_replace_name)
+        code.splice(self.body)
+
+        if name is not None:
+            return code.getvalue()
+
+        wrapper = IndentedBuffer()
+        wrapper.writeline("TritonCodeCache.load('''")
+        wrapper.splice(code.getvalue(), strip=True)
+        wrapper.writeline("''').kernel")
+
+        return wrapper.getvalue()
+
+    def map_args(self):
+        """
+        map the constant args or
+        kernel[grid](..., IN_C, IN_H, IN_W, strides,...)
+        """
+        (
+            self.inout_dict,
+            self.args_dict,
+            self.const_dict,
+            self.other_dict,
+        ) = self.node.map_args()
+
+    def precompute(self, wrapper, kernel_name):
+        """
+        some triton kernels needs host precompute tensor
+        for example, triton_conv needs precompte delta_x_ptr
+        """
+        if isinstance(self.node, ir.Convolution):
+            if self.const_dict["CONV1X1_NHWC"] == "False":
+                IN_C = self.args_dict["IN_C"]
+                KERNEL_H = self.args_dict["KERNEL_H"]
+                KERNEL_W = self.args_dict["KERNEL_W"]
+                dilation_h = self.args_dict["dilation_h"]
+                dilation_w = self.args_dict["dilation_w"]
+                stride_wc = self.args_dict["stride_wc"]
+                stride_wh = self.args_dict["stride_wh"]
+                stride_ww = self.args_dict["stride_ww"]
+                stride_xc = self.args_dict["stride_xc"]
+                stride_xh = self.args_dict["stride_xh"]
+                stride_xw = self.args_dict["stride_xw"]
+                device = self.other_dict["device"]
+                if self.template_name == "triton_conv_delta_x":
+                    assert "delta_x_ptr" not in self.args_dict.keys()
+                    self.args_dict["delta_x_ptr"] = f"delta_x_{kernel_name}"
+                    wrapper.writeline(
+                        f"from {config.inductor_import}.triton_ops import _conv"
+                    )
+                    wrapper.writeline(
+                        f"delta_x_{kernel_name} = _conv._delta_x_ptr("
+                        f"{IN_C}, {KERNEL_H}, {KERNEL_W}, "
+                        f"{dilation_h}, {dilation_w}, "
+                        f"{stride_wc}, {stride_wh}, {stride_ww}, "
+                        f"{stride_xc}, {stride_xh}, {stride_xw}, {device})"
+                    )
+                # triton_conv_delta_x_hwc
+                else:
+                    assert "delta_xh_ptr" not in self.args_dict.keys()
+                    assert "delta_xw_ptr" not in self.args_dict.keys()
+                    assert "delta_xc_ptr" not in self.args_dict.keys()
+                    self.args_dict["delta_xh_ptr"] = f"delta_xh_{kernel_name}"
+                    self.args_dict["delta_xw_ptr"] = f"delta_xw_{kernel_name}"
+                    self.args_dict["delta_xc_ptr"] = f"delta_xc_{kernel_name}"
+                    wrapper.writeline(
+                        f"from {config.inductor_import}.triton_ops import _conv"
+                    )
+                    wrapper.writeline(
+                        f"delta_xh_{kernel_name}, delta_xw_{kernel_name}, delta_xc_{kernel_name}"
+                        f" = _conv._delta_x_ptr_hwc("
+                        f"{IN_C}, {KERNEL_H}, {KERNEL_W}, "
+                        f"{dilation_h}, {dilation_w}, "
+                        f"{stride_wc}, {stride_wh}, {stride_ww}, "
+                        f"{stride_xc}, {stride_xh}, {stride_xw}, {device})"
+                    )
+
+            # else, delta_x_ptr is None
+            else:
+                assert "delta_x_ptr" not in self.args_dict.keys()
+                self.args_dict["delta_x_ptr"] = "None"
+        return
+
+    def gen_grid(self, name):
+        code = IndentedBuffer()
+        if isinstance(self.node, ir.Convolution):
+            BATCH = self.args_dict["BATCH"]
+            OUT_H = self.args_dict["OUT_H"]
+            OUT_W = self.args_dict["OUT_W"]
+            KERNEL_N = self.args_dict["KERNEL_N"]
+            with code.indent():
+                code.splice(
+                    f"""
+                    def grid_{name}(META):
+                        return (
+                            triton.cdiv({BATCH} * {OUT_H} * {OUT_W}, META["BLOCK_M"]),
+                            triton.cdiv({KERNEL_N}, META["BLOCK_N"]),
+                        )
+                    """
+                )
+        if isinstance(self.node, ir.MatrixMultiply):
+            M = self.args_dict["M"]
+            N = self.args_dict["N"]
+            with code.indent():
+                code.splice(
+                    f"""
+                    def grid_{name}(META):
+                        return (
+                            triton.cdiv({M}, META["BLOCK_M"]) * triton.cdiv({N}, META["BLOCK_N"]),
+                            META["SPLIT_K"],
+                        )
+                    """
+                )
+        return code.getvalue()
+
+    def call_kernel(self, wrapper, name: str):
+        # gen code to call kernel
+        # e.g.
+        # def grid(META):
+        #     return (...)
+        # kernel1[grid](arg0, arg1, ...)
+        extra_args = ", ".join(self.extra_call_args)
+        self_args = ", ".join({**self.inout_dict, **self.args_dict}.values())
+        self_const_kwargs = ", ".join(f"{k}={v}" for k, v in self.const_dict.items())
+        args = self_args + (
+            ", " + extra_args if extra_args and len(extra_args) > 0 else ""
+        )
+        args_kwargs = args + ", " + self_const_kwargs
+        wrapper.writeline(self.gen_grid(name))
+        wrapper.writeline(f"{name}[grid_{name}]({args_kwargs})")
+
+
+def should_use_template(node: ir.ExternKernel):
+    template_kernels = [ir.Convolution, ir.MatrixMultiply]
+    if type(node) in template_kernels and ir.is_triton(node.get_device()):
+        if isinstance(node, ir.Convolution):
+            return node.kernel != "aten.convolution"
+        elif isinstance(node, ir.MatrixMultiply):
+            return node.kernel != "aten.mm.out"
+    return False
+
+
+def template_can_fuse(snode1, snode2):
+    assert snode1.is_template()
+    if snode1.group != snode2.group:
+        return False
+    tiling = snode1.get_nodes()[0].node.get_template_tiling()
+    for node in snode2.get_nodes():
+        if not TritonKernel.is_compatible(tiling, node.get_ranges()):
+            return False
+    return True
+
+
+def template_codegen(scheduler, scheduler_node, epilogue):
+    """
+    codegen function for triton templates
+    scheduler: Scheduler
+    scheduler_node: ExternKernelSchedulerNode
+    """
+    log.debug("template_codegen: %s -- %s", scheduler_node, epilogue)
+
+    wrapper = V.graph.wrapper_code
+    _, groups = scheduler_node.group
+
+    with TritonTemplateKernel(
+        scheduler_node.node, *scheduler_node.node.get_template_tiling()
+    ) as kernel:
+        # map const args/ shape/ strides to kernel args
+        kernel.map_args()
+        # set self.args name to match the TritonTemplateKernel's args names
+        kernel.rename_vars()
+        # scheduler.pop_group will keep iterating all reachable fusable SchedulerNodes
+        assert type(kernel.node) in template_dict.keys()
+
+        kernel.store_buffer_names.add(scheduler_node.get_name())
+
+        for node in epilogue:
+            node.mark_run()
+            node.codegen(kernel.split_and_set_ranges(node.get_ranges()))
+
+    could_remove_kernel_buf = (
+        kernel.args.output_buffers[scheduler_node.get_name()] == "REMOVED"
+    )
+    kernel_buf_replace_name = None
+    if could_remove_kernel_buf:
+        for node in epilogue:
+            if not kernel.args.is_removed(node.get_name()):
+                kernel_buf_replace_name = node.get_name()
+                break
+        assert kernel_buf_replace_name is not None
+
+    kernel_name = "triton_template_" + wrapper.next_kernel_suffix()
+    # code gen kernel
+    wrapper.header.splice(
+        kernel.codegen_kernel(
+            kernel_name,
+            bool(epilogue),
+            could_remove_kernel_buf,
+            kernel_buf_replace_name,
+        )
+    )
+    # gen precompute tensor (like delta_x_ptr) if needed
+    kernel.precompute(wrapper, kernel_name)
+    # code gen call to kernel
+    kernel.call_kernel(wrapper, kernel_name)
diff --git a/torch/_inductor/codegen/wrapper.py b/torch/_inductor/codegen/wrapper.py
new file mode 100644
index 000000000000..cf8fb46c84bd
--- /dev/null
+++ b/torch/_inductor/codegen/wrapper.py
@@ -0,0 +1,417 @@
+import collections
+import dataclasses
+import functools
+import hashlib
+from itertools import count
+from typing import Any, Dict, List
+
+from .. import codecache, config, ir
+from ..utils import dynamo_utils, has_triton, sympy_dot, sympy_product
+from ..virtualized import V
+from .common import CodeGen, DeferredLine, IndentedBuffer, Kernel
+from .triton import texpr
+
+pexpr = texpr
+
+
+def buffer_reuse_key(node: ir.Buffer):
+    size = node.get_size()
+    stride = node.get_stride()
+    last_element = sympy_dot([s - 1 for s in size], stride)
+    return (
+        node.get_device(),
+        node.get_dtype(),
+        V.graph.sizevars.simplify(sympy_product(size)),
+        # Detect gaps in tensor storage caused by strides
+        V.graph.sizevars.size_hint(last_element),
+    )
+
+
+def make_buffer_reuse(old, new):
+    assert old.get_dtype() == new.get_dtype()
+    del_line = ""
+    if old.get_name() not in V.graph.get_output_names():
+        del_line = f"; del {old.get_name()}"
+    if old.get_size() == new.get_size() and old.get_stride() == new.get_stride():
+        return f"{new.get_name()} = {old.get_name()}{del_line}"
+
+    return (
+        f"{new.get_name()} = as_strided({old.get_name()}, "
+        f"{V.graph.sizevars.codegen_shape_tuple(new.get_size())}, "
+        f"{V.graph.sizevars.codegen_shape_tuple(new.get_stride())}){del_line}"
+    )
+
+
+def make_buffer_allocation(buffer):
+    device = buffer.get_device()
+    dtype = buffer.get_dtype()
+    shape = tuple(buffer.get_size())
+    stride = tuple(buffer.get_stride())
+    return (
+        f"{buffer.get_name()} = empty_strided("
+        f"{V.graph.sizevars.codegen_shape_tuple(shape)}, "
+        f"{V.graph.sizevars.codegen_shape_tuple(stride)}, "
+        f"device='{device.type}', dtype={dtype})"
+    )
+
+
+class MemoryPlanningState:
+    def __init__(self):
+        super().__init__()
+        self.reuse_pool: Dict[
+            Any, List["FreeIfNotReusedLine"]
+        ] = collections.defaultdict(list)
+
+    def __contains__(self, key):
+        return bool(self.reuse_pool.get(key, None))
+
+    def pop(self, key) -> "FreeIfNotReusedLine":
+        item = self.reuse_pool[key].pop()
+        assert not item.is_reused
+        return item
+
+    def push(self, key, item: "FreeIfNotReusedLine"):
+        assert not item.is_reused
+        self.reuse_pool[key].append(item)
+
+
+class MemoryPlanningLine:
+    def plan(self, state: MemoryPlanningState) -> "MemoryPlanningLine":
+        """First pass to find reuse"""
+        return self
+
+    def codegen(self, code: IndentedBuffer):
+        """Second pass to output code"""
+        pass
+
+
+@dataclasses.dataclass
+class AllocateLine(MemoryPlanningLine):
+    node: ir.Buffer
+
+    def plan(self, state: MemoryPlanningState):
+        if self.node.get_name() in V.graph.removed_buffers:
+            return NullLine()
+
+        # try to reuse a recently freed buffer
+        key = buffer_reuse_key(self.node)
+        if key in state:
+            free_line = state.pop(key)
+            free_line.is_reused = True
+            return ReuseLine(free_line.node, self.node)
+
+        return self
+
+    def codegen(self, code: IndentedBuffer):
+        assert self.node.get_name() not in V.graph.removed_buffers
+        code.writeline(make_buffer_allocation(self.node))
+
+
+@dataclasses.dataclass
+class FreeIfNotReusedLine(MemoryPlanningLine):
+    node: ir.Buffer
+    is_reused: bool = False
+
+    def plan(self, state: MemoryPlanningState):
+        assert not self.is_reused
+        if self.node.get_name() in V.graph.removed_buffers:
+            return NullLine()
+        state.push(buffer_reuse_key(self.node), self)
+        return self
+
+    def codegen(self, code: IndentedBuffer):
+        assert self.node.get_name() not in V.graph.removed_buffers
+        if not self.is_reused:
+            code.writeline(f"del {self.node.get_name()}")
+
+
+@dataclasses.dataclass
+class ReuseLine(MemoryPlanningLine):
+    node: ir.Buffer
+    reused_as: ir.Buffer
+
+    def plan(self, state: MemoryPlanningState):
+        assert self.node.get_name() not in V.graph.removed_buffers
+        assert self.reused_as.get_name() not in V.graph.removed_buffers
+        return self
+
+    def codegen(self, code: IndentedBuffer):
+        assert self.node.get_name() not in V.graph.removed_buffers
+        assert self.reused_as.get_name() not in V.graph.removed_buffers
+        code.writeline(make_buffer_reuse(self.node, self.reused_as) + "  # reuse")
+
+
+@dataclasses.dataclass
+class FreeLine(MemoryPlanningLine):
+    node: ir.Buffer
+
+    def plan(self, state: MemoryPlanningState):
+        if self.node.get_name() in V.graph.removed_buffers:
+            return NullLine()
+        return self
+
+    def codegen(self, code: IndentedBuffer):
+        assert self.node.get_name() not in V.graph.removed_buffers
+        code.writeline(f"del {self.node.get_name()}")
+
+
+class NullLine(MemoryPlanningLine):
+    pass
+
+
+class WrapperCodeGen(CodeGen):
+    """
+    The outer wrapper that calls the kernels.
+    """
+
+    def __init__(self):
+        super().__init__()
+        self._names_iter = count()
+        self.header = IndentedBuffer()
+        self.prefix = IndentedBuffer()
+        self.kernels = {}
+        self.lines = []
+        self.header.splice(
+            f"""
+                from ctypes import c_void_p, c_long
+                import torch
+                import random
+                from torch import empty_strided, as_strided, device
+                from {codecache.__name__} import AsyncCompile
+
+                aten = torch.ops.aten
+                assert_size_stride = torch._C._dynamo.guards.assert_size_stride
+                async_compile = AsyncCompile()
+
+            """
+        )
+
+        if has_triton():
+            self.header.splice(
+                f"""
+                import triton
+                import triton.language as tl
+                from {config.inductor_import}.triton_ops.autotune import grid
+                from torch._C import _cuda_getCurrentRawStream as get_cuda_stream
+                """
+            )
+
+            if config.triton.convolution != "aten":
+                self.header.splice(
+                    f"""
+                    from {config.inductor_import}.triton_ops.conv_perf_model import early_config_prune
+                    from {config.inductor_import}.triton_ops.conv_perf_model import estimate_conv_time
+                    from {config.inductor_import}.triton_ops.autotune import conv_heuristics
+                    """
+                )
+
+            if config.triton.mm != "aten":
+                self.header.splice(
+                    f"""
+                    from {config.inductor_import}.triton_ops.autotune import mm_heuristics
+                    from {config.inductor_import}.triton_ops.autotune import mm_autotune
+                    """
+                )
+
+            if config.triton.use_bmm:
+                self.header.writeline(
+                    f"from {config.inductor_import}.triton_ops.batched_matmul import bmm_out as triton_bmm_out"
+                )
+
+        self.prefix.splice(
+            """
+
+            async_compile.wait(globals())
+            del async_compile
+
+            def call(args):
+            """
+        )
+        with self.prefix.indent():
+            inp_len = len(V.graph.graph_inputs.keys())
+            if inp_len != 0:
+                lhs = f"{', '.join(V.graph.graph_inputs.keys())}{'' if inp_len != 1 else ','}"
+                self.prefix.writeline(f"{lhs} = args")
+                self.prefix.writeline("args.clear()")
+            for name in V.graph.randomness_seeds:
+                self.prefix.writeline(
+                    f"torch.randint(2**31, size=(), dtype=torch.int64, out={name})"
+                )
+            V.graph.sizevars.codegen(self.prefix, V.graph.graph_inputs)
+
+        for name, value in V.graph.constants.items():
+            # include a hash so our code cache gives different constants different files
+            hashed = hashlib.sha256(repr(value).encode("utf-8")).hexdigest()
+            self.header.writeline(f"{name} = None  # {hashed}")
+
+        self.allocated = set()
+        self.freed = set()
+        self.write_get_cuda_stream = functools.lru_cache(None)(
+            self.write_get_cuda_stream
+        )
+
+    def write_get_cuda_stream(self, index):
+        name = f"stream{index}"
+        self.writeline(f"{name} = get_cuda_stream({index})")
+        return name
+
+    def next_kernel_suffix(self):
+        return f"{next(self._names_iter)}"
+
+    def codegen_allocation(self, buffer):
+        name = buffer.get_name()
+        if name in V.graph.removed_buffers or name in self.allocated:
+            return
+        self.allocated.add(name)
+
+        if isinstance(
+            buffer,
+            (ir.ExternKernelAlloc, ir.MultiOutput),
+        ):
+            return
+
+        layout = buffer.get_layout()
+        if isinstance(layout, ir.MutationLayout):
+            return
+        if isinstance(layout, ir.AliasedLayout):
+            assert isinstance(layout.view, ir.ReinterpretView)
+            if not layout.maybe_guard_aligned():
+                V.graph.unaligned_buffers.add(name)
+            self.codegen_allocation(layout.view.data)
+            allocation = DeferredLine(
+                name, f"{name} = {layout.view.codegen_reference()}  # alias"
+            )
+            self.writeline(allocation)
+            return
+
+        self.writeline(AllocateLine(buffer))
+
+    def codegen_free(self, buffer):
+        name = buffer.get_name()
+
+        # can be freed but not reused
+        if isinstance(buffer, ir.InputBuffer):
+            self.writeline(f"del {name}")
+            return
+
+        if not self.can_reuse(buffer):
+            return
+        self.freed.add(name)
+
+        layout = buffer.get_layout()
+        if isinstance(layout, (ir.AliasedLayout, ir.MultiOutputLayout)):
+            self.writeline(f"del {name}")
+            return
+
+        self.writeline(FreeIfNotReusedLine(buffer))
+
+    def can_reuse(self, buffer):
+        name = buffer.get_name()
+        if (
+            name in V.graph.removed_buffers
+            or name in V.graph.graph_inputs
+            or name in V.graph.constants
+            or name in self.freed
+        ):
+            return False
+        return True
+
+    def codegen_inplace_reuse(self, input_buffer, output_buffer):
+        assert buffer_reuse_key(input_buffer) == buffer_reuse_key(output_buffer)
+        self.codegen_allocation(input_buffer)
+        self.freed.add(input_buffer.get_name())
+        self.allocated.add(output_buffer.get_name())
+        self.writeline(ReuseLine(input_buffer, output_buffer))
+
+    @dynamo_utils.dynamo_timed
+    def generate(self):
+        result = IndentedBuffer()
+        result.splice(self.header)
+        result.splice(self.prefix)
+
+        out_names = V.graph.get_output_names()
+        with result.indent():
+            while (
+                self.lines
+                and isinstance(self.lines[-1], MemoryPlanningLine)
+                and self.lines[-1].node.name not in out_names
+            ):
+                # these lines will be pointless
+                self.lines.pop()
+
+            # codegen allocations in two passes
+            planning_state = MemoryPlanningState()
+            for i in range(len(self.lines)):
+                if isinstance(self.lines[i], MemoryPlanningLine):
+                    self.lines[i] = self.lines[i].plan(planning_state)
+
+            for line in self.lines:
+                if isinstance(line, MemoryPlanningLine):
+                    line.codegen(result)
+                else:
+                    result.writeline(line)
+
+            output_refs = [x.codegen_reference() for x in V.graph.graph_outputs]
+            if output_refs:
+                result.writeline("return (" + ", ".join(output_refs) + ", )")
+            else:
+                result.writeline("return ()")
+
+        self.add_benchmark_harness(result)
+
+        return result.getvalue()
+
+    def add_benchmark_harness(self, output):
+        """
+        Append a benchmark harness to generated code for debugging
+        """
+        if not config.benchmark_harness:
+            return
+
+        def add_fake_input(name, shape, stride, device, dtype):
+            output.writeline(
+                f"{name} = rand_strided("
+                f"{V.graph.sizevars.codegen_shape_tuple(shape)}, "
+                f"{V.graph.sizevars.codegen_shape_tuple(stride)}, "
+                f"device='{device.type}', dtype={dtype})"
+            )
+
+        output.writelines(["", "", 'if __name__ == "__main__":'])
+        with output.indent():
+            output.splice(
+                f"""
+                from {config.dynamo_import}.testing import rand_strided
+                from {config.inductor_import}.utils import print_performance
+                """,
+                strip=True,
+            )
+
+            for name, value in V.graph.constants.items():
+                add_fake_input(
+                    name, value.size(), value.stride(), value.device, value.dtype
+                )
+
+            for name, value in V.graph.graph_inputs.items():
+                shape = [V.graph.sizevars.size_hint(x) for x in value.get_size()]
+                stride = [V.graph.sizevars.size_hint(x) for x in value.get_stride()]
+                add_fake_input(
+                    name, shape, stride, value.get_device(), value.get_dtype()
+                )
+
+            output.writeline(
+                f"print_performance(lambda: call([{', '.join(V.graph.graph_inputs.keys())}]))"
+            )
+
+    def define_kernel(self, name: str, kernel: str):
+        self.header.splice(f"\n\n{name} = {kernel}")
+
+    def call_kernel(self, name: str, kernel: Kernel):
+        tmp = IndentedBuffer()
+        kernel.call_kernel(self, tmp, name)
+        for line in tmp.getvalue().split("\n"):
+            line = line.strip()
+            if line:
+                self.writeline(line)
+
+    def writeline(self, line):
+        self.lines.append(line)
diff --git a/torch/_inductor/compile_fx.py b/torch/_inductor/compile_fx.py
new file mode 100644
index 000000000000..c482e55a954d
--- /dev/null
+++ b/torch/_inductor/compile_fx.py
@@ -0,0 +1,405 @@
+import dataclasses
+import functools
+import itertools
+import logging
+import sys
+from typing import List
+
+import functorch
+from functorch._src.aot_autograd import make_boxed_func
+from functorch.compile import min_cut_rematerialization_partition
+
+import torch.fx
+from torch._subclasses.fake_tensor import FakeTensor
+
+from . import config, metrics, overrides
+from .debug import DebugContext
+from .decomposition import select_decomp_table
+from .graph import GraphLowering
+from .utils import (
+    dynamo_logging,
+    dynamo_optimizations,
+    dynamo_utils,
+    has_incompatible_cudagraph_ops,
+)
+from .virtualized import V
+
+log = logging.getLogger(__name__)
+ALIGNMENT = 16
+
+aot_autograd = dynamo_optimizations.backends.aot_autograd
+normalize_ir = dynamo_optimizations.normalize.normalize_ir
+is_aot_autograd_safe_to_run = dynamo_optimizations.training.is_aot_autograd_safe_to_run
+count_calls = dynamo_utils.count_calls
+
+
+@dataclasses.dataclass
+class BoxedBool:
+    value: bool
+
+    def __bool__(self):
+        return self.value
+
+    @staticmethod
+    def disable(obj):
+        if isinstance(obj, BoxedBool):
+            obj.value = False
+            return obj
+        return False
+
+
+# copy_ fails when trying to write to tensors with memory overlap,
+# for expanded dimensions (a dimension which used to have size 1 -> ?)
+# we can select one element from that dimension and write to it
+# to achieve writing to all values of that dimension of the input tensor
+def get_expanded_dims(t):
+    return [i for i in range(t.ndim) if t.stride(i) == 0 and t.size(i) != 1]
+
+
+def index_expanded_dims(t, expanded_dims):
+    for expanded_dim in expanded_dims:
+        t = torch.ops.aten.slice(t, expanded_dim, 0, 1)
+    return t
+
+
+def complex_memory_overlap(t):
+    # if torch._debug_has_internal_overlap thinks this tensor potentially has
+    # memory overlap internally, let's dig deeper to find out whether it's true.
+    if torch._debug_has_internal_overlap(t) != 0:
+        strides = t.stride()
+        sizes = t.shape
+        indices = list(range(len(strides)))
+        indices = [x for _, x in sorted(zip(strides, indices))]
+        for i in range(len(strides)):
+            prev_stride = 1 if i == 0 else strides[indices[i - 1]]
+            prev_size = 1 if i == 0 else sizes[indices[i - 1]]
+            if strides[indices[i]] < prev_stride * prev_size:
+                return True
+    return False
+
+
+@functools.lru_cache(None)
+def _step_logger():
+    return dynamo_logging.get_step_logger(log)
+
+
+@DebugContext.wrap
+def count_bytes_inner(gm, example_inputs, num_fixed=0, **kwargs):
+    shape_env = None
+    for inp in example_inputs:
+        if isinstance(inp, FakeTensor) and inp.fake_mode.shape_env is not None:
+            shape_env = inp.fake_mode.shape_env
+
+    graph = GraphLowering(gm, shape_env=shape_env, num_static_inputs=num_fixed)
+    with V.set_graph_handler(graph):
+        graph.run(*example_inputs)
+        num_bytes, nodes_num_elem = graph.count_bytes()
+        metrics.num_bytes_accessed += num_bytes
+        metrics.nodes_num_elem += nodes_num_elem
+    return make_boxed_func(gm.forward)
+
+
+@DebugContext.wrap
+@torch.utils._python_dispatch._disable_current_modes()
+def compile_fx_inner(
+    gm: torch.fx.GraphModule,
+    example_inputs: List[torch.Tensor],
+    cudagraphs=None,
+    num_fixed=0,
+    is_backward=False,
+    graph_id=None,
+):
+    if dynamo_utils.count_calls(gm.graph) == 0:
+        return make_boxed_func(gm.forward)
+
+    # lift the maximum depth of the Python interpreter stack
+    # to adapt large/deep models
+    sys.setrecursionlimit(max(sys.getrecursionlimit(), 2000))
+
+    _step_logger()(
+        logging.INFO,
+        "torchinductor compiling "
+        f"{'BACKWARDS' if is_backward else 'FORWARDS'} "
+        f"graph {graph_id}",
+    )
+
+    V.debug.fx_graph(gm, example_inputs)
+
+    if cudagraphs is None:
+        cudagraphs = config.triton.cudagraphs
+    shape_env = None
+    for inp in example_inputs:
+        if isinstance(inp, FakeTensor) and inp.fake_mode.shape_env is not None:
+            shape_env = inp.fake_mode.shape_env
+
+    graph = GraphLowering(gm, shape_env=shape_env, num_static_inputs=num_fixed)
+    with V.set_graph_handler(graph):
+        graph.run(*example_inputs)
+        compiled_fn = graph.compile_to_fn()
+
+    if cudagraphs:
+        complex_memory_overlap_inputs = any(
+            complex_memory_overlap(t) for t in example_inputs
+        )
+
+        if (
+            set(graph.device_types) == {"cuda"}
+            and not graph.mutated_inputs
+            and not has_incompatible_cudagraph_ops(gm)
+            and not complex_memory_overlap_inputs
+        ):
+            compiled_fn = cudagraphify(
+                compiled_fn, example_inputs, static_input_idxs=range(num_fixed)
+            )
+        else:
+            BoxedBool.disable(cudagraphs)
+
+            if len(set(graph.device_types)) > 1:
+                log.warning("skipping cudagraphs due to multiple devices")
+            elif set(graph.device_types) == {"cuda"}:
+                if graph.mutated_inputs:
+                    log.warning("skipping cudagraphs due to input mutation")
+                elif complex_memory_overlap_inputs:
+                    log.warning("skipping cudagraphs due to complex input striding")
+
+    result = align_inputs(compiled_fn, example_inputs, range(num_fixed))
+    _step_logger()(
+        logging.INFO,
+        "torchinductor done compiling "
+        f"{'BACKWARDS' if is_backward else 'FORWARDS'} "
+        f"graph {graph_id}",
+    )
+
+    # aot autograd needs to know to pass in inputs as a list
+    result._boxed_call = True
+    return result
+
+
+def clone_preserve_strides(x):
+    needed_size = (
+        sum((shape - 1) * stride for shape, stride in zip(x.size(), x.stride())) + 1
+    )
+    buffer = torch.as_strided(x, (needed_size,), (1,)).clone()
+    return torch.as_strided(buffer, x.size(), x.stride())
+
+
+def align_inputs(model, inputs, static_input_idxs=()):
+    check_inputs = [
+        i
+        for i in range(len(inputs))
+        if (i not in static_input_idxs or (inputs[i].data_ptr() % ALIGNMENT) != 0)
+        and inputs[i].device.type == "cuda"
+    ]
+
+    if len(check_inputs) == 0:
+        return model
+
+    def run(new_inputs):
+        for i in check_inputs:
+            if new_inputs[i].data_ptr() % ALIGNMENT:
+                new_inputs[i] = clone_preserve_strides(new_inputs[i])
+        return model(new_inputs)
+
+    return run
+
+
+@dynamo_utils.dynamo_timed
+def cudagraphify(model, inputs, static_input_idxs=()):
+    # if using fake tensors, defer cudagraphs until we get real inputs at runtime
+    if not any(isinstance(inp, FakeTensor) for inp in inputs):
+        return cudagraphify_impl(model, inputs, static_input_idxs)
+
+    compiled_fn = None
+
+    def run(new_inputs):
+        nonlocal compiled_fn
+        if compiled_fn is None:
+            with dynamo_utils.preserve_rng_state():
+                compiled_fn = cudagraphify_impl(model, new_inputs, static_input_idxs)
+
+        return compiled_fn(new_inputs)
+
+    return run
+
+
+def remove_unaligned_input_idxs(inputs, static_input_idxs):
+    """
+    We require all inputs to be aligned, so introduce a copy for any
+    that aren't.
+    """
+    aligned_static_input_idxs = {
+        idx for idx in static_input_idxs if (inputs[idx].data_ptr() % ALIGNMENT) == 0
+    }
+    if len(aligned_static_input_idxs) != len(static_input_idxs):
+        return aligned_static_input_idxs
+    return static_input_idxs
+
+
+def cudagraphify_impl(model, inputs, static_input_idxs=()):
+    """
+    Assumes inputs[static_input_idxs[i]] are always the same memory address
+    """
+    static_input_idxs = remove_unaligned_input_idxs(inputs, static_input_idxs)
+
+    def static_input(x):
+        """
+        Copy and input while preserving strides
+        """
+        # TODO(jansel): figure out why this version doesn't work:
+        # return torch.empty_strided(x.size(), x.stride(), dtype=x.dtype, device=x.device)
+        needed_size = (
+            sum((shape - 1) * stride for shape, stride in zip(x.size(), x.stride())) + 1
+        )
+        buffer = torch.zeros(needed_size, dtype=x.dtype, device=x.device)
+        return torch.as_strided(buffer, x.size(), x.stride())
+
+    assert isinstance(inputs, (list, tuple))
+    static_inputs = [
+        static_input(x) if idx not in static_input_idxs else x.detach()
+        for idx, x in enumerate(inputs)
+    ]
+
+    inps_expanded_dims = [
+        get_expanded_dims(x) if idx not in static_input_idxs else []
+        for idx, x in enumerate(inputs)
+    ]
+
+    # warmup
+    torch.cuda.synchronize()
+    stream = torch.cuda.Stream()
+    stream.wait_stream(torch.cuda.current_stream())
+    # copy static_inputs because it will be cleared in model
+    with torch.cuda.stream(stream):
+        model(list(static_inputs))
+    stream.synchronize()
+    torch.cuda.current_stream().wait_stream(stream)
+    torch.cuda.synchronize()
+
+    # record
+    graph = torch.cuda.CUDAGraph()
+    with torch.cuda.graph(graph, stream=stream):
+        static_outputs = model(list(static_inputs))
+    if not isinstance(static_outputs, (list, tuple)):
+        static_outputs = (static_outputs,)
+
+    if config.size_asserts:
+
+        def run(new_inputs):
+            assert len(static_inputs) == len(new_inputs)
+            for idx, (dst, src, expanded_dims) in enumerate(
+                zip(static_inputs, new_inputs, inps_expanded_dims)
+            ):
+                if idx in static_input_idxs:
+                    assert dst.data_ptr() == src.data_ptr()
+                else:
+                    # TODO - could make one single op of multiple slices
+                    # and avoid dispatch.
+                    # Could also pre-index the `dst` tensors
+                    dst = index_expanded_dims(dst, expanded_dims)
+                    src = index_expanded_dims(src, expanded_dims)
+                    dst.copy_(src)
+            new_inputs.clear()
+            graph.replay()
+            return static_outputs
+
+    else:
+        copy_indices = [
+            idx for idx in range(len(static_inputs)) if idx not in static_input_idxs
+        ]
+
+        def run(new_inputs):
+            for idx in copy_indices:
+                src = index_expanded_dims(static_inputs[idx], inps_expanded_dims[idx])
+                dst = index_expanded_dims(new_inputs[idx], inps_expanded_dims[idx])
+                dst.copy_(src)
+            new_inputs.clear()
+            graph.replay()
+            return static_outputs
+
+    return run
+
+
+def count_tangents(fx_g: torch.fx.GraphModule):
+    """
+    Infers which inputs are static for a backwards graph
+    """
+
+    def is_not_gradout(x):
+        return "tangents" not in x.name
+
+    arg_count = 0
+    static_arg_idxs = []
+    for n in fx_g.graph.nodes:
+        if n.op == "placeholder":
+            if is_not_gradout(n):
+                static_arg_idxs.append(arg_count)
+            arg_count += 1
+
+    assert static_arg_idxs == list(range(len(static_arg_idxs)))
+    return len(static_arg_idxs)
+
+
+_graph_counter = itertools.count(0)
+
+
+def compile_fx(
+    model_: torch.fx.GraphModule,
+    example_inputs_: List[torch.Tensor],
+    inner_compile=compile_fx_inner,
+):
+    """Main entrypoint to a compile given FX graph"""
+
+    if not is_aot_autograd_safe_to_run(model_, example_inputs_):
+        log.warning("Aot Autograd is not safe to run, so falling back to eager")
+        return model_
+
+    functorch.compile.config.use_functionalize = True
+    functorch.compile.config.use_fake_tensor = True
+
+    with overrides.patch_functions():
+        model_ = normalize_ir(model_, example_inputs_)
+        model_ = overrides.replace_fx(model_)
+        model_ = overrides.fuse_fx(model_, example_inputs_)
+    num_example_inputs = len(example_inputs_)
+    cudagraphs = BoxedBool(config.triton.cudagraphs and not config.dynamic_shapes)
+
+    graph_id = next(_graph_counter)
+
+    @dynamo_utils.dynamo_timed
+    def fw_compiler(model: torch.fx.GraphModule, example_inputs):
+        fixed = len(example_inputs) - num_example_inputs
+        return inner_compile(
+            model,
+            example_inputs,
+            num_fixed=fixed,
+            cudagraphs=cudagraphs,
+            graph_id=graph_id,
+        )
+
+    @dynamo_utils.dynamo_timed
+    def bw_compiler(model: torch.fx.GraphModule, example_inputs):
+        fixed = count_tangents(model)
+        return inner_compile(
+            model,
+            example_inputs,
+            num_fixed=fixed,
+            cudagraphs=cudagraphs,
+            is_backward=True,
+            graph_id=graph_id,
+        )
+
+    with overrides.patch_functions():
+
+        # TODO: can add logging before/after the call to create_aot_dispatcher_function
+        # in functorch/_src/aot_autograd.py::aot_module_simplified::aot_function_simplified::new_func
+        # once torchdynamo is merged into pytorch
+        return aot_autograd(
+            model_,
+            example_inputs_,
+            fw_compiler=fw_compiler,
+            bw_compiler=bw_compiler,
+            decompositions=select_decomp_table(),
+            partition_fn=functools.partial(
+                min_cut_rematerialization_partition, compiler="inductor"
+            ),
+        )
diff --git a/torch/_inductor/config.py b/torch/_inductor/config.py
new file mode 100644
index 000000000000..ac97ddf563f1
--- /dev/null
+++ b/torch/_inductor/config.py
@@ -0,0 +1,184 @@
+import os
+import sys
+
+# add some debug printouts
+debug = False
+
+# dead code elimination
+dce = False
+
+# assume input tensors are dynamic
+dynamic_shapes = (
+    os.environ.get("TORCHDYNAMO_DYNAMIC_SHAPES") == "1"
+)  # Use dynamic shapes if torchdynamo dynamic shapes is set
+
+# assume weight tensors are fixed size
+static_weight_shapes = True
+
+# put correctness assertions in generated code
+size_asserts = True
+
+# enable loop reordering based on input orders
+pick_loop_orders = True
+
+# generate inplace computations
+inplace_buffers = True
+
+# codegen benchmark harness
+benchmark_harness = True
+
+# control store vs recompute heuristic
+# For fanouts, rematearialization can lead to exponential blowup. So, have
+# smaller threashold
+realize_reads_threshold = 4
+realize_bytes_threshold = 2000
+
+# Threshold to prevent excessive accumulation of ops in one buffer during lowering
+realize_acc_reads_threshold = 8
+
+# fallback to eager for random/dropout, this is slow but useful for debugging
+fallback_random = False
+
+# automatically create fallbacks when encountering an unhandled op
+implicit_fallbacks = True
+
+# Enables a fusion pass that groups nodes together before the scheduler
+prefuse_nodes = True
+
+# do bench to decide best layout, currently only for aten.conv
+tune_layout = False
+
+# fuse even in cases without common reads
+aggressive_fusion = False
+
+# how many nodes to allow into a single fusion
+max_fusion_size = 64
+
+# replace small reductions with pointwise, disable with `= 1`
+unroll_reductions_threshold = 8
+
+comment_origin = False
+
+compile_threads = (
+    min(
+        32,
+        len(os.sched_getaffinity(0))
+        if hasattr(os, "sched_getaffinity")
+        else os.cpu_count(),
+    )
+    if sys.platform != "win32"
+    else 1
+)
+
+# If kernel is fused, the name is generated from the origin node op names
+# for larger kernels limit this
+kernel_name_max_ops = 10
+
+# How to import torchinductor, either torchinductor or torch.inductor
+inductor_import = __name__.replace(".config", "")
+
+# How to import torchdynamo, either torchdynamo or torch.dynamo
+dynamo_import = inductor_import.replace("inductor", "dynamo")
+
+# Pad input tensors of matmul/bmm/addmm to leverage Tensor Cores in NVIDIA GPUs
+shape_padding = os.environ.get("TORCHINDUCTOR_SHAPE_PADDING", "0") == "1"
+alignment_size = 4
+
+# Fx-based linear/matmul/bmm + permute/transpose vertical fusion
+permute_fusion = os.environ.get("TORCHINDUCTOR_PERMUTE_FUSION", "0") == "1"
+
+
+# config specific to codegen/cpp.pp
+class cpp:
+    # set to torch.get_num_threads()
+    threads = -1
+
+    # Assume number of threads is dynamic, don't specialize thread number.
+    # Kernels don't recompile on thread number changes with this flag on.
+    # For single-threaded workload, turning it on would incur a slight
+    # performance degradation.
+    dynamic_threads = False
+
+    simdlen = None
+    min_chunk_size = 4096
+    cxx = (
+        None,  # download gcc12 from conda-forge if conda is installed
+        "g++-12",
+        "g++-11",
+        "g++-10",
+        "clang++",
+        "g++",
+        "g++.par",
+    )
+    # Allow kernel performance profiling via PyTorch profiler
+    enable_kernel_profile = False
+
+
+# config specific to codegen/triton.py
+class triton:
+
+    # Use cudagraphs on output code
+    cudagraphs = True
+
+    # choose conv backend, "aten" or "triton" or "autotune"
+    convolution = "aten"
+
+    # choose mm backend, "aten" or "triton" or "autotune"
+    mm = "aten"
+
+    # Always load full blocks (rather than broadcasting inside the block)
+    # Set default as True because otherwise will encouter `map::at` error
+    # in triton if loading from 1-dim tensor using 2-dim pointer offset
+    # https://triton-lang.slack.com/archives/C01L1FLTX70/p1656023403343639
+    # could be set as False if triton fixes the bug later
+    dense_indexing = False
+
+    # limit tiling dimensions
+    max_tiles = 2
+
+    # use triton.autotune?
+    autotune = True
+
+    use_bmm = False
+
+    # should we stop a fusion to allow better tiling?
+    tiling_prevents_pointwise_fusion = True
+    tiling_prevents_reduction_fusion = True
+    # should we give different names to kernels
+    ordered_kernel_names = False
+    # should we put op names in kernel names
+    descriptive_kernel_names = True
+
+
+# create a directory containing lots of debug information
+class trace:
+    # master switch for all debugging flags below
+    enabled = os.environ.get("TORCHINDUCTOR_TRACE", "0") == "1"
+
+    # Save python logger call >=logging.DEBUG
+    debug_log = True
+
+    # Save python logger call >=logging.INFO
+    info_log = False
+
+    # Save input FX graph (post decomps)
+    fx_graph = True
+
+    # Save TorchInductor IR before fusion pass
+    ir_pre_fusion = True
+
+    # Save TorchInductor IR after fusion pass
+    ir_post_fusion = True
+
+    # Copy generated code to trace dir
+    output_code = True
+
+    # SVG figure showing post-fusion graph
+    graph_diagram = False
+
+    # Store cProfile (see snakeviz to view)
+    compile_profile = False
+
+    # Upload the .tar.gz file
+    # Needs to be overriden based on specific environment needs
+    upload_tar = None
diff --git a/torch/_inductor/cuda_properties.py b/torch/_inductor/cuda_properties.py
new file mode 100644
index 000000000000..e42b2c5b5c67
--- /dev/null
+++ b/torch/_inductor/cuda_properties.py
@@ -0,0 +1,54 @@
+import functools
+
+import torch
+
+
+# API to query cuda properties that will work in a triton compile process
+# that cannot use the GPU APIs (due to processing fork() and initialization
+# time issues). Properties are recorded in the main process before
+# we fork the workers.
+
+
+@functools.lru_cache(None)
+def _properties():
+    if not torch.cuda.is_available():
+        return {}
+    try:
+        return {
+            i: torch.cuda.get_device_properties(i)
+            for i in range(torch.cuda.device_count())
+        }
+    except RuntimeError:
+        return {}
+
+
+_compile_worker_current_device = None
+
+
+def set_compiler_worker_current_device(device):
+    global _compile_worker_current_device
+    _compile_worker_current_device = device
+
+
+def current_device():
+    if _compile_worker_current_device is not None:
+        return _compile_worker_current_device
+    return torch.cuda.current_device()
+
+
+def _device(device):
+    if device is not None:
+        if isinstance(device, torch.device):
+            assert device.type == "cuda"
+            device = device.index
+        return device
+    return current_device()
+
+
+def get_device_properties(device=None):
+    return _properties()[_device(device)]
+
+
+def get_device_capability(device=None):
+    p = get_device_properties(device)
+    return p.major, p.minor
diff --git a/torch/_inductor/debug.py b/torch/_inductor/debug.py
new file mode 100644
index 000000000000..67e75d1a7329
--- /dev/null
+++ b/torch/_inductor/debug.py
@@ -0,0 +1,331 @@
+import collections
+import contextlib
+import cProfile
+import functools
+import itertools
+import logging
+import os.path
+import pstats
+import shutil
+import subprocess
+from typing import Any, List
+
+from functorch.compile import draw_graph, get_graph_being_compiled
+
+import torch
+from torch import fx as fx
+from torch.fx.graph_module import GraphModule
+from torch.fx.passes.shape_prop import TensorMetadata
+from torch.fx.passes.tools_common import legalize_graph
+
+from . import config, ir
+from .scheduler import (
+    BaseSchedulerNode,
+    ExternKernelSchedulerNode,
+    FusedSchedulerNode,
+    NopKernelSchedulerNode,
+    OutputNode,
+    SchedulerNode,
+    TemplateSchedulerNode,
+)
+from .utils import dynamo_config, dynamo_debug_utils, dynamo_utils
+from .virtualized import V
+
+log = logging.getLogger(__name__)
+
+
+@functools.lru_cache(None)
+def has_dot():
+    try:
+        subprocess.check_output(["which", "dot"], stderr=subprocess.PIPE)
+        return True
+    except subprocess.SubprocessError:
+        return False
+
+
+def draw_buffers(nodes, print_graph=False, fname=None):
+    """
+    Draw a graph in fname.svg.
+    nodes is a list of SchedulerNode objects.
+    """
+    if not has_dot():
+        log.warning("draw_buffers() requires `graphviz` package")
+        return
+
+    if fname is None:
+        fname = get_graph_being_compiled()
+
+    graph = create_fx_from_snodes(nodes)
+
+    for node in graph.nodes:
+        if "fusion_meta" not in node.meta:
+            continue
+        group = node.meta["fusion_meta"].group
+        if isinstance(group, tuple):
+            group = group[1]
+
+        # gather meta data
+        dtype = None
+        if isinstance(node, ir.ComputedBuffer):
+            dtype = node.data.dtype
+
+        metadata = TensorMetadata(group, dtype, None, None, None, None, None)
+        node.meta["tensor_meta"] = metadata
+
+    if print_graph:
+        print(graph)
+
+    gm = GraphModule({}, graph)
+    legalize_graph(gm)
+    gm.graph.lint()
+    draw_graph(gm, fname, clear_meta=False)
+
+
+def create_fx_from_snodes(snodes: List[BaseSchedulerNode]) -> fx.Graph:
+    """
+    Creates a FX Graph from a list of SchedulerNode objects.
+    """
+
+    def get_fake_func(name):
+        def func1(*args):
+            return 0
+
+        func1.__name__ = name
+        return func1
+
+    FusionMeta = collections.namedtuple("FusionMeta", ["group", "snodes", "type"])
+
+    func_dict = {s: get_fake_func(s) for s in ["extern", "nop", "compute", "fused"]}
+    buf_to_fx_node = {}
+    graph = torch.fx.Graph()
+    first_node = None
+
+    outputs = []
+    group: Any = None
+    # create call_function node for each Buffer and Kernel
+    for snode in snodes:
+        if isinstance(snode, ExternKernelSchedulerNode):
+            node_type = "extern"
+            group = node_type
+        elif isinstance(snode, TemplateSchedulerNode):
+            node_type = "template"
+            group = node_type
+        elif isinstance(snode, NopKernelSchedulerNode):
+            node_type = "nop"
+            group = node_type
+        elif isinstance(snode, SchedulerNode):
+            node_type = "compute"
+            group = snode.group
+        elif isinstance(snode, FusedSchedulerNode):
+            node_type = "fused"
+            group = snode.group
+        else:
+            raise RuntimeError("Unknown node type")
+        node_func = func_dict[node_type]
+        fx_node = graph.call_function(node_func, args=(), kwargs=None)
+
+        def in_output(snode):
+            if isinstance(snode, FusedSchedulerNode):
+                return any([in_output(x) for x in snode.snodes])
+            return any([isinstance(user.node, OutputNode) for user in snode.users])
+
+        if in_output(snode):
+            outputs.append(fx_node)
+        name = snode.get_name()
+        fx_node.name = name
+
+        fx_node.meta["fusion_meta"] = FusionMeta(group, [snode], node_type)
+
+        if isinstance(snode, FusedSchedulerNode):
+            for x in snode.snodes:
+                buf_to_fx_node[x.get_name()] = fx_node
+        buf_to_fx_node[name] = fx_node
+
+        if first_node is None:
+            first_node = fx_node
+
+    # create edges between nodes
+    for snode in snodes:
+        name = snode.get_name()
+        deps = snode.read_writes.reads
+
+        fx_node = buf_to_fx_node[name]
+        new_args = []
+        for dep in deps:
+            if dep.name in buf_to_fx_node:
+                dep_node = buf_to_fx_node[dep.name]
+            else:
+                with graph.inserting_before(first_node):
+                    dep_node = graph.placeholder(dep.name)
+                    buf_to_fx_node[dep.name] = dep_node
+            new_args.append(dep_node)
+
+        fx_node.args = tuple(new_args)
+
+    graph.output(outputs[0] if len(outputs) == 1 else tuple(outputs))
+    return graph
+
+
+class DebugContext:
+    _counter = itertools.count()
+
+    @staticmethod
+    def wrap(fn):
+        @functools.wraps(fn)
+        def inner(*args, **kwargs):
+            with DebugContext():
+                return fn(*args, **kwargs)
+
+        return dynamo_debug_utils.wrap_compiler_debug(inner, compiler_name="inductor")
+
+    @staticmethod
+    def create_debug_dir():
+        for n in DebugContext._counter:
+            dirname = os.path.join(
+                dynamo_utils.get_debug_dir(),
+                "torchinductor",
+                f"debug.{os.getpid()}.{n}",
+            )
+            if not os.path.exists(dirname):
+                os.makedirs(dirname)
+                return dirname
+
+    def __init__(self):
+        self._prof = None
+        self._path = None
+        self._stack = contextlib.ExitStack()
+
+    def rename(self, new_path: str):
+        if not self._path:
+            return
+        assert new_path.endswith(".debug"), new_path
+        if os.path.exists(new_path):
+            shutil.rmtree(new_path)
+        try:
+            os.rename(self._path, new_path)
+            self._path = new_path
+        except OSError:
+            # other OS might have troubling renaming dir with open files
+            pass
+
+    def fopen(self, filename):
+        assert self._path
+        return open(os.path.join(self._path, filename), "w")
+
+    def filename(self, suffix):
+        return os.path.join(self._path, suffix)
+
+    def upload_tar(self):
+        if config.trace.upload_tar is not None:
+            import tarfile
+
+            assert self._path
+            tar_file = os.path.join(
+                self._path, f"{os.path.basename(self._path)}.tar.gz"
+            )
+            with tarfile.open(tar_file, "w:gz") as tar:
+                tar.add(self._path, arcname=os.path.basename(self._path))
+            config.trace.upload_tar(tar_file)
+
+    def __enter__(self):
+        log = logging.getLogger(config.inductor_import)
+        if not log.handlers:
+            dynamo_utils.init_logging()
+
+        if config.debug:
+            dynamo_config.log_level = logging.DEBUG
+
+        self._stack.enter_context(V.set_debug_handler(self))
+
+        if not config.trace.enabled:
+            return
+
+        self._path = self.create_debug_dir()
+
+        if config.trace.debug_log:
+            self._setup_log_capture("debug.log", logging.DEBUG)
+        if config.trace.info_log:
+            self._setup_log_capture("info.log", logging.INFO)
+        if config.trace.compile_profile:
+            self._prof = cProfile.Profile()
+            self._prof.enable()
+
+    def _setup_log_capture(self, filename, level):
+        log = logging.getLogger(config.inductor_import)
+        fd = self._stack.enter_context(self.fopen(filename))
+        ch = logging.StreamHandler(fd)
+        ch.setLevel(level)
+        ch.setFormatter(
+            logging.Formatter("[%(filename)s:%(lineno)d %(levelname)s] %(message)s")
+        )
+        log.addHandler(ch)
+        log.setLevel(min(log.level, level))
+        self._stack.callback(log.removeHandler, ch)
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        if self._prof:
+            self._prof.disable()
+            self._save_profile_data()
+
+        if self._path:
+            self.upload_tar()
+            log.warning("%s debug trace: %s", get_graph_being_compiled(), self._path)
+        self._stack.close()
+
+    def _save_profile_data(self):
+        self._prof.dump_stats(self.filename("compile.prof"))
+        with self.fopen("compile.stats") as fd:
+            stats = pstats.Stats(self._prof, stream=fd)
+            stats.strip_dirs()
+            stats.sort_stats("cumtime")
+            stats.print_stats(100)
+            stats.sort_stats("tottime")
+            stats.print_stats(100)
+
+    def __getattr__(self, name):
+        if config.trace.enabled and getattr(config.trace, name):
+            try:
+                return getattr(DebugFormatter(self), name)
+            except Exception:
+                log.warning("Ignoring exception in debug code", exc_info=True)
+        else:
+
+            def ignored(*args, **kwargs):
+                pass
+
+            return ignored
+
+
+SchedulerNodeList = List[Any]
+
+
+class DebugFormatter:
+    def __init__(self, handler):
+        self.fopen = handler.fopen
+        self.filename = handler.filename
+        self.handler = handler
+
+    def fx_graph(self, gm: torch.fx.GraphModule, inputs: List[torch.Tensor]):
+        with self.fopen("fx_graph_runnable.py") as fd:
+            dynamo_debug_utils.save_graph_repro(fd, gm, inputs, "inductor")
+
+        with self.fopen("fx_graph_readable.py") as fd:
+            fd.write(gm.print_readable(print_output=False))
+
+    def ir_pre_fusion(self, nodes: SchedulerNodeList):
+        self._write_ir("ir_pre_fusion.txt", nodes)
+
+    def ir_post_fusion(self, nodes: SchedulerNodeList):
+        self._write_ir("ir_post_fusion.txt", nodes)
+
+    def _write_ir(self, filename: str, nodes: SchedulerNodeList):
+        with self.fopen(filename) as fd:
+            for node in nodes:
+                fd.write(node.debug_str())
+                fd.write("\n\n\n")
+
+    def graph_diagram(self, nodes: SchedulerNodeList):
+        draw_buffers(nodes, fname=self.filename("graph_diagram.svg"))
+
+    def output_code(self, filename):
+        shutil.copy(filename, self.filename("output_code.py"))
diff --git a/torch/_inductor/decomposition.py b/torch/_inductor/decomposition.py
new file mode 100644
index 000000000000..6cddc0f489c5
--- /dev/null
+++ b/torch/_inductor/decomposition.py
@@ -0,0 +1,529 @@
+import functools
+import logging
+import math
+import numbers
+
+import torch
+import torch._decomp as decomp
+from torch import Tensor
+from torch._decomp import get_decompositions
+from torch._prims_common import is_boolean_dtype, is_integer_dtype
+from torch.utils._mode_utils import no_dispatch
+
+from . import config, utils
+
+log = logging.getLogger(__name__)
+aten = torch.ops.aten
+log = logging.getLogger(__name__)
+
+decompositions = get_decompositions(
+    [
+        aten._adaptive_avg_pool2d_backward,
+        aten.addcmul,
+        aten.avg_pool2d_backward,
+        aten.binary_cross_entropy_with_logits,
+        aten.clamp_max,
+        aten.clamp_min,
+        aten.col2im,
+        aten.cudnn_batch_norm,
+        aten.cudnn_batch_norm_backward,
+        aten.detach,
+        aten.dot,
+        aten.elu,
+        aten.elu_backward,
+        aten._embedding_bag,
+        aten.embedding_dense_backward,
+        aten.expand_as,
+        aten.eye,
+        aten.flip,
+        aten._fused_moving_avg_obs_fq_helper,
+        aten.gelu,
+        aten.gelu_backward,
+        aten.glu_backward,
+        aten.grid_sampler_2d,
+        aten.hardsigmoid,
+        aten.hardsigmoid_backward,
+        aten.hardswish,
+        aten.hardswish_backward,
+        aten.hardtanh,
+        aten.hardtanh_backward,
+        aten.im2col,
+        aten.index_add,
+        aten.index_add_,
+        aten.index_select,
+        aten.l1_loss,
+        aten.leaky_relu,
+        aten.leaky_relu_backward,
+        aten.linalg_vector_norm,
+        aten.logit,
+        aten.logit_backward,
+        aten._log_softmax,
+        aten._log_softmax_backward_data,
+        aten.logsumexp.default,
+        aten.max_pool2d_with_indices_backward,
+        aten.mse_loss,
+        aten.mse_loss_backward,
+        aten.mv,
+        aten.narrow,
+        aten.native_batch_norm,
+        aten.native_batch_norm_backward,
+        aten.native_dropout_backward,
+        aten.native_group_norm,
+        aten.native_group_norm_backward,
+        aten.native_layer_norm,
+        aten.native_layer_norm_backward,
+        aten.new_empty,
+        aten.new_full,
+        aten.new_ones,
+        aten.nll_loss_backward,
+        aten.nll_loss_forward,
+        aten.norm,
+        aten.reflection_pad2d_backward,
+        aten._reshape_alias,
+        aten.select_backward,
+        aten.select_scatter,
+        aten.sgn,
+        aten.sigmoid_backward,
+        aten.silu,
+        aten.silu_backward,
+        aten.slice_backward,
+        aten._softmax,
+        aten._softmax_backward_data,
+        aten.softplus,
+        aten.softplus_backward,
+        aten.stack,
+        aten.std_mean.correction,
+        aten.t,
+        aten.tanh_backward,
+        aten.threshold_backward,
+        aten.transpose.int,
+        aten.tril.default,
+        aten.unfold,
+        aten.unfold_backward,
+        aten.upsample_bilinear2d.vec,
+        aten.upsample_nearest2d_backward,
+        aten.softplus,
+        aten.softplus_backward,
+        aten.bucketize,
+    ]
+)
+
+
+def register_decomposition(ops):
+    for op in [ops] if callable(ops) else ops:
+        if op in decompositions:
+            log.warning(f"duplicate decomp: {ops}")
+    return decomp.register_decomposition(ops, decompositions)
+
+
+@register_decomposition([aten.clamp])
+def clamp(x, min=None, max=None):
+    if min is not None:
+        x = torch.maximum(x, torch.tensor(min, dtype=x.dtype, device=x.device))
+    if max is not None:
+        x = torch.minimum(x, torch.tensor(max, dtype=x.dtype, device=x.device))
+    return x
+
+
+@register_decomposition([aten.tanh])
+def tanh(x):
+    return 2.0 / (1.0 + torch.exp(-2.0 * x)) - 1.0
+
+
+# TorchInductor-only decomposition. It should not be taken to core.
+# See https://github.com/pytorch/torchdynamo/pull/1120
+@register_decomposition([aten.floor_divide.default])
+def floordiv(a, b):
+    return aten.div.Tensor_mode(a, b, rounding_mode="floor")
+
+
+def get_padded_length(x):
+    if x % config.alignment_size == 0:
+        return 0
+    return int((x // config.alignment_size + 1) * config.alignment_size) - x
+
+
+def pad_dim(x, padded_length, dim):
+    pad = x.new_zeros(*x.shape[:dim], padded_length, *x.shape[dim + 1 :])
+    return torch.cat([x, pad], dim=dim)
+
+
+def check_device_dtype(a: Tensor, b: Tensor):
+    return (
+        a.is_cuda
+        and b.is_cuda
+        and a.dtype == torch.float32
+        and b.dtype == torch.float32
+    )
+
+
+@register_decomposition([aten.addmm])
+def addmm(input, mat1, mat2, *, beta=1, alpha=1):
+    if config.triton.mm != "aten":
+        out = torch.mm(mat1, mat2)
+        if not isinstance(alpha, numbers.Number) or alpha != 1:
+            out = out * alpha
+        if not isinstance(beta, numbers.Number) or beta != 1:
+            input = input * beta
+        return input + out
+
+    if (
+        config.shape_padding
+        and check_device_dtype(mat1, mat2)
+        and should_pad_bench(mat1, mat2, torch.ops.aten.addmm, input=input)
+    ):
+        m_padded_length = get_padded_length(mat1.shape[0])
+        k_padded_length = get_padded_length(mat1.shape[1])
+        n_padded_length = get_padded_length(mat2.shape[1])
+
+        if k_padded_length != 0:
+            mat1 = pad_dim(mat1, k_padded_length, 1)
+            mat2 = pad_dim(mat2, k_padded_length, 0)
+        elif m_padded_length != 0:
+            mat1 = pad_dim(mat1, m_padded_length, 0)
+        elif n_padded_length != 0:
+            mat2 = pad_dim(mat2, n_padded_length, 1)
+
+        if input is not None and k_padded_length == 0:
+            if m_padded_length != 0 and input.dim() == 2:
+                input = pad_dim(input, m_padded_length, 0)
+            elif n_padded_length != 0:
+                if input.dim() == 2:
+                    input = pad_dim(input, n_padded_length, 1)
+                elif input.dim() == 1:
+                    input = pad_dim(input, n_padded_length, 0)
+
+        if k_padded_length != 0:
+            return torch.ops.aten.addmm(input, mat1, mat2, beta=beta, alpha=alpha)
+        elif m_padded_length != 0:
+            return torch.ops.aten.addmm(input, mat1, mat2, beta=beta, alpha=alpha)[
+                :-m_padded_length, :
+            ]
+        elif n_padded_length != 0:
+            return torch.ops.aten.addmm(input, mat1, mat2, beta=beta, alpha=alpha)[
+                :, :-n_padded_length
+            ]
+
+    return NotImplemented  # go directly to lowering
+
+
+def should_pad_bench(mat1, mat2, op, input=None):
+    with no_dispatch():
+        if op is torch.ops.aten.mm or op is torch.ops.aten.addmm:
+            m_padded_length = get_padded_length(mat1.shape[0])
+            k_padded_length = get_padded_length(mat1.shape[1])
+            n_padded_length = get_padded_length(mat2.shape[1])
+        elif op is torch.ops.aten.bmm:
+            m_padded_length = get_padded_length(mat1.shape[1])
+            k_padded_length = get_padded_length(mat1.shape[2])
+            n_padded_length = get_padded_length(mat2.shape[2])
+        else:
+            return False
+
+        if m_padded_length == k_padded_length == n_padded_length == 0:
+            return False
+
+        mat1 = torch.randn_like(mat1)
+        mat2 = torch.randn_like(mat2)
+        warmup = 5
+        rep = 100
+        if op is torch.ops.aten.bmm or op is torch.ops.aten.mm:
+            ori_time = utils.do_bench(
+                lambda: op(mat1, mat2), warmup=warmup, rep=rep, fast_flush=True
+            )[0]
+        else:
+            if input is not None:
+                input = torch.randn_like(input)
+            ori_time = utils.do_bench(
+                lambda: op(input, mat1, mat2), warmup=warmup, rep=rep, fast_flush=True
+            )[0]
+
+        mat1_pad = mat1.new_empty([get_padded_length(i) + i for i in mat1.shape])
+        mat2_pad = mat2.new_empty([get_padded_length(i) + i for i in mat2.shape])
+        if op is torch.ops.aten.addmm:
+            input_pad = None
+            if input is not None and input.is_cuda and input.dtype == torch.float32:
+                input_pad = input.new_empty(
+                    [get_padded_length(i) + i for i in input.shape]
+                )
+            pad_time = utils.do_bench(
+                lambda: op(input_pad, mat1_pad, mat2_pad),
+                warmup=warmup,
+                rep=rep,
+                fast_flush=True,
+            )[0]
+        else:
+            pad_time = utils.do_bench(
+                lambda: op(mat1_pad, mat2_pad), warmup=warmup, rep=rep, fast_flush=True
+            )[0]
+
+        # Shape padding introduces addtional memory ops. Based on microbenchmarks, 1.3x for
+        # aten.mm and aten.addmm and 2x for aten.bmm represent a reasonable tradeoff between
+        # performance improvement from shape padding and overhead from addtional memory ops
+        # TODO: Build a learned model which would be better than this heuristic
+        if op is torch.ops.aten.mm or op is torch.ops.aten.addmm:
+            return ori_time > pad_time * 1.3
+        else:
+            return ori_time > pad_time * 2
+
+
+@register_decomposition([aten.mm])
+def mm_decomp(mat1, mat2):
+    if (
+        config.shape_padding
+        and check_device_dtype(mat1, mat2)
+        and should_pad_bench(mat1, mat2, torch.ops.aten.mm)
+    ):
+        m_padded_length = get_padded_length(mat1.shape[0])
+        k_padded_length = get_padded_length(mat1.shape[1])
+        n_padded_length = get_padded_length(mat2.shape[1])
+
+        if k_padded_length != 0:
+            mat1 = pad_dim(mat1, k_padded_length, 1)
+            mat2 = pad_dim(mat2, k_padded_length, 0)
+            return torch.ops.aten.mm(mat1, mat2)
+        elif m_padded_length != 0:
+            mat1 = pad_dim(mat1, m_padded_length, 0)
+            return torch.ops.aten.mm(mat1, mat2)[:-m_padded_length, :]
+        elif n_padded_length != 0:
+            mat2 = pad_dim(mat2, n_padded_length, 1)
+            return torch.ops.aten.mm(mat1, mat2)[:, :-n_padded_length]
+
+    return NotImplemented  # go directly to lowering
+
+
+@register_decomposition([aten.bmm])
+def bmm_decomp(mat1, mat2):
+    if (
+        config.shape_padding
+        and check_device_dtype(mat1, mat2)
+        and should_pad_bench(mat1, mat2, torch.ops.aten.bmm)
+    ):
+        m_padded_length = get_padded_length(mat1.shape[1])
+        k_padded_length = get_padded_length(mat1.shape[2])
+        n_padded_length = get_padded_length(mat2.shape[2])
+
+        if k_padded_length != 0:
+            mat1 = pad_dim(mat1, k_padded_length, 2)
+            mat2 = pad_dim(mat2, k_padded_length, 1)
+            return torch.ops.aten.bmm(mat1, mat2)
+        elif m_padded_length != 0:
+            mat1 = pad_dim(mat1, m_padded_length, 1)
+            return torch.ops.aten.bmm(mat1, mat2)[:, :-m_padded_length, :].contiguous()
+        elif n_padded_length != 0:
+            mat2 = pad_dim(mat2, n_padded_length, 2)
+            return torch.ops.aten.bmm(mat1, mat2)[:, :, :-n_padded_length].contiguous()
+
+    return NotImplemented  # go directly to lowering
+
+
+@register_decomposition([aten.convolution_backward])
+def convolution_backward(
+    grad_output,
+    input,
+    weight,
+    bias_sizes,
+    stride,
+    padding,
+    dilation,
+    transposed,
+    output_padding,
+    groups,
+    output_mask,
+):
+    if not output_mask[2] or grad_output.device.type != "cuda":
+        return NotImplemented
+    grad_bias = aten.sum(grad_output, [0] + list(range(2, grad_output.dim())))
+    grad_inp, grad_weight, _ = aten.convolution_backward(
+        grad_output,
+        input,
+        weight,
+        bias_sizes,
+        stride,
+        padding,
+        dilation,
+        transposed,
+        output_padding,
+        groups,
+        [output_mask[0], output_mask[1], False],
+    )
+    return (grad_inp, grad_weight, grad_bias)
+
+
+@register_decomposition([aten.rsqrt])
+def rsqrt(x):
+    return torch.reciprocal(torch.sqrt(x))
+
+
+@register_decomposition([aten.log2])
+def log2(x):
+    return torch.log(x) * (1.0 / math.log(2.0))
+
+
+@register_decomposition([aten.round.decimals])
+def round_dec(x, decimals=0):
+    ten_pow_decimals = 10.0**decimals
+    return aten.round(x * ten_pow_decimals) * (1.0 / ten_pow_decimals)
+
+
+@register_decomposition([aten.rsub.Tensor, aten.rsub.Scalar])
+def rsub(a, b):
+    if isinstance(b, numbers.Number):
+        b = torch.tensor(b, dtype=a.dtype, device=a.device)
+    return b - a
+
+
+@register_decomposition([aten.masked_fill])
+def masked_fill(value, mask, other):
+    if isinstance(other, numbers.Number):
+        other = torch.tensor(other, dtype=value.dtype, device=value.device)
+    assert other.numel() == 1 and other.ndim == 0
+    if other.device != value.device and other.numel() == 1:
+        other = other.to(value.device)
+    if other.dtype != value.dtype:
+        # TODO: error out on improper complex conversions
+        other = other.to(value.dtype)
+    return torch.where(mask, other, value)
+
+
+@register_decomposition([aten.nan_to_num])
+def nan_to_num(x, nan=0.0, posinf=None, neginf=None):
+    if is_boolean_dtype(x.dtype) or is_integer_dtype(x.dtype):
+        return x
+
+    if nan is None:
+        nan = 0.0
+    if posinf is None:
+        posinf = torch.finfo(x.dtype).max
+    if neginf is None:
+        neginf = torch.finfo(x.dtype).min
+    nan, posinf, neginf = (
+        torch.tensor(v, dtype=x.dtype, device=x.device) for v in (nan, posinf, neginf)
+    )
+    x = torch.where(x != x, nan, x)
+    x = torch.where(x == float("inf"), posinf, x)
+    x = torch.where(x == float("-inf"), neginf, x)
+    return x
+
+
+@register_decomposition([aten.all.default])
+def all(input):
+    return torch.logical_not(torch.any(torch.logical_not(input)))
+
+
+@register_decomposition([aten.all.dim])
+def all_dim(input, dim, keeepdim=False):
+    return torch.logical_not(torch.any(torch.logical_not(input), dim, keeepdim))
+
+
+# NB: this decomposition is not stride accurate, do not put it in the main
+# library
+@register_decomposition(aten.copy)
+def copy(self, src, non_blocking=False):
+    intermediate = src.to(self, non_blocking)
+    if self.size() != intermediate.size():
+        return aten.expand_copy.default(intermediate, self.size())
+    else:
+        return intermediate
+
+
+@register_decomposition(aten.hardswish_)
+def hardswish_(x):
+    return x.copy_(aten.hardswish(x))
+
+
+@register_decomposition(aten.hardtanh_)
+def hardtanh_(x, min_val=-1, max_val=1):
+    return x.copy_(aten.hardtanh(x, min_val, max_val))
+
+
+@register_decomposition(aten.leaky_relu_)
+def leaky_relu_(x, negative_slope=0.01):
+    return x.copy_(aten.leaky_relu(x, negative_slope))
+
+
+@register_decomposition(aten.silu_)
+def silu_(x):
+    return x.copy_(aten.silu(x))
+
+
+@register_decomposition(aten.masked_fill_)
+def masked_fill_(x, mask, value):
+    return x.copy_(aten.masked_fill(x, mask, value))
+
+
+@register_decomposition([aten.log1p])
+def log1p(x):
+    return torch.log(x + 1)
+
+
+@register_decomposition([aten.baddbmm])
+def baddbmm(self, batch1, batch2, beta=1, alpha=1):
+    result = torch.bmm(batch1, batch2)
+    if not isinstance(alpha, numbers.Number) or alpha != 1:
+        result = result * alpha
+    if not isinstance(beta, numbers.Number) or beta != 1:
+        self = self * beta
+    return self + result
+
+
+@register_decomposition([aten.conj_physical])
+def conj_physical(self):
+    assert not self.is_complex(), "TODO: implement this"
+    return self
+
+
+@register_decomposition([aten.lift, aten.detach_])
+def lift(self):
+    return self
+
+
+@register_decomposition([aten.fill.Scalar])
+def fill_scalar(self, value):
+    return torch.full_like(self, value)
+
+
+@register_decomposition([aten.fill.Tensor])
+def fill_tensor(self, value: Tensor):
+    assert value.dim() == 0, "aten.fill.Tensor only supports 0-dimension value tensor"
+    return torch.full_like(self, value.item())
+
+
+@register_decomposition([aten.bernoulli.default])
+def bernoulli(self, *, generator=None):
+    assert generator is None
+    return torch.rand_like(self, dtype=torch.float32) < self
+
+
+@register_decomposition([aten.bernoulli.p])
+def bernoulli_p(self, p=0.5, *, generator=None):
+    assert generator is None
+    return torch.rand_like(self, dtype=torch.float32) < p
+
+
+"""
+Some decomps result in differences from eager related to randomness.
+We put these decomps in a separate table `extra_random_decomps` to allow
+turning them on and off via `config.fallback_random`.
+"""
+extra_random_decomps = get_decompositions([aten.native_dropout])
+register_extra_random_decomp = functools.partial(
+    decomp.register_decomposition, registry=extra_random_decomps
+)
+
+
+@register_extra_random_decomp([aten.bernoulli_])
+def bernoulli_(self, p=0.5):
+    return self.copy_(torch.rand_like(self, dtype=torch.float32) < p)
+
+
+@functools.lru_cache(None)
+def fast_random_decomps():
+    return {**decompositions, **extra_random_decomps}
+
+
+def select_decomp_table():
+    """decomps can change based on config"""
+    if config.fallback_random:
+        return decompositions
+    return fast_random_decomps()
diff --git a/torch/_inductor/dependencies.py b/torch/_inductor/dependencies.py
new file mode 100644
index 000000000000..5434d7addfa9
--- /dev/null
+++ b/torch/_inductor/dependencies.py
@@ -0,0 +1,288 @@
+import collections
+import dataclasses
+import itertools
+import logging
+import typing
+from typing import Callable, cast, Dict, List, Optional, Set, Tuple, Union
+
+import sympy
+
+from . import config
+from .codegen.common import index_prevent_reordering
+from .utils import (
+    get_dtype_size,
+    sympy_product,
+    sympy_str,
+    sympy_subs,
+    sympy_symbol,
+    VarRanges,
+)
+from .virtualized import V
+
+log = logging.getLogger(__name__)
+
+Dep = Union["MemoryDep", "StarDep"]
+
+
+class MemoryDep(typing.NamedTuple):
+    name: str
+    index: sympy.Expr  # type: ignore[assignment]
+    size: Tuple[sympy.Expr, ...]
+
+    def broadcast_extend_sizes(self, extra_sizes: List[sympy.Expr]) -> "MemoryDep":
+        size = (*self.size, *[x for x in extra_sizes if x != 1])
+        return MemoryDep(self.name, self.index, size)
+
+    def maybe_swap_sizes(self) -> "MemoryDep":
+        # swap only in simple cases where index is trivial and
+        # there are just 2 sizes
+        if (
+            len(self.size) == 2
+            and len(self.index.args) == 0
+            and cast(sympy.Symbol, self.index).name == canonicalization_prefix() + "0"
+        ):
+            c = canonicalization_prefix()
+            size = (self.size[1], self.size[0])
+            s0 = sympy_symbol(c + "0")
+            s1 = sympy_symbol(c + "1")
+            index = sympy_subs(self.index, {s0: s1})
+            return MemoryDep(self.name, index, size)
+        else:
+            return self
+
+    def strip_last_size(self) -> "MemoryDep":
+        nsizes = len(self.size)
+        if not (nsizes >= 1 and len(self.index.args) <= nsizes - 1):
+            return self
+        # make sure last dim index is not used
+        prefix = canonicalization_prefix()
+        len_prefix = len(prefix)
+        prefixes = [
+            fs.name[:len_prefix]
+            for fs in cast(Set[sympy.Symbol], self.index.free_symbols)
+        ]
+        assert (
+            len(prefixes) == 0 or prefix in prefixes
+        ), "index expression should contain canonicalized symbols"
+        last_index = f"{prefix}{len(self.size)-1}"
+        if last_index not in self.index.free_symbols:
+            size = self.size[:-1]
+            return MemoryDep(self.name, self.index, size)
+        else:
+            return self
+
+    def rename(self, renames: Dict[str, str]) -> "MemoryDep":
+        if self.name in renames:
+            return MemoryDep(renames[self.name], self.index, self.size)
+        return self
+
+    def numbytes_hint(self):
+        vars = set(self.index.free_symbols)
+        size_vars_used = []
+        for var in vars:
+            if var.name.startswith(canonicalization_prefix()):
+                # Sometimes with indirect indexing we have very weird symbol names
+                assert " " not in var.name
+                size_vars_used.append(int(var.name[len(canonicalization_prefix()) :]))
+
+        return V.graph.sizevars.size_hint(
+            sympy_product([self.size[i] for i in size_vars_used])
+        ) * get_dtype_size(V.graph.get_dtype(self.name))
+
+    def is_contiguous(self) -> bool:
+        return isinstance(self.index, (sympy.Symbol, sympy.Integer))
+
+
+class StarDep(typing.NamedTuple):
+    # depends on the entire buffer
+    name: str
+
+    def rename(self, renames: Dict[str, str]) -> "StarDep":
+        if self.name in renames:
+            return StarDep(renames[self.name])
+        return self
+
+    def numbytes_hint(self):
+        from .ir import MultiOutputLayout
+
+        if self.name in V.graph.name_to_buffer:
+            buf = V.graph.name_to_buffer[self.name]
+        elif self.name in V.graph.graph_inputs:
+            buf = V.graph.graph_inputs[self.name]
+        else:
+            return 1
+        if hasattr(buf, "layout") and isinstance(buf.layout, MultiOutputLayout):
+            # NB: Too annoying to acquire, should only be used for instrumentation
+            return 1
+        return V.graph.sizevars.size_hint(
+            sympy_product(buf.get_size())
+        ) * get_dtype_size(buf.get_dtype())
+
+    def is_contiguous(self) -> bool:
+        return False
+
+
+class IndexExprDep(typing.NamedTuple):
+    index: sympy.Expr  # type: ignore[assignment]
+    size: Tuple[sympy.Expr, ...]
+
+
+@dataclasses.dataclass
+class ReadWrites:
+    reads: Set[Dep]
+    writes: Set[Dep]
+    index_exprs: Set[IndexExprDep]
+    range_vars: Optional[List[sympy.Expr]] = None
+    var_ranges: Optional[VarRanges] = None
+
+    def rename(self, renames: typing.Dict[str, str]) -> "ReadWrites":
+        return ReadWrites(
+            {dep.rename(renames) for dep in self.reads},
+            {dep.rename(renames) for dep in self.writes},
+            self.index_exprs,
+            self.range_vars,
+            self.var_ranges,
+        )
+
+    def with_read(self, name: str) -> "ReadWrites":
+        assert isinstance(name, str)
+        return ReadWrites(
+            set.union(self.reads, {StarDep(name)}),
+            self.writes,
+            self.index_exprs,
+            self.range_vars,
+            self.var_ranges,
+        )
+
+    def merge(self, other):
+        reads = set.union(self.reads, other.reads)
+        writes = set.union(self.writes, other.writes)
+        index_exprs = set.union(self.index_exprs, other.index_exprs)
+        return ReadWrites(
+            reads - writes,
+            writes,
+            index_exprs,
+        )
+
+
+class RecordLoadStore(V.MockHandler):  # type: ignore[name-defined]
+    def __init__(self, var_ranges: VarRanges, normalize: bool):
+        super(RecordLoadStore, self).__init__()
+        self._reads: Set[MemoryDep] = set()
+        self._writes: Set[MemoryDep] = set()
+        self._index_exprs: Set[IndexExprDep] = set()
+        self._var_ranges: VarRanges = var_ranges
+        self._normalize: bool = normalize
+
+    # Truncate the expr str by a threshold to prevent it's too long
+    # and cause process hanging. The result is not used.
+    # https://github.com/pytorch/torchdynamo/issues/1352
+    @staticmethod
+    def truncate_expr(expr):
+        if len(expr) > config.realize_bytes_threshold:
+            expr = f"{expr[:config.realize_bytes_threshold]}..."
+        return expr
+
+    def canonicalize(
+        self, index: sympy.Expr
+    ) -> Tuple[sympy.Expr, Tuple[sympy.Expr, ...]]:
+        sizes = list(self._var_ranges.values())
+        sizes = [V.graph.sizevars.simplify(x) for x in sizes]
+        if not self._normalize:
+            return index, tuple([x for x in sizes if x != 1])
+
+        # Try to further simplify the indexes even if simplify_loops didn't
+        # convert it to the simpliest form because of the interference from
+        # different indexing formulas.
+        index_vars = list(self._var_ranges.keys())
+        new_sizes, reindex, prune = V.graph.sizevars._simplify_loops(
+            index_vars,
+            sizes,
+            index_prevent_reordering([index], index_vars, sizes),
+        )
+
+        # assign new variables each dimension to deal with numbering mismatches
+        # d0, d1, d2 could become d0, d2 -- which won't match d0, d1
+        _, add_var = var_builder(canonicalization_prefix())
+        replacement = dict(zip(index_vars, reindex([add_var(x) for x in new_sizes])))
+
+        index = sympy_subs(sympy.expand(index), replacement)
+        return index, tuple(new_sizes)
+
+    def load(self, name: str, index: sympy.Expr) -> str:
+        canonicalized_index, canonicalized_size = self.canonicalize(index)
+        self._reads.add(MemoryDep(name, canonicalized_index, canonicalized_size))
+        return f"load({name}, {sympy_str(index)})"
+
+    def store(self, name: str, index: sympy.Expr, value: str, mode=None) -> str:
+        canonicalized_index, canonicalized_size = self.canonicalize(index)
+        self._writes.add(MemoryDep(name, canonicalized_index, canonicalized_size))
+        return f"store({name}, {sympy_str(index)}, {value}, {mode})"
+
+    def reduction(
+        self, name: str, dtype, src_dtype, reduction_type, index, value
+    ) -> str:
+        return self.store(name, index, f"reduce_{reduction_type})({value})")
+
+    def index_expr(self, index: sympy.Expr, dtype) -> str:
+        canonicalized_index, canonicalized_size = self.canonicalize(index)
+        self._index_exprs.add(IndexExprDep(canonicalized_index, canonicalized_size))
+        return f"index_expr({sympy_str(index)}, {dtype})"
+
+
+def var_builder(prefix: str) -> Tuple[VarRanges, Callable[[sympy.Expr], sympy.Symbol]]:
+    cnt = itertools.count()
+    var_ranges: VarRanges = collections.OrderedDict()
+
+    def add_var(length: sympy.Expr) -> sympy.Symbol:
+        v = sympy_symbol(f"{prefix}{next(cnt)}")
+        var_ranges[v] = length
+        return v
+
+    return var_ranges, add_var
+
+
+def index_vars_no_squeeze(*argsizes: Tuple[sympy.Expr, ...], prefix: str):
+    var_ranges, add_var = var_builder(prefix)
+    args: List[List[sympy.Symbol]] = []
+    for size in argsizes:
+        args.append(list(map(add_var, size)))
+    return args, var_ranges
+
+
+def index_vars_squeeze(*argsizes: Tuple[sympy.Expr, ...], prefix: str = "d"):
+    from .ir import SqueezeView
+
+    var_ranges, add_var = var_builder(prefix)
+    args: List[List[sympy.Expr]] = []
+    new_sizes: List[List[sympy.Expr]] = []
+    for size in argsizes:
+        new_size, reindex = SqueezeView.squeezer(size)
+        new_sizes.append(new_size)
+        args.append(reindex(list(map(add_var, new_size))))
+    return new_sizes, args, var_ranges
+
+
+def extract_read_writes(
+    fn: Callable,
+    *argsizes: Tuple[sympy.Expr, ...],
+    normalize: bool = False,
+    prefix: str = "d",
+):
+    _, args, var_ranges = index_vars_squeeze(*argsizes, prefix=prefix)
+    rw = RecordLoadStore(var_ranges, normalize=normalize)
+    with V.set_ops_handler(rw):  # type: ignore[call-arg]
+        fn(*args)
+
+    if normalize:
+        range_vars = []  # Number of vars could differ due to normalization
+    else:
+        range_vars = [*itertools.chain(*args)]
+
+    return ReadWrites(
+        set(rw._reads), set(rw._writes), rw._index_exprs, range_vars, var_ranges
+    )
+
+
+def canonicalization_prefix():
+    return "c"
diff --git a/torch/_inductor/exc.py b/torch/_inductor/exc.py
new file mode 100644
index 000000000000..8b70874d9542
--- /dev/null
+++ b/torch/_inductor/exc.py
@@ -0,0 +1,85 @@
+import os
+import textwrap
+from functools import lru_cache
+
+from . import config
+
+if os.environ.get("TORCHINDUCTOR_WRITE_MISSING_OPS") == "1":
+
+    @lru_cache(None)
+    def _record_missing_op(target):
+        with open("/tmp/missing_ops.txt", "a") as fd:
+            fd.write(str(target) + "\n")
+
+else:
+
+    def _record_missing_op(target):
+        pass
+
+
+class OperatorIssue(RuntimeError):
+    @staticmethod
+    def operator_str(target, args, kwargs):
+        lines = [f"target: {target}"] + [
+            f"args[{i}]: {arg}" for i, arg in enumerate(args)
+        ]
+        if kwargs:
+            lines.append(f"kwargs: {kwargs}")
+        return textwrap.indent("\n".join(lines), "  ")
+
+
+class MissingOperatorWithoutDecomp(OperatorIssue):
+    def __init__(self, target, args, kwargs):
+        _record_missing_op(target)
+        super().__init__(f"missing lowering\n{self.operator_str(target, args, kwargs)}")
+
+
+class MissingOperatorWithDecomp(OperatorIssue):
+    def __init__(self, target, args, kwargs):
+        _record_missing_op(target)
+        super().__init__(
+            f"missing decomposition\n{self.operator_str(target, args, kwargs)}"
+            + textwrap.dedent(
+                f"""
+
+                There is a decomposition available for {target} in
+                torch._decomp.get_decompositions().  Please add this operator to the
+                `decompositions` list in {config.inductor_import}.decompositions
+                """
+            )
+        )
+
+
+class LoweringException(OperatorIssue):
+    def __init__(self, exc, target, args, kwargs):
+        super().__init__(
+            f"{type(exc).__name__}: {exc}\n{self.operator_str(target, args, kwargs)}"
+        )
+
+
+class InvalidCxxCompiler(RuntimeError):
+    def __init__(self):
+        from . import config
+
+        super().__init__(
+            f"No working C++ compiler found in {config.__name__}.cpp.cxx: {config.cpp.cxx}"
+        )
+
+
+class CppCompileError(RuntimeError):
+    def __init__(self, cmd, output):
+        super().__init__(
+            textwrap.dedent(
+                """
+                    C++ compile error
+
+                    Command:
+                    {cmd}
+
+                    Output:
+                    {output}
+                """
+            )
+            .strip()
+            .format(cmd=" ".join(cmd), output=output.decode("utf-8"))
+        )
diff --git a/torch/_inductor/graph.py b/torch/_inductor/graph.py
new file mode 100644
index 000000000000..44f136b356a7
--- /dev/null
+++ b/torch/_inductor/graph.py
@@ -0,0 +1,448 @@
+import logging
+import operator
+import os
+import re
+import time
+
+import sympy
+
+import torch
+import torch.fx
+from torch._decomp import get_decompositions
+from torch.fx.experimental.symbolic_shapes import ShapeEnv
+from torch.utils._mode_utils import no_dispatch
+
+from . import config, ir
+from .codegen.wrapper import WrapperCodeGen
+from .exc import (
+    LoweringException,
+    MissingOperatorWithDecomp,
+    MissingOperatorWithoutDecomp,
+)
+from .ir import Constant, FixedLayout, InputBuffer, Pointwise, Reduction, TensorBox
+from .lowering import (
+    layout_constraints,
+    lowerings,
+    make_fallback,
+    needs_realized_inputs,
+)
+from .sizevars import SizeVarAllocator
+from .utils import dynamo_utils, gather_origins, get_dtype_size, sympy_product
+from .virtualized import V
+
+log = logging.getLogger(__name__)
+
+
+class GraphLowering(torch.fx.Interpreter):
+    def symbolic_sizes_strides(self, ex: torch.Tensor):
+        """
+        Support dynamic shapes and dynamic strides by assigning variables
+        to each dimension.  We duck-shape tensors, so if two tensors
+        have the same size they get assigned the same symbolic variable.
+        """
+        if self.reuse_shape_env:
+            size = ex.size()
+            stride = ex.stride()
+        else:
+            size, stride = self._shape_env.create_symbolic_sizes_strides(ex)
+
+        size = [i.get_pyobj().expr if isinstance(i, torch.SymInt) else i for i in size]
+        stride = [
+            i.get_pyobj().expr if isinstance(i, torch.SymInt) else i for i in stride
+        ]
+        return size, stride
+
+    def static_sizes_strides(self, ex: torch.Tensor):
+        """
+        Primarily used to weights
+        """
+        size = [sympy.Integer(i) for i in ex.size()]
+        stride = [sympy.Integer(i) for i in ex.stride()]
+        return size, stride
+
+    def __init__(
+        self, gm: torch.fx.GraphModule, shape_env=None, num_static_inputs=None
+    ):
+        super().__init__(gm)
+        if shape_env is None:
+            shape_env = ShapeEnv()
+            self.reuse_shape_env = False
+        else:
+            self._shape_env = shape_env
+            self.reuse_shape_env = True
+        self._shape_env = shape_env
+        self.sizevars = SizeVarAllocator(shape_env)
+        self.graph_inputs = {}
+        self.graph_inputs_original = {}
+        self.graph_outputs = None
+        self.device_types = set()
+        self.buffers = []
+        self.constants = {}
+        self.removed_buffers = set()
+        self.inplaced_to_remove = set()
+        self.wrapper_code = None
+        self.num_static_inputs = num_static_inputs
+        self.mutated_inputs = set()
+        self.unaligned_buffers = set()
+        self.randomness_offset = sympy.Integer(0)
+        self.randomness_seeds = []
+        self.name_to_buffer = {}
+        self.creation_time = time.time()
+
+    def get_dtype(self, buffer_name):
+        if buffer_name in self.constants:
+            return self.constants[buffer_name].dtype
+        if buffer_name in self.name_to_buffer:
+            return self.name_to_buffer[buffer_name].get_dtype()
+        if buffer_name in self.graph_inputs:
+            return self.graph_inputs[buffer_name].get_dtype()
+        m = re.match(r"as_strided\(([a-zA-Z0-9_]+),", buffer_name)
+        if m:
+            return self.get_dtype(m.group(1))
+        raise KeyError(f"could not find {buffer_name}")
+
+    def random_seed_buffer(self, device: torch.device):
+        """
+        Return a device-unique 1-element tensor storing our RNG seed.
+        This will get initialized at the start of each graph in
+        `wrapper.py`.
+
+        Note this is only used by cuda backends.  The CPU backend handles
+        RNG seeds as a sizevar.
+        """
+        name = f"seed_{device.type}_{device.index}"
+        if name not in self.constants:
+            self.constants[name] = torch.zeros((), device=device, dtype=torch.int64)
+            self.randomness_seeds.append(name)
+
+        return ir.RandSeedBuffer(
+            name=name,
+            layout=ir.FixedLayout(
+                device=device,
+                dtype=torch.int64,
+                size=[],
+                stride=[],
+            ),
+        )
+
+    def increment_randomness_offset(self, numel):
+        """
+        A global counter of how many random numbers we have handed out so far.
+        """
+        offset = self.randomness_offset
+        self.randomness_offset = offset + numel
+        return offset
+
+    @dynamo_utils.dynamo_timed
+    def run(self, *args):
+        return super().run(*args)
+
+    def register_buffer(self, buffer: ir.ComputedBuffer):
+        name = f"buf{len(self.buffers)}"
+        self.buffers.append(buffer)
+        self.name_to_buffer[name] = buffer
+        return name
+
+    def realize_users_of(self, name: str):
+        """
+        When a buffer is mutated we need to make sure all the reads to
+        the old version are realized before the mutation happens.
+        """
+        assert isinstance(name, str)
+
+        def visit(value):
+            if isinstance(value, (list, tuple)):
+                return [visit(x) for x in value]
+            if isinstance(value, ir.IRNode):
+                if value.is_user_of(name):
+                    value.realize()
+            return value
+
+        for key, value in self.env.items():
+            try:
+                visit(value)
+            except Exception:
+                log.warning("error in realize_users_of", exc_info=True)
+
+    def add_tensor_constant(self, data):
+        def allocate():
+            for name, value in self.constants.items():
+                if (
+                    data.size() == value.size()
+                    and data.stride() == value.stride()
+                    and data.dtype == value.dtype
+                    and data.device == value.device
+                    and torch.eq(data, value).all()
+                ):
+                    return name
+            name = f"constant{len(self.constants)}"
+            self.constants[name] = data
+            return name
+
+        return TensorBox.create(
+            ir.ConstantBuffer(
+                allocate(),
+                FixedLayout(data.device, data.dtype, *self.static_sizes_strides(data)),
+            )
+        )
+
+    def constant_name(self, name: str, device_override: torch.device):
+        """
+        We AOT copy constants to the devices they are needed on.
+        If device_override doesn't match the constant's device, then
+        copy it and return a different name.
+        """
+        if self.constants[name].device == device_override or device_override is None:
+            return name
+        alt_name = f"{name}_{device_override.type}{device_override.index or 0}"
+        if alt_name not in self.constants:
+            self.constants[alt_name] = self.constants[name].to(device_override)
+        return alt_name
+
+    def placeholder(self, target, args, kwargs):
+        example: torch.Tensor = super().placeholder(target, args, kwargs)
+        if config.static_weight_shapes and (
+            len(self.graph_inputs) < self.num_static_inputs or not config.dynamic_shapes
+        ):
+            # the first N inputs are weights
+            sizes, strides = self.static_sizes_strides(example)
+        else:
+            sizes, strides = self.symbolic_sizes_strides(example)
+        # TODO(jansel): handle input aliasing
+        tensor = TensorBox.create(
+            InputBuffer(
+                target,
+                FixedLayout(example.device, example.dtype, sizes, strides),
+            )
+        )
+        self.graph_inputs[target] = tensor
+        self.graph_inputs_original[target] = tensor.data.data
+        self.device_types.add(example.device.type)
+        return tensor
+
+    def call_function(self, target, args, kwargs):
+        with ir.IRNode.current_origins(gather_origins(args, kwargs)):
+            if target is operator.getitem and isinstance(args[0], (list, tuple)):
+                return super().call_function(target, args, kwargs)
+
+            if target not in lowerings:
+                if config.implicit_fallbacks:
+                    error = (
+                        MissingOperatorWithDecomp
+                        if get_decompositions([target])
+                        else MissingOperatorWithoutDecomp
+                    )
+                    log.warning(
+                        "Creating implicit fallback for:\n%s",
+                        error.operator_str(target, args, kwargs),
+                    )
+                    make_fallback(target)
+                elif get_decompositions([target]):
+                    # There isn't a good way to dynamically patch this in
+                    # since AOT Autograd already ran.  The error message tells
+                    # the user how to fix it.
+                    raise MissingOperatorWithDecomp(target, args, kwargs)
+                else:
+                    raise MissingOperatorWithoutDecomp(target, args, kwargs)
+
+            try:
+                out = lowerings[target](*args, **kwargs)
+                return out
+            except Exception as e:
+                raise LoweringException(e, target, args, kwargs) from e
+
+    def get_attr(self, target, args, kwargs):
+        # this is a constant
+        value = getattr(self.module, target)
+        with no_dispatch():
+            if value.shape == ():
+                return Constant(value.item(), value.dtype, value.device)
+            if len(value.shape) == 1 and value.shape[0] <= 8:
+                # tensor lowering has constant inlining logic
+                from .lowering import tensor
+
+                return tensor(value.tolist(), dtype=value.dtype, device=value.device)
+
+        return self.add_tensor_constant(value)
+
+    def call_module(self, target, args, kwargs):
+        raise AssertionError()
+
+    def call_method(self, target, args, kwargs):
+        raise AssertionError()
+
+    def output(self, target, args, kwargs):
+        result = super().output(target, args, kwargs)
+        assert isinstance(result, (tuple, list)), type(result)
+        assert all(
+            isinstance(
+                x,
+                (
+                    TensorBox,
+                    ir.Constant,
+                    type(None),
+                    ir.ConstantBuffer,
+                    sympy.Expr,
+                    int,
+                ),
+            )
+            for x in result
+        ), result
+        self.graph_outputs = [ir.ExternKernel.realize_input(x) for x in result]
+        for name, value in self.graph_inputs.items():
+            value.realize()
+            assert isinstance(value, TensorBox)
+            value = value.data
+            assert isinstance(value, ir.StorageBox)
+            value_storage_box = value
+            value = value.data
+            if not isinstance(value, InputBuffer) or value.get_name() != name:
+                # one of our inputs was mutated, need to turn that into a copy
+                ir.MutationLayout.realize_into(value, self.graph_inputs_original[name])
+                # replace output with mutated input
+                try:
+                    ind = self.graph_outputs.index(value_storage_box)
+                    self.graph_outputs[ind] = self.graph_inputs_original[name]
+                except ValueError:
+                    pass
+
+        self.finalize()
+
+    def finalize(self):
+        for buf in self.buffers:
+            buf.decide_layout()
+
+    def run_node(self, n: torch.fx.Node):
+        with ir.IRNode.current_origins({n}):
+            if n.op == "call_function" and n.target in layout_constraints:
+                args, kwargs = self.fetch_args_kwargs_from_env(n)
+                args, kwargs = layout_constraints[n.target](n, *args, **kwargs)
+                result = self.call_function(n.target, args, kwargs)
+            else:
+                result = super().run_node(n)
+
+            # Realize if (1) any user need inputs realized, or (2) there is
+            # already too many reads and rematerializing can be bad.
+            num_users = len(set(n.users))
+            if num_users > 1 and isinstance(result, TensorBox):
+                for user in n.users:
+                    if user.target in needs_realized_inputs:
+                        result.realize_hint()
+                        # This inclusion is somewhat controversial (from
+                        # discussion between Horace, Natalia, and Elias).
+                        # Currently, it's not very clear why this is helpful.
+                        # The general idea here is that even though a node may
+                        # have FlexibleLayout, we still often *treat* it as if
+                        # it was contiguous. This appears to sometime result in
+                        # suboptimal behavior.
+                        #
+                        # When we do a better job selecting layout, we should
+                        # revisit this.
+                        if user.target in (
+                            torch.ops.aten.convolution.default,
+                            torch.ops.aten.convolution_backward.default,
+                            torch.ops.aten.mm.default,
+                        ):
+                            result = ir.ExternKernel.require_stride_order(
+                                result, ir.get_stride_order(n.meta["val"].stride())
+                            )
+                    if user.op == "output":
+                        if isinstance(result.data.data, (Pointwise, Reduction)):
+                            result.realize()
+
+                # TODO(jansel): introduce a store vs inline choice
+                result.mark_reuse(len(n.users))
+
+            # Realize if the IRNode already has accumulated lots of reads
+            if isinstance(result, TensorBox) and result.has_exceeded_max_reads():
+                # Prevent excessive accumulation in a computed buffer, when
+                # there are multiple branches meach with small number of memory
+                # reads, but they converge to a user.
+                result.realize_hint()
+        return result
+
+    def codegen(self):
+        from .scheduler import Scheduler
+
+        self.wrapper_code = WrapperCodeGen()
+        self.scheduler = Scheduler(self.buffers)
+        self.scheduler.codegen()
+        return self.wrapper_code.generate()
+
+    def count_bytes(self):
+        from .scheduler import FusedSchedulerNode, NopKernelSchedulerNode, Scheduler
+
+        scheduler = Scheduler(self.buffers)
+
+        def get_read_write_buffers_sizes(node):
+            if isinstance(node, NopKernelSchedulerNode):
+                return 0
+            reads = set(dep.name for dep in node.read_writes.reads)
+            writes = set(dep.name for dep in node.read_writes.writes)
+
+            def is_materialized(buf):
+                buf_uses = set(
+                    [user.node for user in scheduler.name_to_node[buf].users]
+                )
+                return len(buf_uses - set(node.snodes)) > 0
+
+            if isinstance(node, FusedSchedulerNode):
+                writes = set([dep for dep in writes if is_materialized(dep)])
+            node_bytes = 0
+            for buf in reads | writes:
+                if buf in self.name_to_buffer:
+                    buf = self.name_to_buffer[buf]
+                elif buf in self.graph_inputs:
+                    buf = self.graph_inputs[buf]
+                else:
+                    continue
+
+                node_bytes += V.graph.sizevars.size_hint(
+                    sympy_product(buf.get_size())
+                ) * get_dtype_size(buf.get_dtype())
+            return node_bytes
+
+        total_bytes = 0
+        node_counts = []
+        for node in scheduler.nodes:
+            num_bytes = get_read_write_buffers_sizes(node)
+            node_counts.append((node, num_bytes // 4))
+            total_bytes += num_bytes
+        return total_bytes, node_counts
+
+    @dynamo_utils.dynamo_timed
+    def compile_to_module(self):
+        from .codecache import PyCodeCache
+
+        code = self.codegen()
+        if config.debug:
+            print(code)
+
+        mod = PyCodeCache.load(code)
+        for name, value in self.constants.items():
+            setattr(mod, name, value)
+
+        log.log(logging.CODE, "Output code: %s", mod.__file__)
+        V.debug.output_code(mod.__file__)
+        V.debug.rename(os.path.splitext(mod.__file__)[0] + ".debug")
+        return mod
+
+    def compile_to_fn(self):
+        return self.compile_to_module().call
+
+    def get_output_names(self):
+        return [
+            node.get_name()
+            for node in self.graph_outputs
+            if not isinstance(node, ir.NoneAsConstantBuffer)
+            and not isinstance(node, ir.ShapeAsConstantBuffer)
+        ]
+
+    def is_unspec_arg(self, name):
+        # dynamo wraps unspec variable as 0d CPU tensor,
+        # need to convert to scalar during codegen (triton only)
+        return (
+            name in self.graph_inputs.keys()
+            and self.graph_inputs[name].get_numel() == 1
+            and self.graph_inputs[name].get_device().type == "cpu"
+        )
diff --git a/torch/_inductor/ir.py b/torch/_inductor/ir.py
new file mode 100644
index 000000000000..4c7d94ce9875
--- /dev/null
+++ b/torch/_inductor/ir.py
@@ -0,0 +1,4047 @@
+import contextlib
+import dataclasses
+import functools
+import itertools
+import logging
+import re
+import textwrap
+from collections import OrderedDict
+from contextlib import nullcontext
+from enum import Enum
+from functools import partial
+from inspect import signature
+from typing import Any, Callable, ClassVar, Dict, List, Optional, Set, Tuple, Union
+from unittest.mock import patch
+
+import numpy
+import sympy
+from sympy import Expr, Integer
+
+import torch.fx
+import torch.utils._pytree as pytree
+from torch._prims_common import (
+    is_boolean_dtype,
+    is_float_dtype,
+    make_channels_last_strides_for,
+    make_contiguous_strides_for,
+)
+from torch._subclasses.fake_tensor import FakeTensorMode
+
+from . import config, dependencies
+from .codegen.common import index_prevent_reordering
+from .cuda_properties import get_device_properties
+from .dependencies import extract_read_writes, var_builder
+from .utils import (
+    argsort,
+    cache_on_self,
+    sympy_dot,
+    sympy_product,
+    sympy_subs,
+    sympy_symbol,
+)
+from .virtualized import ops, V
+
+log = logging.getLogger(__name__)
+indent = functools.partial(textwrap.indent, prefix="  ")
+aten = torch.ops.aten
+
+
+def inverse_reorder(order):
+    inv_order = dict(zip(order, range(len(order))))
+
+    def reindex(index):
+        assert len(index) == len(inv_order)
+        return [index[inv_order[i]] for i in range(len(index))]
+
+    return reindex
+
+
+def same_reorder(order):
+    def reindex(index):
+        assert len(index) == len(order)
+        return [index[order[i]] for i in range(len(index))]
+
+    return reindex
+
+
+def fuse_reindexing(reindex1, reindex2):
+    def reindex(index):
+        return reindex1(reindex2(index))
+
+    return reindex
+
+
+def stride_order2fill_order(order):
+    """
+    Convert stride order to fill order
+    For channel last format,
+    stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0]
+    """
+    lookup = {pos: idx for idx, pos in enumerate(order)}
+    fill_order = [lookup[i] for i in range(len(order))]
+    return fill_order
+
+
+def get_stride_order(seq):
+    """
+    Convert strides to stride order
+    """
+    sorted_idx = argsort(seq)
+    out = [None for _ in range(len(seq))]
+    for i, elem in enumerate(sorted_idx):
+        out[elem] = i
+    return out
+
+
+def reads_from_conv(buf, var_ranges):
+    """
+    return:
+    if reads_from_conv: boolean
+    the new memory_addr: Sympy Expression
+    """
+    if buf is None:
+        return False, None
+    if isinstance(buf, Convolution):
+        indexer = buf.layout.as_fixed().make_indexer()
+        index_vars = sorted(var_ranges, key=lambda var: var.name)
+        index = indexer(index_vars)
+        return True, index
+    # for case like
+    # buf0 = conv(x, w)
+    # return torch.cat([buf0, buf1]), torch.cat([buf0, buf2])
+    # Because of ConcatKernel, it will create two bufs buf3 and 4
+    # buf3 has the AliasedLayout which reads from buf0(Convolution)
+    # but buf4 is a copy of buf3 which reads from buf3
+    # we want to know that buf4 also follows buf0 conv's layout
+    if isinstance(buf.layout, AliasedLayout):
+        reads = buf.get_read_writes().reads
+        reads_bufs = [
+            V.graph.name_to_buffer[r.name]
+            if r.name in V.graph.name_to_buffer.keys()
+            else None
+            for r in reads
+        ]
+        for reads_buf in reads_bufs:
+            read_from_conv, addr = reads_from_conv(reads_buf, var_ranges)
+            if read_from_conv:
+                return True, addr
+    return False, None
+
+
+def ir_node_to_tensor(x, guard_shape=True):
+    shape_fn = (
+        V.graph.sizevars.guard_static_shape
+        if guard_shape
+        else V.graph.sizevars.size_hint
+    )
+    size = [shape_fn(s) for s in x.get_size()]
+    if is_storage_and_layout(x):
+        stride = [shape_fn(s) for s in x.get_layout().stride]
+    else:
+        stride = make_contiguous_strides_for(size)
+    dtype = x.get_dtype()
+    device = x.get_device()
+    t = torch.empty_strided(
+        size=size, stride=stride, dtype=dtype, device=device
+    ).zero_()
+    return t
+
+
+def layout_priority_idx(reads_bufs, memory_addrs, var_ranges):
+    """
+    if reads from conv that needs to use specific layout
+    return:
+    priority_idx regarding memory_addrs idx
+    memory_addrs - update memory_addrs with the true addr if needed
+    """
+
+    priority_idx = []
+    for i, reads_buf in enumerate(reads_bufs):
+        read_from_conv, mem_addr = reads_from_conv(reads_buf, var_ranges)
+        if read_from_conv:
+            priority_idx.append(i)
+            memory_addrs[i] = mem_addr
+    return priority_idx, memory_addrs
+
+
+class ModularIndexing(sympy.Function):
+    """
+    ModularIndexing(a, b, c) => (a // b) % c
+    """
+
+    nargs = (3,)
+
+    @classmethod
+    def eval(cls, base, divisor, modulus):
+        if base == 0 or modulus == 1:
+            return sympy.Integer(0)
+
+        if (
+            isinstance(base, sympy.Integer)
+            and isinstance(divisor, sympy.Integer)
+            and isinstance(modulus, sympy.Integer)
+        ):
+            return (base // divisor) % modulus
+
+        if divisor != 1:
+            gcd = sympy.gcd(base, divisor)
+            if gcd != 1:
+                return ModularIndexing(base / gcd, divisor / gcd, modulus)
+
+        if isinstance(base, sympy.Add):
+            new_terms = []
+            for term in base.args:
+                if sympy.gcd(term, modulus * divisor) != modulus * divisor:
+                    new_terms.append(term)
+            if len(new_terms) != len(base.args):
+                return ModularIndexing(sum(new_terms), divisor, modulus)
+
+        if isinstance(base, IndexingDiv):
+            return ModularIndexing(base.args[0], base.args[1] * divisor, modulus)
+
+
+class IndexingDiv(sympy.Function):
+    """
+    a // b used in indexing where we need to be careful about simplification.
+    We don't use sympy.FloorDiv to bypass some simplification rules.
+    """
+
+    nargs = (2,)
+
+    @classmethod
+    def eval(cls, base, divisor):
+        if base == 0:
+            return sympy.Integer(0)
+        if divisor == 1:
+            return base
+        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
+            return base // divisor
+        if isinstance(base, IndexingDiv):
+            return IndexingDiv(base.args[0], base.args[1] * divisor)
+
+        if isinstance(base, sympy.Add):
+            for a in base.args:
+                gcd = sympy.gcd(a, divisor)
+                if gcd == divisor:
+                    return IndexingDiv(base - a, divisor) + a / gcd
+        gcd = sympy.gcd(base, divisor)
+        if gcd != 1:
+            return IndexingDiv(
+                sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
+            )
+
+
+class CleanDiv(IndexingDiv):
+    """
+    Div where we can assume no rounding.
+    This is to enable future optimizations.
+    """
+
+    pass
+
+
+class CeilDiv(sympy.Function):
+    """
+    Div used in indexing that rounds up.
+    """
+
+    def __new__(cls, base, divisor):
+        if sympy.gcd(base, divisor) == divisor:
+            return CleanDiv(base, divisor)
+        else:
+            return IndexingDiv(base + (divisor - 1), divisor)
+
+
+def get_device_type(x):
+    if getattr(x, "get_device", None):
+        return get_device_type(x.get_device())
+    if isinstance(x, torch.device):
+        return x.type
+    return None
+
+
+def is_triton(x):
+    return get_device_type(x) == "cuda"
+
+
+def is_cpu(x):
+    return get_device_type(x) == "cpu"
+
+
+@dataclasses.dataclass
+class IRNode(object):
+    _current_origins: ClassVar[Set[Any]] = set()
+
+    @staticmethod
+    @contextlib.contextmanager
+    def current_origins(origins: Set[torch.fx.Node]):
+        old = IRNode._current_origins
+        IRNode._current_origins = old | origins
+        yield
+        IRNode._current_origins = old
+
+    def __post_init__(self):
+        self.origins = set(self._current_origins)
+
+    def common_repr(self):
+        return (
+            [f"origins={self.origins}"] if hasattr(self, "origins") else ["no origins?"]
+        )
+
+    def str_helper(self, lines):
+        lines = lines + self.common_repr()
+        lines = indent(",\n".join(map(str, lines)))
+        return f"{type(self).__name__}(\n{lines}\n)"
+
+    def is_user_of(self, name):
+        return any(name == dep.name for dep in self.get_reads())
+
+    def get_numel(self):
+        return sympy_product(self.get_size())
+
+
+@dataclasses.dataclass
+class Loops(IRNode):
+    device: torch.device
+    dtype: torch.dtype
+    inner_fn: Callable
+    ranges: List[Expr]
+
+    def __str__(self, names=("ranges",)):
+        return self.str_helper(
+            [
+                f"'{self.device.type}'",
+                str(self.dtype),
+                self.inner_fn_str(),
+            ]
+            + [f"{name}={getattr(self, name)}" for name in names]
+        )
+
+    __repr__ = __str__
+
+    def get_dtype(self):
+        return self.dtype
+
+    def get_device(self):
+        return self.device
+
+    def get_size(self):
+        return self.ranges
+
+    def is_extern(self):
+        return False
+
+    @classmethod
+    def create(cls, *args, **kwargs):
+        return TensorBox.create(cls(*args, **kwargs))
+
+    @staticmethod
+    def _index(ranges, prefix="i"):
+        return [
+            sympy.Integer(0) if s == 1 else sympy_symbol(f"{prefix}{n}")
+            for n, s in enumerate(ranges)
+        ]
+
+    @cache_on_self
+    def inner_fn_str(self):
+        try:
+            with V.set_ops_handler(V.MockHandler()), patch.object(
+                FlexibleLayout, "allow_indexing", True
+            ):
+                return str(self.inner_fn(self._index(self.ranges)))
+        except Exception as e:
+            return f"inner_fn(): {e}"
+
+    def is_zero_elements(self):
+        return any(r == 0 for r in self.ranges)
+
+    @cache_on_self
+    def get_reads(self):
+        with patch.object(FlexibleLayout, "allow_indexing", True):
+            if self.get_reduction_type():
+                return extract_read_writes(
+                    self.make_loader(),
+                    self.get_size(),
+                    self.get_reduction_size(),
+                ).reads
+            else:
+                return extract_read_writes(
+                    self.make_loader(),
+                    self.get_size(),
+                ).reads
+
+
+class Pointwise(Loops):
+    def make_loader(self):
+        return self.inner_fn
+
+    def get_reduction_size(self):
+        return []
+
+    def get_reduction_type(self):
+        return None
+
+    def store_output(self, output_name, indexer, vars):
+        return ops.store(output_name, indexer(vars), self.inner_fn(vars))
+
+    def constant_to_device(self, device):
+        """Move this to a given device. Requires that all reads are to constants."""
+        loader = self.make_loader()
+        loader = patch.object(ConstantBuffer, "override_device", device)(loader)
+        return Pointwise(device, self.dtype, loader, self.ranges)
+
+
+@dataclasses.dataclass
+class Scatter(Pointwise):
+    output_indexer: Callable[[List[Expr]], Expr]
+    scatter_mode: Optional[str] = None
+
+    def constant_to_device(self, device):
+        """Move this to a given device. Requires that all reads are to constants."""
+        loader = self.make_loader()
+        loader = patch.object(ConstantBuffer, "override_device", device)(loader)
+        return Scatter(
+            device,
+            self.dtype,
+            loader,
+            self.ranges,
+            self.output_indexer,
+            self.scatter_mode,
+        )
+
+    def store_output(self, output_name, indexer, vars):
+        return ops.store(
+            output_name,
+            indexer(self.output_indexer(vars)),
+            self.inner_fn(vars),
+            mode=self.scatter_mode,
+        )
+
+
+class ReductionHint(Enum):
+    INNER = 0
+    OUTER = 1
+    OUTER_TINY = 2
+    DEFAULT = 3
+
+
+class TileHint(Enum):
+    SQUARE = 0
+    DEFAULT = 1
+
+
+@dataclasses.dataclass
+class Reduction(Loops):
+    reduction_ranges: List[Expr]
+    reduction_type: str
+    # self.dtype represents the dst dtype
+    src_dtype: torch.dtype
+    reduction_hint: ReductionHint
+
+    def __str__(self):
+        return Loops.__str__(
+            self, names=("ranges", "reduction_ranges", "reduction_type")
+        )
+
+    __repr__ = __str__
+
+    def get_reduction_size(self):
+        return self.reduction_ranges
+
+    def get_reduction_type(self):
+        return self.reduction_type
+
+    def store_reduction(self, output_name, indexer, vars, reduction_vars):
+        return ops.reduction(
+            output_name,
+            self.dtype,
+            self.src_dtype,
+            self.reduction_type,
+            indexer(vars),
+            self.inner_fn(vars, reduction_vars),
+        )
+
+    def index_length(self):
+        return len(self.ranges) + len(self.reduction_ranges)
+
+    @cache_on_self
+    def inner_fn_str(self):
+        try:
+            with V.set_ops_handler(V.MockHandler()), patch.object(
+                FlexibleLayout, "allow_indexing", True
+            ):
+                return str(
+                    self.inner_fn(
+                        self._index(self.ranges),
+                        self._index(self.reduction_ranges, "r"),
+                    )
+                )
+        except Exception as e:
+            return f"inner_fn(): {e}"
+
+    def constant_to_device(self, device):
+        """Move this to a given device. Requires that all reads are to constants."""
+        loader = self.make_loader()
+        loader = patch.object(ConstantBuffer, "override_device", device)(loader)
+        return Reduction(
+            device,
+            self.dtype,
+            loader,
+            self.ranges,
+            self.reduction_ranges,
+            self.reduction_type,
+            self.src_dtype,
+            ReductionHint.DEFAULT,
+        )
+
+    @staticmethod
+    def num_splits(
+        device,
+        dst_dtype,
+        src_dtype,
+        inner_fn,
+        ranges,
+        reduction_ranges,
+        reduction_type,
+        reduction_numel,
+    ):
+        num_sm = get_device_properties(device).multi_processor_count
+        min_elements_per_thread = 32
+        max_elements_per_thread = 512
+        threads_per_sm = 2048
+        min_elements_per_device = min_elements_per_thread * num_sm * threads_per_sm
+        max_elements_per_device = max_elements_per_thread * num_sm * threads_per_sm
+
+        def inner_reduction_splits(reduction_numel_hint, numel_hint):
+            # do heuristics that's close to eager mode for split inner reduction
+            # we leak reduction autotune configs here, and will need to refactor to avoid this later
+            num_warps = 8
+            num_threads = 32 * num_warps
+            if numel_hint >= 2 * num_sm:  # don't split if there are enough outputs
+                return 1
+            if reduction_numel_hint <= 8192:
+                return 1
+            if reduction_numel_hint * numel_hint <= min_elements_per_device:
+                split_size = min_elements_per_thread
+            elif reduction_numel_hint * numel_hint < max_elements_per_device:
+                target_blocks = num_sm * threads_per_sm // (2 * num_threads)
+                blocks_per_output = (target_blocks + numel_hint - 1) // numel_hint
+                tmp_split_size = (
+                    reduction_numel_hint + num_threads * blocks_per_output - 1
+                ) // (num_threads * blocks_per_output)
+                divisors = sympy.divisors(reduction_numel_hint)
+                closest = min(divisors, key=lambda x: abs(x - tmp_split_size))
+                if abs(closest - tmp_split_size) < 30:
+                    # prefer even splits, but never smalle than min_elements_per_thread
+                    split_size = max(closest, min_elements_per_thread)
+                else:
+                    split_size = tmp_split_size
+            else:
+                divisors = sympy.divisors(reduction_numel_hint)
+                closest = min(divisors, key=lambda x: abs(x - max_elements_per_thread))
+                if abs(closest - max_elements_per_thread) < 50:
+                    # prefer even splits
+                    split_size = closest
+                else:
+                    split_size = max_elements_per_thread
+            return (reduction_numel_hint + split_size * num_threads - 1) // (
+                split_size * num_threads
+            )
+
+        def outer_reduction_splits(reduction_numel_hint, numel_hint):
+            # TODO the best heuristic currently has XBLOCK (corresponding to numel_hint) 128
+            # extend to even smaller number of outputs
+            num_warps = 8
+            num_threads = num_warps * 32
+            rvals_per_thread = 4  # comes from heuristics, refactor to not leak here
+            xvals_per_block = 128
+            xblocks = (numel_hint + xvals_per_block - 1) // xvals_per_block
+            if reduction_numel_hint * numel_hint < min_elements_per_device:
+                split_size = min_elements_per_thread
+            elif reduction_numel_hint * numel_hint < max_elements_per_device:
+                target_blocks = num_sm * threads_per_sm // (num_threads)
+                target_blocks = (target_blocks + xblocks - 1) // xblocks
+                tmp_split_size = (
+                    reduction_numel_hint + rvals_per_thread * target_blocks - 1
+                ) // (rvals_per_thread * target_blocks)
+                divisors = sympy.divisors(reduction_numel_hint)
+                closest = min(divisors, key=lambda x: abs(x - tmp_split_size))
+                if abs(tmp_split_size - closest) < 20:
+                    split_size = max(closest, min_elements_per_thread)
+                else:
+                    split_size = tmp_split_size
+            else:
+                divisors = sympy.divisors(reduction_numel_hint)
+                closest = min(divisors, key=lambda x: abs(x - max_elements_per_thread))
+                if abs(closest - max_elements_per_thread) < 50:
+                    # prefer even splits
+                    split_size = closest
+                else:
+                    split_size = max_elements_per_thread
+
+            return (reduction_numel_hint + rvals_per_thread * split_size - 1) // (
+                rvals_per_thread * split_size
+            )
+
+        reduction_numel_hint = V.graph.sizevars.size_hint(reduction_numel)
+        numel_hint = V.graph.sizevars.size_hint(sympy_product(ranges))
+        # easy cases
+        if numel_hint == 1:
+            return ReductionHint.INNER, inner_reduction_splits(
+                reduction_numel_hint, numel_hint
+            )
+        if (
+            reduction_numel_hint <= min_elements_per_thread
+            or numel_hint >= num_sm * 2 * 32
+        ):
+            return ReductionHint.DEFAULT, 1
+
+        r = Reduction(
+            device,
+            dst_dtype,
+            inner_fn,
+            ranges,
+            reduction_ranges,
+            reduction_type,
+            src_dtype,
+            ReductionHint.DEFAULT,
+        )
+        read_writes = ComputedBuffer(
+            name=None,
+            layout=FlexibleLayout(
+                device=r.get_device(),
+                dtype=r.get_dtype(),
+                size=r.get_size(),
+            ),
+            data=r,
+        ).get_read_writes()
+        # try finding the full size producer
+        # TODO this will fail for something like ((1, N) * (N, 1)).sum()
+        # this would also possibly be wrong for producers with the different contiguity but we hope those cases are rare
+        # TODO maybe go over all full size producers and pick the most common one?
+        range_vars = [
+            r
+            for r in read_writes.range_vars
+            if isinstance(r, sympy.Expr) and not isinstance(r, sympy.Number)
+        ]
+        index = None
+        for md in read_writes.reads:
+            if all([r in md.index.free_symbols for r in range_vars]):
+                index = md.index
+                break
+        if not index:
+            # TODO determine splits when all inputs are broadcasted
+            return ReductionHint.DEFAULT, 1
+        reduction_vars = [
+            rv for rv in range_vars if read_writes.var_ranges[rv] in reduction_ranges
+        ]
+        strides = V.graph.sizevars.stride_hints(index, reduction_vars)
+        outer = all([s > 1 for s in strides])
+        if not outer:
+            return ReductionHint.INNER, inner_reduction_splits(
+                reduction_numel_hint, numel_hint
+            )
+        else:  # outer reduction
+            return ReductionHint.OUTER, outer_reduction_splits(
+                reduction_numel_hint, numel_hint
+            )
+
+    @staticmethod
+    def _unroll_reduction_fn(inner_fn, reduction_ranges, reduction_type):
+        """Convert inner_fn from a reduction to an pointwise"""
+        reduction_ranges = [
+            V.graph.sizevars.guard_static_shape(x) for x in reduction_ranges
+        ]
+
+        if reduction_type == "sum":
+
+            def combine_fn(a, b):
+                return ops.add(a, b)
+
+        elif reduction_type == "min":
+
+            def combine_fn(a, b):
+                return ops.minimum(a, b)
+
+        elif reduction_type == "max":
+
+            def combine_fn(a, b):
+                return ops.maximum(a, b)
+
+        elif reduction_type == "any":
+
+            def combine_fn(a, b):
+                return ops.logical_or(a, b)
+
+        elif reduction_type == "argmin":
+
+            def combine_fn(a, b):
+                return ops.minimum(a[0], b[0]), ops.where(
+                    ops.lt(b[0], a[0]), b[1], a[1]
+                )
+
+        elif reduction_type == "argmax":
+
+            def combine_fn(a, b):
+                return ops.maximum(a[0], b[0]), ops.where(
+                    ops.gt(b[0], a[0]), b[1], a[1]
+                )
+
+        else:
+            raise NotImplementedError(f"unknown reduction_type={reduction_type}")
+
+        def fn(index):
+            return functools.reduce(
+                combine_fn,
+                (
+                    value_fn(index, rindex)
+                    for rindex in itertools.product(
+                        *[range(x) for x in reduction_ranges]
+                    )
+                ),
+            )
+
+        if reduction_type in ("argmin", "argmax"):
+            flatten_index = FixedLayout(
+                None,
+                None,
+                reduction_ranges,
+                FlexibleLayout.contiguous_strides(reduction_ranges),
+            ).make_indexer()
+
+            def value_fn(index, rindex):
+                rindex = [sympy.expand(i) for i in rindex]
+                return (
+                    inner_fn(index, rindex),
+                    ops.index_expr(flatten_index(rindex), torch.int64),
+                )
+
+            return lambda index: fn(index)[1]
+        else:
+            value_fn = inner_fn
+            return fn
+
+    @classmethod
+    def create(
+        cls,
+        device: torch.device,
+        dst_dtype: torch.dtype,
+        src_dtype: torch.dtype,
+        inner_fn: Callable,
+        ranges: List[Expr],
+        reduction_ranges: List[Expr],
+        reduction_type: str,
+        reduction_hint: ReductionHint = ReductionHint.DEFAULT,
+    ):
+        reduction_numel = V.graph.sizevars.simplify(sympy_product(reduction_ranges))
+
+        if reduction_numel == 0:
+
+            # N.B. This is a hack to generate the literal of the given type
+            # Ideally, we should be fixing `def constant` in triton.py
+            # but it breaks due to hardcoded dtypes in other places
+            def py_cnst(val):
+                return (
+                    bool(val)
+                    if dst_dtype == torch.bool
+                    else float(val)
+                    if dst_dtype.is_floating_point
+                    else int(val)
+                )
+
+            rtypes_to_inits = {
+                "sum": py_cnst(0),
+                "prod": py_cnst(1),
+                "any": py_cnst(0),
+                # "all" is desugared to `!any(!val)`
+            }
+
+            assert (
+                reduction_type in rtypes_to_inits.keys()
+            ), f"{reduction_type} not supported for zero-dimension tensors!"
+
+            def const_fn(index):
+                return ops.constant(rtypes_to_inits[reduction_type], dst_dtype)
+
+            return Pointwise.create(
+                device=device,
+                dtype=src_dtype,
+                inner_fn=const_fn,
+                ranges=list(ranges),
+            )
+
+        if reduction_numel == 1:
+            # this reduction is actually a pointwise op
+            if reduction_type in ("argmin", "argmax"):
+
+                def fn(index):
+                    return 0
+
+            else:
+
+                def fn(index):
+                    reduction_index = [sympy.Integer(0) for _ in reduction_ranges]
+                    return inner_fn(index, reduction_index)
+
+            return Pointwise.create(device, dst_dtype, fn, ranges)
+
+        if (
+            isinstance(reduction_numel, sympy.Integer)
+            and V.graph.sizevars.size_hint(reduction_numel)
+            < config.unroll_reductions_threshold
+            and sympy_product(ranges) != 1
+        ):
+            return Pointwise.create(
+                device,
+                dst_dtype,
+                cls._unroll_reduction_fn(inner_fn, reduction_ranges, reduction_type),
+                ranges,
+            )
+
+        if is_triton(device) and reduction_type not in {"argmax", "argmin"}:
+            # triton doesn't support reduce to single element well, so break it up
+            hint, split = cls.num_splits(
+                device,
+                dst_dtype,
+                src_dtype,
+                inner_fn,
+                ranges,
+                reduction_ranges,
+                reduction_type,
+                reduction_numel,
+            )
+            # intermediate reduction in split can contain complex indexing,
+            # and num_splits will fail to correctly set the hint
+            # reuse the passed hint if available
+            if reduction_hint == ReductionHint.DEFAULT:
+                reduction_hint = hint
+            if split > 1:
+                # triton doesn't support reduce to single element well, so break it up
+                return cls.create_multilayer(
+                    device,
+                    dst_dtype,
+                    src_dtype,
+                    inner_fn,
+                    ranges,
+                    reduction_ranges,
+                    reduction_type,
+                    split,
+                    reduction_hint,
+                )
+
+        return TensorBox.create(
+            Reduction(
+                device,
+                dst_dtype,
+                inner_fn,
+                ranges,
+                reduction_ranges,
+                reduction_type,
+                src_dtype,
+                reduction_hint,
+            )
+        )
+
+    @staticmethod
+    def default_value(reduction_type, dtype):
+        if reduction_type in {"max", "argmax"}:
+            if is_float_dtype(dtype):
+                return float("-inf")
+            elif is_boolean_dtype(dtype):
+                return 0
+            else:
+                return torch.iinfo(dtype).min
+        if reduction_type in {"min", "argmin"}:
+            if is_float_dtype(dtype):
+                return float("inf")
+            elif is_boolean_dtype(dtype):
+                return 1
+            else:
+                return torch.iinfo(dtype).max
+
+        return {
+            "sum": 0,
+            "any": 0,
+        }[reduction_type]
+
+    @classmethod
+    def create_multilayer(
+        cls,
+        device: torch.device,
+        dst_dtype: torch.dtype,
+        src_dtype: torch.dtype,
+        inner_fn: Callable,
+        ranges: List[Expr],
+        reduction_ranges: List[Expr],
+        reduction_type: str,
+        split: int,
+        reduction_hint: ReductionHint,
+    ):
+        """
+        Break a large reduction up into multiple smaller reductions
+        recursively
+        """
+        reduction_numel = sympy_product(reduction_ranges)
+
+        # TODO(jansel): convert this to dynamic shapes
+        # TODO(jansel): realize the reduction so we can do dynamic indexing
+        reduction_ranges = [
+            sympy.Integer(V.graph.sizevars.guard_static_shape(s))
+            for s in reduction_ranges
+        ]
+        reduction_numel = sympy.Integer(
+            V.graph.sizevars.guard_static_shape(reduction_numel)
+        )
+
+        if V.graph.sizevars.size_hint(reduction_numel) % split == 0:
+            need_mask = False
+        else:
+            need_mask = True
+
+        split = sympy.Integer(split)
+        block_size = IndexingDiv(reduction_numel + (split - 1), split)
+
+        reindex = View.dynamic_reshape_indexer(reduction_ranges, [reduction_numel])
+
+        def wrapper_fn(index, reduction_index):
+            (reduction_index,) = reduction_index
+            *new_index, reduction_block = index
+            indices = block_size * reduction_block + reduction_index
+
+            def body():
+                return inner_fn(new_index, reindex([indices]))
+
+            if need_mask:
+                mask = ops.lt(
+                    ops.index_expr(indices, torch.int32),
+                    ops.index_expr(reduction_numel, torch.int32),
+                )
+                return ops.masked(
+                    mask, body, cls.default_value(reduction_type, dst_dtype)
+                )
+            else:
+                return body()
+
+        # triton will automatically compute reductions in fp32 if reducing over fp16/bf16
+        # within the kernel. keep the intermediate in fp32 so as to keep the whole reduction
+        # in fp32 and not reduce precision by breaking up the kernel into multiple layers
+        intermediate_dtype = (
+            dst_dtype
+            if dst_dtype not in (torch.float16, torch.bfloat16)
+            else torch.float
+        )
+        intermediate = Reduction.create(
+            device,
+            intermediate_dtype,
+            src_dtype,
+            wrapper_fn,
+            [*ranges, split],
+            [block_size],
+            reduction_type,
+            reduction_hint,
+        )
+        intermediate.realize()
+        intermediate_loader = intermediate.make_loader()
+
+        def intermediate_fn(index, reduction_index):
+            return intermediate_loader([*index, *reduction_index])
+
+        numel_hint = V.graph.sizevars.size_hint(sympy_product(ranges))
+        if split <= 512 and numel_hint <= 512 and reduction_hint == ReductionHint.OUTER:
+            reduction_hint = ReductionHint.OUTER_TINY
+        return TensorBox.create(
+            Reduction(
+                device,
+                dst_dtype,
+                intermediate_fn,
+                ranges,
+                [split],
+                reduction_type,
+                src_dtype,
+                reduction_hint,
+            )
+        )
+
+
+def is_storage_and_layout(x):
+    try:
+        as_storage_and_layout(x, freeze=False)
+        return True
+    except NotImplementedError:
+        return False
+
+
+def is_contiguous_storage_and_layout(x):
+    try:
+        buffer, layout = as_storage_and_layout(x, freeze=False)
+        return layout.is_contiguous()
+    except NotImplementedError:
+        return False
+
+
+def as_storage_and_layout(x, freeze=True, want_contiguous=False, stride_order=None):
+    """Try to simplify x into a StorageBox and a Layout"""
+    if isinstance(x, TensorBox):
+        return as_storage_and_layout(
+            x.data,
+            freeze=freeze,
+            want_contiguous=want_contiguous,
+            stride_order=stride_order,
+        )
+    if isinstance(x, StorageBox) and isinstance(x.data, Buffer):
+        if freeze:
+            if want_contiguous:
+                x.data.freeze_layout()
+            elif stride_order is not None:
+                x.data.freeze_layout_with_stride_order(stride_order)
+            else:
+                x.data.decide_layout()
+        return x, x.data.layout
+    if isinstance(x, ReinterpretView):
+        # making the base of x contiguous or stride_ordered will not necessarily make
+        # the ReinterpretedView either, so dont pass along those arguments
+        buffer, _ = as_storage_and_layout(
+            x.data,
+            freeze=freeze,
+        )
+        return buffer, x.layout
+    raise NotImplementedError
+
+
+as_contiguous_storage_and_layout = functools.partial(
+    as_storage_and_layout, want_contiguous=True
+)
+
+
+def is_stride_order_storage_and_layout(x, stride_order):
+    try:
+        buffer, layout = as_storage_and_layout(x, freeze=False)
+        return layout.is_stride_ordered(stride_order)
+    except NotImplementedError:
+        return False
+
+
+@dataclasses.dataclass
+class BaseView(IRNode):
+    data: IRNode
+
+    def get_dtype(self):
+        return self.data.get_dtype()
+
+    def get_device(self):
+        return self.data.get_device()
+
+    def get_name(self):
+        return self.data.get_name()
+
+    def mark_reuse(self, users):
+        return self.data.mark_reuse(users)
+
+    def has_exceeded_max_reads(self):
+        return self.data.has_exceeded_max_reads()
+
+    def realize(self):
+        return self.data.realize()
+
+    def realize_hint(self):
+        return self.data.realize_hint()
+
+    def get_storage_numel(self):
+        return self.data.get_storage_numel()
+
+    def is_extern(self):
+        return self.data.is_extern()
+
+    @cache_on_self
+    def get_reads(self):
+        with patch.object(FlexibleLayout, "allow_indexing", True):
+            return extract_read_writes(
+                self.make_loader(),
+                self.get_size(),
+            ).reads
+
+    def unwrap_view(self):
+        x = self
+        while isinstance(x, BaseView):
+            x = x.data
+        return x
+
+    def constant_to_device(self, device):
+        """Move this to a given device. Requires that all reads are to constants."""
+        loader = self.make_loader()
+        loader = patch.object(ConstantBuffer, "override_device", device)(loader)
+        return Pointwise(device, self.get_dtype(), loader, self.get_size())
+
+
+@dataclasses.dataclass
+class ExpandView(BaseView):
+    size: List[Expr]
+
+    @staticmethod
+    def _normalize_size(x, new_size):
+        """Replace `-1` with correct sizes"""
+        new_size = list(map(sympy.expand, new_size))
+        old_size = x.get_size()
+        old_size = [None] * (len(new_size) - len(old_size)) + list(old_size)
+        assert len(new_size) == len(old_size)
+        for i in range(len(new_size)):
+            if new_size[i] == -1:
+                assert old_size[i] is not None
+                new_size[i] = old_size[i]
+        return new_size
+
+    @classmethod
+    def create(cls, x, new_size):
+        new_size = cls._normalize_size(x, new_size)
+
+        if is_storage_and_layout(x):
+            storage, old_layout = as_storage_and_layout(x)
+            skip = len(new_size) - len(old_layout.size)
+            assert skip >= 0
+            new_stride = [sympy.Integer(0)] * skip
+            for stride, size in zip(old_layout.stride, old_layout.size):
+                new_stride.append(stride if size != 1 else sympy.Integer(0))
+            new_layout = FixedLayout(
+                old_layout.device,
+                old_layout.dtype,
+                list(new_size),
+                new_stride,
+                old_layout.offset,
+            )
+            return ReinterpretView(storage, new_layout)
+
+        return ExpandView(x, new_size)
+
+    def get_size(self):
+        return self.size
+
+    def make_loader(self):
+        target = self.get_size()
+        actual = self.data.get_size()
+        skip = len(target) - len(actual)
+        inner = self.data.make_loader()
+
+        def load(index):
+            index = list(index[skip:])
+            assert len(index) == len(actual)
+            for i in range(len(actual)):
+                if actual[i] == 1:
+                    # zero out broadcast dimension
+                    index[i] = sympy.Integer(0)
+            return inner(index)
+
+        return load
+
+
+@dataclasses.dataclass
+class PermuteView(BaseView):
+    dims: List[Expr]
+
+    @classmethod
+    def create(cls, x, dims):
+        assert set(cls._map_neg_dims(dims)) == set(range(len(dims)))
+
+        if is_storage_and_layout(x):
+            storage, old_layout = as_storage_and_layout(x)
+            new_layout = FixedLayout(
+                old_layout.device,
+                old_layout.dtype,
+                [old_layout.size[i] for i in dims],
+                [old_layout.stride[i] for i in dims],
+                old_layout.offset,
+            )
+            return ReinterpretView(storage, new_layout)
+
+        return PermuteView(x, dims)
+
+    @classmethod
+    def _map_neg_dims(cls, dims):
+        return [dim if dim >= 0 else len(dims) + dim for dim in dims]
+
+    def get_size(self):
+        assert set(self._map_neg_dims(self.dims)) == set(range(len(self.dims)))
+        size = self.data.get_size()
+        return [size[i] for i in self.dims]
+
+    def make_loader(self):
+        inner = self.data.make_loader()
+        inv = {j: i for i, j in enumerate(self.dims)}
+        inv = [inv[i] for i in range(len(self.dims))]
+        assert set(inv) == set(range(len(self.dims)))
+
+        def load(index):
+            index = [index[i] for i in inv]
+            return inner(index)
+
+        return load
+
+
+class SqueezeView(BaseView):
+    @classmethod
+    def create(cls, x, *, dim=None):
+
+        if is_storage_and_layout(x):
+            storage, old_layout = as_storage_and_layout(x)
+            new_size = []
+            new_stride = []
+            if dim is not None:
+                assert isinstance(dim, int), "expected integer dim argument"
+                assert 0 <= dim and dim < len(old_layout.size)
+
+            for i, (size, stride) in enumerate(zip(old_layout.size, old_layout.stride)):
+                if dim is None:
+                    if size != 1:
+                        new_size.append(size)
+                        new_stride.append(stride)
+                else:
+                    if i != dim:
+                        new_size.append(size)
+                        new_stride.append(stride)
+                    else:
+                        assert size == 1, "expected squeezed size to be 1"
+
+            new_layout = FixedLayout(
+                old_layout.device,
+                old_layout.dtype,
+                new_size,
+                new_stride,
+                old_layout.offset,
+            )
+            return ReinterpretView(storage, new_layout)
+
+        if dim is None:
+            # redirect to a generic view
+            return View.create(x, [s for s in x.get_size() if s != 1])
+        else:
+            assert x.get_size()[dim] == 1
+            return View.create(x, [s for i, s in enumerate(x.get_size()) if i != dim])
+
+    @staticmethod
+    def squeezer(size: Tuple[sympy.Expr, ...]):
+        new_size = [s for s in size if s != 1]
+        not_one = [i for i, s in enumerate(size) if s != 1]
+        length = len(size)
+
+        def reindex(index: List[sympy.Expr]) -> List[sympy.Expr]:
+            assert len(index) == len(not_one), f"{index} {not_one}"
+            new_index = [sympy.Integer(0)] * length
+            for idx, s in zip(not_one, index):
+                new_index[idx] = s
+            return tuple(new_index)
+
+        return new_size, reindex
+
+    def __init__(self, data):
+        raise AssertionError("use SqueezeView.create()")
+
+
+@dataclasses.dataclass
+class View(BaseView):
+    size: List[Expr]
+    reindex: Callable
+
+    def make_indexer(self):
+        base_indexer = self.data.make_indexer()
+
+        def indexer(idx):
+            return base_indexer(self.reindex(idx))
+
+        return indexer
+
+    @staticmethod
+    def handle_negative_index(idx, size):
+        idx = sympy.expand(idx)
+        size = sympy.expand(size)
+        sizevars = V.graph.sizevars
+        if sizevars.size_hint(idx) < 0:
+            sizevars.guard_lt(idx, 0)
+            idx = idx + size
+        return idx
+
+    def reindex_str(self):
+        index_old = [sympy_symbol(f"i{n}") for n in range(len(self.size))]
+        index_new = list(self.reindex(index_old))
+        return f"lambda {', '.join(map(str, index_old))}: {index_new}"
+
+    def __str__(self):
+        return self.str_helper(
+            [self.data, f"size={self.size}", f"reindex={self.reindex_str()}"]
+        )
+
+    __repr__ = __str__
+
+    @classmethod
+    def create(cls, x, new_size):
+        assert isinstance(new_size, (tuple, list))
+        old_size, new_size = cls.resolve_negative_size(x.get_size(), new_size)
+
+        if V.graph.sizevars.maybe_guard_list_equals(old_size, new_size):
+            return x
+
+        # TODO: a new class for FixedTransferLayout that output layout is constrained by input layout
+        if is_contiguous_storage_and_layout(x) and not isinstance(
+            x.data, ExternKernelAlloc
+        ):
+            storage, old_layout = as_contiguous_storage_and_layout(x)
+            new_layout = FixedLayout(
+                old_layout.device,
+                old_layout.dtype,
+                new_size,
+                FlexibleLayout.contiguous_strides(new_size),
+                old_layout.offset,
+            )
+            return ReinterpretView(storage, new_layout)
+
+        reindex = cls.dynamic_reshape_indexer(old_size, new_size)
+        return cls(x, tuple(new_size), reindex)
+
+    @staticmethod
+    def resolve_negative_size(old_size, new_size):
+        new_size = [V.graph.sizevars.simplify(x) for x in new_size]
+        old_size = [V.graph.sizevars.simplify(x) for x in old_size]
+
+        new_size = list(new_size)
+        for i in range(len(new_size)):
+            if new_size[i] == -1:
+                new_size[i] = sympy.Integer(1)
+                new_size[i] = CleanDiv(sympy_product(old_size), sympy_product(new_size))
+                break
+
+        V.graph.sizevars.guard_equals(sympy_product(old_size), sympy_product(new_size))
+        return old_size, new_size
+
+    @classmethod
+    def dynamic_reshape_indexer(cls, old_size, new_size):
+        try:
+            reindex = cls._dynamic_reshape_indexer(old_size, new_size)
+        except (AssertionError, IndexError):
+            # optimistic algorithm failed, lets do a fallback
+            flat = [sympy_product(old_size)]
+            reindex1 = cls._dynamic_reshape_indexer(old_size, flat)
+            reindex2 = cls._dynamic_reshape_indexer(flat, new_size)
+            reindex = fuse_reindexing(reindex1, reindex2)
+        return reindex
+
+    @staticmethod
+    def _dynamic_reshape_indexer(old_size, new_size):
+        """
+        Perform a reshape entirely by modifying indexing math
+        """
+        size_hint = V.graph.sizevars.size_hint
+        vars = [sympy_symbol(f"view{i}") for i in range(len(new_size))]
+
+        stack_new = list(zip(vars, new_size))
+        stack_old = list(old_size)
+
+        view_expr = []
+        while stack_new and stack_old:
+            size_old = stack_old.pop()
+            var, size_new = stack_new.pop()
+            if size_old == 1:
+                view_expr.append(sympy.Integer(0))
+                stack_new.append((var, size_new))  # re-add
+            elif size_new == 1:
+                stack_old.append(size_old)  # re-add
+            elif size_hint(size_new) == size_hint(size_old):
+                view_expr.append(var)
+                V.graph.sizevars.guard_equals(size_new, size_old)
+            elif size_hint(size_new) < size_hint(size_old):
+                while size_hint(size_new) < size_hint(size_old):
+                    var2, size_new2 = stack_new.pop()
+                    var = var2 * size_new + var
+                    size_new = size_new * size_new2
+                view_expr.append(var)
+                V.graph.sizevars.guard_equals(size_new, size_old)
+            elif size_hint(size_new) > size_hint(size_old):
+                divisor = sympy.Integer(1)
+                modulus = size_old
+                view_expr.append(ModularIndexing(var, divisor, modulus))
+                divisor = divisor * modulus
+                while size_hint(size_new) > size_hint(size_old):
+                    modulus = stack_old.pop()
+                    view_expr.append(ModularIndexing(var, divisor, modulus))
+                    divisor = divisor * modulus
+                    size_old = size_old * modulus
+                V.graph.sizevars.guard_equals(size_new, size_old)
+            else:
+                raise AssertionError()
+
+        while stack_old:
+            size_old = stack_old.pop()
+            assert size_old == 1
+            view_expr.append(sympy.Integer(0))
+
+        while stack_new:
+            var, size_new = stack_new.pop()
+            assert size_new == 1
+
+        view_expr = list(reversed(view_expr))
+        assert len(view_expr) == len(old_size)
+
+        def reindex(index):
+            assert len(index) == len(vars), (len(index), len(vars))
+            replacements = dict(zip(vars, index))
+            return tuple(sympy_subs(x, replacements) for x in view_expr)
+
+        return reindex
+
+    def get_size(self):
+        return self.size
+
+    def make_loader(self):
+        def load(index):
+            return inner(self.reindex(index))
+
+        inner = self.data.make_loader()
+        return load
+
+
+@dataclasses.dataclass
+class ReinterpretView(BaseView):
+    """Pretend our storage has a different layout"""
+
+    layout: "Layout"
+
+    def __post_init__(self):
+        if isinstance(self.data, BaseView):
+            self.data = self.data.unwrap_view()
+
+    def __str__(self):
+        return self.str_helper(
+            [
+                self.data,
+                self.layout,
+            ]
+        )
+
+    __repr__ = __str__
+
+    def get_name(self):
+        return self.data.get_name()
+
+    def get_device(self):
+        return self.layout.device
+
+    def get_dtype(self):
+        return self.layout.dtype
+
+    def get_size(self):
+        return self.layout.size
+
+    def get_stride(self):
+        return self.layout.stride
+
+    def make_loader(self):
+        def loader(index):
+            indexer = self.layout.make_indexer()
+            return ops.load(self.get_name(), indexer(index))
+
+        return loader
+
+    def make_indexer(self):
+        return self.layout.make_indexer()
+
+    def get_layout(self):
+        return self.layout
+
+    def freeze_layout(self):
+        pass
+
+    def codegen_reference(self):
+        size = V.graph.sizevars.codegen_shape_tuple(self.layout.size)
+        stride = V.graph.sizevars.codegen_shape_tuple(self.layout.stride)
+        offset = V.graph.sizevars.codegen_sizevar(self.layout.offset)
+        if offset != "0":
+            return f"as_strided({self.get_name()}, {size}, {stride}, {offset})"
+        return f"as_strided({self.get_name()}, {size}, {stride})"
+
+
+class SliceView(View):
+    @classmethod
+    def create(cls, x, dim, start, end, step=1):
+        step = sympy.expand(step)
+        assert step > 0
+        try:
+            if start == 0 and end >= 2**63 and step == 1:
+                return x
+        except TypeError:
+            pass
+
+        sizevars = V.graph.sizevars
+        new_size = list(x.get_size())
+
+        start = cls.handle_negative_index(start, new_size[dim])
+        end = cls.handle_negative_index(end, new_size[dim])
+
+        end = sizevars.guard_min(end, new_size[dim])
+        start = sizevars.guard_min(sizevars.guard_min(start, new_size[dim]), end)
+        if start == 0 and sizevars.size_hint(end - new_size[dim]) == 0 and step == 1:
+            sizevars.guard_equals(end, new_size[dim])
+            return x
+
+        new_size[dim] = IndexingDiv(end - start + (step - 1), step)
+
+        if is_storage_and_layout(x):
+            # Fast path
+            storage, old_layout = as_storage_and_layout(x)
+            new_stride = list(old_layout.stride)
+            new_stride[dim] = new_stride[dim] * step
+            new_layout = FixedLayout(
+                old_layout.device,
+                old_layout.dtype,
+                new_size,
+                new_stride,
+                old_layout.offset + old_layout.stride[dim] * start,
+            )
+            return ReinterpretView(storage, new_layout)
+
+        def reindex(index):
+            assert len(index) == len(new_size), f"wrong ndim {index} {new_size}"
+            index = list(index)
+            index[dim] = index[dim] * step + start
+            return index
+
+        # redirect to a generic view
+        return SliceView(x, size=new_size, reindex=reindex)
+
+
+class BaseConstant(IRNode):
+    def get_size(self):
+        return ()
+
+    def get_dtype(self):
+        return self.dtype
+
+    def get_device(self):
+        return self.device
+
+    def mark_reuse(self, users):
+        pass
+
+    def has_exceeded_max_reads(self):
+        return False
+
+    def get_reads(self):
+        return ()
+
+    def is_extern(self):
+        return False
+
+
+@dataclasses.dataclass
+class Constant(BaseConstant):
+    value: Any
+    dtype: torch.dtype
+    device: torch.device
+
+    def make_loader(self):
+        def loader(index):
+            return ops.constant(self.value, self.dtype)
+
+        return loader
+
+
+@dataclasses.dataclass
+class IndexingConstant(BaseConstant):
+    index: Any
+    dtype: torch.dtype
+    device: torch.device
+
+    def make_loader(self):
+        def loader(index):
+            return ops.index_expr(self.index, self.dtype)
+
+        return loader
+
+
+@dataclasses.dataclass
+class Layout(IRNode):
+    def __init__(
+        self,
+        device: torch.device,
+        dtype: torch.dtype,
+        size: List[Expr],
+        stride: List[Expr],
+        offset: Expr = Integer(0),
+    ):
+        self.device = device
+        self.dtype = dtype
+        self.size = size
+        self._stride = stride
+        self.offset = offset
+
+    @property
+    def stride(self):
+        return self._stride
+
+    def __str__(self):
+        offset = ""
+        if self.offset != 0:
+            offset = f", offset={self.offset}"
+        return (
+            f"{type(self).__name__}('{self.device.type}', {self.dtype}, "
+            f"size={self.size}, stride={self.stride}{offset})"
+        )
+
+    __repr__ = __str__
+
+    def is_contiguous(self):
+        for left, right, size in zip(
+            self.stride, FlexibleLayout.contiguous_strides(self.size), self.size
+        ):
+            if size != 1 and left != right:
+                return False
+        return True
+
+    def is_transposed(self):
+        for left, right, size in zip(
+            self.stride,
+            reversed(FlexibleLayout.contiguous_strides(self.size)),
+            self.size,
+        ):
+            if size != 1 and left != right:
+                return False
+        return True
+
+    def is_stride_ordered(self, order):
+        assert len(self.stride) == len(order)
+        # reorder the stride given order
+        stride_ordered = [None] * len(order)
+        for i in range(len(order)):
+            stride_ordered[order[i]] = V.graph.sizevars.size_hint(self.stride[i])
+        # check if it is in ascending order
+        for i in range(len(order) - 1):
+            if stride_ordered[i] > stride_ordered[i + 1]:
+                return False
+        return True
+
+    def is_channels_last_stride_ordered(self):
+        # create channels_last order(NCHW, NCDHW, the C is the first order).
+        order = [0] + list(reversed(range(1, len(self.stride) - 1)))
+        order = [len(order)] + order
+        return self.is_stride_ordered(order)
+
+    def as_fixed(self):
+        return FixedLayout(
+            self.device,
+            self.dtype,
+            self.size,
+            self.stride,
+            self.offset,
+        )
+
+    def make_indexer(self):
+        assert (
+            FlexibleLayout.allow_indexing
+        ), f"convert {type(self).__name__} to FixedLayout first"
+        return self.as_fixed().make_indexer()
+
+    def __eq__(self, other) -> bool:
+        return (
+            self.device == other.device
+            and self.dtype == other.dtype
+            and self.size == other.size
+            and self.stride == other.stride
+            and self.offset == other.offset
+        )
+
+
+class FixedLayout(Layout):
+    """A Tensor layout we cannot change"""
+
+    def make_indexer(self):
+        """A closure containing math to read a given element"""
+
+        def indexer(index):
+            assert len(index) == len(self.stride) == len(self.size)
+            result = self.offset
+            for idx, stride, sz in zip(index, self.stride, self.size):
+                if sz != 1:
+                    result = result + idx * stride
+            return result
+
+        return indexer
+
+
+class FlexibleLayout(Layout):
+    """A Tensor layout we are allowed to change"""
+
+    allow_indexing = False
+
+    @staticmethod
+    def contiguous_strides(sizes):
+        if len(sizes) == 0:
+            return []
+        reversed_strides = [sympy.Integer(1)]
+        for size in reversed(sizes[1:]):
+            reversed_strides.append(size * reversed_strides[-1])
+        return list(reversed(reversed_strides))
+
+    @staticmethod
+    def fill_ordered(sizes, order):
+        """
+        Create a stride based on the order the dimensions should be filled in.
+
+        In this format, channels last would be:
+            [1, 3, 2, 0]
+        """
+        assert set(range(len(sizes))) == set(order)
+        next_stride = sympy.Integer(1)
+        strides = [None] * len(order)
+
+        for i in order:
+            strides[i] = next_stride
+            next_stride = next_stride * sizes[i]
+        return strides
+
+    @staticmethod
+    def stride_ordered(sizes, order):
+        """
+        Create a stride based on the sorted order of a permuted range.
+
+        In this format, channels last would be:
+            [3, 0, 2, 1]
+        """
+        assert set(range(len(sizes))) == set(order)
+        fill_order = stride_order2fill_order(order)
+        return FlexibleLayout.fill_ordered(sizes, fill_order)
+
+    @staticmethod
+    def same_ordered(sizes, stride):
+        """
+        Create a stride that has the same stride order as given stride
+
+        For example, if given stride is [1000, 1, 100, 10],
+        the fill order should be [1, 3, 2, 0]
+        """
+        assert len(sizes) == len(stride)
+        stride = [V.graph.sizevars.size_hint(x) for x in stride]
+        fill_order = sorted(range(len(stride)), key=stride.__getitem__)
+        return FlexibleLayout.fill_ordered(sizes, fill_order)
+
+    def as_stride_order(self, order):
+        return FixedLayout(
+            self.device,
+            self.dtype,
+            self.size,
+            self.stride_ordered(self.size, order),
+            self.offset,
+        )
+
+    def as_fill_order(self, order):
+        return FixedLayout(
+            self.device,
+            self.dtype,
+            self.size,
+            self.fill_ordered(self.size, order),
+            self.offset,
+        )
+
+    def as_same_order(self, stride):
+        return FixedLayout(
+            self.device,
+            self.dtype,
+            self.size,
+            self.same_ordered(self.size, stride),
+            self.offset,
+        )
+
+    def __init__(self, device, dtype, size, stride_order=None):
+        super(FlexibleLayout, self).__init__(
+            device, dtype, size, FlexibleLayout.contiguous_strides(size)
+        )
+        self.preferred_stride_order = stride_order
+
+
+class AliasedLayout(Layout):
+    """Shares the same storage as another tensor"""
+
+    def __init__(self, view: "ReinterpretView"):
+        layout = view.get_layout()
+        super().__init__(
+            layout.device,
+            layout.dtype,
+            layout.size,
+            layout.stride,
+        )
+        self.view = view
+
+    def make_indexer(self):
+        return self.as_fixed().make_indexer()
+
+    def maybe_guard_aligned(self):
+        offset = self.view.get_layout().offset
+        if offset == 0:
+            return True
+        from .compile_fx import ALIGNMENT
+
+        return V.graph.sizevars.maybe_guard_multiple_of(offset, ALIGNMENT)
+
+
+class MutationLayout(Layout):
+    def __init__(self, target: IRNode):
+        super().__init__(
+            target.get_device(),
+            target.get_dtype(),
+            target.get_size(),
+            None,  # type: ignore[arg-type]
+        )
+        self.target = target
+
+    @Layout.stride.getter
+    def stride(self):
+        return self.real_layout().stride
+
+    def real_layout(self):
+        if isinstance(self.target, MutationLayout):
+            return self.target.real_layout()
+        return self.target.data.layout
+
+    @classmethod
+    def realize_into(cls, src, dst):
+        dst.realize()
+        V.graph.realize_users_of(dst.get_name())
+
+        if isinstance(src, TensorBox):
+            src = src.data
+
+        if not isinstance(src, StorageBox) or src.is_user_of(dst.get_name()):
+            need_copy = True
+        else:
+            src.realize()
+            need_copy = not isinstance(src.data.layout, FlexibleLayout)
+
+        if need_copy:
+            src = Pointwise.create(
+                device=src.get_device(),
+                dtype=src.get_dtype(),
+                inner_fn=src.make_loader(),
+                ranges=[
+                    V.graph.sizevars.guard_equals(a, b)
+                    for a, b in zip(src.get_size(), dst.get_size())
+                ],
+            ).data
+            src.realize()
+
+        assert isinstance(src.data.layout, FlexibleLayout)
+        src.data.layout = MutationLayout(dst)
+        return src.data
+
+    def as_fixed(self):
+        return self
+
+    def make_indexer(self):
+        return self.target.make_indexer()
+
+
+@dataclasses.dataclass
+class Buffer(IRNode):
+    name: str
+    layout: Layout
+
+    def make_indexer(self):
+        return self.layout.make_indexer()
+
+    def get_name(self):
+        assert self.name
+        return self.name
+
+    def get_device(self):
+        return self.layout.device
+
+    def get_dtype(self):
+        return getattr(self.layout, "dtype", None)
+
+    def get_size(self):
+        return self.layout.size
+
+    def get_stride(self):
+        return self.layout.stride
+
+    def get_layout(self):
+        return self.layout
+
+    def get_storage_numel(self):
+        return self.get_numel()
+
+    def is_extern(self):
+        return False
+
+    def freeze_layout(self):
+        if not isinstance(self.layout, MultiOutputLayout):
+            self.layout = self.layout.as_fixed()
+
+    def freeze_layout_with_stride_order(self, order):
+        assert isinstance(self.layout, FlexibleLayout)
+        self.layout = self.layout.as_stride_order(order)
+
+    def freeze_layout_with_fill_order(self, order):
+        assert isinstance(self.layout, FlexibleLayout)
+        self.layout = self.layout.as_fill_order(order)
+
+    def freeze_layout_with_same_order(self, stride):
+        assert isinstance(self.layout, FlexibleLayout)
+        self.layout = self.layout.as_same_order(stride)
+
+    def make_loader(self):
+        def loader(index):
+            indexer = self.layout.make_indexer()
+            return ops.load(self.name, indexer(index))
+
+        return loader
+
+    def is_no_op(self):
+        return False
+
+    def codegen_reference(self):
+        return self.get_name()
+
+    def decide_layout(self):
+        pass
+
+    def get_alias_names(self):
+        if isinstance(self.layout, AliasedLayout):
+            return [self.layout.view.get_name()]
+        return ()
+
+    def get_mutation_names(self):
+        if isinstance(self.layout, MutationLayout):
+            return [self.layout.target.get_name()]
+        return ()
+
+    @cache_on_self
+    def get_read_writes(self):
+        with patch.object(FlexibleLayout, "allow_indexing", True):
+            return extract_read_writes(
+                self.make_loader(),
+                self.get_size(),
+            )
+
+    def get_reads(self):
+        return self.get_read_writes().reads
+
+    def realize(self):
+        pass
+
+
+class InputBuffer(Buffer):
+    pass
+
+
+class ConstantBuffer(InputBuffer):
+    override_device = None
+
+    def make_loader(self):
+        def loader(index):
+            indexer = self.layout.make_indexer()
+            return ops.load(
+                V.graph.constant_name(self.name, self.override_device), indexer(index)
+            )
+
+        return loader
+
+    def constant_to_device(self, device):
+        return ConstantBuffer(V.graph.constant_name(self.name, device), self.layout)
+
+
+class RandSeedBuffer(ConstantBuffer):
+    def codegen_reference(self):
+        # Clone makes sure if we pass this from forwards to backwards
+        # the value does not get clobbered by the time backwards is run.
+        return self.get_name() + ".clone()"
+
+
+class NoneAsConstantBuffer(IRNode):
+    def codegen_reference(self):
+        return "None"
+
+
+class ShapeAsConstantBuffer(IRNode):
+    def __init__(self, shape):
+        super().__init__()
+        self.shape = shape
+
+    def codegen_reference(self):
+        return str(self.shape)
+
+
+@dataclasses.dataclass
+class ComputedBuffer(Buffer):
+    data: Loops
+
+    @cache_on_self
+    def get_read_writes(self):
+        with patch.object(FlexibleLayout, "allow_indexing", True):
+            if self.data.get_reduction_type():
+                return extract_read_writes(
+                    self.get_store_function(),
+                    self.data.get_size(),
+                    self.data.get_reduction_size(),
+                )
+            else:
+                return extract_read_writes(
+                    self.get_store_function(),
+                    self.data.get_size(),
+                )
+
+    def get_store_function(self):
+        indexer = self.layout.as_fixed().make_indexer()
+        if self.data.get_reduction_type():
+            return partial(self.data.store_reduction, self.name, indexer)
+        else:
+            return partial(self.data.store_output, self.name, indexer)
+
+    def decide_layout(self):
+        """
+        If our layout is still flexible, try to set it based on stride orders of reads.
+
+        TODO(jansel): A better algorithm here would look at downstream consumers of this
+                      value and try to do global graph-level layout optimization.
+                      This is also something just begging to be autotuned.
+        """
+        if isinstance(self.layout, FlexibleLayout):
+            _, (index_vars, reduction_vars), _ = dependencies.index_vars_squeeze(
+                self.data.get_size(), self.data.get_reduction_size()
+            )
+            reads = self.get_read_writes().reads
+            reads_bufs = [
+                V.graph.name_to_buffer[r.name]
+                if r.name in V.graph.name_to_buffer.keys()
+                else None
+                for r in reads
+            ]
+            priority_idx = []
+            for i, reads_buf in enumerate(reads_bufs):
+                if (
+                    isinstance(reads_buf, Convolution)
+                    and reads_buf.kernel != "aten.convolution"
+                ):
+                    # prioritize Conv layout order
+                    priority_idx.append(i)
+            # only consider reads to buffer of same size
+            reads = [
+                sympy_subs(
+                    r.index, {v: sympy.Integer(0) for v in reduction_vars if v != 0}
+                )
+                for r in reads
+            ]
+
+            if reads:
+                stride_lengths = numpy.array(
+                    [V.graph.sizevars.stride_hints(expr, index_vars) for expr in reads],
+                    dtype=numpy.int64,
+                )
+                from .scheduler import pick_loop_order
+
+                self.freeze_layout_with_fill_order(
+                    pick_loop_order(stride_lengths, self.get_size(), priority_idx)
+                )
+
+        if isinstance(self.layout, FlexibleLayout):
+            self.freeze_layout()
+
+    def simplify_and_reorder(self):
+        """
+        This is a main place where we do loop transformations in a
+        backend-agnostic way.
+
+        Here we:
+            1) Remove any 1 dimensions
+            2) Fuse contiguous dimensions together
+            3) Reorder dimensions based on stride orders
+        """
+        _, args, var_ranges = dependencies.index_vars_squeeze(
+            self.data.get_size(), self.data.get_reduction_size(), prefix="q"
+        )
+        with patch.object(ConstantBuffer, "override_device", self.get_device()):
+            body = LoopBody(
+                self.get_store_function(),
+                (args if self.get_reduction_type() else args[:1]),
+                var_ranges,
+            )
+        index_formulas = [*body.indexing_exprs.values()]
+        reads_bufs = [
+            V.graph.name_to_buffer[reads_name]
+            if reads_name in V.graph.name_to_buffer.keys()
+            else None
+            for reads_name in body.reads_name2expr.keys()
+        ]
+        priority_idx = []
+        if config.triton.convolution == "aten":
+            memory_addrs = [
+                *body.reads_name2expr.values(),
+                *body.writes_name2expr.values(),
+            ]
+        else:
+            # prioritize reads layout/loop_ordering over writes
+            if len(body.reads_name2expr.values()) > 0:
+                memory_addrs = [*body.reads_name2expr.values()]
+            else:
+                memory_addrs = [*body.writes_name2expr.values()]
+            for i, reads_buf in enumerate(reads_bufs):
+                if isinstance(reads_buf, Convolution):
+                    priority_idx.append(i)
+        index_vars = []
+        reduce_vars = []
+        index_size = []
+        reduce_size = []
+        for v, s in var_ranges.items():
+            if v in args[0]:
+                assert not reduce_vars
+                index_vars.append(v)
+                index_size.append(s)
+            else:
+                assert v in args[1]
+                reduce_vars.append(v)
+                reduce_size.append(s)
+
+        # the reordering_reindex in reads' simplify_reorder_and_tile
+        reordering_reindex = [same_reorder(range(len(index_vars)))] * len(memory_addrs)
+        for i, reads_buf in enumerate(reads_bufs):
+            if isinstance(reads_buf, ComputedBuffer) and hasattr(
+                reads_buf, "iter_reordering_reindex"
+            ):
+                reordering_reindex[i] = reads_buf.iter_reordering_reindex
+
+        def simplify_and_reorder(x_vars, sizes, reordering_reindex=None):
+            sizes, reindex0, reindex1 = self._apply_loop_reordering(
+                x_vars, sizes, memory_addrs, reordering_reindex, priority_idx
+            )
+            # for NHWC: reindex0([0,1,2,3]) = [0,2,3,1], reindex1([0,1,2,3]) = [0,3,2,1]
+            x_vars = reindex0(x_vars)
+            sizes, reindex2, prune = V.graph.sizevars._simplify_loops(
+                x_vars,
+                sizes,
+                index_prevent_reordering(index_formulas, x_vars, sizes),
+            )
+            x_vars = prune(x_vars)
+            # sizes, reindex1, prune = _simplify_loops(x_vars, sizes, index_formulas)
+            # x_vars = prune(x_vars)
+            # sizes, reindex2 = self._apply_loop_reordering(x_vars, sizes, memory_addrs)
+            reindex = fuse_reindexing(reindex1, reindex2)
+            return sizes, reindex, reindex1
+
+        iter_ranges, iter_reindex, iter_reordering_reindex = simplify_and_reorder(
+            index_vars, index_size, reordering_reindex
+        )
+        reduce_ranges, reduce_reindex, _ = simplify_and_reorder(
+            reduce_vars, reduce_size
+        )
+
+        # remember the reordering order
+        self.iter_reordering_reindex = iter_reordering_reindex
+        # retrace the loop body with simplification and reordering applied
+        (iter_vars, reduce_vars), var_ranges = dependencies.index_vars_no_squeeze(
+            iter_ranges, reduce_ranges, prefix="z"
+        )
+        body = LoopBody(
+            body, [iter_reindex(iter_vars), reduce_reindex(reduce_vars)], var_ranges
+        )
+        return (iter_ranges, reduce_ranges), body
+
+    @staticmethod
+    def _apply_loop_reordering(
+        index_vars, sizes, memory_addrs, reordering_reindex=None, priority_idx=None
+    ):
+        """
+        Shuffle the order of loops around to hopefully improve performance.
+        """
+        from .scheduler import pick_loop_order
+
+        if priority_idx is None:
+            priority_idx = []
+
+        try:
+            strides = numpy.array(
+                [
+                    V.graph.sizevars.stride_hints(expr, index_vars)
+                    for expr in memory_addrs
+                ],
+                dtype=numpy.int64,
+            )
+            assert strides.shape == (len(memory_addrs), len(index_vars))
+            # consider both layout(strides) and reordering(reordering_reindex)
+            if reordering_reindex is not None:
+                for i in range(len(memory_addrs)):
+                    try:
+                        strides[i] = reordering_reindex[i](strides[i])
+                    # if len(order) != len(strides), do not reorder
+                    except AssertionError:
+                        pass
+            order = list(reversed(pick_loop_order(strides, sizes, priority_idx)))
+        except Exception:
+            if config.debug:
+                log.warning(
+                    f"Did not simplify complex index:\n{dict(zip(index_vars, sizes))}\n{memory_addrs}"
+                )
+            order = list(range(len(sizes)))
+        sizes = [sizes[i] for i in order]
+        return sizes, same_reorder(order), inverse_reorder(order)
+
+    def get_reduction_size(self):
+        return self.data.get_reduction_size()
+
+    def get_reduction_type(self):
+        return self.data.get_reduction_type()
+
+    def is_no_op(self):
+        return self.data.is_zero_elements()
+
+    def should_allocate(self):
+        return True
+
+    def constant_to_device(self, device):
+        """Move this to a given device. Requires that all reads are to constants."""
+        return self.data.constant_to_device(device)
+
+
+@dataclasses.dataclass
+class InputsKernel(Buffer):
+    inputs: List[Buffer]
+
+    def get_read_writes(self):
+        return dependencies.ReadWrites(
+            {dependencies.StarDep(x.get_name()) for x in self.inputs},
+            {dependencies.StarDep(self.get_name())},
+            set(),
+            [],
+            None,
+        )
+
+    @staticmethod
+    def unwrap_storage(inputs):
+        inputs_new = []
+        for x in inputs:
+            if isinstance(x, TensorBox):
+                x = x.data
+            if isinstance(x, StorageBox):
+                x = x.data
+            if isinstance(x, BaseView) and not isinstance(x, ReinterpretView):
+                x = ExternKernel.realize_input(x)
+            assert isinstance(x, (Buffer, ReinterpretView)), x
+            inputs_new.append(x)
+        return inputs_new
+
+    def is_extern(self):
+        return True
+
+
+class NopKernel(InputsKernel):
+    def is_no_op(self):
+        return True
+
+
+class ConcatKernel(NopKernel):
+    """
+    There isn't actually a real kernel for concat, we just change the
+    storage for the upstream data.
+    """
+
+    @classmethod
+    def create(cls, inputs, dim):
+        device = inputs[0].get_device()
+        dtype = inputs[0].get_dtype()
+        new_size = list(inputs[0].get_size())
+        offsets_start = [0]
+        offsets_end = [new_size[dim]]
+        assert 0 <= dim < len(new_size)
+        for i in range(1, len(inputs)):
+            input_size = inputs[i].get_size()
+            offsets_start.append(new_size[dim])
+            assert len(input_size) == len(new_size)
+            assert inputs[i].get_dtype() == dtype
+            assert inputs[i].get_device() == device
+            for j in range(len(new_size)):
+                if j == dim:
+                    new_size[j] = new_size[j] + input_size[j]
+                else:
+                    new_size[j] = V.graph.sizevars.guard_equals(
+                        new_size[j], input_size[j]
+                    )
+            offsets_end.append(new_size[dim])
+
+        kernel = ConcatKernel(
+            name=None,
+            layout=FixedLayout(
+                device=device,
+                dtype=dtype,
+                size=new_size,
+                stride=FlexibleLayout.contiguous_strides(new_size),
+            ),
+            inputs=[],
+        )
+        kernel = StorageBox(kernel)
+        for i in range(len(inputs)):
+            kernel.data.inputs.append(
+                cls.realize_into(
+                    inputs[i],
+                    SliceView.create(kernel, dim, offsets_start[i], offsets_end[i]),
+                )
+            )
+        kernel.data.name = V.graph.register_buffer(kernel.data)
+        kernel.data.inputs = cls.unwrap_storage(kernel.data.inputs)
+
+        return kernel
+
+    @classmethod
+    def realize_into(cls, src, dst):
+        # Attempt to turn this into a ReinterpretView rather than assert.
+        # This has concessions around layout, as as_storage_and_layout
+        # can cause us to go from flexible to fixed layout.
+        if not isinstance(dst, ReinterpretView):
+            if is_storage_and_layout(dst):
+                storage, layout = as_storage_and_layout(dst)
+                dst = ReinterpretView(storage, layout)
+        assert isinstance(dst, ReinterpretView), dst
+        if isinstance(src, TensorBox):
+            # unwrap a TensorBox
+            return cls.realize_into(src.data, dst)
+        if isinstance(src, StorageBox):
+            src.realize()
+            # ExternKernelAlloc has specific requirements for output layout, should create a copy
+            if isinstance(src.data.layout, FlexibleLayout) and not isinstance(
+                src.data, ExternKernelAlloc
+            ):
+                src.data.layout = AliasedLayout(dst)
+                return src.data
+        # introduce a copy
+        pw = Pointwise.create(
+            device=src.get_device(),
+            dtype=src.get_dtype(),
+            inner_fn=src.make_loader(),
+            ranges=[
+                V.graph.sizevars.guard_equals(a, b)
+                for a, b in zip(src.get_size(), dst.get_size())
+            ],
+        )
+        return cls.realize_into(pw, dst)
+
+    def should_allocate(self):
+        return True
+
+
+@dataclasses.dataclass
+class ExternKernel(InputsKernel):
+    constant_args: Tuple[Any, ...] = ()
+    kwargs: Dict[str, Any] = dataclasses.field(default_factory=dict)
+    output_view: Optional[ReinterpretView] = None
+
+    def decide_layout(self):
+        if isinstance(self.layout, FlexibleLayout):
+            self.apply_constraint()
+            self.freeze_layout()
+
+    def codegen(self, wrapper):
+        raise NotImplementedError
+
+    @staticmethod
+    def copy_input(x):
+        pw = Pointwise.create(
+            device=x.get_device(),
+            dtype=x.get_dtype(),
+            inner_fn=x.make_loader(),
+            ranges=x.get_size(),
+        )
+        pw.realize()
+        return pw
+
+    @classmethod
+    def process_kernel(cls, kernel, *args, **kwargs):
+        binded_args = signature(kernel).bind(*args, **kwargs).arguments
+        args_flat, args_spec = pytree.tree_flatten(binded_args)
+
+        is_arg_tensor = []
+        tensor_args = []
+        non_tensor_args = []
+        for arg in args_flat:
+            is_arg_tensor.append(isinstance(arg, IRNode))
+            if is_arg_tensor[-1]:
+                tensor_args.append(arg)
+            else:
+                non_tensor_args.append(arg)
+
+        def unflatten_args(new_tensor_args, new_non_tensor_args):
+            result = []
+            it_tensors = iter(new_tensor_args)
+            it_non_tensors = iter(new_non_tensor_args)
+            for is_tensor in is_arg_tensor:
+                if is_tensor:
+                    result.append(next(it_tensors))
+                else:
+                    result.append(next(it_non_tensors))
+            result = pytree.tree_unflatten(result, args_spec)
+            return result.get("args", []), result.get("kwargs", {})
+
+        tensor_args = [cls.realize_input(x) for x in tensor_args]
+
+        # freeze layout otherwise our output stride calculation might
+        # become incorrect
+        for x in tensor_args:
+            if is_storage_and_layout(x):
+                as_storage_and_layout(x, freeze=True)
+
+        # We don't have generic shape formulas, so just burn in the
+        # shapes and run an example input.
+        # TODO(jansel): replace this with dynamic shape formulas
+        example_args = []
+
+        for x in tensor_args:
+            example_args.append(ir_node_to_tensor(x, guard_shape=True))
+
+        new_args, new_kwargs = unflatten_args(example_args, non_tensor_args)
+        example_output = kernel(*new_args, **new_kwargs)
+
+        return example_output, tensor_args, non_tensor_args, unflatten_args
+
+    @classmethod
+    def convert_to_reinterpret_view(cls, x):
+        """
+        In order to pass this to an extern kernel we need a
+        ReinterpretView not a View.  This allows us to avoid some
+        uneeded copies.
+        """
+        assert isinstance(x, BaseView)
+        if isinstance(x, ReinterpretView):
+            return x
+
+        x.unwrap_view().freeze_layout()
+        rw = extract_read_writes(x.make_loader(), x.get_size(), normalize=False)
+        assert len(rw.reads) == 1
+
+        index = V.graph.sizevars.simplify_with_ranges(
+            list(rw.reads)[0].index, rw.var_ranges
+        )
+        strides = V.graph.sizevars.stride_vars(index, rw.range_vars)
+        offset = V.graph.sizevars.offset_var(index, rw.range_vars)
+        expected = sympy_dot(rw.range_vars, strides) + offset
+
+        if index != expected:
+            log.debug(
+                "convert_to_reinterpret_view failed: stride=%s offset=%s index=%s",
+                strides,
+                offset,
+                index,
+            )
+            raise NotImplementedError()
+
+        return ReinterpretView(
+            data=x.data,
+            layout=FixedLayout(
+                device=x.get_device(),
+                dtype=x.get_dtype(),
+                size=x.get_size(),
+                stride=strides,
+                offset=offset,
+            ),
+        )
+
+    @classmethod
+    def realize_input(cls, x):
+        if x is None:
+            return NoneAsConstantBuffer()
+        if isinstance(x, (sympy.Expr, int)):
+            return ShapeAsConstantBuffer(x)
+        if isinstance(x, Constant):
+            return V.graph.add_tensor_constant(
+                torch.tensor(x.value, dtype=x.get_dtype(), device=x.get_device())
+            )
+        if isinstance(x, ConstantBuffer):
+            return x
+        if isinstance(x, TensorBox):
+            return cls.realize_input(x.data)
+        if isinstance(x, ReinterpretView):
+            return x
+        if isinstance(x, BaseView):
+            x.realize()
+            if is_storage_and_layout(x.unwrap_view()) and not isinstance(
+                x.unwrap_view().data, ExternKernelAlloc
+            ):
+                try:
+                    return cls.convert_to_reinterpret_view(x)
+                except NotImplementedError:
+                    pass
+        if isinstance(x, StorageBox):
+            # TODO(jansel): impose layout preference on realized buffer
+            x.realize()
+            return x
+        return cls.copy_input(x)
+
+    @classmethod
+    def require_stride1(cls, x):
+        if is_storage_and_layout(x):
+            if len(x.get_stride()) == 0:
+                return x
+            for stride in x.get_stride():
+                if stride == 1:
+                    return x
+        return cls.copy_input(x)
+
+    @classmethod
+    def require_stride_order(cls, x, order):
+        if x.get_numel() == 0:  # Layout doesn't matter
+            return x
+
+        # require x to have the layout as strided_ordered as order
+        if is_storage_and_layout(x):
+            if isinstance(
+                x.get_layout(), FlexibleLayout
+            ) and is_stride_order_storage_and_layout(x, order):
+                # fix flexiblelayout to be FixedLayout with stride_order
+                as_storage_and_layout(
+                    x, freeze=True, want_contiguous=False, stride_order=order
+                )
+                return x
+            elif isinstance(
+                x.get_layout(), FixedLayout
+            ) and x.get_layout().is_stride_ordered(order):
+                return x
+            elif isinstance(x.get_layout(), MutationLayout):
+                if isinstance(x.get_layout().real_layout(), FlexibleLayout):
+                    raise AssertionError(
+                        "the MutationLayout's real layout shouldn't be FlexibleLayout"
+                    )
+                elif isinstance(
+                    x.get_layout().real_layout(), FixedLayout
+                ) and x.get_layout().real_layout().is_stride_ordered(order):
+                    return x
+
+        # TODO - Storage to InputBuffer
+        if isinstance(x, InputBuffer) and x.get_layout().is_stride_ordered(order):
+            return x
+        x = cls.copy_input(x)
+        as_storage_and_layout(x, freeze=True, want_contiguous=False, stride_order=order)
+        assert is_stride_order_storage_and_layout(x, order)
+        return x
+
+    @classmethod
+    def require_contiguous(cls, x):
+        return cls.require_stride_order(x, list(reversed(range(len(x.get_size())))))
+
+    def apply_constraint(self):
+        pass
+
+    def codegen_args(self):
+        args = [x.codegen_reference() for x in self.inputs]
+        args.extend(map(repr, self.constant_args))
+        return args
+
+    def codegen_kwargs(self):
+        kwargs = []
+        if self.kwargs:
+            kwargs = [f"{k}={repr(v)}" for k, v in self.kwargs.items()]
+        return kwargs
+
+    def codegen_size_asserts(self, wrapper):
+        if config.size_asserts:
+            size = V.graph.sizevars.codegen_shape_tuple(self.get_size())
+            stride = V.graph.sizevars.codegen_shape_tuple(self.get_stride())
+            wrapper.writeline(
+                f"assert_size_stride({self.get_name()}, {size}, {stride})"
+            )
+
+    def get_group_stride(self):
+        """
+        get output sizes and strides, for template_codegen
+        """
+        _size = self.get_size()
+        _stride = self.get_stride()
+        # iter_ranges = _size of output tensor, reduce_range = [] because no reduction
+        return [_size, []], _stride
+
+    def canonicalize(self):
+        """
+        Manually get cononicalization of the output index
+        """
+        # manually generate index formula for conv
+        sizevars = V.graph.sizevars
+        sizes = self.get_size()
+        strides = self.get_stride()
+        strides = [sizevars.size_hint(x) for x in strides]
+        index_vars = [sympy_symbol(f"d{i}") for i in range(len(sizes))]
+        # reorder index vars according to stride
+        index_order = sorted(range(len(strides)), key=strides.__getitem__, reverse=True)
+        lookup = {pos: idx for idx, pos in enumerate(index_order)}
+        order = [lookup[i] for i in range(len(lookup))]
+        index_vars = [index_vars[i] for i in order]
+        indexer = self.make_indexer()
+        index = indexer(index_vars)
+
+        new_sizes, reindex, prune = V.graph.sizevars._simplify_loops(
+            index_vars, sizes, [index]
+        )
+
+        # assign new variables each dimension to deal with numbering mismatches
+        # d0, d1, d2 could become d0, d2 -- which won't match d0, d1
+        _, add_var = var_builder("c")
+        replacement = dict(zip(index_vars, reindex([add_var(x) for x in new_sizes])))
+
+        index = sympy_subs(sympy.expand(index), replacement)
+        return index, tuple(new_sizes)
+
+    def __str__(self):
+        lines = [
+            f"{field.name}={getattr(self, field.name)}"
+            for field in dataclasses.fields(self)
+        ]
+        return self.str_helper(lines)
+
+
+@dataclasses.dataclass
+class ExternKernelOut(ExternKernel):
+    output_view: Optional[ReinterpretView] = None
+
+    def codegen(self, wrapper):
+        args = self.codegen_args()
+
+        kwargs = self.codegen_kwargs()
+        if kwargs:
+            args.extend(kwargs)
+
+        if self.output_view:
+            args.append(f"out={self.output_view.codegen_reference()}")
+        else:
+            args.append(f"out={self.codegen_reference()}")
+        wrapper.writeline(f"{self.kernel}({', '.join(args)})")
+
+    def __init__(self, layout, inputs, constant_args=(), kwargs=None, output_view=None):
+        super().__init__(
+            None, layout, self.unwrap_storage(inputs), constant_args, kwargs or {}
+        )
+        self.output_view = output_view
+        self.name = V.graph.register_buffer(self)
+
+    def should_allocate(self):
+        return True
+
+
+class ExternKernelAlloc(ExternKernel):
+    def codegen(self, wrapper):
+        wrapper.writeline(
+            f"{self.get_name()} = {self.kernel}({', '.join(self.codegen_args())})"
+        )
+        if isinstance(self.layout, Layout):
+            self.codegen_size_asserts(wrapper)
+
+    def __init__(self, layout, inputs, constant_args=()):
+        super().__init__(None, layout, self.unwrap_storage(inputs), constant_args)
+        self.name = V.graph.register_buffer(self)
+
+    def should_allocate(self):
+        return False
+
+    def apply_constraint(self):
+        raise NotImplementedError
+
+
+class InplaceBernoulliFallback(ExternKernel):
+    """
+    This needs to be a custom class to handle mutation properly
+    """
+
+    kernel = "aten.bernoulli_"
+
+    def codegen(self, wrapper):
+        (x,) = [t.codegen_reference() for t in self.inputs]
+        wrapper.writeline(
+            f"{self.kernel}({x}, {', '.join(map(repr, self.constant_args))})"
+        )
+
+    def should_allocate(self):
+        return False
+
+    def get_mutation_names(self):
+        assert isinstance(self.layout, MutationLayout)
+        return (self.layout.target.get_name(),)
+
+    def __init__(self, x, *constant_args):
+        super().__init__(
+            None,
+            MutationLayout(x),
+            self.unwrap_storage([x]),
+            constant_args,
+        )
+        self.name = V.graph.register_buffer(self)
+
+
+class IndexPutFallback(ExternKernel):
+    """
+    This needs to be a custom class to handle mutation and indices properly
+    """
+
+    kernel = "aten.index_put_"
+
+    def codegen(self, wrapper):
+        (x, values, *valid_indices) = [t.codegen_reference() for t in self.inputs]
+        indices = []
+        iter_valid_indices = iter(valid_indices)
+        for i, _ in enumerate(self.indices):
+            if self.indices[i] is not None:
+                indices.append(next(iter_valid_indices))
+            else:
+                indices.append("None")
+        wrapper.writeline(
+            f"{self.kernel}({x}, [{','.join(indices)}], {values}, {repr(self.constant_args[0])})"
+        )
+
+    def should_allocate(self):
+        return False
+
+    def __init__(self, x, indices, values, accumulate):
+        self.indices = indices
+        valid_indices = [i for i in indices if i is not None]
+        tensors = [self.realize_input(x) for x in [x, values, *valid_indices]]
+        super().__init__(
+            None,
+            MutationLayout(x),
+            self.unwrap_storage(tensors),
+            [accumulate],
+        )
+        self.name = V.graph.register_buffer(self)
+
+
+class MatrixMultiply(ExternKernelOut):
+    kernel = "aten.mm.out"
+
+    def __init__(
+        self, layout, inputs, constant_args=(), output_view=None, kernel="aten.mm.out"
+    ):
+        super().__init__(layout, inputs, constant_args, output_view)
+        self.kernel = kernel
+
+    @classmethod
+    def create(cls, a, b):
+        *m, k1 = a.get_size()
+        k2, n = b.get_size()
+        V.graph.sizevars.guard_equals(k1, k2)
+        a = cls.realize_input(a)
+        b = cls.realize_input(b)
+        if len(m) != 1 and not a.get_layout().is_contiguous():
+            a = cls.copy_input(a)
+        else:
+            a = cls.require_stride1(a)
+        b = cls.require_stride1(b)
+
+        # choose runtime kernel
+        config_mm = config.triton.mm
+        # default kernel is aten
+        kernel = "aten.mm.out"
+        if config_mm == "aten":
+            kernel = "aten.mm.out"
+        elif config_mm == "triton" and a.get_device().type == "cuda":
+            kernel = "triton_ops.matmul_out"
+        elif config_mm == "autotune":
+            from .codegen.autotuner import tuned_mm
+
+            kernel = tuned_mm(
+                a.get_size(),
+                b.get_size(),
+                a.get_stride(),
+                b.get_stride(),
+                a.get_device(),
+                a.get_dtype(),
+            )
+
+        return MatrixMultiply(
+            layout=FlexibleLayout(
+                device=a.get_device(),
+                dtype=a.get_dtype(),
+                size=list(m) + [n],
+            ),
+            inputs=[a, b],
+            kernel=kernel,
+        )
+
+    def get_template_tiling(self):
+        tile1, tile2 = self.get_size()
+        return (
+            tile1,
+            tile2,
+            sympy.Integer(1),
+        )
+
+    def map_args(self):
+        # a, b
+        in_args = [x.codegen_reference() for x in self.inputs]
+        # const_args = self.constant_args
+        inout_dict = OrderedDict(
+            [
+                ("A", f"{in_args[0]}"),
+                ("B", f"{in_args[1]}"),
+                ("C", f"{self.get_name()}"),
+            ]
+        )
+        # batch==1 bmm->mm
+        if len(self.get_stride()) == 3:
+            assert self.get_size()[0] == 1
+            stride_cm = self.get_stride()[1]
+            stride_cn = self.get_stride()[2]
+        else:
+            stride_cm = self.get_stride()[0]
+            stride_cn = self.get_stride()[1]
+        args_dict = OrderedDict(
+            [
+                ("M", f"{self.inputs[0].get_size()[0]}"),
+                ("N", f"{self.inputs[1].get_size()[1]}"),
+                ("K", f"{self.inputs[0].get_size()[1]}"),
+                ("stride_am", f"{self.inputs[0].get_stride()[0]}"),
+                ("stride_ak", f"{self.inputs[0].get_stride()[1]}"),
+                ("stride_bk", f"{self.inputs[1].get_stride()[0]}"),
+                ("stride_bn", f"{self.inputs[1].get_stride()[1]}"),
+                ("stride_cm", f"{stride_cm}"),
+                ("stride_cn", f"{stride_cn}"),
+            ]
+        )
+        # accumulator types
+        ACC_TYPE = (
+            "tl.float32"
+            if self.inputs[0].get_dtype()
+            in [torch.float16, torch.bfloat16, torch.float32]
+            else "tl.int32"
+        )
+        # dict for tl.constexpr
+        const_dict = OrderedDict(
+            [
+                ("GROUP_M", "8"),
+                ("ACC_TYPE", ACC_TYPE),
+                ("allow_tf32", f"{torch.backends.cuda.matmul.allow_tf32}"),
+            ]
+        )
+
+        other_dict = OrderedDict()
+
+        return inout_dict, args_dict, const_dict, other_dict
+
+
+class MatrixMultiplyAdd(ExternKernelOut):
+    def __init__(self, layout, inputs, constant_args=(), kwargs=None, output_view=None):
+        super().__init__(layout, inputs, constant_args, kwargs or {}, output_view)
+        self.kernel = "aten.addmm.out"
+
+    @classmethod
+    def create(cls, inp, a, b, beta, alpha):
+        m, k1 = a.get_size()
+        k2, n = b.get_size()
+        V.graph.sizevars.guard_equals(k1, k2)
+        inp = cls.realize_input(inp)
+        a = cls.realize_input(a)
+        b = cls.realize_input(b)
+        a = cls.require_stride1(a)
+        b = cls.require_stride1(b)
+        return MatrixMultiplyAdd(
+            layout=FlexibleLayout(
+                device=a.get_device(),
+                dtype=a.get_dtype(),
+                size=[m] + [n],
+            ),
+            inputs=[inp, a, b],
+            kwargs={"beta": beta, "alpha": alpha},
+        )
+
+
+class BatchMatrixMultiply(ExternKernelOut):
+    kernel = "aten.bmm.out"
+
+    def __init__(self, layout, inputs, constant_args=(), output_view=None):
+        super().__init__(layout, inputs, constant_args, output_view)
+        if (
+            config.triton.use_bmm
+            and len(inputs) > 0
+            and inputs[0].get_device().type == "cuda"
+        ):
+            self.kernel = "triton_bmm_out"
+
+    @classmethod
+    def create(cls, a, b):
+        b1, m, k1 = a.get_size()
+        b2, k2, n = b.get_size()
+        b3 = V.graph.sizevars.guard_equals(b1, b2)
+        V.graph.sizevars.guard_equals(k1, k2)
+        a = cls.require_stride1(cls.realize_input(a))
+        b = cls.require_stride1(cls.realize_input(b))
+
+        output_layout = FlexibleLayout(
+            device=a.get_device(),
+            dtype=a.get_dtype(),
+            size=[b3, m, n],
+        ).as_fixed()
+
+        if b3 == 1:
+            # convert to normal mm
+            data = MatrixMultiply(
+                layout=output_layout.as_fixed(),
+                inputs=[SqueezeView.create(a, dim=0), SqueezeView.create(b, dim=0)],
+            )
+            data.output_view = ReinterpretView(
+                data,
+                FlexibleLayout(
+                    device=a.get_device(),
+                    dtype=a.get_dtype(),
+                    size=[m, n],
+                ).as_fixed(),
+            )
+        else:
+            data = BatchMatrixMultiply(
+                layout=output_layout,
+                inputs=[a, b],
+            )
+        return data
+
+
+class DeviceCopy(ExternKernelOut):
+    @classmethod
+    def create(cls, x, device):
+        if not x.is_extern() and all(
+            (r.name in V.graph.constants and hasattr(r, "index")) for r in x.get_reads()
+        ):
+            return x.constant_to_device(device)
+
+        V.graph.device_types.add(device.type)
+        V.graph.device_types.add(x.get_device().type)
+
+        log.warning("DeviceCopy")
+        return DeviceCopy(
+            FlexibleLayout(
+                device=device,
+                dtype=x.get_dtype(),
+                size=x.get_size(),
+            ),
+            [cls.realize_input(x)],
+        )
+
+    def codegen(self, wrapper):
+        args = self.codegen_args()
+        assert len(args) == 1
+        if self.output_view:
+            wrapper.writeline(
+                f"{self.output_view.codegen_reference()}.copy_({args[0]})"
+            )
+        else:
+            wrapper.writeline(f"{self.codegen_reference()}.copy_({args[0]})")
+
+
+class DynamicScalar(IRNode):
+    """
+    The result of a call to aten._local_scalar_dense.
+
+    This is not yet implemented.  The one model (so far) that calls this
+    (fastNLP_Bert) does not actually use the result.  So we expect this
+    node to get dead code eliminated.
+    """
+
+    def get_reads(self):
+        return ()
+
+
+@dataclasses.dataclass
+class FallbackKernel(ExternKernelAlloc):
+    def __init__(
+        self,
+        layout,
+        kernel,
+        tensor_args,
+        nontensor_args,
+        unflatten_args,
+        kwargs=None,
+    ):
+        super(FallbackKernel, self).__init__(
+            layout,
+            tuple(tensor_args),
+            tuple(nontensor_args),
+        )
+        if getattr(torch.ops.aten, kernel.__name__, None) is kernel:
+            self.kernel = f"aten.{kernel.__name__}"
+        else:
+            self.kernel = (
+                f"{kernel.__module__.replace('._ops.', '.ops.')}.{kernel.__name__}"
+            )
+        self.unflatten_args = unflatten_args
+        self.kwargs = {} if kwargs is None else kwargs
+        if self.kernel not in ("aten.convolution_backward",):
+            log.warning(f"Using FallbackKernel: {self.kernel}")
+
+    def codegen_args(self):
+        @dataclasses.dataclass
+        class Shim:
+            ref: Any
+
+            def __repr__(self):
+                return self.ref
+
+        def gen_kwarg(k, v):
+            return f"{k}={repr(v)}"
+
+        tensor_args = [Shim(x.codegen_reference()) for x in self.inputs]
+        constant_args = [Shim(repr(x)) for x in self.constant_args]
+        args, kwargs = self.unflatten_args(tensor_args, constant_args)
+        return list(map(repr, args)) + list(gen_kwarg(k, v) for k, v in kwargs.items())
+
+    @classmethod
+    def create(cls, kernel, *args, **kwargs):
+        fake_incorrect_kernels = (
+            aten._fft_r2c.default,
+            aten._fft_r2c.out,
+            aten._fft_c2r.default,
+            aten._fft_c2c.default,
+            aten._fft_c2c.out,
+            aten._linalg_svd.default,
+            aten._linalg_svd.U,
+            aten.upsample_bilinear2d.default,
+        )
+        context = (
+            FakeTensorMode if kernel not in fake_incorrect_kernels else nullcontext
+        )
+        with context():
+            (
+                example_output,
+                tensor_args,
+                non_tensor_args,
+                unflatten_args,
+            ) = cls.process_kernel(kernel, *args, **kwargs)
+
+        assert tensor_args or isinstance(
+            example_output, torch.Tensor
+        ), "Not sure where to find device info"
+        packed = FallbackKernel(
+            MultiOutputLayout(
+                tensor_args[0].get_device() if tensor_args else example_output.device
+            ),
+            kernel,
+            tensor_args,
+            non_tensor_args,
+            unflatten_args,
+            kwargs,
+        )
+
+        def generate_output(output, index=""):
+            if isinstance(output, (list, tuple)):
+                return type(output)(
+                    generate_output(output[i], f"{index}[{i}]")
+                    for i in range(len(output))
+                )
+            elif isinstance(output, torch.Tensor):
+                return MultiOutput(
+                    FixedLayout(
+                        output.device,
+                        output.dtype,
+                        [sympy.Integer(s) for s in output.size()],
+                        [sympy.Integer(s) for s in output.stride()],
+                    ),
+                    packed,
+                    index,
+                )
+            else:
+                assert output is None, "FallbackKernel output type is not supported"
+                return None
+
+        return generate_output(example_output)
+
+    def apply_constraint(self):
+        return super().apply_constraint()
+
+
+@dataclasses.dataclass
+class MultiOutputLayout(IRNode):
+    device: torch.device
+
+
+class MultiOutput(ExternKernel):
+    def codegen(self, wrapper):
+        wrapper.writeline(
+            f"{self.get_name()} = {self.inputs[0].get_name()}{self.index}"
+        )
+        self.codegen_size_asserts(wrapper)
+
+    def __init__(self, layout, input, index: str):
+        super().__init__(None, layout, [input], ())
+        self.name = V.graph.register_buffer(self)
+        self.index = index
+
+    def should_allocate(self):
+        return False
+
+
+class Convolution(ExternKernelAlloc):
+    kernel = "aten.convolution"
+
+    def __init__(
+        self,
+        layout,
+        inputs,
+        constant_args=(),
+        preferred_stride_order=None,
+        kernel="aten.convolution",
+    ):
+        super().__init__(layout, inputs, constant_args)
+        self.kernel = kernel
+        self.preferred_stride_order = preferred_stride_order
+
+    def codegen(self, wrapper):
+        if self.kernel == "triton_ops.conv":
+            wrapper.header.writeline(
+                f"import {config.inductor_import}.triton_ops.conv as {self.kernel}"
+            )
+        wrapper.writeline(
+            f"{self.get_name()} = {self.kernel}({', '.join(self.codegen_args())})"
+        )
+        if isinstance(self.layout, Layout):
+            self.codegen_size_asserts(wrapper)
+
+    @classmethod
+    def create(
+        cls,
+        x: "TensorBox",
+        weight: "TensorBox",
+        bias: "TensorBox",
+        stride_: List[int],
+        padding_: List[int],
+        dilation_: List[int],
+        transposed: bool,
+        output_padding_: List[int],
+        groups: int,
+    ):
+        with torch._subclasses.FakeTensorMode():
+            x_fake = ir_node_to_tensor(x, guard_shape=True)
+            weight_fake = ir_node_to_tensor(weight, guard_shape=True)
+            bias_fake = (
+                ir_node_to_tensor(bias, guard_shape=True) if bias is not None else bias
+            )
+            output = torch.ops.aten.convolution(
+                x_fake,
+                weight_fake,
+                bias_fake,
+                stride_,
+                padding_,
+                dilation_,
+                transposed,
+                output_padding_,
+                groups,
+            )
+            req_stride_order = get_stride_order(output.stride())
+
+        if config.triton.convolution == "aten":
+            weight = cls.require_stride_order(weight, req_stride_order)
+            x = cls.require_stride_order(x, req_stride_order)
+        else:
+            x = cls.require_stride1(cls.realize_input(x))
+            weight = cls.require_stride1(cls.realize_input(weight))
+
+        stride = tuple(stride_)
+        padding = tuple(padding_)
+        dilation = tuple(dilation_)
+        assert isinstance(transposed, bool)
+        output_padding = tuple(output_padding_)
+        assert isinstance(groups, int)
+
+        output_size = output.shape
+
+        weight_shape = [
+            sympy.Integer(V.graph.sizevars.guard_static_shape(s))
+            for s in weight.get_size()
+        ]
+        _, _, *kernel_size = weight_shape
+
+        # choose runtime kernel
+        config_conv = config.triton.convolution
+        if (
+            config_conv == "aten"
+            or len(kernel_size) != 2  # triton conv only supports conv2d
+            or not is_triton(x.get_device())
+            or transposed
+            or groups != 1
+            # or x.get_dtype() == torch.float16
+            # or x.get_dtype() == torch.bfloat16
+        ):
+            kernel = "aten.convolution"
+        elif config_conv == "triton":
+            kernel = "triton_ops.conv"
+        else:
+            assert config_conv == "autotune"
+            from .codegen.autotuner import tuned_conv
+
+            kernel = tuned_conv(
+                x.get_size(),
+                weight.get_size(),
+                x.get_stride(),
+                weight.get_stride(),
+                stride,
+                padding,
+                dilation,
+                transposed,
+                output_padding,
+                groups,
+                x.get_device(),
+                x.get_dtype(),
+            )
+
+        # for conv2d or conv3d, prefer channels last format
+        if kernel == "triton_ops.conv":
+            output_layout_str = "torch.channels_last"
+
+        elif config.tune_layout and len(x.get_size()) == 4:
+            from .codegen.autotuner import tuned_conv_layout
+
+            output_layout_str = tuned_conv_layout(
+                kernel,
+                x.get_size(),
+                weight.get_size(),
+                stride,
+                padding,
+                dilation,
+                transposed,
+                output_padding,
+                groups,
+                x.get_device(),
+                x.get_dtype(),
+            )
+
+        else:
+            output_layout_str = (
+                "torch.contiguous_format"
+                if output.is_contiguous()
+                else "torch.channels_last"
+            )
+
+        if output_layout_str == "torch.channels_last":
+            stride_order = [0] + list(reversed(range(1, len(kernel_size) + 1)))
+            if len(stride_order) < len(output_size):
+                # add batch dim if it exists
+                stride_order = [len(stride_order)] + stride_order
+            strides = make_channels_last_strides_for(output_size)
+        else:
+            stride_order = list(reversed(range(len(output_size))))
+            strides = make_contiguous_strides_for(output_size)
+
+        if config.triton.convolution != "aten":
+            x = cls.require_stride_order(x, stride_order)
+
+        output_layout = FixedLayout(
+            x.get_device(),
+            x.get_dtype(),
+            output_size,
+            strides,
+        )
+
+        if bias is not None:
+            return Convolution(
+                output_layout,
+                (x, weight, bias),
+                (stride, padding, dilation, transposed, output_padding, groups),
+                stride_order,
+                kernel,
+            )
+        else:
+            return Convolution(
+                output_layout,
+                (x, weight),
+                (bias, stride, padding, dilation, transposed, output_padding, groups),
+                stride_order,
+                kernel,
+            )
+
+    def map_args(self):
+        # x, w, bias
+        in_args = [x.codegen_reference() for x in self.inputs]
+        # stride, padding, dilation, transposed, output_padding, groups
+        const_args = self.constant_args
+        if len(in_args) < 3:
+            # otherwise, bias=None is the first constant_args
+            const_args = const_args[1:]
+
+        inout_dict = OrderedDict(
+            [
+                ("x", f"{in_args[0]}"),
+                ("w", f"{in_args[1]}"),
+                ("y", f"{self.get_name()}"),
+            ]
+        )
+        args_dict = OrderedDict(
+            [
+                ("stride_xn", f"{self.inputs[0].get_stride()[0]}"),
+                ("stride_xc", f"{self.inputs[0].get_stride()[1]}"),
+                ("stride_xh", f"{self.inputs[0].get_stride()[2]}"),
+                ("stride_xw", f"{self.inputs[0].get_stride()[3]}"),
+                ("stride_wn", f"{self.inputs[1].get_stride()[0]}"),
+                ("stride_wc", f"{self.inputs[1].get_stride()[1]}"),
+                ("stride_wh", f"{self.inputs[1].get_stride()[2]}"),
+                ("stride_ww", f"{self.inputs[1].get_stride()[3]}"),
+                ("stride_yn", f"{self.get_stride()[0]}"),
+                ("stride_yc", f"{self.get_stride()[1]}"),
+                ("stride_yh", f"{self.get_stride()[2]}"),
+                ("stride_yw", f"{self.get_stride()[3]}"),
+                (
+                    "stride_biasn",
+                    f"{self.inputs[0].get_stride()[0]}"
+                    if len(in_args) >= 3
+                    else "None",
+                ),
+                # ("delta_x_ptr", "None"),
+                ("BATCH", f"{self.inputs[0].get_size()[0]}"),
+                ("IN_C", f"{self.inputs[0].get_size()[1]}"),
+                ("IN_H", f"{self.inputs[0].get_size()[2]}"),
+                ("IN_W", f"{self.inputs[0].get_size()[3]}"),
+                ("KERNEL_N", f"{self.inputs[1].get_size()[0]}"),
+                ("KERNEL_H", f"{self.inputs[1].get_size()[2]}"),
+                ("KERNEL_W", f"{self.inputs[1].get_size()[3]}"),
+                ("OUT_H", f"{self.get_size()[2]}"),
+                ("OUT_W", f"{self.get_size()[3]}"),
+                ("stride_h", f"{const_args[0][0]}"),
+                ("stride_w", f"{const_args[0][1]}"),
+                ("padding_h", f"{const_args[1][0]}"),
+                ("padding_w", f"{const_args[1][1]}"),
+                ("dilation_h", f"{const_args[2][0]}"),
+                ("dilation_w", f"{const_args[2][1]}"),
+                # ("transposed", f"{const_args[3]}"),
+                ("output_padding_h", f"{const_args[4][0]}"),
+                ("output_padding_w", f"{const_args[4][1]}"),
+                ("groups", f"{const_args[5]}"),
+            ]
+        )
+
+        # accumulator type
+        ACC_TYPE = (
+            "tl.float32"
+            if self.inputs[0].get_dtype()
+            in [torch.float16, torch.bfloat16, torch.float32]
+            else "tl.int32"
+        )
+        CONV1X1_NHWC = (
+            "True"
+            if self.inputs[0].get_stride()[1] == 1
+            and self.inputs[1].get_size()[2] == 1
+            and self.inputs[1].get_size()[3] == 1
+            else "False"
+        )
+        # dict for tl.constexpr
+        const_dict = OrderedDict(
+            [
+                ("ACC_TYPE", ACC_TYPE),
+                ("CONV1X1_NHWC", CONV1X1_NHWC),
+            ]
+        )
+
+        # dict for non-kernel args (e.g. delta_x_ptr)
+        other_dict = OrderedDict(
+            [
+                ("device", f'"{self.inputs[0].get_device()}"'),
+            ]
+        )
+
+        return inout_dict, args_dict, const_dict, other_dict
+
+    def get_template_tiling(self):
+        n, c, h, w = self.get_size()
+        return (
+            n * h * w,
+            c,
+            sympy.Integer(1),
+        )
+
+
+def _prepare_convolution_fusion_create(
+    cls,
+    x: "TensorBox",
+    weight: "TensorBox",
+    bias: "TensorBox",
+    padding_: List[int],
+    stride_: List[int],
+    dilation_: List[int],
+    groups: int,
+):
+    """
+    This function is a helper function to prepare inputs, layout and constant args
+    for convolution post-op fusion's create function, including deciding the output
+    layout (channels first or channels last), realizing inputs and make them etc. The
+    function only supports the CPU device since conv post-op fusion kernel is only
+    supported on CPU right now.
+    """
+    stride = tuple(stride_)
+    padding = tuple(padding_)
+    dilation = tuple(dilation_)
+    assert isinstance(groups, int)
+    with torch._subclasses.FakeTensorMode():
+        x_fake = ir_node_to_tensor(x, guard_shape=True)
+        weight_fake = ir_node_to_tensor(weight, guard_shape=True)
+        bias_fake = (
+            ir_node_to_tensor(bias, guard_shape=True) if bias is not None else bias
+        )
+        output = torch.ops.aten.convolution(
+            x_fake,
+            weight_fake,
+            bias_fake,
+            stride,
+            padding,
+            dilation,
+            False,
+            [0, 0],
+            groups,
+        )
+        output_size = output.size()
+        req_stride_order = [0] + list(reversed(range(1, len(stride) + 1)))
+        req_stride_order = [len(req_stride_order)] + req_stride_order
+        output_stride = make_channels_last_strides_for(output_size)
+
+    x = cls.require_stride_order(x, req_stride_order)
+    assert x.get_device().type == "cpu" and weight.get_device().type == "cpu"
+    inputs = [x, weight]
+
+    kernel_layout = FixedLayout(
+        x.get_device(),
+        x.get_dtype(),
+        output.size(),
+        output_stride,
+    )
+    constant_args = [padding, stride, dilation, groups]
+
+    if bias is not None:
+        inputs.append(bias)
+    else:
+        constant_args.insert(0, bias)
+    return inputs, constant_args, kernel_layout, req_stride_order
+
+
+class ConvolutionUnary(ExternKernelAlloc):
+    kernel = "torch.ops.mkldnn._convolution_pointwise"
+
+    def __init__(
+        self,
+        layout,
+        inputs,
+        constant_args=(),
+        kernel="torch.ops.mkldnn._convolution_pointwise",
+    ):
+        super().__init__(layout, inputs, constant_args)
+        self.kernel = kernel
+
+    def codegen(self, wrapper):
+        wrapper.writeline(
+            f"{self.get_name()} = {self.kernel}({', '.join(self.codegen_args())})"
+        )
+        if isinstance(self.layout, Layout):
+            self.codegen_size_asserts(wrapper)
+
+    @classmethod
+    def create(
+        cls,
+        x: "TensorBox",
+        weight: "TensorBox",
+        bias: "TensorBox",
+        padding_: List[int],
+        stride_: List[int],
+        dilation_: List[int],
+        groups: int,
+        attr,
+        scalars,
+        algorithm,
+    ):
+        kernel = "torch.ops.mkldnn._convolution_pointwise"
+        (inputs, constant_args, kernel_layout, _) = _prepare_convolution_fusion_create(
+            cls, x, weight, bias, padding_, stride_, dilation_, groups
+        )
+        constant_args = constant_args + [attr, scalars, algorithm]
+        return ConvolutionUnary(
+            layout=kernel_layout,
+            inputs=inputs,
+            constant_args=constant_args,
+            kernel=kernel,
+        )
+
+
+class ConvolutionBinary(ExternKernelAlloc):
+    kernel = "torch.ops.mkldnn._convolution_pointwise.binary"
+
+    def __init__(
+        self,
+        layout,
+        inputs,
+        constant_args=(),
+        kernel="torch.ops.mkldnn._convolution_pointwise.binary",
+    ):
+        super().__init__(layout, inputs, constant_args)
+        self.kernel = kernel
+
+    def codegen(self, wrapper):
+        wrapper.writeline(
+            f"{self.get_name()} = {self.kernel}({', '.join(self.codegen_args())})"
+        )
+        if isinstance(self.layout, Layout):
+            self.codegen_size_asserts(wrapper)
+
+    @classmethod
+    def create(
+        cls,
+        x: "TensorBox",
+        other: "TensorBox",
+        weight: "TensorBox",
+        bias: "TensorBox",
+        padding_: List[int],
+        stride_: List[int],
+        dilation_: List[int],
+        groups: int,
+        binary_attr: str,
+        binary_alpha: Optional[float],
+        unary_attr: Optional[str],
+        unary_scalars: Optional[List],
+        unary_algorithm: Optional[str],
+    ):
+        kernel = "torch.ops.mkldnn._convolution_pointwise.binary"
+        (
+            inputs,
+            constant_args,
+            kernel_layout,
+            req_stride_order,
+        ) = _prepare_convolution_fusion_create(
+            cls, x, weight, bias, padding_, stride_, dilation_, groups
+        )
+        other = cls.require_stride_order(other, req_stride_order)
+        inputs.insert(1, other)
+        constant_args = constant_args + [
+            binary_attr,
+            binary_alpha,
+            unary_attr,
+            unary_scalars,
+            unary_algorithm,
+        ]
+        return ConvolutionBinary(
+            layout=kernel_layout,
+            inputs=inputs,
+            constant_args=constant_args,
+            kernel=kernel,
+        )
+
+
+class ConvolutionBinaryInplace(ExternKernelAlloc):
+    kernel = "torch.ops.mkldnn._convolution_pointwise_.binary"
+
+    def __init__(
+        self,
+        kernel_layout,
+        inputs,
+        constant_args=(),
+        kernel="torch.ops.mkldnn._convolution_pointwise_.binary",
+    ):
+        super().__init__(kernel_layout, inputs, constant_args)
+        self.kernel = kernel
+
+    def codegen(self, wrapper):
+        wrapper.writeline(
+            f"{self.get_name()} = {self.kernel}({', '.join(self.codegen_args())})"
+        )
+
+    def get_mutation_names(self):
+        assert isinstance(self.layout, MutationLayout)
+        return (self.layout.target.get_name(),)
+
+    @classmethod
+    def create(
+        cls,
+        x: "TensorBox",
+        other: "TensorBox",
+        weight: "TensorBox",
+        bias: "TensorBox",
+        padding_: List[int],
+        stride_: List[int],
+        dilation_: List[int],
+        groups: int,
+        binary_attr: str,
+        binary_alpha: Optional[float],
+        unary_attr: Optional[str],
+        unary_scalars: Optional[List],
+        unary_algorithm: Optional[str],
+    ):
+        kernel = "torch.ops.mkldnn._convolution_pointwise_.binary"
+        (inputs, constant_args, _, _) = _prepare_convolution_fusion_create(
+            cls, x, weight, bias, padding_, stride_, dilation_, groups
+        )
+        other = cls.realize_input(other)
+        V.graph.realize_users_of(other.get_name())
+        inputs.insert(1, other)
+        constant_args = constant_args + [
+            binary_attr,
+            binary_alpha,
+            unary_attr,
+            unary_scalars,
+            unary_algorithm,
+        ]
+        return ConvolutionBinaryInplace(
+            kernel_layout=MutationLayout(inputs[1]),
+            inputs=inputs,
+            constant_args=constant_args,
+            kernel=kernel,
+        )
+
+
+class MKLPackedLinear(ExternKernelAlloc):
+    kernel = "torch.ops.mkl._mkl_linear"
+
+    def __init__(
+        self,
+        layout,
+        inputs,
+        constant_args=(),
+        kernel="torch.ops.mkl._mkl_linear",
+    ):
+        super().__init__(layout, inputs, constant_args)
+        self.kernel = kernel
+
+    def codegen(self, wrapper):
+        wrapper.writeline(
+            f"{self.get_name()} = {self.kernel}({', '.join(self.codegen_args())})"
+        )
+
+    @classmethod
+    def create(cls, x, packed_w, orig_w, bias, batch_size):
+        kernel = "torch.ops.mkl._mkl_linear"
+
+        with torch._subclasses.FakeTensorMode():
+            x_fake = ir_node_to_tensor(x, guard_shape=True)
+            weight_fake = ir_node_to_tensor(orig_w, guard_shape=True)
+            bias_fake = (
+                ir_node_to_tensor(bias, guard_shape=True) if bias is not None else bias
+            )
+            output = torch.ops.aten.linear(
+                x_fake,
+                weight_fake,
+                bias_fake,
+            )
+            output_size = output.size()
+            req_stride_order = list(reversed(range(len(output_size))))
+            output_stride = output.stride()
+        x = cls.require_stride_order(x, req_stride_order)
+        inputs = [x, packed_w, orig_w]
+        constant_args = [batch_size]
+        if bias is not None:
+            inputs.append(bias)
+        else:
+            constant_args.insert(0, bias)
+
+        return MKLPackedLinear(
+            layout=FixedLayout(
+                x.get_device(), x.get_dtype(), output_size, output_stride
+            ),
+            inputs=inputs,
+            constant_args=constant_args,
+            kernel=kernel,
+        )
+
+
+class LinearUnary(ExternKernelAlloc):
+    kernel = "torch.ops.mkldnn._linear_pointwise"
+
+    def __init__(
+        self,
+        layout,
+        inputs,
+        constant_args=(),
+        kernel="torch.ops.mkldnn._linear_pointwise",
+    ):
+        super().__init__(layout, inputs, constant_args)
+        self.kernel = kernel
+
+    def codegen(self, wrapper):
+        wrapper.writeline(
+            f"{self.get_name()} = {self.kernel}({', '.join(self.codegen_args())})"
+        )
+
+    @classmethod
+    def create(cls, x, w, b, attr, scalars, algorithm):
+        kernel = "torch.ops.mkldnn._linear_pointwise"
+        x = cls.require_stride1(cls.realize_input(x))
+        w = cls.require_stride1(cls.realize_input(w))
+
+        *m, ic = x.get_size()
+        oc, ic = w.get_size()
+
+        inputs = [x, w]
+        constant_args = [attr, scalars, algorithm]
+        if b is not None:
+            b = cls.require_stride1(cls.realize_input(b))
+            inputs.append(b)
+        else:
+            constant_args.insert(0, b)
+
+        return LinearUnary(
+            layout=FlexibleLayout(
+                device=x.get_device(),
+                dtype=x.get_dtype(),
+                size=list(m) + [oc],
+            ),
+            inputs=inputs,
+            constant_args=constant_args,
+            kernel=kernel,
+        )
+
+    def apply_constraint(self):
+        pass
+
+
+class LinearBinary(ExternKernelAlloc):
+    kernel = "torch.ops.mkldnn._linear_pointwise.binary"
+
+    def __init__(
+        self,
+        layout,
+        inputs,
+        constant_args=(),
+        kernel="torch.ops.mkldnn._linear_pointwise.binary",
+    ):
+        super().__init__(layout, inputs, constant_args)
+        self.kernel = kernel
+
+    def codegen(self, wrapper):
+        wrapper.writeline(
+            f"{self.get_name()} = {self.kernel}({', '.join(self.codegen_args())})"
+        )
+
+    @classmethod
+    def create(cls, x, y, w, b, attr):
+        kernel = "torch.ops.mkldnn._linear_pointwise.binary"
+        x = cls.require_stride1(cls.realize_input(x))
+        y = cls.require_stride1(cls.realize_input(y))
+        w = cls.require_stride1(cls.realize_input(w))
+
+        *m, ic = x.get_size()
+        oc, ic = w.get_size()
+
+        inputs = [x, y, w]
+        constant_args = [attr]
+        if b is not None:
+            b = cls.require_stride1(cls.realize_input(b))
+            inputs.append(b)
+        else:
+            constant_args.insert(0, b)
+
+        return LinearBinary(
+            layout=FlexibleLayout(
+                device=x.get_device(),
+                dtype=x.get_dtype(),
+                size=list(m) + [oc],
+            ),
+            inputs=inputs,
+            constant_args=constant_args,
+            kernel=kernel,
+        )
+
+    def apply_constraint(self):
+        pass
+
+
+@dataclasses.dataclass
+class MutableBox(IRNode):
+    """
+    TensorBox / StorageBox allow in-place mutation of Tensors
+    """
+
+    data: IRNode
+
+    def __getattr__(self, name):
+        fn = getattr(self.data, name)
+        if callable(fn):
+            return fn
+        raise AttributeError(f"{type(self.data).__name__}.{name} not callable")
+
+    def __str__(self):
+        if isinstance(self.data, MutableBox):
+            line0 = f"{type(self).__name__}({type(self.data).__name__}("
+            endl = "))"
+            inner = self.data.data
+        else:
+            line0 = f"{type(self).__name__}("
+            inner = self.data
+            endl = ")"
+
+        lines = [
+            line0,
+            indent(str(inner)),
+            endl,
+        ]
+        return "\n".join(lines)
+
+    __repr__ = __str__
+
+
+class TensorBox(MutableBox):
+    @staticmethod
+    def create(data):
+        return TensorBox(StorageBox(data))
+
+
+class StorageBox(MutableBox):
+    def is_input_buffer(self):
+        if isinstance(self.data, (InputBuffer, ReinterpretView)):
+            return self.data.get_name() in V.graph.graph_inputs
+        return False
+
+    def realize(self):
+        if isinstance(
+            self.data, (ComputedBuffer, InputsKernel, InputBuffer, ReinterpretView)
+        ):
+            return self.data.get_name()
+        assert isinstance(self.data, (Pointwise, Reduction)), type(self.data)
+        self.data = ComputedBuffer(
+            name=None,
+            layout=FlexibleLayout(
+                device=self.data.get_device(),
+                dtype=self.data.get_dtype(),
+                size=self.data.get_size(),
+            ),
+            data=self.data,
+        )
+        self.data.name = V.graph.register_buffer(self.data)
+        self.data.origins = self.origins
+        return self.data.name
+
+    def realize_hint(self):
+        """
+        Called on buffers we expect to be forced to realize later.
+        """
+        if isinstance(self.data, (Pointwise, Reduction)) and self.num_reads() > 1:
+            self.realize()
+
+    def has_exceeded_max_reads(self):
+        return isinstance(self.data, Pointwise) and (
+            self.num_reads() > config.realize_acc_reads_threshold
+            or len(self.inner_fn_str()) > config.realize_bytes_threshold
+        )
+
+    def mark_reuse(self, users):
+        """
+        A heuristic to decide if we should realize a tensor
+        that is used multiple times.
+        """
+
+        def should_realize_on_cpu(loops: Union[Pointwise, Reduction]):
+            """
+            The heuristic for realizing reused result of heavy ops on cpu
+            """
+            heavy_ops = ["exp"]  # a list of heavy ops
+            fn_str = loops.inner_fn_str()
+            return any([fn_str.startswith(op + "(") for op in heavy_ops])
+
+        if (
+            users > 1
+            and isinstance(self.data, (Pointwise, Reduction))
+            and (
+                self.num_reads() > config.realize_reads_threshold
+                or len(self.inner_fn_str()) > config.realize_bytes_threshold
+                or (is_cpu(self.data) and should_realize_on_cpu(self.data))
+            )
+        ):
+            self.realize()
+
+    @cache_on_self
+    def num_reads(self):
+        data = self.data
+        if isinstance(data, (InputsKernel, InputBuffer, ReinterpretView)):
+            return 1
+        if isinstance(data, ComputedBuffer):
+            read_writes = data.get_read_writes()
+        else:
+            assert isinstance(data, (Pointwise, Reduction)), type(data)
+            read_writes = ComputedBuffer(
+                name=None,
+                layout=FlexibleLayout(
+                    device=data.get_device(),
+                    dtype=data.get_dtype(),
+                    size=data.get_size(),
+                ),
+                data=data,
+            ).get_read_writes()
+        return len(read_writes.reads)
+
+
+class LoopBody:
+    """
+    Captures the body of a Loops subclass into an FX graph.  Persists any
+    indexing simplifications and makes it easier to analyze loop bodies.
+    """
+
+    def __init__(self, fn, args, var_ranges):
+        super().__init__()
+        self.var_ranges = var_ranges
+        self.indexing_exprs = {}
+        self.indexing_exprs_name = {}
+        self.reads = []
+        self.writes = []
+        self.reads_name2expr = {}
+        self.writes_name2expr = {}
+        self.other = []
+        self.submodules = {"get_index": self.get_index}
+        self.subblocks = {}
+        self.indirect_vars = []
+        self.root_block = LoopBodyBlock(self, fn, args)
+        self.indexing = None
+
+    def debug_str(self):
+        lines = [f"var_ranges = {dict(self.var_ranges)}"]
+        lines.extend([f"{name} = {val}" for name, val in self.indexing_exprs.items()])
+        lines.extend(
+            [
+                block.debug_str(name)
+                for name, block in itertools.chain(
+                    [("body", self.root_block)], self.subblocks.items()
+                )
+            ]
+        )
+        return "\n".join(lines)
+
+    def add_index_expr(self, expr: sympy.Expr, category, buf_name):
+        getattr(self, category).append(expr)
+        if buf_name is not None:
+            getattr(self, f"{category}_name2expr")[buf_name] = expr
+        if expr not in self.indexing_exprs_name:
+            name = f"index{len(self.indexing_exprs)}"
+            self.indexing_exprs_name[expr] = name
+            self.indexing_exprs[name] = expr
+        return self.indexing_exprs_name[expr]
+
+    def add_submodule(self, block, prefix):
+        """Not actually for nn.Modules, but subblocks in generated code are mapped to FX call_module opcodes"""
+        if prefix[-1].isnumeric() and prefix not in self.submodules:
+            name = prefix
+        else:
+            name = f"{prefix}{len(self.submodules)}"
+        self.submodules[name] = block
+        return name
+
+    def add_indirect(self):
+        name = f"indirect{len(self.indirect_vars)}"
+        var = sympy_symbol(name)
+        self.indirect_vars.append([var])
+        return var
+
+    def replace_indirect(self, old, new):
+        """Swap in a variable used in indirect indexing"""
+        if str(old) == str(new):
+            return
+        self.indexing = {k: sympy_subs(v, {old: new}) for k, v in self.indexing.items()}
+
+    def get_index(self, name):
+        return self.indexing[name]
+
+    def __call__(self, *indices):
+        index = list(itertools.chain(*indices))
+        assert len(index) == len(self.var_ranges), (index, self.var_ranges)
+        assert all(v not in self.var_ranges for v in index)
+        replacements = dict(zip(self.var_ranges.keys(), index))
+        self.indexing = {
+            name: sympy_subs(expr, replacements)
+            for name, expr in self.indexing_exprs.items()
+        }
+        result = self.root_block()
+        self.indexing = None
+        return result
+
+
+class LoopBodyBlock:
+    """
+    Captures the body of a Loops subclass into an FX graph.
+    In normal cases there will be a 1:1 mapping between LoopBody and
+    LoopBodyBlock, hower in the case of ops.masked() the masked out
+    operations will manifest as an extra LoopBodyBlock.
+    """
+
+    def __init__(self, body: LoopBody, fn: Callable, args: List[Any]):
+        self.body = body
+
+        def add_index(expr, category, buf_name=None):
+            return tracer.create_proxy(
+                "call_module",
+                "get_index",
+                (self.body.add_index_expr(expr, category, buf_name),),
+                {},
+            )
+
+        class CaptureIndexing(V.WrapperHandler):
+            def load(self, name: str, index: sympy.Expr):
+                index = add_index(index, "reads", name)
+                return self._inner.load(name, index)
+
+            def store(self, name, index, value, mode=None):
+                index = add_index(index, "writes", name)
+                return self._inner.store(name, index, value, mode)
+
+            def reduction(self, name, dtype, src_dtype, reduction_type, index, value):
+                index = add_index(index, "writes", name)
+                return self._inner.reduction(
+                    name, dtype, src_dtype, reduction_type, index, value
+                )
+
+            def index_expr(self, index, dtype):
+                if isinstance(index, (int, sympy.Integer)):
+                    return ops.constant(int(index), dtype)
+                index = add_index(index, "other")
+                return self._inner.index_expr(index, dtype)
+
+            @staticmethod
+            def masked(mask_proxy, masked_body: Callable, other_proxy):
+                """
+                Recursively capture the masked out body in another LoopBodyBlock
+                """
+
+                def shim(mask, other):
+                    return V.ops.masked(mask, subblock, other)
+
+                name = self.body.add_submodule(shim, "masked_subblock")
+                subblock = LoopBodyBlock(self.body, masked_body, [])
+                self.body.subblocks[name] = subblock
+                return tracer.create_proxy(
+                    "call_module", name, (mask_proxy, other_proxy), {}
+                )
+
+            @staticmethod
+            def indirect_indexing(index_proxy):
+                """
+                Flow data from tensors into indexing formulas.
+                Introduce a call_module to update the indexing.
+                """
+
+                def set_indirect(new_var):
+                    self.body.replace_indirect(var, V.ops.indirect_indexing(new_var))
+
+                var = self.body.add_indirect()
+                tracer.create_proxy(
+                    "call_module",
+                    self.body.add_submodule(set_indirect, f"set_{var}"),
+                    (index_proxy,),
+                    {},
+                )
+                return var
+
+        tracer = torch.fx.Tracer()
+        tracer.graph = torch.fx.Graph(tracer_cls=tracer.__class__)
+        proxy_ops = tracer.create_proxy("placeholder", "ops", (), {})
+        from .sizevars import SimplifyIndexing
+
+        with V.set_ops_handler(
+            SimplifyIndexing(CaptureIndexing(proxy_ops), self.body.var_ranges)
+        ):
+            tracer.create_proxy("output", "output", (fn(*args),), {})
+        self.graph = tracer.graph
+
+    def __call__(self):
+        graph = self.graph
+        submodules = self.body.submodules
+
+        class InterpreterShim(torch.fx.Interpreter):
+            def __init__(self):
+                """
+                We don't call super() here to avoid constructing a
+                GraphModule which is very expensive (it does codegen).
+                """
+                self.module = self
+                self.graph = graph
+                self.submodules = submodules
+                self.garbage_collect_values = False
+                self.env = {}
+                self.fetch_attr = submodules.__getitem__
+
+        return InterpreterShim().run(V.get_ops_handler())
+
+    def debug_str(self, name="block"):
+        code = torch.fx.GraphModule(self.body.submodules, self.graph).code
+        return re.sub(
+            # strip `; del var0` suffixes to make output prettier
+            r";[^\n]*",
+            "",
+            code.strip().replace("def forward(", f"def {name}("),
+        )
diff --git a/torch/_inductor/lowering.py b/torch/_inductor/lowering.py
new file mode 100644
index 000000000000..f65c3eab3b3f
--- /dev/null
+++ b/torch/_inductor/lowering.py
@@ -0,0 +1,3670 @@
+import functools
+import itertools
+import logging
+import operator
+from collections.abc import Iterable
+from typing import List, Optional, Tuple
+
+import sympy
+
+import torch
+import torch.fx
+import torch.utils._pytree as pytree
+from torch._prims_common import (
+    elementwise_dtypes,
+    ELEMENTWISE_TYPE_PROMOTION_KIND,
+    is_boolean_dtype,
+    is_integer_dtype,
+    Number,
+)
+
+from . import config, ir, overrides
+from .cuda_properties import current_device
+from .decomposition import decompositions, get_decompositions
+from .ir import (
+    ExpandView,
+    IndexingConstant,
+    IndexingDiv,
+    PermuteView,
+    Pointwise,
+    Reduction,
+    SqueezeView,
+    TensorBox,
+    View,
+)
+from .utils import ceildiv, has_torchvision_roi_align, sympy_product
+from .virtualized import ops, V
+
+log = logging.getLogger(__name__)
+lowerings = {}
+layout_constraints = {}
+fallbacks = set()
+aten = torch.ops.aten
+prims = torch.ops.prims
+needs_realized_inputs = set()
+
+
+def add_needs_realized_inputs(fn):
+    if isinstance(fn, (list, tuple, set)):
+        return [add_needs_realized_inputs(x) for x in fn]
+    needs_realized_inputs.add(fn)
+    if isinstance(fn, torch._ops.OpOverloadPacket):
+        for overload in fn.overloads():
+            needs_realized_inputs.add(getattr(fn, overload))
+
+
+def add_layout_constraint(fn, constraint):
+    if isinstance(fn, torch._ops.OpOverloadPacket):
+        for overload in fn.overloads():
+            layout_constraints[getattr(fn, overload)] = constraint
+    else:
+        layout_constraints[fn] = constraint
+
+
+add_needs_realized_inputs(
+    [
+        aten.as_strided,
+        aten.avg_pool2d,
+        aten.avg_pool2d_backward,
+        aten.bmm,
+        aten.convolution,
+        aten.convolution_backward,
+        aten.max_pool2d_with_indices,
+        aten.max_pool2d_with_indices_backward,
+        aten.mm,
+        aten.upsample_bilinear2d,
+        aten.upsample_nearest2d,
+        aten.upsample_bicubic2d,
+    ]
+)
+
+# TODO(jansel): ezyang says we won't need this in the future, try removing it
+# based on https://github.com/pytorch/pytorch/blob/9e3eb329df8f701/c10/core/ScalarType.h#L28
+DTYPE_ID_LOOKUP = {
+    0: torch.uint8,
+    1: torch.int8,
+    2: torch.int16,
+    3: torch.int32,
+    4: torch.int64,
+    5: torch.float16,
+    6: torch.float32,
+    7: torch.float64,
+    8: torch.complex32,
+    9: torch.complex64,
+    10: torch.complex32,
+    11: torch.bool,
+    15: torch.bfloat16,
+    # TODO(jansel): add quantized types?
+    #  _(c10::qint8, QInt8) /* 12 */
+    # _(c10::quint8, QUInt8) /* 13 */
+    # _(c10::qint32, QInt32) /* 14 */
+    # _(c10::quint4x2, QUInt4x2) /* 16 */
+    # _(c10::quint2x4, QUInt2x4) /* 17 */
+}
+
+
+def decode_dtype(dtype: int):
+    if not isinstance(dtype, int):
+        return dtype
+    assert dtype in DTYPE_ID_LOOKUP, f"id {dtype} missing from DTYPE_ID_LOOKUP"
+    dtype = DTYPE_ID_LOOKUP[dtype]
+    return dtype
+
+
+def is_integer_type(x):
+    if isinstance(x, TensorBox):
+        return is_integer_dtype(x.get_dtype()) or is_boolean_dtype(x.get_dtype())
+    else:
+        return isinstance(x, int)
+
+
+def is_boolean_type(x):
+    if isinstance(x, TensorBox):
+        return is_boolean_dtype(x.get_dtype())
+    else:
+        return isinstance(x, bool)
+
+
+def decode_device(device):
+    if device is None:
+        return torch.tensor(0.0).device  # default device
+    if isinstance(device, str):
+        device = torch.device(device)
+    if device.type == "cuda" and device.index is None:
+        return torch.device("cuda", index=current_device())
+    return device
+
+
+def get_promoted_dtype(*args, type_promotion_kind: ELEMENTWISE_TYPE_PROMOTION_KIND):
+    def construct_input(inp):
+        if isinstance(inp, Number):
+            return inp
+        else:
+            assert hasattr(inp, "get_dtype")
+            dim = len(inp.get_size())
+            # construct a tmp tensor to feed into torch.result_type
+            return torch.zeros([1] * dim, dtype=inp.get_dtype())
+
+    inps = [construct_input(arg) for arg in args]
+    _, dtype = elementwise_dtypes(*inps, type_promotion_kind=type_promotion_kind)
+    return dtype
+
+
+def _register_lowering(
+    aten_fn, decomp_fn, broadcast, type_promotion_kind, convert_input_to_bool
+):
+    """
+    Add a lowering to lowerings dict
+
+    Arguments:
+        aten_fn: torch.ops.aten.* fn we are lowering
+        decomp_fn: alternate implementation on our IR
+        broadcast: True to apply broadcasting to tensor inputs
+        type_promotion_kind: kind of type promotion applied to tensor inputs, `None` means no type promotion
+        convert_input_to_bool: some logical ops require inputs are converted to bool
+    """
+
+    @functools.wraps(decomp_fn)
+    def wrapped(*args, **kwargs):
+        args = list(args)
+        unpacked = False
+        # TODO maybe we need to use pytrees here
+        if len(args) == 1 and isinstance(args[0], (list, tuple)):
+            unpacked = True
+            args = args[0]
+        # Only look at args that are Tensors
+        indices = [i for i, x in enumerate(args) if isinstance(x, TensorBox)]
+
+        # explicitly assert for "out=" ops for better error messages
+        assert not any(
+            x == "out" for x in kwargs.keys()
+        ), "out= ops aren't yet supported"
+        # kwargs tensors not supported yet unless it's a fallback op
+        assert not any(isinstance(x, TensorBox) for x in kwargs.values()) or all(
+            fn in fallbacks for fn in aten_fn
+        )
+
+        if (type_promotion_kind or convert_input_to_bool) and indices:
+            if convert_input_to_bool:
+                dtype = torch.bool
+            else:
+                # FIXME that's a crude approximation for promoting args
+                promoting_args = [
+                    a for a in args if isinstance(a, Number) or hasattr(a, "get_dtype")
+                ]
+                dtype = get_promoted_dtype(
+                    *promoting_args, type_promotion_kind=type_promotion_kind
+                )
+            # sometimes args are an immutable list so we can't mutate them
+            new_args = []
+            for i in range(len(args)):
+                if i in indices:
+                    new_args.append(to_dtype(args[i], dtype))
+                elif isinstance(args[i], ir.Constant):
+                    new_args.append(
+                        ir.Constant(args[i].value, dtype, args[indices[0]].get_device())
+                    )
+                else:
+                    new_args.append(args[i])
+            args = new_args
+        if unpacked:
+            args = [args]
+        if broadcast and indices:
+            for i, x in zip(indices, broadcast_tensors(*[args[i] for i in indices])):
+                args[i] = x
+            for i in range(len(args)):
+                if isinstance(args[i], ir.Constant):
+                    args[i] = ExpandView.create(
+                        args[i], list(args[indices[0]].get_size())
+                    )
+
+        return decomp_fn(*args, **kwargs)
+
+    if not isinstance(aten_fn, (list, tuple)):
+        aten_fn = [aten_fn]
+    else:
+        aten_fn = list(aten_fn)
+
+    for fn in list(aten_fn):
+        if isinstance(fn, torch._ops.OpOverloadPacket):
+            for overload in fn.overloads():
+                other_fn = getattr(fn, overload)
+                if other_fn not in lowerings:
+                    aten_fn.append(other_fn)
+
+    lowerings.update({fn: wrapped for fn in aten_fn})
+    return wrapped
+
+
+def register_lowering(
+    aten_fn,
+    broadcast=False,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+    convert_input_to_bool=False,
+):
+    """
+    Shim to support decorator syntax.
+    """
+    return functools.partial(
+        _register_lowering,
+        aten_fn,
+        broadcast=broadcast,
+        type_promotion_kind=type_promotion_kind,
+        convert_input_to_bool=convert_input_to_bool,
+    )
+
+
+def broadcast_symbolic_shapes(a, b):
+    """
+    Broadcasting logic based on symbolic shapes.
+
+    We give the shapes 0 and 1 concrete values, while all other shapes
+    are symbolic sympy formulas.
+    """
+    output = []
+    for a, b in itertools.zip_longest(
+        reversed(a), reversed(b), fillvalue=sympy.Integer(1)
+    ):
+        if b == 1:
+            output.append(a)
+        elif a == 1:
+            output.append(b)
+        else:
+            V.graph.sizevars.guard_equals(a, b)
+            if len(sympy.expand(b).free_symbols) < len(sympy.expand(a).free_symbols):
+                output.append(b)  # prefer shorter formula
+            else:
+                output.append(a)
+    return tuple(reversed(output))
+
+
+def promote_constants(inputs, override_return_dtype=None):
+    if not any(isinstance(x, (sympy.Expr, int, float)) for x in inputs):
+        return inputs
+    if all(isinstance(x, (int, float)) for x in inputs):
+        dtype = override_return_dtype or get_promoted_dtype(
+            *inputs, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT
+        )
+        return [ir.Constant(x, dtype, decode_device(None)) for x in inputs]
+    ex = next(x for x in inputs if isinstance(x, TensorBox))
+    out = []
+    for x in inputs:
+        if isinstance(x, (int, float)):
+            out.append(
+                ExpandView.create(
+                    ir.Constant(x, ex.get_dtype(), ex.get_device()), list(ex.get_size())
+                )
+            )
+        elif isinstance(x, sympy.Expr):
+            out.append(IndexingConstant(x, ex.get_dtype(), ex.get_device()))
+        else:
+            out.append(x)
+
+    return out
+
+
+def make_pointwise(
+    fn,
+    override_return_dtype=None,
+    override_device=None,
+    override_fn_when_input_bool=None,
+    override_fn_when_cuda_float64=None,
+    allow_alpha=False,
+):
+    def inner(*inputs: List[TensorBox], alpha=None):
+        inputs = promote_constants(inputs, override_return_dtype)
+        if allow_alpha:
+            if alpha is not None and alpha != 1:
+                inputs = list(inputs)
+                inputs[-1] = mul(inputs[-1], alpha)
+        else:
+            assert alpha is None
+        loaders = [x.make_loader() for x in inputs]
+        ranges = inputs[0].get_size()
+        dtype = override_return_dtype or inputs[0].get_dtype()
+        is_cuda = decode_device(inputs[0].get_device()).type == "cuda"
+
+        for other in inputs[1:]:
+            assert isinstance(other, ir.BaseConstant) or len(ranges) == len(
+                other.get_size()
+            ), f"ndim mismatch {fn} {ranges} {other.get_size()}"
+
+        def inner_fn(index):
+            assert len(index) == len(ranges), f"wrong ndim {index} {ranges}"
+            if dtype == torch.bool and override_fn_when_input_bool is not None:
+                return override_fn_when_input_bool(*[load(index) for load in loaders])
+            elif override_fn_when_cuda_float64 and is_cuda and dtype == torch.float64:
+                return override_fn_when_cuda_float64(*[load(index) for load in loaders])
+            else:
+                return fn(*[load(index) for load in loaders])
+
+        if not override_device:
+            device = None
+            for i in inputs:
+                if i.get_device().type == "cuda":
+                    device = i.get_device()
+                    break
+            if not device:
+                device = inputs[0].get_device()
+
+        device = override_device or device
+
+        return Pointwise.create(
+            device=device,
+            dtype=dtype,
+            inner_fn=inner_fn,
+            ranges=ranges,
+        )
+
+    return inner
+
+
+@register_lowering(prims.convert_element_type, type_promotion_kind=None)
+def to_dtype(x: TensorBox, dtype: torch.dtype):
+    if x.get_dtype() == dtype:
+        return x
+
+    def _to_dtype(x):
+        return ops.to_dtype(x, dtype)
+
+    return make_pointwise(_to_dtype, override_return_dtype=dtype)(x)
+
+
+def to_device(x: TensorBox, device: torch.device):
+    device = decode_device(device)
+    if x.get_device() == device:
+        return x
+    return TensorBox.create(ir.DeviceCopy.create(x, device))
+
+
+@register_lowering(aten._to_copy)
+def _to_copy(
+    x,
+    *,
+    dtype=None,
+    layout=None,
+    device=None,
+    pin_memory=None,
+    non_blocking=False,
+    memory_format=None,
+):
+    assert not layout or layout == torch.strided, "TODO"
+    assert not pin_memory, "TODO"
+    assert not memory_format, "TODO"
+    if device:
+        device = decode_device(device)
+    if device is not None and device != x.get_device():
+        if dtype is not None and device.type == "cpu":
+            # CPU can do fewer type conversions
+            x = to_dtype(x, decode_dtype(dtype))
+        x = to_device(x, device)
+    if dtype is not None:
+        x = to_dtype(x, decode_dtype(dtype))
+    return x
+
+
+@register_lowering(aten.to)
+def to(
+    x,
+    device_or_dtype=None,
+    non_blocking=False,
+    copy=False,
+    memory_format=None,
+    device=None,
+    dtype=None,
+    layout=None,
+):
+    assert not memory_format, "TODO"
+    assert layout in (None, torch.strided)
+    if isinstance(device_or_dtype, torch.dtype):
+        return to_dtype(x, device_or_dtype)
+    elif isinstance(device_or_dtype, torch.device):
+        return to_device(x, device_or_dtype)
+    else:
+        assert device_or_dtype is None, device_or_dtype
+
+    if device is not None:
+        x = to_device(x, device)
+    if dtype is not None:
+        x = to_dtype(x, dtype)
+    return x
+
+
+def ops_wrapper(name):
+    assert isinstance(name, str)
+
+    def fn(*args, **kwargs):
+        return getattr(ops, name)(*args, **kwargs)
+
+    return fn
+
+
+def register_pointwise(
+    aten_fn,
+    name=None,
+    broadcast=True,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+    convert_input_to_bool=False,
+    override_return_dtype=None,
+    override_fn_when_input_bool=None,
+    allow_alpha=False,
+    use_libdevice_for_f64=False,
+):
+    """A pointwise function that maps ops.{name} to inputs"""
+    name = name or aten_fn.__name__
+    fn = ops_wrapper(name)
+    if use_libdevice_for_f64:
+        fn_libdevice = ops_wrapper("libdevice_" + name)
+    if override_fn_when_input_bool is not None:
+        override_fn_when_input_bool = ops_wrapper(override_fn_when_input_bool)
+
+    fn = make_pointwise(
+        fn,
+        override_return_dtype=override_return_dtype,
+        override_fn_when_input_bool=override_fn_when_input_bool,
+        override_fn_when_cuda_float64=fn_libdevice if use_libdevice_for_f64 else None,
+        allow_alpha=allow_alpha,
+    )
+    fn = register_lowering(
+        aten_fn,
+        broadcast=broadcast,
+        type_promotion_kind=type_promotion_kind,
+        convert_input_to_bool=convert_input_to_bool,
+    )(fn)
+
+    if hasattr(prims, name):
+        register_lowering(
+            getattr(prims, name),
+            type_promotion_kind=None,
+            convert_input_to_bool=convert_input_to_bool,
+        )(fn)
+    return fn
+
+
+@register_lowering(aten.where, broadcast=False, type_promotion_kind=None)
+def where(cond, a, b):
+    def fn(*args):
+        return ops.where(*args)
+
+    if isinstance(a, (float, int)):
+        a = constant_like(a)(b)
+    if isinstance(b, (float, int)):
+        b = constant_like(b)(a)
+
+    args = [cond, a, b]
+    dtype = get_promoted_dtype(
+        args[1], args[2], type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT
+    )
+    indices = [i for i, x in enumerate(args) if isinstance(x, TensorBox)]
+    for i, x in zip(indices, broadcast_tensors(*[args[i] for i in indices])):
+        args[i] = x
+    for i in range(len(args)):
+        if isinstance(args[i], ir.Constant):
+            args[i] = ExpandView.create(args[i], list(args[indices[0]].get_size()))
+    return make_pointwise(fn, override_return_dtype=dtype)(
+        args[0], to_dtype(args[1], dtype), to_dtype(args[2], dtype)
+    )
+
+
+@register_lowering(aten.broadcast_tensors, broadcast=False, type_promotion_kind=None)
+def broadcast_tensors(*inputs):
+    if len(inputs) == 1 and isinstance(inputs[0], (list, tuple)):
+        return broadcast_tensors(*inputs[0])
+    target = functools.reduce(
+        broadcast_symbolic_shapes, [x.get_size() for x in inputs], ()
+    )
+    outputs = []
+    for x in inputs:
+        sizes = x.get_size()
+        if len(sizes) != len(target) or any(
+            ((a == 1 and b != 1) or (a != 1 and b == 1)) for a, b in zip(sizes, target)
+        ):
+            x = expand(x, target)
+        outputs.append(x)
+    return outputs
+
+
+@register_lowering([aten.alias, aten.detach, aten.detach_, aten.lift, prims.view_of])
+def nop(x):
+    return x  # AOT autograd handles this for us
+
+
+if hasattr(aten, "lift_fresh"):
+    register_lowering(aten.lift_fresh)(nop)
+
+
+@register_lowering(aten.squeeze, type_promotion_kind=None)
+def squeeze(x, dim=None):
+    assert isinstance(x, TensorBox)
+    if dim is None:
+        return TensorBox(SqueezeView.create(x.data))
+    offset = len(x.get_size()) == 0
+    dim = _validate_dim(x, dim, offset)
+    new_shape = list(x.get_size())
+    if len(new_shape) > 0:
+        removed = new_shape.pop(dim)
+        if V.graph.sizevars.maybe_guard_equals(removed, 1):
+            return view(x, new_shape)
+
+    # squeeze does nothing if the size isn't 1
+    return x
+
+
+@register_lowering([aten.squeeze_])
+def squeeze_(x, dim=None):
+    val = squeeze(x, dim)
+    assert isinstance(x, TensorBox)
+    assert isinstance(val, TensorBox)
+    x.data = val.data
+    return x
+
+
+@register_lowering(aten.isinf)
+def isinf(x):
+    if is_integer_type(x):
+        return full_like(x, False, dtype=torch.bool)
+    fn = ops_wrapper("isinf")
+    return make_pointwise(fn, override_return_dtype=torch.bool)(x)
+
+
+@register_lowering(aten.isnan)
+def isnan(x):
+    if is_integer_type(x):
+        return full_like(x, False, dtype=torch.bool)
+    fn = ops_wrapper("isnan")
+    return make_pointwise(fn, override_return_dtype=torch.bool)(x)
+
+
+@register_lowering(aten.ceil)
+def ceil(x):
+    if is_integer_type(x):
+        return x
+    fn = ops_wrapper("ceil")
+    return make_pointwise(fn)(x)
+
+
+@register_lowering(aten.floor)
+def floor(x):
+    if is_integer_type(x):
+        return x
+    fn = ops_wrapper("floor")
+    return make_pointwise(fn)(x)
+
+
+@register_lowering(aten.round)
+def round(x):
+    if is_integer_type(x):
+        return x
+    fn = ops_wrapper("round")
+    return make_pointwise(fn)(x)
+
+
+@register_lowering(aten.trunc)
+def trunc(x):
+    if is_integer_type(x):
+        return x
+    fn = ops_wrapper("trunc")
+    return make_pointwise(fn)(x)
+
+
+@register_lowering(aten.expand, type_promotion_kind=None)
+def expand(x, sizes):
+    if isinstance(x, ir.BaseConstant):
+        return ExpandView.create(x, tuple(sizes))
+    assert isinstance(x, TensorBox)
+    assert isinstance(sizes, (list, tuple))
+    if tuple(x.get_size()) == tuple(sizes):
+        return x
+
+    x_size_product = sympy_product(x.get_size())
+    try:
+        if x_size_product > 0:
+            x.mark_reuse(
+                V.graph.sizevars.size_hint(sympy_product(sizes) / x_size_product)
+            )
+    except TypeError:
+        # Certain sympy products cannot be compared, fails with
+        # cannot determine truth value of Relational
+        pass
+    return TensorBox(ExpandView.create(x.data, tuple(sizes)))
+
+
+@register_lowering(prims.broadcast_in_dim, type_promotion_kind=None)
+def broadcast_in_dim(a, shape, broadcast_dimensions):
+    s = list(shape)
+    for broadcast_dimension in broadcast_dimensions:
+        s[broadcast_dimension] = -1
+
+    v = a
+    for idx, x in enumerate(s):
+        if x != -1:
+            v = unsqueeze(v, idx)
+
+    return expand(v, shape)
+
+
+@register_lowering(aten.expand_as, type_promotion_kind=None)
+def expand_as(x, y):
+    return expand(x, y.get_size())
+
+
+@register_lowering(aten.repeat)
+def repeat(x, repeats):
+    old_size = list(x.get_size())
+    if len(repeats) > len(old_size):
+        old_size = [sympy.Integer(1)] * (len(repeats) - len(old_size)) + old_size
+        x = view(x, list(old_size))
+    assert len(repeats) == len(x.get_size())
+
+    new_size = list(x.get_size())
+
+    for i in range(len(repeats)):
+        assert repeats[i] != 0
+        if repeats[i] != 1:
+            new_size[i] = new_size[i] * repeats[i]
+
+    if all((a == 1 or b == 1) for a, b in zip(repeats, old_size)):
+        return expand(x, new_size)
+
+    def inner_fn(index):
+        assert len(index) == len(repeats)
+        index = list(index)
+        for i in range(len(repeats)):
+            if repeats[i] != 1:
+                if old_size[i] == 1:
+                    index[i] = sympy.Integer(0)
+                else:
+                    index[i] = ir.ModularIndexing(index[i], 1, old_size[i])
+        return x_loader(index)
+
+    old_size_product = sympy_product(old_size)
+    try:
+        if old_size_product > 0:
+            x.mark_reuse(
+                V.graph.sizevars.size_hint(sympy_product(new_size) / old_size_product)
+            )
+    except TypeError:
+        # Certain sympy products cannot be compared, fails with
+        # cannot determine truth value of Relational
+        pass
+
+    x_loader = x.make_loader()
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=inner_fn,
+        ranges=list(new_size),
+    )
+
+
+@register_lowering(aten._unsafe_view, type_promotion_kind=None)
+@register_lowering(aten.view, type_promotion_kind=None)
+@register_lowering(aten.reshape, type_promotion_kind=None)
+def view(x, sizes):
+    assert isinstance(x, TensorBox)
+    assert isinstance(sizes, (list, tuple))
+    return TensorBox(View.create(x.data, sizes))
+
+
+@register_lowering(aten.permute, type_promotion_kind=None)
+def permute(x, dims):
+    assert isinstance(x, TensorBox)
+    assert isinstance(dims, (list, tuple))
+    return TensorBox(PermuteView.create(x.data, tuple(dims)))
+
+
+@register_lowering(aten.slice, type_promotion_kind=None)
+def slice_(x, dim=0, start=0, end=2**63, step=1):
+    assert isinstance(x, TensorBox)
+    dim = _validate_dim(x, dim, 0)
+    return TensorBox(ir.SliceView.create(x.data, dim, start, end, step))
+
+
+@register_lowering(aten.roll, type_promotion_kind=None)
+def roll(a, shifts, dims=tuple()):
+    """
+    This is based on torch._refs.roll(), but uses ir.ModularIndexing().
+
+    We can't use the ref here because it is based on multiple calls to
+    torch.cat() that this will result in terrible code.
+    """
+    # ATen specifies int[1] type for shifts and dims which expands integers to tuples of length 1
+    if not isinstance(shifts, Iterable):
+        shifts = (shifts,)
+    if not isinstance(dims, Iterable):
+        dims = (dims,)
+    dims = [_validate_dim(a, d) for d in dims]
+
+    if sympy_product(a.get_size()) == 0:
+        return clone(a)
+
+    len_shifts = len(shifts)
+    len_dims = len(dims)
+    if len_shifts != 1 or len_dims != 1:
+        if len_shifts == 0:
+            raise RuntimeError("`shifts` required")
+        # Takes care of the case when dims is not specified (default)
+        # By default, the tensor is flattened before shifting, after which the original shape is restored
+        if len_dims == 0 and len_shifts == 1:
+            flat = view(a, [sympy_product(a.get_size())])
+            rolled = roll(flat, shifts, 0)
+            return view(rolled, list(a.get_size()))
+        if len_shifts != len_dims:
+            raise RuntimeError(
+                f"shifts and dimensions must align. shifts: {len_shifts}, dims: {len_dims}"
+            )
+        tail_shifts = shifts[1:]
+        tail_dims = dims[1:]
+        first_dim_rolled = roll(a, shifts[0], dims[0])
+        return roll(first_dim_rolled, tail_shifts, tail_dims)
+
+    (dim,) = dims
+    size = V.graph.sizevars.guard_static_shape(a.get_size()[dim])
+    start = (size - shifts[0]) % size
+    a_loader = a.make_loader()
+
+    def fn(index):
+        index = list(index)
+        index[dim] = ir.ModularIndexing(
+            index[dim] + start, sympy.Integer(1), sympy.expand(size)
+        )
+        return a_loader(index)
+
+    return Pointwise.create(
+        device=a.get_device(),
+        dtype=a.get_dtype(),
+        inner_fn=fn,
+        ranges=a.get_size(),
+    )
+
+
+@register_lowering(aten.as_strided, type_promotion_kind=None)
+def as_strided(x, size, stride, storage_offset=None):
+    if isinstance(x, TensorBox) and isinstance(x.data, ir.BaseView):
+        # as_strided ignores views
+        x = x.data.unwrap_view()
+    x.realize()
+    if not ir.is_contiguous_storage_and_layout(x):
+        raise NotImplementedError(f"unrealized as_strided({x}, ...)")
+    storage, old_layout = ir.as_contiguous_storage_and_layout(x)
+    new_layout = ir.FixedLayout(
+        old_layout.device,
+        old_layout.dtype,
+        [sympy.expand(s) for s in size],
+        [sympy.expand(s) for s in stride],
+        sympy.expand(storage_offset or 0),
+    )
+    return TensorBox(ir.ReinterpretView(storage, new_layout))
+
+
+@register_lowering(aten.as_strided_)
+def as_strided_(x, size, stride, storage_offset=None):
+    assert isinstance(x, TensorBox)
+    x.data = as_strided(x, size, stride, storage_offset).data
+    return x
+
+
+@register_lowering(aten.cat)
+def cat(inputs, dim=0):
+    if len(inputs) == 1:
+        return inputs[0]
+
+    dim = _validate_dim(inputs[0], dim, 0)
+    dtype = get_promoted_dtype(
+        *inputs, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT
+    )
+    inputs = [to_dtype(inp, dtype) for inp in inputs]
+    return TensorBox(ir.ConcatKernel.create(inputs, dim))
+
+
+@register_lowering(aten.select, type_promotion_kind=None)
+def select(x, dim, idx):
+    idx = View.handle_negative_index(idx, x.get_size()[dim])
+    return squeeze(slice_(x, dim, idx, idx + 1), dim)
+
+
+@register_lowering(aten.split, type_promotion_kind=None)
+def split(x, sizes, dim=0):
+    dim = _validate_dim(x, dim, 0)
+    x_size = V.graph.sizevars.guard_static_shape(x.get_size()[dim])
+    if isinstance(sizes, int):
+        sizes = [sizes] * ((x_size + sizes - 1) // sizes)
+    result = []
+    start = 0
+    for size in sizes:
+        end = start + size
+        result.append(slice_(x, dim, start, end))
+        start = end
+    return result
+
+
+@register_lowering(aten.split_with_sizes, type_promotion_kind=None)
+def split_with_sizes(x, sizes, dim=0):
+    return split(x, sizes, dim)
+
+
+@register_lowering(aten.unbind, type_promotion_kind=None)
+def unbind(x, dim=0):
+    dim = _validate_dim(x, dim, 0)
+    x_size = V.graph.sizevars.guard_static_shape(x.get_size()[dim])
+    result = []
+    for i in range(x_size):
+        result.append(select(x, dim, i))
+    return result
+
+
+@register_lowering(aten.unsqueeze, type_promotion_kind=None)
+def unsqueeze(x, dim):
+    dim = _validate_dim(x, dim, 1)
+    new_shape = list(x.get_size())
+    new_shape.insert(dim, sympy.Integer(1))
+    return view(x, new_shape)
+
+
+@register_lowering(aten.unsqueeze_, type_promotion_kind=None)
+def unsqueeze_(x, dim):
+    val = unsqueeze(x, dim)
+    assert isinstance(x, TensorBox)
+    assert isinstance(val, TensorBox)
+    x.data = val.data
+    return x
+
+
+def _validate_dim(x, dim, offset=0):
+    assert isinstance(dim, int)
+    ndim = len(x.get_size())
+    if dim < 0:
+        dim += ndim + offset
+    assert 0 <= dim < ndim + offset
+    return dim
+
+
+@register_lowering(aten.glu)
+def glu(x, dim=-1):
+    dim = _validate_dim(x, dim, 0)
+    new_len = V.graph.sizevars.guard_static_shape(x.get_size()[dim]) // 2
+    a = slice_(x, dim, 0, new_len)
+    b = slice_(x, dim, new_len, new_len * 2)
+    return mul(a, sigmoid(b))
+
+
+@register_lowering(aten.mm)
+def mm(a: TensorBox, b: TensorBox):
+    return TensorBox.create(ir.MatrixMultiply.create(a, b))
+
+
+@register_lowering(aten.addmm)
+def addmm(inp: TensorBox, a: TensorBox, b: TensorBox, beta=1, alpha=1):
+    return TensorBox.create(ir.MatrixMultiplyAdd.create(inp, a, b, beta, alpha))
+
+
+@register_lowering(aten.bmm)
+def bmm(a: TensorBox, b: TensorBox):
+    return TensorBox.create(ir.BatchMatrixMultiply.create(a, b))
+
+
+def register_onednn_fusion_ops():
+    if torch._C.has_mkldnn:
+
+        @register_lowering(torch.ops.mkldnn._convolution_pointwise)
+        def convolution_unary(
+            x: TensorBox,
+            weight: TensorBox,
+            bias: TensorBox,
+            padding,
+            stride,
+            dilation,
+            groups,
+            attr,
+            scalars,
+            algorithm,
+        ):
+            return TensorBox.create(
+                ir.ConvolutionUnary.create(
+                    x,
+                    weight,
+                    bias,
+                    padding,
+                    stride,
+                    dilation,
+                    groups,
+                    attr,
+                    scalars,
+                    algorithm,
+                )
+            )
+
+        @register_lowering(torch.ops.mkldnn._convolution_pointwise.binary)
+        def convolution_binary(
+            x: TensorBox,
+            other: TensorBox,
+            weight: TensorBox,
+            bias: TensorBox,
+            padding,
+            stride,
+            dilation,
+            groups,
+            binary_attr,
+            binary_alpha,
+            unary_attr,
+            unary_scalars,
+            unary_algorithm,
+        ):
+            return TensorBox.create(
+                ir.ConvolutionBinary.create(
+                    x,
+                    other,
+                    weight,
+                    bias,
+                    padding,
+                    stride,
+                    dilation,
+                    groups,
+                    binary_attr,
+                    binary_alpha,
+                    unary_attr,
+                    unary_scalars,
+                    unary_algorithm,
+                )
+            )
+
+        @register_lowering(torch.ops.mkldnn._convolution_pointwise_.binary)
+        def convolution_binary_inplace(
+            x: TensorBox,
+            other: TensorBox,
+            weight: TensorBox,
+            bias: TensorBox,
+            padding,
+            stride,
+            dilation,
+            groups,
+            binary_attr,
+            binary_alpha,
+            unary_attr,
+            unary_scalars,
+            unary_algorithm,
+        ):
+            return TensorBox.create(
+                ir.ConvolutionBinaryInplace.create(
+                    x,
+                    other,
+                    weight,
+                    bias,
+                    padding,
+                    stride,
+                    dilation,
+                    groups,
+                    binary_attr,
+                    binary_alpha,
+                    unary_attr,
+                    unary_scalars,
+                    unary_algorithm,
+                )
+            )
+
+        @register_lowering(torch.ops.mkldnn._linear_pointwise)
+        def linear_unary(
+            x: TensorBox, w: TensorBox, b: TensorBox, attr, scalars, algorithm
+        ):
+            return TensorBox.create(
+                ir.LinearUnary.create(x, w, b, attr, scalars, algorithm)
+            )
+
+        @register_lowering(torch.ops.mkldnn._linear_pointwise.binary)
+        def linear_binary(x: TensorBox, y: TensorBox, w: TensorBox, b: TensorBox, attr):
+            return TensorBox.create(ir.LinearBinary.create(x, y, w, b, attr))
+
+        if torch._C.has_mkl:
+
+            @register_lowering(torch.ops.mkl._mkl_linear)
+            def mkl_packed_linear(
+                x: TensorBox,
+                packed_w: TensorBox,
+                orig_w: TensorBox,
+                b: TensorBox,
+                batch_size,
+            ):
+                return TensorBox.create(
+                    ir.MKLPackedLinear.create(x, packed_w, orig_w, b, batch_size)
+                )
+
+    else:
+        pass
+
+
+register_onednn_fusion_ops()
+
+
+def fallback_handler(kernel):
+    fallbacks.add(kernel)
+
+    def handler(*args, **kwargs):
+        return pytree.tree_map(
+            TensorBox.create, ir.FallbackKernel.create(kernel, *args, **kwargs)
+        )
+
+    return handler
+
+
+def make_fallback(kernel, layout_constraint=None):
+    assert (
+        kernel not in decompositions
+    ), f"both a fallback and a decomp for same kernel: {kernel}"
+    if get_decompositions([kernel]) and kernel is not aten.cumsum:
+        log.warning(
+            f"make_fallback({kernel}): a decomposition exists, we should switch to it"
+        )
+
+    add_needs_realized_inputs(kernel)
+    if layout_constraint is not None:
+        add_layout_constraint(kernel, layout_constraint)
+    return register_lowering(kernel, type_promotion_kind=None)(fallback_handler(kernel))
+
+
+@register_lowering(aten.native_dropout, type_promotion_kind=None)
+def native_dropout(x, p, train):
+    assert (
+        config.fallback_random
+    ), "this should be handled in decomps unless config.fallback_random"
+    if train:
+        return pytree.tree_map(
+            TensorBox.create, ir.FallbackKernel.create(aten.native_dropout, x, p, train)
+        )
+
+    return x, ones_like(x, dtype=torch.bool)
+
+
+@register_lowering(aten.bernoulli_, type_promotion_kind=None)
+def bernoulli_(x, *args):
+    assert (
+        config.fallback_random
+    ), "this should be handled in decomps unless config.fallback_random"
+    x.realize()
+    V.graph.realize_users_of(x.get_name())
+    ir.InplaceBernoulliFallback(x, *args)
+    return x
+
+
+# This shouldn't be called in general
+@register_lowering(aten._foobar)
+def _foobar(_):
+    raise AssertionError()
+
+
+@functools.lru_cache(1)
+def _warn_triton_random(salt):
+    log.warning("using triton random, expect difference from eager")
+
+
+def warn_triton_random():
+    # only warn once per graph
+    _warn_triton_random(V.graph.creation_time)
+
+
+def make_rand(fn_name):
+    def rand_or_randn(
+        *size,
+        dtype=None,
+        layout=0,
+        device=None,
+        pin_memory=False,
+        memory_format=None,
+    ):
+        warn_triton_random()
+        assert not pin_memory
+        assert layout in (0, torch.strided)
+        assert memory_format in (None, torch.contiguous_format)
+        device = decode_device(device)
+        dtype = dtype or torch.get_default_dtype()
+        if len(size) == 1 and isinstance(size[0], (list, tuple, torch.Size)):
+            size = tuple(size[0])
+        size = [sympy.expand(s) for s in size]
+        offset = V.graph.increment_randomness_offset(sympy_product(size))
+
+        random_pos = ir.FixedLayout(
+            device,
+            dtype,
+            size,
+            ir.FlexibleLayout.contiguous_strides(size),
+            offset=offset,
+        ).make_indexer()
+
+        seed_buffer = V.graph.random_seed_buffer(device).make_loader()
+
+        def inner_fn(index):
+            seed = seed_buffer([])
+            # change seed so that we don't collide with philox_rand_like()
+            # TODO(jansel): migrate everything to philox_rand_like()
+            seed = ops.bitwise_xor(seed, ops.constant(0xFFFF, torch.int32))
+            return getattr(ops, fn_name)(
+                seed,
+                ops.index_expr(random_pos(index), torch.int32),
+                dtype,
+            )
+
+        return Pointwise.create(
+            device=device,
+            dtype=dtype,
+            inner_fn=inner_fn,
+            ranges=list(size),
+        )
+
+    return rand_or_randn
+
+
+fallback_rand = fallback_handler(aten.rand)
+fallback_randn = fallback_handler(aten.randn)
+fast_rand = make_rand("rand")
+fast_randn = make_rand("randn")
+
+
+@register_lowering([aten.rand, torch.rand])
+def rand(*args, **kwargs):
+    if config.fallback_random:
+        return fallback_rand(*args, **kwargs)
+    else:
+        return fast_rand(*args, **kwargs)
+
+
+@register_lowering([aten.randn, torch.randn])
+def randn(*args, **kwargs):
+    if config.fallback_random:
+        return fallback_randn(*args, **kwargs)
+    else:
+        return fast_randn(*args, **kwargs)
+
+
+@register_lowering(overrides.philox_seed_like._overloadpacket)
+def philox_seed_like(x):
+    warn_triton_random()
+    return V.graph.random_seed_buffer(x.get_device())
+
+
+@register_lowering(overrides.philox_rand_like._overloadpacket, type_promotion_kind=None)
+def philox_rand_like(x, seed, offset):
+    device = x.get_device()
+    dtype = x.get_dtype()
+    size = x.get_size()
+    random_pos = ir.FixedLayout(
+        device,
+        dtype,
+        size,
+        ir.FlexibleLayout.contiguous_strides(size),
+        offset=sympy.expand(offset),
+    ).make_indexer()
+    seed_loader = seed.make_loader()
+
+    def inner_fn(index):
+        return ops.rand(
+            seed_loader([]),
+            ops.index_expr(random_pos(index), torch.int32),
+            dtype,
+        )
+
+    return Pointwise.create(
+        device=device,
+        dtype=dtype,
+        inner_fn=inner_fn,
+        ranges=list(size),
+    )
+
+
+def require_dense(_, *args, **kwargs):
+    args, kwargs = pytree.tree_map_only(
+        ir.IRNode, lambda t: ir.ExternKernel.require_stride1(t), (args, kwargs)
+    )
+    return args, kwargs
+
+
+def require_contiguous(_, *args, **kwargs):
+    args, kwargs = pytree.tree_map_only(
+        ir.IRNode, lambda t: ir.ExternKernel.require_contiguous(t), (args, kwargs)
+    )
+    return args, kwargs
+
+
+if has_torchvision_roi_align():
+    make_fallback(torch.ops.torchvision.roi_align)
+
+
+def constrain_to_fx_strides(fx_node, *args, **kwargs):
+    def apply_constraint(arg, fx_arg):
+        if isinstance(arg, ir.IRNode):
+            stride_order = ir.get_stride_order(fx_arg.meta["val"].stride())
+            return ir.ExternKernel.require_stride_order(arg, stride_order)
+        return arg
+
+    args = [apply_constraint(arg, fx_arg) for arg, fx_arg in zip(args, fx_node.args)]
+    kwargs = {k: apply_constraint(v, fx_node.kwargs[k]) for k, v in kwargs.items()}
+    return args, kwargs
+
+
+# TODO(jansel): we should implement decomps or lowerings for these
+# https://github.com/pytorch/torchdynamo/issues/327
+make_fallback(aten._adaptive_avg_pool2d_backward, require_dense)
+make_fallback(aten.convolution_backward, constrain_to_fx_strides)
+make_fallback(aten._cudnn_rnn, require_dense)
+make_fallback(aten._cudnn_rnn_backward, require_contiguous)
+make_fallback(aten.cumsum, require_dense)
+make_fallback(aten._embedding_bag, require_contiguous)
+make_fallback(aten._embedding_bag_forward_only, require_contiguous)
+make_fallback(aten._fused_moving_avg_obs_fq_helper)
+make_fallback(aten._fused_moving_avg_obs_fq_helper_functional)
+make_fallback(aten.grid_sampler_2d_backward, require_dense)
+make_fallback(aten.randperm)
+make_fallback(aten.sort)
+make_fallback(aten.sort.stable)
+make_fallback(aten._sparse_coo_tensor_with_dims_and_tensors)
+make_fallback(aten._thnn_fused_lstm_cell, require_dense)
+make_fallback(aten.topk)
+make_fallback(aten.upsample_bicubic2d_backward, require_contiguous)
+make_fallback(aten.upsample_bilinear2d_backward, require_dense)
+
+
+add_layout_constraint(aten.convolution, constrain_to_fx_strides)
+
+
+@register_lowering(aten.convolution)
+def convolution(
+    x: TensorBox,
+    weight: TensorBox,
+    bias: TensorBox,
+    stride: List[int],
+    padding: List[int],
+    dilation: List[int],
+    transposed: bool,
+    output_padding: List[int],
+    groups: int,
+):
+    is_cpu = all(
+        input.get_device().type == "cpu"
+        for input in (x, weight, bias)
+        if input is not None
+    )
+    result = TensorBox.create(
+        ir.Convolution.create(
+            x,
+            weight,
+            bias if is_cpu else None,  # For cpu path, bias can always be fused
+            stride,
+            padding,
+            dilation,
+            transposed,
+            output_padding,
+            groups,
+        )
+    )
+    if not is_cpu and bias is not None:
+        kernel_dims = len(weight.get_size()) - 2
+        out_chan = result.get_size()[-1 - kernel_dims]
+        bias = view(bias, [out_chan] + kernel_dims * [1])
+        result = add(result, bias)
+    return result
+
+
+@register_lowering(aten._convolution)
+def _convolution(
+    x,
+    weight,
+    bias,
+    stride,
+    padding,
+    dilation,
+    transposed,
+    output_padding,
+    groups,
+    benchmark,
+    deterministic,
+    cudnn_enabled,
+    allow_tf32,
+):
+    return convolution(
+        x, weight, bias, stride, padding, dilation, transposed, output_padding, groups
+    )
+
+
+@register_lowering(aten.clone)
+def clone(x, *, memory_format=0):
+    # TODO(jansel): memory format
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=x.make_loader(),
+        ranges=list(x.get_size()),
+    )
+
+
+if hasattr(aten, "lift_fresh_copy"):
+    register_lowering(aten.lift_fresh_copy)(clone)
+
+
+fallback_arange = fallback_handler(aten.arange)
+
+
+@register_lowering([torch.arange, aten.arange])
+def arange(
+    start,
+    end=None,
+    step=1,
+    *,
+    dtype=None,
+    device=None,
+    layout=torch.strided,
+    pin_memory=False,
+):
+    assert layout == torch.strided
+    assert not pin_memory
+    if end is None:
+        end = start
+        start = 0
+
+    if isinstance(start, float) and int(start) == start:
+        start = int(start)
+    if isinstance(end, float) and int(end) == end:
+        end = int(end)
+    if isinstance(step, float) and int(step) == step:
+        step = int(step)
+
+    # Triton kernel doesn't support float arange yet, fallback to aten.arange
+    if not (isinstance(start, int) and isinstance(end, int) and isinstance(step, int)):
+        return fallback_arange(
+            start,
+            end,
+            step,
+            dtype=dtype,
+            device=device,
+            layout=layout,
+            pin_memory=pin_memory,
+        )
+
+    dtype = dtype or torch.int64
+    length = ceildiv((end - start), step)
+    start = sympy.Integer(start)
+    step = sympy.Integer(step)
+
+    return Pointwise.create(
+        device=decode_device(device),
+        dtype=dtype,
+        inner_fn=lambda index: ops.index_expr(step * index[0] + start, dtype),
+        ranges=[sympy.Integer(length)],
+    )
+
+
+@register_lowering([torch.linspace, aten.linspace])
+def linspace(start, end, steps, *, dtype=None, device=None, pin_memory=False):
+    assert not pin_memory
+    dtype = dtype or torch.get_default_dtype()
+
+    step_size = (end - start) / (steps - 1)
+
+    def inner_fn(index):
+        return ops.add(
+            ops.mul(ops.constant(step_size, dtype), ops.index_expr(index[0], dtype)),
+            ops.constant(start, dtype),
+        )
+
+    return Pointwise.create(
+        device=decode_device(device),
+        dtype=dtype,
+        inner_fn=inner_fn,
+        ranges=[sympy.Integer(steps)],
+    )
+
+
+@register_lowering(aten.triu)
+def triu(x, diagonal=0):
+    x_loader = x.make_loader()
+    dtype = x.get_dtype()
+
+    def inner_fn(index):
+        *_, i, j = index
+        return ops.where(
+            ops.ge(
+                ops.index_expr(j - i - diagonal, torch.int32),
+                ops.constant(0, torch.int32),
+            ),
+            x_loader(index),
+            ops.constant(0, dtype),
+        )
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=dtype,
+        inner_fn=inner_fn,
+        ranges=list(x.get_size()),
+    )
+
+
+@register_lowering(aten.select_scatter, type_promotion_kind=None)
+def select_scatter(x, src, dim: int, index: int):
+    assert x.get_dtype() == src.get_dtype()
+    x_loader = x.make_loader()
+    dim = _validate_dim(x, dim, 0)
+    if index < 0:
+        index = index + x.get_size()[dim]
+    V.graph.sizevars.guard_leq(0, index)
+    V.graph.sizevars.guard_lt(index, x.get_size()[dim])
+    src = expand(unsqueeze(src, dim), x.get_size())
+    src_loader = src.make_loader()
+
+    def inner_fn(idx):
+        return ops.where(
+            ops.eq(
+                ops.index_expr(idx[dim], torch.int32),
+                ops.index_expr(index, torch.int32),
+            ),
+            src_loader(idx),
+            x_loader(idx),
+        )
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=inner_fn,
+        ranges=list(x.get_size()),
+    )
+
+
+@register_lowering(aten.slice_scatter, type_promotion_kind=None)
+def slice_scatter(x, src, dim=0, start=None, end=None, step=1):
+    assert x.get_dtype() == src.get_dtype()
+    x_loader = x.make_loader()
+    dim = _validate_dim(x, dim, 0)
+    dim_size = x.get_size()[dim]
+    if start is not None and start < 0:
+        start = start + dim_size
+    if end is not None and end < 0:
+        end = end + dim_size
+    if start is None:
+        start = 0
+    if end is None or V.graph.sizevars.maybe_guard_leq(x.get_size()[dim], end):
+        end = dim_size
+
+    src_size = list(x.get_size())
+    src_size[dim] = ir.IndexingDiv(sympy.expand(end - start), sympy.expand(step))
+    src = expand(src, src_size)
+    src_loader = src.make_loader()
+
+    def inner_fn(idx):
+        if start == 0 and end == dim_size and step == 1:
+            # selecting every element is the same as just src.clone()
+            return src_loader(idx)
+
+        idx_dim = ops.index_expr(idx[dim], torch.int32)
+        src_idx = list(idx)
+        src_idx[dim] = ir.IndexingDiv(idx[dim] - start, step)
+
+        mask = []
+        if start != 0:
+            mask.append(
+                ops.ge(
+                    idx_dim,
+                    ops.index_expr(sympy.expand(start), torch.int32),
+                )
+            )
+        if end != dim_size:
+            mask.append(
+                ops.lt(
+                    idx_dim,
+                    ops.index_expr(sympy.expand(end), torch.int32),
+                )
+            )
+        if step != 1:
+            mask.append(
+                ops.eq(
+                    ops.index_expr(
+                        ir.ModularIndexing(idx[dim] - start, 1, step), torch.int32
+                    ),
+                    ops.constant(0, torch.int32),
+                )
+            )
+        assert mask
+        mask = functools.reduce(ops.and_, mask)
+        src_val = ops.masked(
+            mask,
+            lambda: src_loader(src_idx),
+            0 if is_integer_type(x) else 0.0,
+        )
+        return ops.where(
+            mask,
+            src_val,
+            x_loader(idx),
+        )
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=inner_fn,
+        ranges=list(x.get_size()),
+    )
+
+
+def _unwrap(x):
+    if isinstance(x, (list, tuple)) and len(x) > 0:
+        return _unwrap(x[0])
+    return x
+
+
+@register_lowering([torch.tensor, aten.scalar_tensor])
+def tensor(data, *, dtype=None, device=None, layout=None, pin_memory=False):
+    assert layout in (None, torch.strided)
+    assert pin_memory is False
+    if isinstance(_unwrap(data), int):
+        dtype = dtype or torch.int64
+    else:
+        dtype = dtype or torch.get_default_dtype()
+
+    if isinstance(data, (float, int)):
+        ranges = []
+
+        def inner_fn(index):
+            return ops.constant(data, dtype)
+
+    elif len(data) == 0 or isinstance(data[0], (float, int)) and len(data) <= 8:
+        # inline small tensors
+        ranges = [sympy.Integer(len(data))]
+
+        def inner_fn(index):
+            def binary_search(start, end):
+                assert start < end
+                if end - start == 1:
+                    return ops.constant(data[start], dtype)
+                mid = (end - start) // 2 + start
+                return ops.where(
+                    ops.lt(
+                        ops.index_expr(index[0], torch.int64),
+                        ops.constant(mid, torch.int64),
+                    ),
+                    binary_search(start, mid),
+                    binary_search(mid, end),
+                )
+
+            if len(data) == 0:
+                return ops.constant(0, dtype)
+            return binary_search(0, len(data))
+
+    else:
+        return V.graph.add_tensor_constant(
+            torch.tensor(data, dtype=dtype, device=device)
+        )
+
+    return Pointwise.create(
+        device=decode_device(device),
+        dtype=dtype,
+        inner_fn=inner_fn,
+        ranges=ranges,
+    )
+
+
+@register_lowering(torch.as_tensor)
+def as_tensor(data, dtype=None, device=None):
+    if isinstance(data, TensorBox):
+        if dtype is not None:
+            data = to(data, dtype)
+        if device is not None:
+            data = to(data, device)
+        return data
+    return tensor(data, dtype=dtype, device=device)
+
+
+@register_lowering(torch.LongTensor)
+def long_tensor(data):
+    return tensor(data, dtype=torch.int64)
+
+
+@register_lowering(aten._local_scalar_dense)
+def _local_scalar_dense(data):
+    return ir.DynamicScalar()
+
+
+def _full(fill_value, device, dtype, size):
+    value = fill_value
+    if not isinstance(fill_value, (int, float)) and hasattr(value, "value"):
+        value = value.value
+    if isinstance(value, (int, float)):
+
+        def inner_fn(index):
+            return ops.constant(value, dtype)
+
+    else:
+        assert len(value.get_size()) == 0
+        value_loader = value.make_loader()
+
+        def inner_fn(index):
+            return value_loader([])
+
+    return Pointwise.create(
+        device=device,
+        dtype=dtype,
+        inner_fn=inner_fn,
+        ranges=list(size),
+    )
+
+
+@register_lowering(aten.full_like, type_promotion_kind=None)
+def full_like(x, fill_value, **kwargs):
+    return create_tensor_like(tensor_constructor(fill_value))(x, **kwargs)
+
+
+def tensor_constructor(fill_value):
+    # torch.zeros, torch.ones, etc
+    def inner(
+        *size,
+        names=None,
+        dtype=None,
+        device=None,
+        layout=0,
+        pin_memory=False,
+        memory_format=None,
+    ):
+        assert names is None
+        assert not pin_memory
+        assert layout in (0, torch.strided)
+        assert memory_format in (None, torch.contiguous_format)
+        device = decode_device(device)
+        dtype = dtype or torch.get_default_dtype()
+        if len(size) == 1 and isinstance(size[0], (list, tuple, torch.Size)):
+            size = tuple(size[0])
+        size = [sympy.expand(s) for s in size]
+        return _full(fill_value, device, dtype, size)
+
+    return inner
+
+
+empty = register_lowering([torch.empty, aten.empty])(tensor_constructor(0))
+zeros = register_lowering([torch.zeros, aten.zeros])(tensor_constructor(0))
+ones = register_lowering([torch.ones, aten.ones])(tensor_constructor(1))
+
+
+def create_tensor_like(creation_fn):
+    """
+    Shim to convert X_like(...) into X(...).  For example zeros_like() into zeros().
+    """
+
+    def _constant_like(
+        x, *, dtype=None, device=None, layout=0, pin_memory=False, memory_format=None
+    ):
+        assert not pin_memory
+        assert layout in (0, torch.strided)
+        if dtype is None:
+            dtype = x.get_dtype()
+        else:
+            dtype = decode_dtype(dtype)
+        device = device or x.get_device()
+        size = list(x.get_size())
+        return creation_fn(
+            size, dtype=dtype, device=device, layout=layout, pin_memory=pin_memory
+        )
+
+    return _constant_like
+
+
+def constant_like(fill_value):
+    return create_tensor_like(tensor_constructor(fill_value))
+
+
+empty_like = register_lowering(aten.empty_like)(create_tensor_like(empty))
+zeros_like = register_lowering(aten.zeros_like)(create_tensor_like(zeros))
+ones_like = register_lowering(aten.ones_like)(create_tensor_like(ones))
+if not config.fallback_random:
+    rand_like = register_lowering(aten.rand_like)(create_tensor_like(rand))
+
+register_lowering(aten.zero)(zeros_like)
+
+
+def new_constant(fill_value):
+    def _new_constant(
+        x, size, *, dtype=None, layout=None, device=None, pin_memory=None
+    ):
+        assert isinstance(size, (list, type))
+        assert not pin_memory
+        assert not layout or layout == torch.strided
+        dtype = decode_dtype(dtype) or x.get_dtype()
+        device = device or x.get_device()
+        size = [sympy.Integer(s) for s in size]
+        return _full(fill_value, device, dtype, size)
+
+    return _new_constant
+
+
+register_lowering(aten.new_empty)(new_constant(0))
+register_lowering(aten.new_zeros)(new_constant(0))
+register_lowering(aten.new_ones)(new_constant(1))
+
+
+@register_lowering(aten.empty_strided)
+def empty_strided(
+    size, stride, *, dtype=None, layout=None, device=None, pin_memory=None
+):
+    assert isinstance(size, (list, type))
+    assert isinstance(stride, (list, type))
+    assert not pin_memory
+    assert not layout or layout == torch.strided
+    dtype = decode_dtype(dtype) or torch.get_default_dtype()
+    device = device or torch.tensor(0.0).device
+    pointwise = _full(fill_value=0, device=device, dtype=dtype, size=size)
+    if tuple(ir.FlexibleLayout.contiguous_strides(size)) == tuple(stride):
+        # fast path, no need to realize it
+        return pointwise
+    pointwise.realize()
+    buffer = pointwise.data.data
+    assert isinstance(buffer, ir.ComputedBuffer)
+    buffer.layout = ir.FixedLayout(
+        device=device,
+        dtype=dtype,
+        size=[sympy.expand(s) for s in size],
+        stride=[sympy.expand(s) for s in stride],
+    )
+    return pointwise
+
+
+@register_lowering(aten.new_empty_strided)
+def new_empty_strided(
+    x, size, stride, *, dtype=None, layout=None, device=None, pin_memory=None
+):
+    if dtype is None:
+        dtype = x.get_dtype()
+    if device is None:
+        device = x.get_device()
+    return empty_strided(
+        size, stride, dtype=dtype, layout=layout, device=device, pin_memory=pin_memory
+    )
+
+
+@register_lowering([torch.full, aten.full])
+def full(size, fill_value, **kwargs):
+    return tensor_constructor(fill_value)(size, **kwargs)
+
+
+@register_lowering(aten.gather, type_promotion_kind=None)
+def gather(x, dim, index):
+    assert isinstance(x, TensorBox)
+    assert index.get_dtype() == torch.int64
+    offset = len(x.get_size()) == 0
+    dim = _validate_dim(x, dim, offset)
+
+    x_loader = x.make_loader()
+    index_loader = index.make_loader()
+
+    def fn(idx):
+        idx = list(idx)
+        if len(idx) != 0:
+            idx[dim] = ops.indirect_indexing(index_loader(idx))
+        return x_loader(idx)
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=fn,
+        ranges=index.get_size(),
+    )
+
+
+@register_lowering(aten.embedding, type_promotion_kind=None)
+def embedding(weight, indices, padding_idx=-1, scale_grad_by_freq=False, sparse=False):
+    assert not sparse
+    assert isinstance(weight, TensorBox)
+    assert isinstance(indices, TensorBox)
+    assert "int" in str(indices.get_dtype())
+
+    weight_loader = weight.make_loader()
+    indices_loader = indices.make_loader()
+    indices_ndim = len(indices.get_size())
+    new_size = [*indices.get_size(), *weight.get_size()[1:]]
+
+    def fn(idx):
+        assert len(idx) == len(new_size), f"{idx} != {new_size}"
+        var_index = indices_loader(idx[:indices_ndim])
+        weight_idx = [ops.indirect_indexing(var_index)] + [*idx[indices_ndim:]]
+        return weight_loader(weight_idx)
+
+    return Pointwise.create(
+        device=weight.get_device(),
+        dtype=weight.get_dtype(),
+        inner_fn=fn,
+        ranges=new_size,
+    )
+
+
+def check_and_broadcast_indices(indices):
+    assert all(
+        i.get_dtype() in (torch.int64, torch.int32, torch.bool, torch.uint8)
+        for i in indices
+        if i is not None
+    ), f"indices must be int64, byte or bool. Got {[i.get_dtype() for i in indices if i is not None]}"
+    assert all(
+        [i.get_dtype() in (torch.int32, torch.int64) for i in indices if i is not None]
+    ), "bool indices are not supported yet"
+    valid_idxs = [i for i, x in enumerate(indices) if isinstance(x, TensorBox)]
+    assert len(valid_idxs) > 0, "requires at least 1 non-None index"
+    new_indices = [None] * len(indices)
+    for i, x in zip(valid_idxs, broadcast_tensors(*[indices[i] for i in valid_idxs])):
+        new_indices[i] = x
+        output_dim = len(x.get_size())
+    start_offset = 0
+    # only support None at start or end for now
+    tmp = list(new_indices)
+    while tmp and tmp[-1] is None:
+        tmp.pop()
+    while tmp and tmp[0] is None:
+        tmp.pop(0)
+        start_offset += 1
+    assert all((i is not None) for i in tmp)
+    end_offset = output_dim + start_offset
+
+    return new_indices, start_offset, end_offset
+
+
+@register_lowering(aten.index, type_promotion_kind=None)
+def index(x, indices):
+    assert isinstance(indices, (list, tuple))
+    x_loader = x.make_loader()
+    indices, start_offset, end_offset = check_and_broadcast_indices(indices)
+    indices_sizes = [i.get_size() for i in indices if i is not None]
+    indices_loaders = [i.make_loader() for i in indices if i is not None]
+    # no guards on output size, all the guards are set in broadcast_tensors
+    output_size = list(indices_sizes[0])
+
+    x_size = x.get_size()
+    output_size = [
+        *x_size[:start_offset],
+        *output_size,
+        *x_size[start_offset + len(indices_loaders) :],
+    ]
+
+    def fn(idx):
+        assert len(idx) == len(output_size)
+        new_index = [
+            ops.indirect_indexing(loader(idx[start_offset:end_offset]))
+            for loader in indices_loaders
+        ]
+        new_index = [*idx[:start_offset], *new_index, *idx[end_offset:]]
+        return x_loader(new_index)
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=fn,
+        ranges=output_size,
+    )
+
+
+# This is moved from decomposition to lowering because this decomp introduced
+# mutation in the graph, which is bad for Aot Autograd. Aot Autograd runs dead
+# code elimination and common subexpression elimination optimizations, which
+# assume graphs to be side-effect free. More details at
+# https://github.com/pytorch/torchdynamo/issues/1235.
+# Moving such reinplacing type of decomps to lowering ensures that AotAutograd
+# gets good graphs.
+@register_lowering([aten.index_put])
+def index_put(x, indices, values, accumulate=False):
+    return index_put_(clone(x), indices, values, accumulate)
+
+
+def index_put_as_masked_fill(self, indices, value, accumulate):
+    if value.get_device() != self.get_device():
+        value = to_device(value, self.get_device())
+    if accumulate:
+        value = add(self, value)
+    return mutate_to(self, where(indices[0], value, self))
+
+
+def index_put_fallback(self, indices, values, accumulate):
+    ir.IndexPutFallback(self, indices, values, accumulate)
+    return self
+
+
+@register_lowering(aten.index_put_, type_promotion_kind=None)
+def index_put_(self, indices, values, accumulate=False):
+    # Dispatch to masked fill for single boolean index with single value
+    if (
+        values.get_numel() == 1
+        and len(indices) == 1
+        and indices[0].get_dtype() in {torch.bool, torch.uint8}
+    ):
+        return index_put_as_masked_fill(self, indices, values, accumulate)
+
+    # Fallback if there is a boolean index
+    for index in indices:
+        if index is not None and index.get_dtype() in {torch.bool, torch.uint8}:
+            return index_put_fallback(self, indices, values, accumulate)
+
+    x_size = self.get_size()
+    x_ndim = len(x_size)
+
+    # fallback to aten.index_put_, as tl.atomic_add does NOT support int64 or bool
+    if self.get_dtype() in {torch.int64, torch.bool}:
+        # self is an scalar Tensor
+        if x_ndim == 0:
+            self = view(self, [1])
+        self = index_put_fallback(self, indices, values, accumulate)
+        if x_ndim == 0:
+            self = view(self, [])
+        return self
+
+    values = to_dtype(values, self.get_dtype())
+    indices, start_offset, end_offset = check_and_broadcast_indices(indices)
+    indices_sizes = [i.get_size() for i in indices if i is not None]
+    indices_loaders = [i.make_loader() for i in indices if i is not None]
+
+    assert isinstance(self, TensorBox)
+    self.realize()
+    V.graph.realize_users_of(self.get_name())
+
+    # self is an scalar Tensor
+    if x_ndim == 0:
+        self = view(self, [1])
+
+    output_size = list(indices_sizes[0])
+    expected_vals_size = [
+        *x_size[:start_offset],
+        *output_size,
+        *x_size[start_offset + len(indices_sizes) :],
+    ]
+
+    values = expand(values, expected_vals_size)
+    # all guards are set above during broadcast_tensors and expand
+
+    def output_indexer(index):
+        assert len(index) == len(expected_vals_size)
+        new_index = [
+            ops.indirect_indexing(loader(index[start_offset:end_offset]))
+            for loader in indices_loaders
+        ]
+        new_index = [*index[:start_offset], *new_index, *index[end_offset:]]
+        return new_index
+
+    scatter = ir.Scatter(
+        device=self.get_device(),
+        dtype=self.get_dtype(),
+        inner_fn=values.make_loader(),
+        ranges=expected_vals_size,  # iter_ranges,
+        output_indexer=output_indexer,
+        scatter_mode="atomic_add" if accumulate else None,
+    )
+    buffer = ir.ComputedBuffer(
+        None,
+        ir.MutationLayout(self),
+        scatter,
+    )
+    buffer.name = V.graph.register_buffer(buffer)
+
+    if x_ndim == 0:
+        self = view(self, [])
+    return self
+
+
+@register_lowering(aten.as_strided_scatter, type_promotion_kind=None)
+def as_strided_scatter(self, src, size, stride, storage_offset=None):
+    output = clone(self)
+    output_view = as_strided(output, size, stride, storage_offset)
+    copy_(output_view, src)
+    return output
+
+
+@register_lowering(aten.scatter, type_promotion_kind=None)
+def scatter(x, dim: int, index, src, **kwargs):
+    return scatter_(clone(x), dim, index, src, **kwargs)
+
+
+def scatter_fallback(
+    fn, self, dim: int, index, src, *, reduce: str = None, include_self: bool = True
+):
+
+    if reduce not in {None, "sum"} or (
+        reduce == "sum" and self.get_dtype() in {torch.bool, torch.int64}
+    ):
+        self.realize()
+        return fallback_handler(fn)(
+            self, dim, index, src, reduce=reduce, include_self=include_self
+        )
+
+    return None
+
+
+@register_lowering(aten.scatter_, type_promotion_kind=None)
+def scatter_(self, dim: int, index, src, *, reduce: str = None):
+
+    if reduce == "add":
+        reduce = "sum"
+    elif reduce == "multiply":
+        reduce = "prod"
+    else:
+        assert reduce is None
+
+    fallback_result = scatter_fallback(
+        aten.scatter_, self, dim, index, src, reduce=reduce
+    )
+
+    if fallback_result:
+        return fallback_result
+    return scatter_reduce_(self, dim, index, src, reduce)
+
+
+@register_lowering(aten.scatter_add, type_promotion_kind=None)
+def scatter_add(x, dim: int, index, src):
+    return scatter_add_(clone(x), dim, index, src)
+
+
+@register_lowering(aten.scatter_add_, type_promotion_kind=None)
+def scatter_add_(x, dim: int, index, src):
+    return scatter_reduce_(clone(x), dim, index, src, "sum")
+
+
+@register_lowering(aten.scatter_reduce, type_promotion_kind=None)
+def scatter_reduce(x, dim: int, index, src, reduction_type, **kwargs):
+    return scatter_reduce_(clone(x), dim, index, src, reduction_type, **kwargs)
+
+
+fallback_scatter_reduce_ = fallback_handler(aten.scatter_reduce_)
+
+
+@register_lowering(aten.scatter_reduce_, type_promotion_kind=None)
+def scatter_reduce_(self, dim: int, index, src, reduce, *, include_self: bool = True):
+    assert reduce in {None, "sum", "prod", "mean", "amax", "amin"}
+
+    fallback_result = scatter_fallback(
+        aten.scatter_reduce_,
+        self,
+        dim,
+        index,
+        src,
+        reduce=reduce,
+        include_self=include_self,
+    )
+
+    if fallback_result:
+        return fallback_result
+
+    assert isinstance(self, TensorBox)
+    assert "int" in str(index.get_dtype())
+
+    ndim = len(self.get_size())
+    if ndim == 0:
+        self = view(self, [1])
+
+    if isinstance(src, TensorBox) and len(src.get_size()) == 0:
+        src = view(src, [1])
+
+    if isinstance(index, TensorBox) and len(index.get_size()) == 0:
+        index = view(index, [1])
+
+    assert -len(self.get_size()) <= dim < len(self.get_size())
+
+    self.realize()
+    V.graph.realize_users_of(self.get_name())
+    index_loader = index.make_loader()
+    src_loader = src.make_loader() if isinstance(src, TensorBox) else None
+
+    def output_indexer(idx):
+        indirect_idx = list(idx)
+        indirect_idx[dim] = ops.indirect_indexing(index_loader(idx))
+        return indirect_idx
+
+    def fn(idx):
+        if src_loader:
+            return src_loader(idx)
+        else:
+            # src is a scalar
+            return ops.constant(src, self.get_dtype())
+
+    def backend_reduce_str(reduce):
+        if reduce == "sum":
+            return "atomic_add"
+        else:
+            # TODO: Need to support more reduction type
+            assert reduce is None
+            return None
+
+    if not include_self:
+        # zero out the corresponding elements first
+        zero_out = ir.Scatter(
+            device=self.get_device(),
+            dtype=self.get_dtype(),
+            inner_fn=lambda index: ops.constant(0, self.get_dtype()),
+            ranges=index.get_size(),
+            output_indexer=output_indexer,
+            scatter_mode=None,
+        )
+        buffer = ir.ComputedBuffer(
+            None,
+            ir.MutationLayout(self),
+            zero_out,
+        )
+        buffer.name = V.graph.register_buffer(buffer)
+
+    # self[index[i][j][k]][j][k] += src[i][j][k]  # if dim == 0
+    # self[i][index[i][j][k]][k] += src[i][j][k]  # if dim == 1
+    # self[i][j][index[i][j][k]] += src[i][j][k]  # if dim == 2
+    scatter = ir.Scatter(
+        device=self.get_device(),
+        dtype=self.get_dtype(),
+        inner_fn=fn,
+        ranges=index.get_size(),
+        output_indexer=output_indexer,
+        scatter_mode=backend_reduce_str(reduce),
+    )
+    buffer = ir.ComputedBuffer(
+        None,
+        ir.MutationLayout(self),
+        scatter,
+    )
+    buffer.name = V.graph.register_buffer(buffer)
+
+    if ndim == 0:
+        self = view(self, [])
+    return self
+
+
+def upsample_nearestnd(x, output_size, scales_x: Tuple[float] = None, n: int = 2):
+    x.realize_hint()  # elements are reused
+    x_loader = x.make_loader()
+    i_sizes = x.get_size()[-n:]
+    batch = x.get_size()[:-n]
+    i_sizes = [V.graph.sizevars.guard_static_shape(i) for i in i_sizes]
+
+    assert len(scales_x) == n
+    o_sizes = output_size
+
+    scales = [i / o for i, o in zip(i_sizes, o_sizes)]
+    for i, scale in enumerate(scales):
+        if scale:
+            scales[i] = scale
+
+    def scale(x, scale):
+        x = ops.index_expr(x, torch.float32)
+        x = ops.mul(x, ops.constant(scale, torch.float32))
+        x = ops.to_dtype(x, torch.int32)
+        return ops.indirect_indexing(x)
+
+    def fn(idx):
+        x = idx[-n:]
+        b = idx[:-n]
+        return x_loader([*b, *[scale(i, s) for i, s in zip(x, scales)]])
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=fn,
+        ranges=[*batch, *o_sizes],
+    )
+
+
+@register_lowering(aten.upsample_nearest1d.default)
+def upsample_nearest1d(x, output_size, scales: Optional[float] = None):
+    return upsample_nearestnd(x, output_size, (scales,), n=1)
+
+
+@register_lowering(aten.upsample_nearest2d.default)
+def upsample_nearest2d(
+    x, output_size, scales_h: Optional[float] = None, scales_w: Optional[float] = None
+):
+    return upsample_nearestnd(x, output_size, (scales_h, scales_w), n=2)
+
+
+@register_lowering(aten.upsample_nearest3d.default)
+def upsample_nearest3d(
+    x,
+    output_size,
+    scales_d: Optional[float] = None,
+    scales_h: Optional[float] = None,
+    scales_w: Optional[float] = None,
+):
+    return upsample_nearestnd(x, output_size, (scales_d, scales_h, scales_w), n=3)
+
+
+@register_lowering(aten.upsample_bicubic2d.default)
+def upsample_bicubic2d_default(
+    x,
+    output_size,
+    align_corners: bool,
+    scales_h: Optional[float] = None,
+    scales_w: Optional[float] = None,
+):
+    x.realize_hint()
+    x_loader = x.make_loader()
+
+    N, C, iH, iW = x.get_size()
+    oH, oW = output_size
+
+    iH = V.graph.sizevars.guard_static_shape(iH)
+    iW = V.graph.sizevars.guard_static_shape(iW)
+
+    def get_int_dtype(maxval):
+        if maxval > torch.iinfo(torch.int32).max:
+            return torch.int64
+        return torch.int32
+
+    def compute_scale(in_size, out_size, align_corners, scale=None):
+        if align_corners:
+            return (in_size - 1) / (out_size - 1) if out_size > 1 else 0
+        else:
+            return 1 / scale if scale is not None and scale > 0 else in_size / out_size
+
+    def compute_source_index(scale, dst_index, align_corners):
+        dst_index_ie = ops.index_expr(dst_index, torch.float32)
+        if align_corners:
+            return ops.mul(scale, dst_index_ie)
+        else:
+            return ops.sub(
+                ops.mul(scale, ops.add(dst_index_ie, 0.5)), 0.5
+            )  # scale * (dst_index + 0.5) - 0.5
+
+    def cubic_convolution1(x, A):
+        # ((A + 2) * x - (A+3)) * x * x + 1
+        return ops.add(ops.mul(ops.mul(ops.sub(ops.mul(A + 2, x), A + 3), x), x), 1.0)
+
+    def cubic_convolution2(x, A):
+        # ((A * x - 5 * A) * x + 8 * A) * x - 4*A
+        return ops.sub(
+            ops.mul(ops.add(ops.mul(ops.sub(ops.mul(A, x), 5 * A), x), 8 * A), x), 4 * A
+        )
+
+    def get_cubic_upsample_coefficients(t):
+        A = -0.75
+        c0 = cubic_convolution2(ops.add(t, 1.0), A)
+        c1 = cubic_convolution1(t, A)
+
+        x2 = ops.sub(1.0, t)
+        c2 = cubic_convolution1(x2, A)
+        c3 = cubic_convolution2(ops.add(x2, 1.0), A)
+        return (
+            c0,
+            c1,
+            c2,
+            c3,
+        )
+
+    def cubic_interp1d(xs, t):
+        cs = get_cubic_upsample_coefficients(t)
+        # dot product between xs and cs
+        return ops.add(
+            ops.mul(xs[0], cs[0]),
+            ops.add(
+                ops.mul(xs[1], cs[1]),
+                ops.add(ops.mul(xs[2], cs[2]), ops.mul(xs[3], cs[3])),
+            ),
+        )
+
+    height_scale = compute_scale(iH, oH, align_corners, scales_h)
+    width_scale = compute_scale(iW, oW, align_corners, scales_h)
+
+    def clamp(v, min, max):
+        return ops.maximum(min, ops.minimum(max, v))
+
+    def fn(idx):
+        n, c, oy, ox = idx
+
+        real_x = compute_source_index(width_scale, ox, align_corners)
+        in_x = ops.floor(real_x)
+        t_x = ops.sub(real_x, in_x)
+
+        real_y = compute_source_index(height_scale, oy, align_corners)
+        in_y = ops.floor(real_y)
+        t_y = ops.sub(real_y, in_y)
+
+        def load_bounded(fy, fx):
+            iy = ops.indirect_indexing(clamp(fy, 0, iH - 1))
+            ix = ops.indirect_indexing(clamp(fx, 0, iW - 1))
+            return x_loader([n, c, iy, ix])
+
+        iy = ops.to_dtype(in_y, get_int_dtype(iH + 1))
+        ix = ops.to_dtype(in_x, get_int_dtype(iW + 1))
+        iys_ofs = tuple((ops.add(iy, ofs) for ofs in (-1, 0, 1, 2)))
+        ixs_ofs = tuple((ops.add(ix, ofs) for ofs in (-1, 0, 1, 2)))
+
+        def get_x_interp(y):
+            coeffs_x = tuple((load_bounded(y, x) for x in ixs_ofs))
+            return cubic_interp1d(coeffs_x, t_x)
+
+        coeffs_y = tuple(get_x_interp(y) for y in iys_ofs)
+        return cubic_interp1d(coeffs_y, t_y)
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=fn,
+        ranges=[N, C, sympy.Integer(oH), sympy.Integer(oW)],
+    )
+
+
+@register_lowering(aten.reflection_pad2d)
+def reflection_pad2d(x, padding):
+    assert len(padding) == 4
+    left, right, top, bot = padding
+
+    x_loader = x.make_loader()
+    *batch, h, w = x.get_size()
+    h = V.graph.sizevars.guard_static_shape(h)
+    w = V.graph.sizevars.guard_static_shape(w)
+
+    def reflect(x, size, offset):
+        size = ops.constant(size - 1, torch.int32)
+        x = ops.index_expr(x, torch.int32)
+        x = ops.sub(x, ops.constant(offset, torch.int32))
+        x = ops.sub(size, ops.abs(ops.sub(size, ops.abs(x))))
+        return ops.indirect_indexing(x)
+
+    def fn(idx):
+        *b, x, y = idx
+        x = reflect(x, h, top)
+        y = reflect(y, w, left)
+        return x_loader([*b, x, y])
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=fn,
+        ranges=[*batch, sympy.Integer(h + top + bot), sympy.Integer(w + left + right)],
+    )
+
+
+@register_lowering(aten.reflection_pad2d_backward)
+def reflection_pad2d_backward(grad_output, x, padding):
+    assert len(padding) == 4
+    left, right, top, bot = padding
+
+    *_, h, w = x.get_size()
+    h = V.graph.sizevars.guard_static_shape(h) - 1
+    w = V.graph.sizevars.guard_static_shape(w) - 1
+    grad_loader = grad_output.make_loader()
+
+    def fn(idx):
+        *b, x, y = idx
+
+        def load_from_output(x, y):
+            x = ops.indirect_indexing(ops.index_expr(x, torch.int32))
+            y = ops.indirect_indexing(ops.index_expr(y, torch.int32))
+            return grad_loader([*b, x, y])
+
+        def index_range_condition(index_range):
+            i, lb, ub = index_range
+            i = ops.index_expr(i, torch.int32)
+            return ops.and_(ops.ge(i, lb), ops.le(i, ub))
+
+        def accumulate(out_x, out_y, index_range1, index_range2=None):
+            nonlocal grad
+
+            # If the upper bound is less than the lower bound, we can get rid of one accumulation.
+            # This happens when the padding size is zero.
+            if index_range1[2] < index_range1[1]:
+                return
+            cond = index_range_condition(index_range1)
+            if index_range2 is not None:
+                if index_range2[2] < index_range2[1]:
+                    return
+                cond = ops.and_(cond, index_range_condition(index_range2))
+            g = ops.masked(cond, lambda: load_from_output(out_x, out_y), 0.0)
+            grad = ops.add(grad, g)
+
+        # Areas after reflection:
+        #
+        #   top-left    |   top     |   top-right
+        # -----------------------------------------
+        #   left        |   center  |   right
+        # -----------------------------------------
+        #   bottom-left |   bottom  |   bottom-right
+        #
+        # The center area is the orignial matrix. Other areas are reflections.
+
+        center_x, center_y = x + top, y + left
+        top_reflect_x, left_reflect_y = top - x, left - y
+        bot_reflect_x, right_reflect_y = 2 * h + top - x, 2 * w + left - y
+
+        # Accumulate gradients from different areas
+        grad = load_from_output(center_x, center_y)
+        accumulate(center_x, left_reflect_y, (y, 1, left))
+        accumulate(center_x, right_reflect_y, (y, w - right, w - 1))
+        accumulate(top_reflect_x, center_y, (x, 1, top))
+        accumulate(bot_reflect_x, center_y, (x, h - bot, h - 1))
+        accumulate(top_reflect_x, left_reflect_y, (x, 1, top), (y, 1, left))
+        accumulate(top_reflect_x, right_reflect_y, (x, 1, top), (y, w - right, w - 1))
+        accumulate(bot_reflect_x, left_reflect_y, (x, h - bot, h - 1), (y, 1, left))
+        accumulate(
+            bot_reflect_x, right_reflect_y, (x, h - bot, h - 1), (y, w - right, w - 1)
+        )
+
+        return grad
+
+    return Pointwise.create(
+        device=grad_output.get_device(),
+        dtype=grad_output.get_dtype(),
+        inner_fn=fn,
+        ranges=list(x.get_size()),
+    )
+
+
+@register_lowering(prims.rev.default)
+def rev(x, dims):
+    # note - dims pre-canoncalized
+    x_loader = x.make_loader()
+    sizes = x.get_size()
+
+    def loader(idx):
+        idx = list(idx)
+        assert len(idx) == len(sizes)
+        for dim in dims:
+            idx[dim] = (sizes[dim] - 1) - idx[dim]
+
+        return x_loader(idx)
+
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=loader,
+        ranges=sizes,
+    )
+
+
+@register_lowering(aten.constant_pad_nd, type_promotion_kind=None)
+def constant_pad_nd(x, padding, fill_value=0):
+    assert (len(padding) % 2) == 0
+    if all(p == 0 for p in padding):
+        return x
+
+    sizes = x.get_size()
+
+    bounds = list(reversed(list(zip(padding[::2], padding[1::2]))))
+    n = len(sizes) - len(bounds)
+
+    output_size = list(sizes[:n])
+    mask_sizes = []
+    for (low, high), size in zip(bounds, sizes[n:]):
+        size = V.graph.sizevars.guard_static_shape(size)
+        mask_sizes.append(size)
+        output_size.append(sympy.expand(size + low + high))
+    assert len(output_size) == len(sizes)
+
+    def mask(index):
+        mask = []
+        for idx, (low, high), length in zip(index[n:], bounds, mask_sizes):
+            if low != 0:
+                mask.append(range_mask_low(idx))
+            if high != 0:
+                mask.append(range_mask_high(idx, length))
+        mask = functools.reduce(ops.and_, mask)
+        return ops.masked(mask, lambda: x_loader(index), fill_value)
+
+    def offset_fn(index):
+        new_index = list(index[:n])
+        for idx, (low, high) in zip(index[n:], bounds):
+            new_index.append(idx - low)
+        assert len(new_index) == len(index)
+        return mask(new_index)
+
+    x_loader = x.make_loader()
+    return Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=offset_fn,
+        ranges=output_size,
+    )
+
+
+def range_mask_low(i: sympy.Expr):
+    return ops.ge(
+        ops.index_expr(i, torch.int64),
+        ops.index_expr(sympy.Integer(0), torch.int64),
+    )
+
+
+def range_mask_high(i: sympy.Expr, length: sympy.Expr):
+    return ops.lt(
+        ops.index_expr(i, torch.int64),
+        ops.index_expr(length, torch.int64),
+    )
+
+
+def range_mask(i: sympy.Expr, length: sympy.Expr):
+    return ops.and_(
+        range_mask_low(i),
+        range_mask_high(i, length),
+    )
+
+
+def constant_boundary_condition_2d(x, fill_value, padding):
+    *_, h, w = x.get_size()
+    x_loader = x.make_loader()
+
+    def load(index):
+        *prefix, ih, iw = index
+
+        mask = ops.and_(
+            range_mask(ih, h),
+            range_mask(iw, w),
+        )
+        return ops.masked(mask, lambda: x_loader([*prefix, ih, iw]), fill_value)
+
+    return load
+
+
+def pooling_size(x, i, kernel_size, stride, padding, ceil_mode):
+
+    x_out = ir.IndexingDiv(
+        x + 2 * padding[i] - (kernel_size[i] - 1) + (stride[i] - 1), stride[i]
+    )
+
+    if ceil_mode:
+        x_alt = ir.IndexingDiv(
+            x + 2 * padding[i] - (kernel_size[i] - 1) + 2 * (stride[i] - 1), stride[i]
+        )
+
+        if V.graph.sizevars.size_hint(x_out - x_alt) == 0:
+            # ceil mode is actually a no-op, lets guard on that
+            V.graph.sizevars.guard_equals(x_out, x_alt)
+            ceil_mode = False
+        else:
+            x_out = x_alt
+    return x_out, ceil_mode
+
+
+fallback_max_pool2d_with_indices = fallback_handler(aten.max_pool2d_with_indices)
+
+
+@register_lowering(aten.max_pool2d_with_indices, type_promotion_kind=None)
+def max_pool2d_with_indices(
+    x, kernel_size, stride=None, padding=0, dilation=1, ceil_mode=False
+):
+    if padding == 0:
+        padding = [0, 0]
+    if not stride:
+        stride = kernel_size
+
+    assert dilation == 1 or all(d == 1 for d in dilation)
+    assert isinstance(x, TensorBox)
+    assert len(kernel_size) == 2
+    assert len(stride) == 2
+    assert len(padding) == 2
+    assert len(x.get_size()) in (3, 4)
+
+    x.realize_hint()
+    *batch, h, w = x.get_size()
+
+    h_out, ceil_mode1 = pooling_size(h, 0, kernel_size, stride, padding, ceil_mode)
+    w_out, ceil_mode2 = pooling_size(w, 1, kernel_size, stride, padding, ceil_mode)
+
+    if padding[0] or padding[1] or ceil_mode1 or ceil_mode2:
+        x_loader = constant_boundary_condition_2d(x, float("-inf"), padding)
+    else:
+        x_loader = x.make_loader()
+
+    new_size = list(batch) + [h_out, w_out]
+    window_size = kernel_size[0] * kernel_size[1]
+
+    if window_size > 25:
+        # Kernel size too big. Results in hard-to-optimize Triton code. Use fallback.
+        return fallback_max_pool2d_with_indices(
+            x, kernel_size, stride, padding, dilation, ceil_mode
+        )
+
+    def fn(idx, return_index):
+        *prefix, bh, bw = idx
+        maxval = None
+        maxindex = None
+        for ih, iw in itertools.product(range(kernel_size[0]), range(kernel_size[1])):
+            ih = bh * stride[0] + ih - padding[0]
+            iw = bw * stride[1] + iw - padding[1]
+            val = x_loader([*prefix, ih, iw])
+            index = ops.index_expr(ih * w + iw, torch.int64)
+            if maxval is None:
+                maxindex = index
+                maxval = val
+            else:
+                maxindex = ops.where(ops.gt(val, maxval), index, maxindex)
+                maxval = ops.maximum(val, maxval)
+        if return_index:
+            return maxindex
+        else:
+            return maxval
+
+    r1 = Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=functools.partial(fn, return_index=False),
+        ranges=new_size,
+    )
+    r2 = Pointwise.create(
+        device=x.get_device(),
+        dtype=torch.int64,
+        inner_fn=functools.partial(fn, return_index=True),
+        ranges=new_size,
+    )
+    # TODO(jansel): should we force these to be realized?
+    return r1, r2
+
+
+fallback_max_pool2d_with_indices_backward = fallback_handler(
+    aten.max_pool2d_with_indices_backward
+)
+
+
+@register_lowering(aten.max_pool2d_with_indices_backward, type_promotion_kind=None)
+def max_pool2d_with_indices_backward(
+    grad_output, x, kernel_size, stride, padding, dilation, ceil_mode, indices
+):
+    if padding == 0:
+        padding = [0, 0]
+    if not stride:
+        stride = kernel_size
+
+    assert dilation == 1 or all(d == 1 for d in dilation)
+    assert isinstance(x, TensorBox)
+    assert len(kernel_size) == 2
+    assert len(stride) == 2
+    assert len(padding) == 2
+    assert len(x.get_size()) in (3, 4)
+
+    # we will read this many times, so make sure it is computed
+    grad_output.realize_hint()
+    indices.realize_hint()
+
+    *batch, height, width = x.get_size()
+    *_, pooled_height, pooled_width = grad_output.get_size()
+
+    indices_loader = indices.make_loader()
+    grad_loader = grad_output.make_loader()
+    new_size = list(x.get_size())
+
+    h_window_size = max(
+        [
+            max(h // stride[0] - max(0, (h - kernel_size[0]) // stride[0]), 1)
+            for h in range(kernel_size[0] * 2)
+        ]
+    )
+    w_window_size = max(
+        [
+            max(w // stride[1] - max(0, (w - kernel_size[1]) // stride[1]), 1)
+            for w in range(kernel_size[1] * 2)
+        ]
+    )
+
+    window_size = h_window_size * w_window_size
+
+    if window_size > 25:
+        # Kernel size too big. Results in hard-to-optimize Triton code. Use fallback.
+        return fallback_max_pool2d_with_indices_backward(
+            grad_output, x, kernel_size, stride, padding, dilation, ceil_mode, indices
+        )
+
+    def fn(idx):
+        *prefix, h, w = idx
+        index_test = ops.index_expr(h * width + w, torch.int32)
+        h = h + padding[0]
+        w = w + padding[1]
+        phstart = ops.index_expr(
+            ir.IndexingDiv(h - kernel_size[0] + stride[0], stride[0]), torch.int32
+        )
+        pwstart = ops.index_expr(
+            ir.IndexingDiv(w - kernel_size[1] + stride[1], stride[1]), torch.int32
+        )
+        phend = ops.index_expr(ir.IndexingDiv(h, stride[0]) + 1, torch.int32)
+        pwend = ops.index_expr(ir.IndexingDiv(w, stride[1]) + 1, torch.int32)
+
+        phstart = ops.maximum(phstart, ops.constant(0, torch.int32))
+        pwstart = ops.maximum(pwstart, ops.constant(0, torch.int32))
+        phend = ops.minimum(phend, ops.index_expr(pooled_height, torch.int32))
+        pwend = ops.minimum(pwend, ops.index_expr(pooled_width, torch.int32))
+
+        gradient = None
+        for ph_ in range(h_window_size):
+            for pw_ in range(w_window_size):
+                ph = ops.add(phstart, ops.constant(ph_, torch.int32))
+                pw = ops.add(pwstart, ops.constant(pw_, torch.int32))
+                grad_index = [
+                    *prefix,
+                    ops.indirect_indexing(
+                        ops.minimum(ph, ops.sub(phend, ops.constant(1, torch.int32)))
+                    ),
+                    ops.indirect_indexing(
+                        ops.minimum(pw, ops.sub(pwend, ops.constant(1, torch.int32)))
+                    ),
+                ]
+
+                index_actual = indices_loader(grad_index)
+                grad_part = grad_loader(grad_index)
+                check = ops.eq(index_actual, index_test)
+
+                if gradient is None:
+                    # don't need mask for 0, 0
+                    gradient = ops.where(
+                        check, grad_part, ops.constant(0.0, torch.float32)
+                    )
+                else:
+                    mask = ops.and_(
+                        ops.and_(
+                            ops.lt(ph, phend),
+                            ops.lt(pw, pwend),
+                        ),
+                        check,
+                    )
+                    gradient = ops.where(mask, ops.add(gradient, grad_part), gradient)
+        assert gradient is not None
+        return gradient
+
+    return Pointwise.create(
+        device=grad_output.get_device(),
+        dtype=grad_output.get_dtype(),
+        inner_fn=fn,
+        ranges=new_size,
+    )
+
+
+def pad_adaptive_loader(x):
+    *_, h, w = x.get_size()
+    x_loader = x.make_loader()
+
+    def load(prefix, increments, start_indices, end_indices):
+        ih, iw = increments
+        h_start_index, w_start_index = start_indices
+        h_end_index, w_end_index = end_indices
+
+        mask = ops.and_(
+            ops.lt(
+                ops.index_expr(h_start_index + ih, torch.int64),
+                ops.index_expr(h_end_index, torch.int64),
+            ),
+            ops.lt(
+                ops.index_expr(w_start_index + iw, torch.int64),
+                ops.index_expr(w_end_index, torch.int64),
+            ),
+        )
+
+        return ops.masked(
+            mask,
+            lambda: x_loader([*prefix, h_start_index + ih, w_start_index + iw]),
+            0.0,
+        )
+
+    return load
+
+
+def _adaptive_pooling_idx_sum(kernel_maxes, start_index_fns, end_index_fns):
+    h_start_index_fn, w_start_index_fn = start_index_fns
+    h_end_index_fn, w_end_index_fn = end_index_fns
+
+    def fn_sum(idx, loader):
+        *prefix, bh, bw = idx
+
+        h_start_index = h_start_index_fn(bh)
+        h_end_index = h_end_index_fn(bh)
+
+        w_start_index = w_start_index_fn(bw)
+        w_end_index = w_end_index_fn(bw)
+
+        total = None
+        for ih, iw in itertools.product(range(kernel_maxes[0]), range(kernel_maxes[1])):
+            val = loader(
+                prefix,
+                [ih, iw],
+                [h_start_index, w_start_index],
+                [h_end_index, w_end_index],
+            )
+            if total is None:
+                total = val
+            else:
+                total = ops.add(val, total)
+        return total
+
+    return fn_sum
+
+
+fallback_adaptive_avg_pool2d = fallback_handler(aten._adaptive_avg_pool2d)
+
+
+@register_lowering(aten._adaptive_avg_pool2d)
+def _adaptive_avg_pool2d(x, output_size):
+    assert isinstance(x, TensorBox)
+    assert len(output_size) == 2
+    x.realize_hint()
+
+    *batch, h_in, w_in = x.get_size()
+
+    h_in = V.graph.sizevars.guard_static_shape(h_in)
+    w_in = V.graph.sizevars.guard_static_shape(w_in)
+
+    h_out, w_out = output_size
+
+    # no-op if the same input and output
+    if h_in == h_out and w_in == w_out:
+        return clone(x)
+
+    if h_in % h_out == 0 and w_in % w_out == 0:
+        kernel_size = [h_in // h_out, w_in // w_out]
+        return avg_pool2d(x, kernel_size)
+
+    h_kernel_max = ceildiv((h_in + h_out - 1), h_out)
+    w_kernel_max = ceildiv((w_in + w_out - 1), w_out)
+
+    new_size = list(batch) + [h_out, w_out]
+    dtype = x.get_dtype()
+
+    def start_index(index, out_dim, inp_dim):
+        return ir.IndexingDiv((index * inp_dim), out_dim)
+
+    def end_index(index, out_dim, inp_dim):
+        return ir.IndexingDiv((index + 1) * inp_dim + out_dim - 1, out_dim)
+
+    h_start_index = functools.partial(start_index, out_dim=h_out, inp_dim=h_in)
+    h_end_index = functools.partial(end_index, out_dim=h_out, inp_dim=h_in)
+
+    w_start_index = functools.partial(start_index, out_dim=w_out, inp_dim=w_in)
+    w_end_index = functools.partial(end_index, out_dim=w_out, inp_dim=w_in)
+
+    window_size = h_kernel_max * w_kernel_max
+    if window_size > 25:
+        # Kernel size too big. Results in hard-to-optimize Triton code. Use fallback.
+        return fallback_adaptive_avg_pool2d(x, output_size)
+
+    fn_sum = _adaptive_pooling_idx_sum(
+        [h_kernel_max, w_kernel_max],
+        [h_start_index, w_start_index],
+        [h_end_index, w_end_index],
+    )
+
+    ones_loader = pad_adaptive_loader(ones_like(x))
+
+    def fn(idx):
+        return ops.div(fn_sum(idx, pad_adaptive_loader(x)), fn_sum(idx, ones_loader))
+
+    rv = Pointwise.create(
+        device=x.get_device(),
+        dtype=dtype,
+        inner_fn=fn,
+        ranges=new_size,
+    )
+    # TODO: should we force these to be realized?
+    return rv
+
+
+@register_lowering(aten.upsample_nearest2d_backward.default)
+def upsample_nearest2d_backward(
+    x, output_size=None, input_size=None, scales_h=None, scales_w=None
+):
+    x.realize_hint()
+
+    *batch, inp_h, inp_w = x.get_size()
+    inp_h = V.graph.sizevars.guard_static_shape(inp_h)
+    inp_w = V.graph.sizevars.guard_static_shape(inp_w)
+
+    *batch, out_h, out_w = input_size
+
+    if inp_h % out_h == 0 and inp_w % out_w == 0:
+        return avg_pool2d(x, [inp_h // out_h, inp_w // out_w], divisor_override=1)
+
+    h_kernel_max = ceildiv(inp_h, out_h)
+    w_kernel_max = ceildiv(inp_w, out_w)
+
+    def start_index(index, out_dim, inp_dim):
+        return ir.CeilDiv(index * inp_dim, out_dim)
+
+    def end_index(index, out_dim, inp_dim):
+        return start_index((index + 1), out_dim, inp_dim)
+
+    h_start_index = functools.partial(start_index, out_dim=out_h, inp_dim=inp_h)
+    h_end_index = functools.partial(end_index, out_dim=out_h, inp_dim=inp_h)
+
+    w_start_index = functools.partial(start_index, out_dim=out_w, inp_dim=inp_w)
+    w_end_index = functools.partial(end_index, out_dim=out_w, inp_dim=inp_w)
+
+    fn_sum = _adaptive_pooling_idx_sum(
+        [h_kernel_max, w_kernel_max],
+        [h_start_index, w_start_index],
+        [h_end_index, w_end_index],
+    )
+
+    def fn(idx):
+        return fn_sum(idx, pad_adaptive_loader(x))
+
+    rv = Pointwise.create(
+        device=x.get_device(),
+        dtype=x.get_dtype(),
+        inner_fn=fn,
+        ranges=list(input_size),
+    )
+
+    return rv
+
+
+fallback_avg_pool2d = fallback_handler(aten.avg_pool2d)
+
+
+@register_lowering(aten.avg_pool2d, type_promotion_kind=None)
+def avg_pool2d(
+    x,
+    kernel_size,
+    stride=(),
+    padding=0,
+    ceil_mode=False,
+    count_include_pad=True,
+    divisor_override=None,
+):
+    if not stride:
+        stride = kernel_size
+    if not padding:
+        padding = [0, 0]
+
+    assert isinstance(x, TensorBox)
+    assert len(kernel_size) == 2
+    assert len(stride) == 2
+    assert len(padding) == 2
+    assert len(x.get_size()) in (3, 4)
+
+    x.realize_hint()
+    *batch, h, w = x.get_size()
+
+    h_out, ceil_mode1 = pooling_size(h, 0, kernel_size, stride, padding, ceil_mode)
+    w_out, ceil_mode2 = pooling_size(w, 1, kernel_size, stride, padding, ceil_mode)
+
+    if padding[0] or padding[1] or ceil_mode1 or ceil_mode2:
+        x_loader = constant_boundary_condition_2d(x, 0.0, padding)
+        had_padding = True
+    else:
+        x_loader = x.make_loader()
+        had_padding = False
+
+    new_size = list(batch) + [h_out, w_out]
+    dtype = x.get_dtype()
+
+    window_size = kernel_size[0] * kernel_size[1]
+    if window_size > 25:
+        # Kernel size too big. Results in hard-to-optimize Triton code. Use fallback.
+        return fallback_avg_pool2d(
+            x,
+            kernel_size,
+            stride,
+            padding,
+            ceil_mode,
+            count_include_pad,
+            divisor_override,
+        )
+
+    def fn_sum(idx, loader):
+        *prefix, bh, bw = idx
+        total = None
+        for ih, iw in itertools.product(range(kernel_size[0]), range(kernel_size[1])):
+            ih = bh * stride[0] + ih - padding[0]
+            iw = bw * stride[1] + iw - padding[1]
+            val = loader([*prefix, ih, iw])
+            if total is None:
+                total = val
+            else:
+                total = ops.add(val, total)
+        return total
+
+    if count_include_pad or not had_padding or divisor_override:
+        if divisor_override:
+            scale = 1 / divisor_override
+        else:
+            scale = 1.0 / (kernel_size[0] * kernel_size[1])
+
+        def fn(idx):
+            return ops.mul(fn_sum(idx, x_loader), ops.constant(scale, dtype))
+
+    else:
+        ones_loader = constant_boundary_condition_2d(ones_like(x), 0.0, padding)
+
+        def fn(idx):
+            # TODO(jansel): optimize to do `int(x<h)` rather than `x<h?1:0`
+            return ops.div(fn_sum(idx, x_loader), fn_sum(idx, ones_loader))
+
+    rv = Pointwise.create(
+        device=x.get_device(),
+        dtype=dtype,
+        inner_fn=fn,
+        ranges=new_size,
+    )
+    # TODO(jansel): should we force these to be realized?
+    return rv
+
+
+fallback_avg_pool2d_backward = fallback_handler(aten.avg_pool2d_backward)
+
+
+@register_lowering(aten.avg_pool2d_backward, type_promotion_kind=None)
+def avg_pool2d_backward(
+    grad_output,
+    x,
+    kernel_size,
+    stride,
+    padding,
+    ceil_mode,
+    count_include_pad,
+    divisor_override=None,
+):
+
+    assert not divisor_override
+    if not stride:
+        stride = kernel_size
+    if not padding:
+        padding = [0, 0]
+
+    assert isinstance(grad_output, TensorBox)
+    assert isinstance(x, TensorBox)
+    assert len(kernel_size) == 2
+    assert len(stride) == 2
+    assert len(padding) == 2
+    assert len(x.get_size()) in (3, 4)
+
+    grad_output.realize_hint()  # we will read this many times, so make sure it is computed
+
+    *batch, height, width = x.get_size()
+
+    h_out, ceil_mode1 = pooling_size(height, 0, kernel_size, stride, padding, ceil_mode)
+    w_out, ceil_mode2 = pooling_size(width, 1, kernel_size, stride, padding, ceil_mode)
+
+    grad_loader = grad_output.make_loader()
+
+    had_padding = padding[0] or padding[1] or ceil_mode1 or ceil_mode2
+
+    *_, pooled_height, pooled_width = grad_output.get_size()
+    new_size = list(x.get_size())
+    dtype = x.get_dtype()
+
+    h_window_size = max(
+        [
+            max(h // stride[0] - max(0, (h - kernel_size[0]) // stride[0]), 1)
+            for h in range(kernel_size[0] * 2)
+        ]
+    )
+    w_window_size = max(
+        [
+            max(w // stride[1] - max(0, (w - kernel_size[1]) // stride[1]), 1)
+            for w in range(kernel_size[1] * 2)
+        ]
+    )
+
+    window_size = h_window_size * w_window_size
+    if window_size > 25:
+        # Kernel size too big. Results in hard-to-optimize Triton code. Use fallback.
+        return fallback_avg_pool2d_backward(
+            grad_output,
+            x,
+            kernel_size,
+            stride,
+            padding,
+            ceil_mode,
+            count_include_pad,
+            divisor_override,
+        )
+
+    def compute_pool_size_without_padding(ph, pw):
+        """
+        This computes the scaling factor that we will divide an element
+        by when `count_include_pad=False`
+        """
+        stride_h = ops.constant(stride[0], torch.int32)
+        stride_w = ops.constant(stride[1], torch.int32)
+        pad_h = ops.constant(padding[0], torch.int32)
+        pad_w = ops.constant(padding[1], torch.int32)
+        kernel_h = ops.constant(kernel_size[0], torch.int32)
+        kernel_w = ops.constant(kernel_size[1], torch.int32)
+        hstart = ops.sub(ops.mul(ph, stride_h), pad_h)
+        wstart = ops.sub(ops.mul(pw, stride_w), pad_w)
+        hend = ops.minimum(
+            ops.add(hstart, kernel_h),
+            ops.add(ops.index_expr(height, torch.int32), pad_h),
+        )
+        wend = ops.minimum(
+            ops.add(wstart, kernel_w),
+            ops.add(ops.index_expr(width, torch.int32), pad_w),
+        )
+        hstart = ops.maximum(hstart, ops.constant(0, torch.int32))
+        wstart = ops.maximum(wstart, ops.constant(0, torch.int32))
+        hend = ops.minimum(hend, ops.index_expr(height, torch.int32))
+        wend = ops.minimum(wend, ops.index_expr(width, torch.int32))
+        divide_factor = ops.mul(ops.sub(hend, hstart), ops.sub(wend, wstart))
+        return divide_factor
+
+    def fn(idx):
+        *prefix, h, w = idx
+        h = h + padding[0]
+        w = w + padding[1]
+        phstart = ops.index_expr(
+            ir.IndexingDiv(h - kernel_size[0] + stride[0], stride[0]), torch.int32
+        )
+        pwstart = ops.index_expr(
+            ir.IndexingDiv(w - kernel_size[1] + stride[1], stride[1]), torch.int32
+        )
+        phend = ops.index_expr(ir.IndexingDiv(h, stride[0]) + 1, torch.int32)
+        pwend = ops.index_expr(ir.IndexingDiv(w, stride[1]) + 1, torch.int32)
+
+        phstart = ops.maximum(phstart, ops.constant(0, torch.int32))
+        pwstart = ops.maximum(pwstart, ops.constant(0, torch.int32))
+        phend = ops.minimum(phend, ops.index_expr(pooled_height, torch.int32))
+        pwend = ops.minimum(pwend, ops.index_expr(pooled_width, torch.int32))
+
+        gradient = None
+        for ph_ in range(h_window_size):
+            for pw_ in range(w_window_size):
+                ph = ops.add(phstart, ops.constant(ph_, torch.int32))
+                pw = ops.add(pwstart, ops.constant(pw_, torch.int32))
+
+                if count_include_pad or not had_padding:
+                    scale = kernel_size[0] * kernel_size[1]
+                else:
+                    scale = compute_pool_size_without_padding(ph, pw)
+
+                part = ops.truediv(
+                    grad_loader(
+                        [
+                            *prefix,
+                            ops.indirect_indexing(
+                                ops.minimum(
+                                    ph, ops.sub(phend, ops.constant(1, torch.int32))
+                                )
+                            ),
+                            ops.indirect_indexing(
+                                ops.minimum(
+                                    pw, ops.sub(pwend, ops.constant(1, torch.int32))
+                                )
+                            ),
+                        ]
+                    ),
+                    scale,
+                )
+
+                mask = ops.and_(
+                    ops.lt(ph, phend),
+                    ops.lt(pw, pwend),
+                )
+                if gradient is None:
+                    gradient = ops.where(mask, part, ops.constant(0.0, torch.float32))
+                else:
+                    gradient = ops.where(mask, ops.add(gradient, part), gradient)
+        assert gradient is not None
+        return gradient
+
+    rv = Pointwise.create(
+        device=grad_output.get_device(),
+        dtype=dtype,
+        inner_fn=fn,
+        ranges=new_size,
+    )
+    return rv
+
+
+def _validate_reduction_axis(x, axis):
+    size = x.get_size()
+    if isinstance(axis, int):
+        axis = [axis]
+    elif not axis:
+        axis = range(len(size))
+    axis = list(axis)
+    for i in range(len(axis)):
+        if axis[i] < 0:
+            axis[i] += len(size) if len(size) else 1
+        assert 0 <= axis[i] < len(size) or (len(size) == 0 and axis[i] == 0)
+    assert len(set(axis)) == len(axis), "reduction axis not unique"
+    return axis
+
+
+def make_reduction(reduction_type: str, override_return_dtype=None):
+    def inner(x, axis=None, keepdims=False, *, dtype=None):
+        if reduction_type == "min" and axis is not None:
+            return (
+                reduce_amin(x, axis, keepdims, dtype=dtype),
+                reduce_argmin(x, axis, keepdims),
+            )
+        if reduction_type == "max" and axis is not None:
+            return (
+                reduce_amax(x, axis, keepdims, dtype=dtype),
+                reduce_argmax(x, axis, keepdims),
+            )
+        if dtype is not None:
+            x = to_dtype(x, dtype)
+        if reduction_type == "any":
+            x = to_dtype(x, torch.bool)
+        size = x.get_size()
+        axis = set(_validate_reduction_axis(x, axis))
+
+        kept_sizes = []
+        kept_idx = []
+        reduced_sizes = []
+        reduced_idx = []
+        for i in range(len(size)):
+            if i in axis:
+                reduced_idx.append(i)
+                reduced_sizes.append(size[i])
+            else:
+                kept_idx.append(i)
+                kept_sizes.append(size[i])
+
+        def loader(index, reduction_index):
+            assert len(reduction_index) == len(reduced_idx)
+            if keepdims:
+                assert len(index) == len(size)
+                assert all(index[i] == 0 for i in reduced_idx)
+                index = [index[i] for i in kept_idx]
+            assert len(index) == len(kept_idx)
+            new_index = [None] * (len(index) + len(reduction_index))
+            for idx, var in itertools.chain(
+                zip(kept_idx, index), zip(reduced_idx, reduction_index)
+            ):
+                new_index[idx] = var
+            return inner_loader(new_index)
+
+        if keepdims:
+            new_size = list(size)
+            for i in reduced_idx:
+                new_size[i] = sympy.Integer(1)
+        else:
+            new_size = kept_sizes
+
+        inner_loader = x.make_loader()
+        result = Reduction.create(
+            device=x.get_device(),
+            dst_dtype=override_return_dtype or x.get_dtype(),
+            src_dtype=x.get_dtype(),
+            inner_fn=loader,
+            ranges=new_size,
+            reduction_ranges=reduced_sizes,
+            reduction_type={"amax": "max", "amin": "min"}.get(
+                reduction_type, reduction_type
+            ),
+        )
+        if isinstance(
+            result.data.data, Reduction
+        ):  # Only realize if reduction isn't unrolled
+            result.realize()
+        return result
+
+    return inner
+
+
+@register_lowering(aten.mean)
+def mean(x, axis=None, keepdim=False, *, dtype=None):
+    if dtype is not None:
+        x = to_dtype(x, dtype)
+    size = x.get_size()
+    axis = _validate_reduction_axis(x, axis)
+    # compute in higher-precision until end of mean lowering
+    output_dtype = x.get_dtype()
+    if output_dtype in (torch.float16, torch.bfloat16):
+        x = to_dtype(x, torch.float)
+    sum_result = sum_(x, axis, keepdim)
+    denom = sympy_product(size[i] for i in axis)
+    denom = ir.IndexingConstant(denom, x.get_dtype(), x.get_device())
+    denom = ExpandView.create(denom, list(sum_result.get_size()))
+    return to_dtype(div(sum_result, denom), output_dtype)
+
+
+@register_lowering([aten.var, prims.var])
+def var_(x, axis, correction=1, keepdim=False):
+    size = x.get_size()
+    axis = _validate_reduction_axis(x, axis)
+    diffs = square(sub(x, mean(x, axis, keepdim=True)))
+    sum_result = sum_(diffs, axis, keepdim)
+
+    denom = sympy_product(size[i] for i in axis)
+    if correction:
+        denom = denom - correction
+    denom = ir.IndexingConstant(denom, x.get_dtype(), x.get_device())
+    denom = ExpandView.create(denom, list(sum_result.get_size()))
+    return div(sum_result, denom)
+
+
+@register_lowering(aten.var_mean)
+def var_mean(x, dim, unbiased=True, keepdim=False, correction=None):
+    if correction is None:
+        correction = int(unbiased)
+    return [
+        var_(x, dim, correction=correction, keepdim=keepdim),
+        mean(x, dim, keepdim=keepdim),
+    ]
+
+
+@register_lowering(aten.std)
+def std(x, axis, correction=1, keepdim=False):
+    return sqrt(var_(x, axis, correction, keepdim=keepdim))
+
+
+def pow_recursive(x, y, dtype):
+    if y < 0:
+        return pow_recursive(ops.reciprocal(x), -y, dtype)
+    if y == 0:
+        return ops.constant(1, dtype)
+    if y == 1:
+        return x
+
+    result = pow_recursive(x, y // 2, dtype)
+    result = ops.mul(result, result)
+    if (y % 2) == 1:
+        result = ops.mul(result, x)
+    return result
+
+
+@make_pointwise
+def pow_native(a, b):
+    return ops.pow(a, b)
+
+
+def _is_ir_node_and_cuda(x):
+    if isinstance(x, ir.IRNode) and decode_device(x.get_device()).type == "cuda":
+        return True
+
+    return False
+
+
+@register_lowering(aten.pow, broadcast=True)
+def pow(a, b):
+    if _is_ir_node_and_cuda(a) and _is_ir_node_and_cuda(b):
+        assert a.get_dtype() in (
+            torch.float16,
+            torch.float32,
+            torch.float64,
+        ), "Pow input must be floating point."
+    if isinstance(b, float) and b == int(b):
+        return pow(a, int(b))
+    elif isinstance(b, float) and b == 0.5:
+        return sqrt(a)
+    elif isinstance(b, int) and b == 1:
+        return a
+    elif isinstance(b, int) and -32 < b < 32:
+        # Optimize away small fixed powers
+        loader = a.make_loader()
+
+        def fn(idx):
+            return pow_recursive(loader(idx), b, a.get_dtype())
+
+        return Pointwise.create(
+            device=a.get_device(),
+            dtype=a.get_dtype(),
+            inner_fn=fn,
+            ranges=a.get_size(),
+        )
+    else:
+        return pow_native(a, b)
+
+
+def mutate_to(changed, val):
+    if isinstance(changed, TensorBox):
+        changed_data = changed.data
+    else:
+        changed_data = changed
+    if isinstance(val, TensorBox):
+        val = val.data
+
+    if not isinstance(val, ir.StorageBox):
+        # introduce a copy to handle views
+        val = Pointwise.create(
+            device=changed.get_device(),
+            dtype=changed.get_dtype(),
+            inner_fn=val.make_loader(),
+            ranges=changed.get_size(),
+        ).data
+        assert isinstance(val, ir.StorageBox)
+
+    if isinstance(changed_data, ir.StorageBox) and not changed_data.is_input_buffer():
+        # Fast path, just swing the data pointer
+        val.realize()
+        changed_data.data = val.data
+        return changed
+
+    ir.MutationLayout.realize_into(val, changed_data)
+    return changed
+
+
+@register_lowering(aten.fill_)
+def fill_(x, fill_value):
+    return mutate_to(x, full_like(x, fill_value))
+
+
+@register_lowering(aten.zero_)
+def zero_(x):
+    return mutate_to(x, full_like(x, 0))
+
+
+@register_lowering(aten.copy_, type_promotion_kind=None)
+def copy_(dst, src, non_blocking=False):
+    src = to_device(src, dst.get_device())
+    src = to_dtype(src, dst.get_dtype())
+    src = expand(src, dst.get_size())
+    return mutate_to(dst, src)
+
+
+@make_pointwise
+def floordiv(a, b):
+    return ops.floordiv(a, b)
+
+
+@make_pointwise
+def truncdiv(a, b):
+    return ops.truncdiv(a, b)
+
+
+@register_lowering(aten.div, broadcast=True)
+def div_mode(a, b, rounding_mode=None):
+    both_integer = is_integer_type(a) and is_integer_type(b)
+    both_boolean = is_boolean_type(a) and is_boolean_type(b)
+
+    # floordiv and truncdiv need special handling for integer tensors on Triton,
+    # see the discussion at https://github.com/openai/triton/issues/605
+    if rounding_mode == "floor":
+        assert not both_boolean, "floordiv operands can not be boolean at the same time"
+        return floordiv(a, b) if both_integer else floor(div(a, b))
+    if rounding_mode == "trunc":
+        assert not both_boolean, "truncdiv operands can not be boolean at the same time"
+        return truncdiv(a, b) if both_integer else trunc(div(a, b))
+    return div(a, b)
+
+
+@register_lowering([aten.mul], broadcast=True)
+def mul(a, b):
+    both_bool = is_boolean_type(a) and is_boolean_type(b)
+    if both_bool:
+        return logical_and(a, b)
+    else:
+        fn = ops_wrapper(aten.mul.__name__)
+        return make_pointwise(fn)(a, b)
+
+
+# NOTE: prims.div maps to a / b in C, so performs truncation division on
+#   integer inputs and true division for floating and complex inputs.
+@register_lowering([prims.div], broadcast=True)
+def div_prim(a, b):
+    is_integral = is_boolean_type(a) or is_integer_type(a)
+
+    if is_integral:
+        return truncdiv(a, b)
+
+    def fn(*args):
+        return ops.div(*args)
+
+    return make_pointwise(fn)(a, b)
+
+
+div = register_lowering(
+    [aten.true_divide, aten.div.Tensor],
+    broadcast=True,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)(div_prim)
+
+
+@register_lowering([aten.fmod, prims.fmod], broadcast=True)
+def fmod(a, b):
+    is_integral = is_boolean_type(a) or is_integer_type(a)
+
+    if is_integral:
+
+        def fn(a, b):
+            return ops.mod(a, b)
+
+    else:
+
+        def fn(a, b):
+            return ops.fmod(a, b)
+
+    return make_pointwise(fn)(a, b)
+
+
+# TODO - enable builtin and disable decomp to lower to ptx instruction
+# Causes compilation to not complete on timm_vision_transformers inference
+# @register_lowering(aten.rsqrt)
+# def rsqrt(x):
+#     dtype = x.get_dtype()
+#     if is_integer_dtype(dtype) or is_boolean_dtype(dtype):
+#         x = to_dtype(x, torch.get_default_dtype())
+#
+#     def _rsqrt(x):
+#         return ops.rsqrt(x)
+#
+#     return make_pointwise(_rsqrt)(x)
+
+
+@register_lowering([aten.sum, prims.sum])
+def sum_(x, axis=None, keepdims=False, *, dtype=None):
+    if (
+        is_integer_dtype(x.get_dtype()) or is_boolean_dtype(x.get_dtype())
+    ) and dtype is None:
+        dtype = torch.int64
+
+    fn = make_reduction("sum", override_return_dtype=dtype)
+    return fn(x, axis, keepdims, dtype=dtype)
+
+
+register_lowering(aten.max)(make_reduction("max"))
+register_lowering(aten.min)(make_reduction("min"))
+reduce_amax = register_lowering(aten.amax)(make_reduction("amax"))
+reduce_amin = register_lowering(aten.amin)(make_reduction("amin"))
+register_lowering(aten.any)(make_reduction("any", override_return_dtype=torch.bool))
+reduce_argmax = register_lowering(aten.argmax)(
+    make_reduction("argmax", override_return_dtype=torch.int64)
+)
+reduce_argmin = register_lowering(aten.argmin)(
+    make_reduction("argmin", override_return_dtype=torch.int64)
+)
+
+add = register_pointwise(
+    aten.add, allow_alpha=True, override_fn_when_input_bool="logical_or"
+)
+exp = register_pointwise(
+    aten.exp,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+    use_libdevice_for_f64=True,
+)
+relu = register_pointwise(aten.relu)
+sigmoid = register_pointwise(
+    aten.sigmoid,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+    use_libdevice_for_f64=True,
+)
+sqrt = register_pointwise(
+    aten.sqrt,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+    use_libdevice_for_f64=True,
+)
+square = register_pointwise(aten.square)
+sub = register_pointwise(aten.sub, allow_alpha=True)
+
+register_pointwise(
+    aten.cos,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+    use_libdevice_for_f64=True,
+)
+register_pointwise(
+    aten.sin,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+    use_libdevice_for_f64=True,
+)
+register_pointwise(aten.abs)
+register_pointwise(aten.bitwise_and)
+register_pointwise(aten.bitwise_not, override_fn_when_input_bool="logical_not")
+register_pointwise(aten.bitwise_or)
+register_pointwise(aten.bitwise_xor)
+register_pointwise(
+    aten.lgamma, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT
+)
+erf = register_pointwise(
+    aten.erf, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT
+)
+register_lowering(
+    aten.special_erf, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT
+)(erf)
+
+register_pointwise(
+    aten.log,
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+    use_libdevice_for_f64=True,
+)
+register_pointwise(aten.logical_not, convert_input_to_bool=True)
+register_pointwise(aten.maximum)
+register_pointwise(aten.minimum)
+register_pointwise(aten.neg)
+register_pointwise(
+    aten.reciprocal, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT
+)
+register_pointwise(aten.remainder)
+register_pointwise(aten.sign, override_fn_when_input_bool="identity")
+register_pointwise(aten.ceil)
+register_pointwise(aten.signbit, override_return_dtype=torch.bool)
+
+register_pointwise(aten.le, type_promotion_kind=None, override_return_dtype=torch.bool)
+register_pointwise(aten.lt, type_promotion_kind=None, override_return_dtype=torch.bool)
+register_pointwise(aten.ge, type_promotion_kind=None, override_return_dtype=torch.bool)
+register_pointwise(aten.gt, type_promotion_kind=None, override_return_dtype=torch.bool)
+register_pointwise(aten.eq, type_promotion_kind=None, override_return_dtype=torch.bool)
+register_pointwise(aten.ne, type_promotion_kind=None, override_return_dtype=torch.bool)
+logical_and = register_pointwise(
+    aten.logical_and,
+    type_promotion_kind=None,
+    convert_input_to_bool=True,
+    override_return_dtype=torch.bool,
+)
+register_lowering(aten.__and__, type_promotion_kind=None)(logical_and)
+register_lowering(aten.__or__, type_promotion_kind=None)(
+    register_pointwise(
+        aten.logical_or,
+        type_promotion_kind=None,
+        convert_input_to_bool=True,
+        override_return_dtype=torch.bool,
+    )
+)
+
+
+def register_inplace(aten_op, outplace_op):
+    @register_lowering(aten_op, type_promotion_kind=None)
+    def fn(*args, **kwargs):
+        result = outplace_op(*args, **kwargs)
+        result = to_dtype(result, args[0].get_dtype())
+        return mutate_to(args[0], result)
+
+    return fn
+
+
+register_inplace(aten.add_, add)
+register_inplace(aten.mul_, mul)
+register_inplace(aten.div_.Tensor, div)
+register_inplace(aten.div_.Tensor_mode, div_mode)
+register_inplace(aten.sub_, sub)
+register_inplace(aten.relu_, relu)
+register_inplace(aten.sigmoid_, sigmoid)
+
+
+@register_lowering(aten.sym_size)
+def sym_size(a, dim):
+    return a.get_size()[dim]
+
+
+@register_lowering(aten.sym_numel)
+def sym_numel(a):
+    return a.get_numel()
+
+
+@register_lowering(operator.mul)
+def op_mul(a, b):
+    return a * b
+
+
+@register_lowering(operator.add)
+def op_add(a, b):
+    return a + b
+
+
+@register_lowering(operator.floordiv)
+def op_floordiv(a, b):
+    return IndexingDiv(a, b)
+
+
+@register_lowering(aten._foobar)
+def foobar(self, *args, **kwargs):
+    raise NotImplementedError("Helpful for debugging")
diff --git a/torch/_inductor/metrics.py b/torch/_inductor/metrics.py
new file mode 100644
index 000000000000..f7e05288c9a5
--- /dev/null
+++ b/torch/_inductor/metrics.py
@@ -0,0 +1,17 @@
+# counter for tracking how many kernels have been generated
+generated_kernel_count = 0
+generated_cpp_vec_kernel_count = 0
+num_bytes_accessed = 0
+nodes_num_elem = []
+
+
+# reset all counters
+def reset():
+    global generated_kernel_count
+    global generated_cpp_vec_kernel_count
+    global num_bytes_accessed, nodes_num_elem
+
+    generated_kernel_count = 0
+    generated_cpp_vec_kernel_count = 0
+    num_bytes_accessed = 0
+    nodes_num_elem.clear()
diff --git a/torch/_inductor/overrides.py b/torch/_inductor/overrides.py
new file mode 100644
index 000000000000..e1dfdea9c40a
--- /dev/null
+++ b/torch/_inductor/overrides.py
@@ -0,0 +1,1168 @@
+import copy
+import itertools
+import logging
+import operator
+import random
+import weakref
+from typing import Optional
+
+import numpy
+
+import torch
+import torch.nn as nn
+from torch import _prims
+from torch.fx.experimental.optimization import (
+    matches_module_pattern,
+    replace_node_module,
+)
+from torch.fx.experimental.proxy_tensor import ProxyTorchDispatchMode
+from torch.fx.passes.shape_prop import ShapeProp
+from torch.nn import functional as F
+from torch.nn.modules.utils import _pair
+from torch.nn.utils.fusion import fuse_conv_bn_eval
+from torch.overrides import TorchFunctionMode
+
+from . import config
+
+log = logging.getLogger(__name__)
+
+
+class AutogradMonkeypatch(TorchFunctionMode):
+    def __torch_function__(self, func, types, args=(), kwargs=None):
+        if not kwargs:
+            kwargs = {}
+        if func is replacements:
+            return replacements[func](*args, **kwargs)
+        return func(*args, **kwargs)
+
+
+patch_functions = AutogradMonkeypatch
+
+
+def replace_fx(gm: torch.fx.GraphModule):
+    # Sometimes patch_functions() misses things already in the graph
+    for node in reversed(list(gm.graph.nodes)):
+        if node.op == "call_function" and node.target in replacements:
+            if (
+                config.fallback_random
+                and replacements[node.target] in replacements_using_triton_random
+            ):
+                continue
+            with gm.graph.inserting_before(node):
+                node.replace_all_uses_with(
+                    gm.graph.call_function(
+                        replacements[node.target], node.args, node.kwargs
+                    )
+                )
+            gm.graph.erase_node(node)
+    gm.recompile()
+    return gm
+
+
+class UnaryAttr(object):
+    def __init__(self, op_name: str, scalars_attr=None, algorithm_attr=None):
+        self.op_name = op_name
+        self.scalars_attr = scalars_attr if scalars_attr else []
+        self.algorithm_attr = algorithm_attr if algorithm_attr else ""
+        super(UnaryAttr, self).__init__()
+
+    def __call__(self, unary_module: nn.Module):
+        if type(unary_module) is nn.ReLU6:
+            unary_module = nn.Hardtanh(min_val=0, max_val=6)
+        assert all(hasattr(unary_module, item) for item in self.scalars_attr)
+        scalars = [getattr(unary_module, item) for item in self.scalars_attr]
+
+        algorithm = ""
+        if self.algorithm_attr:
+            assert hasattr(unary_module, self.algorithm_attr)
+            algorithm = getattr(unary_module, self.algorithm_attr)
+
+        return self.op_name, scalars, algorithm
+
+
+class ConvUnary2d(nn.Conv2d):
+    def __init__(
+        self,
+        conv: nn.Module,
+        unary: Optional[nn.Module],
+        input_size: list,
+    ):
+        super(ConvUnary2d, self).__init__(
+            conv.in_channels,
+            conv.out_channels,
+            conv.kernel_size,
+            conv.stride,
+            conv.padding,
+            conv.dilation,
+            conv.groups,
+            conv.bias is not None,
+            conv.padding_mode,
+            conv.weight.device,
+            conv.weight.dtype,
+        )
+        self._update_module_params(conv, unary, input_size)
+
+    def _update_module_params(self, conv, unary, input_size):
+        self.__dict__ = copy.deepcopy(conv.__dict__)
+        self.attr = "none"
+        self.scalars = []
+        self.algorithm = ""
+        if unary is not None:
+            self.attr, self.scalars, self.algorithm = unary_modules_map[
+                unary.__class__
+            ](unary)
+        self.weight = torch.nn.Parameter(
+            torch._C._nn.mkldnn_reorder_conv2d_weight(
+                self.weight.to_mkldnn(),
+                self.padding,
+                self.stride,
+                self.dilation,
+                self.groups,
+                input_size,
+            ),
+            requires_grad=self.weight.requires_grad,
+        )
+
+    def _conv_forward(self, input, weight, bias):
+        if self.padding_mode != "zeros":
+            return torch.ops.mkldnn._convolution_pointwise(
+                F.pad(
+                    input, self._reversed_padding_repeated_twice, mode=self.padding_mode
+                ),
+                weight,
+                bias,
+                _pair(0),
+                self.stride,
+                self.dilation,
+                self.groups,
+                self.attr,
+                self.scalars,
+                self.algorithm,
+            )
+        return torch.ops.mkldnn._convolution_pointwise(
+            input,
+            weight,
+            bias,
+            self.padding,
+            self.stride,
+            self.dilation,
+            self.groups,
+            self.attr,
+            self.scalars,
+            self.algorithm,
+        )
+
+    def forward(self, input):
+        return self._conv_forward(input, self.weight, self.bias)
+
+
+class ConvBinary2d(nn.Conv2d):
+    def __init__(
+        self,
+        conv: nn.Module,
+        binary_op_name: str,
+        input_size: list,
+    ):
+        super(ConvBinary2d, self).__init__(
+            conv.in_channels,
+            conv.out_channels,
+            conv.kernel_size,
+            conv.stride,
+            conv.padding,
+            conv.dilation,
+            conv.groups,
+            conv.bias is not None,
+            conv.padding_mode,
+            conv.weight.device,
+            conv.weight.dtype,
+        )
+        self._update_module_params(conv, binary_op_name, input_size)
+
+    def _update_module_params(self, conv, binary_op_name, input_size):
+        self.__dict__ = copy.deepcopy(conv.__dict__)
+        self.binary_attr = binary_op_name
+        self.binary_alpha = None
+        self.unary_attr = None
+        self.unary_scalars = []
+        self.unary_algorithm = None
+        self.weight = torch.nn.Parameter(
+            torch._C._nn.mkldnn_reorder_conv2d_weight(
+                self.weight.to_mkldnn(),
+                self.padding,
+                self.stride,
+                self.dilation,
+                self.groups,
+                input_size,
+            ),
+            requires_grad=self.weight.requires_grad,
+        )
+
+    def _update_unary_params(self, unary):
+        self.unary_attr, self.unary_scalars, self.unary_algorithm = unary_modules_map[
+            unary.__class__
+        ](unary)
+
+    def _conv_forward(self, input, other, weight, bias):
+        if self.padding_mode != "zeros":
+            return torch.ops.mkldnn._convolution_pointwise(
+                F.pad(
+                    input, self._reversed_padding_repeated_twice, mode=self.padding_mode
+                ),
+                other,
+                weight,
+                bias,
+                _pair(0),
+                self.stride,
+                self.dilation,
+                self.groups,
+                self.binary_attr,
+                self.binary_alpha,
+                self.unary_attr,
+                self.unary_scalars,
+                self.unary_algorithm,
+            )
+        return torch.ops.mkldnn._convolution_pointwise(
+            input,
+            other,
+            weight,
+            bias,
+            self.padding,
+            self.stride,
+            self.dilation,
+            self.groups,
+            self.binary_attr,
+            self.binary_alpha,
+            self.unary_attr,
+            self.unary_scalars,
+            self.unary_algorithm,
+        )
+
+    def forward(self, input, other):
+        return self._conv_forward(input, other, self.weight, self.bias)
+
+
+class ConvBinaryInplace2d(nn.Conv2d):
+    def __init__(
+        self,
+        conv: nn.Module,
+        binary_op_name: str,
+        input_size: list,
+    ):
+        super(ConvBinaryInplace2d, self).__init__(
+            conv.in_channels,
+            conv.out_channels,
+            conv.kernel_size,
+            conv.stride,
+            conv.padding,
+            conv.dilation,
+            conv.groups,
+            conv.bias is not None,
+            conv.padding_mode,
+            conv.weight.device,
+            conv.weight.dtype,
+        )
+        self._update_module_params(conv, binary_op_name, input_size)
+
+    def _update_module_params(self, conv, binary_op_name, input_size):
+        self.__dict__ = copy.deepcopy(conv.__dict__)
+        self.binary_attr = binary_op_name
+        self.binary_alpha = None
+        self.unary_attr = None
+        self.unary_scalars = []
+        self.unary_algorithm = None
+        self.weight = torch.nn.Parameter(
+            torch._C._nn.mkldnn_reorder_conv2d_weight(
+                self.weight.to_mkldnn(),
+                self.padding,
+                self.stride,
+                self.dilation,
+                self.groups,
+                input_size,
+            ),
+            requires_grad=self.weight.requires_grad,
+        )
+
+    def _update_unary_params(self, unary):
+        self.unary_attr, self.unary_scalars, self.unary_algorithm = unary_modules_map[
+            unary.__class__
+        ](unary)
+
+    def _conv_forward(self, input, other, weight, bias):
+        if self.padding_mode != "zeros":
+            return torch.ops.mkldnn._convolution_pointwise_(
+                F.pad(
+                    input, self._reversed_padding_repeated_twice, mode=self.padding_mode
+                ),
+                other,
+                weight,
+                bias,
+                _pair(0),
+                self.stride,
+                self.dilation,
+                self.groups,
+                self.binary_attr,
+                self.binary_alpha,
+                self.unary_attr,
+                self.unary_scalars,
+                self.unary_algorithm,
+            )
+        return torch.ops.mkldnn._convolution_pointwise_(
+            input,
+            other,
+            weight,
+            bias,
+            self.padding,
+            self.stride,
+            self.dilation,
+            self.groups,
+            self.binary_attr,
+            self.binary_alpha,
+            self.unary_attr,
+            self.unary_scalars,
+            self.unary_algorithm,
+        )
+
+    def forward(self, input, other):
+        return self._conv_forward(input, other, self.weight, self.bias)
+
+
+class PackedLinear(nn.Linear):
+    def __init__(self, linear: nn.Module, input_size: list):
+        super(PackedLinear, self).__init__(
+            linear.in_features,
+            linear.out_features,
+            linear.bias is not None,
+            linear.weight.device,
+            linear.weight.dtype,
+        )
+        self._update_module_params(linear, input_size)
+
+    def _update_module_params(self, linear, input_size):
+        self.__dict__ = copy.deepcopy(linear.__dict__)
+        self.batch_size = int(numpy.prod(input_size) / input_size[-1])
+        self.packed_weight = torch.nn.Parameter(
+            torch.ops.mkl._mkl_reorder_linear_weight(
+                self.weight.to_mkldnn(), self.batch_size
+            ),
+            requires_grad=self.weight.requires_grad,
+        )
+
+    def forward(self, input):
+        y = torch.ops.mkl._mkl_linear(
+            input, self.packed_weight, self.weight, self.bias, self.batch_size
+        )
+        return y
+
+
+class LinearUnary(nn.Linear):
+    def __init__(
+        self,
+        linear: nn.Module,
+        unary: nn.Module,
+    ):
+        super(LinearUnary, self).__init__(
+            linear.in_features,
+            linear.out_features,
+            linear.bias is not None,
+            linear.weight.device,
+            linear.weight.dtype,
+        )
+        self._update_module_params(linear, unary)
+
+    def _update_module_params(self, linear, unary):
+        self.__dict__ = copy.deepcopy(linear.__dict__)
+        self.attr, self.scalars, self.algorithm = unary_modules_map[unary.__class__](
+            unary
+        )
+
+    def forward(self, input):
+        y = torch.ops.mkldnn._linear_pointwise(
+            input, self.weight, self.bias, self.attr, self.scalars, self.algorithm
+        )
+        return y
+
+
+class LinearBinary(nn.Linear):
+    def __init__(self, linear: nn.Module, binary_op_name: str):
+        super(LinearBinary, self).__init__(
+            linear.in_features,
+            linear.out_features,
+            linear.bias is not None,
+            linear.weight.device,
+            linear.weight.dtype,
+        )
+        self._update_module_params(linear, binary_op_name)
+
+    def _update_module_params(self, linear, binary_op_name):
+        self.__dict__ = copy.deepcopy(linear.__dict__)
+
+        self.attr = binary_op_name
+
+    def forward(self, input, other):
+        y = torch.ops.mkldnn._linear_pointwise(
+            input, other, self.weight, self.bias, self.attr
+        )
+        return y
+
+
+def packed_conv_eval(conv: nn.Module, input_size: list):
+    assert not (conv.training), "Fusion only for eval!"
+    return ConvUnary2d(
+        conv,
+        None,
+        input_size,
+    )
+
+
+def fused_conv_unary_eval(conv: nn.Module, unary: nn.Module, input_size: list):
+    assert not (conv.training), "Fusion only for eval!"
+    return ConvUnary2d(
+        conv,
+        unary,
+        input_size,
+    )
+
+
+def fused_conv_binary_eval(conv: nn.Module, binary_op_name: str, input_size: list):
+    assert not (conv.training), "Fusion only for eval!"
+    return ConvBinary2d(
+        conv,
+        binary_op_name,
+        input_size,
+    )
+
+
+def fused_conv_binary_inplace_eval(
+    conv: nn.Module, binary_op_name: str, input_size: list
+):
+    assert not (conv.training), "Fusion only for eval!"
+    return ConvBinaryInplace2d(
+        conv,
+        binary_op_name,
+        input_size,
+    )
+
+
+def fused_conv_binary_unary_eval(
+    conv_binary: nn.Module, unary: nn.Module, input_size: list
+):
+    assert not (conv_binary.training), "Fusion only for eval!"
+    # reuse origin conv module, and just update its' unary attr.
+    conv_binary._update_unary_params(unary)
+    return conv_binary
+
+
+def is_bfloat16_module(m):
+    weight_is_bf16 = m.weight.dtype == torch.bfloat16
+    bias_is_bf16 = m.bias is None or m.bias.dtype == torch.bfloat16
+    return weight_is_bf16 and bias_is_bf16
+
+
+def packed_linear_eval(linear: nn.Module, input_size: list):
+    assert not (linear.training), "Fusion only for eval!"
+    return PackedLinear(linear, input_size)
+
+
+def fused_linear_unary_eval(linear: nn.Module, unary: nn.Module, input_size: list):
+    assert not (linear.training), "Fusion only for eval!"
+    return LinearUnary(
+        linear,
+        unary,
+    )
+
+
+def fused_linear_binary_eval(linear: nn.Module, attr: str, input_size: list):
+    assert not (linear.training), "Fusion only for eval!"
+    linear_binary = LinearBinary(
+        linear,
+        attr,
+    )
+    return linear_binary
+
+
+def check_node_kind(current_node, modules, node_kind):
+    if not isinstance(current_node, torch.fx.Node):
+        return False
+    if current_node.op != "call_module":
+        return False
+    if not isinstance(current_node.target, str):
+        return False
+    if current_node.target not in modules:
+        return False
+    if type(modules[current_node.target]) is not node_kind:
+        return False
+    return True
+
+
+def check_node_is_binary(node):
+    return (
+        (node.op == "call_function" and node.target in [torch.add, torch.sub])
+        or (
+            node.op == "call_function"
+            and node.target
+            in [operator.add, operator.iadd, operator.sub, operator.isub]
+        )
+        or (node.op == "call_method" and node.target in ["add", "add_", "sub", "sub_"])
+    )
+
+
+def check_binary_op_kwargs_is_default(node):
+    # For binary op, we hope the kwargs values are the default value:
+    # torch.sub(add)(input, other, *, alpha=1, out=None).
+    if len(node.args) > 2:
+        return False
+    if len(node.kwargs) > 0:
+        if "out" in node.kwargs and node.kwargs["out"] is not None:
+            return False
+        if "alpha" in node.kwargs and node.kwargs["alpha"] != 1.0:
+            return False
+    return True
+
+
+def check_node_is_add_inplace(node):
+    return (node.op == "call_function" and node.target in [operator.iadd]) or (
+        node.op == "call_method" and node.target in ["add_"]
+    )
+
+
+def fuse_fx(gm: torch.fx.GraphModule, example_inputs):
+    is_cpu = all(
+        example_input.device == torch.device("cpu") for example_input in example_inputs
+    )
+
+    if config.permute_fusion and not is_cpu:
+        # For linear permute fusion, we need to check input info to identify
+        # and perform proper permutation/transpose
+        ShapeProp(gm).propagate(*example_inputs)
+        gm = linear_permute_fusion(gm)
+        gm = permute_linear_fusion(gm)
+        gm = permute_matmul_fusion(gm)
+
+    # make sure the autograd is disabled.
+    if torch.is_grad_enabled():
+        return gm
+    if not (torch.backends.mkldnn.enabled and torch.backends.mkldnn.is_available()):
+        return gm
+    if not is_cpu:
+        return gm
+    gm = fuse_conv_bn(gm)
+    # For binary fusion, we need to check inputs info to make sure
+    # the binary inputs have same tensor info(device, dtype, and layout).
+    ShapeProp(gm).propagate(*example_inputs)
+    gm = fuse_unary(gm)
+    gm = fuse_binary_inplace(gm)
+    gm = fuse_binary(gm)
+    # why re-run fuse_unary? we want to enable conv+binary+unary fusion,
+    # such as conv+add+relu for vision model.
+    gm = fuse_unary(gm)
+    gm = pack_module(gm)
+    return gm
+
+
+def fuse_conv_bn(gm: torch.fx.GraphModule, inplace=False):
+    """
+    Fuses Convolution/BN layers for inference purposes.
+    """
+    patterns = [
+        (torch.nn.Conv1d, torch.nn.BatchNorm1d),
+        (torch.nn.Conv2d, torch.nn.BatchNorm2d),
+        (torch.nn.Conv3d, torch.nn.BatchNorm3d),
+    ]
+    modules = dict(gm.named_modules())
+
+    for pattern in patterns:
+        for node in gm.graph.nodes:
+            if matches_module_pattern(pattern, node, modules):
+                if len(node.args[0].users) > 1:  # Output of conv is used by other nodes
+                    continue
+                conv = modules[node.args[0].target]
+                bn = modules[node.target]
+                eval_mode = all(not n.training for n in [conv, bn])
+                if not eval_mode:
+                    continue
+                if not bn.track_running_stats:
+                    continue
+                fused_conv = fuse_conv_bn_eval(conv, bn)
+                replace_node_module(node.args[0], modules, fused_conv)
+                node.replace_all_uses_with(node.args[0])
+                gm.graph.erase_node(node)
+                gm.graph.lint()
+    gm.recompile()
+    return gm
+
+
+def fuse_unary(gm: torch.fx.GraphModule):
+    modules = dict(gm.named_modules())
+
+    for (unary_module, _), (computation_module, fuse_func,) in itertools.product(
+        unary_modules_map.items(), computation_op_unary_op_fusion_map.items()
+    ):
+        pattern = (computation_module, unary_module)
+        for node in gm.graph.nodes:
+            if matches_module_pattern(pattern, node, modules):
+                if (
+                    len(node.args[0].users) > 1
+                ):  # Output of computation_node is used by other nodes
+                    continue
+                computation_node = modules[node.args[0].target]
+                unary_node = modules[node.target]
+                eval_mode = all(not n.training for n in [computation_node, unary_node])
+                if not eval_mode:
+                    continue
+                # TODO: support padding str input("valid", "same").
+                if type(computation_node) in [nn.Conv2d] and isinstance(
+                    computation_node.padding, str
+                ):
+                    continue
+                # only fuse for linear when the dtype is bf16
+                if type(computation_node) in [nn.Linear] and not is_bfloat16_module(
+                    computation_node
+                ):
+                    continue
+                computation_node_input_size = (
+                    node.args[0].args[0].meta.get("tensor_meta").shape
+                )
+                fused_module = fuse_func(
+                    computation_node, unary_node, computation_node_input_size
+                )
+                replace_node_module(node.args[0], modules, fused_module)
+
+                node.replace_all_uses_with(node.args[0])
+                gm.graph.erase_node(node)
+                gm.graph.lint()
+    gm.recompile()
+    return gm
+
+
+def _philox_rand_like_meta(input, seed, offset):
+    return _prims.TensorMeta(input)
+
+
+def _philox_rand_like(input, seed, offset):
+    # placeholder only used in tracing
+    return torch.rand_like(input)
+
+
+class NormalizedLinearNode:
+    def __init__(self, node: torch.fx.Node) -> None:
+        assert node.op == "call_function"
+        assert node.target in [torch.nn.functional.linear]
+        self.node: torch.fx.Node = node
+
+    def get_input(self) -> torch.fx.Node:
+        if len(self.node.args) > 0:
+            return self.node.args[0]
+        else:
+            return self.node.kwargs["input"]
+
+    def get_weight(self) -> torch.fx.Node:
+        if len(self.node.args) > 1:
+            return self.node.args[1]
+        else:
+            return self.node.kwargs["weight"]
+
+    def get_bias(self) -> torch.fx.Node:
+        if len(self.node.args) > 2:
+            return self.node.args[2]
+        else:
+            return self.node.kwargs["bias"]
+
+
+class NormalizedMatmulNode:
+    def __init__(self, node: torch.fx.Node) -> None:
+        assert node.op == "call_function"
+        assert node.target in [torch.bmm, torch.matmul]
+        self.node: torch.fx.Node = node
+
+    def get_input(self) -> torch.fx.Node:
+        if len(self.node.args) > 0:
+            return self.node.args[0]
+        else:
+            return self.node.kwargs["input"]
+
+    def get_other(self) -> torch.fx.Node:
+        if len(self.node.args) > 1:
+            return self.node.args[1]
+        else:
+            return self.node.kwargs["other"]
+
+
+def check_permute(node: torch.fx.Node):
+    ranks = len(node.meta["tensor_meta"].shape)
+    if len(node.args) > 3:
+        permutation = [node.args[i] % ranks for i in range(1, ranks + 1)]
+    elif (
+        "permutation" in node.kwargs
+        and node.kwargs["permutation"] is not None
+        and len(node.kwargs["permutation"]) > 2
+    ):
+        permutation = [i % ranks for i in node.kwargs["permutation"]]
+    else:
+        return False
+    allowed_permutation = list(range(ranks))
+    allowed_permutation[-1] = ranks - 2
+    allowed_permutation[-2] = ranks - 1
+    return permutation == allowed_permutation
+
+
+def linear_permute_fusion(module: torch.fx.GraphModule) -> torch.fx.GraphModule:
+    for node in module.graph.nodes:
+        if (
+            node.op == "call_method"
+            and node.target == "permute"
+            and check_permute(node)
+        ):
+            if len(node.args) > 0:
+                input_node = node.args[0]
+            else:
+                input_node = node.kwargs["input"]
+            if (
+                input_node.op == "call_function"
+                and input_node.target == torch.nn.functional.linear
+            ):
+                normalized = NormalizedLinearNode(input_node)
+                input = normalized.get_input()
+                weight = normalized.get_weight()
+                bias = normalized.get_bias()
+                with module.graph.inserting_before(node):
+                    fused_node = module.graph.call_function(
+                        linear_transpose, args=(input, weight, bias)
+                    )
+                    node.replace_all_uses_with(fused_node)
+
+    module.graph.lint()
+    module.graph.eliminate_dead_code()
+    module.recompile()
+    return module
+
+
+# Y1 = X * W^T + bias
+# Y2 = Y1.permute(0, 2, 1)
+# ---->
+# Y2 = (W * X^T + bias.unsqueeze(-1))^T
+def linear_transpose(
+    input: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor
+) -> torch.Tensor:
+    return torch.matmul(weight, input.transpose(-1, -2)) + bias.unsqueeze(-1)
+
+
+def permute_linear_fusion(module: torch.fx.GraphModule) -> torch.fx.GraphModule:
+    for node in module.graph.nodes:
+        if node.op == "call_function" and node.target == torch.nn.functional.linear:
+            if len(node.args) > 0:
+                input_node = node.args[0]
+            else:
+                input_node = node.kwargs["input"]
+            if (
+                input_node.op == "call_method"
+                and input_node.target == "permute"
+                and check_permute(input_node)
+            ):
+                normalized = NormalizedLinearNode(node)
+                if len(input_node.args) > 0:
+                    input = input_node.args[0]
+                else:
+                    input = input_node.kwargs["input"]
+                weight = normalized.get_weight()
+                bias = normalized.get_bias()
+                with module.graph.inserting_before(node):
+                    fused_node = module.graph.call_function(
+                        transpose_linear, args=(input, weight, bias)
+                    )
+                    node.replace_all_uses_with(fused_node)
+
+    module.graph.lint()
+    module.graph.eliminate_dead_code()
+    module.recompile()
+    return module
+
+
+def permute_matmul_fusion(module: torch.fx.GraphModule) -> torch.fx.GraphModule:
+    for node in module.graph.nodes:
+        if node.op == "call_function" and (
+            node.target == torch.bmm or node.target == torch.matmul
+        ):
+            normalized = NormalizedMatmulNode(node)
+            A = normalized.get_input()
+            B = normalized.get_other()
+            Atrans = Btrans = False
+            if A.op == "call_method" and A.target == "permute" and check_permute(A):
+                Atrans = True
+                if len(A.args) > 0:
+                    A = A.args[0]
+                else:
+                    A = A.kwargs["input"]
+
+            if B.op == "call_method" and B.target == "permute" and check_permute(B):
+                Btrans = True
+                if len(B.args) > 0:
+                    B = B.args[0]
+                else:
+                    B = B.kwargs["input"]
+
+            if Atrans or Btrans:
+                with module.graph.inserting_before(node):
+                    fused_node = module.graph.call_function(
+                        transpose_matmul,
+                        args=(A, B, Atrans, Btrans),
+                    )
+                node.replace_all_uses_with(fused_node)
+
+    module.graph.lint()
+    module.graph.eliminate_dead_code()
+    module.recompile()
+    return module
+
+
+# X1 = X.permute(0, 2, 1)
+# Y1 = X1 * W1^T + bias1
+# ---->
+# Y2 = X1.transpose(-1, -2) * W1^T + bias1
+def transpose_linear(
+    input: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor
+) -> torch.Tensor:
+    return torch.matmul(input.transpose(-1, -2), weight.t()) + bias
+
+
+def transpose_matmul(A: torch.Tensor, B: torch.Tensor, Atrans: bool, Btrans: bool):
+    if Atrans:
+        A = A.transpose(-1, -2)
+    if Btrans:
+        B = B.transpose(-1, -2)
+    return torch.matmul(A, B)
+
+
+def replace_and_fuse_for_binary(
+    computation_node, node, fuse_func, attr, modules, index_node, index_pointwise
+):
+    computation_node_input_size = (
+        node.args[index_node].args[0].meta.get("tensor_meta").shape
+    )
+    fused_module = fuse_func(computation_node, attr, computation_node_input_size)
+    replace_node_module(node.args[index_node], modules, fused_module)
+    node.args[index_node].args = node.args[index_node].args + (
+        node.args[index_pointwise],
+    )
+    node.replace_all_uses_with(node.args[index_node])
+
+
+def binary_inputs_meta_is_same(binary_node):
+    tensor0_meta = binary_node.args[0].meta.get("tensor_meta")
+    tensor1_meta = binary_node.args[1].meta.get("tensor_meta")
+    if not tensor0_meta or not tensor1_meta:
+        return False
+    if (
+        tensor0_meta.shape != tensor1_meta.shape
+        or tensor0_meta.stride != tensor1_meta.stride
+        or tensor0_meta.dtype != tensor1_meta.dtype
+    ):
+        return False
+
+    return True
+
+
+def fuse_binary(gm: torch.fx.GraphModule):
+    modules = dict(gm.named_modules())
+    for node in gm.graph.nodes:
+        if check_node_is_binary(node) and check_binary_op_kwargs_is_default(node):
+            for node_kind, fuse_func in computation_op_binary_op_fusion_map.items():
+                if not isinstance(node.args[0], torch.fx.Node) or not isinstance(
+                    node.args[1], torch.fx.Node
+                ):
+                    continue
+                if not binary_inputs_meta_is_same(node):
+                    continue
+                attr = binary_attr[node.target]
+                index_list = supported_index_list[attr]
+                for index_dict in index_list:
+                    index_node = index_dict["index_computation"]
+                    index_pointwise = index_dict["index_pointwise"]
+                    if check_node_kind(node.args[index_node], modules, node_kind):
+                        if len(node.args[index_node].users) > 1:
+                            continue
+                        computation_node = modules[node.args[index_node].target]
+                        # TODO: support padding str input("valid", "same").
+                        if type(computation_node) in [nn.Conv2d] and isinstance(
+                            computation_node.padding, str
+                        ):
+                            continue
+                        # only fuse for linear when the dtype is bf16
+                        if type(computation_node) in [
+                            nn.Linear
+                        ] and not is_bfloat16_module(computation_node):
+                            continue
+                        replace_and_fuse_for_binary(
+                            computation_node,
+                            node,
+                            fuse_func,
+                            attr if attr != "iadd" else "add",
+                            modules,
+                            index_node,
+                            index_pointwise,
+                        )
+                        # Make sure the fused node is post node of node's inputs nodes.
+                        node.append(node.args[index_node])
+                        gm.graph.erase_node(node)
+                        gm.graph.lint()
+                        break
+
+    gm.recompile()
+    return gm
+
+
+def fuse_binary_inplace(gm: torch.fx.GraphModule):
+    modules = dict(gm.named_modules())
+    for node in gm.graph.nodes:
+        if check_node_is_add_inplace(node) and check_binary_op_kwargs_is_default(node):
+            for (
+                node_kind,
+                fuse_func,
+            ) in computation_op_binary_op_fusion_inplace_map.items():
+                if not isinstance(node.args[0], torch.fx.Node) or not isinstance(
+                    node.args[1], torch.fx.Node
+                ):
+                    continue
+                if not binary_inputs_meta_is_same(node):
+                    continue
+                if check_node_kind(node.args[1], modules, node_kind):
+                    if len(node.args[1].users) > 1:
+                        continue
+                    # make sure the output and input are not same tensor.
+                    if node.args[1].args[0] == node.args[0]:
+                        continue
+                    computation_node = modules[node.args[1].target]
+                    # TODO: support padding str input("valid", "same").
+                    if type(computation_node) in [nn.Conv2d] and isinstance(
+                        computation_node.padding, str
+                    ):
+                        continue
+                    replace_and_fuse_for_binary(
+                        computation_node,
+                        node,
+                        fuse_func,
+                        "add",
+                        modules,
+                        1,  # conv module index
+                        0,  # binary op index
+                    )
+                    # Make sure the fused node is post node of node's inputs nodes.
+                    node.append(node.args[1])
+                    gm.graph.erase_node(node)
+                    gm.graph.lint()
+                    break
+
+    gm.recompile()
+    return gm
+
+
+def pack_module(gm: torch.fx.GraphModule):
+    modules = dict(gm.named_modules())
+    for node in gm.graph.nodes:
+        if node.op == "call_module":
+            assert isinstance(node.target, str)
+            cur_module = modules[node.target]
+            if type(cur_module) in computation_op_packed_map:
+                computation_node_input_meta = node.args[0].meta.get("tensor_meta")
+                if computation_node_input_meta.dtype != torch.float32:
+                    continue
+                if type(cur_module) in [torch.nn.Linear] and not torch._C.has_mkl:
+                    continue
+                computation_node_input_size = computation_node_input_meta.shape
+                if type(cur_module) in [nn.Conv2d] and isinstance(
+                    cur_module.padding, str
+                ):
+                    continue
+                new_module = computation_op_packed_map[type(cur_module)](
+                    cur_module, computation_node_input_size
+                )
+                assert isinstance(new_module, nn.Module)
+                replace_node_module(node, modules, new_module)
+                gm.graph.lint()
+    gm.recompile()
+    return gm
+
+
+philox_rand_like = _prims._make_prim(
+    schema="philox_rand_like(Tensor input, Tensor seed, int offset) -> Tensor",
+    return_type=_prims.RETURN_TYPE.NEW,
+    meta=_philox_rand_like_meta,
+    impl_aten=_philox_rand_like,
+    doc="",
+)
+
+
+def _philox_seed_like_meta(x):
+    return _prims.TensorMeta(_philox_seed_like(x))
+
+
+def _philox_seed_like(x):
+    # we need a tensor input here so AOT autograd properly captures this
+    # with just a device input, this becomes a constant
+    return torch.tensor(random.randrange(2**31), device=x.device, dtype=torch.int32)
+
+
+philox_seed_like = _prims._make_prim(
+    schema="philox_seed_like(Tensor other) -> Tensor",
+    return_type=_prims.RETURN_TYPE.NEW,
+    meta=_philox_seed_like_meta,
+    impl_aten=_philox_seed_like,
+    doc="",
+)
+
+
+def null_ref():
+    return None
+
+
+class PhiloxRandomState:
+    next_offset = 0
+    seed = {}
+    last_tracer_ref = null_ref
+
+    @classmethod
+    def reset(cls, tracer=None):
+        cls.next_offset = 0
+        cls.seed = {}
+        cls.last_tracer_ref = weakref.ref(tracer) if tracer is not None else null_ref
+
+    @classmethod
+    def get_seed_offset(cls, x):
+        modes = torch.fx.experimental.proxy_tensor.get_torch_dispatch_modes()
+        proxy_modes = [m for m in modes if isinstance(m, ProxyTorchDispatchMode)]
+        if proxy_modes:
+            tracer = proxy_modes[0].tracer
+            if cls.last_tracer_ref() is not tracer:
+                # tracer changed, need to reset state
+                cls.reset(tracer)
+        else:
+            # no tracer, need to reset state
+            cls.reset()
+
+        device = x.device
+        if device not in cls.seed:
+            # Compute the seed just once per trace so that we pass fewer
+            # things from forward to backward
+            cls.seed[device] = philox_seed_like(x)
+
+        seed = cls.seed[device]
+        offset = cls.next_offset
+        cls.next_offset += x.numel()
+        return seed, offset
+
+
+class LowmemDropout(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x, p):
+        ctx.p = p
+        scale = float(1.0 / (1.0 - p))
+        seed, offset = PhiloxRandomState.get_seed_offset(x)
+        ctx.save_for_backward(seed)
+        ctx.offset = offset
+        bool_mask = philox_rand_like(x, seed, offset) > p
+        return bool_mask.to(x.dtype) * x * scale
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        p = ctx.p
+        scale = float(1.0 / (1.0 - p))
+        (seed,) = ctx.saved_tensors
+        bool_mask = philox_rand_like(grad_output, seed, ctx.offset) > p
+        return bool_mask.to(grad_output.dtype) * grad_output * scale, None
+
+
+@torch.fx.wrap
+def lowmem_dropout(input, p=0.5, training=True, inplace=False):
+    if isinstance(input, torch.fx.Proxy):
+        # double check we don't FX trace this
+        return input.tracer.create_proxy(
+            "call_function",
+            lowmem_dropout,
+            (input, p, training),
+            {},
+        )
+    if not training or p == 0:
+        return input
+    result = LowmemDropout.apply(input, p)
+    if inplace:
+        input.copy_(result)
+    return result
+
+
+@torch.fx.wrap
+def rand_like(x, **kwargs):
+    if isinstance(x, torch.fx.Proxy):
+        # double check we don't FX trace this
+        return x.tracer.create_proxy("call_function", rand_like, (x), kwargs)
+    assert kwargs.get("device", x.device) == x.device
+    seed, offset = PhiloxRandomState.get_seed_offset(x)
+    return philox_rand_like(x, seed, offset).to(kwargs.get("dtype", torch.float32))
+
+
+replacements = {torch.nn.functional.dropout: lowmem_dropout, torch.rand_like: rand_like}
+# Keep track of any replacement functions that use triton random,
+# so they can be avoided when fallback_random is set
+replacements_using_triton_random = {lowmem_dropout, rand_like}
+
+computation_op_unary_op_fusion_map = {
+    nn.Conv2d: fused_conv_unary_eval,
+    nn.Linear: fused_linear_unary_eval,
+    ConvBinary2d: fused_conv_binary_unary_eval,
+    ConvBinaryInplace2d: fused_conv_binary_unary_eval,
+}
+
+
+unary_modules_map = {
+    nn.ReLU: UnaryAttr("relu"),
+    nn.Sigmoid: UnaryAttr("sigmoid"),
+    nn.Tanh: UnaryAttr("tanh"),
+    nn.Hardswish: UnaryAttr("hardswish"),
+    nn.LeakyReLU: UnaryAttr("leaky_relu", scalars_attr=["negative_slope"]),
+    nn.Hardtanh: UnaryAttr("hardtanh", scalars_attr=["min_val", "max_val"]),
+    nn.GELU: UnaryAttr("gelu", algorithm_attr="approximate"),
+    nn.ReLU6: UnaryAttr("hardtanh", scalars_attr=["min_val", "max_val"]),
+    nn.SiLU: UnaryAttr("swish"),
+}
+
+
+binary_attr = {
+    torch.add: "add",  # node.op == "call_function"
+    "add": "add",  # node.op == "call_method"
+    "add_": "iadd",  # node.op == "call_method"
+    operator.add: "add",  # node.op == "call_function"
+    operator.iadd: "iadd",  # node.op == "call_function"
+    torch.sub: "sub",  # node.op == "call_function"
+    "sub": "sub",  # node.op == "call_method"
+    "sub_": "sub",  # node.op == "call_method"
+    operator.sub: "sub",  # node.op == "call_function"
+    operator.isub: "sub",  # node.op == "call_function"
+}
+
+
+computation_op_binary_op_fusion_map = {
+    nn.Conv2d: fused_conv_binary_eval,
+    nn.Linear: fused_linear_binary_eval,
+}
+
+
+computation_op_binary_op_fusion_inplace_map = {
+    nn.Conv2d: fused_conv_binary_inplace_eval,
+}
+
+
+computation_op_packed_map = {
+    nn.Linear: packed_linear_eval,
+    nn.Conv2d: packed_conv_eval,
+}
+
+
+# For add: we support conv/linear + other and other + conv
+# For sub/add_/sub_, we only support conv/linear - other
+# or conv/linear +(-)= other
+supported_index_list = {
+    "add": [
+        {"index_computation": 0, "index_pointwise": 1},
+        {"index_computation": 1, "index_pointwise": 0},
+    ],
+    "iadd": [{"index_computation": 0, "index_pointwise": 1}],
+    "sub": [{"index_computation": 0, "index_pointwise": 1}],
+}
diff --git a/torch/_inductor/scheduler.py b/torch/_inductor/scheduler.py
new file mode 100644
index 000000000000..8609617897bf
--- /dev/null
+++ b/torch/_inductor/scheduler.py
@@ -0,0 +1,1129 @@
+import collections
+import dataclasses
+import functools
+import itertools
+import logging
+import os
+import pprint
+import textwrap
+from typing import Dict, List, Optional, Set, Union
+
+import numpy as np
+import sympy
+
+import torch
+
+from . import config, dependencies, ir
+from .dependencies import MemoryDep, StarDep
+from .sizevars import SimplifyIndexing
+from .utils import cache_on_self, cmp, dynamo_utils, has_triton
+from .virtualized import V
+
+log = logging.getLogger(__name__)
+
+
+def pformat(obj):
+    if isinstance(obj, set):
+        # pformat has trouble with sets of sympy exprs
+        obj = sorted(obj, key=str)
+    result = pprint.pformat(obj, indent=4)
+    if "\n" in result:
+        return f"\n{textwrap.indent(result, ' '*4)}"
+    return result
+
+
+class OutputNode:
+    def __init__(self, dep):
+        self.unmet_dependencies = {dep}
+        self.inverse_users = []
+
+    def is_reduction(self):
+        return False
+
+    def get_alias_names(self):
+        return ()
+
+    def get_name(self):
+        return "OUTPUT"
+
+    __repr__ = get_name
+
+
+class BaseSchedulerNode:
+    def __init__(self, scheduler: "Scheduler", node: ir.Buffer):
+        self.scheduler: "Scheduler" = scheduler
+        self.node: ir.Buffer = node
+        self.users: Optional[List[NodeUser]] = None
+        self.inverse_users: List[BaseSchedulerNode] = []
+        self.set_read_writes(node.get_read_writes())
+        self.recursive_predecessors: Optional[Set[str]] = None
+        self.min_order: Optional[int] = None
+        self.max_order: Optional[int] = None
+        self.last_usage: Set[str] = None  # buffers that won't be used after this kernel
+        self.written = False
+
+    def __repr__(self):
+        return f"{type(self).__name__}(name={self.get_name()!r})"
+
+    def debug_str(self):
+        """Longer form printout for trace logs"""
+        name = self.get_name()
+        lines = [
+            f"{name}: {type(self).__name__}({type(self.node).__name__})",
+            f"{name}.writes = {pformat(self.read_writes.writes)}",
+            f"{name}.unmet_dependencies = {pformat(self.unmet_dependencies)}",
+            f"{name}.met_dependencies = {pformat(self.read_writes.reads - self.unmet_dependencies)}",
+        ]
+        try:
+            lines += [
+                self.debug_str_extra(),
+            ]
+        except Exception:
+            log.warning("Ignoring error in debug_str()", exc_info=True)
+        return "\n".join(lines).rstrip()
+
+    def debug_str_extra(self):
+        return ""
+
+    def log_details(self):
+        log.info(
+            "%s: unmet_dependencies = %s, writes = %s",
+            self,
+            self.unmet_dependencies,
+            self.read_writes.writes,
+        )
+
+    def update_mutated_names(self, renames: Dict[str, str]):
+        self.set_read_writes(self.read_writes.rename(renames))
+
+    def add_mutation_dep(self, name):
+        self.set_read_writes(self.read_writes.with_read(name))
+
+    def set_users(self, users: List["NodeUser"]):
+        # deduplicate
+        result: Dict[int, NodeUser] = {}
+        for use in users:
+            if id(use.node) in result:
+                result[id(use.node)] = NodeUser(
+                    use.node, result[id(use.node)].can_inplace and use.can_inplace
+                )
+            else:
+                result[id(use.node)] = use
+        self.users = list(result.values())
+
+    def get_aliases(self):
+        return self.node.get_alias_names()
+
+    def get_mutations(self):
+        return self.node.get_mutation_names()
+
+    def set_read_writes(self, rw: dependencies.ReadWrites):
+        self.read_writes: dependencies.ReadWrites = rw
+        self.unmet_dependencies = self.read_writes.reads
+        self.prune_deps()
+
+    def used_buffer_names(self) -> Set[str]:
+        return {
+            dep.name
+            for dep in itertools.chain(self.read_writes.reads, self.read_writes.writes)
+        }
+
+    def prune_deps(self):
+        self.unmet_dependencies = {
+            dep
+            for dep in self.unmet_dependencies
+            if dep.name not in self.scheduler.available_buffer_names
+        }
+
+    def get_name(self) -> str:
+        return self.node.get_name()
+
+    def get_first_name(self) -> str:
+        return self.get_name()
+
+    def get_names(self) -> Set[str]:
+        return set([self.get_name()])
+
+    def get_nodes(self) -> List["BaseSchedulerNode"]:
+        return [self]
+
+    def get_device(self):
+        return self.node.get_device()
+
+    def is_reduction(self):
+        return False
+
+    def is_template(self):
+        return False
+
+    def is_extern(self):
+        return False
+
+    def can_inplace(self, read_dep: dependencies.MemoryDep):
+        return False
+
+    def allocate(self):
+        from .codegen.triton_template import should_use_template
+
+        if self.node.should_allocate() or should_use_template(self.node):
+            # if self.node should allocate or
+            # if self.node is generated by TritonKernelTemplates
+            # because Triton kernel could not allocate tensor itself
+            V.graph.wrapper_code.codegen_allocation(self.node)
+
+    def can_free(self):
+        for use in self.users:
+            if isinstance(use.node, OutputNode):
+                return False
+        return True
+
+    def codegen_originating_info(self, buffer, only_once=True):
+        if not config.comment_origin:
+            return
+
+        if only_once and self.written:
+            return
+        origins = self.node.origins
+        out_lines = []
+
+        for o in origins:
+            if o.op == "output":
+                # These are boring and samey
+                continue
+
+            out_lines.append("")
+            # TODO(voz): Should the pragma be constant somewhere?
+            out_lines.append("#pragma CMT ORIGIN:")
+            out_lines.append(f"#pragma CMT {o.op} {o.target}")
+            if "stack_trace" in o.meta:
+                stack_trace = f"{o.meta['stack_trace']}"
+                stack_trace_last_line = stack_trace.split("|")[-1]
+                out_lines.append(
+                    "#pragma CMT "
+                    + stack_trace_last_line.replace("{", "{{")
+                    .replace("}", "}}")
+                    .replace("\n", "\\")
+                )
+                out_lines.append("#pragma CMT END ORIGIN")
+                out_lines.append("")
+
+        if len(out_lines) == 0:
+            return
+
+        # TODO(voz): Ostensibly, we should not need this. But there are cases where C++ codegen does
+        # not use BracesBuffer, so we have no good indicator of a C++ buffer atm.
+        buffer.writelines(out_lines)
+        self.written = True
+
+
+class ExternKernelSchedulerNode(BaseSchedulerNode):
+    def debug_str_extra(self):
+        return f"{self.get_name()}.node.kernel = {getattr(self.node, 'kernel', None)}"
+
+    def is_extern(self):
+        return True
+
+
+class TemplateSchedulerNode(BaseSchedulerNode):
+    def __init__(self, scheduler: "Scheduler", node: ir.ExternKernel, group_fn):
+        super().__init__(scheduler, node)
+        (self._sizes, self._stride) = node.get_group_stride()
+        self.group = (node.get_device(), group_fn(self._sizes))
+        self.set_read_writes(node.get_read_writes())
+        self.update_dep_type()
+
+    def is_template(self):
+        return True
+
+    def update_dep_type(self):
+        assert len(self.read_writes.writes) == 1
+        write = self.read_writes.writes.pop()
+        if isinstance(write, StarDep):
+            name = write.name
+            canonicalized_index, canonicalized_size = self.node.canonicalize()
+            new_dep = MemoryDep(name, canonicalized_index, canonicalized_size)
+            self.read_writes.writes.add(new_dep)
+        else:
+            self.read_writes.writes.add(write)
+
+    def get_ranges(self):
+        return self._sizes
+
+
+class NopKernelSchedulerNode(BaseSchedulerNode):
+    pass
+
+
+class SchedulerNode(BaseSchedulerNode):
+    def __init__(self, scheduler: "Scheduler", node: ir.ComputedBuffer, group_fn):
+        super().__init__(scheduler, node)
+        (
+            self._sizes,
+            self._body,
+        ) = node.simplify_and_reorder()
+
+        self.group = (node.get_device(), group_fn(self._sizes))
+
+        self.set_read_writes(
+            dependencies.extract_read_writes(self._body, *self._sizes, normalize=True)
+        )
+        if self.is_reduction():
+            # reduction has last (reduced) dim in its sizes, and some
+            # downstream dependencies get confused by it
+            self.read_writes.writes = self.read_writes.writes | {
+                w.strip_last_size() for w in self.read_writes.writes
+            }
+            # reduction not on the last dim swaps the sizes, and downstream
+            # dependencies expect unswapped
+            # TODO swapping sizes doesn't work, leads to
+            # File "/scratch/ngimel/work/repos/torchdynamo/torchinductor/sizevars.py", line 130, in guard_equals
+            # if len(right.free_symbols) < len(left.free_symbols):
+            # AttributeError: 'int' object has no attribute 'free_symbols'
+            # even though memory dep looks correct
+            # self.read_writes.writes = self.read_writes.writes | {
+            #     w.maybe_swap_sizes() for w in self.read_writes.writes
+            # }
+
+    def debug_str_extra(self):
+        name = self.get_name()
+        lines = [
+            f"{name}.group.device = {self.group[0]}",
+            f"{name}.group.iteration = {self.group[1]}",
+            f"{name}.sizes = {self._sizes}",
+        ]
+        if self.get_aliases():
+            lines.append(f"{name}.aliases = {pformat(self.get_aliases())}")
+        if self.get_mutations():
+            lines.append(f"{name}.mutations = {pformat(self.get_mutations())}")
+        if isinstance(self._body, ir.LoopBody):
+            lines.append(f"class {name}_loop_body:")
+            lines.append(textwrap.indent(self._body.debug_str(), "    "))
+        return "\n".join(lines)
+
+    def get_ranges(self):
+        return self._sizes
+
+    def is_reduction(self):
+        return bool(self.node.data.get_reduction_type())
+
+    def allocate(self):
+        if (
+            not self.node.should_allocate()
+            or self.node.get_alias_names()
+            or self.node.get_mutation_names()
+        ):
+            return super().allocate()
+
+        if config.inplace_buffers:
+            from .codegen.triton_template import should_use_template
+            from .codegen.wrapper import buffer_reuse_key
+
+            for read in self.read_writes.reads:
+                input_node: BaseSchedulerNode = self.scheduler.name_to_node.get(
+                    read.name
+                )
+                if input_node and V.graph.wrapper_code.can_reuse(input_node):
+                    remaining_uses = [
+                        x
+                        for x in input_node.users
+                        if x.node.get_name()
+                        not in self.scheduler.available_buffer_names
+                    ]
+                    if (
+                        len(remaining_uses) == 1
+                        and remaining_uses[0].can_inplace
+                        and remaining_uses[0].node is self
+                        and not isinstance(
+                            input_node.node.get_layout(),
+                            (ir.MultiOutputLayout, ir.MutationLayout, ir.AliasedLayout),
+                        )
+                        and not should_use_template(input_node.node)
+                        and buffer_reuse_key(input_node.node)
+                        == buffer_reuse_key(self.node)
+                    ):
+                        V.graph.wrapper_code.codegen_inplace_reuse(
+                            input_node.node, self.node
+                        )
+                        V.kernel.args.make_inplace(
+                            input_node.get_name(), self.get_name()
+                        )
+                        return
+        super().allocate()
+
+    def run(self, *index_vars):
+        self.mark_run()
+        self.codegen(index_vars)
+
+    def mark_run(self):
+        self.allocate()
+
+    def codegen(self, index_vars):
+        sizes = self._sizes
+        assert sum(map(len, sizes)) == sum(map(len, index_vars))
+        var_ranges = dict(
+            zip(
+                itertools.chain.from_iterable(index_vars),
+                itertools.chain.from_iterable(sizes),
+            )
+        )
+        try:
+            with V.set_ops_handler(
+                SimplifyIndexing(V.get_ops_handler(), var_ranges)
+            ), V.kernel.set_current_node(self):
+                self._body(*index_vars)
+        except Exception:
+            log.fatal("Error in codegen for %s", self.node)
+            raise
+
+    def pointwise_read_writes(self):
+        """
+        Get the memory dependencies in the non-reduction axis.
+        """
+        sizes, reduction_sizes = self._sizes
+
+        def fn(index):
+            return self._body(index, [sympy.Integer(0) for _ in reduction_sizes])
+
+        return dependencies.extract_read_writes(fn, sizes)
+
+    def can_inplace(self, read_dep: dependencies.MemoryDep):
+        if self.get_aliases():
+            return False
+        if len(self.read_writes.writes) == 1 and hasattr(read_dep, "index"):
+            write_dep = next(iter(self.read_writes.writes))
+            return read_dep.index == write_dep.index and read_dep.size == write_dep.size
+        return False
+
+
+class FusedSchedulerNode(BaseSchedulerNode):
+    """
+    This is a "fake" scheduler node that represents a group of scheduler nodes
+    that are meant to be fused together. The way it does this is by maintaining
+    its unmet dependencies as the union of its constituent nodes.
+    """
+
+    @classmethod
+    def fuse(cls, node1: BaseSchedulerNode, node2: BaseSchedulerNode):
+        assert node1.scheduler is node2.scheduler
+        return cls(node1.scheduler, node1.get_nodes() + node2.get_nodes())
+
+    def __init__(self, scheduler: "Scheduler", snodes: List[SchedulerNode]):
+        # NB: No need to call super().__init__() because we don't need to re-use any of its logic.
+        self.snodes = snodes
+        self.scheduler = scheduler
+        self.node = None  # type: ignore[assignment]
+        self.users = None
+        self.inverse_users = []
+        self.group = max(snodes, key=lambda x: int(x.is_reduction())).group
+        self.recursive_predecessors = functools.reduce(
+            set.union, [x.recursive_predecessors for x in snodes]
+        )
+        self.set_read_writes(
+            functools.reduce(
+                dependencies.ReadWrites.merge, [x.read_writes for x in snodes]
+            )
+        )
+        names = set(self.get_names())
+        self.unmet_dependencies = {
+            dep
+            for dep in functools.reduce(
+                set.union, [x.unmet_dependencies for x in snodes]
+            )
+            if dep.name not in names
+        } - self.read_writes.writes
+        self.min_order = min([x.min_order for x in self.snodes])
+        self.max_order = max([x.max_order for x in self.snodes])
+
+    @cache_on_self
+    def get_name(self) -> str:
+        return "_".join([x.get_name() for x in self.snodes])
+
+    def get_first_name(self) -> str:
+        return self.snodes[0].get_name()
+
+    @cache_on_self
+    def get_names(self) -> Set[str]:
+        return functools.reduce(set.union, [x.get_names() for x in self.snodes])
+
+    def debug_str_extra(self):
+        return (
+            f"{self.get_name()}.snodes = {pformat([x.get_name() for x in self.snodes])}"
+        )
+
+    @cache_on_self
+    def used_buffer_names(self) -> Set[str]:
+        return functools.reduce(set.union, [x.used_buffer_names() for x in self.snodes])
+
+    def get_nodes(self) -> List[BaseSchedulerNode]:
+        return self.snodes
+
+    def __repr__(self):
+        return f"{type(self).__name__}(nodes={self.get_name()})"
+
+    @cache_on_self
+    def is_reduction(self):
+        return any(x.is_reduction() for x in self.snodes)
+
+    @cache_on_self
+    def is_template(self):
+        return any(x.is_template() for x in self.snodes)
+
+    def get_device(self):
+        return self.group[0]
+
+    # None of these need to be implemented, as a FusedSchedulerNode is just an
+    # abstraction for scheduling purposes
+    def update_mutated_names(self, renames: Dict[str, str]):
+        raise NotImplementedError
+
+    def add_mutation_dep(self, name):
+        raise NotImplementedError
+
+    def set_users(self, users: List["NodeUser"]):
+        raise NotImplementedError
+
+    def get_aliases(self):
+        raise NotImplementedError
+
+    def get_mutations(self):
+        raise NotImplementedError
+
+    def can_inplace(self, read_dep: dependencies.MemoryDep):
+        raise NotImplementedError
+
+    def allocate(self):
+        raise NotImplementedError
+
+    def can_free(self):
+        raise NotImplementedError
+
+
+def pick_loop_order(stride_lengths, sizes, priority_idx=()):
+    """
+    A heuristic to decide loop iteration orders.  This has not been well
+    tuned and may be something we should autotune.
+    """
+
+    @functools.cmp_to_key
+    def index_cmp(a, b):
+        if sizes[a] == 1 or sizes[b] == 1:
+            # 1-sizes don't matter, just move them to the end
+            return cmp(sizes[a] == 1, sizes[b] == 1)
+
+        a_first = np.logical_or(
+            stride_lengths[:, b] == 0, stride_lengths[:, a] < stride_lengths[:, b]
+        ).all()
+        b_first = np.logical_or(
+            stride_lengths[:, a] == 0, stride_lengths[:, a] > stride_lengths[:, b]
+        ).all()
+
+        if a_first and not b_first:
+            return -1
+        if b_first and not a_first:
+            return 1
+
+        # otherwise contiguous
+        return cmp(b, a)
+
+    order = list(reversed(range(stride_lengths.shape[1])))
+    if len(priority_idx) > 0:
+        # if we have priority node, only use that node's order
+        stride_lengths = stride_lengths[priority_idx]
+    if config.pick_loop_orders:
+        order.sort(key=index_cmp)
+    return order
+
+
+@dataclasses.dataclass
+class NodeUser:
+    node: BaseSchedulerNode
+    can_inplace: bool = False
+
+    def get_name(self):
+        return self.node.get_name()
+
+
+class Scheduler:
+    @dynamo_utils.dynamo_timed
+    def __init__(self, nodes):
+        from .codegen.triton_template import should_use_template
+
+        super(Scheduler, self).__init__()
+        self.backends = {}
+
+        self.nodes = []
+        self.available_buffer_names = {
+            *V.graph.graph_inputs.keys(),
+            *V.graph.constants.keys(),
+        }
+        for node in nodes:
+            assert (
+                node.origins is not None
+            ), "All nodes passed to scheduling must have an origin"
+            if node.is_no_op():
+                self.nodes.append(NopKernelSchedulerNode(self, node))
+            elif isinstance(node, ir.ComputedBuffer):
+                group_fn = self.get_backend(node.get_device()).group_fn
+                self.nodes.append(SchedulerNode(self, node, group_fn))
+            elif isinstance(node, ir.ExternKernel) and should_use_template(node):
+                group_fn = self.get_backend(node.get_device()).group_fn
+                self.nodes.append(TemplateSchedulerNode(self, node, group_fn))
+            elif isinstance(node, ir.ExternKernel):
+                self.nodes.append(ExternKernelSchedulerNode(self, node))
+            else:
+                raise NotImplementedError(node)
+        # some new constants could have been created above
+        self.available_buffer_names.update(V.graph.constants.keys())
+        for node in self.nodes:
+            node.prune_deps()
+
+        self.name_to_node = {node.get_name(): node for node in self.nodes}
+        self.name_to_fused_node = None  # set in fuse_nods()
+
+        # we handle mutation by renaming modified versions of the same
+        # buffer in the dependency graph to prevent cycles.
+        # mutation_renames: tracks the current name for a given buffer
+        #                   (changed once per mutation)
+        self.mutation_real_name = {}
+        # mutation_real_name: maps back to the original name for codegen
+        self.mutation_renames = {}
+
+        self.compute_dependencies()
+        self.topological_sort_schedule()
+        self.compute_predecessors()
+        self.dead_node_elimination()
+
+        V.debug.ir_pre_fusion(self.nodes)
+        self.num_orig_nodes = len(self.nodes)
+        self.name_to_fused_node = {n.get_name(): n for n in self.nodes}
+        self.fuse_nodes()
+        self.compute_last_usage()
+        V.debug.ir_post_fusion(self.nodes)
+        V.debug.graph_diagram(self.nodes)
+        self.debug_draw_graph()
+
+        # used during codegen:
+        self.current_device = None
+        self.buffer_names_to_free = set()
+        self.buffer_names_no_longer_needed = set()
+
+    def debug_draw_graph(self):
+        """Generate an image of the graph for debugging"""
+        if os.environ.get("INDUCTOR_WRITE_SCHEDULER_GRAPH", None) == "1":
+            from .debug import draw_buffers
+
+            draw_buffers(self.nodes, print_graph=True)
+
+    def debug_print_nodes(self, label):
+        if log.isEnabledFor(logging.INFO):
+            log.info("%s:", label)
+            for node in self.nodes:
+                node.log_details()
+
+    def compute_dependencies(self):
+        """
+        Create dependency edges between nodes, handling aliasing and
+        mutation properly.
+        """
+        name_to_users = collections.defaultdict(list)
+
+        # handle aliasing by using python aliasing in name_to_users
+        # if foo aliases bar then we will make name_to_users["foo"] point
+        # to the same python list as name_to_users["bar"]
+        for node1 in self.nodes:
+            node1_name = node1.get_name()
+            for node2_name in node1.get_aliases():
+                if node1_name in name_to_users and node2_name in name_to_users:
+                    # merge the two
+                    list1 = name_to_users[node1_name]
+                    list2 = name_to_users[node2_name]
+                    combined = list1 + list2
+                    for key in name_to_users.keys():
+                        if name_to_users[key] is list1 or name_to_users[key] is list2:
+                            name_to_users[key] = combined
+                elif node1_name in name_to_users:
+                    name_to_users[node2_name] = name_to_users[node1_name]
+                else:
+                    name_to_users[node1_name] = name_to_users[node2_name]
+
+        def rename(n):
+            if n in self.mutation_renames:
+                return rename(self.mutation_renames[n])
+            return n
+
+        def dep_closure(node_name):
+            reachable_names = {node_name}
+            node = self.name_to_node[node_name]
+            write_dep = list(node.read_writes.writes)[0]
+            for read_dep in node.read_writes.reads:
+                if (
+                    read_dep.name in self.name_to_node
+                    and read_dep.index == write_dep.index
+                    and read_dep.size == write_dep.size
+                ):
+                    reachable_names.update(dep_closure(read_dep.name))
+            return reachable_names
+
+        def add_user(used_by_name, user_node, can_inplace=False):
+            name_to_users[rename(used_by_name)].append(NodeUser(user_node, can_inplace))
+
+        for node in self.nodes:
+            # a node will mutate either 0 or 1 buffers
+            for alt_name in node.get_mutations():
+                alt_name = rename(alt_name)
+                # this node must run after the prior writer
+                add_user(alt_name, node)
+                node.add_mutation_dep(alt_name)
+                for other_node in name_to_users[alt_name]:
+                    # this node must run after all prior readers
+                    other_name = rename(other_node.get_name())
+                    known_dep_node_names = dep_closure(node.get_name())
+                    if other_name not in known_dep_node_names:
+                        # If this node alreay directly or indirectly depends on other_node,
+                        # we don't need to insert an extra StarDep.
+                        node.add_mutation_dep(other_name)
+                        add_user(other_name, node)
+
+            # add normal non-mutation dependencies
+            for read in node.read_writes.reads:
+                add_user(read.name, node, node.can_inplace(read))
+
+            node.update_mutated_names(self.mutation_renames)
+
+            # update our renaming scheme for the next iteration
+            for alt_name in node.get_mutations():
+                self.mutation_renames[rename(alt_name)] = node.get_name()
+                self.mutation_renames[alt_name] = node.get_name()
+                self.mutation_real_name[node.get_name()] = self.mutation_real_name.get(
+                    alt_name, alt_name
+                )
+
+        # make sure outputs aren't dead-code-eliminated
+        for node_name in V.graph.get_output_names():
+            add_user(node_name, OutputNode(StarDep(node_name)))
+
+        # make sure input mutation isn't dead-code-eliminated
+        for name in self.mutation_renames:
+            if name in V.graph.graph_inputs:
+                add_user(name, OutputNode(StarDep(name)))
+                V.graph.mutated_inputs.add(name)
+
+        # copy users information onto the nodes
+        for node in self.nodes:
+            node.set_users(name_to_users[node.get_name()])
+
+        # populate inverse_users
+        for node in self.nodes:
+            for user in node.users:
+                user.node.inverse_users.append(node)
+
+    def dead_node_elimination(self):
+        """
+        Remove any nodes without users
+        """
+        updated_nodes = []
+        for node in self.nodes:
+            if node.users:
+                updated_nodes.append(node)
+            else:
+                # dead code
+                log.debug("removed dead node: %s", node.get_name())
+                V.graph.removed_buffers.add(node.get_name())
+        self.nodes = updated_nodes
+
+    def topological_sort_schedule(self):
+        """
+        Ensure self.nodes is in topologically sorted order
+        """
+        seen = set()
+        name_to_node = dict()
+        result = []
+
+        def visit(n):
+            if n not in seen:
+                seen.add(n)
+                for dep in sorted(n.unmet_dependencies, key=lambda d: d.name):
+                    visit(name_to_node[dep.name])
+                result.append(n)
+
+        for node in self.nodes:
+            for name in node.get_names():
+                name_to_node[name] = node
+        for node in self.nodes:
+            visit(node)
+        self.nodes = result
+
+    def compute_predecessors(self):
+        """
+        Populate each node.recursive_predecessors
+        """
+        # note self.nodes is topologically sorted
+        name_to_predecessors = {}
+        for node in self.nodes:
+            recursive_predecessors = set()
+            for dep in node.unmet_dependencies:
+                recursive_predecessors.add(dep.name)
+                recursive_predecessors |= name_to_predecessors[dep.name]
+            name_to_predecessors[node.get_name()] = recursive_predecessors
+            node.recursive_predecessors = recursive_predecessors
+
+        for order, node in enumerate(self.nodes):
+            node.min_order = order
+            node.max_order = order
+
+    def fuse_nodes(self):
+        """
+        Mutates self.nodes to combine nodes into FusedSchedulerNodes.
+        """
+        for _ in range(10):
+            old_len = len(self.nodes)
+            self.fuse_nodes_once()
+            if len(self.nodes) == old_len:
+                break
+
+    def fuse_nodes_once(self):
+        """
+        Mutates self.nodes to combine nodes into FusedSchedulerNodes.
+
+        This relies on two key functions to control the logic:
+            - self.can_fuses(): checks if a fusion is legal
+            - self.score_fusion(): assigns priority to a given fusion
+        """
+        fused_nodes = set(self.nodes)
+        for node1, node2 in self.get_possible_fusions():
+            node1 = self.name_to_fused_node[node1.get_first_name()]
+            node2 = self.name_to_fused_node[node2.get_first_name()]
+            if self.can_fuse(node1, node2) and not self.will_fusion_create_cycle(
+                node1, node2
+            ):
+                node3 = FusedSchedulerNode.fuse(node1, node2)
+                fused_nodes.remove(node1)
+                fused_nodes.remove(node2)
+                fused_nodes.add(node3)
+                self.name_to_fused_node.update(
+                    {n.get_name(): node3 for n in node3.get_nodes()}
+                )
+        self.nodes = sorted(fused_nodes, key=lambda x: x.min_order)
+        self.topological_sort_schedule()
+
+    def get_possible_fusions(self):
+        """
+        Helper to find all legal fusion opportunities, sorted by self.score_fusion()
+        """
+        possible_fusions = []
+        seen = set()
+
+        def check_all_pairs(nodes):
+            for node1_index, node1 in enumerate(nodes):
+                for node2 in nodes[node1_index + 1 :]:
+                    key = (node1, node2)
+                    if key in seen:
+                        continue
+                    seen.add(key)
+
+                    if self.can_fuse(node1, node2):
+                        possible_fusions.append(key)
+                    elif node2.is_template() and self.can_fuse(node2, node1):
+                        # epilogue fusions are order dependent
+                        possible_fusions.append((node2, node1))
+
+        buffer_names_grouping = collections.defaultdict(list)
+        for node in self.nodes:
+            for buf in node.used_buffer_names():
+                buffer_names_grouping[buf].append(node)
+        for node_grouping in buffer_names_grouping.values():
+            check_all_pairs(node_grouping)
+
+        if config.aggressive_fusion:
+            group_grouping = collections.defaultdict(list)
+            for node in self.nodes:
+                group = getattr(node, "group", None)
+                if group:
+                    group_grouping[group].append(node)
+            for node_grouping in group_grouping.values():
+                check_all_pairs(node_grouping)
+
+        return sorted(possible_fusions, key=self.score_fusion_key, reverse=True)
+
+    def will_fusion_create_cycle(self, node1, node2):
+        """Finds whether there's a path from src to dst caused indirectly by fusion"""
+
+        def check(node):
+            if isinstance(node, FusedSchedulerNode) and node not in visited:
+                visited.add(node)
+                return bool(combined_names & node.recursive_predecessors) or any(
+                    check(self.name_to_fused_node[n])
+                    for n in node.recursive_predecessors - combined_predecessors
+                )
+            return False
+
+        visited = set()
+        combined_names = node1.get_names() | node2.get_names()
+        combined_predecessors = (
+            node1.recursive_predecessors | node2.recursive_predecessors
+        ) - combined_names
+        return any(check(self.name_to_fused_node[n]) for n in combined_predecessors)
+
+    def can_fuse(self, node1: BaseSchedulerNode, node2: BaseSchedulerNode):
+        """
+        Determine if it is possible to combine node1 and node2 into a
+        single fused node.
+        """
+        if node1 is node2:
+            return False
+        if (
+            isinstance(node1, (ExternKernelSchedulerNode, NopKernelSchedulerNode))
+            and not node1.is_template()
+        ):
+            return False
+        if (
+            isinstance(node2, (ExternKernelSchedulerNode, NopKernelSchedulerNode))
+            and not node2.is_template()
+        ):
+            return False
+        if node2.get_names() & node1.recursive_predecessors:
+            return False  # node2 must go before node1
+        if node2.is_template():
+            return False  # only epilogues
+
+        device = node1.get_device()
+        if device != node2.get_device():
+            return False  # wrong device
+
+        no_shared_data = self.score_fusion_memory(node1, node2) == 0
+        if no_shared_data and (
+            not config.aggressive_fusion or node1.is_reduction() or node2.is_reduction()
+        ):
+            return False  # heuristic not needed for correctness
+
+        if len(node1.get_nodes()) + len(node2.get_nodes()) > config.max_fusion_size:
+            return False  # heuristic not needed for correctness
+
+        if node1.get_names() & node2.recursive_predecessors:
+            # node2 depends on node1 outputs
+            if not self.can_fuse_vertical(node1, node2):
+                return False
+            if node1.is_template():
+                from .codegen.triton_template import template_can_fuse
+
+                return template_can_fuse(node1, node2)
+            return self.get_backend(device).can_fuse_vertical(node1, node2)
+        else:  # nodes don't depend on each other, but may have common reads
+            if node1.is_template():
+                return False
+            return self.get_backend(device).can_fuse_horizontal(node1, node2)
+
+    def can_fuse_vertical(self, node1, node2):
+        """
+        Check if it is legal to fuse a consumer (node2) into a producer (node1).
+
+        We can fuse them if all the reads of node2 either match
+        corresponding writes in node1, or are written by nodes that can
+        be scheduled before the fusion of node1 and node2.
+        """
+        node1_names = node1.get_names()
+        computed_deps = set()
+        for rd in node2.unmet_dependencies:
+            for cd in node1.read_writes.writes:
+                # StarDep doesn't match MemoryDep, different indices don't match
+                # However, broadcasting sometimes strips dimensions, and if that's the case
+                # we still can match unmet dep
+                if (
+                    rd.name == cd.name
+                    and type(rd) == type(cd)
+                    and rd.index == cd.index
+                    and len(rd.size) >= len(cd.size)
+                    and rd.size[: len(cd.size)] == cd.size
+                ):
+                    computed_deps.add(rd)
+
+        remaining_deps = {dep.name for dep in node2.unmet_dependencies - computed_deps}
+        if remaining_deps & node1_names:
+            # MemoryDeps didn't match and read different locations of the same buffer.
+            # Examples here include:
+            #   - MemoryDep("foo", x) != MemoryDep("foo", x + 1)
+            #   - MemoryDep("foo", x) != StarDep("foo")
+            return False
+        for name in remaining_deps:
+            if node1_names & self.name_to_fused_node[name].recursive_predecessors:
+                return False
+        return True
+
+    def score_fusion(self, node1: BaseSchedulerNode, node2: BaseSchedulerNode):
+        """
+        Assign a score (higher comes first) to the fusion of node1
+        and node2.  When different fusions conflict with each other,
+        this is the way we decide what order to run them in.
+
+        Our current score is based on:
+        - Estimate of the saved memory operations
+        - Fusions closer together in original order
+        """
+        memory_score = self.score_fusion_memory(node1, node2)
+        proximity_score = -max(
+            abs(node1.min_order - node2.max_order),
+            abs(node2.min_order - node1.max_order),
+        )
+        return (
+            node1.is_reduction() == node2.is_reduction() and memory_score > 0,
+            memory_score,
+            proximity_score,
+        )
+
+    def score_fusion_memory(self, node1, node2):
+        """
+        The first term in our fusion score that estimates number of saved memory operations.
+        """
+        common_memory_deps = (node1.read_writes.reads | node1.read_writes.writes) & (
+            node2.read_writes.reads | node2.read_writes.writes
+        )
+        return sum(dep.numbytes_hint() for dep in common_memory_deps)
+
+    def score_fusion_key(self, nodes):
+        """
+        Shim for list.sort(key=...)
+        """
+        node1, node2 = nodes
+        return self.score_fusion(node1, node2)
+
+    def compute_last_usage(self):
+        """
+        Populate node.last_usage
+        """
+
+        future_used_buffers = set()
+        for node_name in V.graph.get_output_names():
+            future_used_buffers.add(node_name)
+
+        for node in reversed(self.nodes):
+            used_buffers = node.used_buffer_names()
+            used_buffers = {self.mutation_real_name.get(k, k) for k in used_buffers}
+            node.last_usage = used_buffers - future_used_buffers
+            future_used_buffers.update(used_buffers)
+
+    def free_buffers(self):
+        """Free any buffers that are no longer needed"""
+        for name in sorted(self.buffer_names_to_free - V.graph.removed_buffers):
+            if name in self.name_to_node:
+                node = self.name_to_node[name]
+                if node.can_free():
+                    V.graph.wrapper_code.codegen_free(node.node)
+            elif name in V.graph.graph_inputs:
+                storage = V.graph.graph_inputs[name].data
+                assert storage.is_input_buffer()
+                V.graph.wrapper_code.codegen_free(storage.data)
+
+        self.buffer_names_to_free.clear()
+
+    def remove_kernel_local_buffers(self):
+        """
+        Any buffers that are both created and have a last use in the
+        same kernel can be removed.
+        """
+        for name in V.kernel.store_buffer_names & self.buffer_names_no_longer_needed:
+            if (
+                name not in V.kernel.must_keep_buffers
+                and name not in V.kernel.args.input_buffers
+                and name not in self.mutation_renames
+                and name not in self.mutation_real_name
+            ):
+                # For inplace buffers subject to remove, we don't actually
+                # remove them but put them in a dedicated set. This simplifies
+                # the life cycle management of inplace buffers.
+                # This set is used to
+                # 1) avoid unnecessary store in DeferredLine.
+                # 2) avoid alias var definitions in kernel.
+                if name in V.kernel.args.inplace_buffers:
+                    V.graph.inplaced_to_remove.add(name)
+                else:
+                    self.remove_buffer(name)
+
+    def remove_buffer(self, name):
+        # Assign a special value instead of deleting the entry
+        # because we still rely on output_buffers's length to
+        # generate unique arg name.
+        log.debug("remove_buffer(%r)", name)
+        V.kernel.args.output_buffers[name] = "REMOVED"
+        V.graph.removed_buffers.add(name)
+
+    def flush(self):
+        for backend in self.backends.values():
+            backend.flush()
+        self.free_buffers()
+
+    def codegen_extern_call(self, scheduler_node: ExternKernelSchedulerNode):
+        assert isinstance(scheduler_node, ExternKernelSchedulerNode)
+        scheduler_node.allocate()
+        node = scheduler_node.node
+        node.codegen(V.graph.wrapper_code)
+        self.free_buffers()
+
+    def codegen_template_call(
+        self, scheduler_node: Union[FusedSchedulerNode, TemplateSchedulerNode]
+    ):
+        from .codegen.triton_template import template_codegen
+
+        node, *epilogue = scheduler_node.get_nodes()
+        node.allocate()
+        template_codegen(self, node, epilogue)
+        self.free_buffers()
+
+    def create_backend(self, device: torch.device):
+        assert (
+            device.type != "cuda" or device.index is not None
+        ), f"{device} should have been normalized in lowering"
+        V.graph.device_types.add(device.type)
+        if device.type == "cpu":
+            from .codegen.cpp import CppScheduling
+
+            return CppScheduling(self)
+        else:
+            if not has_triton():
+                device_props = torch.cuda.get_device_properties(device)
+                if device_props.major < 6:
+                    raise RuntimeError(
+                        f"Found {device_props.name} which is too old to be supported by the triton GPU compiler, which is used as the backend. Triton only supports devices of CUDA Capability >= 6.0, but your device is of CUDA capability {device_props.major}.{device_props.minor}"  # noqa: B950
+                    )
+                else:
+                    raise RuntimeError(
+                        "Cannot find a working triton installation. More information on installing Triton can be found at https://github.com/openai/triton"  # noqa: B950
+                    )
+            from .codegen.triton import TritonScheduling
+
+            return TritonScheduling(self)
+
+    def get_backend(self, device: torch.device):
+        if device not in self.backends:
+            self.backends[device] = self.create_backend(device)
+        return self.backends[device]
+
+    @dynamo_utils.dynamo_timed
+    def codegen(self):
+        for node in self.nodes:
+            self.buffer_names_no_longer_needed.update(node.last_usage)
+
+            if not isinstance(node, NopKernelSchedulerNode):
+                device = node.get_device()
+                if (
+                    device != self.current_device
+                    or node.is_extern()
+                    or node.is_template()
+                ):
+                    self.flush()
+                    self.current_device = device
+
+            self.buffer_names_to_free.update(node.last_usage)
+
+            if node.is_template():
+                self.codegen_template_call(node)
+            elif node.is_extern():
+                self.codegen_extern_call(node)
+            elif isinstance(node, (FusedSchedulerNode, SchedulerNode)):
+                self.get_backend(device).codegen_nodes(node.get_nodes())
+            else:
+                assert isinstance(node, NopKernelSchedulerNode)
+                node.allocate()
+
+            self.available_buffer_names.update(node.get_names())
+
+        self.flush()
diff --git a/torch/_inductor/sizevars.py b/torch/_inductor/sizevars.py
new file mode 100644
index 000000000000..67902bb23b2d
--- /dev/null
+++ b/torch/_inductor/sizevars.py
@@ -0,0 +1,586 @@
+import dataclasses
+import functools
+import itertools
+import logging
+from typing import Callable, Dict, List, Tuple
+
+import sympy
+from sympy import Expr
+
+from torch.fx.experimental.symbolic_shapes import ShapeEnv
+
+from . import ir
+from .codegen.common import IndentedBuffer
+from .utils import sympy_subs, sympy_symbol, VarRanges
+from .virtualized import V
+
+log = logging.getLogger(__name__)
+
+
+@dataclasses.dataclass
+class ZeroGuard:
+    """
+    An expression we should check equals zero.
+    Guards are currently not checked.  Plan to add this later.
+    """
+
+    expr: Expr
+
+
+@dataclasses.dataclass
+class PositiveGuard:
+    """
+    An expression we should check for > 0
+    Guards are currently not checked.  Plan to add this later.
+    """
+
+    expr: Expr
+
+
+class SizeVarAllocator(object):
+    def __init__(self, shape_env=None):
+        super().__init__()
+        if shape_env is None:
+            shape_env = ShapeEnv()
+        self.shape_env = shape_env
+        self.var_to_val = self.shape_env.var_to_val
+        self.guards = []
+        self.replacements: Dict[sympy.Symbol, Expr] = self.shape_env.replacements
+        self.need_seed = False
+        self.stride_vars = self.make_stride_vars_cache()
+        self.simplify_with_ranges = self.make_simplify_with_ranges_cache()
+        self._simplify_loops = self.make_simplify_loops_cache()
+
+    def seed(self):
+        """
+        Seed is a special variable used to hold the rng seed for a graph.
+
+        Note this is only used by the CPU backend, we put seeds in a
+        1-element tensor for the CUDA backend.
+        """
+        self.need_seed = True
+        return sympy_symbol("seed")
+
+    def simplify(self, expr: Expr):
+        return sympy.expand(expr).xreplace(self.replacements)
+
+    def make_simplify_with_ranges_cache(self):
+        """
+        self._simplify_with_ranges() can be expensive, cache its results
+        """
+        cache = dict()
+        replacement_count = len(self.replacements)
+
+        def simplify_with_ranges(expr: Expr, var_ranges: VarRanges):
+            nonlocal replacement_count
+            if replacement_count != len(self.replacements):
+                # new replacements invalidates cached results
+                cache.clear()
+                replacement_count = len(self.replacements)
+            key = (expr, *var_ranges.items())
+            result = cache.get(key, None)
+            if result is None:
+                result = self._simplify_with_ranges(expr, var_ranges)
+                cache[key] = result
+            return result
+
+        return simplify_with_ranges
+
+    def make_simplify_loops_cache(self):
+        """
+        self._simplify_with_ranges() can be expensive, cache its results
+        """
+        cache = dict()
+        replacement_count = len(self.replacements)
+
+        def simplify_loops(index_vars, sizes, index_formulas):
+            nonlocal replacement_count
+            if replacement_count != len(self.replacements):
+                # new replacements invalidates cached results
+                cache.clear()
+                replacement_count = len(self.replacements)
+            key = (*index_vars, *sizes, *index_formulas)
+            result = cache.get(key, None)
+            if result is None:
+                result = self._simplify_loops_impl(index_vars, sizes, index_formulas)
+                cache[key] = result
+            return result
+
+        return simplify_loops
+
+    def _simplify_with_ranges(self, expr: Expr, var_ranges: VarRanges):
+        """
+        Simplify indexing expression with knowledge of the ranges of
+        iteration variables.
+        """
+        from .ir import IndexingDiv, ModularIndexing
+
+        expr = join_dimensions(self.simplify(expr))
+        original_expr = expr
+
+        def remove_zero_terms(base, divisor):
+            """Symbols smaller than the divisor are zero"""
+            for v in base.free_symbols:
+                if v in var_ranges:
+                    # var smaller than divisor can be removed
+                    # if the rest is guaranteed to be multiple of divisor
+                    rest = sympy.Wild("_rest", exclude=[v])
+                    m = base.match(v + rest)
+                    if m and v not in m[rest].free_symbols:
+                        gcd = sympy.gcd(m[rest], divisor)
+                        if gcd == divisor:
+                            if self.maybe_guard_leq(var_ranges[v], divisor):
+                                base = m[rest]
+            return base
+
+        def visit_indexing_div(base, divisor):
+            return IndexingDiv(remove_zero_terms(base, divisor), divisor)
+
+        def visit_modular_indexing(base, divisor, modulus):
+            base = remove_zero_terms(base, divisor)
+            if isinstance(base, ModularIndexing):
+                # for modular indexing, biggest values from the ranges don't necessarily result in
+                # the biggest result, the biggest result is modulus - 1
+                base_s = base.args[2] - 1
+            elif not base.has(ModularIndexing):
+                # actual iteration range is to size-1
+                iter_ranges = {k: v - 1 for k, v in var_ranges.items()}
+                base_s = sympy_subs(base, iter_ranges)
+            else:
+                base_s = base
+            if self.maybe_guard_lt(base_s, modulus * divisor):
+                return IndexingDiv(base, divisor)
+            return ModularIndexing(base, divisor, modulus)
+
+        if expr.has(ModularIndexing):
+            expr = expr.replace(
+                ModularIndexing(
+                    sympy.Wild("base"),
+                    sympy.Wild("divisor"),
+                    sympy.Wild("modulus"),
+                ),
+                visit_modular_indexing,
+            )
+
+        if expr.has(IndexingDiv):
+            expr = expr.replace(
+                IndexingDiv(
+                    sympy.Wild("base"),
+                    sympy.Wild("divisor"),
+                ),
+                visit_indexing_div,
+            )
+
+        if expr != original_expr:
+            return self._simplify_with_ranges(expr, var_ranges)
+        return expr
+
+    def _simplify_loops_impl(self, index_vars, sizes, index_formulas):
+        """
+        Try to remove as many axis from loop iterations as possible, by:
+            1) removing size==1 dimensions
+            2) fuse contiguous dimensions into a single loop
+            If channel_last = True, we will prevent the last dim fused with other dims
+        """
+        sizes = list(map(self.simplify, sizes))
+
+        strides = [self.stride_vars(x, index_vars) for x in index_formulas]
+        assert len(sizes) == len(strides[0]), (len(sizes), len(strides[0]))
+
+        for i in range(len(sizes)):
+            if sizes[i] == 1:
+                # remove dim
+                sizes[i] = None
+
+        def can_merge_dims(a, b):
+            for k in range(len(strides)):
+                if self.simplify(strides[k][a] * sizes[a]) == self.simplify(
+                    strides[k][b]
+                ):
+                    # approximate test passed, try sound version
+                    va = index_vars[a]
+                    vb = index_vars[b]
+                    v = sympy_symbol("_merge_tester")
+                    expr1 = sympy_subs(index_formulas[k], {va: v * sizes[a], vb: 0})
+                    expr2 = sympy_subs(index_formulas[k], {va: 0, vb: v})
+                    if self.simplify(expr1) == self.simplify(expr2):
+                        continue
+                return False
+            return True
+
+        changed = True
+        while changed:
+            changed = False
+            for i, j in itertools.product(
+                reversed(range(len(sizes))), reversed(range(len(sizes)))
+            ):
+                if i == j or sizes[i] is None or sizes[j] is None:
+                    continue
+                if can_merge_dims(i, j):
+                    changed = True
+                    sizes[i] = sizes[i] * sizes[j]
+                    sizes[j] = None
+
+        def reindex(index):
+            it = list(reversed(index))
+            new_index = []
+            for size in sizes:
+                if size is None:
+                    new_index.append(sympy.Integer(0))
+                else:
+                    new_index.append(it.pop())
+            assert not it
+            return new_index
+
+        def prune(index):
+            assert len(index) == len(sizes)
+            return [i for i, s in zip(index, sizes) if s is not None]
+
+        return [x for x in sizes if x is not None], reindex, prune
+
+    def guard_equals(self, left: Expr, right: Expr) -> Expr:
+        left = sympy.expand(left)
+        right = sympy.expand(right)
+        if left == right:
+            return left
+        expr = self.simplify(left - right)
+        assert self.size_hint(expr) == 0, (expr, self.size_hint(expr))
+        free = list(expr.free_symbols)
+        if len(free) == 0:
+            assert expr == 0
+            return left
+        elif len(free) in (1, 2, 3):
+            # remove the largest of the guarded variables
+            free.sort(key=self.size_hint)
+            try:
+                solutions = sympy.solve(expr, free[-1])
+                if (
+                    len(solutions) == 1
+                    and solutions[0]
+                    and "/" not in str(solutions[0])
+                ):
+                    self.replacements[free[-1]] = solutions[0]
+            except NotImplementedError:
+                pass
+
+        self.guards.append(ZeroGuard(expr))
+
+        if len(right.free_symbols) < len(left.free_symbols):
+            return right
+        else:
+            return left
+
+    def maybe_guard_equals(self, left: Expr, right: Expr) -> bool:
+        """if left==right, guard on that fact and return true"""
+        if left == right:
+            return True
+        if self.size_hint(left - right) == 0:
+            self.guard_equals(left, right)
+            return True
+        return False
+
+    def maybe_guard_list_equals(self, left: List[Expr], right: List[Expr]) -> bool:
+        """if left==right, guard on that fact and return true"""
+        if len(left) != len(right):
+            return False
+        if all(self.size_hint(a - b) == 0 for a, b in zip(left, right)):
+            for a, b in zip(left, right):
+                self.guard_equals(a, b)
+            return True
+        return False
+
+    def maybe_guard_leq(self, left: Expr, right: Expr) -> bool:
+        try:
+            if self.size_hint(left) > self.size_hint(right):
+                return False
+        except TypeError:
+            return False
+        self.guard_leq(left, right)
+        return True
+
+    def maybe_guard_lt(self, left: Expr, right: Expr) -> bool:
+        try:
+            if self.size_hint(left) >= self.size_hint(right):
+                return False
+        except TypeError:
+            return False
+        self.guard_lt(left, right)
+        return True
+
+    def guard_leq(self, left: Expr, right: Expr) -> None:
+        return self.guard_lt(left, right + 1)
+
+    def guard_lt(self, left: Expr, right: Expr) -> None:
+        expr = self.simplify(right - left)
+        assert self.size_hint(expr) > 0
+        if len(expr.free_symbols) == 0:
+            return
+        if "-" in str(expr):
+            # all vars are positive, so needs a minus sign to get negative values
+            self.guards.append(PositiveGuard(expr))
+
+    def guard_min(self, left: Expr, right: Expr) -> Expr:
+        """return the smaller of left and right, and guard on that choice"""
+        lv = self.size_hint(left)
+        rv = self.size_hint(right)
+        if lv == rv:
+            return self.guard_equals(left, right)
+        elif lv < rv:
+            self.guard_lt(left, right)
+            return left
+        else:
+            self.guard_lt(right, left)
+            return right
+
+    def guard_max(self, left: Expr, right: Expr) -> Expr:
+        """return the larger of left and right, and guard on that choice"""
+        return -self.guard_min(-left, -right)
+
+    def maybe_guard_multiple_of(self, numerator: Expr, denominator: Expr) -> bool:
+        """if denominator divides numerator, return True and guard on that fact"""
+        if sympy.gcd(numerator, denominator) == denominator:
+            # can prove it symbolically
+            return True
+        if self.size_hint(numerator) % self.size_hint(denominator) == 0:
+            multiple = self.size_hint(numerator) // self.size_hint(denominator)
+            self.guard_equals(multiple * denominator, numerator)
+            return True
+        return False
+
+    def guard_static_shape(self, left: Expr) -> int:
+        right = self.size_hint(left)
+        self.guard_equals(left, sympy.Integer(right))
+        return int(right)
+
+    def __getitem__(self, val: int) -> Expr:
+        return self.shape_env.create_symbol(val)
+
+    def size_hint(self, expr: Expr) -> int:
+        out = sympy_subs(sympy.expand(expr), self.var_to_val)
+        return int(out)
+
+    def _lru_cache(self, fn, maxsize=None):
+        """
+        Wrapper around functools.lru_cache that clears when replacements
+        has been invalidated.
+        """
+        fn_cache = functools.lru_cache(maxsize)(fn)
+        prior_len = len(self.replacements)
+
+        @functools.wraps(fn)
+        def wrapper(*args, **kwargs):
+            nonlocal prior_len
+            if prior_len != len(self.replacements):
+                prior_len = len(self.replacements)
+                fn_cache.cache_clear()
+            return fn_cache(*args, **kwargs)
+
+        return wrapper
+
+    def make_stride_vars_cache(self):
+        cache = self._lru_cache(self._stride_vars)
+
+        def stride_vars(index: Expr, vars: List[sympy.Symbol]) -> List[Expr]:
+            return cache(index, tuple(vars))
+
+        return stride_vars
+
+    def _stride_vars(self, index: Expr, vars: List[sympy.Symbol]) -> List[Expr]:
+        """Convert an indexing expression back into strides"""
+        strides = []
+        index = self.simplify(index)
+        # remove any offset
+        index = index - sympy_subs(index, {v: sympy.Integer(0) for v in vars if v != 0})
+        for i in range(len(vars)):
+            # drop all the other dims
+            index_dim = sympy_subs(
+                index,
+                {
+                    vars[j]: sympy.Integer(0)
+                    for j in range(len(vars))
+                    if i != j and vars[j] != 0
+                },
+            )
+            v = vars[i]
+            if v == 0:
+                strides.append(sympy.Integer(0))
+            else:
+                # TODO(jansel): should we use sympy.diff here?
+                strides.append(
+                    sympy_subs(index_dim, {v: sympy.Integer(1)})
+                    - sympy_subs(index_dim, {v: sympy.Integer(0)})
+                )
+        return strides
+
+    def offset_var(self, index: Expr, vars: List[sympy.Symbol]) -> Expr:
+        """Extract offset part of an indexing expression"""
+        index = self.simplify(index)
+        return sympy_subs(index, {v: sympy.Integer(0) for v in vars if v != 0})
+
+    def stride_hints(self, index: Expr, vars: List[sympy.Symbol]) -> List[int]:
+        for v in index.free_symbols:
+            if v.name.startswith("indirect"):
+                index = sympy_subs(index, {v: 0})
+        result = []
+        for s in self.stride_vars(index, vars):
+            try:
+                result.append(self.size_hint(s))
+            except TypeError:
+                result.append(0)
+        return result
+
+    def stride_order(self, index: Expr, vars: List[sympy.Symbol]) -> List[int]:
+        strides = tuple(
+            map(lambda x: abs(x), self.stride_hints(index, vars))
+        )  # lambda to placate mypy
+        order = list(range(len(strides)))
+        order.sort(key=lambda x: (strides[x] == 0, strides[x]))
+        return order
+
+    def codegen(self, code: IndentedBuffer, graph_inputs: Dict[str, ir.Buffer]):
+        """Assign all symbolic shapes to locals"""
+        if self.need_seed:
+            code.writeline(
+                "seed = torch.randint(2**31, size=(), dtype=torch.int32).item()"
+            )
+
+        @functools.lru_cache(None)
+        def sizeof(name):
+            code.writeline(f"{name}_size = {name}.size()")
+            return f"{name}_size"
+
+        @functools.lru_cache(None)
+        def strideof(name):
+            code.writeline(f"{name}_stride = {name}.stride()")
+            return f"{name}_stride"
+
+        # Assign all symbolic shapes needed to local variables
+        needed = set(self.var_to_val.keys()) - set(self.replacements.keys())
+        added = set()
+
+        for name, value in graph_inputs.items():
+            shapes = value.get_size()
+            for dim, shape in enumerate(shapes):
+                shape = self.simplify(shape)
+                if shape in needed:
+                    needed.remove(shape)
+                    added.add(shape)
+                    code.writeline(f"{shape} = {sizeof(name)}[{dim}]")
+                elif isinstance(shape, sympy.Symbol):
+                    assert shape in added, f"{shape} is needed but not added"
+
+        for name, value in graph_inputs.items():
+            shapes = value.get_stride()
+            for dim, shape in enumerate(shapes):
+                shape = self.simplify(shape)
+                if shape in needed:
+                    needed.remove(shape)
+                    code.writeline(f"{shape} = {strideof(name)}[{dim}]")
+                elif isinstance(shape, sympy.Symbol):
+                    assert shape in added, f"{shape} is needed but not added"
+        assert not needed
+
+    def codegen_sizevar(self, x: Expr) -> str:
+        from .codegen.wrapper import pexpr
+
+        return pexpr(self.simplify(x))
+
+    def codegen_shape_tuple(self, shape: Tuple[Expr, ...]) -> str:
+        parts = list(map(self.codegen_sizevar, shape))
+        if len(parts) == 0:
+            return "()"
+        if len(parts) == 1:
+            return f"({parts[0]}, )"
+        return f"({', '.join(parts)})"
+
+
+def join_dimensions(expr: Expr) -> Expr:
+    from .ir import ModularIndexing
+
+    if not isinstance(expr, sympy.Add) or not expr.has(ModularIndexing):
+        return expr  # fast exit path
+    return _join_dimensions_cached(expr)
+
+
+@functools.lru_cache(256)
+def _join_dimensions_cached(expr: Expr) -> Expr:
+    """
+    ModularIndexing(i0, 1, 32) + 32 * ModularIndexing(i0, 32, 4)
+    becomes
+    ModularIndexing(i0, 1, 128)
+    ModularIndexing(i0, 1, 32) + 32 * IndexingDiv(i0, 32)
+    becomes i0
+
+
+    This type of pattern can come from view operations
+    """
+    from .ir import IndexingDiv, ModularIndexing
+
+    assert isinstance(expr, sympy.Add)
+
+    scale = sympy.Wild("scale", exclude=[0])
+    base = sympy.Wild("base")
+    divisor = sympy.Wild("divisor")
+    mod1 = sympy.Wild("modulus")
+    mod2 = sympy.Wild("modulus2")
+    for term1 in expr.args:
+        m1 = term1.match(scale * ModularIndexing(base, divisor, mod1))
+        if m1:
+            for term2 in expr.args:
+                m2 = term2.match(
+                    m1[scale]
+                    * m1[mod1]
+                    * ModularIndexing(m1[base], m1[divisor] * m1[mod1], mod2)
+                )
+                if m2 and term1 != term2:
+                    expr = join_dimensions(
+                        expr
+                        - term1
+                        - term2
+                        + m1[scale]
+                        * ModularIndexing(m1[base], m1[divisor], m1[mod1] * m2[mod2])
+                    )
+                    return expr
+    for term1 in expr.args:
+        m1 = term1.match(scale * ModularIndexing(base, divisor, mod1))
+        if m1:
+            for term2 in expr.args:
+                m2 = term2.match(
+                    m1[scale] * m1[mod1] * IndexingDiv(m1[base], m1[divisor] * m1[mod1])
+                )
+                if m2 is not None:  # in case of success we get an empty dict here
+                    expr = join_dimensions(
+                        expr
+                        - term1
+                        - term2
+                        + m1[scale] * IndexingDiv(m1[base], m1[divisor])
+                    )
+                    return expr
+    return expr
+
+
+class SimplifyIndexing(V.WrapperHandler):  # type: ignore[name-defined]
+    """
+    A wrapper around .virtualize.ops that uses var range information to
+    simplify ir.ModularIndexing/ir.IndexingDiv.
+    """
+
+    def __init__(self, inner, var_ranges: VarRanges):
+        super().__init__(inner)
+        self._simplify: Callable[
+            [Expr], Expr
+        ] = lambda index: V.graph.sizevars.simplify_with_ranges(index, var_ranges)
+
+    def load(self, name: str, index: sympy.Expr):
+        return self._inner.load(name, self._simplify(index))
+
+    def store(self, name, index, value, mode=None):
+        return self._inner.store(name, self._simplify(index), value, mode=mode)
+
+    def reduction(self, name, dtype, src_dtype, reduction_type, index, value):
+        return self._inner.reduction(
+            name, dtype, src_dtype, reduction_type, self._simplify(index), value
+        )
+
+    def index_expr(self, index, dtype):
+        return self._inner.index_expr(self._simplify(index), dtype)
diff --git a/torch/_inductor/triton_ops/__init__.py b/torch/_inductor/triton_ops/__init__.py
new file mode 100644
index 000000000000..b3f6ecc3ff42
--- /dev/null
+++ b/torch/_inductor/triton_ops/__init__.py
@@ -0,0 +1,8 @@
+from ..utils import has_triton
+
+if has_triton():
+    from .conv import _conv, conv
+    from .conv1x1 import _conv1x1, conv1x1
+    from .matmul import _matmul_out, matmul_out
+
+    __all__ = ["_conv", "conv", "_conv1x1", "conv1x1", "_matmul_out", "matmul_out"]
diff --git a/torch/_inductor/triton_ops/autotune.py b/torch/_inductor/triton_ops/autotune.py
new file mode 100644
index 000000000000..808241cd02a2
--- /dev/null
+++ b/torch/_inductor/triton_ops/autotune.py
@@ -0,0 +1,692 @@
+import builtins
+import copy
+import hashlib
+import json
+import logging
+import os.path
+import re
+import threading
+from typing import List
+
+import torch
+
+from .. import config
+from ..ir import ReductionHint, TileHint
+from ..triton_ops.mm_perf_model import estimate_matmul_time
+from ..utils import conditional_product, dynamo_utils, has_triton
+from .conv_perf_model import (
+    early_config_prune as conv_early_config_prune,
+    estimate_conv_time,
+)
+
+log = logging.getLogger(__name__)
+
+if has_triton():
+    import triton
+    from triton import cdiv, Config, next_power_of_2
+    from triton.runtime.jit import get_cuda_stream, KernelInterface
+else:
+    cdiv = None
+    Config = object
+    get_cuda_stream = None
+    KernelInterface = object
+    next_power_of_2 = None
+    triton = None
+
+
+class CachingAutotuner(KernelInterface):
+    """
+    Simplified version of Triton autotuner that has no invalidation
+    key and caches the best config to disk to improve cold start times.
+    Unlike the main triton Autotuner, this version can precompile all
+    configs, and does not rely on the Triton JIT.
+    """
+
+    def __init__(self, fn, meta, configs, save_cache_hook):
+        super().__init__()
+        self.fn = fn
+        self.meta = meta
+        self.save_cache_hook = save_cache_hook
+        self.configs = configs
+        self.launchers = []
+        self.lock = threading.Lock()
+
+    def precompile(self, warm_cache_only_with_cc=None):
+        with self.lock:
+            if self.launchers:
+                return
+            self.launchers = [
+                self._precompile_config(c, warm_cache_only_with_cc)
+                for c in self.configs
+            ]
+            self.configs = None
+
+    def _precompile_config(self, cfg: Config, warm_cache_only_with_cc: int):
+        """Ahead of time compile a given autotuner config."""
+        compile_meta = copy.deepcopy(self.meta)
+        for k, v in cfg.kwargs.items():
+            compile_meta["constants"][self.fn.arg_names.index(k)] = v
+        compile_meta["num_warps"] = cfg.num_warps
+        compile_meta["num_stages"] = cfg.num_stages
+
+        if warm_cache_only_with_cc:
+            triton.compile(
+                self.fn,
+                warm_cache_only=True,
+                cc=warm_cache_only_with_cc,
+                **compile_meta,
+            )
+            return
+
+        torch.cuda.set_device(torch.cuda.current_device())
+
+        binary = triton.compile(
+            self.fn,
+            **compile_meta,
+        )
+
+        call_args = [
+            arg
+            for i, arg in enumerate(self.fn.arg_names)
+            if i not in self.fn.constexprs
+        ]
+        def_args = list(self.fn.arg_names)
+        while def_args and def_args[-1] in cfg.kwargs:
+            def_args.pop()
+
+        scope = {
+            "grid_meta": cfg.kwargs,
+            "bin": binary,
+            "torch": torch,
+            "set_device": torch.cuda.set_device,
+            "current_device": torch.cuda.current_device,
+        }
+        exec(
+            f"""
+            def launcher({', '.join(def_args)}, grid, stream):
+                # set_device(current_device())  # TODO(jansel): is this needed?
+                grid_0, grid_1, grid_2 = grid(grid_meta)
+                bin.c_wrapper(grid_0, grid_1, grid_2, bin.num_warps, bin.shared,
+                            stream, bin.cu_function, None, None, None,
+                            {', '.join(call_args)})
+            """.lstrip(),
+            scope,
+        )
+
+        launcher = scope["launcher"]
+        launcher.config = cfg
+        return launcher
+
+    def bench(self, launcher, *args, grid):
+        """Measure the performance of a given launcher"""
+        stream = get_cuda_stream(torch.cuda.current_device())
+
+        def kernel_call():
+            if launcher.config.pre_hook is not None:
+                launcher.config.pre_hook(
+                    {**zip(self.arg_names, args), **launcher.config.kwargs}
+                )
+            launcher(
+                *args,
+                grid=grid,
+                stream=stream,
+            )
+
+        from triton.testing import do_bench
+
+        return do_bench(kernel_call, rep=40, fast_flush=True)
+
+    @dynamo_utils.dynamo_timed
+    def autotune_to_one_config(self, *args, **kwargs):
+        """Do the actual autotuning"""
+        from ..compile_fx import clone_preserve_strides
+
+        # clone the input args to avoid autotune contaminating them if
+        # the kernel does in-place stores
+        cloned_args = [
+            clone_preserve_strides(arg) if isinstance(arg, torch.Tensor) else arg
+            for arg in args
+        ]
+        timings = {
+            launcher: self.bench(launcher, *cloned_args, **kwargs)
+            for launcher in self.launchers
+        }
+        self.launchers = [builtins.min(timings, key=timings.get)]
+        if self.save_cache_hook:
+            self.save_cache_hook(self.launchers[0].config)
+
+    def run(self, *args, grid, stream):
+        if len(self.launchers) != 1:
+            if len(self.launchers) == 0:
+                self.precompile()
+            if len(self.launchers) > 1:
+                self.autotune_to_one_config(*args, grid=grid)
+
+        (launcher,) = self.launchers
+        if launcher.config.pre_hook is not None:
+            launcher.config.pre_hook(
+                {**zip(self.arg_names, args), **launcher.config.kwargs}
+            )
+        try:
+            result = launcher(
+                *args,
+                grid=grid,
+                stream=stream,
+            )
+        except TypeError as e:
+            if re.match(r"function takes exactly \d+ arguments \(\d+ given\)", str(e)):
+                raise RuntimeError(
+                    """Consider updating Triton with
+`pip install -U "git+https://github.com/openai/triton@af76c989eb4799b015f8b288ccd8421558772e56#subdirectory=python"`"""
+                )
+            else:
+                raise e
+
+        return result
+
+
+def hash_configs(configs: List[Config]):
+    """
+    Hash used to check for changes in configurations
+    """
+    hasher = hashlib.sha256()
+    for cfg in configs:
+        hasher.update(
+            f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode(
+                "utf-8"
+            )
+        )
+    return hasher.hexdigest()
+
+
+def load_cached_autotuning(
+    cache_filename: str, configs_hash: str, configs: List[Config]
+):
+    """
+    Read a cached autotuning result from disk
+    """
+    if not os.path.exists(cache_filename):
+        return None
+
+    with open(cache_filename, "r") as fd:
+        best_config = json.loads(fd.read())
+    if best_config.get("configs_hash") != configs_hash:
+        return None
+
+    matching_configs = [
+        cfg
+        for cfg in configs
+        if all(val == best_config.get(key) for key, val in cfg.kwargs.items())
+    ]
+    if len(matching_configs) != 1:
+        return None
+
+    return matching_configs[0]
+
+
+def cached_autotune(
+    configs: List[Config],
+    meta,
+    filename=None,
+):
+    """
+    A copy of triton.autotune that calls our subclass.  Our subclass
+    has additional debugging, error handling, and on-disk caching.
+    """
+    configs = unique_configs(configs)
+    assert len(configs) == 1 or filename
+
+    # on disk caching logic
+    if filename is not None and len(configs) > 1:
+        cache_filename = os.path.splitext(filename)[0] + ".best_config"
+        configs_hash = hash_configs(configs)
+        best_config = load_cached_autotuning(cache_filename, configs_hash, configs)
+        if best_config:
+            configs = [best_config]
+
+        def save_cache_hook(cfg):
+            with open(cache_filename, "w") as fd:
+                fd.write(json.dumps({**cfg.kwargs, "configs_hash": configs_hash}))
+
+    else:
+        save_cache_hook = None
+
+    def decorator(fn):
+        return CachingAutotuner(
+            fn, meta=meta, configs=configs, save_cache_hook=save_cache_hook
+        )
+
+    return decorator
+
+
+def unique_configs(configs: List[Config]):
+    """Remove duplicate configurations"""
+    seen = set()
+    pruned_configs = []
+    for cfg in configs:
+        key = tuple(cfg.kwargs.items())
+        if key not in seen:
+            seen.add(key)
+            pruned_configs.append(cfg)
+    return pruned_configs
+
+
+def triton_config(size_hints, x, y=None, z=None, num_stages=1) -> Config:
+    """
+    Construct a pointwise triton config with some adjustment heuristics
+    based on size_hints. Size_hints is a tuple of numels in each tile
+    dimension and will be rounded up to the nearest power of 2.
+    """
+    # Ideally we want to read this from some device config
+    maxGridSize = [2147483647, 65535, 65535]
+
+    target = conditional_product(x, y, z)
+    if conditional_product(*size_hints) < target:
+        target //= 8
+
+    # shrink sizes to size hints
+    x = min(x, size_hints[0])
+    if y:
+        y = min(y, size_hints[1])
+    if z:
+        z = min(z, size_hints[2])
+
+    # if we are below original block size, scale up where we can;
+    # or if the calculated grid size is larger than the limit, we bump up the corresponding dimension
+    while x < size_hints[0] and (
+        x * maxGridSize[0] < size_hints[0] or conditional_product(x, y, z) < target
+    ):
+        x *= 2
+    while (
+        y
+        and y < size_hints[1]
+        and (
+            y * maxGridSize[1] < size_hints[1] or conditional_product(x, y, z) < target
+        )
+    ):
+        y *= 2
+    while (
+        z
+        and z < size_hints[2]
+        and (
+            z * maxGridSize[2] < size_hints[2] or conditional_product(x, y, z) < target
+        )
+    ):
+        z *= 2
+
+    cfg = {"XBLOCK": x}
+    if y:
+        cfg["YBLOCK"] = y
+    if z:
+        cfg["ZBLOCK"] = z
+    num_warps = next_power_of_2(min(max(conditional_product(x, y, z) // 256, 1), 8))
+    return Config(cfg, num_warps=num_warps, num_stages=num_stages)
+
+
+def triton_config_reduction(size_hints, x, r, num_stages=2) -> Config:
+    """
+    Construct a reduction triton config with some adjustment heuristics
+    based on size_hints. Size_hints is a tuple of numels in each tile
+    dimension and will be rounded up to the nearest power of 2.
+    """
+
+    target = conditional_product(x, r)
+    if conditional_product(*size_hints) < target:
+        target //= 8
+
+    # shrink sizes to size hints
+    x = min(x, size_hints[0])
+    r = min(r, size_hints[1])
+
+    # if we are below original block size, scale up where we can
+    while x < size_hints[0] and conditional_product(x, r) < target:
+        x *= 2
+    while r < size_hints[1] and conditional_product(x, r) < target:
+        r *= 2
+
+    cfg = {"XBLOCK": x, "RBLOCK": r}
+    num_warps = next_power_of_2(min(max(conditional_product(x, r) // 128, 2), 8))
+    return Config(cfg, num_warps=num_warps, num_stages=num_stages)
+
+
+def triton_config_tiled_reduction(size_hints, x, y, r, num_stages=2):
+    """
+    Construct a tile reduction triton config with some adjustment
+    heuristics based on size_hints. Size_hints is a tuple of numels in
+    each tile dimension and will be rounded up to the nearest power of 2.
+    """
+
+    target = conditional_product(x, y, r)
+    if conditional_product(*size_hints) < target:
+        target //= 8
+
+    # shrink sizes to size hints
+    x = min(x, size_hints[0])
+    y = min(y, size_hints[1])
+    r = min(r, size_hints[2])
+
+    # if we are below original block size, scale up where we can
+    while x < size_hints[0] and conditional_product(x, y, r) < target:
+        x *= 2
+    while r < size_hints[2] and conditional_product(x, y, r) < target:
+        r *= 2
+    while y < size_hints[1] and conditional_product(x, y, r) < target:
+        y *= 2
+
+    cfg = {"XBLOCK": x, "YBLOCK": y, "RBLOCK": r}
+    num_warps = next_power_of_2(min(max(conditional_product(x, y, r) // 256, 1), 8))
+    return Config(cfg, num_warps=num_warps, num_stages=num_stages)
+
+
+def pointwise(size_hints, meta, tile_hint=None, filename=None):
+    """
+    Construct @triton.heuristics() based on size_hints.
+    """
+    if len(size_hints) == 1:
+        return cached_autotune([triton_config(size_hints, 1024)], meta=meta)
+    if len(size_hints) == 2:
+        if not config.triton.autotune or tile_hint == TileHint.SQUARE:
+            return cached_autotune([triton_config(size_hints, 32, 32)], meta=meta)
+        return cached_autotune(
+            [
+                triton_config(size_hints, 32, 32),
+                triton_config(size_hints, 8, 256),
+                triton_config(size_hints, 256, 8),
+                triton_config(size_hints, 1, 1024),
+                triton_config(size_hints, 1024, 1),
+            ],
+            meta=meta,
+            filename=filename,
+        )
+    if len(size_hints) == 3:
+        if not config.triton.autotune:
+            return cached_autotune([triton_config(size_hints, 16, 16, 16)], meta=meta)
+        return cached_autotune(
+            [
+                triton_config(size_hints, 16, 16, 16),
+                triton_config(size_hints, 64, 8, 8),
+                triton_config(size_hints, 8, 64, 8),
+                triton_config(size_hints, 8, 8, 64),
+                triton_config(size_hints, 1024, 1, 1),
+                triton_config(size_hints, 1, 1024, 1),
+                triton_config(size_hints, 1, 1, 1024),
+            ],
+            meta=meta,
+            filename=filename,
+        )
+    raise NotImplementedError(f"size_hints: {size_hints}")
+
+
+def reduction(size_hints, reduction_hint=False, meta=None, filename=None):
+    """args to @triton.heuristics()"""
+    assert meta is not None
+    rnumel = size_hints[-1]
+    if len(size_hints) == 2:
+        contiguous_config = triton_config_reduction(
+            size_hints, 1, (rnumel if 256 <= rnumel < 2048 else 2048), num_stages=1
+        )
+        outer_config = triton_config_reduction(size_hints, 128, 8)
+        tiny_config = triton_config_reduction(
+            size_hints, 2 * (256 // rnumel) if rnumel <= 256 else 1, rnumel
+        )
+        if reduction_hint == ReductionHint.INNER:
+            return cached_autotune([contiguous_config], meta=meta)
+        elif reduction_hint == ReductionHint.OUTER:
+            return cached_autotune([outer_config], meta=meta)
+        elif reduction_hint == ReductionHint.OUTER_TINY:
+            return cached_autotune([tiny_config], meta=meta)
+        if not config.triton.autotune:
+            return cached_autotune(
+                [triton_config_reduction(size_hints, 32, 128)], meta=meta
+            )
+        return cached_autotune(
+            [
+                triton_config_reduction(size_hints, 64, 64),
+                triton_config_reduction(
+                    size_hints, 128, 8
+                ),  # this one is the best for outer reduction
+                triton_config_reduction(
+                    size_hints, 8, 512
+                ),  # this and the next one seem very similar but both are needed for perf
+                contiguous_config,
+            ],
+            meta=meta,
+            filename=filename,
+        )
+    raise NotImplementedError(f"size_hints: {size_hints}")
+
+
+def conv_heuristics():
+    configs = [
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 32}, num_stages=2, num_warps=8
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 64, "BLOCK_K": 32}, num_stages=2, num_warps=8
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 32, "BLOCK_K": 32}, num_stages=4, num_warps=4
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 32, "BLOCK_K": 64}, num_stages=4, num_warps=4
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 16, "BLOCK_K": 32}, num_stages=4, num_warps=2
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 32}, num_stages=4, num_warps=8
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 32}, num_stages=4, num_warps=4
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 64, "BLOCK_K": 32}, num_stages=4, num_warps=4
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 16, "BLOCK_K": 32}, num_stages=4, num_warps=4
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 128}, num_stages=3, num_warps=8
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 64, "BLOCK_K": 128}, num_stages=3, num_warps=8
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 32, "BLOCK_K": 128}, num_stages=4, num_warps=4
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 128}, num_stages=4, num_warps=4
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 128}, num_stages=4, num_warps=4
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 32, "BLOCK_K": 64}, num_stages=4, num_warps=2
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 64, "BLOCK_K": 64}, num_stages=4, num_warps=2
+        ),
+        # triton.Config(
+        #     {"BLOCK_M": 128, "BLOCK_N": 16, "BLOCK_K": 64}, num_stages=4, num_warps=2
+        # ),
+    ]
+    key = [
+        "BATCH",
+        "IN_C",
+        "IN_H",
+        "IN_W",
+        "KERNEL_N",
+        "KERNEL_H",
+        "KERNEL_W",
+        "OUT_H",
+        "OUT_W",
+        # parameters of conv
+        "stride_h",
+        "stride_w",
+        "padding_h",
+        "padding_w",
+        "dilation_h",
+        "dilation_w",
+        "output_padding_h",
+        "output_padding_w",
+        "groups",
+    ]
+    prune_configs_by = {
+        "early_config_prune": conv_early_config_prune,
+        "perf_model": estimate_conv_time,
+        "top_k": 10,
+    }
+    return triton.autotune(configs, key, prune_configs_by=prune_configs_by)
+
+
+def mm_heuristics():
+    from triton import heuristics
+
+    mm_heuristic = heuristics(
+        {
+            "EVEN_K": lambda args: args["K"] % (args["BLOCK_K"] * args["SPLIT_K"]) == 0,
+        }
+    )
+    return mm_heuristic
+
+
+def mm_autotune(get_io_bound_configs=False):
+    from triton.ops.matmul import get_configs_io_bound
+    from triton.ops.matmul_perf_model import early_config_prune as mm_early_config_prune
+
+    configs = [
+        # basic configs for compute-bound matmuls
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=3,
+            num_warps=8,
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 128, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=3,
+            num_warps=8,
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 64, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 256, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 32, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 32, "BLOCK_K": 32, "SPLIT_K": 1},
+            num_stages=5,
+            num_warps=2,
+        ),
+        # good for int8
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 128, "SPLIT_K": 1},
+            num_stages=3,
+            num_warps=8,
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 128, "BLOCK_K": 128, "SPLIT_K": 1},
+            num_stages=3,
+            num_warps=8,
+        ),
+        triton.Config(
+            {"BLOCK_M": 256, "BLOCK_N": 64, "BLOCK_K": 128, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 256, "BLOCK_K": 128, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 128, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 64, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 64, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 128, "BLOCK_N": 32, "BLOCK_K": 64, "SPLIT_K": 1},
+            num_stages=4,
+            num_warps=4,
+        ),
+        triton.Config(
+            {"BLOCK_M": 64, "BLOCK_N": 32, "BLOCK_K": 64, "SPLIT_K": 1},
+            num_stages=5,
+            num_warps=2,
+        ),
+    ]
+    if get_io_bound_configs:
+        configs += get_configs_io_bound()
+    key = ["M", "N", "K"]
+    prune_configs_by = {
+        "early_config_prune": mm_early_config_prune,
+        "perf_model": estimate_matmul_time,
+        "top_k": 10,
+    }
+    return triton.autotune(configs, key, prune_configs_by=prune_configs_by)
+
+
+def grid(xnumel, ynumel=None, znumel=None):
+    """Helper function to compute triton grids"""
+
+    if ynumel and znumel:
+
+        def grid_fn(meta):
+            return (
+                cdiv(xnumel, meta["XBLOCK"]),
+                cdiv(ynumel, meta["YBLOCK"]),
+                cdiv(znumel, meta["ZBLOCK"]),
+            )
+
+    elif ynumel:
+
+        def grid_fn(meta):
+            return (
+                cdiv(xnumel, meta["XBLOCK"]),
+                cdiv(ynumel, meta["YBLOCK"]),
+                1,
+            )
+
+    else:
+
+        def grid_fn(meta):
+            return (
+                cdiv(xnumel, meta["XBLOCK"]),
+                1,
+                1,
+            )
+
+    return grid_fn
diff --git a/torch/_inductor/triton_ops/batched_matmul.py b/torch/_inductor/triton_ops/batched_matmul.py
new file mode 100644
index 000000000000..7e7a65596b02
--- /dev/null
+++ b/torch/_inductor/triton_ops/batched_matmul.py
@@ -0,0 +1,274 @@
+import torch
+
+from ..utils import has_triton
+
+if has_triton():
+    import triton
+    import triton.language as tl
+
+    def init_to_zero(name):
+        return lambda nargs: nargs[name].zero_()
+
+    @triton.heuristics(
+        {
+            "EVEN_K": lambda args: args["K"] % (args["BLOCK_K"] * args["SPLIT_K"]) == 0,
+        }
+    )
+    @triton.autotune(
+        configs=[
+            # basic configs for compute-bound matmuls
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=3,
+                num_warps=8,
+            ),
+            triton.Config(
+                {"BLOCK_M": 256, "BLOCK_N": 128, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=3,
+                num_warps=8,
+            ),
+            triton.Config(
+                {"BLOCK_M": 256, "BLOCK_N": 64, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 256, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 32, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 32, "BLOCK_K": 32, "SPLIT_K": 1},
+                num_stages=5,
+                num_warps=2,
+            ),
+            # additional configs
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=3,
+                num_warps=8,
+            ),
+            triton.Config(
+                {"BLOCK_M": 256, "BLOCK_N": 128, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=3,
+                num_warps=8,
+            ),
+            triton.Config(
+                {"BLOCK_M": 256, "BLOCK_N": 64, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 256, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=2,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=2,
+                num_warps=4,
+            ),
+            # additional configs for K = 64
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 256, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=8,
+            ),
+            triton.Config(
+                {"BLOCK_M": 256, "BLOCK_N": 128, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=8,
+            ),
+            triton.Config(
+                {"BLOCK_M": 256, "BLOCK_N": 64, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 256, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 32, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=4,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 32, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=5,
+                num_warps=2,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 128, "BLOCK_N": 32, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=4,
+            ),
+            triton.Config(
+                {"BLOCK_M": 64, "BLOCK_N": 32, "BLOCK_K": 64, "SPLIT_K": 1},
+                num_stages=1,
+                num_warps=2,
+            ),
+        ],
+        # + get_configs_io_bound(),
+        key=["M", "N", "K"],
+        #
+        # key=["M", "N", "K"],
+        # prune_configs_by={
+        #     "early_config_prune": early_config_prune,
+        #     "perf_model": estimate_matmul_time,
+        #     "top_k": 18,
+        # },
+    )
+    @triton.jit
+    def _kernel(
+        A,
+        B,
+        C,
+        M,
+        N,
+        K,
+        stride_am,
+        stride_ak,
+        stride_bk,
+        stride_bn,
+        stride_cm,
+        stride_cn,
+        BLOCK_M: tl.constexpr,
+        BLOCK_N: tl.constexpr,
+        BLOCK_K: tl.constexpr,
+        GROUP_M: tl.constexpr,
+        SPLIT_K: tl.constexpr,
+        EVEN_K: tl.constexpr,
+        ACC_TYPE: tl.constexpr,
+    ):
+        # matrix multiplication
+        pid = tl.program_id(0)
+        pid_z = tl.program_id(1)
+        bid = tl.program_id(2)
+        grid_m = (M + BLOCK_M - 1) // BLOCK_M
+        grid_n = (N + BLOCK_N - 1) // BLOCK_N
+        # re-order program ID for better L2 performance
+        width = GROUP_M * grid_n
+        group_id = pid // width
+        group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
+        pid_m = group_id * GROUP_M + (pid % group_size)
+        pid_n = (pid % width) // (group_size)
+        # do matrix multiplication
+        rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+        rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
+        rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
+        rk = pid_z * BLOCK_K + tl.arange(0, BLOCK_K)
+        # pointers
+        A = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
+        B = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)
+        A += bid * M * K
+        B += bid * K * N
+        acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
+        for k in range(K, 0, -BLOCK_K * SPLIT_K):
+            if EVEN_K:
+                a = tl.load(A)
+                b = tl.load(B)
+            else:
+                a = tl.load(A, mask=rk[None, :] < k, other=0.0)
+                b = tl.load(B, mask=rk[:, None] < k, other=0.0)
+            acc += tl.dot(a, b)
+            A += BLOCK_K * SPLIT_K * stride_ak
+            B += BLOCK_K * SPLIT_K * stride_bk
+        acc = acc.to(C.dtype.element_ty)
+
+        # rematerialize rm and rn to save registers
+        rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+        rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        C = C + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)
+        C += bid * M * N
+        mask = (rm < M)[:, None] & (rn < N)[None, :]
+        # handles write-back with reduction-splitting
+        if SPLIT_K == 1:
+            tl.store(C, acc, mask=mask)
+        else:
+            tl.atomic_add(C, acc, mask=mask)
+
+    def bmm_out(a, b, out):
+        # handle non-contiguous inputs if necessary
+        if a.stride(0) > 1 and a.stride(1) > 1:
+            a = a.contiguous()
+        if b.stride(0) > 1 and b.stride(1) > 1:
+            b = b.contiguous()
+        # checks constraints
+        assert a.shape[2] == b.shape[1], "incompatible dimensions"
+        B, M, K = a.shape
+        _, _, N = b.shape
+        # allocates output
+        c = out
+        # accumulator types
+        ACC_TYPE = (
+            tl.float32
+            if a.dtype in [torch.float16, torch.bfloat16, torch.float32]
+            else tl.int32
+        )
+
+        # launch kernel
+        def grid(META):
+            return (
+                triton.cdiv(M, META["BLOCK_M"]) * triton.cdiv(N, META["BLOCK_N"]),
+                META["SPLIT_K"],
+                B,
+            )
+
+        # grid = lambda META: (
+        #     triton.cdiv(M, META["BLOCK_M"]) * triton.cdiv(N, META["BLOCK_N"]),
+        #     META["SPLIT_K"],
+        #     B,
+        # )
+
+        # autotuner = _kernel[grid].kernel
+        _kernel[grid](a, b, c, M, N, K, K, 1, N, 1, N, 1, GROUP_M=8, ACC_TYPE=ACC_TYPE)
+        # print(autotuner.best_config)
+        # print(autotuner.configs_timings)
diff --git a/torch/_inductor/triton_ops/conv.py b/torch/_inductor/triton_ops/conv.py
new file mode 100644
index 000000000000..62d7123174a5
--- /dev/null
+++ b/torch/_inductor/triton_ops/conv.py
@@ -0,0 +1,744 @@
+import torch
+
+from ..utils import has_triton
+
+if has_triton():
+    import triton
+    import triton.language as tl
+
+    from .autotune import conv_heuristics
+    from .utils import _unpack
+
+    @conv_heuristics()
+    @triton.jit
+    def _kernel_delta_x_hwc(
+        x,
+        w,
+        y,
+        # stride of tensor
+        stride_xn,
+        stride_xc,
+        stride_xh,
+        stride_xw,
+        stride_wn,
+        stride_wc,
+        stride_wh,
+        stride_ww,
+        stride_yn,
+        stride_yc,
+        stride_yh,
+        stride_yw,
+        stride_biasn,
+        # pointer inc for x
+        delta_xh_ptr,
+        delta_xw_ptr,
+        delta_xc_ptr,
+        # Tensor dimensions
+        BATCH,
+        IN_C,
+        IN_H,
+        IN_W,
+        KERNEL_N,
+        KERNEL_H,
+        KERNEL_W,
+        OUT_H,
+        OUT_W,
+        # parameters of conv
+        stride_h,
+        stride_w,
+        padding_h,
+        padding_w,
+        dilation_h,
+        dilation_w,
+        output_padding_h,
+        output_padding_w,
+        groups,
+        # Metaparameters
+        ACC_TYPE: tl.constexpr,
+        CONV1X1_NHWC: tl.constexpr,
+        # blocks in different dimension
+        BLOCK_M: tl.constexpr,
+        BLOCK_N: tl.constexpr,
+        # reduction tiling parameter for matmul
+        BLOCK_K: tl.constexpr,
+        # Super-blocking for better L2 peformance
+        GROUP_H: tl.constexpr,
+    ):
+        """
+        each program instance computes a [BLOCK_BATCH, BLOCK_N, BLOCK_H, BLOCK_W] block of y
+        """
+        # -----------------------------------------------------------
+        # Map program ids `pid` to the block of y it should compute.
+        pid_nhw = tl.program_id(0)
+        pid_k = tl.program_id(1)
+
+        # offset for output y
+        off_y_k = pid_k * BLOCK_N + tl.arange(0, BLOCK_N)
+        off_y_nhw = pid_nhw * BLOCK_M + tl.arange(0, BLOCK_M)
+        off_y_n = off_y_nhw // (OUT_H * OUT_W)
+        off_y_hw = off_y_nhw % (OUT_H * OUT_W)
+        off_y_h = off_y_hw // OUT_W + output_padding_h
+        off_y_w = off_y_hw % OUT_W + output_padding_w
+
+        # offset for the initial ptr for x
+        off_x_n = off_y_n
+        off_x_h = off_y_h * stride_h - padding_h
+        off_x_w = off_y_w * stride_w - padding_w
+        off_x_nhw = off_x_n * stride_xn + off_x_h * stride_xh + off_x_w * stride_xw
+        off_x_crs = tl.arange(0, BLOCK_K)
+
+        CRS = IN_C * KERNEL_H * KERNEL_W
+        # load inc ptr of x, upade x_ptrs
+        if not CONV1X1_NHWC:
+            delta_xh_ptrs = delta_xh_ptr + off_x_crs
+            delta_xw_ptrs = delta_xw_ptr + off_x_crs
+            delta_xc_ptrs = delta_xc_ptr + off_x_crs
+            delta_xh = tl.load(delta_xh_ptrs, mask=off_x_crs < CRS, other=0)
+            delta_xw = tl.load(delta_xw_ptrs, mask=off_x_crs < CRS, other=0)
+            delta_xc = tl.load(delta_xc_ptrs, mask=off_x_crs < CRS, other=0)
+            off_x_crs_unpacked = (
+                delta_xh * stride_xh + delta_xw * stride_xw + delta_xc * stride_xc
+            )
+            x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
+        else:
+            x_ptrs = x + off_x_nhw[:, None] + off_x_crs[None, :]
+            delta_xh = 0
+            delta_xw = 0
+
+        mask_x = (
+            (off_x_n < BATCH)[:, None]
+            & (off_x_crs < CRS)[None, :]
+            & (off_x_h[:, None] + delta_xh[None, :] >= 0)
+            & (off_x_h[:, None] + delta_xh[None, :] < IN_H)
+            & (off_x_w[:, None] + delta_xw[None, :] >= 0)
+            & (off_x_w[:, None] + delta_xw[None, :] < IN_W)
+        )
+
+        # offset for the inital ptr for w
+        off_w_crs = tl.arange(0, BLOCK_K)
+        off_w_k = off_y_k
+        w_ptrs = w + off_w_crs[:, None] + off_w_k[None, :] * stride_wn
+        mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
+
+        # ------ load x ------
+        matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
+        # ------ load w ------
+        matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
+
+        # -----------------------------------------------------------
+        # allocate accumulator
+        acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
+        for crs in range(0, CRS, BLOCK_K):
+
+            # ------ matrix multiplication ------
+            acc += tl.dot(matrix_x, matrix_w)
+            # ------ update ptrs ------
+            w_ptrs += BLOCK_K
+            # load inc ptr of x, upade x_ptrs
+            off_x_crs = crs + BLOCK_K + tl.arange(0, BLOCK_K)
+            if not CONV1X1_NHWC:
+                delta_xh_ptrs += BLOCK_K
+                delta_xw_ptrs += BLOCK_K
+                delta_xc_ptrs += BLOCK_K
+                delta_xh = tl.load(delta_xh_ptrs, mask=off_x_crs < CRS, other=0)
+                delta_xw = tl.load(delta_xw_ptrs, mask=off_x_crs < CRS, other=0)
+                delta_xc = tl.load(delta_xc_ptrs, mask=off_x_crs < CRS, other=0)
+                off_x_crs_unpacked = (
+                    delta_xh * stride_xh + delta_xw * stride_xw + delta_xc * stride_xc
+                )
+                x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
+            else:
+                x_ptrs += BLOCK_K
+
+            mask_x = (
+                (off_x_n < BATCH)[:, None]
+                & (off_x_crs < CRS)[None, :]
+                & (off_x_h[:, None] + delta_xh[None, :] >= 0)
+                & (off_x_h[:, None] + delta_xh[None, :] < IN_H)
+                & (off_x_w[:, None] + delta_xw[None, :] >= 0)
+                & (off_x_w[:, None] + delta_xw[None, :] < IN_W)
+            )
+            mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
+            # ------ prefetch ------
+            # ------ load x ------
+            matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
+            # ------ load w ------
+            matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
+
+        acc = acc.to(y.dtype.element_ty)
+
+        # rematerialize -- this saves some registers
+        # offset for output y
+        off_y_k = pid_k * BLOCK_N + tl.arange(0, BLOCK_N)
+        off_y_nhw = pid_nhw * BLOCK_M + tl.arange(0, BLOCK_M)
+        off_y_n = off_y_nhw // (OUT_H * OUT_W)
+        off_y_hw = off_y_nhw % (OUT_H * OUT_W)
+        # consider output padding
+        off_y_h = off_y_hw // OUT_W + output_padding_h
+        off_y_w = off_y_hw % OUT_W + output_padding_w
+
+        # y ptrs in the block of [BLOCK_M, BLOCK_N]
+        y_ptrs = (
+            y
+            + off_y_n[:, None] * stride_yn
+            + off_y_h[:, None] * stride_yh
+            + off_y_w[:, None] * stride_yw
+            + off_y_k[None, :] * stride_yc
+        )
+
+        # out-of-bounds check
+        mask_y = (
+            (off_y_n < BATCH)[:, None]
+            & (off_y_h < OUT_H + output_padding_h)[:, None]
+            & (off_y_w < OUT_W + output_padding_w)[:, None]
+            & (off_y_k < KERNEL_N)[None, :]
+        )
+
+        tl.store(y_ptrs, acc, mask=mask_y)
+
+        return
+
+    @conv_heuristics()
+    @triton.jit
+    def _kernel_delta_x(
+        x,
+        w,
+        y,
+        # stride of tensor
+        stride_xn,
+        stride_xc,
+        stride_xh,
+        stride_xw,
+        stride_wn,
+        stride_wc,
+        stride_wh,
+        stride_ww,
+        stride_yn,
+        stride_yc,
+        stride_yh,
+        stride_yw,
+        stride_biasn,
+        # pointer inc for x
+        delta_x_ptr,
+        # Tensor dimensions
+        BATCH,
+        IN_C,
+        IN_H,
+        IN_W,
+        KERNEL_N,
+        KERNEL_H,
+        KERNEL_W,
+        OUT_H,
+        OUT_W,
+        # parameters of conv
+        stride_h,
+        stride_w,
+        padding_h,
+        padding_w,
+        dilation_h,
+        dilation_w,
+        output_padding_h,
+        output_padding_w,
+        groups,
+        # Metaparameters
+        ACC_TYPE: tl.constexpr,
+        CONV1X1_NHWC: tl.constexpr,
+        # blocks in different dimension
+        BLOCK_M: tl.constexpr,
+        BLOCK_N: tl.constexpr,
+        # reduction tiling parameter for matmul
+        BLOCK_K: tl.constexpr,
+        # Super-blocking for better L2 peformance
+        GROUP_H: tl.constexpr,
+    ):
+        """
+        each program instance computes a [BLOCK_BATCH, BLOCK_N, BLOCK_H, BLOCK_W] block of y
+        """
+        # -----------------------------------------------------------
+        # Map program ids `pid` to the block of y it should compute.
+        pid_nhw = tl.program_id(0)
+        pid_k = tl.program_id(1)
+
+        # offset for output y
+        off_y_k = pid_k * BLOCK_N + tl.arange(0, BLOCK_N)
+        off_y_nhw = pid_nhw * BLOCK_M + tl.arange(0, BLOCK_M)
+        off_y_n = off_y_nhw // (OUT_H * OUT_W)
+        off_y_hw = off_y_nhw % (OUT_H * OUT_W)
+        off_y_h = off_y_hw // OUT_W + output_padding_h
+        off_y_w = off_y_hw % OUT_W + output_padding_w
+
+        # offset for the initial ptr for x
+        off_x_n = off_y_n
+        off_x_h = off_y_h * stride_h - padding_h
+        off_x_w = off_y_w * stride_w - padding_w
+        off_x_nhw = off_x_n * stride_xn + off_x_h * stride_xh + off_x_w * stride_xw
+        off_x_crs = tl.arange(0, BLOCK_K)
+
+        CRS = IN_C * KERNEL_H * KERNEL_W
+        # load inc ptr of x, upade x_ptrs
+        if not CONV1X1_NHWC:
+            delta_x_ptrs = delta_x_ptr + off_x_crs
+            off_x_crs_unpacked = tl.load(delta_x_ptrs, mask=off_x_crs < CRS)
+            x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
+        else:
+            x_ptrs = x + off_x_nhw[:, None] + off_x_crs[None, :]
+
+        mask_x = (
+            (off_x_n < BATCH)
+            & (off_x_h >= 0)
+            & (off_x_h < IN_H)
+            & (off_x_w >= 0)
+            & (off_x_w < IN_W)
+        )[:, None] & (off_x_crs < CRS)[None, :]
+
+        # offset for the inital ptr for w
+        off_w_crs = tl.arange(0, BLOCK_K)
+        off_w_k = off_y_k
+        w_ptrs = w + off_w_crs[:, None] + off_w_k[None, :] * stride_wn
+        mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
+
+        # ------ load x ------
+        matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
+        # ------ load w ------
+        matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
+
+        # -----------------------------------------------------------
+        # allocate accumulator
+        acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
+        for crs in range(0, CRS, BLOCK_K):
+
+            # ------ matrix multiplication ------
+            acc += tl.dot(matrix_x, matrix_w)
+            # ------ update ptrs ------
+            w_ptrs += BLOCK_K
+            # load inc ptr of x, upade x_ptrs
+            if not CONV1X1_NHWC:
+                delta_x_ptrs += BLOCK_K
+                off_x_crs = crs + BLOCK_K + tl.arange(0, BLOCK_K)
+                off_x_crs_unpacked = tl.load(
+                    delta_x_ptrs, mask=off_x_crs < CRS, other=0
+                )
+                x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
+            else:
+                off_x_crs = crs + BLOCK_K + tl.arange(0, BLOCK_K)
+                x_ptrs += BLOCK_K
+
+            mask_x = (
+                (off_x_n < BATCH)
+                & (off_x_h >= 0)
+                & (off_x_h < IN_H)
+                & (off_x_w >= 0)
+                & (off_x_w < IN_W)
+            )[:, None] & (off_x_crs < CRS)[None, :]
+            mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
+            # ------ prefetch ------
+            # ------ load x ------
+            matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
+            # ------ load w ------
+            matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
+
+        acc = acc.to(y.dtype.element_ty)
+
+        # rematerialize -- this saves some registers
+        # offset for output y
+        off_y_k = pid_k * BLOCK_N + tl.arange(0, BLOCK_N)
+        off_y_nhw = pid_nhw * BLOCK_M + tl.arange(0, BLOCK_M)
+        off_y_n = off_y_nhw // (OUT_H * OUT_W)
+        off_y_hw = off_y_nhw % (OUT_H * OUT_W)
+        # consider output padding
+        off_y_h = off_y_hw // OUT_W + output_padding_h
+        off_y_w = off_y_hw % OUT_W + output_padding_w
+
+        # y ptrs in the block of [BLOCK_M, BLOCK_N]
+        y_ptrs = (
+            y
+            + off_y_n[:, None] * stride_yn
+            + off_y_h[:, None] * stride_yh
+            + off_y_w[:, None] * stride_yw
+            + off_y_k[None, :] * stride_yc
+        )
+
+        # out-of-bounds check
+        mask_y = (
+            (off_y_n < BATCH)[:, None]
+            & (off_y_h < OUT_H + output_padding_h)[:, None]
+            & (off_y_w < OUT_W + output_padding_w)[:, None]
+            & (off_y_k < KERNEL_N)[None, :]
+        )
+
+        tl.store(y_ptrs, acc, mask=mask_y)
+
+        return
+
+    class _conv:
+        kernel = _kernel_delta_x_hwc
+
+        # for the contigous order of w ptr, what"s the corresponding
+        # ptr changes for x in a sliding window
+        @staticmethod
+        def _delta_x_ptr_hwc(
+            IN_C,
+            KERNEL_H,
+            KERNEL_W,
+            dilation_h,
+            dilation_w,
+            stride_wc,
+            stride_wh,
+            stride_ww,
+            stride_xc,
+            stride_xh,
+            stride_xw,
+            device,
+        ):
+            # get the order of axes in w, innermost dimension outward
+            stride_w_3d = [stride_wc, stride_wh, stride_ww]
+            order = sorted(range(len(stride_w_3d)), key=stride_w_3d.__getitem__)
+            window_size = IN_C * KERNEL_H * KERNEL_W
+
+            r_window = torch.arange(0, window_size, 1, device=device)
+            window_unpack = _unpack(r_window, order, [IN_C, KERNEL_H, KERNEL_W])
+            window_unpack_c = window_unpack[order[0]]
+            window_unpack_h = window_unpack[order[1]]
+            window_unpack_w = window_unpack[order[2]]
+            r_dilation_h = dilation_h * window_unpack_h
+            r_dilation_w = dilation_w * window_unpack_w
+            r_inc = window_unpack_c
+            # delta_x = (
+            #     r_dilation_h * stride_xh + r_dilation_w * stride_xw + r_inc * stride_xc
+            # )
+            # return delta_x
+            return (
+                r_dilation_h,
+                r_dilation_w,
+                r_inc,
+            )
+
+        @staticmethod
+        def _delta_x_ptr(
+            IN_C,
+            KERNEL_H,
+            KERNEL_W,
+            dilation_h,
+            dilation_w,
+            stride_wc,
+            stride_wh,
+            stride_ww,
+            stride_xc,
+            stride_xh,
+            stride_xw,
+            device,
+        ):
+            # get the order of axes in w, innermost dimension outward
+            stride_w_3d = [stride_wc, stride_wh, stride_ww]
+            order = sorted(range(len(stride_w_3d)), key=stride_w_3d.__getitem__)
+            window_size = IN_C * KERNEL_H * KERNEL_W
+
+            r_window = torch.arange(0, window_size, 1, device=device)
+            window_unpack = _unpack(r_window, order, [IN_C, KERNEL_H, KERNEL_W])
+            window_unpack_c = window_unpack[order[0]]
+            window_unpack_h = window_unpack[order[1]]
+            window_unpack_w = window_unpack[order[2]]
+            r_dilation_h = dilation_h * window_unpack_h
+            r_dilation_w = dilation_w * window_unpack_w
+            r_inc = window_unpack_c
+            delta_x = (
+                r_dilation_h * stride_xh + r_dilation_w * stride_xw + r_inc * stride_xc
+            )
+            return delta_x
+
+        @staticmethod
+        def _call(
+            x,
+            w,
+            bias,
+            stride,
+            padding,
+            dilation,
+            transposed,
+            output_padding,
+            groups,
+        ):
+            # Q: should we check x, w, bias dtypes?
+            device = x.device
+            # input shapes
+            shape_x = x.shape
+            shape_w = w.shape
+            shape_bias = bias.shape if bias is not None else None
+
+            # indicies for the layeout
+            xn, xc, xh, xw = 0, 1, 2, 3
+            yn, yc, yh, yw = 0, 1, 2, 3
+            wn, wc, wh, ww = 0, 1, 2, 3
+
+            # out_channel, in_channel, kernel_height, kernel_width
+            kernel_size = [shape_w[wh], shape_w[ww]]
+            input_size = [shape_x[xh], shape_x[xw]]
+            assert (
+                not shape_bias or shape_bias[0] == shape_w[wn]
+            ), f"bias shape did not match{shape_bias} != {shape_w[wn]}"
+            in_channel = shape_w[wc] * groups
+
+            assert shape_x[xc] % groups == 0, "in_channels must be divisible by groups"
+            assert shape_w[wn] % groups == 0, "out_channels must be divisible by groups"
+            assert (
+                shape_x[xc] == in_channel
+            ), f"in_channel did not match {shape_x[xc]} != {in_channel}"
+
+            assert (
+                len(stride)
+                == len(padding)
+                == len(dilation)
+                == len(output_padding)
+                == len(kernel_size)
+                == len(input_size)
+            )
+
+            # output shape
+            shape_y = [0] * 4
+            shape_y[yn] = shape_x[xn]
+            shape_y[yc] = shape_w[wn]
+            shape_y[yh] = (
+                input_size[0]
+                + 2 * padding[0]
+                - dilation[0] * (kernel_size[0] - 1)
+                - 1
+                + stride[0]
+            ) // stride[0] + 2 * output_padding[0]
+            shape_y[yw] = (
+                input_size[1]
+                + 2 * padding[1]
+                - dilation[1] * (kernel_size[1] - 1)
+                - 1
+                + stride[1]
+            ) // stride[1] + 2 * output_padding[1]
+
+            BATCH = shape_x[xn]
+            IN_C = shape_x[xc]
+            IN_H = shape_x[xh]
+            IN_W = shape_x[xw]
+            KERNEL_N = shape_w[wn]
+            KERNEL_H = shape_w[wh]
+            KERNEL_W = shape_w[ww]
+            OUT_H = shape_y[yh]
+            OUT_W = shape_y[yw]
+
+            # allocate output
+            y = torch.empty(shape_y, device=device, dtype=x.dtype)
+
+            # get strides for tensors
+            stride_x = x.stride()
+            stride_w = w.stride()
+            stride_bias = bias.stride() if shape_bias else None
+            stride_biasn = stride_bias[0] if stride_bias else None
+
+            # output layout should be the same as x
+            if stride_x[xc] < stride_x[xh] and stride_x[xc] < stride_x[xw]:
+                y = y.to(memory_format=torch.channels_last)
+            stride_y = y.stride()
+
+            # allocate tmp
+            # WINDOW_SIZE = KERNEL_H * KERNEL_W * IN_C
+            # tmp_x = torch.empty((BATCH * OUT_H * OUT_W, WINDOW_SIZE), device=device, dtype=x.dtype)
+            # tmp_w = torch.empty((WINDOW_SIZE, KERNEL_N), device=device, dtype=w.dtype)
+            # accumulator types
+            ACC_TYPE = (
+                tl.float32
+                if x.dtype in [torch.float16, torch.bfloat16, torch.float32]
+                else tl.int32
+            )
+            # if stride_x[xc] == 1 and stride_x > 1 and stride_y > 1:
+            CONV1X1_NHWC = False
+            if stride_x[xc] == 1 and KERNEL_H == 1 and KERNEL_W == 1:
+                CONV1X1_NHWC = True
+            #  do we need delta x ptr for h, w, c dimension each or not
+            DELTA_X_PTR_HWC = (
+                False
+                if (
+                    (padding[0] == 0 and padding[1] == 0)
+                    or (KERNEL_H == 1 and KERNEL_W == 1)
+                )
+                else True
+            )
+            if not CONV1X1_NHWC:
+                if DELTA_X_PTR_HWC:
+                    delta_xh, delta_xw, delta_xc = _conv._delta_x_ptr_hwc(
+                        IN_C,
+                        KERNEL_H,
+                        KERNEL_W,
+                        dilation[0],
+                        dilation[1],
+                        stride_w[wc],
+                        stride_w[wh],
+                        stride_w[ww],
+                        stride_x[xc],
+                        stride_x[xh],
+                        stride_x[xw],
+                        device,
+                    )
+                else:
+                    delta_x = _conv._delta_x_ptr(
+                        IN_C,
+                        KERNEL_H,
+                        KERNEL_W,
+                        dilation[0],
+                        dilation[1],
+                        stride_w[wc],
+                        stride_w[wh],
+                        stride_w[ww],
+                        stride_x[xc],
+                        stride_x[xh],
+                        stride_x[xw],
+                        device,
+                    )
+            else:
+                delta_x = None
+                delta_xh, delta_xw, delta_xc = None, None, None
+
+            # launch kernel, 2-dim, batch*h*w, kernel
+            def grid(META):
+                return (
+                    triton.cdiv(BATCH * OUT_H * OUT_W, META["BLOCK_M"]),
+                    triton.cdiv(KERNEL_N, META["BLOCK_N"]),
+                )
+
+            # conv1x1 or padding==0
+            if CONV1X1_NHWC or not DELTA_X_PTR_HWC:
+                _kernel_delta_x[grid](
+                    x,
+                    w,
+                    y,
+                    # stride nchw for x,w,y tensor
+                    stride_x[xn],
+                    stride_x[xc],
+                    stride_x[xh],
+                    stride_x[xw],
+                    stride_w[wn],
+                    stride_w[wc],
+                    stride_w[wh],
+                    stride_w[ww],
+                    stride_y[yn],
+                    stride_y[yc],
+                    stride_y[yh],
+                    stride_y[yw],
+                    stride_biasn,
+                    # pointer inc for x
+                    delta_x,
+                    # Tensor dimensions
+                    BATCH,
+                    IN_C,
+                    IN_H,
+                    IN_W,
+                    KERNEL_N,
+                    KERNEL_H,
+                    KERNEL_W,
+                    OUT_H,
+                    OUT_W,
+                    # conv parameters
+                    stride[0],
+                    stride[1],
+                    padding[0],
+                    padding[1],
+                    dilation[0],
+                    dilation[1],
+                    output_padding[0],
+                    output_padding[1],
+                    groups,
+                    # Metaparameters
+                    ACC_TYPE=ACC_TYPE,
+                    CONV1X1_NHWC=CONV1X1_NHWC,
+                    # BLOCK_M=128,
+                    # BLOCK_N=32,
+                    # BLOCK_K=32,
+                    GROUP_H=1,
+                )
+            # need to know ptr update for each dimension to check if
+            # the sliding window is out of bounds
+            else:
+                # kernel = _kernel_delta_x_hwc
+                _kernel_delta_x_hwc[grid](
+                    x,
+                    w,
+                    y,
+                    # stride nchw for x,w,y tensor
+                    stride_x[xn],
+                    stride_x[xc],
+                    stride_x[xh],
+                    stride_x[xw],
+                    stride_w[wn],
+                    stride_w[wc],
+                    stride_w[wh],
+                    stride_w[ww],
+                    stride_y[yn],
+                    stride_y[yc],
+                    stride_y[yh],
+                    stride_y[yw],
+                    stride_biasn,
+                    # pointer inc for x
+                    delta_xh,
+                    delta_xw,
+                    delta_xc,
+                    # Tensor dimensions
+                    BATCH,
+                    IN_C,
+                    IN_H,
+                    IN_W,
+                    KERNEL_N,
+                    KERNEL_H,
+                    KERNEL_W,
+                    OUT_H,
+                    OUT_W,
+                    # conv parameters
+                    stride[0],
+                    stride[1],
+                    padding[0],
+                    padding[1],
+                    dilation[0],
+                    dilation[1],
+                    output_padding[0],
+                    output_padding[1],
+                    groups,
+                    # Metaparameters
+                    ACC_TYPE=ACC_TYPE,
+                    CONV1X1_NHWC=CONV1X1_NHWC,
+                    # BLOCK_M=128,
+                    # BLOCK_N=32,
+                    # BLOCK_K=32,
+                    GROUP_H=1,
+                )
+
+            if bias is not None:
+                if len(bias.shape) == 1:
+                    bias = bias.reshape([1, bias.shape[0], 1, 1])
+                y += bias
+            return y
+
+        @staticmethod
+        def forward(
+            x,
+            w,
+            bias,
+            stride=(1, 1),
+            padding=(0, 0),
+            dilation=(1, 1),
+            transposed=False,
+            output_padding=(0, 0),
+            groups=1,
+        ):
+            if groups != 1:
+                print(f"Do not support groups = {groups}")
+                return
+            if transposed:
+                print("Do not support transposed")
+            return _conv._call(
+                x,
+                w,
+                bias,
+                stride,
+                padding,
+                dilation,
+                transposed,
+                output_padding,
+                groups,
+            )
+
+    conv = _conv.forward
diff --git a/torch/_inductor/triton_ops/conv1x1.py b/torch/_inductor/triton_ops/conv1x1.py
new file mode 100644
index 000000000000..c7b79f004a5a
--- /dev/null
+++ b/torch/_inductor/triton_ops/conv1x1.py
@@ -0,0 +1,195 @@
+import torch
+
+from ..utils import has_triton
+
+if has_triton():
+
+    import triton
+
+    class _conv1x1:
+        @staticmethod
+        def _call(
+            x,
+            w,
+            bias,
+            stride,
+            padding,
+            dilation,
+            transposed,
+            output_padding,
+            groups,
+        ):
+            # Q: should we check x, w, bias dtypes?
+            device = x.device
+            # input shapes
+            shape_x = x.shape
+            shape_w = w.shape
+            shape_bias = bias.shape if bias is not None else None
+
+            # indicies for the layeout
+            xn, xc, xh, xw = 0, 1, 2, 3
+            yn, yc, yh, yw = 0, 1, 2, 3
+            wn, wc, wh, ww = 0, 1, 2, 3
+
+            # out_channel, in_channel, kernel_height, kernel_width
+            kernel_size = [shape_w[wh], shape_w[ww]]
+            input_size = [shape_x[xh], shape_x[xw]]
+            assert (
+                not shape_bias or shape_bias[0] == shape_w[wn]
+            ), f"bias shape did not match{shape_bias} != {shape_w[wn]}"
+            in_channel = shape_w[wc] * groups
+
+            assert shape_x[xc] % groups == 0, "in_channels must be divisible by groups"
+            assert shape_w[wn] % groups == 0, "out_channels must be divisible by groups"
+            assert (
+                shape_x[xc] == in_channel
+            ), f"in_channel did not match {shape_x[xc]} != {in_channel}"
+
+            assert (
+                len(stride)
+                == len(padding)
+                == len(dilation)
+                == len(output_padding)
+                == len(kernel_size)
+                == len(input_size)
+            )
+
+            # output shape
+            shape_y = [0] * 4
+            shape_y[yn] = shape_x[xn]
+            shape_y[yc] = shape_w[wn]
+            shape_y[yh] = (
+                input_size[0]
+                + 2 * padding[0]
+                - dilation[0] * (kernel_size[0] - 1)
+                - 1
+                + stride[0]
+            ) // stride[0] + 2 * output_padding[0]
+            shape_y[yw] = (
+                input_size[1]
+                + 2 * padding[1]
+                - dilation[1] * (kernel_size[1] - 1)
+                - 1
+                + stride[1]
+            ) // stride[1] + 2 * output_padding[1]
+
+            BATCH = shape_x[xn]
+            IN_C = shape_x[xc]
+            # IN_H = shape_x[xh]
+            # IN_W = shape_x[xw]
+            KERNEL_N = shape_w[wn]
+            KERNEL_H = shape_w[wh]
+            KERNEL_W = shape_w[ww]
+            OUT_H = shape_y[yh]
+            OUT_W = shape_y[yw]
+
+            assert KERNEL_H == 1 and KERNEL_W == 1, "only support 1x1 conv"
+            channels_last = x.stride()[1] == 1
+
+            if padding == (0, 0):
+                # nchw -> nhwc
+                x = x.permute(0, 2, 3, 1)
+                # select every stride's element (for stride > 1)
+                x = x[:, :: stride[0], :: stride[1], :]
+                # 2d matrix
+                mat_x = x.reshape(-1, IN_C)
+                # 2d matrix
+                mat_w = w.view(KERNEL_N, IN_C)
+                mat_w = mat_w.permute(1, 0)
+                # 2d matrix y, (BATCH * OUT_H * OUT_W, KERNEL_N)
+                mat_y = triton.ops.matmul(mat_x, mat_w)
+                # mat_y = torch.empty((BATCH * OUT_H * OUT_W, KERNEL_N), device=device, dtype=x.dtype,)
+                y = mat_y.view(BATCH, OUT_H, OUT_W, KERNEL_N)
+                if bias is not None:
+                    y += bias
+                # convert back to the original layout of y
+                # nhwc -> nchw
+                y = y.permute(0, 3, 1, 2)
+                if not channels_last:
+                    y = y.to(memory_format=torch.contiguous_format)
+                return y
+
+            else:
+                y = torch.empty(
+                    (shape_y[yn], shape_y[yh], shape_y[yw], shape_y[yc]),
+                    device=device,
+                    dtype=x.dtype,
+                )
+                if channels_last:
+                    y = y.to(memory_format=torch.channels_last)
+                # y = bias.repeat((shape_y[yn], shape_y[yh], shape_y[yw], 1)).to(device).type(x.dtype)
+                # convert x to channel-last layout;
+                # don't care w layout since kernel size is 1
+                x = x.permute(0, 2, 3, 1)
+                # select every stride"s element (for stride > 1)
+                x = x[:, :: stride[0], :: stride[1], :]
+                # 2d matrix
+                mat_x = x.view(-1, IN_C)
+                # 2d matrix
+                mat_w = w.view(KERNEL_N, IN_C)
+                mat_w = mat_w.permute(1, 0)
+                # 2d matrix y, (BATCH * (OUT_H-2*padding) * (OUT_W-2*padding), KERNEL_N)
+                mat_y = triton.ops.matmul(mat_x, mat_w)
+                mat_y = mat_y.view(
+                    BATCH, OUT_H - 2 * padding[0], OUT_W - 2 * padding[1], KERNEL_N
+                )
+                # consider padding > 0
+                if bias is not None:
+                    y[
+                        :,
+                        padding[0] : OUT_H - padding[0],
+                        padding[1] : OUT_W - padding[1],
+                        :,
+                    ] = (
+                        mat_y + bias
+                    )
+                    y[:, : padding[0], :, :] = bias
+                    y[:, :, : padding[1], :] = bias
+                    y[:, OUT_H - padding[0] :, :, :] = bias
+                    y[:, :, OUT_W - padding[1] :, :] = bias
+                else:
+                    y[
+                        :,
+                        padding[0] : OUT_H - padding[0],
+                        padding[1] : OUT_W - padding[1],
+                        :,
+                    ] = mat_y
+                    y[:, : padding[0], :, :] = 0
+                    y[:, :, : padding[1], :] = 0
+                    y[:, OUT_H - padding[0] :, :, :] = 0
+                    y[:, :, OUT_W - padding[1] :, :] = 0
+                # convert back to the original layout of y
+                # nhwc -> nchw
+                y = y.permute(0, 3, 1, 2)
+                return y
+
+        @staticmethod
+        def forward(
+            x,
+            w,
+            bias,
+            stride=(1, 1),
+            padding=(0, 0),
+            dilation=(1, 1),
+            transposed=False,
+            output_padding=(0, 0),
+            groups=1,
+        ):
+            if groups != 1:
+                print(f"Do not support groups = {groups}")
+                return
+            if transposed:
+                print("Do not support transposed")
+            return _conv1x1._call(
+                x,
+                w,
+                bias,
+                stride,
+                padding,
+                dilation,
+                transposed,
+                output_padding,
+                groups,
+            )
+
+    conv1x1 = _conv1x1.forward
diff --git a/torch/_inductor/triton_ops/conv_perf_model.py b/torch/_inductor/triton_ops/conv_perf_model.py
new file mode 100644
index 000000000000..0369e35ec6ca
--- /dev/null
+++ b/torch/_inductor/triton_ops/conv_perf_model.py
@@ -0,0 +1,165 @@
+import heapq
+
+import torch
+
+
+def estimate_conv_time(
+    # backend, device,
+    num_warps,
+    num_stages,
+    x,
+    BATCH,
+    IN_C,
+    IN_H,
+    IN_W,
+    KERNEL_N,
+    KERNEL_H,
+    KERNEL_W,
+    OUT_H,
+    OUT_W,
+    BLOCK_M,
+    BLOCK_K,
+    BLOCK_N,
+    debug=False,
+    **kwargs,
+):
+    """return estimated running time in ms
+    = max(compute, loading) + store"""
+    import triton
+    import triton._C.libtriton.triton as _triton
+    from triton.ops.matmul_perf_model import (
+        get_dram_gbps as get_dram_gbps,
+        get_tflops as get_tflops,
+    )
+
+    backend = _triton.runtime.backend.CUDA
+    device = torch.cuda.current_device()
+    dtype = x.dtype
+    dtsize = x.element_size()
+
+    M = BATCH * OUT_H * OUT_W
+    N = KERNEL_N
+    K = KERNEL_H * KERNEL_W * IN_C
+    num_cta_m = triton.cdiv(M, BLOCK_M)
+    num_cta_n = triton.cdiv(N, BLOCK_N)
+    num_cta_k = 1
+    num_ctas = num_cta_m * num_cta_n * num_cta_k
+
+    # If the input is smaller than the block size
+    M, N = max(M, BLOCK_M), max(N, BLOCK_N)
+
+    # time to compute
+    total_ops = 2 * M * N * K / (1024 * 1024 * 1024)  # GOPS
+    tput = get_tflops(backend, device, num_ctas, num_warps, dtype)
+    compute_ms = total_ops / tput
+
+    # time to load data
+    num_sm = _triton.runtime.num_sm(backend, device)
+    active_cta_ratio = min(1, num_ctas / num_sm)
+    active_cta_ratio_bw1 = min(
+        1, num_ctas / 32
+    )  # 32 active ctas are enough to saturate
+    active_cta_ratio_bw2 = max(
+        min(1, (num_ctas - 32) / (108 - 32)), 0
+    )  # 32-108, remaining 5%
+    dram_bw = get_dram_gbps(backend, device) * (
+        active_cta_ratio_bw1 * 0.95 + active_cta_ratio_bw2 * 0.05
+    )  # in GB/s
+    l2_bw = dram_bw * 4  # rough estimation (should be 4.7 for A100?)
+    # assume 80% of (following) loads are in L2 cache
+    load_a_dram = M * K * dtsize * (1 + 0.2 * (num_cta_n - 1))
+    load_a_l2 = M * K * dtsize * 0.8 * (num_cta_n - 1)
+    load_b_dram = N * K * dtsize * (1 + 0.2 * (num_cta_m - 1))
+    load_b_l2 = N * K * dtsize * 0.8 * (num_cta_m - 1)
+    # total
+    total_dram = (load_a_dram + load_b_dram) / (1024 * 1024)  # MB
+    total_l2 = (load_a_l2 + load_b_l2) / (1024 * 1024)
+    # loading time in ms
+    load_ms = total_dram / dram_bw + total_l2 / l2_bw
+
+    # estimate storing time
+    store_bw = dram_bw * 0.6  # :o
+    store_c_dram = M * N * dtsize / (1024 * 1024)  # MB
+    store_ms = store_c_dram / store_bw
+
+    total_time_ms = max(compute_ms, load_ms) + store_ms
+    if debug:
+        print(
+            f"Total time: {total_time_ms}ms, compute time: {compute_ms}ms, "
+            f"loading time: {load_ms}ms, store time: {store_ms}ms, "
+            f"Activate CTAs: {active_cta_ratio*100}%"
+        )
+    return total_time_ms
+
+
+def early_config_prune(configs, named_args):
+    import triton._C.libtriton.triton as _triton
+
+    backend = _triton.runtime.backend.CUDA
+    device = torch.cuda.current_device()
+    cc = _triton.runtime.cc(backend, device)
+    # BLOCK_M, BLOCK_N, BLOCK_K, SPLIT_K, num_warps, num_stages
+    dtsize = named_args["x"].element_size()
+    # dtype = named_args["x"].dtype
+
+    # 1. make sure we have enough smem
+    pruned_configs = []
+    for config in configs:
+        kw = config.kwargs
+        BLOCK_M, BLOCK_N, BLOCK_K, num_stages = (
+            kw["BLOCK_M"],
+            kw["BLOCK_N"],
+            kw["BLOCK_K"],
+            config.num_stages,
+        )
+        max_shared_memory = _triton.runtime.max_shared_memory(backend, device)
+        required_shared_memory = (BLOCK_M + BLOCK_N) * BLOCK_K * num_stages * dtsize
+        if required_shared_memory <= max_shared_memory:
+            pruned_configs.append(config)
+    configs = pruned_configs
+
+    # group configs by (BLOCK_M,_N,_K, num_warps)
+    configs_map = {}
+    for config in configs:
+        kw = config.kwargs
+        BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages = (
+            kw["BLOCK_M"],
+            kw["BLOCK_N"],
+            kw["BLOCK_K"],
+            config.num_warps,
+            config.num_stages,
+        )
+
+        key = (BLOCK_M, BLOCK_N, BLOCK_K, num_warps)
+        if key in configs_map:
+            configs_map[key].append((config, num_stages))
+        else:
+            configs_map[key] = [(config, num_stages)]
+
+    pruned_configs = []
+    for k, v in configs_map.items():
+        BLOCK_M, BLOCK_N, BLOCK_K, num_warps = k
+        if cc >= 80:
+            # compute cycles (only works for ampere GPUs)
+            mmas = BLOCK_M * BLOCK_N * BLOCK_K / (16 * 8 * 16)
+            mma_cycles = mmas / min(4, num_warps) * 8
+
+            ldgsts_latency = 300  # Does this matter?
+            optimal_num_stages = ldgsts_latency / mma_cycles
+
+            # nearest stages, prefer large #stages
+            nearest = heapq.nsmallest(
+                2,
+                v,
+                key=lambda x: 10 + abs(x[1] - optimal_num_stages)
+                if (x[1] - optimal_num_stages) < 0
+                else x[1] - optimal_num_stages,
+            )
+
+            for n in nearest:
+                pruned_configs.append(n[0])
+        else:  # Volta & Turing only supports num_stages <= 2
+            random_config = v[0][0]
+            random_config.num_stages = 2
+            pruned_configs.append(random_config)
+    return pruned_configs
diff --git a/torch/_inductor/triton_ops/matmul.py b/torch/_inductor/triton_ops/matmul.py
new file mode 100644
index 000000000000..c120b8c0b277
--- /dev/null
+++ b/torch/_inductor/triton_ops/matmul.py
@@ -0,0 +1,136 @@
+import torch
+
+from ..utils import has_triton
+
+if has_triton():
+
+    import triton
+    import triton.language as tl
+
+    from .autotune import mm_autotune, mm_heuristics
+
+    @mm_heuristics()
+    @mm_autotune(get_io_bound_configs=True)
+    @triton.jit
+    def _kernel(
+        A,
+        B,
+        C,
+        M,
+        N,
+        K,
+        stride_am,
+        stride_ak,
+        stride_bk,
+        stride_bn,
+        stride_cm,
+        stride_cn,
+        allow_tf32: tl.constexpr,
+        BLOCK_M: tl.constexpr,
+        BLOCK_N: tl.constexpr,
+        BLOCK_K: tl.constexpr,
+        GROUP_M: tl.constexpr,
+        SPLIT_K: tl.constexpr,
+        EVEN_K: tl.constexpr,
+        ACC_TYPE: tl.constexpr,
+    ):
+        # matrix multiplication
+        pid = tl.program_id(0)
+        pid_z = tl.program_id(1)
+        grid_m = (M + BLOCK_M - 1) // BLOCK_M
+        grid_n = (N + BLOCK_N - 1) // BLOCK_N
+        # re-order program ID for better L2 performance
+        width = GROUP_M * grid_n
+        group_id = pid // width
+        group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
+        pid_m = group_id * GROUP_M + (pid % group_size)
+        pid_n = (pid % width) // (group_size)
+        # do matrix multiplication
+        rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+        rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
+        rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
+        rk = pid_z * BLOCK_K + tl.arange(0, BLOCK_K)
+        # pointers
+        A = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
+        B = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)
+        acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
+        for k in range(K, 0, -BLOCK_K * SPLIT_K):
+            if EVEN_K:
+                a = tl.load(A)
+                b = tl.load(B)
+            else:
+                a = tl.load(A, mask=rk[None, :] < k, other=0.0)
+                b = tl.load(B, mask=rk[:, None] < k, other=0.0)
+            acc += tl.dot(a, b, allow_tf32=allow_tf32)
+            A += BLOCK_K * SPLIT_K * stride_ak
+            B += BLOCK_K * SPLIT_K * stride_bk
+        acc = acc.to(C.dtype.element_ty)
+        # rematerialize rm and rn to save registers
+        rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+        rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        C = C + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)
+        mask = (rm < M)[:, None] & (rn < N)[None, :]
+        # handles write-back with reduction-splitting
+        if SPLIT_K == 1:
+            tl.store(C, acc, mask=mask)
+        else:
+            tl.atomic_add(C, acc, mask=mask)
+
+    class _matmul_out:
+        kernel = _kernel
+
+        @staticmethod
+        def _call(a, b, out, allow_tf32=True):
+            # handle non-contiguous inputs if necessary
+            if a.stride(0) > 1 and a.stride(1) > 1:
+                a = a.contiguous()
+            if b.stride(0) > 1 and b.stride(1) > 1:
+                b = b.contiguous()
+            # checks constraints
+            assert a.shape[1] == b.shape[0], "incompatible dimensions"
+            M, K = a.shape
+            _, N = b.shape
+            # allocates output
+            c = out
+            # accumulator types
+            ACC_TYPE = (
+                tl.float32
+                if a.dtype in [torch.float16, torch.bfloat16, torch.float32]
+                else tl.int32
+            )
+
+            # launch kernel (grid defined as using def instead of lambda to pass `make lint`)
+            def grid(META):
+                return (
+                    triton.cdiv(M, META["BLOCK_M"]) * triton.cdiv(N, META["BLOCK_N"]),
+                    META["SPLIT_K"],
+                )
+
+            # grid = lambda META: (
+            #     triton.cdiv(M, META["BLOCK_M"]) * triton.cdiv(N, META["BLOCK_N"]),
+            #     META["SPLIT_K"],
+            # )
+            _kernel[grid](
+                a,
+                b,
+                c,
+                M,
+                N,
+                K,
+                a.stride(0),
+                a.stride(1),
+                b.stride(0),
+                b.stride(1),
+                c.stride(0),
+                c.stride(1),
+                allow_tf32=allow_tf32,
+                GROUP_M=8,
+                ACC_TYPE=ACC_TYPE,
+            )
+
+        @staticmethod
+        def forward(a, b, out, allow_tf32=True):
+            return _matmul_out._call(a, b, out, allow_tf32)
+
+    matmul_out = _matmul_out.forward
diff --git a/torch/_inductor/triton_ops/mm_perf_model.py b/torch/_inductor/triton_ops/mm_perf_model.py
new file mode 100644
index 000000000000..fd3a6904213e
--- /dev/null
+++ b/torch/_inductor/triton_ops/mm_perf_model.py
@@ -0,0 +1,90 @@
+import torch
+
+
+def estimate_matmul_time(
+    # backend, device,
+    num_warps,
+    num_stages,
+    A,
+    B,
+    M,
+    N,
+    K,
+    BLOCK_M,
+    BLOCK_N,
+    BLOCK_K,
+    SPLIT_K,
+    debug=False,
+    **kwargs,
+):
+    """return estimated running time in ms
+    = max(compute, loading) + store"""
+    import triton
+    import triton._C.libtriton.triton as _triton
+    from triton.ops.matmul_perf_model import (
+        get_dram_gbps as get_dram_gbps,
+        get_tflops as get_tflops,
+    )
+
+    backend = _triton.runtime.backend.CUDA
+    device = torch.cuda.current_device()
+    dtype = A.dtype
+    dtsize = A.element_size()
+
+    num_cta_m = triton.cdiv(M, BLOCK_M)
+    num_cta_n = triton.cdiv(N, BLOCK_N)
+    num_cta_k = SPLIT_K
+    num_ctas = num_cta_m * num_cta_n * num_cta_k
+
+    # If the input is smaller than the block size
+    M, N = max(M, BLOCK_M), max(N, BLOCK_N)
+
+    # time to compute
+    total_ops = 2 * M * N * K / (1024 * 1024 * 1024)  # GOPS
+    tput = get_tflops(backend, device, num_ctas, num_warps, dtype)
+    compute_ms = total_ops / tput
+
+    # time to load data
+    num_sm = _triton.runtime.num_sm(backend, device)
+    active_cta_ratio = min(1, num_ctas / num_sm)
+    active_cta_ratio_bw1 = min(
+        1, num_ctas / 32
+    )  # 32 active ctas are enough to saturate
+    active_cta_ratio_bw2 = max(
+        min(1, (num_ctas - 32) / (108 - 32)), 0
+    )  # 32-108, remaining 5%
+    dram_bw = get_dram_gbps(backend, device) * (
+        active_cta_ratio_bw1 * 0.95 + active_cta_ratio_bw2 * 0.05
+    )  # in GB/s
+    l2_bw = dram_bw * 4  # rough estimation (should be 4.7 for A100?)
+    # assume 80% of (following) loads are in L2 cache
+    load_a_dram = M * K * dtsize * (1 + 0.2 * (num_cta_n - 1))
+    load_a_l2 = M * K * dtsize * 0.8 * (num_cta_n - 1)
+    load_b_dram = N * K * dtsize * (1 + 0.2 * (num_cta_m - 1))
+    load_b_l2 = N * K * dtsize * 0.8 * (num_cta_m - 1)
+    # total
+    total_dram = (load_a_dram + load_b_dram) / (1024 * 1024)  # MB
+    total_l2 = (load_a_l2 + load_b_l2) / (1024 * 1024)
+    # loading time in ms
+    load_ms = total_dram / dram_bw + total_l2 / l2_bw
+
+    # estimate storing time
+    store_bw = dram_bw * 0.6  # :o
+    store_c_dram = M * N * dtsize * SPLIT_K / (1024 * 1024)  # MB
+    if SPLIT_K == 1:
+        store_ms = store_c_dram / store_bw
+    else:
+        reduce_bw = store_bw
+        store_ms = store_c_dram / reduce_bw
+        # c.zero_()
+        zero_ms = M * N * 2 / (1024 * 1024) / store_bw
+        store_ms += zero_ms
+
+    total_time_ms = max(compute_ms, load_ms) + store_ms
+    if debug:
+        print(
+            f"Total time: {total_time_ms}ms, compute time: {compute_ms}ms, "
+            f"loading time: {load_ms}ms, store time: {store_ms}ms, "
+            f"Activate CTAs: {active_cta_ratio*100}%"
+        )
+    return total_time_ms
diff --git a/torch/_inductor/triton_ops/utils.py b/torch/_inductor/triton_ops/utils.py
new file mode 100644
index 000000000000..2bc98ae29c4f
--- /dev/null
+++ b/torch/_inductor/triton_ops/utils.py
@@ -0,0 +1,31 @@
+import torch
+
+
+def _extract_strides(shape):
+    rank = len(shape)
+    ret = [1] * rank
+    for i in range(rank - 1, 0, -1):
+        ret[i - 1] = ret[i] * shape[i]
+    return ret
+
+
+def _roundup(x, div):
+    return (x + div - 1) // div * div
+
+
+# unpack the given idx given the order of axis of the desired 3-dim tensor
+# You could view it as the reverse of flatten the idx of 3 axis in a tensor to 1-dim idx.
+# order is the order of axes in tensor, innermost dimension outward
+# shape is the 3D tensor's shape
+def _unpack(idx, order, shape):
+    if torch.is_tensor(idx):
+        _12 = torch.div(idx, shape[order[0]], rounding_mode="trunc")
+        _0 = idx % shape[order[0]]
+        _2 = torch.div(_12, shape[order[1]], rounding_mode="trunc")
+        _1 = _12 % shape[order[1]]
+    else:
+        _12 = idx // shape[order[0]]
+        _0 = idx % shape[order[0]]
+        _2 = _12 // shape[order[1]]
+        _1 = _12 % shape[order[1]]
+    return _0, _1, _2
diff --git a/torch/_inductor/utils.py b/torch/_inductor/utils.py
new file mode 100644
index 000000000000..36a645c99a97
--- /dev/null
+++ b/torch/_inductor/utils.py
@@ -0,0 +1,383 @@
+import collections
+import contextlib
+import functools
+import operator
+import os
+import tempfile
+import time
+from importlib import import_module
+from typing import Any, Dict, List
+from unittest import mock
+
+import numpy as np
+import sympy
+
+import torch
+from torch.fx.immutable_collections import immutable_dict, immutable_list
+
+from . import config
+from .cuda_properties import get_device_capability
+
+VarRanges = Dict[sympy.Expr, sympy.Expr]
+
+# We import torchdynamo modules indirectly to allow a future rename to torch.dynamo
+dynamo_config = import_module(f"{config.dynamo_import}.config")
+dynamo_debug_utils = import_module(f"{config.dynamo_import}.debug_utils")
+dynamo_logging = import_module(f"{config.dynamo_import}.logging")
+dynamo_optimizations = import_module(f"{config.dynamo_import}.optimizations")
+dynamo_testing = import_module(f"{config.dynamo_import}.testing")
+dynamo_utils = import_module(f"{config.dynamo_import}.utils")
+
+
+@functools.lru_cache(None)
+def has_triton():
+    if not torch.cuda.is_available():
+        return False
+    try:
+        import triton
+
+        return triton is not None and get_device_capability() >= (7, 0)
+    except ImportError:
+        return False
+
+
+@functools.lru_cache(None)
+def has_torchvision_roi_align():
+    try:
+        from torchvision.ops import roi_align  # noqa: F401
+
+        return roi_align is not None and hasattr(
+            getattr(torch.ops, "torchvision", None), "roi_align"
+        )
+    except ImportError:
+        return False
+
+
+def conditional_product(*args):
+    return functools.reduce(operator.mul, [x for x in args if x])
+
+
+def do_bench(
+    fn,
+    warmup=25,
+    rep=100,
+    grad_to_none=None,
+    percentiles=(0.5, 0.2, 0.8),
+    record_clocks=False,
+    fast_flush=False,
+):
+    """
+    Benchmark the runtime of the provided function. By default, return the median runtime of :code:`fn` along with
+    the 20-th and 80-th performance percentile.
+
+    :param fn: Function to benchmark
+    :type fn: Callable
+    :param warmup: Warmup time (in ms)
+    :type warmup: int
+    :param rep: Repetition time (in ms)
+    :type rep: int
+    :param grad_to_none: Reset the gradient of the provided tensor to None
+    :type grad_to_none: torch.tensor, optional
+    :param percentiles: Performance percentile to return in addition to the median.
+    :type percentiles: list[float]
+    :param fast_flush: Use faster kernel to flush L2 between measurements
+    :type fast_flush: bool
+    """
+
+    # Estimate the runtime of the function
+    fn()
+    torch.cuda.synchronize()
+    start_event = torch.cuda.Event(enable_timing=True)
+    end_event = torch.cuda.Event(enable_timing=True)
+    start_event.record()
+    for _ in range(5):
+        fn()
+    end_event.record()
+    torch.cuda.synchronize()
+    estimate_ms = start_event.elapsed_time(end_event) / 5
+    # compute number of warmup and repeat
+    n_warmup = max(1, int(warmup / estimate_ms))
+    n_repeat = max(1, int(rep / estimate_ms))
+    # We maintain a buffer of 256 MB that we clear
+    # before each kernel call to make sure that the L2
+    # doesn't contain any input data before the run
+    start_event = [torch.cuda.Event(enable_timing=True) for i in range(n_repeat)]
+    end_event = [torch.cuda.Event(enable_timing=True) for i in range(n_repeat)]
+    if fast_flush:
+        cache = torch.empty(int(256e6 // 4), dtype=torch.int, device="cuda")
+    else:
+        cache = torch.empty(int(256e6), dtype=torch.int8, device="cuda")
+    # Warm-up
+    for _ in range(n_warmup):
+        fn()
+    # Benchmark
+    for i in range(n_repeat):
+        # we don't want `fn` to accumulate gradient values
+        # if it contains a backward pass. So we clear the
+        # provided gradients
+        if grad_to_none is not None:
+            for x in grad_to_none:
+                x.grad = None
+        # we clear the L2 cache before each run
+        cache.zero_()
+        # record time of `fn`
+        start_event[i].record()
+        fn()
+        end_event[i].record()
+    # Record clocks
+    torch.cuda.synchronize()
+    times = torch.tensor([s.elapsed_time(e) for s, e in zip(start_event, end_event)])
+    if percentiles:
+        percentiles = torch.quantile(times, torch.tensor(percentiles)).tolist()
+        return tuple(percentiles)
+    else:
+        return torch.mean(times).item()
+
+
+def sympy_product(it):
+    return functools.reduce(operator.mul, it, sympy.Integer(1))
+
+
+def sympy_dot(seq1, seq2):
+    assert len(seq1) == len(seq2)
+    return sympy.expand(sum(a * b for a, b in zip(seq1, seq2)))
+
+
+def unique(it):
+    return {id(x): x for x in it}.values()
+
+
+def ceildiv(numer: int, denom: int):
+    assert isinstance(numer, int) and isinstance(denom, int)
+    return -(numer // -denom)
+
+
+def gen_gm_and_inputs(target, args, kwargs):
+    g = torch.fx.Graph()
+    g_args = []
+    a_args = []
+    for n, arg in enumerate(args):
+        if isinstance(arg, torch.Tensor):
+            g_args.append(g.placeholder(f"arg{n}"))
+            a_args.append(arg)
+        else:
+            g_args.append(arg)
+    assert all(not isinstance(x, torch.Tensor) for x in kwargs.values())
+    node = g.call_function(target, tuple(g_args), kwargs)
+    if (
+        len(target._schema.returns) == 1
+        and str(target._schema.returns[0].type) == "Tensor"
+    ):
+        node = (node,)
+    g.output(node)
+
+    gm = torch.fx.GraphModule({}, g)
+    return gm, a_args
+
+
+def synchronize():
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
+
+
+def timed(model, example_inputs, times=1):
+    synchronize()
+    torch.manual_seed(1337)
+    t0 = time.perf_counter()
+    for _ in range(times):
+        result = model(*example_inputs)
+        synchronize()
+    t1 = time.perf_counter()
+    # GC the result after timing
+    assert result is not None
+    return t1 - t0
+
+
+def print_performance(fn, args=(), times=10, repeat=10, baseline=1.0):
+    timings = [timed(fn, args, times) for _ in range(repeat)]
+    took = np.median(timings)
+    print(f"{took/baseline:.6f}")
+    return took
+
+
+immutable_dict.__hash__ = lambda self: hash(tuple(self.items()))
+immutable_list.__hash__ = lambda self: hash(tuple(self))
+
+
+def freeze_inputs(f):
+    """
+    Useful for wrapping lists in tuples for caching purposes
+    """
+
+    def freeze_value(x):
+        if isinstance(x, (immutable_dict, immutable_list)):
+            return x
+        if isinstance(x, list):
+            return immutable_list(x)
+        if isinstance(x, dict):
+            return immutable_dict(x)
+        return x
+
+    @functools.wraps(f)
+    def wrapped(*args):
+        args = [freeze_value(x) for x in args]
+        return f(*args)
+
+    wrapped.cache_info = f.cache_info
+    return wrapped
+
+
+def precompute_method(obj: Any, method: str):
+    """Replace obj.method() with a new method that returns a precomputed constant."""
+    result = getattr(obj, method)()
+    setattr(obj, method, lambda: result)
+
+
+def precompute_methods(obj: Any, methods: List[str]):
+    """Replace methods with new methods that returns a precomputed constants."""
+    for method in methods:
+        precompute_method(obj, method)
+
+
+def cmp(a, b):
+    return int(a > b) - int(a < b)
+
+
+def cache_on_self(fn):
+    key = f"__{fn.__name__}_cache"
+
+    @functools.wraps(fn)
+    def wrapper(self):
+        if not hasattr(self, key):
+            setattr(self, key, fn(self))
+        return getattr(self, key)
+
+    return wrapper
+
+
+def get_fused_kernel_name(node_schedule):
+    return "_".join(
+        ["fused"]
+        + [
+            str(origin.name)
+            for origin in functools.reduce(
+                operator.or_,
+                [node.node.origins for node in node_schedule if hasattr(node, "node")],
+            )
+            if origin.op == "call_function"
+        ][0 : config.kernel_name_max_ops]
+    )
+
+
+def gather_origins(args, kwargs):
+    import itertools
+
+    from .ir import ComputedBuffer, IRNode
+
+    def is_unrealized_node(n):
+        return isinstance(n, IRNode) and not isinstance(n, ComputedBuffer)
+
+    kwarg_origins = [val.origins for val in kwargs.values() if is_unrealized_node(val)]
+    arg_origins = [arg.origins for arg in args if is_unrealized_node(arg)]
+    return set(itertools.chain(*arg_origins, *kwarg_origins))
+
+
+def sympy_str(expr: sympy.Expr):
+    """
+    Normal sympy str is very slow, this is a lot faster.  The result are
+    somewhat worse, as it doesn't do as much simplification.  So don't
+    use this for final codegen.
+    """
+    if isinstance(expr, sympy.Symbol):
+        return expr.name
+    if isinstance(expr, sympy.Add):
+        return " + ".join(map(sympy_str, expr.args))
+    if isinstance(expr, sympy.Mul):
+        return " * ".join(map(sympy_str, expr.args))
+
+    from .ir import CleanDiv, IndexingDiv, ModularIndexing
+
+    if isinstance(expr, (ModularIndexing, CleanDiv, IndexingDiv)):
+        return f"{expr.func.__name__}({', '.join(map(sympy_str, expr.args))})"
+    return str(expr)
+
+
+def sympy_symbol(name):
+    return sympy.Symbol(name, integer=True, positive=True)
+
+
+def sympy_subs(expr: sympy.Expr, replacements: Dict[Any, Any]):
+    """
+    xreplace is faster than subs, but is way more picky
+    """
+
+    def promote_strings(key):
+        if isinstance(key, str):
+            return sympy_symbol(key)
+        return key
+
+    return expr.xreplace(
+        {promote_strings(k): promote_strings(v) for k, v in replacements.items()}
+    )
+
+
+def free_symbol_startswith(index: sympy.Expr, prefix: str):
+    return any(v.name.startswith(prefix) for v in index.free_symbols)
+
+
+def has_incompatible_cudagraph_ops(gm):
+    forbidden_list = set(
+        [
+            "aten._fused_moving_avg_obs_fq_helper.default",
+            "aten._fused_moving_avg_obs_fq_helper_functional.default",
+            "fbgemm.dense_to_jagged.default",
+            "fbgemm.jagged_to_padded_dense.default",
+        ]
+    )
+    for node in gm.graph.nodes:
+        if str(node.target) in forbidden_list:
+            return True
+    return False
+
+
+instance_descriptor = collections.namedtuple(
+    "instance_descriptor", ["divisible_by_16", "equal_to_1"]
+)
+
+
+@contextlib.contextmanager
+def fresh_inductor_cache(cache_entries=None):
+    """
+    Contextmanager that provides a clean tmp cachedir for inductor.
+
+    Optionally, pass a dict as 'cache_entries' to get a list of filenames and sizes
+    generated with this cache instance.
+    """
+    with tempfile.TemporaryDirectory() as inductor_cache_dir:
+        with mock.patch.dict(
+            os.environ, {"TORCHINDUCTOR_CACHE_DIR": inductor_cache_dir}
+        ):
+            triton_cache_dir = os.path.join(inductor_cache_dir, "triton")
+            with mock.patch.dict(os.environ, {"TRITON_CACHE_DIR": triton_cache_dir}):
+                yield
+                if isinstance(cache_entries, dict):
+                    assert len(cache_entries) == 0, "expected empty cache_entries dict"
+                    if os.path.exists(triton_cache_dir):
+                        files = os.listdir(triton_cache_dir)
+                        cache_entries.update(
+                            {
+                                f: os.path.getsize(os.path.join(triton_cache_dir, f))
+                                for f in files
+                                if ".lock" not in f
+                            }
+                        )
+
+
+def argsort(seq):
+    # preserve original order for equal strides
+    return list(reversed(sorted(range(len(seq)), key=seq.__getitem__, reverse=True)))
+
+
+@functools.lru_cache(8)
+def get_dtype_size(dtype):
+    return torch.empty((), dtype=dtype).element_size()
diff --git a/torch/_inductor/virtualized.py b/torch/_inductor/virtualized.py
new file mode 100644
index 000000000000..27e60b1daf1d
--- /dev/null
+++ b/torch/_inductor/virtualized.py
@@ -0,0 +1,140 @@
+from contextlib import contextmanager
+from itertools import chain
+from threading import local
+
+import sympy
+
+from torch.fx.graph import inplace_methods, magic_methods
+
+from .utils import sympy_str, sympy_symbol
+
+threadlocal = local()
+
+
+class Virtualized:
+    """
+    A global variable that redirects via thread local variable
+
+    This allows us to swap in different op implementations in codegen.
+    """
+
+    def __init__(self, vname, default):
+        self._key = f"__torchinductor_{vname}"
+        self._default = default
+
+    def _set_handler(self, value):
+        prior = self._get_handler()
+        setattr(threadlocal, self._key, value)
+
+        @contextmanager
+        def ctx():
+            try:
+                yield
+            finally:
+                self._set_handler(prior)
+
+        return ctx()
+
+    def _get_handler(self):
+        try:
+            return getattr(threadlocal, self._key)
+        except AttributeError:
+            return self._default()
+
+    def __getattr__(self, name):
+        return getattr(self._get_handler(), name)
+
+
+class NullHandler:
+    pass
+
+
+def _arg_str(a):
+    if isinstance(a, sympy.Expr):
+        return sympy_str(a)
+    return str(a)
+
+
+class MockHandler:
+    def __getattr__(self, name):
+        def inner(*args, **kwargs):
+            fargs = [_arg_str(a) for a in args]
+            fargs.extend(f"{k}={v}" for k, v in kwargs.items())
+            return self.truncate_expr(f"{name}({', '.join(fargs)})")
+
+        return inner
+
+    @staticmethod
+    def truncate_expr(expr):
+        return expr
+
+    @classmethod
+    def masked(cls, mask, body, other):
+        return cls.truncate_expr(f"masked({mask}, {body()}, {other})")
+
+    @staticmethod
+    def indirect_indexing(index_var):
+        return sympy_symbol(f"({str(index_var)})")
+
+    @classmethod
+    def _init_cls(cls):
+        def make_handler(format_string):
+            @staticmethod
+            def inner(*args):
+                return format_string.format(*args)
+
+            return inner
+
+        for name, format_string in chain(
+            magic_methods.items(), inplace_methods.items()
+        ):
+            setattr(cls, name, make_handler(format_string))
+
+
+class WrapperHandler:
+    def __init__(self, inner):
+        self._inner = inner
+
+    def __getattr__(self, item):
+        return getattr(self._inner, item)
+
+
+MockHandler._init_cls()
+
+ops = Virtualized("ops", MockHandler)
+_graph = Virtualized("graph", NullHandler)
+_kernel = Virtualized("kernel", NullHandler)
+_debug = Virtualized("debug", NullHandler)
+
+
+class _V:
+    MockHandler = MockHandler
+    WrapperHandler = WrapperHandler
+
+    set_ops_handler = ops._set_handler
+    get_ops_handler = ops._get_handler
+    set_graph_handler = _graph._set_handler
+    set_kernel_handler = _kernel._set_handler
+    set_debug_handler = _debug._set_handler
+
+    @property
+    def ops(self) -> MockHandler:
+        """The operator handler specific to the current codegen task"""
+        return ops._get_handler()
+
+    @property
+    def graph(self):
+        """The graph currently being generated"""
+        return _graph._get_handler()
+
+    @property
+    def kernel(self):
+        """The kernel currently being generated"""
+        return _kernel._get_handler()
+
+    @property
+    def debug(self):
+        return _debug._get_handler()
+
+
+V = _V()
diff --git a/torch/_lazy/__init__.py b/torch/_lazy/__init__.py
index 953f6a83ffac..249ce9b11578 100644
--- a/torch/_lazy/__init__.py
+++ b/torch/_lazy/__init__.py
@@ -1,4 +1,9 @@
+import threading
+
 import torch._C._lazy
+from torch.utils._pytree import tree_flatten, tree_unflatten
+
+from .closure import add_step_closure, run_step_closures
 
 
 def mark_step(device: str = "", wait=False):
@@ -11,6 +16,8 @@ def mark_step(device: str = "", wait=False):
     # TODO(whc) expand this to include backend hooks and align with XLA backend needs
     torch._C._lazy._mark_step(device, [], wait=wait)
 
+    run_step_closures()
+
 
 def wait_device_ops(devices=None):
     """Waits for all the async operations on the given devices to complete.
@@ -34,3 +41,15 @@ def sync_multi(tensors, devices):
 def get_tensor_id(tensor):
     """Return a unique id of the lazy tensor maintained by LTC"""
     return torch._C._lazy._get_tensor_id(tensor)
+
+
+def to_cpu(tensors, devices=None):
+    devices = devices or ["lazy"]
+
+    flattened, spec = tree_flatten(tensors)
+    sync_multi(flattened, devices)
+    return tree_unflatten([t.to("cpu") for t in flattened], spec)
+
+
+def save(tensors, *args, **kwargs):
+    torch.save(to_cpu(tensors), *args, **kwargs)
diff --git a/torch/_lazy/closure.py b/torch/_lazy/closure.py
new file mode 100644
index 000000000000..07f1055ee827
--- /dev/null
+++ b/torch/_lazy/closure.py
@@ -0,0 +1,134 @@
+import os
+import threading
+from queue import Empty as EmptyQueue, Queue
+
+from torch._lazy.device_context import get_device_context
+
+
+class ClosureHandler:
+    def __init__(self):
+        pass
+
+    def run(self, closure):
+        """Run closure function
+
+        Args:
+        closure: callable function to run
+        """
+        closure()
+
+    def __call__(self, closures):
+        for closure in closures:
+            self.run(closure)
+
+
+class AsyncClosureHandler(ClosureHandler):
+    """Handler for Asynchronous Step Closures
+    Args:
+        max_queue_size: The maximum length of the closure queue after which
+        the training loop will block until closures are evaluated.
+        By default, a reasonable limit of a maximum of 100 on the queue.
+        This value can be set using the `XLA_MAX_ASYNC_QUEUE` environment
+        variable.
+    """
+
+    def __init__(self, max_queue_size=100):
+        super().__init__()
+        self._closure_queue: Queue = Queue(
+            int(os.environ.get("LTC_MAX_ASYNC_QUEUE", max_queue_size))
+        )
+        self._closure_exception: Queue = Queue()
+        self._closure_lock = threading.Lock()
+        self._closure_event_loop_finished = threading.Event()
+        self._closure_event_loop = None
+
+    def start_event_loop(self):
+        """Start closure event loop if not started"""
+        if self._closure_event_loop is None:
+
+            def event_loop():
+                # Run loop until closure event is set and closure queue is empty
+                while True:
+                    try:
+                        closure = self._closure_queue.get(block=True, timeout=3)
+                        closure()
+                        self._closure_queue.task_done()
+                    except EmptyQueue:
+                        with self._closure_lock:
+                            if self._closure_queue.empty():
+                                self._closure_event_loop_finished.set()
+                                return
+                    except Exception as e:
+                        self._closure_exception.put(e)
+                        return
+
+            self._closure_event_loop = threading.Thread(target=event_loop)
+            self._closure_event_loop.start()
+
+    def run(self, closure):
+        with self._closure_lock:
+            self._closure_queue.put(closure, block=True)
+            if (
+                self._closure_event_loop is None
+                or not self._closure_event_loop.is_alive()
+            ):
+                try:
+                    e = self._closure_exception.get(block=False)
+                    raise RuntimeError(
+                        "Cannot run asynchronous closure due to previously raised exception"
+                    ) from e
+                except EmptyQueue:
+                    self._closure_event_loop = None
+                    self.start_event_loop()
+
+
+def add_step_closure(closure, args=(), run_async=False):
+    """Adds a closure to the list of the ones to be run at the end of the step.
+    Many times during model training there is the need to print/report (print to
+    console, post to tensorboard, etc...) information which require the content of
+    intermediary tensors to be inspected.
+    Inspecting different tensors content in different points of the model code
+    requires many executions and typically causes performance issues.
+    Adding a step closure will ensure that it will be run after the barrier, when
+    all the live tensors will be already materialized to device data.
+    Live tensors which will include the ones captured by the closure arguments.
+    So using `add_step_closure()` will ensure a single execution will be
+    performed, even when multiple closures are queued, requiring multiple tensors
+    to be inspected.
+    Step closures will be run sequentially in the order they have been queued.
+    Note that even though using this API the execution will be optimized, it is
+    advised to throttle the printing/reporting events once every N steps.
+    Args:
+      closure (callable): The function to be called.
+      args (tuple): The arguments to be passed to the closure.
+      run_async: If True, run the closure asynchronously.
+    """
+    devctx = get_device_context()
+    closures_type = "async_step_closures" if run_async else "step_closures"
+    step_closures = getattr(devctx, closures_type, None)
+    if step_closures is None:
+        step_closures = []
+        setattr(devctx, closures_type, step_closures)
+    step_closures.append(lambda a=args: closure(*a))
+
+
+def run_step_closures():
+    devctx = get_device_context()
+    async_step_closures = getattr(devctx, "async_step_closures", None)
+    if async_step_closures is not None:
+        devctx.async_step_closures = []
+        async_closure_handler = getattr(devctx, "async_closure_handler", None)
+        if async_closure_handler is None:
+            async_closure_handler = AsyncClosureHandler()
+            devctx.async_closure_handler = async_closure_handler
+        async_closure_handler(async_step_closures)
+
+    step_closures = getattr(devctx, "step_closures", None)
+    if step_closures is not None:
+        devctx.step_closures = []
+        closure_handler = getattr(devctx, "closure_handler", None)
+        if closure_handler is None:
+            closure_handler = ClosureHandler()
+            devctx.closure_handler = closure_handler
+        closure_handler(step_closures)
+    return devctx
diff --git a/torch/_lazy/device_context.py b/torch/_lazy/device_context.py
new file mode 100644
index 000000000000..840c7f8e50d0
--- /dev/null
+++ b/torch/_lazy/device_context.py
@@ -0,0 +1,25 @@
+import threading
+from typing import Any, Dict
+
+import torch._C._lazy
+
+
+class DeviceContext:
+    _CONTEXTS: Dict[str, Any] = dict()
+    _CONTEXTS_LOCK = threading.Lock()
+
+    def __init__(self, device):
+        self.device = device
+
+
+def get_device_context(device=None):
+    if device is None:
+        device = torch._C._lazy._get_default_device_type()
+    else:
+        device = str(device)
+    with DeviceContext._CONTEXTS_LOCK:
+        devctx = DeviceContext._CONTEXTS.get(device, None)
+        if devctx is None:
+            devctx = DeviceContext(device)
+            DeviceContext._CONTEXTS[device] = devctx
+        return devctx
diff --git a/torch/_lazy/extract_compiled_graph.py b/torch/_lazy/extract_compiled_graph.py
index 440539f52453..033d000c69d8 100644
--- a/torch/_lazy/extract_compiled_graph.py
+++ b/torch/_lazy/extract_compiled_graph.py
@@ -72,7 +72,7 @@ def __init__(self, lazy_out_list):
         self.index: List[List[int]] = []
         self.total_count = len(lazy_out_list)
 
-        tensor_id_to_idx: Dict[int, int] = dict()
+        tensor_id_to_idx: Dict[int, int] = {}
         for dup_idx, lazy_tensor in enumerate(lazy_out_list):
             uniq_idx = tensor_id_to_idx.get(id(lazy_tensor), None)
             if uniq_idx is not None:
diff --git a/torch/_linalg_utils.py b/torch/_linalg_utils.py
index d7f6798dd9d7..bdd22f395d2d 100644
--- a/torch/_linalg_utils.py
+++ b/torch/_linalg_utils.py
@@ -76,12 +76,7 @@ def qform(A: Optional[Tensor], S: Tensor):
 
 def basis(A):
     """Return orthogonal basis of A columns."""
-    if A.is_cuda:
-        # torch.orgqr is not available in CUDA
-        Q = torch.linalg.qr(A).Q
-    else:
-        Q = torch.orgqr(*torch.geqrf(A))
-    return Q
+    return torch.linalg.qr(A).Q
 
 
 def symeig(A: Tensor, largest: Optional[bool] = False) -> Tuple[Tensor, Tensor]:
@@ -96,9 +91,31 @@ def symeig(A: Tensor, largest: Optional[bool] = False) -> Tuple[Tensor, Tensor]:
     return E, Z
 
 
-# This function was deprecated and removed
+# These functions were deprecated and removed
 # This nice error message can be removed in version 1.13+
+def matrix_rank(input, tol=None, symmetric=False, *, out=None) -> Tensor:
+    raise RuntimeError(
+        "This function was deprecated since version 1.9 and is now removed.",
+        "Please use the `torch.linalg.matrix_rank` function instead.",
+    )
+
+
 def solve(input: Tensor, A: Tensor, *, out=None) -> Tuple[Tensor, Tensor]:
     raise RuntimeError(
         "This function was deprecated since version 1.9 and is now removed. Please use the `torch.linalg.solve` function instead.",
     )
+
+
+def lstsq(input: Tensor, A: Tensor, *, out=None) -> Tuple[Tensor, Tensor]:
+    raise RuntimeError(
+        "This function was deprecated since version 1.9 and is now removed.",
+        "Please use the `torch.linalg.lstsq` function instead.",
+    )
+
+
+def eig(
+    self: Tensor, eigenvectors: bool = False, *, e=None, v=None
+) -> Tuple[Tensor, Tensor]:
+    raise RuntimeError(
+        "This function was deprecated since version 1.9 and is now removed. Please use the `torch.linalg.eig` function instead.",
+    )
diff --git a/torch/_lobpcg.py b/torch/_lobpcg.py
index 273c93d03815..032783c2d24e 100644
--- a/torch/_lobpcg.py
+++ b/torch/_lobpcg.py
@@ -399,7 +399,7 @@ def lobpcg(
       A (Tensor): the input tensor of size :math:`(*, m, m)`
 
       B (Tensor, optional): the input tensor of size :math:`(*, m,
-                  m)`. When not specified, `B` is interpereted as
+                  m)`. When not specified, `B` is interpreted as
                   identity matrix.
 
       X (tensor, optional): the input tensor of size :math:`(*, m, n)`
diff --git a/torch/_meta_registrations.py b/torch/_meta_registrations.py
index 093a45ff324e..a0fe373ea6e5 100644
--- a/torch/_meta_registrations.py
+++ b/torch/_meta_registrations.py
@@ -1,41 +1,42 @@
+import math
 from typing import List, Optional, Union
 
 import torch
 import torch._prims_common as utils
 from torch import Tensor
+from torch._decomp import _add_op_to_registry, global_decomposition_table, meta_table
+from torch._ops import OpOverload
+from torch._prims import _elementwise_meta, ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND
 from torch._prims_common import (
     check,
     corresponding_complex_dtype,
     corresponding_real_dtype,
     elementwise_dtypes,
     ELEMENTWISE_TYPE_PROMOTION_KIND,
+    FloatLike,
+    IntLike,
+    make_contiguous_strides_for,
 )
 
 from torch._prims_common.wrappers import out_wrapper
 from torch._refs import _broadcast_shapes
+
+from torch._subclasses.fake_tensor import check_no_bool_index_tensors
 from torch.utils._pytree import tree_map
 
-aten = torch.ops.aten
 
-meta_lib = torch.library.Library("aten", "IMPL", "Meta")
+aten = torch.ops.aten
 
-meta_table = {}
+_meta_lib_dont_use_me_use_register_meta = torch.library.Library("aten", "IMPL", "Meta")
 
 
-def register_meta(op, register_dispatcher=True):
-    def wrapper(f):
-        def add_func(op):
-            meta_table[op] = f
-            if register_dispatcher:
-                name = (
-                    op.__name__
-                    if op._overloadname != "default"
-                    else op.overloadpacket.__name__
-                )
-                meta_lib.impl(name, f)
+def register_meta(op):
+    def wrapper(fn):
+        def register(op):
+            _add_op_to_registry(meta_table, op, fn)
 
-        tree_map(add_func, op)
-        return f
+        tree_map(register, op)
+        return fn
 
     return wrapper
 
@@ -76,6 +77,31 @@ def meta_randperm(n, *, generator=None, out):
     return out
 
 
+@register_meta(aten.randint.default)
+def meta_randint(
+    high, size, *, dtype=torch.long, layout=None, device=None, pin_memory=None
+):
+    return torch.empty(
+        size, dtype=dtype, layout=layout, device=device, pin_memory=pin_memory
+    )
+
+
+@register_meta(aten.randint.low)
+def meta_randint_low(
+    low, high, size, *, dtype=torch.long, layout=None, device=None, pin_memory=None
+):
+    return torch.empty(
+        size, dtype=dtype, layout=layout, device=device, pin_memory=pin_memory
+    )
+
+
+@register_meta(aten.rand.default)
+def meta_rand_default(size, *, dtype=None, layout=None, device=None, pin_memory=None):
+    return torch.empty(
+        size, dtype=dtype, layout=layout, device=device, pin_memory=pin_memory
+    )
+
+
 @register_meta([aten._fft_c2r.default, aten._fft_c2r.out])
 @out_wrapper()
 def meta_fft_c2r(self, dim, normalization, lastdim):
@@ -85,6 +111,28 @@ def meta_fft_c2r(self, dim, normalization, lastdim):
     return self.new_empty(output_sizes, dtype=toRealValueType(self.dtype))
 
 
+@register_meta(aten.copy_.default)
+def meta_copy_(self, src, non_blocking=False):
+    return self
+
+
+def inferUnsqueezeGeometry(tensor, dim):
+    result_sizes = list(tensor.size())
+    result_strides = list(tensor.stride())
+    new_stride = 1 if dim >= tensor.dim() else result_sizes[dim] * result_strides[dim]
+    result_sizes.insert(dim, 1)
+    result_strides.insert(dim, new_stride)
+    return result_sizes, result_strides
+
+
+@register_meta(aten.unsqueeze_.default)
+def meta_unsqueeze_(self, dim):
+    dim = maybe_wrap_dim(dim, self.dim() + 1)
+    g_sizes, g_strides = inferUnsqueezeGeometry(self, dim)
+    self.as_strided_(g_sizes, g_strides)
+    return self
+
+
 # Implementations below are taken from https://github.com/albanD/subclass_zoo/blob/main/python_meta_tensor.py
 @register_meta(aten.index_select.default)
 def meta_index_select(self, dim, index):
@@ -100,11 +148,27 @@ def meta_index_select_out(self, dim, index, out):
     return out.copy_(torch.index_select(self, dim, index))
 
 
-@register_meta([aten.max.default, aten.min.default])
+@register_meta([aten.max.default, aten.max.unary_out])
+@out_wrapper()
 def meta_max(self):
     return self.new_empty(())
 
 
+@register_meta(aten.max.dim)
+def meta_max_dim(self, dim, keepdim=False):
+    dim = utils.reduction_dims(self.shape, (dim,))
+    output_shape = _compute_reduction_shape(self, dim, keepdim)
+    return (
+        self.new_empty(output_shape),
+        self.new_empty(output_shape, dtype=torch.long),
+    )
+
+
+@register_meta([aten.min.default])
+def meta_min(self):
+    return self.new_empty(())
+
+
 @register_meta(aten.angle.default)
 def meta_angle(self):
     if self.is_complex():
@@ -113,7 +177,7 @@ def meta_angle(self):
         _, result_dtype = elementwise_dtypes(
             self, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT
         )
-    return self.new_empty(self.size(), dtype=result_dtype)
+    return torch.empty_like(self, dtype=result_dtype)
 
 
 @register_meta(aten.angle.out)
@@ -122,7 +186,8 @@ def meta_angle_out(self, out):
     return out.copy_(torch.angle(self))
 
 
-def squareCheckInputs(self, f_name):
+# From aten/src/ATen/native/LinearAlgebraUtils.h
+def squareCheckInputs(self: Tensor, f_name: str):
     assert (
         self.dim() >= 2
     ), f"{f_name}: The input tensor must have at least 2 dimensions."
@@ -131,6 +196,22 @@ def squareCheckInputs(self, f_name):
     ), f"{f_name}: A must be batches of square matrices, but they are {self.size(-2)} by {self.size(-1)} matrices"
 
 
+# From aten/src/ATen/native/LinearAlgebraUtils.h
+def checkFloatingOrComplex(
+    t: Tensor, f_name: str, allow_low_precision_dtypes: bool = True
+):
+    dtype = t.dtype
+    check(
+        t.is_floating_point() or t.is_complex(),
+        lambda: f"{f_name}, : Expected a floating point or complex tensor as input. Got , {dtype}",
+    )
+    if allow_low_precision_dtypes:
+        check(
+            dtype in (torch.float, torch.double, torch.cfloat, torch.cdouble),
+            lambda: f"{f_name} : Low precision dtypes not supported. Got {dtype}",
+        )
+
+
 def checkUplo(uplo: str):
     uplo_uppercase = uplo.upper()
     assert (
@@ -150,6 +231,73 @@ def meta_linalg_eigh(self, uplo="L"):
     return (values, vectors)
 
 
+# From aten/src/ATen/native/BatchLinearAlgebra.cpp
+@register_meta(aten.linalg_cholesky_ex.default)
+def linalg_cholesky_ex(A: Tensor, upper: bool = False, check_errors: bool = False):
+    squareCheckInputs(A, "linalg.cholesky")
+    checkFloatingOrComplex(A, "linalg.cholesky")
+
+    A_shape = A.shape
+    ndim = len(A_shape)
+
+    # L
+    L_strides = make_contiguous_strides_for(A_shape, False)
+    L = A.new_empty(A_shape)
+    L.as_strided_(A_shape, L_strides)
+
+    # infos
+    infos = A.new_empty(A_shape[0 : ndim - 2], dtype=torch.int32)
+    return L, infos
+
+
+# From aten/src/ATen/native/BatchLinearAlgebra.cpp
+@register_meta(aten.linalg_cholesky.default)
+def meta_linalg_cholesky(A: Tensor, upper=False):
+    # All the checks done on info in the corresponding C++ function
+    # are data dependent, so we skip info computation
+    L, infos = linalg_cholesky_ex(A, upper, False)
+    return L, infos
+
+
+# From aten/src/ATen/native/ReflectionPad.cpp
+@register_meta(
+    [aten.reflection_pad2d_backward.default, aten.replication_pad2d_backward.default]
+)
+def meta_pad2d_backward(grad_output, self, padding):
+    dim_w = 2
+    dim_h = 1
+    dim_plane = 0
+    nbatch = 1
+
+    self_shape = self.shape
+    if self.dim() == 4:
+        nbatch = self_shape[0]
+        dim_w += 1
+        dim_h += 1
+        dim_plane += 1
+
+    pad_l = padding[0]
+    pad_r = padding[1]
+    pad_t = padding[2]
+    pad_b = padding[3]
+
+    nplane = self_shape[dim_plane]
+    input_h = self_shape[dim_h]
+    input_w = self_shape[dim_w]
+    output_h = input_h + pad_t + pad_b
+    output_w = input_w + pad_l + pad_r
+
+    check(
+        output_w == grad_output.shape[dim_w],
+        lambda: f"gradOutput width unexpected. Expected: {output_w}, Got: {grad_output.shape[dim_w]}",
+    )
+    check(
+        output_h == grad_output.shape[dim_h],
+        lambda: f"gradOutput height unexpected. Expected: {output_h}, Got: {grad_output.shape[dim_h]}",
+    )
+    return self.new_empty(self.shape)
+
+
 @register_meta(aten.reflection_pad2d.default)
 def meta_pad2d(self, padding):
     valid_dims = self.size(1) != 0 and self.size(2) != 0
@@ -175,6 +323,48 @@ def meta_pad2d(self, padding):
         return self.new_empty((nbatch, nplane, output_h, output_w))
 
 
+@register_meta([aten.bernoulli.default, aten.bernoulli.out])
+@out_wrapper()
+def meta_bernoulli(self, *, generator=None):
+    # https://github.com/pytorch/pytorch/issues/88612
+    return torch.empty_like(self).contiguous()
+
+
+@register_meta(aten.bernoulli_.float)
+def meta_bernoulli_(self, p=0.5, generator=None):
+    return self
+
+
+@register_meta(aten.bernoulli.p)
+def meta_bernoulli_p(self, p=0.5, generator=None):
+    # https://github.com/pytorch/pytorch/issues/88612
+    return torch.empty_like(self).contiguous()
+
+
+@register_meta(aten._fused_moving_avg_obs_fq_helper.default)
+def meta__fused_moving_avg_obs_fq_helper(
+    self,
+    observer_on,
+    fake_quant_on,
+    running_min,
+    running_max,
+    scale,
+    zero_point,
+    averaging_const,
+    quant_min,
+    quant_max,
+    ch_axis,
+    per_row_fake_quant=False,
+    symmetric_quant=False,
+):
+    check(
+        ch_axis < self.dim(),
+        lambda: "Error in fused_moving_avg_obs_fake_quant_cpu: ch_axis must be < self.dim()",
+    )
+    mask = torch.empty_like(self, dtype=torch.bool)
+    return (torch.empty_like(self), mask)
+
+
 def dot_check(self, other):
     check(
         self.dim() == 1 and other.dim() == 1,
@@ -188,6 +378,16 @@ def meta_dot(self, tensor):
     return self.new_empty(())
 
 
+@register_meta([aten.mm.default])
+def meta_mm(a, b):
+    check(a.dim() == 2, lambda: "a must be 2D")
+    check(b.dim() == 2, lambda: "b must be 2D")
+    N, M1 = a.shape
+    M2, P = b.shape
+    check(M1 == M2, lambda: "a and b must have same reduction dim")
+    return a.new_empty(N, P)
+
+
 def _compute_reduction_shape(self, dims, keepdim):
     if keepdim:
         return tuple(self.shape[i] if i not in dims else 1 for i in range(self.ndim))
@@ -195,20 +395,15 @@ def _compute_reduction_shape(self, dims, keepdim):
     return utils.compute_reduction_output_shape(self.shape, dims)
 
 
-@register_meta(aten.inverse.default)
-def meta_inverse(self):
-    # Bug: https://github.com/pytorch/pytorch/issues/77498
-    if self.numel() == 0:
-        return torch.empty_like(self)
-    r = self.new_empty(self.shape)
-    r.transpose_(-2, -1)
-    return r
-
-
-@torch.library.impl(meta_lib, "bernoulli.out")
-def meta_bernoulli(self, *, generator=None, out):
-    torch._resize_output_(out, self.size(), self.device)
-    return out
+# FakeTensors (meta tensors with a device) will report device as meta
+# when running meta kernels. Here, access the "fake device" of FakeTensor if it
+# exists so meta kernels which have diverge per device will be more
+# accurate when run with FakeTensors
+def device_hint(tensor) -> "str":
+    if isinstance(tensor, torch._subclasses.FakeTensor):
+        return tensor.fake_device.type
+    else:
+        return "cuda"  # default to cuda
 
 
 @register_meta(aten.convolution.default)
@@ -268,24 +463,24 @@ def calc_conv_nd_return_shape(
         output_padding: Optional[Union[List[int], int]] = None,
     ):
         ret_shape = []
-        if isinstance(stride, int):
+        if isinstance(stride, IntLike):
             stride = [stride] * len(dims)
         elif len(stride) == 1:
             stride = [stride[0]] * len(dims)
 
-        if isinstance(padding, int):
+        if isinstance(padding, IntLike):
             padding = [padding] * len(dims)
         elif len(padding) == 1:
             padding = [padding[0]] * len(dims)
 
-        if isinstance(dilation, int):
+        if isinstance(dilation, IntLike):
             dilation = [dilation] * len(dims)
         elif len(dilation) == 1:
             dilation = [dilation[0]] * len(dims)
 
         output_padding_list: Optional[List[int]] = None
         if output_padding:
-            if isinstance(output_padding, int):
+            if isinstance(output_padding, IntLike):
                 output_padding_list = [output_padding] * len(dims)
             elif len(output_padding) == 1:
                 output_padding_list = [output_padding[0]] * len(dims)
@@ -313,10 +508,17 @@ def calc_conv_nd_return_shape(
                 )
         return ret_shape
 
+    def is_channels_last(ten):
+        return torch._prims_common.suggest_memory_format(ten) == torch.channels_last
+
     def pick_memory_format():
-        if input_tensor.is_contiguous(memory_format=torch.channels_last):
-            return torch.channels_last
-        elif input_tensor.is_contiguous(memory_format=torch.contiguous_format):
+        if device_hint(input_tensor) == "cuda":
+            if is_channels_last(input_tensor) or is_channels_last(weight):
+                return torch.channels_last
+        else:
+            if is_channels_last(input_tensor):
+                return torch.channels_last
+        if input_tensor.is_contiguous(memory_format=torch.contiguous_format):
             return torch.contiguous_format
         elif input_tensor.is_contiguous(memory_format=torch.preserve_format):
             return torch.preserve_format
@@ -337,24 +539,228 @@ def pick_memory_format():
 
     else:
         out_channels = weight.shape[0]
-        if weight.shape[1] != input_tensor.shape[1] / groups:
+        if weight.shape[1] * groups != input_tensor.shape[1]:
             raise RuntimeError("Invalid channel dimensions")
         shape_out = calc_conv_nd_return_shape(
             dims, kernel_size, stride, padding, dilation
         )
     out = input_tensor.new_empty((input_tensor.shape[0], out_channels, *shape_out))
-    mem_fmt = pick_memory_format()
-    out = out.to(memory_format=mem_fmt)  # type: ignore[call-overload]
+
+    out = out.to(memory_format=pick_memory_format())  # type: ignore[call-overload]
     return out
 
 
+# from check_dim_size() in aten/src/ATen/TensorUtils.cpp.
+def check_dim_size(tensor, dim, dim_size, size):
+    check(
+        tensor.dim() == dim and tensor.shape[dim_size] == size,
+        lambda: f"Expected a tensor of dimension {dim} and tensor.size[{dim_size}] == {size}, "
+        + f"but got : dimension {tensor.dim()} and tensor.size[{dim_size}] = {tensor.shape[dim_size]}",
+    )
+
+
+@register_meta(aten.avg_pool2d.default)
+def meta_avg_pool2d(
+    input,
+    kernel_size,
+    stride=(),
+    padding=(0,),
+    ceil_mode=False,
+    count_include_pad=True,
+    divisor_override=None,
+):
+    def unpack(name, val):
+        check(
+            len(val) in [1, 2],
+            lambda: f"avg_pool2d: {name} must either be a single int, or a tuple of two ints",
+        )
+        H = val[0]
+        W = H if len(val) == 1 else val[1]
+        return H, W
+
+    kH, kW = unpack("kernel_size", kernel_size)
+    check(
+        len(stride) in [0, 1, 2],
+        lambda: "avg_pool2d: stride must either be omitted, a single int, or a tuple of two ints",
+    )
+    if len(stride) == 0:
+        dH, dW = kH, kW
+    elif len(stride) == 1:
+        dH, dW = stride[0], stride[0]
+    else:
+        dH, dW = unpack("stride", stride)
+
+    padH, padW = unpack("padding", padding)
+
+    check(
+        divisor_override is None or divisor_override != 0,
+        lambda: "divisor must be not zero",
+    )
+
+    nbatch = input.size(-4) if input.dim() == 4 else 1
+    nInputPlane = input.size(-3)
+    inputHeight = input.size(-2)
+    inputWidth = input.size(-1)
+
+    outputHeight = pooling_output_shape(inputHeight, kH, padH, dH, 1, ceil_mode)
+    outputWidth = pooling_output_shape(inputWidth, kW, padW, dW, 1, ceil_mode)
+
+    memory_format = utils.suggest_memory_format(input)
+    pool2d_shape_check(
+        input,
+        kH,
+        kW,
+        dH,
+        dW,
+        padH,
+        padW,
+        1,
+        1,
+        nInputPlane,
+        inputHeight,
+        inputWidth,
+        outputHeight,
+        outputWidth,
+        memory_format,
+    )
+
+    if input.dim() == 3:
+        size = [nInputPlane, outputHeight, outputWidth]
+    else:
+        size = [nbatch, nInputPlane, outputHeight, outputWidth]
+    return torch.empty(
+        size, dtype=input.dtype, device=input.device, memory_format=memory_format
+    )
+
+
+# from avg_pool2d_backward_shape_check() in aten/src/ATen/native/Pool.h.
+def avg_pool2d_backward_shape_check(
+    input,
+    gradOutput,
+    nbatch,
+    kH,
+    kW,
+    dH,
+    dW,
+    padH,
+    padW,
+    nInputPlane,
+    inputHeight,
+    inputWidth,
+    outputHeight,
+    outputWidth,
+    mem_format,
+):
+    pool2d_shape_check(
+        input,
+        kH,
+        kW,
+        dH,
+        dW,
+        padH,
+        padW,
+        1,
+        1,
+        nInputPlane,
+        inputHeight,
+        inputWidth,
+        outputHeight,
+        outputWidth,
+        mem_format,
+    )
+
+    ndim = input.dim()
+    nOutputPlane = nInputPlane
+
+    check_dim_size(gradOutput, ndim, ndim - 3, nOutputPlane)
+    check_dim_size(gradOutput, ndim, ndim - 2, outputHeight)
+    check_dim_size(gradOutput, ndim, ndim - 1, outputWidth)
+
+
+# Don't override the C++ registration.
+@register_meta(aten.avg_pool2d_backward.default)
+def meta_avg_pool2d_backward(
+    gradOutput_,
+    input,
+    kernel_size,
+    stride,
+    padding,
+    ceil_mode,
+    count_include_pad,
+    divisor_override,
+):
+    # From aten/src/ATen/native/AveragePool2d.cpp structured kernel meta func.
+    check(
+        len(kernel_size) == 1 or len(kernel_size) == 2,
+        lambda: "avg_pool2d: kernel_size must either be a single int, or a tuple of two ints",
+    )
+    kH = kernel_size[0]
+    kW = kH if len(kernel_size) == 1 else kernel_size[1]
+    check(
+        len(stride) == 0 or len(stride) == 1 or len(stride) == 2,
+        lambda: "avg_pool2d: stride must either be omitted, a single int, or a tuple of two ints",
+    )
+    dH = kH if len(stride) == 0 else stride[0]
+    dW = kW if len(stride) == 0 else dH if len(stride) == 1 else stride[1]
+    check(
+        len(padding) == 1 or len(padding) == 2,
+        lambda: "avg_pool2d: padding must either be a single int, or a tuple of two ints",
+    )
+    padH = padding[0]
+    padW = padH if len(padding) == 1 else padding[1]
+
+    check(
+        divisor_override is None or divisor_override != 0,
+        lambda: "divisor must be not zero",
+    )
+
+    input_size = input.shape
+    nbatch = input_size[-4] if input.dim() == 4 else 1
+    nInputPlane = input_size[-3]
+    inputHeight = input_size[-2]
+    inputWidth = input_size[-1]
+
+    outputHeight = pooling_output_shape(inputHeight, kH, padH, dH, 1, ceil_mode)
+    outputWidth = pooling_output_shape(inputWidth, kW, padW, dW, 1, ceil_mode)
+
+    mem_format = utils.suggest_memory_format(input)
+
+    avg_pool2d_backward_shape_check(
+        input,
+        gradOutput_,
+        nbatch,
+        kH,
+        kW,
+        dH,
+        dW,
+        padH,
+        padW,
+        nInputPlane,
+        inputHeight,
+        inputWidth,
+        outputHeight,
+        outputWidth,
+        mem_format,
+    )
+
+    return torch.empty(
+        input_size, dtype=input.dtype, device=input.device, memory_format=mem_format
+    )
+
+
 @register_meta(aten._adaptive_avg_pool2d.default)
 def meta_adaptive_avg_pool2d(self, output_size):
     check(
         self.ndim == 3 or self.ndim == 4,
         lambda: f"Expected 3D or 4D tensor, but got {self.shape}",
     )
-    return self.new_empty(self.shape[:-2] + tuple(output_size))
+    output_shape = self.shape[:-2] + tuple(output_size)
+    memory_format = utils.suggest_memory_format(self)
+    # need to set memory_format to preserve the memory format of the input
+    # channel last input should have channel last output
+    return torch.empty(
+        output_shape, dtype=self.dtype, device=self.device, memory_format=memory_format
+    )
 
 
 @register_meta(aten._adaptive_avg_pool3d.default)
@@ -366,6 +772,26 @@ def meta_adaptive_avg_pool3d(self, output_size):
     return self.new_empty(self.shape[:-3] + tuple(output_size))
 
 
+@register_meta(aten._adaptive_avg_pool2d_backward.default)
+def meta__adaptive_avg_pool2d_backward(grad_out, self):
+    ndim = grad_out.ndim
+    for i in range(1, ndim):
+        check(
+            grad_out.size(i) > 0,
+            lambda: f"adaptive_avg_pool2d_backward(): Expected grad_output to have non-zero \
+                      size for non-batch dimensions, {grad_out.shape} with dimension {i} being empty",
+        )
+    check(
+        ndim == 3 or ndim == 4,
+        lambda: f"adaptive_avg_pool2d_backward(): Expected 3D or 4D tensor, but got {self.shape}",
+    )
+    check(
+        self.dtype == grad_out.dtype,
+        lambda: f"expected dtype {self.dtype} for `grad_output` but got dtype {grad_out.dtype}",
+    )
+    return self.new_empty(self.shape)
+
+
 @register_meta(aten.repeat_interleave.Tensor)
 def meta_repeat_interleave_Tensor(repeats, output_size=None):
     if output_size is None:
@@ -373,8 +799,7 @@ def meta_repeat_interleave_Tensor(repeats, output_size=None):
     return repeats.new_empty(output_size)
 
 
-@torch.library.impl(meta_lib, "complex")
-@torch.library.impl(meta_lib, "complex.out")
+@register_meta([aten.complex.default, aten.complex.out])
 @out_wrapper()
 def meta_complex(real, imag):
     assert real.dtype.is_floating_point
@@ -383,7 +808,7 @@ def meta_complex(real, imag):
     return real.new_empty(out_shape, dtype=corresponding_complex_dtype(real.dtype))
 
 
-@torch.library.impl(meta_lib, "vdot")
+@register_meta(aten.vdot.default)
 def vdot(self, other):
     if not self.is_complex:
         return torch.dot(self, other)
@@ -404,8 +829,9 @@ def vdot(self, other):
 # of indexing shape inference is useful,
 # but not registering it to the dispatcher because we already
 # get shape inference through structured kernels
-@register_meta(aten.index.Tensor, register_dispatcher=False)
+@register_meta(aten.index.Tensor)
 def meta_index_Tensor(self, indices):
+    check_no_bool_index_tensors(aten.index.Tensor, self, indices)
     check(indices, lambda: "at least one index must be provided")
     # aten::index is the internal advanced indexing implementation
     # checkIndexTensorTypes and expandTensors
@@ -413,8 +839,8 @@ def meta_index_Tensor(self, indices):
     for i, index in enumerate(indices):
         if index is not None:
             check(
-                index.dtype in [torch.long, torch.int8, torch.bool],
-                lambda: "tensors used as indices must be long, byte or bool tensors",
+                index.dtype in [torch.long, torch.int, torch.int8, torch.bool],
+                lambda: "tensors used as indices must be long, int, byte or bool tensors",
             )
             if index.dtype in [torch.int8, torch.bool]:
                 nonzero = index.nonzero()
@@ -506,6 +932,36 @@ def meta_index_Tensor(self, indices):
     return self.new_empty(before_shape + replacement_shape + after_shape)
 
 
+@register_meta([aten.convolution_backward.default])
+def meta_convolution_backward(
+    grad_output_,
+    input_,
+    weight_,
+    bias_sizes_opt,
+    stride,
+    padding,
+    dilation,
+    transposed,
+    output_padding,
+    groups,
+    output_mask,
+):
+    # High level logic taken from slow_conv3d_backward_cpu which should
+    # be representative of all convolution_backward impls
+    backend_grad_input = None
+    backend_grad_weight = None
+    backend_grad_bias = None
+
+    if output_mask[0]:
+        backend_grad_input = grad_output_.new_empty(input_.size())
+    if output_mask[1]:
+        backend_grad_weight = grad_output_.new_empty(weight_.size())
+    if output_mask[2]:
+        backend_grad_bias = grad_output_.new_empty(bias_sizes_opt)
+
+    return (backend_grad_input, backend_grad_weight, backend_grad_bias)
+
+
 @register_meta([aten.addbmm.default, aten.addbmm.out])
 @out_wrapper()
 def meta_addbmm(self, batch1, batch2, *, beta=1, alpha=1):
@@ -532,7 +988,7 @@ def meta_addbmm(self, batch1, batch2, *, beta=1, alpha=1):
     return self.new_empty(self.size())
 
 
-@torch.library.impl(meta_lib, "_cdist_forward")
+@register_meta(aten._cdist_forward.default)
 def meta_cdist_forward(x1, x2, p, compute_mode):
     check(
         x1.dim() >= 2,
@@ -568,7 +1024,7 @@ def meta_cdist_forward(x1, x2, p, compute_mode):
     return x1.new_empty(output_shape)
 
 
-@torch.library.impl(meta_lib, "_embedding_bag")
+@register_meta(aten._embedding_bag.default)
 def meta_embedding_bag(
     weight,
     indices,
@@ -643,7 +1099,7 @@ def is_fast_path(src, scale, output, padding_idx):
         else:
             return is_fast_path_index_select(src, output, padding_idx)
 
-    if offsets.device.type != "cpu":
+    if device_hint(offsets) != "cpu":
         offset2bag = indices.new_empty(indices.size(0))
         bag_size = indices.new_empty(offsets.size())
         if mode == MODE_MAX:
@@ -661,28 +1117,12 @@ def is_fast_path(src, scale, output, padding_idx):
     return output, offset2bag, bag_size, max_indices
 
 
-@register_meta([aten.diag.default, aten.diag.out])
-@out_wrapper()
-def meta_diag(self, dim=0):
-    check(self.dim() in (1, 2), lambda: "matrix or a vector expected")
-    if self.dim() == 1:
-        sz = self.size(0) + abs(dim)
-        return self.new_empty((sz, sz))
-
-    # case: dim is 2
-    if dim >= 0:
-        sz = min(self.size(0), self.size(1) - dim)
-    else:
-        sz = min(self.size(0) + dim, self.size(1))
-    return self.new_empty((sz,))
-
-
-@torch.library.impl(meta_lib, "_embedding_bag_forward_only")
+@register_meta(aten._embedding_bag_forward_only.default)
 def meta_embedding_bag_forward_only(weight, indices, offsets, *args):
     output, offset2bag, bag_size, max_indices = meta_embedding_bag(
         weight, indices, offsets, *args
     )
-    if offsets.device.type == "cpu":
+    if device_hint(offsets) == "cpu":
         bag_size = offsets.new_empty(offsets.size())
     return output, offset2bag, bag_size, max_indices
 
@@ -728,13 +1168,838 @@ def meta_nanmedian_dim(input, dim=-1, keepdim=False):
     )
 
 
-@torch.library.impl(meta_lib, "logical_not_")
+@register_meta(aten.logical_not_.default)
 def meta_logical_not_(self):
     return self
 
 
+@register_meta(aten.repeat.default)
+def meta_repeat(self, repeats):
+    check(
+        len(repeats) >= self.dim(),
+        lambda: "Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor",
+    )
+    # Add new leading dimensions to the tensor if the
+    # number of target dimensions is larger than the
+    # number of source dimensions.
+    num_new_dimensions = len(repeats) - self.dim()
+    padded_size = (1,) * num_new_dimensions + tuple(self.shape)
+    target_size = [padded_size[i] * repeats[i] for i in range(len(repeats))]
+    return self.new_empty(target_size)
+
+
+@register_meta(aten.zero_.default)
+def meta_zero_(self):
+    return self
+
+
+@register_meta(
+    [
+        aten.mul_.Scalar,
+        aten.div_.Scalar,
+        aten.mul_.Tensor,
+        aten.div_.Tensor,
+        aten.logical_and_.default,
+        aten.logical_or_.default,
+        aten.logical_xor_.default,
+    ],
+)
+def meta_binop_inplace(self, other):
+    return self
+
+
+@register_meta(
+    [
+        aten.add_.Scalar,
+        aten.sub_.Scalar,
+        aten.add_.Tensor,
+        aten.sub_.Tensor,
+    ],
+)
+def meta_binop_inplace_alpha(self, other, alpha=1):
+    return self
+
+
+@register_meta([aten.round.default, aten.round.decimals])
+def meta_round(self, **kwargs):
+    return _elementwise_meta(
+        self, type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT
+    )
+
+
+@register_meta(aten.zero.default)
+def meta_zero(self):
+    return self.new_empty(self.shape)
+
+
+@register_meta([aten.fill_.Tensor, aten.fill_.Scalar])
+def meta_fill_(self, val):
+    return self
+
+
+@register_meta([aten.fill.Tensor, aten.fill.Scalar])
+def meta_fill(self, val):
+    return torch.empty_like(self)
+
+
+@register_meta(aten.relu_.default)
+def meta_relu_(self):
+    return self
+
+
+@register_meta(aten.index_put.default)
+def meta_index_put(self, indices, values, accumulate=False):
+    return torch.empty_like(self)
+
+
+@register_meta(aten.masked_fill_.Scalar)
+def meta_masked_fill_(self, mask, value):
+    return self
+
+
+@register_meta(aten.index_put_.default)
+def meta_index_put_(self, indices, values, accumulate=False):
+    return self
+
+
+@register_meta(aten.alias.default)
+def meta_alias(self):
+    return self.view(self.shape)
+
+
+def common_meta_baddbmm_bmm(batch1, batch2, is_bmm, self_baddbmm=None):
+    check(batch1.dim() == 3, lambda: "batch1 must be a 3D tensor")
+    check(batch2.dim() == 3, lambda: "batch2 must be a 3D tensor")
+
+    batch1_sizes = batch1.size()
+    batch2_sizes = batch2.size()
+
+    bs = batch1_sizes[0]
+    contraction_size = batch1_sizes[2]
+    res_rows = batch1_sizes[1]
+    res_cols = batch2_sizes[2]
+    output_size = (bs, res_rows, res_cols)
+
+    check(
+        batch2_sizes[0] == bs and batch2_sizes[1] == contraction_size,
+        lambda: f"Expected size for first two dimensions of batch2 tensor to be: [{bs}"
+        f", {contraction_size}] but got: [{batch2_sizes[0]}, {batch2_sizes[1]}].",
+    )
+
+    # TODO: handle out
+
+    output = batch2.new_empty(output_size)
+
+    if not is_bmm and self_baddbmm is not None:
+        check(self_baddbmm.dim() == 3, lambda: "self must be a 3D tensor")
+        check(
+            self_baddbmm.size() == output_size,
+            lambda: "Expected an input tensor shape with shape {output_size} but got shape: {self.size()}",
+        )
+
+    return output
+
+
+@register_meta(aten.bmm.default)
+def meta_bmm(self, mat2):
+    return common_meta_baddbmm_bmm(self, mat2, True)
+
+
+def div_rtn(x, y):
+    q = x // y
+    r = x % y
+    # WARNING: explicit bool conversion here is necessary;
+    # would be fixed by SymBool
+    if r != 0 and (bool(r < 0) != bool(y < 0)):
+        q -= 1
+    return q
+
+
+def pooling_output_shape_pad_lr(
+    inputSize, kernelSize, pad_l, pad_r, stride, dilation, ceil_mode
+):
+    outputSize = (
+        div_rtn(
+            inputSize
+            + pad_l
+            + pad_r
+            - dilation * (kernelSize - 1)
+            - 1
+            + (stride - 1 if ceil_mode else 0),
+            stride,
+        )
+        + 1
+    )
+    if ceil_mode:
+        if (outputSize - 1) * stride >= inputSize + pad_l:
+            outputSize -= 1
+    return outputSize
+
+
+def pooling_output_shape(inputSize, kernelSize, pad, stride, dilation, ceil_mode):
+    check(stride != 0, lambda: "stride should not be zero")
+    check(pad >= 0, lambda: f"pad must be non-negative, but got pad: {pad}")
+    check(
+        pad <= kernelSize // 2,
+        lambda: f"pad should be at most half of kernel size, but got pad={pad} and kernel_size={kernelSize}",
+    )
+    return pooling_output_shape_pad_lr(
+        inputSize, kernelSize, pad, pad, stride, dilation, ceil_mode
+    )
+
+
+def pool2d_shape_check(
+    input,
+    kH,
+    kW,
+    dH,
+    dW,
+    padH,
+    padW,
+    dilationH,
+    dilationW,
+    nInputPlane,
+    inputHeight,
+    inputWidth,
+    outputHeight,
+    outputWidth,
+    memory_format,
+):
+    ndim = input.dim()
+    nOutputPlane = nInputPlane
+
+    check(
+        kW > 0 and kH > 0,
+        lambda: "kernel size should be greater than zero, but got kH: {kH}, kW: {kW}",
+    )
+    check(
+        dW > 0 and dH > 0,
+        lambda: "stride should be greater than zero, but got dH: {dH}, dW: {dW}",
+    )
+    check(
+        dilationH > 0 and dilationW > 0,
+        lambda: "dilation should be greater than zero, but got dilationH: {dilationH}, dilationW: {dilationW}",
+    )
+
+    valid_dims = input.size(1) != 0 and input.size(2) != 0
+
+    if memory_format == torch.channels_last:
+        check(
+            ndim == 4 and valid_dims and input.size(3) != 0,
+            lambda: "Expected 4D (batch mode) tensor expected for input with channels_last layout"
+            " with optional 0 dim batch size for input, but got: {input.size()}",
+        )
+    else:
+        check(
+            (ndim == 3 and input.size(0) != 0 and valid_dims)
+            or (ndim == 4 and valid_dims and input.size(3) != 0),
+            lambda: f"Expected 3D or 4D (batch mode) tensor with optional 0 dim batch size for input, but got: {input.size()}",
+        )
+
+    check(
+        kW // 2 >= padW and kH // 2 >= padH,
+        lambda: "pad should be smaller than or equal to half of kernel size, but got "
+        f"padW = {padW}, padH = {padH}, kW = {kW}, kH = {kH}",
+    )
+
+    check(
+        outputWidth >= 1 and outputHeight >= 1,
+        lambda: f"Given input size: ({nInputPlane}x{inputHeight}x{inputWidth}). "
+        f"Calculated output size: ({nOutputPlane}x{outputHeight}x{outputWidth}). "
+        "Output size is too small",
+    )
+
+
+def max_pool2d_checks_and_compute_shape(
+    input, kernel_size, stride, padding, dilation, ceil_mode
+):
+    # Reference: aten/src/ATen/native/DilatedMaxPool2d.cpp
+    def unpack(name, val):
+        check(
+            len(val) in [1, 2],
+            lambda: f"max_pool2d: {name} must either be a single int, or a tuple of two ints",
+        )
+        H = val[0]
+        W = H if len(val) == 1 else val[1]
+        return H, W
+
+    kH, kW = unpack("kernel_size", kernel_size)
+
+    check(
+        len(stride) in [0, 1, 2],
+        lambda: "max_pool2d: stride must either be omitted, a single int, or a tuple of two ints",
+    )
+    if len(stride) == 0:
+        dH, dW = kH, kW
+    else:
+        dH, dW = unpack("stride", stride)
+
+    padH, padW = unpack("padding", padding)
+    dilationH, dilationW = unpack("dilation", dilation)
+    nInputPlane = input.size(-3)
+    inputHeight = input.size(-2)
+    inputWidth = input.size(-1)
+
+    memory_format = utils.suggest_memory_format(input)
+    if memory_format == torch.channels_last:
+        check(
+            input.dim() == 4,
+            lambda: "non-empty 4D (batch mode) tensor expected for input with channels_last layout",
+        )
+    elif memory_format == torch.contiguous_format:
+        check(
+            input.dim() in [3, 4],
+            lambda: "non-empty 3D or 4D (batch mode) tensor expected for input",
+        )
+    else:
+        check(
+            False,
+            lambda: "Unsupport memory format. Supports only ChannelsLast, Contiguous",
+        )
+
+    outputHeight = pooling_output_shape(inputHeight, kH, padH, dH, dilationH, ceil_mode)
+    outputWidth = pooling_output_shape(inputWidth, kW, padW, dW, dilationW, ceil_mode)
+
+    pool2d_shape_check(
+        input,
+        kH,
+        kW,
+        dH,
+        dW,
+        padH,
+        padW,
+        dilationH,
+        dilationW,
+        nInputPlane,
+        inputHeight,
+        inputWidth,
+        outputHeight,
+        outputWidth,
+        memory_format,
+    )
+
+    return nInputPlane, outputHeight, outputWidth
+
+
+@register_meta(aten.max_pool2d_with_indices_backward.default)
+def meta_max_pool2d_with_indices_backward(
+    grad_output, self, kernel_size, stride, padding, dilation, ceil_mode, indices
+):
+    nInputPlane, outputHeight, outputWidth = max_pool2d_checks_and_compute_shape(
+        self, kernel_size, stride, padding, dilation, ceil_mode
+    )
+
+    check(
+        self.dtype == grad_output.dtype,
+        lambda: "expected dtype {self.dtype} for `gradOutput` but got dtype {grad_output.dtype}",
+    )
+
+    nOutputPlane = nInputPlane
+    ndim = self.ndim
+
+    def _check_dim_size(t):
+        check_dim_size(t, ndim, ndim - 3, nOutputPlane)
+        check_dim_size(t, ndim, ndim - 2, outputHeight)
+        check_dim_size(t, ndim, ndim - 1, outputWidth)
+
+    _check_dim_size(grad_output)
+    _check_dim_size(indices)
+
+    memory_format = utils.suggest_memory_format(self)
+    return torch.empty(
+        self.shape, dtype=self.dtype, device=self.device, memory_format=memory_format
+    )
+
+
+@register_meta(aten.max_pool2d_with_indices.default)
+def meta_max_pool2d_with_indices(
+    input, kernel_size, stride=(), padding=(0,), dilation=(1,), ceil_mode=False
+):
+    nInputPlane, outputHeight, outputWidth = max_pool2d_checks_and_compute_shape(
+        input, kernel_size, stride, padding, dilation, ceil_mode
+    )
+
+    nbatch = input.size(-4) if input.dim() == 4 else 1
+    memory_format = utils.suggest_memory_format(input)
+    if input.dim() == 3:
+        size = [nInputPlane, outputHeight, outputWidth]
+    else:
+        size = [nbatch, nInputPlane, outputHeight, outputWidth]
+    return (
+        torch.empty(
+            size, dtype=input.dtype, device=input.device, memory_format=memory_format
+        ),
+        torch.empty(
+            size, dtype=torch.int64, device=input.device, memory_format=memory_format
+        ),
+    )
+
+
+@register_meta(aten.grid_sampler_2d_backward.default)
+def grid_sampler_2d_backward_meta(
+    grad_output,
+    input,
+    grid,
+    interpolation_mode,
+    padding_mode,
+    align_corners,
+    output_mask,
+):
+    input_requires_grad = output_mask[0]
+    if input_requires_grad:
+        grad_input = torch.zeros_like(input, memory_format=torch.contiguous_format)
+    else:
+        grad_input = None
+    grad_grid = torch.empty_like(grid, memory_format=torch.contiguous_format)
+    return (grad_input, grad_grid)
+
+
+@register_meta([aten.full.default])
+def full(size, fill_value, *args, **kwargs):
+    return torch.empty(size, *args, **kwargs)
+
+
+@register_meta(
+    [
+        aten.randint_like.default,
+        aten.randint_like.low_dtype,
+        aten.randn_like.default,
+        aten.rand_like.default,
+        aten.full_like.default,
+        aten.ones_like.default,
+    ]
+)
+def meta_like(self, *args, **kwargs):
+    return aten.empty_like.default(self, **kwargs)
+
+
+# zeros_like is special cased to work for sparse
+@register_meta(aten.zeros_like.default)
+def zeros_like(
+    self, dtype=None, layout=None, device=None, pin_memory=None, memory_format=None
+):
+    if layout == torch.sparse_coo:
+        check(
+            memory_format is None,
+            lambda: "memory format option is only supported by strided tensors",
+        )
+
+        res = torch.empty(
+            0,
+            dtype=self.dtype if dtype is None else dtype,
+            layout=layout,
+            device=self.device if device is None else device,
+            pin_memory=pin_memory,
+        )
+
+        if self.is_sparse:
+            res.sparse_resize_and_clear_(
+                self.size(), self.sparse_dim(), self.dense_dim()
+            )
+        else:
+            res.sparse_resize_and_clear_(self.size(), self.dim(), 0)
+
+        res._coalesced_(True)
+        return res
+    return aten.empty_like.default(
+        self,
+        dtype=dtype,
+        layout=layout,
+        device=device,
+        pin_memory=pin_memory,
+        memory_format=memory_format,
+    )
+
+
+# hacky: Please remove after math.ceil works with arange
+@register_meta(aten.arange.default)
+def arange(end, **kwargs):
+    if isinstance(end, FloatLike):
+        end = math.ceil(end)  # type: ignore[arg-type]
+
+    def is_integral(x):
+        return isinstance(x, IntLike) or isinstance(x, bool)
+
+    set_to_integral_dtype = kwargs.get("dtype", None) is None and is_integral(end)
+    if set_to_integral_dtype:
+        kwargs["dtype"] = torch.int64
+
+    return aten.empty([end], **kwargs)
+
+
+@register_meta(aten.arange.start)
+def arange_start(start, end, **kwargs):
+    return aten.arange(end - start, **kwargs)
+
+
+@register_meta(aten.select.int)
+def meta_select(self, dim, index):
+    ndim = self.dim()
+    check(
+        ndim != 0, lambda: "select() cannot be applied to a 0-dim tensor.", IndexError
+    )
+
+    dim = dim if dim >= 0 else dim + ndim
+    size = self.size(dim)
+
+    check(
+        not (-index > size or index >= size),
+        lambda: f"select(): index {index} out of range for tensor of size "
+        f"{self.size()} at dimension {dim}",
+        IndexError,
+    )
+
+    index = index if index >= 0 else index + size
+
+    new_size = list(self.size())
+    new_stride = list(self.stride())
+
+    new_storage_offset = self.storage_offset() + index * new_stride[dim]
+    del new_size[dim]
+    del new_stride[dim]
+
+    return self.as_strided(new_size, new_stride, new_storage_offset)
+
+
+@register_meta(aten.select_scatter.default)
+def meta_select_scatter(self, src, dim, index):
+    return torch.empty_like(self)
+
+
+@register_meta(aten.slice_scatter.default)
+def meta_slice_scatter(self, src, dim=0, start=None, end=None, step=1):
+    return torch.empty_like(self)
+
+
+# TODO: Deduplicate this with canonicalize_dim
+def maybe_wrap_dim(dim: int, dim_post_expr: int, wrap_scalar: bool = True):
+    if dim_post_expr <= 0:
+        assert wrap_scalar
+        dim_post_expr = 1
+    min = -dim_post_expr
+    max = dim_post_expr - 1
+    assert not (dim < min or dim > max), f"dim {dim} out of bounds ({min}, {max})"
+    if dim < 0:
+        dim += dim_post_expr
+    return dim
+
+
+def ensure_nonempty_size(t, dim):
+    return 1 if t.dim() == 0 else t.shape[dim]
+
+
+# From aten/src/ATen/native/ScatterGatherChecks.h
+def gather_shape_check(self, dim, index):
+    self_dims = max(self.dim(), 1)
+    index_dims = max(index.dim(), 1)
+    check(
+        self_dims == index_dims,
+        lambda: "Index tensor must have the same number of dimensions as input tensor",
+    )
+    for i in range(self_dims):
+        if i != dim:
+            check(
+                ensure_nonempty_size(index, i) <= ensure_nonempty_size(self, i),
+                lambda: f"Size does not match at dimension {i} expected index {index.shape}"
+                + f" to be smaller than self {self.shape} apart from dimension {dim}",
+            )
+
+
+@register_meta(aten.gather.default)
+def meta_gather(self, dim, index, sparse_grad=False):
+    wrapped_dim = maybe_wrap_dim(dim, self.dim())
+    is_index_empty = index.numel() == 0
+    if not is_index_empty:
+        check(
+            index.dtype == torch.long,
+            lambda: f"gather(): Expected dtype int64 for index, but got {index.dtype}",
+        )
+        gather_shape_check(self, wrapped_dim, index)
+    return self.new_empty(index.shape)
+
+
+# From aten/src/ATen/native/TensorAdvancedIndexing.cpp
+def get_operator_enum(reduce_, use_new_options=False):
+    if use_new_options:
+        if reduce_ == "sum":
+            return "REDUCE_ADD"
+        elif reduce_ == "prod":
+            return "REDUCE_MULTIPLY"
+        elif reduce_ == "mean":
+            return "REDUCE_MEAN"
+        elif reduce_ == "amax":
+            return "REDUCE_MAXIMUM"
+        elif reduce_ == "amin":
+            return "REDUCE_MINIMUM"
+        check(
+            False,
+            lambda: "reduce argument must be either sum, prod, mean, amax or amin.",
+        )
+        return
+    else:
+        if reduce_ == "add":
+            return "REDUCE_ADD"
+        elif reduce_ == "multiply":
+            return "REDUCE_MULTIPLY"
+        check(False, lambda: "reduce argument must be either add or multiply.")
+        return
+
+
+# From aten/src/ATen/native/ScatterGatherChecks.h
+def scatter_gather_dtype_check(method_name, self, index, src_opt=None):
+    if index.numel() != 0:
+        check(
+            index.dtype == torch.long,
+            lambda: f"{method_name}(): Expected dtype int64 for index",
+        )
+
+    if src_opt is not None:
+        check(
+            self.dtype == src_opt.dtype,
+            lambda: f"{method_name}(): Expected self.dtype to be equal to src.dtype",
+        )
+
+
+def ensure_nonempty_dim(dim):
+    return max(dim, 1)
+
+
+# From aten/src/ATen/native/ScatterGatherChecks.h
+def scatter_shape_check(self, dim, index, src_opt=None):
+    if index.numel() == 0:
+        return
+    check(
+        ensure_nonempty_dim(self.dim()) == ensure_nonempty_dim(index.dim()),
+        lambda: "Index tensor must have the same number of dimensions as self tensor",
+    )
+
+    is_wrong_shape = False
+    self_dims = ensure_nonempty_dim(self.dim())
+
+    # Check: index.size(d) <= self.size(d) for all d != dim
+    for d in range(self_dims):
+        index_d_size = ensure_nonempty_size(index, d)
+        if d == dim:
+            continue
+        if index_d_size > ensure_nonempty_size(self, d):
+            is_wrong_shape = True
+            break
+
+    # Check: index.size(d) <= src.size(d) for all d if src is Tensor
+    if not is_wrong_shape and src_opt is not None:
+        for d in range(self_dims):
+            index_d_size = ensure_nonempty_size(index, d)
+            if index_d_size > ensure_nonempty_size(src_opt, d):
+                is_wrong_shape = True
+                break
+
+    if src_opt is not None:
+        check(
+            ensure_nonempty_dim(self.dim()) == ensure_nonempty_dim(index.dim()),
+            lambda: "Index tensor must have the same number of dimensions as self tensor",
+        )
+        check(
+            not is_wrong_shape,
+            lambda: f"Expected index {index.shape} to be smaller than self {self.shape}"
+            + f" apart from dimension {dim} and to be smaller than src {src_opt.shape}",
+        )
+    else:
+        check(
+            not is_wrong_shape,
+            lambda: f"Expected index {index.shape} to be smaller than self {self.shape}"
+            + f" apart from dimension {dim}",
+        )
+
+
+# From aten/src/ATen/native/TensorAdvancedIndexing.cpp
+def scatter_meta_impl(self, dim, index, src=None, reduce_=None, use_new_options=False):
+    wrapped_dim = maybe_wrap_dim(dim, self.dim())
+    scatter_gather_dtype_check("scatter", self, index, src)
+    scatter_shape_check(self, wrapped_dim, index, src)
+    if reduce_ is not None:
+        # Check if we have a valid reduce operator.
+        get_operator_enum(reduce_, use_new_options)
+
+
+@register_meta(aten.scatter_add.default)
+def meta_scatter_add(self, dim, index, src):
+    scatter_meta_impl(self, dim, index, src, "add")
+    return self.new_empty(self.shape)
+
+
+@register_meta(aten.scatter_add_)
+def meta_scatter_add_(self, dim, index, src):
+    scatter_meta_impl(self, dim, index, src, "add")
+    return self
+
+
+@register_meta(aten.scatter)
+@out_wrapper()
+def meta_scatter(self, dim, index, src_or_value, reduce=None):
+    src = src_or_value if isinstance(src_or_value, torch.Tensor) else None
+    scatter_meta_impl(self, dim, index, src, reduce)
+    return self.new_empty(self.shape)
+
+
+@register_meta(aten.scatter_)
+def meta_scatter_(self, dim, index, src_or_value, reduce=None):
+    src = src_or_value if isinstance(src_or_value, torch.Tensor) else None
+    scatter_meta_impl(self, dim, index, src, reduce)
+    return self
+
+
+@register_meta([aten.scatter_reduce.two, aten.scatter_reduce.two_out])
+@out_wrapper()
+def meta_scatter_reduce_two(self, dim, index, src, reduce, include_self=True):
+    scatter_meta_impl(self, dim, index, src, reduce, use_new_options=True)
+    return self.new_empty(self.shape)
+
+
+@register_meta(aten.scatter_reduce_.two)
+def meta_scatter_reduce__two(self, dim, index, src, reduce, include_self=True):
+    scatter_meta_impl(self, dim, index, src, reduce, use_new_options=True)
+    return self
+
+
+@register_meta(aten.upsample_nearest2d.vec)
+def upsample_nearest2d_vec(input, output_size, scale_factors):
+    mem_format = utils.suggest_memory_format(input)
+    spatial_dimensions = input.dim() - 2
+
+    input_shape = input.shape
+    if output_size is not None:
+        assert scale_factors is None
+        out_size = output_size
+    elif scale_factors is not None:
+        assert output_size is None
+        out_size = []
+        for i in range(spatial_dimensions):
+            sym_float = (input_shape[i + 2] / 1) * scale_factors[i]
+            assert sym_float >= 0
+            out_size.append(math.floor(sym_float))
+
+    output_height = out_size[0]
+    output_width = out_size[1]
+    nbatch = input_shape[0]
+    channels = input_shape[1]
+    return input.new_empty((nbatch, channels, output_height, output_width)).to(
+        memory_format=mem_format
+    )
+
+
+@register_meta([aten.sort.default, aten.sort.stable])
+def meta_sort(self, stable=None, dim=-1, descending=False):
+    return torch.empty_like(self), torch.empty_like(self, dtype=torch.int64)
+
+
+def zero_numel_check_dims(self, dim, fn_name):
+    if self.ndim == 0:
+        check(
+            dim == 0 or dim == -1,
+            lambda: f"{fn_name}: Expected reduction dim -1 or 0 for scalar but got {dim}",
+            IndexError,
+        )
+    else:
+        check(
+            self.size(dim) != 0,
+            lambda: f"{fn_name}: Expected reduction dim {dim} to have non-zero size.",
+            IndexError,
+        )
+
+
+# From aten/src/ATen/native/ReduceOps.cpp
+def check_argmax_argmin(name, self, dim):
+    if dim is not None:
+        dim = maybe_wrap_dim(dim, self.dim())
+        zero_numel_check_dims(self, dim, name)
+    else:
+        check(
+            self.numel() != 0,
+            lambda: f"{name}: Expected reduction dim to be specified for input.numel() == 0.",
+        )
+
+
+@register_meta([aten.argmax.default, aten.argmin.default])
+def argmax_argmin_meta(self, dim=None, keepdim=False):
+    check_argmax_argmin("argmax", self, dim)
+    dims = utils.reduction_dims(self.shape, (dim,) if dim is not None else None)
+    shape = _compute_reduction_shape(self, dims, keepdim)
+    return self.new_empty(shape, dtype=torch.int64)
+
+
+@register_meta(aten.scalar_tensor.default)
+def scalar_tensor(s, dtype=None, layout=None, device=None, pin_memory=None):
+    return torch.empty(
+        (), dtype=dtype, layout=layout, device=device, pin_memory=pin_memory
+    )
+
+
+@register_meta(aten.topk.default)
+def topk_meta(self, k, dim=-1, largest=True, sorted=True):
+    # From aten/src/ATen/native/Sorting.cpp
+    dim = maybe_wrap_dim(dim, self.dim(), wrap_scalar=True)
+    check(
+        k >= 0 and k <= (self.size(dim) if self.dim() > 0 else 1),
+        lambda: "selected index k out of range",
+    )
+    sliceSize = 1 if self.dim() == 0 else self.size(dim)
+    check(k >= 0 and k <= sliceSize, lambda: "k not in range for dimension")
+
+    topKSize = list(self.shape)
+    if len(topKSize) > 0:
+        topKSize[dim] = k
+    return self.new_empty(topKSize), self.new_empty(topKSize, dtype=torch.int64)
+
+
 # We must also trigger meta registrations from PrimTorch ref
 # decompositions
 import torch._refs
 import torch._refs.nn.functional
 import torch._refs.special
+
+
+def activate_meta():
+
+    activate_meta_table = {}
+
+    # For a given op, we pick the most specific decomp function from
+    # global_decomp_table in the precedence order of meta > post_autograd > pre_autograd
+    for type in ["meta", "post_autograd", "pre_autograd"]:
+        registry = global_decomposition_table[type]
+
+        for opo in registry:
+            if opo not in activate_meta_table:
+                activate_meta_table[opo] = registry[opo]
+
+    for op_overload, fn in activate_meta_table.items():
+        assert isinstance(op_overload, OpOverload)
+
+        op_overload.py_impl(torch._C.DispatchKey.Meta)(fn)
+
+        if torch._C._dispatch_has_kernel_for_dispatch_key(
+            op_overload.name(), "CompositeImplicitAutograd"
+        ):
+            # Internally, we shouldn't be registering meta kernels for any operators that
+            # have CompositeImplicitAutograd kernels.
+            # Instead, we should be letting those decompositions run, and writing meta kernels
+            # only for the base operators.
+            pass
+        elif op_overload.is_view:
+            # Attempting to register a python meta kernel for a view operator.
+            # We shouldn't do this, because the output will report as not having aliased storages.
+            # All view ops have meta kernels in C++ today, so we should use those instead.
+            pass
+        elif op_overload.name() in {
+            "aten::empty_strided",  # causing infinite recursion, test_meta.py
+            "aten::clone",  # causing infinite recursion
+            "aten::_to_copy",  # causing infinite recursion, test_serialization.py -k test_tensor_subclass_getstate_overwrite  # noqa: B950
+            "aten::copy_",  # Exception not raised, test_torch.py -k test_storage_meta_errors_cpu_int64  # noqa: B950
+            "aten::constant_pad_nd",  # requires_grad mismatch, test_ops.py -k test_fake_crossref_backward_amp_istft_cuda_float32  # noqa: B950
+            "aten::rot90",  # requires_grad mismatch! test_ops.py -k test_fake_crossref_backward_amp_rot90_cuda_float32  # noqa: B950
+        }:
+            pass
+        else:
+            _meta_lib_dont_use_me_use_register_meta.impl(op_overload, fn)
+
+
+activate_meta()
diff --git a/torch/_ops.py b/torch/_ops.py
index 21af70bbd2ee..033d8f361eed 100644
--- a/torch/_ops.py
+++ b/torch/_ops.py
@@ -1,14 +1,19 @@
 import contextlib
 import ctypes
+import inspect
 import sys
 import types
+from abc import ABC
+from typing import Any, Dict
 
 import torch._C
 
 import torch.jit
 from torch import _utils_internal
+from torch._functorch.pyfunctorch import dispatch_functorch
 
 # Query `hasattr` only once.
+
 _SET_GLOBAL_FLAGS = hasattr(sys, "getdlopenflags") and hasattr(sys, "setdlopenflags")
 
 
@@ -26,9 +31,209 @@ def dl_open_guard():
         sys.setdlopenflags(old_flags)
 
 
+def has_key(op, k):
+    return (
+        torch._C._dispatch_has_kernel_for_dispatch_key(op.name(), k)
+        or k in op.py_kernels
+    )
+
+
+# TODO(voz) We are missing an entire axis of registration - Modes for the python key
+class PyOperatorABC(ABC):
+    def __call__(self, *args, **kwargs):
+        pass
+
+    def py_impl(self, dispatch_key, fn):
+        pass
+
+    def name(self):
+        pass
+
+
+is_included_in_alias = torch._C._dispatch_is_included_in_alias
+
+DispatchKey = torch._C.DispatchKey
+
+# Equivalent to computeDispatchTableEntryWithDebug
+def resolve_key(op: PyOperatorABC, k: DispatchKey):  # type: ignore[valid-type]
+    # 1. (Direct) operator registration
+    if has_key(op, k):
+        return k
+    # 2.1 Use CompositeExplicitAutogradNonFunctional kernel if available
+    cand = DispatchKey.CompositeExplicitAutogradNonFunctional
+    if (k == DispatchKey.Undefined or is_included_in_alias(k, cand)) and has_key(
+        op, cand
+    ):
+        return cand
+    # 2.2 Use CompositeExplicitAutograd kernel if available
+    cand = DispatchKey.CompositeExplicitAutograd
+    if (k == DispatchKey.Undefined or is_included_in_alias(k, cand)) and has_key(
+        op, cand
+    ):
+        return cand
+    has_backend_kernel = torch._C._dispatch_has_kernel_for_any_dispatch_key(
+        op.name(), torch._C._dispatch_get_backend_keyset_from_autograd(k)
+    ) or has_key(op, DispatchKey.CompositeExplicitAutograd)
+    # 2.3. Use CompositeImplicitAutograd kernel if available
+    cand = DispatchKey.CompositeImplicitAutogradNestedTensor
+    if (
+        (k != DispatchKey.Undefined and is_included_in_alias(k, cand))
+        and has_key(op, cand)
+        and not has_backend_kernel
+    ):
+        return cand
+    cand = DispatchKey.CompositeImplicitAutograd
+    if (k == DispatchKey.Undefined or is_included_in_alias(k, cand)) and has_key(
+        op, cand
+    ):
+        if (
+            k == DispatchKey.AutogradOther
+            and torch._C._dispatch_has_kernel_for_any_dispatch_key(
+                op.name(), torch._C._dispatch_autogradother_backends
+            )
+        ):
+            raise RuntimeError("ambiguous autogradother kernel")
+        elif not has_backend_kernel:
+            return cand
+    # 2.4. For autograd backend keys, use kernel from DispatchKey::Autograd if available
+    cand = DispatchKey.Autograd
+    if is_included_in_alias(k, cand) and has_key(op, cand):
+        return cand
+    # Backend fallback
+    if torch._C._dispatch_has_backend_fallback(k):
+        # The dispatch key itself will implicitly route to backend fallback.
+        # This is probably not great for the pure Python implementation.
+        return k
+    raise NotImplementedError(f"could not find kernel for {op} at dispatch key {k}")
+
+
+pyop_namespace = {}
+
+
+class PyOperator(PyOperatorABC):
+    def __init__(self, name):
+        self._name = name
+        self.table = {}
+        self.python_key_mode_table = {}
+        self.functorch_table = {}
+
+        # Make _OPNamespace not scream, this whole name based association needs a good hard look
+        self.__name__ = name
+        pyop_namespace[name] = self
+
+    def fallthrough(self, dispatch_key):
+        self.table[dispatch_key] = self._fallthrough_fn(self, dispatch_key)
+
+    def py_impl(self, dispatch_key_or_mode_or_transform):
+        def inner(fn):
+            if inspect.isclass(dispatch_key_or_mode_or_transform) and issubclass(
+                dispatch_key_or_mode_or_transform,
+                torch.utils._python_dispatch.TorchDispatchMode,
+            ):
+                mode = dispatch_key_or_mode_or_transform
+                assert mode not in self.python_key_mode_table
+                # TODO(voz): Should we replace setting torch._C.DispatchKey.Python entirely with setting mode keys?
+                self.python_key_mode_table[mode] = fn
+                return fn
+
+            if isinstance(
+                dispatch_key_or_mode_or_transform, torch._C._functorch.TransformType
+            ):
+                transform = dispatch_key_or_mode_or_transform
+                self.functorch_table[transform] = fn
+                return fn
+
+            dispatch_key = dispatch_key_or_mode_or_transform
+            assert (
+                dispatch_key != torch._C.DispatchKey.Python
+            ), "Please register a mode for the torch._C.DispatchKey.Python key instead."
+            assert isinstance(dispatch_key, torch._C.DispatchKey)
+            assert dispatch_key not in self.table
+            self.table[dispatch_key] = fn
+            return fn
+
+        return inner
+
+    def dispatch(self, dispatch_key, *args, **kwargs):
+        from torch.utils._python_dispatch import _get_current_dispatch_mode
+
+        if dispatch_key == torch._C.DispatchKey.FuncTorchDynamicLayerFrontMode:
+            return dispatch_functorch(self, args, kwargs)
+
+        if dispatch_key == torch._C.DispatchKey.Python:
+            # TODO(voz): We should walk all the nodes here / turn it into a list, topmode is ok for now.
+            curr_mode = type(_get_current_dispatch_mode())
+            assert (
+                curr_mode is not None
+            ), "Illegal invocation of dispatch on torch._C.DispatchKey.Python without a mode."
+            assert (
+                curr_mode in self.python_key_mode_table
+            ), f"Current active mode {curr_mode} not registered"
+            # TODO(voz): The idea behind this is that we do not yet support dispatch by key + mode, only key.
+            return self.python_key_mode_table[curr_mode](*args, **kwargs)
+
+        assert dispatch_key in self.table, dispatch_key
+        return self.table[dispatch_key](*args, **kwargs)
+
+    def __call__(self, *args, **kwargs):
+        flat_args = _to_flat_tuple(args, kwargs)
+        if torch.overrides.has_torch_function(flat_args):
+            return torch.overrides.handle_torch_function(
+                self, flat_args, *args, **kwargs
+            )
+
+        dispatch_key_set = _compute_keyset(args, kwargs)
+        return self.dispatch(dispatch_key_set.highestPriorityTypeId(), *args, **kwargs)
+
+    def name(self):
+        return self.name
+
+    # TODO(voz): Should rewrite fallthrough register as the impl for keys we do not specify
+    # as opposed to being this sort of explicit thing where ops are a little too key aware...
+    def _fallthrough_fn(self, operator, dispatch_key):
+        def inner(*args, **kwargs):
+            all_keys_after_current = torch._C._dispatch_keyset_full_after(dispatch_key)
+            all_keys_after_current_masked = all_keys_after_current & _compute_keyset(
+                args, kwargs
+            )
+            return self.dispatch(
+                all_keys_after_current_masked.highestPriorityTypeId(), *args, **kwargs
+            )
+
+        return inner
+
+
+def _to_flat_tuple(args, kwargs):
+    flat_args, _ = torch.utils._pytree.tree_flatten(args)
+    flat_kwargs, _ = torch.utils._pytree.tree_flatten(kwargs)
+    flat_all = flat_args + flat_kwargs
+    return flat_all
+
+
+def _compute_keyset(args, kwargs):
+    tensors = _get_tensors(args, kwargs)
+    return key_extractor(tensors)
+
+
+def _get_tensors(args, kwargs):
+    flat_all = _to_flat_tuple(args, kwargs)
+    tensor_args = [t for t in flat_all if isinstance(t, torch.Tensor)]
+    return tuple(tensor_args)
+
+
+# Note - this should maintain identical impl to the C++ dispatcher key extraction logic
+# at ATen/core/dispatch/DispatchKeyExtractor.h
+def key_extractor(tensors):
+    key_set = torch._C._dispatch_tls_local_include_set()
+    for tensor in tensors:
+        key_set = key_set | torch._C._dispatch_keys(tensor)
+    key_set = key_set - torch._C._dispatch_tls_local_exclude_set()
+    return key_set
+
+
 # Each OpOverload object contains pointer to a a specific operator overload, a pointer to the parent `OpOverloadPacket` object.
 # You can obtain an OpOverload object through attribute query on OpOverloadPacket.
-class OpOverload:
+class OpOverload(PyOperatorABC):
     def __init__(self, overloadpacket, op, op_dk, schema, tags):
         self._op = op
         self._op_dk = op_dk
@@ -38,14 +243,34 @@ def __init__(self, overloadpacket, op, op_dk, schema, tags):
         self._overloadname = (
             "default" if schema.overload_name == "" else schema.overload_name
         )
-        self.name = self._schema.name
+        self._name = self._schema.name
         if schema.overload_name:
-            self.name += "." + schema.overload_name
+            self._name += "." + schema.overload_name
+        self.py_kernels: Dict[torch._C.DispatchKey, Any] = {}  # type: ignore[name-defined]
         self.__name__ = "{}.{}".format(
             self._schema.name.split("::")[1], self._overloadname
         )
+        # TODO(voz): Lots of shared logic around python_key_mode_table, maybe pull into base...
+        self.python_key_mode_table = {}
         self.__module__ = overloadpacket.__module__
         op.__module__ = overloadpacket.__module__
+        self.__qualname__ = self._name
+        self.__annotations__ = {}
+        # NB: This name is hard-coded in torch/csrc/autograd/python_variable.cpp
+        self._dispatch_cache = {}
+
+        # Logic replicated from aten/src/ATen/native/MathBitsFallback.h
+        is_write = None
+        for a in self._schema.arguments:
+            if a.alias_info is None:
+                continue
+            if is_write is None:
+                is_write = a.alias_info.is_write
+            else:
+                # We will conservatively call mixed mutable/non-mutable
+                # aliased inputs as NOT a view
+                is_write = a.alias_info.is_write or is_write
+        self.is_view = is_write is not None and not is_write
 
     # it's a no-op since OpOverload object is immutable and must be unique for a given op overload.
     def __deepcopy__(self, memo=None):
@@ -59,9 +284,6 @@ def __repr__(self):
     def __call__(self, *args, **kwargs):
         return self._op(*args, **kwargs or {})
 
-    def __getattr__(self, key):
-        return getattr(self._op, key)
-
     def __hash__(self):
         return hash(self._op)
 
@@ -69,13 +291,111 @@ def __hash__(self):
     def __str__(self):
         return "{}.{}.{}".format(*self._schema.name.split("::"), self._overloadname)
 
+    @property
+    def namespace(self):
+        return self._schema.name.split("::")[0]
+
     def decompose(self, *args, **kwargs):
-        dk = "CompositeImplicitAutograd"
-        if torch._C._dispatch_has_kernel_for_dispatch_key(self.name, dk):
+        dk = torch._C.DispatchKey.CompositeImplicitAutograd
+        if dk in self.py_kernels:
+            # NB: This branch is not too necessary anymore, because we can
+            # apply Python CompositeImplicitAutograd *before* tracing
+            # using Python dispatcher (also taking advantage of the autograd
+            # formula).  But it's included for completeness
+            return self.py_kernels[dk](*args, **kwargs)
+        elif torch._C._dispatch_has_kernel_for_dispatch_key(self.name(), dk):
             return self._op_dk(dk, *args, **kwargs)
         else:
             return NotImplemented
 
+    def py_impl(self, dispatch_key_or_mode):
+        def inner(fn):
+            if inspect.isclass(dispatch_key_or_mode) and issubclass(
+                dispatch_key_or_mode, torch.utils._python_dispatch.TorchDispatchMode
+            ):
+                mode = dispatch_key_or_mode
+                assert mode not in self.python_key_mode_table
+                # TODO(voz): Should we replace setting torch._C.DispatchKey.Python entirely with setting mode keys?
+                self.python_key_mode_table[mode] = fn
+                self._dispatch_cache.clear()
+                return fn
+
+            assert isinstance(dispatch_key_or_mode, torch._C.DispatchKey)
+            assert (
+                dispatch_key_or_mode != torch._C.DispatchKey.Python
+            ), "Please register a mode for the torch._C.DispatchKey.Python key instead."
+
+            if dispatch_key_or_mode in self.py_kernels:
+                raise RuntimeError(
+                    f"Trying to override a python impl for {dispatch_key_or_mode} on operator {self._name}"
+                )
+            self.py_kernels[dispatch_key_or_mode] = fn
+            self._dispatch_cache.clear()
+            return fn
+
+        return inner
+
+    # Remove a dispatch key from the dispatch cache.  This will force it to get
+    # recomputed the next time.  Does nothing
+    # WARNING: if you register a dispatch key to py_kernels of an OpOverload,
+    # calling _del_dispatch on that key is NOT sufficient to apply your change,
+    # because a single registration may affect MULTIPLE dispatch keys (e.g.,
+    # registering Autograd affects AutogradCPU).  del_dispatch is to be used
+    # only if you are specifically modifying how get_dispatch handles a
+    # particular input 'key'.
+    def _uncache_dispatch(self, key):
+        self._dispatch_cache.pop(key, None)
+
+    # This implements the pre-computation logic for the Python dispatcher.
+    def _get_dispatch(self, key):
+        # This is only called upon a cache miss
+        assert key not in self._dispatch_cache, f"{self} {key}"
+
+        if key == torch._C.DispatchKey.Python:
+            if not self.python_key_mode_table:
+                self._dispatch_cache[key] = key
+                return key
+
+            def handler(*args, **kwargs):
+                from torch.utils._python_dispatch import _get_current_dispatch_mode
+
+                # TODO: We also need to handle tensor subclasses here
+                # TODO(voz): We should walk all the nodes here / turn it into a list, topmode is ok for now.
+                curr_mode = type(_get_current_dispatch_mode())
+                assert (
+                    curr_mode is not None
+                ), "Illegal invocation of dispatch on torch._C.DispatchKey.Python without a mode."
+                if curr_mode not in self.python_key_mode_table:
+                    # TODO: This path is slow, should generally encourage this
+                    # case to not happen
+                    return self._op_dk(key, *args, **kwargs)
+                # TODO(voz): The idea behind this is that we do not yet support dispatch by key + mode, only key.
+                return self.python_key_mode_table[curr_mode](*args, **kwargs)
+
+            self._dispatch_cache[key] = handler
+            return handler
+
+        final_key = resolve_key(self, key)
+
+        # TODO: We could potentially have lots of debugging wrappers against
+        # dispatch keys; design some general registration mechanism instead of
+        # having if statement for each of them
+        if key == torch._C.DispatchKey.Functionalize:
+            import torch._dispatch.python as pydispatch
+
+            if pydispatch.CROSSREF_FUNCTIONALIZE:
+                handler = pydispatch.make_crossref_functionalize(self, final_key)
+                self._dispatch_cache[key] = handler
+                return handler
+
+        # print(self, key, final_key)
+        r = self.py_kernels.get(final_key, final_key)
+        self._dispatch_cache[key] = r
+        return r
+
+    def name(self):
+        return self._name
+
     @property
     def overloadpacket(self):
         return self._overloadpacket
@@ -101,6 +421,7 @@ def __init__(self, qualified_op_name, op_name, op, overload_names):
         self.__name__ = op_name
         self._op = op
         self._overload_names = overload_names
+        self._dir = []
 
     # it's a no-op since OpOverloadPacket object is immutable and must be unique for a given op.
     def __deepcopy__(self, memo=None):
@@ -159,6 +480,7 @@ def __getattr__(self, key):
             overload = OpOverload(self, op_, op_dk_, schema, tags)
             # cache the overload object
             setattr(self, key, overload)
+            self._dir.append(key)
             return overload
         except RuntimeError:
             raise AttributeError(
@@ -167,6 +489,9 @@ def __getattr__(self, key):
                 )
             ) from None
 
+    def __iter__(self):
+        return iter(self._dir)
+
     def __call__(self, *args, **kwargs):
         # overloading __call__ to ensure torch.ops.foo.bar()
         # is still callable from JIT
@@ -218,11 +543,17 @@ class _OpNamespace(types.ModuleType):
     def __init__(self, name):
         super(_OpNamespace, self).__init__("torch.ops." + name)
         self.name = name
+        self._dir = []
+
+    def __iter__(self):
+        return iter(self._dir)
 
     def __getattr__(self, op_name):
         # It is not a valid op_name when __file__ is passed in
         if op_name == "__file__":
             return "torch.ops"
+        elif op_name == "__origin__":
+            raise AttributeError()
 
         # Get the op `my_namespace::my_op` if available. This will also check
         # for overloads and raise an exception if there are more than one.
@@ -248,22 +579,39 @@ def __getattr__(self, op_name):
         # cache the opoverloadpacket to ensure that each op corresponds to
         # a unique OpOverloadPacket object
         setattr(self, op_name, opoverloadpacket)
+        self._dir.append(op_name)
         return opoverloadpacket
 
 
+class _PyOpNamespace(_OpNamespace):
+    def __init__(self):
+        super(_PyOpNamespace, self).__init__("torch.ops")
+        self.pyop_namespace = pyop_namespace
+
+
 class _Ops(types.ModuleType):
     __file__ = "_ops.py"
 
     def __init__(self):
         super(_Ops, self).__init__("torch.ops")
         self.loaded_libraries = set()
+        self.pyops = _PyOpNamespace()
+        self._dir = []
 
     def __getattr__(self, name):
+        # Check if the name is a pyop
+        if name in self.pyops.pyop_namespace:
+            return self.pyops.pyop_namespace[name]
+
         # Here we are creating `torch.ops.my_namespace`
         namespace = _OpNamespace(name)
         setattr(self, name, namespace)
+        self._dir.append(name)
         return namespace
 
+    def __iter__(self):
+        return iter(self._dir)
+
     def load_library(self, path):
         """
         Loads a shared library from the given path into the current process.
diff --git a/torch/_prims/__init__.py b/torch/_prims/__init__.py
index c2794837b780..67e16ca102ac 100644
--- a/torch/_prims/__init__.py
+++ b/torch/_prims/__init__.py
@@ -16,10 +16,13 @@
 from torch._prims.nvfuser_prims import register_nvprims
 from torch._prims_common import (
     check,
+    Dim,
     DimsSequenceType,
     DimsType,
+    IntLike,
     Number,
     NumberType,
+    RETURN_TYPE,
     ShapeType,
     StrideType,
     TensorLike,
@@ -28,6 +31,7 @@
 )
 from torch._prims_common.wrappers import backwards_not_supported
 from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
+from torch.fx.experimental.symbolic_shapes import sym_float
 from torch.overrides import handle_torch_function, has_torch_function
 from torch.utils._pytree import tree_flatten, tree_map, tree_unflatten
 
@@ -60,6 +64,8 @@
     "bessel_i0e",
     "bessel_i1",
     "bessel_i1e",
+    "bessel_j0",
+    "bessel_j1",
     "bitwise_not",
     "cbrt",
     "ceil",
@@ -68,6 +74,7 @@
     "erf",
     "erf_inv",
     "erfc",
+    "erfcx",
     "exp",
     "expm1",
     "exp2",
@@ -80,6 +87,7 @@
     "log1p",
     "log2",
     "log10",
+    "ndtri",
     "neg",
     "real",
     "reciprocal",
@@ -88,6 +96,7 @@
     "signbit",
     "sin",
     "sinh",
+    "spherical_bessel_j0",
     "sqrt",
     "tan",
     "tanh",
@@ -155,6 +164,7 @@
     #
     # Data conversion and movement prims
     #
+    "clone",
     "convert_element_type",
     "device_put",
     "item",
@@ -188,6 +198,7 @@
     #
     # Randomness Prims
     #
+    "normal",
     "uniform",
     #
     # FFT prims
@@ -198,27 +209,6 @@
 ]
 
 
-# In order to keep things like aliasing relationships and storage
-# consistent wrt/meta tensors, FakeTensors own a FakeTensorMode
-# which caches conversions to Meta Tensors. We would like to use
-# one consistent mode among along FakeTensors, which we store here.
-# We store a weakref, so that when all previous FakeTensors are
-# the present mode will also deallocate. FakeTensorMode holds onto
-# tensors that are converted to Meta so we don't want to persist it
-# longer than necessary.x
-prim_fake_mode_ref = None
-
-
-def get_prim_fake_mode():
-    global prim_fake_mode_ref
-    if prim_fake_mode_ref is None or prim_fake_mode_ref() is None:
-        mode = FakeTensorMode()
-        prim_fake_mode_ref = weakref.ref(mode)
-        return mode
-    else:
-        return prim_fake_mode_ref()
-
-
 def TensorMeta(
     tensorlike: Optional[Union[NumberType, torch.Tensor]] = None,
     *,
@@ -259,58 +249,7 @@ def TensorMeta(
     if isinstance(device, str):
         device = torch.device(device)
 
-    if isinstance(tensorlike, FakeTensor):
-        mode = tensorlike.fake_mode
-    else:
-        mode = get_prim_fake_mode()
-
-    if device.type == "meta":
-        return torch.empty_strided(shape, strides, dtype=dtype, device="meta")
-    else:
-        # SymInt doesnt support empty_strided yet
-        if any(
-            isinstance(inp, torch.SymIntNode) for inp in itertools.chain(shape, strides)
-        ):
-            meta_t = torch.empty(shape, dtype=dtype, device="meta")
-        else:
-            meta_t = torch.empty_strided(shape, strides, dtype=dtype, device="meta")
-        return FakeTensor(mode, meta_t, device)
-
-
-#
-# Common datastructures and helpers
-#
-
-# Describes the return type of the primitive:
-#
-#   - NEW, a new tensor is created
-#   - VIEW, a view of an input tensor is returned
-#   - INPLACE, one or more input tensors is modified
-#
-# these descriptors are mututally exclusive and exhaustive.
-class RETURN_TYPE(Enum):
-    NEW = (0,)
-    VIEW = (1,)
-    INPLACE = (2,)
-
-
-def _wrap_tensor_meta(f):
-    def wrap(t):
-        if (
-            isinstance(t, torch.Tensor)
-            and not isinstance(t, FakeTensor)
-            and not t.device.type == "meta"
-        ):
-            return FakeTensor.from_tensor(t, get_prim_fake_mode())
-        else:
-            return t
-
-    def wrapper(*args, **kwargs):
-        wrapped_args = tree_map(wrap, args)
-        wrapped_kwargs = tree_map(wrap, kwargs)
-        return f(*wrapped_args, **wrapped_kwargs)
-
-    return wrapper
+    return torch.empty_strided(shape, strides, dtype=dtype, device=device)
 
 
 def _make_prim(
@@ -342,18 +281,16 @@ def _prim_impl(*args, **kwargs):
     def _autograd_impl(*args, **kwargs):
         return backwards_not_supported(_prim)(*args, **kwargs)
 
-    _meta_impl = _wrap_tensor_meta(meta)
-
     def _backend_select_impl(*args, **kwargs):
         if kwargs.get("device") and kwargs["device"].type == "meta":
-            return _meta_impl(*args, **kwargs)
+            return meta(*args, **kwargs)
         else:
             return _prim_impl(*args, **kwargs)
 
     name = schema.split("(")[0]
     prim_impl.impl(name, _prim_impl)
     prim_autograd_impl.impl(name, _autograd_impl)
-    prim_meta_impl.impl(name, _meta_impl)
+    prim_meta_impl.impl(name, meta)
 
     _prim_packet = getattr(torch.ops.prims, name)
     _prim = _prim_packet.default
@@ -369,7 +306,8 @@ def _backend_select_impl(*args, **kwargs):
 
         p.schema = schema
         p.prim_impl = _prim_impl
-        p.prim_meta_impl = _wrap_tensor_meta(meta)
+        p.prim_meta_impl = meta
+        p.impl_aten = impl_aten
 
     return _prim
 
@@ -399,7 +337,7 @@ def _elementwise_meta(
 
     args_ = list(args)
     if args_with_fixed_dtypes is not None:
-        args_.extend(args_with_fixed_dtypes)
+        args_ = list(args_with_fixed_dtypes) + args_
 
     utils.check_same_device(*args_, allow_cpu_scalar_tensors=True)
     utils.check_same_shape(*args_, allow_cpu_scalar_tensors=True)
@@ -453,9 +391,18 @@ def _elementwise_meta(
         return TensorMeta(device=device, shape=shape, strides=strides, dtype=dtype)
 
     # Number case
-    # NOTE: this case is not currently exercised
     # TODO: fix number type promotion (bool, complex->float)
-    return TensorMeta(number)
+
+    # For now for symint/float, just implementing the common / simple cases of (int,float,symint,symfloat)
+    seen_float = False
+    if isinstance(number, (torch.SymInt, torch.SymFloat)):
+        for a in args:
+            assert isinstance(a, (int, float, torch.SymInt, torch.SymFloat)), "NYI"
+            seen_float = seen_float or isinstance(a, (float, torch.SymFloat))
+        if seen_float:
+            number = sym_float(number)
+
+    return TensorMeta(number)  # type: ignore[arg-type]
 
 
 def _complex_only_elementwise_meta(*args, **kwargs):
@@ -567,6 +514,20 @@ def _not_impl(*args, **kwargs):
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
 
+bessel_j0 = _make_elementwise_unary_prim(
+    "bessel_j0",
+    impl_aten=torch.special.bessel_j0,
+    doc="",
+    type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
+)
+
+bessel_j1 = _make_elementwise_unary_prim(
+    "bessel_j1",
+    impl_aten=torch.special.bessel_j1,
+    doc="",
+    type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
+)
+
 bessel_i0 = _make_elementwise_unary_prim(
     "bessel_i0",
     impl_aten=torch.i0,
@@ -648,6 +609,40 @@ def _conj_physical_meta(input: TensorLikeType) -> TensorLikeType:
     return_type=RETURN_TYPE.NEW,
 )
 
+
+def _clone_meta(
+    input: TensorLikeType, *, memory_format: torch.memory_format = torch.preserve_format
+) -> TensorLikeType:
+    if memory_format != torch.preserve_format:
+        return torch.empty(
+            input.shape,
+            dtype=input.dtype,
+            layout=input.layout,
+            device=input.device,
+            requires_grad=input.requires_grad,
+            memory_format=memory_format,
+        )
+
+    # memory_format == torch.preserve_format
+    strides = utils.compute_elementwise_output_strides(input)
+    return torch.empty_strided(
+        input.shape,
+        strides,
+        dtype=input.dtype,
+        layout=input.layout,
+        device=input.device,
+        requires_grad=input.requires_grad,
+    )
+
+
+clone = _make_prim(
+    schema="clone(Tensor self, *, MemoryFormat? memory_format=None) -> Tensor",
+    meta=_clone_meta,
+    impl_aten=torch.clone,
+    doc="Returns the copy of a tensor",
+    return_type=RETURN_TYPE.NEW,
+)
+
 digamma = _make_elementwise_unary_prim(
     "digamma",
     impl_aten=torch.digamma,
@@ -676,6 +671,13 @@ def _conj_physical_meta(input: TensorLikeType) -> TensorLikeType:
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
 
+erfcx = _make_elementwise_unary_prim(
+    "erfcx",
+    impl_aten=torch.special.erfcx,
+    doc="",
+    type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
+)
+
 exp = _make_elementwise_unary_prim(
     "exp",
     impl_aten=torch.exp,
@@ -799,6 +801,13 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
 
+ndtri = _make_elementwise_unary_prim(
+    "ndtri",
+    impl_aten=torch.special.ndtri,
+    doc="",
+    type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
+)
+
 neg = _make_elementwise_unary_prim(
     "neg",
     impl_aten=torch.neg,
@@ -848,6 +857,13 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
     type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
 )
 
+spherical_bessel_j0 = _make_elementwise_unary_prim(
+    "spherical_bessel_j0",
+    impl_aten=torch.special.spherical_bessel_j0,
+    doc="",
+    type_promotion=ELEMENTWISE_PRIM_TYPE_PROMOTION_KIND.DEFAULT,
+)
+
 sqrt = _make_elementwise_unary_prim(
     "sqrt",
     impl_aten=torch.sqrt,
@@ -924,7 +940,7 @@ def _fill_aten(a: Tensor, value: NumberType) -> Tensor:
 # div prim performs truncation division on integer inputs
 #   and true division for floating and complex inputs
 def _div_aten(a, b):
-    is_integral = isinstance(a, (bool, int)) or (
+    is_integral = isinstance(a, (bool, int, torch.SymInt)) or (
         isinstance(a, torch.Tensor) and utils.is_integer_dtype(a.dtype)
     )
 
@@ -1134,9 +1150,6 @@ def _minimum_aten(
 
 #
 # View operations
-#
-# TODO: model view relationships
-# TODO: model storage
 def _as_strided_meta(
     a: TensorLikeType, size: ShapeType, stride: StrideType, storage_offset: int
 ) -> TensorLikeType:
@@ -1150,9 +1163,11 @@ def _as_strided_meta(
         # as_strided to shapes with no elements are trivially valid, so it's OK
         pass
     elif isinstance(a, torch.Tensor):
-        utils.check_in_bounds_for_storage(a.storage(), size, stride, storage_offset)
+        utils.check_in_bounds_for_storage(
+            a._typed_storage(), size, stride, storage_offset
+        )
 
-    return TensorMeta(a, shape=size, strides=stride)
+    return torch.as_strided(a, size, stride, storage_offset)
 
 
 def _as_strided_aten(
@@ -1167,7 +1182,7 @@ def _as_strided_aten(
 """
 
 as_strided = _make_prim(
-    schema="as_strided(Tensor(a!) a, int[] size, int[] stride, int storage_offset) -> Tensor(a!)",
+    schema="as_strided(Tensor(a!) a, SymInt[] size, SymInt[] stride, SymInt storage_offset) -> Tensor(a!)",
     meta=_as_strided_meta,
     impl_aten=_as_strided_aten,
     return_type=RETURN_TYPE.VIEW,
@@ -1193,7 +1208,7 @@ def _broadcast_in_dim_meta(
     # (no relative reordering of dims) of integers and
     # each dimension must be within the new shape
     def _greater_than_reduce(acc, x):
-        assert isinstance(x, int)
+        assert isinstance(x, Dim)
         assert x > acc
         assert x < len(shape)
 
@@ -1219,7 +1234,7 @@ def _greater_than_reduce(acc, x):
         else:
             new_strides.append(0)
 
-    return TensorMeta(a, shape=shape, strides=new_strides)
+    return a.as_strided(shape, new_strides, a.storage_offset())
 
 
 def _broadcast_in_dim_aten(a, shape, broadcast_dimensions):
@@ -1266,7 +1281,7 @@ def _collapse_view_helper(
         strides = (1,)
     else:
         shape = a.shape  # type: ignore[assignment]
-        strides = a.stride()
+        strides = a.stride()  # type: ignore[assignment]
 
     utils.validate_idx(len(shape), start)
     utils.validate_exclusive_idx(len(shape), end)
@@ -1320,7 +1335,10 @@ def _collapse_view_meta(a: TensorLikeType, start: int, end: int) -> TensorLikeTy
         msg = "Attempting to view a collapsed tensor, but no such view exists!"
         raise ValueError(msg)
 
-    return TensorMeta(a, shape=new_shape, strides=new_strides)
+    if new_strides is None:
+        return a.view(new_shape)
+    else:
+        return a.as_strided(new_shape, new_strides, a.storage_offset())
 
 
 def _collapse_view_aten(a: Tensor, start: int, end: int) -> Tensor:
@@ -1368,7 +1386,7 @@ def _collapse_view_aten(a: Tensor, start: int, end: int) -> Tensor:
 def _conj_meta(a: TensorLikeType) -> TensorLikeType:
     if not a.dtype.is_complex:
         raise RuntimeError("Expected complex dtype in prims.conj")
-    return TensorMeta(a)
+    return a.as_strided(a.shape, a.stride(), a.storage_offset())
 
 
 _conj_doc = """
@@ -1384,12 +1402,18 @@ def _conj_meta(a: TensorLikeType) -> TensorLikeType:
 )
 
 
-def expand_dims(a: TensorLikeType, dimensions: DimsSequenceType) -> TensorLikeType:
+def expand_dims(
+    a: TensorLikeType, dimensions: DimsSequenceType, ndim=None
+) -> TensorLikeType:
     """
     Creates a view of a with a.ndim + len(dimensions) dimensions, with new
     dimensions of length one at the dimensions specified by dimensions.
     """
-    dims = sorted(utils.canonicalize_dims(a.ndim, dimensions))  # type: ignore[arg-type]
+    if ndim is not None:
+        # TODO: this is only here to support the unsqueeze ref
+        dims = sorted(utils.canonicalize_dims(ndim, dimensions))  # type: ignore[arg-type]
+    else:
+        dims = sorted(utils.canonicalize_dims(a.ndim, dimensions))  # type: ignore[arg-type]
     if len(set(dims)) != len(dims):
         msg = "Received duplicate dimensions to expand in {0}".format(str(dimensions))
         raise ValueError(msg)
@@ -1488,7 +1512,7 @@ def _slice_meta(
     for x, y in zip(a.stride(), _strides):
         new_strides.append(x * y)
 
-    return TensorMeta(a, shape=new_shape, strides=new_strides)
+    return a.as_strided(new_shape, new_strides, a.storage_offset())
 
 
 def _slice_aten(
@@ -1519,7 +1543,7 @@ def _slice_aten(
     """
 
 slice = _make_prim(
-    schema="slice(Tensor(a) a, int[] start_indices, int[] limit_indices, int[]? strides=None) -> Tensor(a)",
+    schema="slice(Tensor(a) a, SymInt[] start_indices, SymInt[] limit_indices, SymInt[]? strides=None) -> Tensor(a)",
     meta=_slice_meta,
     impl_aten=_slice_aten,
     return_type=RETURN_TYPE.VIEW,
@@ -1602,8 +1626,9 @@ def _slice_in_dim_aten(
     Convenience wrapper for slicing just one dimension using slice.
     """
 
+# TODO: make stride SymInt
 slice_in_dim = _make_prim(
-    schema="slice_in_dim(Tensor(a) a, int start_index, int limit_index, int stride=1, int axis=0) -> Tensor(a)",
+    schema="slice_in_dim(Tensor(a) a, SymInt start_index, SymInt limit_index, int stride=1, int axis=0) -> Tensor(a)",
     meta=_slice_in_dim_meta,
     impl_aten=_slice_in_dim_aten,
     return_type=RETURN_TYPE.VIEW,
@@ -1617,10 +1642,9 @@ def _split_dim_meta(a: TensorLikeType, dim: int, outer_length: int) -> TensorLik
     utils.validate_dim_length(outer_length)
 
     # Verifies the dim can be split with the specified lhs_length
-    _inner_length = a.shape[dim] / outer_length
-    inner_length: int = int(_inner_length)
+    inner_length = a.shape[dim] // outer_length
 
-    if inner_length != _inner_length:
+    if (a.shape[dim] % outer_length) != 0:
         msg = "Attempting to split dimension of length {0}, but outer length of {1} divides it with a remainder!".format(
             a.shape[dim], outer_length
         )
@@ -1636,11 +1660,11 @@ def _split_dim_meta(a: TensorLikeType, dim: int, outer_length: int) -> TensorLik
             new_shape.append(a.shape[idx])
             new_strides.append(a.stride()[idx])
 
-    return TensorMeta(a, shape=new_shape, strides=new_strides)
+    return a.as_strided(new_shape, new_strides, a.storage_offset())
 
 
 def _split_dim_aten(a: Tensor, dim: int, outer_length: int) -> Tensor:
-    inner_length = int(a.shape[dim] / outer_length)
+    inner_length = a.shape[dim] // outer_length
     new_shape = a.shape[0:dim] + (outer_length, inner_length) + a.shape[dim + 1 :]
 
     return a.view(new_shape)
@@ -1655,7 +1679,7 @@ def _split_dim_aten(a: Tensor, dim: int, outer_length: int) -> Tensor:
 
 # TODO: consider renaming split_dim_view
 split_dim = _make_prim(
-    schema="split_dim(Tensor(a) a, int dim, int outer_length) -> Tensor(a)",
+    schema="split_dim(Tensor(a) a, int dim, SymInt outer_length) -> Tensor(a)",
     meta=_split_dim_meta,
     impl_aten=_split_dim_aten,
     return_type=RETURN_TYPE.VIEW,
@@ -1679,7 +1703,7 @@ def _squeeze_meta(a: TensorLikeType, dimensions: Sequence) -> TensorLikeType:
         new_shape.append(a.shape[idx])
         new_strides.append(a.stride()[idx])
 
-    return TensorMeta(a, shape=new_shape, strides=new_strides)
+    return a.as_strided(new_shape, new_strides, a.storage_offset())
 
 
 def _squeeze_aten(a: Tensor, dimensions: Sequence) -> Tensor:
@@ -1721,7 +1745,7 @@ def _transpose_meta(a: TensorLikeType, permutation: DimsSequenceType) -> TensorL
         new_shape[idx] = a.shape[dim]
         new_strides[idx] = a.stride()[dim]
 
-    return TensorMeta(a, shape=tuple(new_shape), strides=tuple(new_strides))
+    return a.as_strided(tuple(new_shape), tuple(new_strides), a.storage_offset())
 
 
 def _transpose_aten(a: Tensor, permutation: DimsSequenceType) -> Tensor:
@@ -1746,7 +1770,7 @@ def _transpose_aten(a: Tensor, permutation: DimsSequenceType) -> Tensor:
 
 
 def _view_of_meta(a: TensorLikeType) -> TensorLikeType:
-    return TensorMeta(a)
+    return a.as_strided(a.shape, a.stride(), a.storage_offset())
 
 
 def _view_of_aten(a: Tensor) -> Tensor:
@@ -1788,12 +1812,16 @@ def _cat_meta(tensors: Sequence[TensorLikeType], dim: int) -> TensorLikeType:
     # Verifies same shape (except in the concat dimension)
     shape = tensors[0].shape
     concat_length = 0
-    for tensor in tensors:
+    for tensor_idx, tensor in enumerate(tensors):
         for idx, (common_length, length) in enumerate(zip(shape, tensor.shape)):
             if idx == dim:
                 concat_length = concat_length + length
-            else:
-                assert length == common_length
+            elif length != common_length:
+                raise RuntimeError(
+                    f"Sizes of tensors must match except in dimension {dim}. "
+                    "Expected {common_length} but got {length} for tensor number "
+                    "{tensor_idx} in the list"
+                )
 
     new_shape = list(tensors[0].shape).copy()
     new_shape[dim] = concat_length
@@ -1848,7 +1876,7 @@ def _reshape_aten(a: Tensor, shape: ShapeType) -> Tensor:
   containing a copy of the data in a.
   """
 reshape = _make_prim(
-    schema="reshape(Tensor a, int[] shape) -> Tensor",
+    schema="reshape(Tensor a, SymInt[] shape) -> Tensor",
     meta=_reshape_meta,
     impl_aten=_reshape_aten,
     return_type=RETURN_TYPE.NEW,
@@ -1858,7 +1886,8 @@ def _reshape_aten(a: Tensor, shape: ShapeType) -> Tensor:
 
 def _rev_meta(a: TensorLikeType, dims: DimsSequenceType) -> TensorLikeType:
     utils.validate_dimension_indices(a.ndim, dims)
-    return TensorMeta(a)
+    out = torch.empty_like(a, memory_format=torch.preserve_format)
+    return TensorMeta(out)
 
 
 _rev_doc = """
@@ -2142,7 +2171,7 @@ def _resize_aten(a: Tensor, shape: ShapeType) -> Tensor:
 
 # TODO: review support arbitrary resizes
 resize = _make_prim(
-    schema="resize(Tensor(a!) a, int[] shape) -> Tensor(a!)",
+    schema="resize(Tensor(a!) a, SymInt[] shape) -> Tensor(a!)",
     meta=_resize_meta,
     impl_aten=_resize_aten,
     return_type=RETURN_TYPE.INPLACE,
@@ -2291,17 +2320,19 @@ def _arange_meta(
         step != 0,
         lambda: "step must be nonzero",
     )
-    utils.check(
-        math.isfinite(start) and math.isfinite(end),
-        lambda: f"unsupported range: {start} -> {end}",
-    )
+    # SymInts can't represent inf
+    if not isinstance(start, torch.SymInt) and not isinstance(end, torch.SymInt):
+        utils.check(
+            math.isfinite(start) and math.isfinite(end),
+            lambda: f"unsupported range: {start} -> {end}",
+        )
     utils.check(
         (step > 0 and end >= start) or (step < 0 and end <= start),
         lambda: "upper bound and lower bound inconsistent with step sign",
     )
     if dtype is not None:
         pass
-    elif all(isinstance(arg, int) for arg in (start, end, step)):
+    elif all(isinstance(arg, IntLike) for arg in (start, end, step)):
         dtype = torch.int64
     else:
         dtype = torch.get_default_dtype()
@@ -2364,7 +2395,7 @@ def _empty_aten(
 """
 
 empty = _make_prim(
-    schema="empty(int[] shape, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
+    schema="empty(SymInt[] shape, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
     meta=_empty_meta,
     impl_aten=_empty_aten,
     return_type=RETURN_TYPE.NEW,
@@ -2389,7 +2420,7 @@ def _empty_strided_meta(
 
 # TODO: add layout, pin_memory
 empty_strided = _make_prim(
-    schema="empty_strided(int[] shape, int[] strides, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
+    schema="empty_strided(SymInt[] shape, SymInt[] strides, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
     return_type=RETURN_TYPE.NEW,
     meta=_empty_strided_meta,
     impl_aten=torch.empty_strided,
@@ -2429,7 +2460,7 @@ def _full_aten(
 
 # TODO: add layout
 full = _make_prim(
-    schema="full(int[] shape, Scalar fill_value, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
+    schema="full(SymInt[] shape, Scalar fill_value, *, ScalarType dtype, Device device, bool requires_grad) -> Tensor",
     meta=_full_meta,
     impl_aten=_full_aten,
     return_type=RETURN_TYPE.NEW,
@@ -2583,6 +2614,64 @@ def _svd_aten(
 # Randomness Prims
 #
 
+# TODO: add generator support
+# NOTE: there is currently no way of acquiring the "default" torch generator
+def _normal_meta(
+    shape: ShapeType,
+    *,
+    mean: Union[float, complex],
+    std: float,
+    dtype: torch.dtype,
+    device: torch.device,
+    requires_grad: bool,
+) -> TensorLikeType:
+    utils.check(
+        std >= 0.0,
+        lambda: f"expected non-negative standard deviation, but got std={std}",
+    )
+
+    utils.check(
+        utils.is_float_dtype(dtype) or utils.is_complex_dtype(dtype),
+        lambda: f"expected a floating-point or complex dtype, but got dtype={dtype}",
+    )
+
+    strides = utils.make_contiguous_strides_for(shape)
+    return TensorMeta(shape=shape, strides=strides, dtype=dtype, device=device)
+
+
+def _normal_aten(
+    shape: ShapeType,
+    *,
+    mean: Union[float, complex],
+    std: float,
+    dtype: torch.dtype,
+    device: torch.device,
+    requires_grad: bool,
+) -> Tensor:
+    a = torch.empty(shape, dtype=dtype, device=device, requires_grad=requires_grad)
+    with torch.no_grad():
+        # NOTE: normal_ is incorrectly annotated to expect mean to be a float
+        a.normal_(mean, std)  # type: ignore[arg-type]
+    return a
+
+
+_normal_doc = """
+    Constructs a tensor filled with values drawn from a normal distribution with the specified mean
+    and standard deviation.
+
+    Only supports floating-point types.
+"""
+
+normal = _make_prim(
+    schema=(
+        "normal(SymInt[] shape, *, Scalar mean, Scalar std, ScalarType dtype, Device device,  bool requires_grad) -> Tensor"
+    ),
+    return_type=RETURN_TYPE.NEW,
+    meta=_normal_meta,
+    impl_aten=_normal_aten,
+    doc=_normal_doc,
+)
+
 
 def _uniform_meta(
     shape: ShapeType,
@@ -2616,7 +2705,7 @@ def _uniform_aten(
 # TODO: we should more seriously review randomness modeling and prims
 uniform = _make_prim(
     schema=(
-        "uniform(int[] shape, *, Scalar low, Scalar high, ScalarType dtype, Device device) -> Tensor"
+        "uniform(SymInt[] shape, *, Scalar low, Scalar high, ScalarType dtype, Device device) -> Tensor"
     ),
     return_type=RETURN_TYPE.NEW,
     meta=_uniform_meta,
@@ -2624,6 +2713,10 @@ def _uniform_aten(
     doc=_uniform_doc,
 )
 
+#
+# FFT prims
+#
+
 
 def _fft_r2c_meta(
     input: TensorLike,
@@ -2740,7 +2833,7 @@ def _fft_c2r_aten(
 
 
 fft_c2r = _make_prim(
-    schema="fft_c2r(Tensor self, *, int[] dim, int last_dim_size) -> Tensor",
+    schema="fft_c2r(Tensor self, *, int[] dim, SymInt last_dim_size) -> Tensor",
     meta=_fft_c2r_meta,
     impl_aten=_fft_c2r_aten,
     return_type=RETURN_TYPE.NEW,
diff --git a/torch/_prims/context.py b/torch/_prims/context.py
index f82dbf687e8a..b9f6e634bb49 100644
--- a/torch/_prims/context.py
+++ b/torch/_prims/context.py
@@ -1,9 +1,11 @@
 import functools
 from contextlib import nullcontext
 from typing import Any, Callable, Dict, Sequence
+from warnings import warn
 
 import torch
 
+import torch._decomp
 import torch._prims
 
 import torch._refs
@@ -11,6 +13,7 @@
 import torch._refs.nn.functional
 import torch._refs.special
 import torch.overrides
+from torch._prims.nvfuser_executor import NvfuserPrimOperatorSupport
 
 from torch._prims_common import torch_function_passthrough
 from torch.fx.experimental.proxy_tensor import get_isolated_graphmodule
@@ -36,12 +39,20 @@ def torch_to_refs_map():
         torch.Tensor.__and__: torch._refs.bitwise_and,
         torch.Tensor.__or__: torch._refs.bitwise_or,
         torch.Tensor.__eq__: torch._refs.eq,
+        torch.Tensor.__rsub__: torch._refs.rsub,
+        torch.Tensor.__rtruediv__: torch._refs.rtruediv,
+        torch.Tensor.__floordiv__: torch._refs.floor_divide,
+        torch.Tensor.__rfloordiv__: torch._refs.rfloordiv,
+        torch.Tensor.__pow__: torch._refs.pow,
+        torch.Tensor.__rpow__: torch._refs.rpow,
         torch.Tensor.new_empty: torch._refs.new_empty,
         torch.Tensor.new_full: torch._refs.new_full,
         torch.Tensor.new_zeros: torch._refs.new_zeros,
         torch.Tensor.new_ones: torch._refs.new_ones,
         torch.Tensor.fill_: torch._refs.fill_,
         torch.Tensor.zero_: torch._refs.zero_,
+        torch.Tensor.to: torch._refs.to,
+        torch.Tensor.sum_to_size: torch._refs.sum_to_size,
         # TODO: Should these methods be mapped some other way?
         torch.Tensor.copy_: torch._prims.copy_to,
         torch.Tensor.resize: torch._prims.resize,
@@ -54,6 +65,12 @@ def torch_to_refs_map():
     for s in dir(torch.Tensor):
         if s in torch._refs.__all__:
             r[getattr(torch.Tensor, s)] = torch._refs.__dict__.get(s)
+
+    # Support conversions
+    for s in torch._refs._conversions.__all__:
+        tensor_attr = getattr(torch.Tensor, s, None) or getattr(torch, s)
+        r[tensor_attr] = torch._refs._conversions.__dict__.get(s)
+
     return r
 
 
@@ -70,13 +87,25 @@ class NvfuserPrimsMode(torch.overrides.TorchFunctionMode):
     Switches the interpretation of torch.ops.prims.* functions to
     use nvFuser's prims in torch.ops.nvprims.*
 
-    >>> with NvfuserPrimMode():
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> with NvfuserPrimsMode():
     ...     torch.ops.prims.add(x, y)  # calls torch.ops.nvprims.add(x, y)
 
     By default, this context manager will fall back on the torch.ops.prims* if the
     nvprim does not exist.
+    It's possible to skip certain prims by passing their names to the skip_ops
+    argument. skip_ops is expected to be a sequence of strings, e.g.,
+    ["prims.add.default"] In order to check the expected name of a prim, one can
+    use the `torch.overrides.resolve_name`.
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> with NvfuserPrimsMode(skips_ops=("prims.add.default")):
+    ...     torch.ops.prims.add.default(x, y)  # does not call torch.ops.nvprims.add.default(x, y)
     """
 
+    def __init__(self, *, skip_ops=()):
+        self.skip_ops = skip_ops
+
     def __torch_function__(
         self,
         orig_func: Callable,
@@ -86,6 +115,12 @@ def __torch_function__(
     ):
         if kwargs is None:
             kwargs = {}
+
+        # If the function is in the skip list, then we don't want to
+        # remap it to the nvprims.
+        if torch.overrides.resolve_name(orig_func) in self.skip_ops:
+            return orig_func(*args, **kwargs)
+
         if isinstance(orig_func, torch._ops.OpOverload) or isinstance(
             orig_func, torch._ops.OpOverloadPacket
         ):
@@ -139,12 +174,22 @@ def __torch_function__(
                 return orig_func(*args, **kwargs)
         mapping = torch_to_refs_map()
         func = mapping.get(orig_func, None)
+
+        # For torch.ops.aten.*, use registered decompositions from torch._decomp
+        # torch._decomp.decomposition_table provides a mapping from
+        # torch.ops.aten.* to torch._refs or torch._decomp.decompositions
+        # implementations.
+        # There're other ways to implement this functionality,
+        # see https://github.com/pytorch/pytorch/pull/82657#discussion_r939776417
+        if func is None and isinstance(orig_func, torch._ops.OpOverload):
+            func = torch._decomp.decomposition_table.get(orig_func, None)
+
         if func is not None:
             # If the ref exists query whether we should use it or not
-            if self.should_fallback_fn(self, func, args, kwargs):
+            if self.should_fallback_fn(self, orig_func, func, args, kwargs):
                 return orig_func(*args, **kwargs)
             # torch calls inside func should be interpreted as refs calls
-            with torch.overrides.enable_torch_function_mode(self, replace=self.inner):
+            with self:
                 return func(*args, **kwargs)
         if self.strict:
             raise RuntimeError(
@@ -160,21 +205,216 @@ def _is_node_supported_nvfuser(node):
     )
 
 
-def _is_func_unsupported_nvfuser(torch_function_mode, func, args, kwargs):
-    with torch.overrides.enable_torch_function_mode(
-        torch_function_mode, replace=torch_function_mode.inner
-    ):
-        gm = get_isolated_graphmodule(func, args, kwargs)
+def _is_func_unsupported_nvfuser(
+    torch_function_mode, orig_func, func, args, kwargs, *, skip_ops=()
+):
+    """
+    This function traces the `func` under `torch_function_mode` and checks if
+    any of the traced nodes are not supported by nvFuser. If so, we should
+    fallback to the original function.
+
+    `skip_ops` argument is expected to be a list of strings of function names
+    that would match with `torch.overrides.resolve_name`.
 
+    Args:
+        torch_function_mode: The torch_function_mode context manager. orig_func:
+        The original function, its name will be used to check if
+                   it should be skipped.
+        func: The function to be traced. args: The args to be passed to the
+        function. kwargs: The kwargs to be passed to the function.
+    Keyword args:
+        skip_ops: A list of ops to skip when checking if the function is
+        supported.
+    """
+    # One supported case is easy to check: if the resolved name of the original
+    # function in the skip list, skip it.
+    if torch.overrides.resolve_name(orig_func) in skip_ops:
+        return True
+
+    with torch_function_mode:
+        try:
+            gm = get_isolated_graphmodule(func, args, kwargs)
+        except Exception as e:
+            warn(
+                "get_isolated_graphmodule failed on decomposition: "
+                + func.__name__
+                + " with error message: "
+                + str(e)
+            )
+            # returns unsupported when tracing fails.
+            return True
+
+    supported_ops = NvfuserPrimOperatorSupport()
     call_function_nodes = filter(lambda n: n.op == "call_function", gm.graph.nodes)
     any_unsupported = any(
-        not _is_node_supported_nvfuser(node) for node in call_function_nodes
+        not supported_ops.is_node_supported(None, node) for node in call_function_nodes
     )
     return any_unsupported
 
 
-TorchRefsNvfuserCapabilityMode = functools.partial(
-    TorchRefsMode,
-    should_fallback_fn=_is_func_unsupported_nvfuser,
-    prims_mode_cls=NvfuserPrimsMode,
-)
+class TorchRefsNvfuserCapabilityMode(TorchRefsMode):
+    def __init__(self, *, skip_ops=()):
+        aten_ops_to_skip = (
+            "aten._log_softmax.default",
+            "aten._log_softmax_backward_data.default",
+            "aten.expand.default",
+        )
+        self.skip_ops = tuple(skip_ops) + aten_ops_to_skip
+        super().__init__(
+            strict=False,
+            should_fallback_fn=functools.partial(
+                _is_func_unsupported_nvfuser,
+                skip_ops=tuple(skip_ops) + aten_ops_to_skip,
+            ),
+            prims_mode_cls=functools.partial(NvfuserPrimsMode, skip_ops=skip_ops),
+        )
+
+    # TODO: remove this once version from _decomp/decompositions.py is working
+    # with this context manager
+    # This is a workaround for AOT Autograd graphs
+    def _cudnn_batch_norm(
+        self,
+        input,
+        weight,
+        bias,
+        running_mean,
+        running_var,
+        training,
+        exponential_average_factor,
+        epsilon,
+    ):
+        a, b, c = torch.ops.nvprims.native_batch_norm(
+            input,
+            weight,
+            bias,
+            running_mean,
+            running_var,
+            training,
+            exponential_average_factor,
+            epsilon,
+        )
+        if training:
+            return (a, b, c, input.new_zeros((0,), dtype=torch.uint8))
+        return (
+            a,
+            weight.new_zeros((0,)),
+            weight.new_zeros((0,)),
+            input.new_zeros((0,), dtype=torch.uint8),
+        )
+
+    # This is a workaround for AOT Autograd graphs
+    def _cudnn_batch_norm_backward(
+        self,
+        input,
+        grad_output,
+        weight,
+        running_mean,
+        running_var,
+        save_mean,
+        save_var,
+        epsilon,
+        reserveSpace,
+    ):
+        func = torch._decomp.decomposition_table[
+            torch.ops.aten.native_batch_norm_backward.default
+        ]
+        return func(
+            grad_output,
+            input,
+            weight,
+            running_mean,
+            running_var,
+            save_mean,
+            save_var,
+            True,
+            epsilon,
+            [True, True, True],
+        )
+
+    def _is_var_mean(self, func):
+        return "torch.var_mean" == torch.overrides.resolve_name(func) or (
+            (
+                isinstance(func, torch._ops.OpOverload)
+                or isinstance(func, torch._ops.OpOverloadPacket)
+            )
+            and "aten.var_mean" in str(func)
+        )
+
+    def _is_view_or_reshape(self, func):
+        allowed_ops = {
+            "torch.Tensor.view",
+            "torch.Tensor.reshape",
+            "torch.view_copy",
+            "torch.reshape",
+            "aten.view.default",
+            "aten._unsafe_view.default",
+            "aten.view_copy.default",
+        } - set(self.skip_ops)
+        return torch.overrides.resolve_name(func) in allowed_ops
+
+    def _is_native_batch_norm(self, func):
+        return "torch.native_batch_norm" == torch.overrides.resolve_name(func) or (
+            func == torch.ops.aten.native_batch_norm.default
+            or func == torch.ops.aten.native_batch_norm
+        )
+
+    def _is_rand_like(self, func):
+        result = "torch.rand_like" == torch.overrides.resolve_name(func) or (
+            func == torch.ops.aten.rand_like or func == torch.ops.aten.rand_like.default
+        )
+        return result
+
+    def __torch_function__(
+        self,
+        orig_func: Callable,
+        types: Sequence,
+        args: Sequence[Any] = (),
+        kwargs: Dict = None,
+    ):
+        if kwargs is None:
+            kwargs = {}
+        # First we intercept calls for nvfuser-specific prims bypassing generic torch._refs
+        if self._is_var_mean(orig_func):
+            return torch.ops.nvprims.var_mean(*args, **kwargs)
+
+        if (
+            orig_func == torch.ops.aten.cudnn_batch_norm.default
+            or orig_func == torch.ops.aten.cudnn_batch_norm
+        ):
+            with self:
+                return self._cudnn_batch_norm(*args, **kwargs)
+
+        # A workaround for AOT Autograd graphs
+        # See https://github.com/pytorch/pytorch/pull/86115#issue-1394883782
+        if (
+            orig_func == torch.ops.aten.cudnn_batch_norm_backward.default
+            or orig_func == torch.ops.aten.cudnn_batch_norm_backward
+        ):
+            with self:
+                return self._cudnn_batch_norm_backward(*args, **kwargs)
+
+        if self._is_view_or_reshape(orig_func):
+            a, *shape = args
+            shape = torch._prims_common.extract_shape_from_varargs(
+                shape, validate=False
+            )  # type: ignore[assignment]
+            if len(kwargs) > 0:
+                warn("view has ignored kwargs!")
+            return torch.ops.nvprims.view(a, shape)
+
+        if orig_func == torch.ops.aten._reshape_alias.default:
+            a, shape, stride = args
+            if len(kwargs) > 0:
+                warn("view has ignored kwargs!")
+            return torch.ops.nvprims.view(a, shape)
+
+        if self._is_native_batch_norm(orig_func):
+            return torch.ops.nvprims.native_batch_norm(*args, **kwargs)
+
+        if self._is_rand_like(orig_func):
+            if len(kwargs) > 0:
+                warn("rand_like has ignored kwargs!")
+            return torch.ops.nvprims.rand_like(*args)
+
+        # Then we use TorchRefsMode to interpret the rest
+        return super().__torch_function__(orig_func, types, args, kwargs)
diff --git a/torch/_prims/executor.py b/torch/_prims/executor.py
index 1dfbee2c95a4..2d8d815f0638 100644
--- a/torch/_prims/executor.py
+++ b/torch/_prims/executor.py
@@ -1,4 +1,4 @@
-from typing import Callable
+from typing import Callable, Optional
 
 from torch._prims.context import NvfuserPrimsMode, TorchRefsMode
 from torch._prims.nvfuser_executor import nvfuser_execute, nvfuser_execute_partitioned
@@ -7,7 +7,12 @@
 from torch.fx.experimental.proxy_tensor import make_fx, wrapper_and_args_for_make_fx
 
 
-def execute(gm: GraphModule, *args, executor: str = "aten"):
+def execute(
+    gm: GraphModule,
+    *args,
+    executor: str = "aten",
+    executor_parameters: Optional[dict] = None,
+):
     """
     Prototype ATen executor.
 
@@ -17,9 +22,11 @@ def execute(gm: GraphModule, *args, executor: str = "aten"):
     if executor == "aten":
         return gm.forward(*args)
     elif executor == "nvfuser":
-        return nvfuser_execute_partitioned(gm, *args)
+        return nvfuser_execute_partitioned(
+            gm, *args, executor_parameters=executor_parameters
+        )
     elif executor == "strictly_nvfuser":
-        return nvfuser_execute(gm, *args)
+        return nvfuser_execute(gm, *args, executor_parameters=executor_parameters)
 
     msg = "Received unexpected value for 'executor': {0}. Allowed values are: aten, nvfuser.".format(
         executor
diff --git a/torch/_prims/nvfuser_executor.py b/torch/_prims/nvfuser_executor.py
index 93de1f6bc43c..b44f7653ee81 100644
--- a/torch/_prims/nvfuser_executor.py
+++ b/torch/_prims/nvfuser_executor.py
@@ -1,11 +1,18 @@
+import operator
 from copy import deepcopy
 from dataclasses import dataclass
 from functools import lru_cache
+from types import MappingProxyType
 from warnings import warn
 
 import torch
 import torch.overrides
-from torch._prims_common import getnvFuserDtype, Number
+from torch._prims_common import (
+    _torch_dtype_to_nvfuser_dtype_map,
+    getnvFuserDtype,
+    Number,
+    number_type,
+)
 
 from torch.fx import GraphModule
 from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
@@ -16,10 +23,25 @@
         DataType,
         Fusion,
         FusionDefinition,
+        Tensor,
     )
 else:
     DataType = None
 
+import os
+
+
+@lru_cache(None)
+def get_nvprim_dump_nvtx():
+    return os.getenv("PYTORCH_NVFUSER_DUMP_NVTX")
+
+
+DEFAULT_NVFUSER_PYTHON_CONFIG = MappingProxyType(
+    {
+        "use_python_fusion_cache": True,
+        "allow_single_op_fusion": False,
+    }
+)
 
 # nvFuserTensorTemplate and nvFuserScalarTemplate are helper objects
 # for cached construction of the nvFuser's Fusion
@@ -27,9 +49,10 @@
 # https://github.com/pytorch/pytorch/issues/80551
 @dataclass(frozen=True)
 class nvFuserTensorTemplate:
-    size: tuple
-    stride: tuple
+    symbolic_shape: tuple
+    contiguity: tuple
     dtype: DataType
+    is_cpu: bool
 
 
 @dataclass(frozen=True)
@@ -37,26 +60,64 @@ class nvFuserScalarTemplate:
     dtype: DataType
 
 
+@lru_cache(maxsize=2048)
+def compute_symbolic_shape(shape):
+    """Computes the symbolic shape of a tensor.
+    nvFuser specializes on size-1 dimensions as broadcasted dimensions.
+    -1 is used to represent any size."""
+    return tuple(1 if s == 1 else -1 for s in shape)
+
+
+@lru_cache(maxsize=2048)
+def compute_contiguity(shape, strides):
+    """Computes the contiguity information to simplify internal indexing.
+    Contiguous dimensions are represented by True, strided dimensions
+    are represented by False.
+    """
+    return torch._C._nvfuser.compute_contiguity(shape, strides)
+
+
 def to_nvfuser_template_args(args):
     def to_nvfuser(arg):
         if isinstance(arg, torch.Tensor):
             return nvFuserTensorTemplate(
-                arg.size(), arg.stride(), getnvFuserDtype(arg.dtype)
+                compute_symbolic_shape(arg.size()),
+                compute_contiguity(arg.size(), arg.stride()),
+                getnvFuserDtype(arg.dtype),
+                arg.is_cpu,  # type: ignore[attr-defined]
             )
         elif isinstance(arg, Number):
-            return nvFuserScalarTemplate(getnvFuserDtype(type(arg)))
+            return nvFuserScalarTemplate(getnvFuserDtype(number_type(arg)))
         else:
             return arg
 
     return tree_map(to_nvfuser, args)
 
 
+def _any_get_attr_used(call_function_nodes):
+    return any(
+        filter(
+            # bug in mypy https://github.com/python/mypy/issues/12682
+            lambda n: any(  # type: ignore[arg-type]
+                a.op == "get_attr" for a in n.args if isinstance(a, torch.fx.Node)  # type: ignore[attr-defined]
+            ),
+            call_function_nodes,
+        )
+    )
+
+
 # MyPy bug: https://github.com/python/mypy/issues/5107
 @lru_cache(maxsize=1024)  # type: ignore[arg-type]
 def make_nvfuser_fusion(gm: GraphModule, *nv_args_templates):
-    # PROTOTYPE nvfuser executor
+    if not torch.cuda.is_available():
+        raise RuntimeError(
+            "Attempting to use nvFuser trace executor but CUDA is not available!"
+        )
+
     # Everything in the graph must support nvfuser
     for node in gm.graph.nodes:
+        if node.op == "call_function" and node.target == operator.getitem:
+            continue
         if (
             node.op == "call_function"
             and getattr(node.target, "impl_nvfuser", None) is None
@@ -66,6 +127,25 @@ def make_nvfuser_fusion(gm: GraphModule, *nv_args_templates):
                 f"Node {node} with target {node.target} does not support nvfuser"
             )
 
+    graph_input_nodes = list(filter(lambda n: n.op == "placeholder", gm.graph.nodes))
+    call_function_nodes = list(
+        filter(lambda n: n.op == "call_function", gm.graph.nodes)
+    )
+    assert len(graph_input_nodes) == len(
+        nv_args_templates
+    ), "Number of placeholder nodes in the graph must match number of args"
+    assert len(nv_args_templates) > 0, "There must be at least one argument"
+    assert (
+        len(call_function_nodes) > 0
+    ), "Graph must contain at least one call_function node"
+    assert not _any_get_attr_used(
+        call_function_nodes
+    ), "Constant tensors that are saved in the graph and used as arguments are not supported yet"
+
+    # Checking output dtypes
+    output_node = next(filter(lambda n: n.op == "output", gm.graph.nodes))
+    orig_flat_out, _ = tree_flatten(output_node.args[0])
+
     fusion = Fusion()
     with FusionDefinition(fusion) as fd:
 
@@ -76,15 +156,58 @@ def _to_nvfuser_constant(arg):
                 return arg
 
         class FusionInterpreter(torch.fx.Interpreter):
+            def run_node(self, node):
+                # Squeeze requires original shape of args[0]
+                if node.target in [
+                    torch.ops.nvprims.squeeze,
+                    torch.ops.nvprims.squeeze.default,
+                ]:
+                    original_shape = list(node.args[0].meta["tensor_meta"].shape)
+                    assert len(node.args) == 2
+                    args, kwargs = self.fetch_args_kwargs_from_env(node)
+                    args = [args[0], original_shape, args[1]]
+                    return self.call_function(node.target, args, node.kwargs)
+
+                if node.target in [
+                    torch.ops.nvprims.native_batch_norm,
+                    torch.ops.nvprims.native_batch_norm.default,
+                ]:
+                    args, kwargs = self.fetch_args_kwargs_from_env(node)
+                    assert len(args) == 8
+                    training = args[5]
+                    args6_end = tuple(map(_to_nvfuser_constant, args[6:]))
+                    args = args[:5] + (training,) + args6_end
+                    return node.target.impl_nvfuser(fd, *args, **kwargs)
+
+                return super().run_node(node)
+
             def call_function(self, target, args, kwargs):
+                # This handles tuple unpacking
+                if target == operator.getitem:
+                    assert isinstance(args[0], tuple)
+                    return target(*args, **kwargs)
                 args = tuple(map(_to_nvfuser_constant, args))
                 target = target.impl_nvfuser
                 args = (fd,) + args
                 return target(*args, **kwargs)
 
+            def output(self, target, args, kwargs):
+                flat_out, unflatten_spec = tree_flatten(args[0])
+                for o, orig_o in zip(flat_out, orig_flat_out):
+                    # casting outputs to the original data type
+                    # ensures outputs produced by fusion would always agree with original GraphModule
+                    out_dtype = _torch_dtype_to_nvfuser_dtype_map.get(orig_o.meta["tensor_meta"].dtype)  # type: ignore[union-attr]
+                    assert isinstance(
+                        o, Tensor
+                    ), "output from codegen has to be tensor type"
+                    fd.add_output(fd.ops.cast(o, dtype=out_dtype))
+                return args[0]
+
         def templates_to_nvfuser_inputs(arg):
             if isinstance(arg, nvFuserTensorTemplate):
-                x = fd.define_tensor(arg.size, arg.stride, arg.dtype)
+                x = fd.define_tensor(
+                    arg.symbolic_shape, arg.contiguity, arg.dtype, arg.is_cpu
+                )
                 return x
             elif isinstance(arg, nvFuserScalarTemplate):
                 x = fd.define_scalar(arg.dtype)
@@ -96,42 +219,90 @@ def templates_to_nvfuser_inputs(arg):
         nv_args = tuple(map(templates_to_nvfuser_inputs, nv_args_templates))
         out = FusionInterpreter(gm).run(*nv_args)
         flat_out, unflatten_spec = tree_flatten(out)
-        for o in flat_out:
-            fd.add_output(o)
 
     return fusion, unflatten_spec
 
 
-def nvfuser_execute(gm: GraphModule, *args):
-    if not torch.cuda.is_available():
-        raise RuntimeError(
-            "Attempting to use nvFuser trace executor but CUDA is not available!"
-        )
-
+def nvfuser_execute(gm: GraphModule, *args, executor_parameters=None):
+    executor_parameters = executor_parameters or DEFAULT_NVFUSER_PYTHON_CONFIG
     flat_args, _ = tree_flatten(args)
 
-    # Construction of the fusion is expensive and cached based on the GraphModule
-    # and symbolic nvFuser args.
-    nv_template_args = to_nvfuser_template_args(flat_args)
-    fusion, unflatten_spec = make_nvfuser_fusion(gm, *nv_template_args)  # type: ignore[misc]
+    # check for cuda only fusion
+    if any(isinstance(arg, torch.Tensor) and arg.is_cuda for arg in flat_args) and all(  # type: ignore[attr-defined]
+        (
+            not isinstance(arg, torch.Tensor)
+            or (arg.is_cpu and arg.ndim == 0)  # type: ignore[attr-defined]
+            or arg.is_cuda  # type: ignore[attr-defined]
+        )
+        for arg in flat_args
+    ):
+
+        # Construction of the fusion is expensive and cached based on the GraphModule
+        # and symbolic nvFuser args.
+        nv_template_args = to_nvfuser_template_args(flat_args)
+        use_cache = executor_parameters.get(
+            "use_python_fusion_cache",
+            DEFAULT_NVFUSER_PYTHON_CONFIG["use_python_fusion_cache"],
+        )
+        if use_cache:
+            fusion, unflatten_spec = make_nvfuser_fusion(gm, *nv_template_args)  # type: ignore[misc]
+        else:
+            fusion, unflatten_spec = make_nvfuser_fusion.__wrapped__(gm, *nv_template_args)  # type: ignore[misc]
 
-    # Inputs to fusion.execute correspond to the same template/symbolic inputs
-    # marked with `define_tensor/scalar`
-    concrete_fusion_inputs = tuple(
-        arg for arg in flat_args if isinstance(arg, (torch.Tensor, Number))
-    )
+        # Inputs to fusion.execute correspond to the same template/symbolic inputs
+        # marked with `define_tensor/scalar`
+        concrete_fusion_inputs = tuple(
+            arg for arg in flat_args if isinstance(arg, (torch.Tensor, Number))
+        )
 
-    return tree_unflatten(
-        fusion.execute(concrete_fusion_inputs),  # type: ignore[has-type]
-        unflatten_spec,  # type: ignore[has-type]
-    )
+        if get_nvprim_dump_nvtx():
+            torch.cuda.nvtx.range_push(
+                "fusion: {0}, graph: {1}".format(
+                    fusion.id(),
+                    str(
+                        [
+                            {
+                                "op": n.op,
+                                "name": n.name,
+                                "args": n.args,
+                                "kwargs": n.kwargs,
+                            }
+                            for n in gm.graph.nodes
+                        ]
+                    ),
+                )
+            )
+        result = tree_unflatten(
+            fusion.execute(concrete_fusion_inputs),  # type: ignore[has-type]
+            unflatten_spec,  # type: ignore[has-type]
+        )
+        if get_nvprim_dump_nvtx():
+            torch.cuda.nvtx.range_pop()
+        return result
+    else:
+        warn(
+            "nvfuser_executor is executed with non-cuda args, fallback to aten executor"
+        )
+        return gm.forward(*args)
 
 
 class NvfuserPrimOperatorSupport(torch.fx.passes.operator_support.OperatorSupport):
     def is_node_supported(self, submodules, node: torch.fx.Node) -> bool:
-        return (
+        # special case to stop lowering to nvprim when converting to an unsupported type
+        if (
             node.op == "call_function"
-            and getattr(node.target, "impl_nvfuser", None) is not None
+            and node.target == torch.ops.nvprims.convert_element_type.default
+        ):
+            return (
+                _torch_dtype_to_nvfuser_dtype_map.get(node.args[1]) is not None
+                and _torch_dtype_to_nvfuser_dtype_map.get(
+                    node.args[0].meta["tensor_meta"].dtype  # type: ignore[union-attr]
+                )
+                is not None
+            )
+        return node.op == "call_function" and (
+            getattr(node.target, "impl_nvfuser", None) is not None
+            or node.target == operator.getitem
         )
 
 
@@ -147,22 +318,98 @@ def call_module(self, target, args, kwargs):
             return super().call_module(target, args, kwargs)
 
 
+class NvfuserGraphModule(torch.nn.Module):
+    def __init__(self, gm, use_python_fusion_cache):
+        super().__init__()
+        self.gm = gm
+        self.executor_parameters = {"use_python_fusion_cache": use_python_fusion_cache}
+
+    def __call__(self, *args):
+        return nvfuser_execute(
+            self.gm, *args, executor_parameters=self.executor_parameters
+        )
+
+
+# A set of operators that are supported by nvFuser
+# but should not form a fusion group solely on their own
+_non_compute_ops = [
+    "torch.ops." + str(getattr(torch.ops.nvprims, prim).default)
+    for prim in dir(torch.ops.nvprims)
+    if isinstance(getattr(torch.ops.nvprims, prim), torch._ops.OpOverloadPacket)
+    and getattr(torch.ops.nvprims, prim).return_type
+    == torch._prims_common.RETURN_TYPE.VIEW
+]
+
+_allowed_single_node_partition_ops = [
+    "torch.ops.nvprims.native_batch_norm.default",
+    "torch.ops.nvprims.var_mean.default",
+    "torch.ops.nvprims.var_mean.main",
+]
+
+
+def _remove_empty_like_fill(gm: GraphModule):
+    # Remove empty_like + fill nodes that prevent lowering to nvprims
+    # This is a workaround for nonoptimal traces of C++ code `(1 - tensor)`
+    # https://github.com/pytorch/pytorch/issues/86612
+
+    def pattern(scalar, tensor):
+        # pattern for C++ trace of `scalar - tensor`. We are looking for the
+        # pattern of aten and nvprims.sub specifically because we want to remove
+        # the empty_like + fill nodes after lowering of AOT Autograd trace to
+        # nvprims In the future, nvFuser might support fill, and empty_like and
+        # this workaround can be removed.
+        empty_like = torch.ops.aten.empty_like.default(
+            tensor, memory_format=torch.preserve_format
+        )
+        fill = torch.ops.aten.fill.Scalar(empty_like, scalar)
+        sub = torch.ops.nvprims.sub.default(fill, tensor)
+        return sub
+
+    def replacement(scalar, tensor):
+        return torch.ops.nvprims.sub.default(scalar, tensor)
+
+    torch.fx.replace_pattern(gm, pattern, replacement)
+    return gm
+
+
 # MyPy bug: https://github.com/python/mypy/issues/5107
-@lru_cache()  # type: ignore[arg-type]
-def maybe_partition_graph(gm: GraphModule):
+@lru_cache(maxsize=1024)  # type: ignore[arg-type]
+def maybe_partition_graph(
+    gm: GraphModule, allow_single_op_fusion: bool, use_python_fusion_cache: bool
+):
+    gm = _remove_empty_like_fill(gm)
     supported_ops = NvfuserPrimOperatorSupport()
-    call_function_nodes = filter(lambda n: n.op == "call_function", gm.graph.nodes)
+    call_function_nodes = list(
+        filter(lambda n: n.op == "call_function", gm.graph.nodes)
+    )
     # the graph is partitioned only if at least one node is not supported by nvFuser
     any_unsupported = any(
         not supported_ops.is_node_supported(None, node) for node in call_function_nodes
     )
+    any_unsupported |= len(call_function_nodes) == 0
+
+    # When there are constant tensors in the graph, we can't partition it
+    # because deepcopy fails. Here we just return the original graph to be
+    # executed by eager mode
+    # https://github.com/pytorch/pytorch/issues/84415
+    if (
+        _any_get_attr_used(call_function_nodes)
+        or len(list(filter(lambda n: n.op == "placeholder", gm.graph.nodes))) == 0
+    ):
+        return gm, True
+
     if any_unsupported:
         # CapabilityBasedPartitioner modifies the graph in-place so we need to make a copy of the graph
         gm = deepcopy(gm)
         partitioner = CapabilityBasedPartitioner(
-            gm, supported_ops, allows_single_node_partition=True
+            gm,
+            supported_ops,
+            allows_single_node_partition=allow_single_op_fusion,
+            non_compute_ops=_non_compute_ops,
+            allowed_single_node_partition_ops=_allowed_single_node_partition_ops,
         )
         partitions = partitioner.propose_partitions()
+        partitioner.remove_bookend_non_compute_ops(partitions)
         if len(partitions) == 0:
             warn(
                 "No partition found for the graph. "
@@ -171,16 +418,69 @@ def maybe_partition_graph(gm: GraphModule):
                 category=RuntimeWarning,
             )
         partitioned_graph = partitioner.fuse_partitions(partitions)
+
+        # Replacing graph's fused submodules with a wrapper module with
+        # __call__() method that calls nvfuser_execute.
+        # This avoids the need to call the interpreter on the graph
+        for node in partitioned_graph.graph.nodes:
+            # TODO: use a better way to identify fused submodule
+            if node.op == "call_module" and "fused_" in node.name:
+                nvfuser_submodule = getattr(partitioned_graph, node.name)
+                partitioned_graph.delete_submodule(node.target)
+                gm.add_submodule(
+                    node.target,
+                    NvfuserGraphModule(nvfuser_submodule, use_python_fusion_cache),
+                )
+
+        # Go through the graph and replace all the nodes that were converted to
+        # nvprims but won't be sent to nvFuser with a call to PyTorch's eager
+        # mode. This is necessary because torch.ops.* have higher overhead than
+        # calling the eager mode directly.
+        for node in partitioned_graph.graph.nodes:
+            if node.op == "call_function" and str(node.target).startswith("nvprims."):
+                if getattr(node.target, "impl_aten", None) is not None:
+                    node.target = node.target.impl_aten
+        partitioned_graph.graph.eliminate_dead_code()
+        partitioned_graph.recompile()
         return partitioned_graph, any_unsupported
     else:
         return gm, any_unsupported
 
 
-def nvfuser_execute_partitioned(gm: GraphModule, *args):
+class NVTXInterpreter(torch.fx.Interpreter):
+    def run_node(self, n):
+        torch.cuda.nvtx.range_push(
+            "name: {0}, args: {1}, op: {2}, kwargs: {3}".format(
+                n.name, n.args, n.op, n.kwargs
+            )
+        )
+        result = super().run_node(n)
+        torch.cuda.nvtx.range_pop()
+        return result
+
+
+def nvfuser_execute_partitioned(gm: GraphModule, *args, executor_parameters=None):
+    executor_parameters = executor_parameters or DEFAULT_NVFUSER_PYTHON_CONFIG
+    # maybe_partition_graph function is cached so we can't use non-hashable arguments
+    allow_single_op_fusion = executor_parameters.get(
+        "allow_single_op_fusion",
+        DEFAULT_NVFUSER_PYTHON_CONFIG["allow_single_op_fusion"],
+    )
+    use_python_fusion_cache = executor_parameters.get(
+        "use_python_fusion_cache",
+        DEFAULT_NVFUSER_PYTHON_CONFIG["use_python_fusion_cache"],
+    )
     # When possible it's better to use nvfuser_execute directly
-    # because it avoids PartitionedInterpreter's overhead
-    gm, is_partitioned = maybe_partition_graph(gm)
+    # because it avoids GraphModule's overhead
+    gm, is_partitioned = maybe_partition_graph(
+        gm,
+        allow_single_op_fusion=allow_single_op_fusion,
+        use_python_fusion_cache=use_python_fusion_cache,
+    )
     if is_partitioned:
-        return PartitionedInterpreter(gm).run(*args)
+        if get_nvprim_dump_nvtx():
+            return NVTXInterpreter(gm).run(*args)
+        else:
+            return gm(*args)
     else:
-        return nvfuser_execute(gm, *args)
+        return nvfuser_execute(gm, *args, executor_parameters=executor_parameters)
diff --git a/torch/_prims/nvfuser_prims.py b/torch/_prims/nvfuser_prims.py
index 7decdcbdfc55..3da16ab3aa27 100644
--- a/torch/_prims/nvfuser_prims.py
+++ b/torch/_prims/nvfuser_prims.py
@@ -5,24 +5,34 @@
 # can be added in the future for the corresponding higher-level torch/aten
 # functions.
 
-from typing import Any, Dict
+from typing import Any, Dict, Optional, Tuple
 
 import torch
 
 from torch._prims_common import (
     DimsSequenceType,
+    elementwise_dtypes,
+    ELEMENTWISE_TYPE_PROMOTION_KIND,
     getnvFuserDtype,
+    make_contiguous_strides_for,
     ShapeType,
     TensorLikeType,
 )
 
-from torch._prims_common.wrappers import backwards_not_supported
+from torch._prims_common.wrappers import (
+    _maybe_convert_to_dtype,
+    backwards_not_supported,
+    elementwise_type_promotion_wrapper,
+)
 
 nvprim_namespace = "nvprims"
 nvprim = torch.library.Library(nvprim_namespace, "DEF")
 nvprim_impl = torch.library.Library(
     nvprim_namespace, "IMPL", "CompositeExplicitAutograd"
 )
+nvprim_implicit_impl = torch.library.Library(
+    nvprim_namespace, "IMPL", "CompositeImplicitAutograd"
+)
 nvprim_autograd_impl = torch.library.Library(nvprim_namespace, "IMPL", "Autograd")
 nvprim_meta_impl = torch.library.Library(nvprim_namespace, "IMPL", "Meta")
 
@@ -34,6 +44,7 @@
     "atanh",
     "cos",
     "cosh",
+    "clone",
     "bitwise_not",
     "ceil",
     "erf",
@@ -59,6 +70,7 @@
     "sqrt",
     "tan",
     "tanh",
+    "transpose",
     "trunc",
     "add",
     "atan2",
@@ -77,6 +89,8 @@
     "pow",
     "remainder",
     "sub",
+    "squeeze",
+    "view_of",
     "broadcast_in_dim",
     "where",
     "convert_element_type",
@@ -199,6 +213,32 @@ def _{fname}_nvfuser(fd, a, b, c):
     )
 
 
+def _native_batch_norm_nvfuser(
+    fd, input, weight, bias, running_mean, running_var, training, momentum, eps
+):
+
+    """
+    if weight is None:
+        weight = fd.define_null_tensor()
+    if bias is None:
+        bias = fd.define_null_tensor()
+    if running_mean is None:
+        running_mean = fd.define_null_tensor()
+    if running_var is None:
+        running_var = fd.define_null_tensor()
+    """
+    return fd.ops.batch_norm(
+        input,
+        weight,
+        bias,
+        running_mean,
+        running_var,
+        momentum,
+        eps,
+        training,
+    )
+
+
 def _broadcast_in_dim_nvfuser(
     fd: Any,
     a: TensorLikeType,
@@ -213,6 +253,30 @@ def _convert_element_type_nvfuser(fd: Any, a: TensorLikeType, dtype: torch.dtype
     return fd.ops.cast(a, nvfuser_dtype)  # type: ignore[attr-defined]
 
 
+def _transpose_nvfuser(fd, a, dims):
+    return fd.ops.permute(a, dims)  # type: ignore[attr-defined]
+
+
+def _squeeze_nvfuser(fd, a, a_shape, dimensions):
+    for idx in reversed(sorted(dimensions)):
+        a = fd.ops.squeeze(a, a_shape, idx)
+        a_shape = a_shape[:idx] + a_shape[idx + 1 :]
+    return a
+
+
+def _view_of_nvfuser(fd, a):
+    return fd.ops.set(a)
+
+
+def _view_nvfuser(
+    fd,
+    a,
+    a_shape,
+    new_shape,
+):
+    return fd.ops.view(a, a_shape, new_shape)
+
+
 def _sum_nvfuser(
     fd: Any,
     a: TensorLikeType,
@@ -234,6 +298,27 @@ def _var_nvfuser(
     return fd.ops.var(a, dims, correction, keep_dims)
 
 
+def _var_mean_nvfuser(
+    fd: Any,
+    a: TensorLikeType,
+    dims: DimsSequenceType,
+    unbiased: Optional[bool] = None,
+    keepdim: bool = False,
+    *,
+    correction: int,
+):
+    # Unbiased arg shouldn't be set when this function is called
+    assert unbiased is None
+    # Ignore keepdim arg, because currently it's automatically converted into nvfuser's symbolic scalar
+    # keepdim is handled by the reference implementation
+    keepdim = False
+    return fd.ops.var_mean(a, dims, correction, keepdim)
+
+
+def _rand_like_nvfuser(fd: Any, a: TensorLikeType):
+    return fd.ops.rand_like(a)
+
+
 def _amax_nvfuser(
     fd: Any,
     a: TensorLikeType,
@@ -252,16 +337,385 @@ def _amin_nvfuser(
     return fd.ops.min(a, dims, keep_dims)
 
 
+def _clone_nvfuser(fd: Any, input: TensorLikeType, *, memory_format=None):
+    return fd.ops.set(input)
+
+
+_nvfuser_impls["native_batch_norm"] = _native_batch_norm_nvfuser
 _nvfuser_impls["broadcast_in_dim"] = _broadcast_in_dim_nvfuser
 _nvfuser_impls["convert_element_type"] = _convert_element_type_nvfuser
+_nvfuser_impls["clone"] = _clone_nvfuser
+_nvfuser_impls["transpose"] = _transpose_nvfuser
+_nvfuser_impls["squeeze"] = _squeeze_nvfuser
+_nvfuser_impls["view_of"] = _view_of_nvfuser
+_nvfuser_impls["view"] = _view_nvfuser
+_nvfuser_impls["rand_like"] = _rand_like_nvfuser
 _nvfuser_impls["sum"] = _sum_nvfuser
 _nvfuser_impls["var"] = _var_nvfuser
+_nvfuser_impls["var_mean"] = _var_mean_nvfuser
 _nvfuser_impls["amax"] = _amax_nvfuser
 _nvfuser_impls["amin"] = _amin_nvfuser
 
+# functorch.compile.min_cut_rematerialization_partition accepts a list of
+# operators that can be recomputed in the backward pass. This list is used to
+# determine which operators can be recomputed. If an operator is not in this
+# list, it will not be recomputed.
+_nvfuser_is_recomputable: Dict[str, bool] = {
+    # Reductions are not allowed to be recomputed
+    "amax": False,
+    "amin": False,
+    "sum": False,
+    "var": False,
+    "var_mean": False,
+    # Normalizations are not allowed to be recomputed
+    "native_batch_norm": False,
+    # Random ops are not allowed to be recomputed
+    "rand_like": False,
+    # Everything else is allowed to be recomputed
+    "abs": True,
+    "acos": True,
+    "add": True,
+    "asin": True,
+    "atan": True,
+    "atan2": True,
+    "atanh": True,
+    "bitwise_and": True,
+    "bitwise_not": True,
+    "bitwise_or": True,
+    "bitwise_xor": True,
+    "broadcast_in_dim": True,
+    "ceil": True,
+    "clone": True,
+    "convert_element_type": True,
+    "cos": True,
+    "cosh": True,
+    "div": True,
+    "eq": True,
+    "erf": True,
+    "erfc": True,
+    "exp": True,
+    "expm1": True,
+    "floor": True,
+    "fmod": True,
+    "ge": True,
+    "gt": True,
+    "imag": True,
+    "isfinite": True,
+    "le": True,
+    "lgamma": True,
+    "log": True,
+    "log10": True,
+    "log1p": True,
+    "log2": True,
+    "lt": True,
+    "mul": True,
+    "ne": True,
+    "neg": True,
+    "pow": True,
+    "real": True,
+    "reciprocal": True,
+    "remainder": True,
+    "round": True,
+    "rsqrt": True,
+    "sign": True,
+    "sin": True,
+    "sinh": True,
+    "sqrt": True,
+    "squeeze": True,
+    "sub": True,
+    "tan": True,
+    "tanh": True,
+    "transpose": True,
+    "trunc": True,
+    "view": True,
+    "view_of": True,
+    "where": True,
+}
+
+
+def register_native_batch_norm():
+    """This function is used to register the native_batch_norm function in torch.ops.nvprims module."""
+    name = "native_batch_norm"
+
+    nvprim.define(
+        f"{name}(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, "
+        + "bool training, float momentum, float eps)"
+        + " -> (Tensor, Tensor, Tensor)"
+    )
+
+    def _prim_impl(
+        input, weight, bias, running_mean, running_var, training, momentum, eps
+    ):
+        return torch.native_batch_norm(
+            input, weight, bias, running_mean, running_var, training, momentum, eps
+        )
+
+    nvprim_impl.impl(name, _prim_impl)
+    prim_packet = torch.ops.nvprims.native_batch_norm
+    prim = prim_packet.default
+
+    def _native_batch_norm_ref(
+        input: torch.Tensor,
+        weight: Optional[torch.Tensor],
+        bias: Optional[torch.Tensor],
+        running_mean: Optional[torch.Tensor],
+        running_var: Optional[torch.Tensor],
+        training: bool,
+        momentum: float,
+        eps: float,
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+
+        if torch._prims_common.is_complex_dtype(input.dtype):
+            raise NotImplementedError("Complex tensors are not supported")
+
+        # note: BN only promotes input to dtype of weight/bias, but keeps the same output dtype
+        result_dtype = input.dtype
+        computation_dtype, _ = elementwise_dtypes(
+            input,
+            weight,
+            bias,
+            type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.NO_OPMATH,
+        )
+
+        input_ = _maybe_convert_to_dtype(input, computation_dtype)
+        output, mean, rstd = prim(
+            input_, weight, bias, running_mean, running_var, training, momentum, eps
+        )
+        output_ = _maybe_convert_to_dtype(output, result_dtype)  # type: ignore[arg-type]
+        return (output_, mean, rstd)  # type: ignore[return-value]
+
+    def _native_batch_norm_autograd(
+        input: torch.Tensor,
+        weight: Optional[torch.Tensor],
+        bias: Optional[torch.Tensor],
+        running_mean: Optional[torch.Tensor],
+        running_var: Optional[torch.Tensor],
+        training: bool,
+        momentum: float,
+        eps: float,
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        # This wrapper is needed to convert prims calls inside
+        # _native_batch_norm_ref to nvprims calls
+        from torch._prims.context import NvfuserPrimsMode
+
+        with NvfuserPrimsMode():
+            return backwards_not_supported(_native_batch_norm_ref)(
+                input, weight, bias, running_mean, running_var, training, momentum, eps
+            )
+
+    nvprim_autograd_impl.impl(name, _native_batch_norm_autograd)
+
+    for p in (prim_packet, prim):
+        p.__doc__ = "Computes batch normalization."
+        p.impl_nvfuser = _nvfuser_impls["native_batch_norm"]
+        p.is_recomputable = _nvfuser_is_recomputable["native_batch_norm"]
+        p.return_type = torch._prims_common.RETURN_TYPE.NEW  # type: ignore[attr-defined]
+
+
+def register_rand_like():
+    name = "rand_like"
+
+    nvprim.define(
+        "rand_like(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, "
+        + "Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor"
+    )
+
+    def _meta_rand_like(
+        self,
+        *,
+        dtype=None,
+        layout=None,
+        device=None,
+        pin_memory=None,
+        memory_format=None,
+    ):
+        strides = make_contiguous_strides_for(self.shape)
+        return torch._prims.TensorMeta(
+            self,
+            shape=self.shape,
+            strides=strides,
+            dtype=dtype,
+            device=device,
+        )
+
+    def _prim_impl(
+        self,
+        *,
+        dtype=None,
+        layout=None,
+        device=None,
+        pin_memory=None,
+        memory_format=None,
+    ):
+        return torch.rand_like(
+            self,
+            dtype=dtype,
+            layout=layout,
+            device=device,
+            pin_memory=pin_memory,
+            memory_format=memory_format,
+        )
+
+    nvprim_impl.impl(name, _prim_impl)
+    nvprim_meta_impl.impl(name, _meta_rand_like)
+
+    prim_packet = getattr(torch.ops.nvprims, name)
+    prim = prim_packet.default
+
+    nvprim_autograd_impl.impl(name, backwards_not_supported(prim))
+
+    for p in (prim_packet, prim):
+        p.__doc__ = "Computes rand_like"
+        p.impl_nvfuser = _nvfuser_impls["rand_like"]
+        p.is_recomputable = _nvfuser_is_recomputable["rand_like"]
+        p.return_type = torch._prims_common.RETURN_TYPE.NEW  # type: ignore[attr-defined]
+
+
+def register_var_mean():
+    """This function is used to register the var_mean function in torch.ops.nvprims module."""
+    name = "var_mean.main"
+
+    # This overload must be default for correct dispatching of var_mean(Tensor, bool)
+    nvprim.define("var_mean(Tensor inp, bool unbiased) -> (Tensor, Tensor)")
+
+    # This signature tries to combine several overloads of the torch.var_mean function into one overload.
+    nvprim.define(
+        f"{name}(Tensor inp, int[1]? dim=None, bool? unbiased=None, bool keepdim=False, *, int? correction=None)"
+        + " -> (Tensor, Tensor)"
+    )
+
+    # This function is used for device="meta" Tensors.
+    def _meta_var_mean(inp, dim=None, unbiased=None, keepdim=False, *, correction=None):
+        if torch._prims_common.is_complex_dtype(inp.dtype):
+            output_dtype = torch._prims_common.corresponding_real_dtype(inp.dtype)
+        else:
+            output_dtype = inp.dtype
+        var = torch._prims._reduction_meta(inp, dim, output_dtype=output_dtype)
+        mean = torch._prims._reduction_meta(inp, dim, output_dtype=inp.dtype)
+        if keepdim:
+            output_shape = [
+                inp.shape[i] if i not in dim else 1 for i in range(inp.ndim)
+            ]
+            broadcast_dims = [i for i in range(inp.ndim) if i not in dim]
+            var = torch.ops.nvprims.broadcast_in_dim(var, output_shape, broadcast_dims)
+            mean = torch.ops.nvprims.broadcast_in_dim(
+                mean, output_shape, broadcast_dims
+            )
+        return (var, mean)
+
+    # This function is used under _AutoDispatchBelowAutograd context
+    def _prim_impl(inp, dim=None, unbiased=None, keepdim=False, *, correction=None):
+        correction = torch._prims_common.set_correction(unbiased, correction)
+        return torch.var_mean(inp, dim, correction=correction, keepdim=keepdim)
+
+    nvprim_impl.impl(name, _prim_impl)
+    nvprim_meta_impl.impl(name, _meta_var_mean)
+
+    prim_packet = torch.ops.nvprims.var_mean
+    prim = prim_packet.main
+
+    def _unbiased_overload_impl(inp, unbiased):
+        return prim(inp, dim=None, unbiased=unbiased)
+
+    nvprim_implicit_impl.impl("var_mean", _unbiased_overload_impl)
+
+    @elementwise_type_promotion_wrapper(
+        type_promoting_args=("a",),
+        type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.COMPLEX_TO_FLOAT,
+    )
+    def _var_mean_ref(a, dim=None, unbiased=None, keepdim=False, *, correction=None):
+        correction = torch._prims_common.set_correction(unbiased, correction)
+        # reduces over all dimensions if dim=() is passed
+        if dim == () or dim == []:
+            dim = None
+        dim = torch._prims_common.reduction_dims(a.shape, dim)
+
+        # For complex tensors eager computes the variance as the sum of variances of
+        # the real and imaginary parts
+        # TODO: Creating a complex tensor from real and imaginary parts is not supported
+        if torch._prims_common.is_complex_dtype(a.dtype):
+            raise NotImplementedError("Complex tensors are not supported")
+
+        var_mean = prim(a, dim, correction=correction)
+
+        if keepdim:
+            output_shape = [a.shape[i] if i not in dim else 1 for i in range(a.ndim)]
+            broadcast_dims = [i for i in range(a.ndim) if i not in dim]
+            var, mean = var_mean
+            var = torch.ops.nvprims.broadcast_in_dim(var, output_shape, broadcast_dims)
+            mean = torch.ops.nvprims.broadcast_in_dim(
+                mean, output_shape, broadcast_dims
+            )
+            var_mean = (var, mean)
+        return var_mean
+
+    def _var_mean_autograd(
+        a, dim=None, unbiased=None, keepdim=False, *, correction=None
+    ):
+        # This wrapper is needed to convert prims calls inside
+        # elementwise_type_promotion_wrapper to nvprims calls
+        from torch._prims.context import NvfuserPrimsMode
+
+        with NvfuserPrimsMode():
+            return backwards_not_supported(_var_mean_ref)(
+                a, dim, unbiased, keepdim, correction=correction
+            )
+
+    nvprim_autograd_impl.impl(name, _var_mean_autograd)
+
+    for p in (prim_packet, prim):
+        p.__doc__ = "Computes the variance and mean of x over the list of dimensions specified in the dim argument"
+        p.impl_nvfuser = _nvfuser_impls["var_mean"]
+        p.is_recomputable = _nvfuser_is_recomputable["var_mean"]
+        p.return_type = torch._prims_common.RETURN_TYPE.NEW  # type: ignore[attr-defined]
+
+
+def _nvprims_view_impl_aten(a, original_shape, new_shape):
+    return a.reshape(new_shape)
+
+
+def register_view():
+    """This function is used to register the view function in torch.ops.view module."""
+    # View is implemented as a decomposition into prims.split_dim,
+    # prims.collapse_dim, and prims.reshape, but we would like to intercept
+    # non-decomposed view for now
+    name = "view"
+
+    nvprim.define("view(Tensor inp, SymInt[] original_shape, SymInt[] shape) -> Tensor")
+    nvprim.define("view.shape(Tensor inp, SymInt[] shape) -> Tensor")
+
+    # This function is used under _AutoDispatchBelowAutograd context
+    def _prim_impl(a, original_shape, new_shape):
+        return a.reshape(new_shape)
+
+    nvprim_impl.impl(name, _prim_impl)
+
+    prim_packet = torch.ops.nvprims.view
+    prim = prim_packet.default
+
+    def _view_no_original_shape_overload_impl(a, shape):
+        if list(a.shape) == list(shape):
+            return torch.ops.nvprims.view_of(a)
+        return torch.ops.nvprims.view.default(a, a.shape, shape)
+
+    nvprim_implicit_impl.impl("view.shape", _view_no_original_shape_overload_impl)
+    nvprim_autograd_impl.impl(name, backwards_not_supported(prim))
+
+    for p in (prim_packet, prim):
+        p.__doc__ = "Creates a tensor with the specified shape containing a copy of the data in a."
+        p.impl_nvfuser = _nvfuser_impls["view"]
+        p.is_recomputable = _nvfuser_is_recomputable["view"]
+        p.return_type = torch._prims_common.RETURN_TYPE.VIEW  # type: ignore[attr-defined]
+        p.impl_aten = _nvprims_view_impl_aten
+
 
 def register_nvprims():
     """Registers all nvFuser primitives in the torch.ops.nvprims module."""
+    register_var_mean()
+    register_view()
+    register_native_batch_norm()
+    register_rand_like()
+
     for name in nvprim_names:
         main_prim = getattr(torch.ops.prims, name)
 
@@ -277,4 +731,6 @@ def register_nvprims():
         for p in (prim_packet, prim):
             p.__doc__ = main_prim.__doc__
             p.impl_nvfuser = _nvfuser_impls[name]
+            p.is_recomputable = _nvfuser_is_recomputable.get(name, False)
             p.return_type = main_prim.return_type  # type: ignore[attr-defined]
+            p.impl_aten = main_prim.impl_aten
diff --git a/torch/_prims_common/__init__.py b/torch/_prims_common/__init__.py
index 807c971c3b7d..647b0e66729e 100644
--- a/torch/_prims_common/__init__.py
+++ b/torch/_prims_common/__init__.py
@@ -1,11 +1,10 @@
 from __future__ import annotations
 
-from typing import Any, Union, Sequence, Optional, Tuple, List, Callable, Type, overload
+from typing import Any, Union, Sequence, Optional, Tuple, List, Callable, Type, overload, cast
 from enum import Enum
 from functools import reduce, cmp_to_key
 import operator
 import weakref
-
 import torch
 
 # nvFuser imports are conditional on being compiled with CUDA
@@ -43,16 +42,30 @@ def getnvFuserDtype(dtype: Union[torch.dtype, NumberTypeType]):
 StrideType = Union[List[int], Tuple[int, ...]]
 DimsType = Union[int, List[int], Tuple[int, ...]]
 DimsSequenceType = Union[List[int], Tuple[int, ...]]
+# TODO: Type[torch.SymInt], Type[torch.SymFloat]
 NumberTypeType = Union[Type[bool], Type[int], Type[float], Type[complex]]
+# TODO: This needs a lot more type annotations
+# NumberType = Union[bool, int, float, complex, torch.SymInt, torch.SymFloat]
 NumberType = Union[bool, int, float, complex]
-Number = (bool, int, float, complex)
+
+Number = (bool, int, float, complex, torch.SymInt, torch.SymFloat)
+# I don't call it Integral because numbers.Integral includes bool, but IntLike
+# does not
+Dim = int
+IntLike = (int, torch.SymInt)
+FloatLike = (float, torch.SymFloat)
+IntWithoutSymInt = int
+FloatWithoutSymFloat = float
 DeviceLikeType = Union[str, torch.device]
 Tensor = torch.Tensor
 
 
 torch_function_passthrough = {
+    torch.Tensor.dim,
     torch.Tensor.ndim.__get__,  # type: ignore[attr-defined]
     torch.Tensor.numel,
+    torch.Tensor.size,
+    torch.Tensor.storage_offset,
     torch.Tensor.stride,
     torch.Tensor.dtype.__get__,  # type: ignore[attr-defined]
     torch.Tensor.is_sparse.__get__,  # type: ignore[attr-defined]
@@ -120,20 +133,30 @@ def compare_tensor_meta(a: TensorLikeType, b: TensorLikeType, check_strides=Fals
     if check_strides:
         same_strides, idx = check_significant_strides(a, b)
         if not same_strides:
-            msg = "Stride mismatch! Strides are {0} and {1} (mismatched at {2})!".format(
-                a.stride(), b.stride(), idx
+            msg = (
+                "Stride mismatch! Strides are {0} and {1} (mismatched at {2})!".format(
+                    a.stride(), b.stride(), idx
+                )
+            )
+            raise RuntimeError(msg)
+
+        if a.storage_offset() != b.storage_offset():
+            msg = (
+                "Storage offset mismatch! Storage offsets are {0} and {1}!".format(
+                    a.storage_offset(), b.storage_offset()
+                )
             )
             raise RuntimeError(msg)
 
 
 def check_significant_strides(
-    a: TensorLikeType, b: TensorLikeType
+    a: TensorLikeType, b: TensorLikeType, *, only_cuda=True
 ) -> Tuple[bool, Optional[int]]:
     # NOTE: only on CUDA because CPU elementwise strides are incorrect in PyTorch
     # See https://github.com/pytorch/pytorch/issues/77553
     # Only compares strides that are "meaningful" -- strides for dimensions with length > 1
     # and for tensors with more than one element
-    if (a.device.type == "cuda" or b.device.type == "cuda") and a.numel() > 0:
+    if (not only_cuda or a.device.type == "cuda" or b.device.type == "cuda") and a.numel() > 0:
         for idx in range(a.ndim):
             if a.stride()[idx] != b.stride()[idx] and a.shape[idx] > 1:
                 return False, idx
@@ -268,6 +291,9 @@ def is_non_overlapping_and_dense(a: Tensor) -> bool:
     its dimensions that is contiguous.
     """
 
+    if a.is_sparse:
+        return False
+
     # Short-circuits if the tensor is already contiguous or channels-last contiguous
     if is_contiguous(a) or is_channels_last_contiguous(a):
         return True
@@ -337,7 +363,7 @@ def compute_elementwise_output_strides(*tensors) -> Tuple[int, ...]:
 
     shape = tensors[0].shape
 
-    def _cmp(idx_a, idx_b):
+    def should_swap(idx_a, idx_b):
         for tensor in tensors:
             stride_a = tensor.stride()[idx_a]
             stride_b = tensor.stride()[idx_b]
@@ -355,24 +381,30 @@ def _cmp(idx_a, idx_b):
             if shape[idx_a] > shape[idx_b]:
                 return 1
 
-            # NOTE: this case is missing in the C++ impl
-            if shape[idx_a] < shape[idx_b]:
-                return -1
-
         # Note: this case is hit if all strides are zero,
         # or all strides are equal and all dimensions have the same length
         return 0
 
-    perm = tuple(range(ndim))
-    perm = tuple(sorted(perm, key=cmp_to_key(_cmp), reverse=True))
+    perm = list(reversed(range(ndim)))
+
+    # insertion sort with support for ambiguous comparisons
+    for i in range(1, ndim):
+        dim1 = i
+        for dim0 in reversed(range(i)):
+            comparison = should_swap(perm[dim0], perm[dim1])
+            if comparison > 0:
+                perm[dim0], perm[dim1] = perm[dim1], perm[dim0]
+                dim1 = dim0
+            elif comparison < 0:
+                break
 
     permuted_shape = [-1] * ndim
-    for idx, x in enumerate(perm):
+    for idx, x in enumerate(reversed(perm)):
         permuted_shape[idx] = shape[x]
 
     new_strides = make_contiguous_strides_for(permuted_shape)
     permuted_strides = [-1] * ndim
-    for idx, x in enumerate(perm):
+    for idx, x in enumerate(reversed(perm)):
         permuted_strides[x] = new_strides[idx]
 
     return tuple(permuted_strides)
@@ -418,8 +450,8 @@ def validate_idx(rank: int, idx: int):
     Assumes the index is already canonicalized.
     """
 
-    assert isinstance(idx, int)
-    assert isinstance(rank, int)
+    assert isinstance(idx, Dim)
+    assert isinstance(rank, Dim)
 
     assert idx >= 0 and idx < rank or idx == 0
 
@@ -435,26 +467,35 @@ def validate_exclusive_idx(rank: int, ex_idx: int):
     for the given shape.
     """
 
-    assert isinstance(ex_idx, int)
-    assert isinstance(rank, int)
+    assert isinstance(ex_idx, Dim)
+    assert isinstance(rank, Dim)
     assert ex_idx > 0 and ex_idx <= rank
 
 
-# "Wraps" a dim (up to one time) for the given rank, allowing
-# dims to be specified using negative indices
-def canonicalize_dim(rank: int, idx: int) -> int:
-    # TODO: add a comment for why this is
-    _rank = rank if rank != 0 else 1
+# "Wraps" a dim (up to one time) for the given rank, allowing dims to be
+# specified using negative indices. If `wrap_scalar` is true then scalar
+# tensors of rank 0 will allow dimensions in the range [-1, 0]. Otherwise,
+# idx should be in the range [-rank, rank-1].
+def canonicalize_dim(rank: int, idx: int, wrap_scalar: bool = True) -> int:
+    if rank < 0:
+        msg = f"Rank cannot be negative but got {rank}"
+        raise IndexError(msg)
+
+    if rank == 0:
+        if not wrap_scalar:
+            msg = f"Dimension specified as {idx} but tensor has no dimensions"
+            raise IndexError(msg)
+        rank = 1
 
-    if idx >= 0 and idx < _rank:
+    if idx >= 0 and idx < rank:
         return idx
 
     if idx < 0:
-        _idx = idx + _rank
+        _idx = idx + rank
     else:
         _idx = idx
 
-    if _idx < 0 or _idx > _rank:
+    if _idx < 0 or _idx >= rank:
         # Same error message as in aten/src/ATen/WrapDimUtils.h:49
         msg = "Dimension out of range (expected to be in range of [{0}, {1}], but got {2})".format(
             -rank, rank - 1, idx
@@ -467,20 +508,20 @@ def canonicalize_dim(rank: int, idx: int) -> int:
 # Takes a dimension or sequence of dimensions and "wraps" them,
 # mapping negative offsets to positive ones
 @overload
-def canonicalize_dims(rank: int, indices: Sequence[int]) -> Tuple[int, ...]:
+def canonicalize_dims(rank: int, indices: Sequence[int], wrap_scalar: bool = True) -> Tuple[int, ...]:
     pass
 
 
 @overload
-def canonicalize_dims(rank: int, indices: int) -> int:
+def canonicalize_dims(rank: int, indices: int, wrap_scalar: bool = True) -> int:
     pass
 
 
-def canonicalize_dims(rank, indices):
-    if isinstance(indices, int):
-        return canonicalize_dim(rank, indices)
+def canonicalize_dims(rank, indices, wrap_scalar=True):
+    if isinstance(indices, Dim):
+        return canonicalize_dim(rank, indices, wrap_scalar)
 
-    return tuple(canonicalize_dim(rank, x) for x in indices)
+    return tuple(canonicalize_dim(rank, x, wrap_scalar) for x in indices)
 
 
 def is_valid_permutation(rank: int, perm: DimsSequenceType) -> bool:
@@ -618,6 +659,17 @@ def extract_shape(*args, allow_cpu_scalar_tensors: bool) -> Optional[ShapeType]:
     return shape if shape is not None else scalar_shape
 
 
+# Extracts dimensions that might be passed either as a list/tuple or as varargs.
+# A typical case is Tensor.permute .
+def extract_dims_from_varargs(dims: Union[DimsSequenceType, Tuple[DimsSequenceType, ...]]) -> DimsSequenceType:
+    if dims and isinstance(dims[0], Sequence):
+        assert len(dims) == 1
+        dims = cast(Tuple[DimsSequenceType], dims)
+        return dims[0]
+    else:
+        return cast(DimsSequenceType, dims)
+
+
 def extract_shape_from_varargs(
     shape: Union[ShapeType, Tuple[ShapeType]],
     validate=True,
@@ -649,6 +701,38 @@ def extract_shape_from_varargs(
     return shape  # type: ignore[return-value]
 
 
+def infer_size(shape: ShapeType, numel: int) -> Tuple[int, ...]:
+    """
+    Infers the size of a dim with size -1, if it exists.
+    Also checks that new shape is compatible with the number of elements.
+    """
+    dim = None
+    newsize = 1
+    for i, d in enumerate(shape):
+        if d == -1:
+            check(dim is None, lambda: "only one dimension can be inferred")
+            dim = i
+        elif d >= 0:
+            newsize *= d
+        else:
+            check(False, lambda: f"invalid shape dimension {d}")
+    check(
+        numel == newsize or (dim is not None and newsize > 0 and numel % newsize == 0),
+        lambda: f"shape '{list(shape)}' is invalid for input of size {numel}",
+    )
+    if dim is not None:
+        # Convert to list to produce a compatible error message with core
+        # PyTorch, which prints sequences in square brackets.
+        shape = list(shape)
+        check(
+            newsize != 0,
+            lambda: (f"cannot reshape tensor of 0 elements into shape {shape} because the "
+                     f"unspecified dimension size -1 can be any value and is ambiguous"),
+        )
+        shape[dim] = numel // newsize
+    return tuple(shape)
+
+
 _integer_dtypes = (torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)
 _low_precision_dtypes = (torch.float16, torch.bfloat16, torch.complex32)
 _float_dtypes = (torch.float16, torch.bfloat16, torch.float32, torch.float64)
@@ -728,6 +812,28 @@ def dtype_to_type(dtype: torch.dtype) -> type:
     raise ValueError("Invalid dtype!")
 
 
+def dtype_to_type_ctor(dtype: torch.dtype) -> Callable[[NumberType], NumberType]:
+    """
+    Computes the corresponding Python type constructor for the
+    given dtype.
+    """
+    from torch.fx.experimental.symbolic_shapes import sym_float, sym_int
+
+    assert isinstance(dtype, torch.dtype)
+
+    if dtype is torch.bool:
+        return lambda x: bool(x)
+    if dtype in _integer_dtypes:
+        return sym_int
+    if dtype in _float_dtypes:
+        return sym_float
+    if dtype in _complex_dtypes:
+        # TODO: type error here is real, replace with sym_complex
+        return lambda x: complex(x)  # type: ignore[arg-type]
+
+    raise ValueError("Invalid dtype!")
+
+
 def type_to_dtype(typ: type) -> torch.dtype:
     """
     Computes the corresponding dtype for a Number type.
@@ -737,10 +843,11 @@ def type_to_dtype(typ: type) -> torch.dtype:
 
     if typ is bool:
         return torch.bool
-    if typ is int:
+    if typ in [int, torch.SymInt]:
         return torch.long
-    if typ is float:
+    if typ in [float, torch.SymFloat]:
         return torch.get_default_dtype()
+    # TODO: sym_complex_float?
     if typ is complex:
         return corresponding_complex_dtype(torch.get_default_dtype())
 
@@ -869,6 +976,14 @@ def _extract_dtype(
     raise RuntimeError("Unexpected termination!")
 
 
+def check_pin_memory(pin_memory: bool):
+    check(not pin_memory, lambda: "PrimTorch does not support pinned memory", NotImplementedError)
+
+
+def check_layout(layout: torch.layout):
+    check(layout == torch.strided, lambda: f"PrimTorch doesn't support layout={layout}", NotImplementedError)
+
+
 # TODO: maybe unify with can_cast_to?
 def is_weakly_lesser_type(a: type, b: type) -> bool:
     """
@@ -997,6 +1112,29 @@ class REDUCTION_OUTPUT_TYPE_KIND(Enum):
     ALWAYS_BOOL = (3,)
 
 
+# Describes the return type of the primitive:
+#
+#   - NEW, a new tensor is created
+#   - VIEW, a view of an input tensor is returned
+#   - INPLACE, one or more input tensors is modified
+#
+# these descriptors are mututally exclusive and exhaustive.
+class RETURN_TYPE(Enum):
+    NEW = (0,)
+    VIEW = (1,)
+    INPLACE = (2,)
+
+
+# TODO: when NumberType contains the sym types, can simplify this
+def number_type(x: Union[NumberType, torch.SymInt, torch.SymFloat]) -> Type:
+    if isinstance(x, torch.SymInt):
+        return int
+    elif isinstance(x, torch.SymFloat):
+        return float
+    else:
+        return type(x)
+
+
 # TODO: document type promotion kinds
 def elementwise_dtypes(
     *_args,
@@ -1101,7 +1239,7 @@ def elementwise_dtypes(
             raise ValueError(msg)
 
         if isinstance(x, Number):
-            highest_type = get_higher_type(highest_type, type(x))
+            highest_type = get_higher_type(highest_type, number_type(x))
         else:
             # x is a TensorLike
             highest_type = get_higher_type(highest_type, dtype_to_type(x.dtype))
@@ -1206,15 +1344,17 @@ def reduction_dtypes(
         result_dtype = torch.bool
     return computation_dtype, result_dtype
 
-
+# This function's logic is borrowed from the following functions defined in C++:
+# batched_matrix_contiguous_strides and contiguous_strides
 def make_contiguous_strides_for(
     shape: ShapeType, row_major: bool = True
 ) -> Tuple[int, ...]:
     """
-    Returns the strides of a contriguous tensor if row_major
+    Returns the strides of a contiguous tensor if row_major
     If row_major=True, it returns the strides of a contiguous batch of Fortran-contiguous matrices
     This is often used when calling external libraries like BLAS/LAPACK/cuSolver...
     """
+    # contiguous_strides from c10/util/strides.h
     validate_shape(shape)
     if not shape:
         return ()
@@ -1228,6 +1368,7 @@ def make_contiguous_strides_for(
 
     result = tuple(reversed(strides))
 
+    # batched_matrix_contiguous_strides from aten/src/ATen/native/LinearAlgebraUtils.h
     if row_major:
         return result
     else:
@@ -1312,6 +1453,24 @@ def reduction_dims(shape: ShapeType, dims: Optional[Sequence]) -> Tuple[int, ...
     return dims
 
 
+def set_correction(
+    unbiased: Optional[bool] = None,
+    correction: Optional[int] = None,
+):
+    if correction is not None and unbiased is not None:
+        raise RuntimeError("cannot specify both correction and unbiased arguments")
+    elif correction is None and unbiased is None:
+        correction = 1
+    elif correction is None and unbiased is not None:
+        correction = 0 if unbiased is False else 1
+    # NB: we don't actually support symint here, but it's harmless to accept
+    if not isinstance(correction, IntLike):
+        raise ValueError("correction argument should be integer")
+    if correction < 0:
+        raise ValueError("correction argument should be non-negative")
+    return correction
+
+
 def check_in_bounds_for_storage(
     a: torch.TypedStorage, shape: ShapeType, strides: StrideType, storage_offset: int
 ):
@@ -1398,3 +1557,63 @@ def suggest_memory_format(x: TensorLikeType) -> torch.memory_format:
 def prod(xs: Sequence[NumberType]) -> NumberType:
     """Product of elements in input sequence. Returns 1 for empty sequence"""
     return reduce(operator.mul, xs, 1)
+
+
+def is_expandable_to(shape: ShapeType, desired: ShapeType) -> bool:
+    """Checks if a shape can be expanded to another shape.
+    This is equivalent to checking if the two shapes are broadcastable.
+    """
+    # This is a Python implementation of
+    # aten/src/ATen/ExpandUtils.h:is_expandable_to
+    if len(shape) > len(desired):
+        return False
+    for i in range(len(shape)):
+        if shape[-i - 1] != desired[-i - 1] and shape[-i - 1] != 1:
+            return False
+    return True
+
+
+def mask_tensor(mask: TensorLikeType, t: TensorLikeType):
+    """
+    Similar to torch.where(mask, t, 0) but if t is boolean,
+    result is also boolean and not promoted to int.
+    """
+    # torch.where(mask, t, False) is equivalent
+    # but feels hacky and might break in the future
+    if t.dtype is torch.bool:
+        return mask.logical_and(t)
+    else:
+        return torch.where(mask, t, 0)
+
+
+def get_aten_op(fn: Callable, name: str):
+    """
+    Given the __module__ of reference and its name, it returns
+    (our best guess of) the ATen name of the associated operation
+
+    Note: In ATen, the __name__ of a function within a module often
+    starts by the module name. E.g. linalg_eigh, or special_zeta
+    """
+    module = fn.__module__
+    prefix = "torch._refs"
+    assert(module.startswith(prefix))
+    module = module[len(prefix):]
+    # We want to go from .special / .nn.functional
+    # to special and special_ / nn_functional_
+    if module:
+        module = module[1:]
+        module = module.replace(".", "_")
+        module = module + "_"
+    return getattr(torch.ops.aten, f"{module}{name}")
+
+
+def dtype_or_default(dtype: Optional[torch.dtype]) -> torch.dtype:
+    return dtype if dtype is not None else torch.get_default_dtype()
+
+
+def device_or_default(device: Optional[torch.device]) -> torch.device:
+    return device if device is not None else torch.device("cpu")
+
+
+def layout_or_default(layout: Optional[torch.layout]) -> torch.layout:
+    return layout if layout is not None else torch.strided
diff --git a/torch/_prims_common/wrappers.py b/torch/_prims_common/wrappers.py
index b4db75adc3ba..349e450cf372 100644
--- a/torch/_prims_common/wrappers.py
+++ b/torch/_prims_common/wrappers.py
@@ -4,6 +4,7 @@
     NumberType,
     TensorLike,
     TensorLikeType,
+    ShapeType,
     ELEMENTWISE_TYPE_PROMOTION_KIND,
 )
 import torch._prims_common as utils
@@ -11,15 +12,14 @@
 
 from typing import Callable, Sequence, Union, Tuple, NamedTuple
 import inspect
-from functools import wraps, reduce
-import operator
+from functools import wraps
 import warnings
 from itertools import chain
 
 # TODO: implement ref.cast with an option to enforce safe casting
 def _maybe_convert_to_dtype(
-    a: Union[TensorLikeType, NumberType, Sequence], dtype: torch.dtype
-) -> Union[TensorLikeType, NumberType, Sequence]:
+    a: Union[TensorLikeType, NumberType, Sequence, None], dtype: torch.dtype
+) -> Union[TensorLikeType, NumberType, Sequence, None]:
     import torch._prims as prims
     if isinstance(a, TensorLike):
         if a.dtype != dtype:
@@ -28,9 +28,13 @@ def _maybe_convert_to_dtype(
             return prims.convert_element_type(a, dtype)
         return a
     if isinstance(a, Number):
-        return utils.dtype_to_type(dtype)(a)
+        return utils.dtype_to_type_ctor(dtype)(a)
     if isinstance(a, Sequence):
         return tuple(_maybe_convert_to_dtype(x, dtype) for x in a)
+    # Passthrough None because some functions wrapped with type promotion
+    # wrapper might have optional args
+    if a is None:
+        return None
 
     raise ValueError(
         "Received type {0} that is neither a tensor or a number!".format(type(a))
@@ -114,34 +118,33 @@ def _fn(*args, **kwargs):
 
             result = fn(**bound.arguments)
 
-            # FIXME?: assumes result is a single tensor
-            assert isinstance(result, TensorLike)
-            return _maybe_convert_to_dtype(result, result_dtype)
+            if isinstance(result, TensorLike):
+                return _maybe_convert_to_dtype(result, result_dtype)
+            if isinstance(result, Sequence):
+                return tuple(_maybe_convert_to_dtype(x, result_dtype) for x in result)
+            raise AssertionError(f"Unhandled result type: {type(result)}")
 
         _fn.__signature__ = sig  # type: ignore[attr-defined]
         return _fn
 
 
 # TODO: handle tuples of tensors
-def _maybe_resize_out(out: TensorLikeType, shape):
-    if out.numel() == 0:
-        return out.resize_(shape)
-
-    if out.numel() != reduce(operator.mul, shape, 1):
-        msg = (
-            "An output with one or more elements was resized since it had shape {0} "
-            "which does not match the required output shape {1}. "
-            "This behavior is deprecated, and in a future PyTorch release outputs will not "
-            "be resized unless they have zero elements. "
-            "You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0).".format(
-                str(out.shape), str(shape)
+def _maybe_resize_out(out: TensorLikeType, shape: ShapeType):
+    # If the shapes are correct there's nothing to do
+    if utils.same_shape(out.shape, shape):
+        return out
+    else:
+        if out.numel() != 0:
+            msg = (
+                f"An output with one or more elements was resized since it had shape {str(out.shape)} "
+                "which does not match the required output shape {str(shape)}. "
+                "This behavior is deprecated, and in a future PyTorch release outputs will not "
+                "be resized unless they have zero elements. "
+                "You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0)."
             )
-        )
-        warnings.warn(msg)
+            warnings.warn(msg)
         return out.resize_(shape)
 
-    return out
-
 
 def _safe_copy_out(
     *, copy_from: TensorLikeType, copy_to: TensorLikeType, exact_dtype: bool = False
@@ -158,7 +161,7 @@ def _safe_copy_out(
         utils.check(
             copy_from.dtype == copy_to.dtype,
             lambda: f"Expected out tensor to have dtype {copy_from.dtype} "
-            "but got {copy_to.dtype} instead",
+            f"but got {copy_to.dtype} instead",
         )
     else:
         utils.check(
@@ -269,15 +272,18 @@ def _fn(*args, out=None, **kwargs):
 
 
 def backwards_not_supported(prim):
+    def redispatch_prim(args, kwargs):
+        g = torch._C._AutoDispatchBelowAutograd()
+        try:
+            return prim(*args, **kwargs)
+        finally:
+            del g
+
     class BackwardsNotSupported(torch.autograd.Function):
         @staticmethod
         def forward(ctx, args_spec, *flat_args):
             args, kwargs = tree_unflatten(flat_args, args_spec)  # type: ignore[arg-type]
-            g = torch._C._AutoDispatchBelowAutograd()
-            try:
-                return prim(*args, **kwargs)
-            finally:
-                del g
+            return redispatch_prim(args, kwargs)
 
         @staticmethod
         def backward(ctx, *args):
@@ -286,7 +292,20 @@ def backward(ctx, *args):
     @wraps(prim)
     def _autograd_impl(*args, **kwargs):
         flat_args, args_spec = tree_flatten((args, kwargs))
-        return BackwardsNotSupported.apply(args_spec, *flat_args)
+        if torch.is_grad_enabled() and any(a.requires_grad for a in flat_args if isinstance(a, torch.Tensor)):
+            # TODO: There is a subtle bug here: prims like copy_to
+            # return their input argument after mutating it; and custom
+            # autograd function will incorrectly turn the result into
+            # a view which will fail test_python_ref_executor tests.
+            # At the moment, we sidestep this by observing that the
+            # unit tests don't ever try to run the executor with
+            # autograd, so we don't exercise the buggy case, but if
+            # you ever want to feed autograd through this, be aware
+            # of it!  We need a way of properly implementing autograd
+            # for mutating operations in Python to do this.
+            return BackwardsNotSupported.apply(args_spec, *flat_args)
+        else:
+            return redispatch_prim(args, kwargs)
 
     return _autograd_impl
 
diff --git a/torch/_python_dispatcher.py b/torch/_python_dispatcher.py
index fbad612ab8d3..c420ad044f71 100644
--- a/torch/_python_dispatcher.py
+++ b/torch/_python_dispatcher.py
@@ -46,7 +46,7 @@
   # print(dispatcher.rawRegistrations())
   # print(dispatcher.rawDispatchTable())
 PythonDispatcher calls C++ dispatcher under the hood for to precompute dispatch table.
-This file only provides the simplified API for developers, revelant test code is located in
+This file only provides the simplified API for developers, relevant test code is located in
 test/test_dispatch.py
 """
 
diff --git a/torch/_refs/__init__.py b/torch/_refs/__init__.py
index 2aae9d4d56f2..04bf9e12927f 100644
--- a/torch/_refs/__init__.py
+++ b/torch/_refs/__init__.py
@@ -6,7 +6,7 @@
 
 from collections.abc import Iterable
 from enum import Enum
-from functools import partial, reduce, wraps
+from functools import partial, reduce, singledispatch, wraps
 from typing import Callable, List, Optional, overload, Sequence, Tuple, Union
 
 import torch
@@ -16,10 +16,13 @@
 from torch._prims_common import (
     check,
     DeviceLikeType,
+    Dim,
     DimsSequenceType,
     DimsType,
     dtype_to_type,
     ELEMENTWISE_TYPE_PROMOTION_KIND,
+    FloatLike,
+    IntLike,
     is_weakly_lesser_type,
     Number,
     NumberType,
@@ -39,6 +42,7 @@
     elementwise_unary_scalar_wrapper,
     out_wrapper,
 )
+from torch.fx.experimental.symbolic_shapes import sym_float, sym_int
 
 # Experimental module containing prototype Python references for existing
 #   PyTorch operations.
@@ -70,25 +74,37 @@
     "fill",
     "floor",
     "frac",
+    "index_add",
+    "index_copy",
+    "index_copy_",
+    "index_select",
+    "index_fill",
+    "index_fill_",
     "isfinite",
     "isinf",
     "isnan",
+    "isreal",
     "i0",
+    "lerp",
     "lgamma",
     "log",
     "log1p",
     "log2",
     "log10",
+    "log_softmax",
     "nan_to_num",
     "neg",
     "positive",
     "reciprocal",
     "round",  # TODO: model kwargs
     "sigmoid",
+    "sgn",
     "sign",
     "signbit",
     "sin",
+    "sinc",
     "sinh",
+    "softmax",
     "sqrt",
     "square",
     "tan",
@@ -105,7 +121,8 @@
     "bitwise_or",
     "bitwise_right_shift",
     "bitwise_xor",
-    # "complex",
+    "clamp_min",
+    "clamp_max",
     "copysign",
     "div",
     "eq",
@@ -121,6 +138,7 @@
     "hypot",
     "igamma",
     "igammac",
+    "imag",
     "isclose",
     "lcm",
     # 'ldexp',
@@ -139,17 +157,21 @@
     "nextafter",
     # 'polar',  # abs, cos, sin
     "pow",
+    "real",
+    "rpow",
     "remainder",
     "rsub",
-    # # special.xlog1py
-    # # special.zeta
+    "rtruediv",
+    "rfloordiv",
     "sub",
     "true_divide",
     "trunc_divide",
-    # 'xlogy', # where?, log, mul
+    "xlogy",
     #
     # Elementwise Ternary References
     #
+    "addcdiv",
+    "addcmul",
     "clamp",
     #
     # Conditional references
@@ -162,6 +184,7 @@
     "clone",
     "copy_to",  # TODO: add OpInfo (or implement .to)
     "item",  # TODO: add OpInfo
+    "to",
     #
     # Reduction ops
     #
@@ -173,6 +196,7 @@
     "std_mean",
     "var_mean",
     "sum",
+    "sum_to_size",
     "prod",
     "var",
     #
@@ -195,9 +219,15 @@
     "conj",
     "constant_pad_nd",
     "contiguous",
+    "diag_embed",
+    "diag",
+    "diagonal",
+    "diagonal_copy",
+    "diagonal_scatter",
     "dsplit",
     "dstack",
     "expand",
+    "expand_as",
     "flatten",
     "flip",
     "fliplr",
@@ -205,10 +235,14 @@
     "hsplit",
     "hstack",
     "meshgrid",
+    "movedim",
     "narrow",
+    "narrow_copy",
+    "native_group_norm",
     "native_layer_norm",
     "permute",
     "ravel",
+    "repeat",
     "reshape",
     "roll",
     "rot90",
@@ -217,30 +251,39 @@
     "swap_axes",  # alias for transpose
     "squeeze",
     "t",
+    "T",
     "tensor_split",
     "transpose",
+    "unfold",
+    "unfold_copy",
     "unsqueeze",
     "view",
     "vsplit",
     "vstack",
     "unflatten",
     "unbind",
+    "triu",
+    "tril",
+    "triu_indices",
+    "tril_indices",
     #
     # Tensor Creation
     #
+    "arange",
     "empty",
     "empty_like",
     "empty_strided",
+    "eye",
     "full",
     "full_like",
+    "linspace",
+    "logspace",
     "ones",
     "ones_like",
+    "randn",
     "scalar_tensor",
     "zeros",
     "zeros_like",
-    "arange",
-    "linspace",
-    "logspace",
     #
     # Randomness References
     #
@@ -250,14 +293,19 @@
     #
     "allclose",
     "equal",  # TODO: add OpInfo
+    #
+    # Statistical operations
+    #
+    "bucketize",
 ]
 
 Tensor = torch.Tensor
+DispatchKey = torch._C.DispatchKey  # type: ignore[attr-defined]
 
 
 def _broadcast_shapes(*_shapes):
     shapes = tuple(
-        (x,) if isinstance(x, int) else x
+        (x,) if isinstance(x, IntLike) else x
         for x in filter(lambda x: x is not None, _shapes)
     )
 
@@ -274,7 +322,7 @@ def _broadcast_shapes(*_shapes):
     common_shape = [
         1,
     ] * reduce(max, (len(shape) for shape in shapes))
-    for shape in shapes:
+    for arg_idx, shape in enumerate(shapes):
         for idx in range(-1, -1 - len(shape), -1):
             if common_shape[idx] == 1:
                 if shape[idx] < 0:
@@ -285,9 +333,9 @@ def _broadcast_shapes(*_shapes):
             elif shape[idx] != 1:
                 if common_shape[idx] != shape[idx]:
                     raise RuntimeError(
-                        "Attempting to broadcast a dimension of length ",
-                        str(shape[idx]),
-                        "!",
+                        f"Attempting to broadcast a dimension of length {shape[idx]} at {idx}! "
+                        f"Mismatching argument at index {arg_idx} had {shape}; but expected shape "
+                        f"should be broadcastable to {common_shape}"
                     )
 
     return common_shape
@@ -309,10 +357,7 @@ def __maybe_broadcast(x, shape):
                 return x
 
             if not utils.same_shape(x.shape, common_shape):
-                common_rank = len(common_shape) + 1
-                start = common_rank - (len(x.shape) + 1)
-                dims = tuple(range(start, len(x.shape) + start))
-                return prims.broadcast_in_dim(x, common_shape, dims)
+                return x.expand(common_shape)
 
             return x
         else:
@@ -337,7 +382,6 @@ def _make_elementwise_unary_reference(
     type_promotion_kind,
     *,
     aten_op=infer_aten_op,
-    disable_meta=False,
     extra_meta=None,
 ) -> Callable:
     def inner(prim: Callable):
@@ -351,26 +395,59 @@ def inner(prim: Callable):
             type_promotion_kind=type_promotion_kind,
         )
         def _ref(a: TensorLikeType) -> TensorLikeType:
-            if not isinstance(a, TensorLike):
-                raise RuntimeError(
-                    "Expected a tensor input for an elementwise unary operation!"
-                )
-
             if extra_meta is not None:
                 extra_meta(a)
 
             return prim(a)
 
         if aten_op is infer_aten_op:
-            aten_op = getattr(torch.ops.aten, prim.__name__)
+            aten_op = utils.get_aten_op(prim, prim.__name__)
         if aten_op is not None:
-            register_decomposition(aten_op, disable_meta=disable_meta)(_ref)
+            register_decomposition(aten_op)(_ref)
 
         return _ref
 
     return inner
 
 
+def _make_alias(fn, name):
+    """
+    This function defines an alias of another function and sets its __name__argument
+    Note that when naïvely doing `alias = fn`, we have that `alias.__name__ == "fn"`.
+    """
+
+    def _fn(*args, **kwargs):
+        return fn(*args, **kwargs)
+
+    _fn.__name__ = name
+    return _fn
+
+
+def _make_inplace(fn):
+    """
+    Given a function with out variant (i.e. using `out_wrapper()), it returns its in-place variant
+    See https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-do-in-place-operations-work-in-pytorch
+    """
+
+    # nb. We use the name of the first argument used in the unary references
+    @wraps(fn)
+    def _fn(a, *args, **kwargs):
+        return fn(a, *args, out=a, **kwargs)
+
+    inplace_name = f"{fn.__name__}_"
+    _fn.__name__ = inplace_name
+    _fn = register_decomposition(getattr(torch.ops.aten, inplace_name))(_fn)
+
+    # We access the __all__ attribute of the module where fn is defined
+    # There may be a cleaner way of doing this...
+    from inspect import getmodule
+
+    _all = getmodule(fn).__all__  # type: ignore[union-attr]
+    if inplace_name not in _all:
+        _all.append(inplace_name)
+    return _fn
+
+
 @_make_elementwise_unary_reference(ELEMENTWISE_TYPE_PROMOTION_KIND.COMPLEX_TO_FLOAT)
 def abs(a):
     return prims.abs(a)
@@ -419,7 +496,7 @@ def ceil(a):
 @register_decomposition(torch.ops.aten.conj_physical)
 @out_wrapper()
 def conj_physical(input: TensorLikeType):
-    if not input.dtype.is_complex:
+    if not utils.is_complex_dtype(input.dtype):
         return input
     return prims.conj_physical(input)
 
@@ -568,6 +645,20 @@ def isnan(a: TensorLikeType) -> TensorLikeType:
     return prims.ne(a, a)
 
 
+# alias
+mvlgamma = _make_alias(torch.special.multigammaln, "mvlgamma")  # type: ignore[has-type]
+
+
+@_make_elementwise_unary_reference(
+    ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
+    aten_op=None,  # CompositeImplicitAutograd
+)
+def isreal(a: TensorLikeType) -> TensorLikeType:
+    if utils.is_complex_dtype(a.dtype):
+        return torch.imag(a) == 0
+    return torch.ones_like(a, dtype=torch.bool)
+
+
 # TODO: if this is special maybe it should be defined there and imported here?
 @_make_elementwise_unary_reference(
     ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT, aten_op=torch.ops.aten.special_i0
@@ -601,15 +692,15 @@ def log10(a):
     return prims.log10(a)
 
 
+# CompositeImplicitAutograd - don't register decomp
 @out_wrapper()
 def log_softmax(
     a: TensorLikeType,
     dim: int,
-    *,
     dtype: Optional[torch.dtype] = None,
 ) -> TensorLikeType:
     result_dtype = dtype or a.dtype
-    computation_dtype = utils.get_computation_dtype(a.dtype)
+    computation_dtype = utils.get_computation_dtype(result_dtype)
     a_ = _maybe_convert_to_dtype(a, computation_dtype)
     return _maybe_convert_to_dtype(a_ - logsumexp(a_, dim, keepdim=True), result_dtype)  # type: ignore[return-value]
 
@@ -653,10 +744,10 @@ def nan_to_num(
         nan = 0.0
 
     if posinf is None:
-        posinf = prims.maximum_value(a.dtype)
+        posinf = torch.finfo(a.dtype).max
 
     if neginf is None:
-        neginf = prims.minimum_value(a.dtype)
+        neginf = torch.finfo(a.dtype).min
 
     result = where(isnan(a), nan, a)
 
@@ -670,9 +761,14 @@ def nan_to_num(
 
 
 def _neg_meta(a: TensorLikeType):
-    if a.dtype is torch.bool:
-        msg = "neg is not supported on bool tensors."
-        raise RuntimeError(msg)
+    check(
+        a.dtype is not torch.bool,
+        lambda: (
+            "Negation, the `-` operator, on a bool tensor is not supported. "
+            "If you are trying to invert a mask, use the `~` or `logical_not()` "
+            "operator instead."
+        ),
+    )
 
 
 @_make_elementwise_unary_reference(
@@ -724,6 +820,15 @@ def sigmoid(a: TensorLikeType) -> TensorLikeType:
     return true_divide(1, add(1, exp(neg(a))))
 
 
+@_make_elementwise_unary_reference(ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT)
+def sgn(a):
+    if utils.is_complex_dtype(a.dtype):
+        a_abs = a.abs()
+        return torch.where(a_abs == 0, 0, a / a_abs)
+    else:
+        return a.sign()
+
+
 @_make_elementwise_unary_reference(ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT)
 def sign(a):
     return prims.sign(a)
@@ -739,6 +844,14 @@ def sin(a):
     return prims.sin(a)
 
 
+# Autograd note: This will give the right first derivative at zero (by chance),
+# but not the right second derivative
+@_make_elementwise_unary_reference(ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT)
+def sinc(a):
+    a = math.pi * a
+    return torch.where(a == 0, 1, torch.sin(a) / a)
+
+
 @_make_elementwise_unary_reference(ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT)
 def sinh(a):
     return prims.sinh(a)
@@ -767,57 +880,65 @@ def tanh(a):
     return prims.tanh(a)
 
 
-@_make_elementwise_unary_reference(ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT)
+@_make_elementwise_unary_reference(ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT)
 def trunc(a):
     return prims.trunc(a)
 
 
 def _make_elementwise_binary_reference(
-    prim: Callable,
-    *,
     type_promotion_kind,
     aten_op=infer_aten_op,
+    name=None,
     has_out=True,
     supports_lhs_python_scalar=True,
     supports_rhs_python_scalar=True,
-    disable_meta=False,
+    supports_two_python_scalars=False,
 ) -> Callable:
-    @elementwise_type_promotion_wrapper(
-        type_promoting_args=("a", "b"),
-        type_promotion_kind=type_promotion_kind,
-    )
-    def _ref(
-        a: Union[Tensor, NumberType],
-        b: Union[Tensor, NumberType],
-    ) -> Tensor:
-        if not supports_lhs_python_scalar and isinstance(a, Number):
-            raise ValueError(
-                "Received a lhs Python scalar to an elementwise binary operation that does not accept lhs scalars!"
-            )
+    def inner(prim: Callable):
+        nonlocal aten_op, name
+        if name is None:
+            name = prim.__name__
 
-        if not supports_rhs_python_scalar and isinstance(b, Number):
-            raise ValueError(
-                "Received a rhs Python scalar to an elementwise binary operation that does not accept rhs scalars!"
+        @wraps(prim)
+        @elementwise_type_promotion_wrapper(
+            type_promoting_args=("a", "b"),
+            type_promotion_kind=type_promotion_kind,
+        )
+        def _ref(
+            a: Union[Tensor, NumberType],
+            b: Union[Tensor, NumberType],
+        ) -> Tensor:
+            check(
+                supports_lhs_python_scalar or not isinstance(a, Number),
+                lambda: "{name}: Received a lhs Python scalar to an elementwise binary operation that does not accept lhs scalars!",
+                ValueError,
             )
-
-        # TODO: enable this for operations that support it, like add
-        if isinstance(a, Number) and isinstance(b, Number):
-            raise ValueError(
-                "Receive two Number inputs to an elementwise binary operation!"
+            check(
+                supports_rhs_python_scalar or not isinstance(b, Number),
+                lambda: "{name}: Received a rhs Python scalar to an elementwise binary operation that does not accept rhs scalars!",
+                ValueError,
+            )
+            check(
+                supports_two_python_scalars
+                or not (isinstance(a, Number) and isinstance(b, Number)),
+                lambda: f"{name}: Receive two Number inputs to an elementwise binary operation!",
+                ValueError,
             )
+            a, b = _maybe_broadcast(a, b)
+            return prim(a, b)
 
-        a, b = _maybe_broadcast(a, b)
-        return prim(a, b)
+        if has_out:
+            _ref = out_wrapper()(_ref)
 
-    if has_out:
-        _ref = out_wrapper()(_ref)
+        _ref.__name__ = name
+        if aten_op is infer_aten_op:
+            aten_op = utils.get_aten_op(prim, name)
+        if aten_op is not None:
+            register_decomposition(aten_op)(_ref)
 
-    if aten_op is infer_aten_op:
-        aten_op = getattr(torch.ops.aten, prim.__name__.split(".")[0])
-    if aten_op is not None:
-        register_decomposition(aten_op, disable_meta=disable_meta)(_ref)
+        return _ref
 
-    return _ref
+    return inner
 
 
 # Add has its own implementation because it has an alpha argument
@@ -837,17 +958,14 @@ def add(
     Reference implementation of torch.add
     """
 
-    if isinstance(a, Number) and isinstance(b, Number):
-        raise ValueError(
-            "Receive two Number inputs to an elementwise binary operation!"
-        )
-
     a, b = _maybe_broadcast(a, b)
 
     if alpha is not None:
         dtype = a.dtype if isinstance(a, TensorLike) else b.dtype  # type: ignore[union-attr]
         python_type = utils.dtype_to_type(dtype)
-        if not utils.is_weakly_lesser_type(type(alpha), python_type):
+        if python_type != bool and not utils.is_weakly_lesser_type(
+            type(alpha), python_type
+        ):
             msg = (
                 "alpha argument of type {0} cannot be safely cast to type {1}!".format(
                     type(alpha), python_type
@@ -860,47 +978,61 @@ def add(
 
 
 # TODO: add docstring
-atan2 = _make_elementwise_binary_reference(
-    prims.atan2,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def atan2(a, b):
+    return prims.atan2(a, b)
+
 
 # TODO: add docstring
-bitwise_and = _make_elementwise_binary_reference(
-    prims.bitwise_and,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
 )
+def bitwise_and(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.bitwise_and(a, b)
+
 
 # TODO: add docstring
-bitwise_left_shift = _make_elementwise_binary_reference(
-    prims.shift_left,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.bitwise_left_shift,  # prim/aten name mismatch
 )
+def bitwise_left_shift(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.shift_left(a, b)
+
 
 # TODO: add docstring
-bitwise_or = _make_elementwise_binary_reference(
-    prims.bitwise_or,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
 )
+def bitwise_or(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.bitwise_or(a, b)
+
 
 # TODO: add docstring
-bitwise_right_shift = _make_elementwise_binary_reference(
-    prims.shift_right_arithmetic,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.bitwise_right_shift,  # prim/aten name mismatch
 )
+def bitwise_right_shift(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.shift_right_arithmetic(a, b)
+
 
 # TODO: add docstring
-bitwise_xor = _make_elementwise_binary_reference(
-    prims.bitwise_xor,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
 )
+def bitwise_xor(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.bitwise_xor(a, b)
 
 
-def _copysign(
+# TODO: add docstring
+@_make_elementwise_binary_reference(
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+    supports_lhs_python_scalar=False,
+)
+def copysign(
     a: Union[TensorLikeType, NumberType], b: Union[TensorLikeType, NumberType]
 ):
     if isinstance(b, Number) and isinstance(a, Tensor):
@@ -913,14 +1045,6 @@ def _copysign(
     return where(signbit(b), neg(abs(a)), abs(a))
 
 
-# TODO: add docstring
-copysign = _make_elementwise_binary_reference(
-    _copysign,
-    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
-    supports_lhs_python_scalar=False,
-    aten_op=torch.ops.aten.copysign,
-)
-
 # TODO: add docstring
 # complex =  _make_elementwise_binary_reference(prims.complex, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT)
 
@@ -951,14 +1075,19 @@ def div(
 
 
 # TODO: add docstring
-eq = _make_elementwise_binary_reference(
-    prims.eq,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
     supports_lhs_python_scalar=False,
 )
+def eq(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.eq(a, b)
 
 
-def _pow(
+# TODO: add docstring
+@_make_elementwise_binary_reference(
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.BOOL_TO_LONG,
+)
+def pow(
     a: Union[TensorLikeType, NumberType],
     b: Union[TensorLikeType, NumberType],
 ) -> TensorLikeType:
@@ -974,13 +1103,6 @@ def _pow(
     return prims.pow(a, b)
 
 
-# TODO: add docstring
-pow = _make_elementwise_binary_reference(
-    _pow,
-    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.BOOL_TO_LONG,
-    aten_op=torch.ops.aten.pow,
-)
-
 # TODO: add docstring
 # Float power has its own implementation because it has unique type promotion.
 # NB: aten_op not registered because CompositeExplicitAutograd
@@ -1040,7 +1162,13 @@ def float_power(
 #
 # For reference, see CPython's implementation:
 # https://github.com/python/cpython/blob/ace008c531dd685a30c1dd68f9b5ba35f20171cf/Objects/floatobject.c#L636
-def _floor_divide(
+
+# TODO: add docstring
+@_make_elementwise_binary_reference(
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+    supports_two_python_scalars=True,
+)
+def floor_divide(
     a: Union[TensorLikeType, NumberType], b: Union[TensorLikeType, NumberType]
 ):
     # Wrap scalars because some references only accept tensor arguments.
@@ -1078,7 +1206,7 @@ def _floor_divide_integer(a: Tensor, b: Tensor) -> Tensor:
 
     # Convert truncation to flooring:
     offset = (torch.signbit(a) != torch.signbit(b)).logical_and(torch.fmod(a, b) != 0)
-    return prims.div(a, b) - prims.convert_element_type(offset, a.dtype)
+    return prims.div(a, b) - _maybe_convert_to_dtype(offset, a.dtype)
 
 
 def _floor_divide_float(a: Tensor, b: Tensor) -> Tensor:
@@ -1107,65 +1235,69 @@ def _floor_divide_float(a: Tensor, b: Tensor) -> Tensor:
 
 
 # TODO: add docstring
-floor_divide = _make_elementwise_binary_reference(
-    _floor_divide,
-    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.floor_divide,
-)
-
-
-# TODO: add docstring
-fmax = _make_elementwise_binary_reference(
-    prims.fmax,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.fmax,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def fmax(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.fmax(a, b)
+
 
 # TODO: add docstring
-fmin = _make_elementwise_binary_reference(
-    prims.fmin,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.fmin,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def fmin(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.fmin(a, b)
+
 
 # TODO: add docstring
-fmod = _make_elementwise_binary_reference(
-    prims.fmod,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.fmod,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=True,
 )
+def fmod(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.fmod(a, b)
+
 
 # TODO: add docstring
-gcd = _make_elementwise_binary_reference(
-    prims.gcd,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.gcd,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def gcd(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.gcd(a, b)
+
 
 # TODO: add docstring
-ge = _make_elementwise_binary_reference(
-    prims.ge,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
     supports_lhs_python_scalar=False,
 )
+def ge(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.ge(a, b)
+
 
 # TODO: add docstring
-gt = _make_elementwise_binary_reference(
-    prims.gt,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
     supports_lhs_python_scalar=False,
 )
+def gt(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.gt(a, b)
 
 
-def _heaviside(input: TensorLikeType, values: TensorLikeType) -> TensorLikeType:
+@_make_elementwise_binary_reference(
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.NO_OPMATH,
+    supports_lhs_python_scalar=False,
+    supports_rhs_python_scalar=False,
+)
+def heaviside(input: TensorLikeType, values: TensorLikeType) -> TensorLikeType:
     input_eq_zero = eq(input, 0)
     input_lt_zero = logical_or(lt(input, 0), isnan(input))
     zeros_and_ones = where(input_lt_zero, 0, 1)
@@ -1173,34 +1305,31 @@ def _heaviside(input: TensorLikeType, values: TensorLikeType) -> TensorLikeType:
     return output
 
 
-heaviside = _make_elementwise_binary_reference(
-    _heaviside,
-    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.NO_OPMATH,
-    supports_lhs_python_scalar=False,
-    supports_rhs_python_scalar=False,
-    aten_op=torch.ops.aten.heaviside,
-)
-
-hypot = _make_elementwise_binary_reference(
-    prims.hypot,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def hypot(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.hypot(a, b)
 
-igamma = _make_elementwise_binary_reference(
-    prims.igamma,
+
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def igamma(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.igamma(a, b)
+
 
-igammac = _make_elementwise_binary_reference(
-    prims.igammac,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def igammac(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.igammac(a, b)
 
 
 def _check_close_args(
@@ -1275,30 +1404,42 @@ def isclose(
     return result
 
 
-def _lcm(a: TensorLikeType, b: TensorLikeType):
-    g = gcd(a, b)
-    return where(eq(g, 0), 0, abs(mul(true_divide(a, g), b)))
-
-
 # TODO: add docstring
-lcm = _make_elementwise_binary_reference(
-    _lcm,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.lcm,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def lcm(a: TensorLikeType, b: TensorLikeType):
+    dtype = a.dtype
+    # promoting to int32 to maintain 100% consistency with C++ and to
+    # prevent overflow in case of int8 and int16
+    promote_to_int = dtype in (torch.int8, torch.int16)
+    if promote_to_int:
+        a = prims.convert_element_type(a, torch.int32)
+        b = prims.convert_element_type(b, torch.int32)
+
+    g = torch.gcd(a, b)
+    # Avoid division by zero in case gcd(0, 0) == 0
+    g = torch.where(g == 0, 1, g)
+    res = torch.abs(prims.div(a, g) * b)
+    return res if not promote_to_int else prims.convert_element_type(res, dtype)
 
 
 # TODO: add docstring
-le = _make_elementwise_binary_reference(
-    prims.le,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
     supports_lhs_python_scalar=False,
 )
+def le(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.le(a, b)
 
 
-def _logical_and(a: TensorLikeType, b: TensorLikeType):
+# TODO: add docstring
+@_make_elementwise_binary_reference(
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
+)
+def logical_and(a: TensorLikeType, b: TensorLikeType):
     if not utils.is_boolean_dtype(a.dtype):
         a = a != 0
     if not utils.is_boolean_dtype(b.dtype):
@@ -1306,23 +1447,19 @@ def _logical_and(a: TensorLikeType, b: TensorLikeType):
     return a & b
 
 
-logical_and = _make_elementwise_binary_reference(
-    _logical_and,
-    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
-    aten_op=torch.ops.aten.logical_and,
-)
-
-
-@_make_elementwise_unary_reference(
-    ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL, aten_op=torch.ops.aten.logical_not
-)
+# TODO: add docstring
+@_make_elementwise_unary_reference(ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL)
 def logical_not(a: TensorLikeType):
     if not utils.is_boolean_dtype(a.dtype):
         return a == 0
     return ~a
 
 
-def _logical_or(a: TensorLikeType, b: TensorLikeType):
+# TODO: add docstring
+@_make_elementwise_binary_reference(
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
+)
+def logical_or(a: TensorLikeType, b: TensorLikeType):
     if not utils.is_boolean_dtype(a.dtype):
         a = a != 0
     if not utils.is_boolean_dtype(b.dtype):
@@ -1330,14 +1467,12 @@ def _logical_or(a: TensorLikeType, b: TensorLikeType):
     return bitwise_or(a, b)
 
 
-logical_or = _make_elementwise_binary_reference(
-    _logical_or,
+# TODO: add docstring
+# TODO: skip unnecessary conversion of long to float
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
-    aten_op=torch.ops.aten.logical_or,
 )
-
-
-def _logical_xor(a: TensorLikeType, b: TensorLikeType):
+def logical_xor(a: TensorLikeType, b: TensorLikeType):
     if not utils.is_boolean_dtype(a.dtype):
         a = a != 0
     if not utils.is_boolean_dtype(b.dtype):
@@ -1345,60 +1480,66 @@ def _logical_xor(a: TensorLikeType, b: TensorLikeType):
     return a ^ b
 
 
-# TODO: skip unnecessary conversion of long to float
-logical_xor = _make_elementwise_binary_reference(
-    _logical_xor,
-    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
-    aten_op=torch.ops.aten.logical_xor,
-)
-
-
 # TODO: add docstring
-lt = _make_elementwise_binary_reference(
-    prims.lt,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
     supports_lhs_python_scalar=False,
 )
+def lt(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.lt(a, b)
+
 
 # TODO: add docstring
-maximum = _make_elementwise_binary_reference(
-    prims.maximum,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
 )
+def maximum(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.maximum(a, b)
+
 
 # TODO: add docstring
-minimum = _make_elementwise_binary_reference(
-    prims.minimum,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
 )
+def minimum(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.minimum(a, b)
+
 
 # TODO: add docstring
-mul = _make_elementwise_binary_reference(
-    prims.mul,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+    supports_two_python_scalars=True,
 )
+def mul(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.mul(a, b)
+
 
 # TODO: add docstring
-ne = _make_elementwise_binary_reference(
-    prims.ne,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.ALWAYS_BOOL,
     supports_lhs_python_scalar=False,
 )
+def ne(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.ne(a, b)
+
 
 # TODO: add docstring
-nextafter = _make_elementwise_binary_reference(
-    prims.nextafter,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.NO_OPMATH,
     supports_lhs_python_scalar=False,
     supports_rhs_python_scalar=False,
 )
+def nextafter(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.nextafter(a, b)
+
 
 # TODO: add docstring
-remainder = _make_elementwise_binary_reference(
-    prims.remainder,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=torch.ops.aten.remainder,
 )
+def remainder(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.remainder(a, b)
+
 
 # reverse sub
 def rsub(
@@ -1432,11 +1573,6 @@ def sub(
     Reference implementation of torch.sub
     """
 
-    if isinstance(a, Number) and isinstance(b, Number):
-        raise ValueError(
-            "Receive two Number inputs to an elementwise binary operation!"
-        )
-
     a, b = _maybe_broadcast(a, b)
 
     if alpha is not None:
@@ -1455,14 +1591,48 @@ def sub(
 
 
 # TODO: add docstring
-true_divide = _make_elementwise_binary_reference(
-    prims.div,
+@_make_elementwise_binary_reference(
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+    name="true_divide",
     aten_op=None,  # CompositeImplicitAutograd
+    supports_two_python_scalars=True,
+)
+def true_divide(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.div(a, b)
+
+
+@register_decomposition(torch.ops.aten.xlogy)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a", "b"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
 )
+def xlogy(a: Union[TensorLikeType, NumberType], b: Union[TensorLikeType, NumberType]):
+    utils.check(
+        isinstance(a, TensorLike) or isinstance(b, TensorLike),
+        lambda: 'Expected either argument a or b to be a Tensor"',
+    )
 
+    # Operations like eq and log do not handle scalar values, so we convert them to scalar_tensors.
+    if isinstance(b, TensorLike) and isinstance(a, Number):
+        a = scalar_tensor(a, dtype=b.dtype, device=b.device)
+    elif isinstance(a, TensorLike) and isinstance(b, Number):
+        b = scalar_tensor(b, dtype=a.dtype, device=a.device)
+
+    # mypy: expected "Tensor"
+    assert isinstance(a, TensorLike)
+    assert isinstance(b, TensorLike)
+    rhs = torch.where(torch.eq(a, 0), 0, torch.mul(a, torch.log(b)))
+    return torch.where(torch.isnan(b), float("nan"), rhs)
 
-def _trunc_divide(
+
+# TODO: add docstring
+@_make_elementwise_binary_reference(
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+    aten_op=None,  # CompositeImplicitAutograd
+    supports_two_python_scalars=True,
+)
+def trunc_divide(
     a: Union[TensorLikeType, NumberType], b: Union[TensorLikeType, NumberType]
 ):
     dtype = utils.get_dtype(a)
@@ -1472,18 +1642,71 @@ def _trunc_divide(
     return trunc(prims.div(a, b))
 
 
-# TODO: add docstring
-trunc_divide = _make_elementwise_binary_reference(
-    _trunc_divide,
-    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
-    aten_op=None,  # CompositeImplicitAutograd
-)
-
 #
 # Elementwise Ternary References
 #
 
 
+@register_decomposition(torch.ops.aten.addcdiv)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("self", "tensor1", "tensor2"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def addcdiv(
+    self: TensorLikeType,
+    tensor1: TensorLikeType,
+    tensor2: TensorLikeType,
+    *,
+    value: NumberType = 1,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.addcdiv
+    """
+    if value is not None:
+        dtype = self.dtype  # no scalars allowed, see add
+        python_type = utils.dtype_to_type(dtype)
+        check(
+            utils.is_weakly_lesser_type(type(value), python_type),
+            lambda: "value argument of type {0} cannot be safely cast to type {1}!".format(
+                type(value), python_type
+            ),
+            exc_type=ValueError,
+        )
+
+    return self + value * tensor1 / tensor2
+
+
+@register_decomposition(torch.ops.aten.addcmul)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("self", "tensor1", "tensor2"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+)
+def addcmul(
+    self: TensorLikeType,
+    tensor1: TensorLikeType,
+    tensor2: TensorLikeType,
+    *,
+    value: NumberType = 1,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.addcmul
+    """
+    if value is not None:
+        dtype = self.dtype  # no scalars allowed, see add
+        python_type = utils.dtype_to_type(dtype)
+        check(
+            utils.is_weakly_lesser_type(type(value), python_type),
+            lambda: "value argument of type {0} cannot be safely cast to type {1}!".format(
+                type(value), python_type
+            ),
+            exc_type=ValueError,
+        )
+
+    return self + value * tensor1 * tensor2
+
+
 @register_decomposition(torch.ops.aten.clamp)
 @out_wrapper()
 @elementwise_type_promotion_wrapper(
@@ -1569,13 +1792,11 @@ def where(
 #
 # Data Movement References
 #
+@register_decomposition(torch.ops.aten.clone)
 def clone(
     a: TensorLikeType, *, memory_format: torch.memory_format = torch.preserve_format
 ) -> TensorLikeType:
-    result = torch.empty_like(
-        a, requires_grad=a.requires_grad, memory_format=memory_format
-    )
-    copy_to(result, a)
+    result = prims.clone(a, memory_format=memory_format)
     return result
 
 
@@ -1601,46 +1822,211 @@ def item(a: TensorLikeType) -> NumberType:
     return number_type(prims.item(a))
 
 
-#
-# Reduction references
-#
-
-
-def _reduction(
+# fast path when `to` returns an alias to input. This mimics the same function in aten
+def _to_will_alias(
     a: TensorLikeType,
-    prim: Callable,
-    *,
-    has_identity: bool = True,
-    accepts_dim_tuple: bool = True,  # to handle min/argmin that accept single dim only
-    dims: Optional[DimsType] = None,
-    keepdims: bool = False,
-    dtype: Optional[torch.dtype] = None,  # should be specified for ops that support it
-    out: Optional[Tensor] = None,
-    output_dtype_kind: REDUCTION_OUTPUT_TYPE_KIND,
-) -> TensorLikeType:  # it is usually SAME, but I want
-    # ref writers to actually think about what to put here
-    assert isinstance(a, TensorLike)
-    if a.ndim > 64:
-        raise RuntimeError(
-            "Received a tensor with {0} dimensions, but only tensors with up to 64 dims are supported!".format(
-                a.ndim
-            )
+    device: Optional[torch.device] = None,
+    dtype: Optional[torch.dtype] = None,
+    copy: Optional[bool] = None,
+    layout: Optional[torch.layout] = None,
+    memory_format: Optional[torch.memory_format] = None,
+    pin_memory: Optional[bool] = False,
+    non_blocking: bool = False,  # not using non_blocking
+) -> bool:
+    return (
+        not copy
+        and (device is None or a.device == device)
+        and (dtype is None or a.dtype == dtype)
+        and (layout is None or a.layout == layout)
+        # is_pinned issue #84925
+        # and (pin_memory is None or pin_memory == a.is_pinned())
+        and (
+            memory_format is None
+            or memory_format == torch.preserve_format
+            or utils.is_contiguous_for_memory_format(a, memory_format=memory_format)
         )
+    )
 
-    if out is not None:
-        assert isinstance(out, TensorLike)
-        if dtype is not None:
-            # TODO - this is true for eager mode currently, but it's wrong behavior for complex norms
-            if dtype != out.dtype:
-                raise RuntimeError(
-                    "dtype argument and out dtype must match in reduction"
-                )
-    if not accepts_dim_tuple:
-        assert dims is None or isinstance(dims, int)
-    if isinstance(dims, int):
-        dims = (dims,)  # type: ignore[assignment]
-    dims = utils.reduction_dims(a.shape, dims)
-    if not has_identity:
+
+@singledispatch
+def _to_dispatch(*args, **kwargs):
+    raise NotImplementedError
+
+
+@_to_dispatch.register
+def _to_device(
+    device: torch.device,
+    dtype: torch.dtype,
+    non_blocking: bool = False,
+    copy: bool = False,
+    memory_format: Optional[torch.memory_format] = None,
+):
+    kwargs = {
+        "device": device,
+        "dtype": dtype,
+        "non_blocking": non_blocking,
+        "copy": copy,
+        "memory_format": memory_format,
+    }
+    return kwargs
+
+
+@_to_dispatch.register
+def _to_device_str(
+    device: str,
+    dtype: torch.dtype,
+    non_blocking: bool = False,
+    copy: bool = False,
+    memory_format: Optional[torch.memory_format] = None,
+):
+    kwargs = {
+        "device": torch.device(device),
+        "dtype": dtype,
+        "non_blocking": non_blocking,
+        "copy": copy,
+        "memory_format": memory_format,
+    }
+    return kwargs
+
+
+@_to_dispatch.register
+def _to_dtype(
+    dtype: torch.dtype,
+    non_blocking: bool = False,
+    copy: bool = False,
+    memory_format: Optional[torch.memory_format] = None,
+):
+    kwargs = {
+        "dtype": dtype,
+        "non_blocking": non_blocking,
+        "copy": copy,
+        "memory_format": memory_format,
+    }
+    return kwargs
+
+
+@_to_dispatch.register
+def _to_other(
+    other: Tensor,
+    non_blocking: bool = False,
+    copy: bool = False,
+    memory_format: Optional[torch.memory_format] = None,
+):
+    device = other.device
+    dtype = other.dtype
+    layout = other.layout
+    # is_pinned issue #84925
+    # pin_memory = other.is_pinned()
+    kwargs = {
+        "device": device,
+        "dtype": dtype,
+        "layout": layout,
+        "non_blocking": non_blocking,
+        "copy": copy,
+        "memory_format": memory_format,
+    }
+    return kwargs
+
+
+# remove to_kwargs that is already present in `a`
+def canonicalize_to_arguments(a: Tensor, to_kwargs: dict):
+    options_to_check = ["dtype", "device", "layout", "memory_format"]
+    # "device" option could be passed a str instead torch.device
+    if "device" in to_kwargs and isinstance(to_kwargs["device"], str):
+        to_kwargs["device"] = torch.device(to_kwargs["device"])
+
+    for kw in options_to_check:
+        if kw in to_kwargs:
+            if (
+                (kw == "memory_format" and to_kwargs[kw] is torch.preserve_format)
+                or (
+                    kw == "device"
+                    and to_kwargs[kw].type == a.device.type
+                    and (
+                        not to_kwargs[kw].index or to_kwargs[kw].index == a.device.index
+                    )
+                )
+                or (
+                    getattr(a, kw, None) == to_kwargs[kw]
+                )  # this also handles {"memory_format": None}
+            ):
+                to_kwargs.pop(kw)
+
+
+def to(a: TensorLikeType, *args, **kwargs) -> TensorLikeType:
+    # handled dispatch via positional arguments
+    if len(args) != 0:
+        kwargs = _to_dispatch(*args, **kwargs)
+
+    # TODO: is_pinned is not currently supported in refs or fake_tensor
+    # https://github.com/pytorch/pytorch/issues/84925
+    assert "pin_memory" not in kwargs
+    canonicalize_to_arguments(a, kwargs)
+
+    if _to_will_alias(a, **kwargs):
+        return a
+
+    copy = kwargs.pop("copy") if "copy" in kwargs else False
+    non_blocking = kwargs.pop("non_blocking") if "non_blocking" in kwargs else False
+
+    # short-circuit to `prims.convert_element_type` when `to` is just a dtype change
+    if (
+        (copy or (kwargs.get("dtype", a.dtype) != a.dtype))
+        and (not non_blocking)
+        and ("memory_format" not in kwargs)
+        and ("device" not in kwargs)
+        and ("layout" not in kwargs)
+        # is_pinned issue #84925
+        # and ("pin_memory" not in kwargs)
+    ):
+        return prims.convert_element_type(a, kwargs.get("dtype", a.dtype))
+
+    result = torch.empty_like(a, **kwargs)
+    # TODO: non_blocking should be handled by `copy_to`
+    copy_to(result, a)
+    return result
+
+
+#
+# Reduction references
+#
+
+
+def _reduction(
+    a: TensorLikeType,
+    prim: Callable,
+    *,
+    has_identity: bool = True,
+    accepts_dim_tuple: bool = True,  # to handle min/argmin that accept single dim only
+    dims: Optional[DimsType] = None,
+    keepdims: bool = False,
+    dtype: Optional[torch.dtype] = None,  # should be specified for ops that support it
+    out: Optional[Tensor] = None,
+    output_dtype_kind: REDUCTION_OUTPUT_TYPE_KIND,
+) -> TensorLikeType:  # it is usually SAME, but I want
+    # ref writers to actually think about what to put here
+    assert isinstance(a, TensorLike)
+    if a.ndim > 64:
+        raise RuntimeError(
+            "Received a tensor with {0} dimensions, but only tensors with up to 64 dims are supported!".format(
+                a.ndim
+            )
+        )
+
+    if out is not None:
+        assert isinstance(out, TensorLike)
+        if dtype is not None:
+            # TODO - this is true for eager mode currently, but it's wrong behavior for complex norms
+            if dtype != out.dtype:
+                raise RuntimeError(
+                    "dtype argument and out dtype must match in reduction"
+                )
+    if not accepts_dim_tuple:
+        assert dims is None or isinstance(dims, Dim)
+    if isinstance(dims, Dim):
+        dims = (dims,)  # type: ignore[assignment]
+    dims = utils.reduction_dims(a.shape, dims)
+    if not has_identity:
         valid_shape = a.ndim == 0 or py_all(a.shape[i] for i in dims)
         if not valid_shape:
             raise RuntimeError(
@@ -1649,8 +2035,8 @@ def _reduction(
     computation_dtype, result_dtype = utils.reduction_dtypes(
         a, output_dtype_kind, dtype
     )
-    a_converted = prims.convert_element_type(a, computation_dtype)
-    result = prim(a_converted, dims)
+    a = _maybe_convert_to_dtype(a, computation_dtype)  # type: ignore[assignment]
+    result = prim(a, dims)
     if keepdims:
         output_shape = [a.shape[i] if i not in dims else 1 for i in range(a.ndim)]
         broadcast_dims = [i for i in range(a.ndim) if i not in dims]
@@ -1671,6 +2057,25 @@ def _reduction(
     return result
 
 
+def _make_copy_from_view(fn):
+    """
+    Given a view function (e.g. torch.diagonal) generates its copy variant (e.g. torch.diagonal_copy)
+    """
+    name = fn.__name__
+    fn = out_wrapper()(fn)
+
+    def _fn(*args, out=None, **kwargs):
+        result = fn(*args, out=out, **kwargs)
+        if out is None:
+            return result.clone(memory_format=torch.contiguous_format)
+        return result
+
+    copy_name = f"{name}_copy"
+    _fn.__name__ = copy_name
+    _fn = register_decomposition(getattr(torch.ops.aten, copy_name))(_fn)
+    return _fn
+
+
 # Saves Python all
 py_all = all
 
@@ -1683,13 +2088,12 @@ def all(
     keepdim: bool = False,
 ) -> TensorLikeType:
     # Computes nelem
-    if isinstance(dim, int):
+    if isinstance(dim, Dim):
         dim = (dim,)  # type: ignore[assignment]
-    dims = utils.reduction_dims(a.shape, dim)  # type: ignore[arg-type]
-    nelem = 1 if a.ndim == 0 else reduce(operator.mul, (a.shape[i] for i in dims), 1)
 
     a_ = _maybe_convert_to_dtype(a, torch.bool)
-    result = eq(sum(a_, dim=dim, keepdim=keepdim), nelem)  # type: ignore[arg-type]
+    # avoid comparison with symbolic number of elements to make this op symint friendly
+    result = eq(sum(logical_not(a_), dim=dim, keepdim=keepdim), 0)
 
     # Preserves uint8 -- probably a legacy mask thing
     if a.dtype is torch.uint8:
@@ -1725,7 +2129,7 @@ def sum(
     dim: Union[Optional[int], Optional[List[int]]] = None,
     keepdim: bool = False,
     *,
-    dtype=None,
+    dtype: Optional[torch.dtype] = None,
     out: Optional[Tensor] = None,
 ) -> TensorLikeType:
     if dtype is None:
@@ -1747,6 +2151,28 @@ def sum(
     )
 
 
+def sum_to_size(
+    a: Tensor,
+    *shape,
+) -> Tensor:
+    shape = utils.extract_shape_from_varargs(shape, validate=False)
+    utils.check(
+        utils.is_expandable_to(shape, a.shape),
+        lambda: f'sum_to_size: size "{shape}" is not expandable to size "{a.shape}"',
+    )
+    # In ATen scalar tensors are sent through sum and the result is returned as
+    # type promoted
+    if utils.is_same_shape(shape, a.shape) and len(shape) > 0:
+        return prims.view_of(a)
+    leading_dims = a.ndim - len(shape)
+    reduce_dims = tuple(range(leading_dims)) + tuple(
+        i
+        for i in range(leading_dims, len(shape))
+        if shape[i - leading_dims] == 1 and a.shape[i] != 1
+    )
+    return torch.sum(a, dim=reduce_dims, keepdim=True, dtype=None)
+
+
 @register_decomposition(torch.ops.aten.prod)
 def prod(
     a: TensorLikeType,
@@ -1823,23 +2249,17 @@ def amax(
     )
 
 
-def _set_correction(
-    unbiased: Optional[bool] = None,
-    correction: Optional[int] = None,
-):
-    if correction is not None and unbiased is not None:
-        raise RuntimeError("cannot specify both correction and unbiased arguments")
-    elif correction is None and unbiased is None:
-        correction = 1
-    elif correction is None and unbiased is not None:
-        correction = 0 if unbiased is False else 1
-    if not isinstance(correction, int):
-        raise ValueError("correction argument should be integer")
-    if correction < 0:
-        raise ValueError("correction argument should be non-negative")
-    return correction
+def _dim_var_dispatch(dim=None, unbiased=None):
+    # There's the following overload of torch.var:
+    # var(Tensor self, bool unbiased=True) -> (Tensor, Tensor)
+    # We need to explicitly convert bool dims to unbiased arg
+    if unbiased is None and isinstance(dim, bool):
+        unbiased = dim
+        dim = None
+    return dim, unbiased
 
 
+@register_decomposition(torch.ops.aten.var)
 @out_wrapper()
 def var(
     a: TensorLikeType,
@@ -1849,7 +2269,8 @@ def var(
     *,
     correction: Optional[int] = None,
 ) -> TensorLikeType:
-    correction = _set_correction(unbiased, correction)
+    dim, unbiased = _dim_var_dispatch(dim, unbiased)
+    correction = utils.set_correction(unbiased, correction)
     # reduces over all dimensions if dim=() is passed
     if dim == () or dim == []:
         dim = None
@@ -1876,7 +2297,8 @@ def std(
     *,
     correction: Optional[int] = None,
 ) -> TensorLikeType:
-    correction = _set_correction(unbiased, correction)
+    dim, unbiased = _dim_var_dispatch(dim, unbiased)
+    correction = utils.set_correction(unbiased, correction)
     # reduces over all dimensions if dim=() is passed
     if dim == () or dim == []:
         dim = None
@@ -1911,11 +2333,14 @@ def mean(
     # reduces over all dimensions if dim=() is passed
     if dim == () or dim == []:
         dim = None
+    orig_dtype = dtype
     if dtype is None:
         dtype = a.dtype
     # can't use out wrapper because of this argument
-    if out is not None and out.dtype != dtype:
-        raise RuntimeError("expected out dtype and dtype to match")
+    check(
+        out is None or out.dtype == dtype,
+        lambda: f"Expected out tensor to have dtype {dtype}, but got {out.dtype} instead",
+    )
     result = _reduction(
         a,
         prims.sum,
@@ -1925,9 +2350,15 @@ def mean(
         out=None,
         output_dtype_kind=REDUCTION_OUTPUT_TYPE_KIND.KEEP_PROMOTED_TYPE,
     )
-    if utils.is_integer_dtype(dtype):
-        raise RuntimeError("result type should be floating point or complex")
-    if isinstance(dim, int):
+    check(
+        utils.is_float_dtype(dtype) or utils.is_complex_dtype(dtype),
+        lambda: (
+            f"mean(): could not infer output dtype. "
+            f"{'Input' if orig_dtype is None else 'Optional'} dtype must be either "
+            f"a floating point or complex dtype. Got: {dtype}"
+        ),
+    )
+    if isinstance(dim, Dim):
         dim = (dim,)  # type: ignore[assignment]
     dims = utils.reduction_dims(a.shape, dim)  # type: ignore[arg-type]
     nelem = 1 if a.ndim == 0 else reduce(operator.mul, (a.shape[i] for i in dims), 1)
@@ -1950,6 +2381,7 @@ def std_mean(
     keepdim: bool = False,
     correction: Optional[int] = None,
 ):
+    dim, unbiased = _dim_var_dispatch(dim, unbiased)
     s = std(a, dim, unbiased, keepdim, correction=correction)
     m = mean(a, dim, keepdim)
     return s, m
@@ -1964,6 +2396,7 @@ def var_mean(
     *,
     correction: Optional[int] = None,
 ):
+    dim, unbiased = _dim_var_dispatch(dim, unbiased)
     v = var(a, dim, unbiased, keepdim, correction=correction)
     m = mean(a, dim, keepdim)
     return v, m
@@ -2089,7 +2522,11 @@ def broadcast_shapes(*shapes) -> ShapeType:
     return torch.Size(_broadcast_shapes(*shapes))
 
 
+@torch.ops.aten.broadcast_tensors.default.py_impl(DispatchKey.CompositeImplicitAutograd)
+@torch.ops.aten.broadcast_tensors.default.py_impl(DispatchKey.Meta)
 def broadcast_tensors(*tensors) -> List[TensorLikeType]:
+    if len(tensors) == 1 and not isinstance(tensors[0], Tensor):
+        tensors = tensors[0]
     return list(_maybe_broadcast(*tensors, preserve_cpu_scalar_tensors=False))
 
 
@@ -2107,6 +2544,18 @@ def broadcast_to(a: TensorLikeType, size: ShapeType) -> TensorLikeType:
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.NO_OPMATH,
 )
 def cat(tensors: TensorSequenceType, dim: int = 0) -> TensorLikeType:
+    def cat_compute_output_memory_format(inputs):
+        format = None
+        for t in inputs:
+            f = utils.suggest_memory_format(t)
+            if f == torch.contiguous_format:
+                return f
+            if format is not None and format != f:
+                return torch.contiguous_format
+            format = f
+        assert format is not None
+        return format
+
     if len(tensors) == 0:
         msg = "cat expects at least one tensor, but received zero!"
         raise ValueError(msg)
@@ -2116,8 +2565,15 @@ def cat(tensors: TensorSequenceType, dim: int = 0) -> TensorLikeType:
 
     utils.check_same_device(*tensors, allow_cpu_scalar_tensors=False)
 
-    dim = utils.canonicalize_dim(tensors[0].ndim, dim)
-    utils.validate_idx(tensors[0].ndim, dim)
+    for t in tensors:
+        # match logic in legacy_cat_wrap_dim
+        if t.ndim == 1 and t.size(0) == 0:
+            continue
+        dim = utils.canonicalize_dim(t.ndim, dim)
+        utils.validate_idx(t.ndim, dim)
+        break
+
+    memory_format = cat_compute_output_memory_format(tensors)
 
     # Filters tensors with one dimension of length zero
     filtered = tuple(x for x in tensors if not (x.ndim == 1 and x.numel() == 0))
@@ -2130,23 +2586,28 @@ def cat(tensors: TensorSequenceType, dim: int = 0) -> TensorLikeType:
         except Exception:
             requires_grad = False
 
-        return empty((0,), dtype=t.dtype, device=t.device, requires_grad=requires_grad)
+        return empty(
+            (0,),
+            dtype=t.dtype,
+            device=t.device,
+            requires_grad=requires_grad,
+            memory_format=memory_format,
+        )
 
-    return prims.cat(filtered, dim)
+    return prims.cat(filtered, dim).clone(memory_format=memory_format)
 
 
 # CompositeImplicitAutograd - don't register decomp
 @out_wrapper()
 def column_stack(tensors: TensorSequenceType) -> TensorLikeType:
     aligned_tensors = tuple(
-        x if x.ndim > 1 else prims.expand_dims(x, list(range(x.ndim, 2)))
-        for x in tensors
+        x if x.ndim > 1 else x.reshape((x.numel(), 1)) for x in tensors
     )
     return cat(aligned_tensors, 1)
 
 
 def conj(input: TensorLikeType) -> TensorLikeType:
-    if not input.dtype.is_complex:
+    if not utils.is_complex_dtype(input.dtype):
         return input
     if input.is_sparse:
         return torch.conj_physical(input)
@@ -2251,7 +2712,7 @@ def dstack(tensors: TensorSequenceType) -> TensorLikeType:
     return cat(aligned_tensors, 2)
 
 
-@register_decomposition(torch.ops.aten.expand, disable_meta=True)
+@register_decomposition(torch.ops.aten.expand)
 def expand(a: Tensor, *shape) -> Tensor:
     # NOTE: cannot use utils.extract_shape_from_varargs here
     # because that also validates the shape, but the shape
@@ -2284,6 +2745,11 @@ def expand(a: Tensor, *shape) -> Tensor:
     )
 
 
+# CompositeImplicitAutograd - don't register decomp
+def expand_as(a: Tensor, b: Tensor) -> Tensor:
+    return a.expand(b.shape)
+
+
 def chunk(a: TensorLikeType, chunks: int, dim: int = 0) -> Tuple[TensorLikeType, ...]:
     if chunks <= 0:
         msg = "Expected at least one chunk, but got {0}!".format(chunks)
@@ -2353,11 +2819,41 @@ def flipud(a: TensorLikeType) -> TensorLikeType:
 
 
 # CompositeImplicitAutograd - don't register decomp
-def narrow(a: TensorLikeType, dim: int, start: int, length: int) -> TensorLikeType:
+def narrow(
+    a: TensorLikeType, dim: int, start: Union[int, TensorLikeType], length: int
+) -> TensorLikeType:
+    # Supports Tensor overload that was added for XLA:
+    # https://github.com/pytorch/pytorch/issues/31558
+    if isinstance(start, TensorLike):
+        check(
+            start.dim() == 0 and utils.is_integer_dtype(start.dtype),
+            lambda: "start must be an 0-dim integral Tensor.",
+        )
+        start = start.item()  # type: ignore[assignment]
+    check(a.dim() > 0, lambda: "narrow() cannot be applied to a 0-dim tensor.")
+    check(length >= 0, lambda: "narrow(): length must be non-negative.")
     dim = utils.canonicalize_dim(a.ndim, dim)
+    dim_length = a.size(dim)
+    # Start being the end is usually invalid since it's out of bounds. So it's
+    # not allowed by canonicalize_dim. But for narrow it's valid as long as
+    # the length is 0, which is handled by the check below.
+    if start != dim_length:
+        # Negative start means indexing from the end of dim.
+        # Note: a dimension isn't being canonicalized here, this reuses
+        # canonicalize_dim because the semantics are similar.
+        start = utils.canonicalize_dim(dim_length, start)  # type: ignore[arg-type]
+    check(
+        start <= dim_length - length,  # type: ignore[arg-type]
+        lambda: f"start ({start}) + length ({length}) exceeds dimension size ({dim_length}).",
+    )
     return prims.slice_in_dim(a, start, start + length, axis=dim)
 
 
+# TODO: This must return a sparse tensor if the input is sparse, but refs have
+# no sparse support. See narrow_copy_sparse in core.
+narrow_copy = _make_copy_from_view(narrow)
+
+
 def _normalize(
     a: Tensor, norm_dims: DimsType, eps: float
 ) -> Tuple[Tensor, Tensor, Tensor]:
@@ -2375,15 +2871,84 @@ def _normalize(
         mean (Tensor): mean of the tensor along norm_dims.
         rstd (Tensor): 1/std of the tensor along norm_dims.
     """
+    norm_dims = utils.canonicalize_dims(a.ndim, norm_dims)
     computation_dtype = utils.get_computation_dtype(a.dtype)
     a_acc = _maybe_convert_to_dtype(a, computation_dtype)
     assert isinstance(a_acc, TensorLike)  # to avoid mypy error for var_mean
-    biased_var, mean = var_mean(a_acc, dim=norm_dims, unbiased=False, keepdim=True)
+    biased_var, mean = torch.var_mean(
+        a_acc, dim=norm_dims, unbiased=False, keepdim=True
+    )
     rstd = torch.rsqrt(biased_var + eps)
     out = (a - mean) * rstd
     return out, mean, rstd
 
 
+# add all specified dimensions
+def _unsqueeze_multiple(x: TensorLikeType, dimensions: List[int]) -> TensorLikeType:
+    for dim in sorted(dimensions):
+        x = torch.unsqueeze(x, dim)
+    return x
+
+
+def _squeeze_multiple(x: TensorLikeType, dimensions: List[int]) -> TensorLikeType:
+    for dim in reversed(sorted(dimensions)):
+        x = torch.squeeze(x, dim)
+    return x
+
+
+@register_decomposition(torch.ops.aten.native_group_norm.default)
+def native_group_norm(
+    input: Tensor,
+    weight: Optional[Tensor],
+    bias: Optional[Tensor],
+    batch_size: int,
+    num_channels: int,
+    flattened_inner_size: int,
+    num_groups: int,
+    eps: float,
+) -> Tuple[Tensor, Tensor, Tensor]:
+    utils.check(
+        input.ndim >= 2,
+        lambda: f"Expected at least 2 dimensions for input tensor but received {input.ndim}",
+    )
+    utils.check(
+        num_channels % num_groups == 0,
+        lambda: "Expected number of channels in input to be divisible by num_groups, "
+        + f"but got input of shape {input.shape} and num_groups = {num_groups}",
+    )
+
+    # num_channels / num_groups and flattened inner dimension are the reduction axes
+    reduction_dims = [2, 3]
+    input_reshaped = torch.reshape(
+        input,
+        [batch_size, num_groups, num_channels // num_groups, flattened_inner_size],
+    )
+    out, mean, rstd = _normalize(input_reshaped, reduction_dims, eps)
+    out = out.view(input.shape)
+
+    broadcast_dims = [0] + list(dim for dim in range(2, input.ndim))
+    unsqueeze_bias = None
+    if bias is not None:
+        unsqueeze_bias = _unsqueeze_multiple(bias, broadcast_dims)
+    unsqueeze_weight = None
+    if weight is not None:
+        unsqueeze_weight = _unsqueeze_multiple(weight, broadcast_dims)
+
+    if unsqueeze_weight is not None:
+        out = out * unsqueeze_weight
+    if unsqueeze_bias is not None:
+        out = out + unsqueeze_bias
+
+    out = _maybe_convert_to_dtype(out, input.dtype)  # type: ignore[assignment]
+    mean = _maybe_convert_to_dtype(mean, input.dtype)  # type: ignore[assignment]
+    rstd = _maybe_convert_to_dtype(rstd, input.dtype)  # type: ignore[assignment]
+
+    # remove broadcast dimensions from mean and rstd
+    mean = _squeeze_multiple(mean, reduction_dims)
+    rstd = _squeeze_multiple(rstd, reduction_dims)
+    return (out, mean, rstd)
+
+
 @register_decomposition(torch.ops.aten.native_layer_norm)
 def native_layer_norm(
     input: Tensor,
@@ -2428,72 +2993,138 @@ def native_layer_norm(
         + ", but got input of size "
         + str(input.shape),
     )
+
+    input = input.contiguous()
+    if weight is not None:
+        weight = weight.contiguous()
+    if bias is not None:
+        bias = bias.contiguous()
+
     axis = input.ndim - normalized_ndim
     reduction_dims = list(range(axis, input.ndim))
     out, mean, rstd = _normalize(input, reduction_dims, eps)
+
     if weight is None and bias is not None:
         out = out + bias
     elif weight is not None and bias is None:
         out = out * weight
     elif weight is not None and bias is not None:
         out = out * weight + bias
-    out = prims.convert_element_type(out, input.dtype)
+
+    out = _maybe_convert_to_dtype(out, input.dtype)  # type: ignore[assignment]
     if input.device.type == "cpu":
-        mean = prims.convert_element_type(mean, input.dtype)
-        rstd = prims.convert_element_type(rstd, input.dtype)
+        mean = _maybe_convert_to_dtype(mean, input.dtype)  # type: ignore[assignment]
+        rstd = _maybe_convert_to_dtype(rstd, input.dtype)  # type: ignore[assignment]
     return (out, mean, rstd)
 
 
 # TODO: Adding this as a meta function causes functorch tests to fail when compiled with debug mode.
 # test/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_fx_transpose_simple_cpu
-@register_decomposition(torch.ops.aten.permute, disable_meta=True)
-def permute(a: TensorLikeType, dims: DimsSequenceType) -> TensorLikeType:
-    _permutation = utils.canonicalize_dims(a.ndim, dims)
+@register_decomposition(torch.ops.aten.permute)
+def permute(a: TensorLikeType, *dims) -> TensorLikeType:
+    _permutation = utils.canonicalize_dims(
+        a.ndim, utils.extract_dims_from_varargs(dims)
+    )
     return prims.transpose(a, _permutation)
 
 
-def _reshape_view_helper(a: TensorLikeType, *shape, allow_copy: bool) -> TensorLikeType:
-    # NOTE: Reshape may be given a shape with a -1 length
-    # This indicates that the dimension's length should be inferred
-    # Creates a valid shape
-    shape = utils.extract_shape_from_varargs(shape, validate=False)
+# Get the new shape and stride after applying unfold to an input tensor
+def _get_unfold_shape_stride(
+    a_shape: ShapeType, a_stride: StrideType, dimension: int, size: int, step: int
+):
+    a_ndim = len(a_shape)
+    dim = utils.canonicalize_dim(a_ndim, dimension, wrap_scalar=True)
+    max_size = 1 if a_ndim == 0 else a_shape[dim]
+    last_stride = 1 if a_ndim == 0 else a_stride[dim]
 
-    for idx in range(len(shape)):
-        if shape[idx] == -1:
-            # Verifies there's only one dimension of length -1 in the shape
-            if shape.count(-1) > 1:
-                msg = "Can only infer the length of one dimension, but got shape {0}!".format(
-                    str(shape)
-                )
-                raise ValueError(msg)
+    utils.check(
+        size <= max_size,
+        lambda: f"Maximum size for tensor at dimension {dim} is {max_size} but size is {size}",
+    )
 
-            # TODO: improve error message
-            if a.numel() > 0:
-                length = reduce(
-                    operator.floordiv, (x for x in shape if x != -1), a.numel()
-                )
-            else:
-                msg = "Cannot reshape a tensor of zero elements into shape {0} because the unspecified length is ambiguous!".format(
-                    str(shape)
-                )
-                raise ValueError(msg)
+    utils.check(
+        step > 0,
+        lambda: f"Step is {step} but must be > 0",
+    )
+
+    shape = list(a_shape)
+    strides = list(a_stride)
+    shape.append(size)
+    strides.append(last_stride)
+    if dim < a_ndim:
+        shape[dim] = (shape[dim] - size) // step + 1
+        strides[dim] *= step
+    return shape, strides
+
+
+@register_decomposition(torch.ops.aten.repeat)
+def repeat(a: Tensor, *repeat_shape) -> Tensor:
+    repeat_shape = utils.extract_shape_from_varargs(repeat_shape, validate=False)
+    utils.check(
+        len(repeat_shape) >= len(a.shape),
+        lambda: "repeat: Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor",
+    )
+
+    if len(repeat_shape) == 0:
+        return torch.clone(a)
+
+    num_new_dimensions = len(repeat_shape) - a.ndim
+    padded_shape = [1] * num_new_dimensions
+    for dim_size in a.shape:
+        padded_shape.append(dim_size)
+
+    target_shape = tuple(
+        padded_size * repeat_size
+        for padded_size, repeat_size in zip(padded_shape, repeat_shape)
+    )
+
+    # return an empty tensor if one of the repeat_shape dimensions is zero
+    if 0 in repeat_shape:
+        return torch.empty(
+            target_shape,
+            dtype=a.dtype,
+            device=a.device,
+            requires_grad=a.requires_grad,
+            memory_format=utils.suggest_memory_format(a),
+        )
+
+    urtensor_shape = target_shape
+    urtensor_stride = utils.make_contiguous_strides_for(target_shape)
+    for dim, dim_size in enumerate(padded_shape):
+        # repeat each dimension by using unfold_copy operation
+        urtensor_shape, urtensor_stride = _get_unfold_shape_stride(
+            urtensor_shape, urtensor_stride, dim, dim_size, max(dim_size, 1)
+        )
+
+    # derive permute order by sorting urtensor strides
+    enumerated_stride = list(enumerate(urtensor_stride))
+    enumerated_stride.sort(key=lambda item: item[1], reverse=True)
+    permute_order, sorted_stride = zip(*enumerated_stride)
+
+    # add new and expand dimensions according to urtensor
+    repeat_xtensor = a.expand(urtensor_shape)
+
+    # clone tensor to concretize expanded dimensions
+    cloned_result = torch.clone(repeat_xtensor)
 
-            shape = list(shape)  # type: ignore[assignment]
-            shape[idx] = length  # type: ignore[index]
-            break
+    # transpose axis so strides are in sorted order
+    permuted_result = cloned_result.permute(permute_order)
+
+    # reshape to get contiguous tensor with correct target shape
+    return permuted_result.reshape(target_shape)
+
+
+def _reshape_view_helper(a: TensorLikeType, *shape, allow_copy: bool) -> TensorLikeType:
+    # Creates a valid shape
+    shape = utils.extract_shape_from_varargs(shape, validate=False)
+    # Reshape may be given a shape with a -1 length
+    # This indicates that the dimension's length should be inferred
+    shape = utils.infer_size(shape, a.numel())
 
     # Short-circuits if shape is the same
-    utils.validate_shape(shape)
     if tuple(a.shape) == tuple(shape):
         return prims.view_of(a)
 
-    numel = reduce(operator.mul, shape) if len(shape) > 0 else 1
-    if a.numel() != numel:
-        msg = "Attempting to reshape a tensor with shape {0} and {1} elements to a shape {2} with {3} elements!".format(
-            str(a.shape), a.numel(), str(shape), numel
-        )
-        raise ValueError(msg)
-
     # Special-cases tensors with no elements
     if a.numel() == 0:
         return as_strided(a, shape, utils.make_contiguous_strides_for(shape))
@@ -2591,7 +3222,6 @@ def _reshape_view_helper(a: TensorLikeType, *shape, allow_copy: bool) -> TensorL
     return a_
 
 
-# TODO: Turn this into a decomposition (currently fails on reshape meta tests)
 # CompositeImplicitAutograd - don't register decomp
 # NOTE: shape is a vararg because Tensor.reshape can be called with as
 # Tensor.reshape(a, b, c) or Tensor.reshape((a, b, c)) Function call
@@ -2600,6 +3230,11 @@ def reshape(a: TensorLikeType, *shape: ShapeType) -> TensorLikeType:
     return _reshape_view_helper(a, *shape, allow_copy=True)
 
 
+# CompositeImplicitAutograd - don't register decomp
+def reshape_as(self: TensorLikeType, other: TensorLikeType) -> TensorLikeType:
+    return self.reshape(other.size())
+
+
 @register_decomposition(torch.ops.aten.roll)
 def roll(
     a: TensorLikeType, shifts: DimsType, dims: DimsType = tuple()
@@ -2651,16 +3286,17 @@ def rot90(
     a: TensorLikeType, k: int = 1, dims: DimsSequenceType = (0, 1)
 ) -> TensorLikeType:
     """Reference implementation of :func:`torch.rot90`."""
-    dims_ = utils.canonicalize_dims(a.ndim, dims)
-    # Required to silence MyPy errors
-    assert isinstance(dims_, (tuple, list))
-    dims = dims_
     if len(dims) != 2:
         raise RuntimeError(
             f"expected total rotation dims == 2, but got dims = {len(dims)}"
         )
     if a.ndim < 2:
         raise RuntimeError(f"expected total dims >= 2, but got total dims = {a.ndim}")
+
+    # Do this after the initial checks to be compatible with the behavior in
+    # core.
+    dims = utils.canonicalize_dims(a.ndim, dims)
+
     if dims[0] == dims[1]:
         raise RuntimeError(
             f"expected rotation dims to be different, but got dim0 = {dims[0]} and dim1 = {dims[1]}"
@@ -2673,7 +3309,7 @@ def rot90(
     elif k == 3:
         return torch.transpose(torch.flip(a, (dims[0],)), dims[0], dims[1])
     else:
-        return clone(a)
+        return clone(a, memory_format=torch.contiguous_format)
 
 
 def _check_stack_inputs(tensors: TensorSequenceType) -> None:
@@ -2702,17 +3338,16 @@ def stack(tensors: TensorSequenceType, dim: int = 0) -> TensorLikeType:
     return torch.cat([t.unsqueeze(wrapped_dim) for t in tensors], dim)
 
 
+# CompositeImplicitAutograd - don't register decomp
 @out_wrapper()
 def softmax(
     a: TensorLikeType,
     dim: int,
-    *,
     dtype: Optional[torch.dtype] = None,
 ) -> TensorLikeType:
     result_dtype = dtype or a.dtype
-    computation_dtype = utils.get_computation_dtype(a.dtype)
+    computation_dtype = utils.get_computation_dtype(result_dtype)
     a_ = _maybe_convert_to_dtype(a, computation_dtype)
-    assert isinstance(a_, TensorLike)  # to avoid MyPy error for amax
     a_max = amax(a_, dim, keepdim=True)
     a_exp = exp(a_ - a_max)
     return _maybe_convert_to_dtype(
@@ -2741,21 +3376,16 @@ def vstack(tensors: TensorSequenceType) -> TensorLikeType:
 # CompositeImplicitAutograd - don't register decomp
 def unflatten(a: TensorLikeType, dim: int, sizes: ShapeType) -> TensorLikeType:
     dim = utils.canonicalize_dim(a.ndim, dim)
-    if not sizes:
-        raise RuntimeError("unflatten: sizes must be non-empty")
-    if -1 not in sizes and utils.prod(sizes) != a.shape[dim]:
-        raise RuntimeError(
-            f"unflatten: Provided sizes {sizes} don't multiply up to the size of dim {dim} ({a.shape[dim]}) in the input tensor"
-        )
-    out_shape = tuple(a.shape[:dim]) + tuple(sizes) + tuple(a.shape[dim + 1 :])
-    return torch.reshape(a, out_shape)
+    utils.check(len(sizes) != 0, lambda: "unflatten: sizes must be non-empty")
+    return a.view(tuple(a.shape[:dim]) + tuple(sizes) + tuple(a.shape[dim + 1 :]))
 
 
+@register_decomposition(torch.ops.aten.unbind)
 def unbind(t: TensorLikeType, dim: int = 0) -> TensorSequenceType:
     dim = utils.canonicalize_dim(t.ndim, dim)
     check(
         len(t.shape) > 0,
-        lambda: "dimension specified as 0 but tensor has no dimensions",
+        lambda: "Dimension specified as 0 but tensor has no dimensions",
         IndexError,
     )
     return tuple(
@@ -2763,8 +3393,93 @@ def unbind(t: TensorLikeType, dim: int = 0) -> TensorSequenceType:
     )
 
 
-# Note: although squeeze is documented as having the out= kwarg it doesn't
-@register_decomposition(torch.ops.aten.squeeze, disable_meta=True)
+@register_decomposition(torch.ops.aten.index_copy)
+@out_wrapper()
+def index_copy(x: TensorLike, dim: int, index: TensorLike, tensor: TensorLike):
+    return x.clone(memory_format=torch.contiguous_format).index_copy_(
+        dim, index, tensor
+    )
+
+
+@register_decomposition(torch.ops.aten.index_copy_)
+def index_copy_(x: TensorLike, dim: int, index: TensorLike, tensor: TensorLike):
+    dim = utils.canonicalize_dims(x.ndim, dim)
+    utils.check(
+        index.ndim <= 1,
+        lambda: f"Index should have dimension 1 or 0 (got {index.ndim})",
+    )
+    # Treat scalars as elements of \R^1
+    y = x.unsqueeze(0) if x.ndim == 0 else x
+    idx = (slice(None),) * dim + (index,)
+    y[idx] = tensor
+    return x
+
+
+@register_decomposition(torch.ops.aten.index_fill)
+def index_fill(
+    x: TensorLike, dim: int, index: TensorLike, value: Union[NumberType, TensorLike]
+):
+    return x.clone().index_fill_(dim, index, value)  # type: ignore[arg-type]
+
+
+@register_decomposition(torch.ops.aten.index_fill_)
+def index_fill_(
+    x: TensorLike, dim: int, index: TensorLike, value: Union[NumberType, TensorLike]
+):
+    if isinstance(value, TensorLike):
+        utils.check(
+            value.ndim == 0,
+            lambda: "Only supports 0-dimensional value tensor. "  # type: ignore[union-attr]
+            f"Got a tensor with {value.ndim} dimensions.",
+        )  # type: ignore[arg-type]
+        return x.clone().index_copy_(dim, index, value)
+    dim = utils.canonicalize_dims(x.ndim, dim)
+    utils.check(
+        index.ndim <= 1,
+        lambda: f"Index should have dimension 1 or 0 (got {index.ndim})",
+    )
+    idx = (slice(None),) * dim + (index,)
+    # Treat scalars as elements of \R^1
+    y = x.unsqueeze(0) if x.ndim == 0 else x
+    y[idx] = value  # type: ignore[assignment]
+    return x
+
+
+@register_decomposition(torch.ops.aten.index_add)
+@out_wrapper()
+def index_add(
+    x: TensorLike,
+    dim: int,
+    index: TensorLike,
+    tensor: TensorLike,
+    *,
+    alpha: NumberType = 1,
+):
+    # index_add always returns a new contiguous tensor
+    return x.clone(memory_format=torch.contiguous_format).index_add_(
+        dim, index, tensor, alpha=alpha  # type: ignore[arg-type]
+    )
+
+
+@register_decomposition(torch.ops.aten.index_select)
+@out_wrapper()
+def index_select(x: TensorLike, dim: int, index: TensorLike):
+    dim = utils.canonicalize_dims(x.ndim, dim)
+    utils.check(
+        index.ndim <= 1,
+        lambda: f"Index should have dimension 1 or 0 (got {index.ndim})",
+    )
+    # Treat scalars as elements of \R^1
+    if x.ndim == 0:
+        # we cannot write `x.unsqueeze(0)[index].squeeze(0).clone()`
+        # as tensor[index] will trigger index.item() if index is a 0-dim tensor
+        # and .item() cannot be symbolically traced with FakeTensor.
+        return torch.ops.aten.index(x.unsqueeze(0), [index]).squeeze(0).clone()
+    idx = (slice(None),) * dim + (index,)
+    return x[idx]
+
+
+@register_decomposition(torch.ops.aten.squeeze)
 def squeeze(a: TensorLikeType, dim: Optional[int] = None) -> TensorLikeType:
     if dim is not None:
         dim = utils.canonicalize_dim(a.ndim, dim)
@@ -2807,7 +3522,7 @@ def tensor_split(
             raise ValueError(msg)
 
     # Case 0 -- indices_or_sections is an integer or a scalar tensor n and a is split along dim into n parts of equal-ish length
-    if isinstance(indices_or_sections, int) or (
+    if isinstance(indices_or_sections, IntLike) or (
         isinstance(indices_or_sections, TensorLike) and indices_or_sections.ndim == 0
     ):
         sections: int = (
@@ -2873,7 +3588,7 @@ def hsplit(
         ),
     )
     dim = 0 if a.ndim == 1 else 1
-    if isinstance(indices_or_sections, int):
+    if isinstance(indices_or_sections, IntLike):
         split_size = indices_or_sections
         check(
             (split_size != 0 and a.shape[dim] % split_size == 0),
@@ -2915,17 +3630,17 @@ def vsplit(
             + " dimensions!"
         ),
     )
-    if isinstance(indices_or_sections, int):
+    if isinstance(indices_or_sections, IntLike):
         split_size = indices_or_sections
         check(
             (split_size != 0 and a.shape[0] % split_size == 0),
             lambda: (
-                "torch.vsplit attempted to split along dimension 0 "
-                + ", but the size of the dimension "
-                + str(a.shape[0])
-                + " is not divisible by the split_size "
-                + str(split_size)
-                + "!"
+                f"torch.vsplit attempted to split along dimension 0"
+                f", but the size of the dimension "
+                f"{a.shape[0]}"
+                f" is not divisible by the split_size "
+                f"{split_size}"
+                f"!"
             ),
         )
         return tensor_split(a, split_size, 0)
@@ -2944,23 +3659,162 @@ def vsplit(
     return tensor_split(a, split_sizes, 0)
 
 
-# CompositeImplicitAutograd - don't register decomp
-def dsplit(a: TensorLikeType, sections: DimsType) -> TensorSequenceType:
-    if a.ndim < 3:
-        raise RuntimeError(
-            f"torch.dsplit requires a tensor with at least 3 dimension, but got a tensor with {a.ndim} dimensions!"
-        )
-    if isinstance(sections, int) and (sections == 0 or a.shape[2] % sections != 0):
-        raise RuntimeError(
-            "torch._refs.dsplit attempted to split along dimension 2, "
-            + f"but the size of the dimension {a.shape[2]} is not divisible by the split_size {sections}!"
-        )
-    return tensor_split(a, sections, 2)
+@register_decomposition(torch.ops.aten.diag.out)
+@out_wrapper()
+def diag(
+    self: TensorLikeType,
+    offset: int = 0,
+) -> TensorLikeType:
+    ndim = self.dim()
+    utils.check(
+        ndim in (1, 2), lambda: f"diag(): Supports 1D or 2D tensors. Got {ndim}D"
+    )
+    if ndim == 1:
+        return torch.diag_embed(self, offset)
+    else:
+        return torch.diagonal_copy(self, offset)
 
 
-@register_decomposition(torch.ops.aten.t.default, disable_meta=True)
-def t(a: TensorLikeType):
-    # TODO: Add sparse support
+@register_decomposition(torch.ops.aten.diagonal_scatter)
+@out_wrapper()
+def diagonal_scatter(
+    input: TensorLikeType,
+    src: TensorLikeType,
+    offset: int = 0,
+    dim1: int = 0,
+    dim2: int = 1,
+) -> TensorLikeType:
+    out = input.clone()
+    diag = out.diagonal(offset, dim1, dim2)
+    check(
+        diag.shape == src.shape,
+        lambda: "expected src to have a size equal to the diagonal of the input."
+        f"Got {src.shape} for a diagonal of shape {diag.shape}",
+    )
+    copy_to(diag, src)
+    return out
+
+
+@register_decomposition(torch.ops.aten.diagonal)
+def diagonal(
+    self: TensorLikeType,
+    offset: int = 0,
+    dim1: int = 0,
+    dim2: int = 1,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.diagonal
+    """
+    num_dims = self.dim()
+    dim1 = utils.canonicalize_dim(idx=dim1, rank=num_dims)
+    dim2 = utils.canonicalize_dim(idx=dim2, rank=num_dims)
+
+    check(
+        dim1 != dim2, lambda: f"diagonal dimensions cannot be identical {dim1}, {dim2}"
+    )
+
+    storage_offset = self.storage_offset()
+
+    if offset >= 0:
+        diag_size = max(min(self.size()[dim1], self.size()[dim2] - offset), 0)
+    else:
+        diag_size = max(min(self.size()[dim1] + offset, self.size()[dim2]), 0)
+
+    if diag_size > 0:
+        if offset >= 0:
+            storage_offset += offset * self.stride()[dim2]
+        else:
+            storage_offset -= offset * self.stride()[dim1]
+
+    sizes = [s for i, s in enumerate(self.size()) if i not in (dim1, dim2)]
+    sizes.append(diag_size)
+
+    strides = [s for i, s in enumerate(self.stride()) if i not in (dim1, dim2)]
+    strides.append(self.stride()[dim1] + self.stride()[dim2])
+
+    result = self.as_strided(size=sizes, stride=strides, storage_offset=storage_offset)
+
+    return result
+
+
+diagonal_copy = _make_copy_from_view(diagonal)
+
+
+@register_decomposition(torch.ops.aten.diag_embed)
+@out_wrapper()
+def diag_embed(
+    t: TensorLikeType,
+    offset: int = 0,
+    dim1: int = -2,
+    dim2: int = -1,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.diag_embed
+    """
+    # as per the docs, exchanging dims is equivalent to changing the sign of
+    # offset
+    if dim1 > dim2:
+        dim1, dim2 = dim2, dim1
+        offset = -offset
+
+    # convert from negative dims
+    rank = t.ndim + 1
+    dim1 = utils.canonicalize_dim(rank=rank, idx=dim1)
+    dim2 = utils.canonicalize_dim(rank=rank, idx=dim2)
+
+    check(
+        dim1 != dim2, lambda: f"diagonal dimensions cannot be identical {dim1}, {dim2}"
+    )
+
+    # as per the docs, the size of last dim is placed at dim1 and dim2
+    last_dim = t.size(-1)
+
+    if offset != 0:
+        # add padding to match the new size
+        t_shape = list(t.shape)
+        t_shape[-1] = builtins.abs(offset)
+        z = torch.zeros(t_shape, dtype=t.dtype, device=t.device, requires_grad=False)
+        pair = (z, t) if offset > 0 else (t, z)
+        t = torch.cat(pair, dim=-1)
+        # make sure the diagonal always has the same size
+        last_dim += builtins.abs(offset)
+
+    # preserve original data, but place 1 at dim1 and move last dim to dim2
+    t = t.unsqueeze(dim1).movedim(-1, dim2)
+
+    # generate ranges shifting indices based on offset
+    a_range = torch.arange(last_dim, device=t.device, dtype=torch.int64)
+    b_range = torch.arange(
+        offset, last_dim + offset, device=t.device, dtype=torch.int64
+    )
+
+    # broadcast
+    cond = a_range == b_range.unsqueeze(-1)
+    cond_shape = [last_dim if i in (dim1, dim2) else 1 for i in range(len(t.shape))]
+    cond = cond.reshape(cond_shape)
+
+    # aten.diag_embed always returns a new contiguous tensor
+    # contiguous() is needed to correctly model the output stride
+    return utils.mask_tensor(cond, t).contiguous()
+
+
+# CompositeImplicitAutograd - don't register decomp
+def dsplit(a: TensorLikeType, sections: DimsType) -> TensorSequenceType:
+    if a.ndim < 3:
+        raise RuntimeError(
+            f"torch.dsplit requires a tensor with at least 3 dimension, but got a tensor with {a.ndim} dimensions!"
+        )
+    if isinstance(sections, IntLike) and (sections == 0 or a.shape[2] % sections != 0):
+        raise RuntimeError(
+            "torch.dsplit attempted to split along dimension 2, "
+            + f"but the size of the dimension {a.shape[2]} is not divisible by the split_size {sections}!"
+        )
+    return tensor_split(a, sections, 2)
+
+
+@register_decomposition(torch.ops.aten.t.default)
+def t(a: TensorLikeType):
+    # TODO: Add sparse support
     # if a.is_sparse:
     #     sparse_dim = a.sparse_dim()
     #     dense_dim = a.dense_dim()
@@ -2976,6 +3830,20 @@ def t(a: TensorLikeType):
     return torch.transpose(a, 0, 0 if a.ndim < 2 else 1)
 
 
+# CompositeImplicitAutograd - don't register decomp
+def T(a: TensorLikeType) -> TensorLikeType:
+    # n != 2 && n != 0 is deprecated in regular PyTorch.
+    check(
+        a.ndim in (0, 2),
+        lambda: (
+            "The use of `x.T` on tensors of dimension other than 0 or 2 "
+            "to reverse their shape is not supported."
+        ),
+    )
+    return a.t()
+
+
+@register_decomposition(torch.ops.aten.transpose)
 def transpose(a: TensorLikeType, dim0: int, dim1: int) -> TensorLikeType:
     _dim0, _dim1 = utils.canonicalize_dims(a.ndim, (dim0, dim1))  # type: ignore[misc]
 
@@ -2985,41 +3853,91 @@ def transpose(a: TensorLikeType, dim0: int, dim1: int) -> TensorLikeType:
     _permutation = list(range(0, a.ndim))
     _permutation[_dim0] = _dim1
     _permutation[_dim1] = _dim0
-    return prims.transpose(a, _permutation)
+    return torch.permute(a, _permutation)
 
 
 # Aliases for transpose
 swap_axes = transpose
 
 
-@register_decomposition(torch.ops.aten.unsqueeze, disable_meta=True)
+@register_decomposition(torch.ops.aten.unfold)
+def unfold(
+    self: TensorLikeType, dimension: int, size: int, step: int
+) -> TensorLikeType:
+    shape, strides = _get_unfold_shape_stride(
+        self.shape, self.stride(), dimension, size, step
+    )
+    return self.as_strided(shape, strides)
+
+
+@register_decomposition(torch.ops.aten.unfold_copy)
+@out_wrapper()
+def unfold_copy(self: TensorLikeType, dimension: int, size: int, step: int):
+    return self.unfold(dimension, size, step).clone(
+        memory_format=torch.contiguous_format
+    )
+
+
+@register_decomposition(torch.ops.aten.cumsum)
+def cumsum(
+    a: TensorLikeType,
+    dim: int,
+    *,
+    keepdim: bool = False,
+    dtype: Optional[torch.dtype] = None,
+    out: Optional[Tensor] = None,
+) -> TensorLikeType:
+    # We implement all the kwargs of a reduction. ATen just handles dtype
+    # nb. This decomposition may not be as efficient as a backend-specific implementation
+    ndim = a.ndim
+    dim = utils.canonicalize_dim(ndim, dim)
+    if ndim == 0:
+        return sum(a.unsqueeze(0), dim=0, keepdim=keepdim, dtype=dtype, out=out)
+    a = a.unsqueeze(dim + 1)
+    rg = torch.arange(a.shape[dim], device=a.device)
+    mask = rg.unsqueeze(1) <= rg
+    for _ in range(ndim - dim - 1):
+        mask = mask.unsqueeze(-1)
+    masked_a = utils.mask_tensor(mask, a)
+    return sum(masked_a, dim=dim, keepdim=keepdim, dtype=dtype, out=out)
+
+
+# Note: although squeeze is documented as having the out= kwarg it doesn't
+@register_decomposition(torch.ops.aten.unsqueeze)
 def unsqueeze(a: TensorLikeType, dim: int) -> TensorLikeType:
     # Note that unsqueeze canonicalizes with rank + 1 because it allows
     # a new innermost dimension to be specified
-    dim = utils.canonicalize_dim(a.ndim + 1, dim)
-    return prims.expand_dims(a, (dim,))
+    ndim = a.ndim + 1
+    dim = utils.canonicalize_dim(ndim, dim)
+    return prims.expand_dims(a, (dim,), ndim=ndim)
 
 
 # NOTE: shape is a vararg because Tensor.reshape can be called with as
 # Tensor.view(a, b, c) or Tensor.view((a, b, c)) Function call torch.view
 # doesn't support unpacked shapes
 # TODO: Turn this into a decomposition (currently fails on reshape meta tests)
-@register_decomposition(torch.ops.aten.view, disable_meta=True)
+@register_decomposition(torch.ops.aten.view)
 def view(a: TensorLikeType, *shape: ShapeType) -> TensorLikeType:
     return _reshape_view_helper(a, *shape, allow_copy=False)
 
 
+# CompositeImplicitAutograd - don't register decomp
+def view_as(self: TensorLikeType, other: TensorLikeType) -> TensorLikeType:
+    return self.view(other.size())
+
+
 # CompositeImplicitAutograd - don't register decomp
 def ravel(a: TensorLikeType) -> TensorLikeType:
     return reshape(a, (-1,))
 
 
+@register_decomposition(torch.ops.aten.empty.memory_format)
 @out_wrapper()
 def empty(
     *shape,
     dtype: Optional[torch.dtype] = None,
+    layout: torch.layout = torch.strided,
     device: Optional[torch.device] = None,
-    layout: Optional[torch.layout] = None,
     requires_grad: bool = False,
     pin_memory: bool = False,
     memory_format: torch.memory_format = torch.contiguous_format,
@@ -3065,6 +3983,7 @@ def new_empty(
 ) -> TensorLikeType:
 
     dtype = a.dtype if dtype is None else dtype
+    layout = a.layout if layout is None else layout
     device = a.device if device is None else device
 
     return torch.empty(
@@ -3076,6 +3995,61 @@ def new_empty(
     )
 
 
+@register_decomposition(torch.ops.aten.new_empty_strided)
+def new_empty_strided(
+    a: TensorLikeType,
+    size: ShapeType,
+    stride: StrideType,
+    *,
+    dtype: Optional[torch.dtype] = None,
+    layout: Optional[torch.layout] = None,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.Tensor.new_empty_strided
+    """
+
+    dtype = a.dtype if dtype is None else dtype
+    layout = a.layout if layout is None else layout
+    device = a.device if device is None else device
+
+    return torch.empty_strided(
+        size,
+        stride,
+        dtype=dtype,
+        device=device,
+        pin_memory=pin_memory,
+        layout=layout,
+    )
+
+
+@register_decomposition(torch.ops.aten.zeros.default)
+@out_wrapper()
+def zeros(
+    *size,
+    dtype: Optional[torch.dtype] = None,
+    layout: torch.layout = torch.strided,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+    requires_grad: bool = False,
+) -> TensorLikeType:
+    size = utils.extract_shape_from_varargs(size)
+
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    return torch.full(
+        size,
+        False if dtype == torch.bool else 0,
+        dtype=dtype,
+        layout=layout,
+        device=device,
+        pin_memory=pin_memory,
+        requires_grad=requires_grad,
+    )
+
+
 @register_decomposition(torch.ops.aten.new_zeros)
 def new_zeros(
     a: TensorLikeType,
@@ -3085,12 +4059,47 @@ def new_zeros(
     layout: Optional[torch.layout] = None,
     device: Optional[torch.device] = None,
     pin_memory: bool = False,
+    requires_grad: bool = False,
 ) -> TensorLikeType:
-    r = a.new_empty(
-        size, dtype=dtype, layout=layout, device=device, pin_memory=pin_memory
+    dtype = a.dtype if dtype is None else dtype
+    layout = a.layout if layout is None else layout
+    device = a.device if device is None else device
+
+    return torch.full(
+        size,
+        False if dtype == torch.bool else 0,
+        dtype=dtype,
+        layout=layout,
+        device=device,
+        pin_memory=pin_memory,
+        requires_grad=requires_grad,
+    )
+
+
+@register_decomposition(torch.ops.aten.ones.default)
+@out_wrapper()
+def ones(
+    *size,
+    dtype: Optional[torch.dtype] = None,
+    layout: torch.layout = torch.strided,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+    requires_grad: bool = False,
+) -> TensorLikeType:
+    size = utils.extract_shape_from_varargs(size)
+
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    return torch.full(
+        size,
+        True if dtype == torch.bool else 1,
+        dtype=dtype,
+        layout=layout,
+        device=device,
+        pin_memory=pin_memory,
+        requires_grad=requires_grad,
     )
-    r.zero_()
-    return r
 
 
 @register_decomposition(torch.ops.aten.new_ones)
@@ -3102,44 +4111,62 @@ def new_ones(
     layout: Optional[torch.layout] = None,
     device: Optional[torch.device] = None,
     pin_memory: bool = False,
+    requires_grad: bool = False,
 ) -> TensorLikeType:
-    r = a.new_empty(
-        size, dtype=dtype, layout=layout, device=device, pin_memory=pin_memory
+    dtype = a.dtype if dtype is None else dtype
+    layout = a.layout if layout is None else layout
+    device = a.device if device is None else device
+
+    return torch.full(
+        size,
+        True if dtype == torch.bool else 1,
+        dtype=dtype,
+        layout=layout,
+        device=device,
+        pin_memory=pin_memory,
+        requires_grad=requires_grad,
     )
-    r.fill_(1)
-    return r
 
 
 @register_decomposition(torch.ops.aten.new_full)
 def new_full(
     a: TensorLikeType,
     size: ShapeType,
-    fill_value: NumberType,
+    fill_value: Union[int, float, bool],
     *,
     dtype: Optional[torch.dtype] = None,
     layout: Optional[torch.layout] = None,
     device: Optional[torch.device] = None,
     pin_memory: bool = False,
 ) -> TensorLikeType:
-    r = a.new_empty(
-        size, dtype=dtype, layout=layout, device=device, pin_memory=pin_memory
+    dtype = a.dtype if dtype is None else dtype
+    layout = a.layout if layout is None else layout
+    device = a.device if device is None else device
+
+    return torch.full(
+        size,
+        fill_value,
+        dtype=dtype,
+        layout=layout,
+        device=device,
+        pin_memory=pin_memory,
     )
-    r.fill_(fill_value)  # type: ignore[arg-type]
-    return r
 
 
+@register_decomposition(torch.ops.aten.empty_like)
 def empty_like(
     a: TensorLikeType,
     *,
     dtype: Optional[torch.dtype] = None,
     device: Optional[torch.device] = None,
     layout: Optional[torch.layout] = None,
-    requires_grad: bool = False,
     pin_memory: bool = False,
+    requires_grad: bool = False,
     memory_format: torch.memory_format = torch.preserve_format,
 ) -> TensorLikeType:
 
     dtype = a.dtype if dtype is None else dtype
+    layout = a.layout if layout is None else layout
     device = a.device if device is None else device
 
     strides: Tuple[int, ...]
@@ -3168,37 +4195,6 @@ def empty_like(
     )
 
 
-@overload
-def arange(
-    end: NumberType,
-    *,
-    dtype: Optional[torch.dtype] = None,
-    device: Optional[torch.device] = None,
-    layout: torch.layout = torch.strided,
-    pin_memory: bool = False,
-    requires_grad: bool = False,
-) -> TensorLikeType:
-    pass
-
-
-@overload
-def arange(
-    start: NumberType,
-    end: NumberType,
-    step: NumberType = 1,
-    *,
-    dtype: Optional[torch.dtype] = None,
-    device: Optional[torch.device] = None,
-    layout: torch.layout = torch.strided,
-    pin_memory: bool = False,
-    requires_grad: bool = False,
-) -> TensorLikeType:
-    pass
-
-
-# See https://github.com/pytorch/pytorch/issues/82364
-# @register_decomposition(torch.ops.aten.arange)
-# @out_wrapper()
 @register_decomposition(
     [
         torch.ops.aten.arange.default,
@@ -3206,42 +4202,65 @@ def arange(
         torch.ops.aten.arange.start_step,
     ]
 )
+@out_wrapper()
 def arange(
-    a: Optional[NumberType] = None,
-    b: Optional[NumberType] = None,
+    start: NumberType = 0,
+    end: Optional[NumberType] = None,
     step: NumberType = 1,
     *,
     dtype: Optional[torch.dtype] = None,
-    device: Optional[torch.device] = None,
     layout: torch.layout = torch.strided,
+    device: Optional[torch.device] = None,
     pin_memory: bool = False,
     requires_grad: bool = False,
 ) -> TensorLikeType:
-    assert (a is not None and b is not None) or (a is not None and b is None)
-    if a is not None and b is not None:
-        return prims.arange(
-            a,
-            b,
-            step,
-            dtype=dtype,
-            device=device,
-            # layout=layout,
-            # pin_memory=pin_memory,
-            requires_grad=requires_grad,
-        )
-    elif a is not None and b is None:
-        return prims.arange(
-            0,
-            a,
-            step,
-            dtype=dtype,
-            device=device,
-            # layout=layout,
-            # pin_memory=pin_memory,
-            requires_grad=requires_grad,
-        )
+    utils.check_layout(layout)
+    utils.check_pin_memory(pin_memory)
+    # Case: torch.arange(5)
+    if end is None:
+        end = start
+        start = 0
+    return prims.arange(
+        start,
+        end,
+        step,
+        dtype=dtype,
+        # layout=layout,
+        device=device,
+        # pin_memory=pin_memory,
+        requires_grad=requires_grad,
+    )
+
+
+@register_decomposition(torch.ops.aten.lerp)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("start", "end", "weight"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+)
+def lerp(start: Tensor, end: Tensor, weight: Union[Tensor, NumberType]):
+    check(
+        start.dtype == end.dtype,
+        lambda: f"expected dtype {start.dtype} for `end` but got dtype {end.dtype}",
+    )
+    if isinstance(weight, Number):
+        weight = start.new_full((), weight)  # type: ignore[arg-type]
     else:
-        raise AssertionError()
+        check(
+            start.dtype == weight.dtype,
+            lambda: f"expected dtype {start.dtype} for `weight` but got dtype {weight.dtype}",  # type: ignore[union-attr]
+        )
+    assert isinstance(weight, Tensor)  # mypy
+    # We implement it this way for numerical stability. We assume (in the stability optimisation)
+    # that 0 <= weight <= 1. We take the abs to deal with complex numbers
+    # We want to perform operations near zero, which is where floating points are most precise
+    # thus, we perform the following optimisation:
+    # If weight.abs() >= 0.5:
+    #    return (1 - weight) * (start - end) + end
+    mask = weight.abs() >= 0.5
+    coeff = torch.where(mask, weight - 1, weight)
+    base = torch.where(mask, end, start)
+    return coeff * (end - start) + base
 
 
 @register_decomposition(torch.ops.aten.linspace)
@@ -3257,72 +4276,58 @@ def linspace(
     pin_memory: bool = False,
     requires_grad: bool = False,
 ) -> TensorLikeType:
-    if dtype is None:
-        dtype = torch.get_default_dtype()
-
-    # NB: NumPy actually doesn't do this cast, but for this ref, I'd rather have this
-    #     cast than not, because it allows us to always go into the precise path
-    #     if dtype is integral and not worry about whether start/end are float
-    if prims.utils.is_integer_dtype(dtype):
-        if isinstance(start, float):
-            start = int(start)
-        if isinstance(end, float):
-            end = int(end)
-
     if py_any(isinstance(arg, complex) for arg in (start, end, steps)):
-        raise NotImplementedError
-    assert not isinstance(start, complex) and not isinstance(end, complex)  # for mypy
+        default_complex_dtype = utils.corresponding_complex_dtype(
+            torch.get_default_dtype()
+        )
+        if dtype is None:
+            dtype = default_complex_dtype
+        else:
+            check(
+                utils.is_complex_dtype(dtype),
+                lambda: f"linspace(): inferred dtype {default_complex_dtype} can't be safely cast to passed dtype {dtype}",
+            )
+    else:
+        dtype = dtype or torch.get_default_dtype()
+    assert isinstance(dtype, torch.dtype)
 
+    # steps does not participate in the computation of the dtype
     check(
-        isinstance(steps, int),
+        isinstance(steps, IntLike),
         lambda: "steps must be int, not float",
         exc_type=TypeError,
     )
-    assert isinstance(steps, int)  # for mypy
+    assert isinstance(steps, IntLike)  # for mypy
     check(steps >= 0, lambda: "number of steps must be non-negative")
 
     factory_kwargs = {
+        "layout": layout,
         "device": device,
-        # "layout":layout,
-        # "pin_memory":pin_memory,
+        "pin_memory": pin_memory,
         "requires_grad": requires_grad,
     }
     if steps == 0:
-        ret = torch.full((0,), 0, dtype=dtype, **factory_kwargs)  # type: ignore[call-overload]
-    elif steps == 1:
-        ret = torch.full((1,), start, dtype=dtype, **factory_kwargs)  # type: ignore[call-overload]
-    elif start == end:
-        ret = torch.full((steps,), start, dtype=dtype, **factory_kwargs)  # type: ignore[call-overload]
-    else:
-        if prims.utils.is_integer_dtype(dtype):
-            # We need to cast to int, so to avoid off-by-one issues
-            # do the entire computation with ints when we can
-            assert isinstance(start, int) and isinstance(end, int)
-            step_size_x_denom = end - start
-            eps = 1 if end > start else -1
-            denom = steps - 1
-            ret = prims.to_dtype(
-                torch.arange(
-                    start * denom,
-                    end * denom + eps,
-                    step_size_x_denom,
-                    dtype=torch.int64,
-                    **factory_kwargs,  # type: ignore[arg-type]
-                )
-                / denom,
-                dtype,
-            )
-        else:
-            step_size = (end - start) / (steps - 1)
-            eps = step_size / 2
-            ret = prims.to_dtype(
-                torch.arange(  # type: ignore[call-overload]
-                    start, end + eps, step_size, dtype=torch.float64, **factory_kwargs
-                ),
-                dtype,
-            )
-
-    return ret
+        return torch.full((0,), 0, dtype=dtype, **factory_kwargs)  # type: ignore[arg-type]
+    if steps == 1:
+        return torch.full((1,), start, dtype=dtype, **factory_kwargs)  # type: ignore[arg-type]
+    if start == end:
+        return torch.full((steps,), start, dtype=dtype, **factory_kwargs)  # type: ignore[arg-type]
+
+    # arange returns values in the interval [start, end) so we add an an eps to make it [start, end]
+    # The eps is small enough as to always add just the element end
+    step_size = 1 / (steps - 1)
+    eps = step_size / 2
+    # arange returns a tensor of size divup(end - start, step) and thus, for the arguemnts below
+    # ceil(div(1 + step_size/2,  1/(steps - 1)) = steps - 1  + ceil(1 / 2) = steps
+    # torch.arange is an scan algorithm, so we need a high-precision dtype
+    rg = torch.arange(
+        0, 1 + eps, step_size, dtype=torch.float64, **factory_kwargs  # type: ignore[arg-type]
+    )
+    double_dtype = torch.complex128 if utils.is_complex_dtype(dtype) else torch.float64
+    rg = _maybe_convert_to_dtype(rg, double_dtype)  # type: ignore[assignment]
+    cast = partial(torch.full, (1,), dtype=double_dtype, **factory_kwargs)
+    out = torch.lerp(cast(start), cast(end), rg)
+    return _maybe_convert_to_dtype(out, dtype)  # type: ignore[return-value]
 
 
 @register_decomposition(torch.ops.aten.logspace)
@@ -3344,10 +4349,10 @@ def logspace(
 
     # NB: NumPy doesn't have this cast
     if prims.utils.is_integer_dtype(dtype):
-        if isinstance(start, float):
-            start = int(start)
-        if isinstance(end, float):
-            end = int(end)
+        if isinstance(start, FloatLike):
+            start = sym_int(start)
+        if isinstance(end, FloatLike):
+            end = sym_int(end)
 
     assert not isinstance(base, complex)  # for mypy
     if base < 0:
@@ -3357,8 +4362,8 @@ def logspace(
         end,
         steps,
         dtype=torch.float64,
-        device=device,
         layout=layout,
+        device=device,
         pin_memory=pin_memory,
         requires_grad=requires_grad,
     )
@@ -3441,6 +4446,69 @@ def meshgrid(
     return grids
 
 
+# CompositeImplicitAutograd - don't register decomp
+def movedim(
+    input: TensorLikeType,
+    source: Union[int, DimsSequenceType],
+    destination: Union[int, DimsSequenceType],
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.movedim
+    """
+    if type(source) is int:
+        source = (source,)
+    if type(destination) is int:
+        destination = (destination,)
+
+    # Converts to list to produce a compatible error message with core PyTorch,
+    # which prints sequences in square brackets.
+    utils.check(
+        len(source) == len(destination),  # type: ignore[arg-type]
+        lambda: (
+            "movedim: Invalid source or destination dims: source "  # type: ignore[arg-type]
+            f"({list(source)} dims) should contain the same number of dims as "
+            f"destination ({list(destination)} dims)"
+        ),
+    )
+
+    rank = input.ndim
+    ss = tuple(utils.canonicalize_dims(rank=rank, indices=source))  # type: ignore[arg-type]
+    ds = tuple(utils.canonicalize_dims(rank=rank, indices=destination))  # type: ignore[arg-type]
+
+    sss = set(ss)
+    dss = set(ds)
+
+    # See above on why this converts to list in error messages.
+    utils.check(
+        len(ss) == len(sss),
+        lambda: f"movedim: repeated dim in `source` ({list(source)})",  # type: ignore[arg-type]
+    )
+    utils.check(
+        len(ds) == len(dss),
+        lambda: f"movedim: repeated dim in `destination` ({list(destination)})",  # type: ignore[arg-type]
+    )
+
+    m = dict(zip(ds, ss))
+    dims = []
+    si = 0  # source index
+    for di in range(rank):
+        # check if the destination index is in the mapping
+        s = m.get(di)
+        if s is not None:
+            # insert source index if found
+            dims.append(s)
+        else:
+            # insert source index sequentially, skipping indices from the mapping
+            while si in sss:
+                si += 1
+            dims.append(si)
+            si += 1
+
+    result = torch.permute(input, tuple(dims))
+
+    return result
+
+
 # NOTE: for convenience, shape can be a tuple of ints or a tuple containing a tuple of ints
 @register_decomposition(torch.ops.aten.empty_strided)
 def empty_strided(
@@ -3449,15 +4517,13 @@ def empty_strided(
     *,
     dtype: Optional[torch.dtype] = None,
     device: Optional[torch.device] = None,
-    layout: Optional[torch.layout] = None,
+    layout: torch.layout = torch.strided,
     requires_grad: bool = False,
     pin_memory: bool = False,
 ) -> TensorLikeType:
-
-    if pin_memory:
-        raise NotImplementedError("PrimTorch doesn't support pinned memory")
-    if layout is not None and layout is not torch.strided:
-        raise NotImplementedError(f"PrimTorch doesn't support layout={layout}")
+    # Layout == strided, pin_memory is False
+    utils.check_layout(layout)
+    utils.check_pin_memory(pin_memory)
 
     shape = utils.extract_shape_from_varargs(shape)
     dtype = torch.get_default_dtype() if dtype is None else dtype
@@ -3472,18 +4538,75 @@ def empty_strided(
     )
 
 
-# TODO: missing kwargs (e.g. layout)
+@register_decomposition(torch.ops.aten.eye)
+@out_wrapper()
+def eye(
+    n: int,
+    m: Optional[int] = None,
+    *,
+    dtype: Optional[torch.dtype] = None,
+    layout: torch.layout = torch.strided,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+    requires_grad: bool = False,  # TODO: unused
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.eye
+    """
+    if m is None:
+        m = n
+
+    check(n >= 0, lambda: f"n must be greater or equal to 0, got {n}")
+    check(m >= 0, lambda: f"m must be greater or equal to 0, got {m}")
+
+    range_n = torch.arange(n, dtype=torch.int64, device=device, requires_grad=False)
+    range_m = torch.arange(m, dtype=torch.int64, device=device, requires_grad=False)
+
+    cond = range_n.unsqueeze(-1) == range_m
+    if dtype is torch.bool:
+        return cond
+    else:
+        one = torch.ones(
+            (1,),
+            dtype=dtype,
+            layout=layout,
+            device=device,
+            pin_memory=pin_memory,
+            requires_grad=False,
+        )
+        return torch.where(cond, one, 0)
+    # TODO: Use requires_grad.  All refs taking the requires_grad kwarg must
+    # return a leaf tensor.
+    # result.requires_grad_(requires_grad)
+
+
+@register_decomposition(torch.ops.aten.full)
 @out_wrapper()
 def full(
     shape: ShapeType,
     fill_value: NumberType,
     *,
-    dtype: torch.dtype,
-    device: torch.device,
-    requires_grad: bool,
+    dtype: Optional[torch.dtype] = None,
+    layout: torch.layout = torch.strided,
+    device: Optional[torch.device] = None,
+    pin_memory: bool = False,
+    requires_grad: bool = False,
 ) -> TensorLikeType:
-    e = empty(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-    return fill(e, fill_value)
+    utils.check_layout(layout)
+    utils.check_pin_memory(pin_memory)
+
+    dtype = dtype if dtype is not None else utils.type_to_dtype(type(fill_value))
+    device = device if device is not None else torch.device("cpu")
+
+    e = empty(
+        shape,
+        dtype=dtype,
+        layout=layout,
+        device=device,
+        pin_memory=pin_memory,
+        requires_grad=requires_grad,
+    )
+    return torch.fill(e, fill_value)  # type: ignore[arg-type]
 
 
 def full_like(
@@ -3491,33 +4614,75 @@ def full_like(
     fill_value: NumberType,
     *,
     dtype: Optional[torch.dtype] = None,
+    layout: Optional[torch.layout] = None,
     device: Optional[torch.device] = None,
+    pin_memory: bool = False,
     requires_grad: bool = False,
+    memory_format: torch.memory_format = torch.preserve_format,
 ) -> TensorLikeType:
-    e = empty_like(a, dtype=dtype, device=device, requires_grad=requires_grad)
+    e = torch.empty_like(
+        a,
+        dtype=dtype,
+        layout=layout,
+        device=device,
+        pin_memory=pin_memory,
+        requires_grad=requires_grad,
+        memory_format=memory_format,
+    )
     return fill(e, fill_value)
 
 
-ones = partial(full, fill_value=True)
+zeros_like = partial(full_like, fill_value=False)
+
 
 ones_like = partial(full_like, fill_value=True)
 
 
-# TODO: missing kwargs (e.g. layout)
+@register_decomposition(torch.ops.aten.randn.default)
+@out_wrapper()
+def randn(
+    *shape,
+    dtype: Optional[torch.dtype] = None,
+    device: Optional[torch.device] = None,
+    layout: Optional[torch.layout] = None,
+    requires_grad: bool = False,
+    pin_memory: bool = False,
+) -> TensorLikeType:
+    utils.check_pin_memory(pin_memory)
+
+    shape_ = utils.extract_shape_from_varargs(shape)
+
+    dtype = utils.dtype_or_default(dtype)
+    device = utils.device_or_default(device)
+
+    return prims.normal(
+        shape_,
+        mean=0.0,
+        std=1.0,
+        dtype=dtype,
+        device=device,
+        requires_grad=requires_grad,
+    )
+
+
 def scalar_tensor(
     a: NumberType,
     *,
     dtype: Optional[torch.dtype] = None,
+    layout: torch.layout = torch.strided,
     device: Optional[torch.device] = None,
+    pin_memory: bool = False,
 ) -> TensorLikeType:
+    utils.check_layout(layout)
+    utils.check_pin_memory(pin_memory)
     dtype = dtype if dtype is not None else utils.type_to_dtype(type(a))
     device = device if device is not None else torch.device("cpu")
     return prims.scalar_tensor(a, dtype=dtype, device=device)
 
 
-zeros = partial(full, fill_value=False)
-
-zeros_like = partial(full_like, fill_value=False)
+#
+# Randomness References
+#
 
 
 @register_decomposition(torch.ops.aten.uniform)
@@ -3531,10 +4696,10 @@ def uniform(
 ) -> TensorLikeType:
     utils.validate_shape(shape)
 
-    assert isinstance(low, (bool, int, float))
-    assert isinstance(high, (bool, int, float))
-    low = float(low)
-    high = float(high)
+    assert isinstance(low, Number)
+    assert isinstance(high, Number)
+    low = sym_float(low)
+    high = sym_float(high)
 
     assert isinstance(dtype, torch.dtype)
     device = utils.canonicalize_device(device)
@@ -3579,10 +4744,14 @@ def masked_fill(a: TensorLikeType, mask: TensorLikeType, value: TensorOrNumberLi
     # Since `where` allows type-promotion,
     # cast value to correct type before passing to `where`
     if isinstance(value, Number):
-        return torch.where(mask, python_type(value), a)
+        r = torch.where(mask, python_type(value), a)
+    else:
+        assert isinstance(value, TensorLike)
+        r = torch.where(mask, prims.to_dtype(value, a.dtype), a)
 
-    assert isinstance(value, TensorLike)
-    return torch.where(mask, prims.to_dtype(value, a.dtype), a)
+    # aten.mask_fill always return a new contiguous tensor
+    # contiguous() is needed to correctly model the output stride
+    return r.contiguous()
 
 
 # CompositeImplicitAutograd - don't register decomp
@@ -3634,10 +4803,10 @@ def norm(
 ) -> TensorLikeType:
     # In these cases we compute the "Frobenius norm"
     if (
-        p == "fro" and (dim is None or isinstance(dim, int) or len(dim) <= 2)
+        p == "fro" and (dim is None or isinstance(dim, Dim) or len(dim) <= 2)
     ) or p is None:
         p = 2
-    if isinstance(dim, int):
+    if isinstance(dim, Dim):
         dim = [dim]
     if isinstance(p, str):
         # Here we either call the nuclear norm, or we call matrix_norm with some arguments
@@ -3657,6 +4826,342 @@ def trace(self: TensorLikeType) -> TensorLikeType:
     return torch.sum(torch.diag(self, 0))
 
 
+def _make_r_binary_op(base_op):
+    def rop(
+        a: Union[TensorLikeType, NumberType],
+        b: Union[TensorLikeType, NumberType],
+    ) -> TensorLikeType:
+        return base_op(b, a)
+
+    return rop
+
+
+rtruediv = _make_r_binary_op(true_divide)
+rfloordiv = _make_r_binary_op(floor_divide)
+rpow = _make_r_binary_op(pow)
+
+
+@register_decomposition(torch.ops.aten.triu)
+@out_wrapper()
+def triu(a: TensorLikeType, diagonal: int = 0) -> TensorLikeType:
+    utils.check(
+        a.ndim >= 2, lambda: "triu: input tensor must have at least 2 dimensions"
+    )
+    h, w = a.shape[-2:]
+    mask = (
+        torch.arange(w, device=a.device).unsqueeze(-2)
+        - torch.arange(h, device=a.device).unsqueeze(-1)
+    ) >= diagonal
+
+    # aten.triu always returns a new contiguous tensor
+    # contiguous() is needed to correctly model the output stride
+    return utils.mask_tensor(mask, a).contiguous()
+
+
+@register_decomposition(torch.ops.aten.tril)
+@out_wrapper()
+def tril(a: TensorLikeType, diagonal: int = 0) -> TensorLikeType:
+    utils.check(
+        a.ndim >= 2, lambda: "tril: input tensor must have at least 2 dimensions"
+    )
+    h, w = a.shape[-2:]
+    mask = (
+        torch.arange(w, device=a.device).unsqueeze(-2)
+        - torch.arange(h, device=a.device).unsqueeze(-1)
+    ) <= diagonal
+
+    # aten.tril always returns a new contiguous tensor
+    # contiguous() is needed to correctly model the output stride
+    return utils.mask_tensor(mask, a).contiguous()
+
+
+# This is based on get_tril_size in aten/src/ATen/native/TensorFactories.h
+# The components of the matrix that belong to the lower triangle with offset
+# form a pentagon that can be broken down into a top trapezoid and a bottom
+# rectangle. For the implementation of tril_indices, we need the sizes of
+# both of these, as well as the length of the top side of the trapezoid.
+def _get_tril_sizes(row: int, col: int, offset: int) -> Tuple[int, int, int]:
+    if row == 0 or col == 0:
+        return 0, 0, 0
+
+    m_first_row = min(col, 1 + offset) if offset > 0 else int(row + offset > 0)
+    m_last_row = max(0, min(col, row + offset))
+    n_row_all = max(0, min(row, row + offset))
+    n_row_trapezoid = m_last_row - m_first_row + 1
+
+    # Number of elements in top trapezoid
+    trapezoid_size = (m_first_row + m_last_row) * n_row_trapezoid // 2
+    # Number of elements in bottom rectangle
+    diff_row = n_row_all - n_row_trapezoid
+    rectangle_size = max(0, diff_row * col)
+
+    return trapezoid_size, rectangle_size, m_first_row
+
+
+def _trilu_checks(
+    name: str,
+    row: int,
+    col: int,
+    dtype: torch.dtype,
+    layout: torch.layout,
+    pin_memory: bool,
+):
+    check(row >= 0, lambda: f"row must be non-negative, got {row}")
+    check(col >= 0, lambda: f"col must be non-negative, got {col}")
+    check(
+        dtype in (torch.int32, torch.int64),
+        lambda: f"\"{name}\" not implemented for '{dtype}'",
+    )
+
+
+# This is based on tril_indices_cuda in aten/src/ATen/native/cuda/TensorFactories.cu
+@register_decomposition(torch.ops.aten.tril_indices)
+def tril_indices(
+    row: int,
+    col: int,
+    offset: int = 0,
+    *,
+    dtype: torch.dtype = torch.long,
+    layout: torch.layout = torch.strided,
+    device: DeviceLikeType = "cpu",
+    pin_memory: bool = False,
+) -> TensorLikeType:
+    _trilu_checks("tril_indices", row, col, dtype, layout, pin_memory)
+
+    trapezoid_size, rectangle_size, m_first_row = _get_tril_sizes(row, col, offset)
+    row_offset = max(0, -offset)
+
+    arange_kw = partial(
+        torch.arange, layout=layout, device=device, pin_memory=pin_memory
+    )
+
+    # first we do the indices for top trapezoid
+    xs1 = arange_kw(0, trapezoid_size, dtype=torch.float64)
+    b = m_first_row - 0.5
+    row_inds1 = torch.floor(-b + torch.sqrt(b * b + 2 * xs1))
+    col_inds1 = torch.floor(xs1 - (2 * m_first_row - 1 + row_inds1) * row_inds1 * 0.5)
+    row_inds1 = prims.to_dtype(row_inds1 + row_offset, dtype)
+    col_inds1 = prims.to_dtype(col_inds1, dtype)
+
+    # then bottom rectangle
+    xs2 = arange_kw(0, rectangle_size, dtype=dtype)
+    row_inds2 = xs2 // col + (col - m_first_row + 1 + row_offset)
+    col_inds2 = xs2 % col
+
+    return torch.stack(
+        (torch.cat((row_inds1, row_inds2)), torch.cat((col_inds1, col_inds2)))
+    )
+
+
+# Similar to _get_tril_sizes above, but here there is a top trapezoid and
+# a bottom rectangle instead. Note that you can't reduce this to
+# _get_tril_sizes(col, row, -offset) because that would correspond to
+# decomposing into a left trapezoid and right rectangle.
+def _get_triu_sizes(row: int, col: int, offset: int) -> Tuple[int, int, int]:
+    if row == 0 or col == 0:
+        return 0, 0, 0
+
+    m_first_row = max(0, col - offset) if offset > 0 else col
+
+    # Number of elements in top rectangle
+    rectangle_size = max(0, min(row, -offset) * col)
+
+    # Number of elements in bottom trapezoid
+    trapezoid_size_tril, rectangle_size_tril, _ = _get_tril_sizes(row, col, offset - 1)
+    triu_size = row * col - (trapezoid_size_tril + rectangle_size_tril)
+    trapezoid_size = triu_size - rectangle_size
+
+    return trapezoid_size, rectangle_size, m_first_row
+
+
+@register_decomposition(torch.ops.aten.triu_indices)
+def triu_indices(
+    row: int,
+    col: int,
+    offset: int = 0,
+    *,
+    dtype: torch.dtype = torch.long,
+    layout: torch.layout = torch.strided,
+    device: DeviceLikeType = "cpu",
+    pin_memory: bool = False,
+) -> TensorLikeType:
+    _trilu_checks("triu_indices", row, col, dtype, layout, pin_memory)
+
+    trapezoid_size, rectangle_size, m_first_row = _get_triu_sizes(row, col, offset)
+    col_offset = max(0, offset)
+
+    arange_kw = partial(
+        torch.arange, layout=layout, device=device, pin_memory=pin_memory
+    )
+
+    # indices for top rectangle
+    xs2 = arange_kw(0, rectangle_size, dtype=dtype)
+    row_inds2 = xs2 // col
+    col_inds2 = xs2 % col
+
+    # bottom trapezoid
+    xs1 = arange_kw(0, trapezoid_size, dtype=torch.float64)
+    b = -0.5 - m_first_row
+    row_inds1 = torch.floor(-b - torch.sqrt(b * b - 2 * xs1))
+    col_inds1 = torch.floor(xs1 - ((2 * m_first_row - 1 - row_inds1) * row_inds1) * 0.5)
+    row_inds1 = prims.to_dtype(row_inds1, dtype)
+    col_inds1 = prims.to_dtype(col_inds1, dtype)
+
+    if col:
+        row_inds1 = row_inds1 + (rectangle_size // col)
+    col_inds1 = col_inds1 + col_offset
+
+    return torch.stack(
+        (torch.cat((row_inds2, row_inds1)), torch.cat((col_inds2, col_inds1)))
+    )
+
+
+@register_decomposition(torch.ops.aten.bucketize)
+@out_wrapper(exact_dtype=True)
+def bucketize(
+    a: TensorLikeType,
+    boundaries: TensorLikeType,
+    *,
+    out_int32: bool = False,
+    right: bool = False,
+):
+    utils.check(
+        boundaries.dim() == 1,
+        lambda: f"boundaries tensor must be 1 dimension but got dim({boundaries.dim()})",
+    )
+
+    out_dtype = torch.int32 if out_int32 else torch.int64
+    n_boundaries = boundaries.shape[-1]
+    if n_boundaries == 0:
+        return torch.zeros_like(a)
+    # We are trying to find the bucket (defined by pairs of consecutive elements of `boundaries`)
+    # each element of `a` belongs to. We use binary search to achieve logarithimic complexity,
+    # but each step of the search is done "in parallel" over all elements of `a`
+    # can't use int32 as indexes, so we have to do all computations with int64 and convert at the end
+    start = torch.zeros(a.shape, device=a.device, dtype=torch.int64)
+    end = start + n_boundaries
+    # Max depth of the binary search
+    # Since we can't break out of the loop at different points for different elements of a,
+    # we just do the max amount of iterations that binary search requires and add condition
+    # tensor (cond_update below) to stop updating once the search terminates
+
+    # For first iteration through loop we can skip some checks, we have separate implementation
+    mid = start + (end - start) // 2
+    mid_val = boundaries[mid]
+    if right:
+        cond_mid = mid_val > a
+    else:
+        cond_mid = mid_val >= a
+    start = torch.where(cond_mid, start, mid + 1)
+
+    if n_boundaries > 1:
+        cond_update = torch.ones_like(a, dtype=torch.bool)
+        niters = int(math.log2(n_boundaries))
+        for _ in range(niters):
+            end = torch.where(cond_mid & cond_update, mid, end)
+            cond_update = start < end
+            # start might end up pointing to 1 past the end, we guard against that
+            mid = torch.where(cond_update, start + (end - start) // 2, 0)
+            mid_val = boundaries[mid]
+            # If right is true, the buckets are closed on the *left*
+            # (i.e., we are doing the equivalent of std::upper_bound in C++)
+            # Otherwise they are closed on the right (std::lower_bound)
+            if right:
+                cond_mid = mid_val > a
+            else:
+                cond_mid = mid_val >= a
+            start = torch.where((~cond_mid) & cond_update, mid + 1, start)
+
+    return start.to(dtype=out_dtype)
+
+
+# inplace
+abs_ = _make_inplace(abs)
+acos_ = _make_inplace(acos)
+acosh_ = _make_inplace(acosh)
+addcmul_ = _make_inplace(addcmul)
+addcdiv_ = _make_inplace(addcdiv)
+asin_ = _make_inplace(asin)
+asinh_ = _make_inplace(asinh)
+atan_ = _make_inplace(atan)
+atanh_ = _make_inplace(atanh)
+atan2_ = _make_inplace(atan2)
+ceil_ = _make_inplace(ceil)
+clamp_ = _make_inplace(clamp)
+clamp_min_ = _make_inplace(clamp_min)
+clamp_max_ = _make_inplace(clamp_max)
+conj_physical_ = _make_inplace(conj_physical)
+copysign_ = _make_inplace(copysign)
+cos_ = _make_inplace(cos)
+cosh_ = _make_inplace(cosh)
+cumsum_ = _make_inplace(cumsum)
+digamma_ = _make_inplace(digamma)
+div_ = _make_inplace(div)
+eq_ = _make_inplace(eq)
+erf_ = _make_inplace(erf)
+erfc_ = _make_inplace(erfc)
+erfinv_ = _make_inplace(erfinv)
+exp_ = _make_inplace(exp)
+exp2_ = _make_inplace(exp2)
+expm1_ = _make_inplace(expm1)
+float_power_ = _make_inplace(float_power)
+floor_ = _make_inplace(floor)
+floor_divide_ = _make_inplace(floor_divide)
+fmod_ = _make_inplace(fmod)
+frac_ = _make_inplace(frac)
+ge_ = _make_inplace(ge)
+gt_ = _make_inplace(gt)
+heaviside_ = _make_inplace(heaviside)
+hypot_ = _make_inplace(hypot)
+igamma_ = _make_inplace(igamma)
+igammac_ = _make_inplace(igammac)
+le_ = _make_inplace(le)
+lerp_ = _make_inplace(lerp)
+lgamma_ = _make_inplace(lgamma)
+log10_ = _make_inplace(log10)
+log1p_ = _make_inplace(log1p)
+log2_ = _make_inplace(log2)
+log_ = _make_inplace(log)
+logical_and_ = _make_inplace(logical_and)
+logical_or_ = _make_inplace(logical_or)
+logical_xor_ = _make_inplace(logical_xor)
+lt_ = _make_inplace(lt)
+mvlgamma_ = _make_inplace(mvlgamma)
+nan_to_num_ = _make_inplace(nan_to_num)
+ne_ = _make_inplace(ne)
+neg_ = _make_inplace(neg)
+nextafter_ = _make_inplace(nextafter)
+pow_ = _make_inplace(pow)
+reciprocal_ = _make_inplace(reciprocal)
+remainder_ = _make_inplace(remainder)
+rsqrt_ = _make_inplace(rsqrt)
+sgn_ = _make_inplace(sgn)
+sigmoid_ = _make_inplace(sigmoid)
+sign_ = _make_inplace(sign)
+sin_ = _make_inplace(sin)
+sinc_ = _make_inplace(sinc)
+sinh_ = _make_inplace(sinh)
+sqrt_ = _make_inplace(sqrt)
+square_ = _make_inplace(square)
+tan_ = _make_inplace(tan)
+tanh_ = _make_inplace(tanh)
+tril_ = _make_inplace(tril)
+triu_ = _make_inplace(triu)
+true_divide_ = _make_inplace(true_divide)
+trunc_ = _make_inplace(trunc)
+xlogy_ = _make_inplace(xlogy)
+
+# Views
+# We can't model these as above, as the pattern of doing `op(a, out=a)` does not work for a view function
+# given that it does not reshape the input (it just copies the result into it)
+
+# squeeze_ = _make_inplace(squeeze)
+# t_ = _make_inplace(t)
+# transpose_ = _make_inplace(transpose)
+# unsqueeze_ = _make_inplace(unsqueeze)
+
+
+import torch._refs._conversions
 import torch._refs.fft
 import torch._refs.linalg
 import torch._refs.nn.functional
diff --git a/torch/_refs/_conversions.py b/torch/_refs/_conversions.py
new file mode 100644
index 000000000000..abcd5729818d
--- /dev/null
+++ b/torch/_refs/_conversions.py
@@ -0,0 +1,106 @@
+import torch
+import torch._prims_common as utils
+
+# Utilities should come BEFORE this import
+from torch._decomp import register_decomposition
+
+from torch._prims_common import check, TensorLikeType
+from torch._prims_common.wrappers import out_wrapper
+from torch._refs import _broadcast_shapes
+
+# Data conversion references.
+#
+# Note: this module breaks the usual _refs to torch naming scheme where
+# _refs.foo.bar is a ref for torch.foo.bar.  The following definitions are not
+# part of _refs/__init__.py to avoid name clashes with Python builtin types
+# (like int).
+
+__all__ = [
+    # dtypes
+    "bfloat16",
+    "bool",
+    "byte",
+    "cdouble",
+    "cfloat",
+    "chalf",
+    "char",
+    "double",
+    "float",
+    "half",
+    "int",
+    "long",
+    "short",
+    # misc
+    "complex",
+]
+
+
+def _make_conversion_method(name: str, dtype: torch.dtype):
+    def fn(
+        self: TensorLikeType, memory_format: torch.memory_format = torch.preserve_format
+    ) -> TensorLikeType:
+        return self.to(dtype, memory_format=memory_format)  # type: ignore[call-overload]
+
+    fn.__name__ = name
+    return fn
+
+
+bfloat16 = _make_conversion_method("bfloat16", torch.bfloat16)
+
+bool = _make_conversion_method("bool", torch.bool)
+
+byte = _make_conversion_method("byte", torch.uint8)
+
+cdouble = _make_conversion_method("cdouble", torch.cdouble)
+
+cfloat = _make_conversion_method("cfloat", torch.cfloat)
+
+chalf = _make_conversion_method("chalf", torch.complex32)
+
+char = _make_conversion_method("char", torch.int8)
+
+double = _make_conversion_method("double", torch.double)
+
+float = _make_conversion_method("float", torch.float)
+
+half = _make_conversion_method("half", torch.half)
+
+int = _make_conversion_method("int", torch.int)
+
+long = _make_conversion_method("long", torch.long)
+
+short = _make_conversion_method("short", torch.short)
+
+
+@register_decomposition(torch.ops.aten.complex)
+# Note: complex has type promotion tests disabled due to different semantics.
+# exact_dtype is for compat with complex_check_dtype from core.
+@out_wrapper(exact_dtype=True)
+def complex(real: TensorLikeType, imag: TensorLikeType) -> TensorLikeType:
+    allowed_dtypes = (torch.float32, torch.float64, torch.float16)
+    check(
+        real.dtype in allowed_dtypes and imag.dtype in allowed_dtypes,
+        lambda: (
+            f"Expected both inputs to be Half, Float or Double tensors but got "
+            f"{real.dtype} and {imag.dtype}"
+        ),
+    )
+    check(
+        real.dtype == imag.dtype,
+        lambda: (
+            f"Expected object of scalar type {real.dtype} but got "
+            f"scalar type {imag.dtype} for second argument"
+        ),
+    )
+    result_dtype = utils.corresponding_complex_dtype(real.dtype)  # type: ignore[arg-type]
+    common_shape = _broadcast_shapes(real.shape, imag.shape)
+    result = real.new_empty(
+        common_shape,
+        dtype=result_dtype,
+        layout=real.layout,
+        device=real.device,
+        # pin_memory=real.is_pinned(),  # NYI
+    )
+    result.real = real
+    result.imag = imag
+    return result
diff --git a/torch/_refs/fft.py b/torch/_refs/fft.py
index d92ef6914c2d..738a33fde038 100644
--- a/torch/_refs/fft.py
+++ b/torch/_refs/fft.py
@@ -9,7 +9,7 @@
 import torch._prims_common as utils
 from torch._decomp import register_decomposition
 from torch._prims_common import check, DimsType, ShapeType, TensorLikeType
-from torch._prims_common.wrappers import out_wrapper
+from torch._prims_common.wrappers import _maybe_convert_to_dtype, out_wrapper
 
 __all__ = [
     # Transforms
@@ -76,9 +76,7 @@ def _maybe_promote_tensor_fft(
     """Helper to promote a tensor to a dtype supported by the FFT primitives"""
     cur_type = t.dtype
     new_type = _promote_type_fft(cur_type, require_complex)
-    if cur_type == new_type:
-        return t
-    return prims.convert_element_type(t, new_type)
+    return _maybe_convert_to_dtype(t, new_type)  # type: ignore[return-value]
 
 
 def _resize_fft_input(
@@ -117,7 +115,7 @@ def _fft_c2r(
 ) -> TensorLikeType:
     """Common code for performing any complex to real FFT (irfft or hfft)"""
     input = _maybe_promote_tensor_fft(input, require_complex=True)
-    dims = (utils.canonicalize_dim(input.ndim, dim),)
+    dims = (utils.canonicalize_dim(input.ndim, dim, wrap_scalar=False),)
     last_dim_size = n if n is not None else 2 * (input.shape[dim] - 1)
     check(last_dim_size >= 1, lambda: f"Invalid number of data points ({n}) specified")
 
@@ -146,7 +144,7 @@ def _fft_r2c(
         lambda: f"{func_name} expects a floating point input tensor, but got {input.dtype}",
     )
     input = _maybe_promote_tensor_fft(input)
-    dims = (utils.canonicalize_dim(input.ndim, dim),)
+    dims = (utils.canonicalize_dim(input.ndim, dim, wrap_scalar=False),)
 
     if n is not None:
         input = _resize_fft_input(input, dims, (n,))
@@ -169,7 +167,7 @@ def _fft_c2c(
         input.dtype.is_complex,
         lambda: f"{func_name} expects a complex input tensor, but got {input.dtype}",
     )
-    dims = (utils.canonicalize_dim(input.ndim, dim),)
+    dims = (utils.canonicalize_dim(input.ndim, dim, wrap_scalar=False),)
 
     if n is not None:
         input = _resize_fft_input(input, dims, (n,))
@@ -265,7 +263,7 @@ def _canonicalize_fft_shape_and_dim_args(
     if dim is not None:
         if not isinstance(dim, Sequence):
             dim = (dim,)
-        ret_dims = utils.canonicalize_dims(input_dim, dim)
+        ret_dims = utils.canonicalize_dims(input_dim, dim, wrap_scalar=False)
 
         # Check dims are unique
         check(len(set(dim)) == len(dim), lambda: "FFT dims must be unique")
diff --git a/torch/_refs/linalg/__init__.py b/torch/_refs/linalg/__init__.py
index c3b8a3c60352..e6c15ec01889 100644
--- a/torch/_refs/linalg/__init__.py
+++ b/torch/_refs/linalg/__init__.py
@@ -14,11 +14,12 @@
     check,
     check_fp_or_complex,
     check_is_matrix,
+    Dim,
     DimsType,
     NumberType,
     TensorLikeType,
 )
-from torch._prims_common.wrappers import out_wrapper
+from torch._prims_common.wrappers import _maybe_convert_to_dtype, out_wrapper
 
 __all__ = [
     "svd",
@@ -69,7 +70,7 @@ def vector_norm(
     # Checks
     check_fp_or_complex(x.dtype, "linalg.vector_norm")
 
-    if isinstance(dim, int):
+    if isinstance(dim, Dim):
         dim = [dim]  # type: ignore[assignment]
     elif not isinstance(dim, List) and dim is not None:
         # refs.amin just accepts List rather than DimType (Tuple)
@@ -96,23 +97,23 @@ def vector_norm(
         x, utils.REDUCTION_OUTPUT_TYPE_KIND.COMPLEX_TO_FLOAT, dtype
     )
 
-    to_result_dtype = partial(prims.convert_element_type, dtype=result_dtype)
+    to_result_dtype = partial(_maybe_convert_to_dtype, dtype=result_dtype)
 
     # Implementation
     if ord == 0.0:
         return refs.sum(refs.ne(x, 0.0), dim=dim, keepdim=keepdim, dtype=result_dtype)
     elif ord == float("inf"):
-        return to_result_dtype(refs.amax(torch.abs(x), dim=dim, keepdim=keepdim))
+        return to_result_dtype(refs.amax(torch.abs(x), dim=dim, keepdim=keepdim))  # type: ignore[return-value]
     elif ord == float("-inf"):
-        return to_result_dtype(refs.amin(torch.abs(x), dim=dim, keepdim=keepdim))
+        return to_result_dtype(refs.amin(torch.abs(x), dim=dim, keepdim=keepdim))  # type: ignore[return-value]
     else:
         # From here on the computation dtype is important as the reduction is non-trivial
-        x = prims.convert_element_type(x, computation_dtype)
+        x = _maybe_convert_to_dtype(x, computation_dtype)  # type: ignore[assignment]
         reduce_sum = partial(refs.sum, dim=dim, keepdim=keepdim)
 
         if not (ord % 2.0 == 0.0 and utils.is_float_dtype(x.dtype)):
             x = torch.abs(x)
-        return to_result_dtype(torch.pow(reduce_sum(torch.pow(x, ord)), 1.0 / ord))
+        return to_result_dtype(torch.pow(reduce_sum(torch.pow(x, ord)), 1.0 / ord))  # type: ignore[return-value]
 
 
 def backshift_permutation(dim0, dim1, ndim):
@@ -142,7 +143,7 @@ def matrix_norm(
     check_is_matrix(A, "linalg.matrix_norm")
     # dim
     dim = utils.canonicalize_dims(A.ndim, dim)
-    if isinstance(dim, int):
+    if isinstance(dim, Dim):
         dim = (dim,)  # type: ignore[assignment]
     check(len(dim) == 2, lambda: "linalg.matrix_norm: dim must be a 2-tuple. Got {dim}")
     check(
@@ -167,7 +168,7 @@ def matrix_norm(
             return vector_norm(A, 2, dim, keepdim, dtype=dtype)
         else:  # ord == "nuc"
             if dtype is not None:
-                A = prims.convert_element_type(A, dtype)
+                A = _maybe_convert_to_dtype(A, dtype)  # type: ignore[assignment]
             perm = backshift_permutation(dim[0], dim[1], A.ndim)
             result = torch.sum(svdvals(prims.transpose(A, perm)), -1, keepdim)
             if keepdim:
@@ -190,7 +191,7 @@ def matrix_norm(
 
         if abs_ord == 2.0:
             if dtype is not None:
-                A = prims.convert_element_type(A, dtype)
+                A = _maybe_convert_to_dtype(A, dtype)  # type: ignore[assignment]
             perm = backshift_permutation(dim[0], dim[1], A.ndim)
             result = max_min(svdvals(prims.transpose(A, perm)), dim=-1)
             if keepdim:
@@ -219,7 +220,7 @@ def norm(
     dtype: Optional[torch.dtype] = None,
 ) -> TensorLikeType:
     if dim is not None:
-        if isinstance(dim, int):
+        if isinstance(dim, Dim):
             dim = (dim,)  # type: ignore[assignment]
         check(
             len(dim) in (1, 2),
diff --git a/torch/_refs/nn/functional/__init__.py b/torch/_refs/nn/functional/__init__.py
index 702589e72ad2..4ebe6e2b05d9 100644
--- a/torch/_refs/nn/functional/__init__.py
+++ b/torch/_refs/nn/functional/__init__.py
@@ -1,11 +1,13 @@
-from typing import Optional, Union
+import math
+from functools import wraps
+from typing import Callable, Optional, Union
 
 import torch
-
 import torch._prims as prims
 import torch._prims_common as utils
 import torch._refs as refs
 from torch._decomp import register_decomposition
+from torch._decomp.decompositions import Reduction
 from torch._prims_common import (
     check,
     ELEMENTWISE_TYPE_PROMOTION_KIND,
@@ -19,37 +21,128 @@
     elementwise_unary_scalar_wrapper,
     out_wrapper,
 )
-from torch._refs import (
-    _make_elementwise_binary_reference,
-    _make_elementwise_unary_reference,
-)
+from torch._refs import _make_inplace
+
+from torch._subclasses.fake_tensor import FakeTensor
 
 __all__ = [
+    "alpha_dropout",
     "celu",
     "dropout",
     "elu",
     "hardshrink",
     "hardtanh",
     "hinge_embedding_loss",
+    "huber_loss",
     "l1_loss",
+    "log_softmax",
     "margin_ranking_loss",
     "mish",
+    "nll_loss",
     "mse_loss",
+    "poisson_nll_loss",
     "prelu",
     "relu",
     "relu6",
     "selu",
+    "softmax",
+    "softmin",
     "softplus",
     "softshrink",
     "tanhshrink",
     "threshold",
+    "triplet_margin_loss",
+    "glu",
+    "pairwise_distance",
+    "pdist",
 ]
 
 Tensor = torch.Tensor
 
+
+def _dropout_helper(
+    self: TensorLikeType,
+    val: float,
+) -> TensorLikeType:
+    """
+    Helper function for all dropout-type operators. During training,
+    some of the elements of the input tensor are randomly masked.
+
+    Returns the masked tensor of the boolean values.
+
+    """
+
+    return (
+        refs.uniform(
+            self.shape, low=0.0, high=1.0, dtype=torch.float32, device=self.device
+        )
+        < val
+    )
+
+
+@register_decomposition(torch.ops.aten.alpha_dropout)
+def alpha_dropout(
+    self: TensorLikeType, p: float = 0.5, training: bool = False, inplace: bool = False
+) -> TensorLikeType:
+
+    if inplace:
+        raise NotImplementedError
+
+    if not training:
+        return self
+
+    utils.check(
+        p <= 1 and p >= 0,
+        lambda: f"dropout probability has to be between 0 and 1, but got, {p}",
+    )
+
+    if p == 1:
+        return torch.zeros_like(self)
+
+    if p == 0:
+        return self
+
+    dropout_mask = _dropout_helper(self, 1 - p)
+
+    # From paper: Self-Normalizing Neural Networks (https://arxiv.org/pdf/1706.02515.pdf)
+    # alpha = - SELU.alpha * SELU.scale, here
+    # SELU.alpha = 1.6732632423543772848170429916717 and
+    # SELU.scale = 1.0507009873554804934193349852946
+    alpha = -1.7580993408473766
+
+    a = 1.0 / math.sqrt((alpha * alpha * p + 1) * (1 - p))
+    b = torch.logical_not(dropout_mask)
+    b = b * (alpha * a) + alpha * a * p
+    dropout_mask = a * dropout_mask
+
+    return self * dropout_mask + b
+
+
+def inplace_wrapper(fn):
+    """
+    Given a nn.functional non-linearity, implements its `inplace: bool` argument
+    """
+
+    # nb. We use the name of the first argument used in the unary references
+    @wraps(fn)
+    def _fn(a, *args, inplace=False, **kwargs):
+        if inplace:
+            check(
+                "out" not in kwargs,
+                lambda: "Cannot set inplace=True and pass out= at the same time",
+            )
+            return fn(a, *args, inplace=False, out=a, **kwargs)
+        else:
+            return fn(a, *args, inplace=False, **kwargs)
+
+    return _fn
+
+
 # celu is implemented specially because it has an alpha argument
 # celu is very similar to elu
 @register_decomposition(torch.ops.aten.celu)
+@inplace_wrapper
+@out_wrapper()
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a",),
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
@@ -81,8 +174,9 @@ def celu(
     return torch.where(a > 0, a, rhs)
 
 
-# TODO: should we allow the user to set a different dtype for the mask generation?
 @register_decomposition(torch.ops.aten.dropout)
+@inplace_wrapper
+@out_wrapper()
 def dropout(
     a: TensorLikeType, p: float = 0.5, training: bool = True, inplace: bool = False
 ) -> TensorLikeType:
@@ -93,32 +187,36 @@ def dropout(
     if not training:
         return a
 
-    assert p <= 1
-    assert p >= 0
+    utils.check(
+        p <= 1 and p >= 0,
+        lambda: f"dropout probability has to be between 0 and 1, but got, {p}",
+    )
 
     if p == 1:
-        return refs.zeros_like(a)
+        return torch.zeros_like(a)
 
     if p == 0:
         return a
 
-    p1m = 1 - p
-    scale = 1 / p1m
-    mask = refs.lt(
-        refs.uniform(a.shape, low=0.0, high=1.0, dtype=torch.float32, device=a.device),
-        p1m,
-    )
-    return refs.mul(refs.mul(a, mask), scale)
+    scale = 1 / (1 - p)
+    dropout_mask = _dropout_helper(a, 1 - p)
+
+    return a * dropout_mask * scale
 
 
-# elu is implemented specially because it has an alpha argument
-# This cannot be used as a decomposition because the aten op takes in 2 extra kwargs
+@register_decomposition(torch.ops.aten.elu)
+@inplace_wrapper
+@out_wrapper()
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a",),
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
 )
 def elu(
-    a: TensorLikeType, alpha: Optional[NumberType] = None, inplace: bool = False
+    a: TensorLikeType,
+    alpha: NumberType = 1.0,
+    scale: NumberType = 1.0,
+    input_scale: NumberType = 1.0,
+    inplace: bool = False,
 ) -> TensorLikeType:
     """
     Reference implementation of torch.nn.functional.elu
@@ -126,24 +224,27 @@ def elu(
     if inplace:
         raise NotImplementedError
 
-    rhs: TensorLikeType
-    if alpha is not None:
-        python_type = utils.dtype_to_type(a.dtype)
-        if not utils.is_weakly_lesser_type(type(alpha), python_type):
-            msg = (
-                "alpha argument of type {0} cannot be safely cast to type {1}!".format(
-                    type(alpha), python_type
-                )
-            )
-            raise ValueError(msg)
-        rhs = alpha * torch.expm1(a)
-    else:
-        rhs = torch.expm1(a)
+    # nb. This should be factored out into a can_cast aux function
+    python_type = utils.dtype_to_type(a.dtype)
+    check(
+        utils.is_weakly_lesser_type(type(input_scale), python_type),
+        lambda: f"input_scale argument of type {type(input_scale)} cannot be safely cast to type {python_type}!",
+    )
+    check(
+        utils.is_weakly_lesser_type(type(scale), python_type),
+        lambda: f"scale argument of type {type(scale)} cannot be safely cast to type {python_type}!",
+    )
+    check(
+        utils.is_weakly_lesser_type(type(alpha), python_type),
+        lambda: f"alpha argument of type {type(alpha)} cannot be safely cast to type {python_type}!",
+    )
 
-    return torch.where(a > 0, a, rhs)
+    return torch.where(a > 0, scale * a, (alpha * scale) * torch.expm1(a * input_scale))
 
 
 @register_decomposition(torch.ops.aten.relu)
+@inplace_wrapper
+@out_wrapper()
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a",),
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
@@ -159,6 +260,46 @@ def relu(a: TensorLikeType, inplace: bool = False) -> TensorLikeType:
     return torch.where(torch.le(a, 0), 0, a)
 
 
+def group_norm(
+    input: Tensor,
+    num_groups: int,
+    weight: Optional[Tensor] = None,
+    bias: Optional[Tensor] = None,
+    eps: float = 1e-5,
+) -> Tensor:
+    """
+    Reference implementation of :func:`torch.nn.functional.group_norm`.
+    """
+    utils.check(
+        input.ndim >= 2,
+        lambda: f"Expected at least 2 dimensions for input tensor but received {input.ndim}",
+    )
+
+    batch_size = input.shape[0]
+    num_channels = input.shape[1]
+    utils.check(
+        num_channels % num_groups == 0,
+        lambda: "Expected number of channels in input to be divisible by num_groups, "
+        + f"but got input of shape {input.shape} and num_groups = {num_groups}",
+    )
+
+    # input shape is (N, C, *), so we flatten all inner dimensions except (N, C)
+    flattened_inner_size = 1
+    for dim_length in input.shape[2:]:
+        flattened_inner_size *= dim_length
+
+    return torch.native_group_norm(
+        input,
+        weight,
+        bias,
+        batch_size,
+        num_channels,
+        flattened_inner_size,
+        num_groups,
+        eps,
+    )[0]
+
+
 def layer_norm(
     input: Tensor,
     normalized_shape: ShapeType,
@@ -173,6 +314,8 @@ def layer_norm(
 
 
 @register_decomposition(torch.ops.aten.leaky_relu)
+@inplace_wrapper
+@out_wrapper()
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a",),
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
@@ -195,6 +338,8 @@ def leaky_relu(
 
 
 @register_decomposition(torch.ops.aten.mish)
+@inplace_wrapper
+@out_wrapper()
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a",),
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
@@ -210,6 +355,8 @@ def mish(a: TensorLikeType, inplace: bool = False) -> TensorLikeType:
 
 
 @register_decomposition(torch.ops.aten.selu)
+@inplace_wrapper
+@out_wrapper()
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a",),
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
@@ -229,8 +376,40 @@ def selu(a: TensorLikeType, inplace: bool = False) -> TensorLikeType:
     return scale * torch.where(a > 0, a, rhs)
 
 
+# Forwarding alias: the functional variant doesn't support the out kwarg
+# CompositeImplicitAutograd - don't register decomp
+def softmax(
+    a: TensorLikeType,
+    dim: Optional[int] = None,
+    _stacklevel: int = 3,  # for compat when using TorchRefsMode(strict=True)
+    dtype: Optional[torch.dtype] = None,
+) -> TensorLikeType:
+    # The error is for compat with regular PyTorch, which has this behavior
+    # deprecated.  For PrimTorch, it's fine to drop support for deprecated
+    # behavior because it requires explicit opt in.  This error is to inform
+    # users how to update their calls.
+    check(dim is not None, lambda: "implicit dim not supported, use dim=X")
+    return torch.softmax(a=a, dim=dim, dtype=dtype)  # type: ignore[call-overload]
+
+
+# CompositeImplicitAutograd - don't register decomp
+def softmin(
+    a: TensorLikeType,
+    dim: Optional[int] = None,
+    _stacklevel: int = 3,  # for compat when using TorchRefsMode(strict=True)
+    dtype: Optional[torch.dtype] = None,
+) -> TensorLikeType:
+    # The error is for compat with regular PyTorch, which has this behavior
+    # deprecated.  For PrimTorch, it's fine to drop support for deprecated
+    # behavior because it requires explicit opt in.  This error is to inform
+    # users how to update their calls.
+    check(dim is not None, lambda: "implicit dim not supported, use dim=X")
+    return torch.softmax(a=-a, dim=dim, dtype=dtype)  # type: ignore[call-overload]
+
+
 # softplus is implemented specially because it has beta and threshold arguments
 @register_decomposition(torch.ops.aten.softplus)
+@inplace_wrapper
 @out_wrapper()
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a",),
@@ -297,6 +476,17 @@ def softshrink(a: TensorLikeType, lambd: float = 0.5):
 
 
 # Losses
+def _reduction_int_to_str(reduction: int) -> str:
+    if reduction == Reduction.NONE.value:
+        return "none"
+    elif reduction == Reduction.MEAN.value:
+        return "mean"
+    elif reduction == Reduction.SUM.value:
+        return "sum"
+    else:
+        raise ValueError(f"{reduction} is not a valid value for reduction")
+
+
 def _apply_loss_reduction(loss: TensorLikeType, reduction: str) -> TensorLikeType:
     if reduction == "sum":
         return refs.sum(loss)
@@ -308,7 +498,7 @@ def _apply_loss_reduction(loss: TensorLikeType, reduction: str) -> TensorLikeTyp
 
 def _check_reduction_value(reduction: str):
     if reduction not in ("mean", "sum", "none"):
-        raise ValueError("{} is not a valid value for reduction".format(reduction))
+        raise ValueError(f"{reduction} is not a valid value for reduction")
 
 
 # This helper function maps depreciated arguments, "size_average" and "reduce"
@@ -329,7 +519,11 @@ def _get_string_reduction_arg(
     return ret
 
 
-@register_decomposition(torch.ops.aten.l1_loss)
+# CompositeImplicitAutograd - don't register decomp
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("input", "target"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.COMPLEX_TO_FLOAT,
+)
 def l1_loss(
     input: TensorLikeType,
     target: TensorLikeType,
@@ -337,8 +531,12 @@ def l1_loss(
     reduce: Optional[bool] = None,
     reduction: str = "mean",
 ) -> TensorLikeType:
+    """
+    Reference implementation of torch.nn.functional.l1_loss
+    """
     if size_average is not None or reduce is not None:
-        # TODO: raise exception instead of converting value
+        # TODO: Raise exception instead of converting value.  This is only for
+        # primTorch since it can drop support for deprecated arguments.
         # msg = "size_average and reduce args are deprecated, please use reduction argument."
         reduction = _get_string_reduction_arg(size_average=size_average, reduce=reduce)
     _check_reduction_value(reduction)
@@ -346,6 +544,22 @@ def l1_loss(
     return _apply_loss_reduction(loss, reduction)
 
 
+# Forwarding alias: the functional variant doesn't support the out kwarg
+# CompositeImplicitAutograd - don't register decomp
+def log_softmax(
+    a: TensorLikeType,
+    dim: Optional[int] = None,
+    _stacklevel: int = 3,  # for compat when using TorchRefsMode(strict=True)
+    dtype: Optional[torch.dtype] = None,
+) -> TensorLikeType:
+    # The error is for compat with regular PyTorch, which has this behavior
+    # deprecated.  For PrimTorch, it's fine to drop support for deprecated
+    # behavior because it requires explicit opt in.  This error is to inform
+    # users how to update their calls.
+    check(dim is not None, lambda: "implicit dim not supported, use dim=X")
+    return torch.log_softmax(a=a, dim=dim, dtype=dtype)  # type: ignore[call-overload]
+
+
 @register_decomposition(torch.ops.aten.margin_ranking_loss)
 def margin_ranking_loss(
     input1: TensorLikeType,
@@ -382,7 +596,8 @@ def mse_loss(
     reduction: str = "mean",
 ) -> TensorLikeType:
     if size_average is not None or reduce is not None:
-        # TODO: raise exception instead of converting value
+        # TODO: Raise exception instead of converting value.  This is only for
+        # primTorch since it can drop support for deprecated arguments.
         # msg = "size_average and reduce args are deprecated, please use reduction argument."
         reduction = _get_string_reduction_arg(size_average=size_average, reduce=reduce)
     _check_reduction_value(reduction)
@@ -408,6 +623,190 @@ def hinge_embedding_loss(
     return _apply_loss_reduction(loss, reduction)
 
 
+def _nll_loss_nd(
+    input: TensorLikeType,
+    target: TensorLikeType,
+    weight: Optional[TensorLikeType],
+    reduction: str,
+    ignore_index: int,
+) -> TensorLikeType:
+    utils.check(
+        input.ndim > 0 and input.ndim <= 3,
+        lambda: f"Expected input dimension to be either [1, 2, 3] but received {input.ndim}.",
+    )
+
+    utils.check(
+        (input.ndim == 1) or (input.shape[0] == target.shape[0]),
+        lambda: f"Expected input batch size {input.shape[0]} to match target batch size {target.shape[0]}.",
+    )
+
+    _check_reduction_value(reduction)
+
+    flat_target = torch.flatten(target)
+    ignore_classes_mask = torch.eq(flat_target, ignore_index)
+
+    # TODO: Enable data-dependent checks with debug mode
+    # TODO: This check does not work with FakeTensor inputs; See Issue #85834
+    # Explicit cast for class_check to bool; See Issue #78071
+    """
+    num_classes = input.shape[1] if input.ndim > 1 else input.shape[0]
+    valid_classes_mask = torch.logical_and(
+        (flat_target >= 0), (flat_target < num_classes)
+    )
+    class_check = torch.all(torch.logical_or(ignore_classes_mask, valid_classes_mask))
+    utils.check(
+        isinstance(target, FakeTensor) or bool(class_check.item()),
+        lambda: "A target class is out-of-bounds and not the ignore index.",
+    )
+    """
+
+    ignore_class_weight = torch.scalar_tensor(0, dtype=input.dtype, device=input.device)
+    class_weight = (
+        torch.scalar_tensor(1, dtype=input.dtype, device=input.device)
+        if weight is None
+        else weight[flat_target]
+    )
+    current_weight = torch.where(
+        ignore_classes_mask,
+        ignore_class_weight,
+        class_weight,
+    )
+
+    if input.ndim == 1:
+        # implicit batch size = 1
+        # input (1 batch size, C classes)
+        loss = -input[target] * current_weight
+    elif input.ndim == 2:
+        # input (N batch size, C classes)
+        batch_size = input.shape[0]
+        loss = -input[torch.arange(batch_size), target] * current_weight
+    else:
+        # 3D case (N batch size, C classe, K dimensions)
+        # input (N batch size, C classes, K)
+        batch_size = input.shape[0]
+        extent = input.shape[2]
+        numel = batch_size * extent
+        indices = torch.arange(numel)
+        bdx = indices // extent
+        kdx = indices % extent
+        loss = -input[bdx, flat_target, kdx] * current_weight
+    loss = torch.reshape(loss, target.shape)
+
+    if reduction == "none":
+        return loss
+    elif reduction == "sum":
+        return torch.sum(loss)
+    else:
+        # calculate weighted mean of the loss function
+        return torch.sum(loss) / torch.sum(current_weight)
+
+
+@register_decomposition(torch.ops.aten.nll_loss)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("input",),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+)
+def nll_loss(
+    input: TensorLikeType,
+    target: TensorLikeType,
+    weight: Optional[TensorLikeType] = None,
+    size_average: Optional[bool] = None,
+    ignore_index: int = -100,
+    reduce: Optional[bool] = None,
+    reduction: str = "mean",
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.nn.functional.nll_loss
+    """
+    utils.check(
+        input.ndim > 0,
+        lambda: f"Expected input tensor to have 1 or more dimensions (got {input.ndim})",
+    )
+
+    # TODO: raise exception instead of converting value
+    # msg = "size_average and reduce args are deprecated, please use reduction argument."
+    # Convert these options for consistency with the eager mode
+    if size_average is not None or reduce is not None:
+        reduction = _get_string_reduction_arg(size_average=size_average, reduce=reduce)
+
+    # The expected behavior when the target and input have zero elements:
+    #   reduction = 'none' --- tensor([])
+    #   reduction = 'sum'  --- tensor(0.)
+    #   reduction = 'mean' --- tensor(nan)
+    # Mean reduction on empty tensors produces NaN. See the discussion in
+    # https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162
+    if input.numel() == 0 and target.numel() == 0:
+        if reduction == "none":
+            return torch.zeros_like(target)
+        elif reduction == "sum":
+            return torch.empty_like(target)
+        else:
+            return torch.full_like(target, float("nan"))
+
+    # The _nll_loss_nd helper function handles the most common cases.
+    # ndim == 1 (Single Example)
+    #   => Batch Size: 1, Input: (C), Target: ()
+    # ndim == 2 (k = 1)
+    #   => Batch Size: N, Input: (N, C), Target: (N)
+    # ndim == 3 (k > 1)
+    #   => Batch Size: N, Input: (N, C, K), Target: (N, K)
+    if input.ndim <= 3:
+        return _nll_loss_nd(input, target, weight, reduction, ignore_index)
+
+    # For ndim > 3, we reshape the input and target to 3-D case.
+    # Input (N batch-size, C classes, k-dimensions)
+    # Target (N batch-size, k-dimensions)
+    utils.check(
+        input.ndim > 0 and target.ndim > 0 and target.shape[1:] == input.shape[2:],
+        lambda: f"Expected target shape {out_size} but got {target.shape}",
+    )
+
+    batch_size = input.shape[0]
+    num_classes = input.shape[1]
+    out_size = [batch_size] + list(target.shape[1:])
+
+    input = torch.reshape(input, [batch_size, num_classes, -1])
+    target = torch.reshape(target, [batch_size, -1])
+    if reduction != "none":
+        return _nll_loss_nd(input, target, weight, reduction, ignore_index)
+    else:
+        result = _nll_loss_nd(input, target, weight, reduction, ignore_index)
+        # reshape flattened inner-dim to original k-dimensions
+        return torch.reshape(result, out_size)
+
+
+# TODO: This ref supports int reduction and out kwarg to be compatible with ATen:
+# https://github.com/pytorch/pytorch/issues/83931
+# TODO: Could be rewritten to support complex:
+# https://github.com/pytorch/pytorch/pull/85041
+@register_decomposition(torch.ops.aten.huber_loss)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("input", "target"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+)
+def huber_loss(
+    input: TensorLikeType,
+    target: TensorLikeType,
+    reduction: Union[str, int] = "mean",
+    delta: float = 1.0,
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.nn.functional.huber_loss
+    """
+    if type(reduction) is int:
+        reduction = _reduction_int_to_str(reduction)
+    _check_reduction_value(reduction)  # type: ignore[arg-type]
+    check(
+        delta > 0,
+        lambda: "huber_loss does not support non-positive values for delta.",
+    )
+    z = (input - target).abs()
+    loss = torch.where(z < delta, 0.5 * z * z, delta * (z - 0.5 * delta))
+    return _apply_loss_reduction(loss, reduction)  # type: ignore[arg-type]
+
+
 # tanhshrink does not use _make_elementwise_unary_reference because it does not support out
 @elementwise_unary_scalar_wrapper
 @elementwise_type_promotion_wrapper(
@@ -426,6 +825,8 @@ def tanhshrink(a: TensorLikeType) -> TensorLikeType:
 
 
 @register_decomposition(torch.ops.aten.threshold)
+@inplace_wrapper
+@out_wrapper()
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a",),
     type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
@@ -446,7 +847,87 @@ def threshold(
     return torch.where(a <= threshold, value, a)
 
 
+# CompositeImplicitAutograd - don't register decomp
+# No elementwise type promotion - core op doesn't explicitly type promote
+def triplet_margin_loss(
+    anchor: TensorLikeType,
+    positive: TensorLikeType,
+    negative: TensorLikeType,
+    margin: float = 1.0,
+    p: float = 2,
+    eps: float = 1e-6,
+    swap: bool = False,
+    size_average: Optional[bool] = None,
+    reduce: Optional[bool] = None,
+    reduction: str = "mean",
+) -> TensorLikeType:
+    if size_average is not None or reduce is not None:
+        # TODO: Raise exception instead of converting value.  This is only for
+        # primTorch since it can drop support for deprecated arguments.
+        # msg = "size_average and reduce args are deprecated, please use reduction argument."
+        reduction = _get_string_reduction_arg(size_average=size_average, reduce=reduce)
+
+    # torch.nn.functional.triplet_margin_with_distance_loss has no ref defined
+    # since it's a pure Python implementation.  Use this helper instead.
+    return _triplet_margin_with_distance_loss(
+        anchor=anchor,
+        positive=positive,
+        negative=negative,
+        distance_function=lambda x, y: torch.pairwise_distance(x, y, p, eps),
+        margin=margin,
+        swap=swap,
+        reduction=reduction,
+    )
+
+
+# Pure Python impl - don't register decomp and don't add a ref.  Defined as a
+# helper here since triplet_margin_loss can be nicely implemented with it.
+def _triplet_margin_with_distance_loss(
+    anchor: TensorLikeType,
+    positive: TensorLikeType,
+    negative: TensorLikeType,
+    *,
+    distance_function: Optional[
+        Callable[[TensorLikeType, TensorLikeType], TensorLikeType]
+    ] = None,
+    margin: float = 1.0,
+    swap: bool = False,
+    reduction: str = "mean",
+) -> TensorLikeType:
+    _check_reduction_value(reduction)
+
+    a_dim = anchor.ndim
+    p_dim = positive.ndim
+    n_dim = negative.ndim
+    check(
+        a_dim == p_dim and p_dim == n_dim,
+        lambda: (
+            f"The anchor, positive, and negative tensors are expected to have "
+            f"the same number of dimensions, but got: anchor {a_dim}D, "
+            f"positive {p_dim}D, and negative {n_dim}D inputs"
+        ),
+    )
+
+    if distance_function is None:
+        distance_function = torch.pairwise_distance
+
+    dist_pos = distance_function(anchor, positive)
+    dist_neg = distance_function(anchor, negative)
+    # The distance swap is described in the paper "Learning shallow
+    # convolutional feature descriptors with triplet losses" by V. Balntas, E.
+    # Riba et al.  If True, and if the positive example is closer to the
+    # negative example than the anchor is, swaps the positive example and the
+    # anchor in the loss computation.
+    if swap:
+        dist_swap = distance_function(positive, negative)
+        dist_neg = torch.minimum(dist_neg, dist_swap)
+    loss = torch.clamp_min(margin + dist_pos - dist_neg, 0)
+    return _apply_loss_reduction(loss, reduction)
+
+
 @register_decomposition(torch.ops.aten.hardtanh)
+@inplace_wrapper
+@out_wrapper()
 @elementwise_unary_scalar_wrapper
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a"),
@@ -508,6 +989,44 @@ def gelu(a: TensorLikeType, approximate: str = "none") -> TensorLikeType:
         raise RuntimeError("approximate argument must be either none or tanh.")
 
 
+# CompositeImplicitAutograd - don't register decomp
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("input", "target"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def poisson_nll_loss(
+    input: TensorLikeType,
+    target: TensorLikeType,
+    log_input: bool = True,
+    full: bool = False,
+    size_average: Optional[bool] = None,
+    eps: float = 1e-8,
+    reduce: Optional[bool] = None,
+    reduction: str = "mean",
+) -> TensorLikeType:
+    """
+    Reference implementation of torch.nn.functional.poisson_nll_loss
+    """
+    if size_average is not None or reduce is not None:
+        # TODO: Raise exception instead of converting value.  This is only for
+        # primTorch since it can drop support for deprecated arguments.
+        # msg = "size_average and reduce args are deprecated, please use reduction argument."
+        reduction = _get_string_reduction_arg(size_average=size_average, reduce=reduce)
+    _check_reduction_value(reduction)
+    if log_input:
+        loss = torch.exp(input) - target * input
+    else:
+        loss = input - target * torch.log(input + eps)
+
+    if full:
+        stirling_term = (
+            target * torch.log(target) - target + 0.5 * torch.log(2 * torch.pi * target)
+        )
+        # avoid inplace add
+        loss = loss + stirling_term.masked_fill(target <= 1, 0)
+    return _apply_loss_reduction(loss, reduction)
+
+
 @register_decomposition(torch.ops.aten.prelu)
 @elementwise_type_promotion_wrapper(
     type_promoting_args=("a", "weight"),
@@ -548,6 +1067,8 @@ def prelu(a: TensorLikeType, weight: TensorLikeType) -> TensorLikeType:
 
 
 @register_decomposition(torch.ops.aten.relu6)
+@inplace_wrapper
+@out_wrapper()
 def relu6(a: TensorLikeType, inplace: bool = False) -> TensorLikeType:
     """
     Reference implementation of torch.nn.functional.relu6
@@ -559,3 +1080,61 @@ def relu6(a: TensorLikeType, inplace: bool = False) -> TensorLikeType:
     # It may be better to use clamp here, but we use hardtanh to replicate
     # the behavior of the existing implementation
     return refs.nn.functional.hardtanh(a, 0, 6)
+
+
+@register_decomposition(torch.ops.aten.glu)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+)
+def glu(a: TensorLikeType, dim: int = -1) -> TensorLikeType:
+    dim = utils.canonicalize_dims(a.ndim, dim)
+    check(
+        a.shape[dim] % 2 == 0,
+        lambda: f"Halving dimension must be even, but dimension {dim} is size {a.shape[dim]}",
+    )
+    b, c = torch.tensor_split(a, 2, dim)
+
+    return b * torch.sigmoid(c)
+
+
+@register_decomposition(torch.ops.aten.pairwise_distance)
+@out_wrapper()
+def pairwise_distance(
+    x1: TensorLikeType,
+    x2: TensorLikeType,
+    p: NumberType = 2.0,
+    eps: NumberType = 1e-6,
+    keepdim=False,
+) -> TensorLikeType:
+    return torch.linalg.vector_norm(x1 - x2 + eps, ord=p, dim=-1, keepdim=keepdim)
+
+
+@register_decomposition(torch.ops.aten.pdist)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT,
+)
+def pdist(a: TensorLikeType, p: float = 2) -> TensorLikeType:
+    check(a.ndim == 2, lambda: f"pdist only supports 2D tensors, got: {a.ndim}D")
+    check(p >= 0, lambda: "pdist only supports non-negative p values")
+    # For p == 2 we can use an efficient implementation, but other values of p
+    # require creating a much bigger tensor for an intermediate step
+    if p == 2:
+        aTa = torch.mm(a, a.T)
+        aTa_diag = torch.diag(aTa)
+        t = torch.sqrt(torch.clamp(aTa_diag + aTa_diag.unsqueeze(-1) - 2 * aTa, min=0))
+    else:
+        t = torch.linalg.vector_norm(a.unsqueeze(1) - a, ord=p, dim=2)
+    i = torch.triu_indices(t.shape[0], t.shape[1], offset=1, device=a.device)
+    return t.flatten().index_select(0, i[0] * t.shape[0] + i[1])
+
+
+# Needed as aten.{celu_,elu_...} exist (even if they don't have the in-place kwarg)
+celu_ = _make_inplace(celu)
+elu_ = _make_inplace(elu)
+mish_ = _make_inplace(mish)
+selu_ = _make_inplace(selu)
+threshold_ = _make_inplace(threshold)
diff --git a/torch/_refs/special/__init__.py b/torch/_refs/special/__init__.py
index 1c75522d9be1..498382324265 100644
--- a/torch/_refs/special/__init__.py
+++ b/torch/_refs/special/__init__.py
@@ -1,4 +1,5 @@
-from typing import Optional
+import math
+from typing import Optional, Union
 
 import torch
 import torch._prims as prims
@@ -7,7 +8,13 @@
 
 from torch import Tensor
 from torch._decomp import register_decomposition
-from torch._prims_common import ELEMENTWISE_TYPE_PROMOTION_KIND, TensorLikeType
+from torch._prims_common import (
+    ELEMENTWISE_TYPE_PROMOTION_KIND,
+    Number,
+    NumberType,
+    TensorLike,
+    TensorLikeType,
+)
 from torch._prims_common.wrappers import elementwise_type_promotion_wrapper, out_wrapper
 from torch._refs import (
     _make_elementwise_binary_reference,
@@ -16,35 +23,107 @@
 
 
 __all__ = [
+    "bessel_j0",
+    "bessel_j1",
+    "entr",
+    "erfcx",
+    "expit",
     "i0e",
     "i1",
     "i1e",
+    "log_ndtr",
     "logit",
+    "log_softmax",
+    "multigammaln",
+    "ndtr",
+    "ndtri",
+    "softmax",
+    "spherical_bessel_j0",
+    "xlog1py",
     "zeta",
 ]
 
 
 @_make_elementwise_unary_reference(
-    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT, aten_op=torch.ops.aten.special_i0e
+    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def bessel_j0(a: TensorLikeType) -> TensorLikeType:
+    return prims.bessel_j0(a)
+
+
+@_make_elementwise_unary_reference(
+    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def bessel_j1(a: TensorLikeType) -> TensorLikeType:
+    return prims.bessel_j1(a)
+
+
+@register_decomposition(torch.ops.aten.special_entr)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
 )
-def i0e(a):
+def entr(a: TensorLikeType) -> TensorLikeType:
+    return torch.where(
+        torch.isnan(a),
+        a,
+        torch.where(a > 0, -a * torch.log(a), torch.where(a == 0, 0, -torch.inf)),
+    )
+
+
+@register_decomposition(torch.ops.aten.special_erfcx)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def erfcx(a: TensorLikeType) -> TensorLikeType:
+    return prims.erfcx(a)
+
+
+# alias for sigmoid
+expit = torch.sigmoid
+
+
+@_make_elementwise_unary_reference(
+    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def i0e(a: TensorLikeType) -> TensorLikeType:
     return prims.bessel_i0e(a)
 
 
 @_make_elementwise_unary_reference(
-    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT, aten_op=torch.ops.aten.special_i1
+    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
 )
-def i1(a):
+def i1(a: TensorLikeType) -> TensorLikeType:
     return prims.bessel_i1(a)
 
 
 @_make_elementwise_unary_reference(
-    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT, aten_op=torch.ops.aten.special_i1e
+    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
 )
-def i1e(a):
+def i1e(a: TensorLikeType) -> TensorLikeType:
     return prims.bessel_i1e(a)
 
 
+@register_decomposition(torch.ops.aten.special_log_ndtr)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def log_ndtr(a: TensorLikeType) -> TensorLikeType:
+    # Note: M_SQRT1_2 is the value of 1 / √2
+    M_SQRT1_2 = 0.707106781186547524400844362104849039
+    t = a * M_SQRT1_2
+    return torch.where(
+        a < 1.0,
+        torch.log(torch.special.erfcx(-t) / 2) - t * t,
+        torch.log1p(-refs.erfc(t) / 2),
+    )
+
+
 @register_decomposition(torch.ops.aten.logit)
 @out_wrapper()
 @elementwise_type_promotion_wrapper(
@@ -60,8 +139,96 @@ def logit(self: TensorLikeType, eps: Optional[float] = None) -> TensorLikeType:
     return torch.log(torch.true_divide(self, torch.sub(1, self)))
 
 
-zeta = _make_elementwise_binary_reference(
-    prims.zeta,
+@register_decomposition(torch.ops.aten.special_xlog1py)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a", "b"),
+    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def xlog1py(a: Union[TensorLikeType, NumberType], b: Union[TensorLikeType, NumberType]):
+    utils.check(
+        isinstance(a, TensorLike) or isinstance(b, TensorLike),
+        lambda: 'Expected either argument a or b to be a Tensor"',
+    )
+
+    # Operations like eq and log do not handle scalar values, so we convert them to scalar_tensors.
+    if isinstance(a, TensorLike) and isinstance(b, Number):
+        b = refs.scalar_tensor(b, dtype=a.dtype, device=a.device)
+    elif isinstance(b, TensorLike) and isinstance(a, Number):
+        a = refs.scalar_tensor(a, dtype=b.dtype, device=b.device)
+
+    # mypy: expected "Tensor"
+    assert isinstance(a, TensorLike)
+    assert isinstance(b, TensorLike)
+    rhs = torch.where(torch.eq(a, 0), 0, torch.mul(a, refs.log1p(b)))
+    return torch.where(torch.isnan(b), float("nan"), rhs)
+
+
+@register_decomposition(torch.ops.aten.mvlgamma)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def multigammaln(a: TensorLikeType, p: int) -> TensorLikeType:
+    c = 0.25 * p * (p - 1) * math.log(math.pi)
+    b = 0.5 * torch.arange(start=(1 - p), end=1, step=1, dtype=a.dtype, device=a.device)
+    return torch.sum(torch.lgamma(a.unsqueeze(-1) + b), dim=-1) + c
+
+
+@register_decomposition(torch.ops.aten.special_ndtr)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def ndtr(a: TensorLikeType) -> TensorLikeType:
+    # Note: M_SQRT1_2 is the value of 1 / √2
+    M_SQRT1_2 = 0.707106781186547524400844362104849039
+    a_sqrt_2 = a * M_SQRT1_2
+    return (1 + torch.erf(a_sqrt_2)) * 0.5
+
+
+@register_decomposition(torch.ops.aten.special_ndtri)
+@out_wrapper()
+@elementwise_type_promotion_wrapper(
+    type_promoting_args=("a",),
+    type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def ndtri(a: TensorLikeType) -> TensorLikeType:
+    return prims.ndtri(a)
+
+
+# Forwarding alias: the special variant doesn't support the out kwarg
+# CompositeImplicitAutograd - don't register decomp
+def log_softmax(
+    a: TensorLikeType,
+    dim: int,
+    dtype: Optional[torch.dtype] = None,
+) -> TensorLikeType:
+    return torch.log_softmax(a=a, dim=dim, dtype=dtype)  # type: ignore[call-overload]
+
+
+# Forwarding alias: the special variant doesn't support the out kwarg
+# CompositeImplicitAutograd - don't register decomp
+def softmax(
+    a: TensorLikeType,
+    dim: int,
+    dtype: Optional[torch.dtype] = None,
+) -> TensorLikeType:
+    return torch.softmax(a=a, dim=dim, dtype=dtype)  # type: ignore[call-overload]
+
+
+@_make_elementwise_unary_reference(
+    ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
+)
+def spherical_bessel_j0(a: TensorLikeType) -> TensorLikeType:
+    return prims.spherical_bessel_j0(a)
+
+
+# TODO: add docstring
+@_make_elementwise_binary_reference(
     type_promotion_kind=utils.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT,
-    aten_op=torch.ops.aten.special_zeta,
 )
+def zeta(a: TensorLikeType, b: TensorLikeType) -> TensorLikeType:
+    return prims.zeta(a, b)
diff --git a/torch/_subclasses/__init__.py b/torch/_subclasses/__init__.py
index 85ea33018287..bc7a7fed7512 100644
--- a/torch/_subclasses/__init__.py
+++ b/torch/_subclasses/__init__.py
@@ -7,9 +7,12 @@
     UnsupportedFakeTensorException,
 )
 
+from torch._subclasses.fake_utils import CrossRefFakeMode
+
 __all__ = [
     "FakeTensor",
     "FakeTensorMode",
     "UnsupportedFakeTensorException",
     "DynamicOutputShapeException",
+    "CrossRefFakeMode",
 ]
diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py
index 8c991036f1bd..9a0ac050e6b9 100644
--- a/torch/_subclasses/fake_tensor.py
+++ b/torch/_subclasses/fake_tensor.py
@@ -4,22 +4,27 @@
 import weakref
 from dataclasses import dataclass
 from functools import partial
-from typing import Callable, Union
+from typing import Any, Callable, Dict, List, Optional, Tuple, Type, TypeVar, Union
 
 import torch
-import torch.fx.experimental.symbolic_shapes as symbolic_shapes
 from torch._ops import OpOverload
 from torch._subclasses.meta_utils import MetaConverter, WeakTensorRefKey
 from torch.fx.operator_schemas import normalize_function
+from torch.multiprocessing.reductions import StorageWeakRef
 from torch.overrides import TorchFunctionMode
 from torch.utils._mode_utils import no_dispatch
-from torch.utils._python_dispatch import enable_torch_dispatch_mode, TorchDispatchMode
+from torch.utils._python_dispatch import TorchDispatchMode
 
-from torch.utils._pytree import tree_flatten, tree_map
+from torch.utils._pytree import PyTree, tree_flatten, tree_map
 
+pytree = torch.utils._pytree
+T = TypeVar("T")
+TensorWeakRef = Any
 
 aten = torch.ops.aten
 
+CONSTANT_NUMEL_LIMIT = 1
+
 
 @dataclass
 class UnsupportedFakeTensorException(RuntimeError):
@@ -31,10 +36,15 @@ class DynamicOutputShapeException(RuntimeError):
     func: OpOverload
 
 
+@dataclass
+class DataDependentOutputException(RuntimeError):
+    func: OpOverload
+
+
 _device_not_kwarg_ops = (
     aten._resize_output_.default,
-    aten.nested_tensor.default,
-    aten.nested_tensor.out,
+    aten._nested_tensor_from_tensor_list.default,
+    aten._nested_tensor_from_tensor_list.out,
     aten.pin_memory.default,
     aten.is_pinned.default,
     aten.to.device,
@@ -98,21 +108,77 @@ def _is_tensor_constructor(func: OpOverload):
     )
 
 
+@functools.lru_cache(None)
+def get_schema_info(func):
+    return torch._C._SchemaInfo(func._schema)  # type: ignore[attr-defined]
+
+
+# many of the decompositions registered to torch/_prims do not at the moment model
+# aliasing or strides, so as an incremental step, just enable the decompositions in
+# torch/_decomp/decompositions.py.
+# decomps are used for aot autograd tracing so we would like to unify on their
+# implementation and add additional testing to them
+@functools.lru_cache(None)
+def torch_decomp_decompositions(func):
+    from torch._decomp import decomposition_table
+
+    decompositions = torch._decomp.decompositions
+    decomp_attrs = [getattr(decompositions, attr) for attr in dir(decompositions)]
+    return decomposition_table[func] in decomp_attrs
+
+
+def tree_flatten_only(ty: Type[T], pytree: PyTree):
+    flat_vals, _ = tree_flatten(pytree)
+    return [elem for elem in flat_vals if isinstance(elem, ty)]
+
+
 # Similar to `MetaConverter`, this is a class for converting
 # multiple tensors into fake tensors which share the same view/storage
 # structure. Like `MetaConverter`, it uses `WeakTensorRefKey` to
 # hold a weak reference for all memoized tensors.
 class FakeTensorConverter(object):
-    tensor_memo: weakref.WeakValueDictionary
+    @property
+    def tensor_memo(self):
+        return self.meta_converter.tensor_memo
+
     meta_converter: MetaConverter
+    constant_storage_mapping: Dict[StorageWeakRef, List[TensorWeakRef]]
 
     def __init__(self):
-        # FakeTensors store the FakeTensorMode which in turn stores a
-        # FakeTensor, so we need to hold a weak reference to the FakeTensor
-        # otherwise we would induce a circular reference
-        self.tensor_memo = weakref.WeakValueDictionary()
         self.meta_converter = MetaConverter()
 
+        # map from to storage to corresponding constant tensors
+        self.constant_storage_mapping = {}
+
+    def add_constant_storage_mapping(self, fake_tensor):
+        # when you have a constant, aliased tensor:
+        # const_tensor.add_(torch.rand([1]))
+        # all aliases of it must become no longer const
+        assert isinstance(fake_tensor, FakeTensor) and fake_tensor.constant is not None
+        weak_st = StorageWeakRef(fake_tensor.constant._typed_storage())
+
+        # we need a map from a weak storage to all of its corresponding
+        # constant tensors. python doesn't have the weak value equivalent
+        # of defaultdict(list), so we are using a WeakValueDictionary as one
+        if weak_st not in self.constant_storage_mapping:
+            self.constant_storage_mapping[weak_st] = []
+        self.constant_storage_mapping[weak_st].append(weakref.ref(fake_tensor))
+
+    def invalidate_constant_aliases(self, tensor):
+        assert not isinstance(tensor, FakeTensor)
+
+        weak_st = StorageWeakRef(tensor._typed_storage())
+        if weak_st not in self.constant_storage_mapping:
+            return
+
+        for weak_tensor_ref in self.constant_storage_mapping[weak_st]:
+            ten = weak_tensor_ref()
+            if ten is not None:
+                ten._fix_weakref()
+                ten.constant = None
+
+        del self.constant_storage_mapping[weak_st]
+
     def _get_memo(self, t):
         if WeakTensorRefKey(t) in self.tensor_memo:
             out = self.tensor_memo[WeakTensorRefKey(t)]
@@ -137,7 +203,7 @@ def del_ten():
         weakref.finalize(t, del_ten)
         self.tensor_memo[th] = v
 
-    def from_real_tensor(self, fake_mode, t):
+    def from_real_tensor(self, fake_mode, t, make_constant=False, shape_env=None):
         maybe_memo = self._get_memo(t)
         if maybe_memo is not None:
             return maybe_memo
@@ -145,19 +211,38 @@ def from_real_tensor(self, fake_mode, t):
         # not yet supported in metatensors
         if t.is_quantized:
             raise UnsupportedFakeTensorException("quantized nyi in meta tensors")
-        with no_dispatch():
-            meta_t = self.meta_converter(t)
-            if meta_t.device.type != "meta":
-                raise UnsupportedFakeTensorException("meta converter nyi")
-            out = FakeTensor(fake_mode, meta_t, existing_device)
         if type(t) is torch.nn.Parameter:
-            out = torch.nn.Parameter(out, requires_grad=out.requires_grad)  # type: ignore[assignment]
-        if t.grad is not None:
-            out.grad = self.from_real_tensor(fake_mode, t.grad)
-        self.set_tensor_memo(t, out)
+            assert not make_constant
+
+        def mk_fake_tensor(make_meta_t):
+            # NB: don't use in_kernel_invocation_manager. to
+            # ensure FakeTensor can internally do constant computation
+            # as necessary.  Invocation manager is "more correct" as
+            # it works for more operators in make_meta_t, but
+            # invariant is that make_meta_t only calls factories
+            # for which it is not strictly necessary to use the
+            # invocation manager (I think!)
+            with no_dispatch():
+                return FakeTensor(
+                    fake_mode,
+                    make_meta_t(),
+                    existing_device,
+                    constant=t if make_constant else None,
+                )
+
+        out = self.meta_converter(t, shape_env=shape_env, callback=mk_fake_tensor)
+        if out is NotImplemented:
+            raise UnsupportedFakeTensorException("meta converter nyi")
+        if make_constant:
+            self.add_constant_storage_mapping(out)
+        # NB: meta_converter set the memo
         return out
 
+    # If you specify the device, it MUST be a meta tensor.
     def from_meta_and_device(self, fake_mode, t, device):
+        assert (
+            t.device.type == "meta"
+        ), f"tensor's device must be `meta`, got {t.device.type} instead"
         maybe_memo = self._get_memo(t)
         if maybe_memo is not None:
             return maybe_memo
@@ -165,22 +250,15 @@ def from_meta_and_device(self, fake_mode, t, device):
         self.set_tensor_memo(t, out)
         return out
 
-    # There are two ways to call this.  First, you can have manually constructed
-    # a meta tensor and you need to turn it into a fake tensor.  In that case,
-    # pass a meta tensor and a device argument.  Alternately, you can have a
-    # real tensor that you need to convert into a fake tensor; in that case,
-    # omit the device.
+    # You can have a real tensor that you need to convert into a fake tensor.
+    # If you have a meta tensor already, call from_meta_and_device.
     #
-    # The disallowed case: if you specify the device, it MUST be a meta tensor.
-    # However, you're allowed to pass a meta tensor to be turned into a fake
+    # You're allowed to pass a meta tensor to be turned into a fake
     # tensor; although an odd thing to do, this can occur if you're doing
-    # cross ref testing and the inner test is already operating on meta tensors
-    def __call__(self, fake_mode, t, device=None):
-        if device is None:
-            return self.from_real_tensor(fake_mode, t)
-        else:
-            assert t.device.type == "meta"
-            return self.from_meta_and_device(fake_mode, t, device)
+    # cross ref testing and the inner test is already operating on meta tensors.
+    # You must have created the FakeTensorMode with allow_meta == True
+    def __call__(self, fake_mode, t, *, make_constant=False, shape_env=None):
+        return self.from_real_tensor(fake_mode, t, make_constant, shape_env=shape_env)
 
 
 op_implementations = []
@@ -218,7 +296,10 @@ def constructors(fake_mode, func, *args, **kwargs):
     out_device = new_kwargs.pop("device", None)
     out_device = out_device if out_device is not None else default_device
     new_kwargs["device"] = torch.device("meta")
-    r = func(*args, **new_kwargs)
+    # _like constructors have fake tensor inputs (maybe this causes the non-like
+    # to fail? hmmm)
+    with in_kernel_invocation_manager(fake_mode):
+        r = func(*args, **new_kwargs)
     return FakeTensor(fake_mode, r, out_device)
 
 
@@ -230,15 +311,21 @@ def non_kwarg_to(fake_mode, func, *args, **kwargs):
     input_device = new_kwargs["device"]
     out_device = input_device if input_device else new_kwargs["input"].device
     new_kwargs["device"] = torch.device("meta")
-    r = func(*args, **new_kwargs)
-    return fake_mode.fake_tensor_converter(fake_mode, r, out_device)
+    inp = new_kwargs.pop("input")
+    with in_kernel_invocation_manager(fake_mode):
+        r = func(inp, **new_kwargs)
+    # TODO: I think this does the wrong thing if r is inp
+    return fake_mode.fake_tensor_converter.from_meta_and_device(
+        fake_mode, r, out_device
+    )
 
 
 # Dont default to default device handling,
 # since the device of `the_template` is ignored
 @register_op_impl(aten.resize_as_.default)
 def resize_as_(fake_mode, func, *args, **kwargs):
-    return func(*args, **kwargs)
+    with in_kernel_invocation_manager(fake_mode):
+        return func(*args, **kwargs)
 
 
 @register_op_impl(aten._sparse_coo_tensor_with_dims_and_tensors.default)
@@ -257,28 +344,29 @@ def to_copy(fake_mode, func, *args, **kwargs):
 
     input_device = new_kwargs.pop("device", None)
     out_device = input_device if input_device else new_kwargs["input"].device
-    with no_dispatch():
+    with in_kernel_invocation_manager(fake_mode):
         input = new_kwargs.pop("input").to("meta")
         return FakeTensor(fake_mode, aten._to_copy(input, **new_kwargs), out_device)
 
 
-@register_op_impl(aten.clone.default)
-def clone(fake_mode, func, input, memory_format=None):
-    out_device = input.device
-    with no_dispatch():
-        out = aten._to_copy(input.to("meta"), memory_format=memory_format)
-        return FakeTensor(fake_mode, out, out_device)
-
-
 # index.Tensor data-dependent in only some conditions
 @register_op_impl(
     lambda func: torch.Tag.dynamic_output_shape in func.tags  # type: ignore[attr-defined]
     and func != aten.index.Tensor
 )
-def data_dep_op(fake_mode, func, *args, **kwargs):
+def dyn_shape(fake_mode, func, *args, **kwargs):
     raise DynamicOutputShapeException(func)
 
 
+@register_op_impl(
+    lambda func: torch.Tag.data_dependent_output in func.tags  # type: ignore[attr-defined]
+)
+def data_dep(fake_mode, func, *args, **kwargs):
+    if fake_mode.throw_on_data_dependent_ops:
+        raise DataDependentOutputException(func)
+    return NotImplemented
+
+
 # Bool Indices get Expanded as Masks
 # See: IndexingUtils.h:expandTensors
 def check_no_bool_index_tensors(func, self, indices):
@@ -329,29 +417,95 @@ def index_put_(fake_mode, func, *args, **kwargs):
     return new_kwargs["input"]
 
 
-# Meta tensors give you the ability to run PyTorch code without having to
-# actually do computation through tensors allocated on a `meta` device.
-# Because the device is `meta`, meta tensors do not model device propagation.
-# FakeTensor extends MetaTensors to also carry an additional `fake_device`
-# which tracks devices that would have been used.
+@register_op_impl(lambda fn: fn in _device_not_kwarg_ops)
+def nyi(fake_mode, func, *args, **kwargs):
+    assert func not in _device_not_kwarg_ops, f"NYI: {func}"
+
+
+@register_op_impl(
+    lambda func: func in (aten.convolution.default, aten.convolution_backward.default)
+)
+def conv(fake_mode, func, *args, **kwargs):
+    _, kwargs = normalize_function(
+        func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True
+    )
+    device = kwargs["input"].fake_device
+    # need to re-enable mode so the tensors report fake device
+    with fake_mode:
+        # if the input is unsqueezed is done in Convolution.cpp we get segfault
+        k = kwargs["weight"].ndim
+        if k == 3 and not kwargs["input"].is_mkldnn and not kwargs["input"].is_xpu:
+            mem_fmt = None
+        else:
+            if func is aten.convolution.default:
+                conv_backend = torch._C._select_conv_backend(**kwargs)
+            else:
+                conv_backend = torch._C._select_conv_backend(
+                    kwargs["input"],
+                    kwargs["weight"],
+                    bias=None,
+                    stride=kwargs["stride"],
+                    padding=kwargs["padding"],
+                    dilation=kwargs["dilation"],
+                    transposed=kwargs["transposed"],
+                    output_padding=kwargs["output_padding"],
+                    groups=kwargs["groups"],
+                    bias_sizes=kwargs["bias_sizes"],
+                )
+            mem_fmt = torch._C._conv_determine_backend_memory_format(
+                kwargs["input"], kwargs["weight"], conv_backend
+            )
+
+    def convert(t, mem_fmt):
+        if t is None:
+            return t
+        if mem_fmt is not None:
+            t = t.to(memory_format=mem_fmt)
+        return FakeTensor(fake_mode, t, device)
+
+    with in_kernel_invocation_manager(fake_mode):
+        out = func(**kwargs)
+
+        if func is aten.convolution.default:
+            return convert(out, mem_fmt)
+        else:
+            return (
+                convert(out[0], mem_fmt),
+                convert(out[1], mem_fmt),
+                convert(out[2], None),
+            )
 
 
 @contextlib.contextmanager
 def in_kernel_invocation_manager(fake_mode):
-    fake_mode.in_kernel_invocation = True
     # See: note [Fake Tensor Dispatch Keys]
-    torch._C._add_meta_to_tls_dispatch_include()
+    prev_in_kernel = fake_mode.in_kernel_invocation
+    meta_in_tls = torch._C._meta_in_tls_dispatch_include()
+    assert meta_in_tls == prev_in_kernel, f"{meta_in_tls}, {prev_in_kernel}"
+
+    guard = torch._C._DisableTorchDispatch()  # type: ignore[attr-defined]
+    fake_mode.in_kernel_invocation = True
+    torch._C._set_meta_in_tls_dispatch_include(True)
     try:
         yield
     finally:
-        fake_mode.in_kernel_invocation = False
-        torch._C._remove_meta_from_tls_dispatch_include()
+        fake_mode.in_kernel_invocation = prev_in_kernel
+        torch._C._set_meta_in_tls_dispatch_include(prev_in_kernel)
+        del guard
 
 
 class FakeTensor(torch.Tensor):
+    """
+    Meta tensors give you the ability to run PyTorch code without having to
+    actually do computation through tensors allocated on a `meta` device.
+    Because the device is `meta`, meta tensors do not model device propagation.
+    FakeTensor extends MetaTensors to also carry an additional `fake_device`
+    which tracks devices that would have been used.
+    """
+
     fake_device: torch.device
     fake_mode: "FakeTensorMode"
-    has_sym_ints: bool
+    constant: Optional[torch.Tensor]
 
     # Note: [Fake Tensor Dispatch Keys]
     # In order to model the behavior of device-specific autocast
@@ -366,7 +520,7 @@ class FakeTensor(torch.Tensor):
     # The `device_for_backend_keys` does that below
 
     @staticmethod
-    def __new__(cls, fake_mode, elem, device):
+    def __new__(cls, fake_mode, elem, device, constant=None):
         return torch.Tensor._make_subclass(
             cls,
             elem,
@@ -375,7 +529,13 @@ def __new__(cls, fake_mode, elem, device):
             device_for_backend_keys=device,
         )
 
-    def __init__(self, fake_mode, elem, device: Union[torch.device, str]):
+    def __init__(
+        self,
+        fake_mode,
+        elem,
+        device: Union[torch.device, str],
+        constant: Optional[torch.Tensor] = None,
+    ):
         assert elem.device.type == "meta", elem.device.type
         device = device if isinstance(device, torch.device) else torch.device(device)
         # NB: it is fine, if a little confusing, for device to be meta
@@ -392,47 +552,17 @@ def __init__(self, fake_mode, elem, device: Union[torch.device, str]):
             device = torch.device(f"cuda:{torch.cuda.current_device()}")
         self.fake_device = device
         self.fake_mode = fake_mode
-        self.has_sym_ints = symbolic_shapes.has_symbolic_sizes_strides(elem)
+        self.constant = constant
 
     @staticmethod
     def from_tensor(t, fake_mode):
-        existing_device = t.device
-        # TODO: this should use meta converter
-        return FakeTensor(fake_mode, t.to(device="meta"), existing_device)
+        return fake_mode.from_tensor(t)
 
     # TODO: resolve error in default __repr__
     def __repr__(self):
         with in_kernel_invocation_manager(self.fake_mode):
             self_repr = super().__repr__()
-        return f"FakeTensor({self.fake_mode}, {self_repr}, {self.fake_device})"
-
-    def stride(self):
-        if self.has_sym_ints:
-            # TODO: As we currently don't support symbolic strides, we'll assume contiguous strides
-            # The reason this needs to be here instead of __torch_dispatch__ is that
-            # when aten.stride goes into __torch_dispatch__, it expects a list of
-            # concrete ints to be returned. So we need to short-circuit that entirely
-            return symbolic_shapes.create_contiguous(self.shape)
-        return super().stride()
-
-    def new(self, *args, **kwargs):
-        # torch.Tensor.new does not go through the normal dispatcher pattern
-        # so in order to use the same pattern as normal invocation of
-        # returning meta device within the kernel we need to intercept
-        # the call here
-        # because it doesn't go through the dispatcher, we run into errors
-        # when attempting to compute an output in meta, so
-        # we compute the real tensor then convert to meta
-        out_device = self.fake_device
-        with no_dispatch():
-            real_out = super().new(*args, **kwargs)
-
-        assert not isinstance(real_out, FakeTensor), real_out
-        assert real_out.device.type != "meta", real_out.device
-
-        with no_dispatch():
-            meta_out = MetaConverter()(real_out)
-            return FakeTensor(self.fake_mode, meta_out, out_device)
+        return f"FakeTensor({self_repr}, {self.fake_device})"
 
     @classmethod
     def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
@@ -444,14 +574,6 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
                 return torch.device("meta")
             else:
                 return args[0].fake_device
-        # Need this to handle infinite recursion with sparse tensors.
-        # Sparse tensors have custom stride policy which means that
-        # they will dispatch here on dispatch, and we need to trigger
-        # the default behavior.
-        # TODO: when we get other tensor types online they will also
-        # need to get entries here.
-        elif func == torch.ops.aten.stride.default:
-            return None
 
         # Because fake mode can return NotImplemented (if it sees a subclass
         # it doesn't know how to deal with), this test here is important
@@ -469,15 +591,19 @@ def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
                 else:
                     assert fake_mode is arg.fake_mode, "Mixing modes NYI"
 
-        with enable_torch_dispatch_mode(fake_mode):
+        assert fake_mode is not None
+        with fake_mode:  # type: ignore[attr-defined]
             return func(*args, **kwargs)
 
     @staticmethod
-    def _find_common_device(func, args, kwargs):
+    def _find_common_device(func, args, kwargs) -> Tuple[torch.device, bool]:
+        # Returns: (common_device, has_scalar_only_inputs)
+
         # cpu - zero-dim tensors can be called in cuda kernels,
         # so overwrite the common_device if it the only existing
         # device comes from a cpu zero-dim tensor
         common_device = None
+        has_scalar_only_inputs = False
         is_cpu_zero_dim = None
 
         def cpu_zero_dim(t):
@@ -520,15 +646,28 @@ def merge_devices(t):
         tree_map(merge_devices, args)
         tree_map(merge_devices, kwargs)
 
+        # some functions that allow Python numbers to bind to Tensors
+        # if we have failed to find a device, and we're running one of these operators,
+        # we must have scalar only inputs
+        if (
+            torch._C._should_allow_numbers_as_tensors(
+                func.name().split("::")[-1].split(".")[0]
+            )
+            and common_device is None
+        ):
+            # ops with scalar only inputs always have result on cpu
+            has_scalar_only_inputs = True
+            common_device = torch.device("cpu")
+
         assert common_device is not None, f"Could not find common device for {func}"
 
-        return common_device
+        return common_device, has_scalar_only_inputs
 
     __torch_function__ = torch._C._disabled_torch_function_impl
 
 
 # We keep one instantiation of `fake_tensor_converter` active
-# for the duration of `with torch_enable_mode(FakeTensorMode)`.
+# for the duration of `with FakeTensorMode()`.
 # This allows accurate storage aliasing across invocation of
 # different operators. While this will keep all freshly allocated
 # tensors alive during `FakeTensorMode`, there will no be no
@@ -537,11 +676,21 @@ def merge_devices(t):
 
 
 class FakeTensorMode(TorchDispatchMode):
-    def __init__(self, *, allow_fallback_kernels=True, allow_meta=False):
+    def __init__(
+        self,
+        *,
+        allow_fallback_kernels=True,
+        allow_meta=False,
+        throw_on_data_dependent_ops=True,
+        shape_env=None,
+    ):
         self.allow_fallback_kernels = allow_fallback_kernels
         self.fake_tensor_converter = FakeTensorConverter()
         self.allow_meta = allow_meta
 
+        # TODO: delete arg and default to true. waiting on dynamo perf regression testing
+        self.throw_on_data_dependent_ops = throw_on_data_dependent_ops
+
         # [in_kernel_invocation]
         # when FakeTensor is invoked in user code, .device should return
         # the fake_device of the tensor so that code such as as `if x.is_cuda`
@@ -554,6 +703,8 @@ def __init__(self, *, allow_fallback_kernels=True, allow_meta=False):
         # the device property
         self.in_kernel_invocation = False
 
+        self.shape_env = shape_env
+
     def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         kwargs = kwargs if kwargs else {}
 
@@ -563,158 +714,282 @@ def __torch_dispatch__(self, func, types, args=(), kwargs=None):
                 return torch.device("meta")
             else:
                 return args[0].fake_device
-        flat_arg_tensors = [
-            i for i in tree_flatten((args, kwargs))[0] if isinstance(i, FakeTensor)
-        ]
-        has_symbolic_sizes = any([i.has_sym_ints for i in flat_arg_tensors])
-        if has_symbolic_sizes:
-            # TODO: Find better approach for this
-            # Avoid circular import
-            from torch._decomp import decomposition_table
-            from torch._meta_registrations import meta_table
-
-            # TODO: hack, doesn't actually work.
-            # see https://github.com/pytorch/pytorch/pull/81598#issuecomment-1192030435
-            with enable_torch_dispatch_mode(
-                self
-            ), torch.overrides.enable_reentrant_dispatch():
-                if func in meta_table:
-                    r = meta_table[func](*args, **kwargs)
-                    return r
-                if func in decomposition_table:
-                    return decomposition_table[func](*args, **kwargs)
 
-            with no_dispatch():
-                if symbolic_shapes.is_symbolic_op(func):
-                    return symbolic_shapes.handle_symbolic_op(func, args, kwargs)
-                if func == aten.size.default:
-                    raise RuntimeError(
-                        "Trying to call aten.size on a tensor with symbolic shapes. "
-                        "It's likely that this is from calling tensor.shape in C++"
-                    )
+        # Some attribute queries that can be serviced directly
+        # See Note [is_coalesced is dispatched]
+        if func in {
+            torch.ops.aten.is_coalesced.default,
+            torch.ops.aten.dense_dim.default,
+            torch.ops.aten.sparse_dim.default,
+        }:
+            # NB: no_dispatch is ok here too, this func is very simple
+            with in_kernel_invocation_manager(self):
+                return func(*args, **kwargs)
 
-        # prims already wrap FakeTensor inputs to FakeTensor outputs
-        # and do device logic, we dont need do anything but run them
-        # and ensure that Meta kernels are dispatched to (see)
-        # Fake Tensor Dispatch Keys
+        flat_arg_fake_tensors = tree_flatten_only(FakeTensor, (args, kwargs))
+        flat_symints = tree_flatten_only(torch.SymInt, (args, kwargs))
+        has_symbolic_sizes = (
+            any([i._has_symbolic_sizes_strides for i in flat_arg_fake_tensors])
+            or len(flat_symints) > 0
+        )
 
-        if "prims::" in func._schema.name and len(flat_arg_tensors) != 0:
-            try:
-                torch._C._add_meta_to_tls_dispatch_include()
+        converter = self.fake_tensor_converter
+
+        # If this is a lift, the input tensor is guaranteed to be a
+        # constant, so we keep a copy of the original argument along so
+        # we can query it if we're asked to item() it at some later point
+        if func in self.lift_fns:
+            out = func(*args, **kwargs)
+            if self.may_turn_const(out):
+                # NB: not in_kernel_invocation_manager because we're doing real
+                # compute here
                 with no_dispatch():
-                    return func(*args, **kwargs)
-            finally:
-                torch._C._remove_meta_from_tls_dispatch_include()
-
-        if has_symbolic_sizes:
-            constructors = [aten.empty.SymInt]
-            if func not in constructors:
-                raise RuntimeError(
-                    f"{func} - couldn't find symbolic meta function/decomposition"
-                )
+                    out = out.clone()
+                return converter(self, out, make_constant=True)
+
+        flat_arg_tensors = tree_flatten_only(torch.Tensor, (args, kwargs))
+        # See [subclass inputs] below
+        # NB: If you're seeing a mysterious infinite loop involving fake
+        # tensor, it might be related to this line.  Though I'm not sure
+        # how you'll know to read this comment, as this line won't show up
+        # in the stack trace.
+        if self.check_for_subclass(flat_arg_tensors):
+            return NotImplemented
 
-        with no_dispatch():
-            # TODO: apply as no_dispatch decorator
-            converter = self.fake_tensor_converter
+        # if we are in the dispatch mode, we will enter this function even if the inputs
+        # are not FakeTensors. For now, throw if any non-Fake Tensor inputs
+        # and just support constructors.
+
+        # this is generated from torch.tensor(), which does not use the
+        # dispatcher, to allow wrapper subclasses to wrap the new tensor
+        if func in self.lift_fns:
+            assert (
+                len(kwargs) == 0 and len(args) == 1 and type(args[0]) is torch.Tensor
+            ), f"{args} {kwargs}"
+            return converter(self, args[0])
+
+        if self.check_for_non_fake(flat_arg_tensors):
+            raise Exception(
+                "Invoking operators with non-Fake Tensor inputs in FakeTensorMode is not yet supported. "
+                f"Please convert all Tensors to FakeTensors first. Found in {func}(*{args}, **{kwargs})"
+            )
 
-            # if we are in the dispatch mode, we will enter this function even if the inputs
-            # are not FakeTensors. For now, throw if any non-Fake Tensor inputs
-            # and just support constructors. TODO: extend more broadly
-            conversion_made = False
-            subclass_seen = False
+        # The current constant handling only support tracing systems
+        # (aot autograd, torchdynamo) where each operation is run consecutively.
+        # Because each operation is run in order, we can trace out and support
+        # sequences like: x = torch.tensor(0.); y = x.add_(1)
+        # Whenver a constant is written to but with inputs that cannot be evaluated
+        # statically, such as random_(), we invalidate all constants that alias the input
+        # We will rely on functionalization for use of fake tensors constants as persistent
+        # objects on an FX Graph.
+
+        # We dispatch size/stride/numel on the FakeTensor not its constant, so bail on inplace_view
+        all_constant = all(e.constant is not None for e in flat_arg_fake_tensors)
+        if (
+            torch.Tag.nondeterministic_seeded not in func.tags  # type: ignore[attr-defined]
+            and torch.Tag.inplace_view not in func.tags  # type: ignore[attr-defined]
+            and all_constant
+            and len(flat_arg_fake_tensors) != 0
+            and not has_symbolic_sizes
+        ):
+            const_args, const_kwargs = pytree.tree_map_only(
+                FakeTensor, lambda t: t.constant, (args, kwargs)
+            )
 
-            def check_non_fake_tensor(x):
-                nonlocal conversion_made, subclass_seen
-                conversion_made = conversion_made or (
-                    isinstance(x, torch.Tensor) and not isinstance(x, FakeTensor)
-                )
-                subclass_seen = subclass_seen or (
-                    isinstance(x, torch.Tensor)
-                    and not isinstance(x, FakeTensor)
-                    and type(x) is not torch.Tensor
-                    and type(x) is not torch.nn.Parameter
-                )
+            # NB: not in_kernel_invocation_manager(self) as we want to do REAL
+            # compute
+            with no_dispatch():
+                out = func(*const_args, **const_kwargs)
 
-            tree_map(check_non_fake_tensor, args)
-            tree_map(check_non_fake_tensor, kwargs)
-
-            # Suppose we enable fake tensor mode.  This means that fake tensor
-            # mode will run first.  But what if we do an operation that
-            # involves a tensor subclass that will desugar into normal tensor
-            # operations?  Without this line, fake tensor mode will run first,
-            # decide that a conversion was made (since there was a non fake
-            # tensor argument), and report an error that converting non
-            # fake tensor is not supported.  What we actually wanted to happen
-            # was to give the subclass a chance to figure out what it wants to
-            # before erroring out.  Returning NotImplemented here allows this.
-            #
-            # NB: If you're seeing a mysterious infinite loop involving fake
-            # tensor, it might be related to this line.  Though I'm not sure
-            # how you'll know to read this comment, as this line won't show up
-            # in the stack trace.
-            if subclass_seen:
-                return NotImplemented
-
-            # this is generated from torch.tensor(), which does not use the
-            # dispatcher, to allow wrapper subclasses to wrap the new tensor
-            # we need to handle before error checking
-            if func in [
-                aten.lift_fresh.default,
-                aten.lift_fresh_copy.default,
-            ]:
-                assert (
-                    len(kwargs) == 0
-                    and len(args) == 1
-                    and type(args[0]) is torch.Tensor
-                ), f"{args} {kwargs}"
-                with no_dispatch():
-                    return converter(self, args[0])
+            all_constant = pytree.tree_all_only(
+                torch.Tensor, lambda t: self.may_turn_const(t), out
+            )
 
-            if conversion_made:
-                raise Exception(
-                    "Invoking operators with non-Fake Tensor inputs in FakeTensorMode is not yet supported. "
-                    f"Please convert all Tensors to FakeTensors first. Found in {func}(*{args}, **{kwargs})"
+            if all_constant:
+                return pytree.tree_map_only(
+                    torch.Tensor,
+                    lambda t: converter(self, t, make_constant=True),
+                    out,
                 )
 
-            for run_impl_check, op_impl in op_implementations:
-                if run_impl_check(func):
-                    return op_impl(self, func, *args, **kwargs)
-
-            try:
-                with in_kernel_invocation_manager(self):
-                    r = func(*args, **kwargs)
-            except NotImplementedError as not_implemented_error:
-                if not self.allow_fallback_kernels:
-                    raise not_implemented_error
-                return run_fallback_kernel(
-                    self, func, args, kwargs, not_implemented_error
+            # we weren't able to turn outputs to constants,
+            # so invalidate all constants that might be aliases of the outputs
+            for ten in tree_flatten_only(torch.Tensor, out):
+                converter.invalidate_constant_aliases(ten)
+
+        # we are falling through to running non constant tensors, any input constant that
+        # is written to must be invalidated
+        self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
+
+        # If there's a Python meta, prefer that over the decomposition
+        from torch._decomp import meta_table as meta_table
+
+        if func not in meta_table and not self.cpp_meta_supports_symint(func):
+            from torch._decomp import decomposition_table
+
+            # Prefer Python decompositions over C++ ones
+            if func in decomposition_table and (
+                has_symbolic_sizes
+                or (
+                    # TODO: Remove these exclusions, so that we can remove
+                    # this leg entirely
+                    torch_decomp_decompositions(func)
+                    and all(not e.is_sparse for e in flat_arg_fake_tensors)
                 )
+            ):
+                with self:
+                    return decomposition_table[func](*args, **kwargs)
+
+            with self:
+                # Decomposes CompositeImplicitAutograd ops
+                r = func.decompose(*args, **kwargs)
+                if r is not NotImplemented:
+                    return r
+
+        # prims already wrap FakeTensor inputs to FakeTensor outputs
+        # and do device logic, we dont need do anything but run them
+        # and ensure that Meta kernels are dispatched to (see)
+        # Fake Tensor Dispatch Keys
+        # TODO - we should be use the prim aten impl
+        if "prims::" in func._schema.name and hasattr(func, "prim_meta_impl"):
+            with self:
+                return func.prim_meta_impl(*args, **kwargs)
+
+        # special handling for funcs registered through `register_op_impl`,
+        # e.g., manipulating args on constructor calls to construct meta tensors
+        # and then afterwards wrapping them to a FakeTensor
+        for run_impl_check, op_impl in op_implementations:
+            if run_impl_check(func):
+                op_impl_out = op_impl(self, func, *args, **kwargs)
+                if op_impl_out != NotImplemented:
+                    return op_impl_out
+
+        # run kernel registered to meta for func, which include
+        # python meta registrations, prims, decomps, and c++ meta fns (structured kernels)
+        try:
+            with in_kernel_invocation_manager(self):
+                r = func(*args, **kwargs)
+        except NotImplementedError as not_implemented_error:
+            # no meta kernel registered, fallback to kernel for the device
+            if not self.allow_fallback_kernels:
+                raise not_implemented_error
+            return run_fallback_kernel(self, func, args, kwargs, not_implemented_error)
+
+        return self.wrap_meta_outputs_with_default_device_logic(r, func, args, kwargs)
+
+    # [subclass inputs]
+    # Suppose we enable fake tensor mode.  This means that fake tensor
+    # mode will run first.  But what if we do an operation that
+    # involves a tensor subclass that will desugar into normal tensor
+    # operations?  Without returning NotImplemented, fake tensor mode will run first,
+    # decide that a conversion was made (since there was a non fake
+    # tensor argument), and report an error that converting non
+    # fake tensor is not supported.  What we actually wanted to happen
+    # was to give the subclass a chance to figure out what it wants to
+    # before erroring out. Returning NotImplemented here allows this.
+    def check_for_subclass(self, flat_arg_tensors):
+        return any(
+            not isinstance(x, FakeTensor)
+            and type(x) is not torch.Tensor
+            and type(x) is not torch.nn.Parameter
+            for x in flat_arg_tensors
+        )
+
+    def check_for_non_fake(self, flat_arg_tensors):
+        return any(
+            isinstance(x, torch.Tensor) and not isinstance(x, FakeTensor)
+            for x in flat_arg_tensors
+        )
+
+    def wrap_meta_outputs_with_default_device_logic(self, r, func, args, kwargs):
+        wrap = self.gen_wrap_fn(func, args, kwargs)
 
-            # TODO: handle non-kwarg devices
-            assert func not in _device_not_kwarg_ops, f"NYI: {func}"
+        # if device is specified, use that
+        if kwargs.get("device", None):
+            return tree_map(partial(wrap, device=kwargs["device"]), r)
 
-            # Lazily initialized, in case there are no tensor returns
-            common_device = None
+        return tree_map(partial(wrap), r)
 
-            def wrap(e, device=None):
-                nonlocal common_device
-                if isinstance(e, torch.Tensor) and not isinstance(e, FakeTensor):
-                    if common_device is None:
-                        common_device = FakeTensor._find_common_device(
-                            func, args, kwargs
-                        )
-                    return converter(self, e, device or common_device)
+    def gen_wrap_fn(self, func, args, kwargs):
+        converter = self.fake_tensor_converter
+
+        # Lazily initialized, in case there are no tensor returns
+        common_device = None
+        has_scalar_only_inputs = False
+
+        def wrap(e, device=None):
+            nonlocal common_device
+            nonlocal has_scalar_only_inputs
+            if isinstance(e, torch.Tensor) and not isinstance(e, FakeTensor):
+                if common_device is None:
+                    (
+                        common_device,
+                        has_scalar_only_inputs,
+                    ) = FakeTensor._find_common_device(func, args, kwargs)
+
+                if has_scalar_only_inputs:
+                    # Under FakeTensorMode, op accepts scalar only inputs, such as aten.add/sub/mul/div,
+                    # returns a real scalar tensor on CPU. See TensorMeta() in _prims/__init__.py for details.
+                    # We thus directly convert real tensor to fake tensor.
+                    return converter(self, e)
                 else:
-                    return e
+                    return converter.from_meta_and_device(
+                        self, e, device or common_device
+                    )
+            else:
+                return e
+
+        return wrap
+
+    def cpp_meta_supports_symint(self, func):
+        if torch.Tag.view_copy in func.tags:  # type: ignore[attr-defined]
+            return True
+        return func in [
+            aten.empty_strided.default,
+            aten.as_strided_scatter.default,
+            aten.as_strided.default,
+            aten.as_strided_.default,
+            aten.zeros.default,
+            aten.detach.default,
+            aten.view_as_real.default,
+            aten.view_as_complex.default,
+            aten.set_.source_Storage_storage_offset,
+            aten._sparse_coo_tensor_with_dims_and_tensors.default,
+        ]
 
-            # if device is specified, use that
-            if kwargs.get("device", None):
-                return tree_map(partial(wrap, device=kwargs["device"]), r)
+    @property
+    def lift_fns(self):
+        return (aten.lift_fresh.default, aten.lift_fresh_copy.default)
 
-            return tree_map(partial(wrap), r)
+    def may_turn_const(self, t):
+        return (
+            t.numel() <= CONSTANT_NUMEL_LIMIT
+            and not t.is_sparse
+            and not isinstance(t, FakeTensor)
+        )
 
-    def from_tensor(self, tensor):
-        return self.fake_tensor_converter(self, tensor)
+    def invalidate_written_to_constants(
+        self, func, flat_arg_fake_tensors, args, kwargs
+    ):
+        any_constant = any(e.constant is not None for e in flat_arg_fake_tensors)
+        if any_constant and get_schema_info(func).is_mutable():
+            schema_info = get_schema_info(func)
+            _, new_kwargs = normalize_function(
+                func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True
+            )
+            for k, v in new_kwargs.items():
+                k = k if (k != "input" or schema_info.has_argument(k)) else "self"
+                if (
+                    isinstance(v, FakeTensor)
+                    and schema_info.is_mutable(k)
+                    and v.constant is not None
+                ):
+                    self.fake_tensor_converter.invalidate_constant_aliases(v.constant)
+
+    def from_tensor(self, tensor, static_shapes=False):
+        if static_shapes:
+            return self.fake_tensor_converter(self, tensor)
+        return self.fake_tensor_converter(self, tensor, shape_env=self.shape_env)
 
 
 # NB: returns fake tensors
@@ -725,8 +1000,11 @@ def run_fallback_kernel(fake_mode, func, args, kwargs, orig_not_implemented_exce
     if torch.Tag.inplace_view in func.tags:  # type: ignore[attr-defined]
         raise orig_not_implemented_exception
 
+    inp_impls = {}
+
+    # Don't use in_kernel_invocation_manager(fake_mode) as we want to do
+    # REAL compute (not with meta device)
     with no_dispatch():
-        inp_impls = {}
 
         def to_real_tensor(e):
             if isinstance(e, FakeTensor):
@@ -742,25 +1020,25 @@ def to_real_tensor(e):
 
         r = func(*args, **kwargs)
 
-        tensor_impls = set()
-        storages = set()
-
-        for e in tree_flatten((args, kwargs))[0]:
-            if isinstance(e, torch.Tensor):
-                if not e.is_sparse:
-                    storages.add(e.storage()._cdata)
-
-        # TODO: also check metadata change on inputs
-        # proper aliasing/metadata relationship between outputs and inputs will
-        # not be set up, bc of conversion to device, unless we can reuse an
-        # input impl
-        for e in tree_flatten(r)[0]:
-            if id(e) not in inp_impls and (
-                isinstance(e, torch.Tensor)
-                and not e.is_sparse
-                and e.storage()._cdata in storages
-            ):
-                raise orig_not_implemented_exception
+    tensor_impls = set()
+    storages = set()
+
+    for e in tree_flatten((args, kwargs))[0]:
+        if isinstance(e, torch.Tensor):
+            if not e.is_sparse:
+                storages.add(e._typed_storage()._cdata)
+
+    # TODO: also check metadata change on inputs
+    # proper aliasing/metadata relationship between outputs and inputs will
+    # not be set up, bc of conversion to device, unless we can reuse an
+    # input impl
+    for e in tree_flatten(r)[0]:
+        if id(e) not in inp_impls and (
+            isinstance(e, torch.Tensor)
+            and not e.is_sparse
+            and e._typed_storage()._cdata in storages
+        ):
+            raise orig_not_implemented_exception
 
     def map_out(e):
         if isinstance(e, torch.Tensor):
@@ -785,7 +1063,9 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
 
         # clone will get called in Parameter deepcopy
         if func == torch._C._TensorBase.clone:
-            return func(self.fake_mode.from_tensor(args[0]), **kwargs)
+            return func(
+                self.fake_mode.from_tensor(args[0], static_shapes=True), **kwargs
+            )
         elif func == torch.Tensor.__deepcopy__:
             assert len(args) == 2 and len(kwargs) == 0
             tensor, memo = args
@@ -793,7 +1073,7 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
             if id(tensor) in memo:
                 return memo[id(tensor)]
 
-            out = self.fake_mode.from_tensor(tensor)
+            out = self.fake_mode.from_tensor(tensor, static_shapes=True)
             memo[id(tensor)] = out
             return out
         else:
diff --git a/torch/_subclasses/fake_utils.py b/torch/_subclasses/fake_utils.py
new file mode 100644
index 000000000000..d23b12ca8440
--- /dev/null
+++ b/torch/_subclasses/fake_utils.py
@@ -0,0 +1,140 @@
+import warnings
+from typing import Callable, Union
+
+import torch
+import torch.utils._pytree as pytree
+from torch._ops import OpOverload
+from torch._subclasses.fake_tensor import (
+    FakeTensorMode,
+    tree_flatten_only,
+    UnsupportedFakeTensorException,
+)
+from torch.utils._python_dispatch import TorchDispatchMode
+from torch.utils._pytree import tree_flatten
+
+
+aten = torch.ops.aten
+
+
+def outputs_alias_inputs(outputs, inputs):
+    input_storages = {
+        inp._typed_storage()._cdata
+        for inp in tree_flatten_only(torch.Tensor, inputs)
+        if torch._C._has_storage(inp)
+    }
+    return any(
+        torch._C._has_storage(out) and out._typed_storage()._cdata in input_storages
+        for out in tree_flatten_only(torch.Tensor, outputs)
+    )
+
+
+def outputs_are_inputs(outputs, inputs):
+    input_ids = {id(inp) for inp in tree_flatten_only(torch.Tensor, inputs)}
+    return any(id(out) in input_ids for out in tree_flatten_only(torch.Tensor, outputs))
+
+
+def output_alias_each_other(outputs):
+    storages = set()
+    for out in tree_flatten_only(torch.Tensor, outputs):
+        if not torch._C._has_storage(out):
+            continue
+        stor = out._typed_storage()._cdata
+        if stor in storages:
+            return True
+        storages.add(stor)
+    return False
+
+
+class CrossRefFakeMode(TorchDispatchMode):
+    def __init__(
+        self,
+        ignore_op_fn: Union[Callable[[OpOverload], bool], None] = None,
+        *,
+        check_strides=True,
+        check_aliasing=True,
+    ):
+        self.ignore_op_fn = (
+            ignore_op_fn if ignore_op_fn is not None else lambda fn: False
+        )
+        self.check_strides = check_strides
+        self.check_aliasing = check_aliasing
+
+    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+        kwargs = kwargs or {}
+
+        fake_r = None
+
+        # empty_like excluded for now due to sparse complex
+        # aten._to_dense.default this one is getting called with csc
+        if (
+            func
+            not in (
+                aten.lift_fresh.default,
+                aten.lift_fresh_copy.default,
+                aten.set_.source_Storage_storage_offset,
+            )
+            and not self.ignore_op_fn(func)
+            and torch.Tag.dynamic_output_shape not in func.tags  # type: ignore[attr-defined]
+            and torch.Tag.inplace_view not in func.tags  # type: ignore[attr-defined]
+            and torch.Tag.data_dependent_output not in func.tags  # type: ignore[attr-defined]
+        ):
+            try:
+                with FakeTensorMode() as fake_mode:
+                    fake_args, fake_kwargs = pytree.tree_map_only(
+                        torch.Tensor, fake_mode.from_tensor, (args, kwargs)
+                    )
+                    with warnings.catch_warnings():
+                        fake_r = func(*fake_args, **fake_kwargs)
+            except UnsupportedFakeTensorException:
+                pass
+
+        r = func(*args, **kwargs)
+        if fake_r is not None:
+            r_flat, _ = tree_flatten(r)
+            f_flat, _ = tree_flatten(fake_r)
+            assert len(r_flat) == len(
+                r_flat
+            ), f"Mismatch {len(r_flat)} != {len(r_flat)} on {func}"
+
+            if self.check_aliasing:
+                r_aliasing = outputs_alias_inputs(r, (args, kwargs))
+                f_aliasing = outputs_alias_inputs(fake_r, (fake_args, fake_kwargs))
+                assert (
+                    r_aliasing == f_aliasing
+                ), f"Mismatch on {func}: {r_aliasing} != {f_aliasing}"
+
+                r_identity_eq = outputs_are_inputs(r, (args, kwargs))
+                f_identity_eq = outputs_are_inputs(fake_r, (fake_args, fake_kwargs))
+                assert (
+                    r_identity_eq == f_identity_eq
+                ), f"Mismatch on {func}: {r_identity_eq} != {f_identity_eq}"
+
+                r_output_alias_each_other = output_alias_each_other(r)
+                f_output_alias_each_other = output_alias_each_other(fake_r)
+                assert (
+                    r_output_alias_each_other == f_output_alias_each_other
+                ), f"Mismatch on {func}: {r_output_alias_each_other} != {f_output_alias_each_other}"
+
+            for r_out, fake_out in zip(tree_flatten(r)[0], tree_flatten(fake_r)[0]):
+                r_is_ten = isinstance(r_out, torch.Tensor)
+                assert r_is_ten == isinstance(
+                    fake_out, torch.Tensor
+                ), f"Mismatched number of tensor outputs on {func}"
+                if r_is_ten:
+                    assert (
+                        r_out.requires_grad == fake_out.requires_grad
+                    ), f"Mismatch on {func}"
+                    if torch._C._has_storage(r_out):
+                        r_offset = r_out.storage_offset()
+                        f_offset = fake_out.storage_offset()
+                        assert (
+                            r_offset == f_offset
+                        ), f"Mismatch on {func}: {r_offset} != {f_offset}"
+
+                    try:
+                        torch._prims.utils.compare_tensor_meta(
+                            r_out, fake_out, check_strides=self.check_strides
+                        )
+                    except Exception as e:
+                        raise RuntimeError(f"Mismatch on {func}: {e}")
+        return r
diff --git a/torch/_subclasses/meta_utils.py b/torch/_subclasses/meta_utils.py
index d14685e44bd4..8adca0335b97 100644
--- a/torch/_subclasses/meta_utils.py
+++ b/torch/_subclasses/meta_utils.py
@@ -1,8 +1,10 @@
+import contextlib
+import warnings
 import weakref
+from typing import ContextManager
 
 import torch
 from torch.multiprocessing.reductions import StorageWeakRef
-from torch.utils._mode_utils import no_dispatch
 
 
 def safe_is_leaf(t):
@@ -13,6 +15,49 @@ def safe_is_leaf(t):
         return False
 
 
+def safe_grad(t):
+    with warnings.catch_warnings():
+        warnings.filterwarnings("ignore", "The .grad attribute of a Tensor")
+        return t.grad
+
+
+def assert_eq(a, b):
+    assert a == b, f"{a} != {b}"
+
+
+def assert_metadata_eq(assert_eq, m1, m2, *, skip_symbolic=False):
+    def go(m1, m2):
+        assert_eq(m1.dtype, m2.dtype)
+        if not skip_symbolic:
+            assert_eq(m1.shape, m2.shape)
+        assert_eq(m1.requires_grad, m2.requires_grad)
+        assert_eq(m1.is_leaf, m2.is_leaf)
+        assert_eq(m1.grad_fn is None, m2.grad_fn is None)
+        assert_eq(m1.is_sparse, m2.is_sparse)
+        assert_eq(m1.is_inference(), m2.is_inference())
+        assert_eq(m1.is_conj(), m2.is_conj())
+        assert_eq(m1.is_neg(), m2.is_neg())
+        assert_eq(safe_grad(m1) is not None, safe_grad(m2) is not None)
+        if safe_grad(m1) is not None:
+            go(safe_grad(m1), safe_grad(m2))
+        if m1.is_sparse:
+            assert_eq(m1.dense_dim(), m2.dense_dim())
+            assert_eq(m1.sparse_dim(), m2.sparse_dim())
+            assert_eq(m1.is_coalesced(), m2.is_coalesced())
+        else:
+            if not skip_symbolic:
+                assert_eq(m1.stride(), m2.stride())
+                assert_eq(m1.storage_offset(), m2.storage_offset())
+            assert_eq(m1._is_view(), m2._is_view())
+            if m1._is_view():
+                go(m1._base, m2._base)
+        # TODO: test if is resizable (no direct query for this atm)
+        # TODO: audit AutogradMeta to see if it matches
+        # TODO: test forward AD
+
+    return go(m1, m2)
+
+
 # torch.Tensors cannot be used as a key in a dictionary
 # because they define a custom __eq__ function which when used
 # to resolve hash collisions will throw when comparing tensors:
@@ -56,13 +101,14 @@ def __eq__(self, other):
 class MetaConverter:
     def __init__(self):
         self.storage_memo = {}
-        self.tensor_memo = {}
+        self.tensor_memo: weakref.WeakValueDictionary = weakref.WeakValueDictionary()
         self.maybe_storages_to_delete = []
         self.check_expired_frequency = 128
         self.check_expired_count = 0
         self.hit = 0
         self.miss = 0
         self.del_hook = None
+        self.arg_cnt = 0
 
     def successful(self):
         return self.hit > 0 and self.miss == 0
@@ -94,10 +140,10 @@ def set_tensor_memo(self, t, v):
         # hold a weak ref to self, otherwise it will be kept alive
         # by the del_ten closure
         self_weak_ref = weakref.ref(self)
-        if t.is_sparse:
+        if t.is_sparse or t.is_mkldnn:
             weak_st = None
         else:
-            weak_st = StorageWeakRef(t.storage())
+            weak_st = StorageWeakRef(t._typed_storage())
         tensor_ref_key = WeakTensorRefKey(t)
 
         def del_ten():
@@ -126,18 +172,71 @@ def del_ten():
 
     # NB: doesn't actually return a storage, because meta storage is
     # not supported
-    def meta_storage(self, s):
+    def meta_storage(self, s, callback):
         # NB: TypedStorage is freshly allocated and cannot be used as hash
         # key index.
 
         # Use a Weak Ref to s in order to not leak memory
         swr = StorageWeakRef(s)
         if swr not in self.storage_memo:
-            self.storage_memo[swr] = torch.empty(s.size(), dtype=s.dtype, device="meta")
+            self.storage_memo[swr] = callback(
+                lambda: torch.empty(s.size(), dtype=torch.uint8, device="meta")
+            )._storage()
         return self.storage_memo[swr]
 
     # This function assumes that it's possible to do the conversion
-    def meta_tensor(self, t):
+    def meta_tensor(self, t, shape_env=None, callback=lambda t: t()):
+        # This indicates you set no_dispatch() before calling into this
+        # function.  This is an error: we may be creating fake tensors and
+        # will perform operations on them which need fake tensor mode to
+        # be active.  You will segfault if you are in a no_dispatch() block.
+        assert not torch._C._dispatch_tls_local_exclude_set().has(
+            torch._C.DispatchKey.Python
+        )
+        arg_cnt = self.arg_cnt
+        self.arg_cnt += 1
+
+        # When we make as_strided calls, we end up generating a guard
+        # that the new as_strided tensor is in bounds for the old storage
+        # for the base (since as_strided calls can "bust" out of their
+        # bounding box.)  This guard is unnecessary: if a user is able
+        # to provide us a tensor with the view base setup this way, we
+        # don't need to produce a guard, because the fact that they
+        # were able to produce the view base means its in bounds.
+        #
+        # Now, ordinarily, this guard would be harmless.  However, the
+        # generated guard refers to variables bound on the base variable.
+        # At the moment, Dynamo doesn't actually guard on x._base, because
+        # according to Voz this results in a lot of spurious invalidations,
+        # and also if the user doesn't directly make use of _base, its
+        # pointless anyway (because programs should be parametric over
+        # whether or not the input tensor is a view or not--unless you're
+        # mutating the input, but that's a whole 'nother ballgame).  So
+        # for expediency, we suppress these guards so we don't have to
+        # deal with this (yet, anyway.)
+        #
+        # NB: An old version of this code suppressed guards for ALL operations
+        # happening during meta conversion, not just as_strided calls.
+        # This is too aggressive: we do duck sizing and 0/1 simplification
+        # as we allocate variables, and we do need to register guards for
+        # these cases.
+        maybe_suppress = contextlib.nullcontext()
+        if shape_env is not None:
+            maybe_suppress = shape_env.suppress_guards()
+
+        make_symbolic = shape_env is not None
+
+        def sym(x):
+            if make_symbolic:
+                return shape_env.create_symintnode(shape_env.create_symbol(x))
+            else:
+                return x
+
+        def sym_sizes_strides(t):
+            if make_symbolic:
+                return shape_env.create_symbolic_sizes_strides(t)
+            return (t.size(), t.stride())
+
         # see expired-storages
         self.check_expired_count += 1
         if self.check_expired_count >= self.check_expired_frequency:
@@ -147,15 +246,24 @@ def meta_tensor(self, t):
         if self.get_tensor_memo(t) is None:
             with torch.inference_mode(t.is_inference()):
                 if t.is_sparse:
+                    assert shape_env is None, "symbolic on sparse NYI"
                     is_leaf = safe_is_leaf(t)
-                    r = torch.ops.aten._sparse_coo_tensor_with_dims(
-                        t.sparse_dim(),
-                        t.dense_dim(),
-                        t.shape,
-                        dtype=t.dtype,
-                        layout=torch.sparse_coo,
-                        device="meta",
+                    r = callback(
+                        lambda: torch.ops.aten._sparse_coo_tensor_with_dims(
+                            t.sparse_dim(),
+                            t.dense_dim(),
+                            t.shape,
+                            dtype=t.dtype,
+                            layout=torch.sparse_coo,
+                            device="meta",
+                        )
                     )
+                    assert safe_is_leaf(r), "the callback you passed in doesn't detach"
+                    # Note [is_coalesced is dispatched]
+                    # Strangely enough, is_coalesced() is a dispatched operator,
+                    # which means that it will get caught by fake tensor mode.
+                    # Ordinarily this would error, but there's some logic in
+                    # fake tensor ensure this doesn't happen.
                     r._coalesced_(t.is_coalesced())
                     if t.requires_grad:
                         r.requires_grad = True
@@ -163,14 +271,28 @@ def meta_tensor(self, t):
                         with torch.enable_grad():
                             r = r.clone()
                             r._coalesced_(t.is_coalesced())
-
+                elif t.is_mkldnn:
+                    is_leaf = safe_is_leaf(t)
+                    sizes, strides = sym_sizes_strides(t)
+                    r = callback(
+                        lambda: torch.empty_strided(
+                            sizes, strides, dtype=t.dtype, device="meta"
+                        )
+                    )
+                    assert safe_is_leaf(r), "the callback you passed in doesn't detach"
+                    if t.requires_grad:
+                        r.requires_grad = True
+                    if t.requires_grad and not is_leaf:
+                        with torch.enable_grad():
+                            r = r.clone()
                 elif t._is_view():
                     # Construct views in two steps: recursively meta-fy their
-                    # base, and then create the view off that.  NB: doing it
+                    # base, and then create view(s) off that.  NB: doing it
                     # directly from storage is WRONG because this won't cause
                     # version counters to get shared.
                     assert t._is_view()
-                    base = self.meta_tensor(t._base)
+
+                    base = self.meta_tensor(t._base, shape_env, callback)
 
                     def is_c_of_r(complex_dtype, real_dtype):
                         return (
@@ -179,61 +301,177 @@ def is_c_of_r(complex_dtype, real_dtype):
                             == real_dtype
                         )
 
-                    if base.dtype == t.dtype:
-                        pass
-                    elif is_c_of_r(base.dtype, t.dtype):
-                        base = torch.view_as_real(base)
-                    elif is_c_of_r(t.dtype, base.dtype):
-                        base = torch.view_as_complex(base)
-                    else:
-                        # This is not guaranteed to succeed.  If it fails, it
-                        # means there is another dtype-converting view function
-                        # that hasn't been handled here
-                        base = base.view(t.dtype)
+                    # In some situations, MetaConverter may be called in a
+                    # context where autograd is disabled.  For the _is_view
+                    # assert to pass, we have to setup the autograd view
+                    # metadata anyway.  Do this by reenabling the
+                    # ADInplaceOrView key.  This is kind of a hack.
+                    old_exclude = torch._C._dispatch_tls_is_dispatch_key_excluded(
+                        torch._C.DispatchKey.ADInplaceOrView
+                    )
+                    torch._C._dispatch_tls_set_dispatch_key_excluded(
+                        torch._C.DispatchKey.ADInplaceOrView, False
+                    )
+                    try:
+
+                        if base.dtype == t.dtype:
+                            pass
+                        elif is_c_of_r(base.dtype, t.dtype):
+                            base = torch.view_as_real(base)
+                        elif is_c_of_r(t.dtype, base.dtype):
+                            base = torch.view_as_complex(base)
+                        else:
+                            # This is not guaranteed to succeed.  If it fails, it
+                            # means there is another dtype-converting view function
+                            # that hasn't been handled here
+                            base = base.view(t.dtype)
+
+                        # This is very tricky.  Naively, you might expect this
+                        # to hold:
+                        #
+                        #   if t.requires_grad and not safe_is_leaf(t)
+                        #       assert t._base.requires_grad
+                        #
+                        # But it's not true!  As you can see in the following
+                        # program:
+                        #
+                        #   x = torch.zeros(4)
+                        #   y = x.view(1, 4)
+                        #   y.requires_grad = True
+                        #   z = y.view(1, 1, 4)
+                        #   assert z._base is x
+                        #
+                        # So we may have to do *two* views out of the base to
+                        # recreate this situation.
+
+                        sizes, strides = sym_sizes_strides(t)
+
+                        if safe_is_leaf(t):
+                            # Leaf views that track view metadata are created by
+                            # creating a view inside a no_grad block
+                            with torch.no_grad(), maybe_suppress:
+                                r = base.as_strided(
+                                    sizes, strides, sym(t.storage_offset())
+                                )
+                            # As it's a leaf, we can directly assign requires_grad
+                            r.requires_grad = t.requires_grad
+                        else:
+                            if t._base.requires_grad == t.requires_grad:
+                                # Easy case, just run the view op
+                                with torch.enable_grad(), maybe_suppress:
+                                    r = base.as_strided(
+                                        sizes, strides, sym(t.storage_offset())
+                                    )
+                            else:
+                                # Obscure case.  Create a leaf view and give it the
+                                # correct requires_grad, then do the final view.
+                                # NB: Can't have a non-leaf without requiring grad!
+                                assert t.requires_grad
+                                with torch.no_grad():
+                                    mid = base.view(base.shape)
+                                mid.requires_grad = t.requires_grad
+                                with torch.enable_grad(), maybe_suppress:
+                                    r = mid.as_strided(
+                                        sizes, strides, sym(t.storage_offset())
+                                    )
+                    finally:
+                        torch._C._dispatch_tls_set_dispatch_key_excluded(
+                            torch._C.DispatchKey.ADInplaceOrView, old_exclude
+                        )
 
-                    with torch.enable_grad():
-                        r = base.as_strided(t.size(), t.stride(), t.storage_offset())
                 else:
                     is_leaf = safe_is_leaf(t)
-                    # Fake up some autograd history.
-                    if t.requires_grad:
-                        r = torch.empty(
-                            (0,), dtype=t.dtype, device="meta", requires_grad=True
+                    sizes, strides = sym_sizes_strides(t)
+                    storage_offset = sym(t.storage_offset())
+                    r = callback(
+                        lambda: torch.empty_strided(
+                            sizes, strides, dtype=t.dtype, device="meta"
                         )
+                    )
+                    assert safe_is_leaf(r), "the callback you passed in doesn't detach"
+                    if t.requires_grad:
+                        r.requires_grad = t.requires_grad
                         if not is_leaf:
+                            # Fake up some autograd history.
                             with torch.enable_grad():
-                                # The backward function here will be wrong, but
-                                # that's OK; our goal is just to get the metadata
-                                # looking as close as possible; we're not going to
-                                # actually try to backward() on these produced
-                                # metas.  TODO: would be safer to install some
-                                # sort of unsupported grad_fn here
-                                r = r.clone()
+                                # preserve_format is the default, but we want to
+                                # emphasize how important it is to preserve
+                                # format here
+                                r = r.clone(memory_format=torch.preserve_format)
+
+                    s = t._storage()
+                    swr = StorageWeakRef(s)
+                    if (
+                        swr not in self.storage_memo
+                        and r.stride() == strides
+                        and r.storage_offset() == storage_offset
+                    ):
+                        # You're normal and happy, install the fresh storage into the memo
+                        self.storage_memo[swr] = r._storage()
                     else:
-                        r = torch.empty((0,), dtype=t.dtype, device="meta")
-                    # As long as meta storage is not supported, need to prevent
-                    # redispatching on set_(Storage, ...) which will choke with
-                    # meta storage
-                    s = self.meta_storage(t.storage())
-                    with no_dispatch():
-                        with torch.no_grad():
-                            r.set_(s, t.storage_offset(), t.size(), t.stride())
+                        # You're in crazy town; somehow you gave us a tensor
+                        # that wasn't a view, but had nonzero storage offset,
+                        # nontrivial strides (such that clone() couldn't
+                        # preserve them), or already aliases with another
+                        # tensor's storage.  The most typical way to end
+                        # up here is with set_.  So use set_ to bludgeon this
+                        # in.
+                        r_s = self.meta_storage(s, callback=callback)
+                        # NB: In principle, this should always work, but there
+                        # is some subtle difference in the autograd metadata
+                        # that means we will backprop the set_ call, even if
+                        # r is declared as an input to grad.
+                        # See https://github.com/pytorch/pytorch/issues/87956
+                        # for the reproducer.
+                        # NB: The in_kernel_invocation_manager here is necessary
+                        # for fake tensor.  If we run the set_ call with fake
+                        # tensor on, r will improperly report that it is NOT a
+                        # meta tensor but a cpu tensor, and then the set_ call
+                        # will fail due to device mismatch.  no_dispatch() is
+                        # not enough, because the fake tensor will still claim
+                        # to be a CPU tensor and you'll end up in the CPU
+                        # kernel.  Arguably this is a hack; a cleaner way to
+                        # solve this is to have a FakeStorage concept which
+                        # would report it's CPU device--no problem now!  But
+                        # this is difficult to do because we don't have storage
+                        # subclasses.  Relevant test is
+                        # DynamicShapesFunctionTests::test_add_dynamic_shapes in
+                        # test/dynamo/test_dynamic_shapes.py
+                        maybe_fake_mgr: ContextManager[None] = contextlib.nullcontext()
+                        from torch._subclasses.fake_tensor import (
+                            FakeTensor,
+                            in_kernel_invocation_manager,
+                        )
+
+                        if isinstance(r, FakeTensor):
+                            maybe_fake_mgr = in_kernel_invocation_manager(r.fake_mode)
+                        with maybe_fake_mgr, torch.no_grad():
+                            r.set_(r_s, storage_offset, sizes, strides)
 
+                if safe_grad(t) is not None:
+                    r.grad = self.meta_tensor(safe_grad(t), shape_env, callback)
                 torch._C._set_conj(r, t.is_conj())
                 torch._C._set_neg(r, t.is_neg())
+            # This can be skipped if necessary for performance reasons
+            assert_metadata_eq(assert_eq, t, r, skip_symbolic=True)
             self.set_tensor_memo(t, r)
 
         return self.get_tensor_memo(t)
 
-    def __call__(self, t):
+    def __call__(self, t, shape_env=None, *, callback=lambda t: t()):
         # TODO: zero tensors?  We appear to have eliminated them by
         # excluding complex for now
-        if type(t) is torch.Tensor or type(t) is torch.nn.Parameter:
+        from torch._subclasses.fake_tensor import FakeTensor
+
+        if (
+            type(t) is torch.Tensor
+            or type(t) is torch.nn.Parameter
+            or isinstance(t, FakeTensor)
+        ):
             if any(
                 [
                     t.is_sparse_csr,
                     t.layout in [torch.sparse_csc, torch.sparse_bsr, torch.sparse_bsc],
-                    t.is_mkldnn,
                     t.is_quantized,
                     t.is_nested,
                     t._is_view() and t._base is not None and t._base.is_sparse,
@@ -254,10 +492,11 @@ def __call__(self, t):
                 # tests all break so we just exclude this.  In any case
                 # the to conversion isn't really right anyhow.
                 self.miss += 1
-                return t
+                return NotImplemented
             else:
                 self.hit += 1
-                r = self.meta_tensor(t)
+                r = self.meta_tensor(t, shape_env=shape_env, callback=callback)
+                # TODO: this is suspicious, now that we have callback argument
                 if type(t) is torch.nn.Parameter:
                     r = torch.nn.Parameter(r, requires_grad=r.requires_grad)
                 return r
@@ -268,7 +507,7 @@ def __call__(self, t):
             # support meta.  Trying to YOLO this is more trouble than it's
             # worth.
             self.miss += 1
-            return t
+            return NotImplemented
         else:
             # non-Tensor types don't count as hit or miss
             return t
diff --git a/torch/_tensor.py b/torch/_tensor.py
index b74b29823230..bf94639e2dbb 100644
--- a/torch/_tensor.py
+++ b/torch/_tensor.py
@@ -25,11 +25,10 @@
     has_torch_function_unary,
     has_torch_function_variadic,
 )
+from torch.utils.dlpack import DLDeviceType
 
 
 def _handle_torch_function_and_wrap_type_error_to_not_implemented(f):
-    # functools.wraps doesn't work well with methods in python 2
-    method_assignments = ("__name__", "__doc__")
     assigned = functools.WRAPPER_ASSIGNMENTS
 
     @functools.wraps(f, assigned=assigned)
@@ -56,9 +55,6 @@ def _rebuild_from_type(func, type, args, dict):
 
 
 def _rebuild_from_type_v2(func, new_type, args, state):
-    if new_type is Tensor:
-        return func(*args)
-
     ret = func(*args)
     if type(ret) is not new_type:
         ret = ret.as_subclass(new_type)
@@ -71,21 +67,7 @@ def _rebuild_from_type_v2(func, new_type, args, state):
     ):
         ret.__setstate__(state)
     else:
-        if isinstance(state, tuple):
-            if not len(state) == 2:
-                raise RuntimeError(f"Invalid serialized state: {state}")
-            dict_state = state[0]
-            slots_state = state[1]
-        else:
-            dict_state = state
-            slots_state = None
-
-        for k, v in dict_state.items():
-            setattr(ret, k, v)
-
-        if slots_state:
-            for k, v in slots_state.items():
-                setattr(ret, k, v)
+        ret = torch._utils._set_obj_state(ret, state)
     return ret
 
 
@@ -113,10 +95,14 @@ def __deepcopy__(self, memo):
             # doesn't work because of
             # https://github.com/pytorch/pytorch/issues/47442
             # Update the test in test_serialization if you remove 'meta' from here
-
             if (
                 self.is_sparse
-                or self.device.type in ["lazy", "xla", "mps", "ort", "meta", "hpu"]
+                or self.device.type
+                in ["lazy", "xla", "mps", "ort", "meta", "hpu", "ipu"]
+                or (
+                    not torch._C._has_storage(self)
+                    and self.device.type == "privateuseone"
+                )
                 or (type(self) is not Tensor and self.data_ptr() == 0)
             ):
                 new_tensor = self.clone()
@@ -130,7 +116,7 @@ def __deepcopy__(self, memo):
                         "different type."
                     )
             else:
-                new_storage = self.storage().__deepcopy__(memo)
+                new_storage = self._typed_storage()._deepcopy(memo)
                 if self.is_quantized:
                     # quantizer_params can be different type based on torch attribute
                     quantizer_params: Union[
@@ -161,7 +147,9 @@ def __deepcopy__(self, memo):
                     # need to wrap with TypedStorage
                     new_tensor = torch._utils._rebuild_qtensor(
                         torch.storage.TypedStorage(
-                            wrap_storage=new_storage.untyped(), dtype=self.dtype
+                            wrap_storage=new_storage._untyped_storage,
+                            dtype=self.dtype,
+                            _internal=True,
                         ),
                         self.storage_offset(),
                         self.size(),
@@ -219,31 +207,13 @@ def __deepcopy__(self, memo):
             return new_tensor
 
     def __reduce_ex__(self, proto):
-        if type(self) is Tensor:
+        state = torch._utils._get_obj_state(self)
+        if type(self) is Tensor and not state:
+            # Fast path for regular tensor without Python state.
             return self._reduce_ex_internal(proto)
         if has_torch_function_unary(self):
             return handle_torch_function(Tensor.__reduce_ex__, (self,), self, proto)
         func, args = self._reduce_ex_internal(proto)
-        # Get the state of the python subclass
-        # This loosely mimicks the function on the object class but since Tensor do not inherit
-        # from it, we cannot call that function directly
-        # https://github.com/python/cpython/blob/c83919bd635f4433f1c6ae8504996a9fe3c215e5/Objects/typeobject.c#L4891
-        getstate_fn = getattr(self, "__getstate__", None)
-        if getstate_fn:
-            state = getstate_fn()
-        else:
-            slots_to_save = copyreg._slotnames(self.__class__)  # type: ignore[attr-defined]
-            if slots_to_save:
-                state = (
-                    self.__dict__,
-                    {
-                        name: getattr(self, name)
-                        for name in slots_to_save
-                        if hasattr(self, name)
-                    },
-                )
-            else:
-                state = self.__dict__
         return (_rebuild_from_type_v2, (func, type(self), args, state))
 
     def storage(self):
@@ -255,7 +225,17 @@ def storage(self):
         if has_torch_function_unary(self):
             return handle_torch_function(Tensor.storage, (self,), self)
 
-        return torch.TypedStorage(wrap_storage=self._storage(), dtype=self.dtype)
+        torch.storage._warn_typed_storage_removal()
+        return self._typed_storage()
+
+    # For internal use only, to avoid raising deprecation warning
+    def _typed_storage(self):
+        _storage = self._storage()
+        if isinstance(_storage, torch.TypedStorage):
+            _storage = _storage._untyped_storage
+        return torch.TypedStorage(
+            wrap_storage=_storage, dtype=self.dtype, _internal=True
+        )
 
     def _reduce_ex_internal(self, proto):
         check_serializing_named_tensor(self)
@@ -272,7 +252,9 @@ def _reduce_ex_internal(self, proto):
         # 2. Python list is not a good fit due to performance reason.
         #    `tolist()` converts every single element in the tensor into python objects
         #    and serialize them one by one.
-        if self.device.type in ["xla", "ort", "hpu"]:
+        if self.device.type in ["xla", "ort", "hpu"] or (
+            not torch._C._has_storage(self) and self.device.type == "privateuseone"
+        ):
             # Convert BFloat16 tesors to Float32 before conversion to numpy, as numpy doesn't
             # support BFloat16. The rebuild tensor from numpy takes in the original self.dtype,
             # this would reconstruct the BFloat16 tensor from numpy.
@@ -327,7 +309,9 @@ def _reduce_ex_internal(self, proto):
             # need to wrap with TypedStorage
             args_qtensor = (
                 torch.storage.TypedStorage(
-                    wrap_storage=self.storage().untyped(), dtype=self.dtype
+                    wrap_storage=self._typed_storage()._untyped_storage,
+                    dtype=self.dtype,
+                    _internal=True,
                 ),
                 self.storage_offset(),
                 tuple(self.size()),
@@ -348,22 +332,32 @@ def _reduce_ex_internal(self, proto):
                     "sparse tensor __reduce_ex__ for layout `%s`" % (self.layout)
                 )
             return (torch._utils._rebuild_sparse_tensor, args_sparse)
-        elif self.is_sparse_csr:
-            if self.layout == torch.sparse_csr:
-                args_sparse_csr = (
-                    self.layout,
-                    (
-                        self.crow_indices(),
-                        self.col_indices(),
-                        self.values(),
-                        self.size(),
-                    ),
+        elif self.layout in {
+            torch.sparse_csr,
+            torch.sparse_csc,
+            torch.sparse_bsr,
+            torch.sparse_bsc,
+        }:
+            if self.layout in {torch.sparse_csr, torch.sparse_bsr}:
+                compressed_indices, plain_indices = (
+                    self.crow_indices(),
+                    self.col_indices(),
                 )
             else:
-                raise NotImplementedError(
-                    "sparse csr tensor __reduce_ex__ for layout `%s`" % (self.layout)
+                compressed_indices, plain_indices = (
+                    self.ccol_indices(),
+                    self.row_indices(),
                 )
-            return (torch._utils._rebuild_sparse_csr_tensor, args_sparse_csr)
+            args_sparse_compressed = (
+                self.layout,
+                (
+                    compressed_indices,
+                    plain_indices,
+                    self.values(),
+                    self.size(),
+                ),
+            )
+            return (torch._utils._rebuild_sparse_tensor, args_sparse_compressed)
         elif (
             self.data_ptr() == 0
             and type(self) is not torch.Tensor
@@ -385,7 +379,9 @@ def _reduce_ex_internal(self, proto):
             # need to wrap with TypedStorage
             args = (
                 torch.storage.TypedStorage(
-                    wrap_storage=self.storage().untyped(), dtype=self.dtype
+                    wrap_storage=self._typed_storage()._untyped_storage,
+                    dtype=self.dtype,
+                    _internal=True,
                 ),
                 self.storage_offset(),
                 tuple(self.size()),
@@ -393,6 +389,10 @@ def _reduce_ex_internal(self, proto):
                 self.requires_grad,
                 backward_hooks,
             )  # previously was self._backward_hooks
+
+            metadata = torch._utils.get_tensor_metadata(self)
+            if metadata:
+                args = args + (metadata,)  # type: ignore[assignment]
             return (torch._utils._rebuild_tensor_v2, args)
 
     def __setstate__(self, state):
@@ -603,7 +603,7 @@ def is_shared(self):
         """
         if has_torch_function_unary(self):
             return handle_torch_function(Tensor.is_shared, (self,), self)
-        return self.storage().is_shared()
+        return self._typed_storage()._is_shared()
 
     def share_memory_(self):
         r"""Moves the underlying storage to shared memory.
@@ -613,7 +613,7 @@ def share_memory_(self):
         """
         if has_torch_function_unary(self):
             return handle_torch_function(Tensor.share_memory_, (self,), self)
-        self.storage().share_memory_()
+        self._typed_storage()._share_memory_()
         return self
 
     def __reversed__(self):
@@ -638,6 +638,16 @@ def solve(self, other):
 
         return solve(self, other)
 
+    def lstsq(self, other):
+        from ._linalg_utils import lstsq
+
+        return lstsq(self, other)
+
+    def eig(self, eigenvectors=False):
+        from ._linalg_utils import eig
+
+        return eig(self, eigenvectors=eigenvectors)
+
     def lu(self, pivot=True, get_infos=False):
         r"""See :func:`torch.lu`"""
         # If get_infos is True, then we don't need to check for errors and vice versa
@@ -763,16 +773,16 @@ def split(self, split_size, dim=0):
             return handle_torch_function(
                 Tensor.split, (self,), self, split_size, dim=dim
             )
-        if isinstance(split_size, int):
-            return super(Tensor, self).split(split_size, dim)
-        elif isinstance(split_size, Tensor):
+        if isinstance(split_size, Tensor):
             try:
                 split_size = int(split_size)
-                return super(Tensor, self).split(split_size, dim)
             except ValueError:
-                return super(Tensor, self).split_with_sizes(split_size, dim)
+                pass
+
+        if isinstance(split_size, int):
+            return torch._VF.split(self, split_size, dim)  # type: ignore[attr-defined]
         else:
-            return super(Tensor, self).split_with_sizes(split_size, dim)
+            return torch._VF.split_with_sizes(self, split_size, dim)
 
     def unique(self, sorted=True, return_inverse=False, return_counts=False, dim=None):
         r"""Returns the unique elements of the input tensor.
@@ -1022,7 +1032,7 @@ def __cuda_array_interface__(self):
             torch.int64: "<i8",
         }[self.dtype]
 
-        itemsize = self.storage().element_size()
+        itemsize = self.element_size()
 
         shape = tuple(self.shape)
         if self.is_contiguous():
@@ -1045,7 +1055,9 @@ def storage_type(self):
         if has_torch_function_unary(self):
             return handle_torch_function(Tensor.storage_type, (self,), self)
 
-        return self.storage()._get_legacy_storage_class()
+        torch.storage._warn_typed_storage_removal()
+
+        return self._typed_storage()._get_legacy_storage_class()
 
     def refine_names(self, *names):
         r"""Refines the dimension names of :attr:`self` according to :attr:`names`.
@@ -1311,30 +1323,30 @@ def __dlpack__(self, stream=None):
             if self.device.type == "cuda":
                 stream = torch.cuda.ExternalStream(stream)
                 # Only synchronize on different streams
-                if stream != torch.cuda.current_stream:
+                sync_stream = torch.cuda.current_stream()
+                if stream != sync_stream:
                     event = torch.cuda.Event()
-                    event.record(torch.cuda.current_stream())
+                    event.record(sync_stream)
                     stream.wait_event(event)
         return torch.to_dlpack(self)
 
     def __dlpack_device__(self) -> Tuple[enum.IntEnum, int]:
-        # Avoid circular import
-        from torch.utils.dlpack import DLDeviceType
-
         if has_torch_function_unary(self):
             return handle_torch_function(Tensor.__dlpack_device__, (self,), self)
-        idx = self.device.index if self.device.index is not None else 0
-        if self.device.type == "cuda" and torch.version.hip is not None:
+        device = self.device
+        idx = device.index if device.index is not None else 0
+        torch_device_type = device.type
+        if torch_device_type == "cuda" and torch.version.hip is not None:
             device_type = DLDeviceType.kDLROCM
-        elif self.device.type == "cpu" and self.is_pinned():
+        elif torch_device_type == "cpu" and self.is_pinned():
             device_type = DLDeviceType.kDLCPUPinned
-        elif self.device.type == "cuda":
+        elif torch_device_type == "cuda":
             device_type = DLDeviceType.kDLGPU
-        elif self.device.type == "cpu":
+        elif torch_device_type == "cpu":
             device_type = DLDeviceType.kDLCPU
         else:
             raise ValueError(
-                "Unknown device type {} for Dlpack".format(self.device.type)
+                "Unknown device type {} for Dlpack".format(torch_device_type)
             )
         return (device_type, idx)
 
diff --git a/torch/_tensor_docs.py b/torch/_tensor_docs.py
index 3380942c0287..726ae5137e6a 100644
--- a/torch/_tensor_docs.py
+++ b/torch/_tensor_docs.py
@@ -28,13 +28,18 @@ def add_docstr_all(method, docstr):
         returned tensor. Default: ``False``.
     pin_memory (bool, optional): If set, returned tensor would be allocated in
         the pinned memory. Works only for CPU tensors. Default: ``False``.
+    layout (:class:`torch.layout`, optional): the desired layout of returned Tensor.
+        Default: ``torch.strided``.
 """
 )
 
 add_docstr_all(
     "new_tensor",
-    r"""
-new_tensor(data, dtype=None, device=None, requires_grad=False) -> Tensor
+    """
+new_tensor(data, *, dtype=None, device=None, requires_grad=False, layout=torch.strided, \
+pin_memory=False) -> Tensor
+"""
+    + r"""
 
 Returns a new Tensor with :attr:`data` as the tensor data.
 By default, the returned Tensor has the same :class:`torch.dtype` and
@@ -57,9 +62,13 @@ def add_docstr_all(method, docstr):
 
 Args:
     data (array_like): The returned Tensor copies :attr:`data`.
+
+Keyword args:
     {dtype}
     {device}
     {requires_grad}
+    {layout}
+    {pin_memory}
 
 Example::
 
@@ -76,8 +85,11 @@ def add_docstr_all(method, docstr):
 
 add_docstr_all(
     "new_full",
-    r"""
-new_full(size, fill_value, dtype=None, device=None, requires_grad=False) -> Tensor
+    """
+new_full(size, fill_value, *, dtype=None, device=None, requires_grad=False, layout=torch.strided, \
+pin_memory=False) -> Tensor
+"""
+    + r"""
 
 Returns a Tensor of size :attr:`size` filled with :attr:`fill_value`.
 By default, the returned Tensor has the same :class:`torch.dtype` and
@@ -85,9 +97,13 @@ def add_docstr_all(method, docstr):
 
 Args:
     fill_value (scalar): the number to fill the output tensor with.
+
+Keyword args:
     {dtype}
     {device}
     {requires_grad}
+    {layout}
+    {pin_memory}
 
 Example::
 
@@ -104,17 +120,26 @@ def add_docstr_all(method, docstr):
 
 add_docstr_all(
     "new_empty",
-    r"""
-new_empty(size, dtype=None, device=None, requires_grad=False) -> Tensor
+    """
+new_empty(size, *, dtype=None, device=None, requires_grad=False, layout=torch.strided, \
+pin_memory=False) -> Tensor
+"""
+    + r"""
 
 Returns a Tensor of size :attr:`size` filled with uninitialized data.
 By default, the returned Tensor has the same :class:`torch.dtype` and
 :class:`torch.device` as this tensor.
 
 Args:
+    size (int...): a list, tuple, or :class:`torch.Size` of integers defining the
+        shape of the output tensor.
+
+Keyword args:
     {dtype}
     {device}
     {requires_grad}
+    {layout}
+    {pin_memory}
 
 Example::
 
@@ -130,17 +155,26 @@ def add_docstr_all(method, docstr):
 
 add_docstr_all(
     "new_empty_strided",
-    r"""
-new_empty_strided(size, stride, dtype=None, device=None, requires_grad=False) -> Tensor
+    """
+new_empty_strided(size, stride, dtype=None, device=None, requires_grad=False, layout=torch.strided, \
+pin_memory=False) -> Tensor
+"""
+    + r"""
 
 Returns a Tensor of size :attr:`size` and strides :attr:`stride` filled with
 uninitialized data. By default, the returned Tensor has the same
 :class:`torch.dtype` and :class:`torch.device` as this tensor.
 
 Args:
+    size (int...): a list, tuple, or :class:`torch.Size` of integers defining the
+        shape of the output tensor.
+
+Keyword args:
     {dtype}
     {device}
     {requires_grad}
+    {layout}
+    {pin_memory}
 
 Example::
 
@@ -156,8 +190,11 @@ def add_docstr_all(method, docstr):
 
 add_docstr_all(
     "new_ones",
-    r"""
-new_ones(size, dtype=None, device=None, requires_grad=False) -> Tensor
+    """
+new_ones(size, *, dtype=None, device=None, requires_grad=False, layout=torch.strided, \
+pin_memory=False) -> Tensor
+"""
+    + r"""
 
 Returns a Tensor of size :attr:`size` filled with ``1``.
 By default, the returned Tensor has the same :class:`torch.dtype` and
@@ -166,9 +203,13 @@ def add_docstr_all(method, docstr):
 Args:
     size (int...): a list, tuple, or :class:`torch.Size` of integers defining the
         shape of the output tensor.
+
+Keyword args:
     {dtype}
     {device}
     {requires_grad}
+    {layout}
+    {pin_memory}
 
 Example::
 
@@ -184,8 +225,11 @@ def add_docstr_all(method, docstr):
 
 add_docstr_all(
     "new_zeros",
-    r"""
-new_zeros(size, dtype=None, device=None, requires_grad=False) -> Tensor
+    """
+new_zeros(size, *, dtype=None, device=None, requires_grad=False, layout=torch.strided, \
+pin_memory=False) -> Tensor
+"""
+    + r"""
 
 Returns a Tensor of size :attr:`size` filled with ``0``.
 By default, the returned Tensor has the same :class:`torch.dtype` and
@@ -194,9 +238,13 @@ def add_docstr_all(method, docstr):
 Args:
     size (int...): a list, tuple, or :class:`torch.Size` of integers defining the
         shape of the output tensor.
+
+Keyword args:
     {dtype}
     {device}
     {requires_grad}
+    {layout}
+    {pin_memory}
 
 Example::
 
@@ -1481,8 +1529,8 @@ def add_docstr_all(method, docstr):
 
 Return the number of dense dimensions in a :ref:`sparse tensor <sparse-docs>` :attr:`self`.
 
-.. warning::
-  Throws an error if :attr:`self` is not a sparse tensor.
+.. note::
+  Returns ``len(self.shape)`` if :attr:`self` is not a sparse tensor.
 
 See also :meth:`Tensor.sparse_dim` and :ref:`hybrid tensors <sparse-hybrid-coo-docs>`.
 """,
@@ -1536,7 +1584,7 @@ def add_docstr_all(method, docstr):
 add_docstr_all(
     "as_strided_scatter",
     r"""
-as_strided_scatter(src, size, stride, storage_offset=0) -> Tensor
+as_strided_scatter(src, size, stride, storage_offset=None) -> Tensor
 
 See :func:`torch.as_strided_scatter`
 """,
@@ -1692,15 +1740,6 @@ def add_docstr_all(method, docstr):
 """,
 )
 
-add_docstr_all(
-    "eig",
-    r"""
-eig(eigenvectors=False) -> (Tensor, Tensor)
-
-See :func:`torch.eig`
-""",
-)
-
 add_docstr_all(
     "element_size",
     r"""
@@ -2190,14 +2229,15 @@ def add_docstr_all(method, docstr):
 get_device() -> Device ordinal (Integer)
 
 For CUDA tensors, this function returns the device ordinal of the GPU on which the tensor resides.
-For CPU tensors, an error is thrown.
+For CPU tensors, this function returns `-1`.
 
 Example::
 
     >>> x = torch.randn(3, 4, 5, device='cuda:0')
     >>> x.get_device()
     0
-    >>> x.cpu().get_device()  # RuntimeError: get_device is not implemented for type torch.FloatTensor
+    >>> x.cpu().get_device()
+    -1
 """,
 )
 
@@ -3012,15 +3052,6 @@ def add_docstr_all(method, docstr):
 """,
 )
 
-add_docstr_all(
-    "lstsq",
-    r"""
-lstsq(A) -> (Tensor, Tensor)
-
-See :func:`torch.lstsq`
-""",
-)
-
 add_docstr_all(
     "lt",
     r"""
@@ -3405,18 +3436,7 @@ def callable(a, b) -> number
     r"""
 narrow(dimension, start, length) -> Tensor
 
-See :func:`torch.narrow`
-
-Example::
-
-    >>> x = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
-    >>> x.narrow(0, 0, 2)
-    tensor([[ 1,  2,  3],
-            [ 4,  5,  6]])
-    >>> x.narrow(1, 1, 2)
-    tensor([[ 2,  3],
-            [ 5,  6],
-            [ 8,  9]])
+See :func:`torch.narrow`.
 """,
 )
 
@@ -3425,11 +3445,7 @@ def callable(a, b) -> number
     r"""
 narrow_copy(dimension, start, length) -> Tensor
 
-Same as :meth:`Tensor.narrow` except returning a copy rather
-than shared storage.  This is primarily for sparse tensors, which
-do not have a shared-storage narrow method.  Calling ``narrow_copy``
-with ``dimemsion > self.sparse_dim()`` will return a copy with the
-relevant dense dimension narrowed, and ``self.shape`` updated accordingly.
+See :func:`torch.narrow_copy`.
 """,
 )
 
@@ -4236,7 +4252,7 @@ def callable(a, b) -> number
 
 Additionally accepts an optional :attr:`reduce` argument that allows
 specification of an optional reduction operation, which is applied to all
-values in the tensor :attr:`src` into :attr:`self` at the indicies
+values in the tensor :attr:`src` into :attr:`self` at the indices
 specified in the :attr:`index`. For each value in :attr:`src`, the reduction
 operation is applied to an index in :attr:`self` which is specified by
 its index in :attr:`src` for ``dimension != dim`` and by the corresponding
@@ -4654,8 +4670,8 @@ def callable(a, b) -> number
 
 Return the number of sparse dimensions in a :ref:`sparse tensor <sparse-docs>` :attr:`self`.
 
-.. warning::
-  Throws an error if :attr:`self` is not a sparse tensor.
+.. note::
+  Returns ``0`` if :attr:`self` is not a sparse tensor.
 
 See also :meth:`Tensor.dense_dim` and :ref:`hybrid tensors <sparse-hybrid-coo-docs>`.
 """,
@@ -6516,50 +6532,6 @@ def callable(a, b) -> number
     "to_padded_tensor",
     r"""
 to_padded_tensor(padding, output_size=None) -> Tensor
-
-Returns a new (non-nested) Tensor by padding the nested tensor.
-The leading entries will be filled with the nested data,
-while the trailing entries will be padded.
-
-.. warning::
-
-    :func:`to_padded_tensor` always copies the underlying data,
-    since the nested and the non-nested tensors differ in memory layout.
-
-Args:
-    padding (float): The padding value for the trailing entries.
-    output_size (Tuple[int]): The size of the output tensor.
-                              If given, it must be large enough to contain all nested data;
-                              else, will infer by taking the max size of each nested sub-tensor along each dimension.
-
-Example::
-
-    >>> nt = torch.nested_tensor([torch.randn((2, 5)), torch.randn((3, 4))])
-    nested_tensor([
-      tensor([[ 1.6862, -1.1282,  1.1031,  0.0464, -1.3276],
-              [-1.9967, -1.0054,  1.8972,  0.9174, -1.4995]]),
-      tensor([[-1.8546, -0.7194, -0.2918, -0.1846],
-              [ 0.2773,  0.8793, -0.5183, -0.6447],
-              [ 1.8009,  1.8468, -0.9832, -1.5272]])
-    ])
-    >>> pt_infer = nt.to_padded_tensor(0.0)
-    tensor([[[ 1.6862, -1.1282,  1.1031,  0.0464, -1.3276],
-             [-1.9967, -1.0054,  1.8972,  0.9174, -1.4995],
-             [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000]],
-            [[-1.8546, -0.7194, -0.2918, -0.1846,  0.0000],
-             [ 0.2773,  0.8793, -0.5183, -0.6447,  0.0000],
-             [ 1.8009,  1.8468, -0.9832, -1.5272,  0.0000]]])
-    >>> pt_large = nt.to_padded_tensor(1.0, (2, 4, 6))
-    tensor([[[ 1.6862, -1.1282,  1.1031,  0.0464, -1.3276,  1.0000],
-             [-1.9967, -1.0054,  1.8972,  0.9174, -1.4995,  1.0000],
-             [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000],
-             [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000]],
-            [[-1.8546, -0.7194, -0.2918, -0.1846,  1.0000,  1.0000],
-             [ 0.2773,  0.8793, -0.5183, -0.6447,  1.0000,  1.0000],
-             [ 1.8009,  1.8468, -0.9832, -1.5272,  1.0000,  1.0000],
-             [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000]]])
-    >>> pt_small = nt.to_padded_tensor(2.0, (2, 2, 2))
-    RuntimeError: Value in output_size is less than NestedTensor padded size. Truncation is not supported.
-
+See :func:`to_padded_tensor`
 """,
 )
diff --git a/torch/_tensor_str.py b/torch/_tensor_str.py
index 493f17637a1b..ad5429c61e56 100644
--- a/torch/_tensor_str.py
+++ b/torch/_tensor_str.py
@@ -1,4 +1,5 @@
 import math
+import textwrap
 from typing import Optional
 
 import torch
@@ -46,12 +47,20 @@ def set_printoptions(
 
     Example::
 
+        >>> # Limit the precision of elements
         >>> torch.set_printoptions(precision=2)
         >>> torch.tensor([1.12345])
         tensor([1.12])
+        >>> # Limit the number of elements shown
         >>> torch.set_printoptions(threshold=5)
         >>> torch.arange(10)
         tensor([0, 1, 2, ..., 7, 8, 9])
+        >>> # Restore defaults
+        >>> torch.set_printoptions(profile='default')
+        >>> torch.tensor([1.12345])
+        tensor([1.1235])
+        >>> torch.arange(10)
+        tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
 
     """
     if profile is not None:
@@ -206,7 +215,7 @@ def _vector_str(self, indent, summarize, formatter1, formatter2=None):
     elements_per_line = max(
         1, int(math.floor((PRINT_OPTS.linewidth - indent) / (element_length)))
     )
-    char_per_line = element_length * elements_per_line
+    # char_per_line = element_length * elements_per_line  # unused
 
     def _val_formatter(val, formatter1=formatter1, formatter2=formatter2):
         if formatter2 is not None:
@@ -356,6 +365,8 @@ def get_summarized_data(self):
 
 
 def _str_intern(inp, *, tensor_contents=None):
+    if torch._C._functorch.is_functorch_wrapped_tensor(inp):
+        return _functorch_wrapper_str_intern(inp, tensor_contents=tensor_contents)
     is_plain_tensor = type(inp) is torch.Tensor or type(inp) is torch.nn.Parameter
     if inp.is_nested:
         prefix = "nested_tensor("
@@ -589,6 +600,38 @@ def indented_str(s, indent):
     return string_repr
 
 
+def _functorch_wrapper_str_intern(tensor, *, tensor_contents=None):
+    level = torch._C._functorch.maybe_get_level(tensor)
+    assert level != -1
+
+    if torch._C._functorch.is_functionaltensor(tensor):
+        # Since we're unwrapping the FunctionalTensorWrapper, we need to make sure
+        # that it's up to date first
+        torch._sync(tensor)
+
+    value = torch._C._functorch.get_unwrapped(tensor)
+    value_repr = repr(value)
+
+    indented_value_repr = textwrap.indent(value_repr, " " * 4)
+    if torch._C._functorch.is_batchedtensor(tensor):
+        bdim = torch._C._functorch.maybe_get_bdim(tensor)
+        assert bdim != -1
+        return (
+            f"BatchedTensor(lvl={level}, bdim={bdim}, value=\n"
+            f"{indented_value_repr}\n"
+            f")"
+        )
+    if torch._C._functorch.is_gradtrackingtensor(tensor):
+        return (
+            f"GradTrackingTensor(lvl={level}, value=\n" f"{indented_value_repr}\n" f")"
+        )
+    if torch._C._functorch.is_functionaltensor(tensor):
+        return f"FunctionalTensor(lvl={level}, value=\\\n{value_repr})"
+
+    raise ValueError("We don't know how to print this, please file us an issue")
+
+
 def _str(self, *, tensor_contents=None):
-    with torch.no_grad():
+    with torch.no_grad(), torch.utils._python_dispatch._disable_current_modes():
+        guard = torch._C._DisableFuncTorch()
         return _str_intern(self, tensor_contents=tensor_contents)
diff --git a/torch/_torch_docs.py b/torch/_torch_docs.py
index b699ea67738e..34195f938b40 100644
--- a/torch/_torch_docs.py
+++ b/torch/_torch_docs.py
@@ -166,6 +166,14 @@ def merge_dicts(*dicts):
 See :doc:`/notes/randomness` for more information.""",
 }
 
+sparse_support_notes = {
+    "sparse_beta_warning": """
+.. warning::
+    Sparse support is a beta feature and some layout(s)/dtype/device combinations may not be supported,
+    or may not have autograd support. If you notice missing functionality please
+    open a feature request.""",
+}
+
 add_docstr(
     torch.abs,
     r"""
@@ -325,7 +333,7 @@ def merge_dicts(*dicts):
 
 Args:
     {input}
-    other (Tensor or Number): the tensor or number to add to input.
+    other (Tensor or Number): the tensor or number to add to :attr:`input`.
 
 Keyword arguments:
     alpha (Number): the multiplier for :attr:`other`.
@@ -421,7 +429,7 @@ def merge_dicts(*dicts):
 addcdiv(input, tensor1, tensor2, *, value=1, out=None) -> Tensor
 
 Performs the element-wise division of :attr:`tensor1` by :attr:`tensor2`,
-multiply the result by the scalar :attr:`value` and add it to :attr:`input`.
+multiplies the result by the scalar :attr:`value` and adds it to :attr:`input`.
 
 .. warning::
     Integer division with addcdiv is no longer supported, and in a future
@@ -472,8 +480,8 @@ def merge_dicts(*dicts):
 addcmul(input, tensor1, tensor2, *, value=1, out=None) -> Tensor
 
 Performs the element-wise multiplication of :attr:`tensor1`
-by :attr:`tensor2`, multiply the result by the scalar :attr:`value`
-and add it to :attr:`input`.
+by :attr:`tensor2`, multiplies the result by the scalar :attr:`value`
+and adds it to :attr:`input`.
 
 .. math::
     \text{out}_i = \text{input}_i + \text{value} \times \text{tensor1}_i \times \text{tensor2}_i
@@ -534,6 +542,12 @@ def merge_dicts(*dicts):
 For inputs of type `FloatTensor` or `DoubleTensor`, arguments :attr:`beta` and
 :attr:`alpha` must be real numbers, otherwise they should be integers.
 
+This operation has support for arguments with :ref:`sparse layouts<sparse-docs>`. If
+:attr:`input` is sparse the result will have the same layout and if :attr:`out`
+is provided it must have the same layout as :attr:`input`.
+
+{sparse_beta_warning}
+
 {tf32_note}
 
 {rocm_fp16_note}
@@ -557,7 +571,7 @@ def merge_dicts(*dicts):
     tensor([[-4.8716,  1.4671, -1.3746],
             [ 0.7573, -3.9555, -2.8681]])
 """.format(
-        **common_args, **tf32_notes, **rocm_fp16_notes
+        **common_args, **tf32_notes, **rocm_fp16_notes, **sparse_support_notes
     ),
 )
 
@@ -648,7 +662,7 @@ def merge_dicts(*dicts):
 """
     + r"""
 For inputs of type `FloatTensor` or `DoubleTensor`, arguments :attr:`beta` and
-:attr:`alpha` must be real numbers, otherwise they should be integers
+:attr:`alpha` must be real numbers, otherwise they should be integers.
 
 Args:
     input (Tensor): vector to be added
@@ -726,7 +740,7 @@ def merge_dicts(*dicts):
     r"""
 allclose(input, other, rtol=1e-05, atol=1e-08, equal_nan=False) -> bool
 
-This function checks if all :attr:`input` and :attr:`other` satisfy the condition:
+This function checks if :attr:`input` and :attr:`other` satisfy the condition:
 
 .. math::
     \lvert \text{input} - \text{other} \rvert \leq \texttt{atol} + \texttt{rtol} \times \lvert \text{other} \rvert
@@ -948,15 +962,15 @@ def merge_dicts(*dicts):
     r"""
 as_tensor(data, dtype=None, device=None) -> Tensor
 
-Converts data into a tensor, sharing data and preserving autograd
+Converts :attr:`data` into a tensor, sharing data and preserving autograd
 history if possible.
 
-If data is already a tensor with the requeseted dtype and device
-then data itself is returned, but if data is a
+If :attr:`data` is already a tensor with the requested dtype and device
+then :attr:`data` itself is returned, but if :attr:`data` is a
 tensor with a different dtype or device then it's copied as if using
 `data.to(dtype=dtype, device=device)`.
 
-If data is a NumPy array (an ndarray) with the same dtype and device then a
+If :attr:`data` is a NumPy array (an ndarray) with the same dtype and device then a
 tensor is constructed using :func:`torch.from_numpy`.
 
 .. seealso::
@@ -999,7 +1013,7 @@ def merge_dicts(*dicts):
     r"""
 asin(input, *, out=None) -> Tensor
 
-Returns a new tensor with the arcsine  of the elements of :attr:`input`.
+Returns a new tensor with the arcsine of the elements of :attr:`input`.
 
 .. math::
     \text{out}_{i} = \sin^{-1}(\text{input}_{i})
@@ -1075,7 +1089,7 @@ def merge_dicts(*dicts):
     r"""
 atan(input, *, out=None) -> Tensor
 
-Returns a new tensor with the arctangent  of the elements of :attr:`input`.
+Returns a new tensor with the arctangent of the elements of :attr:`input`.
 
 .. math::
     \text{out}_{i} = \tan^{-1}(\text{input}_{i})
@@ -1471,7 +1485,7 @@ def merge_dicts(*dicts):
 Keyword args:
     {out}
 
-Example:
+Example::
 
     >>> torch.bitwise_not(torch.tensor([-1, -2, 3], dtype=torch.int8))
     tensor([ 0,  1, -4], dtype=torch.int8)
@@ -1540,7 +1554,7 @@ def merge_dicts(*dicts):
 Keyword args:
     {out}
 
-Example:
+Example::
 
     >>> torch.bitwise_and(torch.tensor([-1, -2, 3], dtype=torch.int8), torch.tensor([1, 0, 3], dtype=torch.int8))
     tensor([1, 0,  3], dtype=torch.int8)
@@ -1566,7 +1580,7 @@ def merge_dicts(*dicts):
 Keyword args:
     {out}
 
-Example:
+Example::
 
     >>> torch.bitwise_or(torch.tensor([-1, -2, 3], dtype=torch.int8), torch.tensor([1, 0, 3], dtype=torch.int8))
     tensor([-1, -2,  3], dtype=torch.int8)
@@ -1592,7 +1606,7 @@ def merge_dicts(*dicts):
 Keyword args:
     {out}
 
-Example:
+Example::
 
     >>> torch.bitwise_xor(torch.tensor([-1, -2, 3], dtype=torch.int8), torch.tensor([1, 0, 3], dtype=torch.int8))
     tensor([-2, -2,  0], dtype=torch.int8)
@@ -1625,7 +1639,7 @@ def merge_dicts(*dicts):
 Keyword args:
     {out}
 
-Example:
+Example::
 
     >>> torch.bitwise_left_shift(torch.tensor([-1, -2, 3], dtype=torch.int8), torch.tensor([1, 0, 3], dtype=torch.int8))
     tensor([-2, -2, 24], dtype=torch.int8)
@@ -1656,7 +1670,7 @@ def merge_dicts(*dicts):
 Keyword args:
     {out}
 
-Example:
+Example::
 
     >>> torch.bitwise_right_shift(torch.tensor([-2, -7, 31], dtype=torch.int8), torch.tensor([1, 0, 3], dtype=torch.int8))
     tensor([-1, -7,  3], dtype=torch.int8)
@@ -1843,7 +1857,7 @@ def merge_dicts(*dicts):
         in the list, tuple or tensor. For instance, :code:`indices_or_sections=[2, 3]` and :code:`dim=0`
         would result in the tensors :code:`input[:2]`, :code:`input[2:3]`, and :code:`input[3:]`.
 
-        If indices_or_sections is a tensor, it must be a zero-dimensional or one-dimensional
+        If :attr:`indices_or_sections` is a tensor, it must be a zero-dimensional or one-dimensional
         long tensor on the CPU.
 
     dim (int, optional): dimension along which to split the tensor. Default: ``0``
@@ -1892,17 +1906,17 @@ def merge_dicts(*dicts):
 
 .. note::
 
-    This function may return less then the specified number of chunks!
+    This function may return fewer than the specified number of chunks!
 
 .. seealso::
 
     :func:`torch.tensor_split` a function that always returns exactly the specified number of chunks
 
-If the tensor size along the given dimesion :attr:`dim` is divisible by :attr:`chunks`,
+If the tensor size along the given dimension :attr:`dim` is divisible by :attr:`chunks`,
 all returned chunks will be the same size.
 If the tensor size along the given dimension :attr:`dim` is not divisible by :attr:`chunks`,
 all returned chunks will be the same size, except the last one.
-If such division is not possible, this function may return less
+If such division is not possible, this function may return fewer
 than the specified number of chunks.
 
 Arguments:
@@ -1910,7 +1924,7 @@ def merge_dicts(*dicts):
     chunks (int): number of chunks to return
     dim (int): dimension along which to split the tensor
 
-Example::
+Example:
     >>> torch.arange(11).chunk(6)
     (tensor([0, 1]),
      tensor([2, 3]),
@@ -2310,6 +2324,15 @@ def merge_dicts(*dicts):
 """,
 )
 
+add_docstr(
+    torch.concatenate,
+    r"""
+concatenate(tensors, axis=0, out=None) -> Tensor
+
+Alias of :func:`torch.cat`.
+""",
+)
+
 add_docstr(
     torch.ceil,
     r"""
@@ -2318,6 +2341,9 @@ def merge_dicts(*dicts):
 Returns a new tensor with the ceil of the elements of :attr:`input`,
 the smallest integer greater than or equal to each element.
 
+For integer inputs, follows the array-api convention of returning a
+copy of the input tensor.
+
 .. math::
     \text{out}_{i} = \left\lceil \text{input}_{i} \right\rceil
 """
@@ -3714,7 +3740,7 @@ def merge_dicts(*dicts):
 add_docstr(
     torch.as_strided_scatter,
     r"""
-as_strided_scatter(input, src, size, stride, storage_offset=0) -> Tensor
+as_strided_scatter(input, src, size, stride, storage_offset=None) -> Tensor
 
 Embeds the values of the :attr:`src` tensor into :attr:`input` along
 the elements corresponding to the result of calling
@@ -3999,92 +4025,6 @@ def merge_dicts(*dicts):
 """,
 )
 
-add_docstr(
-    torch.eig,
-    r"""
-eig(input, eigenvectors=False, *, out=None) -> (Tensor, Tensor)
-
-Computes the eigenvalues and eigenvectors of a real square matrix.
-
-.. note::
-    Since eigenvalues and eigenvectors might be complex, backward pass is supported only
-    if eigenvalues and eigenvectors are all real valued.
-
-    When :attr:`input` is on CUDA, :func:`torch.eig() <torch.eig>` causes
-    host-device synchronization.
-
-.. warning::
-
-    :func:`torch.eig` is deprecated in favor of :func:`torch.linalg.eig`
-    and will be removed in a future PyTorch release.
-    :func:`torch.linalg.eig` returns complex tensors of dtype `cfloat` or `cdouble`
-    rather than real tensors mimicking complex tensors.
-
-    ``L, _ = torch.eig(A)`` should be replaced with
-
-    .. code :: python
-
-        L_complex = torch.linalg.eigvals(A)
-
-    ``L, V = torch.eig(A, eigenvectors=True)`` should be replaced with
-
-    .. code :: python
-
-        L_complex, V_complex = torch.linalg.eig(A)
-
-Args:
-    input (Tensor): the square matrix of shape :math:`(n \times n)` for which the eigenvalues and eigenvectors
-        will be computed
-    eigenvectors (bool): ``True`` to compute both eigenvalues and eigenvectors;
-        otherwise, only eigenvalues will be computed
-
-Keyword args:
-    out (tuple, optional): the output tensors
-
-Returns:
-    (Tensor, Tensor): A namedtuple (eigenvalues, eigenvectors) containing
-
-        - **eigenvalues** (*Tensor*): Shape :math:`(n \times 2)`. Each row is an eigenvalue of ``input``,
-          where the first element is the real part and the second element is the imaginary part.
-          The eigenvalues are not necessarily ordered.
-        - **eigenvectors** (*Tensor*): If ``eigenvectors=False``, it's an empty tensor.
-          Otherwise, this tensor of shape :math:`(n \times n)` can be used to compute normalized (unit length)
-          eigenvectors of corresponding eigenvalues as follows.
-          If the corresponding `eigenvalues[j]` is a real number, column `eigenvectors[:, j]` is the eigenvector
-          corresponding to `eigenvalues[j]`.
-          If the corresponding `eigenvalues[j]` and `eigenvalues[j + 1]` form a complex conjugate pair, then the
-          true eigenvectors can be computed as
-          :math:`\text{true eigenvector}[j] = eigenvectors[:, j] + i \times eigenvectors[:, j + 1]`,
-          :math:`\text{true eigenvector}[j + 1] = eigenvectors[:, j] - i \times eigenvectors[:, j + 1]`.
-
-Example::
-
-    Trivial example with a diagonal matrix. By default, only eigenvalues are computed:
-
-    >>> a = torch.diag(torch.tensor([1, 2, 3], dtype=torch.double))
-    >>> e, v = torch.eig(a)
-    >>> e
-    tensor([[1., 0.],
-            [2., 0.],
-            [3., 0.]], dtype=torch.float64)
-    >>> v
-    tensor([], dtype=torch.float64)
-
-    Compute also the eigenvectors:
-
-    >>> e, v = torch.eig(a, eigenvectors=True)
-    >>> e
-    tensor([[1., 0.],
-            [2., 0.],
-            [3., 0.]], dtype=torch.float64)
-    >>> v
-    tensor([[1., 0., 0.],
-            [0., 1., 0.],
-            [0., 0., 1.]], dtype=torch.float64)
-
-""",
-)
-
 add_docstr(
     torch.eq,
     r"""
@@ -4241,6 +4181,9 @@ def merge_dicts(*dicts):
 Returns a new tensor with the floor of the elements of :attr:`input`,
 the largest integer less than or equal to each element.
 
+For integer inputs, follows the array-api convention of returning a
+copy of the input tensor.
+
 .. math::
     \text{out}_{i} = \left\lfloor \text{input}_{i} \right\rfloor
 """
@@ -5053,7 +4996,7 @@ def merge_dicts(*dicts):
 :attr:`max`. If :attr:`min` and :attr:`max` are both zero, the minimum and
 maximum values of the data are used.
 
-Elements lower than min and higher than max are ignored.
+Elements lower than min and higher than max and ``NaN`` elements are ignored.
 
 Args:
     {input}
@@ -5212,7 +5155,9 @@ def merge_dicts(*dicts):
            bin_edges=(tensor([0.0000, 0.5000, 1.0000]),
                       tensor([0.0000, 0.5000, 1.0000])))
 
-""",
+""".format(
+        **common_args
+    ),
 )
 # TODO: Fix via https://github.com/pytorch/pytorch/issues/75798
 torch.histogramdd.__module__ = "torch"
@@ -5813,7 +5758,7 @@ def merge_dicts(*dicts):
     r"""
 ldexp(input, other, *, out=None) -> Tensor
 
-Multiplies :attr:`input` by 2**:attr:`other`.
+Multiplies :attr:`input` by 2 ** :attr:`other`.
 
 .. math::
     \text{{out}}_i = \text{{input}}_i * 2^\text{{other}}_i
@@ -6423,96 +6368,6 @@ def merge_dicts(*dicts):
     ),
 )
 
-add_docstr(
-    torch.lstsq,
-    r"""
-lstsq(input, A, *, out=None) -> (Tensor, Tensor)
-
-Computes the solution to the least squares and least norm problems for a full
-rank matrix :math:`A` of size :math:`(m \times n)` and a matrix :math:`B` of
-size :math:`(m \times k)`.
-
-If :math:`m \geq n`, :func:`lstsq` solves the least-squares problem:
-
-.. math::
-
-   \begin{array}{ll}
-   \min_X & \|AX-B\|_2.
-   \end{array}
-
-If :math:`m < n`, :func:`lstsq` solves the least-norm problem:
-
-.. math::
-
-   \begin{array}{llll}
-   \min_X & \|X\|_2 & \text{subject to} & AX = B.
-   \end{array}
-
-Returned tensor :math:`X` has shape :math:`(\max(m, n) \times k)`. The first :math:`n`
-rows of :math:`X` contains the solution. If :math:`m \geq n`, the residual sum of squares
-for the solution in each column is given by the sum of squares of elements in the
-remaining :math:`m - n` rows of that column.
-
-.. warning::
-
-    :func:`torch.lstsq` is deprecated in favor of :func:`torch.linalg.lstsq`
-    and will be removed in a future PyTorch release. :func:`torch.linalg.lstsq`
-    has reversed arguments and does not return the QR decomposition in the returned tuple,
-    (it returns other information about the problem).
-    The returned `solution` in :func:`torch.lstsq` stores the residuals of the solution in the
-    last `m - n` columns in the case `m > n`. In :func:`torch.linalg.lstsq`, the residuals
-    are in the field 'residuals' of the returned named tuple.
-
-    Unpacking the solution as ``X = torch.lstsq(B, A).solution[:A.size(1)]`` should be replaced with
-
-    .. code:: python
-
-        X = torch.linalg.lstsq(A, B).solution
-
-.. note::
-    The case when :math:`m < n` is not supported on the GPU.
-
-Args:
-    input (Tensor): the matrix :math:`B`
-    A (Tensor): the :math:`m` by :math:`n` matrix :math:`A`
-
-Keyword args:
-    out (tuple, optional): the optional destination tensor
-
-Returns:
-    (Tensor, Tensor): A namedtuple (solution, QR) containing:
-
-        - **solution** (*Tensor*): the least squares solution
-        - **QR** (*Tensor*): the details of the QR factorization
-
-.. note::
-
-    The returned matrices will always be transposed, irrespective of the strides
-    of the input matrices. That is, they will have stride `(1, m)` instead of
-    `(m, 1)`.
-
-Example::
-
-    >>> A = torch.tensor([[1., 1, 1],
-    ...                   [2, 3, 4],
-    ...                   [3, 5, 2],
-    ...                   [4, 2, 5],
-    ...                   [5, 4, 3]])
-    >>> B = torch.tensor([[-10., -3],
-    ...                   [ 12, 14],
-    ...                   [ 14, 12],
-    ...                   [ 16, 16],
-    ...                   [ 18, 16]])
-    >>> X, _ = torch.lstsq(B, A)
-    >>> X
-    tensor([[  2.0000,   1.0000],
-            [  1.0000,   1.0000],
-            [  1.0000,   2.0000],
-            [ 10.9635,   4.8501],
-            [  8.9332,   5.2418]])
-""",
-)
-
 add_docstr(
     torch.lt,
     r"""
@@ -6692,51 +6547,6 @@ def merge_dicts(*dicts):
     ),
 )
 
-add_docstr(
-    torch.matrix_rank,
-    r"""
-matrix_rank(input, tol=None, symmetric=False, *, out=None) -> Tensor
-
-Returns the numerical rank of a 2-D tensor. The method to compute the
-matrix rank is done using SVD by default. If :attr:`symmetric` is ``True``,
-then :attr:`input` is assumed to be symmetric, and the computation of the
-rank is done by obtaining the eigenvalues.
-
-:attr:`tol` is the threshold below which the singular values (or the eigenvalues
-when :attr:`symmetric` is ``True``) are considered to be 0. If :attr:`tol` is not
-specified, :attr:`tol` is set to ``S.max() * max(S.size()) * eps`` where `S` is the
-singular values (or the eigenvalues when :attr:`symmetric` is ``True``), and ``eps``
-is the epsilon value for the datatype of :attr:`input`.
-
-.. warning::
-
-    :func:`torch.matrix_rank` is deprecated in favor of :func:`torch.linalg.matrix_rank`
-    and will be removed in a future PyTorch release. The parameter :attr:`symmetric` was
-    renamed in :func:`torch.linalg.matrix_rank` to :attr:`hermitian`.
-
-Args:
-    input (Tensor): the input 2-D tensor
-    tol (float, optional): the tolerance value. Default: ``None``
-    symmetric(bool, optional): indicates whether :attr:`input` is symmetric.
-                               Default: ``False``
-
-Keyword args:
-    {out}
-
-Example::
-
-    >>> a = torch.eye(10)
-    >>> torch.matrix_rank(a)
-    tensor(10)
-    >>> b = torch.eye(10)
-    >>> b[0, 0] = 0
-    >>> torch.matrix_rank(b)
-    tensor(9)
-""".format(
-        **common_args
-    ),
-)
-
 add_docstr(
     torch.matrix_power,
     r"""
@@ -7666,6 +7476,12 @@ def merge_dicts(*dicts):
 Supports strided and sparse 2-D tensors as inputs, autograd with
 respect to strided inputs.
 
+This operation has support for arguments with :ref:`sparse layouts<sparse-docs>`.
+If :attr:`out` is provided it's layout will be used. Otherwise, the result
+layout will be deduced from that of :attr:`input`.
+
+{sparse_beta_warning}
+
 {tf32_note}
 
 {rocm_fp16_note}
@@ -7685,7 +7501,7 @@ def merge_dicts(*dicts):
     tensor([[ 0.4851,  0.5037, -0.3633],
             [-0.0760, -3.6705,  2.4784]])
 """.format(
-        **common_args, **tf32_notes, **rocm_fp16_notes
+        **common_args, **tf32_notes, **rocm_fp16_notes, **sparse_support_notes
     ),
 )
 
@@ -7742,6 +7558,12 @@ def merge_dicts(*dicts):
   tensor, these inputs are valid for broadcasting even though the final two dimensions (i.e. the
   matrix dimensions) are different. :attr:`out` will be a :math:`(j \times k \times n \times p)` tensor.
 
+This operation has support for arguments with :ref:`sparse layouts<sparse-docs>`. In particular the
+matrix-matrix (both arguments 2-dimensional) supports sparse arguments with the same restrictions
+as :func:`torch.mm`
+
+{sparse_beta_warning}
+
 {tf32_note}
 
 {rocm_fp16_note}
@@ -7786,7 +7608,7 @@ def merge_dicts(*dicts):
     torch.Size([10, 3, 5])
 
 """.format(
-        **common_args, **tf32_notes, **rocm_fp16_notes
+        **common_args, **tf32_notes, **rocm_fp16_notes, **sparse_support_notes
     ),
 )
 
@@ -8158,8 +7980,10 @@ def merge_dicts(*dicts):
 Args:
     input (Tensor): the tensor to narrow
     dim (int): the dimension along which to narrow
-    start (int): the starting dimension
-    length (int): the distance to the ending dimension
+    start (int or Tensor): index of the element to start the narrowed dimension
+        from. Can be negative, which means indexing from the end of `dim`. If
+        `Tensor`, it must be an 0-dim integral `Tensor` (bools not allowed)
+    length (int): length of the narrowed dimension, must be weakly positive
 
 Example::
 
@@ -8171,9 +7995,62 @@ def merge_dicts(*dicts):
     tensor([[ 2,  3],
             [ 5,  6],
             [ 8,  9]])
+    >>> torch.narrow(x, -1, torch.tensor(-1), 1)
+    tensor([[3],
+            [6],
+            [9]])
 """,
 )
 
+add_docstr(
+    torch.narrow_copy,
+    r"""
+narrow_copy(input, dim, start, length, *, out=None) -> Tensor
+
+Same as :meth:`Tensor.narrow` except this returns a copy rather
+than shared storage. This is primarily for sparse tensors, which
+do not have a shared-storage narrow method.
+
+Args:
+    input (Tensor): the tensor to narrow
+    dim (int): the dimension along which to narrow
+    start (int): index of the element to start the narrowed dimension from. Can
+        be negative, which means indexing from the end of `dim`
+    length (int): length of the narrowed dimension, must be weakly positive
+
+Keyword args:
+    {out}
+
+Example::
+
+    >>> x = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+    >>> torch.narrow_copy(x, 0, 0, 2)
+    tensor([[ 1,  2,  3],
+            [ 4,  5,  6]])
+    >>> torch.narrow_copy(x, 1, 1, 2)
+    tensor([[ 2,  3],
+            [ 5,  6],
+            [ 8,  9]])
+    >>> s = torch.arange(16).reshape(2, 2, 2, 2).to_sparse(2)
+    >>> torch.narrow_copy(s, 0, 0, 1)
+    tensor(indices=tensor([[0, 0],
+                           [0, 1]]),
+           values=tensor([[[0, 1],
+                           [2, 3]],
+
+                          [[4, 5],
+                           [6, 7]]]),
+           size=(1, 2, 2, 2), nnz=2, layout=torch.sparse_coo)
+
+.. seealso::
+
+        :func:`torch.narrow` for a non copy variant
+
+""".format(
+        **common_args
+    ),
+)
+
 add_docstr(
     torch.nan_to_num,
     r"""
@@ -8620,10 +8497,14 @@ def merge_dicts(*dicts):
 Also supports batched inputs, and, if the input is batched, the output is batched with the same dimensions.
 
 .. seealso::
-
         :func:`torch.geqrf` can be used to form the Householder representation `(input, tau)` of matrix `Q`
         from the QR decomposition.
 
+.. note::
+        This function supports backward but it is only fast when ``(input, tau)`` do not require gradients
+        and/or ``tau.size(-1)`` is very small.
+        ``
+
 Args:
     input (Tensor): tensor of shape `(*, mn, k)` where `*` is zero or more batch dimensions
                     and `mn` equals to `m` or `n` depending on the :attr:`left`.
@@ -8674,6 +8555,8 @@ def merge_dicts(*dicts):
 .. math::
     \text{{out}}_i \sim \text{{Poisson}}(\text{{input}}_i)
 
+:attr:`input` must be non-negative.
+
 Args:
     input (Tensor): the input tensor containing the rates of the Poisson distribution
 
@@ -8958,7 +8841,7 @@ def merge_dicts(*dicts):
           If you plan to backpropagate through QR, note that the current backward implementation
           is only well-defined when the first :math:`\min(input.size(-1), input.size(-2))`
           columns of :attr:`input` are linearly independent.
-          This behavior will propably change once QR supports pivoting.
+          This behavior will probably change once QR supports pivoting.
 
 .. note:: This function uses LAPACK for CPU inputs and MAGMA for CUDA inputs,
           and may produce different (valid) decompositions on different device types
@@ -9102,9 +8985,11 @@ def merge_dicts(*dicts):
 
 add_docstr(
     torch.rand,
-    r"""
-rand(*size, *, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False) -> Tensor
-
+    """
+rand(*size, *, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False, \
+pin_memory=False) -> Tensor
+"""
+    + r"""
 Returns a tensor filled with random numbers from a uniform distribution
 on the interval :math:`[0, 1)`
 
@@ -9121,6 +9006,7 @@ def merge_dicts(*dicts):
     {layout}
     {device}
     {requires_grad}
+    {pin_memory}
 
 Example::
 
@@ -9242,8 +9128,11 @@ def merge_dicts(*dicts):
 
 add_docstr(
     torch.randn,
-    r"""
-randn(*size, *, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False) -> Tensor
+    """
+randn(*size, *, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False, \
+pin_memory=False) -> Tensor
+"""
+    + r"""
 
 Returns a tensor filled with random numbers from a normal distribution
 with mean `0` and variance `1` (also called the standard normal
@@ -9265,6 +9154,7 @@ def merge_dicts(*dicts):
     {layout}
     {device}
     {requires_grad}
+    {pin_memory}
 
 Example::
 
@@ -9662,6 +9552,9 @@ def merge_dicts(*dicts):
 
 Rounds elements of :attr:`input` to the nearest integer.
 
+For integer inputs, follows the array-api convention of returning a
+copy of the input tensor.
+
 .. note::
     This function implements the "round half to even" to
     break ties when a number is equidistant from two
@@ -10341,7 +10234,7 @@ def merge_dicts(*dicts):
         as values.
     values (array_list): Initial values for the tensor. Can be a list,
         tuple, NumPy ``ndarray``, scalar, and other types that
-        represents a (1+K)-dimensonal tensor where ``K`` is the number
+        represents a (1+K)-dimensional tensor where ``K`` is the number
         of dense dimensions.
     size (list, tuple, :class:`torch.Size`, optional): Size of the
         sparse tensor: ``(*batchsize, nrows, ncols, *densesize)``. If
@@ -10401,7 +10294,7 @@ def merge_dicts(*dicts):
         values.
     values (array_list): Initial values for the tensor. Can be a list,
         tuple, NumPy ``ndarray``, scalar, and other types that
-        represents a (1+K)-dimensonal tensor where ``K`` is the number
+        represents a (1+K)-dimensional tensor where ``K`` is the number
         of dense dimensions.
     size (list, tuple, :class:`torch.Size`, optional): Size of the
         sparse tensor: ``(*batchsize, nrows, ncols, *densesize)``. If
@@ -10461,7 +10354,7 @@ def merge_dicts(*dicts):
         values.
     values (array_list): Initial values for the tensor. Can be a list,
         tuple, NumPy ``ndarray``, scalar, and other types that
-        represents a (1 + 2 + K)-dimensonal tensor where ``K`` is the
+        represents a (1 + 2 + K)-dimensional tensor where ``K`` is the
         number of dense dimensions.
     size (list, tuple, :class:`torch.Size`, optional): Size of the
         sparse tensor: ``(*batchsize, nrows * blocksize[0], ncols *
@@ -10526,7 +10419,7 @@ def merge_dicts(*dicts):
         as values.
     values (array_list): Initial blocks for the tensor. Can be a list,
         tuple, NumPy ``ndarray``, and other types that
-        represents a (1 + 2 + K)-dimensonal tensor where ``K`` is the
+        represents a (1 + 2 + K)-dimensional tensor where ``K`` is the
         number of dense dimensions.
     size (list, tuple, :class:`torch.Size`, optional): Size of the
         sparse tensor: ``(*batchsize, nrows * blocksize[0], ncols *
@@ -10708,7 +10601,7 @@ def merge_dicts(*dicts):
 add_docstr(
     torch.squeeze,
     r"""
-squeeze(input, dim=None, *, out=None) -> Tensor
+squeeze(input, dim=None) -> Tensor
 
 Returns a tensor with all the dimensions of :attr:`input` of size `1` removed.
 
@@ -10733,9 +10626,6 @@ def merge_dicts(*dicts):
     dim (int, optional): if given, the input will be squeezed only in
            this dimension
 
-Keyword args:
-    {out}
-
 Example::
 
     >>> x = torch.zeros(2, 1, 2, 1, 2)
@@ -11272,7 +11162,7 @@ def merge_dicts(*dicts):
     r"""
 flip(input, dims) -> Tensor
 
-Reverse the order of a n-D tensor along given axis in dims.
+Reverse the order of an n-D tensor along given axis in dims.
 
 .. note::
     `torch.flip` makes a copy of :attr:`input`'s data. This is different from NumPy's `np.flip`,
@@ -11427,15 +11317,15 @@ def merge_dicts(*dicts):
 add_docstr(
     torch.rot90,
     r"""
-rot90(input, k, dims) -> Tensor
+rot90(input, k=1, dims=[0,1]) -> Tensor
 
-Rotate a n-D tensor by 90 degrees in the plane specified by dims axis.
+Rotate an n-D tensor by 90 degrees in the plane specified by dims axis.
 Rotation direction is from the first towards the second axis if k > 0, and from the second towards the first for k < 0.
 
 Args:
     {input}
-    k (int): number of times to rotate
-    dims (a list or tuple): axis to rotate
+    k (int): number of times to rotate. Default value is 1
+    dims (a list or tuple): axis to rotate. Default value is [0, 1]
 
 Example::
 
@@ -11664,6 +11554,20 @@ def merge_dicts(*dicts):
 resulting :attr:`out` tensor *does not* share the underlying storage
 with the :attr:`input` tensor.
 
+If :attr:`input` is a :ref:`sparse tensor <sparse-docs>` with compressed
+layout (SparseCSR, SparseBSR, SparseCSC or SparseBSC) the arguments
+:attr:`dim0` and :attr:`dim1` must be both batch dimensions, or must
+both be sparse dimensions. The batch dimensions of a sparse tensor are the
+dimensions preceding the sparse dimensions.
+
+.. note::
+    Transpositions which interchange the sparse dimensions of a `SparseCSR`
+    or `SparseCSC` layout tensor will result in the layout changing between
+    the two options. Transposition of the sparse dimensions of a ` SparseBSR`
+    or `SparseBSC` layout tensor will likewise generate a result with the
+    opposite layout.
+
+
 Args:
     {input}
     dim0 (int): the first dimension to be transposed
@@ -12027,6 +11931,9 @@ def merge_dicts(*dicts):
 Returns a new tensor with the truncated integer values of
 the elements of :attr:`input`.
 
+For integer inputs, follows the array-api convention of returning a
+copy of the input tensor.
+
 Args:
     {input}
 
@@ -12063,14 +11970,14 @@ def merge_dicts(*dicts):
     )
 
 Args:
-    input (Tensor): the input value(s), in ``torch.float32``.
-    scale (double or Tensor): quantization scale
-    zero_point (int64 or Tensor): quantization zero_point
+    input (Tensor): the input value(s), ``torch.float32`` tensor
+    scale (double scalar or ``float32`` Tensor): quantization scale
+    zero_point (int64 scalar or ``int32`` Tensor): quantization zero_point
     quant_min (int64): lower bound of the quantized domain
     quant_max (int64): upper bound of the quantized domain
 
 Returns:
-    Tensor: A newly fake_quantized tensor
+    Tensor: A newly fake_quantized ``torch.float32`` tensor
 
 Example::
 
@@ -12102,15 +12009,15 @@ def merge_dicts(*dicts):
     )
 
 Args:
-    input (Tensor): the input value(s), in ``torch.float32``.
-    scale (Tensor): quantization scale, per channel
-    zero_point (Tensor): quantization zero_point, per channel
+    input (Tensor): the input value(s), in ``torch.float32``
+    scale (Tensor): quantization scale, per channel in ``torch.float32``
+    zero_point (Tensor): quantization zero_point, per channel in ``torch.int32`` or ``torch.half`` or ``torch.float32``
     axis (int32): channel axis
     quant_min (int64): lower bound of the quantized domain
     quant_max (int64): upper bound of the quantized domain
 
 Returns:
-    Tensor: A newly fake_quantized per channel tensor
+    Tensor: A newly fake_quantized per channel ``torch.float32`` tensor
 
 Example::
 
@@ -12124,7 +12031,7 @@ def merge_dicts(*dicts):
     >>> scales = (torch.randn(2) + 1) * 0.05
     >>> scales
     tensor([0.0475, 0.0486])
-    >>> zero_points = torch.zeros(2).to(torch.long)
+    >>> zero_points = torch.zeros(2).to(torch.int32)
     >>> zero_points
     tensor([0, 0])
     >>> torch.fake_quantize_per_channel_affine(x, scales, zero_points, 1, 0, 255)
@@ -12497,7 +12404,7 @@ def merge_dicts(*dicts):
 add_docstr(
     torch.where,
     r"""
-where(condition, x, y) -> Tensor
+where(condition, x, y, *, out=None) -> Tensor
 
 Return a tensor of elements selected from either :attr:`x` or :attr:`y`, depending on :attr:`condition`.
 
@@ -12508,7 +12415,8 @@ def merge_dicts(*dicts):
         \text{x}_i & \text{if } \text{condition}_i \\
         \text{y}_i & \text{otherwise} \\
     \end{cases}
-
+"""
+    + r"""
 .. note::
     The tensors :attr:`condition`, :attr:`x`, :attr:`y` must be :ref:`broadcastable <broadcasting-semantics>`.
 
@@ -12519,6 +12427,9 @@ def merge_dicts(*dicts):
     y (Tensor or Scalar): value (if :attr:`y` is a scalar) or values selected at indices
                           where :attr:`condition` is ``False``
 
+Keyword args:
+    {out}
+
 Returns:
     Tensor: A tensor of shape equal to the broadcasted shape of :attr:`condition`, :attr:`x`, :attr:`y`
 
@@ -12550,7 +12461,9 @@ def merge_dicts(*dicts):
 
 .. note::
     See also :func:`torch.nonzero`.
-""",
+""".format(
+        **common_args
+    ),
 )
 
 add_docstr(
@@ -12713,7 +12626,7 @@ def merge_dicts(*dicts):
     {requires_grad}
 
 Returns:
-    Tensor: A 1-D tensor of size :math:`(\text{{window\_length}},)` containing the window
+    Tensor: A 1-D tensor of size :math:`(\text{{window\_length}},)` containing the window.
 
 """.format(
         **factory_common_args
@@ -13213,7 +13126,7 @@ def merge_dicts(*dicts):
 
 Keyword args:
     output_size (int, optional): Total output size for the given axis
-        ( e.g. sum of repeats). If given, it will avoid stream syncronization
+        ( e.g. sum of repeats). If given, it will avoid stream synchronization
         needed to calculate output shape of the tensor.
 
 Returns:
@@ -13449,7 +13362,7 @@ def merge_dicts(*dicts):
     input (Tensor): quantized tensor
     kernel_size (list of int): the size of the sliding window
     stride (``list of int``, optional): the stride of the sliding window
-    padding (``list of int``, opttional): padding to be added on both sides, must be >= 0 and <= kernel_size / 2
+    padding (``list of int``, optional): padding to be added on both sides, must be >= 0 and <= kernel_size / 2
     dilation (``list of int``, optional): The stride between elements within a sliding window, must be > 0. Default 1
     ceil_mode (bool, optional):  If True, will use ceil instead of floor to compute the output shape.
         Defaults to False.
@@ -13523,6 +13436,7 @@ def merge_dicts(*dicts):
 
 Example::
 
+    >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA)
     >>> g_cpu = torch.Generator()
     >>> g_cuda = torch.Generator(device='cuda')
 """,
@@ -13768,7 +13682,7 @@ def merge_dicts(*dicts):
 
 Args:
     input (Tensor or Scalar): N-D tensor or a Scalar containing the search value(s).
-    boundaries (Tensor): 1-D tensor, must contain a monotonically increasing sequence.
+    boundaries (Tensor): 1-D tensor, must contain a strictly increasing sequence, or the return value is undefined.
 
 Keyword args:
     out_int32 (bool, optional): indicate the output data type. torch.int32 if True, torch.int64 otherwise.
diff --git a/torch/_utils.py b/torch/_utils.py
index 8a539d75f565..1bf3cf96ad1c 100644
--- a/torch/_utils.py
+++ b/torch/_utils.py
@@ -1,3 +1,4 @@
+import copyreg
 import sys
 import traceback
 import warnings
@@ -143,15 +144,33 @@ def _get_async_or_non_blocking(function_name, non_blocking, kwargs):
 # be a TypedStorage
 def _rebuild_tensor(storage, storage_offset, size, stride):
     # first construct a tensor with the correct dtype/device
-    t = torch.tensor([], dtype=storage.dtype, device=storage.untyped().device)
-    return t.set_(storage.untyped(), storage_offset, size, stride)
+    t = torch.tensor([], dtype=storage.dtype, device=storage._untyped_storage.device)
+    return t.set_(storage._untyped_storage, storage_offset, size, stride)
+
+
+def get_tensor_metadata(tensor):
+    # Tensor's Metadata for serializing.
+    # Currently, this only returns a dict[string, bool] specifing whether
+    # `conj` or `neg` bit is set.
+    assert isinstance(tensor, torch.Tensor)
+    return torch._C._get_tensor_metadata(tensor)  # type: ignore[attr-defined]
+
+
+def set_tensor_metadata(tensor, metadata):
+    # See `get_tensor_metadata` above
+    assert isinstance(metadata, dict)
+    assert isinstance(tensor, torch.Tensor)
+    torch._C._set_tensor_metadata(tensor, metadata)  # type: ignore[attr-defined]
 
 
 def _rebuild_tensor_v2(
-    storage, storage_offset, size, stride, requires_grad, backward_hooks
+    storage, storage_offset, size, stride, requires_grad, backward_hooks, metadata=None
 ):
     tensor = _rebuild_tensor(storage, storage_offset, size, stride)
     tensor.requires_grad = requires_grad
+    if metadata:
+        set_tensor_metadata(tensor, metadata)
+
     # NB: This line exists only for backwards compatibility; the
     # general expectation is that backward_hooks is an empty
     # OrderedDict.  See Note [Don't serialize hooks]
@@ -174,15 +193,30 @@ def _rebuild_tensor_v2(
 def _validate_loaded_sparse_tensors():
     try:
         for t in _sparse_tensors_to_validate:
-            if t.is_sparse:
+            if t.layout is torch.sparse_coo:
                 torch._validate_sparse_coo_tensor_args(
                     t._indices(), t._values(), t.size()
                 )
-            elif t.is_sparse_csr:
+            elif t.layout in {
+                torch.sparse_csr,
+                torch.sparse_csc,
+                torch.sparse_bsr,
+                torch.sparse_bsc,
+            }:
                 # TODO: Validation currently involves an expensive traversal
                 # on CPU, which may include a device transfer.
-                torch._validate_sparse_csr_tensor_args(
-                    t.crow_indices(), t.col_indices(), t.values(), t.size()
+                if t.layout in {torch.sparse_csr, torch.sparse_bsr}:
+                    compressed_indices, plain_indices = (
+                        t.crow_indices(),
+                        t.col_indices(),
+                    )
+                else:
+                    compressed_indices, plain_indices = (
+                        t.ccol_indices(),
+                        t.row_indices(),
+                    )
+                torch._validate_sparse_compressed_tensor_args(
+                    compressed_indices, plain_indices, t.values(), t.size(), t.layout
                 )
             else:
                 raise NotImplementedError(
@@ -207,14 +241,15 @@ def _rebuild_sparse_tensor(layout, data):
         _sparse_tensors_to_validate.append(result)
         return result
 
-    raise NotImplementedError("rebuilding sparse tensor for layout %s" % (layout))
-
-
-def _rebuild_sparse_csr_tensor(layout, data):
-    if layout == torch.sparse_csr:
-        crow_indices, col_indices, values, size = data
-        result = torch._sparse_csr_tensor_unsafe(
-            crow_indices, col_indices, values, size
+    elif layout in {
+        torch.sparse_csr,
+        torch.sparse_csc,
+        torch.sparse_bsr,
+        torch.sparse_bsc,
+    }:
+        compressed_indices, plain_indices, values, size = data
+        result = torch._sparse_compressed_tensor_unsafe(
+            compressed_indices, plain_indices, values, size, layout=layout
         )
         _sparse_tensors_to_validate.append(result)
         return result
@@ -317,6 +352,64 @@ def _rebuild_parameter(data, requires_grad, backward_hooks):
     return param
 
 
+# TODO(kshitij12345): Support serializing nn.Parameter with Python Attributes.
+# NOTE: We are just defining it here now for future use.
+def _rebuild_parameter_with_state(data, requires_grad, backward_hooks, state):
+    param = torch.nn.Parameter(data, requires_grad)
+    # NB: This line exists only for backwards compatibility; the
+    # general expectation is that backward_hooks is an empty
+    # OrderedDict.  See Note [Don't serialize hooks]
+    param._backward_hooks = backward_hooks
+
+    # Restore state on Parameter like python attr.
+    param = _set_obj_state(param, state)
+    return param
+
+
+def _get_obj_state(obj):
+    # Get the state of the python subclass
+    # This loosely mimicks the function on the object class but since Tensor do not inherit
+    # from it, we cannot call that function directly
+    # https://github.com/python/cpython/blob/c83919bd635f4433f1c6ae8504996a9fe3c215e5/Objects/typeobject.c#L4891
+    getstate_fn = getattr(obj, "__getstate__", None)
+    if getstate_fn:
+        state = getstate_fn()
+    else:
+        slots_to_save = copyreg._slotnames(obj.__class__)  # type: ignore[attr-defined]
+        if slots_to_save:
+            state = (
+                obj.__dict__,
+                {
+                    name: getattr(obj, name)
+                    for name in slots_to_save
+                    if hasattr(obj, name)
+                },
+            )
+        else:
+            state = obj.__dict__
+
+    return state
+
+
+def _set_obj_state(obj, state):
+    if isinstance(state, tuple):
+        if not len(state) == 2:
+            raise RuntimeError(f"Invalid serialized state: {state}")
+        dict_state = state[0]
+        slots_state = state[1]
+    else:
+        dict_state = state
+        slots_state = None
+
+    for k, v in dict_state.items():
+        setattr(obj, k, v)
+
+    if slots_state:
+        for k, v in slots_state.items():
+            setattr(obj, k, v)
+    return obj
+
+
 def _import_dotted_name(name):
     components = name.split(".")
     obj = __import__(components[0])
diff --git a/torch/_weights_only_unpickler.py b/torch/_weights_only_unpickler.py
new file mode 100644
index 000000000000..30e10409184f
--- /dev/null
+++ b/torch/_weights_only_unpickler.py
@@ -0,0 +1,291 @@
+# Unpickler restricted to loading only state dicts
+# Restrict constructing types to a list defined in _get_allowed_globals()
+# Restrict BUILD operation to `Tensor`, `Parameter` and `OrderedDict` types only
+# Restrict APPEND/APPENDS to `list`
+# In `GLOBALS` operation do not do class lookup by name, but rather rely on dictionary
+# defined by `_get_allowed_globals()` method, that contains:
+# - torch types (Storage, dtypes, Tensor, `torch.Size`),
+# - `torch._utils._rebuild` functions.
+# - `torch.nn.Parameter`
+# - `collections.OrderedDict`
+
+# Based of https://github.com/python/cpython/blob/main/Lib/pickle.py
+# Expected to be useful for loading PyTorch model weights
+# For example:
+# data = urllib.request.urlopen('https://download.pytorch.org/models/resnet50-0676ba61.pth').read()
+# buf = io.BytesIO(data)
+# weights = torch.load(buf, weights_only = True)
+
+import functools as _functools
+from collections import OrderedDict
+from pickle import (
+    APPEND,
+    APPENDS,
+    BINGET,
+    BININT,
+    BININT1,
+    BININT2,
+    BINPERSID,
+    BINPUT,
+    BINUNICODE,
+    BUILD,
+    bytes_types,
+    decode_long,
+    EMPTY_DICT,
+    EMPTY_LIST,
+    EMPTY_SET,
+    EMPTY_TUPLE,
+    GLOBAL,
+    LONG1,
+    LONG_BINGET,
+    LONG_BINPUT,
+    MARK,
+    NEWFALSE,
+    NEWOBJ,
+    NEWTRUE,
+    NONE,
+    PROTO,
+    REDUCE,
+    SETITEM,
+    SETITEMS,
+    SHORT_BINSTRING,
+    STOP,
+    TUPLE,
+    TUPLE1,
+    TUPLE2,
+    TUPLE3,
+    UnpicklingError,
+)
+from struct import unpack
+from sys import maxsize
+from typing import Any, Dict, List
+
+import torch
+
+
+# Unpickling machinery
+@_functools.lru_cache(maxsize=1)
+def _get_allowed_globals():
+    rc: Dict[str, Any] = {
+        "collections.OrderedDict": OrderedDict,
+        "torch.nn.parameter.Parameter": torch.nn.Parameter,
+        "torch.serialization._get_layout": torch.serialization._get_layout,
+        "torch.Size": torch.Size,
+        "torch.Tensor": torch.Tensor,
+    }
+    # dtype
+    for t in [
+        torch.complex32,
+        torch.complex64,
+        torch.complex128,
+        torch.float16,
+        torch.float32,
+        torch.float64,
+        torch.int8,
+        torch.int16,
+        torch.int32,
+        torch.int64,
+    ]:
+        rc[str(t)] = t
+    # Tensor classes
+    for tt in torch._tensor_classes:
+        rc[f"{tt.__module__}.{tt.__name__}"] = tt
+    # Storage classes
+    for ts in torch._storage_classes:
+        rc[f"{ts.__module__}.{ts.__name__}"] = ts
+    # Rebuild functions
+    for f in [
+        torch._utils._rebuild_parameter,
+        torch._utils._rebuild_tensor,
+        torch._utils._rebuild_tensor_v2,
+        torch._utils._rebuild_sparse_tensor,
+        torch._utils._rebuild_meta_tensor_no_storage,
+    ]:
+        rc[f"torch._utils.{f.__name__}"] = f
+
+    # Handles Tensor Subclasses, Tensor's with attributes.
+    # NOTE: It calls into above rebuild functions for regular Tensor types.
+    rc["torch._tensor._rebuild_from_type_v2"] = torch._tensor._rebuild_from_type_v2
+    return rc
+
+
+class Unpickler:
+    def __init__(self, file, *, encoding: str = "bytes"):
+        self.encoding = encoding
+        self.readline = file.readline
+        self.read = file.read
+        self.memo: Dict[int, Any] = {}
+
+    def load(self):
+        """Read a pickled object representation from the open file.
+
+        Return the reconstituted object hierarchy specified in the file.
+        """
+        self.metastack = []
+        self.stack: List[Any] = []
+        self.append = self.stack.append
+        read = self.read
+        readline = self.readline
+        while True:
+            key = read(1)
+            if not key:
+                raise EOFError
+            assert isinstance(key, bytes_types)
+            # Risky operators
+            if key[0] == GLOBAL[0]:
+                module = readline()[:-1].decode("utf-8")
+                name = readline()[:-1].decode("utf-8")
+                full_path = f"{module}.{name}"
+                if full_path in _get_allowed_globals():
+                    self.append(_get_allowed_globals()[full_path])
+                else:
+                    raise RuntimeError(f"Unsupported class {full_path}")
+            elif key[0] == NEWOBJ[0]:
+                args = self.stack.pop()
+                cls = self.stack.pop()
+                if cls is not torch.nn.Parameter:
+                    raise RuntimeError(f"Trying to instantiate unsupported class {cls}")
+                self.append(torch.nn.Parameter(*args))
+            elif key[0] == REDUCE[0]:
+                args = self.stack.pop()
+                func = self.stack[-1]
+                if func not in _get_allowed_globals().values():
+                    raise RuntimeError(
+                        f"Trying to call reduce for unrecognized function {func}"
+                    )
+                self.stack[-1] = func(*args)
+            elif key[0] == BUILD[0]:
+                state = self.stack.pop()
+                inst = self.stack[-1]
+                if type(inst) is torch.Tensor:
+                    # Legacy unpickling
+                    inst.set_(*state)
+                elif type(inst) is torch.nn.Parameter:
+                    inst.__setstate__(state)
+                elif type(inst) is OrderedDict:
+                    inst.__dict__.update(state)
+                else:
+                    raise RuntimeError(
+                        f"Can only build Tensor, parameter or dict objects, but got {type(inst)}"
+                    )
+            # Stack manipulation
+            elif key[0] == APPEND[0]:
+                item = self.stack.pop()
+                list_obj = self.stack[-1]
+                if type(list_obj) is not list:
+                    raise RuntimeError(
+                        f"Can only append to lists, but got {type(list_obj)}"
+                    )
+                list_obj.append(item)
+            elif key[0] == APPENDS[0]:
+                items = self.pop_mark()
+                list_obj = self.stack[-1]
+                if type(list_obj) is not list:
+                    raise RuntimeError(
+                        f"Can only extend lists, but got {type(list_obj)}"
+                    )
+                list_obj.extend(items)
+            elif key[0] == SETITEM[0]:
+                (v, k) = (self.stack.pop(), self.stack.pop())
+                self.stack[-1][k] = v
+            elif key[0] == SETITEMS[0]:
+                items = self.pop_mark()
+                for i in range(0, len(items), 2):
+                    self.stack[-1][items[i]] = items[i + 1]
+            elif key[0] == MARK[0]:
+                self.metastack.append(self.stack)
+                self.stack = []
+                self.append = self.stack.append
+            elif key[0] == TUPLE[0]:
+                items = self.pop_mark()
+                self.append(tuple(items))
+            elif key[0] == TUPLE1[0]:
+                self.stack[-1] = (self.stack[-1],)
+            elif key[0] == TUPLE2[0]:
+                self.stack[-2:] = [(self.stack[-2], self.stack[-1])]
+            elif key[0] == TUPLE3[0]:
+                self.stack[-3:] = [(self.stack[-3], self.stack[-2], self.stack[-1])]
+            # Basic types construction
+            elif key[0] == NONE[0]:
+                self.append(None)
+            elif key[0] == NEWFALSE[0]:
+                self.append(False)
+            elif key[0] == NEWTRUE[0]:
+                self.append(True)
+            elif key[0] == EMPTY_TUPLE[0]:
+                self.append(())
+            elif key[0] == EMPTY_LIST[0]:
+                self.append([])
+            elif key[0] == EMPTY_DICT[0]:
+                self.append({})
+            elif key[0] == EMPTY_SET[0]:
+                self.append(set())
+            elif key[0] == BININT[0]:
+                self.append(unpack("<i", read(4))[0])
+            elif key[0] == BININT1[0]:
+                self.append(self.read(1)[0])
+            elif key[0] == BININT2[0]:
+                self.append(unpack("<H", read(2))[0])
+            elif key[0] == BINUNICODE[0]:
+                strlen = unpack("<I", read(4))[0]
+                if strlen > maxsize:
+                    raise RuntimeError("String is too long")
+                strval = str(read(strlen), "utf-8", "surrogatepass")
+                self.append(strval)
+            elif key[0] == SHORT_BINSTRING[0]:
+                strlen = read(1)[0]
+                strdata = read(strlen)
+                if self.encoding != "bytes":
+                    strdata = strdata.decode(self.encoding, "strict")
+                self.append(strdata)
+            elif key[0] == BINPERSID[0]:
+                pid = self.stack.pop()
+                # Only allow persistent load of storage
+                if type(pid) is not tuple and not type(pid) is not int:
+                    raise RuntimeError(
+                        f"persistent_load id must be tuple or int, but got {type(pid)}"
+                    )
+                if (
+                    type(pid) is tuple
+                    and len(pid) > 0
+                    and torch.serialization._maybe_decode_ascii(pid[0]) != "storage"
+                ):
+                    raise RuntimeError(
+                        f"Only persistent_load of storage is allowed, but got {pid[0]}"
+                    )
+                self.append(self.persistent_load(pid))
+            elif key[0] in [BINGET[0], LONG_BINGET[0]]:
+                idx = (read(1) if key[0] == BINGET[0] else unpack("<I", read(4)))[0]
+                self.append(self.memo[idx])
+            elif key[0] in [BINPUT[0], LONG_BINPUT[0]]:
+                i = (read(1) if key[0] == BINPUT[0] else unpack("<I", read(4)))[0]
+                if i < 0:
+                    raise ValueError("negative argument")
+                self.memo[i] = self.stack[-1]
+            elif key[0] == LONG1[0]:
+                n = read(1)[0]
+                data = read(n)
+                self.append(decode_long(data))
+            # First and last deserializer ops
+            elif key[0] == PROTO[0]:
+                # Read and ignore proto version
+                read(1)[0]
+            elif key[0] == STOP[0]:
+                rc = self.stack.pop()
+                return rc
+            else:
+                raise RuntimeError(f"Unsupported operand {key[0]}")
+
+    # Return a list of items pushed in the stack after last MARK instruction.
+    def pop_mark(self):
+        items = self.stack
+        self.stack = self.metastack.pop()
+        self.append = self.stack.append
+        return items
+
+    def persistent_load(self, pid):
+        raise UnpicklingError("unsupported persistent id encountered")
+
+
+def load(file, *, encoding: str = "ASCII"):
+    return Unpickler(file, encoding=encoding).load()
diff --git a/torch/amp/autocast_mode.py b/torch/amp/autocast_mode.py
index fd6ce5e7693d..e0ff5efed2a4 100644
--- a/torch/amp/autocast_mode.py
+++ b/torch/amp/autocast_mode.py
@@ -195,6 +195,8 @@ def __init__(self, device_type : str,
             self.fast_dtype = torch.get_autocast_cpu_dtype()
         elif self.device == 'xpu':
             self.fast_dtype = torch.xpu.get_autocast_xpu_dtype()  # type: ignore[attr-defined]
+        elif self.device == 'hpu':
+            self.fast_dtype = torch.hpu.get_autocast_hpu_dtype()  # type: ignore[attr-defined]
         else:
             raise RuntimeError('User specified autocast device_type must be \'cuda\' or \'cpu\'')
         self._cache_enabled = torch.is_autocast_cache_enabled()
@@ -213,14 +215,21 @@ def __init__(self, device_type : str,
                 error_message += 'CPU Autocast only supports dtype of torch.bfloat16 currently.'
                 warnings.warn(error_message)
                 enabled = False
-        if self.device == 'xpu':
+        elif self.device == 'xpu':
             supported_dtype = [torch.bfloat16, torch.float16]
             if self.fast_dtype not in supported_dtype:
                 error_message = 'In XPU autocast, but the target dtype is not supported. Disabling autocast.\n'
                 error_message += 'XPU Autocast only supports dtype of torch.bfloat16 currently.'
                 warnings.warn(error_message)
                 enabled = False
-        if self.device == 'cuda':
+        elif self.device == 'hpu':
+            supported_dtype = [torch.bfloat16, torch.float16]
+            if self.fast_dtype not in supported_dtype:
+                error_message = 'In HPU autocast, but the target dtype is not supported. Disabling autocast.\n'
+                error_message += 'HPU Autocast only supports dtypes of torch.bfloat16 and torch.float16 currently.'
+                warnings.warn(error_message)
+                enabled = False
+        elif self.device == 'cuda':
             if self.fast_dtype == torch.bfloat16 and not torch.cuda.is_bf16_supported():
                 raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.')
         self._enabled = enabled
@@ -243,6 +252,12 @@ def __enter__(self):
             torch.xpu.set_autocast_xpu_enabled(self._enabled)  # type: ignore[attr-defined]
             torch.xpu.set_autocast_xpu_dtype(self.fast_dtype)  # type: ignore[attr-defined]
             torch.autocast_increment_nesting()
+        elif self.device == 'hpu':
+            self.prev = torch.hpu.is_autocast_hpu_enabled()    # type: ignore[attr-defined]
+            self.prev_fastdtype = torch.hpu.get_autocast_hpu_dtype()  # type: ignore[attr-defined]
+            torch.hpu.set_autocast_hpu_enabled(self._enabled)  # type: ignore[attr-defined]
+            torch.hpu.set_autocast_hpu_dtype(self.fast_dtype)  # type: ignore[attr-defined]
+            torch.autocast_increment_nesting()
         else:
             self.prev = torch.is_autocast_enabled()
             self.prev_fastdtype = torch.get_autocast_gpu_dtype()
@@ -266,6 +281,11 @@ def __exit__(self, exc_type: Any, exc_val: Any, exc_tb: Any):  # type: ignore[ov
                 torch.clear_autocast_cache()
             torch.xpu.set_autocast_xpu_enabled(self.prev)            # type: ignore[attr-defined]
             torch.xpu.set_autocast_xpu_dtype(self.prev_fastdtype)    # type: ignore[attr-defined]
+        elif self.device == 'hpu':
+            if torch.autocast_decrement_nesting() == 0:
+                torch.clear_autocast_cache()
+            torch.hpu.set_autocast_hpu_enabled(self.prev)            # type: ignore[attr-defined]
+            torch.hpu.set_autocast_hpu_dtype(self.prev_fastdtype)    # type: ignore[attr-defined]
         else:
             if torch.autocast_decrement_nesting() == 0:
                 torch.clear_autocast_cache()
diff --git a/torch/ao/__init__.py b/torch/ao/__init__.py
index e69de29bb2d1..fe6f3a460316 100644
--- a/torch/ao/__init__.py
+++ b/torch/ao/__init__.py
@@ -0,0 +1,16 @@
+# torch.ao is a package with a lot of interdependencies.
+# We will use lazy import to avoid cyclic dependencies here.
+
+
+__all__ = [
+    "nn",
+    "ns",
+    "quantization",
+    "pruning",
+]
+
+def __getattr__(name):
+    if name in __all__:
+        import importlib
+        return importlib.import_module("." + name, __name__)
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
diff --git a/torch/ao/nn/__init__.py b/torch/ao/nn/__init__.py
index 4a229cfb509c..88a5a03af1cc 100644
--- a/torch/ao/nn/__init__.py
+++ b/torch/ao/nn/__init__.py
@@ -1 +1,19 @@
-from torch.ao.nn import sparse
+# We are exposing all subpackages to the end-user.
+# Because of possible inter-dependency, we want to avoid
+# the cyclic imports, thus implementing lazy version
+# as per https://peps.python.org/pep-0562/
+
+import importlib
+
+__all__ = [
+    "intrinsic",
+    "qat",
+    "quantizable",
+    "quantized",
+    "sparse",
+]
+
+def __getattr__(name):
+    if name in __all__:
+        return importlib.import_module("." + name, __name__)
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
diff --git a/torch/ao/nn/intrinsic/__init__.py b/torch/ao/nn/intrinsic/__init__.py
new file mode 100644
index 000000000000..677e2c90977e
--- /dev/null
+++ b/torch/ao/nn/intrinsic/__init__.py
@@ -0,0 +1,32 @@
+from .modules import *  # noqa: F403
+from .modules.fused import _FusedModule  # noqa: F403
+
+# # Subpackages
+# from . import qat  # noqa: F403
+# from . import quantized  # noqa: F403
+
+__all__ = [
+    'ConvBn1d',
+    'ConvBn2d',
+    'ConvBn3d',
+    'ConvBnReLU1d',
+    'ConvBnReLU2d',
+    'ConvBnReLU3d',
+    'ConvReLU1d',
+    'ConvReLU2d',
+    'ConvReLU3d',
+    'LinearReLU',
+    'BNReLU2d',
+    'BNReLU3d',
+    'LinearBn1d',
+]
+
+# We are exposing all subpackages to the end-user.
+# Because of possible inter-dependency, we want to avoid
+# the cyclic imports, thus implementing lazy version
+# as per https://peps.python.org/pep-0562/
+def __getattr__(name):
+    if name in __all__:
+        import importlib
+        return importlib.import_module("." + name, __name__)
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
diff --git a/torch/ao/nn/intrinsic/modules/__init__.py b/torch/ao/nn/intrinsic/modules/__init__.py
new file mode 100644
index 000000000000..b2e23e735429
--- /dev/null
+++ b/torch/ao/nn/intrinsic/modules/__init__.py
@@ -0,0 +1,31 @@
+from .fused import _FusedModule  # noqa: F401
+from .fused import ConvBn1d
+from .fused import ConvBn2d
+from .fused import ConvBn3d
+from .fused import ConvBnReLU1d
+from .fused import ConvBnReLU2d
+from .fused import ConvBnReLU3d
+from .fused import ConvReLU1d
+from .fused import ConvReLU2d
+from .fused import ConvReLU3d
+from .fused import LinearReLU
+from .fused import BNReLU2d
+from .fused import BNReLU3d
+from .fused import LinearBn1d
+
+
+__all__ = [
+    'ConvBn1d',
+    'ConvBn2d',
+    'ConvBn3d',
+    'ConvBnReLU1d',
+    'ConvBnReLU2d',
+    'ConvBnReLU3d',
+    'ConvReLU1d',
+    'ConvReLU2d',
+    'ConvReLU3d',
+    'LinearReLU',
+    'BNReLU2d',
+    'BNReLU3d',
+    'LinearBn1d',
+]
diff --git a/torch/ao/nn/intrinsic/modules/fused.py b/torch/ao/nn/intrinsic/modules/fused.py
new file mode 100644
index 000000000000..261142fa8fc6
--- /dev/null
+++ b/torch/ao/nn/intrinsic/modules/fused.py
@@ -0,0 +1,128 @@
+import torch
+from torch.nn import Conv1d, Conv2d, Conv3d, ReLU, Linear, BatchNorm1d, BatchNorm2d, BatchNorm3d
+from torch.nn.utils.parametrize import type_before_parametrizations
+
+__all__ = ['ConvReLU1d', 'ConvReLU2d', 'ConvReLU3d', 'LinearReLU', 'ConvBn1d', 'ConvBn2d',
+           'ConvBnReLU1d', 'ConvBnReLU2d', 'ConvBn3d', 'ConvBnReLU3d', 'BNReLU2d', 'BNReLU3d',
+           'LinearBn1d']
+# Used for identifying intrinsic modules used in quantization
+class _FusedModule(torch.nn.Sequential):
+    pass
+
+class ConvReLU1d(_FusedModule):
+    r"""This is a sequential container which calls the Conv1d and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, relu):
+        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(relu) == ReLU, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(conv), type_before_parametrizations(relu))
+        super().__init__(conv, relu)
+
+class ConvReLU2d(_FusedModule):
+    r"""This is a sequential container which calls the Conv2d and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, relu):
+        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(relu) == ReLU, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(conv), type_before_parametrizations(relu))
+        super().__init__(conv, relu)
+
+class ConvReLU3d(_FusedModule):
+    r"""This is a sequential container which calls the Conv3d and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, relu):
+        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(relu) == ReLU, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(conv), type_before_parametrizations(relu))
+        super().__init__(conv, relu)
+
+class LinearReLU(_FusedModule):
+    r"""This is a sequential container which calls the Linear and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, linear, relu):
+        assert type_before_parametrizations(linear) == Linear and type_before_parametrizations(relu) == ReLU, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(linear), type_before_parametrizations(relu))
+        super().__init__(linear, relu)
+
+class ConvBn1d(_FusedModule):
+    r"""This is a sequential container which calls the Conv 1d and Batch Norm 1d modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, bn):
+        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(bn) == BatchNorm1d, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(conv), type_before_parametrizations(bn))
+        super().__init__(conv, bn)
+
+class ConvBn2d(_FusedModule):
+    r"""This is a sequential container which calls the Conv 2d and Batch Norm 2d modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, bn):
+        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(bn) == BatchNorm2d, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(conv), type_before_parametrizations(bn))
+        super(ConvBn2d, self).__init__(conv, bn)
+
+class ConvBnReLU1d(_FusedModule):
+    r"""This is a sequential container which calls the Conv 1d, Batch Norm 1d, and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, bn, relu):
+        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(bn) == BatchNorm1d and \
+            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
+            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
+        super().__init__(conv, bn, relu)
+
+class ConvBnReLU2d(_FusedModule):
+    r"""This is a sequential container which calls the Conv 2d, Batch Norm 2d, and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, bn, relu):
+        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(bn) == BatchNorm2d and \
+            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
+            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
+        super().__init__(conv, bn, relu)
+
+class ConvBn3d(_FusedModule):
+    r"""This is a sequential container which calls the Conv 3d and Batch Norm 3d modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, bn):
+        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(bn) == BatchNorm3d, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(conv), type_before_parametrizations(bn))
+        super().__init__(conv, bn)
+
+class ConvBnReLU3d(_FusedModule):
+    r"""This is a sequential container which calls the Conv 3d, Batch Norm 3d, and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, conv, bn, relu):
+        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(bn) == BatchNorm3d and \
+            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
+            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
+        super().__init__(conv, bn, relu)
+
+
+class BNReLU2d(_FusedModule):
+    r"""This is a sequential container which calls the BatchNorm 2d and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, batch_norm, relu):
+        assert type_before_parametrizations(batch_norm) == BatchNorm2d and type_before_parametrizations(relu) == ReLU, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(batch_norm), type_before_parametrizations(relu))
+        super().__init__(batch_norm, relu)
+
+class BNReLU3d(_FusedModule):
+    r"""This is a sequential container which calls the BatchNorm 3d and ReLU modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, batch_norm, relu):
+        assert type_before_parametrizations(batch_norm) == BatchNorm3d and type_before_parametrizations(relu) == ReLU, \
+            'Incorrect types for input modules{}{}'.format(
+                type_before_parametrizations(batch_norm), type_before_parametrizations(relu))
+        super().__init__(batch_norm, relu)
+
+
+class LinearBn1d(_FusedModule):
+    r"""This is a sequential container which calls the Linear and BatchNorm1d modules.
+    During quantization this will be replaced with the corresponding fused module."""
+    def __init__(self, linear, bn):
+        assert type_before_parametrizations(linear) == Linear and type_before_parametrizations(bn) == BatchNorm1d, \
+            'Incorrect types for input modules{}{}'.format(type_before_parametrizations(linear), type_before_parametrizations(bn))
+        super().__init__(linear, bn)
diff --git a/torch/ao/nn/intrinsic/qat/__init__.py b/torch/ao/nn/intrinsic/qat/__init__.py
new file mode 100644
index 000000000000..3d79bdbfe832
--- /dev/null
+++ b/torch/ao/nn/intrinsic/qat/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/intrinsic/qat/modules/__init__.py b/torch/ao/nn/intrinsic/qat/modules/__init__.py
new file mode 100644
index 000000000000..f44820c637e8
--- /dev/null
+++ b/torch/ao/nn/intrinsic/qat/modules/__init__.py
@@ -0,0 +1,31 @@
+from .linear_relu import LinearReLU
+from .linear_fused import LinearBn1d
+from .conv_fused import (
+    ConvBn1d,
+    ConvBn2d,
+    ConvBn3d,
+    ConvBnReLU1d,
+    ConvBnReLU2d,
+    ConvBnReLU3d,
+    ConvReLU1d,
+    ConvReLU2d,
+    ConvReLU3d,
+    update_bn_stats,
+    freeze_bn_stats,
+)
+
+__all__ = [
+    "LinearReLU",
+    "LinearBn1d",
+    "ConvReLU1d",
+    "ConvReLU2d",
+    "ConvReLU3d",
+    "ConvBn1d",
+    "ConvBn2d",
+    "ConvBn3d",
+    "ConvBnReLU1d",
+    "ConvBnReLU2d",
+    "ConvBnReLU3d",
+    "update_bn_stats",
+    "freeze_bn_stats",
+]
diff --git a/torch/ao/nn/intrinsic/qat/modules/conv_fused.py b/torch/ao/nn/intrinsic/qat/modules/conv_fused.py
new file mode 100644
index 000000000000..6a6f4c14d6b4
--- /dev/null
+++ b/torch/ao/nn/intrinsic/qat/modules/conv_fused.py
@@ -0,0 +1,828 @@
+import math
+import torch
+import torch.nn as nn
+import torch.ao.nn.intrinsic as nni
+import torch.ao.nn.qat as nnqat
+import torch.nn.functional as F
+from torch.nn import init
+from torch.nn.utils import fuse_conv_bn_weights
+from torch.nn.modules.utils import _single, _pair, _triple
+from torch.nn.parameter import Parameter
+from typing import TypeVar
+
+__all__ = ['ConvBn1d', 'ConvBnReLU1d', 'ConvReLU1d', 'ConvBn2d', 'ConvBnReLU2d', 'ConvReLU2d', 'ConvBn3d',
+           'ConvBnReLU3d', 'ConvReLU3d', 'update_bn_stats', 'freeze_bn_stats']
+_BN_CLASS_MAP = {
+    1: nn.BatchNorm1d,
+    2: nn.BatchNorm2d,
+    3: nn.BatchNorm3d,
+}
+
+
+MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
+
+
+class _ConvBnNd(nn.modules.conv._ConvNd, nni._FusedModule):
+
+    _version = 2
+    _FLOAT_MODULE = MOD
+
+    def __init__(self,
+                 # ConvNd args
+                 in_channels, out_channels, kernel_size, stride,
+                 padding, dilation, transposed, output_padding,
+                 groups,
+                 bias,
+                 padding_mode,
+                 # BatchNormNd args
+                 # num_features: out_channels
+                 eps=1e-05, momentum=0.1,
+                 # affine: True
+                 # track_running_stats: True
+                 # Args for this module
+                 freeze_bn=False,
+                 qconfig=None,
+                 dim=2):
+        nn.modules.conv._ConvNd.__init__(self, in_channels, out_channels, kernel_size,
+                                         stride, padding, dilation, transposed,
+                                         output_padding, groups, False, padding_mode)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.freeze_bn = freeze_bn if self.training else True
+        self.bn = _BN_CLASS_MAP[dim](out_channels, eps, momentum, True, True)
+        self.weight_fake_quant = self.qconfig.weight()
+        if bias:
+            self.bias = Parameter(torch.empty(out_channels))
+        else:
+            self.register_parameter('bias', None)
+        self.reset_bn_parameters()
+
+        # this needs to be called after reset_bn_parameters,
+        # as they modify the same state
+        if self.training:
+            if freeze_bn:
+                self.freeze_bn_stats()
+            else:
+                self.update_bn_stats()
+        else:
+            self.freeze_bn_stats()
+
+        self._enable_slow_path_for_better_numerical_stability = False
+
+    def reset_running_stats(self):
+        self.bn.reset_running_stats()
+
+    def reset_bn_parameters(self):
+        self.bn.reset_running_stats()
+        init.uniform_(self.bn.weight)
+        init.zeros_(self.bn.bias)
+        # note: below is actully for conv, not BN
+        if self.bias is not None:
+            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
+            bound = 1 / math.sqrt(fan_in)
+            init.uniform_(self.bias, -bound, bound)
+
+    def reset_parameters(self):
+        super(_ConvBnNd, self).reset_parameters()
+
+    def update_bn_stats(self):
+        self.freeze_bn = False
+        self.bn.training = True
+        return self
+
+    def freeze_bn_stats(self):
+        self.freeze_bn = True
+        self.bn.training = False
+        return self
+
+    def _forward(self, input):
+        if self._enable_slow_path_for_better_numerical_stability:
+            return self._forward_slow(input)
+        return self._forward_approximate(input)
+
+    def _forward_approximate(self, input):
+        """Approximated method to fuse conv and bn. It requires only one forward pass.
+        conv_orig = conv / scale_factor where scale_factor = bn.weight / running_std
+        """
+        assert self.bn.running_var is not None
+        running_std = torch.sqrt(self.bn.running_var + self.bn.eps)
+        scale_factor = self.bn.weight / running_std
+        weight_shape = [1] * len(self.weight.shape)
+        weight_shape[0] = -1
+        bias_shape = [1] * len(self.weight.shape)
+        bias_shape[1] = -1
+        scaled_weight = self.weight_fake_quant(self.weight * scale_factor.reshape(weight_shape))
+        # using zero bias here since the bias for original conv
+        # will be added later
+        if self.bias is not None:
+            zero_bias = torch.zeros_like(self.bias, dtype=input.dtype)
+        else:
+            zero_bias = torch.zeros(self.out_channels, device=scaled_weight.device, dtype=input.dtype)
+        conv = self._conv_forward(input, scaled_weight, zero_bias)
+        conv_orig = conv / scale_factor.reshape(bias_shape)
+        if self.bias is not None:
+            conv_orig = conv_orig + self.bias.reshape(bias_shape)
+        conv = self.bn(conv_orig)
+        return conv
+
+    def _forward_slow(self, input):
+        """
+        A more accurate but slow method to compute conv bn fusion, following https://arxiv.org/pdf/1806.08342.pdf
+        It requires two forward passes but handles the case bn.weight == 0
+
+        Conv: Y = WX + B_c
+        Conv without bias: Y0 = WX = Y - B_c, Y = Y0 + B_c
+
+        Batch statistics:
+          mean_Y = Y.mean()
+                 = Y0.mean() + B_c
+          var_Y = (Y - mean_Y)^2.mean()
+                = (Y0 - Y0.mean())^2.mean()
+        BN (r: bn.weight, beta: bn.bias):
+          Z = r * (Y - mean_Y) / sqrt(var_Y + eps) + beta
+            = r * (Y0 - Y0.mean()) / sqrt(var_Y + eps) + beta
+
+        Fused Conv BN training (std_Y = sqrt(var_Y + eps)):
+          Z = (r * W / std_Y) * X + r * (B_c - mean_Y) / std_Y + beta
+            = (r * W / std_Y) * X - r * Y0.mean() / std_Y + beta
+
+        Fused Conv BN inference (running_std = sqrt(running_var + eps)):
+          Z = (r * W / running_std) * X - r * (running_mean - B_c) / running_std + beta
+
+        QAT with fused conv bn:
+          Z_train = fake_quant(r * W / running_std) * X * (running_std / std_Y) - r * Y0.mean() / std_Y + beta
+                  = conv(X, fake_quant(r * W / running_std)) * (running_std / std_Y) - r * Y0.mean() / std_Y + beta
+          Z_inference = conv(X, fake_quant(r * W / running_std)) - r * (running_mean - B_c) / running_std + beta
+        """
+
+        assert self.bn.running_var is not None
+        assert self.bn.running_mean is not None
+
+        # using zero bias here since the bias for original conv
+        # will be added later
+        zero_bias = torch.zeros(self.out_channels, device=self.weight.device, dtype=input.dtype)
+
+        weight_shape = [1] * len(self.weight.shape)
+        weight_shape[0] = -1
+        bias_shape = [1] * len(self.weight.shape)
+        bias_shape[1] = -1
+
+        if self.bn.training:
+            # needed to compute batch mean/std
+            conv_out = self._conv_forward(input, self.weight, zero_bias)
+            # update bn statistics
+            with torch.no_grad():
+                conv_out_bias = (
+                    conv_out if self.bias is None else conv_out + self.bias.reshape(bias_shape)
+                )
+                self.bn(conv_out_bias)
+
+        # fused conv + bn without bias using bn running statistics
+        running_std = torch.sqrt(self.bn.running_var + self.bn.eps)
+        scale_factor = self.bn.weight / running_std
+        scaled_weight = self.weight_fake_quant(
+            self.weight * scale_factor.reshape(weight_shape)
+        )
+        # fused conv without bias for inference: (r * W / running_std) * X
+        conv_bn = self._conv_forward(input, scaled_weight, zero_bias)
+
+        if self.bn.training:
+            avg_dims = [0] + list(range(2, len(self.weight.shape)))
+            batch_mean = conv_out.mean(avg_dims)
+            batch_var = torch.square(conv_out - batch_mean.reshape(bias_shape)).mean(
+                avg_dims
+            )
+            batch_std = torch.sqrt(batch_var + self.bn.eps)
+
+            # scale to use batch std in training mode
+            # conv(X, r * W / std_Y) = conv(X, r * W / running_std) * (running_std / std_Y)
+            unscale_factor = running_std / batch_std
+            conv_bn *= unscale_factor.reshape(bias_shape)
+
+            fused_mean = batch_mean
+            fused_std = batch_std
+        else:
+            fused_mean = self.bn.running_mean - (self.bias if self.bias is not None else 0)
+            fused_std = running_std
+
+        # fused bias = beta - r * mean / std
+        fused_bias = self.bn.bias - self.bn.weight * fused_mean / fused_std
+        conv_bn += fused_bias.reshape(bias_shape)
+
+        # HACK to let conv bias particpiate in loss to avoid DDP error (parameters
+        #   were not used in producing loss)
+        if self.bias is not None:
+            conv_bn += (self.bias - self.bias).reshape(bias_shape)
+
+        return conv_bn
+
+    def extra_repr(self):
+        # TODO(jerryzh): extend
+        return super(_ConvBnNd, self).extra_repr()
+
+    def forward(self, input):
+        return self._forward(input)
+
+    def train(self, mode=True):
+        """
+        Batchnorm's training behavior is using the self.training flag. Prevent
+        changing it if BN is frozen. This makes sure that calling `model.train()`
+        on a model with a frozen BN will behave properly.
+        """
+        self.training = mode
+        if not self.freeze_bn:
+            for module in self.children():
+                module.train(mode)
+        return self
+
+    # ===== Serialization version history =====
+    #
+    # Version 1/None
+    #   self
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #   |--- gamma : Tensor
+    #   |--- beta : Tensor
+    #   |--- running_mean : Tensor
+    #   |--- running_var : Tensor
+    #   |--- num_batches_tracked : Tensor
+    #
+    # Version 2
+    #   self
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #   |--- bn : Module
+    #        |--- weight : Tensor (moved from v1.self.gamma)
+    #        |--- bias : Tensor (moved from v1.self.beta)
+    #        |--- running_mean : Tensor (moved from v1.self.running_mean)
+    #        |--- running_var : Tensor (moved from v1.self.running_var)
+    #        |--- num_batches_tracked : Tensor (moved from v1.self.num_batches_tracked)
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
+        version = local_metadata.get('version', None)
+        if version is None or version == 1:
+            # BN related parameters and buffers were moved into the BN module for v2
+            v2_to_v1_names = {
+                'bn.weight': 'gamma',
+                'bn.bias': 'beta',
+                'bn.running_mean': 'running_mean',
+                'bn.running_var': 'running_var',
+                'bn.num_batches_tracked': 'num_batches_tracked',
+            }
+            for v2_name, v1_name in v2_to_v1_names.items():
+                if prefix + v1_name in state_dict:
+                    state_dict[prefix + v2_name] = state_dict[prefix + v1_name]
+                    state_dict.pop(prefix + v1_name)
+                elif prefix + v2_name in state_dict:
+                    # there was a brief period where forward compatibility
+                    # for this module was broken (between
+                    # https://github.com/pytorch/pytorch/pull/38478
+                    # and https://github.com/pytorch/pytorch/pull/38820)
+                    # and modules emitted the v2 state_dict format while
+                    # specifying that version == 1. This patches the forward
+                    # compatibility issue by allowing the v2 style entries to
+                    # be used.
+                    pass
+                elif strict:
+                    missing_keys.append(prefix + v2_name)
+
+        super(_ConvBnNd, self)._load_from_state_dict(
+            state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module or qparams_dict
+
+            Args: `mod` a float module, either produced by torch.ao.quantization utilities
+            or directly from user
+        """
+        # The ignore is because _FLOAT_MODULE is a TypeVar here where the bound
+        # has no __name__ (code is fine though)
+        assert type(mod) == cls._FLOAT_MODULE, 'qat.' + cls.__name__ + '.from_float only works for ' + \
+            cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        assert mod.qconfig, 'Input float module must have a valid qconfig'
+        qconfig = mod.qconfig
+        conv, bn = mod[0], mod[1]
+        qat_convbn = cls(conv.in_channels, conv.out_channels, conv.kernel_size,
+                         conv.stride, conv.padding, conv.dilation,
+                         conv.groups, conv.bias is not None,
+                         conv.padding_mode,
+                         bn.eps, bn.momentum,
+                         False,
+                         qconfig)
+        qat_convbn.weight = conv.weight
+        qat_convbn.bias = conv.bias
+        qat_convbn.bn.weight = bn.weight
+        qat_convbn.bn.bias = bn.bias
+        qat_convbn.bn.running_mean = bn.running_mean
+        qat_convbn.bn.running_var = bn.running_var
+        # mypy error: Cannot determine type of 'num_batches_tracked'
+        qat_convbn.bn.num_batches_tracked = bn.num_batches_tracked  # type: ignore[has-type]
+        return qat_convbn
+
+    def to_float(self):
+        cls = type(self)
+        conv = cls._FLOAT_CONV_MODULE(  # type: ignore[attr-defined]
+            self.in_channels,
+            self.out_channels,
+            self.kernel_size,
+            self.stride,
+            self.padding,
+            self.dilation,
+            self.groups,
+            self.bias is not None,
+            self.padding_mode)
+        conv.weight = torch.nn.Parameter(self.weight.detach())
+        if self.bias is not None:
+            conv.bias = torch.nn.Parameter(self.bias.detach())
+
+        if cls._FLOAT_BN_MODULE:  # type: ignore[attr-defined]
+            # fuse bn into conv
+            conv.weight, conv.bias = fuse_conv_bn_weights(
+                conv.weight,
+                conv.bias,
+                self.bn.running_mean,
+                self.bn.running_var,
+                self.bn.eps,
+                self.bn.weight,
+                self.bn.bias
+            )
+
+        if cls._FLOAT_RELU_MODULE:  # type: ignore[attr-defined]
+            modules = []
+            modules.append(conv)
+            relu = cls._FLOAT_RELU_MODULE()  # type: ignore[attr-defined]
+            modules.append(relu)
+            conv_relu = cls._FUSED_FLOAT_MODULE(*modules)  # type: ignore[attr-defined]
+            conv_relu.train(self.training)
+            return conv_relu
+        else:
+            conv.train(self.training)
+            return conv
+
+class ConvBn1d(_ConvBnNd, nn.Conv1d):
+    r"""
+    A ConvBn1d module is a module fused from Conv1d and BatchNorm1d,
+    attached with FakeQuantize modules for weight,
+    used in quantization aware training.
+
+    We combined the interface of :class:`torch.nn.Conv1d` and
+    :class:`torch.nn.BatchNorm1d`.
+
+    Similar to :class:`torch.nn.Conv1d`, with FakeQuantize modules initialized
+    to default.
+
+    Attributes:
+        freeze_bn:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    _FLOAT_BN_MODULE = nn.BatchNorm1d
+    _FLOAT_RELU_MODULE = None
+    _FLOAT_MODULE = nni.ConvBn1d
+    _FLOAT_CONV_MODULE = nn.Conv1d
+
+    def __init__(self,
+                 # Conv1d args
+                 in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1,
+                 bias=None,
+                 padding_mode='zeros',
+                 # BatchNorm1d args
+                 # num_features: out_channels
+                 eps=1e-05, momentum=0.1,
+                 # affine: True
+                 # track_running_stats: True
+                 # Args for this module
+                 freeze_bn=False,
+                 qconfig=None):
+        kernel_size = _single(kernel_size)
+        stride = _single(stride)
+        padding = _single(padding)
+        dilation = _single(dilation)
+        _ConvBnNd.__init__(self, in_channels, out_channels, kernel_size, stride,
+                           padding, dilation, False, _single(0), groups, bias, padding_mode,
+                           eps, momentum, freeze_bn, qconfig, dim=1)
+
+class ConvBnReLU1d(ConvBn1d):
+    r"""
+    A ConvBnReLU1d module is a module fused from Conv1d, BatchNorm1d and ReLU,
+    attached with FakeQuantize modules for weight,
+    used in quantization aware training.
+
+    We combined the interface of :class:`torch.nn.Conv1d` and
+    :class:`torch.nn.BatchNorm1d` and :class:`torch.nn.ReLU`.
+
+    Similar to `torch.nn.Conv1d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    # base class defines _FLOAT_MODULE as "ConvBn1d"
+    _FLOAT_MODULE = nni.ConvBnReLU1d  # type: ignore[assignment]
+    _FLOAT_CONV_MODULE = nn.Conv1d
+    _FLOAT_BN_MODULE = nn.BatchNorm1d
+    _FLOAT_RELU_MODULE = nn.ReLU  # type: ignore[assignment]
+    # module class after fusing bn into conv
+    _FUSED_FLOAT_MODULE = nni.ConvReLU1d
+
+    def __init__(self,
+                 # Conv1d args
+                 in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1,
+                 bias=None,
+                 padding_mode='zeros',
+                 # BatchNorm1d args
+                 # num_features: out_channels
+                 eps=1e-05, momentum=0.1,
+                 # affine: True
+                 # track_running_stats: True
+                 # Args for this module
+                 freeze_bn=False,
+                 qconfig=None):
+        super().__init__(in_channels, out_channels, kernel_size, stride,
+                         padding, dilation, groups, bias,
+                         padding_mode, eps, momentum,
+                         freeze_bn,
+                         qconfig)
+
+    def forward(self, input):
+        return F.relu(ConvBn1d._forward(self, input))
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(ConvBnReLU1d, cls).from_float(mod)
+
+class ConvReLU1d(nnqat.Conv1d, nni._FusedModule):
+    r"""A ConvReLU1d module is a fused module of Conv1d and ReLU, attached with
+    FakeQuantize modules for weight for
+    quantization aware training.
+
+    We combined the interface of :class:`~torch.nn.Conv1d` and
+    :class:`~torch.nn.BatchNorm1d`.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    _FLOAT_MODULE = nni.ConvReLU1d
+    _FLOAT_CONV_MODULE = nn.Conv1d
+    _FLOAT_BN_MODULE = None
+    _FLOAT_RELU_MODULE = nn.ReLU
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1,
+                 bias=True, padding_mode='zeros',
+                 qconfig=None):
+        super(ConvReLU1d, self).__init__(in_channels, out_channels, kernel_size,
+                                         stride=stride, padding=padding, dilation=dilation,
+                                         groups=groups, bias=bias, padding_mode=padding_mode,
+                                         qconfig=qconfig)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.weight_fake_quant = self.qconfig.weight()
+
+    def forward(self, input):
+        return F.relu(
+            self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias))
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(ConvReLU1d, cls).from_float(mod)
+
+class ConvBn2d(_ConvBnNd, nn.Conv2d):
+    r"""
+    A ConvBn2d module is a module fused from Conv2d and BatchNorm2d,
+    attached with FakeQuantize modules for weight,
+    used in quantization aware training.
+
+    We combined the interface of :class:`torch.nn.Conv2d` and
+    :class:`torch.nn.BatchNorm2d`.
+
+    Similar to :class:`torch.nn.Conv2d`, with FakeQuantize modules initialized
+    to default.
+
+    Attributes:
+        freeze_bn:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    _FLOAT_MODULE = nni.ConvBn2d
+    _FLOAT_CONV_MODULE = nn.Conv2d
+    _FLOAT_BN_MODULE = nn.BatchNorm2d
+    _FLOAT_RELU_MODULE = None
+
+    def __init__(self,
+                 # ConvNd args
+                 in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1,
+                 bias=None,
+                 padding_mode='zeros',
+                 # BatchNorm2d args
+                 # num_features: out_channels
+                 eps=1e-05, momentum=0.1,
+                 # affine: True
+                 # track_running_stats: True
+                 # Args for this module
+                 freeze_bn=False,
+                 qconfig=None):
+        kernel_size = _pair(kernel_size)
+        stride = _pair(stride)
+        padding = _pair(padding)
+        dilation = _pair(dilation)
+        _ConvBnNd.__init__(self, in_channels, out_channels, kernel_size, stride,
+                           padding, dilation, False, _pair(0), groups, bias, padding_mode,
+                           eps, momentum, freeze_bn, qconfig, dim=2)
+
+class ConvBnReLU2d(ConvBn2d):
+    r"""
+    A ConvBnReLU2d module is a module fused from Conv2d, BatchNorm2d and ReLU,
+    attached with FakeQuantize modules for weight,
+    used in quantization aware training.
+
+    We combined the interface of :class:`torch.nn.Conv2d` and
+    :class:`torch.nn.BatchNorm2d` and :class:`torch.nn.ReLU`.
+
+    Similar to `torch.nn.Conv2d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    # base class defines _FLOAT_MODULE as "ConvBn2d"
+    _FLOAT_MODULE = nni.ConvBnReLU2d  # type: ignore[assignment]
+    _FLOAT_CONV_MODULE = nn.Conv2d
+    _FLOAT_BN_MODULE = nn.BatchNorm2d
+    _FLOAT_RELU_MODULE = nn.ReLU  # type: ignore[assignment]
+    # module class after fusing bn into conv
+    _FUSED_FLOAT_MODULE = nni.ConvReLU2d
+
+    def __init__(self,
+                 # Conv2d args
+                 in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1,
+                 bias=None,
+                 padding_mode='zeros',
+                 # BatchNorm2d args
+                 # num_features: out_channels
+                 eps=1e-05, momentum=0.1,
+                 # affine: True
+                 # track_running_stats: True
+                 # Args for this module
+                 freeze_bn=False,
+                 qconfig=None):
+        super(ConvBnReLU2d, self).__init__(in_channels, out_channels, kernel_size, stride,
+                                           padding, dilation, groups, bias,
+                                           padding_mode, eps, momentum,
+                                           freeze_bn,
+                                           qconfig)
+
+    def forward(self, input):
+        return F.relu(ConvBn2d._forward(self, input))
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(ConvBnReLU2d, cls).from_float(mod)
+
+class ConvReLU2d(nnqat.Conv2d, nni._FusedModule):
+    r"""A ConvReLU2d module is a fused module of Conv2d and ReLU, attached with
+    FakeQuantize modules for weight for
+    quantization aware training.
+
+    We combined the interface of :class:`~torch.nn.Conv2d` and
+    :class:`~torch.nn.BatchNorm2d`.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    _FLOAT_MODULE = nni.ConvReLU2d
+    _FLOAT_CONV_MODULE = nn.Conv2d
+    _FLOAT_BN_MODULE = None
+    _FLOAT_RELU_MODULE = nn.ReLU
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1,
+                 bias=True, padding_mode='zeros',
+                 qconfig=None):
+        super(ConvReLU2d, self).__init__(in_channels, out_channels, kernel_size,
+                                         stride=stride, padding=padding, dilation=dilation,
+                                         groups=groups, bias=bias, padding_mode=padding_mode,
+                                         qconfig=qconfig)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.weight_fake_quant = self.qconfig.weight()
+
+    def forward(self, input):
+        return F.relu(
+            self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias))
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(ConvReLU2d, cls).from_float(mod)
+
+class ConvBn3d(_ConvBnNd, nn.Conv3d):
+    r"""
+    A ConvBn3d module is a module fused from Conv3d and BatchNorm3d,
+    attached with FakeQuantize modules for weight,
+    used in quantization aware training.
+
+    We combined the interface of :class:`torch.nn.Conv3d` and
+    :class:`torch.nn.BatchNorm3d`.
+
+    Similar to :class:`torch.nn.Conv3d`, with FakeQuantize modules initialized
+    to default.
+
+    Attributes:
+        freeze_bn:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    _FLOAT_MODULE = nni.ConvBn3d
+    _FLOAT_CONV_MODULE = nn.Conv3d
+    _FLOAT_BN_MODULE = nn.BatchNorm3d
+    _FLOAT_RELU_MODULE = None
+
+    def __init__(
+        self,
+        # ConvNd args
+        in_channels,
+        out_channels,
+        kernel_size,
+        stride=1,
+        padding=0,
+        dilation=1,
+        groups=1,
+        bias=None,
+        padding_mode="zeros",
+        # BatchNorm3d args
+        # num_features: out_channels
+        eps=1e-05,
+        momentum=0.1,
+        # affine: True
+        # track_running_stats: True
+        # Args for this module
+        freeze_bn=False,
+        qconfig=None,
+    ):
+        kernel_size = _triple(kernel_size)
+        stride = _triple(stride)
+        padding = _triple(padding)
+        dilation = _triple(dilation)
+        _ConvBnNd.__init__(
+            self,
+            in_channels,
+            out_channels,
+            kernel_size,
+            stride,
+            padding,
+            dilation,
+            False,
+            _triple(0),
+            groups,
+            bias,
+            padding_mode,
+            eps,
+            momentum,
+            freeze_bn,
+            qconfig,
+            dim=3,
+        )
+
+class ConvBnReLU3d(ConvBn3d):
+    r"""
+    A ConvBnReLU3d module is a module fused from Conv3d, BatchNorm3d and ReLU,
+    attached with FakeQuantize modules for weight,
+    used in quantization aware training.
+
+    We combined the interface of :class:`torch.nn.Conv3d` and
+    :class:`torch.nn.BatchNorm3d` and :class:`torch.nn.ReLU`.
+
+    Similar to `torch.nn.Conv3d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    _FLOAT_MODULE = nni.ConvBnReLU3d  # type: ignore[assignment]
+    _FLOAT_CONV_MODULE = nn.Conv3d
+    _FLOAT_BN_MODULE = nn.BatchNorm3d
+    _FLOAT_RELU_MODULE = nn.ReLU  # type: ignore[assignment]
+    # module class after fusing bn into conv
+    _FUSED_FLOAT_MODULE = nni.ConvReLU3d
+
+    def __init__(
+        self,
+        # Conv3d args
+        in_channels,
+        out_channels,
+        kernel_size,
+        stride=1,
+        padding=0,
+        dilation=1,
+        groups=1,
+        bias=None,
+        padding_mode="zeros",
+        # BatchNorm3d args
+        # num_features: out_channels
+        eps=1e-05,
+        momentum=0.1,
+        # affine: True
+        # track_running_stats: True
+        # Args for this module
+        freeze_bn=False,
+        qconfig=None,
+    ):
+        super(ConvBnReLU3d, self).__init__(
+            in_channels,
+            out_channels,
+            kernel_size,
+            stride,
+            padding,
+            dilation,
+            groups,
+            bias,
+            padding_mode,
+            eps,
+            momentum,
+            freeze_bn,
+            qconfig,
+        )
+
+    def forward(self, input):
+        return F.relu(ConvBn3d._forward(self, input))
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(ConvBnReLU3d, cls).from_float(mod)
+
+class ConvReLU3d(nnqat.Conv3d, nni._FusedModule):
+    r"""A ConvReLU3d module is a fused module of Conv3d and ReLU, attached with
+    FakeQuantize modules for weight for
+    quantization aware training.
+
+    We combined the interface of :class:`~torch.nn.Conv3d` and
+    :class:`~torch.nn.BatchNorm3d`.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    _FLOAT_MODULE = nni.ConvReLU3d
+    _FLOAT_CONV_MODULE = nn.Conv3d
+    _FLOAT_BN_MODULE = None
+    _FLOAT_RELU_MODULE = nn.ReLU
+
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        stride=1,
+        padding=0,
+        dilation=1,
+        groups=1,
+        bias=True,
+        padding_mode="zeros",
+        qconfig=None,
+    ):
+        super(ConvReLU3d, self).__init__(
+            in_channels,
+            out_channels,
+            kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+        )
+        assert qconfig, "qconfig must be provided for QAT module"
+        self.qconfig = qconfig
+        self.weight_fake_quant = self.qconfig.weight()
+
+    def forward(self, input):
+        return F.relu(
+            self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
+        )
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(ConvReLU3d, cls).from_float(mod)
+
+def update_bn_stats(mod):
+    if type(mod) in set(
+        [ConvBnReLU1d, ConvBnReLU2d, ConvBnReLU3d, ConvBn1d, ConvBn2d, ConvBn3d]
+    ):
+        mod.update_bn_stats()
+
+def freeze_bn_stats(mod):
+    if type(mod) in set(
+        [ConvBnReLU1d, ConvBnReLU2d, ConvBnReLU3d, ConvBn1d, ConvBn2d, ConvBn3d]
+    ):
+        mod.freeze_bn_stats()
diff --git a/torch/ao/nn/intrinsic/qat/modules/linear_fused.py b/torch/ao/nn/intrinsic/qat/modules/linear_fused.py
new file mode 100644
index 000000000000..7c92c470ba5b
--- /dev/null
+++ b/torch/ao/nn/intrinsic/qat/modules/linear_fused.py
@@ -0,0 +1,167 @@
+import torch
+import torch.nn as nn
+import torch.ao.nn.intrinsic as nni
+import torch.nn.functional as F
+from torch.nn import init
+from torch.nn.parameter import Parameter
+from torch.nn.utils.fusion import fuse_linear_bn_weights
+
+
+class LinearBn1d(nn.modules.linear.Linear, nni._FusedModule):
+    r"""
+    A LinearBn1d module is a module fused from Linear and BatchNorm1d, attached
+    with FakeQuantize modules for weight, used in quantization aware training.
+
+    We combined the interface of :class:`torch.nn.Linear` and
+    :class:torch.nn.BatchNorm1d`.
+
+    Similar to :class:`torch.nn.Linear`, with FakeQuantize modules initialized
+    to default.
+
+    Attributes:
+        freeze_bn:
+        weight_fake_quant: fake quant module for weight
+
+    """
+    def __init__(self,
+                 # Linear args
+                 in_features, out_features, bias=True,
+                 # BatchNorm1d args
+                 # num_features: out_features
+                 eps=1e-05, momentum=0.1,
+                 # affine: True
+                 # track_running_stats: True
+                 # Args for this module
+                 freeze_bn=False,
+                 qconfig=None):
+        nn.modules.linear.Linear.__init__(self, in_features, out_features, bias)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.freeze_bn = freeze_bn if self.training else True
+        self.bn = nn.BatchNorm1d(out_features, eps, momentum, True, True)
+        self.weight_fake_quant = self.qconfig.weight()
+        if bias:
+            self.bias = Parameter(torch.empty(out_features))
+        else:
+            self.register_parameter('bias', None)
+        self.reset_bn_parameters()
+
+        # this needs to be called after reset_bn_parameters,
+        # as they modify the same state
+        if self.training:
+            if freeze_bn:
+                self.freeze_bn_stats()
+            else:
+                self.update_bn_stats()
+        else:
+            self.freeze_bn_stats()
+
+    def reset_running_stats(self):
+        self.bn.reset_running_stats()
+
+    def reset_bn_parameters(self):
+        self.bn.reset_running_stats()
+        init.uniform_(self.bn.weight)
+        init.zeros_(self.bn.bias)
+
+    def reset_parameters(self):
+        super(LinearBn1d, self).reset_parameters()
+
+    def update_bn_stats(self):
+        self.freeze_bn = False
+        self.bn.training = True
+        return self
+
+    def freeze_bn_stats(self):
+        self.freeze_bn = True
+        self.bn.training = False
+        return self
+
+    def forward(self, input):
+        assert self.bn.running_var is not None
+
+        # Scale the linear weights by BN's running statistics to reduce
+        # weight jitter, see https://arxiv.org/pdf/1806.08342.pdf, page 18
+        # for motivation.
+        #
+        # Instead of
+        #
+        #   x1 = F.linear(x0, fq(w), b)
+        #   x2 = self.bn(x1)
+        #
+        # We have
+        #
+        #   # scale the weight by previous batch's running statistics
+        #   scale_factor = bn.w / bn.running_std_from_prev_batch
+        #   # do the linear transformation without bias
+        #   x1_scaled = F.linear(x0, fq(w * scale_factor), 0)
+        #   # reverse the scaling and add original bias
+        #   x1_orig = x1_scaled / scale_factor + b
+        #   x2 = self.bn(x1_orig)
+
+        running_std = torch.sqrt(self.bn.running_var + self.bn.eps)
+        scale_factor = self.bn.weight / running_std
+        weight_shape = [1] * len(self.weight.shape)
+        weight_shape[0] = -1
+        bias_shape = [1] * len(self.weight.shape)
+        bias_shape[1] = -1
+        scaled_weight = self.weight_fake_quant(self.weight * scale_factor.reshape(weight_shape))
+        if self.bias is not None:
+            zero_bias = torch.zeros_like(self.bias)
+        else:
+            zero_bias = torch.zeros(self.out_features, device=scaled_weight.device)
+        linear_out = F.linear(input, scaled_weight, zero_bias)
+        linear_out_orig = linear_out / scale_factor.reshape(bias_shape)
+        if self.bias is not None:
+            linear_out_orig = linear_out_orig + self.bias.reshape(bias_shape)
+        bn_out = self.bn(linear_out_orig)
+        return bn_out
+
+    def train(self, mode=True):
+        """
+        Batchnorm's training behavior is using the self.training flag. Prevent
+        changing it if BN is frozen. This makes sure that calling `model.train()`
+        on a model with a frozen BN will behave properly.
+        """
+        self.training = mode
+        if not self.freeze_bn:
+            for module in self.children():
+                module.train(mode)
+        return self
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module or qparams_dict
+
+            Args: `mod' a float module, either produced by torch.ao.quantization
+            utilities or directly from user
+        """
+        assert type(mod) == nni.LinearBn1d, 'qat.' + cls.__name__ + \
+            '.from_float only works for ' + nni.LinearBn1d.__name__
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        assert mod.qconfig, 'Input float module must have a valid config'
+        qconfig = mod.qconfig
+        linear, bn = mod[0], mod[1]
+        qat_linearbn = cls(linear.in_features, linear.out_features, linear.bias is not None,
+                           bn.eps, bn.momentum,
+                           False, qconfig)
+        qat_linearbn.weight = linear.weight
+        qat_linearbn.bias = linear.bias
+        qat_linearbn.bn.weight = bn.weight
+        qat_linearbn.bn.bias = bn.bias
+        qat_linearbn.bn.running_mean = bn.running_mean
+        qat_linearbn.bn.running_var = bn.running_var
+        qat_linearbn.bn.num_batches_tracked = bn.num_batches_tracked
+        return qat_linearbn
+
+    def to_float(self):
+        linear = torch.nn.Linear(self.in_features, self.out_features)
+        linear.weight, linear.bias = fuse_linear_bn_weights(
+            self.weight,
+            self.bias,
+            self.bn.running_mean,
+            self.bn.running_var,
+            self.bn.eps,
+            self.bn.weight,
+            self.bn.bias)
+        return linear
diff --git a/torch/ao/nn/intrinsic/qat/modules/linear_relu.py b/torch/ao/nn/intrinsic/qat/modules/linear_relu.py
new file mode 100644
index 000000000000..1c779658e38e
--- /dev/null
+++ b/torch/ao/nn/intrinsic/qat/modules/linear_relu.py
@@ -0,0 +1,48 @@
+import torch
+import torch.ao.nn.qat as nnqat
+import torch.ao.nn.intrinsic as nni
+import torch.nn.functional as F
+
+class LinearReLU(nnqat.Linear, nni._FusedModule):
+    r"""
+    A LinearReLU module fused from Linear and ReLU modules, attached with
+    FakeQuantize modules for weight, used in
+    quantization aware training.
+
+    We adopt the same interface as :class:`torch.nn.Linear`.
+
+    Similar to `torch.nn.intrinsic.LinearReLU`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight: fake quant module for weight
+
+    Examples::
+
+        >>> # xdoctest: +SKIP
+        >>> m = nn.qat.LinearReLU(20, 30)
+        >>> input = torch.randn(128, 20)
+        >>> output = m(input)
+        >>> print(output.size())
+        torch.Size([128, 30])
+    """
+    _FLOAT_MODULE = nni.LinearReLU
+
+    def __init__(self, in_features, out_features, bias=True,
+                 qconfig=None):
+        super(LinearReLU, self).__init__(in_features, out_features, bias, qconfig)
+
+    def forward(self, input):
+        return F.relu(F.linear(input, self.weight_fake_quant(self.weight), self.bias))
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(LinearReLU, cls).from_float(mod)
+
+    def to_float(self):
+        linear = torch.nn.Linear(self.in_features, self.out_features, self.bias is not None)
+        linear.weight = torch.nn.Parameter(self.weight.detach())
+        if self.bias is not None:
+            linear.bias = torch.nn.Parameter(self.bias.detach())
+        relu = torch.nn.ReLU()
+        return torch.nn.intrinsic.LinearReLU(linear, relu)
diff --git a/torch/ao/nn/intrinsic/quantized/__init__.py b/torch/ao/nn/intrinsic/quantized/__init__.py
new file mode 100644
index 000000000000..a3c5788d574d
--- /dev/null
+++ b/torch/ao/nn/intrinsic/quantized/__init__.py
@@ -0,0 +1,10 @@
+from .modules import *  # noqa: F403
+
+__all__ = [
+    'BNReLU2d',
+    'BNReLU3d',
+    'ConvReLU1d',
+    'ConvReLU2d',
+    'ConvReLU3d',
+    'LinearReLU',
+]
diff --git a/torch/ao/nn/intrinsic/quantized/dynamic/__init__.py b/torch/ao/nn/intrinsic/quantized/dynamic/__init__.py
new file mode 100644
index 000000000000..3d79bdbfe832
--- /dev/null
+++ b/torch/ao/nn/intrinsic/quantized/dynamic/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/intrinsic/quantized/dynamic/modules/__init__.py b/torch/ao/nn/intrinsic/quantized/dynamic/modules/__init__.py
new file mode 100644
index 000000000000..ce571862b427
--- /dev/null
+++ b/torch/ao/nn/intrinsic/quantized/dynamic/modules/__init__.py
@@ -0,0 +1,6 @@
+import torch
+from .linear_relu import LinearReLU
+
+__all__ = [
+    'LinearReLU',
+]
diff --git a/torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py b/torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
new file mode 100644
index 000000000000..f84138c2bbd4
--- /dev/null
+++ b/torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
@@ -0,0 +1,51 @@
+import torch
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.intrinsic as nni
+
+class LinearReLU(nnqd.Linear):
+    r"""
+    A LinearReLU module fused from Linear and ReLU modules that can be used
+    for dynamic quantization.
+    Supports both, FP16 and INT8 quantization.
+
+    We adopt the same interface as :class:`torch.ao.nn.quantized.dynamic.Linear`.
+
+    Attributes:
+        Same as torch.ao.nn.quantized.dynamic.Linear
+
+    Examples::
+
+        >>> m = nn.intrinsic.quantized.dynamic.LinearReLU(20, 30)
+        >>> input = torch.randn(128, 20)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+        >>> print(output.size())
+        torch.Size([128, 30])
+    """
+    _FLOAT_MODULE = nni.LinearReLU  # type: ignore[assignment]
+
+    def __init__(self, in_features, out_features, bias=True, dtype=torch.qint8):
+        super().__init__(in_features, out_features, bias, dtype)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self._packed_params.dtype == torch.qint8:
+            # TODO check if we should set reduce_rage = True by default here
+            Y = torch.ops.quantized.linear_relu_dynamic(
+                x, self._packed_params._packed_params, reduce_range=True)
+        elif self._packed_params.dtype == torch.float16:
+            Y = torch.ops.quantized.linear_relu_dynamic_fp16(
+                x, self._packed_params._packed_params)
+        else:
+            raise RuntimeError('Unsupported dtype on dynamic quantized linear relu!')
+        return Y.to(x.dtype)
+
+    def _get_name(self):
+        return 'DynamicQuantizedLinearReLU'
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(LinearReLU, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, ref_qlinear_relu):
+        return super().from_reference(ref_qlinear_relu[0])
diff --git a/torch/ao/nn/intrinsic/quantized/modules/__init__.py b/torch/ao/nn/intrinsic/quantized/modules/__init__.py
new file mode 100644
index 000000000000..521e409b2b64
--- /dev/null
+++ b/torch/ao/nn/intrinsic/quantized/modules/__init__.py
@@ -0,0 +1,12 @@
+from .linear_relu import LinearReLU
+from .conv_relu import ConvReLU1d, ConvReLU2d, ConvReLU3d
+from .bn_relu import BNReLU2d, BNReLU3d
+
+__all__ = [
+    'LinearReLU',
+    'ConvReLU1d',
+    'ConvReLU2d',
+    'ConvReLU3d',
+    'BNReLU2d',
+    'BNReLU3d',
+]
diff --git a/torch/ao/nn/intrinsic/quantized/modules/bn_relu.py b/torch/ao/nn/intrinsic/quantized/modules/bn_relu.py
new file mode 100644
index 000000000000..abc37095e7cf
--- /dev/null
+++ b/torch/ao/nn/intrinsic/quantized/modules/bn_relu.py
@@ -0,0 +1,78 @@
+
+import torch
+import torch.ao.nn.intrinsic
+import torch.ao.nn.intrinsic.qat
+import torch.ao.nn.quantized as nnq
+
+
+class BNReLU2d(nnq.BatchNorm2d):
+    r"""
+    A BNReLU2d module is a fused module of BatchNorm2d and ReLU
+
+    We adopt the same interface as :class:`torch.ao.nn.quantized.BatchNorm2d`.
+
+    Attributes:
+        Same as torch.ao.nn.quantized.BatchNorm2d
+
+    """
+    _FLOAT_MODULE = torch.ao.nn.intrinsic.BNReLU2d
+
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
+        super(BNReLU2d, self).__init__(num_features, eps=eps, momentum=momentum, device=device, dtype=dtype)
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        return torch.ops.quantized.batch_norm2d_relu(
+            input, self.weight, self.bias, self.running_mean,
+            self.running_var, self.eps, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedBNReLU2d'
+
+    @classmethod
+    def from_float(cls, mod):
+        # TODO: Add qat support for BNReLU2d
+        return super(BNReLU2d, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, bn_relu, output_scale, output_zero_point):
+        return super().from_reference(bn_relu[0], output_scale, output_zero_point)
+
+class BNReLU3d(nnq.BatchNorm3d):
+    r"""
+    A BNReLU3d module is a fused module of BatchNorm3d and ReLU
+
+    We adopt the same interface as :class:`torch.ao.nn.quantized.BatchNorm3d`.
+
+    Attributes:
+        Same as torch.ao.nn.quantized.BatchNorm3d
+
+    """
+    _FLOAT_MODULE = torch.ao.nn.intrinsic.BNReLU3d
+
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
+        super(BNReLU3d, self).__init__(num_features, eps=eps, momentum=momentum, device=device, dtype=dtype)
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
+        return torch.ops.quantized.batch_norm3d_relu(
+            input, self.weight, self.bias, self.running_mean,
+            self.running_var, self.eps, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedBNReLU3d'
+
+    @classmethod
+    def from_float(cls, mod):
+        # TODO: Add qat support for BNReLU3d
+        return super(BNReLU3d, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, bn_relu, output_scale, output_zero_point):
+        return super().from_reference(bn_relu[0], output_scale, output_zero_point)
diff --git a/torch/ao/nn/intrinsic/quantized/modules/conv_relu.py b/torch/ao/nn/intrinsic/quantized/modules/conv_relu.py
new file mode 100644
index 000000000000..8f27addf431f
--- /dev/null
+++ b/torch/ao/nn/intrinsic/quantized/modules/conv_relu.py
@@ -0,0 +1,166 @@
+
+import torch
+import torch.ao.nn.intrinsic
+import torch.ao.nn.intrinsic.qat
+import torch.nn.functional as F
+import torch.ao.nn.quantized as nnq
+
+from torch.nn.utils import fuse_conv_bn_weights
+
+_reverse_repeat_padding = nnq.modules.conv._reverse_repeat_padding
+
+# TODO: factor out the common parts to ConvNd
+class ConvReLU1d(nnq.Conv1d):
+    r"""
+    A ConvReLU1d module is a fused module of Conv1d and ReLU
+
+    We adopt the same interface as :class:`torch.ao.nn.quantized.Conv1d`.
+
+    Attributes:
+        Same as torch.ao.nn.quantized.Conv1d
+
+    """
+    _FLOAT_MODULE = torch.ao.nn.intrinsic.ConvReLU1d  # type: ignore[assignment]
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        super(ConvReLU1d, self).__init__(
+            in_channels, out_channels, kernel_size, stride=stride,
+            padding=padding, dilation=dilation, groups=groups, bias=bias,
+            padding_mode=padding_mode, device=device, dtype=dtype)
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        if self.padding_mode != 'zeros':
+            # Padding in Conv1d is stored as (p, p), need to get (p,)
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return torch.ops.quantized.conv1d_relu(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedConvReLU1d'
+
+    @classmethod
+    def from_float(cls, mod):
+        if type(mod) == torch.ao.nn.intrinsic.qat.ConvBnReLU1d:
+            mod.weight, mod.bias = fuse_conv_bn_weights(
+                mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
+                mod.bn.eps, mod.bn.weight, mod.bn.bias)
+        return super(ConvReLU1d, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
+        assert type(ref_qconv) != torch.nn.intrinsic.ConvBnReLU1d, \
+            "BatchNorm1d should be fused into Conv1d before converting to reference module"
+        return super().from_reference(ref_qconv[0], output_scale, output_zero_point)
+
+class ConvReLU2d(nnq.Conv2d):
+    r"""
+    A ConvReLU2d module is a fused module of Conv2d and ReLU
+
+    We adopt the same interface as :class:`torch.ao.nn.quantized.Conv2d`.
+
+    Attributes:
+        Same as torch.ao.nn.quantized.Conv2d
+
+    """
+    _FLOAT_MODULE = torch.ao.nn.intrinsic.ConvReLU2d  # type: ignore[assignment]
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        super(ConvReLU2d, self).__init__(
+            in_channels, out_channels, kernel_size, stride=stride,
+            padding=padding, dilation=dilation, groups=groups, bias=bias,
+            padding_mode=padding_mode, device=device, dtype=dtype)
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return torch.ops.quantized.conv2d_relu(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedConvReLU2d'
+
+    @classmethod
+    def from_float(cls, mod):
+        if type(mod) == torch.ao.nn.intrinsic.qat.ConvBnReLU2d:
+            mod.weight, mod.bias = fuse_conv_bn_weights(
+                mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
+                mod.bn.eps, mod.bn.weight, mod.bn.bias)
+        return super(ConvReLU2d, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
+        assert type(ref_qconv) != torch.nn.intrinsic.ConvBnReLU2d, \
+            "BatchNorm2d should be fused into Conv2d before converting to reference module"
+        return super().from_reference(ref_qconv[0], output_scale, output_zero_point)
+
+
+class ConvReLU3d(nnq.Conv3d):
+    r"""
+    A ConvReLU3d module is a fused module of Conv3d and ReLU
+
+    We adopt the same interface as :class:`torch.ao.nn.quantized.Conv3d`.
+
+    Attributes: Same as torch.ao.nn.quantized.Conv3d
+
+    """
+    _FLOAT_MODULE = torch.ao.nn.intrinsic.ConvReLU3d  # type: ignore[assignment]
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
+        super(ConvReLU3d, self).__init__(
+            in_channels, out_channels, kernel_size, stride=stride,
+            padding=padding, dilation=dilation, groups=groups, bias=bias,
+            padding_mode=padding_mode, device=device, dtype=dtype)
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return torch.ops.quantized.conv3d_relu(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedConvReLU3d'
+
+    @classmethod
+    def from_float(cls, mod):
+        if type(mod) == torch.ao.nn.intrinsic.qat.ConvBnReLU3d:
+            mod.weight, mod.bias = fuse_conv_bn_weights(
+                mod.weight,
+                mod.bias,
+                mod.bn.running_mean,
+                mod.bn.running_var,
+                mod.bn.eps,
+                mod.bn.weight,
+                mod.bn.bias,
+            )
+        return super(ConvReLU3d, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
+        assert type(ref_qconv) != torch.nn.intrinsic.ConvBnReLU3d, \
+            "BatchNorm3d should be fused into Conv3d before converting to reference module"
+        return super().from_reference(ref_qconv[0], output_scale, output_zero_point)
diff --git a/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py b/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py
new file mode 100644
index 000000000000..d4e2b76c363e
--- /dev/null
+++ b/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py
@@ -0,0 +1,41 @@
+import torch
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.intrinsic as nni
+
+class LinearReLU(nnq.Linear):
+    r"""
+    A LinearReLU module fused from Linear and ReLU modules
+
+    We adopt the same interface as :class:`torch.ao.nn.quantized.Linear`.
+
+    Attributes:
+        Same as torch.ao.nn.quantized.Linear
+
+    Examples::
+
+        >>> # xdoctest: +SKIP
+        >>> m = nn.intrinsic.LinearReLU(20, 30)
+        >>> input = torch.randn(128, 20)
+        >>> output = m(input)
+        >>> print(output.size())
+        torch.Size([128, 30])
+    """
+    _FLOAT_MODULE = nni.LinearReLU
+
+    def __init__(self, in_features, out_features, bias=True, dtype=torch.qint8):
+        super().__init__(in_features, out_features, bias, dtype)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return torch.ops.quantized.linear_relu(
+            x, self._packed_params._packed_params, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedLinearReLU'
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(LinearReLU, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, ref_linear_relu, output_scale, output_zero_point):
+        return super().from_reference(ref_linear_relu[0], output_scale, output_zero_point)
diff --git a/torch/ao/nn/qat/__init__.py b/torch/ao/nn/qat/__init__.py
new file mode 100644
index 000000000000..3d79bdbfe832
--- /dev/null
+++ b/torch/ao/nn/qat/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/qat/dynamic/__init__.py b/torch/ao/nn/qat/dynamic/__init__.py
new file mode 100644
index 000000000000..3d79bdbfe832
--- /dev/null
+++ b/torch/ao/nn/qat/dynamic/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/qat/dynamic/modules/__init__.py b/torch/ao/nn/qat/dynamic/modules/__init__.py
new file mode 100644
index 000000000000..c8168b30406a
--- /dev/null
+++ b/torch/ao/nn/qat/dynamic/modules/__init__.py
@@ -0,0 +1,3 @@
+from .linear import Linear
+
+__all__ = ["Linear"]
diff --git a/torch/ao/nn/qat/dynamic/modules/linear.py b/torch/ao/nn/qat/dynamic/modules/linear.py
new file mode 100644
index 000000000000..89c556731595
--- /dev/null
+++ b/torch/ao/nn/qat/dynamic/modules/linear.py
@@ -0,0 +1,24 @@
+import torch
+
+
+class Linear(torch.ao.nn.qat.Linear):
+    r"""
+    A linear module attached with FakeQuantize modules for weight,
+    used for dynamic quantization aware training.
+
+    We adopt the same interface as `torch.nn.Linear`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
+    for documentation.
+
+    Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
+    default.
+    """
+
+    def __init__(self, in_features, out_features, bias=True,
+                 qconfig=None, device=None, dtype=None) -> None:
+        super().__init__(in_features, out_features, bias, qconfig, device, dtype)
+        if not torch.ao.quantization.qconfig._activation_is_memoryless(qconfig):
+            raise ValueError(
+                "Dynamic QAT requires a memoryless observer." +
+                "This means a MovingAverage observer with averaging constant equal to 1"
+            )
diff --git a/torch/ao/nn/qat/modules/__init__.py b/torch/ao/nn/qat/modules/__init__.py
new file mode 100644
index 000000000000..988a1dd5ed4b
--- /dev/null
+++ b/torch/ao/nn/qat/modules/__init__.py
@@ -0,0 +1,14 @@
+from .linear import Linear
+from .conv import Conv1d
+from .conv import Conv2d
+from .conv import Conv3d
+from .embedding_ops import EmbeddingBag, Embedding
+
+__all__ = [
+    "Linear",
+    "Conv1d",
+    "Conv2d",
+    "Conv3d",
+    "Embedding",
+    "EmbeddingBag",
+]
diff --git a/torch/ao/nn/qat/modules/conv.py b/torch/ao/nn/qat/modules/conv.py
new file mode 100644
index 000000000000..7ef73f2b2ea7
--- /dev/null
+++ b/torch/ao/nn/qat/modules/conv.py
@@ -0,0 +1,264 @@
+import torch
+import torch.nn as nn
+from torch.nn.modules.utils import _single, _pair, _triple
+from torch.ao.nn.intrinsic import _FusedModule
+from typing import Tuple, TypeVar, Union
+from torch.nn.common_types import _size_1_t, _size_2_t, _size_3_t
+
+MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
+
+class _ConvNd(nn.modules.conv._ConvNd):
+
+    _FLOAT_MODULE = MOD
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: Tuple[int, ...],
+                 stride: Tuple[int, ...],
+                 padding: Tuple[int, ...],
+                 dilation: Tuple[int, ...],
+                 transposed: bool,
+                 output_padding: Tuple[int, ...],
+                 groups: int,
+                 bias: bool,
+                 padding_mode: str,
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        factory_kwargs = {"device": device, "dtype": dtype}
+        nn.modules.conv._ConvNd.__init__(self, in_channels, out_channels, kernel_size,
+                                         stride, padding, dilation, transposed,
+                                         output_padding, groups, bias, padding_mode, **factory_kwargs)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+
+    def forward(self, input):
+        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @staticmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module
+
+            Args:
+               `mod`: a float module, either produced by torch.ao.quantization utilities
+               or directly from user
+        """
+        assert type(mod) == cls._FLOAT_MODULE, (
+            "qat."
+            + cls.__name__
+            + ".from_float only works for "
+            + cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
+        )
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        assert mod.qconfig, 'Input float module must have a valid qconfig'
+        if issubclass(type(mod), _FusedModule):
+            mod = mod[0]  # type: ignore[index]
+        qconfig = mod.qconfig
+        qat_conv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
+                       stride=mod.stride, padding=mod.padding, dilation=mod.dilation,
+                       groups=mod.groups, bias=mod.bias is not None,
+                       padding_mode=mod.padding_mode, qconfig=qconfig)
+        qat_conv.weight = mod.weight
+        qat_conv.bias = mod.bias
+        return qat_conv
+
+    def to_float(self):
+        """ This works for both single qat conv, and the qat conv - relu modules
+        to convert the qat module to a floating point module
+        """
+        cls = type(self)
+        conv = cls._FLOAT_CONV_MODULE(  # type: ignore[attr-defined, operator]
+            self.in_channels,
+            self.out_channels,
+            self.kernel_size,  # type: ignore[arg-type]
+            self.stride,  # type: ignore[arg-type]
+            self.padding,  # type: ignore[arg-type]
+            self.dilation,  # type: ignore[arg-type]
+            self.groups,
+            self.bias is not None,
+            self.padding_mode)
+        conv.weight = torch.nn.Parameter(self.weight.detach())
+        if self.bias is not None:
+            conv.bias = torch.nn.Parameter(self.bias.detach())
+        # conv relu
+        if issubclass(cls, _FusedModule):
+            modules = [conv]
+            assert hasattr(cls, "_FLOAT_RELU_MODULE")
+            relu = cls._FLOAT_RELU_MODULE()  # type: ignore[attr-defined]
+            modules.append(relu)
+            fused = cls._FLOAT_MODULE(*modules)  # type: ignore[arg-type, attr-defined, operator]
+            fused.train(self.training)
+            return fused
+        else:
+            return conv
+
+class Conv1d(_ConvNd, nn.Conv1d):
+    r"""
+    A Conv1d module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as :class:`~torch.nn.Conv1d`
+
+    Similar to :class:`~torch.nn.Conv2d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Conv1d
+    _FLOAT_CONV_MODULE = nn.Conv1d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: Union[str, _size_1_t] = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _single(kernel_size)
+        stride_ = _single(stride)
+        padding_ = padding if isinstance(padding, str) else _single(padding)
+        dilation_ = _single(dilation)
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_single(0),
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super().from_float(cls, mod)
+
+class Conv2d(_ConvNd, nn.Conv2d):
+    r"""
+    A Conv2d module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Conv2d`, please see
+    https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d
+    for documentation.
+
+    Similar to `torch.nn.Conv2d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Conv2d
+    _FLOAT_CONV_MODULE = nn.Conv2d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_2_t,
+                 stride: _size_2_t = 1,
+                 padding: Union[str, _size_2_t] = 0,
+                 dilation: _size_2_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _pair(kernel_size)
+        stride_ = _pair(stride)
+        padding_ = padding if isinstance(padding, str) else _pair(padding)
+        dilation_ = _pair(dilation)
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_pair(0),
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
+
+    def forward(self, input):
+        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super().from_float(cls, mod)
+
+class Conv3d(_ConvNd, nn.Conv3d):
+    r"""
+    A Conv3d module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Conv3d`, please see
+    https://pytorch.org/docs/stable/nn.html?highlight=conv3d#torch.nn.Conv3d
+    for documentation.
+
+    Similar to `torch.nn.Conv3d`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight_fake_quant: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Conv3d
+    _FLOAT_CONV_MODULE = nn.Conv3d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_3_t,
+                 stride: _size_3_t = 1,
+                 padding: Union[str, _size_3_t] = 0,
+                 dilation: _size_3_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 qconfig=None,
+                 device=None,
+                 dtype=None) -> None:
+        kernel_size_ = _triple(kernel_size)
+        stride_ = _triple(stride)
+        padding_ = padding if isinstance(padding, str) else _triple(padding)
+        dilation_ = _triple(dilation)
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size_,
+            stride=stride_,
+            padding=padding_,
+            dilation=dilation_,
+            transposed=False,
+            output_padding=_triple(0),
+            groups=groups,
+            bias=bias,
+            padding_mode=padding_mode,
+            qconfig=qconfig,
+            device=device,
+            dtype=dtype)
+
+    def forward(self, input):
+        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super().from_float(cls, mod)
diff --git a/torch/ao/nn/qat/modules/embedding_ops.py b/torch/ao/nn/qat/modules/embedding_ops.py
new file mode 100644
index 000000000000..da7f33363742
--- /dev/null
+++ b/torch/ao/nn/qat/modules/embedding_ops.py
@@ -0,0 +1,143 @@
+import torch
+from torch import Tensor
+import torch.nn as nn
+import torch.nn.functional as F
+
+__all__ = ['Embedding', 'EmbeddingBag']
+
+class Embedding(nn.Embedding):
+    r"""
+    An embedding bag module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Embedding`, please see
+    https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding
+    for documentation.
+
+    Similar to `torch.nn.Embedding`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Embedding
+
+    def __init__(self, num_embeddings, embedding_dim, padding_idx=None,
+                 max_norm=None, norm_type=2.0, scale_grad_by_freq=False,
+                 sparse=False, _weight=None, device=None, dtype=None, qconfig=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_embeddings, embedding_dim, padding_idx, max_norm,
+                         norm_type, scale_grad_by_freq, sparse, _weight,
+                         **factory_kwargs)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        assert qconfig.weight().qscheme == torch.per_channel_affine_float_qparams, \
+            'Embedding weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
+            str(qconfig.weight().qscheme)
+        self.qconfig = qconfig
+        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+
+    def forward(self, input) -> Tensor:
+        return F.embedding(input, self.weight_fake_quant(self.weight), self.padding_idx,
+                           self.max_norm, self.norm_type, self.scale_grad_by_freq,
+                           self.sparse)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module
+
+            Args: `mod` a float module, either produced by torch.ao.quantization utilities
+            or directly from user
+        """
+        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
+            cls._FLOAT_MODULE.__name__
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        assert mod.qconfig, 'Input float module must have a valid qconfig'
+        weight_qscheme = mod.qconfig.weight().qscheme  # type: ignore[union-attr, operator]
+        assert weight_qscheme == torch.per_channel_affine_float_qparams, \
+            'Embedding weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
+            str(weight_qscheme)
+
+        qconfig = mod.qconfig
+        qat_embedding_bag = cls(mod.num_embeddings, mod.embedding_dim, mod.padding_idx,
+                                mod.max_norm, mod.norm_type, mod.scale_grad_by_freq,
+                                mod.sparse, mod.weight, qconfig=qconfig)
+
+        return qat_embedding_bag
+
+    def to_float(self):
+        embedding_bag = torch.nn.Embedding(self.num_embeddings, self.embedding_dim, self.padding_idx,
+                                           self.max_norm, self.norm_type, self.scale_grad_by_freq,
+                                           self.sparse, None)
+        embedding_bag.weight = torch.nn.Parameter(self.weight.detach())
+        embedding_bag.train(self.training)
+        return embedding_bag
+
+class EmbeddingBag(nn.EmbeddingBag):
+    r"""
+    An embedding bag module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.EmbeddingBag`, please see
+    https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag
+    for documentation.
+
+    Similar to `torch.nn.EmbeddingBag`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.EmbeddingBag
+
+    def __init__(self, num_embeddings, embedding_dim, max_norm=None,
+                 norm_type=2.0, scale_grad_by_freq=False, mode='mean',
+                 sparse=False, _weight=None, include_last_offset=False,
+                 padding_idx=None, qconfig=None, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_embeddings, embedding_dim, max_norm, norm_type,
+                         scale_grad_by_freq, mode, sparse, _weight,
+                         include_last_offset, padding_idx, **factory_kwargs)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        assert qconfig.weight().qscheme == torch.per_channel_affine_float_qparams, \
+            'Embedding Bag weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
+            str(qconfig.weight().qscheme)
+        self.qconfig = qconfig
+        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+
+    def forward(self, input, offsets=None, per_sample_weights=None) -> Tensor:
+        return F.embedding_bag(input, self.weight_fake_quant(self.weight), offsets,
+                               self.max_norm, self.norm_type,
+                               self.scale_grad_by_freq, self.mode, self.sparse,
+                               per_sample_weights, self.include_last_offset,
+                               self.padding_idx)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module
+
+            Args: `mod` a float module, either produced by torch.ao.quantization utilities
+            or directly from user
+        """
+        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
+            cls._FLOAT_MODULE.__name__
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        assert mod.qconfig, 'Input float module must have a valid qconfig'
+        weight_qscheme = mod.qconfig.weight().qscheme  # type: ignore[union-attr, operator]
+        assert weight_qscheme == torch.per_channel_affine_float_qparams, \
+            'Embedding Bag weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
+            str(weight_qscheme)
+
+        qconfig = mod.qconfig
+        qat_embedding_bag = cls(mod.num_embeddings, mod.embedding_dim, mod.max_norm, mod.norm_type,
+                                mod.scale_grad_by_freq, mod.mode, mod.sparse, mod.weight,
+                                mod.include_last_offset, mod.padding_idx, qconfig=qconfig)
+
+        return qat_embedding_bag
+
+    def to_float(self):
+        embedding_bag = torch.nn.EmbeddingBag(self.num_embeddings, self.embedding_dim, self.max_norm,
+                                              self.norm_type, self.scale_grad_by_freq, self.mode, self.sparse,
+                                              None, self.include_last_offset, self.padding_idx)
+        embedding_bag.weight = torch.nn.Parameter(self.weight.detach())
+        embedding_bag.train(self.training)
+        return embedding_bag
diff --git a/torch/ao/nn/qat/modules/linear.py b/torch/ao/nn/qat/modules/linear.py
new file mode 100644
index 000000000000..0d6125765a26
--- /dev/null
+++ b/torch/ao/nn/qat/modules/linear.py
@@ -0,0 +1,77 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.ao.nn.intrinsic import LinearReLU
+from torch.nn.utils.parametrize import (
+    is_parametrized,
+    type_before_parametrizations,
+    transfer_parametrizations_and_params,
+)
+
+class Linear(nn.Linear):
+    r"""
+    A linear module attached with FakeQuantize modules for weight,
+    used for quantization aware training.
+
+    We adopt the same interface as `torch.nn.Linear`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
+    for documentation.
+
+    Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
+    default.
+
+    Attributes:
+        weight: fake quant module for weight
+    """
+    _FLOAT_MODULE = nn.Linear
+
+    def __init__(self, in_features, out_features, bias=True,
+                 qconfig=None, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(in_features, out_features, bias, **factory_kwargs)
+        assert qconfig, 'qconfig must be provided for QAT module'
+        self.qconfig = qconfig
+        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
+
+    def forward(self, input):
+        return F.linear(input, self.weight_fake_quant(self.weight), self.bias)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a qat module from a float module or qparams_dict
+            Args: `mod` a float module, either produced by torch.ao.quantization utilities
+            or directly from user
+        """
+        assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
+            " qat."
+            + cls.__name__
+            + ".from_float only works for "
+            + cls._FLOAT_MODULE.__name__
+        )
+        assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
+        assert mod.qconfig, "Input float module must have a valid qconfig"
+        if type_before_parametrizations(mod) == LinearReLU:
+            mod = mod[0]
+
+        qconfig = mod.qconfig
+        qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)
+
+        if is_parametrized(mod, "weight"):
+            transfer_parametrizations_and_params(mod, qat_linear, "weight")
+        else:
+            qat_linear.weight = mod.weight
+
+        if is_parametrized(mod, "bias"):
+            transfer_parametrizations_and_params(mod, qat_linear, "bias")
+        else:
+            qat_linear.bias = mod.bias
+
+        return qat_linear
+
+    def to_float(self):
+        linear = torch.nn.Linear(self.in_features, self.out_features, self.bias is not None)
+        linear.weight = torch.nn.Parameter(self.weight.detach())
+        if self.bias is not None:
+            linear.bias = torch.nn.Parameter(self.bias.detach())
+        linear.train(self.training)
+        return linear
diff --git a/torch/ao/nn/quantizable/__init__.py b/torch/ao/nn/quantizable/__init__.py
new file mode 100644
index 000000000000..3d79bdbfe832
--- /dev/null
+++ b/torch/ao/nn/quantizable/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/quantizable/modules/__init__.py b/torch/ao/nn/quantizable/modules/__init__.py
new file mode 100644
index 000000000000..7c9fb032a2bb
--- /dev/null
+++ b/torch/ao/nn/quantizable/modules/__init__.py
@@ -0,0 +1,9 @@
+from .activation import MultiheadAttention
+from .rnn import LSTM
+from .rnn import LSTMCell
+
+__all__ = [
+    'LSTM',
+    'LSTMCell',
+    'MultiheadAttention',
+]
diff --git a/torch/ao/nn/quantizable/modules/activation.py b/torch/ao/nn/quantizable/modules/activation.py
new file mode 100644
index 000000000000..65ea5c274135
--- /dev/null
+++ b/torch/ao/nn/quantizable/modules/activation.py
@@ -0,0 +1,454 @@
+import torch
+import torch.jit  # this is needed to avoid a circular import
+from torch import nn
+import torch.nn.functional as nnF
+
+from torch import Tensor
+from typing import Optional, Tuple
+
+import warnings
+
+class MultiheadAttention(nn.MultiheadAttention):
+    _FLOAT_MODULE = nn.MultiheadAttention
+
+    r"""Quantizable implementation of the MultiheadAttention.
+
+    Note::
+        Please, refer to :class:`~torch.nn.MultiheadAttention` for more
+        information
+
+    Allows the model to jointly attend to information from different
+    representation subspaces.
+    See reference: Attention Is All You Need
+
+    The original MHA module is not quantizable.
+    This reimplements it by explicitly instantiating the linear layers.
+
+    .. math::
+        \text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O
+        \text{where} head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
+
+    Args:
+        embed_dim: total dimension of the model.
+        num_heads: parallel attention heads.
+        dropout: a Dropout layer on attn_output_weights. Default: 0.0.
+        bias: add bias as module parameter. Default: True.
+        add_bias_kv: add bias to the key and value sequences at dim=0.
+        add_zero_attn: add a new batch of zeros to the key and
+                       value sequences at dim=1.
+        kdim: total number of features in key. Default: None.
+        vdim: total number of features in value. Default: None.
+        batch_first: If ``True``, then the input and output tensors are provided
+            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
+
+    Note that if :attr:`kdim` and :attr:`vdim` are None, they will be set
+    to :attr:`embed_dim` such that query, key, and value have the same
+    number of features.
+
+    Examples::
+
+        >>> import torch.nn.quantizable as nnqa
+        >>> multihead_attn = nnqa.MultiheadAttention(embed_dim, num_heads)
+        >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
+
+    Note::
+        Please, follow the quantization flow to convert the quantizable MHA.
+    """
+    __constants__ = ['batch_first']
+
+    def __init__(self, embed_dim: int, num_heads: int,
+                 dropout: float = 0., bias: bool = True,
+                 add_bias_kv: bool = False, add_zero_attn: bool = False,
+                 kdim: int = None, vdim: int = None, batch_first: bool = False,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(MultiheadAttention, self).__init__(embed_dim, num_heads, dropout,
+                                                 bias, add_bias_kv,
+                                                 add_zero_attn, kdim, vdim, batch_first,
+                                                 **factory_kwargs)
+        self.linear_Q = nn.Linear(self.embed_dim, self.embed_dim, bias=bias, **factory_kwargs)
+        self.linear_K = nn.Linear(self.kdim, self.embed_dim, bias=bias, **factory_kwargs)
+        self.linear_V = nn.Linear(self.vdim, self.embed_dim, bias=bias, **factory_kwargs)
+        # for the type: ignore, see https://github.com/pytorch/pytorch/issues/58969
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=bias, **factory_kwargs)  # type: ignore[assignment]
+
+        # Functionals
+        self.q_scaling_product = torch.nn.quantized.FloatFunctional()
+        # note: importing torch.nn.quantized at top creates a circular import
+
+        # Quant/Dequant
+        self.quant_attn_output = torch.ao.quantization.QuantStub()
+        self.quant_attn_output_weights = torch.ao.quantization.QuantStub()
+        self.dequant_q = torch.ao.quantization.DeQuantStub()
+        self.dequant_k = torch.ao.quantization.DeQuantStub()
+        self.dequant_v = torch.ao.quantization.DeQuantStub()
+
+    def _get_name(self):
+        return 'QuantizableMultiheadAttention'
+
+    @classmethod
+    def from_float(cls, other):
+        assert type(other) == cls._FLOAT_MODULE
+        assert hasattr(other, 'qconfig'), "The float module must have 'qconfig'"
+        # Setting the dropout to 0.0!
+        observed = cls(other.embed_dim, other.num_heads, other.dropout,
+                       (other.in_proj_bias is not None),
+                       (other.bias_k is not None),
+                       other.add_zero_attn, other.kdim, other.vdim)
+        observed.bias_k = other.bias_k
+        observed.bias_v = other.bias_v
+        observed.qconfig = other.qconfig
+
+        # Set the linear weights
+        # for the type: ignores, see https://github.com/pytorch/pytorch/issues/58969
+        observed.out_proj.weight = other.out_proj.weight  # type: ignore[has-type]
+        observed.out_proj.bias = other.out_proj.bias  # type: ignore[has-type]
+        if other._qkv_same_embed_dim:
+            # Use separate params
+            bias = other.in_proj_bias
+            _start = 0
+            _end = _start + other.embed_dim
+            weight = other.in_proj_weight[_start:_end, :]
+            if bias is not None:
+                bias = torch.nn.Parameter(bias[_start:_end], bias.requires_grad)
+            observed.linear_Q.weight = torch.nn.Parameter(weight,
+                                                          weight.requires_grad)
+            observed.linear_Q.bias = bias
+
+            bias = other.in_proj_bias
+            _start = _end
+            _end = _start + other.embed_dim
+            weight = other.in_proj_weight[_start:_end, :]
+            if bias is not None:
+                bias = torch.nn.Parameter(bias[_start:_end], bias.requires_grad)
+            observed.linear_K.weight = torch.nn.Parameter(weight,
+                                                          weight.requires_grad)
+            observed.linear_K.bias = bias
+
+            bias = other.in_proj_bias
+            _start = _end
+            weight = other.in_proj_weight[_start:, :]
+            if bias is not None:
+                bias = torch.nn.Parameter(bias[_start:], bias.requires_grad)
+            observed.linear_V.weight = torch.nn.Parameter(weight,
+                                                          weight.requires_grad)
+            observed.linear_V.bias = bias
+        else:
+            observed.linear_Q.weight = nn.Parameter(other.q_proj_weight)
+            observed.linear_K.weight = nn.Parameter(other.k_proj_weight)
+            observed.linear_V.weight = nn.Parameter(other.v_proj_weight)
+            if other.in_proj_bias is None:
+                observed.linear_Q.bias = None  # type: ignore[assignment]
+                observed.linear_K.bias = None  # type: ignore[assignment]
+                observed.linear_V.bias = None  # type: ignore[assignment]
+            else:
+                observed.linear_Q.bias = nn.Parameter(other.in_proj_bias[0:other.embed_dim])
+                observed.linear_K.bias = nn.Parameter(other.in_proj_bias[other.embed_dim:(other.embed_dim * 2)])
+                observed.linear_V.bias = nn.Parameter(other.in_proj_bias[(other.embed_dim * 2):])
+        observed.eval()
+        # Explicit prepare
+        observed = torch.ao.quantization.prepare(observed, inplace=True)
+        return observed
+
+    @torch.jit.unused
+    def dequantize(self):
+        r"""Utility to convert the quantized MHA back to float.
+
+        The motivation for this is that it is not trivial to conver the weights
+        from the format that is used in the quantized version back to the
+        float.
+        """
+        fp = self._FLOAT_MODULE(self.embed_dim, self.num_heads, self.dropout,
+                                (self.in_proj_bias is not None),
+                                (self.bias_k is not None),
+                                self.add_zero_attn, self.kdim, self.vdim, self.batch_first)
+        assert fp._qkv_same_embed_dim == self._qkv_same_embed_dim
+        if self.bias_k is not None:
+            fp.bias_k = nn.Parameter(self.bias_k.dequantize())
+        if self.bias_v is not None:
+            fp.bias_v = nn.Parameter(self.bias_v.dequantize())
+
+        # Set the linear weights
+        # Note: Because the linear layers are quantized, mypy does not nkow how
+        # to deal with them -- might need to ignore the typing checks.
+        # for the type: ignore[has-type], see https://github.com/pytorch/pytorch/issues/58969
+        w, b = self.out_proj._weight_bias()  # type: ignore[operator, has-type]
+        fp.out_proj.weight = nn.Parameter(w.dequantize())
+        if b is not None:
+            fp.out_proj.bias = nn.Parameter(b)
+
+        wQ, bQ = self.linear_Q._weight_bias()  # type: ignore[operator]
+        wQ = wQ.dequantize()
+        wK, bK = self.linear_K._weight_bias()  # type: ignore[operator]
+        wK = wK.dequantize()
+        wV, bV = self.linear_V._weight_bias()  # type: ignore[operator]
+        wV = wV.dequantize()
+        if fp._qkv_same_embed_dim:
+            # Use separate params
+            _start = 0
+            _end = _start + fp.embed_dim
+            fp.in_proj_weight[_start:_end, :] = wQ
+            if fp.in_proj_bias is not None:
+                assert all(bQ == 0)
+                fp.in_proj_bias[_start:_end] = bQ
+
+            _start = _end
+            _end = _start + fp.embed_dim
+            fp.in_proj_weight[_start:_end, :] = wK
+            if fp.in_proj_bias is not None:
+                assert all(bK == 0)
+                fp.in_proj_bias[_start:_end] = bK
+
+            _start = _end
+            fp.in_proj_weight[_start:, :] = wV
+            if fp.in_proj_bias is not None:
+                assert all(bV == 0)
+                fp.in_proj_bias[_start:] = bV
+        else:
+            fp.q_proj_weight = nn.Parameter(wQ)
+            fp.k_proj_weight = nn.Parameter(wK)
+            fp.v_proj_weight = nn.Parameter(wV)
+            if fp.in_proj_bias is None:
+                self.linear_Q.bias = None
+                self.linear_K.bias = None
+                self.linear_V.bias = None
+            else:
+                fp.in_proj_bias[0:fp.embed_dim] = bQ
+                fp.in_proj_bias[fp.embed_dim:(fp.embed_dim * 2)] = bK
+                fp.in_proj_bias[(fp.embed_dim * 2):] = bV
+
+        return fp
+
+
+    @classmethod
+    def from_observed(cls, other):
+        # The whole flow is float -> observed -> quantized
+        # This class does float -> observed only
+        # See nn.quantized.MultiheadAttention
+        raise NotImplementedError("It looks like you are trying to prepare an "
+                                  "MHA module. Please, see "
+                                  "the examples on quantizable MHAs.")
+
+    def forward(self,
+                query: Tensor,
+                key: Tensor,
+                value: Tensor,
+                key_padding_mask: Optional[Tensor] = None,
+                need_weights: bool = True,
+                attn_mask: Optional[Tensor] = None,
+                average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
+        r"""
+    Note::
+        Please, refer to :func:`~torch.nn.MultiheadAttention.forward` for more
+        information
+
+    Args:
+        query, key, value: map a query and a set of key-value pairs to an output.
+            See "Attention Is All You Need" for more details.
+        key_padding_mask: if provided, specified padding elements in the key will
+            be ignored by the attention. When given a binary mask and a value is True,
+            the corresponding value on the attention layer will be ignored. When given
+            a byte mask and a value is non-zero, the corresponding value on the attention
+            layer will be ignored
+        need_weights: output attn_output_weights.
+        attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all
+            the batches while a 3D mask allows to specify a different mask for the entries of each batch.
+
+    Shape:
+        - Inputs:
+        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
+          the embedding dimension. :math:`(N, L, E)` if ``batch_first`` is ``True``.
+        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
+          the embedding dimension. :math:`(N, S, E)` if ``batch_first`` is ``True``.
+        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
+          the embedding dimension. :math:`(N, S, E)` if ``batch_first`` is ``True``.
+        - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.
+          If a ByteTensor is provided, the non-zero positions will be ignored while the position
+          with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
+          value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
+        - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
+          3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,
+          S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
+          positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
+          while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
+          is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
+          is provided, it will be added to the attention weight.
+        - average_attn_weights: If true, indicates that the returned ``attn_weights`` should be averaged across
+          heads. Otherwise, ``attn_weights`` are provided separately per head. Note that this flag only has an
+          effect when ``need_weights=True.``. Default: True (i.e. average weights across heads)
+
+        - Outputs:
+        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
+          E is the embedding dimension. :math:`(N, L, E)` if ``batch_first`` is ``True``.
+        - attn_output_weights: If ``average_attn_weights=True``, returns attention weights averaged
+          across heads of shape :math:`(N, L, S)`, where N is the batch size, L is the target sequence length,
+          S is the source sequence length. If ``average_attn_weights=False``, returns attention weights per
+          head of shape :math:`(N, num_heads, L, S)`.
+        """
+        return self._forward_impl(query, key, value, key_padding_mask,
+                                  need_weights, attn_mask, average_attn_weights)
+
+    def _forward_impl(self,
+                      query: Tensor,
+                      key: Tensor,
+                      value: Tensor,
+                      key_padding_mask: Optional[Tensor] = None,
+                      need_weights: bool = True,
+                      attn_mask: Optional[Tensor] = None,
+                      average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
+        # This version will not deal with the static key/value pairs.
+        # Keeping it here for future changes.
+        #
+        # TODO: This method has some duplicate lines with the
+        # `torch.nn.functional.multi_head_attention`. Will need to refactor.
+        static_k = None
+        static_v = None
+
+        if self.batch_first:
+            query, key, value = [x.transpose(0, 1) for x in (query, key, value)]
+
+        tgt_len, bsz, embed_dim_to_check = query.size()
+        assert self.embed_dim == embed_dim_to_check
+        # allow MHA to have different sizes for the feature dimension
+        assert key.size(0) == value.size(0) and key.size(1) == value.size(1)
+
+        head_dim = self.embed_dim // self.num_heads
+        assert head_dim * self.num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+        scaling = float(head_dim) ** -0.5
+
+        q = self.linear_Q(query)
+        k = self.linear_K(key)
+        v = self.linear_V(value)
+
+        q = self.q_scaling_product.mul_scalar(q, scaling)
+
+        if attn_mask is not None:
+            assert attn_mask.dtype == torch.float32 or attn_mask.dtype == torch.float64 or \
+                attn_mask.dtype == torch.float16 or attn_mask.dtype == torch.uint8 or attn_mask.dtype == torch.bool, \
+                'Only float, byte, and bool types are supported for attn_mask, not {}'.format(attn_mask.dtype)
+            if attn_mask.dtype == torch.uint8:
+                warnings.warn("Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
+                attn_mask = attn_mask.to(torch.bool)
+
+            if attn_mask.dim() == 2:
+                attn_mask = attn_mask.unsqueeze(0)
+                if list(attn_mask.size()) != [1, query.size(0), key.size(0)]:
+                    raise RuntimeError('The size of the 2D attn_mask is not correct.')
+            elif attn_mask.dim() == 3:
+                if list(attn_mask.size()) != [bsz * self.num_heads, query.size(0), key.size(0)]:
+                    raise RuntimeError('The size of the 3D attn_mask is not correct.')
+            else:
+                raise RuntimeError("attn_mask's dimension {} is not supported".format(attn_mask.dim()))
+            # attn_mask's dim is 3 now.
+
+        # convert ByteTensor key_padding_mask to bool
+        if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:
+            warnings.warn("Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
+            key_padding_mask = key_padding_mask.to(torch.bool)
+        if self.bias_k is not None and self.bias_v is not None:
+            if static_k is None and static_v is None:
+
+                # Explicitly assert that bias_k and bias_v are not None
+                # in a way that TorchScript can understand.
+                bias_k = self.bias_k
+                assert bias_k is not None
+                bias_v = self.bias_v
+                assert bias_v is not None
+
+                k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
+                v = torch.cat([v, bias_v.repeat(1, bsz, 1)])
+                if attn_mask is not None:
+                    attn_mask = nnF.pad(attn_mask, (0, 1))
+                if key_padding_mask is not None:
+                    key_padding_mask = nnF.pad(key_padding_mask, (0, 1))
+            else:
+                assert static_k is None, "bias cannot be added to static key."
+                assert static_v is None, "bias cannot be added to static value."
+        else:
+            assert self.bias_k is None
+            assert self.bias_v is None
+
+        q = q.contiguous().view(tgt_len, bsz * self.num_heads, head_dim).transpose(0, 1)
+        if k is not None:
+            k = k.contiguous().view(-1, bsz * self.num_heads, head_dim).transpose(0, 1)
+        if v is not None:
+            v = v.contiguous().view(-1, bsz * self.num_heads, head_dim).transpose(0, 1)
+
+        if static_k is not None:
+            assert static_k.size(0) == bsz * self.num_heads
+            assert static_k.size(2) == head_dim
+            k = static_k
+
+        if static_v is not None:
+            assert static_v.size(0) == bsz * self.num_heads
+            assert static_v.size(2) == head_dim
+            v = static_v
+
+        src_len = k.size(1)
+
+        if key_padding_mask is not None:
+            assert key_padding_mask.size(0) == bsz
+            assert key_padding_mask.size(1) == src_len
+
+        if self.add_zero_attn:
+            src_len += 1
+            k_zeros = torch.zeros((k.size(0), 1) + k.size()[2:])
+            if k.is_quantized:
+                k_zeros = torch.quantize_per_tensor(k_zeros, k.q_scale(), k.q_zero_point(), k.dtype)
+            k = torch.cat([k, k_zeros], dim=1)
+            v_zeros = torch.zeros((v.size(0), 1) + k.size()[2:])
+            if v.is_quantized:
+                v_zeros = torch.quantize_per_tensor(v_zeros, v.q_scale(), v.q_zero_point(), v.dtype)
+            v = torch.cat([v, v_zeros], dim=1)
+
+            if attn_mask is not None:
+                attn_mask = nnF.pad(attn_mask, (0, 1))
+            if key_padding_mask is not None:
+                key_padding_mask = nnF.pad(key_padding_mask, (0, 1))
+
+        # Leaving the quantized zone here
+        q = self.dequant_q(q)
+        k = self.dequant_k(k)
+        v = self.dequant_v(v)
+        attn_output_weights = torch.bmm(q, k.transpose(1, 2))
+        assert list(attn_output_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
+
+        if attn_mask is not None:
+            if attn_mask.dtype == torch.bool:
+                attn_output_weights.masked_fill_(attn_mask, float('-inf'))
+            else:
+                attn_output_weights += attn_mask
+
+        if key_padding_mask is not None:
+            attn_output_weights = attn_output_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_output_weights = attn_output_weights.masked_fill(
+                key_padding_mask.unsqueeze(1).unsqueeze(2),
+                float('-inf'),
+            )
+            attn_output_weights = attn_output_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        attn_output_weights = nnF.softmax(
+            attn_output_weights, dim=-1)
+        attn_output_weights = nnF.dropout(attn_output_weights, p=self.dropout, training=self.training)
+
+        attn_output = torch.bmm(attn_output_weights, v)
+        assert list(attn_output.size()) == [bsz * self.num_heads, tgt_len, head_dim]
+        if self.batch_first:
+            attn_output = attn_output.view(bsz, tgt_len, self.embed_dim)
+        else:
+            attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, self.embed_dim)
+
+        # Reentering the quantized zone
+        attn_output = self.quant_attn_output(attn_output)
+        # for the type: ignore[has-type], see https://github.com/pytorch/pytorch/issues/58969
+        attn_output = self.out_proj(attn_output)  # type: ignore[has-type]
+        attn_output_weights = self.quant_attn_output_weights(attn_output_weights)
+
+        if need_weights:
+            # average attention weights over heads
+            attn_output_weights = attn_output_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            if average_attn_weights:
+                attn_output_weights = attn_output_weights.mean(dim=1)
+            return attn_output, attn_output_weights
+        else:
+            return attn_output, None
diff --git a/torch/ao/nn/quantizable/modules/rnn.py b/torch/ao/nn/quantizable/modules/rnn.py
new file mode 100644
index 000000000000..72156a7ba5fe
--- /dev/null
+++ b/torch/ao/nn/quantizable/modules/rnn.py
@@ -0,0 +1,398 @@
+import numbers
+from typing import Optional, Tuple
+import warnings
+
+import torch
+from torch import Tensor
+
+"""
+We will recreate all the RNN modules as we require the modules to be decomposed
+into its building blocks to be able to observe.
+"""
+
+class LSTMCell(torch.nn.Module):
+    r"""A quantizable long short-term memory (LSTM) cell.
+
+    For the description and the argument types, please, refer to :class:`~torch.nn.LSTMCell`
+
+    Examples::
+
+        >>> import torch.nn.quantizable as nnqa
+        >>> rnn = nnqa.LSTMCell(10, 20)
+        >>> input = torch.randn(6, 10)
+        >>> hx = torch.randn(3, 20)
+        >>> cx = torch.randn(3, 20)
+        >>> output = []
+        >>> for i in range(6):
+        ...     hx, cx = rnn(input[i], (hx, cx))
+        ...     output.append(hx)
+    """
+    _FLOAT_MODULE = torch.nn.LSTMCell
+
+    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__()
+        self.input_size = input_dim
+        self.hidden_size = hidden_dim
+        self.bias = bias
+
+        self.igates = torch.nn.Linear(input_dim, 4 * hidden_dim, bias=bias, **factory_kwargs)
+        self.hgates = torch.nn.Linear(hidden_dim, 4 * hidden_dim, bias=bias, **factory_kwargs)
+        self.gates = torch.ao.nn.quantized.FloatFunctional()
+
+        self.input_gate = torch.nn.Sigmoid()
+        self.forget_gate = torch.nn.Sigmoid()
+        self.cell_gate = torch.nn.Tanh()
+        self.output_gate = torch.nn.Sigmoid()
+
+        self.fgate_cx = torch.ao.nn.quantized.FloatFunctional()
+        self.igate_cgate = torch.ao.nn.quantized.FloatFunctional()
+        self.fgate_cx_igate_cgate = torch.ao.nn.quantized.FloatFunctional()
+
+        self.ogate_cy = torch.ao.nn.quantized.FloatFunctional()
+
+        self.initial_hidden_state_qparams: Tuple[float, int] = (1.0, 0)
+        self.initial_cell_state_qparams: Tuple[float, int] = (1.0, 0)
+        self.hidden_state_dtype: torch.dtype = torch.quint8
+        self.cell_state_dtype: torch.dtype = torch.quint8
+
+    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
+        if hidden is None or hidden[0] is None or hidden[1] is None:
+            hidden = self.initialize_hidden(x.shape[0], x.is_quantized)
+        hx, cx = hidden
+
+        igates = self.igates(x)
+        hgates = self.hgates(hx)
+        gates = self.gates.add(igates, hgates)
+
+        input_gate, forget_gate, cell_gate, out_gate = gates.chunk(4, 1)
+
+        input_gate = self.input_gate(input_gate)
+        forget_gate = self.forget_gate(forget_gate)
+        cell_gate = self.cell_gate(cell_gate)
+        out_gate = self.output_gate(out_gate)
+
+        fgate_cx = self.fgate_cx.mul(forget_gate, cx)
+        igate_cgate = self.igate_cgate.mul(input_gate, cell_gate)
+        fgate_cx_igate_cgate = self.fgate_cx_igate_cgate.add(fgate_cx, igate_cgate)
+        cy = fgate_cx_igate_cgate
+
+        tanh_cy = torch.tanh(cy)
+        hy = self.ogate_cy.mul(out_gate, tanh_cy)
+        return hy, cy
+
+    def initialize_hidden(self, batch_size: int, is_quantized: bool = False) -> Tuple[Tensor, Tensor]:
+        h, c = torch.zeros((batch_size, self.hidden_size)), torch.zeros((batch_size, self.hidden_size))
+        if is_quantized:
+            (h_scale, h_zp) = self.initial_hidden_state_qparams
+            (c_scale, c_zp) = self.initial_cell_state_qparams
+            h = torch.quantize_per_tensor(h, scale=h_scale, zero_point=h_zp, dtype=self.hidden_state_dtype)
+            c = torch.quantize_per_tensor(c, scale=c_scale, zero_point=c_zp, dtype=self.cell_state_dtype)
+        return h, c
+
+    def _get_name(self):
+        return 'QuantizableLSTMCell'
+
+    @classmethod
+    def from_params(cls, wi, wh, bi=None, bh=None):
+        """Uses the weights and biases to create a new LSTM cell.
+
+        Args:
+            wi, wh: Weights for the input and hidden layers
+            bi, bh: Biases for the input and hidden layers
+        """
+        assert (bi is None) == (bh is None)  # Either both None or both have values
+        input_size = wi.shape[1]
+        hidden_size = wh.shape[1]
+        cell = cls(input_dim=input_size, hidden_dim=hidden_size,
+                   bias=(bi is not None))
+        cell.igates.weight = torch.nn.Parameter(wi)
+        if bi is not None:
+            cell.igates.bias = torch.nn.Parameter(bi)
+        cell.hgates.weight = torch.nn.Parameter(wh)
+        if bh is not None:
+            cell.hgates.bias = torch.nn.Parameter(bh)
+        return cell
+
+    @classmethod
+    def from_float(cls, other):
+        assert type(other) == cls._FLOAT_MODULE
+        assert hasattr(other, 'qconfig'), "The float module must have 'qconfig'"
+        observed = cls.from_params(other.weight_ih, other.weight_hh,
+                                   other.bias_ih, other.bias_hh)
+        observed.qconfig = other.qconfig
+        observed.igates.qconfig = other.qconfig
+        observed.hgates.qconfig = other.qconfig
+        return observed
+
+
+class _LSTMSingleLayer(torch.nn.Module):
+    r"""A single one-directional LSTM layer.
+
+    The difference between a layer and a cell is that the layer can process a
+    sequence, while the cell only expects an instantaneous value.
+    """
+    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__()
+        self.cell = LSTMCell(input_dim, hidden_dim, bias=bias, **factory_kwargs)
+
+    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
+        result = []
+        for xx in x:
+            hidden = self.cell(xx, hidden)
+            result.append(hidden[0])  # type: ignore[index]
+        result_tensor = torch.stack(result, 0)
+        return result_tensor, hidden
+
+    @classmethod
+    def from_params(cls, *args, **kwargs):
+        cell = LSTMCell.from_params(*args, **kwargs)
+        layer = cls(cell.input_size, cell.hidden_size, cell.bias)
+        layer.cell = cell
+        return layer
+
+
+class _LSTMLayer(torch.nn.Module):
+    r"""A single bi-directional LSTM layer."""
+    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
+                 batch_first: bool = False, bidirectional: bool = False,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__()
+        self.batch_first = batch_first
+        self.bidirectional = bidirectional
+        self.layer_fw = _LSTMSingleLayer(input_dim, hidden_dim, bias=bias, **factory_kwargs)
+        if self.bidirectional:
+            self.layer_bw = _LSTMSingleLayer(input_dim, hidden_dim, bias=bias, **factory_kwargs)
+
+    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
+        if self.batch_first:
+            x = x.transpose(0, 1)
+        if hidden is None:
+            hx_fw, cx_fw = (None, None)
+        else:
+            hx_fw, cx_fw = hidden
+        hidden_bw: Optional[Tuple[Tensor, Tensor]] = None
+        if self.bidirectional:
+            if hx_fw is None:
+                hx_bw = None
+            else:
+                hx_bw = hx_fw[1]
+                hx_fw = hx_fw[0]
+            if cx_fw is None:
+                cx_bw = None
+            else:
+                cx_bw = cx_fw[1]
+                cx_fw = cx_fw[0]
+            if hx_bw is not None and cx_bw is not None:
+                hidden_bw = hx_bw, cx_bw
+        if hx_fw is None and cx_fw is None:
+            hidden_fw = None
+        else:
+            hidden_fw = torch.jit._unwrap_optional(hx_fw), torch.jit._unwrap_optional(cx_fw)
+        result_fw, hidden_fw = self.layer_fw(x, hidden_fw)
+
+        if hasattr(self, 'layer_bw') and self.bidirectional:
+            x_reversed = x.flip(0)
+            result_bw, hidden_bw = self.layer_bw(x_reversed, hidden_bw)
+            result_bw = result_bw.flip(0)
+
+            result = torch.cat([result_fw, result_bw], result_fw.dim() - 1)
+            if hidden_fw is None and hidden_bw is None:
+                h = None
+                c = None
+            elif hidden_fw is None:
+                (h, c) = torch.jit._unwrap_optional(hidden_bw)
+            elif hidden_bw is None:
+                (h, c) = torch.jit._unwrap_optional(hidden_fw)
+            else:
+                h = torch.stack([hidden_fw[0], hidden_bw[0]], 0)  # type: ignore[list-item]
+                c = torch.stack([hidden_fw[1], hidden_bw[1]], 0)  # type: ignore[list-item]
+        else:
+            result = result_fw
+            h, c = torch.jit._unwrap_optional(hidden_fw)  # type: ignore[assignment]
+
+        if self.batch_first:
+            result.transpose_(0, 1)
+
+        return result, (h, c)
+
+    @classmethod
+    def from_float(cls, other, layer_idx=0, qconfig=None, **kwargs):
+        r"""
+        There is no FP equivalent of this class. This function is here just to
+        mimic the behavior of the `prepare` within the `torch.ao.quantization`
+        flow.
+        """
+        assert hasattr(other, 'qconfig') or (qconfig is not None)
+
+        input_size = kwargs.get('input_size', other.input_size)
+        hidden_size = kwargs.get('hidden_size', other.hidden_size)
+        bias = kwargs.get('bias', other.bias)
+        batch_first = kwargs.get('batch_first', other.batch_first)
+        bidirectional = kwargs.get('bidirectional', other.bidirectional)
+
+        layer = cls(input_size, hidden_size, bias, batch_first, bidirectional)
+        layer.qconfig = getattr(other, 'qconfig', qconfig)
+        wi = getattr(other, f'weight_ih_l{layer_idx}')
+        wh = getattr(other, f'weight_hh_l{layer_idx}')
+        bi = getattr(other, f'bias_ih_l{layer_idx}', None)
+        bh = getattr(other, f'bias_hh_l{layer_idx}', None)
+
+        layer.layer_fw = _LSTMSingleLayer.from_params(wi, wh, bi, bh)
+
+        if other.bidirectional:
+            wi = getattr(other, f'weight_ih_l{layer_idx}_reverse')
+            wh = getattr(other, f'weight_hh_l{layer_idx}_reverse')
+            bi = getattr(other, f'bias_ih_l{layer_idx}_reverse', None)
+            bh = getattr(other, f'bias_hh_l{layer_idx}_reverse', None)
+            layer.layer_bw = _LSTMSingleLayer.from_params(wi, wh, bi, bh)
+        return layer
+
+
+class LSTM(torch.nn.Module):
+    r"""A quantizable long short-term memory (LSTM).
+
+    For the description and the argument types, please, refer to :class:`~torch.nn.LSTM`
+
+    Attributes:
+        layers : instances of the `_LSTMLayer`
+
+    .. note::
+        To access the weights and biases, you need to access them per layer.
+        See examples below.
+
+    Examples::
+
+        >>> import torch.nn.quantizable as nnqa
+        >>> rnn = nnqa.LSTM(10, 20, 2)
+        >>> input = torch.randn(5, 3, 10)
+        >>> h0 = torch.randn(2, 3, 20)
+        >>> c0 = torch.randn(2, 3, 20)
+        >>> output, (hn, cn) = rnn(input, (h0, c0))
+        >>> # To get the weights:
+        >>> # xdoctest: +SKIP
+        >>> print(rnn.layers[0].weight_ih)
+        tensor([[...]])
+        >>> print(rnn.layers[0].weight_hh)
+        AssertionError: There is no reverse path in the non-bidirectional layer
+    """
+    _FLOAT_MODULE = torch.nn.LSTM
+
+    def __init__(self, input_size: int, hidden_size: int,
+                 num_layers: int = 1, bias: bool = True,
+                 batch_first: bool = False, dropout: float = 0.,
+                 bidirectional: bool = False,
+                 device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__()
+        self.input_size = input_size
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.bias = bias
+        self.batch_first = batch_first
+        self.dropout = float(dropout)
+        self.bidirectional = bidirectional
+        self.training = False  # We don't want to train using this module
+        num_directions = 2 if bidirectional else 1
+
+        if not isinstance(dropout, numbers.Number) or not 0 <= dropout <= 1 or \
+                isinstance(dropout, bool):
+            raise ValueError("dropout should be a number in range [0, 1] "
+                             "representing the probability of an element being "
+                             "zeroed")
+        if dropout > 0:
+            warnings.warn("dropout option for quantizable LSTM is ignored. "
+                          "If you are training, please, use nn.LSTM version "
+                          "followed by `prepare` step.")
+            if num_layers == 1:
+                warnings.warn("dropout option adds dropout after all but last "
+                              "recurrent layer, so non-zero dropout expects "
+                              "num_layers greater than 1, but got dropout={} "
+                              "and num_layers={}".format(dropout, num_layers))
+
+        layers = [_LSTMLayer(self.input_size, self.hidden_size,
+                             self.bias, batch_first=False,
+                             bidirectional=self.bidirectional, **factory_kwargs)]
+        for layer in range(1, num_layers):
+            layers.append(_LSTMLayer(self.hidden_size, self.hidden_size,
+                                     self.bias, batch_first=False,
+                                     bidirectional=self.bidirectional,
+                                     **factory_kwargs))
+        self.layers = torch.nn.ModuleList(layers)
+
+    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
+        if self.batch_first:
+            x = x.transpose(0, 1)
+
+        max_batch_size = x.size(1)
+        num_directions = 2 if self.bidirectional else 1
+        if hidden is None:
+            zeros = torch.zeros(num_directions, max_batch_size,
+                                self.hidden_size, dtype=torch.float,
+                                device=x.device)
+            zeros.squeeze_(0)
+            if x.is_quantized:
+                zeros = torch.quantize_per_tensor(zeros, scale=1.0,
+                                                  zero_point=0, dtype=x.dtype)
+            hxcx = [(zeros, zeros) for _ in range(self.num_layers)]
+        else:
+            hidden_non_opt = torch.jit._unwrap_optional(hidden)
+            if isinstance(hidden_non_opt[0], Tensor):
+                hx = hidden_non_opt[0].reshape(self.num_layers, num_directions,
+                                               max_batch_size,
+                                               self.hidden_size).unbind(0)
+                cx = hidden_non_opt[1].reshape(self.num_layers, num_directions,
+                                               max_batch_size,
+                                               self.hidden_size).unbind(0)
+                hxcx = [(hx[idx].squeeze_(0), cx[idx].squeeze_(0)) for idx in range(self.num_layers)]
+            else:
+                hxcx = hidden_non_opt
+
+        hx_list = []
+        cx_list = []
+        for idx, layer in enumerate(self.layers):
+            x, (h, c) = layer(x, hxcx[idx])
+            hx_list.append(torch.jit._unwrap_optional(h))
+            cx_list.append(torch.jit._unwrap_optional(c))
+        hx_tensor = torch.stack(hx_list)
+        cx_tensor = torch.stack(cx_list)
+
+        # We are creating another dimension for bidirectional case
+        # need to collapse it
+        hx_tensor = hx_tensor.reshape(-1, hx_tensor.shape[-2], hx_tensor.shape[-1])
+        cx_tensor = cx_tensor.reshape(-1, cx_tensor.shape[-2], cx_tensor.shape[-1])
+
+        if self.batch_first:
+            x = x.transpose(0, 1)
+
+        return x, (hx_tensor, cx_tensor)
+
+    def _get_name(self):
+        return 'QuantizableLSTM'
+
+    @classmethod
+    def from_float(cls, other, qconfig=None):
+        assert isinstance(other, cls._FLOAT_MODULE)
+        assert (hasattr(other, 'qconfig') or qconfig)
+        observed = cls(other.input_size, other.hidden_size, other.num_layers,
+                       other.bias, other.batch_first, other.dropout,
+                       other.bidirectional)
+        observed.qconfig = getattr(other, 'qconfig', qconfig)
+        for idx in range(other.num_layers):
+            observed.layers[idx] = _LSTMLayer.from_float(other, idx, qconfig,
+                                                         batch_first=False)
+        observed.eval()
+        observed = torch.ao.quantization.prepare(observed, inplace=True)
+        return observed
+
+    @classmethod
+    def from_observed(cls, other):
+        # The whole flow is float -> observed -> quantized
+        # This class does float -> observed only
+        raise NotImplementedError("It looks like you are trying to convert a "
+                                  "non-quantizable LSTM module. Please, see "
+                                  "the examples on quantizable LSTMs.")
diff --git a/torch/ao/nn/quantized/__init__.py b/torch/ao/nn/quantized/__init__.py
new file mode 100644
index 000000000000..ae3eae72fb0b
--- /dev/null
+++ b/torch/ao/nn/quantized/__init__.py
@@ -0,0 +1,38 @@
+from . import functional
+from .modules import *  # noqa: F403
+
+__all__ = [
+    'BatchNorm2d',
+    'BatchNorm3d',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+    'DeQuantize',
+    'ELU',
+    'Embedding',
+    'EmbeddingBag',
+    'GroupNorm',
+    'Hardswish',
+    'InstanceNorm1d',
+    'InstanceNorm2d',
+    'InstanceNorm3d',
+    'LayerNorm',
+    'LeakyReLU',
+    'Linear',
+    'LSTM',
+    'MaxPool2d',
+    'MultiheadAttention',
+    'Quantize',
+    'ReLU6',
+    'Sigmoid',
+    'Softmax',
+    'Dropout',
+    'PReLU',
+    # Wrapper modules
+    'FloatFunctional',
+    'FXFloatFunctional',
+    'QFunctional',
+]
diff --git a/torch/ao/nn/quantized/dynamic/__init__.py b/torch/ao/nn/quantized/dynamic/__init__.py
new file mode 100644
index 000000000000..3d79bdbfe832
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/__init__.py
@@ -0,0 +1 @@
+from .modules import *  # noqa: F403
diff --git a/torch/ao/nn/quantized/dynamic/modules/__init__.py b/torch/ao/nn/quantized/dynamic/modules/__init__.py
new file mode 100644
index 000000000000..a7a97e0a8da8
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/modules/__init__.py
@@ -0,0 +1,19 @@
+
+from .linear import Linear
+from .rnn import LSTM, GRU, LSTMCell, RNNCell, GRUCell
+from .conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+
+__all__ = [
+    'Linear',
+    'LSTM',
+    'GRU',
+    'LSTMCell',
+    'RNNCell',
+    'GRUCell',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+]
diff --git a/torch/ao/nn/quantized/dynamic/modules/conv.py b/torch/ao/nn/quantized/dynamic/modules/conv.py
new file mode 100644
index 000000000000..9e290ed41ac0
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/modules/conv.py
@@ -0,0 +1,399 @@
+# coding=utf-8
+r"""Dynamically quantized convolution modules."""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from torch import Tensor
+from torch._ops import ops
+from torch.nn.common_types import _size_1_t
+from torch.nn.modules.utils import _single, _pair, _triple
+from torch.ao.nn.quantized.modules.conv import _reverse_repeat_padding
+import torch.ao.nn.quantized as nnq
+import warnings
+
+__all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
+
+class Conv1d(nnq.Conv1d):
+    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv1d` and :class:`~torch.nn.quantized.dynamic.Conv1d` and
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv1d` for other attributes.
+
+    Examples::
+
+        >>> m = nn.quantized.dynamic.Conv1d(16, 33, 3, stride=2)
+        >>> input = torch.randn(20, 16, 100)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+
+    """
+
+    _FLOAT_MODULE = nn.Conv1d
+    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
+    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: _size_1_t = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 device=None,
+                 dtype=None,
+                 reduce_range=True):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _single(kernel_size)
+        stride = _single(stride)
+        padding = padding if isinstance(padding, str) else _single(padding)
+        dilation = _single(dilation)
+
+        super(Conv1d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConv1d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        if self.padding_mode != 'zeros':
+            # Padding in Conv1d is stored as (p, p), need to get (p,)
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv1d_dynamic(input, self._packed_params, reduce_range)
+
+
+class Conv2d(nnq.Conv2d):
+    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv2d` and :class:`~torch.nn.quantized.dynamic.Conv2d` and
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv2d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nn.quantized.dynamic.Conv2d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nn.quantized.dynamic.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> # non-square kernels and unequal stride and with padding and dilation
+        >>> m = nn.quantized.dynamic.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
+        >>> input = torch.randn(20, 16, 50, 100)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+
+    """
+    _FLOAT_MODULE = nn.Conv2d
+    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
+    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _pair(kernel_size)
+        stride = _pair(stride)
+        padding = _pair(padding)
+        dilation = _pair(dilation)
+
+        super(Conv2d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConv2d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv2d_dynamic(
+            input, self._packed_params, reduce_range)
+
+
+class Conv3d(nnq.Conv3d):
+    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv3d` and :class:`~torch.nn.quantized.dynamic.Conv3d` and
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv3d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nn.quantized.dynamic.Conv3d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nn.quantized.dynamic.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
+        >>> # non-square kernels and unequal stride and with padding and dilation
+        >>> m = nn.quantized.dynamic.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2), dilation=(1, 2, 2))
+        >>> input = torch.randn(20, 16, 56, 56, 56)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+
+    """
+    _FLOAT_MODULE = nn.Conv3d
+    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
+    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _triple(kernel_size)
+        stride = _triple(stride)
+        padding = _triple(padding)
+        dilation = _triple(dilation)
+        super(Conv3d, self)._init(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            False, _triple(0), groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConv3d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv3d_dynamic(
+            input, self._packed_params, reduce_range)
+
+
+class ConvTranspose1d(nnq.ConvTranspose1d):
+    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose1d`.
+
+    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv1d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose1d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> # xdoctest: +SKIP
+        >>> m = nndq.ConvTranspose1d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nndq.ConvTranspose1d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> output = m(input)
+        >>> # exact output size can be also specified as an argument
+        >>> downsample = nndq.Conv1d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nndq.ConvTranspose1d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(input)
+        >>> h.size()
+        torch.Size([1, 16, 6])
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose1d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(ConvTranspose1d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConvTranpose1d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        return torch.ops.quantized.conv_transpose1d_dynamic(
+            input, self._packed_params, reduce_range)
+
+
+class ConvTranspose2d(nnq.ConvTranspose2d):
+    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose2d`.
+
+    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv2d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nnq.ConvTranspose2d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+        >>> # exact output size can be also specified as an argument
+        >>> downsample = nnq.Conv2d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(input)
+        >>> h.size()
+        torch.Size([1, 16, 6, 6])
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose2d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(ConvTranspose2d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConvTranpose2d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        return ops.quantized.conv_transpose2d_dynamic(
+            input, self._packed_params, reduce_range)
+
+
+class ConvTranspose3d(nnq.ConvTranspose3d):
+    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose3d`.
+
+    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv3d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose3d` for other attributes.
+
+    Examples::
+
+        >>> # With cubic kernels and equal stride
+        >>> m = nnq.ConvTranspose3d(16, 33, 3, stride=2)
+        >>> # non-cubic kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose3d(16, 33, (3, 3, 5), stride=(2, 1, 1), padding=(4, 2, 2))
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+        >>> # exact output size can be also specified as an argument
+        >>> downsample = nnq.Conv3d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose3d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(input)
+        >>> h.size()
+        torch.Size([1, 16, 6, 6, 6])
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12, 12, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose3d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        warnings.warn(
+            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
+                self._get_name()
+            )
+        )
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(ConvTranspose3d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedConvTranpose3d'
+
+    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, T, H, W)`!")
+        return ops.quantized.conv_transpose3d_dynamic(
+            input, self._packed_params, reduce_range)
diff --git a/torch/ao/nn/quantized/dynamic/modules/linear.py b/torch/ao/nn/quantized/dynamic/modules/linear.py
new file mode 100644
index 000000000000..35476f10d6e2
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/modules/linear.py
@@ -0,0 +1,127 @@
+import torch
+import torch.ao.nn.quantized as nnq
+from torch.ao.nn.quantized.modules.utils import _quantize_weight
+import torch.ao.nn.intrinsic as nni
+
+class Linear(nnq.Linear):
+    r"""
+    A dynamic quantized linear module with floating point tensor as inputs and outputs.
+    We adopt the same interface as `torch.nn.Linear`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
+
+    Similar to :class:`torch.nn.Linear`, attributes will be randomly
+    initialized at module creation time and will be overwritten later
+
+    Attributes:
+        weight (Tensor): the non-learnable quantized weights of the module which are of
+                         shape :math:`(\text{out\_features}, \text{in\_features})`.
+        bias (Tensor): the non-learnable floating point bias of the module of shape
+                       :math:`(\text{out\_features})`. If :attr:`bias` is ``True``,
+                       the values are initialized to zero.
+
+    Examples::
+
+        >>> m = nn.quantized.dynamic.Linear(20, 30)
+        >>> input = torch.randn(128, 20)
+        >>> # xdoctest: +SKIP
+        >>> output = m(input)
+        >>> print(output.size())
+        torch.Size([128, 30])
+    """
+    # version used in this class is different from the parent class nnq.Linear
+    _version = 4
+
+    def __init__(self, in_features, out_features, bias_=True, dtype=torch.qint8):
+        super(Linear, self).__init__(in_features, out_features, bias_, dtype=dtype)
+        # We don't muck around with buffers or attributes or anything here
+        # to keep the module simple. *everything* is simply a Python attribute.
+        # Serialization logic is explicitly handled in the below serialization and
+        # deserialization modules
+        self.version = 4
+
+    def forward(self, x):
+        # Note that we can handle self.bias == None case.
+        if self._packed_params.dtype == torch.qint8:
+            if self.version is None or self.version < 4:
+                Y = torch.ops.quantized.linear_dynamic(
+                    x, self._packed_params._packed_params)
+            else:
+                Y = torch.ops.quantized.linear_dynamic(
+                    x, self._packed_params._packed_params, reduce_range=True)
+        elif self._packed_params.dtype == torch.float16:
+            Y = torch.ops.quantized.linear_dynamic_fp16(
+                x, self._packed_params._packed_params)
+        else:
+            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
+        return Y.to(x.dtype)
+
+    def _get_name(self):
+        return 'DynamicQuantizedLinear'
+
+    def extra_repr(self):
+        extra_repr_str = 'in_features={}, out_features={}, dtype={}'.format(
+            self.in_features, self.out_features, self._packed_params.dtype
+        )
+        if self._packed_params.dtype == torch.qint8:
+            extra_repr_str += ', qscheme={}'.format(self.weight().qscheme())
+        return extra_repr_str
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        version = local_metadata.get('version', None)
+        self.version = version
+        super(Linear, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                  missing_keys, unexpected_keys, error_msgs)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a dynamic quantized module from a float module or qparams_dict
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+        """
+        float_modules = [torch.nn.Linear, torch.nn.modules.linear.NonDynamicallyQuantizableLinear,
+                         torch.nn.intrinsic.modules.fused.LinearReLU, torch.ao.nn.qat.dynamic.Linear]
+
+        assert type(mod) in float_modules, \
+            'nn.quantized.dynamic.Linear.from_float only works for one of' + \
+            str([float_mod.__name__ for float_mod in float_modules])
+        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+        if type(mod) == nni.LinearReLU:
+            mod = mod[0]
+        if mod.qconfig is not None and mod.qconfig.weight is not None:
+            weight_observer = mod.qconfig.weight()
+        else:
+            # We have the circular import issues if we import the qconfig in the beginning of this file:
+            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
+            # import until we need it.
+            from torch.ao.quantization.qconfig import default_dynamic_qconfig
+            weight_observer = default_dynamic_qconfig.weight()
+        dtype = weight_observer.dtype
+        assert dtype in [torch.qint8, torch.float16], "The only supported dtypes for " \
+            "dynamic quantized linear are qint8 and float16 got: {}".format(dtype)
+        weight_observer(mod.weight)
+        if dtype == torch.qint8:
+            qweight = _quantize_weight(mod.weight.float(), weight_observer)
+        elif dtype == torch.float16:
+            qweight = mod.weight.float()
+        else:
+            raise RuntimeError('Unsupported dtype specified for dynamic quantized Linear!')
+        qlinear = cls(mod.in_features, mod.out_features, dtype=dtype)
+        qlinear.set_weight_bias(qweight, mod.bias)
+        return qlinear
+
+    @classmethod
+    def from_reference(cls, ref_qlinear):
+        """ Create a (fbgemm/qnnpack) dynamic quantized module from a reference quantized
+        module
+        Args:
+            ref_qlinear (Module): a reference quantized  module, either produced by
+            torch.ao.quantization functions or provided by the user
+        """
+        qlinear = cls(ref_qlinear.in_features, ref_qlinear.out_features, dtype=ref_qlinear.weight_dtype)
+        qweight = ref_qlinear.get_quantized_weight()
+        bias = ref_qlinear.bias
+        qlinear.set_weight_bias(qweight, bias)
+        return qlinear
diff --git a/torch/ao/nn/quantized/dynamic/modules/rnn.py b/torch/ao/nn/quantized/dynamic/modules/rnn.py
new file mode 100644
index 000000000000..f9ec67529f7a
--- /dev/null
+++ b/torch/ao/nn/quantized/dynamic/modules/rnn.py
@@ -0,0 +1,1054 @@
+import numbers
+import warnings
+
+import torch
+import torch.nn as nn
+from torch import Tensor  # noqa: F401
+from torch._jit_internal import Tuple, Optional, List, Union, Dict  # noqa: F401
+from torch.nn.utils.rnn import PackedSequence
+from torch.ao.nn.quantized.modules.utils import _quantize_weight
+
+__all__ = ['pack_weight_bias', 'PackedParameter', 'RNNBase', 'LSTM', 'GRU', 'RNNCellBase', 'RNNCell', 'LSTMCell',
+           'GRUCell']
+
+def _apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
+    return tensor.index_select(dim, permutation)
+
+def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
+    warnings.warn("apply_permutation is deprecated, please use tensor.index_select(dim, permutation) instead")
+    return _apply_permutation(tensor, permutation, dim)
+
+def pack_weight_bias(qweight, bias, dtype):
+
+    if dtype == torch.qint8:
+        # for each layer, for each direction we need to quantize and pack
+        # weights and pack parameters in this order:
+        #
+        #   w_ih, w_hh
+        packed_weight = \
+            torch.ops.quantized.linear_prepack(qweight, bias)
+
+        return packed_weight
+    else:
+        # for each layer, for each direction we need to quantize and pack
+        # weights and pack parameters in this order:
+        #
+        #   packed_ih, packed_hh, b_ih, b_hh
+        packed_weight = torch.ops.quantized.linear_prepack_fp16(
+            qweight, bias)
+
+        return packed_weight
+
+class PackedParameter(torch.nn.Module):
+    def __init__(self, param):
+        super(PackedParameter, self).__init__()
+        self.param = param
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(PackedParameter, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'param'] = self.param
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self.param = state_dict[prefix + 'param']
+        super(PackedParameter, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                           missing_keys, unexpected_keys, error_msgs)
+
+class RNNBase(torch.nn.Module):
+
+    _FLOAT_MODULE = nn.RNNBase
+
+    _version = 2
+
+    def __init__(self, mode, input_size, hidden_size,
+                 num_layers=1, bias=True, batch_first=False,
+                 dropout=0., bidirectional=False, dtype=torch.qint8):
+        super(RNNBase, self).__init__()
+
+        self.mode = mode
+        self.input_size = input_size
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.bias = bias
+        self.batch_first = batch_first
+        self.dropout = float(dropout)
+        self.bidirectional = bidirectional
+        self.dtype = dtype
+        self.version = 2
+        self.training = False
+        num_directions = 2 if bidirectional else 1
+
+        # "type: ignore" is required since ints and Numbers are not fully comparable
+        # https://github.com/python/mypy/issues/8566
+        if not isinstance(dropout, numbers.Number) \
+                or not 0 <= dropout <= 1 or isinstance(dropout, bool):  # type: ignore[operator]
+            raise ValueError("dropout should be a number in range [0, 1] "
+                             "representing the probability of an element being "
+                             "zeroed")
+        if dropout > 0 and num_layers == 1:  # type: ignore[operator]
+            warnings.warn("dropout option adds dropout after all but last "
+                          "recurrent layer, so non-zero dropout expects "
+                          "num_layers greater than 1, but got dropout={} and "
+                          "num_layers={}".format(dropout, num_layers))
+
+        if mode == 'LSTM':
+            gate_size = 4 * hidden_size
+        elif mode == 'GRU':
+            gate_size = 3 * hidden_size
+        else:
+            raise ValueError("Unrecognized RNN mode: " + mode)
+
+        _all_weight_values = []
+        for layer in range(num_layers):
+            for direction in range(num_directions):
+                layer_input_size = input_size if layer == 0 else hidden_size * num_directions
+
+                w_ih = torch.randn(gate_size, layer_input_size).to(torch.float)
+                w_hh = torch.randn(gate_size, hidden_size).to(torch.float)
+                b_ih = torch.randn(gate_size).to(torch.float)
+                b_hh = torch.randn(gate_size).to(torch.float)
+                if dtype == torch.qint8:
+                    w_ih = torch.quantize_per_tensor(w_ih, scale=0.1, zero_point=0, dtype=torch.qint8)
+                    w_hh = torch.quantize_per_tensor(w_hh, scale=0.1, zero_point=0, dtype=torch.qint8)
+                    packed_ih = \
+                        torch.ops.quantized.linear_prepack(w_ih, b_ih)
+                    packed_hh = \
+                        torch.ops.quantized.linear_prepack(w_hh, b_hh)
+                    if self.version is None or self.version < 2:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh)
+                    else:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh, True)
+                else:
+                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
+                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
+                        packed_ih, packed_hh)
+
+                _all_weight_values.append(PackedParameter(cell_params))
+        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
+
+    def _get_name(self):
+        return 'DynamicQuantizedRNN'
+
+    def extra_repr(self):
+        s = '{input_size}, {hidden_size}'
+        if self.num_layers != 1:
+            s += ', num_layers={num_layers}'
+        if self.bias is not True:
+            s += ', bias={bias}'
+        if self.batch_first is not False:
+            s += ', batch_first={batch_first}'
+        if self.dropout != 0:
+            s += ', dropout={dropout}'
+        if self.bidirectional is not False:
+            s += ', bidirectional={bidirectional}'
+        return s.format(**self.__dict__)
+
+    def __repr__(self):
+        # We don't want to show `ModuleList` children, hence custom
+        # `__repr__`. This is the same as nn.Module.__repr__, except the check
+        # for the `PackedParameter` and `nn.ModuleList`.
+        # You should still override `extra_repr` to add more info.
+        extra_lines = []
+        extra_repr = self.extra_repr()
+        # empty string will be split into list ['']
+        if extra_repr:
+            extra_lines = extra_repr.split('\n')
+        child_lines = []
+        for key, module in self._modules.items():
+            if isinstance(module, (PackedParameter, nn.ModuleList)):
+                continue
+            mod_str = repr(module)
+            mod_str = nn.modules.module._addindent(mod_str, 2)
+            child_lines.append('(' + key + '): ' + mod_str)
+        lines = extra_lines + child_lines
+
+        main_str = self._get_name() + '('
+        if lines:
+            # simple one-liner info, which most builtin Modules will use
+            if len(extra_lines) == 1 and not child_lines:
+                main_str += extra_lines[0]
+            else:
+                main_str += '\n  ' + '\n  '.join(lines) + '\n'
+
+        main_str += ')'
+        return main_str
+
+    def check_input(self, input: Tensor, batch_sizes: Optional[Tensor]) -> None:
+        expected_input_dim = 2 if batch_sizes is not None else 3
+        if input.dim() != expected_input_dim:
+            raise RuntimeError(
+                'input must have {} dimensions, got {}'.format(
+                    expected_input_dim, input.dim()))
+        if self.input_size != input.size(-1):
+            raise RuntimeError(
+                'input.size(-1) must be equal to input_size. Expected {}, got {}'.format(
+                    self.input_size, input.size(-1)))
+
+    def get_expected_hidden_size(self, input: Tensor, batch_sizes: Optional[Tensor]) -> Tuple[int, int, int]:
+        if batch_sizes is not None:
+            mini_batch = int(batch_sizes[0])
+        else:
+            mini_batch = input.size(0) if self.batch_first else input.size(1)
+        num_directions = 2 if self.bidirectional else 1
+        expected_hidden_size = (self.num_layers * num_directions,
+                                mini_batch, self.hidden_size)
+        return expected_hidden_size
+
+    def check_hidden_size(
+        self, hx: Tensor, expected_hidden_size: Tuple[int, int, int],
+        msg: str = 'Expected hidden size {}, got {}'
+    ) -> None:
+        if hx.size() != expected_hidden_size:
+            raise RuntimeError(msg.format(
+                expected_hidden_size, list(hx.size())))
+
+    def check_forward_args(self, input: Tensor, hidden: Tensor, batch_sizes: Optional[Tensor]) -> None:
+        self.check_input(input, batch_sizes)
+        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
+        self.check_hidden_size(hidden, expected_hidden_size,
+                               msg='Expected hidden size {}, got {}')
+
+    def permute_hidden(self, hx: Tensor, permutation: Optional[Tensor]) -> Tensor:
+        if permutation is None:
+            return hx
+        return _apply_permutation(hx, permutation)
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        version = local_metadata.get('version', None)
+        self.version = version
+        super(RNNBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                   missing_keys, unexpected_keys, error_msgs)
+
+    def set_weight_bias(self, weight_bias_dict):
+
+        def weight_bias_name(ihhh, layer, suffix):
+            weight_name = "weight_{}_l{}{}".format(ihhh, layer, suffix)
+            bias_name = "bias_{}_l{}{}".format(ihhh, layer, suffix)
+            return weight_name, bias_name
+
+        num_directions = 2 if self.bidirectional else 1
+        # TODO: dedup with __init__ of RNNBase
+        _all_weight_values = []
+        for layer in range(self.num_layers):
+            for direction in range(num_directions):
+                suffix = "_reverse" if direction == 1 else ""
+                w_ih_name, b_ih_name = weight_bias_name("ih", layer, suffix)
+                w_hh_name, b_hh_name = weight_bias_name("hh", layer, suffix)
+                w_ih = weight_bias_dict[w_ih_name]
+                b_ih = weight_bias_dict[b_ih_name]
+                w_hh = weight_bias_dict[w_hh_name]
+                b_hh = weight_bias_dict[b_hh_name]
+                if w_ih.dtype == torch.qint8:
+                    packed_ih = torch.ops.quantized.linear_prepack(w_ih, b_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack(w_hh, b_hh)
+                    if self.version is None or self.version < 2:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh)
+                    else:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, b_ih, b_hh, True)
+                else:
+                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
+                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
+                        packed_ih, packed_hh)
+
+                _all_weight_values.append(PackedParameter(cell_params))
+        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
+
+    @classmethod
+    def from_float(cls, mod):
+        assert type(mod) in set(
+            [torch.nn.LSTM,
+             torch.nn.GRU]
+        ), 'nn.quantized.dynamic.RNNBase.from_float only works for nn.LSTM and nn.GRU'
+        assert hasattr(
+            mod,
+            'qconfig'
+        ), 'Input float module must have qconfig defined'
+
+        if mod.qconfig is not None and mod.qconfig.weight is not None:
+            weight_observer_method = mod.qconfig.weight
+        else:
+            # We have the circular import issues if we import the qconfig in the beginning of this file:
+            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
+            # import until we need it.
+            from torch.ao.quantization.qconfig import default_dynamic_qconfig
+            weight_observer_method = default_dynamic_qconfig.weight
+
+        dtype = weight_observer_method().dtype
+        supported_scalar_types = [torch.qint8, torch.float16]
+        if dtype not in supported_scalar_types:
+            raise RuntimeError('Unsupported dtype for dynamic RNN quantization: {}'.format(dtype))
+        # RNNBase can be either LSTM or GRU
+        qRNNBase: Union[LSTM, GRU]
+        if mod.mode == 'LSTM':
+            qRNNBase = LSTM(mod.input_size, mod.hidden_size, mod.num_layers,
+                            mod.bias, mod.batch_first, mod.dropout, mod.bidirectional, dtype)
+        elif mod.mode == 'GRU':
+            qRNNBase = GRU(mod.input_size, mod.hidden_size, mod.num_layers,
+                           mod.bias, mod.batch_first, mod.dropout, mod.bidirectional, dtype)
+        else:
+            raise NotImplementedError('Only LSTM/GRU is supported for QuantizedRNN for now')
+
+        num_directions = 2 if mod.bidirectional else 1
+
+        assert mod.bias
+
+        _all_weight_values = []
+        for layer in range(qRNNBase.num_layers):
+            for direction in range(num_directions):
+                suffix = '_reverse' if direction == 1 else ''
+
+                def retrieve_weight_bias(ihhh):
+                    weight_name = 'weight_{}_l{}{}'.format(ihhh, layer, suffix)
+                    bias_name = 'bias_{}_l{}{}'.format(ihhh, layer, suffix)
+                    weight = getattr(mod, weight_name)
+                    bias = getattr(mod, bias_name)
+                    return weight, bias
+
+                weight_ih, bias_ih = retrieve_weight_bias('ih')
+                weight_hh, bias_hh = retrieve_weight_bias('hh')
+
+                if dtype == torch.qint8:
+                    def quantize_and_pack(w, b):
+                        weight_observer = weight_observer_method()
+                        weight_observer(w)
+                        qweight = _quantize_weight(w.float(), weight_observer)
+                        packed_weight = \
+                            torch.ops.quantized.linear_prepack(qweight, b)
+                        return packed_weight
+                    packed_ih = quantize_and_pack(weight_ih, bias_ih)
+                    packed_hh = quantize_and_pack(weight_hh, bias_hh)
+                    if qRNNBase.version is None or qRNNBase.version < 2:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, bias_ih, bias_hh)
+                    else:
+                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
+                            packed_ih, packed_hh, bias_ih, bias_hh, True)
+
+                elif dtype == torch.float16:
+                    packed_ih = torch.ops.quantized.linear_prepack_fp16(
+                        weight_ih.float(), bias_ih)
+                    packed_hh = torch.ops.quantized.linear_prepack_fp16(
+                        weight_hh.float(), bias_hh)
+
+                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
+                        packed_ih, packed_hh)
+                else:
+                    raise RuntimeError('Unsupported dtype specified for dynamic quantized LSTM!')
+
+                _all_weight_values.append(PackedParameter(cell_params))
+        qRNNBase._all_weight_values = torch.nn.ModuleList(_all_weight_values)
+
+        return qRNNBase
+
+
+    def _weight_bias(self):
+        # Returns a dict of weights and biases
+        weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
+        count = 0
+        num_directions = 2 if self.bidirectional else 1
+        for layer in range(self.num_layers):
+            for direction in range(num_directions):
+                suffix = '_reverse' if direction == 1 else ''
+                key_name1 = 'weight_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
+                key_name2 = 'weight_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
+                # packed weights are part of torchbind class, CellParamsSerializationType
+                # Within the packed weight class, the weight and bias are accessible as Tensors
+                packed_weight_bias = self._all_weight_values[count].param.__getstate__()[0][4]
+                weight_bias_dict['weight'][key_name1] = packed_weight_bias[0].__getstate__()[0][0]
+                weight_bias_dict['weight'][key_name2] = packed_weight_bias[1].__getstate__()[0][0]
+                key_name1 = 'bias_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
+                key_name2 = 'bias_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
+                weight_bias_dict['bias'][key_name1] = packed_weight_bias[0].__getstate__()[0][1]
+                weight_bias_dict['bias'][key_name2] = packed_weight_bias[1].__getstate__()[0][1]
+                count = count + 1
+        return weight_bias_dict
+
+    def get_weight(self):
+        return self._weight_bias()['weight']
+
+    def get_bias(self):
+        return self._weight_bias()['bias']
+
+class LSTM(RNNBase):
+    r"""
+    A dynamic quantized LSTM module with floating point tensor as inputs and outputs.
+    We adopt the same interface as `torch.nn.LSTM`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM for documentation.
+
+    Examples::
+
+        >>> rnn = nn.LSTM(10, 20, 2)
+        >>> input = torch.randn(5, 3, 10)
+        >>> h0 = torch.randn(2, 3, 20)
+        >>> c0 = torch.randn(2, 3, 20)
+        >>> output, (hn, cn) = rnn(input, (h0, c0))
+    """
+    _FLOAT_MODULE = nn.LSTM
+
+    __overloads__ = {'forward': ['forward_packed', 'forward_tensor']}
+
+    def __init__(self, *args, **kwargs):
+        super(LSTM, self).__init__('LSTM', *args, **kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedLSTM'
+
+    def forward_impl(
+        self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]],
+        batch_sizes: Optional[Tensor], max_batch_size: int,
+        sorted_indices: Optional[Tensor]
+    ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
+        if hx is None:
+            num_directions = 2 if self.bidirectional else 1
+            zeros = torch.zeros(self.num_layers * num_directions,
+                                max_batch_size, self.hidden_size,
+                                dtype=input.dtype, device=input.device)
+            hx = (zeros, zeros)
+        else:
+            # Each batch of the hidden state should match the input sequence that
+            # the user believes he/she is passing in.
+            hx = self.permute_hidden(hx, sorted_indices)
+
+        self.check_forward_args(input, hx, batch_sizes)
+
+        _all_params = ([m.param for m in self._all_weight_values])
+        if batch_sizes is None:
+            result = torch.quantized_lstm(input, hx, _all_params, self.bias, self.num_layers,
+                                          float(self.dropout), self.training, self.bidirectional,
+                                          self.batch_first, dtype=self.dtype, use_dynamic=True)
+        else:
+            result = torch.quantized_lstm(input, batch_sizes, hx, _all_params, self.bias,
+                                          self.num_layers, float(self.dropout), self.training,
+                                          self.bidirectional, dtype=self.dtype, use_dynamic=True)
+        output = result[0]
+        hidden = result[1:]
+
+        return output, hidden
+
+    @torch.jit.export
+    def forward_tensor(
+        self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None
+    ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
+        batch_sizes = None
+        max_batch_size = input.size(0) if self.batch_first else input.size(1)
+        sorted_indices = None
+        unsorted_indices = None
+
+        output, hidden = self.forward_impl(
+            input, hx, batch_sizes, max_batch_size, sorted_indices)
+
+        return output, self.permute_hidden(hidden, unsorted_indices)
+
+    @torch.jit.export
+    def forward_packed(
+        self, input: PackedSequence, hx: Optional[Tuple[Tensor, Tensor]] = None
+    ) -> Tuple[PackedSequence, Tuple[Tensor, Tensor]]:
+        input_, batch_sizes, sorted_indices, unsorted_indices = input
+        max_batch_size = batch_sizes[0]
+        max_batch_size = int(max_batch_size)
+
+        output_, hidden = self.forward_impl(
+            input_, hx, batch_sizes, max_batch_size, sorted_indices)
+
+        output = PackedSequence(output_, batch_sizes,
+                                sorted_indices, unsorted_indices)
+        return output, self.permute_hidden(hidden, unsorted_indices)
+
+    # "type: ignore" is required due to issue #43072
+    def permute_hidden(  # type: ignore[override]
+        self, hx: Tuple[Tensor, Tensor], permutation: Optional[Tensor]
+    ) -> Tuple[Tensor, Tensor]:
+        if permutation is None:
+            return hx
+        return _apply_permutation(hx[0], permutation), _apply_permutation(hx[1], permutation)
+
+    # "type: ignore" is required due to issue #43072
+    def check_forward_args(  # type: ignore[override]
+        self, input: Tensor, hidden: Tuple[Tensor, Tensor], batch_sizes: Optional[Tensor]
+    ) -> None:
+        self.check_input(input, batch_sizes)
+        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
+
+        self.check_hidden_size(hidden[0], expected_hidden_size,
+                               'Expected hidden[0] size {}, got {}')
+        self.check_hidden_size(hidden[1], expected_hidden_size,
+                               'Expected hidden[1] size {}, got {}')
+
+    @torch.jit.ignore
+    def forward(self, input, hx=None):
+        if isinstance(input, PackedSequence):
+            return self.forward_packed(input, hx)
+        else:
+            return self.forward_tensor(input, hx)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(LSTM, cls).from_float(mod)
+
+    @classmethod
+    def from_reference(cls, ref_mod):
+        assert hasattr(ref_mod, "weight_ih_l0_dtype"), "We are assuming weight_ih_l0 "
+        "exists in LSTM, may need to relax the assumption to support the use case"
+        qmod = cls(
+            ref_mod.input_size,
+            ref_mod.hidden_size,
+            ref_mod.num_layers,
+            ref_mod.bias,
+            ref_mod.batch_first,
+            ref_mod.dropout,
+            ref_mod.bidirectional,
+            # assuming there is layer 0, which should be OK
+            ref_mod.weight_ih_l0_dtype,
+        )
+        qmod.set_weight_bias(ref_mod.get_quantized_weight_bias_dict())
+        return qmod
+
+
+class GRU(RNNBase):
+    r"""Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
+
+
+    For each element in the input sequence, each layer computes the following
+    function:
+
+    .. math::
+        \begin{array}{ll}
+            r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
+            z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
+            n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
+            h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}
+        \end{array}
+
+    where :math:`h_t` is the hidden state at time `t`, :math:`x_t` is the input
+    at time `t`, :math:`h_{(t-1)}` is the hidden state of the layer
+    at time `t-1` or the initial hidden state at time `0`, and :math:`r_t`,
+    :math:`z_t`, :math:`n_t` are the reset, update, and new gates, respectively.
+    :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
+
+    In a multilayer GRU, the input :math:`x^{(l)}_t` of the :math:`l` -th layer
+    (:math:`l >= 2`) is the hidden state :math:`h^{(l-1)}_t` of the previous layer multiplied by
+    dropout :math:`\delta^{(l-1)}_t` where each :math:`\delta^{(l-1)}_t` is a Bernoulli random
+    variable which is :math:`0` with probability :attr:`dropout`.
+
+    Args:
+        input_size: The number of expected features in the input `x`
+        hidden_size: The number of features in the hidden state `h`
+        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
+            would mean stacking two GRUs together to form a `stacked GRU`,
+            with the second GRU taking in outputs of the first GRU and
+            computing the final results. Default: 1
+        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
+            Default: ``True``
+        batch_first: If ``True``, then the input and output tensors are provided
+            as (batch, seq, feature). Default: ``False``
+        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
+            GRU layer except the last layer, with dropout probability equal to
+            :attr:`dropout`. Default: 0
+        bidirectional: If ``True``, becomes a bidirectional GRU. Default: ``False``
+
+    Inputs: input, h_0
+        - **input** of shape `(seq_len, batch, input_size)`: tensor containing the features
+          of the input sequence. The input can also be a packed variable length
+          sequence. See :func:`torch.nn.utils.rnn.pack_padded_sequence`
+          for details.
+        - **h_0** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
+          containing the initial hidden state for each element in the batch.
+          Defaults to zero if not provided. If the RNN is bidirectional,
+          num_directions should be 2, else it should be 1.
+
+    Outputs: output, h_n
+        - **output** of shape `(seq_len, batch, num_directions * hidden_size)`: tensor
+          containing the output features h_t from the last layer of the GRU,
+          for each `t`. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
+          given as the input, the output will also be a packed sequence.
+          For the unpacked case, the directions can be separated
+          using ``output.view(seq_len, batch, num_directions, hidden_size)``,
+          with forward and backward being direction `0` and `1` respectively.
+
+          Similarly, the directions can be separated in the packed case.
+        - **h_n** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
+          containing the hidden state for `t = seq_len`
+
+          Like *output*, the layers can be separated using
+          ``h_n.view(num_layers, num_directions, batch, hidden_size)``.
+
+    Shape:
+        - Input1: :math:`(L, N, H_{in})` tensor containing input features where
+          :math:`H_{in}=\text{input\_size}` and `L` represents a sequence length.
+        - Input2: :math:`(S, N, H_{out})` tensor
+          containing the initial hidden state for each element in the batch.
+          :math:`H_{out}=\text{hidden\_size}`
+          Defaults to zero if not provided. where :math:`S=\text{num\_layers} * \text{num\_directions}`
+          If the RNN is bidirectional, num_directions should be 2, else it should be 1.
+        - Output1: :math:`(L, N, H_{all})` where :math:`H_{all}=\text{num\_directions} * \text{hidden\_size}`
+        - Output2: :math:`(S, N, H_{out})` tensor containing the next hidden state
+          for each element in the batch
+
+    Attributes:
+        weight_ih_l[k] : the learnable input-hidden weights of the :math:`\text{k}^{th}` layer
+            (W_ir|W_iz|W_in), of shape `(3*hidden_size, input_size)` for `k = 0`.
+            Otherwise, the shape is `(3*hidden_size, num_directions * hidden_size)`
+        weight_hh_l[k] : the learnable hidden-hidden weights of the :math:`\text{k}^{th}` layer
+            (W_hr|W_hz|W_hn), of shape `(3*hidden_size, hidden_size)`
+        bias_ih_l[k] : the learnable input-hidden bias of the :math:`\text{k}^{th}` layer
+            (b_ir|b_iz|b_in), of shape `(3*hidden_size)`
+        bias_hh_l[k] : the learnable hidden-hidden bias of the :math:`\text{k}^{th}` layer
+            (b_hr|b_hz|b_hn), of shape `(3*hidden_size)`
+
+    .. note::
+        All the weights and biases are initialized from :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})`
+        where :math:`k = \frac{1}{\text{hidden\_size}}`
+
+    .. include:: ../cudnn_persistent_rnn.rst
+
+    Examples::
+
+        >>> rnn = nn.GRU(10, 20, 2)
+        >>> input = torch.randn(5, 3, 10)
+        >>> h0 = torch.randn(2, 3, 20)
+        >>> output, hn = rnn(input, h0)
+    """
+    _FLOAT_MODULE = nn.GRU
+
+    __overloads__ = {'forward': ['forward_packed', 'forward_tensor']}
+
+    def __init__(self, *args, **kwargs):
+        super(GRU, self).__init__('GRU', *args, **kwargs)
+
+    def _get_name(self):
+        return 'DynamicQuantizedGRU'
+
+    def check_forward_args(self, input: Tensor, hidden: Tensor, batch_sizes: Optional[Tensor]) -> None:
+        self.check_input(input, batch_sizes)
+        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
+
+        self.check_hidden_size(hidden, expected_hidden_size,
+                               'Expected hidden size {}, got {}')
+
+    def forward_impl(
+        self, input: Tensor, hx: Optional[Tensor],
+        batch_sizes: Optional[Tensor], max_batch_size: int,
+        sorted_indices: Optional[Tensor]
+    ) -> Tuple[Tensor, Tensor]:
+        if hx is None:
+            num_directions = 2 if self.bidirectional else 1
+            zeros = torch.zeros(self.num_layers * num_directions,
+                                max_batch_size, self.hidden_size,
+                                dtype=input.dtype, device=input.device)
+            hx = zeros
+        else:
+            # Each batch of the hidden state should match the input sequence that
+            # the user believes he/she is passing in.
+            hx = self.permute_hidden(hx, sorted_indices)
+
+        self.check_forward_args(input, hx, batch_sizes)
+
+        _all_params = ([m.param for m in self._all_weight_values])
+        if batch_sizes is None:
+            result = torch.quantized_gru(input,
+                                         hx,
+                                         _all_params,
+                                         self.bias,
+                                         self.num_layers,
+                                         self.dropout,
+                                         self.training,
+                                         self.bidirectional,
+                                         self.batch_first)
+        else:
+            result = torch.quantized_gru(input,
+                                         batch_sizes,
+                                         hx,
+                                         _all_params,
+                                         self.bias,
+                                         self.num_layers,
+                                         self.dropout,
+                                         self.training,
+                                         self.bidirectional)
+        output = result[0]
+        hidden = result[1]
+
+        return output, hidden
+
+
+    @torch.jit.export
+    def forward_tensor(
+        self, input: Tensor, hx: Optional[Tensor] = None
+    ) -> Tuple[Tensor, Tensor]:
+        batch_sizes = None
+        max_batch_size = input.size(0) if self.batch_first else input.size(1)
+        sorted_indices = None
+        unsorted_indices = None
+
+        output, hidden = self.forward_impl(
+            input, hx, batch_sizes, max_batch_size, sorted_indices)
+
+        return output, self.permute_hidden(hidden, unsorted_indices)
+
+    @torch.jit.export
+    def forward_packed(
+        self, input: PackedSequence, hx: Optional[Tensor] = None
+    ) -> Tuple[PackedSequence, Tensor]:
+        input_, batch_sizes, sorted_indices, unsorted_indices = input
+        max_batch_size = batch_sizes[0]
+        max_batch_size = int(max_batch_size)
+        output_, hidden = self.forward_impl(
+            input_, hx, batch_sizes, max_batch_size, sorted_indices)
+
+        output = PackedSequence(output_, batch_sizes,
+                                sorted_indices, unsorted_indices)
+        return output, self.permute_hidden(hidden, unsorted_indices)
+
+    def permute_hidden(
+        self, hx: Tensor, permutation: Optional[Tensor]
+    ) -> Tensor:
+        if permutation is None:
+            return hx
+        return _apply_permutation(hx, permutation)
+
+    @torch.jit.ignore
+    def forward(self, input, hx=None):
+        if isinstance(input, PackedSequence):
+            return self.forward_packed(input, hx)
+        else:
+            return self.forward_tensor(input, hx)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(GRU, cls).from_float(mod)
+
+
+class RNNCellBase(torch.nn.Module):
+    # _FLOAT_MODULE = nn.CellRNNBase
+    __constants__ = ['input_size', 'hidden_size', 'bias']
+
+    def __init__(self, input_size, hidden_size, bias=True, num_chunks=4, dtype=torch.qint8):
+        super(RNNCellBase, self).__init__()
+        self.input_size = input_size
+        self.hidden_size = hidden_size
+        self.bias = bias
+        self.weight_dtype = dtype
+        if bias:
+            self.bias_ih = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
+            self.bias_hh = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
+        else:
+            self.register_parameter('bias_ih', None)
+            self.register_parameter('bias_hh', None)
+
+        weight_ih = torch.randn(num_chunks * hidden_size, input_size).to(torch.float)
+        weight_hh = torch.randn(num_chunks * hidden_size, hidden_size).to(torch.float)
+        if dtype == torch.qint8:
+            weight_ih = torch.quantize_per_tensor(weight_ih, scale=1, zero_point=0, dtype=torch.qint8)
+            weight_hh = torch.quantize_per_tensor(weight_hh, scale=1, zero_point=0, dtype=torch.qint8)
+
+        if dtype == torch.qint8:
+            # for each layer, for each direction we need to quantize and pack
+            # weights and pack parameters in this order:
+            #
+            #   w_ih, w_hh
+            packed_weight_ih = \
+                torch.ops.quantized.linear_prepack(weight_ih, self.bias_ih)
+            packed_weight_hh = \
+                torch.ops.quantized.linear_prepack(weight_hh, self.bias_hh)
+        else:
+            # for each layer, for each direction we need to quantize and pack
+            # weights and pack parameters in this order:
+            #
+            #   packed_ih, packed_hh, b_ih, b_hh
+            packed_weight_ih = torch.ops.quantized.linear_prepack_fp16(
+                weight_ih, self.bias_ih)
+            packed_weight_hh = torch.ops.quantized.linear_prepack_fp16(
+                weight_hh, self.bias_hh)
+
+        self._packed_weight_ih = packed_weight_ih
+        self._packed_weight_hh = packed_weight_hh
+
+    def _get_name(self):
+        return 'DynamicQuantizedRNNBase'
+
+    def extra_repr(self):
+        s = '{input_size}, {hidden_size}'
+        if 'bias' in self.__dict__ and self.bias is not True:
+            s += ', bias={bias}'
+        if 'nonlinearity' in self.__dict__ and self.nonlinearity != "tanh":
+            s += ', nonlinearity={nonlinearity}'
+        return s.format(**self.__dict__)
+
+    def check_forward_input(self, input):
+        if input.size(1) != self.input_size:
+            raise RuntimeError(
+                "input has inconsistent input_size: got {}, expected {}".format(
+                    input.size(1), self.input_size))
+
+    def check_forward_hidden(self, input: Tensor, hx: Tensor, hidden_label: str = '') -> None:
+        if input.size(0) != hx.size(0):
+            raise RuntimeError(
+                "Input batch size {} doesn't match hidden{} batch size {}".format(
+                    input.size(0), hidden_label, hx.size(0)))
+
+        if hx.size(1) != self.hidden_size:
+            raise RuntimeError(
+                "hidden{} has inconsistent hidden_size: got {}, expected {}".format(
+                    hidden_label, hx.size(1), self.hidden_size))
+
+    @classmethod
+    def from_float(cls, mod):
+        assert type(mod) in set([torch.nn.LSTMCell,
+                                 torch.nn.GRUCell,
+                                 torch.nn.RNNCell]), 'nn.quantized.dynamic.RNNCellBase.from_float \
+                                 only works for nn.LSTMCell, nn.GRUCell and nn.RNNCell'
+        assert hasattr(
+            mod, 'qconfig'), 'Input float module must have qconfig defined'
+
+        if mod.qconfig is not None and mod.qconfig.weight is not None:
+            weight_observer_method = mod.qconfig.weight
+        else:
+            # We have the circular import issues if we import the qconfig in the beginning of this file:
+            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
+            # import until we need it.
+            from torch.ao.quantization.qconfig import default_dynamic_qconfig
+            weight_observer_method = default_dynamic_qconfig.weight
+
+        dtype = weight_observer_method().dtype
+        supported_scalar_types = [torch.qint8, torch.float16]
+        if dtype not in supported_scalar_types:
+            raise RuntimeError('Unsupported dtype for dynamic RNN quantization: {}'.format(dtype))
+
+        qRNNCellBase: Union[LSTMCell, GRUCell, RNNCell]
+
+        if type(mod) == torch.nn.LSTMCell:
+            qRNNCellBase = LSTMCell(mod.input_size, mod.hidden_size, bias=mod.bias, dtype=dtype)
+        elif type(mod) == torch.nn.GRUCell:
+            qRNNCellBase = GRUCell(mod.input_size, mod.hidden_size, bias=mod.bias, dtype=dtype)
+        elif type(mod) == torch.nn.RNNCell:
+            qRNNCellBase = RNNCell(mod.input_size, mod.hidden_size, bias=mod.bias, nonlinearity=mod.nonlinearity, dtype=dtype)
+        else:
+            raise NotImplementedError('Only LSTMCell, GRUCell and RNNCell \
+            are supported for QuantizedRNN for now')
+
+        assert mod.bias
+
+        def _observe_and_quantize_weight(weight):
+            if dtype == torch.qint8:
+                weight_observer = weight_observer_method()
+                weight_observer(weight)
+                qweight = _quantize_weight(weight.float(), weight_observer)
+                return qweight
+            else:
+                return weight.float()
+
+        qRNNCellBase._packed_weight_ih = pack_weight_bias(_observe_and_quantize_weight(mod.weight_ih), mod.bias_ih, dtype)
+        qRNNCellBase._packed_weight_hh = pack_weight_bias(_observe_and_quantize_weight(mod.weight_hh), mod.bias_hh, dtype)
+        return qRNNCellBase
+
+    @classmethod
+    def from_reference(cls, ref_mod):
+        assert hasattr(ref_mod, "weight_ih_dtype"), "We are assuming weight_ih "
+        "exists in reference module, may need to relax the assumption to support the use case"
+        if hasattr(ref_mod, "nonlinearity"):
+            qmod = cls(
+                ref_mod.input_size,
+                ref_mod.hidden_size,
+                ref_mod.bias,
+                ref_mod.nonlinearity,
+                dtype=ref_mod.weight_ih_dtype
+            )
+        else:
+            qmod = cls(
+                ref_mod.input_size,
+                ref_mod.hidden_size,
+                ref_mod.bias,
+                dtype=ref_mod.weight_ih_dtype
+            )
+        weight_bias_dict = {
+            "weight": {
+                "weight_ih": ref_mod.get_quantized_weight_ih(),
+                "weight_hh": ref_mod.get_quantized_weight_hh(),
+            },
+            "bias": {
+                "bias_ih": ref_mod.bias_ih,
+                "bias_hh": ref_mod.bias_hh,
+            }
+        }
+        qmod.set_weight_bias(weight_bias_dict)
+        return qmod
+
+    def _weight_bias(self):
+        # Returns a dict of weights and biases
+        weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
+        w1, b1 = self._packed_weight_ih.__getstate__()[0]
+        w2, b2 = self._packed_weight_hh.__getstate__()[0]
+        # TODO: these can be simplified to one level? e.g. using weight_ih as key
+        # directly
+        weight_bias_dict['weight']['weight_ih'] = w1
+        weight_bias_dict['weight']['weight_hh'] = w2
+        weight_bias_dict['bias']['bias_ih'] = b1
+        weight_bias_dict['bias']['bias_hh'] = b2
+        return weight_bias_dict
+
+    def get_weight(self):
+        return self._weight_bias()['weight']
+
+    def get_bias(self):
+        return self._weight_bias()['bias']
+
+    def set_weight_bias(self, weight_bias_dict):
+        # TODO: these can be simplified to one level? e.g. using weight_ih as key
+        # directly
+        self._packed_weight_ih = pack_weight_bias(
+            weight_bias_dict["weight"]["weight_ih"],
+            weight_bias_dict["bias"]["bias_ih"],
+            self.weight_dtype)
+        self._packed_weight_hh = pack_weight_bias(
+            weight_bias_dict["weight"]["weight_hh"],
+            weight_bias_dict["bias"]["bias_hh"],
+            self.weight_dtype)
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(RNNCellBase, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + '_packed_weight_ih'] = self._packed_weight_ih
+        destination[prefix + '_packed_weight_hh'] = self._packed_weight_hh
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self._packed_weight_ih = state_dict.pop(prefix + '_packed_weight_ih')
+        self._packed_weight_hh = state_dict.pop(prefix + '_packed_weight_hh')
+        super(RNNCellBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                       missing_keys, unexpected_keys, error_msgs)
+
+class RNNCell(RNNCellBase):
+    r"""An Elman RNN cell with tanh or ReLU non-linearity.
+    A dynamic quantized RNNCell module with floating point tensor as inputs and outputs.
+    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.RNNCell`,
+    please see https://pytorch.org/docs/stable/nn.html#torch.nn.RNNCell for documentation.
+
+    Examples::
+
+        >>> rnn = nn.RNNCell(10, 20)
+        >>> input = torch.randn(6, 3, 10)
+        >>> hx = torch.randn(3, 20)
+        >>> output = []
+        >>> for i in range(6):
+        ...     hx = rnn(input[i], hx)
+        ...     output.append(hx)
+    """
+    __constants__ = ['input_size', 'hidden_size', 'bias', 'nonlinearity']
+
+    def __init__(self, input_size, hidden_size, bias=True, nonlinearity="tanh", dtype=torch.qint8):
+        super(RNNCell, self).__init__(input_size, hidden_size, bias, num_chunks=1, dtype=dtype)
+        self.nonlinearity = nonlinearity
+
+    def _get_name(self):
+        return 'DynamicQuantizedRNNCell'
+
+    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
+        self.check_forward_input(input)
+        if hx is None:
+            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+        self.check_forward_hidden(input, hx, '')
+        if self.nonlinearity == "tanh":
+            ret = torch.ops.quantized.quantized_rnn_tanh_cell_dynamic(
+                input, hx,
+                self._packed_weight_ih, self._packed_weight_hh,
+                self.bias_ih, self.bias_hh)
+        elif self.nonlinearity == "relu":
+            ret = torch.ops.quantized.quantized_rnn_relu_cell_dynamic(
+                input, hx,
+                self._packed_weight_ih, self._packed_weight_hh,
+                self.bias_ih, self.bias_hh)
+        else:
+            ret = input  # TODO: remove when jit supports exception flow
+            raise RuntimeError(
+                "Unknown nonlinearity: {}".format(self.nonlinearity))
+        return ret
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(RNNCell, cls).from_float(mod)
+
+
+class LSTMCell(RNNCellBase):
+    r"""A long short-term memory (LSTM) cell.
+
+    A dynamic quantized LSTMCell module with floating point tensor as inputs and outputs.
+    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.LSTMCell`,
+    please see https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell for documentation.
+
+    Examples::
+
+        >>> rnn = nn.LSTMCell(10, 20)
+        >>> input = torch.randn(6, 3, 10)
+        >>> hx = torch.randn(3, 20)
+        >>> cx = torch.randn(3, 20)
+        >>> output = []
+        >>> for i in range(6):
+        ...     hx, cx = rnn(input[i], (hx, cx))
+        ...     output.append(hx)
+    """
+
+    def __init__(self, *args, **kwargs):
+        super(LSTMCell, self).__init__(*args, num_chunks=4, **kwargs)  # type: ignore[misc]
+
+    def _get_name(self):
+        return 'DynamicQuantizedLSTMCell'
+
+    def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
+        self.check_forward_input(input)
+        if hx is None:
+            zeros = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+            hx = (zeros, zeros)
+        self.check_forward_hidden(input, hx[0], '[0]')
+        self.check_forward_hidden(input, hx[1], '[1]')
+        return torch.ops.quantized.quantized_lstm_cell_dynamic(
+            input, hx,
+            self._packed_weight_ih, self._packed_weight_hh,
+            self.bias_ih, self.bias_hh)
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(LSTMCell, cls).from_float(mod)
+
+class GRUCell(RNNCellBase):
+    r"""A gated recurrent unit (GRU) cell
+
+    A dynamic quantized GRUCell module with floating point tensor as inputs and outputs.
+    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.GRUCell`,
+    please see https://pytorch.org/docs/stable/nn.html#torch.nn.GRUCell for documentation.
+
+    Examples::
+
+        >>> rnn = nn.GRUCell(10, 20)
+        >>> input = torch.randn(6, 3, 10)
+        >>> hx = torch.randn(3, 20)
+        >>> output = []
+        >>> for i in range(6):
+        ...     hx = rnn(input[i], hx)
+        ...     output.append(hx)
+    """
+
+    def __init__(self, input_size, hidden_size, bias=True, dtype=torch.qint8):
+        super(GRUCell, self).__init__(input_size, hidden_size, bias, num_chunks=3, dtype=dtype)
+
+    def _get_name(self):
+        return 'DynamicQuantizedGRUCell'
+
+    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
+        self.check_forward_input(input)
+        if hx is None:
+            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+        self.check_forward_hidden(input, hx, '')
+        return torch.ops.quantized.quantized_gru_cell_dynamic(
+            input, hx,
+            self._packed_weight_ih, self._packed_weight_hh,
+            self.bias_ih, self.bias_hh,
+        )
+
+    @classmethod
+    def from_float(cls, mod):
+        return super(GRUCell, cls).from_float(mod)
diff --git a/torch/ao/nn/quantized/functional.py b/torch/ao/nn/quantized/functional.py
new file mode 100644
index 000000000000..b3169279082a
--- /dev/null
+++ b/torch/ao/nn/quantized/functional.py
@@ -0,0 +1,616 @@
+r""" Functional interface (quantized)."""
+from typing import List, Optional
+import warnings
+
+import torch
+from torch import Tensor
+from torch.nn.modules.utils import _pair, _triple
+from torch.jit.annotations import BroadcastingList2
+
+from .modules.utils import _pair_from_first
+
+# Although some of the functions and docstrings are mirrored from the torch.nn,
+# we want to have them here for future changes.
+
+def avg_pool2d(input, kernel_size, stride=None, padding=0, ceil_mode=False,
+               count_include_pad=True, divisor_override=None):
+    r"""
+    Applies 2D average-pooling operation in :math:`kH \times kW` regions by step size
+    :math:`sH \times sW` steps. The number of output features is equal to the number of
+    input planes.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    See :class:`~torch.ao.nn.quantized.AvgPool2d` for details and output shape.
+
+    Args:
+        input: quantized input tensor :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
+        kernel_size: size of the pooling region. Can be a single number or a
+          tuple `(kH, kW)`
+        stride: stride of the pooling operation. Can be a single number or a
+          tuple `(sH, sW)`. Default: :attr:`kernel_size`
+        padding: implicit zero paddings on both sides of the input. Can be a
+          single number or a tuple `(padH, padW)`. Default: 0
+        ceil_mode: when True, will use `ceil` instead of `floor` in the formula
+            to compute the output shape. Default: ``False``
+        count_include_pad: when True, will include the zero-padding in the
+            averaging calculation. Default: ``True``
+        divisor_override: if specified, it will be used as divisor, otherwise
+             size of the pooling region will be used. Default: None
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.avg_pool2d' must be quantized!")
+    return torch.nn.functional.avg_pool2d(input, kernel_size, stride, padding,
+                                          ceil_mode, count_include_pad,
+                                          divisor_override)
+
+def avg_pool3d(input, kernel_size, stride=None, padding=0, ceil_mode=False,
+               count_include_pad=True, divisor_override=None):
+    r"""
+    Applies 3D average-pooling operation in :math:`kD \ times kH \times kW` regions by step size
+    :math:`sD \times sH \times sW` steps. The number of output features is equal to the number of
+    input planes.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    Args:
+        input: quantized input tensor :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
+        kernel_size: size of the pooling region. Can be a single number or a
+          tuple `(kD, kH, kW)`
+        stride: stride of the pooling operation. Can be a single number or a
+          tuple `(sD, sH, sW)`. Default: :attr:`kernel_size`
+        padding: implicit zero paddings on both sides of the input. Can be a
+          single number or a tuple `(padD, padH, padW)`. Default: 0
+        ceil_mode: when True, will use `ceil` instead of `floor` in the formula
+            to compute the output shape. Default: ``False``
+        count_include_pad: when True, will include the zero-padding in the
+            averaging calculation. Default: ``True``
+        divisor_override: if specified, it will be used as divisor, otherwise
+             size of the pooling region will be used. Default: None
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.avg_pool3d' must be quantized!")
+    return torch.nn.functional.avg_pool3d(input, kernel_size, stride, padding,
+                                          ceil_mode, count_include_pad,
+                                          divisor_override)
+
+def adaptive_avg_pool2d(input: Tensor, output_size: BroadcastingList2[int]) -> Tensor:
+    r"""
+    Applies a 2D adaptive average pooling over a quantized input signal composed
+    of several quantized input planes.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    See :class:`~torch.ao.nn.quantized.AdaptiveAvgPool2d` for details and output shape.
+
+    Args:
+        output_size: the target output size (single integer or
+                     double-integer tuple)
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.functional.adaptive_avg_pool2d' must be quantized!")
+    return torch.nn.functional.adaptive_avg_pool2d(input, output_size)
+
+def adaptive_avg_pool3d(input: Tensor, output_size: BroadcastingList2[int]) -> Tensor:
+    r"""
+    Applies a 3D adaptive average pooling over a quantized input signal composed
+    of several quantized input planes.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    See :class:`~torch.ao.nn.quantized.AdaptiveAvgPool3d` for details and output shape.
+
+    Args:
+        output_size: the target output size (single integer or
+                     double-integer tuple)
+    """
+    if not input.is_quantized:
+        raise ValueError(
+            "Input to 'quantized.functional.adaptive_avg_pool3d' must be quantized!")
+    return torch.nn.functional.adaptive_avg_pool3d(input, output_size)
+
+def conv1d(input, weight, bias,
+           stride=1, padding=0, dilation=1, groups=1,
+           padding_mode='zeros',
+           scale=1.0, zero_point=0,
+           dtype=torch.quint8):
+    r"""
+    Applies a 1D convolution over a quantized 1D input composed of several input
+    planes.
+
+    See :class:`~torch.ao.nn.quantized.Conv1d` for details and output shape.
+
+    Args:
+        input: quantized input tensor of shape :math:`(\text{minibatch} , \text{in\_channels} , iW)`
+        weight: quantized filters of shape :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , iW)`
+        bias: **non-quantized** bias tensor of shape :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
+        stride: the stride of the convolving kernel. Can be a single number or a
+          tuple `(sW,)`. Default: 1
+        padding: implicit paddings on both sides of the input. Can be a
+          single number or a tuple `(padW,)`. Default: 0
+        dilation: the spacing between kernel elements. Can be a single number or
+          a tuple `(dW,)`. Default: 1
+        groups: split input into groups, :math:`\text{in\_channels}` should be divisible by the
+          number of groups. Default: 1
+        padding_mode: the padding mode to use. Only "zeros" is supported for quantized convolution at the moment. Default: "zeros"
+        scale: quantization scale for the output. Default: 1.0
+        zero_point: quantization zero_point for the output. Default: 0
+        dtype: quantization data type to use. Default: ``torch.quint8``
+
+    Examples::
+
+        >>> from torch.ao.nn.quantized import functional as qF
+        >>> filters = torch.randn(33, 16, 3, dtype=torch.float)
+        >>> inputs = torch.randn(20, 16, 50, dtype=torch.float)
+        >>> bias = torch.randn(33, dtype=torch.float)
+        >>>
+        >>> scale, zero_point = 1.0, 0
+        >>> dtype_inputs = torch.quint8
+        >>> dtype_filters = torch.qint8
+        >>>
+        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
+        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
+        >>> qF.conv1d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
+    """  # noqa: E501
+    if padding_mode != 'zeros':
+        raise NotImplementedError("Only zero-padding is supported!")
+    if input.dtype != torch.quint8:
+        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
+    if weight.dtype != torch.qint8:
+        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
+    if input.ndim != 3:
+        raise ValueError("Input shape must be `(N, C, L)`!")
+    stride = _pair_from_first(stride)
+    padding = _pair_from_first(padding)
+    dilation = _pair_from_first(dilation)
+
+    packed_params = torch.ops.quantized.conv1d_prepack(
+        weight, bias, stride, padding, dilation, groups)
+    return torch.ops.quantized.conv1d(input, packed_params, scale, zero_point)
+
+def conv2d(input, weight, bias,
+           stride=1, padding=0, dilation=1, groups=1,
+           padding_mode='zeros',
+           scale=1.0, zero_point=0,
+           dtype=torch.quint8):
+    r"""
+    Applies a 2D convolution over a quantized 2D input composed of several input
+    planes.
+
+    See :class:`~torch.ao.nn.quantized.Conv2d` for details and output shape.
+
+    Args:
+        input: quantized input tensor of shape :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
+        weight: quantized filters of shape :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kH , kW)`
+        bias: **non-quantized** bias tensor of shape :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
+        stride: the stride of the convolving kernel. Can be a single number or a
+          tuple `(sH, sW)`. Default: 1
+        padding: implicit paddings on both sides of the input. Can be a
+          single number or a tuple `(padH, padW)`. Default: 0
+        dilation: the spacing between kernel elements. Can be a single number or
+          a tuple `(dH, dW)`. Default: 1
+        groups: split input into groups, :math:`\text{in\_channels}` should be divisible by the
+          number of groups. Default: 1
+        padding_mode: the padding mode to use. Only "zeros" is supported for quantized convolution at the moment. Default: "zeros"
+        scale: quantization scale for the output. Default: 1.0
+        zero_point: quantization zero_point for the output. Default: 0
+        dtype: quantization data type to use. Default: ``torch.quint8``
+
+    Examples::
+
+        >>> from torch.ao.nn.quantized import functional as qF
+        >>> filters = torch.randn(8, 4, 3, 3, dtype=torch.float)
+        >>> inputs = torch.randn(1, 4, 5, 5, dtype=torch.float)
+        >>> bias = torch.randn(8, dtype=torch.float)
+        >>>
+        >>> scale, zero_point = 1.0, 0
+        >>> dtype_inputs = torch.quint8
+        >>> dtype_filters = torch.qint8
+        >>>
+        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
+        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
+        >>> qF.conv2d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
+    """  # noqa: E501
+    if padding_mode != 'zeros':
+        raise NotImplementedError("Only zero-padding is supported!")
+    if input.dtype != torch.quint8:
+        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
+    if weight.dtype != torch.qint8:
+        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
+    if input.ndim != 4:
+        raise ValueError("Input shape must be `(N, C, H, W)`!")
+    stride = _pair(stride)
+    padding = _pair(padding)
+    dilation = _pair(dilation)
+
+    packed_params = torch.ops.quantized.conv2d_prepack(
+        weight, bias, stride, padding, dilation, groups)
+    return torch.ops.quantized.conv2d(input, packed_params, scale, zero_point)
+
+def conv3d(input, weight, bias, stride=1, padding=0, dilation=1, groups=1,
+           padding_mode='zeros', scale=1.0, zero_point=0, dtype=torch.quint8):
+    r"""
+    Applies a 3D convolution over a quantized 3D input composed of several input
+    planes.
+
+    See :class:`~torch.ao.nn.quantized.Conv3d` for details and output shape.
+
+    Args:
+        input: quantized input tensor of shape
+          :math:`(\text{minibatch} , \text{in\_channels} , iD , iH , iW)`
+        weight: quantized filters of shape
+          :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kD , kH , kW)`
+        bias: **non-quantized** bias tensor of shape
+          :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
+        stride: the stride of the convolving kernel. Can be a single number or a
+          tuple `(sD, sH, sW)`. Default: 1
+        padding: implicit paddings on both sides of the input. Can be a
+          single number or a tuple `(padD, padH, padW)`. Default: 0
+        dilation: the spacing between kernel elements. Can be a single number or
+          a tuple `(dD, dH, dW)`. Default: 1
+        groups: split input into groups, :math:`\text{in\_channels}` should be
+          divisible by the number of groups. Default: 1
+        padding_mode: the padding mode to use. Only "zeros" is supported for
+          quantized convolution at the moment. Default: "zeros"
+        scale: quantization scale for the output. Default: 1.0
+        zero_point: quantization zero_point for the output. Default: 0
+        dtype: quantization data type to use. Default: ``torch.quint8``
+
+    Examples::
+
+        >>> from torch.ao.nn.quantized import functional as qF
+        >>> filters = torch.randn(8, 4, 3, 3, 3, dtype=torch.float)
+        >>> inputs = torch.randn(1, 4, 5, 5, 5, dtype=torch.float)
+        >>> bias = torch.randn(8, dtype=torch.float)
+        >>>
+        >>> scale, zero_point = 1.0, 0
+        >>> dtype_inputs = torch.quint8
+        >>> dtype_filters = torch.qint8
+        >>>
+        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
+        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
+        >>> qF.conv3d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
+    """  # noqa: E501
+    if padding_mode != 'zeros':
+        raise NotImplementedError("Only zero-padding is supported!")
+    if input.dtype != torch.quint8:
+        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
+    if weight.dtype != torch.qint8:
+        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
+    if input.ndim != 5:
+        raise ValueError("Input shape must be `(N, C, D, H, W)`!")
+    stride = _triple(stride)
+    padding = _triple(padding)
+    dilation = _triple(dilation)
+
+    packed_params = torch.ops.quantized.conv3d_prepack(
+        weight, bias, stride, padding, dilation, groups)
+    return torch.ops.quantized.conv3d(input, packed_params, scale, zero_point)
+
+def interpolate(input, size=None, scale_factor=None, mode='nearest', align_corners=None):
+    r"""Down/up samples the input to either the given :attr:`size` or the given
+    :attr:`scale_factor`
+
+    See :func:`torch.nn.functional.interpolate` for implementation details.
+
+    The input dimensions are interpreted in the form:
+    `mini-batch x channels x [optional depth] x [optional height] x width`.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    .. note:: Only 2D/3D input is supported for quantized inputs
+
+    .. note:: Only the following modes are supported for the quantized inputs:
+
+        - `bilinear`
+        - `nearest`
+
+    Args:
+        input (Tensor): the input tensor
+        size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):
+            output spatial size.
+        scale_factor (float or Tuple[float]): multiplier for spatial size. Has to match input size if it is a tuple.
+        mode (str): algorithm used for upsampling:
+            ``'nearest'`` | ``'bilinear'``
+        align_corners (bool, optional): Geometrically, we consider the pixels of the
+            input and output as squares rather than points.
+            If set to ``True``, the input and output tensors are aligned by the
+            center points of their corner pixels, preserving the values at the corner pixels.
+            If set to ``False``, the input and output tensors are aligned by the corner
+            points of their corner pixels, and the interpolation uses edge value padding
+            for out-of-boundary values, making this operation *independent* of input size
+            when :attr:`scale_factor` is kept the same. This only has an effect when :attr:`mode`
+            is ``'bilinear'``.
+            Default: ``False``
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.interpolate' must be quantized!")
+    return torch.nn.functional.interpolate(input, size, scale_factor, mode,
+                                           align_corners)
+
+def linear(
+    input: Tensor, weight: Tensor, bias: Optional[Tensor] = None,
+    scale: Optional[float] = None, zero_point: Optional[int] = None
+) -> Tensor:
+    r"""
+    Applies a linear transformation to the incoming quantized data:
+    :math:`y = xA^T + b`.
+    See :class:`~torch.ao.nn.quantized.Linear`
+
+    .. note::
+
+      Current implementation packs weights on every call, which has penalty on performance.
+      If you want to avoid the overhead, use :class:`~torch.ao.nn.quantized.Linear`.
+
+    Args:
+      input (Tensor): Quantized input of type `torch.quint8`
+      weight (Tensor): Quantized weight of type `torch.qint8`
+      bias (Tensor): None or fp32 bias of type `torch.float`
+      scale (double): output scale. If None, derived from the input scale
+      zero_point (long): output zero point. If None, derived from the input zero_point
+
+    Shape:
+        - Input: :math:`(N, *, in\_features)` where `*` means any number of
+          additional dimensions
+        - Weight: :math:`(out\_features, in\_features)`
+        - Bias: :math:`(out\_features)`
+        - Output: :math:`(N, *, out\_features)`
+    """
+    if scale is None:
+        scale = input.q_scale()
+    if zero_point is None:
+        zero_point = input.q_zero_point()
+    _packed_params = torch.ops.quantized.linear_prepack(weight, bias)
+    return torch.ops.quantized.linear(input, _packed_params, scale, zero_point)
+
+def max_pool1d(input, kernel_size, stride=None, padding=0, dilation=1,
+               ceil_mode=False, return_indices=False):
+    r"""Applies a 1D max pooling over a quantized input signal composed of
+    several quantized input planes.
+
+    .. note:: The input quantization parameters are propagated to the output.
+
+    See :class:`~torch.ao.nn.quantized.MaxPool1d` for details.
+    """
+    if return_indices:
+        raise NotImplementedError("return_indices is not yet implemented!")
+    if stride is None:
+        stride = torch.jit.annotate(List[int], [])
+    return torch.nn.functional.max_pool1d(input, kernel_size, stride, padding,
+                                          dilation, ceil_mode=ceil_mode, return_indices=return_indices)
+
+def max_pool2d(input, kernel_size, stride=None, padding=0, dilation=1,
+               ceil_mode=False, return_indices=False):
+    r"""Applies a 2D max pooling over a quantized input signal composed of
+    several quantized input planes.
+
+    .. note:: The input quantization parameters are propagated to the output.
+
+    See :class:`~torch.ao.nn.quantized.MaxPool2d` for details.
+    """
+    if return_indices:
+        raise NotImplementedError("return_indices is not yet implemented!")
+    if stride is None:
+        stride = torch.jit.annotate(List[int], [])
+    return torch.nn.functional.max_pool2d(input, kernel_size, stride, padding,
+                                          dilation, ceil_mode=ceil_mode, return_indices=return_indices)
+
+def celu(input: Tensor, scale: float, zero_point: int, alpha: float = 1.) -> Tensor:
+    r"""celu(input, scale, zero_point, alpha=1.) -> Tensor
+
+    Applies the quantized CELU function element-wise.
+
+    .. math::
+        \text{CELU}(x) = \max(0,x) + \min(0, \alpha * (\exp(x / \alpha) - 1))
+
+    Args:
+        input: quantized input
+        alpha: the :math:`\alpha` value for the CELU formulation. Default: 1.0
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.celu' must be quantized!")
+    return torch.ops.quantized.celu(input, scale, zero_point, alpha)
+
+
+def leaky_relu(input: Tensor, negative_slope: float = 0.01, inplace: bool = False,
+               scale: Optional[float] = None, zero_point: Optional[int] = None):
+    r"""
+    Quantized version of the.
+    leaky_relu(input, negative_slope=0.01, inplace=False, scale, zero_point) -> Tensor
+
+    Applies element-wise,
+    :math:`\text{LeakyReLU}(x) = \max(0, x) + \text{negative\_slope} * \min(0, x)`
+
+    Args:
+        input: Quantized input
+        negative_slope: The slope of the negative input
+        inplace: Inplace modification of the input tensor
+        scale, zero_point: Scale and zero point of the output tensor.
+
+    See :class:`~torch.nn.LeakyReLU` for more details.
+    """
+    if scale is not None and zero_point is not None:
+        assert not inplace, "Cannot rescale with `inplace`"
+        output = torch._empty_affine_quantized(
+            input.shape, scale=scale, zero_point=int(zero_point), dtype=input.dtype)
+        torch._C._nn.leaky_relu(input, negative_slope, out=output)
+        return output
+    if inplace:
+        result = torch._C._nn.leaky_relu_(input, negative_slope)
+    else:
+        result = torch._C._nn.leaky_relu(input, negative_slope)
+    return result
+
+def hardtanh(input: Tensor, min_val: float = -1., max_val: float = 1., inplace: bool = False) -> Tensor:
+    r"""This is the quantized version of :func:`~torch.nn.functional.hardtanh`.
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.hardtanh' must be quantized!")
+    if inplace:
+        return torch._C._nn.hardtanh_(input, min_val, max_val)
+    return torch._C._nn.hardtanh(input, min_val, max_val)
+
+def hardswish(input: Tensor, scale: float, zero_point: int) -> Tensor:
+    r"""This is the quantized version of :func:`~torch.nn.functional.hardswish`.
+
+    Args:
+        input: quantized input
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.hardswish' must be quantized!")
+    return torch._ops.ops.quantized.hardswish(input, scale, zero_point)
+
+def threshold(input: Tensor, threshold: float, value: float) -> Tensor:
+    r"""Applies the quantized version of the threshold function element-wise:
+
+    .. math::
+        x = \begin{cases}
+                x & \text{if~} x > \text{threshold} \\
+                \text{value} & \text{otherwise}
+            \end{cases}
+
+    See :class:`~torch.nn.Threshold` for more details.
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.threshold' must be quantized!")
+    if threshold is None:
+        raise ValueError("Input to 'threshold' must be specified!")
+    if value is None:
+        raise ValueError("Input to 'value' must be specified!")
+    return torch._ops.ops.quantized.threshold(input, threshold, value)
+
+def elu(input: Tensor, scale: float, zero_point: int, alpha: float = 1.) -> Tensor:
+    r"""This is the quantized version of :func:`~torch.nn.functional.elu`.
+
+    Args:
+        input: quantized input
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+        alpha: the alpha constant
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.elu' must be quantized!")
+    return torch.ops.quantized.elu(input, scale, zero_point, alpha)
+
+def hardsigmoid(input: Tensor, inplace: bool = False) -> Tensor:
+    r"""This is the quantized version of :func:`~torch.nn.functional.hardsigmoid`.
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.hardsigmoid' must be quantized!")
+    if inplace:
+        return torch._C._nn.hardsigmoid_(input)  # type: ignore[attr-defined]
+    return torch._C._nn.hardsigmoid(input)
+
+def clamp(input: Tensor, min_: float, max_: float) -> Tensor:
+    r"""float(input, min\_, max\_) -> Tensor
+
+    Applies the clamp function element-wise.
+    See :class:`~torch.ao.nn.quantized.clamp` for more details.
+
+    Args:
+        input: quantized input
+        min_: minimum value for clamping
+        max_: maximum value for clamping
+    """
+    if not input.is_quantized:
+        raise ValueError("Input to 'quantized.clamp' must be quantized!")
+    return torch.clamp(input, min_, max_)
+
+def upsample(input, size=None, scale_factor=None, mode='nearest', align_corners=None):
+    r"""Upsamples the input to either the given :attr:`size` or the given
+    :attr:`scale_factor`
+
+    .. warning::
+        This function is deprecated in favor of
+        :func:`torch.nn.quantized.functional.interpolate`.
+        This is equivalent with ``nn.quantized.functional.interpolate(...)``.
+
+    See :func:`torch.nn.functional.interpolate` for implementation details.
+
+    The input dimensions are interpreted in the form:
+    `mini-batch x channels x [optional depth] x [optional height] x width`.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    .. note:: Only 2D input is supported for quantized inputs
+
+    .. note:: Only the following modes are supported for the quantized inputs:
+
+        - `bilinear`
+        - `nearest`
+
+    Args:
+        input (Tensor): quantized input tensor
+        size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):
+            output spatial size.
+        scale_factor (float or Tuple[float]): multiplier for spatial size. Has to be an integer.
+        mode (str): algorithm used for upsampling:
+            ``'nearest'`` | ``'bilinear'``
+        align_corners (bool, optional): Geometrically, we consider the pixels of the
+            input and output as squares rather than points.
+            If set to ``True``, the input and output tensors are aligned by the
+            center points of their corner pixels, preserving the values at the corner pixels.
+            If set to ``False``, the input and output tensors are aligned by the corner
+            points of their corner pixels, and the interpolation uses edge value padding
+            for out-of-boundary values, making this operation *independent* of input size
+            when :attr:`scale_factor` is kept the same. This only has an effect when :attr:`mode`
+            is ``'bilinear'``.
+            Default: ``False``
+
+    .. warning::
+        With ``align_corners = True``, the linearly interpolating modes
+        (`bilinear`) don't proportionally align the
+        output and input pixels, and thus the output values can depend on the
+        input size. This was the default behavior for these modes up to version
+        0.3.1. Since then, the default behavior is ``align_corners = False``.
+        See :class:`~torch.nn.Upsample` for concrete examples on how this
+        affects the outputs.
+    """
+    warnings.warn("nn.quantized.functional.upsample is deprecated. Use nn.quantized.functional.interpolate instead.")
+    return interpolate(input, size, scale_factor, mode, align_corners)
+
+def upsample_bilinear(input, size=None, scale_factor=None):
+    r"""Upsamples the input, using bilinear upsampling.
+
+    .. warning::
+        This function is deprecated in favor of
+        :func:`torch.nn.quantized.functional.interpolate`.
+        This is equivalent with
+        ``nn.quantized.functional.interpolate(..., mode='bilinear', align_corners=True)``.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    .. note:: Only 2D inputs are supported
+
+    Args:
+        input (Tensor): quantized input
+        size (int or Tuple[int, int]): output spatial size.
+        scale_factor (int or Tuple[int, int]): multiplier for spatial size
+    """
+    # DeprecationWarning is ignored by default
+    warnings.warn("nn.quantized.functional.upsample_bilinear is deprecated. Use nn.quantized.functional.interpolate instead.")
+    return interpolate(input, size, scale_factor, mode='bilinear', align_corners=True)
+
+def upsample_nearest(input, size=None, scale_factor=None):
+    r"""Upsamples the input, using nearest neighbours' pixel values.
+
+    .. warning::
+        This function is deprecated in favor of
+        :func:`torch.nn.quantized.functional.interpolate`.
+        This is equivalent with ``nn.quantized.functional.interpolate(..., mode='nearest')``.
+
+    .. note:: The input quantization parameters propagate to the output.
+
+    .. note:: Only 2D inputs are supported
+
+    Args:
+        input (Tensor): quantized input
+        size (int or Tuple[int, int] or Tuple[int, int, int]): output spatial
+            size.
+        scale_factor (int): multiplier for spatial size. Has to be an integer.
+    """
+    # DeprecationWarning is ignored by default
+    warnings.warn("nn.quantized.functional.upsample_nearest is deprecated. Use nn.quantized.functional.interpolate instead.")
+    return interpolate(input, size, scale_factor, mode='nearest')
diff --git a/torch/ao/nn/quantized/modules/__init__.py b/torch/ao/nn/quantized/modules/__init__.py
new file mode 100644
index 000000000000..750591f0fc42
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/__init__.py
@@ -0,0 +1,136 @@
+import torch
+
+# The quantized modules use `torch.nn` and `torch.ao.nn.quantizable`
+# packages. However, the `quantizable` package uses "lazy imports"
+# to avoid circular dependency.
+# Hence we need to include it here to make sure it is resolved before
+# they are used in the modules.
+import torch.ao.nn.quantizable
+
+from torch.nn.modules.pooling import MaxPool2d
+
+from .activation import ReLU6, Hardswish, ELU, LeakyReLU, Sigmoid, Softmax, MultiheadAttention, PReLU
+from .dropout import Dropout
+from .batchnorm import BatchNorm2d, BatchNorm3d
+from .normalization import LayerNorm, GroupNorm, InstanceNorm1d, \
+    InstanceNorm2d, InstanceNorm3d
+from .conv import Conv1d, Conv2d, Conv3d
+from .conv import ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from .linear import Linear
+from .embedding_ops import Embedding, EmbeddingBag
+from .rnn import LSTM
+
+from .functional_modules import FloatFunctional, FXFloatFunctional, QFunctional
+
+
+class Quantize(torch.nn.Module):
+    r"""Quantizes an incoming tensor
+
+    Args:
+     `scale`: scale of the output Quantized Tensor
+     `zero_point`: zero_point of output Quantized Tensor
+     `dtype`: data type of output Quantized Tensor
+     `factory_kwargs`: Dictionary of kwargs used for configuring initialization
+         of internal buffers. Currently, `device` and `dtype` are supported.
+         Example: `factory_kwargs={'device': 'cuda', 'dtype': torch.float64}`
+         will initialize internal buffers as type `torch.float64` on the current CUDA device.
+         Note that `dtype` only applies to floating-point buffers.
+
+    Examples::
+        >>> t = torch.tensor([[1., -1.], [1., -1.]])
+        >>> scale, zero_point, dtype = 1.0, 2, torch.qint8
+        >>> qm = Quantize(scale, zero_point, dtype)
+        >>> # xdoctest: +SKIP
+        >>> qt = qm(t)
+        >>> print(qt)
+        tensor([[ 1., -1.],
+                [ 1., -1.]], size=(2, 2), dtype=torch.qint8, scale=1.0, zero_point=2)
+    """
+
+    scale: torch.Tensor
+    zero_point: torch.Tensor
+
+    def __init__(self, scale, zero_point, dtype, factory_kwargs=None):
+        factory_kwargs = torch.nn.factory_kwargs(factory_kwargs)
+        super(Quantize, self).__init__()
+        self.register_buffer('scale', torch.tensor([scale], **factory_kwargs))
+        self.register_buffer('zero_point',
+                             torch.tensor([zero_point], dtype=torch.long,
+                                          **{k: v for k, v in factory_kwargs.items() if k != 'dtype'}))
+        self.dtype = dtype
+
+    def forward(self, X):
+        return torch.quantize_per_tensor(X, float(self.scale),
+                                         int(self.zero_point), self.dtype)
+
+    @staticmethod
+    def from_float(mod):
+        assert hasattr(mod, 'activation_post_process')
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return Quantize(scale.float().item(), zero_point.long().item(), mod.activation_post_process.dtype)
+
+    def extra_repr(self):
+        return 'scale={}, zero_point={}, dtype={}'.format(self.scale, self.zero_point, self.dtype)
+
+
+class DeQuantize(torch.nn.Module):
+    r"""Dequantizes an incoming tensor
+
+    Examples::
+        >>> input = torch.tensor([[1., -1.], [1., -1.]])
+        >>> scale, zero_point, dtype = 1.0, 2, torch.qint8
+        >>> qm = Quantize(scale, zero_point, dtype)
+        >>> # xdoctest: +SKIP
+        >>> quantized_input = qm(input)
+        >>> dqm = DeQuantize()
+        >>> dequantized = dqm(quantized_input)
+        >>> print(dequantized)
+        tensor([[ 1., -1.],
+                [ 1., -1.]], dtype=torch.float32)
+    """
+
+    def __init__(self):
+        super(DeQuantize, self).__init__()
+
+    def forward(self, Xq):
+        return Xq.dequantize()
+
+    @staticmethod
+    def from_float(mod):
+        return DeQuantize()
+
+__all__ = [
+    'BatchNorm2d',
+    'BatchNorm3d',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+    'DeQuantize',
+    'ELU',
+    'Embedding',
+    'EmbeddingBag',
+    'GroupNorm',
+    'Hardswish',
+    'InstanceNorm1d',
+    'InstanceNorm2d',
+    'InstanceNorm3d',
+    'LayerNorm',
+    'LeakyReLU',
+    'Linear',
+    'LSTM',
+    'MaxPool2d',
+    'MultiheadAttention',
+    'Quantize',
+    'ReLU6',
+    'Sigmoid',
+    'Softmax',
+    'Dropout',
+    'PReLU',
+    # Wrapper modules
+    'FloatFunctional',
+    'FXFloatFunctional',
+    'QFunctional',
+]
diff --git a/torch/ao/nn/quantized/modules/activation.py b/torch/ao/nn/quantized/modules/activation.py
new file mode 100644
index 000000000000..067844fd80a1
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/activation.py
@@ -0,0 +1,278 @@
+import torch
+
+class ReLU6(torch.nn.ReLU):
+    r"""Applies the element-wise function:
+
+    :math:`\text{ReLU6}(x) = \min(\max(x_0, x), q(6))`, where :math:`x_0` is the
+    zero_point, and :math:`q(6)` is the quantized representation of number 6.
+
+    Args:
+        inplace: can optionally do the operation in-place. Default: ``False``
+
+    Shape:
+        - Input: :math:`(N, *)` where `*` means, any number of additional
+          dimensions
+        - Output: :math:`(N, *)`, same shape as the input
+
+    .. image:: ../scripts/activation_images/ReLU6.png
+
+    Examples::
+
+        >>> m = nn.quantized.ReLU6()
+        >>> input = torch.randn(2)
+        >>> # xdoctest: +SKIP
+        >>> input = torch.quantize_per_tensor(input, 1.0, 0, dtype=torch.qint32)
+        >>> output = m(input)
+    """
+    def __init__(self, inplace=False):
+        super(ReLU6, self).__init__(inplace)
+        self.inplace = inplace
+
+    def forward(self, input):
+        return torch.ops.quantized.relu6(input, self.inplace)
+
+    def _get_name(self):
+        return 'QuantizedReLU6'
+
+    @staticmethod
+    def from_float(mod):
+        return ReLU6(mod.inplace)
+
+class Hardswish(torch.nn.Hardswish):
+    r"""This is the quantized version of :class:`~torch.nn.Hardswish`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+    """
+    def __init__(self, scale, zero_point):
+        super(Hardswish, self).__init__()
+        self.scale = scale
+        self.zero_point = zero_point
+
+    def forward(self, input):
+        return torch.ao.nn.quantized.functional.hardswish(
+            input, scale=self.scale, zero_point=self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedHardswish'
+
+    @staticmethod
+    def from_float(mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return Hardswish(float(scale), int(zero_point))
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(float(scale), int(zero_point))
+
+class ELU(torch.nn.ELU):
+    r"""This is the quantized equivalent of :class:`~torch.nn.ELU`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+        alpha: the alpha constant
+    """
+    def __init__(self, scale, zero_point, alpha=1.):
+        super(ELU, self).__init__(alpha)
+        self.scale = scale
+        self.zero_point = zero_point
+
+    def forward(self, input):
+        return torch.ao.nn.quantized.functional.elu(
+            input, self.scale, self.zero_point, self.alpha)
+
+    def _get_name(self):
+        return 'QuantizedELU'
+
+    @staticmethod
+    def from_float(mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return ELU(float(scale), int(zero_point), mod.alpha)
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(float(scale), int(zero_point), mod.alpha)
+
+class LeakyReLU(torch.nn.LeakyReLU):
+    r"""This is the quantized equivalent of :class:`~torch.nn.LeakyReLU`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+        negative_slope: Controls the angle of the negative slope. Default: 1e-2
+    """
+    def __init__(self, scale: float, zero_point: int, negative_slope: float = 1e-2,
+                 inplace: bool = False, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(negative_slope, inplace)
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.leaky_relu(
+            input, self.negative_slope, self.inplace, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedLeakyReLU'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return cls(float(scale), int(zero_point), mod.negative_slope, mod.inplace)
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(float(scale), int(zero_point), mod.negative_slope, mod.inplace)
+
+class Sigmoid(torch.nn.Sigmoid):
+    r"""This is the quantized equivalent of :class:`~torch.nn.Sigmoid`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+    """
+
+    def __init__(self, output_scale: float, output_zero_point: int):
+        super().__init__()
+        self.output_scale = output_scale
+        self.output_zero_point = output_zero_point
+
+    def forward(self, input):
+        return torch.ops.quantized.sigmoid(input, self.output_scale, self.output_zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        output_scale, output_zero_point = mod.activation_post_process.calculate_qparams()
+        return cls(float(output_scale), int(output_zero_point))
+
+class Softmax(torch.nn.Softmax):
+    r"""This is the quantized version of :class:`~torch.nn.Softmax`.
+
+    Args:
+        dim: A dimension along which Softmax will be computed (so every slice along dim will sum to 1).
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+    """
+    def __init__(self, dim=None, scale=1.0, zero_point=0):
+        super().__init__()
+        self.dim = dim
+        self.scale = scale
+        self.zero_point = zero_point
+
+    def forward(self, input):
+        dim = self.dim
+        if dim is None:
+            stacklevel = 3
+            # Note: adding the mypy ignore on _get_softmax_dim seems less bad
+            # than making `_get_softmax_dim` an official API.
+            dim = torch.nn.functional._get_softmax_dim(  # type: ignore[attr-defined]
+                "softmax", input.dim(), stacklevel)
+        return torch.ops.quantized.softmax(
+            input, dim, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedSoftmax'
+
+    @staticmethod
+    def from_float(mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        return Softmax(mod.dim, float(scale), int(zero_point))
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(mod.dim, float(scale), int(zero_point))
+
+
+class MultiheadAttention(torch.ao.nn.quantizable.MultiheadAttention):
+    _FLOAT_MODULE = torch.ao.nn.quantizable.MultiheadAttention
+
+    def _get_name(self):
+        return "QuantizedMultiheadAttention"
+
+    @classmethod
+    def from_float(cls, other):
+        # The whole flow is float -> observed -> quantized
+        # This class does observed -> quantized only
+        raise NotImplementedError("It looks like you are trying to convert a "
+                                  "non-observed MHA module. Please, see "
+                                  "the examples on quantizable MHAs.")
+
+    @classmethod
+    def from_observed(cls, other):
+        converted = torch.ao.quantization.convert(other, mapping=None,
+                                                  inplace=False,
+                                                  remove_qconfig=True,
+                                                  convert_custom_config_dict=None)
+        converted.__class__ = cls
+        # Remove the parameters for the bias_k and bias_v to quantize them
+        # TODO: This is a potential source of accuracy drop.
+        #       quantized cat takes the scale and zp of the first
+        #       element, which might lose the precision in the bias_k
+        #       and the bias_v (which are cat'ed with k/v being first).
+        if converted.bias_k is not None:
+            bias_k = converted._parameters.pop('bias_k')
+            sc, zp = torch._choose_qparams_per_tensor(bias_k,
+                                                      reduce_range=False)
+            bias_k = torch.quantize_per_tensor(bias_k, sc, zp, torch.quint8)
+            setattr(converted, 'bias_k', bias_k)  # noqa: B010
+
+        if converted.bias_v is not None:
+            bias_v = converted._parameters.pop('bias_v')
+            sc, zp = torch._choose_qparams_per_tensor(bias_k,
+                                                      reduce_range=False)
+            bias_v = torch.quantize_per_tensor(bias_v, sc, zp, torch.quint8)
+            setattr(converted, 'bias_v', bias_v)  # noqa: B010
+
+        return converted
+
+class PReLU(torch.nn.Module):
+    r"""This is the quantized equivalent of :class:`~torch.nn.PReLU`.
+
+    Args:
+        scale: quantization scale of the output tensor
+        zero_point: quantization zero point of the output tensor
+        num_parameters: number of parameters: 1, or the number of channels at input. Default: 1
+    """
+    def __init__(self, output_scale: float, output_zero_point: int,
+                 num_parameters: int = 1) -> None:
+        super().__init__()
+        self.num_parameters = num_parameters
+        self.scale = output_scale
+        self.zero_point = output_zero_point
+        w = torch.randn(num_parameters, dtype=torch.float)
+        qw = torch.quantize_per_tensor(w, scale=1.0, zero_point=0, dtype=torch.quint8)
+        self.set_weight(qw)
+
+    def set_weight(self, w: torch.Tensor) -> None:
+        self.weight = w
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return torch.ops.quantized.prelu(input, self.weight, self.scale, self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedPReLU'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        qprelu = cls(float(scale), int(zero_point), mod.num_parameters)
+        float_wt = mod.weight.float()
+        observer = mod.qconfig.weight()
+        wt_scale, wt_zp = observer.calculate_qparams()
+        qweight = torch.quantize_per_tensor(
+            float_wt, float(wt_scale), int(wt_zp), torch.quint8)
+        qprelu.set_weight(qweight)
+        return qprelu
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        qprelu = cls(float(scale), int(zero_point), mod.num_parameters)
+        float_wt = mod.weight.float()
+        observer = mod.qconfig.weight()
+        wt_scale, wt_zp = observer.calculate_qparams()
+        qweight = torch.quantize_per_tensor(
+            float_wt, float(wt_scale), int(wt_zp), torch.quint8)
+        qprelu.set_weight(qweight)
+        return qprelu
diff --git a/torch/ao/nn/quantized/modules/batchnorm.py b/torch/ao/nn/quantized/modules/batchnorm.py
new file mode 100644
index 000000000000..3e4c3a16ac8b
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/batchnorm.py
@@ -0,0 +1,101 @@
+import torch
+import torch.ao.nn.intrinsic as nni
+
+class _BatchNorm(torch.nn.modules.batchnorm._BatchNorm):
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_features, eps, momentum, True, True, **factory_kwargs)
+        self.register_buffer('scale', torch.tensor(1.0, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(0, **factory_kwargs))
+
+    @staticmethod
+    def from_float(cls, mod):
+        activation_post_process = mod.activation_post_process
+        if type(mod) == cls._NNI_BN_RELU_MODULE:
+            mod = mod[0]
+        scale, zero_point = activation_post_process.calculate_qparams()
+        new_mod = cls(mod.num_features, mod.eps)
+        new_mod.weight = mod.weight
+        new_mod.bias = mod.bias
+        new_mod.running_mean = mod.running_mean
+        new_mod.running_var = mod.running_var
+        new_mod.scale = scale
+        new_mod.zero_point = zero_point
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, bn, output_scale, output_zero_point):
+        qbn = cls(
+            bn.num_features,
+            bn.eps,
+            bn.momentum,
+            device=bn.weight.device,
+            dtype=bn.weight.dtype
+        )
+        qbn.weight = bn.weight
+        qbn.bias = bn.bias
+        qbn.running_mean = bn.running_mean
+        qbn.running_var = bn.running_var
+        qbn.scale = output_scale
+        qbn.zero_point = output_zero_point
+        return qbn
+
+class BatchNorm2d(_BatchNorm):
+    r"""This is the quantized version of :class:`~torch.nn.BatchNorm2d`.
+    """
+
+    _NNI_BN_RELU_MODULE = nni.BNReLU2d
+
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_features, eps, momentum, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedBatchNorm2d'
+
+    def _check_input_dim(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        # disabling this since this is not symbolically traceable
+        # self._check_input_dim(input)
+        return torch.ops.quantized.batch_norm2d(
+            input, self.weight, self.bias, self.running_mean,
+            self.running_var, self.eps, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        return _BatchNorm.from_float(cls, mod)
+
+class BatchNorm3d(_BatchNorm):
+    r"""This is the quantized version of :class:`~torch.nn.BatchNorm3d`.
+    """
+
+    _NNI_BN_RELU_MODULE = nni.BNReLU3d
+
+    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super().__init__(num_features, eps, momentum, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedBatchNorm3d'
+
+    def _check_input_dim(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        # disabling this since this is not symbolically traceable
+        # self._check_input_dim(input)
+        return torch.ops.quantized.batch_norm3d(
+            input, self.weight, self.bias, self.running_mean,
+            self.running_var, self.eps, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        return _BatchNorm.from_float(cls, mod)
diff --git a/torch/ao/nn/quantized/modules/conv.py b/torch/ao/nn/quantized/modules/conv.py
new file mode 100644
index 000000000000..59067d59725c
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/conv.py
@@ -0,0 +1,937 @@
+# coding=utf-8
+r"""Quantized convolution modules."""
+
+from typing import Optional, List, TypeVar
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.ao.nn.intrinsic as nni
+import torch.ao.nn.intrinsic.qat as nniqat
+
+from torch._ops import ops
+from torch.nn.common_types import _size_1_t
+from torch.nn.modules.utils import _single, _pair, _triple
+from torch.nn.utils import fuse_conv_bn_weights
+
+from .utils import _quantize_weight, WeightedQuantizedModule
+
+__all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
+
+_SUPPORTED_PADDING = {
+    'zeros',
+    'reflect'
+}
+
+
+def _reverse_repeat_padding(padding: List[int]) -> List[int]:
+    _reversed_padding_repeated_twice: List[int] = []
+    N = len(padding)
+    for idx in range(N):
+        for _ in range(2):
+            _reversed_padding_repeated_twice.append(padding[N - idx - 1])
+    return _reversed_padding_repeated_twice
+
+
+class _ConvNd(WeightedQuantizedModule):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        # All subclasses have this signature - See PR #49702s
+        raise NotImplementedError
+
+    def _init(self, in_channels, out_channels, kernel_size, stride,
+              padding, dilation,
+              transposed, output_padding,
+              groups, bias,
+              padding_mode='zeros',
+              device=None,
+              dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(_ConvNd, self).__init__()
+
+        if in_channels % groups != 0:
+            raise ValueError('in_channels must be divisible by groups')
+        if out_channels % groups != 0:
+            raise ValueError('out_channels must be divisible by groups')
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.padding = padding
+        self.dilation = dilation
+        self.transposed = transposed
+        self.output_padding = output_padding
+        self.groups = groups
+        if padding_mode not in _SUPPORTED_PADDING:
+            raise ValueError("'padding_mode' {} is not supported by quantized convolution".format(padding_mode))
+        self.padding_mode = padding_mode
+        # Initialize as NCHW. set_weight will internally transpose to NHWC.
+        if self.transposed:
+            weight_shape = [in_channels, out_channels // self.groups]
+        else:
+            weight_shape = [out_channels, in_channels // self.groups]
+        qweight = torch._empty_affine_quantized(
+            weight_shape + list(kernel_size),
+            scale=1, zero_point=0, dtype=torch.qint8,
+            **{k: v for k, v in factory_kwargs.items() if k != 'dtype'})
+        bias_float = (
+            torch.zeros(out_channels, dtype=torch.float,
+                        **{k: v for k, v in factory_kwargs.items() if k != 'dtype'}) if bias else None)
+
+        self.set_weight_bias(qweight, bias_float)
+        self.scale = 1.0
+        self.zero_point = 0
+
+    def set_weight_bias(self, qweight, bias_float):
+        raise NotImplementedError
+
+    def bias(self):
+        raise NotImplementedError
+
+    def _weight_bias(self):
+        raise NotImplementedError
+
+    def extra_repr(self):
+        s = ('{in_channels}, {out_channels}, kernel_size={kernel_size}'
+             ', stride={stride}, scale={scale}, zero_point={zero_point}')
+        if self.padding != (0,) * len(self.padding):
+            s += ', padding={padding}'
+        if self.dilation != (1,) * len(self.dilation):
+            s += ', dilation={dilation}'
+        if self.output_padding != (0,) * len(self.output_padding):
+            s += ', output_padding={output_padding}'
+        if self.groups != 1:
+            s += ', groups={groups}'
+        if self.bias() is None:
+            s += ', bias=False'
+        return s.format(**self.__dict__)
+
+    # ===== Serialization methods =====
+    # The special consideration here is that we have to unpack the weights into
+    # their regular QTensor form for serialization. Packed weights should not
+    # live outside the process in which they were created, rather they should be
+    # derived from the QTensor weight.
+    #   self
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #
+    # TODO: maybe change to this when https://github.com/pytorch/pytorch/pull/32958 is landed
+    #   self
+    #   |--- _packed_params : Conv2dPackedParamsBase or Conv3dPackedParamsBase
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(_ConvNd, self)._save_to_state_dict(destination, prefix, keep_vars)
+        (w, b) = self._weight_bias()
+        destination[prefix + 'weight'] = w
+        destination[prefix + 'bias'] = b
+        destination[prefix + 'scale'] = torch.tensor(self.scale)
+        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
+
+    @torch.jit.export
+    def __getstate__(self):
+        (w, b) = self._weight_bias()
+        return (
+            self.in_channels,
+            self.out_channels,
+            self.kernel_size,
+            self.stride,
+            self.padding,
+            self.dilation,
+            self.transposed,
+            self.output_padding,
+            self.groups,
+            self.padding_mode,
+            w,
+            b,
+            self.scale,
+            self.zero_point,
+            self.training
+        )
+
+    # ===== Deserialization methods =====
+    # Counterpart to the serialization methods, we must pack the serialized
+    # QTensor weight into its packed format for use by the FBGEMM ops.
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self.set_weight_bias(
+            state_dict[prefix + 'weight'], state_dict[prefix + 'bias'])
+        state_dict.pop(prefix + 'weight')
+        state_dict.pop(prefix + 'bias')
+        self.scale = float(state_dict[prefix + 'scale'])
+        state_dict.pop(prefix + 'scale')
+        self.zero_point = int(state_dict[prefix + 'zero_point'])
+        state_dict.pop(prefix + 'zero_point')
+        super(_ConvNd, self)._load_from_state_dict(
+            state_dict, prefix, local_metadata, False, missing_keys,
+            unexpected_keys, error_msgs)
+
+    @torch.jit.export
+    def __setstate__(self, state):
+        self.in_channels = state[0]
+        self.out_channels = state[1]
+        self.kernel_size = state[2]
+        self.stride = state[3]
+        self.padding = state[4]
+        self.dilation = state[5]
+        self.transposed = state[6]
+        self.output_padding = state[7]
+        self.groups = state[8]
+        self.padding_mode = state[9]
+        self.set_weight_bias(state[10], state[11])
+        self.scale = state[12]
+        self.zero_point = state[13]
+        self.training = state[14]
+
+    def __deepcopy__(self, memo):
+        new_instance = type(self).__new__(type(self))
+        torch.nn.Module.__init__(new_instance)
+        state = self.__getstate__()
+        new_instance.__setstate__(state)
+        return new_instance
+
+    def __copy__(self):
+        return self.__deepcopy__({})
+
+    @classmethod
+    def get_qconv(cls, mod, activation_post_process, weight_post_process=None):
+        r"""Creates a qconv object and returns it.
+        """
+        if weight_post_process is None:
+            weight_post_process = mod.qconfig.weight()
+        weight_post_process(mod.weight)
+        assert weight_post_process.dtype == torch.qint8, \
+            'Weight observer must have a dtype of qint8'
+        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
+        # the __init__ call used is the one from derived classes and not the one from _ConvNd
+        qconv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
+                    mod.stride, mod.padding, mod.dilation, mod.groups,
+                    mod.bias is not None, mod.padding_mode)
+        qconv.set_weight_bias(qweight, mod.bias)
+        if activation_post_process is None or activation_post_process.dtype == torch.float:
+            return qconv  # dynamic quantization doesn't need scale/zero_point
+        else:
+            act_scale, act_zp = activation_post_process.calculate_qparams()
+            qconv.scale = float(act_scale)
+            qconv.zero_point = int(act_zp)
+            return qconv
+
+    @staticmethod
+    def from_float(cls, mod):
+        if hasattr(mod, "weight_fake_quant"):
+            # assert type(mod) == cls.__QAT_MODULE, " nnq." + cls.__name__ + \
+            # ".from_float only works for " + cls.__QAT_MODULE.__name__
+            if type(mod) == cls._NNIQAT_CONV_BN_MODULE:
+                mod.weight, mod.bias = fuse_conv_bn_weights(
+                    mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
+                    mod.bn.eps, mod.bn.weight, mod.bn.bias)
+            assert hasattr(mod, "activation_post_process"), \
+                "Input QAT module must have observer attached"
+            weight_post_process = mod.weight_fake_quant
+            activation_post_process = mod.activation_post_process
+        else:
+            assert type(mod) == cls._FLOAT_MODULE, \
+                " nnq." + cls.__name__ + ".from_float only works for " + \
+                cls._FLOAT_MODULE.__name__ + " but got:" + str(type(mod))
+            assert hasattr(mod, "qconfig"), \
+                "Input float module must have qconfig defined."
+            activation_post_process = None if not hasattr(
+                mod, "activation_post_process") else mod.activation_post_process
+            if type(mod) == cls._NNI_CONV_RELU_MODULE:
+                mod = mod[0]
+            weight_post_process = mod.qconfig.weight()
+        return cls.get_qconv(mod, activation_post_process, weight_post_process)
+
+    @classmethod
+    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
+        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
+        Args:
+            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+            output_scale (float): scale for output Tensor
+            output_zero_point (int): zero point for output Tensor
+        """
+        qconv = cls(
+            ref_qconv.in_channels,
+            ref_qconv.out_channels,
+            ref_qconv.kernel_size,  # type: ignore[arg-type]
+            ref_qconv.stride,  # type: ignore[arg-type]
+            ref_qconv.padding,  # type: ignore[arg-type]
+            ref_qconv.dilation,  # type: ignore[arg-type]
+            ref_qconv.groups,
+            ref_qconv.bias is not None,  # type: ignore[arg-type]
+            ref_qconv.padding_mode,
+            device=ref_qconv.weight.device,
+            dtype=ref_qconv.weight.dtype)
+        qweight = ref_qconv.get_quantized_weight()
+        qconv.set_weight_bias(qweight, ref_qconv.bias)
+        qconv.scale = float(output_scale)
+        qconv.zero_point = int(output_zero_point)
+        return qconv
+
+
+class Conv1d(_ConvNd):
+    r"""Applies a 1D convolution over a quantized input signal composed of
+    several quantized input planes.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv1d`.
+
+    .. note::
+        Only `zeros` is supported for the :attr:`padding_mode` argument.
+
+    .. note::
+        Only `torch.quint8` is supported for the input data type.
+
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv1d` for other attributes.
+
+    Examples::
+
+        >>> m = nn.quantized.Conv1d(16, 33, 3, stride=2)
+        >>> input = torch.randn(20, 16, 100)
+        >>> # quantize input to quint8
+        >>> # xdoctest: +SKIP
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0,
+        ...                                     dtype=torch.quint8)
+        >>> output = m(q_input)
+
+    """
+
+    _FLOAT_MODULE = nn.Conv1d
+    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn1d
+    _NNI_CONV_RELU_MODULE = nni.ConvReLU1d
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: _size_1_t = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = 'zeros',
+                 device=None,
+                 dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _single(kernel_size)
+        stride = _single(stride)
+        padding = padding if isinstance(padding, str) else _single(padding)
+        dilation = _single(dilation)
+
+        # Subclasses of _ConvNd needs to call _init rather than __init__. See
+        # discussion on PR #49702
+        super(Conv1d, self)._init(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            False, _single(0), groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConv1d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        if self.padding_mode == 'zeros':
+            self._packed_params = torch.ops.quantized.conv1d_prepack(
+                w, b, self.stride, self.padding, self.dilation, self.groups)
+        else:
+            self._packed_params = torch.ops.quantized.conv1d_prepack(
+                w, b, self.stride, _pair(0), self.dilation,
+                self.groups)
+
+    def _weight_bias(self):
+        w, b = torch.ops.quantized.conv1d_unpack(self._packed_params)
+        return w, b
+
+    def weight(self):
+        return self._weight_bias()[0]
+
+    def bias(self):
+        return self._weight_bias()[1]
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        if self.padding_mode != 'zeros':
+            # Padding in Conv1d is stored as (p, p), need to get (p,)
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv1d(input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Creates a quantized module from a float module or qparams_dict.
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+              utilities or provided by the user
+        """
+        return _ConvNd.from_float(cls, mod)
+
+
+class Conv2d(_ConvNd):
+    r"""Applies a 2D convolution over a quantized input signal composed of
+    several quantized input planes.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv2d`.
+
+    .. note::
+        Only `zeros` is supported for the :attr:`padding_mode` argument.
+
+    .. note::
+        Only `torch.quint8` is supported for the input data type.
+
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv2d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nn.quantized.Conv2d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> # non-square kernels and unequal stride and with padding and dilation
+        >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
+        >>> input = torch.randn(20, 16, 50, 100)
+        >>> # quantize input to quint8
+        >>> # xdoctest: +SKIP
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+
+    """
+    _FLOAT_MODULE = nn.Conv2d
+    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn2d
+    _NNI_CONV_RELU_MODULE = nni.ConvReLU2d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _pair(kernel_size)
+        stride = _pair(stride)
+        padding = _pair(padding)
+        dilation = _pair(dilation)
+        # Subclasses of _ConvNd need to call _init rather than __init__. See
+        # discussion on PR #49702
+        super(Conv2d, self)._init(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            False, _pair(0), groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConv2d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        if self.padding_mode == 'zeros':
+            self._packed_params = torch.ops.quantized.conv2d_prepack(
+                w, b, self.stride, self.padding, self.dilation, self.groups)
+        else:
+            self._packed_params = torch.ops.quantized.conv2d_prepack(
+                w, b, self.stride, _pair(0), self.dilation, self.groups)
+
+    def _weight_bias(self):
+        return self._packed_params.unpack()
+
+    def weight(self):
+        return self._weight_bias()[0]
+
+    def bias(self):
+        return self._weight_bias()[1]
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv2d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Creates a quantized module from a float module or qparams_dict.
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+              utilities or provided by the user
+        """
+        return _ConvNd.from_float(cls, mod)
+
+
+class Conv3d(_ConvNd):
+    r"""Applies a 3D convolution over a quantized input signal composed of
+    several quantized input planes.
+
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.Conv3d`.
+
+    .. note::
+        Only `zeros` is supported for the :attr:`padding_mode` argument.
+
+    .. note::
+        Only `torch.quint8` is supported for the input data type.
+
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+
+    See :class:`~torch.nn.Conv3d` for other attributes.
+
+    Examples::
+
+        >>> # With square kernels and equal stride
+        >>> m = nn.quantized.Conv3d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nn.quantized.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
+        >>> # non-square kernels and unequal stride and with padding and dilation
+        >>> m = nn.quantized.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2), dilation=(1, 2, 2))
+        >>> input = torch.randn(20, 16, 56, 56, 56)
+        >>> # quantize input to quint8
+        >>> # xdoctest: +SKIP
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+
+    """
+    _FLOAT_MODULE = nn.Conv3d
+    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn3d
+    _NNI_CONV_RELU_MODULE = nni.ConvReLU3d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros', device=None, dtype=None):
+        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _triple(kernel_size)
+        stride = _triple(stride)
+        padding = _triple(padding)
+        dilation = _triple(dilation)
+        # Subclasses of _ConvNd need to call _init rather than __init__. See
+        # discussion on PR #49702
+        super(Conv3d, self)._init(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            False, _triple(0), groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConv3d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        if self.padding_mode == 'zeros':
+            self._packed_params = torch.ops.quantized.conv3d_prepack(
+                w, b, self.stride, self.padding, self.dilation, self.groups)
+        else:
+            self._packed_params = torch.ops.quantized.conv3d_prepack(
+                w, b, self.stride, _triple(0), self.dilation, self.groups)
+
+    def _weight_bias(self):
+        return self._packed_params.unpack()
+
+    def weight(self):
+        return self._weight_bias()[0]
+
+    def bias(self):
+        return self._weight_bias()[1]
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
+        if self.padding_mode != 'zeros':
+            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
+            input = F.pad(input, _reversed_padding_repeated_twice,
+                          mode=self.padding_mode)
+        return ops.quantized.conv3d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Creates a quantized module from a float module or qparams_dict.
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+              utilities or provided by the user
+        """
+        return _ConvNd.from_float(cls, mod)
+
+# === Transposed Convolutions ===
+MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
+
+
+class _ConvTransposeNd(_ConvNd):
+
+    _FLOAT_MODULE = MOD
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride,
+                 padding, dilation, transposed, output_padding,
+                 groups, bias, padding_mode, device=None, dtype=None):
+        if padding_mode != 'zeros':
+            raise ValueError('Only "zeros" padding mode is supported for {}'.format(self.__class__.__name__))
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        # Subclasses of _ConvNd need to call _init rather than __init__. See
+        # discussion on PR #49702
+        super(_ConvTransposeNd, self)._init(
+            in_channels, out_channels, kernel_size, stride,
+            padding, dilation, transposed, output_padding,
+            groups, bias, padding_mode, **factory_kwargs)
+
+    def _input_padding(self, kernel_size: List[int], dilation: List[int], padding: List[int]) -> List[int]:
+        res = torch.jit.annotate(List[int], [])
+        for kdx in range(len(kernel_size)):
+            pad = (dilation[kdx] * (kernel_size[kdx] - 1) - padding[kdx])
+            res.append(pad)
+        return res
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Creates a quantized module from a float module or qparams_dict.
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+              utilities or provided by the user
+        """
+        # derived classes override cls._FLOAT_MODULE attribute
+        msg = ' nnq.' + cls.__name__ + '.from_float only works for ' + \
+              cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
+        assert type(mod) == cls._FLOAT_MODULE, msg
+        assert hasattr(mod, 'qconfig'), \
+            'Input float module must have qconfig defined.'
+        weight_post_process = mod.qconfig.weight()
+        weight_post_process(mod.weight)
+        assert weight_post_process.dtype == torch.qint8, \
+            'Weight observer must have a dtype of qint8'
+        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
+        # the __init__ call used is the one from derived classes and not the one from _ConvTransposeNd
+        qconv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,  # type: ignore[call-arg]
+                    mod.stride, mod.padding, mod.output_padding, mod.groups,
+                    mod.bias is not None, mod.dilation, mod.padding_mode)
+        qconv.set_weight_bias(qweight, mod.bias)
+        if not hasattr(mod, "activation_post_process") or mod.activation_post_process.dtype == torch.float:
+            return qconv  # dynamic quantization doesn't need scale/zero_point
+        else:
+            act_scale, act_zp = mod.activation_post_process.calculate_qparams()
+            qconv.scale = float(act_scale)
+            qconv.zero_point = int(act_zp)
+            return qconv
+
+    @staticmethod
+    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
+        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
+        Args:
+            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+            output_scale (float): scale for output Tensor
+            output_zero_point (int): zero point for output Tensor
+        """
+        qconv = cls(
+            ref_qconvt.in_channels,
+            ref_qconvt.out_channels,
+            ref_qconvt.kernel_size,  # type: ignore[arg-type]
+            ref_qconvt.stride,  # type: ignore[arg-type]
+            ref_qconvt.padding,  # type: ignore[arg-type]
+            ref_qconvt.output_padding,  # type: ignore[arg-type]
+            ref_qconvt.groups,
+            ref_qconvt.bias is not None,  # type: ignore[arg-type]
+            ref_qconvt.dilation,  # type: ignore[arg-type]
+            ref_qconvt.padding_mode,
+            device=ref_qconvt.weight.device,
+            dtype=ref_qconvt.weight.dtype)
+        qweight = ref_qconvt.get_quantized_weight()
+        qconv.set_weight_bias(qweight, ref_qconvt.bias)
+        qconv.scale = float(output_scale)
+        qconv.zero_point = int(output_zero_point)
+        return qconv
+
+
+class ConvTranspose1d(_ConvTransposeNd):
+    r"""Applies a 1D transposed convolution operator over an input image
+    composed of several input planes.
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose1d`.
+
+    .. note:: Currently only the QNNPACK engine is implemented.
+        Please, set the `torch.backends.quantized.engine = 'qnnpack'`
+
+    For special notes, please, see :class:`~torch.nn.quantized.Conv1d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
+
+    Examples::
+
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_QENGINE)
+        >>> torch.backends.quantized.engine = 'qnnpack'
+        >>> from torch.nn import quantized as nnq
+        >>> # With square kernels and equal stride
+        >>> m = nnq.ConvTranspose1d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose1d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> input = torch.randn(20, 16, 50)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+        >>> # exact output size can be also specified as an argument
+        >>> input = torch.randn(1, 16, 12)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> downsample = nnq.Conv1d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose1d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(q_input)
+        >>> h.size()
+        torch.Size([1, 16, 6])
+        >>> # xdoctest: +SKIP("FIXME: output_size is not a parameter)
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose1d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _single(kernel_size)
+        stride = _single(stride)
+        padding = _single(padding)
+        dilation = _single(dilation)
+        output_padding = _single(output_padding)
+
+        super(ConvTranspose1d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConvTranpose1d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        self._packed_params = torch.ops.quantized.conv_transpose1d_prepack(
+            w, b, self.stride, self.padding, self.output_padding, self.dilation,
+            self.groups)
+
+    def _weight_bias(self):
+        w, b = torch.ops.quantized.conv_transpose1d_unpack(self._packed_params)
+        return w, b
+
+    def weight(self):
+        (w, _) = self._weight_bias()
+        return w
+
+    def bias(self):
+        (_, b) = self._weight_bias()
+        return b
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 3:
+            raise ValueError("Input shape must be `(N, C, L)`!")
+        return torch.ops.quantized.conv_transpose1d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
+        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
+
+
+class ConvTranspose2d(_ConvTransposeNd):
+    r"""Applies a 2D transposed convolution operator over an input image
+    composed of several input planes.
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose2d`.
+
+    For special notes, please, see :class:`~torch.nn.quantized.Conv2d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
+
+    Examples::
+
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_QENGINE)
+        >>> # QNNPACK or FBGEMM as backend
+        >>> torch.backends.quantized.engine = 'qnnpack'
+        >>> # With square kernels and equal stride
+        >>> import torch.nn.quantized as nnq
+        >>> m = nnq.ConvTranspose2d(16, 33, 3, stride=2)
+        >>> # non-square kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
+        >>> input = torch.randn(20, 16, 50, 100)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+        >>> # exact output size can be also specified as an argument
+        >>> input = torch.randn(1, 16, 12, 12)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> downsample = nnq.Conv2d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(q_input)
+        >>> h.size()
+        torch.Size([1, 16, 6, 6])
+        >>> # xdoctest: +SKIP("FIXME: output_size is not a parameter)
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose2d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _pair(kernel_size)
+        stride = _pair(stride)
+        padding = _pair(padding)
+        dilation = _pair(dilation)
+        output_padding = _pair(output_padding)
+
+        super(ConvTranspose2d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConvTranpose2d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        self._packed_params = torch.ops.quantized.conv_transpose2d_prepack(
+            w, b, self.stride, self.padding, self.output_padding, self.dilation,
+            self.groups)
+
+    def _weight_bias(self):
+        w, b = torch.ops.quantized.conv2d_unpack(self._packed_params)
+        return w, b
+
+    def weight(self):
+        (w, _) = self._weight_bias()
+        return w
+
+    def bias(self):
+        (_, b) = self._weight_bias()
+        return b
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 4:
+            raise ValueError("Input shape must be `(N, C, H, W)`!")
+        return ops.quantized.conv_transpose2d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
+        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
+
+
+class ConvTranspose3d(_ConvTransposeNd):
+    r"""Applies a 3D transposed convolution operator over an input image
+    composed of several input planes.
+    For details on input arguments, parameters, and implementation see
+    :class:`~torch.nn.ConvTranspose3d`.
+
+    .. note:: Currently only the FBGEMM engine is implemented.
+        Please, set the `torch.backends.quantized.engine = 'fbgemm'`
+
+    For special notes, please, see :class:`~torch.nn.quantized.Conv3d`
+
+    Attributes:
+        weight (Tensor):     packed tensor derived from the learnable weight
+                             parameter.
+        scale (Tensor):      scalar for the output scale
+        zero_point (Tensor): scalar for the output zero point
+    See :class:`~torch.nn.ConvTranspose3d` for other attributes.
+
+    Examples::
+
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_QENGINE)
+        >>> torch.backends.quantized.engine = 'fbgemm'
+        >>> from torch.nn import quantized as nnq
+        >>> # With cubic kernels and equal stride
+        >>> m = nnq.ConvTranspose3d(16, 33, 3, stride=2)
+        >>> # non-cubic kernels and unequal stride and with padding
+        >>> m = nnq.ConvTranspose3d(16, 33, (3, 3, 5), stride=(2, 1, 1), padding=(4, 2, 2))
+        >>> input = torch.randn(20, 16, 50, 100, 100)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> output = m(q_input)
+        >>> # exact output size can be also specified as an argument
+        >>> input = torch.randn(1, 16, 12, 12, 12)
+        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
+        >>> downsample = nnq.Conv3d(16, 16, 3, stride=2, padding=1)
+        >>> upsample = nnq.ConvTranspose3d(16, 16, 3, stride=2, padding=1)
+        >>> h = downsample(q_input)
+        >>> h.size()
+        torch.Size([1, 16, 6, 6, 6])
+        >>> # xdoctest: +SKIP("FIXME: output_size is not a parameter)
+        >>> output = upsample(h, output_size=input.size())
+        >>> output.size()
+        torch.Size([1, 16, 12, 12, 12])
+    """
+
+    _FLOAT_MODULE = nn.ConvTranspose3d
+
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0, groups=1, bias=True,
+                 dilation=1, padding_mode='zeros', device=None, dtype=None):
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        kernel_size = _triple(kernel_size)
+        stride = _triple(stride)
+        padding = _triple(padding)
+        dilation = _triple(dilation)
+        output_padding = _triple(output_padding)
+
+        super(ConvTranspose3d, self).__init__(
+            in_channels, out_channels, kernel_size, stride, padding, dilation,
+            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
+
+    def _get_name(self):
+        return 'QuantizedConvTranpose3d'
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        self._packed_params = torch.ops.quantized.conv_transpose3d_prepack(
+            w, b, self.stride, self.padding, self.output_padding, self.dilation,
+            self.groups)
+
+    def _weight_bias(self):
+        w, b = torch.ops.quantized.conv3d_unpack(self._packed_params)
+        return w, b
+
+    def weight(self):
+        (w, _) = self._weight_bias()
+        return w
+
+    def bias(self):
+        (_, b) = self._weight_bias()
+        return b
+
+    def forward(self, input):
+        # Temporarily using len(shape) instead of ndim due to JIT issue
+        # https://github.com/pytorch/pytorch/issues/23890
+        if len(input.shape) != 5:
+            raise ValueError("Input shape must be `(N, C, T, H, W)`!")
+        return ops.quantized.conv_transpose3d(
+            input, self._packed_params, self.scale, self.zero_point)
+
+    @classmethod
+    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
+        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
diff --git a/torch/ao/nn/quantized/modules/dropout.py b/torch/ao/nn/quantized/modules/dropout.py
new file mode 100644
index 000000000000..64110ab53bed
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/dropout.py
@@ -0,0 +1,27 @@
+import torch
+
+__all__ = ['Dropout']
+
+class Dropout(torch.nn.Dropout):
+    r"""This is the quantized equivalent of :class:`~torch.nn.Dropout`.
+        And this is a placeholder to enable models where fp32 tensors
+        had dropout to work with quantized tensors in train and eval mode.
+
+    Args:
+        p: probability of an element to be zeroed
+        inplace: can optionally do the operation in-place. Default: ``False``
+    """
+
+    def forward(self, input):
+        return input
+
+    def _get_name(self):
+        return 'QuantizedDropout'
+
+    @classmethod
+    def from_float(cls, mod):
+        return cls(mod.p, mod.inplace)
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(mod.p, mod.inplace)
diff --git a/torch/ao/nn/quantized/modules/embedding_ops.py b/torch/ao/nn/quantized/modules/embedding_ops.py
new file mode 100644
index 000000000000..5f23bd528604
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/embedding_ops.py
@@ -0,0 +1,295 @@
+import torch
+import torch.nn as nn
+from torch import Tensor  # noqa: F401
+from torch._jit_internal import Optional, List  # noqa: F401
+
+from .utils import hide_packed_params_repr
+from .utils import _quantize_weight
+
+__all__ = ['EmbeddingPackedParams', 'Embedding', 'EmbeddingBag']
+
+class EmbeddingPackedParams(torch.nn.Module):
+    _version = 1
+
+    def __init__(self, num_embeddings, embedding_dim, dtype=torch.quint8):
+        super(EmbeddingPackedParams, self).__init__()
+        self.dtype = dtype
+        if self.dtype in [torch.quint8, torch.quint4x2]:
+            scales = torch.ones(num_embeddings, dtype=torch.float)
+            zero_points = torch.zeros(num_embeddings, dtype=torch.float)
+            wq = torch._empty_per_channel_affine_quantized([num_embeddings, embedding_dim], scales=scales,
+                                                           zero_points=zero_points,
+                                                           axis=0, dtype=self.dtype)
+            self.set_weight(wq)
+        else:
+            raise NotImplementedError(f'Unsupported dtype on quantized embedding! Supports quint8 and quint4x2. Got dtype: {dtype}')
+
+    @torch.jit.export
+    def set_weight(self, weight: torch.Tensor) -> None:
+        if self.dtype in [torch.quint8, torch.quint4x2]:
+            self._packed_weight = torch.ops.quantized.embedding_bag_prepack(weight)
+        else:
+            raise NotImplementedError('Unsupported dtype for quantized embedding prepack! Supports quint8 and quint4x2.')
+
+
+    @torch.jit.export
+    def _weight(self):
+        if self.dtype in [torch.quint8, torch.quint4x2]:
+            return torch.ops.quantized.embedding_bag_unpack(self._packed_weight)
+        else:
+            raise NotImplementedError('Unsupported dtype for quantized embedding unpack! Supports quint8 and quint4x2.')
+
+    def forward(self, x):
+        return x
+
+    # Version 1
+    #   self
+    #   |--- _packed_weight : Tensor representing weight of EmbeddingPackedParamsBase
+    #   |--- dtype : torch.dtype
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(EmbeddingPackedParams, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'dtype'] = self.dtype
+        destination[prefix + '_packed_weight'] = self._weight()
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self.dtype = state_dict[prefix + 'dtype']
+        state_dict.pop(prefix + 'dtype')
+
+        weight = state_dict[prefix + '_packed_weight']
+        state_dict.pop(prefix + '_packed_weight')
+        self.set_weight(weight)
+
+        super(EmbeddingPackedParams, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                                 missing_keys, unexpected_keys, error_msgs)
+
+    def __repr__(self):
+        return self._weight().__repr__()
+
+class Embedding(torch.nn.Module):
+    r"""
+    A quantized Embedding module with quantized packed weights as inputs.
+    We adopt the same interface as `torch.nn.Embedding`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding for documentation.
+
+    Similar to :class:`~torch.nn.Embedding`, attributes will be randomly
+    initialized at module creation time and will be overwritten later
+
+    Attributes:
+        weight (Tensor): the non-learnable quantized weights of the module of
+                         shape :math:`(\text{num\_embeddings}, \text{embedding\_dim})`.
+
+    Examples::
+        >>> m = nn.quantized.Embedding(num_embeddings=10, embedding_dim=12)
+        >>> indices = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8])
+        >>> output = m(indices)
+        >>> print(output.size())
+        torch.Size([9, 12])
+
+    """
+    _version = 1
+
+    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
+                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
+                 sparse: bool = False, _weight: Optional[Tensor] = None, dtype=torch.quint8) -> None:
+        super(Embedding, self).__init__()
+        self.num_embeddings = num_embeddings
+        self.embedding_dim = embedding_dim
+        self.dtype = dtype
+
+        if _weight is None:
+            scales = torch.ones(num_embeddings, dtype=torch.float)
+            zero_points = torch.zeros(num_embeddings, dtype=torch.float)
+            qweight = torch._empty_per_channel_affine_quantized([num_embeddings, embedding_dim],
+                                                                scales=scales, zero_points=zero_points,
+                                                                axis=0, dtype=torch.quint8)
+        else:
+            assert list(_weight.shape) == [num_embeddings, embedding_dim], \
+                'Shape of weight does not match num_embeddings and embedding_dim'
+            qweight = _weight
+
+        self._packed_params = EmbeddingPackedParams(num_embeddings, embedding_dim, dtype)
+        self._packed_params.set_weight(qweight)
+
+    def forward(self, indices: Tensor) -> Tensor:
+        if self.dtype == torch.quint4x2:
+            return torch.ops.quantized.embedding_4bit(self._packed_params._packed_weight, indices)
+        else:
+            return torch.ops.quantized.embedding_byte(self._packed_params._packed_weight, indices)
+
+    def _get_name(self):
+        return 'QuantizedEmbedding'
+
+    def __repr__(self):
+        return hide_packed_params_repr(self, EmbeddingPackedParams)
+
+    def extra_repr(self):
+        extra_repr_str = 'num_embeddings={}, embedding_dim={}, dtype={}, qscheme={}'.format(
+            self.num_embeddings, self.embedding_dim, self._packed_params.dtype, self.weight().qscheme()
+        )
+
+        return extra_repr_str
+
+    def set_weight(self, w: torch.Tensor) -> None:
+        self._packed_params.set_weight(w)
+
+    def weight(self):
+        return self._packed_params._weight()
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a quantized embedding module from a float module
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+                          utilities or provided by user
+        """
+        if hasattr(mod, 'weight_fake_quant'):
+            assert type(mod) == torch.ao.nn.qat.Embedding, 'nnq.' + cls.__name__ + '.from_float ' + \
+                'with fake quant only works for ' + torch.ao.nn.qat.Embedding.__name__
+            weight_observer = mod.weight_fake_quant
+            activation_post_process = mod.activation_post_process
+        else:
+            assert type(mod) == nn.Embedding, 'nnq.' + cls.__name__ + '.from_float only works for ' + \
+                nn.Embedding.__name__
+            assert hasattr(mod, 'qconfig'), 'Embedding input float module must have qconfig defined'
+            from torch.ao.quantization import float_qparams_weight_only_qconfig
+            if mod.qconfig is not None and mod.qconfig.weight is not None:  # type: ignore[union-attr]
+                weight_observer = mod.qconfig.weight()  # type: ignore[union-attr, operator]
+            else:
+                weight_observer = float_qparams_weight_only_qconfig.weight()
+
+        dtype = weight_observer.dtype
+        is_float_qparams_qconfig = weight_observer.qscheme == torch.per_channel_affine_float_qparams
+        assert is_float_qparams_qconfig, \
+            'Embedding quantization is only supported with float_qparams_weight_only_qconfig.'
+
+        assert dtype == torch.quint8 or dtype == torch.quint4x2, \
+            f'The only supported dtype for nnq.Embedding is torch.quint8 and torch.quint4x2, got {dtype}'
+
+        # Run the observer to calculate qparams.
+        weight_observer(mod.weight)
+        qweight = _quantize_weight(mod.weight.float(), weight_observer)
+
+        # Create quantized Embedding module and pass in the quantized weight
+        qembedding = Embedding(mod.num_embeddings, mod.embedding_dim)
+        qembedding.set_weight(qweight)
+        return qembedding
+
+    @classmethod
+    def from_reference(cls, ref_embedding):
+        qembedding = cls(
+            ref_embedding.num_embeddings,
+            ref_embedding.embedding_dim,
+            ref_embedding.padding_idx,
+            ref_embedding.max_norm,
+            ref_embedding.norm_type,
+            ref_embedding.scale_grad_by_freq,
+            ref_embedding.sparse,
+            ref_embedding.get_quantized_weight(),
+            ref_embedding.weight_dtype,
+        )
+        return qembedding
+
+class EmbeddingBag(Embedding):
+    r"""
+    A quantized EmbeddingBag module with quantized packed weights as inputs.
+    We adopt the same interface as `torch.nn.EmbeddingBag`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag for documentation.
+
+    Similar to :class:`~torch.nn.EmbeddingBag`, attributes will be randomly
+    initialized at module creation time and will be overwritten later
+
+    Attributes:
+        weight (Tensor): the non-learnable quantized weights of the module of
+                         shape :math:`(\text{num\_embeddings}, \text{embedding\_dim})`.
+
+    Examples::
+        >>> m = nn.quantized.EmbeddingBag(num_embeddings=10, embedding_dim=12, include_last_offset=True, mode='sum')
+        >>> indices = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8, 6, 6, 9, 1, 6, 8, 8, 3, 2, 3, 6, 3, 6, 5, 7, 0, 8, 4, 6, 5, 8, 2, 3])
+        >>> offsets = torch.tensor([0, 19, 20, 28, 28, 32])
+        >>> output = m(indices, offsets)
+        >>> print(output.size())
+        torch.Size([5, 12])
+
+    """
+    _version = 1
+
+    def __init__(self, num_embeddings: int, embedding_dim: int,
+                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
+                 mode: str = 'sum', sparse: bool = False, _weight: Optional[Tensor] = None,
+                 include_last_offset: bool = False, dtype=torch.quint8) -> None:
+        super(EmbeddingBag, self).__init__(num_embeddings, embedding_dim, _weight=_weight, dtype=dtype)
+
+        self.mode = mode
+        self.pruned_weights = False
+        self.include_last_offset = include_last_offset
+        self.dtype = dtype
+
+    def forward(self, indices: Tensor, offsets: Optional[Tensor] = None, per_sample_weights: Optional[Tensor] = None,
+                compressed_indices_mapping: Optional[Tensor] = None) -> Tensor:
+        if self.dtype == torch.quint4x2:
+            return torch.ops.quantized.embedding_bag_4bit(self._packed_params._packed_weight, indices, offsets, False, 0,
+                                                          self.pruned_weights, per_sample_weights, compressed_indices_mapping,
+                                                          self.include_last_offset)
+        else:
+            return torch.ops.quantized.embedding_bag_byte(self._packed_params._packed_weight, indices, offsets, False, 0,
+                                                          self.pruned_weights, per_sample_weights, compressed_indices_mapping,
+                                                          self.include_last_offset)
+
+    def _get_name(self):
+        return 'QuantizedEmbeddingBag'
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a quantized embedding_bag module from a float module
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+                          utilities or provided by user
+        """
+        if hasattr(mod, 'weight_fake_quant'):
+            weight_observer = mod.weight_fake_quant
+        else:
+            assert type(mod) == nn.EmbeddingBag, 'nnq.' + cls.__name__ + '.from_float only works for ' + \
+                nn.EmbeddingBag.__name__
+            assert hasattr(mod, 'qconfig'), 'EmbeddingBag input float module must have qconfig defined'
+            from torch.ao.quantization.qconfig import float_qparams_weight_only_qconfig
+            if mod.qconfig is not None and mod.qconfig.weight is not None:  # type: ignore[union-attr]
+                weight_observer = mod.qconfig.weight()  # type: ignore[union-attr, operator]
+            else:
+                weight_observer = float_qparams_weight_only_qconfig.weight()
+
+        dtype = weight_observer.dtype
+        is_float_qparams_qconfig = weight_observer.qscheme == torch.per_channel_affine_float_qparams
+        assert is_float_qparams_qconfig, \
+            'EmbeddingBag quantization is only supported with float_qparams_weight_only_qconfig.'
+
+        assert dtype == torch.quint8 or dtype == torch.quint4x2, \
+            f'The only supported dtype for nnq.EmbeddingBag is torch.quint8 and torch.quint4x2, got {dtype}'
+
+        # Run the observer to calculate qparams.
+        weight_observer(mod.weight)
+        qweight = _quantize_weight(mod.weight.float(), weight_observer)
+
+        # Create quantized EmbeddingBag module and pass in the quantized weight
+        qembedding_bag = EmbeddingBag(mod.num_embeddings, mod.embedding_dim, dtype=dtype)
+        qembedding_bag.set_weight(qweight)
+        return qembedding_bag
+
+    @classmethod
+    def from_reference(cls, ref_embedding_bag):
+        qembedding_bag = cls(
+            ref_embedding_bag.num_embeddings,
+            ref_embedding_bag.embedding_dim,
+            ref_embedding_bag.max_norm,
+            ref_embedding_bag.norm_type,
+            ref_embedding_bag.scale_grad_by_freq,
+            ref_embedding_bag.mode,
+            ref_embedding_bag.sparse,
+            ref_embedding_bag.get_quantized_weight(),
+            ref_embedding_bag.include_last_offset,
+            ref_embedding_bag.weight_dtype,
+        )
+        return qembedding_bag
diff --git a/torch/ao/nn/quantized/modules/functional_modules.py b/torch/ao/nn/quantized/modules/functional_modules.py
new file mode 100644
index 000000000000..5bf7a7322652
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/functional_modules.py
@@ -0,0 +1,233 @@
+from typing import List
+
+import torch
+from torch import Tensor
+from torch._ops import ops
+
+__all__ = ['FloatFunctional', 'FXFloatFunctional', 'QFunctional']
+
+class FloatFunctional(torch.nn.Module):
+    r"""State collector class for float operations.
+
+    The instance of this class can be used instead of the ``torch.`` prefix for
+    some operations. See example usage below.
+
+    .. note::
+
+        This class does not provide a ``forward`` hook. Instead, you must use
+        one of the underlying functions (e.g. ``add``).
+
+    Examples::
+
+        >>> f_add = FloatFunctional()
+        >>> a = torch.tensor(3.0)
+        >>> b = torch.tensor(4.0)
+        >>> f_add.add(a, b)  # Equivalent to ``torch.add(a, b)``
+
+    Valid operation names:
+        - add
+        - cat
+        - mul
+        - add_relu
+        - add_scalar
+        - mul_scalar
+    """
+    def __init__(self):
+        super(FloatFunctional, self).__init__()
+        self.activation_post_process = torch.nn.Identity()
+
+    def forward(self, x):
+        raise RuntimeError("FloatFunctional is not intended to use the " +
+                           "'forward'. Please use the underlying operation")
+
+    r"""Operation equivalent to ``torch.add(Tensor, Tensor)``"""
+    def add(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.add(x, y)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.add(Tensor, float)``"""
+    def add_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = torch.add(x, y)
+        # Note: this operation is not observed because the observation is not
+        # needed for the quantized op.
+        return r
+
+    r"""Operation equivalent to ``torch.mul(Tensor, Tensor)``"""
+    def mul(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.mul(x, y)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.mul(Tensor, float)``"""
+    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = torch.mul(x, y)
+        # Note: this operation is not observed because the observation is not
+        # needed for the quantized op.
+        return r
+
+    r"""Operation equivalent to ``torch.cat``"""
+    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
+        r = torch.cat(x, dim=dim)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``relu(torch.add(x,y))``"""
+    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.add(x, y)
+        r = torch.nn.functional.relu(r)
+        r = self.activation_post_process(r)
+        return r
+
+class FXFloatFunctional(torch.nn.Module):
+    r""" module to replace FloatFunctional module before FX graph mode quantization,
+    since activation_post_process will be inserted in top level module directly
+
+    Valid operation names:
+        - add
+        - cat
+        - mul
+        - add_relu
+        - add_scalar
+        - mul_scalar
+    """
+    def forward(self, x):
+        raise RuntimeError("FloatFunctional is not intended to use the " +
+                           "'forward'. Please use the underlying operation")
+
+    r"""Operation equivalent to ``torch.add(Tensor, Tensor)``"""
+    def add(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.add(x, y)
+        return r
+
+    r"""Operation equivalent to ``torch.add(Tensor, float)``"""
+    def add_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = torch.add(x, y)
+        return r
+
+    r"""Operation equivalent to ``torch.mul(Tensor, Tensor)``"""
+    def mul(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.mul(x, y)
+        return r
+
+    r"""Operation equivalent to ``torch.mul(Tensor, float)``"""
+    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = torch.mul(x, y)
+        return r
+
+    r"""Operation equivalent to ``torch.cat``"""
+    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
+        r = torch.cat(x, dim=dim)
+        return r
+
+    r"""Operation equivalent to ``relu(torch.add(x,y))``"""
+    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
+        r = torch.add(x, y)
+        r = torch.nn.functional.relu(r)
+        return r
+
+class QFunctional(torch.nn.Module):
+    r"""Wrapper class for quantized operations.
+
+    The instance of this class can be used instead of the
+    ``torch.ops.quantized`` prefix. See example usage below.
+
+    .. note::
+
+        This class does not provide a ``forward`` hook. Instead, you must use
+        one of the underlying functions (e.g. ``add``).
+
+    Examples::
+
+        >>> q_add = QFunctional()
+        >>> # xdoctest: +SKIP
+        >>> a = torch.quantize_per_tensor(torch.tensor(3.0), 1.0, 0, torch.qint32)
+        >>> b = torch.quantize_per_tensor(torch.tensor(4.0), 1.0, 0, torch.qint32)
+        >>> q_add.add(a, b)  # Equivalent to ``torch.ops.quantized.add(a, b, 1.0, 0)``
+
+    Valid operation names:
+        - add
+        - cat
+        - mul
+        - add_relu
+        - add_scalar
+        - mul_scalar
+    """
+    def __init__(self):
+        super(QFunctional, self).__init__()
+        self.scale = 1.0
+        self.zero_point = 0
+        self.activation_post_process = torch.nn.Identity()
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(QFunctional, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'scale'] = torch.tensor(self.scale)
+        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+
+        self.scale = float(state_dict.pop(prefix + 'scale'))
+        self.zero_point = int(state_dict.pop(prefix + 'zero_point'))
+        super(QFunctional, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                       missing_keys, unexpected_keys, error_msgs)
+
+    def _get_name(self):
+        return 'QFunctional'
+
+    def extra_repr(self):
+        return 'scale={}, zero_point={}'.format(
+            self.scale, self.zero_point
+        )
+
+    def forward(self, x):
+        raise RuntimeError("Functional is not intended to use the " +
+                           "'forward'. Please use the underlying operation")
+
+    r"""Operation equivalent to ``torch.ops.quantized.add``"""
+    def add(self, x: Tensor, y: Tensor) -> Tensor:
+        r = ops.quantized.add(x, y, scale=self.scale, zero_point=self.zero_point)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.add(Tensor, float)``"""
+    def add_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = ops.quantized.add_scalar(x, y)
+        # Note: this operation is not observed because the observation is not
+        # needed for the quantized op.
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.mul(Tensor, Tensor)``"""
+    def mul(self, x: Tensor, y: Tensor) -> Tensor:
+        r = ops.quantized.mul(x, y, scale=self.scale, zero_point=self.zero_point)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.mul(Tensor, float)``"""
+    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
+        r = ops.quantized.mul_scalar(x, y)
+        # Note: this operation is not observed because the observation is not
+        # needed for the quantized op.
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.cat``"""
+    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
+        r = ops.quantized.cat(x, scale=self.scale, zero_point=self.zero_point, dim=dim)
+        r = self.activation_post_process(r)
+        return r
+
+    r"""Operation equivalent to ``torch.ops.quantized.add_relu``"""
+    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
+        r = ops.quantized.add_relu(x, y, scale=self.scale, zero_point=self.zero_point)
+        r = self.activation_post_process(r)
+        return r
+
+    @classmethod
+    def from_float(cls, mod):
+        assert type(mod) == FloatFunctional,\
+            "QFunctional.from_float expects an instance of FloatFunctional"
+        scale, zero_point = mod.activation_post_process.calculate_qparams()  # type: ignore[operator]
+        new_mod = QFunctional()
+        new_mod.scale = float(scale)
+        new_mod.zero_point = int(zero_point)
+        return new_mod
diff --git a/torch/ao/nn/quantized/modules/linear.py b/torch/ao/nn/quantized/modules/linear.py
new file mode 100644
index 000000000000..e9d5c1127c3d
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/linear.py
@@ -0,0 +1,302 @@
+from collections.abc import Iterable
+import torch
+
+import torch.nn as nn
+import torch.ao.nn.intrinsic as nni
+import torch.ao.nn.intrinsic.qat as nniqat
+from torch.nn.utils.fusion import fuse_linear_bn_weights
+from torch.nn.utils.parametrize import type_before_parametrizations
+
+from typing import Optional
+
+from .utils import _quantize_weight, hide_packed_params_repr, WeightedQuantizedModule
+
+__all__ = ['LinearPackedParams', 'Linear']
+
+
+class LinearPackedParams(torch.nn.Module):
+    _version = 3
+
+    def __init__(self, dtype=torch.qint8):
+        super().__init__()
+        self.dtype = dtype
+        if self.dtype == torch.qint8:
+            wq = torch._empty_affine_quantized([1, 1], scale=1.0, zero_point=0, dtype=torch.qint8)
+        elif self.dtype == torch.float16:
+            wq = torch.zeros([1, 1], dtype=torch.float)
+        self.set_weight_bias(wq, None)
+
+    @torch.jit.export
+    def set_weight_bias(self, weight: torch.Tensor, bias: Optional[torch.Tensor]) -> None:
+        if self.dtype == torch.qint8:
+            self._packed_params = torch.ops.quantized.linear_prepack(weight, bias)
+        elif self.dtype == torch.float16:
+            self._packed_params = torch.ops.quantized.linear_prepack_fp16(weight, bias)
+        else:
+            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
+
+
+    @torch.jit.export
+    def _weight_bias(self):
+        if self.dtype == torch.qint8:
+            return torch.ops.quantized.linear_unpack(self._packed_params)
+        elif self.dtype == torch.float16:
+            return torch.ops.quantized.linear_unpack_fp16(self._packed_params)
+        else:
+            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
+
+    def forward(self, x):
+        return x
+
+    # Version 1
+    #   self
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #
+    # Version 2
+    #   self
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #   |--- dtype : torch.dtype
+    #
+    # Version 3
+    #   self
+    #   |--- _packed_params : (Tensor, Tensor) representing (weight, bias)
+    #                         of LinearPackedParams
+    #   |--- dtype : torch.dtype
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super(LinearPackedParams, self)._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'dtype'] = self.dtype
+        destination[prefix + '_packed_params'] = self._weight_bias()
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        version = local_metadata.get('version', None)
+        if version is None or version < 2:
+            self.dtype = torch.qint8
+        else:
+            self.dtype = state_dict[prefix + 'dtype']
+            state_dict.pop(prefix + 'dtype')
+
+        if version is None or version < 3:
+            self.set_weight_bias(state_dict[prefix + 'weight'], state_dict[prefix + 'bias'])
+            state_dict.pop(prefix + 'weight')
+            state_dict.pop(prefix + 'bias')
+
+        if version == 3:
+            weight, bias = state_dict[prefix + '_packed_params']
+            state_dict.pop(prefix + '_packed_params')
+            self.set_weight_bias(weight, bias)
+
+        super(LinearPackedParams, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
+                                                              missing_keys, unexpected_keys, error_msgs)
+
+
+    def __repr__(self):
+        return self._weight_bias().__repr__()
+
+
+class Linear(WeightedQuantizedModule):
+    r"""
+    A quantized linear module with quantized tensor as inputs and outputs.
+    We adopt the same interface as `torch.nn.Linear`, please see
+    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
+
+    Similar to :class:`~torch.nn.Linear`, attributes will be randomly
+    initialized at module creation time and will be overwritten later
+
+    Attributes:
+        weight (Tensor): the non-learnable quantized weights of the module of
+                         shape :math:`(\text{out\_features}, \text{in\_features})`.
+        bias (Tensor): the non-learnable bias of the module of shape :math:`(\text{out\_features})`.
+                If :attr:`bias` is ``True``, the values are initialized to zero.
+        scale: `scale` parameter of output Quantized Tensor, type: double
+        zero_point: `zero_point` parameter for output Quantized Tensor, type: long
+
+    Examples::
+
+        >>> m = nn.quantized.Linear(20, 30)
+        >>> input = torch.randn(128, 20)
+        >>> # xdoctest: +SKIP
+        >>> input = torch.quantize_per_tensor(input, 1.0, 0, torch.quint8)
+        >>> output = m(input)
+        >>> print(output.size())
+        torch.Size([128, 30])
+    """
+    _version = 3
+    _FLOAT_MODULE = (nn.Linear, nn.modules.linear.NonDynamicallyQuantizableLinear)
+
+    def __init__(self, in_features, out_features, bias_=True,
+                 dtype=torch.qint8):
+        super().__init__()
+        # We don't muck around with buffers or attributes or anything here
+        # to keep the module simple. *everything* is simply a Python attribute.
+        # Serialization logic is explicitly handled in the below serialization and
+        # deserialization modules
+        self.in_features = in_features
+        self.out_features = out_features
+        bias = None
+        if bias_:
+            bias = torch.zeros(out_features, dtype=torch.float)
+
+        if dtype == torch.qint8:
+            qweight = torch._empty_affine_quantized(
+                [out_features, in_features], scale=1, zero_point=0, dtype=torch.qint8)
+        elif dtype == torch.float16:
+            qweight = torch.zeros([out_features, in_features], dtype=torch.float)
+        else:
+            raise RuntimeError('Unsupported dtype specified for quantized Linear!')
+
+        self._packed_params = LinearPackedParams(dtype)
+        self._packed_params.set_weight_bias(qweight, bias)
+        self.scale = 1.0
+        self.zero_point = 0
+
+    def _get_name(self):
+        return 'QuantizedLinear'
+
+    def extra_repr(self):
+        return 'in_features={}, out_features={}, scale={}, zero_point={}, qscheme={}'.format(
+            self.in_features, self.out_features, self.scale, self.zero_point, self.weight().qscheme()
+        )
+
+    def __repr__(self):
+        return hide_packed_params_repr(self, LinearPackedParams)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return torch.ops.quantized.linear(
+            x, self._packed_params._packed_params, self.scale, self.zero_point)
+
+    # ===== Serialization methods =====
+    # The special consideration here is that we have to unpack the weights into their
+    # regular QTensor form for serialization. Packed weights should not live
+    # outside the process in which they were created, rather they should be derived
+    # from the QTensor weight.
+    #
+    # Version 1
+    #   self
+    #   |--- scale : float
+    #   |--- zero_point : int
+    #   |--- weight : Tensor
+    #   |--- bias : Tensor
+    #
+    # Version 2
+    #   self
+    #   |--- scale : float
+    #   |--- zero_point : int
+    #   |--- _packed_params : Module
+    #        |--- weight : Tensor
+    #        |--- bias : Tensor
+    #
+    # Version 3
+    #   self
+    #   |--- scale : float
+    #   |--- zero_point : int
+    #   |--- _packed_params : Module
+    #        |--- _packed_params : (Tensor, Tensor) representing weight, bias
+    #                              of LinearPackedParams C++ struct
+    #
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super()._save_to_state_dict(destination, prefix, keep_vars)
+        destination[prefix + 'scale'] = torch.tensor(self.scale)
+        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
+
+    # ===== Deserialization methods =====
+    # Counterpart to the serialization methods, we must pack the serialized QTensor
+    # weight into its packed format for use by the FBGEMM ops.
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        self.scale = float(state_dict[prefix + 'scale'])
+        state_dict.pop(prefix + 'scale')
+
+        self.zero_point = int(state_dict[prefix + 'zero_point'])
+        state_dict.pop(prefix + 'zero_point')
+
+        version = local_metadata.get('version', None)
+
+        if version is None or version == 1:
+            # We moved the parameters into a LinearPackedParameters submodule
+            weight = state_dict.pop(prefix + 'weight')
+            bias = state_dict.pop(prefix + 'bias')
+            state_dict.update({prefix + '_packed_params.weight': weight,
+                               prefix + '_packed_params.bias': bias})
+
+        super()._load_from_state_dict(
+            state_dict, prefix, local_metadata, False,
+            missing_keys, unexpected_keys, error_msgs)
+
+    # Function rather than property to make sure that JIT serialization doesn't
+    # register this as an attribute
+    def _weight_bias(self):
+        return self._packed_params._weight_bias()
+
+    def weight(self):
+        return self._weight_bias()[0]
+
+    def bias(self):
+        return self._weight_bias()[1]
+
+    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
+        self._packed_params.set_weight_bias(w, b)
+
+    @classmethod
+    def from_float(cls, mod):
+        r"""Create a quantized module from an observed float module
+
+        Args:
+            mod (Module): a float module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+        """
+        if hasattr(mod, 'weight_fake_quant'):
+            if type_before_parametrizations(mod) == nniqat.LinearBn1d:
+                mod.weight, mod.bias = fuse_linear_bn_weights(
+                    mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
+                    mod.bn.eps, mod.bn.weight, mod.bn.bias)
+            weight_post_process = mod.weight_fake_quant
+            activation_post_process = mod.activation_post_process
+        else:
+            # This function does not participate in JIT, so it is OK to ignore
+            # the type mismatch in assignment. Also, mypy has an issue with
+            # iterables not being implemented, so we are ignoring those too.
+            if not isinstance(cls._FLOAT_MODULE, Iterable):
+                cls._FLOAT_MODULE = [cls._FLOAT_MODULE]  # type: ignore[assignment]
+            supported_modules = ', '.join([float_mod.__name__ for float_mod in cls._FLOAT_MODULE])  # type: ignore[attr-defined]
+            error_msg = 'nnq.{}.from_float only works for {}, but got: {}'.format(cls.__name__, supported_modules, type(mod))
+            assert type_before_parametrizations(mod) in cls._FLOAT_MODULE, error_msg.format()  # type: ignore[attr-defined]
+            assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
+            activation_post_process = mod.activation_post_process
+            if type_before_parametrizations(mod) == nni.LinearReLU:
+                mod = mod[0]
+            weight_post_process = mod.qconfig.weight()
+        weight_post_process(mod.weight)
+        dtype = weight_post_process.dtype
+        act_scale, act_zp = activation_post_process.calculate_qparams()
+        assert dtype == torch.qint8, 'Weight observer must have dtype torch.qint8'
+        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
+        qlinear = cls(mod.in_features,
+                      mod.out_features,
+                      dtype=dtype)
+        qlinear.set_weight_bias(qweight, mod.bias)
+        qlinear.scale = float(act_scale)
+        qlinear.zero_point = int(act_zp)
+        return qlinear
+
+    @classmethod
+    def from_reference(cls, ref_qlinear, output_scale, output_zero_point):
+        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
+
+        Args:
+            ref_qlinear (Module): a reference quantized linear module, either produced by torch.ao.quantization
+                          utilities or provided by the user
+            output_scale (float): scale for output Tensor
+            zero_point (int): zero point for output Tensor
+        """
+        qlinear = cls(
+            ref_qlinear.in_features,
+            ref_qlinear.out_features)
+        qweight = ref_qlinear.get_quantized_weight()
+        qlinear.set_weight_bias(qweight, ref_qlinear.bias)
+
+        qlinear.scale = float(output_scale)
+        qlinear.zero_point = int(output_zero_point)
+        return qlinear
diff --git a/torch/ao/nn/quantized/modules/normalization.py b/torch/ao/nn/quantized/modules/normalization.py
new file mode 100644
index 000000000000..3c77e1277598
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/normalization.py
@@ -0,0 +1,204 @@
+import torch
+
+__all__ = ['LayerNorm', 'GroupNorm', 'InstanceNorm1d', 'InstanceNorm2d', 'InstanceNorm3d']
+
+class LayerNorm(torch.nn.LayerNorm):
+    r"""This is the quantized version of :class:`~torch.nn.LayerNorm`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+
+    def __init__(self, normalized_shape, weight, bias, scale, zero_point, eps=1e-5,
+                 elementwise_affine=True, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(LayerNorm, self).__init__(
+            normalized_shape, eps=eps, elementwise_affine=elementwise_affine,
+            **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.layer_norm(
+            input, self.normalized_shape, weight=self.weight, bias=self.bias,
+            eps=self.eps, output_scale=self.scale, output_zero_point=self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedLayerNorm'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.normalized_shape, mod.weight, mod.bias, float(scale),
+            int(zero_point), mod.eps, mod.elementwise_affine)
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(
+            mod.normalized_shape, mod.weight, mod.bias, float(scale),
+            int(zero_point), mod.eps, mod.elementwise_affine)
+
+class GroupNorm(torch.nn.GroupNorm):
+    r"""This is the quantized version of :class:`~torch.nn.GroupNorm`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+    __constants__ = ['num_groups', 'num_channels', 'eps', 'affine']
+
+    def __init__(self, num_groups, num_channels, weight, bias, scale, zero_point, eps=1e-5,
+                 affine=True, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(GroupNorm, self).__init__(num_groups, num_channels, eps, affine,
+                                        **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.group_norm(
+            input, self.num_groups, self.weight, self.bias, self.eps, self.scale,
+            self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedGroupNorm'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.num_groups, mod.num_channels, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+        return new_mod
+
+class InstanceNorm1d(torch.nn.InstanceNorm1d):
+    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm1d`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+    def __init__(self, num_features, weight, bias, scale, zero_point,
+                 eps=1e-5, momentum=0.1, affine=False,
+                 track_running_stats=False, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(InstanceNorm1d, self).__init__(
+            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.instance_norm(
+            input, self.weight, self.bias, self.eps, self.scale,
+            self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedInstanceNorm1d'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+
+class InstanceNorm2d(torch.nn.InstanceNorm2d):
+    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm2d`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+    def __init__(self, num_features, weight, bias, scale, zero_point,
+                 eps=1e-5, momentum=0.1, affine=False,
+                 track_running_stats=False, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(InstanceNorm2d, self).__init__(
+            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.instance_norm(
+            input, self.weight, self.bias, self.eps, self.scale,
+            self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedInstanceNorm2d'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+
+class InstanceNorm3d(torch.nn.InstanceNorm3d):
+    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm3d`.
+
+    Additional args:
+        * **scale** - quantization scale of the output, type: double.
+        * **zero_point** - quantization zero point of the output, type: long.
+
+    """
+    def __init__(self, num_features, weight, bias, scale, zero_point,
+                 eps=1e-5, momentum=0.1, affine=False,
+                 track_running_stats=False, device=None, dtype=None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype}
+        super(InstanceNorm3d, self).__init__(
+            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
+        self.weight = weight
+        self.bias = bias
+        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
+        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
+
+    def forward(self, input):
+        return torch.ops.quantized.instance_norm(
+            input, self.weight, self.bias, self.eps, self.scale,
+            self.zero_point)
+
+    def _get_name(self):
+        return 'QuantizedInstanceNorm3d'
+
+    @classmethod
+    def from_float(cls, mod):
+        scale, zero_point = mod.activation_post_process.calculate_qparams()
+        new_mod = cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
+        return new_mod
+
+    @classmethod
+    def from_reference(cls, mod, scale, zero_point):
+        return cls(
+            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
+            mod.eps, mod.affine)
diff --git a/torch/ao/nn/quantized/modules/rnn.py b/torch/ao/nn/quantized/modules/rnn.py
new file mode 100644
index 000000000000..6e6970171150
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/rnn.py
@@ -0,0 +1,47 @@
+import torch
+
+class LSTM(torch.ao.nn.quantizable.LSTM):
+    r"""A quantized long short-term memory (LSTM).
+
+    For the description and the argument types, please, refer to :class:`~torch.nn.LSTM`
+
+    Attributes:
+        layers : instances of the `_LSTMLayer`
+
+    .. note::
+        To access the weights and biases, you need to access them per layer.
+        See examples in :class:`~torch.nn.quantizable.LSTM`
+
+    Examples::
+        >>> # xdoctest: +SKIP
+        >>> custom_module_config = {
+        ...     'float_to_observed_custom_module_class': {
+        ...         nn.LSTM: nn.quantizable.LSTM,
+        ...     },
+        ...     'observed_to_quantized_custom_module_class': {
+        ...         nn.quantizable.LSTM: nn.quantized.LSTM,
+        ...     }
+        ... }
+        >>> tq.prepare(model, prepare_custom_module_class=custom_module_config)
+        >>> tq.convert(model, convert_custom_module_class=custom_module_config)
+    """
+    _FLOAT_MODULE = torch.nn.quantizable.LSTM  # type: ignore[assignment]
+
+    def _get_name(self):
+        return 'QuantizedLSTM'
+
+    @classmethod
+    def from_float(cls, *args, **kwargs):
+        # The whole flow is float -> observed -> quantized
+        # This class does observed -> quantized only
+        raise NotImplementedError("It looks like you are trying to convert a "
+                                  "non-observed LSTM module. Please, see "
+                                  "the examples on quantizable LSTMs.")
+
+    @classmethod
+    def from_observed(cls, other):
+        assert type(other) == cls._FLOAT_MODULE
+        converted = torch.ao.quantization.convert(other, inplace=False,
+                                                  remove_qconfig=True)
+        converted.__class__ = cls
+        return converted
diff --git a/torch/ao/nn/quantized/modules/utils.py b/torch/ao/nn/quantized/modules/utils.py
new file mode 100644
index 000000000000..d22cdc9a1d4e
--- /dev/null
+++ b/torch/ao/nn/quantized/modules/utils.py
@@ -0,0 +1,113 @@
+import abc
+import torch
+import itertools
+import collections
+from torch.nn.modules.module import _addindent
+
+class WeightedQuantizedModule(torch.nn.Module, metaclass=abc.ABCMeta):
+    """Wrapper for quantized modules than can be lowered from reference modules."""
+    @classmethod
+    @abc.abstractmethod
+    def from_reference(cls, ref_module, output_scale, output_zero_point):
+        raise NotImplementedError
+
+def _get_weight_observer(observer):
+    # FakeQuantize observer
+    if hasattr(observer, "activation_post_process"):
+        observer = observer.activation_post_process
+    # UniformQuantizationObserverBase observer
+    return observer
+
+def _needs_weight_clamping(observer, dtype):
+    observer = _get_weight_observer(observer)
+    if dtype in [torch.qint8, torch.quint8, torch.qint32]:
+        info = torch.iinfo(dtype)
+        return observer.quant_min > info.min or observer.quant_max < info.max
+    return False
+
+def _clamp_weights(qweight, observer, scale, zp):
+    if not _needs_weight_clamping(observer, qweight.dtype):
+        return qweight
+
+    observer = _get_weight_observer(observer)
+    min_, max_ = observer.quant_min, observer.quant_max
+
+    # Doing this because can't use torch.ops.quantized.clamp() with per_channel qscheme yet.
+    qw_int_max = torch.clone(qweight.int_repr()).fill_(max_)
+    qw_int_min = torch.clone(qweight.int_repr()).fill_(min_)
+    qw_int = torch.minimum(torch.maximum(qweight.int_repr(), qw_int_min), qw_int_max)
+
+    if observer.qscheme in [torch.per_tensor_symmetric,
+                            torch.per_tensor_affine]:
+        qweight = torch._make_per_tensor_quantized_tensor(qw_int, scale.item(), zp.item())
+    elif observer.qscheme in [torch.per_channel_symmetric,
+                              torch.per_channel_affine,
+                              torch.per_channel_affine_float_qparams]:
+        qweight = torch._make_per_channel_quantized_tensor(qw_int, scale, zp, axis=observer.ch_axis)
+    else:
+        raise ValueError("Unexpected qscheme " + observer.qscheme)
+    return qweight
+
+def _quantize_weight(float_wt, observer):
+    wt_scale, wt_zp = observer.calculate_qparams()
+    if observer.qscheme in [torch.per_tensor_symmetric, torch.per_tensor_affine]:
+        qweight = torch.quantize_per_tensor(
+            float_wt,
+            float(wt_scale), int(wt_zp), torch.qint8)
+        qweight = _clamp_weights(qweight, observer, wt_scale, wt_zp)
+    elif observer.qscheme in [torch.per_channel_symmetric, torch.per_channel_affine]:
+        wt_axis = observer.ch_axis
+        qweight = torch.quantize_per_channel(
+            float_wt,
+            wt_scale.to(torch.double), wt_zp.to(torch.int64), wt_axis, torch.qint8)
+        qweight = _clamp_weights(qweight, observer, wt_scale, wt_zp)
+    elif observer.qscheme in [torch.per_channel_affine_float_qparams]:
+        qweight = torch.quantize_per_channel(
+            float_wt,
+            wt_scale.to(torch.float), wt_zp.to(torch.float), observer.ch_axis, observer.dtype)
+        qweight = _clamp_weights(qweight, observer, wt_scale, wt_zp)
+    else:
+        raise ValueError("Unexpected qscheme " + observer.qscheme)
+    return qweight
+
+def _ntuple_from_first(n):
+    """Converts the argument to a tuple of size n
+    with the first element repeated."""
+    def parse(x):
+        while isinstance(x, collections.abc.Sequence):
+            if len(x) == n:
+                break
+            x = x[0]
+        return tuple(itertools.repeat(x, n))
+    return parse
+
+def hide_packed_params_repr(self, params):
+    # We don't want to show `PackedParams` children, hence custom
+    # `__repr__`. This is the same as nn.Module.__repr__, except the check
+    # for the `params module`.
+    extra_lines = []
+    extra_repr = self.extra_repr()
+    # empty string will be split into list ['']
+    if extra_repr:
+        extra_lines = extra_repr.split('\n')
+    child_lines = []
+    for key, module in self._modules.items():
+        if isinstance(module, params):
+            continue
+        mod_str = repr(module)
+        mod_str = _addindent(mod_str, 2)
+        child_lines.append('(' + key + '): ' + mod_str)
+    lines = extra_lines + child_lines
+
+    main_str = self._get_name() + '('
+    if lines:
+        # simple one-liner info, which most builtin Modules will use
+        if len(extra_lines) == 1 and not child_lines:
+            main_str += extra_lines[0]
+        else:
+            main_str += '\n  ' + '\n  '.join(lines) + '\n'
+
+    main_str += ')'
+    return main_str
+
+_pair_from_first = _ntuple_from_first(2)
diff --git a/torch/ao/nn/quantized/reference/__init__.py b/torch/ao/nn/quantized/reference/__init__.py
new file mode 100644
index 000000000000..833bf9577a1a
--- /dev/null
+++ b/torch/ao/nn/quantized/reference/__init__.py
@@ -0,0 +1,17 @@
+from .modules import *  # noqa: F403
+
+__all__ = [
+    'Linear',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+    'RNNCell',
+    'LSTMCell',
+    'GRUCell',
+    'LSTM',
+    'Embedding',
+    'EmbeddingBag',
+]
diff --git a/torch/ao/nn/quantized/reference/modules/__init__.py b/torch/ao/nn/quantized/reference/modules/__init__.py
new file mode 100644
index 000000000000..3627f93ebd6c
--- /dev/null
+++ b/torch/ao/nn/quantized/reference/modules/__init__.py
@@ -0,0 +1,20 @@
+from .linear import Linear
+from .conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from .rnn import RNNCell, LSTMCell, GRUCell, LSTM
+from .sparse import Embedding, EmbeddingBag
+
+__all__ = [
+    'Linear',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+    'RNNCell',
+    'LSTMCell',
+    'GRUCell',
+    'LSTM',
+    'Embedding',
+    'EmbeddingBag',
+]
diff --git a/torch/ao/nn/quantized/reference/modules/conv.py b/torch/ao/nn/quantized/reference/modules/conv.py
new file mode 100644
index 000000000000..910223056fba
--- /dev/null
+++ b/torch/ao/nn/quantized/reference/modules/conv.py
@@ -0,0 +1,318 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Dict, Any, List
+from torch.nn.common_types import _size_1_t
+from .utils import ReferenceQuantizedModule
+
+__all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
+
+class _ConvNd(torch.nn.modules.conv._ConvNd, ReferenceQuantizedModule):
+    """ A reference version of nn.quantized.Conv2d
+        we will not pack the parameters in this module, since weight packing is an
+        optimization for quantized backends supported in PyTorch (fbgemm/qnnpack),
+        this is useful when user want to use this module in other backends like Glow.
+    """
+    __annotations__ = {"bias": Optional[torch.Tensor]}
+    _IS_REFERENCE = True
+
+    @staticmethod
+    def from_float(cls, float_conv, weight_qparams):
+        qref_conv = cls(
+            float_conv.in_channels,
+            float_conv.out_channels,
+            float_conv.kernel_size,  # type: ignore[arg-type]
+            float_conv.stride,  # type: ignore[arg-type]
+            float_conv.padding,  # type: ignore[arg-type]
+            float_conv.dilation,  # type: ignore[arg-type]
+            float_conv.groups,
+            float_conv.bias is not None,  # type: ignore[arg-type]
+            float_conv.padding_mode,
+            device=float_conv.weight.device,
+            dtype=float_conv.weight.dtype,
+            weight_qparams=weight_qparams)
+        qref_conv.weight = torch.nn.Parameter(float_conv.weight.detach())
+        if float_conv.bias is not None:
+            qref_conv.bias = torch.nn.Parameter(float_conv.bias.detach())
+        return qref_conv
+
+class Conv1d(_ConvNd, nn.Conv1d):
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: _size_1_t = 0,
+                 dilation: _size_1_t = 1,
+                 groups: int = 1,
+                 bias: bool = True,
+                 padding_mode: str = "zeros",
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.Conv1d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.conv1d ---
+
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.conv1d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv1d
+        """
+        weight_quant_dequant = self.get_weight()
+        result = F.conv1d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, self.dilation, self.groups)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConv1d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvNd.from_float(cls, float_conv, weight_qparams)
+
+class Conv2d(_ConvNd, nn.Conv2d):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode='zeros',
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.Conv2d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.conv2d ---
+
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.conv2d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv2d
+        """
+        weight_quant_dequant = self.get_weight()
+        result = F.conv2d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, self.dilation, self.groups)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConv2d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvNd.from_float(cls, float_conv, weight_qparams)
+
+class Conv3d(_ConvNd, nn.Conv3d):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=True,
+                 padding_mode="zeros",
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.Conv3d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
+            groups, bias, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.conv3d ---
+
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.conv3d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv3d
+        """
+        weight_quant_dequant = self.get_weight()
+        result = F.conv3d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, self.dilation, self.groups)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConv3d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvNd.from_float(cls, float_conv, weight_qparams)
+
+class _ConvTransposeNd(_ConvNd, torch.nn.modules.conv._ConvTransposeNd):
+    """ A reference version of nn.quantized.ConvTranspose2d
+        we will not pack the parameters in this module, since weight packing is an
+        optimization for quantized backends supported in PyTorch (fbgemm/qnnpack),
+        this is useful when user want to use this module in other backends like Glow.
+    """
+    @staticmethod
+    def from_float(cls, float_conv, weight_qparams):
+        qref_conv = cls(
+            float_conv.in_channels,
+            float_conv.out_channels,
+            float_conv.kernel_size,  # type: ignore[arg-type]
+            float_conv.stride,  # type: ignore[arg-type]
+            float_conv.padding,  # type: ignore[arg-type]
+            float_conv.output_padding,  # type: ignore[arg-type]
+            float_conv.groups,
+            float_conv.bias is not None,  # type: ignore[arg-type]
+            float_conv.dilation,  # type: ignore[arg-type]
+            float_conv.padding_mode,
+            device=float_conv.weight.device,
+            dtype=float_conv.weight.dtype,
+            weight_qparams=weight_qparams)
+        qref_conv.weight = torch.nn.Parameter(float_conv.weight.detach())
+        if float_conv.bias is not None:
+            qref_conv.bias = torch.nn.Parameter(float_conv.bias.detach())
+        return qref_conv
+
+
+class ConvTranspose1d(_ConvTransposeNd, nn.ConvTranspose1d):
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: _size_1_t,
+                 stride: _size_1_t = 1,
+                 padding: _size_1_t = 0,
+                 output_padding: _size_1_t = 0,
+                 groups: int = 1,
+                 bias: bool = True,
+                 dilation: _size_1_t = 1,
+                 padding_mode: str = "zeros",
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.ConvTranspose1d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.convTranspose1d ---
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.convTranspose1d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv1d
+        """
+
+        assert isinstance(self.padding, tuple)
+        # One cannot replace List by Tuple or Sequence in "_output_padding" because
+        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
+        output_padding = self._output_padding(
+            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
+
+        weight_quant_dequant = self.get_weight()
+        result = F.conv_transpose1d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, output_padding, self.groups, self.dilation)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConvTranspose1d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
+
+class ConvTranspose2d(_ConvTransposeNd, nn.ConvTranspose2d):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0,
+                 groups=1, bias=True, dilation=1,
+                 padding_mode='zeros',
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+
+        nn.ConvTranspose2d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.convTranspose2d ---
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.convTranspose2d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv2d
+        """
+        assert isinstance(self.padding, tuple)
+        # One cannot replace List by Tuple or Sequence in "_output_padding" because
+        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
+
+        output_padding = self._output_padding(
+            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
+
+        weight_quant_dequant = self.get_weight()
+        result = F.conv_transpose2d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, output_padding, self.groups, self.dilation)
+
+        return result
+
+    def _get_name(self):
+        return "QuantizedConvTranspose2d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
+
+class ConvTranspose3d(_ConvTransposeNd, nn.ConvTranspose3d):
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, output_padding=0,
+                 groups=1, bias=True, dilation=1,
+                 padding_mode="zeros",
+                 device=None,
+                 dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None):
+        nn.ConvTranspose3d.__init__(
+            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
+            groups, bias, dilation, padding_mode, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.convTranspose3d ---
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.convTranspose3d --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized conv3d
+        """
+
+        assert isinstance(self.padding, tuple)
+        # One cannot replace List by Tuple or Sequence in "_output_padding" because
+        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
+        output_padding = self._output_padding(
+            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
+
+        weight_quant_dequant = self.get_weight()
+        result = F.conv_transpose3d(
+            x, weight_quant_dequant, self.bias, self.stride,
+            self.padding, output_padding, self.groups, self.dilation)
+        return result
+
+    def _get_name(self):
+        return "QuantizedConvTranspose3d(Reference)"
+
+    @classmethod
+    def from_float(cls, float_conv, weight_qparams):
+        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
diff --git a/torch/ao/nn/quantized/reference/modules/linear.py b/torch/ao/nn/quantized/reference/modules/linear.py
new file mode 100644
index 000000000000..378fe0eb6eee
--- /dev/null
+++ b/torch/ao/nn/quantized/reference/modules/linear.py
@@ -0,0 +1,57 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Dict, Any
+from .utils import ReferenceQuantizedModule
+
+__all__ = ['Linear']
+
+class Linear(nn.Linear, ReferenceQuantizedModule):
+    """ A reference quantized linear module that fits into the FX
+    Graph Mode Quantization workflow
+    activation will be floating point Tensor, we will store floating
+    point weight as well in the module, but in forward we'll quantize
+    and dequantize the weight before running the floating point functional
+    linear operator.
+    """
+    _IS_REFERENCE = True
+
+    def __init__(
+            self,
+            in_features: int,
+            out_features: int,
+            bias_: bool = True,
+            device: Optional[torch.device] = None,
+            dtype: Optional[torch.dtype] = None,
+            weight_qparams: Optional[Dict[str, Any]] = None):
+        super().__init__(in_features, out_features, bias_, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def _get_name(self):
+        return "QuantizedLinear(Reference)"
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        we have:
+        w(float) -- quant - dequant \
+        x(float) ------------- F.linear ---
+
+        In the full model, we will see
+        w(float) -- quant - *dequant \
+        x -- quant --- *dequant --  *F.linear --- *quant - dequant
+        and the backend should be able to fuse the ops with `*` into a quantized linear
+        """
+        weight_quant_dequant = self.get_weight()
+        result = F.linear(x, weight_quant_dequant, self.bias)
+        return result
+
+    @classmethod
+    def from_float(cls, float_linear, weight_qparams):
+        qref_linear = Linear(
+            float_linear.in_features, float_linear.out_features,
+            float_linear.bias is not None, device=float_linear.weight.device,
+            dtype=float_linear.weight.dtype, weight_qparams=weight_qparams)
+        qref_linear.weight = torch.nn.Parameter(float_linear.weight.detach())
+        if float_linear.bias is not None:
+            qref_linear.bias = torch.nn.Parameter(float_linear.bias.detach())
+        return qref_linear
diff --git a/torch/ao/nn/quantized/reference/modules/rnn.py b/torch/ao/nn/quantized/reference/modules/rnn.py
new file mode 100644
index 000000000000..a6eaed977449
--- /dev/null
+++ b/torch/ao/nn/quantized/reference/modules/rnn.py
@@ -0,0 +1,479 @@
+import torch
+import torch.nn as nn
+from torch import Tensor
+from .utils import _quantize_and_dequantize_weight
+from .utils import _quantize_weight
+from typing import Optional, Dict, Any, Tuple
+from torch import _VF
+from torch.nn.utils.rnn import PackedSequence
+
+__all__ = ['RNNCellBase', 'RNNCell', 'LSTMCell', 'GRUCell', 'RNNBase', 'LSTM', 'get_quantized_weight']
+
+def _apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
+    return tensor.index_select(dim, permutation)
+
+def _get_weight_and_quantization_params(module, wn):
+    weight = getattr(module, wn)
+    params = [weight]
+    for param_name in [wn + n for n in ["_qscheme", "_dtype", "_scale", "_zero_point", "_axis"]]:
+        if hasattr(module, param_name):
+            param = getattr(module, param_name)
+        else:
+            param = None
+        params.append(param)
+    return params
+
+def get_quantized_weight(module, wn):
+    if not hasattr(module, wn):
+        return None
+    params = _get_weight_and_quantization_params(module, wn)
+    weight = _quantize_weight(*params)
+    return weight
+
+def _get_quantize_and_dequantized_weight(module, wn):
+    if not hasattr(module, wn):
+        return None
+    params = _get_weight_and_quantization_params(module, wn)
+    weight = _quantize_and_dequantize_weight(*params)
+    return weight
+
+class RNNCellBase(nn.RNNCellBase):
+    def __init__(self, input_size: int, hidden_size: int, bias: bool, num_chunks: int,
+                 device=None, dtype=None, weight_qparams_dict=None) -> None:
+        super().__init__(input_size, hidden_size, bias, num_chunks, device=device, dtype=dtype)
+        if weight_qparams_dict is None:
+            weight_qparams = {
+                "qscheme": torch.per_tensor_affine,
+                "dtype": torch.quint8,
+                "scale": 1.0,
+                "zero_point": 0
+            }
+            weight_qparams_dict = {
+                "weight_ih": weight_qparams,
+                "weight_hh": weight_qparams
+            }
+        assert len(weight_qparams_dict) == 2, "Expected length for weight_qparams_dict to be 2 for QuantizedRNNCellBase(Reference)"
+        self._init_weight_qparams_dict(weight_qparams_dict, device)
+
+    def _init_weight_qparams_dict(self, weight_qparams_dict, device):
+        assert weight_qparams_dict is not None
+        for key, weight_qparams in weight_qparams_dict.items():
+            # TODO: refactor the duplicated code to utils.py
+            weight_qscheme = weight_qparams["qscheme"]
+            weight_dtype = weight_qparams["dtype"]
+            setattr(self, key + "_qscheme", weight_qscheme)
+            setattr(self, key + "_dtype", weight_dtype)
+            assert weight_qscheme in [None, torch.per_tensor_affine, torch.per_channel_affine], \
+                Exception(f"qscheme: {weight_qscheme} is not support in {self._get_name()}")
+            if weight_qscheme is not None:
+                scale = weight_qparams["scale"]
+                scale_tensor = scale.clone().detach() \
+                    if isinstance(scale, torch.Tensor) else \
+                    torch.tensor(scale, dtype=torch.float, device=device)
+                self.register_buffer(key + "_scale", scale_tensor)
+                zp = weight_qparams["zero_point"]
+                zp_tensor = zp.clone().detach() \
+                    if isinstance(zp, torch.Tensor) else \
+                    torch.tensor(zp, dtype=torch.int, device=device)
+                self.register_buffer(key + "_zero_point", zp_tensor)
+                if weight_qscheme == torch.per_channel_affine:
+                    axis = weight_qparams["axis"]
+                    axis_tensor = axis.clone().detach() \
+                        if isinstance(axis, torch.Tensor) else \
+                        torch.tensor(axis, dtype=torch.int, device=device)
+                    self.register_buffer(key + "_axis", axis_tensor)
+                else:
+                    # added for TorchScriptability, not used
+                    self.register_buffer(
+                        key + "_axis", torch.tensor(0, dtype=torch.int, device=device))
+
+    def _get_name(self):
+        return "QuantizedRNNCellBase(Reference)"
+
+    def get_quantized_weight_ih(self):
+        return get_quantized_weight(self, "weight_ih")
+
+    def get_quantized_weight_hh(self):
+        return get_quantized_weight(self, "weight_hh")
+
+    def get_weight_ih(self):
+        return _get_quantize_and_dequantized_weight(self, "weight_ih")
+
+    def get_weight_hh(self):
+        return _get_quantize_and_dequantized_weight(self, "weight_hh")
+
+class RNNCell(RNNCellBase):
+    """
+    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
+    we need to pass in a `weight_qparams_dict` that maps from weight name,
+    e.g. weight_ih, to the weight_qparams for that weight
+    """
+    def __init__(self, input_size: int, hidden_size: int, bias: bool = True, nonlinearity: str = "tanh",
+                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
+        super().__init__(input_size, hidden_size, bias, num_chunks=1, **factory_kwargs)
+        self.nonlinearity = nonlinearity
+
+    def _get_name(self):
+        return "QuantizedRNNCell(Reference)"
+
+    # TODO: refactor nn.RNNCell to have a _forward that takes weight_ih and weight_hh as input
+    # and remove duplicated code, same for the other two Cell modules
+    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
+        assert input.dim() in (1, 2), \
+            f"RNNCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
+        is_batched = input.dim() == 2
+        if not is_batched:
+            input = input.unsqueeze(0)
+
+        if hx is None:
+            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+        else:
+            hx = hx.unsqueeze(0) if not is_batched else hx
+
+        if self.nonlinearity == "tanh":
+            ret = _VF.rnn_tanh_cell(
+                input, hx,
+                self.get_weight_ih(), self.get_weight_hh(),
+                self.bias_ih, self.bias_hh,
+            )
+        elif self.nonlinearity == "relu":
+            ret = _VF.rnn_relu_cell(
+                input, hx,
+                self.get_weight_ih(), self.get_weight_hh(),
+                self.bias_ih, self.bias_hh,
+            )
+        else:
+            ret = input  # TODO: remove when jit supports exception flow
+            raise RuntimeError(
+                "Unknown nonlinearity: {}".format(self.nonlinearity))
+
+        if not is_batched:
+            ret = ret.squeeze(0)
+
+        return ret
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.nonlinearity,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
+
+class LSTMCell(RNNCellBase):
+    """
+    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
+    we need to pass in a `weight_qparams_dict` that maps from weight name,
+    e.g. weight_ih, to the weight_qparams for that weight
+    """
+    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
+                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
+        super().__init__(input_size, hidden_size, bias, num_chunks=4, **factory_kwargs)
+
+    def _get_name(self):
+        return "QuantizedLSTMCell(Reference)"
+
+    def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
+        assert input.dim() in (1, 2), \
+            f"LSTMCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
+        is_batched = input.dim() == 2
+        if not is_batched:
+            input = input.unsqueeze(0)
+
+        if hx is None:
+            zeros = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+            hx = (zeros, zeros)
+        else:
+            hx = (hx[0].unsqueeze(0), hx[1].unsqueeze(0)) if not is_batched else hx
+
+        ret = _VF.lstm_cell(
+            input, hx,
+            self.get_weight_ih(), self.get_weight_hh(),
+            self.bias_ih, self.bias_hh,
+        )
+
+        if not is_batched:
+            ret = (ret[0].squeeze(0), ret[1].squeeze(0))
+        return ret
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
+
+class GRUCell(RNNCellBase):
+    """
+    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
+    we need to pass in a `weight_qparams_dict` that maps from weight name,
+    e.g. weight_ih, to the weight_qparams for that weight
+    """
+    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
+                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
+        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
+        super().__init__(input_size, hidden_size, bias, num_chunks=3, **factory_kwargs)
+
+    def _get_name(self):
+        return "QuantizedGRUCell(Reference)"
+
+    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
+        assert input.dim() in (1, 2), \
+            f"GRUCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
+        is_batched = input.dim() == 2
+        if not is_batched:
+            input = input.unsqueeze(0)
+
+        if hx is None:
+            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
+        else:
+            hx = hx.unsqueeze(0) if not is_batched else hx
+
+        ret = _VF.gru_cell(
+            input, hx,
+            self.get_weight_ih(), self.get_weight_hh(),
+            self.bias_ih, self.bias_hh,
+        )
+
+        if not is_batched:
+            ret = ret.squeeze(0)
+
+        return ret
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.bias,
+            mod.weight_ih.device,
+            mod.weight_ih.dtype,
+            weight_qparams_dict)
+        ref_mod.weight_ih = mod.weight_ih
+        ref_mod.weight_hh = mod.weight_hh
+        ref_mod.bias_ih = mod.bias_ih
+        ref_mod.bias_hh = mod.bias_hh
+        return ref_mod
+
+class RNNBase(nn.RNNBase):
+    def __init__(self, mode: str, input_size: int, hidden_size: int,
+                 num_layers: int = 1, bias: bool = True, batch_first: bool = False,
+                 dropout: float = 0., bidirectional: bool = False, proj_size: int = 0,
+                 device=None, dtype=None,
+                 weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
+        super().__init__(
+            mode, input_size, hidden_size, num_layers, bias, batch_first, dropout,
+            bidirectional, proj_size, device, dtype
+        )
+        if weight_qparams_dict is None:
+            weight_qparams = {
+                'qscheme': torch.per_tensor_affine,
+                'dtype': torch.quint8,
+                'scale': 1.0,
+                'zero_point': 0
+            }
+            weight_qparams_dict = {}
+            for wn in self._flat_weights_names:
+                if wn.startswith("weight"):
+                    weight_qparams_dict[wn] = weight_qparams
+        self._init_weight_qparams_dict(weight_qparams_dict, device)
+
+    def _init_weight_qparams_dict(self, weight_qparams_dict, device):
+        for key, weight_qparams in weight_qparams_dict.items():
+            weight_qscheme = weight_qparams["qscheme"]
+            weight_dtype = weight_qparams["dtype"]
+            setattr(self, key + "_qscheme", weight_qscheme)
+            setattr(self, key + "_dtype", weight_dtype)
+            assert weight_qscheme in [None, torch.per_tensor_affine, torch.per_channel_affine], \
+                Exception(f"qscheme: {weight_qscheme} is not support in {self._get_name()}")
+            if weight_qscheme is not None:
+                self.register_buffer(
+                    key + "_scale",
+                    torch.tensor(weight_qparams["scale"], dtype=torch.float, device=device))
+                self.register_buffer(
+                    key + "_zero_point",
+                    torch.tensor(weight_qparams["zero_point"], dtype=torch.int, device=device))
+                if weight_qscheme == torch.per_channel_affine:
+                    self.register_buffer(
+                        key + "_axis",
+                        torch.tensor(weight_qparams["axis"], dtype=torch.int, device=device))
+                else:
+                    # added for TorchScriptability, not used
+                    self.register_buffer(
+                        key + "_axis", torch.tensor(0, dtype=torch.int, device=device))
+
+class LSTM(RNNBase):
+    """ Reference Quantized LSTM Module
+    We'll store weight_qparams for all the weights in _flat_weights, we need to pass in
+    a `weight_qparams_dict` that maps from weight name, e.g. weight_ih_l0,
+    to the weight_qparams for that weight
+    """
+    def __init__(self, *args, **kwargs):
+        super().__init__('LSTM', *args, **kwargs)
+
+    # Same as above, see torch/nn/modules/module.py::_forward_unimplemented
+    def permute_hidden(self,  # type: ignore[override]
+                       hx: Tuple[Tensor, Tensor],
+                       permutation: Optional[Tensor]
+                       ) -> Tuple[Tensor, Tensor]:
+        if permutation is None:
+            return hx
+        return _apply_permutation(hx[0], permutation), _apply_permutation(hx[1], permutation)
+
+    def get_expected_cell_size(self, input: Tensor, batch_sizes: Optional[Tensor]) -> Tuple[int, int, int]:
+        if batch_sizes is not None:
+            mini_batch = int(batch_sizes[0])
+        else:
+            mini_batch = input.size(0) if self.batch_first else input.size(1)
+        num_directions = 2 if self.bidirectional else 1
+        expected_hidden_size = (self.num_layers * num_directions,
+                                mini_batch, self.hidden_size)
+        return expected_hidden_size
+
+    # In the future, we should prevent mypy from applying contravariance rules here.
+    # See torch/nn/modules/module.py::_forward_unimplemented
+    def check_forward_args(self,  # type: ignore[override]
+                           input: Tensor,
+                           hidden: Tuple[Tensor, Tensor],
+                           batch_sizes: Optional[Tensor],
+                           ):
+        self.check_input(input, batch_sizes)
+        self.check_hidden_size(hidden[0], self.get_expected_hidden_size(input, batch_sizes),
+                               'Expected hidden[0] size {}, got {}')
+        self.check_hidden_size(hidden[1], self.get_expected_cell_size(input, batch_sizes),
+                               'Expected hidden[1] size {}, got {}')
+
+    def get_quantized_weight_bias_dict(self):
+        """ dictionary from flat_weight_name to quantized weight or (unquantized) bias
+        e.g.
+        {
+          "weight_ih_l0": quantized_weight,
+          "bias_ih_l0": unquantized_bias,
+          ...
+        }
+        """
+        quantized_weight_bias_dict = {}
+        for wn in self._flat_weights_names:
+            if hasattr(self, wn):
+                if wn.startswith("weight"):
+                    weight_or_bias = get_quantized_weight(self, wn)
+                else:
+                    weight_or_bias = getattr(self, wn)
+            else:
+                weight_or_bias = None
+            quantized_weight_bias_dict[wn] = weight_or_bias
+        return quantized_weight_bias_dict
+
+    def get_flat_weights(self):
+        flat_weights = []
+        for wn in self._flat_weights_names:
+            if hasattr(self, wn):
+                weight = getattr(self, wn)
+                if wn.startswith("weight"):
+                    params = _get_weight_and_quantization_params(self, wn)
+                    weight = _quantize_and_dequantize_weight(*params)
+            else:
+                weight = None
+            flat_weights.append(weight)
+        return flat_weights
+
+    def forward(self, input, hx=None):  # noqa: F811
+        orig_input = input
+        # xxx: isinstance check needs to be in conditional for TorchScript to compile
+        batch_sizes = None
+        if isinstance(orig_input, PackedSequence):
+            input, batch_sizes, sorted_indices, unsorted_indices = input
+            max_batch_size = batch_sizes[0]
+            max_batch_size = int(max_batch_size)
+        else:
+            batch_sizes = None
+            is_batched = input.dim() == 3
+            batch_dim = 0 if self.batch_first else 1
+            if not is_batched:
+                input = input.unsqueeze(batch_dim)
+            max_batch_size = input.size(0) if self.batch_first else input.size(1)
+            sorted_indices = None
+            unsorted_indices = None
+
+        if hx is None:
+            num_directions = 2 if self.bidirectional else 1
+            real_hidden_size = self.proj_size if self.proj_size > 0 else self.hidden_size
+            h_zeros = torch.zeros(self.num_layers * num_directions,
+                                  max_batch_size, real_hidden_size,
+                                  dtype=input.dtype, device=input.device)
+            c_zeros = torch.zeros(self.num_layers * num_directions,
+                                  max_batch_size, self.hidden_size,
+                                  dtype=input.dtype, device=input.device)
+            hx = (h_zeros, c_zeros)
+        else:
+            if batch_sizes is None:  # If not PackedSequence input.
+                if is_batched:
+                    if (hx[0].dim() != 3 or hx[1].dim() != 3):
+                        msg = ("For batched 3-D input, hx and cx should "
+                               f"also be 3-D but got ({hx[0].dim()}-D, {hx[1].dim()}-D) tensors")
+                        raise RuntimeError(msg)
+                else:
+                    if hx[0].dim() != 2 or hx[1].dim() != 2:
+                        msg = ("For unbatched 2-D input, hx and cx should "
+                               f"also be 2-D but got ({hx[0].dim()}-D, {hx[1].dim()}-D) tensors")
+                        raise RuntimeError(msg)
+                    hx = (hx[0].unsqueeze(1), hx[1].unsqueeze(1))
+
+            # Each batch of the hidden state should match the input sequence that
+            # the user believes he/she is passing in.
+            hx = self.permute_hidden(hx, sorted_indices)
+
+        self.check_forward_args(input, hx, batch_sizes)
+        if batch_sizes is None:
+            result = _VF.lstm(input, hx, self.get_flat_weights(), self.bias, self.num_layers,
+                              self.dropout, self.training, self.bidirectional, self.batch_first)
+        else:
+            result = _VF.lstm(input, batch_sizes, hx, self.get_flat_weights(), self.bias,
+                              self.num_layers, self.dropout, self.training, self.bidirectional)
+        output = result[0]
+        hidden = result[1:]
+        # xxx: isinstance check needs to be in conditional for TorchScript to compile
+        if isinstance(orig_input, PackedSequence):
+            output_packed = PackedSequence(output, batch_sizes, sorted_indices, unsorted_indices)
+            return output_packed, self.permute_hidden(hidden, unsorted_indices)
+        else:
+            if not is_batched:
+                output = output.squeeze(batch_dim)
+                hidden = (hidden[0].squeeze(1), hidden[1].squeeze(1))
+            return output, self.permute_hidden(hidden, unsorted_indices)
+
+    def _get_name(self):
+        return "QuantizedLSTM(Reference)"
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams_dict):
+        ref_mod = cls(
+            mod.input_size,
+            mod.hidden_size,
+            mod.num_layers,
+            mod.bias,
+            mod.batch_first,
+            mod.dropout,
+            mod.bidirectional,
+            weight_qparams_dict=weight_qparams_dict)
+        for wn in mod._flat_weights_names:
+            setattr(ref_mod, wn, getattr(mod, wn))
+        return ref_mod
diff --git a/torch/ao/nn/quantized/reference/modules/sparse.py b/torch/ao/nn/quantized/reference/modules/sparse.py
new file mode 100644
index 000000000000..4890402b875a
--- /dev/null
+++ b/torch/ao/nn/quantized/reference/modules/sparse.py
@@ -0,0 +1,94 @@
+import torch.nn as nn
+import torch.nn.functional as F
+from torch import Tensor
+from .utils import ReferenceQuantizedModule
+from typing import Optional, Dict, Any
+
+__all__ = ['Embedding', 'EmbeddingBag']
+
+class Embedding(nn.Embedding, ReferenceQuantizedModule):
+    """ A reference quantized Embedding module that fits into the
+    FX Graph Mode Quantization workflow, activation will be floating point Tensor,
+    we will store floating point weight as well in the module, but in forward we'll
+    quantize and dequantize the weight before running the floating point functional
+    embedding operator.
+    """
+    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
+                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
+                 sparse: bool = False, _weight: Optional[Tensor] = None,
+                 device=None, dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None) -> None:
+        super().__init__(num_embeddings, embedding_dim, padding_idx, max_norm,
+                         norm_type, scale_grad_by_freq, sparse, _weight, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def _get_name(self):
+        return "QuantizedEmbedding(Reference)"
+
+    def forward(self, input: Tensor) -> Tensor:
+        weight_quant_dequant = self.get_weight()
+        return F.embedding(
+            input, weight_quant_dequant, self.padding_idx, self.max_norm,
+            self.norm_type, self.scale_grad_by_freq, self.sparse)
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams):
+        return cls(
+            mod.num_embeddings,
+            mod.embedding_dim,
+            mod.padding_idx,
+            mod.max_norm,
+            mod.norm_type,
+            mod.scale_grad_by_freq,
+            mod.sparse,
+            mod.weight,
+            mod.weight.device,
+            mod.weight.dtype,
+            weight_qparams)
+
+class EmbeddingBag(nn.EmbeddingBag, ReferenceQuantizedModule):
+    """ A reference quantized EmbeddingBag module that fits into the
+    FX Graph Mode Quantization workflow, activation will be floating point Tensor,
+    we will store floating point weight as well in the module, but in forward we'll
+    quantize and dequantize the weight before running the floating point functional
+    embedding operator.
+    """
+    def __init__(self, num_embeddings: int, embedding_dim: int,
+                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
+                 mode: str = 'mean', sparse: bool = False, _weight: Optional[Tensor] = None,
+                 include_last_offset: bool = False, padding_idx: Optional[int] = None,
+                 device=None, dtype=None,
+                 weight_qparams: Optional[Dict[str, Any]] = None) -> None:
+        super().__init__(num_embeddings, embedding_dim, max_norm, norm_type,
+                         scale_grad_by_freq, mode, sparse, _weight, include_last_offset,
+                         padding_idx, device, dtype)
+        self._init_weight_qparams(weight_qparams, device)
+
+    def _get_name(self):
+        return "QuantizedEmbedding(Reference)"
+
+    def forward(self, input: Tensor, offsets: Optional[Tensor] = None, per_sample_weights: Optional[Tensor] = None) -> Tensor:
+        weight_quant_dequant = self.get_weight()
+        return F.embedding_bag(input, weight_quant_dequant, offsets,
+                               self.max_norm, self.norm_type,
+                               self.scale_grad_by_freq, self.mode, self.sparse,
+                               per_sample_weights, self.include_last_offset,
+                               self.padding_idx)
+
+    @classmethod
+    def from_float(cls, mod, weight_qparams):
+        return cls(
+            mod.num_embeddings,
+            mod.embedding_dim,
+            mod.max_norm,
+            mod.norm_type,
+            mod.scale_grad_by_freq,
+            mod.mode,
+            mod.sparse,
+            mod.weight,
+            mod.include_last_offset,
+            mod.padding_idx,
+            mod.weight.device,
+            mod.weight.dtype,
+            weight_qparams
+        )
diff --git a/torch/ao/nn/quantized/reference/modules/utils.py b/torch/ao/nn/quantized/reference/modules/utils.py
new file mode 100644
index 000000000000..6af9e1f04eb5
--- /dev/null
+++ b/torch/ao/nn/quantized/reference/modules/utils.py
@@ -0,0 +1,160 @@
+import torch
+import typing
+
+class ReferenceQuantizedModule(torch.nn.Module):
+    def _init_weight_qparams(self, weight_qparams, device):
+        if weight_qparams is None:
+            weight_qparams = {
+                "qscheme": torch.per_tensor_affine,
+                "dtype": torch.quint8,
+                "scale": 1.0,
+                "zero_point": 0
+            }
+        self.weight_qscheme: torch.qscheme = weight_qparams["qscheme"]
+        self.weight_dtype = weight_qparams["dtype"]
+        assert self.weight_qscheme in [
+            None, torch.per_tensor_affine, torch.per_channel_affine,
+            torch.per_channel_affine_float_qparams], \
+            Exception(f"qscheme: {self.weight_qscheme} is not support in reference quantized {self._get_name()}")
+        if self.weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
+            zero_point_dtype = weight_qparams["zero_point"].dtype if \
+                isinstance(weight_qparams["zero_point"], torch.Tensor) else \
+                torch.int
+            w_scale = weight_qparams["scale"]
+            w_scale_tensor = w_scale.clone().detach() \
+                if isinstance(w_scale, torch.Tensor) \
+                else torch.tensor(w_scale, dtype=torch.float, device=device)
+            self.register_buffer("weight_scale", w_scale_tensor)
+            w_zp = weight_qparams["zero_point"]
+            w_zp_tensor = w_zp.clone().detach() \
+                if isinstance(w_zp, torch.Tensor) \
+                else torch.tensor(w_zp, dtype=zero_point_dtype, device=device)
+            self.register_buffer("weight_zero_point", w_zp_tensor)
+            if self.weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
+                w_axis = weight_qparams["axis"]
+                w_axis_tensor = w_axis.clone().detach() \
+                    if isinstance(w_axis, torch.Tensor) \
+                    else torch.tensor(w_axis, dtype=torch.int, device=device)
+                self.register_buffer("weight_axis", w_axis_tensor)
+            else:
+                # added for TorchScriptability, not used
+                self.register_buffer(
+                    "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
+        else:
+            # added for TorchScriptability, and for torch.float
+            self.register_buffer("weight_scale", torch.tensor(1.0, dtype=torch.float, device=device))
+            self.register_buffer("weight_zero_point", torch.tensor(0, dtype=torch.int, device=device))
+            self.register_buffer(
+                "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
+
+    def get_weight(self):
+        """
+        Fake quantize (quantize and dequantize) the weight with
+        the quantization parameters for weight, this is used to
+        simulate the numerics for the quantized weight in a quantized
+        model
+        """
+        # suppress mypy warning
+        assert isinstance(self.weight_scale, torch.Tensor)
+        assert isinstance(self.weight_zero_point, torch.Tensor)
+        assert isinstance(self.weight_axis, torch.Tensor)
+        return _quantize_and_dequantize_weight(
+            self.weight,  # type: ignore[arg-type]
+            self.weight_qscheme,
+            self.weight_dtype,
+            self.weight_scale,
+            self.weight_zero_point, self.weight_axis)
+
+    def get_quantized_weight(self):
+        # suppress mypy warning
+        assert isinstance(self.weight_scale, torch.Tensor)
+        assert isinstance(self.weight_zero_point, torch.Tensor)
+        assert isinstance(self.weight_axis, torch.Tensor)
+        return _quantize_weight(
+            self.weight,  # type: ignore[arg-type]
+            self.weight_qscheme,
+            self.weight_dtype,
+            self.weight_scale,
+            self.weight_zero_point,
+            self.weight_axis)
+
+    def _save_to_state_dict(self, destination, prefix, keep_vars):
+        super()._save_to_state_dict(destination, prefix, keep_vars)
+        _save_weight_qparams(
+            destination, prefix, self.weight_qscheme, self.weight_dtype,
+            self.weight_scale, self.weight_zero_point, self.weight_axis)
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        for key in _get_weight_qparam_keys(state_dict, prefix):
+            setattr(self, key, state_dict[prefix + key])
+            state_dict.pop(prefix + key)
+
+        super()._load_from_state_dict(
+            state_dict, prefix, local_metadata, False,
+            missing_keys, unexpected_keys, error_msgs)
+
+def _quantize_weight(
+        weight: torch.Tensor,
+        weight_qscheme: torch.qscheme,
+        weight_dtype: torch.dtype,
+        weight_scale: torch.Tensor,
+        weight_zero_point: torch.Tensor,
+        weight_axis: torch.Tensor):
+    if weight_dtype == torch.float16:
+        weight = weight.to(weight_dtype)
+        return weight
+
+    if weight_qscheme == torch.per_tensor_affine:
+        if weight_dtype in [torch.quint8, torch.qint8, torch.qint32]:
+            weight = torch.quantize_per_tensor(weight, weight_scale, weight_zero_point, weight_dtype)
+            return weight
+    elif weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
+        if weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
+            weight = torch.quantize_per_channel(
+                weight, weight_scale,
+                weight_zero_point, weight_axis.item(), weight_dtype)  # type: ignore[arg-type]
+            return weight
+    raise Exception(f"Unsupported dtype and qscheme: {weight_dtype}, {weight_qscheme}")
+
+def _quantize_and_dequantize_weight(
+        weight: torch.Tensor,
+        weight_qscheme: torch.qscheme,
+        weight_dtype: torch.dtype,
+        weight_scale: torch.Tensor,
+        weight_zero_point: torch.Tensor,
+        weight_axis: torch.Tensor):
+    """ Quantize and then dequantize the weight based on
+    the quantization parameters
+    """
+    if weight_qscheme in [
+            torch.per_tensor_affine,
+            torch.per_channel_affine,
+            torch.per_channel_affine_float_qparams]:
+        weight_quant = _quantize_weight(
+            weight, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis)
+        weight_dequant = weight_quant.dequantize()
+    else:
+        weight_dequant = weight
+    return weight_dequant
+
+def _save_weight_qparams(destination, prefix, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis):
+    destination[prefix + "weight_qscheme"] = weight_qscheme
+    destination[prefix + "weight_dtype"] = weight_dtype
+    if weight_qscheme is not None:
+        destination[prefix + "weight_scale"] = weight_scale
+        destination[prefix + "weight_zero_point"] = weight_zero_point
+        if weight_qscheme == torch.per_channel_affine:
+            destination[prefix + "weight_axis"] = weight_axis
+
+def _get_weight_qparam_keys(
+        state_dict: typing.Dict[str, typing.Any],
+        prefix: str):
+    keys = ["weight_qscheme", "weight_dtype"]
+    weight_qscheme = state_dict[prefix + "weight_qscheme"]
+    if weight_qscheme is not None:
+        keys.append("weight_scale")
+        keys.append("weight_zero_point")
+        if weight_qscheme == torch.quantize_per_channel:
+            keys.append("weight_axis")
+    return keys
diff --git a/torch/ao/nn/sparse/quantized/dynamic/linear.py b/torch/ao/nn/sparse/quantized/dynamic/linear.py
index e1742d7ed109..1bfb19b590ee 100644
--- a/torch/ao/nn/sparse/quantized/dynamic/linear.py
+++ b/torch/ao/nn/sparse/quantized/dynamic/linear.py
@@ -1,11 +1,11 @@
 from typing import Optional
 
-from torch.ao.nn.sparse.quantized import linear
-from torch.ao.nn.sparse.quantized.utils import LinearBlockSparsePattern
-
 import torch
 import torch.nn.intrinsic as nni
-from torch.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr
+
+from torch.ao.nn.sparse.quantized import linear
+from torch.ao.nn.sparse.quantized.utils import LinearBlockSparsePattern
+from torch.ao.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr
 
 __all__ = ['Linear']
 
diff --git a/torch/ao/nn/sparse/quantized/linear.py b/torch/ao/nn/sparse/quantized/linear.py
index 81c224666fb1..83be323e22c3 100644
--- a/torch/ao/nn/sparse/quantized/linear.py
+++ b/torch/ao/nn/sparse/quantized/linear.py
@@ -1,7 +1,7 @@
 from typing import Optional
 
 import torch
-from torch.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr
+from torch.ao.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr
 
 __all__ = ['LinearPackedParams', 'Linear']
 
diff --git a/torch/ao/ns/_numeric_suite.py b/torch/ao/ns/_numeric_suite.py
index 143265a8f174..3ddca96b1de5 100644
--- a/torch/ao/ns/_numeric_suite.py
+++ b/torch/ao/ns/_numeric_suite.py
@@ -1,7 +1,7 @@
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 from torch.ao.quantization import prepare
 from typing import Dict, List, Optional, Any, Union, Callable, Set
 
diff --git a/torch/ao/ns/_numeric_suite_dbr.py b/torch/ao/ns/_numeric_suite_dbr.py
deleted file mode 100644
index a7ac562df187..000000000000
--- a/torch/ao/ns/_numeric_suite_dbr.py
+++ /dev/null
@@ -1,112 +0,0 @@
-"""
-Numeric Suite Core APIs for define-by-run quantization.
-
-Experimental, API may change at any time.
-"""
-
-import functools
-from typing import Tuple, Any, Optional, List, Dict
-
-import torch
-
-from torch.ao.quantization._dbr.quantization_state import (
-    AutoQuantizationState,
-)
-
-def _turn_on_loggers(name: str, model: torch.nn.Module) -> None:
-    for _, module in model.named_modules():
-        if isinstance(module, AutoQuantizationState):
-            module.logging_model_name = name
-            module.log_op_outputs = True
-
-def add_loggers(
-    name_a: str,
-    model_a: torch.nn.Module,
-    name_b: str,
-    model_b: torch.nn.Module,
-) -> Tuple[torch.nn.Module, torch.nn.Module]:
-    """
-    Enables intermediate activation logging on model_a and model_b.
-    """
-    _turn_on_loggers(name_a, model_a)
-    _turn_on_loggers(name_b, model_b)
-    return model_a, model_b
-
-def _extract_logger_info_one_model(model: torch.nn.Module) -> Tuple[str, Any]:
-    results: Optional[List[List[Any]]] = None
-    model_name = None
-    for _, module in model.named_modules():
-        if isinstance(module, AutoQuantizationState):
-            if results is None:
-                # initialize results to the right length
-                results = [[] for i in range(len(module.op_outputs))]
-            assert results is not None
-
-            if model_name is None:
-                # model_name is the same everywhere in this model, take
-                # the first one
-                model_name = module.logging_model_name
-
-            for forward_idx, outputs in enumerate(module.op_outputs):
-                results[forward_idx].extend(outputs)
-
-    # sort each forward's results by global idx
-    assert results is not None
-    assert model_name is not None
-    for result_idx, result in enumerate(results):
-        result.sort(key=functools.cmp_to_key(  # type: ignore[misc]
-            lambda a, b: 1 if a[0] > b[0] else -1))  # type: ignore[index]
-
-    return model_name, results
-
-def extract_logger_info(
-    model_a: torch.nn.Module,
-    model_b: torch.nn.Module,
-    model_name_to_use_for_layer_names: str,
-) -> Any:
-    """
-    Extracts intermediate activations from model_a and model_b.
-    """
-
-    model_name_a, results_a = _extract_logger_info_one_model(model_a)
-    model_name_b, results_b = _extract_logger_info_one_model(model_b)
-    assert len(results_a) == len(results_b), 'results length mismatch'
-    results: Dict[str, Any] = {}
-    if len(results_a) == 0:
-        return results
-
-    for op_idx in range(len(results_a[0])):
-        # currently using global_idx for layer_name
-        layer_name = (
-            results_a[0][op_idx][0]
-            if model_name_to_use_for_layer_names == model_name_a
-            else results_a[0][op_idx][0])
-
-        values_a = [results_a[forward_idx][op_idx][3]
-                    for forward_idx in range(len(results_a))]
-        values_b = [results_b[forward_idx][op_idx][3]
-                    for forward_idx in range(len(results_b))]
-        node_output = {
-            model_name_a: [{
-                'type': 'node_output',
-                'values': values_a,
-                'ref_node_target_type': str(results_a[0][op_idx][2]),
-                'fqn': str(results_a[0][op_idx][1]),
-                'index_of_arg': 0,
-                'index_within_arg': 0,
-            }],
-            model_name_b: [{
-                'type': 'node_output',
-                'values': values_b,
-                'ref_node_target_type': str(results_b[0][op_idx][2]),
-                'fqn': str(results_b[0][op_idx][1]),
-                'index_of_arg': 0,
-                'index_within_arg': 0,
-            }],
-        }
-
-        results[layer_name] = {
-            'node_output': node_output,
-        }
-
-    return results
diff --git a/torch/ao/ns/_numeric_suite_fx.py b/torch/ao/ns/_numeric_suite_fx.py
index 116b46240105..860430c40b9f 100644
--- a/torch/ao/ns/_numeric_suite_fx.py
+++ b/torch/ao/ns/_numeric_suite_fx.py
@@ -119,8 +119,24 @@
     NSResultsType,
     NSNodeTargetType,
 )
+from torch.ao.quantization.backend_config.utils import get_fusion_pattern_to_root_node_getter
+from torch.ao.quantization.backend_config import BackendConfig
+from torch.ao.quantization.fx.backend_config_utils import get_pattern_to_quantize_handlers
+from torch.ao.quantization.fx.match_utils import find_matches
+from torch.ao.quantization.fx.qconfig_mapping_utils import generate_node_name_to_qconfig
+from torch.ao.quantization.qconfig import QConfigAny
+from torch.ao.ns.fx.n_shadows_utils import (
+    OutputProp,
+    _get_dedup_subgraphs,
+    SHADOW_WRAPPER_NODE_NAME_PREFIX,
+    group_results_by_subgraph,
+    create_results_comparison,
+    print_n_shadows_summary,
+    handle_subgraph,
+)
+from torch.ao.ns.fx.qconfig_multi_mapping import QConfigMultiMapping
 
-from typing import Dict, Tuple, Callable, List, Optional, Set
+from typing import Dict, Tuple, Callable, List, Optional, Set, Any, Type
 
 RNNReturnType = Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]
 
@@ -146,6 +162,7 @@ def __init__(
         index_within_arg: int,
         index_of_arg: int,
         fqn: Optional[str],
+        qconfig_str: Optional[str] = '',
     ):
         super().__init__()
         self.stats: List[torch.Tensor] = []
@@ -189,12 +206,28 @@ def __init__(
         self.index_of_arg = index_of_arg
         # fully qualified name
         self.fqn = fqn
+        # if loggers are added before prepare_fx, but we do not want
+        # collect results of calibration, only results after convert_fx
+        # so, we add a flag to control whether this logger collects data
+        self.enabled = True
+        # string representation of qconfig
+        self.qconfig_str = qconfig_str
+        # this can be turned off to reduce memory usage during calibration
+        self.save_activations = True
 
     # Note: cannot annotate the type of x because TorchScript does not support
     #   the Union type.
     def forward(self, x):
         """
         """  # blank docblock to make autodoc happy
+        # TODO(future PR): consider designing this better, as the difference
+        # between these two flags is subtle and not obvious.
+        if not self.enabled:
+            return x
+        if not self.save_activations:
+            return x
+        # TODO(future PR): consider refactoring this to better reuse the parent
+        # class
         if isinstance(x, torch.Tensor):
             self.stats.append(x.detach())
         elif isinstance(x, tuple) and len(x) == 2 and len(x[1]) == 2:
@@ -210,6 +243,38 @@ def __repr__(self):
 index_of_arg={self.index_of_arg}, fqn={self.fqn})"""
 
 
+class OutputComparisonLogger(OutputLogger):
+    """
+    Same as OutputLogger, but also requires the original activation
+    in order to calculate the comparison at calibration time
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        # TODO(future PR): make the comparison function configurable
+        self.comparison_fn = torch.ao.ns.fx.utils.compute_sqnr
+        self.comparison_fn_name = 'sqnr'
+        # precalculated comparisons of logger output versus reference
+        self.comparisons = []
+        # precalculated comparisons function
+
+    def forward(self, x, x_ref):
+        """
+        """  # blank docblock to make autodoc happy
+        if not self.enabled:
+            return x
+        assert isinstance(x, torch.Tensor), 'non-tensor inputs not yet supported'
+        if self.save_activations:
+            # save the activation, for debugging
+            self.stats.append(x.detach())
+        # save the comparison
+        self.comparisons.append(self.comparison_fn(x, x_ref))
+        return x
+
+    def __repr__(self):
+        return "OutputComparisonLogger"
+
+
 class NSTracer(quantize_fx.QuantizationTracer):
     """
     Just like a regular FX quantization tracer, but treats observers and fake_quantize
@@ -473,7 +538,7 @@ def _extract_logger_info_one_model(
             stats_to_use = mod.stats
             if len(mod.stats_rnn) > 0:
                 stats_to_use = mod.stats_rnn
-            results[key][mod.results_type][mod.model_name].append({
+            data = {
                 'type': mod.results_type,
                 'values': stats_to_use,
                 'ref_node_name': mod.ref_node_name,
@@ -483,7 +548,15 @@ def _extract_logger_info_one_model(
                 'index_within_arg': mod.index_within_arg,
                 'index_of_arg': mod.index_of_arg,
                 'fqn': mod.fqn,
-            })
+                'qconfig_str': mod.qconfig_str,
+            }
+            if hasattr(mod, 'comparisons'):
+                data['comparisons'] = mod.comparisons
+                data['comparison_fn_name'] = mod.comparison_fn_name
+            else:
+                data['comparisons'] = []
+                data['comparison_fn_name'] = ''
+            results[key][mod.results_type][mod.model_name].append(data)
             # ensure the list stays sorted
             results[key][mod.results_type][mod.model_name].sort(
                 key=lambda res:
@@ -673,3 +746,150 @@ def extend_logger_results_with_comparison(
                 for value_1, value_2 in zip(values_1, values_2):
                     comparison_result = comparison_fn(value_1, value_2)
                     result_2[comparison_name].append(comparison_result)
+
+def prepare_n_shadows_model(
+    model: torch.nn.Module,
+    example_inputs: Any,
+    qconfig_multi_mapping: QConfigMultiMapping,
+    backend_config: BackendConfig,
+) -> torch.nn.Module:
+    """
+    Given a model with a graph with M ops such as
+
+
+      args_kwargs_m -> op_m -> output_m
+
+
+    And a set of N qconfigs for each op, creates a new model, with
+    each of the subgraph of `op_m` transformed into
+
+    .. code::
+
+      args_kwargs_m -> op_m -> output_m
+           |                        |
+           |---------------------------> mod_with_op_m_transformed_with_qconfig_n
+
+    Where mod_with_op_m_transformed_with_qconfig_n is a submodule, and its
+    inner graph looks like
+
+    .. code::
+
+      args_m -------- op_m_prepared_with_qconfig_n -> output_m_n -> comparison_logger
+                  /                                                    /
+      kwargs_m ---                                                    /
+                                                                     /
+      output_m ------------------------------------------------------
+
+    This is useful for testing different quantization of multiple layers in
+    a single pass through the model.
+
+    High level TODOs for future PRs:
+    1. add deduplication for qconfigs per subgraph
+    2. figure out a better way to name the output structure
+    3. return a results data structure instead of printing it out
+    4. add examples to docblocks
+    """
+
+    tracer = quantize_fx.QuantizationTracer([], [])
+    mt = torch.fx.GraphModule(model, tracer.trace(model))
+    # this is necessary to ensure logger FQNs get populated
+    mt._node_name_to_scope = tracer.node_name_to_scope
+
+    # run example input propagation, we need this to call prepare_fx on
+    # individual subgraphs
+    output_prop = OutputProp(mt)
+    output_prop.propagate(*example_inputs)
+
+    # Find the set of subgraphs in the original graph which we need to
+    # consider.
+    modules = dict(mt.named_modules(remove_duplicate=False))
+    patterns = get_pattern_to_quantize_handlers(backend_config)
+    root_node_getter_mapping = \
+        get_fusion_pattern_to_root_node_getter(backend_config)
+    standalone_module_names: List[str] = []
+    standalone_module_classes: List[Type] = []
+    custom_module_classes: List[Type] = []
+    matches = find_matches(
+        mt.graph, modules, patterns, root_node_getter_mapping,
+        standalone_module_names, standalone_module_classes, custom_module_classes)
+    subgraphs_dedup: Dict[str, List[Node]] = \
+        _get_dedup_subgraphs(matches)
+
+    # generate node to qconfig for each subgraph
+    # TODO(future PR): deduplicate repeating entries
+    list_of_node_name_to_qconfig: List[Dict[str, QConfigAny]] = []
+    for qconfig_mapping in qconfig_multi_mapping.qconfig_mappings_list:
+        node_name_to_qconfig = generate_node_name_to_qconfig(
+            mt, modules, mt.graph, qconfig_mapping, tracer.node_name_to_scope)
+        list_of_node_name_to_qconfig.append(node_name_to_qconfig)
+
+    # For each region in the model, do the following:
+    #   For each qconfig for that region, do the following:
+    #     1. create a copy of the region wrapped in a module
+    #     2. pass original args, original kwargs, and expected output to module
+    #     3. add an output comparison logger and hook it up to compare
+    #        actual output to expected output
+    #     4. run `prepare_fx` on the module
+    for (subgraph_idx, (match_name, nodes_in_this_subgraph)) in \
+            enumerate(subgraphs_dedup.items()):
+        handle_subgraph(
+            mt, subgraph_idx, match_name, nodes_in_this_subgraph,
+            qconfig_multi_mapping.qconfig_mappings_list, list_of_node_name_to_qconfig)
+
+    mt.recompile()
+    return mt
+
+# TODO(future PR): consider aligning API signature with other similar quantization
+# functions (enable_fake_quant, etc)
+def loggers_set_enabled(model: torch.nn.Module, enabled: bool) -> None:
+    """
+    Sets the `enabled` setting on a `model`'s loggers
+    """
+    for name, child in model.named_modules():
+        if isinstance(child, OutputLogger):
+            child.enabled = enabled
+
+# TODO(future PR): consider aligning API signature with other similar quantization
+# functions (enable_fake_quant, etc)
+def loggers_set_save_activations(
+    model: torch.nn.Module,
+    save_activations: bool,
+) -> None:
+    """
+    Sets the `save_activations` setting on a `model`'s loggers
+    """
+    for name, child in model.named_modules():
+        if isinstance(child, OutputLogger):
+            child.save_activations = save_activations
+
+def convert_n_shadows_model(model: GraphModule) -> GraphModule:
+    """
+    Given a model from `prepare_n_shadows_model`, runs `convert_fx`
+    on each shadow submodule.
+    """
+    for node in model.graph.nodes:
+        # TODO(future PR): consider matching in a safer way than
+        # node name string match
+        if node.name.startswith(SHADOW_WRAPPER_NODE_NAME_PREFIX):
+            orig_mod = getattr(model, node.name)
+            converted_mod = torch.ao.quantization.quantize_fx.convert_fx(
+                orig_mod)
+            setattr(model, node.name, converted_mod)
+
+    return model
+
+def extract_results_n_shadows_model(model: torch.nn.Module) -> NSResultsType:
+    """
+    Extracts logger results from `model`.
+    """
+    results: NSResultsType = {}
+    _extract_logger_info_one_model(model, results, OutputLogger)
+    return results
+
+def print_comparisons_n_shadows_model(results: NSResultsType) -> None:
+    """
+    Prints a summary of extracted `results`.
+    """
+    results_grouped = group_results_by_subgraph(results)
+    results_comparison = create_results_comparison(results_grouped)
+    print_n_shadows_summary(results_comparison)
diff --git a/torch/ao/ns/fx/mappings.py b/torch/ao/ns/fx/mappings.py
index 7bcd0bab10cb..321ec09731e2 100644
--- a/torch/ao/ns/fx/mappings.py
+++ b/torch/ao/ns/fx/mappings.py
@@ -5,14 +5,14 @@
 import torch.nn.functional as F
 toq = torch.ops.quantized
 
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 import torch.nn.intrinsic.quantized as nniq
 import torch.nn.intrinsic.quantized.dynamic as nniqd
-import torch.nn.intrinsic.qat as nniqat
+import torch.ao.nn.intrinsic.qat as nniqat
 import torch.nn.intrinsic as nni
-import torch.nn.qat as nnqat
-import torch.nn.qat.dynamic as nnqatd
+import torch.ao.nn.qat as nnqat
+import torch.ao.nn.qat.dynamic as nnqatd
 from torch.ao.quantization.backend_config import get_native_backend_config_dict
 import torch.ao.quantization.fx._lower_to_native_backend as \
     _lower_to_native_backend
@@ -195,6 +195,10 @@ def get_base_name_to_sets_of_related_ops() -> Dict[str, Set[NSNodeTargetType]]:
         set([
             F.hardswish,
         ]),
+        # F.group_norm
+        set([
+            F.group_norm,
+        ]),
         # F.instance_norm
         set([
             F.instance_norm,
diff --git a/torch/ao/ns/fx/n_shadows_utils.py b/torch/ao/ns/fx/n_shadows_utils.py
new file mode 100644
index 000000000000..85e9be6135f1
--- /dev/null
+++ b/torch/ao/ns/fx/n_shadows_utils.py
@@ -0,0 +1,917 @@
+import torch
+import torch.fx
+from torch.fx import (
+    Node,
+    GraphModule,
+    Graph,
+)
+
+from torch.ao.ns.fx.utils import (
+    # TODO(future PR): make this work correctly for methods
+    get_target_type_str,
+    get_normalized_nth_input,
+)
+from torch.ao.ns.fx.ns_types import (
+    NSSingleResultValuesType,
+    NSResultsType,
+)
+from torch.ao.ns.fx.graph_passes import _maybe_get_fqn
+from torch.ao.quantization import QConfigMapping
+from torch.ao.quantization.fx.custom_config import PrepareCustomConfig
+from torch.ao.quantization.qconfig import QConfigAny
+from torch.ao.quantization.utils import getattr_from_fqn
+from torch.ao.quantization.fx.match_utils import MatchResult
+
+import collections
+import copy
+from typing import List, Dict, Set, Tuple, Callable, Any, Optional
+import operator
+
+SHADOW_NODE_NAME_PREFIX = 'shadow'
+SHADOW_WRAPPER_NODE_NAME_PREFIX = 'shadow_wrapper'
+
+# TODO(future PR): reuse existing mapping instead of creating a new one
+BINARY_FUNCTIONS = {
+    torch.add,
+    torch.Tensor.add,
+    operator.add,
+    torch.mul,
+    torch.Tensor.mul,
+    operator.mul,
+}
+
+def _get_attr_name(subgraph_idx, subgraph_candidate_idx):
+    return f"{SHADOW_NODE_NAME_PREFIX}_{subgraph_idx}_{subgraph_candidate_idx}"
+
+def _get_attr_wrapper_name(subgraph_idx, subgraph_candidate_idx):
+    return f"{SHADOW_WRAPPER_NODE_NAME_PREFIX}_{subgraph_idx}_{subgraph_candidate_idx}"
+
+
+class OutputProp:
+    """
+    Output propagation (modeled from shape propagation).
+
+    Given a GraphModule and an example input, saves the output flowing
+    through each node on `node.traced_result`.
+
+    Code based on the example from
+    https://pytorch.org/docs/stable/fx.html#the-interpreter-pattern
+    """
+    def __init__(self, mod):
+        self.mod = mod
+        self.graph = mod.graph
+        self.modules = dict(self.mod.named_modules())
+
+    def propagate(self, *args):
+        args_iter = iter(args)
+        env : Dict[str, Node] = {}
+
+        def load_arg(a):
+            return torch.fx.graph.map_arg(a, lambda n: env[n.name])
+
+        def fetch_attr(target : str):
+            target_atoms = target.split('.')
+            attr_itr = self.mod
+            for i, atom in enumerate(target_atoms):
+                if not hasattr(attr_itr, atom):
+                    raise RuntimeError(f"Node referenced nonexistent target {'.'.join(target_atoms[:i])}")
+                attr_itr = getattr(attr_itr, atom)
+            return attr_itr
+
+        for node in self.graph.nodes:
+            if node.op == 'placeholder':
+                result = next(args_iter)
+            elif node.op == 'get_attr':
+                result = fetch_attr(node.target)
+            elif node.op == 'call_function':
+                result = node.target(*load_arg(node.args), **load_arg(node.kwargs))
+            elif node.op == 'call_method':
+                self_obj, *args = load_arg(node.args)
+                kwargs = load_arg(node.kwargs)
+                result = getattr(self_obj, node.target)(*args, **kwargs)
+            elif node.op == 'call_module':
+                result = self.modules[node.target](*load_arg(node.args), **load_arg(node.kwargs))
+
+            if isinstance(result, torch.Tensor):
+                node.traced_result = result
+
+            env[node.name] = result
+
+        return None
+
+def _get_dedup_subgraphs(
+    matches: Dict[str, MatchResult]
+) -> Dict[str, List[Node]]:
+    # the original matches variable is unique by node, make it unique by subgraph
+    # instead
+    seen_nodes = set()
+    subgraphs_dedup = {}
+
+    # Dict items are not reversible until Python 3.8, so we hack it
+    # to be compatible with previous Python versions
+    # TODO(future PR): try reversed(list(matches.items()))
+    matches_items_reversed: List[Tuple[str, MatchResult]] = []
+    for name, cur_match in matches.items():
+        matches_items_reversed.insert(0, (name, cur_match))
+
+    # Note: the order is important.  `matches` currently provides the matches
+    # in reverse order.  We would like to process the matches in non-reverse
+    # order, so that we can create an intuitive naming scheme, such as
+    # naming the first op's submodules `shadow_0_0` through `shadow_0_(n-1)`
+    for name, cur_match in matches_items_reversed:  # type: ignore[call-overload]
+        was_seen = False
+        for node_or_tuple in cur_match[1]:
+
+            # Cur_match[1] has an unusual type. It says that it's a `List[Node]`,
+            # but it is really not. Furthermore, the contents of this field
+            # can change from match results of multiple nodes of the same pattern
+            #
+            # For example, for conv -> bn -> relu, we see
+            # match_results = {
+            #   'conv': (relu, [(bn, conv), relu], ...),
+            #   'bn': (relu, [(bn, conv), relu], ...),
+            #   'relu': (relu, [(bn, conv), relu], ...),
+            # }
+            #
+            # Ideally we should clean up the `find_matches` function to make
+            # this more intuitive. For the purposes of this prototype, we hack
+            # around it.
+
+            if isinstance(node_or_tuple, Node):
+                if node_or_tuple in seen_nodes:
+                    was_seen = True
+                seen_nodes.add(node_or_tuple)
+
+            else:
+                assert isinstance(node_or_tuple, tuple)
+                for node in node_or_tuple:
+                    assert isinstance(node, Node)
+                    if node in seen_nodes:
+                        was_seen = True
+                    seen_nodes.add(node)
+
+        if was_seen:
+            continue
+
+        # Start with the unusual type, convert it to [op_0, ..., op_n]
+        list_of_nodes = []
+
+        if len(cur_match[1]) == 1:
+            list_of_nodes = cur_match[1]
+        else:
+            assert len(cur_match[1]) == 2
+            # either (a, b), or ((a, b), c) or (c, (a, b))
+            # cannot make any assumptions on order, not clear what the
+            # find_matches function is doing to populate this
+            # TODO(future PR): make this code less confusing,  see discussion
+            # in https://github.com/pytorch/pytorch/pull/80521/files#r975918836
+
+            def _order_nodes(node_a, node_b, node_c) -> List[Node]:
+                nodes = [node_a, node_b, node_c]
+                first_node = None
+                mid_node = None
+                last_node = None
+                for n in nodes:
+                    prev_n = n.args[0]
+                    next_n = list(n.users)[0]
+                    if prev_n not in nodes:
+                        first_node = n
+                    elif next_n not in nodes:
+                        last_node = n
+                    else:
+                        mid_node = n
+                assert first_node is not None and mid_node is not None and \
+                    last_node is not None
+                assert mid_node.args[0] is first_node
+                assert last_node.args[0] is mid_node
+                return [last_node, mid_node, first_node]
+
+            if isinstance(cur_match[1][0], Node) and isinstance(cur_match[1][1], Node):
+                # (a, b)
+                list_of_nodes = cur_match[1]
+            elif isinstance(cur_match[1][0], tuple):
+                # ((a, b), c)
+                node_a, node_b = cur_match[1][0]
+                node_c = cur_match[1][1]
+                list_of_nodes = _order_nodes(node_a, node_b, node_c)
+            elif isinstance(cur_match[1][1], tuple):
+                # (a, (b, c))
+                node_a, node_b = cur_match[1][1]
+                node_c = cur_match[1][0]
+                list_of_nodes = _order_nodes(node_a, node_b, node_c)
+
+        # [node_n, ..., node_0], note that the order is reversed
+        # to make it chronological for simple subgraphs
+        list_of_nodes.reverse()
+        subgraphs_dedup[name] = list_of_nodes
+
+    return subgraphs_dedup
+
+def _get_logger_for_subgraph(
+    model: GraphModule,
+    first_node: Node,
+    last_node: Node,
+    subgraph_idx: int,
+    subgraph_candidate_idx: int,
+    qconfig_str: str,
+    logger_cls: Callable,
+    fqn: Optional[str],
+) -> torch.nn.Module:
+    """
+    Given a model and a linear subgraph starting from `first_node` and
+    ending with `last_node`, creates a logger for the end of this
+    subgraph.
+    """
+    if fqn is None:
+        fqn = ''
+    logger_mod_orig = logger_cls(
+        first_node.name,  # ref_node_name
+        last_node.name,  # prev_node_name
+        f'subgraph_{subgraph_idx}_{subgraph_candidate_idx}',  # model_name
+        'model',  # ref_name
+        get_target_type_str(last_node, model),  # prev_node_target_type
+        get_target_type_str(first_node, model),  # ref_node_target_type
+        NSSingleResultValuesType.NODE_OUTPUT.value,  # results_type
+        0,  # index_within_arg
+        0,  # index_of_arg
+        fqn,  # fqn
+        qconfig_str,
+    )
+    # Usually we expect the user to add loggers, then calibrate, then convert,
+    # and then populate loggers.  This is why the loggers start disabled.
+    # TODO(future PR): reconsider the design to make this more intuitive.
+    logger_mod_orig.enabled = False
+    return logger_mod_orig
+
+def _add_logger_to_subgraph_wrapper(
+    model: GraphModule,
+    subgraph_idx: int,
+    subgraph_candidate_idx: int,
+    qconfig_str: str,
+    logger_cls: Callable,
+    ref_output_node: Node,
+    fqn: Optional[str],
+) -> None:
+    """
+    Given a model which consists of a subgraph and nothing else, adds a logger
+    to the end of this model. The logger takes `ref_output_node` as the reference
+    output, and does the comparison during calibration time.
+    """
+    first_node, last_node, first_non_ph_node = None, None, None
+    for idx, node in enumerate(model.graph.nodes):  # type: ignore[union-attr, arg-type]
+        if idx == 0:
+            first_node = node
+        elif idx == len(model.graph.nodes) - 1:  # type: ignore[union-attr, arg-type]
+            # last node is the output, so we want the first
+            # arg of the output
+            last_node = node.args[0]
+        if first_non_ph_node is None and node.op != 'placeholder':
+            first_non_ph_node = node
+    assert first_node is not None and last_node is not None and \
+        first_non_ph_node is not None
+    logger_mod = _get_logger_for_subgraph(
+        model, first_non_ph_node, last_node, subgraph_idx,  # type: ignore[arg-type]
+        subgraph_candidate_idx, qconfig_str, logger_cls, fqn)
+    attr_name = _get_attr_name(subgraph_idx, subgraph_candidate_idx)
+    assert not hasattr(model, attr_name)
+    setattr(model, attr_name, logger_mod)
+
+    # add a new placeholder to the original subgraph module
+    # to represent the reference input
+    # before:
+    #
+    #   x0 -> mod -> x1
+    #
+    # after:
+    #
+    #   x0 -> mod -> x1
+    #         /
+    #   x0_ref
+
+    ph_name = 'SHADOW_PH_NAME'
+    # verify a node with this name does not exist
+    assert len([n for n in model.graph.nodes if n.name == ph_name]) == 0, \
+        'graph already contains node with name {ph_name}'
+
+    new_ph = None
+    with model.graph.inserting_before(first_node):
+        new_ph = model.graph.placeholder(ph_name)
+
+    with model.graph.inserting_after(last_node):
+        new_node = model.graph.call_module(
+            attr_name, args=(last_node, new_ph), kwargs={})
+    model.recompile()
+
+def create_submodule_from_subgraph(
+    model: torch.nn.Module,
+    first_node: Node,
+    last_node: Node,
+) -> GraphModule:
+    """
+    Input: a model, and a linear subgraph within the model from first_node to
+      last_node.
+
+    Output: a new submodule containing a copy of the subgraph, with the inputs
+      to the first node becoming the inputs to the submodule, and all other
+      nodes in the subgraph being copied.
+
+    Example inputs:
+
+    `model`: a module with graph
+
+      x0 -> op1 -> x1 -> op2 -> x2
+             |
+            arg1
+
+    `first_node`: op1
+    `last_node`: op2
+
+    Example output: a new module with graph
+
+      input1 -> op1_copy -> x1 -> op2_copy -> output1
+                   |
+                  arg1
+    """
+
+    #
+    # create a blank GraphModule with an empty graph
+    #
+
+    class M(torch.nn.Module):
+        def forward(self, x):
+            pass
+
+    m = M()
+    gm = torch.fx.symbolic_trace(m)
+    g = gm.graph
+    for node in reversed(gm.graph.nodes):
+        g.erase_node(node)
+
+    #
+    # modify the graph to have a copy of our subgraph
+    #
+
+    cur_node_orig = first_node
+    cur_args_orig = cur_node_orig.args
+    cur_kwargs_orig = cur_node_orig.kwargs
+
+    cur_name_idx = 0
+
+    iteration_limit = 100
+    cur_iteration = 0
+
+    while True:
+        if cur_node_orig is first_node:
+            # we are at the first node, we need to set up graph inputs
+            # TODO(future): some graphs could have placeholders which are unrelated
+            # to the first node, need to handle this
+            cur_args_copy = []
+            cur_kwargs_copy = {}
+            seen_names: Set[str] = set()
+            old_name_to_new_node: Dict[str, Node] = {}
+
+            def _add_placeholder(
+                g: Graph, node: Node, seen_names, old_name_to_new_node
+            ):
+                # note: for graphs starting with patterns such as `y = x + x`, we
+                # need to ensure we do not add multiple placeholders with the
+                # same name
+                counter = 0
+                while node.name + '_' + str(counter) in seen_names:
+                    counter += 1
+                cur_name = node.name + '_' + str(counter)
+                seen_names.add(cur_name)
+                placeholder = g.placeholder(cur_name)
+                old_name_to_new_node[node.name] = placeholder
+                return placeholder
+
+            for arg in cur_node_orig.args:
+                if isinstance(arg, Node):
+                    p = _add_placeholder(
+                        g, arg, seen_names, old_name_to_new_node)
+                    cur_args_copy.append(p)
+                elif isinstance(arg, (list, tuple)):
+                    new_arg = []
+                    for inner_arg in arg:
+                        if isinstance(inner_arg, Node):
+                            new_arg.append(_add_placeholder(
+                                g, inner_arg, seen_names, old_name_to_new_node))
+                        else:
+                            new_arg.append(inner_arg)
+                    cur_args_copy.append(new_arg)
+                else:
+                    cur_args_copy.append(arg)
+
+            # TODO(future PR): handle non-normalized kwargs
+            for kwarg_name, kwarg in cur_node_orig.kwargs.items():
+                if isinstance(kwarg, Node):
+                    cur_kwargs_copy[kwarg_name] = _add_placeholder(
+                        g, kwarg, seen_names, old_name_to_new_node)
+                elif isinstance(kwarg, (list, tuple)):
+                    new_kwarg = []
+                    for inner_kwarg in kwarg:
+                        p = _add_placeholder(
+                            g, inner_kwarg, seen_names, old_name_to_new_node)
+                        new_kwarg.append(p)
+                    cur_kwargs_copy[kwarg_name] = new_kwarg
+                else:
+                    cur_kwargs_copy[kwarg_name] = kwarg
+
+            cur_args_copy = tuple(cur_args_copy)  # type: ignore[assignment]
+        else:
+            # we are not at first node, first arg is from the previous node,
+            # and all other args are copied
+
+            # the current implementation is simplistic and cannot handle
+            # ops with two or more arguments which need to be passed from
+            # the previous op, so we assert them out
+            assert cur_node_orig.target not in BINARY_FUNCTIONS
+
+            # at this point in the code, cur_node_copy is pointing to the copy
+            # of the previous node
+            # TODO(future PR): this is not handling complicated graphs correctly, need to
+            # look at actual relationships instead of assuming sequential graph
+            # TODO(future PR): this is ignoring kwargs, will need to support kwargs
+            # for any fusion pattern which has them for a node that is not the
+            # first node.
+            cur_args_copy = [cur_node_copy]  # type: ignore[has-type]
+
+            if len(cur_node_orig.args) > 1:
+                for arg in cur_node_orig.args[1:]:
+                    if isinstance(arg, torch.nn.Parameter):
+                        new_arg = arg.clone().detach()  # type: ignore[assignment]
+                        mod_name = f"mod_{cur_name_idx}"
+                        cur_name_idx += 1
+                        setattr(gm, mod_name, new_arg)
+                        new_arg_placeholder = gm.placeholder(mod_name)
+                        cur_args_copy.append(new_arg_placeholder)
+                    elif isinstance(arg, (float, int, torch.dtype)):
+                        cur_args_copy.append(arg)
+                    else:
+                        raise AssertionError(f'arg of type {type(arg)} not handled yet')
+            cur_args_copy = tuple(cur_args_copy)  # type: ignore[assignment]
+
+        # copy the node
+        if cur_node_orig.op == 'call_module':
+            orig_mod = getattr_from_fqn(model, cur_node_orig.target)  # type: ignore[arg-type]
+            orig_mod_copy = copy.deepcopy(orig_mod)
+            mod_name = f"mod_{cur_name_idx}"
+            setattr(gm, mod_name, orig_mod_copy)
+            cur_name_idx += 1
+            cur_node_copy = g.call_module(mod_name, cur_args_copy, cur_kwargs_copy)
+
+        elif cur_node_orig.op == 'call_function':
+            cur_node_copy = g.call_function(
+                cur_node_orig.target, cur_args_copy, cur_kwargs_copy)
+
+        elif cur_node_orig.op == 'call_method':
+            cur_node_copy = g.call_method(
+                cur_node_orig.target, cur_args_copy, cur_kwargs_copy)
+
+        else:
+            raise AssertionError(f'{cur_node_orig.op} not supported yet')
+
+        if cur_node_orig is last_node:
+            break
+
+        # go to next node
+        assert len(cur_node_orig.users.keys()) == 1, \
+            f'{cur_node_orig} has more than 1 users, not supported yet'
+        cur_node_orig = list(cur_node_orig.users.keys())[0]
+        cur_args_orig = cur_node_orig.args
+        cur_kwargs_orig = cur_node_orig.kwargs
+
+        cur_iteration += 1
+        if cur_iteration > iteration_limit:
+            raise AssertionError('iteration limit exceeded')
+
+    # set up outputs
+    g.output(cur_node_copy)
+
+    gm.recompile()
+    return gm
+
+def handle_subgraph_candidate(
+    mt: GraphModule,
+    subgraph_idx: int,
+    subgraph_candidate_idx: int,
+    first_node: Node,
+    last_node: Node,
+    fqn: Optional[str],
+    list_of_node_name_to_qconfig: List[Dict[str, QConfigAny]],
+    example_inputs: Any,
+) -> None:
+    """
+    Given a subgraph in `mt` and a subgraph candidate idx, inserts the
+    subgraph candidate copy and instruments it with loggers.
+
+    If subgraph_candidate_idx is 0, this is the baseline fp32 subgraph and we just
+    add a logger to the end.
+
+    If subgraph_candidate_idx is not 0, we create a copy of the subgraph and
+    prepare it with `prepare_fx`.
+    """
+
+    # TODO(future PR): move logger classes to utils to remove circular dependency
+    from torch.ao.ns._numeric_suite_fx import OutputLogger, OutputComparisonLogger
+
+    if subgraph_candidate_idx == 0:
+        # idx = 0 is the floating point (original) version of the subgraph
+        # We keep the subgraph as is, and add a logger at the end
+
+        qconfig_str = ''
+        logger_mod_orig = _get_logger_for_subgraph(
+            mt, first_node, last_node, subgraph_idx, subgraph_candidate_idx,
+            qconfig_str, OutputLogger, fqn)
+
+        attr_name = _get_attr_name(subgraph_idx, subgraph_candidate_idx)
+        assert not hasattr(mt, attr_name)
+        setattr(mt, attr_name, logger_mod_orig)
+        with mt.graph.inserting_after(last_node):
+            new_node = mt.graph.call_module(attr_name, args=(last_node,), kwargs={})
+
+    else:
+        # idx > 0 means we have a candidate qconfig to try, so we need
+        # to make a copy of the subgraph, feed it with the right inputs,
+        # and add a logger at the end
+
+        # get the qconfig
+        # subtract one because the first candidate is the floating point
+        # version of the subgraph
+        node_name_to_qconfig = \
+            list_of_node_name_to_qconfig[subgraph_candidate_idx - 1]
+        qconfig = node_name_to_qconfig[first_node.name]
+
+        # if no quantization is requested, skip
+        # TODO(future PR): deduplicate equivalent qconfigs that come from
+        #   different qconfig mapping objects
+        if qconfig is None:
+            return
+
+        qconfig_mapping = QConfigMapping().set_global(qconfig)
+
+        # create a copy of the submodule, wrapped in a separate module
+        orig_mod_copy_wrapped = create_submodule_from_subgraph(
+            mt, first_node, last_node)
+
+        # add a logger to the end of this submodule
+        # get first and last nodes of the submodule
+        _add_logger_to_subgraph_wrapper(
+            orig_mod_copy_wrapped, subgraph_idx, subgraph_candidate_idx,
+            str(qconfig), OutputComparisonLogger, last_node, fqn)
+
+        # We need to set the loggers as non traceable to have them survive
+        # prepare_fx and convert_fx calls.
+        prepare_custom_config = PrepareCustomConfig()\
+            .set_non_traceable_module_classes([OutputLogger, OutputComparisonLogger])
+
+        # add a call to prepare_fx on the wrapper module
+        orig_mod_copy_wrapped = torch.ao.quantization.quantize_fx.prepare_fx(
+            orig_mod_copy_wrapped, qconfig_mapping, example_inputs=example_inputs,
+            prepare_custom_config=prepare_custom_config)
+
+        # attach the wrapper to the model
+        attr_name = _get_attr_wrapper_name(subgraph_idx, subgraph_candidate_idx)
+        assert not hasattr(mt, attr_name)
+        setattr(mt, attr_name, orig_mod_copy_wrapped)
+
+        # add a call to the wrapper module from the parent graph
+        with mt.graph.inserting_after(last_node):
+            # TODO(future PR): handle fusion patterns where non-first nodes
+            # need inputs
+
+            # pass in all node args and kwargs
+
+            # the first argument is always the reference output of the last
+            # node of this subgraph
+            new_args = [last_node]
+
+            for arg in first_node.args:
+                if isinstance(arg, Node):
+                    new_args.append(arg)
+                elif isinstance(arg, (list, tuple)) and len(arg) and isinstance(arg[0], Node):
+                    for inner_arg in arg:
+                        if isinstance(inner_arg, Node):
+                            new_args.append(inner_arg)
+
+            new_kwargs = {}
+            for name, old_kwarg in first_node.kwargs.items():
+                if isinstance(old_kwarg, Node):
+                    new_kwargs[name] = old_kwarg
+                elif isinstance(old_kwarg, (list, tuple)) and len(old_kwarg):
+                    for inner_old_kwarg in old_kwarg:
+                        # TODO(future PR): clarify why we are adding kwargs to args
+                        new_args.append(inner_old_kwarg)
+
+            new_args = tuple(new_args)  # type: ignore[assignment]
+
+            new_node = mt.graph.call_module(
+                attr_name, args=new_args, kwargs=new_kwargs)
+
+def handle_subgraph(
+    mt: GraphModule,
+    subgraph_idx: int,
+    match_name: str,
+    nodes_in_this_subgraph: List[Any],
+    qconfig_mappings: List[QConfigMapping],
+    list_of_node_name_to_qconfig: List[Dict[str, QConfigAny]],
+) -> None:
+    """
+    Given a model `mt` and a subgraph_idx, creates the needed copies
+    of the subgraph for all qconfigs, and instruments them with loggers.
+    """
+    # for now, assume that
+    # 1. the first node has one input
+    # 2. the last node has one output
+
+    # for now, ignore all subgraphs that contain non-nodes (tuples, etc)
+    # TODO(future PR): implement this
+    if any(
+        not isinstance(node, Node)
+        for node in nodes_in_this_subgraph
+    ):
+        return
+
+    first_node = nodes_in_this_subgraph[0]
+    last_node = nodes_in_this_subgraph[-1]
+    # We used output propagation to populate example values on each
+    # node. Use the example values from the previous node as the input
+    # to the current node.
+    prev_node = get_normalized_nth_input(first_node, mt, 0)
+    if isinstance(prev_node, list):
+        example_inputs = [x.traced_result for x in prev_node]
+    elif isinstance(prev_node, tuple):
+        example_inputs = (x.traced_result for x in prev_node)  # type: ignore[assignment]
+    else:
+        # currently some customer models do not have a traced_result in
+        # every node, so we have to guard for this case since we cannot
+        # quantize without an example input
+        # TODO(future PR): add a test case for this once we have an easy
+        # repro, see https://github.com/pytorch/pytorch/pull/80521/files#r975940489
+        # for additional context
+        if hasattr(prev_node, 'traced_result'):
+            example_inputs = (prev_node.traced_result,)  # type: ignore[attr-defined, assignment]
+        else:
+            print(
+                'unable to get example input for node ' +
+                f'{first_node.format_node()}, skipping')
+            return
+
+    # If there are no quantization configs for this subgraph, skip adding
+    # loggers. This reduces memory usage for models where not all layers are
+    # quantized.
+    # TODO(future): consider making this configurable
+    found_at_least_one_qconfig = False
+    for subgraph_candidate_idx in range(len(qconfig_mappings) + 1):
+
+        if subgraph_candidate_idx == 0:
+            # fp32 baseline does not need a qconfig
+            continue
+
+        # a. we have N shadows, so len(qconfig_mappings) is N
+        # b. we will have the fp32 layer + N shadows, so overall number of
+        #    (original_op) + (*shadows) will be N+1
+        # c. since `subgraph_candidate_idx` represents (b), we need
+        #    to subtract 1 to query from (a)
+        node_name_to_qconfig = \
+            list_of_node_name_to_qconfig[subgraph_candidate_idx - 1]
+        qconfig = node_name_to_qconfig[first_node.name]
+        if qconfig is not None:
+            found_at_least_one_qconfig = True
+            break
+    if not found_at_least_one_qconfig:
+        print('unable to find at least one qconfig for node ' +
+              f'{first_node.format_node()}, skipping')
+        return
+
+    fqn = _maybe_get_fqn(first_node, mt)
+
+    for subgraph_candidate_idx in range(len(qconfig_mappings) + 1):
+        handle_subgraph_candidate(
+            mt, subgraph_idx, subgraph_candidate_idx, first_node,
+            last_node, fqn, list_of_node_name_to_qconfig,
+            example_inputs)
+
+# TODO(future PR): redesign this to make it easier to consume outputs
+def group_results_by_subgraph(results: NSResultsType) -> Any:
+    """
+    Creates a comparison of results
+
+    Input:
+
+    {
+      'model': {
+        'node_output': {
+          'subgraph_0_0': [
+            'values': [torch.tensor(...), ...], ...
+            'ref_node_name': ...,
+            'ref_node_target_type': ...,
+            'qconfig_str': ...,
+            'comparisons': [], ...
+            'comparison_fn_name': '',
+            'fqn': '...',
+          ],
+          'subgraph_0_1': [
+            'values': [torch.tensor(...), ...], ...
+            'ref_node_name': ...,
+            'ref_node_target_type': ...,
+            'qconfig_str': ...,
+            'comparisons': [torch.tensor(...), ...], ...
+            'comparison_fn_name': '...',
+            'fqn': '...',
+          ],
+          ...
+        },
+      },
+    }
+
+    Output:
+    {
+      'subgraph_0': {
+        '0': {
+          'ref_node_name': '...',
+          'ref_node_target_type': ...,
+          'values': [torch.tensor(...), ...],
+          'qconfig_str': None,
+          'comparisons': [torch.tensor(...), ...], ...
+          'comparison_fn_name': '...',
+          'fqn': '...',
+        },
+        '1': {
+          'ref_node_name': '...',
+          'ref_node_target_type': ...,
+          'values': [torch.tensor(...), ...],
+          'qconfig_str': '...',
+          'comparisons': [torch.tensor(...), ...], ...
+          'comparison_fn_name': '...',
+          'fqn': '...',
+        },
+      },
+    }
+
+    """
+    subgraph_name_to_subgraph_results: Any = collections.defaultdict(dict)
+
+    for subgraph_name_with_idx, subgraph_candidate_results in \
+            results['model']['node_output'].items():
+
+        # convert from `subgraph_m_n` to `subgraph_m` and `n`
+        subgraph_str, subgraph_idx, subgraph_candidate_idx = \
+            subgraph_name_with_idx.split('_')
+        subgraph_name = f'{subgraph_str}_{subgraph_idx}'
+
+        subgraph_results = {
+            'ref_node_name': subgraph_candidate_results[0]['ref_node_name'],
+            'ref_node_target_type': subgraph_candidate_results[0]['ref_node_target_type'],
+            'fqn': subgraph_candidate_results[0]['fqn'],
+            'values': subgraph_candidate_results[0]['values'],
+            'qconfig_str': subgraph_candidate_results[0]['qconfig_str'],
+            'comparisons': subgraph_candidate_results[0]['comparisons'],
+            'comparison_fn_name': subgraph_candidate_results[0]['comparison_fn_name'],
+        }
+
+        subgraph_name_to_subgraph_results[subgraph_name][subgraph_candidate_idx] = \
+            subgraph_results
+
+    return dict(subgraph_name_to_subgraph_results)
+
+# TODO(future PR): redesign this to make it easier to consume outputs
+def create_results_comparison(
+    results_grouped,
+) -> Any:
+    """
+    Input:
+
+    {
+      'subgraph_0': {
+        '0': {
+          'ref_node_name': '...',
+          'ref_node_target_type': ...,
+          'values': [torch.tensor(...), ...],
+          'qconfig_str': '',
+          'comparisons': [],
+          'comparison_fn_name': '',
+          'fqn': '...',
+        },
+        '1': {
+          'ref_node_name': '...',
+          'ref_node_target_type': ...,
+          'values': [torch.tensor(...), ...],
+          'qconfig_str': '...',
+          'comparisons': [torch.tensor(...), ...],
+          'comparison_fn_name': 'sqnr',
+          'fqn': '...',
+        },
+      },
+    }
+
+    Output:
+    {
+      'subgraph_0': {
+        'ref_node_name': '...',
+        'ref_node_target_type': '...',
+        'fqn': '...',
+        'candidates': {
+          '1': {
+            'qconfig_str': ...,
+            'comparison_fn_name': 'sqnr',
+            'cmp_raw': [..., ...],
+            'cmp_mean': ...,
+          },
+          ...,
+        },
+      },
+    }
+    """
+
+    results_comparison = {}
+
+    for subgraph_name, subgraph_results in results_grouped.items():
+
+        candidates = {}
+        for subgraph_inner_name, subgraph_inner_result in subgraph_results.items():
+            # skip comparing baseline to baseline
+            if subgraph_inner_name == '0':
+                continue
+
+            # we expect the comparisons to be precalculated from
+            # calibration, so we just fetch them here
+            cmp_raw = subgraph_inner_result['comparisons']
+            cmp_raw_tensor = torch.stack(cmp_raw)
+
+            candidates[subgraph_inner_name] = {
+                'qconfig_str': subgraph_inner_result['qconfig_str'],
+                'comparison_fn_name': subgraph_inner_result['comparison_fn_name'],
+                'cmp_raw': cmp_raw_tensor,
+                'cmp_mean': torch.mean(cmp_raw_tensor),
+            }
+
+        results_comparison[subgraph_name] = {
+            'ref_node_name': subgraph_results['0']['ref_node_name'],
+            'ref_node_target_type': subgraph_results['0']['ref_node_target_type'],
+            'fqn': subgraph_results['0']['fqn'],
+            'candidates': candidates,
+        }
+
+    return results_comparison
+
+# TODO(future PR): redesign this to make it easier to consume outputs
+def print_n_shadows_summary(
+    results_comparison,
+) -> None:
+    """
+    Input:
+
+    {
+      'subgraph_0': {
+        'ref_node_name': 'linear1',
+        'ref_node_target_type': '...',
+        'fqn': '...',
+        'candidates': {
+          '1': {
+            'qconfig_str': ...,
+            'comparison_fn_name': ...,
+            'cmp_raw': [45.0, 55.0],
+            'cmp_mean': 50.0,
+          },
+          ...,
+        },
+      },
+    }
+
+    Prints:
+
+    node_name | node_type | fqn | 0    | 1    | ...
+    linear1   | ...       | ... | 45.0 | 50.0 | ...
+    """
+
+    try:
+        from tabulate import tabulate
+    except ImportError:
+        print("`print_tabular` relies on the library `tabulate`, "
+              "which could not be found on this machine. Run `pip "
+              "install tabulate` to install the library.")
+        return
+
+    results = []
+    for subgraph_name, subgraph_data in results_comparison.items():
+        mean_all_candidates = [
+            candidate['cmp_mean']
+            for candidate_name, candidate in subgraph_data['candidates'].items()
+        ]
+
+        data_row = [
+            subgraph_data['ref_node_name'],
+            subgraph_data['ref_node_target_type'],
+            subgraph_data['fqn'],
+            *mean_all_candidates,
+        ]
+        results.append(data_row)
+
+    max_candidate_idx_len = -1
+    for data_row in results:
+        max_candidate_idx_len = max(max_candidate_idx_len, len(data_row[1]))
+    candidate_idx_headers = [str(x) for x in range(max_candidate_idx_len)]
+
+    headers = ['node_name', 'node_type', 'fqn', *candidate_idx_headers]
+    print(tabulate(results, headers=headers))
diff --git a/torch/ao/ns/fx/ns_types.py b/torch/ao/ns/fx/ns_types.py
index c1efc5a24555..cf0451a155dd 100644
--- a/torch/ao/ns/fx/ns_types.py
+++ b/torch/ao/ns/fx/ns_types.py
@@ -35,6 +35,12 @@ class NSSingleResultValuesType(str, enum.Enum):
 #   # index of this node within the args of the input/output node
 #   # for example, in add(x1, x2), x2 would have index_of_arg == 1
 #   'index_of_arg': 0,
+#   # precomputed comparisons of logger values to reference values
+#   'comparisons': [torch.tensor(...), ...]
+#   # name of function used for precomputed comparisons
+#   'comparison_fn_name': 'sqnr',
+#   # string representation of qconfig responsible for creating this logger
+#   'qconfig_str': 'QConfig(...)',
 # }
 NSSingleResultType = Dict[str, Any]
 
diff --git a/torch/ao/ns/fx/qconfig_multi_mapping.py b/torch/ao/ns/fx/qconfig_multi_mapping.py
new file mode 100644
index 000000000000..bff2640e1feb
--- /dev/null
+++ b/torch/ao/ns/fx/qconfig_multi_mapping.py
@@ -0,0 +1,242 @@
+from __future__ import annotations
+
+import copy
+from typing import Any, Callable, Dict, List, Union
+
+import torch
+from torch.ao.quantization import QConfigMapping
+from torch.ao.quantization.qconfig import QConfigAny
+
+__all__ = ["QConfigMultiMapping"]
+
+_QCONFIG_STYLE_ORDER: List[str] = [
+    "global_qconfig",
+    "object_type_qconfigs",
+    "module_name_regex_qconfigs",
+    "module_name_qconfigs",
+    "module_name_object_type_order_qconfigs",
+]
+
+_QCONFIG_STYLE_TO_METHOD: Dict[str, str] = {
+    "global_qconfig": "set_global",
+    "object_type_qconfigs": "set_object_type",
+    "module_name_regex_qconfigs": "set_module_name_regex",
+    "module_name_qconfigs": "set_module_name",
+    "module_name_object_type_order_qconfigs": "set_module_name_object_type_order",
+}
+
+def _remove_duplicates_and_none(qconfig_list: List[QConfigAny]) -> None:
+    to_remove = []
+    for index, cur_qconfig in enumerate(qconfig_list):
+        if cur_qconfig is None:
+            to_remove.append(index)
+            break
+        for checked_qconfig in qconfig_list[:index]:
+            if torch.ao.quantization.qconfig_equals(cur_qconfig, checked_qconfig):
+                to_remove.append(index)
+                break
+    for index in to_remove[::-1]:
+        qconfig_list.pop(index)
+
+class QConfigMultiMapping:
+    """
+    This class, used with the prepare_n_shadows_model API, stores a list of :class:`torch.ao.quantization.QConfigMapping`s
+    so that multiple QConfigs can be specified for each QConfig matching style.
+
+    The user can specify QConfigs using the following methods (in increasing match priority):
+
+        ``set_global`` : sets the global (default) QConfigs
+
+        ``set_object_type`` : sets the QConfigs for a given module type, function, or method name
+
+        ``set_module_name_regex`` : sets the QConfigs for modules matching the given regex string
+
+        ``set_module_name`` : sets the QConfigs for modules matching the given module name
+
+        ``set_module_name_object_type_order`` : sets the QConfigs for modules matching a combination
+        of the given module name, object type, and the index at which the module appears
+
+    Note: Usage of set methods is the same as in QConfigMapping except with a passed in list of QConfigs rather than a
+    single QConfig.
+
+    Example usage::
+
+        qconfig_mapping = QConfigMultiMapping()
+            .set_global([qconfig1, qconfig2])
+            .set_object_type(torch.nn.Linear, [qconfig2, qconfig3])
+            .set_object_type(torch.nn.ReLU, [qconfig1])
+            .set_module_name_regex("foo.*bar.*conv[0-9]+", [qconfig2])
+            .set_module_name_regex("foo.*", [qconfig1, qconfig2, qconfig3])
+            .set_module_name("module1", [None])
+            .set_module_name("module2", [qconfig2])
+            .set_module_name_object_type_order("foo.bar", torch.nn.functional.linear, 0, [qconfig3])
+
+    """
+
+    def __init__(self):
+        # initialize this with 1 QConfigMapping to avoid corner cases
+        self.qconfig_mappings_list: List[QConfigMapping] = [QConfigMapping()]
+
+    def _handle_list_size_mismatch(
+        self, qconfig_list: List[QConfigAny], style: str
+    ) -> None:
+        # this method handles cases where the size of qconfig_list does not match
+        # the size of qconfig_mappings_list.
+        # Issue: Consider a user inserting global_qconfig A and B first, then inserting
+        # qconfig C as an object_type_qconfig for conv ops. If we internally store
+        # 1 QConfigMapping with A and C and another with just B, then the
+        # second QConfigMapping will match B to conv ops (which is not wanted), since B is global.
+
+        # we avoid this by maintaining the invariant that if any QConfigMapping
+        # has a qconfig style+key with a qconfig in it, all QConfigMappings must
+        # have either a qconfig or None for that same style+key. In the above
+        # example, a None qconfig would prevent the unwanted match in the
+        # second QConfigMapping
+
+        if len(qconfig_list) > len(self.qconfig_mappings_list):
+            # Case: we have more qconfigs (in qconfig_list) than QConfigMappings
+
+            # Add new QConfigMappings (initialized so we maintain the `invariant`)
+
+            new_qconfig_mapping = QConfigMapping()
+            # searches other QConfigMappings for qconfig style+keys
+            # that need to be inserted as `None` into the new QConfigMapping
+            for qconfig_mapping in self.qconfig_mappings_list:
+
+                # global_qconfig has None by default
+                for check_style in _QCONFIG_STYLE_ORDER[1:]:
+                    qconfigs_dict = getattr(qconfig_mapping, check_style)
+                    target_qconfigs_dict = getattr(new_qconfig_mapping, check_style)
+                    for key in qconfigs_dict:
+                        target_qconfigs_dict[key] = None
+                break
+
+            # insert copies of this new QConfigMapping until all entires
+            # in qconfig_list can fit among the QConfigMappings
+            while len(qconfig_list) > len(self.qconfig_mappings_list):
+                self.qconfig_mappings_list.append(copy.deepcopy(new_qconfig_mapping))
+        else:
+            # Case: we have fewer qconfigs in qconfig_list than QConfigMappings
+
+            # pad qconfig_list with `None` until length is same
+            while len(qconfig_list) < len(self.qconfig_mappings_list):
+                qconfig_list.append(None)
+
+    # this function applies the insertion method across each QConfigMapping
+    def _insert_qconfig_list(
+        self,
+        style: str,
+        args: List[Union[str, int, Callable]],
+        qconfig_list: List[QConfigAny],
+    ) -> None:
+
+        # we remove duplicates and None to make the ordering of qconfigs
+        # deterministic upon insertion.
+        _remove_duplicates_and_none(qconfig_list)
+
+        self._handle_list_size_mismatch(qconfig_list, style)
+        method_name = _QCONFIG_STYLE_TO_METHOD[style]
+        for qconfig_mapping, qconfig in zip(self.qconfig_mappings_list, qconfig_list):
+            # uses QConfigMapping set method to insert qconfig
+            set_method = getattr(qconfig_mapping, method_name)
+            set_method(*args, qconfig)
+
+    def set_global(self, global_qconfig_list: List[QConfigAny]) -> QConfigMultiMapping:
+        """
+        Set global QConfigs
+        see :func:`~torch.ao.quantization.QConfigMapping.set_global()` for more info
+        """
+        self._insert_qconfig_list("global_qconfig", [], global_qconfig_list)
+        return self
+
+    def set_object_type(
+        self, object_type: Union[Callable, str], qconfig_list: List[QConfigAny]
+    ) -> QConfigMultiMapping:
+        """
+        Set object type QConfigs
+        see :func:`~torch.ao.quantization.QConfigMapping.set_object_type()` for more info
+        """
+        self._insert_qconfig_list("object_type_qconfigs", [object_type], qconfig_list)
+        return self
+
+    def set_module_name_regex(
+        self, module_name_regex: str, qconfig_list: List[QConfigAny]
+    ) -> QConfigMultiMapping:
+        """
+        Set module_name_regex QConfigs
+        see :func:`~torch.ao.quantization.QConfigMapping.set_module_name_regex()` for more info
+        """
+        self._insert_qconfig_list(
+            "module_name_regex_qconfigs", [module_name_regex], qconfig_list
+        )
+        return self
+
+    def set_module_name(
+        self, module_name: str, qconfig_list: List[QConfigAny]
+    ) -> QConfigMultiMapping:
+        """
+        Set module_name QConfigs
+        see :func:`~torch.ao.quantization.QConfigMapping.set_module_name()` for more info
+        """
+        self._insert_qconfig_list("module_name_qconfigs", [module_name], qconfig_list)
+        return self
+
+    def set_module_name_object_type_order(
+        self,
+        module_name: str,
+        object_type: Callable,
+        index: int,
+        qconfig_list: List[QConfigAny],
+    ) -> QConfigMultiMapping:
+        """
+        Set module_name QConfigs
+        see :func:`~torch.ao.quantization.QConfigMapping.set_module_name_object_type_order()` for more info
+        """
+        self._insert_qconfig_list(
+            "module_name_object_type_order_qconfigs",
+            [module_name, object_type, index],
+            qconfig_list,
+        )
+        return self
+
+    @classmethod
+    def from_list_qconfig_mapping(
+        cls, qconfig_mapping_list: List[QConfigMapping]
+    ) -> QConfigMultiMapping:
+        """
+        Creates a QConfigMultiMapping from a list of QConfigMappings
+        """
+        new_qconfig_multi_mapping = cls()
+
+        new_qconfig_multi_mapping.qconfig_mappings_list = copy.deepcopy(
+            qconfig_mapping_list
+        )
+
+        # we need to avoid the issue described in _handle_list_size_mismatch,
+        # so we reinsert all the qconfigs using the QConfigMultiMapping
+        # set methods
+
+        # go through all qconfig styles
+        # note: global can be ignored since it is None by default
+        for style in _QCONFIG_STYLE_ORDER[1:]:
+
+            # gather all key+qconfigs for current style
+            # into qconfig_dict_list
+            qconfig_dict_list: Dict[Any, List[QConfigAny]] = {}
+            for qconfig_mapping in qconfig_mapping_list:
+                qconfig_dict = getattr(qconfig_mapping, style)
+                for key, qconfig in qconfig_dict.items():
+                    if key not in qconfig_dict_list:
+                        qconfig_dict_list[key] = []
+                    qconfig_dict_list[key].append(qconfig)
+
+            # reinsert all gathered key+qconfigs
+            set_method_name = _QCONFIG_STYLE_TO_METHOD[style]
+            set_method = getattr(new_qconfig_multi_mapping, set_method_name)
+            for key, qconfig_list in qconfig_dict_list.items():
+                if isinstance(key, tuple):
+                    set_method(*key, qconfig_list)
+                else:
+                    set_method(key, qconfig_list)
+
+        return new_qconfig_multi_mapping
diff --git a/torch/ao/ns/fx/utils.py b/torch/ao/ns/fx/utils.py
index bd10a0e4a8c0..2993764b8a12 100644
--- a/torch/ao/ns/fx/utils.py
+++ b/torch/ao/ns/fx/utils.py
@@ -4,7 +4,7 @@
 import torch
 import torch.nn as nn
 import torch.nn.intrinsic.quantized as nniq
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 
 toq = torch.ops.quantized
 from typing import Tuple, Callable, Dict, Set, List, Optional, Union
diff --git a/torch/ao/ns/fx/weight_utils.py b/torch/ao/ns/fx/weight_utils.py
index 2020593ddbfb..e02d464a1fb7 100644
--- a/torch/ao/ns/fx/weight_utils.py
+++ b/torch/ao/ns/fx/weight_utils.py
@@ -1,10 +1,10 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.quantized as nnq
-import torch.nn.intrinsic.qat as nniqat
-import torch.nn.qat as nnqat
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.intrinsic.qat as nniqat
+import torch.ao.nn.qat as nnqat
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
 toq = torch.ops.quantized
diff --git a/torch/ao/sparsity/__init__.py b/torch/ao/pruning/__init__.py
similarity index 93%
rename from torch/ao/sparsity/__init__.py
rename to torch/ao/pruning/__init__.py
index 715eaa6b8c6d..ab74a09d0d76 100644
--- a/torch/ao/sparsity/__init__.py
+++ b/torch/ao/pruning/__init__.py
@@ -10,6 +10,7 @@
 # Scheduler
 from .scheduler.base_scheduler import BaseScheduler
 from .scheduler.lambda_scheduler import LambdaSL
+from .scheduler.cubic_scheduler import CubicSL
 
 # Parametrizations
 from .sparsifier.utils import FakeSparsity
diff --git a/torch/ao/pruning/_experimental/__init__.py b/torch/ao/pruning/_experimental/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/ao/sparsity/_experimental/activation_sparsifier/README.md b/torch/ao/pruning/_experimental/activation_sparsifier/README.md
similarity index 98%
rename from torch/ao/sparsity/_experimental/activation_sparsifier/README.md
rename to torch/ao/pruning/_experimental/activation_sparsifier/README.md
index 3c2514c2f116..810b053d9222 100644
--- a/torch/ao/sparsity/_experimental/activation_sparsifier/README.md
+++ b/torch/ao/pruning/_experimental/activation_sparsifier/README.md
@@ -60,7 +60,7 @@ def mask_fn(tensor, threshold):  # threshold is the sparse config here
 ```
 
 ## API Design
-`ActivationSparsifier`: Attaches itself to a model layer and sparsifies the activation flowing through that layer. The user can pass in the default `aggregate_fn`, `reduce_fn` and `mask_fn`. Additionaly, `features` and `feature_dim` are also accepted.
+`ActivationSparsifier`: Attaches itself to a model layer and sparsifies the activation flowing through that layer. The user can pass in the default `aggregate_fn`, `reduce_fn` and `mask_fn`. Additionally, `features` and `feature_dim` are also accepted.
 
 `register_layer`: Registers a layer for sparsification. Specifically, registers `forward_pre_hook()` that performs aggregation.
 
diff --git a/torch/ao/pruning/_experimental/activation_sparsifier/__init__.py b/torch/ao/pruning/_experimental/activation_sparsifier/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/ao/sparsity/_experimental/activation_sparsifier/activation_sparsifier.py b/torch/ao/pruning/_experimental/activation_sparsifier/activation_sparsifier.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/activation_sparsifier/activation_sparsifier.py
rename to torch/ao/pruning/_experimental/activation_sparsifier/activation_sparsifier.py
diff --git a/torch/ao/sparsity/_experimental/data_scheduler/README.md b/torch/ao/pruning/_experimental/data_scheduler/README.md
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_scheduler/README.md
rename to torch/ao/pruning/_experimental/data_scheduler/README.md
diff --git a/torch/ao/sparsity/_experimental/data_scheduler/__init__.py b/torch/ao/pruning/_experimental/data_scheduler/__init__.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_scheduler/__init__.py
rename to torch/ao/pruning/_experimental/data_scheduler/__init__.py
diff --git a/torch/ao/sparsity/_experimental/data_scheduler/base_data_scheduler.py b/torch/ao/pruning/_experimental/data_scheduler/base_data_scheduler.py
similarity index 98%
rename from torch/ao/sparsity/_experimental/data_scheduler/base_data_scheduler.py
rename to torch/ao/pruning/_experimental/data_scheduler/base_data_scheduler.py
index 92d3c80d2085..703b81675495 100644
--- a/torch/ao/sparsity/_experimental/data_scheduler/base_data_scheduler.py
+++ b/torch/ao/pruning/_experimental/data_scheduler/base_data_scheduler.py
@@ -30,7 +30,7 @@ class BaseDataScheduler(object):
     def __init__(self, data_sparsifier, schedule_param: str, last_epoch=-1, verbose=False):
         # Attach sparsifier
         if not isinstance(data_sparsifier, BaseDataSparsifier):
-            raise TypeError('{} is not an instance of torch.ao.sparsity.BaseDataSparsifier'.format(
+            raise TypeError('{} is not an instance of torch.ao.pruning.BaseDataSparsifier'.format(
                 type(data_sparsifier).__name__))
         self.data_sparsifier = data_sparsifier
         self.schedule_param = schedule_param
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/README.md b/torch/ao/pruning/_experimental/data_sparsifier/README.md
similarity index 98%
rename from torch/ao/sparsity/_experimental/data_sparsifier/README.md
rename to torch/ao/pruning/_experimental/data_sparsifier/README.md
index c6fc99b36c8c..faea74355360 100644
--- a/torch/ao/sparsity/_experimental/data_sparsifier/README.md
+++ b/torch/ao/pruning/_experimental/data_sparsifier/README.md
@@ -3,7 +3,7 @@
 The data sparsifier inherits from the `BaseSparsifier` class. It attempts to sparsify data tensors in general (trainable and non-trainable).
 
 ## Implementation Details
-The data sparsifier does not receive a model or a layer to sparsify. Hence, the mask needs to be owned by the data sparsifier. This is acheived by introducing a private container model that registers the data as a parametrized buffer.
+The data sparsifier does not receive a model or a layer to sparsify. Hence, the mask needs to be owned by the data sparsifier. This is achieved by introducing a private container model that registers the data as a parametrized buffer.
 
 The BaseDataSparsifier handles all the housekeeping while allowing the user to just implement the `update_mask` logic in their implementation.
 
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/__init__.py b/torch/ao/pruning/_experimental/data_sparsifier/__init__.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/__init__.py
rename to torch/ao/pruning/_experimental/data_sparsifier/__init__.py
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/base_data_sparsifier.py b/torch/ao/pruning/_experimental/data_sparsifier/base_data_sparsifier.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/base_data_sparsifier.py
rename to torch/ao/pruning/_experimental/data_sparsifier/base_data_sparsifier.py
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/README.md b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/README.md
similarity index 97%
rename from torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/README.md
rename to torch/ao/pruning/_experimental/data_sparsifier/benchmarks/README.md
index b39e951efec5..f7f83d7d6f3b 100644
--- a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/README.md
+++ b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/README.md
@@ -5,7 +5,7 @@ The objective of this exercise is to use the data sparsifier to prune the embedd
 
 1. **Disk usage savings**: Savings in model size after pruning.
 2. **Model Quality**: How and by how much does performance deteriorate after pruning the embedding bags?
-3. **Model forward time**: Can we speed up the model forward time by utilizing the sparsity? Specificially, can we introduce torch.sparse interim to reduce number of computations.
+3. **Model forward time**: Can we speed up the model forward time by utilizing the sparsity? Specifically, can we introduce torch.sparse interim to reduce number of computations.
 
 ## Scope
 The [DataNormSparsifier](https://github.com/pytorch/pytorch/blob/master/torch/ao/sparsity/_experimental/data_sparsifier/data_norm_sparsifier.py) is used to sparsify the embeddings of the DLRM model. The model is sparsified for all the combinations of -
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/dlrm_utils.py b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/dlrm_utils.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/dlrm_utils.py
rename to torch/ao/pruning/_experimental/data_sparsifier/benchmarks/dlrm_utils.py
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/evaluate_disk_savings.py b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_disk_savings.py
similarity index 98%
rename from torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/evaluate_disk_savings.py
rename to torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_disk_savings.py
index 750bb91536b1..eb4d2a04751b 100644
--- a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/evaluate_disk_savings.py
+++ b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_disk_savings.py
@@ -1,7 +1,7 @@
 from typing import Dict, List
 import torch
 import time
-from torch.ao.sparsity._experimental.data_sparsifier import DataNormSparsifier
+from torch.ao.pruning._experimental.data_sparsifier import DataNormSparsifier
 import os
 from dlrm_utils import get_dlrm_model, get_valid_name  # type: ignore[import]
 import copy
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/evaluate_forward_time.py b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_forward_time.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/evaluate_forward_time.py
rename to torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_forward_time.py
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/evaluate_model_metrics.py b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_model_metrics.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/evaluate_model_metrics.py
rename to torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_model_metrics.py
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/images/accuracy.png b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/images/accuracy.png
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/images/accuracy.png
rename to torch/ao/pruning/_experimental/data_sparsifier/benchmarks/images/accuracy.png
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/images/disk_savings.png b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/images/disk_savings.png
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/images/disk_savings.png
rename to torch/ao/pruning/_experimental/data_sparsifier/benchmarks/images/disk_savings.png
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/images/forward_time.png b/torch/ao/pruning/_experimental/data_sparsifier/benchmarks/images/forward_time.png
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/benchmarks/images/forward_time.png
rename to torch/ao/pruning/_experimental/data_sparsifier/benchmarks/images/forward_time.png
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/data_norm_sparsifier.py b/torch/ao/pruning/_experimental/data_sparsifier/data_norm_sparsifier.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/data_norm_sparsifier.py
rename to torch/ao/pruning/_experimental/data_sparsifier/data_norm_sparsifier.py
diff --git a/torch/ao/pruning/_experimental/data_sparsifier/lightning/__init__.py b/torch/ao/pruning/_experimental/data_sparsifier/lightning/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/README.md b/torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/README.md
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/README.md
rename to torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/README.md
diff --git a/torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/__init__.py b/torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/_data_sparstity_utils.py b/torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/_data_sparstity_utils.py
similarity index 93%
rename from torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/_data_sparstity_utils.py
rename to torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/_data_sparstity_utils.py
index ba0b105d34ba..2db7d0668800 100644
--- a/torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/_data_sparstity_utils.py
+++ b/torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/_data_sparstity_utils.py
@@ -1,5 +1,5 @@
 import logging
-from torch.ao.sparsity._experimental.data_sparsifier.base_data_sparsifier import SUPPORTED_TYPES
+from torch.ao.pruning._experimental.data_sparsifier.base_data_sparsifier import SUPPORTED_TYPES
 
 logger: logging.Logger = logging.getLogger(__name__)
 
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/data_sparsity.py b/torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/data_sparsity.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/data_sparsifier/lightning/callbacks/data_sparsity.py
rename to torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/data_sparsity.py
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/lightning/tests/test_callbacks.py b/torch/ao/pruning/_experimental/data_sparsifier/lightning/tests/test_callbacks.py
similarity index 95%
rename from torch/ao/sparsity/_experimental/data_sparsifier/lightning/tests/test_callbacks.py
rename to torch/ao/pruning/_experimental/data_sparsifier/lightning/tests/test_callbacks.py
index 76909dc48b9b..936979442df6 100644
--- a/torch/ao/sparsity/_experimental/data_sparsifier/lightning/tests/test_callbacks.py
+++ b/torch/ao/pruning/_experimental/data_sparsifier/lightning/tests/test_callbacks.py
@@ -1,14 +1,14 @@
-from torch.ao.sparsity._experimental.data_sparsifier.data_norm_sparsifier import DataNormSparsifier
-from torch.ao.sparsity._experimental.data_scheduler.base_data_scheduler import BaseDataScheduler
+from torch.ao.pruning._experimental.data_sparsifier.data_norm_sparsifier import DataNormSparsifier
+from torch.ao.pruning._experimental.data_scheduler.base_data_scheduler import BaseDataScheduler
 import torch
 import torch.nn as nn
 from typing import List
-from torch.ao.sparsity._experimental.data_sparsifier.lightning.callbacks.data_sparsity import (
+from torch.ao.pruning._experimental.data_sparsifier.lightning.callbacks.data_sparsity import (
     PostTrainingDataSparsity,
     TrainingAwareDataSparsity
 )
-from torch.ao.sparsity._experimental.data_sparsifier.lightning.callbacks._data_sparstity_utils import _get_valid_name
-from torch.ao.sparsity._experimental.data_sparsifier.base_data_sparsifier import SUPPORTED_TYPES
+from torch.ao.pruning._experimental.data_sparsifier.lightning.callbacks._data_sparstity_utils import _get_valid_name
+from torch.ao.pruning._experimental.data_sparsifier.base_data_sparsifier import SUPPORTED_TYPES
 from torch.testing._internal.common_utils import TestCase
 from torch.testing._internal.common_utils import run_tests
 import importlib
diff --git a/torch/ao/sparsity/_experimental/data_sparsifier/quantization_utils.py b/torch/ao/pruning/_experimental/data_sparsifier/quantization_utils.py
similarity index 98%
rename from torch/ao/sparsity/_experimental/data_sparsifier/quantization_utils.py
rename to torch/ao/pruning/_experimental/data_sparsifier/quantization_utils.py
index 54fa01eef74b..8e79cedbb8ea 100644
--- a/torch/ao/sparsity/_experimental/data_sparsifier/quantization_utils.py
+++ b/torch/ao/pruning/_experimental/data_sparsifier/quantization_utils.py
@@ -1,6 +1,6 @@
 import torch
 import torch.nn as nn
-from torch.ao.sparsity.sparsifier.utils import module_to_fqn, fqn_to_module
+from torch.ao.pruning.sparsifier.utils import module_to_fqn, fqn_to_module
 from typing import Dict, List
 
 SUPPORTED_MODULES = {
diff --git a/torch/ao/sparsity/_experimental/pruner/README.md b/torch/ao/pruning/_experimental/pruner/README.md
similarity index 100%
rename from torch/ao/sparsity/_experimental/pruner/README.md
rename to torch/ao/pruning/_experimental/pruner/README.md
diff --git a/torch/ao/sparsity/_experimental/pruner/__init__.py b/torch/ao/pruning/_experimental/pruner/__init__.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/pruner/__init__.py
rename to torch/ao/pruning/_experimental/pruner/__init__.py
diff --git a/torch/ao/sparsity/_experimental/pruner/base_pruner.py b/torch/ao/pruning/_experimental/pruner/base_pruner.py
similarity index 98%
rename from torch/ao/sparsity/_experimental/pruner/base_pruner.py
rename to torch/ao/pruning/_experimental/pruner/base_pruner.py
index cbcdec5d49e6..fbeed5084abb 100644
--- a/torch/ao/sparsity/_experimental/pruner/base_pruner.py
+++ b/torch/ao/pruning/_experimental/pruner/base_pruner.py
@@ -11,8 +11,8 @@
 
 from .parametrization import PruningParametrization, ZeroesParametrization, ActivationReconstruction, BiasHook
 
-from torch.ao.sparsity import BaseSparsifier, module_to_fqn, fqn_to_module
-from torch.ao.sparsity.sparsifier.utils import get_arg_info_from_tensor_fqn
+from torch.ao.pruning import BaseSparsifier, module_to_fqn, fqn_to_module
+from torch.ao.pruning.sparsifier.utils import get_arg_info_from_tensor_fqn
 
 __all__ = ["BasePruner"]
 
diff --git a/torch/ao/sparsity/_experimental/pruner/images/prune_1.png b/torch/ao/pruning/_experimental/pruner/images/prune_1.png
similarity index 100%
rename from torch/ao/sparsity/_experimental/pruner/images/prune_1.png
rename to torch/ao/pruning/_experimental/pruner/images/prune_1.png
diff --git a/torch/ao/sparsity/_experimental/pruner/images/prune_2.png b/torch/ao/pruning/_experimental/pruner/images/prune_2.png
similarity index 100%
rename from torch/ao/sparsity/_experimental/pruner/images/prune_2.png
rename to torch/ao/pruning/_experimental/pruner/images/prune_2.png
diff --git a/torch/ao/sparsity/_experimental/pruner/images/prune_3.png b/torch/ao/pruning/_experimental/pruner/images/prune_3.png
similarity index 100%
rename from torch/ao/sparsity/_experimental/pruner/images/prune_3.png
rename to torch/ao/pruning/_experimental/pruner/images/prune_3.png
diff --git a/torch/ao/sparsity/_experimental/pruner/images/prune_4.png b/torch/ao/pruning/_experimental/pruner/images/prune_4.png
similarity index 100%
rename from torch/ao/sparsity/_experimental/pruner/images/prune_4.png
rename to torch/ao/pruning/_experimental/pruner/images/prune_4.png
diff --git a/torch/ao/sparsity/_experimental/pruner/parametrization.py b/torch/ao/pruning/_experimental/pruner/parametrization.py
similarity index 100%
rename from torch/ao/sparsity/_experimental/pruner/parametrization.py
rename to torch/ao/pruning/_experimental/pruner/parametrization.py
diff --git a/torch/ao/sparsity/_mappings.py b/torch/ao/pruning/_mappings.py
similarity index 72%
rename from torch/ao/sparsity/_mappings.py
rename to torch/ao/pruning/_mappings.py
index 64ca5b242e27..281450bcb29e 100644
--- a/torch/ao/sparsity/_mappings.py
+++ b/torch/ao/pruning/_mappings.py
@@ -1,13 +1,17 @@
-import torch
-import torch.ao.nn
+__all__ = [
+    "get_static_sparse_quantized_mapping",
+    "get_dynamic_sparse_quantized_mapping",
+]
 
 def get_static_sparse_quantized_mapping():
+    import torch.ao.nn.sparse
     _static_sparse_quantized_mapping = dict({
         torch.nn.Linear: torch.ao.nn.sparse.quantized.Linear,
     })
     return _static_sparse_quantized_mapping
 
 def get_dynamic_sparse_quantized_mapping():
+    import torch.ao.nn.sparse
     _dynamic_sparse_quantized_mapping = dict({
         torch.nn.Linear: torch.ao.nn.sparse.quantized.dynamic.Linear,
     })
diff --git a/torch/ao/pruning/scheduler/__init__.py b/torch/ao/pruning/scheduler/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/ao/sparsity/scheduler/base_scheduler.py b/torch/ao/pruning/scheduler/base_scheduler.py
similarity index 89%
rename from torch/ao/sparsity/scheduler/base_scheduler.py
rename to torch/ao/pruning/scheduler/base_scheduler.py
index 780a34fc7345..2adec4b27a67 100644
--- a/torch/ao/sparsity/scheduler/base_scheduler.py
+++ b/torch/ao/pruning/scheduler/base_scheduler.py
@@ -1,5 +1,5 @@
 
-from torch.ao.sparsity import BaseSparsifier
+from torch.ao.pruning import BaseSparsifier
 
 from functools import wraps
 import warnings
@@ -13,7 +13,7 @@ def __init__(self, sparsifier, last_epoch=-1, verbose=False):
 
         # Attach sparsifier
         if not isinstance(sparsifier, BaseSparsifier):
-            raise TypeError('{} is not an instance of torch.ao.sparsity.BaseSparsifier'.format(
+            raise TypeError('{} is not an instance of torch.ao.pruning.BaseSparsifier'.format(
                 type(sparsifier).__name__))
         self.sparsifier = sparsifier
 
@@ -150,3 +150,15 @@ def __exit__(self, type, value, traceback):
 
         self._last_sl = [group['sparsity_level'] for group in self.sparsifier.groups]
         self.sparsifier.enable_mask_update = True
+
+    def _make_sure_a_list(self, var):
+        r"""Utility that extends it to the same length as the .groups, ensuring it is a list"""
+        n = len(self.sparsifier.groups)
+        if not isinstance(var, (list, tuple)):
+            return [var] * n
+        else:
+            if len(var) != n:
+                raise ValueError("Expected variable of length {n}, but got {got}".format(
+                    n=n, got=len(var)
+                ))
+            return list(var)  # We want the result to be in a list, not tuple
diff --git a/torch/ao/pruning/scheduler/cubic_scheduler.py b/torch/ao/pruning/scheduler/cubic_scheduler.py
new file mode 100644
index 000000000000..49ee9f51b42a
--- /dev/null
+++ b/torch/ao/pruning/scheduler/cubic_scheduler.py
@@ -0,0 +1,108 @@
+# -*- coding: utf-8 -*-
+import warnings
+
+from .base_scheduler import BaseScheduler
+
+__all__ = ["CubicSL"]
+
+def _clamp(x, lo, hi):
+    return max(lo, min(hi, x))
+
+
+class CubicSL(BaseScheduler):
+    r"""Sets the sparsity level of each parameter group to the final sl
+    plus a given exponential function.
+
+    .. math::
+
+        s_i = s_f + (s_0 - s_f) \cdot \left( 1 - \frac{t - t_0}{n\Delta t} \right)^3
+
+    where :math:`s_i` is the sparsity at epoch :math:`t`, :math;`s_f` is the final
+    sparsity level, :math:`f(i)` is the function to be applied to the current epoch
+    :math:`t`, initial epoch :math:`t_0`, and final epoch :math:`t_f`.
+    :math:`\Delta t` is used to control how often the update of the sparsity level
+    happens. By default,
+
+    Args:
+        sparsifier (BaseSparsifier): Wrapped sparsifier.
+        init_sl (int, list): Initial level of sparsity
+        init_t (int, list): Initial step, when pruning starts
+        delta_t (int, list): Pruning frequency
+        total_t (int, list): Total number of pruning steps
+        initially_zero (bool, list): If True, sets the level of sparsity to 0
+            before init_t (:math:`t_0`). Otherwise, the sparsity level before
+            init_t (:math:`t_0`) is set to init_sl(:math:`s_0`)
+        last_epoch (int): The index of last epoch. Default: -1.
+        verbose (bool): If ``True``, prints a message to stdout for
+            each update. Default: ``False``.
+    """
+    def __init__(self,
+                 sparsifier,
+                 init_sl=0.0,
+                 init_t=0,
+                 delta_t=10,
+                 total_t=100,
+                 initially_zero=False,
+                 last_epoch=-1,
+                 verbose=False
+                 ):
+        self.sparsifier = sparsifier
+
+        self.init_sl = self._make_sure_a_list(init_sl)
+        self.init_t = self._make_sure_a_list(init_t)
+        self.delta_t = self._make_sure_a_list(delta_t)
+        self.total_t = self._make_sure_a_list(total_t)
+
+        self.initially_zero = self._make_sure_a_list(initially_zero)
+
+        super().__init__(sparsifier, last_epoch, verbose)
+
+    @staticmethod
+    def sparsity_compute_fn(s_0, s_f, t, t_0, dt, n, initially_zero=False):
+        r""""Computes the current level of sparsity.
+
+        Based on https://arxiv.org/pdf/1710.01878.pdf
+
+        Args:
+            s_0: Initial level of sparsity, :math:`s_i`
+            s_f: Target level of sparsity, :math:`s_f`
+            t: Current step, :math:`t`
+            t_0: Initial step, :math:`t_0`
+            dt: Pruning frequency, :math:`\Delta T`
+            n: Pruning steps, :math:`n`
+            initially_zero: Sets the level of sparsity to 0 before t_0.
+                If False, sets to s_0
+
+        Returns:
+            The sparsity level :math:`s_t` at the current step :math:`t`
+        """
+        if initially_zero and t < t_0:
+            return 0
+        s_t = s_f + (s_0 - s_f) * (1.0 - (t - t_0) / (dt * n)) ** 3
+        s_t = _clamp(s_t, s_0, s_f)
+        return s_t
+
+    def get_sl(self):
+        if not self._get_sl_called_within_step:
+            warnings.warn(
+                "To get the last sparsity level computed by the scheduler, "
+                "please use `get_last_sl()`.")
+        return [
+            self.sparsity_compute_fn(
+                s_0=initial_sparsity,
+                s_f=final_sparsity,
+                t=self.last_epoch,
+                t_0=initial_epoch,
+                dt=delta_epoch,
+                n=interval_epochs,
+                initially_zero=initially_zero
+            ) for initial_sparsity, final_sparsity, initial_epoch, delta_epoch, interval_epochs, initially_zero in
+            zip(
+                self.init_sl,
+                self.base_sl,
+                self.init_t,
+                self.delta_t,
+                self.total_t,
+                self.initially_zero
+            )
+        ]
diff --git a/torch/ao/sparsity/scheduler/lambda_scheduler.py b/torch/ao/pruning/scheduler/lambda_scheduler.py
similarity index 100%
rename from torch/ao/sparsity/scheduler/lambda_scheduler.py
rename to torch/ao/pruning/scheduler/lambda_scheduler.py
diff --git a/torch/ao/pruning/sparsifier/__init__.py b/torch/ao/pruning/sparsifier/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/ao/sparsity/sparsifier/base_sparsifier.py b/torch/ao/pruning/sparsifier/base_sparsifier.py
similarity index 99%
rename from torch/ao/sparsity/sparsifier/base_sparsifier.py
rename to torch/ao/pruning/sparsifier/base_sparsifier.py
index 539e0d29b8a0..bfd6b49a1011 100644
--- a/torch/ao/sparsity/sparsifier/base_sparsifier.py
+++ b/torch/ao/pruning/sparsifier/base_sparsifier.py
@@ -51,7 +51,7 @@ class BaseSparsifier(abc.ABC):
     """
     def __init__(self, defaults: Optional[Dict[str, Any]] = None):
         super().__init__()
-        self.defaults: Dict[str, Any] = defaults or dict()
+        self.defaults: Dict[str, Any] = defaults or {}
 
         self.state: Dict[str, Dict] = defaultdict(dict)
         self.groups: List[Dict[str, Any]] = []
@@ -275,7 +275,7 @@ def squash_mask(self,
             tensor_name = config['tensor_name']
             parametrize.remove_parametrizations(module, tensor_name,
                                                 leave_parametrized=True)
-            sparse_params = dict()
+            sparse_params = {}
             if params_to_keep is not None:
                 global_params = {k: config[k] for k in params_to_keep}
                 sparse_params.update(global_params)
diff --git a/torch/ao/sparsity/sparsifier/nearly_diagonal_sparsifier.py b/torch/ao/pruning/sparsifier/nearly_diagonal_sparsifier.py
similarity index 100%
rename from torch/ao/sparsity/sparsifier/nearly_diagonal_sparsifier.py
rename to torch/ao/pruning/sparsifier/nearly_diagonal_sparsifier.py
diff --git a/torch/ao/sparsity/sparsifier/utils.py b/torch/ao/pruning/sparsifier/utils.py
similarity index 99%
rename from torch/ao/sparsity/sparsifier/utils.py
rename to torch/ao/pruning/sparsifier/utils.py
index ee0791a91dce..799c60a6d662 100644
--- a/torch/ao/sparsity/sparsifier/utils.py
+++ b/torch/ao/pruning/sparsifier/utils.py
@@ -77,4 +77,4 @@ def state_dict(self, *args, **kwargs):
         # We don't want to let the parametrizations to save the mask.
         # That way we make sure that the linear module doesn't store the masks
         # alongside their parametrizations.
-        return dict()
+        return {}
diff --git a/torch/ao/sparsity/sparsifier/weight_norm_sparsifier.py b/torch/ao/pruning/sparsifier/weight_norm_sparsifier.py
similarity index 89%
rename from torch/ao/sparsity/sparsifier/weight_norm_sparsifier.py
rename to torch/ao/pruning/sparsifier/weight_norm_sparsifier.py
index 2aca6d466336..2ba2584616e2 100644
--- a/torch/ao/sparsity/sparsifier/weight_norm_sparsifier.py
+++ b/torch/ao/pruning/sparsifier/weight_norm_sparsifier.py
@@ -1,5 +1,5 @@
 from functools import reduce
-from typing import Tuple
+from typing import Callable, Optional, Tuple, Union
 
 import torch
 import torch.nn.functional as F
@@ -35,6 +35,8 @@ class WeightNormSparsifier(BaseSparsifier):
         sparsity_level: The target level of sparsity
         sparse_block_shape: The shape of a sparse block (see note below)
         zeros_per_block: Number of zeros in a sparse block
+        norm: Norm to use. Could be either `int` or a callable.
+            If `int`, only L1 and L2 are implemented.
 
     Note::
         The `sparse_block_shape` is tuple representing (block_ROWS, block_COLS),
@@ -51,7 +53,8 @@ class WeightNormSparsifier(BaseSparsifier):
     def __init__(self,
                  sparsity_level: float = 0.5,
                  sparse_block_shape: Tuple[int, int] = (1, 4),
-                 zeros_per_block: int = None):
+                 zeros_per_block: Optional[int] = None,
+                 norm: Optional[Union[Callable, int]] = None):
         if zeros_per_block is None:
             zeros_per_block = reduce((lambda x, y: x * y), sparse_block_shape)
         defaults = {
@@ -59,6 +62,16 @@ def __init__(self,
             "sparse_block_shape": sparse_block_shape,
             "zeros_per_block": zeros_per_block,
         }
+        if norm is None:
+            norm = 2
+        if callable(norm):
+            self.norm_fn = norm
+        elif norm == 1:
+            self.norm_fn = lambda T: T.abs()
+        elif norm == 2:
+            self.norm_fn = lambda T: T * T
+        else:
+            raise NotImplementedError(f"L-{norm} is not yet implemented.")
         super().__init__(defaults=defaults)
 
     def _scatter_fold_block_mask(self, output_shape, dim, indices, block_shape,
@@ -86,7 +99,7 @@ def _make_tensor_mask(self, data, input_shape, sparsity_level, sparse_block_shap
         dw = (block_w - w % block_w) % block_w
 
         if mask is None:
-            mask = torch.ones(h, w, device=data.device)
+            mask = torch.ones(h + dh, w + dw, device=data.device)
 
         if sparsity_level >= 1.0:
             mask.data = torch.zeros_like(mask)
@@ -128,14 +141,15 @@ def _make_block_mask(self, data, sparse_block_shape, zeros_per_block, mask=None)
 
         In this context the `zeros_per_block` describes the number of zeroed-out elements within a patch.
         """
-        if mask is None:
-            mask = torch.ones(data.shape, device=data.device)
         h, w = data.shape[-2:]
         block_h, block_w = sparse_block_shape
         dh = (block_h - h % block_h) % block_h
         dw = (block_w - w % block_w) % block_w
         values_per_block = reduce((lambda x, y: x * y), sparse_block_shape)
 
+        if mask is None:
+            mask = torch.ones((h + dh, w + dw), device=data.device)
+
         if values_per_block == zeros_per_block:
             # Everything should be sparsified
             mask.data = torch.zeros_like(mask)
@@ -155,7 +169,7 @@ def _make_block_mask(self, data, sparse_block_shape, zeros_per_block, mask=None)
             dim=1, indices=sorted_idx, output_shape=padded_data.shape, block_shape=sparse_block_shape, mask=mask_reshape
         )
 
-        mask.data = mask_reshape.squeeze().reshape(mask.shape)[:h, :w].contiguous()
+        mask.data = mask_reshape.squeeze().reshape(mask.shape).contiguous()
         return mask
 
     def update_mask(self, module, tensor_name, sparsity_level, sparse_block_shape,
@@ -174,7 +188,7 @@ def update_mask(self, module, tensor_name, sparsity_level, sparse_block_shape,
         elif sparsity_level >= 1.0 and (zeros_per_block == values_per_block):
             mask.data = torch.zeros_like(mask)
         else:
-            ww = getattr(module, tensor_name)**2
+            ww = self.norm_fn(getattr(module, tensor_name))
             tensor_mask = self._make_tensor_mask(
                 data=ww, input_shape=ww.shape, sparsity_level=sparsity_level, sparse_block_shape=sparse_block_shape
             )
diff --git a/torch/ao/quantization/__init__.py b/torch/ao/quantization/__init__.py
index 7fa05eaa6c40..1ba2a60ed3d1 100644
--- a/torch/ao/quantization/__init__.py
+++ b/torch/ao/quantization/__init__.py
@@ -8,11 +8,134 @@
 from .qconfig import *  # noqa: F403
 from .qconfig_mapping import *  # noqa: F403
 from .quant_type import *  # noqa: F403
-from .quantization_mappings import *  # noqa: F403
+from .quantization_mappings import *  # type: ignore[no-redef]
 from .quantize import *  # noqa: F403
 from .quantize_jit import *  # noqa: F403
 from .stubs import *  # noqa: F403
 
+__all__ = [
+    "DeQuantStub",
+    "FakeQuantize",
+    "FakeQuantizeBase",
+    "FixedQParamsFakeQuantize",
+    "FixedQParamsObserver",
+    "FusedMovingAvgObsFakeQuantize",
+    "HistogramObserver",
+    "MatchAllNode",
+    "MinMaxObserver",
+    "MovingAverageMinMaxObserver",
+    "MovingAveragePerChannelMinMaxObserver",
+    "NoopObserver",
+    "ObserverBase",
+    "Pattern",
+    "PerChannelMinMaxObserver",
+    "PlaceholderObserver",
+    "QConfig",
+    "QConfigAny",
+    "QConfigDynamic",
+    "QConfigMapping",
+    "QuantStub",
+    "QuantType",
+    "QuantWrapper",
+    "RecordingObserver",
+    "ReuseInputObserver",
+    "UniformQuantizationObserverBase",
+    "add_observer_",
+    "add_quant_dequant",
+    "convert",
+    "convert_dynamic_jit",
+    "convert_jit",
+    "default_affine_fixed_qparams_fake_quant",
+    "default_affine_fixed_qparams_observer",
+    "default_debug_observer",
+    "default_dynamic_fake_quant",
+    "default_dynamic_quant_observer",
+    "default_embedding_fake_quant",
+    "default_embedding_fake_quant_4bit",
+    "default_eval_fn",
+    "default_fake_quant",
+    "default_fixed_qparams_range_0to1_fake_quant",
+    "default_fixed_qparams_range_0to1_observer",
+    "default_fixed_qparams_range_neg1to1_fake_quant",
+    "default_fixed_qparams_range_neg1to1_observer",
+    "default_float_qparams_observer",
+    "default_float_qparams_observer_4bit",
+    "default_fused_act_fake_quant",
+    "default_fused_per_channel_wt_fake_quant",
+    "default_fused_wt_fake_quant",
+    "default_histogram_fake_quant",
+    "default_histogram_observer",
+    "default_observer",
+    "default_per_channel_weight_fake_quant",
+    "default_per_channel_weight_observer",
+    "default_placeholder_observer",
+    "default_reuse_input_observer",
+    "default_symmetric_fixed_qparams_fake_quant",
+    "default_symmetric_fixed_qparams_observer",
+    "default_weight_fake_quant",
+    "default_weight_observer",
+    "disable_fake_quant",
+    "disable_observer",
+    "enable_fake_quant",
+    "enable_observer",
+    "fuse_conv_bn",
+    "fuse_conv_bn_jit",
+    "fuse_conv_bn_relu",
+    "fuse_convtranspose_bn",
+    "fuse_linear_bn",
+    "fuse_modules",
+    "fuse_modules_qat",
+    "fused_per_channel_wt_fake_quant_range_neg_127_to_127",
+    "fused_wt_fake_quant_range_neg_127_to_127",
+    "get_combined_dict",
+    "get_default_compare_output_module_list",
+    "get_default_custom_config_dict",
+    "get_default_dynamic_quant_module_mappings",
+    "get_default_dynamic_sparse_quant_module_mappings",
+    "get_default_float_to_quantized_operator_mappings",
+    "get_default_qat_module_mappings",
+    "get_default_qat_qconfig",
+    "get_default_qat_qconfig_dict",
+    "get_default_qat_qconfig_mapping",
+    "get_default_qconfig",
+    "get_default_qconfig_dict",
+    "get_default_qconfig_mapping",
+    "get_default_qconfig_propagation_list",
+    "get_default_static_quant_module_mappings",
+    "get_default_static_quant_reference_module_mappings",
+    "get_default_static_sparse_quant_module_mappings",
+    "get_dynamic_quant_module_class",
+    "get_embedding_qat_module_mappings",
+    "get_embedding_static_quant_module_mappings",
+    "get_fuser_method",
+    "get_fuser_method_new",
+    "get_observer_dict",
+    "get_observer_state_dict",
+    "get_quantized_operator",
+    "get_static_quant_module_class",
+    "get_unique_devices_",
+    "is_activation_post_process",
+    "load_observer_state_dict",
+    "no_observer_set",
+    "per_channel_weight_observer_range_neg_127_to_127",
+    "prepare",
+    "prepare_dynamic_jit",
+    "prepare_jit",
+    "prepare_qat",
+    "propagate_qconfig_",
+    "qconfig_equals",
+    "quantize",
+    "quantize_dynamic",
+    "quantize_dynamic_jit",
+    "quantize_jit",
+    "quantize_qat",
+    "register_activation_post_process_hook",
+    "script_qconfig",
+    "script_qconfig_dict",
+    "swap_module",
+    "weight_observer_range_neg_127_to_127",
+]
+
 def default_eval_fn(model, calib_data):
     r"""
     Default evaluation function takes a torch.utils.data.Dataset or a list of
diff --git a/torch/ao/quantization/_correct_bias.py b/torch/ao/quantization/_correct_bias.py
index 4456e3396394..0d9017533166 100644
--- a/torch/ao/quantization/_correct_bias.py
+++ b/torch/ao/quantization/_correct_bias.py
@@ -1,6 +1,6 @@
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 
 import torch.ao.quantization
 import torch.ao.ns._numeric_suite as ns
diff --git a/torch/ao/quantization/_dbr/README.md b/torch/ao/quantization/_dbr/README.md
deleted file mode 100644
index b474fb4788b3..000000000000
--- a/torch/ao/quantization/_dbr/README.md
+++ /dev/null
@@ -1,259 +0,0 @@
-# Define-by-run quantization
-
-Note: this code is an early prototype and the API may change at any time.
-
-Define-by-run quantization is a prototype of automated quantization syntax
-transforms for PyTorch Eager mode. High level algorithm:
-
-1. take a user model and an example input
-2. trace the model with example input and record the subgraphs of seen quantizeable ops
-3. define quantization syntax transforms over the seen subgraphs
-4. during execution of user code, dynamically call into the subgraphs from (3) when necessary
-
-## User API overview
-
-```
-from torch.ao.quantization._quantize_dbr import prepare, convert
-
-m = M(...)
-mp = prepare(m, example_input)
-# calibration (not shown)
-mq = convert(mp)
-```
-
-## Framework concepts overview
-
-The framework is defined with two major concepts:
-
-1. Each non-leaf module has a child of type `AutoQuantizationState`, stored
-under the `_auto_quant_state` attribute name. This child contains the quantization
-related state such as captured subgraphs and observers.
-2. During program execution, the framework overrides each module and quantizeable
-op to call hooks on the objects defined in (1), to implement the quantization
-syntax transforms.
-
-For example, imagine a user module such as this one:
-
-```
-class Child(torch.nn.Module):
-    def forward(self, x):
-        return x + x
-
-class Parent(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.conv = torch.nn.Conv2d(1, 1, 1)
-        self.child = Child()
-
-    def forward(self, x):
-        x = self.conv(x)
-        x = self.child(x)
-        return x
-
-m = Parent().eval()
-```
-
-After model creation, the model hierarchy looks like
-
-```
-m (Parent)
-|- conv (torch.nn.Conv2d)
-|- child (Child)
-```
-
-After adding auto-observe, the model hierarchy looks like
-
-```
-m (Parent)
-|- conv (torch.nn.Conv2d)
-|- _auto_quant_state (AutoQuantizationState)
-|- child (Child)
-   |- _auto_quant_state (AutoQuantizationState)
-```
-
-Each `AutoQuantizationState` instance stores (a) subgraphs of seen quantizeable ops
-and (b) observers. Here is an example (in pseudocode, actual variable names many
-change):
-
-```
-AutoQuantizationState(
-  (seen_op_infos): {
-    0: (type): <built-in method add of type object at 0x112a7fe50>
-       (input_tensor_infos): [
-         QTensorInfo(id=0, inf_dtype=torch.float32),
-         QTensorInfo(id=1, inf_dtype=torch.float32),
-       ],
-       (output_tensor_infos): [
-         QTensorInfo(id=2, inf_dtype=torch.quint8),
-       ],
-  }
-  (tensor_id_to_observer): ModuleDict(
-    (0): MinMaxObserver(...),
-    ...
-  )
-)
-```
-
-During program execution, the following overrides will be called, in order:
-1. m.__call__ override start
-2. m.conv.__call__ override start, end
-3. m.child.__call__ override start
-4. m.child::add.__torch_function__ override start, end
-5. m.child.__call__ override end
-6. m.__call__ override end
-
-There are three flavors of hooks:
-
-1. op hooks.  These are called from the parent module on individual ops, and used
-to implement quantization of op subgraphs.
-2. module I/O hooks.  These are called from the parent module on child modules only,
-and used to enforce dtype of module outputs, if needed.  For example, if the model
-needs to output a fp32 tensor and the last op is int8, the conversion to fp32
-will happen in this hook (and not in the op hook).
-3. arg dequant hooks.  These are called when the current op requires `torch.float`
-tensors.  Any non-quantizeable op will go through these, and have any quantized
-tensor arguments dequantized.
-
-## Code overview
-
-### auto_trace.py
-
-This file contains the logic for partial program capture. We override `__torch_function__`
-and `torch.nn.Module.__call__` to define the interception points. During tracing, calibration
-and inference, we dynamically execute quantization transforms from these interception
-points. The following pseudocode illustrates how both `add_auto_observation` and
-`add_auto_convert` call into quantization hooks:
-
-```
-# override of `__torch_function__`
-def __torch_function__(cls, func, types, args, kwargs):
-    ...
-
-    # the framework provides `cur_module` of the current function
-    # `quant_state` is the quantization state of the current module
-    quant_state = cur_module._auto_quant_state
-
-    # only some ops are quantizeable, the following line allows us to terminate
-    # early for unquantizeable ones
-    hook_type = get_torch_function_hook_type(parent_module, func)
-
-    if hook_type is HookType.OP_HOOKS:
-
-        # this line will throw an exception if control flow over quantizeable ops
-        # is detected
-        qstate.validate_cur_op(func)
-
-        # "before" hook
-        args, kwargs = qstate.op_prepare_before_hook(func, args, kwargs, ...)
-
-        # run original function
-        output = super().__torch_function__(func, types, args, kwargs)
-
-        # "after" hook
-        output = qstate.op_prepare_after_hook(func, output, args, kwargs, ...)
-
-    else:
-        output = super().__torch_function(func, types, args, kwargs)
-
-    ...
-
-    return output
-
-# override of `torch.nn.Module.__call__`
-def record_module(self, *args, **kwargs):
-    # the framework keeps track of parent module of the current module
-    parent_module = get_parent_module(...)
-    parent_qstate = parent_module._auto_quant_state
-    cur_module = self
-    cur_qstate = self._auto_quant_state
-
-
-    # the framework calculates when the current module needs op_hooks, io hooks
-    # or arg_dequants
-    hook_type = get_module_hook_type(parent_module, func)
-
-    if hook_type is HookType.OP_HOOKS:
-        # call before, during and after hooks on parent_qstate with the current op
-        # execute original forward
-        ...
-
-    elif hook_type is HookType.MODULE_IO_HOOKS:
-        # execute original forward
-        # call hook on outputs of module for dtype transitions
-        ...
-
-    elif hook_type is HookType.ARG_DEQUANTS:
-        # dequantize incoming arguments, if they are quantized
-        # execute original forward
-        ...
-
-    else:
-        # execute original forward
-
-    ...
-
-    return output
-
-
-```
-
-In detail:
-
-#### calibration
-
-This happens in the `add_auto_observation` function.
-
-1. For each non-leaf module in the model, we create a new `AutoQuantizationState`
-module and attach it as a child. This contains the quantization state
-(subgraphs and observers).
-2. For each `__torch_function__` and `torch.nn.Module.__call__` override, call
-quantization hooks if necessary. If `first_call` is true, this captures the
-subgraphs. Otherwise, this performs observation.
-
-#### inference
-
-This happens in the `add_auto_convert` function.
-
-1. For each `__torch_function__` and `torch.nn.Module.__call__` override, call
-quantization hooks if necessary. This performs the quantization inference
-syntax transforms.
-
-### quantization_state.py
-
-This file defines `AutoQuantizationState`. This is an object which
-stores quantization state for its parent module. It contains the following state:
-
-1. all captured quantization op subgraphs
-2. all dynamically created observers and fake_quants
-
-It also contains the following hooks:
-
-1. module before and after hooks (used for dtype transitions)
-2. function before and after hooks (used for dtype transitions and observation)
-3. function replacement hooks (used for substiting quantized kernels)
-
-### auto_trace_rewriter.py
-
-This file defines a custom FX tracer which can encode the transforms captured
-by `AutoQuantizationState` into an FX graph. This is useful because it provides
-a path to `torch.jit.script`.
-
-TODO(future PR): write up more.
-
-### fusion.py
-
-This file has a function which finds all potential module fusions of a model.
-This is implemented by tracing the model with the machinery in `auto_trace.py`,
-and traversing the found subgraphs to look for fusions. A list of module fqns
-to fuse is returned, which can then be plugged in to the Eager mode fusion APIs.
-
-### mappings.py
-
-This file defines the quantization mappings (which fp32 ops are related to which int8
-ops, which ops are quantizeable, etc.
-
-TODO(future PR): delete this file and use existing mappings instead.
-
-### utils.py
-
-This file contains various stateless utilities.
diff --git a/torch/ao/quantization/_dbr/auto_trace.py b/torch/ao/quantization/_dbr/auto_trace.py
deleted file mode 100644
index 98981d417e2f..000000000000
--- a/torch/ao/quantization/_dbr/auto_trace.py
+++ /dev/null
@@ -1,723 +0,0 @@
-import logging
-from typing import Tuple, Any, List, Dict
-
-import torch
-from torch.fx.node import map_aggregate
-
-from .quantization_state import (
-    AutoQuantizationState,
-)
-from .utils import (
-    trace_with_inputs,
-    is_leaf,
-    HookType,
-    get_torch_function_hook_type,
-    get_module_hook_type,
-    OpQuantizeabilityType,
-    AutoQuantizationStateModuleDict,
-    get_fqn_valid_for_module_dict_key,
-)
-from .model_utils import (
-    pack_weights_for_functionals,
-    attach_scale_zp_values_to_model,
-    attach_op_convert_info_to_model,
-    attach_output_convert_info_to_model,
-)
-from . import auto_trace_rewriter
-from torch.ao.quantization import is_activation_post_process
-from torch.ao.quantization.qconfig_mapping import QConfigMapping
-
-logger = logging.getLogger('auto_trace')
-logging.basicConfig(level=logging.DEBUG)
-# logging.basicConfig(level=logging.INFO)
-
-# enabling this tanks performance, make sure to disable for benchmarking
-# TODO(future PR): clean this up
-enable_logging = False
-# enable_logging = True
-
-
-def add_auto_observation(
-    model : torch.nn.Module,
-    qconfig_mapping: QConfigMapping,
-    example_inputs: Tuple[Any],
-    input_dtypes: Any = (torch.float,),  # must be same structure as model inputs
-    prepare_custom_config_dict: Dict[str, Any] = None,
-) -> torch.nn.Module:
-    if prepare_custom_config_dict is None:
-        prepare_custom_config_dict = {}
-    output_dtypes = prepare_custom_config_dict.get('output_dtypes', (torch.float,))
-
-    def convert_to_interception_proxy(x):
-        if isinstance(x, torch.Tensor):
-            return x.as_subclass(QuantizationPrepareTensorProxy)  # type: ignore[arg-type]
-        else:
-            return x
-
-    cur_module = None
-    first_call = True
-    module_stack : List[torch.nn.Module] = []
-    # Counter for tensor IDs, will be modified inplace by quant state.
-    # This is used to track tensors from output ops to input ops. For example,
-    # if op_n had a tensor output with id=1, and op_n+2 had a tensor input with
-    # id=1, we know that the output of op_n is the input to op_n+2. Note,
-    # this is a list because it needs to incremented inplace.
-    qtensor_id = [0]
-    module_id_to_fqn: Dict[int, str] = {}
-
-    # Counter for global quantizeable ops, useful for intermediate activation
-    # logging.
-    global_op_idx = [0]
-
-    global_disable_torch_function_override = False
-
-    class QuantizationPrepareTensorProxy(torch.Tensor):
-        """
-        An override of `torch.Tensor` to enable dynamic tracing for
-        quantization.
-
-        For each function with a `__torch_function__` override, this proxy does
-        the following for functions which need quantization:
-
-        1. calls `_auto_quant_state.validate_cur_op` to validate that
-           the currently seen op is the same as what was recorded during tracing
-        2. calls `_auto_quant_state.op_prepare_before_hook`
-        3. executes the original function
-        4. calls `_auto_quant_state.op_prepare_after_hook`
-        5. calls `_auto_quant_state.mark_cur_op_complete` to increment
-           the current op index in preparation for the next op
-
-        Otherwise, calls the original function.
-        """
-
-        @classmethod
-        def __torch_function__(cls, func, types, args=(), kwargs=None):
-            nonlocal global_disable_torch_function_override
-            if (
-                # global override means disable the override here
-                global_disable_torch_function_override or
-                # to prevent printing things from going into an infinite loop
-                func == torch.Tensor.__repr__ or
-                # we don't need to override getters in this framework
-                func.__name__ == '__get__'
-            ):
-                return super().__torch_function__(func, types, args, kwargs)
-
-            # if we are in a function, the current module is always a parent
-            nonlocal cur_module
-            parent_module = cur_module
-            if enable_logging:
-                if not is_activation_post_process(parent_module):
-                    # logging for insides of obs/fq is not useful for this framework
-
-                    # fqn map does not contain observers, which is why we
-                    # cannot always assume that FQN exists
-                    fqn_for_logging = module_id_to_fqn.get(
-                        id(parent_module), 'unknown') if parent_module else None
-                    logger.debug(
-                        f' fqn:{fqn_for_logging} _tf_ {str(func)} len_args {len(args)}')
-
-            nonlocal qtensor_id
-            kwargs = kwargs if kwargs else {}
-            hook_type = get_torch_function_hook_type(parent_module, func)
-
-            if hook_type is HookType.OP_HOOKS:
-                fqn = module_id_to_fqn[id(parent_module)] if parent_module else None
-                qstate = parent_module._auto_quant_state  # type: ignore[attr-defined]
-                if not first_call:
-                    qstate.validate_cur_op(func)
-                # run "before" hook
-                if first_call:
-                    args, kwargs = qstate.first_call_op_prepare_before_hook(
-                        func, args, kwargs, qtensor_id, fqn, parent_module,
-                        OpQuantizeabilityType.QUANTIZEABLE)
-                else:
-                    args, kwargs = qstate.op_prepare_before_hook(
-                        func, args, kwargs)
-                # forward
-                output = super().__torch_function__(func, types, args, kwargs)
-                # run "after" hook
-                if first_call:
-                    output = qstate.first_call_op_prepare_after_hook(
-                        func, output, args, qtensor_id,
-                        OpQuantizeabilityType.QUANTIZEABLE)
-                else:
-                    output = qstate.op_prepare_after_hook(
-                        func, output, args, global_op_idx)
-                qstate.mark_cur_op_complete(func)
-            else:
-                # Hook type is not HookType.OP_HOOKS, if first_call is True we
-                # record the DAG of non-quantizeable ops.
-
-                if first_call:
-                    qstate = getattr(parent_module, '_auto_quant_state', None)
-                    if qstate:
-                        fqn = module_id_to_fqn.get(id(parent_module), None) \
-                            if parent_module else None
-                        args, kwargs = qstate.first_call_op_prepare_before_hook(
-                            func, args, kwargs, qtensor_id, fqn, parent_module,
-                            OpQuantizeabilityType.NOT_QUANTIZEABLE)
-
-                output = super().__torch_function__(func, types, args, kwargs)
-
-                if first_call:
-                    qstate = getattr(parent_module, '_auto_quant_state', None)
-                    if qstate:
-                        output = qstate.first_call_op_prepare_after_hook(
-                            func, output, args, qtensor_id,
-                            OpQuantizeabilityType.NOT_QUANTIZEABLE)
-
-            # TODO: is this right? Don't really understand this
-            if output is NotImplemented:
-                with torch._C.DisableTorchFunction():
-                    output = func(*args, **kwargs).as_subclass(
-                        QuantizationPrepareTensorProxy)
-                assert output is not NotImplemented
-
-            return output
-
-        def __repr__(self):
-            return f'QuantizationPrepareTensorProxy({super().__repr__()})'
-
-        # TODO(future PR): add other math overrides
-
-    class QuantizationInterceptionModule(type(model)):  # type: ignore[misc]
-        """
-        An override of user defined subclass of `nn.Module` to enable
-        dynamic tracing for quantization.
-
-        `cur_module` keeps track of the current module in the stack.
-
-        During the fist call, an `AutoQuantizationState` object is created and
-        attached to each non-leaf modules which we need to check for
-        quantizeable operations.
-
-        We override the `__call__` function to do the following for each
-        module:
-
-        If the module is an op which needs quantization:
-
-        1. calls `_auto_quant_state.validate_cur_op` to validate that
-           the currently seen op is the same as what was recorded during tracing
-        2. calls parent module's `._auto_quant_state.op_prepare_before_hook`
-        3. executes the original module forward
-        4. calls parent module's `_auto_quant_state.op_prepare_after_hook`
-        5. calls `_auto_quant_state.mark_cur_op_complete` to increment
-           the current op index in preparation for the next op
-
-        If the module can contain children ops that need quantization:
-
-        1. calls `_auto_quant_state.inputs_prepare_hook` (not implemented yet)
-        2. executes the original module forward
-        3. calls `_auto_quant_state.outputs_prepare_hook`
-
-        Otherwise, calls the original module forward.
-        """
-
-        def __call__(self, *args, **kwargs):
-            new_args = map_aggregate(args, convert_to_interception_proxy)
-            new_kwargs = map_aggregate(kwargs, convert_to_interception_proxy)
-            orig_module_call = torch.nn.Module.__call__
-            orig_nn_sequential_forward = torch.nn.Sequential.forward
-
-            def _patched_module_call(self, *args, **kwargs):
-
-                if enable_logging:
-                    fqn = module_id_to_fqn.get(id(self), None)
-                    logger.debug(f" fqn:{fqn} _cl_: {type(self)} start")
-
-                nonlocal cur_module
-                old_module = cur_module
-                cur_module = self
-                try:
-                    parent_module = module_stack[-1] if len(module_stack) else None
-                    module_stack.append(self)
-                    fqn = module_id_to_fqn.get(id(self), None)
-
-                    hook_type = get_module_hook_type(parent_module, cur_module)
-
-                    if hook_type is HookType.OP_HOOKS:
-                        parent_qstate: AutoQuantizationState = \
-                            parent_module._auto_quant_state  # type: ignore[union-attr, assignment]
-                        # before hooks
-                        if not first_call:
-                            parent_qstate.validate_cur_op(cur_module)
-
-                        # If we are in this hook, `cur_module` is a leaf module.
-                        # Therefore, we do not need to override any of its
-                        # children. Disabling the overrides for performance.
-                        nonlocal global_disable_torch_function_override
-                        old_global_disable_torch_function_override = \
-                            global_disable_torch_function_override
-                        global_disable_torch_function_override = True
-
-                        if first_call:
-                            # mypy ignore is used instead of assert because this
-                            # runs on every forward and assert has a performance cost
-                            args, kwargs = parent_qstate.first_call_op_prepare_before_hook(
-                                cur_module, args, kwargs, qtensor_id,
-                                fqn, cur_module,  # type: ignore[arg-type]
-                                OpQuantizeabilityType.QUANTIZEABLE)
-                        else:
-                            # mypy ignore is used instead of assert because this
-                            # runs on every forward and assert has a performance cost
-                            args, kwargs = parent_qstate.op_prepare_before_hook(
-                                cur_module, args, kwargs)  # type: ignore[arg-type]
-
-                        # original forward
-                        output = orig_module_call(self, *args, **kwargs)
-
-                        # Re-enable the overrides.
-                        global_disable_torch_function_override = \
-                            old_global_disable_torch_function_override
-
-                        # after hooks
-                        if first_call:
-                            output = parent_qstate.first_call_op_prepare_after_hook(
-                                cur_module, output, args, qtensor_id,
-                                OpQuantizeabilityType.QUANTIZEABLE)
-                        else:
-                            output = parent_qstate.op_prepare_after_hook(
-                                cur_module, output, args, global_op_idx)
-                        parent_qstate.mark_cur_op_complete(cur_module)
-
-                    elif hook_type is HookType.MODULE_IO_HOOKS:
-                        # TODO(future PR): add inputs io hook
-
-                        cur_qstate = cur_module._auto_quant_state
-                        cur_qstate.reset_to_new_call()
-
-                        # original forward
-                        output = orig_module_call(self, *args, **kwargs)
-
-                        # after hooks
-                        if first_call:
-                            output = cur_qstate.first_call_outputs_prepare_hook(
-                                output, qtensor_id)
-                        else:
-                            output = cur_qstate.outputs_prepare_hook(output)
-
-                        cur_qstate.validate_is_at_last_seen_idx()
-
-                    elif hook_type is HookType.ARG_DEQUANTS:
-                        if first_call and parent_module is not None:
-                            parent_qstate_fc = getattr(
-                                parent_module, '_auto_quant_state', None)
-                            if parent_qstate_fc:
-                                args, kwargs = \
-                                    parent_qstate_fc.first_call_op_prepare_before_hook(
-                                        cur_module, args, kwargs, qtensor_id, fqn,
-                                        cur_module,
-                                        OpQuantizeabilityType.NOT_QUANTIZEABLE)
-
-                        output = orig_module_call(self, *args, **kwargs)
-                        # if this fp32 was inplace, make sure to set the output dtype
-                        # back to torch.float
-                        if hasattr(output, '_qtensor_info'):
-                            del output._qtensor_info
-
-                        if first_call and parent_module is not None:
-                            parent_qstate_fc = getattr(
-                                parent_module, '_auto_quant_state', None)
-                            if parent_qstate_fc:
-                                output = \
-                                    parent_qstate_fc.first_call_op_prepare_after_hook(
-                                        cur_module, output, args, qtensor_id,
-                                        OpQuantizeabilityType.NOT_QUANTIZEABLE)
-
-                    else:
-                        output = orig_module_call(self, *args, **kwargs)
-
-                    if enable_logging:
-                        fqn = module_id_to_fqn.get(id(self), None)
-                        logger.debug(f" fqn:{fqn} _cl_: {type(self)} end")
-
-                    return output
-                finally:
-                    module_stack.pop()
-                    cur_module = old_module
-
-            torch.nn.Module.__call__ = _patched_module_call
-            torch.nn.Sequential.forward = _nn_sequential_patched_forward  # type: ignore[assignment]
-            nonlocal first_call
-            try:
-                if first_call:
-                    # Create a list before iterating because we are adding new
-                    # named modules inside the loop.
-                    named_modules = list(self.named_modules())
-
-                    # Record module instances which are leaves or children of leaves
-                    leaves = set()
-                    for fqn, child in named_modules:
-                        if is_leaf(child, prepare_custom_config_dict):
-                            for _, child_child in child.named_modules():
-                                leaves.add(child_child)
-
-                    self._fqn_to_auto_quant_state_map = AutoQuantizationStateModuleDict()
-
-                    for fqn, v in named_modules:
-
-                        # fqn is the global FQN, i.e. 'foo.bar.baz'
-                        # v is the module instance
-                        #
-                        # we need to associate the global FQN with SeenOp
-                        # for modules, this is the module FQN
-                        # for functions, this is the parent module FQN
-                        module_id_to_fqn[id(v)] = fqn
-
-                        if v in leaves:
-                            continue
-
-                        if v is self:
-                            # for the top level module only, specify input
-                            # and output dtypes
-                            auto_quant_state = AutoQuantizationState(
-                                qconfig_mapping, fqn,
-                                input_dtypes, output_dtypes)
-                        else:
-                            auto_quant_state = AutoQuantizationState(
-                                qconfig_mapping, fqn)
-
-                        # The code below registers the auto_quant_state object
-                        # of the child in the module hierarchy of the parent,
-                        # and adds the auto_quant_state object to the child
-                        # with a raw __setattr__, without registering it in
-                        # the module hierarchy of the child.
-                        # This is solving the problem of both storing extra state
-                        # (observers) as well as not modifying the meaning of user
-                        # code in child modules which iterates over all module
-                        # children.
-                        #
-                        # This narrows down the issue of dynamically adding
-                        # children to only affect the top level module and not
-                        # the children.
-
-                        # On the parent, register this module in the FQN map
-                        fqn_to_use_for_key = \
-                            get_fqn_valid_for_module_dict_key(fqn)
-                        self._fqn_to_auto_quant_state_map[fqn_to_use_for_key] = \
-                            auto_quant_state
-                        # On the child, manually set the attribute without
-                        # going through the `torch.nn.Module.__setattr__`
-                        # function, to prevent this object from appearing in
-                        # the child's module hierarchy.
-                        object.__setattr__(
-                            v, '_auto_quant_state', auto_quant_state)
-
-                global_op_idx[0] = 0
-
-                output = super().__call__(*new_args, **new_kwargs)
-
-                if first_call:
-                    for _, v in self.named_modules():
-                        if hasattr(v, '_auto_quant_state'):
-                            v._auto_quant_state.match_fusion_patterns()
-                            v._auto_quant_state.insert_observers(v)
-
-                return output
-            finally:
-                torch.nn.Module.__call__ = orig_module_call
-                torch.nn.Sequential.forward = orig_nn_sequential_forward  # type: ignore[assignment]
-                first_call = False
-
-
-    model.__class__ = QuantizationInterceptionModule
-    # create the graph
-    trace_with_inputs(model, example_inputs)
-    return model
-
-
-def add_auto_convert(module : torch.nn.Module) -> torch.nn.Module:
-    def convert_to_dispatch_proxy(x):
-        if isinstance(x, torch.Tensor):
-            return x.as_subclass(QuantizationConvertTensorProxy)  # type: ignore[arg-type]
-        else:
-            return x
-
-    module_id_to_fqn: Dict[int, str] = {}
-    # Counter for global quantizeable ops, useful for intermediate activation
-    # logging.
-    global_op_idx = [0]
-
-    global_disable_torch_function_override = False
-
-    class QuantizationConvertTensorProxy(torch.Tensor):
-        """
-        An override of `torch.Tensor` to enable dynamic dispatch for
-        quantization inference.
-
-        For each function with a `__torch_fuction__` override, this proxy does
-        the following for functions which need quantization:
-
-        1. calls `_auto_quant_state.validate_cur_op` to validate that
-           the currently seen op is the same as what was recorded during tracing
-        2. calls `_auto_quant_state.op_convert_before_hook`.
-        3. executes the function, with target, args and kwargs possibly modified
-           by (2)
-        4. calls `_auto_quant_state.inference_function_after_hook`.
-        5. calls `_auto_quant_state.mark_cur_op_complete` to increment
-           the current op index in preparation for the next op
-
-        Otherwise, calls the original function.
-        """
-
-        @classmethod
-        def __torch_function__(cls, func, types, args=(), kwargs=None):
-            nonlocal global_disable_torch_function_override
-            if (
-                # global override means disable the override here
-                global_disable_torch_function_override or
-                # to prevent printing things from going into an infinite loop
-                func == torch.Tensor.__repr__ or
-                # we don't need to override getters in this framework
-                func.__name__ == '__get__'
-            ):
-                return super().__torch_function__(func, types, args, kwargs)
-
-            kwargs = kwargs if kwargs else {}
-            # if we are in a function, the current module is always a parent
-            parent_module = cur_module
-            hook_type = get_torch_function_hook_type(parent_module, func)
-
-            if enable_logging:
-                fqn_for_logging = module_id_to_fqn.get(
-                    id(parent_module), 'unknown') if parent_module else None
-                logger.debug(
-                    f" fqn:{fqn_for_logging} _tf_ {func} " +
-                    f"hook_type {hook_type} " +
-                    # f"arg_types {[type(arg) for arg in args]}) " +
-                    f"arg_dtypes {[arg.dtype if isinstance(arg, torch.Tensor) else None for arg in args]}")
-
-            if hook_type is HookType.OP_HOOKS:
-                qstate: AutoQuantizationState = parent_module._auto_quant_state  # type: ignore[union-attr]
-                # before hooks
-                qstate.validate_cur_op(func)
-                func, args, kwargs = qstate.op_convert_before_hook(
-                    func, args, kwargs, parent_module)  # type: ignore[arg-type]
-
-                # forward
-                output = super().__torch_function__(func, types, args, kwargs)
-                # after hooks
-                output = qstate.op_convert_after_hook(
-                    func, output, global_op_idx)
-                qstate.mark_cur_op_complete(func)
-
-            elif hook_type is HookType.ARG_DEQUANTS:
-                # TODO(future PR): handle more dtypes
-                new_args = []
-                for arg in args:
-                    if isinstance(arg, torch.Tensor) and arg.is_quantized:
-                        new_args.append(arg.dequantize())
-                    else:
-                        new_args.append(arg)
-                args = tuple(new_args)
-                output = super().__torch_function__(func, types, args, kwargs)
-
-            else:  # HookType.NONE
-                output = super().__torch_function__(func, types, args, kwargs)
-
-            # TODO: is this right? Don't really understand this
-            if output is NotImplemented:
-                with torch._C.DisableTorchFunction():
-                    output = func(*args, **kwargs).as_subclass(
-                        QuantizationConvertTensorProxy)
-                assert output is not NotImplemented
-
-            if enable_logging:
-                fqn_for_logging = module_id_to_fqn.get(
-                    id(parent_module), 'unknown') if parent_module else None
-                out_dtype = None
-                if isinstance(output, torch.Tensor):
-                    out_dtype = output.dtype
-                logger.debug(f" fqn:{fqn_for_logging} _tf_ {func} out {out_dtype} end")
-
-            return output
-
-        def __repr__(self):
-            return f'QuantizationConvertTensorProxy({super().__repr__()})'
-
-    cur_module = None
-    module_stack : List[torch.nn.Module] = []
-
-    assert len(module.__class__.__bases__) == 1
-
-    class QuantizationDispatchModule(module.__class__.__bases__[0]):  # type: ignore[name-defined]
-        """
-        An override of user defined subclass of `nn.Module` to enable
-        dynamic tracing for quantization, after model conversion
-        to quantized domain.
-
-        `cur_module` keeps track of the current module in the stack.
-
-        Tensor arguments are converted to `QuantizationConvertTensorProxy`.
-
-        We override the `__call__` function to do the following for each
-        module:
-
-        If the module is an op which needs quantization:
-
-        1. calls `_auto_quant_state.validate_cur_op` to validate that
-           the currently seen op is the same as what was recorded during tracing
-        2. calls parent module's `._auto_quant_state.op_convert_before_hook`
-        3. executes the original module forward
-        4. calls parent module's `_auto_quant_state.op_convert_after_hook`
-        5. calls `_auto_quant_state.mark_cur_op_complete` to increment
-           the current op index in preparation for the next op
-
-        If the module can contain children ops that need quantization:
-
-        1. calls `_auto_quant_state.inputs_convert_hook` (not implemented yet)
-        2. executes the original module forward
-        3. calls `_auto_quant_state.outputs_convert_hook`
-
-        Otherwise, calls the original module forward.
-        """
-
-        def __call__(self, *args, **kwargs):
-            new_args = map_aggregate(args, convert_to_dispatch_proxy)
-            new_kwargs = map_aggregate(kwargs, convert_to_dispatch_proxy)
-            orig_module_call = torch.nn.Module.__call__
-            orig_nn_sequential_forward = torch.nn.Sequential.forward
-
-            def _patched_module_call(self, *args, **kwargs):
-                nonlocal cur_module
-                old_module = cur_module
-                cur_module = self
-                nonlocal global_disable_torch_function_override
-                try:
-                    parent_module = module_stack[-1] if len(module_stack) else None
-                    module_stack.append(self)
-                    hook_type = get_module_hook_type(parent_module, cur_module)
-                    if enable_logging:
-                        fqn_for_logging = module_id_to_fqn.get(id(self), None)
-                        logger.debug(
-                            f" fqn: {fqn_for_logging} " +
-                            f"_cl_ {type(self)} " +
-                            f"arg_dtypes {[arg.dtype if isinstance(arg, torch.Tensor) else None for arg in args]} " +
-                            f"hook_type {hook_type}")
-
-                    if hook_type is HookType.OP_HOOKS:
-                        # before hooks
-                        qstate: AutoQuantizationState = \
-                            parent_module._auto_quant_state  # type: ignore[union-attr, assignment]
-                        qstate.validate_cur_op(cur_module)
-
-                        # If we are in this hook, `cur_module` is a leaf module.
-                        # Therefore, we do not need to override any of its
-                        # children. Disabling the overrides for performance.
-                        old_global_disable_torch_function_override = \
-                            global_disable_torch_function_override
-                        global_disable_torch_function_override = True
-
-                        _, args, kwargs = qstate.op_convert_before_hook(
-                            cur_module, args, kwargs, cur_module)
-                        # forward
-                        output = orig_module_call(self, *args, **kwargs)
-                        # after hooks
-                        output = qstate.op_convert_after_hook(
-                            cur_module, output, global_op_idx)
-
-                        # Re-enable the override.
-                        global_disable_torch_function_override = \
-                            old_global_disable_torch_function_override
-
-                        qstate.mark_cur_op_complete(cur_module)
-
-                    elif hook_type is HookType.MODULE_IO_HOOKS:
-                        cur_qstate: AutoQuantizationState = cur_module._auto_quant_state
-
-                        cur_qstate.reset_to_new_call()
-
-                        # before hooks (TODO)
-                        # forward
-                        output = orig_module_call(self, *args, **kwargs)
-                        # after hooks
-
-                        # For the sake of performance, we assume no overrides
-                        # are needed for quantizing/dequantizing things
-                        old_global_disable_torch_function_override = \
-                            global_disable_torch_function_override
-                        global_disable_torch_function_override = True
-
-                        output = cur_qstate.outputs_convert_hook(output)
-
-                        global_disable_torch_function_override = \
-                            old_global_disable_torch_function_override
-
-                        cur_qstate.validate_is_at_last_seen_idx()
-
-                    elif hook_type is HookType.ARG_DEQUANTS:
-                        # TODO(future PR): handle more dtypes
-                        new_args = []
-                        for arg in args:
-                            if isinstance(arg, torch.Tensor) and arg.is_quantized:
-                                dequant = arg.dequantize().as_subclass(
-                                    QuantizationConvertTensorProxy)  # type: ignore[arg-type]
-                                new_args.append(dequant)
-                            else:
-                                new_args.append(arg)
-                        args = tuple(new_args)
-                        output = orig_module_call(self, *args, **kwargs)
-
-                    else:
-                        output = orig_module_call(self, *args, **kwargs)
-
-                    if enable_logging:
-                        fqn_for_logging = module_id_to_fqn.get(id(self), None)
-                        logger.debug(
-                            f" fqn: {fqn_for_logging} " +
-                            f"_cl_ {type(self)} " +
-                            f"dtype {output.dtype if isinstance(output, torch.Tensor) else None} " +
-                            "end")
-                    return output
-                finally:
-                    module_stack.pop()
-                    cur_module = old_module
-
-            torch.nn.Module.__call__ = _patched_module_call
-            torch.nn.Sequential.forward = _nn_sequential_patched_forward  # type: ignore[assignment]
-
-            try:
-                global_op_idx[0] = 0
-                output = super().__call__(*new_args, **new_kwargs)
-
-                def unwrap_proxy(a):
-                    if isinstance(a, QuantizationConvertTensorProxy):
-                        a.__class__ = torch.Tensor  # type: ignore[assignment]
-                    return a
-
-                output = map_aggregate(output, unwrap_proxy)
-                return output
-            finally:
-                torch.nn.Module.__call__ = orig_module_call
-                torch.nn.Sequential.forward = orig_nn_sequential_forward  # type: ignore[assignment]
-
-        def rewrite_for_scripting(self):
-            return auto_trace_rewriter.rewrite_for_scripting(self)
-
-    pack_weights_for_functionals(module)
-    attach_scale_zp_values_to_model(module)
-    attach_op_convert_info_to_model(module)
-    attach_output_convert_info_to_model(module)
-
-    # Since eager mode convert could have changed the IDs of some modules,
-    # populate the FQN map again
-    for k, v in module.named_modules():
-        module_id_to_fqn[id(v)] = k
-
-    module.__class__ = QuantizationDispatchModule
-
-    return module
-
-
-# AutoQuantizationState lives in parent module's _modules.
-# Currently, `torch.nn.Sequential`'s forward iterates over all
-# items in _modules. To avoid changing the meaning of the program, for
-# now we patch the forward to ignore our quantization state.
-# Note: this is a hackedy hack, before launching we should consider
-# checking the fix into `torch.nn.Sequential` to avoid the patch.
-def _nn_sequential_patched_forward(cls, input):
-    for module in cls:
-        if not isinstance(module, AutoQuantizationStateModuleDict):
-            input = module(input)
-    return input
diff --git a/torch/ao/quantization/_dbr/auto_trace_rewriter.py b/torch/ao/quantization/_dbr/auto_trace_rewriter.py
deleted file mode 100644
index 1189dbc879c4..000000000000
--- a/torch/ao/quantization/_dbr/auto_trace_rewriter.py
+++ /dev/null
@@ -1,247 +0,0 @@
-import copy
-import math
-import operator
-from types import ModuleType
-from typing import Callable, Any, Tuple, Dict
-
-import torch
-import torch.fx
-from .mappings import conv_ops
-from .quantization_state import AutoQuantizationState
-from .utils import (
-    get_packable_arg_idxs,
-    AutoQuantizationStateModuleDict,
-)
-
-class AllModuleTracer(torch.fx.Tracer):
-    """
-    This is a tracer that knows how to convert quantizeable ops with
-    dynamic dispatch into their corresponding quantized subgraphs.
-    """
-
-    node_name_to_dtype: Dict[str, Any]
-
-    def __init__(self, autowrap_modules: Tuple[ModuleType] = (math, ),
-                 autowrap_functions: Tuple[Callable, ...] = (),
-                 param_shapes_constant: bool = False) -> None:
-        super().__init__(
-            autowrap_modules, autowrap_functions,
-            param_shapes_constant)
-        self.node_name_to_dtype = {}
-
-    def is_leaf_module(self, m, module_qualified_name) -> bool:
-        return True
-
-    def _maybe_update_args_with_quants(self, args, arg_quant_infos, target):
-        # insert quants for inputs, if needed
-        if len(arg_quant_infos):
-            new_args = []
-            if target == torch.ops.quantized.cat:
-                new_first_arg = []
-                for idx, input_arg_quant_info in enumerate(arg_quant_infos):
-                    if input_arg_quant_info is None:
-                        new_first_arg.append(args[0][idx])
-                    else:
-                        # create a quant node
-                        scale, zp, dtype = input_arg_quant_info
-                        quant = super().create_node(
-                            'call_function', torch.quantize_per_tensor,
-                            (args[0][idx], scale.item(), zp.item(), dtype), {}, None, None)
-                        new_first_arg.append(quant)
-                new_args = [new_first_arg, *args[1:]]
-            elif target == torch.cat:
-                return args
-            else:
-                # TODO: this is not handling non-tensor tuple args (for example,
-                # dilation in conv2d) correctly, it just happens to work but
-                # needs a fix.
-                for idx, arg in enumerate(args):
-                    input_arg_quant_info = arg_quant_infos[idx]
-                    if input_arg_quant_info is None:
-                        new_args.append(args[idx])
-                    else:
-                        # create a quant node
-                        scale, zp, dtype = input_arg_quant_info
-                        quant = super().create_node(
-                            'call_function', torch.quantize_per_tensor,
-                            (args[idx], scale.item(), zp.item(), dtype), {}, None, None)
-                        new_args.append(quant)
-            args = tuple(new_args)
-        return args
-
-    def _maybe_update_args_with_dequants(self, args):
-        new_args = []
-        for arg in args:
-            if (
-                isinstance(arg, torch.fx.Node) and
-                arg.name in self.node_name_to_dtype and
-                self.node_name_to_dtype[arg.name] != torch.float
-            ):
-                dequant = torch.fx.Proxy(arg).dequantize().node
-                new_args.append(dequant)
-            else:
-                new_args.append(arg)
-        return tuple(new_args)
-
-    def _maybe_update_outputs(self, outputs, output_qtensor_infos, output_dtypes):
-        # TODO(future PR): handle other output types
-        assert len(outputs) == 1 and len(output_qtensor_infos) == 1
-        if output_dtypes is not None:
-            assert len(output_dtypes) == 1
-            output_dtype = output_dtypes[0]
-            qtensor_info = output_qtensor_infos[0]
-            if qtensor_info.inf_dtype != output_dtype:
-                assert output_dtype is torch.float, \
-                    'non-float dtypes not handled yet'
-                dequant = torch.fx.Proxy(outputs[0]).dequantize().node
-                outputs = (dequant,)
-        return outputs
-
-    def create_node(self, kind, target, args, kwargs, name=None, type_expr=None):
-        if target == operator.add:
-            target = torch.add
-        if target == operator.mul:
-            target = torch.mul
-
-        # TODO(future PR): move this into mappings
-        if target == 'add':
-            target = torch.add
-            kind = 'call_function'
-        if target == 'mul':
-            target = torch.mul
-            kind = 'call_function'
-
-        dtype_to_use = torch.float
-
-        if kind == 'call_function' or kind == 'call_method':
-            qstate = self.root._auto_quant_state
-            assert isinstance(qstate, AutoQuantizationState)
-            if qstate.cur_op_needs_hooks(target):
-                # need to test this path with call_method
-                assert kind == 'call_function'
-                qstate.validate_cur_op(target)
-
-                old_target = target
-                # TODO use arg_dequant_infos
-                new_target, arg_quant_infos, arg_dequant_infos, packed_param_name, additional_kwargs, _, _ = \
-                    qstate.get_op_convert_info(target)
-                for k in ('scale', 'zero_point'):
-                    if k in additional_kwargs:
-                        additional_kwargs[k] = additional_kwargs[k].item()
-                if new_target is not None:
-                    target = new_target
-                args = self._maybe_update_args_with_quants(args, arg_quant_infos, target)
-                # if there is a packed param, replace the relevant args
-                if packed_param_name is not None:
-                    new_args_with_packed = []
-                    packable_arg_idxs = get_packable_arg_idxs(old_target)
-                    added_packed = False
-                    for idx, arg in enumerate(args):
-                        if packable_arg_idxs is not None and idx in packable_arg_idxs:
-                            if not added_packed:
-                                # packed_param = getattr(self.root, packed_param_name)
-                                packed_param_node = super().create_node(
-                                    'get_attr', packed_param_name, (), {}, None, None)
-                                new_args_with_packed.append(packed_param_node)
-                                added_packed = True
-                        else:
-                            new_args_with_packed.append(arg)
-                    args = tuple(new_args_with_packed)
-
-                # TODO move op-specific logic out of here
-                if target is torch.ops.quantized.linear:
-                    def linear_rewrite_args(input, weight, bias=None):
-                        return (input, weight,
-                                additional_kwargs['scale'],
-                                additional_kwargs['zero_point'])
-                    args = linear_rewrite_args(*args, **kwargs)
-                    kwargs = {}
-                elif old_target not in conv_ops or target in conv_ops:
-                    kwargs.update(**additional_kwargs)
-                else:
-                    new_args = [*args]
-                    new_args.append(additional_kwargs['scale'])
-                    new_args.append(additional_kwargs['zero_point'])
-                    args = tuple(new_args)
-
-                dtype_to_use = qstate.get_cur_output_inf_dtype()
-                qstate.mark_cur_op_complete(old_target)
-
-            else:
-                args = self._maybe_update_args_with_dequants(args)
-
-        elif kind == 'call_module':
-            # TODO: handle fqn
-            module_instance = getattr(self.root, target)
-            qstate = self.root._auto_quant_state
-            assert isinstance(qstate, AutoQuantizationState)
-            if qstate.cur_op_needs_hooks(module_instance):
-                qstate.validate_cur_op(module_instance)
-
-                # TODO use arg_dequant_infos
-                _, arg_quant_infos, arg_dequant_infos, _packed_param_name, additional_kwargs, _, _ = \
-                    qstate.get_op_convert_info(module_instance)
-                for k in ('scale', 'zero_point'):
-                    if k in additional_kwargs:
-                        additional_kwargs[k] = additional_kwargs[k].item()
-
-                args = self._maybe_update_args_with_quants(args, arg_quant_infos, target)
-                kwargs.update(**additional_kwargs)
-
-                dtype_to_use = qstate.get_cur_output_inf_dtype()
-                qstate.mark_cur_op_complete(module_instance)
-
-            else:
-                args = self._maybe_update_args_with_dequants(args)
-
-        elif kind == 'output':
-            qstate = self.root._auto_quant_state
-            assert isinstance(qstate, AutoQuantizationState)
-            output_qtensor_infos = qstate.get_output_qtensor_infos()
-            output_dtypes = qstate.get_output_dtypes()
-            args = self._maybe_update_outputs(
-                args, output_qtensor_infos, output_dtypes)
-
-        out = super().create_node(kind, target, args, kwargs, name, type_expr)
-        self.node_name_to_dtype[out.name] = dtype_to_use
-        return out
-
-    # This is a hack to enable nn.Sequential to properly work with this
-    # class.
-    # TODO(future): remove the hack
-    def call_module(self, m: torch.nn.Module, forward: Callable[..., Any], args : Tuple[Any, ...], kwargs : Dict[str, Any]) -> Any:
-        if isinstance(m, AutoQuantizationStateModuleDict):
-            return args[0]
-        return super().call_module(m, forward, args, kwargs)
-
-# TODO(future PR): handle cases where the module is not symbolically
-# traceable
-def rewrite_for_scripting(mod: torch.nn.Module) -> torch.nn.Module:
-    """
-    Makes the dynamically dispatched ops in `mod` be explicit, so they
-    can be visibile to `torch.jit.script`. In detail:
-
-    1. symbolically traces the forward with FX, without any leaves
-    2. for each quantizeable op with dynamic dispatch, rewrites the graph to
-       contain the quantized subgraph (quant if necessary, quantized op,
-       dequant if necessary).
-    3. recursively repeat (1 - 2) for each child
-    """
-
-    def rewrite_helper(mod : torch.nn.Module):
-        copied = copy.copy(mod)
-        for name, child in mod.named_children():
-            setattr(copied, name, rewrite_helper(child))
-
-        if hasattr(mod, '_auto_quant_state') and (
-            mod._auto_quant_state.has_at_least_one_seen_q_op_info() or  # type: ignore[union-attr, operator]
-            (mod._auto_quant_state.get_output_dtypes() is not None)  # type: ignore[union-attr, operator]
-        ):
-            copied._auto_quant_state.reset_to_new_call()  # type: ignore[union-attr, operator]
-
-            graph = AllModuleTracer().trace(copied)
-            return torch.fx.GraphModule(copied, graph, copied.__class__.__name__)
-        else:
-            return copied
-
-    return rewrite_helper(mod)
diff --git a/torch/ao/quantization/_dbr/function_fusion.py b/torch/ao/quantization/_dbr/function_fusion.py
deleted file mode 100644
index fdafa5102ad4..000000000000
--- a/torch/ao/quantization/_dbr/function_fusion.py
+++ /dev/null
@@ -1,101 +0,0 @@
-from typing import Dict, Tuple, Callable, Optional
-
-from .mappings import known_function_fusion_patterns_and_replacements
-from .utils import (
-    FusionInfo,
-    SeenQOpInfo,
-    get_users_of_seen_q_op_info,
-    get_producer_of_seen_q_op_info,
-)
-
-def _identity(x):
-    return x
-
-def pattern_is_match(
-    fusion_pattern: Tuple[Callable, ...],
-    cur_seen_q_op_info: Optional[SeenQOpInfo],
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-) -> bool:
-    is_match = True
-    for el_type in fusion_pattern:
-        if cur_seen_q_op_info is not None and el_type == cur_seen_q_op_info.type:
-            next_seen_q_op_infos = get_users_of_seen_q_op_info(
-                idx_to_seen_q_op_infos, cur_seen_q_op_info)
-            if len(next_seen_q_op_infos) == 1:
-                cur_seen_q_op_info = next_seen_q_op_infos[0]
-            else:
-                cur_seen_q_op_info = None
-            continue
-        else:
-            is_match = False
-            break
-    return is_match
-
-def get_seen_q_op_info_of_start_of_fusion(
-    seen_q_op_info_end_of_fusion: SeenQOpInfo,
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-) -> SeenQOpInfo:
-    assert seen_q_op_info_end_of_fusion.fusion_info is not None
-    cur_seen_q_op_info = seen_q_op_info_end_of_fusion
-    for idx in range(len(seen_q_op_info_end_of_fusion.fusion_info.pattern) - 1):
-        cur_seen_q_op_info = get_producer_of_seen_q_op_info(
-            idx_to_seen_q_op_infos, cur_seen_q_op_info)  # type: ignore[assignment]
-    return cur_seen_q_op_info
-
-def get_seen_q_op_info_of_end_of_fusion(
-    seen_q_op_info_start_of_fusion: SeenQOpInfo,
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-) -> SeenQOpInfo:
-    assert seen_q_op_info_start_of_fusion.fusion_info is not None
-    cur_seen_q_op_info = seen_q_op_info_start_of_fusion
-    for idx in range(len(seen_q_op_info_start_of_fusion.fusion_info.pattern) - 1):
-        users = get_users_of_seen_q_op_info(
-            idx_to_seen_q_op_infos, cur_seen_q_op_info)
-        cur_seen_q_op_info = users[0]
-    return cur_seen_q_op_info
-
-def match_fusion_patterns(
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-):
-    """
-    Matches fusion patterns to elements of `idx_to_seen_q_op_infos`.
-    Modifies them inplace if matches are found.
-
-    Note:
-    1. The matching is local to the ops seen by a single parent module,
-       it does not cross module boundaries. This is for simplicity, and
-       there are no plans to relax this at the moment.
-    2. The matching only supports linear patterns of ops where all of
-       of the arguments needed to execute the fusion are passed to the first
-       op in the sequence. This is for simplicity, and can be relaxed
-       in a future PR if there is a need.
-    3. Currently the matching does not look at non quantizeable ops,
-       this will be fixed in the next PR.
-    """
-
-    # Walk the subgraphs and find the function fusions. For now, this is
-    # brute forced for simplicity, can be optimized later if necessary.
-    for idx, seen_q_op_info in idx_to_seen_q_op_infos.items():
-        for fusion_pattern, replacement in \
-                known_function_fusion_patterns_and_replacements.items():
-            is_match = pattern_is_match(
-                fusion_pattern, seen_q_op_info, idx_to_seen_q_op_infos)
-            if not is_match:
-                continue
-
-            cur_seen_q_op_info = seen_q_op_info
-            for idx in range(len(fusion_pattern)):
-                if idx > 0:
-                    users = get_users_of_seen_q_op_info(
-                        idx_to_seen_q_op_infos, cur_seen_q_op_info)
-                    cur_seen_q_op_info = users[0]
-
-                is_first_element = idx == 0
-                is_last_element = idx == len(fusion_pattern) - 1
-                replacement_type = replacement if is_first_element \
-                    else _identity
-                fusion_info = FusionInfo(
-                    fusion_pattern, replacement_type, is_first_element,
-                    is_last_element)
-                cur_seen_q_op_info.fusion_info = fusion_info
-            break
diff --git a/torch/ao/quantization/_dbr/fusion.py b/torch/ao/quantization/_dbr/fusion.py
deleted file mode 100644
index 7cf5ce4a618e..000000000000
--- a/torch/ao/quantization/_dbr/fusion.py
+++ /dev/null
@@ -1,56 +0,0 @@
-from typing import List
-
-import torch
-
-from .function_fusion import pattern_is_match
-
-from .utils import (
-    get_users_of_seen_q_op_info,
-)
-
-from .mappings import (
-    known_module_fusion_patterns,
-)
-
-def get_module_fusion_fqns(
-    module: torch.nn.Module,
-) -> List[List[str]]:
-    """
-    Input: a module with auto quantization state
-
-    Walks the subgraphs and determines which modules should be
-    fused.
-
-    Output: a list of FQNs of modules which should be fused.
-    """
-    results = []
-    for _, child in module.named_modules():
-        if not hasattr(child, '_auto_quant_state'):
-            continue
-        qstate = child._auto_quant_state
-
-        # Walk the subgraphs and record the FQNs of all known module fusions.
-        # For now, this is brute forced for simplicity, can be optimized later if
-        # necessary.
-        # TODO(future PR): if a pattern is matched, add it to "seen" items
-        # and do not use it in future matching.
-        for idx, seen_q_op_info in qstate.idx_to_seen_q_op_infos.items():
-            for fusion_pattern in known_module_fusion_patterns:
-                is_match = pattern_is_match(
-                    fusion_pattern, seen_q_op_info, qstate.idx_to_seen_q_op_infos)
-                if is_match:
-                    cur_fqns = [seen_q_op_info.fqn]
-                    cur_seen_q_op_info = seen_q_op_info
-                    for _element in fusion_pattern[:-1]:
-                        users = get_users_of_seen_q_op_info(
-                            qstate.idx_to_seen_q_op_infos, cur_seen_q_op_info)
-                        cur_seen_q_op_info = users[0]
-                        cur_fqns.append(cur_seen_q_op_info.fqn)
-
-                    # we check for existence to ensure the final fusion list
-                    # is deduplicated, in case the same op is called multiple
-                    # times in a single forward
-                    if cur_fqns not in results:
-                        results.append(cur_fqns)
-
-    return results
diff --git a/torch/ao/quantization/_dbr/mappings.py b/torch/ao/quantization/_dbr/mappings.py
deleted file mode 100644
index 89c963f8795a..000000000000
--- a/torch/ao/quantization/_dbr/mappings.py
+++ /dev/null
@@ -1,178 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.nn.quantized as nnq
-toq = torch.ops.quantized
-from torch.ao.quantization.quantization_mappings import (
-    DEFAULT_STATIC_QUANT_MODULE_MAPPINGS,
-    DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS,
-    DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS,
-)
-
-import operator
-from typing import Callable
-
-# TODO(future PR): reuse all of these with existing quantization mappings
-
-fp32_to_int8_fun_mapping = {
-    torch.Tensor.add: torch.ops.quantized.add,
-    torch.Tensor.add_: torch.ops.quantized.add,
-    torch.add: torch.ops.quantized.add,
-    operator.add: torch.ops.quantized.add,
-    operator.iadd: torch.ops.quantized.add,
-    torch.Tensor.mul: torch.ops.quantized.mul,
-    torch.mul: torch.ops.quantized.mul,
-    operator.mul: torch.ops.quantized.mul,
-    torch.cat: torch.ops.quantized.cat,
-    F.conv1d: torch.ops.quantized.conv1d,
-    F.conv2d: torch.ops.quantized.conv2d,
-    F.conv3d: torch.ops.quantized.conv3d,
-    F.linear: toq.linear,
-}
-
-# TODO: enforce that functions in fp32_to_int8_fun_mapping must both be
-# in functions_supported_by_quantization
-functions_supported_by_quantization = set([
-    torch.Tensor.add,
-    torch.Tensor.add_,
-    torch.Tensor.mul,
-    torch.add,
-    torch.mul,
-    torch.cat,
-    # adding for MobileNetV2, will need a better place for these
-    torch.nn.functional.adaptive_avg_pool2d,
-    F.hardsigmoid,
-    torch.flatten,
-    toq.add,
-    toq.mul,
-    toq.cat,
-    F.conv1d,
-    F.conv2d,
-    F.conv3d,
-    toq.conv1d,
-    toq.conv2d,
-    toq.conv3d,
-    F.dropout,
-    torch.relu,
-    F.relu,
-    F.linear,
-    toq.linear,
-])
-
-module_types_supported_by_quantization = set()
-module_types_supported_by_quantization |= \
-    set(DEFAULT_STATIC_QUANT_MODULE_MAPPINGS.keys())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_STATIC_QUANT_MODULE_MAPPINGS.values())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS.keys())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS.values())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS.keys())
-module_types_supported_by_quantization |= \
-    set(DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS.values())
-module_types_supported_by_quantization |= set([
-    # these are quantizeable modules which do not need swaps
-    nn.ReLU,
-    nn.Dropout,
-    nn.Identity,
-])
-module_types_supported_by_quantization -= set([
-    # TODO(future PR): enable DBR quantization for embeddings
-    nn.Embedding,
-    nnq.Embedding,
-    nn.EmbeddingBag,
-    nnq.EmbeddingBag,
-])
-
-
-# These can work in either fp32 or quint8, without the need for observation
-# TODO: better name
-module_types_supported_by_quantization_preserves_dtype = set([
-    nn.Identity,
-    nn.Dropout,
-])
-
-functions_supported_by_quantization_preserves_dtype = set([
-    F.dropout,
-])
-
-add_and_mul_ops = set([
-    torch.add,
-    torch.Tensor.add,
-    torch.Tensor.add_,
-    torch.mul,
-    torch.Tensor.mul,
-])
-
-# TODO(future): reuse global mapping
-known_module_fusion_patterns = [
-    (torch.nn.Conv2d, torch.nn.ReLU),
-    (torch.nn.Conv2d, torch.nn.BatchNorm2d),
-]
-
-# TODO(future): reuse global mapping
-known_function_fusion_patterns_and_replacements = {
-    (torch.Tensor.add, torch.relu): toq.add_relu,
-}
-
-binary_related_ops = (
-    (torch.add, torch.Tensor.add),
-    (torch.add, torch.Tensor.add_),
-    (torch.Tensor.add, torch.Tensor.add_),
-    (torch.mul, torch.Tensor.mul),
-    (torch.mul, torch.Tensor.mul_),
-    (torch.Tensor.mul, torch.Tensor.mul_),
-)
-
-conv_ops = set([
-    F.conv1d,
-    F.conv2d,
-    F.conv3d,
-])
-
-conv_prepack_fns = {
-    F.conv1d: toq.conv1d_prepack,
-    F.conv2d: toq.conv2d_prepack,
-    F.conv3d: toq.conv3d_prepack,
-}
-
-# TODO(future PR): reuse global mapping
-a_related_to_b = set()
-for a, b in binary_related_ops:
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-for a, b in DEFAULT_STATIC_QUANT_MODULE_MAPPINGS.items():
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-for a, b in DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS.items():
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-for a, b in DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS.items():
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-for a, b in fp32_to_int8_fun_mapping.items():
-    a_related_to_b.add((a, b))
-    a_related_to_b.add((b, a))
-
-def ops_are_related(
-    cur_op: Callable,
-    expected_op_type: Callable,
-    type_is_module: bool,
-) -> bool:
-    # if isinstance(cur_op, torch.nn.Module):
-    if type_is_module:
-        cur_op = type(cur_op)
-    return cur_op == expected_op_type or \
-        (cur_op, expected_op_type) in a_related_to_b
-
-# validity checks
-# TODO: move these out
-for m in module_types_supported_by_quantization_preserves_dtype:
-    assert m in module_types_supported_by_quantization, \
-        f"{m} needs to be added to module_types_supported_by_quantization"
-
-for f in functions_supported_by_quantization_preserves_dtype:
-    assert f in functions_supported_by_quantization, \
-        f"{f} needs to be added to functions_supported_by_quantization"
diff --git a/torch/ao/quantization/_dbr/model_utils.py b/torch/ao/quantization/_dbr/model_utils.py
deleted file mode 100644
index cd60de8a1ba4..000000000000
--- a/torch/ao/quantization/_dbr/model_utils.py
+++ /dev/null
@@ -1,163 +0,0 @@
-"""
-Contains model level utilities which can be aware of the AutoQuantizationState
-type.
-"""
-
-import torch
-import torch.nn.functional as F
-toq = torch.ops.quantized
-from .mappings import conv_ops, conv_prepack_fns
-from .quantization_state import AutoQuantizationState
-from torch.quantization import (
-    ObserverBase,
-    FakeQuantizeBase,
-)
-from typing import Optional
-
-def pack_weights_for_functionals(
-    module: torch.nn.Module,
-) -> None:
-    """
-    Packs weights for functionals seen while tracing.
-    Note: weight packing for modules is handled by eager mode quantization
-    flow.
-    """
-    if hasattr(module, '_auto_quant_state'):
-        qstate: AutoQuantizationState = module._auto_quant_state  # type: ignore[assignment]
-        # find any ops which need packing
-        for idx, seen_q_op_info in qstate.idx_to_seen_q_op_infos.items():
-            packable_args_len = len(seen_q_op_info.packable_tensor_idx_to_name) + \
-                len(seen_q_op_info.packable_nontensor_idx_to_arg)
-            if packable_args_len == 0:
-                continue
-
-            if seen_q_op_info.type in conv_ops:
-                # fetch all the info needed for packed params
-                assert seen_q_op_info.packable_tensor_idx_to_name[1] is not None
-                weight = getattr(module, seen_q_op_info.packable_tensor_idx_to_name[1])
-                assert seen_q_op_info.packable_tensor_idx_to_name[2] is not None
-                bias = getattr(module, seen_q_op_info.packable_tensor_idx_to_name[2])
-                stride = seen_q_op_info.packable_nontensor_idx_to_arg[3]
-                padding = seen_q_op_info.packable_nontensor_idx_to_arg[4]
-                dilation = seen_q_op_info.packable_nontensor_idx_to_arg[5]
-                groups = seen_q_op_info.packable_nontensor_idx_to_arg[6]
-
-                # quantize the weight
-                # TODO: create weight observers from qconfig.weight
-                assert seen_q_op_info.input_tensor_infos[1] is not None
-                weight_tensor_id = seen_q_op_info.input_tensor_infos[1].id
-                weight_obs = qstate.tensor_id_to_observer[str(weight_tensor_id)]
-                assert isinstance(weight_obs, (ObserverBase, FakeQuantizeBase))
-                scale, zp = weight_obs.calculate_qparams()
-                qweight = torch.quantize_per_tensor(weight, scale, zp, torch.qint8)
-
-                # create the packed params
-                packed_params = conv_prepack_fns[seen_q_op_info.type](
-                    qweight, bias, stride, padding, dilation, groups)
-
-                # attach to module
-                name_idx = 0
-                prefix = "_packed_params_"
-                name_candidate = f"{prefix}{name_idx}"
-                while hasattr(module, name_candidate):
-                    name_idx += 1
-                    name_candidate = f"{prefix}{name_idx}"
-                setattr(module, name_candidate, packed_params)
-                qstate.idx_to_packed_weight_name[idx] = name_candidate
-                # TODO: delete the original weights
-
-            elif seen_q_op_info.type == F.linear:
-                # fetch all the info needed for packed params
-                def get_tensor_param_name(idx: int, name: str) -> Optional[str]:
-                    param_name = seen_q_op_info.packable_tensor_idx_to_name.get(idx, None)
-                    if param_name is not None:
-                        return param_name
-                    return seen_q_op_info.packable_tensor_kwarg_name_to_name.get(name, None)
-
-                weight_name = get_tensor_param_name(1, 'weight')
-                assert weight_name is not None
-                weight = getattr(module, weight_name)
-
-                bias_name = get_tensor_param_name(2, 'bias')
-                bias = getattr(module, bias_name) if bias_name is not None else None
-
-                # quantize the weight
-                # TODO: create weight observers from qconfig.weight
-                assert seen_q_op_info.input_tensor_infos[1] is not None
-                weight_tensor_id = seen_q_op_info.input_tensor_infos[1].id
-                weight_obs = qstate.tensor_id_to_observer[str(weight_tensor_id)]
-                assert isinstance(weight_obs, (ObserverBase, FakeQuantizeBase))
-                scale, zp = weight_obs.calculate_qparams()
-                qweight = torch.quantize_per_tensor(weight, scale, zp, torch.qint8)
-
-                # create the packed params
-                packed_params = toq.linear_prepack(qweight, bias)
-
-                # attach to module
-                name_idx = 0
-                prefix = "_packed_params_"
-                name_candidate = f"{prefix}{name_idx}"
-                while hasattr(module, name_candidate):
-                    name_idx += 1
-                    name_candidate = f"{prefix}{name_idx}"
-                setattr(module, name_candidate, packed_params)
-                qstate.idx_to_packed_weight_name[idx] = name_candidate
-                # TODO: delete the original weights
-
-    for _, child in module.named_children():
-        pack_weights_for_functionals(child)
-
-def attach_scale_zp_values_to_model(
-    module: torch.nn.Module,
-) -> None:
-    """
-    Calculates the scale and zero_point from each observer and attaches
-    these values to the parent module. This is done to avoid recalculating
-    these values at inference.
-    """
-    if hasattr(module, '_auto_quant_state'):
-        qstate: AutoQuantizationState = module._auto_quant_state  # type: ignore[assignment]
-        for tensor_id, observer in qstate.tensor_id_to_observer.items():
-            activation_int8_or_int32_quantized = \
-                observer.dtype in [torch.quint8, torch.qint8, torch.qint32]
-            if activation_int8_or_int32_quantized:
-                scale, zp = observer.calculate_qparams()
-                # tensor_id_to_observer is a ModuleDict which has to have string keys
-                # tensor_id_to_scale_zp is a normal dict which can have int keys
-                qstate.tensor_id_to_scale_zp[int(tensor_id)] = (scale, zp)
-        qstate.tensor_id_to_observer.clear()
-
-    for _, child in module.named_children():
-        attach_scale_zp_values_to_model(child)
-
-def attach_op_convert_info_to_model(
-    module: torch.nn.Module,
-) -> None:
-    """
-    Calculates the info needed to convert each op and attaches
-    it to the parent module. This is done to avoid recalculating these values
-    at inference.
-    """
-    if hasattr(module, '_auto_quant_state'):
-        qstate: AutoQuantizationState = module._auto_quant_state  # type: ignore[assignment]
-        for _, seen_q_op_info in qstate.idx_to_seen_q_op_infos.items():
-            qstate.idx_to_op_convert_info[seen_q_op_info.idx] = \
-                qstate.calculate_op_convert_info(seen_q_op_info)
-
-    for _, child in module.named_children():
-        attach_op_convert_info_to_model(child)
-
-def attach_output_convert_info_to_model(
-    module: torch.nn.Module,
-) -> None:
-    """
-    Calculates the info needed to perform the module outputs hook
-    and attaches it to the parent module. This is done to avoid recalculating
-    these values at inference.
-    """
-    if hasattr(module, '_auto_quant_state'):
-        qstate: AutoQuantizationState = module._auto_quant_state  # type: ignore[assignment]
-        qstate.set_needs_dtype_transform_on_outputs()
-
-    for _, child in module.named_children():
-        attach_output_convert_info_to_model(child)
diff --git a/torch/ao/quantization/_dbr/module_swap_utils.py b/torch/ao/quantization/_dbr/module_swap_utils.py
deleted file mode 100644
index a95f8210286e..000000000000
--- a/torch/ao/quantization/_dbr/module_swap_utils.py
+++ /dev/null
@@ -1,79 +0,0 @@
-from typing import Dict, Callable, Any, Optional
-
-import torch
-
-from torch.nn.intrinsic import _FusedModule
-from ..utils import (
-    activation_is_int8_quantized,
-    activation_is_int32_quantized,
-    op_is_int8_dynamically_quantized,
-)
-from torch.ao.quantization import swap_module
-from torch.ao.quantization.quantization_mappings import (
-    DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS,
-)
-
-def _swap_child_modules(
-    module: torch.nn.Module,
-    static_mappings: Dict[Callable, Any],
-    dynamic_mappings: Dict[Callable, Any],
-    parent_fqn: Optional[str] = None,
-) -> None:
-    """
-    For each direct child of `module`, swaps it using `static_mappings`
-    if the qconfig for that child is using int8 static quantization,
-    and the module type is in the mapping.
-
-    Recursively calls itself on each child.
-    """
-
-    qstate = getattr(module, '_auto_quant_state', None)
-
-    reassign = {}
-    for local_fqn, mod in module.named_children():
-        if parent_fqn is None:
-            global_fqn = local_fqn
-        else:
-            global_fqn = f"{parent_fqn}.{local_fqn}"
-        # both fused modules and observed custom modules are
-        # swapped as one unit
-        if not isinstance(mod, _FusedModule):
-            _swap_child_modules(
-                mod, static_mappings, dynamic_mappings, global_fqn)
-
-        qconfig = getattr(mod, 'qconfig', None)
-        if not qconfig:
-            continue
-        activation_int8_quantized = activation_is_int8_quantized(qconfig)
-        op_int8_dynamically_quantized = op_is_int8_dynamically_quantized(qconfig)
-        activation_int32_quantized = activation_is_int32_quantized(qconfig)
-
-        # Get the output observer from qstate and attach it to the module,
-        # to match the API for Eager mode module swaps
-        if qstate is not None:
-            output_obs = qstate.get_output_observer_from_fqn(global_fqn)
-            if output_obs is not None:
-                mod.activation_post_process = output_obs
-
-        if activation_int8_quantized:
-            if not type(mod) in static_mappings:
-                continue
-            reassign[local_fqn] = swap_module(mod, static_mappings, {})
-        elif op_int8_dynamically_quantized:
-            if not type(mod) in dynamic_mappings:
-                continue
-            reassign[local_fqn] = swap_module(mod, dynamic_mappings, {})
-        elif activation_int32_quantized:
-            # For now, only apply reference logic to modules quantized to
-            # int32. Do it automatically.
-            # TODO(future PR): extend this logic to more dtypes, and add
-            # the is_reference API flag instead of doing this automatically.
-            # Note: swap modules only does the swap if the mapping for this
-            # module exists.
-            reassign[local_fqn] = swap_module(
-                mod, DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS, {})
-
-        # TODO(future PR): add support for other dtypes
-
-    for key, value in reassign.items():
-        module._modules[key] = value
diff --git a/torch/ao/quantization/_dbr/qconfig_mapping_utils.py b/torch/ao/quantization/_dbr/qconfig_mapping_utils.py
deleted file mode 100644
index 8503a131c676..000000000000
--- a/torch/ao/quantization/_dbr/qconfig_mapping_utils.py
+++ /dev/null
@@ -1,25 +0,0 @@
-import torch
-from typing import Callable, Dict
-from ..qconfig_mapping import QConfigMapping
-
-TYPE_TO_REPLACEMENT_TYPE: Dict[Callable, Callable] = {
-    torch.add: torch.Tensor.add,
-    torch.Tensor.add_: torch.Tensor.add,
-    torch.mul: torch.Tensor.mul,
-    torch.Tensor.mul_: torch.Tensor.mul,
-}
-
-def normalize_object_types(qconfig_mapping: QConfigMapping) -> None:
-    """
-    This function looks for entries in `qconfig_mapping.object_type_qconfigs`
-    corresponding to PyTorch overrides of Python math functions
-    such as `torch.add` and `torch.mul`. If any of these functions are found,
-    it changes the type to the tensor variant of these functions.
-    This is needed because the tensor variant is what is expected
-    within the framework.
-    """
-    for object_type, qconfig in list(qconfig_mapping.object_type_qconfigs.items()):
-        replacement_type = TYPE_TO_REPLACEMENT_TYPE.get(object_type, None)  # type: ignore[arg-type]
-        if replacement_type is not None:
-            del qconfig_mapping.object_type_qconfigs[object_type]
-            qconfig_mapping.object_type_qconfigs[replacement_type] = qconfig
diff --git a/torch/ao/quantization/_dbr/quantization_state.py b/torch/ao/quantization/_dbr/quantization_state.py
deleted file mode 100644
index 371703823a28..000000000000
--- a/torch/ao/quantization/_dbr/quantization_state.py
+++ /dev/null
@@ -1,986 +0,0 @@
-from typing import Callable, List, Tuple, Any, Optional, Dict
-
-import torch
-import torch.nn.functional as F
-
-from .mappings import (
-    conv_ops,
-    ops_are_related,
-)
-
-from .utils import (
-    _raise_obs_not_found_error,
-    _raise_obs_op_mismatch,
-    op_needs_quantization,
-    SeenQOpInfo,
-    SeenNonQOpInfo,
-    QTensorInfo,
-    FuncOutputObsType,
-    get_func_output_obs_type,
-    converted_func_needs_scale_zp,
-    FuncOutputDTypeType,
-    get_func_output_dtype_type,
-    get_quantized_op,
-    get_input_observed_arg_idxs,
-    get_packable_tensor_arg_idxs,
-    get_param_name,
-    get_packable_nontensor_arg_idxs,
-    get_packable_arg_idxs,
-    get_weight_arg_idx,
-    iterate_and_apply,
-    get_op_packing_only_uses_module_attributes,
-    get_packable_tensor_kwarg_names,
-    clone_detach_tensor_without_dispatch,
-    get_input_args_quant_dequant_info,
-    get_cur_qconfig,
-    OpQuantizeabilityType,
-)
-
-from .function_fusion import (
-    match_fusion_patterns,
-    get_seen_q_op_info_of_start_of_fusion,
-    get_seen_q_op_info_of_end_of_fusion,
-)
-
-from ..qconfig_mapping import (
-    QConfigMapping,
-)
-
-from torch.ao.quantization.utils import (
-    activation_is_int32_quantized,
-)
-
-OpConvertInfo = Tuple[
-    # quantized equivalent of original op (None means keep original)
-    Optional[Callable],
-    # arg_quant_infos, each element is (scale, zp, dtype) for quantized and None otherwise
-    List[Optional[Tuple[float, int, torch.dtype]]],
-    # arg_dequant_infos, each element is True if this arg needs a dequant
-    List[bool],
-    # packed param name, if the op has a packed param
-    Optional[str],
-    # additional kwargs, such as output scale and zero_point
-    Dict[str, Any],
-    # any_arg_quant_or_dequant_needed, if False then we can skip looking at
-    # arg_quant_infos and arg_dequant_infos, for performance
-    bool,
-    # any_arg_kwarg_modification_needed, if False then we can return original
-    # args and kwargs, for performance
-    bool,
-]
-
-# TODO(future PR): maybe better name
-# TODO(future PR): add serialization support
-class AutoQuantizationState(torch.nn.Module):
-    """
-    Contains state necessary to perform auto quantization on the parent
-    `nn.Module` instance.
-    """
-
-    idx : int
-
-    def __init__(
-        self,
-        qconfig_mapping: QConfigMapping,
-        fqn: str,
-        input_dtypes: Any = None,
-        output_dtypes: Any = None,
-    ):
-        super().__init__()
-        self.idx = 0
-        self.qconfig_mapping = qconfig_mapping
-        self.fqn = fqn
-        # this is a ModuleDict in order to properly register observers
-        # to be within the module hierarchy.
-        self.tensor_id_to_observer = torch.nn.ModuleDict()
-
-        # TODO(future PR): include kwargs
-        # Note: seen quantizeable ops are recorded with an index,
-        # because we enforce order of execution. However, seen
-        # unquantizeable ops are recorded without an index, because
-        # we do not enforce order of execution.
-        self.idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo] = {}
-        self.seen_nonq_op_infos: List[SeenNonQOpInfo] = []
-
-        # qtensor_info objects of tensor outputs of the module, specified
-        # in order of iteration through the output type. Non-tensor outputs
-        # are represented with `None`.
-        self.output_qtensor_infos: List[Optional[QTensorInfo]] = []
-        self.input_dtypes = input_dtypes
-        self.output_dtypes = output_dtypes
-        # key: idx of seen op
-        # value: name of packed weight
-        # note: this is filled out right before convert
-        self.idx_to_packed_weight_name: Dict[int, str] = {}
-        self.tensor_id_to_scale_zp: Dict[int, Tuple[torch.Tensor, torch.Tensor]] = {}
-
-        # Numeric Suite add_loggers functionality
-        # if this flag is True, op outputs will be saved for debugging
-        self.log_op_outputs = False
-        # data structure to save op outputs for debugging
-        # * outer list represents the different model forward call instances
-        # * inner list represents the different op forward call instances in a
-        #   model forward
-        # TODO(future PR): handle types which are not torch.Tensor
-        # TODO(future PR): use the Logger class and allow user overrides of it
-        self.op_outputs: List[List[Tuple[
-            int,  # global op idx
-            Optional[str],  # fqn
-            Callable,  # fp32 op type (TODO future PR: add quantized op type)
-            torch.Tensor,  # value
-        ]]] = []
-        # model name to use in logging results
-        self.logging_model_name: Optional[str]
-
-        self.idx_to_op_convert_info: Dict[int, OpConvertInfo] = {}
-
-        # If this is True, module outputs will be checked and converted
-        # to the dtype specified by the user. If this is False, module outputs
-        # will be returned as is. This value can be precalculated and it is set
-        # to its final value after tracing.
-        self.needs_dtype_transform_on_outputs = True
-
-    def get_extra_state(self):
-        return {"tensor_id_to_scale_zp": self.tensor_id_to_scale_zp}
-
-    def set_extra_state(self, state):
-        self.tensor_id_to_scale_zp = state["tensor_id_to_scale_zp"]
-        for _, seen_q_op_info in self.idx_to_seen_q_op_infos.items():
-            self.idx_to_op_convert_info[seen_q_op_info.idx] = \
-                self.calculate_op_convert_info(seen_q_op_info)
-
-    def has_at_least_one_seen_q_op_info(self) -> bool:
-        return len(self.idx_to_seen_q_op_infos) > 0
-
-    def validate_is_at_last_seen_idx(self) -> None:
-        is_at_last_seen_idx = (
-            len(self.idx_to_seen_q_op_infos) == 0 or
-            self.idx == len(self.idx_to_seen_q_op_infos)
-        )
-        if not is_at_last_seen_idx:
-            raise AssertionError(
-                f"Cur idx: {self.idx}, expected idx: {len(self.idx_to_seen_q_op_infos)}")
-
-    def extra_repr(self) -> str:
-        s = ""
-        # idx_to_seen_q_op_infos
-        if len(self.idx_to_seen_q_op_infos):
-            s += "(seen_q_op_infos): {\n"
-            for k, v in self.idx_to_seen_q_op_infos.items():
-                s += f"  {k}: {v}\n"
-            s += "}\n"
-        else:
-            s += "(seen_q_op_infos): {}\n"
-        if len(self.seen_nonq_op_infos):
-            s += "(seen_nonq_op_infos): {\n"
-            for n in self.seen_nonq_op_infos:
-                s += f"  {n}\n"
-            s += "}\n"
-        else:
-            s += "(seen_nonq_op_infos): {}\n"
-        # output_qtensor_infos
-        s += "(output_qtensor_infos): ["
-        for i in self.output_qtensor_infos:
-            s += f"{i} "
-        s += "]\n"
-        # idx_to_packed_weight_name
-        if len(self.idx_to_packed_weight_name):
-            s += "(idx_to_packed_weight_name): {\n"
-            for k, v in self.idx_to_packed_weight_name.items():  # type: ignore[assignment]
-                s += f"  {k}: {v}\n"
-            s += "}\n"
-        else:
-            s += "(idx_to_packed_weight_name): {}\n"
-        if len(self.tensor_id_to_scale_zp):
-            s += "(tensor_id_to_scale_zp): {\n"
-            for k, v in self.tensor_id_to_scale_zp.items():  # type: ignore[assignment]
-                s += f"  {k}: {v}\n"
-            s += "}"
-        return s
-
-    def _get_cur_seen_q_op_info(self):
-        return self.idx_to_seen_q_op_infos[self.idx]
-
-    def get_cur_output_inf_dtype(self):
-        return self._get_cur_seen_q_op_info().output_tensor_infos[0].inf_dtype
-
-    def reset_to_new_call(self):
-        """
-        Resets the internal op counter to start a new top level module call
-        """
-        # torch.nn.Module __setattr__ has overhead,
-        # this code is the explicit fast path for `self.idx = 0`
-        object.__setattr__(self, 'idx', 0)
-
-        if self.log_op_outputs:
-            self.op_outputs.append([])
-
-    def cur_op_needs_hooks(self, cur_op: Callable) -> bool:
-        return op_needs_quantization(cur_op)
-
-    def validate_cur_op(self, cur_op: Callable) -> None:
-        """
-        This function is expected to be called before any new function or
-        module call which needs hooks. It validates that the new function or
-        module is of the expected type based on the order of execution.
-        """
-        try:
-            seen_q_op_info = self._get_cur_seen_q_op_info()
-            expected_op = seen_q_op_info.type
-        except IndexError:
-            _raise_obs_not_found_error(cur_op)
-        if not ops_are_related(cur_op, expected_op, seen_q_op_info.type_is_module):
-            _raise_obs_op_mismatch(cur_op, expected_op)
-
-    def mark_cur_op_complete(self, cur_op: Callable) -> None:
-        """
-        This function is expected to be called after a function or module
-        processing is complete.
-        """
-        # torch.nn.Module __setattr__ has overhead,
-        # this code is the explicit fast path for `self.idx += 1`
-        object.__setattr__(self, 'idx', self.idx + 1)
-
-    def first_call_outputs_prepare_hook(
-        self,
-        outputs: Any,
-        qtensor_id: List[int],
-    ) -> Any:
-        """
-        This function is expected to be called on the outputs of a prepared
-        module right before they are returned to the parent, during tracing.
-        """
-        outputs = self._first_call_assign_qtensor_infos_to_mod_outputs(
-            outputs, qtensor_id)
-        return outputs
-
-    def outputs_prepare_hook(
-        self,
-        outputs: Any,
-    ) -> Any:
-        """
-        This function is expected to be called on the outputs of a prepared
-        module right before they are returned to the parent.
-        """
-        return outputs
-
-    def outputs_convert_hook(
-        self,
-        outputs: Any,
-    ) -> Any:
-        """
-        This function is expected to be called on the outputs of a converted
-        module right before they are returned to the parent.
-        """
-        outputs = self._maybe_mod_outputs_dtype_transform(outputs)
-        return outputs
-
-    def get_output_qtensor_infos(self) -> List[Optional[QTensorInfo]]:
-        """
-        Used by the conversion to torch.jit.script.
-        """
-        return self.output_qtensor_infos
-
-    def get_output_dtypes(self) -> Any:
-        """
-        Used by the conversion to torch.jit.script.
-        """
-        return self.output_dtypes
-
-    def first_call_op_prepare_before_hook(
-        self,
-        op: Callable,
-        args: Tuple[Any, ...],
-        kwargs: Dict[str, Any],
-        qtensor_id: List[int],
-        fqn: str,
-        root_module: torch.nn.Module,
-        op_quantizeability_type: OpQuantizeabilityType,
-    ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
-        """
-        This function is expected to be called on args and kwargs of
-        `op` directly before `op` is executed, during tracing.
-
-        We record the type of `op`
-        and the IDs of its tensor inputs. Note: we add a placeholder for IDs
-        of tensor outputs, the placeholder will be filled out during the
-        `op_prepare_after_hook`.
-
-        The function returns modified `args` and `kwargs`.
-        """
-        return self._first_call_op_prepare_before_hook_create_subgraphs(
-            op, args, kwargs, qtensor_id, fqn, root_module,
-            op_quantizeability_type)
-
-    def op_prepare_before_hook(
-        self,
-        op: Callable,
-        args: Tuple[Any, ...],
-        kwargs: Dict[str, Any],
-    ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
-        """
-        This function is expected to be called on args and kwargs of
-        `op` directly before `op` is executed.
-
-        We do the following:
-        * pass the inputs through observers, if needed
-
-        The function returns modified `args` and `kwargs`.
-        """
-        seen_q_op_info = self._get_cur_seen_q_op_info()
-
-        def _maybe_observe(arg, tensor_info):
-            tensor_id = tensor_info.id
-            # TODO: do not run this twice on input and output
-            if str(tensor_id) in self.tensor_id_to_observer:
-                observer = self.tensor_id_to_observer[str(tensor_id)]
-                return observer(arg)
-            else:
-                return arg
-
-        args = iterate_and_apply(
-            args, seen_q_op_info.input_tensor_infos, _maybe_observe)
-
-        return args, kwargs
-
-    def first_call_op_prepare_after_hook(
-        self,
-        op: Callable,
-        output: Any,
-        args: Tuple[Any, ...],
-        qtensor_id: List[int],
-        op_quantizeability_type: OpQuantizeabilityType,
-    ) -> Any:
-        """
-        This function is called after an op call on a prepared model.
-
-        * create an observer for the output, if needed, and record it in
-          `tensor_id_to_observer`
-        * amend the current seen op with the tensor ID of the output
-        """
-        self._first_call_op_prepare_after_hook_adjust_subgraphs(
-            op, output, args, qtensor_id, op_quantizeability_type)
-        return output
-
-    def op_prepare_after_hook(
-        self,
-        op: Callable,
-        output: Any,
-        args: Tuple[Any, ...],
-        global_op_idx: List[int],
-    ) -> Any:
-        """
-        This function is called after an op call on a prepared model.
-
-        * observe the output, if needed
-        """
-        seen_q_op_info = self._get_cur_seen_q_op_info()
-
-        # if we are in a fusion, we only observe at the end of it
-        is_fusion = seen_q_op_info.fusion_info is not None
-        is_end_of_fusion = seen_q_op_info.fusion_info is not None and \
-            seen_q_op_info.fusion_info.is_last_element
-
-        if is_fusion:
-            if is_end_of_fusion:
-                # do observe in the end of fusions, according to info
-                # of the base op
-                seen_q_op_info_start = get_seen_q_op_info_of_start_of_fusion(
-                    seen_q_op_info, self.idx_to_seen_q_op_infos)
-                # use the obs type from beginning of pattern
-                func_output_obs_type = get_func_output_obs_type(seen_q_op_info_start)
-                if func_output_obs_type != FuncOutputObsType.NONE:
-                    # use the output tensor ID from the end of pattern
-                    tensor_id = seen_q_op_info.output_tensor_infos[0].id
-                    obs = self.tensor_id_to_observer[str(tensor_id)]
-                    output = obs(output)
-
-            else:
-                # do not observe in the middle of fusions
-                pass
-        else:
-            # observe without fusions as normal
-            func_output_obs_type = get_func_output_obs_type(seen_q_op_info)
-            # TODO(future PR): other output types
-            if func_output_obs_type != FuncOutputObsType.NONE:
-                tensor_id = seen_q_op_info.output_tensor_infos[0].id
-                obs = self.tensor_id_to_observer[str(tensor_id)]
-                output = obs(output)
-
-        if self.log_op_outputs:
-            output_clone = clone_detach_tensor_without_dispatch(output)
-            self.op_outputs[-1].append(
-                (global_op_idx[0], seen_q_op_info.fqn, seen_q_op_info.type, output_clone))
-            global_op_idx[0] += 1
-
-        return output
-
-    def op_convert_before_hook(
-        self,
-        op: Callable,
-        args: Tuple[Any, ...],
-        kwargs: Dict[str, Any],
-        root_module: torch.nn.Module,
-    ) -> Tuple[Callable, Tuple[Any, ...], Dict[str, Any]]:
-        """
-        This function is called before an op call in a converted model.
-
-        For each arg in `args`, quantizes it if necessary.
-
-        Returns potentially modified `op`, potentially modified `args`,
-        potentially modified `kwargs`.
-        """
-        # TODO generalize this for more things
-        # currently:
-        # * can quantize args (via arg_quant_infos)
-        # * can add scale and zp (via additional kwargs)
-
-        # needed for F.conv2d
-        # F.conv2d(input, weight, bias, stride, padding, dilation, groups)
-        #   to
-        # q.conv2d(input, packed_params, scale, zero_point)
-        orig_op = op
-        maybe_new_op, arg_quant_infos, arg_dequant_infos, packed_param_name, \
-            additional_kwargs, any_arg_quant_or_dequant_needed, \
-            any_arg_kwarg_modification_needed = self.get_op_convert_info(op)
-        if maybe_new_op is not None:
-            op = maybe_new_op
-        if not any_arg_kwarg_modification_needed:
-            return op, args, kwargs
-        # print(op, arg_quant_infos, packed_param_name, additional_kwargs)
-
-        # potentially quantize args, based on arg_quant_infos
-        new_args = []
-        if any_arg_quant_or_dequant_needed:
-            tensor_arg_idx = 0
-            # TODO: refactor this to use iterate_and_apply
-            if orig_op is torch.cat:  # torch.cat variants
-                # input tensors
-                new_first_arg = []
-                for arg in args[0]:
-                    # TODO: handle non-tensor inputs
-                    quant_info = arg_quant_infos[tensor_arg_idx]
-                    dequant_info = arg_dequant_infos[tensor_arg_idx]
-                    if quant_info is not None:
-                        scale, zp, dtype = quant_info
-                        arg = torch.quantize_per_tensor(arg, scale, zp, dtype)
-                    if dequant_info is True:
-                        # Note: both quant and dequant paths are taken for
-                        # reference ops.
-                        arg = arg.dequantize()
-                    new_first_arg.append(arg)
-                    tensor_arg_idx += 1
-                new_args = [new_first_arg, *args[1:]]
-            else:
-                for arg in args:
-                    # TODO: handle non-tensor inputs
-                    # TODO: this is not handling non-tensor tuple args (for example,
-                    # dilation in conv2d) correctly, it just happens to work but
-                    # needs a fix.
-                    quant_info = arg_quant_infos[tensor_arg_idx]
-                    dequant_info = arg_dequant_infos[tensor_arg_idx]
-                    if quant_info is not None:
-                        scale, zp, dtype = quant_info
-                        arg = torch.quantize_per_tensor(arg, scale, zp, dtype)
-                    if dequant_info is True:
-                        # Note: both quant and dequant paths are taken for
-                        # reference ops.
-                        arg = arg.dequantize()
-                    new_args.append(arg)
-                    tensor_arg_idx += 1
-        else:
-            new_args = [*args]
-
-        # if there is a packed param, replace the relevant args
-        if packed_param_name is not None:
-            new_args_with_packed = []
-            packable_arg_idxs = get_packable_arg_idxs(orig_op)
-            added_packed = False
-            for idx, arg in enumerate(new_args):
-                if packable_arg_idxs is not None and idx in packable_arg_idxs:
-                    if not added_packed:
-                        packed_param = getattr(root_module, packed_param_name)
-                        new_args_with_packed.append(packed_param)
-                        added_packed = True
-                else:
-                    new_args_with_packed.append(arg)
-            new_args = new_args_with_packed
-
-        # potentially extend kwargs with scale and zero_point
-        # TODO move op-specific logic out of here
-        if len(additional_kwargs):
-            if orig_op not in conv_ops and orig_op != F.linear:
-                kwargs.update(**additional_kwargs)
-            else:
-                seen_q_op_info = self._get_cur_seen_q_op_info()
-                if seen_q_op_info.output_tensor_infos[0].inf_dtype == torch.quint8:
-                    new_args.append(additional_kwargs['scale'])
-                    new_args.append(additional_kwargs['zero_point'])
-
-        # TODO move op-specific logic out of here
-        if op is torch.ops.quantized.linear:
-            kwargs.pop('bias', None)
-
-        return op, tuple(new_args), kwargs
-
-    def op_convert_after_hook(
-        self,
-        op: Callable,
-        output,
-        global_op_idx: List[int],
-    ) -> Any:
-        """
-        This function is called after an op call in a converted model.
-        """
-        # TODO(future PR): improve performance by moving this out of the
-        # path of non-reference ops
-        seen_q_op_info = self._get_cur_seen_q_op_info()
-
-        if seen_q_op_info.is_reference_op_at_inference:
-            # given the current reference module design,
-            # we need to quantize to the target dtype
-            output_tensor_info = seen_q_op_info.output_tensor_infos[0]
-            tensor_id, inf_dtype = \
-                output_tensor_info.id, output_tensor_info.inf_dtype
-            scale, zp = self.tensor_id_to_scale_zp[tensor_id]
-            output = torch.quantize_per_tensor(
-                output, scale, zp, inf_dtype)
-
-        if self.log_op_outputs:
-            output_clone = clone_detach_tensor_without_dispatch(output)
-            seen_q_op_info = self._get_cur_seen_q_op_info()
-            self.op_outputs[-1].append(
-                (global_op_idx[0], seen_q_op_info.fqn, seen_q_op_info.type, output_clone))
-            global_op_idx[0] += 1
-
-        return output
-
-    def get_op_convert_info(
-        self,
-        op: Callable,
-    ) -> OpConvertInfo:
-        """
-        Returns the information needed for convert time modifications to `op`.
-        """
-        return self.idx_to_op_convert_info[self.idx]
-
-    def calculate_op_convert_info(
-        self,
-        seen_q_op_info: SeenQOpInfo,
-    ) -> OpConvertInfo:
-        """
-        This precalculates the information which will be returned by
-        `get_op_convert_info`.
-        """
-        # calculate new op
-        maybe_new_op = get_quantized_op(
-            seen_q_op_info, self.idx_to_seen_q_op_infos)
-
-        # calculate quant infos
-        arg_quant_infos, arg_dequant_infos, any_arg_quant_or_dequant_needed = \
-            get_input_args_quant_dequant_info(
-                seen_q_op_info, self.tensor_id_to_scale_zp)
-
-        # get packed param name, if applicable
-        packed_param_name = self._get_packed_param_name(seen_q_op_info)
-
-        # calculate scale and zp for output
-        # TODO: instead of always doing this if there is an observer,
-        # calculate whether this is needed based on the op and dtypes
-        additional_kwargs = {}
-        needs_scale_zp = converted_func_needs_scale_zp(seen_q_op_info)
-        if needs_scale_zp:
-            cur_seen_q_op_info = seen_q_op_info
-
-            # if this is a start of a fusion pattern, get the observer
-            # from the end of the fusion
-            is_start_of_fusion = seen_q_op_info.fusion_info and \
-                seen_q_op_info.fusion_info.is_first_element
-            if is_start_of_fusion:
-                cur_seen_q_op_info = get_seen_q_op_info_of_end_of_fusion(
-                    seen_q_op_info, self.idx_to_seen_q_op_infos)
-
-            output_tensor_infos = cur_seen_q_op_info.output_tensor_infos
-            tensor_id = output_tensor_infos[0].id
-            scale, zp = self.tensor_id_to_scale_zp[tensor_id]
-            additional_kwargs.update({'scale': scale, 'zero_point': zp})
-
-        any_arg_kwarg_modification_needed = bool(
-            any_arg_quant_or_dequant_needed or
-            packed_param_name is not None or
-            len(additional_kwargs)
-        )  # the cast to bool is to make mypy recognize this as a bool
-
-        return maybe_new_op, arg_quant_infos, arg_dequant_infos, \
-            packed_param_name, additional_kwargs, any_arg_quant_or_dequant_needed, \
-            any_arg_kwarg_modification_needed
-
-    def _get_packed_param_name(self, seen_q_op_info: SeenQOpInfo) -> Optional[str]:
-        """
-        If the op in seen_q_op_info has a quantized packed param, returns it.
-        Otherwise, returns None.
-        """
-        return self.idx_to_packed_weight_name.get(seen_q_op_info.idx, None)
-
-    def _first_call_assign_qtensor_infos_to_mod_outputs_tensor(
-        self,
-        output: torch.Tensor,
-        qtensor_id: List[int],
-    ) -> torch.Tensor:
-        """
-        This is a helper function for _first_call_assign_qtensor_infos_to_mod_outputs
-        to handle iterables of tensors without code duplication.
-        """
-        if not hasattr(output, '_qtensor_info'):
-            # TODO: use actual dtype instead of defaulting to float
-            output._qtensor_info = QTensorInfo(  # type: ignore[attr-defined]
-                qtensor_id[0], output.dtype, torch.float)
-            qtensor_id[0] += 1
-        self.output_qtensor_infos.append(output._qtensor_info)  # type: ignore[attr-defined]
-        # TODO(future PR): add an observer if needed
-        return output
-
-    def _first_call_assign_qtensor_infos_to_mod_outputs(
-        self,
-        outputs: Any,
-        qtensor_id: List[int],
-    ) -> Any:
-        """
-        Takes `outputs`, which are a set of values about to be returned from
-        the current module. If `_qtensor_info` attributes do not already exist
-        on any tensors in `outputs`, this function adds them, initializing the
-        dtype to `torch.float`. This allows us to reason about module output
-        dtypes even if the last op in the module is not quantizeable.
-        """
-        # TODO: handle objects with deeper nested tensors
-        if isinstance(outputs, torch.Tensor):
-            self._first_call_assign_qtensor_infos_to_mod_outputs_tensor(outputs, qtensor_id)
-        elif isinstance(outputs, tuple):
-            # TODO: handle other tuple subclasses more generically
-            new_outputs = []
-            for output in outputs:
-                if isinstance(output, torch.Tensor):
-                    new_outputs.append(self._first_call_assign_qtensor_infos_to_mod_outputs_tensor(
-                        output, qtensor_id))
-                else:
-                    new_outputs.append(output)
-            # hacky check for collections.namedtuple, TODO improve this
-            # https://stackoverflow.com/questions/2166818/how-to-check-if-an-object-is-an-instance-of-a-namedtuple
-            if hasattr(outputs, '_fields'):
-                outputs = outputs.__class__(*new_outputs)
-            else:
-                outputs = tuple(new_outputs)
-        else:
-            pass
-        return outputs
-
-    def set_needs_dtype_transform_on_outputs(self):
-        """
-        Calculates whether a dtype transform on module outputs is needed
-        and stores it. This is used to skip the outputs hook if it is not
-        needed.
-        """
-        self.needs_dtype_transform_on_outputs = False
-
-        if not len(self.output_qtensor_infos):
-            # if there are no tensor outputs, there is nothing to transform
-            return
-
-        qtensor_info = self.output_qtensor_infos[0]
-        if self.output_dtypes is not None:
-            assert qtensor_info is not None
-            # check the output dtype, and do the conversion if needed
-            output_dtype = self.output_dtypes[0]
-            if qtensor_info.inf_dtype != output_dtype:
-                assert output_dtype is torch.float, \
-                    'non-float output dtypes not handled yet'
-                self.needs_dtype_transform_on_outputs = True
-
-    def _maybe_mod_outputs_dtype_transform(
-        self,
-        outputs: Any,
-    ) -> Any:
-        """
-        Takes `outputs` which are about to be returned from this module
-        to the caller. If this module has restrictions on the dtypes of
-        tensors it has to return, does the dtype conversion. Otherwise,
-        does nothing.
-        """
-        if not self.needs_dtype_transform_on_outputs:
-            return outputs
-
-        if isinstance(outputs, torch.Tensor):
-            qtensor_info = self.output_qtensor_infos[0]
-            if self.output_dtypes is not None:
-                assert qtensor_info is not None
-                # check the output dtype, and do the conversion if needed
-                output_dtype = self.output_dtypes[0]
-                if qtensor_info.inf_dtype != output_dtype:
-                    assert output_dtype is torch.float, \
-                        'non-float output dtypes not handled yet'
-                    outputs = outputs.dequantize()
-            else:
-                # if no output dtype was specified, do nothing
-                pass
-
-        return outputs
-
-    def _first_call_op_prepare_before_hook_create_subgraphs_tensor(
-        self,
-        op: Callable,
-        arg: Any,
-        arg_tensor_infos: List[Optional[QTensorInfo]],
-        qtensor_id: List[int],
-    ) -> None:
-        """
-        Runs the prepare hook during first_call for individual
-        tensors. If the input argument is a tensor, this function is
-        called directly. If the input argument is an iterable such
-        as a list or a tuple, this function is called on each element of
-        the iteratble.
-        """
-        # TODO(next): fix this for torch.cat
-        if not isinstance(arg, torch.Tensor):
-            arg_tensor_infos.append(None)
-            return
-
-        # If a tensor does not have an ID, add it. This allows
-        # us to track inputs shared by multiple quantizeable modules.
-        if not hasattr(arg, '_qtensor_info'):
-            arg._qtensor_info = QTensorInfo(  # type: ignore[attr-defined]
-                qtensor_id[0], arg.dtype, arg.dtype)
-            qtensor_id[0] += 1
-        arg_tensor_infos.append(arg._qtensor_info)  # type: ignore[attr-defined]
-
-    def _first_call_op_prepare_before_hook_create_subgraphs(
-        self,
-        op: Callable,
-        args: Tuple[Any, ...],
-        kwargs: Dict[str, Any],
-        qtensor_id: List[int],
-        fqn: str,
-        root_module: torch.nn.Module,
-        op_quantizeability_type: OpQuantizeabilityType,
-    ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
-        """
-        Given an op, args, kwargs about to be executed, records the subgraph
-        of this op in `self`.
-        """
-        arg_tensor_infos: List[Optional[QTensorInfo]] = []
-        for arg in args:
-            if isinstance(arg, (list, tuple)):
-                for inner_arg in arg:
-                    self._first_call_op_prepare_before_hook_create_subgraphs_tensor(
-                        op, inner_arg, arg_tensor_infos, qtensor_id)
-            else:
-                self._first_call_op_prepare_before_hook_create_subgraphs_tensor(
-                    op, arg, arg_tensor_infos, qtensor_id)
-
-        if op_quantizeability_type is OpQuantizeabilityType.NOT_QUANTIZEABLE:
-            op_type_is_module = isinstance(op, torch.nn.Module)
-            op_type : Callable = type(op) if op_type_is_module else op  # type: ignore[assignment]
-            self.seen_nonq_op_infos.append(SeenNonQOpInfo(
-                op_type, arg_tensor_infos, []))
-            return args, kwargs
-
-        op_packing_only_uses_module_attributes = \
-            get_op_packing_only_uses_module_attributes(op, args, kwargs, root_module)
-
-        packable_tensor_idx_to_name = {}
-        packable_nontensor_idx_to_arg = {}
-        packable_tensor_kwarg_name_to_name = {}
-        if op_packing_only_uses_module_attributes:
-            packable_tensor_arg_idxs = get_packable_tensor_arg_idxs(op)
-            if packable_tensor_arg_idxs is not None:
-                for arg_idx in packable_tensor_arg_idxs:
-                    if arg_idx >= len(args):
-                        continue
-                    arg = args[arg_idx]
-                    param_name = get_param_name(root_module, arg)
-                    packable_tensor_idx_to_name[arg_idx] = param_name
-
-            packable_nontensor_arg_idxs = get_packable_nontensor_arg_idxs(op)
-            if packable_nontensor_arg_idxs is not None:
-                for arg_idx in packable_nontensor_arg_idxs:
-                    packable_nontensor_idx_to_arg[arg_idx] = args[arg_idx]
-
-            packable_tensor_kwarg_names = \
-                get_packable_tensor_kwarg_names(op)
-            if packable_tensor_kwarg_names is not None:
-                for kwarg_name in packable_tensor_kwarg_names:
-                    if kwarg_name not in kwargs:
-                        continue
-                    kwarg = kwargs[kwarg_name]
-                    kwarg_name_on_module = get_param_name(root_module, kwarg)
-                    packable_tensor_kwarg_name_to_name[kwarg_name] = \
-                        kwarg_name_on_module
-
-        if self.idx not in self.idx_to_seen_q_op_infos:
-            op_type_is_module = isinstance(op, torch.nn.Module)
-            op_type = type(op) if op_type_is_module else op  # type: ignore[assignment]
-            qconfig = get_cur_qconfig(self.qconfig_mapping, fqn, op_type)
-            # TODO(future PR): use API flag instead of qconfig for is_reference
-            is_reference_op_at_inference = \
-                qconfig is not None and activation_is_int32_quantized(qconfig)
-            self.idx_to_seen_q_op_infos[self.idx] = SeenQOpInfo(
-                self.idx, op_type, op_type_is_module, fqn, arg_tensor_infos, [],
-                packable_tensor_idx_to_name, packable_nontensor_idx_to_arg,
-                packable_tensor_kwarg_name_to_name,
-                op_packing_only_uses_module_attributes, qconfig, None,
-                is_reference_op_at_inference)
-
-        return args, kwargs
-
-    def _first_call_op_prepare_after_hook_adjust_subgraphs(
-        self,
-        op: Callable,
-        output: Any,
-        args: Tuple[Any, ...],
-        qtensor_id: List[int],
-        op_quantizeability_type: OpQuantizeabilityType,
-    ) -> None:
-        """
-        After `op` was just executed, modifies the subgraph recorded
-        for this op with the information about the output. Note, this
-        has to be done in the "after" hook because the output of the op
-        does not exist in the "before" hook.
-        """
-        # TODO(future PR): check if _qtensor_id needs to become an actual
-        # attribute of Tensor
-        # TODO(future PR): handle non-tensor outputs
-        if op_quantizeability_type is OpQuantizeabilityType.QUANTIZEABLE:
-
-            seen_q_op_info = self._get_cur_seen_q_op_info()
-            func_output_dtype_type = get_func_output_dtype_type(seen_q_op_info)
-            if func_output_dtype_type == FuncOutputDTypeType.DTYPE_DEPENDS_ON_QCONFIG:
-                qconfig = get_cur_qconfig(
-                    self.qconfig_mapping, seen_q_op_info.fqn,
-                    seen_q_op_info.type)
-                if qconfig is None:
-                    dtype_to_use = torch.float
-                else:
-                    dtype_to_use = qconfig.activation().dtype
-
-            elif func_output_dtype_type == FuncOutputDTypeType.DTYPE_DEFAULT_BC_UNSUPPORTED_SYNTAX:
-                dtype_to_use = torch.float
-            else:
-                # TODO(future PR): respect qconfig for torch.cat
-                if isinstance(args[0], (tuple, list)):  # for torch.cat
-                    unique_arg_dtypes = [
-                        arg._qtensor_info.inf_dtype for arg in args[0]]
-                    assert len(set(unique_arg_dtypes)) == 1, \
-                        'an iterable with arguments with different inference ' + \
-                        'dtypes is not supported yet'
-                    dtype_to_use = args[0][0]._qtensor_info.inf_dtype
-                else:
-                    dtype_to_use = args[0]._qtensor_info.inf_dtype
-
-        else:
-            dtype_to_use = None  # type: ignore[assignment]
-
-        def _add_output_qtensor_info(output, dtype_to_use):
-            if dtype_to_use is None:
-                dtype_to_use = output.dtype
-            output._qtensor_info = QTensorInfo(
-                qtensor_id[0], output.dtype, dtype_to_use)  # type: ignore[arg-type]
-            if op_quantizeability_type is OpQuantizeabilityType.QUANTIZEABLE:
-                target = self.idx_to_seen_q_op_infos[self.idx].output_tensor_infos
-            else:
-                target = self.seen_nonq_op_infos[-1].output_tensor_infos
-            target.append(output._qtensor_info)
-            qtensor_id[0] += 1
-
-        if isinstance(output, torch.Tensor):
-            _add_output_qtensor_info(output, dtype_to_use)
-        elif isinstance(output, tuple):
-            for element in output:
-                if isinstance(element, torch.Tensor):
-                    _add_output_qtensor_info(element, dtype_to_use)
-
-    def match_fusion_patterns(self):
-        match_fusion_patterns(self.idx_to_seen_q_op_infos)
-
-    def _maybe_insert_input_observers(self, seen_q_op_info: SeenQOpInfo):
-        func_output_dtype_type = get_func_output_dtype_type(seen_q_op_info)
-        input_observed_arg_idxs = get_input_observed_arg_idxs(
-            seen_q_op_info.type, seen_q_op_info.type_is_module)
-
-        if func_output_dtype_type == FuncOutputDTypeType.DTYPE_DEPENDS_ON_QCONFIG:
-            for idx, tensor_info in enumerate(seen_q_op_info.input_tensor_infos):
-                if tensor_info is None:
-                    continue
-                if input_observed_arg_idxs is not None and \
-                        idx not in input_observed_arg_idxs:
-                    continue
-
-                qconfig = get_cur_qconfig(
-                    self.qconfig_mapping, seen_q_op_info.fqn, seen_q_op_info.type)
-                if qconfig is None:
-                    # If qconfig is None, we do not need any input observers
-                    continue
-
-                elif tensor_info.inf_dtype != torch.quint8:
-                    # TODO(future PR): this assumes current dtype is quint8,
-                    # this is not always true
-                    # TODO(future PR): currently this only handles float32 and
-                    # quint8, we need to extend it to other dtypes
-                    tensor_id = tensor_info.id  # type: ignore[attr-defined]
-                    weight_arg_idx = get_weight_arg_idx(seen_q_op_info.type)
-                    obs = qconfig.weight() if idx == weight_arg_idx else \
-                        qconfig.activation()
-                    self.tensor_id_to_observer[str(tensor_id)] = obs
-
-    def _maybe_insert_output_observers(
-        self,
-        seen_q_op_info: SeenQOpInfo,
-        root_module: torch.nn.Module,
-    ):
-        if seen_q_op_info.fusion_info is not None:
-            if not seen_q_op_info.fusion_info.is_first_element:
-                # if we are in a fusion but not at the start, do not insert observer
-                return
-            else:
-                # if we are in a fusion and at the start, insert observer for its end
-                # get the output of the end of the fusion
-                cur_seen_q_op_info = get_seen_q_op_info_of_end_of_fusion(
-                    seen_q_op_info, self.idx_to_seen_q_op_infos)
-                output_tensor_id = cur_seen_q_op_info.output_tensor_infos[0].id
-        else:
-            output_tensor_id = seen_q_op_info.output_tensor_infos[0].id
-
-        func_output_obs_type = get_func_output_obs_type(seen_q_op_info)
-        if func_output_obs_type == FuncOutputObsType.NEW_OBS:
-            # TODO(future PR): check qconfig is None
-            qconfig = get_cur_qconfig(
-                self.qconfig_mapping, seen_q_op_info.fqn, seen_q_op_info.type)
-            assert qconfig is not None
-            self.tensor_id_to_observer[str(output_tensor_id)] = \
-                qconfig.activation()
-        elif func_output_obs_type == FuncOutputObsType.REUSES_FIRST_INPUT_OBS:
-            assert seen_q_op_info.input_tensor_infos[0] is not None
-            first_input_tensor_id = seen_q_op_info.input_tensor_infos[0].id
-
-            first_input_obs = \
-                self.tensor_id_to_observer[str(first_input_tensor_id)]
-            self.tensor_id_to_observer[str(output_tensor_id)] = first_input_obs
-
-    def insert_observers(self, root_module: torch.nn.Module):
-        for idx, seen_q_op_info in self.idx_to_seen_q_op_infos.items():
-            self._maybe_insert_input_observers(seen_q_op_info)
-            self._maybe_insert_output_observers(seen_q_op_info, root_module)
-
-    def get_output_observer_from_fqn(self, fqn: str) -> Optional[torch.nn.Module]:
-        for idx, seen_q_op_info in self.idx_to_seen_q_op_infos.items():
-            if seen_q_op_info.fqn != fqn:
-                continue
-            output_tensor_id = seen_q_op_info.output_tensor_infos[0].id
-            if str(output_tensor_id) in self.tensor_id_to_observer:
-                return self.tensor_id_to_observer[str(output_tensor_id)]
-        return None
-
-    # This is a hack to enable nn.Sequential to properly work with
-    # this class.
-    # TODO(future): remove the hack
-    def forward(self, x):
-        raise NotImplementedError('Calling AutoQuantizationState.forward is not supported')
-        # return x
diff --git a/torch/ao/quantization/_dbr/torchscript_utils.py b/torch/ao/quantization/_dbr/torchscript_utils.py
deleted file mode 100644
index 2efbbe5fd938..000000000000
--- a/torch/ao/quantization/_dbr/torchscript_utils.py
+++ /dev/null
@@ -1,15 +0,0 @@
-import torch
-from torch.jit._recursive import wrap_cpp_module
-
-def remove_redundant_aliases(scripted_module: torch.nn.Module):
-    """
-    Running torch.jit.trace on a model with DBR quantization introduces
-    extra alias ops, because we use `torch.Tensor.as_subclass` and tracing
-    through this results in an `aten::alias` function call in TorchScript.
-    This pass removes these alias calls when it is safe to do so.
-    """
-    module_c = scripted_module._c
-    module_c = \
-        torch._C._jit_pass_dbr_quant_remove_redundant_aliases(module_c)  # type: ignore[attr-defined]
-    scripted_module = wrap_cpp_module(module_c)
-    return scripted_module
diff --git a/torch/ao/quantization/_dbr/utils.py b/torch/ao/quantization/_dbr/utils.py
deleted file mode 100644
index 181d136beab2..000000000000
--- a/torch/ao/quantization/_dbr/utils.py
+++ /dev/null
@@ -1,751 +0,0 @@
-import dataclasses
-import enum
-from typing import Callable, Tuple, Any, List, Optional, Dict
-
-import torch
-import torch.nn.functional as F
-toq = torch.ops.quantized
-
-from .mappings import (
-    functions_supported_by_quantization,
-    module_types_supported_by_quantization,
-    module_types_supported_by_quantization_preserves_dtype,
-    functions_supported_by_quantization_preserves_dtype,
-    fp32_to_int8_fun_mapping,
-    add_and_mul_ops,
-    conv_ops,
-)
-
-from ..qconfig import QConfigAny
-from ..qconfig_mapping import QConfigMapping
-
-from torch.quantization import (
-    ObserverBase,
-    FakeQuantizeBase,
-    is_activation_post_process,
-)
-
-from ..qconfig_mapping_utils import (
-    maybe_adjust_qconfig_for_module_type_or_name,
-)
-
-def _raise_obs_not_found_error(func):
-    raise RuntimeError(
-        f'Encountered arithmetic operation {torch.typename(func)} but we have '
-        f'encountered fewer arithmetic operations in previous calibration runs. '
-        f'This likely indicates that the program contains dynamic control flow. '
-        f' Quantization is not defined over dynamic control flow!')
-
-def _raise_obs_op_mismatch(func, prev_op):
-    raise RuntimeError(
-        f'Encountered arithmetic operation {torch.typename(func)} but previously '
-        f'recorded operation was {torch.typename(prev_op)}!. This likely indicates '
-        f'that the program contains dynamic control flow. Quantization is not '
-        f'defined over dynamic control flow!')
-
-
-@dataclasses.dataclass
-class QTensorInfo:
-    id: int  # tensor ID
-    orig_dtype: torch.dtype  # dtype seen while tracing with example input
-    inf_dtype: torch.dtype  # dtype at inference
-
-
-@dataclasses.dataclass
-class FusionInfo:
-    # linear matched pattern, example: [torch.add, torch.relu]
-    pattern: Tuple[Callable, ...]
-    # what the current element should be replaced with during execution
-    # example: toq.add_relu (for torch.add -> torch.relu)
-    replacement_type_this_element: Callable
-    # true if the current element is the first element of the pattern,
-    # for example true for torch.add in (torch.add -> torch.relu)
-    is_first_element: bool
-    # true if the current element is the last element of the pattern,
-    # for example true for torch.relu in (torch.add -> torch.relu)
-    is_last_element: bool
-
-
-@dataclasses.dataclass
-class SeenQOpInfo:
-    idx: int
-    # Python type of the seen op. For modules, this is type(mod). For
-    # functions, this is the target function.
-    type: Callable
-    # True if the type is a module, False otherwise (for functions/methods).
-    type_is_module: bool
-    # Note: FQN refers to the current module for modules and to the parent
-    # module for functions
-    fqn: str
-    # Information about the input tensors
-    # Non-tensor inputs are represented with None.
-    input_tensor_infos: List[Optional[QTensorInfo]]
-    # Information about the output tensors
-    # Non-tensor outputs are represented with None.
-    output_tensor_infos: List[QTensorInfo]
-    # Information about tensors which will need to be packed,
-    # idx is the argument index in args
-    # name is the name of this parameter in the parent module
-    packable_tensor_idx_to_name: Dict[int, Optional[str]]
-    # Information about non-tensors which will need to be packed,
-    # idx is the argument index in args
-    # arg is the argument value
-    packable_nontensor_idx_to_arg: Dict[int, Any]
-    # Information about tensors which will need to be packed from kwargs.
-    # kwarg_name is the kwarg name
-    # name is the name of this parameter in the parent module
-    packable_tensor_kwarg_name_to_name: Dict[str, Optional[str]]
-    # This is True if all packable args are simple attributes, or there
-    # are no packable args.
-    # This is False if some packable args are results of other functions.
-    op_packing_only_uses_module_attributes: bool
-    # QConfig for the op, can be None
-    qconfig: QConfigAny
-    # fusion_info for the op, is None if no fusion is found
-    fusion_info: Optional[FusionInfo]
-    # True if this op is a reference op during inference
-    is_reference_op_at_inference: bool
-
-    def __repr__(self) -> str:
-        s = f"(type): {self.type}\n"
-        s += f"     (fqn): {self.fqn}\n"
-        s += f"     (input_tensor_infos): {self.input_tensor_infos}\n"
-        s += f"     (output_tensor_infos): {self.output_tensor_infos}"
-        if len(self.packable_tensor_idx_to_name):
-            s += f"\n     (packable_tensor_idx_to_name): {self.packable_tensor_idx_to_name}"
-        if len(self.packable_nontensor_idx_to_arg):
-            s += f"\n     (packable_nontensor_idx_to_arg): {self.packable_nontensor_idx_to_arg}"
-        if len(self.packable_tensor_kwarg_name_to_name):
-            s += f"\n     (packable_tensor_kwarg_name_to_name): {self.packable_tensor_kwarg_name_to_name}"
-        if self.fusion_info:
-            s += f"\n     (fusion_info): {self.fusion_info}"
-        return s
-
-
-@dataclasses.dataclass
-class SeenNonQOpInfo:
-    # Python type of the seen op. For modules, this is type(mod). For
-    # functions, this is the target function.
-    type: Callable
-    # Information about the input tensors
-    # Non-tensor inputs are represented with None.
-    input_tensor_infos: List[Optional[QTensorInfo]]
-    # Information about the output tensors
-    # Non-tensor outputs are represented with None.
-    output_tensor_infos: List[QTensorInfo]
-
-
-class OpQuantizeabilityType(enum.Enum):
-    QUANTIZEABLE = 0
-    NOT_QUANTIZEABLE = 1
-
-def op_needs_quantization(op: Callable) -> bool:
-    if op in functions_supported_by_quantization:
-        return True
-    elif type(op) in module_types_supported_by_quantization:
-        return True
-    else:
-        return False
-
-# TODO: fix lint
-class ObserverWrapper(torch.nn.Identity):
-    def __init__(self, child):
-        super().__init__()
-        self.child = child
-        self.dtype = child.dtype
-
-def wrap_observers_in_placeholders(module: torch.nn.Module) -> None:
-    """
-    Wraps each child observer of `module` in a placeholder which prevents
-    the execution of the observer during the forward. This is useful to prevent
-    tracing the model with example inputs from contributing to calibration
-    statistics.
-    """
-    for name, child in module.named_children():
-        if isinstance(child, (ObserverBase, FakeQuantizeBase)):
-            wrapper = ObserverWrapper(child)
-            setattr(module, name, wrapper)
-        else:
-            wrap_observers_in_placeholders(child)
-
-def unwrap_observers_from_placeholders(module: torch.nn.Module) -> None:
-    """
-    Restores observers back to their original state.
-    """
-    # Note: we cannot use module.named_children() because we can
-    # have two different names refer to the same module, for example
-    # when we are reusing observers for torch.add scalar version.
-    for name, child in module._modules.items():
-        if child is None:
-            continue
-        if isinstance(child, ObserverWrapper):
-            unwrapped = child.child
-            setattr(module, name, unwrapped)
-        else:
-            unwrap_observers_from_placeholders(child)
-
-def trace_with_inputs(
-    model: torch.nn.Module,
-    example_args: Tuple[Any],
-) -> None:
-    with torch.no_grad():
-        old_training = model.training
-        model.eval()
-        wrap_observers_in_placeholders(model)
-        model(*example_args)
-        unwrap_observers_from_placeholders(model)
-        if old_training:
-            model.train()
-
-# TODO(future PR): verify correctness of this for all
-# quantizeable modules
-def is_leaf(
-    m: torch.nn.Module,
-    prepare_custom_config_dict: Optional[Dict[str, Any]],
-) -> bool:
-    if prepare_custom_config_dict is None:
-        prepare_custom_config_dict = {}
-
-    if 'non_traceable_module_class' in prepare_custom_config_dict:
-        for target_cls in prepare_custom_config_dict['non_traceable_module_class']:
-            if isinstance(m, target_cls):
-                return True
-
-    # TODO(future PR): extend to the rest of the container classes
-    container_classes = (
-        torch.nn.Sequential,
-        torch.nn.ModuleList,
-    )
-    return (
-        # allowlist everything in torch.nn except containers
-        (m.__module__.startswith('torch.nn') and (
-            not isinstance(m, container_classes)
-        )) or
-        # allowlist nni modules, as they inherit from nn.Sequential
-        m.__module__.startswith('torch.nn.intrinsic') or
-        # observers and fake quants are leaves
-        is_activation_post_process(m)
-    )
-
-class FuncOutputObsType(enum.Enum):
-    NONE = 0
-    NEW_OBS = 1
-    REUSES_FIRST_INPUT_OBS = 2
-
-def get_func_output_obs_type(
-    seen_q_op_info: SeenQOpInfo,
-) -> FuncOutputObsType:
-    op_type = seen_q_op_info.type
-
-    if seen_q_op_info.qconfig is None:
-        return FuncOutputObsType.NONE
-
-    # check for ops which need packed weights but the weights are
-    # coming from another function
-    if not seen_q_op_info.op_packing_only_uses_module_attributes:
-        return FuncOutputObsType.NONE
-
-    if op_type in add_and_mul_ops:
-        if (
-            len(seen_q_op_info.input_tensor_infos) > 0 and
-            seen_q_op_info.input_tensor_infos[0] is not None and
-            seen_q_op_info.input_tensor_infos[0].inf_dtype in (torch.int32, torch.int64)
-        ):
-            # this is handling ops on dtypes such as torch.int
-            return FuncOutputObsType.NONE
-        elif (
-            len(seen_q_op_info.input_tensor_infos) > 1 and
-            seen_q_op_info.input_tensor_infos[1] is None
-        ):
-            return FuncOutputObsType.REUSES_FIRST_INPUT_OBS
-    elif op_type in (torch.relu, F.relu):
-        return FuncOutputObsType.NONE
-    elif op_type == torch.cat:
-        if (
-            len(seen_q_op_info.input_tensor_infos) > 0 and
-            seen_q_op_info.input_tensor_infos[0] is not None and
-            seen_q_op_info.input_tensor_infos[0].inf_dtype in (torch.int32, torch.int64)
-        ):
-            return FuncOutputObsType.NONE
-    elif op_type in (torch.nn.LSTM,):
-        return FuncOutputObsType.NONE
-    return FuncOutputObsType.NEW_OBS
-
-def converted_func_needs_scale_zp(seen_q_op_info: SeenQOpInfo) -> bool:
-    op_type = seen_q_op_info.type
-    is_module = isinstance(op_type, type(torch.nn.Module))
-    if is_module:
-        return False
-    if seen_q_op_info.qconfig is None:
-        return False
-    if op_type in add_and_mul_ops:
-        # check if both arguments are tensors
-        inputs = seen_q_op_info.input_tensor_infos
-        both_args_tensors = len(inputs) == 2 and inputs[0] is not None and \
-            inputs[1] is not None
-        # disable quantization for torch.mul with int tensor arguments
-        first_dtype_is_not_int = len(inputs) > 0 and \
-            inputs[0] is not None and \
-            inputs[0].inf_dtype not in (torch.int32, torch.int64)
-        return both_args_tensors and first_dtype_is_not_int
-    elif op_type == torch.cat:
-        inputs = seen_q_op_info.input_tensor_infos
-        first_dtype_is_not_int = len(inputs) > 0 and \
-            inputs[0] is not None and \
-            inputs[0].inf_dtype not in (torch.int32, torch.int64)
-        return first_dtype_is_not_int
-    elif op_type in conv_ops or op_type == F.linear:
-        outputs = seen_q_op_info.output_tensor_infos
-        is_int8 = outputs[0].inf_dtype == torch.quint8
-        return is_int8
-    return False
-
-class FuncOutputDTypeType(enum.Enum):
-    # for ops which are quantizeable and are configured by the qconfig,
-    # for example F.conv2d
-    DTYPE_DEPENDS_ON_QCONFIG = 0
-    # for ops which are quantizeable and take the dtype of the previous
-    # op, for example nn.Dropout
-    DTYPE_EQUALS_INPUT_DTYPE = 1
-    # for ops which may be quantizeable in some cases but are not
-    # quantizeable due to observed syntax (for example, F.conv2d with
-    # weights coming from another function).
-    DTYPE_DEFAULT_BC_UNSUPPORTED_SYNTAX = 2
-
-def get_func_output_dtype_type(
-    seen_q_op_info: SeenQOpInfo,
-) -> FuncOutputDTypeType:
-    if seen_q_op_info.type_is_module:
-        if seen_q_op_info.type in module_types_supported_by_quantization_preserves_dtype:
-            return FuncOutputDTypeType.DTYPE_EQUALS_INPUT_DTYPE
-
-    # check for ops which need packed weights but the weights are
-    # coming from another function
-    if not seen_q_op_info.op_packing_only_uses_module_attributes:
-        return FuncOutputDTypeType.DTYPE_DEFAULT_BC_UNSUPPORTED_SYNTAX
-
-    args = seen_q_op_info.input_tensor_infos
-    if seen_q_op_info.type in functions_supported_by_quantization_preserves_dtype:
-        return FuncOutputDTypeType.DTYPE_EQUALS_INPUT_DTYPE
-    elif seen_q_op_info.type in add_and_mul_ops and len(args) > 0 and \
-            args[0] is not None and \
-            args[0].orig_dtype in (torch.int32, torch.int64):
-        # binary ops with torch.int arguments do not support quantization
-        return FuncOutputDTypeType.DTYPE_EQUALS_INPUT_DTYPE
-    elif seen_q_op_info.type == torch.cat and len(args) > 0 and \
-            args[0] is not None and \
-            args[0].orig_dtype in (torch.int32, torch.int64):
-        # TODO(before land): do we still need this branch?
-        return FuncOutputDTypeType.DTYPE_EQUALS_INPUT_DTYPE
-
-    return FuncOutputDTypeType.DTYPE_DEPENDS_ON_QCONFIG
-
-def get_weight_argument_info(op: Callable) -> Optional[Tuple[int, str]]:
-    if op == F.linear or op in conv_ops:
-        return (1, 'weight')
-    return None
-
-def get_op_packing_only_uses_module_attributes(
-    op: Callable,
-    args: Tuple[Any, ...],
-    kwargs: Dict[str, Any],
-    module: torch.nn.Module,
-) -> bool:
-    """
-    Returns True if all arguments of this op which are weights are module
-    attributes on the root module, and False otherwise.
-
-    For example, for `F.linear(input, weight, bias)`, this would return
-    True if `weight` is stored directly on the parent module (the common case),
-    and False if `weight` was an output of a different op.
-    """
-    # check for ops which need packed weights but the weights are
-    # coming from another function
-    info = get_weight_argument_info(op)
-    if info is not None:
-        idx, name = info
-        param_name = args[idx] if idx < len(args) else kwargs[name]
-        arg_name_in_root = get_param_name(module, param_name)
-        if arg_name_in_root is None:
-            return False
-    return True
-
-def get_quantized_op(
-    seen_q_op_info: SeenQOpInfo,
-    idx_to_seen_q_op_infos: Dict[int, SeenQOpInfo],
-) -> Optional[Callable]:
-    """
-    Given a `seen_q_op_info`, returns the quantized version of the seen function.
-    If the `seen_q_op_info` corresponds to a module, returns `None`.
-    If the function does need quantizing, returns `None`.
-    """
-    # if we are in a fusion, use the fusion replacement rules
-    if seen_q_op_info.fusion_info is not None:
-        return seen_q_op_info.fusion_info.replacement_type_this_element
-
-    op_type = seen_q_op_info.type
-    is_module = isinstance(op_type, type(torch.nn.Module))
-    if is_module:
-        return None
-    if seen_q_op_info.output_tensor_infos[0].inf_dtype != torch.quint8:
-        return None
-
-    if (
-        (op_type in add_and_mul_ops or op_type == torch.cat) and
-        seen_q_op_info.input_tensor_infos[0] is not None and
-        seen_q_op_info.input_tensor_infos[0].inf_dtype in (torch.int32, torch.int64)
-    ):
-        # handle torch.mul with int tensor arguments
-        return None
-    elif op_type in fp32_to_int8_fun_mapping:
-        return fp32_to_int8_fun_mapping[op_type]
-    return None
-
-def get_input_observed_arg_idxs(
-    op_type: Callable,
-    op_type_is_module: bool,
-) -> Optional[List[int]]:
-    if op_type_is_module:
-        # TODO(future PR): handle RNNs
-        return [0]
-    elif op_type in conv_ops:
-        return [0, 1]
-    elif op_type == F.linear:
-        return [0, 1]
-    # None means "observe all Tensor args"
-    return None
-
-def get_packable_tensor_arg_idxs(op: Callable) -> Optional[List[int]]:
-    """
-    Returns tensor arg idxs which correspond to parameters which will need
-    to be packed.
-    """
-    if op in conv_ops:
-        return [1, 2]
-    elif op == F.linear:
-        return [1, 2]
-    return None
-
-def get_packable_tensor_kwarg_names(op: Callable) -> Optional[List[str]]:
-    """
-    Returns tensor kwarg names which correspond to parameters which will
-    need to be packed.
-    """
-    if op == F.linear or op in conv_ops:
-        return ['weight', 'bias']
-    return None
-
-def get_param_name(module: torch.nn.Module, arg: Any) -> Optional[str]:
-    """
-    Returns the name of arg with respect to the current module.
-    """
-    for name, param in module.named_parameters():
-        if arg is param:
-            return name
-    return None
-    # raise AssertionError(f"arg {arg} not found in module {module}")
-
-def get_packable_nontensor_arg_idxs(op: Callable) -> Optional[List[int]]:
-    """
-    Returns nontensor arg idxs which correspond to arguments which will need
-    to be packed.
-    """
-    if op in conv_ops:
-        # stride, padding, dilation, groups
-        return [3, 4, 5, 6]
-    return None
-
-def get_packable_arg_idxs(op: Callable) -> Optional[List[int]]:
-    if op in conv_ops:
-        # weight, bias, stride, padding, dilation, groups
-        return [1, 2, 3, 4, 5, 6]
-    elif op == F.linear:
-        # weight, bias
-        return [1, 2]
-    return None
-
-def get_weight_arg_idx(op: Callable) -> Optional[int]:
-    if op in conv_ops:
-        return 1
-    elif op == F.linear:
-        return 1
-    return None
-
-def iterate_and_apply(
-    args: Any,
-    flattened_tensor_infos: List[Optional[QTensorInfo]],
-    func: Callable,
-    flattened_tensor_infos_idx=None
-) -> Any:
-    """
-    Inputs:
-      `args`: arguments to a function, may contain nested types, for example:
-
-        ([torch.Tensor, torch.Tensor], int, (int, int))
-
-      `flattened_tensor_infos`: tensor information containers for each tensor
-        in `args`, flattened, for example corresponding with above:
-
-        ({...}, {...}, None, None, None)
-
-      `func`: function to apply to each tensor in `args` to create `new_args`
-
-    Returns `new_args`, where each tensor has been transformed by `func`.
-    """
-    arg_idx = 0
-    if flattened_tensor_infos_idx is None:
-        flattened_tensor_infos_idx = [0]
-
-    if isinstance(args, tuple):
-        new_args = []
-        for arg in args:
-            new_arg = iterate_and_apply(
-                arg, flattened_tensor_infos, func, flattened_tensor_infos_idx)
-            new_args.append(new_arg)
-        return tuple(new_args)
-    elif isinstance(args, list):
-        for idx in range(len(args)):
-            new_arg = iterate_and_apply(
-                args[idx], flattened_tensor_infos, func, flattened_tensor_infos_idx)
-            args[idx] = new_arg
-        return args
-    else:
-        # individual element
-        cur_flattened_tensor_info = \
-            flattened_tensor_infos[flattened_tensor_infos_idx[0]]
-        flattened_tensor_infos_idx[0] += 1
-
-        if cur_flattened_tensor_info is not None:
-            return func(args, cur_flattened_tensor_info)
-        else:
-            return args
-
-def get_producer_of_seen_q_op_info(
-    idx_to_seen_q_op_info: Dict[int, SeenQOpInfo],
-    cur_seen_q_op_info: SeenQOpInfo,
-) -> Optional[SeenQOpInfo]:
-    """
-    Input: cur_seen_q_op_info, all seen ops
-    Output: the SeenQOpInfo which created the input to the current SeenQOpInfo
-    """
-    if cur_seen_q_op_info.input_tensor_infos[0] is None:
-        return None
-    input_tensor_id = cur_seen_q_op_info.input_tensor_infos[0].id
-    for idx, seen_q_op_info in idx_to_seen_q_op_info.items():
-        for output_tensor_info in seen_q_op_info.output_tensor_infos:
-            if output_tensor_info is not None:
-                if input_tensor_id == output_tensor_info.id:
-                    return seen_q_op_info
-    return None
-
-def get_users_of_seen_q_op_info(
-    idx_to_seen_q_op_info: Dict[int, SeenQOpInfo],
-    cur_seen_q_op_info: SeenQOpInfo,
-) -> List[SeenQOpInfo]:
-    """
-    Input: cur_seen_q_op_info
-    Output: list of all seen_q_op_infos which use the output of the cur_seen_q_op_info,
-    """
-    if len(cur_seen_q_op_info.output_tensor_infos) != 1:
-        return []
-    output_tensor_id = cur_seen_q_op_info.output_tensor_infos[0].id
-    results = []
-    for idx, seen_q_op_info in idx_to_seen_q_op_info.items():
-        for input_tensor_info in seen_q_op_info.input_tensor_infos:
-            if input_tensor_info is not None:
-                if output_tensor_id == input_tensor_info.id:
-                    results.append(seen_q_op_info)
-    return results
-
-class HookType(enum.Enum):
-    """
-    Describes the various types of function and module hooks that are used
-    to implement quantization syntax transforms.
-    """
-    # Hooks which are run before, during and after a quantizeable op.
-    # Usually used for op input and output observation, subsituating
-    # quantized kernels, and dynamically looking up arguments to quantized
-    # kernels.
-    OP_HOOKS = 0
-    # Hooks which are run before or after a `torch.nn.Module` which
-    # is a non-leaf. Usually used for dtype transforms if the user requests
-    # that the inputs or outputs of a certain module are of some dtype.
-    MODULE_IO_HOOKS = 1
-    # Hooks which are run before a non-quantizeable op which requires
-    # `torch.float` inputs. Any inputs which are not floats are converted
-    # back to floats.
-    ARG_DEQUANTS = 2
-    # Everything else
-    NONE = 3
-
-def get_torch_function_hook_type(
-    parent_module: Optional[torch.nn.Module],
-    func: Callable,
-) -> HookType:
-    # the direct __dict__ accesses are for performance, because
-    # the default `torch.nn.Module.__getattr__` has overhead.
-    parent_module_has_qstate = parent_module is not None and \
-        '_auto_quant_state' in parent_module.__dict__
-    needs_op_hooks = parent_module_has_qstate and \
-        parent_module.__dict__['_auto_quant_state'].cur_op_needs_hooks(func)  # type: ignore[union-attr, operator]
-
-    if needs_op_hooks:
-        return HookType.OP_HOOKS
-    elif (
-        parent_module_has_qstate and
-        # do not attempt to dequantize the args to dequantize, as that will
-        # lead to infinite recursion
-        func != torch.Tensor.dequantize
-    ):
-        return HookType.ARG_DEQUANTS
-    else:
-        return HookType.NONE
-
-def get_module_hook_type(
-    parent_module: Optional[torch.nn.Module],
-    cur_module: torch.nn.Module,
-) -> HookType:
-    cached_hook_type = getattr(cur_module, '_auto_quant_module_hook_type', None)
-    if cached_hook_type is not None:
-        return cached_hook_type
-    parent_module_has_qstate = parent_module is not None and \
-        '_auto_quant_state' in parent_module.__dict__
-    needs_op_hooks = parent_module_has_qstate and \
-        parent_module.__dict__['_auto_quant_state'].cur_op_needs_hooks(cur_module)  # type: ignore[union-attr, operator]
-    # We need IO hooks if
-    # * we are calling forward on a module (always True here)
-    # * that module has quant state
-    # * that module does not need op hooks for the parent
-    needs_io_hooks = (
-        '_auto_quant_state' in cur_module.__dict__ and
-        (not needs_op_hooks)
-    )
-    needs_arg_dequants = parent_module_has_qstate and not needs_op_hooks
-
-    if needs_op_hooks:
-        result = HookType.OP_HOOKS
-    elif needs_io_hooks:
-        result = HookType.MODULE_IO_HOOKS
-    elif needs_arg_dequants:
-        result = HookType.ARG_DEQUANTS
-    else:
-        result = HookType.NONE
-    cur_module._auto_quant_module_hook_type = result  # type: ignore[assignment]
-    return result
-
-def clone_detach_tensor_without_dispatch(x: torch.Tensor) -> torch.Tensor:
-    """
-    Creates a detached clone of `x`, unwrapping x from any dispatched
-    type before performing the copy.
-    This is necessary to not leak dispatched types to debugging logic
-    such as numeric suite.
-    TODO(future PR): figure out why is_quantized returns False for
-    the dispatched types, even though the underlying tensor is quantized.
-    """
-    old_class = x.__class__
-    x.__class__ = torch.Tensor
-    x_copy = x.clone().detach()
-    x.__class__ = old_class
-    return x_copy
-
-def get_input_args_quant_dequant_info(
-    seen_q_op_info: SeenQOpInfo,
-    tensor_id_to_scale_zp: Dict[int, Tuple[torch.Tensor, torch.Tensor]],
-) -> Tuple[List[Optional[Tuple[float, int, torch.dtype]]], List[bool], bool]:
-    """
-    Returns a list of information about the tensor inputs to the current op.
-
-    Quant list:
-    For each tensor input:
-    * if the tensor input needs a quant, the list will contain
-      (scale, zero_point)
-    * if the tensor input does not need a quant, the list will contain None
-
-    Dequant list:
-    For each tensor input:
-    * if the tensor input needs a dequant, True, otherwise, False
-
-    any_arg_quant_or_dequant_needed:
-    If True, at least one of quants or dequants is needed. If False,
-    there are no quants or dequants needed.
-
-    For example, if there are two tensor inputs to the current op, and the
-    first input needs a quant, this function will return
-
-      # quants
-      [(scale0, zero_point0), None],
-      # dequants
-      [False, False]
-    """
-    quant_infos: List[Optional[Tuple[float, int, torch.dtype]]] = []
-    dequant_infos: List[bool] = []
-
-    # determine the expected output dtype
-    output_dtype = seen_q_op_info.output_tensor_infos[0].inf_dtype
-    packable_arg_idxs = get_packable_arg_idxs(seen_q_op_info.type)
-    any_arg_quant_or_dequant_needed = False
-
-    for input_arg_idx, input_arg in enumerate(seen_q_op_info.input_tensor_infos):
-        arg_will_be_packed = packable_arg_idxs is not None and \
-            input_arg_idx in packable_arg_idxs and \
-            seen_q_op_info.op_packing_only_uses_module_attributes
-        if input_arg is not None and not arg_will_be_packed:
-            tensor_id = input_arg.id
-            if input_arg.inf_dtype != output_dtype:
-                any_arg_quant_or_dequant_needed = True
-                if output_dtype in (torch.quint8, torch.qint32):
-                    assert tensor_id in tensor_id_to_scale_zp
-                    scale, zp = tensor_id_to_scale_zp[tensor_id]
-                    # TODO: return this to the caller
-                    quant_infos.append((scale, zp, output_dtype))  # type: ignore[arg-type]
-                    if output_dtype == torch.qint32:
-                        # For now, we treat all qint32 ops as reference, so
-                        # we add a dequant before the op.
-                        # TODO(future PR): extend this to more dtypes
-                        # TODO(future PR): use is_reference flag instead of
-                        # assuming
-                        dequant_infos.append(True)
-                    else:
-                        dequant_infos.append(False)
-                else:
-                    quant_infos.append(None)
-                    dequant_infos.append(True)
-            else:
-                quant_infos.append(None)
-                dequant_infos.append(False)
-        else:
-            quant_infos.append(None)
-            dequant_infos.append(False)
-    return quant_infos, dequant_infos, any_arg_quant_or_dequant_needed
-
-def get_cur_qconfig(
-    qconfig_mapping: QConfigMapping,
-    cur_fqn: str,
-    cur_op_type: Callable,
-) -> Optional[QConfigAny]:
-    # precedence: global -> object_type -> module_name_regex -> module_name
-    #   -> module_name_object_type_order
-    # (module_name_regex, module_name_object_type_order not implemented yet)
-
-    # global
-    global_qconfig = qconfig_mapping.global_qconfig
-
-    qconfig = maybe_adjust_qconfig_for_module_type_or_name(
-        qconfig_mapping, cur_op_type, cur_fqn, global_qconfig)
-
-    return qconfig
-
-
-# We store quantization state for all children on the top level module in a
-# ModuleDict. In order to properly special case this module from other
-# ModuleDict instances, we create a marker class for it.
-class AutoQuantizationStateModuleDict(torch.nn.ModuleDict):
-    pass
-
-def get_fqn_valid_for_module_dict_key(fqn: str) -> str:
-    """
-    Modifies `fqn` to make it a valid key to a ModuleDict.
-    """
-    if fqn == '':
-        fqn = ' '
-    return fqn.replace('.', ':')
diff --git a/torch/ao/quantization/_quantize_dbr.py b/torch/ao/quantization/_quantize_dbr.py
deleted file mode 100644
index 9e8de68757af..000000000000
--- a/torch/ao/quantization/_quantize_dbr.py
+++ /dev/null
@@ -1,144 +0,0 @@
-import torch
-
-from ._dbr.auto_trace import add_auto_observation, add_auto_convert
-from ._dbr.fusion import get_module_fusion_fqns
-from ._dbr.qconfig_mapping_utils import normalize_object_types
-
-from .qconfig_mapping_utils import (
-    get_flattened_qconfig_dict,
-)
-from torch.ao.quantization.qconfig_mapping import QConfigMapping
-from torch.ao.quantization.quantization_mappings import (
-    get_default_static_quant_module_mappings,
-    get_default_dynamic_quant_module_mappings,
-)
-from ._dbr.module_swap_utils import _swap_child_modules
-
-
-def prepare(model, qconfig_dict, example_inputs, inplace=False, allow_list=None,
-            observer_non_leaf_module_list=None,
-            prepare_custom_config_dict=None,
-            fuse_modules=True):
-    r"""A wrapper around `torch.quantization.prepare` which prepares the
-    model for quantization using dynamic tracing.
-
-    Requires `qconfig_dict` (same format as prepare_fx) to specify the
-    quantization settings. Not all functionality is supported yet.
-
-    Requires `example_inputs` to build
-    the graph before calibration or quantization aware training can proceed.
-
-    Supported `prepare_custom_config_dict` keys:
-      * `non_traceable_module_class` - same meaning as in prepare_fx
-      * `output_dtypes` - expected dtypes of model outputs, must match actual
-        output structure.
-
-    TODO(future PR): better docblock
-    """
-    assert example_inputs is not None, 'example_inputs must be specified'
-
-    if prepare_custom_config_dict is None:
-        prepare_custom_config_dict = {}
-
-    # TODO: change signature to use QConfigMapping instead of qconfig_dict
-    qconfig_mapping = QConfigMapping.from_dict(qconfig_dict)
-
-    assert len(qconfig_mapping.module_name_regex_qconfigs) == 0, \
-        'qconfig_mapping.set_module_name_regex is not supported yet in define-by-run quantization'
-    assert len(qconfig_mapping.module_name_object_type_order_qconfigs) == 0, \
-        'qconfig_mapping.set_module_name_object_type_order is not supported yet in define-by-run quantization'
-
-    normalize_object_types(qconfig_mapping)
-    flattened_qconfig_dict = get_flattened_qconfig_dict(qconfig_mapping)
-    torch.quantization.propagate_qconfig_(model, flattened_qconfig_dict)
-
-    # if parts of the model are non traceable, delete qconfig from
-    # them so they do not get swapped
-    non_traceable_module_class = \
-        prepare_custom_config_dict.get('non_traceable_module_class', [])
-    for name, child in model.named_modules():
-        for target_cls in non_traceable_module_class:
-            if isinstance(child, target_cls):
-                for _, child_child in child.named_modules():
-                    child_child.qconfig = None
-
-    # TODO(future PR): QAT support
-
-    if fuse_modules:
-        # automatically fuse modules
-        old_class = model.__class__
-        model = add_auto_observation(
-            model, qconfig_mapping, example_inputs,
-            prepare_custom_config_dict=prepare_custom_config_dict)
-        module_fusion_fqns = get_module_fusion_fqns(model)
-        if len(module_fusion_fqns):
-            model = torch.quantization.fuse_modules(model, module_fusion_fqns)
-
-            # Since we are reusing the auto_trace machinery to find fusion
-            # FQNs, we need to do some surgery to get qconfigs on modules
-            # after module fusion to be correct.
-            for _, child in model.named_modules():
-                if isinstance(child, torch.nn.intrinsic._FusedModule):
-                    if hasattr(child[0], 'qconfig'):
-                        child.qconfig = child[0].qconfig
-
-        # delete all the DBR state from the model, so add_auto_observation
-        # can start from a clean slate
-        parents_to_delete_auto_quant_state = []
-        for k, v in model.named_modules():
-            if hasattr(v, '_auto_quant_state'):
-                parents_to_delete_auto_quant_state.append(v)
-        for v in parents_to_delete_auto_quant_state:
-            del v._auto_quant_state
-
-        del model._fqn_to_auto_quant_state_map
-
-        for p in model.parameters():
-            if hasattr(p, '_qtensor_info'):
-                del p._qtensor_info
-        for b in model.buffers():
-            if hasattr(b, '_qtensor_info'):
-                del b._qtensor_info
-
-        # the model hierarchy might have changed during fusion, so we
-        # have to delete the cached module hook types
-        for k, v in model.named_modules():
-            if hasattr(v, '_auto_quant_module_hook_type'):
-                del v._auto_quant_module_hook_type
-
-        model.__class__ = old_class
-
-    # Automatically assign qconfigs for modules where the defaults do not
-    # work.
-    # TODO(future PR): clean this up and align with other APIs
-    for name, child in model.named_modules():
-        if isinstance(child, (torch.nn.Embedding, torch.nn.EmbeddingBag)):
-            # pass
-            # child.qconfig = torch.quantization.float_qparams_weight_only_qconfig
-            # uncomment below to unbreak attention_is_all_you_need
-            # TODO write up issue, maybe fix
-            child.qconfig = None  # type: ignore[assignment]
-        elif isinstance(child, torch.nn.LSTM):
-            # TODO: fix LSTM handling in eager mode static quant and remove this
-            qconfig_mapping.object_type_qconfigs[torch.nn.LSTM] = None
-
-    # TODO(future PR): do the QAT module swap
-
-    assert not inplace
-    model = add_auto_observation(
-        model, qconfig_mapping, example_inputs,
-        prepare_custom_config_dict=prepare_custom_config_dict)
-    return model
-
-def convert(model: torch.nn.Module) -> torch.nn.Module:
-    r"""Converts a prepared DBR quantization model to a quantized form.
-
-    TODO(future PR): better docblock
-    """
-    static_mappings = get_default_static_quant_module_mappings()
-    dynamic_mappings = get_default_dynamic_quant_module_mappings()
-    # swap the modules
-    _swap_child_modules(model, static_mappings, dynamic_mappings)
-    # add dynamic handling for quants/dequants, functions and methods
-    model = add_auto_convert(model)
-    return model
diff --git a/torch/ao/quantization/backend_config/README.md b/torch/ao/quantization/backend_config/README.md
index a170581d5638..985765e6badc 100644
--- a/torch/ao/quantization/backend_config/README.md
+++ b/torch/ao/quantization/backend_config/README.md
@@ -1,10 +1,34 @@
-The patterns are we matching against is float modules types, functional operators and pytorch operators in reverse order:
+## BackendConfig Overview
+
+BackendConfig allows PyTorch quantization to work with different backend or kernel libraries. These backends may have different sets of supported quantized operator patterns, and the same operator patterns may require different handling across different backends. To make quantization work with different backends and allow maximum flexibility, we strived to make all the parts of the quantization flow configurable with BackendConfig. Currently, it is only used by FX graph mode quantization. For more details on how it integrates with the FX graph mode quantization flow, refer to this [README](/torch/ao/quantization/fx/README.md).
+
+BackendConfig configures quantization behavior in terms of operator patterns. For each operator pattern, we need to specify what the supported data types are for the input and output activations, weights, and biases, and also specify the QAT modules, the reference quantized modules etc., which will be used in module swapping during the quantization passes.
+
+Quantized backends can have different support in terms of the following aspects:
+* Quantization scheme (symmetric vs asymmetric, per-channel vs per-tensor)
+* Data type (float32, float16, int8, uint8, bfloat16, etc.) for input/output/weight/bias
+* Quantized (and fused) mapping: Some quantized operators may have different numerics compared to a naive (dequant - float_op - quant) reference implementation. For weighted operators, such as conv and linear, we need to be able to specify custom reference modules and a mapping from the float modules
+* QAT mapping: For weighted operators, we need to swap them with the Quantization Aware Training (QAT) versions that add fake quantization to the weights
+
+As an example, here is what fbgemm looks like:
+|                                           | fbgemm                                                                |
+|-------------------------------------------|-----------------------------------------------------------------------|
+| Quantization Scheme                       | activation: per tensor, weight: per tensor or per channel             |
+| Data Type                                 | activation: quint8 (with qmin/qmax range restrictions), weight: qint8 |
+| Quantized and Fused Operators and Mapping | e.g. torch.nn.Conv2d -> torch.ao.nn.quantized.reference.Conv2d        |
+| QAT Module Mapping                        | e.g. torch.nn.Conv2d -> torch.ao.nn.qat.Conv2d                        |
+
+Instead of hardcoding the fusion mappings, float to reference quantized module mappings, fusion patterns etc., we will derive everything from the BackendConfig throughout the code base. This allows PyTorch Quantization to work with all first-party (fbgemm and qnnpack) and third-party backends (TensorRT, executorch etc.) that may differ from native backends in different aspects. With the recent addition of xnnpack, integrated as part of the qnnpack backend in PyTorch, the BackendConfig is needed to define the new constraints required for xnnpack quantized operators.
+
+## Pattern Specification
+
+The operator patterns used in BackendConfig are float modules, functional operators and pytorch operators specified in reverse order:
 ```
 operator = module_type | functional | torch op | native op | MatchAllNode
 Pattern = (operator, Pattern, Pattern, ...) | operator
 ```
-where the first item for Pattern is the operator we want to match, and the rest are the patterns for the arguments of the operator.
-For example, pattern (nn.ReLU, (operator.add, MatchAllNode, (nn.BatchNorm2d, nn.Conv2d))) would match the following graph:
+where the first item for each Pattern is the operator, and the rest are the patterns for the arguments of the operator.
+For example, the pattern (nn.ReLU, (operator.add, MatchAllNode, (nn.BatchNorm2d, nn.Conv2d))) would match the following graph:
 ```
 tensor_1            tensor_2
  |                    |
@@ -17,4 +41,118 @@ tensor_1            tensor_2
       nn.ReLU
 ```
 
-we’ll match the last node as the anchor point of the match, and we can retrieve the whole graph by tracing back from the node, e.g. in the example above, we matched nn.ReLU node, then node.args[0] is the operator.add node.
+During prepare and convert, we’ll match the last node, which will be the anchor point of the match, and we can retrieve the whole graph by tracing back from the node. E.g. in the example above, we matched the `nn.ReLU` node, and `node.args[0]` is the `operator.add` node.
+
+## BackendConfig Implementation
+
+The BackendConfig is comprised of a list of BackendPatternConfigs, each of which define the specifications and the requirements for an operator pattern. Here is an example usage:
+
+```
+import torch
+from torch.ao.quantization.backend_config import BackendConfig, BackendPatternConfig, DTypeConfig, ObservationType
+from torch.ao.quantization.fuser_method_mappings import reverse_sequential_wrapper2
+
+weighted_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_type=torch.float)
+
+linear_config = BackendPatternConfig(torch.nn.Linear) \
+    .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
+    .add_dtype_config(weighted_int8_dtype_config) \
+    .set_root_module(torch.nn.Linear) \
+    .set_qat_module(torch.ao.nn.qat.Linear) \
+    .set_reference_quantized_module(torch.ao.nn.quantized.reference.Linear)
+
+conv_relu_config = BackendPatternConfig((torch.nn.ReLU, torch.nn.Conv2d)) \
+    .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
+    .add_dtype_config(weighted_int8_dtype_config) \
+    .set_fused_module(torch.ao.nn.intrinsic.ConvReLU2d) \
+    .set_fuser_method(reverse_sequential_wrapper2(torch.ao.nn.intrinsic.ConvReLU2d))
+
+backend_config = BackendConfig("my_backend") \
+    .set_backend_pattern_config(linear_config) \
+    .set_backend_pattern_config(conv_relu_config)
+```
+
+### Observer Insertion
+
+Relevant APIs:
+* `set_observation_type`
+
+During the prepare phase, we insert observers (or QuantDeQuantStubs in the future) into the graph for this operator pattern based on the observation type, which specifies whether to use different observers for the inputs and the outputs of the pattern. For more detail, see `torch.ao.quantization.backend_config.ObservationType`.
+
+### Reference Quantized Patterns
+
+Relevant APIs:
+* `set_root_module`
+* `set_reference_quantized_module`
+
+During the convert phase, when we construct the reference quantized model, the root modules (e.g. `torch.nn.Linear` for `nni.LinearReLU` or `nniqat.LinearReLU`) will be swapped to the corresponding reference quantized modules (e.g. `torch.ao.nn.reference.Linear`). This allows custom backends to specify custom reference quantized module implementations to match the numerics of their lowered operators. Since this is a one-to-one mapping, both the root module and the reference quantized module must be specified in the same BackendPatternConfig in order for the conversion to take place.
+
+### Fusion
+
+Relevant APIs:
+* `set_fuser_method`
+* `set_fused_module`
+* `_set_root_node_getter`
+* `_set_extra_inputs_getter`
+
+As an optimization, operator patterns such as (`torch.nn.ReLU`, `torch.nn.Linear`) may be fused into `nni.LinearReLU`. This is performed during the prepare phase according to the function specified in `set_fuser_method`, which replaces the pattern with the fused module. During the convert phase, these fused modules (identified by `set_fused_module`) will then be converted to the reference quantized versions of the modules.
+
+In FX graph mode quantization, we replace the corresponding nodes in the graph using two helper functions set by the user: `root_node_getter`, which returns the root node (typically the weighted module in the pattern like `torch.nn.Linear`) to replace the matched pattern in the graph, and `extra_inputs_getter`, which returns a list of extra input arguments that will be appended to the existing arguments of the fused module (copied over from the root node). See [this snippet](https://gist.github.com/jerryzh168/8bea7180a8ba3c279f2c9b050f2a69a6) for an example usage.
+
+### Data Type Restrictions
+
+Relevant APIs:
+* `add_dtype_config`
+* `set_dtype_configs`
+
+DTypeConfig specifies a set of supported data types for input/output/weight/bias along with the associated constraints, if any. There are two ways of specifying `input_dtype`, `output_dtype`, and `weight_dtype`, as simple `torch.dtype`s or as `DTypeWithConstraints`, e.g.:
+
+```
+import torch
+from torch.ao.quantization.backend import DTypeConfig, DTypeWithConstraints
+
+dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float)
+
+dtype_config_with_constraints = DTypeConfig(
+    input_dtype=DTypeWithConstraints(
+        dtype=torch.quint8,
+        quant_min_lower_bound=0,
+        quant_max_upper_bound=255,
+        scale_min_lower_bound=2 ** -12,
+    ),
+    output_dtype=DTypeWithConstraints(
+        dtype=torch.quint8,
+        quant_min_lower_bound=0,
+        quant_max_upper_bound=255,
+        scale_min_lower_bound=2 ** -12,
+    ),
+    weight_dtype=DTypeWithConstraints(
+        dtype=torch.qint8,
+        quant_min_lower_bound=-128,
+        quant_max_upper_bound=127,
+        scale_min_lower_bound=2 ** -12,
+    ),
+    bias_dtype=torch.float)
+```
+
+During the prepare phase of quantization, we will compare the data types specified in these DTypeConfigs to the ones specified in the matching QConfig for a given operator pattern. If the data types do not match (or the constraints are not satisfied) for all the DTypeConfigs specified for the operator pattern, then we will simply ignore the QConfig and skip quantizing this pattern.
+
+#### Quantization range
+
+The user's QConfig may specify `quant_min` and `quant_max`, which are min and max restrictions on the quantization values. Here we set the lower bound for the `quant_min` and then upper bound for the `quant_max` to represent the limits of the backend. If a QConfig exceeds these limits in either direction, it will be treated as violating this constraint.
+
+#### Scale range
+
+Similarly, the user's QConfig may specify a minimum value for the quantization scale (currently exposed as `eps` but will change in the future to better reflect the semantics). Here we set the lower bound for the `scale_min` to represent the limits of the backend. If a QConfig's min scale value falls below this limit, the QConfig will be treated as violating this constraint. Note that `scale_max_upper_bound` is currently not used, because there is no corresponding mechanism to enforce this on the observer yet.
+
+#### Fixed quantization parameters
+
+For ops with fixed quantization parameters such as `torch.nn.Sigmoid` or `torch.nn.Tanh`, the BackendConfig can specify the specific scale and zero point values as constraints on the input and output activations. The user's QConfigs for these ops must use `FixedQParamsObserver` or `FixedQParamsFakeQuantize` for their activations with matching scale and zero point values, otherwise these QConfigs will be ignored.
diff --git a/torch/ao/quantization/backend_config/__init__.py b/torch/ao/quantization/backend_config/__init__.py
index 5a985b654b47..6443b756f716 100644
--- a/torch/ao/quantization/backend_config/__init__.py
+++ b/torch/ao/quantization/backend_config/__init__.py
@@ -1,11 +1,20 @@
-from .backend_config import BackendConfig, BackendPatternConfig, DTypeConfig
+from .backend_config import BackendConfig, BackendPatternConfig, DTypeConfig, DTypeWithConstraints, ObservationType
+from .fbgemm import get_fbgemm_backend_config
 from .native import get_native_backend_config, get_native_backend_config_dict
-from .observation_type import ObservationType
+from .qnnpack import get_qnnpack_backend_config
 from .tensorrt import get_tensorrt_backend_config, get_tensorrt_backend_config_dict
+from .executorch import get_executorch_backend_config
 
 __all__ = [
+    "get_fbgemm_backend_config",
     "get_native_backend_config",
     "get_native_backend_config_dict",
+    "get_qnnpack_backend_config",
     "get_tensorrt_backend_config",
     "get_tensorrt_backend_config_dict",
+    "get_executorch_backend_config",
+    "BackendConfig",
+    "BackendPatternConfig",
+    "DTypeConfig",
+    "ObservationType",
 ]
diff --git a/torch/ao/quantization/backend_config/_common_operator_config_utils.py b/torch/ao/quantization/backend_config/_common_operator_config_utils.py
index 66a53bdef366..47a0b3024208 100644
--- a/torch/ao/quantization/backend_config/_common_operator_config_utils.py
+++ b/torch/ao/quantization/backend_config/_common_operator_config_utils.py
@@ -1,19 +1,24 @@
+import copy
 import operator
 import torch
 import torch.nn.functional as F
 import torch.nn as nn
 import torch.nn.intrinsic as nni
-import torch.nn.intrinsic.qat as nniqat
+import torch.ao.nn.intrinsic.qat as nniqat
 import torch.nn.qat as nnqat
 import torch.nn.quantized._reference as nnqr
 from collections import namedtuple
-from typing import List
-from .observation_type import ObservationType
-from .backend_config import BackendPatternConfig, DTypeConfig
+from typing import Callable, Dict, List, Union
+from .backend_config import (
+    BackendPatternConfig,
+    DTypeConfig,
+    DTypeWithConstraints,
+    ObservationType,
+)
 from ..fuser_method_mappings import (
-    reverse_sequential_wrapper2,
-    reverse2,
-    reverse3,
+    _reverse_sequential_wrapper2,
+    _reverse2,
+    _reverse3,
     fuse_conv_bn,
     fuse_conv_bn_relu,
     fuse_linear_bn,
@@ -43,6 +48,38 @@
     nnqat.Conv3d, nniqat.ConvReLU3d, nniqat.ConvBn3d, nniqat.ConvBnReLU3d,
     F.conv3d)
 
+# Add constraints for fixed qparams ops like sigmoid and tanh to ensure values
+# fall within the proper ranges, e.g. [0, 1] for sigmoid, [-1, 1] for tanh
+_FIXED_QPARAM_OP_0TO1_CONSTRAINTS = DTypeWithConstraints(
+    dtype=torch.quint8,
+    quant_min_lower_bound=0,
+    quant_max_upper_bound=255,
+    scale_exact_match=1.0 / 256.0,
+    zero_point_exact_match=0,
+)
+_FIXED_QPARAM_OP_NEG1TO1_CONSTRAINTS = DTypeWithConstraints(
+    dtype=torch.quint8,
+    quant_min_lower_bound=0,
+    quant_max_upper_bound=255,
+    scale_exact_match=2.0 / 256.0,
+    zero_point_exact_match=128,
+)
+_FIXED_QPARAMS_OP_TO_CONSTRAINTS: Dict[Union[Callable, str], DTypeWithConstraints] = {
+    torch.nn.Hardsigmoid: _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    torch.nn.functional.hardsigmoid: _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    "hardsigmoid": _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    "hardsigmoid_": _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    torch.nn.Sigmoid: _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    torch.sigmoid: _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    "sigmoid": _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    "sigmoid_": _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    torch.nn.Softmax: _FIXED_QPARAM_OP_0TO1_CONSTRAINTS,
+    torch.nn.Tanh: _FIXED_QPARAM_OP_NEG1TO1_CONSTRAINTS,
+    torch.tanh: _FIXED_QPARAM_OP_NEG1TO1_CONSTRAINTS,
+    "tanh": _FIXED_QPARAM_OP_NEG1TO1_CONSTRAINTS,
+    "tanh_": _FIXED_QPARAM_OP_NEG1TO1_CONSTRAINTS,
+}
+
 def _get_binary_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
     binary_op_configs: List[BackendPatternConfig] = []
     num_tensor_args_to_observation_type_mapping = {
@@ -65,6 +102,11 @@ def _get_binary_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatt
                 BackendPatternConfig(bop_pattern)
                     .set_dtype_configs(dtype_configs)  # noqa: E131
                     ._set_num_tensor_args_to_observation_type(num_tensor_args_to_observation_type_mapping))
+    # matmul
+    binary_op_configs.append(
+        BackendPatternConfig(torch.matmul)
+        .set_dtype_configs(dtype_configs)  # noqa: E131
+    )
     return binary_op_configs
 
 def _get_linear_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
@@ -95,7 +137,8 @@ def _get_linear_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPattern
     linear_configs.append(
         BackendPatternConfig(torch.nn.functional.linear)
             .set_observation_type(observation_type)  # noqa: E131
-            .set_dtype_configs(dtype_configs))
+            .set_dtype_configs(dtype_configs)
+            ._set_input_type_to_index({"weight": 1, "bias": 2}))
 
     # (2) Linear + relu
     # -------------------
@@ -104,13 +147,13 @@ def _get_linear_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPattern
     linear_configs.append(
         BackendPatternConfig((torch.nn.ReLU, torch.nn.Linear))
             .set_dtype_configs(dtype_configs)  # noqa: E131
-            .set_fuser_method(reverse_sequential_wrapper2(nni.LinearReLU))
+            .set_fuser_method(_reverse_sequential_wrapper2(nni.LinearReLU))
             .set_fused_module(nni.LinearReLU))
     # linear relu, linear module + functional relu
     linear_configs.append(
         BackendPatternConfig((torch.nn.functional.relu, torch.nn.Linear))
             .set_dtype_configs(dtype_configs)  # noqa: E131
-            .set_fuser_method(reverse_sequential_wrapper2(nni.LinearReLU))
+            .set_fuser_method(_reverse_sequential_wrapper2(nni.LinearReLU))
             .set_fused_module(nni.LinearReLU))
 
     # 2.2 linear module + relu, fused module configs
@@ -147,7 +190,7 @@ def _get_linear_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPattern
     linear_configs.append(
         BackendPatternConfig((nn.BatchNorm1d, nn.Linear))
             .set_dtype_configs(dtype_configs)  # noqa: E131
-            .set_fuser_method(reverse2(fuse_linear_bn))
+            .set_fuser_method(_reverse2(fuse_linear_bn))
             .set_fused_module(nni.LinearBn1d))
 
     # 3.2 linear bn fused
@@ -197,7 +240,8 @@ def _get_conv_configs(dtype_configs):
         conv_configs.append(
             BackendPatternConfig(convs.func)
                 .set_observation_type(observation_type)  # noqa: E131
-                .set_dtype_configs(dtype_configs))
+                .set_dtype_configs(dtype_configs)
+                ._set_input_type_to_index({"weight": 1, "bias": 2}))
 
         # (2) Conv + relu
         # -----------------
@@ -206,13 +250,13 @@ def _get_conv_configs(dtype_configs):
         conv_configs.append(
             BackendPatternConfig((torch.nn.ReLU, convs.root))
                 .set_dtype_configs(dtype_configs)  # noqa: E131
-                .set_fuser_method(reverse_sequential_wrapper2(convs.fused_conv_relu))
+                .set_fuser_method(_reverse_sequential_wrapper2(convs.fused_conv_relu))
                 .set_fused_module(convs.fused_conv_relu))
         # conv relu fusion, conv module + functional relu
         conv_configs.append(
             BackendPatternConfig((F.relu, convs.root))
                 .set_dtype_configs(dtype_configs)  # noqa: E131
-                .set_fuser_method(reverse_sequential_wrapper2(convs.fused_conv_relu))
+                .set_fuser_method(_reverse_sequential_wrapper2(convs.fused_conv_relu))
                 .set_fused_module(convs.fused_conv_relu))
         # 2.2 conv module + relu fused module configs
         # conv relu, fused module
@@ -261,20 +305,20 @@ def _get_conv_configs(dtype_configs):
         conv_configs.append(
             BackendPatternConfig((convs.bn, convs.root))
                 .set_dtype_configs(dtype_configs)  # noqa: E131
-                .set_fuser_method(reverse2(fuse_conv_bn))
+                .set_fuser_method(_reverse2(fuse_conv_bn))
                 .set_fused_module(convs.fused_conv_bn))
         # conv + bn + relu module fusion
         conv_configs.append(
             BackendPatternConfig((nn.ReLU, (convs.bn, convs.root)))
                 .set_dtype_configs(dtype_configs)  # noqa: E131
-                .set_fuser_method(reverse3(fuse_conv_bn_relu))
+                .set_fuser_method(_reverse3(fuse_conv_bn_relu))
                 .set_fused_module(convs.fused_conv_bn_relu))
         # conv + bn + relu functional fusion
         conv_configs.append(
             BackendPatternConfig((F.relu, (convs.bn, convs.root)))
                 .set_dtype_configs(dtype_configs)  # noqa: E131
                 .set_root_module(convs.root)
-                .set_fuser_method(reverse3(fuse_conv_bn_relu))
+                .set_fuser_method(_reverse3(fuse_conv_bn_relu))
                 .set_fused_module(convs.fused_conv_bn_relu))
         # TODO: we can add fusion for torch.relu as well
 
@@ -318,12 +362,110 @@ def _get_conv_configs(dtype_configs):
         conv_configs.append(
             BackendPatternConfig((convs.bn, convs.transpose))
                 .set_dtype_configs(dtype_configs)  # noqa: E131
-                .set_fuser_method(reverse2(fuse_convtranspose_bn))
+                .set_fuser_method(_reverse2(fuse_convtranspose_bn))
                 .set_root_module(convs.transpose)
                 .set_reference_quantized_module(convs.transpose_reference))
 
     return conv_configs
 
+def _get_cat_config(dtype_configs: List[DTypeConfig]) -> BackendPatternConfig:
+    return BackendPatternConfig(torch.cat) \
+        .set_observation_type(ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT) \
+        .set_dtype_configs(dtype_configs)
+
+def _get_ln_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    ln_configs = []
+    ln_configs.append(
+        BackendPatternConfig(torch.nn.LayerNorm)
+        .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+        .set_dtype_configs(dtype_configs)
+    )
+    ln_configs.append(
+        BackendPatternConfig(torch.nn.functional.layer_norm)
+        .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+        .set_dtype_configs(dtype_configs)
+        ._set_input_type_to_index({"weight": 2, "bias": 3})
+    )
+    return ln_configs
+
+def _get_default_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    configs = []
+    default_ops = [
+        torch.nn.ELU,
+        torch.nn.LeakyReLU,
+        torch.nn.Hardswish,
+        torch.nn.InstanceNorm1d,
+        torch.nn.InstanceNorm2d,
+        torch.nn.InstanceNorm3d,
+        torch.nn.Dropout,
+        torch.nn.PReLU,
+        torch.nn.functional.elu,
+        torch.nn.functional.hardswish,
+        torch.nn.functional.leaky_relu,
+        torch.nn.functional.dropout,
+    ]
+    for op in default_ops:
+        configs.append(
+            BackendPatternConfig(op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+
+    configs.append(
+        BackendPatternConfig(torch.nn.functional.group_norm)
+        .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+        .set_dtype_configs(dtype_configs)
+        ._set_input_type_to_index({"weight": 2, "bias": 3})
+    )
+
+    configs.append(
+        BackendPatternConfig(torch.nn.functional.instance_norm)
+        .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+        .set_dtype_configs(dtype_configs)
+        ._set_input_type_to_index({"weight": 3, "bias": 4})
+    )
+    return configs
+
+def _add_fixed_qparams_to_dtype_configs(
+    dtype_configs: List[DTypeConfig],
+    constraints: DTypeWithConstraints,
+) -> List[DTypeConfig]:
+    """
+    Return a copy of the list of DTypeConfigs where activations are subject to the specified
+    constraints required for fixed qparams ops.
+
+    If the data type doesn't match the one in the constraints, simply leave the corresponding
+    DTypeConfig unchanged.
+
+    If `scale_min_lower_bound` or `scale_max_upper_bound` is specified in the activations,
+    throw an exception since these settings are incompatible with fixed qparams ops.
+    """
+    new_dtype_configs = []
+    for dtype_config in dtype_configs:
+        dc = copy.deepcopy(dtype_config)
+        for orig_constraints in [dc.input_dtype_with_constraints, dc.output_dtype_with_constraints]:
+            if orig_constraints.dtype != constraints.dtype:
+                continue
+            if orig_constraints.scale_min_lower_bound is not None:
+                raise ValueError("scale_min_lower_bound is invalid for fixed qparams ops: %s" % dtype_config)
+            if orig_constraints.scale_max_upper_bound is not None:
+                raise ValueError("scale_max_upper_bound is invalid for fixed qparams ops: %s" % dtype_config)
+            orig_constraints.quant_min_lower_bound = constraints.quant_min_lower_bound
+            orig_constraints.quant_max_upper_bound = constraints.quant_max_upper_bound
+            orig_constraints.scale_exact_match = constraints.scale_exact_match
+            orig_constraints.zero_point_exact_match = constraints.zero_point_exact_match
+        new_dtype_configs.append(dc)
+    return new_dtype_configs
+
+def _get_fixed_qparams_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    fixed_qparams_op_configs = []
+    for fixed_qparam_op, constraints in _FIXED_QPARAMS_OP_TO_CONSTRAINTS.items():
+        new_dtype_configs = _add_fixed_qparams_to_dtype_configs(dtype_configs, constraints)
+        fixed_qparams_op_configs.append(
+            BackendPatternConfig(fixed_qparam_op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(new_dtype_configs))
+    return fixed_qparams_op_configs
+
 def _get_share_qparams_op_configs(dtype_configs):
     """ Get the operator config for the operators that works for both float and quantized input
     if input is quantized, the output Tensor shares the same quantization parameter
@@ -398,6 +540,80 @@ def _get_share_qprams_op_backend_config(op):
     ]
     return [_get_share_qprams_op_backend_config(op) for op in share_qparams_ops]
 
+def _get_bn_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    """ Get configs related to batchnorm. """
+    bn_configs = []
+    bn_to_fused_bn = {
+        torch.nn.BatchNorm2d: nni.BNReLU2d,
+        torch.nn.BatchNorm3d: nni.BNReLU3d,
+    }
+    for bn in bn_to_fused_bn.keys():
+        fused_bn = bn_to_fused_bn[bn]
+        # bn module + relu module fusion config
+        bn_configs.append(
+            BackendPatternConfig((torch.nn.ReLU, bn))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(_reverse_sequential_wrapper2(fused_bn))
+                .set_fused_module(fused_bn))
+        # bn module + F.relu fusion config
+        bn_configs.append(
+            BackendPatternConfig((torch.nn.functional.relu, bn))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(_reverse_sequential_wrapper2(bn_to_fused_bn[bn]))
+                .set_fused_module(fused_bn))
+        bn_configs.append(
+            BackendPatternConfig(bn)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+
+    # fused bn configs
+    for fused_bn in bn_to_fused_bn.values():
+        bn_configs.append(
+            BackendPatternConfig(fused_bn)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+    return bn_configs
+
+def _get_rnn_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    rnn_op_configs = []
+    for rnn_op, ref_rnn_op in [
+            (nn.GRUCell, nnqr.GRUCell),
+            (nn.LSTMCell, nnqr.LSTMCell),
+            (nn.RNNCell, nnqr.RNNCell),
+            (nn.LSTM, nnqr.LSTM)
+    ]:
+        rnn_op_configs.append(
+            BackendPatternConfig(rnn_op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(rnn_op)
+                .set_reference_quantized_module(ref_rnn_op))
+    return rnn_op_configs
+
+def _get_embedding_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
+    embedding_op_configs = []
+    for embedding_op, qat_embedding_op, ref_embedding_op in [
+            (nn.Embedding, nnqat.Embedding, nnqr.Embedding),
+            (nn.EmbeddingBag, nnqat.EmbeddingBag, nnqr.EmbeddingBag),
+    ]:
+        embedding_op_configs.append(
+            BackendPatternConfig(embedding_op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_qat_module(qat_embedding_op)
+                .set_root_module(embedding_op)
+                .set_reference_quantized_module(ref_embedding_op)
+                ._set_input_output_observed(False))  # This is temporary, and will be removed soon
+        # config for qat op
+        embedding_op_configs.append(
+            BackendPatternConfig(qat_embedding_op)
+                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(embedding_op)
+                .set_reference_quantized_module(ref_embedding_op)
+                ._set_input_output_observed(False))  # This is temporary, and will be removed soon
+    return embedding_op_configs
+
 __all__ = [
     "_get_binary_op_configs",
     "_get_linear_configs",
diff --git a/torch/ao/quantization/backend_config/backend_config.py b/torch/ao/quantization/backend_config/backend_config.py
index 0ab3c2742858..e8af42ff4b6a 100644
--- a/torch/ao/quantization/backend_config/backend_config.py
+++ b/torch/ao/quantization/backend_config/backend_config.py
@@ -1,17 +1,18 @@
 from __future__ import annotations
 from dataclasses import dataclass
-from typing import Any, Callable, Dict, List, Optional, Type
+from typing import Any, Callable, Dict, List, Optional, Type, Union
 
 import torch
-from torch.ao.quantization.backend_config.observation_type import ObservationType
-from torch.ao.quantization.observer import _PartialWrapper
 from torch.ao.quantization.utils import Pattern
+from enum import Enum
 
 
 __all__ = [
     "BackendConfig",
     "BackendPatternConfig",
     "DTypeConfig",
+    "DTypeWithConstraints",
+    "ObservationType",
 ]
 
 
@@ -40,8 +41,42 @@
 NUM_TENSOR_ARGS_TO_OBSERVATION_TYPE_DICT_KEY = "num_tensor_args_to_observation_type"
 INPUT_TYPE_TO_INDEX_DICT_KEY = "input_type_to_index"
 INPUT_OUTPUT_OBSERVED_DICT_KEY = "input_output_observed"
-OVERWRITE_OUTPUT_FAKE_QUANTIZE_DICT_KEY = "overwrite_output_fake_quantize"
-OVERWRITE_OUTPUT_OBSERVER_DICT_KEY = "overwrite_output_observer"
+
+
+# TODO: maybe rename this to something that's not related to observer
+# e.g. QParamsType
+class ObservationType(Enum):
+    """ An enum that represents different ways of how an operator/operator pattern
+    should be observed
+    """
+
+    OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT = 0
+    """this means input and output are observed with different observers, based
+    on qconfig.activation
+    example: conv, linear, softmax
+    """
+
+    OUTPUT_SHARE_OBSERVER_WITH_INPUT = 1
+    """this means the output will use the same observer instance as input, based
+    on qconfig.activation
+    example: torch.cat, maxpool
+    """
+
+
+@dataclass
+class DTypeWithConstraints:
+    """
+    Config for specifying additional constraints for a given dtype, such as quantization
+    value ranges, scale value ranges, and fixed quantization params, to be used in
+    :class:`~torch.ao.quantization.backend_config.DTypeConfig`.
+    """
+    dtype: Optional[torch.dtype] = None
+    quant_min_lower_bound: Union[int, float, None] = None
+    quant_max_upper_bound: Union[int, float, None] = None
+    scale_min_lower_bound: Union[int, float, None] = None
+    scale_max_upper_bound: Union[int, float, None] = None
+    scale_exact_match: Optional[float] = None
+    zero_point_exact_match: Optional[int] = None
 
 
 @dataclass
@@ -49,43 +84,122 @@ class DTypeConfig:
     """
     Config for the set of supported input/output activation, weight, and bias data types for the
     patterns defined in :class:`~torch.ao.quantization.backend_config.BackendConfig`.
+
+    Example usage::
+
+        >>> dtype_config1 = DTypeConfig(
+        ...     input_dtype=torch.quint8,
+        ...     output_dtype=torch.quint8,
+        ...     weight_dtype=torch.qint8,
+        ...     bias_dtype=torch.float)
+
+        >>> dtype_config2 = DTypeConfig(
+        ...     input_dtype=DTypeWithConstraints(
+        ...         dtype=torch.quint8,
+        ...         quant_min_lower_bound=0,
+        ...         quant_max_upper_bound=255,
+        ...     ),
+        ...     output_dtype=DTypeWithConstraints(
+        ...         dtype=torch.quint8,
+        ...         quant_min_lower_bound=0,
+        ...         quant_max_upper_bound=255,
+        ...     ),
+        ...     weight_dtype=DTypeWithConstraints(
+        ...         dtype=torch.qint8,
+        ...         quant_min_lower_bound=-128,
+        ...         quant_max_upper_bound=127,
+        ...     ),
+        ...     bias_dtype=torch.float)
+
+        >>> dtype_config1.input_dtype
+        torch.quint8
+
+        >>> dtype_config2.input_dtype
+        torch.quint8
+
+        >>> dtype_config2.input_dtype_with_constraints
+        DTypeWithConstraints(dtype=torch.quint8, quant_min_lower_bound=0, quant_max_upper_bound=255, \
+scale_min_lower_bound=None, scale_max_upper_bound=None)
     """
-    input_dtype: Optional[torch.dtype] = None
-    output_dtype: Optional[torch.dtype] = None
-    weight_dtype: Optional[torch.dtype] = None
-    bias_dtype: Optional[torch.dtype] = None
-    is_dynamic: Optional[bool] = None
+    input_dtype_with_constraints: DTypeWithConstraints
+    output_dtype_with_constraints: DTypeWithConstraints
+    weight_dtype_with_constraints: DTypeWithConstraints
+    bias_dtype: Optional[torch.dtype]
+    is_dynamic: Optional[bool]
+
+    def __init__(
+        self,
+        input_dtype: Union[torch.dtype, DTypeWithConstraints, None] = None,
+        output_dtype: Union[torch.dtype, DTypeWithConstraints, None] = None,
+        weight_dtype: Union[torch.dtype, DTypeWithConstraints, None] = None,
+        bias_dtype: Optional[torch.dtype] = None,
+        is_dynamic: Optional[bool] = None,
+    ):
+        if isinstance(input_dtype, DTypeWithConstraints):
+            self.input_dtype_with_constraints = input_dtype
+        else:
+            self.input_dtype_with_constraints = DTypeWithConstraints(dtype=input_dtype)
+
+        if isinstance(output_dtype, DTypeWithConstraints):
+            self.output_dtype_with_constraints = output_dtype
+        else:
+            self.output_dtype_with_constraints = DTypeWithConstraints(dtype=output_dtype)
+
+        if isinstance(weight_dtype, DTypeWithConstraints):
+            self.weight_dtype_with_constraints = weight_dtype
+        else:
+            self.weight_dtype_with_constraints = DTypeWithConstraints(dtype=weight_dtype)
+
+        self.bias_dtype = bias_dtype
+        self.is_dynamic = is_dynamic
+
+    @property
+    def input_dtype(self) -> Optional[torch.dtype]:
+        return self.input_dtype_with_constraints.dtype
+
+    @property
+    def output_dtype(self) -> Optional[torch.dtype]:
+        return self.output_dtype_with_constraints.dtype
+
+    @property
+    def weight_dtype(self) -> Optional[torch.dtype]:
+        return self.weight_dtype_with_constraints.dtype
 
     @classmethod
     def from_dict(cls, dtype_config_dict: Dict[str, Any]) -> DTypeConfig:
         """
-        Create a `DTypeConfig` from a dictionary with the following items (all optional):
-
-            "input_dtype": torch.dtype
-            "output_dtype": torch.dtype
-            "weight_dtype": torch.dtype
+        Create a ``DTypeConfig`` from a dictionary with the following items (all optional):
+            "input_dtype": torch.dtype or ``DTypeWithConstraints``
+            "output_dtype": torch.dtype or ``DTypeWithConstraints``
+            "weight_dtype": torch.dtype or ``DTypeWithConstraints``
             "bias_type": torch.dtype
             "is_dynamic": bool
         """
         input_dtype = dtype_config_dict.get(INPUT_DTYPE_DICT_KEY, None)
+        if input_dtype is not None and not isinstance(input_dtype, (torch.dtype, DTypeWithConstraints)):
+            raise ValueError("Expected input_dtype to be a torch.dtype or DTypeWithConstraints")
         output_dtype = dtype_config_dict.get(OUTPUT_DTYPE_DICT_KEY, None)
+        if output_dtype is not None and not isinstance(output_dtype, (torch.dtype, DTypeWithConstraints)):
+            raise ValueError("Expected output_dtype to be a torch.dtype or DTypeWithConstraints")
         weight_dtype = dtype_config_dict.get(WEIGHT_DTYPE_DICT_KEY, None)
+        if weight_dtype is not None and not isinstance(weight_dtype, (torch.dtype, DTypeWithConstraints)):
+            raise ValueError("Expected weight_dtype to be a torch.dtype or DTypeWithConstraints")
         bias_dtype = dtype_config_dict.get(BIAS_DTYPE_DICT_KEY, None)
         is_dynamic = dtype_config_dict.get(IS_DYNAMIC_DICT_KEY, None)
         return cls(input_dtype, output_dtype, weight_dtype, bias_dtype, is_dynamic)
 
     def to_dict(self) -> Dict[str, Any]:
         """
-        Convert this `DTypeConfig` to a dictionary with the items described in
+        Convert this ``DTypeConfig`` to a dictionary with the items described in
         :func:`~torch.ao.quantization.backend_config.DTypeConfig.from_dict`.
         """
         dtype_config_dict: Dict[str, Any] = {}
         if self.input_dtype is not None:
-            dtype_config_dict[INPUT_DTYPE_DICT_KEY] = self.input_dtype
+            dtype_config_dict[INPUT_DTYPE_DICT_KEY] = self.input_dtype_with_constraints
         if self.output_dtype is not None:
-            dtype_config_dict[OUTPUT_DTYPE_DICT_KEY] = self.output_dtype
+            dtype_config_dict[OUTPUT_DTYPE_DICT_KEY] = self.output_dtype_with_constraints
         if self.weight_dtype is not None:
-            dtype_config_dict[WEIGHT_DTYPE_DICT_KEY] = self.weight_dtype
+            dtype_config_dict[WEIGHT_DTYPE_DICT_KEY] = self.weight_dtype_with_constraints
         if self.bias_dtype is not None:
             dtype_config_dict[BIAS_DTYPE_DICT_KEY] = self.bias_dtype
         if self.is_dynamic is not None:
@@ -95,16 +209,18 @@ def to_dict(self) -> Dict[str, Any]:
 
 class BackendConfig:
     # TODO: refer to NativeBackendConfig once that is implemented
-    """
-    Config that defines the set of patterns that can be quantized on a given backend, and how reference
+    """Config that defines the set of patterns that can be quantized on a given backend, and how reference
     quantized models can be produced from these patterns.
 
     A pattern in this context refers to a module, a functional, an operator, or a directed acyclic graph
     of the above. Each pattern supported on the target backend can be individually configured through
     :class:`~torch.ao.quantization.backend_config.BackendPatternConfig` in terms of:
-        (1) The supported input/output activation, weight, and bias data types
-        (2) How observers and quant/dequant ops are inserted in order to construct the reference pattern, and
-        (3) (Optionally) Fusion, QAT, and reference module mappings.
+
+    (1) The supported input/output activation, weight, and bias data types
+
+    (2) How observers and quant/dequant ops are inserted in order to construct the reference pattern, and
+
+    (3) (Optionally) Fusion, QAT, and reference module mappings.
 
     The format of the patterns is described in:
     https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/backend_config/README.md
@@ -113,7 +229,7 @@ class BackendConfig:
 
         import torch
         from torch.ao.quantization.backend_config import BackendConfig, BackendPatternConfig, DTypeConfig, ObservationType
-        from torch.ao.quantization.fuser_method_mappings import reverse_sequential_wrapper2
+        from torch.ao.quantization.fuser_method_mappings import _reverse_sequential_wrapper2
 
         weighted_int8_dtype_config = DTypeConfig(
             input_dtype=torch.quint8,
@@ -132,11 +248,12 @@ class BackendConfig:
             .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
             .add_dtype_config(weighted_int8_dtype_config) \
             .set_fused_module(torch.nn.intrinsic.ConvReLU2d) \
-            .set_fuser_method(reverse_sequential_wrapper2(torch.nn.intrinsic.ConvReLU2d))
+            .set_fuser_method(_reverse_sequential_wrapper2(torch.nn.intrinsic.ConvReLU2d))
 
         backend_config = BackendConfig("my_backend") \
             .set_backend_pattern_config(linear_config) \
             .set_backend_pattern_config(conv_relu_config)
+
     """
     def __init__(self, name: str = ""):
         self.name = name
@@ -169,10 +286,12 @@ def set_backend_pattern_configs(self, configs: List[BackendPatternConfig]) -> Ba
     @classmethod
     def from_dict(cls, backend_config_dict: Dict[str, Any]) -> BackendConfig:
         """
-        Create a `BackendConfig` from a dictionary with the following items:
+        Create a ``BackendConfig`` from a dictionary with the following items:
 
             "name": the name of the target backend
+
             "configs": a list of dictionaries that each represents a `BackendPatternConfig`
+
         """
         conf = cls(backend_config_dict.get(NAME_DICT_KEY, ""))
         for d in backend_config_dict.get(CONFIGS_DICT_KEY, []):
@@ -186,7 +305,7 @@ def from_dict(cls, backend_config_dict: Dict[str, Any]) -> BackendConfig:
 
     def to_dict(self) -> Dict[str, Any]:
         """
-        Convert this `BackendConfig` to a dictionary with the items described in
+        Convert this ``BackendConfig`` to a dictionary with the items described in
         :func:`~torch.ao.quantization.backend_config.BackendConfig.from_dict`.
         """
         return {
@@ -198,19 +317,8 @@ def to_dict(self) -> Dict[str, Any]:
 class BackendPatternConfig:
     """
     Config for ops defined in :class:`~torch.ao.quantization.backend_config.BackendConfig`.
-
-    The user can configure how a operator pattern graph is handled on a given backend using the following methods:
-        `set_observation_type`: sets how observers should be inserted for this pattern.
-            See :class:`~torch.ao.quantization.backend_config.ObservationType`
-        `add_dtype_config`: add a set of supported data types for this pattern
-        `set_root_module`: sets the module that represents the root for this pattern
-        `set_qat_module`: sets the module that represents the QAT implementation for this pattern
-        `set_reference_quantized_module`: sets the module that represents the reference quantized
-            implementation for this pattern's root module.
-        `set_fused_module`: sets the module that represents the fused implementation for this pattern
-        `set_fuser_method`: sets the function that specifies how to fuse the pattern for this pattern
-
     For a detailed example usage, see :class:`~torch.ao.quantization.backend_config.BackendConfig`.
+
     """
     def __init__(self, pattern: Pattern):
         self.pattern = pattern
@@ -228,19 +336,27 @@ def __init__(self, pattern: Pattern):
         self._num_tensor_args_to_observation_type: Dict[int, ObservationType] = {}
         self._input_type_to_index: Dict[str, int] = {}
         self._input_output_observed: Optional[bool] = None
-        self._overwrite_output_fake_quantize: Optional[_PartialWrapper] = None
-        self._overwrite_output_observer: Optional[_PartialWrapper] = None
 
     def set_observation_type(self, observation_type: ObservationType) -> BackendPatternConfig:
         """
-        Set how observers should be inserted for this pattern.
+        Set how observers should be inserted in the graph for this pattern.
+        There are two observation types:
+
+            `OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT` (default): the output observer instance will be
+            different from the input. This is the most common observation type.
+
+            `OUTPUT_SHARE_OBSERVER_WITH_INPUT`: the output observer instance will be the same as the input.
+            This is useful for operators like `cat`.
+
+        Note: This will be renamed in the near future, since we will soon insert QuantDeQuantStubs with
+        observers (and fake quantizes) attached instead of observers themselves.
         """
         self.observation_type = observation_type
         return self
 
     def add_dtype_config(self, dtype_config: DTypeConfig) -> BackendPatternConfig:
         """
-        Register a set of supported input/output activation, weight, and bias data types for this pattern.
+        Add a set of supported input/output activation, weight, and bias data types for this pattern.
         """
         self.dtype_configs.append(dtype_config)
         return self
@@ -285,6 +401,11 @@ def set_fused_module(self, fused_module: Type[torch.nn.Module]) -> BackendPatter
     def set_fuser_method(self, fuser_method: Callable) -> BackendPatternConfig:
         """
         Set the function that specifies how to fuse the pattern for this pattern.
+
+        The first argument of this function should be `is_qat`, and the rest of the arguments
+        should be the items in the tuple pattern, e.g. (`torch.nn.ReLU`, `torch.nn.Linear`)
+        will have a function with three arguments, `is_qat`, `relu`, and `linear`.
+        The return value of this function should be the resulting fused module.
         """
         self.fuser_method = fuser_method
         return self
@@ -310,33 +431,26 @@ def _set_input_output_observed(self, input_output_observed: bool) -> BackendPatt
         self._input_output_observed = input_output_observed
         return self
 
-    def _set_overwrite_output_fake_quantize(self, overwrite_output_fake_quantize: _PartialWrapper) -> BackendPatternConfig:
-        self._overwrite_output_fake_quantize = overwrite_output_fake_quantize
-        return self
-
-    def _set_overwrite_output_observer(self, overwrite_output_observer: _PartialWrapper) -> BackendPatternConfig:
-        self._overwrite_output_observer = overwrite_output_observer
-        return self
-
     @classmethod
     def from_dict(cls, backend_pattern_config_dict: Dict[str, Any]) -> BackendPatternConfig:
         """
-        Create a `BackendPatternConfig` from a dictionary with the following items:
+        Create a ``BackendPatternConfig`` from a dictionary with the following items:
 
             "pattern": the pattern being configured
             "observation_type": the :class:`~torch.ao.quantization.backend_config.ObservationType` that specifies how
-                observers should be inserted for this pattern
-            "dtype_configs": a list of dictionaries that represents :class:`~torch.ao.quantization.backend_config.DTypeConfig`s
+            observers should be inserted for this pattern
+            "dtype_configs": a list of dictionaries that represents :class:`~torch.ao.quantization.backend_config.DTypeConfig` s
             "root_module": a :class:`torch.nn.Module` that represents the root for this pattern
             "qat_module": a :class:`torch.nn.Module` that represents the QAT implementation for this pattern
             "reference_quantized_module": a :class:`torch.nn.Module` that represents the reference quantized
-                implementation for this pattern's root module.
+            implementation for this pattern's root module.
             "fused_module": a :class:`torch.nn.Module` that represents the fused implementation for this pattern
             "fuser_method": a function that specifies how to fuse the pattern for this pattern
+
         """
         def _get_dtype_config(obj: Any) -> DTypeConfig:
             """
-            Convert the given object into a `DTypeConfig` if possible, else throw an exception.
+            Convert the given object into a ``DTypeConfig`` if possible, else throw an exception.
             """
             if isinstance(obj, DTypeConfig):
                 return obj
@@ -363,13 +477,11 @@ def _get_dtype_config(obj: Any) -> DTypeConfig:
             backend_pattern_config_dict.get(NUM_TENSOR_ARGS_TO_OBSERVATION_TYPE_DICT_KEY, {}))
         conf._set_input_type_to_index(backend_pattern_config_dict.get(INPUT_TYPE_TO_INDEX_DICT_KEY, {}))
         conf._set_input_output_observed(backend_pattern_config_dict.get(INPUT_OUTPUT_OBSERVED_DICT_KEY, None))
-        conf._set_overwrite_output_fake_quantize(backend_pattern_config_dict.get(OVERWRITE_OUTPUT_FAKE_QUANTIZE_DICT_KEY, None))
-        conf._set_overwrite_output_observer(backend_pattern_config_dict.get(OVERWRITE_OUTPUT_OBSERVER_DICT_KEY, None))
         return conf
 
     def to_dict(self) -> Dict[str, Any]:
         """
-        Convert this `BackendPatternConfig` to a dictionary with the items described in
+        Convert this ``BackendPatternConfig`` to a dictionary with the items described in
         :func:`~torch.ao.quantization.backend_config.BackendPatternConfig.from_dict`.
         """
         backend_pattern_config_dict: Dict[str, Any] = {
@@ -397,8 +509,4 @@ def to_dict(self) -> Dict[str, Any]:
             backend_pattern_config_dict[INPUT_TYPE_TO_INDEX_DICT_KEY] = self._input_type_to_index
         if self._input_output_observed is not None:
             backend_pattern_config_dict[INPUT_OUTPUT_OBSERVED_DICT_KEY] = self._input_output_observed
-        if self._overwrite_output_fake_quantize is not None:
-            backend_pattern_config_dict[OVERWRITE_OUTPUT_FAKE_QUANTIZE_DICT_KEY] = self._overwrite_output_fake_quantize
-        if self._overwrite_output_observer is not None:
-            backend_pattern_config_dict[OVERWRITE_OUTPUT_OBSERVER_DICT_KEY] = self._overwrite_output_observer
         return backend_pattern_config_dict
diff --git a/torch/ao/quantization/backend_config/executorch.py b/torch/ao/quantization/backend_config/executorch.py
new file mode 100644
index 000000000000..627143c00099
--- /dev/null
+++ b/torch/ao/quantization/backend_config/executorch.py
@@ -0,0 +1,226 @@
+import operator
+from typing import List
+import torch
+import torch.nn.functional as F
+import torch.nn as nn
+import torch.nn.qat as nnqat
+import torch.nn.quantized._reference as nnqr
+from .backend_config import BackendConfig, BackendPatternConfig, DTypeConfig, ObservationType
+from ._common_operator_config_utils import _Conv2dMetadata
+from ..fuser_method_mappings import _reverse_sequential_wrapper2
+
+
+__all__ = [
+    "get_executorch_backend_config",
+]
+
+
+# ===================
+# |  DTYPE CONFIGS  |
+# ===================
+
+executorch_weighted_op_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+)
+
+executorch_default_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+)
+
+executorch_default_dynamic_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.float,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+executorch_default_dynamic_float16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+
+# =============================
+# |  BACKEND PATTERN CONFIGS  |
+# =============================
+
+def _get_linear_configs() -> List[BackendPatternConfig]:
+    """
+    Return all configs related to linear modules and ops.
+    """
+    observation_type = ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT
+    dtype_configs = [
+        executorch_weighted_op_int8_dtype_config,
+        executorch_default_dynamic_int8_dtype_config,
+        executorch_default_dynamic_float16_dtype_config,
+    ]
+    linear_configs: List[BackendPatternConfig] = []
+    # linear module
+    linear_configs.append(
+        BackendPatternConfig(torch.nn.Linear)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            .set_root_module(torch.nn.Linear)
+            .set_reference_quantized_module(nnqr.Linear)
+            .set_qat_module(nnqat.Linear))
+    # functional linear
+    linear_configs.append(
+        BackendPatternConfig(torch.nn.functional.linear)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs)
+            ._set_input_type_to_index({"weight": 1, "bias": 2}))
+    return linear_configs
+
+def _get_conv_configs() -> List[BackendPatternConfig]:
+    """
+    Return all configs related to conv modules and ops.
+    """
+    observation_type = ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT
+    dtype_configs = [executorch_weighted_op_int8_dtype_config]
+    conv_configs = []
+    for convs in [_Conv2dMetadata]:
+        # conv module
+        conv_configs.append(
+            BackendPatternConfig(convs.root)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference)
+                .set_qat_module(convs.qat))
+        # functional conv
+        conv_configs.append(
+            BackendPatternConfig(convs.func)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                ._set_input_type_to_index({"weight": 1, "bias": 2}))
+        # conv module + relu module
+        conv_configs.append(
+            BackendPatternConfig((torch.nn.ReLU, convs.root))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(_reverse_sequential_wrapper2(convs.fused_conv_relu))
+                .set_fused_module(convs.fused_conv_relu))
+        # conv module + functional relu
+        conv_configs.append(
+            BackendPatternConfig((F.relu, convs.root))
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                .set_fuser_method(_reverse_sequential_wrapper2(convs.fused_conv_relu))
+                .set_fused_module(convs.fused_conv_relu))
+        # fused conv relu module
+        conv_configs.append(
+            BackendPatternConfig(convs.fused_conv_relu)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs)
+                .set_root_module(convs.root)
+                .set_reference_quantized_module(convs.reference)
+                .set_qat_module(convs.relu_qat))
+        # functional conv + relu module
+        conv_configs.append(
+            BackendPatternConfig((torch.nn.ReLU, convs.func))
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+        # functional conv + functional relu
+        conv_configs.append(
+            BackendPatternConfig((F.relu, convs.func))
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+    return conv_configs
+
+def _get_binary_ops_configs() -> List[BackendPatternConfig]:
+    """
+    Return all configs related to binary ops.
+    """
+    dtype_configs = [executorch_weighted_op_int8_dtype_config]
+    num_tensor_args_to_observation_type_mapping = {
+        # TODO: this is not used right now since we have extra check in prepare
+        # will need to change this to NO_OBSERVER later after we implemented
+        # Tensor dtype inference properly
+        0: ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+        1: ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT,
+        2: ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT,
+    }
+    binary_op_configs: List[BackendPatternConfig] = []
+    for op in [operator.add, torch.add]:
+        binary_op_configs.append(
+            BackendPatternConfig(op)
+                .set_dtype_configs(dtype_configs)  # noqa: E131
+                ._set_num_tensor_args_to_observation_type(num_tensor_args_to_observation_type_mapping))
+    return binary_op_configs
+
+def _get_share_qparams_ops_configs() -> List[BackendPatternConfig]:
+    """
+    Return the operator configs for the operators that works for both float and quantized
+    input if input is quantized, the output Tensor shares the same quantization parameter
+    with input.
+
+    Example operator: avgpool2d, reshape, transpose, maxpool2d
+    Example observed operator:
+    observer_0 - avgpool2d - observer_0 (same observer instance as input)
+    """
+    observation_type = ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT
+    dtype_configs = [executorch_default_op_quint8_dtype_config]
+    share_qparams_ops = [
+        F.adaptive_avg_pool2d,
+        F.relu,
+        F.relu6,
+        torch.nn.AdaptiveAvgPool2d,
+        torch.squeeze,
+        "permute",
+        "reshape",
+        "relu",
+        "relu_",
+        "squeeze",
+        "squeeze_",
+    ]
+    share_qparams_op_configs: List[BackendPatternConfig] = []
+    for op in share_qparams_ops:
+        share_qparams_op_configs.append(
+            BackendPatternConfig(op)
+                .set_observation_type(observation_type)  # noqa: E131
+                .set_dtype_configs(dtype_configs))
+    return share_qparams_op_configs
+
+def _get_bn_configs() -> List[BackendPatternConfig]:
+    """
+    Return all configs related to batchnorm.
+    """
+    observation_type = ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT
+    dtype_configs = [executorch_default_op_quint8_dtype_config]
+    bn_configs = []
+    bn_configs.append(
+        BackendPatternConfig(nn.BatchNorm2d)
+            .set_observation_type(observation_type)  # noqa: E131
+            .set_dtype_configs(dtype_configs))
+    return bn_configs
+
+def _get_cat_configs() -> List[BackendPatternConfig]:
+    dtype_configs = [executorch_default_op_quint8_dtype_config]
+    cat_configs = []
+    cat_configs.append(
+        BackendPatternConfig(torch.cat)
+        .set_observation_type(ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT)
+        .set_dtype_configs(dtype_configs))
+    return cat_configs
+
+# =====================
+# |  BACKEND CONFIGS  |
+# =====================
+
+def get_executorch_backend_config() -> BackendConfig:
+    """
+    Return the `BackendConfig` for backends PyTorch lowers to through the Executorch stack.
+    """
+    return BackendConfig("executorch") \
+        .set_backend_pattern_configs(_get_linear_configs()) \
+        .set_backend_pattern_configs(_get_conv_configs()) \
+        .set_backend_pattern_configs(_get_binary_ops_configs()) \
+        .set_backend_pattern_configs(_get_share_qparams_ops_configs()) \
+        .set_backend_pattern_configs(_get_bn_configs()) \
+        .set_backend_pattern_configs(_get_cat_configs())
diff --git a/torch/ao/quantization/backend_config/fbgemm.py b/torch/ao/quantization/backend_config/fbgemm.py
new file mode 100644
index 000000000000..de38272b00e9
--- /dev/null
+++ b/torch/ao/quantization/backend_config/fbgemm.py
@@ -0,0 +1,114 @@
+import torch
+from ._common_operator_config_utils import (
+    _get_binary_op_configs,
+    _get_bn_configs,
+    _get_cat_config,
+    _get_conv_configs,
+    _get_default_op_configs,
+    _get_embedding_op_configs,
+    _get_fixed_qparams_op_configs,
+    _get_linear_configs,
+    _get_rnn_op_configs,
+    _get_share_qparams_op_configs,
+)
+from .backend_config import BackendConfig, DTypeConfig
+
+
+# ===================
+# |  DTYPE CONFIGS  |
+# ===================
+
+# TODO: For now, these DTypeConfigs are identical to the ones defined in native.py
+# In the future, once we support specifying quant_min/quant_max and scale_min/scale_max,
+# these will diverge. In particular, for FBGEMM, we will restrict the activation quantized
+# values to within [0, 127].
+
+fbgemm_weighted_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+)
+
+fbgemm_default_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+)
+
+fbgemm_default_op_fp16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float16,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float16,
+)
+
+fbgemm_default_dynamic_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.float,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+fbgemm_default_dynamic_float16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+fbgemm_weight_only_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint8,
+)
+
+fbgemm_weight_only_quint4x2_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint4x2,
+)
+
+
+# =====================
+# |  BACKEND CONFIGS  |
+# =====================
+
+def get_fbgemm_backend_config() -> BackendConfig:
+    """
+    Return the `BackendConfig` for PyTorch's native FBGEMM backend.
+    """
+    conv_dtype_configs = [fbgemm_weighted_op_quint8_dtype_config]
+    linear_dtype_configs = [
+        fbgemm_weighted_op_quint8_dtype_config,
+        fbgemm_default_dynamic_int8_dtype_config,
+        fbgemm_default_dynamic_float16_dtype_config,
+    ]
+    binary_op_dtype_configs = [fbgemm_default_op_quint8_dtype_config]
+    default_op_dtype_configs = [fbgemm_default_op_quint8_dtype_config]
+    fixed_qparams_op_dtype_configs = [fbgemm_default_op_quint8_dtype_config]
+    share_qparams_op_dtype_configs = [fbgemm_default_op_quint8_dtype_config]
+    rnn_op_dtype_configs = [
+        fbgemm_default_dynamic_int8_dtype_config,
+        fbgemm_default_dynamic_float16_dtype_config,
+    ]
+    embedding_op_dtype_configs = [
+        fbgemm_weight_only_quint8_dtype_config,
+        fbgemm_weight_only_quint4x2_dtype_config,
+    ]
+    return BackendConfig("fbgemm") \
+        .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
+        .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
+
+__all__ = [
+    "get_fbgemm_backend_config",
+]
diff --git a/torch/ao/quantization/backend_config/native.py b/torch/ao/quantization/backend_config/native.py
index 80902b95b55a..f584aff82a12 100644
--- a/torch/ao/quantization/backend_config/native.py
+++ b/torch/ao/quantization/backend_config/native.py
@@ -1,23 +1,19 @@
-from typing import List
 import torch
-import torch.nn as nn
-import torch.nn.intrinsic as nni
-import torch.nn.qat as nnqat
-import torch.nn.quantized._reference as nnqr
 from ._common_operator_config_utils import (
     _get_binary_op_configs,
-    _get_linear_configs,
+    _get_bn_configs,
+    _get_cat_config,
     _get_conv_configs,
+    _get_default_op_configs,
+    _get_embedding_op_configs,
+    _get_fixed_qparams_op_configs,
+    _get_linear_configs,
+    _get_ln_configs,
+    _get_rnn_op_configs,
     _get_share_qparams_op_configs,
 )
-from .backend_config import BackendConfig, BackendPatternConfig, DTypeConfig
-from .observation_type import ObservationType
-from ..fake_quantize import FixedQParamsFakeQuantize
-from ..fuser_method_mappings import (
-    reverse_sequential_wrapper2,
-)
-from ..qconfig_mapping import _FIXED_QPARAMS_OP_TO_OBSERVER
-from ..utils import Pattern
+from .backend_config import BackendConfig, DTypeConfig
+
 
 # ===================
 # |  DTYPE CONFIGS  |
@@ -25,7 +21,7 @@
 
 # weighted op int8 dtype config
 # this is config for ops that has quantized weights, like linear, conv
-weighted_op_int8_dtype_config = DTypeConfig(
+weighted_op_quint8_dtype_config = DTypeConfig(
     input_dtype=torch.quint8,
     output_dtype=torch.quint8,
     weight_dtype=torch.qint8,
@@ -66,6 +62,14 @@
     is_dynamic=True,
 )
 
+# Needed for LayerNorm and f.layer_norm, since currently the kernel only supports float weights
+input_output_only_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.float,
+    bias_dtype=torch.float,
+)
+
 weight_only_quint8_dtype_config = DTypeConfig(
     input_dtype=torch.float,
     output_dtype=torch.float,
@@ -78,198 +82,93 @@
     weight_dtype=torch.quint4x2,
 )
 
-# ======================
-# |  OPERATOR CONFIGS  |
-# ======================
-
-def _get_default_op_backend_config(op: Pattern, dtype_configs: List[DTypeConfig]) -> BackendPatternConfig:
-    return BackendPatternConfig(op) \
-        .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
-        .set_dtype_configs(dtype_configs)
 
-_DEFAULT_OP_INT8_CONFIGS: List[BackendPatternConfig] = [
-    _get_default_op_backend_config(op, [default_op_quint8_dtype_config]) for op in [
-        torch.nn.ELU,
-        torch.nn.LeakyReLU,
-        torch.nn.Hardswish,
-        torch.nn.InstanceNorm1d,
-        torch.nn.InstanceNorm2d,
-        torch.nn.InstanceNorm3d,
-        torch.nn.LayerNorm,
-        torch.nn.Dropout,
-        torch.nn.PReLU,
-        torch.nn.functional.elu,
-        torch.nn.functional.hardswish,
-        torch.nn.functional.instance_norm,
-        torch.nn.functional.leaky_relu,
-        torch.nn.functional.dropout,
-        torch.nn.functional.layer_norm,
-    ]]
-
-def _get_fixed_qparams_op_configs(dtype_configs: List[DTypeConfig]) -> List[BackendPatternConfig]:
-    fixed_qparams_op_configs = []
-    for fixed_qparam_op, output_observer in _FIXED_QPARAMS_OP_TO_OBSERVER.items():
-        fixed_qparams_op_configs.append(
-            # TODO: The _overwrite_output keys are temporary, since we don't want to put observer
-            # in the configs we expect that it's provided by user
-            # What we want to put here is the requirement on observers, in this case dtype,
-            # quant_min, quant_max etc., but we need to first move all configs to
-            # backend_config_dict to do that, we'll remove these keys after we fully migrated
-            # everything to use backend_config_dict
-            BackendPatternConfig(fixed_qparam_op)
-                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
-                .set_dtype_configs(dtype_configs)
-                ._set_overwrite_output_fake_quantize(FixedQParamsFakeQuantize.with_args(observer=output_observer))
-                ._set_overwrite_output_observer(output_observer))
-    return fixed_qparams_op_configs
-
-_CAT_CONFIG = BackendPatternConfig(torch.cat) \
-    .set_observation_type(ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT) \
-    .add_dtype_config(default_op_quint8_dtype_config)
-
-def _get_bn_configs() -> List[BackendPatternConfig]:
-    """ Get configs related to batchnorm
-    """
-    bn_configs = []
-    bn_to_fused_bn = {
-        torch.nn.BatchNorm2d: nni.BNReLU2d,
-        torch.nn.BatchNorm3d: nni.BNReLU3d,
-    }
-    for bn in bn_to_fused_bn.keys():
-        fused_bn = bn_to_fused_bn[bn]
-        # bn module + relu module fusion config
-        bn_configs.append(
-            BackendPatternConfig((torch.nn.ReLU, bn))
-                .add_dtype_config(default_op_quint8_dtype_config)  # noqa: E131
-                .set_fuser_method(reverse_sequential_wrapper2(fused_bn))
-                .set_fused_module(fused_bn))
-        # bn module + F.relu fusion config
-        bn_configs.append(
-            BackendPatternConfig((torch.nn.functional.relu, bn))
-                .add_dtype_config(default_op_quint8_dtype_config)  # noqa: E131
-                .set_fuser_method(reverse_sequential_wrapper2(bn_to_fused_bn[bn]))
-                .set_fused_module(fused_bn))
-        bn_configs.append(
-            BackendPatternConfig(bn)
-                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
-                .add_dtype_config(default_op_quint8_dtype_config))
-
-    # fused bn configs
-    for fused_bn in bn_to_fused_bn.values():
-        bn_configs.append(
-            BackendPatternConfig(fused_bn)
-                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
-                .add_dtype_config(default_op_quint8_dtype_config))
-    return bn_configs
-
-def _get_rnn_op_configs() -> List[BackendPatternConfig]:
-    rnn_op_configs = []
-    for rnn_op, ref_rnn_op in [
-            (nn.GRUCell, nnqr.GRUCell),
-            (nn.LSTMCell, nnqr.LSTMCell),
-            (nn.RNNCell, nnqr.RNNCell),
-            (nn.LSTM, nnqr.LSTM)
-    ]:
-        rnn_op_configs.append(
-            BackendPatternConfig(rnn_op)
-                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
-                .add_dtype_config(default_dynamic_int8_dtype_config)
-                .add_dtype_config(default_dynamic_float16_dtype_config)
-                .set_root_module(rnn_op)
-                .set_reference_quantized_module(ref_rnn_op))
-    return rnn_op_configs
-
-def _get_embedding_op_configs() -> List[BackendPatternConfig]:
-    embedding_op_configs = []
-    for embedding_op, qat_embedding_op, ref_embedding_op in [
-            (nn.Embedding, nnqat.Embedding, nnqr.Embedding),
-            (nn.EmbeddingBag, nnqat.EmbeddingBag, nnqr.EmbeddingBag),
-    ]:
-        embedding_op_configs.append(
-            BackendPatternConfig(embedding_op)
-                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
-                .add_dtype_config(weight_only_quint8_dtype_config)
-                .add_dtype_config(weight_only_quint4x2_dtype_config)
-                .set_qat_module(qat_embedding_op)
-                .set_root_module(embedding_op)
-                .set_reference_quantized_module(ref_embedding_op)
-                ._set_input_output_observed(False))  # This is temporary, and will be removed soon
-        # config for qat op
-        embedding_op_configs.append(
-            BackendPatternConfig(qat_embedding_op)
-                .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)  # noqa: E131
-                .add_dtype_config(weight_only_quint8_dtype_config)
-                .add_dtype_config(weight_only_quint4x2_dtype_config)
-                .set_root_module(embedding_op)
-                .set_reference_quantized_module(ref_embedding_op)
-                ._set_input_output_observed(False))  # This is temporary, and will be removed soon
-    return embedding_op_configs
+# =====================
+# |  BACKEND CONFIGS  |
+# =====================
 
 def get_test_only_legacy_native_backend_config() -> BackendConfig:
     """
     Return the `BackendConfig` for PyTorch Native backend (fbgemm/qnnpack) with various additional fp16 ops.
     """
-    conv_dtype_configs = [weighted_op_int8_dtype_config]
+    conv_dtype_configs = [weighted_op_quint8_dtype_config]
     linear_dtype_configs = [
-        weighted_op_int8_dtype_config,
+        weighted_op_quint8_dtype_config,
         default_dynamic_int8_dtype_config,
         default_dynamic_float16_dtype_config,
         default_op_fp16_dtype_config,
     ]
     binary_op_dtype_configs = [
-        weighted_op_int8_dtype_config,
+        default_op_quint8_dtype_config,
+        default_op_fp16_dtype_config,
+    ]
+    default_op_dtype_configs = [default_op_quint8_dtype_config]
+    fixed_qparams_op_dtype_configs = [
+        default_op_quint8_dtype_config,
         default_op_fp16_dtype_config,
     ]
     share_qparams_op_dtype_configs = [
         default_op_quint8_dtype_config,
         default_op_fp16_dtype_config
     ]
-    fixed_qparams_op_dtype_configs = [
-        weighted_op_int8_dtype_config,
-        default_op_fp16_dtype_config,
+    rnn_op_dtype_configs = [
+        default_dynamic_int8_dtype_config,
+        default_dynamic_float16_dtype_config,
+    ]
+    embedding_op_dtype_configs = [
+        weight_only_quint8_dtype_config,
+        weight_only_quint4x2_dtype_config,
     ]
+    layer_norm_op_dtype_configs = [input_output_only_quint8_dtype_config]
     return BackendConfig("_native_and_fp16") \
-        .set_backend_pattern_configs(_DEFAULT_OP_INT8_CONFIGS) \
-        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
         .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
         .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
         .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
-        .set_backend_pattern_config(_CAT_CONFIG) \
-        .set_backend_pattern_configs(_get_bn_configs()) \
         .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
-        .set_backend_pattern_configs(_get_rnn_op_configs()) \
-        .set_backend_pattern_configs(_get_embedding_op_configs())
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_ln_configs(layer_norm_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
 
 def get_native_backend_config() -> BackendConfig:
     """
     Return the `BackendConfig` for PyTorch Native backend (fbgemm/qnnpack).
     """
-    conv_dtype_configs = [weighted_op_int8_dtype_config]
+    # TODO: express this BackendConfig as a union of the FBGEMM and QNNPACK BackendConfigs
+    conv_dtype_configs = [weighted_op_quint8_dtype_config]
     linear_dtype_configs = [
-        weighted_op_int8_dtype_config,
+        weighted_op_quint8_dtype_config,
         default_dynamic_int8_dtype_config,
         default_dynamic_float16_dtype_config,
     ]
-    binary_op_dtype_configs = [
-        weighted_op_int8_dtype_config,
-    ]
-    share_qparams_op_dtype_configs = [
-        default_op_quint8_dtype_config,
+    binary_op_dtype_configs = [default_op_quint8_dtype_config]
+    default_op_dtype_configs = [default_op_quint8_dtype_config]
+    fixed_qparams_op_dtype_configs = [default_op_quint8_dtype_config]
+    share_qparams_op_dtype_configs = [default_op_quint8_dtype_config]
+    rnn_op_dtype_configs = [
+        default_dynamic_int8_dtype_config,
+        default_dynamic_float16_dtype_config,
     ]
-    fixed_qparams_op_dtype_configs = [
-        weighted_op_int8_dtype_config,
+    embedding_op_dtype_configs = [
+        weight_only_quint8_dtype_config,
+        weight_only_quint4x2_dtype_config,
     ]
+    layer_norm_op_dtype_configs = [input_output_only_quint8_dtype_config]
     return BackendConfig("native") \
-        .set_backend_pattern_configs(_DEFAULT_OP_INT8_CONFIGS) \
-        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
         .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
         .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
         .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
-        .set_backend_pattern_config(_CAT_CONFIG) \
-        .set_backend_pattern_configs(_get_bn_configs()) \
         .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
-        .set_backend_pattern_configs(_get_rnn_op_configs()) \
-        .set_backend_pattern_configs(_get_embedding_op_configs())
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_ln_configs(layer_norm_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
 
 def get_native_backend_config_dict():
     """
@@ -286,7 +185,14 @@ def get_test_only_legacy_native_backend_config_dict():
 
 __all__ = [
     "get_test_only_legacy_native_backend_config",
-    "get_test_only_legacy_native_backend_config_dict",
+    "default_op_quint8_dtype_config",
+    "default_op_fp16_dtype_config",
+    "default_dynamic_int8_dtype_config",
+    "default_dynamic_float16_dtype_config",
+    "input_output_only_quint8_dtype_config",
+    "weight_only_quint8_dtype_config",
+    "weight_only_quint4x2_dtype_config",
     "get_native_backend_config",
     "get_native_backend_config_dict",
+    "get_test_only_legacy_native_backend_config_dict",
 ]
diff --git a/torch/ao/quantization/backend_config/observation_type.py b/torch/ao/quantization/backend_config/observation_type.py
index 9a25f1dbc70f..e69de29bb2d1 100644
--- a/torch/ao/quantization/backend_config/observation_type.py
+++ b/torch/ao/quantization/backend_config/observation_type.py
@@ -1,13 +0,0 @@
-from enum import Enum
-
-__all__ = ['ObservationType']
-
-class ObservationType(Enum):
-    # this means input and output are observed with different observers, based
-    # on qconfig.activation
-    # example: conv, linear, softmax
-    OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT = 0
-    # this means the output will use the same observer instance as input, based
-    # on qconfig.activation
-    # example: torch.cat, maxpool
-    OUTPUT_SHARE_OBSERVER_WITH_INPUT = 1
diff --git a/torch/ao/quantization/backend_config/qnnpack.py b/torch/ao/quantization/backend_config/qnnpack.py
new file mode 100644
index 000000000000..391acf55614a
--- /dev/null
+++ b/torch/ao/quantization/backend_config/qnnpack.py
@@ -0,0 +1,161 @@
+import torch
+from ._common_operator_config_utils import (
+    _get_binary_op_configs,
+    _get_bn_configs,
+    _get_cat_config,
+    _get_conv_configs,
+    _get_default_op_configs,
+    _get_embedding_op_configs,
+    _get_fixed_qparams_op_configs,
+    _get_linear_configs,
+    _get_rnn_op_configs,
+    _get_share_qparams_op_configs,
+)
+from .backend_config import BackendConfig, DTypeConfig, DTypeWithConstraints
+
+
+# ===================
+# |  DTYPE CONFIGS  |
+# ===================
+
+qnnpack_weighted_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+)
+
+qnnpack_default_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+)
+
+qnnpack_default_op_fp16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float16,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float16,
+)
+
+qnnpack_default_dynamic_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.float,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+qnnpack_default_dynamic_float16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+qnnpack_weight_only_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint8,
+)
+
+qnnpack_weight_only_quint4x2_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint4x2,
+)
+
+# xnnpack compatible dtype configs
+
+# We restrict scale values to be 2 ** -12 to ensure the
+# requantization scale never falls below the xnnpack lower
+# threshold. Additionally, for qint8 weight, we restrict
+# the quantization values to [-127, +127], excluding -128.
+# For more detail, refer to the description of
+# `default_symmetric_qnnpack_qconfig`.
+
+# TODO: add additional restriction on qscheme to ensure it
+# is either per_tensor_symmetric or per_channel_symmetric
+
+qnnpack_act_qint8_scale_min_2_neg_12 = DTypeWithConstraints(
+    dtype=torch.qint8,
+    scale_min_lower_bound=2 ** -12,
+)
+
+qnnpack_weight_qint8_neg_127_to_127_scale_min_2_neg_12 = DTypeWithConstraints(
+    dtype=torch.qint8,
+    quant_min_lower_bound=-127,
+    quant_max_upper_bound=127,
+    scale_min_lower_bound=2 ** -12,
+)
+
+qnnpack_weighted_op_qint8_symmetric_dtype_config = DTypeConfig(
+    input_dtype=qnnpack_act_qint8_scale_min_2_neg_12,
+    output_dtype=qnnpack_act_qint8_scale_min_2_neg_12,
+    weight_dtype=qnnpack_weight_qint8_neg_127_to_127_scale_min_2_neg_12,
+    bias_dtype=torch.float,
+)
+
+qnnpack_default_op_qint8_symmetric_dtype_config = DTypeConfig(
+    input_dtype=qnnpack_act_qint8_scale_min_2_neg_12,
+    output_dtype=qnnpack_act_qint8_scale_min_2_neg_12,
+)
+
+
+# =====================
+# |  BACKEND CONFIGS  |
+# =====================
+
+def get_qnnpack_backend_config() -> BackendConfig:
+    """
+    Return the `BackendConfig` for PyTorch's native QNNPACK backend.
+    """
+    conv_dtype_configs = [
+        qnnpack_weighted_op_qint8_symmetric_dtype_config,
+        qnnpack_weighted_op_quint8_dtype_config,
+    ]
+    linear_dtype_configs = [
+        qnnpack_weighted_op_qint8_symmetric_dtype_config,
+        qnnpack_weighted_op_quint8_dtype_config,
+        qnnpack_default_dynamic_int8_dtype_config,
+        qnnpack_default_dynamic_float16_dtype_config,
+    ]
+    binary_op_dtype_configs = [
+        qnnpack_default_op_qint8_symmetric_dtype_config,
+        qnnpack_default_op_quint8_dtype_config,
+    ]
+    default_op_dtype_configs = [
+        qnnpack_default_op_qint8_symmetric_dtype_config,
+        qnnpack_default_op_quint8_dtype_config,
+    ]
+    fixed_qparams_op_dtype_configs = [
+        qnnpack_default_op_qint8_symmetric_dtype_config,
+        qnnpack_default_op_quint8_dtype_config,
+    ]
+    share_qparams_op_dtype_configs = [
+        qnnpack_default_op_qint8_symmetric_dtype_config,
+        qnnpack_default_op_quint8_dtype_config,
+    ]
+    rnn_op_dtype_configs = [
+        qnnpack_default_dynamic_int8_dtype_config,
+        qnnpack_default_dynamic_float16_dtype_config,
+    ]
+    embedding_op_dtype_configs = [
+        qnnpack_weight_only_quint8_dtype_config,
+        qnnpack_weight_only_quint4x2_dtype_config,
+    ]
+    return BackendConfig("qnnpack") \
+        .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
+        .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
+
+__all__ = [
+    "get_qnnpack_backend_config",
+]
diff --git a/torch/ao/quantization/backend_config/tensorrt.py b/torch/ao/quantization/backend_config/tensorrt.py
index f709e63e993d..9b6fb39e0616 100644
--- a/torch/ao/quantization/backend_config/tensorrt.py
+++ b/torch/ao/quantization/backend_config/tensorrt.py
@@ -1,6 +1,10 @@
 import torch
-from .backend_config import BackendConfig, BackendPatternConfig, DTypeConfig
-from .observation_type import ObservationType
+from .backend_config import (
+    BackendConfig,
+    BackendPatternConfig,
+    DTypeConfig,
+    ObservationType
+)
 from ._common_operator_config_utils import (
     _get_binary_op_configs,
     _get_linear_configs,
diff --git a/torch/ao/quantization/backend_config/utils.py b/torch/ao/quantization/backend_config/utils.py
index 398a4b5d4ee3..fc7e9aca9ff6 100644
--- a/torch/ao/quantization/backend_config/utils.py
+++ b/torch/ao/quantization/backend_config/utils.py
@@ -4,10 +4,25 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from .backend_config import BackendConfig, DTypeConfig
-from ..quantization_types import Pattern
+from ..utils import Pattern
+
+__all__ = [
+    "get_pattern_to_dtype_configs",
+    "get_qat_module_classes",
+    "get_fused_module_classes",
+    "get_pattern_to_input_type_to_index",
+    "get_root_module_to_quantized_reference_module",
+    "get_fuser_method_mapping",
+    "get_module_to_qat_module",
+    "get_fusion_pattern_to_root_node_getter",
+    "get_fusion_pattern_to_extra_inputs_getter",
+    "remove_boolean_dispatch_from_name",
+    "pattern_to_human_readable",
+    "entry_to_pretty_str",
+]
 
 def get_pattern_to_dtype_configs(backend_config: BackendConfig) -> Dict[Pattern, List[DTypeConfig]]:
-    pattern_to_dtype_configs: Dict[Pattern, List[DTypeConfig]] = dict()
+    pattern_to_dtype_configs: Dict[Pattern, List[DTypeConfig]] = {}
     for pattern, config in backend_config.configs.items():
         pattern_to_dtype_configs[pattern] = config.dtype_configs
     return pattern_to_dtype_configs
@@ -27,28 +42,28 @@ def get_fused_module_classes(backend_config: BackendConfig) -> Tuple[type, ...]:
     return tuple(set(fused_module_classes))
 
 def get_pattern_to_input_type_to_index(backend_config: BackendConfig) -> Dict[Pattern, Dict[str, int]]:
-    pattern_to_input_type_to_index: Dict[Pattern, Dict[str, int]] = dict()
+    pattern_to_input_type_to_index: Dict[Pattern, Dict[str, int]] = {}
     for pattern, config in backend_config.configs.items():
         pattern_to_input_type_to_index[pattern] = config._input_type_to_index
     return pattern_to_input_type_to_index
 
 def get_root_module_to_quantized_reference_module(
         backend_config: BackendConfig) -> Dict[Type[torch.nn.Module], Type[torch.nn.Module]]:
-    mapping: Dict[Type[torch.nn.Module], Type[torch.nn.Module]] = dict()
+    mapping: Dict[Type[torch.nn.Module], Type[torch.nn.Module]] = {}
     for config in backend_config.configs.values():
         if config.root_module is not None and config.reference_quantized_module is not None:
             mapping[config.root_module] = config.reference_quantized_module
     return mapping
 
 def get_fuser_method_mapping(backend_config: BackendConfig) -> Dict[Pattern, Union[nn.Sequential, Callable]]:
-    fuser_method_mapping : Dict[Pattern, Union[nn.Sequential, Callable]] = dict()
+    fuser_method_mapping : Dict[Pattern, Union[nn.Sequential, Callable]] = {}
     for pattern, config in backend_config.configs.items():
         if config.fuser_method is not None:
             fuser_method_mapping[pattern] = config.fuser_method
     return fuser_method_mapping
 
 def get_module_to_qat_module(backend_config: BackendConfig) -> Dict[Pattern, Type[torch.nn.Module]]:
-    module_to_qat_module: Dict[Pattern, Type[torch.nn.Module]] = dict()
+    module_to_qat_module: Dict[Pattern, Type[torch.nn.Module]] = {}
     for pattern, config in backend_config.configs.items():
         if config.qat_module is not None:
             module_to_qat_module[pattern] = config.qat_module
@@ -64,7 +79,7 @@ def get_root_node(node_pattern):
     This can work for all patterns whose root node is the "last node" in the pattern,
     e.g. (torch.add, MatchAllNode, (torch.ReLU, torch.Conv2d))
     """
-    root_node_getter_mapping: Dict[Pattern, Callable] = dict()
+    root_node_getter_mapping: Dict[Pattern, Callable] = {}
     for pattern, config in backend_config.configs.items():
         if config._root_node_getter is not None:
             root_node_getter_mapping[pattern] = config._root_node_getter
@@ -84,7 +99,7 @@ def extra_inputs_getter(pattern) -> List[Any]:
         add, extra_input, conv_pattern = pattern
         return [extra_input]
     """
-    extra_inputs_getter_mapping: Dict[Pattern, Callable] = dict()
+    extra_inputs_getter_mapping: Dict[Pattern, Callable] = {}
     for pattern, config in backend_config.configs.items():
         if config._extra_inputs_getter is not None:
             extra_inputs_getter_mapping[pattern] = config._extra_inputs_getter
diff --git a/torch/ao/quantization/backend_config/x86.py b/torch/ao/quantization/backend_config/x86.py
new file mode 100644
index 000000000000..ce92ed9bc42b
--- /dev/null
+++ b/torch/ao/quantization/backend_config/x86.py
@@ -0,0 +1,111 @@
+import torch
+from ._common_operator_config_utils import (
+    _get_binary_op_configs,
+    _get_bn_configs,
+    _get_cat_config,
+    _get_conv_configs,
+    _get_default_op_configs,
+    _get_embedding_op_configs,
+    _get_fixed_qparams_op_configs,
+    _get_linear_configs,
+    _get_rnn_op_configs,
+    _get_share_qparams_op_configs,
+)
+from .backend_config import BackendConfig, DTypeConfig
+
+
+# ===================
+# |  DTYPE CONFIGS  |
+# ===================
+
+# X86 aligns with FBGEMM for now
+
+x86_weighted_op_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+)
+
+x86_default_op_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.quint8,
+)
+
+x86_default_op_fp16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float16,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float16,
+)
+
+x86_default_dynamic_int8_dtype_config = DTypeConfig(
+    input_dtype=torch.quint8,
+    output_dtype=torch.float,
+    weight_dtype=torch.qint8,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+x86_default_dynamic_float16_dtype_config = DTypeConfig(
+    input_dtype=torch.float16,
+    output_dtype=torch.float,
+    weight_dtype=torch.float16,
+    bias_dtype=torch.float,
+    is_dynamic=True,
+)
+
+x86_weight_only_quint8_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint8,
+)
+
+x86_weight_only_quint4x2_dtype_config = DTypeConfig(
+    input_dtype=torch.float,
+    output_dtype=torch.float,
+    weight_dtype=torch.quint4x2,
+)
+
+
+# =====================
+# |  BACKEND CONFIGS  |
+# =====================
+
+def get_x86_backend_config() -> BackendConfig:
+    """
+    Return the `BackendConfig` for PyTorch's native x86 backend.
+    """
+    conv_dtype_configs = [x86_weighted_op_int8_dtype_config]
+    linear_dtype_configs = [
+        x86_weighted_op_int8_dtype_config,
+        x86_default_dynamic_int8_dtype_config,
+        x86_default_dynamic_float16_dtype_config,
+    ]
+    binary_op_dtype_configs = [x86_weighted_op_int8_dtype_config]
+    default_op_dtype_configs = [x86_default_op_quint8_dtype_config]
+    fixed_qparams_op_dtype_configs = [x86_weighted_op_int8_dtype_config]
+    share_qparams_op_dtype_configs = [x86_default_op_quint8_dtype_config]
+    rnn_op_dtype_configs = [
+        x86_default_dynamic_int8_dtype_config,
+        x86_default_dynamic_float16_dtype_config,
+    ]
+    embedding_op_dtype_configs = [
+        x86_weight_only_quint8_dtype_config,
+        x86_weight_only_quint4x2_dtype_config,
+    ]
+    return BackendConfig("x86") \
+        .set_backend_pattern_configs(_get_conv_configs(conv_dtype_configs)) \
+        .set_backend_pattern_configs(_get_linear_configs(linear_dtype_configs)) \
+        .set_backend_pattern_configs(_get_binary_op_configs(binary_op_dtype_configs)) \
+        .set_backend_pattern_config(_get_cat_config(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_default_op_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_fixed_qparams_op_configs(fixed_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_share_qparams_op_configs(share_qparams_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_bn_configs(default_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_rnn_op_configs(rnn_op_dtype_configs)) \
+        .set_backend_pattern_configs(_get_embedding_op_configs(embedding_op_dtype_configs))
+
+__all__ = [
+    "get_x86_backend_config",
+]
diff --git a/torch/ao/quantization/fake_quantize.py b/torch/ao/quantization/fake_quantize.py
index b4e295fbd4d0..4fb639015127 100644
--- a/torch/ao/quantization/fake_quantize.py
+++ b/torch/ao/quantization/fake_quantize.py
@@ -18,6 +18,33 @@
 from abc import ABC, abstractmethod
 from typing import Any, Tuple
 
+__all__ = [
+    "FakeQuantizeBase",
+    "FakeQuantize",
+    "FixedQParamsFakeQuantize",
+    "FusedMovingAvgObsFakeQuantize",
+    "disable_fake_quant",
+    "disable_observer",
+    "enable_fake_quant",
+    "enable_observer",
+    "default_fake_quant",
+    "default_weight_fake_quant",
+    "default_dynamic_fake_quant",
+    "default_fixed_qparams_range_neg1to1_fake_quant",
+    "default_fixed_qparams_range_0to1_fake_quant",
+    "default_symmetric_fixed_qparams_fake_quant",
+    "default_affine_fixed_qparams_fake_quant",
+    "default_per_channel_weight_fake_quant",
+    "default_embedding_fake_quant",
+    "default_embedding_fake_quant_4bit",
+    "default_histogram_fake_quant",
+    "default_fused_act_fake_quant",
+    "default_fused_wt_fake_quant",
+    "default_fused_per_channel_wt_fake_quant",
+    "fused_wt_fake_quant_range_neg_127_to_127",
+    "fused_per_channel_wt_fake_quant_range_neg_127_to_127",
+]
+
 def _is_per_channel(qscheme: 'torch.qscheme') -> bool:
     return qscheme in [torch.per_channel_symmetric, torch.per_channel_affine, torch.per_channel_affine_float_qparams]
 
@@ -76,7 +103,13 @@ def enable_observer(self, enabled: bool = True) -> None:
     def disable_observer(self):
         self.enable_observer(False)
 
-    with_args = classmethod(_with_args)
+    @classmethod
+    def with_args(cls, **kwargs):
+        fake_quant_constructor = _with_args(cls, **kwargs)
+        # need to assign the correct module to fake_quantize
+        # constructors to satisfy public v private requirements
+        fake_quant_constructor.__module__ = "torch.ao.quantization.fake_quantize"
+        return fake_quant_constructor
 
 class FakeQuantize(FakeQuantizeBase):
     r""" Simulate the quantize and dequantize operations in training time.
@@ -395,6 +428,7 @@ def forward(self, X: torch.Tensor) -> torch.Tensor:
                                                                        quant_min=0,
                                                                        quant_max=255,
                                                                        dtype=torch.quint8,)
+
 """
 Fused version of `default_fake_quant`, with improved performance.
 """
@@ -428,12 +462,14 @@ def forward(self, X: torch.Tensor) -> torch.Tensor:
 Fused version of `default_weight_fake_quant`, with the 8-bit values restricted to [-127, +127], excluding -128.
 """
 
-fused_per_channel_wt_fake_quant_range_neg_127_to_127 = FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
-                                                                                               quant_min=-127,
-                                                                                               quant_max=127,
-                                                                                               dtype=torch.qint8,
-                                                                                               qscheme=torch.per_channel_symmetric,
-                                                                                               eps=2 ** -12)
+fused_per_channel_wt_fake_quant_range_neg_127_to_127 = \
+    FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAveragePerChannelMinMaxObserver,
+                                            quant_min=-127,
+                                            quant_max=127,
+                                            dtype=torch.qint8,
+                                            qscheme=torch.per_channel_symmetric,
+                                            eps=2 ** -12)
+
 """
 Fused version of `default_per_channel_weight_fake_quant`, with the 8-bit values restricted to [-127, +127], excluding -128.
 """
diff --git a/torch/ao/quantization/fuse_modules.py b/torch/ao/quantization/fuse_modules.py
index 4677cc173ca7..6cf37af0cf93 100644
--- a/torch/ao/quantization/fuse_modules.py
+++ b/torch/ao/quantization/fuse_modules.py
@@ -11,6 +11,12 @@
 
 from typing import List, Optional
 
+__all__ = [
+    "fuse_known_modules",
+    "fuse_modules",
+    "fuse_modules_qat",
+]
+
 # Generalization of getattr
 def _get_module(model, submodule_key):
     tokens = submodule_key.split('.')
@@ -154,7 +160,7 @@ def fuse_modules(model, modules_to_fuse, inplace=False, fuser_func=fuse_known_mo
         modules_to_fuse,
         is_qat=False,
         inplace=inplace,
-        fuser_func=fuse_known_modules,
+        fuser_func=fuser_func,
         fuse_custom_config_dict=None)
 
 def fuse_modules_qat(model, modules_to_fuse, inplace=False, fuser_func=fuse_known_modules, fuse_custom_config_dict=None):
@@ -165,5 +171,5 @@ def fuse_modules_qat(model, modules_to_fuse, inplace=False, fuser_func=fuse_know
         modules_to_fuse,
         is_qat=True,
         inplace=inplace,
-        fuser_func=fuse_known_modules,
+        fuser_func=fuser_func,
         fuse_custom_config_dict=None)
diff --git a/torch/ao/quantization/fuser_method_mappings.py b/torch/ao/quantization/fuser_method_mappings.py
index 6ca985e80672..db4cc9a04d76 100644
--- a/torch/ao/quantization/fuser_method_mappings.py
+++ b/torch/ao/quantization/fuser_method_mappings.py
@@ -2,12 +2,18 @@
 import torch.nn.intrinsic as nni
 
 from typing import Union, Callable, Tuple, Dict, Optional, Type
-from torch.ao.quantization.utils import Pattern
-
-from torch.ao.quantization.utils import get_combined_dict
-from torch.ao.quantization.utils import MatchAllNode
+from torch.ao.quantization.utils import Pattern, get_combined_dict, MatchAllNode
 import itertools
 
+__all__ = [
+    "fuse_conv_bn",
+    "fuse_conv_bn_relu",
+    "fuse_linear_bn",
+    "fuse_convtranspose_bn",
+    "get_fuser_method",
+    "get_fuser_method_new",
+]
+
 def fuse_conv_bn(is_qat, conv, bn):
     r"""Given the conv and bn modules, fuses them and returns the fused module
 
@@ -144,7 +150,7 @@ def fuse_convtranspose_bn(is_qat, convt, bn):
     else:
         return nn.utils.fusion.fuse_conv_bn_eval(convt, bn, transpose=True)
 
-def sequential_wrapper2(sequential):
+def _sequential_wrapper2(sequential):
     """ Given a sequential class for two modules, return a function that takes
     is_qat, and then two modules as argument, that ignores the is_qat flag
     and always returns the sequential that combines the two input modules
@@ -153,20 +159,20 @@ def fuser_method(is_qat, m1, m2):
         return sequential(m1, m2)
     return fuser_method
 
-DEFAULT_OP_LIST_TO_FUSER_METHOD: Dict[Tuple, Union[nn.Sequential, Callable]] = {
+_DEFAULT_OP_LIST_TO_FUSER_METHOD: Dict[Tuple, Union[nn.Sequential, Callable]] = {
     (nn.Conv1d, nn.BatchNorm1d): fuse_conv_bn,
     (nn.Conv1d, nn.BatchNorm1d, nn.ReLU): fuse_conv_bn_relu,
     (nn.Conv2d, nn.BatchNorm2d): fuse_conv_bn,
     (nn.Conv2d, nn.BatchNorm2d, nn.ReLU): fuse_conv_bn_relu,
     (nn.Conv3d, nn.BatchNorm3d): fuse_conv_bn,
     (nn.Conv3d, nn.BatchNorm3d, nn.ReLU): fuse_conv_bn_relu,
-    (nn.Conv1d, nn.ReLU): sequential_wrapper2(nni.ConvReLU1d),
-    (nn.Conv2d, nn.ReLU): sequential_wrapper2(nni.ConvReLU2d),
-    (nn.Conv3d, nn.ReLU): sequential_wrapper2(nni.ConvReLU3d),
+    (nn.Conv1d, nn.ReLU): _sequential_wrapper2(nni.ConvReLU1d),
+    (nn.Conv2d, nn.ReLU): _sequential_wrapper2(nni.ConvReLU2d),
+    (nn.Conv3d, nn.ReLU): _sequential_wrapper2(nni.ConvReLU3d),
     (nn.Linear, nn.BatchNorm1d): fuse_linear_bn,
-    (nn.Linear, nn.ReLU): sequential_wrapper2(nni.LinearReLU),
-    (nn.BatchNorm2d, nn.ReLU): sequential_wrapper2(nni.BNReLU2d),
-    (nn.BatchNorm3d, nn.ReLU): sequential_wrapper2(nni.BNReLU3d),
+    (nn.Linear, nn.ReLU): _sequential_wrapper2(nni.LinearReLU),
+    (nn.BatchNorm2d, nn.ReLU): _sequential_wrapper2(nni.BNReLU2d),
+    (nn.BatchNorm3d, nn.ReLU): _sequential_wrapper2(nni.BNReLU3d),
     (nn.ConvTranspose1d, nn.BatchNorm1d): fuse_convtranspose_bn,
     (nn.ConvTranspose2d, nn.BatchNorm2d): fuse_convtranspose_bn,
     (nn.ConvTranspose3d, nn.BatchNorm3d): fuse_convtranspose_bn,
@@ -177,14 +183,14 @@ def get_fuser_method(op_list, additional_fuser_method_mapping=None):
     return None if fuser method does not exist
     '''
     if additional_fuser_method_mapping is None:
-        additional_fuser_method_mapping = dict()
-    all_mappings = get_combined_dict(DEFAULT_OP_LIST_TO_FUSER_METHOD,
+        additional_fuser_method_mapping = {}
+    all_mappings = get_combined_dict(_DEFAULT_OP_LIST_TO_FUSER_METHOD,
                                      additional_fuser_method_mapping)
     fuser_method = all_mappings.get(op_list, None)
     assert fuser_method is not None, "did not find fuser method for: {} ".format(op_list)
     return fuser_method
 
-def reverse_sequential_wrapper2(sequential):
+def _reverse_sequential_wrapper2(sequential):
     """ Given a sequential class for two modules, return a function that takes
     is_qat, and then two modules as argument, that ignores the is_qat flag
     and always returns the sequential that combines the two input modules, with
@@ -194,37 +200,37 @@ def fuser_method(is_qat, m1, m2):
         return sequential(m2, m1)
     return fuser_method
 
-def reverse2(f):
+def _reverse2(f):
     def reversed(is_qat, x, y):
         return f(is_qat, y, x)
     return reversed
 
-def reverse3(f):
+def _reverse3(f):
     def reversed(is_qat, x, w):
         y, z = w
         return f(is_qat, z, y, x)
     return reversed
 
-DEFAULT_PATTERN_TO_FUSER_METHOD: Dict[Pattern, Union[nn.Sequential, Callable]] = {
-    (nn.BatchNorm1d, nn.Conv1d): reverse2(fuse_conv_bn),
-    (nn.ReLU, (nn.BatchNorm1d, nn.Conv1d)): reverse3(fuse_conv_bn_relu),
-    (nn.BatchNorm2d, nn.Conv2d): reverse2(fuse_conv_bn),
-    (nn.ReLU, (nn.BatchNorm2d, nn.Conv2d)): reverse3(fuse_conv_bn_relu),
-    (nn.BatchNorm3d, nn.Conv3d): reverse2(fuse_conv_bn),
-    (nn.ReLU, (nn.BatchNorm3d, nn.Conv3d)): reverse3(fuse_conv_bn_relu),
-    (nn.ReLU, nn.Conv1d): reverse_sequential_wrapper2(nni.ConvReLU1d),
-    (nn.ReLU, nn.Conv2d): reverse_sequential_wrapper2(nni.ConvReLU2d),
-    (nn.ReLU, nn.Conv3d): reverse_sequential_wrapper2(nni.ConvReLU3d),
-    (nn.BatchNorm1d, nn.Linear): reverse2(fuse_linear_bn),
-    (nn.ReLU, nn.Linear): reverse_sequential_wrapper2(nni.LinearReLU),
-    (nn.ReLU, nn.BatchNorm2d): reverse_sequential_wrapper2(nni.BNReLU2d),
-    (nn.ReLU, nn.BatchNorm3d): reverse_sequential_wrapper2(nni.BNReLU3d),
-    (nn.BatchNorm1d, nn.ConvTranspose1d): reverse2(fuse_convtranspose_bn),
-    (nn.BatchNorm2d, nn.ConvTranspose2d): reverse2(fuse_convtranspose_bn),
-    (nn.BatchNorm3d, nn.ConvTranspose3d): reverse2(fuse_convtranspose_bn),
+_DEFAULT_PATTERN_TO_FUSER_METHOD: Dict[Pattern, Union[nn.Sequential, Callable]] = {
+    (nn.BatchNorm1d, nn.Conv1d): _reverse2(fuse_conv_bn),
+    (nn.ReLU, (nn.BatchNorm1d, nn.Conv1d)): _reverse3(fuse_conv_bn_relu),
+    (nn.BatchNorm2d, nn.Conv2d): _reverse2(fuse_conv_bn),
+    (nn.ReLU, (nn.BatchNorm2d, nn.Conv2d)): _reverse3(fuse_conv_bn_relu),
+    (nn.BatchNorm3d, nn.Conv3d): _reverse2(fuse_conv_bn),
+    (nn.ReLU, (nn.BatchNorm3d, nn.Conv3d)): _reverse3(fuse_conv_bn_relu),
+    (nn.ReLU, nn.Conv1d): _reverse_sequential_wrapper2(nni.ConvReLU1d),
+    (nn.ReLU, nn.Conv2d): _reverse_sequential_wrapper2(nni.ConvReLU2d),
+    (nn.ReLU, nn.Conv3d): _reverse_sequential_wrapper2(nni.ConvReLU3d),
+    (nn.BatchNorm1d, nn.Linear): _reverse2(fuse_linear_bn),
+    (nn.ReLU, nn.Linear): _reverse_sequential_wrapper2(nni.LinearReLU),
+    (nn.ReLU, nn.BatchNorm2d): _reverse_sequential_wrapper2(nni.BNReLU2d),
+    (nn.ReLU, nn.BatchNorm3d): _reverse_sequential_wrapper2(nni.BNReLU3d),
+    (nn.BatchNorm1d, nn.ConvTranspose1d): _reverse2(fuse_convtranspose_bn),
+    (nn.BatchNorm2d, nn.ConvTranspose2d): _reverse2(fuse_convtranspose_bn),
+    (nn.BatchNorm3d, nn.ConvTranspose3d): _reverse2(fuse_convtranspose_bn),
 }
 
-def get_valid_patterns(op_pattern):
+def _get_valid_patterns(op_pattern):
     """
     Returns a list of valid patterns generated from the op_pattern,
     since MatchAllNode can match all types of nodes,
@@ -249,7 +255,7 @@ def get_valid_patterns(op_pattern):
     if isinstance(op_pattern, (tuple, list)):
         sub_combs = []
         for sub_pattern in op_pattern:
-            sub_combs.append(get_valid_patterns(sub_pattern))
+            sub_combs.append(_get_valid_patterns(sub_pattern))
         result = list(itertools.product(*sub_combs))
     else:
         result = [op_pattern, MatchAllNode]
@@ -262,9 +268,9 @@ def get_fuser_method_new(
     Would like to implement this first and have a separate PR for deprecation
     """
     if fuser_method_mapping is None:
-        fuser_method_mapping = DEFAULT_PATTERN_TO_FUSER_METHOD
+        fuser_method_mapping = _DEFAULT_PATTERN_TO_FUSER_METHOD
 
-    op_patterns = get_valid_patterns(op_pattern)
+    op_patterns = _get_valid_patterns(op_pattern)
     fuser_method = None
     for op_pattern in op_patterns:
         fuser_method = fuser_method_mapping.get(op_pattern, None)
diff --git a/torch/ao/quantization/fx/README.md b/torch/ao/quantization/fx/README.md
new file mode 100644
index 000000000000..622acd30956c
--- /dev/null
+++ b/torch/ao/quantization/fx/README.md
@@ -0,0 +1,380 @@
+# FX Graph Mode Quantization Design Doc
+High Level FX Graph Mode Quantization Flow
+```
+float_model            QConfigMapping           BackendConfig
+    \                          |                        /
+     \                         |                      /
+      \                        |                    /
+(prepare_fx/prepare_qat_fx)                        /
+—-------------------------------------------------------
+|                         Fuse                         |
+|                  QAT Module Swap                     |
+|                 Insert Observers                     |
+—-------------------------------------------------------
+                              |
+                      Calibrate/Train
+                              |
+(convert_fx)                  |
+—--------------------------------------------------------
+|                         Convert                       |
+|                        Lowering                       |
+—--------------------------------------------------------
+                              |
+                       Quantized Model
+
+```
+Please refer to [TODO: link] for definitions of terminologies.
+
+## Overview
+The FX graph representation is pretty close to python/eager mode, it preserves many python/eager mode constructs like modules, functionals, torch ops, so overall the implementation reuses some of building blocks and utilities from eager mode quantization, this includes the QConfig, QConfig propagation (might be removed), fused modules, QAT module, quantized modules, QAT module swapping utility. Also the overall flow exactly matches eager mode quantization, the only difference is that the transformations like fusion, inserting stubs are fully automated and controlled by QConfigMapping and BackendConfig.
+
+## High Level Flow with Simple Example
+
+`prepare_fx`:
+```
+Floating Point Model --> (1.1 `_fuse_fx`) --> Fused Model
+                     --> (1.2 QAT Module Swap) --> Model with QAT modules
+                     --> (1.3 Insert Observers) --> Prepared Model
+```
+
+`convert_fx`:
+```
+Prepared Model --> (2.1 `convert_to_reference`) --> Reference Quantized Model
+               --> (2.2 Lower to Native Backend) --> Quantized Model
+```
+
+In the following, I’ll first have a detailed description for each step, and then talk about the corresponding settings in BackendConfig. We’ll follow the terminologies defined in (draft) README.md of quantization syntax transforms in this doc.
+
+### 0. Original Model
+
+```
+class LinearReLUModule(torch.nn.Module):
+   def __init__(self):
+       super().__init__()
+       self.linear = torch.nn.Linear(5, 10).float()
+       self.relu = torch.nn.ReLU()
+
+   def forward(self, x):
+       return self.relu(self.linear(x))
+```
+
+### 1.1 Fusion
+```
+fused: GraphModule(
+  (linear): LinearReLU(
+    (0): Linear(in_features=5, out_features=10, bias=True)
+    (1): ReLU()
+  )
+)
+
+def forward(self, x):
+    linear = self.linear(x);  x = None
+    return linear
+```
+
+What we did in this example are:
+
+* Identify (Linear - ReLU) subgraph by searching through the model graph
+* For each of the identified subgraph, we replace the `root_node` (typically the weighted module in the pattern, like Linear), with a fused module by calling the fuser_method for this pattern, a fused module is a sequential of a few modules, e.g. nni.LinearReLU is a sequential of linear and relu module
+
+`backend_config` configurations relevant to this step are:
+
+```
+BackendPatternConfig((torch.nn.ReLU, torch.nn.Linear))
+    .set_fuser_method(_reverse_sequential_wrapper2(nni.LinearReLU))
+    ._set_root_node_getter(my_root_node_getter)
+    ._set_extra_inputs_getter(my_extra_inputs_getter)
+```
+
+
+`BackendPatternConfig` takes in a pattern that specifies the fusion pattern that we want to search for, pattern format can be found in https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/backend_config/README.md
+
+`set_dtype_configs`: dtype_configs are used to check against the qconfig for the pattern, to see if the qconfig is supported in the target backend or not. Currently it’s not used in fusion, but we can add this check in the future, or remove this and always fuse these patterns.
+`set_fuser_method`: specifies the fuser method to use for the pattern, a fuser method will take the matched object and fuse them into a fused module.
+`_set_root_node_getter`: sets a function that takes a node pattern and returns the root node in the pattern.
+`_set_extra_inputs_getter`: all input args of root node will be copied over to fused module, if there are extra inputs, this function will return a list of extra inputs given the pattern.
+
+Example usage of `root_node_getter` and `extra_input_getter`: https://gist.github.com/jerryzh168/8bea7180a8ba3c279f2c9b050f2a69a6
+
+### 1.2 QAT Module Swap
+```
+GraphModule(
+  (linear): LinearReLU(
+    in_features=5, out_features=10, bias=True
+    (weight_fake_quant): MinMaxObserver(min_val=inf, max_val=-inf)
+  )
+)
+
+def forward(self, x):
+    linear = self.linear(x);  x = None
+    return linear
+```
+
+In this step we swap the fused module to qat module, for example, swap nn.intrinsic.LinearReLU instances to nn.intrinsic.qat.LinearReLU module where we fake quantize the weight of linear.
+For modules that has corresponding QAT modules we’ll call eager mode `convert` function with a mapping from float module to QAT module which will swap all float module (and fused module) with QAT module, this step is exactly the same as eager mode quantization, just called inside the `prepare_fx/prepare_qat_fx` function.
+
+`backend_config` configurations relevant in this step are:
+```
+BackendPatternConfig(nni.LinearReLU)
+    .set_qat_module(nniqat.LinearReLU)
+```
+
+The pattern used to initialize BackendPatternConfig is the class type for original or fused floating point module class.
+`set_qat_module` sets the qat module class corresponding to the module class specified in the pattern.
+
+### 1.3 QuantDeQuantStub and Observer/FakeQuantize Insertion
+```
+GraphModule(
+  (activation_post_process_0): MinMaxObserver(min_val=inf, max_val=-inf)
+  (linear): LinearReLU(
+    (0): Linear(in_features=5, out_features=10, bias=True)
+    (1): ReLU()
+  )
+  (activation_post_process_1): MinMaxObserver(min_val=inf, max_val=-inf)
+)
+
+def forward(self, x):
+    activation_post_process_0 = self.activation_post_process_0(x);  x = None
+    linear = self.linear(activation_post_process_0);  activation_post_process_0 = None
+    activation_post_process_1 = self.activation_post_process_1(linear);  linear = None
+    return activation_post_process_1
+```
+
+Note: activation_post_process_0 and activation_post_process_1 will be updated with QuantDeQuantStub
+
+QuantDeQuantStubs are inserted based on the `qconfig_mapping` provided by users. Also we have a backend_config that specifies the configs that are supported by the backend. In this step, we will
+* Check if `qconfig_mapping` is compatible with `backend_config` or not, if user requested a qconfig that is not compatible with `backend_config`, we’ll not insert observers for the operator, the config would just be ignored.
+* Insert observer for the input and output of the subgraph, based on the `qconfig_mapping` (what user requested) and the `backend_config` (how the operator should be observed in a backend).
+
+Detailed walkthrough for this step in `prepare_qat_fx` (inserting QDQStub and FakeQuantize modules):
+Note: We could also insert QStub and DQStub in this step when users request to change the interface dtype for the model, standalone module or custom modules.
+```
+# fused and qat swapped model
+# graph 1:
+input - qat_linear_relu - output
+              |
+          FakeQuantize
+(need to be updated with QDQStub + FakeQuantize)
+              |
+           weight
+
+# qconfig_mapping (simplified, shown as dict)
+{'qat_linear_relu': QConfig(
+  weight=MinMaxObserver.with_args(dtype=torch.qint8),
+  activation=HistogramObserver.with_args(dtype=torch.quint8),
+)}
+
+# backend_config (simplified)
+{
+  'pattern': nnqat.LinearReLU,
+  'dtype_configs': [{input: torch.quint8, output: torch.quint8, weight: torch.qint8}],
+}
+
+# step 1: assign qconfig to each op (please see [TODO: link] for details)
+# step 2: determine which qconfigs are valid according to the backend configuration (please see [TODO: link] for details)
+(we should add a warning here)
+# step 3: for subgraphs with validated qconfigs, insert qstub/dqstub/qdqstub needed
+# To talk about what happens in this step, let’s first define some terms. Let’s view the computation graph we showed about as a Graph consists of nodes and edges, each node here will be an FX Node that represents some computation, for example linear, and each edge will be a connection between two nodes, and each edge can both be viewed as the output of the previous Node or the input of the next Node.
+
+# The end goal for this step is to insert QDQStubs at edges so that we produce a graph of quantized reference model when each QDQStub represents a quantize operator followed by a dequantize operator.
+
+# graph 2:
+input - QDQStub1 (FakeQuantize) - qat_linear_relu - QDQStub2 (FakeQuantize) - output
+                                      |
+                                FakeQuantize
+                  (need to be updated with QDQStub + FakeQuantize)
+                                      |
+                                    weight
+Note: weight + FakeQuantize is a part of qat_linear_relu
+
+# The overall logic to insert QDQStub1 and QDQStub2 inplace is the following:
+# 0. For each node in the original graph, we compute the target_dtype for input and output for it based on qconfig, for graph1, configured with qconfig_mapping, we have:
+# node_name_to_target_dtype =
+# {
+#     # this is placeholder node in FX Graph
+#     “input” : {“input_activation”: torch.float32, “output_activation”: torch.float32},
+#     “qat_linear_relu”: {“input_activation”: torch.quint8, “output_activation”: torch.quint8, “weight”: ...}
+#     # this is the return node in FX Graph
+#     “output”: {“input_activation”: torch.float32, “output_activation”: torch.float32}
+# }
+# Note: this map is generated before we insert qdqstub to graph1, and will not change in the process.
+#
+# 1. Inserting QDQStub1 (for input of qat_linear_relu)
+#    We need to look at the edge between `input` Node and `qat_linear_relu` Node here, we need to decide if we need to insert a
+#    QDQStub at this edge, which could serve as an input argument for `qat_linear_relu` Node (and also output for `input` Node)
+#    The way we decide if we want to insert QDQStub here is to figure out
+#    (1). The target dtype for output of `input` Node, which is torch.float32
+#    (2). The target dtype for input of `qat_linear_relu` Node, which is torch.quint8
+#    There is a mismatch here and (2) is a quantized dtype, so we need to insert QDQStub at the edge.
+#    We also need to attach observer/fakequant module to the QDQStub we inserted here.
+# 2. Insert QDQStub2 (for output of qat_linear_relu)
+#    The logic for inserting QDQStub for output is much easier, since we assume all modules/functions in the graph produce fp32 output
+#    by default (we can have additional checks and extend this to work for other dtypes after we have type inference ready),
+#    we just need to look at the target output dtype for qat_linear_relu Node, and if it is a quantized dtype (quint8, qint8, float16),
+#    we would insert a QDQStub here.
+#
+# Questions: How to avoid inserting duplicate QDQStubs?
+# e.g. when we have a single input being used by multiple ops:
+# input — linear1 —-
+#      \--- linear2 —
+# how do we make sure we only insert one QDQStub for input of both linear1 and linear2?
+# input - QDQStub — linear1 -
+#              \ —- linear2 -
+#
+# The way we do it right now is before we insert QDQStub, we look at all users of `input` Node here and make sure there is no QDQStubs
+# with the same target_dtype, that is, if we already inserted a QDQStub with dtype quint8 for linear1, and linear2 is also connected to it, if we request another QDQStub with dtype quint8 when processing linear2 Node, we’ll detect that the desired QDQStub already exists and do nothing
+
+# Question: What is the logic for keeping output to be float32?
+# Let’s say the output of `qat_linear_relu` Node is configured as float32, both in qconfig_mapping and backend_config:
+# qconfig_mapping (simplified, shown as dict)
+{'qat_linear_relu': QConfig(
+  weight=MinMaxObserver.with_args(dtype=torch.qint8),
+  input_activation=HistogramObserver.with_args(dtype=torch.quint8),
+  output_activation=PlaceholderObserver.with_args(dtype=torch.float32),
+)}
+
+# backend_config (simplified)
+{
+  'pattern': nnqat.LinearReLU,
+  'dtype_configs': [{input: torch.quint8, output: torch.float32, weight: torch.qint8}],
+}
+
+# What we’ll do here is when we are trying to insert output QDQStub for `qat_linear_relu`, we look at the target output dtype for this node (node_name_to_target_dtype[“qat_linear_relu”][“output_activation”], and find that it is float, which is not a quantized dtype, so
+# will do nothing here.
+# Note that this does not prevent other operators following `qat_linear_relu` to insert a QDQStub at the output of `qat_linear_relu`, since we are dealing with an `edge` of the graph here, and an `edge` is connected to two nodes, which means
+# the output of `qat_linear_relu` will also be the input of a node following `qat_linear_relu`.
+```
+
+`backend_config` configurations used in this step:
+```
+BackendConfig(nniqat.LinearReLU)
+    .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT)
+    .set_dtype_configs([
+        DTypeConfig(input_dtype=torch.quint8, output_dtype = torch.quint8, weight_dtype = torch.qint8, bias_dtype = torch.float32)]
+    )
+```
+
+Pattern in this case is the same as before, it defines the pattern for the subgraph we are dealing with
+`set_observation_type`: sets the observation type for the patter, currently only two types:
+```
+OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT means the output observer instance will be different from the input, which is the most common type of observer placement.
+OUTPUT_SHARE_OBSERVER_WITH_INPUT means the output observer is shared with input, they will be the same instance. This is useful for operators like cat.
+```
+
+`set_dtype_configs`: sets a list of supported (activation, weight, bias, etc.) dtype combinations for qconfigs for the pattern. Note that we represent different modes of quantization (static/dynamic/`weight_only`) purely through this combination, for example, fbgemm static quantization can be represented as: {"`input_activation`": torch.quint8, "weight": torch.qint8, "`output_activation`": torch.quint8}
+Note: the dtype config will be used to configure the support for dynamic quantization as well
+Note: we may extend this to support more fine grained configurations of args, kwargs, attributes and outputs in the future
+Note: we are referring to observer here, which is an implementation detail, we can change this to talk about quantization parameters instead, e.g. `QParamsType.OUTPUT_USE_DIFFERENT_QPARAMS_AS_INPUT and QParamsType.OUTPUT_USE_SAME_QPARAMS_AS_INPUT`
+
+### 2. Calibration/Training
+After we insert observers, we run the model to calibrate observers or to fine tune. This step is identical to eager mode quantization. After that the observer/fakequantize modules contain sufficient information to determine quantization parameters according to the observed data.
+
+### 3.1 Conversion to Reference Quantized Model
+```
+quantized: GraphModule(
+  (linear): LinearReLU(
+    (0): QuantizedLinear(Reference)(in_features=5, out_features=10, bias=True)
+    (1): ReLU()
+  )
+)
+
+def forward(self, x):
+    linear_input_scale_0 = self.linear_input_scale_0
+    linear_input_zero_point_0 = self.linear_input_zero_point_0
+    quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0, linear_input_zero_point_0, torch.quint8);  x = linear_input_scale_0 = linear_input_zero_point_0 = None
+    dequantize = quantize_per_tensor.dequantize();  quantize_per_tensor = None
+    linear = self.linear(dequantize);  dequantize = None
+    linear_scale_0 = self.linear_scale_0
+    linear_zero_point_0 = self.linear_zero_point_0
+    quantize_per_tensor_1 = torch.quantize_per_tensor(linear, linear_scale_0, linear_zero_point_0, torch.quint8);  linear = linear_scale_0 = linear_zero_point_0 = None
+    dequantize_1 = quantize_per_tensor_1.dequantize();  quantize_per_tensor_1 = None
+    return dequantize_1
+```
+
+After we insert observers, we’ll need to convert the model to a reference quantized model. Reference quantized model is a model that uses reference patterns to represent quantized operators, this serves as the standard interface for quantized operators between PyTorch quantization and backend lowering passes. For more details, please take a look at this [RFC](https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md). This pass is pretty straightforward, what we do is:
+(1). for each QDQStub (attached with Observer for FakeQuantize modules) in the graph, we'll convert it to calls to quantize and dequantize functions based on the attributes of attached Observer and FakeQuantize modules (e.g. qscheme, dtype etc.)
+(2). for weighted modules like linear/conv, we convert them to corresponding reference quantized module.
+
+Example:
+```
+# graph 1
+input - QDQStub1 (FakeQuantize) - qat_linear_relu - QDQStub2 (FakeQuantize) - output
+                                      |
+                                FakeQuantize
+                  (need to be updated with QDQStub + FakeQuantize)
+                                      |
+                                    Weight
+
+Note: weight + FakeQuantize is a part of qat_linear_relu module
+
+# graph 2
+input - quantize - dequantize - reference_linear_relu - quantize - dequantize - output
+                                        |
+                                   dequantize
+                                        |
+                                    quantize
+                                        |
+                                      weight
+```
+Note: weight + quantize + dequantize is a part of reference_linear_relu module
+
+To decide which quantize node we want to use, we’ll look at:
+(1). dtype of attached Observer/FakeQuantize module
+(2). qscheme of attached Observer/FakeQuantize module
+(3). (optionally) other attributes of attached Observer/FakeQuantize module
+The quantize operator we can choose from right now are: (quantize_per_tensor, quantize_per_channel, to, quantize_per_tensor_dynamic)
+
+```
+backend_config configurations used in this step:
+BackendConfig(nniqat.LinearReLU)
+    .set_root_module(nn.Linear)
+    .set_reference_quantized_module_for_root(nnqr.Linear)
+    .set_fused_module(nni.LinearReLU)
+```
+
+Pattern in this case is the same as before, it defines the pattern for the subgraph we are dealing with
+
+`set_root_module`: Sets a module class for the root of the pattern, e.g. nn.Linear for a nni.LinearReLU/nniqat.LinearReLU, used to identify the modules that needs to be swapped to reference quantized module
+
+`set_reference_quantized_module_for_root`: Sets the corresponding reference quantized module class for root module class, e.g. when root_module is nn.Linear, this will be nn.quantized.reference.Linear, used to swap the root module to be a reference quantized module.
+
+Note: we are only swapping `root_module` here, for example, in the current example, the original module is nniqat.LinearReLU, when we are converting weight modules(step (2)), we first convert nniqat.LinearReLU to a float module, in this case, the fused LinearReLU module: nni.LinearReLU, and then swap the root_module (nn.Linear) with reference quantized module (nnqr.Linear), so we end up with a nni.LinearReLU module, which is a sequential module of a nnqr.Linear and nn.ReLU.
+Basically, the corresponding reference quantized module for both nniqat.LinearReLU and nni.LinearReLU would be a nni.LinearReLU sequential module (originally nn.Linear + nn.ReLU) with nn.Linear being replaced by nnqr.Linear: nni.LinearReLU(nnqr.Linear, nn.ReLU).
+
+`set_fused_module`: This is the corresponding fused module class for the pattern, used to identify fused modules that needs to be converted to reference quantized module
+
+### 3.2 Lower to PyTorch Native Backend
+```
+GraphModule(
+  (linear): QuantizedLinearReLU(in_features=5, out_features=10, scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine)
+)
+
+def forward(self, x):
+    linear_input_scale_0 = self.linear_input_scale_0
+    linear_input_zero_point_0 = self.linear_input_zero_point_0
+    quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0, linear_input_zero_point_0, torch.quint8);  x = linear_input_scale_0 = linear_input_zero_point_0 = None
+    linear = self.linear(quantize_per_tensor);  quantize_per_tensor = None
+    dequantize_1 = linear.dequantize();  linear = None
+    return dequantize_1
+```
+
+Currently, PyTorch has native quantized backends: fbgemm and qnnpack, so we need a lowering pass to lower the reference quantized model to a model that is using native quantized operators in PyTorch. What this pass did is
+* Recognize the reference patterns like: "dequantize - `float_op` - quantize" in the graph and replace them with the quantized modules (under torch.nn.quantized namespace) or operators (under torch.ops.quantized namespace, or torch namespace)
+In general there are three types of patterns:
+** Static quantization: "dequantize - `float_op` - `quantize_per_tensor`"
+** Dynamic quantization: "`quantize_per_tensor_dynamic` - dequantize - `float_op`"
+** Weight only quantization:
+```
+                                           Input - float_op - output
+   weight - quantize_per_tensor - dequantize /
+```
+* Prepack and fold the weights for quantized linear and quantized conv operator
+* The lowering pass is also going to keep some patterns for quantized operators unfused, since user may explicitly request some operators to stay in float by configuring the qconfig to be None
+
+There are no configurations related to lowering in `backend_config` since it is backend developer’s responsibility to implement lowering pass and each of the backend developers may have their own configurations. So from end to end, `backend_config` and together with qconfig_mapping controls what Reference Quantized Model is produced by FX Graph Mode Quantization, not lowered model.
+
+However, for some operator based backends, like the current pytorch native backends including fbgemm and qnnpack. We could interpret `backend_config` in terms of configurations for operators as well. e.g. configuring `input_dtype`=quint8, `weight_dtype`=qint8, `output_dtype`=torch.quint8 for nn.Linear is saying that the quantized linear will take a quint8 activation and qint8 weight as input and outputs a quint8 activation. But there is no guarantee that this interpretation will always work in the future, especially when we add new flavors of quantized operators.
+
+## Extensibility
+
+FX graph mode quantization can be extended to work with different backends, which may have different sets of supported quantized operator patterns and different requirements for each pattern. For more detail, please refer to the [BackendConfig README](/torch/ao/quantization/backend_config/README.md).
diff --git a/torch/ao/quantization/fx/_decomposed.py b/torch/ao/quantization/fx/_decomposed.py
new file mode 100644
index 000000000000..ec814d6a17bb
--- /dev/null
+++ b/torch/ao/quantization/fx/_decomposed.py
@@ -0,0 +1,309 @@
+import torch
+from torch.library import Library, impl
+from torch.ao.quantization import MinMaxObserver
+from typing import Tuple
+
+# Note: decomposed means decomposed quantized tensor, using decomposed so that the
+# name is not too long
+quantized_decomposed_lib = Library("quantized_decomposed", "DEF")
+
+# Helper to check the passed in quant min and max are valid for the dtype
+def _quant_min_max_bounds_check(quant_min, quant_max, dtype):
+    quant_min_lower_bound = 0
+    quant_max_upper_bound = 0
+    if dtype == torch.uint8:
+        quant_min_lower_bound = 0
+        quant_max_upper_bound = 255
+    elif dtype == torch.int8:
+        quant_min_lower_bound = -128
+        quant_max_upper_bound = 127
+    else:
+        raise ValueError(f"Unsupported dtype: {dtype}")
+
+    assert quant_min >= quant_min_lower_bound, \
+        "quant_min out of bound for dtype, " \
+        f"quant_min_lower_bound: {quant_min_lower_bound} quant_min: {quant_min}"
+
+    assert quant_max <= quant_max_upper_bound, \
+        "quant_max out of bound for dtype, " \
+        f"quant_max_upper_bound: {quant_max_upper_bound} quant_max: {quant_max}"
+
+quantized_decomposed_lib.define(
+    "quantize_per_tensor(Tensor input, float scale, int zero_point, "
+    "int quant_min, int quant_max, ScalarType dtype) -> Tensor")
+
+@impl(quantized_decomposed_lib, "quantize_per_tensor", "CompositeExplicitAutograd")
+def quantize_per_tensor(
+        input: torch.Tensor,
+        scale: float,
+        zero_point: int,
+        quant_min: int,
+        quant_max: int,
+        dtype: torch.dtype
+) -> torch.Tensor:
+    """ Affine quantization for the Tensor using the same quantization parameters to map
+    from floating point to quantized values
+
+    Args:
+       input (torch.Tensor): original float32 Tensor
+       scale (float): quantization parameter for affine quantization
+       zero_point (int): quantization parameter for affine quantization
+       quant_min (int): minimum quantized value for output Tensor
+       quant_max (int): maximum quantized value for output Tensor
+       dtype (torch.dtype): requested dtype (e.g. torch.uint8) for output Tensor
+
+    Returns:
+       Tensor with requested dtype (e.g. torch.uint8), note the quantization parameters
+       are not stored in the Tensor, we are storing them in function arguments instead
+    """
+    assert input.dtype == torch.float32, f"Expecting input to have dtype torch.float32, but got dtype: {input.dtype}"
+    _quant_min_max_bounds_check(quant_min, quant_max, dtype)
+
+    inv_scale = 1.0 / scale
+    return torch.clamp(torch.round(input * inv_scale) + zero_point, quant_min, quant_max).to(dtype)
+
+quantized_decomposed_lib.define(
+    "quantize_per_tensor.tensor(Tensor input, Tensor scale, Tensor zero_point, "
+    "int quant_min, int quant_max, ScalarType dtype) -> Tensor")
+
+@impl(quantized_decomposed_lib, "quantize_per_tensor.tensor", "CompositeExplicitAutograd")
+def quantize_per_tensor_tensor(
+        input: torch.Tensor,
+        scale: torch.Tensor,
+        zero_point: torch.Tensor,
+        quant_min: int,
+        quant_max: int,
+        dtype: torch.dtype
+) -> torch.Tensor:
+    """ Affine quantization for the Tensor using the same quantization parameters to map
+    from floating point to quantized values
+    Same as `quantize_per_tensor` but scale and zero_point are Scalar Tensor instead of
+    scalar values
+    """
+    assert zero_point.numel() == 1, f"Exepecting zero_point tensor to be one element, but received : {zero_point.numel()}"
+    assert scale.numel() == 1, f"Exepecting scale tensor to be one element, but received : {scale.numel()}"
+    return quantize_per_tensor(input, scale.item(), zero_point.item(), quant_min, quant_max, dtype)
+
+# Note: quant_min/quant_max/dtype are not used in the operator, but for now it's kept in
+# the signature as metadata for the input Tensor, this might be useful for pattern
+# matching in the future
+# We will revisit this later if we found there are no use cases for it
+quantized_decomposed_lib.define(
+    "dequantize_per_tensor(Tensor input, float scale, int zero_point, "
+    "int quant_min, int quant_max, ScalarType dtype) -> Tensor")
+
+@impl(quantized_decomposed_lib, "dequantize_per_tensor", "CompositeExplicitAutograd")
+def dequantize_per_tensor(
+        input: torch.Tensor,
+        scale: float,
+        zero_point: int,
+        quant_min: int,
+        quant_max: int,
+        dtype: torch.dtype
+) -> torch.Tensor:
+    """ Affine dequantization for the Tensor using the same quantization parameters to map
+    from quantized values to floating point values
+
+    Args:
+       input (torch.Tensor): Tensor with dtype matching `dtype` argument,
+       e.g. (`torch.uint8`), it is a per tensor quantized Tensor if combined with
+       quantization parameters in the argument of this function (scale/zero_point)
+
+       scale (float): quantization parameter for affine quantization
+
+       zero_point (int): quantization parameter for affine quantization
+
+       quant_min (int): minimum quantized value for input Tensor (not used in computation,
+       reserved for pattern matching)
+
+       quant_max (int): maximum quantized value for input Tensor (not used in computation,
+       reserved for pattern matching)
+
+       dtype (torch.dtype): dtype for input Tensor (not used in computation,
+       reserved for pattern matching)
+
+    Returns:
+       dequantized float32 Tensor
+    """
+    assert input.dtype == dtype, f"Expecting input to have dtype: {dtype}"
+    if dtype in [torch.uint8, torch.int8]:
+        # TODO: investigate why
+        # (input - zero_point).to(torch.float32) * scale
+        # failed the test
+        return (input.to(torch.float32) - zero_point) * scale
+    else:
+        raise ValueError(f"Unsupported dtype in dequantize_per_tensor: {dtype}")
+
+
+quantized_decomposed_lib.define(
+    "dequantize_per_tensor.tensor(Tensor input, Tensor scale, Tensor zero_point, "
+    "int quant_min, int quant_max, ScalarType dtype) -> Tensor")
+
+@impl(quantized_decomposed_lib, "dequantize_per_tensor.tensor", "CompositeExplicitAutograd")
+def dequantize_per_tensor_tensor(
+        input: torch.Tensor,
+        scale: torch.Tensor,
+        zero_point: torch.Tensor,
+        quant_min: int,
+        quant_max: int,
+        dtype: torch.dtype
+) -> torch.Tensor:
+    """ Affine dequantization for the Tensor using the same quantization parameters to map
+    from quantized values to floating point values
+    Same as `dequantize_per_tensor` but scale and zero_point are Scalar Tensor instead of
+    scalar values
+    """
+    assert zero_point.numel() == 1, f"Exepecting zero_point tensor to be one element, but received : {zero_point.numel()}"
+    assert scale.numel() == 1, f"Exepecting scale tensor to be one element, but received : {scale.numel()}"
+    return dequantize_per_tensor(input, scale.item(), zero_point.item(), quant_min, quant_max, dtype)
+
+
+quantized_decomposed_lib.define(
+    "choose_qparams.tensor(Tensor input, int quant_min, int quant_max, "
+    "ScalarType dtype) -> (Tensor, Tensor)")
+
+@impl(quantized_decomposed_lib, "choose_qparams.tensor", "CompositeExplicitAutograd")
+def choose_qparams_tensor(
+        input: torch.Tensor,
+        quant_min: int,
+        quant_max: int,
+        dtype: torch.dtype
+) -> Tuple[float, int]:
+    """ Given an input Tensor, derive the per tensor affine quantization parameter
+    (scale and zero_point) for target quantized Tensor from the Tensor
+
+    Args:
+       input (torch.Tensor): floating point input Tensor
+       quant_min (int): minimum quantized value for target quantized Tensor
+       quant_max (int): maximum quantized value for target quantized Tensor
+       dtype (torch.dtype): dtype for target quantized Tensor
+
+    Returns:
+       scale (float): quantization parameter for the target quantized Tensor
+       zero_point (int): quantization parameter for the target quantized Tensor
+    """
+    assert input.dtype == torch.float32, f"Expecting input to have dtype torch.float32, but got dtype: {input.dtype}"
+    assert quant_min < quant_max, f"Expecting quant_min to be smaller than quant_max but received min: {quant_min} max: {quant_max}"
+
+    # Its weird to create an observer manually just to calculate qparams. I tried refactoring this functionality out of observer
+    # into a util and then use that util directly, but I kept running into jit typing errors related to torch.qscheme not
+    # being recognized as a type. TODO: properly refactor this out to avoid observer overhead
+    tensor_dtype_to_observer_dtype = {torch.uint8: torch.quint8, torch.int8: torch.qint8}
+    observer = MinMaxObserver(quant_min=quant_min, quant_max=quant_max, dtype=tensor_dtype_to_observer_dtype[dtype])
+    observer(input)
+    scale, zero_point = observer.calculate_qparams()
+    return (scale, zero_point)
+
+# Helper function used to implement per-channel quantization against any axis
+def _permute_to_axis_zero(x, axis):
+    new_axis_list = list(range(x.dim()))
+    new_axis_list[axis] = 0
+    new_axis_list[0] = axis
+    y = x.permute(tuple(new_axis_list))
+    return y, new_axis_list
+
+quantized_decomposed_lib.define(
+    "quantize_per_channel(Tensor input, Tensor scales, Tensor zero_points, int axis, "
+    "int quant_min, int quant_max, ScalarType dtype) -> Tensor")
+
+@impl(quantized_decomposed_lib, "quantize_per_channel", "CompositeExplicitAutograd")
+def quantize_per_channel(
+        input: torch.Tensor,
+        scales: torch.Tensor,
+        zero_points: torch.Tensor,
+        axis: int,
+        quant_min: int,
+        quant_max: int,
+        dtype: torch.dtype
+) -> torch.Tensor:
+    """ Affine per channel quantization for the Tensor using the same quantization
+    parameters for each channel/axis to map from floating point to quantized values
+
+    Args:
+       input (torch.Tensor): original float32 Tensor
+       scales (torch.Tensor): a list of scale quantization parameter for
+       affine quantization, one per channel
+       zero_point (torch.Tensor): a list of zero_point quantization parameter for
+       affine quantization, one per channel
+       quant_min (int): minimum quantized value for output Tensor
+       quant_max (int): maximum quantized value for output Tensor
+       dtype (torch.dtype): requested dtype (e.g. torch.uint8) for output Tensor
+
+    Returns:
+       Tensor with requested dtype (e.g. torch.uint8), note the quantization parameters
+       are not stored in the Tensor, we are storing them in function arguments instead
+    """
+    assert input.dtype == torch.float32, f"Expecting input to have dtype torch.float32, but got dtype: {input.dtype}"
+    assert axis < input.dim(), f"Expecting axis to be < {input.dim()}"
+    _quant_min_max_bounds_check(quant_min, quant_max, dtype)
+    input, permute_axis_list = _permute_to_axis_zero(input, axis)
+    res = torch.zeros_like(input)
+
+    for i in range(input.size(0)):
+        res[i] = torch.clamp(
+            torch.round(input[i] * (1.0 / scales[i])) + zero_points[i],
+            quant_min,
+            quant_max
+        )
+
+    out = res.permute(tuple(permute_axis_list))
+    return out.to(dtype)
+
+# Note: quant_min/quant_max/dtype are not used in the operator, but for now it's kept in
+# the signature as metadata for the input Tensor, this might be useful for pattern
+# matching in the future
+# We will revisit this later if we found there are no use cases for it
+quantized_decomposed_lib.define(
+    "dequantize_per_channel(Tensor input, Tensor scales, Tensor zero_points, int axis, "
+    "int quant_min, int quant_max, ScalarType dtype) -> Tensor")
+
+@impl(quantized_decomposed_lib, "dequantize_per_channel", "CompositeExplicitAutograd")
+def dequantize_per_channel(
+        input: torch.Tensor,
+        scales: torch.Tensor,
+        zero_points: torch.Tensor,
+        axis: int,
+        quant_min: int,
+        quant_max: int,
+        dtype: torch.dtype
+) -> torch.Tensor:
+    """ Affine per channel dequantization for the Tensor using the same quantization
+    parameters for each channel/axis to map from quantized values to floating point values
+
+    Args:
+       input (torch.Tensor): Tensor with dtype matching `dtype` argument,
+       e.g. (`torch.uint8`), it is a per channel quantized Tensor if combined with
+       quantization parameter in the argument of this function (scales/zero_points/axis)
+
+       scales (torch.Tensor): a list of scale quantization parameter for
+       affine quantization, one per channel
+
+       zero_points (torch.Tensor): a list of zero_point quantization parameter for
+       affine quantization, one per channel
+
+       quant_min (int): minimum quantized value for output Tensor (not used in computation,
+       reserved for pattern matching)
+
+       quant_max (int): maximum quantized value for output Tensor (not used in computation,
+       reserved for pattern matching)
+
+       dtype (torch.dtype): requested dtype for output Tensor (not used in computation,
+       reserved for pattern matching)
+
+    Returns:
+       dquantized float32 Tensor
+    """
+    assert input.dtype == dtype, f"Expecting input to have dtype torch.float32, but got dtype: {input.dtype}"
+    assert axis < input.dim(), f"Expecting axis to be < {input.dim()}"
+    _quant_min_max_bounds_check(quant_min, quant_max, dtype)
+    input, permute_axis_list = _permute_to_axis_zero(input, axis)
+    res = torch.zeros_like(input, dtype=torch.float32)
+
+    for i in range(input.size(0)):
+        # TODO: investigate why
+        # (input[i] - zero_points[i]).to(torch.float32) * scales[i]
+        # failed the test
+        res[i] = (input[i].to(torch.float32) - zero_points[i]) * scales[i]
+
+    out = res.permute(tuple(permute_axis_list))
+    return out
diff --git a/torch/ao/quantization/fx/_equalize.py b/torch/ao/quantization/fx/_equalize.py
index a610b34edb6f..3e3aa05ecdd9 100644
--- a/torch/ao/quantization/fx/_equalize.py
+++ b/torch/ao/quantization/fx/_equalize.py
@@ -10,13 +10,15 @@
 from torch.fx import GraphModule
 from torch.fx.graph import Node
 
+from torch.ao.quantization.backend_config import get_native_backend_config
+
 from ..observer import _with_args, ObserverBase, PerChannelMinMaxObserver
 from ..utils import _parent_name, check_min_max_valid
 
 from .utils import (
     get_new_attr_name_with_prefix,
     maybe_get_next_module,
-    WEIGHT_INDEX_DICT,
+    node_arg_is_weight,
 )
 
 CUSTOM_MODULE_SUPP_LIST: List[Any] = []
@@ -28,6 +30,11 @@ def reshape_scale(scale: torch.Tensor, axis: int, input: torch.Tensor) -> torch.
     new_shape[axis] = input.size(axis)
     return scale.view(new_shape)
 
+qsheme_mapping_per_tensor_to_per_channel = {
+    torch.per_tensor_affine: torch.per_channel_affine,
+    torch.per_tensor_symmetric: torch.per_channel_symmetric,
+}
+
 
 class _InputEqualizationObserver(nn.Module):
     r"""Observer for tracking the running min/max values of input columns, and
@@ -58,8 +65,9 @@ def __init__(self, dtype=torch.quint8, qscheme=torch.per_tensor_affine,
         self.dtype = dtype
         self.qscheme = qscheme
 
+        per_channel_qscheme = qsheme_mapping_per_tensor_to_per_channel[qscheme]
         self.input_obs = PerChannelMinMaxObserver(ch_axis=1, dtype=dtype,
-                                                  qscheme=qscheme,
+                                                  qscheme=per_channel_qscheme,
                                                   quant_min=quant_min,
                                                   quant_max=quant_max,
                                                   factory_kwargs=factory_kwargs)
@@ -139,8 +147,11 @@ def __init__(self, dtype=torch.qint8, qscheme=torch.per_tensor_affine, quant_min
         self.qscheme = qscheme
         self.ch_axis = 1
 
+        per_channel_qscheme = qscheme
+        if qscheme in {torch.per_tensor_affine, torch.per_tensor_symmetric}:
+            per_channel_qscheme = qsheme_mapping_per_tensor_to_per_channel[qscheme]
         self.weight_col_obs = PerChannelMinMaxObserver(ch_axis=1, dtype=dtype,
-                                                       qscheme=qscheme,
+                                                       qscheme=per_channel_qscheme,
                                                        quant_min=quant_min,
                                                        quant_max=quant_max,
                                                        factory_kwargs=factory_kwargs)
@@ -286,9 +297,9 @@ def get_op_node_and_weight_eq_obs(
     if op_node.op == 'call_module':
         # If the op_node is a nn.Linear layer, then it must have a
         # WeightEqualizationObserver configuration
-        equalization_qconfig_map: Dict[str, Any] = model._equalization_qconfig_map  # type: ignore[assignment]
-        assert(equalization_qconfig_map.get(op_node.name, None) is not None)
-        weight_eq_obs = equalization_qconfig_map.get(op_node.name, None).weight()
+        equalization_node_name_to_qconfig: Dict[str, Any] = model._equalization_node_name_to_qconfig  # type: ignore[assignment]
+        assert(equalization_node_name_to_qconfig.get(op_node.name, None) is not None)
+        weight_eq_obs = equalization_node_name_to_qconfig.get(op_node.name, None).weight()
 
         assert(isinstance(weight_eq_obs, _WeightEqualizationObserver))
         return op_node, weight_eq_obs
@@ -305,9 +316,11 @@ def get_op_node_and_weight_eq_obs(
 def maybe_get_weight_eq_obs_node(op_node: Node, modules: Dict[str, nn.Module]) -> Optional[Node]:
     """ Gets the weight equalization observer node if it exists.
     """
-    assert(op_node.op == 'call_function' and op_node.target in WEIGHT_INDEX_DICT)
-    for i, node_arg in enumerate(op_node.args):
-        if i in WEIGHT_INDEX_DICT[op_node.target]:  # type: ignore[index]
+    assert(op_node.op == 'call_function')
+    # TODO: Pass in backend_config into this function and parent functions.
+    backend_config = get_native_backend_config()
+    for node_arg in op_node.args:
+        if node_arg_is_weight(op_node, node_arg, backend_config):
             assert(isinstance(node_arg, Node) and node_arg.op == 'call_module' and
                    isinstance(modules[str(node_arg.target)], _WeightEqualizationObserver))
             return node_arg
@@ -805,9 +818,6 @@ def get_equalization_qconfig_dict(
 
     # Constructs an equalization_qconfig_dict that specifies to only equalize
     # the layers with the highest quantization errors
-    module_to_qconfig_list = list(
-        map(lambda item: (item[0], default_equalization_qconfig), layers_to_equalize)
-    )
-
+    module_to_qconfig_list = [(item[0], default_equalization_qconfig) for item in layers_to_equalize]
     equalization_qconfig_dict = {"module_name": module_to_qconfig_list}
     return equalization_qconfig_dict
diff --git a/torch/ao/quantization/fx/_lower_to_native_backend.py b/torch/ao/quantization/fx/_lower_to_native_backend.py
index 02fdc76ff6ac..93c3d07e1880 100644
--- a/torch/ao/quantization/fx/_lower_to_native_backend.py
+++ b/torch/ao/quantization/fx/_lower_to_native_backend.py
@@ -6,10 +6,10 @@
 import torch.nn.intrinsic as nni
 import torch.nn.intrinsic.quantized as nniq
 import torch.nn.intrinsic.quantized.dynamic as nniqd
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.quantized._reference as nnqr
-from torch.nn.quantized.modules.utils import WeightedQuantizedModule
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized.reference as nnqr
+from torch.ao.nn.quantized.modules.utils import WeightedQuantizedModule
 from .graph_module import QuantizedGraphModule
 from .utils import (
     collect_producer_nodes,
@@ -111,6 +111,9 @@ def is_copy_node(node, modules):
         torch.flatten,
         torch.mean,
         operator.floordiv,
+        # F.channel_shuffle and torch.channel_shuffle are essentially the same thing
+        # so we only need to put one of them here
+        torch.channel_shuffle,
     ]
     method_list = [
         "clamp",
@@ -131,6 +134,7 @@ def is_copy_node(node, modules):
         torch.nn.MaxPool3d,
         torch.nn.ReLU,
         torch.nn.ReLU6,
+        torch.nn.ChannelShuffle,
     ]
     return _is_node_in_list(node, modules, func_list, method_list, module_type_list)
 
@@ -280,7 +284,7 @@ def should_skip_lowering(op: torch.fx.node.Node, qconfig_map: Dict[str, QConfigA
 }
 
 # Mapping from a functional to a dictionary, where the key is a 2-tuple of
-# (activation_compute_dtype, weight_dtype) and the value is a 2-tuple of
+# (input_activation_dtype, weight_dtype) and the value is a 2-tuple of
 #   1) The dynamically quantized version of the op
 #   2) The dynamically quantized version of the op fused with relu, if it exists, else None
 DYNAMIC_LOWER_FUNCTIONAL_MAP: Dict[Callable, Dict[Tuple[torch.dtype, torch.dtype], Tuple[Callable, Optional[Callable]]]] = {
@@ -331,9 +335,9 @@ def fold_weight(
     graph module with the traced nodes and run the graph module to pack the
     weight. then replace the original chain of ops with the packed weight.
     """
-    packed_weights = dict()
+    packed_weights = {}
     # map from folded node name to the prepacked weight name
-    folded_nodes = dict()
+    folded_nodes = {}
     # get packed weights
     for node in quantized.graph.nodes:
         if node.op == 'call_function' and node.target in WEIGHT_PREPACK_OPS:
@@ -501,11 +505,10 @@ def _lower_static_weighted_ref_module(
         parent_name, module_name = _parent_name(ref_node.target)
         setattr(modules[parent_name], module_name, q_module)
 
-        # Step 2: Remove dq_node, q_node and its args
+        # Step 2: Reroute around dq_node, and remove q_node and its args
         dq_node = ref_node.args[0]
         assert(isinstance(dq_node, Node))
-        dq_node.replace_all_uses_with(dq_node.args[0])
-        model.graph.erase_node(dq_node)
+        ref_node.replace_input_with(dq_node, dq_node.args[0])
         q_node.replace_all_uses_with(ref_node)
         model.graph.erase_node(q_node)
         model.graph.erase_node(scale_node)
@@ -527,24 +530,16 @@ def _lower_dynamic_weighted_ref_module(model: QuantizedGraphModule):
         dq_node = ref_node.args[0]
         if dq_node.op != "call_method" or dq_node.target != "dequantize":
             continue
-        # don't support lowering the pattern when the result of dequantize is used by
-        # multiple nodes
-        if len(dq_node.users) > 1:
-            continue
 
         input_dynamic_q_node = dq_node.args[0]
-        # don't support lowering the pattern when the result of quantize is used by
-        # multiple nodes
-        if len(input_dynamic_q_node.users) > 1:
-            continue
 
         if input_dynamic_q_node.op != "call_function" or \
            input_dynamic_q_node.target != torch.quantize_per_tensor_dynamic:
             continue
 
-        activation_compute_dtype = input_dynamic_q_node.args[1]
-        is_fp16 = activation_compute_dtype == torch.float16
-        is_int8 = activation_compute_dtype in [torch.quint8, torch.qint8]
+        activation_dtype = input_dynamic_q_node.args[1]
+        is_fp16 = activation_dtype == torch.float16
+        is_int8 = activation_dtype in [torch.quint8, torch.qint8]
         if not is_int8 and not is_fp16:
             continue
 
@@ -559,15 +554,10 @@ def _lower_dynamic_weighted_ref_module(model: QuantizedGraphModule):
         # TODO: maybe define a WeightedDynamicallyQuantizedModule
         q_module = q_class.from_reference(ref_module)  # type: ignore[attr-defined]
 
-        # replace reference moduel with dynamically quantized module
+        # replace reference module with dynamically quantized module
         parent_name, module_name = _parent_name(ref_node.target)
         setattr(named_modules[parent_name], module_name, q_module)
-
-        # remove q - dq node
-        dq_node.replace_all_uses_with(input_dynamic_q_node)
-        model.graph.erase_node(dq_node)
-        input_dynamic_q_node.replace_all_uses_with(input_dynamic_q_node.args[0])
-        model.graph.erase_node(input_dynamic_q_node)
+        ref_node.replace_input_with(dq_node, input_dynamic_q_node.args[0])
 
 def _lower_weight_only_weighted_ref_module(model: QuantizedGraphModule):
     """
@@ -650,11 +640,7 @@ def _lower_static_weighted_ref_functional(
         # Move func_node after output_zp_node in the graph
         output_zp_node.append(func_node)
 
-        # Clean up: Remove dequantize and quantize nodes, and the relu node if it exists
-        for dqn in [input_dq_node, weight_dq_node]:
-            dqn_input = dqn.args[0]
-            dqn.replace_all_uses_with(dqn_input)
-            model.graph.erase_node(dqn)
+        # Clean up: Remove quantize node, and the relu node if it exists
         model.graph.erase_node(q_node)
         if relu_node is not None:
             model.graph.erase_node(relu_node)
@@ -700,19 +686,15 @@ def _lower_dynamic_weighted_ref_functional(
             continue
 
         input_dynamic_q_node = input_dq_node.args[0]
-        # don't support lowering the pattern when the result of quantize is used by
-        # multiple nodes
-        if len(input_dynamic_q_node.users) > 1:
-            continue
 
         if input_dynamic_q_node.op != "call_function" or \
            input_dynamic_q_node.target != torch.quantize_per_tensor_dynamic:
             continue
 
         reduce_range_node = None
-        (pattern_input, activation_compute_dtype, reduce_range_node) = input_dynamic_q_node.args
-        is_fp16 = activation_compute_dtype == torch.float16
-        is_int8 = activation_compute_dtype in [torch.quint8, torch.qint8]
+        (pattern_input, activation_dtype, reduce_range_node) = input_dynamic_q_node.args
+        is_fp16 = activation_dtype == torch.float16
+        is_int8 = activation_dtype in [torch.quint8, torch.qint8]
         if not is_int8 and not is_fp16:
             continue
 
@@ -720,7 +702,7 @@ def _lower_dynamic_weighted_ref_functional(
         weight_dtype = quantized_weight.args[-1]
 
         # Step 1: Try to select reference pattern with the corresponding quantized op
-        dynamic_quant_dtype_key = (activation_compute_dtype, weight_dtype)
+        dynamic_quant_dtype_key = (activation_dtype, weight_dtype)
         if dynamic_quant_dtype_key not in DYNAMIC_LOWER_FUNCTIONAL_MAP[func_node.target]:
             print(f"Didn't find dtype combination {dynamic_quant_dtype_key} during "
                   f"dynamic quantized op lowering for {func_node.target}")
@@ -762,12 +744,7 @@ def _lower_dynamic_weighted_ref_functional(
         if relu_node is not None:
             relu_node.replace_all_uses_with(func_node)
 
-        # Step 4: Remove dequantize and quantize nodes, and the relu node if it exists
-        for dqn in [input_dq_node, weight_dq_node]:
-            dqn_input = dqn.args[0]
-            dqn.replace_all_uses_with(dqn_input)
-            model.graph.erase_node(dqn)
-        model.graph.erase_node(input_dynamic_q_node)
+        # Step 4: Remove the relu node if it exists
         if relu_node is not None:
             model.graph.erase_node(relu_node)
 
@@ -793,8 +770,7 @@ def _lower_quantized_binary_op(
             dq_node = arg
             assert(isinstance(dq_node, Node))
             dn_input = dq_node.args[0]
-            dq_node.replace_all_uses_with(dn_input)
-            model.graph.erase_node(dq_node)
+            bop_node.replace_input_with(dq_node, dn_input)
             num_dq_nodes += 1
         assert(num_dq_nodes > 0)
 
@@ -882,7 +858,7 @@ def special_pattern_replacement(model: QuantizedGraphModule):
                 parent_name, module_name = _parent_name(ref_node.target)
                 setattr(modules[parent_name], module_name, qmodule)
 
-        # remove dq node:
+        # reroute around dq node:
         dq_nodes: List[Node] = []
         if isinstance(dq_node_or_nodes, Node):
             dq_nodes = [dq_node_or_nodes]
@@ -891,8 +867,7 @@ def special_pattern_replacement(model: QuantizedGraphModule):
 
         for dq_node in dq_nodes:
             dn_input = dq_node.args[0]
-            dq_node.replace_all_uses_with(dn_input)
-            model.graph.erase_node(dq_node)
+            ref_node.replace_input_with(dq_node, dn_input)
 
         # store q node args
         qnode_qparams = list(q_node.args)[1:]
@@ -962,6 +937,7 @@ def _lower_to_native_backend(
     _lower_quantized_binary_op(model, qconfig_map)
     _lower_getattr_tensor_metadta_op(model)
     special_pattern_replacement(model)
+    model.graph.eliminate_dead_code()
     model = fold_weight(model, node_name_to_scope)
     model.graph.eliminate_dead_code()
     model.recompile()
diff --git a/torch/ao/quantization/fx/_model_report/README.md b/torch/ao/quantization/fx/_model_report/README.md
index 0c4943ad6a75..6275b49b54e2 100644
--- a/torch/ao/quantization/fx/_model_report/README.md
+++ b/torch/ao/quantization/fx/_model_report/README.md
@@ -5,7 +5,7 @@ ModelReport
 
  > ⚠️ *While the example below uses the Fx Workflow, the use of the ModelReport class **does not depend** on the Fx Workflow to work*.
  The requirements are detector dependent.
- Most detectors require a **traceable GraphModule**, but some (ex. `PerChannelDetector`) require just a `nn.Module`.
+ Most detectors require a **traceable GraphModule**, but some (ex. `PerChannelDetector`) require just an `nn.Module`.
 
 #### Typical Fx Workflow
 - Initialize model &rarr; Prepare model &rarr; Callibrate model &rarr; Convert model &rarr; ...
@@ -32,7 +32,7 @@ model_report = ModelReport(model, detector_set)
 ready_for_callibrate = model_report.prepare_detailed_callibration()
 
 # callibrate model and generate report
-ready_for_callibrate(example_input) # TODO run callibration of model with relavent data
+ready_for_callibrate(example_input) # TODO run callibration of model with relevant data
 reports = model_report.generate_model_report(remove_inserted_observers=True)
 for report_name in report.keys():
     text_report, report_dict = reports[report_name]
@@ -61,8 +61,8 @@ This is so that we can keep track of where we want to insert observers on a dete
 - `prepare_detailed_calibration(self)` &rarr; `GraphModule` inserts observers into the locations specified by each detector in the model.
 It then returns the GraphModule with the detectors inserted into both the regular module structure as well as the node structure.
 - `generate_model_report(self, remove_inserted_observers: bool)` &rarr; `Dict[str, Tuple[str, Dict]]` uses callibrated GraphModule to optionally removes inserted observers, and generate, for each detector the ModelReport instance was initialized with:
-  - A string-based report that is easily digestable and actionable explaining the data collected by relavent observers for that detector
-  - A dictionary containing statistics collected by the relavent observers and values calculated by the detector for futher analysis or plotting
+  - A string-based report that is easily digestable and actionable explaining the data collected by relevant observers for that detector
+  - A dictionary containing statistics collected by the relevant observers and values calculated by the detector for further analysis or plotting
 
 ## ModelReportVisualizer Overview
 
@@ -127,21 +127,21 @@ return_dict = {
     "[unique_observer_fqn_of_insert_location]" :
     {
         "target_node" -> the node we are trying to observe with this observer (torch.fx.node.Node),
-        "insert_observer" -> the intialized observer we wish to insert (ObserverBase),
+        "insert_observer" -> the initialized observer we wish to insert (ObserverBase),
         "insert_post" -> True if this is meant to be a post-observer for target_node, False if pre-observer,
         "observer_args" -> The arguments that are meant to be passed into the observer,
     }
 }
 ```
 - `get_detector_name(self)` -> `str`: returns the name of the detector.
-You should give your detector a unique name different from exisiting detectors.
+You should give your detector a unique name different from existing detectors.
 - `generate_detector_report(self, model)` -> `Tuple[str, Dict[str, Any]]`: generates a report based on the information the detector is trying to collect.
 This report consists of both a text-based report as well as a dictionary of collected and calculated statistics.
 This report is returned to the `ModelReport` instance, which will then compile all the reports of all the Detectors requested by the user.
 
 ## ModelReportObserver Overview
 
-As seen in the [requirments to implement a detector section](#requirements-to-implement-a-detector), one of the key parts of implementing a detector is to specify what `Observer` we are trying to insert.
+As seen in the [requirements to implement a detector section](#requirements-to-implement-a-detector), one of the key parts of implementing a detector is to specify what `Observer` we are trying to insert.
 All the detectors in the ModelReport API use the [`ModelReportObserver`](https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/fx/_model_report/model_report_observer.py).
 While the core purpose of many observers in PyTorch's Quantization API is to collect min / max information to help determine quantization parameters, the `ModelReportObserver` collects additional statistics.
 
@@ -152,7 +152,7 @@ The statistics collected by the `ModelReportObserver` include:
 - Ratio of 100th percentile to some *n*th percentile
 - Number of constant value batches to pass through each channel
 
-After the `ModelReportObserver` collects the statistics above during the callibration process, the detectors then extract the information they need to generate their reports from the relavent observers.
+After the `ModelReportObserver` collects the statistics above during the callibration process, the detectors then extract the information they need to generate their reports from the relevant observers.
 
 ### Using Your Own Observer
 
@@ -187,7 +187,7 @@ Since you are also implementing your own detector in this case, it is up to you
     - A line plot (for both per-tensor and per-channel statistics)
     - A histogram (for both per-tensor and per-channel statistics)
 - `model_report.py`: File containing the `ModelReport` class
-  - Main class users are interacting with to go through the ModelReport worflow
+  - Main class users are interacting with to go through the ModelReport workflow
   - API described in detail in [Overview section](#modelreport-overview)
 
 # Tests
@@ -200,7 +200,7 @@ These tests include:
 - Test class for the `ModelReportVisualizer` class
 - Test class for **each** of the implemented Detectors
 
-If you wish to add a Detector, make sure to create a test class modeled after one of the exisiting classes and test your detector.
+If you wish to add a Detector, make sure to create a test class modeled after one of the existing classes and test your detector.
 Because users will be interacting with the Detectors through the `ModelReport` class and not directly, ensure that the tests follow this as well.
 
 # Future Tasks and Improvements
diff --git a/torch/ao/quantization/fx/_model_report/detector.py b/torch/ao/quantization/fx/_model_report/detector.py
index 0657f600d992..c92733bbc1c3 100644
--- a/torch/ao/quantization/fx/_model_report/detector.py
+++ b/torch/ao/quantization/fx/_model_report/detector.py
@@ -6,9 +6,23 @@
 from abc import ABC, abstractmethod
 from torch.ao.quantization.fake_quantize import FakeQuantize
 from torch.ao.quantization.fx.graph_module import GraphModule
-from torch.ao.quantization.observer import ObserverBase
 from torch.ao.quantization.fx._model_report.model_report_observer import ModelReportObserver
-from torch.ao.quantization.qconfig import QConfig
+from torch.ao.quantization.qconfig import (
+    QConfig,
+    default_qconfig,
+    _assert_valid_qconfig,
+)
+from torch.ao.quantization.observer import (
+    ObserverBase,
+    default_dynamic_quant_observer,
+    default_per_channel_weight_observer,
+    default_observer,
+    default_weight_observer,
+)
+from torch.ao.quantization.fx._equalize import (
+    default_equalization_qconfig,
+    EqualizationQConfig,
+)
 from torch.ao.quantization.quantize import is_activation_post_process
 
 # Names for observer insert keys
@@ -17,6 +31,84 @@
 DETECTOR_IS_POST_OBS_KEY = "is_post_observer"
 DETECTOR_OBS_ARGS_KEY = "observer_args"
 
+# Mapping related code
+class DetectorQConfigInfo():
+    r"""
+    This class contains the QConfig information for a single module.
+    The list of variables / values this contains can grow depending on the
+    extensibility of the qconfig mapping feature set but this currently includes:
+    - if activation observer is dynamic
+    - if weight observer is per channel
+
+
+    Args:
+        module_fqn (str): The fully qualified name (fqn) of the module that this
+            information contains info relavent to qconfig for
+    """
+
+    def __init__(self, module_fqn: str):
+        super().__init__()
+        self.module_fqn = module_fqn
+
+        # populate this section with all the variables we might find important
+        # change from none if your detector is actually using this
+        self.is_activation_dynamic = False
+        self.is_weight_per_channel = False
+
+        # equalization related options
+        self.is_equalization_recommended = False
+
+    def generate_quantization_qconfig(self, module: torch.nn.Module) -> QConfig:
+        r"""
+        Args:
+            module (torch.nn.Module) The module we are generating
+            the qconfig for
+
+        Returns the generated quantization QConfig according to what a valid configuration is
+        """
+        # Apply suggestions to new qconfig
+        module_qconfig = default_qconfig
+
+        # keep track of dynamic and per_channel recommendations
+        recommendations_list = []
+        # append as if a list of combinations
+        recommendations_list.append((self.is_activation_dynamic, self.is_weight_per_channel))
+        recommendations_list.append((self.is_activation_dynamic, False))  # only trying dynamic rec
+        recommendations_list.append((False, self.is_weight_per_channel))  # only trying dynamic
+
+        # now we try each of the combinations
+        for rec in recommendations_list:
+            # rec[0] -> dynamic recommended
+            # rec[1] -> per channel recommended
+            activation = default_dynamic_quant_observer if rec[0] else default_observer
+            weight = default_per_channel_weight_observer if rec[1] else default_weight_observer
+            test_config = QConfig(activation, weight)
+            try:
+                _assert_valid_qconfig(test_config, module)
+                module_qconfig = test_config
+                break
+            except AssertionError:
+                # if not a valid configuration, we move on to the next one in priority
+                continue
+
+        # return the QConfig chosen
+        return module_qconfig
+
+    def generate_equalization_qconfig(self) -> EqualizationQConfig:
+        r"""
+        This returns the equalization configuration for a module.
+
+        For now, it just returns the default, but as more equalization options become
+        possible, this method can get more fleshed out with more nuanced granularity.
+
+
+        Returns the generated equalization QConfig according to what a valid configuration is
+        """
+        # in this case, we just return default equalization config
+        # we know this is valid because only valid modules would even
+        # have this option
+        return default_equalization_qconfig
+
 # Adding base class for detectors
 class DetectorBase(ABC):
     r""" Base Detector Module
@@ -31,6 +123,7 @@ class DetectorBase(ABC):
 
     def __init__(self):
         super().__init__()
+        self.detector_config_info = None
 
     @abstractmethod
     def determine_observer_insert_points(self, model) -> Dict:
@@ -48,6 +141,18 @@ def get_detector_name(self) -> str:
         r""" Returns the name of the current detector """
         pass
 
+
+    @abstractmethod
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        pass
+
     def _get_targeting_node(self, prepared_fx_model: GraphModule, target_fqn: str) -> torch.fx.node.Node:
         r"""
         Takes in a GraphModule and the target_fqn and finds the node whose target is this fqn.
@@ -117,6 +222,7 @@ class PerChannelDetector(DetectorBase):
         "fbgemm": set([nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nnqat.Linear, nnqat.Conv1d, nnqat.Conv2d, nnqat.Conv3d]),
         "qnnpack": set([nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nnqat.Linear, nnqat.Conv1d, nnqat.Conv2d, nnqat.Conv3d]),
         "onednn": set([nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nnqat.Linear, nnqat.Conv1d, nnqat.Conv2d, nnqat.Conv3d]),
+        "x86": set([nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nnqat.Linear, nnqat.Conv1d, nnqat.Conv2d, nnqat.Conv3d]),
     }
 
     def __init__(self, backend: str = torch.backends.quantized.engine):
@@ -134,6 +240,31 @@ def get_detector_name(self) -> str:
         r""" returns the string name of this detector"""
         return "per_channel_detector"
 
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        # run the helper function to populate the dictionary
+        per_channel_info = self._detect_per_channel_helper(model)
+
+        # we actually have a qconfig info object we are populating
+        module_fqn_to_detector_qconfig_info = {}
+
+        for module_fqn in per_channel_info:
+            # create a detector info instance
+            detector_qconfig_info = DetectorQConfigInfo(module_fqn)
+
+            # see if per channel quantization is supported
+            per_chan_supported: bool = per_channel_info[module_fqn][self.PER_CHAN_SUPPORTED_KEY]
+            detector_qconfig_info.is_weight_per_channel = per_chan_supported
+            module_fqn_to_detector_qconfig_info[module_fqn] = detector_qconfig_info
+
+        return module_fqn_to_detector_qconfig_info
+
     def determine_observer_insert_points(self, model: nn.Module) -> Dict:
         r"""
         There is no observers inserted for the PerChannelDetector.
@@ -350,6 +481,32 @@ def get_detector_name(self) -> str:
         r""" returns the string name of this detector"""
         return "dynamic_vs_static_detector"
 
+
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        # run the helper function to populate the dictionary
+        dynamic_static_info = self._generate_dict_info(model)
+
+        # we actually have a qconfig info object we are populating
+        module_fqn_to_detector_qconfig_info = {}
+
+        for module_fqn in dynamic_static_info:
+            # create a detector info instance
+            detector_qconfig_info = DetectorQConfigInfo(module_fqn)
+
+            # see if per channel quantization is supported
+            dynamic_static_recommended: bool = dynamic_static_info[module_fqn][self.DEFAULT_DYNAMIC_REC_KEY]
+            detector_qconfig_info.is_activation_dynamic = dynamic_static_recommended
+            module_fqn_to_detector_qconfig_info[module_fqn] = detector_qconfig_info
+
+        return module_fqn_to_detector_qconfig_info
+
     def _is_supported(self, module: nn.Module, insert: bool = False) -> bool:
         r"""Returns whether the given module is supported for observers
 
@@ -639,6 +796,41 @@ def _is_supported(self, module: nn.Module, insert: bool = False) -> bool:
             has_obs = hasattr(module, self.DEFAULT_PRE_OBSERVER_NAME)
             return is_supported_type and has_obs
 
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        # run the helper function to populate the dictionary
+        # find the range of inputs
+        input_values: Dict[str, Dict] = self._extract_input_info(model)
+
+        # find the range of weights
+        weight_values: Dict[str, Dict] = self._extract_weight_info(model)
+
+        # calculate per_channel comparision statistic s_c
+        comp_stats: Dict[str, torch.Tensor] = self._generate_comparision_values(input_values, weight_values)
+
+        # generate the return dictionary
+        input_weight_equalization_info: Dict[str, Dict] = self._generate_dict_info(input_values, weight_values, comp_stats)
+
+        # we actually have a qconfig info object we are populating
+        module_fqn_to_detector_qconfig_info = {}
+
+        for module_fqn in input_weight_equalization_info:
+            # create a detector info instance
+            detector_qconfig_info = DetectorQConfigInfo(module_fqn)
+
+            # see if per channel quantization is supported
+            input_weight_recommended: bool = input_weight_equalization_info[module_fqn][self.RECOMMENDED_KEY]
+            detector_qconfig_info.is_equalization_recommended = input_weight_recommended
+            module_fqn_to_detector_qconfig_info[module_fqn] = detector_qconfig_info
+
+        return module_fqn_to_detector_qconfig_info
+
     def determine_observer_insert_points(self, prepared_fx_model: GraphModule) -> Dict[str, Dict[str, Any]]:
         r"""Determines where observers need to be inserted for the Input Weight Equalization Detector.
         For this detector, we want to place observers in front of supported layers.
@@ -1083,6 +1275,17 @@ def _supports_insertion(self, module: nn.Module) -> bool:
         num_children = len(list(module.children()))
         return num_children == 0 and not is_activation_post_process(module)
 
+    def get_qconfig_info(self, model) -> Dict[str, DetectorQConfigInfo]:
+        r""" Returns the DetectorQConfigInfo for each module_fqn relavent
+        Args
+            model (nn.Module or subclass): model to find observer insertion points
+
+        Returns a Dict mapping from unique observer fqns (where we want to insert them) to:
+            A DetectorQConfigInfo with the information to generate a QConfig for a specific module
+        """
+        # currently doesn't do anything for outlier detector
+        return {}
+
     def _supports_report_gen(self, module: nn.Module) -> bool:
         r"""Returns whether the given module is supported for report generation
 
diff --git a/torch/ao/quantization/fx/_model_report/model_report.py b/torch/ao/quantization/fx/_model_report/model_report.py
index 9f3ab1b4ac3e..ee96dd4bf5a9 100644
--- a/torch/ao/quantization/fx/_model_report/model_report.py
+++ b/torch/ao/quantization/fx/_model_report/model_report.py
@@ -1,4 +1,4 @@
-from typing import Any, Dict, Set, Tuple
+from typing import Any, Dict, Set, Tuple, Callable
 from collections import OrderedDict
 import torch
 from torch.ao.quantization.fx._model_report.detector import (
@@ -6,12 +6,14 @@
     DETECTOR_OBS_ARGS_KEY,
     DETECTOR_OBS_TO_INSERT_KEY,
     DETECTOR_IS_POST_OBS_KEY,
-    DETECTOR_TARGET_NODE_KEY
+    DETECTOR_TARGET_NODE_KEY,
+    DetectorQConfigInfo
 )
 from torch.ao.quantization.fx._model_report.model_report_visualizer import ModelReportVisualizer
 from torch.ao.quantization.fx.graph_module import GraphModule
 from torch.ao.quantization.observer import ObserverBase
-from torch.ao.quantization.qconfig_mapping import QConfigMapping
+from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
+from torch.ao.quantization.fx._equalize import EqualizationQConfig
 
 class ModelReport:
     r"""
@@ -383,7 +385,7 @@ def _reformat_reports_for_visualizer(self) -> OrderedDict:
                         module_fqns_to_features[module_fqn] = {**new_info, **present_info}
                     else:
                         error_str = "You have the same key with different values across detectors. "
-                        error_str += "Someone incorrectly implemented a detector with conflicting keys to exisiting detectors."
+                        error_str += "Someone incorrectly implemented a detector with conflicting keys to existing detectors."
                         raise ValueError(error_str)
                 else:
                     # we just set it
@@ -424,6 +426,110 @@ def generate_visualizer(self) -> ModelReportVisualizer:
 
         return visualizer
 
+    def _generate_qconfig_mapping_helper(
+        self,
+        detector_qconfig_info_combined: Dict[str, DetectorQConfigInfo],
+        generation_function: Callable
+    ) -> QConfigMapping:
+        r"""
+        This helper takes in the compiled detector qconfig info that
+        has been compiled together and merges it into a QConfigMapping
+        """
+        # keep track of the qconfigmapping
+        qconfig_mapping = QConfigMapping()
+
+        # loop through each module / fqn and attempt to create QConfigMapping
+        for fqn, module in self._model.named_modules():
+            # if we have a qconfig info for this module
+            if fqn in detector_qconfig_info_combined:
+                qconfig_info_compiled = detector_qconfig_info_combined[fqn]
+
+                # now generate the qconfig and add it to the mapping
+                generated_qconfig = generation_function(qconfig_info_compiled, module)
+
+                # add to our config
+                qconfig_mapping.set_module_name(fqn, generated_qconfig)
+
+        # return compiled mapping
+        return qconfig_mapping
+
+    def _update_detector_quantizaiton_qconfig_info(self, combined_info: DetectorQConfigInfo, new_info: DetectorQConfigInfo):
+        r"""
+        Takes in the old and new information and updates the combined information.
+
+        Args:
+            combined_info (DetectorQConfigInfo): The DetectorQConfigInfo we are compiling all of the information in
+            new_info (DetectorQConfigInfo): The DetectorQConfigInfo with the information we are trying to merge the new info
+                into it
+        """
+        combined_info.is_activation_dynamic = combined_info.is_activation_dynamic or new_info.is_activation_dynamic
+        combined_info.is_weight_per_channel = combined_info.is_weight_per_channel or new_info.is_weight_per_channel
+
+    def _update_detector_equalization_qconfig_info(self, combined_info: DetectorQConfigInfo, new_info: DetectorQConfigInfo):
+        r"""
+        Takes in the old and new information and updates the combined information.
+
+        Args:
+            combined_info (DetectorQConfigInfo): The DetectorQConfigInfo we are compiling all of the information in
+            new_info (DetectorQConfigInfo): The DetectorQConfigInfo with the information we are trying to merge the new info
+                into it
+        """
+        is_equalization_recommended = combined_info.is_equalization_recommended or new_info.is_equalization_recommended
+        combined_info.is_equalization_recommended = is_equalization_recommended
+
+    def _generate_module_fqn_to_detector_info_mapping(
+        self,
+        update_qconfig_info_function: Callable
+    ) -> Dict[str, DetectorQConfigInfo]:
+        r"""
+        Generates a QConfigMapping based on the suggestions of the
+        ModelReport API. The generated mapping encompasses all the
+        different types of feedback from the different detectors
+        all into one place.
+
+        These configs are based on the suggestions provided by the ModelReport API
+        and can only be generated once the reports have been generated.
+
+        Args:
+            update_qconfig_info_function (Callable) takes in a function that takes in two DetectorQConfigInfo
+            and updates the one that is being compiled
+
+        Returns a Dict mapping module_fqns to DetectorQConfigInfo objects
+
+        Note:
+            Throws exception if we try to generate mapping on model we already removed observers from
+            Throws exception if we try to generate mapping without preparing for callibration
+        """
+        # if we haven't prepped model for callibration, then we shouldn't generate mapping yet
+        if not self._prepared_flag:
+            raise Exception("Cannot generate report without preparing model for callibration")
+
+        # if we already removed the observers, we cannot mapping
+        if self._removed_observers:
+            raise Exception("Cannot generate report on model you already removed observers from")
+
+        # keep track of qconfig info for each module across detectors
+        detector_qconfig_info_combined: Dict[str, DetectorQConfigInfo] = {}
+
+        for detector in self._desired_report_detectors:
+            # get the info from the detector
+            detector_info: Dict[str, DetectorQConfigInfo] = detector.get_qconfig_info(self._model)
+
+            # we go through the modules
+            for module_fqn in detector_info:
+                # see if we already have info on it
+                if module_fqn in detector_qconfig_info_combined:
+                    # we combine the current options with what is there
+                    current_options = detector_qconfig_info_combined[module_fqn]
+                    detector_options = detector_info[module_fqn]
+
+                    update_qconfig_info_function(current_options, detector_options)
+                else:
+                    # we just use this for now
+                    detector_qconfig_info_combined[module_fqn] = detector_info[module_fqn]
+
+        return detector_qconfig_info_combined
+
     def generate_qconfig_mapping(self) -> QConfigMapping:
         r"""
         Generates a QConfigMapping based on the suggestions of the
@@ -435,8 +541,44 @@ def generate_qconfig_mapping(self) -> QConfigMapping:
         and can only be generated once the reports have been generated.
 
         Returns a QConfigMapping for the quantization configuration
+
+        Note:
+            Throws exception if we try to generate mapping on model we already removed observers from
+            Throws exception if we try to generate mapping without preparing for callibration
+        """
+        # get the mapping info
+        detector_qconfig_info_combined = self._generate_module_fqn_to_detector_info_mapping(
+            self._update_detector_quantizaiton_qconfig_info
+        )
+
+        # we will do a bit of processing and remove fqns that don't have input weight recommended
+
+        # now we generate the QConfig for each of the options
+        mapping: QConfigMapping = self._generate_qconfig_mapping_helper(
+            detector_qconfig_info_combined,
+            self._quantization_config_generator
+        )
+
+        # return the generated mapping
+        return mapping
+
+    def _quantization_config_generator(self, detector_qconfig_info: DetectorQConfigInfo, module: torch.nn.Module) -> QConfig:
+        r"""
+        Returns the quantization configuration generated by the DetectorQConfigInfo object
+        """
+        return detector_qconfig_info.generate_quantization_qconfig(module)
+
+    def _equalization_config_generator(
+        self,
+        detector_qconfig_info: DetectorQConfigInfo,
+        module: torch.nn.Module
+    ) -> EqualizationQConfig:
+        r"""
+        We ignore the module argument here, and only focus on thedetector_qconfig_info
+
+        Returns the equalization configuration generated by the DetectorQConfigInfo object
         """
-        pass
+        return detector_qconfig_info.generate_equalization_qconfig()
 
     def generate_equalization_mapping(self) -> QConfigMapping:
         r"""
@@ -449,4 +591,16 @@ def generate_equalization_mapping(self) -> QConfigMapping:
 
         Returns a QConfigMapping for the equalization configuration
         """
-        pass
+        # get the mapping info
+        detector_qconfig_info_combined = self._generate_module_fqn_to_detector_info_mapping(
+            self._update_detector_equalization_qconfig_info
+        )
+
+        # now we generate the QConfig for each of the options
+        mapping: QConfigMapping = self._generate_qconfig_mapping_helper(
+            detector_qconfig_info_combined,
+            self._equalization_config_generator
+        )
+
+        # return the generated mapping
+        return mapping
diff --git a/torch/ao/quantization/fx/backend_config_utils.py b/torch/ao/quantization/fx/backend_config_utils.py
index 477147950e5d..50c6b6a27ede 100644
--- a/torch/ao/quantization/fx/backend_config_utils.py
+++ b/torch/ao/quantization/fx/backend_config_utils.py
@@ -1,29 +1,26 @@
 import torch
 from torch.ao.quantization.fx.pattern_utils import get_default_quant_patterns, sorted_patterns_dict
-from torch.ao.quantization.backend_config import get_native_backend_config
-from torch.ao.quantization.backend_config.observation_type import ObservationType
-from torch.ao.quantization.quantization_types import (
-    Pattern,
-    NodePattern,
-    QuantizerCls,
+from torch.ao.quantization.backend_config import (
+    get_native_backend_config,
+    ObservationType,
 )
 from torch.ao.quantization.utils import (
-    activation_dtype,
     get_combined_dict,
+    Pattern,
+    NodePattern,
+    QuantizerCls,
 )
 
 from ..backend_config import BackendConfig
 from .quantization_patterns import QuantizeHandler
 from .fusion_patterns import DefaultFuseHandler
 
-from typing import Dict, Any, Callable, Optional
+from typing import Callable, Dict
 
 def get_quantize_handler_cls(
         observation_type,
         dtype_configs,
         num_tensor_args_to_observation_type,
-        overwrite_output_fake_quantizer,
-        overwrite_output_observer,
         input_output_observed):
 
     class ConfigurableQuantizeHandler(QuantizeHandler):
@@ -41,35 +38,11 @@ def __init__(
             else:
                 self.observation_type = observation_type
             self.dtype_configs = dtype_configs
-            self.overwrite_output_fake_quantizer = overwrite_output_fake_quantizer
-            self.overwrite_output_observer = overwrite_output_observer
             self.input_output_observed_ = input_output_observed
 
         def is_general_tensor_value_op(self) -> bool:
             return self.observation_type == ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT
 
-        # TODO: change this to output activation
-        def get_activation_ctr(
-                self,
-                qconfig: Any,
-                pattern: Pattern,
-                is_training: bool,
-        ) -> Optional[Callable]:
-            """
-            Returns the constructor for the activation observer which should be
-            used for the pattern matched to this handler. Some handlers override
-            this to a different value than what is specified in the qconfig.
-            """
-            act_dtype = activation_dtype(qconfig)
-            # TODO: change to is_qat
-            if is_training:
-                if act_dtype == torch.quint8 and self.overwrite_output_fake_quantizer is not None:
-                    return self.overwrite_output_fake_quantizer
-            else:
-                if act_dtype == torch.quint8 and self.overwrite_output_observer is not None:
-                    return self.overwrite_output_observer
-            return qconfig.activation
-
         # This is temporary, and will be removed soon
         def input_output_observed(self):
             return self.input_output_observed_
@@ -84,13 +57,11 @@ def get_pattern_to_quantize_handlers(backend_config: BackendConfig) -> Dict[Patt
     we can refactor this after we convert the path for fbgemm/qnnpack fully to the
     new path, this is not exposed to backend developers
     """
-    pattern_to_quantize_handlers = dict()
+    pattern_to_quantize_handlers = {}
     for pattern, config in backend_config.configs.items():
         observation_type = config.observation_type
         dtype_configs = config.dtype_configs
         num_tensor_args_to_observation_type = config._num_tensor_args_to_observation_type
-        overwrite_fake_quantizer = config._overwrite_output_fake_quantize
-        overwrite_observer = config._overwrite_output_observer
         input_output_observed = config._input_output_observed
         if input_output_observed is None:
             input_output_observed = True
@@ -99,8 +70,6 @@ def get_pattern_to_quantize_handlers(backend_config: BackendConfig) -> Dict[Patt
                 observation_type,
                 dtype_configs,
                 num_tensor_args_to_observation_type,
-                overwrite_fake_quantizer,
-                overwrite_observer,
                 input_output_observed)
 
     return pattern_to_quantize_handlers
@@ -108,7 +77,7 @@ def get_pattern_to_quantize_handlers(backend_config: BackendConfig) -> Dict[Patt
 # TODO: move this to torch/ao/quantization/backend_config/utils.py
 def get_fusion_pattern_to_fuse_handler_cls(
         backend_config: BackendConfig) -> Dict[Pattern, Callable]:
-    fusion_pattern_to_fuse_handlers: Dict[Pattern, Callable] = dict()
+    fusion_pattern_to_fuse_handlers: Dict[Pattern, Callable] = {}
     for pattern, config in backend_config.configs.items():
         if config.fuser_method is not None:
             # TODO: is this logic right?
@@ -137,6 +106,7 @@ def get_native_quant_patterns(additional_quant_patterns: Dict[Pattern, Quantizer
 get_pattern_to_quantize_handlers.__module__ = "torch.ao.quantization.fx.backend_config_utils"
 
 __all__ = [
+    "get_quantize_handler_cls",
     "get_fusion_pattern_to_fuse_handler_cls",
     "get_native_quant_patterns",
     "get_pattern_to_quantize_handlers",
diff --git a/torch/ao/quantization/fx/common_quantization_patterns.py b/torch/ao/quantization/fx/common_quantization_patterns.py
deleted file mode 100644
index a863c18a383e..000000000000
--- a/torch/ao/quantization/fx/common_quantization_patterns.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from .quantization_patterns import (
-    QuantizeHandler,
-)
-# TODO: remove
-class CommonQuantizeHandler(QuantizeHandler):
-    """ Common quantized op, first input and first output will be quantized
-    """
-    pass
diff --git a/torch/ao/quantization/fx/convert.py b/torch/ao/quantization/fx/convert.py
index d2647aa2ad6e..f677e0eedc66 100644
--- a/torch/ao/quantization/fx/convert.py
+++ b/torch/ao/quantization/fx/convert.py
@@ -1,4 +1,4 @@
-from typing import Any, Dict, List, Optional, Set, Tuple, Union, Type
+from typing import Any, Dict, List, Optional, Set, Tuple, Union, Type, Callable
 from torch.ao.quantization.quant_type import QuantType
 import torch
 import copy
@@ -24,10 +24,10 @@
 )
 from ..qconfig_mapping import QConfigMapping
 from ..qconfig_mapping_utils import (
-    update_qconfig_for_qat,
+    _update_qconfig_for_qat,
 )
-from .qconfig_utils import (
-    generate_qconfig_map,
+from .qconfig_mapping_utils import (
+    generate_node_name_to_qconfig,
     compare_prepare_convert_qconfig_mappings,
     update_qconfig_for_fusion,
     is_qconfig_supported_by_dtype_configs,
@@ -50,24 +50,31 @@
 from ._equalize import update_obs_for_equalization, convert_eq_obs
 from torch.nn.utils.parametrize import type_before_parametrizations
 from .utils import (
+    _get_module,
+    _is_custom_module_lstm,
     get_custom_module_class_keys,
-    get_quantize_node_info,
     create_getattr_from_value,
     collect_producer_nodes,
     graph_module_from_producer_nodes,
-    WEIGHT_INDEX_DICT,
+    node_arg_is_weight,
+)
+from torch.ao.quantization.utils import (
+    is_per_channel,
+    to_underlying_dtype,
 )
-
 from torch.ao.quantization.quantize import (
     _remove_qconfig,
     is_activation_post_process,
 )
+from torch.ao.quantization.stubs import DeQuantStub
 from .custom_config import (
     ConvertCustomConfig,
     PrepareCustomConfig,
 )
 from .lower_to_fbgemm import lower_to_fbgemm
-
+# importing the lib so that the quantized_decomposed ops are registered
+from ._decomposed import quantized_decomposed_lib  # noqa: F401
+import operator
 
 # TODO: revisit this list. Many helper methods shouldn't be public
 __all__ = [
@@ -75,19 +82,378 @@
     "convert_custom_module",
     "convert_standalone_module",
     "convert_weighted_module",
-    "duplicate_dequantize_node",
-    "duplicate_quantize_dynamic_node",
     "get_module_path_and_prefix",
     "has_none_qconfig",
     "insert_dequantize_node",
     "maybe_get_observer_for_node",
     "maybe_recursive_remove_dequantize",
-    "remove_extra_dequantize",
-    "remove_quant_dequant_pairs",
     "restore_state",
     "run_weight_observers",
 ]
 
+def _replace_observer_with_quantize_dequantize_node_decomposed(
+        model: torch.nn.Module,
+        graph: Graph,
+        node: Node,
+        modules: Dict[str, torch.nn.Module],
+        node_name_to_scope: Dict[str, Tuple[str, type]],
+        node_name_to_qconfig: Dict[str, QConfigAny]) -> None:
+    """ Replace activation_post_process module call node with quantize and
+    dequantize node working with decomposed Tensor
+
+    Before:
+    ... -> observer_0(x) -> ...
+    After:
+    ... -> torch.ops.quantized_decomposed.quantize_per_tensor(x, ...) ->
+    torch.ops.quantized_decomposed.dequantize_per_tensor() -> ...
+
+    or quantize_per_channel and dequantize_per_channel
+    """
+    assert modules is not None
+    assert isinstance(node.target, str)
+    module_path, prefix = get_module_path_and_prefix(node, node_name_to_scope, node_name_to_qconfig)
+    activation_post_process = modules[node.target]
+    # skip replacing observers to quant/dequant nodes if the qconfigs of all
+    # consumers and producers of this observer are None
+    skip_replacement = all([
+        has_none_qconfig(n, node_name_to_qconfig) for n in
+        list(node.args) + list(node.users.keys())])
+    if skip_replacement or not _is_conversion_supported(activation_post_process):
+        # didn't find correponding quantize op and info for the activation_post_process
+        # so we just remove the observer
+        with graph.inserting_before(node):
+            node.replace_all_uses_with(node.args[0])
+            graph.erase_node(node)
+        return
+
+    # otherwise, we can convert the activation_post_process module call to quantize/dequantize node
+
+    # 1. extract the information from activation_post_process module for generating
+    # the quantize and dequantize operator
+    dtype = activation_post_process.dtype  # type: ignore[attr-defined]
+
+    is_dynamic = False
+    if hasattr(activation_post_process, "is_dynamic"):
+        is_dynamic = activation_post_process.is_dynamic  # type: ignore[assignment]
+
+    if dtype in [torch.quint8, torch.qint8, torch.qint32] and \
+            (not is_dynamic):
+        # TODO: probably should cleanup this condition check, it's hard
+        # to reason about this if and the following elif
+
+        # uint8/int8/int32 static quantization branch
+
+        # 1. extract information for inserting q/dq node from activation_post_process
+        node_type = "call_function"
+        quantize_op : Optional[Callable] = None
+        scale, zero_point = activation_post_process.calculate_qparams()  # type: ignore[attr-defined, operator]
+        if is_per_channel(activation_post_process.qscheme):  # type: ignore[attr-defined]
+            ch_axis = int(activation_post_process.ch_axis)  # type: ignore[attr-defined, arg-type]
+            quantize_op = torch.ops.quantized_decomposed.quantize_per_channel
+            dequantize_op = torch.ops.quantized_decomposed.dequantize_per_channel
+            quant_min = activation_post_process.quant_min
+            quant_max = activation_post_process.quant_max
+            dtype_ = to_underlying_dtype(dtype)
+            qparams = {
+                "_scale_": scale,
+                "_zero_point_": zero_point,
+                "_axis_": ch_axis,
+                "_quant_min_": quant_min,
+                "_quant_max_": quant_max,
+                "_dtype_": dtype_
+            }
+        else:
+            quantize_op = torch.ops.quantized_decomposed.quantize_per_tensor
+            dequantize_op = torch.ops.quantized_decomposed.dequantize_per_tensor
+            scale = float(scale)
+            zero_point = int(zero_point)
+            quant_min = activation_post_process.quant_min  # type: ignore[attr-defined]
+            quant_max = activation_post_process.quant_max  # type: ignore[attr-defined]
+            dtype_ = to_underlying_dtype(dtype)
+            qparams = {
+                "_scale_": scale,
+                "_zero_point_": zero_point,
+                "_quant_min_": quant_min,
+                "_quant_max_": quant_max,
+                "_dtype_": dtype_
+            }
+
+        # 2. replace activation_post_process node with quantize and dequantize
+        with graph.inserting_before(node):
+            input_node = node.args[0]
+            quantize_op_inputs = [input_node]
+            for key, value_or_node in qparams.items():
+                # TODO: we can add the information of whether a value needs to
+                # be registered as an attribute in qparams dict itself
+                if key in ['_scale_', '_zero_point_']:
+                    # For scale and zero_point values we register them as buffers in the root module.
+                    # TODO: maybe need more complex attr name here
+                    qparam_node = create_getattr_from_value(
+                        model, graph, module_path + prefix + key, value_or_node)
+                    quantize_op_inputs.append(qparam_node)
+                else:
+                    # for qparams that are not scale/zero_point (like axis, dtype) we store them as literals in the graph.
+                    quantize_op_inputs.append(value_or_node)
+
+            quantized_node = graph.create_node(node_type, quantize_op, tuple(quantize_op_inputs), {})
+            # use the same qparams from quantize op
+            dq_inputs = [quantized_node] + quantize_op_inputs[1:]
+            dequantized_node = graph.call_function(
+                dequantize_op,
+                tuple(dq_inputs),
+                {}
+            )
+            node.replace_all_uses_with(dequantized_node)
+            graph.erase_node(node)
+    elif is_dynamic:
+
+        # uint8/int8/fp16 dynamic quantization
+
+        # 1. extract information for inserting q/dq node from activation_post_process
+        node_type = "call_function"
+        quantize_op = torch.ops.quantized_decomposed.quantize_per_tensor.tensor
+        # we only use choose_qparams for is_decomposed now,
+        # but we should probably align the non-decomposed path with this as well,
+        # and that can be done after we remove reduce_range flag
+        # 1. extract qparams from activation_post_process module
+        dtype_ = to_underlying_dtype(dtype)
+        assert dtype_ in [torch.uint8, torch.int8], \
+            "only uint8 and int8 are supported in reference flow for " \
+            "dynamic quantization right now"
+        quant_min = activation_post_process.quant_min  # type: ignore[attr-defined]
+        quant_max = activation_post_process.quant_max  # type: ignore[attr-defined]
+        # note: scale and zero_point are missing for quantize_per_tensor op
+        # we'll need to get this from choose_qparams op, which we'll add after
+        # this step
+        qparams = {
+            "_quant_min_": quant_min,
+            "_quant_max_": quant_max,
+            "_dtype_": dtype_
+        }
+
+        # 2. insert choose_qparams op and update the qparams list
+        with graph.inserting_before(node):
+            input_node = node.args[0]
+            choose_qparams_op_inputs = [node.args[0]]
+            for key, value in qparams.items():
+                # we have quant_min, quant_max and dtype, all should be stored
+                # as literals
+                choose_qparams_op_inputs.append(value)
+            choose_qparams_node = graph.create_node(
+                "call_function",
+                torch.ops.quantized_decomposed.choose_qparams.tensor,
+                tuple(choose_qparams_op_inputs),
+                {}
+            )
+            # choose_qparms returns (scale, zero_point)
+            scale_node = graph.create_node(
+                "call_function",
+                operator.getitem,
+                (choose_qparams_node, 0),
+                {}
+            )
+            zero_point_node = graph.create_node(
+                "call_function",
+                operator.getitem,
+                (choose_qparams_node, 1),
+                {}
+            )
+            quant_min = qparams["_quant_min_"]
+            quant_max = qparams["_quant_max_"]
+            dtype = qparams["_dtype_"]
+            qparams = {
+                "_scale_": scale_node,
+                "_zero_point_": zero_point_node,
+                "_quant_min_": quant_min,
+                "_quant_max_": quant_max,
+                "_dtype_": dtype
+            }
+
+        # 3. replace activation_post_process node to quantize and dequantize node
+        with graph.inserting_before(node):
+            input_node = node.args[0]
+            quantize_op_inputs = [input_node]
+            for key, value_or_node in qparams.items():
+                # TODO: we can add the information of whether a value needs to
+                # be registered as an attribute in qparams dict itself
+                if key in ['_scale_', '_zero_point_']:
+                    # in this case we have a node in the graph since it's dynamically
+                    # computed from the input, with choose_qparams op
+                    qparam_node = value_or_node
+                    quantize_op_inputs.append(qparam_node)
+                else:
+                    # for qparams that are not scale/zero_point (like axis, dtype) we
+                    # store them as literals in the graph.
+                    quantize_op_inputs.append(value_or_node)
+
+            quantized_node = graph.create_node(node_type, quantize_op, tuple(quantize_op_inputs), {})
+            # use the same qparams from quantize op
+            dq_inputs = [quantized_node] + quantize_op_inputs[1:]
+            # need to use the tensor variant of this op, since scale and zero_point
+            # from choose_qparam are Tensors, instead of float/int, this is to
+            # prevent these nodes being traced away by downstream systems
+            dequantize_op = torch.ops.quantized_decomposed.dequantize_per_tensor.tensor
+            dequantized_node = graph.call_function(
+                dequantize_op,
+                tuple(dq_inputs),
+                {}
+            )
+            node.replace_all_uses_with(dequantized_node)
+            graph.erase_node(node)
+    elif dtype == torch.float16:
+        raise NotImplementedError("decomposed to float16 op not implemented yet")
+
+    # should not reach since we have checks in the begining to make sure the
+    # activation_post_process is supported
+
+def _replace_observer_with_quantize_dequantize_node(
+        model: torch.nn.Module,
+        graph: Graph,
+        node: Node,
+        modules: Dict[str, torch.nn.Module],
+        node_name_to_scope: Dict[str, Tuple[str, type]],
+        node_name_to_qconfig: Dict[str, QConfigAny]) -> None:
+    """ Replace activation_post_process module call node with quantize and
+    dequantize node
+
+    Before:
+    ... -> observer_0(x) -> ...
+    After:
+    ... -> torch.quantize_per_tensor(x, ...) -> x.dequantize() -> ...
+    """
+    assert modules is not None
+    assert isinstance(node.target, str)
+    module_path, prefix = get_module_path_and_prefix(node, node_name_to_scope, node_name_to_qconfig)
+    activation_post_process = modules[node.target]
+    # skip replacing observers to quant/dequant nodes if the qconfigs of all
+    # consumers and producers of this observer are None
+    skip_replacement = all([
+        has_none_qconfig(n, node_name_to_qconfig) for n in
+        list(node.args) + list(node.users.keys())])
+    if skip_replacement or not _is_conversion_supported(activation_post_process):
+        # didn't find correponding quantize op and info for the activation_post_process
+        # so we just remove the observer
+        with graph.inserting_before(node):
+            node.replace_all_uses_with(node.args[0])
+            graph.erase_node(node)
+        return
+
+    # otherwise, we can convert the activation_post_process module call to quantize/dequantize node
+    dtype = activation_post_process.dtype  # type: ignore[attr-defined]
+
+    is_dynamic = False
+    if hasattr(activation_post_process, "is_dynamic"):
+        is_dynamic = activation_post_process.is_dynamic  # type: ignore[attr-defined, assignment]
+
+    if dtype in [torch.quint8, torch.qint8, torch.qint32] and \
+            (not is_dynamic):
+        # TODO: probably should cleanup this condition check, it's hard
+        # to reason about this if and the following elif
+
+        # uint8/int8/int32 static quantization branch
+
+        # 1. extract the information from activation_post_process module for generating
+        # the quantize and dequantize operator
+        node_type = "call_function"
+        quantize_op : Optional[Callable] = None
+        scale, zero_point = activation_post_process.calculate_qparams()  # type: ignore[attr-defined, operator]
+        if is_per_channel(activation_post_process.qscheme):  # type: ignore[attr-defined]
+            ch_axis = int(activation_post_process.ch_axis)  # type: ignore[attr-defined, arg-type]
+            qparams = {"_scale_": scale, "_zero_point_": zero_point, "_axis_": ch_axis, "_dtype_": dtype}
+            quantize_op = torch.quantize_per_channel
+        else:
+            scale = float(scale)
+            zero_point = int(zero_point)
+            qparams = {"_scale_": scale, "_zero_point_": zero_point, "_dtype_": dtype}
+            quantize_op = torch.quantize_per_tensor
+
+        # 2. replace activation_post_process node with quantize and dequantize
+        with graph.inserting_before(node):
+            input_node = node.args[0]
+            quantize_op_inputs = [input_node]
+            for key, value_or_node in qparams.items():
+                # TODO: we can add the information of whether a value needs to
+                # be registered as an attribute in qparams dict itself
+                if key in ['_scale_', '_zero_point_']:
+                    # For scale and zero_point values we register them as buffers in the root module.
+                    # TODO: maybe need more complex attr name here
+                    qparam_node = create_getattr_from_value(
+                        model, graph, module_path + prefix + key, value_or_node)
+                    quantize_op_inputs.append(qparam_node)
+                else:
+                    # for qparams that are not scale/zero_point (like axis, dtype) we store them as literals in the graph.
+                    quantize_op_inputs.append(value_or_node)
+
+            quantized_node = graph.create_node(node_type, quantize_op, tuple(quantize_op_inputs), {})
+            dequantized_node = graph.call_method("dequantize", args=(quantized_node,))
+            node.replace_all_uses_with(dequantized_node)
+            graph.erase_node(node)
+    elif is_dynamic:
+
+        # uint8/int8/fp16 dynamic quantization branch
+
+        node_type = "call_function"
+        quantize_op = torch.quantize_per_tensor_dynamic
+        # TODO: get reduce range from observer
+        # reduce_range = activation_post_process.reduce_range
+        reduce_range = torch.backends.quantized.engine in ("fbgemm", "x86")
+        qparams = {"_dtype_": dtype, "_reduce_range_": reduce_range}
+
+        with graph.inserting_before(node):
+            input_node = node.args[0]
+            quantize_op_inputs = [input_node]
+            for key, value in qparams.items():
+                quantize_op_inputs.append(value)
+
+            quantized_node = graph.create_node(node_type, quantize_op, tuple(quantize_op_inputs), {})
+            dequantized_node = graph.call_method("dequantize", args=(quantized_node,))
+            node.replace_all_uses_with(dequantized_node)
+            graph.erase_node(node)
+    elif dtype == torch.float16:
+        node_type = "call_method"
+        quantize_op = "to"  # type: ignore[assignment]
+        qparams = {"_dtype_": dtype}
+        with graph.inserting_before(node):
+            input_node = node.args[0]
+            quantize_op_inputs = [input_node]
+            for key, value in qparams.items():
+                # TODO: we can add the information of whether a value needs to
+                # be registered as an attribute in qparams dict itself
+                quantize_op_inputs.append(value)
+
+            quantized_node = graph.create_node(node_type, quantize_op, tuple(quantize_op_inputs), {})
+            dequantized_node = graph.call_method("dequantize", args=(quantized_node,))
+            node.replace_all_uses_with(dequantized_node)
+            graph.erase_node(node)
+
+    # should not reach since we have checks in the begining to make sure the
+    # activation_post_process is supported
+
+# this is a temporary hack for custom module, we may want to implement
+# this properly after the custom module class design is finalized
+# TODO: DeQuantStubs are currently inserted only after custom module LSTM, while observers are inserted
+# after all other custom modules. In the future, we should simply insert QuantStubs before and DeQuantStubs
+# after custom modules in general, and replace these with "quantize" and "dequantize" nodes respectively.
+def _replace_observer_or_dequant_stub_with_dequantize_node(node: Node, graph: Graph):
+    call_custom_module_node = node.args[0]
+    assert isinstance(call_custom_module_node, Node), \
+        f"Expecting the for call custom module node to be a Node, but got {call_custom_module_node}"
+    node.replace_all_uses_with(call_custom_module_node)
+    graph.erase_node(node)
+    insert_dequantize_node(call_custom_module_node, graph)
+
+def _is_conversion_supported(activation_post_process: torch.nn.Module) -> bool:
+    dtype = activation_post_process.dtype  # type: ignore[attr-defined]
+
+    is_dynamic = False
+    if hasattr(activation_post_process, "is_dynamic"):
+        is_dynamic = activation_post_process.is_dynamic  # type: ignore[attr-defined, assignment]
+
+    return (
+        (dtype in [torch.quint8, torch.qint8, torch.qint32] and (not is_dynamic)) or  # type: ignore[return-value]
+        is_dynamic or
+        dtype == torch.float16
+    )
 
 def restore_state(
         observed: torch.nn.Module
@@ -101,113 +467,32 @@ def restore_state(
     observed_node_names: Set[str] = observed._observed_node_names  # type: ignore[assignment]
     return node_name_to_scope, prepare_custom_config, observed_node_names
 
-def has_none_qconfig(node: Argument, qconfig_map: Dict[str, QConfigAny]) -> bool:
+def has_none_qconfig(node: Argument, node_name_to_qconfig: Dict[str, QConfigAny]) -> bool:
     """ Check if a node has a qconfig of None, i.e. user requested to not quantize
     the node
     """
-    return isinstance(node, Node) and node.name in qconfig_map and qconfig_map[node.name] is None
+    return isinstance(node, Node) and node.name in node_name_to_qconfig and node_name_to_qconfig[node.name] is None
 
-def run_weight_observers(observed: GraphModule) -> None:
+def run_weight_observers(observed: GraphModule, backend_config: BackendConfig) -> None:
     """ Extract the subgraph that produces the weight for dynamic quant
     or weight only quant node and run the subgraph to observe the weight.
     Note that the observers of dynamic quant or weight only quant ops are
     run during the convert step.
     """
     for node in observed.graph.nodes:
-        if node.op != 'call_function' or node.target not in WEIGHT_INDEX_DICT:
+        if node.op != "call_function":
             continue
-        for i, node_arg in enumerate(node.args):
-            if i not in WEIGHT_INDEX_DICT[node.target]:
-                continue
+        for node_arg in node.args:
             # node_arg is weight
-            weight_observer_nodes = collect_producer_nodes(node_arg)
-            if weight_observer_nodes is None:
-                continue
-            weight_observer_module = \
-                graph_module_from_producer_nodes(
-                    observed, weight_observer_nodes)
-            # run the weight observer
-            weight_observer_module()
-
-# this method is temporary will be removed soon
-def duplicate_quantize_dynamic_node(quantized: QuantizedGraphModule) -> QuantizedGraphModule:
-    quantized_root = quantized
-    for node in quantized.graph.nodes:
-        if (node.op == "call_function" and node.target == torch.quantize_per_tensor_dynamic):
-            users = list(node.users)
-            if len(users) > 1:
-                for user in users:
-                    with quantized.graph.inserting_before(node):
-                        new_node = quantized.graph.create_node(
-                            "call_function",
-                            torch.quantize_per_tensor_dynamic,
-                            node.args,
-                            node.kwargs)
-                    user.replace_input_with(node, new_node)
-                quantized.graph.erase_node(node)
-
-    quantized = QuantizedGraphModule(quantized_root, quantized.graph, quantized_root.preserved_attr_names)
-    return quantized
-
-def duplicate_dequantize_node(quantized: QuantizedGraphModule) -> QuantizedGraphModule:
-    """
-    If a dequantize node has multiple uses, duplicate it and create one dequantize node for each use.
-    This is to enable the pattern matching to map from individual quant - dequant - ref_module to
-    final quantized module.
-    """
-    quantized_root = quantized
-    for node in quantized.graph.nodes:
-        if (node.op == "call_method" and node.target == "dequantize" or
-           (node.op == "call_function" and node.target == torch.dequantize)):
-            users = list(node.users)
-            if len(users) > 1:
-                for user in users:
-                    with quantized.graph.inserting_before(node):
-                        new_node = quantized.graph.create_node("call_method", "dequantize", node.args, {})
-                    user.replace_input_with(node, new_node)
-                quantized.graph.erase_node(node)
-
-    quantized = QuantizedGraphModule(quantized_root, quantized.graph, quantized_root.preserved_attr_names)
-    return quantized
-
-def remove_extra_dequantize(quantized: QuantizedGraphModule) -> QuantizedGraphModule:
-    """
-    Removes duplicate dequant nodes in the graph, for an operator that has multiple dequant nodes as a user,
-    replace them with a single dequant node that can be shared across all the uses.
-    """
-    quantized_root = quantized
-    for node in quantized.graph.nodes:
-        users = list(node.users)
-        dequant_users = [user for user in node.users if user.op == "call_method" and user.target == "dequantize" or
-                         (user.op == "call_function" and user.target == torch.dequantize)]
-
-        if len(dequant_users) > 1:
-            with quantized.graph.inserting_after(node):
-                unique_dq = quantized.graph.create_node("call_method", "dequantize", users[0].args, {})
-            for dequant in dequant_users:
-                dequant.replace_all_uses_with(unique_dq)
-                quantized.graph.erase_node(dequant)
-
-    quantized = QuantizedGraphModule(quantized_root, quantized.graph, quantized_root.preserved_attr_names)
-    return quantized
-
-def remove_quant_dequant_pairs(quantized: QuantizedGraphModule) -> QuantizedGraphModule:
-    quantized_root = quantized
-    for node in quantized.graph.nodes:
-        if node.op == "call_function" and node.target in [torch.quantize_per_tensor, torch.quantize_per_channel]:
-            users = list(node.users)
-            user = users[0] if users else None
-            if len(users) == 1 and user.op == "call_method" and user.target == "dequantize":
-                user.replace_all_uses_with(node.args[0])
-                quantized.graph.erase_node(user)
-                orig_args = list(node.args)
-                quantized.graph.erase_node(node)
-                for arg in orig_args:
-                    if isinstance(arg, Node) and len(list(arg.users)) == 0:
-                        quantized.graph.erase_node(arg)
-
-    quantized = QuantizedGraphModule(quantized_root, quantized.graph, quantized_root.preserved_attr_names)
-    return quantized
+            if node_arg and node_arg_is_weight(node, node_arg, backend_config):
+                weight_observer_nodes = collect_producer_nodes(node_arg)
+                if weight_observer_nodes is None:
+                    continue
+                weight_observer_module = \
+                    graph_module_from_producer_nodes(
+                        observed, weight_observer_nodes)
+                # run the weight observer
+                weight_observer_module()
 
 def maybe_recursive_remove_dequantize(arg: Any, node: Node, graph: Graph):
     """ If the arg is a dequantize Node, or a list/tuple/dict of dequantize Node,
@@ -232,7 +517,7 @@ def maybe_recursive_remove_dequantize(arg: Any, node: Node, graph: Graph):
 def get_module_path_and_prefix(
         obs_node: Node,
         node_name_to_scope: Dict[str, Tuple[str, type]],
-        qconfig_map: Dict[str, QConfigAny]):
+        node_name_to_qconfig: Dict[str, QConfigAny]):
     """ Given and observer node, get the `Scope` or the fully qualified name for
     the submodule containing the observed node, also return a prefix of "_input"
     when the observed node is an input of a F.linear op, and not the output of another
@@ -247,7 +532,8 @@ def get_module_path_and_prefix(
     # the input of the next operator
     assert isinstance(observed_node, Node), \
         f"Expecting observed node to be a Node, but got {observed_node}"
-    is_input_observer_only = qconfig_map[observed_node.name] is None if observed_node.name in qconfig_map else None
+    is_input_observer_only = node_name_to_qconfig[observed_node.name] is None \
+        if observed_node.name in node_name_to_qconfig else None
     if is_input_observer_only:
         # if the quantize function is at the input of op, then we find the first user of the observer_node
         # to get the path. If a linear call_function is in the user list, we return the first instance
@@ -370,7 +656,7 @@ def convert_weighted_module(
         node: Node,
         modules: Dict[str, torch.nn.Module],
         observed_node_names: Set[str],
-        qconfig_map: Dict[str, QConfigAny],
+        node_name_to_qconfig: Dict[str, QConfigAny],
         backend_config: BackendConfig):
     """ Convert a weighted module to reference quantized module in the model
     If the QConfig of a QAT module is not set, the module will still be converted to
@@ -401,7 +687,7 @@ def convert_weighted_module(
 
     is_observed = node.name in observed_node_names
     # If a qconfig is not defined for this node, then skip converting to a reference module
-    if qconfig is None or has_none_qconfig(node, qconfig_map) or not is_observed:
+    if qconfig is None or has_none_qconfig(node, node_name_to_qconfig) or not is_observed:
         return
 
     # skip converting to reference quantized module if the qconfig is not supported
@@ -476,6 +762,23 @@ def convert_weighted_module(
         parent_name, name = _parent_name(node.target)
         setattr(modules[parent_name], name, ref_qmodule)
 
+def _remove_previous_dequantize_in_custom_module(node: Node, prev_node: Node, graph: Graph):
+    """
+    Given a custom module `node`, if the previous node is a dequantize, reroute the custom as follows:
+
+    Before: quantize - dequantize - custom_module
+    After: quantize - custom_module
+                 \\ - dequantize
+    """
+    # expecting the input node for a custom module node to be a Node
+    assert isinstance(prev_node, Node), \
+        f"Expecting the argument for custom module node to be a Node, but got {prev_node}"
+    if prev_node.op == "call_method" and prev_node.target == "dequantize":
+        node.replace_input_with(prev_node, prev_node.args[0])
+        # Remove the dequantize node if it doesn't have other users
+        if len(prev_node.users) == 0:
+            graph.erase_node(prev_node)
+
 def convert_custom_module(
         node: Node,
         graph: Graph,
@@ -511,27 +814,30 @@ def convert_custom_module(
     qconfig = observed_custom_module.qconfig
     if activation_is_statically_quantized(qconfig):
         statically_quantized_custom_module_nodes.add(node)
-        # remove the previous dequant node
-        prev_node = node.args[0]
-        # expecting the input node for a custom module node to be a Node
-        assert isinstance(prev_node, Node), \
-            f"Expecting the argument for custom module node to be a Node, but got {prev_node}"
-        if prev_node.op == "call_method" and prev_node.target == "dequantize":
-            # change the connection for custom module, we'll change the input
-            # of custom module node to quantize node:
-            # Before: quantize - dequantize - custom - module
-            # After: quantize - custom - module
-            #              \ - dequantize
-            node.replace_input_with(prev_node, prev_node.args[0])
-
-            # Remove the dequantize node if it doesn't have other users
-            if len(prev_node.users) == 0:
-                graph.erase_node(prev_node)
-
-        # absorb the following observer into the module conversion
-        activation_post_process = maybe_get_observer_for_node(node, modules)
-        assert activation_post_process is not None
-        observed_custom_module.activation_post_process = activation_post_process
+        if _is_custom_module_lstm(node, modules):
+            # The inputs are tuples in the form (input, (hidden0, hidden1))
+            # Ensure all three input nodes are quantized
+            assert (
+                len(node.args) == 2 and
+                isinstance(node.args[1], tuple) and
+                len(node.args[1]) == 2
+            )
+            (inputs, (hidden0, hidden1)) = node.args  # type: ignore[misc]
+            assert isinstance(inputs, Node)
+            assert isinstance(hidden0, Node)
+            assert isinstance(hidden1, Node)
+            _remove_previous_dequantize_in_custom_module(node, inputs, graph)
+            _remove_previous_dequantize_in_custom_module(node, hidden0, graph)
+            _remove_previous_dequantize_in_custom_module(node, hidden1, graph)
+        else:
+            # remove the previous dequant node to ensure the inputs are quantized
+            arg = node.args[0]
+            assert isinstance(arg, Node)
+            _remove_previous_dequantize_in_custom_module(node, arg, graph)
+            # absorb the following observer into the module conversion
+            activation_post_process = maybe_get_observer_for_node(node, modules)
+            assert activation_post_process is not None
+            observed_custom_module.activation_post_process = activation_post_process
 
     # swap the observed custom module to quantized custom module
     quantized_custom_module_class = get_swapped_custom_module_class(
@@ -547,7 +853,8 @@ def convert(
         is_standalone_module: bool = False,
         _remove_qconfig_flag: bool = True,
         qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
-        backend_config: Union[BackendConfig, Dict[str, Any], None] = None) -> torch.nn.Module:
+        backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
+        is_decomposed: bool = False) -> torch.nn.Module:
     """
     We will convert an observed model (a module with observer calls) to a reference
     quantized model, the rule is simple:
@@ -559,13 +866,21 @@ def convert(
        is stored in observed_node_names, we can decide whether we need to swap the
        module based on this set
 
-    standalone_module means it a submodule that is not inlined in
-    parent module, and will be quantized separately as one unit.
-
-    Returns a quantized standalone module, whether input/output is quantized is
-    specified by prepare_custom_config, with
-    input_quantized_idxs, output_quantized_idxs, please
-    see docs for prepare_fx for details
+    Args:
+       * `is_standalone_module`: when this flag is True, it means we are quantizing
+       a submodule that is not inlined in parent module, and will be quantized
+       separately as one unit.
+
+       * `is_decomposed`: a boolean flag to indicate whether we want to use the
+        quantize operator for decomposed quantized tensor
+        (torch.ops.quantized_decomposed.quantize_per_tensor) or default/standalone
+        quantized tensor (torch.quantize_per_tensor)
+
+    Returns:
+         a quantized standalone module, whether input/output is quantized is
+         specified by prepare_custom_config, with
+         input_quantized_idxs, output_quantized_idxs, please
+         see docs for :func:`~torch.ao.quantization.prepare_fx` for details
     """
     if convert_custom_config is None:
         convert_custom_config = ConvertCustomConfig()
@@ -590,8 +905,11 @@ def convert(
             "in a future version. Please pass in a BackendConfig instead.")
         backend_config = BackendConfig.from_dict(backend_config)
 
+    if backend_config is None:
+        backend_config = get_native_backend_config()
+
     node_name_to_scope, prepare_custom_config, observed_node_names = restore_state(model)
-    qconfig_map: Dict[str, QConfigAny] = model._qconfig_map  # type: ignore[assignment]
+    node_name_to_qconfig: Dict[str, QConfigAny] = model._node_name_to_qconfig  # type: ignore[assignment]
 
     # mapping from fully qualified module name to module instance
     # for example,
@@ -611,25 +929,27 @@ def convert(
         modules_copy = copy.deepcopy(modules)
 
         if model._is_qat:
-            update_qconfig_for_qat(qconfig_mapping, {})
+            _update_qconfig_for_qat(qconfig_mapping, {})
         update_qconfig_for_fusion(model, qconfig_mapping)
 
         compare_prepare_convert_qconfig_mappings(prepare_qconfig_mapping, qconfig_mapping)  # type: ignore[arg-type]
-        convert_qconfig_map = generate_qconfig_map(model, modules_copy, model.graph, qconfig_mapping, node_name_to_scope)
-        # check the convert_qconfig_map generated and ensure that all the values either match what was set in prepare qconfig_map
-        # or are set to None in the convert_qconfig_map.
-        for k, v in qconfig_map.items():
-            assert k in convert_qconfig_map, 'Expected key {} in convert qconfig_map'.format(k)
-            if convert_qconfig_map[k] is not None:
-                assert qconfig_equals(v, convert_qconfig_map[k]), \
+        convert_node_name_to_qconfig = generate_node_name_to_qconfig(
+            model, modules_copy, model.graph, qconfig_mapping, node_name_to_scope)
+        # check the convert_node_name_to_qconfig generated and ensure that
+        # all the values either match what was set in prepare node_name_to_qconfig
+        # or are set to None in the convert_node_name_to_qconfig.
+        for k, v in node_name_to_qconfig.items():
+            assert k in convert_node_name_to_qconfig, 'Expected key {} in convert node_name_to_qconfig'.format(k)
+            if convert_node_name_to_qconfig[k] is not None:
+                assert qconfig_equals(v, convert_node_name_to_qconfig[k]), \
                     "Expected k {} to have the same value in prepare and convert QConfigMappings, " \
-                    "but {} was updated to {}".format(k, v, convert_qconfig_map[k])
-        qconfig_map = convert_qconfig_map
+                    "but {} was updated to {}".format(k, v, convert_node_name_to_qconfig[k])
+        node_name_to_qconfig = convert_node_name_to_qconfig
 
     custom_module_classes = get_custom_module_class_keys(convert_custom_config.observed_to_quantized_mapping)
     custom_module_class_mapping = convert_custom_config.observed_to_quantized_mapping
 
-    if model._equalization_qconfig_map is not None:
+    if model._equalization_node_name_to_qconfig is not None:
         # If we want to do equalization then do the following:
         # Calculate the equalization scale, update the observers with the scaled
         # inputs, and scale the weight
@@ -638,87 +958,19 @@ def convert(
 
     # always run weight observers in the top level forward method
     # for dynamic quant ops or weight only quant ops
-    run_weight_observers(model)
+    run_weight_observers(model, backend_config)
 
     graph_inputs: List[str] = []
     for node in model.graph.nodes:
         if node.op == 'placeholder':
             graph_inputs.append(node.name)
 
-    # TODO: move this outside of this function
-    def replace_observer_with_quantize_dequantize_node(
-            model: torch.nn.Module,
-            graph: Graph,
-            node: Node,
-            modules: Dict[str, torch.nn.Module],
-            node_name_to_scope: Dict[str, Tuple[str, type]],
-            qconfig_map: Dict[str, QConfigAny]) -> None:
-        """ Replace activation_post_process module call node with quantize and
-        dequantize node
-
-        Before:
-        ... -> observer_0(x) -> ...
-        After:
-        ... -> torch.quantize_per_tensor(x, ...) -> x.dequantize() -> ...
-        """
-        assert modules is not None
-        assert isinstance(node.target, str)
-        module_path, prefix = get_module_path_and_prefix(node, node_name_to_scope, qconfig_map)
-        observer_module = modules[node.target]
-        maybe_quantize_node_info = get_quantize_node_info(observer_module)
-        # Skip replacing observers to quant/dequant nodes if the qconfigs of all
-        # consumers and producers of this observer are None
-        skip_replacement = all([
-            has_none_qconfig(n, qconfig_map) for n in
-            list(node.args) + list(node.users.keys())])
-        if skip_replacement or maybe_quantize_node_info is None:
-            # didn't find correponding quantize op and info for the observer_module
-            # so we just remove the observer
-            with graph.inserting_before(node):
-                node.replace_all_uses_with(node.args[0])
-                graph.erase_node(node)
-        else:
-            # otherwise, we can convert the observer moduel call to quantize/dequantize node
-            node_type, quantize_op, qparams = maybe_quantize_node_info
-            # replace observer node with quant - dequant node
-            with graph.inserting_before(node):
-                input_node = node.args[0]
-                inputs = [input_node]
-                for key, value in qparams.items():
-                    # TODO: we can add the information of whether a value needs to
-                    # be registered as an attribute in qparams dict itself
-                    if key in ['_scale_', '_zero_point_']:
-                        # For scale and zero_point values we register them as buffers in the root module.
-                        # TODO: maybe need more complex attr name here
-                        qparam_node = create_getattr_from_value(model, graph, module_path + prefix + key, value)
-                        inputs.append(qparam_node)
-                    else:
-                        # for qparams that are not scale/zero_point (like axis, dtype) we store them as literals in the graph.
-                        inputs.append(value)
-
-                quantized_node = graph.create_node(node_type, quantize_op, tuple(inputs), {})
-                dequantized_node = graph.call_method("dequantize", args=(quantized_node,))
-                node.replace_all_uses_with(dequantized_node)
-                graph.erase_node(node)
-
-    # this is a temporary hack for custom module, we may want to implement
-    # this properly after the custom module class design is finalized
-    def replace_observer_with_dequantize_node(node: Node, graph: Graph):
-        call_custom_module_node = node.args[0]
-        assert isinstance(call_custom_module_node, Node), \
-            f"Expecting the for call custom module node to be a Node, but got {call_custom_module_node}"
-        node.replace_all_uses_with(call_custom_module_node)
-        graph.erase_node(node)
-        insert_dequantize_node(call_custom_module_node, graph)
-
     # additional state to override inputs to be quantized, if specified
     # by the user
     placeholder_node_seen_cnt = 0
     input_quantized_idxs: List[int] = prepare_custom_config.input_quantized_indexes
     output_quantized_idxs: List[int] = prepare_custom_config.output_quantized_indexes
 
-    if backend_config is None:
-        backend_config = get_native_backend_config()
     root_module_to_quantized_reference_module = get_root_module_to_quantized_reference_module(backend_config)
     # convert tuples so that it can work with isinstance(module, tuple_of_classes)
     root_module_classes = tuple(root_module_to_quantized_reference_module.keys())
@@ -758,29 +1010,38 @@ def replace_observer_with_dequantize_node(node: Node, graph: Graph):
             else:
                 warnings.warn(f"Unsupported node type for output_quantized_idxs: {type(output)}")
         elif node.op == "call_module":
-            if is_activation_post_process(modules[node.target]):
+            mod = _get_module(node, modules)
+            assert mod is not None
+            if is_activation_post_process(mod):
                 observed_node = node.args[0]
                 if observed_node in statically_quantized_custom_module_nodes:
-                    replace_observer_with_dequantize_node(node, model.graph)
+                    _replace_observer_or_dequant_stub_with_dequantize_node(node, model.graph)
                 else:
-                    replace_observer_with_quantize_dequantize_node(
-                        model, model.graph, node, modules, node_name_to_scope,
-                        qconfig_map)
-            elif is_observed_standalone_module(modules[node.target]):
+                    if is_decomposed:
+                        _replace_observer_with_quantize_dequantize_node_decomposed(
+                            model, model.graph, node, modules, node_name_to_scope,
+                            node_name_to_qconfig)
+                    else:
+                        _replace_observer_with_quantize_dequantize_node(
+                            model, model.graph, node, modules, node_name_to_scope,
+                            node_name_to_qconfig)
+            elif isinstance(mod, DeQuantStub):
+                _replace_observer_or_dequant_stub_with_dequantize_node(node, model.graph)
+            elif is_observed_standalone_module(mod):
                 convert_standalone_module(
                     node, modules, model, is_reference, backend_config)
             # below this point `type_before_parametrizations` is used
             # instead of `type` to handle situations with fx quant + sparsity
-            elif type_before_parametrizations(modules[node.target]) in set(
+            elif type_before_parametrizations(mod) in set(
                     root_module_classes).union(qat_module_classes).union(fused_module_classes):
                 # extra check for fused module classes to make sure they are fused module classes
                 # of target modules
-                if type_before_parametrizations(modules[node.target]) in fused_module_classes and \
-                   type_before_parametrizations(modules[node.target][0]) not in root_module_classes:
+                if type_before_parametrizations(mod) in fused_module_classes and \
+                   type_before_parametrizations(mod[0]) not in root_module_classes:  # type: ignore[index]
                     continue
                 convert_weighted_module(
-                    node, modules, observed_node_names, qconfig_map, backend_config)
-            elif type_before_parametrizations(modules[node.target]) in custom_module_classes:
+                    node, modules, observed_node_names, node_name_to_qconfig, backend_config)
+            elif type_before_parametrizations(mod) in custom_module_classes:
                 convert_custom_module(
                     node, model.graph, modules, custom_module_class_mapping,
                     statically_quantized_custom_module_nodes)
@@ -794,11 +1055,7 @@ def replace_observer_with_dequantize_node(node: Node, graph: Graph):
 
     # TODO: maybe move this to quantize_fx.py
     if not is_reference:
-        model = duplicate_dequantize_node(model)
-        model = duplicate_quantize_dynamic_node(model)
-        model = lower_to_fbgemm(model, qconfig_map, node_name_to_scope)
-        model = remove_quant_dequant_pairs(model)
-        model = remove_extra_dequantize(model)
+        model = lower_to_fbgemm(model, node_name_to_qconfig, node_name_to_scope)
     # TODO: this looks hacky, we want to check why we need this and see if we can
     # remove this
     # removes qconfig and activation_post_process modules
diff --git a/torch/ao/quantization/fx/custom_config.py b/torch/ao/quantization/fx/custom_config.py
index 1fdde5e51a33..9d08853a4126 100644
--- a/torch/ao/quantization/fx/custom_config.py
+++ b/torch/ao/quantization/fx/custom_config.py
@@ -4,7 +4,7 @@
 
 from torch.ao.quantization import QConfigMapping
 from torch.ao.quantization.backend_config import BackendConfig
-from torch.ao.quantization.quant_type import QuantType, _quant_type_from_str, quant_type_to_str
+from torch.ao.quantization.quant_type import QuantType, _quant_type_from_str, _get_quant_type_to_str
 
 
 __all__ = [
@@ -42,18 +42,6 @@ class PrepareCustomConfig:
     Custom configuration for :func:`~torch.ao.quantization.quantize_fx.prepare_fx` and
     :func:`~torch.ao.quantization.quantize_fx.prepare_qat_fx`.
 
-    The user can set custom configuration using the following methods:
-
-        `set_standalone_module_name`: sets the config for preparing a standalone module for quantization, identified by name
-        `set_standalone_module_class`: sets the config for preparing a standalone module for quantization, identified by class
-        `set_float_to_observed_mapping`: sets the mapping from a float module class to an observed module class
-        `set_non_traceable_module_names`: sets modules that are not symbolically traceable, identified by name
-        `set_non_traceable_module_classes`: sets modules that are not symbolically traceable, identified by class
-        `set_input_quantized_indexes`: sets the indexes of the inputs of the graph that should be quantized.
-        `set_output_quantized_indexes`: sets the indexes of the outputs of the graph that should be quantized.
-        `set_preserved_attributes`: sets the names of the attributes that will persist in the graph module even
-            if they are not used in the model's `forward` method
-
     Example usage::
 
         prepare_custom_config = PrepareCustomConfig() \
@@ -86,11 +74,11 @@ def set_standalone_module_name(
             prepare_custom_config: Optional[PrepareCustomConfig],
             backend_config: Optional[BackendConfig]) -> PrepareCustomConfig:
         """
-        Set the configuration for running a standalone module identified by `module_name`.
+        Set the configuration for running a standalone module identified by ``module_name``.
 
-        If `qconfig_mapping` is None, the parent `qconfig_mapping` will be used instead.
-        If `prepare_custom_config` is None, an empty `PrepareCustomConfig` will be used.
-        If `backend_config` is None, the parent `backend_config` will be used instead.
+        If ``qconfig_mapping`` is None, the parent ``qconfig_mapping`` will be used instead.
+        If ``prepare_custom_config`` is None, an empty ``PrepareCustomConfig`` will be used.
+        If ``backend_config`` is None, the parent ``backend_config`` will be used instead.
         """
         self.standalone_module_names[module_name] = \
             StandaloneModuleConfigEntry(qconfig_mapping, example_inputs, prepare_custom_config, backend_config)
@@ -104,11 +92,11 @@ def set_standalone_module_class(
             prepare_custom_config: Optional[PrepareCustomConfig],
             backend_config: Optional[BackendConfig]) -> PrepareCustomConfig:
         """
-        Set the configuration for running a standalone module identified by `module_class`.
+        Set the configuration for running a standalone module identified by ``module_class``.
 
-        If `qconfig_mapping` is None, the parent `qconfig_mapping` will be used instead.
-        If `prepare_custom_config` is None, an empty `PrepareCustomConfig` will be used.
-        If `backend_config` is None, the parent `backend_config` will be used instead.
+        If ``qconfig_mapping`` is None, the parent ``qconfig_mapping`` will be used instead.
+        If ``prepare_custom_config`` is None, an empty ``PrepareCustomConfig`` will be used.
+        If ``backend_config`` is None, the parent ``backend_config`` will be used instead.
         """
         self.standalone_module_classes[module_class] = \
             StandaloneModuleConfigEntry(qconfig_mapping, example_inputs, prepare_custom_config, backend_config)
@@ -122,7 +110,7 @@ def set_float_to_observed_mapping(
         """
         Set the mapping from a custom float module class to a custom observed module class.
 
-        The observed module class must have a `from_float` class method that converts the float module class
+        The observed module class must have a ``from_float`` class method that converts the float module class
         to the observed module class. This is currently only supported for static quantization.
         """
         if quant_type != QuantType.STATIC:
@@ -165,7 +153,7 @@ def set_output_quantized_indexes(self, indexes: List[int]) -> PrepareCustomConfi
     def set_preserved_attributes(self, attributes: List[str]) -> PrepareCustomConfig:
         """
         Set the names of the attributes that will persist in the graph module even if they are not used in
-        the model's `forward` method.
+        the model's ``forward`` method.
         """
         self.preserved_attributes = attributes
         return self
@@ -174,23 +162,23 @@ def set_preserved_attributes(self, attributes: List[str]) -> PrepareCustomConfig
     @classmethod
     def from_dict(cls, prepare_custom_config_dict: Dict[str, Any]) -> PrepareCustomConfig:
         """
-        Create a `PrepareCustomConfig` from a dictionary with the following items:
+        Create a ``PrepareCustomConfig`` from a dictionary with the following items:
 
             "standalone_module_name": a list of (module_name, qconfig_mapping, example_inputs,
-                child_prepare_custom_config, backend_config) tuples
+            child_prepare_custom_config, backend_config) tuples
 
             "standalone_module_class" a list of (module_class, qconfig_mapping, example_inputs,
-                child_prepare_custom_config, backend_config) tuples
+            child_prepare_custom_config, backend_config) tuples
 
             "float_to_observed_custom_module_class": a nested dictionary mapping from quantization
-                mode to an inner mapping from float module classes to observed module classes, e.g.
-                {"static": {FloatCustomModule: ObservedCustomModule}}
+            mode to an inner mapping from float module classes to observed module classes, e.g.
+            {"static": {FloatCustomModule: ObservedCustomModule}}
 
             "non_traceable_module_name": a list of modules names that are not symbolically traceable
             "non_traceable_module_class": a list of module classes that are not symbolically traceable
             "input_quantized_idxs": a list of indexes of graph inputs that should be quantized
             "output_quantized_idxs": a list of indexes of graph outputs that should be quantized
-            "preserved_attributes": a list of attributes that persist even if they are not used in `forward`
+            "preserved_attributes": a list of attributes that persist even if they are not used in ``forward``
 
         This function is primarily for backward compatibility and may be removed in the future.
         """
@@ -255,7 +243,7 @@ def _get_backend_config(obj: Any, dict_key: str) -> Optional[BackendConfig]:
 
     def to_dict(self) -> Dict[str, Any]:
         """
-        Convert this `PrepareCustomConfig` to a dictionary with the items described in
+        Convert this ``PrepareCustomConfig`` to a dictionary with the items described in
         :func:`~torch.ao.quantization.fx.custom_config.PrepareCustomConfig.from_dict`.
         """
         def _make_tuple(key: Any, e: StandaloneModuleConfigEntry):
@@ -275,7 +263,7 @@ def _make_tuple(key: Any, e: StandaloneModuleConfigEntry):
         for quant_type, float_to_observed_mapping in self.float_to_observed_mapping.items():
             if FLOAT_TO_OBSERVED_DICT_KEY not in d:
                 d[FLOAT_TO_OBSERVED_DICT_KEY] = {}
-            d[FLOAT_TO_OBSERVED_DICT_KEY][quant_type_to_str(quant_type)] = float_to_observed_mapping
+            d[FLOAT_TO_OBSERVED_DICT_KEY][_get_quant_type_to_str(quant_type)] = float_to_observed_mapping
         if len(self.non_traceable_module_names) > 0:
             d[NON_TRACEABLE_MODULE_NAME_DICT_KEY] = self.non_traceable_module_names
         if len(self.non_traceable_module_classes) > 0:
@@ -293,12 +281,6 @@ class ConvertCustomConfig:
     """
     Custom configuration for :func:`~torch.ao.quantization.quantize_fx.convert_fx`.
 
-    The user can set custom configuration using the following methods:
-
-        `set_observed_to_quantized_mapping`: sets the mapping from an observed module class to a quantized module class
-        `set_preserved_attributes`: sets the names of the attributes that will persist in the graph module even if they
-            are not used in the model's `forward` method
-
     Example usage::
 
         convert_custom_config = ConvertCustomConfig() \
@@ -318,7 +300,7 @@ def set_observed_to_quantized_mapping(
         """
         Set the mapping from a custom observed module class to a custom quantized module class.
 
-        The quantized module class must have a `from_observed` class method that converts the observed module class
+        The quantized module class must have a ``from_observed`` class method that converts the observed module class
         to the quantized module class.
         """
         if quant_type not in self.observed_to_quantized_mapping:
@@ -329,7 +311,7 @@ def set_observed_to_quantized_mapping(
     def set_preserved_attributes(self, attributes: List[str]) -> ConvertCustomConfig:
         """
         Set the names of the attributes that will persist in the graph module even if they are not used in
-        the model's `forward` method.
+        the model's ``forward`` method.
         """
         self.preserved_attributes = attributes
         return self
@@ -338,17 +320,16 @@ def set_preserved_attributes(self, attributes: List[str]) -> ConvertCustomConfig
     @classmethod
     def from_dict(cls, convert_custom_config_dict: Dict[str, Any]) -> ConvertCustomConfig:
         """
-        Create a `ConvertCustomConfig` from a dictionary with the following items:
+        Create a ``ConvertCustomConfig`` from a dictionary with the following items:
 
             "observed_to_quantized_custom_module_class": a nested dictionary mapping from quantization
-                mode to an inner mapping from observed module classes to quantized module classes, e.g.
-                {
-                    "static": {FloatCustomModule: ObservedCustomModule},
-                    "dynamic": {FloatCustomModule: ObservedCustomModule},
-                    "weight_only": {FloatCustomModule: ObservedCustomModule}
-                }
-
-            "preserved_attributes": a list of attributes that persist even if they are not used in `forward`
+            mode to an inner mapping from observed module classes to quantized module classes, e.g.::
+            {
+            "static": {FloatCustomModule: ObservedCustomModule},
+            "dynamic": {FloatCustomModule: ObservedCustomModule},
+            "weight_only": {FloatCustomModule: ObservedCustomModule}
+            }
+            "preserved_attributes": a list of attributes that persist even if they are not used in ``forward``
 
         This function is primarily for backward compatibility and may be removed in the future.
         """
@@ -362,14 +343,14 @@ def from_dict(cls, convert_custom_config_dict: Dict[str, Any]) -> ConvertCustomC
 
     def to_dict(self) -> Dict[str, Any]:
         """
-        Convert this `ConvertCustomConfig` to a dictionary with the items described in
+        Convert this ``ConvertCustomConfig`` to a dictionary with the items described in
         :func:`~torch.ao.quantization.fx.custom_config.ConvertCustomConfig.from_dict`.
         """
         d: Dict[str, Any] = {}
         for quant_type, observed_to_quantized_mapping in self.observed_to_quantized_mapping.items():
             if OBSERVED_TO_QUANTIZED_DICT_KEY not in d:
                 d[OBSERVED_TO_QUANTIZED_DICT_KEY] = {}
-            d[OBSERVED_TO_QUANTIZED_DICT_KEY][quant_type_to_str(quant_type)] = observed_to_quantized_mapping
+            d[OBSERVED_TO_QUANTIZED_DICT_KEY][_get_quant_type_to_str(quant_type)] = observed_to_quantized_mapping
         if len(self.preserved_attributes) > 0:
             d[PRESERVED_ATTRIBUTES_DICT_KEY] = self.preserved_attributes
         return d
@@ -379,11 +360,6 @@ class FuseCustomConfig:
     """
     Custom configuration for :func:`~torch.ao.quantization.quantize_fx.fuse_fx`.
 
-    The user can set custom configuration using the following method:
-
-        `set_preserved_attributes`: sets the names of the attributes that will persist in the graph module
-            even if they are not used in the model's `forward` method
-
     Example usage::
 
         fuse_custom_config = FuseCustomConfig().set_preserved_attributes(["attr1", "attr2"])
@@ -395,7 +371,7 @@ def __init__(self):
     def set_preserved_attributes(self, attributes: List[str]) -> FuseCustomConfig:
         """
         Set the names of the attributes that will persist in the graph module even if they are not used in
-        the model's `forward` method.
+        the model's ``forward`` method.
         """
         self.preserved_attributes = attributes
         return self
@@ -404,9 +380,9 @@ def set_preserved_attributes(self, attributes: List[str]) -> FuseCustomConfig:
     @classmethod
     def from_dict(cls, fuse_custom_config_dict: Dict[str, Any]) -> FuseCustomConfig:
         """
-        Create a `ConvertCustomConfig` from a dictionary with the following items:
+        Create a ``ConvertCustomConfig`` from a dictionary with the following items:
 
-            "preserved_attributes": a list of attributes that persist even if they are not used in `forward`
+            "preserved_attributes": a list of attributes that persist even if they are not used in ``forward``
 
         This function is primarily for backward compatibility and may be removed in the future.
         """
@@ -416,7 +392,7 @@ def from_dict(cls, fuse_custom_config_dict: Dict[str, Any]) -> FuseCustomConfig:
 
     def to_dict(self) -> Dict[str, Any]:
         """
-        Convert this `FuseCustomConfig` to a dictionary with the items described in
+        Convert this ``FuseCustomConfig`` to a dictionary with the items described in
         :func:`~torch.ao.quantization.fx.custom_config.ConvertCustomConfig.from_dict`.
         """
         d: Dict[str, Any] = {}
diff --git a/torch/ao/quantization/fx/fuse.py b/torch/ao/quantization/fx/fuse.py
index 094639ee4f19..8a4fcb6d1125 100644
--- a/torch/ao/quantization/fx/fuse.py
+++ b/torch/ao/quantization/fx/fuse.py
@@ -33,7 +33,7 @@
 from typing import Any, Callable, Dict, List, Tuple, Union
 import warnings
 
-from torch.ao.quantization.quantization_types import Pattern, NodePattern
+from torch.ao.quantization.utils import Pattern, NodePattern
 
 
 __all__ = [
diff --git a/torch/ao/quantization/fx/fusion_patterns.py b/torch/ao/quantization/fx/fusion_patterns.py
index 29d4217699b0..075a0cfa0331 100644
--- a/torch/ao/quantization/fx/fusion_patterns.py
+++ b/torch/ao/quantization/fx/fusion_patterns.py
@@ -1,7 +1,6 @@
 import torch
 from torch.fx.graph import Node, Graph
-from ..utils import _parent_name
-from torch.ao.quantization.quantization_types import NodePattern, Pattern
+from ..utils import _parent_name, NodePattern, Pattern
 from ..fuser_method_mappings import get_fuser_method_new
 from abc import ABC, abstractmethod
 from typing import Any, Callable, Dict, Optional, Union, List
@@ -39,7 +38,7 @@ def fuse(self,
              is_qat: bool) -> Node:
         pass
 
-# TODO: move this to backend_config.fuse_handler
+# TODO: move this to backend_config_utils
 class DefaultFuseHandler(FuseHandler):
     def __init__(
             self,
diff --git a/torch/ao/quantization/fx/graph_module.py b/torch/ao/quantization/fx/graph_module.py
index da93f8a0cef9..9a6bcc87d37f 100644
--- a/torch/ao/quantization/fx/graph_module.py
+++ b/torch/ao/quantization/fx/graph_module.py
@@ -4,6 +4,15 @@
 from torch.fx.graph import Graph
 from typing import Union, Dict, Any, Set
 
+__all__ = [
+    "FusedGraphModule",
+    "ObservedGraphModule",
+    "is_observed_module",
+    "ObservedStandaloneGraphModule",
+    "is_observed_standalone_module",
+    "QuantizedGraphModule",
+]
+
 class FusedGraphModule(GraphModule):
     def __init__(self, root: Union[torch.nn.Module, Dict[str, Any]], graph: Graph, preserved_attr_names: Set[str]):
         self.preserved_attr_names = preserved_attr_names
@@ -27,9 +36,9 @@ def __init__(self, root: Union[torch.nn.Module, Dict[str, Any]], graph: Graph, p
             '_activation_post_process_map',
             '_activation_post_process_indexes',
             '_patterns',
-            '_qconfig_map',
+            '_node_name_to_qconfig',
             '_prepare_custom_config',
-            '_equalization_qconfig_map',
+            '_equalization_node_name_to_qconfig',
             '_node_name_to_scope',
             '_qconfig_mapping',
             '_is_qat',
diff --git a/torch/ao/quantization/fx/match_utils.py b/torch/ao/quantization/fx/match_utils.py
index d9bd4ff8459f..e50b89f9ce40 100644
--- a/torch/ao/quantization/fx/match_utils.py
+++ b/torch/ao/quantization/fx/match_utils.py
@@ -4,7 +4,7 @@
     Graph,
     Node,
 )
-from torch.ao.quantization.quantization_types import Pattern
+from torch.ao.quantization.utils import Pattern
 from .quantization_patterns import (
     QuantizeHandler,
 )
@@ -27,7 +27,8 @@
     "find_matches",
 ]
 
-
+# TODO(future PR): the 1st argument is typed as `List[Node]`, but a better type
+# would be a recursive `List[Union[Node, Tuple[Union[Node, ...]]]]`
 MatchResult = Tuple[Node, List[Node], Optional[Pattern], QuantizeHandler]
 
 _MatchResultWithQConfig = Tuple[Node, List[Node], Optional[Pattern], QuantizeHandler,
diff --git a/torch/ao/quantization/fx/pattern_utils.py b/torch/ao/quantization/fx/pattern_utils.py
index b483ede08011..c4971b542627 100644
--- a/torch/ao/quantization/fx/pattern_utils.py
+++ b/torch/ao/quantization/fx/pattern_utils.py
@@ -1,11 +1,24 @@
 from collections import OrderedDict
 from typing import Dict, Any
-from torch.ao.quantization.quantization_types import Pattern
+from torch.ao.quantization.utils import Pattern
 from ..fake_quantize import FixedQParamsFakeQuantize
 # from .quantization_patterns import BinaryOpQuantizeHandler
 from ..observer import ObserverBase
 import copy
 
+__all__ = [
+    "DEFAULT_FUSION_PATTERNS",
+    "register_fusion_pattern",
+    "get_default_fusion_patterns",
+    "DEFAULT_QUANTIZATION_PATTERNS",
+    "DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP",
+    "DEFAULT_OUTPUT_OBSERVER_MAP",
+    "register_quant_pattern",
+    "get_default_quant_patterns",
+    "get_default_output_activation_post_process_map",
+    "sorted_patterns_dict",
+]
+
 # TODO(future PR): fix the typing on QuantizeHandler (currently a circular dependency)
 QuantizeHandler = Any
 
@@ -25,8 +38,8 @@ def get_default_fusion_patterns() -> Dict[Pattern, QuantizeHandler]:
 # Mapping from pattern to activation_post_process(observer/fake_quant) constructor for output activation
 # e.g. pattern: torch.sigmoid,
 #      output_activation_post_process: default_fixed_qparams_range_0to1_fake_quant
-DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP = dict()
-DEFAULT_OUTPUT_OBSERVER_MAP = dict()
+DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP = {}
+DEFAULT_OUTPUT_OBSERVER_MAP = {}
 
 # Register pattern for both static quantization and qat
 def register_quant_pattern(pattern, fixed_qparams_observer=None):
diff --git a/torch/ao/quantization/fx/prepare.py b/torch/ao/quantization/fx/prepare.py
index 851f597b88be..932b40e03e0f 100644
--- a/torch/ao/quantization/fx/prepare.py
+++ b/torch/ao/quantization/fx/prepare.py
@@ -18,22 +18,18 @@
     ObserverBase,
 )
 from ..qconfig import (
-    obs_or_fq_ctr_equals,
-    float16_dynamic_qconfig,
-    float16_static_qconfig,
-    is_reuse_input_qconfig,
+    _is_reuse_input_qconfig,
     QConfigAny,
 )
 from ..qconfig_mapping import (
-    _FIXED_QPARAMS_OP_TO_OBSERVER,
     QConfigMapping,
 )
 from ..qconfig_mapping_utils import (
-    get_flattened_qconfig_dict,
-    update_qconfig_for_qat,
+    _get_flattened_qconfig_dict,
+    _update_qconfig_for_qat,
 )
-from .qconfig_utils import (
-    generate_qconfig_map,
+from .qconfig_mapping_utils import (
+    generate_node_name_to_qconfig,
     update_qconfig_for_fusion,
 )
 
@@ -41,13 +37,11 @@
     QuantizeHandler,
 )
 
-from torch.ao.quantization.quantization_types import (
+from torch.ao.quantization.utils import (
     Pattern,
     NodePattern,
 )
 
-from torch.ao.quantization import FixedQParamsFakeQuantize
-
 from ._equalize import (
     is_equalization_observer,
     node_supports_equalization,
@@ -69,14 +63,18 @@
 
 from ..utils import _parent_name
 from .utils import (
+    _insert_dequant_stubs_for_custom_module_lstm_output,
+    _is_custom_module_lstm,
+    _maybe_get_custom_module_lstm_from_node_arg,
+    _qconfig_satisfies_dtype_config_constraints,
     get_custom_module_class_keys,
     all_node_args_have_no_tensors,
     assert_and_get_unique_device,
     get_non_observable_arg_indexes_and_types,
     get_new_attr_name_with_prefix,
+    node_arg_is_weight,
+    node_arg_is_bias,
     NON_QUANTIZABLE_WEIGHT_OPS,
-    WEIGHT_INDEX_DICT,
-    BIAS_INDEX_DICT,
 )
 
 from torch.ao.quantization.quantize import (
@@ -88,12 +86,10 @@
     get_qconfig_dtypes,
     get_swapped_custom_module_class,
     activation_is_statically_quantized,
-    activation_is_int8_quantized,
 )
 
 from ..backend_config.utils import (
     get_pattern_to_dtype_configs,
-    get_pattern_to_input_type_to_index,
     get_module_to_qat_module,
     get_fusion_pattern_to_root_node_getter,
 )
@@ -119,7 +115,7 @@
 __all__ = [
     "DO_NOT_OBS_DTYPE_LIST",
     "add_matched_node_name_to_set",
-    "get_arg_target_compute_dtype_as_input_to_node",
+    "get_arg_target_is_dynamic_as_input_to_node",
     "get_arg_target_dtype_as_input_to_node",
     "get_arg_target_dtype_as_output",
     "get_target_activation_dtype_for_node",
@@ -130,7 +126,6 @@
     "is_input_arg_dtype_supported_by_backend",
     "is_observer_in_same_graph",
     "is_output_dtype_supported_by_backend",
-    "is_pattern_dtype_config_supported_by_backend",
     "maybe_insert_input_equalization_observers_for_node",
     "maybe_insert_input_observer_for_arg_or_kwarg",
     "maybe_insert_input_observers_for_node",
@@ -138,8 +133,6 @@
     "maybe_insert_output_observer_for_node",
     "maybe_make_input_output_share_observers",
     "maybe_propagate_dtype_for_node",
-    "node_arg_is_bias",
-    "node_arg_is_weight",
     "prepare",
     "propagate_dtypes_for_known_nodes",
     "qat_swap_modules",
@@ -157,98 +150,94 @@ def is_activation_post_process_node(node: Node, modules: Dict[str, torch.nn.Modu
     return isinstance(node, torch.fx.Node) and node.op == "call_module" and \
         is_activation_post_process(modules[str(node.target)])
 
-def node_arg_is_weight(node: Node, arg: Any) -> bool:
-    if isinstance(node, Node) and node.op == 'call_function' and \
-            node.target in WEIGHT_INDEX_DICT:
-        for i, node_arg in enumerate(node.args):
-            if arg is node_arg and i in \
-                    WEIGHT_INDEX_DICT[node.target]:  # type: ignore[index]
-                return True
-        for kwarg_name, kwarg_value in node.kwargs.items():
-            if kwarg_name == 'weight' and arg is kwarg_value:
-                return True
-    return False
-
-def node_arg_is_bias(node: Node, arg: Any) -> bool:
-    if not isinstance(node, Node) or node.op != 'call_function' or \
-       node.target not in BIAS_INDEX_DICT:
-        return False
-
-    for i, node_arg in enumerate(node.args):
-        if arg is node_arg and i in \
-           BIAS_INDEX_DICT[node.target]:  # type: ignore[index]
-            return True
-
-    return node.kwargs.get('bias', None) is arg
-
 def is_input_arg_dtype_supported_by_backend(
     arg: Argument,
     node: Node,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
+    qconfig: QConfigAny,
     dtype_config: DTypeConfig,
+    backend_config: BackendConfig,
 ) -> bool:
     """ Check if the configured qconfig for the argument
     is supported by the backend or not
     """
     if isinstance(arg, (list, tuple)):
-        return all(map(lambda a: is_input_arg_dtype_supported_by_backend(a, node, node_name_to_target_dtype, dtype_config), arg))
+        return all(is_input_arg_dtype_supported_by_backend(
+            a, node, node_name_to_target_dtype_info, qconfig,
+            dtype_config, backend_config) for a in arg)
     if not isinstance(arg, Node):
         return True
     # TODO: support check for standalone module
-    is_weight = node_arg_is_weight(node, arg)
-    is_bias = node_arg_is_bias(node, arg)
+    is_weight = node_arg_is_weight(node, arg, backend_config)
+    is_bias = node_arg_is_bias(node, arg, backend_config)
     is_activation = not is_weight and not is_bias
     if is_activation:
-        is_dynamic = dtype_config.is_dynamic
-        if is_dynamic:
-            input_activation_dtype = dtype_config.input_dtype
-            # TODO: change this after the is_dynamic refactor is landed
-            compute_dtype = node_name_to_target_dtype[node.name].get("input_activation_compute_dtype", None)
-            return input_activation_dtype is None or \
-                compute_dtype == input_activation_dtype
+        qconfig_info = node_name_to_target_dtype_info[node.name].get(
+            "input_activation_dtype")
+        if qconfig_info is not None:
+            qconfig_dtype, qconfig_is_dynamic = qconfig_info
         else:
-            input_activation_dtype = dtype_config.input_dtype
-            return input_activation_dtype is None or \
-                node_name_to_target_dtype[node.name]["input_activation_dtype"] == input_activation_dtype
+            qconfig_dtype, qconfig_is_dynamic = None, None
+        # TODO(future PR): remove the cast to bool below after figuring
+        # out why backend_config has is_dynamic set to None in some cases.
+        return (dtype_config.input_dtype is None) or (
+            dtype_config.input_dtype == qconfig_dtype and
+            bool(dtype_config.is_dynamic) == bool(qconfig_is_dynamic) and
+            _qconfig_satisfies_dtype_config_constraints(qconfig, dtype_config.input_dtype_with_constraints)
+        )
     elif is_weight:
+        # TODO: move dtype check into `_qconfig_satisfies_dtype_config_constraints` as well
         weight_dtype = dtype_config.weight_dtype
-        return weight_dtype is None or node_name_to_target_dtype[node.name]["weight_dtype"] == weight_dtype
+        dtype_matches = "weight_dtype" in node_name_to_target_dtype_info[node.name] and \
+            node_name_to_target_dtype_info[node.name]["weight_dtype"][0] == weight_dtype  # type: ignore[index]
+        qconfig_satisfies_constraints = _qconfig_satisfies_dtype_config_constraints(
+            qconfig, dtype_config.weight_dtype_with_constraints, is_activation=False)
+        return weight_dtype is None or (dtype_matches and qconfig_satisfies_constraints)
     else:  # bias
         bias_dtype = dtype_config.bias_dtype
-        return bias_dtype is None or node_name_to_target_dtype[node.name]["bias_dtype"] == bias_dtype
+        return bias_dtype is None or \
+            (
+                "bias_dtype" in node_name_to_target_dtype_info[node.name] and
+                node_name_to_target_dtype_info[node.name]["bias_dtype"][0] == bias_dtype  # type: ignore[index]
+            )
 
 def is_output_dtype_supported_by_backend(
     node: Node,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
+    qconfig: QConfigAny,
     dtype_config: DTypeConfig,
 ) -> bool:
     """ Check if the configured qconfig for the output
     is supported by the backend or not
     """
     output_dtype = dtype_config.output_dtype
-    return output_dtype is None or \
-        output_dtype == node_name_to_target_dtype[node.name]["output_activation_dtype"]
+    dtype_matches = node_name_to_target_dtype_info[node.name]["output_activation_dtype"][0] == output_dtype  # type: ignore[index]
+    qconfig_satisfies_constraints = _qconfig_satisfies_dtype_config_constraints(
+        qconfig, dtype_config.output_dtype_with_constraints)
+    return output_dtype is None or (dtype_matches and qconfig_satisfies_constraints)
 
-def is_observer_in_same_graph(node, modules, node_name_to_target_dtype):
+def is_observer_in_same_graph(node, modules, node_name_to_target_dtype_info):
     """ Check if observer in same graph
     when the node output is not fp32 and input is 'placeholder'
     the input is assumed to be quantized, so it is observed
     in a different place rather than not observed.
     """
-    node_output_dtype = get_arg_target_dtype_as_output(node, modules, node_name_to_target_dtype)
+    node_output_dtype = get_arg_target_dtype_as_output(node, modules, node_name_to_target_dtype_info)
     if len(node.args) > 0 and isinstance(node.args[0], Node):
         if node_output_dtype == torch.quint8 and node.args[0].op == 'placeholder':
             return False
     return True
 
-def is_pattern_dtype_config_supported_by_backend(
+def _is_pattern_dtype_config_and_qconfig_supported_by_backend(
     pattern: Optional[Pattern],
-    matched_node_pattern: Optional[NodePattern],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
-    backend_config: Optional[BackendConfig],
+    matched_node_pattern: Optional[List[Node]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
+    qconfig: QConfigAny,
+    backend_config: BackendConfig,
 ) -> bool:
-    """ Check is the dtype configuration of a pattern is supported by
-    the backend or not
+    """ Check if the dtype configuration of a pattern is supported by
+    the backend or not, and whether the qconfig satisfies constraints
+    specified in the corresponding dtype config.
     """
     if backend_config is None or pattern is None:
         return True
@@ -264,17 +253,12 @@ def is_pattern_dtype_config_supported_by_backend(
     for dtype_config in dtype_configs:
         # check if arg dtype are supported
         supported = True
-        for arg in input_node.args:
-            supported = supported and \
-                is_input_arg_dtype_supported_by_backend(
-                    arg, input_node, node_name_to_target_dtype, dtype_config)
-        for k, arg in input_node.kwargs.items():
-            supported = supported and \
-                is_input_arg_dtype_supported_by_backend(
-                    arg, input_node, node_name_to_target_dtype, dtype_config)
+        for arg in list(input_node.args) + list(input_node.kwargs.values()):
+            supported = supported and is_input_arg_dtype_supported_by_backend(
+                arg, input_node, node_name_to_target_dtype_info, qconfig, dtype_config, backend_config)
         # check if output dtype is supported
         supported = supported and is_output_dtype_supported_by_backend(
-            output_node, node_name_to_target_dtype, dtype_config)
+            output_node, node_name_to_target_dtype_info, qconfig, dtype_config)
         if supported:
             return True
     return False
@@ -360,11 +344,27 @@ def get_target_activation_dtype_for_node(
     qhandler: Optional[QuantizeHandler],
     modules: Dict[str, torch.nn.Module],
     cache_for_no_tensor_check: Dict[Node, bool],
-) -> Dict[str, Optional[Union[torch.dtype, type]]]:
+) -> Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]:
     """
-    Returns the expected dtype of the input and output of this node after
-    convert. If the value is not None, it represents the dtype of the
-    Tensor. If the value is None, it means the value is not a Tensor.
+    For each op attribute in the op's input activation, output activation,
+    weight, bias - returns the settings of dtype and is_dynamic we expect
+    for the `quantize` call in the reference model representation, or None
+    if there is no `quantize` call needed.
+
+    For example, if we have a node corresponding to `op0` in
+
+      x0 -> op0 -> x1
+
+    And we want a reference quantized representation to be
+
+      x0 -> quant_static -> dequant -> op0 -> quant_dynamic -> dequant -> x1
+
+    Then this function will return
+
+      {
+        'input_activation': {'dtype': torch.quint8, is_dynamic: False},
+        'output_activation': {'dtype': torch.quint8, is_dynamic: True},
+      }
 
     Note: this is for activations only, weight dtypes are not handled here.
 
@@ -374,15 +374,15 @@ def get_target_activation_dtype_for_node(
     if node.op == 'placeholder':
         if inputs_seen_counter in input_quantized_idxs:
             return {
-                "input_activation_dtype": torch.quint8,
-                "output_activation_dtype": torch.quint8,
+                "input_activation_dtype": (torch.quint8, False),
+                "output_activation_dtype": (torch.quint8, False),
             }
         else:
             # if dtype is fp32 (default), do nothing
             # note: other dtypes are not supported
             return {
-                "input_activation_dtype": torch.float,
-                "output_activation_dtype": torch.float,
+                "input_activation_dtype": (torch.float, False),
+                "output_activation_dtype": (torch.float, False),
             }
 
     elif node.op in ('call_module', 'call_method', 'call_function'):
@@ -400,48 +400,61 @@ def get_target_activation_dtype_for_node(
             node.target == operator.getitem
         if is_getitem:
             return {
-                "input_activation_dtype": torch.float,
-                "output_activation_dtype": torch.float,
+                "input_activation_dtype": (torch.float, False),
+                "output_activation_dtype": (torch.float, False),
             }
 
         # get qconfig to determine the eventual dtype of this node
         if qconfig is not None:
             if qhandler is not None and qhandler.input_output_observed():
-                act_dtype, weight_dtype, act_compute_dtype = \
+                act_dtype, weight_dtype, input_act_is_dynamic = \
                     get_qconfig_dtypes(qconfig)
+
+                # Currently `QConfig` only has one `activation` field.
+                # For static quantization, it is reused for both input
+                # and output activation. For dynamic quantization, this
+                # field is currently only used for the input activation,
+                # with the output activation being in fp32.
+                # In the future this may change as we add more fields
+                # to the `QConfig` object.
+                output_act_dtype = act_dtype \
+                    if (not input_act_is_dynamic) else torch.float
+
                 bias_dtype = torch.float16 \
-                    if act_dtype == torch.float16 and weight_dtype == torch.float16 \
-                    else torch.float
+                    if (
+                        act_dtype == torch.float16
+                        and weight_dtype == torch.float16
+                        and (not input_act_is_dynamic)
+                    ) else torch.float
                 return {
-                    "input_activation_dtype": act_dtype,
-                    "input_activation_compute_dtype": act_compute_dtype,
-                    "weight_dtype": weight_dtype,
-                    "bias_dtype": bias_dtype,
-                    "output_activation_dtype": act_dtype,
+                    "input_activation_dtype": (act_dtype, input_act_is_dynamic),
+                    "weight_dtype": (weight_dtype, False),
+                    "bias_dtype": (bias_dtype, False),
+                    "output_activation_dtype": (output_act_dtype, False),
                 }
         return {
-            "input_activation_dtype": torch.float,
-            "output_activation_dtype": torch.float,
+            "input_activation_dtype": (torch.float, False),
+            "output_activation_dtype": (torch.float, False),
         }
 
     elif node.op == 'get_attr':
         return {
-            "input_activation_dtype": torch.float,
-            "output_activation_dtype": torch.float,
+            "input_activation_dtype": (torch.float, False),
+            "output_activation_dtype": (torch.float, False),
         }
 
     elif node.op == 'output':
         if outputs_seen_counter in output_quantized_idxs:
             return {
-                "input_activation_dtype": torch.quint8,
-                "output_activation_dtype": torch.quint8
+                "input_activation_dtype": (torch.quint8, False),
+                "output_activation_dtype": (torch.quint8, False),
             }
         else:
             # if dtype is fp32 (default), do nothing
             # note: other dtypes are not supported
             return {
-                "input_activation_dtype": torch.float,
-                "output_activation_dtype": torch.float,
+                "input_activation_dtype": (torch.float, False),
+                "output_activation_dtype": (torch.float, False),
             }
 
     else:
@@ -450,62 +463,77 @@ def get_target_activation_dtype_for_node(
 def get_arg_target_dtype_as_output(
     arg: Node,
     modules: Dict[str, torch.nn.Module],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
 ) -> Optional[Union[torch.dtype, type]]:
     """ Get the target output activation dtype for
-    the argumnet in the original graph, skipping inserted observers
+    the argument in the original graph, skipping inserted observers
     We are assuming that the observers are inserted correctly, and the dtype for
     argument in quantized graph will match what is specified by the qconfig
     """
     assert isinstance(arg, Node)
-    if is_activation_post_process_node(arg, modules):
+    # Custom module LSTM output is a tuple that we broke down into the internal nodes in order
+    # to insert DeQuantStubs (see `_insert_dequant_stubs_for_custom_module_lstm_output`).
+    # Since we modified the graph in this case, we must trace back from the args through
+    # the specific nodes we added in order to reach the original LSTM node. Otherwise, we would
+    # not be able to accurately detect whether this node is a consumer of custom module LSTM.
+    custom_module_lstm_node = _maybe_get_custom_module_lstm_from_node_arg(arg, modules)
+    if custom_module_lstm_node is not None:
+        return node_name_to_target_dtype_info[custom_module_lstm_node.name]["output_activation_dtype"][0]  # type: ignore[index]
+    elif is_activation_post_process_node(arg, modules):
         observed_arg = arg.args[0]
         assert isinstance(observed_arg, Node), "Currently we only support observing Node"
-        return node_name_to_target_dtype[observed_arg.name]["output_activation_dtype"]
+        return node_name_to_target_dtype_info[observed_arg.name]["output_activation_dtype"][0]  # type: ignore[index]
     else:
-        return node_name_to_target_dtype[arg.name]["output_activation_dtype"]
+        target_dtype_info = \
+            node_name_to_target_dtype_info[arg.name]["output_activation_dtype"]
+        if target_dtype_info is not None:
+            return target_dtype_info[0]
+        else:
+            return None
 
 def get_arg_target_dtype_as_input_to_node(
     arg: Node,
     node: Node,
     modules: Dict[str, torch.nn.Module],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
+    backend_config: BackendConfig,
 ) -> Optional[Union[torch.dtype, type]]:
     """ Get the target argument dtype for the argument `arg`, as input
     to node `node`
     """
     assert isinstance(arg, Node)
-    is_weight = node_arg_is_weight(node, arg)
-    is_bias = node_arg_is_bias(node, arg)
+    is_weight = node_arg_is_weight(node, arg, backend_config)
+    is_bias = node_arg_is_bias(node, arg, backend_config)
     is_activation = not is_weight and not is_bias
     if is_activation:
-        return node_name_to_target_dtype[node.name]["input_activation_dtype"]
+        return node_name_to_target_dtype_info[node.name]["input_activation_dtype"][0]  # type: ignore[index]
     elif is_weight:
         if node.target in NON_QUANTIZABLE_WEIGHT_OPS:
             return None
         else:
-            return node_name_to_target_dtype[node.name]["weight_dtype"]
+            return node_name_to_target_dtype_info[node.name]["weight_dtype"][0]  # type: ignore[index]
     else:
-        return node_name_to_target_dtype[node.name]["bias_dtype"]
+        return node_name_to_target_dtype_info[node.name]["bias_dtype"][0]  # type: ignore[index]
 
-def get_arg_target_compute_dtype_as_input_to_node(
+def get_arg_target_is_dynamic_as_input_to_node(
     arg: Node,
     node: Node,
     modules: Dict[str, torch.nn.Module],
-    node_name_to_target_dtype: Dict[str, Dict[str, Union[torch.dtype, type, None]]],
-) -> Union[torch.dtype, type, None]:
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Tuple[Union[torch.dtype, type, None], bool]]],
+    backend_config: BackendConfig,
+) -> bool:
     """ Get the target argument dtype for the argument `arg`, as input
     to node `node`
     """
     assert isinstance(arg, Node)
-    is_weight = node_arg_is_weight(node, arg)
-    is_bias = node_arg_is_bias(node, arg)
+    is_weight = node_arg_is_weight(node, arg, backend_config)
+    is_bias = node_arg_is_bias(node, arg, backend_config)
     is_activation = not is_weight and not is_bias
     if is_activation and \
-       "input_activation_compute_dtype" in node_name_to_target_dtype[node.name]:
-        return node_name_to_target_dtype[node.name]["input_activation_compute_dtype"]
+       "input_activation_dtype" in node_name_to_target_dtype_info[node.name]:
+        return node_name_to_target_dtype_info[node.name]["input_activation_dtype"][1]
     else:
-        return None
+        return False
 
 def maybe_insert_input_observer_for_arg_or_kwarg(
     node: Union[Node, Any],
@@ -514,10 +542,10 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
     model: torch.nn.Module,
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
     qhandler: Optional[QuantizeHandler],
     prepare_custom_config: PrepareCustomConfig,
-    backend_config: Optional[BackendConfig],
+    backend_config: BackendConfig,
 ) -> Argument:
     """
     Given a `node` and an `arg`, inserts an input observer between
@@ -530,7 +558,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
         for inner_arg in arg:
             new_inner_arg = maybe_insert_input_observer_for_arg_or_kwarg(
                 node, inner_arg, qconfig, model, modules,
-                graph, node_name_to_target_dtype,
+                graph, node_name_to_target_dtype_info,
                 qhandler,
                 prepare_custom_config,
                 backend_config)
@@ -547,37 +575,45 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
     assert qconfig is not None
     if not is_standalone_module:
         # regular flow for most nodes, except standalone modules
-        is_weight = node_arg_is_weight(node, arg)
+        is_weight = node_arg_is_weight(node, arg, backend_config)
 
-        is_reuse_input_qconfig_ = is_reuse_input_qconfig(qconfig)
+        _is_reuse_input_qconfig_ = _is_reuse_input_qconfig(qconfig)
 
         act_post_process_ctr = qconfig.weight if is_weight else \
             qconfig.activation
 
-        arg_as_output_target_dtype = get_arg_target_dtype_as_output(arg, modules, node_name_to_target_dtype)
-        arg_as_input_target_dtype = get_arg_target_dtype_as_input_to_node(arg, node, modules, node_name_to_target_dtype)
-        arg_as_input_target_compute_dtype = \
-            get_arg_target_compute_dtype_as_input_to_node(
-                arg, node, modules, node_name_to_target_dtype)
-        needs_obs = (
-            # if the dtypes are different, we need an observer
-            (arg_as_output_target_dtype != arg_as_input_target_dtype) and
-            # except if the second dtype is float, a dequant will be inserted
-            # without an observer in convert
-            # TODO(future PR): change this so a placeholder is inserted for
-            # future dequants, to make the logic easier to understand
-            (arg_as_input_target_dtype != torch.float) and
-            # if arg output dtype is in DO_NOT_OBS_DTYPE_LIST do not insert observer
-            (arg_as_output_target_dtype not in DO_NOT_OBS_DTYPE_LIST) and
-            # if qconfig is reuse_input qconfig, we won't insert extra observer for input
-            not is_reuse_input_qconfig_ or
-            # need to add input observer for dynamic quantization
-            # only add observer for first input for now, we may need to extend
-            # qconfig_dict and backend_config to support more general configurations
-            # of dynamic quantization, e.g. dynamically quantizing second input, third
-            # input etc.
-            (arg_as_input_target_compute_dtype in [torch.quint8, torch.int8, torch.float16]) and arg is node.args[0]
-        )
+        arg_as_output_target_dtype = get_arg_target_dtype_as_output(arg, modules, node_name_to_target_dtype_info)
+        arg_as_input_target_dtype = get_arg_target_dtype_as_input_to_node(arg,
+                                                                          node,
+                                                                          modules,
+                                                                          node_name_to_target_dtype_info,
+                                                                          backend_config)
+        arg_as_input_target_is_dynamic = \
+            get_arg_target_is_dynamic_as_input_to_node(
+                arg, node, modules, node_name_to_target_dtype_info, backend_config)  # type: ignore[arg-type]
+        needs_obs = \
+            (
+                # the following code block is for static quantization
+                (not arg_as_input_target_is_dynamic) and
+                # if the dtypes are different, we need an observer
+                (arg_as_output_target_dtype != arg_as_input_target_dtype) and
+                # except if the second dtype is float, a dequant will be inserted
+                # without an observer in convert
+                # TODO(future PR): change this so a placeholder is inserted for
+                # future dequants, to make the logic easier to understand
+                (arg_as_input_target_dtype != torch.float) and
+                # if arg output dtype is in DO_NOT_OBS_DTYPE_LIST do not insert observer
+                (arg_as_output_target_dtype not in DO_NOT_OBS_DTYPE_LIST) and
+                # if qconfig is reuse_input qconfig, we won't insert extra observer for input
+                not _is_reuse_input_qconfig_
+            ) or (
+                # need to add input observer for dynamic quantization
+                # only add observer for first input for now, we may need to extend
+                # qconfig_dict and backend_config to support more general configurations
+                # of dynamic quantization, e.g. dynamically quantizing second input, third
+                # input etc.
+                arg_as_input_target_is_dynamic and arg is node.args[0]
+            )
 
     else:
         # custom flow for standalone modules
@@ -597,7 +633,7 @@ def maybe_insert_input_observer_for_arg_or_kwarg(
         if cur_input_idx is None:
             needs_obs = False
         else:
-            arg_as_output_target_dtype = get_arg_target_dtype_as_output(arg, modules, node_name_to_target_dtype)
+            arg_as_output_target_dtype = get_arg_target_dtype_as_output(arg, modules, node_name_to_target_dtype_info)
             arg_as_input_target_dtype = torch.quint8 if cur_input_idx in sm_input_quantized_idxs \
                 else torch.float
             needs_obs = (
@@ -647,10 +683,10 @@ def maybe_insert_input_observers_for_node(
     model: torch.nn.Module,
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
     qhandler: Optional[QuantizeHandler],
     prepare_custom_config: PrepareCustomConfig,
-    backend_config: Optional[BackendConfig],
+    backend_config: BackendConfig,
 ) -> None:
     """
     If needed, inserts observers to the input args and kwargs of `node`.
@@ -676,7 +712,7 @@ def maybe_insert_input_observers_for_node(
     for arg in node.args:
         new_arg = maybe_insert_input_observer_for_arg_or_kwarg(
             node, arg, qconfig, model, modules, graph,
-            node_name_to_target_dtype,
+            node_name_to_target_dtype_info,
             qhandler,
             prepare_custom_config,
             backend_config)
@@ -686,7 +722,7 @@ def maybe_insert_input_observers_for_node(
     for k, kwarg in node.kwargs.items():
         new_kwarg = maybe_insert_input_observer_for_arg_or_kwarg(
             node, kwarg, qconfig, model, modules, graph,
-            node_name_to_target_dtype,
+            node_name_to_target_dtype_info,
             qhandler,
             prepare_custom_config,
             backend_config)
@@ -702,8 +738,9 @@ def maybe_insert_input_equalization_observers_for_node(
     model: torch.nn.Module,
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
     is_branch: bool,
+    backend_config: BackendConfig,
 ) -> None:
     """
     If `node` needs to be equalized, find the input/weight observers it needs in
@@ -722,11 +759,11 @@ def maybe_insert_input_equalization_observers_for_node(
 
     new_args = []
     for arg in node.args:
-        if not isinstance(arg, Node) or node_arg_is_bias(node, arg):
+        if not isinstance(arg, Node) or node_arg_is_bias(node, arg, backend_config):
             new_args.append(arg)
             continue
 
-        is_weight = node_arg_is_weight(node, arg)
+        is_weight = node_arg_is_weight(node, arg, backend_config)
 
         act_eq_process_ctr = equalization_qconfig.weight if is_weight else \
             equalization_qconfig.input_activation
@@ -746,7 +783,7 @@ def maybe_insert_output_observer_for_node(
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
     matches: Dict[str, _MatchResultWithQConfig],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
     matched_pattern: Any,
     qhandler: Optional[QuantizeHandler],
     is_qat: bool,
@@ -768,7 +805,7 @@ def maybe_insert_output_observer_for_node(
 
     is_standalone_module = qhandler is not None and qhandler.is_standalone_module()
 
-    dtype = node_name_to_target_dtype[node.name]["output_activation_dtype"]
+    dtype, is_dynamic = node_name_to_target_dtype_info[node.name]["output_activation_dtype"]  # type: ignore[misc]
     should_insert_observer = dtype not in DO_NOT_OBS_DTYPE_LIST + [torch.float]
     # TODO(future PR): move the following logic to
     # should_insert_observer_for_output
@@ -781,23 +818,16 @@ def maybe_insert_output_observer_for_node(
         (not is_standalone_module)
 
     if should_insert_observer:
-        act_post_process_ctr = qconfig.activation
-        if activation_is_int8_quantized(qconfig):
-            act_post_process_ctr = qhandler.get_activation_ctr(
-                qconfig,
-                matched_pattern,
-                is_qat)
-        observer = act_post_process_ctr()
-        new_obs = insert_observer(node, observer, model, modules, graph)
-        return new_obs
+        observer = qconfig.activation()
+        return insert_observer(node, observer, model, modules, graph)
     else:
         return None
 
 def maybe_insert_observers_before_graph_output(
     graph_output_node: Node,
     output_quantized_idxs: List[int],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
-    qconfig_map: Dict[str, QConfigAny],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
+    node_name_to_qconfig: Dict[str, QConfigAny],
     model: torch.nn.Module,
     modules: Dict[str, torch.nn.Module],
     graph: Graph,
@@ -825,8 +855,8 @@ def maybe_insert_observers_before_graph_output(
     def _recursive_maybe_replace_node_with_obs(
         maybe_node: Argument,
         target_dtype: torch.dtype,
-        node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
-        qconfig_map: Dict[str, QConfigAny],
+        node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
+        node_name_to_qconfig: Dict[str, QConfigAny],
         model: torch.nn.Module,
         modules: Dict[str, torch.nn.Module],
         graph: Graph,
@@ -850,10 +880,10 @@ def _recursive_maybe_replace_node_with_obs(
         if isinstance(maybe_node, Node):
             # check dtype of this node
             this_node_dtype = get_arg_target_dtype_as_output(
-                maybe_node, modules, node_name_to_target_dtype)
+                maybe_node, modules, node_name_to_target_dtype_info)
             if this_node_dtype != target_dtype:
                 # insert observer
-                qconfig = qconfig_map.get(maybe_node.name)
+                qconfig = node_name_to_qconfig.get(maybe_node.name)
                 # TODO(future PR): see if we need to allow specifying qconfig
                 #   on output nodes, to remove the restriction below.
                 assert qconfig is not None, \
@@ -868,8 +898,8 @@ def _recursive_maybe_replace_node_with_obs(
             results = []
             for inner_node in maybe_node:
                 results.append(_recursive_maybe_replace_node_with_obs(
-                    inner_node, target_dtype, node_name_to_target_dtype,
-                    qconfig_map, model, modules, graph))
+                    inner_node, target_dtype, node_name_to_target_dtype_info,
+                    node_name_to_qconfig, model, modules, graph))
             if isinstance(maybe_node, list):
                 return results
             else:
@@ -878,8 +908,8 @@ def _recursive_maybe_replace_node_with_obs(
             results_dict = {}
             for k, inner_v in maybe_node.items():
                 results_dict[k] = _recursive_maybe_replace_node_with_obs(
-                    inner_v, target_dtype, node_name_to_target_dtype,
-                    qconfig_map, model, modules, graph)
+                    inner_v, target_dtype, node_name_to_target_dtype_info,
+                    node_name_to_qconfig, model, modules, graph)
             return results_dict
         else:
             return results
@@ -888,8 +918,8 @@ def _recursive_maybe_replace_node_with_obs(
     for old_arg in graph_output_node.args:
         new_args.append(
             _recursive_maybe_replace_node_with_obs(
-                old_arg, output_target_dtype, node_name_to_target_dtype,
-                qconfig_map, model, modules, graph))
+                old_arg, output_target_dtype, node_name_to_target_dtype_info,
+                node_name_to_qconfig, model, modules, graph))
 
     graph_output_node.args = tuple(new_args)  # type: ignore[assignment]
 
@@ -897,17 +927,18 @@ def _recursive_maybe_replace_node_with_obs(
 def maybe_propagate_dtype_for_node(
     node: Node,
     target_dtype: Union[torch.dtype, type],
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
     matches: Dict[str, _MatchResultWithQConfig],
 ) -> None:
     """
-    Assigns `target_dtype` to `node`. If `node` is a general tensor shape op
+    Assigns `target_dtype` to `node`, setting `is_dynamic` to False. If `node`
+    is a general tensor shape op
     (see GeneralTensorShapeOpQuantizeHandler in quantization_patterns.py for more details)
     also call this function recursively on
     the first argument, to propagate the dtype to the caller.
     """
-    node_name_to_target_dtype[node.name]["input_activation_dtype"] = target_dtype
-    node_name_to_target_dtype[node.name]["output_activation_dtype"] = target_dtype
+    node_name_to_target_dtype_info[node.name]["input_activation_dtype"] = (target_dtype, False)
+    node_name_to_target_dtype_info[node.name]["output_activation_dtype"] = (target_dtype, False)
     # if this is a copy node, propagate to first arg
     root_node, _, pattern, qhandler, qconfig = matches.get(
         node.name, (None, None, None, None, None))
@@ -915,11 +946,11 @@ def maybe_propagate_dtype_for_node(
         prev_node = node.args[0]
         if isinstance(prev_node, Node):
             maybe_propagate_dtype_for_node(
-                prev_node, target_dtype, node_name_to_target_dtype, matches)
+                prev_node, target_dtype, node_name_to_target_dtype_info, matches)
 
 def propagate_dtypes_for_known_nodes(
     graph: Graph,
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]],
+    node_name_to_target_dtype_info: Dict[str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]],
     matches: Dict[str, _MatchResultWithQConfig],
 ) -> None:
     """
@@ -952,7 +983,7 @@ def propagate_dtypes_for_known_nodes(
                     # hard coded arguments show up but aren't `Node` typed and do not need dtype propgated
                     if isinstance(cur_arg, torch.fx.node.Node):
                         maybe_propagate_dtype_for_node(
-                            cur_arg, arg_type, node_name_to_target_dtype, matches)
+                            cur_arg, arg_type, node_name_to_target_dtype_info, matches)
 
 def maybe_make_input_output_share_observers(
     node: Node,
@@ -1076,13 +1107,13 @@ def insert_observers_for_model(
     model: GraphModule,
     modules: Dict[str, torch.nn.Module],
     matches: Dict[str, _MatchResultWithQConfig],
-    qconfig_map: Dict[str, QConfigAny],
+    node_name_to_qconfig: Dict[str, QConfigAny],
     graph: Graph,
     prepare_custom_config: PrepareCustomConfig,
     equalization_config_map: Dict[str, Any],
     input_quantized_idxs: List[int],
     output_quantized_idxs: List[int],
-    backend_config: Optional[BackendConfig],
+    backend_config: BackendConfig,
     observed_node_names: Set[str],
     is_qat: bool,
 ) -> Optional[Node]:
@@ -1137,9 +1168,10 @@ def insert_observers_for_model(
     #     }
     #   }
     #
-    # TODO: rename this to node_name_to_target_dtype_info
-    node_name_to_target_dtype: Dict[str, Dict[str, Optional[Union[torch.dtype, type]]]] = defaultdict(dict)
-    cache_for_no_tensor_check: Dict[Node, bool] = dict()
+    node_name_to_target_dtype_info: Dict[
+        str, Dict[str, Optional[Tuple[Union[torch.dtype, type], bool]]]
+    ] = defaultdict(dict)
+    cache_for_no_tensor_check: Dict[Node, bool] = {}
 
     inputs_seen_counter = 0
     outputs_seen_counter = 0
@@ -1152,7 +1184,7 @@ def insert_observers_for_model(
     for node in model.graph.nodes:
         root_node, _, pattern, qhandler, qconfig = matches.get(
             node.name, (None, None, None, None, None))
-        node_name_to_target_dtype[node.name] = get_target_activation_dtype_for_node(
+        node_name_to_target_dtype_info[node.name] = get_target_activation_dtype_for_node(
             node, qconfig, inputs_seen_counter, outputs_seen_counter,
             input_quantized_idxs, output_quantized_idxs, qhandler,
             modules, cache_for_no_tensor_check)
@@ -1166,7 +1198,7 @@ def insert_observers_for_model(
     #   x1 = x0.masked_fill(mask, 1)
     # we propagate the type of mask to be torch.bool
     propagate_dtypes_for_known_nodes(
-        model.graph, node_name_to_target_dtype, matches)
+        model.graph, node_name_to_target_dtype_info, matches)
 
     # After this point, the current node and all of its arguments
     # have a dtype assigned. Now, we insert observers for inputs
@@ -1195,8 +1227,8 @@ def insert_observers_for_model(
                 node.name, (None, None, None, None, None))
             equalization_qconfig = equalization_config_map.get(node.name, None)
 
-            this_node_dtype = node_name_to_target_dtype[node.name]
-            output_not_a_tensor = this_node_dtype is None
+            this_node_dtype_info = node_name_to_target_dtype_info[node.name]
+            output_not_a_tensor = this_node_dtype_info is None
             # TODO(future PR): consider stopping matching getitem
             is_getitem = node.op == 'call_function' and \
                 node.target == operator.getitem
@@ -1209,8 +1241,8 @@ def insert_observers_for_model(
                 not node.op == 'output'
             )
 
-            is_supported_by_backend = is_pattern_dtype_config_supported_by_backend(
-                pattern, matched_node_pattern, node_name_to_target_dtype, backend_config)
+            is_supported_by_backend = _is_pattern_dtype_config_and_qconfig_supported_by_backend(
+                pattern, matched_node_pattern, node_name_to_target_dtype_info, qconfig, backend_config)
 
             if not skip_inserting_observers and is_supported_by_backend:
                 modules = dict(model.named_modules(remove_duplicate=False))
@@ -1238,7 +1270,7 @@ def insert_observers_for_model(
                         for user in node.args[0].users:
                             # Checks if there exists another user being quantized
                             is_user_quantized = (
-                                qconfig_map.get(user.name, None) is not None or
+                                node_name_to_qconfig.get(user.name, None) is not None or
                                 (user.op == 'call_module' and isinstance(modules[str(user.target)], ObserverBase))
                             )
                             if user != node and is_user_quantized:
@@ -1253,7 +1285,7 @@ def insert_observers_for_model(
                         # this modifies node inplace
                         maybe_insert_input_observers_for_node(
                             node, qconfig, model, modules, graph,
-                            node_name_to_target_dtype,
+                            node_name_to_target_dtype_info,
                             qhandler,
                             prepare_custom_config,
                             backend_config)
@@ -1261,57 +1293,74 @@ def insert_observers_for_model(
                         # Insert equalization input observers if needed
                         maybe_insert_input_equalization_observers_for_node(
                             node, equalization_qconfig, model, modules, graph,
-                            node_name_to_target_dtype, is_quantized_branch)
+                            node_name_to_target_dtype_info, is_quantized_branch, backend_config)
 
                     is_last_node_of_pattern = node is last_node
                     is_general_tensor_value_op = \
                         (qhandler is not None and qhandler.is_general_tensor_value_op())
-                    is_reuse_input_qconfig_ = is_reuse_input_qconfig(qconfig)
+                    _is_reuse_input_qconfig_ = _is_reuse_input_qconfig(qconfig)
 
                     if is_last_node_of_pattern:
-                        # this returns the new observer node if it was needed
-                        maybe_output_obs_node = maybe_insert_output_observer_for_node(
-                            node, model, modules, graph, matches,
-                            node_name_to_target_dtype, pattern, qhandler, is_qat)
-                        if maybe_output_obs_node is not None:
-                            # Update users of original node to use the output observer
-                            # instead. For example, change
-                            #
-                            #           next_node
-                            #          /
-                            #   cur_node -> obs
-                            #
-                            # to
-                            #
-                            #                 next_node
-                            #                 /
-                            #   cur_node -> obs
-                            #
-                            # We need to save orig users before updating uses because
-                            # the list of users will change as we update uses
-                            orig_users = list(node.users.keys())
-                            for user_node in orig_users:
-                                if user_node is maybe_output_obs_node:
-                                    continue
-                                user_node.replace_input_with(node, maybe_output_obs_node)
-
-                            is_observer_in_same_graph_ = is_observer_in_same_graph(node, modules, node_name_to_target_dtype)
-
-                            # for general tensor value ops, we modify the graph
-                            # to make all inputs and outputs use the first input's
-                            # observer
-                            if (is_general_tensor_value_op and is_observer_in_same_graph_) or \
-                                    is_reuse_input_qconfig_:
-                                if not maybe_make_input_output_share_observers(node, model, modules):
-                                    remove_output_observer(node, model, modules)
-
-                            if qhandler is not None and qhandler.is_custom_module():
-                                swap_custom_module_to_observed(node, qconfig, modules, prepare_custom_config)
+                        if _is_custom_module_lstm(node, modules, qconfig, qhandler):
+                            # Currently custom module outputs are assumed to be already quantized,
+                            # so we need to insert a DeQuantStub after the output. For custom module
+                            # LSTM specifically, the outputs are also a nested tuple, so we must first
+                            # break down the tuple to insert DeQuantStubs after the internal nodes.
+
+                            # TODO: This currently diverges from how custom modules are handled today,
+                            # where we insert observers after the output instead of DeQuantStubs, and
+                            # replace these observers with "dequantize" nodes during convert. Conceptually,
+                            # these output observers are the same as DeQuantStubs. In the future, we
+                            # should resolve this inconsistency by inserting DeQuantStubs for all custom
+                            # modules, not just for LSTM.
+                            _insert_dequant_stubs_for_custom_module_lstm_output(node, model, modules, graph)
+                            swap_custom_module_to_observed(node, qconfig, modules, prepare_custom_config)
+                        else:
+                            # this returns the new observer node if it was needed
+                            maybe_output_obs_node = maybe_insert_output_observer_for_node(
+                                node, model, modules, graph, matches,
+                                node_name_to_target_dtype_info, pattern, qhandler, is_qat)
+
+                            if maybe_output_obs_node is not None:
+                                # Update users of original node to use the output observer
+                                # instead. For example, change
+                                #
+                                #           next_node
+                                #          /
+                                #   cur_node -> obs
+                                #
+                                # to
+                                #
+                                #                 next_node
+                                #                 /
+                                #   cur_node -> obs
+                                #
+                                # We need to save orig users before updating uses because
+                                # the list of users will change as we update uses
+                                orig_users = list(node.users.keys())
+                                for user_node in orig_users:
+                                    if user_node is maybe_output_obs_node:
+                                        continue
+                                    user_node.replace_input_with(node, maybe_output_obs_node)
+
+                                is_observer_in_same_graph_ = is_observer_in_same_graph(
+                                    node, modules, node_name_to_target_dtype_info)
+
+                                # for general tensor value ops, we modify the graph
+                                # to make all inputs and outputs use the first input's
+                                # observer
+                                if (is_general_tensor_value_op and is_observer_in_same_graph_) or \
+                                        _is_reuse_input_qconfig_:
+                                    if not maybe_make_input_output_share_observers(node, model, modules):
+                                        remove_output_observer(node, model, modules)
+
+                                if qhandler is not None and qhandler.is_custom_module():
+                                    swap_custom_module_to_observed(node, qconfig, modules, prepare_custom_config)
 
                 else:  # output
                     maybe_insert_observers_before_graph_output(
                         node, output_quantized_idxs,
-                        node_name_to_target_dtype, qconfig_map,
+                        node_name_to_target_dtype_info, node_name_to_qconfig,
                         model, modules, graph)
 
         #
@@ -1329,49 +1378,13 @@ def insert_observers_for_model(
 
     return results_node
 
-def _validate_fixed_qparams_qconfigs(model: GraphModule, qconfig_map: Dict[str, QConfigAny]):
-    """
-    Validate whether the correct observers are configured for fixed qparams ops in the model, if any.
-    """
-    # TODO: handle fp16 qconfigs properly
-    allowed_observer_ctrs = [
-        float16_dynamic_qconfig.activation,
-        float16_static_qconfig.activation,
-    ]
-    named_modules = dict(model.named_modules(remove_duplicate=False))
-    for node in model.graph.nodes:
-        if node.op == "call_function":
-            module_type_or_function_or_method = node.target
-        elif node.op == "call_module":
-            module_type_or_function_or_method = type(named_modules[node.target])
-        else:
-            module_type_or_function_or_method = None
-
-        if module_type_or_function_or_method in _FIXED_QPARAMS_OP_TO_OBSERVER:
-            bad_observer = True
-            qconfig = qconfig_map.get(node.name, None)
-            if qconfig is None:
-                bad_observer = False
-            else:
-                for observer_ctr in allowed_observer_ctrs + [_FIXED_QPARAMS_OP_TO_OBSERVER[module_type_or_function_or_method]]:
-                    if obs_or_fq_ctr_equals(
-                            qconfig.activation,
-                            FixedQParamsFakeQuantize.with_args(observer=observer_ctr)) or \
-                            obs_or_fq_ctr_equals(qconfig.activation, observer_ctr):
-                        bad_observer = False
-            if bad_observer:
-                raise ValueError("QConfigMapping must specify fixed qparams observer for fixed qparams op "
-                                 "'%s' type: '%s'. Please use torch.ao.quantization.get_default_qconfig_mapping or "
-                                 "torch.ao.quantization.get_default_qat_qconfig_mapping"
-                                 " instead." % (node.format_node(), module_type_or_function_or_method))
-
 def run_prepare_fx_on_standalone_modules(
     model: torch.nn.Module,
     is_qat: bool,
     modules: Dict[str, torch.nn.Module],
     matches: Any,
     prepare_custom_config: PrepareCustomConfig,
-    backend_config: Optional[BackendConfig],
+    backend_config: BackendConfig,
 ) -> None:
     """
     Runs prepare_fx on each standalone module. Note: this does
@@ -1413,18 +1426,18 @@ def run_prepare_fx_on_standalone_modules(
 
 def save_state(
     observed: GraphModule,
-    qconfig_map: Dict[str, QConfigAny],
+    node_name_to_qconfig: Dict[str, QConfigAny],
     node_name_to_scope: Dict[str, Tuple[str, type]],
     prepare_custom_config: PrepareCustomConfig,
-    equalization_qconfig_map: Dict[str, Any],
+    equalization_node_name_to_qconfig: Dict[str, Any],
     qconfig_mapping: QConfigMapping,
     is_qat: bool,
     observed_node_names: Set[str],
 ) -> None:
-    observed._qconfig_map = qconfig_map  # type: ignore[assignment]
+    observed._node_name_to_qconfig = node_name_to_qconfig  # type: ignore[assignment]
     observed._prepare_custom_config = prepare_custom_config  # type: ignore[assignment]
     observed._node_name_to_scope = node_name_to_scope  # type: ignore[assignment]
-    observed._equalization_qconfig_map = equalization_qconfig_map  # type: ignore[assignment]
+    observed._equalization_node_name_to_qconfig = equalization_node_name_to_qconfig  # type: ignore[assignment]
     observed._qconfig_mapping = qconfig_mapping  # type: ignore[assignment]
     observed._is_qat = is_qat  # type: ignore[assignment]
     observed._observed_node_names = observed_node_names  # type: ignore[assignment]
@@ -1503,44 +1516,26 @@ def prepare(
     #   ((<function relu at 0x7f766a7360d0>, <built-in function add>):
     #     <class 'torch.ao.quantization.fx.quantize.Add'>),
     # }
-    # TODO: rename to pattern_to_quantize_handler
-    patterns: Dict[Pattern, QuantizeHandler] = {}
+
+    pattern_to_quantize_handler: Dict[Pattern, QuantizeHandler] = {}
     if backend_config is None:
         backend_config = get_native_backend_config()
-    patterns = get_pattern_to_quantize_handlers(backend_config)
-    patterns = sorted_patterns_dict(patterns)
-
-    # TODO: make WEIGHT_INDEX_DICT and BIAS_INDEX_DICT an argument to the functions that needs them
-    # TODO: refactor this part to return WEIGHT_INDEX_DICT and BIAS_INDEX_DICT
-    pattern_to_input_type_to_index = get_pattern_to_input_type_to_index(backend_config)
-    for pattern, input_type_to_index in pattern_to_input_type_to_index.items():
-        for input_type, index in input_type_to_index.items():
-            index_dicts = {
-                "weight": WEIGHT_INDEX_DICT,
-                "bias": BIAS_INDEX_DICT,
-                "input": {}  # not used right now
-            }
-            assert input_type in index_dicts.keys(), \
-                f"input type must be one of {index_dicts.keys()} but got: {input_type}"
-            index_dict = index_dicts[input_type]
-            if pattern in index_dict:  # type: ignore[operator]
-                index_dict[pattern].append(index)  # type: ignore[index]
-            else:
-                index_dict[pattern] = [index]  # type: ignore[index]
+    pattern_to_quantize_handler = get_pattern_to_quantize_handlers(backend_config)
+    pattern_to_quantize_handler = sorted_patterns_dict(pattern_to_quantize_handler)
 
     root_node_getter_mapping = \
         get_fusion_pattern_to_root_node_getter(backend_config)
 
     update_qconfig_for_fusion(model, qconfig_mapping)
     update_qconfig_for_fusion(model, _equalization_config)
-    flattened_qconfig_dict = get_flattened_qconfig_dict(qconfig_mapping)
+    flattened_qconfig_dict = _get_flattened_qconfig_dict(qconfig_mapping)
     # TODO: support regex as well
     propagate_qconfig_(model, flattened_qconfig_dict, prepare_custom_config.to_dict())
 
     if is_qat:
         module_to_qat_module = get_module_to_qat_module(backend_config)
         qat_swap_modules(model, module_to_qat_module)
-        update_qconfig_for_qat(qconfig_mapping, {})
+        _update_qconfig_for_qat(qconfig_mapping, {})
 
     # mapping from fully qualified module name to module instance
     # for example,
@@ -1551,11 +1546,10 @@ def prepare(
     # }
     modules = dict(model.named_modules(remove_duplicate=False))
 
-    # fill qconfig_map, a map from node name to qconfig, used in find_matches
-    equalization_qconfig_map = generate_qconfig_map(
+    # fill node_name_to_qconfig, a map from node name to qconfig, used in find_matches
+    equalization_node_name_to_qconfig = generate_node_name_to_qconfig(
         model, modules, model.graph, _equalization_config, node_name_to_scope)
-    qconfig_map = generate_qconfig_map(model, modules, model.graph, qconfig_mapping, node_name_to_scope)
-    _validate_fixed_qparams_qconfigs(model, qconfig_map)
+    node_name_to_qconfig = generate_node_name_to_qconfig(model, modules, model.graph, qconfig_mapping, node_name_to_scope)
 
     # match the patterns that will get quantized
     standalone_module_names = list(prepare_custom_config.standalone_module_names.keys())
@@ -1563,13 +1557,13 @@ def prepare(
 
     custom_module_classes = get_custom_module_class_keys(prepare_custom_config.float_to_observed_mapping)
     matches_without_qconfig = find_matches(
-        model.graph, modules, patterns, root_node_getter_mapping,
+        model.graph, modules, pattern_to_quantize_handler, root_node_getter_mapping,
         standalone_module_names, standalone_module_classes, custom_module_classes)
 
     # map qconfig instances to matches
     matches = {}
     for node_name, match_without_qconfig in matches_without_qconfig.items():
-        match_with_qconfig = (*match_without_qconfig, qconfig_map[node_name])
+        match_with_qconfig = (*match_without_qconfig, node_name_to_qconfig[node_name])
         matches[node_name] = match_with_qconfig
 
     input_quantized_idxs: List[int] = prepare_custom_config.input_quantized_indexes
@@ -1584,17 +1578,17 @@ def prepare(
     observed_node_names: Set[str] = set()
 
     result_node = insert_observers_for_model(
-        model, modules, matches, qconfig_map,
+        model, modules, matches, node_name_to_qconfig,
         model.graph, prepare_custom_config,
-        equalization_qconfig_map,
+        equalization_node_name_to_qconfig,
         input_quantized_idxs,
         output_quantized_idxs,
         backend_config,
         observed_node_names,
         is_qat)
 
-    save_state(model, qconfig_map, node_name_to_scope,
-               prepare_custom_config, equalization_qconfig_map, qconfig_mapping, is_qat, observed_node_names)
+    save_state(model, node_name_to_qconfig, node_name_to_scope,
+               prepare_custom_config, equalization_node_name_to_qconfig, qconfig_mapping, is_qat, observed_node_names)
 
     preserved_attributes = set(prepare_custom_config.preserved_attributes)
     model = ObservedGraphModule(model, model.graph, preserved_attributes)
diff --git a/torch/ao/quantization/fx/qconfig_utils.py b/torch/ao/quantization/fx/qconfig_mapping_utils.py
similarity index 85%
rename from torch/ao/quantization/fx/qconfig_utils.py
rename to torch/ao/quantization/fx/qconfig_mapping_utils.py
index f27fa1e742f8..6ccc8d07f64e 100644
--- a/torch/ao/quantization/fx/qconfig_utils.py
+++ b/torch/ao/quantization/fx/qconfig_mapping_utils.py
@@ -2,7 +2,7 @@
 from collections import defaultdict, OrderedDict
 from typing import Callable, Any, Dict, Tuple, Set, List
 from torch.ao.quantization import QConfig
-from torch.ao.quantization.qconfig import add_module_to_qconfig_obs_ctr, QConfigAny, qconfig_equals
+from torch.ao.quantization.qconfig import _add_module_to_qconfig_obs_ctr, QConfigAny, qconfig_equals
 from torch.ao.quantization.quantize import (
     is_activation_post_process,
 )
@@ -23,14 +23,14 @@
     get_qconfig_dtypes,
 )
 from ..qconfig_mapping import (
-    OBJECT_TYPE_DICT_KEY,
-    MODULE_NAME_DICT_KEY,
-    MODULE_NAME_REGEX_DICT_KEY,
+    _OBJECT_TYPE_DICT_KEY,
+    _MODULE_NAME_DICT_KEY,
+    _MODULE_NAME_REGEX_DICT_KEY,
     QConfigMapping,
 )
 from ..qconfig_mapping_utils import (
-    get_object_type_qconfig,
-    maybe_adjust_qconfig_for_module_type_or_name,
+    _get_object_type_qconfig,
+    _maybe_adjust_qconfig_for_module_type_or_name,
 )
 
 
@@ -38,7 +38,7 @@
 __all__ = [
     "check_is_valid_config_dict",
     "compare_prepare_convert_qconfig_mappings",
-    "generate_qconfig_map",
+    "generate_node_name_to_qconfig",
     "is_qconfig_supported_by_dtype_configs",
     "maybe_adjust_qconfig_for_module_name_object_type_order",
     "update_qconfig_for_fusion",
@@ -100,14 +100,14 @@ def update_qconfig_for_fusion(model: GraphModule, qconfig_mapping: QConfigMappin
             if fused_qconfig is not None:
                 object_type_dict[type(maybe_fused_module)] = fused_qconfig
 
-def generate_qconfig_map(
+def generate_node_name_to_qconfig(
         root: torch.nn.Module,
         modules: Dict[str, torch.nn.Module],
         input_graph: Graph,
         qconfig_mapping: QConfigMapping,
         node_name_to_scope: Dict[str, Tuple[str, type]]) -> Dict[str, QConfigAny]:
     global_qconfig = qconfig_mapping.global_qconfig
-    qconfig_map = dict()
+    node_name_to_qconfig = {}
 
     # example:
     #
@@ -121,17 +121,17 @@ def generate_qconfig_map(
         qconfig = None
         if node.op == "get_attr":
             module_name, _ = _parent_name(node.target)
-            qconfig = maybe_adjust_qconfig_for_module_type_or_name(
+            qconfig = _maybe_adjust_qconfig_for_module_type_or_name(
                 qconfig_mapping, type(modules[module_name]), module_name, global_qconfig)
-            qconfig_with_device_check = add_module_to_qconfig_obs_ctr(qconfig, modules.get(node.target, None))
+            qconfig_with_device_check = _add_module_to_qconfig_obs_ctr(qconfig, modules.get(node.target, None))
         elif node.op == "call_function":
             # precedence: module_name_qconfig
             # > function_qconfig > global_qconfig
             # module_name takes precedence over function qconfig
-            function_qconfig = get_object_type_qconfig(
+            function_qconfig = _get_object_type_qconfig(
                 qconfig_mapping, node.target, global_qconfig)
             module_path, module_type = node_name_to_scope[node.name]
-            qconfig = maybe_adjust_qconfig_for_module_type_or_name(
+            qconfig = _maybe_adjust_qconfig_for_module_type_or_name(
                 qconfig_mapping, module_type, module_path, function_qconfig)
 
             cur_object_type_idx = \
@@ -139,28 +139,28 @@ def generate_qconfig_map(
             submodule_to_object_type_to_cur_idx[module_path][node.target] += 1
             qconfig = maybe_adjust_qconfig_for_module_name_object_type_order(
                 qconfig_mapping, module_path, node.target, cur_object_type_idx, qconfig)
-            qconfig_with_device_check = add_module_to_qconfig_obs_ctr(qconfig, modules.get(node.target, None))
+            qconfig_with_device_check = _add_module_to_qconfig_obs_ctr(qconfig, modules.get(node.target, None))
 
         elif node.op == "call_method":
             module_path, module_type = node_name_to_scope[node.name]
             # first use node.target (string) to get the qconfig
             # this is to support configs like
             # "object_type": [("reshpe", qconfig)]
-            qconfig = maybe_adjust_qconfig_for_module_type_or_name(
+            qconfig = _maybe_adjust_qconfig_for_module_type_or_name(
                 qconfig_mapping, node.target, module_path, global_qconfig)
             # if there is no special config for the method, we'll fall back to the
             # config for the module that contains the call_method node
-            qconfig = maybe_adjust_qconfig_for_module_type_or_name(
+            qconfig = _maybe_adjust_qconfig_for_module_type_or_name(
                 qconfig_mapping, module_type, module_path, qconfig)
             # currently call_method does not support modifying qconfig
             # by order, we can add this later if it is needed.
-            qconfig_with_device_check = add_module_to_qconfig_obs_ctr(qconfig, modules.get(node.target, None))
+            qconfig_with_device_check = _add_module_to_qconfig_obs_ctr(qconfig, modules.get(node.target, None))
 
         elif node.op == 'call_module':
             # if the node is an observer, just continue - don't add it to the qconfig_map
             if is_activation_post_process(modules[node.target]):
                 continue
-            qconfig = maybe_adjust_qconfig_for_module_type_or_name(
+            qconfig = _maybe_adjust_qconfig_for_module_type_or_name(
                 qconfig_mapping, type(modules[node.target]), node.target, global_qconfig)
 
             module_path, module_type = node_name_to_scope[node.name]
@@ -174,7 +174,7 @@ def generate_qconfig_map(
             qconfig = maybe_adjust_qconfig_for_module_name_object_type_order(
                 qconfig_mapping, parent_name, module_type, cur_object_type_idx,
                 qconfig)
-            qconfig_with_device_check = add_module_to_qconfig_obs_ctr(qconfig, modules.get(node.target, None))
+            qconfig_with_device_check = _add_module_to_qconfig_obs_ctr(qconfig, modules.get(node.target, None))
 
             # regex is not supported eager mode propagate_qconfig_, we'll
             # need to set the qconfig explicitly here in case regex
@@ -183,8 +183,8 @@ def generate_qconfig_map(
         else:
             qconfig_with_device_check = None
 
-        qconfig_map[node.name] = qconfig_with_device_check
-    return qconfig_map
+        node_name_to_qconfig[node.name] = qconfig_with_device_check
+    return node_name_to_qconfig
 
 
 def check_is_valid_config_dict(config_dict: Any, allowed_keys: Set[str], dict_name: str) -> None:
@@ -223,7 +223,7 @@ def compare_prepare_convert_qconfig_mappings(
         convert_qconfig_mapping.module_name_qconfigs,
         convert_qconfig_mapping.module_name_regex_qconfigs,
     ]
-    dict_names = [OBJECT_TYPE_DICT_KEY, MODULE_NAME_DICT_KEY, MODULE_NAME_REGEX_DICT_KEY]
+    dict_names = [_OBJECT_TYPE_DICT_KEY, _MODULE_NAME_DICT_KEY, _MODULE_NAME_REGEX_DICT_KEY]
     for i in range(len(prepare_dicts)):
         for name, qconfig in prepare_dicts[i].items():
             assert name in convert_dicts[i], "Missing key {} {} in convert QConfigMapping \
@@ -242,15 +242,18 @@ def is_qconfig_supported_by_dtype_configs(qconfig: QConfig, dtype_configs: List[
         weight_dtype = dtype_config.weight_dtype or torch.float
         bias_dtype = dtype_config.bias_dtype or torch.float
         output_dtype = dtype_config.output_dtype or torch.float
-        qconfig_activation_dtype, qconfig_weight_dtype, qconfig_compute_dtype = \
+        qconfig_activation_dtype, qconfig_weight_dtype, qconfig_input_act_is_dynamic = \
             get_qconfig_dtypes(qconfig)
         qconfig_bias_dtype = torch.float16 \
-            if qconfig_activation_dtype == torch.float16 and \
-            qconfig_weight_dtype == torch.float16 \
-            else torch.float
+            if (
+                qconfig_activation_dtype == torch.float16
+                and qconfig_weight_dtype == torch.float16
+                and not is_dynamic
+            ) else torch.float
 
         if is_dynamic:
-            is_match = input_dtype == qconfig_compute_dtype and \
+            is_match = qconfig_input_act_is_dynamic and \
+                input_dtype == qconfig_activation_dtype and \
                 output_dtype == torch.float and \
                 weight_dtype == qconfig_weight_dtype
         else:
diff --git a/torch/ao/quantization/fx/quantization_patterns.py b/torch/ao/quantization/fx/quantization_patterns.py
index bacec65d0337..f8d72de9c96a 100644
--- a/torch/ao/quantization/fx/quantization_patterns.py
+++ b/torch/ao/quantization/fx/quantization_patterns.py
@@ -6,13 +6,27 @@
 from .utils import (
     all_node_args_have_no_tensors,
 )
-from torch.ao.quantization.quantization_types import (
-    Pattern,
-    NodePattern,
-)
+from torch.ao.quantization.utils import NodePattern
 
 from abc import ABC
-from typing import Any, Callable, Dict, Optional
+from typing import Callable, Dict
+
+__all__ = [
+    "QuantizeHandler",
+    "BinaryOpQuantizeHandler",
+    "CatQuantizeHandler",
+    "ConvReluQuantizeHandler",
+    "LinearReLUQuantizeHandler",
+    "BatchNormQuantizeHandler",
+    "EmbeddingQuantizeHandler",
+    "RNNDynamicQuantizeHandler",
+    "DefaultNodeQuantizeHandler",
+    "FixedQParamsOpQuantizeHandler",
+    "CopyNodeQuantizeHandler",
+    "GeneralTensorShapeOpQuantizeHandler",
+    "CustomModuleQuantizeHandler",
+    "StandaloneModuleQuantizeHandler",
+]
 
 def _default_root_node_getter(node_pattern):
     if node_pattern is None:
@@ -21,12 +35,7 @@ def _default_root_node_getter(node_pattern):
         node_pattern = node_pattern[-1]
     return node_pattern
 
-# -------------------------
-# Pattern Registrations
-# -------------------------
-
-# 1. Post Training Static Quantization and Quantization Aware Training Patterns
-
+# TODO: move to backend_config_utils.py
 # Base Pattern Handler
 class QuantizeHandler(ABC):
     """ Base handler class for the quantizer patterns
@@ -52,7 +61,7 @@ def __init__(
         # determine how many of the first two args are Tensors (versus scalars)
         # this distinguishes things like "x + y" from "x + 2" or "2 + x"
         if isinstance(self.root_node, Node):
-            cache_for_no_tensor_check: Dict[Node, bool] = dict()
+            cache_for_no_tensor_check: Dict[Node, bool] = {}
             for arg_idx in range(len(self.root_node.args)):
                 arg = self.root_node.args[arg_idx]
                 if isinstance(arg, Node) and (
@@ -86,19 +95,6 @@ def is_general_tensor_value_op(self) -> bool:
         """
         return False
 
-    def get_activation_ctr(
-        self,
-        qconfig: Any,
-        pattern: Pattern,
-        is_training: bool,
-    ) -> Optional[Callable]:
-        """
-        Returns the constructor for the activation observer which should be
-        used for the pattern matched to this handler. Some handlers override
-        this to a different value than what is specified in the qconfig.
-        """
-        return qconfig.activation
-
     def is_custom_module(self):
         return self.is_custom_module_
 
diff --git a/torch/ao/quantization/fx/tracer.py b/torch/ao/quantization/fx/tracer.py
index 732c8b957555..3a959447cfd6 100644
--- a/torch/ao/quantization/fx/tracer.py
+++ b/torch/ao/quantization/fx/tracer.py
@@ -81,7 +81,7 @@ def __init__(
     def is_leaf_module(self, m: torch.nn.Module, module_qualified_name: str) -> bool:
         return (
             (
-                m.__module__.startswith("torch.nn")
+                (m.__module__.startswith("torch.nn") or m.__module__.startswith("torch.ao.nn"))
                 and not isinstance(m, torch.nn.Sequential)
             )
             or module_qualified_name in self.skipped_module_names
diff --git a/torch/ao/quantization/fx/utils.py b/torch/ao/quantization/fx/utils.py
index cbf4b77e2068..edf440de28e1 100644
--- a/torch/ao/quantization/fx/utils.py
+++ b/torch/ao/quantization/fx/utils.py
@@ -2,8 +2,32 @@
 import re
 import torch
 import torch.nn as nn
-from torch.ao.quantization import QuantType
-from torch.ao.quantization.utils import is_per_tensor, is_per_channel
+from torch.ao.quantization import (
+    QConfigAny,
+    QuantType,
+)
+from torch.ao.quantization.backend_config import (
+    BackendConfig,
+    DTypeWithConstraints,
+)
+from torch.ao.quantization.fake_quantize import (
+    FakeQuantizeBase,
+    FixedQParamsFakeQuantize,
+)
+from torch.ao.quantization.observer import (
+    FixedQParamsObserver,
+    ObserverBase,
+)
+from torch.ao.quantization.qconfig import (
+    float16_static_qconfig,
+    float16_dynamic_qconfig,
+    qconfig_equals,
+)
+from torch.ao.quantization.stubs import DeQuantStub
+from torch.ao.quantization.utils import (
+    activation_is_statically_quantized,
+    is_per_tensor,
+)
 from torch.ao.quantization.quantize import is_activation_post_process
 
 from torch.fx import GraphModule, map_arg
@@ -13,6 +37,8 @@
     Node,
 )
 from .custom_config import PrepareCustomConfig
+# importing the lib so that the quantized_decomposed ops are registered
+from ._decomposed import quantized_decomposed_lib  # noqa: F401
 
 from typing import Callable, Optional, List, Dict, Any, Set, Tuple, Union, Type
 from collections import namedtuple
@@ -24,7 +50,6 @@
     "all_node_args_except_first",
     "all_node_args_have_no_tensors",
     "assert_and_get_unique_device",
-    "BIAS_INDEX_DICT",
     "collect_producer_nodes",
     "create_getattr_from_value",
     "create_node_from_old_node_preserve_meta",
@@ -37,44 +62,39 @@
     "get_per_tensor_qparams",
     "get_qconv_op",
     "get_qconv_prepack_op",
-    "get_quantize_node_info",
+    "get_skipped_module_name_and_classes",
     "graph_module_from_producer_nodes",
     "graph_pretty_str",
     "is_get_tensor_info_node",
     "maybe_get_next_module",
     "NodeInfo",
     "node_return_type_is_int",
+    "node_arg_is_bias",
+    "node_arg_is_weight",
     "NON_OBSERVABLE_ARG_DICT",
     "NON_QUANTIZABLE_WEIGHT_OPS",
-    "quantize_node",
     "return_arg_list",
-    "WEIGHT_INDEX_DICT",
-    "get_skipped_module_name_and_classes",
 ]
 
-
-# A dictionary for querying the weight index for a given op
-WEIGHT_INDEX_DICT = {
-    torch.nn.functional.conv1d : [1],
-    torch.nn.functional.conv2d : [1],
-    torch.nn.functional.conv3d : [1],
-    torch.nn.functional.linear : [1],
-    torch.nn.functional.layer_norm : [2],
-    torch.nn.functional.group_norm : [2],
-    torch.nn.functional.instance_norm : [3],
-}
-
 NON_QUANTIZABLE_WEIGHT_OPS = {torch.nn.functional.layer_norm, torch.nn.functional.group_norm, torch.nn.functional.instance_norm}
 
-BIAS_INDEX_DICT = {
-    torch.nn.functional.conv1d : [2],
-    torch.nn.functional.conv2d : [2],
-    torch.nn.functional.conv3d : [2],
-    torch.nn.functional.linear : [2],
-    torch.nn.functional.layer_norm : [3],
-    torch.nn.functional.group_norm : [3],
-    torch.nn.functional.instance_norm : [4],
-}
+def node_arg_is_weight(node: Node, arg: Any, backend_config: BackendConfig) -> bool:
+    """Returns if node arg is weight"""
+    if isinstance(node, Node) and node.op == "call_function" and node.target in backend_config.configs:
+        weight_index = backend_config.configs[node.target]._input_type_to_index.get("weight")
+        if weight_index is not None and weight_index < len(node.args) and node.args[weight_index] is arg:
+            return True
+        return node.kwargs.get("weight") is arg
+    return False
+
+def node_arg_is_bias(node: Node, arg: Any, backend_config: BackendConfig) -> bool:
+    """Returns if node arg is bias"""
+    if isinstance(node, Node) and node.op == "call_function" and node.target in backend_config.configs:
+        bias_index = backend_config.configs[node.target]._input_type_to_index.get("bias")
+        if bias_index is not None and bias_index < len(node.args) and node.args[bias_index] is arg:
+            return True
+        return node.kwargs.get("bias") is arg
+    return False
 
 def graph_pretty_str(g, shorten=True) -> str:
     """Returns a printable representation of the ops in the graph of g.
@@ -150,106 +170,9 @@ def get_per_tensor_qparams(activation_post_process):
     dtype = activation_post_process.dtype
     return scale, zero_point, dtype
 
-def get_quantize_node_info(activation_post_process: Callable) -> Optional[Tuple[str, Union[Callable, str], Dict[str, Any]]]:
-    ''' Given an activation_post_process module,
-    return node_type(e.g. call_function), quantize op(e.g. quantize_per_tensor) and a dictionary
-    of extracted qparams from the module
-    '''
-    dtype = activation_post_process.dtype  # type: ignore[attr-defined]
-    compute_dtype = None
-    if hasattr(activation_post_process, "compute_dtype"):
-        compute_dtype = activation_post_process.compute_dtype  # type: ignore[attr-defined]
-    quantize_op : Optional[Union[Callable, str]] = None
-    if dtype in [torch.quint8, torch.qint8]:
-        node_type = "call_function"
-        scale, zero_point = activation_post_process.calculate_qparams()  # type: ignore[attr-defined]
-        if is_per_channel(activation_post_process.qscheme):  # type: ignore[attr-defined]
-            ch_axis = int(activation_post_process.ch_axis)  # type: ignore[attr-defined]
-            qparams = {"_scale_": scale, "_zero_point_": zero_point, "_axis_": ch_axis, "_dtype_": dtype}
-            quantize_op = torch.quantize_per_channel
-        else:
-            scale = float(scale)
-            zero_point = int(zero_point)
-            qparams = {"_scale_": scale, "_zero_point_": zero_point, "_dtype_": dtype}
-            quantize_op = torch.quantize_per_tensor
-    elif dtype == torch.float16:
-        node_type = "call_method"
-        quantize_op = "to"
-        qparams = {"_dtype_": dtype}
-    elif dtype == torch.float32 and compute_dtype in [torch.quint8, torch.qint8, torch.float16]:
-        # dynamic quantization
-        node_type = "call_function"
-        quantize_op = torch.quantize_per_tensor_dynamic
-        # TODO: get reduce range from observer
-        # reduce_range = activation_post_process.reduce_range
-        reduce_range = torch.backends.quantized.engine == "fbgemm"
-        qparams = {"_dtype_": compute_dtype, "_reduce_range_": reduce_range}
-    else:
-        warnings.warn(f"Unsupported activation_post_process in get_quantize_node_info: {activation_post_process}")
-        return None
-    return node_type, quantize_op, qparams
-
-def quantize_node(
-        in_node: Node,
-        obs_module: torch.nn.Module,
-        obs_node: Node,
-        modules: Dict[str, torch.nn.Module],
-        quantized_graph: Graph,
-        node_name_to_scope: Dict[str, Tuple[str, type]],
-        is_input: bool,
-        output_prefix: str = "_output") -> Node:
-    ''' Add quantization nodes (eg. quantize_per_tensor/per_channel) for given node to graph
-    with the qparams calculated from activation_post_process (obs_module).
-    The observer node (obs_node) is used to find the FQN of the user of act_post_process.
-    e.g. Given input `node` in `node = self.conv(x)`, insert node:
-    `quantized_node = torch.quantize_per_tensor(x, self._scale_0, self._zer_point_0, self._dtype_0)`
-    where self._scale_0, self._zero_point_0 and self._dtype_0 are
-    calculated from `obs_module`
-    '''
-    # Find the first use of the observer node, we use this to get the scope of the module.
-    if is_input:
-        # if the quantize function is at the input of op, then we find the first user of the observer_node
-        # to get the path. If a linear call_function is in the user list, we return the first instance
-        # of linear node to get the FQN.
-        users = list(obs_node.users)
-        first_linear_use_or_first_use = users[0] if users else None
-        linear_node = None
-        for n in users:
-            if n.op == "call_function" and n.target == torch.nn.functional.linear:
-                linear_node = n
-                break
-        if linear_node:
-            first_linear_use_or_first_use = linear_node
-        prefix = "_input"
-    else:
-        # if the quantize function is at the output of the op, we use the observer input node to get the path
-        first_linear_use_or_first_use = in_node
-        prefix = output_prefix
-
-    if first_linear_use_or_first_use and first_linear_use_or_first_use.name in node_name_to_scope:
-        module_path, _ = node_name_to_scope[first_linear_use_or_first_use.name]
-    else:
-        # TODO: it's not used, so actually we can skip quantization
-        # but this requires changing return type of quantize_node
-        # we can fix it later if needed
-        module_path = ""
-    root_module = modules['']
-    graph = quantized_graph
-    maybe_quantize_node_info = get_quantize_node_info(obs_module)
-    assert maybe_quantize_node_info is not None, \
-        f"Expecting quantize node info not to be None, observer: {obs_module}"
-    node_type, quantize_op, qparams = maybe_quantize_node_info
-    inputs = [in_node]
-
-    for key, value in qparams.items():
-        if key in ['_scale_', '_zero_point_']:
-            # For scale and zero_point values we register them as buffers in the root module.
-            qparam_node = create_getattr_from_value(root_module, graph, module_path + prefix + key, value)
-            inputs.append(qparam_node)
-        else:
-            # for qparams that are not scale/zero_point (like axis, dtype) we store them as literals in the graph.
-            inputs.append(value)
-    return graph.create_node(node_type, quantize_op, tuple(inputs), {})
+# Keep it here for BC in torch.quantization namespace, we can remove it after
+# we deprecate the torch.quantization namespace
+quantize_node = NotImplemented
 
 def get_custom_module_class_keys(custom_module_mapping: Dict[QuantType, Dict[Type, Type]]) -> List[Any]:
     r""" Get all the unique custom module keys in the custom config dict
@@ -642,3 +565,411 @@ def get_skipped_module_name_and_classes(
         skipped_module_classes += get_custom_module_class_keys(prepare_custom_config.float_to_observed_mapping)
 
     return skipped_module_names, skipped_module_classes
+
+def _is_custom_module_lstm(
+        node: Node,
+        named_modules: Dict[str, torch.nn.Module],
+        qconfig: QConfigAny = None,
+        # QuantizeHandler, but we cannot include the type here due to circular imports
+        qhandler: Optional[Any] = None,
+) -> bool:
+    """
+    Return whether this refers to the custom module LSTM flow.
+    """
+    mod = _get_module(node, named_modules)
+    if qconfig is not None and qhandler is not None:
+        assert isinstance(qhandler, torch.ao.quantization.fx.quantization_patterns.QuantizeHandler)  # type: ignore[attr-defined]
+        return isinstance(mod, torch.nn.LSTM) and \
+            activation_is_statically_quantized(qconfig) and \
+            qhandler.is_custom_module()
+    else:
+        return isinstance(mod, torch.ao.nn.quantizable.LSTM)
+
+def _get_module(node: Node, named_modules: Dict[str, torch.nn.Module]) -> Optional[torch.nn.Module]:
+    """
+    If `node` refers to a call_module node, return the module, else None.
+    """
+    if node.op == "call_module" and str(node.target) in named_modules:
+        return named_modules[str(node.target)]
+    else:
+        return None
+
+def _insert_dequant_stub(
+    node: Node,
+    model: torch.nn.Module,
+    named_modules: Dict[str, torch.nn.Module],
+    graph: Graph,
+) -> Node:
+    """
+    Attach a `DeQuantStub` to the model and create a node that calls this
+    `DeQuantStub` on the output of `node`, similar to how observers are inserted.
+    """
+    prefix = "dequant_stub_"
+    get_new_dequant_stub_name = get_new_attr_name_with_prefix(prefix)
+    dequant_stub_name = get_new_dequant_stub_name(model)
+    dequant_stub = DeQuantStub()
+    setattr(model, dequant_stub_name, dequant_stub)
+    named_modules[dequant_stub_name] = dequant_stub
+    with graph.inserting_after(node):
+        return graph.call_module(dequant_stub_name, (node,))
+
+def _insert_dequant_stubs_for_custom_module_lstm_output(
+    node: Node,
+    model: torch.nn.Module,
+    named_modules: Dict[str, torch.nn.Module],
+    graph: Graph,
+) -> Node:
+    """
+    Insert DeQuantStubs after each internal output node of custom module LSTM.
+
+    Custom module LSTM outputs are nested tuples of the sturcture (output, (hidden0, hidden1)),
+    Since we cannot dequantize a tuple as a whole, we must first break down the tuple into its
+    components through `getitem`. This function transforms the graph as follows:
+
+      (1) Split the LSTM node into (output, (hidden0, hidden1))
+      (2) Insert a DeQuantStub after each internal node
+      (3) Recombine the DeQuantStubs into the same structure as before
+      (4) Reroute all consumers of the original LSTM node and its sub-nodes
+          (e.g. lstm[0])
+
+    Before:
+                   lstm_output
+                        |
+                        v
+                  original_user(s)
+    After:
+                   lstm_output
+                  /           \\
+                 /  (getitem)  \\
+                /               \\
+               v                 v
+             output            hidden
+               |               /   \\
+         (DeQuantStub)        (getitem)
+               |             /       \\
+               v            v         v
+           output_dq     hidden0    hidden1
+               |            |         |
+               |    (DeQuantStub) (DeQuantStub)
+               |            |         |
+               |            v         v
+               |      hidden0_dq  hidden1_dq
+               |            \\       /
+               |              (tuple)
+               |              \\   /
+               |               v  v
+               |             hidden_dq
+               \\               /
+                \\   (tuple)   /
+                 v            v
+                 lstm_output_dq
+                       |
+                       v
+                original_user(s)
+
+    For step (4), reroute all users of the original LSTM node(s) as follows:
+      lstm_output -> lstm_output_dq
+      lstm_output[0] -> output_dq
+      lstm_output[1] -> hidden_dq
+      lstm_output[1][0] -> hidden0_dq
+      lstm_output[1][1] -> hidden1_dq
+
+    Return the node `lstm_output_dq`.
+    """
+    # (1) Split the LSTM node into (output, (hidden0, hidden1))
+    # (2) Insert a DeQuantStub after each internal node
+    with graph.inserting_after(node):
+        output = graph.call_function(operator.getitem, (node, 0))
+        output_dq = _insert_dequant_stub(output, model, named_modules, graph)
+    with graph.inserting_after(output_dq):
+        hidden = graph.call_function(operator.getitem, (node, 1))
+    with graph.inserting_after(hidden):
+        hidden0 = graph.call_function(operator.getitem, (hidden, 0))
+        hidden0_dq = _insert_dequant_stub(hidden0, model, named_modules, graph)
+    with graph.inserting_after(hidden0_dq):
+        hidden1 = graph.call_function(operator.getitem, (hidden, 1))
+        hidden1_dq = _insert_dequant_stub(hidden1, model, named_modules, graph)
+
+    # (3) Recombine the DeQuantStubs into the same structure as before
+    with graph.inserting_after(hidden1_dq):
+        hidden_dq = graph.call_function(tuple, ([hidden0_dq, hidden1_dq],))
+    with graph.inserting_after(hidden_dq):
+        lstm_output_dq = graph.call_function(tuple, ([output_dq, hidden_dq],))
+
+    # (4) Reroute all consumers of the original LSTM node and its sub-nodes
+    for user in list(node.users.keys()):
+        if user != output and user != hidden:
+            user.replace_input_with(node, lstm_output_dq)
+    # The getitem and tuple nodes we added here may interfere with reference quantized
+    # pattern matching, so we need to redirect the consumers of internal nodes to the
+    # corresponding nodes with DeQuantStubs (e.g. lstm_output_dq[0] -> output_dq) attached,
+    # in order to preserve reference patterns like "dequantize - consumer - quantize".
+    _reroute_tuple_getitem_pattern(graph)
+    return lstm_output_dq
+
+def _maybe_get_custom_module_lstm_from_node_arg(
+    arg: Node,
+    named_modules: Dict[str, torch.nn.Module],
+) -> Optional[Node]:
+    """
+    Given an argument of a node, if the argument refers to the path through which the node
+    is a consumer of custom module LSTM, return the custom module LSTM node, or None otherwise.
+
+    This is used to determine whether a node is a consumer of custom module LSTM, and, if so,
+    skip inserting input observers for this node. This is because custom module LSTM produces
+    quantized outputs, so inserting an input observer for the consumer of custom module LSTM
+    would unnecessarily quantize the outputs again.
+
+      lstm -> consumer
+
+    In practice, however, custom module LSTM outputs a tuple (output, (hidden0, hidden1)) with
+    DeQuantStubs attached to each internal node (see `_insert_dequant_stubs_for_custom_module_lstm_output`).
+    This tuple can be consumed in one of four ways:
+
+      lstm -> getitem -> DeQuantStub -> consumer                       # consume lstm[0]
+      lstm -> getitem -> getitem -> DeQuantStub -> tuple -> consumer   # consume lstm[1]
+      lstm -> getitem -> getitem -> DeQuantStub -> consumer            # consume lstm[1][0] or lstm[1][1]
+      lstm -> getitem -> DeQuantStub -> tuple -> consumer              # consume lstm
+
+    Thus, we must match against the above patterns instead of simply checking the parent node
+    to determine whether this node is a consumer of a custom module LSTM.
+    """
+    def match_dq(a):
+        return isinstance(_get_module(a, named_modules), DeQuantStub)
+
+    def match_lstm(a):
+        return _is_custom_module_lstm(a, named_modules)
+
+    def match_getitem(a):
+        return a.op == "call_function" and a.target == operator.getitem
+
+    def match_tuple(a):
+        return a.op == "call_function" and a.target == tuple
+
+    def _match_pattern(match_pattern: List[Callable]) -> Optional[Node]:
+        """
+        Traverse up the graph and match the args one by one.
+        If there is a match, return the last matched node, or None otherwise.
+        """
+        a = arg
+        for i, match in enumerate(match_pattern):
+            if not match(a):
+                return None
+            # Match next arg, for tuple the arg is a tuple of a list, e.g. ([dq_1, other_node],)
+            if i < len(match_pattern) - 1:
+                if match == match_tuple:
+                    a = a.args[0][0]  # type: ignore[assignment,index]
+                else:
+                    a = a.args[0]  # type: ignore[assignment]
+        return a
+
+    all_match_patterns = [
+        [match_dq, match_getitem, match_lstm],
+        [match_tuple, match_dq, match_getitem, match_getitem, match_lstm],
+        [match_dq, match_getitem, match_getitem, match_lstm],
+        [match_tuple, match_dq, match_getitem, match_lstm],
+    ]
+
+    for p in all_match_patterns:
+        matched_node = _match_pattern(p)
+        if matched_node is not None:
+            return matched_node
+    return None
+
+def _reroute_tuple_getitem_pattern(graph: Graph):
+    """
+    Search for patterns where N consecutive `tuple` call_function nodes are followed by
+    N consecutive `getitem` call_function nodes that are "reverses" of the `tuple` nodes.
+    If we find this pattern, reroute the consumers of the last `getitem` to skip these
+    N `tuple` and `getitem` nodes.
+
+    Before:
+
+        a   b     c
+        |   \\   /
+        \\   tuple
+         \\   /
+          tuple
+            |
+        getitem(1)
+            |
+        getitem(0)
+            |
+            d
+
+    After:
+
+        b
+        |
+        d
+    """
+    def find_patterns(
+            node: Node,
+            index_stack: List[int],
+            current_pattern: List[Node],
+            matched_patterns: List[List[Node]],
+            seen: Set[Tuple[Node, Tuple[int, ...]]]):
+        """
+        Traverse the graph recursively to match for the N-tuple - N-getitem patterns,
+        starting at the given node.
+
+        We use a stack to keep track of the expected `getitem` indices, since these are
+        reversed from the `tuple` indices. In the above example, the stack after
+        (b -> tuple -> tuple) will be [0, 1], which will be popped by getitem(1) first
+        and then by getitem(0).
+
+        TODO: traverse upwards from the output and handle the case when tuple is not a
+        separate node, e.g. graph.call_function(operator.getitem, args=(a, (b, c)))
+        """
+        if len(index_stack) == 0 and len(current_pattern) > 0:
+            matched_patterns.append(copy.copy(current_pattern))
+            current_pattern.clear()
+
+        # Avoid duplicating work
+        state = (node, tuple(index_stack))
+        if state in seen:
+            return
+        seen.add(state)
+
+        # Iterate through users of this node to find tuple/getitem nodes to match
+        for user in node.users:
+            if user.op == "call_function" and user.target == tuple:
+                for i, user_arg in enumerate(user.args[0]):  # type: ignore[arg-type]
+                    if user_arg == node:
+                        index_stack.append(i)
+                        current_pattern.append(user)
+                        find_patterns(user, index_stack, current_pattern, matched_patterns, seen)
+            elif user.op == "call_function" and user.target == operator.getitem:
+                if len(index_stack) > 0:
+                    if user.args[1] == index_stack[-1]:
+                        index_stack.pop()
+                        current_pattern.append(user)
+                        find_patterns(user, index_stack, current_pattern, matched_patterns, seen)
+        return matched_patterns
+
+    # Collect all matched patterns
+    matched_patterns: List[List[Node]] = []
+    seen: Set[Tuple[Node, Tuple[int, ...]]] = set()  # (node, index_stack)
+    for node in graph.nodes:
+        find_patterns(node, [], [], matched_patterns, seen)
+
+    # For each pattern, redirect all consumers of the last getitem node to the correct input
+    # of the first tuple node
+    for pattern in matched_patterns:
+        first_tuple = pattern[0]
+        last_getitem = pattern[-1]
+        assert first_tuple.op == "call_function" and first_tuple.target == tuple
+        assert last_getitem.op == "call_function" and last_getitem.target == operator.getitem
+        last_getitem_index = last_getitem.args[1]
+        new_input = first_tuple.args[0][last_getitem_index]  # type: ignore[index]
+        for user in list(last_getitem.users.keys()):
+            user.replace_input_with(last_getitem, new_input)
+
+def _get_observer_from_activation_post_process(
+    activation_post_process: Union[ObserverBase, FakeQuantizeBase],
+) -> ObserverBase:
+    """
+    If `activation_post_process` is an observer, return the observer.
+    If `activation_post_process` is a fake quantize, return the internal observer.
+    """
+    if isinstance(activation_post_process, ObserverBase):
+        return activation_post_process
+    else:
+        assert isinstance(activation_post_process, FakeQuantizeBase)
+        return activation_post_process.activation_post_process  # type: ignore[return-value]
+
+def _qconfig_satisfies_dtype_config_constraints(
+        qconfig: QConfigAny,
+        dtype_with_constraints: DTypeWithConstraints,
+        is_activation: bool = True) -> bool:
+    """
+    Return whether `qconfig` satisfies the following constraints from the backend,
+    specified through the activation and weight DTypeWithConstraints.
+
+        1. QConfig specified a quantization range that falls within the backend's, if any
+        2. QConfig specified a min scale value that is >= the backend's, if any
+        3. QConfig specified a FixedQParamsObserver or FixedQParamsFakeQuantize that has
+           scale and zero point that match the backend's, if any
+
+    If `is_activation` is True, we check `qconfig.activation`, else we check `qconfig.weight`.
+    If `qconfig` or `dtype_with_constraints.dtype` is None, or the dtypes do not match, return True.
+    """
+    # TODO: log warnings only when the user enabled a debug flag
+    def _activation_post_process_satisfies_dtype_config_constraints(
+            activation_post_process: Union[ObserverBase, FakeQuantizeBase],
+            dtype_with_constraints: DTypeWithConstraints,
+            debug_string: str) -> bool:
+        observer = _get_observer_from_activation_post_process(activation_post_process)
+        app_quant_min = getattr(observer, "quant_min", None)
+        app_quant_max = getattr(observer, "quant_max", None)
+        # TODO: for now, just use the existing eps value as scale_min. In the future, we should
+        # resolve the differences between the two, either by renaming eps or some other way
+        app_scale_min = getattr(observer, "eps", None)
+        backend_quant_min = dtype_with_constraints.quant_min_lower_bound
+        backend_quant_max = dtype_with_constraints.quant_max_upper_bound
+        backend_scale_min = dtype_with_constraints.scale_min_lower_bound
+        backend_scale_exact_match = dtype_with_constraints.scale_exact_match
+        backend_zero_point_exact_match = dtype_with_constraints.zero_point_exact_match
+        # check quantization ranges
+        if backend_quant_min is not None and backend_quant_max is not None:
+            if app_quant_min is None or app_quant_max is None:
+                warnings.warn("QConfig %s must specify 'quant_min' and 'quant_max', ignoring %s" %
+                              (debug_string, qconfig))
+                return False
+            elif app_quant_min < backend_quant_min or app_quant_max > backend_quant_max:
+                warnings.warn(("QConfig %s quantization range must fall within the backend's:\n"
+                              "QConfig range = (%s, %s), BackendConfig range = (%s, %s), ignoring %s") %
+                              (debug_string, app_quant_min, app_quant_max,
+                              backend_quant_min, backend_quant_max, qconfig))
+                return False
+        # check scale min
+        if backend_scale_min is not None:
+            if app_scale_min is None:
+                warnings.warn("QConfig %s must specify 'eps', ignoring %s" % (debug_string, qconfig))
+                return False
+            elif app_scale_min < backend_scale_min:
+                warnings.warn(("QConfig %s eps (%s) must be greater than or equal to "
+                              "the backend's min scale value (%s), ignoring %s") %
+                              (debug_string, app_scale_min, backend_scale_min, qconfig))
+                return False
+        # check fixed scale and zero point
+        if backend_scale_exact_match is not None and backend_zero_point_exact_match is not None:
+            # For tests only, accept the following qconfigs for now
+            # TODO: handle fp16 qconfigs properly
+            for accepted_qconfig in [float16_static_qconfig, float16_dynamic_qconfig]:
+                if qconfig_equals(qconfig, accepted_qconfig):
+                    return True
+            suggestion_str = (
+                "Please use torch.ao.quantization.get_default_qconfig_mapping or "
+                "torch.ao.quantization.get_default_qat_qconfig_mapping. Example:\n"
+                "    qconfig_mapping = get_default_qconfig_mapping(\"fbgemm\")\n"
+                "    model = prepare_fx(model, qconfig_mapping, example_inputs)"
+            )
+            if not isinstance(activation_post_process, FixedQParamsObserver) and \
+                    not isinstance(activation_post_process, FixedQParamsFakeQuantize):
+                warnings.warn(("QConfig must specify a FixedQParamsObserver or a FixedQParamsFakeQuantize "
+                              "for fixed qparams ops, ignoring %s.\n%s") % (qconfig, suggestion_str))
+                return False
+            if observer.scale != backend_scale_exact_match or observer.zero_point != backend_zero_point_exact_match:
+                warnings.warn(("QConfig fixed scale (%s) and zero point (%s) do not match the backend's "
+                              "(%s and %s), ignoring %s.\n%s") %
+                              (observer.scale, observer.zero_point, backend_scale_exact_match,
+                              backend_zero_point_exact_match, qconfig, suggestion_str))
+                return False
+        return True
+
+    if qconfig is None or dtype_with_constraints.dtype is None:
+        return True
+
+    activation_post_process_ctr = qconfig.activation if is_activation else qconfig.weight
+    debug_string = "activation" if is_activation else "weight"
+    satisfies_constraints = True
+    if activation_post_process_ctr is not None:
+        activation_post_process = activation_post_process_ctr()
+        assert is_activation_post_process(activation_post_process)
+        # If dtypes don't match, don't check the activation_post_process and return True early
+        if activation_post_process.dtype != dtype_with_constraints.dtype:
+            return True
+        satisfies_constraints = _activation_post_process_satisfies_dtype_config_constraints(
+            activation_post_process, dtype_with_constraints, debug_string)
+    return satisfies_constraints
diff --git a/torch/ao/quantization/observer.py b/torch/ao/quantization/observer.py
index 7c8a5c11a016..f8683024cee5 100644
--- a/torch/ao/quantization/observer.py
+++ b/torch/ao/quantization/observer.py
@@ -12,8 +12,8 @@
 
 import torch
 import torch.nn as nn
-from torch.ao.quantization.utils import check_min_max_valid, calculate_qmin_qmax
-
+from torch.ao.quantization.utils import (
+    check_min_max_valid, calculate_qmin_qmax, is_per_tensor, is_per_channel)
 
 __all__ = [
     "default_affine_fixed_qparams_observer",
@@ -131,7 +131,8 @@ class ObserverBase(ABC, nn.Module):
     the collected statistics.
 
     Args:
-        dtype: Quantized data type
+        dtype: dtype argument to the `quantize` node needed to implement the
+               reference model spec.
     """
 
     def __init__(self, dtype):
@@ -155,7 +156,8 @@ class UniformQuantizationObserverBase(ObserverBase):
     scale and zero_point.
 
     Args:
-        dtype: Quantized data type.
+        dtype: dtype argument to the `quantize` node needed to implement the
+               reference model spec.
         qscheme: Quantization scheme to be used.
         reduce_range: Reduces the range of the quantized data type by 1 bit.
                       This is sometimes required to avoid instruction overflow.
@@ -382,7 +384,8 @@ class MinMaxObserver(UniformQuantizationObserverBase):
     tensors, and uses this statistic to compute the quantization parameters.
 
     Args:
-        dtype: Quantized data type
+        dtype: dtype argument to the `quantize` node needed to implement the
+               reference model spec.
         qscheme: Quantization scheme to be used
         reduce_range: Reduces the range of the quantized data type by 1 bit
         quant_min: Minimum quantization value. If unspecified, it will follow the 8-bit setup.
@@ -448,7 +451,11 @@ def __init__(
         factory_kwargs=None,
         eps=torch.finfo(torch.float32).eps,
     ) -> None:
-
+        if not is_per_tensor(qscheme):
+            raise NotImplementedError(
+                "MinMaxObserver's qscheme only support torch.per_tensor_symmetric \
+                    and torch.per_tensor_affine."
+            )
         # For x86 quantized kernels, we need to ensure that the vpmaddubsw
         # instruction does not overflow. We allow for a reduce_range argument to
         # observers that reduces the quantized range to (0,127) or (-64, 63).
@@ -516,7 +523,8 @@ class MovingAverageMinMaxObserver(MinMaxObserver):
 
     Args:
         averaging_constant: Averaging constant for min/max.
-        dtype: Quantized data type
+        dtype: dtype argument to the `quantize` node needed to implement the
+               reference model spec.
         qscheme: Quantization scheme to be used
         reduce_range: Reduces the range of the quantized data type by 1 bit
         quant_min: Minimum quantization value. If unspecified, it will follow the 8-bit setup.
@@ -561,6 +569,11 @@ def __init__(
         eps=torch.finfo(torch.float32).eps,
         **kwargs
     ) -> None:
+        if not is_per_tensor(qscheme):
+            raise NotImplementedError(
+                "MovingAverageMinMaxObserver's qscheme only support \
+                    torch.per_tensor_symmetric and torch.per_tensor_affine."
+            )
         self.averaging_constant = averaging_constant
         super(MovingAverageMinMaxObserver, self).__init__(
             dtype=dtype,
@@ -601,7 +614,8 @@ class PerChannelMinMaxObserver(UniformQuantizationObserverBase):
 
     Args:
         ch_axis: Channel axis
-        dtype: Quantized data type
+        dtype: dtype argument to the `quantize` node needed to implement the
+               reference model spec.
         qscheme: Quantization scheme to be used
         reduce_range: Reduces the range of the quantized data type by 1 bit
         quant_min: Minimum quantization value. If unspecified, it will follow the 8-bit setup.
@@ -630,6 +644,11 @@ def __init__(
         factory_kwargs=None,
         eps=torch.finfo(torch.float32).eps,
     ) -> None:
+        if not is_per_channel(qscheme):
+            raise NotImplementedError(
+                "PerChannelMinMaxObserver's qscheme only support \
+                    torch.per_channel_symmetric, torch.per_channel_affine and torch.per_channel_affine_float_qparams."
+            )
         super(PerChannelMinMaxObserver, self).__init__(
             dtype=dtype,
             qscheme=qscheme,
@@ -770,8 +789,11 @@ def _load_from_state_dict_script(
     @torch.jit.export
     def reset_min_max_vals(self):
         """Resets the min/max values."""
-        self.min_val = torch.tensor([])
-        self.max_val = torch.tensor([])
+        # This used to be torch.ones but that does not work because
+        # JIT compiler can optimize it via common subexpression elimination
+        # in which case both min_val and max_val point to the same tensor.
+        self.min_val = torch.rand(0, )
+        self.max_val = torch.rand(0, )
 
 
 class MovingAveragePerChannelMinMaxObserver(PerChannelMinMaxObserver):
@@ -814,6 +836,11 @@ def __init__(
         eps=torch.finfo(torch.float32).eps,
         **kwargs
     ) -> None:
+        if not is_per_channel(qscheme):
+            raise NotImplementedError(
+                "MovingAveragePerChannelMinMaxObserver's qscheme only support \
+                    torch.per_channel_symmetric, torch.per_channel_affine and torch.per_channel_affine_float_qparams."
+            )
         super(MovingAveragePerChannelMinMaxObserver, self).__init__(
             ch_axis=ch_axis,
             dtype=dtype,
@@ -862,7 +889,8 @@ class HistogramObserver(UniformQuantizationObserverBase):
         bins: Number of bins to use for the histogram
         upsample_rate: Factor by which the histograms are upsampled, this is
                        used to interpolate histograms with varying ranges across observations
-        dtype: Quantized data type
+        dtype: dtype argument to the `quantize` node needed to implement the
+               reference model spec
         qscheme: Quantization scheme to be used
         reduce_range: Reduces the range of the quantized data type by 1 bit
         eps: Epsilon value for float32, Defaults to `torch.finfo(torch.float32).eps`.
@@ -894,6 +922,11 @@ def __init__(
         factory_kwargs=None,
         eps=torch.finfo(torch.float32).eps,
     ) -> None:
+        if not is_per_tensor(qscheme):
+            raise NotImplementedError(
+                "HistogramObserver's qscheme only support torch.per_tensor_symmetric \
+                    and torch.per_tensor_affine."
+            )
         # bins: The number of bins used for histogram calculation.
         super(HistogramObserver, self).__init__(
             dtype=dtype,
@@ -986,7 +1019,7 @@ def _non_linear_param_search(self) -> Tuple[torch.Tensor, torch.Tensor]:
         This follows the implementation of NormMinimization::NonlinearQuantizationParamsSearch in
         caffe2/quantization/server/norm_minimization.cc
         """
-        assert self.histogram.size()[0] == self.bins, "bins mistmatch"
+        assert self.histogram.size()[0] == self.bins, "bins mismatch"
         bin_width = (self.max_val - self.min_val) / self.bins
 
         # cumulative sum
@@ -1050,21 +1083,21 @@ def _adjust_min_max(
         # the input histogram
         # start_idx maps min_val to the histogram bin index.
 
-        hist_bin_width = (self.max_val - self.min_val) / (self.bins * upsample_rate)
+        # Compute the width of histogram bins is a straightforward solution, where
+        # hist_bin_width = (self.max_val - self.min_val) / (self.bins * upsample_rate)
+        # Underflow happens if the numerator is close to the smallest positive subnormal number of FP32
+        # Therefore, we avoid such division operation.
         downsample_rate = int(
             torch.ceil(
-                (combined_max - combined_min) / (self.bins * hist_bin_width)
+                (combined_max - combined_min) * upsample_rate / (self.max_val - self.min_val)
             ).item()
         )
-        e = downsample_rate * (self.bins * hist_bin_width) - (
-            combined_max - combined_min
+        e = downsample_rate * (self.max_val - self.min_val) / upsample_rate - (combined_max - combined_min)
+        start_idx = int(
+            torch.round((self.min_val - combined_min) * self.bins * upsample_rate / (self.max_val - self.min_val)).item()
         )
-        # Relax only the max, not the min, so that for one sided distributions, min stays at zero
         combined_max = combined_max + e
         combined_min = combined_min
-        start_idx = int(
-            torch.round((self.min_val - combined_min) / hist_bin_width).item()
-        )
         return combined_min, combined_max, downsample_rate, start_idx
 
     def _combine_histograms(
@@ -1230,6 +1263,9 @@ def _load_from_state_dict(
             error_msgs,
         )
 
+    def extra_repr(self):
+        return "min_val={}, max_val={}".format(self.min_val, self.max_val)
+
 
 class FixedQParamsObserver(ObserverBase):
     r"""
@@ -1278,22 +1314,40 @@ class PlaceholderObserver(ObserverBase):
     ranges.
 
     Args:
-        dtype: Quantized data type
+        dtype: dtype argument to the `quantize` node needed to implement the
+               reference model spec.
+        quant_min: minimum value in quantized domain (TODO: align behavior with other observers)
+        quant_min: maximum value in quantized domain
         custom_op_name: (temporary) specify this observer for an operator that doesn't require any observation
                         (Can be used in Graph Mode Passes for special case ops).
+        compute_dtype (deprecated): if set, marks the future quantize function to use
+                       dynamic quantization instead of static quantization.
+                       This field is deprecated, use `is_dynamic=True` instead.
+        is_dynamic: if True, the `quantize` function in the reference model
+                    representation taking stats from this observer instance will
+                    use dynamic quantization.
     """
 
     def __init__(
-        self, dtype=torch.float32, custom_op_name="", compute_dtype=None
+        self, dtype=torch.float32, custom_op_name="", compute_dtype=None,
+        quant_min=None, quant_max=None, is_dynamic=False,
     ) -> None:
-        super(PlaceholderObserver, self).__init__(dtype=dtype)
+        super().__init__(dtype=dtype)
         # dtype of input of the target operator, e.g. for dynamic quantization
         # ops, the dtype will be float32
         self.dtype = dtype
+        self.quant_min = quant_min
+        self.quant_max = quant_max
         self.custom_op = custom_op_name
         # used for configuration of computation type for dynamic quantization
         if compute_dtype:
-            self.compute_dtype = compute_dtype
+            is_dynamic = True
+            warnings.warn(
+                "Please use `is_dynamic` instead of `compute_dtype`. \
+                    `compute_dtype` will be deprecated in a future release \
+                    of PyTorch."
+            )
+        self.is_dynamic = is_dynamic
 
     def forward(self, x):
         return x
@@ -1501,7 +1555,7 @@ def load_observer_state_dict(mod, obs_dict):
 weight quantization is supported, such as `fbgemm`.
 """
 
-per_channel_weight_observer_range_neg_127_to_127 = MinMaxObserver.with_args(
+per_channel_weight_observer_range_neg_127_to_127 = PerChannelMinMaxObserver.with_args(
     dtype=torch.qint8, qscheme=torch.per_channel_symmetric,
     quant_min=-127, quant_max=127, eps=2 ** -12)
 """
@@ -1509,7 +1563,7 @@ def load_observer_state_dict(mod, obs_dict):
 """
 
 default_dynamic_quant_observer = PlaceholderObserver.with_args(
-    dtype=torch.float, compute_dtype=torch.quint8
+    dtype=torch.quint8, quant_min=0, quant_max=255, is_dynamic=True,
 )
 """
 Default observer for dynamic quantization.
diff --git a/torch/ao/quantization/qconfig.py b/torch/ao/quantization/qconfig.py
index a4766ee8bdae..09fa02ff3ddb 100644
--- a/torch/ao/quantization/qconfig.py
+++ b/torch/ao/quantization/qconfig.py
@@ -72,13 +72,8 @@
     "get_default_qat_qconfig",
     "get_default_qconfig_dict",
     "get_default_qat_qconfig_dict",
-    "assert_valid_qconfig",
-    "add_module_to_qconfig_obs_ctr",
     "QConfigAny",
-    "obs_or_fq_ctr_equals",
     "qconfig_equals",
-    "activation_is_memoryless",
-    "is_reuse_input_qconfig",
 ]
 
 class QConfig(namedtuple('QConfig', ['activation', 'weight'])):
@@ -157,7 +152,7 @@ def __new__(cls, activation=torch.nn.Identity, weight=torch.nn.Identity):
 Default dynamic qconfig.
 """
 
-float16_dynamic_qconfig = QConfig(activation=PlaceholderObserver.with_args(dtype=torch.float32, compute_dtype=torch.float16),
+float16_dynamic_qconfig = QConfig(activation=PlaceholderObserver.with_args(dtype=torch.float16, is_dynamic=True),
                                   weight=PlaceholderObserver.with_args(dtype=torch.float16))
 """
 Dynamic qconfig with weights quantized to `torch.float16`.
@@ -228,23 +223,35 @@ def get_default_qconfig(backend='fbgemm', version=0):
     Returns the default PTQ qconfig for the specified backend.
 
     Args:
-      * `backend`: a string representing the target backend. Currently supports `fbgemm`,
-        `qnnpack` and `onednn`.
+      * `backend` (str): a string representing the target backend. Currently supports
+        `x86`, `fbgemm` (default), `qnnpack` and `onednn`.
 
     Return:
         qconfig
     """
+    supported_backends = ["fbgemm", "x86", "qnnpack", "onednn"]
+    if backend not in supported_backends:
+        raise AssertionError(
+            "backend: " + str(backend) +
+            " not supported. backend must be one of {}".format(supported_backends)
+        )
+
     if version == 0:
         if backend == 'fbgemm':
             qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=True),
                               weight=default_per_channel_weight_observer)
         elif backend == 'qnnpack':
+            # TODO: make this compatible with xnnpack constraints
             qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
                               weight=default_weight_observer)
         elif backend == 'onednn':
             qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
                               weight=default_per_channel_weight_observer)
+        elif backend == 'x86':
+            qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=True),
+                              weight=default_per_channel_weight_observer)
         else:
+            # won't reach
             qconfig = default_qconfig
     else:
         raise AssertionError("Version number: " + str(version) +
@@ -299,13 +306,20 @@ def get_default_qat_qconfig(backend='fbgemm', version=1):
     Returns the default QAT qconfig for the specified backend.
 
     Args:
-      * `backend`: a string representing the target backend. Currently supports `fbgemm`,
-        `qnnpack` and `onednn`.
+      * `backend` (str): a string representing the target backend. Currently supports
+        `x86`, `fbgemm` (default), `qnnpack` and `onednn`.
       * `version`: version, for backwards compatibility. Can be `None` or `1`.
 
     Return:
         qconfig
     """
+    supported_backends = ["fbgemm", "x86", "qnnpack", "onednn"]
+    if backend not in supported_backends:
+        raise AssertionError(
+            "backend: " + str(backend) +
+            " not supported. backend must be one of {}".format(supported_backends)
+        )
+
     # Histogram observer is too slow for quantization aware training
     if version == 0:
         if backend == 'fbgemm':
@@ -325,6 +339,12 @@ def get_default_qat_qconfig(backend='fbgemm', version=1):
                                                                 quant_min=0,
                                                                 quant_max=255),
                               weight=default_per_channel_weight_fake_quant)
+        elif backend == 'x86':
+            qconfig = QConfig(activation=FakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
+                                                                quant_min=0,
+                                                                quant_max=255,
+                                                                reduce_range=True),
+                              weight=default_per_channel_weight_fake_quant)
         else:
             qconfig = default_qat_qconfig
     # Use the fused observe + fake_quant modules for doing QAT.
@@ -336,6 +356,7 @@ def get_default_qat_qconfig(backend='fbgemm', version=1):
                                                                                  reduce_range=True),
                               weight=default_fused_per_channel_wt_fake_quant)
         elif backend == 'qnnpack':
+            # TODO: make this compatible with xnnpack constraints
             qconfig = QConfig(activation=FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
                                                                                  quant_min=0,
                                                                                  quant_max=255,
@@ -346,6 +367,12 @@ def get_default_qat_qconfig(backend='fbgemm', version=1):
                                                                                  quant_min=0,
                                                                                  quant_max=255),
                               weight=default_fused_per_channel_wt_fake_quant)
+        elif backend == 'x86':
+            qconfig = QConfig(activation=FusedMovingAvgObsFakeQuantize.with_args(observer=MovingAverageMinMaxObserver,
+                                                                                 quant_min=0,
+                                                                                 quant_max=255,
+                                                                                 reduce_range=True),
+                              weight=default_fused_per_channel_wt_fake_quant)
         else:
             qconfig = default_qat_qconfig_v2
     else:
@@ -387,8 +414,8 @@ def get_default_qat_qconfig_dict(backend='fbgemm', version=1):
         "a future version. Please use torch.ao.quantization.get_default_qat_qconfig_mapping instead.")
     return torch.ao.quantization.get_default_qat_qconfig_mapping(backend, version).to_dict()
 
-def assert_valid_qconfig(qconfig: Optional[QConfig],
-                         mod: torch.nn.Module) -> None:
+def _assert_valid_qconfig(qconfig: Optional[QConfig],
+                          mod: torch.nn.Module) -> None:
     """
     Verifies that this `qconfig` is valid.
     """
@@ -410,10 +437,10 @@ def assert_valid_qconfig(qconfig: Optional[QConfig],
         assert not is_per_channel, \
             'Per channel weight observer is not supported yet for ConvTranspose{n}d.'
 
-# TODO: remove QConfigAny and replace it with Optional[QConfig]
 QConfigAny = Optional[QConfig]
+QConfigAny.__module__ = "torch.ao.quantization.qconfig"
 
-def add_module_to_qconfig_obs_ctr(
+def _add_module_to_qconfig_obs_ctr(
         qconfig: QConfigAny,
         module: Optional[nn.Module]) -> Any:
     r"""This is a helper function for use in quantization prepare that updates a qconfig so that
@@ -457,7 +484,7 @@ def configure_constructor_to_put_obs_on_module_device(original_constructor):
 
 _ObserverOrFakeQuantizeConstructor = Union[_PartialWrapper, ObserverBase, FakeQuantizeBase]
 
-def obs_or_fq_ctr_equals(obs_or_fq1: _ObserverOrFakeQuantizeConstructor, obs_or_fq2: _ObserverOrFakeQuantizeConstructor):
+def _obs_or_fq_ctr_equals(obs_or_fq1: _ObserverOrFakeQuantizeConstructor, obs_or_fq2: _ObserverOrFakeQuantizeConstructor):
     if isinstance(obs_or_fq1, _PartialWrapper) and isinstance(obs_or_fq2, _PartialWrapper):
         return _partial_wrapper_equals(obs_or_fq1, obs_or_fq2)
     return obs_or_fq1 == obs_or_fq2
@@ -470,9 +497,9 @@ def _partial_wrapper_equals(obs_or_fq1: _PartialWrapper, obs_or_fq2: _PartialWra
     obs_or_fq1_keywords = copy.copy(obs_or_fq1.p.keywords)
     obs_or_fq2_keywords = copy.copy(obs_or_fq2.p.keywords)
     keywords_equal = True
-    # compare observer constructor with obs_or_fq_ctr_equals since direct compare would fail
+    # compare observer constructor with _obs_or_fq_ctr_equals since direct compare would fail
     if "observer" in obs_or_fq1_keywords and "observer" in obs_or_fq2_keywords:
-        keywords_equal = keywords_equal and obs_or_fq_ctr_equals(obs_or_fq1_keywords["observer"], obs_or_fq2_keywords["observer"])
+        keywords_equal = keywords_equal and _obs_or_fq_ctr_equals(obs_or_fq1_keywords["observer"], obs_or_fq2_keywords["observer"])
         obs_or_fq1_keywords.pop("observer")
         obs_or_fq2_keywords.pop("observer")
     keywords_equal = keywords_equal and obs_or_fq1_keywords == obs_or_fq2_keywords
@@ -490,13 +517,13 @@ def qconfig_equals(q1: QConfigAny, q2: QConfigAny):
             # Qconfig weight and activation can be either a partial wrapper,
             # or an observer class. Special handling is required (above) for
             # comparing partial wrappers.
-            activation_same = obs_or_fq_ctr_equals(q1.activation, q2.activation)
-            weight_same = obs_or_fq_ctr_equals(q1.weight, q2.weight)
+            activation_same = _obs_or_fq_ctr_equals(q1.activation, q2.activation)
+            weight_same = _obs_or_fq_ctr_equals(q1.weight, q2.weight)
             return activation_same and weight_same
         except AttributeError:
             return q1 == q2
 
-def activation_is_memoryless(qconfig: QConfig):
+def _activation_is_memoryless(qconfig: QConfig):
     """
     Return whether the observer for activations defined in the given QConfig is memoryless.
     This means a MovingAverage observer with averaging constant equal to 1.
@@ -509,7 +536,7 @@ def _is_memoryless(observer):
     else:
         return _is_memoryless(act)
 
-def is_reuse_input_qconfig(qconfig: Optional[QConfig]):
+def _is_reuse_input_qconfig(qconfig: Optional[QConfig]):
     return qconfig is not None and \
         isinstance(qconfig.activation(), ReuseInputObserver) and \
         isinstance(qconfig.weight(), NoopObserver)
diff --git a/torch/ao/quantization/qconfig_mapping.py b/torch/ao/quantization/qconfig_mapping.py
index b45972228690..65c85d033c5f 100644
--- a/torch/ao/quantization/qconfig_mapping.py
+++ b/torch/ao/quantization/qconfig_mapping.py
@@ -12,10 +12,12 @@
     _PartialWrapper,
     default_fixed_qparams_range_0to1_observer,
     default_fixed_qparams_range_neg1to1_observer,
+    default_placeholder_observer,
     default_weight_observer,
 )
 from .qconfig import (
     default_reuse_input_qconfig,
+    default_symmetric_qnnpack_qconfig,
     get_default_qconfig,
     get_default_qat_qconfig,
     QConfig,
@@ -31,12 +33,13 @@
 
 
 # TODO: replace all usages with these constants
-GLOBAL_DICT_KEY = ""
-OBJECT_TYPE_DICT_KEY = "object_type"
-MODULE_NAME_REGEX_DICT_KEY = "module_name_regex"
-MODULE_NAME_DICT_KEY = "module_name"
-MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY = "module_name_object_type_order"
+_GLOBAL_DICT_KEY = ""
+_OBJECT_TYPE_DICT_KEY = "object_type"
+_MODULE_NAME_REGEX_DICT_KEY = "module_name_regex"
+_MODULE_NAME_DICT_KEY = "module_name"
+_MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY = "module_name_object_type_order"
 
+# TODO: derive this map from the BackendConfig
 _FIXED_QPARAMS_OP_TO_OBSERVER: Dict[Union[Callable, str], _PartialWrapper] = {
     torch.nn.Hardsigmoid: default_fixed_qparams_range_0to1_observer,
     torch.nn.functional.hardsigmoid: default_fixed_qparams_range_0to1_observer,
@@ -68,11 +71,15 @@ def _get_default_qconfig_mapping(is_qat: bool, backend: str, version: int) -> QC
     # so we have to modify the weight observer to default_weight_observer or another
     # per tensor supported observer.
     # see https://github.com/pytorch/pytorch/issues/47535
-    if backend == "fbgemm":
+    if backend in ("fbgemm", "x86"):
         qconfig_transpose = QConfig(activation=qconfig.activation, weight=default_weight)
     else:
         qconfig_transpose = qconfig
 
+    # currently layernorm only supports float weights
+    # we have to add this because otherwise there will be a extra quantize-dequantize pair
+    qconfig_layernorm = QConfig(activation=qconfig.activation, weight=default_placeholder_observer)
+
     qconfig_mapping = QConfigMapping() \
         .set_global(qconfig) \
         .set_object_type("reshape", default_reuse_input_qconfig) \
@@ -95,7 +102,9 @@ def _get_default_qconfig_mapping(is_qat: bool, backend: str, version: int) -> QC
         .set_object_type(torch.relu, qconfig) \
         .set_object_type(torch.nn.BatchNorm1d, qconfig) \
         .set_object_type(torch.nn.BatchNorm2d, qconfig) \
-        .set_object_type(torch.nn.BatchNorm3d, qconfig)
+        .set_object_type(torch.nn.BatchNorm3d, qconfig) \
+        .set_object_type(torch.nn.functional.layer_norm, qconfig_layernorm) \
+        .set_object_type(torch.nn.LayerNorm, qconfig_layernorm) \
 
     # Use special observers for ops with fixed qparams
     fixed_qparams_observer_to_qconfig: Dict[Any, QConfigAny] = {}
@@ -116,28 +125,54 @@ def _get_default_qconfig_mapping(is_qat: bool, backend: str, version: int) -> QC
 def get_default_qconfig_mapping(backend="fbgemm", version=0) -> QConfigMapping:
     """
     Return the default QConfigMapping for post training quantization.
+
+    Args:
+      * ``backend`` (str) : the quantization backend for the default qconfig mapping, should be
+         one of ["x86", "fbgemm" (default), "qnnpack", "onednn"]
+      * ``version`` (int) : the version for the default qconfig mapping
     """
+    # TODO: add assert for backend choices
     return _get_default_qconfig_mapping(False, backend, version)
 
 def get_default_qat_qconfig_mapping(backend="fbgemm", version=1) -> QConfigMapping:
     """
     Return the default QConfigMapping for quantization aware training.
+
+    Args:
+      * ``backend`` (str) : the quantization backend for the default qconfig mapping, should be
+         one of ["x86", "fbgemm" (default), "qnnpack", "onednn"]
+      * ``version`` (int) : the version for the default qconfig mapping
     """
     return _get_default_qconfig_mapping(True, backend, version)
 
+def _get_symmetric_qnnpack_qconfig_mapping():
+    """
+    Return a QConfigMapping that uses `torch.ao.quantization.default_symmetric_qnnpack_qconfig`
+    as the default QConfig.
+    """
+    qconfig_mapping = get_default_qconfig_mapping("qnnpack") \
+        .set_global(default_symmetric_qnnpack_qconfig)
+    for pattern in qconfig_mapping.object_type_qconfigs.keys():
+        if pattern not in _FIXED_QPARAMS_OP_TO_OBSERVER:
+            qconfig_mapping.set_object_type(pattern, default_symmetric_qnnpack_qconfig)
+    return qconfig_mapping
 
 class QConfigMapping:
     """
-    Mapping from model ops to :class:`torch.ao.quantization.QConfig`s.
+    Mapping from model ops to :class:`torch.ao.quantization.QConfig` s.
 
     The user can specify QConfigs using the following methods (in increasing match priority):
 
-        `set_global`: sets the global (default) QConfig
-        `set_object_type`: sets the QConfig for a given module type, function, or method name
-        `set_module_name_regex`: sets the QConfig for modules matching the given regex string
-        `set_module_name`: sets the QConfig for modules matching the given module name
-        `set_module_name_object_type_order`: sets the QConfig for modules matching a combination
-            of the given module name, object type, and the index at which the module appears
+        ``set_global`` : sets the global (default) QConfig
+
+        ``set_object_type`` : sets the QConfig for a given module type, function, or method name
+
+        ``set_module_name_regex`` : sets the QConfig for modules matching the given regex string
+
+        ``set_module_name`` : sets the QConfig for modules matching the given module name
+
+        ``set_module_name_object_type_order`` : sets the QConfig for modules matching a combination
+        of the given module name, object type, and the index at which the module appears
 
     Example usage::
 
@@ -150,6 +185,7 @@ class QConfigMapping:
             .set_module_name("module1", qconfig1)
             .set_module_name("module2", qconfig2)
             .set_module_name_object_type_order("foo.bar", torch.nn.functional.linear, 0, qconfig3)
+
     """
 
     def __init__(self):
@@ -224,22 +260,26 @@ def set_module_name_object_type_order(
     # TODO: remove this
     def to_dict(self) -> Dict[str, Any]:
         """
-        Convert this `QConfigMapping` to a dictionary with the following keys:
+        Convert this ``QConfigMapping`` to a dictionary with the following keys:
 
             "" (for global QConfig)
+
             "object_type"
+
             "module_name_regex"
+
             "module_name"
+
             "module_name_object_type_order"
 
         The values of this dictionary are lists of tuples.
         """
         return {
-            GLOBAL_DICT_KEY: self.global_qconfig,
-            OBJECT_TYPE_DICT_KEY: list(self.object_type_qconfigs.items()),
-            MODULE_NAME_REGEX_DICT_KEY: list(self.module_name_regex_qconfigs.items()),
-            MODULE_NAME_DICT_KEY: list(self.module_name_qconfigs.items()),
-            MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY: [
+            _GLOBAL_DICT_KEY: self.global_qconfig,
+            _OBJECT_TYPE_DICT_KEY: list(self.object_type_qconfigs.items()),
+            _MODULE_NAME_REGEX_DICT_KEY: list(self.module_name_regex_qconfigs.items()),
+            _MODULE_NAME_DICT_KEY: list(self.module_name_qconfigs.items()),
+            _MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY: [
                 (*k, v) for k, v in self.module_name_object_type_order_qconfigs.items()
             ],
         }
@@ -248,25 +288,29 @@ def to_dict(self) -> Dict[str, Any]:
     @classmethod
     def from_dict(cls, qconfig_dict: Dict[str, Any]) -> QConfigMapping:
         """
-        Create a `QConfigMapping` from a dictionary with the following keys (all optional):
+        Create a ``QConfigMapping`` from a dictionary with the following keys (all optional):
 
             "" (for global QConfig)
+
             "object_type"
+
             "module_name_regex"
+
             "module_name"
+
             "module_name_object_type_order"
 
         The values of this dictionary are expected to be lists of tuples.
         """
         conf = cls()
-        if GLOBAL_DICT_KEY in qconfig_dict:
-            conf.set_global(qconfig_dict[GLOBAL_DICT_KEY])
-        for object_type, qconfig in qconfig_dict.get(OBJECT_TYPE_DICT_KEY, []):
+        if _GLOBAL_DICT_KEY in qconfig_dict:
+            conf.set_global(qconfig_dict[_GLOBAL_DICT_KEY])
+        for object_type, qconfig in qconfig_dict.get(_OBJECT_TYPE_DICT_KEY, []):
             conf.set_object_type(object_type, qconfig)
-        for module_name_regex, qconfig in qconfig_dict.get(MODULE_NAME_REGEX_DICT_KEY, []):
+        for module_name_regex, qconfig in qconfig_dict.get(_MODULE_NAME_REGEX_DICT_KEY, []):
             conf.set_module_name_regex(module_name_regex, qconfig)
-        for module_name, qconfig in qconfig_dict.get(MODULE_NAME_DICT_KEY, []):
+        for module_name, qconfig in qconfig_dict.get(_MODULE_NAME_DICT_KEY, []):
             conf.set_module_name(module_name, qconfig)
-        for module_name, object_type, index, qconfig in qconfig_dict.get(MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY, []):
+        for module_name, object_type, index, qconfig in qconfig_dict.get(_MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY, []):
             conf.set_module_name_object_type_order(module_name, object_type, index, qconfig)
         return conf
diff --git a/torch/ao/quantization/qconfig_mapping_utils.py b/torch/ao/quantization/qconfig_mapping_utils.py
index 09bce4fbebb0..0109729e580c 100644
--- a/torch/ao/quantization/qconfig_mapping_utils.py
+++ b/torch/ao/quantization/qconfig_mapping_utils.py
@@ -1,5 +1,5 @@
 import re
-from typing import Dict, Callable, Union
+from typing import Dict, Callable, Union, List
 
 from .utils import (
     get_combined_dict,
@@ -12,25 +12,18 @@
 from .qconfig_mapping import QConfigMapping
 
 
-# TODO: revisit this list. Many helper methods shouldn't be public
-__all__ = [
-    "get_flattened_qconfig_dict",
-    "get_object_type_qconfig",
-    "get_module_name_qconfig",
-    "get_module_name_regex_qconfig",
-    "maybe_adjust_qconfig_for_module_type_or_name",
-    "update_qconfig_for_qat",
+__all__: List[str] = [
 ]
 
 
-def get_object_type_qconfig(
+def _get_object_type_qconfig(
         qconfig_mapping: QConfigMapping,
         object_type: Union[Callable, str],
         fallback_qconfig: QConfigAny) -> QConfigAny:
     return qconfig_mapping.object_type_qconfigs.get(object_type, fallback_qconfig)
 
 
-def get_module_name_regex_qconfig(qconfig_mapping, module_name, fallback_qconfig):
+def _get_module_name_regex_qconfig(qconfig_mapping, module_name, fallback_qconfig):
     for regex_pattern, qconfig in qconfig_mapping.module_name_regex_qconfigs.items():
         if re.match(regex_pattern, module_name):
             # first match wins
@@ -38,7 +31,7 @@ def get_module_name_regex_qconfig(qconfig_mapping, module_name, fallback_qconfig
     return fallback_qconfig
 
 
-def get_module_name_qconfig(qconfig_mapping, module_name, fallback_qconfig):
+def _get_module_name_qconfig(qconfig_mapping, module_name, fallback_qconfig):
     if module_name == '':
         # module name qconfig not found
         return fallback_qconfig
@@ -46,23 +39,23 @@ def get_module_name_qconfig(qconfig_mapping, module_name, fallback_qconfig):
         return qconfig_mapping.module_name_qconfigs[module_name]
     else:
         parent, _ = _parent_name(module_name)
-        return get_module_name_qconfig(qconfig_mapping, parent, fallback_qconfig)
+        return _get_module_name_qconfig(qconfig_mapping, parent, fallback_qconfig)
 
 
-def maybe_adjust_qconfig_for_module_type_or_name(qconfig_mapping, module_type, module_name, global_qconfig):
+def _maybe_adjust_qconfig_for_module_type_or_name(qconfig_mapping, module_type, module_name, global_qconfig):
     # get qconfig for module_name,
     # fallback to module_name_regex_qconfig, module_type_qconfig,
     # global_qconfig if necessary
-    module_type_qconfig = get_object_type_qconfig(
+    module_type_qconfig = _get_object_type_qconfig(
         qconfig_mapping, module_type, global_qconfig)
-    module_name_regex_qconfig = get_module_name_regex_qconfig(
+    module_name_regex_qconfig = _get_module_name_regex_qconfig(
         qconfig_mapping, module_name, module_type_qconfig)
-    module_name_qconfig = get_module_name_qconfig(
+    module_name_qconfig = _get_module_name_qconfig(
         qconfig_mapping, module_name, module_name_regex_qconfig)
     return module_name_qconfig
 
 
-def get_flattened_qconfig_dict(qconfig_mapping: QConfigMapping) -> Dict[Union[Callable, str], QConfigAny]:
+def _get_flattened_qconfig_dict(qconfig_mapping: QConfigMapping) -> Dict[Union[Callable, str], QConfigAny]:
     """ flatten the global, object_type and module_name qconfig
     to the same qconfig_dict so that it can be used by
     propagate_qconfig_ function.
@@ -94,7 +87,7 @@ def get_flattened_qconfig_dict(qconfig_mapping: QConfigMapping) -> Dict[Union[Ca
     return flattened
 
 
-def update_qconfig_for_qat(
+def _update_qconfig_for_qat(
         qconfig_mapping: QConfigMapping,
         additional_qat_module_mapping: Dict[Callable, Callable]):
     """
diff --git a/torch/ao/quantization/quant_type.py b/torch/ao/quantization/quant_type.py
index 9d2a3a2bdc7b..d3b1d034a1fe 100644
--- a/torch/ao/quantization/quant_type.py
+++ b/torch/ao/quantization/quant_type.py
@@ -2,7 +2,6 @@
 
 __all__ = [
     "QuantType",
-    "quant_type_to_str",
 ]
 
 # Quantization type (dynamic quantization, static quantization).
@@ -21,7 +20,7 @@ class QuantType(enum.IntEnum):
 }
 
 # TODO: make this private
-def quant_type_to_str(quant_type: QuantType) -> str:
+def _get_quant_type_to_str(quant_type: QuantType) -> str:
     return _quant_type_to_str[quant_type]
 
 def _quant_type_from_str(name: str) -> QuantType:
diff --git a/torch/ao/quantization/quantization_mappings.py b/torch/ao/quantization/quantization_mappings.py
index 8b192b6ffd67..87d773548532 100644
--- a/torch/ao/quantization/quantization_mappings.py
+++ b/torch/ao/quantization/quantization_mappings.py
@@ -4,18 +4,21 @@
 from torch import nn
 
 import torch.nn.functional as F
-import torch.nn.intrinsic as nni
-import torch.nn.intrinsic.quantized as nniq
-import torch.nn.intrinsic.quantized.dynamic as nniqd
-import torch.nn.intrinsic.qat as nniqat
-import torch.nn.quantized as nnq
-import torch.nn.quantized._reference as nnqr
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.qat as nnqat
-import torch.nn.qat.dynamic as nnqatd
+import torch.ao.nn.intrinsic as nni
+import torch.ao.nn.intrinsic.quantized as nniq
+import torch.ao.nn.intrinsic.quantized.dynamic as nniqd
+import torch.ao.nn.intrinsic.qat as nniqat
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.reference as nnqr
+import torch.ao.nn.quantized.dynamic as nnqd
+import torch.ao.nn.qat as nnqat
+import torch.ao.nn.qat.dynamic as nnqatd
 
 from typing import Optional, Union, Dict, Set, Callable, Any
 
+# Because `torch.ao.nn` uses lazy imports, we need to make
+# sure we import the contents explicitly here.
+import torch.ao.nn.sparse
 import torch.ao.nn as ao_nn
 from torch.ao.quantization.stubs import QuantStub, DeQuantStub
 from torch.ao.quantization.fake_quantize import (
@@ -25,6 +28,32 @@
 from torch.ao.quantization.utils import get_combined_dict
 from torch.nn.utils.parametrize import type_before_parametrizations
 
+__all__ = [
+    "DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS",
+    "DEFAULT_STATIC_QUANT_MODULE_MAPPINGS",
+    "DEFAULT_QAT_MODULE_MAPPINGS",
+    "DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS",
+    "DEFAULT_FLOAT_TO_QUANTIZED_OPERATOR_MAPPINGS",
+    "DEFAULT_MODULE_TO_ACT_POST_PROCESS",
+    "DEFAULT_STATIC_SPARSE_QUANT_MODULE_MAPPINGS",
+    "DEFAULT_DYNAMIC_SPARSE_QUANT_MODULE_MAPPINGS",
+    "no_observer_set",
+    "get_default_static_quant_module_mappings",
+    "get_default_static_quant_reference_module_mappings",
+    "get_embedding_static_quant_module_mappings",
+    "get_default_static_sparse_quant_module_mappings",
+    "get_static_quant_module_class",
+    "get_dynamic_quant_module_class",
+    "get_default_qat_module_mappings",
+    "get_embedding_qat_module_mappings",
+    "get_default_dynamic_quant_module_mappings",
+    "get_default_dynamic_sparse_quant_module_mappings",
+    "get_default_qconfig_propagation_list",
+    "get_default_compare_output_module_list",
+    "get_default_float_to_quantized_operator_mappings",
+    "get_quantized_operator",
+]
+
 # Default map for swapping float module to reference quantized modules
 DEFAULT_REFERENCE_STATIC_QUANT_MODULE_MAPPINGS : Dict[Callable, Any] = {
     QuantStub: nnq.Quantize,
diff --git a/torch/ao/quantization/quantization_types.py b/torch/ao/quantization/quantization_types.py
deleted file mode 100644
index b6cb5bef434e..000000000000
--- a/torch/ao/quantization/quantization_types.py
+++ /dev/null
@@ -1,18 +0,0 @@
-# TODO: the name of this file is probably confusing, remove this file and move the type
-# definitions to somewhere else, e.g. to .utils
-from typing import Any, Tuple, Union
-from torch.fx import Node
-from .utils import Pattern  # noqa: F401
-
-NodePattern = Union[Tuple[Node, Node], Tuple[Node, Tuple[Node, Node]], Any]
-
-# This is the Quantizer class instance from torch/quantization/fx/quantize.py.
-# Define separately to prevent circular imports.
-# TODO(future PR): improve this.
-QuantizerCls = Any
-
-__all__ = [
-    "Pattern",
-    "NodePattern",
-    "QuantizerCls",
-]
diff --git a/torch/ao/quantization/quantize.py b/torch/ao/quantization/quantize.py
index 8a2d0679aa1c..d18f93987465 100644
--- a/torch/ao/quantization/quantize.py
+++ b/torch/ao/quantization/quantize.py
@@ -4,7 +4,7 @@
 
 import torch
 import torch.nn as nn
-import torch.nn.quantized as nnq
+import torch.ao.nn.quantized as nnq
 from torch.nn.intrinsic import _FusedModule
 
 from torch.ao.quantization.quantization_mappings import (
@@ -20,14 +20,32 @@
 from .utils import get_qparam_dict, has_no_children_ignoring_parametrizations
 from torch.ao.quantization.stubs import DeQuantStub, QuantWrapper
 from torch.ao.quantization.qconfig import (
-    add_module_to_qconfig_obs_ctr,
+    _add_module_to_qconfig_obs_ctr,
     default_dynamic_qconfig,
     float16_dynamic_qconfig,
     float_qparams_weight_only_qconfig,
     float_qparams_weight_only_qconfig_4bit,
-    activation_is_memoryless)
+    _activation_is_memoryless)
 from torch.nn.utils.parametrize import type_before_parametrizations
 
+__all__ = [
+    "get_default_custom_config_dict",
+    "is_activation_post_process",
+    "propagate_qconfig_",
+    "register_activation_post_process_hook",
+    "add_observer_",
+    "get_unique_devices_",
+    "add_quant_dequant",
+    "prepare",
+    "quantize",
+    "quantize_dynamic",
+    "prepare_qat",
+    "quantize_qat",
+    "convert",
+    "swap_module",
+    "get_observer_dict",
+]
+
 _DEFAULT_CUSTOM_CONFIG_DICT = {
     'float_to_observed_custom_module_class': {
         nn.LSTM: nn.quantizable.LSTM,
@@ -73,9 +91,9 @@ def _propagate_qconfig_helper(module, qconfig_dict,
     module_qconfig = qconfig_dict.get(prefix, module_qconfig)
     module_qconfig = getattr(module, 'qconfig', module_qconfig)
 
-    torch.ao.quantization.qconfig.assert_valid_qconfig(module_qconfig, module)
+    torch.ao.quantization.qconfig._assert_valid_qconfig(module_qconfig, module)
 
-    qconfig_with_device_check = add_module_to_qconfig_obs_ctr(module_qconfig, module)
+    qconfig_with_device_check = _add_module_to_qconfig_obs_ctr(module_qconfig, module)
     module.qconfig = qconfig_with_device_check
 
     for name, child in module.named_children():
@@ -125,11 +143,13 @@ def register_activation_post_process_hook(module, pre_hook=False):
     assert hasattr(module, 'activation_post_process'), \
         'Expect activation_post_process attribute already attached to the module'
     if pre_hook:
-        handle = module.register_forward_pre_hook(_observer_forward_pre_hook)
-        module._forward_pre_hooks.move_to_end(handle.id, last=False)
+        handle = module.register_forward_pre_hook(
+            _observer_forward_pre_hook, prepend=True
+        )
     else:
-        handle = module.register_forward_hook(_observer_forward_hook)
-        module._forward_hooks.move_to_end(handle.id, last=False)
+        handle = module.register_forward_hook(
+            _observer_forward_hook, prepend=True
+        )
 
 
 def add_observer_(module, qconfig_propagation_list=None, non_leaf_module_list=None, device=None, custom_module_class_mapping=None):
@@ -183,7 +203,7 @@ def insert_activation_post_process(m, special_act_post_process=None):
                 m.qconfig, device, special_act_post_process))
             # Register observer as the first entry in the hook list
             # All post forward hooks are preserved and will be executed after the observer before convert
-            register_activation_post_process_hook(m, pre_hook=activation_is_memoryless(m.qconfig))
+            register_activation_post_process_hook(m, pre_hook=_activation_is_memoryless(m.qconfig))
 
     for name, child in module.named_children():
         # TODO remove Dropout special after codebase stable
@@ -196,12 +216,12 @@ def insert_activation_post_process(m, special_act_post_process=None):
             # activation_post_process are now added directly to nn.Sequentail/_FusedModule
             if needs_observation(child):
                 insert_activation_post_process(child)
-        elif _has_special_act_post_process(child):
-            special_act_post_process = _get_special_act_post_process(child)
-            insert_activation_post_process(child, special_act_post_process)
         elif non_leaf_module_list is not None and type_before_parametrizations(child) in non_leaf_module_list:
             if needs_observation(child):
                 insert_activation_post_process(child)
+        elif _has_special_act_post_process(child):
+            special_act_post_process = _get_special_act_post_process(child)
+            insert_activation_post_process(child, special_act_post_process)
         elif needs_observation(child) and type_before_parametrizations(child) in custom_module_class_mapping:
             observed_child = custom_module_class_mapping[type_before_parametrizations(child)].from_float(child)
             setattr(module, name, observed_child)
diff --git a/torch/ao/quantization/quantize_fx.py b/torch/ao/quantization/quantize_fx.py
index 07cfa431441a..8f2693457658 100644
--- a/torch/ao/quantization/quantize_fx.py
+++ b/torch/ao/quantization/quantize_fx.py
@@ -38,14 +38,14 @@ def _swap_ff_with_fxff(model: torch.nn.Module) -> None:
     """
     modules_to_swap = []
     for name, module in model.named_children():
-        if isinstance(module, torch.nn.quantized.FloatFunctional):
+        if isinstance(module, torch.ao.nn.quantized.FloatFunctional):
             modules_to_swap.append(name)
         else:
             _swap_ff_with_fxff(module)
 
     for name in modules_to_swap:
         del model._modules[name]
-        model._modules[name] = torch.nn.quantized.FXFloatFunctional()
+        model._modules[name] = torch.ao.nn.quantized.FXFloatFunctional()
 
 
 def _fuse_fx(
@@ -236,13 +236,9 @@ def fuse_fx(
 
     Args:
 
-        * `model`: a torch.nn.Module model
-        * `fuse_custom_config`: custom configurations for fuse_fx.
-            See :class:`~torch.ao.quantization.fx.custom_config.FuseCustomConfig` for more detail::
-
-                from torch.ao.quantization.fx.custom_config import FuseCustomConfig
-                fuse_custom_config = FuseCustomConfig().set_preserved_attributes(["preserved_attr"])
-
+        * `model` (torch.nn.Module): a torch.nn.Module model
+        * `fuse_custom_config` (FuseCustomConfig): custom configurations for fuse_fx.
+            See :class:`~torch.ao.quantization.fx.custom_config.FuseCustomConfig` for more details
     Example::
 
         from torch.ao.quantization import fuse_fx
@@ -280,52 +276,24 @@ def prepare_fx(
     r""" Prepare a model for post training static quantization
 
     Args:
-      * `model` (required): torch.nn.Module model, must be in eval mode
-
-      * `qconfig_mapping` (required): mapping from model ops to qconfigs::
-
-          from torch.quantization import QConfigMapping
-
-          qconfig_mapping = QConfigMapping() \
-              .set_global(global_qconfig) \
-              .set_object_type(torch.nn.Linear, qconfig1) \
-              .set_object_type(torch.nn.functional.linear, qconfig1) \
-              .set_module_name_regex("foo.*bar.*conv[0-9]+", qconfig1) \
-              .set_module_name_regex("foo.*bar.*", qconfig2) \
-              .set_module_name_regex("foo.*", qconfig3) \
-              .set_module_name("module1", qconfig1) \
-              .set_module_name("module2", qconfig2) \
-              .set_module_name_object_type_order("module3", torch.nn.functional.linear, 0, qconfig3)
-
+      * `model` (torch.nn.Module): torch.nn.Module model
 
-          The precedence of different settings:
-          set_global < set_object_type < set_module_name_regex < set_module_name < set_module_name_object_type_order
-      * `example_inputs`: (required) Example inputs for forward function of the model
+      * `qconfig_mapping` (QConfigMapping): QConfigMapping object to configure how a model is
+         quantized, see :class:`~torch.ao.quantization.qconfig_mapping.QConfigMapping`
+         for more details
 
-      * `prepare_custom_config`: customization configuration for quantization tool.
-          See :class:`~torch.ao.quantization.fx.custom_config.PrepareCustomConfig` for more detail::
+      * `example_inputs` (Tuple[Any, ...]): Example inputs for forward function of the model,
+         Tuple of positional args (keyword args can be passed as positional args as well)
 
-              from torch.ao.quantization.fx.custom_config import PrepareCustomConfig
-
-              prepare_custom_config = PrepareCustomConfig() \
-                  .set_standalone_module_name("module1", qconfig_mapping, example_inputs, \
-                      child_prepare_custom_config, backend_config) \
-                  .set_standalone_module_class(MyStandaloneModule, qconfig_mapping, example_inputs, \
-                      child_prepare_custom_config, backend_config) \
-                  .set_float_to_observed_mapping(FloatCustomModule, ObservedCustomModule) \
-                  .set_non_traceable_module_names(["module2", "module3"]) \
-                  .set_non_traceable_module_classes([NonTraceableModule1, NonTraceableModule2]) \
-                  .set_input_quantized_indexes([0]) \
-                  .set_output_quantized_indexes([0]) \
-                  .set_preserved_attributes(["attr1", "attr2"])
+      * `prepare_custom_config` (PrepareCustomConfig): customization configuration for quantization tool.
+          See :class:`~torch.ao.quantization.fx.custom_config.PrepareCustomConfig` for more details
 
       * `_equalization_config`: config for specifying how to perform equalization on the model
 
-      * `backend_config`: config that specifies how operators are quantized
-         in a backend, this includes how the operaetors are observed,
+      * `backend_config` (BackendConfig): config that specifies how operators are quantized
+         in a backend, this includes how the operators are observed,
          supported fusion patterns, how quantize/dequantize ops are
-         inserted, supported dtypes etc. The structure of the dictionary is still WIP
-         and will change in the future, please don't use right now.
+         inserted, supported dtypes etc. See :class:`~torch.ao.quantization.backend_config.BackendConfig` for more details
 
     Return:
       A GraphModule with observer (configured by qconfig_mapping), ready for calibration
@@ -458,14 +426,14 @@ def prepare_qat_fx(
     r""" Prepare a model for quantization aware training
 
     Args:
-      * `model`: torch.nn.Module model, must be in train mode
-      * `qconfig_mapping`: see :func:`~torch.ao.quantization.prepare_fx`
-      * `example_inputs`: see :func:`~torch.ao.quantization.prepare_fx`
-      * `prepare_custom_config`: see :func:`~torch.ao.quantization.prepare_fx`
-      * `backend_config`: see :func:`~torch.ao.quantization.prepare_fx`
+      * `model` (torch.nn.Module): torch.nn.Module model
+      * `qconfig_mapping` (QConfigMapping): see :func:`~torch.ao.quantization.prepare_fx`
+      * `example_inputs` (Tuple[Any, ...]): see :func:`~torch.ao.quantization.prepare_fx`
+      * `prepare_custom_config` (PrepareCustomConfig): see :func:`~torch.ao.quantization.prepare_fx`
+      * `backend_config` (BackendConfig): see :func:`~torch.ao.quantization.prepare_fx`
 
     Return:
-      A GraphModule with fake quant modules (configured by qconfig_mapping), ready for
+      A GraphModule with fake quant modules (configured by qconfig_mapping and backend_config), ready for
       quantization aware training
 
     Example::
@@ -521,7 +489,7 @@ def train_loop(model, train_data):
         qconfig_mapping = get_default_qat_qconfig("fbgemm")
 
         # We can customize qconfig_mapping in different ways, please take a look at
-        # the doctring for :func:`~torch.ao.quantization.prepare_fx` for different ways
+        # the docstring for :func:`~torch.ao.quantization.prepare_fx` for different ways
         # to configure this
 
         # example_inputs is a tuple of inputs, that is used to infer the type of the
@@ -562,6 +530,7 @@ def _convert_fx(
     _remove_qconfig: bool = True,
     qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
     backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
+    is_decomposed: bool = False,
 ) -> torch.nn.Module:
     """ `is_standalone_module`: see docs in :func:`~torch.ao.quantization.prepare_standalone_module_fx`
     """
@@ -584,6 +553,7 @@ def _convert_fx(
         _remove_qconfig_flag=_remove_qconfig,
         qconfig_mapping=qconfig_mapping,
         backend_config=backend_config,
+        is_decomposed=is_decomposed,
     )
 
     preserved_attributes = convert_custom_config.preserved_attributes
@@ -602,23 +572,14 @@ def convert_fx(
     r""" Convert a calibrated or trained model to a quantized model
 
     Args:
-        * `graph_module`: A prepared and calibrated/trained model (GraphModule)
-        * `is_reference`: flag for whether to produce a reference quantized model,
-          which will be a common interface between pytorch quantization with
-          other backends like accelerators
-
-        * `convert_custom_config`: custom configurations for convert function.
-            See :class:`~torch.ao.quantization.fx.custom_config.ConvertCustomConfig` for more detail::
-
-                from torch.ao.quantization.fx.custom_config import ConvertCustomConfig
+        * `graph_module` (torch.fx.GraphModule): A prepared and calibrated/trained model (GraphModule)
 
-                convert_custom_config = ConvertCustomConfig() \
-                    .set_observed_to_quantized_mapping(ObservedCustomModule, QuantizedCustomModule) \
-                    .set_preserved_attributes(["attr1", "attr2"])
+        * `convert_custom_config` (ConvertCustomConfig): custom configurations for convert function.
+            See :class:`~torch.ao.quantization.fx.custom_config.ConvertCustomConfig` for more details
 
-        * `_remove_qconfig`: Option to remove the qconfig attributes in the model after convert.
+        * `_remove_qconfig` (bool): Option to remove the qconfig attributes in the model after convert.
 
-        * `qconfig_mapping`: config for specifying how to convert a model for quantization.
+        * `qconfig_mapping` (QConfigMapping): config for specifying how to convert a model for quantization.
 
            The keys must include the ones in the qconfig_mapping passed to `prepare_fx` or `prepare_qat_fx`,
            with the same values or `None`. Additional keys can be specified with values set to `None`.
@@ -631,14 +592,14 @@ def convert_fx(
                 .set_object_type(torch.nn.functional.linear, qconfig_from_prepare)
                 .set_module_name("foo.bar", None)  # skip quantizing module "foo.bar"
 
-         * `backend_config`: A configuration for the backend which describes how
+         * `backend_config` (BackendConfig): A configuration for the backend which describes how
             operators should be quantized in the backend, this includes quantization
             mode support (static/dynamic/weight_only), dtype support (quint8/qint8 etc.),
-            observer placement for each operators and fused operators. Detailed
-            documentation can be found in torch/ao/quantization/backend_config/README.md
+            observer placement for each operators and fused operators.
+            See :class:`~torch.ao.quantization.backend_config.BackendConfig` for more details
 
     Return:
-        A quantized model (GraphModule)
+        A quantized model (torch.nn.Module)
 
     Example::
 
@@ -682,19 +643,19 @@ def convert_to_reference_fx(
     hardware, like accelerators
 
     Args:
-        * `graph_module`: A prepared and calibrated/trained model (GraphModule)
+        * `graph_module` (GraphModule): A prepared and calibrated/trained model (GraphModule)
 
-        * `convert_custom_config`: custom configurations for convert function.
-            See :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more detail.
+        * `convert_custom_config` (ConvertCustomConfig): custom configurations for convert function.
+            See :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more details.
 
-        * `_remove_qconfig`: Option to remove the qconfig attributes in the model after convert.
+        * `_remove_qconfig` (bool): Option to remove the qconfig attributes in the model after convert.
 
-        * `qconfig_mapping`: config for specifying how to convert a model for quantization.
-            See :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more detail.
+        * `qconfig_mapping` (QConfigMapping): config for specifying how to convert a model for quantization.
+            See :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more details.
 
-         * `backend_config`: A configuration for the backend which describes how
+         * `backend_config` (BackendConfig): A configuration for the backend which describes how
             operators should be quantized in the backend. See
-            :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more detail.
+            :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more details.
 
     Return:
         A reference quantized model (GraphModule)
@@ -717,6 +678,59 @@ def convert_to_reference_fx(
         backend_config=backend_config,
     )
 
+def _convert_to_reference_decomposed_fx(
+    graph_module: GraphModule,
+    convert_custom_config: Union[ConvertCustomConfig, Dict[str, Any], None] = None,
+    _remove_qconfig: bool = True,
+    qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
+    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
+) -> torch.nn.Module:
+    r""" Convert a calibrated or trained model to a reference quantized model, with
+    decomposed representation for quantized Tensor
+    see https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md for more details,
+    reference quantzied model is a standard representation of a quantized model provided
+    by FX Graph Mode Quantization, it can be further lowered to run on the target
+    hardware, like accelerators
+
+    Note: this is not public API
+
+    Args:
+        * `graph_module` (GraphModule): A prepared and calibrated/trained model (GraphModule)
+
+        * `convert_custom_config` (ConvertCustomConfig): custom configurations for convert function.
+            See :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more details.
+
+        * `_remove_qconfig` (bool): Option to remove the qconfig attributes in the model after convert.
+
+        * `qconfig_mapping` (QConfigMapping): config for specifying how to convert a model for quantization.
+            See :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more details.
+
+         * `backend_config` (BackendConfig): A configuration for the backend which describes how
+            operators should be quantized in the backend. See
+            :func:`~torch.ao.quantization.quantize_fx.convert_fx` for more details.
+
+    Return:
+        A reference quantized model (GraphModule) with operators working with decomposed quantized Tensor
+
+    Example::
+
+        # prepared_model: the model after prepare_fx/prepare_qat_fx and calibration/training
+        # TODO: add backend_config after we split the backend_config for fbgemm and qnnpack
+        # e.g. backend_config = get_default_backend_config("fbgemm")
+        reference_quantized_model = _convert_to_reference_decomposed_fx(prepared_model)
+
+    """
+    torch._C._log_api_usage_once("quantization_api.quantize_fx._convert_to_reference_decomposed_fx")
+    return _convert_fx(
+        graph_module,
+        is_reference=True,
+        convert_custom_config=convert_custom_config,
+        _remove_qconfig=_remove_qconfig,
+        qconfig_mapping=qconfig_mapping,
+        backend_config=backend_config,
+        is_decomposed=True,
+    )
+
 
 def _convert_standalone_module_fx(
     graph_module: GraphModule,
diff --git a/torch/ao/quantization/quantize_jit.py b/torch/ao/quantization/quantize_jit.py
index 9e76f286a8f4..d35c123800e0 100644
--- a/torch/ao/quantization/quantize_jit.py
+++ b/torch/ao/quantization/quantize_jit.py
@@ -4,6 +4,18 @@
 from torch.ao.quantization.quant_type import QuantType
 from torch.jit._recursive import wrap_cpp_module
 
+__all__ = [
+    "script_qconfig",
+    "script_qconfig_dict",
+    "fuse_conv_bn_jit",
+    "prepare_jit",
+    "prepare_dynamic_jit",
+    "convert_jit",
+    "convert_dynamic_jit",
+    "quantize_jit",
+    "quantize_dynamic_jit",
+]
+
 def _check_is_script_module(model):
     if not isinstance(model, torch.jit.ScriptModule):
         raise ValueError('input must be a script module, got: ' + str(type(model)))
@@ -62,6 +74,25 @@ def _prepare_jit(model, qconfig_dict, inplace=False, quant_type=QuantType.STATIC
         model = wrap_cpp_module(model_c)
     return model
 
+def _prepare_ondevice_jit(model, qconfig_dict, method_name='forward', inplace=False, quant_type=QuantType.STATIC):
+    _check_is_script_module(model)
+    if not all(isinstance(x, str) for x in qconfig_dict.keys()):
+        raise ValueError('qconfig_dict should only contain names(str) as keys.')
+    scripted_qconfig_dict = script_qconfig_dict(qconfig_dict)
+    method_graph = model._c._get_method(method_name).graph
+    torch._C._jit_pass_inline(method_graph)
+    model = fuse_conv_bn_jit(model, inplace)
+    model_c = torch._C._jit_pass_insert_observer_method_for_ondevice_ptq(model._c,
+                                                                         method_name,
+                                                                         scripted_qconfig_dict,
+                                                                         inplace,
+                                                                         quant_type)
+    if inplace:
+        model._reconstruct(model_c)
+    else:
+        model = wrap_cpp_module(model_c)
+    return model
+
 def prepare_jit(model, qconfig_dict, inplace=False):
     torch._C._log_api_usage_once("quantization_api.quantize_jit.prepare_jit")
     return _prepare_jit(model, qconfig_dict, inplace, quant_type=QuantType.STATIC)
@@ -70,6 +101,10 @@ def prepare_dynamic_jit(model, qconfig_dict, inplace=False):
     torch._C._log_api_usage_once("quantization_api.quantize_jit.prepare_dynamic_jit")
     return _prepare_jit(model, qconfig_dict, inplace, quant_type=QuantType.DYNAMIC)
 
+
+def _prepare_ondevice_dynamic_jit(model, qconfig_dict, method_name='forward', inplace=False):
+    return _prepare_ondevice_jit(model, qconfig_dict, method_name, inplace, quant_type=QuantType.DYNAMIC)
+
 def _convert_jit(model, inplace=False, debug=False, quant_type=QuantType.STATIC,
                  preserved_attrs=None):
     _check_is_script_module(model)
@@ -93,6 +128,23 @@ def _convert_jit(model, inplace=False, debug=False, quant_type=QuantType.STATIC,
     torch._C._jit_pass_dce(model.graph)
     return model
 
+
+def _convert_ondevice_jit(model, method_name, inplace=False, debug=False, quant_type=QuantType.STATIC):
+    _check_is_script_module(model)
+    assert quant_type == QuantType.DYNAMIC, "This API, while should work for static quant, is only tested for dynamic quant."
+    assert not method_name.startswith("observe_"), "Pass in valid method to be quantized, e.g. forward"
+    observe_method_name = "observe_" + method_name
+    quantize_method_name = "quantize_" + method_name
+    model_c = model._c
+    model_c = torch._C._jit_pass_insert_quant_dequant_for_ondevice_ptq(
+        model._c, observe_method_name, inplace, debug, QuantType.DYNAMIC)
+    model_c = torch._C._jit_pass_quant_finalize_for_ondevice_ptq(model_c, QuantType.DYNAMIC, quantize_method_name)
+    if inplace:
+        model._reconstruct(model_c)
+    else:
+        model = wrap_cpp_module(model_c)
+    return model
+
 def convert_jit(model, inplace=False, debug=False, preserved_attrs=None):
     torch._C._log_api_usage_once("quantization_api.quantize_jit.convert_jit")
     return _convert_jit(model, inplace, debug, quant_type=QuantType.STATIC, preserved_attrs=preserved_attrs)
@@ -101,6 +153,16 @@ def convert_dynamic_jit(model, inplace=False, debug=False, preserved_attrs=None)
     torch._C._log_api_usage_once("quantization_api.quantize_jit.convert_dynamic_jit")
     return _convert_jit(model, inplace, debug, quant_type=QuantType.DYNAMIC, preserved_attrs=preserved_attrs)
 
+
+def _convert_ondevice_dynamic_jit(model, method_name, inplace=False, debug=False):
+    return _convert_ondevice_jit(model, method_name, inplace, debug, quant_type=QuantType.DYNAMIC)
+
+
+def _quantize_ondevice_dynamic_jit_impl(model, qconfig_dict, method_name, inplace=False):
+    model = _prepare_ondevice_dynamic_jit(model, qconfig_dict, method_name, inplace)
+    model = _convert_ondevice_dynamic_jit(model, method_name, inplace)
+    return model
+
 def _quantize_jit(model, qconfig_dict, run_fn=None, run_args=None, inplace=False, debug=False, quant_type=QuantType.STATIC):
     # Always do inplace convert because the Tensor is already
     # copied in prepare_jit when inplace is False
@@ -211,3 +273,63 @@ def calibrate(model, data_loader):
     """
     torch._C._log_api_usage_once("quantization_api.quantize_jit.quantize_dynamic_jit")
     return _quantize_jit(model, qconfig_dict, inplace=inplace, debug=debug, quant_type=QuantType.DYNAMIC)
+
+
+def _quantize_ondevice_dynamic_jit(model, qconfig_dict, method_name='forward', inplace=False):
+    r"""Prepares the input float TorchScript model with
+    *on-device* post training dynamic quantization.
+    Currently only qint8 quantization of torch.nn.Linear is supported.
+
+    Args:
+        `model`: input float TorchScript model
+        `qconfig_dict`: qconfig_dict is a dictionary with names of sub modules as key and
+        qconfig for that module as value, please see detailed
+        `method_name`: Name of the method within the model, to be prepared for quantization
+        descriptions in :func:`~torch.ao.quantization.quantize_jit`
+        `inplace`: carry out model transformations in-place, the original module is
+        mutated
+
+    Return:
+        TorchScript model that is ready for on device quantization.
+        This means that the returned
+        model has:
+        - Method is inlined.
+        - Model has observer modules inserted in the model.
+        - Model has packed params inserted in the model. However they are empty as in they dont
+          contain valid quantized weights.
+        - observe_<method_name> is added that observe the values to be quantized.
+        - reset_observers_<method_name> to reset observers.
+        - quantize_<method_name> is added to the model.
+          - This method extract scale, zero points.
+          - Quantizes observed weights.
+          - Creates packed params from it and update the attribute of the model with the new values
+            for the packed params.
+          - Reset the original fp32 weights with empty tensor using SetAttr.
+        - quantized_<method_name> is added to the model.
+          - This method uses quantized weights and quantized linear ops instead of fp32 op.
+          - This method should be used for inference post PTQ.
+        - Note that all method's signatures should be the same as method_name.
+
+        Later on device:
+        - Run reset_observers_<method_name>
+        - Run observe_<method_name>
+        - Run quantize_<method_name>
+        - Now model can be saved and loaded later.
+        - Run model with quantized_<method_name>
+
+    Example:
+    ```python
+    import torch
+    from torch.ao.quantization import per_channel_dynamic_qconfig
+    from torch.ao.quantization.quantize_jit import _quantize_ondevice_dynamic_jit
+
+    ts_model = torch.jit.script(float_model.eval())  # or torch.jit.trace(float_model, input)
+    qconfig = get_default_qconfig('fbgemm')
+    quant_ready_model = _quantize_ondevice_dynamic_jit(
+        ts_model,
+        {'': qconfig},
+        'forward',
+        True)
+    ```
+    """
+    return _quantize_ondevice_dynamic_jit_impl(model, qconfig_dict, method_name, inplace=inplace)
diff --git a/torch/ao/quantization/utils.py b/torch/ao/quantization/utils.py
index 012718a55a04..984386d20504 100644
--- a/torch/ao/quantization/utils.py
+++ b/torch/ao/quantization/utils.py
@@ -4,6 +4,7 @@
 import warnings
 import functools
 import torch
+from torch.fx import Node
 from torch.ao.quantization.quant_type import QuantType
 from typing import Tuple, Any, Union, Callable, Dict, Optional
 from torch.nn.utils.parametrize import is_parametrized
@@ -11,10 +12,22 @@
 from inspect import signature
 from inspect import getfullargspec
 
+NodePattern = Union[Tuple[Node, Node], Tuple[Node, Tuple[Node, Node]], Any]
+NodePattern.__module__ = "torch.ao.quantization.utils"
+
+# This is the Quantizer class instance from torch/quantization/fx/quantize.py.
+# Define separately to prevent circular imports.
+# TODO(future PR): improve this.
+# make this public once fixed (can't be public as is because setting the module directly
+# doesn't work)
+QuantizerCls = Any
+
 # Type for fusion patterns, it can be more complicated than the following actually,
 # see pattern.md for docs
 # TODO: not sure if typing supports recursive data types
-Pattern = Union[Callable, Tuple[Callable, Callable], Tuple[Callable, Tuple[Callable, Callable]], Any]
+Pattern = Union[
+    Callable, Tuple[Callable, Callable], Tuple[Callable, Tuple[Callable, Callable]], Any
+]
 Pattern.__module__ = "torch.ao.quantization.utils"
 
 # TODO: maybe rename this to MatchInputNode
@@ -127,6 +140,17 @@ def getattr_from_fqn(obj: Any, fqn: str) -> Any:
     """
     return functools.reduce(getattr, fqn.split("."), obj)
 
+def to_underlying_dtype(qdtype):
+    DTYPE_MAPPING = {
+        torch.quint8: torch.uint8,
+        torch.qint8: torch.int8,
+        torch.qint32: torch.int32,
+        torch.quint4x2: torch.uint8,
+        torch.quint2x4: torch.uint8,
+    }
+    assert qdtype in DTYPE_MAPPING, "Unsupported dtype: " + qdtype
+    return DTYPE_MAPPING[qdtype]
+
 def get_qparam_dict(observer_or_fake_quant):
     qscheme = observer_or_fake_quant.qscheme if hasattr(observer_or_fake_quant, "qscheme") else None
     dtype = observer_or_fake_quant.dtype
@@ -185,19 +209,21 @@ def weight_dtype(qconfig):
 
 def activation_is_statically_quantized(qconfig):
     """ Given a qconfig, decide if the activation needs to be
-    quantized or not, this includes quantizing to quint8, qint8 and float16
+    quantized or not, this includes quantizing to quint8, qint8 and qint32 and float16
     """
-    return activation_dtype(qconfig) in [torch.quint8, torch.qint8, torch.float16]
+    return (
+        activation_dtype(qconfig) in [torch.quint8, torch.qint8, torch.qint32, torch.float16]
+        and (not activation_is_dynamically_quantized(qconfig))
+    )
 
 def activation_is_dynamically_quantized(qconfig):
     """ Given a qconfig, decide if the activation needs to be
     dynamically quantized or not, this includes dynamically quantizing to
     quint8, qint8 and float16
     """
-    activation_dtype, _, activation_compute_dtype = \
+    activation_dtype, _, activation_is_dynamic = \
         get_qconfig_dtypes(qconfig)
-    return activation_dtype == torch.float and \
-        activation_compute_dtype in [torch.quint8, torch.qint8, torch.float16]
+    return activation_is_dynamic
 
 def activation_is_int8_quantized(qconfig):
     """ Given a qconfig, decide if the activation needs to be
@@ -227,40 +253,40 @@ def op_is_int8_dynamically_quantized(qconfig) -> bool:
     """ Given a qconfig, returns True if this op is using int8 dynamic
     quantization
     """
-    activation_dtype, weight_dtype, activation_compute_dtype = \
+    activation_dtype, weight_dtype, activation_is_dynamic = \
         get_qconfig_dtypes(qconfig)
     return (
-        activation_dtype is torch.float and
+        activation_dtype is torch.quint8 and
         # for now, the lines below assume fbgemm or qnnpack
         weight_dtype is torch.qint8 and
-        activation_compute_dtype is torch.quint8
+        activation_is_dynamic
     )
 
 def get_qconfig_dtypes(qconfig):
     r""" returns the qconfig tuple for qconfig:
-    (activation_dtype, weight_dtype, activation_compute_dtype)
+    (activation_dtype, weight_dtype, activation_is_dynamic)
     """
     assert qconfig is not None
     activation = qconfig.activation()
     weight = qconfig.weight()
-    compute_dtype = activation.compute_dtype if hasattr(activation, 'compute_dtype') else None
-    return (activation.dtype, weight.dtype, compute_dtype)
+    act_is_dynamic = activation.is_dynamic if hasattr(activation, 'is_dynamic') else False
+    return (activation.dtype, weight.dtype, act_is_dynamic)
 
 def get_quant_type(qconfig):
     assert qconfig is not None
     activation = qconfig.activation()
     weight = qconfig.weight()
-    static_dtypes = [torch.quint8, torch.qint8, torch.quint4x2]
+    static_dtypes = [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]
     if weight.dtype in static_dtypes:
-        if activation.dtype in static_dtypes:
-            return QuantType.STATIC
-        elif hasattr(activation, 'compute_dtype') and activation.compute_dtype in static_dtypes:
+        if hasattr(activation, 'is_dynamic') and activation.is_dynamic:
             return QuantType.DYNAMIC
+        elif activation.dtype in static_dtypes:
+            return QuantType.STATIC
         else:
             return QuantType.WEIGHT_ONLY
 
     if weight.dtype == torch.float16:
-        if activation.dtype == torch.float:
+        if hasattr(activation, 'is_dynamic') and activation.is_dynamic:
             return QuantType.DYNAMIC
         elif activation.dtype == torch.float16:
             return QuantType.STATIC
@@ -519,8 +545,96 @@ def _patched_module_call(self, *args, **kwargs):
         torch.nn.Module.__call__ = orig_module_call
     return fqn_to_example_inputs
 
+def _get_lstm_with_individually_observed_parts(
+    float_lstm: torch.nn.LSTM,
+    # Use Callable instead of _PartialWrapper here to avoid circular dependencies
+    linear_output_obs_ctr: Optional[Callable] = None,
+    sigmoid_obs_ctr: Optional[Callable] = None,
+    tanh_obs_ctr: Optional[Callable] = None,
+    cell_state_obs_ctr: Optional[Callable] = None,
+    hidden_state_obs_ctr: Optional[Callable] = None,
+) -> torch.ao.nn.quantizable.LSTM:
+    """
+    Return an observed `torch.ao.nn.quantizable.LSTM` created from a `torch.nn.LSTM`
+    with specific observers or fake quantizes assigned to the inner ops or submodules.
+
+    In both eager and FX graph mode quantization, `torch.ao.nn.quantizable.LSTM` is
+    used as an observed custom module, which is responsible for inserting its own
+    observers. By default, all inner ops inherit the parent custom module's QConfig.
+    Users who wish to override this behavior may extend `torch.ao.nn.quantizable.LSTM`
+    and use this helper function to customize the observer insertion logic.
+
+    Args:
+        `float_lstm`: The float LSTM module
+        `linear_output_obs_ctr`: observer or fake quantize for linear outputs Wx + b,
+            where W is the weight matrix, b is the bias, and x is either the inputs
+            or the hidden state from the previous layer (if any)
+        `sigmoid_obs_ctr`: observer or fake quantize for sigmoid activations
+        `tanh_obs_ctr`: observer or fake quantize for tanh activations
+        `cell_state_obs_ctr`: observer or fake quantize for the cell state
+        `hidden_state_obs_ctr`: observer or fake quantize for the hidden state and
+            the output
+
+    Return:
+        A `torch.ao.nn.quantizable.LSTM` with the specified observers or fake quantizes
+        attached to the inner submodules.
+    """
+    def make_qconfig(obs_ctr: Callable) -> torch.ao.quantization.QConfig:
+        """
+        Make a QConfig with fixed qparams observers or fake quantizes.
+        """
+        if isinstance(obs_ctr(), torch.ao.quantization.FakeQuantizeBase):
+            weight = torch.ao.quantization.default_weight_fake_quant
+        else:
+            weight = torch.ao.quantization.default_weight_observer
+        return torch.ao.quantization.QConfig(activation=obs_ctr, weight=weight)
+
+    observed_lstm = torch.ao.nn.quantizable.LSTM(
+        float_lstm.input_size, float_lstm.hidden_size, float_lstm.num_layers, float_lstm.bias,
+        float_lstm.batch_first, float_lstm.dropout, float_lstm.bidirectional)
+
+    # Assign QConfigs with fixed qparams to all inner submodules
+    # Module hierarchy: LSTM > _LSTMLayer > _LSTMSingleLayer (forward or backward) > LSTMCell
+    for layer in observed_lstm.layers:
+        inner_layers = [layer.layer_fw]
+        if float_lstm.bidirectional:
+            inner_layers.append(layer.layer_bw)
+        for inner_layer in inner_layers:
+            cell = inner_layer.cell
+            if linear_output_obs_ctr is not None:
+                qconfig = make_qconfig(linear_output_obs_ctr)
+                cell.igates.qconfig = qconfig
+                cell.hgates.qconfig = qconfig
+            if sigmoid_obs_ctr is not None:
+                qconfig = make_qconfig(sigmoid_obs_ctr)
+                cell.input_gate.qconfig = qconfig
+                cell.forget_gate.qconfig = qconfig
+                cell.output_gate.qconfig = qconfig
+            if tanh_obs_ctr is not None:
+                cell.cell_gate.qconfig = make_qconfig(tanh_obs_ctr)
+            if cell_state_obs_ctr is not None:
+                cell.fgate_cx_igate_cgate.qconfig = make_qconfig(cell_state_obs_ctr)
+                obs = cell_state_obs_ctr()
+                if hasattr(obs, "scale") and hasattr(obs, "zero_point"):
+                    cell.initial_cell_state_qparams = (obs.scale, obs.zero_point)
+                cell.cell_state_dtype = obs.dtype
+            if hidden_state_obs_ctr is not None:
+                cell.ogate_cy.qconfig = make_qconfig(hidden_state_obs_ctr)
+                obs = hidden_state_obs_ctr()
+                if hasattr(obs, "scale") and hasattr(obs, "zero_point"):
+                    cell.initial_hidden_state_qparams = (obs.scale, obs.zero_point)
+                cell.hidden_state_dtype = obs.dtype
+
+    # Insert the observers based on the previously attached QConfigs
+    # Pass in non_leaf_module_list to prevent the observers for sigmoid/tanh from being overridden
+    torch.ao.quantization.add_observer_(
+        observed_lstm,
+        non_leaf_module_list=[torch.nn.Sigmoid, torch.nn.Tanh]
+    )
+    return observed_lstm
 
 __all__ = [
+    "NodePattern",
     "Pattern",
     "MatchAllNode",
     "check_node",
@@ -545,4 +659,5 @@ def _patched_module_call(self, *args, **kwargs):
     "calculate_qmin_qmax",
     "has_no_children_ignoring_parametrizations",
     "get_fqn_to_example_inputs",
+    "to_underlying_dtype",
 ]
diff --git a/torch/autograd/__init__.py b/torch/autograd/__init__.py
index 0dbb039db807..721a8f5376b6 100644
--- a/torch/autograd/__init__.py
+++ b/torch/autograd/__init__.py
@@ -15,7 +15,7 @@
 from .variable import Variable
 from .function import Function, NestedIOFunction
 from .gradcheck import gradcheck, gradgradcheck
-from .grad_mode import no_grad, enable_grad, set_grad_enabled, inference_mode
+from .grad_mode import no_grad, enable_grad, set_grad_enabled, inference_mode, set_multithreading_enabled
 from .anomaly_mode import detect_anomaly, set_detect_anomaly
 from ..overrides import has_torch_function, handle_torch_function, is_tensor_like
 from . import functional
@@ -165,6 +165,12 @@ def backward(
             not provided, the gradient is accumulated into all the leaf Tensors that
             were used to compute the attr::tensors.
     """
+    if torch._C._are_functorch_transforms_active():
+        raise RuntimeError(
+            "backward() called inside a functorch transform. This is not "
+            "supported, please use functorch.grad or functorch.vjp instead "
+            "or call backward() outside of functorch transforms.")
+
     if grad_variables is not None:
         warnings.warn("'grad_variables' is deprecated. Use 'grad_tensors' instead.")
         if grad_tensors is None:
@@ -326,15 +332,31 @@ def variable(*args, **kwargs):
     raise RuntimeError("autograd initialization failed")
 
 # Import all native method/classes
-from torch._C._autograd import (DeviceType, ProfilerActivity, ProfilerState, ProfilerConfig, ProfilerEvent,
-                                _enable_profiler_legacy, _disable_profiler_legacy, _profiler_enabled,
-                                _enable_record_function, _set_empty_test_observer, kineto_available,
-                                _record_function_with_args_enter, _record_function_with_args_exit,
-                                _supported_activities, _add_metadata_json, SavedTensor,
-                                _push_saved_tensors_default_hooks, _pop_saved_tensors_default_hooks)
-
-from torch._C._autograd import (_ProfilerResult, _KinetoEvent, _kineto_step,
-                                _prepare_profiler, _enable_profiler, _disable_profiler)
+from torch._C._autograd import (
+    _add_metadata_json,
+    _disable_profiler,
+    _disable_profiler_legacy,
+    _enable_profiler,
+    _enable_profiler_legacy,
+    _enable_record_function,
+    _kineto_step,
+    _KinetoEvent,
+    _pop_saved_tensors_default_hooks,
+    _prepare_profiler,
+    _profiler_enabled,
+    _ProfilerResult,
+    _push_saved_tensors_default_hooks,
+    _record_function_with_args_enter,
+    _record_function_with_args_exit,
+    _set_empty_test_observer,
+    _supported_activities,
+    DeviceType,
+    kineto_available,
+    ProfilerEvent,
+    SavedTensor,
+)
+
+from torch._C._profiler import ProfilerActivity, ProfilerConfig, ProfilerState
 
 from . import profiler
 
diff --git a/torch/autograd/anomaly_mode.py b/torch/autograd/anomaly_mode.py
index cca0ece338d0..9faa1dc89358 100644
--- a/torch/autograd/anomaly_mode.py
+++ b/torch/autograd/anomaly_mode.py
@@ -13,7 +13,8 @@ class detect_anomaly(object):
     - Running the forward pass with detection enabled will allow the backward
       pass to print the traceback of the forward operation that created the failing
       backward function.
-    - Any backward computation that generate "nan" value will raise an error.
+    - If ``check_nan`` is ``True``, any backward computation that generate "nan"
+      value will raise an error. Default ``True``.
 
     .. warning::
         This mode should be enabled only for debugging as the different tests
@@ -70,17 +71,19 @@ class detect_anomaly(object):
 
     """
 
-    def __init__(self) -> None:
+    def __init__(self, check_nan=True) -> None:
         self.prev = torch.is_anomaly_enabled()
+        self.check_nan = check_nan
+        self.prev_check_nan = torch.is_anomaly_check_nan_enabled()
         warnings.warn('Anomaly Detection has been enabled. '
                       'This mode will increase the runtime '
                       'and should only be enabled for debugging.', stacklevel=2)
 
     def __enter__(self) -> None:
-        torch.set_anomaly_enabled(True)
+        torch.set_anomaly_enabled(True, self.check_nan)
 
     def __exit__(self, *args: Any) -> None:
-        torch.set_anomaly_enabled(self.prev)
+        torch.set_anomaly_enabled(self.prev, self.prev_check_nan)
 
 
 class set_detect_anomaly(object):
@@ -95,15 +98,18 @@ class set_detect_anomaly(object):
     Args:
         mode (bool): Flag whether to enable anomaly detection (``True``),
                      or disable (``False``).
+        check_nan (bool): Flag whether to raise an error when the backward
+                          generate "nan"
 
     """
 
-    def __init__(self, mode: bool) -> None:
+    def __init__(self, mode: bool, check_nan: bool = True) -> None:
         self.prev = torch.is_anomaly_enabled()
-        torch.set_anomaly_enabled(mode)
+        self.prev_check_nan = torch.is_anomaly_check_nan_enabled()
+        torch.set_anomaly_enabled(mode, check_nan)
 
     def __enter__(self) -> None:
         pass
 
     def __exit__(self, *args: Any) -> None:
-        torch.set_anomaly_enabled(self.prev)
+        torch.set_anomaly_enabled(self.prev, self.prev_check_nan)
diff --git a/torch/autograd/forward_ad.py b/torch/autograd/forward_ad.py
index 8c43df928817..415928f5c22d 100644
--- a/torch/autograd/forward_ad.py
+++ b/torch/autograd/forward_ad.py
@@ -1,4 +1,6 @@
 import torch
+import os
+import sys
 from .grad_mode import _DecoratorContextManager
 from collections import namedtuple
 
@@ -66,6 +68,27 @@ def make_dual(tensor, tangent, *, level=None):
     for detailed steps on how to use this API.
 
     """
+    # See NOTE: [forward-mode AD decompositions mechanism]
+    #
+    # Import from torch._decomp import decompositions_for_jvp to register
+    # decompositions for jvp to the jit registry
+    #
+    # FIXME: We specify that __debug__ must be True because
+    # if python is run with -OO or -O flags (i.e., __debug__ is False), we encounter the
+    # following error:
+    #
+    # Return value was annotated as having type Tuple[NoneType, NoneType] but is actually of
+    # type Tuple[Tensor, Tensor]:
+    #   File ".../torch/_decomp/__init__.py", line 1585
+    #     else:
+    #         buffer = z
+    #     return min - torch.log1p(z), buffer
+    #     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
+    # Currently broken for 3.11, see https://github.com/pytorch/pytorch/issues/85506
+    if (os.environ.get("PYTORCH_JIT", "1" if sys.version_info < (3, 11) else "0") == "1" and
+            __debug__):
+        from torch._decomp import decompositions_for_jvp  # noqa: F401
+
     if level is None:
         level = _current_level
 
diff --git a/torch/autograd/function.py b/torch/autograd/function.py
index 686b5db83f68..f4810712cab3 100644
--- a/torch/autograd/function.py
+++ b/torch/autograd/function.py
@@ -8,6 +8,9 @@
 from collections import OrderedDict
 from typing import Any, List, Optional
 
+__all__ = ["FunctionCtx", "BackwardCFunction", "FunctionMeta", "Function", "once_differentiable", "traceable",
+           "InplaceFunction", "NestedIOFunction"]
+
 # Formerly known as: _ContextMethodMixin
 class FunctionCtx(object):
 
diff --git a/torch/autograd/functional.py b/torch/autograd/functional.py
index f47023f8c82a..d44ce69df6a7 100644
--- a/torch/autograd/functional.py
+++ b/torch/autograd/functional.py
@@ -3,6 +3,8 @@
 from . import forward_ad as fwAD
 from torch._vmap_internals import _vmap
 
+__all__ = ["vjp", "jvp", "jacobian", "hessian", "hvp", "vhp"]
+
 # Utility functions
 
 def _as_tuple_nocheck(x):
diff --git a/torch/autograd/grad_mode.py b/torch/autograd/grad_mode.py
index b847129d1759..e5e410eeb42e 100644
--- a/torch/autograd/grad_mode.py
+++ b/torch/autograd/grad_mode.py
@@ -2,10 +2,11 @@
 import torch
 import functools
 import inspect
+import warnings
 from typing import Any, Callable, TypeVar, cast
 
 __all__ = ['no_grad', 'enable_grad', 'set_grad_enabled',
-           'inference_mode']
+           'inference_mode', 'set_multithreading_enabled']
 
 
 # Used for annotating the decorator usage of 'no_grad' and 'enable_grad'.
@@ -18,6 +19,12 @@ class _DecoratorContextManager:
     """Allow a context manager to be used as a decorator"""
 
     def __call__(self, func: F) -> F:
+        if inspect.isclass(func):
+            warnings.warn("Decorating classes is deprecated and will be disabled in "
+                          "future versions. You should only decorate functions or methods. "
+                          "To preserve the current behavior of class decoration, you can "
+                          "directly decorate the `__init__` method and nothing else.")
+
         if inspect.isgeneratorfunction(func):
             return self._wrap_generator(func)
 
@@ -184,7 +191,7 @@ def __exit__(self, exc_type: Any, exc_value: Any, traceback: Any) -> None:
 
 
 class set_grad_enabled(_DecoratorContextManager):
-    r"""Context-manager that sets gradient calculation to on or off.
+    r"""Context-manager that sets gradient calculation on or off.
 
     ``set_grad_enabled`` will enable or disable grads based on its argument :attr:`mode`.
     It can be used as a context-manager or as a function.
@@ -298,3 +305,35 @@ def __exit__(self, exc_type: Any, exc_value: Any, traceback: Any) -> None:
 
     def clone(self):
         return self.__class__(self.mode)
+
+
+class set_multithreading_enabled(_DecoratorContextManager):
+    r"""Context-manager that sets multithreaded backwards on or off.
+
+    ``set_multithreading_enabled`` will enable or disable multithreaded backwards based on its argument :attr:`mode`.
+    It can be used as a context-manager or as a function.
+
+    This context manager is thread local; it will not affect computation
+    in other threads.
+
+    Args:
+        mode (bool): Flag whether to enable multithreaded backwards (``True``), or disable
+                     (``False``).
+
+    .. note::
+        This API does not apply to :ref:`forward-mode AD <forward-mode-ad>`.
+
+    """
+
+    def __init__(self, mode: bool) -> None:
+        self.mode = mode
+        self.multithreadeding_enabled_guard = torch._C._MultithreadingEnabled(mode)
+
+    def __enter__(self) -> None:
+        pass
+
+    def __exit__(self, *args) -> None:
+        del self.multithreadeding_enabled_guard
+
+    def clone(self):
+        return self.__class__(self.mode)
diff --git a/torch/autograd/gradcheck.py b/torch/autograd/gradcheck.py
index 7379cc63625f..46d4f370a99a 100644
--- a/torch/autograd/gradcheck.py
+++ b/torch/autograd/gradcheck.py
@@ -368,7 +368,7 @@ def _get_analytical_jacobian_forward_ad(fn, inputs, outputs, *, check_grad_dtype
                     dual_outputs = filter(_is_float_or_complex_tensor, raw_outputs)
                     for index_o, d_o in enumerate(dual_outputs):
                         val, res = fwAD.unpack_dual(d_o)
-                        if check_grad_dtypes and val.is_complex() != res.is_complex():
+                        if check_grad_dtypes and res is not None and val.is_complex() != res.is_complex():
                             raise GradcheckError('Forward AD gradient has dtype mismatch.')
 
                         if res is None:
@@ -1164,7 +1164,7 @@ def _vec_from_tensor(x, generator, downcast_complex=False):
         dtype = _to_real_dtype(x.dtype) if downcast_complex else x.dtype
         values = torch.rand(x_values.numel(), generator=generator) \
             .to(dtype=dtype, device=x.device) \
-            .reshape(x_values.shape)
+            .view(x_values.shape)
         values /= values.norm()
         vec = torch.sparse_coo_tensor(x._indices(), values, x.size())
     else:
diff --git a/torch/autograd/graph.py b/torch/autograd/graph.py
index 022515bf1e97..fc490a9d8e31 100644
--- a/torch/autograd/graph.py
+++ b/torch/autograd/graph.py
@@ -1,5 +1,18 @@
 import torch
-from typing import Callable, Any
+import contextlib
+from typing import Callable, Any, Dict, Tuple, Optional, Sequence, List, Set
+from torch.utils.hooks import RemovableHandle
+from torch.utils._python_dispatch import TorchDispatchMode
+from collections import defaultdict
+import weakref
+
+__all__ = [
+    "saved_tensors_hooks",
+    "save_on_cpu",
+    "disable_saved_tensors_hooks",
+    "register_multi_grad_hook",
+    "allow_mutation_on_saved_tensors",
+]
 
 class saved_tensors_hooks():
     """Context-manager that sets a pair of pack / unpack hooks for saved tensors.
@@ -93,7 +106,7 @@ class save_on_cpu(saved_tensors_hooks):
 
     Example::
 
-        >>> # xdoctest: +REQUIRES(env:CUDAHOME)
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA)
         >>> a = torch.randn(5, requires_grad=True, device="cuda")
         >>> b = torch.randn(5, requires_grad=True, device="cuda")
         >>> c = torch.randn(5, requires_grad=True, device="cuda")
@@ -131,3 +144,286 @@ def unpack_from_cpu(packed):
             return tensor.to(device, non_blocking=pin_memory)
 
         super().__init__(pack_to_cpu, unpack_from_cpu)
+
+
+@contextlib.contextmanager
+def disable_saved_tensors_hooks(error_message):
+    """Context-manager that disables the saved tensors default hooks feature.
+
+    Useful for if you are creating a feature that does not work with saved
+    tensors default hooks.
+
+    Args:
+        error_message (str): When saved tensors default hooks are used when they
+                             have been are disabled, a RuntimeError with this
+                             error message gets raised.
+
+    Example::
+
+        >>> message = "saved tensors default hooks are disabled"
+        >>> with torch.autograd.graph.disable_saved_tensors_hooks(message):
+        ...     # Raises RuntimeError: saved tensors default hooks are disabled
+        ...     with torch.autograd.graph.save_on_cpu():
+        ...         pass
+
+    """
+    try:
+        maybe_prev_message = torch._C._autograd._saved_tensors_hooks_get_disabled_error_message()
+        torch._C._autograd._saved_tensors_hooks_disable(error_message)
+        yield
+    finally:
+        # See NOTE: [disabled_error_message invariant]
+        if maybe_prev_message is None:
+            torch._C._autograd._saved_tensors_hooks_enable()
+        else:
+            torch._C._autograd._saved_tensors_hooks_disable(maybe_prev_message)
+
+
+def register_multi_grad_hook(tensors: Sequence[torch.Tensor], fn: Callable[[Sequence[Optional[torch.Tensor]]], None]):
+    r"""Registers a multi-grad backward hook.
+
+    The hook will be called after gradients with respect to every tensor in
+    :attr:`tensors` have been computed. If a tensor is in :attr:`tensors` but
+    is not part of the graph, or if a tensor is not needed to compute the gradients
+    for any ``inputs`` specified for the current ``.backward()`` or ``.grad()`` call,
+    this tensor will be ignored and the hook will not wait for its gradient to be
+    computed.
+
+    After every non-ignored tensor's gradient has been computed, :attr:`fn` will be
+    called with those gradients. ``None`` will be passed for tensors that did not
+    have their gradients computed.
+
+    The hook should not modify its arguments.
+
+    This function returns a handle with a method ``handle.remove()`` that removes the hook.
+
+    Example::
+
+        >>> import torch
+        >>>
+        >>> a = torch.rand(2, 3, requires_grad=True)
+        >>> b = torch.rand(2, 3, requires_grad=True)
+        >>> c = a * b
+        >>> d = a * b
+        >>>
+        >>> def fn(grads):
+        ...     print([g is not None for g in grads])
+        ...
+        >>> torch.autograd.graph.register_multi_grad_hook((a, b, c, d), fn)
+        >>>
+        >>> c.sum().backward(retain_graph=True)
+        [True, True, True, False]
+        >>> c.sum().backward(inputs=(a,), retain_graph=True)
+        [True, False, True, False]
+        >>>
+    """
+    count: Dict[int, int] = dict()
+    nb_calls = None
+    buffer: Dict[int, List[Optional[torch.Tensor]]] = dict()
+
+    def get_grad_fn(t):
+        # or grad accumulator
+        if t.requires_grad and t.grad_fn is None:
+            return t.clone().grad_fn.next_functions[0][0]
+        else:
+            return t.grad_fn
+
+    grad_fns = list(map(get_grad_fn, tensors))
+
+    def get_inner_hook(idx):
+        def inner_hook(grad: torch.Tensor):
+            nonlocal count, nb_calls, buffer
+            id = torch._C._current_graph_task_id()
+            assert id != -1, "expected this hook to be called inside a backward call"
+            count[id] = count.get(id, 0)
+            buffer[id] = buffer.get(id, [None] * len(tensors))
+
+            if count[id] == 0:
+                # On the first call, compute the actual nb_calls and buffer
+                nb_calls = sum(torch._C._will_engine_execute_node(g) for g in grad_fns)  # type: ignore[attr-defined]
+
+            buffer[id][idx] = grad
+            count[id] += 1
+
+            if count[id] == nb_calls:
+                fn(buffer[id])
+                del count[id]
+                del buffer[id]
+        return inner_hook
+
+    class Handle(RemovableHandle):
+        handles: Tuple[RemovableHandle, ...]
+
+        def __init__(self, handles: Tuple[RemovableHandle, ...]):
+            self.handles = handles
+
+        def remove(self):
+            for handle in self.handles:
+                handle.remove()
+
+        def __getstate__(self):
+            return self.handles
+
+        def __setstate__(self, state):
+            self.handles = state
+
+    handles: List[RemovableHandle] = []
+    for i, t in enumerate(tensors):
+        handles.append(t.register_hook(get_inner_hook(i)))
+
+    return Handle(tuple(handles))
+
+
+# NOTE [Allow mutation on tensors saved for backward]
+#
+# 1. Tensor gets saved for backward
+#    - remember the python object id and the version of the tensor
+#    - remember aliasing information (data_ptr of base + version)
+#    - save the original so we control its lifetime
+# 2. Any time a tensor gets in-placed
+#    - for each tensor aliased to it:
+#      - check using its object id and version to see if it has been saved
+#      - if it has been saved, clone it
+#      - delete the reference to the original
+# 3. during backward
+#    - if the clone exists, the tensor must've been modified in-place
+_allow_mutation_on_saved_tensors_enabled = False
+
+def _get_tid(t) -> Tuple[int, int, int]:
+    return (id(t), t.data_ptr(), t._version)
+
+def _get_sid(t) -> Tuple[int, int]:
+    return (t.data_ptr(), t._version)
+
+class _Handle():
+    pass
+
+class _swap_with_cloned(saved_tensors_hooks):
+    def __init__(self, ctx):
+        def pack_hook(t):
+            tid = _get_tid(t)
+            sid = _get_sid(t)
+            # Tensors saved for backward have an entry in _tid_to_weakhandle
+            handle: Optional[_Handle] = None
+
+            # Save aliasing information
+            ctx.sid_to_tid[sid].add(tid)
+
+            # NB: The same tensor (of the same version) can be saved multiple times
+            if tid not in ctx.tid_to_weakhandle:
+                handle = _Handle()
+                ctx.tid_to_weakhandle[tid] = handle
+                ctx.original[handle] = t
+            else:
+                # Store an additional strong reference to the handle
+                handle = ctx.tid_to_weakhandle[tid]
+            return handle
+
+        def unpack_hook(tup):
+            handle = tup
+            error_msg = (
+                "Trying to backward outside of the 'allow_mutation_on_saved_tensors' context"
+                "in which the graph was originally recorded.")
+            assert _allow_mutation_on_saved_tensors_enabled, error_msg
+            if handle in ctx.cloned:
+                res = ctx.cloned[handle]
+            else:
+                assert handle in ctx.original, error_msg
+                res = ctx.original[handle]
+            return res
+
+        super().__init__(pack_hook, unpack_hook)
+
+class _CloneArgBeforeMutateMode(TorchDispatchMode):
+    def __init__(self, ctx):
+        self.ctx = ctx
+
+    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+        kwargs = kwargs or {}
+
+        for idx, arg in enumerate(func._schema.arguments):
+            if arg.alias_info is not None and arg.alias_info.is_write:
+                t = kwargs["out"] if arg.is_out else args[idx]
+                tid = _get_tid(t)
+                sid = _get_sid(t)
+                ctx = self.ctx
+                if sid in ctx.sid_to_tid:
+                    for tid in ctx.sid_to_tid[sid]:
+                        if tid not in ctx.tid_to_weakhandle:
+                            # We know that if tid is in sid_to_tid, then it must also be in
+                            # tid_to_weakhandle. However, it is possible for the tensor to be
+                            # saved at one point, but cleared by backward before it is modified
+                            # in-place. Consider the following example:
+                            #
+                            # >>> a = torch.randn(2, 3, requires_grad=True).clone()
+                            # >>> out = (a**2).sum()
+                            # >>> out.backward()
+                            # >>> a.sin_()
+                            continue
+                        handle = ctx.tid_to_weakhandle[tid]
+                        if handle in ctx.cloned:
+                            # The same exact tensor has been cloned already
+                            continue
+                        ctx.cloned[handle] = ctx.original[handle].clone()
+                        del ctx.original[handle]
+
+        rs = func(*args, **kwargs)
+        return rs
+
+class _AllowMutationOnSavedContext():
+    def __init__(self):
+        self.cloned: weakref.WeakKeyDictionary = weakref.WeakKeyDictionary()
+        self.original: weakref.WeakKeyDictionary = weakref.WeakKeyDictionary()
+        self.tid_to_weakhandle: weakref.WeakValueDictionary = weakref.WeakValueDictionary()
+        self.sid_to_tid: Dict[Tuple[int, int], Set[Tuple[int, int, int]]] = defaultdict(set)
+
+    def clear(self):
+        self.cloned.clear()
+        self.original.clear()
+        self.tid_to_weakhandle.clear()
+        self.sid_to_tid.clear()
+
+@contextlib.contextmanager
+def allow_mutation_on_saved_tensors():
+    """Context manager under which mutating tensors saved for backward is allowed
+
+    Under this context manager, tensors saved for backward are cloned on mutation,
+    so the original version can still be used during backward. Normally, mutating a tensor
+    saved for backward will result in an error raised when it's used during backward.
+
+    To ensure the correct behavior, both the forward and backward should be run under
+    the same context manager.
+
+    returns:
+        An _AllowMutationOnSavedContext object storing the state managed by this
+        context manager. This object can be useful for debugging purposes. The state
+        managed by the context manager is automatically cleared upon exiting.
+
+    Example::
+
+        >>> import torch
+        >>> with torch.autograd.graph.allow_mutation_on_saved_tensors():
+        ...     # forward
+        ...     a = torch.ones(2, 3, requires_grad=True)
+        ...     b = a.clone()
+        ...     out = (b**2).sum()
+        ...     b.sin_()
+        ...     # backward
+        ...     out.sum().backward()
+        ...
+        tensor([[0.8415, 0.8415, 0.8415],
+                [0.8415, 0.8415, 0.8415]], grad_fn=<SinBackward0>)
+    """
+    global _allow_mutation_on_saved_tensors_enabled
+
+    ctx = _AllowMutationOnSavedContext()
+
+    with _swap_with_cloned(ctx), _CloneArgBeforeMutateMode(ctx):
+        try:
+            if _allow_mutation_on_saved_tensors_enabled:
+                raise RuntimeError("allow_mutation_on_saved_tensors contexts cannot be nested")
+            _allow_mutation_on_saved_tensors_enabled = True
+            yield ctx
+        finally:
+            ctx.clear()
+            _allow_mutation_on_saved_tensors_enabled = False
diff --git a/torch/autograd/profiler.py b/torch/autograd/profiler.py
index 19dcb9370067..e70ec6c4ed8c 100644
--- a/torch/autograd/profiler.py
+++ b/torch/autograd/profiler.py
@@ -3,7 +3,7 @@
 
 import torch
 import torch.cuda
-from torch._C._autograd import _ExperimentalConfig
+from torch._C._profiler import _ExperimentalConfig
 
 from torch.autograd import (
     _disable_profiler,
@@ -30,14 +30,16 @@
 )
 from torch.futures import Future
 
+__all__ = ["profile", "record_function", "emit_itt", "emit_nvtx", "load_nvprof", "EnforceUnique",
+           "parse_nvprof_trace", "kineto_step", "EventList", "FunctionEvent", "MemRecordsAcc"]
 
 try:
     # Available in Python >= 3.2
-    from contextlib import ContextDecorator
+    from contextlib import ContextDecorator as _ContextDecorator
 except ImportError:
     import functools
 
-    class ContextDecorator(object):  # type: ignore[no-redef]
+    class _ContextDecorator(object):  # type: ignore[no-redef]
 
         def __enter__(self):
             raise NotImplementedError
@@ -53,7 +55,6 @@ def wrapped(*args, **kwargs):
 
             return wrapped
 
-
 class profile(object):
     """Context manager that manages autograd profiler state and holds a summary of results.
     Under the hood it just records events of functions being executed in C++ and
@@ -249,11 +250,25 @@ def _check_finish(self):
         if self.function_events is None:
             raise RuntimeError("Profiler didn't finish running")
 
-    def table(self, sort_by=None, row_limit=100, max_src_column_width=75, header=None, top_level_events_only=False):
+    def table(
+            self,
+            sort_by=None,
+            row_limit=100,
+            max_src_column_width=75,
+            max_name_column_width=55,
+            max_shapes_column_width=80,
+            header=None,
+            top_level_events_only=False
+    ):
         self._check_finish()
         assert self.function_events is not None
         return self.function_events.table(
-            sort_by=sort_by, row_limit=row_limit, max_src_column_width=max_src_column_width, header=header,
+            sort_by=sort_by,
+            row_limit=row_limit,
+            max_src_column_width=max_src_column_width,
+            max_name_column_width=max_name_column_width,
+            max_shapes_column_width=max_shapes_column_width,
+            header=header,
             top_level_events_only=top_level_events_only
         )
     table.__doc__ = EventList.table.__doc__
@@ -426,7 +441,7 @@ def createFunctionEventForMemoryEvents(evt):
         return function_events
 
 
-class record_function(ContextDecorator):
+class record_function(_ContextDecorator):
     """Context manager/function decorator that adds a label to a block of
     Python code (or function) when running autograd profiler. It is
     useful when tracing the code profile.
@@ -466,17 +481,29 @@ def __init__(self, name: str, args: Optional[str] = None):
         self.args: Optional[str] = args
         # Whether or not we should run record function's end callbacks when exiting.
         self.run_callbacks_on_exit: bool = True
-        # Stores underlying RecordFunction as a tensor. TODO: move to custom
-        # class (https://github.com/pytorch/pytorch/issues/35026).
-        self.handle: torch.Tensor = torch.zeros(1)
+        # TODO: TorchScript ignores standard type annotation here
+        # self.record: Optional["torch.classes.profiler._RecordFunction"] = None
+        self.record = torch.jit.annotate(Optional["torch.classes.profiler._RecordFunction"], None)
 
     def __enter__(self):
-        self.handle = torch.ops.profiler._record_function_enter(self.name, self.args)
+        self.record = torch.ops.profiler._record_function_enter_new(self.name, self.args)
         return self
 
     def __exit__(self, exc_type: Any, exc_value: Any, traceback: Any):
-        if self.run_callbacks_on_exit:
-            torch.ops.profiler._record_function_exit(self.handle)
+        if not self.run_callbacks_on_exit:
+            return
+
+        # Local variable is needed by TorchScript to refine Optional[T] to T
+        record = self.record
+        assert record is not None
+
+        # TODO: Too slow with __torch_function__ handling enabled
+        # See https://github.com/pytorch/pytorch/issues/76410
+        if not torch.jit.is_scripting():
+            with torch._C.DisableTorchFunction():
+                torch.ops.profiler._record_function_exit._RecordFunction(record)
+        else:
+            torch.ops.profiler._record_function_exit(record)
 
     def _call_end_callbacks_on_future(self, fut: Future[Any]) -> Future[Any]:
         """
@@ -503,7 +530,19 @@ def _call_end_callbacks_on_future(self, fut: Future[Any]) -> Future[Any]:
         # We are scheduling to run this RecordFunction's end callbacks when the
         # passed in future completes, so don't run end callbacks on exit.
         self.run_callbacks_on_exit = False
-        profiled_future = torch.ops.profiler._call_end_callbacks_on_jit_fut(self.handle, fut)
+
+        # Local variable is needed by TorchScript to refine Optional[T] to T
+        record = self.record
+        assert record is not None
+
+        # TODO: Too slow with __torch_function__ handling enabled
+        # See https://github.com/pytorch/pytorch/issues/76410
+        if not torch.jit.is_scripting():
+            with torch._C.DisableTorchFunction():
+                profiled_future = torch.ops.profiler._call_end_callbacks_on_jit_fut._RecordFunction(
+                    record, fut)
+        else:
+            profiled_future = torch.ops.profiler._call_end_callbacks_on_jit_fut(record, fut)
         return profiled_future
 
 
diff --git a/torch/autograd/profiler_legacy.py b/torch/autograd/profiler_legacy.py
index 0211ec8a2809..5848e21ed15e 100644
--- a/torch/autograd/profiler_legacy.py
+++ b/torch/autograd/profiler_legacy.py
@@ -13,6 +13,7 @@
 import itertools
 from warnings import warn
 
+__all__ = ["profile"]
 
 class profile(object):
     """DEPRECATED: use torch.profiler instead"""
@@ -57,7 +58,7 @@ def config(self):
             self.with_flops,
             self.with_modules,
             # avoid exposing _ExperimentalConfig this in legacy public API
-            torch._C._autograd._ExperimentalConfig(),
+            torch._C._profiler._ExperimentalConfig(),
         )
 
     def __enter__(self):
@@ -102,11 +103,25 @@ def _check_finish(self):
         if self.function_events is None:
             raise RuntimeError("Profiler didn't finish running")
 
-    def table(self, sort_by=None, row_limit=100, max_src_column_width=75, header=None, top_level_events_only=False):
+    def table(
+            self,
+            sort_by=None,
+            row_limit=100,
+            max_src_column_width=75,
+            max_name_column_width=55,
+            max_shapes_column_width=80,
+            header=None,
+            top_level_events_only=False
+    ):
         self._check_finish()
         assert self.function_events is not None
         return self.function_events.table(
-            sort_by=sort_by, row_limit=row_limit, max_src_column_width=max_src_column_width, header=header,
+            sort_by=sort_by,
+            row_limit=row_limit,
+            max_src_column_width=max_src_column_width,
+            max_name_column_width=max_name_column_width,
+            max_shapes_column_width=max_shapes_column_width,
+            header=header,
             top_level_events_only=top_level_events_only
         )
     table.__doc__ = EventList.table.__doc__
diff --git a/torch/autograd/profiler_util.py b/torch/autograd/profiler_util.py
index 49c181f73de9..891992aed5c6 100644
--- a/torch/autograd/profiler_util.py
+++ b/torch/autograd/profiler_util.py
@@ -10,6 +10,8 @@
 import bisect
 import math
 
+__all__ = ["EventList", "FormattedTimesMixin", "Interval", "Kernel", "FunctionEvent", "FunctionEventAvg",
+           "StringTable", "MemRecordsAcc"]
 
 class EventList(list):
     """A list of Events (for pretty printing)"""
@@ -146,7 +148,16 @@ def bw_parent(evt):
     def self_cpu_time_total(self):
         return sum([event.self_cpu_time_total for event in self])
 
-    def table(self, sort_by=None, row_limit=100, max_src_column_width=75, header=None, top_level_events_only=False):
+    def table(
+            self,
+            sort_by=None,
+            row_limit=100,
+            max_src_column_width=75,
+            max_name_column_width=55,
+            max_shapes_column_width=80,
+            header=None,
+            top_level_events_only=False
+    ):
         """Prints an EventList as a nicely formatted table.
 
         Args:
@@ -169,6 +180,8 @@ def table(self, sort_by=None, row_limit=100, max_src_column_width=75, header=Non
             sort_by=sort_by,
             row_limit=row_limit,
             max_src_column_width=max_src_column_width,
+            max_name_column_width=max_name_column_width,
+            max_shapes_column_width=max_shapes_column_width,
             header=header,
             profile_memory=self._profile_memory,
             with_flops=self._with_flops,
@@ -670,6 +683,8 @@ def _build_table(
         header=None,
         row_limit=100,
         max_src_column_width=75,
+        max_name_column_width=55,
+        max_shapes_column_width=80,
         with_flops=False,
         profile_memory=False,
         top_level_events_only=False):
@@ -687,13 +702,13 @@ def _build_table(
             events, key=lambda evt: getattr(evt, sort_by), reverse=True
         ), use_cuda=has_cuda_time, profile_memory=profile_memory, with_flops=with_flops)
 
-    MAX_NAME_COLUMN_WIDTH = 55
     name_column_width = max([len(evt.key) for evt in events]) + 4
-    name_column_width = min(name_column_width, MAX_NAME_COLUMN_WIDTH)
+    if max_name_column_width is not None:
+        name_column_width = min(name_column_width, max_name_column_width)
 
-    MAX_SHAPES_COLUMN_WIDTH = 80
     shapes_column_width = max([len(str(evt.input_shapes)) for evt in events]) + 4
-    shapes_column_width = min(shapes_column_width, MAX_SHAPES_COLUMN_WIDTH)
+    if max_shapes_column_width is not None:
+        shapes_column_width = min(shapes_column_width, max_shapes_column_width)
 
     DEFAULT_COLUMN_WIDTH = 12
     flops_column_width = DEFAULT_COLUMN_WIDTH
@@ -706,7 +721,8 @@ def _build_table(
     has_stack = len(stacks) > 0
     if has_stack:
         src_column_width = max([max([len(entry) for entry in stack]) for stack in stacks]) + 4
-        src_column_width = min(src_column_width, max_src_column_width)
+        if max_src_column_width is not None:
+            src_column_width = min(src_column_width, max_src_column_width)
 
     headers = [
         'Name',
@@ -844,8 +860,8 @@ def trim_path(path, src_column_width):
         else:
             event_limit += 1
         name = evt.key
-        if len(name) >= MAX_NAME_COLUMN_WIDTH - 3:
-            name = name[:(MAX_NAME_COLUMN_WIDTH - 3)] + "..."
+        if max_name_column_width is not None and len(name) >= max_name_column_width - 3:
+            name = name[:(max_name_column_width - 3)] + "..."
         row_values = [
             name,
             # Self CPU total %, 0 for async events.
diff --git a/torch/autograd/variable.py b/torch/autograd/variable.py
index 02305bd5f588..57b210e7fe5d 100644
--- a/torch/autograd/variable.py
+++ b/torch/autograd/variable.py
@@ -1,6 +1,7 @@
 import torch
 from torch._six import with_metaclass
 
+__all__ = ["VariableMeta", "Variable"]
 
 class VariableMeta(type):
     def __instancecheck__(cls, other):
diff --git a/torch/backends/_coreml/preprocess.py b/torch/backends/_coreml/preprocess.py
index 00e9f5d1cbe1..f72dae177ed2 100644
--- a/torch/backends/_coreml/preprocess.py
+++ b/torch/backends/_coreml/preprocess.py
@@ -49,8 +49,9 @@ def CompileSpec(inputs,
                 outputs,
                 backend=CoreMLComputeUnit.CPU,
                 allow_low_precision=True,
-                quantization_mode=CoreMLQuantizationMode.NONE):
-    return (inputs, outputs, backend, allow_low_precision, quantization_mode)
+                quantization_mode=CoreMLQuantizationMode.NONE,
+                mlmodel_export_path=None):
+    return (inputs, outputs, backend, allow_low_precision, quantization_mode, mlmodel_export_path)
 
 
 def _check_enumerated_shape(shape):
@@ -71,7 +72,7 @@ def _convert_to_mil_type(shape, dtype, name: str):
 
 def preprocess(script_module: torch._C.ScriptObject, compile_spec: Dict[str, Tuple]):
     spec = compile_spec["forward"]
-    input_specs, output_specs, backend, allow_low_precision, quantization_mode = spec
+    input_specs, output_specs, backend, allow_low_precision, quantization_mode, mlmodel_export_path = spec
     mil_inputs = []
     inputs = []
     for index, input in enumerate(input_specs):
@@ -96,6 +97,11 @@ def preprocess(script_module: torch._C.ScriptObject, compile_spec: Dict[str, Tup
         outputs.append([name, str(dtype), str(shape)])
     mlmodel = ct.models.model.MLModel(spec)
     print(mlmodel)
+
+    if mlmodel_export_path is not None:
+        print(f"Saving CoreML .mlmodel file to {mlmodel_export_path}")
+        mlmodel.save(mlmodel_export_path)
+
     config = {
         "spec_ver": str(spec.specificationVersion),  # type: ignore[attr-defined]
         "backend": backend,
diff --git a/torch/backends/cuda/__init__.py b/torch/backends/cuda/__init__.py
index ec7a68ec5be0..50735e125ec3 100644
--- a/torch/backends/cuda/__init__.py
+++ b/torch/backends/cuda/__init__.py
@@ -1,8 +1,14 @@
 import sys
 import torch
+import contextlib
+from enum import IntEnum
 
 from typing import Union
 
+__all__ = ["is_built", "cuFFTPlanCacheAttrContextProp", "cuFFTPlanCache", "cuFFTPlanCacheManager",
+           "cuBLASModule", "preferred_linalg_library", "cufft_plan_cache", "matmul", "SDPBackend", "enable_flash_sdp",
+           "flash_sdp_enabled", "enable_mem_efficient_sdp", "mem_efficient_sdp_enabled",
+           "math_sdp_enabled", "enable_math_sdp", "sdp_kernel"]
 
 def is_built():
     r"""Returns whether PyTorch is built with CUDA support.  Note that this
@@ -158,5 +164,95 @@ def preferred_linalg_library(backend: Union[None, str, torch._C._LinalgBackend]
 
     return torch._C._get_linalg_preferred_backend()
 
+
+class SDPBackend(IntEnum):
+    r"""Enum class for the scaled dot product attention backends.
+
+    .. warning:: This flag is experimental and subject to change.'
+
+    This class needs to stay inline with the enum defined in:
+    pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h
+    """
+    ERROR = -1
+    MATH = 0
+    FLASH_ATTENTION = 1
+    EFFICIENT_ATTENTION = 2
+
+
+def flash_sdp_enabled():
+    r"""
+    .. warning:: This flag is experimental and subject to change.
+
+    Returns whether flash sdp is enabled or not.
+    """
+    return torch._C._get_flash_sdp_enabled()
+
+
+def enable_flash_sdp(enabled: bool):
+    r"""
+    .. warning:: This flag is experimental and subject to change.
+
+    Enables or disables flash sdp.
+    """
+    torch._C._set_sdp_use_flash(enabled)
+
+def mem_efficient_sdp_enabled():
+    r"""
+    .. warning:: This flag is experimental and subject to change.
+
+    Returns whether memory efficient sdp is enabled or not.
+    """
+    return torch._C._get_mem_efficient_sdp_enabled()
+
+
+def enable_mem_efficient_sdp(enabled: bool):
+    r"""
+    .. warning:: This flag is experimental and subject to change.
+
+    Enables or disables memory efficient sdp.
+    """
+    torch._C._set_sdp_use_mem_efficient(enabled)
+
+def math_sdp_enabled():
+    r"""
+    .. warning:: This flag is experimental and subject to change.
+
+    Returns whether math sdp is enabled or not.
+    """
+    return torch._C._get_math_sdp_enabled()
+
+
+def enable_math_sdp(enabled: bool):
+    r"""
+    .. warning:: This flag is experimental and subject to change.
+
+    Enables or disables math sdp.
+    """
+    torch._C._set_sdp_use_math(enabled)
+
+
+@contextlib.contextmanager
+def sdp_kernel(enable_flash: bool = True, enable_math: bool = True, enable_mem_efficient: bool = True):
+    r"""
+    .. warning:: This flag is experimental and subject to change.
+
+    This context manager can be used to temporarily enable or disable flash/memory efficient sdp and math sdp.
+    Upon exiting the context manager, the previous state of the flags will be restored.
+    """
+    previous_flash: bool = flash_sdp_enabled()
+    previous_mem_efficient: bool = mem_efficient_sdp_enabled()
+    previous_math: bool = math_sdp_enabled()
+    try:
+        enable_flash_sdp(enable_flash)
+        enable_mem_efficient_sdp(enable_mem_efficient)
+        enable_math_sdp(enable_math)
+        yield{}
+    except RuntimeError as err:
+        raise err
+    finally:
+        enable_flash_sdp(previous_flash)
+        enable_mem_efficient_sdp(previous_mem_efficient)
+        enable_math_sdp(previous_math)
+
 cufft_plan_cache = cuFFTPlanCacheManager()
 matmul = cuBLASModule()
diff --git a/torch/backends/cudnn/__init__.py b/torch/backends/cudnn/__init__.py
index e187d6d26aed..2b63a6379665 100644
--- a/torch/backends/cudnn/__init__.py
+++ b/torch/backends/cudnn/__init__.py
@@ -37,6 +37,8 @@ def _init():
             else:
                 cudnn_compatible = runtime_minor >= compile_minor
             if not cudnn_compatible:
+                if os.environ.get('PYTORCH_SKIP_CUDNN_COMPATIBILITY_CHECK', '0') == '1':
+                    return True
                 base_error_msg = (f'cuDNN version incompatibility: '
                                   f'PyTorch was compiled  against {compile_version} '
                                   f'but found runtime version {runtime_version}. '
diff --git a/torch/backends/opt_einsum/__init__.py b/torch/backends/opt_einsum/__init__.py
new file mode 100644
index 000000000000..966258fdd016
--- /dev/null
+++ b/torch/backends/opt_einsum/__init__.py
@@ -0,0 +1,99 @@
+from typing import Any
+import warnings
+import sys
+from functools import lru_cache as _lru_cache
+from contextlib import contextmanager
+from torch.backends import ContextProp, PropModule, __allow_nonbracketed_mutation
+
+try:
+    import opt_einsum as _opt_einsum  # type: ignore[import]
+except ImportError:
+    _opt_einsum = None
+
+
+@_lru_cache()
+def is_available() -> bool:
+    r"""Returns a bool indicating if opt_einsum is currently available."""
+    return _opt_einsum is not None
+
+
+def get_opt_einsum() -> Any:
+    r"""Returns the opt_einsum package if opt_einsum is currently available, else None."""
+    return _opt_einsum
+
+
+def _set_enabled(_enabled: bool) -> None:
+    if not is_available() and _enabled:
+        raise ValueError(f'opt_einsum is not available, so setting `enabled` to {_enabled} will not reap '
+                         'the benefits of calculating an optimal path for einsum. torch.einsum will '
+                         'fall back to contracting from left to right. To enable this optimal path '
+                         'calculation, please install opt-einsum.')
+    global enabled
+    enabled = _enabled
+
+
+def _get_enabled() -> bool:
+    return enabled
+
+
+def _set_strategy(_strategy: str) -> None:
+    if not is_available():
+        raise ValueError(f'opt_einsum is not available, so setting `strategy` to {_strategy} will not be meaningful. '
+                         'torch.einsum will bypass path calculation and simply contract from left to right. '
+                         'Please install opt_einsum or unset `strategy`.')
+    if not enabled:
+        raise ValueError(f'opt_einsum is not enabled, so setting a `strategy` to {_strategy} will not be meaningful. '
+                         'torch.einsum will bypass path calculation and simply contract from left to right. '
+                         'Please set `enabled` to `True` as well or unset `strategy`.')
+    if _strategy not in ['auto', 'greedy', 'optimal']:
+        raise ValueError(f'`strategy` must be one of the following: [auto, greedy, optimal] but is {_strategy}')
+    global strategy
+    strategy = _strategy
+
+
+def _get_strategy() -> str:
+    return strategy
+
+
+def set_flags(_enabled=None, _strategy=None):
+    orig_flags = (enabled, None if not is_available() else strategy)
+    if _enabled is not None:
+        _set_enabled(_enabled)
+    if _strategy is not None:
+        _set_strategy(_strategy)
+    return orig_flags
+
+
+@contextmanager
+def flags(enabled=None, strategy=None):
+    with __allow_nonbracketed_mutation():
+        orig_flags = set_flags(enabled, strategy)
+    try:
+        yield
+    finally:
+        # recover the previous values
+        with __allow_nonbracketed_mutation():
+            set_flags(*orig_flags)
+
+
+# The magic here is to allow us to intercept code like this:
+#
+#   torch.backends.opt_einsum.enabled = True
+
+class OptEinsumModule(PropModule):
+    def __init__(self, m, name):
+        super(OptEinsumModule, self).__init__(m, name)
+
+    global enabled
+    enabled = ContextProp(_get_enabled, _set_enabled)
+    global strategy
+    strategy = None
+    if is_available():
+        strategy = ContextProp(_get_strategy, _set_strategy)
+
+# This is the sys.modules replacement trick, see
+# https://stackoverflow.com/questions/2447353/getattr-on-a-module/7668273#7668273
+sys.modules[__name__] = OptEinsumModule(sys.modules[__name__], __name__)
+
+enabled = True if is_available() else False
+strategy = 'auto' if is_available() else None
diff --git a/torch/backends/quantized/__init__.py b/torch/backends/quantized/__init__.py
index 6f7d479e90c4..2db2b672f1b4 100644
--- a/torch/backends/quantized/__init__.py
+++ b/torch/backends/quantized/__init__.py
@@ -13,6 +13,8 @@ def _get_qengine_id(qengine: str) -> int:
         ret = 2
     elif qengine == 'onednn':
         ret = 3
+    elif qengine == 'x86':
+        ret = 4
     else:
         ret = -1
         raise RuntimeError("{} is not a valid value for quantized engine".format(qengine))
@@ -20,7 +22,7 @@ def _get_qengine_id(qengine: str) -> int:
 
 # This function should correspond to the enums present in c10/core/QEngine.h
 def _get_qengine_str(qengine: int) -> str:
-    all_engines = {0 : 'none', 1 : 'fbgemm', 2 : 'qnnpack', 3 : 'onednn'}
+    all_engines = {0 : 'none', 1 : 'fbgemm', 2 : 'qnnpack', 3 : 'onednn', 4 : 'x86'}
     return all_engines.get(qengine, '*undefined')
 
 class _QEngineProp(object):
diff --git a/torch/backends/xeon/run_cpu.py b/torch/backends/xeon/run_cpu.py
index c056af964478..da55a9e605e1 100644
--- a/torch/backends/xeon/run_cpu.py
+++ b/torch/backends/xeon/run_cpu.py
@@ -60,13 +60,13 @@
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --throughput_mode script.py args
+   python -m torch.backends.xeon.run_cpu --throughput_mode script.py args
 
 2. Run single-instance inference on a single CPU node.
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --node_id 1 script.py args
+   python -m torch.backends.xeon.run_cpu --node_id 1 script.py args
 
 Multi-instance inference
 ------------------------
@@ -77,13 +77,13 @@
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu -- python_script args
+   python -m torch.backends.xeon.run_cpu -- python_script args
 
    eg: on an Intel(R) Xeon(R) Scalable Processor with 14 instance, 4 cores per instance
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --ninstances 14 --ncores_per_instance 4 python_script args
+   python -m torch.backends.xeon.run_cpu --ninstances 14 --ncores_per_instance 4 python_script args
 
 2. Run single-instance inference among multiple instances.
    By default, runs all ninstances. If you want to independently run a single instance among ninstances, specify rank.
@@ -92,27 +92,27 @@
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --ninstances 2 --rank 0 python_script args
+   python -m torch.backends.xeon.run_cpu --ninstances 2 --rank 0 python_script args
 
    eg: run 1st instance on an Intel(R) Xeon(R) Scalable Processor with 2 instance (i.e., numactl -C 28-55)
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --ninstances 2 --rank 1 python_script args
+   python -m torch.backends.xeon.run_cpu --ninstances 2 --rank 1 python_script args
 
    eg: run 0th instance on an Intel(R) Xeon(R) Scalable Processor with 2 instance, 2 cores per instance,
    first four cores (i.e., numactl -C 0-1)
 
 ::
 
-   >>> python -m torch.backends.xeon.run_cpu --core_list "0, 1, 2, 3" --ninstances 2 --ncores_per_instance 2
+   python -m torch.backends.xeon.run_cpu --core_list "0, 1, 2, 3" --ninstances 2 --ncores_per_instance 2
    --rank 0 python_script args
 
 3. To look up what optional arguments this module offers:
 
 ::
 
-    >>> python -m torch.backends.xeon.run_cpu --help
+    python -m torch.backends.xeon.run_cpu --help
 
 Memory allocator
 ----------------
@@ -598,7 +598,7 @@ def create_args(parser=None):
     _add_multi_instance_params(parser)
     # positional
     parser.add_argument("program", type=str,
-                        help="The full path to the proram/script to be launched. "
+                        help="The full path to the program/script to be launched. "
                              "followed by all the arguments for the script")
 
     # rest from the training program
diff --git a/torch/cpu/amp/autocast_mode.py b/torch/cpu/amp/autocast_mode.py
index 03cbcdcda0fc..0909a0bcd556 100644
--- a/torch/cpu/amp/autocast_mode.py
+++ b/torch/cpu/amp/autocast_mode.py
@@ -1,6 +1,8 @@
 import torch
 from typing import Any
 
+__all__ = ["autocast"]
+
 class autocast(torch.amp.autocast_mode.autocast):
     r"""
     See :class:`torch.autocast`.
diff --git a/torch/csrc/CudaIPCTypes.cpp b/torch/csrc/CudaIPCTypes.cpp
index 9a2c47a5f7a8..d18a23ebe4e6 100644
--- a/torch/csrc/CudaIPCTypes.cpp
+++ b/torch/csrc/CudaIPCTypes.cpp
@@ -195,7 +195,7 @@ CudaIPCSentData::~CudaIPCSentData() {
   try {
     if (event_sync_required_) {
       at::cuda::CUDAGuard device_guard(device_.index());
-      cudaEventDestroy(event_);
+      C10_CUDA_CHECK(cudaEventDestroy(event_));
       if (!CudaIPCGlobalEntities::alive) {
         return;
       }
diff --git a/torch/csrc/DynamicTypes.cpp b/torch/csrc/DynamicTypes.cpp
index c481c27c92f7..00674cf81229 100644
--- a/torch/csrc/DynamicTypes.cpp
+++ b/torch/csrc/DynamicTypes.cpp
@@ -13,6 +13,7 @@
 #include <torch/csrc/utils/object_ptr.h>
 
 #include <ATen/ATen.h>
+#include <ATen/FunctionalStorageImpl.h>
 
 #include <array>
 #include <memory>
@@ -38,6 +39,8 @@ at::DeprecatedTypeProperties* get_type_properties(
     backend = at::Backend::CPU;
   } else if (device_type == at::kCUDA) {
     backend = at::Backend::CUDA;
+  } else if (device_type == at::kXPU) {
+    backend = at::Backend::XPU;
   } else if (device_type == at::kMPS) {
     backend = at::Backend::MPS;
   } else if (device_type == at::DeviceType::Meta) {
@@ -75,7 +78,11 @@ THPLayout* getTHPLayout(at::Layout layout) {
 
 PyObject* createPyObject(const at::Storage& storage) {
   if (storage.device_type() != at::DeviceType::Meta &&
-      storage.data() == nullptr && storage.nbytes() != 0) {
+      storage.data() == nullptr && storage.sym_nbytes() != 0 &&
+      // Grabbing storage() from FunctionalTensorWrapper is allowed.
+      // This is useful for checking aliasing info from python
+      dynamic_cast<at::functionalization::FunctionalStorageImpl*>(
+          storage.unsafeGetStorageImpl()) == nullptr) {
     TORCH_CHECK_NOT_IMPLEMENTED(
         false,
         "python bindings to nullptr storage (e.g., from torch.Tensor._make_wrapper_subclass) are currently unsafe and thus disabled.  See https://github.com/pytorch/pytorch/issues/61669 for more details");
@@ -133,7 +140,7 @@ at::Storage createStorageGetType(
     TORCH_INTERNAL_ASSERT(THPDtype_Check(dtype_obj));
     scalar_type = reinterpret_cast<THPDtype*>(dtype_obj)->scalar_type;
 
-    untyped_storage_obj = PyObject_GetAttrString(obj, "_storage");
+    untyped_storage_obj = PyObject_GetAttrString(obj, "_untyped_storage");
     TORCH_INTERNAL_ASSERT(untyped_storage_obj);
     Py_DECREF(untyped_storage_obj);
 
diff --git a/torch/csrc/Exceptions.cpp b/torch/csrc/Exceptions.cpp
index 6342826f5daf..67ac3decd6b1 100644
--- a/torch/csrc/Exceptions.cpp
+++ b/torch/csrc/Exceptions.cpp
@@ -13,7 +13,7 @@
 #include <c10/util/StringUtil.h>
 
 PyObject *THPException_FatalError, *THPException_LinAlgError,
-    *THPException_OutOfMemoryError;
+    *THPException_OutOfMemoryError, *THPException_DistBackendError;
 
 #define ASSERT_TRUE(cond) \
   if (!(cond))            \
@@ -35,7 +35,7 @@ For example, you can the torch.linalg.inv function will raise torch.linalg.LinAl
 a matrix is not invertible.\n \
 \n\
 Example:\n \
->>> # xdoctest: +REQUIRES(--lapac)\n \
+>>> # xdoctest: +REQUIRES(env:TORCH_DOCKTEST_LAPACK)\n \
 >>> matrix = torch.eye(3, 3)\n \
 >>> matrix[-1, -1] = 0\n \
 >>> matrix\n \
@@ -63,6 +63,16 @@ could not be completed because the input matrix is singular.",
       PyModule_AddObject(
           module, "_OutOfMemoryError", THPException_OutOfMemoryError) == 0);
 
+  ASSERT_TRUE(
+      THPException_DistBackendError = PyErr_NewExceptionWithDoc(
+          "torch.distributed.DistBackendError",
+          "Exception raised when a backend error occurs in distributed",
+          PyExc_RuntimeError,
+          nullptr));
+  ASSERT_TRUE(
+      PyModule_AddObject(
+          module, "_DistBackendError", THPException_DistBackendError) == 0);
+
   return true;
 }
 
@@ -210,22 +220,33 @@ LinAlgError::LinAlgError(const char* format, ...) {
   va_end(fmt_args);
 }
 
-void PyWarningHandler::InternalHandler::process(
-    const c10::SourceLocation& source_location,
-    const std::string& msg,
-    const bool verbatim) {
-  warning_buffer_.push_back({source_location, msg, verbatim});
-};
+void PyWarningHandler::InternalHandler::process(const c10::Warning& warning) {
+  warning_buffer_.push_back(warning);
+}
 
 PyWarningHandler::PyWarningHandler() noexcept(true)
-    : prev_handler_(c10::Warning::get_warning_handler()), in_exception_(false) {
-  c10::Warning::set_warning_handler(&internal_handler_);
+    : prev_handler_(c10::WarningUtils::get_warning_handler()),
+      in_exception_(false) {
+  c10::WarningUtils::set_warning_handler(&internal_handler_);
+}
+
+// Get the Python warning type for a warning
+PyObject* map_warning_to_python_type(const c10::Warning& warning) {
+  struct Visitor {
+    PyObject* operator()(const c10::UserWarning&) const {
+      return PyExc_UserWarning;
+    }
+    PyObject* operator()(const c10::DeprecationWarning&) const {
+      return PyExc_DeprecationWarning;
+    }
+  };
+  return c10::visit(Visitor(), warning.type());
 }
 
 /// See NOTE [ Conversion Cpp Python Warning ] for noexcept justification
 /// NOLINTNEXTLINE(bugprone-exception-escape)
 PyWarningHandler::~PyWarningHandler() noexcept(false) {
-  c10::Warning::set_warning_handler(prev_handler_);
+  c10::WarningUtils::set_warning_handler(prev_handler_);
   auto& warning_buffer = internal_handler_.warning_buffer_;
 
   if (warning_buffer.size() > 0) {
@@ -238,19 +259,20 @@ PyWarningHandler::~PyWarningHandler() noexcept(false) {
       // error has been set yet
       PyErr_Fetch(&type, &value, &traceback);
     }
-    for (auto& warning : warning_buffer) {
-      auto source_location = warning.source_location_;
-      auto& msg = warning.msg_;
+    for (const auto& warning : warning_buffer) {
+      auto source_location = warning.source_location();
+      auto msg = warning.msg();
       processErrorMsgInplace(msg);
       if (source_location.file == nullptr) {
-        result = PyErr_WarnEx(PyExc_RuntimeWarning, msg.c_str(), 1);
-      } else if (warning.verbatim_) {
+        result =
+            PyErr_WarnEx(map_warning_to_python_type(warning), msg.c_str(), 1);
+      } else if (warning.verbatim()) {
         // Sets the source location from the warning
         // Note: PyErr_WarnExplicit will disregard Python's warning filter
         // and always appear. This is in contrast to PyErr_WarnEx,
         // which respects the warning filter.
         result = PyErr_WarnExplicit(
-            /*category=*/PyExc_UserWarning,
+            /*category=*/map_warning_to_python_type(warning),
             /*message=*/msg.c_str(),
             /*filename=*/source_location.file,
             /*lineno=*/source_location.line,
@@ -267,7 +289,8 @@ PyWarningHandler::~PyWarningHandler() noexcept(false) {
             source_location.file,
             source_location.line);
         buf.push_back('\0');
-        result = PyErr_WarnEx(PyExc_UserWarning, buf.data(), 1);
+        result =
+            PyErr_WarnEx(map_warning_to_python_type(warning), buf.data(), 1);
       }
       if (result < 0) {
         if (in_exception_) {
diff --git a/torch/csrc/Exceptions.h b/torch/csrc/Exceptions.h
index 89256c64bba8..01caa6a702c0 100644
--- a/torch/csrc/Exceptions.h
+++ b/torch/csrc/Exceptions.h
@@ -8,6 +8,7 @@
 #include <system_error>
 
 #include <ATen/detail/FunctionTraits.h>
+#include <c10/util/C++17.h>
 #include <c10/util/Exception.h>
 #include <c10/util/StringUtil.h>
 #include <pybind11/pybind11.h>
@@ -75,6 +76,8 @@ static inline void PyErr_SetString(PyObject* type, const std::string& message) {
   _CATCH_GENERIC_ERROR(LinAlgError, THPException_LinAlgError, retstmnt) \
   _CATCH_GENERIC_ERROR(                                                 \
       OutOfMemoryError, THPException_OutOfMemoryError, retstmnt)        \
+  _CATCH_GENERIC_ERROR(                                                 \
+      DistBackendError, THPException_DistBackendError, retstmnt)        \
   _CATCH_GENERIC_ERROR(Error, PyExc_RuntimeError, retstmnt)             \
   catch (torch::PyTorchError & e) {                                     \
     auto msg = torch::processErrorMsg(e.what());                        \
@@ -146,7 +149,7 @@ static inline void PyErr_SetString(PyObject* type, const std::string& message) {
 #define END_HANDLE_TH_ERRORS END_HANDLE_TH_ERRORS_RET(nullptr)
 
 extern PyObject *THPException_FatalError, *THPException_LinAlgError,
-    *THPException_OutOfMemoryError;
+    *THPException_OutOfMemoryError, *THPException_DistBackendError;
 
 // Throwing this exception means that the python error flags have been already
 // set and control should be immediately returned to the interpreter.
@@ -337,19 +340,6 @@ struct LinAlgError : public PyTorchError {
   }
 };
 
-struct WarningMeta {
-  WarningMeta(
-      const c10::SourceLocation& _source_location,
-      // NOLINTNEXTLINE(modernize-pass-by-value)
-      const std::string& _msg,
-      const bool _verbatim)
-      : source_location_{_source_location}, msg_{_msg}, verbatim_{_verbatim} {}
-
-  c10::SourceLocation source_location_;
-  std::string msg_;
-  bool verbatim_;
-};
-
 // ATen warning handler for Python
 struct PyWarningHandler {
   // Move actual handler into a separate class with a noexcept
@@ -357,12 +347,9 @@ struct PyWarningHandler {
   // subclasses to have a noexcept(false) destructor.
   struct InternalHandler : at::WarningHandler {
     ~InternalHandler() override = default;
-    void process(
-        const at::SourceLocation& source_location,
-        const std::string& msg,
-        const bool verbatim) override;
+    void process(const c10::Warning& warning) override;
 
-    std::vector<WarningMeta> warning_buffer_;
+    std::vector<c10::Warning> warning_buffer_;
   };
 
  public:
@@ -389,17 +376,17 @@ struct PyWarningHandler {
 
 namespace detail {
 template <typename Func, size_t i>
-using Arg = typename function_traits<Func>::template arg<i>::type;
+using Arg = typename invoke_traits<Func>::template arg<i>::type;
 
 template <typename Func, size_t... Is>
 auto wrap_pybind_function_impl_(Func&& f, std::index_sequence<Is...>) {
-  using traits = function_traits<Func>;
+  using result_type = typename invoke_traits<Func>::result_type;
   namespace py = pybind11;
 
   // f=f is needed to handle function references on older compilers
-  return [f = f](Arg<Func, Is>... args) -> typename traits::result_type {
+  return [f = std::forward<Func>(f)](Arg<Func, Is>... args) -> result_type {
     HANDLE_TH_ERRORS
-    return f(std::forward<Arg<Func, Is>>(args)...);
+    return c10::guts::invoke(f, std::forward<Arg<Func, Is>>(args)...);
     END_HANDLE_TH_ERRORS_PYBIND
   };
 }
@@ -409,7 +396,7 @@ auto wrap_pybind_function_impl_(Func&& f, std::index_sequence<Is...>) {
 // Returns a function object suitable for registering with pybind11.
 template <typename Func>
 auto wrap_pybind_function(Func&& f) {
-  using traits = function_traits<Func>;
+  using traits = invoke_traits<Func>;
   return torch::detail::wrap_pybind_function_impl_(
       std::forward<Func>(f), std::make_index_sequence<traits::arity>{});
 }
diff --git a/torch/csrc/Module.cpp b/torch/csrc/Module.cpp
index 18fb5b105ef3..607373625724 100644
--- a/torch/csrc/Module.cpp
+++ b/torch/csrc/Module.cpp
@@ -1,3 +1,4 @@
+#include <c10/util/Optional.h>
 #include <sys/types.h>
 #include <torch/csrc/python_headers.h>
 
@@ -38,22 +39,29 @@
 #include <torch/csrc/THP.h>
 #include <torch/csrc/TypeInfo.h>
 #include <torch/csrc/api/include/torch/python/init.h>
+#include <torch/csrc/autograd/python_cpp_function.h>
 #include <torch/csrc/autograd/python_enum_tag.h>
 #include <torch/csrc/autograd/python_fft_functions.h>
+#include <torch/csrc/autograd/python_function.h>
 #include <torch/csrc/autograd/python_legacy_variable.h>
 #include <torch/csrc/autograd/python_linalg_functions.h>
+#include <torch/csrc/autograd/python_nested_functions.h>
 #include <torch/csrc/autograd/python_nn_functions.h>
 #include <torch/csrc/autograd/python_return_types.h>
 #include <torch/csrc/autograd/python_sparse_functions.h>
 #include <torch/csrc/autograd/python_special_functions.h>
 #include <torch/csrc/autograd/python_variable.h>
+#include <torch/csrc/dynamo/init.h>
+#include <torch/csrc/functorch/init.h>
 #include <torch/csrc/jit/python/init.h>
 #include <torch/csrc/jit/python/python_ir.h>
 #include <torch/csrc/jit/python/python_tracer.h>
+#include <torch/csrc/jit/serialization/pickler.h>
 #include <torch/csrc/lazy/python/init.h>
 #include <torch/csrc/monitor/python_init.h>
 #include <torch/csrc/multiprocessing/init.h>
 #include <torch/csrc/onnx/init.h>
+#include <torch/csrc/profiler/python/init.h>
 #include <torch/csrc/tensor/python_tensor.h>
 #include <torch/csrc/utils/disable_torch_function.h>
 #include <torch/csrc/utils/init.h>
@@ -177,6 +185,25 @@ static PyObject* THPModule_crashIfCsrcUBSAN(PyObject* module, PyObject* arg) {
   return THPUtils_packInt32((int)y);
 }
 
+static PyObject* THPModule_crashIfvptrUBSAN(PyObject* module, PyObject* noarg) {
+  // This code shoud work perfectly fine, as vtables are idential for Foo and
+  // Baz unless rtti and ubsan are enabled
+  struct Foo {
+    virtual int bar() = 0;
+    virtual ~Foo() = default;
+  };
+  struct Baz {
+    virtual int bar() {
+      return 17;
+    }
+    virtual ~Baz() = default;
+  };
+  Baz x{};
+  auto y = static_cast<Foo*>(static_cast<void*>(&x));
+  auto rc = y->bar();
+  return THPUtils_packInt32(rc);
+}
+
 static PyObject* THPModule_crashIfATenASAN(PyObject* module, PyObject* arg) {
   THPUtils_assert(
       THPUtils_checkLong(arg),
@@ -387,18 +414,23 @@ static PyObject* THPModule_parallelInfo(PyObject* module, PyObject* noargs) {
 }
 
 void DLPack_Capsule_Destructor(PyObject* data) {
+  if (C10_LIKELY(!PyCapsule_IsValid(data, "dltensor"))) {
+    // early out, see DLPack spec: if a consuming library sets the capsule
+    // name to something else, they own it and we don't need to do anything
+    return;
+  }
   HANDLE_TH_ERRORS
+  // Causes overheads for validity checks again, but this case is rare
+  // since consuming libraries should rename the capsule according to spec.
+  // Note that this cannot set a python error (we checked validity above),
+  // so we don't need to handle python error state here.
   DLManagedTensor* dlMTensor =
       (DLManagedTensor*)PyCapsule_GetPointer(data, "dltensor");
-  if (dlMTensor) {
-    // the dlMTensor has not been consumed, call deleter ourselves
-    // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-    dlMTensor->deleter(const_cast<DLManagedTensor*>(dlMTensor));
-  } else {
-    // the dlMTensor has been consumed
-    // PyCapsule_GetPointer has set an error indicator
-    PyErr_Clear();
-  }
+  // the dlMTensor has not been consumed, call deleter ourselves.
+  // DLPack spec mentions that deleter may be NULL, but deleter from
+  // `at::toDLPack` is never NULL, so no need for an additional check here.
+  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
+  dlMTensor->deleter(const_cast<DLManagedTensor*>(dlMTensor));
   END_HANDLE_TH_ERRORS_RET()
 }
 
@@ -418,6 +450,33 @@ PyObject* THPModule_fromDLPack(PyObject* _unused, PyObject* data) {
   END_HANDLE_TH_ERRORS
 }
 
+PyObject* THModule_getCppBacktrace(PyObject* _unused, PyObject* args) {
+  HANDLE_TH_ERRORS
+  size_t frames_to_skip;
+  size_t maximum_number_of_frames;
+  if (!PyArg_ParseTuple(
+          args, "LL", &frames_to_skip, &maximum_number_of_frames)) {
+    return nullptr;
+  }
+  return THPUtils_packString(
+      c10::get_backtrace(frames_to_skip, maximum_number_of_frames, true));
+  END_HANDLE_TH_ERRORS
+}
+static PyObject* THModule_rename_privateuse1_backend(
+    PyObject* _unused,
+    PyObject* arg) {
+  HANDLE_TH_ERRORS
+  THPUtils_assert(
+      THPUtils_checkString(arg),
+      "_rename_privateuse1_backend expects a str, "
+      "but got %s",
+      THPUtils_typename(arg));
+  const std::string backend_name = THPUtils_unpackString(arg);
+  c10::register_privateuse1_backend(backend_name);
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
 PyObject* THPModule_setAllowTF32CuDNN(PyObject* _unused, PyObject* arg) {
   THPUtils_assert(
       PyBool_Check(arg),
@@ -460,7 +519,51 @@ PyObject* THPModule_float32MatmulPrecision(
   }
   return THPUtils_packString(s);
 }
-
+PyObject* THPModule_setSDPUseFlash(PyObject* _unused, PyObject* arg) {
+  THPUtils_assert(
+      PyBool_Check(arg),
+      "set_sdp_use_math expects a bool, "
+      "but got %s",
+      THPUtils_typename(arg));
+  at::globalContext().setSDPUseFlash(arg == Py_True);
+  Py_RETURN_NONE;
+}
+PyObject* THPModule_userEnabledFlashSDP(PyObject* _unused, PyObject* noargs) {
+  if (at::globalContext().userEnabledFlashSDP())
+    Py_RETURN_TRUE;
+  else
+    Py_RETURN_FALSE;
+}
+PyObject* THPModule_setSDPUseMemEfficient(PyObject* _unused, PyObject* arg) {
+  THPUtils_assert(
+      PyBool_Check(arg),
+      "set_sdp_use_math expects a bool, "
+      "but got %s",
+      THPUtils_typename(arg));
+  at::globalContext().setSDPUseMemEfficient(arg == Py_True);
+  Py_RETURN_NONE;
+}
+PyObject* userEnabledMemEfficientSDP(PyObject* _unused, PyObject* noargs) {
+  if (at::globalContext().userEnabledMemEfficientSDP())
+    Py_RETURN_TRUE;
+  else
+    Py_RETURN_FALSE;
+}
+PyObject* THPModule_setSDPUseMath(PyObject* _unused, PyObject* arg) {
+  THPUtils_assert(
+      PyBool_Check(arg),
+      "set_sdp_use_math expects a bool, "
+      "but got %s",
+      THPUtils_typename(arg));
+  at::globalContext().setSDPUseMath(arg == Py_True);
+  Py_RETURN_NONE;
+}
+PyObject* THPModule_userEnabledMathSDP(PyObject* _unused, PyObject* noargs) {
+  if (at::globalContext().userEnabledMathSDP())
+    Py_RETURN_TRUE;
+  else
+    Py_RETURN_FALSE;
+}
 PyObject* THPModule_setUserEnabledCuDNN(PyObject* _unused, PyObject* arg) {
   THPUtils_assert(
       PyBool_Check(arg),
@@ -554,17 +657,33 @@ PyObject* THPModule_setWarnAlways(PyObject* _unused, PyObject* arg) {
       "setWarnOnlyOnce expects a bool, "
       "but got %s",
       THPUtils_typename(arg));
-  c10::Warning::set_warnAlways(arg == Py_True);
+  c10::WarningUtils::set_warnAlways(arg == Py_True);
   Py_RETURN_NONE;
 }
 
 PyObject* THPModule_warnAlways(PyObject* _unused, PyObject* noargs) {
-  if (c10::Warning::get_warnAlways()) {
+  if (c10::WarningUtils::get_warnAlways()) {
     Py_RETURN_TRUE;
   }
   Py_RETURN_FALSE;
 }
 
+// Used only for testing C++ to Python warning translations.
+PyObject* THPModule_warn(PyObject* _unused, PyObject* noargs) {
+  HANDLE_TH_ERRORS
+  TORCH_WARN("Test message for TORCH_WARN");
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
+// Used only for testing C++ to Python warning translations.
+PyObject* THPModule_warnDeprecation(PyObject* _unused, PyObject* noargs) {
+  HANDLE_TH_ERRORS
+  TORCH_WARN_DEPRECATION("Test message for TORCH_WARN_DEPRECATION");
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
 PyObject* THPModule_setBenchmarkCuDNN(PyObject* _unused, PyObject* arg) {
   THPUtils_assert(
       PyBool_Check(arg),
@@ -687,6 +806,77 @@ PyObject* THPModule_isEnabledXNNPACK(PyObject* _unused, PyObject* noargs) {
     Py_RETURN_FALSE;
 }
 
+PyObject* THPModule_willEngineExecuteNode(PyObject* _unused, PyObject* arg) {
+  HANDLE_TH_ERRORS
+  bool isTHPFunction = THPFunction_Check(arg);
+  bool isTHPCppFunction = torch::autograd::THPCppFunction_Check(arg);
+  THPUtils_assert(
+      isTHPFunction || isTHPCppFunction,
+      "_will_engine_execute_node expects an grad_fn, "
+      "but got %s",
+      THPUtils_typename(arg));
+  const auto exec_info = torch::autograd::get_current_graph_task_exec_info();
+  THPUtils_assert(
+      exec_info,
+      "_get_should_execute_nodes should only be called during the backward pass");
+  torch::autograd::Node* node;
+  std::shared_ptr<torch::autograd::Node> node_sp;
+  if (isTHPFunction) {
+    node_sp = ((THPFunction*)arg)->cdata.lock();
+    node = node_sp.get();
+  } else {
+    node = ((torch::autograd::THPCppFunction*)arg)->cdata.get();
+  }
+  const auto nodes_in_graph =
+      torch::autograd::get_current_graph_task_nodes_in_graph();
+  bool ret = nodes_in_graph->find(node) != nodes_in_graph->end();
+  if (ret && !exec_info->empty()) {
+    auto it = exec_info->find(node);
+    if (it == exec_info->end() || !it->second.should_execute()) {
+      ret = false;
+    } else {
+      TORCH_CHECK(
+          !(node->topological_nr() == 0 && it->second.captures_),
+          "A leaf node was passed to _will_engine_execute_node but we are "
+          "currently running autograd.grad(). This is currently not supported.");
+    }
+  }
+  if (ret) {
+    Py_RETURN_TRUE;
+  } else {
+    Py_RETURN_FALSE;
+  }
+  END_HANDLE_TH_ERRORS
+}
+
+PyObject* THPModule_getCurrentGraphTaskExecutionOrder(
+    PyObject* _unused,
+    PyObject* noargs) {
+  HANDLE_TH_ERRORS
+  std::vector<torch::autograd::Node*> nodes =
+      torch::autograd::get_current_graph_task_execution_order();
+  TORCH_CHECK(
+      nodes.size(),
+      "_current_graph_task_execution_order should only be called during the backward pass");
+  auto list = THPObjectPtr(PyList_New(nodes.size()));
+  if (!list)
+    return nullptr;
+  for (const auto i : c10::irange(nodes.size())) {
+    // This node is guaranteed to be alive since the backward is still running
+    PyObject* pyobj_node =
+        torch::autograd::functionToPyObject(nodes[i]->getptr());
+    PyList_SET_ITEM(list.get(), i, pyobj_node);
+  }
+  return list.release();
+  END_HANDLE_TH_ERRORS
+}
+
+PyObject* THPModule_getCurrentGraphTaskId(PyObject* _unused, PyObject* noargs) {
+  HANDLE_TH_ERRORS
+  return THPUtils_packInt64(torch::autograd::get_current_graph_task_id());
+  END_HANDLE_TH_ERRORS
+}
+
 PyObject* THPModule_setDefaultMobileCPUAllocator(
     PyObject* _unused,
     PyObject* noargs) {
@@ -763,6 +953,7 @@ static PyMethodDef TorchMethods[] = {
     {"_infer_size", THPModule_inferSize, METH_VARARGS, nullptr},
     {"_crash_if_csrc_asan", THPModule_crashIfCsrcASAN, METH_O, nullptr},
     {"_crash_if_csrc_ubsan", THPModule_crashIfCsrcUBSAN, METH_O, nullptr},
+    {"_crash_if_vptr_ubsan", THPModule_crashIfvptrUBSAN, METH_NOARGS, nullptr},
     {"_crash_if_aten_asan", THPModule_crashIfATenASAN, METH_O, nullptr},
     {"_show_config", THPModule_showConfig, METH_NOARGS, nullptr},
     {"_cxx_flags", THPModule_cxxFlags, METH_NOARGS, nullptr},
@@ -793,6 +984,24 @@ static PyMethodDef TorchMethods[] = {
      THPModule_setNumInteropThreads,
      METH_O,
      nullptr},
+    {"_get_flash_sdp_enabled",
+     THPModule_userEnabledFlashSDP,
+     METH_NOARGS,
+     nullptr},
+    {"_set_sdp_use_flash", THPModule_setSDPUseFlash, METH_O, nullptr},
+    {"_get_mem_efficient_sdp_enabled",
+     userEnabledMemEfficientSDP,
+     METH_NOARGS,
+     nullptr},
+    {"_set_sdp_use_mem_efficient",
+     THPModule_setSDPUseMemEfficient,
+     METH_O,
+     nullptr},
+    {"_get_math_sdp_enabled",
+     THPModule_userEnabledMathSDP,
+     METH_NOARGS,
+     nullptr},
+    {"_set_sdp_use_math", THPModule_setSDPUseMath, METH_O, nullptr},
     {"_get_cudnn_enabled", THPModule_userEnabledCuDNN, METH_NOARGS, nullptr},
     {"_set_cudnn_enabled", THPModule_setUserEnabledCuDNN, METH_O, nullptr},
     {"_get_mkldnn_enabled", THPModule_userEnabledMkldnn, METH_NOARGS, nullptr},
@@ -823,6 +1032,8 @@ static PyMethodDef TorchMethods[] = {
      nullptr},
     {"_get_warnAlways", THPModule_warnAlways, METH_NOARGS, nullptr},
     {"_set_warnAlways", THPModule_setWarnAlways, METH_O, nullptr},
+    {"_warn", THPModule_warn, METH_NOARGS, nullptr},
+    {"_warn_deprecation", THPModule_warnDeprecation, METH_NOARGS, nullptr},
     {"_get_cublas_allow_tf32", THPModule_allowTF32CuBLAS, METH_NOARGS, nullptr},
     {"_set_cublas_allow_tf32", THPModule_setAllowTF32CuBLAS, METH_O, nullptr},
     {"_get_float32_matmul_precision",
@@ -859,6 +1070,11 @@ static PyMethodDef TorchMethods[] = {
      nullptr},
     {"_to_dlpack", THPModule_toDLPack, METH_O, nullptr},
     {"_from_dlpack", THPModule_fromDLPack, METH_O, nullptr},
+    {"_get_cpp_backtrace", THModule_getCppBacktrace, METH_VARARGS, nullptr},
+    {"_rename_privateuse1_backend",
+     THModule_rename_privateuse1_backend,
+     METH_O,
+     nullptr},
     {"set_flush_denormal", THPModule_setFlushDenormal, METH_O, nullptr},
     {"get_default_dtype", THPModule_getDefaultDtype, METH_NOARGS, nullptr},
     {"_get_default_device", THPModule_getDefaultDevice, METH_NOARGS, nullptr},
@@ -866,6 +1082,18 @@ static PyMethodDef TorchMethods[] = {
     {"_set_qengine", THPModule_setQEngine, METH_O, nullptr},
     {"_supported_qengines", THPModule_supportedQEngines, METH_NOARGS, nullptr},
     {"_is_xnnpack_enabled", THPModule_isEnabledXNNPACK, METH_NOARGS, nullptr},
+    {"_will_engine_execute_node",
+     THPModule_willEngineExecuteNode,
+     METH_O,
+     nullptr},
+    {"_current_graph_task_execution_order",
+     THPModule_getCurrentGraphTaskExecutionOrder,
+     METH_NOARGS,
+     nullptr},
+    {"_current_graph_task_id",
+     THPModule_getCurrentGraphTaskId,
+     METH_NOARGS,
+     nullptr},
     {"_set_default_mobile_cpu_allocator",
      THPModule_setDefaultMobileCPUAllocator,
      METH_NOARGS,
@@ -1012,14 +1240,18 @@ PyObject* initModule() {
   torch::jit::initJITBindings(module);
   torch::monitor::initMonitorBindings(module);
   torch::impl::dispatch::initDispatchBindings(module);
+  torch::dynamo::initDynamoBindings(module);
+  torch::functorch::impl::initFuncTorchBindings(module);
   torch::throughput_benchmark::initThroughputBenchmarkBindings(module);
   torch::autograd::initReturnTypes(module);
   torch::autograd::initNNFunctions(module);
   torch::autograd::initFFTFunctions(module);
   torch::autograd::initLinalgFunctions(module);
+  torch::autograd::initNestedFunctions(module);
   torch::autograd::initSparseFunctions(module);
   torch::autograd::initSpecialFunctions(module);
   torch::autograd::init_legacy_variable(module);
+  torch::profiler::initPythonBindings(module);
   torch::python::init_bindings(module);
   torch::lazy::initLazyBindings(module);
 #ifdef USE_ITT
@@ -1166,7 +1398,9 @@ Call this whenever a new thread is created in order to propagate values from
       .value("SlowTranspose3d", at::native::ConvBackend::SlowTranspose3d)
       .value(
           "Winograd3x3Depthwise", at::native::ConvBackend::Winograd3x3Depthwise)
-      .value("Xnnpack2d", at::native::ConvBackend::Xnnpack2d);
+      .value("Xnnpack2d", at::native::ConvBackend::Xnnpack2d)
+      .value("Mps", at::native::ConvBackend::Mps)
+      .value("MpsTranspose,", at::native::ConvBackend::MpsTranspose);
 
   py_module.def(
       "_select_conv_backend",
@@ -1174,10 +1408,10 @@ Call this whenever a new thread is created in order to propagate values from
          const at::Tensor& weight,
          const c10::optional<at::Tensor>& bias_opt,
          at::IntArrayRef stride_,
-         at::IntArrayRef padding_,
+         at::SymIntArrayRef padding_,
          at::IntArrayRef dilation_,
          bool transposed_,
-         at::IntArrayRef output_padding_,
+         at::SymIntArrayRef output_padding_,
          int64_t groups_) {
         return at::native::select_conv_backend(
             input,
@@ -1188,8 +1422,62 @@ Call this whenever a new thread is created in order to propagate values from
             dilation_,
             transposed_,
             output_padding_,
-            groups_);
-      });
+            groups_,
+            c10::nullopt);
+      },
+      py::arg("input"),
+      py::arg("weight"),
+      py::arg("bias"),
+      py::arg("stride"),
+      py::arg("padding"),
+      py::arg("dilation"),
+      py::arg("transposed"),
+      py::arg("output_padding"),
+      py::arg("groups"));
+
+  // overload for bias_sizes_opt/backward TODO: figure out default value
+  py_module.def(
+      "_select_conv_backend",
+      [](const at::Tensor& input,
+         const at::Tensor& weight,
+         const c10::optional<at::Tensor>& bias,
+         at::IntArrayRef stride_,
+         at::SymIntArrayRef padding_,
+         at::IntArrayRef dilation_,
+         bool transposed_,
+         at::SymIntArrayRef output_padding_,
+         int64_t groups_,
+         c10::optional<std::vector<c10::SymInt>> bias_sizes_opt) {
+        c10::OptionalArrayRef<c10::SymInt> ref = c10::nullopt;
+        if (bias_sizes_opt) {
+          ref = (*bias_sizes_opt);
+        }
+        return at::native::select_conv_backend(
+            input,
+            weight,
+            bias,
+            stride_,
+            padding_,
+            dilation_,
+            transposed_,
+            output_padding_,
+            groups_,
+            ref);
+      },
+      py::arg("input"),
+      py::arg("weight"),
+      py::arg("bias"),
+      py::arg("stride"),
+      py::arg("padding"),
+      py::arg("dilation"),
+      py::arg("transposed"),
+      py::arg("output_padding"),
+      py::arg("groups"),
+      py::arg("bias_sizes"));
+
+  py_module.def(
+      "_conv_determine_backend_memory_format",
+      at::native::_determine_backend_memory_format);
 
   py::enum_<at::LinalgBackend>(py_module, "_LinalgBackend")
       .value("Default", at::LinalgBackend::Default)
@@ -1257,22 +1545,33 @@ Call this whenever a new thread is created in order to propagate values from
       "_set_conj", [](const at::Tensor& x, bool conj) { x._set_conj(conj); });
   py_module.def(
       "_set_neg", [](const at::Tensor& x, bool neg) { x._set_neg(neg); });
+  py_module.def("_get_tensor_metadata", &torch::jit::getTensorMetadata);
+  py_module.def(
+      "_set_tensor_metadata",
+      static_cast<void (*)(
+          const at::Tensor&, std::unordered_map<std::string, bool>)>(
+          torch::jit::setTensorMetadata));
   py_module.def("_dispatch_key_set", [](const at::Tensor& x) {
     return toString(x.key_set());
   });
+  py_module.def(
+      "_has_storage", [](const at::Tensor& x) { return x.has_storage(); });
 
-  py_module.def("_add_meta_to_tls_dispatch_include", []() {
+  py_module.def("_set_meta_in_tls_dispatch_include", [](bool meta_in_tls) {
     auto local_keyset = c10::impl::tls_local_dispatch_key_set();
     c10::DispatchKeySet key_set({at::DispatchKey::Meta});
-    local_keyset.included_ = local_keyset.included_ | key_set;
+    if (meta_in_tls) {
+      local_keyset.included_ = local_keyset.included_ | key_set;
+    } else {
+      local_keyset.included_ =
+          local_keyset.included_.remove_backend(c10::BackendComponent::MetaBit);
+    }
     c10::impl::_force_tls_local_dispatch_key_set(local_keyset);
   });
-  py_module.def("_remove_meta_from_tls_dispatch_include", []() {
+
+  py_module.def("_meta_in_tls_dispatch_include", []() {
     auto local_keyset = c10::impl::tls_local_dispatch_key_set();
-    c10::DispatchKeySet key_set({at::DispatchKey::Meta});
-    auto k = key_set.highestBackendKey();
-    local_keyset.included_ = local_keyset.included_.remove_backend(k);
-    c10::impl::_force_tls_local_dispatch_key_set(local_keyset);
+    return local_keyset.included_.has_backend(c10::BackendComponent::MetaBit);
   });
 
   py_module.def("_dump_local_tls_set", []() {
@@ -1281,6 +1580,11 @@ Call this whenever a new thread is created in order to propagate values from
     std::cout << "Excluded: " << toString(local_keyset.excluded_) << "\n";
   });
 
+  py_module.def(
+      "_should_allow_numbers_as_tensors", [](const std::string& name) {
+        return torch::should_allow_numbers_as_tensors(name);
+      });
+
   const auto& defaultGenerator = at::detail::getDefaultCPUGenerator();
   THPDefaultCPUGenerator =
       (THPGenerator*)THPGenerator_initDefaultGenerator(defaultGenerator);
diff --git a/torch/csrc/Size.cpp b/torch/csrc/Size.cpp
index 36419f20eccd..ba4090bfb684 100644
--- a/torch/csrc/Size.cpp
+++ b/torch/csrc/Size.cpp
@@ -59,7 +59,7 @@ PyObject* THPSize_NewFromSymSizes(const at::Tensor& self_) {
       TORCH_CHECK(
           !torch::jit::tracer::isTracing(),
           "JIT Tracing of SymInts isn't supported");
-      auto py_symint = py::cast(si.toSymIntNodeImpl()).release().ptr();
+      auto py_symint = py::cast(si).release().ptr();
       if (!py_symint)
         throw python_error();
       PyTuple_SET_ITEM(ret.get(), i, py_symint);
@@ -98,7 +98,7 @@ static PyObject* THPSize_pynew(
       if (THPUtils_checkLong(item)) {
         continue;
       }
-      if (torch::is_symint_node(item)) {
+      if (torch::is_symint(item)) {
         continue;
       }
       if (torch::jit::tracer::isTracing() && isTracedZeroDimVar(item)) {
@@ -135,7 +135,7 @@ static PyObject* THPSize_repr(THPSize* self) {
     auto item = PyTuple_GET_ITEM(self, i);
     auto ih = py::handle(item);
 
-    repr += torch::is_symint_node(ih)
+    repr += torch::is_symint(ih)
         ? std::string(py::str(ih))
         : std::to_string(THPUtils_unpackLong(PyTuple_GET_ITEM(self, i)));
   }
diff --git a/torch/csrc/Storage.cpp b/torch/csrc/Storage.cpp
index 64d5cd785e52..e1ae53daf62f 100644
--- a/torch/csrc/Storage.cpp
+++ b/torch/csrc/Storage.cpp
@@ -99,6 +99,8 @@ static PyObject* THPStorage_pynew(
     } else if (device.type() == at::kMPS) {
       allocator = at::mps::GetMPSAllocator();
 #endif
+    } else if (device.type() == at::DeviceType::XPU) {
+      allocator = c10::GetAllocator(device.type());
     } else if (device.type() == at::DeviceType::Meta) {
       allocator = c10::GetAllocator(device.type());
     } else {
diff --git a/torch/csrc/StorageMethods.cpp b/torch/csrc/StorageMethods.cpp
index 2b74c8a2fd29..29f0f67ce6ec 100644
--- a/torch/csrc/StorageMethods.cpp
+++ b/torch/csrc/StorageMethods.cpp
@@ -41,7 +41,7 @@
 static PyObject* THPStorage_nbytes(PyObject* _self, PyObject* noargs) {
   HANDLE_TH_ERRORS
   auto self = (THPStorage*)_self;
-  return THPUtils_packUInt64(self->cdata->nbytes());
+  return py::cast(self->cdata->sym_nbytes()).release().ptr();
   END_HANDLE_TH_ERRORS
 }
 
diff --git a/torch/csrc/StorageSharing.cpp b/torch/csrc/StorageSharing.cpp
index c35747a0ae6c..3ab36b672e19 100644
--- a/torch/csrc/StorageSharing.cpp
+++ b/torch/csrc/StorageSharing.cpp
@@ -110,7 +110,11 @@ static PyObject* THPStorage_shareFilename(PyObject* _self, PyObject* noargs) {
         /*resizable=*/false));
 
     at::Storage _self_aten = torch::createStorage(_self);
-    storage_copy(new_storage, _self_aten);
+    {
+      // Copying into shared memory can be slow, so release the GIL
+      pybind11::gil_scoped_release no_gil;
+      storage_copy(new_storage, _self_aten);
+    }
 
     std::swap(*storage, *new_storage.unsafeGetStorageImpl());
     ctx = THManagedMapAllocator::fromDataPtr(storage->data_ptr());
@@ -210,7 +214,11 @@ static PyObject* THPStorage_shareFd(PyObject* _self, PyObject* noargs) {
   } else {
     at::Storage new_storage(THPStorage_newFdStorage(storage->nbytes()));
     at::Storage _self_aten = torch::createStorage(_self);
-    storage_copy(new_storage, _self_aten);
+    {
+      // Copying into shared memory can be slow, so release the GIL
+      pybind11::gil_scoped_release no_gil;
+      storage_copy(new_storage, _self_aten);
+    }
 
     std::swap(*storage, *new_storage.unsafeGetStorageImpl());
     ctx = at::MapAllocator::fromDataPtr(storage->data_ptr());
diff --git a/torch/csrc/TypeInfo.cpp b/torch/csrc/TypeInfo.cpp
index 5e08a96109e0..9984f4cbfa26 100644
--- a/torch/csrc/TypeInfo.cpp
+++ b/torch/csrc/TypeInfo.cpp
@@ -108,7 +108,7 @@ PyObject* THPDTypeInfo_compare(THPDTypeInfo* a, THPDTypeInfo* b, int op) {
 
 static PyObject* THPDTypeInfo_bits(THPDTypeInfo* self, void*) {
   // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers)
-  int bits = elementSize(self->type) * 8;
+  int64_t bits = elementSize(self->type) * 8;
   return THPUtils_packInt64(bits);
 }
 
@@ -220,13 +220,10 @@ PyObject* THPFInfo_str(THPFInfo* self) {
 }
 
 PyObject* THPIInfo_str(THPIInfo* self) {
-  auto type = self->type;
-  std::string primary_name, legacy_name;
-  std::tie(primary_name, legacy_name) = torch::utils::getDtypeNames(type);
   std::ostringstream oss;
 
-  oss << "iinfo(min=" << PyFloat_AsDouble(THPIInfo_min(self, nullptr));
-  oss << ", max=" << PyFloat_AsDouble(THPIInfo_max(self, nullptr));
+  oss << "iinfo(min=" << PyLong_AsDouble(THPIInfo_min(self, nullptr));
+  oss << ", max=" << PyLong_AsDouble(THPIInfo_max(self, nullptr));
   oss << ", dtype=" << PyUnicode_AsUTF8(THPIInfo_dtype(self, nullptr)) << ")";
 
   return THPUtils_packString(oss.str().c_str());
diff --git a/torch/csrc/api/include/torch/all.h b/torch/csrc/api/include/torch/all.h
index 5228dbd8637a..02086ceab254 100644
--- a/torch/csrc/api/include/torch/all.h
+++ b/torch/csrc/api/include/torch/all.h
@@ -11,6 +11,7 @@
 #include <torch/fft.h>
 #include <torch/jit.h>
 #include <torch/linalg.h>
+#include <torch/nested.h>
 #include <torch/nn.h>
 #include <torch/optim.h>
 #include <torch/serialize.h>
diff --git a/torch/csrc/api/include/torch/nested.h b/torch/csrc/api/include/torch/nested.h
new file mode 100644
index 000000000000..d91c878348bd
--- /dev/null
+++ b/torch/csrc/api/include/torch/nested.h
@@ -0,0 +1,95 @@
+#pragma once
+
+#include <ATen/ATen.h>
+#include <ATen/core/ATen_fwd.h>
+#include <torch/csrc/api/include/torch/detail/TensorDataContainer.h>
+#include <algorithm>
+
+namespace torch {
+namespace nested {
+
+/// Nested tensor
+///
+/// See
+/// https://pytorch.org/docs/master/nested.html#torch.nested.nested_tensor
+///
+/// ```
+// implemented on python object to allow torch.nested.nested_tensor to be
+// constructed with arbitrarily nested python objects - for now, only arbitrary
+// python lists and lists of Tensors
+// See torch/csrc/autograd/python_nested_functions_manual.cpp for Python
+// implementation
+// See here for C++ implementation
+inline at::Tensor nested_tensor(
+    at::TensorList nested_tensor_data,
+    const at::TensorOptions& options = {}) {
+  auto out = at::_nested_tensor_from_tensor_list(
+      nested_tensor_data,
+      c10::typeMetaToScalarType(options.dtype()),
+      c10::nullopt,
+      options.device(),
+      options.pinned_memory());
+  if (options.has_requires_grad() && options.requires_grad()) {
+    out.requires_grad_(true);
+  }
+  return out;
+}
+
+inline at::Tensor nested_tensor(
+    at::ArrayRef<detail::TensorDataContainer> nested_tensor_data,
+    const at::TensorOptions& options = {}) {
+  for (const auto& tdc : nested_tensor_data) {
+    TORCH_CHECK(
+        tdc.is_init_list(),
+        "nested_tensor() not implemented for these parameters");
+  }
+  // Construct a TensorList using nested_tensor_data
+  std::vector<at::Tensor> tensor_list(nested_tensor_data.size());
+  std::transform(
+      nested_tensor_data.begin(),
+      nested_tensor_data.end(),
+      tensor_list.begin(),
+      [&](const detail::TensorDataContainer& tdc) {
+        return tdc.convert_to_tensor(options);
+      });
+  auto out = at::_nested_tensor_from_tensor_list(
+      tensor_list,
+      c10::typeMetaToScalarType(options.dtype()),
+      c10::nullopt,
+      options.device(),
+      options.pinned_memory());
+  if (options.has_requires_grad() && options.requires_grad()) {
+    out.requires_grad_(true);
+  }
+  return out;
+}
+
+/// As Nested Tensor
+///
+/// See
+/// https://pytorch.org/docs/master/nested.html#torch.nested.as_nested_tensor
+///
+/// ```
+inline at::Tensor as_nested_tensor(
+    at::TensorList list,
+    c10::optional<at::ScalarType> dtype = c10::nullopt,
+    c10::optional<at::Device> device = c10::nullopt) {
+  return at::_nested_tensor_from_tensor_list(
+      list, dtype, c10::nullopt, device, c10::nullopt);
+}
+
+/// Nested to padded tensor
+///
+/// See
+/// https://pytorch.org/docs/master/nested.html#torch.nested.to_padded_tensor
+///
+/// ```
+inline at::Tensor to_padded_tensor(
+    const at::Tensor& self,
+    double padding,
+    at::OptionalIntArrayRef output_size = c10::nullopt) {
+  return at::nested_to_padded_tensor(self, padding, output_size);
+}
+
+} // namespace nested
+} // namespace torch
diff --git a/torch/csrc/api/include/torch/nn/functional/padding.h b/torch/csrc/api/include/torch/nn/functional/padding.h
index 49150ce8271e..5eeeba45de5b 100644
--- a/torch/csrc/api/include/torch/nn/functional/padding.h
+++ b/torch/csrc/api/include/torch/nn/functional/padding.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/native/PadNd.h>
+#include <ATen/PadNd.h>
 #include <torch/nn/options/padding.h>
 
 namespace torch {
diff --git a/torch/csrc/api/include/torch/nn/functional/upsampling.h b/torch/csrc/api/include/torch/nn/functional/upsampling.h
index 37d6fab99480..a190a79badb0 100644
--- a/torch/csrc/api/include/torch/nn/functional/upsampling.h
+++ b/torch/csrc/api/include/torch/nn/functional/upsampling.h
@@ -118,8 +118,12 @@ inline Tensor interpolate(
   }
 
   if (antialias &&
-      !(input.dim() == 4 && (c10::get_if<enumtype::kBilinear>(&mode)))) {
-    TORCH_CHECK(false, "Anti-alias option is only supported for bilinear mode");
+      !(input.dim() == 4 &&
+        (c10::get_if<enumtype::kBilinear>(&mode) ||
+         c10::get_if<enumtype::kBicubic>(&mode)))) {
+    TORCH_CHECK(
+        false,
+        "Anti-alias option is only supported for bilinear and bicubic modes");
   }
 
   auto closed_over_args =
@@ -215,6 +219,14 @@ inline Tensor interpolate(
         scale_factor_list.at(2));
   } else if (input.dim() == 4 && c10::get_if<enumtype::kBicubic>(&mode)) {
     TORCH_INTERNAL_ASSERT(align_corners != c10::nullopt);
+    if (antialias) {
+      return torch::_upsample_bicubic2d_aa(
+          input,
+          _interp_output_size(2, closed_over_args),
+          *align_corners,
+          scale_factor_list.at(0),
+          scale_factor_list.at(1));
+    }
     return torch::upsample_bicubic2d(
         input,
         _interp_output_size(2, closed_over_args),
diff --git a/torch/csrc/api/include/torch/nn/pimpl.h b/torch/csrc/api/include/torch/nn/pimpl.h
index b3a2f2fd28c6..d66d83c257eb 100644
--- a/torch/csrc/api/include/torch/nn/pimpl.h
+++ b/torch/csrc/api/include/torch/nn/pimpl.h
@@ -191,6 +191,14 @@ serialize::InputArchive& operator>>(
 } // namespace nn
 } // namespace torch
 
+// Workaround for CUDA 10.2 and below not allowing attribute unused on
+// using declarations.
+#ifdef __CUDACC__
+#define TORCH_UNUSED_EXCEPT_CUDA
+#else
+#define TORCH_UNUSED_EXCEPT_CUDA C10_UNUSED
+#endif
+
 /// Defines a class `Name` which inherits from `nn::ModuleHolder` to provide a
 /// wrapper over a `std::shared_ptr<ImplType>`.
 /// `Impl` is a type alias for `ImplType` which provides a way to call static
@@ -199,7 +207,7 @@ serialize::InputArchive& operator>>(
   class Name : public torch::nn::ModuleHolder<ImplType> { /* NOLINT */ \
    public:                                                             \
     using torch::nn::ModuleHolder<ImplType>::ModuleHolder;             \
-    using Impl = ImplType;                                             \
+    using Impl TORCH_UNUSED_EXCEPT_CUDA = ImplType;                    \
   }
 
 /// Like `TORCH_MODULE_IMPL`, but defaults the `ImplType` name to `<Name>Impl`.
diff --git a/torch/csrc/api/src/nn/modules/transformer.cpp b/torch/csrc/api/src/nn/modules/transformer.cpp
index 6d643fc7354f..df08c629da56 100644
--- a/torch/csrc/api/src/nn/modules/transformer.cpp
+++ b/torch/csrc/api/src/nn/modules/transformer.cpp
@@ -466,7 +466,7 @@ Tensor TransformerImpl::generate_square_subsequent_mask(int64_t sz) {
   // Treat 0 dim valid here
   TORCH_CHECK(
       sz >= 0,
-      "Input size must be non-negative to genearte a valid square subsequent mask, but got ",
+      "Input size must be non-negative to generate a valid square subsequent mask, but got ",
       sz);
 
   // check IEEE754 support here since -inf is not guaranteed to be valid on non
@@ -479,7 +479,7 @@ Tensor TransformerImpl::generate_square_subsequent_mask(int64_t sz) {
   // platform
   else {
     TORCH_WARN_ONCE(
-        "IEEE754 is not supporetd on this platform, generate_square_subsequent_mask will fill "
+        "IEEE754 is not supported on this platform, generate_square_subsequent_mask will fill "
         "the mask with smallest float number on this platform instead of -inf");
     return torch::triu(
         torch::full({sz, sz}, std::numeric_limits<float>::lowest()), 1);
diff --git a/torch/csrc/api/src/optim/optimizer.cpp b/torch/csrc/api/src/optim/optimizer.cpp
index 95165d850cf6..f73e54d2835f 100644
--- a/torch/csrc/api/src/optim/optimizer.cpp
+++ b/torch/csrc/api/src/optim/optimizer.cpp
@@ -64,13 +64,13 @@ void OptimizerParamState::serialize(
 double OptimizerOptions::get_lr() const {
   TORCH_CHECK(
       false,
-      "double get_lr() has not been overidden and implemented in subclass of torch::optim::OptimizerOptions, you must override it in your subclass.");
+      "double get_lr() has not been overridden and implemented in subclass of torch::optim::OptimizerOptions, you must override it in your subclass.");
 }
 
 void OptimizerOptions::set_lr(const double lr) {
   TORCH_CHECK(
       false,
-      "double set_lr() has not been overidden and implemented in subclass of torch::optim::OptimizerOptions, you must override it in your subclass.");
+      "double set_lr() has not been overridden and implemented in subclass of torch::optim::OptimizerOptions, you must override it in your subclass.");
 }
 
 std::unique_ptr<OptimizerOptions> OptimizerOptions::clone() const {
diff --git a/torch/csrc/autograd/FunctionsManual.cpp b/torch/csrc/autograd/FunctionsManual.cpp
index ccbed38e76b7..d1c59302b392 100644
--- a/torch/csrc/autograd/FunctionsManual.cpp
+++ b/torch/csrc/autograd/FunctionsManual.cpp
@@ -9,6 +9,7 @@
 #include <ATen/Dispatch.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/ScalarOps.h>
+#include <ATen/SparseCsrTensorUtils.h>
 #include <ATen/SparseTensorUtils.h>
 #include <ATen/TensorSubclassLikeUtils.h>
 #include <ATen/Utils.h>
@@ -29,6 +30,7 @@
 #include <ciso646>
 #include <functional>
 #include <numeric>
+
 // Helper functions for autogenerated code
 // These used to be inlined into the codegened Functions.cpp
 
@@ -142,6 +144,18 @@ int64_t _safe_size(IntArrayRef sizes, IntArrayRef dim) {
   return size;
 }
 
+c10::SymInt _safe_size(c10::SymIntArrayRef sizes, c10::IntArrayRef dim) {
+  c10::SymInt size = 1;
+  if (sizes.size() == 0) {
+    return 1;
+  }
+  for (auto d : dim) {
+    d = at::maybe_wrap_dim(d, sizes.size());
+    size *= sizes[d];
+  }
+  return size;
+}
+
 Tensor handle_r_to_c(ScalarType self_st, Tensor gradient_result) {
   if (!at::isComplexType(self_st) && gradient_result.is_complex()) {
     // R -> C
@@ -236,7 +250,7 @@ Tensor norm_backward(
   //     the problematic values with something arbitrary before the division,
   //     but we decide not to due to the perf hit. Instead we just silence ASAN
   //     where necessary
-  size_t ndim = self.sizes().size();
+  size_t ndim = self.dim();
   double p = p_.value_or(2.0).toDouble();
   Tensor self_scaled;
   Tensor scale_v;
@@ -497,6 +511,14 @@ Tensor sgn_backward(const Tensor& x, const Tensor& gx, const Tensor& sgn) {
   }
 }
 
+Tensor masked_fill_backward(const Tensor& grad, const Tensor& mask) {
+  // masked_select does not work well with functorch, as its shape is
+  // data-dependent
+  return areAnyTensorSubclassLike({grad, mask})
+      ? at::where(mask, grad, 0).sum()
+      : grad.masked_select(mask).sum();
+}
+
 Tensor mul_tensor_backward(Tensor grad, Tensor other, ScalarType self_st) {
   auto out = grad * other.conj();
   return handle_r_to_c(self_st, out);
@@ -549,13 +571,13 @@ Tensor permute_backwards(const Tensor& grad, IntArrayRef fwd_dims) {
 Tensor rad2deg_backward(const Tensor& grad) {
   constexpr double M_180_PI =
       57.295779513082320876798154814105170332405472466564;
-  return at::mul(grad, at::native::wrapped_scalar_tensor(Scalar(M_180_PI)));
+  return at::mul(grad, Scalar(M_180_PI));
 }
 
 Tensor deg2rad_backward(const Tensor& grad) {
   constexpr double M_PI_180 =
       0.017453292519943295769236907684886127134428718885417;
-  return at::mul(grad, at::native::wrapped_scalar_tensor(Scalar(M_PI_180)));
+  return at::mul(grad, Scalar(M_PI_180));
 }
 
 Tensor unsqueeze_multiple(
@@ -584,21 +606,22 @@ Tensor unsqueeze_multiple(
 
 Tensor sum_backward(
     const Tensor& grad,
-    IntArrayRef sizes,
+    c10::SymIntArrayRef sizes,
     OptionalIntArrayRef opt_dims,
     bool keepdim) {
   if (!keepdim && sizes.size() > 0) {
     if (opt_dims.has_value() && opt_dims.value().size() > 0) {
-      return unsqueeze_multiple(grad, opt_dims, sizes.size()).expand(sizes);
+      return unsqueeze_multiple(grad, opt_dims, sizes.size())
+          .expand_symint(sizes);
     }
   }
-  return grad.expand(sizes);
+  return grad.expand_symint(sizes);
 }
 
 Tensor sum_backward(
     const Tensor& grad,
     c10::SymIntArrayRef sizes,
-    c10::SymIntArrayRef dims,
+    c10::IntArrayRef dims,
     bool keepdim) {
   if (!keepdim && sizes.size() > 0 && dims.size() > 0) {
     // we are only using `keepdim=true` path for SymInts for now
@@ -615,15 +638,15 @@ Tensor nansum_backward(
     const Tensor& self,
     at::OptionalIntArrayRef dims,
     bool keepdim) {
-  return sum_backward(grad, self.sizes(), dims, keepdim) *
+  return sum_backward(grad, self.sym_sizes(), dims, keepdim) *
       self.isnan().logical_not();
 }
 
 Tensor mean_backward(
     const Tensor& grad,
-    IntArrayRef shape,
+    c10::SymIntArrayRef shape,
     OptionalIntArrayRef opt_dim,
-    int64_t numel,
+    c10::SymInt numel,
     bool keepdim) {
   bool is_all_reduce = !opt_dim.has_value() || opt_dim.value().size() == 0;
   auto n = is_all_reduce ? numel : _safe_size(shape, opt_dim.value());
@@ -649,6 +672,13 @@ Tensor prod_safe_zeros_backward(
     const Tensor& grad,
     const Tensor& inp,
     int64_t dim) {
+  if (inp.numel() == 0) {
+    // When input has a zero sized dimension (empty tensor),
+    // we don't need to actually compute the grads.
+    // So we just reshape `grad` as `input`.
+    return grad.expand_as(inp);
+  }
+
   if (inp.size(dim) == 1) {
     return grad;
   }
@@ -683,7 +713,8 @@ Tensor prod_backward(
   if (input.dim() == 0) {
     return grad;
   }
-  if (input.is_meta()) {
+  if (input.is_meta() || isTensorSubclassLike(input)) {
+    // For Composite Compliance, always take the safer (and slower) path
     return prod_safe_zeros_backward(grad, input.contiguous().view(-1), 0)
         .view_as(input);
   }
@@ -708,11 +739,14 @@ Tensor prod_backward(
     return grad;
   }
   dim = at::maybe_wrap_dim(dim, input.sizes().size());
-  if (!keepdim && input.dim() != 1) {
+  if (!keepdim) {
+    // `prod` reduces the dimension at `dim`,
+    // so, unsqueeze `grad` and `result` at dim.
     grad = grad.unsqueeze(dim);
     result = result.unsqueeze(dim);
   }
-  if (input.is_meta()) {
+  if (input.is_meta() || isTensorSubclassLike(input)) {
+    // For Composite Compliance, always take the safer (and slower) path
     return prod_safe_zeros_backward(grad, input, dim);
   }
 
@@ -798,39 +832,43 @@ Tensor logcumsumexp_backward(
 }
 
 Tensor unbind_backward(const variable_list& grads, int64_t dim) {
-  IntArrayRef sizes;
+  c10::SymIntArrayRef sizes;
   at::TensorOptions o;
   for (const auto& v : grads) {
     if (v.defined()) {
-      sizes = v.sizes();
+      sizes = v.sym_sizes();
       o = static_cast<Tensor>(v).options();
       break;
     }
   }
   auto grads_tensors = fmap(grads, [&](const Variable& v) {
     return (
-        v.defined() ? static_cast<Tensor>(v) : at::zeros({}, o).expand(sizes));
+        v.defined() ? static_cast<Tensor>(v)
+                    : at::zeros({}, o).expand_symint(sizes));
   });
   return at::stack(grads_tensors, dim);
 }
 
-Tensor unsqueeze_to(const Tensor& self, IntArrayRef sizes) {
+Tensor unsqueeze_to(const Tensor& self, c10::SymIntArrayRef sym_sizes) {
   auto result = self;
 
-  int64_t nDims = sizes.size();
+  int64_t nDims = sym_sizes.size();
   for (const auto dim : c10::irange(nDims)) {
-    if (sizes[dim] == 1) {
+    if (sym_sizes[dim] == 1) {
       result = result.unsqueeze(dim);
     }
   }
   return result;
 }
 
-Tensor unsqueeze_to(const Tensor& self, int64_t dim, IntArrayRef sizes) {
-  dim = at::maybe_wrap_dim(dim, sizes.size());
+Tensor unsqueeze_to(
+    const Tensor& self,
+    int64_t dim,
+    c10::SymIntArrayRef sym_sizes) {
+  dim = at::maybe_wrap_dim(dim, sym_sizes.size());
   // in NumPy it's not an error to unsqueeze a scalar, but we still need to
   // avoided unsqueezing in the backward.
-  if (sizes.size() > 0 && sizes[dim] == 1) {
+  if (sym_sizes.size() > 0 && sym_sizes[dim] == 1) {
     return self.unsqueeze(dim);
   }
   return self;
@@ -838,15 +876,15 @@ Tensor unsqueeze_to(const Tensor& self, int64_t dim, IntArrayRef sizes) {
 
 std::vector<Tensor> cat_tensors_backward(
     const Tensor& grad,
-    const std::vector<std::vector<int64_t>>& sizes,
+    const std::vector<std::vector<c10::SymInt>>& sizes,
     const std::vector<ScalarType>& dtypes,
     int64_t dim) {
   std::vector<Tensor> grad_inputs(sizes.size());
   if (!grad.defined()) {
     return grad_inputs;
   }
-  dim = at::legacy_cat_wrap_dim(dim, sizes);
-  int64_t accumulate = 0;
+  dim = at::legacy_cat_wrap_dim_symint(dim, sizes);
+  c10::SymInt accumulate = 0;
 
   Tensor grad_;
   bool grad_is_complex = grad.is_complex();
@@ -863,13 +901,32 @@ std::vector<Tensor> cat_tensors_backward(
     }
     auto& shape = sizes[i];
     // If input was empty tensor, gradInput should be empty tensor.
-    if (shape == std::vector<int64_t>({0})) {
+    if (shape == std::vector<c10::SymInt>({c10::SymInt(0)})) {
       grad_inputs[i] = at::zeros({0}, grad_val.options());
       continue;
     }
     auto size = shape[dim];
     accumulate += size;
-    grad_inputs[i] = grad_val.narrow(dim, accumulate - size, size);
+    grad_inputs[i] = grad_val.narrow_symint(dim, accumulate - size, size);
+  }
+  return grad_inputs;
+}
+
+std::vector<Tensor> stack_tensors_backward(
+    const Tensor& grad,
+    int64_t dim,
+    const std::vector<ScalarType>& dtypes) {
+  std::vector<Tensor> grad_inputs(dtypes.size());
+  if (!grad.defined()) {
+    return grad_inputs;
+  }
+  bool grad_is_complex = grad.is_complex();
+  for (const auto i : c10::irange(dtypes.size())) {
+    auto gr = grad.select(dim, i);
+    if (grad_is_complex && !at::isComplexType(dtypes[i])) {
+      gr = at::real(gr);
+    }
+    grad_inputs[i] = gr;
   }
   return grad_inputs;
 }
@@ -1042,15 +1099,15 @@ Tensor convolution_jvp(
     const Tensor& bias_p,
     const Tensor& bias_t,
     IntArrayRef stride,
-    IntArrayRef padding,
+    at::SymIntArrayRef padding,
     IntArrayRef dilation,
     bool transposed,
-    IntArrayRef output_padding,
+    at::SymIntArrayRef output_padding,
     int64_t groups) {
   auto bias_t_opt =
       bias_t.defined() ? c10::optional<at::Tensor>(bias_t) : c10::nullopt;
   return (
-      at::convolution(
+      at::convolution_symint(
           input_t,
           weight_p,
           c10::nullopt,
@@ -1060,7 +1117,7 @@ Tensor convolution_jvp(
           transposed,
           output_padding,
           groups) +
-      at::convolution(
+      at::convolution_symint(
           input_p,
           weight_t,
           bias_t_opt,
@@ -1080,10 +1137,10 @@ Tensor _convolution_jvp(
     const Tensor& bias_p,
     const Tensor& bias_t,
     IntArrayRef stride,
-    IntArrayRef padding,
+    at::SymIntArrayRef padding,
     IntArrayRef dilation,
     bool transposed,
-    IntArrayRef output_padding,
+    at::SymIntArrayRef output_padding,
     int64_t groups,
     bool benchmark,
     bool deterministic,
@@ -1092,7 +1149,7 @@ Tensor _convolution_jvp(
   auto bias_t_opt =
       bias_t.defined() ? c10::optional<at::Tensor>(bias_t) : c10::nullopt;
   return (
-      at::_convolution(
+      at::_convolution_symint(
           input_t,
           weight_p,
           c10::nullopt,
@@ -1106,7 +1163,7 @@ Tensor _convolution_jvp(
           deterministic,
           cudnn_enabled,
           allow_tf32) +
-      at::_convolution(
+      at::_convolution_symint(
           input_p,
           weight_t,
           bias_t_opt,
@@ -1147,7 +1204,7 @@ Tensor convolution_backward_jvp_grad_bias(
 
 // This function is used by load_derivatives.py to replace tensor.strides()
 // calls that appear in derivative formulas. If the tensor has requires_grad
-// set, this function returns its strides or throws an error if the tensor
+// set, this function returns its strides or an empty array if the tensor
 // is sparse. If requires_grad is not set, an empty array is returned since
 // there will be no backward pass. There has one special case, if input is
 // MKLDNN tensor and has requires_grad set, just return an empty array, the
@@ -1161,36 +1218,28 @@ Tensor convolution_backward_jvp_grad_bias(
 // Args:
 //  input              Tensor to call .strides() on
 //  input_name         Name of `input` tensor, from derivative formula
-at::IntArrayRef strides_or_error(
+at::SymIntArrayRef strides_or_error(
     const Tensor& input,
     c10::string_view const& input_name) {
   // TODO: Ideally, this function would never be called if requires_grad is
   // not set. Once codegen is updated to avoid the call, we can remove this
   // check.
   if (input.requires_grad()) {
-    TORCH_CHECK(
-        !input.is_sparse(),
-        "The backward pass for this operation requires the '",
-        input_name,
-        "' tensor to be strided, but a sparse tensor was given instead. ",
-        "Please either use a strided tensor or set requires_grad=False for '",
-        input_name,
-        "'");
     if (input.is_mkldnn())
-      return IntArrayRef({});
-    if (input.is_sparse_csr())
-      return IntArrayRef({});
-    return input.strides();
+      return {};
+    if (input.is_sparse() || at::sparse_csr::is_sparse_compressed(input))
+      return {};
+    return input.sym_strides();
   } else {
-    return IntArrayRef({});
+    return {};
   }
 }
 
 Tensor mm_mat1_backward(
     const Tensor& grad,
     const Tensor& mat2,
-    at::IntArrayRef mat1_sizes,
-    at::IntArrayRef mat1_strides,
+    at::SymIntArrayRef mat1_sizes,
+    at::SymIntArrayRef mat1_strides,
     c10::Layout mat1_layout,
     const Scalar& alpha) {
   if (grad.layout() == c10::kStrided && mat2.layout() == c10::kStrided &&
@@ -1208,8 +1257,8 @@ Tensor mm_mat1_backward(
 Tensor mm_mat2_backward(
     const Tensor& grad,
     const Tensor& mat1,
-    IntArrayRef mat2_sizes,
-    IntArrayRef mat2_strides,
+    at::SymIntArrayRef mat2_sizes,
+    at::SymIntArrayRef mat2_strides,
     c10::Layout mat2_layout,
     const Scalar& alpha) {
   if (grad.layout() == c10::kStrided && mat1.layout() == c10::kStrided &&
@@ -1237,8 +1286,10 @@ Tensor mm_mat1_sparse_backward(
   } else if (
       grad.layout() == c10::kStrided && mat2.layout() == c10::kStrided &&
       mat1.is_sparse_csr()) {
-    return at::sparse_sampled_addmm(
-        at::zeros_like(mat1, mat1.options()), grad, mat2.mH(), 1.0, alpha);
+    // zero must to have mat1 sparsity pattern:
+    auto zero = mat1.clone();
+    zero.values().zero_();
+    return at::sparse_sampled_addmm(zero, grad, mat2.mH(), 1.0, alpha);
   } else if (
       grad.layout() == c10::kStrided && mat2.layout() == c10::kStrided &&
       mat1.layout() == c10::kStrided) {
@@ -1343,11 +1394,11 @@ Tensor renorm_backward(
 
 Tensor repeat_backward(
     Tensor grad,
-    IntArrayRef repeats,
-    IntArrayRef input_shape) {
+    c10::SymIntArrayRef repeats,
+    c10::SymIntArrayRef input_shape) {
   auto find_iter = std::find(repeats.cbegin(), repeats.cend(), 0);
   if (find_iter != repeats.cend()) {
-    return at::zeros(input_shape, grad.options());
+    return at::zeros_symint(input_shape, grad.options());
   }
   const auto input_dims = input_shape.size();
   int64_t num_unsqueezed = grad.dim() - input_dims;
@@ -1356,9 +1407,10 @@ Tensor repeat_backward(
     grad = grad.sum(0, false);
   }
 
-  at::DimVector grad_size, sum_dims;
+  at::SymDimVector grad_size;
+  at::DimVector sum_dims;
   for (const auto dim : c10::irange(input_dims)) {
-    int64_t repeat = repeats[dim + num_unsqueezed];
+    auto repeat = repeats[dim + num_unsqueezed];
     // Reshape gradient (repeat > 1)
     // Index:      [..., dim    , ...]    [..., dim   ,  dim+1        , ...]
     // Shape: From [..., dimsize, ...] to [..., repeat, dimsize/repeat, ...]
@@ -1417,7 +1469,7 @@ Tensor repeat_backward(
   // reduce the whole grad tensor into a scalar rather than keeping original
   // dimensions.
   if (!sum_dims.empty()) {
-    grad = grad.reshape(grad_size);
+    grad = grad.reshape_symint(grad_size);
     grad = grad.sum(sum_dims);
   }
   return grad;
@@ -1493,16 +1545,19 @@ Tensor var_backward(
     if (n == correction) {
       return INFINITY * grad;
     } else {
-      return (2.0 / (self.numel() - correction)) * grad * (self - self.mean());
+      return (c10::SymFloat(2.0) /
+              c10::SymFloat(self.sym_numel() - correction)) *
+          grad * (self - self.mean());
     }
   }
   auto dim = dim_opt.value();
   if (!keepdim && self.dim() > 1) {
-    grad = unsqueeze_multiple(grad, dim, self.sizes().size());
+    grad = unsqueeze_multiple(grad, dim, self.sym_sizes().size());
   }
-  const int64_t dof = _safe_size(self.sizes(), dim) - correction;
+  const c10::SymInt dof = _safe_size(self.sym_sizes(), dim) - correction;
   // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
-  return (2.0 / dof) * grad * (self - self.mean(dim, /*keepdim=*/true));
+  return (c10::SymFloat(2.0) / c10::SymFloat(dof)) * grad *
+      (self - self.mean(dim, /*keepdim=*/true));
 }
 
 Tensor std_backward(
@@ -1531,9 +1586,9 @@ Tensor var_mean_backward(
   if (gmean.defined()) {
     auto aux = mean_backward(
         gmean,
-        self.sizes(),
+        self.sym_sizes(),
         dim_opt.value_or(IntArrayRef({})),
-        self.numel(),
+        self.sym_numel(),
         keepdim);
     gself = gself.defined() ? gself + aux : aux;
   }
@@ -1556,9 +1611,9 @@ Tensor std_mean_backward(
   if (gmean.defined()) {
     auto aux = mean_backward(
         gmean,
-        self.sizes(),
+        self.sym_sizes(),
         dim_opt.value_or(IntArrayRef({})),
-        self.numel(),
+        self.sym_numel(),
         keepdim);
     gself = gself.defined() ? gself + aux : aux;
   }
@@ -1568,21 +1623,21 @@ Tensor std_mean_backward(
 Tensor masked_scatter_backward(
     const Tensor& grad,
     const Tensor& mask,
-    IntArrayRef sizes) {
-  int64_t numel = 1;
+    c10::SymIntArrayRef sizes) {
+  c10::SymInt numel = 1;
   for (auto size : sizes) {
     numel *= size;
   }
   auto mask_selected = grad.masked_select(mask);
-  auto diff_nelem = numel - mask_selected.numel();
+  auto diff_nelem = numel - mask_selected.sym_numel();
   if (diff_nelem > 0) {
     // because mask_selected returns a 1-d tensor with size of masked elements
     // that are 1, we need to fill out the rest with zeros then reshape back to
     // tensor2's size.
-    auto zeros_fillin = at::zeros({diff_nelem}, grad.options());
+    auto zeros_fillin = at::zeros_symint({diff_nelem}, grad.options());
     mask_selected = at::cat({mask_selected, zeros_fillin}, 0);
   }
-  return mask_selected.view(sizes);
+  return mask_selected.view_symint(sizes);
 }
 
 Tensor cholesky_jvp(const Tensor& dA, const Tensor& L, bool upper) {
@@ -1757,9 +1812,9 @@ Tensor pinv_backward(const Tensor& grad, const Tensor& pinvA, const Tensor& A) {
 
 Tensor split_with_sizes_backward(
     const std::vector<torch::autograd::Variable>& grads,
-    IntArrayRef split_sizes,
+    c10::SymIntArrayRef split_sizes,
     int64_t dim,
-    IntArrayRef sizes,
+    c10::SymIntArrayRef sizes,
     const at::TensorOptions& options) {
   dim = at::maybe_wrap_dim(dim, sizes.size());
 
@@ -1773,7 +1828,7 @@ Tensor split_with_sizes_backward(
       auto length = split_sizes[j];
       auto grad_size = sizes.vec();
       grad_size[dim] = length;
-      grads_all_defined[j] = at::zeros(grad_size, options);
+      grads_all_defined[j] = at::zeros_symint(grad_size, options);
     }
   }
 
@@ -1783,17 +1838,17 @@ Tensor split_with_sizes_backward(
 
 Tensor split_backward(
     const std::vector<torch::autograd::Variable>& grads,
-    int64_t split_size,
+    c10::SymInt split_size,
     int64_t dim,
-    IntArrayRef sizes,
+    c10::SymIntArrayRef sym_sizes,
     const at::TensorOptions& options) {
-  dim = at::maybe_wrap_dim(dim, sizes.size());
-  int64_t dim_size = sizes[dim];
+  dim = at::maybe_wrap_dim(dim, sym_sizes.size());
+  auto dim_size = sym_sizes[dim];
   int64_t num_splits = grads.size();
-  std::vector<int64_t> split_sizes(num_splits, split_size);
+  std::vector<c10::SymInt> split_sizes(num_splits, split_size);
   split_sizes[num_splits - 1] =
       split_size - (split_size * num_splits - dim_size);
-  return split_with_sizes_backward(grads, split_sizes, dim, sizes, options);
+  return split_with_sizes_backward(grads, split_sizes, dim, sym_sizes, options);
 }
 
 Tensor max_pool_double_backward(
@@ -2665,8 +2720,8 @@ Tensor softplus_double_backward(
 // NOTE [ Detecting Memory Overlap Within A Strided Tensor ]
 // Helper for as_strided_backward
 static inline bool _maybe_overlapping_memory(
-    IntArrayRef sizes,
-    IntArrayRef strides) {
+    c10::SymIntArrayRef sizes,
+    c10::SymIntArrayRef strides) {
   if (sizes.size() > 0) {
     std::vector<std::size_t> argsort(sizes.size());
     std::iota(argsort.begin(), argsort.end(), 0);
@@ -2675,7 +2730,7 @@ static inline bool _maybe_overlapping_memory(
           return strides[i] < strides[j];
         });
 
-    int64_t max_index_in_slice = 0;
+    c10::SymInt max_index_in_slice = 0;
     for (auto i : argsort) {
       auto stride_ = strides[i];
       if (stride_ <= max_index_in_slice) {
@@ -2689,11 +2744,11 @@ static inline bool _maybe_overlapping_memory(
 
 // Returns the minimum storage size needed to contain a tensor of sizes,
 // strides, and storage_offset Helper for as_strided_backward
-static inline int64_t _min_storage_size(
-    IntArrayRef sizes,
-    IntArrayRef strides,
-    int64_t storage_offset) {
-  int64_t storage_size = storage_offset + 1;
+static inline c10::SymInt _min_storage_size(
+    c10::SymIntArrayRef sizes,
+    c10::SymIntArrayRef strides,
+    c10::SymInt storage_offset) {
+  c10::SymInt storage_size = storage_offset + 1;
   int64_t dim = sizes.size();
   for (const auto i : c10::irange(dim)) {
     auto size_i = sizes[i];
@@ -2710,9 +2765,9 @@ static inline int64_t _min_storage_size(
 Tensor as_strided_backward(
     Tensor grad,
     TensorGeometry input_geometry,
-    IntArrayRef sizes,
-    IntArrayRef strides,
-    optional<int64_t> storage_offset_) {
+    c10::SymIntArrayRef sym_sizes,
+    c10::SymIntArrayRef sym_strides,
+    optional<c10::SymInt> sym_storage_offset_) {
   // For output geometry,
   //   check for size 0 dimensions,
   //   skip size 1 dimensions,
@@ -2721,17 +2776,17 @@ Tensor as_strided_backward(
   // layout-aware/agnostic autograd ] Step (0)~(1) for the algorithm in NOTE [
   // Detecting Memory Overlap Within A Strided Tensor ]
   //              on output geometry
-  auto storage_offset =
-      storage_offset_.value_or(input_geometry.storage_offset());
+  auto sym_storage_offset =
+      sym_storage_offset_.value_or(input_geometry.sym_storage_offset());
   auto odim = grad.dim();
-  std::vector<int64_t> out_sizes_, out_strides_;
+  std::vector<c10::SymInt> out_sizes_, out_strides_;
   out_sizes_.reserve(odim);
   out_strides_.reserve(odim);
   for (int64_t i = odim - 1; i >= 0; i--) {
-    auto size_i = sizes[i];
-    auto stride_i = strides[i];
+    auto size_i = sym_sizes[i];
+    auto stride_i = sym_strides[i];
     if (size_i == 0) {
-      return at::zeros(input_geometry.sizes(), grad.options());
+      return at::zeros_symint(input_geometry.sym_sizes(), grad.options());
     } else if (size_i == 1) {
       grad = grad.squeeze(i);
     } else if (stride_i == 0) {
@@ -2753,16 +2808,16 @@ Tensor as_strided_backward(
   // Strided Tensor ]
   //              on input geometry
   auto idim = input_geometry.dim();
-  IntArrayRef inp_sizes = input_geometry.sizes(),
-              inp_strides = input_geometry.strides();
-  std::vector<int64_t> inp_sizes_, inp_strides_;
+  auto inp_sizes = input_geometry.sym_sizes(),
+       inp_strides = input_geometry.sym_strides();
+  std::vector<c10::SymInt> inp_sizes_, inp_strides_;
   inp_sizes_.reserve(idim);
   inp_strides_.reserve(idim);
   for (int64_t i = idim - 1; i >= 0; i--) {
     auto size_i = inp_sizes[i];
     auto stride_i = inp_strides[i];
     if (size_i == 0) {
-      return at::zeros(input_geometry.sizes(), grad.options());
+      return at::zeros_symint(input_geometry.sym_sizes(), grad.options());
     } else if (size_i != 1) {
       inp_sizes_.insert(inp_sizes_.begin(), size_i);
       inp_strides_.insert(inp_strides_.begin(), stride_i);
@@ -2785,29 +2840,37 @@ Tensor as_strided_backward(
 
   // Step (1): create underlying tensor as "storage"
   auto shared_offset =
-      std::min(input_geometry.storage_offset(), storage_offset);
-  auto inp_effective_offset = input_geometry.storage_offset() - shared_offset;
-  auto out_effective_offset = storage_offset - shared_offset;
-  auto base_size = std::max(
-      _min_storage_size(inp_sizes_, inp_strides_, inp_effective_offset),
-      _min_storage_size(out_sizes_, out_strides_, out_effective_offset));
-  auto storage = grad.new_zeros({base_size});
+      // TODO: symint-ify. Do we need a min() and max() for SymInts?
+      input_geometry.sym_storage_offset().min(sym_storage_offset);
+  auto inp_effective_offset =
+      input_geometry.sym_storage_offset() - shared_offset;
+  auto out_effective_offset = sym_storage_offset - shared_offset;
+  auto base_size1 =
+      _min_storage_size(inp_sizes_, inp_strides_, inp_effective_offset);
+  auto base_size2 =
+      _min_storage_size(out_sizes_, out_strides_, out_effective_offset);
+  auto base_size = base_size1.max(base_size2);
+  auto storage = grad.new_zeros_symint(c10::SymIntArrayRef(base_size));
 
   // prepare indices tensor if we will do index_add_ later
   c10::optional<at::Tensor> flatten_full_indices;
   if (inp_maybe_overlap || out_maybe_overlap) {
     flatten_full_indices =
-        at::arange(0, base_size, grad.options().dtype(at::kLong));
+        // TODO: should we symint-ify arange? Need SymScalar.
+        at::arange(
+            0,
+            base_size.guard_int(__FILE__, __LINE__),
+            grad.options().dtype(at::kLong));
   }
 
   // Step (2): use output geometry to scatter gradients into storage
   if (out_maybe_overlap) {
-    auto out_indices = flatten_full_indices->as_strided(
+    auto out_indices = flatten_full_indices->as_strided_symint(
         out_sizes_, out_strides_, out_effective_offset);
     storage.index_add_(0, out_indices.reshape(-1), grad.reshape(-1));
   } else {
     // assume that new tensors have 0 storage offset
-    storage.as_strided(out_sizes_, out_strides_, out_effective_offset)
+    storage.as_strided_symint(out_sizes_, out_strides_, out_effective_offset)
         .copy_(grad);
   }
 
@@ -2817,23 +2880,24 @@ Tensor as_strided_backward(
     auto count = at::zeros_like(storage, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
     auto inp_indices =
         flatten_full_indices
-            ->as_strided(inp_sizes_, inp_strides_, inp_effective_offset)
+            ->as_strided_symint(inp_sizes_, inp_strides_, inp_effective_offset)
             .reshape(-1);
     count.index_add_(
         0, inp_indices, at::ones({1}, grad.options()).expand_as(inp_indices));
     storage.div_(count); // this will give nan outside visible range
   }
   // Step (4): return as_strided view of the storage tensor with input geometry
-  return storage.as_strided(inp_sizes, inp_strides, inp_effective_offset);
+  return storage.as_strided_symint(
+      inp_sizes, inp_strides, inp_effective_offset);
 }
 
 Tensor as_strided_scatter_backward(
     Tensor grad,
     TensorGeometry input_geometry,
     TensorGeometry src_geometry,
-    IntArrayRef sizes,
-    IntArrayRef strides,
-    optional<int64_t> storage_offset) {
+    c10::SymIntArrayRef sizes,
+    c10::SymIntArrayRef strides,
+    optional<c10::SymInt> storage_offset) {
   // Note [as_strided_scatter backward support]
   // as_strided_scatter handling for autograd is a beast, and is non-trivial to
   // implement for arbitrarily strided inputs. Most uses for as_strided with
@@ -2842,10 +2906,12 @@ Tensor as_strided_scatter_backward(
   // inputs. We can assume that the input was a contiguous tensor. Also, we'll
   // take the perf hit and contiguify grad for now.
   auto grad_ = grad.contiguous();
-  auto grad_slice = grad_.as_strided(sizes, strides, storage_offset);
+  auto grad_slice = grad_.as_strided_symint(sizes, strides, storage_offset);
   auto result =
-      grad_.new_empty_strided(input_geometry.sizes(), input_geometry.strides());
-  auto result_slice = result.as_strided(sizes, strides, storage_offset);
+      grad_.new_zeros_symint(input_geometry.sym_sizes())
+          .as_strided_symint(
+              input_geometry.sym_sizes(), input_geometry.sym_strides());
+  auto result_slice = result.as_strided_symint(sizes, strides, storage_offset);
   result_slice.copy_(grad_slice);
   return result;
 }
@@ -2968,7 +3034,10 @@ std::tuple<Tensor, Tensor, Tensor> prelu_double_backward(
     }
 
     Tensor ggO;
-    if (gO.requires_grad()) {
+    // areAnyTensorSubclassLike check necessary for composite compiance:
+    // e.g. it's possible that grad_out/gO is a BatchedTensor wrapping
+    // some Tensor that does require grad
+    if (areAnyTensorSubclassLike({grad_out}) || gO.requires_grad()) {
       // expand weight as input as in ggW/ggI above
       auto weight_expanded = weight;
       for (const auto i : c10::irange(dims_to_unsqueeze)) {
@@ -3100,15 +3169,16 @@ Tensor elu_double_backward(
 
 Tensor slice_backward_wrapper(
     const at::Tensor& grad,
-    const c10::IntArrayRef& input_sizes,
+    const c10::SymIntArrayRef& input_sizes,
     int64_t dim,
-    c10::optional<int64_t> start,
-    c10::optional<int64_t> end,
-    int64_t step) {
+    c10::optional<c10::SymInt> start,
+    c10::optional<c10::SymInt> end,
+    c10::SymInt step) {
   auto start_val = start.has_value() ? start.value() : 0;
   auto end_val = end.has_value() ? end.value() : INT64_MAX;
 
-  return slice_backward(grad, input_sizes, dim, start_val, end_val, step);
+  return slice_backward_symint(
+      grad, input_sizes, dim, start_val, end_val, step);
 }
 
 std::tuple<Tensor, Tensor, Tensor> linalg_svd_jvp(
@@ -3432,151 +3502,6 @@ Tensor svd_backward(
   return gA;
 }
 
-// The implementation follows:
-// "An extended collection of matrix derivative results for forward and reverse
-// mode algorithmic differentiation"
-// https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf
-// However, the reference does not cover the constraints on eigenvectors to have
-// 1-norm. See the details below.
-Tensor eig_backward(
-    const std::vector<torch::autograd::Variable>& grads,
-    const Tensor& self,
-    bool is_eigvec_tensor_nonempty,
-    const Tensor& eigenvalues,
-    const Tensor& eigenvectors) {
-  at::NoTF32Guard disable_tf32;
-  TORCH_CHECK(
-      is_eigvec_tensor_nonempty,
-      "eig_backward: torch.eig(eigenvalues=False) is not differentiable. ",
-      "Please use torch.linalg.eigvals");
-
-  // variable names correspond to the ones in the reference document
-  auto D = eigenvalues;
-  const auto& U = eigenvectors;
-  auto D_grad = grads[0];
-  auto U_grad = grads[1];
-
-  // The condition below is trying to marry torch.eig and torch.linalg.eig
-  // for real inputs.
-  //
-  // For real inputs torch.eig returns a real 2D tensor representing real and
-  // complex components of eigenvalues, while torch.linalg.eig will most likely
-  // always return complex eigenvalues.
-  if (!self.is_complex()) {
-    Tensor is_imag_eigvals_zero;
-    // path for torch.eig with always a "real" 2D tensor of eigenvalues
-    if (!D.is_complex()) {
-      // narrow extracts the column corresponding to the imaginary part
-      is_imag_eigvals_zero = (D.narrow(-1, 1, 1) == 0.0).min();
-    }
-    // path for torch.linalg.eig with always a complex tensor of eigenvalues
-    else {
-      is_imag_eigvals_zero = (at::imag(D) == 0.0).min();
-      // insert an additional dimension to be compatible with torch.eig.
-      // Recall that it produces 2D tensors.
-      // We extract only the real parts as there is no support for
-      // complex eigenvalues with real inputs yet.
-      D = at::real(D).unsqueeze(-1);
-      D_grad = at::real(D_grad).unsqueeze(-1);
-    }
-    // No support for complex eigenvalues for real inputs yet.
-    TORCH_CHECK(
-        at::equal(is_imag_eigvals_zero, is_imag_eigvals_zero.new_ones({})),
-        "eig_backward: Backward calculation does not support complex eigenvalues for real inputs at the moment.");
-  } else {
-    // torch.eig returns 2d tensors for eigenvalues,
-    // while torch.linalg.eig returns 1d.
-    // Hence we insert additional dimension for complex input,
-    // such that the same code could be used for both methods.
-    // It will become unnecessary once torch.eig is deprecated.
-    D = D.unsqueeze(-1);
-    if (D_grad.defined()) {
-      D_grad = D_grad.unsqueeze(-1);
-    }
-  }
-
-  if (!D_grad.defined() && !U_grad.defined()) {
-    return at::zeros_like(self, at::MemoryFormat::Contiguous);
-  }
-
-  // Adapting the result from the reference above for the complex input, we get:
-  //
-  // A_grad = U^{-H} (D_grad + F.conj() * (U^H U_grad)) U^H,
-  // where M^H := (M.mT()).conj() and * is the Hadamard (element-wise) product.
-  //
-  // torch.eig/torch.linalg.eig produce eigenvectors which are
-  // normalized to 1 norm, and the reference does not take that into account.
-  // Hence, we have to modify the formula accordingly.
-  //
-  // Normalization to 1 norm imposes the following constraint on the
-  // eigenvectors, i.e. (U^H U) * I = I, where I is an identity matrix. Forward
-  // AD for this expression yields: (dU^H U + U^H dU) * I = 0 => U^H dU * I = 0
-  // <=> diag(U^H dU) = 0, which means that each i-th column of U is orthogonal
-  // to the i-th column of dU. Now, the value of dU which does not take this
-  // constraint into consideration comes straight from the reference: dU = U(F *
-  // U^{-1} dA U). To make sure that U^H dU * I = 0, and using U^H U * I = I
-  // (normalization), we propose a modifed forward AD for U: dU_new = dU - U(U^H
-  // dU * I) (think of Gram-Schmidt)
-  //
-  // The rest is very similar to what is done in the reference and we finally
-  // arrive at:
-  //
-  // A_grad = U^{-H} (D_grad + (U^H U_grad - U^H U (U^H U_grad * I)) * F.conj())
-  // U^H
-  //        = U^{-H} (eigenvalues_contribs + eigenvectors_contrib) U^H, where
-  // eigenvalues_contribs := D_grad,
-  // eigenvectors_contribs := (U^H U_grad - U^H U (U^H U_grad * I)) * F.conj().
-  // The contributions from the eigenvectors and the eigenvalues are computed
-  // below, and then we solve the system U^H A_grad = (eigenvalues_contribs +
-  // eigenvectors_contribs) U_H to produce A_grad.
-
-  // contribution from the eigenvectors
-  Tensor U_contrib;
-  if (U_grad.defined()) {
-    // narrow extracts the column corresponding to the real part
-    D = D.narrow(-1, 0, 1);
-    auto F = (D.mT() - D);
-    if (!F.is_complex()) {
-      F.diagonal(0, -2, -1).fill_(INFINITY);
-      F.pow_(-1);
-    } else {
-      // The F matrix construction for complex eigenvalues
-      // if different from its real counterpart.
-      // There is no complex INFINITY, and we cannot use
-      //
-      // F.pow_(-1);
-      // F.diagonal(0, -2, -1).fill_(0);
-      //
-      // as it breaks gradgradcheck by double backward
-      // propagating nans through F.pow_(-1) at zero,
-      // the point of discontinuity.
-      // Hence this hack below.
-      F.diagonal(0, -2, -1).fill_(1);
-      F.pow_(-1);
-      F.diagonal(0, -2, -1).fill_(0);
-    }
-    auto U_grad_proj_onto_U = at::matmul(U.mH(), U_grad);
-    auto Uh_U = at::matmul(U.mH(), U);
-    U_contrib = (U_grad_proj_onto_U -
-                 Uh_U * U_grad_proj_onto_U.diagonal(0, -2, -1).unsqueeze(-2)) *
-        F.conj();
-  } else {
-    U_contrib = at::zeros_like(self, at::MemoryFormat::Contiguous);
-  }
-
-  // contributions from the eigenvalues
-  Tensor D_contrib;
-  if (D_grad.defined()) {
-    // narrow extracts the column corresponding to the real part
-    D_contrib = D_grad.narrow(-1, 0, 1);
-  } else {
-    D_contrib = at::zeros_like(D, at::MemoryFormat::Contiguous);
-  }
-
-  return at::linalg_solve(
-      U.mH(), at::matmul(U_contrib, U.mH()) + D_contrib * U.mH());
-}
-
 Tensor linalg_eig_backward(
     const Tensor& gL,
     const Tensor& gV,
@@ -4019,6 +3944,60 @@ Tensor linalg_matrix_exp_differential(
       self, grad, at::linalg_matrix_exp, /* adjoint */ adjoint);
 }
 
+template <typename F1, typename F2, typename... Ts>
+Tensor masked_fmap(
+    const Tensor& mask,
+    const F1& f1,
+    const F2& f2,
+    const Tensor& t,
+    const Ts&... ts) {
+  // This function takes two functions f1 and f2 and a (variadic) list of
+  // tensors, and creates a new tensor of the same shape as the first element of
+  // the list of tensors by applying the function f1 to the tensors for which
+  // the mask is true and f2 to the tensors for which the mask is false This
+  // function is used when we have a formula that works for, say, all
+  // non-singular inputs and another one for when the inputs are singular. See
+  // for example det_backward
+
+  // Precondition for the n == 0 case to make sense
+  TORCH_INTERNAL_ASSERT(t.numel() != 0);
+  auto t_masked = t.index({mask});
+  auto n = t_masked.numel();
+  if (n == t.numel()) {
+    return f1(t, ts...);
+  } else if (n == 0) {
+    return f2(t, ts...);
+  } else {
+    // Equivalent to
+    // ret = torch.empty_like(t)
+    // ret[mask] = f1(t1[mask], ..., tn[mask])
+    // ret[~mask] = f2(t1[~mask], ..., tn[~mask])
+    auto not_mask = mask.logical_not();
+    return at::empty_like(t)
+        .index_put_({mask}, f1(t_masked, ts.index({mask})...))
+        .index_put_(
+            {not_mask}, f2(t.index({not_mask}), ts.index({not_mask})...));
+  }
+}
+
+Tensor linalg_det_jvp(
+    const Tensor& dA,
+    const Tensor& det,
+    const Tensor& LU,
+    const Tensor& pivots,
+    const bool use_A_T) {
+  // (d det)_A(E) = tr(A^{-1}E)*det
+  // We use that the determinant is C^1 to approximate the gradient of singular
+  // inputs Since we never differentiate over forward AD, we don't need to deal
+  // with further gradients, as we do in grad_backward
+  auto eps = at::native::_get_epsilon(c10::toRealValueType(LU.scalar_type()));
+  auto LU_ =
+      LU + at::diag_embed(at::where(LU.diagonal(0, -2, -1) == 0., eps, 0.));
+  auto AinvE =
+      at::linalg_lu_solve(LU_, pivots, dA, /*left=*/true, /*adjoint=*/use_A_T);
+  return AinvE.diagonal(0, -2, -1).sum(-1) * det;
+}
+
 Tensor linalg_det_backward(
     const Tensor& grad,
     const Tensor& det,
@@ -4026,7 +4005,8 @@ Tensor linalg_det_backward(
     const Tensor& LU,
     const Tensor& pivots) {
   at::NoTF32Guard disable_tf32;
-  if (!grad.defined()) {
+  // A.numel() == 0 necessary for the singular case
+  if (!grad.defined() || A.numel() == 0) {
     return {};
   }
 
@@ -4035,45 +4015,55 @@ Tensor linalg_det_backward(
   auto d_diag = grad * det.conj();
   // Optimisation, Make it F-transposed as it's what lu_solve expects
   auto d = at::diag_embed(d_diag.unsqueeze(-1).expand_as(pivots)).mT();
+  auto eps = at::native::_get_epsilon(c10::toRealValueType(LU.scalar_type()));
 
+  // Optimisation if we are not going to compute higher-order gradients
   if (!at::GradMode::is_enabled()) {
     // The formula is given by the solution of AX = det.conj() * det * I when A
-    // is invertible det is C^1, so if it's not invertible, we can wiggle the LU
-    // decomposition a bit and use the resulting matrix as a decent
-    // approximation
-    auto eps = at::native::_get_epsilon(c10::toRealValueType(LU.scalar_type()));
+    // is invertible det is C^1, so if it's not invertible, we can apply a
+    // perturbation to the LU decomposition and use the resulting matrix as a
+    // non-singular approximation
     auto LU_ =
         LU + at::diag_embed(at::where(LU.diagonal(0, -2, -1) == 0., eps, 0.));
     auto use_A_T = A.is_contiguous() && !A.is_complex();
     return at::linalg_lu_solve(
         LU_, pivots, d, /*left=*/true, /*adjoint=*/!use_A_T);
   } else {
-    // If we want to compute further gradients, we need to recompute the LU
-    // decomposition so that autograd computes the correct gradients wrt to A
-    // (cf. solve_backward)
+    // If we want to compute higher-order gradients, we need to recompute the
+    // LU decomposition so that autograd computes the correct gradients wrt
+    // to A (cf. solve_backward)
+    auto non_singular =
+        [](const Tensor& A, const Tensor& d, const Tensor& /*grad*/) {
+          return at::linalg_solve(A.mH(), d);
+        };
 
-    // TODO When the user wants higher derivatives, the trick above just does
-    // not cut it The proper way of doing this is doing `auto mask = det == 0.;`
-    // and then if any determinant is zero, use an SVD decomposition to compute
-    // the derivative in those inputs (not all inputs). The derivative may be
-    // then computed explicitly by noting that the gradient of the derivative of
-    // the determinant is given in terms of the adjugate of a matrix. Then, the
-    // adjugate of a singular matrix may be computed as per
-    // https://nhigham.com/2020/06/16/what-is-the-adjugate-of-a-matrix/
-    // The code may be implemented as follows:
-    //
-    // Tensor U, S, Vh;
-    // std::tie(U, S, Vh) = at::linalg_svd(A);
-    // auto alpha = (at::linalg_det(U) * at::linalg_det(Vh)).conj() * grad;
-    // auto D = prod_safe_zeros_backward(alpha.unsqueeze(-1), S, S.dim() - 1);
-    // return (U * D.unsqueeze(-2)).matmul(Vh);
-    //
-    // The issue with this code is that the derivative given by autograd of
-    // prod_safe_zeros_backward is not the second derivative of the product.
-    // It is not clear to me how to implement the second derivative of the
-    // product efficently. Note that this is also currently a problem when we
-    // compute higher derivatives of `x.prod()` and `x` has more than one zero.
-    return at::linalg_solve(A.mH(), d);
+    // The derivative may be then computed explicitly by noting that the
+    // gradient of the derivative of the determinant is given in terms of the
+    // adjugate of a matrix. The adjugate of a singular matrix may be computed
+    // as per https://nhigham.com/2020/06/16/what-is-the-adjugate-of-a-matrix/
+    auto singular = [](const Tensor& A,
+                       const Tensor& /*d*/,
+                       const Tensor& grad) {
+      Tensor U, S, Vh;
+      std::tie(U, S, Vh) = at::linalg_svd(A);
+      auto alpha = (at::linalg_det(U) * at::linalg_det(Vh)).conj() * grad;
+      auto D = prod_safe_zeros_backward(alpha.unsqueeze(-1), S, S.dim() - 1);
+      return (U * D.unsqueeze(-2)).matmul(Vh);
+    };
+
+    // We could use the singular formula for all inputs but we try to filter out
+    // some inputs via the masking, as computing an SVD is about 100 times
+    // slower than computing an lu_solve on GPU
+    // For tensor subclasses, we can't call masked_fmap as it calls
+    // index({mask}) which needs to call item to compute the number of elements
+    // in the result.
+
+    if (areAnyTensorSubclassLike({A, d, grad})) {
+      return singular(A, d, grad);
+    } else {
+      return masked_fmap(
+          det.abs() < 100. * eps, singular, non_singular, A, d, grad);
+    }
   }
 }
 
@@ -4384,10 +4374,10 @@ Tensor fft_c2r_backward(
 
 Tensor fft_r2c_backward(
     const Tensor& grad,
-    IntArrayRef dim,
+    at::IntArrayRef dim,
     int64_t normalization,
     bool onesided,
-    int64_t last_dim_size) {
+    c10::SymInt last_dim_size) {
   if (!onesided) {
     return at::real(at::_fft_c2c(grad, dim, normalization, /*forward=*/false));
   }
@@ -4402,16 +4392,17 @@ Tensor fft_r2c_backward(
   //        (C2C ifft only take twosided inputs so we need to fill here)
   //     2. inverse C2C ifft
   //     3. discard the complex dim
-  auto half_sizes = grad.sizes();
-  at::DimVector new_grad_shape(half_sizes.begin(), half_sizes.end());
+  auto half_sizes = grad.sym_sizes();
+  std::vector<c10::SymInt> new_grad_shape(half_sizes.begin(), half_sizes.end());
   const auto last_dim = at::maybe_wrap_dim(dim.back(), half_sizes.size());
   new_grad_shape[last_dim] = last_dim_size;
 
-  const auto zero_length = last_dim_size - grad.size(dim.back());
+  const auto zero_length = last_dim_size - grad.sym_size(dim.back());
   auto complex_full_grad =
-      zero_length > 0 ? grad.new_zeros(new_grad_shape) : grad;
+      zero_length > 0 ? grad.new_zeros_symint(new_grad_shape) : grad;
   if (zero_length > 0) {
-    complex_full_grad.slice(last_dim, 0, half_sizes[last_dim]).copy_(grad);
+    complex_full_grad.slice_symint(last_dim, 0, half_sizes[last_dim])
+        .copy_(grad);
   }
   return at::real(
       at::_fft_c2c(complex_full_grad, dim, normalization, /*forward=*/false));
@@ -4599,7 +4590,7 @@ std::tuple<Tensor, Tensor, Tensor> layer_norm_double_backward(
     const Tensor& gO_t,
     const Tensor& save_mean_t,
     const Tensor& save_invstd_t,
-    IntArrayRef normalized_shape,
+    c10::SymIntArrayRef normalized_shape,
     std::array<bool, 3> output_mask) {
   const int normalized_ndim = normalized_shape.size();
   const auto input_shape = input_t.sizes();
@@ -4741,40 +4732,40 @@ infinitely_differentiable_native_group_norm_backward(
     const Tensor& mean,
     const Tensor& rstd,
     const c10::optional<Tensor>& gamma,
-    int64_t N,
-    int64_t C,
-    int64_t HxW,
+    c10::SymInt N,
+    c10::SymInt C,
+    c10::SymInt HxW,
     int64_t group,
     double eps,
     std::array<bool, 3> grad_input_mask) {
   const int64_t G = group;
-  const int64_t D = C / G;
-  const double s = 1.0 / static_cast<double>(D * HxW);
+  const auto D = C / G;
+  c10::SymFloat s = c10::SymFloat(1.0) / c10::SymFloat(D * HxW);
   Tensor dX;
   Tensor dgamma;
   Tensor dbeta;
-  const Tensor X_tensor = X.reshape({N, G, D, HxW});
-  const Tensor mean_tensor = mean.reshape({N, G, 1, 1});
-  const Tensor rstd_tensor = rstd.reshape({N, G, 1, 1});
+  const Tensor X_tensor = X.reshape_symint({N, G, D, HxW});
+  const Tensor mean_tensor = mean.reshape_symint({N, G, 1, 1});
+  const Tensor rstd_tensor = rstd.reshape_symint({N, G, 1, 1});
   Tensor dY_tensor;
   Tensor ds;
   Tensor db;
   if (dY.defined()) {
-    dY_tensor = dY.reshape({N, G, D, HxW});
+    dY_tensor = dY.reshape_symint({N, G, D, HxW});
     ds = (dY_tensor * X_tensor).sum(3).unsqueeze_(-1);
     db = dY_tensor.sum(3).unsqueeze_(-1);
   }
   if (grad_input_mask[0]) {
     Tensor gamma_tensor;
     if (isDefined(gamma)) {
-      gamma_tensor = gamma->reshape({1, G, D, 1});
+      gamma_tensor = gamma->reshape_symint({1, G, D, 1});
     }
     const Tensor var =
         ((rstd_tensor * rstd_tensor).reciprocal_() - eps).clamp_min(0);
     const Tensor rstd_cube = rstd_tensor * rstd_tensor * rstd_tensor;
     Tensor dvar;
     if (drstd.defined()) {
-      dvar = -0.5 * rstd_cube * drstd.view({N, G, 1, 1});
+      dvar = -0.5 * rstd_cube * drstd.view_symint({N, G, 1, 1});
     }
     if (dY.defined()) {
       const Tensor a =
@@ -4789,7 +4780,7 @@ infinitely_differentiable_native_group_norm_backward(
       if (dmean.defined() && drstd.defined()) {
         dX += var_mean_backward(
             dvar,
-            dmean.view({N, G, 1, 1}),
+            dmean.view_symint({N, G, 1, 1}),
             X_tensor,
             IntArrayRef{2, 3},
             0,
@@ -4799,7 +4790,7 @@ infinitely_differentiable_native_group_norm_backward(
     } else if (dmean.defined() && drstd.defined()) {
       dX = var_mean_backward(
                dvar,
-               dmean.view({N, G, 1, 1}),
+               dmean.view_symint({N, G, 1, 1}),
                X_tensor,
                IntArrayRef{2, 3},
                0,
@@ -4845,14 +4836,32 @@ std::tuple<Tensor, Tensor, Tensor> _trilinear_backward(
 }
 
 Tensor log1p_backward(const Tensor& grad, const Tensor& self) {
-  if (self.is_sparse()) {
-    AT_ERROR(
-        "log1p of a sparse tensor is made to be non-differentiable since ",
-        "local gradient of zero is 1 / (0 + 1) = 1 and it makes the tensor dense. ",
-        "Use a different mathematical operation which preserves sparsity of gradients, ",
-        "or report a bug if you think this is an error.");
+  // We must conditionally initalize this using to_dense if sparse, sparse
+  // addition is not supported without exact shape match
+  Tensor self_p1_conj;
+  if (self.layout() == c10::kSparse || self.layout() == c10::kSparseCsr ||
+      self.layout() == c10::kSparseCsc || self.layout() == c10::kSparseBsr ||
+      self.layout() == c10::kSparseBsc) {
+    // The warning only applies to the sparsity of self, dense grad is never
+    // materialized so if self is strided and grad is sparse nothing unepected
+    // happens memory wise
+    TORCH_WARN(
+        "log1p_backward: received self with sparse layout, but backward requires materialization of a dense tensor with this shape");
+    self_p1_conj = (self.to_dense() + 1).conj();
+  } else {
+    // Although calling self.to_dense() would just return self when it has
+    // strided layout, that would breaks functorch tests.
+    self_p1_conj = (self + 1).conj();
+  }
+  if (grad.layout() == c10::kSparse || grad.layout() == c10::kSparseCsr ||
+      grad.layout() == c10::kSparseCsc || grad.layout() == c10::kSparseBsr ||
+      grad.layout() == c10::kSparseBsc) {
+    // If grad is sparse we can't divide by the n-d (self + 1).conj(), so we
+    // must multiply by the recipricol, layout of grad is preserved which is
+    // important to gradcheck
+    return grad * self_p1_conj.reciprocal_();
   }
-  return grad / (self + 1).conj();
+  return grad / self_p1_conj;
 }
 
 Tensor sinc_backward(const Tensor& grad, const Tensor& self) {
@@ -4871,21 +4880,21 @@ Tensor sparse_constructor_values_backward(
 
 // Because the backward of pad(input, pads) is just pad(grad_output, [-p for p
 // in pads])
-Tensor constant_pad_nd_backward(const Tensor& grad, IntArrayRef pad) {
+Tensor constant_pad_nd_backward(const Tensor& grad, c10::SymIntArrayRef pad) {
   auto negated_pad = pad.vec();
   // NOLINTNEXTLINE(modernize-use-transparent-functors)
   std::transform(
       negated_pad.cbegin(),
       negated_pad.cend(),
       negated_pad.begin(),
-      std::negate<int64_t>());
-  return at::constant_pad_nd(grad, negated_pad, 0);
+      std::negate<c10::SymInt>());
+  return at::constant_pad_nd_symint(grad, negated_pad, 0);
 }
 
-Tensor embedding_dense_double_backward(
+Tensor embedding_dense_double_backward_symint(
     const Tensor& grad,
     const Tensor& indices,
-    int64_t padding_idx) {
+    c10::SymInt padding_idx) {
   // since first backward takes care of scaling by frequency,
   // we don't need to worry about it here.
   auto gg_weight = grad.index_select(0, indices.reshape(-1));
@@ -5038,14 +5047,16 @@ std::tuple<Tensor, Tensor> householder_product_backward(
     const Tensor& grad,
     const Tensor& result,
     const Tensor& input_,
-    const Tensor& tau) {
+    const Tensor& tau,
+    const bool flip_order) {
+  // NOTE on `flip_order`: when flip_order is true,
+  // the algorithm below reverses the processing direction from
+  // range(k) to range(k - 1, -1, -1) in the main loop, and left/right
+  // Householder projection applications get flipped.
+  // The comments below about the algorithmic details assume flip_order = false.
   if (!grad.defined() || !input_.numel() || !tau.numel()) {
     return std::tuple<Tensor, Tensor>(Tensor(), Tensor());
   }
-
-  auto input_grad = at::zeros_like(input_);
-  auto tau_grad = at::zeros_like(tau);
-
   auto m = input_.size(-2);
   auto k = tau.size(-1);
 
@@ -5094,7 +5105,7 @@ std::tuple<Tensor, Tensor> householder_product_backward(
     auto v_grad = (-t_unsqueezed * vHK).conj().squeeze(-2) -
         (t_unsqueezed * Kv).squeeze(-1);
     auto tau_grad = -(vHK.narrow(-1, k, m - k).matmul(v)).conj();
-    return std::make_tuple(v_grad, tau_grad.squeeze(-1));
+    return std::make_tuple(v_grad.unsqueeze(-1), tau_grad.squeeze(-1));
   };
 
   auto apply_householder_reflector = [m, modify_K_in_place](
@@ -5115,26 +5126,92 @@ std::tuple<Tensor, Tensor> householder_product_backward(
         left);
   };
 
+  const auto flip_i = [flip_order, k](int64_t i) -> int64_t {
+    return !flip_order ? i : k - i - 1;
+  };
+  const auto next_i = [flip_order](int64_t i) -> int64_t {
+    return !flip_order ? ++i : --i;
+  };
+  const auto apply_left = !flip_order;
+
   // K <- H_0^{-1} @ K
+  const auto zero_idx = flip_i(0);
   K = apply_householder_reflector(
-      0, input.narrow(-1, 0, 1), sigma.narrow(-1, 0, 1), K, /*left=*/true);
-  for (const auto i : c10::irange(k)) {
-    // NOTE: narrow will unsqueeze(-1)
-    auto v_i = input.narrow(-1, i, 1);
-    auto t_i = tau.narrow(-1, i, 1);
-
-    Tensor v_i_grad, tau_i_grad;
-    std::tie(v_i_grad, tau_i_grad) = update_grad(i, v_i, t_i, K);
-    input_grad.select(-1, i).copy_(v_i_grad.squeeze(-1));
-    tau_grad.select(-1, i).copy_(tau_i_grad.squeeze(-1));
-
-    // K <- H_{i + 1}^{-1} @ K @ H_i
-    if (i < k - 1) {
-      auto v_i_next = input.narrow(-1, i + 1, 1);
-      auto s_i_next = sigma.narrow(-1, i + 1, 1);
-      K = apply_householder_reflector(
-          i + 1, v_i_next, s_i_next, K, /*left=*/true);
-      K = apply_householder_reflector(i, v_i, t_i, K, /*left=*/false);
+      zero_idx,
+      input.narrow(-1, zero_idx, 1),
+      sigma.narrow(-1, zero_idx, 1),
+      K,
+      /*left=*/apply_left);
+
+  Tensor input_grad, tau_grad;
+  // For Composite Compliance, we can't copy a Subclass into a Regular Tensor,
+  // so we use out-of-place ops with equivalent output.
+  // NOTE: We can't use `new_zeros` directly as `input`, 'tau' or `grad` can
+  // be Tensor Subclass and we don't want to make assumption about which
+  // one to choose for creating output buffer.
+  // eg. if both are BatchedTensor at different level.
+  if (areAnyTensorSubclassLike({input, tau, K})) {
+    // k + 1 if input_grads hold a matrix of zeros for inactive parts of input.
+    auto input_grads = std::vector<Tensor>(k < input.size(-1) ? k + 1 : k);
+    auto tau_grads = std::vector<Tensor>(k);
+
+    for (const auto i_idx : c10::irange(k)) {
+      auto i = flip_i(i_idx);
+      // NOTE: narrow will unsqueeze(-1)
+      auto v_i = input.narrow(-1, i, 1);
+      auto t_i = tau.narrow(-1, i, 1);
+
+      Tensor v_i_grad, tau_i_grad;
+      std::tie(v_i_grad, tau_i_grad) = update_grad(i, v_i, t_i, K);
+      input_grads[i] = v_i_grad;
+      tau_grads[i] = tau_i_grad;
+
+      // K <- H_{i + 1}^{-1} @ K @ H_i
+      if (i != flip_i(k - 1)) {
+        auto i_next = next_i(i);
+        auto v_i_next = input.narrow(-1, i_next, 1);
+        auto s_i_next = sigma.narrow(-1, i_next, 1);
+        K = apply_householder_reflector(
+            i_next, v_i_next, s_i_next, K, /*left=*/apply_left);
+        K = apply_householder_reflector(i, v_i, t_i, K, /*left=*/!apply_left);
+      }
+    }
+
+    // Only first k columns are active in forward.
+    // zero gradients for the inactive input.
+    if (k < input.size(-1)) {
+      auto zero_grad_shape =
+          at::DimVector(input_.sizes().slice(0, input_.dim() - 1));
+      zero_grad_shape.push_back(input.size(-1) - k);
+      auto zero_grad = at::zeros(zero_grad_shape, input_.options());
+      input_grads[k] = zero_grad;
+    }
+
+    input_grad = at::cat(input_grads, -1);
+    tau_grad = at::cat(tau_grads, -1);
+  } else {
+    input_grad = at::zeros_like(input_);
+    tau_grad = at::zeros_like(tau);
+    for (const auto i_idx : c10::irange(k)) {
+      auto i = flip_i(i_idx);
+      // NOTE: narrow will unsqueeze(-1)
+      auto v_i = input.narrow(-1, i, 1);
+      auto t_i = tau.narrow(-1, i, 1);
+
+      Tensor v_i_grad, tau_i_grad;
+      std::tie(v_i_grad, tau_i_grad) = update_grad(i, v_i, t_i, K);
+      input_grad.select(-1, i).copy_(v_i_grad.squeeze(-1));
+      tau_grad.select(-1, i).copy_(tau_i_grad.squeeze(-1));
+
+      // K <- H_{i + 1}^{-1} @ K @ H_i
+      if (i != flip_i(k - 1)) {
+        auto i_next = next_i(i);
+        auto v_i_next = input.narrow(-1, i_next, 1);
+        auto s_i_next = sigma.narrow(-1, i_next, 1);
+        K = apply_householder_reflector(
+            i_next, v_i_next, s_i_next, K, /*left=*/apply_left);
+        K = apply_householder_reflector(i, v_i, t_i, K, /*left=*/!apply_left);
+      }
     }
   }
 
@@ -5226,10 +5303,18 @@ Tensor householder_product_jvp(
 
     H_plus = apply_householder_reflector(v_i, sigma_i, H_plus, /*left=*/true);
 
-    dprod.add_(H_minus.matmul(
+    // `H_minus_dH_i_H_plus` = H_1 * ... * H_{i-1} dH_i * H_{i+1} * ...
+    auto H_minus_dH_i_H_plus = H_minus.matmul(
         apply_simple_product(v_i, v_i, dtau_i, H_plus) +
         apply_simple_product(dv_i, v_i, tau_i, H_plus) +
-        apply_simple_product(v_i, dv_i, tau_i, H_plus)));
+        apply_simple_product(v_i, dv_i, tau_i, H_plus));
+    // For Composite Compliance, if `intermediate` is a Tensor-Subclass,
+    // we use out-of-place variant of add.
+    if (at::isTensorSubclassLike(H_minus_dH_i_H_plus)) {
+      dprod = dprod.add(H_minus_dH_i_H_plus);
+    } else {
+      dprod.add_(H_minus_dH_i_H_plus);
+    }
 
     H_minus = apply_householder_reflector(v_i, tau_i, H_minus, /*left=*/false);
   }
@@ -5237,6 +5322,67 @@ Tensor householder_product_jvp(
   return dprod;
 }
 
+std::tuple<Tensor, Tensor, Tensor> ormqr_backward(
+    const Tensor& grad,
+    const Tensor& result,
+    const Tensor& self,
+    const Tensor& tau,
+    const Tensor& other,
+    bool left,
+    bool transpose,
+    std::array<bool, 3> grad_output_mask) {
+  Tensor self_grad, tau_grad, other_grad;
+
+  if (!grad.defined()) {
+    return std::make_tuple(self_grad, tau_grad, other_grad);
+  }
+
+  const auto self_requires_grad = grad_output_mask[0];
+  const auto tau_requires_grad = grad_output_mask[1];
+  const auto other_requires_grad = grad_output_mask[2];
+
+  if (other_requires_grad) {
+    other_grad = at::ormqr(self, tau, grad, left, !transpose);
+  }
+  if (self_requires_grad || tau_requires_grad) {
+    if (left ^ transpose) {
+      // Assume left = true and transpose = false. The case with
+      // left = false and tranpose = true is very much similar with just
+      // transposed arguments passed into householder_product_backward.
+      // Ormqr computes B = H_1 * ... * H_k * A.
+      // The sensivity wrt H_i is given by (see notes in
+      // householder_product_backward) Tr(H_i_plus B B_grad^H H_i_minus dH_i),
+      // so, since householder_product_backward respects `for i in range(k)`, we
+      // could reuse householder_product_backward with
+      // householder_product_backward.grad = grad and
+      // householder_product_backward.result = result.
+      const auto hpb_grad = !transpose ? grad : grad.mH();
+      const auto hpb_result = !transpose ? result : result.mH();
+      std::tie(self_grad, tau_grad) =
+          householder_product_backward(hpb_grad, hpb_result, self, tau);
+    } else {
+      // Assuming left = false and transpose = false. The case with
+      // left = true and transpose = true is very much similar with just
+      // transposed arguments passed into householder_product_backward.
+      // In this case Ormqr computes B = H_1 * ... * H_k * A and the sensitivity
+      // wrt H_i becomes Tr(H_i_plus B_grad^H B H_i_minus dH_k).
+      // We could see that the role of `grad` and `result` in
+      // householder_product_backward gets "swapped" and "transposed" and that
+      // in order to compute H_k_grad efficiently we would need to compute grads
+      // in reversed order (`for i in range(k - 1, -1, -1)`). Hence we reuse
+      // householder_product_backward with householder_product_backward.grad =
+      // result.mH, householder_product_backward.result = grad.mH,
+      // householder_product_backward.flip_order = true.
+      const auto hpb_grad = !transpose ? result.mH() : result;
+      const auto hpb_result = !transpose ? grad.mH() : grad;
+      std::tie(self_grad, tau_grad) = householder_product_backward(
+          hpb_grad, hpb_result, self, tau, /*flip_order=*/true);
+    }
+  }
+
+  return std::make_tuple(self_grad, tau_grad, other_grad);
+}
+
 std::tuple<Tensor, Tensor> polar_backward(
     const Tensor& grad,
     const Tensor& result) {
@@ -5584,8 +5730,8 @@ Tensor solve_jvp(
 Tensor lu_unpack_backward(
     const Tensor& L_grad,
     const Tensor& U_grad,
-    const int64_t m,
-    const int64_t n) {
+    const c10::SymInt m,
+    const c10::SymInt n) {
   if (!L_grad.defined() && !U_grad.defined()) {
     return {};
   }
@@ -5593,16 +5739,16 @@ Tensor lu_unpack_backward(
 
   // Getters for the principal and complementary part of the matrices
   const auto get_L1 = [m, k](const Tensor& L) {
-    return m == k ? L.tril(-1) : L.narrow(-2, 0, k).tril(-1);
+    return m == k ? L.tril(-1) : L.narrow_symint(-2, 0, k).tril(-1);
   };
   const auto get_L2 = [m, k](const Tensor& L) {
-    return L.narrow(-2, k, m - k);
+    return L.narrow_symint(-2, k, m - k);
   };
   const auto get_U1 = [n, k](const Tensor& U) {
-    return n == k ? U.triu() : U.narrow(-1, 0, k).triu();
+    return n == k ? U.triu() : U.narrow_symint(-1, 0, k).triu();
   };
   const auto get_U2 = [n, k](const Tensor& U) {
-    return U.narrow(-1, k, n - k);
+    return U.narrow_symint(-1, k, n - k);
   };
 
   if (L_grad.defined()) {
@@ -5619,36 +5765,39 @@ Tensor lu_unpack_backward(
       if (m >= n) {
         return L_grad.tril(-1);
       } else {
-        auto size = L_grad.sizes().vec();
+        auto size = L_grad.sym_sizes().vec();
         size.end()[-1] = n - m;
         return at::cat(
-            {L_grad.tril(-1), at::zeros(size, L_grad.options())}, /*dim=*/-1);
+            {L_grad.tril(-1), at::zeros_symint(size, L_grad.options())},
+            /*dim=*/-1);
       }
     }
   } else {
     if (n >= m) {
       return U_grad.triu();
     } else {
-      auto size = U_grad.sizes().vec();
+      auto size = U_grad.sym_sizes().vec();
       size.end()[-2] = m - n;
       return at::cat(
-          {U_grad.triu(), at::zeros(size, U_grad.options())}, /*dim=*/-2);
+          {U_grad.triu(), at::zeros_symint(size, U_grad.options())},
+          /*dim=*/-2);
     }
   }
 }
 
-Tensor cat_jvp(at::TensorList tensors, int64_t dim) {
+Tensor cat_jvp(at::ITensorListRef tensors, int64_t dim) {
   Tensor out_fw_grad;
 
+  auto materialized = tensors.materialize();
   auto any_defined = false;
-  for (const auto& t : tensors) {
+  for (const Tensor& t : materialized) {
     any_defined |= isFwGradDefined(t);
   }
 
   if (any_defined) {
     std::vector<Tensor> fw_grads;
 
-    for (auto& t : tensors) {
+    for (const Tensor& t : materialized) {
       fw_grads.push_back(
           isFwGradDefined(t)
               ? t._fw_grad(/*level*/ 0)
@@ -5892,7 +6041,7 @@ Tensor layer_norm_jvp(
     const Tensor& bias_t,
     const Tensor& saved_mean,
     const Tensor& saved_invstd,
-    IntArrayRef normalized_shape) {
+    c10::SymIntArrayRef normalized_shape) {
   auto dims = std::vector<int64_t>{};
   auto view_size = input_t.sizes().vec();
   auto view_size_affine = input_t.sizes().vec();
@@ -6394,6 +6543,40 @@ std::tuple<Tensor, Tensor> _cudnn_convolution_backward(
   return result;
 }
 
+Tensor scatter_reduce_jvp(
+    const Tensor& self_p,
+    const Tensor& self_t,
+    int dim,
+    const Tensor& index,
+    const Tensor& src_p,
+    const Tensor& src_t,
+    c10::string_view reduce,
+    bool include_self,
+    const Tensor& result) {
+  if (reduce == "sum" || reduce == "mean") {
+    // The function is linear
+    return at::scatter_reduce(self_t, dim, index, src_t, reduce, include_self);
+    //  auto mask = x == restore_reduced_dims(result, dim, keepdim);
+    //  return at::where(mask, dx, 0.).sum(dim, keepdim) / mask.sum(dim,
+    //  keepdim);
+  } else if (reduce == "amin" || reduce == "amax") {
+    auto gather_result = at::gather(result, dim, index);
+    auto mask_self = self_p == result;
+    auto mask_src = src_p == gather_result;
+    auto masked_src_t = at::where(mask_src, src_t, 0.);
+    auto div =
+        mask_self.to(self_t.dtype())
+            .scatter_reduce(
+                dim, index, mask_src.to(self_t.dtype()), "sum", include_self);
+    return at::where(mask_self, self_t, 0.)
+        .scatter_reduce(dim, index, masked_src_t, "sum", include_self)
+        .div(div);
+  } else {
+    // Not implemented
+    return Tensor{};
+  }
+}
+
 std::tuple<Tensor, Tensor> scatter_reduce_backward(
     const Tensor& grad,
     const Tensor& self,
diff --git a/torch/csrc/autograd/FunctionsManual.h b/torch/csrc/autograd/FunctionsManual.h
index 8ae864188fcc..edc7dcd140f7 100644
--- a/torch/csrc/autograd/FunctionsManual.h
+++ b/torch/csrc/autograd/FunctionsManual.h
@@ -154,13 +154,13 @@ at::Tensor unsqueeze_multiple(
     size_t n_dims);
 at::Tensor sum_backward(
     const at::Tensor& grad,
-    at::IntArrayRef sizes,
+    at::SymIntArrayRef sizes,
     at::OptionalIntArrayRef opt_dims,
     bool keepdim);
 at::Tensor sum_backward(
     const at::Tensor& grad,
     c10::SymIntArrayRef sizes,
-    c10::SymIntArrayRef dims,
+    c10::IntArrayRef dims,
     bool keepdim);
 at::Tensor nansum_backward(
     const at::Tensor& grad,
@@ -215,16 +215,20 @@ at::Tensor logcumsumexp_backward(
     at::Tensor result,
     int64_t dim);
 at::Tensor unbind_backward(const variable_list& grads, int64_t dim);
-at::Tensor unsqueeze_to(const at::Tensor& self, at::IntArrayRef sizes);
+at::Tensor unsqueeze_to(const at::Tensor& self, c10::SymIntArrayRef sym_sizes);
 at::Tensor unsqueeze_to(
     const at::Tensor& self,
     int64_t dim,
-    at::IntArrayRef sizes);
+    c10::SymIntArrayRef sym_sizes);
 std::vector<at::Tensor> cat_tensors_backward(
     const at::Tensor& grad,
-    const std::vector<std::vector<int64_t>>& sizes,
+    const std::vector<std::vector<c10::SymInt>>& sizes,
     const std::vector<ScalarType>& dtypes,
     int64_t dim);
+std::vector<at::Tensor> stack_tensors_backward(
+    const at::Tensor& grad,
+    int64_t dim,
+    const std::vector<ScalarType>& dtypes);
 std::vector<at::Tensor> block_diag_backward(
     const at::Tensor& grad,
     const std::vector<std::vector<int64_t>>& sizes,
@@ -252,21 +256,21 @@ at::Tensor clamp_jvp(
     const Tensor& min_t,
     const Tensor& max_p,
     const Tensor& max_t);
-at::IntArrayRef strides_or_error(
+at::SymIntArrayRef strides_or_error(
     const Tensor& input,
     c10::string_view const& input_name);
 at::Tensor mm_mat1_backward(
     const Tensor& grad,
     const Tensor& mat2,
-    at::IntArrayRef mat1_sizes,
-    at::IntArrayRef mat1_strides,
+    at::SymIntArrayRef mat1_sizes,
+    at::SymIntArrayRef mat1_strides,
     c10::Layout mat1_layout,
     const Scalar& alpha);
 at::Tensor mm_mat2_backward(
     const at::Tensor& grad,
     const at::Tensor& mat1,
-    at::IntArrayRef sizes,
-    at::IntArrayRef strides,
+    at::SymIntArrayRef sizes,
+    at::SymIntArrayRef strides,
     c10::Layout layout,
     const at::Scalar& alpha);
 at::Tensor mm_mat1_sparse_backward(
@@ -287,8 +291,8 @@ at::Tensor renorm_backward(
     const at::Scalar& maxnorm);
 at::Tensor repeat_backward(
     at::Tensor grad,
-    at::IntArrayRef repeats,
-    at::IntArrayRef input_shape);
+    at::SymIntArrayRef repeats,
+    at::SymIntArrayRef input_shape);
 at::Tensor _fused_dropout_backward(
     at::Tensor grad,
     at::Tensor mask,
@@ -307,6 +311,7 @@ at::Tensor evenly_distribute_backward(
     const at::Tensor& input,
     const at::Tensor& value);
 Tensor sgn_backward(const Tensor& x, const Tensor& gx, const Tensor& sgn);
+Tensor masked_fill_backward(const Tensor& grad, const Tensor& mask);
 at::Tensor var_backward(
     at::Tensor grad,
     const at::Tensor& self,
@@ -329,9 +334,9 @@ at::Tensor std_backward(
     bool keepdim);
 Tensor mean_backward(
     const Tensor& grad,
-    IntArrayRef shape,
+    c10::SymIntArrayRef shape,
     at::OptionalIntArrayRef opt_dim,
-    int64_t numel,
+    c10::SymInt numel,
     bool keepdim);
 Tensor var_mean_backward(
     const Tensor& gvar,
@@ -351,7 +356,7 @@ Tensor std_mean_backward(
 at::Tensor masked_scatter_backward(
     const at::Tensor& grad,
     const at::Tensor& mask,
-    at::IntArrayRef sizes);
+    c10::SymIntArrayRef sizes);
 at::Tensor cholesky_backward(
     const at::Tensor& grad,
     bool upper,
@@ -374,15 +379,15 @@ Tensor pinv_jvp(const Tensor& A, const Tensor& pinvA, const Tensor& dA);
 Tensor pinv_backward(const Tensor& grad, const Tensor& pinvA, const Tensor& A);
 at::Tensor split_with_sizes_backward(
     const std::vector<torch::autograd::Variable>& grads,
-    IntArrayRef split_sizes,
+    c10::SymIntArrayRef split_sizes,
     int64_t dim,
-    IntArrayRef sizes,
+    c10::SymIntArrayRef sizes,
     const at::TensorOptions& options);
 at::Tensor split_backward(
     const std::vector<torch::autograd::Variable>& grads,
-    int64_t split_size,
+    c10::SymInt split_size,
     int64_t dim,
-    at::IntArrayRef sizes,
+    c10::SymIntArrayRef sizes,
     const at::TensorOptions& options);
 at::Tensor max_pool_double_backward(
     const at::Tensor& grad,
@@ -512,10 +517,10 @@ at::Tensor sinc_backward(const at::Tensor& grad, const at::Tensor& self);
 at::Tensor sparse_constructor_values_backward(
     const at::Tensor& sparse_grad_out,
     const at::Tensor& indices);
-at::Tensor embedding_dense_double_backward(
+at::Tensor embedding_dense_double_backward_symint(
     const at::Tensor& grad,
     const at::Tensor& indices,
-    int64_t padding_idx);
+    c10::SymInt padding_idx);
 at::Tensor index_backward(
     at::Tensor zeros_like_self,
     const torch::List<c10::optional<Tensor>>& indices,
@@ -550,11 +555,11 @@ std::tuple<Tensor, Tensor, Tensor> linalg_svd_jvp(
     const bool full_matrices);
 Tensor slice_backward_wrapper(
     const at::Tensor& grad,
-    const c10::IntArrayRef& input_sizes,
+    const c10::SymIntArrayRef& input_sizes,
     int64_t dim,
-    c10::optional<int64_t> start,
-    c10::optional<int64_t> end,
-    int64_t step);
+    c10::optional<c10::SymInt> start,
+    c10::optional<c10::SymInt> end,
+    c10::SymInt step);
 std::tuple<Tensor, Tensor> linalg_eig_jvp(
     const Tensor& dA,
     const Tensor& L,
@@ -627,12 +632,6 @@ Tensor linalg_qr_backward(
     const Tensor& Q,
     const Tensor& R,
     const c10::string_view mode);
-Tensor eig_backward(
-    const std::vector<torch::autograd::Variable>& grads,
-    const Tensor& self,
-    bool eigenvectors,
-    const Tensor& lambda,
-    const Tensor& v);
 Tensor linalg_matrix_exp_differential(
     const Tensor& self,
     const Tensor& grad,
@@ -669,15 +668,15 @@ Tensor fft_backward(
     IntArrayRef output_sizes);
 Tensor fft_r2c_backward(
     const Tensor& grad,
-    IntArrayRef dim,
+    at::IntArrayRef dim,
     int64_t normalization,
     bool onesided,
-    int64_t last_dim_size);
+    c10::SymInt last_dim_size);
 Tensor fft_c2r_backward(
     const Tensor& grad,
     IntArrayRef dim,
     int64_t normalization);
-Tensor constant_pad_nd_backward(const Tensor& grad, IntArrayRef pad);
+Tensor constant_pad_nd_backward(const Tensor& grad, c10::SymIntArrayRef pad);
 std::tuple<Tensor, Tensor> cholesky_solve_backward(
     const Tensor& grad_x,
     const Tensor& self,
@@ -699,9 +698,9 @@ infinitely_differentiable_native_group_norm_backward(
     const Tensor& mean,
     const Tensor& rstd,
     const c10::optional<Tensor>& gamma,
-    int64_t N,
-    int64_t C,
-    int64_t HxW,
+    c10::SymInt N,
+    c10::SymInt C,
+    c10::SymInt HxW,
     int64_t group,
     double eps,
     std::array<bool, 3> grad_input_mask);
@@ -736,16 +735,16 @@ Tensor gelu_double_backward(
 Tensor as_strided_backward(
     Tensor grad,
     TensorGeometry input_geometry,
-    IntArrayRef sizes,
-    IntArrayRef strides,
-    optional<int64_t> storage_offset_);
+    c10::SymIntArrayRef sizes,
+    c10::SymIntArrayRef strides,
+    optional<c10::SymInt> storage_offset_);
 Tensor as_strided_scatter_backward(
     Tensor grad,
     TensorGeometry input_geometry,
     TensorGeometry src_geometry,
-    IntArrayRef sizes,
-    IntArrayRef strides,
-    optional<int64_t> storage_offset);
+    c10::SymIntArrayRef sizes,
+    c10::SymIntArrayRef strides,
+    optional<c10::SymInt> storage_offset);
 std::tuple<Tensor, Tensor> atan2_backward(
     const Tensor& grad,
     const Tensor& self,
@@ -766,20 +765,30 @@ std::tuple<Tensor, Tensor, Tensor> layer_norm_double_backward(
     const Tensor& gO,
     const Tensor& save_mean,
     const Tensor& save_invstd,
-    IntArrayRef normalized_shape,
+    c10::SymIntArrayRef normalized_shape,
     std::array<bool, 3> output_mask);
 
 std::tuple<Tensor, Tensor> householder_product_backward(
     const Tensor& grad,
     const Tensor& result,
     const Tensor& input,
-    const Tensor& tau);
+    const Tensor& tau,
+    const bool flip_order = false);
 Tensor householder_product_jvp(
     const Tensor& dV,
     const Tensor& dtau,
     const Tensor& prod,
     const Tensor& V,
     const Tensor& tau);
+std::tuple<Tensor, Tensor, Tensor> ormqr_backward(
+    const Tensor& grad,
+    const Tensor& result,
+    const Tensor& self,
+    const Tensor& tau,
+    const Tensor& other,
+    bool left,
+    bool transpose,
+    std::array<bool, 3> grad_output_mask);
 std::tuple<Tensor, Tensor> polar_backward(
     const Tensor& grad,
     const Tensor& result);
@@ -825,8 +834,8 @@ Tensor linalg_solve_jvp(
 Tensor lu_unpack_backward(
     const Tensor& L_grad,
     const Tensor& U_grad,
-    const int64_t m,
-    const int64_t n);
+    const c10::SymInt m,
+    const c10::SymInt n);
 
 Tensor linalg_det_backward(
     const Tensor& grad,
@@ -834,6 +843,12 @@ Tensor linalg_det_backward(
     const Tensor& A,
     const Tensor& LU,
     const Tensor& pivots);
+Tensor linalg_det_jvp(
+    const Tensor& dA,
+    const Tensor& det,
+    const Tensor& LU,
+    const Tensor& pivots,
+    const bool use_A_T);
 std::tuple<Tensor, Tensor> linalg_lstsq_backward(
     const Tensor& grad,
     const Tensor& A,
@@ -891,7 +906,7 @@ Tensor layer_norm_jvp(
     const Tensor& bias_t,
     const Tensor& saved_mean,
     const Tensor& saved_invstd,
-    IntArrayRef normalized_shape);
+    c10::SymIntArrayRef normalized_shape);
 
 Tensor group_norm_jvp(
     const Tensor& input_p,
@@ -922,10 +937,10 @@ Tensor convolution_jvp(
     const Tensor& bias_p,
     const Tensor& bias_t,
     IntArrayRef stride,
-    IntArrayRef padding,
+    at::SymIntArrayRef padding,
     IntArrayRef dilation,
     bool transposed,
-    IntArrayRef output_padding,
+    at::SymIntArrayRef output_padding,
     int64_t groups);
 
 Tensor _convolution_jvp(
@@ -936,10 +951,10 @@ Tensor _convolution_jvp(
     const Tensor& bias_p,
     const Tensor& bias_t,
     IntArrayRef stride,
-    IntArrayRef padding,
+    at::SymIntArrayRef padding,
     IntArrayRef dilation,
     bool transposed,
-    IntArrayRef output_padding,
+    at::SymIntArrayRef output_padding,
     int64_t groups,
     bool benchmark,
     bool deterministic,
@@ -950,7 +965,7 @@ Tensor convolution_backward_jvp_grad_bias(
     const Tensor& grad_out_t,
     const Tensor& grad_bias);
 
-Tensor cat_jvp(at::TensorList tensors, int64_t dim);
+Tensor cat_jvp(at::ITensorListRef tensors, int64_t dim);
 Tensor block_diag_jvp(at::TensorList tensors);
 Tensor stack_jvp(at::TensorList tensors, int64_t dim);
 Tensor cumprod_jvp(Tensor self_t, Tensor self_p, Tensor result, int dim);
@@ -977,6 +992,17 @@ std::tuple<Tensor, Tensor> _cudnn_convolution_backward(
     int64_t groups,
     ::std::array<bool, 2> output_mask);
 
+Tensor scatter_reduce_jvp(
+    const Tensor& self_p,
+    const Tensor& self_t,
+    int dim,
+    const Tensor& index,
+    const Tensor& src_p,
+    const Tensor& src_t,
+    c10::string_view reduce,
+    bool include_self,
+    const Tensor& result);
+
 std::tuple<Tensor, Tensor> scatter_reduce_backward(
     const Tensor& grad,
     const Tensor& self,
diff --git a/torch/csrc/autograd/TraceTypeManual.cpp b/torch/csrc/autograd/TraceTypeManual.cpp
index e39b3a120f0e..b647e3dc7df5 100644
--- a/torch/csrc/autograd/TraceTypeManual.cpp
+++ b/torch/csrc/autograd/TraceTypeManual.cpp
@@ -54,7 +54,9 @@ const Tensor& resize_(
     IntArrayRef size,
     c10::optional<MemoryFormat> optional_memory_format) {
   if (torch::jit::tracer::isTracing()) {
-    jit::tracer::ArgumentStash::popIntArrayRef("size");
+    if (jit::tracer::ArgumentStash::hasIntArrayRef("size")) {
+      jit::tracer::ArgumentStash::popIntArrayRef("size");
+    }
     jit::tracer::warn("resize_", jit::tracer::WARN_RESIZE);
     jit::tracer::delValueTrace(self);
   }
diff --git a/torch/csrc/autograd/VariableTypeManual.cpp b/torch/csrc/autograd/VariableTypeManual.cpp
index 9c668a8ef1f4..e276521acebb 100644
--- a/torch/csrc/autograd/VariableTypeManual.cpp
+++ b/torch/csrc/autograd/VariableTypeManual.cpp
@@ -90,14 +90,14 @@ Tensor unpack_opt(const Tensor& t, const char* name, int pos) {
   return unpack(t, name, pos);
 }
 
-std::vector<at::Tensor> unpack(at::TensorList tl, const char* name, int pos) {
-  std::vector<at::Tensor> ret(tl.size());
-  for (const auto i : c10::irange(tl.size())) {
-    const auto& t = tl[i];
-    if (!t.defined()) {
-      continue;
-    }
-    ret[i] = static_cast<const Variable&>(t);
+std::vector<at::Tensor> unpack(
+    at::ITensorListRef tl,
+    const char* name,
+    int pos) {
+  std::vector<at::Tensor> ret;
+  ret.reserve(tl.size());
+  for (const auto& t : tl) {
+    ret.push_back(t.defined() ? static_cast<const Variable&>(t) : Variable{});
   }
   return ret;
 }
@@ -154,7 +154,7 @@ Tensor _make_dual(
   std::shared_ptr<ViewBackward0> grad_fn;
   if (compute_requires_grad(primal_)) {
     grad_fn = std::make_shared<ViewBackward0>();
-    grad_fn->self_sizes = primal_.sizes().vec();
+    grad_fn->self_sym_sizes = primal_.sym_sizes().vec();
     grad_fn->set_next_edges(collect_next_edges(primal_));
   }
 
diff --git a/torch/csrc/autograd/VariableTypeUtils.h b/torch/csrc/autograd/VariableTypeUtils.h
index b6c92a396ec0..f822a0e70323 100644
--- a/torch/csrc/autograd/VariableTypeUtils.h
+++ b/torch/csrc/autograd/VariableTypeUtils.h
@@ -2,6 +2,9 @@
 
 #include <c10/util/irange.h>
 
+#include <ATen/core/boxing/KernelFunction.h>
+#include <ATen/core/dispatch/Dispatcher.h>
+
 #include <torch/csrc/autograd/edge.h>
 #include <torch/csrc/autograd/function.h>
 #include <torch/csrc/autograd/functions/basic_ops.h>
@@ -11,6 +14,7 @@
 #include <torch/csrc/autograd/variable.h>
 
 #include <torch/csrc/autograd/functions/utils.h>
+#include <torch/csrc/autograd/jit_decomp_interface.h>
 #include <torch/csrc/utils/variadic.h>
 
 #include <array>
@@ -57,7 +61,7 @@ inline void check_inplace(const at::Tensor& tensor, bool requires_grad) {
   }
 }
 
-inline void check_inplace(const at::TensorList tensors, bool requires_grad) {
+inline void check_inplace(at::ITensorListRef tensors, bool requires_grad) {
   for (const auto& tensor : tensors) {
     check_inplace(tensor, requires_grad);
   }
@@ -81,8 +85,20 @@ inline void throw_error_for_complex_autograd(
   }
 }
 
+inline void throw_error_if_base_and_tensor_are_same(
+    const at::Tensor& base,
+    const at::Tensor& tensor) {
+  TORCH_CHECK(
+      base.unsafeGetTensorImpl() != tensor.unsafeGetTensorImpl(),
+      "View operation returned a tensor that is the same as the input base tensor.  This "
+      "is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). "
+      "As a user, you could have made a mistake implementing __torch_dispatch__ or a Python "
+      "operator decomposition or meta registration; if that's not the case, please "
+      "report a bug to PyTorch or the backend you are using.");
+}
+
 inline void throw_error_for_complex_autograd(
-    const at::TensorList& tensorlist,
+    at::ITensorListRef tensorlist,
     const char* name) {
   for (const auto& tensor : tensorlist) {
     throw_error_for_complex_autograd(tensor, name);
@@ -167,6 +183,7 @@ inline at::Tensor as_view(
   // be used for both of them.
   if ((!diff_view_meta || diff_view_meta->shared_view_info()) &&
       is_bw_differentiable && is_fw_differentiable) {
+    throw_error_if_base_and_tensor_are_same(base, tensor);
     if (diff_view_meta) {
       creation_meta = propagate_creation_meta(
           diff_view_meta->get_creation_meta(), creation_meta);
@@ -220,6 +237,7 @@ inline at::Tensor as_view(
       creation_meta = propagate_creation_meta(
           diff_view_meta->get_creation_meta(), creation_meta);
     }
+    throw_error_if_base_and_tensor_are_same(base, tensor);
     return make_variable_differentiable_view(
         tensor,
         std::move(new_bw_info),
@@ -377,7 +395,7 @@ inline void check_no_requires_grad(
 }
 
 inline void check_no_requires_grad(
-    at::TensorList tensors,
+    at::ITensorListRef tensors,
     const char* name,
     const char* fn_name = "") {
   // GradMode check is expensive, so check it only once for TensorLists
@@ -406,7 +424,7 @@ inline void check_no_requires_grad(
 
 // Assumed that saved tensor lists are never inplace outputs
 inline std::vector<SavedVariable> make_saved_variable_list(
-    at::TensorList tensors) {
+    at::ITensorListRef tensors) {
   return fmap(tensors, [](const at::Tensor& tensor) -> SavedVariable {
     return SavedVariable{tensor, false /* is output */};
   });
@@ -425,22 +443,88 @@ inline std::vector<SavedVariable> make_saved_variable_list(
       });
 }
 
-inline std::vector<std::vector<int64_t>> to_args_sizes(at::TensorList tensors) {
+inline std::vector<std::vector<int64_t>> to_args_sizes(
+    at::ITensorListRef tensors) {
   std::vector<std::vector<int64_t>> args_sizes(tensors.size());
-  for (const auto i : c10::irange(tensors.size())) {
-    args_sizes[i] = tensors[i].sizes().vec();
+  size_t i = 0;
+  for (const auto& t : tensors) {
+    args_sizes[i++] = t.sizes().vec();
+  }
+  return args_sizes;
+}
+
+inline std::vector<std::vector<c10::SymInt>> to_args_sizes_symint(
+    at::ITensorListRef tensors) {
+  std::vector<std::vector<c10::SymInt>> args_sizes(tensors.size());
+  size_t i = 0;
+  for (const auto& t : tensors) {
+    args_sizes[i++] = t.sym_sizes().vec();
   }
   return args_sizes;
 }
 
 inline std::vector<c10::ScalarType> to_args_scalartypes(
-    at::TensorList tensors) {
+    at::ITensorListRef tensors) {
   std::vector<c10::ScalarType> args_scalartypes(tensors.size());
-  for (const auto i : c10::irange(tensors.size())) {
-    args_scalartypes[i] = tensors[i].scalar_type();
+  size_t i = 0;
+  for (const auto& t : tensors) {
+    args_scalartypes[i++] = t.scalar_type();
   }
   return args_scalartypes;
 }
 
+namespace impl {
+
+namespace {
+
+// If run_jit_decomposition were not a member function, we would be able
+// to pass this as a template parameter to c10::Boxedkernel::makeFromFunction.
+// However, member functions cannot be passed this way - instead we wrap our
+// call in this functor so it can be passed to c10::BoxedKernel::makeFromFunctor
+class WrapperFunctor final : public c10::OperatorKernel {
+ public:
+  WrapperFunctor(JitDecompInterface* impl) : impl_(impl){};
+
+  void operator()(
+      const c10::OperatorHandle& op,
+      c10::DispatchKeySet ks,
+      torch::jit::Stack* stack) {
+    impl_->run_jit_decomposition(op, stack);
+  }
+  JitDecompInterface* impl_;
+};
+
+} // namespace
+
+template <class Return, class... Args>
+Return run_jit_decomposition_with_args_for_jvp(
+    c10::string_view name,
+    const c10::OperatorHandle& opHandle,
+    c10::DispatchKeySet dispatchKeySet,
+    Args&&... args) {
+  // see NOTE: [Jit Decomposition Interface]
+  JitDecompInterface* impl = getJitDecompImpl();
+
+  TORCH_CHECK_NOT_IMPLEMENTED(
+      impl && impl->has_jit_decomposition(opHandle.schema()),
+      "Trying to use forward AD with ",
+      name,
+      " that does not support it because it has not been implemented yet.\nPlease file an issue "
+      "to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml "
+      "so that we can prioritize its implementation.\n"
+      "Note that forward AD support for some operators require PyTorch to be built with "
+      "TorchScript and for JIT to be enabled. "
+      "If the environment var PYTORCH_JIT=0 is set or if the library is not built with TorchScript, "
+      "some operators may no longer be used with forward AD.");
+
+  return c10::KernelFunction::makeFromBoxedKernel(
+             c10::BoxedKernel::makeFromFunctor(
+                 std::make_unique<WrapperFunctor>(impl)))
+      .call<Return, Args...>(
+          opHandle, dispatchKeySet, std::forward<Args>(args)...);
+}
+
+} // namespace impl
+
 } // namespace autograd
 } // namespace torch
diff --git a/torch/csrc/autograd/anomaly_mode.cpp b/torch/csrc/autograd/anomaly_mode.cpp
index e8afa6f8fc52..0715f850de81 100644
--- a/torch/csrc/autograd/anomaly_mode.cpp
+++ b/torch/csrc/autograd/anomaly_mode.cpp
@@ -8,6 +8,7 @@ namespace torch {
 namespace autograd {
 
 bool AnomalyMode::_enabled = false;
+bool AnomalyMode::_check_nan = true;
 
 namespace {
 std::mutex& get_anomaly_guard_lock() {
@@ -21,20 +22,21 @@ uint32_t& get_anomaly_counter() {
 }
 } // namespace
 
-DetectAnomalyGuard::DetectAnomalyGuard() {
+DetectAnomalyGuard::DetectAnomalyGuard(bool check_nan) {
   TORCH_WARN_ONCE(
       "This mode should be enabled only for debugging as the different tests will slow down your program execution.");
   std::lock_guard<std::mutex> lock(get_anomaly_guard_lock());
   uint32_t& counter = get_anomaly_counter();
   counter++;
-  AnomalyMode::set_enabled(true);
+  this->prev_check_nan_ = AnomalyMode::should_check_nan();
+  AnomalyMode::set_enabled(true, check_nan);
 }
 
 DetectAnomalyGuard::~DetectAnomalyGuard() {
   std::lock_guard<std::mutex> lock(get_anomaly_guard_lock());
   uint32_t& counter = get_anomaly_counter();
   counter--;
-  AnomalyMode::set_enabled(counter > 0);
+  AnomalyMode::set_enabled(counter > 0, this->prev_check_nan_);
 }
 
 AnomalyMetadata::~AnomalyMetadata() = default;
diff --git a/torch/csrc/autograd/anomaly_mode.h b/torch/csrc/autograd/anomaly_mode.h
index a741acdd38af..b0eba7eeac46 100644
--- a/torch/csrc/autograd/anomaly_mode.h
+++ b/torch/csrc/autograd/anomaly_mode.h
@@ -14,12 +14,17 @@ struct TORCH_API AnomalyMode {
   static bool is_enabled() {
     return _enabled;
   }
-  static void set_enabled(bool enabled) {
+  static bool should_check_nan() {
+    return _check_nan;
+  }
+  static void set_enabled(bool enabled, bool check_nan = true) {
     _enabled = enabled;
+    _check_nan = check_nan;
   }
 
  private:
   static bool _enabled;
+  static bool _check_nan;
 };
 
 /// A RAII guard that enables Anomaly Detection Mode.
@@ -46,8 +51,11 @@ struct TORCH_API AnomalyMode {
 /// @endcode
 class TORCH_API DetectAnomalyGuard {
  public:
-  DetectAnomalyGuard();
+  DetectAnomalyGuard(bool check_nan = true);
   ~DetectAnomalyGuard();
+
+ private:
+  bool prev_check_nan_;
 };
 
 struct TORCH_API AnomalyMetadata {
diff --git a/torch/csrc/autograd/autograd_meta.cpp b/torch/csrc/autograd/autograd_meta.cpp
index db00d67576d3..d11cd68e1800 100644
--- a/torch/csrc/autograd/autograd_meta.cpp
+++ b/torch/csrc/autograd/autograd_meta.cpp
@@ -82,7 +82,7 @@ using at::Tensor;
 // base if needed. Case 5 is handled in fw_grad by reading the forward grad from
 // the base if needed.
 
-namespace {
+namespace utils {
 
 // Enforcing that the metadata between the primal and tangent are same has two
 // goals:
@@ -139,7 +139,8 @@ bool has_same_meta(const Variable& base, const Variable& other) {
   }
   return true;
 }
-} // anonymous namespace
+
+} // namespace utils
 
 // This function is will ensure that the fw_grad_ is properly a view of the base
 // for inplace ops on Tensors that do not have forward grad originally.
@@ -219,7 +220,8 @@ void AutogradMeta::set_fw_grad(
           // Enforce same meta here to make sure that the view op below is
           // always valid
           Tensor new_base_fw_grad;
-          if (has_same_meta(new_grad, base) && has_same_meta(new_grad, self)) {
+          if (utils::has_same_meta(new_grad, base) &&
+              utils::has_same_meta(new_grad, self)) {
             // TODO extend this special case to when the underlying storage of
             // new_grad can be re-used.
             new_base_fw_grad = new_grad;
@@ -248,7 +250,7 @@ void AutogradMeta::set_fw_grad(
     }
 
     // Enforce the basic layout constraint
-    if (!has_same_meta(new_grad, self)) {
+    if (!utils::has_same_meta(new_grad, self)) {
       if (is_view_) {
         auto this_view_meta = static_cast<DifferentiableViewMeta*>(this);
         TORCH_INTERNAL_ASSERT(
diff --git a/torch/csrc/autograd/autograd_not_implemented_fallback.cpp b/torch/csrc/autograd/autograd_not_implemented_fallback.cpp
index 92d234e7fb28..b29a05349975 100644
--- a/torch/csrc/autograd/autograd_not_implemented_fallback.cpp
+++ b/torch/csrc/autograd/autograd_not_implemented_fallback.cpp
@@ -2,10 +2,11 @@
 
 #include <c10/util/irange.h>
 
+#include <ATen/core/TorchDispatchUtils.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <ATen/core/ivalue.h>
 
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 #include <torch/csrc/autograd/VariableTypeUtils.h>
 #include <torch/csrc/autograd/autograd.h>
 #include <torch/csrc/autograd/function.h>
diff --git a/torch/csrc/autograd/custom_function.h b/torch/csrc/autograd/custom_function.h
index 7e35eef6783d..d7670d924b1f 100644
--- a/torch/csrc/autograd/custom_function.h
+++ b/torch/csrc/autograd/custom_function.h
@@ -205,6 +205,12 @@ struct ExtractVariables : IterArgs<ExtractVariables> {
     is_var_.push_back(true);
     list_.emplace_back(x);
   }
+  void operator()(const at::TensorList& list) {
+    for (const at::Tensor& x : list) {
+      is_var_.push_back(true);
+      list_.emplace_back(x);
+    }
+  }
   template <typename T>
   void operator()(const T& x) {
     is_var_.push_back(false);
@@ -294,7 +300,7 @@ auto Function<T>::apply(Args&&... args)
     TORCH_CHECK(
         false,
         "jvp is not implemented for the c++ API of custom Function yet.",
-        "Please open a feature request on Github if you need this.");
+        "Please open a feature request on GitHub if you need this.");
   };
 
   auto wrapped_outputs = _wrap_outputs(
diff --git a/torch/csrc/autograd/engine.cpp b/torch/csrc/autograd/engine.cpp
index bd746a34d5c5..ef4856cf4796 100644
--- a/torch/csrc/autograd/engine.cpp
+++ b/torch/csrc/autograd/engine.cpp
@@ -11,6 +11,7 @@
 #include <ATen/DeviceGuard.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/Parallel.h>
+#include <ATen/SparseCsrTensorUtils.h>
 #include <ATen/detail/CUDAHooksInterface.h>
 
 #ifndef AT_PER_OPERATOR_HEADERS
@@ -270,10 +271,7 @@ void Engine::stop() {
   // Under some conditions, autograd threads can hang on shutdown
   // Do not wait for them to shutdown indefinitely but rely on timeout
   auto wait_duration_str = getenv("TORCH_AUTOGRAD_SHUTDOWN_WAIT_LIMIT");
-  if (!wait_duration_str) {
-    wait_duration_str = "10.0";
-  }
-  auto wait_duration = std::atof(wait_duration_str);
+  auto wait_duration = wait_duration_str ? std::atof(wait_duration_str) : 10.0;
   bool noBackward = true;
   for (auto& queue : device_ready_queues_) {
     noBackward = noBackward && queue->empty();
@@ -381,6 +379,14 @@ get_current_graph_task_exec_info() {
   return current_graph_task ? &current_graph_task->exec_info_ : nullptr;
 }
 
+const std::unordered_set<Node*>* get_current_graph_task_nodes_in_graph() {
+  return current_graph_task ? &current_graph_task->nodes_in_graph_ : nullptr;
+}
+
+int get_current_graph_task_id() {
+  return current_graph_task ? current_graph_task->id_ : -1;
+}
+
 bool get_current_graph_task_keep_graph() {
   return current_graph_task ? current_graph_task->keep_graph_ : true;
 }
@@ -389,6 +395,66 @@ void add_node_to_current_graph_task_exec_info(Node* fn) {
   current_graph_task->exec_info_[fn].needed_ = true;
 }
 
+// NB: The engine itself does not use the outputs of this function.
+std::vector<Node*> get_current_graph_task_execution_order() {
+  std::shared_ptr<GraphTask> task = current_graph_task;
+  if (!task) {
+    return {};
+  }
+
+  // We could potentially check if there is only a single device here
+  // but explicitly require this context doens't seem bad either
+  TORCH_CHECK(
+      !c10::AutogradState::get_tls_state().get_multithreading_enabled(),
+      "get_current_graph_task_execution_order expects the current backward to be "
+      "executed with multithreading disabled, e.g. by running:\n\n"
+      ">>> with torch.autograd.set_multithreading_enabled(False):\n"
+      "...     torch.autograd.grad(...)\n");
+
+  const bool check_exec_info = !task->exec_info_.empty();
+  std::vector<Node*> out{};
+  std::unordered_set<Node*> seen{};
+
+  auto compare_seq_nr = [](Node* n1, Node* n2) {
+    return n1->sequence_nr() < n2->sequence_nr();
+  };
+  std::priority_queue<Node*, std::vector<Node*>, decltype(compare_seq_nr)> heap(
+      compare_seq_nr);
+
+  for (Node* ptr : task->graph_roots_) {
+    heap.push(ptr);
+  }
+
+  // Implementation notes:
+  // - Don't need to count dependencies because we have sequence_nr
+  // - Don't need to check topological_nr because we have exec_info
+  while (!heap.empty()) {
+    Node* fn = heap.top();
+    heap.pop();
+
+    const bool was_inserted = seen.insert(fn).second;
+    if (!was_inserted) {
+      continue;
+    }
+
+    out.push_back(fn);
+    for (const auto& edge : fn->next_edges()) {
+      Node* next_ptr = edge.function.get();
+      if (!next_ptr) {
+        continue;
+      }
+      if (check_exec_info) {
+        auto it = task->exec_info_.find(next_ptr);
+        if (it == task->exec_info_.end() || !it->second.should_execute()) {
+          continue;
+        }
+      }
+      heap.push(next_ptr);
+    }
+  }
+  return out;
+}
+
 // NOTE: graph_tasks do not necessarily form a stack. Imagine this
 // case:
 //
@@ -457,7 +523,7 @@ auto Engine::thread_main(const std::shared_ptr<GraphTask>& graph_task) -> void {
         // NB: The ThreadLocalStateGuard doesn't set the grad_mode because
         // GraphTask always saves ThreadLocalState without grad_mode.
         at::ThreadLocalStateGuard tls_guard(local_graph_task->thread_locals_);
-        c10::Warning::WarningHandlerGuard warnings_guard(
+        c10::WarningUtils::WarningHandlerGuard warnings_guard(
             &local_graph_task->warning_handler_);
 
         try {
@@ -774,7 +840,8 @@ void validate_outputs(
       // Strided when metadata is SparseCsr
       if (!grad.is_sparse() &&
           !(grad.layout() == at::kStrided &&
-            metadata.layout() == at::kSparseCsr)) {
+            (at::sparse_csr::is_sparse_compressed(metadata.layout()) ||
+             metadata.layout() == at::kSparse))) {
         std::stringstream ss;
         ss << "invalid gradient at index " << i << " - expected layout ";
         ss << metadata.layout() << " but got " << grad.layout();
@@ -908,7 +975,7 @@ void Engine::evaluate_function(
     return;
   }
 
-  if (AnomalyMode::is_enabled()) {
+  if (AnomalyMode::is_enabled() && AnomalyMode::should_check_nan()) {
     AutoGradMode grad_mode(false);
     for (const auto i : c10::irange(num_outputs)) {
       auto& output = outputs[i];
@@ -1006,7 +1073,6 @@ auto Engine::compute_dependencies(
     GraphTask& task,
     uint64_t min_topo_nr) -> void {
   // Computes the number of dependencies for each function which requires grad
-  std::unordered_set<Node*> seen;
   std::vector<Node*> queue{root};
   bool might_use_cuda = at::globalContext().hasCUDA();
   bool will_use_cuda = false;
@@ -1026,7 +1092,7 @@ auto Engine::compute_dependencies(
     for (const auto& edge : fn->next_edges()) {
       if (auto next_ptr = edge.function.get()) {
         dependencies[next_ptr] += 1;
-        const bool was_inserted = seen.insert(next_ptr).second;
+        const bool was_inserted = task.nodes_in_graph_.insert(next_ptr).second;
         if (was_inserted)
           queue.push_back(next_ptr);
       }
@@ -1041,7 +1107,7 @@ auto Engine::compute_dependencies(
 }
 
 auto Engine::execute(
-    const edge_list& roots,
+    const edge_list& root_edges,
     const variable_list& inputs,
     bool keep_graph,
     bool create_graph,
@@ -1049,9 +1115,9 @@ auto Engine::execute(
     const edge_list& outputs) -> variable_list {
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
   validate_outputs(
-      roots, const_cast<variable_list&>(inputs), [](const std::string& msg) {
-        return msg;
-      });
+      root_edges,
+      const_cast<variable_list&>(inputs),
+      [](const std::string& msg) { return msg; });
   if (accumulate_grad && create_graph) {
     TORCH_WARN_ONCE(
         "Using backward() with create_graph=True will create a reference cycle "
@@ -1074,17 +1140,25 @@ auto Engine::execute(
   init_local_ready_queue();
   bool not_reentrant_backward_call = worker_device == NO_DEVICE;
 
+  // Store root nodes so we can traverse through the graph later
+  // e.g., for get_current_graph_task_execution_order
+  c10::SmallVector<Node*, 4> temp_roots{root_edges.size()};
+  for (const auto i : c10::irange(root_edges.size())) {
+    temp_roots[i] = root_edges[i].function.get();
+  }
+
   auto graph_task = std::make_shared<GraphTask>(
       /* keep_graph */ keep_graph,
       /* create_graph */ create_graph,
       /* depth */ not_reentrant_backward_call ? 0 : total_depth + 1,
-      /* cpu_ready_queue */ local_ready_queue);
+      /* cpu_ready_queue */ local_ready_queue,
+      /* graph_roots */ std::move(temp_roots));
 
   // If we receive a single root, skip creating extra root node
-  bool skip_dummy_node = roots.size() == 1;
+  bool skip_dummy_node = root_edges.size() == 1;
   auto graph_root = skip_dummy_node
-      ? roots.at(0).function
-      : std::make_shared<GraphRoot>(roots, inputs);
+      ? root_edges.at(0).function
+      : std::make_shared<GraphRoot>(root_edges, inputs);
 
   auto min_topo_nr = compute_min_topological_nr(outputs);
   // Now compute the dependencies for all executable functions
@@ -1097,14 +1171,17 @@ auto Engine::execute(
 
   // Queue the root
   if (skip_dummy_node) {
-    InputBuffer input_buffer(roots.at(0).function->num_inputs());
+    InputBuffer input_buffer(root_edges.at(0).function->num_inputs());
     auto input = inputs.at(0);
 
     const auto input_stream = InputMetadata(input).stream();
     const auto opt_next_stream =
-        roots.at(0).function->stream(c10::DeviceType::CUDA);
+        root_edges.at(0).function->stream(c10::DeviceType::CUDA);
     input_buffer.add(
-        roots.at(0).input_nr, std::move(input), input_stream, opt_next_stream);
+        root_edges.at(0).input_nr,
+        std::move(input),
+        input_stream,
+        opt_next_stream);
 
     execute_with_graph_task(graph_task, graph_root, std::move(input_buffer));
   } else {
@@ -1252,7 +1329,9 @@ void Engine::init_local_ready_queue(std::shared_ptr<ReadyQueue> ready_queue) {
 auto Engine::ready_queue(
     std::shared_ptr<ReadyQueue> cpu_ready_queue,
     at::Device device) -> std::shared_ptr<ReadyQueue> {
-  if (should_run_in_cpu_ready_queue(device.type())) {
+  bool multithreading_disabled =
+      !c10::AutogradState::get_tls_state().get_multithreading_enabled();
+  if (multithreading_disabled || should_run_in_cpu_ready_queue(device.type())) {
     // return the cpu ready queue passed in
     TORCH_INTERNAL_ASSERT(cpu_ready_queue);
     return cpu_ready_queue;
diff --git a/torch/csrc/autograd/function.h b/torch/csrc/autograd/function.h
index aa82e3ad2c77..bb5f4b1eaad0 100644
--- a/torch/csrc/autograd/function.h
+++ b/torch/csrc/autograd/function.h
@@ -143,6 +143,9 @@ struct TORCH_API Node : std::enable_shared_from_this<Node> {
   Node& operator=(Node&& other) = delete;
   virtual ~Node() = default;
 
+  std::shared_ptr<Node> getptr() {
+    return shared_from_this();
+  }
   /// Evaluates the function on the given inputs and returns the result of the
   /// function call.
   variable_list operator()(variable_list&& inputs) {
diff --git a/torch/csrc/autograd/functions/tensor.cpp b/torch/csrc/autograd/functions/tensor.cpp
index 5a1108ddf7c8..377c40ce388e 100644
--- a/torch/csrc/autograd/functions/tensor.cpp
+++ b/torch/csrc/autograd/functions/tensor.cpp
@@ -75,15 +75,17 @@ auto CopySlices::apply(variable_list&& inputs) -> variable_list {
     throw std::runtime_error(ERR_BACKWARD_TWICE);
   }
 
-  auto result = grad.new_empty_strided(base.sizes(), base.strides());
+  auto result =
+      grad.new_empty_strided_symint(base.sym_sizes(), base.sym_strides());
   result.copy_(grad);
 
   at::Tensor grad_slice;
   if (view_fn) {
     grad_slice = view_fn(result);
   } else {
-    auto offset = view.storage_offset() - base.storage_offset();
-    grad_slice = result.as_strided(view.sizes(), view.strides(), offset);
+    auto offset = view.sym_storage_offset() - base.sym_storage_offset();
+    grad_slice =
+        result.as_strided_symint(view.sym_sizes(), view.sym_strides(), offset);
   }
 
   // Adding the missing nodes to the current graph's `exec_info`.
diff --git a/torch/csrc/autograd/functions/utils.h b/torch/csrc/autograd/functions/utils.h
index a2169f18656f..fbc1e7954904 100644
--- a/torch/csrc/autograd/functions/utils.h
+++ b/torch/csrc/autograd/functions/utils.h
@@ -100,5 +100,23 @@ inline bool isFwGradDefined(const c10::optional<at::Tensor>& t) {
   return t.has_value() && t->defined() && t->_fw_grad(/*level */ 0).defined();
 }
 
+inline bool isFwGradDefinedTensorList(const at::ITensorListRef& variables) {
+  bool ret = false;
+  for (auto& variable : variables) {
+    ret |= isFwGradDefined(variable);
+  }
+  return ret;
+}
+
+inline bool isFwGradDefinedTensorList(
+    const c10::List<c10::optional<at::Tensor>> li) {
+  bool ret = false;
+  for (auto i : c10::irange(li.size())) {
+    auto t = li.get(i);
+    ret |= (t.has_value() && isFwGradDefined(t.value()));
+  }
+  return ret;
+}
+
 } // namespace autograd
 } // namespace torch
diff --git a/torch/csrc/autograd/graph_task.h b/torch/csrc/autograd/graph_task.h
index 66c567c57ba3..4efbc905fed3 100644
--- a/torch/csrc/autograd/graph_task.h
+++ b/torch/csrc/autograd/graph_task.h
@@ -15,6 +15,10 @@ struct ReadyQueue;
 static constexpr int NO_DEVICE = -2;
 static constexpr int CPU_DEVICE = -1;
 
+namespace {
+std::atomic<uint64_t> graph_task_id{0};
+}
+
 // GraphTask holds metadata needed for a single execution of backward()
 struct GraphTask : std::enable_shared_from_this<GraphTask> {
   std::atomic<uint64_t> outstanding_tasks_{0};
@@ -31,6 +35,9 @@ struct GraphTask : std::enable_shared_from_this<GraphTask> {
   std::unordered_map<Node*, InputBuffer> not_ready_;
   std::unordered_map<Node*, int> dependencies_;
 
+  // Records the nodes that are in the graph
+  std::unordered_set<Node*> nodes_in_graph_;
+  c10::SmallVector<Node*, 4> graph_roots_;
   // Note [Exec info]
   // Exec info is created for each GraphTask, which allows filtering paths on
   // the graph that are not needed. It has a bit complicated semantics. If it's
@@ -150,20 +157,25 @@ struct GraphTask : std::enable_shared_from_this<GraphTask> {
 
   utils::DelayWarningHandler warning_handler_;
 
+  uint64_t id_;
+
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
   GraphTask(
       bool keep_graph,
       bool grad_mode,
       int reentrant_depth,
       std::shared_ptr<ReadyQueue> cpu_ready_queue,
+      c10::SmallVector<Node*, 4> graph_roots,
       bool exit_on_error = false)
       : keep_graph_(keep_graph),
+        graph_roots_(std::move(graph_roots)),
         owner_(NO_DEVICE),
         reentrant_depth_(reentrant_depth),
         exit_on_error_(exit_on_error),
         cpu_ready_queue_(std::move(cpu_ready_queue)),
         future_result_(c10::make_intrusive<at::ivalue::Future>(
-            c10::ListType::create(c10::TensorType::get()))) {
+            c10::ListType::create(c10::TensorType::get()))),
+        id_(graph_task_id.fetch_add(1, std::memory_order_relaxed)) {
     thread_locals_.set_grad_mode(grad_mode);
   }
 
@@ -186,7 +198,11 @@ class GraphTaskGuard {
 
 TORCH_API const std::unordered_map<Node*, GraphTask::ExecInfo>*
 get_current_graph_task_exec_info();
+TORCH_API const std::unordered_set<Node*>*
+get_current_graph_task_nodes_in_graph();
 TORCH_API bool get_current_graph_task_keep_graph();
+TORCH_API std::vector<Node*> get_current_graph_task_execution_order();
+TORCH_API int get_current_graph_task_id();
 void add_node_to_current_graph_task_exec_info(Node* fn);
 
 } // namespace autograd
diff --git a/torch/csrc/autograd/init.cpp b/torch/csrc/autograd/init.cpp
index 646819eb399b..ee963232d316 100644
--- a/torch/csrc/autograd/init.cpp
+++ b/torch/csrc/autograd/init.cpp
@@ -1,12 +1,14 @@
 #include <torch/csrc/python_headers.h>
 
 #include <ATen/PythonTorchFunctionTLS.h>
+#include <ATen/SavedTensorHooks.h>
 #include <ATen/autocast_mode.h>
 #include <ATen/core/PythonFallbackKernel.h>
 #include <ATen/record_function.h>
 #include <c10/core/DeviceType.h>
 #include <c10/core/InferenceMode.h>
 #include <c10/core/ScalarType.h>
+#include <c10/core/impl/PythonDispatcherTLS.h>
 #include <torch/csrc/Exceptions.h>
 #include <torch/csrc/autograd/autograd.h>
 #include <torch/csrc/autograd/function.h>
@@ -22,11 +24,11 @@
 #include <torch/csrc/autograd/utils/wrap_outputs.h>
 #include <torch/csrc/jit/python/pybind_utils.h>
 #include <torch/csrc/profiler/collection.h>
-#include <torch/csrc/profiler/execution_graph_observer.h>
 #include <torch/csrc/profiler/kineto_shim.h>
 #include <torch/csrc/utils/disable_torch_function.h>
 #include <torch/csrc/utils/pybind.h>
 #include <torch/csrc/utils/pycfunction_helpers.h>
+#include <torch/csrc/utils/python_torch_function_mode.h>
 
 #include <set>
 #include <unordered_set>
@@ -41,6 +43,21 @@ struct DisableFuncTorch {
   c10::impl::ExcludeDispatchKeyGuard back_guard_;
 };
 
+struct MultithreadingEnabled {
+  MultithreadingEnabled(bool enabled)
+      : old_(c10::AutogradState::get_tls_state().get_multithreading_enabled()) {
+    c10::AutogradState::get_tls_state().set_multithreading_enabled(enabled);
+  }
+  ~MultithreadingEnabled() {
+    c10::AutogradState::get_tls_state().set_multithreading_enabled(old_);
+  }
+  bool old_;
+};
+
+struct DisableAutocast {
+  c10::impl::ExcludeDispatchKeyGuard guard_{c10::autocast_dispatch_keyset};
+};
+
 struct EnableTorchFunction {
   EnableTorchFunction()
       : old_(at::impl::PythonTorchFunctionTLS::is_disabled()) {
@@ -52,6 +69,16 @@ struct EnableTorchFunction {
   bool old_;
 };
 
+struct EnablePythonDispatcher {
+  EnablePythonDispatcher() : old_(c10::impl::PythonDispatcherTLS::get_state()) {
+    c10::impl::PythonDispatcherTLS::set_state(getPyInterpreter());
+  }
+  ~EnablePythonDispatcher() {
+    c10::impl::PythonDispatcherTLS::set_state(old_);
+  }
+  c10::impl::PyInterpreter* old_;
+};
+
 } // namespace
 
 PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
@@ -91,77 +118,6 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
   if (!ParameterClass)
     return nullptr;
 
-  py::enum_<ProfilerState>(m, "ProfilerState")
-      .value("Disabled", ProfilerState::Disabled)
-      .value("CPU", ProfilerState::CPU)
-      .value("CUDA", ProfilerState::CUDA)
-      .value("NVTX", ProfilerState::NVTX)
-      .value("ITT", ProfilerState::ITT)
-      .value("KINETO", ProfilerState::KINETO)
-      .value("KINETO_GPU_FALLBACK", ProfilerState::KINETO_GPU_FALLBACK);
-
-  using torch::profiler::impl::ActiveProfilerType;
-  py::enum_<ActiveProfilerType>(m, "ActiveProfilerType")
-      .value("NONE", ActiveProfilerType::NONE)
-      .value("LEGACY", ActiveProfilerType::LEGACY)
-      .value("KINETO", ActiveProfilerType::KINETO)
-      .value("NVTX", ActiveProfilerType::NVTX);
-
-  py::enum_<ActivityType>(m, "ProfilerActivity")
-      .value("CPU", ActivityType::CPU)
-      .value("CUDA", ActivityType::CUDA);
-
-  py::class_<ExperimentalConfig>(m, "_ExperimentalConfig")
-      .def(
-          py::init<
-              std::vector<std::string> /* profiler_metrics */,
-              bool /* profiler_measure_per_kernel */
-              >(),
-          "An experimental config for Kineto features. Please note that"
-          "backward compatibility is not guaranteed.\n"
-          "    profiler_metrics : a list of CUPTI profiler metrics used\n"
-          "       to measure GPU performance events.\n"
-          "       If this list contains values Kineto runs in CUPTI profiler mode\n"
-          "    profiler_measure_per_kernel (bool) : whether to profile metrics per kernel\n"
-          "       or for the entire measurement duration.",
-          py::arg("profiler_metrics") = std::vector<std::string>(),
-          py::arg("profiler_measure_per_kernel") = false)
-      .def(py::pickle(
-          [](const ExperimentalConfig& p) { // __getstate__
-            py::list py_metrics;
-            for (const auto& metric : p.profiler_metrics) {
-              py::bytes mbytes(metric);
-              py_metrics.append(mbytes);
-            }
-            /* Return a tuple that fully encodes the state of the config */
-            return py::make_tuple(py_metrics, p.profiler_measure_per_kernel);
-          },
-          [](py::tuple t) { // __setstate__
-            if (t.size() != 2) {
-              throw std::runtime_error("Expected 2 values in state");
-            }
-
-            py::list py_metrics = t[0].cast<py::list>();
-            std::vector<std::string> metrics{py_metrics.size()};
-
-            for (const auto& py_metric : py_metrics) {
-              metrics.push_back(py::str(py_metric));
-            }
-
-            return ExperimentalConfig(std::move(metrics), t[1].cast<bool>());
-          }));
-
-  py::class_<ProfilerConfig>(m, "ProfilerConfig")
-      .def(py::init<
-           ProfilerState,
-           bool, /* record_input_shapes */
-           bool, /* profile_memory */
-           bool, /* with_stack */
-           bool, /* with_flops */
-           bool, /* with_modules */
-           ExperimentalConfig /* experimental_config */
-           >());
-
   py::class_<LegacyEvent>(m, "ProfilerEvent")
       .def("kind", &LegacyEvent::kindStr)
       .def("name", [](const LegacyEvent& e) { return e.name(); })
@@ -196,12 +152,15 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
       .value("FPGA", c10::DeviceType::FPGA)
       .value("ORT", c10::DeviceType::ORT)
       .value("XLA", c10::DeviceType::XLA)
-      .value("Lazy", c10::DeviceType::Lazy)
+      .value("Vulkan", c10::DeviceType::Vulkan)
+      .value("Metal", c10::DeviceType::Metal)
+      .value("XPU", c10::DeviceType::XPU)
       .value("MPS", c10::DeviceType::MPS)
-      .value("HPU", c10::DeviceType::HPU)
       .value("Meta", c10::DeviceType::Meta)
-      .value("Vulkan", c10::DeviceType::Vulkan)
-      .value("Metal", c10::DeviceType::Metal);
+      .value("HPU", c10::DeviceType::HPU)
+      .value("VE", c10::DeviceType::VE)
+      .value("Lazy", c10::DeviceType::Lazy)
+      .value("IPU", c10::DeviceType::IPU);
 
   py::class_<KinetoEvent>(m, "_KinetoEvent")
       // name of the event
@@ -256,89 +215,7 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
       .def("cuda_elapsed_us", &KinetoEvent::cudaElapsedUs)
       .def("nbytes", [](const KinetoEvent& e) { return e.nBytes(); });
 
-  {
-    using torch::profiler::impl::PyFrameState;
-    using torch::profiler::impl::Result;
-    py::enum_<EventType>(m, "_EventType")
-        .value("TorchOp", EventType::TorchOp)
-        .value("Backend", EventType::Backend)
-        .value("Allocation", EventType::Allocation)
-        .value("PyCall", EventType::PyCall)
-        .value("PyCCall", EventType::PyCCall)
-        .value("Kineto", EventType::Kineto);
-    py::class_<ExtraFields<EventType::TorchOp>>(m, "_ExtraFields_TorchOp")
-        .def_readonly("inputs", &ExtraFields<EventType::TorchOp>::inputs_)
-        .def_readonly(
-            "allow_tf32_cublas",
-            &ExtraFields<EventType::TorchOp>::allow_tf32_cublas_);
-    py::class_<Inputs>(m, "_Inputs")
-        .def_readonly("shapes", &Inputs::shapes_)
-        .def_readonly("dtypes", &Inputs::dtypes_)
-        .def_property_readonly(
-            "ivalues",
-            [](const Inputs& inputs) {
-              py::list list;
-              for (auto& v : inputs.ivalues_) {
-                list.append(torch::jit::toPyObject(v));
-              }
-              return list;
-            })
-        .def_readonly("tensor_metadata", &Inputs::tensor_metadata_);
-
-    py::class_<TensorMetadata>(m, "_TensorMetadata")
-        .def_property_readonly(
-            "layout",
-            [](const TensorMetadata& metadata) {
-              PyObject* layout_obj =
-                  torch::autograd::utils::wrap(metadata.layout_);
-              return py::reinterpret_borrow<py::object>(layout_obj);
-            })
-        .def_property_readonly("device", [](const TensorMetadata& metadata) {
-          // Have to pull a copy of the existing Python Device object.
-          PyObject* thp_device = THPDevice_New(
-              c10::Device(metadata.device_type_, metadata.device_index_));
-          return py::reinterpret_borrow<py::object>(thp_device);
-        });
-    py::class_<ExtraFields<EventType::Backend>>(m, "_ExtraFields_Backend");
-    py::class_<ExtraFields<EventType::Allocation>>(
-        m, "_ExtraFields_Allocation");
-    py::class_<ExtraFields<EventType::PyCall>>(m, "_ExtraFields_PyCall")
-        .def_readonly("callsite", &ExtraFields<EventType::PyCall>::callsite_)
-        .def_readonly("caller", &ExtraFields<EventType::PyCall>::caller_);
-    py::class_<ExtraFields<EventType::PyCCall>>(m, "_ExtraFields_PyCCall")
-        .def_readonly("caller", &ExtraFields<EventType::PyCall>::caller_);
-    py::class_<PyFrameState>(m, "_PyFrameState")
-        .def_readonly("line_number", &PyFrameState::line_no_)
-        .def_property_readonly(
-            "file_name",
-            [](const PyFrameState& s) { return s.filename_.str(); })
-        .def_property_readonly("function_name", [](const PyFrameState& s) {
-          return s.funcname_.str();
-        });
-    py::class_<ExtraFields<EventType::Kineto>>(m, "_ExtraFields_Kineto");
-
-    py::class_<Result, std::shared_ptr<Result>>(m, "_ProfilerEvent")
-        .def("name", &Result::name)
-        .def_property_readonly("tag", &Result::tag)
-        .def_readonly("extra_fields", &Result::extra_fields_)
-        .def_property_readonly(
-            "id",
-            [](const Result& r) {
-              return reinterpret_cast<intptr_t>(r.shared_from_this().get());
-            })
-        .def_property_readonly(
-            "parent", [](const Result& r) { return r.parent_.lock(); })
-        .def_readonly("children", &Result::children_)
-        .def_readonly("start_time_ns", &Result::start_time_ns_)
-        .def_readonly("start_tid", &Result::start_tid_)
-        .def_property_readonly("correlation_id", &Result::correlationID)
-        .def_property_readonly("end_time_ns", &Result::endTimeNS)
-        .def_property_readonly("duration_time_ns", [](const Result& r) {
-          return r.endTimeNS() - r.start_time_ns_;
-        });
-
-    m.def("_soft_assert_raises", &setSoftAssertRaises);
-  }
+  m.def("_soft_assert_raises", &setSoftAssertRaises);
 
   py::class_<ProfilerResult>(m, "_ProfilerResult")
       .def("trace_start_us", &ProfilerResult::trace_start_us)
@@ -361,21 +238,6 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
   m.def("_kineto_step", profilerStep); // Only if `USE_KINETO` is set
   m.def("kineto_available", []() { return torch::profiler::kKinetoAvailable; });
 
-  // PyTorch profiler execution graph internal interface.
-  m.def(
-      "_add_execution_graph_observer",
-      &torch::profiler::impl::addExecutionGraphObserver,
-      py::arg("output_file_name"));
-  m.def(
-      "_remove_execution_graph_observer",
-      &torch::profiler::impl::removeExecutionGraphObserver);
-  m.def(
-      "_enable_execution_graph_observer",
-      &torch::profiler::impl::enableExecutionGraphObserver);
-  m.def(
-      "_disable_execution_graph_observer",
-      &torch::profiler::impl::disableExecutionGraphObserver);
-
   // NOTICE: These record functions are not torch operators and may not show up
   // in TorchScript tracing, FX transforms, or operator serialization. For these
   // use cases, please use `torch.profiler.record_function`.
@@ -448,6 +310,14 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
     }
   });
   m.def("_clear_callbacks", []() { at::clearCallbacks(); });
+  m.def(
+      "_saved_tensors_hooks_is_enabled",
+      at::SavedTensorDefaultHooks::is_enabled);
+  m.def("_saved_tensors_hooks_enable", at::SavedTensorDefaultHooks::enable);
+  m.def("_saved_tensors_hooks_disable", at::SavedTensorDefaultHooks::disable);
+  m.def(
+      "_saved_tensors_hooks_get_disabled_error_message",
+      at::SavedTensorDefaultHooks::get_disabled_error_message);
   m.def(
       "_push_saved_tensors_default_hooks",
       [](py::function& pack_hook, py::function& unpack_hook) {
@@ -478,8 +348,15 @@ PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject* unused) {
       .def(py::init<>());
   py::class_<EnableTorchFunction>(_C_m, "_EnableTorchFunction")
       .def(py::init<>());
+  py::class_<EnablePythonDispatcher>(_C_m, "_EnablePythonDispatcher")
+      .def(py::init<>());
+  py::class_<c10::impl::DisablePythonDispatcher>(
+      _C_m, "_DisablePythonDispatcher")
+      .def(py::init<>());
   py::class_<DisableFuncTorch>(_C_m, "_DisableFuncTorch").def(py::init<>());
-
+  py::class_<MultithreadingEnabled>(_C_m, "_MultithreadingEnabled")
+      .def(py::init<bool>());
+  py::class_<DisableAutocast>(_C_m, "_DisableAutocast").def(py::init<>());
   py::class_<torch::autograd::SavedVariable>(m, "SavedTensor")
       .def(py::init([]() -> torch::autograd::SavedVariable {
         TORCH_CHECK(
@@ -525,6 +402,17 @@ static PyObject* is_autocast_enabled(PyObject* _unused, PyObject* arg) {
   END_HANDLE_TH_ERRORS
 }
 
+static PyObject* is_any_autocast_enabled(PyObject* _unused, PyObject* arg) {
+  HANDLE_TH_ERRORS
+  if (at::autocast::is_enabled() || at::autocast::is_cpu_enabled() ||
+      at::autocast::is_xpu_enabled()) {
+    Py_RETURN_TRUE;
+  } else {
+    Py_RETURN_FALSE;
+  }
+  END_HANDLE_TH_ERRORS
+}
+
 static PyObject* set_autocast_cpu_enabled(PyObject* _unused, PyObject* arg) {
   HANDLE_TH_ERRORS
   if (!PyBool_Check(arg)) {
@@ -656,12 +544,17 @@ static PyObject* is_inference_mode_enabled(PyObject* _unused, PyObject* arg) {
   END_HANDLE_TH_ERRORS
 }
 
-static PyObject* set_anomaly_mode_enabled(PyObject* _unused, PyObject* arg) {
+static PyObject* set_anomaly_mode_enabled(
+    PyObject* _unused,
+    PyObject* args,
+    PyObject* kwargs) {
   HANDLE_TH_ERRORS
-  if (!PyBool_Check(arg)) {
-    throw TypeError("enabled must be a bool (got %s)", Py_TYPE(arg)->tp_name);
-  }
-  AnomalyMode::set_enabled(arg == Py_True);
+  static PythonArgParser parser({
+      "set_anomaly_enabled(bool enabled, bool check_nan=True)",
+  });
+  ParsedArgs<2> parsed_args;
+  auto r = parser.parse(args, kwargs, parsed_args);
+  AnomalyMode::set_enabled(r.toBool(0), r.toBool(1));
   Py_RETURN_NONE;
   END_HANDLE_TH_ERRORS
 }
@@ -676,6 +569,18 @@ static PyObject* is_anomaly_mode_enabled(PyObject* _unused, PyObject* arg) {
   END_HANDLE_TH_ERRORS
 }
 
+static PyObject* is_anomaly_check_nan_enabled(
+    PyObject* _unused,
+    PyObject* arg) {
+  HANDLE_TH_ERRORS
+  if (AnomalyMode::should_check_nan()) {
+    Py_RETURN_TRUE;
+  } else {
+    Py_RETURN_FALSE;
+  }
+  END_HANDLE_TH_ERRORS
+}
+
 static PyObject* python_enter_dual_level(PyObject* _unused, PyObject* arg) {
   HANDLE_TH_ERRORS
   // It is unlikely that the depth of forward nesting will overflow int64_t so
@@ -702,59 +607,117 @@ static PyObject* python_exit_dual_level(
   END_HANDLE_TH_ERRORS
 }
 
-static PyObject* set_torch_dispatch_mode(PyObject* _unused, PyObject* arg) {
+static PyObject* is_torch_function_mode_enabled(
+    PyObject* _unused,
+    PyObject* _unused2) {
   HANDLE_TH_ERRORS
-  if (arg == Py_None) {
-    at::impl::TorchDispatchModeTLS::set_state(nullptr);
+  if (at::impl::torch_function_mode_enabled()) {
+    Py_RETURN_TRUE;
   } else {
+    Py_RETURN_FALSE;
+  }
+  END_HANDLE_TH_ERRORS
+}
+
+static PyObject* push_on_torch_function_stack(
+    PyObject* _unused,
+    PyObject* arg) {
+  HANDLE_TH_ERRORS
+  if (arg != Py_None) {
     Py_INCREF(arg);
-    at::impl::TorchDispatchModeTLS::set_state(
+    at::impl::PythonTorchFunctionTLS::push_onto_stack(
         std::make_shared<c10::SafePyObject>(arg, getPyInterpreter()));
   }
   Py_RETURN_NONE;
   END_HANDLE_TH_ERRORS
 }
 
-static PyObject* get_torch_dispatch_mode(
+static PyObject* pop_torch_function_stack(
     PyObject* _unused,
     PyObject* _unused2) {
   HANDLE_TH_ERRORS
-  const auto& mode = at::impl::TorchDispatchModeTLS::get_state();
-  if (!mode) {
-    Py_RETURN_NONE;
-  } else {
-    auto* r = mode->ptr(getPyInterpreter());
-    Py_INCREF(r);
-    return r;
-  }
+  const auto& mode = at::impl::PythonTorchFunctionTLS::pop_stack();
+  auto* r = mode->ptr(getPyInterpreter());
+  Py_INCREF(r);
+  return r;
   END_HANDLE_TH_ERRORS
 }
 
-static PyObject* set_torch_function_mode(PyObject* _unused, PyObject* arg) {
+static PyObject* get_function_stack_at(
+    PyObject* _unused,
+    PyObject* args,
+    PyObject* kwargs) {
   HANDLE_TH_ERRORS
-  if (arg == Py_None) {
-    at::impl::PythonTorchFunctionTLS::set_mode(nullptr);
-  } else {
+  static PythonArgParser parser({"get_stack_at(int64_t level)"});
+
+  ParsedArgs<1> parsed_args;
+  auto _r = parser.parse(args, kwargs, parsed_args);
+
+  auto idx = _r.toInt64(0);
+  const auto& mode = at::impl::PythonTorchFunctionTLS::get_stack_at(idx);
+  auto* r = mode->ptr(getPyInterpreter());
+  Py_INCREF(r);
+  return r;
+  END_HANDLE_TH_ERRORS
+}
+
+static PyObject* len_torch_function_stack(
+    PyObject* _unused,
+    PyObject* _unused2) {
+  HANDLE_TH_ERRORS
+  const auto len = at::impl::PythonTorchFunctionTLS::stack_len();
+  return utils::wrap(static_cast<int64_t>(len));
+  END_HANDLE_TH_ERRORS
+}
+
+static PyObject* push_on_torch_dispatch_stack(
+    PyObject* _unused,
+    PyObject* arg) {
+  HANDLE_TH_ERRORS
+  if (arg != Py_None) {
     Py_INCREF(arg);
-    at::impl::PythonTorchFunctionTLS::set_mode(
+    c10::impl::TorchDispatchModeTLS::push_onto_stack(
         std::make_shared<c10::SafePyObject>(arg, getPyInterpreter()));
   }
   Py_RETURN_NONE;
   END_HANDLE_TH_ERRORS
 }
 
-static PyObject* get_torch_function_mode(
+static PyObject* pop_torch_dispatch_stack(
     PyObject* _unused,
     PyObject* _unused2) {
   HANDLE_TH_ERRORS
-  const auto& mode = at::impl::PythonTorchFunctionTLS::get_mode();
-  if (!mode) {
-    Py_RETURN_NONE;
-  } else {
-    auto* r = mode->ptr(getPyInterpreter());
-    Py_INCREF(r);
-    return r;
-  }
+  const auto& mode = c10::impl::TorchDispatchModeTLS::pop_stack();
+  auto* r = mode->ptr(getPyInterpreter());
+  Py_INCREF(r);
+  return r;
+  END_HANDLE_TH_ERRORS
+}
+
+static PyObject* get_dispatch_stack_at(
+    PyObject* _unused,
+    PyObject* args,
+    PyObject* kwargs) {
+  HANDLE_TH_ERRORS
+  static PythonArgParser parser({"get_stack_at(int64_t level)"});
+
+  ParsedArgs<1> parsed_args;
+  auto _r = parser.parse(args, kwargs, parsed_args);
+
+  auto idx = _r.toInt64(0);
+  const auto& mode = c10::impl::TorchDispatchModeTLS::get_stack_at(idx);
+  auto* r = mode->ptr(getPyInterpreter());
+  Py_INCREF(r);
+  return r;
+  END_HANDLE_TH_ERRORS
+}
+
+static PyObject* len_torch_dispatch_stack(
+    PyObject* _unused,
+    PyObject* _unused2) {
+  HANDLE_TH_ERRORS
+  const auto len = c10::impl::TorchDispatchModeTLS::stack_len();
+  return utils::wrap(static_cast<int64_t>(len));
   END_HANDLE_TH_ERRORS
 }
 
@@ -768,6 +731,7 @@ static PyMethodDef methods[] = { // NOLINT
      nullptr},
     {"set_autocast_enabled", set_autocast_enabled, METH_O, nullptr},
     {"is_autocast_enabled", is_autocast_enabled, METH_NOARGS, nullptr},
+    {"_is_any_autocast_enabled", is_any_autocast_enabled, METH_NOARGS, nullptr},
     {"clear_autocast_cache", clear_autocast_cache, METH_NOARGS, nullptr},
     {"set_autocast_cpu_enabled", set_autocast_cpu_enabled, METH_O, nullptr},
     {"is_autocast_cpu_enabled", is_autocast_cpu_enabled, METH_NOARGS, nullptr},
@@ -788,17 +752,56 @@ static PyMethodDef methods[] = { // NOLINT
      METH_NOARGS,
      nullptr},
     {"set_autocast_cache_enabled", set_autocast_cache_enabled, METH_O, nullptr},
-    {"set_anomaly_enabled", set_anomaly_mode_enabled, METH_O, nullptr},
+    {"set_anomaly_enabled",
+     castPyCFunctionWithKeywords(set_anomaly_mode_enabled),
+     METH_VARARGS | METH_KEYWORDS,
+     nullptr},
     {"is_anomaly_enabled", is_anomaly_mode_enabled, METH_NOARGS, nullptr},
+    {"is_anomaly_check_nan_enabled",
+     is_anomaly_check_nan_enabled,
+     METH_NOARGS,
+     nullptr},
     {"_enter_dual_level", python_enter_dual_level, METH_NOARGS, nullptr},
     {"_exit_dual_level",
      castPyCFunctionWithKeywords(python_exit_dual_level),
      METH_VARARGS | METH_KEYWORDS,
      nullptr},
-    {"_set_torch_dispatch_mode", set_torch_dispatch_mode, METH_O, nullptr},
-    {"_get_torch_dispatch_mode", get_torch_dispatch_mode, METH_NOARGS, nullptr},
-    {"_set_torch_function_mode", set_torch_function_mode, METH_O, nullptr},
-    {"_get_torch_function_mode", get_torch_function_mode, METH_NOARGS, nullptr},
+    {"_is_torch_function_mode_enabled",
+     is_torch_function_mode_enabled,
+     METH_NOARGS,
+     nullptr},
+    {"_push_on_torch_function_stack",
+     push_on_torch_function_stack,
+     METH_O,
+     nullptr},
+    {"_pop_torch_function_stack",
+     pop_torch_function_stack,
+     METH_NOARGS,
+     nullptr},
+    {"_get_function_stack_at",
+     castPyCFunctionWithKeywords(get_function_stack_at),
+     METH_VARARGS | METH_KEYWORDS,
+     nullptr},
+    {"_len_torch_function_stack",
+     len_torch_function_stack,
+     METH_NOARGS,
+     nullptr},
+    {"_push_on_torch_dispatch_stack",
+     push_on_torch_dispatch_stack,
+     METH_O,
+     nullptr},
+    {"_pop_torch_dispatch_stack",
+     pop_torch_dispatch_stack,
+     METH_NOARGS,
+     nullptr},
+    {"_get_dispatch_stack_at",
+     castPyCFunctionWithKeywords(get_dispatch_stack_at),
+     METH_VARARGS | METH_KEYWORDS,
+     nullptr},
+    {"_len_torch_dispatch_stack",
+     len_torch_dispatch_stack,
+     METH_NOARGS,
+     nullptr},
     {nullptr, nullptr, 0, nullptr}};
 
 PyMethodDef* python_functions() {
diff --git a/torch/csrc/autograd/input_metadata.h b/torch/csrc/autograd/input_metadata.h
index 7cb9e8aedb19..8060c11ac457 100644
--- a/torch/csrc/autograd/input_metadata.h
+++ b/torch/csrc/autograd/input_metadata.h
@@ -125,13 +125,13 @@ struct InputMetadata {
     if (grad.is_nested()) {
       ss << at::native::get_nested_size_tensor(grad);
     } else {
-      ss << grad.sizes();
+      ss << grad.sym_sizes();
     }
     ss << " but expected shape compatible with ";
     if (is_nested_tensor()) {
       ss << shape_as_tensor();
     } else {
-      ss << c10::asIntArrayRefSlow(shape_as_dim_vector());
+      ss << shape_as_dim_vector();
     }
     return ss;
   }
diff --git a/torch/csrc/autograd/jit_decomp_interface.cpp b/torch/csrc/autograd/jit_decomp_interface.cpp
new file mode 100644
index 000000000000..a1372a48f15c
--- /dev/null
+++ b/torch/csrc/autograd/jit_decomp_interface.cpp
@@ -0,0 +1,21 @@
+#include <torch/csrc/autograd/jit_decomp_interface.h>
+
+namespace torch {
+namespace autograd {
+namespace impl {
+
+namespace {
+JitDecompInterface* impl = nullptr;
+}
+
+void setJitDecompImpl(JitDecompInterface* impl_) {
+  impl = impl_;
+}
+
+JitDecompInterface* getJitDecompImpl() {
+  return impl;
+}
+
+} // namespace impl
+} // namespace autograd
+} // namespace torch
diff --git a/torch/csrc/autograd/jit_decomp_interface.h b/torch/csrc/autograd/jit_decomp_interface.h
new file mode 100644
index 000000000000..42e33c41a0f0
--- /dev/null
+++ b/torch/csrc/autograd/jit_decomp_interface.h
@@ -0,0 +1,54 @@
+#pragma once
+
+#include <ATen/core/Tensor.h>
+#include <ATen/core/function_schema.h>
+#include <c10/macros/Export.h>
+
+// NOTE: [Jit Decomposition Interface]
+//
+// For some context of why we need this at all, see NOTE: [forward-mode AD
+// decompositions mechanism]
+//
+// Introducing that mechanism from the NOTE is problematic because:
+// - it relies on TorchScript, so now VariableTypeX.cpp depends on TorchScript.
+// - there exist internal builds like lite_trainer, which depend on VariableType
+//   but do not depend on TorchScript.
+//
+// For internal builds like lite_trainer builds to pass, and for OSS builds that
+// do depend on TorchScript to still support the forward AD decomp mechanism, we
+// implement a PImpl pattern to avoid a static dependency in favor of a dynamic
+// one
+// - during static initialization time, if the library is built with TorchScript
+//   setJitDecompImpl is called in decomposition_registry.cpp setting a global
+//   ptr to the impl
+// - when the program is run,if getJitDecompImpl returns a non null ptr, we can
+//   carry on normally, otherwise we gracefully error out
+//
+// For extra context, see VariableHooksInterface.h, where a similar technique
+// is used
+
+namespace torch {
+namespace autograd {
+namespace impl {
+
+struct TORCH_API JitDecompInterface {
+  virtual ~JitDecompInterface() = default;
+  virtual bool has_jit_decomposition(
+      const c10::FunctionSchema& schema) const = 0;
+  virtual void run_jit_decomposition(
+      const c10::OperatorHandle& op,
+      jit::Stack* stack) const = 0;
+};
+
+TORCH_API void setJitDecompImpl(JitDecompInterface* impl);
+TORCH_API JitDecompInterface* getJitDecompImpl();
+
+struct TORCH_API JitDecompRegisterer {
+  explicit JitDecompRegisterer(JitDecompInterface* impl) {
+    setJitDecompImpl(impl);
+  }
+};
+
+} // namespace impl
+} // namespace autograd
+} // namespace torch
diff --git a/torch/csrc/autograd/profiler_kineto.cpp b/torch/csrc/autograd/profiler_kineto.cpp
index e905278b5760..c9b7b9fa9296 100644
--- a/torch/csrc/autograd/profiler_kineto.cpp
+++ b/torch/csrc/autograd/profiler_kineto.cpp
@@ -1,3 +1,4 @@
+#include <cstring>
 #define TORCH_ASSERT_ONLY_METHOD_OPERATORS
 #include <torch/csrc/autograd/profiler_kineto.h>
 
@@ -12,9 +13,12 @@
 #include <torch/csrc/profiler/api.h>
 #include <torch/csrc/profiler/collection.h>
 #include <torch/csrc/profiler/containers.h>
-#include <torch/csrc/profiler/itt_observer.h>
+#include <torch/csrc/profiler/events.h>
 #include <torch/csrc/profiler/kineto_shim.h>
-#include <torch/csrc/profiler/nvtx_observer.h>
+#include <torch/csrc/profiler/orchestration/observer.h>
+#include <torch/csrc/profiler/perf.h>
+#include <torch/csrc/profiler/standalone/itt_observer.h>
+#include <torch/csrc/profiler/standalone/nvtx_observer.h>
 #include <torch/csrc/profiler/util.h>
 
 #include <ATen/Context.h>
@@ -60,42 +64,138 @@ using torch::profiler::impl::ActiveProfilerType;
 using torch::profiler::impl::dtypesToStr;
 using torch::profiler::impl::EventType;
 using torch::profiler::impl::ExtraFields;
-using torch::profiler::impl::ProfilerThreadLocalStateBase;
+using torch::profiler::impl::op_input_t;
+using torch::profiler::impl::ProfilerStateBase;
+using torch::profiler::impl::PyExtraFieldsBase;
 using torch::profiler::impl::Result;
 using torch::profiler::impl::shapesToStr;
 using torch::profiler::impl::stacksToStr;
-
-template <typename T>
-constexpr bool is_py_fields() {
-  return std::is_base_of<
-      torch::profiler::impl::PyExtraFieldsBase,
-      typename std::remove_cv<typename std::remove_reference<T>::type>::type>::
-      value;
+using torch::profiler::impl::TensorMetadata;
+
+auto shapesAndDtypes(const std::vector<op_input_t>& inputs) {
+  std::vector<std::vector<int64_t>> shapes;
+  std::vector<std::string> dtypes;
+  for (const auto& i : inputs) {
+    c10::visit(
+        c10::overloaded(
+            [&](const TensorMetadata& t) {
+              shapes.emplace_back(t.sizes_);
+              dtypes.emplace_back(scalarTypeToTypeMeta(t.dtype_).name());
+            },
+            [&](const std::vector<TensorMetadata>&) {
+              shapes.emplace_back();
+              dtypes.emplace_back("TensorList");
+            },
+            [&](const c10::IValue&) {
+              shapes.emplace_back();
+              dtypes.emplace_back("Scalar");
+            },
+            [&](const auto&) {
+              shapes.emplace_back();
+              dtypes.emplace_back();
+            }),
+        i);
+  }
+  return std::make_pair(shapes, dtypes);
 }
 
-struct AddKinetoMetadata {
-  AddKinetoMetadata(std::shared_ptr<Result>& result, KinetoEvent& kineto_event)
+struct MetadataBase {
+  MetadataBase(const std::shared_ptr<Result>& result)
       : kineto_activity_{result->kineto_activity_} {
-    result->visit(*this);
+    if (c10::holds_alternative<ExtraFields<EventType::Kineto>>(
+            result->extra_fields_)) {
+      // In order to add metadata we have to downcast from
+      // `libkineto::ITraceActivity` to `libkineto::GenericTraceActivity`. We
+      // know that all activities provided by PyTorch are of the correct type,
+      // however Kineto profilers can (and do) add events that inherit directly
+      // from ITraceActivity. As a result, any Result which was constructed from
+      // an event that Kineto provided is unsafe to cast.
+      if (!(SOFT_ASSERT(!hasKinetoActivity()))) {
+        result->kineto_activity_ = nullptr;
+      }
+      kineto_activity_ = result->kineto_activity_;
+    }
+  }
 
-    setPythonMetadata(result);
+  void addMetadata(const std::string& key, const std::string& value) {
+    if (kineto_activity_ && !value.empty() && value != "\"\"") {
+      torch::profiler::impl::kineto::addMetadata(kineto_activity_, key, value);
+    }
+  }
+
+  bool hasKinetoActivity() const {
+    return kineto_activity_ != nullptr;
+  }
+
+ private:
+  const torch::profiler::impl::kineto::activity_t* kineto_activity_{nullptr};
+};
+
+struct AddTensorboardFields : public MetadataBase {
+  AddTensorboardFields(
+      const std::shared_ptr<Result>& result,
+      KinetoEvent& kineto_event)
+      : MetadataBase(result) {
+    result->visit(*this);
     const auto module_hierarchy = kineto_event.moduleHierarchy();
     addMetadata("Module Hierarchy", stacksToStr(module_hierarchy.vec(), "."));
     addMetadata("Call stack", stacksToStr(kineto_event.stack().vec(), ";"));
 
-    // It is not safe to use the activity after post processing.
-    result->kineto_activity_ = nullptr;
+    result->visit_if_base<PyExtraFieldsBase>([&, this](const auto& i) -> void {
+      this->addMetadata("Python id", std::to_string(i.id_));
+
+      c10::optional<std::string> parent_id;
+      std::shared_ptr<Result> parent = result->parent_.lock();
+      while (parent && !parent_id.has_value()) {
+        parent->visit_if_base<PyExtraFieldsBase>(
+            [&](const auto& j) { parent_id = std::to_string(j.id_); });
+        parent = parent->parent_.lock();
+      }
+      this->addMetadata("Python parent id", parent_id.value_or("null"));
+    });
+  }
+
+  void operator()(const ExtraFields<EventType::PyCall>& py_call) {
+    if (py_call.module_.has_value()) {
+      addMetadata("Python module id", std::to_string(py_call.module_->id_));
+    }
+  }
+
+  template <typename T>
+  void operator()(const T&) {}
+};
+
+struct AddGenericMetadata : public MetadataBase {
+  AddGenericMetadata(
+      std::shared_ptr<Result>& result,
+      const torch::profiler::impl::ProfilerConfig* config)
+      : MetadataBase(result), config_(config) {
+    result->visit(*this);
+    if (config->experimental_config.verbose) {
+      result->visit_if_base<PyExtraFieldsBase>(
+          [&, this](const auto& i) -> void {
+            this->addMetadata("Python thread", std::to_string(i.python_tid_));
+          });
+    }
   }
 
   void operator()(ExtraFields<EventType::TorchOp>& op_event) {
-    auto& shapes = op_event.inputs_.shapes_;
-    if (!shapes.empty()) {
-      addMetadata("Input Dims", shapesToStr(shapes));
+    const auto shapes_and_dtypes = shapesAndDtypes(op_event.inputs_);
+    if (!shapes_and_dtypes.first.empty()) {
+      addMetadata("Input Dims", shapesToStr(shapes_and_dtypes.first));
     }
 
-    auto& dtypes = op_event.inputs_.dtypes_;
-    if (!dtypes.empty()) {
-      addMetadata("Input type", dtypesToStr(dtypes));
+    if (!shapes_and_dtypes.second.empty()) {
+      addMetadata("Input type", dtypesToStr(shapes_and_dtypes.second));
+    }
+
+    if (config_ && !config_->experimental_config.performance_events.empty()) {
+      auto& event_names = config_->experimental_config.performance_events;
+      for (auto i = 0; i < op_event.perf_event_counters_->size(); ++i) {
+        addMetadata(
+            event_names[i],
+            std::to_string((*op_event.perf_event_counters_)[i]));
+      }
     }
 
     // add information about an associated forward op, if a sequence number
@@ -137,54 +237,12 @@ struct AddKinetoMetadata {
     }
   }
 
-  void operator()(const ExtraFields<EventType::PyCall>& py_call) {
-    if (py_call.module_.has_value()) {
-      addMetadata("Python module id", std::to_string(py_call.module_->id_));
-    }
-  }
-
-  void operator()(const ExtraFields<EventType::PyCCall>& py_call) {}
-
-  void operator()(const ExtraFields<EventType::Kineto>& e) {
-    TORCH_INTERNAL_ASSERT(kineto_activity_ == nullptr);
-  }
-
-  void setPythonMetadata(std::shared_ptr<Result> result) {
-    result->visit([&, this](const auto& i) -> void {
-      c10::guts::if_constexpr<is_py_fields<decltype(i)>()>(
-          [&, this](auto _) -> void {
-            this->addMetadata(
-                "Python thread", std::to_string(_(i).python_tid_));
-            this->addMetadata("Python id", std::to_string(_(i).id_));
-
-            std::string parent_id = "null";
-            auto update_parent_id = [&](const auto& j) -> bool {
-              // Update parent_id the first time we see a Python Result
-              constexpr bool is_python_parent = is_py_fields<decltype(j)>();
-              c10::guts::if_constexpr<is_python_parent>([&](auto _) -> void {
-                parent_id = std::to_string(_(j).id_);
-              });
-
-              // And then break out of the update loop.
-              return !is_python_parent;
-            };
-
-            std::shared_ptr<Result> parent = result->parent_.lock();
-            while (parent && parent->visit(update_parent_id)) {
-              parent = parent->parent_.lock();
-            }
-            this->addMetadata("Python parent id", parent_id);
-          });
-    });
-  }
-
-  void addMetadata(const std::string& key, const std::string& value) {
-    if (kineto_activity_ && !value.empty()) {
-      torch::profiler::impl::kineto::addMetadata(kineto_activity_, key, value);
-    }
-  }
+  template <typename T>
+  void operator()(const T&) {}
 
-  const torch::profiler::impl::kineto::activity_t* kineto_activity_;
+ private:
+  /* To get names of the performance events */
+  const torch::profiler::impl::ProfilerConfig* config_;
 };
 
 // Assumption: Total threads number will not exceed 2^16-1, and total ops will
@@ -193,20 +251,21 @@ static inline uint64_t getForwardThreadKey(uint64_t tid, uint64_t seqNr) {
   return (((tid) << 48) | ((seqNr) & (((uint64_t)1 << 48) - 1)));
 }
 
-struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
+struct KinetoThreadLocalState : public ProfilerStateBase {
   explicit KinetoThreadLocalState(
       const ProfilerConfig& config,
       std::set<torch::profiler::impl::ActivityType> activities)
-      : ProfilerThreadLocalStateBase(config),
+      : ProfilerStateBase(config),
         start_time_(getTimeUs()),
         record_queue_(config, activities) {}
   ~KinetoThreadLocalState() override = default;
 
-  static KinetoThreadLocalState* getTLS() {
-    auto tls = ProfilerThreadLocalStateBase::getTLS();
+  static KinetoThreadLocalState* get(bool global) {
+    auto* state = ProfilerStateBase::get(/*global=*/global);
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
-        tls == nullptr || tls->profilerType() == ActiveProfilerType::KINETO);
-    return static_cast<KinetoThreadLocalState*>(tls);
+        state == nullptr ||
+        state->profilerType() == ActiveProfilerType::KINETO);
+    return static_cast<KinetoThreadLocalState*>(state);
   }
 
   ActiveProfilerType profilerType() override {
@@ -219,7 +278,7 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
       int64_t total_allocated,
       int64_t total_reserved,
       c10::Device device) override {
-    if (config_.profile_memory && config_.state != ProfilerState::Disabled) {
+    if (config_.profile_memory && !config_.disabled()) {
       record_queue_.getSubqueue()->emplace_allocation_event(
           torch::profiler::impl::getApproximateTime(),
           ptr,
@@ -236,7 +295,7 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
       int64_t total_allocated,
       int64_t total_reserved,
       c10::Device device) override {
-    if (config_.profile_memory && config_.state != ProfilerState::Disabled) {
+    if (config_.profile_memory && !config_.disabled()) {
       record_queue_.getSubqueue()->emplace_ooms_event(
           torch::profiler::impl::getApproximateTime(),
           alloc_size,
@@ -300,8 +359,12 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
             [this](ExtraFields<EventType::Backend>& i) { invokeCallback(i); },
             [](auto&) {}));
 
-        kineto_events_.emplace_back(e);
-        AddKinetoMetadata add_kineto_metadata(e, kineto_events_.back());
+        kineto_events_.emplace_back(e, config_.experimental_config.verbose);
+        AddTensorboardFields add_tb(e, kineto_events_.back());
+        AddGenericMetadata add_generic(e, &config_);
+
+        // It is not safe to use the activity after post processing.
+        e->kineto_activity_ = nullptr;
       }
     }
   }
@@ -391,20 +454,10 @@ struct KinetoThreadLocalState : public ProfilerThreadLocalStateBase {
   post_process_t event_post_process_cb_;
 };
 
-using KinetoTLSGlobalStateManager =
-    torch::profiler::impl::GlobalStateManager<KinetoThreadLocalState>;
-
-template <bool use_global>
-static KinetoThreadLocalState* getStatePtr() {
-  return c10::guts::if_constexpr<use_global>(
-      [] { return KinetoTLSGlobalStateManager::get(); },
-      [] { return KinetoThreadLocalState::getTLS(); });
-}
-
 template <bool use_global_state_ptr = false>
 std::unique_ptr<at::ObserverContext> onFunctionEnter(
     const at::RecordFunction& fn) {
-  auto state_ptr = getStatePtr<use_global_state_ptr>();
+  auto state_ptr = KinetoThreadLocalState::get(use_global_state_ptr);
   if (!state_ptr) {
     return nullptr;
   }
@@ -416,7 +469,7 @@ template <bool use_global_state_ptr = false>
 void onFunctionExit(
     const at::RecordFunction& fn,
     at::ObserverContext* ctx_ptr) {
-  auto state_ptr = getStatePtr<use_global_state_ptr>();
+  auto state_ptr = KinetoThreadLocalState::get(use_global_state_ptr);
   if (!state_ptr) {
     return;
   }
@@ -426,6 +479,10 @@ void onFunctionExit(
   TORCH_INTERNAL_ASSERT(kineto_ctx_ptr != nullptr);
   kineto_ctx_ptr->event_->end_time_ =
       torch::profiler::impl::getApproximateTime();
+  if (!config.experimental_config.performance_events.empty()) {
+    state_ptr->record_queue_.getSubqueue()->disable_perf_profiler(
+        *kineto_ctx_ptr->event_->counters_);
+  }
   kineto_ctx_ptr->event_->basic_fields_.end_tid_ =
       at::RecordFunction::currentThreadId();
   if (config.state == ProfilerState::KINETO_GPU_FALLBACK) {
@@ -448,7 +505,8 @@ void onFunctionExit(
 
 template <bool use_global_callback = false>
 void pushProfilingCallbacks(const std::unordered_set<at::RecordScope>& scopes) {
-  auto registration_state_ptr = getStatePtr<use_global_callback>();
+  auto registration_state_ptr =
+      KinetoThreadLocalState::get(use_global_callback);
   TORCH_INTERNAL_ASSERT(registration_state_ptr, "Expected profiler state set");
   auto recordFunctionCallback =
       at::RecordFunctionCallback(
@@ -473,10 +531,10 @@ void reportBackendEventToActiveKinetoProfiler(
     const std::string& event_name,
     const std::string& backend_name) {
   TORCH_INTERNAL_ASSERT(
-      KinetoTLSGlobalStateManager::get() == nullptr,
+      KinetoThreadLocalState::get(/*global=*/true) == nullptr,
       "On-demand profiling does not support post processing callback");
 
-  auto state_ptr = KinetoThreadLocalState::getTLS();
+  auto state_ptr = KinetoThreadLocalState::get(/*global=*/false);
   if (!state_ptr) {
     return;
   }
@@ -510,6 +568,33 @@ void prepareProfiler(
       "Supported only in Kineto profiler");
   torch::profiler::impl::kineto::prepareTrace(
       /*cpuOnly=*/!at::hasCUDA(), activities, config.experimental_config);
+
+  if (config.experimental_config.performance_events.size()) {
+    /* For now only CPU activity is supported */
+    TORCH_CHECK(
+        activities.count(torch::autograd::profiler::ActivityType::CPU),
+        "Cannot run cpu hardware profiler without CPU activities, please only use CPU activity type");
+    /*
+     * Sending a warning and passing the non-standard event to the backend
+     * Backend can abort if the event is not supported.
+     * TODO Should we gracefully drop the invalid event if we have atleast one
+     * valid?
+     */
+    auto is_standard_event = [](const std::string& event) -> bool {
+      for (auto e : torch::profiler::ProfilerPerfEvents) {
+        if (!std::strcmp(event.c_str(), e)) {
+          return true;
+        }
+      }
+      return false;
+    };
+
+    for (const auto& e : config.experimental_config.performance_events) {
+      if (!is_standard_event(e)) {
+        TORCH_WARN("Forwarding a non-standard CPU performance event : ", e);
+      }
+    }
+  }
 }
 
 void enableProfilerWithEventPostProcess(
@@ -524,11 +609,11 @@ void enableProfilerWithEventPostProcess(
       config.state != ProfilerState::ITT,
       "ITT does not support post processing callback.");
   TORCH_INTERNAL_ASSERT(
-      KinetoTLSGlobalStateManager::get() == nullptr,
+      KinetoThreadLocalState::get(/*global=*/true) == nullptr,
       "On-demand profiling does not support post processing callback");
 
   enableProfiler(config, activities, scopes);
-  auto state_ptr = KinetoThreadLocalState::getTLS();
+  auto state_ptr = KinetoThreadLocalState::get(config.global());
   state_ptr->setEventPostProcessingCallback(std::move(cb));
 }
 
@@ -536,7 +621,12 @@ void enableProfiler(
     const torch::profiler::impl::ProfilerConfig& config,
     const std::set<torch::profiler::impl::ActivityType>& activities,
     const std::unordered_set<at::RecordScope>& scopes) {
-  TORCH_CHECK(!profilerEnabled(), "Profiler is already enabled on this thread");
+  const auto has_cpu = activities.count(ActivityType::CPU);
+  TORCH_CHECK(
+      KinetoThreadLocalState::get(/*global=*/config.global()) == nullptr,
+      "Profiler is already enabled",
+      (config.global() ? "." : " on this thread."));
+
   if (config.state == ProfilerState::NVTX) {
     torch::profiler::impl::pushNVTXCallbacks(config, scopes);
     return;
@@ -547,39 +637,27 @@ void enableProfiler(
 
   TORCH_CHECK(
       config.state == ProfilerState::KINETO ||
-      config.state == ProfilerState::KINETO_GPU_FALLBACK ||
-      config.state == ProfilerState::KINETO_ONDEMAND);
-  TORCH_CHECK(
-      !activities.empty(), "No activities specified for Kineto profiler");
+      config.state == ProfilerState::KINETO_GPU_FALLBACK || config.global());
+  TORCH_CHECK(!activities.empty(), "No activities specified.");
+  TORCH_INTERNAL_ASSERT(
+      has_cpu || !config.global(),
+      "Ondemand profiling must enable CPU tracing");
 
-  if (config.state == ProfilerState::KINETO ||
-      config.state == ProfilerState::KINETO_GPU_FALLBACK) {
-    auto state = std::make_shared<KinetoThreadLocalState>(config, activities);
-    c10::ThreadLocalDebugInfo::_push(c10::DebugInfoKind::PROFILER_STATE, state);
+  KinetoThreadLocalState::push(
+      std::make_shared<KinetoThreadLocalState>(config, activities));
 
-    if (activities.count(ActivityType::CPU)) {
-      pushProfilingCallbacks<false>(scopes);
-    }
-    torch::profiler::impl::kineto::startTrace();
+  if (has_cpu) {
+    config.global() ? pushProfilingCallbacks</*global=*/true>(scopes)
+                    : pushProfilingCallbacks</*global=*/false>(scopes);
   }
 
-  if (config.state == ProfilerState::KINETO_ONDEMAND) {
-    KinetoTLSGlobalStateManager::init(config, activities);
-
-    TORCH_INTERNAL_ASSERT(
-        activities.count(ActivityType::CPU),
-        "Ondemand profiling must enable CPU tracing");
-    pushProfilingCallbacks<true>(scopes);
+  if (!config.global()) {
+    torch::profiler::impl::kineto::startTrace();
   }
 }
 
 std::unique_ptr<ProfilerResult> disableProfiler() {
-  auto state_ptr = std::static_pointer_cast<
-      torch::profiler::impl::ProfilerThreadLocalStateBase>(
-      KinetoTLSGlobalStateManager::get() == nullptr
-          ? c10::ThreadLocalDebugInfo::_pop(c10::DebugInfoKind::PROFILER_STATE)
-          : KinetoTLSGlobalStateManager::pop());
-
+  auto state_ptr = ProfilerStateBase::pop();
   const auto& config = state_ptr->config();
   TORCH_CHECK(
       state_ptr &&
@@ -590,12 +668,10 @@ std::unique_ptr<ProfilerResult> disableProfiler() {
            config.state == ProfilerState::ITT),
       "Can't disable Kineto profiler when it's not running");
 
-  if (state_ptr->hasCallbackHandle()) {
-    at::removeCallback(state_ptr->callbackHandle());
-  }
+  state_ptr->removeCallback();
 
   // Traces are converged via libkineto automatically for ondemand flow
-  if (state_ptr->config().state == ProfilerState::KINETO_ONDEMAND) {
+  if (state_ptr->config().global()) {
     (void)std::static_pointer_cast<KinetoThreadLocalState>(state_ptr)
         ->finalizeTrace();
     return std::make_unique<ProfilerResult>();
@@ -623,25 +699,46 @@ std::unique_ptr<ProfilerResult> disableProfiler() {
 }
 
 KinetoEvent::KinetoEvent(
-    std::shared_ptr<const torch::profiler::impl::Result> result)
+    std::shared_ptr<const torch::profiler::impl::Result> result,
+    const bool verbose)
     : result_{result} {
   TORCH_INTERNAL_ASSERT(result != nullptr);
 
-  // Populate Python stack
-  auto parent = result_->parent_.lock();
-  while (parent != nullptr) {
-    parent->visit([&](const auto& i) {
-      if (is_py_fields<decltype(i)>()) {
-        python_stack_.push_back(parent->name());
-      }
-    });
-    parent = parent->parent_.lock();
+  if (verbose) {
+    // Populate Python stack
+    auto parent = result_->parent_.lock();
+    while (parent != nullptr) {
+      parent->visit_if_base<PyExtraFieldsBase>(
+          [&](const auto& i) { python_stack_.push_back(parent->name()); });
+      parent = parent->parent_.lock();
+    }
   }
+
+  result->visit_if_base<ExtraFields<EventType::TorchOp>>([&](const auto& op) {
+    std::tie(shapes_, dtypes_) = shapesAndDtypes(op.inputs_);
+  });
 }
 
 bool KinetoEvent::isPythonFunction() const {
-  return result_->visit(
-      [](const auto& i) { return is_py_fields<decltype(i)>(); });
+  bool out{false};
+  result_->visit_if_base<PyExtraFieldsBase>([&](const auto&) { out = true; });
+  return out;
+}
+
+bool KinetoEvent::hasShapes() const {
+  return !shapes_.empty();
+}
+
+const c10::ArrayRef<std::vector<int64_t>> KinetoEvent::shapes() const {
+  return shapes_;
+}
+
+bool KinetoEvent::hasTypes() const {
+  return !dtypes_.empty();
+}
+
+const c10::ArrayRef<std::string> KinetoEvent::dtypes() const {
+  return dtypes_;
 }
 
 const c10::ArrayRef<std::string> KinetoEvent::stack() const {
@@ -709,6 +806,21 @@ int64_t KinetoEvent::cudaElapsedUs() const {
   return -1;
 }
 
+void KinetoEvent::getPerfEventCounters(std::vector<uint64_t>& in) const {
+  return result_->visit(c10::overloaded(
+      [&in](const ExtraFields<EventType::TorchOp>& e) -> void {
+        const size_t n = e.perf_event_counters_->size();
+        // should be rare
+        if (in.size() < n) {
+          in.resize(n, 0);
+        }
+        for (size_t i = 0; i < n; ++i) {
+          in[i] = (*e.perf_event_counters_)[i];
+        }
+      },
+      [](const auto&) -> void { return; }));
+}
+
 #define FORWARD_FROM_RESULT(method_name, result_expr)                        \
   decltype(std::declval<KinetoEvent>().method_name())                        \
   KinetoEvent::method_name() const {                                         \
@@ -746,10 +858,6 @@ FORWARD_FROM_RESULT(deviceResourceId, kineto_info_.resource)
 
 TYPED_ATTR_WITH_DEFAULT(TorchOp, sequenceNr, e.sequence_number_, -1)
 TYPED_ATTR(TorchOp, fwdThreadId, e.sequence_number_ >= 0 ? e.forward_tid_ : 0)
-TYPED_ATTR(TorchOp, hasShapes, !e.inputs_.shapes_.empty())
-TYPED_ATTR(TorchOp, shapes, e.inputs_.shapes_)
-TYPED_ATTR(TorchOp, hasTypes, !e.inputs_.dtypes_.empty())
-TYPED_ATTR(TorchOp, dtypes, e.inputs_.dtypes_)
 TYPED_ATTR(TorchOp, scope, static_cast<uint8_t>(e.scope_))
 TYPED_ATTR(TorchOp, hasModuleHierarchy, !e.jit_modules_.empty())
 TYPED_ATTR(TorchOp, isAsync, e.is_async_)
diff --git a/torch/csrc/autograd/profiler_kineto.h b/torch/csrc/autograd/profiler_kineto.h
index c924e8074cf9..d85232f96cb5 100644
--- a/torch/csrc/autograd/profiler_kineto.h
+++ b/torch/csrc/autograd/profiler_kineto.h
@@ -4,6 +4,8 @@
 #include <vector>
 
 #include <torch/csrc/profiler/api.h>
+#include <torch/csrc/profiler/events.h>
+#include <torch/csrc/profiler/stubs/base.h>
 #include <torch/csrc/profiler/util.h>
 
 namespace torch {
@@ -20,7 +22,9 @@ namespace profiler {
 using experimental_event_t = std::shared_ptr<torch::profiler::impl::Result>;
 
 struct TORCH_API KinetoEvent {
-  explicit KinetoEvent(std::shared_ptr<const torch::profiler::impl::Result>);
+  KinetoEvent(
+      std::shared_ptr<const torch::profiler::impl::Result>,
+      const bool verbose);
 
   uint64_t startThreadId() const;
   uint64_t endThreadId() const;
@@ -51,6 +55,7 @@ struct TORCH_API KinetoEvent {
   std::string backend() const;
   bool isPythonFunction() const;
   int64_t cudaElapsedUs() const;
+  void getPerfEventCounters(torch::profiler::perf_counters_t&) const;
 
  private:
   torch::profiler::impl::ProfilerEventStub fallbackStart() const;
@@ -58,6 +63,10 @@ struct TORCH_API KinetoEvent {
 
   std::shared_ptr<const torch::profiler::impl::Result> result_;
   std::vector<std::string> python_stack_;
+
+  // Copy fields from result so we can return ArrayRefs.
+  std::vector<std::vector<int64_t>> shapes_;
+  std::vector<std::string> dtypes_;
 };
 
 // Consolidating events returned directly from Kineto
diff --git a/torch/csrc/autograd/profiler_legacy.cpp b/torch/csrc/autograd/profiler_legacy.cpp
index a7422335c442..f77b4f5928b3 100644
--- a/torch/csrc/autograd/profiler_legacy.cpp
+++ b/torch/csrc/autograd/profiler_legacy.cpp
@@ -120,18 +120,17 @@ namespace profiler {
 
 namespace {
 using torch::profiler::impl::ActiveProfilerType;
-using torch::profiler::impl::ProfilerThreadLocalStateBase;
+using torch::profiler::impl::ProfilerStateBase;
 
-struct ProfilerLegacyThreadLocalState : public ProfilerThreadLocalStateBase {
+struct ProfilerLegacyThreadLocalState : public ProfilerStateBase {
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
   explicit ProfilerLegacyThreadLocalState(
       const torch::profiler::impl::ProfilerConfig& config)
-      : ProfilerThreadLocalStateBase(config),
-        remoteProfiledEvents_{c10::nullopt} {}
+      : ProfilerStateBase(config), remoteProfiledEvents_{c10::nullopt} {}
   ~ProfilerLegacyThreadLocalState() override = default;
 
   static ProfilerLegacyThreadLocalState* getTLS() {
-    auto tls = ProfilerThreadLocalStateBase::getTLS();
+    auto tls = ProfilerStateBase::get(/*global=*/false);
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
         tls == nullptr || tls->profilerType() == ActiveProfilerType::LEGACY);
     return static_cast<ProfilerLegacyThreadLocalState*>(tls);
@@ -162,6 +161,10 @@ struct ProfilerLegacyThreadLocalState : public ProfilerThreadLocalStateBase {
     return ActiveProfilerType::LEGACY;
   }
 
+  void leakHandle() {
+    handle_ = 0;
+  }
+
  protected:
   RangeEventList& getEventList(int64_t thread_id = -1);
 
@@ -193,7 +196,7 @@ thread_event_lists ProfilerLegacyThreadLocalState::consolidate() {
 }
 
 void ProfilerLegacyThreadLocalState::mark(std::string name, bool include_cuda) {
-  if (config_.state == torch::profiler::impl::ProfilerState::Disabled) {
+  if (config_.disabled()) {
     return;
   }
   if (config_.state == torch::profiler::impl::ProfilerState::NVTX) {
@@ -225,7 +228,7 @@ void ProfilerLegacyThreadLocalState::pushRange(
     const at::RecordFunction& fn,
     const bool record_cuda,
     std::vector<std::vector<int64_t>>&& shapes) {
-  if (config_.state == torch::profiler::impl::ProfilerState::Disabled) {
+  if (config_.disabled()) {
     return;
   }
   if (config_.state == torch::profiler::impl::ProfilerState::NVTX) {
@@ -273,7 +276,7 @@ void ProfilerLegacyThreadLocalState::pushRange(
 void ProfilerLegacyThreadLocalState::popRange(
     const at::RecordFunction& fn,
     const bool record_cuda) {
-  if (config_.state == torch::profiler::impl::ProfilerState::Disabled) {
+  if (config_.disabled()) {
     return;
   }
   if (config_.state == torch::profiler::impl::ProfilerState::NVTX) {
@@ -300,8 +303,7 @@ void ProfilerLegacyThreadLocalState::reportMemoryUsage(
     int64_t /* total_allocated, unused for legacy */,
     int64_t /* total_reserved, unused for legacy */,
     c10::Device device) {
-  if (config_.profile_memory &&
-      config_.state != torch::profiler::impl::ProfilerState::Disabled) {
+  if (config_.profile_memory && !config_.disabled()) {
     uint64_t thread_id = at::RecordFunction::currentThreadId();
     LegacyEvent evt(
         EventKind::MemoryAlloc,
@@ -372,9 +374,7 @@ void pushProfilingCallbacksLegacy() {
           [](const at::RecordFunction& fn)
               -> std::unique_ptr<at::ObserverContext> {
             auto state_ptr = ProfilerLegacyThreadLocalState::getTLS();
-            if (!state_ptr ||
-                state_ptr->config().state ==
-                    torch::profiler::impl::ProfilerState::Disabled) {
+            if (!state_ptr || state_ptr->config().disabled()) {
               return nullptr;
             }
             bool record_cuda = state_ptr->config().state ==
@@ -396,9 +396,7 @@ void pushProfilingCallbacksLegacy() {
           },
           [](const at::RecordFunction& fn, at::ObserverContext*) {
             auto state_ptr = ProfilerLegacyThreadLocalState::getTLS();
-            if (!state_ptr ||
-                state_ptr->config().state ==
-                    torch::profiler::impl::ProfilerState::Disabled) {
+            if (!state_ptr || state_ptr->config().disabled()) {
               return;
             }
             bool record_cuda = state_ptr->config().state ==
@@ -454,15 +452,10 @@ thread_event_lists disableProfilerLegacy(
 
   auto state_ptr = static_cast<ProfilerLegacyThreadLocalState*>(state.get());
   TORCH_CHECK(
-      state_ptr &&
-          state_ptr->config().state !=
-              torch::profiler::impl::ProfilerState::Disabled,
+      state_ptr && !state_ptr->config().disabled(),
       "Can't disable profiler when it's not running");
 
-  if (cleanupTLSState) {
-    at::removeCallback(state_ptr->callbackHandle());
-  }
-
+  cleanupTLSState ? state_ptr->removeCallback() : state_ptr->leakHandle();
   if (!consolidate ||
       state_ptr->config().state == torch::profiler::impl::ProfilerState::NVTX) {
     return thread_event_lists();
diff --git a/torch/csrc/autograd/profiler_legacy.h b/torch/csrc/autograd/profiler_legacy.h
index e12beca80609..c77b373819af 100644
--- a/torch/csrc/autograd/profiler_legacy.h
+++ b/torch/csrc/autograd/profiler_legacy.h
@@ -12,6 +12,7 @@
 
 #include <torch/csrc/Export.h>
 #include <torch/csrc/profiler/api.h>
+#include <torch/csrc/profiler/stubs/base.h>
 #include <torch/csrc/profiler/util.h>
 
 namespace torch {
diff --git a/torch/csrc/autograd/profiler_python.cpp b/torch/csrc/autograd/profiler_python.cpp
index 34aab7e8e670..5cf08afcbd1f 100644
--- a/torch/csrc/autograd/profiler_python.cpp
+++ b/torch/csrc/autograd/profiler_python.cpp
@@ -1,5 +1,6 @@
 #include <torch/csrc/autograd/profiler_python.h>
 
+#include <atomic>
 #include <cstdint>
 #include <deque>
 #include <iostream>
@@ -13,12 +14,19 @@
 #include <Python.h>
 #include <frameobject.h>
 
+#include <ATen/core/TensorBase.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/C++17.h>
+#include <c10/util/Exception.h>
+#include <c10/util/Logging.h>
+#include <c10/util/Optional.h>
+#include <c10/util/StringUtil.h>
 #include <c10/util/flat_hash_map.h>
 #include <c10/util/irange.h>
+#include <torch/csrc/autograd/python_variable.h>
 #include <torch/csrc/profiler/collection.h>
 #include <torch/csrc/profiler/containers.h>
+#include <torch/csrc/profiler/orchestration/python_tracer.h>
 #include <torch/csrc/profiler/util.h>
 #include <torch/csrc/utils/pybind.h>
 #include <torch/csrc/utils/python_compat.h>
@@ -30,8 +38,8 @@ namespace torch {
 namespace profiler {
 namespace impl {
 namespace {
-enum CallType { PyCall = 0, PyModuleCall, PyCCall };
-static constexpr size_t CallTypeSize = 3;
+enum CallType { PyCall = 0, PyModuleCall, PyCCall, PyOptimizerCall };
+static constexpr size_t CallTypeSize = 4;
 using no_ephemeral_t = std::tuple<>;
 
 // ============================================================================
@@ -56,7 +64,11 @@ struct CodeLocation {
   int line_number_{0};
 };
 
-PyCodeObject* nnModuleCode() {
+template <CallType C>
+PyCodeObject* getCode();
+
+template <>
+PyCodeObject* getCode<CallType::PyModuleCall>() {
   static auto module_call_code = []() {
     pybind11::gil_scoped_acquire gil;
     auto res = py::module::import("torch.nn")
@@ -68,7 +80,22 @@ PyCodeObject* nnModuleCode() {
     return (PyCodeObject*)res;
   }();
   return module_call_code;
-}
+};
+
+template <>
+PyCodeObject* getCode<CallType::PyOptimizerCall>() {
+  static auto optimizer_step_code = []() {
+    pybind11::gil_scoped_acquire gil;
+    auto res = py::module::import("torch.optim")
+                   .attr("Optimizer")
+                   .attr("_optimizer_step_code")
+                   .attr("__code__")
+                   .ptr();
+    TORCH_INTERNAL_ASSERT(PyCode_Check(res));
+    return (PyCodeObject*)res;
+  }();
+  return optimizer_step_code;
+};
 
 } // namespace
 } // namespace impl
@@ -102,7 +129,7 @@ class CallTypeHelper final {
       std::index_sequence<I...>);
 
   template <size_t C, typename T, typename FunctorT, typename... Args>
-  static void map(T& t, FunctorT& f, Args... args) {
+  static void map(T& t, FunctorT& f, Args&&... args) {
     f(std::get<C>(t), args...);
     c10::guts::if_constexpr<C + 1 < End>(
         [&](auto _) { map<C + 1>(_(t), f, std::forward<Args>(args)...); });
@@ -112,7 +139,7 @@ class CallTypeHelper final {
   using tuple_type = decltype(make_tuple_impl(std::make_index_sequence<End>{}));
 
   template <typename FunctorT, typename... Args>
-  static void map(tuple_type& t, FunctorT& f, Args... args) {
+  static void map(tuple_type& t, FunctorT& f, Args&&... args) {
     map<0>(t, f, std::forward<Args>(args)...);
   }
 };
@@ -178,18 +205,40 @@ struct Config<CallType::PyCall> {
   static constexpr EventType event_type = EventType::PyCall;
 };
 
-template <>
-struct Config<CallType::PyModuleCall> {
-  using key_t = PyModuleSelf;
+template <typename Key, typename Cls, typename ParameterInfo>
+struct ExtendedPyCallConfig {
+  using key_t = Key;
+  using cls_t = Cls;
   using ephemeral_t = PyFrameObject*;
-  struct cache_t {
-    c10::optional<CodeLocation> module_forward_;
-    ska::flat_hash_map<PyModuleSelf, PyModuleCls> modules_;
-    ska::flat_hash_map<PyModuleCls, at::StringView> module_cls_names_;
+
+  struct ClsAndParameters {
+    cls_t cls_;
+    std::vector<ParameterInfo> parameters_;
   };
+
+  struct Cache {
+    // `nn.Module.forward` or `optim.Optimizer._optimizer_step_code`
+    c10::optional<CodeLocation> location_;
+    ska::flat_hash_map<key_t, ClsAndParameters> cls_and_parameters_;
+    ska::flat_hash_map<cls_t, at::StringView> cls_names_;
+  };
+  using cache_t = Cache;
+
   static constexpr EventType event_type = EventType::PyCall;
 };
 
+template <>
+struct Config<CallType::PyModuleCall> : ExtendedPyCallConfig<
+                                            PyModuleSelf,
+                                            PyModuleCls,
+                                            NNModuleInfo::ParameterInfo> {};
+
+template <>
+struct Config<CallType::PyOptimizerCall> : ExtendedPyCallConfig<
+                                               PyOptimizerSelf,
+                                               PyOptimizerCls,
+                                               OptimizerInfo::ParameterInfo> {};
+
 template <>
 struct Config<CallType::PyCCall> {
   using key_t = PyMethod;
@@ -222,22 +271,36 @@ class Callsite {
   Config<CallType::PyCall>::key_t caller_;
 };
 
+// ============================================================================
+// == Type specific store and load implementations. ===========================
+// ============================================================================
+using PyCallKey = Config<CallType::PyCall>::key_t;
+using PyModuleCallKey = Config<CallType::PyModuleCall>::key_t;
+using PyCCallKey = Config<CallType::PyCCall>::key_t;
+using PyOptimizerCallKey = Config<CallType::PyOptimizerCall>::key_t;
+
 class ValueCache {
  public:
+  ValueCache() = default;
+  ValueCache(const ValueCache&) = delete;
+
   template <CallType C>
   void store(const typename Config<C>::key_t&, typename Config<C>::ephemeral_t);
 
   template <CallType C>
   auto load(const Callsite<C>& callsite, size_t python_tid) const {
     auto caller = load<CallType::PyCall>(callsite.caller_);
-    TORCH_INTERNAL_ASSERT(!caller.second.has_value());
+    TORCH_INTERNAL_ASSERT(!caller.module_info_.has_value());
     return ExtraFields<Config<C>::event_type>{
         /*end_time_ns=*/std::numeric_limits<time_t>::min(),
         python_tid,
-        caller.first,
+        caller.frame_state_,
         load<C>(callsite.value_)};
   }
 
+  c10::optional<TensorMetadata> recordIfTensor(py::handle p);
+  std::vector<std::pair<std::string, TensorMetadata>> unpackTensorMap(
+      py::dict tensor_map);
   void trimPrefixes();
 
  private:
@@ -251,12 +314,55 @@ class ValueCache {
   CallTypeHelper<State>::tuple_type state_;
 };
 
-// ============================================================================
-// == Type specific store and load implementations. ===========================
-// ============================================================================
-using PyCallKey = Config<CallType::PyCall>::key_t;
-using PyModuleCallKey = Config<CallType::PyModuleCall>::key_t;
-using PyCCallKey = Config<CallType::PyCCall>::key_t;
+template <CallType C>
+typename Config<C>::cls_t set_class(
+    ValueCache* value_cache,
+    typename Config<C>::cache_t& cache,
+    const typename Config<C>::key_t& key,
+    const typename Config<C>::ephemeral_t& frame) {
+  if (C10_UNLIKELY(!cache.location_.has_value())) {
+    auto code = THPCodeObjectPtr(PyFrame_GetCode(frame));
+    TORCH_INTERNAL_ASSERT(code.get() == getCode<C>());
+    cache.location_ = PyCallKey(frame);
+    value_cache->store<CallType::PyCall>(*cache.location_, no_ephemeral_t());
+  }
+
+  auto cls_handle = py::handle((PyObject*)key).attr("__class__");
+  auto cls = typename Config<C>::cls_t(cls_handle.ptr());
+  if (cache.cls_names_.find(cls) == cache.cls_names_.end()) {
+    cache.cls_names_[cls] =
+        at::StringView(py::str(cls_handle.attr("__name__")));
+  }
+  return cls;
+}
+
+TensorMetadata toTensorMetadata(PyObject* self) {
+  TORCH_INTERNAL_ASSERT(THPVariable_CheckExact(self));
+  const auto& t = THPVariable_Unpack(self);
+  RawTensorMetadata m{t};
+  return TensorMetadata{
+      m,
+      t.sizes().vec(),
+      m.layout_ == at::kStrided ? t.strides().vec() : std::vector<int64_t>()};
+}
+
+c10::optional<TensorMetadata> ValueCache::recordIfTensor(py::handle p) {
+  return THPVariable_CheckExact(p.ptr())
+      ? c10::optional<TensorMetadata>{toTensorMetadata(p.ptr())}
+      : c10::nullopt;
+}
+
+std::vector<std::pair<std::string, TensorMetadata>> ValueCache::unpackTensorMap(
+    py::dict tensor_map) {
+  std::vector<std::pair<std::string, TensorMetadata>> out;
+  for (auto& it : tensor_map) {
+    auto* value = it.second.ptr();
+    if (py::isinstance<py::str>(it.first) && THPVariable_CheckExact(value)) {
+      out.push_back({py::cast<std::string>(it.first), toTensorMetadata(value)});
+    }
+  }
+  return out;
+}
 
 template <>
 void ValueCache::store<CallType::PyCall>(const PyCallKey& key, no_ephemeral_t) {
@@ -280,21 +386,23 @@ void ValueCache::store<CallType::PyModuleCall>(
     const PyModuleCallKey& key,
     Config<CallType::PyModuleCall>::ephemeral_t frame) {
   auto& cache = std::get<CallType::PyModuleCall>(state_);
-  if (C10_UNLIKELY(cache.modules_.find(key) == cache.modules_.end())) {
-    if (C10_UNLIKELY(!cache.module_forward_.has_value())) {
-      auto code = THPCodeObjectPtr(PyFrame_GetCode(frame));
-      TORCH_INTERNAL_ASSERT(code.get() == nnModuleCode());
-      cache.module_forward_ = PyCallKey(frame);
-      store<CallType::PyCall>(*cache.module_forward_, no_ephemeral_t());
-    }
-    auto cls_handle = py::handle((PyObject*)key).attr("__class__");
-    auto cls = PyModuleCls(cls_handle.ptr());
-    cache.modules_[key] = cls;
-
-    if (cache.module_cls_names_.find(cls) == cache.module_cls_names_.end()) {
-      cache.module_cls_names_[cls] =
-          at::StringView(py::str(cls_handle.attr("__name__")));
+  if (C10_UNLIKELY(
+          cache.cls_and_parameters_.find(key) ==
+          cache.cls_and_parameters_.end())) {
+    auto cls = set_class<CallType::PyModuleCall>(this, cache, key, frame);
+
+    py::dict params = py::handle((PyObject*)key).attr("_parameters");
+    std::vector<NNModuleInfo::ParameterInfo> params_;
+    for (auto& it : params) {
+      auto* p = it.second.ptr();
+      if (py::isinstance<py::str>(it.first) && THPVariable_CheckExact(p)) {
+        params_.push_back(
+            {it.first.cast<std::string>(),
+             toTensorMetadata(p),
+             recordIfTensor(py::getattr(it.second, "grad", py::none()))});
+      }
     }
+    cache.cls_and_parameters_[key] = {cls, params_};
   }
 }
 
@@ -302,10 +410,61 @@ template <>
 ExtraFields<EventType::PyCall>::args_t ValueCache::load<CallType::PyModuleCall>(
     const PyModuleCallKey& key) const {
   auto& cache = std::get<CallType::PyModuleCall>(state_);
-  TORCH_INTERNAL_ASSERT(cache.module_forward_.has_value());
-  auto cls = cache.modules_.at(key);
-  auto fwd = std::get<CallType::PyCall>(state_).at(*cache.module_forward_);
-  return {fwd, NNModuleInfo{key, cls, cache.module_cls_names_.at(cls)}};
+  TORCH_INTERNAL_ASSERT(cache.location_.has_value());
+  const auto& cls_and_parameters = cache.cls_and_parameters_.at(key);
+  const auto& cls = cls_and_parameters.cls_;
+  NNModuleInfo info{
+      key, cls, cache.cls_names_.at(cls), cls_and_parameters.parameters_};
+  return {
+      /*frame_state_=*/std::get<CallType::PyCall>(state_).at(*cache.location_),
+      /*module_info_=*/std::move(info),
+      /*optimizer_info_=*/c10::nullopt};
+}
+
+template <>
+void ValueCache::store<CallType::PyOptimizerCall>(
+    const PyOptimizerCallKey& key,
+    Config<CallType::PyOptimizerCall>::ephemeral_t frame) {
+  auto& cache = std::get<CallType::PyOptimizerCall>(state_);
+  if (C10_UNLIKELY(
+          cache.cls_and_parameters_.find(key) ==
+          cache.cls_and_parameters_.end())) {
+    auto cls = set_class<CallType::PyOptimizerCall>(this, cache, key, frame);
+    const py::handle self{(PyObject*)key};
+    std::vector<OptimizerInfo::ParameterInfo> params;
+
+    for (const auto& i : (py::list)self.attr("param_groups")) {
+      for (auto& param : py::cast<py::dict>(i).attr("get")("params")) {
+        if (THPVariable_CheckExact(param.ptr())) {
+          // While `self.state` is permitted to store data in an arbitrary way,
+          // all generic optimizers (SGD, Adam, etc) use param as the key since
+          // the state in question is tied to particular parameters. We can
+          // relax this assumption if the need arises.
+          params.push_back(
+              {toTensorMetadata(param.ptr()),
+               recordIfTensor(py::getattr(param, "grad", py::none())),
+               unpackTensorMap(py::cast<py::dict>(self.attr("state"))
+                                   .attr("get")(param, py::dict()))});
+        }
+      }
+    }
+
+    cache.cls_and_parameters_[key] = {cls, params};
+  }
+}
+
+template <>
+ExtraFields<EventType::PyCall>::args_t ValueCache::load<
+    CallType::PyOptimizerCall>(const PyOptimizerCallKey& key) const {
+  auto& cache = std::get<CallType::PyOptimizerCall>(state_);
+  const auto& cls_and_parameters = cache.cls_and_parameters_.at(key);
+  auto cls = cls_and_parameters.cls_;
+  OptimizerInfo info{
+      key, cls, cache.cls_names_.at(cls), cls_and_parameters.parameters_};
+  return {
+      /*frame_state_=*/std::get<CallType::PyCall>(state_).at(*cache.location_),
+      /*module_info_=*/c10::nullopt,
+      /*optimizer_info_=*/std::move(info)};
 }
 
 template <>
@@ -393,7 +552,8 @@ struct TraceKeyCacheState {
 // `PyEval_SetProfile`.
 struct ThreadLocalResults;
 struct TraceContext {
-  PyObject_HEAD ThreadLocalResults* thread_local_results_;
+  PyObject_HEAD;
+  ThreadLocalResults* thread_local_results_;
 };
 
 // CPython boilerplate to define `TraceContext` as a proper python object.
@@ -442,11 +602,16 @@ static PyTypeObject TraceContextType = {
 // ============================================================================
 // == Thread local cache ======================================================
 // ============================================================================
+class PythonTracer;
 struct ThreadLocalResults {
-  ThreadLocalResults(PyThreadState* thread_state, ValueCache* value_cache)
+  ThreadLocalResults(
+      PyThreadState* thread_state,
+      ValueCache* value_cache,
+      PythonTracer* active_tracer)
       : thread_state_{thread_state},
         ctx_{(TraceContext*)TraceContextType.tp_alloc(&TraceContextType, 0)},
-        value_cache_{value_cache} {
+        value_cache_{value_cache},
+        active_tracer_{active_tracer} {
     ctx_->thread_local_results_ = this;
   }
 
@@ -474,6 +639,7 @@ struct ThreadLocalResults {
   PyThreadState* thread_state_;
   TraceContext* ctx_;
   ValueCache* value_cache_;
+  PythonTracer* active_tracer_;
   CallTypeHelper<TraceKeyCacheState>::tuple_type trace_keys_;
   AppendOnlyList<approx_time_t, BLOCK_SIZE> exit_times_;
   AppendOnlyList<approx_time_t, BLOCK_SIZE> c_exit_times_;
@@ -484,50 +650,53 @@ struct ThreadLocalResults {
 // ============================================================================
 class PythonTracer final : public python_tracer::PythonTracerBase {
  public:
+  PythonTracer(torch::profiler::impl::RecordQueue* queue);
+  ~PythonTracer() override;
+
   static int pyProfileFn(
       PyObject* obj,
       PyFrameObject* frame,
       int what,
       PyObject* arg);
 
-  static PythonTracer& singleton();
-  void start(torch::profiler::impl::RecordQueue* queue) override;
   void stop() override;
   std::vector<std::shared_ptr<Result>> getEvents(
       std::function<time_t(approx_time_t)> time_converter,
-      std::vector<python_tracer::CompressedEvent>& enters) override;
-  void clear() override;
+      std::vector<python_tracer::CompressedEvent>& enters,
+      time_t end_time_ns) override;
 
  private:
-  PythonTracer();
-
   void recordPyCall(ThreadLocalResults& tls, PyFrameObject* frame);
   void recordCCall(
       ThreadLocalResults& tls,
       PyFrameObject* frame,
       PyObject* arg);
 
+  std::atomic<bool> active_lock_{false};
+  bool active_{false};
+
   torch::profiler::impl::RecordQueue* queue_;
   PyCodeObject* module_call_code_;
+  PyCodeObject* optimizer_hook_;
 
   std::deque<ThreadLocalResults> thread_local_results_;
   ValueCache value_cache_;
 };
 
-PythonTracer& PythonTracer::singleton() {
-  static PythonTracer singleton_;
-  return singleton_;
-}
-
-PythonTracer::PythonTracer()
-    : queue_(nullptr), module_call_code_(nnModuleCode()) {}
-
-void PythonTracer::start(torch::profiler::impl::RecordQueue* queue) {
-  TORCH_CHECK(queue_ == nullptr, "PythonTracer is already active")
-  TORCH_CHECK(
-      !thread_local_results_.size(),
-      "PythonTracer should not have active contexts");
-  queue_ = queue;
+PythonTracer::PythonTracer(torch::profiler::impl::RecordQueue* queue)
+    : queue_(queue),
+      module_call_code_(getCode<CallType::PyModuleCall>()),
+      optimizer_hook_(getCode<CallType::PyOptimizerCall>()) {
+  TORCH_CHECK(queue_ != nullptr);
+
+  bool expected{false};
+  active_ = active_lock_.compare_exchange_strong(expected, true);
+  if (!active_) {
+    TORCH_WARN(
+        "There is already an active Python tracer. "
+        "Refusing to register profile functions.");
+    return;
+  }
 
   pybind11::gil_scoped_acquire gil;
 
@@ -554,7 +723,7 @@ void PythonTracer::start(torch::profiler::impl::RecordQueue* queue) {
     PyThreadState* thread_state = thread_states[i];
     PyThreadState_Swap(thread_state);
 
-    thread_local_results_.emplace_back(thread_state, &value_cache_);
+    thread_local_results_.emplace_back(thread_state, &value_cache_, this);
     auto* ctx = thread_local_results_.back().ctx_;
 
     // When we begin profiling there are already frames on the Python
@@ -562,9 +731,9 @@ void PythonTracer::start(torch::profiler::impl::RecordQueue* queue) {
     // to all the prior frames onto our event stack. (We stop at depth=128)
     std::vector<PyFrameObject*> current_stack;
     auto frame = PyEval_GetFrame();
-    Py_INCREF(frame);
     size_t depth = 0; // Make sure we can't infinite loop.
     while (frame != nullptr && depth <= 128) {
+      Py_INCREF(frame);
       current_stack.push_back(frame);
       frame = PyFrame_GetBack(frame);
       depth++;
@@ -585,24 +754,26 @@ void PythonTracer::start(torch::profiler::impl::RecordQueue* queue) {
 };
 
 void PythonTracer::stop() {
-  TORCH_INTERNAL_ASSERT(queue_ != nullptr, "PythonTracer is not running.")
-  queue_ = nullptr;
-
   pybind11::gil_scoped_acquire gil;
+  if (active_) {
+    PyThreadState* initial_thread_state = PyThreadState_Get();
+    for (const auto& i : thread_local_results_) {
+      PyThreadState_Swap(i.thread_state_);
+      PyEval_SetProfile(nullptr, nullptr);
+    }
+    PyThreadState_Swap(initial_thread_state);
 
-  PyThreadState* initial_thread_state = PyThreadState_Get();
-  for (const auto& i : thread_local_results_) {
-    PyThreadState_Swap(i.thread_state_);
-    PyEval_SetProfile(nullptr, nullptr);
+    auto lock_returned = active_lock_.compare_exchange_strong(active_, false);
+    active_ = false;
+    SOFT_ASSERT(lock_returned, "Failed to return python tracer lock.");
   }
-  PyThreadState_Swap(initial_thread_state);
 }
 
-void PythonTracer::clear() {
-  TORCH_CHECK(
-      queue_ == nullptr, "Cannot clear state while PythonTracer is active.");
-  thread_local_results_.clear();
-  value_cache_ = ValueCache();
+PythonTracer::~PythonTracer() {
+  if (active_) {
+    TORCH_WARN("`PythonTracer::stop()` was not called.");
+    stop();
+  }
 }
 
 void PythonTracer::recordPyCall(ThreadLocalResults& tls, PyFrameObject* frame) {
@@ -626,6 +797,14 @@ void PythonTracer::recordPyCall(ThreadLocalResults& tls, PyFrameObject* frame) {
       TORCH_INTERNAL_ASSERT(back != nullptr);
       return tls.intern<CallType::PyModuleCall, E>(
           frame, self.get(), back.get());
+    } else if (code.get() == optimizer_hook_) {
+      auto locals = THPObjectPtr(PyFrame_GetLocals(frame));
+      auto self = THPObjectPtr(PyDict_GetItemString(locals, "self"));
+      Py_INCREF(self.get());
+      auto back = THPFrameObjectPtr(PyFrame_GetBack(frame));
+      TORCH_INTERNAL_ASSERT(back != nullptr);
+      return tls.intern<CallType::PyOptimizerCall, E>(
+          frame, self.get(), back.get());
     } else {
       auto back = THPFrameObjectPtr(PyFrame_GetBack(frame));
       auto f_back = (back.get() != nullptr) ? back.get() : frame;
@@ -666,8 +845,9 @@ class PostProcess {
   PostProcess(
       std::function<time_t(approx_time_t)> time_converter,
       std::deque<ThreadLocalResults>& tls,
-      const ValueCache& value_cache)
-      : time_converter_{time_converter} {
+      const ValueCache& value_cache,
+      time_t end_time_ns)
+      : end_time_{end_time_ns}, time_converter_{time_converter} {
     for (size_t python_tid : c10::irange(tls.size())) {
       CallTypeHelper<TraceKeyCacheState>::map(
           tls[python_tid].trace_keys_, *this, value_cache, python_tid);
@@ -714,6 +894,12 @@ class PostProcess {
       std::vector<python_tracer::CompressedEvent>& enters,
       std::vector<std::shared_ptr<Result>>& out) {
     using stack_t = std::vector<std::shared_ptr<Result>>;
+    auto pop = [](stack_t& stack, time_t t) {
+      TORCH_INTERNAL_ASSERT(stack.size(), "Python replay stack is empty.");
+      c10::get<ExtraFields<E>>(stack.back()->extra_fields_).end_time_ns_ = t;
+      stack.pop_back();
+    };
+
     ska::flat_hash_map<size_t, stack_t> stacks;
     auto& state = get_state<E>();
     for (const auto& enter : enters) {
@@ -721,12 +907,9 @@ class PostProcess {
       if (fields_it != state.fields_.end()) {
         while (!state.exits_.empty() &&
                state.exits_.top().t_ < enter.enter_t_) {
-          auto& stack = stacks[state.exits_.top().python_tid_];
-          TORCH_INTERNAL_ASSERT(stack.size(), "Python replay stack is empty.");
-          c10::get<ExtraFields<E>>(stack.back()->extra_fields_).end_time_ns_ =
-              state.exits_.top().t_;
+          auto& exit = state.exits_.top();
+          pop(stacks[exit.python_tid_], exit.t_);
           state.exits_.pop();
-          stack.pop_back();
         }
         out.push_back(Result::create(
             enter.enter_t_,
@@ -737,6 +920,13 @@ class PostProcess {
         stacks[fields_it->second.python_tid_].push_back(out.back());
       }
     }
+
+    // Handle events which were still running when profiling ended.
+    for (auto& i : stacks) {
+      while (!i.second.empty()) {
+        pop(i.second, end_time_);
+      }
+    }
   }
 
   template <EventType E>
@@ -750,6 +940,7 @@ class PostProcess {
     return std::get < E == EventType::PyCall ? 0 : 1 > (state_);
   }
 
+  time_t end_time_;
   std::function<time_t(approx_time_t)> time_converter_;
   std::tuple<State<EventType::PyCall>, State<EventType::PyCCall>> state_;
 };
@@ -778,9 +969,11 @@ struct PythonIDVisitor {
 
 std::vector<std::shared_ptr<Result>> PythonTracer::getEvents(
     std::function<time_t(approx_time_t)> time_converter,
-    std::vector<python_tracer::CompressedEvent>& enters) {
+    std::vector<python_tracer::CompressedEvent>& enters,
+    time_t end_time_ns) {
   value_cache_.trimPrefixes();
-  PostProcess post_process(time_converter, thread_local_results_, value_cache_);
+  PostProcess post_process(
+      time_converter, thread_local_results_, value_cache_, end_time_ns);
   auto out = post_process.run(enters);
 
   std::stable_sort(out.begin(), out.end(), [](const auto& a, const auto& b) {
@@ -807,11 +1000,11 @@ int PythonTracer::pyProfileFn(
       *reinterpret_cast<TraceContext*>(obj)->thread_local_results_;
   switch (what) {
     case PyTrace_CALL:
-      PythonTracer::singleton().recordPyCall(local_results, frame);
+      local_results.active_tracer_->recordPyCall(local_results, frame);
       break;
 
     case PyTrace_C_CALL:
-      PythonTracer::singleton().recordCCall(local_results, frame, arg);
+      local_results.active_tracer_->recordCCall(local_results, frame, arg);
       break;
 
     case PyTrace_EXCEPTION:
@@ -827,8 +1020,9 @@ int PythonTracer::pyProfileFn(
   return 0;
 }
 
-python_tracer::PythonTracerBase& getTracer() {
-  return PythonTracer::singleton();
+std::unique_ptr<python_tracer::PythonTracerBase> getTracer(
+    torch::profiler::impl::RecordQueue* queue) {
+  return std::make_unique<PythonTracer>(queue);
 }
 } // namespace
 } // namespace impl
diff --git a/torch/csrc/autograd/python_anomaly_mode.cpp b/torch/csrc/autograd/python_anomaly_mode.cpp
index ec5dfe1b0995..3c91316c06f9 100644
--- a/torch/csrc/autograd/python_anomaly_mode.cpp
+++ b/torch/csrc/autograd/python_anomaly_mode.cpp
@@ -16,7 +16,7 @@ namespace autograd {
 
 void PyAnomalyMetadata::store_stack() {
   pybind11::gil_scoped_acquire gil;
-  THPObjectPtr mod(PyImport_ImportModule("traceback"));
+  THPObjectPtr mod(PyImport_ImportModule("torch.fx.traceback"));
   if (!mod) {
     throw python_error();
   }
diff --git a/torch/csrc/autograd/python_cpp_function.cpp b/torch/csrc/autograd/python_cpp_function.cpp
index 1c4f15e63f49..13531398ad2f 100644
--- a/torch/csrc/autograd/python_cpp_function.cpp
+++ b/torch/csrc/autograd/python_cpp_function.cpp
@@ -203,7 +203,8 @@ PyTypeObject* _initFunctionPyTypeObject(
   return &type;
 }
 
-static std::unordered_map<std::type_index, THPObjectPtr> cpp_function_types;
+static std::unordered_map<std::type_index, THPObjectPtr> cpp_function_types_map;
+static std::unordered_set<PyTypeObject*> cpp_function_types_set;
 
 struct DefaultFunctionType {
   DefaultFunctionType() : type() {
@@ -231,10 +232,10 @@ PyObject* functionToPyObject(const std::shared_ptr<Node>& cdata) {
     Py_INCREF(cdata->pyobj());
   } else {
     auto& fn = *cdata;
-    auto it = cpp_function_types.find(std::type_index(typeid(fn)));
+    auto it = cpp_function_types_map.find(std::type_index(typeid(fn)));
     // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
     PyTypeObject* type;
-    if (it == cpp_function_types.end()) {
+    if (it == cpp_function_types_map.end()) {
       type = &default_type.type;
     } else {
       type = (PyTypeObject*)it->second.get();
@@ -255,7 +256,9 @@ PyObject* functionToPyObject(const std::shared_ptr<Node>& cdata) {
 
 void registerCppFunction(const std::type_info& type, PyTypeObject* pytype) {
   Py_INCREF((PyObject*)pytype);
-  cpp_function_types[std::type_index(type)] = THPObjectPtr((PyObject*)pytype);
+  cpp_function_types_map[std::type_index(type)] =
+      THPObjectPtr((PyObject*)pytype);
+  cpp_function_types_set.insert(pytype);
 }
 
 PyObject* registerFunctionHook(Node& fn, PyObject* hook) {
@@ -317,5 +320,15 @@ PyObject* registerFunctionPreHook(Node& fn, PyObject* hook) {
   return handle;
 }
 
+bool THPCppFunction_Check(PyObject* obj) {
+  THPObjectPtr type = THPObjectPtr(PyObject_Type(obj));
+  if (cpp_function_types_set.find((PyTypeObject*)type.get()) ==
+      cpp_function_types_set.end()) {
+    return false;
+  } else {
+    return true;
+  }
+}
+
 } // namespace autograd
 } // namespace torch
diff --git a/torch/csrc/autograd/python_cpp_function.h b/torch/csrc/autograd/python_cpp_function.h
index 5a58b0e20ce3..13105bda5d45 100644
--- a/torch/csrc/autograd/python_cpp_function.h
+++ b/torch/csrc/autograd/python_cpp_function.h
@@ -96,5 +96,7 @@ PyTypeObject* createForwardFunctionPyTypeObject(
 void registerCppFunction(const std::type_info& type, PyTypeObject* pytype);
 PyObject* functionToPyObject(const std::shared_ptr<Node>& cdata);
 
+bool THPCppFunction_Check(PyObject* obj);
+
 } // namespace autograd
 } // namespace torch
diff --git a/torch/csrc/autograd/python_engine.cpp b/torch/csrc/autograd/python_engine.cpp
index 93e5441a1917..3bd12f480d40 100644
--- a/torch/csrc/autograd/python_engine.cpp
+++ b/torch/csrc/autograd/python_engine.cpp
@@ -73,7 +73,7 @@ void PythonEngine::thread_init(
   // Create a PyThreadState, but release the GIL. This lets
   // pybind11::gil_scoped_acquire calls inside thread_main acquire the GIL
   // without having to create a new PyThreadState each time.
-#if defined(IS_PYTHON_3_9_PLUS) || defined(USE_DEPLOY)
+#if defined(IS_PYTHON_3_9_PLUS)
   auto gil = std::make_unique<pybind11::gil_scoped_acquire>();
 #else
   pybind11::gil_scoped_acquire gil;
@@ -86,7 +86,7 @@ void PythonEngine::thread_init(
     decrement_non_reentrant_thread_count();
   }
 
-#if defined(IS_PYTHON_3_9_PLUS) || defined(USE_DEPLOY)
+#if defined(IS_PYTHON_3_9_PLUS)
   // Do not call PyEval_RestoreThread, PyThreadState_[Clear|DeleteCurrent] if
   // runtime is finalizing
   if (!Py_IsInitialized()) {
diff --git a/torch/csrc/autograd/python_nested_functions.h b/torch/csrc/autograd/python_nested_functions.h
new file mode 100644
index 000000000000..6a86a3a7a1fe
--- /dev/null
+++ b/torch/csrc/autograd/python_nested_functions.h
@@ -0,0 +1,11 @@
+#pragma once
+
+namespace torch {
+namespace autograd {
+
+PyMethodDef* get_nested_functions_manual();
+
+void initNestedFunctions(PyObject* module);
+
+} // namespace autograd
+} // namespace torch
diff --git a/torch/csrc/autograd/python_nested_functions_manual.cpp b/torch/csrc/autograd/python_nested_functions_manual.cpp
new file mode 100644
index 000000000000..0e1823e192b3
--- /dev/null
+++ b/torch/csrc/autograd/python_nested_functions_manual.cpp
@@ -0,0 +1,44 @@
+#include <torch/csrc/utils/nested.h>
+#include <torch/csrc/utils/pycfunction_helpers.h>
+#include <torch/csrc/utils/python_arg_parser.h>
+#include <torch/torch.h>
+
+namespace torch {
+namespace autograd {
+
+static PyObject* THPVariable_nested_tensor(
+    PyObject* /*self*/,
+    PyObject* args,
+    PyObject* kwargs) {
+  HANDLE_TH_ERRORS
+  static PythonArgParser parser({
+      "nested_tensor(PyObject* data, *, ScalarType dtype=None, Device? device=None, bool pin_memory=False, bool requires_grad=False)",
+  });
+
+  constexpr int ctor_num_args = 5;
+  ParsedArgs<ctor_num_args> parsed_args;
+  auto r = parser.parse(args, kwargs, parsed_args);
+
+  jit::tracer::warn(
+      "torch.nested.nested_tensor", jit::tracer::WARN_CONSTRUCTOR);
+  return THPVariable_Wrap(torch::utils::nested_tensor_ctor(
+      torch::tensors::get_default_dispatch_key(),
+      torch::tensors::get_default_scalar_type(),
+      r));
+  END_HANDLE_TH_ERRORS
+}
+
+// NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
+static PyMethodDef nested_functions_manual[] = {
+    {"nested_tensor",
+     castPyCFunctionWithKeywords(THPVariable_nested_tensor),
+     METH_VARARGS | METH_KEYWORDS,
+     nullptr},
+};
+
+PyMethodDef* get_nested_functions_manual() {
+  return nested_functions_manual;
+}
+
+} // namespace autograd
+} // namespace torch
diff --git a/torch/csrc/autograd/python_saved_variable_hooks.cpp b/torch/csrc/autograd/python_saved_variable_hooks.cpp
index 2bec0e75b1a7..8f8027f663ba 100644
--- a/torch/csrc/autograd/python_saved_variable_hooks.cpp
+++ b/torch/csrc/autograd/python_saved_variable_hooks.cpp
@@ -56,7 +56,7 @@ PySavedVariableHooks::~PySavedVariableHooks() {
 void PyDefaultSavedVariableHooks::push_hooks(
     py::function& pack_hook,
     py::function& unpack_hook) {
-  at::SavedTensorDefaultHooks::enable();
+  at::SavedTensorDefaultHooks::lazy_initialize();
   at::SavedTensorDefaultHooks::push_hooks(
       pack_hook.release().ptr(), unpack_hook.release().ptr());
 }
diff --git a/torch/csrc/autograd/python_torch_functions_manual.cpp b/torch/csrc/autograd/python_torch_functions_manual.cpp
index 949bf1219f5a..0a9a71a01a6c 100644
--- a/torch/csrc/autograd/python_torch_functions_manual.cpp
+++ b/torch/csrc/autograd/python_torch_functions_manual.cpp
@@ -405,12 +405,27 @@ static PyObject* THPVariable__to_functional_tensor(
     PyObject* kwargs) {
   HANDLE_TH_ERRORS
   static PythonArgParser parser(
-      {"_to_functional_tensor(Tensor t)"}, /*traceable=*/true);
+      {"_to_functional_tensor(Tensor t, *, bool mirror_autograd_meta=False)"},
+      /*traceable=*/true);
 
-  ParsedArgs<1> parsed_args;
+  ParsedArgs<2> parsed_args;
   auto r = parser.parse(args, kwargs, parsed_args);
   auto self_ = r.tensor(0);
+  auto mirror_autograd_meta = r.toBool(1);
   auto wrapped = at::functionalization::impl::to_functional_tensor(self_);
+  if (mirror_autograd_meta) {
+    // Here, we unsafely set the grad function on the wrapper to be the same as
+    // the inner. We expect this grad_fn to NEVER be used. It's needed so that
+    // .is_leaf metadata is accurate on the wrapper
+    auto inner_autograd_meta = impl::get_autograd_meta(self_);
+    if (inner_autograd_meta) {
+      wrapped.set_requires_grad(self_.requires_grad());
+      if (wrapped.requires_grad()) {
+        impl::get_autograd_meta(wrapped)->grad_fn_ =
+            inner_autograd_meta->grad_fn_;
+      }
+    }
+  }
   return wrap(wrapped);
   END_HANDLE_TH_ERRORS
 }
@@ -431,6 +446,22 @@ static PyObject* THPVariable__from_functional_tensor(
   END_HANDLE_TH_ERRORS
 }
 
+static PyObject* THPVariable__freeze_functional_tensor(
+    PyObject* self,
+    PyObject* args,
+    PyObject* kwargs) {
+  HANDLE_TH_ERRORS
+  static PythonArgParser parser(
+      {"_freeze_functional_tensor(Tensor t)"}, /*traceable=*/true);
+
+  ParsedArgs<1> parsed_args;
+  auto r = parser.parse(args, kwargs, parsed_args);
+  auto self_ = r.tensor(0);
+  at::functionalization::impl::freeze_functional_tensor(self_);
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
 static PyObject* THPVariable__is_functional_tensor(
     PyObject* self,
     PyObject* args,
@@ -535,6 +566,10 @@ static PyMethodDef torch_functions_manual[] = {
      castPyCFunctionWithKeywords(THPVariable__from_functional_tensor),
      METH_VARARGS | METH_KEYWORDS | METH_STATIC,
      nullptr},
+    {"_freeze_functional_tensor",
+     castPyCFunctionWithKeywords(THPVariable__freeze_functional_tensor),
+     METH_VARARGS | METH_KEYWORDS | METH_STATIC,
+     nullptr},
     {"_sync",
      castPyCFunctionWithKeywords(THPVariable__sync),
      METH_VARARGS | METH_KEYWORDS | METH_STATIC,
@@ -672,7 +707,7 @@ static PyObject* THPVariable_numel(
   }
 
   if (r.idx == 0) {
-    return wrap(r.tensor(0).numel());
+    return py::cast(r.tensor(0).sym_numel()).release().ptr();
   }
   Py_RETURN_NONE;
   END_HANDLE_TH_ERRORS
diff --git a/torch/csrc/autograd/python_variable.cpp b/torch/csrc/autograd/python_variable.cpp
index b78b117f6ad2..a08d6f7761fd 100644
--- a/torch/csrc/autograd/python_variable.cpp
+++ b/torch/csrc/autograd/python_variable.cpp
@@ -1,8 +1,11 @@
 #include <ATen/NamedTensorUtils.h>
 #include <ATen/core/PythonFallbackKernel.h>
+#include <ATen/core/PythonOpRegistrationTrampoline.h>
 #include <c10/core/DeviceType.h>
 #include <c10/core/SafePyObject.h>
 #include <c10/core/impl/GPUTrace.h>
+#include <c10/core/impl/HermeticPyObjectTLS.h>
+#include <c10/core/impl/PythonDispatcherTLS.h>
 #include <c10/util/DeadlockDetection.h>
 #include <c10/util/irange.h>
 #include <pybind11/pytypes.h>
@@ -30,6 +33,7 @@
 #include <torch/csrc/utils/pybind.h>
 #include <torch/csrc/utils/pycfunction_helpers.h>
 #include <torch/csrc/utils/python_arg_parser.h>
+#include <torch/csrc/utils/python_dispatch.h>
 #include <torch/csrc/utils/python_numbers.h>
 #include <torch/csrc/utils/python_strings.h>
 #include <torch/csrc/utils/tensor_memoryformats.h>
@@ -131,8 +135,7 @@ std::pair<py::object, py::dict> parseIValuesToPyArgsKwargs(
       return py::reinterpret_borrow<py::object>(
           reinterpret_cast<PyObject*>(obj));
     } else if (match(c10::MemoryFormatType::Kind)) {
-      return torch::utils::getTHPMemoryFormat(
-          static_cast<c10::MemoryFormat>(arguments[idx].toInt()));
+      return py::cast(static_cast<c10::MemoryFormat>(arguments[idx].toInt()));
     } else {
       return torch::jit::toPyObject(arguments[idx]);
     }
@@ -167,7 +170,7 @@ void pushPyOutToStack(
   if (num_returns == 0) {
     // Check that we got a None return from Python. Anything else is an error.
     TORCH_CHECK(
-        out.is(py::none()),
+        out.is_none(),
         "Expected ",
         msg,
         " for ",
@@ -188,11 +191,96 @@ void pushPyOutToStack(
 
 namespace {
 
-std::string concrete_name_fn(const c10::impl::PyInterpreter* self) {
-  std::stringstream ss;
-  ss << self;
-  return ss.str();
-}
+// NB: This is a macro and not a template function (like it was before)
+// because passing in constexpr char* as template argument breaks some
+// versions of MSVC that are being used internally at Meta.
+// MSVC 14.16.27023 (vs2017_15.9)
+#define CONCRETE_TRACE_CUDA(func_name, ...)                           \
+  at::impl::MaybeSetTLSOnEntryGuard guard;                            \
+  if (Py_IsInitialized()) {                                           \
+    pybind11::gil_scoped_acquire gil;                                 \
+    try {                                                             \
+      py::module mod = py::module::import("torch.utils._cuda_trace"); \
+      py::object hook = mod.attr(func_name).attr("fire_callbacks");   \
+      hook(__VA_ARGS__);                                              \
+    } catch (const std::exception& e) {                               \
+      LOG(ERROR) << "CUDA trace hook execution failed: " << e.what(); \
+    }                                                                 \
+  }
+
+struct ConcretePyInterpreterVTable final
+    : public c10::impl::PyInterpreterVTable {
+  std::string name() const override;
+
+  void decref(PyObject* pyobj, bool is_tensor) const override;
+
+  c10::intrusive_ptr<TensorImpl> detach(const TensorImpl* self) const override;
+
+  void dispatch(const c10::OperatorHandle& op, torch::jit::Stack* stack)
+      const override;
+  void python_dispatcher(
+      const c10::OperatorHandle& op,
+      c10::DispatchKeySet,
+      torch::jit::Stack* stack) const override;
+  // NB: this is defined in python_dispatch.cpp
+  void python_op_registration_trampoline(
+      const c10::OperatorHandle& op,
+      c10::DispatchKey key,
+      torch::jit::Stack* stack) const override {
+    torch::impl::dispatch::python_op_registration_trampoline_impl(
+        op, key, stack);
+  }
+
+  bool is_contiguous(const TensorImpl* self, at::MemoryFormat) const override;
+  bool is_strides_like(const TensorImpl* self, at::MemoryFormat) const override;
+  bool is_non_overlapping_and_dense(const TensorImpl* self) const override;
+  c10::Device device(const TensorImpl* self) const override;
+  int64_t dim(const TensorImpl* self) const override;
+  c10::IntArrayRef strides(const TensorImpl* self) const override;
+  c10::IntArrayRef sizes(const TensorImpl* self) const override;
+  c10::SymIntArrayRef sym_sizes(const TensorImpl* self) const override;
+  c10::Layout layout(const TensorImpl* self) const override;
+  c10::SymInt sym_numel(const TensorImpl* self) const override;
+  c10::SymIntArrayRef sym_strides(const TensorImpl* self) const override;
+  c10::SymInt sym_storage_offset(const TensorImpl* self) const override;
+
+  void trace_gpu_event_creation(uintptr_t event) const override {
+    CONCRETE_TRACE_CUDA("CUDAEventCreationCallbacks", event);
+  }
+  void trace_gpu_event_deletion(uintptr_t event) const override {
+    CONCRETE_TRACE_CUDA("CUDAEventDeletionCallbacks", event);
+  }
+  void trace_gpu_event_record(uintptr_t event, uintptr_t stream)
+      const override {
+    CONCRETE_TRACE_CUDA("CUDAEventRecordCallbacks", event, stream);
+  }
+  void trace_gpu_event_wait(uintptr_t event, uintptr_t stream) const override {
+    CONCRETE_TRACE_CUDA("CUDAEventWaitCallbacks", event, stream);
+  }
+  void trace_gpu_memory_allocation(uintptr_t ptr) const override {
+    CONCRETE_TRACE_CUDA("CUDAMemoryAllocationCallbacks", ptr);
+  }
+  void trace_gpu_memory_deallocation(uintptr_t ptr) const override {
+    CONCRETE_TRACE_CUDA("CUDAMemoryDeallocationCallbacks", ptr);
+  }
+  void trace_gpu_stream_creation(uintptr_t stream) const override {
+    CONCRETE_TRACE_CUDA("CUDAStreamCreationCallbacks", stream);
+  }
+  void trace_gpu_device_synchronization() const override {
+    CONCRETE_TRACE_CUDA("CUDADeviceSynchronizationCallbacks");
+  }
+  void trace_gpu_stream_synchronization(uintptr_t stream) const override {
+    CONCRETE_TRACE_CUDA("CUDAStreamSynchronizationCallbacks", stream);
+  }
+  void trace_gpu_event_synchronization(uintptr_t event) const override {
+    CONCRETE_TRACE_CUDA("CUDAEventSynchronizationCallbacks", event);
+  }
+
+  static ConcretePyInterpreterVTable* instance() {
+    static ConcretePyInterpreterVTable s;
+    return &s;
+  }
+};
 
 // NOTE [PyInterpreter::decref takes an `is_tensor` arg]
 // Before calling PyInterpreter::decref, we must statically know if the
@@ -202,10 +290,8 @@ std::string concrete_name_fn(const c10::impl::PyInterpreter* self) {
 // One alternative to this is using PyObject_IsInstance
 // to get at this information. However, we don't want to risk an incorrect
 // `__instancecheck__` changing the semantics here.
-void concrete_decref_fn(
-    const c10::impl::PyInterpreter* self,
-    PyObject* pyobj,
-    bool is_tensor) {
+void ConcretePyInterpreterVTable::decref(PyObject* pyobj, bool is_tensor)
+    const {
   // Leak the pyobj if not initialized.  This can happen if we are running
   // exit handlers that are destructing tensors with residual (owned)
   // PyObjects stored in them.
@@ -235,88 +321,30 @@ void concrete_decref_fn(
   Py_DECREF(pyobj);
 };
 
-c10::intrusive_ptr<TensorImpl> concrete_detach_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-void concrete_dispatch_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::OperatorHandle& op,
-    torch::jit::Stack* stack);
-bool concrete_is_contiguous_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-c10::Device concrete_device_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-int64_t concrete_dim_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-c10::IntArrayRef concrete_strides_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-c10::IntArrayRef concrete_sizes_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-c10::SymIntArrayRef concrete_sym_sizes_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-c10::Layout concrete_layout_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-c10::SymInt concrete_sym_numel_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self);
-template <const char*, typename... Ts>
-void concrete_trace_cuda(const c10::impl::PyInterpreter*, Ts...);
-static constexpr char trace_cuda_event_creation_fn_name[] =
-    "CUDAEventCreationCallbacks";
-static constexpr char trace_cuda_event_deletion_fn_name[] =
-    "CUDAEventDeletionCallbacks";
-static constexpr char trace_cuda_event_record_fn_name[] =
-    "CUDAEventRecordCallbacks";
-static constexpr char trace_cuda_event_wait_fn_name[] =
-    "CUDAEventWaitCallbacks";
-static constexpr char trace_cuda_memory_allocation_fn_name[] =
-    "CUDAMemoryAllocationCallbacks";
-static constexpr char trace_cuda_memory_deallocation_fn_name[] =
-    "CUDAMemoryDeallocationCallbacks";
-static constexpr char trace_cuda_stream_creation_fn_name[] =
-    "CUDAStreamCreationCallbacks";
-
 class PyInterpreterHolder {
  public:
   PyInterpreterHolder()
       : impl_(new c10::impl::PyInterpreter(
-            &concrete_name_fn,
-            &concrete_decref_fn,
-            &concrete_detach_fn,
-            &concrete_dispatch_fn,
-            &concrete_is_contiguous_fn,
-            &concrete_device_fn,
-            &concrete_dim_fn,
-            &concrete_strides_fn,
-            &concrete_sizes_fn,
-            &concrete_sym_sizes_fn,
-            &concrete_layout_fn,
-            &concrete_sym_numel_fn,
-            c10::impl::GPUTraceFunctionWrapper(
-                &concrete_trace_cuda<trace_cuda_event_creation_fn_name>,
-                &concrete_trace_cuda<trace_cuda_event_deletion_fn_name>,
-                &concrete_trace_cuda<trace_cuda_event_record_fn_name>,
-                &concrete_trace_cuda<trace_cuda_event_wait_fn_name>,
-                &concrete_trace_cuda<trace_cuda_memory_allocation_fn_name>,
-                &concrete_trace_cuda<trace_cuda_memory_deallocation_fn_name>,
-                &concrete_trace_cuda<trace_cuda_stream_creation_fn_name>))) {}
-  // NB: intentionally leaks the memory
+            ConcretePyInterpreterVTable::instance())) {
+    is_main_interpreter_ =
+        at::impl::PythonOpRegistrationTrampoline::registerInterpreter(impl_);
+  }
+  // NB: intentionally leaks the PyInterpreter, as there may still be
+  // references to it that are live, living in objects that aren't being
+  // destructed while Python is being cleaned up.
   ~PyInterpreterHolder() {
     impl_->disarm();
   }
   c10::impl::PyInterpreter* get() const noexcept {
     return impl_;
   }
+  bool is_main_interpreter() const noexcept {
+    return is_main_interpreter_;
+  }
 
  private:
   c10::impl::PyInterpreter* impl_;
+  bool is_main_interpreter_;
 };
 PyInterpreterHolder self_interpreter;
 
@@ -342,6 +370,16 @@ c10::impl::PyInterpreter* getPyInterpreter() {
   return self_interpreter.get();
 }
 
+bool isMainPyInterpreter() {
+  return self_interpreter.is_main_interpreter();
+}
+
+std::string ConcretePyInterpreterVTable::name() const {
+  std::stringstream ss;
+  ss << getPyInterpreter();
+  return ss.str();
+}
+
 PyObject* THPVariableClass = nullptr;
 
 PyObject* ParameterClass = nullptr;
@@ -400,6 +438,13 @@ PyObject* THPVariable_Wrap(at::TensorBase var) {
     Py_RETURN_NONE;
   }
 
+  if (c10::impl::HermeticPyObjectTLS::get_state()) {
+    return THPVariable_NewWithVar(
+        (PyTypeObject*)THPVariableClass,
+        std::move(var),
+        c10::impl::PyInterpreterStatus::DEFINITELY_UNINITIALIZED);
+  }
+
   c10::optional<PyObject*> mb_obj =
       var.unsafeGetTensorImpl()->check_pyobj(self_interpreter.get());
   c10::impl::PyInterpreterStatus status;
@@ -473,6 +518,11 @@ bool isResurrectable(THPVariable* self) {
     return false;
   }
   auto const& tensor = THPVariable_Unpack(self);
+  // Check if this is hermetic. If it is, no resurrection.
+  if (tensor.unsafeGetTensorImpl()->check_pyobj(self_interpreter.get()) !=
+      c10::make_optional((PyObject*)self)) {
+    return false;
+  }
   if (!tensor.defined() || tensor.use_count() <= 1) {
     return false;
   }
@@ -515,6 +565,7 @@ static bool THPVariable_tryResurrect(THPVariable* self) {
   // Flip THPVariable to be non-owning
   // (near use-after-free miss here: fresh MaybeOwned is created breaking
   // reference on Tensor in struct BEFORE we overwrite the old one)
+  TORCH_INTERNAL_ASSERT(!c10::impl::HermeticPyObjectTLS::get_state());
   self->cdata = MaybeOwned<Variable>::borrowed(tensor);
 
   // NB: At this point, tensor *could* be dead (e.g., some other C++ thread
@@ -566,7 +617,9 @@ static int THPVariable_clear(THPVariable* self) {
     //        unsafeIsBorrowed() is TRUE.  We're deallocating the PyObject
     //        because Tensor asked us to (it's already destructing).
 
-    if (!self->cdata.unsafeIsBorrowed()) {
+    if (!self->cdata.unsafeIsBorrowed() &&
+        tensor.unsafeGetTensorImpl()->check_pyobj(self_interpreter.get()) ==
+            c10::make_optional((PyObject*)self)) {
       // TODO: empirically, on OS X this assert appears to be untrue
       // In test_py_tensors_multi_async_call - ProcessGroupRpcTestWithSpawn
       // distributed/rpc/test_process_group_agent.py
@@ -605,7 +658,12 @@ static int THPVariable_clear(THPVariable* self) {
     }
   }
   TORCH_INTERNAL_ASSERT(!isResurrectable((THPVariable*)self));
-  self->cdata = MaybeOwned<Variable>();
+  {
+    // MapAllocator can take significant time to release large tensors;
+    // release the GIL here to avoid impacting main thread perf.
+    pybind11::gil_scoped_release no_gil;
+    self->cdata = MaybeOwned<Variable>();
+  }
   return 0;
 }
 
@@ -626,6 +684,36 @@ static PyObject* THPVariable_fix_weakref(PyObject* self, PyObject* noargs) {
   Py_RETURN_NONE;
 }
 
+static PyObject* THPVariable_view_func(PyObject* self_, PyObject* arg) {
+  HANDLE_TH_ERRORS
+  const auto& self = THPVariable_Unpack(self_);
+  TORCH_CHECK(
+      THPVariable_Check(arg),
+      "_view_func expect a single argument that is a Tensor");
+  const auto& new_base = THPVariable_Unpack(arg);
+
+  // Ensure that self is indeed a backward differentiable view
+  auto diff_view_meta = torch::autograd::impl::get_view_autograd_meta(self);
+  TORCH_CHECK(
+      diff_view_meta && diff_view_meta->has_bw_view(),
+      "_view_func can only be called on "
+      "a Tensor that is a backward differentiable view.");
+  const auto& view_info = diff_view_meta->get_backward_view();
+  // Ensure that the newly provided base is similar to the original base
+  TORCH_CHECK(
+      torch::autograd::utils::has_same_meta(new_base, view_info.base_),
+      "The new base passed to _view_func must have the same metadata as the Tensors's base");
+
+  // Do the actual view replay
+  if (view_info.has_view_fn()) {
+    return THPVariable_Wrap(view_info.view_fn()(new_base));
+  } else {
+    return THPVariable_Wrap(new_base.as_strided(
+        self.sizes(), self.strides(), self.storage_offset()));
+  }
+  END_HANDLE_TH_ERRORS
+}
+
 // Instantiates a subclass of self with the same data.
 static PyObject* THPVariable_as_subclass(
     PyObject* _self,
@@ -665,7 +753,10 @@ static PyObject* THPVariable_make_subclass(
     throw torch::TypeError(
         "cls must be a type (got %s)", Py_TYPE(cls)->tp_name);
   }
-  torch_dispatch_mode::StashTorchDispatchModeGuard td_g;
+  // guard completely turns off torch dispatch modes, doesn't just pop off the
+  // stack
+  torch_dispatch_mode::StashTorchDispatchStackGuard td_g;
+  c10::impl::DisablePythonDispatcher dpd_g;
   auto data =
       r.tensor(1).detach(); // creates a fresh Tensor (DEFINITELY_UNINITIALIZED)
   // We set `data`'s `allow_tensor_metadata_change` to true here, because we
@@ -681,14 +772,14 @@ static PyObject* THPVariable_make_subclass(
   data.set_requires_grad(r.toBool(2));
   const auto sizes_strides_policy = r.stringViewOptional(3);
   if (sizes_strides_policy.has_value()) {
-    data.unsafeGetTensorImpl()->set_sizes_strides_policy(
+    data.unsafeGetTensorImpl()->set_python_custom_sizes_strides(
         parseSizesStridesPolicyArgument(*sizes_strides_policy));
   }
   if (r.toBool(4)) {
-    data.unsafeGetTensorImpl()->set_custom_device(true);
+    data.unsafeGetTensorImpl()->set_python_custom_device(true);
   }
   if (r.toBool(5)) {
-    data.unsafeGetTensorImpl()->set_custom_layout(true);
+    data.unsafeGetTensorImpl()->set_python_custom_layout(true);
   }
   if (!r.isNone(6)) {
     data.unsafeGetTensorImpl()->_change_backend_component_keys(r.device(6));
@@ -714,7 +805,7 @@ static PyObject* THPVariable_make_wrapper_subclass(
       "Layout layout=torch.strided, Device device=None, bool pin_memory=False, bool requires_grad=False, "
       "c10::string_view? dispatch_sizes_strides_policy=None, bool dispatch_device=False, bool dispatch_layout=False)",
       "_make_wrapper_subclass(PyObject* cls, SymIntArrayRef size, SymIntArrayRef strides, "
-      "int64_t? storage_offset=None, MemoryFormat? memory_format=None, ScalarType dtype=None, "
+      "SymInt? storage_offset=None, MemoryFormat? memory_format=None, ScalarType dtype=None, "
       "Layout layout=torch.strided, Device device=None, bool pin_memory=False, bool requires_grad=False, "
       "c10::string_view? dispatch_sizes_strides_policy=None, bool dispatch_device=False, bool dispatch_layout=False)",
   });
@@ -770,7 +861,7 @@ static PyObject* THPVariable_make_wrapper_subclass(
 
     const auto sizes_strides_policy = r.stringViewOptional(10);
     if (sizes_strides_policy.has_value()) {
-      tensor.unsafeGetTensorImpl()->set_sizes_strides_policy(
+      tensor.unsafeGetTensorImpl()->set_python_custom_sizes_strides(
           parseSizesStridesPolicyArgument(*sizes_strides_policy));
     }
   } else {
@@ -785,33 +876,28 @@ static PyObject* THPVariable_make_wrapper_subclass(
 
     auto sym_sizes = r.symintlist(1);
     auto sym_strides = r.symintlist(2);
+    auto sym_storage_offset = r.toSymIntOptional(3);
 
     TensorImpl* tensor_impl = tensor.unsafeGetTensorImpl();
 
-    // TODO: this should probably be sym_sizes, sym_strides AND offset
-    tensor_impl->set_sym_sizes_and_strides(sym_sizes, sym_strides);
-
-    // TODO: this may need to be symbolic as well
-    auto storage_offset = r.toInt64Optional(3);
-    if (storage_offset) {
-      tensor_impl->set_storage_offset(*storage_offset);
-    }
+    tensor_impl->set_sizes_and_strides(
+        sym_sizes, sym_strides, sym_storage_offset.value_or(0));
 
     const auto sizes_strides_policy = r.stringViewOptional(10);
     if (sizes_strides_policy.has_value()) {
       TORCH_CHECK(
           false,
-          "Setting sizes_strides_policy isn't suppored for this overload")
+          "Setting sizes_strides_policy isn't supported for this overload")
     }
   }
 
   tensor.set_requires_grad(r.toBool(9));
 
   if (r.toBool(11)) {
-    tensor.unsafeGetTensorImpl()->set_custom_device(true);
+    tensor.unsafeGetTensorImpl()->set_python_custom_device(true);
   }
   if (r.toBool(12)) {
-    tensor.unsafeGetTensorImpl()->set_custom_layout(true);
+    tensor.unsafeGetTensorImpl()->set_python_custom_layout(true);
   }
 
   return THPVariable_NewWithVar(
@@ -1017,7 +1103,7 @@ int THPVariable_set_grad(THPVariable* self, PyObject* py_grad, void* unused) {
   }
   THPUtils_assertRet(
       -1,
-      grad.sizes().equals(var.sizes()),
+      grad.sym_sizes().equals(var.sym_sizes()),
       "assigned grad has data of a different size");
 
   var.mutable_grad() = grad;
@@ -1245,34 +1331,6 @@ PyObject* THPVariable_get_base(THPVariable* self, void* unused) {
   END_HANDLE_TH_ERRORS
 }
 
-#ifndef USE_DEPLOY
-// This code is only used for asserts, so it is OK to skip it entirely from
-// deploy interpreters (in which case we will just skip the safety check).  For
-// a more precise check, it would be necessary to test that we are not holding
-// the GIL for *all* active torch deploy interpreters.  There is not really any
-// reason to do this.
-struct ConcretePythonGILHooks : public c10::impl::PythonGILHooks {
-  bool check_python_gil() const override {
-    return Py_IsInitialized() && PyGILState_Check();
-  };
-};
-// During process destruction, python_gil_hooks will get destructed, making
-// further virtual calls on the object invalid.  By the ordering of declarations
-// in this file, the registerer will get destructed first, removing the
-// externally visible reference to the object.  Assuming at this point in time,
-// there aren't other threads racing to read out the hooks, subsequent calls
-// into GIL hooks will hit a nullptr and gracefully no-op the asserts (as
-// desired, since at process shutdown time the Python interpreter is definitely
-// dead).
-//
-// An alternative way to reduce the risk of python_gil_hooks going prematurely
-// dead would be to leak it at destruction time.  I didn't do that because
-// it's annoying to write the Registerer class for this case.
-ConcretePythonGILHooks python_gil_hooks;
-static c10::impl::PythonGILHooksRegisterer python_gil_hooks_registerer(
-    &python_gil_hooks);
-#endif
-
 PyObject* THPVariable_get_shape(THPVariable* self, void* unused) {
   HANDLE_TH_ERRORS
   if (check_has_torch_function((PyObject*)self)) {
@@ -1422,6 +1480,16 @@ PyObject* THPVariable_is_nested(THPVariable* self, void* unused) {
   END_HANDLE_TH_ERRORS
 }
 
+PyObject* THPVariable_has_symbolic_sizes_strides(
+    THPVariable* self,
+    void* unused) {
+  HANDLE_TH_ERRORS
+  auto& self_ = THPVariable_Unpack(self);
+  return torch::autograd::utils::wrap(
+      self_.unsafeGetTensorImpl()->has_symbolic_sizes_strides());
+  END_HANDLE_TH_ERRORS
+}
+
 static PyObject* THPVariable_dtype(THPVariable* self, void* unused) {
   HANDLE_TH_ERRORS
   if (check_has_torch_function((PyObject*)self)) {
@@ -1560,6 +1628,11 @@ static struct PyGetSetDef THPVariable_properties[] = {
      nullptr},
     {"is_meta", (getter)THPVariable_is_meta, nullptr, nullptr, nullptr},
     {"is_nested", (getter)THPVariable_is_nested, nullptr, nullptr, nullptr},
+    {"_has_symbolic_sizes_strides",
+     (getter)THPVariable_has_symbolic_sizes_strides,
+     nullptr,
+     nullptr,
+     nullptr},
     {"dtype", (getter)THPVariable_dtype, nullptr, nullptr, nullptr},
     {"layout", (getter)THPVariable_layout, nullptr, nullptr, nullptr},
     {"device", (getter)THPVariable_device, nullptr, nullptr, nullptr},
@@ -1602,6 +1675,7 @@ static PyMethodDef extra_methods[] = {
      METH_STATIC | METH_VARARGS | METH_KEYWORDS,
      nullptr},
     {"_fix_weakref", THPVariable_fix_weakref, METH_NOARGS, nullptr},
+    {"_view_func", THPVariable_view_func, METH_O, nullptr},
     {nullptr}};
 
 /* From https://github.com/python/cpython/blob/v3.7.0/Modules/xxsubtype.c
@@ -1881,11 +1955,27 @@ static PyObject* THPVariable_NewWithVar(
     auto v = (THPVariable*)obj;
     // TODO: named constructor to avoid default initialization
     new (&v->cdata) MaybeOwned<Variable>();
-    v->cdata = MaybeOwned<Variable>::owned(std::move(_var));
-    const auto& var = THPVariable_Unpack(v);
-    var.unsafeGetTensorImpl()->init_pyobj(self_interpreter.get(), obj, status);
-    if (check_has_torch_dispatch(obj)) {
-      var.unsafeGetTensorImpl()->set_python_dispatch(true);
+    if (c10::impl::HermeticPyObjectTLS::get_state()) {
+      // Do NOT initialize pyobj field on the tensor, you own the C++
+      v->cdata = MaybeOwned<Variable>::owned(std::move(_var));
+      TORCH_INTERNAL_ASSERT(
+          !check_has_torch_dispatch(obj),
+          "While HermeticPyObject was enabled, we attempted to create a tensor "
+          "subclass with __torch_dispatch__.  This violates the invariant that "
+          "operations in HermeticPyObject have equivalent C++ implementations. "
+          "If your operator registered from Python operator registration isn't "
+          "doing anything strange, there may be an internal PyTorch bug involving "
+          "not appropriately disabling TorchDispatchMode before executing "
+          "Python op registration.");
+    } else {
+      // Normal codepath
+      v->cdata = MaybeOwned<Variable>::owned(std::move(_var));
+      const auto& var = THPVariable_Unpack(v);
+      var.unsafeGetTensorImpl()->init_pyobj(
+          self_interpreter.get(), obj, status);
+      if (check_has_torch_dispatch(obj)) {
+        var.unsafeGetTensorImpl()->set_python_dispatch(true);
+      }
     }
   }
   return obj;
@@ -2128,7 +2218,12 @@ py::object torchDispatchFromTensorImpl(
     const c10::TensorImpl* self,
     const char* func_name,
     PyObject* torch_api_function,
-    const char* module_name) {
+    const char* module_name,
+    // WARNING: MUST NOT BE TENSOR ARGS
+    c10::SmallVector<py::object, 1> extra_args = {}) {
+  if (torch_api_function == nullptr) {
+    throw python_error();
+  }
   TORCH_CHECK(
       PyGILState_Check(),
       "GIL must be held before you call parseIValuesToPyArgsKwargs");
@@ -2140,10 +2235,19 @@ py::object torchDispatchFromTensorImpl(
       c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::
           unsafe_reclaim_from_nonowning(const_cast<c10::TensorImpl*>(self)));
   auto self_p = py::reinterpret_steal<py::object>(THPVariable_Wrap(self_t));
-  TORCH_INTERNAL_ASSERT(isPythonTensor(self_t));
+  // NB: this may not be a python tensor if you got here from a mode!
+  // TORCH_INTERNAL_ASSERT(isPythonTensor(self_t));
   append_overloaded_tensor(&overloaded_args, self_p.ptr());
-  auto args = py::reinterpret_steal<py::object>(PyTuple_New(1));
+  auto args =
+      py::reinterpret_steal<py::object>(PyTuple_New(1 + extra_args.size()));
   PyTuple_SET_ITEM(args.ptr(), 0, self_p.release().ptr());
+  int64_t i = 1;
+  for (auto& a : extra_args) {
+    if (a.ptr() == nullptr)
+      throw python_error();
+    PyTuple_SET_ITEM(args.ptr(), i, std::move(a).release().ptr());
+    i++;
+  }
 
   py::dict kwargs;
 
@@ -2158,25 +2262,37 @@ py::object torchDispatchFromTensorImpl(
           TorchFunctionName::TorchDispatch));
 }
 
-void concrete_dispatch_fn(
-    const c10::impl::PyInterpreter*,
+py::handle getTorchApiFunction(const c10::OperatorHandle& op) {
+  return op.getPythonOp(getPyInterpreter(), [&]() -> PyObject* {
+    // Parse the name into namespace and name (no overload_name)
+    // TODO: put this into the library
+    const auto& schema = op.schema();
+    const auto& qualified_name = op.operator_name().name;
+    const auto& overload_name = schema.overload_name();
+    auto pos = qualified_name.find("::");
+    TORCH_INTERNAL_ASSERT(pos != std::string::npos, qualified_name);
+    // Make me some null terminated strings
+    std::string ns_str = qualified_name.substr(0, pos);
+    const char* ns = ns_str.c_str();
+    const char* func_name = qualified_name.c_str() + pos + strlen("::");
+
+    py::handle torch_api_function =
+        py::module::import("torch").attr("ops").attr(ns).attr(func_name);
+    if (overload_name == "") {
+      return torch_api_function.attr("default").ptr();
+    } else {
+      return torch_api_function.attr(overload_name.c_str()).ptr();
+    }
+  });
+}
+
+void ConcretePyInterpreterVTable::dispatch(
     const c10::OperatorHandle& op,
-    torch::jit::Stack* stack) {
+    torch::jit::Stack* stack) const {
   const auto& schema = op.schema();
   const auto num_arguments = schema.arguments().size();
   auto arguments = torch::jit::pop(*stack, num_arguments);
 
-  // Parse the name into namespace and name (no overload_name)
-  // TODO: put this into the library
-  const auto& qualified_name = op.operator_name().name;
-  const auto& overload_name = schema.overload_name();
-  auto pos = qualified_name.find("::");
-  TORCH_INTERNAL_ASSERT(pos != std::string::npos, qualified_name);
-  // Make me some null terminated strings
-  std::string ns_str = qualified_name.substr(0, pos);
-  const char* ns = ns_str.c_str();
-  const char* func_name = qualified_name.c_str() + pos + strlen("::");
-
   // The plan: convert all the arguments back into PyObjects,
   // extracting out the tensor handles, then call
   // handle_torch_function_no_python_arg_parser
@@ -2186,16 +2302,7 @@ void concrete_dispatch_fn(
   py::gil_scoped_acquire g;
 
   std::vector<py::handle> overloaded_args;
-  py::handle torch_api_function =
-      py::module::import("torch").attr("ops").attr(ns).attr(func_name);
-  py::handle torch_api_function_overload;
-  if (overload_name == "") {
-    torch_api_function_overload = torch_api_function.attr("default");
-  } else {
-    torch_api_function_overload =
-        torch_api_function.attr(overload_name.c_str());
-  }
-  std::string module_name_str = "torch.ops." + ns_str;
+  py::handle torch_api_function_overload = getTorchApiFunction(op);
 
   // Find overloaded tensors
   for (const auto idx : c10::irange(arguments.size())) {
@@ -2227,17 +2334,62 @@ void concrete_dispatch_fn(
       overloaded_args,
       args.ptr(),
       kwargs.ptr(),
-      func_name,
+      nullptr,
       torch_api_function_overload.ptr(),
-      module_name_str.c_str(),
+      nullptr,
       TorchFunctionName::TorchDispatch);
   pushPyOutToStack(
       op, stack, py::reinterpret_steal<py::object>(obj), "__torch_dispatch__");
 }
 
-c10::intrusive_ptr<TensorImpl> concrete_detach_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+void ConcretePyInterpreterVTable::python_dispatcher(
+    const c10::OperatorHandle& op,
+    c10::DispatchKeySet ks,
+    torch::jit::Stack* stack) const {
+  py::gil_scoped_acquire g;
+  py::handle torch_api_function_overload = getTorchApiFunction(op);
+  // TODO: if necessary, can optimize to cache the cache lookup
+  // TODO: if necessary, can optimize OpOverload to have slots
+  auto cache = py::dict(torch_api_function_overload.attr("_dispatch_cache"));
+  if (cache.ptr() == nullptr) {
+    throw python_error();
+  }
+
+  c10::DispatchKey k = ks.highestPriorityTypeId();
+  // TODO: allow this to be non-owning
+  auto handler = py::reinterpret_borrow<py::object>(
+      PyDict_GetItem(cache.ptr(), py::cast(k).ptr()));
+  if (handler.ptr() == nullptr) {
+    // Slow path
+    handler = torch_api_function_overload.attr("_get_dispatch")(k);
+  }
+  if (py::isinstance<c10::DispatchKey>(handler)) {
+    // NB: not redispatch, as that will permanently remove the python
+    // dispatcher for subsequent redispatches
+    op.callBoxedForDispatchKey(py::cast<c10::DispatchKey>(handler), *stack);
+    return;
+  }
+
+  const auto& schema = op.schema();
+  const auto num_arguments = schema.arguments().size();
+  auto arguments = torch::jit::pop(*stack, num_arguments);
+
+  auto args_kwargs = parseIValuesToPyArgsKwargs(op, arguments);
+  auto args = std::move(args_kwargs.first);
+  auto kwargs = std::move(args_kwargs.second);
+
+  py::object obj = py::reinterpret_steal<py::object>(
+      PyObject_Call(handler.ptr(), args.ptr(), kwargs.ptr()));
+
+  if (obj.ptr() == nullptr) {
+    throw python_error();
+  }
+
+  pushPyOutToStack(op, stack, std::move(obj), "Python dispatcher");
+}
+
+c10::intrusive_ptr<TensorImpl> ConcretePyInterpreterVTable::detach(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
 
@@ -2261,35 +2413,115 @@ c10::intrusive_ptr<TensorImpl> concrete_detach_fn(
   return res_t.getIntrusivePtr();
 }
 
-bool concrete_is_contiguous_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+bool ConcretePyInterpreterVTable::is_contiguous(
+    const c10::TensorImpl* self,
+    at::MemoryFormat memory_format) const {
+  pybind11::gil_scoped_acquire gil;
+  at::impl::MaybeSetTLSOnEntryGuard guard;
+
+  py::object out;
+  if (memory_format == at::MemoryFormat::Contiguous) {
+    // For backwards compatibility
+    out = torchDispatchFromTensorImpl(
+        self,
+        "is_contiguous",
+        py::module::import("torch")
+            .attr("ops")
+            .attr("aten")
+            .attr("is_contiguous")
+            .attr("default")
+            .ptr(),
+        "torch.ops.aten");
+  } else {
+    out = torchDispatchFromTensorImpl(
+        self,
+        "is_contiguous",
+        py::module::import("torch")
+            .attr("ops")
+            .attr("aten")
+            .attr("is_contiguous")
+            .attr("memory_format")
+            .ptr(),
+        "torch.ops.aten",
+        {py::cast(memory_format)});
+  }
+
+  if (out.is_none()) {
+    return self->is_contiguous_default(memory_format);
+  }
+
+  TORCH_CHECK(
+      PyBool_Check(out.ptr()),
+      "is_contiguous returned invalid type ",
+      py::detail::get_fully_qualified_tp_name(Py_TYPE(out.ptr())),
+      ", expected bool");
+
+  return PyObject_IsTrue(out.ptr());
+}
+
+bool ConcretePyInterpreterVTable::is_strides_like(
+    const c10::TensorImpl* self,
+    at::MemoryFormat memory_format) const {
+  pybind11::gil_scoped_acquire gil;
+  at::impl::MaybeSetTLSOnEntryGuard guard;
+
+  auto out = torchDispatchFromTensorImpl(
+      self,
+      "is_strides_like",
+      py::module::import("torch")
+          .attr("ops")
+          .attr("aten")
+          // NB: intentionally suffixed with _format to avoid
+          // triggering matches against "_like" suffix
+          .attr("is_strides_like_format")
+          .attr("default")
+          .ptr(),
+      "torch.ops.aten",
+      {py::cast(memory_format)});
+
+  if (out.is_none()) {
+    return self->is_strides_like_default(memory_format);
+  }
+
+  TORCH_CHECK(
+      PyBool_Check(out.ptr()),
+      "is_strides_like_format returned invalid type ",
+      py::detail::get_fully_qualified_tp_name(Py_TYPE(out.ptr())),
+      ", expected bool");
+
+  return PyObject_IsTrue(out.ptr());
+}
+
+bool ConcretePyInterpreterVTable::is_non_overlapping_and_dense(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
 
   auto out = torchDispatchFromTensorImpl(
       self,
-      "is_contiguous",
+      "is_non_overlapping_and_dense",
       py::module::import("torch")
           .attr("ops")
           .attr("aten")
-          .attr("is_contiguous")
+          .attr("is_non_overlapping_and_dense")
           .attr("default")
           .ptr(),
       "torch.ops.aten");
 
+  if (out.is_none()) {
+    return self->is_non_overlapping_and_dense_default();
+  }
+
   TORCH_CHECK(
       PyBool_Check(out.ptr()),
-      "is_contiguous returned invalid type ",
+      "is_non_overlapping_and_dense returned invalid type ",
       py::detail::get_fully_qualified_tp_name(Py_TYPE(out.ptr())),
       ", expected bool");
 
   return PyObject_IsTrue(out.ptr());
 }
 
-int64_t concrete_dim_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+int64_t ConcretePyInterpreterVTable::dim(const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
 
@@ -2313,9 +2545,8 @@ int64_t concrete_dim_fn(
   return THPUtils_unpackLong(out.ptr());
 }
 
-c10::Device concrete_device_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+c10::Device ConcretePyInterpreterVTable::device(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
 
@@ -2333,9 +2564,8 @@ c10::Device concrete_device_fn(
   return toDevice(out.ptr());
 }
 
-c10::IntArrayRef concrete_strides_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+c10::IntArrayRef ConcretePyInterpreterVTable::strides(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
 
@@ -2350,10 +2580,10 @@ c10::IntArrayRef concrete_strides_fn(
           .ptr(),
       "torch.ops.aten");
 
-  if (out == Py_None) {
+  if (out.is_none()) {
     TORCH_CHECK(
         !self->has_symbolic_sizes_strides(),
-        "Cannot call sizes on a tensor with symbolic shapes/strides");
+        "Cannot call strides on a tensor with symbolic shapes/strides");
     return self->strides_default();
   }
 
@@ -2393,9 +2623,8 @@ static std::vector<int64_t> values_from_buffer(
   return result;
 }
 
-c10::IntArrayRef concrete_sizes_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+c10::IntArrayRef ConcretePyInterpreterVTable::sizes(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
 
@@ -2410,7 +2639,7 @@ c10::IntArrayRef concrete_sizes_fn(
           .ptr(),
       "torch.ops.aten");
 
-  if (out == Py_None) {
+  if (out.is_none()) {
     TORCH_CHECK(
         !self->has_symbolic_sizes_strides(),
         "Cannot call sizes on a tensor with symbolic shapes/strides");
@@ -2425,9 +2654,8 @@ c10::IntArrayRef concrete_sizes_fn(
   return c10::IntArrayRef(start, len);
 }
 
-c10::SymIntArrayRef concrete_sym_sizes_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+c10::SymIntArrayRef ConcretePyInterpreterVTable::sym_sizes(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
   HANDLE_TH_ERRORS
@@ -2442,7 +2670,7 @@ c10::SymIntArrayRef concrete_sym_sizes_fn(
           .ptr(),
       "torch.ops.aten");
 
-  if (out == Py_None) {
+  if (out.is_none()) {
     return self->sym_sizes_default();
   }
   // We need to squeeze SymIntNodes and ints into `SymInts`
@@ -2466,12 +2694,10 @@ c10::SymIntArrayRef concrete_sym_sizes_fn(
   END_HANDLE_TH_ERRORS_PYBIND
 }
 
-c10::Layout concrete_layout_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+c10::Layout ConcretePyInterpreterVTable::layout(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
-
   auto out = torchDispatchFromTensorImpl(
       self,
       "layout",
@@ -2492,9 +2718,8 @@ c10::Layout concrete_layout_fn(
   return toLayout(out.ptr());
 }
 
-c10::SymInt concrete_sym_numel_fn(
-    const c10::impl::PyInterpreter*,
-    const c10::TensorImpl* self) {
+c10::SymInt ConcretePyInterpreterVTable::sym_numel(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
   auto out = torchDispatchFromTensorImpl(
@@ -2508,31 +2733,76 @@ c10::SymInt concrete_sym_numel_fn(
           .ptr(),
       "torch.ops.aten");
 
-  if (out == Py_None) {
+  if (out.is_none()) {
     TORCH_CHECK(
         !self->has_symbolic_sizes_strides(),
         "Cannot call numel on a tensor with symbolic shapes/strides");
     return self->sym_numel_default();
   }
-  return torch::is_symint_node(out)
-      ? out.cast<c10::SymIntNodeImpl*>()->toSymInt()
-      : c10::SymInt{py::cast<int64_t>(out)};
+  return torch::is_symint(out) ? out.cast<c10::SymInt>()
+                               : c10::SymInt{py::cast<int64_t>(out)};
 }
 
-template <const char* func_name, typename... Ts>
-void concrete_trace_cuda(const c10::impl::PyInterpreter*, Ts... args) {
+c10::SymInt ConcretePyInterpreterVTable::sym_storage_offset(
+    const c10::TensorImpl* self) const {
   pybind11::gil_scoped_acquire gil;
   at::impl::MaybeSetTLSOnEntryGuard guard;
+  auto out = torchDispatchFromTensorImpl(
+      self,
+      "sym_storage_offset",
+      py::module::import("torch")
+          .attr("ops")
+          .attr("aten")
+          .attr("sym_storage_offset")
+          .attr("default")
+          .ptr(),
+      "torch.ops.aten");
 
-  if (Py_IsInitialized()) {
-    try {
-      py::module mod = py::module::import("torch.utils._cuda_trace");
-      py::object hook = mod.attr(func_name).attr("fire_callbacks");
-      hook(args...);
-    } catch (const std::exception& e) {
-      LOG(ERROR) << "CUDA trace hook execution failed: " << e.what();
-    }
+  if (out.is_none()) {
+    return self->sym_storage_offset_default();
   }
+  return torch::is_symint(out) ? out.cast<c10::SymInt>()
+                               : c10::SymInt{py::cast<int64_t>(out)};
+}
+
+c10::SymIntArrayRef ConcretePyInterpreterVTable::sym_strides(
+    const c10::TensorImpl* self) const {
+  pybind11::gil_scoped_acquire gil;
+  at::impl::MaybeSetTLSOnEntryGuard guard;
+  HANDLE_TH_ERRORS
+  auto out = torchDispatchFromTensorImpl(
+      self,
+      "sym_stride",
+      py::module::import("torch")
+          .attr("ops")
+          .attr("aten")
+          .attr("sym_stride")
+          .attr("default")
+          .ptr(),
+      "torch.ops.aten");
+
+  if (out.is_none()) {
+    return self->sym_strides_default();
+  }
+  // We need to squeeze SymIntNodes and ints into `SymInts`
+  // since it's a format `sym_strides()` are stored in
+  TORCH_CHECK(
+      py::isinstance<py::tuple>(out) || py::isinstance<py::list>(out),
+      "Symshape must be a list or a tuple");
+  py::list symints;
+  for (auto it = out.begin(); it != out.end(); it++) {
+    auto elm = *it;
+    auto si = torch::is_symint(elm) ? elm.cast<c10::SymInt>()
+                                    : c10::SymInt{py::cast<int64_t>(elm)};
+    symints.append(si.as_int_unchecked());
+  }
+
+  auto result = values_from_buffer(self, symints);
+  c10::SymInt* start = (c10::SymInt*)result[0];
+  int64_t len = result[1];
+
+  return c10::SymIntArrayRef(start, len);
+  END_HANDLE_TH_ERRORS_PYBIND
 }
 
 } // anonymous namespace
diff --git a/torch/csrc/autograd/python_variable.h b/torch/csrc/autograd/python_variable.h
index be0bd458197e..8f448df06b32 100644
--- a/torch/csrc/autograd/python_variable.h
+++ b/torch/csrc/autograd/python_variable.h
@@ -69,6 +69,7 @@ inline const at::Tensor& THPVariable_Unpack(PyObject* obj) {
 }
 
 TORCH_PYTHON_API c10::impl::PyInterpreter* getPyInterpreter();
+TORCH_PYTHON_API bool isMainPyInterpreter();
 
 std::pair<py::object, py::dict> parseIValuesToPyArgsKwargs(
     const c10::OperatorHandle& op,
diff --git a/torch/csrc/autograd/python_variable_indexing.cpp b/torch/csrc/autograd/python_variable_indexing.cpp
index 8c9ed1d7b8f9..adb74d5eedc7 100644
--- a/torch/csrc/autograd/python_variable_indexing.cpp
+++ b/torch/csrc/autograd/python_variable_indexing.cpp
@@ -16,6 +16,7 @@
 
 #include <ATen/DeviceGuard.h>
 #include <ATen/ExpandUtils.h>
+#include <ATen/Functions.h>
 #include <ATen/TensorIndexing.h>
 #include <ATen/TracerMode.h>
 #include <ATen/core/LegacyTypeDispatch.h>
@@ -47,7 +48,9 @@ Py_ssize_t THPVariable_length(PyObject* self) {
   if (self_.dim() == 0) {
     return 0;
   }
-  return (Py_ssize_t)self_.size(0);
+  // TODO: Maybe this should return a SymInt directly?
+  // Add the guard to get a nice error message if/when we will hit this.
+  return (Py_ssize_t)self_.sym_size(0).guard_int(__FILE__, __LINE__);
   END_HANDLE_TH_ERRORS_RET(-1)
 }
 
@@ -176,18 +179,18 @@ static inline Variable applySlicing(
     variable_list& outIndices,
     bool is_tracing,
     const at::Device& self_device,
-    const c10::optional<IntArrayRef>& self_sizes,
+    const c10::optional<int64_t>& self_ndim,
     int64_t specified_dims) {
   int64_t size =
       PyTuple_GET_SIZE(index); // NOLINT(cppcoreguidelines-pro-type-cstyle-cast)
   int64_t dim = 0;
 
   // See NOTE [nested tensor size for indexing]
-  if (self_sizes.has_value()) {
+  if (self_ndim.has_value()) {
     TORCH_CHECK_INDEX(
-        specified_dims <= (int64_t)self_sizes->size(),
+        specified_dims <= self_ndim.value(),
         "too many indices for tensor of dimension ",
-        (int)self_sizes->size());
+        self_ndim.value());
   }
 
   Variable result = self;
@@ -198,9 +201,9 @@ static inline Variable applySlicing(
     // nested tensor does not have a size (yet) so for now we represent its size
     // as null may need to be changed after we reach a better solution for
     // nested tensor size
-    c10::optional<IntArrayRef> result_sizes = result.is_nested()
-        ? c10::optional<IntArrayRef>(c10::nullopt)
-        : c10::optional<IntArrayRef>(result.sizes());
+    c10::optional<SymIntArrayRef> result_sizes = result.is_nested()
+        ? c10::optional<SymIntArrayRef>(c10::nullopt)
+        : c10::optional<SymIntArrayRef>(result.sym_sizes());
     result = at::indexing::handleDimInMultiDimIndexing(
         /*prev_dim_result=*/result,
         /*original_tensor=*/self,
@@ -382,17 +385,13 @@ PyObject* THPVariable_getitem(PyObject* self, PyObject* index) {
   if (specified_dims == -1) {
     return handle_torch_function_indexing(self, holder.get());
   }
-  // See NOTE [nested tensor size for indexing]
-  c10::optional<IntArrayRef> self_sizes = c10::nullopt;
-  if (!self_.is_nested())
-    self_sizes = self_.sizes();
   Variable sliced = applySlicing(
       self_,
       holder.get(),
       variableIndices,
       /*is_tracing=*/is_tracing,
       self_.device(),
-      self_sizes,
+      self_.ndimension(),
       specified_dims);
   if (variableIndices.empty()) {
     if (sliced.is_same(self_)) {
@@ -522,7 +521,7 @@ int THPVariable_setitem(PyObject* self, PyObject* index, PyObject* py_value) {
       variableIndices,
       /*is_tracing=*/is_tracing,
       self_device,
-      self_.sizes(),
+      self_.ndimension(),
       specified_dims);
   if (variableIndices.empty()) {
     pybind11::gil_scoped_release no_gil;
@@ -532,11 +531,12 @@ int THPVariable_setitem(PyObject* self, PyObject* index, PyObject* py_value) {
 
   {
     pybind11::gil_scoped_release no_gil;
-    IntArrayRef valueSizes = value.sizes();
-    IntArrayRef slicedValueSizes = at::indexing::slicePrefix1sSize(valueSizes);
+    SymIntArrayRef valueSizes = value.sym_sizes();
+    SymIntArrayRef slicedValueSizes =
+        at::indexing::slicePrefix1sSize(valueSizes);
     torch::autograd::Variable valuesSliced;
     if (!valueSizes.equals(slicedValueSizes)) {
-      valuesSliced = value.view(slicedValueSizes);
+      valuesSliced = value.view_symint(slicedValueSizes);
     } else {
       valuesSliced = value;
     }
diff --git a/torch/csrc/autograd/saved_variable.cpp b/torch/csrc/autograd/saved_variable.cpp
index c6ca8eda13d1..d438205e8947 100644
--- a/torch/csrc/autograd/saved_variable.cpp
+++ b/torch/csrc/autograd/saved_variable.cpp
@@ -59,7 +59,8 @@ SavedVariable::SavedVariable(
 
     auto maybe_hooks = get_default_hooks();
 
-    if (maybe_hooks) {
+    // Avoid wrapped numbers from being leaked to the user
+    if (maybe_hooks && !variable.unsafeGetTensorImpl()->is_wrapped_number()) {
       save_metadata(variable);
       set_hooks_and_pack_data(std::move(maybe_hooks), variable);
       return;
@@ -143,7 +144,16 @@ Variable SavedVariable::unpack(std::shared_ptr<Node> saved_for) const {
                 : grad_fn_;
 
   if (!is_leaf_ && !grad_fn) {
-    TORCH_INTERNAL_ASSERT(saved_for, "No grad_fn for non-leaf saved tensor");
+    // This issue was introduced when we added logic to save the original
+    // because now we rely on data_.grad_fn(), but can be unreliable if the
+    // autograd_meta of that saved tensor is cleared with an in-place detach.
+    // As a simple fix, we choose to disallow that behavior here even though
+    // it makes behavior inconsistent depending on whether you are saving
+    // input or output.
+    TORCH_CHECK(
+        saved_for,
+        "Trying to use a saved tensor that has been detached in-place, i.e. with .detach_()."
+        "This is not supported, please use out-of-place `.detach()` instead");
     grad_fn = std::move(saved_for);
   }
 
@@ -159,7 +169,12 @@ Variable SavedVariable::unpack(std::shared_ptr<Node> saved_for) const {
       message
           << "one of the variables needed for gradient computation has been "
              "modified by an inplace operation: ["
-          << data_.toString() << " " << data_.sizes() << "]";
+          << data_.toString() << " ";
+      if (data_.is_nested()) {
+        message << data_._nested_tensor_size() << "]";
+      } else {
+        message << data_.sizes() << "]";
+      }
       if (grad_fn) {
         message << ", which is output " << output_nr_ << " of "
                 << grad_fn->name() << ",";
diff --git a/torch/csrc/autograd/utils/grad_layout_contract.h b/torch/csrc/autograd/utils/grad_layout_contract.h
index 5919d3c19840..37dda0f9acaa 100644
--- a/torch/csrc/autograd/utils/grad_layout_contract.h
+++ b/torch/csrc/autograd/utils/grad_layout_contract.h
@@ -14,7 +14,6 @@ inline bool obeys_layout_contract(
     const at::Tensor& grad,
     const at::Tensor& variable) {
   TORCH_INTERNAL_ASSERT(!grad.is_sparse());
-  TORCH_INTERNAL_ASSERT(!variable.is_sparse());
   TORCH_INTERNAL_ASSERT(!grad.is_sparse_csr());
   TORCH_INTERNAL_ASSERT(!variable.is_sparse_csr());
 
@@ -24,11 +23,14 @@ inline bool obeys_layout_contract(
     // contract and should return true, but this would likely change in the
     // future
     return false;
+  } else if (variable.is_sparse()) {
+    // Gradient Layout Contract is not applicable for sparse layouts
+    return false;
   } else if (variable.is_non_overlapping_and_dense()) {
     // Only look at stride for dimensions that are not of size 1.
-    const auto& grad_sizes = grad.sizes();
-    const auto& grad_strides = grad.strides();
-    const auto& variable_strides = variable.strides();
+    const auto& grad_sizes = grad.sym_sizes();
+    const auto& grad_strides = grad.sym_strides();
+    const auto& variable_strides = variable.sym_strides();
     for (const auto idx : c10::irange(grad_sizes.size())) {
       if (grad_sizes[idx] != 1) {
         if (grad_strides[idx] != variable_strides[idx]) {
@@ -61,9 +63,9 @@ inline at::Tensor clone_obey_contract(
     // Does this dicey-looking sequence attach the result to new_grad's
     // history if GradMode::is_enabled()?  Yes, and @alband says it should.
     return std::move(new_grad
-                         .new_empty_strided(
-                             variable.sizes(),
-                             variable.strides(),
+                         .new_empty_strided_symint(
+                             variable.sym_sizes(),
+                             variable.sym_strides(),
                              variable.options().memory_format(c10::nullopt))
                          .copy_(new_grad));
   } else {
diff --git a/torch/csrc/autograd/utils/warnings.cpp b/torch/csrc/autograd/utils/warnings.cpp
index a8f989d51ce6..691ee2e0c2b6 100644
--- a/torch/csrc/autograd/utils/warnings.cpp
+++ b/torch/csrc/autograd/utils/warnings.cpp
@@ -4,21 +4,18 @@ namespace torch {
 namespace autograd {
 namespace utils {
 
-void DelayWarningHandler::process(
-    const at::SourceLocation& source_location,
-    const std::string& msg,
-    const bool verbatim) {
+void DelayWarningHandler::process(const c10::Warning& warning) {
   std::lock_guard<std::mutex> lock(mutex_);
-  warnings_.push_back({source_location, msg, verbatim});
+  warnings_.push_back(warning);
 }
 
 void DelayWarningHandler::replay_warnings() {
   std::lock_guard<std::mutex> lock(mutex_);
   TORCH_INTERNAL_ASSERT(
-      c10::Warning::get_warning_handler() != this,
+      c10::WarningUtils::get_warning_handler() != this,
       "DelayWarningHandler cannot replay warnings into itself, this will cause a deadlock");
   for (const auto& warning : warnings_) {
-    c10::Warning::warn(warning.source_location, warning.msg, warning.verbatim);
+    c10::warn(warning);
   }
 }
 
diff --git a/torch/csrc/autograd/utils/warnings.h b/torch/csrc/autograd/utils/warnings.h
index d9e028acb4d8..92e3c3611ead 100644
--- a/torch/csrc/autograd/utils/warnings.h
+++ b/torch/csrc/autograd/utils/warnings.h
@@ -17,18 +17,9 @@ class DelayWarningHandler : public at::WarningHandler {
   void replay_warnings();
 
  private:
-  void process(
-      const at::SourceLocation& source_location,
-      const std::string& msg,
-      bool verbatim) override;
+  void process(const c10::Warning& warning) override;
 
-  struct Warning {
-    c10::SourceLocation source_location;
-    std::string msg;
-    bool verbatim;
-  };
-
-  std::vector<Warning> warnings_;
+  std::vector<c10::Warning> warnings_;
   std::mutex mutex_;
 };
 
diff --git a/torch/csrc/autograd/variable.cpp b/torch/csrc/autograd/variable.cpp
index aa3f17f1adfa..368a55ea8c1a 100644
--- a/torch/csrc/autograd/variable.cpp
+++ b/torch/csrc/autograd/variable.cpp
@@ -45,6 +45,8 @@ DifferentiableViewMeta::DifferentiableViewMeta(
     self_impl->set_version_counter(
         impl::version_counter(backward_info_.value().base_));
     attr_version_ = self_impl->version_counter().current_version();
+    TORCH_INTERNAL_ASSERT(
+        backward_info_.value().base_.unsafeGetTensorImpl() != self_impl);
   }
   if (shared_view_info_) {
     TORCH_INTERNAL_ASSERT(
@@ -79,11 +81,11 @@ ViewInfo ViewInfo::chain(
     } else {
       // current_view has a view_func and but it's parent doesn't have one
       if (base.unsafeGetTensorImpl()->support_as_strided()) {
-        auto size = base.sizes().vec();
-        auto stride = base.strides().vec();
-        auto storage_offset = base.storage_offset();
+        auto size = base.sym_sizes().vec();
+        auto stride = base.sym_strides().vec();
+        auto storage_offset = base.sym_storage_offset();
         view_func = [=](const at::Tensor& root_base) {
-          auto temp = root_base.as_strided(size, stride, storage_offset);
+          auto temp = root_base.as_strided_symint(size, stride, storage_offset);
           return view_func(temp);
         };
       } else {
@@ -109,12 +111,12 @@ ViewInfo ViewInfo::chain(
     // if current_view doesn't have a view_func but it's parent has one
     // Copy parent view function to gain ownership
     auto prev_view_fn = view_fn_;
-    auto size = tensor.sizes().vec();
-    auto stride = tensor.strides().vec();
-    auto storage_offset = tensor.storage_offset();
+    auto size = tensor.sym_sizes().vec();
+    auto stride = tensor.sym_strides().vec();
+    auto storage_offset = tensor.sym_storage_offset();
     view_func = [=](const at::Tensor& root_base) {
       auto temp = prev_view_fn(root_base);
-      return temp.as_strided(size, stride, storage_offset);
+      return temp.as_strided_symint(size, stride, storage_offset);
     };
   }
 
@@ -679,15 +681,20 @@ const std::shared_ptr<torch::autograd::Node>& VariableHooks::grad_fn(
       //       that would provide a way to recreate the grad_fn chain.
       if (view_info.has_view_fn()) {
         auto view_fn = view_info.view_fn();
-        auto diff_view = view_fn(view_info.base_);
+        Tensor diff_view;
+        {
+          // We can reach this path with grad_mode disabled, e.g. engine
+          AutoGradMode grad_mode(true);
+          diff_view = view_fn(view_info.base_);
+        }
         diff_view_meta->grad_fn_ = diff_view.grad_fn();
       } else {
         auto fn =
             std::make_shared<torch::autograd::generated::AsStridedBackward0>();
         fn->self_geometry = at::TensorGeometry(view_info.base_);
-        fn->size = self.sizes().vec();
-        fn->stride = self.strides().vec();
-        fn->storage_offset = self.storage_offset();
+        fn->size = self.sym_sizes().vec();
+        fn->stride = self.sym_strides().vec();
+        fn->storage_offset = self.sym_storage_offset();
         fn->set_next_edges(
             torch::autograd::collect_next_edges(view_info.base_));
         fn->add_input_metadata(
@@ -754,7 +761,38 @@ void handle_view_on_rebase(
     } else {
       modified_obj = "is being";
     }
-    if (grad_fn) {
+
+    if (creation_meta == CreationMeta::INFERENCE_MODE ||
+        creation_meta == CreationMeta::NO_GRAD_MODE || !grad_fn) {
+      std::string prefix;
+      if (grad_fn) {
+        prefix = c10::str(
+            "Output ",
+            diff_view_meta->output_nr_,
+            " of ",
+            grad_fn->name(),
+            " is a view of a view which was created in");
+      } else {
+        prefix = "A view was created in";
+      }
+      if (creation_meta == CreationMeta::INFERENCE_MODE) {
+        msg = c10::str(
+            prefix,
+            " inference mode and ",
+            modified_obj,
+            " modified inplace in normal mode.");
+      } else {
+        // create_meta is not necessarily CreationMeta::NO_GRAD_MODE
+        // e.g. CreationMeta::IN_CUSTOM_FUNCTION is possible, but we know that
+        // if there is no grad_fn, that means that the view was performed in
+        // no-grad mode
+        msg = c10::str(
+            prefix,
+            " no_grad mode and ",
+            modified_obj,
+            " modified inplace with grad mode enabled.");
+      }
+    } else {
       msg = c10::str(
           "Output ",
           diff_view_meta->output_nr_,
@@ -763,16 +801,6 @@ void handle_view_on_rebase(
           " is a view and ",
           modified_obj,
           " modified inplace.");
-    } else if (creation_meta == CreationMeta::INFERENCE_MODE) {
-      msg = c10::str(
-          "A view was created in inference mode and ",
-          modified_obj,
-          " modified inplace in normal mode.");
-    } else {
-      msg = c10::str(
-          "A view was created in no_grad mode and ",
-          modified_obj,
-          " modified inplace with grad mode enabled.");
     }
 
     if (creation_meta == CreationMeta::MULTI_OUTPUT_NODE) {
@@ -782,7 +810,6 @@ void handle_view_on_rebase(
           " allow the output views to be modified inplace. You should replace the inplace operation by an"
           " out-of-place one.");
     } else if (creation_meta == CreationMeta::NO_GRAD_MODE) {
-      TORCH_INTERNAL_ASSERT(!grad_fn);
       msg = c10::str(
           msg,
           " Given that this use case is ambiguous and error-prone, it is forbidden."
@@ -790,14 +817,12 @@ void handle_view_on_rebase(
           " inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want"
           " the inplace to be tracked).");
     } else if (creation_meta == CreationMeta::INFERENCE_MODE) {
-      TORCH_INTERNAL_ASSERT(!grad_fn);
       msg = c10::str(
           msg,
           " Given that this use case is ambiguous and error-prone, it is forbidden."
           " You can clarify your code by moving both the view and the inplace either both"
           " inside the inference_mode block (if you don't want the inplace to be tracked) or both outside (if you want"
           " the inplace to be tracked).");
-      TORCH_CHECK(false, msg);
     } else if (creation_meta == CreationMeta::IN_CUSTOM_FUNCTION) {
       msg = c10::str(
           msg,
diff --git a/torch/csrc/autograd/variable.h b/torch/csrc/autograd/variable.h
index b9603696dce2..52ce34ec394d 100644
--- a/torch/csrc/autograd/variable.h
+++ b/torch/csrc/autograd/variable.h
@@ -687,37 +687,26 @@ inline Variable make_variable_differentiable_view(
     CreationMeta creation_meta,
     bool allow_tensor_metadata_change = true) {
   if (data.defined()) {
-    // If we already did a TensorImpl allocation for data, just reuse it.
-    // Otherwise(e.g tensor.swapdim(0, 0) when we return the same tensor as
-    // input), we have to use shallow_copy_and_detach to create a new TensorImpl
-    // to avoid moving leaf node into graph interior. This guarantees only 1
-    // TensorImpl allocation happens in view ops.
-    if (data.getIntrusivePtr().unique() &&
-        data.getIntrusivePtr()->unique_version()) {
-      at::TensorImpl* data_impl = data.unsafeGetTensorImpl();
-      data_impl->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
-      data_impl->set_autograd_meta(std::make_unique<DifferentiableViewMeta>(
-          data_impl,
-          std::move(backward_info),
-          std::move(forward_info),
-          shared_view_info,
-          creation_meta));
-      return data;
-    } else {
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      c10::intrusive_ptr<at::TensorImpl> data_impl_copy =
-          data.getIntrusivePtr()->shallow_copy_and_detach(
-              /*version_counter=*/0,
-              /*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
-      data_impl_copy->set_autograd_meta(
-          std::make_unique<DifferentiableViewMeta>(
-              data_impl_copy.get(),
-              std::move(backward_info),
-              std::move(forward_info),
-              shared_view_info,
-              creation_meta));
-      return Variable(data_impl_copy);
-    }
+    TORCH_CHECK(
+        data.getIntrusivePtr()->autograd_meta() == nullptr,
+        "Attempted to make a tensor into a differentiable view, but the "
+        "tensor already had autograd metadata associated with it.  If you are "
+        "using a __torch_dispatch__ mode, the most common cause for this "
+        "problem is that you used torch.overrides.enable_reentrant_dispatch() "
+        "improperly; tensors created within the extent of reentrant dispatch "
+        "MUST NOT be directly returned from __torch_dispatch__; instead, they "
+        "must be wrapped into fresh tensors that serve as the output.  If you "
+        "are not using wrappers, you probably don't need reentrant dispatch.  "
+        "If this doesn't seem applicable, please file a bug to PyTorch.");
+    at::TensorImpl* data_impl = data.unsafeGetTensorImpl();
+    data_impl->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
+    data_impl->set_autograd_meta(std::make_unique<DifferentiableViewMeta>(
+        data_impl,
+        std::move(backward_info),
+        std::move(forward_info),
+        shared_view_info,
+        creation_meta));
+    return data;
   }
   return Variable();
 }
@@ -802,6 +791,11 @@ inline Variable make_variable(
   return Variable();
 }
 
+namespace utils {
+
+TORCH_API bool has_same_meta(const Variable& base, const Variable& other);
+
+} // namespace utils
 } // namespace autograd
 } // namespace torch
 
diff --git a/torch/csrc/cuda/CUDAPluggableAllocator.cpp b/torch/csrc/cuda/CUDAPluggableAllocator.cpp
new file mode 100644
index 000000000000..56927c16a0de
--- /dev/null
+++ b/torch/csrc/cuda/CUDAPluggableAllocator.cpp
@@ -0,0 +1,317 @@
+#include <mutex>
+#include <unordered_map>
+#include <utility>
+
+#include <torch/csrc/cuda/CUDAPluggableAllocator.h>
+
+namespace torch {
+namespace cuda {
+namespace CUDAPluggableAllocator {
+
+int device_count = 0;
+
+void custom_raw_deleter(void* ptr);
+
+// This is a fast API to just register allocators
+// based on function pointers (ie. external .so libraries)
+// This avoids having to link against libtorch for C++ based custom allocators
+// And also use this from python
+CUDAPluggableAllocator::CUDAPluggableAllocator(
+    std::function<void*(size_t, int, cudaStream_t)> alloc_fn,
+    std::function<void(void*, size_t, cudaStream_t)> free_fn)
+    : alloc_fn_(alloc_fn), free_fn_(free_fn) {}
+
+CUDAPluggableAllocator::CUDAPluggableAllocator(CUDAPluggableAllocator& other)
+    : alloc_fn_(other.alloc_fn_),
+      free_fn_(other.free_fn_),
+      init_fn_(other.init_fn_),
+      reset_fn_(other.reset_fn_),
+      memory_fraction_fn_(other.memory_fraction_fn_),
+      base_alloc_fn_(other.base_alloc_fn_),
+      record_stream_fn_(other.record_stream_fn_),
+      capture_begin_fn_(other.capture_begin_fn_),
+      capture_about_to_end_fn_(other.capture_about_to_end_fn_),
+      capture_ended_fn_(other.capture_ended_fn_),
+      capture_destroy_fn_(other.capture_destroy_fn_) {}
+
+void CUDAPluggableAllocator::set_init_fn(std::function<void(int)> init_fn) {
+  init_fn_ = init_fn;
+}
+
+void CUDAPluggableAllocator::set_reset_fn(std::function<void()> reset_fn) {
+  reset_fn_ = reset_fn;
+}
+
+void CUDAPluggableAllocator::set_memory_fraction_fn(
+    std::function<void(double, int)> memory_fraction_fn) {
+  memory_fraction_fn_ = memory_fraction_fn;
+}
+
+void CUDAPluggableAllocator::set_base_alloc_fn(
+    std::function<void*(void*, size_t*)> base_alloc_fn) {
+  base_alloc_fn_ = base_alloc_fn;
+}
+
+void CUDAPluggableAllocator::set_record_stream_fn(
+    std::function<void(void* ptr, cudaStream_t stream)> record_stream_fn) {
+  record_stream_fn_ = record_stream_fn;
+}
+
+void CUDAPluggableAllocator::set_capture_begin_fn(
+    std::function<void(int, c10::cuda::CaptureId_t, c10::cuda::MempoolId_t)>
+        capture_begin_fn) {
+  capture_begin_fn_ = capture_begin_fn;
+}
+
+void CUDAPluggableAllocator::set_capture_about_to_end_fn(
+    std::function<void(int, c10::cuda::CaptureId_t)> capture_about_to_end_fn) {
+  capture_about_to_end_fn_ = capture_about_to_end_fn;
+}
+
+void CUDAPluggableAllocator::set_capture_ended_fn(
+    std::function<void(int, c10::cuda::CaptureId_t)> capture_ended_fn) {
+  capture_ended_fn_ = capture_ended_fn;
+}
+
+void CUDAPluggableAllocator::set_capture_destroy_fn(
+    std::function<void(int, c10::cuda::MempoolId_t)> capture_destroy_fn) {
+  capture_destroy_fn_ = capture_destroy_fn;
+}
+
+void* CUDAPluggableAllocator::malloc(
+    size_t size,
+    int device,
+    cudaStream_t stream) {
+  void* r = alloc_fn_(size, device, stream);
+  {
+    const std::lock_guard<std::mutex> lock(allocator_mutex_);
+    allocation_metadata_.emplace(r, std::make_pair(size, stream));
+  }
+  return r;
+}
+
+c10::DataPtr CUDAPluggableAllocator::allocate(size_t size) const {
+  int device;
+  C10_CUDA_CHECK(cudaGetDevice(&device));
+  cudaStream_t stream = c10::cuda::getCurrentCUDAStream(device);
+  void* r =
+      const_cast<CUDAPluggableAllocator*>(this)->malloc(size, device, stream);
+  c10::DataPtr data_ptr = {
+      r, r, raw_deleter(), c10::Device(c10::DeviceType::CUDA, device)};
+  return data_ptr;
+}
+
+c10::DeleterFnPtr CUDAPluggableAllocator::raw_deleter() const {
+  return &custom_raw_deleter;
+}
+
+void* CUDAPluggableAllocator::raw_alloc(size_t nbytes) {
+  int device;
+  C10_CUDA_CHECK(cudaGetDevice(&device));
+  cudaStream_t stream = c10::cuda::getCurrentCUDAStream(device);
+  return malloc(nbytes, device, stream);
+}
+
+void* CUDAPluggableAllocator::raw_alloc_with_stream(
+    size_t nbytes,
+    cudaStream_t stream) {
+  int device;
+  C10_CUDA_CHECK(cudaGetDevice(&device));
+  return malloc(nbytes, device, stream);
+}
+
+void CUDAPluggableAllocator::raw_delete(void* ptr) {
+  cudaStream_t stream;
+  size_t size;
+  {
+    const std::lock_guard<std::mutex> lock(allocator_mutex_);
+    TORCH_CHECK(
+        allocation_metadata_.count(ptr),
+        "Trying to free a pointer not allocated here");
+    auto pair = allocation_metadata_[ptr];
+    size = pair.first;
+    stream = pair.second;
+    allocation_metadata_.erase(ptr);
+  }
+  free_fn_(ptr, size, stream);
+}
+
+void CUDAPluggableAllocator::init(int device_count) {
+  if (init_fn_) {
+    init_fn_(device_count);
+  }
+  initialized_ = true;
+}
+
+bool CUDAPluggableAllocator::initialized() {
+  return initialized_;
+}
+
+void CUDAPluggableAllocator::setMemoryFraction(double fraction, int device) {
+  if (memory_fraction_fn_) {
+    memory_fraction_fn_(fraction, device);
+  }
+}
+
+void CUDAPluggableAllocator::emptyCache(void) {
+  if (reset_fn_) {
+    return reset_fn_();
+  }
+}
+
+void CUDAPluggableAllocator::cacheInfo(int dev_id, size_t* largestBlock) {
+  TORCH_CHECK(
+      false,
+      "CUDAPluggableAllocator does not yet support cacheInfo. "
+      "If you need it, please file an issue describing your use case.");
+}
+
+void* CUDAPluggableAllocator::getBaseAllocation(void* ptr, size_t* size) {
+  if (base_alloc_fn_) {
+    return base_alloc_fn_(ptr, size);
+  } else {
+    return ptr;
+  }
+}
+
+void CUDAPluggableAllocator::recordStream(
+    const c10::DataPtr& ptr,
+    streamType stream) {
+  if (record_stream_fn_) {
+    record_stream_fn_(ptr.get(), stream);
+  }
+}
+
+c10::cuda::CUDACachingAllocator::DeviceStats CUDAPluggableAllocator::
+    getDeviceStats(int device) {
+  TORCH_CHECK(
+      false,
+      "CUDAPluggableAllocator does not yet support getDeviceStats. "
+      "If you need it, please file an issue describing your use case.");
+}
+
+void CUDAPluggableAllocator::resetAccumulatedStats(int device) {
+  TORCH_CHECK(
+      false,
+      "CUDAPluggableAllocator does not yet support resetAccumulatedStats. "
+      "If you need it, please file an issue describing your use case.");
+}
+
+void CUDAPluggableAllocator::resetPeakStats(int device) {
+  TORCH_CHECK(
+      false,
+      "CUDAPluggableAllocator does not yet support resetPeakStats. "
+      "If you need it, please file an issue describing your use case.");
+}
+
+c10::cuda::CUDACachingAllocator::SnapshotInfo CUDAPluggableAllocator::
+    snapshot() {
+  TORCH_CHECK(
+      false,
+      "CUDAPluggableAllocator does not yet support snapshot. "
+      "If you need it, please file an issue describing your use case.");
+}
+
+std::shared_ptr<void> CUDAPluggableAllocator::getIpcDevPtr(std::string handle) {
+  TORCH_CHECK(
+      false,
+      "CUDAPluggableAllocator does not yet support getIpcDevPtr. "
+      "If you need it, please file an issue describing your use case.");
+}
+
+// CUDAGraph interactions
+void CUDAPluggableAllocator::notifyCaptureBegin(
+    int device,
+    c10::cuda::CaptureId_t graph_id,
+    c10::cuda::MempoolId_t mempool_id) {
+  if (capture_begin_fn_) {
+    capture_begin_fn_(device, graph_id, mempool_id);
+  }
+}
+
+void CUDAPluggableAllocator::notifyCaptureAboutToEnd(
+    int device,
+    c10::cuda::CaptureId_t graph_id) {
+  if (capture_about_to_end_fn_) {
+    capture_about_to_end_fn_(device, graph_id);
+  }
+}
+
+void CUDAPluggableAllocator::notifyCaptureEnded(
+    int device,
+    c10::cuda::CaptureId_t graph_id) {
+  if (capture_ended_fn_) {
+    capture_ended_fn_(device, graph_id);
+  }
+}
+
+void CUDAPluggableAllocator::notifyCaptureDestroy(
+    int device,
+    c10::cuda::MempoolId_t mempool_id) {
+  if (capture_destroy_fn_) {
+    capture_destroy_fn_(device, mempool_id);
+  }
+}
+
+void CUDAPluggableAllocator::recordHistory(
+    bool enabled,
+    c10::cuda::CUDACachingAllocator::CreateContextFn context_recorder,
+    size_t alloc_trace_max_entries,
+    bool alloc_trace_record_context) {
+  TORCH_CHECK(
+      false,
+      "CUDAPluggableAllocator does not yet support recordHistory. "
+      "If you need it, please file an issue describing your use case.");
+}
+
+void CUDAPluggableAllocator::attachOutOfMemoryObserver(
+    c10::cuda::CUDACachingAllocator::OutOfMemoryObserver observer) {
+  TORCH_CHECK(
+      false,
+      "CUDAPluggableAllocator does not yet support attachOutOfMemoryObserver. "
+      "If you need it, please file an issue describing your use case.");
+}
+
+bool CUDAPluggableAllocator::needsPoolSpecificPeerAccess() {
+  return false;
+}
+
+std::string CUDAPluggableAllocator::name() {
+  return "pluggable";
+}
+
+std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator>
+    current_custom_allocator;
+
+std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator>
+getCurrentAllocator() {
+  return current_custom_allocator;
+}
+
+// TODO: add more functions in the argument
+std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator>
+createCustomAllocator(
+    std::function<void*(size_t, int, cudaStream_t)> alloc_fn,
+    std::function<void(void*, size_t, cudaStream_t)> free_fn) {
+  std::shared_ptr<CUDAPluggableAllocator> allocator(
+      new CUDAPluggableAllocator(alloc_fn, free_fn));
+  allocator->init(device_count);
+  return allocator;
+}
+
+void changeCurrentAllocator(
+    std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator> allocator) {
+  TORCH_CHECK(
+      !c10::cuda::CUDACachingAllocator::allocator.load()->initialized(),
+      "Can't swap an already initialized allocator");
+  c10::cuda::CUDACachingAllocator::allocator.store(allocator.get());
+  current_custom_allocator = allocator;
+}
+
+void custom_raw_deleter(void* ptr) {
+  current_custom_allocator->raw_delete(ptr);
+}
+
+} // namespace CUDAPluggableAllocator
+} // namespace cuda
+} // namespace torch
diff --git a/torch/csrc/cuda/CUDAPluggableAllocator.h b/torch/csrc/cuda/CUDAPluggableAllocator.h
new file mode 100644
index 000000000000..a02acabe3cd8
--- /dev/null
+++ b/torch/csrc/cuda/CUDAPluggableAllocator.h
@@ -0,0 +1,135 @@
+#pragma once
+
+#include <c10/core/Allocator.h>
+#include <c10/cuda/CUDAGraphsC10Utils.h>
+#include <c10/cuda/CUDAMacros.h>
+#include <c10/cuda/CUDAStream.h>
+
+#include <c10/cuda/CUDACachingAllocator.h>
+
+#include <array>
+#include <mutex>
+
+namespace torch {
+
+namespace cuda {
+
+namespace CUDAPluggableAllocator {
+
+#if defined(TORCH_HIP_VERSION)
+using streamType = c10::hip::HIPStream;
+#else
+using streamType = c10::cuda::CUDAStream;
+#endif
+
+std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator>
+getCurrentAllocator();
+std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator>
+createCustomAllocator(
+    std::function<void*(size_t, int, cudaStream_t)> alloc_fn,
+    std::function<void(void*, size_t, cudaStream_t)> free_fn);
+void changeCurrentAllocator(
+    std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator> allocator);
+
+struct CUDAPluggableAllocator
+    : public c10::cuda::CUDACachingAllocator::CUDAAllocator {
+  CUDAPluggableAllocator(
+      std::function<void*(size_t, int, cudaStream_t)> alloc_fn,
+      std::function<void(void*, size_t, cudaStream_t)> free_fn);
+
+  CUDAPluggableAllocator(CUDAPluggableAllocator& other);
+
+  void set_init_fn(std::function<void(int)> init_fn);
+
+  void set_reset_fn(std::function<void()> reset_fn);
+
+  void set_memory_fraction_fn(
+      std::function<void(double, int)> memory_fraction_fn);
+
+  void set_base_alloc_fn(std::function<void*(void*, size_t*)> base_alloc_fn);
+
+  void set_record_stream_fn(
+      std::function<void(void* ptr, cudaStream_t stream)> record_stream_fn);
+
+  void set_capture_begin_fn(
+      std::function<void(int, c10::cuda::CaptureId_t, c10::cuda::MempoolId_t)>
+          capture_begin_fn);
+
+  void set_capture_about_to_end_fn(
+      std::function<void(int, c10::cuda::CaptureId_t)> capture_about_to_end_fn);
+
+  void set_capture_ended_fn(
+      std::function<void(int, c10::cuda::CaptureId_t)> capture_ended_fn);
+
+  void set_capture_destroy_fn(
+      std::function<void(int, c10::cuda::MempoolId_t)> capture_destroy_fn);
+
+  void* malloc(size_t size, int device, cudaStream_t stream);
+
+  c10::DataPtr allocate(size_t size) const;
+  c10::DeleterFnPtr raw_deleter() const;
+
+  virtual void* raw_alloc(size_t nbytes) override;
+  virtual void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream)
+      override;
+  virtual void raw_delete(void* ptr) override;
+  virtual void init(int device_count) override;
+  virtual bool initialized() override;
+  virtual void setMemoryFraction(double fraction, int device) override;
+  virtual void emptyCache() override;
+  virtual void cacheInfo(int dev_id, size_t* largestBlock) override;
+  virtual void* getBaseAllocation(void* ptr, size_t* size) override;
+
+  virtual void recordStream(const c10::DataPtr&, streamType stream) override;
+
+  virtual c10::cuda::CUDACachingAllocator::DeviceStats getDeviceStats(
+      int device) override;
+  virtual void resetAccumulatedStats(int device) override;
+  virtual void resetPeakStats(int device) override;
+  virtual c10::cuda::CUDACachingAllocator::SnapshotInfo snapshot() override;
+  virtual void notifyCaptureBegin(
+      int device,
+      c10::cuda::CaptureId_t graph_id,
+      c10::cuda::MempoolId_t mempool_id) override;
+  virtual void notifyCaptureAboutToEnd(
+      int device,
+      c10::cuda::CaptureId_t graph_id) override;
+  virtual void notifyCaptureEnded(int device, c10::cuda::CaptureId_t graph_id)
+      override;
+  virtual void notifyCaptureDestroy(
+      int device,
+      c10::cuda::MempoolId_t mempool_id) override;
+  virtual std::shared_ptr<void> getIpcDevPtr(std::string handle) override;
+  virtual void recordHistory(
+      bool enabled,
+      c10::cuda::CUDACachingAllocator::CreateContextFn context_recorder,
+      size_t alloc_trace_max_entries,
+      bool alloc_trace_record_context) override;
+  virtual void attachOutOfMemoryObserver(
+      c10::cuda::CUDACachingAllocator::OutOfMemoryObserver observer) override;
+  virtual bool needsPoolSpecificPeerAccess() override;
+  virtual std::string name() override;
+
+ protected:
+  std::function<void*(size_t, int, cudaStream_t)> alloc_fn_;
+  std::function<void(void*, size_t, cudaStream_t)> free_fn_;
+  std::function<void(int)> init_fn_;
+  std::function<void()> reset_fn_;
+  std::function<void(double, int)> memory_fraction_fn_;
+  std::function<void*(void*, size_t*)> base_alloc_fn_;
+  std::function<void(void* ptr, cudaStream_t stream)> record_stream_fn_;
+  std::function<void(int, c10::cuda::CaptureId_t, c10::cuda::MempoolId_t)>
+      capture_begin_fn_;
+  std::function<void(int, c10::cuda::CaptureId_t)> capture_about_to_end_fn_;
+  std::function<void(int, c10::cuda::CaptureId_t)> capture_ended_fn_;
+  std::function<void(int, c10::cuda::MempoolId_t)> capture_destroy_fn_;
+  std::mutex allocator_mutex_;
+  // We do the bookeeping here in order to simplify custom allocators
+  std::unordered_map<void*, std::pair<size_t, cudaStream_t>>
+      allocation_metadata_;
+
+  bool initialized_ = false;
+};
+} // namespace CUDAPluggableAllocator
+} // namespace cuda
+} // namespace torch
diff --git a/torch/csrc/cuda/Graph.cpp b/torch/csrc/cuda/Graph.cpp
index 0866b82f659d..6d3a77c365e1 100644
--- a/torch/csrc/cuda/Graph.cpp
+++ b/torch/csrc/cuda/Graph.cpp
@@ -30,23 +30,23 @@ void THCPGraph_init(PyObject* module) {
       // docs aren't clear. But it works.
       .def(
           "capture_begin",
-          &::at::cuda::CUDAGraph::capture_begin,
+          torch::wrap_pybind_function(&at::cuda::CUDAGraph::capture_begin),
           py::call_guard<py::gil_scoped_release>(),
           py::arg("pool") = c10::cuda::MempoolId_t{0, 0})
       .def(
           "capture_end",
-          &::at::cuda::CUDAGraph::capture_end,
+          torch::wrap_pybind_function(&at::cuda::CUDAGraph::capture_end),
           py::call_guard<py::gil_scoped_release>())
       .def(
           "replay",
-          &::at::cuda::CUDAGraph::replay,
+          torch::wrap_pybind_function(&at::cuda::CUDAGraph::replay),
           py::call_guard<py::gil_scoped_release>())
       .def(
           "reset",
-          &::at::cuda::CUDAGraph::reset,
+          torch::wrap_pybind_function(&at::cuda::CUDAGraph::reset),
           py::call_guard<py::gil_scoped_release>())
       .def(
           "pool",
-          &::at::cuda::CUDAGraph::pool,
+          torch::wrap_pybind_function(&at::cuda::CUDAGraph::pool),
           py::call_guard<py::gil_scoped_release>());
 }
diff --git a/torch/csrc/cuda/Module.cpp b/torch/csrc/cuda/Module.cpp
index 0012d1ae596c..b526f87edd75 100644
--- a/torch/csrc/cuda/Module.cpp
+++ b/torch/csrc/cuda/Module.cpp
@@ -22,6 +22,7 @@
 
 #include <torch/csrc/CudaIPCTypes.h>
 #include <torch/csrc/Generator.h>
+#include <torch/csrc/cuda/CUDAPluggableAllocator.h>
 #include <torch/csrc/cuda/THCP.h>
 #include <torch/csrc/cuda/python_comm.h>
 #include <torch/csrc/python_headers.h>
@@ -32,6 +33,7 @@
 
 #include <array>
 #include <chrono>
+#include <iostream>
 #include <sstream>
 #include <thread>
 #include <unordered_map>
@@ -344,6 +346,22 @@ PyObject* THCPModule_cudaCachingAllocator_raw_delete(
   END_HANDLE_TH_ERRORS
 }
 
+PyObject* THCPModule_cudaCachingAllocator_set_allocator_settings(
+    PyObject* _unused,
+    PyObject* env) {
+  HANDLE_TH_ERRORS
+  c10::cuda::CUDACachingAllocator::setAllocatorSettings(
+      THPUtils_unpackString(env));
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
+PyObject* THCPModule_getAllocatorBackend(PyObject* _unused, PyObject* noargs) {
+  HANDLE_TH_ERRORS
+  return THPUtils_packString(c10::cuda::CUDACachingAllocator::name());
+  END_HANDLE_TH_ERRORS
+}
+
 PyObject* THCPModule_cudaSynchronize(PyObject* _unused, PyObject* noargs) {
   HANDLE_TH_ERRORS
   c10::cuda::device_synchronize();
@@ -376,7 +394,7 @@ PyObject* THCPModule_cudaSleep(PyObject* _unused, PyObject* cycles) {
 static PyGILState_STATE cudaMutexGILState;
 
 PyObject* THCPModule_cudaLockMutex(PyObject* module, PyObject* noargs) {
-  auto mutex = c10::cuda::CUDACachingAllocator::getFreeMutex();
+  auto mutex = c10::cuda::getFreeMutex();
   // This has to be a busy loop because we **absolutely need to** hold the GIL
   // or it's a recipe for a deadlock otherwise (if we let other Python threads
   // run while we have the cudaMutex, but not the GIL, they might try to e.g.
@@ -396,7 +414,7 @@ PyObject* THCPModule_cudaLockMutex(PyObject* module, PyObject* noargs) {
 }
 
 PyObject* THCPModule_cudaUnlockMutex(PyObject* module, PyObject* noargs) {
-  auto mutex = c10::cuda::CUDACachingAllocator::getFreeMutex();
+  auto mutex = c10::cuda::getFreeMutex();
   PyGILState_Release(cudaMutexGILState);
   mutex->unlock();
   Py_RETURN_NONE;
@@ -526,14 +544,17 @@ struct Frame {
 
 struct StackContext : public c10::cuda::CUDACachingAllocator::Context {
   std::vector<Frame> frames;
+  // Empty if cpp traces weren't enabled
+  std::string cpp_frames;
   ~StackContext() {
+    py::gil_scoped_acquire acquire;
     for (auto& f : frames) {
       Py_XDECREF((PyObject*)f.code);
     }
   }
-  static std::unique_ptr<c10::cuda::CUDACachingAllocator::Context> gather() {
+  static std::shared_ptr<StackContext> _gather() {
     py::gil_scoped_acquire acquire;
-    auto r = std::make_unique<StackContext>();
+    auto r = std::make_shared<StackContext>();
     PyFrameObject* f = PyEval_GetFrame();
     Py_XINCREF(f);
     while (f) {
@@ -544,6 +565,15 @@ struct StackContext : public c10::cuda::CUDACachingAllocator::Context {
     }
     return r;
   }
+  static std::shared_ptr<c10::cuda::CUDACachingAllocator::Context> gather() {
+    return _gather();
+  }
+  static std::shared_ptr<c10::cuda::CUDACachingAllocator::Context>
+  gather_with_cpp() {
+    auto r = _gather();
+    r->cpp_frames = c10::get_backtrace();
+    return std::move(r);
+  }
 };
 
 PyObject* THCPModule_memorySnapshot(PyObject* _unused, PyObject* noargs) {
@@ -573,9 +603,29 @@ PyObject* THCPModule_memorySnapshot(PyObject* _unused, PyObject* noargs) {
   py::str name_s = "name";
   py::str line_s = "line";
   py::str frames_s = "frames";
+  py::str cpp_frames_s = "cpp_frames";
   py::str history_s = "history";
   py::str blocks_s = "blocks";
 
+  std::unordered_map<StackContext*, py::list> cached_frames;
+  const auto get_frames = [&](StackContext* sc) -> py::list {
+    auto it = cached_frames.find(sc);
+    if (it != cached_frames.end()) {
+      return it->second;
+    }
+    py::list frames;
+    for (auto& f : sc->frames) {
+      py::dict frame;
+      frame[filename_s] =
+          py::reinterpret_borrow<py::object>(f.code->co_filename);
+      frame[name_s] = py::reinterpret_borrow<py::object>(f.code->co_name);
+      frame[line_s] = PyCode_Addr2Line(f.code, f.lasti);
+      frames.append(std::move(frame));
+    }
+    cached_frames.insert({sc, frames});
+    return frames;
+  };
+
   const auto segmentInfoToDict = [&](const SegmentInfo& segmentInfo) {
     py::dict segmentDict;
     segmentDict[device_s] = segmentInfo.device;
@@ -596,28 +646,19 @@ PyObject* THCPModule_memorySnapshot(PyObject* _unused, PyObject* noargs) {
           (blockInfo.allocated
                ? active_allocated_s
                : (blockInfo.active ? active_pending_free_s : inactive_s));
-      if (blockInfo.history) {
+      if (blockInfo.history.size()) {
         py::list history;
-        History* h = blockInfo.history;
-        while (h) {
+        for (const History& h : blockInfo.history) {
           py::dict history_entry;
-          history_entry[addr_s] = (int64_t)h->addr;
-          history_entry[real_size_s] = h->real_size;
-          if (h->context) {
-            py::list frames;
-            auto sc = (StackContext*)h->context.get();
-            for (auto& f : sc->frames) {
-              py::dict frame;
-              frame[filename_s] =
-                  py::reinterpret_borrow<py::object>(f.code->co_filename);
-              frame[name_s] =
-                  py::reinterpret_borrow<py::object>(f.code->co_name);
-              frame[line_s] = PyCode_Addr2Line(f.code, f.lasti);
-              frames.append(std::move(frame));
+          history_entry[addr_s] = (int64_t)h.addr;
+          history_entry[real_size_s] = h.real_size;
+          if (h.context) {
+            auto sc = (StackContext*)h.context.get();
+            history_entry[frames_s] = get_frames(sc);
+            if (!sc->cpp_frames.empty()) {
+              history_entry[cpp_frames_s] = py::cast(sc->cpp_frames);
             }
-            history_entry[frames_s] = std::move(frames);
           }
-          h = h->next.get();
           history.append(std::move(history_entry));
         }
         blockDict[history_s] = std::move(history);
@@ -629,27 +670,95 @@ PyObject* THCPModule_memorySnapshot(PyObject* _unused, PyObject* noargs) {
     return segmentDict;
   };
 
-  const std::vector<SegmentInfo>& snapshot =
-      c10::cuda::CUDACachingAllocator::snapshot();
-  py::list result;
+  auto snapshot = c10::cuda::CUDACachingAllocator::snapshot();
+  py::list segments;
+
+  for (const auto& segmentInfo : snapshot.segments) {
+    segments.append(segmentInfoToDict(segmentInfo));
+  }
+
+  py::list traces;
+  py::str action_s = "action";
+  py::str alloc_s = "alloc";
+  py::str free_requested_s = "free_requested";
+  py::str free_completed_s = "free_completed";
+  py::str segment_alloc_s = "segment_alloc";
+  py::str segment_free_s = "segment_free";
+  py::str snapshot_s = "snapshot";
+  py::str oom_s = "oom";
+  py::str device_free_s = "device_free";
+
+  using namespace c10::cuda::CUDACachingAllocator;
+
+  auto action_to_str = [&](TraceEntry::Action action) {
+    switch (action) {
+      case TraceEntry::ALLOC:
+        return alloc_s;
+      case TraceEntry::FREE_REQUESTED:
+        return free_requested_s;
+      case TraceEntry::FREE_COMPLETED:
+        return free_completed_s;
+      case TraceEntry::SEGMENT_ALLOC:
+        return segment_alloc_s;
+      case TraceEntry::SEGMENT_FREE:
+        return segment_free_s;
+      case TraceEntry::OOM:
+        return oom_s;
+      case TraceEntry::SNAPSHOT:
+        return snapshot_s;
+    }
+    throw std::runtime_error("unreachable");
+  };
 
-  for (const auto& segmentInfo : snapshot) {
-    result.append(segmentInfoToDict(segmentInfo));
+  for (const auto& traceInfo : snapshot.device_traces) {
+    py::list trace;
+    for (const auto& te : traceInfo) {
+      py::dict trace_entry;
+      if (te.context_) {
+        // without further compression frames can get really large on dump
+        auto sc = (StackContext*)te.context_.get();
+        trace_entry[frames_s] = get_frames(sc);
+        if (!sc->cpp_frames.empty()) {
+          trace_entry[cpp_frames_s] = py::cast(sc->cpp_frames);
+        }
+      }
+      trace_entry[action_s] = action_to_str(te.action_);
+      trace_entry[TraceEntry::OOM == te.action_ ? device_free_s : addr_s] =
+          te.addr_;
+      trace_entry[size_s] = te.size_;
+      trace_entry[stream_s] = int64_t(te.stream_);
+      trace.append(trace_entry);
+    }
+    traces.append(trace);
   }
 
+  py::dict result;
+  result["segments"] = segments;
+  result["device_traces"] = traces;
+
   return result.release().ptr();
   END_HANDLE_TH_ERRORS
 }
 
-PyObject* THCPModule_recordMemoryHistory(PyObject* _unused, PyObject* enabled) {
+PyObject* THCPModule_attachOutOfMemoryObserver(
+    PyObject* _unused,
+    PyObject* observer) {
   HANDLE_TH_ERRORS
-  THPUtils_assert(
-      PyBool_Check(enabled),
-      "recordMemoryHistory expects a bool, "
-      "but got %s",
-      THPUtils_typename(enabled));
-  c10::cuda::CUDACachingAllocator::setContextRecorder(
-      enabled == Py_True ? StackContext::gather : nullptr);
+  Py_XINCREF(observer);
+  auto obs = [observer](
+                 int64_t device,
+                 int64_t alloc,
+                 int64_t device_allocated,
+                 int64_t device_free) {
+    py::gil_scoped_acquire g;
+    PyObject* result = PyObject_CallFunction(
+        observer, "LLLL", device, alloc, device_allocated, device_free);
+    if (!result) {
+      throw py::error_already_set();
+    }
+    Py_XDECREF(result);
+  };
+  c10::cuda::CUDACachingAllocator::attachOutOfMemoryObserver(std::move(obs));
   Py_RETURN_NONE;
   END_HANDLE_TH_ERRORS
 }
@@ -726,6 +835,139 @@ static void registerCudaDeviceProperties(PyObject* module) {
                << ")";
         return stream.str();
       });
+
+  m.def(
+      "_cuda_recordMemoryHistory",
+      [](bool enabled,
+         bool record_context,
+         bool record_context_cpp,
+         Py_ssize_t alloc_trace_max_entries,
+         bool alloc_trace_record_context) {
+        c10::cuda::CUDACachingAllocator::recordHistory(
+            enabled,
+            record_context ? (record_context_cpp ? StackContext::gather_with_cpp
+                                                 : StackContext::gather)
+                           : nullptr,
+            alloc_trace_max_entries,
+            alloc_trace_record_context);
+      });
+}
+
+static void registerCudaPluggableAllocator(PyObject* module) {
+  auto m = py::handle(module).cast<py::module>();
+
+  py::class_<
+      c10::cuda::CUDACachingAllocator::CUDAAllocator,
+      std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator>>(
+      m, "_cuda_CUDAAllocator");
+  m.def("_cuda_getAllocator", []() {
+    return py::cast(torch::cuda::CUDAPluggableAllocator::getCurrentAllocator());
+  });
+
+  m.def(
+      "_cuda_changeCurrentAllocator",
+      [](std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator>
+             allocator) {
+        torch::cuda::CUDAPluggableAllocator::changeCurrentAllocator(allocator);
+      });
+  py::class_<
+      torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator,
+      c10::cuda::CUDACachingAllocator::CUDAAllocator,
+      std::shared_ptr<
+          torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator>>(
+      m, "_CUDAPluggableAllocator")
+      .def(
+          "set_init_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType = void(int);
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_init_fn(func);
+          })
+      .def(
+          "set_reset_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType = void();
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_reset_fn(func);
+          })
+      .def(
+          "set_memory_fraction_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType = void(double, int);
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_memory_fraction_fn(func);
+          })
+      .def(
+          "set_base_alloc_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType = void*(void*, size_t*);
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_base_alloc_fn(func);
+          })
+      .def(
+          "set_record_stream_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType = void(void*, cudaStream_t);
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_record_stream_fn(func);
+          })
+      .def(
+          "set_capture_begin_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType =
+                void(int, c10::cuda::CaptureId_t, c10::cuda::MempoolId_t);
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_capture_begin_fn(func);
+          })
+      .def(
+          "set_capture_about_to_end_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType = void(int, c10::cuda::CaptureId_t);
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_capture_about_to_end_fn(func);
+          })
+      .def(
+          "set_capture_ended_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType = void(int, c10::cuda::CaptureId_t);
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_capture_ended_fn(func);
+          })
+      .def(
+          "set_capture_destroy_fn",
+          [](torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator& self,
+             uint64_t func_ptr) {
+            using FuncType = void(int, c10::cuda::MempoolId_t);
+            std::function<FuncType> func =
+                reinterpret_cast<FuncType*>(func_ptr);
+            self.set_capture_destroy_fn(func);
+          });
+  m.def("_cuda_customAllocator", [](uint64_t malloc_ptr, uint64_t free_ptr) {
+    using MallocFuncType = void*(size_t, int, cudaStream_t);
+    using FreeFuncType = void(void*, size_t, cudaStream_t);
+    std::function<MallocFuncType> malloc_fn =
+        reinterpret_cast<MallocFuncType*>(malloc_ptr);
+    std::function<FreeFuncType> free_fn =
+        reinterpret_cast<FreeFuncType*>(free_ptr);
+    return torch::cuda::CUDAPluggableAllocator::createCustomAllocator(
+        malloc_fn, free_fn);
+  });
 }
 
 static void bindGetDeviceProperties(PyObject* module) {
@@ -796,6 +1038,15 @@ PyObject* THCPModule_getCurrentBlasHandle_wrap(
   END_HANDLE_TH_ERRORS
 }
 
+static PyObject* THCPModule_clearBlasWorkspaces_wrap(
+    PyObject* self,
+    PyObject* noargs) {
+  HANDLE_TH_ERRORS
+  at::cuda::clearCublasWorkspaces();
+  Py_RETURN_NONE;
+  END_HANDLE_TH_ERRORS
+}
+
 PyObject* THCPModule_rocm_is_backward_pass(
     PyObject* _unused,
     PyObject* noargs) {
@@ -885,6 +1136,10 @@ static struct PyMethodDef _THCPModule_methods[] = {
      THCPModule_getCurrentBlasHandle_wrap,
      METH_NOARGS,
      nullptr},
+    {"_cuda_clearCublasWorkspaces",
+     THCPModule_clearBlasWorkspaces_wrap,
+     METH_NOARGS,
+     nullptr},
     {"_cuda_isCurrentStreamCapturing",
      THCPModule_isCurrentStreamCapturing_wrap,
      METH_NOARGS,
@@ -910,11 +1165,10 @@ static struct PyMethodDef _THCPModule_methods[] = {
      METH_O,
      nullptr},
     {"_cuda_memorySnapshot", THCPModule_memorySnapshot, METH_NOARGS, nullptr},
-    {"_cuda_recordMemoryHistory",
-     THCPModule_recordMemoryHistory,
+    {"_cuda_attach_out_of_memory_observer",
+     THCPModule_attachOutOfMemoryObserver,
      METH_O,
      nullptr},
-
     {"_cuda_cudaHostAllocator",
      THCPModule_cudaHostAllocator,
      METH_NOARGS,
@@ -927,6 +1181,14 @@ static struct PyMethodDef _THCPModule_methods[] = {
      THCPModule_cudaCachingAllocator_raw_delete,
      METH_O,
      nullptr},
+    {"_cuda_cudaCachingAllocator_set_allocator_settings",
+     THCPModule_cudaCachingAllocator_set_allocator_settings,
+     METH_O,
+     nullptr},
+    {"_cuda_getAllocatorBackend",
+     THCPModule_getAllocatorBackend,
+     METH_NOARGS,
+     nullptr},
     {"_cuda_synchronize", THCPModule_cudaSynchronize, METH_NOARGS, nullptr},
     {"_cuda_ipc_collect", THCPModule_cudaIPCCollect, METH_NOARGS, nullptr},
     {"_cuda_sleep", THCPModule_cudaSleep, METH_O, nullptr},
@@ -998,6 +1260,7 @@ void initModule(PyObject* module) {
   shared::initCudnnBindings(module);
 #endif
   registerCudaDeviceProperties(module);
+  registerCudaPluggableAllocator(module);
 }
 
 } // namespace cuda
diff --git a/torch/csrc/cuda/Tensor.cpp b/torch/csrc/cuda/Tensor.cpp
index beb81f187a6e..f9486164358d 100644
--- a/torch/csrc/cuda/Tensor.cpp
+++ b/torch/csrc/cuda/Tensor.cpp
@@ -1,4 +1,6 @@
+#ifndef __STDC_FORMAT_MACROS
 #define __STDC_FORMAT_MACROS
+#endif
 
 // Order of these includes matters, which should be fixed.
 // clang-format off
diff --git a/torch/csrc/cuda/comm.cpp b/torch/csrc/cuda/comm.cpp
index 117f6b571792..e215ce0e3ed6 100644
--- a/torch/csrc/cuda/comm.cpp
+++ b/torch/csrc/cuda/comm.cpp
@@ -180,12 +180,12 @@ tensor_list2d broadcast_coalesced(
 
   unique_type_checker type_checker;
   at::cuda::CUDAGuard device_guard(devices[0]);
-  for (auto& chunk : utils::take_tensors(tensors, buffer_size)) {
+  for (auto& chunk : torch::utils::take_tensors(tensors, buffer_size)) {
     auto type_id = chunk.type_id();
     type_checker.show(type_id);
     std::vector<at::Tensor> results;
     if (chunk.options().is_sparse()) {
-      auto flat_tuple = utils::flatten_sparse_tensors(chunk.tensors);
+      auto flat_tuple = torch::utils::flatten_sparse_tensors(chunk.tensors);
       auto broadcast_indices = broadcast(flat_tuple.first, devices);
       auto broadcast_values = broadcast(flat_tuple.second, devices);
       results.reserve(devices.size());
@@ -194,20 +194,20 @@ tensor_list2d broadcast_coalesced(
         auto& device_outputs = outputs[i];
         auto& inds = broadcast_indices[i];
         auto& vals = broadcast_values[i];
-        for (const auto& var :
-             utils::unflatten_sparse_tensors(inds, vals, chunk.tensors)) {
+        for (const auto& var : torch::utils::unflatten_sparse_tensors(
+                 inds, vals, chunk.tensors)) {
           // See NOTE [ Version Counter in comm.*_coalesced ]
           device_outputs.push_back(make_variable(var.tensor_data(), false));
         }
       }
     } else {
-      auto results =
-          broadcast(utils::flatten_dense_tensors(chunk.tensors), devices);
+      auto results = broadcast(
+          torch::utils::flatten_dense_tensors(chunk.tensors), devices);
       for (size_t i = 1, num_devices = devices.size(); i < num_devices; ++i) {
         device_guard.set_index(devices[i]);
         auto& device_outputs = outputs[i];
         for (auto& var :
-             utils::unflatten_dense_tensors(results[i], chunk.tensors)) {
+             torch::utils::unflatten_dense_tensors(results[i], chunk.tensors)) {
           // See NOTE [ Version Counter in comm.*_coalesced ]
           device_outputs.push_back(make_variable(var.tensor_data(), false));
         }
@@ -218,7 +218,7 @@ tensor_list2d broadcast_coalesced(
   // If we only saw a single tensor type, then we can skip expensive reordering
   if (!type_checker.unique) {
     for (auto& o : outputs)
-      utils::reorder_tensors_like(o, tensors);
+      torch::utils::reorder_tensors_like(o, tensors);
   }
   return outputs;
 }
diff --git a/torch/csrc/cuda/memory_snapshot.cpp b/torch/csrc/cuda/memory_snapshot.cpp
new file mode 100644
index 000000000000..13db7cd81010
--- /dev/null
+++ b/torch/csrc/cuda/memory_snapshot.cpp
@@ -0,0 +1,167 @@
+#include <c10/cuda/CUDACachingAllocator.h>
+#include <torch/csrc/cuda/memory_snapshot.h>
+#include <torch/csrc/jit/serialization/pickler.h>
+namespace torch {
+namespace cuda {
+
+using c10::Dict;
+using c10::IValue;
+using torch::jit::Pickler;
+
+using c10::cuda::CUDACachingAllocator::BlockInfo;
+using c10::cuda::CUDACachingAllocator::History;
+using c10::cuda::CUDACachingAllocator::SegmentInfo;
+
+namespace {
+std::string write_pickle(const IValue& v) {
+  std::vector<char> result;
+  {
+    auto writer = [&](const char* data, size_t size) {
+      result.insert(result.end(), data, data + size);
+    };
+    Pickler pickler(writer, nullptr, nullptr, nullptr, nullptr, false);
+    pickler.protocol();
+    pickler.pushIValue(v);
+    pickler.stop();
+  }
+  return std::string(result.begin(), result.end());
+}
+Dict<IValue, IValue> new_dict() {
+  return Dict<IValue, IValue>(c10::AnyType::get(), c10::AnyType::get());
+}
+c10::List<IValue> new_list() {
+  return List<IValue>(c10::AnyType::get());
+}
+} // namespace
+void _record_memory_history(bool enabled, int64_t alloc_trace_max_entries) {
+  c10::cuda::CUDACachingAllocator::recordHistory(
+      enabled, nullptr, alloc_trace_max_entries, false);
+}
+
+std::string _memory_snapshot_pickled() {
+  IValue device_s = "device";
+  IValue address_s = "address";
+  IValue total_size_s = "total_size";
+  IValue allocated_size_s = "allocated_size";
+  IValue active_size_s = "active_size";
+  IValue stream_s = "stream";
+  IValue segment_type_s = "segment_type";
+  IValue large_s = "large";
+  IValue small_s = "small";
+  IValue size_s = "size";
+  IValue state_s = "state";
+  IValue active_allocated_s = "active_allocated";
+  IValue active_pending_free_s = "active_pending_free";
+  IValue inactive_s = "inactive";
+  IValue addr_s = "addr";
+  IValue real_size_s = "real_size";
+  IValue filename_s = "filename";
+  IValue name_s = "name";
+  IValue line_s = "line";
+  IValue frames_s = "frames";
+  IValue history_s = "history";
+  IValue blocks_s = "blocks";
+
+  auto empty_frames = new_list();
+
+  const auto segmentInfoToDict = [&](const SegmentInfo& segmentInfo) {
+    auto segmentDict = new_dict();
+    segmentDict.insert(device_s, segmentInfo.device);
+    segmentDict.insert(address_s, segmentInfo.address);
+    segmentDict.insert(total_size_s, segmentInfo.total_size);
+    segmentDict.insert(allocated_size_s, segmentInfo.allocated_size);
+    segmentDict.insert(active_size_s, segmentInfo.active_size);
+    segmentDict.insert(stream_s, int64_t(segmentInfo.stream));
+    segmentDict.insert(
+        segment_type_s, (segmentInfo.is_large ? large_s : small_s));
+
+    auto blocks = new_list();
+    for (const auto& blockInfo : segmentInfo.blocks) {
+      auto blockDict = new_dict();
+      blockDict.insert(size_s, blockInfo.size);
+      blockDict.insert(
+          state_s,
+          (blockInfo.allocated
+               ? active_allocated_s
+               : (blockInfo.active ? active_pending_free_s : inactive_s)));
+      if (blockInfo.history.size()) {
+        auto history = new_list();
+        for (const History& h : blockInfo.history) {
+          auto history_entry = new_dict();
+          history_entry.insert(addr_s, (int64_t)h.addr);
+          history_entry.insert(real_size_s, (int64_t)h.real_size);
+          if (h.context) {
+            history_entry.insert(frames_s, empty_frames);
+          }
+          history.push_back(std::move(history_entry));
+        }
+        blockDict.insert(history_s, std::move(history));
+      }
+      blocks.push_back(blockDict);
+    }
+    segmentDict.insert(blocks_s, blocks);
+
+    return segmentDict;
+  };
+
+  auto snapshot = c10::cuda::CUDACachingAllocator::snapshot();
+
+  auto segments = new_list();
+  for (const auto& segmentInfo : snapshot.segments) {
+    segments.push_back(segmentInfoToDict(segmentInfo));
+  }
+
+  auto traces = new_list();
+  IValue action_s = "action";
+  IValue alloc_s = "alloc";
+  IValue free_requested_s = "free_requested";
+  IValue free_completed_s = "free_completed";
+  IValue segment_alloc_s = "segment_alloc";
+  IValue segment_free_s = "segment_free";
+  IValue snapshot_s = "snapshot";
+  IValue oom_s = "oom";
+  IValue device_free_s = "device_free";
+
+  using namespace c10::cuda::CUDACachingAllocator;
+
+  auto action_to_str = [&](TraceEntry::Action action) {
+    switch (action) {
+      case TraceEntry::ALLOC:
+        return alloc_s;
+      case TraceEntry::FREE_REQUESTED:
+        return free_requested_s;
+      case TraceEntry::FREE_COMPLETED:
+        return free_completed_s;
+      case TraceEntry::SEGMENT_ALLOC:
+        return segment_alloc_s;
+      case TraceEntry::SEGMENT_FREE:
+        return segment_free_s;
+      case TraceEntry::OOM:
+        return oom_s;
+      case TraceEntry::SNAPSHOT:
+        return snapshot_s;
+    }
+    throw std::runtime_error("unreachable");
+  };
+
+  for (const auto& traceInfo : snapshot.device_traces) {
+    auto trace = new_list();
+    for (const auto& te : traceInfo) {
+      auto trace_entry = new_dict();
+      trace_entry.insert(action_s, action_to_str(te.action_));
+      trace_entry.insert(
+          TraceEntry::OOM == te.action_ ? device_free_s : addr_s, te.addr_);
+      trace_entry.insert(size_s, (int64_t)te.size_);
+      trace_entry.insert(stream_s, int64_t(te.stream_));
+      trace.push_back(trace_entry);
+    }
+    traces.push_back(trace);
+  }
+
+  auto result = new_dict();
+  result.insert("segments", segments);
+  result.insert("device_traces", traces);
+  return write_pickle(result);
+}
+} // namespace cuda
+} // namespace torch
diff --git a/torch/csrc/cuda/memory_snapshot.h b/torch/csrc/cuda/memory_snapshot.h
new file mode 100644
index 000000000000..c21bf6de19e1
--- /dev/null
+++ b/torch/csrc/cuda/memory_snapshot.h
@@ -0,0 +1,17 @@
+#pragma once
+
+#include <torch/csrc/Export.h>
+#include <string>
+
+namespace torch {
+namespace cuda {
+
+// C++-only versions of these, for python use
+// those defined in cuda/Module.cpp which also record python state.
+TORCH_CUDA_CU_API void _record_memory_history(
+    bool enabled,
+    int64_t alloc_trace_max_entries = 1);
+TORCH_CUDA_CU_API std::string _memory_snapshot_pickled();
+
+} // namespace cuda
+} // namespace torch
diff --git a/torch/csrc/cuda/nccl.cpp b/torch/csrc/cuda/nccl.cpp
index e5dc7992d56c..8a3c8af797cc 100644
--- a/torch/csrc/cuda/nccl.cpp
+++ b/torch/csrc/cuda/nccl.cpp
@@ -3,6 +3,7 @@
 #include <torch/csrc/cuda/nccl.h>
 
 #include <ATen/ATen.h>
+#include <c10/cuda/CUDAException.h>
 #include <c10/cuda/CUDAGuard.h>
 #include <c10/util/Exception.h>
 #include <c10/util/hash.h>
@@ -122,21 +123,6 @@ static inline void NCCL_CHECK(ncclResult_t result) {
   NCCL_CHECK(from_nccl_result(result));
 }
 
-struct AutoNcclGroup {
-  AutoNcclGroup() {
-    (c10::cuda::CUDACachingAllocator::getFreeMutex())->lock();
-#if defined(NCCL_MAJOR) && (NCCL_MAJOR >= 2)
-    NCCL_CHECK(ncclGroupStart());
-#endif
-  }
-  ~AutoNcclGroup() {
-#if defined(NCCL_MAJOR) && (NCCL_MAJOR >= 2)
-    NCCL_CHECK(ncclGroupEnd());
-#endif
-    (c10::cuda::CUDACachingAllocator::getFreeMutex())->unlock();
-  }
-};
-
 void throw_nccl_error(torch::cuda::nccl::ncclResult status) {
   std::ostringstream err;
   err << "NCCL Error " << static_cast<int>(status) << ": "
@@ -157,7 +143,7 @@ struct NcclCommList {
     if (comms) {
       for (const auto i : c10::irange(ndevices)) {
         int dummy_var;
-        if (cudaGetDevice(&dummy_var) != cudaSuccess) {
+        if (C10_CUDA_ERROR_HANDLED(cudaGetDevice(&dummy_var)) != cudaSuccess) {
           /* there are cases when this destructor is called after the
            CUDA driver is already unloaded from the process.
            In these cases, skip ncclCommDestroy */
@@ -311,6 +297,20 @@ void check_inputs(
 
 } // namespace detail
 
+AutoNcclGroup::AutoNcclGroup() {
+  (c10::cuda::getFreeMutex())->lock();
+#if defined(NCCL_MAJOR) && (NCCL_MAJOR >= 2)
+  detail::NCCL_CHECK(ncclGroupStart());
+#endif
+}
+
+AutoNcclGroup::~AutoNcclGroup() noexcept(false) {
+#if defined(NCCL_MAJOR) && (NCCL_MAJOR >= 2)
+  detail::NCCL_CHECK(ncclGroupEnd());
+#endif
+  (c10::cuda::getFreeMutex())->unlock();
+}
+
 bool is_available(TensorList tensors) {
 #ifdef USE_NCCL
   device_set devices;
diff --git a/torch/csrc/cuda/nccl.h b/torch/csrc/cuda/nccl.h
index 076ba5cbcddb..f9f4fa8b1353 100644
--- a/torch/csrc/cuda/nccl.h
+++ b/torch/csrc/cuda/nccl.h
@@ -2,7 +2,6 @@
 
 #include <ATen/ATen.h>
 #include <ATen/cuda/CUDAContext.h>
-#include <c10/cuda/CUDACachingAllocator.h>
 #include <c10/util/Optional.h>
 
 #include <cstddef>
@@ -73,6 +72,14 @@ enum class ncclDataType {
   NumTypes = 10
 };
 
+// RAII helper class to manage NCCL group API and CUDA free mutex.
+// The destructor is allowed to throw since this helper class only
+// manages group and lock lifetimes.
+struct AutoNcclGroup {
+  AutoNcclGroup();
+  ~AutoNcclGroup() noexcept(false);
+};
+
 // NOTE: this is exposed only so that python_nccl.cpp can some of these helpers.
 // Don't use them outside of these files.
 namespace detail {
diff --git a/torch/csrc/cuda/shared/cudart.cpp b/torch/csrc/cuda/shared/cudart.cpp
index 9e098d44808b..f18c883a2a06 100644
--- a/torch/csrc/cuda/shared/cudart.cpp
+++ b/torch/csrc/cuda/shared/cudart.cpp
@@ -71,25 +71,26 @@ void initCudartBindings(PyObject* module) {
       "cuda"
       "HostRegister",
       [](uintptr_t ptr, size_t size, unsigned int flags) -> cudaError_t {
-        return cudaHostRegister((void*)ptr, size, flags);
+        return C10_CUDA_ERROR_HANDLED(
+            cudaHostRegister((void*)ptr, size, flags));
       });
   cudart.def(
       "cuda"
       "HostUnregister",
       [](uintptr_t ptr) -> cudaError_t {
-        return cudaHostUnregister((void*)ptr);
+        return C10_CUDA_ERROR_HANDLED(cudaHostUnregister((void*)ptr));
       });
   cudart.def(
       "cuda"
       "StreamCreate",
       [](uintptr_t ptr) -> cudaError_t {
-        return cudaStreamCreate((cudaStream_t*)ptr);
+        return C10_CUDA_ERROR_HANDLED(cudaStreamCreate((cudaStream_t*)ptr));
       });
   cudart.def(
       "cuda"
       "StreamDestroy",
       [](uintptr_t ptr) -> cudaError_t {
-        return cudaStreamDestroy((cudaStream_t)ptr);
+        return C10_CUDA_ERROR_HANDLED(cudaStreamDestroy((cudaStream_t)ptr));
       });
 #if !defined(USE_ROCM)
   cudart.def(
@@ -104,7 +105,7 @@ void initCudartBindings(PyObject* module) {
         c10::cuda::CUDAGuard guard(device);
         size_t device_free = 0;
         size_t device_total = 0;
-        cudaMemGetInfo(&device_free, &device_total);
+        C10_CUDA_CHECK(cudaMemGetInfo(&device_free, &device_total));
         return {device_free, device_total};
       });
 }
diff --git a/torch/csrc/deploy/.gitignore b/torch/csrc/deploy/.gitignore
deleted file mode 100644
index aa484a97a20f..000000000000
--- a/torch/csrc/deploy/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-example/generated/*
diff --git a/torch/csrc/deploy/CMakeLists.txt b/torch/csrc/deploy/CMakeLists.txt
deleted file mode 100644
index 61fe8c1bb892..000000000000
--- a/torch/csrc/deploy/CMakeLists.txt
+++ /dev/null
@@ -1,83 +0,0 @@
-set(DEPLOY_DIR "${CMAKE_CURRENT_SOURCE_DIR}")
-add_subdirectory(interpreter)
-
-if(DEFINED GLIBCXX_USE_CXX11_ABI)
-  if(${GLIBCXX_USE_CXX11_ABI} EQUAL 1)
-    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=1")
-    set(TORCH_CXX_FLAGS "-D_GLIBCXX_USE_CXX11_ABI=1")
-  endif()
-endif()
-
-# we do not want to have torch_deployinterpreter linked against libstdc++ or libc because
-# when loading it with RTLD_DEEPBIND it will resolve std::cout/stdout to the copy in libc++/libc instead of the
-# ones in the main process (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42679).
-# However, we can't just instruct the linker to not link against these libraries because these
-# libraries use function versioning. Without linking them, the shared library would not know the right
-# symbol versions and instead try to link against the old ones. Our solution is to link the library
-# normally then remove the DT_NEEDED entries in the ELF file that instruct the loaded to load the sublibraries.
-# This gives us the right version numbers but no direct dependency on libstdc++/libc. When loaded these
-# symbols will fallback to resolution through the main execution and get the correct values
-add_executable(remove_dt_needed remove_dt_needed.cpp)
-target_link_libraries(remove_dt_needed PRIVATE fmt::fmt-header-only)
-
-add_custom_command(
-  OUTPUT libtorch_deployinterpreter.o
-  # remove the DT_NEEDED entries
-  COMMAND $<TARGET_FILE:remove_dt_needed> $<TARGET_FILE:torch_deployinterpreter> libtorch_deployinterpreter_all.so
-  # package the result into an object we can link into the libdeploy binary.
-  COMMAND ld -r -b binary -o libtorch_deployinterpreter.o libtorch_deployinterpreter_all.so
-  COMMAND objcopy --rename-section .data=.torch_deploy_payload.interpreter_all,readonly,contents -N _binary_libtorch_deployinterpreter_all_so_start -N _binary_libtorch_deployinterpreter_all_so_end libtorch_deployinterpreter.o
-  COMMAND rm libtorch_deployinterpreter_all.so
-  DEPENDS torch_deployinterpreter remove_dt_needed
-  VERBATIM
-)
-
-add_library(torch_deploy_internal STATIC libtorch_deployinterpreter.o ${DEPLOY_DIR}/deploy.cpp ${DEPLOY_DIR}/loader.cpp ${DEPLOY_DIR}/path_environment.cpp ${DEPLOY_DIR}/elf_file.cpp)
-target_link_libraries(torch_deploy_internal PRIVATE crypt pthread dl util m z ffi lzma readline nsl ncursesw panelw) # for python builtins
-target_link_libraries(torch_deploy_internal PUBLIC  shm torch fmt::fmt-header-only protobuf::libprotobuf-lite)
-caffe2_interface_library(torch_deploy_internal torch_deploy)
-
-set(INTERPRETER_TEST_SOURCES
-  ${DEPLOY_DIR}/test_deploy.cpp
-)
-set(INTERPRETER_TEST_SOURCES_GPU
-  ${DEPLOY_DIR}/test_deploy_gpu.cpp
-)
-
-add_executable(test_deploy ${INTERPRETER_TEST_SOURCES})
-target_compile_definitions(test_deploy PUBLIC TEST_CUSTOM_LIBRARY)
-target_include_directories(test_deploy PRIVATE ${PYTORCH_ROOT}/torch)
-target_link_libraries(test_deploy
-  PUBLIC "-Wl,--no-as-needed -rdynamic" gtest dl torch_deploy
-)
-
-add_executable(test_deploy_gpu ${INTERPRETER_TEST_SOURCES_GPU})
-target_compile_definitions(test_deploy_gpu PUBLIC TEST_CUSTOM_LIBRARY)
-target_include_directories(test_deploy_gpu PRIVATE ${PYTORCH_ROOT}/torch)
-target_link_libraries(test_deploy_gpu
-  PUBLIC "-Wl,--no-as-needed -rdynamic" gtest dl torch_deploy
-)
-
-add_library(test_deploy_lib SHARED test_deploy_lib.cpp)
-add_dependencies(test_deploy_lib cpython)
-target_include_directories(test_deploy_lib BEFORE PRIVATE ${PYTHON_INC_DIR})
-target_link_libraries(test_deploy_lib PRIVATE pybind::pybind11)
-
-add_executable(deploy_benchmark ${DEPLOY_DIR}/example/benchmark.cpp)
-target_include_directories(deploy_benchmark PRIVATE ${PYTORCH_ROOT}/torch)
-target_link_libraries(deploy_benchmark
-  PUBLIC "-Wl,--no-as-needed -rdynamic" torch_deploy
-)
-
-add_executable(interactive_embedded_interpreter ${DEPLOY_DIR}/interactive_embedded_interpreter.cpp)
-target_include_directories(interactive_embedded_interpreter PRIVATE ${PYTORCH_ROOT}/torch)
-target_link_libraries(interactive_embedded_interpreter
-  PUBLIC "-Wl,--no-as-needed -rdynamic" torch_deploy
-)
-
-if(INSTALL_TEST)
-  install(TARGETS test_deploy DESTINATION bin)
-  install(TARGETS test_deploy_gpu DESTINATION bin)
-endif()
-
-install(TARGETS torch_deploy DESTINATION lib)
diff --git a/torch/csrc/deploy/Exception.h b/torch/csrc/deploy/Exception.h
deleted file mode 100644
index f4311debeebc..000000000000
--- a/torch/csrc/deploy/Exception.h
+++ /dev/null
@@ -1,47 +0,0 @@
-#ifndef MULTIPY_EXCEPTION_H
-#define MULTIPY_EXCEPTION_H
-
-#include <exception>
-
-#define MULTIPY_INTERNAL_ASSERT_WITH_MESSAGE(condition, message)               \
-  if (!(condition)) {                                                          \
-    throw std::runtime_error(                                                  \
-        "Internal Assertion failed: (" + std::string(#condition) + "), " +     \
-        "function " + __FUNCTION__ + ", file " + __FILE__ + ", line " +        \
-        std::to_string(__LINE__) + ".\n" + "Please report bug to Pytorch.\n" + \
-        message + "\n");                                                       \
-  }
-
-#define MULTIPY_INTERNAL_ASSERT_NO_MESSAGE(condition) \
-  MULTIPY_INTERNAL_ASSERT_WITH_MESSAGE(#condition, "")
-
-#define MULTIPY_INTERNAL_ASSERT_(x, condition, message, FUNC, ...) FUNC
-
-#define MULTIPY_INTERNAL_ASSERT(...)                     \
-  MULTIPY_INTERNAL_ASSERT_(                              \
-      ,                                                  \
-      ##__VA_ARGS__,                                     \
-      MULTIPY_INTERNAL_ASSERT_WITH_MESSAGE(__VA_ARGS__), \
-      MULTIPY_INTERNAL_ASSERT_NO_MESSAGE(__VA_ARGS__));
-
-#define MULTIPY_CHECK_WITH_MESSAGE(condition, message)                      \
-  if (!(condition)) {                                                       \
-    throw std::runtime_error(                                               \
-        "Check failed: (" + std::string(#condition) + "), " + "function " + \
-        __FUNCTION__ + ", file " + __FILE__ + ", line " +                   \
-        std::to_string(__LINE__) + ".\n" + message + "\n");                 \
-  }
-
-#define MULTIPY_CHECK_NO_MESSAGE(condition) \
-  MULTIPY_CHECK_WITH_MESSAGE(#condition, "")
-
-#define MULTIPY_CHECK_(x, condition, message, FUNC, ...) FUNC
-
-#define MULTIPY_CHECK(...)                     \
-  MULTIPY_CHECK_(                              \
-      ,                                        \
-      ##__VA_ARGS__,                           \
-      MULTIPY_CHECK_WITH_MESSAGE(__VA_ARGS__), \
-      MULTIPY_CHECK_NO_MESSAGE(__VA_ARGS__));
-
-#endif // MULTIPY_EXCEPTION_H
diff --git a/torch/csrc/deploy/README.md b/torch/csrc/deploy/README.md
index dfe436ba79fa..c757287f8e1b 100644
--- a/torch/csrc/deploy/README.md
+++ b/torch/csrc/deploy/README.md
@@ -1,27 +1,2 @@
-# Torch Deploy
-This is an experimental feature to embed multiple python interpreters inside the torch library,
-providing a solution to the 'GIL problem' for multithreading with the convenience of python
-and eager or torchscripted pytorch programs.
-
-# libinterpreter
-This is an internal library used behind the scenes to enable multiple python interpreters in
-a single deploy runtime.  libinterpreter.so is DLOPENed multiple times by the deploy library.
-Each copy of libinterpreter exposes a simple interpreter interface but hides its python and other
-internal symbols, preventing the different python instances from seeing each other.
-
-# CPython build
-Torch Deploy builds CPython from source as part of the embedded python interpreter.  CPython has a flexible build system that builds successfully with or without a variety of dependencies installed - if missing, the resulting CPython build simply omits optional functionality, meaning some stdlib modules/libs are not present.
-
-Currently, the torch deploy build setup assumes the full CPython build is present.  This matters because there is a [hardcoded list of python stdlib modules](https://github.com/pytorch/pytorch/blob/2662e34e9287a72e96dabb590e7732f9d4a6b37b/torch/csrc/deploy/interpreter/interpreter_impl.cpp#L35) that are explicitly loaded from the embedded binary at runtime.
-
-### rebuilding CPython after installing missing dependencies
-Because CPython builds successfully when optional dependencies are missing, the cmake wrapper currently doesn't know if you need to rebuild CPython after adding missing dependencies (or whether dependencies were missing in the first place).
-
-To be safe, install the [complete list of dependencies for CPython](https://devguide.python.org/setup/#install-dependencies) for your platform, before trying to build torch with USE_DEPLOY=1.
-
-If you already built CPython without all the dependencies and want to fix it, just blow away the CPython folder under torch/csrc/deploy/third_party, install the missing system dependencies, and re-attempt the pytorch build command.
-
-# Example
-
-Read the [getting started guide](https://github.com/pytorch/pytorch/blob/master/docs/source/deploy.rst) for an
-example on how to use `torch::deploy`.
+# torch::deploy has been moved to pytorch/multipy
+Please check out [https://github.com/pytorch/multipy](https://github.com/pytorch/multipy) to find the new home for torch::deploy.
diff --git a/torch/csrc/deploy/benchmark.cpp b/torch/csrc/deploy/benchmark.cpp
deleted file mode 100644
index 82296a5e1a1d..000000000000
--- a/torch/csrc/deploy/benchmark.cpp
+++ /dev/null
@@ -1,336 +0,0 @@
-#include <torch/deploy.h>
-
-#include <ATen/ATen.h>
-#include <ATen/TypeDefault.h>
-#include <c10/util/irange.h>
-
-#include <torch/script.h>
-
-#include <pthread.h>
-#include <algorithm>
-#include <atomic>
-#include <cassert>
-#include <chrono>
-#include <iostream>
-#include <sstream>
-#include <thread>
-#include <vector>
-
-typedef void (*function_type)(const char*);
-
-bool cuda = false;
-
-constexpr auto latency_p = {
-    25.,
-    50.,
-    95.}; //{1., 5., 25., 50., 75., 90., 95., 99., 99.25, 99.5, 99.75, 99.9};
-
-// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-struct Report {
-  std::string benchmark;
-  std::string strategy;
-  size_t n_threads;
-  size_t items_completed;
-  double work_items_per_second;
-  std::vector<double> latencies;
-  static void report_header(std::ostream& out) {
-    out << "benchmark, strategy, n_threads, work_items_completed, work_items_per_second";
-    for (double l : latency_p) {
-      out << ", p" << l << "_latency";
-    }
-    out << ", device\n";
-  }
-  void report(std::ostream& out) {
-    out << benchmark << ", " << strategy << ", " << n_threads << ", "
-        << items_completed << ", " << work_items_per_second;
-    for (double l : latencies) {
-      out << ", " << l;
-    }
-    out << ", " << (cuda ? "cuda" : "cpu") << "\n";
-  }
-};
-
-const int min_items_to_complete = 1;
-
-struct RunPython {
-  static torch::deploy::ReplicatedObj load_and_wrap(
-      torch::deploy::Package& package) {
-    auto I = package.acquireSession();
-    auto obj = I.self.attr("load_pickle")({"model", "model.pkl"});
-    if (cuda) {
-      obj = I.global("gpu_wrapper", "GPUWrapper")({obj});
-    }
-    return I.createMovable(obj);
-  }
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-  RunPython(
-      torch::deploy::Package& package,
-      std::vector<at::IValue> eg,
-      const torch::deploy::Interpreter* interps)
-      : obj_(load_and_wrap(package)), eg_(std::move(eg)), interps_(interps) {}
-  void operator()(int i) {
-    auto I = obj_.acquireSession();
-    if (cuda) {
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      std::vector<at::IValue> eg2 = {i};
-      eg2.insert(eg2.end(), eg_.begin(), eg_.end());
-      I.self(eg2);
-    } else {
-      I.self(eg_);
-    }
-  }
-  torch::deploy::ReplicatedObj obj_;
-  std::vector<at::IValue> eg_;
-  const torch::deploy::Interpreter* interps_;
-};
-
-// def to_device(i, d):
-//     if isinstance(i, torch.Tensor):
-//         return i.to(device=d)
-//     elif isinstance(i, (tuple, list)):
-//         return tuple(to_device(e, d) for e in i)
-//     else:
-//         raise RuntimeError('inputs are weird')
-
-static torch::IValue to_device(const torch::IValue& v, torch::Device to);
-
-static std::vector<torch::IValue> to_device_vec(
-    at::ArrayRef<torch::IValue> vs,
-    torch::Device to) {
-  std::vector<torch::IValue> results;
-  for (const torch::IValue& v : vs) {
-    results.push_back(to_device(v, to));
-  }
-  return results;
-}
-
-static torch::IValue to_device(const torch::IValue& v, torch::Device to) {
-  if (v.isTensor()) {
-    return v.toTensor().to(to);
-  } else if (v.isTuple()) {
-    auto tup = v.toTuple();
-    return c10::ivalue::Tuple::create(to_device_vec(tup->elements(), to));
-  } else if (v.isList()) {
-    auto converted = to_device_vec(v.toListRef(), to);
-    torch::List<torch::IValue> result(v.toList().elementType());
-    for (const torch::IValue& v : converted) {
-      result.push_back(v);
-    }
-    return result;
-  } else {
-    MULTIPY_INTERNAL_ASSERT(false, "cannot to_device");
-  }
-}
-
-static bool exists(const std::string& fname) {
-  std::fstream jit_file(fname);
-  return jit_file.good();
-}
-
-struct RunJIT {
-  RunJIT(const std::string& file_to_run, std::vector<torch::IValue> eg)
-      : eg_(std::move(eg)) {
-    if (!cuda) {
-      models_.push_back(torch::jit::load(file_to_run + "_jit"));
-    } else {
-      for (const auto i : c10::irange(2)) {
-        auto d = torch::Device(torch::DeviceType::CUDA, i);
-        std::stringstream qualified;
-        qualified << file_to_run << "_jit_" << i;
-        auto loaded = exists(qualified.str())
-            ? torch::jit::load(qualified.str(), d)
-            : torch::jit::load(file_to_run + "_jit", d);
-        loaded.to(d);
-        models_.push_back(loaded);
-      }
-    }
-  }
-  void operator()(int i) {
-    if (cuda) {
-      const auto device_id = i % models_.size();
-      auto d = torch::Device(torch::DeviceType::CUDA, device_id);
-      to_device(
-          models_[device_id].forward(to_device_vec(eg_, d)),
-          torch::DeviceType::CPU);
-    } else {
-      models_[0].forward(eg_);
-    }
-  }
-  std::vector<at::IValue> eg_;
-  std::vector<torch::jit::Module> models_;
-};
-
-struct Benchmark {
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-  Benchmark(
-      torch::deploy::InterpreterManager& manager,
-      size_t n_threads,
-      std::string strategy,
-      // NOLINTNEXTLINE(modernize-pass-by-value)
-      std::string file_to_run,
-      size_t n_seconds = 5)
-      : manager_(manager),
-        n_threads_(n_threads),
-        strategy_(strategy),
-        file_to_run_(file_to_run),
-        n_seconds_(n_seconds),
-        should_run_(true),
-        items_completed_(0),
-        reached_min_items_completed_(0) {
-    // NOLINTNEXTLINE(bugprone-branch-clone)
-    if (strategy == "one_python") {
-      manager.debugLimitInterpreters(1);
-    } else if (strategy == "multi_python") {
-      manager.debugLimitInterpreters(n_threads_);
-    }
-  }
-
-  Report run() {
-    pthread_barrier_init(&first_run_, nullptr, n_threads_ + 1);
-
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    torch::deploy::Package package = manager_.loadPackage(file_to_run_);
-
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    std::vector<at::IValue> eg;
-    {
-      auto I = package.acquireSession();
-
-      eg = I.global("builtins", "tuple")(
-                I.self.attr("load_pickle")({"model", "example.pkl"}))
-               .toIValue()
-               .toTupleRef()
-               .elements();
-    }
-
-    // NOLINTNEXTLINE(bugprone-branch-clone)
-    if (strategy_ == "jit") {
-      run_one_work_item = RunJIT(file_to_run_, std::move(eg));
-    } else {
-      run_one_work_item =
-          RunPython(package, std::move(eg), manager_.allInstances().data());
-    }
-
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    std::vector<std::vector<double>> latencies(n_threads_);
-
-    for (const auto i : c10::irange(n_threads_)) {
-      threads_.emplace_back([this, &latencies, i] {
-        torch::NoGradGuard guard;
-        // do initial work
-        run_one_work_item(i);
-
-        pthread_barrier_wait(&first_run_);
-        size_t local_items_completed = 0;
-        while (should_run_) {
-          auto begin = std::chrono::steady_clock::now();
-          run_one_work_item(i);
-          auto end = std::chrono::steady_clock::now();
-          double work_seconds =
-              std::chrono::duration<double>(end - begin).count();
-          latencies[i].push_back(work_seconds);
-          local_items_completed++;
-          if (local_items_completed == min_items_to_complete) {
-            reached_min_items_completed_++;
-          }
-        }
-        items_completed_ += local_items_completed;
-      });
-    }
-
-    pthread_barrier_wait(&first_run_);
-    auto begin = std::chrono::steady_clock::now();
-    auto try_stop_at = begin + std::chrono::seconds(n_seconds_);
-    std::this_thread::sleep_until(try_stop_at);
-    for (int i = 0; reached_min_items_completed_ < n_threads_; ++i) {
-      std::this_thread::sleep_until(
-          begin + (i + 2) * std::chrono::seconds(n_seconds_));
-    }
-    should_run_ = false;
-    for (std::thread& thread : threads_) {
-      thread.join();
-    }
-    auto end = std::chrono::steady_clock::now();
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    double total_seconds = std::chrono::duration<double>(end - begin).count();
-    Report report;
-    report.benchmark = file_to_run_;
-    report.strategy = strategy_;
-    report.n_threads = n_threads_;
-    report.items_completed = items_completed_;
-    report.work_items_per_second = items_completed_ / total_seconds;
-    reportLatencies(report.latencies, latencies);
-    run_one_work_item = nullptr;
-    return report;
-  }
-
- private:
-  void reportLatencies(
-      std::vector<double>& results,
-      const std::vector<std::vector<double>>& latencies) {
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    std::vector<double> flat_latencies;
-    for (const auto& elem : latencies) {
-      flat_latencies.insert(flat_latencies.end(), elem.begin(), elem.end());
-    }
-    std::sort(flat_latencies.begin(), flat_latencies.end());
-    for (double target : latency_p) {
-      size_t idx = size_t(flat_latencies.size() * target / 100.0);
-      double time = flat_latencies.size() == 0
-          ? 0
-          : flat_latencies.at(std::min(flat_latencies.size() - 1, idx));
-      results.push_back(time);
-    }
-  }
-  torch::deploy::InterpreterManager& manager_;
-  size_t n_threads_;
-  std::string strategy_;
-  std::string file_to_run_;
-  size_t n_seconds_;
-  pthread_barrier_t first_run_;
-  std::atomic<bool> should_run_;
-  std::atomic<size_t> items_completed_;
-  std::atomic<size_t> reached_min_items_completed_;
-  std::vector<std::thread> threads_;
-  std::function<void(int)> run_one_work_item;
-};
-
-// NOLINTNEXTLINE(bugprone-exception-escape)
-int main(int argc, char* argv[]) {
-  int max_thread = atoi(argv[1]);
-  cuda = std::string(argv[2]) == "cuda";
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  bool jit_enable = std::string(argv[3]) == "jit";
-  Report::report_header(std::cout);
-  torch::deploy::InterpreterManager manager(max_thread);
-
-  // make sure gpu_wrapper.py is in the import path
-  for (auto& interp : manager.allInstances()) {
-    auto I = interp.acquireSession();
-    I.global("sys", "path").attr("append")({"torch/csrc/deploy/example"});
-  }
-
-  auto n_threads = {1, 2, 4, 8, 16, 32, 40};
-  for (const auto i : c10::irange(4, argc)) {
-    std::string model_file = argv[i];
-    for (int n_thread : n_threads) {
-      if (n_thread > max_thread) {
-        continue;
-      }
-      for (std::string strategy : {"one_python", "multi_python", "jit"}) {
-        if (strategy == "jit") {
-          if (!jit_enable) {
-            continue;
-          }
-          if (!exists(model_file + "_jit")) {
-            continue;
-          }
-        }
-        Benchmark b(manager, n_thread, strategy, model_file);
-        Report r = b.run();
-        r.report(std::cout);
-      }
-    }
-  }
-  return 0;
-}
diff --git a/torch/csrc/deploy/deploy.cpp b/torch/csrc/deploy/deploy.cpp
deleted file mode 100644
index 680b8541873f..000000000000
--- a/torch/csrc/deploy/deploy.cpp
+++ /dev/null
@@ -1,366 +0,0 @@
-#include <torch/csrc/deploy/Exception.h>
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/csrc/deploy/elf_file.h>
-#include <torch/csrc/deploy/interpreter/Optional.hpp>
-
-#include <torch/cuda.h>
-
-#include <dlfcn.h>
-#include <libgen.h>
-#include <unistd.h>
-
-struct ExeSection {
-  const char* sectionName;
-  bool customLoader;
-};
-
-struct InterpreterSymbol {
-  const char* startSym;
-  const char* endSym;
-  bool customLoader;
-};
-
-// these symbols are generated by cmake, using ld -r -b binary
-// libtorch_deployinterpreter.so which takes the contents of the so and embeds
-// it into a symbol that is then linked into libtorch_deploy.so. This enables us
-// to simply copy the contents of this symbol to disk and dlopen it to create an
-// instance of python.
-
-namespace torch {
-namespace deploy {
-
-const std::initializer_list<ExeSection> pythonInterpreterSection = {
-    {".torch_deploy_payload.interpreter_all", true},
-    {".torch_deploy_payload.interpreter_cuda", false},
-    {".torch_deploy_payload.interpreter_cpu", false},
-};
-
-const std::initializer_list<InterpreterSymbol> kInterpreterSearchPath = {
-    {"_binary_libtorch_deployinterpreter_all_so_start",
-     "_binary_libtorch_deployinterpreter_all_so_end",
-     true},
-    {"_binary_libtorch_deployinterpreter_cuda_so_start",
-     "_binary_libtorch_deployinterpreter_cuda_so_end",
-     false},
-    {"_binary_libtorch_deployinterpreter_cpu_so_start",
-     "_binary_libtorch_deployinterpreter_cpu_so_end",
-     false},
-};
-
-static bool writeDeployInterpreter(FILE* dst) {
-  TORCH_INTERNAL_ASSERT(dst);
-  const char* payloadStart = nullptr;
-  size_t size = 0;
-  bool customLoader = false;
-  std::string exePath;
-  std::ifstream("/proc/self/cmdline") >> exePath;
-  ElfFile elfFile(exePath.c_str());
-  for (const auto& s : pythonInterpreterSection) {
-    multipy::optional<Section> payloadSection =
-        elfFile.findSection(s.sectionName);
-    if (payloadSection != multipy::nullopt) {
-      payloadStart = payloadSection->start;
-      customLoader = s.customLoader;
-      size = payloadSection->len;
-      MULTIPY_CHECK(payloadSection.has_value(), "Missing the payload section");
-      break;
-    }
-  }
-  if (payloadStart == nullptr) {
-    const char* libStart = nullptr;
-    const char* libEnd = nullptr;
-    for (const auto& s : kInterpreterSearchPath) {
-      libStart = (const char*)dlsym(nullptr, s.startSym);
-      if (libStart) {
-        libEnd = (const char*)dlsym(nullptr, s.endSym);
-        customLoader = s.customLoader;
-        break;
-      }
-    }
-    MULTIPY_CHECK(
-        libStart != nullptr && libEnd != nullptr,
-        "torch::deploy requires a build-time dependency on embedded_interpreter or embedded_interpreter_cuda, neither of which were found.  torch::cuda::is_available()=" +
-            std::to_string(torch::cuda::is_available()));
-
-    size = libEnd - libStart;
-    payloadStart = libStart;
-  }
-  size_t written = fwrite(payloadStart, 1, size, dst);
-  TORCH_INTERNAL_ASSERT(size == written, "expected written == size");
-  return customLoader;
-}
-
-InterpreterManager::InterpreterManager(
-    size_t nInterp,
-    std::shared_ptr<Environment> env)
-    : resources_(nInterp) {
-  C10_LOG_API_USAGE_ONCE("torch.deploy.InterpreterManager");
-
-  TORCH_DEPLOY_TRY
-  for (const auto i : c10::irange(nInterp)) {
-    instances_.emplace_back(this, env);
-    auto I = instances_.back().acquireSession();
-    // make torch.version.interp be the interpreter id
-    // can be used for balancing work across GPUs
-    I.global("torch", "version").attr("__setattr__")({"interp", int(i)});
-    instances_.back().pImpl_->setFindModule(
-        [this](const std::string& name) -> multipy::optional<std::string> {
-          auto it = registeredModuleSource_.find(name);
-          if (it != registeredModuleSource_.end()) {
-            return it->second;
-          } else {
-            return multipy::nullopt;
-          }
-        });
-  }
-
-  // Pre-registered modules.
-  // Since torch::deploy::Obj.toIValue cannot infer empty list, we hack it to
-  // return None for empty list.
-  // TODO(jwtan): Make the discovery of these modules easier.
-  registerModuleSource(
-      "GetArgumentNamesModule",
-      "from inspect import signature\n"
-      "from typing import Callable, Optional\n"
-      "def getArgumentNames(function: Callable) -> Optional[list]:\n"
-      "    names = list(signature(function).parameters.keys())\n"
-      "    if len(names) == 0:\n"
-      "        return None\n"
-      "    return names\n");
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-Package InterpreterManager::loadPackage(const std::string& uri) {
-  TORCH_DEPLOY_TRY
-  return Package(uri, this);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-Package InterpreterManager::loadPackage(
-    std::shared_ptr<caffe2::serialize::ReadAdapterInterface> reader) {
-  TORCH_DEPLOY_TRY
-  return Package(reader, this);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-Obj InterpreterSession::fromMovable(const ReplicatedObj& obj) {
-  TORCH_DEPLOY_TRY
-  return impl_->unpickleOrGet(obj.pImpl_->objectId_, obj.pImpl_->data_);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-InterpreterSession ReplicatedObj::acquireSession(
-    const Interpreter* onThisInterpreter) const {
-  TORCH_DEPLOY_TRY
-  InterpreterSession I = onThisInterpreter ? onThisInterpreter->acquireSession()
-                                           : pImpl_->manager_->acquireOne();
-  I.self = I.fromMovable(*this);
-  return I;
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-// NOLINTNEXTLINE(bugprone-exception-escape)
-InterpreterSession::~InterpreterSession() {
-  if (manager_ && notifyIdx_ >= 0) {
-    manager_->resources_.free(notifyIdx_);
-  }
-}
-
-void ReplicatedObjImpl::unload(const Interpreter* onThisInterpreter) {
-  TORCH_DEPLOY_TRY
-  if (!onThisInterpreter) {
-    // NOLINTNEXTLINE(clang-analyzer-core.NullDereference)
-    for (auto& interp : manager_->allInstances()) {
-      unload(&interp);
-    }
-    return;
-  }
-
-  InterpreterSession I = onThisInterpreter->acquireSession();
-  I.impl_->unload(objectId_);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-// NOLINTNEXTLINE(bugprone-exception-escape)
-ReplicatedObjImpl::~ReplicatedObjImpl() {
-  unload(nullptr);
-}
-
-void ReplicatedObj::unload(const Interpreter* onThisInterpreter) {
-  TORCH_DEPLOY_TRY
-  pImpl_->unload(onThisInterpreter);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-ReplicatedObj InterpreterSession::createMovable(Obj obj) {
-  TORCH_DEPLOY_TRY
-  MULTIPY_CHECK(
-      manager_,
-      "Can only create a movable object when the session was created from an interpreter that is part of a InterpreterManager");
-
-  MULTIPY_CHECK(
-      impl_->isOwner(obj),
-      "Cannot create movable from an object that lives in different session");
-
-  auto pickled = impl_->pickle(self, obj);
-  return ReplicatedObj(std::make_shared<ReplicatedObjImpl>(
-      manager_->nextObjectId_++, std::move(pickled), manager_));
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-using dlopen_t = void* (*)(const char*, int);
-
-// ASAN overrides dlopen and errors when it sees the RTLD_DEEPBIND flags because
-// it thinks that the library being loaded will not link against its overrides
-// for things like malloc/free. However, our specially crafted library doesn't
-// have any DT_NEEDED entries -- all undefined symbols will be resolved from the
-// process's link map. So it is actually safe to use RTLD_DEEPBIND with ASAN. We
-// have to get around its check though, so we do it by finding the real dlopen
-// function.
-static dlopen_t find_real_dlopen() {
-  void* libc = dlopen("libdl.so.2", RTLD_NOLOAD | RTLD_LAZY | RTLD_LOCAL);
-  // libdl is gone on some newer systems.
-  if (!libc) {
-    // libc.so won't open with dlopen because it's a linker script.
-    libc = dlopen("libc.so.6", RTLD_NOLOAD | RTLD_LAZY | RTLD_LOCAL);
-  }
-  TORCH_INTERNAL_ASSERT(libc);
-  auto dlopen_ = (dlopen_t)dlsym(libc, "dlopen");
-  TORCH_INTERNAL_ASSERT(dlopen_);
-  return dlopen_;
-}
-
-Interpreter::Interpreter(
-    InterpreterManager* manager,
-    std::shared_ptr<Environment> env)
-    : handle_(nullptr), manager_(manager), env_(env) {
-  // NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays)
-  char libraryName[] = "/tmp/torch_deployXXXXXX";
-  int fd = mkstemp(libraryName);
-  TORCH_INTERNAL_ASSERT(fd != -1, "failed to create temporary file");
-  libraryName_ = libraryName;
-  FILE* dst = fdopen(fd, "wb");
-
-  customLoader_ = writeDeployInterpreter(dst);
-
-  fclose(dst);
-  int flags = RTLD_LOCAL | RTLD_LAZY;
-  if (customLoader_) {
-    flags |= RTLD_DEEPBIND;
-  }
-
-#ifdef FBCODE_CAFFE2
-  static dlopen_t dlopen_ = find_real_dlopen();
-  handle_ = dlopen_(libraryName, flags);
-#else
-  handle_ = dlopen(libraryName, flags);
-#endif
-
-  if (!handle_) {
-    throw std::runtime_error(dlerror());
-  }
-
-  // note: if you want better debugging symbols for things inside
-  // new_intepreter_impl, comment out this line so that the so lasts long enough
-  // for the debugger to see it.
-  unlink(libraryName_.c_str());
-
-  if (customLoader_) {
-    // when using the custom loader we need to link python symbols against
-    // the right version of the symbols for the interpreter which an be looked
-    // up from the handle_ to this shared library. here we register the handle
-    // with the code that does custom loading of python extensions.
-    auto deploySetSelfPtr = (void (*)(void*))dlsym(handle_, "deploy_set_self");
-    AT_ASSERT(deploySetSelfPtr);
-    deploySetSelfPtr(handle_);
-  }
-
-  auto extra_python_paths = env->getExtraPythonPaths();
-  void* newInterpreterImpl = dlsym(handle_, "newInterpreterImpl");
-  AT_ASSERT(newInterpreterImpl);
-  pImpl_ = std::unique_ptr<InterpreterImpl>(
-      ((InterpreterImpl * (*)(const std::vector<std::string>&))
-           newInterpreterImpl)(extra_python_paths));
-  env->configureInterpreter(this);
-}
-
-Interpreter::~Interpreter() {
-  if (handle_) {
-    // ensure python uninitialization runs before we dlclose the library
-    pImpl_.reset();
-    if (customLoader_) {
-      auto deploy_flush_python_libs =
-          (void (*)())dlsym(handle_, "deploy_flush_python_libs");
-      deploy_flush_python_libs();
-    }
-    dlclose(handle_);
-  }
-}
-
-int LoadBalancer::acquire() {
-  TORCH_DEPLOY_TRY
-  thread_local int last = 0;
-  size_t minusers = SIZE_MAX;
-  int minIdx = 0;
-  for (size_t i = 0; i < n_; ++i, ++last) {
-    if (last >= static_cast<int>(n_)) {
-      last = 0;
-    }
-    uint64_t prev = 0;
-    bool acquired = __atomic_compare_exchange_n(
-        &uses_[8 * last],
-        &prev,
-        1ULL,
-        false,
-        __ATOMIC_SEQ_CST,
-        __ATOMIC_SEQ_CST);
-    if (acquired) {
-      // fast path, we found an interpreter with no users
-      return last;
-    }
-    // slow path, we don't want to use this interpreter because it is being
-    // used by someone else.
-
-    if (prev < minusers) {
-      minusers = prev;
-      minIdx = last;
-    }
-  }
-  // we failed to find a completely free interpreter. heuristically use the
-  // one with the least number of user (note that this may have changed since
-  // then, so this is only a heuristic).
-  __atomic_fetch_add(&uses_[8 * minIdx], 1ULL, __ATOMIC_SEQ_CST);
-  return minIdx;
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-void LoadBalancer::free(int where) {
-  TORCH_DEPLOY_TRY
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
-  __atomic_fetch_sub(&uses_[8 * where], 1ULL, __ATOMIC_SEQ_CST);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-void PythonMethodWrapper::setArgumentNames(
-    std::vector<std::string>& argumentNamesOut) const {
-  auto session = model_.acquireSession();
-  auto method = session.self.attr(methodName_.c_str());
-  auto iArgumentNames =
-      session.global("GetArgumentNamesModule", "getArgumentNames")({method})
-          .toIValue();
-  if (iArgumentNames.isNone()) {
-    return;
-  }
-
-  TORCH_INTERNAL_ASSERT(iArgumentNames.isList());
-  auto argumentNames = iArgumentNames.toListRef();
-
-  argumentNamesOut.reserve(argumentNames.size());
-  for (auto& argumentName : argumentNames) {
-    TORCH_INTERNAL_ASSERT(argumentName.isString());
-    argumentNamesOut.push_back(argumentName.toStringRef());
-  }
-}
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/deploy.h b/torch/csrc/deploy/deploy.h
deleted file mode 100644
index b986093ed020..000000000000
--- a/torch/csrc/deploy/deploy.h
+++ /dev/null
@@ -1,302 +0,0 @@
-#pragma once
-#include <c10/util/irange.h>
-#include <torch/csrc/api/include/torch/imethod.h>
-#include <torch/csrc/deploy/interpreter/Optional.hpp>
-#include <torch/csrc/deploy/interpreter/interpreter_impl.h>
-#include <torch/csrc/deploy/noop_environment.h>
-#include <torch/csrc/jit/serialization/import.h>
-#include <cassert>
-#include <fstream>
-#include <iostream>
-#include <string>
-#include <thread>
-#include <vector>
-
-namespace torch {
-namespace deploy {
-
-struct ReplicatedObj;
-struct InterpreterManager;
-
-struct TORCH_API InterpreterSession {
-  InterpreterSession(
-      InterpreterSessionImpl* impl,
-      InterpreterManager* manager) noexcept
-      : impl_(impl), manager_(manager) {}
-
-  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  Obj self; // when retreived from a PythonMovable this will be set.
-  InterpreterSession(InterpreterSession&&) noexcept = default;
-  // NOLINTNEXTLINE(bugprone-exception-escape)
-  ~InterpreterSession();
-  Obj global(const char* module, const char* name) {
-    TORCH_DEPLOY_TRY
-    return impl_->global(module, name);
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-  Obj fromIValue(at::IValue ivalue) {
-    TORCH_DEPLOY_TRY
-    return impl_->fromIValue(std::move(ivalue));
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-  ReplicatedObj createMovable(Obj obj);
-  Obj fromMovable(const ReplicatedObj& obj);
-
- private:
-  friend struct ReplicatedObj;
-  friend struct Package;
-  friend struct InterpreterManager;
-  friend struct ReplicatedObjImpl;
-  std::unique_ptr<InterpreterSessionImpl> impl_;
-  InterpreterManager* manager_; // if created from one
-  int64_t notifyIdx_ = -1;
-};
-
-class TORCH_API Interpreter {
- private:
-  std::string libraryName_;
-  void* handle_;
-  std::unique_ptr<InterpreterImpl> pImpl_;
-  bool customLoader_ = false;
-  InterpreterManager* manager_; // optional if managed by one
-  std::shared_ptr<Environment> env_;
-
- public:
-  Interpreter(InterpreterManager* manager, std::shared_ptr<Environment> env);
-  InterpreterSession acquireSession() const {
-    TORCH_DEPLOY_TRY
-    return InterpreterSession(pImpl_->acquireSession(), manager_);
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-  ~Interpreter();
-  Interpreter(Interpreter&& rhs) noexcept
-      : libraryName_(std::move(rhs.libraryName_)),
-        handle_(rhs.handle_),
-        pImpl_(std::move(rhs.pImpl_)),
-        manager_(rhs.manager_) {
-    rhs.handle_ = nullptr;
-  }
-
-  Interpreter(const Interpreter&) = delete;
-  Interpreter& operator=(const Interpreter&) = delete;
-  Interpreter& operator=(Interpreter&&) = delete;
-  friend struct InterpreterManager;
-};
-
-struct Package;
-
-struct TORCH_API LoadBalancer {
-  explicit LoadBalancer(size_t n)
-      : uses_(new uint64_t[8 * n]), allocated_(n), n_(n) {
-    TORCH_DEPLOY_TRY
-    // 8*... to avoid false sharing of atomics on the same cache line
-    memset(uses_.get(), 0, 8 * n_ * sizeof(uint64_t));
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-  void setResourceLimit(size_t n) {
-    TORCH_DEPLOY_TRY
-    MULTIPY_INTERNAL_ASSERT(n <= allocated_);
-    n_ = n;
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-  int acquire();
-  void free(int where);
-
- private:
-  // NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays)
-  std::unique_ptr<uint64_t[]>
-      uses_; // the approximate count of the number of users of interpreter
-  size_t allocated_;
-  size_t n_;
-};
-
-struct TORCH_API InterpreterManager {
-  explicit InterpreterManager(
-      size_t nInterp = 2,
-      std::shared_ptr<Environment> env = std::make_shared<NoopEnvironment>());
-
-  // get a free model, guarenteed that no other user of acquireOne has the same
-  // model. It _is_ possible that other users will be using the interpreter.
-  InterpreterSession acquireOne() {
-    TORCH_DEPLOY_TRY
-    int where = resources_.acquire();
-    InterpreterSession I = instances_[where].acquireSession();
-    I.notifyIdx_ = where;
-    return I;
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-
-  // use to make sure something gets run on all interpreters, such as loading or
-  // unloading a model eagerly
-  at::ArrayRef<Interpreter> allInstances() {
-    TORCH_DEPLOY_TRY
-    return instances_;
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-  void debugLimitInterpreters(size_t N) {
-    TORCH_DEPLOY_TRY
-    AT_ASSERT(N <= instances_.size());
-    resources_.setResourceLimit(N);
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-  Package loadPackage(const std::string& uri);
-  Package loadPackage(
-      std::shared_ptr<caffe2::serialize::ReadAdapterInterface> reader);
-
-  // convience function for loading some python source code as a module across
-  // all interpreters. this can be used for writing tests of deploy that need to
-  // execute python code, or for small amounts of application logic that are
-  // best written in Python. For larger amounts of code, prefer creating and
-  // loading them as packages.
-  void registerModuleSource(std::string name, std::string src) {
-    registeredModuleSource_[std::move(name)] = std::move(src);
-  }
-
-  InterpreterManager(const InterpreterManager&) = delete;
-  InterpreterManager& operator=(const InterpreterManager&) = delete;
-  InterpreterManager& operator=(InterpreterManager&&) = delete;
-
- private:
-  friend struct Package;
-  friend struct InterpreterSession;
-  size_t nextObjectId_ = 0;
-  std::vector<Interpreter> instances_;
-  LoadBalancer resources_;
-  std::unordered_map<std::string, std::string> registeredModuleSource_;
-};
-
-struct TORCH_API ReplicatedObjImpl {
-  ReplicatedObjImpl(
-      size_t object_id,
-      // NOLINTNEXTLINE(modernize-pass-by-value)
-      PickledObject data,
-      InterpreterManager* manager)
-      : objectId_(object_id), data_(data), manager_(manager) {}
-  // NOLINTNEXTLINE(bugprone-exception-escape)
-  ~ReplicatedObjImpl();
-  void unload(const Interpreter* onThisInterpreter);
-  int64_t objectId_;
-  PickledObject data_;
-  InterpreterManager* manager_;
-};
-
-struct TORCH_API ReplicatedObj {
-  ReplicatedObj() : pImpl_(nullptr) {}
-  InterpreterSession acquireSession(
-      const Interpreter* onThisInterpreter = nullptr) const;
-  at::IValue operator()(at::ArrayRef<at::IValue> args) const {
-    TORCH_DEPLOY_TRY
-    auto I = acquireSession();
-    return I.self(args).toIValue();
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-
-  [[nodiscard]] at::IValue callKwargs(
-      std::vector<at::IValue> args,
-      std::unordered_map<std::string, c10::IValue> kwargs) const {
-    TORCH_DEPLOY_TRY
-    auto I = acquireSession();
-    return I.self.callKwargs(std::move(args), std::move(kwargs)).toIValue();
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-
-  [[nodiscard]] at::IValue callKwargs(
-      std::unordered_map<std::string, c10::IValue> kwargs) const {
-    TORCH_DEPLOY_TRY
-    auto I = acquireSession();
-    return I.self.callKwargs(std::move(kwargs)).toIValue();
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-
-  [[nodiscard]] bool hasattr(const char* name) const {
-    TORCH_DEPLOY_TRY
-    auto I = acquireSession();
-    return I.self.hasattr(name);
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-
-  void unload(const Interpreter* onThisInterpreter = nullptr);
-
- private:
-  ReplicatedObj(std::shared_ptr<ReplicatedObjImpl> pImpl)
-      : pImpl_(std::move(pImpl)) {}
-  std::shared_ptr<ReplicatedObjImpl> pImpl_;
-  friend struct Package;
-  friend struct InterpreterSession;
-  friend struct InterpreterManager;
-};
-
-class PythonMethodWrapper : public torch::IMethod {
-  // PythonMethodWrapper is a more specific instance of a
-  // ReplicatedObj which represents a python method, and
-  // is therefore callable and has argument names accessible.
- public:
-  // TODO(whc) make bound method pickleable, then directly construct from that
-  PythonMethodWrapper(
-      torch::deploy::ReplicatedObj model,
-      std::string methodName)
-      : model_(std::move(model)), methodName_(std::move(methodName)) {}
-
-  const std::string& name() const override {
-    return methodName_;
-  }
-
-  c10::IValue operator()(
-      std::vector<c10::IValue> args,
-      const IValueMap& kwargs = IValueMap()) const override {
-    // TODO(whc) ideally, pickle the method itself as replicatedobj, to skip
-    // this lookup each time
-    auto modelSession = model_.acquireSession();
-    auto method = modelSession.self.attr(methodName_.c_str());
-    return method.callKwargs(args, kwargs).toIValue();
-  }
-
- private:
-  void setArgumentNames(std::vector<std::string>&) const override;
-
-  torch::deploy::ReplicatedObj model_;
-  std::string methodName_;
-};
-
-struct TORCH_API Package {
-  // shorthand for getting the object as a pickle resource in the package
-  ReplicatedObj loadPickle(const std::string& module, const std::string& file) {
-    TORCH_DEPLOY_TRY
-    auto I = acquireSession();
-    auto loaded = I.self.attr("load_pickle")({module, file});
-    return I.createMovable(loaded);
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-
-  InterpreterSession acquireSession() {
-    TORCH_DEPLOY_TRY
-    auto I = manager_->acquireOne();
-    I.self =
-        I.impl_->createOrGetPackageImporterFromContainerFile(containerFile_);
-    return I;
-    TORCH_DEPLOY_SAFE_CATCH_RETHROW
-  }
-
- private:
-  Package(
-      const std::string& uri,
-      InterpreterManager*
-          pm) // or really any of the constructors to our zip file format
-      : manager_(pm),
-        containerFile_(
-            std::make_shared<caffe2::serialize::PyTorchStreamReader>(uri)) {}
-  Package(
-      std::shared_ptr<caffe2::serialize::ReadAdapterInterface> reader,
-      InterpreterManager*
-          pm) // or really any of the constructors to our zip file format
-      : manager_(pm),
-        containerFile_(
-            std::make_shared<caffe2::serialize::PyTorchStreamReader>(reader)) {}
-  friend struct ReplicatedObj;
-  friend struct InterpreterManager;
-  InterpreterManager* manager_;
-  std::shared_ptr<caffe2::serialize::PyTorchStreamReader> containerFile_;
-};
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/elf_file.cpp b/torch/csrc/deploy/elf_file.cpp
deleted file mode 100644
index ca1e749868e5..000000000000
--- a/torch/csrc/deploy/elf_file.cpp
+++ /dev/null
@@ -1,56 +0,0 @@
-#include <c10/util/irange.h>
-#include <torch/csrc/deploy/Exception.h>
-#include <torch/csrc/deploy/elf_file.h>
-#include <torch/csrc/deploy/interpreter/Optional.hpp>
-
-namespace torch {
-namespace deploy {
-
-ElfFile::ElfFile(const char* filename) : memFile_(filename) {
-  const char* fileData = memFile_.data();
-  ehdr_ = (Elf64_Ehdr*)fileData;
-  checkFormat();
-
-  numSections_ = ehdr_->e_shnum;
-  shdrList_ = (Elf64_Shdr*)(fileData + ehdr_->e_shoff);
-
-  auto strtabSecNo = ehdr_->e_shstrndx;
-  MULTIPY_CHECK(
-      strtabSecNo >= 0 && strtabSecNo < numSections_,
-      "e_shstrndx out of range");
-
-  strtabSection_ = toSection(&shdrList_[strtabSecNo]);
-
-  sections_.reserve(numSections_);
-  for (const auto i : c10::irange(numSections_)) {
-    sections_.emplace_back(toSection(&shdrList_[i]));
-  }
-}
-
-multipy::optional<Section> ElfFile::findSection(const char* name) const {
-  MULTIPY_CHECK(name != nullptr, "Null name");
-  multipy::optional<Section> found = multipy::nullopt;
-  for (const auto& section : sections_) {
-    if (strcmp(name, section.name) == 0) {
-      found = section;
-      break;
-    }
-  }
-
-  return found;
-}
-
-void ElfFile::checkFormat() const {
-  // check the magic numbers
-  MULTIPY_CHECK(
-      (ehdr_->e_ident[EI_MAG0] == ELFMAG0) &&
-          (ehdr_->e_ident[EI_MAG1] == ELFMAG1) &&
-          (ehdr_->e_ident[EI_MAG2] == ELFMAG2) &&
-          (ehdr_->e_ident[EI_MAG3] == ELFMAG3),
-      "Unexpected magic numbers");
-  MULTIPY_CHECK(
-      ehdr_->e_ident[EI_CLASS] == ELFCLASS64, "Only support 64bit ELF file");
-}
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/elf_file.h b/torch/csrc/deploy/elf_file.h
deleted file mode 100644
index 31ea7976af88..000000000000
--- a/torch/csrc/deploy/elf_file.h
+++ /dev/null
@@ -1,66 +0,0 @@
-#pragma once
-
-#include <elf.h>
-#include <torch/csrc/deploy/Exception.h>
-#include <torch/csrc/deploy/interpreter/Optional.hpp>
-#include <torch/csrc/deploy/mem_file.h>
-#include <vector>
-
-namespace torch {
-namespace deploy {
-
-struct Section {
-  explicit Section(
-      const char* _name = nullptr,
-      const char* _start = nullptr,
-      size_t _len = 0)
-      : name(_name), start(_start), len(_len) {}
-  const char* name;
-  const char* start;
-  size_t len;
-
-  operator bool() const {
-    return start != nullptr;
-  }
-};
-
-/*
- * This class provie utilities to handle ELF file. Only support 64bit ELF file.
- */
-// TODO: consolidate other ELF file related functions in loader.cpp to this file
-class ElfFile {
- public:
-  explicit ElfFile(const char* filename);
-  multipy::optional<Section> findSection(const char* name) const;
-
- private:
-  Section toSection(Elf64_Shdr* shdr) {
-    auto nameOff = shdr->sh_name;
-    auto shOff = shdr->sh_offset;
-    auto len = shdr->sh_size;
-    const char* name = "";
-
-    if (strtabSection_) {
-      MULTIPY_CHECK(nameOff >= 0 && nameOff < strtabSection_.len);
-      name = strtabSection_.start + nameOff;
-    }
-    const char* start = memFile_.data() + shOff;
-    return Section{name, start, len};
-  }
-
-  [[nodiscard]] const char* str(size_t off) const {
-    MULTIPY_CHECK(off < strtabSection_.len, "String table index out of range");
-    return strtabSection_.start + off;
-  }
-  void checkFormat() const;
-  MemFile memFile_;
-  Elf64_Ehdr* ehdr_;
-  Elf64_Shdr* shdrList_;
-  size_t numSections_;
-
-  Section strtabSection_;
-  std::vector<Section> sections_;
-};
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/environment.h b/torch/csrc/deploy/environment.h
deleted file mode 100644
index 433ce6bcb3f6..000000000000
--- a/torch/csrc/deploy/environment.h
+++ /dev/null
@@ -1,69 +0,0 @@
-#pragma once
-#include <fmt/format.h>
-#include <torch/csrc/deploy/Exception.h>
-#include <torch/csrc/deploy/elf_file.h>
-#include <fstream>
-#include <string>
-
-namespace torch {
-namespace deploy {
-
-class Interpreter;
-
-/*
- * An environment is the concept to decribe the circumstances in which a
- * torch::deploy interpreter runs. In can be an xar file embedded in the binary,
- * a filesystem path for the installed libraries etc.
- */
-class Environment {
-  std::vector<std::string> extraPythonPaths_;
-  // all zipped python libraries will be written
-  // under this directory
-  std::string extraPythonLibrariesDir_;
-  void setupZippedPythonModules(const std::string& pythonAppDir) {
-#ifdef FBCODE_CAFFE2
-    std::string execPath;
-    std::ifstream("/proc/self/cmdline") >> execPath;
-    ElfFile elfFile(execPath.c_str());
-    // load the zipped torch modules
-    constexpr const char* ZIPPED_TORCH_NAME = ".torch_python_modules";
-    auto zippedTorchSection = elfFile.findSection(ZIPPED_TORCH_NAME);
-    MULTIPY_CHECK(
-        zippedTorchSection.has_value(), "Missing the zipped torch section");
-    const char* zippedTorchStart = zippedTorchSection->start;
-    auto zippedTorchSize = zippedTorchSection->len;
-
-    std::string zipArchive =
-        std::string(pythonAppDir) + "/torch_python_modules.zip";
-    auto zippedFile = fopen(zipArchive.c_str(), "wb");
-    MULTIPY_CHECK(
-        zippedFile != nullptr, "Fail to create file: ", strerror(errno));
-    fwrite(zippedTorchStart, 1, zippedTorchSize, zippedFile);
-    fclose(zippedFile);
-
-    extraPythonPaths_.push_back(zipArchive);
-#endif
-    extraPythonLibrariesDir_ = pythonAppDir;
-  }
-
- public:
-  explicit Environment() {
-    char tempDirName[] = "/tmp/torch_deploy_zipXXXXXX";
-    char* tempDirectory = mkdtemp(tempDirName);
-    setupZippedPythonModules(tempDirectory);
-  }
-  explicit Environment(const std::string& pythonAppDir) {
-    setupZippedPythonModules(pythonAppDir);
-  }
-  virtual ~Environment() {
-    auto rmCmd = fmt::format("rm -rf {}", extraPythonLibrariesDir_);
-    system(rmCmd.c_str());
-  }
-  virtual void configureInterpreter(Interpreter* interp) = 0;
-  virtual const std::vector<std::string>& getExtraPythonPaths() {
-    return extraPythonPaths_;
-  }
-};
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/example/benchmark.cpp b/torch/csrc/deploy/example/benchmark.cpp
deleted file mode 100644
index e1d5b6fdcce2..000000000000
--- a/torch/csrc/deploy/example/benchmark.cpp
+++ /dev/null
@@ -1,336 +0,0 @@
-#include <torch/deploy.h>
-
-#include <ATen/ATen.h>
-#include <ATen/TypeDefault.h>
-#include <c10/util/irange.h>
-
-#include <torch/script.h>
-
-#include <pthread.h>
-#include <algorithm>
-#include <atomic>
-#include <cassert>
-#include <chrono>
-#include <iostream>
-#include <sstream>
-#include <thread>
-#include <vector>
-
-typedef void (*function_type)(const char*);
-
-bool cuda = false;
-
-constexpr auto latency_p = {
-    25.,
-    50.,
-    95.}; //{1., 5., 25., 50., 75., 90., 95., 99., 99.25, 99.5, 99.75, 99.9};
-
-// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-struct Report {
-  std::string benchmark;
-  std::string strategy;
-  size_t n_threads;
-  size_t items_completed;
-  double work_items_per_second;
-  std::vector<double> latencies;
-  static void report_header(std::ostream& out) {
-    out << "benchmark, strategy, n_threads, work_items_completed, work_items_per_second";
-    for (double l : latency_p) {
-      out << ", p" << l << "_latency";
-    }
-    out << ", device\n";
-  }
-  void report(std::ostream& out) {
-    out << benchmark << ", " << strategy << ", " << n_threads << ", "
-        << items_completed << ", " << work_items_per_second;
-    for (double l : latencies) {
-      out << ", " << l;
-    }
-    out << ", " << (cuda ? "cuda" : "cpu") << "\n";
-  }
-};
-
-const int min_items_to_complete = 1;
-
-struct RunPython {
-  static torch::deploy::ReplicatedObj load_and_wrap(
-      torch::deploy::Package& package) {
-    auto I = package.acquireSession();
-    auto obj = I.self.attr("load_pickle")({"model", "model.pkl"});
-    if (cuda) {
-      obj = I.global("gpu_wrapper", "GPUWrapper")({obj});
-    }
-    return I.createMovable(obj);
-  }
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-  RunPython(
-      torch::deploy::Package& package,
-      std::vector<at::IValue> eg,
-      const torch::deploy::Interpreter* interps)
-      : obj_(load_and_wrap(package)), eg_(std::move(eg)), interps_(interps) {}
-  void operator()(int i) {
-    auto I = obj_.acquireSession();
-    if (cuda) {
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      std::vector<at::IValue> eg2 = {i};
-      eg2.insert(eg2.end(), eg_.begin(), eg_.end());
-      I.self(eg2);
-    } else {
-      I.self(eg_);
-    }
-  }
-  torch::deploy::ReplicatedObj obj_;
-  std::vector<at::IValue> eg_;
-  const torch::deploy::Interpreter* interps_;
-};
-
-// def to_device(i, d):
-//     if isinstance(i, torch.Tensor):
-//         return i.to(device=d)
-//     elif isinstance(i, (tuple, list)):
-//         return tuple(to_device(e, d) for e in i)
-//     else:
-//         raise RuntimeError('inputs are weird')
-
-static torch::IValue to_device(const torch::IValue& v, torch::Device to);
-
-static std::vector<torch::IValue> to_device_vec(
-    at::ArrayRef<torch::IValue> vs,
-    torch::Device to) {
-  std::vector<torch::IValue> results;
-  for (const torch::IValue& v : vs) {
-    results.push_back(to_device(v, to));
-  }
-  return results;
-}
-
-static torch::IValue to_device(const torch::IValue& v, torch::Device to) {
-  if (v.isTensor()) {
-    return v.toTensor().to(to);
-  } else if (v.isTuple()) {
-    auto tup = v.toTuple();
-    return c10::ivalue::Tuple::create(to_device_vec(tup->elements(), to));
-  } else if (v.isList()) {
-    auto converted = to_device_vec(v.toListRef(), to);
-    torch::List<torch::IValue> result(v.toList().elementType());
-    for (const torch::IValue& v : converted) {
-      result.push_back(v);
-    }
-    return result;
-  } else {
-    TORCH_INTERNAL_ASSERT(false, "cannot to_device");
-  }
-}
-
-static bool exists(const std::string& fname) {
-  std::fstream jit_file(fname);
-  return jit_file.good();
-}
-
-struct RunJIT {
-  RunJIT(const std::string& file_to_run, std::vector<torch::IValue> eg)
-      : eg_(std::move(eg)) {
-    if (!cuda) {
-      models_.push_back(torch::jit::load(file_to_run + "_jit"));
-    } else {
-      for (const auto i : c10::irange(2)) {
-        auto d = torch::Device(torch::DeviceType::CUDA, i);
-        std::stringstream qualified;
-        qualified << file_to_run << "_jit_" << i;
-        auto loaded = exists(qualified.str())
-            ? torch::jit::load(qualified.str(), d)
-            : torch::jit::load(file_to_run + "_jit", d);
-        loaded.to(d);
-        models_.push_back(loaded);
-      }
-    }
-  }
-  void operator()(int i) {
-    if (cuda) {
-      const auto device_id = i % models_.size();
-      auto d = torch::Device(torch::DeviceType::CUDA, device_id);
-      to_device(
-          models_[device_id].forward(to_device_vec(eg_, d)),
-          torch::DeviceType::CPU);
-    } else {
-      models_[0].forward(eg_);
-    }
-  }
-  std::vector<at::IValue> eg_;
-  std::vector<torch::jit::Module> models_;
-};
-
-struct Benchmark {
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
-  Benchmark(
-      torch::deploy::InterpreterManager& manager,
-      size_t n_threads,
-      std::string strategy,
-      // NOLINTNEXTLINE(modernize-pass-by-value)
-      std::string file_to_run,
-      size_t n_seconds = 5)
-      : manager_(manager),
-        n_threads_(n_threads),
-        strategy_(strategy),
-        file_to_run_(file_to_run),
-        n_seconds_(n_seconds),
-        should_run_(true),
-        items_completed_(0),
-        reached_min_items_completed_(0) {
-    // NOLINTNEXTLINE(bugprone-branch-clone)
-    if (strategy == "one_python") {
-      manager.debugLimitInterpreters(1);
-    } else if (strategy == "multi_python") {
-      manager.debugLimitInterpreters(n_threads_);
-    }
-  }
-
-  Report run() {
-    pthread_barrier_init(&first_run_, nullptr, n_threads_ + 1);
-
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    torch::deploy::Package package = manager_.loadPackage(file_to_run_);
-
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    std::vector<at::IValue> eg;
-    {
-      auto I = package.acquireSession();
-
-      eg = I.global("builtins", "tuple")(
-                I.self.attr("load_pickle")({"model", "example.pkl"}))
-               .toIValue()
-               .toTupleRef()
-               .elements();
-    }
-
-    // NOLINTNEXTLINE(bugprone-branch-clone)
-    if (strategy_ == "jit") {
-      run_one_work_item = RunJIT(file_to_run_, std::move(eg));
-    } else {
-      run_one_work_item =
-          RunPython(package, std::move(eg), manager_.allInstances().data());
-    }
-
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    std::vector<std::vector<double>> latencies(n_threads_);
-
-    for (const auto i : c10::irange(n_threads_)) {
-      threads_.emplace_back([this, &latencies, i] {
-        torch::NoGradGuard guard;
-        // do initial work
-        run_one_work_item(i);
-
-        pthread_barrier_wait(&first_run_);
-        size_t local_items_completed = 0;
-        while (should_run_) {
-          auto begin = std::chrono::steady_clock::now();
-          run_one_work_item(i);
-          auto end = std::chrono::steady_clock::now();
-          double work_seconds =
-              std::chrono::duration<double>(end - begin).count();
-          latencies[i].push_back(work_seconds);
-          local_items_completed++;
-          if (local_items_completed == min_items_to_complete) {
-            reached_min_items_completed_++;
-          }
-        }
-        items_completed_ += local_items_completed;
-      });
-    }
-
-    pthread_barrier_wait(&first_run_);
-    auto begin = std::chrono::steady_clock::now();
-    auto try_stop_at = begin + std::chrono::seconds(n_seconds_);
-    std::this_thread::sleep_until(try_stop_at);
-    for (int i = 0; reached_min_items_completed_ < n_threads_; ++i) {
-      std::this_thread::sleep_until(
-          begin + (i + 2) * std::chrono::seconds(n_seconds_));
-    }
-    should_run_ = false;
-    for (std::thread& thread : threads_) {
-      thread.join();
-    }
-    auto end = std::chrono::steady_clock::now();
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    double total_seconds = std::chrono::duration<double>(end - begin).count();
-    Report report;
-    report.benchmark = file_to_run_;
-    report.strategy = strategy_;
-    report.n_threads = n_threads_;
-    report.items_completed = items_completed_;
-    report.work_items_per_second = items_completed_ / total_seconds;
-    reportLatencies(report.latencies, latencies);
-    run_one_work_item = nullptr;
-    return report;
-  }
-
- private:
-  void reportLatencies(
-      std::vector<double>& results,
-      const std::vector<std::vector<double>>& latencies) {
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    std::vector<double> flat_latencies;
-    for (const auto& elem : latencies) {
-      flat_latencies.insert(flat_latencies.end(), elem.begin(), elem.end());
-    }
-    std::sort(flat_latencies.begin(), flat_latencies.end());
-    for (double target : latency_p) {
-      size_t idx = size_t(flat_latencies.size() * target / 100.0);
-      double time = flat_latencies.size() == 0
-          ? 0
-          : flat_latencies.at(std::min(flat_latencies.size() - 1, idx));
-      results.push_back(time);
-    }
-  }
-  torch::deploy::InterpreterManager& manager_;
-  size_t n_threads_;
-  std::string strategy_;
-  std::string file_to_run_;
-  size_t n_seconds_;
-  pthread_barrier_t first_run_;
-  std::atomic<bool> should_run_;
-  std::atomic<size_t> items_completed_;
-  std::atomic<size_t> reached_min_items_completed_;
-  std::vector<std::thread> threads_;
-  std::function<void(int)> run_one_work_item;
-};
-
-// NOLINTNEXTLINE(bugprone-exception-escape)
-int main(int argc, char* argv[]) {
-  int max_thread = atoi(argv[1]);
-  cuda = std::string(argv[2]) == "cuda";
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  bool jit_enable = std::string(argv[3]) == "jit";
-  Report::report_header(std::cout);
-  torch::deploy::InterpreterManager manager(max_thread);
-
-  // make sure gpu_wrapper.py is in the import path
-  for (auto& interp : manager.allInstances()) {
-    auto I = interp.acquireSession();
-    I.global("sys", "path").attr("append")({"torch/csrc/deploy/example"});
-  }
-
-  auto n_threads = {1, 2, 4, 8, 16, 32, 40};
-  for (const auto i : c10::irange(4, argc)) {
-    std::string model_file = argv[i];
-    for (int n_thread : n_threads) {
-      if (n_thread > max_thread) {
-        continue;
-      }
-      for (std::string strategy : {"one_python", "multi_python", "jit"}) {
-        if (strategy == "jit") {
-          if (!jit_enable) {
-            continue;
-          }
-          if (!exists(model_file + "_jit")) {
-            continue;
-          }
-        }
-        Benchmark b(manager, n_thread, strategy, model_file);
-        Report r = b.run();
-        r.report(std::cout);
-      }
-    }
-  }
-  return 0;
-}
diff --git a/torch/csrc/deploy/example/examples.py b/torch/csrc/deploy/example/examples.py
deleted file mode 100644
index 73eeb2149b54..000000000000
--- a/torch/csrc/deploy/example/examples.py
+++ /dev/null
@@ -1,268 +0,0 @@
-from typing import Tuple, List, Dict
-
-import torch
-import torch.nn as nn
-from torch import Tensor
-
-
-class Simple(torch.nn.Module):
-    def __init__(self, N, M):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.rand(N, M))
-
-    def forward(self, input):
-        output = self.weight + input
-        return output
-
-
-def load_library():
-    torch.ops.load_library("my_so.so")
-
-
-def conv1x1(in_planes, out_planes, stride=1):
-    """1x1 convolution"""
-    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
-
-
-def conv3x3(in_planes, out_planes, stride=1):
-    """3x3 convolution with padding"""
-    return nn.Conv2d(
-        in_planes, out_planes, kernel_size=3, stride=stride, padding=1, bias=False
-    )
-
-
-class BasicBlock(nn.Module):
-    expansion = 1
-
-    def __init__(self, inplanes, planes, stride=1, downsample=None):
-        super(BasicBlock, self).__init__()
-        self.conv1 = conv3x3(inplanes, planes, stride)
-        self.bn1 = nn.BatchNorm2d(planes)
-        self.relu = nn.ReLU(inplace=True)
-        self.conv2 = conv3x3(planes, planes)
-        self.bn2 = nn.BatchNorm2d(planes)
-        self.downsample = downsample
-        self.stride = stride
-
-    def forward(self, x):
-        residual = x
-
-        out = self.conv1(x)
-        out = self.bn1(out)
-        out = self.relu(out)
-
-        out = self.conv2(out)
-        out = self.bn2(out)
-
-        if self.downsample is not None:
-            residual = self.downsample(x)
-
-        out += residual
-        out = self.relu(out)
-
-        return out
-
-
-class ResNet(nn.Module):
-    def __init__(self, block, layers, num_classes=1000):
-        super(ResNet, self).__init__()
-        self.inplanes = 64
-        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
-        self.bn1 = nn.BatchNorm2d(64)
-        self.relu = nn.ReLU(inplace=True)
-        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
-        self.layer1 = self._make_layer(block, 64, layers[0])
-        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
-        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
-        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
-        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
-        self.fc = nn.Linear(512 * block.expansion, num_classes)
-
-        for m in self.modules():
-            if isinstance(m, nn.Conv2d):
-                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
-            elif isinstance(m, nn.BatchNorm2d):
-                nn.init.constant_(m.weight, 1)
-                nn.init.constant_(m.bias, 0)
-
-    def _make_layer(self, block, planes, blocks, stride=1):
-        downsample = None
-        if stride != 1 or self.inplanes != planes * block.expansion:
-            downsample = nn.Sequential(
-                conv1x1(self.inplanes, planes * block.expansion, stride),
-                nn.BatchNorm2d(planes * block.expansion),
-            )
-
-        layers = []
-        layers.append(block(self.inplanes, planes, stride, downsample))
-        self.inplanes = planes * block.expansion
-        for _ in range(1, blocks):
-            layers.append(block(self.inplanes, planes))
-
-        return nn.Sequential(*layers)
-
-    def forward(self, x):
-        x = self.conv1(x)
-        x = self.bn1(x)
-        x = self.relu(x)
-        x = self.maxpool(x)
-
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-        x = self.layer4(x)
-
-        x = self.avgpool(x)
-        x = x.view(x.size(0), -1)
-        x = self.fc(x)
-
-        return x
-
-
-def resnet18():
-    return ResNet(BasicBlock, [2, 2, 2, 2])
-
-
-class BatchedModel(nn.Module):
-    def forward(self, input1: Tensor, input2: Tensor) -> Tuple[Tensor, Tensor]:
-        return (input1 * -1, input2 * -1)
-
-    def make_prediction(
-        self, input: List[Tuple[Tensor, Tensor]]
-    ) -> List[Tuple[Tensor, Tensor]]:
-        return [self.forward(i[0], i[1]) for i in input]
-
-    def make_batch(
-        self, mega_batch: List[Tuple[Tensor, Tensor, int]], goals: Dict[str, str]
-    ) -> List[List[Tuple[Tensor, Tensor, int]]]:
-        max_bs = int(goals["max_bs"])
-        return [
-            mega_batch[start_idx : start_idx + max_bs]
-            for start_idx in range(0, len(mega_batch), max_bs)
-        ]
-
-
-class MultiReturn(torch.nn.Module):
-    def __init__(self):
-        super(MultiReturn, self).__init__()
-
-    def forward(self, t: Tuple[Tensor, Tensor]) -> Tuple[Tuple[Tensor, Tensor], Tuple[Tensor, Tensor]]:
-        a, b = t
-        result = ((a.masked_fill_(b, 0.1), b), (torch.ones_like(a), b))
-        return result
-
-
-multi_return_metadata = r"""
-{
- "metadata_container": {
-  "forward": {
-   "named_input_metadata": {
-    "t": {
-     "argument_type": {
-      "tuple": {
-       "tuple_elements": [
-        {
-         "tensor": 1
-        },
-        {
-         "tensor": 6
-        }
-       ]
-      }
-     },
-     "optional_argument": false,
-     "metadata": {
-      "dense_features": {
-       "feature_desc": [
-        {
-          "feature_name": "test_feature_1",
-          "feature_id": 1
-        }
-       ],
-       "expected_shape": {
-        "dims": [
-         -1,
-         1
-        ],
-        "unknown_rank": false
-       },
-       "data_type": 1,
-       "feature_store_feature_type": 0
-      }
-     }
-    }
-   },
-   "positional_output_metadata": [
-    {
-     "argument_type": {
-      "tuple": {
-       "tuple_elements": [
-        {
-         "tensor": 1
-        },
-        {
-         "tensor": 6
-        }
-       ]
-      }
-     },
-     "optional_argument": false,
-     "metadata": {
-      "dense_features": {
-       "feature_desc": [
-        {
-          "feature_name": "test_feature_1",
-          "feature_id": 1
-        }
-       ],
-       "expected_shape": {
-        "dims": [
-         -1,
-         1
-        ],
-        "unknown_rank": false
-       },
-       "data_type": 1,
-       "feature_store_feature_type": 0
-      }
-     }
-    },
-    {
-     "argument_type": {
-      "tuple": {
-       "tuple_elements": [
-        {
-         "tensor": 1
-        },
-        {
-         "tensor": 6
-        }
-       ]
-      }
-     },
-     "optional_argument": false,
-     "metadata": {
-      "dense_features": {
-       "feature_desc": [
-        {
-          "feature_name": "test_feature_3",
-          "feature_id": 3
-        }
-       ],
-       "expected_shape": {
-        "dims": [
-         -1,
-         1
-        ],
-        "unknown_rank": false
-       },
-       "data_type": 1,
-       "feature_store_feature_type": 0
-      }
-     }
-    }
-   ]
-  }
- }
-}
-"""
diff --git a/torch/csrc/deploy/example/fx/examples.py b/torch/csrc/deploy/example/fx/examples.py
deleted file mode 100644
index ef875e9885e6..000000000000
--- a/torch/csrc/deploy/example/fx/examples.py
+++ /dev/null
@@ -1,16 +0,0 @@
-import torch.fx
-try:
-    from .some_dependency import a_non_torch_leaf
-except ImportError:
-    from some_dependency import a_non_torch_leaf
-
-
-torch.fx.wrap('a_non_torch_leaf')
-class SimpleWithLeaf(torch.nn.Module):
-    def __init__(self, N, M):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.rand(N, M))
-
-    def forward(self, input):
-        output = self.weight + a_non_torch_leaf(1, input)
-        return output
diff --git a/torch/csrc/deploy/example/fx/some_dependency.py b/torch/csrc/deploy/example/fx/some_dependency.py
deleted file mode 100644
index a9abb0360ae8..000000000000
--- a/torch/csrc/deploy/example/fx/some_dependency.py
+++ /dev/null
@@ -1,4 +0,0 @@
-# dependency for torch package
-
-def a_non_torch_leaf(a: int, b):
-    return a * b
diff --git a/torch/csrc/deploy/example/generate_examples.py b/torch/csrc/deploy/example/generate_examples.py
deleted file mode 100644
index 591ccc5e850e..000000000000
--- a/torch/csrc/deploy/example/generate_examples.py
+++ /dev/null
@@ -1,96 +0,0 @@
-"""
-Generate the example files that torchpy_test uses.
-"""
-import argparse
-from pathlib import Path
-
-import torch
-from torch.package import PackageExporter
-from torch.fx import symbolic_trace
-
-try:
-    from .examples import Simple, resnet18, MultiReturn, multi_return_metadata, load_library, BatchedModel
-except ImportError:
-    from examples import Simple, resnet18, MultiReturn, multi_return_metadata, load_library, BatchedModel
-
-try:
-    from .fx.examples import SimpleWithLeaf
-except ImportError:
-    from fx.examples import SimpleWithLeaf
-
-try:
-    from .tensorrt_example import make_trt_module
-except ImportError:
-    from tensorrt_example import make_trt_module
-
-def generate_fx_example():
-    name = 'simple_leaf'
-    model = SimpleWithLeaf(5, 10)
-    graph_module : torch.fx.GraphModule = symbolic_trace(model)
-    with PackageExporter(str(p / (name + "_fx"))) as e:
-        e.intern("**")
-        e.save_pickle("model", "model.pkl", graph_module)
-
-    model_jit = torch.jit.script(model)
-    model_jit.save(str(p / (name + "_jit")))
-
-def save(name, model, model_jit=None, eg=None, featurestore_meta=None):
-    with PackageExporter(str(p / name)) as e:
-        e.mock("iopath.**")
-        e.intern("**")
-        e.save_pickle("model", "model.pkl", model)
-        if eg:
-            e.save_pickle("model", "example.pkl", eg)
-        if featurestore_meta:
-            # TODO(whc) can this name come from buck somehow,
-            # so it's consistent with predictor_config_constants::METADATA_FILE_NAME()?
-            e.save_text("extra_files", "metadata.json", featurestore_meta)
-
-    if model_jit:
-        model_jit.save(str(p / (name + "_jit")))
-
-
-parser = argparse.ArgumentParser(description="Generate Examples")
-parser.add_argument("--install_dir", help="Root directory for all output files")
-
-
-if __name__ == "__main__":
-    args = parser.parse_args()
-    if args.install_dir is None:
-        p = Path(__file__).parent / "generated"
-        p.mkdir(exist_ok=True)
-    else:
-        p = Path(args.install_dir)
-
-    resnet = resnet18()
-    resnet.eval()
-    resnet_eg = torch.rand(1, 3, 224, 224)
-    resnet_traced = torch.jit.trace(resnet, resnet_eg)
-    save("resnet", resnet, resnet_traced, (resnet_eg,))
-
-    simple = Simple(10, 20)
-    save("simple", simple, torch.jit.script(simple), (torch.rand(10, 20),))
-
-    multi_return = MultiReturn()
-    save("multi_return", multi_return, torch.jit.script(multi_return), (torch.rand(10, 20),), multi_return_metadata)
-
-    # used for torch deploy/package tests in predictor
-    batched_model = BatchedModel()
-    save("batched_model", batched_model)
-
-    with PackageExporter(str(p / "load_library")) as e:
-        e.mock("iopath.**")
-        e.intern("**")
-        e.save_pickle("fn", "fn.pkl", load_library)
-
-    generate_fx_example()
-
-    with PackageExporter(p / "uses_distributed") as e:
-        e.save_source_string("uses_distributed", "import torch.distributed; assert torch.distributed.is_available()")
-
-    with PackageExporter(str(p / "make_trt_module")) as e:
-        e.extern("tensorrt")
-        e.add_dependency("tensorrt")
-        e.mock("iopath.**")
-        e.intern("**")
-        e.save_pickle("make_trt_module", "model.pkl", make_trt_module)
diff --git a/torch/csrc/deploy/example/gpu_wrapper.py b/torch/csrc/deploy/example/gpu_wrapper.py
deleted file mode 100644
index e40f0f8e8fc1..000000000000
--- a/torch/csrc/deploy/example/gpu_wrapper.py
+++ /dev/null
@@ -1,66 +0,0 @@
-# used by the benchmarking program to wrap cpu models for GPU use
-import torch
-from copy import deepcopy
-
-def to_device(i, d):
-    if isinstance(i, torch.Tensor):
-        return i.to(device=d)
-    elif isinstance(i, (tuple, list)):
-        return tuple(to_device(e, d) for e in i)
-    else:
-        raise RuntimeError('inputs are weird')
-
-class GPUWrapper(torch.nn.Module):
-    def __init__(self, root):
-        super().__init__()
-        self.models = []
-        self.streams = {}
-        for i in range(torch.cuda.device_count()):
-            m = deepcopy(root) if i != 0 else root
-            d = f'cuda:{i}'
-            m.to(device=d)
-            self.models.append((m, d))
-
-    def __getstate__(self):
-        return self.models
-
-    def __setstate__(self, models):
-        super().__init__()
-        self.models = models
-        self.streams = {}
-        for m, d in models:
-            torch.cuda.synchronize(d)
-
-    # roi_align, 2210 count, ROIAlign_cuda.cu: add threadsync: problem goes away, return rand problem goes away,
-    # use different streams here, problem goes away.
-    def forward(self, tid, *args):
-        m, d = self.models[tid % len(self.models)]
-        if tid not in self.streams:
-            self.streams[tid] = torch.cuda.Stream(d)
-        s = self.streams[tid]
-        with torch.cuda.stream(s):
-            iput = to_device(args, d)
-            r = to_device(m(*iput), 'cpu')
-            return r
-
-
-if __name__ == '__main__':
-    def check_close(a, b):
-        if isinstance(a, (list, tuple)):
-            for ae, be in zip(a, b):
-                check_close(ae, be)
-        else:
-            print(torch.max(torch.abs(a - b)))
-            assert torch.allclose(a, b)
-
-    import sys
-    from torch.package import PackageImporter
-    i = PackageImporter(sys.argv[1])
-    torch.version.interp = 0
-    model = i.loadPickle('model', 'model.pkl')
-    eg = i.loadPickle('model', 'example.pkl')
-    r = model(*eg)
-
-    gpu_model = GPUWrapper(model)
-    r2 = gpu_model(*eg)
-    check_close(r, r2)
diff --git a/torch/csrc/deploy/example/simple.pt b/torch/csrc/deploy/example/simple.pt
deleted file mode 100644
index 50f9a087aa822821647a8acabe28bb84207475b2..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 2432
zcmah~4Ny~87JmE%jN!K_C{c`n7DSRLExN!vhoWYYRE&s2)-FvT36Ky7`GEqhO3~?p
zL|_?ZsctLdMyZHOE5mjH5kaV>s1(uBHlpj!S}e8F-KAEovhVR{g6sCo+_~?)d%knu
z`R@776^k4k3BtvNc+OmjmkEO^D@UW`D^iRpe1R~=lh``jhIZRyK=*I)!cZZq8){(n
z@pmC?B^ynOG?YYeu>YNUa`b@-j{Wg<^dcXFL{|=d_n*LpxJB4{DiZ_0cLHYUD8#+K
z1H*4J@SRPF(w;cn*y4?rHxI+^Z-0Txury4*pNcYmEbge-j72SNa8-N+niN~G_qF}N
z<@_7^%N(&n#=@$r$*7cjV)N&FVM)6^KJ=2I;OA!2;OT%JqWUNghic$ze6Z|;`Ixk4
zK3-5SLHR&3X6&{=+nIhCDBc5>@l4=GrIPiSf`i2ftn~r#S$8M-R@H4N@;w^W9&!<C
zLhDGMi)<|L&q9HFH_46di#iqX8*C43gr<&b;L)|2O#7+{a%zUbxBU^o$RJcKh{Oo5
zc<jsQg*%TH<2=zL(ANf|(-+QI6jL9inQM=Y<aP)+pO4C%3_M-UN7EV|vi`+@TkA`s
zMy@mQ&gncXI=l@sEo^MNQ4V7|C+rDdh4#K26g&;W%F>18t;$kz&B73@dhIeC+LVs`
zlLq9;xM(`F6k0QwC>~BDzdtq({_3B=an%TTuO5a^Bl;k4EEHMiO5sOy27bIifQgN2
zB-}WWA20bC8nwOTsGTFcx{C#kmHi}<zbk^Bum>KbDNxWcOnTn)#y)>Gs-O0PW1JTh
zx_jV>{nvnD3B?xEILwnAA<f|jV6<g1GK70!*NS}bAcx5sULnq3wjAs83H0yUjC<~C
zab%E%X6S)m{$qw{!!7c=t07psG!ZV|^+Ac>LMW+lz(s#$;PQ1cyw7?Ix5xsJ`Ii&$
zMoAm|aArNmfEhSP6EJ$c5-*z~F>S0H@&#F7Svm^Qi5Wn6X(N+=iojq=D4xeZ!0GNS
z_)UBu4zBgWo|~V*DY*)xx09rY-*HlLX#)}-b5IW_z$M%b<64iwf$(M6o6`(+8<WtK
zW&jb#6K`J*!y$tfFWp>@-=5os3(8l*-#T1TEOK_~{^#dUxef$D5Ah52=1v4y>y4L_
zsS(C3dR}q13M{ghOQlAwUY;S9GB*@#&?<5@N~Vqoes5clgh7e=ZKWzb!>F^9PdM?U
zxki=7z?ABhskthRLTXfI8?<_9-b$v<J}5}yOd$-EXG*$K=P(Ij3N_jk<L1eaOr4{|
zj`9)coFr5>gghr##}J)2Zqhl6h#(PD#5M+r*djY)iCnJLXq0lJN}Fw9ZqzH3dZl8m
zN^WH8T!Isf3A#BFy5|`6DXMH$cDjyPlAEfV%cJS)jB2R3M6J<-ri)Frr|0|AJR)S8
zF1c2r<j=6c%PA1Xyz*<jO|wXrpJjG&J=WfTvU^FLqoy`#@spjF!jN@u{Qjl0YRSI-
zGbgW@>yq!>>$|8?x(K!lPgZp|mzPV4UDpD?SS2)b9a*EAyjQ>U<uFcm)|tl^JvH~f
z*X58rx_4jf!0LZgH*6UCrXgssmb_P$$mv&qwDVA1!OMp*v+-SLnWg&n2UpIHZEt83
zA586eXjt(e;mT+2d9o`PqQB1C*L6I^ue0X-!mD5U(4=$~9Xe;G2zmT}N%1FGFC>Mh
zppHCUI#E_GGxU(<E8L0&?Ds>y<CZ1m2MBkzmX?jQ67lcHW**=+9d^|js$MFX`?j|B
zNAK$B$9(?7p+Dg`8CR(|x6U)Z(x@(eQgACQGJ?r-Qg*0s+-ZsR-TJ9(N!->`Uv~?R
zH22tDTQeU2<}#k8qp-i&eDsqR-y4#3*9XQpA0E7W)HCUK1BpX}Y@ZMJ|J~a?yR+=^
z3hfWqUuo3J@RDhN-3~f$-}Idi{!HaOFqQZJzN^@T|FgWa4aSsgqhV4&elwX)pSbN@
zWFI_HbXoj4Nz_G1(Bs7#bt9su_sLW04ODFlW1@a#2iyH-zlAaBEWi3FeTzM{Gd;v2
zJ9?VAldZdlE4^T2HaGNasI`*ZZD2W+c&(r`SCimpVzaKec{bpK6iHLy=cG3~iuL*V
z+eB${dU2FzwK_W`>vLk;#A~3s-OOKE)xhis)?B;UL|{|hZz{sn>54@R#_}m2KZnp6
pZ@RJ4Ha(l@o=900-9IU#a1&2{b0**EWwFR<!ks~(X#Yvy{{r501)Tr@

diff --git a/torch/csrc/deploy/example/tensorrt_example.py b/torch/csrc/deploy/example/tensorrt_example.py
deleted file mode 100644
index 7e97fd8ea655..000000000000
--- a/torch/csrc/deploy/example/tensorrt_example.py
+++ /dev/null
@@ -1,63 +0,0 @@
-from typing import List, Any
-import pickle
-import torch
-
-
-class TestTRTModule(torch.nn.Module):
-    def __init__(self, engine, input_names=None, output_names=None, fp16_output=False):
-        super(TestTRTModule, self).__init__()
-        self.engine = engine
-        self.input_names = input_names
-        self.output_names = output_names
-
-        # Indicate output is in fp16
-        self.fp16_output = fp16_output
-
-    def forward(self, *inputs):
-        batch_size = inputs[0].shape[0]
-        contiguous_inputs: List[torch.Tensor] = [i.contiguous() for i in inputs]
-        bindings: List[Any] = [None] * (len(self.input_names) + len(self.output_names))
-
-        # create output tensors
-        outputs: List[torch.Tensor] = []
-        for _, output_name in enumerate(self.output_names):
-            idx: int = self.engine.get_binding_index(output_name)
-            shape = (batch_size,) + tuple(self.engine.get_binding_shape(idx))
-            output = torch.empty(size=shape, dtype=torch.float32, device="cuda")
-            outputs.append(output)
-            bindings[idx] = output.data_ptr()
-
-        for i, input_name in enumerate(self.input_names):
-            idx = self.engine.get_binding_index(input_name)
-            bindings[idx] = contiguous_inputs[i].data_ptr()
-
-        context = self.engine.create_execution_context()
-        context.execute_async(
-            batch_size, bindings, torch.cuda.current_stream().cuda_stream
-        )
-
-        if len(outputs) == 1:
-            return outputs[0]
-
-        return tuple(outputs)
-
-def make_trt_module():
-    import tensorrt as trt
-    logger = trt.Logger(trt.Logger.WARNING)
-    builder = trt.Builder(logger)
-    network = builder.create_network()
-
-    x = network.add_input("x", shape=(1, 2, 3), dtype=trt.float32)
-    layer = network.add_elementwise(x, x, trt.ElementWiseOperation.SUM)
-    layer.name = "add"
-    output = layer.get_output(0)
-    output.name = "output"
-    network.mark_output(output)
-    output.dtype = trt.float32
-
-    builder.max_batch_size = 1024
-    builder_config = builder.create_builder_config()
-    builder_config.max_workspace_size = 1 << 25
-    # Test engine can be serialized and loaded correctly.
-    serialized_engine = pickle.dumps(builder.build_engine(network, builder_config))
-    return TestTRTModule(pickle.loads(serialized_engine), ["x"], ["output"])
diff --git a/torch/csrc/deploy/interactive_embedded_interpreter.cpp b/torch/csrc/deploy/interactive_embedded_interpreter.cpp
deleted file mode 100644
index f730176c12ff..000000000000
--- a/torch/csrc/deploy/interactive_embedded_interpreter.cpp
+++ /dev/null
@@ -1,37 +0,0 @@
-/*
- * The tool provides a shell to the embedded interpreter. Useful to inspect the
- * state of the embedding interpreter interactively.
- */
-#include <c10/util/Flags.h>
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/csrc/deploy/path_environment.h>
-
-C10_DEFINE_string(
-    python_path,
-    "",
-    "The root of the installed python libraries in the system");
-C10_DEFINE_string(pyscript, "", "The path of the python script to execute");
-
-// NOLINTNEXTLINE(bugprone-exception-escape)
-int main(int argc, char** argv) {
-  c10::ParseCommandLineFlags(&argc, &argv);
-
-  if (FLAGS_python_path.size() > 0) {
-    LOG(INFO) << "Will add " << FLAGS_python_path << " to python sys.path";
-  }
-  std::shared_ptr<torch::deploy::Environment> env =
-      std::make_shared<torch::deploy::PathEnvironment>(FLAGS_python_path);
-  // create multiple interpreter instances so the tool does not just cover the
-  // simplest case with a single interpreter instance.
-  torch::deploy::InterpreterManager m(2, env);
-  auto I = m.acquireOne();
-
-  if (FLAGS_pyscript.size() > 0) {
-    auto realpath = I.global("os", "path").attr("expanduser")({FLAGS_pyscript});
-    I.global("runpy", "run_path")({realpath});
-  } else {
-    c10::ArrayRef<torch::deploy::Obj> noArgs;
-    I.global("pdb", "set_trace")(noArgs);
-  }
-  return 0;
-}
diff --git a/torch/csrc/deploy/interpreter/CMakeLists.txt b/torch/csrc/deploy/interpreter/CMakeLists.txt
deleted file mode 100644
index 33b71e348396..000000000000
--- a/torch/csrc/deploy/interpreter/CMakeLists.txt
+++ /dev/null
@@ -1,117 +0,0 @@
-SET(INTERPRETER_DIR "${DEPLOY_DIR}/interpreter" )
-SET(INTERPRETER_DIR "${DEPLOY_DIR}/interpreter" PARENT_SCOPE)
-SET(PYTORCH_ROOT "${CMAKE_CURRENT_SOURCE_DIR}/../../../../")
-
-if(NOT TORCH_INSTALL_LIB_DIR)
-  set(TORCH_INSTALL_LIB_DIR lib)
-endif()
-
-# Build cpython
-SET(PYTHON_INSTALL_DIR "${INTERPRETER_DIR}/cpython")
-SET(PYTHON_INC_DIR "${PYTHON_INSTALL_DIR}/include/python3.8")
-SET(PYTHON_INC_DIR "${PYTHON_INSTALL_DIR}/include/python3.8" PARENT_SCOPE)
-SET(PYTHON_LIB "${PYTHON_INSTALL_DIR}/lib/libpython3.8.a")
-SET(PYTHON_BIN "${PYTHON_INSTALL_DIR}/bin/python3")
-ExternalProject_Add(
-  cpython
-  PREFIX cpython
-  GIT_REPOSITORY https://github.com/python/cpython.git
-  GIT_TAG v3.8.6
-  UPDATE_COMMAND ""
-  PATCH_COMMAND git apply ${CMAKE_CURRENT_SOURCE_DIR}/cpython_patch.diff
-  BUILD_IN_SOURCE True
-  CONFIGURE_COMMAND PYTHON_INSTALL_DIR=${PYTHON_INSTALL_DIR} ${CMAKE_CURRENT_SOURCE_DIR}/configure_cpython.sh
-  BUILD_COMMAND CFLAGS=-fPIC CPPFLAGS=-fPIC make -j8
-  INSTALL_COMMAND make install
-  BYPRODUCTS ${PYTHON_MODULES} ${PYTHON_LIB} ${PYTHON_BIN} ${PYTHON_INSTALL_DIR}/lib/libssl.a ${PYTHON_INSTALL_DIR}/lib/libcrypto.a
-  LOG_OUTPUT_ON_FAILURE True
-)
-
-# We find the built python modules, this is confusing because python build already outputs
-# the modules in a strange nested path, and then that path is relative to the
-# Cmake ExternalProject root in the cmake build dir.
-ExternalProject_Get_property(cpython SOURCE_DIR)
-SET(PYTHON_MODULE_DIR "${SOURCE_DIR}/build/temp.linux-x86_64-3.8/${SOURCE_DIR}/Modules")
-SET(PYTHON_STDLIB_DIR "${SOURCE_DIR}/Lib")
-SET(PYTHON_STDLIB "${PYTHON_INSTALL_DIR}/lib/libpython_stdlib3.8.a")
-# Then we use a hardcoded list of expected module names and include them in our lib
-include("CMakePythonModules.txt")
-ExternalProject_Add_Step(
-  cpython
-  archive_stdlib
-  DEPENDEES install
-  BYPRODUCTS ${PYTHON_STDLIB}
-  COMMAND ar -rc ${PYTHON_STDLIB} ${PYTHON_MODULES}
-  VERBATIM
-)
-# Get python typing extension, needed by torch
-SET(TYPING_PKG "${INTERPRETER_DIR}/third_party/typing_extensions.py")
-ExternalProject_Add(
-  typing
-  PREFIX typing
-  GIT_REPOSITORY https://github.com/python/typing.git
-  GIT_TAG 3.7.4.3
-  UPDATE_COMMAND ""
-  CONFIGURE_COMMAND ""
-  BUILD_COMMAND ""
-  INSTALL_COMMAND cp ../typing/typing_extensions/src_py3/typing_extensions.py ${TYPING_PKG}
-  BYPRODUCTS ${TYPING_PKG}
-  LOG_OUTPUT_ON_FAILURE True
-)
-
-# Output files generated by freeze script, containing frozen bytecode
-SET(FROZEN_DIR "${INTERPRETER_DIR}/frozen")
-set(FROZEN_FILES
-  ${FROZEN_DIR}/main.c
-  ${FROZEN_DIR}/bytecode_0.c
-  ${FROZEN_DIR}/bytecode_1.c
-  ${FROZEN_DIR}/bytecode_2.c
-  ${FROZEN_DIR}/bytecode_3.c
-  ${FROZEN_DIR}/bytecode_4.c
-)
-
-file(GLOB_RECURSE PYTORCH_PYTHON_SOURCE_FILES ${PYTORCH_ROOT}/torch/*.py)
-
-# Packages to freeze: python stdlib, typing extension, and torch
-add_custom_command(
-   OUTPUT ${FROZEN_FILES}
-   WORKING_DIRECTORY ${INTERPRETER_DIR}
-   COMMAND mkdir -p ${FROZEN_DIR}
-   COMMAND ${PYTHON_BIN} ${PYTORCH_ROOT}/torch/utils/_freeze.py ${PYTHON_STDLIB_DIR} ${TYPING_PKG} ${PYTORCH_ROOT}/torch --oss --install_dir ${FROZEN_DIR} --verbose
-   DEPENDS cpython typing ${PYTORCH_PYTHON_SOURCE_FILES}
-   VERBATIM
-)
-
-# instantiate a library based on the objects that make up torch_python
-# make sure system python isn't used here
-target_include_directories(torch_python_obj BEFORE PRIVATE ${PYTHON_INC_DIR})
-add_library(torch_python_static STATIC $<TARGET_OBJECTS:torch_python_obj>)
-# Build the interpreter lib, designed to be standalone and dlopened
-# We bake the python and torch_python binding objs into libinterpreter
-set(INTERPRETER_LIB_SOURCES
-  ${INTERPRETER_DIR}/interpreter_impl.cpp
-  ${INTERPRETER_DIR}/builtin_registry.cpp
-  ${INTERPRETER_DIR}/register_frozenpython.cpp
-  ${INTERPRETER_DIR}/import_find_sharedfuncptr.cpp
-  ${FROZEN_FILES}
-  ${LINKER_SCRIPT}
-)
-add_library(torch_deployinterpreter ${INTERPRETER_LIB_SOURCES} ${LINKER_SCRIPT})
-# need to ensure headers are present before any .cpp in interpreter are compiled,
-# but cpp themselves don't clearly depend on cpython so there is a race otherwise
-add_dependencies(torch_deployinterpreter cpython)
-add_dependencies(torch_python_obj cpython)
-if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
-  target_compile_options(torch_deployinterpreter PRIVATE -fno-gnu-unique)
-endif()
-
-target_include_directories(torch_deployinterpreter PRIVATE ${INTERPRETER_DIR})
-target_include_directories(torch_deployinterpreter BEFORE PUBLIC ${PYTHON_INC_DIR})
-
-target_link_libraries(torch_deployinterpreter PRIVATE ${PYTHON_LIB} ${PYTHON_STDLIB} torch_python_static)
-target_link_libraries(torch_deployinterpreter PRIVATE fmt::fmt-header-only protobuf::libprotobuf-lite)
-target_link_libraries(torch_deployinterpreter PRIVATE ${PYTHON_INSTALL_DIR}/lib/libssl.a ${PYTHON_INSTALL_DIR}/lib/libcrypto.a)
-target_link_libraries(torch_deployinterpreter PRIVATE pybind::pybind11)
-
-# expose torch_python_static for multipy
-install(TARGETS torch_python_static DESTINATION "${TORCH_INSTALL_LIB_DIR}")
diff --git a/torch/csrc/deploy/interpreter/CMakePythonModules.txt b/torch/csrc/deploy/interpreter/CMakePythonModules.txt
deleted file mode 100644
index c6bc9cab76ff..000000000000
--- a/torch/csrc/deploy/interpreter/CMakePythonModules.txt
+++ /dev/null
@@ -1,69 +0,0 @@
-SET(PYTHON_MODULES
-  ${PYTHON_MODULE_DIR}/arraymodule.o
-  ${PYTHON_MODULE_DIR}/_asynciomodule.o
-  ${PYTHON_MODULE_DIR}/audioop.o
-  ${PYTHON_MODULE_DIR}/binascii.o
-  ${PYTHON_MODULE_DIR}/_bisectmodule.o
-  ${PYTHON_MODULE_DIR}/_blake2/blake2module.o ${PYTHON_MODULE_DIR}/_blake2/blake2b_impl.o ${PYTHON_MODULE_DIR}/_blake2/blake2s_impl.o
-  ${PYTHON_MODULE_DIR}/_bz2module.o
-  ${PYTHON_MODULE_DIR}/cmathmodule.o
-  # ${PYTHON_MODULE_DIR}/_math.o
-  ${PYTHON_MODULE_DIR}/cjkcodecs/_codecs_cn.o
-  ${PYTHON_MODULE_DIR}/cjkcodecs/_codecs_hk.o
-  ${PYTHON_MODULE_DIR}/cjkcodecs/_codecs_iso2022.o
-  ${PYTHON_MODULE_DIR}/cjkcodecs/_codecs_jp.o
-  ${PYTHON_MODULE_DIR}/cjkcodecs/_codecs_kr.o
-  ${PYTHON_MODULE_DIR}/cjkcodecs/_codecs_tw.o
-  ${PYTHON_MODULE_DIR}/_contextvarsmodule.o
-  ${PYTHON_MODULE_DIR}/_cryptmodule.o
-  ${PYTHON_MODULE_DIR}/_csv.o
-  ${PYTHON_MODULE_DIR}/_ctypes/_ctypes.o ${PYTHON_MODULE_DIR}/_ctypes/callbacks.o ${PYTHON_MODULE_DIR}/_ctypes/callproc.o ${PYTHON_MODULE_DIR}/_ctypes/stgdict.o ${PYTHON_MODULE_DIR}/_ctypes/cfield.o
-  ${PYTHON_MODULE_DIR}/_ctypes/_ctypes_test.o
-  ${PYTHON_MODULE_DIR}/_cursesmodule.o
-  ${PYTHON_MODULE_DIR}/_curses_panel.o
-  ${PYTHON_MODULE_DIR}/_datetimemodule.o
-  ${PYTHON_MODULE_DIR}/_decimal/_decimal.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/basearith.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/constants.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/context.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/convolute.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/crt.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/difradix2.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/fnt.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/fourstep.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/io.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/memory.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/mpdecimal.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/numbertheory.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/sixstep.o ${PYTHON_MODULE_DIR}/_decimal/libmpdec/transpose.o
-  ${PYTHON_MODULE_DIR}/_elementtree.o
-  ${PYTHON_MODULE_DIR}/fcntlmodule.o
-  ${PYTHON_MODULE_DIR}/grpmodule.o
-  ${PYTHON_MODULE_DIR}/_hashopenssl.o
-  ${PYTHON_MODULE_DIR}/_heapqmodule.o
-  ${PYTHON_MODULE_DIR}/_json.o
-  ${PYTHON_MODULE_DIR}/_lsprof.o
-  ${PYTHON_MODULE_DIR}/_lzmamodule.o
-  ${PYTHON_MODULE_DIR}/mathmodule.o
-  ${PYTHON_MODULE_DIR}/md5module.o
-  ${PYTHON_MODULE_DIR}/mmapmodule.o
-  ${PYTHON_MODULE_DIR}/cjkcodecs/multibytecodec.o
-  ${PYTHON_MODULE_DIR}/_multiprocessing/multiprocessing.o ${PYTHON_MODULE_DIR}/_multiprocessing/semaphore.o
-  ${PYTHON_MODULE_DIR}/nismodule.o
-  ${PYTHON_MODULE_DIR}/_opcode.o
-  ${PYTHON_MODULE_DIR}/ossaudiodev.o
-  ${PYTHON_MODULE_DIR}/parsermodule.o
-  ${PYTHON_MODULE_DIR}/_pickle.o
-  ${PYTHON_MODULE_DIR}/_posixsubprocess.o
-  ${PYTHON_MODULE_DIR}/pyexpat.o ${PYTHON_MODULE_DIR}/expat/xmlparse.o ${PYTHON_MODULE_DIR}/expat/xmlrole.o ${PYTHON_MODULE_DIR}/expat/xmltok.o
-  ${PYTHON_MODULE_DIR}/_queuemodule.o
-  ${PYTHON_MODULE_DIR}/_randommodule.o
-  ${PYTHON_MODULE_DIR}/readline.o
-  ${PYTHON_MODULE_DIR}/resource.o
-  ${PYTHON_MODULE_DIR}/selectmodule.o
-  ${PYTHON_MODULE_DIR}/sha1module.o
-  ${PYTHON_MODULE_DIR}/sha256module.o
-  ${PYTHON_MODULE_DIR}/_sha3/sha3module.o
-  ${PYTHON_MODULE_DIR}/sha512module.o
-  ${PYTHON_MODULE_DIR}/socketmodule.o
-  ${PYTHON_MODULE_DIR}/spwdmodule.o
-  ${PYTHON_MODULE_DIR}/_ssl.o
-  ${PYTHON_MODULE_DIR}/_struct.o
-  ${PYTHON_MODULE_DIR}/syslogmodule.o
-  ${PYTHON_MODULE_DIR}/termios.o
-  ${PYTHON_MODULE_DIR}/_testbuffer.o
-  ${PYTHON_MODULE_DIR}/_testcapimodule.o
-  ${PYTHON_MODULE_DIR}/_testimportmultiple.o
-  ${PYTHON_MODULE_DIR}/_testmultiphase.o
-  ${PYTHON_MODULE_DIR}/unicodedata.o
-  ${PYTHON_MODULE_DIR}/xxlimited.o
-  ${PYTHON_MODULE_DIR}/_xxtestfuzz/_xxtestfuzz.o ${PYTHON_MODULE_DIR}/_xxtestfuzz/fuzzer.o
-  ${PYTHON_MODULE_DIR}/zlibmodule.o
-)
diff --git a/torch/csrc/deploy/interpreter/Optional.hpp b/torch/csrc/deploy/interpreter/Optional.hpp
deleted file mode 100644
index 92b73d7f6fbb..000000000000
--- a/torch/csrc/deploy/interpreter/Optional.hpp
+++ /dev/null
@@ -1,1107 +0,0 @@
-// Copyright (C) 2011 - 2012 Andrzej Krzemienski.
-//
-// Use, modification, and distribution is subject to the Boost Software
-// License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
-// http://www.boost.org/LICENSE_1_0.txt)
-//
-// The idea and interface is based on Boost.Optional library
-// authored by Fernando Luis Cacciola Carballal
-//
-// Source: https://github.com/akrzemi1/Optional
-
-#ifndef ___OPTIONAL_HPP___
-#define ___OPTIONAL_HPP___
-
-#include <cassert>
-#include <functional>
-#include <initializer_list>
-#include <stdexcept>
-#include <string>
-#include <type_traits>
-#include <utility>
-
-#define TR2_OPTIONAL_REQUIRES(...) \
-  typename std::enable_if<__VA_ARGS__::value, bool>::type = false
-
-#if defined __GNUC__ // NOTE: GNUC is also defined for Clang
-#if (__GNUC__ == 4) && (__GNUC_MINOR__ >= 8)
-#define TR2_OPTIONAL_GCC_4_8_AND_HIGHER___
-#elif (__GNUC__ > 4)
-#define TR2_OPTIONAL_GCC_4_8_AND_HIGHER___
-#endif
-
-#if (__GNUC__ == 4) && (__GNUC_MINOR__ >= 7)
-#define TR2_OPTIONAL_GCC_4_7_AND_HIGHER___
-#elif (__GNUC__ > 4)
-#define TR2_OPTIONAL_GCC_4_7_AND_HIGHER___
-#endif
-
-#if (__GNUC__ == 4) && (__GNUC_MINOR__ == 8) && (__GNUC_PATCHLEVEL__ >= 1)
-#define TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
-#elif (__GNUC__ == 4) && (__GNUC_MINOR__ >= 9)
-#define TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
-#elif (__GNUC__ > 4)
-#define TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
-#endif
-#endif
-
-#if defined __clang_major__
-#if (__clang_major__ == 3 && __clang_minor__ >= 5)
-#define TR2_OPTIONAL_CLANG_3_5_AND_HIGHTER_
-#elif (__clang_major__ > 3)
-#define TR2_OPTIONAL_CLANG_3_5_AND_HIGHTER_
-#endif
-#if defined TR2_OPTIONAL_CLANG_3_5_AND_HIGHTER_
-#define TR2_OPTIONAL_CLANG_3_4_2_AND_HIGHER_
-#elif ( \
-    __clang_major__ == 3 && __clang_minor__ == 4 && __clang_patchlevel__ >= 2)
-#define TR2_OPTIONAL_CLANG_3_4_2_AND_HIGHER_
-#endif
-#endif
-
-#if defined _MSC_VER
-#if (_MSC_VER >= 1900)
-#define TR2_OPTIONAL_MSVC_2015_AND_HIGHER___
-#endif
-#endif
-
-#if defined __clang__
-#if (__clang_major__ > 2) || (__clang_major__ == 2) && (__clang_minor__ >= 9)
-#define OPTIONAL_HAS_THIS_RVALUE_REFS 1
-#else
-#define OPTIONAL_HAS_THIS_RVALUE_REFS 0
-#endif
-#elif defined TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
-#define OPTIONAL_HAS_THIS_RVALUE_REFS 1
-#elif defined TR2_OPTIONAL_MSVC_2015_AND_HIGHER___
-#define OPTIONAL_HAS_THIS_RVALUE_REFS 1
-#else
-#define OPTIONAL_HAS_THIS_RVALUE_REFS 0
-#endif
-
-#if defined TR2_OPTIONAL_GCC_4_8_1_AND_HIGHER___
-#define OPTIONAL_HAS_CONSTEXPR_INIT_LIST 1
-#define OPTIONAL_CONSTEXPR_INIT_LIST constexpr
-#else
-#define OPTIONAL_HAS_CONSTEXPR_INIT_LIST 0
-#define OPTIONAL_CONSTEXPR_INIT_LIST
-#endif
-
-#if defined TR2_OPTIONAL_CLANG_3_5_AND_HIGHTER_ && (defined __cplusplus) && \
-    (__cplusplus != 201103L)
-#define OPTIONAL_HAS_MOVE_ACCESSORS 1
-#else
-#define OPTIONAL_HAS_MOVE_ACCESSORS 0
-#endif
-
-// In C++11 constexpr implies const, so we need to make non-const members also
-// non-constexpr
-#if (defined __cplusplus) && (__cplusplus == 201103L)
-#define OPTIONAL_MUTABLE_CONSTEXPR
-#else
-#define OPTIONAL_MUTABLE_CONSTEXPR constexpr
-#endif
-
-namespace multipy {
-
-// BEGIN workaround for missing std::is_trivially_destructible
-#if defined TR2_OPTIONAL_GCC_4_8_AND_HIGHER___
-// leave it: it is already there
-#elif defined TR2_OPTIONAL_CLANG_3_4_2_AND_HIGHER_
-// leave it: it is already there
-#elif defined TR2_OPTIONAL_MSVC_2015_AND_HIGHER___
-// leave it: it is already there
-#elif defined TR2_OPTIONAL_DISABLE_EMULATION_OF_TYPE_TRAITS
-// leave it: the user doesn't want it
-#else
-template <typename T>
-using std::is_trivially_destructible = std::has_trivial_destructor<T>;
-#endif
-// END workaround for missing std::is_trivially_destructible
-
-#if (defined TR2_OPTIONAL_GCC_4_7_AND_HIGHER___)
-// leave it; our metafunctions are already defined.
-#elif defined TR2_OPTIONAL_CLANG_3_4_2_AND_HIGHER_
-// leave it; our metafunctions are already defined.
-#elif defined TR2_OPTIONAL_MSVC_2015_AND_HIGHER___
-// leave it: it is already there
-#elif defined TR2_OPTIONAL_DISABLE_EMULATION_OF_TYPE_TRAITS
-// leave it: the user doesn't want it
-#else
-
-// workaround for missing traits in GCC and CLANG
-template <class T>
-struct std::is_nothrow_move_constructible {
-  constexpr static bool value = std::is_nothrow_constructible<T, T&&>::value;
-};
-
-template <class T, class U>
-struct is_assignable {
-  template <class X, class Y>
-  constexpr static bool has_assign(...) {
-    return false;
-  }
-
-  template <
-      class X,
-      class Y,
-      size_t S = sizeof((std::declval<X>() = std::declval<Y>(), true))>
-  // the comma operator is necessary for the cases where operator= returns void
-  constexpr static bool has_assign(bool) {
-    return true;
-  }
-
-  constexpr static bool value = has_assign<T, U>(true);
-};
-
-template <class T>
-struct std::is_nothrow_move_assignable {
-  template <class X, bool has_any_move_assign>
-  struct has_nothrow_move_assign {
-    constexpr static bool value = false;
-  };
-
-  template <class X>
-  struct has_nothrow_move_assign<X, true> {
-    constexpr static bool value =
-        noexcept(std::declval<X&>() = std::declval<X&&>());
-  };
-
-  constexpr static bool value =
-      has_nothrow_move_assign<T, is_assignable<T&, T&&>::value>::value;
-};
-// end workaround
-
-#endif
-
-// 20.5.4, optional for object types
-template <class T>
-class optional;
-
-// 20.5.5, optional for lvalue reference types
-template <class T>
-class optional<T&>;
-
-// workaround: std utility functions aren't constexpr yet
-template <class T>
-inline constexpr T&& constexpr_forward(
-    typename std::remove_reference<T>::type& t) noexcept {
-  return static_cast<T&&>(t);
-}
-
-template <class T>
-inline constexpr T&& constexpr_forward(
-    typename std::remove_reference<T>::type&& t) noexcept {
-  static_assert(!std::is_lvalue_reference<T>::value, "!!");
-  return static_cast<T&&>(t);
-}
-
-template <class T>
-inline constexpr typename std::remove_reference<T>::type&& constexpr_move(
-    T&& t) noexcept {
-  return static_cast<typename std::remove_reference<T>::type&&>(t);
-}
-
-#if defined NDEBUG
-#define TR2_OPTIONAL_ASSERTED_EXPRESSION(CHECK, EXPR) (EXPR)
-#else
-#define TR2_OPTIONAL_ASSERTED_EXPRESSION(CHECK, EXPR) \
-  ((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
-#endif
-
-namespace detail_ {
-
-// static_addressof: a constexpr version of addressof
-template <typename T>
-struct has_overloaded_addressof {
-  template <class X>
-  constexpr static bool has_overload(...) {
-    return false;
-  }
-
-  template <class X, size_t S = sizeof(std::declval<X&>().operator&())>
-  constexpr static bool has_overload(bool) {
-    return true;
-  }
-
-  constexpr static bool value = has_overload<T>(true);
-};
-
-template <typename T, TR2_OPTIONAL_REQUIRES(!has_overloaded_addressof<T>)>
-constexpr T* static_addressof(T& ref) {
-  return &ref;
-}
-
-template <typename T, TR2_OPTIONAL_REQUIRES(has_overloaded_addressof<T>)>
-T* static_addressof(T& ref) {
-  return std::addressof(ref);
-}
-
-// the call to convert<A>(b) has return type A and converts b to type A iff b
-// decltype(b) is implicitly convertible to A
-template <class U>
-constexpr U convert(U v) {
-  return v;
-}
-
-namespace swap_ns {
-using std::swap;
-
-template <class T>
-void adl_swap(T& t, T& u) noexcept(noexcept(swap(t, u))) {
-  swap(t, u);
-}
-
-} // namespace swap_ns
-
-} // namespace detail_
-
-constexpr struct trivial_init_t {
-} trivial_init{};
-
-// 20.5.6, In-place construction
-constexpr struct in_place_t {
-} in_place{};
-
-// 20.5.7, Disengaged state indicator
-struct nullopt_t {
-  struct init {};
-  constexpr explicit nullopt_t(init) {}
-};
-constexpr nullopt_t nullopt{nullopt_t::init()};
-
-// 20.5.8, class bad_optional_access
-class bad_optional_access : public std::logic_error {
- public:
-  explicit bad_optional_access(const std::string& what_arg)
-      : std::logic_error{what_arg} {}
-  explicit bad_optional_access(const char* what_arg)
-      : std::logic_error{what_arg} {}
-};
-
-template <class T>
-union storage_t {
-  unsigned char dummy_;
-  T value_;
-
-  constexpr storage_t(trivial_init_t) noexcept : dummy_(){};
-
-  template <class... Args>
-  constexpr storage_t(Args&&... args)
-      : value_(constexpr_forward<Args>(args)...) {}
-
-  ~storage_t() {}
-};
-
-template <class T>
-union constexpr_storage_t {
-  unsigned char dummy_;
-  T value_;
-
-  constexpr constexpr_storage_t(trivial_init_t) noexcept : dummy_(){};
-
-  template <class... Args>
-  constexpr constexpr_storage_t(Args&&... args)
-      : value_(constexpr_forward<Args>(args)...) {}
-
-  ~constexpr_storage_t() = default;
-};
-
-template <class T>
-struct optional_base {
-  bool init_;
-  storage_t<T> storage_;
-
-  constexpr optional_base() noexcept : init_(false), storage_(trivial_init){};
-
-  explicit constexpr optional_base(const T& v) : init_(true), storage_(v) {}
-
-  explicit constexpr optional_base(T&& v)
-      : init_(true), storage_(constexpr_move(v)) {}
-
-  template <class... Args>
-  explicit optional_base(in_place_t, Args&&... args)
-      : init_(true), storage_(constexpr_forward<Args>(args)...) {}
-
-  template <
-      class U,
-      class... Args,
-      TR2_OPTIONAL_REQUIRES(std::is_constructible<T, std::initializer_list<U>>)>
-  explicit optional_base(
-      in_place_t,
-      std::initializer_list<U> il,
-      Args&&... args)
-      : init_(true), storage_(il, std::forward<Args>(args)...) {}
-
-  ~optional_base() {
-    if (init_)
-      storage_.value_.T::~T();
-  }
-};
-
-template <class T>
-struct constexpr_optional_base {
-  bool init_;
-  constexpr_storage_t<T> storage_;
-
-  constexpr constexpr_optional_base() noexcept
-      : init_(false), storage_(trivial_init){};
-
-  explicit constexpr constexpr_optional_base(const T& v)
-      : init_(true), storage_(v) {}
-
-  explicit constexpr constexpr_optional_base(T&& v)
-      : init_(true), storage_(constexpr_move(v)) {}
-
-  template <class... Args>
-  explicit constexpr constexpr_optional_base(in_place_t, Args&&... args)
-      : init_(true), storage_(constexpr_forward<Args>(args)...) {}
-
-  template <
-      class U,
-      class... Args,
-      TR2_OPTIONAL_REQUIRES(std::is_constructible<T, std::initializer_list<U>>)>
-  OPTIONAL_CONSTEXPR_INIT_LIST explicit constexpr_optional_base(
-      in_place_t,
-      std::initializer_list<U> il,
-      Args&&... args)
-      : init_(true), storage_(il, std::forward<Args>(args)...) {}
-
-  ~constexpr_optional_base() = default;
-};
-
-template <class T>
-using OptionalBase = typename std::conditional<
-    std::is_trivially_destructible<T>::value, // if possible
-    constexpr_optional_base<typename std::remove_const<
-        T>::type>, // use base with trivial destructor
-    optional_base<typename std::remove_const<T>::type>>::type;
-
-template <class T>
-class optional : private OptionalBase<T> {
-  static_assert(
-      !std::is_same<typename std::decay<T>::type, nullopt_t>::value,
-      "bad T");
-  static_assert(
-      !std::is_same<typename std::decay<T>::type, in_place_t>::value,
-      "bad T");
-
-  constexpr bool initialized() const noexcept {
-    return OptionalBase<T>::init_;
-  }
-  typename std::remove_const<T>::type* dataptr() {
-    return std::addressof(OptionalBase<T>::storage_.value_);
-  }
-  constexpr const T* dataptr() const {
-    return detail_::static_addressof(OptionalBase<T>::storage_.value_);
-  }
-
-#if OPTIONAL_HAS_THIS_RVALUE_REFS == 1
-  constexpr const T& contained_val() const& {
-    return OptionalBase<T>::storage_.value_;
-  }
-#if OPTIONAL_HAS_MOVE_ACCESSORS == 1
-  OPTIONAL_MUTABLE_CONSTEXPR T&& contained_val() && {
-    return std::move(OptionalBase<T>::storage_.value_);
-  }
-  OPTIONAL_MUTABLE_CONSTEXPR T& contained_val() & {
-    return OptionalBase<T>::storage_.value_;
-  }
-#else
-  T& contained_val() & {
-    return OptionalBase<T>::storage_.value_;
-  }
-  T&& contained_val() && {
-    return std::move(OptionalBase<T>::storage_.value_);
-  }
-#endif
-#else
-  constexpr const T& contained_val() const {
-    return OptionalBase<T>::storage_.value_;
-  }
-  T& contained_val() {
-    return OptionalBase<T>::storage_.value_;
-  }
-#endif
-
-  void clear() noexcept {
-    if (initialized())
-      dataptr()->T::~T();
-    OptionalBase<T>::init_ = false;
-  }
-
-  template <class... Args>
-  void initialize(Args&&... args) noexcept(
-      noexcept(T(std::forward<Args>(args)...))) {
-    assert(!OptionalBase<T>::init_);
-    ::new (static_cast<void*>(dataptr())) T(std::forward<Args>(args)...);
-    OptionalBase<T>::init_ = true;
-  }
-
-  template <class U, class... Args>
-  void initialize(std::initializer_list<U> il, Args&&... args) noexcept(
-      noexcept(T(il, std::forward<Args>(args)...))) {
-    assert(!OptionalBase<T>::init_);
-    ::new (static_cast<void*>(dataptr())) T(il, std::forward<Args>(args)...);
-    OptionalBase<T>::init_ = true;
-  }
-
- public:
-  typedef T value_type;
-
-  // 20.5.5.1, constructors
-  constexpr optional() noexcept : OptionalBase<T>(){};
-  constexpr optional(nullopt_t) noexcept : OptionalBase<T>(){};
-
-  optional(const optional& rhs) : OptionalBase<T>() {
-    if (rhs.initialized()) {
-      ::new (static_cast<void*>(dataptr())) T(*rhs);
-      OptionalBase<T>::init_ = true;
-    }
-  }
-
-  optional(optional&& rhs) noexcept(
-      std::is_nothrow_move_constructible<T>::value)
-      : OptionalBase<T>() {
-    if (rhs.initialized()) {
-      ::new (static_cast<void*>(dataptr())) T(std::move(*rhs));
-      OptionalBase<T>::init_ = true;
-    }
-  }
-
-  constexpr optional(const T& v) : OptionalBase<T>(v) {}
-
-  constexpr optional(T&& v) : OptionalBase<T>(constexpr_move(v)) {}
-
-  template <class... Args>
-  explicit constexpr optional(in_place_t, Args&&... args)
-      : OptionalBase<T>(in_place_t{}, constexpr_forward<Args>(args)...) {}
-
-  template <
-      class U,
-      class... Args,
-      TR2_OPTIONAL_REQUIRES(std::is_constructible<T, std::initializer_list<U>>)>
-  OPTIONAL_CONSTEXPR_INIT_LIST explicit optional(
-      in_place_t,
-      std::initializer_list<U> il,
-      Args&&... args)
-      : OptionalBase<T>(in_place_t{}, il, constexpr_forward<Args>(args)...) {}
-
-  // 20.5.4.2, Destructor
-  ~optional() = default;
-
-  // 20.5.4.3, assignment
-  optional& operator=(nullopt_t) noexcept {
-    clear();
-    return *this;
-  }
-
-  optional& operator=(const optional& rhs) {
-    if (initialized() == true && rhs.initialized() == false)
-      clear();
-    else if (initialized() == false && rhs.initialized() == true)
-      initialize(*rhs);
-    else if (initialized() == true && rhs.initialized() == true)
-      contained_val() = *rhs;
-    return *this;
-  }
-
-  optional& operator=(optional&& rhs) noexcept(
-      std::is_nothrow_move_assignable<T>::value&&
-          std::is_nothrow_move_constructible<T>::value) {
-    if (initialized() == true && rhs.initialized() == false)
-      clear();
-    else if (initialized() == false && rhs.initialized() == true)
-      initialize(std::move(*rhs));
-    else if (initialized() == true && rhs.initialized() == true)
-      contained_val() = std::move(*rhs);
-    return *this;
-  }
-
-  template <class U>
-  auto operator=(U&& v) -> typename std::enable_if<
-      std::is_same<typename std::decay<U>::type, T>::value,
-      optional&>::type {
-    if (initialized()) {
-      contained_val() = std::forward<U>(v);
-    } else {
-      initialize(std::forward<U>(v));
-    }
-    return *this;
-  }
-
-  template <class... Args>
-  void emplace(Args&&... args) {
-    clear();
-    initialize(std::forward<Args>(args)...);
-  }
-
-  template <class U, class... Args>
-  void emplace(std::initializer_list<U> il, Args&&... args) {
-    clear();
-    initialize<U, Args...>(il, std::forward<Args>(args)...);
-  }
-
-  // 20.5.4.4, Swap
-  void swap(optional<T>& rhs) noexcept(
-      std::is_nothrow_move_constructible<T>::value&& noexcept(
-          detail_::swap_ns::adl_swap(std::declval<T&>(), std::declval<T&>()))) {
-    if (initialized() == true && rhs.initialized() == false) {
-      rhs.initialize(std::move(**this));
-      clear();
-    } else if (initialized() == false && rhs.initialized() == true) {
-      initialize(std::move(*rhs));
-      rhs.clear();
-    } else if (initialized() == true && rhs.initialized() == true) {
-      using std::swap;
-      swap(**this, *rhs);
-    }
-  }
-
-  // 20.5.4.5, Observers
-
-  explicit constexpr operator bool() const noexcept {
-    return initialized();
-  }
-  constexpr bool has_value() const noexcept {
-    return initialized();
-  }
-
-  constexpr T const* operator->() const {
-    return TR2_OPTIONAL_ASSERTED_EXPRESSION(initialized(), dataptr());
-  }
-
-#if OPTIONAL_HAS_MOVE_ACCESSORS == 1
-
-  OPTIONAL_MUTABLE_CONSTEXPR T* operator->() {
-    assert(initialized());
-    return dataptr();
-  }
-
-  constexpr T const& operator*() const& {
-    return TR2_OPTIONAL_ASSERTED_EXPRESSION(initialized(), contained_val());
-  }
-
-  OPTIONAL_MUTABLE_CONSTEXPR T& operator*() & {
-    assert(initialized());
-    return contained_val();
-  }
-
-  OPTIONAL_MUTABLE_CONSTEXPR T&& operator*() && {
-    assert(initialized());
-    return constexpr_move(contained_val());
-  }
-
-  constexpr T const& value() const& {
-    return initialized()
-        ? contained_val()
-        : (throw bad_optional_access("bad optional access"), contained_val());
-  }
-
-  OPTIONAL_MUTABLE_CONSTEXPR T& value() & {
-    return initialized()
-        ? contained_val()
-        : (throw bad_optional_access("bad optional access"), contained_val());
-  }
-
-  OPTIONAL_MUTABLE_CONSTEXPR T&& value() && {
-    if (!initialized())
-      throw bad_optional_access("bad optional access");
-    return std::move(contained_val());
-  }
-
-#else
-
-  T* operator->() {
-    assert(initialized());
-    return dataptr();
-  }
-
-  constexpr T const& operator*() const {
-    return TR2_OPTIONAL_ASSERTED_EXPRESSION(initialized(), contained_val());
-  }
-
-  T& operator*() {
-    assert(initialized());
-    return contained_val();
-  }
-
-  constexpr T const& value() const {
-    return initialized()
-        ? contained_val()
-        : (throw bad_optional_access("bad optional access"), contained_val());
-  }
-
-  T& value() {
-    return initialized()
-        ? contained_val()
-        : (throw bad_optional_access("bad optional access"), contained_val());
-  }
-
-#endif
-
-#if OPTIONAL_HAS_THIS_RVALUE_REFS == 1
-
-  template <class V>
-  constexpr T value_or(V&& v) const& {
-    return *this ? **this : detail_::convert<T>(constexpr_forward<V>(v));
-  }
-
-#if OPTIONAL_HAS_MOVE_ACCESSORS == 1
-
-  template <class V>
-  OPTIONAL_MUTABLE_CONSTEXPR T value_or(V&& v) && {
-    return *this
-        ? constexpr_move(const_cast<optional<T>&>(*this).contained_val())
-        : detail_::convert<T>(constexpr_forward<V>(v));
-  }
-
-#else
-
-  template <class V>
-  T value_or(V&& v) && {
-    return *this
-        ? constexpr_move(const_cast<optional<T>&>(*this).contained_val())
-        : detail_::convert<T>(constexpr_forward<V>(v));
-  }
-
-#endif
-
-#else
-
-  template <class V>
-  constexpr T value_or(V&& v) const {
-    return *this ? **this : detail_::convert<T>(constexpr_forward<V>(v));
-  }
-
-#endif
-
-  // 20.6.3.6, modifiers
-  void reset() noexcept {
-    clear();
-  }
-};
-
-template <class T>
-class optional<T&> {
-  static_assert(!std::is_same<T, nullopt_t>::value, "bad T");
-  static_assert(!std::is_same<T, in_place_t>::value, "bad T");
-  T* ref;
-
- public:
-  // 20.5.5.1, construction/destruction
-  constexpr optional() noexcept : ref(nullptr) {}
-
-  constexpr optional(nullopt_t) noexcept : ref(nullptr) {}
-
-  constexpr optional(T& v) noexcept : ref(detail_::static_addressof(v)) {}
-
-  optional(T&&) = delete;
-
-  constexpr optional(const optional& rhs) noexcept : ref(rhs.ref) {}
-
-  explicit constexpr optional(in_place_t, T& v) noexcept
-      : ref(detail_::static_addressof(v)) {}
-
-  explicit optional(in_place_t, T&&) = delete;
-
-  ~optional() = default;
-
-  // 20.5.5.2, mutation
-  optional& operator=(nullopt_t) noexcept {
-    ref = nullptr;
-    return *this;
-  }
-
-  // optional& operator=(const optional& rhs) noexcept {
-  // ref = rhs.ref;
-  // return *this;
-  // }
-
-  // optional& operator=(optional&& rhs) noexcept {
-  // ref = rhs.ref;
-  // return *this;
-  // }
-
-  template <typename U>
-  auto operator=(U&& rhs) noexcept -> typename std::enable_if<
-      std::is_same<typename std::decay<U>::type, optional<T&>>::value,
-      optional&>::type {
-    ref = rhs.ref;
-    return *this;
-  }
-
-  template <typename U>
-  auto operator=(U&& rhs) noexcept -> typename std::enable_if<
-      !std::is_same<typename std::decay<U>::type, optional<T&>>::value,
-      optional&>::type = delete;
-
-  void emplace(T& v) noexcept {
-    ref = detail_::static_addressof(v);
-  }
-
-  void emplace(T&&) = delete;
-
-  void swap(optional<T&>& rhs) noexcept {
-    std::swap(ref, rhs.ref);
-  }
-
-  // 20.5.5.3, observers
-  constexpr T* operator->() const {
-    return TR2_OPTIONAL_ASSERTED_EXPRESSION(ref, ref);
-  }
-
-  constexpr T& operator*() const {
-    return TR2_OPTIONAL_ASSERTED_EXPRESSION(ref, *ref);
-  }
-
-  constexpr T& value() const {
-    return ref ? *ref
-               : (throw bad_optional_access("bad optional access"), *ref);
-  }
-
-  explicit constexpr operator bool() const noexcept {
-    return ref != nullptr;
-  }
-
-  constexpr bool has_value() const noexcept {
-    return ref != nullptr;
-  }
-
-  template <class V>
-  constexpr typename std::decay<T>::type value_or(V&& v) const {
-    return *this ? **this
-                 : detail_::convert<typename std::decay<T>::type>(
-                       constexpr_forward<V>(v));
-  }
-
-  // x.x.x.x, modifiers
-  void reset() noexcept {
-    ref = nullptr;
-  }
-};
-
-template <class T>
-class optional<T&&> {
-  static_assert(sizeof(T) == 0, "optional rvalue references disallowed");
-};
-
-// 20.5.8, Relational operators
-template <class T>
-constexpr bool operator==(const optional<T>& x, const optional<T>& y) {
-  return bool(x) != bool(y) ? false : bool(x) == false ? true : *x == *y;
-}
-
-template <class T>
-constexpr bool operator!=(const optional<T>& x, const optional<T>& y) {
-  return !(x == y);
-}
-
-template <class T>
-constexpr bool operator<(const optional<T>& x, const optional<T>& y) {
-  return (!y) ? false : (!x) ? true : *x < *y;
-}
-
-template <class T>
-constexpr bool operator>(const optional<T>& x, const optional<T>& y) {
-  return (y < x);
-}
-
-template <class T>
-constexpr bool operator<=(const optional<T>& x, const optional<T>& y) {
-  return !(y < x);
-}
-
-template <class T>
-constexpr bool operator>=(const optional<T>& x, const optional<T>& y) {
-  return !(x < y);
-}
-
-// 20.5.9, Comparison with nullopt
-template <class T>
-constexpr bool operator==(const optional<T>& x, nullopt_t) noexcept {
-  return (!x);
-}
-
-template <class T>
-constexpr bool operator==(nullopt_t, const optional<T>& x) noexcept {
-  return (!x);
-}
-
-template <class T>
-constexpr bool operator!=(const optional<T>& x, nullopt_t) noexcept {
-  return bool(x);
-}
-
-template <class T>
-constexpr bool operator!=(nullopt_t, const optional<T>& x) noexcept {
-  return bool(x);
-}
-
-template <class T>
-constexpr bool operator<(const optional<T>&, nullopt_t) noexcept {
-  return false;
-}
-
-template <class T>
-constexpr bool operator<(nullopt_t, const optional<T>& x) noexcept {
-  return bool(x);
-}
-
-template <class T>
-constexpr bool operator<=(const optional<T>& x, nullopt_t) noexcept {
-  return (!x);
-}
-
-template <class T>
-constexpr bool operator<=(nullopt_t, const optional<T>&) noexcept {
-  return true;
-}
-
-template <class T>
-constexpr bool operator>(const optional<T>& x, nullopt_t) noexcept {
-  return bool(x);
-}
-
-template <class T>
-constexpr bool operator>(nullopt_t, const optional<T>&) noexcept {
-  return false;
-}
-
-template <class T>
-constexpr bool operator>=(const optional<T>&, nullopt_t) noexcept {
-  return true;
-}
-
-template <class T>
-constexpr bool operator>=(nullopt_t, const optional<T>& x) noexcept {
-  return (!x);
-}
-
-// 20.5.10, Comparison with T
-template <class T>
-constexpr bool operator==(const optional<T>& x, const T& v) {
-  return bool(x) ? *x == v : false;
-}
-
-template <class T>
-constexpr bool operator==(const T& v, const optional<T>& x) {
-  return bool(x) ? v == *x : false;
-}
-
-template <class T>
-constexpr bool operator!=(const optional<T>& x, const T& v) {
-  return bool(x) ? *x != v : true;
-}
-
-template <class T>
-constexpr bool operator!=(const T& v, const optional<T>& x) {
-  return bool(x) ? v != *x : true;
-}
-
-template <class T>
-constexpr bool operator<(const optional<T>& x, const T& v) {
-  return bool(x) ? *x < v : true;
-}
-
-template <class T>
-constexpr bool operator>(const T& v, const optional<T>& x) {
-  return bool(x) ? v > *x : true;
-}
-
-template <class T>
-constexpr bool operator>(const optional<T>& x, const T& v) {
-  return bool(x) ? *x > v : false;
-}
-
-template <class T>
-constexpr bool operator<(const T& v, const optional<T>& x) {
-  return bool(x) ? v < *x : false;
-}
-
-template <class T>
-constexpr bool operator>=(const optional<T>& x, const T& v) {
-  return bool(x) ? *x >= v : false;
-}
-
-template <class T>
-constexpr bool operator<=(const T& v, const optional<T>& x) {
-  return bool(x) ? v <= *x : false;
-}
-
-template <class T>
-constexpr bool operator<=(const optional<T>& x, const T& v) {
-  return bool(x) ? *x <= v : true;
-}
-
-template <class T>
-constexpr bool operator>=(const T& v, const optional<T>& x) {
-  return bool(x) ? v >= *x : true;
-}
-
-// Comparison of optional<T&> with T
-template <class T>
-constexpr bool operator==(const optional<T&>& x, const T& v) {
-  return bool(x) ? *x == v : false;
-}
-
-template <class T>
-constexpr bool operator==(const T& v, const optional<T&>& x) {
-  return bool(x) ? v == *x : false;
-}
-
-template <class T>
-constexpr bool operator!=(const optional<T&>& x, const T& v) {
-  return bool(x) ? *x != v : true;
-}
-
-template <class T>
-constexpr bool operator!=(const T& v, const optional<T&>& x) {
-  return bool(x) ? v != *x : true;
-}
-
-template <class T>
-constexpr bool operator<(const optional<T&>& x, const T& v) {
-  return bool(x) ? *x < v : true;
-}
-
-template <class T>
-constexpr bool operator>(const T& v, const optional<T&>& x) {
-  return bool(x) ? v > *x : true;
-}
-
-template <class T>
-constexpr bool operator>(const optional<T&>& x, const T& v) {
-  return bool(x) ? *x > v : false;
-}
-
-template <class T>
-constexpr bool operator<(const T& v, const optional<T&>& x) {
-  return bool(x) ? v < *x : false;
-}
-
-template <class T>
-constexpr bool operator>=(const optional<T&>& x, const T& v) {
-  return bool(x) ? *x >= v : false;
-}
-
-template <class T>
-constexpr bool operator<=(const T& v, const optional<T&>& x) {
-  return bool(x) ? v <= *x : false;
-}
-
-template <class T>
-constexpr bool operator<=(const optional<T&>& x, const T& v) {
-  return bool(x) ? *x <= v : true;
-}
-
-template <class T>
-constexpr bool operator>=(const T& v, const optional<T&>& x) {
-  return bool(x) ? v >= *x : true;
-}
-
-// Comparison of optional<T const&> with T
-template <class T>
-constexpr bool operator==(const optional<const T&>& x, const T& v) {
-  return bool(x) ? *x == v : false;
-}
-
-template <class T>
-constexpr bool operator==(const T& v, const optional<const T&>& x) {
-  return bool(x) ? v == *x : false;
-}
-
-template <class T>
-constexpr bool operator!=(const optional<const T&>& x, const T& v) {
-  return bool(x) ? *x != v : true;
-}
-
-template <class T>
-constexpr bool operator!=(const T& v, const optional<const T&>& x) {
-  return bool(x) ? v != *x : true;
-}
-
-template <class T>
-constexpr bool operator<(const optional<const T&>& x, const T& v) {
-  return bool(x) ? *x < v : true;
-}
-
-template <class T>
-constexpr bool operator>(const T& v, const optional<const T&>& x) {
-  return bool(x) ? v > *x : true;
-}
-
-template <class T>
-constexpr bool operator>(const optional<const T&>& x, const T& v) {
-  return bool(x) ? *x > v : false;
-}
-
-template <class T>
-constexpr bool operator<(const T& v, const optional<const T&>& x) {
-  return bool(x) ? v < *x : false;
-}
-
-template <class T>
-constexpr bool operator>=(const optional<const T&>& x, const T& v) {
-  return bool(x) ? *x >= v : false;
-}
-
-template <class T>
-constexpr bool operator<=(const T& v, const optional<const T&>& x) {
-  return bool(x) ? v <= *x : false;
-}
-
-template <class T>
-constexpr bool operator<=(const optional<const T&>& x, const T& v) {
-  return bool(x) ? *x <= v : true;
-}
-
-template <class T>
-constexpr bool operator>=(const T& v, const optional<const T&>& x) {
-  return bool(x) ? v >= *x : true;
-}
-
-// 20.5.12, Specialized algorithms
-template <class T>
-void swap(optional<T>& x, optional<T>& y) noexcept(noexcept(x.swap(y))) {
-  x.swap(y);
-}
-
-template <class T>
-constexpr optional<typename std::decay<T>::type> make_optional(T&& v) {
-  return optional<typename std::decay<T>::type>(constexpr_forward<T>(v));
-}
-
-template <class X>
-constexpr optional<X&> make_optional(std::reference_wrapper<X> v) {
-  return optional<X&>(v.get());
-}
-
-} // namespace multipy
-
-namespace std {
-template <typename T>
-struct hash<multipy::optional<T>> {
-  typedef typename hash<T>::result_type result_type;
-  typedef multipy::optional<T> argument_type;
-
-  constexpr result_type operator()(argument_type const& arg) const {
-    return arg ? std::hash<T>{}(*arg) : result_type{};
-  }
-};
-
-template <typename T>
-struct hash<multipy::optional<T&>> {
-  typedef typename hash<T>::result_type result_type;
-  typedef multipy::optional<T&> argument_type;
-
-  constexpr result_type operator()(argument_type const& arg) const {
-    return arg ? std::hash<T>{}(*arg) : result_type{};
-  }
-};
-} // namespace std
-
-#undef TR2_OPTIONAL_REQUIRES
-#undef TR2_OPTIONAL_ASSERTED_EXPRESSION
-
-#endif //___OPTIONAL_HPP___
diff --git a/torch/csrc/deploy/interpreter/builtin_registry.cpp b/torch/csrc/deploy/interpreter/builtin_registry.cpp
deleted file mode 100644
index 611def2e7490..000000000000
--- a/torch/csrc/deploy/interpreter/builtin_registry.cpp
+++ /dev/null
@@ -1,284 +0,0 @@
-#include <Python.h>
-#include <c10/util/Exception.h>
-#include <fmt/format.h>
-#include <torch/csrc/deploy/Exception.h>
-#include <torch/csrc/deploy/interpreter/builtin_registry.h>
-
-namespace torch {
-namespace deploy {
-
-// These numbers of modules should not change as long as the cpython version
-// embedded in the build remains fixed
-static const size_t NUM_FROZEN_PY_BUILTIN_MODULES = 6;
-#ifndef FBCODE_CAFFE2
-static const size_t NUM_FROZEN_PY_STDLIB_MODULES = 680;
-#endif
-
-extern "C" PyObject* initModule(void);
-
-REGISTER_TORCH_DEPLOY_BUILTIN(cpython_internal, PyImport_FrozenModules);
-
-#ifdef FBCODE_CAFFE2
-REGISTER_TORCH_DEPLOY_BUILTIN(frozentorch, nullptr, "torch._C", initModule);
-#else
-extern "C" struct _frozen _PyImport_FrozenModules_torch[];
-REGISTER_TORCH_DEPLOY_BUILTIN(
-    frozentorch,
-    _PyImport_FrozenModules_torch,
-    "torch._C",
-    initModule);
-#endif
-
-BuiltinRegistryItem::BuiltinRegistryItem(
-    const char* _name,
-    const struct _frozen* _frozenModules,
-    std::vector<std::pair<const char*, void*>>&& _builtinModules)
-    : name(_name),
-      frozenModules(_frozenModules),
-      builtinModules(std::move(_builtinModules)) {
-  numModules = 0;
-  if (frozenModules) {
-    while (frozenModules[numModules].name != nullptr) {
-      ++numModules;
-    }
-  }
-
-  fprintf(
-      stderr,
-      "torch::deploy builtin %s contains %u modules\n",
-      name,
-      numModules);
-}
-
-BuiltinRegistry* BuiltinRegistry::get() {
-  static BuiltinRegistry _registry;
-  return &_registry;
-}
-
-void BuiltinRegistry::runPreInitialization() {
-  TORCH_INTERNAL_ASSERT(!Py_IsInitialized());
-  sanityCheck();
-  PyImport_FrozenModules = BuiltinRegistry::getAllFrozenModules();
-  TORCH_INTERNAL_ASSERT(PyImport_FrozenModules != nullptr);
-
-  appendCPythonInittab();
-}
-
-const char* metaPathSetupTemplate = R"PYTHON(
-import sys
-from importlib.metadata import DistributionFinder, Distribution
-# We need to register a custom meta path finder because we are registering
-# `torch._C` as a builtin module.
-#
-# Normally, builtins will be found by the `BuiltinImporter` meta path finder.
-# However, `BuiltinImporter` is hard-coded to assume that all builtin modules
-# are top-level imports.  Since `torch._C` is a submodule of `torch`, the
-# BuiltinImporter skips it.
-class F:
-    MODULES = {<<<DEPLOY_BUILTIN_MODULES_CSV>>>}
-
-    def find_spec(self, fullname, path, target=None):
-        if fullname in self.MODULES:
-            # Load this module using `BuiltinImporter`, but set `path` to None
-            # in order to trick it into loading our module.
-            return sys.meta_path[1].find_spec(fullname, path=None, target=None)
-        return None
-
-    def find_distributions(self, context=DistributionFinder.Context()):
-        modules = {"torch"} | self.MODULES
-        # Insert dummy distribution records for each builtin module so
-        # importlib.metadata.version(...) works.
-        if context.name is None:
-            for name in modules:
-                yield DummyDistribution(name)
-        if context.name in modules:
-            yield DummyDistribution(context.name)
-
-class DummyDistribution(Distribution):
-    def __init__(self, name):
-        self._metadata = {
-            "Name": name,
-            "Version": "0.0.1+fake_multipy",
-        }
-
-    @property
-    def metadata(self):
-        return self._metadata
-
-sys.meta_path.insert(0, F())
-)PYTHON";
-
-void BuiltinRegistry::runPostInitialization() {
-  TORCH_INTERNAL_ASSERT(Py_IsInitialized());
-  std::string metaPathSetupScript(metaPathSetupTemplate);
-  std::string replaceKey = "<<<DEPLOY_BUILTIN_MODULES_CSV>>>";
-  size_t pos = metaPathSetupScript.find(replaceKey);
-  if (pos != std::string::npos) {
-    metaPathSetupScript.replace(pos, replaceKey.size(), getBuiltinModulesCSV());
-  }
-  int r = PyRun_SimpleString(metaPathSetupScript.c_str());
-  TORCH_INTERNAL_ASSERT(r == 0);
-}
-
-void BuiltinRegistry::registerBuiltin(
-    std::unique_ptr<BuiltinRegistryItem> item) {
-  if (get()->name2idx_.find(item->name) != get()->name2idx_.end()) {
-    throw std::runtime_error(std::string("redefine bultin: ") + item->name);
-  }
-  get()->name2idx_[item->name] = get()->items_.size();
-  get()->items_.emplace_back(std::move(item));
-}
-
-BuiltinRegistryItem* BuiltinRegistry::getItem(const std::string& name) {
-  auto itr = get()->name2idx_.find(name);
-  return itr == get()->name2idx_.end() ? nullptr
-                                       : get()->items_[itr->second].get();
-}
-
-unsigned BuiltinRegistry::totalNumModules() {
-  unsigned tot = 0;
-  for (const auto& itemptr : get()->items_) {
-    tot += itemptr->numModules;
-  }
-  return tot;
-}
-
-struct _frozen* BuiltinRegistry::getAllFrozenModules() {
-  /* Allocate new memory for the combined table */
-  size_t totNumModules = totalNumModules();
-  struct _frozen* p = nullptr;
-  if (totNumModules > 0 &&
-      totNumModules <= SIZE_MAX / sizeof(struct _frozen) - 1) {
-    size_t size = sizeof(struct _frozen) * (totNumModules + 1);
-    p = (_frozen*)PyMem_Malloc(size);
-  }
-  if (p == nullptr) {
-    return nullptr;
-  }
-
-  // mark p as an empty frozen module list
-  memset(&p[0], 0, sizeof(p[0]));
-
-  /* Copy the tables into the new memory */
-  unsigned off = 0;
-  for (const auto& itemptr : items()) {
-    if (itemptr->numModules > 0) {
-      memcpy(
-          p + off,
-          itemptr->frozenModules,
-          (itemptr->numModules + 1) * sizeof(struct _frozen));
-      off += itemptr->numModules;
-    }
-  }
-
-  return p;
-}
-
-void BuiltinRegistry::sanityCheck() {
-  auto* cpythonInternalFrozens = getItem("cpython_internal");
-  // Num frozen builtins shouldn't change (unless modifying the underlying
-  // cpython version)
-  TORCH_INTERNAL_ASSERT(
-      cpythonInternalFrozens != nullptr &&
-          cpythonInternalFrozens->numModules == NUM_FROZEN_PY_BUILTIN_MODULES,
-      "Missing python builtin frozen modules");
-
-  auto* frozenpython = getItem("frozenpython");
-#ifdef FBCODE_CAFFE2
-  TORCH_INTERNAL_ASSERT(
-      frozenpython != nullptr, "Missing frozen python modules");
-#else
-  auto* frozentorch = getItem("frozentorch");
-  // Check frozenpython+frozentorch together since in OSS frozenpython is empty
-  // and frozentorch contains stdlib+torch, while in fbcode they are separated
-  // due to thirdparty2 frozenpython. No fixed number of torch modules to check
-  // for, but there should be at least one.
-  TORCH_INTERNAL_ASSERT(
-      frozenpython != nullptr && frozentorch != nullptr &&
-          frozenpython->numModules + frozentorch->numModules >
-              NUM_FROZEN_PY_STDLIB_MODULES + 1,
-      "Missing frozen python stdlib or torch modules");
-#endif
-}
-
-std::vector<std::pair<const char*, void*>> BuiltinRegistry::
-    getAllBuiltinModules() {
-  std::vector<std::pair<const char*, void*>> allBuiltinModules;
-  for (const auto& itemptr : items()) {
-    allBuiltinModules.insert(
-        allBuiltinModules.end(),
-        itemptr->builtinModules.begin(),
-        itemptr->builtinModules.end());
-  }
-  return allBuiltinModules;
-}
-
-void BuiltinRegistry::appendCPythonInittab() {
-  for (const auto& pair : get()->getAllBuiltinModules()) {
-    PyImport_AppendInittab(
-        pair.first, reinterpret_cast<PyObject* (*)()>(pair.second));
-  }
-}
-
-std::string BuiltinRegistry::getBuiltinModulesCSV() {
-  std::string modulesCSV;
-  for (const auto& pair : get()->getAllBuiltinModules()) {
-    if (!modulesCSV.empty()) {
-      modulesCSV += ", ";
-    }
-    modulesCSV += fmt::format("'{}'", pair.first);
-  }
-  return modulesCSV;
-}
-
-BuiltinRegisterer::BuiltinRegisterer(
-    const char* name,
-    const struct _frozen* frozenModules...) {
-  if (allowLibrary && !allowLibrary(name)) {
-    fprintf(
-        stderr,
-        "Skip %s since it's rejected by the allowLibrary method\n",
-        name);
-    return;
-  }
-  // gather builtin modules for this lib
-  va_list args;
-  va_start(args, frozenModules);
-  const char* moduleName = nullptr;
-  void* initFn = nullptr;
-  std::vector<std::pair<const char*, void*>> builtinModules;
-  while (true) {
-    moduleName = va_arg(args, const char*);
-    // encounter end of sequence
-    if (moduleName == nullptr) {
-      break;
-    }
-    initFn = va_arg(args, void*);
-    // skip null init function. This can happen if we create weak reference
-    // to init functions defined in another library. Depending on if we
-    // link with that library, the init function pointer will be the real
-    // implementation or nullptr. tensorrt is a good example. If this is
-    // a CPU build, we will not link with the tensorrt library, so the init
-    // function will be nullptr; on the other hand if this is a GPU build,
-    // we link with the tensorrt library, so the init function will not be
-    // nullptr.
-    if (initFn == nullptr) {
-      continue;
-    }
-    builtinModules.emplace_back(moduleName, initFn);
-  }
-
-  // note: don't call glog api in this method since this method is usually
-  // called before glog get setup
-  fprintf(
-      stderr,
-      "Registering torch::deploy builtin library %s (idx %lu) with %lu builtin modules\n",
-      name,
-      BuiltinRegistry::items().size(),
-      builtinModules.size());
-  BuiltinRegistry::registerBuiltin(std::make_unique<BuiltinRegistryItem>(
-      name, frozenModules, std::move(builtinModules)));
-}
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/interpreter/builtin_registry.h b/torch/csrc/deploy/interpreter/builtin_registry.h
deleted file mode 100644
index 5f2726db67b6..000000000000
--- a/torch/csrc/deploy/interpreter/builtin_registry.h
+++ /dev/null
@@ -1,130 +0,0 @@
-/*
- * The torch::deploy builtin registry library is used to make adding new bultins
- * to torch::deploy easy and clean.
- *
- * Under the hood, to add a torch::deploy builtin, the following things need to
- * be done
- * 1. merge the frozen modules for the builtin into PyImport_FrozenModules
- * 2. appending PyInit methods for modules implemented in C++ to the CPython
- *    builtin module list via methods like PyImport_AppendInittab
- * 3. tweak the sys.meta_path a bit to force loading non-toplevel moduels for
- * the torch::deploy builtin via the CPython builtin module importer.
- *
- * Doing all these things again and again manually is cumbersome and
- * error-prone. This builtin registry library supports open registration for
- * torch::deploy builtins. It does the work above by a single line of code
- * invoking REGISTER_TORCH_DEPLOY_BUILTIN macro. Here is an example for numpy:
- *
- *   REGISTER_TORCH_DEPLOY_BUILTIN(numpy, numpy_frozen_modules, <list of name,
- * PyInit function pairs>)
- *
- * Calling REGISTER_TORCH_DEPLOY_BUILTIN macro will instantiate a
- * BuiltinRegisterer object. The constructor of BuiltinRegisterer does the real
- * registration work.
- */
-#include <gtest/gtest_prod.h>
-#include <cstdarg>
-#include <memory>
-#include <unordered_map>
-#include <vector>
-
-struct _frozen;
-
-namespace torch {
-namespace deploy {
-
-/*
- * This data structure describes a torch::deploy builtin being registered to
- * the registry.
- *
- * Each torch::deploy builtin contains the following basically information:
- * - a name for the builtin. It's usually the name of the library like numpy
- * - the lsit of frozen modules
- * - the list of builtin modules
- */
-struct BuiltinRegistryItem {
-  explicit BuiltinRegistryItem(
-      const char* _name,
-      const struct _frozen* _frozenModules,
-      std::vector<std::pair<const char*, void*>>&& _builtinModules);
-  const char* name;
-  const struct _frozen* frozenModules;
-  unsigned numModules;
-  std::vector<std::pair<const char*, void*>> builtinModules;
-};
-
-/*
- * BuiltinRegistry maintains all the registered torch::deploy builtins. This
- * class is a singleton. Calling BuiltinRegistry::get() returns the single
- * object instance.
- *
- * The state of this class is basically a list of BuiltinRegistryItem registered
- * so far.
- */
-class BuiltinRegistry {
- public:
-  static void runPreInitialization();
-  static void runPostInitialization();
-
- private:
-  static struct _frozen* getAllFrozenModules();
-  // call this after all the registration is done.
-  static void sanityCheck();
-  static void appendCPythonInittab();
-  static std::string getBuiltinModulesCSV();
-
-  static void registerBuiltin(std::unique_ptr<BuiltinRegistryItem> item);
-  static const std::vector<std::unique_ptr<BuiltinRegistryItem>>& items() {
-    return get()->items_;
-  }
-  static unsigned totalNumModules();
-  static BuiltinRegistry* get();
-  static BuiltinRegistryItem* getItem(const std::string& name);
-  static std::vector<std::pair<const char*, void*>> getAllBuiltinModules();
-
-  explicit BuiltinRegistry() = default;
-  std::unordered_map<std::string, int> name2idx_;
-  std::vector<std::unique_ptr<BuiltinRegistryItem>> items_;
-
-  friend class BuiltinRegisterer;
-  FRIEND_TEST(BuiltinRegistryTest, SimpleTest);
-};
-
-/*
- * If nobody defines allowLibrary method, allowLibrary will be evaluated to
- * 0 and we allow registering any libraries. If someone defines allowLibrary,
- * we respect that and only registering libraries that get true from calling
- * allowLibrary(libname).
- *
- * Currently used in unit test so we can fully control the registered libraries.
- */
-__attribute__((weak)) bool allowLibrary(const std::string& libname);
-
-/*
- * This class implements RAII (resource acquisition is initialization) to
- * register a bulitin to the registry.
- */
-class BuiltinRegisterer {
- public:
-  explicit BuiltinRegisterer(
-      const char* name,
-      const struct _frozen* frozenModules...);
-};
-
-} // namespace deploy
-} // namespace torch
-
-#define CONCAT_IMPL(s1, s2) s1##s2
-#define CONCAT(s1, s2) CONCAT_IMPL(s1, s2)
-#define ANONYMOUS_VARIABLE(str) CONCAT(str, __LINE__)
-
-/* there can be a variable list of builtin modules following frozen_modules
- * A typical usage of this macro is:
- *
- *  REGISTER_TORCH_DEPLOY_BUILTIN(library_name_without_quote,
- * frozen_modules_list, builtin_module_name_1, builtin_module_init_function_1,
- * ..., builtin_module_name_N, builtin_module_init_function_N)
- */
-#define REGISTER_TORCH_DEPLOY_BUILTIN(libname, frozenModules...) \
-  static torch::deploy::BuiltinRegisterer ANONYMOUS_VARIABLE(    \
-      BuiltinRegisterer)(#libname, frozenModules, nullptr)
diff --git a/torch/csrc/deploy/interpreter/configure_cpython.sh b/torch/csrc/deploy/interpreter/configure_cpython.sh
deleted file mode 100755
index 2812ea91e011..000000000000
--- a/torch/csrc/deploy/interpreter/configure_cpython.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/bin/bash
-set -ex
-wget https://www.openssl.org/source/openssl-1.1.1k.tar.gz
-tar xf openssl-1.1.1k.tar.gz
-(cd openssl-1.1.1k && ./config --prefix="$PYTHON_INSTALL_DIR" && make -j32 && make install)
-CFLAGS=-fPIC CPPFLAGS=-fPIC ./configure --prefix "$PYTHON_INSTALL_DIR" --with-openssl="$PYTHON_INSTALL_DIR"
diff --git a/torch/csrc/deploy/interpreter/cpython_patch.diff b/torch/csrc/deploy/interpreter/cpython_patch.diff
deleted file mode 100644
index 0ca731f102d1..000000000000
--- a/torch/csrc/deploy/interpreter/cpython_patch.diff
+++ /dev/null
@@ -1,14 +0,0 @@
-diff --git a/Python/dynload_shlib.c b/Python/dynload_shlib.c
-index c51f97abd2..83f73e351d 100644
---- a/Python/dynload_shlib.c
-+++ b/Python/dynload_shlib.c
-@@ -54,8 +54,7 @@ static struct {
- } handles[128];
- static int nhandles = 0;
- 
--
--dl_funcptr
-+dl_funcptr __attribute__((weak))
- _PyImport_FindSharedFuncptr(const char *prefix,
-                             const char *shortname,
-                             const char *pathname, FILE *fp)
diff --git a/torch/csrc/deploy/interpreter/defs.bzl b/torch/csrc/deploy/interpreter/defs.bzl
deleted file mode 100644
index 719155cf7da0..000000000000
--- a/torch/csrc/deploy/interpreter/defs.bzl
+++ /dev/null
@@ -1,117 +0,0 @@
-load("@fbcode_macros//build_defs:cpp_binary.bzl", "cpp_binary")
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-load("@fbcode_macros//build_defs:native_rules.bzl", "cxx_genrule")
-
-# @lint-ignore-every BUCKLINT
-load("@fbsource//tools/build_defs:fb_native_wrapper.bzl", "fb_native")
-
-def embedded_interpreter(name, suffix, legacy = False, exported_deps = [], exported_external_deps = []):
-    final_name = name
-    is_all = suffix == "all"
-    is_cuda = suffix == "cuda" or is_all
-    platform_static_lib = []
-    for platform in ["platform009", "platform010"]:
-        name = platform + "_" + final_name
-        so_name = name + ".so"
-        cpp_binary(
-            name = so_name,
-            srcs = [
-                "interpreter_impl.cpp",
-            ] + (["import_find_sharedfuncptr.cpp"] if is_all else []),
-            headers = [
-                "Optional.hpp",
-                "interpreter_impl.h",
-            ],
-            header_namespace = "torch/csrc/deploy",
-            dlopen_enabled = True,
-            linker_flags = ([
-                # This ensures only the intended interface symbols are public/global
-                # the rest are hidden, regardless of how they were compiled
-                # (e.g. fvisibility=hidden is NOT important for the component
-                # objs in this library, since we override here.)
-                "--version-script=$(location :hide_symbols.script)",
-            ] if not is_all else []),
-            deps = [
-                "fbsource//third-party/fmt:fmt",
-            ] + ([
-                ":builtin_registry_cuda",
-                "//caffe2:torch_python_cuda_without_torch",
-                "//deeplearning/trt/python:frozen_tensorrt",
-            ] if is_cuda else [
-                ":builtin_registry",
-                "//caffe2:torch_python_without_torch",
-            ]),
-            external_deps =
-                [
-                    # needed for interpreter.cpp itself, it uses pybind currently
-                    ("frozenpython", None, "python-frozen"),
-                    ("frozenpython", None, "python"),
-                ],
-            fbcode_platform = platform,
-        )
-
-        # We build torch::deploy with two embedded binaries- one with only cpu py bindings,
-        # the other with cpu+cuda py bindings.  This unfortunately wastes some binary size,
-        # but at least at runtime only one of them is loaded.
-        #
-        # This is becuase of two reasons
-        # (1) that applications such as predictor want to depend on torch::deploy in a
-        # cuda-agnostic way, e.g. they don't choose yet, and a binary/app that depends
-        # on predictor either chooses to include or not include a dep on cuda.
-        #
-        # (2) the way the embedded binary is created and loaded, it only exposes a small
-        # set of interface symbols globally, for creating a new interpreter, and hides its
-        # other symbols (esp. python ones) so they don't conflict with other interpreters.
-        # This prevents dividing the cpu and cuda portions of bindings into _separate_ libs
-        # and loading the cuda part additively.  Hence to achieve requirement (1) we bundle
-        # two complete interpreter libs, one with and one without cuda.
-
-        cp_cmd = "$(location //caffe2/torch/csrc/deploy:remove_dt_needed)" if suffix == "all" else "cp"
-
-        build_name = "build_" + name
-        if not legacy:
-            cxx_genrule(
-                name = build_name,
-                out = "embedded_interpreter_" + suffix + ".a",
-                cmd = """\
-                """ + cp_cmd + """ $(location :""" + so_name + """) libtorch_deployinterpreter_internal_""" + suffix + """.so
-                ld -r -b binary -o ${TMP}/embedded_interpreter_""" + suffix + """.o libtorch_deployinterpreter_internal_""" + suffix + """.so
-                objcopy --rename-section .data=.torch_deploy_payload.interpreter_""" + suffix + """,readonly,contents -N _binary_libtorch_deployinterpreter_""" + suffix + """_so_start -N _binary_libtorch_deployinterpreter_""" + suffix + """_so_end ${TMP}/embedded_interpreter_""" + suffix + """.o
-                ar rcs ${OUT} ${TMP}/embedded_interpreter_""" + suffix + """.o
-                """,
-            )
-        else:
-            cxx_genrule(
-                name = build_name,
-                out = "embedded_interpreter_cuda_legacy.a",
-                cmd = """\
-                cp $(location :""" + so_name + """) libtorch_deployinterpreter_cuda.so
-                ld -r -b binary -o ${TMP}/embedded_interpreter_cuda.o libtorch_deployinterpreter_cuda.so
-                ar rcs ${OUT} ${TMP}/embedded_interpreter_cuda.o
-                """,
-            )
-        platform_static_lib.append(["^" + platform, ":" + build_name])
-
-    internal_name = final_name + "_internal"
-    fb_native.prebuilt_cxx_library(
-        preferred_linkage = "static",
-        name = internal_name,
-        visibility = ["PUBLIC"],
-        link_whole = True,
-        platform_static_lib = platform_static_lib,
-    )
-
-    # a thin wrapper around :embedded_interpreter_internal to add --export-dynamic
-    # linker flags. The flag will be propagated to cpp_binary. We don't require
-    # cpp_binary to explicitly enable --export-dynamic any more. New usecases usually
-    # forgot to do so and caused interpreter not found crash.
-    cpp_library(
-        name = final_name,
-        linker_flags = [
-            "--export-dynamic",
-        ],
-        exported_deps = [
-            ":" + internal_name,
-        ] + exported_deps,
-        exported_external_deps = exported_external_deps,
-    )
diff --git a/torch/csrc/deploy/interpreter/hide_symbols.script b/torch/csrc/deploy/interpreter/hide_symbols.script
deleted file mode 100644
index 2d515de4fb02..000000000000
--- a/torch/csrc/deploy/interpreter/hide_symbols.script
+++ /dev/null
@@ -1,4 +0,0 @@
-INTERPRETER_0.1 {
-  global: newInterpreterImpl;
-  local: *;
-};
diff --git a/torch/csrc/deploy/interpreter/import_find_sharedfuncptr.cpp b/torch/csrc/deploy/interpreter/import_find_sharedfuncptr.cpp
deleted file mode 100644
index 2a89a96c623d..000000000000
--- a/torch/csrc/deploy/interpreter/import_find_sharedfuncptr.cpp
+++ /dev/null
@@ -1,45 +0,0 @@
-#include <torch/csrc/deploy/loader.h>
-#include <sstream>
-#include <vector>
-
-using torch::deploy::CustomLibrary;
-using torch::deploy::CustomLibraryPtr;
-using torch::deploy::SystemLibrary;
-
-// NOLINTNEXTLINE
-std::vector<CustomLibraryPtr> loaded_files_;
-// NOLINTNEXTLINE
-static void* deploy_self = nullptr;
-
-extern "C" {
-
-__attribute__((visibility("default"))) void deploy_set_self(void* v) {
-  deploy_self = v;
-}
-
-typedef void (*dl_funcptr)();
-extern "C" dl_funcptr _PyImport_FindSharedFuncptr(
-    const char* prefix,
-    const char* shortname,
-    const char* pathname,
-    FILE* fp) {
-  const char* args[] = {"deploy"};
-  // XXX: we have to manually flush loaded_files_ (see deploy_flush_python_libs)
-  // when the manager unloads. Otherwise some libraries can live longer than
-  // they are needed, and the process of unloading them might use functionality
-  // that itself gets unloaded.
-  loaded_files_.emplace_back(CustomLibrary::create(pathname, 1, args));
-  CustomLibrary& lib = *loaded_files_.back();
-  lib.add_search_library(SystemLibrary::create(deploy_self));
-  lib.add_search_library(SystemLibrary::create());
-  lib.load();
-  std::stringstream ss;
-  ss << prefix << "_" << shortname;
-  auto r = (dl_funcptr)lib.sym(ss.str().c_str()).value();
-  assert(r);
-  return r;
-}
-__attribute__((visibility("default"))) void deploy_flush_python_libs() {
-  loaded_files_.clear();
-}
-}
diff --git a/torch/csrc/deploy/interpreter/interpreter_impl.cpp b/torch/csrc/deploy/interpreter/interpreter_impl.cpp
deleted file mode 100644
index bca5e54083ff..000000000000
--- a/torch/csrc/deploy/interpreter/interpreter_impl.cpp
+++ /dev/null
@@ -1,413 +0,0 @@
-#include <torch/csrc/deploy/interpreter/interpreter_impl.h>
-
-#include <dlfcn.h>
-
-#define PY_SSIZE_T_CLEAN
-#include <Python.h>
-
-#include <pybind11/embed.h>
-#include <pybind11/functional.h>
-#include <torch/csrc/DynamicTypes.h>
-#include <torch/csrc/autograd/generated/variable_factories.h>
-#include <torch/csrc/deploy/Exception.h>
-#include <torch/csrc/jit/python/pybind_utils.h>
-#include <torch/csrc/utils/pybind.h>
-
-#include <cassert>
-#include <cstdio>
-#include <iostream>
-#include <map>
-#include <thread>
-
-#include <fmt/format.h>
-#include <torch/csrc/deploy/interpreter/builtin_registry.h>
-
-namespace py = pybind11;
-using namespace py::literals;
-
-// TODO this should come from cmake
-#define DEBUG 1
-
-#if (DEBUG == 1)
-#define PYOBJ_ASSERT(obj) \
-  if (NULL == obj) {      \
-    PyErr_Print();        \
-  }                       \
-  assert(NULL != obj);
-#elif (DEBUG == 0)
-#define PYOBJ_ASSERT(obj) assert(NULL != obj);
-#endif
-
-const char* start = R"PYTHON(
-import _ssl # must come before _hashlib otherwise ssl's locks will be set to a Python that might no longer exist...
-import sys
-import importlib.abc
-import linecache
-
-class RegisterModuleImporter(importlib.abc.InspectLoader):
-    def __init__(self, find_module_source):
-        self.find_module_source = find_module_source
-
-    def create_module(self, spec):
-        return None
-
-    def get_source(self, name):
-        return self.find_module_source(name)
-
-    def exec_module(self, module):
-        filename = f"_deploy_internal.{module.__name__}"
-        linecache.lazycache(filename, module.__dict__)
-        code = compile(self.get_source(module.__name__), filename, "exec", dont_inherit=True)
-        exec(code, module.__dict__)
-
-    def find_spec(self, fullname, path, target=None):
-        r = self.find_module_source(fullname)
-        if r is not None:
-            return importlib.util.spec_from_loader(fullname, self)
-        return None
-
-# print("exec_prefix:", sys.base_exec_prefix)
-# print("_base_executable:", sys._base_executable)
-# print("base_prefix:", sys.base_prefix)
-# print("exec_prefix:", sys.exec_prefix)
-# print("executable:", sys.executable)
-# print("path:", sys.path)
-# print("prefix:", sys.prefix)
-import torch # has to be done serially otherwise things will segfault
-try:
-  import torch.version # for some reason torch doesn't import this and cuda fails?
-except ModuleNotFoundError:
-  # fbcode built doesn't have version.py, workaround by faking its info...
-  from types import ModuleType
-  _v = torch.version = sys.modules['torch.version'] = ModuleType('torch.version')
-  _v.__version__ = '1.8.0a0+fake'
-  _v.debug = False
-  _v.cuda = '10.1'
-  _v.git_version = 'fake'
-  _v.hip = None
-
-
-if torch.cuda.is_available():
-  torch.zeros(1).cuda() # force cuda init...
-import warnings
-warnings.simplefilter("ignore")
-)PYTHON";
-
-extern "C" __attribute__((__weak__)) PyObject* PyInit_tensorrt(void);
-extern "C"
-    __attribute__((__weak__)) struct _frozen _PyImport_FrozenModules_tensorrt[];
-
-using torch::deploy::BuiltinRegistry;
-// TODO(shunting) move this to the tensorrt code
-REGISTER_TORCH_DEPLOY_BUILTIN(
-    tensorrt,
-    _PyImport_FrozenModules_tensorrt,
-    "tensorrt.tensorrt",
-    PyInit_tensorrt);
-
-static py::object global_impl(const char* module, const char* name) {
-  return py::module::import(module).attr(name);
-}
-
-using at::IValue;
-using torch::deploy::Obj;
-using torch::deploy::PickledObject;
-
-// Ensure GIL is held while this object is live,
-// note: we are not use py::gil_scoped_acquire here because
-// InitLockAcquire used below has to temporarily release the GIL
-// within this scope to ensure locking order.  Having the source
-// for these objects together makes it easier to see what is happening.
-struct ScopedAcquire {
-  ScopedAcquire() {
-    gstate = PyGILState_Ensure();
-  }
-  ~ScopedAcquire() {
-    PyGILState_Release(gstate);
-  }
-  PyGILState_STATE gstate;
-};
-
-struct InitLockAcquire {
-  InitLockAcquire(std::mutex& init_lock) : init_lock_(init_lock) {
-    // to avoid deadlock, we need to ensure a consistent lock order:
-    // init_lock -> GIL. Otherwise, the GIL can be released by the python
-    // interpreter during initalization tasks, and then re-acquired. If another
-    // thread grabs the GIL to do non-initialization tasks, then it might start
-    // initializing (GIL -> init_lock). To avoid this, release the GIL before
-    // trying to get the init_lock and then reacquire it afterward.
-    // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-    PyThreadState* _save;
-    _save = PyEval_SaveThread();
-    init_lock.lock();
-    PyEval_RestoreThread(_save);
-  }
-  ~InitLockAcquire() {
-    init_lock_.unlock();
-  }
-
- private:
-  std::mutex& init_lock_;
-};
-
-struct __attribute__((visibility("hidden"))) ConcreteInterpreterImpl
-    : public torch::deploy::InterpreterImpl {
-  explicit ConcreteInterpreterImpl(
-      const std::vector<std::string>& extra_python_paths) {
-    BuiltinRegistry::runPreInitialization();
-    PyPreConfig preconfig;
-    PyPreConfig_InitIsolatedConfig(&preconfig);
-    PyStatus status = Py_PreInitialize(&preconfig);
-    TORCH_INTERNAL_ASSERT(!PyStatus_Exception(status))
-
-    PyConfig config;
-    PyConfig_InitIsolatedConfig(&config);
-
-    // Completely blank out the path configuration. This ensures we have
-    // complete control of how our embedded Python searches for modules, and we
-    // will never consult the external filesystem. See:
-    // https://docs.python.org/3/c-api/init_config.html#path-configuration
-    config.site_import = 0;
-    status = PyConfig_SetString(&config, &config.base_exec_prefix, L"");
-    status =
-        PyConfig_SetString(&config, &config.base_executable, L"torch_deploy");
-    status = PyConfig_SetString(&config, &config.base_prefix, L"");
-    status = PyConfig_SetString(&config, &config.exec_prefix, L"");
-    status = PyConfig_SetString(&config, &config.executable, L"torch_deploy");
-    status = PyConfig_SetString(&config, &config.prefix, L"");
-    config.module_search_paths_set = 1;
-    // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
-    wchar_t* module_search_paths[0] = {};
-    status = PyConfig_SetWideStringList(
-        &config, &config.module_search_paths, 0, module_search_paths);
-
-    status = Py_InitializeFromConfig(&config);
-    PyConfig_Clear(&config);
-    TORCH_INTERNAL_ASSERT(!PyStatus_Exception(status))
-#ifdef FBCODE_CAFFE2
-    auto sys_path = global_impl("sys", "path");
-    for (const auto& entry : extra_python_paths) {
-      sys_path.attr("insert")(0, entry);
-    }
-#endif
-    BuiltinRegistry::runPostInitialization();
-
-    int r = PyRun_SimpleString(start);
-    TORCH_INTERNAL_ASSERT(r == 0);
-
-    // we cache these so we don't have to repeat the conversion of strings into
-    // Python and hash table lookups to get to these object
-    saveStorage = global_impl("torch._deploy", "_save_storages");
-    loadStorage = global_impl("torch._deploy", "_load_storages");
-    getPackage = global_impl("torch._deploy", "_get_package");
-    objects = global_impl("torch._deploy", "_deploy_objects");
-    // Release the GIL that PyInitialize acquires
-    PyEval_SaveThread();
-  }
-
-  ~ConcreteInterpreterImpl() override {
-    PyGILState_Ensure();
-    // make sure pybind11 doesn't try to decref after we have destroyed python
-    // note: this leads the referneces to these objects, but we are about to
-    // deinit python anyway so it doesn't matter
-    objects.release();
-    saveStorage.release();
-    loadStorage.release();
-    getPackage.release();
-    if (Py_FinalizeEx() != 0) {
-      exit(1); // can't use TORCH_INTERNAL_ASSERT because we are in a
-               // non-throwing destructor.
-    }
-  }
-
-  void setFindModule(
-      std::function<multipy::optional<std::string>(const std::string&)>
-          find_module) override {
-    std::function<py::object(const std::string&)> wrapped_find_module =
-        [=](const std::string& name) -> py::object {
-      auto r = find_module(name);
-      return r ? py::cast(*r) : py::none();
-    };
-    py::object register_module_importer =
-        py::module::import("__main__")
-            .attr("RegisterModuleImporter")(wrapped_find_module);
-    py::module::import("sys")
-        .attr("meta_path")
-        .attr("append")(register_module_importer);
-  }
-
-  torch::deploy::InterpreterSessionImpl* acquireSession() override;
-  py::object saveStorage;
-  py::object loadStorage;
-  py::object getPackage;
-  py::dict objects;
-  std::mutex init_lock_;
-};
-
-struct __attribute__((visibility("hidden"))) ConcreteInterpreterSessionImpl
-    : public torch::deploy::InterpreterSessionImpl {
-  ConcreteInterpreterSessionImpl(ConcreteInterpreterImpl* interp)
-      : interp_(interp) {}
-  Obj global(const char* module, const char* name) override {
-    return wrap(global_impl(module, name));
-  }
-
-  Obj fromIValue(IValue value) override {
-    return wrap(torch::jit::toPyObject(value));
-  }
-  Obj createOrGetPackageImporterFromContainerFile(
-      const std::shared_ptr<caffe2::serialize::PyTorchStreamReader>&
-          containerFile_) override {
-    InitLockAcquire guard(interp_->init_lock_);
-    return wrap(interp_->getPackage(containerFile_));
-  }
-
-  PickledObject pickle(Obj container, Obj obj) override {
-    py::tuple result = interp_->saveStorage(unwrap(container), unwrap(obj));
-    py::bytes bytes = py::cast<py::bytes>(result[0]);
-    py::list storages = py::cast<py::list>(result[1]);
-    py::list dtypes = py::cast<py::list>(result[2]);
-    auto container_file =
-        py::cast<std::shared_ptr<caffe2::serialize::PyTorchStreamReader>>(
-            result[3]);
-
-    std::vector<at::Storage> storages_c;
-    std::vector<at::ScalarType> dtypes_c;
-    for (size_t i = 0, N = storages.size(); i < N; ++i) {
-      storages_c.push_back(torch::createStorage(storages[i].ptr()));
-      dtypes_c.push_back(
-          reinterpret_cast<THPDtype*>(dtypes[i].ptr())->scalar_type);
-    }
-    return PickledObject{
-        bytes,
-        std::move(storages_c),
-        std::move(dtypes_c),
-        std::move(container_file)};
-  }
-  Obj unpickleOrGet(int64_t id, const PickledObject& obj) override {
-    py::dict objects = interp_->objects;
-    py::object id_p = py::cast(id);
-    if (objects.contains(id_p)) {
-      return wrap(objects[id_p]);
-    }
-
-    InitLockAcquire guard(interp_->init_lock_);
-    // re-check if something else loaded this before we acquired the
-    // init_lock_
-    if (objects.contains(id_p)) {
-      return wrap(objects[id_p]);
-    }
-
-    py::tuple storages(obj.storages_.size());
-    for (size_t i = 0, N = obj.storages_.size(); i < N; ++i) {
-      py::object new_storage = py::reinterpret_steal<py::object>(
-          torch::createPyObject(obj.storages_[i]));
-      storages[i] = std::move(new_storage);
-    }
-    py::tuple dtypes(obj.types_.size());
-    for (size_t i = 0, N = obj.types_.size(); i < N; ++i) {
-      auto dtype = (PyObject*)torch::getTHPDtype(obj.types_[i]);
-      Py_INCREF(dtype);
-      dtypes[i] = dtype;
-    }
-    py::object result = interp_->loadStorage(
-        id, obj.containerFile_, py::bytes(obj.data_), storages, dtypes);
-    return wrap(result);
-  }
-  void unload(int64_t id) override {
-    py::dict objects = interp_->objects;
-    py::object id_p = py::cast(id);
-    if (objects.contains(id_p)) {
-      objects.attr("__delitem__")(id_p);
-    }
-  }
-
-  IValue toIValue(Obj obj) const override {
-    return torch::jit::toTypeInferredIValue(unwrap(obj));
-  }
-
-  Obj call(Obj obj, at::ArrayRef<Obj> args) override {
-    py::tuple m_args(args.size());
-    for (size_t i = 0, N = args.size(); i != N; ++i) {
-      m_args[i] = unwrap(args[i]);
-    }
-    return wrap(call(unwrap(obj), m_args));
-  }
-
-  Obj call(Obj obj, at::ArrayRef<IValue> args) override {
-    py::tuple m_args(args.size());
-    for (size_t i = 0, N = args.size(); i != N; ++i) {
-      m_args[i] = torch::jit::toPyObject(args[i]);
-    }
-    return wrap(call(unwrap(obj), m_args));
-  }
-
-  Obj callKwargs(
-      Obj obj,
-      std::vector<at::IValue> args,
-      std::unordered_map<std::string, c10::IValue> kwargs) override {
-    py::tuple py_args(args.size());
-    for (size_t i = 0, N = args.size(); i != N; ++i) {
-      py_args[i] = torch::jit::toPyObject(args[i]);
-    }
-
-    py::dict py_kwargs;
-    for (auto kv : kwargs) {
-      py_kwargs[py::cast(std::get<0>(kv))] =
-          torch::jit::toPyObject(std::get<1>(kv));
-    }
-    return wrap(call(unwrap(obj), py_args, py_kwargs));
-  }
-
-  Obj callKwargs(Obj obj, std::unordered_map<std::string, c10::IValue> kwargs)
-      override {
-    std::vector<at::IValue> args;
-    return callKwargs(obj, args, kwargs);
-  }
-
-  bool hasattr(Obj obj, const char* attr) override {
-    return py::hasattr(unwrap(obj), attr);
-  }
-
-  Obj attr(Obj obj, const char* attr) override {
-    return wrap(unwrap(obj).attr(attr));
-  }
-
-  static py::object call(
-      py::handle object,
-      py::handle args,
-      py::handle kwargs = nullptr) {
-    PyObject* result = PyObject_Call(object.ptr(), args.ptr(), kwargs.ptr());
-    if (!result) {
-      throw py::error_already_set();
-    }
-    return py::reinterpret_steal<py::object>(result);
-  }
-
-  py::handle unwrap(Obj obj) const {
-    return objects_.at(ID(obj));
-  }
-
-  Obj wrap(py::object obj) {
-    objects_.emplace_back(std::move(obj));
-    return Obj(this, objects_.size() - 1);
-  }
-
-  ~ConcreteInterpreterSessionImpl() override {
-    objects_.clear();
-  }
-  ConcreteInterpreterImpl* interp_;
-  ScopedAcquire acquire_;
-  std::vector<py::object> objects_;
-};
-
-torch::deploy::InterpreterSessionImpl* ConcreteInterpreterImpl::
-    acquireSession() {
-  return new ConcreteInterpreterSessionImpl(this);
-}
-
-extern "C" __attribute__((visibility("default")))
-torch::deploy::InterpreterImpl*
-newInterpreterImpl(const std::vector<std::string>& extra_python_paths) {
-  return new ConcreteInterpreterImpl(extra_python_paths);
-}
diff --git a/torch/csrc/deploy/interpreter/interpreter_impl.h b/torch/csrc/deploy/interpreter/interpreter_impl.h
deleted file mode 100644
index a2dd57e9beeb..000000000000
--- a/torch/csrc/deploy/interpreter/interpreter_impl.h
+++ /dev/null
@@ -1,185 +0,0 @@
-#pragma once
-// multi-python abstract code
-#include <ATen/ATen.h>
-#include <ATen/core/ivalue.h>
-#include <caffe2/serialize/inline_container.h>
-#include <torch/csrc/deploy/interpreter/Optional.hpp>
-
-/* Torch Deploy intentionally embeds multiple copies of c++ libraries
-   providing python bindings necessary for torch::deploy users in the same
-   process space in order to provide a multi-python environment.  As a result,
-   any exception types defined by these duplicated libraries can't be safely
-   caught or handled outside of the originating dynamic library (.so).
-
-   In practice this means that you must either
-   catch these exceptions inside the torch::deploy API boundary or risk crashing
-   the client application.
-
-   It is safe to throw exception types that are defined once in
-   the context of the client application, such as std::runtime_error,
-   which isn't duplicated in torch::deploy interpreters.
-
-   ==> Use TORCH_DEPLOY_TRY, _SAFE_CATCH_RETHROW around _ALL_ torch::deploy APIs
-
-   For more information, see
-    https://gcc.gnu.org/wiki/Visibility (section on c++ exceptions)
-    or https://stackoverflow.com/a/14364055
-    or
-   https://stackoverflow.com/questions/14268736/symbol-visibility-exceptions-runtime-error
-    note- this may be only a serious problem on versions of gcc prior to 4.0,
-   but still seems worth sealing off.
-
-*/
-#define TORCH_DEPLOY_TRY try {
-#define TORCH_DEPLOY_SAFE_CATCH_RETHROW                                     \
-  }                                                                         \
-  catch (std::exception & err) {                                            \
-    throw std::runtime_error(                                               \
-        std::string(                                                        \
-            "Exception Caught inside torch::deploy embedded library: \n") + \
-        err.what());                                                        \
-  }                                                                         \
-  catch (...) {                                                             \
-    throw std::runtime_error(std::string(                                   \
-        "Unknown Exception Caught inside torch::deploy embedded library")); \
-  }
-namespace torch {
-namespace deploy {
-
-struct InterpreterSessionImpl;
-
-struct PickledObject {
-  std::string data_;
-  std::vector<at::Storage> storages_;
-  // types for the storages, required to
-  // reconstruct correct Python storages
-  std::vector<at::ScalarType> types_;
-  std::shared_ptr<caffe2::serialize::PyTorchStreamReader> containerFile_;
-};
-
-// this is a wrapper class that refers to a PyObject* instance in a particular
-// interpreter. We can't use normal PyObject or pybind11 objects here
-// because these objects get used in a user application which will not directly
-// link against libpython. Instead all interaction with the Python state in each
-// interpreter is done via this wrapper class, and methods on
-// InterpreterSession.
-struct Obj {
-  friend struct InterpreterSessionImpl;
-  Obj() : interaction_(nullptr), id_(0) {}
-  Obj(InterpreterSessionImpl* interaction, int64_t id)
-      : interaction_(interaction), id_(id) {}
-
-  at::IValue toIValue() const;
-  Obj operator()(at::ArrayRef<Obj> args);
-  Obj operator()(at::ArrayRef<at::IValue> args);
-  Obj callKwargs(
-      std::vector<at::IValue> args,
-      std::unordered_map<std::string, c10::IValue> kwargs);
-  Obj callKwargs(std::unordered_map<std::string, c10::IValue> kwargs);
-  bool hasattr(const char* attr);
-  Obj attr(const char* attr);
-
- private:
-  InterpreterSessionImpl* interaction_;
-  int64_t id_;
-};
-
-struct InterpreterSessionImpl {
-  friend struct Package;
-  friend struct ReplicatedObj;
-  friend struct Obj;
-  friend struct InterpreterSession;
-  friend struct ReplicatedObjImpl;
-
-  virtual ~InterpreterSessionImpl() = default;
-
- private:
-  virtual Obj global(const char* module, const char* name) = 0;
-  virtual Obj fromIValue(at::IValue value) = 0;
-  virtual Obj createOrGetPackageImporterFromContainerFile(
-      const std::shared_ptr<caffe2::serialize::PyTorchStreamReader>&
-          containerFile_) = 0;
-  virtual PickledObject pickle(Obj container, Obj obj) = 0;
-  virtual Obj unpickleOrGet(int64_t id, const PickledObject& obj) = 0;
-  virtual void unload(int64_t id) = 0;
-
-  virtual at::IValue toIValue(Obj obj) const = 0;
-
-  virtual Obj call(Obj obj, at::ArrayRef<Obj> args) = 0;
-  virtual Obj call(Obj obj, at::ArrayRef<at::IValue> args) = 0;
-  virtual Obj callKwargs(
-      Obj obj,
-      std::vector<at::IValue> args,
-      std::unordered_map<std::string, c10::IValue> kwargs) = 0;
-  virtual Obj callKwargs(
-      Obj obj,
-      std::unordered_map<std::string, c10::IValue> kwargs) = 0;
-  virtual Obj attr(Obj obj, const char* attr) = 0;
-  virtual bool hasattr(Obj obj, const char* attr) = 0;
-
- protected:
-  int64_t ID(Obj obj) const {
-    return obj.id_;
-  }
-
-  bool isOwner(Obj obj) const {
-    return this == obj.interaction_;
-  }
-};
-
-struct InterpreterImpl {
-  virtual InterpreterSessionImpl* acquireSession() = 0;
-  virtual void setFindModule(
-      std::function<multipy::optional<std::string>(const std::string&)>
-          find_module) = 0;
-  virtual ~InterpreterImpl() = default; // this will uninitialize python
-};
-
-// inline definitions for Objs are necessary to avoid introducing a
-// source file that would need to exist it both the libinterpreter.so and then
-// the libtorchpy library.
-inline at::IValue Obj::toIValue() const {
-  TORCH_DEPLOY_TRY
-  return interaction_->toIValue(*this);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-inline Obj Obj::operator()(at::ArrayRef<Obj> args) {
-  TORCH_DEPLOY_TRY
-  return interaction_->call(*this, args);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-inline Obj Obj::operator()(at::ArrayRef<at::IValue> args) {
-  TORCH_DEPLOY_TRY
-  return interaction_->call(*this, args);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-inline Obj Obj::callKwargs(
-    std::vector<at::IValue> args,
-    std::unordered_map<std::string, c10::IValue> kwargs) {
-  TORCH_DEPLOY_TRY
-  return interaction_->callKwargs(*this, std::move(args), std::move(kwargs));
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-inline Obj Obj::callKwargs(
-    std::unordered_map<std::string, c10::IValue> kwargs) {
-  TORCH_DEPLOY_TRY
-  return interaction_->callKwargs(*this, std::move(kwargs));
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-inline bool Obj::hasattr(const char* attr) {
-  TORCH_DEPLOY_TRY
-  return interaction_->hasattr(*this, attr);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-inline Obj Obj::attr(const char* attr) {
-  TORCH_DEPLOY_TRY
-  return interaction_->attr(*this, attr);
-  TORCH_DEPLOY_SAFE_CATCH_RETHROW
-}
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/interpreter/register_frozenpython.cpp b/torch/csrc/deploy/interpreter/register_frozenpython.cpp
deleted file mode 100644
index 75badd2d85cb..000000000000
--- a/torch/csrc/deploy/interpreter/register_frozenpython.cpp
+++ /dev/null
@@ -1,82 +0,0 @@
-#include <Python.h>
-#include <torch/csrc/deploy/interpreter/builtin_registry.h>
-
-#define FOREACH_LIBRARY(_) \
-  _(array)                 \
-  _(_asyncio)              \
-  _(audioop)               \
-  _(binascii)              \
-  _(_bisect)               \
-  _(_blake2)               \
-  _(_bz2)                  \
-  _(cmath)                 \
-  _(_codecs_cn)            \
-  _(_codecs_hk)            \
-  _(_codecs_iso2022)       \
-  _(_codecs_jp)            \
-  _(_codecs_kr)            \
-  _(_codecs_tw)            \
-  _(_contextvars)          \
-  _(_crypt)                \
-  _(_csv)                  \
-  _(_ctypes)               \
-  _(_ctypes_test)          \
-  _(_curses)               \
-  _(_curses_panel)         \
-  _(_datetime)             \
-  _(_decimal)              \
-  _(_elementtree)          \
-  _(fcntl)                 \
-  _(grp)                   \
-  _(_hashlib)              \
-  _(_heapq)                \
-  _(_json)                 \
-  _(_lsprof)               \
-  _(_lzma)                 \
-  _(math)                  \
-  _(_md5)                  \
-  _(mmap)                  \
-  _(_multibytecodec)       \
-  _(_multiprocessing)      \
-  _(nis)                   \
-  _(_opcode)               \
-  _(ossaudiodev)           \
-  _(parser)                \
-  _(_pickle)               \
-  _(_posixsubprocess)      \
-  _(pyexpat)               \
-  _(_queue)                \
-  _(_random)               \
-  _(readline)              \
-  _(resource)              \
-  _(select)                \
-  _(_sha1)                 \
-  _(_sha256)               \
-  _(_sha3)                 \
-  _(_sha512)               \
-  _(_socket)               \
-  _(spwd)                  \
-  _(_ssl)                  \
-  _(_struct)               \
-  _(syslog)                \
-  _(termios)               \
-  _(_testbuffer)           \
-  _(_testcapi)             \
-  _(_testimportmultiple)   \
-  _(_testmultiphase)       \
-  _(unicodedata)           \
-  _(xxlimited)             \
-  _(_xxtestfuzz)           \
-  _(zlib)
-
-#define DECLARE_LIBRARY_INIT(name) extern "C" PyObject* PyInit_##name(void);
-FOREACH_LIBRARY(DECLARE_LIBRARY_INIT)
-#undef DECLARE_LIBRARY_INIT
-
-extern "C" struct _frozen _PyImport_FrozenModules[];
-
-#define STD_LIBARY_PARMS(name) , #name, PyInit_##name
-REGISTER_TORCH_DEPLOY_BUILTIN(
-    frozenpython,
-    _PyImport_FrozenModules FOREACH_LIBRARY(STD_LIBARY_PARMS));
-#undef STD_LIBARY_PARMS
diff --git a/torch/csrc/deploy/interpreter/register_numpy.cpp b/torch/csrc/deploy/interpreter/register_numpy.cpp
deleted file mode 100644
index b32db5729c5c..000000000000
--- a/torch/csrc/deploy/interpreter/register_numpy.cpp
+++ /dev/null
@@ -1,51 +0,0 @@
-#include <Python.h>
-#include <torch/csrc/deploy/interpreter/builtin_registry.h>
-
-extern "C" struct _frozen _PyImport_FrozenModules_numpy[];
-
-extern "C" PyObject* PyInit__multiarray_umath(void);
-extern "C" PyObject* PyInit__multiarray_tests(void);
-extern "C" PyObject* PyInit_lapack_lite(void);
-extern "C" PyObject* PyInit__umath_linalg(void);
-extern "C" PyObject* PyInit__pocketfft_internal(void);
-extern "C" PyObject* PyInit_mtrand(void);
-extern "C" PyObject* PyInit_bit_generator(void);
-extern "C" PyObject* PyInit__common(void);
-extern "C" PyObject* PyInit__bounded_integers(void);
-extern "C" PyObject* PyInit__mt19937(void);
-extern "C" PyObject* PyInit__philox(void);
-extern "C" PyObject* PyInit__pcg64(void);
-extern "C" PyObject* PyInit__sfc64(void);
-extern "C" PyObject* PyInit__generator(void);
-
-REGISTER_TORCH_DEPLOY_BUILTIN(
-    frozen_numpy,
-    _PyImport_FrozenModules_numpy,
-    "numpy.core._multiarray_umath",
-    PyInit__multiarray_umath,
-    "numpy.core._multiarray_tests",
-    PyInit__multiarray_tests,
-    "numpy.linalg.lapack_lite",
-    PyInit_lapack_lite,
-    "numpy.linalg._umath_linalg",
-    PyInit__umath_linalg,
-    "numpy.fft._pocketfft_internal",
-    PyInit__pocketfft_internal,
-    "numpy.random.mtrand",
-    PyInit_mtrand,
-    "numpy.random.bit_generator",
-    PyInit_bit_generator,
-    "numpy.random._common",
-    PyInit__common,
-    "numpy.random._bounded_integers",
-    PyInit__bounded_integers,
-    "numpy.random._mt19937",
-    PyInit__mt19937,
-    "numpy.random._philox",
-    PyInit__philox,
-    "numpy.random._pcg64",
-    PyInit__pcg64,
-    "numpy.random._sfc64",
-    PyInit__sfc64,
-    "numpy.random._generator",
-    PyInit__generator);
diff --git a/torch/csrc/deploy/interpreter/register_pyyaml.cpp b/torch/csrc/deploy/interpreter/register_pyyaml.cpp
deleted file mode 100644
index 9be98f53422d..000000000000
--- a/torch/csrc/deploy/interpreter/register_pyyaml.cpp
+++ /dev/null
@@ -1,6 +0,0 @@
-#include <Python.h>
-#include <torch/csrc/deploy/interpreter/builtin_registry.h>
-
-extern "C" struct _frozen _PyImport_FrozenModules_pyyaml[];
-
-REGISTER_TORCH_DEPLOY_BUILTIN(frozen_pyyaml, _PyImport_FrozenModules_pyyaml);
diff --git a/torch/csrc/deploy/interpreter/test_builtin_registry.cpp b/torch/csrc/deploy/interpreter/test_builtin_registry.cpp
deleted file mode 100644
index 736ddb8e8aa9..000000000000
--- a/torch/csrc/deploy/interpreter/test_builtin_registry.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-#include <Python.h>
-#include <gtest/gtest.h>
-#include <torch/csrc/deploy/interpreter/builtin_registry.h>
-
-namespace torch {
-namespace deploy {
-
-bool allowLibrary(const std::string& libname) {
-  return libname == "lib1" || libname == "lib2";
-}
-
-// NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays)
-struct _frozen lib1FrozenModules[] = {
-    {"mod1", nullptr, 0},
-    {nullptr, nullptr, 0}};
-
-// NOLINTNEXTLINE(modernize-avoid-c-arrays,cppcoreguidelines-avoid-c-arrays)
-struct _frozen lib2FrozenModules[] = {
-    {"mod2", nullptr, 0},
-    {"mod3", nullptr, 0},
-    {nullptr, nullptr, 0}};
-
-void builtin1() {}
-void builtin2() {}
-REGISTER_TORCH_DEPLOY_BUILTIN(
-    lib1,
-    lib1FrozenModules,
-    "lib1.builtin1",
-    builtin1,
-    "lib1.builtin2",
-    builtin2);
-REGISTER_TORCH_DEPLOY_BUILTIN(lib2, lib2FrozenModules);
-
-TEST(BuiltinRegistryTest, SimpleTest) {
-  const auto& items = BuiltinRegistry::items();
-  EXPECT_EQ(2, items.size());
-  EXPECT_EQ(lib1FrozenModules, items[0]->frozenModules);
-  EXPECT_EQ(lib2FrozenModules, items[1]->frozenModules);
-
-  struct _frozen* allFrozenModules = BuiltinRegistry::getAllFrozenModules();
-  EXPECT_EQ("mod1", allFrozenModules[0].name);
-  EXPECT_EQ("mod2", allFrozenModules[1].name);
-  EXPECT_EQ("mod3", allFrozenModules[2].name);
-  EXPECT_EQ(nullptr, allFrozenModules[3].name);
-
-  auto allBuiltinModules = BuiltinRegistry::getAllBuiltinModules();
-  EXPECT_EQ(2, allBuiltinModules.size());
-  EXPECT_EQ("lib1.builtin1", allBuiltinModules[0].first);
-  EXPECT_EQ(builtin1, allBuiltinModules[0].second);
-  EXPECT_EQ("lib1.builtin2", allBuiltinModules[1].first);
-  EXPECT_EQ(builtin2, allBuiltinModules[1].second);
-
-  std::string expectedBuiltinModulesCSV = "'lib1.builtin1', 'lib1.builtin2'";
-  EXPECT_EQ(expectedBuiltinModulesCSV, BuiltinRegistry::getBuiltinModulesCSV());
-}
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/interpreter/third_party/README.md b/torch/csrc/deploy/interpreter/third_party/README.md
deleted file mode 100644
index 2c5d9241d2bb..000000000000
--- a/torch/csrc/deploy/interpreter/third_party/README.md
+++ /dev/null
@@ -1,2 +0,0 @@
-Python libraries that we want to package along with the Python implementation
-bundled in libinterpreter.
diff --git a/torch/csrc/deploy/loader.cpp b/torch/csrc/deploy/loader.cpp
deleted file mode 100644
index ab4d0c7c329e..000000000000
--- a/torch/csrc/deploy/loader.cpp
+++ /dev/null
@@ -1,1255 +0,0 @@
-// Code in this file is a heavily modified version of the dynamic loader
-// from android's bionic library. Here is the license for that project:
-
-/*
- * Copyright (C) 2016 The Android Open Source Project
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *  * Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  * Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
- * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
- * COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
- * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
- * OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
- * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-#include <dlfcn.h>
-#include <elf.h>
-#include <fcntl.h>
-#include <libgen.h>
-#include <link.h>
-#include <sys/mman.h>
-#include <sys/stat.h>
-#include <sys/types.h>
-#include <unistd.h>
-#include <atomic>
-#include <cerrno>
-#include <cinttypes>
-#include <climits>
-#include <cstdint>
-#include <cstring>
-#include <functional>
-#include <iostream>
-#include <sstream>
-#include <stdexcept>
-#include <thread>
-#include <vector>
-// Get PAGE_SIZE and PAGE_MASK.
-#include <sys/user.h>
-
-#include <c10/util/irange.h>
-#include <torch/csrc/deploy/interpreter/Optional.hpp>
-
-#include <fmt/format.h>
-#include <torch/csrc/deploy/loader.h>
-#include <torch/csrc/deploy/mem_file.h>
-
-namespace torch {
-namespace deploy {
-
-#define DEPLOY_ERROR(msg_fmt, ...) \
-  throw DeployLoaderError(fmt::format(msg_fmt, ##__VA_ARGS__))
-
-#define DEPLOY_CHECK(cond, fmt, ...)  \
-  if (!(cond)) {                      \
-    DEPLOY_ERROR(fmt, ##__VA_ARGS__); \
-  }
-
-std::vector<std::string> split_path(const std::string& s, char delim) {
-  const char* cur = s.c_str();
-  const char* end = cur + s.size();
-  if (cur == end) {
-    return {};
-  }
-  std::vector<std::string> result;
-  while (true) {
-    // non-zero amount of chars
-    const char* next = strchr(cur, delim);
-    if (!next) {
-      result.emplace_back(std::string(cur, end));
-      break;
-    }
-    result.emplace_back(std::string(cur, next));
-    cur = next + 1;
-  }
-  return result;
-}
-
-// https://stackoverflow.com/questions/23006930/the-shared-library-rpath-and-the-binary-rpath-priority/52647116#52647116
-void replace_all(
-    std::string& str,
-    const std::string& from,
-    const std::string& to) {
-  if (from.empty())
-    return;
-  size_t start_pos = 0;
-  while ((start_pos = str.find(from, start_pos)) != std::string::npos) {
-    str.replace(start_pos, from.length(), to);
-    start_pos += to.length(); // In case 'to' contains 'from', like replacing
-                              // 'x' with 'yx'
-  }
-}
-
-std::string resolve_path(const std::string& origin, const std::string& t) {
-  std::string result = t;
-  replace_all(result, "$ORIGIN", origin);
-  // NOLINTNEXTLINE
-  char buf[PATH_MAX];
-  char* resolved = realpath(result.c_str(), buf);
-  if (!resolved) {
-    return result;
-  }
-  return resolved;
-}
-
-std::string resolve_origin(const std::string& so_name) {
-  // NOLINTNEXTLINE
-  char origin[PATH_MAX];
-  realpath(so_name.c_str(), origin);
-  dirname(origin);
-  return origin;
-}
-
-template <typename... Args>
-std::string stringf(const char* format, Args... args) {
-  int size_s = snprintf(nullptr, 0, format, args...);
-  std::string result(size_s + 1, 0);
-  snprintf((char*)result.data(), size_s + 1, format, args...);
-  return result;
-}
-// Returns the address of the page containing address 'x'.
-#define PAGE_START(x) ((x)&PAGE_MASK)
-
-// Returns the offset of address 'x' in its page.
-#define PAGE_OFFSET(x) ((x) & ~PAGE_MASK)
-
-// Returns the address of the next page after address 'x', unless 'x' is
-// itself at the start of a page.
-#define PAGE_END(x) PAGE_START((x) + (PAGE_SIZE - 1))
-
-// from bionic
-// returns the size a shared library will take in memory
-size_t phdr_table_get_load_size(
-    const Elf64_Phdr* phdr_table,
-    size_t phdr_count,
-    Elf64_Addr* out_min_vaddr,
-    Elf64_Addr* out_max_vaddr) {
-  Elf64_Addr min_vaddr = UINTPTR_MAX;
-  Elf64_Addr max_vaddr = 0;
-
-  bool found_pt_load = false;
-  for (const auto i : c10::irange(phdr_count)) {
-    const Elf64_Phdr* phdr = &phdr_table[i];
-
-    if (phdr->p_type != PT_LOAD) {
-      continue;
-    }
-    found_pt_load = true;
-
-    if (phdr->p_vaddr < min_vaddr) {
-      min_vaddr = phdr->p_vaddr;
-    }
-
-    if (phdr->p_vaddr + phdr->p_memsz > max_vaddr) {
-      max_vaddr = phdr->p_vaddr + phdr->p_memsz;
-    }
-  }
-  if (!found_pt_load) {
-    min_vaddr = 0;
-  }
-
-  min_vaddr = PAGE_START(min_vaddr);
-  max_vaddr = PAGE_END(max_vaddr);
-
-  if (out_min_vaddr != nullptr) {
-    *out_min_vaddr = min_vaddr;
-  }
-  if (out_max_vaddr != nullptr) {
-    *out_max_vaddr = max_vaddr;
-  }
-  return max_vaddr - min_vaddr;
-}
-
-#define MAYBE_MAP_FLAG(x, from, to) (((x) & (from)) ? (to) : 0)
-#define PFLAGS_TO_PROT(x)                 \
-  (MAYBE_MAP_FLAG((x), PF_X, PROT_EXEC) | \
-   MAYBE_MAP_FLAG((x), PF_R, PROT_READ) | \
-   MAYBE_MAP_FLAG((x), PF_W, PROT_WRITE))
-
-// holds a pre-computed hash for a string that is used in a GNU-style hash
-// tables and also keeps track of the string length.
-struct GnuHash {
-  GnuHash(const char* name) {
-    uint32_t h = 5381;
-    const uint8_t* name_bytes = reinterpret_cast<const uint8_t*>(name);
-#pragma unroll 8
-    while (*name_bytes != 0) {
-      h += (h << 5) +
-          *name_bytes++; // h*33 + c = h + h * 32 + c = h + h << 5 + c
-    }
-    hash = h;
-    name_len = reinterpret_cast<const char*>(name_bytes) - name;
-  }
-  uint32_t hash;
-  uint32_t name_len;
-};
-
-// this is a special builtin in the libc++ API used for telling C++ execption
-// frame unwinding about functions loaded from a pathway other than the libc
-// loader. it is passed a pointer to where the EH_FRAME section was loaded,
-// which appears to include frame information relative to that address.
-extern "C" void __register_frame(void*);
-extern "C" void __deregister_frame(void*);
-
-typedef void (*linker_dtor_function_t)();
-typedef void (*linker_ctor_function_t)(int, const char**, char**);
-
-// https://refspecs.linuxfoundation.org/LSB_2.1.0/LSB-Core-generic/LSB-Core-generic/ehframehdr.html
-// note that eh_frame_ptr can be different types based on eh_frame_ptr_enc but
-// we only support one sepecific encoding that is stored in a int32_t and an
-// offset relative to the start of this struct.
-struct EH_Frame_HDR {
-  char version;
-  char eh_frame_ptr_enc;
-  char fde_count_enc;
-  char table_enc;
-  int32_t eh_frame_ptr;
-};
-
-// this is the libc++ function called to lookup thread local state.
-// It is passed a pointer to an object of the same shape as TLSEntry
-// with the module_id and offset.
-extern "C" void* __tls_get_addr(void*);
-
-extern "C" int __cxa_thread_atexit_impl(
-    void (*dtor)(void*),
-    void* obj,
-    void* dso_symbol);
-
-struct CustomLibraryImpl;
-
-struct TLSMemory {
-  TLSMemory(std::shared_ptr<CustomLibraryImpl> file, size_t size)
-      // NOLINTNEXTLINE
-      : file_(std::move(file)), mem_(malloc(size)) {}
-  std::shared_ptr<CustomLibraryImpl> file_;
-  void* mem_;
-  ~TLSMemory() {
-    // NOLINTNEXTLINE
-    free(mem_);
-  }
-};
-
-static void delete_TLSMemory(void* obj) {
-  delete ((TLSMemory*)obj);
-}
-
-// This object performs TLS emulation for modules not loaded by dlopen.
-// Normally modules have a module_id that is used as a key in libc for the
-// thread local data for that module. However, there is no public API for
-// assigning this module id. Instead, for modules that we load, we set module_id
-// to a pointer to a TLSSegment object, and replace __tls_get_addr with a
-// function that calls `addr`.
-
-// libc module_id's are sequential, so we use the top bit as a flag to see
-// if we have a local TLSegment object instead. This will break if
-// someone creates 2^63 sequential objects, but it is hard to imagine
-// a system with enough RAM to do that.
-constexpr size_t TLS_LOCAL_FLAG = (1ULL << 63);
-
-static void* local__tls_get_addr(TLSIndex* idx);
-
-/* LLDB puts a breakpoint in this function, and reads __deploy_module_info to
- * get debug info from library.  */
-__attribute__((noinline)) void __deploy_register_code() {
-  std::cout << ""; // otherwise the breakpoint doesn't get hit, not sure if
-                   // there is a more stable way of doing this.
-};
-
-struct DeployModuleInfo {
-  const char* name;
-  Elf64_Addr file_addr;
-  size_t file_size;
-  Elf64_Addr load_bias;
-};
-
-extern "C" {
-// NOLINTNEXTLINE
-DeployModuleInfo __deploy_module_info;
-}
-
-// RAII wrapper around dlopen
-struct __attribute__((visibility("hidden"))) SystemLibraryImpl
-    : public SystemLibrary {
-  SystemLibraryImpl(void* handle, bool steal)
-      : handle_(handle), own_handle_(steal && handle != RTLD_DEFAULT) {}
-
-  multipy::optional<Elf64_Addr> sym(const char* name) const override {
-    void* r = dlsym(handle_, name);
-    if (!r) {
-      return multipy::nullopt;
-    }
-    return (Elf64_Addr)r;
-  }
-
-  multipy::optional<TLSIndex> tls_sym(const char* name) const override;
-
-  ~SystemLibraryImpl() override {
-    if (own_handle_) {
-      dlclose(handle_);
-    }
-  }
-
- private:
-  void* handle_;
-  bool own_handle_;
-};
-
-std::shared_ptr<SystemLibrary> SystemLibrary::create(void* handle, bool steal) {
-  return std::make_shared<SystemLibraryImpl>(handle, steal);
-}
-std::shared_ptr<SystemLibrary> SystemLibrary::create(
-    const char* path,
-    int flags) {
-  void* handle = dlopen(path, flags);
-  return SystemLibrary::create(handle, handle != nullptr);
-}
-
-// reads DT_NEEDED and DT_RUNPATH from an unloaded elf file so we can sort out
-// dependencies before calling dlopen
-std::pair<const char*, std::vector<const char*>> load_needed_from_elf_file(
-    const char* filename,
-    const char* data) {
-  auto header_ = (Elf64_Ehdr*)data;
-  auto program_headers = (Elf64_Phdr*)(data + header_->e_phoff);
-  auto n_program_headers = header_->e_phnum;
-  const Elf64_Dyn* dynamic = nullptr;
-  for (const auto i : c10::irange(n_program_headers)) {
-    const Elf64_Phdr* phdr = &program_headers[i];
-    if (phdr->p_type == PT_DYNAMIC) {
-      dynamic = reinterpret_cast<const Elf64_Dyn*>(data + phdr->p_offset);
-      break;
-    }
-  }
-  DEPLOY_CHECK(
-      dynamic,
-      "{}: could not load dynamic section for looking up DT_NEEDED",
-      filename);
-
-  const char* runpath = "";
-  std::vector<const char*> needed;
-
-  auto segment_headers = (Elf64_Shdr*)(data + header_->e_shoff);
-  size_t n_segments = header_->e_shnum;
-  const char* strtab = nullptr;
-
-  const char* segment_string_table =
-      data + segment_headers[header_->e_shstrndx].sh_offset;
-
-  for (const auto i : c10::irange(n_segments)) {
-    const Elf64_Shdr* shdr = &segment_headers[i];
-    if (shdr->sh_type == SHT_STRTAB &&
-        strcmp(".dynstr", segment_string_table + shdr->sh_name) == 0) {
-      strtab = data + shdr->sh_offset;
-      break;
-    }
-  }
-
-  DEPLOY_CHECK(strtab, "{}: could not load dynstr for DT_NEEDED", filename);
-
-  for (const Elf64_Dyn* d = dynamic; d->d_tag != DT_NULL; ++d) {
-    switch (d->d_tag) {
-      case DT_NEEDED:
-        // std::cout << "NEEDED: '" << strtab + d->d_un.d_val << "'\n";
-        needed.push_back(strtab + d->d_un.d_val);
-        break;
-      case DT_RPATH: /* not quite correct, because this is a different order
-                        than runpath,
-                        but better than not processing it at all */
-      case DT_RUNPATH:
-        // std::cout << "RUNPATH: '" << strtab + d->d_un.d_val << "'\n";
-        runpath = strtab + d->d_un.d_val;
-        break;
-    }
-  }
-  return std::make_pair(runpath, std::move(needed));
-}
-
-// common mechanism for reading the elf symbol table,
-// and other information in the PT_DYNAMIC segment.
-struct ElfDynamicInfo {
-  std::string name_;
-  const Elf64_Dyn* dynamic_ = nullptr;
-  Elf64_Addr load_bias_ = 0;
-  const Elf64_Sym* symtab_ = nullptr;
-  const char* strtab_ = nullptr;
-  size_t strtab_size_ = 0;
-  Elf64_Rela* plt_rela_ = nullptr;
-  size_t n_plt_rela_ = 0;
-  Elf64_Rela* rela_ = nullptr;
-  size_t n_rela_ = 0;
-  linker_ctor_function_t init_func_ = nullptr;
-  linker_ctor_function_t* init_array_ = nullptr;
-  linker_dtor_function_t fini_func_ = nullptr;
-  linker_dtor_function_t* fini_array_ = nullptr;
-  size_t n_init_array_ = 0;
-  size_t n_fini_array_ = 0;
-  size_t gnu_nbucket_ = 0;
-  uint32_t* gnu_bucket_ = nullptr;
-  uint32_t* gnu_chain_ = nullptr;
-  uint32_t gnu_maskwords_ = 0;
-  uint32_t gnu_shift2_ = 0;
-  Elf64_Addr* gnu_bloom_filter_ = nullptr;
-  std::string runpath_;
-  std::vector<const char*> needed_;
-
-  const char* get_string(int idx) {
-    return strtab_ + idx;
-  }
-
-  void initialize_from_dynamic_section(
-      std::string name,
-      Elf64_Dyn* dynamic,
-      Elf64_Addr load_bias,
-      bool check_absolute) {
-    name_ = std::move(name);
-    load_bias_ = load_bias;
-    dynamic_ = dynamic;
-    for (const Elf64_Dyn* d = dynamic_; d->d_tag != DT_NULL; ++d) {
-      void* addr = (check_absolute && d->d_un.d_ptr > load_bias_)
-          ? reinterpret_cast<void*>(d->d_un.d_ptr)
-          : reinterpret_cast<void*>(load_bias_ + d->d_un.d_ptr);
-      auto value = d->d_un.d_val;
-
-      switch (d->d_tag) {
-        case DT_SYMTAB:
-          symtab_ = (Elf64_Sym*)addr;
-          break;
-        case DT_STRTAB:
-          strtab_ = (const char*)addr;
-          break;
-
-        case DT_STRSZ:
-          strtab_size_ = value;
-          break;
-
-        case DT_JMPREL:
-          plt_rela_ = (Elf64_Rela*)addr;
-          break;
-        case DT_PLTRELSZ:
-          n_plt_rela_ = value / sizeof(Elf64_Rela);
-          break;
-        case DT_RELA:
-          rela_ = (Elf64_Rela*)addr;
-          break;
-        case DT_RELASZ:
-          n_rela_ = value / sizeof(Elf64_Rela);
-          break;
-
-        case DT_INIT:
-          init_func_ = reinterpret_cast<linker_ctor_function_t>(
-              load_bias_ + d->d_un.d_ptr);
-          break;
-
-        case DT_FINI:
-          fini_func_ = reinterpret_cast<linker_dtor_function_t>(
-              load_bias_ + d->d_un.d_ptr);
-          break;
-
-        case DT_INIT_ARRAY:
-          init_array_ = reinterpret_cast<linker_ctor_function_t*>(
-              load_bias_ + d->d_un.d_ptr);
-          break;
-
-        case DT_INIT_ARRAYSZ:
-          n_init_array_ =
-              static_cast<uint32_t>(d->d_un.d_val) / sizeof(Elf64_Addr);
-          break;
-
-        case DT_FINI_ARRAY:
-          fini_array_ = reinterpret_cast<linker_dtor_function_t*>(
-              load_bias_ + d->d_un.d_ptr);
-          break;
-
-        case DT_FINI_ARRAYSZ:
-          n_fini_array_ =
-              static_cast<uint32_t>(d->d_un.d_val) / sizeof(Elf64_Addr);
-          break;
-
-        case DT_HASH:
-          break;
-
-        case DT_GNU_HASH: {
-          gnu_nbucket_ = reinterpret_cast<uint32_t*>(addr)[0];
-          // skip symndx
-          gnu_maskwords_ = reinterpret_cast<uint32_t*>(addr)[2];
-          gnu_shift2_ = reinterpret_cast<uint32_t*>(addr)[3];
-          gnu_bloom_filter_ =
-              reinterpret_cast<Elf64_Addr*>((Elf64_Addr)addr + 16);
-          gnu_bucket_ =
-              reinterpret_cast<uint32_t*>(gnu_bloom_filter_ + gnu_maskwords_);
-          // amend chain for symndx = header[1]
-          gnu_chain_ =
-              gnu_bucket_ + gnu_nbucket_ - reinterpret_cast<uint32_t*>(addr)[1];
-          --gnu_maskwords_;
-        } break;
-      }
-    }
-
-    if (!gnu_bucket_) {
-      std::cout << fmt::format(
-          "{}: warning, no DT_GNU_HASH found, symbol lookups on this module will not find anything.\n",
-          name_);
-    }
-
-    // pass 2 for things that require the strtab_ to be loaded
-    for (const Elf64_Dyn* d = dynamic_; d->d_tag != DT_NULL; ++d) {
-      switch (d->d_tag) {
-        case DT_NEEDED:
-          needed_.push_back(get_string(d->d_un.d_val));
-          break;
-        case DT_RPATH: /* not quite correct, because this is a different order
-                          than runpath,
-                          but better than not processing it at all */
-        case DT_RUNPATH:
-          runpath_ = get_string(d->d_un.d_val);
-          break;
-      }
-    }
-  }
-
-  multipy::optional<Elf64_Addr> sym(
-      const char* name,
-      GnuHash* precomputed_hash = nullptr) const {
-    if (!gnu_bucket_) {
-      return multipy::nullopt; // no hashtable was loaded
-    }
-    GnuHash hash_obj = precomputed_hash ? *precomputed_hash : GnuHash(name);
-    auto hash = hash_obj.hash;
-    auto name_len = hash_obj.name_len;
-    constexpr uint32_t kBloomMaskBits = sizeof(Elf64_Addr) * 8;
-
-    const uint32_t word_num = (hash / kBloomMaskBits) & gnu_maskwords_;
-    const Elf64_Addr bloom_word = gnu_bloom_filter_[word_num];
-    const uint32_t h1 = hash % kBloomMaskBits;
-    const uint32_t h2 = (hash >> gnu_shift2_) % kBloomMaskBits;
-
-    if ((1 & (bloom_word >> h1) & (bloom_word >> h2)) != 1) {
-      return multipy::nullopt;
-    }
-
-    uint32_t sym_idx = gnu_bucket_[hash % gnu_nbucket_];
-    if (sym_idx == 0) {
-      return multipy::nullopt;
-    }
-
-    uint32_t chain_value = 0;
-    const Elf64_Sym* sym = nullptr;
-
-    do {
-      sym = symtab_ + sym_idx;
-      chain_value = gnu_chain_[sym_idx];
-      if ((chain_value >> 1) == (hash >> 1)) {
-        if (static_cast<size_t>(sym->st_name) + name_len + 1 <= strtab_size_ &&
-            memcmp(strtab_ + sym->st_name, name, name_len + 1) == 0) {
-          // found the matching entry, is it defined?
-          if (sym->st_shndx != 0) {
-            return sym->st_value +
-                ((ELF64_ST_TYPE(sym->st_info) == STT_TLS) ? 0 : load_bias_);
-          }
-          // symbol isn't defined
-          return multipy::nullopt;
-        }
-      }
-      ++sym_idx;
-    } while ((chain_value & 1) == 0);
-    return multipy::nullopt;
-  }
-};
-
-// for resolving TLS offsets we need to look through
-// libc's already loaded libraries. We do not have the whole
-// ELF file mapped in this case just a pointer to the program headers and
-// the load_bias (offset in memory) where the library was loaded.
-struct AlreadyLoadedSymTable {
- private:
-  ElfDynamicInfo dyninfo_;
-
- public:
-  AlreadyLoadedSymTable(
-      const char* name,
-      Elf64_Addr load_bias,
-      const Elf64_Phdr* program_headers,
-      size_t n_program_headers) {
-    Elf64_Dyn* dynamic = nullptr;
-    for (const auto i : c10::irange(n_program_headers)) {
-      const Elf64_Phdr* phdr = &program_headers[i];
-
-      // Segment addresses in memory.
-      Elf64_Addr seg_start = phdr->p_vaddr + load_bias;
-      if (phdr->p_type == PT_DYNAMIC) {
-        dynamic = reinterpret_cast<Elf64_Dyn*>(seg_start);
-        break;
-      }
-    }
-    DEPLOY_CHECK(
-        dynamic, "%s: couldn't find PT_DYNAMIC in already loaded table.", name);
-    dyninfo_.initialize_from_dynamic_section(name, dynamic, load_bias, true);
-  }
-
-  multipy::optional<Elf64_Addr> sym(const char* name) {
-    return dyninfo_.sym(name);
-  }
-};
-static int iterate_cb(struct dl_phdr_info* info, size_t size, void* data) {
-  auto fn = (std::function<int(struct dl_phdr_info * info, size_t size)>*)data;
-  return (*fn)(info, size);
-}
-
-// we need to find a TLS offset / module_id pair for a symbol which we cannot do
-// with a normal dlsym call. Instead we iterate through all loaded libraries and
-// check their symbol tables for the symbol. The value of the symbol is the TLS
-// offset. When we find the library we also get the module id.
-multipy::optional<TLSIndex> slow_find_tls_symbol_offset(const char* sym_name) {
-  multipy::optional<TLSIndex> result = multipy::nullopt;
-  std::function<int(struct dl_phdr_info*, size_t)> cb =
-      [&](struct dl_phdr_info* info, size_t size) {
-        // std::cout << "SEARCHING .. " << info->dlpi_name << "\n";
-        AlreadyLoadedSymTable symtable(
-            info->dlpi_name,
-            info->dlpi_addr,
-            info->dlpi_phdr,
-            info->dlpi_phnum);
-        auto sym_addr = symtable.sym(sym_name);
-        if (sym_addr) {
-          // std::cout << "FOUND IT IN: " << info->dlpi_name << " it has modid:
-          // " << info->dlpi_tls_modid << "\n";
-          result = TLSIndex{info->dlpi_tls_modid, *sym_addr};
-          return 1;
-        }
-        return 0;
-      };
-
-  dl_iterate_phdr(iterate_cb, (void*)&cb);
-  return result;
-}
-
-multipy::optional<TLSIndex> SystemLibraryImpl::tls_sym(const char* name) const {
-  if (!sym(name)) {
-    return multipy::nullopt; // before we do a bunch of slow lookups to find the
-                             // module_id, check that this even defines the
-                             // symbol
-  }
-  if (handle_ == RTLD_DEFAULT) {
-    return slow_find_tls_symbol_offset(name);
-  }
-
-  struct link_map* lm = nullptr;
-  DEPLOY_CHECK(
-      0 == dlinfo(handle_, RTLD_DI_LINKMAP, &lm), "failed to query dlinfo");
-  std::cout << "TLS dlinfo LOOKUP " << lm->l_name << " " << name << " "
-            << "\n";
-
-  ElfDynamicInfo info;
-  info.initialize_from_dynamic_section(lm->l_name, lm->l_ld, lm->l_addr, true);
-  auto r = info.sym(name);
-  if (r) {
-    size_t module_id = 0;
-    DEPLOY_CHECK(
-        0 == dlinfo(handle_, RTLD_DI_TLS_MODID, &module_id),
-        "failed to query dlinfo for module_id");
-    return TLSIndex{module_id, *r};
-  }
-  return multipy::nullopt;
-}
-
-// dlopen does not accept additional search paths as an argument.
-// however, normal DT_NEEDED library load inherits the runpath of parents.
-// So we need to pre-find all the libraries and call dlopen on them directly to
-// get the same behavior. We can find the dependencies by reading the libraries
-// dynamic section for recursive DT_NEEED entries.
-void resolve_needed_libraries(
-    std::vector<std::shared_ptr<SymbolProvider>>& libraries,
-    const std::string& origin_relative,
-    std::vector<std::string>& search_path,
-    const std::string& runpath_template,
-    const std::vector<const char*>& needed) {
-  size_t search_path_start_size = search_path.size();
-
-  std::string origin = resolve_origin(origin_relative);
-  std::vector<std::string> paths = split_path(runpath_template, ':');
-  // backwards because we want paths to be search in order but we search
-  // search_path backward
-  for (size_t i = paths.size(); i > 0; --i) {
-    search_path.emplace_back(resolve_path(origin, paths[i - 1]));
-  }
-
-  for (const char* name : needed) {
-    // std::cout << "ATTEMPTING FIND " << name << "\n";
-    if (strcmp(name, "libtorch_python.so") == 0) {
-      // torchvision expects it...
-      continue;
-    }
-    // find the library, either (1) it is already loaded,
-    //                          (2) it is an absolute path that exists,
-    //                          (3) we find it in the search path
-    //                          (4) we can dlopen it
-
-    // (1) the library is already loaded
-    const int base_flags = RTLD_LAZY | RTLD_LOCAL;
-    void* handle = dlopen(name, base_flags | RTLD_NOLOAD);
-    if (handle) {
-      // std::cout << "ALREADY LOADED " << name << "\n";
-      libraries.emplace_back(SystemLibrary::create(handle, true));
-      continue;
-    }
-
-    std::string library_path = "";
-    // (2) it is an absolute path
-    if (strchr(name, '/') != nullptr) {
-      library_path = name;
-    } else {
-      // (3) find it in the search path
-      for (size_t i = search_path.size(); i > 0; --i) {
-        std::stringstream ss;
-        ss << search_path[i - 1] << "/" << name;
-        if (access(ss.str().c_str(), F_OK) == 0) {
-          library_path = ss.str();
-          break;
-        }
-      }
-    }
-
-    std::vector<std::shared_ptr<SymbolProvider>>
-        sublibraries; // these need to say loaded until we open library_path
-                      // otherwise we might dlclose a sublibrary
-
-    if (library_path != "") {
-      // std::cout << "LOOKING FOR SUBLIBRARIES FOR FILE AT PATH " <<
-      // library_path << "\n"; we found the actual file, recursively load its
-      // deps before opening it so we resolve their paths correctly
-      MemFile image(library_path.c_str());
-      auto search =
-          load_needed_from_elf_file(library_path.c_str(), image.data());
-      resolve_needed_libraries(
-          sublibraries, library_path, search_path, search.first, search.second);
-    } else {
-      library_path = name;
-    }
-
-    // either we didn't find the file, or we have already loaded its deps
-    // in both cases, we now try to call dlopen. In the case where we didn't
-    // find the file, we hope that something like LD_LIBRARY_PATH knows where it
-    // is. In the case where we found it, we know its deps are loaded and
-    // resolved.
-
-    // std::cout << "OPENING " << library_path << "\n";
-    handle = dlopen(library_path.c_str(), base_flags);
-    DEPLOY_CHECK(
-        handle, "{}: could not load library, dlopen says: {}", name, dlerror());
-    libraries.emplace_back(SystemLibrary::create(handle, true));
-  }
-
-  // unwind search_path stack
-  search_path.erase(
-      search_path.begin() + search_path_start_size, search_path.end());
-}
-
-// NOLINTNEXTLINE
-extern "C" void* __dso_handle;
-
-struct __attribute__((visibility("hidden"))) CustomLibraryImpl
-    : public std::enable_shared_from_this<CustomLibraryImpl>,
-      public CustomLibrary {
-  CustomLibraryImpl(const char* filename, int argc, const char** argv)
-      : contents_(filename),
-        mapped_library_(nullptr),
-        name_(filename),
-        argc_(argc),
-        argv_(argv) {
-    pthread_key_create(&tls_key_, nullptr);
-    data_ = contents_.data();
-    header_ = (Elf64_Ehdr*)data_;
-    program_headers_ = (Elf64_Phdr*)(data_ + header_->e_phoff);
-    n_program_headers_ = header_->e_phnum;
-  }
-  void add_search_library(std::shared_ptr<SymbolProvider> lib) override {
-    symbol_search_path_.emplace_back(std::move(lib));
-  }
-
-  void check_library_format() {
-    DEPLOY_CHECK(
-        0 == memcmp(header_->e_ident, ELFMAG, SELFMAG),
-        "{}: not an ELF file",
-        this->name_);
-    DEPLOY_CHECK(
-        header_->e_type == ET_DYN,
-        "{}: is not shared object file",
-        this->name_);
-    DEPLOY_CHECK(
-        header_->e_ident[EI_CLASS] == ELFCLASS64,
-        "{}: is not ELF64 format",
-        this->name_);
-    DEPLOY_CHECK(
-        header_->e_ident[EI_DATA] == ELFDATA2LSB,
-        "{}: is not 2's complement, little endian",
-        this->name_);
-    DEPLOY_CHECK(
-        header_->e_machine == EM_X86_64,
-        "{}: is not in x86_64 format",
-        this->name_);
-  }
-
-  void reserve_address_space() {
-    Elf64_Addr min_vaddr = 0;
-    Elf64_Addr max_vaddr = 0;
-    mapped_size_ = phdr_table_get_load_size(
-        program_headers_, n_program_headers_, &min_vaddr, &max_vaddr);
-    mapped_library_ = mmap(
-        nullptr, mapped_size_, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
-    load_bias_ =
-        (const char*)mapped_library_ - reinterpret_cast<const char*>(min_vaddr);
-  }
-
-  void load_segments() {
-    // from bionic
-    for (const auto i : c10::irange(n_program_headers_)) {
-      const Elf64_Phdr* phdr = &program_headers_[i];
-
-      // Segment addresses in memory.
-      Elf64_Addr seg_start = phdr->p_vaddr + load_bias_;
-      Elf64_Addr seg_end = seg_start + phdr->p_memsz;
-
-      switch (phdr->p_type) {
-        case PT_DYNAMIC:
-          dynamic_ = reinterpret_cast<Elf64_Dyn*>(seg_start);
-          break;
-        case PT_GNU_EH_FRAME:
-          eh_frame_hdr_ = reinterpret_cast<EH_Frame_HDR*>(seg_start);
-          DEPLOY_CHECK(
-              eh_frame_hdr_->eh_frame_ptr_enc == 0x1b,
-              "unsupported eh_frame_pointer_enc {}",
-              eh_frame_hdr_->eh_frame_ptr_enc);
-          eh_frame_ =
-              (void*)((int64_t)&eh_frame_hdr_->eh_frame_ptr + eh_frame_hdr_->eh_frame_ptr);
-          break;
-        case PT_TLS:
-          tls_file_size_ = phdr->p_filesz;
-          tls_mem_size_ = phdr->p_memsz;
-          tls_initalization_image_ = (void*)seg_start;
-          break;
-      };
-
-      if (phdr->p_type != PT_LOAD) {
-        continue;
-      }
-
-      Elf64_Addr seg_page_start = PAGE_START(seg_start);
-      Elf64_Addr seg_page_end = PAGE_END(seg_end);
-
-      Elf64_Addr seg_file_end = seg_start + phdr->p_filesz;
-
-      // File offsets.
-      Elf64_Addr file_start = phdr->p_offset;
-      Elf64_Addr file_end = file_start + phdr->p_filesz;
-
-      Elf64_Addr file_page_start = PAGE_START(file_start);
-      Elf64_Addr file_length = file_end - file_page_start;
-
-      if (contents_.size() <= 0) {
-        DEPLOY_ERROR(
-            "\"{}\" invalid file size: {}", name_.c_str(), contents_.size());
-      }
-
-      if (file_end > contents_.size()) {
-        DEPLOY_ERROR(
-            "invalid ELF file \"{}\" load segment[{}]:"
-            " p_offset ({}) + p_filesz ({}) ( = {}) past end of file "
-            "({})",
-            name_.c_str(),
-            i,
-            reinterpret_cast<void*>(phdr->p_offset),
-            reinterpret_cast<void*>(phdr->p_filesz),
-            reinterpret_cast<void*>(file_end),
-            contents_.size());
-      }
-
-      if (file_length != 0) {
-        int prot = PFLAGS_TO_PROT(phdr->p_flags);
-
-        void* seg_addr = mmap64(
-            reinterpret_cast<void*>(seg_page_start),
-            file_length,
-            prot | PROT_WRITE, // initially everything is writable to do
-                               // relocations
-            MAP_FIXED | MAP_PRIVATE,
-            contents_.fd(),
-            file_page_start);
-        fixup_prot_.emplace_back([=]() {
-          mprotect(reinterpret_cast<void*>(seg_page_start), file_length, prot);
-        });
-        if (seg_addr == MAP_FAILED) {
-          DEPLOY_ERROR(
-              "couldn't map \"{}\" segment {}: {}",
-              name_.c_str(),
-              i,
-              strerror(errno));
-        }
-      }
-
-      // if the segment is writable, and does not end on a page boundary,
-      // zero-fill it until the page limit.
-      if ((phdr->p_flags & PF_W) != 0 && PAGE_OFFSET(seg_file_end) > 0) {
-        memset(
-            reinterpret_cast<void*>(seg_file_end),
-            0,
-            PAGE_SIZE - PAGE_OFFSET(seg_file_end));
-      }
-
-      seg_file_end = PAGE_END(seg_file_end);
-
-      // seg_file_end is now the first page address after the file
-      // content. If seg_end is larger, we need to zero anything
-      // between them. This is done by using a private anonymous
-      // map for all extra pages.
-      if (seg_page_end > seg_file_end) {
-        size_t zeromap_size = seg_page_end - seg_file_end;
-        int prot = PFLAGS_TO_PROT(phdr->p_flags);
-        void* zeromap = mmap(
-            reinterpret_cast<void*>(seg_file_end),
-            zeromap_size,
-            prot | PROT_WRITE,
-            MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE,
-            -1,
-            0);
-        fixup_prot_.emplace_back([=]() {
-          mprotect(reinterpret_cast<void*>(seg_file_end), zeromap_size, prot);
-        });
-        if (zeromap == MAP_FAILED) {
-          DEPLOY_ERROR(
-              "couldn't zero fill \"{}\" gap: {}",
-              name_.c_str(),
-              strerror(errno));
-        }
-      }
-    }
-  }
-  size_t module_id() const {
-    size_t this_as_number = (size_t)this;
-    return this_as_number | TLS_LOCAL_FLAG;
-  }
-
-  void read_dynamic_section() {
-    dyninfo_.initialize_from_dynamic_section(
-        name_, dynamic_, load_bias_, false);
-    std::vector<std::string> empty_search_path;
-    resolve_needed_libraries(
-        symbol_search_path_,
-        name_,
-        empty_search_path,
-        dyninfo_.runpath_,
-        dyninfo_.needed_);
-  }
-
-  multipy::optional<Elf64_Addr> lookup_symbol(Elf64_Xword r_info) {
-    const uint32_t r_type = ELF64_R_TYPE(r_info);
-    const uint32_t r_sym = ELF64_R_SYM(r_info);
-
-    if (r_sym == 0) {
-      return (Elf64_Addr)0;
-    }
-    auto sym_st = dyninfo_.symtab_[r_sym];
-    const char* sym_name = dyninfo_.get_string(sym_st.st_name);
-    if (r_type == R_X86_64_JUMP_SLOT) {
-      if (strcmp(sym_name, "__tls_get_addr") == 0) {
-        return (Elf64_Addr)local__tls_get_addr;
-      }
-      if (strcmp(sym_name, "__cxa_thread_atexit") == 0) {
-        return (Elf64_Addr)__cxa_thread_atexit_impl;
-      }
-    }
-    for (const auto& sys_lib : symbol_search_path_) {
-      auto r = sys_lib->sym(sym_name);
-      if (r) {
-        return r;
-      }
-    }
-    auto r = sym(sym_name);
-    if (r) {
-      return r;
-    }
-    if (ELF64_ST_BIND(sym_st.st_info) != STB_WEAK) {
-      DEPLOY_ERROR(
-          "{}: '{}' symbol not found in ElfFile lookup",
-          name_.c_str(),
-          sym_name);
-    }
-    return multipy::nullopt;
-  }
-
-  multipy::optional<TLSIndex> tls_lookup_symbol(Elf64_Xword r_info) {
-    const uint32_t r_sym = ELF64_R_SYM(r_info);
-
-    if (r_sym == 0) {
-      return TLSIndex{
-          module_id(),
-          0}; // note: offset is not queried when the symbol is blank
-    }
-
-    auto sym_st = dyninfo_.symtab_[r_sym];
-    const char* sym_name = dyninfo_.get_string(sym_st.st_name);
-    for (const auto& sys_lib : symbol_search_path_) {
-      auto r = sys_lib->tls_sym(sym_name);
-      if (r) {
-        return r;
-      }
-    }
-    auto r = tls_sym(sym_name);
-    if (r) {
-      return r;
-    }
-
-    if (ELF64_ST_BIND(sym_st.st_info) != STB_WEAK) {
-      DEPLOY_ERROR(
-          "{}: '{}' symbol not found in ElfFile lookup",
-          name_.c_str(),
-          sym_name);
-    }
-    return multipy::nullopt;
-  }
-
-  void relocate_one(const Elf64_Rela& reloc) {
-    const uint32_t r_type = ELF64_R_TYPE(reloc.r_info);
-
-    if (r_type == 0) {
-      return;
-    }
-
-    void* const rel_target =
-        reinterpret_cast<void*>(reloc.r_offset + load_bias_);
-
-    // TLS relocations need to lookup symbols differently so we can get the
-    // module_id
-    if (r_type == R_X86_64_DTPMOD64 || r_type == R_X86_64_DTPOFF64) {
-      auto tls_index = tls_lookup_symbol(reloc.r_info);
-      if (!tls_index) {
-        return; // skip weak relocation that wasn't found
-      }
-      switch (r_type) {
-        case R_X86_64_DTPMOD64:
-          *static_cast<size_t*>(rel_target) = tls_index->module_id;
-          break;
-        case R_X86_64_DTPOFF64:
-          *static_cast<Elf64_Addr*>(rel_target) =
-              tls_index->offset + reloc.r_addend;
-          break;
-      }
-      return;
-    }
-
-    auto sym_addr = lookup_symbol(reloc.r_info);
-    if (!sym_addr) {
-      return; // skip weak relocation that wasn't found
-    }
-
-    switch (r_type) {
-      case R_X86_64_JUMP_SLOT:
-      case R_X86_64_64:
-      case R_X86_64_GLOB_DAT: {
-        const Elf64_Addr result = *sym_addr + reloc.r_addend;
-        *static_cast<Elf64_Addr*>(rel_target) = result;
-      } break;
-      case R_X86_64_RELATIVE: {
-        // In practice, r_sym is always zero, but if it weren't, the linker
-        // would still look up the referenced symbol (and abort if the symbol
-        // isn't found), even though it isn't used.
-        const Elf64_Addr result = load_bias_ + reloc.r_addend;
-        *static_cast<Elf64_Addr*>(rel_target) = result;
-      } break;
-      case R_X86_64_32: {
-        const Elf32_Addr result = *sym_addr + reloc.r_addend;
-        *static_cast<Elf32_Addr*>(rel_target) = result;
-      } break;
-      case R_X86_64_PC32: {
-        const Elf64_Addr target = *sym_addr + reloc.r_addend;
-        const Elf64_Addr base = reinterpret_cast<Elf64_Addr>(rel_target);
-        const Elf32_Addr result = target - base;
-        *static_cast<Elf32_Addr*>(rel_target) = result;
-      } break;
-      default:
-        DEPLOY_ERROR("unknown reloc type {} in \"{}\"", r_type, name_.c_str());
-        break;
-    }
-  }
-
-  void relocate() {
-    for (const auto i : c10::irange(dyninfo_.n_rela_)) {
-      relocate_one(dyninfo_.rela_[i]);
-    }
-    for (const auto i : c10::irange(dyninfo_.n_plt_rela_)) {
-      relocate_one(dyninfo_.plt_rela_[i]);
-    }
-  }
-
-  void initialize() {
-    call_function(dyninfo_.init_func_);
-    for (const auto i : c10::irange(dyninfo_.n_init_array_)) {
-      call_function(dyninfo_.init_array_[i]);
-    }
-    initialized_ = true;
-  }
-
-  void finalize() {
-    for (size_t i = dyninfo_.n_fini_array_; i > 0; --i) {
-      call_function(dyninfo_.fini_array_[i - 1]);
-    }
-    call_function(dyninfo_.fini_func_);
-  }
-
-  void register_debug_info() {
-    // std::cout << "target modules add " << name_.c_str() << "\n";
-    // std::cout << "target modules load -f " << name_.c_str() << " -s "
-    //           << std::hex << "0x" << load_bias_ << "\n";
-    __deploy_module_info.name = name_.c_str();
-    __deploy_module_info.file_addr = (Elf64_Addr)contents_.data();
-    __deploy_module_info.file_size = contents_.size();
-    __deploy_module_info.load_bias = load_bias_;
-    // debugger script sets a breakpoint on this function,
-    // then reads __deploy_module_info to issue the target module commands.
-    __deploy_register_code();
-  }
-
-  // remove the extra write flags from read-only sections
-  void protect() {
-    for (const auto& fixup : fixup_prot_) {
-      fixup();
-    }
-  }
-
-  void load() override {
-    check_library_format();
-    reserve_address_space();
-    load_segments();
-    read_dynamic_section();
-    relocate();
-    protect();
-    __register_frame(eh_frame_);
-    eh_frame_registered_ = true;
-    register_debug_info();
-    initialize();
-  }
-
-  ~CustomLibraryImpl() override {
-    // std::cout << "LINKER IS UNLOADING: " << name_ << "\n";
-    if (initialized_) {
-      finalize();
-    }
-    if (eh_frame_registered_) {
-      __deregister_frame(eh_frame_);
-    }
-    if (mapped_library_) {
-      munmap(mapped_library_, mapped_size_);
-    }
-  }
-  void call_function(linker_dtor_function_t f) {
-    if (f == nullptr || (int64_t)f == -1)
-      return;
-    f();
-  }
-  void call_function(linker_ctor_function_t f) {
-    if (f == nullptr || (int64_t)f == -1)
-      return;
-    f(argc_, argv_, environ);
-  }
-
-  multipy::optional<Elf64_Addr> sym(const char* name) const override {
-    return dyninfo_.sym(name);
-  }
-
-  multipy::optional<TLSIndex> tls_sym(const char* name) const override {
-    auto r = dyninfo_.sym(name);
-    if (r) {
-      return TLSIndex{module_id(), *r};
-    }
-    return multipy::nullopt;
-  }
-
-  void* tls_addr(size_t offset) {
-    // this was a TLS entry for one of our modules, so we use pthreads to
-    // emulate thread local state.
-    void* start = pthread_getspecific(tls_key_);
-    if (!start) {
-      auto tls_mem = new TLSMemory(shared_from_this(), tls_mem_size_);
-      __cxa_thread_atexit_impl(delete_TLSMemory, tls_mem, &__dso_handle);
-      start = tls_mem->mem_;
-      memcpy(start, tls_initalization_image_, tls_file_size_);
-      memset(
-          (void*)((const char*)start + tls_file_size_),
-          0,
-          tls_mem_size_ - tls_file_size_);
-      pthread_setspecific(tls_key_, start);
-    }
-    return (void*)((const char*)start + offset);
-  }
-
- private:
-  MemFile contents_;
-  const char* data_ = nullptr;
-  const Elf64_Ehdr* header_ = nullptr;
-  const Elf64_Phdr* program_headers_ = nullptr;
-  const EH_Frame_HDR* eh_frame_hdr_ = nullptr;
-  void* eh_frame_ = nullptr;
-  size_t n_program_headers_ = 0;
-  void* mapped_library_ = nullptr;
-  size_t mapped_size_ = 0;
-  Elf64_Addr load_bias_ = 0;
-  Elf64_Dyn* dynamic_ = nullptr;
-  ElfDynamicInfo dyninfo_;
-  std::string name_;
-  int argc_ = 0;
-  const char** argv_ = nullptr;
-  bool initialized_ = false;
-  bool eh_frame_registered_ = false;
-
-  pthread_key_t tls_key_ = 0;
-  void* tls_initalization_image_ = nullptr;
-  size_t tls_file_size_ = 0;
-  size_t tls_mem_size_ = 0;
-
-  std::vector<std::shared_ptr<SymbolProvider>> symbol_search_path_;
-  std::vector<std::function<void(void)>> fixup_prot_;
-};
-
-std::shared_ptr<CustomLibrary> CustomLibrary::create(
-    const char* filename,
-    int argc,
-    const char** argv) {
-  return std::make_shared<CustomLibraryImpl>(filename, argc, argv);
-}
-
-static void* local__tls_get_addr(TLSIndex* idx) {
-  if ((idx->module_id & TLS_LOCAL_FLAG) != 0) {
-    return ((CustomLibraryImpl*)(idx->module_id & ~TLS_LOCAL_FLAG))
-        ->tls_addr(idx->offset);
-  }
-  return __tls_get_addr(idx);
-}
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/loader.h b/torch/csrc/deploy/loader.h
deleted file mode 100644
index 9e5a7fd4571d..000000000000
--- a/torch/csrc/deploy/loader.h
+++ /dev/null
@@ -1,52 +0,0 @@
-#pragma once
-#include <dlfcn.h>
-#include <elf.h>
-#include <torch/csrc/deploy/interpreter/Optional.hpp>
-#include <memory>
-
-namespace torch {
-namespace deploy {
-
-struct DeployLoaderError : public std::runtime_error {
-  using std::runtime_error::runtime_error;
-};
-
-struct TLSIndex {
-  size_t module_id; // if module_id & TLS_LOCAL_FLAG, then module_id &
-                    // ~TLS_LOCAL_FLAG is a TLSMemory*;
-  size_t offset;
-};
-
-struct SymbolProvider {
-  SymbolProvider() = default;
-  virtual multipy::optional<Elf64_Addr> sym(const char* name) const = 0;
-  virtual multipy::optional<TLSIndex> tls_sym(const char* name) const = 0;
-  SymbolProvider(const SymbolProvider&) = delete;
-  SymbolProvider& operator=(const SymbolProvider&) = delete;
-  virtual ~SymbolProvider() = default;
-};
-
-// RAII wrapper around dlopen
-struct SystemLibrary : public SymbolProvider {
-  // create a wrapper around an existing handle returned from dlopen
-  // if steal == true, then this will dlclose the handle when it is destroyed.
-  static std::shared_ptr<SystemLibrary> create(
-      void* handle = RTLD_DEFAULT,
-      bool steal = false);
-  static std::shared_ptr<SystemLibrary> create(const char* path, int flags);
-};
-
-struct CustomLibrary : public SymbolProvider {
-  static std::shared_ptr<CustomLibrary> create(
-      const char* filename,
-      int argc = 0,
-      const char** argv = nullptr);
-  virtual void add_search_library(std::shared_ptr<SymbolProvider> lib) = 0;
-  virtual void load() = 0;
-};
-
-using SystemLibraryPtr = std::shared_ptr<SystemLibrary>;
-using CustomLibraryPtr = std::shared_ptr<CustomLibrary>;
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/mem_file.h b/torch/csrc/deploy/mem_file.h
deleted file mode 100644
index df4fe941ca58..000000000000
--- a/torch/csrc/deploy/mem_file.h
+++ /dev/null
@@ -1,67 +0,0 @@
-#pragma once
-
-#include <fcntl.h>
-#include <sys/mman.h>
-#include <sys/stat.h>
-#include <torch/csrc/deploy/Exception.h>
-#include <unistd.h>
-#include <cerrno>
-#include <cstdio>
-
-namespace torch {
-namespace deploy {
-
-// Memory maps a file into the address space read-only, and manages the lifetime
-// of the mapping. Here are a few use cases:
-// 1. Used in the loader to read in initial image, and to inspect
-// ELF files for dependencies before callling dlopen.
-//
-// 2. Used in unity to load the elf file.
-struct MemFile {
-  explicit MemFile(const char* filename_) : fd_(0), mem_(nullptr), n_bytes_(0) {
-    fd_ = open(filename_, O_RDONLY);
-    MULTIPY_CHECK(
-        fd_ != -1, "failed to open {}: {}" + filename_ + strerror(errno));
-    // NOLINTNEXTLINE
-    struct stat s;
-    if (-1 == fstat(fd_, &s)) {
-      close(fd_); // destructors don't run during exceptions
-      MULTIPY_CHECK(
-          false, "failed to stat {}: {}" + filename_ + strerror(errno));
-    }
-    n_bytes_ = s.st_size;
-    mem_ = mmap(nullptr, n_bytes_, PROT_READ, MAP_SHARED, fd_, 0);
-    if (MAP_FAILED == mem_) {
-      close(fd_);
-      MULTIPY_CHECK(
-          false, "failed to mmap {}: {}" + filename_ + strerror(errno));
-    }
-  }
-  MemFile(const MemFile&) = delete;
-  MemFile& operator=(const MemFile&) = delete;
-  [[nodiscard]] const char* data() const {
-    return (const char*)mem_;
-  }
-  ~MemFile() {
-    if (mem_) {
-      munmap((void*)mem_, n_bytes_);
-    }
-    if (fd_) {
-      close(fd_);
-    }
-  }
-  size_t size() {
-    return n_bytes_;
-  }
-  [[nodiscard]] int fd() const {
-    return fd_;
-  }
-
- private:
-  int fd_;
-  void* mem_;
-  size_t n_bytes_;
-};
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/noop_environment.h b/torch/csrc/deploy/noop_environment.h
deleted file mode 100644
index c1fe6357027f..000000000000
--- a/torch/csrc/deploy/noop_environment.h
+++ /dev/null
@@ -1,14 +0,0 @@
-#pragma once
-
-#include <torch/csrc/deploy/environment.h>
-
-namespace torch {
-namespace deploy {
-
-class NoopEnvironment : public Environment {
- public:
-  void configureInterpreter(Interpreter* /* interp */) override {}
-};
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/path_environment.cpp b/torch/csrc/deploy/path_environment.cpp
deleted file mode 100644
index 89bda34fbe15..000000000000
--- a/torch/csrc/deploy/path_environment.cpp
+++ /dev/null
@@ -1,13 +0,0 @@
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/csrc/deploy/path_environment.h>
-
-namespace torch {
-namespace deploy {
-
-void PathEnvironment::configureInterpreter(Interpreter* interp) {
-  auto I = interp->acquireSession();
-  I.global("sys", "path").attr("append")({path_});
-}
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/path_environment.h b/torch/csrc/deploy/path_environment.h
deleted file mode 100644
index 8c01191b288d..000000000000
--- a/torch/csrc/deploy/path_environment.h
+++ /dev/null
@@ -1,19 +0,0 @@
-#pragma once
-
-#include <torch/csrc/deploy/environment.h>
-#include <string>
-
-namespace torch {
-namespace deploy {
-
-class PathEnvironment : public Environment {
- public:
-  explicit PathEnvironment(std::string path) : path_(std::move(path)) {}
-  void configureInterpreter(Interpreter* interp) override;
-
- private:
-  std::string path_;
-};
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/remove_dt_needed.cpp b/torch/csrc/deploy/remove_dt_needed.cpp
deleted file mode 100644
index 8b1cad535814..000000000000
--- a/torch/csrc/deploy/remove_dt_needed.cpp
+++ /dev/null
@@ -1,82 +0,0 @@
-#include <elf.h>
-#include <fcntl.h>
-#include <sys/mman.h>
-#include <sys/stat.h>
-#include <unistd.h>
-#include <cerrno>
-#include <cstdio>
-#include <cstdlib>
-#include <cstring>
-#include <iostream>
-#include <vector>
-
-#include <c10/util/irange.h>
-#include <fmt/format.h>
-
-#define ERROR(msg_fmt, ...) \
-  throw std::runtime_error(fmt::format(msg_fmt, ##__VA_ARGS__))
-
-#define CHECK(cond, fmt, ...)  \
-  if (!(cond)) {               \
-    ERROR(fmt, ##__VA_ARGS__); \
-  }
-
-// NOLINTNEXTLINE
-int main(int argc, const char** argv) {
-  if (argc != 3) {
-    std::cout << "usage: " << argv[0] << " <input_library> <result_library>\n";
-    return 1;
-  }
-  const char* filename = argv[1];
-  int fd_ = open(filename, O_RDWR);
-  CHECK(fd_ != -1, "failed to open {}: {}", filename, strerror(errno));
-  struct stat s = {0};
-  if (-1 == fstat(fd_, &s)) {
-    close(fd_); // destructors don't run during exceptions
-    ERROR("failed to stat {}: {}", filename, strerror(errno));
-  }
-  size_t n_bytes = s.st_size;
-  void* mem =
-      mmap(nullptr, n_bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd_, 0);
-  if (MAP_FAILED == mem) {
-    close(fd_);
-    ERROR("failed to mmap {}: {}", filename, strerror(errno));
-  }
-
-  char* data = (char*)mem;
-  auto header = (Elf64_Ehdr*)data;
-  auto program_headers = (Elf64_Phdr*)(data + header->e_phoff);
-  auto n_program_headers = header->e_phnum;
-  Elf64_Dyn* dynamic = nullptr;
-  for (const auto i : c10::irange(n_program_headers)) {
-    const Elf64_Phdr* phdr = &program_headers[i];
-    if (phdr->p_type == PT_DYNAMIC) {
-      dynamic = reinterpret_cast<Elf64_Dyn*>(data + phdr->p_offset);
-      break;
-    }
-  }
-  CHECK(
-      dynamic,
-      "{}: could not load dynamic section for looking up DT_NEEDED",
-      filename);
-  std::vector<Elf64_Dyn> entries;
-  for (const Elf64_Dyn* d = dynamic; d->d_tag != DT_NULL; ++d) {
-    entries.push_back(*d);
-  }
-  Elf64_Dyn* w = dynamic;
-  for (const Elf64_Dyn& e : entries) {
-    if (e.d_tag != DT_NEEDED) {
-      *w++ = e;
-    }
-  }
-  auto nwritten = w - dynamic;
-  memset(w, 0, sizeof(Elf64_Dyn) * (entries.size() - nwritten));
-
-  FILE* dst = fopen(argv[2], "w");
-  CHECK(dst != nullptr, "{}: {}", argv[2], strerror(errno));
-  fwrite(mem, n_bytes, 1, dst);
-  fclose(dst);
-  munmap(mem, n_bytes);
-  close(fd_);
-  return 0;
-}
diff --git a/torch/csrc/deploy/test_deploy.cpp b/torch/csrc/deploy/test_deploy.cpp
deleted file mode 100644
index 780937a51e7c..000000000000
--- a/torch/csrc/deploy/test_deploy.cpp
+++ /dev/null
@@ -1,537 +0,0 @@
-#include <ATen/Parallel.h>
-#include <gtest/gtest.h>
-#include <cstring>
-
-#include <c10/util/irange.h>
-#include <libgen.h>
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/script.h>
-#include <torch/torch.h>
-
-#include <future>
-#include <iostream>
-#include <string>
-
-void compare_torchpy_jit(const char* model_filename, const char* jit_filename) {
-  // Test
-
-  torch::deploy::InterpreterManager m(1);
-  torch::deploy::Package p = m.loadPackage(model_filename);
-  auto model = p.loadPickle("model", "model.pkl");
-  at::IValue eg;
-  {
-    auto I = p.acquireSession();
-    eg = I.self.attr("load_pickle")({"model", "example.pkl"}).toIValue();
-  }
-
-  at::Tensor output = model(eg.toTupleRef().elements()).toTensor();
-
-  // Reference
-  auto ref_model = torch::jit::load(jit_filename);
-  at::Tensor ref_output =
-      ref_model.forward(eg.toTupleRef().elements()).toTensor();
-
-  ASSERT_TRUE(ref_output.allclose(output, 1e-03, 1e-05));
-}
-
-const char* simple = "torch/csrc/deploy/example/generated/simple";
-const char* simple_jit = "torch/csrc/deploy/example/generated/simple_jit";
-
-const char* path(const char* envname, const char* path) {
-  const char* e = getenv(envname);
-  return e ? e : path;
-}
-
-TEST(TorchpyTest, LoadLibrary) {
-  torch::deploy::InterpreterManager m(1);
-  torch::deploy::Package p = m.loadPackage(
-      path("LOAD_LIBRARY", "torch/csrc/deploy/example/generated/load_library"));
-  auto model = p.loadPickle("fn", "fn.pkl");
-  model({});
-}
-
-TEST(TorchpyTest, InitTwice) {
-  { torch::deploy::InterpreterManager m(2); }
-  { torch::deploy::InterpreterManager m(1); }
-}
-
-TEST(TorchpyTest, DifferentInterps) {
-  torch::deploy::InterpreterManager m(2);
-  m.registerModuleSource("check_none", "check = id(None)\n");
-  int64_t id0 = 0, id1 = 0;
-  {
-    auto I = m.allInstances()[0].acquireSession();
-    id0 = I.global("check_none", "check").toIValue().toInt();
-  }
-  {
-    auto I = m.allInstances()[1].acquireSession();
-    id1 = I.global("check_none", "check").toIValue().toInt();
-  }
-  ASSERT_NE(id0, id1);
-}
-
-TEST(TorchpyTest, SimpleModel) {
-  compare_torchpy_jit(path("SIMPLE", simple), path("SIMPLE_JIT", simple_jit));
-}
-
-TEST(TorchpyTest, ResNet) {
-  compare_torchpy_jit(
-      path("RESNET", "torch/csrc/deploy/example/generated/resnet"),
-      path("RESNET_JIT", "torch/csrc/deploy/example/generated/resnet_jit"));
-}
-
-TEST(TorchpyTest, Movable) {
-  torch::deploy::InterpreterManager m(1);
-  torch::deploy::ReplicatedObj obj;
-  {
-    auto I = m.acquireOne();
-    auto model =
-        I.global("torch.nn", "Module")(std::vector<torch::deploy::Obj>());
-    obj = I.createMovable(model);
-  }
-  obj.acquireSession();
-}
-
-TEST(TorchpyTest, MultiSerialSimpleModel) {
-  torch::deploy::InterpreterManager manager(3);
-  torch::deploy::Package p = manager.loadPackage(path("SIMPLE", simple));
-  auto model = p.loadPickle("model", "model.pkl");
-  auto ref_model = torch::jit::load(path("SIMPLE_JIT", simple_jit));
-
-  auto input = torch::ones({10, 20});
-  size_t ninterp = 3;
-  std::vector<at::Tensor> outputs;
-
-  for (const auto i : c10::irange(ninterp)) {
-    (void)i;
-    outputs.push_back(model({input.alias()}).toTensor());
-  }
-
-  // Generate reference
-  auto ref_output = ref_model.forward({input.alias()}).toTensor();
-
-  // Compare all to reference
-  for (const auto i : c10::irange(ninterp)) {
-    ASSERT_TRUE(ref_output.equal(outputs[i]));
-  }
-
-  // test kwargs api with args
-  std::vector<c10::IValue> args;
-  args.emplace_back(input);
-  std::unordered_map<std::string, c10::IValue> kwargs_empty;
-  auto jit_output_args = model.callKwargs(args, kwargs_empty).toTensor();
-  ASSERT_TRUE(ref_output.equal(jit_output_args));
-
-  // and with kwargs only
-  std::unordered_map<std::string, c10::IValue> kwargs;
-  kwargs["input"] = input;
-  auto jit_output_kwargs = model.callKwargs(kwargs).toTensor();
-  ASSERT_TRUE(ref_output.equal(jit_output_kwargs));
-
-  // test hasattr
-  ASSERT_TRUE(model.hasattr("forward"));
-  ASSERT_FALSE(model.hasattr("make_prediction"));
-}
-
-TEST(TorchpyTest, ThreadedSimpleModel) {
-  size_t nthreads = 3;
-  torch::deploy::InterpreterManager manager(nthreads);
-
-  torch::deploy::Package p = manager.loadPackage(path("SIMPLE", simple));
-  auto model = p.loadPickle("model", "model.pkl");
-  auto ref_model = torch::jit::load(path("SIMPLE_JIT", simple_jit));
-
-  auto input = torch::ones({10, 20});
-
-  std::vector<at::Tensor> outputs;
-
-  std::vector<std::future<at::Tensor>> futures;
-  for (const auto i : c10::irange(nthreads)) {
-    (void)i;
-    futures.push_back(std::async(std::launch::async, [&model]() {
-      auto input = torch::ones({10, 20});
-      for (const auto j : c10::irange(100)) {
-        (void)j;
-        model({input.alias()}).toTensor();
-      }
-      auto result = model({input.alias()}).toTensor();
-      return result;
-    }));
-  }
-  for (const auto i : c10::irange(nthreads)) {
-    outputs.push_back(futures[i].get());
-  }
-
-  // Generate reference
-  auto ref_output = ref_model.forward({input.alias()}).toTensor();
-
-  // Compare all to reference
-  for (const auto i : c10::irange(nthreads)) {
-    ASSERT_TRUE(ref_output.equal(outputs[i]));
-  }
-}
-
-TEST(TorchpyTest, ErrorsReplicatingObj) {
-  torch::deploy::InterpreterManager manager(4);
-  torch::deploy::Package p = manager.loadPackage(path("SIMPLE", simple));
-  auto replicatedObj = p.loadPickle("model", "model.pkl");
-  // Acquire two different interpreters
-  auto session1 = replicatedObj.acquireSession();
-  auto session2 = p.acquireSession();
-  // Create an obj reference on interpreter 1
-  auto obj = session1.fromMovable(replicatedObj);
-  // should throw an error when trying to access obj from different session
-  // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(session2.createMovable(obj), std::runtime_error);
-  try {
-    session2.createMovable(obj);
-  } catch (std::runtime_error& error) {
-    EXPECT_TRUE(
-        std::string(error.what())
-            .find(
-                "Cannot create movable from an object that lives in different session") !=
-        std::string::npos);
-  }
-}
-
-TEST(TorchpyTest, ThrowsSafely) {
-  // See explanation in deploy.h
-  torch::deploy::InterpreterManager manager(3);
-  // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(manager.loadPackage("some garbage path"), std::runtime_error);
-
-  torch::deploy::Package p = manager.loadPackage(path("SIMPLE", simple));
-  // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(p.loadPickle("some other", "garbage path"), std::runtime_error);
-
-  auto model = p.loadPickle("model", "model.pkl");
-  // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(model(at::IValue("unexpected input")), std::runtime_error);
-}
-
-TEST(TorchpyTest, AcquireMultipleSessionsInTheSamePackage) {
-  torch::deploy::InterpreterManager m(1);
-
-  torch::deploy::Package p = m.loadPackage(path("SIMPLE", simple));
-  auto I = p.acquireSession();
-
-  auto I1 = p.acquireSession();
-}
-
-TEST(TorchpyTest, AcquireMultipleSessionsInDifferentPackages) {
-  torch::deploy::InterpreterManager m(1);
-
-  torch::deploy::Package p = m.loadPackage(path("SIMPLE", simple));
-  auto I = p.acquireSession();
-
-  torch::deploy::Package p1 = m.loadPackage(
-      path("RESNET", "torch/csrc/deploy/example/generated/resnet"));
-  auto I1 = p1.acquireSession();
-}
-
-TEST(TorchpyTest, TensorSharingNotAllowed) {
-  size_t nthreads = 2;
-  torch::deploy::InterpreterManager m(nthreads);
-  // generate a tensor from one interpreter
-  auto I0 = m.allInstances()[0].acquireSession();
-  auto I1 = m.allInstances()[1].acquireSession();
-  auto obj = I0.global("torch", "empty")({I0.fromIValue(2)});
-  auto t = obj.toIValue().toTensor();
-  // try to feed it to the other interpreter, should error
-  // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  ASSERT_THROW(I1.global("torch", "sigmoid")({t}), std::runtime_error);
-}
-
-TEST(TorchpyTest, TaggingRace) {
-  // At time of writing, this takes about 7s to run on DEBUG=1.  I think
-  // this is OK, but feel free to fiddle with the knobs here to reduce the
-  // runtime
-  constexpr int64_t trials = 4;
-  constexpr int64_t nthreads = 16;
-  torch::deploy::InterpreterManager m(nthreads);
-  for (const auto n : c10::irange(trials)) {
-    (void)n;
-    at::Tensor t = torch::empty(2);
-    std::atomic<int64_t> success(0);
-    std::atomic<int64_t> failed(0);
-    at::parallel_for(0, nthreads, 1, [&](int64_t begin, int64_t end) {
-      for (const auto i : c10::irange(begin, end)) {
-        auto I = m.allInstances()[i].acquireSession();
-        try {
-          I.fromIValue(t);
-          success++;
-        } catch (const std::runtime_error& e) {
-          failed++;
-        }
-      }
-    });
-    ASSERT_EQ(success, 1);
-    ASSERT_EQ(failed, nthreads - 1);
-  }
-}
-
-TEST(TorchpyTest, DisarmHook) {
-  at::Tensor t = torch::empty(2);
-  {
-    torch::deploy::InterpreterManager m(1);
-    auto I = m.acquireOne();
-    I.fromIValue(t);
-  } // unload the old interpreter
-  torch::deploy::InterpreterManager m(1);
-  auto I = m.acquireOne();
-  // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  ASSERT_THROW(I.fromIValue(t), std::runtime_error); // NOT a segfault
-}
-
-TEST(TorchpyTest, RegisterModule) {
-  torch::deploy::InterpreterManager m(2);
-  m.registerModuleSource("foomodule", "def add1(x): return x + 1\n");
-  for (const auto& interp : m.allInstances()) {
-    auto I = interp.acquireSession();
-    AT_ASSERT(3 == I.global("foomodule", "add1")({2}).toIValue().toInt());
-  }
-}
-
-#ifdef FBCODE_CAFFE2
-TEST(TorchpyTest, FxModule) {
-  size_t nthreads = 3;
-  torch::deploy::InterpreterManager manager(nthreads);
-  torch::deploy::Package p = manager.loadPackage(path(
-      "SIMPLE_LEAF_FX", "torch/csrc/deploy/example/generated/simple_leaf_fx"));
-  auto model = p.loadPickle("model", "model.pkl");
-
-  std::vector<at::Tensor> outputs;
-  auto input = torch::ones({5, 10});
-  for (const auto i : c10::irange(nthreads)) {
-    (void)i;
-    outputs.push_back(model({input.alias()}).toTensor());
-  }
-
-  // reference model
-  auto ref_model = torch::jit::load(path(
-      "SIMPLE_LEAF_JIT",
-      "torch/csrc/deploy/example/generated/simple_leaf_jit"));
-
-  auto ref_output = ref_model.forward({input.alias()}).toTensor();
-
-  // Compare all to reference
-  for (const auto i : c10::irange(nthreads)) {
-    ASSERT_TRUE(ref_output.equal(outputs[i]));
-  }
-}
-#endif
-
-// Moving a tensor between interpreters should share the underlying storage.
-TEST(TorchpyTest, TensorSerializationSharing) {
-  torch::deploy::InterpreterManager manager(2);
-  manager.registerModuleSource("test_module", R"PYTHON(
-import torch
-
-def get_tensor():
-    return torch.ones(2, 2)
-)PYTHON");
-
-  auto I = manager.acquireOne();
-  auto I2 = manager.acquireOne();
-
-  auto objOnI =
-      I.global("test_module", "get_tensor")(at::ArrayRef<at::IValue>{});
-  auto replicated = I.createMovable(objOnI);
-  auto objOnI2 = I2.fromMovable(replicated);
-
-  auto tensorOnI = objOnI.toIValue().toTensor();
-  auto tensorOnI2 = objOnI2.toIValue().toTensor();
-  ASSERT_TRUE(tensorOnI.storage().is_alias_of(tensorOnI2.storage()));
-}
-
-#ifdef TEST_CUSTOM_LIBRARY
-thread_local int in_another_module = 5;
-TEST(TorchpyTest, SharedLibraryLoad) {
-  torch::deploy::InterpreterManager manager(2);
-  auto no_args = at::ArrayRef<torch::deploy::Obj>();
-  for (auto& interp : manager.allInstances()) {
-    auto I = interp.acquireSession();
-
-    const char* test_lib_path = getenv("LIBTEST_DEPLOY_LIB");
-    if (!test_lib_path) {
-      I.global("sys", "path").attr("append")({"torch/csrc/deploy"});
-      I.global("test_deploy_python", "setup")({getenv("PATH")});
-    } else {
-      // NOLINTNEXTLINE(cppcoreguidelines-avoid-c-arrays,modernize-avoid-c-arrays)
-      char buf[PATH_MAX];
-      strncpy(buf, test_lib_path, PATH_MAX);
-      dirname(buf);
-      I.global("sys", "path").attr("append")({buf});
-    }
-
-    AT_ASSERT(I.global("libtest_deploy_lib", "check_initial_state")(no_args)
-                  .toIValue()
-                  .toBool());
-    ASSERT_TRUE(
-        I.global("libtest_deploy_lib", "simple_add")({5, 4})
-            .toIValue()
-            .toInt() == 9);
-    // I.global("numpy", "array"); // force numpy to load here so it is loaded
-    //                             // twice before we run the tests
-  }
-  for (auto& interp : manager.allInstances()) {
-    auto I = interp.acquireSession();
-    // auto i =
-    //     I.global("test_deploy_python", "numpy_test")({1}).toIValue().toInt();
-    I.global("libtest_deploy_lib", "raise_and_catch_exception")({true});
-    try {
-      I.global("libtest_deploy_lib", "raise_exception")(no_args);
-      ASSERT_TRUE(false); // raise_exception did not throw?
-    } catch (std::exception& err) {
-      ASSERT_TRUE(std::string(err.what()).find("yet") != std::string::npos);
-    }
-    in_another_module = 6;
-    ASSERT_TRUE(
-        I.global("libtest_deploy_lib", "get_in_another_module")(no_args)
-            .toIValue()
-            .toInt() == 6);
-    ASSERT_TRUE(
-        I.global("libtest_deploy_lib", "get_bar")(no_args).toIValue().toInt() ==
-        14);
-    {
-      std::thread foo([&] {
-        I.global("libtest_deploy_lib", "set_bar")({13});
-        ASSERT_TRUE(
-            I.global("libtest_deploy_lib", "get_bar")(no_args)
-                .toIValue()
-                .toInt() == 13);
-      });
-      foo.join();
-    }
-    ASSERT_TRUE(
-        I.global("libtest_deploy_lib", "get_bar_destructed")(no_args)
-            .toIValue()
-            .toInt() == 1);
-    I.global("libtest_deploy_lib", "set_bar")({12});
-  }
-}
-#endif
-
-TEST(TorchpyTest, UsesDistributed) {
-  const auto model_filename = path(
-      "USES_DISTRIBUTED",
-      "torch/csrc/deploy/example/generated/uses_distributed");
-  torch::deploy::InterpreterManager m(1);
-  torch::deploy::Package p = m.loadPackage(model_filename);
-  {
-    auto I = p.acquireSession();
-    I.self.attr("import_module")({"uses_distributed"});
-  }
-}
-
-TEST(TorchpyTest, Autograd) {
-  torch::deploy::InterpreterManager m(2);
-  m.registerModuleSource("autograd_test", R"PYTHON(
-import torch
-
-x = torch.ones(5)  # input tensor
-y = torch.zeros(3)  # expected output
-w = torch.randn(5, 3, requires_grad=True)
-b = torch.randn(3, requires_grad=True)
-z = torch.matmul(x, w)+b
-loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
-loss.backward()
-# result = w.grad
-result = torch.Tensor([1,2,3])
-)PYTHON");
-  at::Tensor w_grad0, w_grad1;
-  {
-    auto I = m.allInstances()[0].acquireSession();
-    w_grad0 = I.global("autograd_test", "result").toIValue().toTensor();
-  }
-  {
-    auto I = m.allInstances()[1].acquireSession();
-    w_grad1 = I.global("autograd_test", "result").toIValue().toTensor();
-  }
-  EXPECT_TRUE(w_grad0.equal(w_grad1));
-}
-
-TEST(TorchpyTest, ImportlibMetadata) {
-  torch::deploy::InterpreterManager m(1);
-  m.registerModuleSource("importlib_test", R"PYTHON(
-from importlib.metadata import version
-
-result = version("torch")
-)PYTHON");
-  auto I = m.allInstances()[0].acquireSession();
-  auto ver = I.global("importlib_test", "result").toIValue().toString();
-  ASSERT_EQ(ver->string(), "0.0.1+fake_multipy");
-}
-
-// OSS build does not have bultin numpy support yet. Use this flag to guard the
-// test case.
-#if HAS_NUMPY
-TEST(TorchpyTest, TestNumpy) {
-  torch::deploy::InterpreterManager m(2);
-  auto noArgs = at::ArrayRef<torch::deploy::Obj>();
-  auto I = m.acquireOne();
-  auto mat35 = I.global("numpy", "random").attr("rand")({3, 5});
-  auto mat58 = I.global("numpy", "random").attr("rand")({5, 8});
-  auto mat38 = I.global("numpy", "matmul")({mat35, mat58});
-  EXPECT_EQ(2, mat38.attr("shape").attr("__len__")(noArgs).toIValue().toInt());
-  EXPECT_EQ(3, mat38.attr("shape").attr("__getitem__")({0}).toIValue().toInt());
-  EXPECT_EQ(8, mat38.attr("shape").attr("__getitem__")({1}).toIValue().toInt());
-}
-#endif
-
-#if HAS_PYYAML
-TEST(TorchpyTest, TestPyYAML) {
-  const std::string kDocument = "a: 1\n";
-
-  torch::deploy::InterpreterManager m(2);
-  auto I = m.acquireOne();
-
-  auto load = I.global("yaml", "load")({kDocument});
-  EXPECT_EQ(1, load.attr("__getitem__")({"a"}).toIValue().toInt());
-
-  auto dump = I.global("yaml", "dump")({load});
-  EXPECT_EQ(kDocument, dump.toIValue().toString()->string());
-}
-#endif
-
-TEST(TorchpyTest, PrintInstruction) {
-  const auto jit_script_with_print = R"JIT(
-  def forward(self, a):
-    print(a)
-    return a + a
-  )JIT";
-
-  auto input = torch::autograd::make_variable(at::randn({2, 3}));
-  auto expected_forward = input + input;
-
-  auto module = std::make_shared<torch::jit::Module>(
-      "Module", std::make_shared<at::CompilationUnit>());
-  module->define(jit_script_with_print);
-
-  std::vector<at::IValue> inputs{at::IValue(input)};
-
-  // Checking that a module containing prim::Print() works fine.
-  auto result1 = (*module)(inputs);
-  EXPECT_TRUE(result1.toTensor().equal(expected_forward));
-
-  {
-    auto interpreterManager =
-        std::make_shared<torch::deploy::InterpreterManager>(1);
-
-    // Checking that a module containing prim::Print() still works fine
-    // after Python environment was created.
-    auto result2 = (*module)(inputs);
-    EXPECT_TRUE(result2.toTensor().equal(expected_forward));
-  }
-
-  // Checking that a module containing prim::Print() still works fine
-  // after Python environment was created and then destroyed.
-  auto result3 = (*module)(inputs);
-  EXPECT_TRUE(result3.toTensor().equal(expected_forward));
-}
-
-int main(int argc, char* argv[]) {
-  ::testing::InitGoogleTest(&argc, argv);
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
diff --git a/torch/csrc/deploy/test_deploy_from_python.py b/torch/csrc/deploy/test_deploy_from_python.py
deleted file mode 100644
index b310d8bd7107..000000000000
--- a/torch/csrc/deploy/test_deploy_from_python.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from libfb.py import testutil
-
-import test_deploy_python_ext
-
-class TestDeployFromPython(testutil.BaseFacebookTestCase):
-    def test_deploy_from_python(self):
-        self.assertTrue(test_deploy_python_ext.run())
diff --git a/torch/csrc/deploy/test_deploy_gpu.cpp b/torch/csrc/deploy/test_deploy_gpu.cpp
deleted file mode 100644
index 48660c79fefa..000000000000
--- a/torch/csrc/deploy/test_deploy_gpu.cpp
+++ /dev/null
@@ -1,120 +0,0 @@
-#include <gtest/gtest.h>
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/cuda.h>
-#include <torch/script.h>
-#include <torch/torch.h>
-#include <future>
-#include <iostream>
-#include <string>
-
-int main(int argc, char* argv[]) {
-  ::testing::InitGoogleTest(&argc, argv);
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
-
-const char* simple = "torch/csrc/deploy/example/generated/simple";
-const char* simple_jit = "torch/csrc/deploy/example/generated/simple_jit";
-
-const char* path(const char* envname, const char* path) {
-  const char* e = getenv(envname);
-  return e ? e : path;
-}
-
-TEST(TorchDeployGPUTest, SimpleModel) {
-  if (!torch::cuda::is_available()) {
-    GTEST_SKIP();
-  }
-  const char* model_filename = path("SIMPLE", simple);
-  const char* jit_filename = path("SIMPLE_JIT", simple_jit);
-
-  // Test
-  torch::deploy::InterpreterManager m(1);
-  torch::deploy::Package p = m.loadPackage(model_filename);
-  auto model = p.loadPickle("model", "model.pkl");
-  {
-    auto M = model.acquireSession();
-    M.self.attr("to")({"cuda"});
-  }
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  std::vector<at::IValue> inputs;
-  {
-    auto I = p.acquireSession();
-    auto eg = I.self.attr("load_pickle")({"model", "example.pkl"}).toIValue();
-    inputs = eg.toTupleRef().elements();
-    inputs[0] = inputs[0].toTensor().to("cuda");
-  }
-  at::Tensor output = model(inputs).toTensor();
-  ASSERT_TRUE(output.device().is_cuda());
-
-  // Reference
-  auto ref_model = torch::jit::load(jit_filename);
-  ref_model.to(torch::kCUDA);
-  at::Tensor ref_output = ref_model.forward(inputs).toTensor();
-
-  ASSERT_TRUE(ref_output.allclose(output, 1e-03, 1e-05));
-}
-
-TEST(TorchDeployGPUTest, UsesDistributed) {
-  const auto model_filename = path(
-      "USES_DISTRIBUTED",
-      "torch/csrc/deploy/example/generated/uses_distributed");
-  torch::deploy::InterpreterManager m(1);
-  torch::deploy::Package p = m.loadPackage(model_filename);
-  {
-    auto I = p.acquireSession();
-    I.self.attr("import_module")({"uses_distributed"});
-  }
-}
-
-#ifdef FBCODE_CAFFE2
-TEST(TorchDeployGPUTest, TensorRT) {
-  if (!torch::cuda::is_available()) {
-    GTEST_SKIP();
-  }
-  auto packagePath = path(
-      "MAKE_TRT_MODULE", "torch/csrc/deploy/example/generated/make_trt_module");
-  torch::deploy::InterpreterManager m(1);
-  torch::deploy::Package p = m.loadPackage(packagePath);
-  auto makeModel = p.loadPickle("make_trt_module", "model.pkl");
-  {
-    auto I = makeModel.acquireSession();
-    auto model = I.self(at::ArrayRef<at::IValue>{});
-    auto input = at::ones({1, 2, 3}).cuda();
-    auto output = input * 2;
-    ASSERT_TRUE(
-        output.allclose(model(at::IValue{input}).toIValue().toTensor()));
-  }
-}
-#endif
-
-// OSS build does not have bultin numpy support yet. Use this flag to guard the
-// test case.
-#if HAS_NUMPY
-TEST(TorchpyTest, TestNumpy) {
-  torch::deploy::InterpreterManager m(2);
-  auto noArgs = at::ArrayRef<torch::deploy::Obj>();
-  auto I = m.acquireOne();
-  auto mat35 = I.global("numpy", "random").attr("rand")({3, 5});
-  auto mat58 = I.global("numpy", "random").attr("rand")({5, 8});
-  auto mat38 = I.global("numpy", "matmul")({mat35, mat58});
-  EXPECT_EQ(2, mat38.attr("shape").attr("__len__")(noArgs).toIValue().toInt());
-  EXPECT_EQ(3, mat38.attr("shape").attr("__getitem__")({0}).toIValue().toInt());
-  EXPECT_EQ(8, mat38.attr("shape").attr("__getitem__")({1}).toIValue().toInt());
-}
-#endif
-
-#if HAS_PYYAML
-TEST(TorchpyTest, TestPyYAML) {
-  const std::string kDocument = "a: 1\n";
-
-  torch::deploy::InterpreterManager m(2);
-  auto I = m.acquireOne();
-
-  auto load = I.global("yaml", "load")({kDocument});
-  EXPECT_EQ(1, load.attr("__getitem__")({"a"}).toIValue().toInt());
-
-  auto dump = I.global("yaml", "dump")({load});
-  EXPECT_EQ(kDocument, dump.toIValue().toString()->string());
-}
-#endif
diff --git a/torch/csrc/deploy/test_deploy_lib.cpp b/torch/csrc/deploy/test_deploy_lib.cpp
deleted file mode 100644
index cac0b539c043..000000000000
--- a/torch/csrc/deploy/test_deploy_lib.cpp
+++ /dev/null
@@ -1,98 +0,0 @@
-#include <pybind11/pybind11.h>
-#include <torch/csrc/utils/pybind.h>
-#include <cstdint>
-#include <cstdio>
-#include <iostream>
-
-namespace py = pybind11;
-
-int foo_constructed = 0;
-int bar_constructed = 0;
-int bar_destructed = 0;
-
-struct Foo {
-  Foo() {
-    ++foo_constructed;
-  }
-  int v = -1;
-};
-
-Foo f;
-
-struct Bar {
-  Bar() {
-    ++bar_constructed;
-  }
-  ~Bar() {
-    ++bar_destructed;
-  }
-  int v = 14;
-};
-
-static thread_local int first = 1; // local TLS, probably offset 0
-static thread_local int second = 2; // local TLS, probably offset 4
-thread_local int bss_local; // local TLS, bss initialized so it probably comes
-                            // after the initialized stuff
-thread_local int third = 3; // local TLS, but extern declared so it will look
-                            // for the symbol third globally, but not find it
-static thread_local Bar bar; // local TLS, with a constructor that should run
-thread_local int
-    in_another_module; // non local TLS, this is defined in test_deploy.cpp
-
-struct MyError : public std::runtime_error {
-  using std::runtime_error::runtime_error;
-};
-
-bool raise_and_catch_exception(bool except) {
-  try {
-    if (except) {
-      throw MyError("yep");
-    }
-    return false;
-  } catch (MyError& c) {
-    return true;
-  }
-}
-bool raise_exception() {
-  throw MyError("yet"); // caught in test_deploy
-}
-
-bool check_initial_state() {
-  bool bv = bar.v == 14; // unless we reference bar it is unspecified whether it
-                         // should have been constructed
-  return bv && first == 1 && second == 2 && bss_local == 0 && third == 3 &&
-      bar_constructed == 1 && foo_constructed == 1 && bar_destructed == 0;
-}
-
-int get_in_another_module() {
-  return in_another_module;
-}
-
-void set_in_another_module(int x) {
-  in_another_module = x;
-}
-int get_bar() {
-  return bar.v;
-}
-void set_bar(int v) {
-  bar.v = v;
-}
-int get_bar_destructed() {
-  return bar_destructed;
-}
-
-int simple_add(int a, int b) {
-  return a + b;
-}
-
-PYBIND11_MODULE(libtest_deploy_lib, m) {
-  m.def("raise_and_catch_exception", raise_and_catch_exception);
-  m.def("raise_exception", raise_exception);
-  m.def("check_initial_state", check_initial_state);
-  m.def("get_in_another_module", get_in_another_module);
-  m.def("set_in_another_module", set_in_another_module);
-  m.def("get_bar", get_bar);
-  m.def("set_bar", set_bar);
-  m.def("get_bar_destructed", get_bar_destructed);
-  m.def("simple_add", simple_add);
-}
diff --git a/torch/csrc/deploy/test_deploy_missing_interpreter.cpp b/torch/csrc/deploy/test_deploy_missing_interpreter.cpp
deleted file mode 100644
index b47f4556ad78..000000000000
--- a/torch/csrc/deploy/test_deploy_missing_interpreter.cpp
+++ /dev/null
@@ -1,14 +0,0 @@
-#include <gtest/gtest.h>
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/torch.h>
-
-int main(int argc, char* argv[]) {
-  ::testing::InitGoogleTest(&argc, argv);
-  int rc = RUN_ALL_TESTS();
-  return rc;
-}
-
-TEST(TorchDeployMissingInterpreter, Throws) {
-  // NOLINTNEXTLINE(hicpp-avoid-goto,cppcoreguidelines-avoid-goto)
-  EXPECT_THROW(torch::deploy::InterpreterManager(1), std::runtime_error);
-}
diff --git a/torch/csrc/deploy/test_deploy_python.py b/torch/csrc/deploy/test_deploy_python.py
deleted file mode 100644
index e32cd37cfacc..000000000000
--- a/torch/csrc/deploy/test_deploy_python.py
+++ /dev/null
@@ -1,26 +0,0 @@
-# this is imported by test_deploy to do some checks in python
-import sys
-import subprocess
-from pathlib import Path
-
-# we've taken steps to clear out the embedded python environment,
-# so we have to go searching for real python to figure out where its libraries are installed.
-def python_path(cpath):
-    for maybe in cpath.split(':'):
-        candidate = Path(maybe) / "python"
-        if candidate.exists():
-            cmd = [str(candidate), '-c', 'import sys; print(":".join(sys.path))']
-            return subprocess.check_output(cmd).decode('utf-8').strip('\n').split(':')
-    raise RuntimeError('could not find real python')
-
-def setup(path):
-    sys.path.extend(python_path(path))
-    sys.path.append('build/lib')  # for our test python extension
-
-# smoke test the numpy extension loading works
-def numpy_test(x):
-    import numpy as np
-    xs = [np.array([x, x]), np.array([x, x])]
-    for i in range(10):
-        xs.append(xs[-1] + xs[-2])
-    return int(xs[-1][0])
diff --git a/torch/csrc/deploy/test_deploy_python_ext.cpp b/torch/csrc/deploy/test_deploy_python_ext.cpp
deleted file mode 100644
index 2c7748b6f46b..000000000000
--- a/torch/csrc/deploy/test_deploy_python_ext.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-#include <caffe2/torch/csrc/deploy/deploy.h>
-#include <pybind11/pybind11.h>
-#include <torch/csrc/utils/pybind.h>
-#include <cstdint>
-#include <cstdio>
-#include <iostream>
-
-bool run() {
-  torch::deploy::InterpreterManager m(2);
-  m.registerModuleSource("check_none", "check = id(None)\n");
-  int64_t id0 = 0, id1 = 0;
-  {
-    auto I = m.allInstances()[0].acquireSession();
-    id0 = I.global("check_none", "check").toIValue().toInt();
-  }
-  {
-    auto I = m.allInstances()[1].acquireSession();
-    id1 = I.global("check_none", "check").toIValue().toInt();
-  }
-  return id0 != id1;
-}
-
-PYBIND11_MODULE(test_deploy_python_ext, m) {
-  m.def("run", run);
-}
diff --git a/torch/csrc/deploy/unity/example.py b/torch/csrc/deploy/unity/example.py
deleted file mode 100644
index 2236600899bb..000000000000
--- a/torch/csrc/deploy/unity/example.py
+++ /dev/null
@@ -1,10 +0,0 @@
-import numpy as np
-import scipy
-from scipy import linalg
-
-print("Hello, torch::deploy unity!")
-print(f"np.random.rand(5): {np.random.rand(5)}")
-print(f"scipy {scipy}")
-mat_a = np.array([[1, 0, 0, 0], [1, 1, 0, 0], [1, 2, 1, 0], [1, 3, 3, 1]])
-mat_b = linalg.inv(mat_a)
-print(mat_b)
diff --git a/torch/csrc/deploy/unity/main.cpp b/torch/csrc/deploy/unity/main.cpp
deleted file mode 100644
index f6eb26f92f53..000000000000
--- a/torch/csrc/deploy/unity/main.cpp
+++ /dev/null
@@ -1,35 +0,0 @@
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/csrc/deploy/unity/xar_environment.h>
-#include <memory>
-
-namespace torch {
-namespace deploy {
-
-// the way we lookup main module follows how an xar file is setup
-std::string lookupMainModule(InterpreterManager& m) {
-  auto I = m.acquireOne();
-  auto mainModule =
-      I.global("__manifest__", "fbmake").attr("get")({"main_module"});
-  std::ostringstream ss;
-  ss << mainModule.toIValue();
-  LOG(INFO) << "main module is " << ss.str();
-  return ss.str();
-}
-
-int doMain(int /* argc */, char** argv) {
-  std::shared_ptr<Environment> env = std::make_shared<XarEnvironment>(argv[0]);
-  InterpreterManager m(2, env);
-
-  auto mainModule = lookupMainModule(m);
-  auto I = m.acquireOne();
-  I.global("runpy", "run_module")({mainModule});
-  return 0;
-}
-
-} // namespace deploy
-} // namespace torch
-
-// NOLINTNEXTLINE(bugprone-exception-escape)
-int main(int argc, char** argv) {
-  return torch::deploy::doMain(argc, argv);
-}
diff --git a/torch/csrc/deploy/unity/tests/simple_model.py b/torch/csrc/deploy/unity/tests/simple_model.py
deleted file mode 100644
index 910d8c675127..000000000000
--- a/torch/csrc/deploy/unity/tests/simple_model.py
+++ /dev/null
@@ -1,15 +0,0 @@
-import torch
-from torch import nn
-
-class SimpleModel(nn.Module):
-    def __init__(self):
-        super(SimpleModel, self).__init__()
-        self.fc = nn.Linear(256, 64)
-        self.fc2 = nn.Linear(64, 10)
-
-    def forward(self, X):
-        X = self.fc(X)
-        X = torch.relu(X)
-        X = self.fc2(X)
-        X = torch.softmax(X, dim=-1)
-        return X
diff --git a/torch/csrc/deploy/unity/tests/sum.py b/torch/csrc/deploy/unity/tests/sum.py
deleted file mode 100644
index 725ec26517af..000000000000
--- a/torch/csrc/deploy/unity/tests/sum.py
+++ /dev/null
@@ -1,5 +0,0 @@
-def func(*vlist):
-    return sum(vlist)
-
-import sys
-print("byebye!", file=sys.stderr)
diff --git a/torch/csrc/deploy/unity/tests/test_unity.h b/torch/csrc/deploy/unity/tests/test_unity.h
deleted file mode 100644
index d5b007980b09..000000000000
--- a/torch/csrc/deploy/unity/tests/test_unity.h
+++ /dev/null
@@ -1,5 +0,0 @@
-#pragma once
-
-// NOLINTNEXTLINE
-static char TEST_PYTHON_APP_DIR_TEMP[] =
-    "/tmp/torch_deploy_unity_unittest_XXXXXX";
diff --git a/torch/csrc/deploy/unity/tests/test_unity_simple_model.cpp b/torch/csrc/deploy/unity/tests/test_unity_simple_model.cpp
deleted file mode 100644
index 3987340f190b..000000000000
--- a/torch/csrc/deploy/unity/tests/test_unity_simple_model.cpp
+++ /dev/null
@@ -1,40 +0,0 @@
-#include <fmt/format.h>
-#include <gtest/gtest.h>
-#include <torch/csrc/deploy/unity/tests/test_unity.h>
-#include <torch/csrc/deploy/unity/xar_environment.h>
-
-namespace torch {
-namespace deploy {
-
-const char* exePath = nullptr;
-
-TEST(UnityTest, TestUnitySimpleModel) {
-  // use a different path for unit test. Normally don't specify the path will
-  // use the default one.
-  mkdtemp(TEST_PYTHON_APP_DIR_TEMP);
-  std::shared_ptr<Environment> env =
-      std::make_shared<XarEnvironment>(exePath, TEST_PYTHON_APP_DIR_TEMP);
-  InterpreterManager m(2, env);
-
-  auto I = m.acquireOne();
-
-  auto noArgs = at::ArrayRef<Obj>();
-  auto input = I.global("torch", "randn")({32, 256});
-  auto model = I.global("simple_model", "SimpleModel")(noArgs);
-
-  auto output = model({input}); // implicitly calls model's forward method
-  EXPECT_EQ(2, output.attr("shape").attr("__len__")(noArgs).toIValue().toInt());
-  EXPECT_EQ(
-      32, output.attr("shape").attr("__getitem__")({0}).toIValue().toInt());
-  EXPECT_EQ(
-      10, output.attr("shape").attr("__getitem__")({1}).toIValue().toInt());
-}
-
-} // namespace deploy
-} // namespace torch
-
-int main(int argc, char** argv) {
-  torch::deploy::exePath = argv[0];
-  ::testing::InitGoogleTest(&argc, argv);
-  return RUN_ALL_TESTS();
-}
diff --git a/torch/csrc/deploy/unity/tests/test_unity_sum.cpp b/torch/csrc/deploy/unity/tests/test_unity_sum.cpp
deleted file mode 100644
index 6105c1158e30..000000000000
--- a/torch/csrc/deploy/unity/tests/test_unity_sum.cpp
+++ /dev/null
@@ -1,31 +0,0 @@
-#include <fmt/format.h>
-#include <gtest/gtest.h>
-#include <torch/csrc/deploy/unity/tests/test_unity.h>
-#include <torch/csrc/deploy/unity/xar_environment.h>
-
-namespace torch {
-namespace deploy {
-
-const char* exePath = nullptr;
-
-TEST(UnityTest, TestUnitySum) {
-  // use a different path for unit test. Normally don't specify the path will
-  // use the default one.
-  mkdtemp(TEST_PYTHON_APP_DIR_TEMP);
-  std::shared_ptr<Environment> env =
-      std::make_shared<XarEnvironment>(exePath, TEST_PYTHON_APP_DIR_TEMP);
-  InterpreterManager m(2, env);
-
-  auto I = m.acquireOne();
-  auto result = I.global("sum", "func")({1, 2, 3, 4});
-  EXPECT_EQ(10, result.toIValue().toInt());
-}
-
-} // namespace deploy
-} // namespace torch
-
-int main(int argc, char** argv) {
-  torch::deploy::exePath = argv[0];
-  ::testing::InitGoogleTest(&argc, argv);
-  return RUN_ALL_TESTS();
-}
diff --git a/torch/csrc/deploy/unity/unity.bzl b/torch/csrc/deploy/unity/unity.bzl
deleted file mode 100644
index 8431356a4df9..000000000000
--- a/torch/csrc/deploy/unity/unity.bzl
+++ /dev/null
@@ -1,46 +0,0 @@
-load("@fbcode_macros//build_defs:cpp_library.bzl", "cpp_library")
-load("@fbcode_macros//build_defs:native_rules.bzl", "cxx_genrule")
-load("@fbcode_macros//build_defs:python_binary.bzl", "python_binary")
-
-# @lint-ignore-every BUCKLINT
-load("@fbsource//tools/build_defs:fb_native_wrapper.bzl", "fb_native")
-
-def build_unity(name, **kwargs):
-    python_binary(name = name, **kwargs)
-
-    cxx_genrule(
-        name = "{}_build_python_app_lib".format(name),
-        out = "python_app.a",
-        cmd = """\
-        cp $(location :""" + name + """) python_app
-        ld -r -b binary -o ${TMP}/python_app.o python_app
-        # rename the .data section to .torch_deploy_payload.unity.
-        # don't set the alloc/load flags for the section so it will not join
-        # the party of relocation.
-        # Also strip the _binary_python_app_start/end/size symbols to avoid
-        # confusion.
-        objcopy --rename-section .data=.torch_deploy_payload.unity,readonly,contents -N  _binary_python_app_start -N  _binary_python_app_end -N  _binary_python_app_size ${TMP}/python_app.o
-        ar rcs ${OUT} ${TMP}/python_app.o
-        """,
-    )
-
-    fb_native.prebuilt_cxx_library(
-        name = "{}_python_app_lib".format(name),
-        visibility = ["PUBLIC"],
-        link_whole = True,
-        preferred_linkage = "static",
-        static_lib = ":{}_build_python_app_lib".format(name),
-    )
-
-    cpp_library(
-        name = "{}_unity_lib".format(name),
-        srcs = [
-        ],
-        linker_flags = [
-            "--export-dynamic",
-        ],
-        exported_deps = [
-            "//caffe2/torch/csrc/deploy/unity:unity_core",
-            ":{}_python_app_lib".format(name),
-        ],
-    )
diff --git a/torch/csrc/deploy/unity/xar_environment.cpp b/torch/csrc/deploy/unity/xar_environment.cpp
deleted file mode 100644
index 4bb764374525..000000000000
--- a/torch/csrc/deploy/unity/xar_environment.cpp
+++ /dev/null
@@ -1,158 +0,0 @@
-#include <dirent.h>
-#include <dlfcn.h>
-#include <fmt/format.h>
-#include <sys/stat.h>
-#include <torch/csrc/deploy/Exception.h>
-#include <torch/csrc/deploy/elf_file.h>
-#include <torch/csrc/deploy/unity/xar_environment.h>
-
-namespace torch {
-namespace deploy {
-
-XarEnvironment::XarEnvironment(std::string exePath, std::string pythonAppDir)
-    : exePath_(std::move(exePath)),
-      pythonAppDir_(std::move(pythonAppDir)),
-      pythonAppRoot_(pythonAppDir_ + "/python_app_root") {
-  setupPythonApp();
-  preloadSharedLibraries();
-}
-
-// NOLINTNEXTLINE(modernize-use-equals-default)
-XarEnvironment::~XarEnvironment() {
-  // We should delete the pythonAppDir_ here. However if we did that, the
-  // next time we run the executable, we will get issue to load shared
-  // libraries since the path we add to LD_LIBRARY_PATH does not exist yet.
-  // Also the pythonAppDir_ will anyway be re-created the next time we run the
-  // executable.
-  //
-  // Keep the teardown step a noop for now.
-}
-
-void XarEnvironment::configureInterpreter(Interpreter* interp) {
-  auto I = interp->acquireSession();
-  I.global("sys", "path").attr("append")({pythonAppRoot_});
-}
-
-/*
- * I can not use std::filesystem since that's added in C++17 and clang-tidy
- * seems using a older version of C++ and can not find it.
- *
- * Create a small utility to check the existence of a directory instead.
- */
-bool _dirExists(const std::string& dirPath) {
-  DIR* dir = opendir(dirPath.c_str());
-  if (dir) {
-    closedir(dir);
-    return true;
-  } else {
-    return false;
-  }
-}
-
-bool _fileExists(const std::string& filePath) {
-  FILE* fp = fopen(filePath.c_str(), "rb");
-  if (fp) {
-    fclose(fp);
-    return true;
-  } else {
-    return false;
-  }
-}
-
-void XarEnvironment::setupPythonApp() {
-  MULTIPY_CHECK(
-      !alreadySetupPythonApp_,
-      "Already setup the python application. It should only been done once!");
-
-  // must match the section name specified in unity.bzl
-  constexpr const char* SECTION_NAME = ".torch_deploy_payload.unity";
-  ElfFile elfFile(exePath_.c_str());
-  auto payloadSection = elfFile.findSection(SECTION_NAME);
-  MULTIPY_CHECK(
-      payloadSection != multipy::nullopt, "Missing the payload section");
-  const char* pythonAppPkgStart = payloadSection->start;
-  auto pythonAppPkgSize = payloadSection->len;
-  LOG(INFO) << "Embedded binary size " << pythonAppPkgSize;
-
-  /*
-   * [NOTE about LD_LIBRARY_PATH]
-   * Some python applications uses python extensions that depends on shared
-   * libraries in the XAR file. E.g., scipy depends on MKL libraries shipped
-   * with the XAR. For those cases, we need ensure 2 things before running the
-   * executable:
-   * 1, make sure the path /tmp/torch_deploy_python_app/python_app_root exists.
-   * 2, add /tmp/torch_deploy_python_app/python_app_root to the LD_LIBRRY_PATH.
-   *
-   * If either condition is not met, we fail to load the dependent shared
-   * libraries in the XAR file.
-   *
-   * There are simple cases though. If the use case only relies on the libraries
-   * built into torch::deploy like torch, numpy, pyyaml etc., or if the
-   * extensions used does not rely on extra shared libraries in the XAR file,
-   * then neither of the prerequisites need to be met.
-   *
-   * We used to fatal if the path is not preexisted. But to make (stress)
-   * unit-test and other simple uses cases easier, we change it to a warning. If
-   * you encounter shared library not found issue, it's likely that your use
-   * case are the aforementioned complex case, make sure the two prerequisites
-   * are met and run again.
-   */
-  LOG_IF(WARNING, !_dirExists(pythonAppRoot_))
-      << "The python app root " << pythonAppRoot_ << " does not exists before "
-      << " running the executable. If you encounter shared libraries not found "
-      << " issue, try create the directory and run the executable again. Check "
-      << "the note in the code for more details";
-
-  /*
-   * NOTE: we remove the pythonAppDir_ below. Anything under it will be gone.
-   * Normally the directory just contains potential stuff left over from the
-   * past runs. It should be pretty safe to discard them.
-   */
-  std::string rmCmd = fmt::format("rm -rf {}", pythonAppDir_);
-  MULTIPY_CHECK(system(rmCmd.c_str()) == 0, "Fail to remove the directory.");
-
-  // recreate the directory
-  auto r = mkdir(pythonAppDir_.c_str(), 0777);
-  MULTIPY_CHECK(r == 0, "Failed to create directory: " + strerror(errno));
-
-  std::string pythonAppArchive = std::string(pythonAppDir_) + "/python_app.xar";
-  auto fp = fopen(pythonAppArchive.c_str(), "wb");
-  MULTIPY_CHECK(fp != nullptr, "Fail to create file: " + strerror(errno));
-  auto written = fwrite(pythonAppPkgStart, 1, pythonAppPkgSize, fp);
-  MULTIPY_CHECK(written == pythonAppPkgSize, "Expected written == size");
-  fclose(fp);
-
-  std::string extractCommand = fmt::format(
-      "unsquashfs -o 4096 -d {} {}", pythonAppRoot_, pythonAppArchive);
-  r = system(extractCommand.c_str());
-  MULTIPY_CHECK(
-      r == 0,
-      "Fail to extract the python package" + std::to_string(r) +
-          extractCommand.c_str());
-
-  alreadySetupPythonApp_ = true;
-}
-
-void XarEnvironment::preloadSharedLibraries() {
-  // preload the following libraries since the CustomLoader has some limitations
-  // 1. CustomLoader can not find the correct order to loader them
-  // 2. CustomLoader use RTLD_LOCAL so the symbol defined in one lib can not be
-  // used by another
-  std::array<const char*, 3> preloadList = {
-      "libmkl_core.so", "libmkl_intel_thread.so", nullptr};
-  for (int i = 0; preloadList[i]; ++i) {
-    // only preload the library if it exists in pythonAppRoot_
-    auto path = pythonAppRoot_ + "/" + preloadList[i];
-    if (!_fileExists(path)) {
-      LOG(INFO) << "The preload library " << preloadList[i]
-                << " does not exist in the python app root, skip loading it";
-      continue;
-    }
-    MULTIPY_CHECK(
-        dlopen(preloadList[i], RTLD_GLOBAL | RTLD_LAZY) != nullptr,
-        "Fail to open the shared library " + preloadList[i] + ": " + dlerror());
-  }
-}
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/deploy/unity/xar_environment.h b/torch/csrc/deploy/unity/xar_environment.h
deleted file mode 100644
index 446b4a09ecea..000000000000
--- a/torch/csrc/deploy/unity/xar_environment.h
+++ /dev/null
@@ -1,31 +0,0 @@
-#pragma once
-
-#include <torch/csrc/deploy/deploy.h>
-#include <torch/csrc/deploy/environment.h>
-#include <string>
-
-namespace torch {
-namespace deploy {
-
-constexpr const char* DEFAULT_PYTHON_APP_DIR = "/tmp/torch_deploy_python_app";
-
-class XarEnvironment : public Environment {
- public:
-  explicit XarEnvironment(
-      std::string exePath,
-      std::string pythonAppDir = DEFAULT_PYTHON_APP_DIR);
-  ~XarEnvironment() override;
-  void configureInterpreter(Interpreter* interp) override;
-
- private:
-  void setupPythonApp();
-  void preloadSharedLibraries();
-
-  std::string exePath_;
-  std::string pythonAppDir_;
-  std::string pythonAppRoot_;
-  bool alreadySetupPythonApp_ = false;
-};
-
-} // namespace deploy
-} // namespace torch
diff --git a/torch/csrc/distributed/autograd/engine/dist_engine.cpp b/torch/csrc/distributed/autograd/engine/dist_engine.cpp
index 2da315644845..06c6927e4c46 100644
--- a/torch/csrc/distributed/autograd/engine/dist_engine.cpp
+++ b/torch/csrc/distributed/autograd/engine/dist_engine.cpp
@@ -185,6 +185,13 @@ void DistEngine::computeDependencies(
     bool retainGraph) {
   TORCH_INTERNAL_ASSERT(graphRoot, "graphRoot is null!");
 
+  // Store root nodes so we can traverse through the graph later
+  // e.g., for get_current_graph_task_execution_order
+  c10::SmallVector<Node*, 4> temp_roots{rootEdges.size()};
+  for (const auto i : c10::irange(rootEdges.size())) {
+    temp_roots[i] = rootEdges[i].function.get();
+  }
+
   // Build the graph task and graph root.
   // NOTE: we don't need to build and pass a cpu_ready_queue to GraphTask
   // as we use execute_graph_task_until_ready_queue_empty, which will build
@@ -194,6 +201,7 @@ void DistEngine::computeDependencies(
       /* create_graph */ false,
       /* depth */ 0,
       /* cpu_ready_queue */ global_cpu_ready_queue_,
+      /* graph_roots */ temp_roots,
       /* exit_on_error */ true);
 
   // Run BFS to traverse the graph locally. The roots of the graph are
diff --git a/torch/csrc/distributed/c10d/Backend.cpp b/torch/csrc/distributed/c10d/Backend.cpp
new file mode 100644
index 000000000000..2cc18bcbe257
--- /dev/null
+++ b/torch/csrc/distributed/c10d/Backend.cpp
@@ -0,0 +1,17 @@
+#include <c10/util/Logging.h>
+#include <fmt/format.h>
+#include <torch/csrc/distributed/c10d/Backend.hpp>
+
+namespace c10d {
+
+Backend::Backend(int rank, int size) : rank_(rank), size_(size) {
+  C10_LOG_API_USAGE_ONCE("c10d.backend");
+}
+
+Backend::~Backend() {}
+
+void Backend::init() {
+  C10_LOG_API_USAGE_ONCE(fmt::format("c10d.backend_{}", getBackendName()));
+}
+
+} // namespace c10d
diff --git a/torch/csrc/distributed/c10d/Backend.hpp b/torch/csrc/distributed/c10d/Backend.hpp
new file mode 100644
index 000000000000..684ad17c8522
--- /dev/null
+++ b/torch/csrc/distributed/c10d/Backend.hpp
@@ -0,0 +1,277 @@
+#pragma once
+
+#include <condition_variable>
+#include <memory>
+#include <mutex>
+#include <stdexcept>
+#include <unordered_map>
+#include <vector>
+
+#include <ATen/ATen.h>
+#include <c10/macros/Macros.h>
+
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Work.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/debug.h>
+#include <torch/csrc/distributed/c10d/sequence_num.hpp>
+
+constexpr auto kDefaultTimeout =
+    std::chrono::milliseconds(30 * 60 * 1000);
+
+namespace c10d {
+
+// Options is a base struct that defines the basic options
+// when constructing a Backend. Each Backend subclass should
+// extend this struct and define its options if it wants to provide more
+// config options (beyond basic ones defined here) to end user.
+struct TORCH_API Options : torch::CustomClassHolder {
+  explicit Options(
+      std::string backend,
+      std::chrono::milliseconds timeout = kDefaultTimeout)
+      : timeout(timeout), backend(backend) {}
+  virtual ~Options() = default;
+
+  std::chrono::milliseconds timeout;
+
+  // backend name
+  const std::string backend;
+};
+
+class TORCH_API Backend : public torch::CustomClassHolder {
+ public:
+  explicit Backend(int rank, int size);
+  virtual ~Backend() = 0;
+
+  // Subclasses must override this method to return the backend name
+  virtual const std::string getBackendName() const {
+    TORCH_INTERNAL_ASSERT(false, "getBackendName is not implemented.");
+  };
+
+  virtual c10::intrusive_ptr<Work> broadcast(
+      std::vector<at::Tensor>& /* tensors */,
+      const BroadcastOptions& /* opts */ = BroadcastOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str("Backend ", getBackendName(), "does not support broadcast"));
+  }
+
+  virtual c10::intrusive_ptr<Work> allreduce(
+      std::vector<at::Tensor>& /* tensors */,
+      const AllreduceOptions& /* opts */ = AllreduceOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str("Backend ", getBackendName(), "does not support allreduce"));
+  }
+
+  virtual c10::intrusive_ptr<Work> allreduce_coalesced(
+      std::vector<at::Tensor>& /* tensors */,
+      const AllreduceCoalescedOptions& /* opts */ =
+          AllreduceCoalescedOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ",
+            getBackendName(),
+            "does not support allreduce_coalesced"));
+  }
+
+  virtual c10::intrusive_ptr<Work> reduce(
+      std::vector<at::Tensor>& /* tensors */,
+      const ReduceOptions& /* opts */ = ReduceOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str("Backend ", getBackendName(), "does not support reduce"));
+  }
+
+  virtual c10::intrusive_ptr<Work> allgather(
+      std::vector<std::vector<at::Tensor>>& /* outputTensors */,
+      std::vector<at::Tensor>& /* inputTensors */,
+      const AllgatherOptions& /* opts */ = AllgatherOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str("Backend ", getBackendName(), "does not support allgather"));
+  }
+
+  // Gathers a single tensor inputBuffer into a single buffer outputBuffer that
+  // is interpreted as a contigious collection of size inputBuffer * WORLD_SIZE.
+  // For implementers of ProcessGroup API and advanced users only.
+  // Note: this function will be deprecated in near future.
+  virtual c10::intrusive_ptr<Work> _allgather_base(
+      at::Tensor& /* outputBuffer */,
+      at::Tensor& /* inputBuffer */,
+      const AllgatherOptions& /* opts */ = AllgatherOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ", getBackendName(), "does not support _allgather_base"));
+  }
+
+  // This function is deprecated and will be moved out of Backend to comms:
+  // * do not add dependencies on this function,
+  // * do not implement it in your Backend, implement _allgather_base
+  //   instead.
+  virtual c10::intrusive_ptr<Work> allgather_coalesced(
+      std::vector<std::vector<at::Tensor>>& /* outputTensorLists */,
+      std::vector<at::Tensor>& /* inputTensors */,
+      const AllgatherOptions& /* opts */ = AllgatherOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ",
+            getBackendName(),
+            "does not support allgather_coalesced"));
+  }
+
+  virtual c10::intrusive_ptr<Work> gather(
+      std::vector<std::vector<at::Tensor>>& /* outputTensors */,
+      std::vector<at::Tensor>& /* inputTensors */,
+      const GatherOptions& /* opts */ = GatherOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str("Backend ", getBackendName(), "does not support gather"));
+  }
+
+  virtual c10::intrusive_ptr<Work> scatter(
+      std::vector<at::Tensor>& /* outputTensors */,
+      std::vector<std::vector<at::Tensor>>& /* inputTensors */,
+      const ScatterOptions& /* opts */ = ScatterOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str("Backend ", getBackendName(), "does not support scatter"));
+  }
+
+  virtual c10::intrusive_ptr<Work> reduce_scatter(
+      std::vector<at::Tensor>& /* outputTensors */,
+      std::vector<std::vector<at::Tensor>>& /* inputTensors */,
+      const ReduceScatterOptions& /* opts */ = ReduceScatterOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ", getBackendName(), "does not support reduce_scatter"));
+  }
+
+  virtual c10::intrusive_ptr<Work> _reduce_scatter_base(
+      at::Tensor& /* outputBuffer */,
+      at::Tensor& /* inputBuffer */,
+      const ReduceScatterOptions& /* opts */ = ReduceScatterOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ",
+            getBackendName(),
+            "does not support _reduce_scatter_base"));
+  }
+
+  virtual c10::intrusive_ptr<Work> alltoall_base(
+      at::Tensor& /* outputBuffer */,
+      at::Tensor& /* inputBuffer */,
+      std::vector<int64_t>& /* outputSplitSizes */,
+      std::vector<int64_t>& /* inputSplitSizes */,
+      const AllToAllOptions& /* opts */ = AllToAllOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ", getBackendName(), "does not support alltoall_base"));
+  }
+
+  virtual c10::intrusive_ptr<Work> alltoall(
+      std::vector<at::Tensor>& /* outputTensors */,
+      std::vector<at::Tensor>& /* inputTensors */,
+      const AllToAllOptions& opts = AllToAllOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str("Backend ", getBackendName(), "does not support alltoall"));
+  }
+
+  virtual void monitoredBarrier(
+      const BarrierOptions& /* unused */,
+      bool /* unused */ = false) {
+    auto backendName = getBackendName();
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ",
+            backendName,
+            " does not support monitoredBarrier, only GLOO supports monitored barrier."));
+  }
+
+  // Agrees on an initial sequence number for the whole group by having rank 0
+  // create it and broadcast it to other ranks using the store. Only implemented
+  // for GLOO and NCCL backends currently.
+  virtual void setSequenceNumberForGroup() {
+    auto backendName = getBackendName();
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ",
+            backendName,
+            " does not yet support sequence numbers."));
+  }
+
+  // Retrieves the current sequence number for the whole group, which should be
+  // in sync. If the returned number is not consistent across the group, it
+  // may indicate that there is some sort of collective desynchronization.
+  virtual uint64_t getSequenceNumberForGroup() {
+    auto backendName = getBackendName();
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ",
+            backendName,
+            " does not yet support sequence numbers."));
+  }
+
+  virtual c10::intrusive_ptr<Work> send(
+      std::vector<at::Tensor>& /* tensors */,
+      int /* dstRank */,
+      int /* tag */) {
+    TORCH_CHECK(
+        false, c10::str("Backend ", getBackendName(), "does not support send"));
+  }
+
+  virtual c10::intrusive_ptr<Work> recv(
+      std::vector<at::Tensor>& /* tensors */,
+      int /* srcRank */,
+      int /* tag */) {
+    TORCH_CHECK(
+        false, c10::str("Backend ", getBackendName(), "does not support recv"));
+  }
+
+  virtual c10::intrusive_ptr<Work> recvAnysource(
+      std::vector<at::Tensor>& /* tensors */,
+      int /* tag */) {
+    TORCH_CHECK(
+        false,
+        c10::str(
+            "Backend ", getBackendName(), "does not support recvAnysource"));
+  }
+
+  virtual c10::intrusive_ptr<Work> barrier(
+      const BarrierOptions& /* opts */ = BarrierOptions()) {
+    TORCH_CHECK(
+        false,
+        c10::str("Backend ", getBackendName(), "does not support barrier"));
+  }
+
+  int getRank() const {
+    return rank_;
+  }
+
+  int getSize() const {
+    return size_;
+  }
+
+ protected:
+  // Implementations of this interface need to call this to setup
+  // appropriate logging etc.
+  void init();
+
+  // Optional sequence number structure for matching collectives.
+  c10::optional<c10d::SequenceNum> sequenceNum_ = c10::nullopt;
+  const int rank_;
+  const int size_;
+};
+
+} // namespace c10d
diff --git a/torch/csrc/distributed/c10d/FileStore.cpp b/torch/csrc/distributed/c10d/FileStore.cpp
index fa41318ecde1..0aa681c5caa1 100644
--- a/torch/csrc/distributed/c10d/FileStore.cpp
+++ b/torch/csrc/distributed/c10d/FileStore.cpp
@@ -1,4 +1,4 @@
-#include <c10d/FileStore.hpp>
+#include <torch/csrc/distributed/c10d/FileStore.hpp>
 
 #include <assert.h>
 #include <fcntl.h>
@@ -276,8 +276,11 @@ FileStore::FileStore(const std::string& path, int numWorkers)
       pos_(0),
       numWorkers_(numWorkers),
       cleanupKey_("cleanup/"),
+      refCountKey_("refcount/"),
       regularPrefix_("/"),
-      deletePrefix_("-") {}
+      deletePrefix_("-") {
+  addHelper(refCountKey_, 1);
+}
 
 FileStore::~FileStore() {
   // If the file does not exist - exit.
@@ -294,13 +297,16 @@ FileStore::~FileStore() {
   if (res == -1) {
     return;
   }
+
   // cleanup key will be different from all rest keys since all rest keys will
   // have a regular prefix.
   auto numFinishedWorker = addHelper(cleanupKey_, 1);
+  auto refCount = addHelper(refCountKey_, -1);
   // The last worker cleans up the file. If numWorkers was not initialized to
   // a specific postive value (i.e. meaning that there was not a fixed number
   // of workers), we don't attempt to clean.
-  if (numWorkers_ >= 0 && numFinishedWorker == numWorkers_) {
+  // Clean up the file if number of references is 0.
+  if (refCount == 0 && numWorkers_ >= 0 && numFinishedWorker >= numWorkers_) {
     // Best effort removal without checking the return
     std::remove(path_.c_str());
   }
diff --git a/torch/csrc/distributed/c10d/FileStore.hpp b/torch/csrc/distributed/c10d/FileStore.hpp
index deb233d35a85..bb364a76a686 100644
--- a/torch/csrc/distributed/c10d/FileStore.hpp
+++ b/torch/csrc/distributed/c10d/FileStore.hpp
@@ -5,7 +5,7 @@
 #include <mutex>
 #include <unordered_map>
 
-#include <c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
 
 namespace c10d {
 
@@ -51,6 +51,7 @@ class TORCH_API FileStore : public Store {
 
   int numWorkers_;
   const std::string cleanupKey_;
+  const std::string refCountKey_;
   const std::string regularPrefix_;
   const std::string deletePrefix_;
 
diff --git a/torch/csrc/distributed/c10d/GlooDeviceFactory.cpp b/torch/csrc/distributed/c10d/GlooDeviceFactory.cpp
index 9c55d047084f..181e7deb439a 100644
--- a/torch/csrc/distributed/c10d/GlooDeviceFactory.cpp
+++ b/torch/csrc/distributed/c10d/GlooDeviceFactory.cpp
@@ -1,4 +1,4 @@
-#include <c10d/GlooDeviceFactory.hpp>
+#include <torch/csrc/distributed/c10d/GlooDeviceFactory.hpp>
 
 #ifdef USE_C10D_GLOO
 
diff --git a/torch/csrc/distributed/c10d/HashStore.cpp b/torch/csrc/distributed/c10d/HashStore.cpp
index 501034b3d09d..41a4d46ff617 100644
--- a/torch/csrc/distributed/c10d/HashStore.cpp
+++ b/torch/csrc/distributed/c10d/HashStore.cpp
@@ -1,4 +1,4 @@
-#include <c10d/HashStore.hpp>
+#include <torch/csrc/distributed/c10d/HashStore.hpp>
 
 #include <errno.h>
 #include <stdint.h>
diff --git a/torch/csrc/distributed/c10d/HashStore.hpp b/torch/csrc/distributed/c10d/HashStore.hpp
index 94b01cd171b8..4cc72182a8cb 100644
--- a/torch/csrc/distributed/c10d/HashStore.hpp
+++ b/torch/csrc/distributed/c10d/HashStore.hpp
@@ -6,7 +6,7 @@
 #include <mutex>
 #include <unordered_map>
 
-#include <c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
 
 namespace c10d {
 
diff --git a/torch/csrc/distributed/c10d/NCCLUtils.cpp b/torch/csrc/distributed/c10d/NCCLUtils.cpp
index 0e4a7df03dc7..a5724062b4a3 100644
--- a/torch/csrc/distributed/c10d/NCCLUtils.cpp
+++ b/torch/csrc/distributed/c10d/NCCLUtils.cpp
@@ -1,4 +1,4 @@
-#include <c10d/NCCLUtils.hpp>
+#include <torch/csrc/distributed/c10d/NCCLUtils.hpp>
 
 #include <c10/util/CallOnce.h>
 
@@ -57,6 +57,57 @@ std::string ncclGetErrorWithVersion(ncclResult_t error) {
       getNcclVersion();
 }
 
+// Provides additional detail into NCCL error codes based on when these are
+// thrown in the NCCL codebase.
+std::string getNcclErrorDetailStr(
+    ncclResult_t error,
+    c10::optional<std::string> processGroupFailureReason /* = c10::nullopt */
+) {
+  // Prioritize failure reason provided by PG NCCL first, as it can abort
+  // communicators when it encounters collective timeouts, etc.
+  if (processGroupFailureReason != c10::nullopt) {
+    return *processGroupFailureReason;
+  }
+  std::string interpret;
+  std::string err;
+#ifdef ENABLE_NCCL_GET_LAST_ERROR
+  err = "\nLast error:\n" + std::string(ncclGetLastError(NULL));
+#endif
+  switch (error) {
+    case ncclUnhandledCudaError:
+      interpret = "ncclUnhandledCudaError: Call to CUDA function failed.";
+      break;
+    case ncclSystemError:
+      interpret =
+          "ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. ";
+#ifndef NCCL_REMOTE_ERROR
+      // Before ncclRemoteError was created, unexpected remote disconnect was
+      // categorized as ncclSystemError
+      interpret += "It can be also caused by unexpected exit of a remote peer.";
+#endif
+      break;
+    case ncclInternalError:
+      interpret = "ncclInternalError: Internal check failed.";
+      break;
+    case ncclInvalidArgument:
+      interpret = "ncclInvalidArgument: Invalid value for an argument.";
+      break;
+    case ncclInvalidUsage:
+      interpret =
+          "ncclInvalidUsage: This usually reflects invalid usage of NCCL library.";
+      break;
+#ifdef NCCL_REMOTE_ERROR
+    case ncclRemoteError:
+      interpret =
+          "ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.";
+      break;
+#endif
+    default:
+      interpret = "Unknown NCCL error!";
+  }
+  return interpret + err;
+}
+
 } // namespace c10d
 
 #endif // USE_C10D_NCCL
diff --git a/torch/csrc/distributed/c10d/NCCLUtils.hpp b/torch/csrc/distributed/c10d/NCCLUtils.hpp
index 7ca54d167ead..fb5d91d2e11c 100644
--- a/torch/csrc/distributed/c10d/NCCLUtils.hpp
+++ b/torch/csrc/distributed/c10d/NCCLUtils.hpp
@@ -12,33 +12,17 @@
 #include <c10/util/Exception.h>
 #include <c10/util/Optional.h>
 
-namespace {
-// Provides additional detail into NCCL error codes based on when these are
-// thrown in the NCCL codebase.
-const inline char* getNcclErrorDetailStr(ncclResult_t error, c10::optional<std::string> processGroupFailureReason = c10::nullopt) {
-  // Prioritize failure reason provided by PG NCCL first, as it can abort
-  // communicators when it encounters collective timeouts, etc.
-  if (processGroupFailureReason != c10::nullopt) {
-    return (*processGroupFailureReason).c_str();
-  }
-  switch (error) {
-    case ncclUnhandledCudaError:
-      return "ncclUnhandledCudaError: Call to CUDA function failed.";
-    case ncclSystemError:
-      return "ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. "
-        "It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.";
-    case ncclInternalError:
-      return "ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption";
-    case ncclInvalidArgument:
-      return "ncclInvalidArgument: Invalid value for an argument (such as invalid pointer, device count, ip:host pair, etc).";
-    case ncclInvalidUsage:
-      return "ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).";
-    default:
-      break;
-  }
-  return "Unknown NCCL error";
-}
-} // namespace
+// ncclGetLastError() is enabled only for NCCL versions 2.13+
+// ncclRemoteError only exists in NCCL versions 2.13+
+#if defined(NCCL_MAJOR) && (NCCL_MAJOR == 2) && defined(NCCL_MINOR) && \
+    (NCCL_MINOR >= 13)
+#define ENABLE_NCCL_GET_LAST_ERROR
+#define NCCL_REMOTE_ERROR
+#elif defined(NCCL_MAJOR) && (NCCL_MAJOR >= 3)
+#define ENABLE_NCCL_GET_LAST_ERROR
+#define NCCL_REMOTE_ERROR
+#endif
+
 // Error checking is enabled only for NCCL versions 2.4+ since ncclCommAbort()
 // and ncclCommGetAsyncError() are not supported in earlier versions.
 #if defined(NCCL_MAJOR) && (NCCL_MAJOR == 2) && defined(NCCL_MINOR) && \
@@ -57,6 +41,12 @@ const inline char* getNcclErrorDetailStr(ncclResult_t error, c10::optional<std::
 #define ENABLE_NCCL_P2P_SUPPORT
 #endif
 
+#if defined(NCCL_MAJOR) && (NCCL_MAJOR == 2) && defined(NCCL_MINOR) && (NCCL_MINOR >= 11)
+#define ENABLE_NCCL_PREMUL_SUM_SUPPORT
+#elif defined(NCCL_MAJOR) && (NCCL_MAJOR >= 3)
+#define ENABLE_NCCL_PREMUL_SUM_SUPPORT
+#endif
+
 // Macro to throw on a non-successful NCCL return value.
 #define C10D_NCCL_CHECK(cmd, failureReason)                                                  \
   do {                                                                        \
@@ -65,7 +55,7 @@ const inline char* getNcclErrorDetailStr(ncclResult_t error, c10::optional<std::
       std::string err = "NCCL error in: " + std::string(__FILE__) + ":" +     \
           std::to_string(__LINE__) + ", " + ncclGetErrorWithVersion(result) + \
           "\n" + getNcclErrorDetailStr(result, failureReason);                \
-      TORCH_CHECK(false, err);                                                \
+      TORCH_CHECK_WITH(DistBackendError, false, err);                         \
     }                                                                         \
   } while (0)
 
@@ -90,6 +80,12 @@ namespace c10d {
 std::string getNcclVersion();
 std::string ncclGetErrorWithVersion(ncclResult_t error);
 
+// Provides additional detail into NCCL error codes based on when these are
+// thrown in the NCCL codebase.
+std::string getNcclErrorDetailStr(
+  ncclResult_t error,
+  c10::optional<std::string> processGroupFailureReason = c10::nullopt);
+
 // RAII wrapper for NCCL communicator
 class NCCLComm {
  public:
@@ -217,6 +213,33 @@ class NCCLComm {
   c10::optional<std::string> commFailureReason_;
 };
 
+// Helper that automatically cleans up premul sums.
+struct ncclRedOpRAII {
+  ncclRedOpRAII() {}
+  ncclRedOpRAII(ncclRedOp_t op) : op_(op) {}
+  ncclRedOpRAII(ncclRedOp_t op, ncclComm_t comm) :
+    op_(op), comm_(comm), premul_sum_(true) {}
+  ncclRedOpRAII(const ncclRedOpRAII&) = delete;
+  ncclRedOpRAII& operator=(const ncclRedOpRAII&) = delete;
+  ncclRedOpRAII(ncclRedOpRAII&& tmp) : ncclRedOpRAII() {
+    std::swap(tmp.op_, this->op_);
+    std::swap(tmp.comm_, this->comm_);
+    std::swap(tmp.premul_sum_, this->premul_sum_);
+  }
+#if defined(ENABLE_NCCL_PREMUL_SUM_SUPPORT)
+  ~ncclRedOpRAII() {
+    if (premul_sum_) {
+      ncclRedOpDestroy(op_, comm_);
+    }
+  }
+#endif
+  operator ncclRedOp_t() const { return op_; }
+  ncclRedOp_t op_;
+  ncclComm_t comm_;
+  bool premul_sum_ = false;
+};
+
+
 } // namespace c10d
 
 #endif // USE_C10D_NCCL
diff --git a/torch/csrc/distributed/c10d/Ops.cpp b/torch/csrc/distributed/c10d/Ops.cpp
index 7ee87ad9ef77..6b4717a8e1d1 100644
--- a/torch/csrc/distributed/c10d/Ops.cpp
+++ b/torch/csrc/distributed/c10d/Ops.cpp
@@ -1,79 +1,139 @@
 #include <torch/csrc/distributed/c10d/Ops.hpp>
 
 #include <ATen/core/dispatch/Dispatcher.h>
+#include <torch/csrc/distributed/c10d/Types.hpp>
 #include <torch/library.h>
 
 namespace c10d {
 namespace {
-c10::intrusive_ptr<ProcessGroup::Work> broadcast_(
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> broadcast_(
     at::TensorList tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     int64_t root_rank,
     int64_t root_tensor,
     int64_t timeout) {
   auto tensor_vec = tensors.vec();
-  return process_group->broadcast(
+  auto work = process_group->broadcast(
       tensor_vec,
       BroadcastOptions{
           root_rank, root_tensor, std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      std::move(tensor_vec), work);
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> allreduce_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  auto work = process_group->allreduce(
+      tensor_vec,
+      AllreduceOptions{*reduce_op.get(), std::chrono::milliseconds(timeout)});
+
+  // Return input tensors as output tensors to make inplace allreduce look like
+  // a functional API, so that make_fx can correctly build the dependencies in
+  // the graph later.
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      std::move(tensor_vec), work);
+}
+
+c10::intrusive_ptr<Work> allreduce_coalesced_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  AllreduceCoalescedOptions opts = AllreduceCoalescedOptions{};
+  opts.reduceOp = *reduce_op.get();
+  opts.timeout = std::chrono::milliseconds(timeout);
+
+  return process_group->allreduce_coalesced(tensor_vec, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> allreduce_(
+c10::intrusive_ptr<Work> reduce_(
     at::TensorList tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
-    int64_t reduce_op,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t root_rank,
+    int64_t root_tensor,
     int64_t timeout) {
   auto tensor_vec = tensors.vec();
-  return process_group->allreduce(
+  return process_group->reduce(
       tensor_vec,
-      AllreduceOptions{
-          static_cast<ReduceOp>(reduce_op),
+      ReduceOptions{
+          *reduce_op.get(),
+          root_rank,
+          root_tensor,
           std::chrono::milliseconds(timeout)});
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> allgather_(
+std::tuple<std::vector<std::vector<at::Tensor>>, c10::intrusive_ptr<Work>>
+allgather_(
     const std::vector<std::vector<at::Tensor>>& output_tensors,
     const std::vector<at::Tensor>& input_tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     int64_t timeout) {
-  return process_group->allgather(
+  auto work = process_group->allgather(
       const_cast<std::vector<std::vector<at::Tensor>>&>(output_tensors),
       const_cast<std::vector<at::Tensor>&>(input_tensors),
       AllgatherOptions{std::chrono::milliseconds(timeout)});
+
+  // Copy output tensors (not storage) so that this can be used in a functional
+  // manner
+  return std::
+      tuple<std::vector<std::vector<at::Tensor>>, c10::intrusive_ptr<Work>>(
+          output_tensors, work);
+}
+
+c10::intrusive_ptr<Work> _allgather_base_(
+    at::Tensor& output_tensor,
+    at::Tensor& input_tensor,
+    const c10::intrusive_ptr<ProcessGroup>& process_group) {
+  return process_group->_allgather_base(output_tensor, input_tensor);
+}
+
+c10::intrusive_ptr<Work> allgather_coalesced_(
+    const std::vector<std::vector<at::Tensor>>& output_lists,
+    const std::vector<at::Tensor>& input_list,
+    const c10::intrusive_ptr<ProcessGroup>& process_group) {
+  return process_group->allgather_coalesced(
+      const_cast<std::vector<std::vector<at::Tensor>>&>(output_lists),
+      const_cast<std::vector<at::Tensor>&>(input_list));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter_(
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> reduce_scatter_(
     const std::vector<at::Tensor>& output_tensors,
     const std::vector<std::vector<at::Tensor>>& input_tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
-    int64_t reduce_op,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
     int64_t timeout) {
-  return process_group->reduce_scatter(
+  auto work = process_group->reduce_scatter(
       const_cast<std::vector<at::Tensor>&>(output_tensors),
       const_cast<std::vector<std::vector<at::Tensor>>&>(input_tensors),
       ReduceScatterOptions{
-          static_cast<ReduceOp>(reduce_op),
-          std::chrono::milliseconds(timeout)});
+          *reduce_op.get(), std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      output_tensors, work);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> reduce_(
-    at::TensorList tensors,
+c10::intrusive_ptr<Work> _reduce_scatter_base_(
+    at::Tensor& output_tensor,
+    at::Tensor& input_tensor,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
-    int64_t reduce_op,
-    int64_t root_rank,
-    int64_t root_tensor,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
     int64_t timeout) {
-  auto tensor_vec = tensors.vec();
-  return process_group->reduce(
-      tensor_vec,
-      ReduceOptions{
-          static_cast<ReduceOp>(reduce_op),
-          root_rank,
-          root_tensor,
-          std::chrono::milliseconds(timeout)});
+  return process_group->_reduce_scatter_base(
+      output_tensor,
+      input_tensor,
+      ReduceScatterOptions{
+          *reduce_op.get(), std::chrono::milliseconds(timeout)});
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> gather_(
+c10::intrusive_ptr<Work> gather_(
     const std::vector<std::vector<at::Tensor>>& output_tensors,
     const std::vector<at::Tensor>& input_tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
@@ -85,19 +145,22 @@ c10::intrusive_ptr<ProcessGroup::Work> gather_(
       GatherOptions{root_rank, std::chrono::milliseconds(timeout)});
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> scatter_(
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> scatter_(
     const std::vector<at::Tensor>& output_tensors,
     const std::vector<std::vector<at::Tensor>>& input_tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     int64_t root_rank,
     int64_t timeout) {
-  return process_group->scatter(
+  auto work = process_group->scatter(
       const_cast<std::vector<at::Tensor>&>(output_tensors),
       const_cast<std::vector<std::vector<at::Tensor>>&>(input_tensors),
       ScatterOptions{root_rank, std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      output_tensors, work);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> alltoall_(
+c10::intrusive_ptr<Work> alltoall_(
     at::TensorList output_tensors,
     at::TensorList input_tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
@@ -110,7 +173,7 @@ c10::intrusive_ptr<ProcessGroup::Work> alltoall_(
       AllToAllOptions{std::chrono::milliseconds(timeout)});
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> barrier(
+c10::intrusive_ptr<Work> barrier(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<int64_t>& device_ids,
     int64_t timeout) {
@@ -118,7 +181,18 @@ c10::intrusive_ptr<ProcessGroup::Work> barrier(
       BarrierOptions{device_ids, std::chrono::milliseconds(timeout)});
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> send(
+void monitored_barrier_(
+    at::Tensor /* unused */,
+    const c10::intrusive_ptr<::c10d::ProcessGroup>& process_group,
+    const std::vector<int64_t>& device_ids,
+    int64_t timeout,
+    bool wait_all_ranks) {
+  process_group->monitoredBarrier(
+      BarrierOptions{device_ids, std::chrono::milliseconds(timeout)},
+      wait_all_ranks);
+}
+
+c10::intrusive_ptr<Work> send(
     at::TensorList tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     int64_t dstRank,
@@ -128,7 +202,7 @@ c10::intrusive_ptr<ProcessGroup::Work> send(
       tensor_vec, static_cast<int>(dstRank), static_cast<int>(tag));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> recv_(
+c10::intrusive_ptr<Work> recv_(
     at::TensorList tensors,
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     int64_t srcRank,
@@ -139,12 +213,16 @@ c10::intrusive_ptr<ProcessGroup::Work> recv_(
 }
 
 TORCH_LIBRARY(c10d, m) {
-  // The following ProcessGroup and Work definations are more like declarations.
-  // They don't expose the details of the two classes into TorchScript.
+  // The following ProcessGroup, Work, and ReduceOp definitions are more like
+  // declarations. They don't expose the details of the two classes into
+  // TorchScript.
   m.class_<ProcessGroup>("ProcessGroup").def(torch::init<int64_t, int64_t>());
-  m.class_<ProcessGroup::Work>("Work").def(torch::init<>());
-  // It's important to register the op to the CompositeExplicitAutograd key to
-  // enable
+  m.class_<Work>("Work")
+      .def(torch::init<>())
+      .def("wait", [](const c10::intrusive_ptr<Work>& self) { self->wait(); });
+  m.class_<ReduceOp>("ReduceOp").def(torch::init<>());
+  // It's important to register the op to the CompositeExplicitAutograd key
+  // instead of the CompositeImplicitAutograd key to enable
   // __torch_dispatch__.
   m.def(
       "broadcast_",
@@ -152,12 +230,27 @@ TORCH_LIBRARY(c10d, m) {
   m.def(
       "allreduce_",
       dispatch(c10::DispatchKey::CompositeExplicitAutograd, allreduce_));
+  m.def(
+      "allreduce_coalesced_",
+      dispatch(
+          c10::DispatchKey::CompositeExplicitAutograd, allreduce_coalesced_));
   m.def(
       "allgather_",
       dispatch(c10::DispatchKey::CompositeExplicitAutograd, allgather_));
+  m.def(
+      "_allgather_base_",
+      dispatch(c10::DispatchKey::CompositeExplicitAutograd, _allgather_base_));
+  m.def(
+      "allgather_coalesced_",
+      dispatch(
+          c10::DispatchKey::CompositeExplicitAutograd, allgather_coalesced_));
   m.def(
       "reduce_scatter_",
       dispatch(c10::DispatchKey::CompositeExplicitAutograd, reduce_scatter_));
+  m.def(
+      "_reduce_scatter_base_",
+      dispatch(
+          c10::DispatchKey::CompositeExplicitAutograd, _reduce_scatter_base_));
   m.def(
       "reduce_",
       dispatch(c10::DispatchKey::CompositeExplicitAutograd, reduce_));
@@ -173,6 +266,10 @@ TORCH_LIBRARY(c10d, m) {
   m.def(
       "barrier",
       dispatch(c10::DispatchKey::CompositeExplicitAutograd, barrier));
+  m.def(
+      "monitored_barrier_",
+      dispatch(
+          c10::DispatchKey::CompositeExplicitAutograd, monitored_barrier_));
   m.def("send", dispatch(c10::DispatchKey::CompositeExplicitAutograd, send));
   m.def("recv_", dispatch(c10::DispatchKey::CompositeExplicitAutograd, recv_));
 }
@@ -180,114 +277,190 @@ TORCH_LIBRARY(c10d, m) {
 
 namespace ops {
 
-c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+c10::intrusive_ptr<Work> broadcast(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     const BroadcastOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::broadcast_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
-                           at::TensorList,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t,
-                           int64_t,
-                           int64_t)>();
+  static auto op =
+      c10::Dispatcher::singleton()
+          .findSchemaOrThrow("c10d::broadcast_", "")
+          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+              at::TensorList,
+              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
+              int64_t,
+              int64_t,
+              int64_t)>();
   // It's awakward to unbox the opts here and box them again in the custom C++
   // op. But it's also complicated to make opts as a CustomClassHolder. Leave it
   // as it is now.
-  return op.call(
+  return std::get<1>(op.call(
       tensors,
       process_group,
       opts.rootRank,
       opts.rootTensor,
-      opts.timeout.count());
+      opts.timeout.count()));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+c10::intrusive_ptr<Work> allreduce(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     const AllreduceOptions& opts) {
+  static auto op =
+      c10::Dispatcher::singleton()
+          .findSchemaOrThrow("c10d::allreduce_", "")
+          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+              at::TensorList,
+              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
+              const c10::intrusive_ptr<::c10d::ReduceOp>&,
+              int64_t)>();
+
+  return std::get<1>(op.call(
+      tensors,
+      process_group,
+      c10::make_intrusive<ReduceOp>(opts.reduceOp),
+      opts.timeout.count()));
+}
+
+c10::intrusive_ptr<Work> allreduce_coalesced(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    at::TensorList tensors,
+    const AllreduceCoalescedOptions& opts) {
   static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::allreduce_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
+                       .findSchemaOrThrow("c10d::allreduce_coalesced_", "")
+                       .typed<c10::intrusive_ptr<::c10d::Work>(
                            at::TensorList,
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t,
+                           const c10::intrusive_ptr<::c10d::ReduceOp>&,
                            int64_t)>();
+
   return op.call(
       tensors,
       process_group,
-      static_cast<uint64_t>(opts.reduceOp),
+      c10::make_intrusive<ReduceOp>(opts.reduceOp),
       opts.timeout.count());
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> allgather(
+c10::intrusive_ptr<Work> allgather(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<std::vector<at::Tensor>>& output_tensors,
     const std::vector<at::Tensor>& input_tensors,
     const AllgatherOptions& opts) {
   static auto op = c10::Dispatcher::singleton()
                        .findSchemaOrThrow("c10d::allgather_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
+                       .typed<std::tuple<
+                           std::vector<std::vector<at::Tensor>>,
+                           c10::intrusive_ptr<Work>>(
                            const std::vector<std::vector<at::Tensor>>&,
                            const std::vector<at::Tensor>&,
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
                            int64_t)>();
-  return op.call(
-      output_tensors, input_tensors, process_group, opts.timeout.count());
+  return std::get<1>(op.call(
+      output_tensors, input_tensors, process_group, opts.timeout.count()));
+}
+
+c10::intrusive_ptr<Work> _allgather_base(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    at::Tensor& output_tensor,
+    at::Tensor& input_tensor,
+    const AllgatherOptions& opts) {
+  static auto op = c10::Dispatcher::singleton()
+                       .findSchemaOrThrow("c10d::_allgather_base_", "")
+                       .typed<c10::intrusive_ptr<Work>(
+                           at::Tensor&,
+                           at::Tensor&,
+                           const c10::intrusive_ptr<::c10d::ProcessGroup>&)>();
+
+  return op.call(output_tensor, input_tensor, process_group);
+}
+
+c10::intrusive_ptr<Work> allgather_coalesced(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const std::vector<std::vector<at::Tensor>>& output_lists,
+    const std::vector<at::Tensor>& input_list,
+    const AllgatherOptions& opts) {
+  static auto op = c10::Dispatcher::singleton()
+                       .findSchemaOrThrow("c10d::allgather_coalesced_", "")
+                       .typed<c10::intrusive_ptr<Work>(
+                           const std::vector<std::vector<at::Tensor>>&,
+                           const std::vector<at::Tensor>&,
+                           const c10::intrusive_ptr<::c10d::ProcessGroup>&)>();
+
+  return op.call(output_lists, input_list, process_group);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+c10::intrusive_ptr<Work> reduce_scatter(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<at::Tensor>& output_tensors,
     const std::vector<std::vector<at::Tensor>>& input_tensors,
     const ReduceScatterOptions& opts) {
+  static auto op =
+      c10::Dispatcher::singleton()
+          .findSchemaOrThrow("c10d::reduce_scatter_", "")
+          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+              const std::vector<at::Tensor>&,
+              const std::vector<std::vector<at::Tensor>>&,
+              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
+              const c10::intrusive_ptr<::c10d::ReduceOp>&,
+              int64_t)>();
+  return std::get<1>(op.call(
+      output_tensors,
+      input_tensors,
+      process_group,
+      c10::make_intrusive<::c10d::ReduceOp>(opts.reduceOp),
+      opts.timeout.count()));
+}
+
+c10::intrusive_ptr<Work> _reduce_scatter_base(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    at::Tensor& output_tensor,
+    at::Tensor& input_tensor,
+    const ReduceScatterOptions& opts) {
   static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::reduce_scatter_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
-                           const std::vector<at::Tensor>&,
-                           const std::vector<std::vector<at::Tensor>>&,
+                       .findSchemaOrThrow("c10d::_reduce_scatter_base_", "")
+                       .typed<c10::intrusive_ptr<Work>(
+                           at::Tensor&,
+                           at::Tensor&,
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t,
+                           const c10::intrusive_ptr<::c10d::ReduceOp>&,
                            int64_t)>();
   return op.call(
-      output_tensors,
-      input_tensors,
+      output_tensor,
+      input_tensor,
       process_group,
-      static_cast<uint64_t>(opts.reduceOp),
+      c10::make_intrusive<::c10d::ReduceOp>(opts.reduceOp),
       opts.timeout.count());
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> reduce(
+c10::intrusive_ptr<Work> reduce(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     const ReduceOptions& opts) {
   static auto op = c10::Dispatcher::singleton()
                        .findSchemaOrThrow("c10d::reduce_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
+                       .typed<c10::intrusive_ptr<::c10d::Work>(
                            at::TensorList,
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t,
+                           const c10::intrusive_ptr<::c10d::ReduceOp>&,
                            int64_t,
                            int64_t,
                            int64_t)>();
   return op.call(
       tensors,
       process_group,
-      static_cast<uint64_t>(opts.reduceOp),
+      c10::make_intrusive<ReduceOp>(opts.reduceOp),
       opts.rootRank,
       opts.rootTensor,
       opts.timeout.count());
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> gather(
+c10::intrusive_ptr<Work> gather(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<std::vector<at::Tensor>>& output_tensors,
     const std::vector<at::Tensor>& input_tensors,
     const GatherOptions& opts) {
   static auto op = c10::Dispatcher::singleton()
                        .findSchemaOrThrow("c10d::gather_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
+                       .typed<c10::intrusive_ptr<::c10d::Work>(
                            const std::vector<std::vector<at::Tensor>>&,
                            const std::vector<at::Tensor>&,
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
@@ -301,35 +474,36 @@ c10::intrusive_ptr<ProcessGroup::Work> gather(
       opts.timeout.count());
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> scatter(
+c10::intrusive_ptr<Work> scatter(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<at::Tensor>& output_tensors,
     const std::vector<std::vector<at::Tensor>>& input_tensors,
     const ScatterOptions& opts) {
-  static auto op = c10::Dispatcher::singleton()
-                       .findSchemaOrThrow("c10d::scatter_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
-                           const std::vector<at::Tensor>&,
-                           const std::vector<std::vector<at::Tensor>>&,
-                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
-                           int64_t,
-                           int64_t)>();
-  return op.call(
+  static auto op =
+      c10::Dispatcher::singleton()
+          .findSchemaOrThrow("c10d::scatter_", "")
+          .typed<std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+              const std::vector<at::Tensor>&,
+              const std::vector<std::vector<at::Tensor>>&,
+              const c10::intrusive_ptr<::c10d::ProcessGroup>&,
+              int64_t,
+              int64_t)>();
+  return std::get<1>(op.call(
       output_tensors,
       input_tensors,
       process_group,
       opts.rootRank,
-      opts.timeout.count());
+      opts.timeout.count()));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> alltoall(
+c10::intrusive_ptr<Work> alltoall(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList output_tensors,
     at::TensorList input_tensors,
     const AllToAllOptions& opts) {
   static auto op = c10::Dispatcher::singleton()
                        .findSchemaOrThrow("c10d::alltoall_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
+                       .typed<c10::intrusive_ptr<::c10d::Work>(
                            at::TensorList,
                            at::TensorList,
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
@@ -338,26 +512,48 @@ c10::intrusive_ptr<ProcessGroup::Work> alltoall(
       output_tensors, input_tensors, process_group, opts.timeout.count());
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> barrier(
+void monitored_barrier(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const BarrierOptions& opts,
+    bool wait_all_ranks) {
+  static auto op = c10::Dispatcher::singleton()
+                       .findSchemaOrThrow("c10d::monitored_barrier_", "")
+                       .typed<void(
+                           at::Tensor,
+                           const c10::intrusive_ptr<::c10d::ProcessGroup>&,
+                           const std::vector<int64_t>&,
+                           int64_t,
+                           bool)>();
+  // Default to using cpu implementation, monitored barrier is only for GLOO
+  at::Tensor tensor = at::empty({0}, at::TensorOptions().device(at::kCPU));
+  op.call(
+      tensor,
+      process_group,
+      opts.device_ids,
+      opts.timeout.count(),
+      wait_all_ranks);
+}
+
+c10::intrusive_ptr<Work> barrier(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const BarrierOptions& opts) {
   static auto op = c10::Dispatcher::singleton()
                        .findSchemaOrThrow("c10d::barrier", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
+                       .typed<c10::intrusive_ptr<::c10d::Work>(
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
                            const std::vector<int64_t>&,
                            int64_t)>();
   return op.call(process_group, opts.device_ids, opts.timeout.count());
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> send(
+c10::intrusive_ptr<Work> send(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     int64_t dstRank,
     int64_t tag) {
   static auto op = c10::Dispatcher::singleton()
                        .findSchemaOrThrow("c10d::send", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
+                       .typed<c10::intrusive_ptr<::c10d::Work>(
                            at::TensorList,
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
                            int64_t,
@@ -365,14 +561,14 @@ c10::intrusive_ptr<ProcessGroup::Work> send(
   return op.call(tensors, process_group, dstRank, tag);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> recv(
+c10::intrusive_ptr<Work> recv(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     int64_t srcRank,
     int64_t tag) {
   static auto op = c10::Dispatcher::singleton()
                        .findSchemaOrThrow("c10d::recv_", "")
-                       .typed<c10::intrusive_ptr<::c10d::ProcessGroup::Work>(
+                       .typed<c10::intrusive_ptr<::c10d::Work>(
                            at::TensorList,
                            const c10::intrusive_ptr<::c10d::ProcessGroup>&,
                            int64_t,
diff --git a/torch/csrc/distributed/c10d/Ops.hpp b/torch/csrc/distributed/c10d/Ops.hpp
index 1d2b7b343c0f..b5426039f01e 100644
--- a/torch/csrc/distributed/c10d/Ops.hpp
+++ b/torch/csrc/distributed/c10d/Ops.hpp
@@ -1,7 +1,7 @@
 #pragma once
 
 #include <c10/util/intrusive_ptr.h>
-#include <c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
 
 namespace c10d {
 namespace ops {
@@ -11,62 +11,90 @@ namespace ops {
 // const std::vector<at::Tensor>&. However, const std::vector<at::Tensor>& is
 // used whenever the API accepts std::vector<std::vector<at::Tensor>>& to keep
 // consistency.
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+TORCH_API c10::intrusive_ptr<Work> broadcast(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     const BroadcastOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+TORCH_API c10::intrusive_ptr<Work> allreduce(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     const AllreduceOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> allgather(
+TORCH_API c10::intrusive_ptr<Work> allreduce_coalesced(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    at::TensorList tensors,
+    const AllreduceCoalescedOptions& opts = {});
+
+TORCH_API c10::intrusive_ptr<Work> allgather(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<std::vector<at::Tensor>>& output_tensors,
     const std::vector<at::Tensor>& input_tensors,
     const AllgatherOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+TORCH_API c10::intrusive_ptr<Work> _allgather_base(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    at::Tensor& outputTensor,
+    at::Tensor& inputTensor,
+    const AllgatherOptions& opts = {});
+
+TORCH_API c10::intrusive_ptr<Work> allgather_coalesced(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const std::vector<std::vector<at::Tensor>>& output_lists,
+    const std::vector<at::Tensor>& input_list,
+    const AllgatherOptions& opts = {});
+
+TORCH_API c10::intrusive_ptr<Work> reduce_scatter(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<at::Tensor>& output_tensors,
     const std::vector<std::vector<at::Tensor>>& input_tensors,
     const ReduceScatterOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> reduce(
+TORCH_API c10::intrusive_ptr<Work> _reduce_scatter_base(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+     at::Tensor& output_tensor,
+     at::Tensor& input_tensor,
+    const ReduceScatterOptions& opts = {});
+
+TORCH_API c10::intrusive_ptr<Work> reduce(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     const ReduceOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> gather(
+TORCH_API c10::intrusive_ptr<Work> gather(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<std::vector<at::Tensor>>& output_tensors,
     const std::vector<at::Tensor>& input_tensors,
     const GatherOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> scatter(
+TORCH_API c10::intrusive_ptr<Work> scatter(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const std::vector<at::Tensor>& output_tensors,
     const std::vector<std::vector<at::Tensor>>& input_tensors,
     const ScatterOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> alltoall(
+TORCH_API c10::intrusive_ptr<Work> alltoall(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList output_tensors,
     at::TensorList input_tensors,
     const AllToAllOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> barrier(
+TORCH_API c10::intrusive_ptr<Work> barrier(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     const BarrierOptions& opts = {});
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> send(
+TORCH_API void monitored_barrier(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const BarrierOptions& opts,
+    bool waitAllRanks);
+
+TORCH_API c10::intrusive_ptr<Work> send(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     int64_t dstRank,
     int64_t tag);
 
-TORCH_API c10::intrusive_ptr<ProcessGroup::Work> recv(
+TORCH_API c10::intrusive_ptr<Work> recv(
     const c10::intrusive_ptr<ProcessGroup>& process_group,
     at::TensorList tensors,
     int64_t srcRank,
diff --git a/torch/csrc/distributed/c10d/OpsImpl.cpp b/torch/csrc/distributed/c10d/OpsImpl.cpp
new file mode 100644
index 000000000000..31386695a132
--- /dev/null
+++ b/torch/csrc/distributed/c10d/OpsImpl.cpp
@@ -0,0 +1,552 @@
+#include <c10/util/intrusive_ptr.h>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
+#include <torch/library.h>
+
+namespace c10d {
+namespace ops {
+
+// Below are ProcessGroup's corresponding ops for each backend. Ops are but
+// routed through the dispatcher to be dispatched to the appropriate backend.
+// Currently a no-op as the process group does not have a list of backends.
+c10::intrusive_ptr<Work> send_cpu(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t dstRank,
+    int64_t tag) {
+  auto tensor_vec = tensors.vec();
+  return process_group->send(
+      tensor_vec, static_cast<int>(dstRank), static_cast<int>(tag));
+}
+
+c10::intrusive_ptr<Work> send_cuda(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t dstRank,
+    int64_t tag) {
+  auto tensor_vec = tensors.vec();
+  return process_group->send(
+      tensor_vec, static_cast<int>(dstRank), static_cast<int>(tag));
+}
+
+c10::intrusive_ptr<Work> recv_cpu_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t srcRank,
+    int64_t tag) {
+  auto tensor_vec = tensors.vec();
+  return process_group->recv(
+      tensor_vec, static_cast<int>(srcRank), static_cast<int>(tag));
+}
+
+c10::intrusive_ptr<Work> recv_cuda_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t srcRank,
+    int64_t tag) {
+  auto tensor_vec = tensors.vec();
+  return process_group->recv(
+      tensor_vec, static_cast<int>(srcRank), static_cast<int>(tag));
+}
+
+c10::intrusive_ptr<Work> reduce_cpu_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t root_rank,
+    int64_t root_tensor,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  return process_group->reduce(
+      tensor_vec,
+      ReduceOptions{
+          *reduce_op.get(),
+          root_rank,
+          root_tensor,
+          std::chrono::milliseconds(timeout)});
+}
+
+c10::intrusive_ptr<Work> reduce_cuda_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t root_rank,
+    int64_t root_tensor,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  return process_group->reduce(
+      tensor_vec,
+      ReduceOptions{
+          *reduce_op.get(),
+          root_rank,
+          root_tensor,
+          std::chrono::milliseconds(timeout)});
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> broadcast_cpu_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t root_rank,
+    int64_t root_tensor,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  auto work = process_group->broadcast(
+      tensor_vec,
+      BroadcastOptions{
+          root_rank, root_tensor, std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      std::move(tensor_vec), work);
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> broadcast_cuda_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t root_rank,
+    int64_t root_tensor,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  auto work = process_group->broadcast(
+      tensor_vec,
+      BroadcastOptions{
+          root_rank, root_tensor, std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      std::move(tensor_vec), work);
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> allreduce_cpu_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  auto work = process_group->allreduce(
+      tensor_vec,
+      AllreduceOptions{*reduce_op.get(), std::chrono::milliseconds(timeout)});
+
+  // Return input tensors as output tensors to make inplace allreduce look like
+  // a functional API, so that make_fx can correctly build the dependencies in
+  // the graph later.
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      std::move(tensor_vec), work);
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> allreduce_cuda_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  auto work = process_group->allreduce(
+      tensor_vec,
+      AllreduceOptions{*reduce_op.get(), std::chrono::milliseconds(timeout)});
+
+  // Return input tensors as output tensors to make inplace allreduce look like
+  // a functional API, so that make_fx can correctly build the dependencies in
+  // the graph later.
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      std::move(tensor_vec), work);
+}
+
+c10::intrusive_ptr<Work> allreduce_coalesced_cpu_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  AllreduceCoalescedOptions opts = AllreduceCoalescedOptions{};
+  opts.reduceOp = *reduce_op.get();
+  opts.timeout = std::chrono::milliseconds(timeout);
+
+  return process_group->allreduce_coalesced(tensor_vec, opts);
+}
+
+c10::intrusive_ptr<Work> allreduce_coalesced_cuda_(
+    at::TensorList tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  auto tensor_vec = tensors.vec();
+  AllreduceCoalescedOptions opts = AllreduceCoalescedOptions{};
+  opts.reduceOp = *reduce_op.get();
+  opts.timeout = std::chrono::milliseconds(timeout);
+
+  return process_group->allreduce_coalesced(tensor_vec, opts);
+}
+
+std::tuple<std::vector<std::vector<at::Tensor>>, c10::intrusive_ptr<Work>>
+allgather_cpu_(
+    const std::vector<std::vector<at::Tensor>>& output_tensors,
+    const std::vector<at::Tensor>& input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t timeout) {
+  auto work = process_group->allgather(
+      const_cast<std::vector<std::vector<at::Tensor>>&>(output_tensors),
+      const_cast<std::vector<at::Tensor>&>(input_tensors),
+      AllgatherOptions{std::chrono::milliseconds(timeout)});
+
+  // Copy output tensors (not storage) so that this can be used in a functional
+  // manner
+  return std::
+      tuple<std::vector<std::vector<at::Tensor>>, c10::intrusive_ptr<Work>>(
+          output_tensors, work);
+}
+
+std::tuple<std::vector<std::vector<at::Tensor>>, c10::intrusive_ptr<Work>>
+allgather_cuda_(
+    const std::vector<std::vector<at::Tensor>>& output_tensors,
+    const std::vector<at::Tensor>& input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t timeout) {
+  auto work = process_group->allgather(
+      const_cast<std::vector<std::vector<at::Tensor>>&>(output_tensors),
+      const_cast<std::vector<at::Tensor>&>(input_tensors),
+      AllgatherOptions{std::chrono::milliseconds(timeout)});
+
+  // Copy output tensors (not storage) so that this can be used in a functional
+  // manner
+  return std::
+      tuple<std::vector<std::vector<at::Tensor>>, c10::intrusive_ptr<Work>>(
+          output_tensors, work);
+}
+
+c10::intrusive_ptr<Work> _allgather_base_cpu_(
+    at::Tensor& output_tensor,
+    at::Tensor& input_tensor,
+    const c10::intrusive_ptr<ProcessGroup>& process_group) {
+  return process_group->_allgather_base(output_tensor, input_tensor);
+}
+
+c10::intrusive_ptr<Work> _allgather_base_cuda_(
+    at::Tensor& output_tensor,
+    at::Tensor& input_tensor,
+    const c10::intrusive_ptr<ProcessGroup>& process_group) {
+  return process_group->_allgather_base(output_tensor, input_tensor);
+}
+
+c10::intrusive_ptr<Work> allgather_coalesced_cpu_(
+    const std::vector<std::vector<at::Tensor>>& output_lists,
+    const std::vector<at::Tensor>& input_list,
+    const c10::intrusive_ptr<ProcessGroup>& process_group) {
+  return process_group->allgather_coalesced(
+      const_cast<std::vector<std::vector<at::Tensor>>&>(output_lists),
+      const_cast<std::vector<at::Tensor>&>(input_list));
+}
+
+c10::intrusive_ptr<Work> allgather_coalesced_cuda_(
+    const std::vector<std::vector<at::Tensor>>& output_lists,
+    const std::vector<at::Tensor>& input_list,
+    const c10::intrusive_ptr<ProcessGroup>& process_group) {
+  return process_group->allgather_coalesced(
+      const_cast<std::vector<std::vector<at::Tensor>>&>(output_lists),
+      const_cast<std::vector<at::Tensor>&>(input_list));
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>
+reduce_scatter_cpu_(
+    const std::vector<at::Tensor>& output_tensors,
+    const std::vector<std::vector<at::Tensor>>& input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  auto work = process_group->reduce_scatter(
+      const_cast<std::vector<at::Tensor>&>(output_tensors),
+      const_cast<std::vector<std::vector<at::Tensor>>&>(input_tensors),
+      ReduceScatterOptions{
+          *reduce_op.get(), std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      output_tensors, work);
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>
+reduce_scatter_cuda_(
+    const std::vector<at::Tensor>& output_tensors,
+    const std::vector<std::vector<at::Tensor>>& input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  auto work = process_group->reduce_scatter(
+      const_cast<std::vector<at::Tensor>&>(output_tensors),
+      const_cast<std::vector<std::vector<at::Tensor>>&>(input_tensors),
+      ReduceScatterOptions{
+          *reduce_op.get(), std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      output_tensors, work);
+}
+
+c10::intrusive_ptr<Work> _reduce_scatter_base_cpu_(
+    at::Tensor& output_tensor,
+    at::Tensor& input_tensor,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  return process_group->_reduce_scatter_base(
+      output_tensor,
+      input_tensor,
+      ReduceScatterOptions{
+          *reduce_op.get(), std::chrono::milliseconds(timeout)});
+}
+
+c10::intrusive_ptr<Work> _reduce_scatter_base_cuda_(
+    at::Tensor& output_tensor,
+    at::Tensor& input_tensor,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const c10::intrusive_ptr<ReduceOp>& reduce_op,
+    int64_t timeout) {
+  return process_group->_reduce_scatter_base(
+      output_tensor,
+      input_tensor,
+      ReduceScatterOptions{
+          *reduce_op.get(), std::chrono::milliseconds(timeout)});
+}
+
+c10::intrusive_ptr<Work> gather_cpu_(
+    const std::vector<std::vector<at::Tensor>>& output_tensors,
+    const std::vector<at::Tensor>& input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t root_rank,
+    int64_t timeout) {
+  return process_group->gather(
+      const_cast<std::vector<std::vector<at::Tensor>>&>(output_tensors),
+      const_cast<std::vector<at::Tensor>&>(input_tensors),
+      GatherOptions{root_rank, std::chrono::milliseconds(timeout)});
+}
+
+c10::intrusive_ptr<Work> gather_cuda_(
+    const std::vector<std::vector<at::Tensor>>& output_tensors,
+    const std::vector<at::Tensor>& input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t root_rank,
+    int64_t timeout) {
+  return process_group->gather(
+      const_cast<std::vector<std::vector<at::Tensor>>&>(output_tensors),
+      const_cast<std::vector<at::Tensor>&>(input_tensors),
+      GatherOptions{root_rank, std::chrono::milliseconds(timeout)});
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> scatter_cpu_(
+    const std::vector<at::Tensor>& output_tensors,
+    const std::vector<std::vector<at::Tensor>>& input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t root_rank,
+    int64_t timeout) {
+  auto work = process_group->scatter(
+      const_cast<std::vector<at::Tensor>&>(output_tensors),
+      const_cast<std::vector<std::vector<at::Tensor>>&>(input_tensors),
+      ScatterOptions{root_rank, std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      output_tensors, work);
+}
+
+std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>> scatter_cuda_(
+    const std::vector<at::Tensor>& output_tensors,
+    const std::vector<std::vector<at::Tensor>>& input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t root_rank,
+    int64_t timeout) {
+  auto work = process_group->scatter(
+      const_cast<std::vector<at::Tensor>&>(output_tensors),
+      const_cast<std::vector<std::vector<at::Tensor>>&>(input_tensors),
+      ScatterOptions{root_rank, std::chrono::milliseconds(timeout)});
+
+  return std::tuple<std::vector<at::Tensor>, c10::intrusive_ptr<Work>>(
+      output_tensors, work);
+}
+
+c10::intrusive_ptr<Work> alltoall_cpu_(
+    at::TensorList output_tensors,
+    at::TensorList input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t timeout) {
+  auto output_tensors_vec = output_tensors.vec();
+  auto input_tensors_vec = input_tensors.vec();
+  return process_group->alltoall(
+      output_tensors_vec,
+      input_tensors_vec,
+      AllToAllOptions{std::chrono::milliseconds(timeout)});
+}
+
+c10::intrusive_ptr<Work> alltoall_cuda_(
+    at::TensorList output_tensors,
+    at::TensorList input_tensors,
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    int64_t timeout) {
+  auto output_tensors_vec = output_tensors.vec();
+  auto input_tensors_vec = input_tensors.vec();
+  return process_group->alltoall(
+      output_tensors_vec,
+      input_tensors_vec,
+      AllToAllOptions{std::chrono::milliseconds(timeout)});
+}
+
+c10::intrusive_ptr<Work> barrier_cpu(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const std::vector<int64_t>& device_ids,
+    int64_t timeout) {
+  return process_group->barrier(
+      BarrierOptions{device_ids, std::chrono::milliseconds(timeout)});
+}
+
+c10::intrusive_ptr<Work> barrier_cuda(
+    const c10::intrusive_ptr<ProcessGroup>& process_group,
+    const std::vector<int64_t>& device_ids,
+    int64_t timeout) {
+  return process_group->barrier(
+      BarrierOptions{device_ids, std::chrono::milliseconds(timeout)});
+}
+
+void monitored_barrier_cpu_(
+    at::Tensor /* unused */,
+    const c10::intrusive_ptr<::c10d::ProcessGroup>& process_group,
+    const std::vector<int64_t>& device_ids,
+    int64_t timeout,
+    bool wait_all_ranks) {
+  process_group->monitoredBarrier(
+      BarrierOptions{device_ids, std::chrono::milliseconds(timeout)},
+      wait_all_ranks);
+}
+
+// register functions to dispatcher
+namespace {
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("send", send_cpu);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("send", send_cuda);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("recv_", recv_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("recv_", recv_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("reduce_", reduce_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("reduce_", reduce_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("broadcast_", broadcast_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("broadcast_", broadcast_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("allreduce_", allreduce_cpu_);
+}
+
+// TODO: The SparseCPU/SparseCUDA dispatched methods are only used to support
+// sparse all_reduce in the Gloo backend
+TORCH_LIBRARY_IMPL(c10d, SparseCPU, m) {
+  m.impl("allreduce_", allreduce_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, SparseCUDA, m) {
+  m.impl("allreduce_", allreduce_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("allreduce_", allreduce_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("allreduce_coalesced_", allreduce_coalesced_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("allreduce_coalesced_", allreduce_coalesced_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("allgather_", allgather_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("allgather_", allgather_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("_allgather_base_", _allgather_base_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("_allgather_base_", _allgather_base_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("allgather_coalesced_", allgather_coalesced_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("allgather_coalesced_", allgather_coalesced_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("reduce_scatter_", reduce_scatter_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("reduce_scatter_", reduce_scatter_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("_reduce_scatter_base_", _reduce_scatter_base_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("_reduce_scatter_base_", _reduce_scatter_base_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("gather_", gather_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("gather_", gather_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("scatter_", scatter_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("scatter_", scatter_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("alltoall_", alltoall_cpu_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("alltoall_", alltoall_cuda_);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("barrier", barrier_cpu);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CUDA, m) {
+  m.impl("barrier", barrier_cuda);
+}
+
+TORCH_LIBRARY_IMPL(c10d, CPU, m) {
+  m.impl("monitored_barrier_", monitored_barrier_cpu_);
+}
+
+} // namespace
+
+} // namespace ops
+} // namespace c10d
diff --git a/torch/csrc/distributed/c10d/ParamCommsUtils.cpp b/torch/csrc/distributed/c10d/ParamCommsUtils.cpp
index eaa4649f7bb8..5f327182e36c 100644
--- a/torch/csrc/distributed/c10d/ParamCommsUtils.cpp
+++ b/torch/csrc/distributed/c10d/ParamCommsUtils.cpp
@@ -1,6 +1,6 @@
 // (c) Facebook, Inc. and its affiliates. Confidential and proprietary.
 
-#include <c10d/ParamCommsUtils.hpp>
+#include <torch/csrc/distributed/c10d/ParamCommsUtils.hpp>
 
 namespace torch {
 
diff --git a/torch/csrc/distributed/c10d/ParamCommsUtils.hpp b/torch/csrc/distributed/c10d/ParamCommsUtils.hpp
index 37eae9e1ae5f..3a1539252b09 100644
--- a/torch/csrc/distributed/c10d/ParamCommsUtils.hpp
+++ b/torch/csrc/distributed/c10d/ParamCommsUtils.hpp
@@ -65,17 +65,55 @@ class TORCH_API ParamCommsDebugInfo
   std::vector<int64_t> outputSplitSizes_;
 };
 
-
-#define RECORD_PARAM_COMMS(rank, colName, inSize, outSize, dType, inSplitSizes, outSplitSizes) \
-  auto paramCommsInfo = std::make_shared<torch::ParamCommsDebugInfo>( \
-    rank, \
-    colName, \
-    inSize, \
-    outSize, \
-    dType, \
-    inSplitSizes, \
-    outSplitSizes); \
+#define RECORD_PARAM_COMMS(                                                    \
+    seq,                                                                       \
+    pg_ptr,                                                                    \
+    rank,                                                                      \
+    colName,                                                                   \
+    inSize,                                                                    \
+    outSize,                                                                   \
+    dType,                                                                     \
+    inSplitSizes,                                                              \
+    outSplitSizes)                                                             \
+  auto paramCommsInfo = std::make_shared<torch::ParamCommsDebugInfo>(          \
+      rank, colName, inSize, outSize, dType, inSplitSizes, outSplitSizes);     \
   c10::DebugInfoGuard g(c10::DebugInfoKind::PARAM_COMMS_INFO, paramCommsInfo); \
-  RECORD_FUNCTION(torch::kParamCommsCallName, std::vector<c10::IValue>());
-
+  std::initializer_list<const c10::IValue> paramList = {                       \
+      c10::IValue(seq),                                                        \
+      c10::IValue(pg_ptr),                                                   \
+      rank,                                                                    \
+      colName,                                                                 \
+      inSplitSizes,                                                            \
+      outSplitSizes};                                                          \
+  c10::ArrayRef<const c10::IValue> paramInputs(paramList);                     \
+  RECORD_FUNCTION(torch::kParamCommsCallName, paramInputs);
+
+#define RECORD_PARAM_COMMS_DATA(                                               \
+    seq,                                                                       \
+    pg_ptr,                                                                    \
+    InputTensors,                                                               \
+    OutputTensors,                                                              \
+    rank,                                                                      \
+    colName,                                                                   \
+    inSize,                                                                    \
+    outSize,                                                                   \
+    dType,                                                                     \
+    inSplitSizes,                                                              \
+    outSplitSizes)                                                             \
+  auto paramCommsInfo = std::make_shared<torch::ParamCommsDebugInfo>(          \
+      rank, colName, inSize, outSize, dType, inSplitSizes, outSplitSizes);     \
+  c10::DebugInfoGuard g(c10::DebugInfoKind::PARAM_COMMS_INFO, paramCommsInfo); \
+  std::initializer_list<const c10::IValue> paramList = {                       \
+      c10::IValue(InputTensors),                                                \
+      c10::IValue(seq),                                                        \
+      c10::IValue(pg_ptr),                                                     \
+      rank,                                                                    \
+      colName,                                                                 \
+      inSplitSizes,                                                            \
+      outSplitSizes};                                                          \
+  c10::ArrayRef<const c10::IValue> paramInputs(paramList);                     \
+  RECORD_FUNCTION_WITH_INPUTS_OUTPUTS(                                         \
+      torch::kParamCommsCallName,                                              \
+      paramInputs,                                                             \
+      std::vector<c10::IValue>(1, c10::IValue(OutputTensors)));
 } // namespace torch
diff --git a/torch/csrc/distributed/c10d/PrefixStore.cpp b/torch/csrc/distributed/c10d/PrefixStore.cpp
index c7442df8d4a2..a27db02c1e3a 100644
--- a/torch/csrc/distributed/c10d/PrefixStore.cpp
+++ b/torch/csrc/distributed/c10d/PrefixStore.cpp
@@ -1,4 +1,4 @@
-#include <c10d/PrefixStore.hpp>
+#include <torch/csrc/distributed/c10d/PrefixStore.hpp>
 
 namespace c10d {
 
@@ -79,4 +79,8 @@ void PrefixStore::setTimeout(const std::chrono::milliseconds& timeout) {
   store_->setTimeout(timeout);
 }
 
+c10::intrusive_ptr<Store> PrefixStore::getUnderlyingStore() {
+  return store_;
+}
+
 } // namespace c10d
diff --git a/torch/csrc/distributed/c10d/PrefixStore.hpp b/torch/csrc/distributed/c10d/PrefixStore.hpp
index c9e57312fac6..143d20b80435 100644
--- a/torch/csrc/distributed/c10d/PrefixStore.hpp
+++ b/torch/csrc/distributed/c10d/PrefixStore.hpp
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
 #include <memory>
 
 namespace c10d {
@@ -13,8 +13,10 @@ class TORCH_API PrefixStore : public Store {
 
   virtual ~PrefixStore(){};
 
+  using Store::set;
   void set(const std::string& key, const std::vector<uint8_t>& value) override;
 
+  using Store::compareSet;
   std::vector<uint8_t> compareSet(
       const std::string& key,
       const std::vector<uint8_t>& expectedValue,
@@ -42,6 +44,8 @@ class TORCH_API PrefixStore : public Store {
 
   void watchKey(const std::string& key, WatchKeyCallback callback) override;
 
+  c10::intrusive_ptr<Store> getUnderlyingStore();
+
  protected:
   std::string prefix_;
   c10::intrusive_ptr<Store> store_;
diff --git a/torch/csrc/distributed/c10d/ProcessGroup.cpp b/torch/csrc/distributed/c10d/ProcessGroup.cpp
index fde76d9f5039..7ee9e36238ed 100644
--- a/torch/csrc/distributed/c10d/ProcessGroup.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroup.cpp
@@ -1,5 +1,5 @@
 #include <ATen/ThreadLocalState.h>
-#include <c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
 
 #include <c10/util/Logging.h>
 #include <fmt/format.h>
@@ -57,128 +57,6 @@ bool isP2POp(OpType opType, bool batchP2P /*= false*/) {
       opType == OpType::RECVANYSOURCE;
 }
 
-ProcessGroup::Work::Work(
-    int rank,
-    OpType opType,
-    const char* profilingTitle,
-    const c10::optional<std::vector<at::Tensor>>& inputTensors)
-    : rank_(rank), opType_(opType) {
-  if (profilingTitle != nullptr) {
-    auto recordingFunction =
-        std::make_shared<at::RecordFunction>(at::RecordScope::USER_SCOPE);
-    if (recordingFunction->isActive()) {
-      // Work events follow a future like pattern and can potentially be marked
-      // as complete by different threads, so explicitly set as async event.
-      recordingFunction->_setAsync();
-      // Passing input tensor to recordFunction allows for shape information in
-      // profiling output.
-      std::vector<c10::IValue> inputs;
-      if (inputTensors) {
-        inputs.reserve(inputTensors->size());
-        for (const auto& tensor : *inputTensors) {
-          inputs.emplace_back(tensor);
-        }
-      }
-      recordingFunction->before(
-          profilingTitle,
-          c10::ArrayRef<const c10::IValue>(inputs.data(), inputs.size()));
-      std::function<void()> end_handler = [recordingFunction]() {
-        recordingFunction->end();
-      };
-      recordFunctionEndCallback_ = at::wrapPropagateTLSState(end_handler);
-    }
-  }
-}
-
-OpType ProcessGroup::Work::retrieveOpType() {
-  return opType_;
-}
-
-ProcessGroup::Work::~Work() = default;
-
-bool ProcessGroup::Work::isCompleted() {
-  std::lock_guard<std::mutex> lock(mutex_);
-  return completed_;
-}
-
-bool ProcessGroup::Work::isSuccess() const {
-  std::lock_guard<std::mutex> lock(mutex_);
-  return !exception_;
-}
-
-std::exception_ptr ProcessGroup::Work::exception() const {
-  std::lock_guard<std::mutex> lock(mutex_);
-  return exception_;
-}
-
-int ProcessGroup::Work::sourceRank() const {
-  TORCH_CHECK(
-      false,
-      "sourceRank() may only be called on work objects "
-      "that correspond to a recv or recv-from-any call.");
-}
-
-std::vector<at::Tensor> ProcessGroup::Work::result() {
-  TORCH_CHECK(false, "result() not implemented.");
-}
-
-void ProcessGroup::Work::synchronize() {}
-
-bool ProcessGroup::Work::wait(std::chrono::milliseconds timeout) {
-  std::unique_lock<std::mutex> lock(mutex_);
-  if (timeout == kNoTimeout) {
-    // This waits without a timeout.
-    cv_.wait(lock, [&] { return completed_; });
-  } else {
-    // Waits for the user-provided timeout.
-    cv_.wait_for(lock, timeout, [&] { return completed_; });
-    if (!completed_) {
-      // Throw exception if the wait operation timed out and the work was not
-      // completed.
-      TORCH_CHECK(false, "Operation timed out!");
-    }
-  }
-  if (exception_) {
-    std::rethrow_exception(exception_);
-  }
-  synchronize();
-  // Always return true, because abort API is not implemented.
-  return true;
-}
-
-void ProcessGroup::Work::abort() {
-  TORCH_CHECK(false, "ProcessGroup::Work::abort not implemented.");
-}
-
-c10::intrusive_ptr<c10::ivalue::Future> ProcessGroup::Work::getFuture() {
-  TORCH_CHECK(false, "ProcessGroup::Work::getFuture not implemented.")
-}
-
-void ProcessGroup::Work::finish(std::exception_ptr exception) {
-  std::unique_lock<std::mutex> lock(mutex_);
-  completed_ = true;
-  exception_ = exception;
-  if (recordFunctionEndCallback_) {
-    recordFunctionEndCallback_();
-    recordFunctionEndCallback_ = nullptr;
-  }
-  lock.unlock();
-  cv_.notify_all();
-}
-
-void ProcessGroup::Work::finishAndThrow(std::exception_ptr exception) {
-  std::unique_lock<std::mutex> lock(mutex_);
-  completed_ = true;
-  exception_ = exception;
-  if (recordFunctionEndCallback_) {
-    recordFunctionEndCallback_();
-    recordFunctionEndCallback_ = nullptr;
-  }
-  if (exception_) {
-    std::rethrow_exception(exception_);
-  }
-}
-
 ProcessGroup::ProcessGroup(int rank, int size)
     : rank_(rank), size_(size), dist_debug_level_(debug_level()) {
   C10_LOG_API_USAGE_ONCE("c10d.process_group");
@@ -190,57 +68,4 @@ void ProcessGroup::init() {
   C10_LOG_API_USAGE_ONCE(
       fmt::format("c10d.process_group_{}", getBackendName()));
 }
-
-class FutureWrappingWork : public ProcessGroup::Work {
- public:
-  FutureWrappingWork(c10::intrusive_ptr<c10::ivalue::Future> fut)
-      : Work(), _fut(fut) {}
-
-  ~FutureWrappingWork() {}
-
-  bool isCompleted() override {
-    return _fut->completed();
-  }
-
-  bool isSuccess() const override {
-    return _fut->hasValue();
-  }
-
-  std::exception_ptr exception() const override {
-    return _fut->exception_ptr();
-  }
-
-  int sourceRank() const override {
-    TORCH_CHECK(false, "FutureWrappingWork::sourceRank() not implemented");
-  }
-
-  std::vector<at::Tensor> result() override {
-    return _fut->value().toPyObjectHolder()->extractTensors();
-  }
-
-  bool wait(std::chrono::milliseconds timeout) override {
-    // FIXME
-    TORCH_CHECK(
-        timeout == kNoTimeout,
-        "FutureWrappingWork::wait() with finite timeout not implemented");
-    _fut->wait();
-    return true;
-  }
-
-  void abort() override {
-    TORCH_CHECK(false, "FutureWrappingWork::abort() not implemented");
-  }
-
-  c10::intrusive_ptr<c10::ivalue::Future> getFuture() override {
-    return _fut;
-  }
-
- private:
-  c10::intrusive_ptr<c10::ivalue::Future> _fut;
-};
-
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroup::Work::create_from_future(
-    c10::intrusive_ptr<c10::ivalue::Future> future) {
-  return c10::make_intrusive<FutureWrappingWork>(future);
-}
 } // namespace c10d
diff --git a/torch/csrc/distributed/c10d/ProcessGroup.hpp b/torch/csrc/distributed/c10d/ProcessGroup.hpp
index a29275f15098..798a60b288f5 100644
--- a/torch/csrc/distributed/c10d/ProcessGroup.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroup.hpp
@@ -10,10 +10,11 @@
 #include <ATen/ATen.h>
 #include <c10/macros/Macros.h>
 
-#include <c10d/Types.hpp>
-#include <c10d/Utils.hpp>
-#include <c10d/debug.h>
-#include <c10d/sequence_num.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/Work.hpp>
+#include <torch/csrc/distributed/c10d/debug.h>
+#include <torch/csrc/distributed/c10d/sequence_num.hpp>
 
 // *************************************************************************
 // PROCESS GROUP collective communication API IS BEING CHANGED BETWEEN
@@ -22,41 +23,11 @@
 // SEE RFC: https://github.com/pytorch/pytorch/issues/39662
 // *************************************************************************
 
-constexpr auto kNoTimeout = std::chrono::milliseconds(0);
 constexpr auto kProcessGroupDefaultTimeout =
     std::chrono::milliseconds(30 * 60 * 1000);
 
 namespace c10d {
 
-constexpr const char* const kSeqNumStoreKey = "SEQ_NUM_STORE_KEY";
-
-enum class OpType : std::uint8_t {
-  BROADCAST = 0,
-  ALLREDUCE = 1,
-  ALLREDUCE_COALESCED = 2,
-  REDUCE = 3,
-  ALLGATHER = 4,
-  _ALLGATHER_BASE = 5,
-  ALLGATHER_COALESCED = 6,
-  GATHER = 7,
-  SCATTER = 8,
-  REDUCE_SCATTER = 9,
-  ALLTOALL_BASE = 10,
-  ALLTOALL = 11,
-  SEND = 12,
-  RECV = 13,
-  RECVANYSOURCE = 14,
-  BARRIER = 15,
-  _REDUCE_SCATTER_BASE = 16,
-  UNKNOWN = 100,
-};
-
-// Converts OpType to human readable string.
-TORCH_API std::string opTypeToString(OpType opType);
-
-// Whether or not an OP is an p2p op (SEND, RECV, RECVANYSOURCE)
-TORCH_API bool isP2POp(OpType opType, bool batchP2P = false);
-
 // ProcessGroup is a base class that captures collective and point to
 // point communication in a fixed set of processes.
 //
@@ -79,103 +50,6 @@ TORCH_API bool isP2POp(OpType opType, bool batchP2P = false);
 //
 class TORCH_API ProcessGroup : public torch::CustomClassHolder {
  public:
-  // Please do not use ProcessGroup::Work API, it is going away, to be
-  // replaced by ivalue::Future.
-  // Python binding for this class might change, please do not assume
-  // this will be bound using pybind.
-  class TORCH_API Work : public torch::CustomClassHolder {
-   public:
-    Work(
-        int rank = -1,
-        OpType opType = OpType::UNKNOWN,
-        const char* profilingTitle = nullptr,
-        const c10::optional<std::vector<at::Tensor>>& inputTensors =
-            c10::nullopt);
-
-    virtual ~Work();
-
-    // Checks if request has completed. Non-blocking operation.
-    virtual bool isCompleted();
-
-    // Returns if the work completed successfully.
-    // If false, the exception function can be called to get details.
-    virtual bool isSuccess() const;
-
-    // Returns exception if isSuccess() returned false.
-    virtual std::exception_ptr exception() const;
-
-    // Returns source rank if this objects represents a recv-from-any.
-    virtual int sourceRank() const;
-
-    // Returns result tensors, if applicable.
-    // If work is not supposed to have result, we return empty list.
-    virtual std::vector<at::Tensor> result();
-
-    // Ensures that operations on the output tensors that are invoked
-    // after this function returns are correctly sequenced after the
-    // asynchronous completion of this work.
-    //
-    // For CUDA tensors, it inserts stream synchronization such that
-    // the streams of the caller wait for completion of the
-    // asynchronous operations on the destination tensors.
-    //
-    // For CPU tensors, it is currently a nop.
-    //
-    // This function should only be used if the caller polls for
-    // completion through the `isCompleted` function, it has returned
-    // true, and the `isSuccess` function also has returned true.
-    //
-    virtual void synchronize();
-
-    // Waits until request completes. Blocking operation.
-    // Throws if the work completed with an exception.
-    // Returns false if the work is aborted.
-    // Otherwise, it always returns true, indicating the work is completed.
-    //
-    // Functionally equivalent to:
-    //
-    //   while (!isCompleted()) { /* nop */ }
-    //   auto success = isSuccess();
-    //   if (!success) { std::rethrow_exception(exception()); }
-    //   return success;
-    //
-    virtual bool wait(std::chrono::milliseconds timeout = kNoTimeout);
-
-    virtual void abort();
-
-    // Returns a Future object that will be associated with the completion of
-    // work. Only NCCL backend is currently supported.
-    virtual c10::intrusive_ptr<c10::ivalue::Future> getFuture();
-
-    OpType retrieveOpType();
-
-    static c10::intrusive_ptr<Work> create_from_future(c10::intrusive_ptr<c10::ivalue::Future>);
-
-   protected:
-    // Completes the work object and optionally sets the exception in a
-    // thread-safe manner. Notifies all waiting condition variables as well.
-    void finish(std::exception_ptr exception = nullptr);
-
-    // Similar to finish, but throws an exception if one is already set or
-    // provided by the user.
-    void finishAndThrow(std::exception_ptr exception);
-
-    mutable std::mutex mutex_;
-    std::condition_variable cv_;
-    bool completed_ = false;
-    std::exception_ptr exception_;
-
-    // Current rank of the node.
-    const int rank_;
-
-    // Operation type that this work object refers to.
-    OpType opType_;
-
-    // When profiling, the callback to record end of operation event. This
-    // callback needs to be called when collective operation is complete.
-    std::function<void()> recordFunctionEndCallback_;
-  };
-
   // ProcessGroup Options is a base struct that defines the basic options
   // when constructing a ProcessGroup. Each ProcessGroup subclass should
   // extend this struct and define its options if it wants to provide more
@@ -214,43 +88,44 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
   }
 
   virtual void endCoalescing(
-      std::vector<c10::intrusive_ptr<ProcessGroup::Work>>& /* reqs */) {
+      std::vector<c10::intrusive_ptr<Work>>& /* reqs */) {
     // no-op for backends that have not implemented endCoalescing
   }
 
   // Consider using ops in Ops.hpp instead of the below, which route things
   // to the dispatcher.
   // TODO: Find a way to force the above rule programmatically.
-  virtual c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  virtual c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& /* tensors */,
       const BroadcastOptions& /* opts */ = BroadcastOptions()) {
     TORCH_CHECK(
         false,
         c10::str(
-            "ProcessGroup ", getBackendName(), "does not support broadcast"));
+            "ProcessGroup ", getBackendName(), " does not support broadcast"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  virtual c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& /* tensors */,
       const AllreduceOptions& /* opts */ = AllreduceOptions()) {
     TORCH_CHECK(
         false,
         c10::str(
-            "ProcessGroup ", getBackendName(), "does not support allreduce"));
+            "ProcessGroup ", getBackendName(), " does not support allreduce"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> allreduce_coalesced(
+  virtual c10::intrusive_ptr<Work> allreduce_coalesced(
       std::vector<at::Tensor>& /* tensors */,
-      const AllreduceCoalescedOptions& /* opts */ = AllreduceCoalescedOptions()) {
+      const AllreduceCoalescedOptions& /* opts */ =
+          AllreduceCoalescedOptions()) {
     TORCH_CHECK(
         false,
         c10::str(
             "ProcessGroup ",
             getBackendName(),
-            "does not support allreduce_coalesced"));
+            " does not support allreduce_coalesced"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> reduce(
+  virtual c10::intrusive_ptr<Work> reduce(
       std::vector<at::Tensor>& /* tensors */,
       const ReduceOptions& /* opts */ = ReduceOptions()) {
     TORCH_CHECK(
@@ -258,21 +133,21 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
         c10::str("ProcessGroup ", getBackendName(), "does not support reduce"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  virtual c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& /* outputTensors */,
       std::vector<at::Tensor>& /* inputTensors */,
       const AllgatherOptions& /* opts */ = AllgatherOptions()) {
     TORCH_CHECK(
         false,
         c10::str(
-            "ProcessGroup ", getBackendName(), "does not support allgather"));
+            "ProcessGroup ", getBackendName(), " does not support allgather"));
   }
 
   // Gathers a single tensor inputBuffer into a single buffer outputBuffer that
   // is interpreted as a contigious collection of size inputBuffer * WORLD_SIZE.
   // For implementers of ProcessGroup API and advanced users only.
   // Note: this function will be deprecated in near future.
-  virtual c10::intrusive_ptr<ProcessGroup::Work> _allgather_base(
+  virtual c10::intrusive_ptr<Work> _allgather_base(
       at::Tensor& /* outputBuffer */,
       at::Tensor& /* inputBuffer */,
       const AllgatherOptions& /* opts */ = AllgatherOptions()) {
@@ -281,14 +156,14 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
         c10::str(
             "ProcessGroup ",
             getBackendName(),
-            "does not support _allgather_base"));
+            " does not support _allgather_base"));
   }
 
   // This function is deprecated and will be moved out of ProcessGroup to comms:
   // * do not add dependencies on this function,
   // * do not implement it in your ProcessGroup, implement _allgather_base
   //   instead.
-  virtual c10::intrusive_ptr<ProcessGroup::Work> allgather_coalesced(
+  virtual c10::intrusive_ptr<Work> allgather_coalesced(
       std::vector<std::vector<at::Tensor>>& /* outputTensorLists */,
       std::vector<at::Tensor>& /* inputTensors */,
       const AllgatherOptions& /* opts */ = AllgatherOptions()) {
@@ -297,29 +172,30 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
         c10::str(
             "ProcessGroup ",
             getBackendName(),
-            "does not support allgather_coalesced"));
+            " does not support allgather_coalesced"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> gather(
+  virtual c10::intrusive_ptr<Work> gather(
       std::vector<std::vector<at::Tensor>>& /* outputTensors */,
       std::vector<at::Tensor>& /* inputTensors */,
       const GatherOptions& /* opts */ = GatherOptions()) {
     TORCH_CHECK(
         false,
-        c10::str("ProcessGroup ", getBackendName(), "does not support gather"));
+        c10::str(
+            "ProcessGroup ", getBackendName(), " does not support gather"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> scatter(
+  virtual c10::intrusive_ptr<Work> scatter(
       std::vector<at::Tensor>& /* outputTensors */,
       std::vector<std::vector<at::Tensor>>& /* inputTensors */,
       const ScatterOptions& /* opts */ = ScatterOptions()) {
     TORCH_CHECK(
         false,
         c10::str(
-            "ProcessGroup ", getBackendName(), "does not support scatter"));
+            "ProcessGroup ", getBackendName(), " does not support scatter"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  virtual c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& /* outputTensors */,
       std::vector<std::vector<at::Tensor>>& /* inputTensors */,
       const ReduceScatterOptions& /* opts */ = ReduceScatterOptions()) {
@@ -328,10 +204,10 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
         c10::str(
             "ProcessGroup ",
             getBackendName(),
-            "does not support reduce_scatter"));
+            " does not support reduce_scatter"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> _reduce_scatter_base(
+  virtual c10::intrusive_ptr<Work> _reduce_scatter_base(
       at::Tensor& /* outputBuffer */,
       at::Tensor& /* inputBuffer */,
       const ReduceScatterOptions& /* opts */ = ReduceScatterOptions()) {
@@ -340,10 +216,10 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
         c10::str(
             "ProcessGroup ",
             getBackendName(),
-            "does not support _reduce_scatter_base"));
+            " does not support _reduce_scatter_base"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> alltoall_base(
+  virtual c10::intrusive_ptr<Work> alltoall_base(
       at::Tensor& /* outputBuffer */,
       at::Tensor& /* inputBuffer */,
       std::vector<int64_t>& /* outputSplitSizes */,
@@ -354,17 +230,17 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
         c10::str(
             "ProcessGroup ",
             getBackendName(),
-            "does not support alltoall_base"));
+            " does not support alltoall_base"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> alltoall(
+  virtual c10::intrusive_ptr<Work> alltoall(
       std::vector<at::Tensor>& /* outputTensors */,
       std::vector<at::Tensor>& /* inputTensors */,
       const AllToAllOptions& opts = AllToAllOptions()) {
     TORCH_CHECK(
         false,
         c10::str(
-            "ProcessGroup ", getBackendName(), "does not support alltoall"));
+            "ProcessGroup ", getBackendName(), " does not support alltoall"));
   }
 
   virtual void monitoredBarrier(
@@ -405,25 +281,25 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
             " does not yet support sequence numbers."));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> send(
+  virtual c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& /* tensors */,
       int /* dstRank */,
       int /* tag */) {
     TORCH_CHECK(
         false,
-        c10::str("ProcessGroup ", getBackendName(), "does not support send"));
+        c10::str("ProcessGroup ", getBackendName(), " does not support send"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> recv(
+  virtual c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& /* tensors */,
       int /* srcRank */,
       int /* tag */) {
     TORCH_CHECK(
         false,
-        c10::str("ProcessGroup ", getBackendName(), "does not support recv"));
+        c10::str("ProcessGroup ", getBackendName(), " does not support recv"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> recvAnysource(
+  virtual c10::intrusive_ptr<Work> recvAnysource(
       std::vector<at::Tensor>& /* tensors */,
       int /* tag */) {
     TORCH_CHECK(
@@ -431,15 +307,15 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
         c10::str(
             "ProcessGroup ",
             getBackendName(),
-            "does not support recvAnysource"));
+            " does not support recvAnysource"));
   }
 
-  virtual c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  virtual c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& /* opts */ = BarrierOptions()) {
     TORCH_CHECK(
         false,
         c10::str(
-            "ProcessGroup ", getBackendName(), "does not support barrier"));
+            "ProcessGroup ", getBackendName(), " does not support barrier"));
   }
 
  protected:
diff --git a/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp b/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
index 1c1ac652f6ed..cf0721c82ddd 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
@@ -1,8 +1,9 @@
-#include <c10d/ProcessGroupGloo.hpp>
+#include <c10/util/Exception.h>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
 
 #ifdef USE_C10D_GLOO
 
-#include <c10d/GlooDeviceFactory.hpp>
+#include <torch/csrc/distributed/c10d/GlooDeviceFactory.hpp>
 #include <chrono>
 #include <exception>
 #include <ratio>
@@ -190,6 +191,9 @@ ReduceFunc toFunction(const ReduceOp& r) {
     case ReduceOp::AVG:
       TORCH_CHECK(false, "Cannot use ReduceOp.AVG with Gloo");
       break;
+    case ReduceOp::PREMUL_SUM:
+      TORCH_CHECK(false, "Cannot use ReduceOp.PREMUL_SUM with Gloo");
+      break;
     case ReduceOp::UNUSED:
       break;
   }
@@ -258,6 +262,9 @@ ReduceFunc toFunction(const ReduceOp& r) {
     case ReduceOp::AVG:
       TORCH_CHECK(false, "Cannot use ReduceOp.AVG with Gloo");
       break;
+    case ReduceOp::PREMUL_SUM:
+      TORCH_CHECK(false, "Cannot use ReduceOp.PREMUL_SUM with Gloo");
+      break;
     case ReduceOp::UNUSED:
       break;
   }
@@ -509,7 +516,7 @@ ProcessGroupGloo::AsyncWork::AsyncWork(
     // Profiler: Pass nullptr as profilingTitle to parent constructor to
     // replace default profiler implementation with async version that reports
     // correct timestamps for work that is asynchronously executed.
-    : ProcessGroup::Work(-1, OpType::UNKNOWN, nullptr, inputTensors),
+    : Work(-1, OpType::UNKNOWN, nullptr, inputTensors),
       outputTensors_(std::move(outputTensors)),
       future_(createFutureAsOutput(outputTensors)) {
   if (profilingTitle != nullptr) {
@@ -530,7 +537,7 @@ void ProcessGroupGloo::AsyncWork::finishWorkGloo() {
 ProcessGroupGloo::SendWork::SendWork(
     at::Tensor& tensor,
     std::unique_ptr<::gloo::transport::UnboundBuffer> buffer)
-    : ProcessGroupGloo::Work(
+    : Work(
           -1,
           OpType::SEND,
           "gloo:send",
@@ -564,7 +571,7 @@ ProcessGroupGloo::RecvWork::RecvWork(
     at::Tensor& tensor,
     std::unique_ptr<::gloo::transport::UnboundBuffer> buffer,
     const char* profilingTitle)
-    : ProcessGroupGloo::Work(
+    : Work(
           -1,
           OpType::UNKNOWN,
           profilingTitle,
@@ -932,7 +939,7 @@ class AsyncBroadcastCUDAWork : public AsyncBroadcastWork {
 
 } // namespace
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::broadcast(
+c10::intrusive_ptr<Work> ProcessGroupGloo::broadcast(
     std::vector<at::Tensor>& inputs,
     const BroadcastOptions& opts) {
   static auto invalidArgument = [](const std::string& msg) {
@@ -1142,7 +1149,6 @@ class AsyncSparseAllreduceWork : public ProcessGroupGloo::AsyncWork {
   // Every process then sums the resulting sparse tensors locally.
   // The nnz for sparse tensors may be different across processes, so first
   // we run allgather on the nnz, and then allgather with max(nnz).
-  // We could use an allgatherv for this, if it were available.
   at::Tensor allreduce(std::vector<at::Tensor>& tensors) {
     // TODO: This is a massive hack!  There is some confusion about
     // Variable/Tensor inside the body of this function.  Turning off
@@ -1425,7 +1431,7 @@ class AsyncSparseAllreduceCUDAWork : public AsyncSparseAllreduceWork {
 
 } // namespace
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::allreduce(
+c10::intrusive_ptr<Work> ProcessGroupGloo::allreduce(
     std::vector<at::Tensor>& inputs,
     const AllreduceOptions& opts) {
   static auto invalidArgument = [](const std::string& msg) {
@@ -1486,7 +1492,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::allreduce(
   return work;
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::allreduce_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupGloo::allreduce_coalesced(
     std::vector<at::Tensor>& tensors,
     const AllreduceCoalescedOptions& opts) {
   static auto invalidArgument = [](const std::string& msg) {
@@ -1655,7 +1661,7 @@ class AsyncReduceCUDAWork : public AsyncReduceWork {
 
 } // namespace
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::reduce(
+c10::intrusive_ptr<Work> ProcessGroupGloo::reduce(
     std::vector<at::Tensor>& inputs,
     const ReduceOptions& opts) {
   static auto invalidArgument = [](const std::string& msg) {
@@ -1832,7 +1838,7 @@ class AsyncAllgatherCUDAWork : public AsyncAllgatherWork {
 
 // Note: current CUDA implementation holds the assumption that the
 // tensors in the nested output tensor vectors are on the same device.
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::allgather(
+c10::intrusive_ptr<Work> ProcessGroupGloo::allgather(
     std::vector<std::vector<at::Tensor>>& outputs,
     std::vector<at::Tensor>& inputs,
     const AllgatherOptions& opts) {
@@ -1966,7 +1972,7 @@ class AsyncAllgatherCoalescedWork : public ProcessGroupGloo::AsyncWork {
 
 } // namespace
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::allgather_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupGloo::allgather_coalesced(
     std::vector<std::vector<at::Tensor>>& output_lists,
     std::vector<at::Tensor>& input_list,
     const AllgatherOptions& /* unused */) {
@@ -2021,7 +2027,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::allgather_coalesced(
   return work;
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::_allgather_base(
+c10::intrusive_ptr<Work> ProcessGroupGloo::_allgather_base(
     at::Tensor& /*unused */,
     at::Tensor& /*unused */,
     const AllgatherOptions& /*unused */) {
@@ -2162,7 +2168,7 @@ class AsyncGatherCUDAWork : public AsyncGatherWork {
 
 } // namespace
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::gather(
+c10::intrusive_ptr<Work> ProcessGroupGloo::gather(
     std::vector<std::vector<at::Tensor>>& outputs,
     std::vector<at::Tensor>& inputs,
     const GatherOptions& opts) {
@@ -2347,7 +2353,7 @@ class AsyncScatterCUDAWork : public AsyncScatterWork {
 
 } // namespace
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::scatter(
+c10::intrusive_ptr<Work> ProcessGroupGloo::scatter(
     std::vector<at::Tensor>& outputs,
     std::vector<std::vector<at::Tensor>>& inputs,
     const ScatterOptions& opts) {
@@ -2409,7 +2415,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::scatter(
   return work;
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::reduce_scatter(
+c10::intrusive_ptr<Work> ProcessGroupGloo::reduce_scatter(
     std::vector<at::Tensor>& outputs,
     std::vector<std::vector<at::Tensor>>& inputs,
     const ReduceScatterOptions& opts) {
@@ -2540,7 +2546,7 @@ class AsyncAlltoallCUDAWork : public AsyncAlltoallWork {
 
 } // namespace
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::alltoall_base(
+c10::intrusive_ptr<Work> ProcessGroupGloo::alltoall_base(
     at::Tensor& outputTensor,
     at::Tensor& inputTensor,
     std::vector<int64_t>& outputCounts,
@@ -2603,7 +2609,7 @@ uint32_t checkTag(int32_t tag) {
   return (uint32_t)tag;
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::send(
+c10::intrusive_ptr<Work> ProcessGroupGloo::send(
     std::vector<at::Tensor>& tensors,
     int dstRank,
     int tag) {
@@ -2622,7 +2628,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::send(
   return c10::make_intrusive<SendWork>(tensor, std::move(buf));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::recv(
+c10::intrusive_ptr<Work> ProcessGroupGloo::recv(
     std::vector<at::Tensor>& tensors,
     int srcRank,
     int tag) {
@@ -2641,7 +2647,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::recv(
   return c10::make_intrusive<RecvWork>(tensor, std::move(buf), "gloo:recv");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::recvAnysource(
+c10::intrusive_ptr<Work> ProcessGroupGloo::recvAnysource(
     std::vector<at::Tensor>& tensors,
     int tag) {
   auto& tensor = checkSingleTensor(tensors);
@@ -2704,8 +2710,7 @@ class AsyncBarrierWork : public ProcessGroupGloo::AsyncWork {
 
 } // namespace
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupGloo::barrier(
-    const BarrierOptions& opts) {
+c10::intrusive_ptr<Work> ProcessGroupGloo::barrier(const BarrierOptions& opts) {
   std::vector<c10::weak_intrusive_ptr<AsyncWork>> priorWork;
 
   // Snapshot all in progress and pending work as weak_ptr.
@@ -2759,8 +2764,8 @@ void ProcessGroupGloo::monitoredBarrier(
   auto startTime = std::chrono::steady_clock::now();
   auto worldSize = this->getSize();
   // Mappings of rank to recvWork/sendWork respectively.
-  std::map<int, c10::intrusive_ptr<ProcessGroup::Work>> recvWorkMap;
-  std::map<int, c10::intrusive_ptr<ProcessGroup::Work>> sendWorkMap;
+  std::map<int, c10::intrusive_ptr<Work>> recvWorkMap;
+  std::map<int, c10::intrusive_ptr<Work>> sendWorkMap;
   // Kick off recvWork and wait to unblock sendWork->wait() from non-zero ranks.
   // Failed/hanging ranks will not ack this call, letting rank 0 know about the
   // failure.
@@ -2768,69 +2773,67 @@ void ProcessGroupGloo::monitoredBarrier(
     recvWorkMap.insert({dstRank, recv(commTensor, dstRank, t1)});
   }
 
-  auto waitLoop =
-      [&](const std::map<int, c10::intrusive_ptr<ProcessGroup::Work>>& works) {
-        std::vector<int> processedRanks;
-        for (auto& work : works) {
-          bool rankResponded = false;
-          try {
-            // Note: if waitAllRanks=false, we recompute the time remaining in
-            // barrier and use this recomputed time in wait(). However, if
-            // waitAllRanks=true, we use the original timeout, since if we use
-            // up the entire timeout waiting for response from rank n, then we
-            // won't have any timeout left to query ranks beginning with n + 1.
-            auto remainingTime = getRemainingTime(
-                startTime, monitoredBarrierTimeout, waitAllRanks);
-            if (!waitAllRanks) {
-              checkRemainingTime(
-                  monitoredBarrierTimeout, remainingTime, processedRanks, rank);
-            }
-            work.second->wait(remainingTime);
-            rankResponded = true;
-          } catch (const std::exception& e) {
-            const std::string error = c10::str(
-                "[Rank 0]: Rank ",
-                work.first,
-                " failed to pass monitoredBarrier in ",
-                monitoredBarrierTimeout.count(),
-                " ms");
-            if (waitAllRanks) {
-              LOG(ERROR) << error;
-            } else {
-              logAndThrow(
-                  error,
-                  c10::str(error, "\n Original exception: \n", e.what()));
-            }
-          }
-          if (rankResponded) {
-            processedRanks.push_back(work.first);
-          }
+  auto waitLoop = [&](const std::map<int, c10::intrusive_ptr<Work>>& works) {
+    std::vector<int> processedRanks;
+    for (auto& work : works) {
+      bool rankResponded = false;
+      try {
+        // Note: if waitAllRanks=false, we recompute the time remaining in
+        // barrier and use this recomputed time in wait(). However, if
+        // waitAllRanks=true, we use the original timeout, since if we use
+        // up the entire timeout waiting for response from rank n, then we
+        // won't have any timeout left to query ranks beginning with n + 1.
+        auto remainingTime =
+            getRemainingTime(startTime, monitoredBarrierTimeout, waitAllRanks);
+        if (!waitAllRanks) {
+          checkRemainingTime(
+              monitoredBarrierTimeout, remainingTime, processedRanks, rank);
         }
-        // If we are collecting all failed ranks, check if we need to throw if
-        // some ranks have not responded.
-        // Ensure all ranks from 1, ... WORLD_SIZE -1 have been successfully
-        // processed.
-        auto rankFailure = (processedRanks.size() != size_ - 1);
-        if (waitAllRanks && rankFailure) {
-          std::vector<int> failedRanks;
-          for (const auto i : c10::irange(1, size_)) {
-            if (std::find(processedRanks.begin(), processedRanks.end(), i) ==
-                processedRanks.end()) {
-              failedRanks.push_back(i);
-            }
-          }
-
-          TORCH_INTERNAL_ASSERT(!failedRanks.empty());
-          const std::string ranksStr = c10::Join(", ", failedRanks);
-          const std::string error = c10::str(
-              "[Rank 0]: Ranks ",
-              ranksStr,
-              " failed to pass monitoredBarrier in ",
-              monitoredBarrierTimeout.count(),
-              " ms");
-          logAndThrow(error, error);
+        work.second->wait(remainingTime);
+        rankResponded = true;
+      } catch (const std::exception& e) {
+        const std::string error = c10::str(
+            "[Rank 0]: Rank ",
+            work.first,
+            " failed to pass monitoredBarrier in ",
+            monitoredBarrierTimeout.count(),
+            " ms");
+        if (waitAllRanks) {
+          LOG(ERROR) << error;
+        } else {
+          logAndThrow(
+              error, c10::str(error, "\n Original exception: \n", e.what()));
         }
-      };
+      }
+      if (rankResponded) {
+        processedRanks.push_back(work.first);
+      }
+    }
+    // If we are collecting all failed ranks, check if we need to throw if
+    // some ranks have not responded.
+    // Ensure all ranks from 1, ... WORLD_SIZE -1 have been successfully
+    // processed.
+    auto rankFailure = (processedRanks.size() != size_ - 1);
+    if (waitAllRanks && rankFailure) {
+      std::vector<int> failedRanks;
+      for (const auto i : c10::irange(1, size_)) {
+        if (std::find(processedRanks.begin(), processedRanks.end(), i) ==
+            processedRanks.end()) {
+          failedRanks.push_back(i);
+        }
+      }
+
+      TORCH_INTERNAL_ASSERT(!failedRanks.empty());
+      const std::string ranksStr = c10::Join(", ", failedRanks);
+      const std::string error = c10::str(
+          "[Rank 0]: Ranks ",
+          ranksStr,
+          " failed to pass monitoredBarrier in ",
+          monitoredBarrierTimeout.count(),
+          " ms");
+      logAndThrow(error, error);
+    }
+  };
 
   waitLoop(recvWorkMap);
   // If we've reached here successfully, this means all ranks have acked in
@@ -2842,10 +2845,6 @@ void ProcessGroupGloo::monitoredBarrier(
   }
 
   waitLoop(sendWorkMap);
-
-  using namespace std::chrono;
-  C10_UNUSED auto elapsedTime =
-      duration_cast<milliseconds>(steady_clock::now() - startTime);
 }
 
 void ProcessGroupGloo::setSequenceNumberForGroup() {
diff --git a/torch/csrc/distributed/c10d/ProcessGroupGloo.hpp b/torch/csrc/distributed/c10d/ProcessGroupGloo.hpp
index 5c0c76afa245..f7bcc047576a 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupGloo.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupGloo.hpp
@@ -18,10 +18,10 @@
 
 #include <c10/util/hash.h>
 
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/Store.hpp>
-#include <c10d/Types.hpp>
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 
 namespace c10d {
 
@@ -68,7 +68,7 @@ class TORCH_API ProcessGroupGloo : public ProcessGroup {
   //
   // FIXME: This probably should be called WorkGloo since the work is executed in sync mode
   // by a background thread.
-  class TORCH_API AsyncWork : public ProcessGroup::Work {
+  class TORCH_API AsyncWork : public Work {
    public:
     explicit AsyncWork(
         std::vector<std::vector<at::Tensor>> outputTensors,
@@ -144,7 +144,7 @@ class TORCH_API ProcessGroupGloo : public ProcessGroup {
   // recv operation. It keeps a reference to the tensor it is
   // operating on to prevent it from being deallocated while the
   // operation is still in flight.
-  class TORCH_API SendWork : public ProcessGroup::Work {
+  class TORCH_API SendWork : public Work {
    public:
     explicit SendWork(
         at::Tensor& tensor,
@@ -159,7 +159,7 @@ class TORCH_API ProcessGroupGloo : public ProcessGroup {
     std::unique_ptr<::gloo::transport::UnboundBuffer> buffer_;
   };
 
-  class TORCH_API RecvWork : public ProcessGroup::Work {
+  class TORCH_API RecvWork : public Work {
    public:
     explicit RecvWork(
         at::Tensor& tensor,
@@ -226,75 +226,75 @@ class TORCH_API ProcessGroupGloo : public ProcessGroup {
     return options_;
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& tensors,
       const BroadcastOptions& opts = BroadcastOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce_coalesced(
+  c10::intrusive_ptr<Work> allreduce_coalesced(
       std::vector<at::Tensor>& tensors,
       const AllreduceCoalescedOptions& opts =
           AllreduceCoalescedOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce(
+  c10::intrusive_ptr<Work> reduce(
       std::vector<at::Tensor>& tensors,
       const ReduceOptions& opts = ReduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputs,
       std::vector<at::Tensor>& inputs,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> _allgather_base(
+  c10::intrusive_ptr<Work> _allgather_base(
       at::Tensor& outputBuffer,
       at::Tensor& inputBuffer,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather_coalesced(
+  c10::intrusive_ptr<Work> allgather_coalesced(
       std::vector<std::vector<at::Tensor>>& output_lists,
       std::vector<at::Tensor>& input_list,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> gather(
+  c10::intrusive_ptr<Work> gather(
       std::vector<std::vector<at::Tensor>>& outputs,
       std::vector<at::Tensor>& inputs,
       const GatherOptions& opts = GatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> scatter(
+  c10::intrusive_ptr<Work> scatter(
       std::vector<at::Tensor>& outputs,
       std::vector<std::vector<at::Tensor>>& inputs,
       const ScatterOptions& opts = ScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& outputs,
       std::vector<std::vector<at::Tensor>>& inputs,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall_base(
+  c10::intrusive_ptr<Work> alltoall_base(
       at::Tensor& outputTensor,
       at::Tensor& inputTensor,
       std::vector<int64_t>& outputCounts,
       std::vector<int64_t>& inputCounts,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> send(
+  c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recv(
+  c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& tensors,
       int srcRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recvAnysource(
+  c10::intrusive_ptr<Work> recvAnysource(
       std::vector<at::Tensor>& tensors,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& opts = BarrierOptions()) override;
 
   const std::unique_ptr<::gloo::rendezvous::Store>& _getStore() const {
diff --git a/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp b/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
index c1a5c83c48c9..9792c438935d 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
@@ -1,4 +1,4 @@
-#include <c10d/ProcessGroupMPI.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupMPI.hpp>
 
 #ifdef USE_C10D_MPI
 
@@ -29,7 +29,7 @@ namespace c10d {
 namespace {
 
 // Op mapping
-std::map<ReduceOp, MPI_Op> mpiOp = {
+std::map<ReduceOp::RedOpType, MPI_Op> mpiOp = {
     {ReduceOp::MIN, MPI_MIN},
     {ReduceOp::MAX, MPI_MAX},
     {ReduceOp::SUM, MPI_SUM},
@@ -122,7 +122,7 @@ ProcessGroupMPI::AsyncWork::AsyncWork(
     std::vector<at::Tensor> outputTensors,
     const char* profilingTitle,
     const c10::optional<std::vector<at::Tensor>>& inputTensors)
-    : ProcessGroup::Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),
+    : Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),
       outputTensors_(std::move(outputTensors)),
       request_(request) {
   memset(&status_, 0, sizeof(status_));
@@ -176,9 +176,9 @@ bool ProcessGroupMPI::AsyncWork::wait(std::chrono::milliseconds /* unused */) {
   if (request_ == MPI_REQUEST_NULL) {
     // AsyncWork needs to manually call profiling end callbacks if they are set,
     // since it does not call ProcessGroup::finish().
-    if (ProcessGroup::Work::recordFunctionEndCallback_) {
-      ProcessGroup::Work::recordFunctionEndCallback_();
-      ProcessGroup::Work::recordFunctionEndCallback_ = nullptr;
+    if (Work::recordFunctionEndCallback_) {
+      Work::recordFunctionEndCallback_();
+      Work::recordFunctionEndCallback_ = nullptr;
     }
     return true;
   }
@@ -189,9 +189,9 @@ bool ProcessGroupMPI::AsyncWork::wait(std::chrono::milliseconds /* unused */) {
 
   // AsyncWork needs to manually call profiling end callbacks if they are set,
   // since it does not call ProcessGroup::finish().
-  if (ProcessGroup::Work::recordFunctionEndCallback_) {
-    ProcessGroup::Work::recordFunctionEndCallback_();
-    ProcessGroup::Work::recordFunctionEndCallback_ = nullptr;
+  if (Work::recordFunctionEndCallback_) {
+    Work::recordFunctionEndCallback_();
+    Work::recordFunctionEndCallback_ = nullptr;
   }
 
   if (!ok) {
@@ -370,7 +370,7 @@ void ProcessGroupMPI::runLoop() {
   }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::enqueue(
+c10::intrusive_ptr<Work> ProcessGroupMPI::enqueue(
     std::unique_ptr<WorkEntry> entry,
     const char* profilingTitle,
     const c10::optional<std::vector<at::Tensor>>& inputTensors) {
@@ -383,7 +383,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::enqueue(
   return work;
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::broadcast(
+c10::intrusive_ptr<Work> ProcessGroupMPI::broadcast(
     std::vector<at::Tensor>& tensors,
     const BroadcastOptions& opts) {
   checkSingleTensor(tensors);
@@ -407,7 +407,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::broadcast(
       c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::allreduce(
+c10::intrusive_ptr<Work> ProcessGroupMPI::allreduce(
     std::vector<at::Tensor>& tensors,
     const AllreduceOptions& opts) {
   checkSingleTensor(tensors);
@@ -433,13 +433,13 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::allreduce(
       c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::allreduce_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupMPI::allreduce_coalesced(
     std::vector<at::Tensor>& tensors,
     const AllreduceCoalescedOptions& opts) {
   TORCH_CHECK(false, "allreduce_coalesced is currently not supported with MPI");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::reduce(
+c10::intrusive_ptr<Work> ProcessGroupMPI::reduce(
     std::vector<at::Tensor>& tensors,
     const ReduceOptions& opts) {
   checkSingleTensor(tensors);
@@ -470,7 +470,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::reduce(
       c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::allgather(
+c10::intrusive_ptr<Work> ProcessGroupMPI::allgather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const AllgatherOptions& opts) {
@@ -519,14 +519,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::allgather(
       c10::optional<std::vector<at::Tensor>>(inputTensors));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::allgather_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupMPI::allgather_coalesced(
     std::vector<std::vector<at::Tensor>>& /* unused */,
     std::vector<at::Tensor>& /* unused */,
     const AllgatherOptions& /* unused */) {
   TORCH_CHECK(false, "ProcessGroupMPI does not support allgather_coalesced");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::gather(
+c10::intrusive_ptr<Work> ProcessGroupMPI::gather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const GatherOptions& opts) {
@@ -602,7 +602,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::gather(
   }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::scatter(
+c10::intrusive_ptr<Work> ProcessGroupMPI::scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ScatterOptions& opts) {
@@ -679,14 +679,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::scatter(
   }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::reduce_scatter(
+c10::intrusive_ptr<Work> ProcessGroupMPI::reduce_scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ReduceScatterOptions& opts) {
   TORCH_CHECK(false, "ProcessGroupMPI does not support reduce_scatter");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::alltoall_base(
+c10::intrusive_ptr<Work> ProcessGroupMPI::alltoall_base(
     at::Tensor& outputTensor,
     at::Tensor& inputTensor,
     std::vector<int64_t>& outputSplitSizes,
@@ -769,7 +769,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::alltoall_base(
   }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::alltoall(
+c10::intrusive_ptr<Work> ProcessGroupMPI::alltoall(
     std::vector<at::Tensor>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const AllToAllOptions& opts) {
@@ -829,7 +829,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::alltoall(
       c10::optional<std::vector<at::Tensor>>(inputTensors));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::send(
+c10::intrusive_ptr<Work> ProcessGroupMPI::send(
     std::vector<at::Tensor>& tensors,
     int dstRank,
     int tag) {
@@ -858,7 +858,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::send(
       c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::recv(
+c10::intrusive_ptr<Work> ProcessGroupMPI::recv(
     std::vector<at::Tensor>& tensors,
     int srcRank,
     int tag) {
@@ -887,7 +887,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::recv(
       c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::recvAnysource(
+c10::intrusive_ptr<Work> ProcessGroupMPI::recvAnysource(
     std::vector<at::Tensor>& tensors,
     int tag) {
   checkSingleTensor(tensors);
@@ -915,8 +915,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::recvAnysource(
       c10::optional<std::vector<at::Tensor>>(tensors));
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::barrier(
-    const BarrierOptions& opts) {
+c10::intrusive_ptr<Work> ProcessGroupMPI::barrier(const BarrierOptions& opts) {
   std::function<void(std::unique_ptr<WorkEntry>&)> runFunc =
       [this](std::unique_ptr<WorkEntry>& entry) {
         std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
@@ -927,7 +926,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::barrier(
   return enqueue(std::move(entry), "mpi:barrier", c10::nullopt);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::_allgather_base(
+c10::intrusive_ptr<Work> ProcessGroupMPI::_allgather_base(
     at::Tensor& /*unused */,
     at::Tensor& /*unused */,
     const AllgatherOptions& /*unused */) {
diff --git a/torch/csrc/distributed/c10d/ProcessGroupMPI.hpp b/torch/csrc/distributed/c10d/ProcessGroupMPI.hpp
index 93bb3113f00c..55ea649121a8 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupMPI.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupMPI.hpp
@@ -13,9 +13,9 @@
 #include <ATen/core/ivalue.h>
 #include <ATen/core/ivalue_inl.h>
 
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/Types.hpp>
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 
 #include <c10/util/CallOnce.h>
 
@@ -82,14 +82,14 @@ struct WorkEntry {
 // ProcessGroupMPI will automatically detect this support.
 class TORCH_API ProcessGroupMPI : public ProcessGroup {
  public:
-  class WorkMPI : public ProcessGroup::Work {
+  class WorkMPI : public Work {
    public:
     explicit WorkMPI(
         std::vector<at::Tensor> outputTensors,
         const char* profilingTitle = nullptr,
         const c10::optional<std::vector<at::Tensor>>& inputTensors =
             c10::nullopt)
-        : ProcessGroup::Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),
+        : Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),
           outputTensors_(std::move(outputTensors)),
           future_(c10::make_intrusive<at::ivalue::Future>(
               c10::ListType::create(c10::TensorType::get()))) {}
@@ -109,7 +109,7 @@ class TORCH_API ProcessGroupMPI : public ProcessGroup {
     c10::intrusive_ptr<at::ivalue::Future> future_;
   };
 
-  class AsyncWork : public ProcessGroup::Work {
+  class AsyncWork : public Work {
    public:
     AsyncWork(
         MPI_Request request,
@@ -153,80 +153,80 @@ class TORCH_API ProcessGroupMPI : public ProcessGroup {
     return std::string(MPI_BACKEND_NAME);
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& data,
       const BroadcastOptions& opts = BroadcastOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce_coalesced(
+  c10::intrusive_ptr<Work> allreduce_coalesced(
       std::vector<at::Tensor>& tensors,
       const AllreduceCoalescedOptions& opts =
           AllreduceCoalescedOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce(
+  c10::intrusive_ptr<Work> reduce(
       std::vector<at::Tensor>& tensors,
       const ReduceOptions& opts = ReduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> _allgather_base(
+  c10::intrusive_ptr<Work> _allgather_base(
       at::Tensor& outputbuffer,
       at::Tensor& inputbuffer,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather_coalesced(
+  c10::intrusive_ptr<Work> allgather_coalesced(
       std::vector<std::vector<at::Tensor>>& outputTensorLists,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> gather(
+  c10::intrusive_ptr<Work> gather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const GatherOptions& opts = GatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> scatter(
+  c10::intrusive_ptr<Work> scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ScatterOptions& opts = ScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall_base(
+  c10::intrusive_ptr<Work> alltoall_base(
       at::Tensor& outputTensor,
       at::Tensor& inputTensor,
       std::vector<int64_t>& outputSplitSizes,
       std::vector<int64_t>& inputSplitSizes,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall(
+  c10::intrusive_ptr<Work> alltoall(
       std::vector<at::Tensor>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> send(
+  c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recv(
+  c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& tensors,
       int srcRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recvAnysource(
+  c10::intrusive_ptr<Work> recvAnysource(
       std::vector<at::Tensor>& tensor,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& opts = BarrierOptions()) override;
 
   // Creating a new ProcessGroupMPI, will initiialize MPI if not initialized
@@ -241,7 +241,7 @@ class TORCH_API ProcessGroupMPI : public ProcessGroup {
   // Helper function that is called by the destructor
   void destroy();
 
-  c10::intrusive_ptr<ProcessGroup::Work> enqueue(
+  c10::intrusive_ptr<Work> enqueue(
       std::unique_ptr<WorkEntry> entry,
       const char* profilingTitle = nullptr,
       const c10::optional<std::vector<at::Tensor>>& inputTensors = c10::nullopt);
diff --git a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
index e1036b855c97..387fe5eb4dcc 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
@@ -1,5 +1,6 @@
-#include <c10d/ProcessGroupNCCL.hpp>
-#include <c10d/UCCForNCCL.hpp>
+#include <torch/csrc/distributed/c10d/NCCLUtils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp>
+#include <torch/csrc/distributed/c10d/UCCForNCCL.hpp>
 #include <sstream>
 
 #ifdef USE_C10D_NCCL
@@ -9,6 +10,7 @@
 #include <stdexcept>
 #include <tuple>
 #include <unordered_set>
+#include <utility>
 
 #include <ATen/cuda/CUDAContext.h>
 #include <c10/core/DeviceType.h>
@@ -19,9 +21,9 @@
 #include <c10/util/Logging.h>
 #include <c10/util/Optional.h>
 #include <c10/util/irange.h>
-#include <c10d/ParamCommsUtils.hpp>
-#include <c10d/TraceUtils.h>
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/ParamCommsUtils.hpp>
+#include <torch/csrc/distributed/c10d/TraceUtils.h>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 
 #include <torch/csrc/cuda/nccl.h>
 
@@ -31,31 +33,13 @@ constexpr const char* const kNCCLAbortedCommStoreKey = "NCCLABORTEDCOMM";
 
 namespace {
 
-// RAII helper class to manage NCCL group API and CUDA free mutex.
-// The destructor is allowed to throw since this helper class only
-// manages group and lock lifetimes.
-struct AutoNcclGroup {
-  AutoNcclGroup() {
-    (c10::cuda::CUDACachingAllocator::getFreeMutex())->lock();
-#if defined(NCCL_MAJOR) && (NCCL_MAJOR >= 2)
-    C10D_NCCL_CHECK(ncclGroupStart(), c10::nullopt);
-#endif
-  }
-  ~AutoNcclGroup() noexcept(false) {
-#if defined(NCCL_MAJOR) && (NCCL_MAJOR >= 2)
-    C10D_NCCL_CHECK(ncclGroupEnd(), c10::nullopt);
-#endif
-    (c10::cuda::CUDACachingAllocator::getFreeMutex())->unlock();
-  }
-};
-
 #if defined(NCCL_MAJOR) && \
     ((NCCL_MAJOR > 2) || (NCCL_MAJOR == 2) && (NCCL_MINOR >= 10))
 #define NCCL_HAS_AVG 1
 #endif
 
 // NCCL op mapping
-const std::map<ReduceOp, ncclRedOp_t> ncclOp = {
+const std::map<ReduceOp::RedOpType, ncclRedOp_t> ncclOp = {
     {ReduceOp::MIN, ncclMin},
     {ReduceOp::MAX, ncclMax},
     {ReduceOp::SUM, ncclSum},
@@ -90,7 +74,36 @@ ncclDataType_t getNcclDataType(at::ScalarType type) {
   return it->second;
 }
 
-ncclRedOp_t getNcclReduceOp(const ReduceOp reduceOp, at::Tensor& input) {
+#ifdef ENABLE_NCCL_PREMUL_SUM_SUPPORT
+template <typename T, ncclDataType_t dataType>
+ncclRedOpRAII unpackPreMulSum(
+    const ReduceOp& reduceOp,
+    const ncclComm_t& comm,
+    int dev_in_group) {
+  const auto* preMulSupplement =
+      reinterpret_cast<NCCLPreMulSumSupplement*>(reduceOp.supplement_.get());
+  ncclRedOp_t preMulSum;
+  bool has_tensor = preMulSupplement->tensor_factor.defined();
+  auto residence = has_tensor ? ncclScalarDevice : ncclScalarHostImmediate;
+  T* ptr_factor =
+      has_tensor ? preMulSupplement->tensor_factor.data_ptr<T>() : nullptr;
+  T scalar_factor = T(preMulSupplement->double_factor);
+  ncclRedOpCreatePreMulSum(
+      &preMulSum,
+      has_tensor ? ptr_factor : &scalar_factor,
+      dataType,
+      residence,
+      comm);
+  return ncclRedOpRAII(preMulSum, comm);
+}
+#endif
+
+ncclRedOpRAII getNcclReduceOp(
+    const ReduceOp& reduceOp,
+    at::Tensor& input,
+    const ncclDataType_t& dataType,
+    const ncclComm_t& comm,
+    int dev_in_group) {
   try {
     if (input.scalar_type() == at::kBool) {
       if (reduceOp == ReduceOp::SUM) {
@@ -103,6 +116,28 @@ ncclRedOp_t getNcclReduceOp(const ReduceOp reduceOp, at::Tensor& input) {
       if (reduceOp == ReduceOp::AVG) {
         TORCH_CHECK(false, "Cannot use ReduceOp.AVG with boolean inputs");
       }
+#endif
+    }
+    if (reduceOp == ReduceOp::PREMUL_SUM) {
+#ifdef ENABLE_NCCL_PREMUL_SUM_SUPPORT
+      switch (dataType) {
+        case ncclHalf:
+          return unpackPreMulSum<at::Half, ncclHalf>(
+              reduceOp, comm, dev_in_group);
+        case ncclFloat:
+          return unpackPreMulSum<float, ncclFloat>(
+              reduceOp, comm, dev_in_group);
+        case ncclDouble:
+          return unpackPreMulSum<double, ncclDouble>(
+              reduceOp, comm, dev_in_group);
+        default:
+          TORCH_CHECK(
+              false, "PreMulSum Data type must be half, float, or double");
+          ncclRedOp_t unused;
+          return unused;
+      }
+#else
+      TORCH_CHECK(false, "PreMulSum requires NCCL>=2.11.1");
 #endif
     }
     return ncclOp.at(reduceOp);
@@ -409,17 +444,22 @@ void ProcessGroupNCCL::WorkNCCL::checkAndThrowException() {
   }
 }
 
-void ProcessGroupNCCL::WorkNCCL::handleNCCLGuard() {
+void ProcessGroupNCCL::WorkNCCL::handleNCCLGuard(
+    ErrorHandlingMode asyncErrorHandling) {
   std::lock_guard<std::mutex> lock(mutex_);
   if (exception_) {
     auto exceptionMsg = c10::str(
         "Some NCCL operations have failed or timed out. Due to the ",
         "asynchronous nature of CUDA kernels, subsequent GPU operations ",
-        "might run on corrupted/incomplete data. To avoid this inconsistency, ",
-        "we are taking the entire process down.");
+        "might run on corrupted/incomplete data.");
     LOG(ERROR) << exceptionMsg;
     C10_LOG_API_USAGE_ONCE("ProcessGroupNCCL.WorkNCCL.handleNCCLGuard");
-    std::rethrow_exception(exception_);
+    if (asyncErrorHandling == TearDown) {
+      auto tearDownMsg = c10::str(
+          "To avoid data inconsistency, we are taking the entire process down.");
+      LOG(ERROR) << tearDownMsg;
+      std::rethrow_exception(exception_);
+    }
   }
 }
 
@@ -466,12 +506,7 @@ void ProcessGroupNCCL::WorkNCCL::synchronizeInternal(
           const auto& storeKey = getNcclAbortedCommStoreKey(
               buildNcclUniqueIdStr(ncclComm->getNcclId()));
           auto rankStr = std::to_string(rank_);
-          store_->set(
-              storeKey,
-              std::vector<uint8_t>(
-                  reinterpret_cast<const uint8_t*>(rankStr.data()),
-                  reinterpret_cast<const uint8_t*>(rankStr.data()) +
-                      rankStr.size()));
+          store_->set(storeKey, rankStr);
           LOG(INFO) << "[Rank " << rank_
                     << "] Wrote aborted communicator id to store: " << storeKey;
         }
@@ -511,6 +546,8 @@ void ProcessGroupNCCL::WorkNCCL::synchronizeInternal(
 // Same as calling synchronize().
 bool ProcessGroupNCCL::WorkNCCL::wait(std::chrono::milliseconds timeout) {
   RECORD_PARAM_COMMS(
+      static_cast<int>(this->seq_), // seq
+      0, // process group ptr
       rank_, // rank
       "wait", // colName
       0, // inSize
@@ -534,6 +571,38 @@ bool ProcessGroupNCCL::WorkNCCL::timedOut() {
           currentTimepoint - workStartTime_) >= opTimeout_);
 }
 
+ProcessGroupNCCL::CoalescedWorkNCCL::CoalescedWorkNCCL(
+    std::vector<ProcessGroupNCCL::WorkNCCL> works,
+    int rank,
+    OpType opType)
+    : Work(rank, opType, nullptr), works_(std::move(works)) {}
+
+ProcessGroupNCCL::CoalescedWorkNCCL::~CoalescedWorkNCCL() = default;
+
+c10::intrusive_ptr<ProcessGroupNCCL::CoalescedWorkNCCL> ProcessGroupNCCL::
+    initCoalescedWork(
+        const std::vector<c10::intrusive_ptr<Work>>& works,
+        int rank,
+        OpType opType) {
+  std::vector<ProcessGroupNCCL::WorkNCCL> ncclWorks;
+  ncclWorks.reserve(works.size());
+  for (auto& work : works) {
+    ncclWorks.push_back(*static_cast<ProcessGroupNCCL::WorkNCCL*>(work.get()));
+  }
+  return c10::make_intrusive<ProcessGroupNCCL::CoalescedWorkNCCL>(
+      ncclWorks, rank, opType);
+}
+
+// Same as calling synchronize().
+bool ProcessGroupNCCL::CoalescedWorkNCCL::wait(
+    std::chrono::milliseconds timeout) {
+  for (auto& w : works_) {
+    w.wait(timeout);
+  }
+  // Always return true, because abort API is not implemented.
+  return true;
+}
+
 ProcessGroupNCCL::ProcessGroupNCCL(
     const c10::intrusive_ptr<Store>& store,
     int rank,
@@ -550,25 +619,27 @@ ProcessGroupNCCL::ProcessGroupNCCL(
       at::cuda::getNumGPUs() != 0,
       "ProcessGroupNCCL is only supported with GPUs, no GPUs found!");
   blockingWait_ = parseEnvVarFlag(NCCL_BLOCKING_WAIT);
-  asyncErrorHandling_ = parseEnvVarFlag(NCCL_ASYNC_ERROR_HANDLING);
-  desyncDebug_ = parseEnvVarFlag(NCCL_DESYNC_DEBUG);
+  asyncErrorHandling_ = static_cast<ErrorHandlingMode>(
+      parseEnvVarIntDefault(NCCL_ASYNC_ERROR_HANDLING, 0));
+  desyncDebug_ = parseEnvVarFlag(NCCL_DESYNC_DEBUG) ||
+      (dist_debug_level_ >= DebugLevel::Detail);
 
   if (blockingWait_) {
-    if (asyncErrorHandling_ || desyncDebug_) {
+    if (asyncErrorHandling_ != NoHandling || desyncDebug_) {
       LOG(INFO) << "[Rank " << rank_ << "] NCCL_BLOCKING_WAIT and "
                 << "NCCL_ASYNC_ERROR_HANDLING|NCCL_DESYNC_DEBUG"
                 << "should not both be enabled. "
                 << "Only NCCL_BLOCKING_WAIT is being used in this process.";
-      asyncErrorHandling_ = false;
+      asyncErrorHandling_ = NoHandling;
       desyncDebug_ = false;
     }
   } else {
-    if (desyncDebug_ && !asyncErrorHandling_) {
+    if (desyncDebug_ && asyncErrorHandling_ == NoHandling) {
       LOG(INFO) << "[Rank " << rank_
                 << "] NCCL_DESYNC_DEBUG and NCCL_ASYNC_ERROR_HANDLING "
                 << "must both be enabled. "
                 << "Enabling NCCL_ASYNC_ERROR_HANDLING.";
-      asyncErrorHandling_ = true;
+      asyncErrorHandling_ = TearDown;
     }
   }
 
@@ -586,7 +657,7 @@ ProcessGroupNCCL::ProcessGroupNCCL(
       std::thread(&ProcessGroupNCCL::ncclCommWatchdog, this);
 #endif
 
-  if (asyncErrorHandling_) {
+  if (asyncErrorHandling_ != NoHandling) {
     workCleanupThread_ = std::thread(&ProcessGroupNCCL::workCleanupLoop, this);
   }
 
@@ -600,6 +671,17 @@ ProcessGroupNCCL::ProcessGroupNCCL(
             << "\nUSE_HIGH_PRIORITY_STREAM: "
             << options_->is_high_priority_stream;
 
+  RECORD_PARAM_COMMS(
+      0, // seq
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      rank, // rank
+      "init", // colName
+      0, // inSize
+      0, // outSize
+      at::kByte, // dType
+      std::vector<int64_t>(), // inSplitSizes
+      std::vector<int64_t>()); // outSplitSizes
+
 #ifdef USE_NCCL_WITH_UCC
   static c10::once_flag initialize_ucc_lib_flag;
   c10::call_once(initialize_ucc_lib_flag, [&] {
@@ -695,7 +777,7 @@ ProcessGroupNCCL::~ProcessGroupNCCL() {
   ncclCommWatchdogThread_.join();
 #endif
 
-  if (asyncErrorHandling_) {
+  if (asyncErrorHandling_ != NoHandling) {
     workMetaListCV_.notify_one();
     workCleanupThread_.join();
   }
@@ -793,9 +875,9 @@ void ProcessGroupNCCL::ncclCommWatchdogInternal() {
               << "[Rank " << rank_
               << "] Received NCCL errors for communicators in the cache: \n"
               << "NCCL error: \n"
-              << getExceptionMsgFromExceptionPtr(ncclErrorException);
+              << exceptionMsg;
 
-          if (blockingWait_ || asyncErrorHandling_) {
+          if (blockingWait_ || asyncErrorHandling_ != NoHandling) {
             LOG(INFO) << "[Rank " << rank_
                       << "] Aborting communicators that received errors";
             // We abort NCCL communicators that have received errors from this
@@ -825,7 +907,7 @@ void ProcessGroupNCCL::ncclCommWatchdogInternal() {
       }
     }
 
-    if (asyncErrorHandling_) {
+    if (asyncErrorHandling_ != NoHandling) {
       abortTimedOutCollectives(abortedCommIds);
     }
 
@@ -841,12 +923,7 @@ void ProcessGroupNCCL::ncclCommWatchdogInternal() {
         abortedComms_.emplace(abortedCommId);
         const auto& storeKey = getNcclAbortedCommStoreKey(abortedCommId);
         auto rankStr = std::to_string(rank_);
-        store_->set(
-            storeKey,
-            std::vector<uint8_t>(
-                reinterpret_cast<const uint8_t*>(rankStr.data()),
-                reinterpret_cast<const uint8_t*>(rankStr.data()) +
-                    rankStr.size()));
+        store_->set(storeKey, rankStr);
         LOG(INFO) << "[Rank " << rank_
                   << "] Watchdog wrote aborted communicator id to store: "
                   << storeKey;
@@ -954,7 +1031,7 @@ void ProcessGroupNCCL::workCleanupLoop() {
           // Handle Exceptions on failed GPU operations and remove completed
           // workNCCL objects from work vector.
           if (!terminateProcessGroup_.load()) {
-            work.handleNCCLGuard();
+            work.handleNCCLGuard(asyncErrorHandling_);
           }
           doneWorks.push_back(std::move(*it));
           it = workMetaList_.erase(it);
@@ -1042,7 +1119,7 @@ void ProcessGroupNCCL::broadcastUniqueNCCLID(
           "[",
           rank_,
           "] is setting up NCCL communicator and "
-          "retreiving ncclUniqueId from [0] via c10d key-value store by key '",
+          "retrieving ncclUniqueId from [0] via c10d key-value store by key '",
           storeKey,
           "', but store->get('",
           storeKey,
@@ -1055,7 +1132,7 @@ void ProcessGroupNCCL::broadcastUniqueNCCLID(
               "Unknown exception while [",
               rank_,
               "] is setting up NCCL communicator and "
-              "retreiving ncclUniqueId from [0] via c10d key-value store by key '",
+              "retrieving ncclUniqueId from [0] via c10d key-value store by key '",
               storeKey,
               "'"));
     }
@@ -1322,6 +1399,15 @@ int64_t check_gpu_tensors_same_device(const std::vector<at::Tensor>& tensors) {
   return total_numel;
 }
 
+bool check_same_size(const std::vector<at::Tensor>& input_tensors) {
+  for (const auto& input_tensor : input_tensors) {
+    if (!input_tensors[0].is_same_size(input_tensor)) {
+      return false;
+    }
+  }
+  return true;
+}
+
 // Flatten each list in `tensor_lists' for a gather or scatter operation, and
 // ensure compatibility with the corresponding tensor in `other'.
 std::vector<at::Tensor> flatten_for_scatter_gather(
@@ -1412,7 +1498,7 @@ void ProcessGroupNCCL::startCoalescing() {
 }
 
 void ProcessGroupNCCL::endCoalescing(
-    std::vector<c10::intrusive_ptr<ProcessGroup::Work>>& reqs) {
+    std::vector<c10::intrusive_ptr<Work>>& reqs) {
   groupEnd();
   if (reqs.size() != coalescedDevices_.size()) {
     TORCH_CHECK(false, "Number of requests do not match number of collectives");
@@ -1434,7 +1520,7 @@ void ProcessGroupNCCL::endCoalescing(
 }
 
 template <typename Fn, typename PreProcess, typename PostProcess>
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::collective(
     std::vector<at::Tensor>& inputs,
     std::vector<at::Tensor>& outputs,
     Fn fn,
@@ -1500,7 +1586,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
   pre(ncclStreams);
 
   {
-    AutoNcclGroup nccl_group_guard;
+    torch::cuda::nccl::AutoNcclGroup nccl_group_guard;
     for (const auto i : c10::irange(inputs.size())) {
       if (!inputs_same_dev || (inputs_same_dev && i == 0)) {
         gpuGuard.set_index(devices[i].index());
@@ -1556,7 +1642,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
   work->opTimeout_ = options_->timeout;
   work->store_ = store_;
 
-  if (asyncErrorHandling_) {
+  if (asyncErrorHandling_ != NoHandling) {
     workEnqueue(work);
   }
 
@@ -1564,7 +1650,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
 }
 
 template <typename Fn, typename PreProcess, typename PostProcess>
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::pointToPoint(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::pointToPoint(
     std::vector<at::Tensor>& tensors,
     Fn fn,
     int peer,
@@ -1643,7 +1729,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::pointToPoint(
   }
 
   {
-    AutoNcclGroup nccl_group_guard;
+    torch::cuda::nccl::AutoNcclGroup nccl_group_guard;
     for (const auto i : c10::irange(tensors.size())) {
       gpuGuard.set_index(devices[i].index());
       at::cuda::CUDAStream& ncclStream = ncclStreams_[key][i];
@@ -1693,7 +1779,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::pointToPoint(
 }
 
 template <typename Fn>
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::collective(
     std::vector<at::Tensor>& inputs,
     std::vector<at::Tensor>& outputs,
     Fn fn,
@@ -1710,7 +1796,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::collective(
 }
 
 template <typename Fn>
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::pointToPoint(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::pointToPoint(
     std::vector<at::Tensor>& tensor,
     Fn fn,
     int peer,
@@ -1726,9 +1812,10 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::pointToPoint(
       profilingTitle);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allreduce_impl(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::allreduce_impl(
     std::vector<at::Tensor>& tensors,
     const AllreduceOptions& opts) {
+  int dev_in_group = 0;
   return collective(
       tensors,
       tensors,
@@ -1736,12 +1823,15 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allreduce_impl(
           at::Tensor& output,
           ncclComm_t comm,
           at::cuda::CUDAStream& stream) {
+        auto ncclDataType = getNcclDataType(input.scalar_type());
+        auto ncclReduceOp = getNcclReduceOp(
+            opts.reduceOp, input, ncclDataType, comm, dev_in_group++);
         return ncclAllReduce(
             input.data_ptr(),
             output.data_ptr(),
             input.numel(),
-            getNcclDataType(input.scalar_type()),
-            getNcclReduceOp(opts.reduceOp, input),
+            ncclDataType,
+            ncclReduceOp,
             comm,
             stream.stream());
       },
@@ -1749,14 +1839,19 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allreduce_impl(
       "nccl:all_reduce");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allreduce(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::allreduce(
     std::vector<at::Tensor>& tensors,
     const AllreduceOptions& opts) {
   check_gpu_tensors_different_devices(tensors);
 
   // @lint-ignore CLANGTIDY
   auto tensor = tensors.back();
-  RECORD_PARAM_COMMS(
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      tensors, // inputTensors
+      tensors, // outputTensors
       rank_, // rank
       "allreduce", // colName
       tensor.numel(), // inSize
@@ -1768,13 +1863,18 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allreduce(
   return allreduce_impl(tensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allreduce_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::allreduce_coalesced(
     std::vector<at::Tensor>& tensors,
     const AllreduceCoalescedOptions& opts) {
   auto total_numel = check_gpu_tensors_same_device(tensors);
 
   // @lint-ignore CLANGTIDY
-  RECORD_PARAM_COMMS(
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      tensors, // inputTensors
+      tensors, // outputTensors
       rank_, // rank
       "allreduce_coalesced", // colName
       total_numel, // inSize
@@ -1787,14 +1887,20 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allreduce_coalesced(
   return allreduce_impl(tensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::broadcast(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::broadcast(
     std::vector<at::Tensor>& tensors,
     const BroadcastOptions& opts) {
   check_gpu_tensors_different_devices(tensors);
 
   // @lint-ignore CLANGTIDY
   auto tensor = tensors.back();
-  RECORD_PARAM_COMMS(
+
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      tensors, // inputTensors
+      tensors, // outputTensors
       rank_, // rank
       "broadcast", // colName
       tensor.numel(), // inSize
@@ -1823,13 +1929,77 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::broadcast(
       "nccl:broadcast");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::reduce(
+// _broadcast_oop adds an out-of-place broadcast in PGNCCL
+// Custom collectives may be implemented by coalescing broadcast operations
+// One use-case is implementing a vector all_gather (all_gather_v)
+// where unevenly sized inputs are gathered among participating ranks
+// Since all_gather provides an out-of-place API, an all_gather_v
+// semantic implemented inside pg_nccl.all_gather also needs to support
+// out-of-place, for which an out-of-place broadcast is required to be added
+c10::intrusive_ptr<Work> ProcessGroupNCCL::_broadcast_oop(
+    std::vector<at::Tensor>& outputTensors,
+    std::vector<at::Tensor>& inputTensors,
+    const BroadcastOptions& opts) {
+  check_gpu_tensors_different_devices(outputTensors);
+  check_gpu_tensors_different_devices(inputTensors);
+
+  // @lint-ignore CLANGTIDY
+  auto tensor = outputTensors.back();
+  // @lint-ignore CLANGTIDY
+  auto in_tensor = inputTensors.back();
+  if (tensor.numel() != in_tensor.numel()) {
+    TORCH_CHECK(
+        false,
+        "Tensor input and output of _broadcast_oop must have the same number of elements ");
+  }
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() +
+          1), // seq + 1 to match collective increment.
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      inputTensors, // inputTensors
+      outputTensors, // outputTensors
+      rank_, // rank
+      "_broadcast_oop", // colName
+      tensor.numel(), // inSize
+      tensor.numel(), // outSize
+      tensor.scalar_type(), // dType
+      std::vector<int64_t>(), // inSplitSizes
+      std::vector<int64_t>()); // outSplitSizes
+
+  return collective(
+      inputTensors,
+      outputTensors,
+      [&](at::Tensor& input,
+          at::Tensor& output,
+          ncclComm_t comm,
+          at::cuda::CUDAStream& stream) {
+        const auto root = opts.rootRank * inputTensors.size() + opts.rootTensor;
+        return ncclBroadcast(
+            input.data_ptr(),
+            output.data_ptr(),
+            input.numel(),
+            getNcclDataType(input.scalar_type()),
+            root,
+            comm,
+            stream.stream());
+      },
+      OpType::BROADCAST,
+      "nccl:_broadcast_oop");
+}
+
+c10::intrusive_ptr<Work> ProcessGroupNCCL::reduce(
     std::vector<at::Tensor>& tensors,
     const ReduceOptions& opts) {
   check_gpu_tensors_different_devices(tensors);
   // @lint-ignore CLANGTIDY
   auto tensor = tensors.back();
-  RECORD_PARAM_COMMS(
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this),
+      tensors, // inputTensors
+      tensors, // outputTensors
       rank_, // rank
       "reduce", // colName
       tensor.numel(), // inSize
@@ -1838,6 +2008,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::reduce(
       std::vector<int64_t>(), // inSplitSizes
       std::vector<int64_t>()); // outSplitSizes
 
+  int dev_in_group = 0;
   return collective(
       tensors,
       tensors,
@@ -1846,12 +2017,15 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::reduce(
           ncclComm_t comm,
           at::cuda::CUDAStream& stream) {
         const auto root = opts.rootRank * tensors.size() + opts.rootTensor;
+        auto ncclDataType = getNcclDataType(input.scalar_type());
+        auto ncclReduceOp = getNcclReduceOp(
+            opts.reduceOp, input, ncclDataType, comm, dev_in_group++);
         return ncclReduce(
             input.data_ptr(),
             output.data_ptr(),
             input.numel(),
-            getNcclDataType(input.scalar_type()),
-            getNcclReduceOp(opts.reduceOp, input),
+            ncclDataType,
+            ncclReduceOp,
             root,
             comm,
             stream.stream());
@@ -1860,135 +2034,273 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::reduce(
       "nccl:reduce");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather(
-    std::vector<std::vector<at::Tensor>>& outputTensors,
+// _reduce_oop exposes an out-of-place reduce from PGNCCL
+// Custom collectives may be implemented by coalescing reduce operations
+// One use-case is implementing a vector reduce_scatter (reduce_scatter_v)
+// where inputs are reduced and scattered unevenly among participating ranks
+// Since reduce_scatter provides an out-of-place API, a reduce_scatter_v
+// semantic implemented inside pg_nccl.reduce_scatter also needs to support
+// out-of-place, for which an out-of-place reduce is required to be added
+c10::intrusive_ptr<Work> ProcessGroupNCCL::_reduce_oop(
+    std::vector<at::Tensor>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
-    const AllgatherOptions& opts) {
+    const ReduceOptions& opts) {
+  check_gpu_tensors_different_devices(outputTensors);
   check_gpu_tensors_different_devices(inputTensors);
-
-  auto outputFlattened =
-      flatten_for_scatter_gather(outputTensors, inputTensors, size_);
-  check_gpu_tensors_different_devices(outputFlattened);
-
   // @lint-ignore CLANGTIDY
-  auto tensor = inputTensors.back();
-  RECORD_PARAM_COMMS(
+  auto tensor = outputTensors.back();
+  // @lint-ignore CLANGTIDY
+  auto in_tensor = inputTensors.back();
+  if (tensor.numel() != in_tensor.numel()) {
+    TORCH_CHECK(
+        false,
+        "Tensor input and output of _reduce_oop must have the same number of elements ");
+  }
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      inputTensors, // inputTensors
+      outputTensors, // outputTensors
       rank_, // rank
-      "all_gather", // colName
+      "_reduce_oop", // colName
       tensor.numel(), // inSize
-      tensor.numel() * // outSize
-          this->getSize(), // dType
-      tensor.scalar_type(),
+      tensor.numel(), // outSize
+      tensor.scalar_type(), // dType
       std::vector<int64_t>(), // inSplitSizes
-      std::vector<int64_t>()); // outSplitSize
+      std::vector<int64_t>()); // outSplitSizes
 
+  int dev_in_group{0};
   return collective(
       inputTensors,
-      outputFlattened,
+      outputTensors,
       [&](at::Tensor& input,
           at::Tensor& output,
           ncclComm_t comm,
           at::cuda::CUDAStream& stream) {
-        c10::cuda::CUDACachingAllocator::recordStream(
-            output.storage().data_ptr(), stream);
-        return ncclAllGather(
+        const auto root = opts.rootRank * inputTensors.size() + opts.rootTensor;
+        const auto ncclDataType = getNcclDataType(input.scalar_type());
+        const auto ncclReduceOp = getNcclReduceOp(
+            opts.reduceOp, input, ncclDataType, comm, dev_in_group++);
+        return ncclReduce(
             input.data_ptr(),
             output.data_ptr(),
             input.numel(),
-            getNcclDataType(input.scalar_type()),
+            ncclDataType,
+            ncclReduceOp,
+            (int)root,
             comm,
             stream.stream());
       },
-      [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {},
-      [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {
-        // Copy the flattened output tensors to the outputs.
-        for (const auto i : c10::irange(outputTensors.size())) {
-          at::cuda::CUDAStreamGuard guard(ncclStreams[i]);
-          for (const auto j : c10::irange(outputTensors[0].size())) {
-            // See [Sync Streams].
-            c10::cuda::CUDACachingAllocator::recordStream(
-                outputTensors[i][j].storage().data_ptr(), ncclStreams[i]);
+      OpType::REDUCE,
+      "nccl:_reduce_oop");
+}
 
-            outputTensors[i][j].copy_(outputFlattened[i][j], true);
+c10::intrusive_ptr<Work> ProcessGroupNCCL::allgather(
+    std::vector<std::vector<at::Tensor>>& outputTensors,
+    std::vector<at::Tensor>& inputTensors,
+    const AllgatherOptions& opts) {
+  check_gpu_tensors_different_devices(inputTensors);
+  // @lint-ignore CLANGTIDY
+  bool same_size = check_same_size(outputTensors.back());
+
+  if (same_size) {
+    auto outputFlattened =
+        flatten_for_scatter_gather(outputTensors, inputTensors, size_);
+    check_gpu_tensors_different_devices(outputFlattened);
+
+    // @lint-ignore CLANGTIDY
+    auto tensor = inputTensors.back();
+    RECORD_PARAM_COMMS_DATA(
+        static_cast<int>(
+            this->getSequenceNumberForGroup() +
+            1), // seq + 1 to match collective
+        reinterpret_cast<std::intptr_t>(this), // process group ptr
+        inputTensors, // inputTensors
+        outputTensors, // outputTensors
+        rank_, // rank
+        "all_gather", // colName
+        tensor.numel(), // inSize
+        tensor.numel() * // outSize
+            this->getSize(), // dType
+        tensor.scalar_type(),
+        std::vector<int64_t>(), // inSplitSizes
+        std::vector<int64_t>()); // outSplitSize
+
+    return collective(
+        inputTensors,
+        outputFlattened,
+        [&](at::Tensor& input,
+            at::Tensor& output,
+            ncclComm_t comm,
+            at::cuda::CUDAStream& stream) {
+          c10::cuda::CUDACachingAllocator::recordStream(
+              output.storage().data_ptr(), stream);
+          return ncclAllGather(
+              input.data_ptr(),
+              output.data_ptr(),
+              input.numel(),
+              getNcclDataType(input.scalar_type()),
+              comm,
+              stream.stream());
+        },
+        [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {},
+        [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {
+          // Copy the flattened output tensors to the outputs.
+          for (const auto i : c10::irange(outputTensors.size())) {
+            at::cuda::CUDAStreamGuard guard(ncclStreams[i]);
+            for (const auto j : c10::irange(outputTensors[0].size())) {
+              // See [Sync Streams].
+              c10::cuda::CUDACachingAllocator::recordStream(
+                  outputTensors[i][j].storage().data_ptr(), ncclStreams[i]);
+
+              outputTensors[i][j].copy_(outputFlattened[i][j], true);
+            }
           }
-        }
-      },
-      OpType::ALLGATHER,
-      "nccl:all_gather");
+        },
+        OpType::ALLGATHER,
+        "nccl:all_gather");
+  } else {
+    const auto num_devices = outputTensors.size();
+    const auto num_reduces = outputTensors[0].size();
+    std::vector<c10::intrusive_ptr<Work>> works;
+    startCoalescing();
+    for (const auto i : c10::irange(num_reduces)) {
+      std::vector<at::Tensor> inputs_multi_dev(num_devices);
+      std::vector<at::Tensor> outputs_multi_dev(num_devices);
+      for (const auto j : c10::irange(num_devices)) {
+        // @lint-ignore CLANGTIDY
+        outputs_multi_dev[j] = outputTensors[j][i];
+        inputs_multi_dev[j] =
+            // @lint-ignore CLANGTIDY
+            i == (rank_ * num_devices + j) ? inputTensors[j]
+                                           : outputs_multi_dev[j];
+      }
+      auto broadcastOpts = BroadcastOptions{
+          static_cast<int64_t>(i / num_devices),
+          static_cast<int64_t>(i % num_devices),
+          opts.timeout};
+      auto work =
+          _broadcast_oop(outputs_multi_dev, inputs_multi_dev, broadcastOpts);
+      works.push_back(work);
+    }
+    endCoalescing(works);
+    return initCoalescedWork(works, rank_, OpType::BROADCAST);
+  }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::allgather_coalesced(
     std::vector<std::vector<at::Tensor>>& /* unused */,
     std::vector<at::Tensor>& /* unused */,
     const AllgatherOptions& /* unused */) {
   TORCH_CHECK(false, "ProcessGroupNCCL does not support allgather_coalesced");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::reduce_scatter(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::reduce_scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ReduceScatterOptions& opts) {
   check_gpu_tensors_different_devices(outputTensors);
-
   // @lint-ignore CLANGTIDY
-  auto tensor = outputTensors.back();
-  RECORD_PARAM_COMMS(
-      rank_, // rank
-      "reduce_scatter", // colName
-      tensor.numel() * // inSize
-          this->getSize(), // outSize
-      tensor.numel(), // dType
-      tensor.scalar_type(),
-      std::vector<int64_t>(), // inSplitSizes
-      std::vector<int64_t>()); // outSplitSizes
-
-  auto inputFlattened =
-      flatten_for_scatter_gather(inputTensors, outputTensors, size_);
-  check_gpu_tensors_different_devices(inputFlattened);
+  bool same_size = check_same_size(inputTensors.back());
 
-  return collective(
-      inputFlattened,
-      outputTensors,
-      [&](at::Tensor& input,
-          at::Tensor& output,
-          ncclComm_t comm,
-          at::cuda::CUDAStream& stream) {
-        c10::cuda::CUDACachingAllocator::recordStream(
-            output.storage().data_ptr(), stream);
-        return ncclReduceScatter(
-            input.data_ptr(),
-            output.data_ptr(),
-            output.numel(),
-            getNcclDataType(input.scalar_type()),
-            getNcclReduceOp(opts.reduceOp, input),
-            comm,
-            stream.stream());
-      },
-      [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {
-        // Copy the input tensors to the flattened inputs.
-        for (const auto i : c10::irange(inputTensors.size())) {
-          at::cuda::CUDAStreamGuard guard(ncclStreams[i]);
-          for (const auto j : c10::irange(inputTensors[0].size())) {
-            // See [Sync Streams].
-            c10::cuda::CUDACachingAllocator::recordStream(
-                inputTensors[i][j].storage().data_ptr(), ncclStreams[i]);
+  if (same_size) {
+    // @lint-ignore CLANGTIDY
+    auto tensor = outputTensors.back();
+
+    int dev_in_group{0};
+    auto inputFlattened =
+        flatten_for_scatter_gather(inputTensors, outputTensors, size_);
+    check_gpu_tensors_different_devices(inputFlattened);
+
+    RECORD_PARAM_COMMS_DATA(
+        static_cast<int>(
+            this->getSequenceNumberForGroup() +
+            1), // seq + 1 to match collective
+        reinterpret_cast<std::intptr_t>(this), // process group ptr
+        inputTensors, // inputTensors
+        outputTensors, // outputTensors
+        rank_, // rank
+        "reduce_scatter", // colName
+        tensor.numel() * this->getSize(), // inSize
+        tensor.numel(), // outSize
+        tensor.scalar_type(), // dType
+        std::vector<int64_t>(), // inSplitSizes
+        std::vector<int64_t>()); // outSplitSizes
 
-            inputFlattened[i][j].copy_(inputTensors[i][j], true);
+    return collective(
+        inputFlattened,
+        outputTensors,
+        [&](at::Tensor& input,
+            at::Tensor& output,
+            ncclComm_t comm,
+            at::cuda::CUDAStream& stream) {
+          c10::cuda::CUDACachingAllocator::recordStream(
+              output.storage().data_ptr(), stream);
+          const auto ncclDataType = getNcclDataType(input.scalar_type());
+          const auto ncclReduceOp = getNcclReduceOp(
+              opts.reduceOp, input, ncclDataType, comm, dev_in_group++);
+          return ncclReduceScatter(
+              input.data_ptr(),
+              output.data_ptr(),
+              output.numel(),
+              ncclDataType,
+              ncclReduceOp,
+              comm,
+              stream.stream());
+        },
+        [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {
+          // Copy the input tensors to the flattened inputs.
+          for (const auto i : c10::irange(inputTensors.size())) {
+            at::cuda::CUDAStreamGuard guard(ncclStreams[i]);
+            for (const auto j : c10::irange(inputTensors[0].size())) {
+              // See [Sync Streams].
+              c10::cuda::CUDACachingAllocator::recordStream(
+                  inputTensors[i][j].storage().data_ptr(), ncclStreams[i]);
+
+              inputFlattened[i][j].copy_(inputTensors[i][j], true);
+            }
           }
-        }
-      },
-      [&](std::vector<at::cuda::CUDAStream>& ncclStreams) {},
-      OpType::REDUCE_SCATTER,
-      "nccl:reduce_scatter");
+        },
+        [&](std::vector<at::cuda::CUDAStream>&) {},
+        OpType::REDUCE_SCATTER,
+        "nccl:reduce_scatter");
+  } else {
+    const auto num_devices = inputTensors.size();
+    const auto num_reduces = inputTensors[0].size();
+    std::vector<c10::intrusive_ptr<Work>> works;
+    startCoalescing();
+    for (const auto i : c10::irange(num_reduces)) {
+      std::vector<at::Tensor> inputs_multi_dev(num_devices);
+      std::vector<at::Tensor> outputs_multi_dev(num_devices);
+      for (const auto j : c10::irange(num_devices)) {
+        // @lint-ignore CLANGTIDY
+        inputs_multi_dev[j] = inputTensors[j][i];
+        outputs_multi_dev[j] =
+            // @lint-ignore CLANGTIDY
+            i == (rank_ * num_devices + j) ? outputTensors[j]
+                                           : inputs_multi_dev[j];
+      }
+      auto reduceOpts = ReduceOptions{
+          opts.reduceOp,
+          static_cast<int64_t>(i / num_devices),
+          static_cast<int64_t>(i % num_devices),
+          opts.timeout};
+      auto work = _reduce_oop(outputs_multi_dev, inputs_multi_dev, reduceOpts);
+      works.push_back(work);
+    }
+    endCoalescing(works);
+    return initCoalescedWork(works, rank_, OpType::REDUCE);
+  }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_reduce_scatter_base(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::_reduce_scatter_base(
     at::Tensor& outputTensor,
     at::Tensor& inputTensor,
     const ReduceScatterOptions& opts) {
   if (inputTensor.dtype() != outputTensor.dtype()) {
     TORCH_CHECK(
-        false, "input tensor must be the same type as the outut tensor.");
+        false, "input tensor must be the same type as the output tensor.");
   }
 
   if (inputTensor.numel() != outputTensor.numel() * size_) {
@@ -1999,7 +2311,13 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_reduce_scatter_base(
 
   // @lint-ignore CLANGTIDY
   const auto& tensor = outputTensor;
-  RECORD_PARAM_COMMS(
+
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      inputTensor, // inputTensor
+      outputTensor, // outputTensor
       rank_, // rank
       "_reduce_scatter_base", // colName
       tensor.numel() * // inSize
@@ -2012,6 +2330,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_reduce_scatter_base(
   auto inputs = std::vector<at::Tensor>{inputTensor};
   auto outputs = std::vector<at::Tensor>{outputTensor};
 
+  int dev_in_group = 0;
   return collective(
       inputs,
       outputs,
@@ -2021,12 +2340,15 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_reduce_scatter_base(
           at::cuda::CUDAStream& stream) {
         c10::cuda::CUDACachingAllocator::recordStream(
             output.storage().data_ptr(), stream);
+        auto ncclDataType = getNcclDataType(input.scalar_type());
+        auto ncclReduceOp = getNcclReduceOp(
+            opts.reduceOp, input, ncclDataType, comm, dev_in_group++);
         return ncclReduceScatter(
             input.data_ptr(),
             output.data_ptr(),
             output.numel(),
-            getNcclDataType(input.scalar_type()),
-            getNcclReduceOp(opts.reduceOp, input),
+            ncclDataType,
+            ncclReduceOp,
             comm,
             stream.stream());
       },
@@ -2036,9 +2358,11 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_reduce_scatter_base(
       "nccl:_reduce_scatter_base");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::barrier(
-    const BarrierOptions& opts) {
+c10::intrusive_ptr<Work> ProcessGroupNCCL::barrier(const BarrierOptions& opts) {
   RECORD_PARAM_COMMS(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
       rank_, // rank
       "barrier", // colName
       0, // inSize
@@ -2100,7 +2424,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::barrier(
 }
 
 #ifdef ENABLE_NCCL_P2P_SUPPORT
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::alltoall_base(
     at::Tensor& outputTensor,
     at::Tensor& inputTensor,
     std::vector<int64_t>& outputSplitSizes,
@@ -2112,7 +2436,13 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
     std::vector<at::Tensor> inputTensors = {inputTensor};
     std::vector<at::Tensor> outputTensors = {outputTensor};
 
-    RECORD_PARAM_COMMS(
+    RECORD_PARAM_COMMS_DATA(
+        static_cast<int>(
+            this->getSequenceNumberForGroup() +
+            1), // seq + 1 to match collective
+        reinterpret_cast<std::intptr_t>(this), // process group ptr
+        inputTensor, // inputTensor
+        outputTensor, // outputTensor
         rank_, // rank
         "all_to_all", // colName
         inputTensor.numel(), // inSize
@@ -2143,7 +2473,13 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
     std::vector<at::Tensor> inputTensors = {inputTensor};
     std::vector<at::Tensor> outputTensors = {outputTensor};
 
-    RECORD_PARAM_COMMS(
+    RECORD_PARAM_COMMS_DATA(
+        static_cast<int>(
+            this->getSequenceNumberForGroup() +
+            1), // seq + 1 to match collective
+        reinterpret_cast<std::intptr_t>(this), // process group ptr
+        inputTensor, // inputTensor
+        outputTensor, // outputTensor
         rank_, // rank
         "all_to_allv", // colName
         inputTensor.numel(), // inSize
@@ -2188,7 +2524,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
   }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::alltoall(
     std::vector<at::Tensor>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const AllToAllOptions& /* unused */) {
@@ -2216,7 +2552,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(
       OpType::ALLTOALL);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::send(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::send(
     std::vector<at::Tensor>& tensors,
     int dstRank,
     int /* unused */) {
@@ -2236,7 +2572,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::send(
   return ret;
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::recv(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::recv(
     std::vector<at::Tensor>& tensors,
     int srcRank,
     int /* unused */) {
@@ -2256,7 +2592,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::recv(
   return ret;
 }
 #else
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::alltoall_base(
     at::Tensor& /* unused */,
     at::Tensor& /* unused */,
     std::vector<int64_t>& /* unused */,
@@ -2267,7 +2603,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall_base(
       "ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::alltoall(
     std::vector<at::Tensor>& /* unused */,
     std::vector<at::Tensor>& /* unused */,
     const AllToAllOptions& /* unused */) {
@@ -2276,7 +2612,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::alltoall(
       "ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::send(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::send(
     std::vector<at::Tensor>& /* unused */,
     int /* unused */,
     int /* unused */) {
@@ -2285,7 +2621,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::send(
       "ProcessGroupNCCL only supports send for NCCL lib version >= 2.7.0");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::recv(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::recv(
     std::vector<at::Tensor>& /* unused */,
     int /* unused */,
     int /* unused */) {
@@ -2309,7 +2645,7 @@ void ProcessGroupNCCL::groupEnd() {
   --ncclActiveGroupCounter_;
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::gather(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::gather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const GatherOptions& opts) {
@@ -2323,14 +2659,6 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::gather(
 
   // @lint-ignore CLANGTIDY
   auto tensor = inputTensors.back();
-  RECORD_PARAM_COMMS(
-      rank_, // rank
-      "gather", // colName
-      tensor.numel(), // inSize
-      tensor.numel() * this->getSize(), // outSize
-      tensor.scalar_type(), // dType
-      std::vector<int64_t>(), // inSplitSizes
-      std::vector<int64_t>()); // outSplitSize
 
   std::vector<at::Tensor> outputs;
 
@@ -2363,6 +2691,20 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::gather(
     outputs.emplace_back();
   }
 
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      inputTensors, // inputTensors
+      outputTensors, // outputTensors
+      rank_, // rank
+      "gather", // colName
+      tensor.numel(), // inSize
+      tensor.numel() * this->getSize(), // outSize
+      tensor.scalar_type(), // dType
+      std::vector<int64_t>(), // inSplitSizes
+      std::vector<int64_t>()); // outSplitSize
+
   return collective(
       inputTensors,
       outputs,
@@ -2384,7 +2726,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::gather(
       "nccl:gather");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::scatter(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ScatterOptions& opts) {
@@ -2398,14 +2740,6 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::scatter(
 
   // @lint-ignore CLANGTIDY
   auto tensor = outputTensors.back();
-  RECORD_PARAM_COMMS(
-      rank_, // rank
-      "scatter", // colName
-      tensor.numel(), // inSize
-      tensor.numel() * this->getSize(), // outSize
-      tensor.scalar_type(), // dType
-      std::vector<int64_t>(), // inSplitSizes
-      std::vector<int64_t>()); // outSplitSize
 
   std::vector<at::Tensor> inputs;
 
@@ -2439,6 +2773,20 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::scatter(
     inputs.emplace_back();
   }
 
+  RECORD_PARAM_COMMS_DATA(
+      static_cast<int>(
+          this->getSequenceNumberForGroup() + 1), // seq + 1 to match collective
+      reinterpret_cast<std::intptr_t>(this), // process group ptr
+      inputTensors, // inputTensors
+      outputTensors, // outputTensors
+      rank_, // rank
+      "scatter", // colName
+      tensor.numel(), // inSize
+      tensor.numel() * this->getSize(), // outSize
+      tensor.scalar_type(), // dType
+      std::vector<int64_t>(), // inSplitSizes
+      std::vector<int64_t>()); // outSplitSize
+
   return collective(
       outputTensors,
       inputs,
@@ -2461,13 +2809,13 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::scatter(
       "nccl:scatter");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::recvAnysource(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::recvAnysource(
     std::vector<at::Tensor>& /* unused */,
     int /* unused */) {
   TORCH_CHECK(false, "ProcessGroupNCCL does not support recvAnysource");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::_allgather_base(
+c10::intrusive_ptr<Work> ProcessGroupNCCL::_allgather_base(
     at::Tensor& output_tensor,
     at::Tensor& input_tensor,
     const AllgatherOptions& /*unused */) {
diff --git a/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
index 6c8be3a73d22..454ecbbf6b02 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
@@ -9,10 +9,10 @@
 #include <thread>
 #include <unordered_map>
 
-#include <c10d/NCCLUtils.hpp>
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/Store.hpp>
-#include <c10d/UCCForNCCL.hpp>
+#include <torch/csrc/distributed/c10d/NCCLUtils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/UCCForNCCL.hpp>
 
 #include <ATen/DynamicLibrary.h>
 #include <ATen/cuda/CUDAContext.h>
@@ -44,6 +44,10 @@ constexpr const char* NCCL_DESYNC_DEBUG = "NCCL_DESYNC_DEBUG";
 
 constexpr const char* NCCL_BACKEND_NAME = "nccl";
 
+// TearDown mode: tear down process upon error, see `WorkNCCL::handleNCCLGuard`
+// Soft mode: just clean up collectives and abort communicators without tearing down process
+enum ErrorHandlingMode { NoHandling = 0, TearDown = 1, CleanUpOnly = 2 };
+
 // ProcessGroupNCCL implements NCCL bindings for c10d.
 //
 // All functions of the class are expected to be called in the same order
@@ -81,7 +85,7 @@ constexpr const char* NCCL_BACKEND_NAME = "nccl";
 //   // Now continue on other work in the current stream.
 class TORCH_API ProcessGroupNCCL : public ProcessGroup {
  public:
-  class WorkNCCL : public ProcessGroup::Work,
+  class WorkNCCL : public Work,
     public std::enable_shared_from_this<WorkNCCL> {
    public:
     // Constructor takes a list of CUDA devices
@@ -125,7 +129,7 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
 
     // Helper function used in CUDA Stream callbacks to complete WorkNCCL
     // objects and throw exceptions when neeeded.
-    void handleNCCLGuard();
+    void handleNCCLGuard(ErrorHandlingMode asyncErrorHandling);
 
     // Helper function that checks if the NCCL kernels have finished
     // execution on the GPUs
@@ -218,6 +222,28 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
     friend class ProcessGroupNCCL;
   };
 
+  class CoalescedWorkNCCL
+      : public Work,
+        public std::enable_shared_from_this<CoalescedWorkNCCL> {
+   public:
+    // Constructor takes a list of WorkNCCL works
+    CoalescedWorkNCCL(
+        std::vector<ProcessGroupNCCL::WorkNCCL> works,
+        int rank,
+        OpType opType);
+
+    ~CoalescedWorkNCCL() override;
+
+    // Same as calling synchronize() for NCCL work.
+    bool wait(std::chrono::milliseconds timeout = kNoTimeout) override;
+
+   protected:
+    // The cached list of CUDA devices to operate on
+    std::vector<ProcessGroupNCCL::WorkNCCL> works_;
+
+    friend class ProcessGroupNCCL;
+  };
+
   struct Options : ProcessGroup::Options {
     // NOTE: timeout in ProcessGroupNCCL::Options denote the timeout for
     // operations. This is only used when blockingWait_ is enabled.
@@ -278,71 +304,81 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
   void startCoalescing() override;
 
   void endCoalescing(
-      std::vector<c10::intrusive_ptr<ProcessGroup::Work>>& reqs) override;
+      std::vector<c10::intrusive_ptr<Work>>& reqs) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& tensors,
       const BroadcastOptions& opts = BroadcastOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  c10::intrusive_ptr<Work> _broadcast_oop(
+      std::vector<at::Tensor>& outputTensors,
+      std::vector<at::Tensor>& inputTensors,
+      const BroadcastOptions& opts = BroadcastOptions());
+
+  c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce_coalesced(
+  c10::intrusive_ptr<Work> allreduce_coalesced(
       std::vector<at::Tensor>& tensors,
       const AllreduceCoalescedOptions& opts =
           AllreduceCoalescedOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce(
+  c10::intrusive_ptr<Work> reduce(
       std::vector<at::Tensor>& tensors,
       const ReduceOptions& opts = ReduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  c10::intrusive_ptr<Work> _reduce_oop(
+      std::vector<at::Tensor>& outputTensors,
+      std::vector<at::Tensor>& inputTensors,
+      const ReduceOptions& opts = ReduceOptions());
+
+  c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> _allgather_base(
+  c10::intrusive_ptr<Work> _allgather_base(
       at::Tensor& outputbuffer,
       at::Tensor& inputbuffer,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather_coalesced(
+  c10::intrusive_ptr<Work> allgather_coalesced(
       std::vector<std::vector<at::Tensor>>& outputTensorLists,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> _reduce_scatter_base(
+  c10::intrusive_ptr<Work> _reduce_scatter_base(
       at::Tensor& outputTensor,
       at::Tensor& inputTensor,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& opts = BarrierOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall_base(
+  c10::intrusive_ptr<Work> alltoall_base(
       at::Tensor& outputTensor,
       at::Tensor& inputTensor,
       std::vector<int64_t>& outputSplitSizes,
       std::vector<int64_t>& inputSplitSizes,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall(
+  c10::intrusive_ptr<Work> alltoall(
       std::vector<at::Tensor>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> send(
+  c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recv(
+  c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& tensors,
       int srcRank,
       int tag) override;
@@ -352,17 +388,17 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
   static void groupEnd();
 
   // Unsupported Ops
-  c10::intrusive_ptr<ProcessGroup::Work> gather(
+  c10::intrusive_ptr<Work> gather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const GatherOptions& opts = GatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> scatter(
+  c10::intrusive_ptr<Work> scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ScatterOptions& opts = ScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recvAnysource(
+  c10::intrusive_ptr<Work> recvAnysource(
       std::vector<at::Tensor>& tensors,
       int tag) override;
 
@@ -406,6 +442,12 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
       const char* profilingTitle=nullptr,
       const c10::optional<std::vector<at::Tensor>>& inputs = c10::nullopt);
 
+  virtual c10::intrusive_ptr<ProcessGroupNCCL::CoalescedWorkNCCL>
+  initCoalescedWork(
+      const std::vector<c10::intrusive_ptr<Work>>& works,
+      int rank,
+      OpType opType);
+
  private:
   // Helper that encapsulates work shared across all collective communication
   // primitives.  The callbacks have the following signatures:
@@ -414,14 +456,14 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
   //                    ncclComm_t, at::cuda::CUDAStream&);
   //    void {pre,post}(std::vector<at::cuda::CUDAStream&>);
   template <typename Fn>
-  c10::intrusive_ptr<ProcessGroup::Work> collective(
+  c10::intrusive_ptr<Work> collective(
       std::vector<at::Tensor>& input,
       std::vector<at::Tensor>& output,
       Fn fn,
       OpType opType,
       const char* profilingTitle = nullptr);
   template <typename Fn, typename PreProcess, typename PostProcess>
-  c10::intrusive_ptr<ProcessGroup::Work> collective(
+  c10::intrusive_ptr<Work> collective(
       std::vector<at::Tensor>& input,
       std::vector<at::Tensor>& output,
       Fn fn,
@@ -434,14 +476,14 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
   // primitives. It is the same structure as the helper used for collective
   // communicaiton primitives.
   template <typename Fn>
-  c10::intrusive_ptr<ProcessGroup::Work> pointToPoint(
+  c10::intrusive_ptr<Work> pointToPoint(
       std::vector<at::Tensor>& tensor,
       Fn fn,
       int peer,
       OpType opType,
       const char* profilingTitle = nullptr);
   template <typename Fn, typename PreProcess, typename PostProcess>
-  c10::intrusive_ptr<ProcessGroup::Work> pointToPoint(
+  c10::intrusive_ptr<Work> pointToPoint(
       std::vector<at::Tensor>& tensor,
       Fn fn,
       int peer,
@@ -450,7 +492,7 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
       PostProcess post,
       const char* profilingTitle);
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce_impl(
+  c10::intrusive_ptr<Work> allreduce_impl(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions());
 
@@ -621,7 +663,7 @@ class TORCH_API ProcessGroupNCCL : public ProcessGroup {
 
   // Whether or not the workCleanupThread is used to perform async error
   // handling.
-  bool asyncErrorHandling_ = false;
+  ErrorHandlingMode asyncErrorHandling_ = NoHandling;
 
   // Whether or not to enable timeout root cause analysis.
   bool desyncDebug_;
diff --git a/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp
index 4e017613200a..e4c0c8b4d651 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp
@@ -1,4 +1,4 @@
-#include <c10d/ProcessGroupRoundRobin.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupRoundRobin.hpp>
 
 namespace c10d {
 
@@ -7,6 +7,9 @@ ProcessGroupRoundRobin::ProcessGroupRoundRobin(
     int size,
     std::vector<c10::intrusive_ptr<ProcessGroup>> processGroups)
     : ProcessGroup(rank, size), processGroups_(std::move(processGroups)) {
+  TORCH_WARN(
+      "ProcessGroupRoundRobin is deprecated and scheduled to be removed after this current release (1.13). ",
+      "Please file an issue on https://github.com/pytorch/pytorch/issues if there are any concerns or issues with this deprecation.");
   TORCH_CHECK(processGroups_.size() >= 1);
   for (const auto& processGroup : processGroups_) {
     TORCH_CHECK(processGroup->getRank() == rank_);
@@ -17,68 +20,66 @@ ProcessGroupRoundRobin::ProcessGroupRoundRobin(
 
 ProcessGroupRoundRobin::~ProcessGroupRoundRobin() {}
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::broadcast(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::broadcast(
     std::vector<at::Tensor>& tensors,
     const BroadcastOptions& opts) {
   return next()->broadcast(tensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::allreduce(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::allreduce(
     std::vector<at::Tensor>& tensors,
     const AllreduceOptions& opts) {
   return next()->allreduce(tensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::
-    allreduce_coalesced(
-        std::vector<at::Tensor>& tensors,
-        const AllreduceCoalescedOptions& opts) {
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::allreduce_coalesced(
+    std::vector<at::Tensor>& tensors,
+    const AllreduceCoalescedOptions& opts) {
   return next()->allreduce_coalesced(tensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::reduce(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::reduce(
     std::vector<at::Tensor>& tensors,
     const ReduceOptions& opts) {
   return next()->reduce(tensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::allgather(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::allgather(
     std::vector<std::vector<at::Tensor>>& outputs,
     std::vector<at::Tensor>& inputs,
     const AllgatherOptions& opts) {
   return next()->allgather(outputs, inputs, opts);
 };
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::
-    allgather_coalesced(
-        std::vector<std::vector<at::Tensor>>& outputTensorLists,
-        std::vector<at::Tensor>& inputTensors,
-        const AllgatherOptions& opts) {
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::allgather_coalesced(
+    std::vector<std::vector<at::Tensor>>& outputTensorLists,
+    std::vector<at::Tensor>& inputTensors,
+    const AllgatherOptions& opts) {
   return next()->allgather(outputTensorLists, inputTensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::gather(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::gather(
     std::vector<std::vector<at::Tensor>>& outputs,
     std::vector<at::Tensor>& inputs,
     const GatherOptions& opts) {
   return next()->gather(outputs, inputs, opts);
 };
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::scatter(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::scatter(
     std::vector<at::Tensor>& outputs,
     std::vector<std::vector<at::Tensor>>& inputs,
     const ScatterOptions& opts) {
   return next()->scatter(outputs, inputs, opts);
 };
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::reduce_scatter(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::reduce_scatter(
     std::vector<at::Tensor>& outputs,
     std::vector<std::vector<at::Tensor>>& inputs,
     const ReduceScatterOptions& opts) {
   return next()->reduce_scatter(outputs, inputs, opts);
 };
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::alltoall_base(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::alltoall_base(
     at::Tensor& outputTensor,
     at::Tensor& inputTensor,
     std::vector<int64_t>& outputSplitSizes,
@@ -88,27 +89,27 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::alltoall_base(
       outputTensor, inputTensor, outputSplitSizes, inputSplitSizes, opts);
 };
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::send(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::send(
     std::vector<at::Tensor>& /* unused */,
     int /* unused */,
     int /* unused */) {
   TORCH_CHECK(false, "ProcessGroupRoundRobin does not support send");
 };
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::recv(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::recv(
     std::vector<at::Tensor>& /* unused */,
     int /* unused */,
     int /* unused */) {
   TORCH_CHECK(false, "ProcessGroupRoundRobin does not support recv");
 };
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::recvAnysource(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::recvAnysource(
     std::vector<at::Tensor>& /* unused */,
     int /* unused */) {
   TORCH_CHECK(false, "ProcessGroupRoundRobin does not support recv");
 };
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::barrier(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::barrier(
     const BarrierOptions& /* unused */) {
   TORCH_CHECK(false, "ProcessGroupRoundRobin does not support barrier");
 };
@@ -122,7 +123,7 @@ const c10::intrusive_ptr<ProcessGroup>& ProcessGroupRoundRobin::next() {
   return processGroup;
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupRoundRobin::_allgather_base(
+c10::intrusive_ptr<Work> ProcessGroupRoundRobin::_allgather_base(
     at::Tensor& /*unused */,
     at::Tensor& /*unused */,
     const AllgatherOptions& /*unused */) {
diff --git a/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.hpp b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.hpp
index d5450badaac5..34dd8dd541e0 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.hpp
@@ -2,7 +2,7 @@
 
 #include <vector>
 
-#include <c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
 
 namespace c10d {
 
@@ -31,75 +31,75 @@ class TORCH_API ProcessGroupRoundRobin final : public ProcessGroup {
       return std::string(ROUND_ROBIN_BACKEND_NAME);
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& tensors,
       const BroadcastOptions& opts = BroadcastOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce_coalesced(
+  c10::intrusive_ptr<Work> allreduce_coalesced(
       std::vector<at::Tensor>& tensors,
       const AllreduceCoalescedOptions& opts =
           AllreduceCoalescedOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce(
+  c10::intrusive_ptr<Work> reduce(
       std::vector<at::Tensor>& tensors,
       const ReduceOptions& opts = ReduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputs,
       std::vector<at::Tensor>& inputs,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> _allgather_base(
+  c10::intrusive_ptr<Work> _allgather_base(
       at::Tensor& outputBuffer,
       at::Tensor& inputBuffer,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather_coalesced(
+  c10::intrusive_ptr<Work> allgather_coalesced(
       std::vector<std::vector<at::Tensor>>& outputTensorLists,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> gather(
+  c10::intrusive_ptr<Work> gather(
       std::vector<std::vector<at::Tensor>>& outputs,
       std::vector<at::Tensor>& inputs,
       const GatherOptions& opts = GatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> scatter(
+  c10::intrusive_ptr<Work> scatter(
       std::vector<at::Tensor>& outputs,
       std::vector<std::vector<at::Tensor>>& inputs,
       const ScatterOptions& opts = ScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& outputs,
       std::vector<std::vector<at::Tensor>>& inputs,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall_base(
+  c10::intrusive_ptr<Work> alltoall_base(
       at::Tensor& outputTensor,
       at::Tensor& inputTensor,
       std::vector<int64_t>& outputSplitSizes,
       std::vector<int64_t>& inputSplitSizes,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> send(
+  c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recv(
+  c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& tensors,
       int srcRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recvAnysource(
+  c10::intrusive_ptr<Work> recvAnysource(
       std::vector<at::Tensor>& tensors,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& opts = BarrierOptions()) override;
 
  private:
diff --git a/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp b/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp
index 191ba4b2ddd7..ad135062a702 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp
@@ -1,28 +1,18 @@
 #ifdef USE_C10D_UCC
 
-#include <c10d/ProcessGroupUCC.hpp>
-#include <c10d/UCCTracing.hpp>
-#include <c10d/UCCUtils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupUCC.hpp>
+#include <torch/csrc/distributed/c10d/UCCTracing.hpp>
+#include <torch/csrc/distributed/c10d/UCCUtils.hpp>
 #include <list>
 #include <memory>
+#include <unordered_map>
+#include <unordered_set>
 
 namespace c10d {
 
 namespace {
 constexpr int64_t kBusyWaitMillis = 10;
 
-const std::map<c10::DeviceType, ucs_memory_type_t> ucs_mtype_map = {
-    {c10::kCPU, UCS_MEMORY_TYPE_HOST},
-    {c10::kCUDA, UCS_MEMORY_TYPE_CUDA},
-};
-
-ucs_memory_type_t to_ucs_memType(c10::DeviceType _c10_type) {
-  if (ucs_mtype_map.find(_c10_type) != ucs_mtype_map.end())
-    return ucs_mtype_map.at(_c10_type);
-  else
-    return UCS_MEMORY_TYPE_UNKNOWN;
-}
-
 const std::map<c10::DeviceType, ucc_memory_type_t> ucc_mtype_map = {
     {c10::kCPU, UCC_MEMORY_TYPE_HOST},
     {c10::kCUDA, UCC_MEMORY_TYPE_CUDA},
@@ -95,7 +85,6 @@ ucc_reduction_op_t to_ucc_reduceOp(
 struct torch_ucc_config_t {
   c10::once_flag flag;
   std::array<bool, 32> blocking_wait;
-  bool enable_profiling;
   bool enable_comms_logger;
   bool use_future;
   // Sharing UCC communicator among multiple PGs to save resource.
@@ -106,19 +95,16 @@ struct torch_ucc_config_t {
   bool enable_health_check;
 } torch_ucc_config;
 
-// TODO: support UCC_BLOCKING_WAIT that applies to all collectives.
-std::map<std::string, std::string> torch_ucc_envs_map = {
-    {"TORCH_UCC_ALLGATHER_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_ALLGATHER_BASE_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_ALLREDUCE_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_ALLTOALL_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_BCAST_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_GATHER_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_REDUCE_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_REDUCE_SCATTER_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_SCATTER_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_SEND_BLOCKING_WAIT", "0"},
-    {"TORCH_UCC_RECV_BLOCKING_WAIT", "0"},
+std::unordered_map<std::string, std::string> torch_ucc_envs_map = {
+    // TORCH_UCC_BLOCKING_WAIT allowed syntax:
+    // - TORCH_UCC_BLOCKING_WAIT=none --> blocking wait completely disabled
+    // - TORCH_UCC_BLOCKING_WAIT=all --> blocking wait completely enabled
+    // - TORCH_UCC_BLOCKING_WAIT=allreduce,send,recv --> blocking wait enabled
+    //                                                   on selected operations
+    // Supported operations:
+    // [allgather,allgather_base,allreduce,alltoall,broadcast,
+    //  gather,reduce,reduce_scatter,scatter,send,recv]
+    {"TORCH_UCC_BLOCKING_WAIT", "none"},
 
     {"TORCH_UCC_USE_FUTURE", "1"},
     {"TORCH_UCC_PROFILING_ENABLE", "0"},
@@ -128,12 +114,42 @@ std::map<std::string, std::string> torch_ucc_envs_map = {
     {"TORCH_UCC_ENABLE_COMMS_LOGGER", "0"},
 };
 
+std::vector<OpType> parse_blocking_wait(std::string op_list_string) {
+  const static std::unordered_map<std::string, OpType> str2op = {
+      {"allgather", OpType::ALLGATHER},
+      {"allgather_base", OpType::_ALLGATHER_BASE},
+      {"allreduce", OpType::ALLREDUCE},
+      {"alltoall_base", OpType::ALLTOALL_BASE},
+      {"broadcast", OpType::BROADCAST},
+      {"gather", OpType::GATHER},
+      {"reduce", OpType::REDUCE},
+      {"reduce_scatter", OpType::REDUCE_SCATTER},
+      {"scatter", OpType::SCATTER},
+      {"send", OpType::SEND},
+      {"recv", OpType::RECV},
+  };
+  auto op_list = parse_list(op_list_string);
+  if (op_list == std::vector<std::string>{"none"}) {
+    return {};
+  }
+  std::vector<OpType> result;
+  if (op_list == std::vector<std::string>{"all"}) {
+    for (auto entry : str2op) {
+      result.push_back(entry.second);
+    }
+  } else {
+    for (auto op_string : op_list) {
+      result.push_back(str2op.at(op_string));
+    }
+  }
+  return result;
+}
+
 } // namespace
 
-void read_confg() {
+void read_config() {
   // default configuration
-  torch_ucc_config.blocking_wait.fill(true);
-  torch_ucc_config.enable_profiling = false;
+  torch_ucc_config.blocking_wait.fill(false);
   torch_ucc_config.use_future = true;
   torch_ucc_config.shared_comm = false;
   torch_ucc_config.use_allgatherv = false;
@@ -149,29 +165,18 @@ void read_confg() {
     }
   }
 
-#define BUILD_BLOCKING_CFG(op, str)                   \
-  (torch_ucc_config.blocking_wait[(std::uint8_t)op] = \
-       std::stoi(torch_ucc_envs_map.at(str)))
-
-  BUILD_BLOCKING_CFG(OpType::ALLGATHER, "TORCH_UCC_ALLGATHER_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(
-      OpType::_ALLGATHER_BASE, "TORCH_UCC_ALLGATHER_BASE_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::ALLREDUCE, "TORCH_UCC_ALLREDUCE_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::ALLTOALL_BASE, "TORCH_UCC_ALLTOALL_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::BROADCAST, "TORCH_UCC_BCAST_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::GATHER, "TORCH_UCC_GATHER_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::REDUCE, "TORCH_UCC_REDUCE_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(
-      OpType::REDUCE_SCATTER, "TORCH_UCC_REDUCE_SCATTER_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::SCATTER, "TORCH_UCC_SCATTER_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::SEND, "TORCH_UCC_SEND_BLOCKING_WAIT");
-  BUILD_BLOCKING_CFG(OpType::RECV, "TORCH_UCC_RECV_BLOCKING_WAIT");
-#undef BUILD_BLOCKING_CFG
+  auto blocking_wait_str = torch_ucc_envs_map.at("TORCH_UCC_BLOCKING_WAIT");
+  for (auto op : parse_blocking_wait(blocking_wait_str)) {
+    torch_ucc_config.blocking_wait[(std::uint8_t)op] = true;
+  }
+  // barrier is always blocking
+  torch_ucc_config.blocking_wait[(std::uint8_t)OpType::BARRIER] = true;
+
+  // barrier is always blocking
+  torch_ucc_config.blocking_wait[(std::uint8_t)OpType::BARRIER] = true;
 
   torch_ucc_config.use_future =
       std::stoi(torch_ucc_envs_map.at("TORCH_UCC_USE_FUTURE"));
-  torch_ucc_config.enable_profiling =
-      std::stoi(torch_ucc_envs_map.at("TORCH_UCC_PROFILING_ENABLE"));
   torch_ucc_config.shared_comm =
       std::stoi(torch_ucc_envs_map.at("TORCH_UCC_SHARED_COMM"));
   torch_ucc_config.use_allgatherv =
@@ -267,9 +272,9 @@ bool ProcessGroupUCC::WorkUCC::wait(std::chrono::milliseconds /* unused */) {
   setAndThrowException();
   // manually call profiling end callbacks if they are set,
   // since progress thread does not own WorkUCC
-  if (ProcessGroup::Work::recordFunctionEndCallback_) {
-    ProcessGroup::Work::recordFunctionEndCallback_();
-    ProcessGroup::Work::recordFunctionEndCallback_ = nullptr;
+  if (Work::recordFunctionEndCallback_) {
+    Work::recordFunctionEndCallback_();
+    Work::recordFunctionEndCallback_ = nullptr;
   }
   return true;
 }
@@ -278,6 +283,14 @@ c10::intrusive_ptr<c10::ivalue::Future> ProcessGroupUCC::WorkUCC::getFuture() {
   return future_;
 }
 
+int ProcessGroupUCC::WorkUCC::sourceRank() const {
+  if (opType_ != OpType::RECV && opType_ != OpType::RECVANYSOURCE) {
+    // Throw an error
+    return Work::sourceRank();
+  }
+  return sourceRank_;
+}
+
 std::vector<at::Tensor> ProcessGroupUCC::WorkUCC::result() {
   return *outputs_;
 }
@@ -311,7 +324,6 @@ Comm::Comm(
     bool is_health_check)
     : logger(logger_),
       oob(oob_),
-      ucx_comm(oob->size, logger),
       ucc_comm(oob, logger),
       finalize_phase(
           is_health_check ? TORCH_UCC_HEALTH_CHECK : TORCH_UCC_FINALIZE),
@@ -348,7 +360,7 @@ std::shared_ptr<Comm> Comm::get_comm(
   static uint32_t comm_id;
 
   std::lock_guard<std::mutex> lock(m);
-  id = (comm_id % TORCH_UCX_MAX_COMM);
+  id = comm_id;
 
   std::string group_id = "group_id";
   if (is_health_check) {
@@ -403,126 +415,6 @@ std::shared_ptr<Comm> Comm::get_comm(
   }
 }
 
-void Comm::ucx_connect_eps(
-    std::vector<ucp_ep_h>& eps,
-    std::shared_ptr<torch_ucc_oob_coll_info_t> oob) {
-  ucp_address_t* local_addr;
-  size_t local_addr_len;
-  std::vector<uint8_t> peer_addr;
-
-  TORCH_UCX_CHECK(
-      ucp_worker_get_address(ucx_comm.worker, &local_addr, &local_addr_len),
-      "failed to get worker address");
-
-  std::vector<uint8_t> val = std::vector<uint8_t>(
-      reinterpret_cast<uint8_t*>(local_addr),
-      reinterpret_cast<uint8_t*>(local_addr) + local_addr_len);
-  oob->store->set(oob->getKey("wa" + std::to_string(oob->rank)), val);
-  ucp_worker_release_address(ucx_comm.worker, local_addr);
-  eps.resize(oob->size);
-  for (int i = 0; i < oob->size; i++) {
-    peer_addr = oob->store->get(oob->getKey("wa" + std::to_string(i)));
-    ucp_ep_params_t ep_params;
-    ep_params.field_mask = UCP_EP_PARAM_FIELD_REMOTE_ADDRESS;
-    ep_params.address = reinterpret_cast<ucp_address_t*>(peer_addr.data());
-    TORCH_UCX_CHECK(
-        ucp_ep_create(ucx_comm.worker, &ep_params, &(eps[i])),
-        c10::str("failed to create endpoint with rank ", i));
-  }
-}
-
-void Comm::ucx_disconnect_eps(
-    std::vector<ucp_ep_h>& eps,
-    std::shared_ptr<torch_ucc_oob_coll_info_t> oob) {
-  ucs_status_t st;
-
-  for (ucp_ep_h& ep : eps) {
-    ucs_status_ptr_t close_req = ucp_ep_close_nb(ep, UCP_EP_CLOSE_MODE_FLUSH);
-    if (UCS_PTR_IS_ERR(close_req)) {
-      TORCH_UCC_LOG_ERROR(
-          finalize_phase, "failed to close endpoint, ignore and continue...");
-      return;
-    }
-    if (UCS_PTR_IS_PTR(close_req)) {
-      do {
-        ucp_worker_progress(ucx_comm.worker);
-        st = ucp_request_check_status(close_req);
-      } while (st != UCS_OK);
-      ucp_request_free(close_req);
-    }
-  }
-  if (!eps.size()) {
-    return;
-  }
-  try {
-    auto sz = (size_t)oob->store->add(oob->getKey("epclosed"), 1);
-    while (sz != eps.size()) {
-      ucp_worker_progress(ucx_comm.worker);
-      std::this_thread::sleep_for(std::chrono::milliseconds(kBusyWaitMillis));
-      sz = (size_t)oob->store->add(oob->getKey("epclosed"), 0);
-    }
-  } catch (std::exception& ex) {
-    LOG(ERROR) << "(disconnect_eps) Caught error in Store Operation .. "
-               << "[" << ex.what() << "]";
-  }
-}
-
-ucc_coll_req_h Comm::send_nb(
-    ucp_ep_h ep,
-    void* data,
-    ucs_memory_type_t mtype,
-    size_t size,
-    ucp_tag_t ucp_tag) {
-  ucs_status_ptr_t st;
-  ucp_request_param_t params;
-  params.op_attr_mask = UCP_OP_ATTR_FIELD_CALLBACK |
-      UCP_OP_ATTR_FIELD_DATATYPE | UCP_OP_ATTR_FIELD_MEMORY_TYPE;
-  params.datatype = ucp_dt_make_contig(size);
-  params.memory_type = mtype;
-  params.cb.send = [](void* request, ucs_status_t status, void* user_data) {
-    static_cast<ucc_coll_req_h>(request)->status = UCC_OK;
-  };
-  st = ucp_tag_send_nbx(ep, data, 1, ucp_tag, &params);
-  if (UCS_PTR_IS_ERR(st)) {
-    TORCH_UCC_LOG_ERROR(
-        TORCH_UCC_COLL_POST,
-        c10::str(
-            "failed to send message: ", ucs_status_string(UCS_PTR_STATUS(st))));
-    throw std::runtime_error(ucs_status_string(UCS_PTR_STATUS(st)));
-  }
-  return reinterpret_cast<ucc_coll_req_h>(st);
-}
-
-ucc_coll_req_h Comm::recv_nb(
-    void* data,
-    ucs_memory_type_t mtype,
-    size_t size,
-    ucp_tag_t ucp_tag,
-    ucp_tag_t ucp_tag_mask) {
-  ucs_status_ptr_t st;
-  ucp_request_param_t params;
-  params.op_attr_mask = UCP_OP_ATTR_FIELD_CALLBACK |
-      UCP_OP_ATTR_FIELD_DATATYPE | UCP_OP_ATTR_FIELD_MEMORY_TYPE;
-  params.datatype = ucp_dt_make_contig(size);
-  params.cb.recv = [](void* request,
-                      ucs_status_t status,
-                      const ucp_tag_recv_info_t* info,
-                      void* user_data) {
-    static_cast<ucc_coll_req_h>(request)->status = UCC_OK;
-  };
-  params.memory_type = mtype;
-  st = ucp_tag_recv_nbx(
-      ucx_comm.worker, data, 1, ucp_tag, ucp_tag_mask, &params);
-  if (UCS_PTR_IS_ERR(st)) {
-    TORCH_UCC_LOG_ERROR(
-        TORCH_UCC_COLL_POST,
-        c10::str(
-            "failed to recv message: ", ucs_status_string(UCS_PTR_STATUS(st))));
-    throw std::runtime_error(ucs_status_string(UCS_PTR_STATUS(st)));
-  }
-  return reinterpret_cast<ucc_coll_req_h>(st);
-}
-
 void Comm::ucc_create_team(
     ucc_team_h& team,
     std::shared_ptr<torch_ucc_oob_coll_info_t> oob) {
@@ -566,34 +458,6 @@ void Comm::ucc_destroy_team(ucc_team_h& team) {
   lock.unlock();
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> Comm::enqueue_p2p(
-    OpType opType,
-    ucc_coll_req_h request,
-    const char* prof_title) {
-  auto work =
-      c10::make_intrusive<ProcessGroupUCC::WorkUCC>(opType, prof_title, logger);
-  if (torch_ucc_config.use_future) {
-    work->future_ = c10::make_intrusive<at::ivalue::Future>(
-        c10::ListType::create(c10::TensorType::get()));
-  }
-  if (request == nullptr) {
-    // p2p2 request completed immediately don't save it to progress queue
-    // and mark future completed immediately
-    if (torch_ucc_config.use_future) {
-      work->future_->markCompleted(c10::IValue(std::vector<at::Tensor>()));
-    }
-    return work;
-  }
-  auto entry =
-      std::make_shared<ProcessGroupUCC::ProgressEntry>(&ucx_comm, request);
-  work->entry_ = entry;
-  std::unique_lock<std::mutex> lock(mutex);
-  progress_queue.push_back(entry);
-  lock.unlock();
-  queue_produce_cv.notify_one();
-  return work;
-}
-
 void Comm::enqueue_collective(
     std::unique_ptr<ProcessGroupUCC::WorkData> data,
     c10::intrusive_ptr<ProcessGroupUCC::WorkUCC> work,
@@ -672,7 +536,6 @@ void Comm::progress_loop() {
     try {
       while (work->request_->status > 0) {
         ucc_comm.progress();
-        ucx_comm.progress();
       }
       if (work->request_->status < 0) {
         eptr = std::make_exception_ptr(
@@ -700,7 +563,7 @@ ProcessGroupUCC::ProcessGroupUCC(
     int size,
     std::chrono::duration<float> timeout)
     : ProcessGroup(rank, size), timeout_(timeout) {
-  c10::call_once(torch_ucc_config.flag, read_confg);
+  c10::call_once(torch_ucc_config.flag, read_config);
   oob = std::make_shared<torch_ucc_oob_coll_info_t>();
   oob->rank = rank;
   oob->size = size;
@@ -708,7 +571,7 @@ ProcessGroupUCC::ProcessGroupUCC(
   comm = nullptr;
   cuda_ee = nullptr;
   static uint32_t id = 0;
-  uint32_t pg_id = (id++ % TORCH_UCX_MAX_COMM);
+  uint32_t pg_id = id++;
 
   logger = c10::make_intrusive<ProcessGroupUCCLogger>(
       c10::str("[Rank ", rank_, "]", "[ProcessGroupUCC-", pg_id, "]"),
@@ -753,20 +616,10 @@ ProcessGroupUCC::~ProcessGroupUCC() {
     comm->ucc_destroy_team(team);
     TORCH_UCC_LOG_INFO(
         TORCH_UCC_FINALIZE, "Successfully destroyed UCC library");
-    comm->ucx_disconnect_eps(eps, oob);
-    TORCH_UCC_LOG_INFO(
-        TORCH_UCC_FINALIZE, "Successfully destroyed UCX library");
     try {
       if (cuda_ee) {
         ucc_ee_destroy(cuda_ee);
       }
-      if ((size_t)oob->store->add(oob->getKey("ucc_pg_closed"), 1) ==
-          eps.size()) {
-        std::vector<uint8_t> val = {1};
-        oob->store->set(oob->getKey("ucc_pg_finished"), val);
-      } else {
-        oob->store->wait({oob->getKey("ucc_pg_finished")});
-      }
     } catch (std::exception& ex) {
       TORCH_UCC_LOG_INFO(
           TORCH_UCC_FINALIZE,
@@ -801,7 +654,6 @@ void ProcessGroupUCC::runHealthCheck() {
   struct HealthCheckData {
     std::mutex healthCheckMutex;
     std::condition_variable healthCheckCv;
-    bool ucxHealthCheckSuccess = false;
     bool uccHealthCheckSuccess = false;
     std::exception_ptr healthCheckException;
   } healthCheckData;
@@ -821,8 +673,6 @@ void ProcessGroupUCC::runHealthCheck() {
         oob->rank = this->oob->rank;
         oob->size = this->oob->size;
         oob->store = this->oob->store;
-
-        std::vector<ucp_ep_h> eps;
         ucc_team_h team = nullptr;
         uint32_t comm_id;
 #ifdef USE_CUDA
@@ -831,19 +681,6 @@ void ProcessGroupUCC::runHealthCheck() {
         }
 #endif
         auto comm = Comm::get_comm(comm_id, device, oob, logger, true);
-        comm->ucx_connect_eps(eps, oob);
-        comm->ucx_disconnect_eps(eps, oob);
-        TORCH_UCC_LOG_INFO(
-            TORCH_UCC_HEALTH_CHECK,
-            c10::str(
-                "UCX library health check succeed for device ",
-                c10::DeviceTypeName(device.type())));
-        // Mark ucx health check as complete.
-        if (is_last_device) {
-          std::lock_guard<std::mutex> lk(healthCheckData.healthCheckMutex);
-          healthCheckData.ucxHealthCheckSuccess = true;
-        }
-
         comm->ucc_create_team(team, oob);
         comm->ucc_destroy_team(team);
         TORCH_UCC_LOG_INFO(
@@ -882,18 +719,13 @@ void ProcessGroupUCC::runHealthCheck() {
           " msec for UCC health check to complete."));
   std::unique_lock<std::mutex> lock(healthCheckData.healthCheckMutex);
   healthCheckData.healthCheckCv.wait_for(lock, timeout_, [&healthCheckData]() {
-    return healthCheckData.ucxHealthCheckSuccess &&
-        healthCheckData.uccHealthCheckSuccess;
+    return healthCheckData.uccHealthCheckSuccess;
   });
 
   if (healthCheckData.healthCheckException) {
     std::rethrow_exception(healthCheckData.healthCheckException);
   }
   // If there is no exception, the likely culprit is a timeout/hang
-  TORCH_CHECK(
-      healthCheckData.ucxHealthCheckSuccess,
-      "ProcessGroupUCC: Health check failure: Failed to initialize UCX on rank ",
-      rank_);
   TORCH_CHECK(
       healthCheckData.uccHealthCheckSuccess,
       "ProcessGroupUCC: Health check failure: Failed to initialize UCC on rank ",
@@ -921,7 +753,7 @@ std::unique_ptr<at::cuda::CUDAEvent> ProcessGroupUCC::getPooledEvent() {
 #endif
 
 template <typename PreProcess, typename PostProcess>
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::collective_post(
+c10::intrusive_ptr<Work> ProcessGroupUCC::collective_post(
     OpType opType,
     PreProcess preproc,
     PostProcess postproc,
@@ -931,9 +763,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::collective_post(
     std::vector<at::Tensor>& inputTensors,
     std::vector<at::Tensor>& outputTensors,
     const char* prof_title) {
+  seq_++;
   set_timeout(coll);
   auto work = c10::make_intrusive<ProcessGroupUCC::WorkUCC>(
-      opType, torch_ucc_config.enable_profiling ? prof_title : nullptr, logger);
+      opType, seq_, prof_title, inputTensors, logger);
+
+  if (opType == OpType::RECV) {
+    work->sourceRank_ = coll.root;
+  }
 
   RECORD_COMMS_TRACE(
       logger->trace_generator,
@@ -952,7 +789,9 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::collective_post(
         work->future_ = c10::make_intrusive<at::ivalue::Future>(
             c10::ListType::create(c10::TensorType::get()));
       }
+      preproc();
       comm->enqueue_collective(std::move(data), work, coll, team);
+      postproc();
       return work;
     }
 #ifdef USE_CUDA
@@ -992,7 +831,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::collective_post(
   }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::allgather(
+c10::intrusive_ptr<Work> ProcessGroupUCC::allgather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const AllgatherOptions& /* unused */) {
@@ -1033,7 +872,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::allgather(
         tensor.device(),
         inputTensors,
         outputTensors[0],
-        "ucc:allgatherv");
+        "ucc:all_gather");
   } else {
     WorkData* data = new WorkData();
     std::vector<at::Tensor> flat_output(outputTensors.size());
@@ -1090,11 +929,11 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::allgather(
         tensor.device(),
         inputTensors,
         outputTensors[0],
-        "ucc:allgather");
+        "ucc:all_gather");
   }
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::_allgather_base(
+c10::intrusive_ptr<Work> ProcessGroupUCC::_allgather_base(
     at::Tensor& outputTensor,
     at::Tensor& inputTensor,
     const AllgatherOptions& opts) {
@@ -1134,7 +973,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::_allgather_base(
       "ucc:allgather_base");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::allreduce(
+c10::intrusive_ptr<Work> ProcessGroupUCC::allreduce(
     std::vector<at::Tensor>& tensors,
     const AllreduceOptions& opts) {
   check_tensor(tensors);
@@ -1165,17 +1004,17 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::allreduce(
       tensor.device(),
       tensors,
       tensors,
-      "ucc:allreduce");
+      "ucc:all_reduce");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::allreduce_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupUCC::allreduce_coalesced(
     std::vector<at::Tensor>& /* unused */,
     const AllreduceCoalescedOptions& /* unused */) {
   throw std::runtime_error(
       "ProcessGroupUCC does not support allreduce_coalesced");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::alltoall(
+c10::intrusive_ptr<Work> ProcessGroupUCC::alltoall(
     std::vector<at::Tensor>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const AllToAllOptions& /* unused */) {
@@ -1239,7 +1078,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::alltoall(
       "ucc:alltoall");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::alltoall_base(
+c10::intrusive_ptr<Work> ProcessGroupUCC::alltoall_base(
     at::Tensor& outputTensor,
     at::Tensor& inputTensor,
     std::vector<int64_t>& outputSplitSizes,
@@ -1317,8 +1156,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::alltoall_base(
       "ucc:alltoall");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::barrier(
-    const BarrierOptions& opts) {
+c10::intrusive_ptr<Work> ProcessGroupUCC::barrier(const BarrierOptions& opts) {
   c10::Device device = c10::Device(c10::DeviceType::CPU);
 #ifdef USE_CUDA
   auto numGPUs = c10::cuda::device_count();
@@ -1365,7 +1203,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::barrier(
       "ucc:barrier");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::broadcast(
+c10::intrusive_ptr<Work> ProcessGroupUCC::broadcast(
     std::vector<at::Tensor>& tensors,
     const BroadcastOptions& opts) {
   check_tensor(tensors);
@@ -1400,7 +1238,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::broadcast(
       "ucc:broadcast");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::gather(
+c10::intrusive_ptr<Work> ProcessGroupUCC::gather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const GatherOptions& opts) {
@@ -1481,7 +1319,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::gather(
       "ucc:gather");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::reduce(
+c10::intrusive_ptr<Work> ProcessGroupUCC::reduce(
     std::vector<at::Tensor>& tensors,
     const ReduceOptions& opts) {
   check_tensor(tensors);
@@ -1516,7 +1354,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::reduce(
       "ucc:reduce");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::reduce_scatter(
+c10::intrusive_ptr<Work> ProcessGroupUCC::reduce_scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ReduceScatterOptions& opts) {
@@ -1590,7 +1428,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::reduce_scatter(
       "ucc:reduce_scatter");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::scatter(
+c10::intrusive_ptr<Work> ProcessGroupUCC::scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ScatterOptions& opts) {
@@ -1665,7 +1503,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::scatter(
       "ucc:scatter");
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::send(
+c10::intrusive_ptr<Work> ProcessGroupUCC::send(
     std::vector<at::Tensor>& tensors,
     int dstRank,
     int tag) {
@@ -1673,7 +1511,6 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::send(
   auto& tensor = tensors[0];
   initComm(tensor.device());
 
-#ifdef USE_ACTIVE_SETS
   WorkData* data = new WorkData();
   ucc_coll_args_t coll;
   coll.tag = tag;
@@ -1701,31 +1538,9 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::send(
       tensors,
       tensors,
       "ucc:send");
-#else
-  ucp_tag_t ucp_tag;
-  TORCH_UCX_MAKE_SEND_TAG(ucp_tag, tag, rank_, comm_id);
-  ucc_coll_req_h request = comm->send_nb(
-      eps[dstRank],
-      tensor.data_ptr(),
-      to_ucs_memType(tensor.device().type()),
-      tensor.numel() * tensor.element_size(),
-      ucp_tag);
-
-  auto work = comm->enqueue_p2p(OpType::SEND, request, "ucc:send");
-  // TODO: record src, dst ranks and tag
-  RECORD_COMMS_TRACE(
-      logger->trace_generator,
-      work,
-      OpType::SEND,
-      this->getRank(),
-      this->getSize(),
-      tensors,
-      tensors);
-  return work;
-#endif
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::recv(
+c10::intrusive_ptr<Work> ProcessGroupUCC::recv(
     std::vector<at::Tensor>& tensors,
     int srcRank,
     int tag) {
@@ -1733,7 +1548,6 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::recv(
   auto& tensor = tensors[0];
   initComm(tensor.device());
 
-#ifdef USE_ACTIVE_SETS
   WorkData* data = new WorkData();
   ucc_coll_args_t coll;
   coll.tag = tag;
@@ -1761,58 +1575,12 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::recv(
       tensors,
       tensors,
       "ucc:recv");
-#else
-  ucp_tag_t ucp_tag, ucp_tag_mask;
-  TORCH_UCX_MAKE_RECV_TAG(ucp_tag, ucp_tag_mask, tag, srcRank, comm_id);
-  ucc_coll_req_h request = comm->recv_nb(
-      tensor.data_ptr(),
-      to_ucs_memType(tensor.device().type()),
-      tensor.numel() * tensor.element_size(),
-      ucp_tag,
-      ucp_tag_mask);
-
-  auto work = comm->enqueue_p2p(OpType::RECV, request, "ucc:recv");
-  // TODO: record src, dst ranks and tag
-  RECORD_COMMS_TRACE(
-      logger->trace_generator,
-      work,
-      OpType::RECV,
-      this->getRank(),
-      this->getSize(),
-      tensors,
-      tensors);
-  return work;
-#endif
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupUCC::recvAnysource(
-    std::vector<at::Tensor>& tensors,
-    int tag) {
-  check_tensor(tensors);
-  auto& tensor = tensors[0];
-  initComm(tensor.device());
+void ProcessGroupUCC::setSequenceNumberForGroup() {}
 
-  ucp_tag_t ucp_tag, ucp_tag_mask;
-  TORCH_UCX_MAKE_RECV_TAG(
-      ucp_tag, ucp_tag_mask, tag, TORCH_UCX_ANY_SOURCE, comm_id);
-  ucc_coll_req_h request = comm->recv_nb(
-      tensor.data_ptr(),
-      to_ucs_memType(tensor.device().type()),
-      tensor.numel() * tensor.element_size(),
-      ucp_tag,
-      ucp_tag_mask);
-
-  auto work = comm->enqueue_p2p(OpType::RECVANYSOURCE, request, "ucc:recv");
-  // TODO: record dst rank and tag
-  RECORD_COMMS_TRACE(
-      logger->trace_generator,
-      work,
-      OpType::RECVANYSOURCE,
-      this->getRank(),
-      this->getSize(),
-      tensors,
-      tensors);
-  return work;
+uint64_t ProcessGroupUCC::getSequenceNumberForGroup() {
+  return seq_;
 }
 
 c10::intrusive_ptr<ProcessGroup> ProcessGroupUCC::createProcessGroupUCC(
@@ -1831,7 +1599,6 @@ void ProcessGroupUCC::initComm(c10::Device dev) {
     }
 #endif
     comm = Comm::get_comm(comm_id, dev, oob, logger);
-    comm->ucx_connect_eps(eps, oob);
     TORCH_UCC_LOG_INFO(TORCH_UCC_INIT, "Successfully initialized UCX library");
     comm->ucc_create_team(team, oob);
     TORCH_UCC_LOG_INFO(TORCH_UCC_INIT, "Successfully initialized UCC library");
diff --git a/torch/csrc/distributed/c10d/ProcessGroupUCC.hpp b/torch/csrc/distributed/c10d/ProcessGroupUCC.hpp
index 1209ea2324c9..03d5d234873d 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupUCC.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupUCC.hpp
@@ -2,7 +2,7 @@
 
 #ifdef USE_C10D_UCC
 
-#include <c10d/UCCUtils.hpp>
+#include <torch/csrc/distributed/c10d/UCCUtils.hpp>
 
 #include <exception>
 #include <memory>
@@ -11,10 +11,10 @@
 #include <thread>
 #include <vector>
 
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/Store.hpp>
-#include <c10d/Types.hpp>
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 #ifdef USE_CUDA
 #include <ATen/cuda/CUDAEvent.h>
 #include <c10/cuda/CUDAStream.h>
@@ -24,47 +24,6 @@ namespace c10d {
 
 #define TORCH_UCC_DEVICE_NOT_SET -2
 
-#define TORCH_UCX_MAKE_P2P_TAG(_tag, _rank, _comm)       \
-  ((((uint64_t)(_tag)) << TORCH_UCX_TAG_BITS_OFFSET) |   \
-   (((uint64_t)(_rank)) << TORCH_UCX_RANK_BITS_OFFSET) | \
-   (((uint64_t)(_comm)) << TORCH_UCX_COMM_BITS_OFFSET))
-
-#define TORCH_UCX_MAKE_OOB_TAG(_tag, _rank, _comm)       \
-  ((((uint64_t)(_tag)) << TORCH_UCX_OOB_BITS_OFFSET) |   \
-   (((uint64_t)(_rank)) << TORCH_UCX_RANK_BITS_OFFSET) | \
-   (((uint64_t)(_rank)) << TORCH_UCX_COMM_BITS_OFFSET))
-
-#define TORCH_UCX_MAKE_SEND_TAG(_ucp_tag, _tag, _rank, _comm)      \
-  do {                                                             \
-    (_ucp_tag) = TORCH_UCX_MAKE_P2P_TAG((_tag), (_rank), (_comm)); \
-  } while (0)
-
-#define TORCH_UCX_ANY_SOURCE (TORCH_UCX_MAX_RANK - 1)
-#define TORCH_UCX_ANY_SOURCE_MASK (~TORCH_UCX_RANK_MASK)
-#define TORCH_UCX_SPECIFIC_SOURCE_MASK ((uint64_t)-1)
-
-#define TORCH_UCX_MAKE_RECV_TAG(_ucp_tag, _ucp_tag_mask, _tag, _rank, _comm) \
-  do {                                                                       \
-    (_ucp_tag) = TORCH_UCX_MAKE_P2P_TAG((_tag), (_rank), (_comm));           \
-    if ((_rank) == TORCH_UCX_ANY_SOURCE) {                                   \
-      (_ucp_tag_mask) = TORCH_UCX_ANY_SOURCE_MASK;                           \
-    } else {                                                                 \
-      (_ucp_tag_mask) = TORCH_UCX_SPECIFIC_SOURCE_MASK;                      \
-    }                                                                        \
-  } while (0)
-
-#define TORCH_UCX_MAKE_OOB_SEND_TAG(_ucp_tag, _tag, _rank, _comm)  \
-  do {                                                             \
-    (_ucp_tag) = TORCH_UCX_MAKE_OOB_TAG((_tag), (_rank), (_comm)); \
-  } while (0)
-
-#define TORCH_UCX_MAKE_OOB_RECV_TAG(                               \
-    _ucp_tag, _ucp_tag_mask, _tag, _rank, _comm)                   \
-  do {                                                             \
-    (_ucp_tag) = TORCH_UCX_MAKE_OOB_TAG((_tag), (_rank), (_comm)); \
-    (_ucp_tag_mask) = (uint64_t)-1;                                \
-  } while (0)
-
 #ifdef USE_CUDA
 #define SAVE_TENSORS(_TENSORS, _DATA)                       \
   do {                                                      \
@@ -84,8 +43,6 @@ namespace c10d {
 
 constexpr const char* UCC_BACKEND_NAME = "ucc";
 
-enum torch_ucx_tag_type_t { TORCH_UCX_P2P_TAG, TORCH_UCX_OOB_TAG };
-
 struct event_pool_t {
 #ifdef USE_CUDA
   std::queue<std::unique_ptr<at::cuda::CUDAEvent>> event_pool;
@@ -153,18 +110,18 @@ class TORCH_API ProcessGroupUCC : public ProcessGroup {
     std::exception_ptr eptr_;
   };
 
-  class WorkUCC : public ProcessGroup::Work {
+  class WorkUCC : public Work {
     friend class ProcessGroupUCC;
     friend class Comm;
 
    public:
-    WorkUCC(OpType opType, const char* prof_title)
-        : ProcessGroup::Work(-1, opType, prof_title) {}
     WorkUCC(
         OpType opType,
+        uint64_t seq,
         const char* prof_title,
+        const c10::optional<std::vector<at::Tensor>>& inputs,
         const c10::intrusive_ptr<ProcessGroupUCCLogger>& logger)
-        : ProcessGroup::Work(-1, opType, prof_title), logger_(logger) {}
+        : Work(-1, opType, prof_title, inputs), logger_(logger), seq_(seq) {}
     ~WorkUCC();
     void setException();
     void setAndThrowException();
@@ -173,13 +130,17 @@ class TORCH_API ProcessGroupUCC : public ProcessGroup {
     bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override;
     c10::intrusive_ptr<c10::ivalue::Future> getFuture() override;
     std::vector<at::Tensor> result() override;
+    int sourceRank() const override;
 #ifdef USE_CUDA
     std::unique_ptr<at::cuda::CUDAEvent> fence = nullptr;
     event_pool_t* ep = nullptr;
 #endif
+    int sourceRank_;
+
    protected:
     std::shared_ptr<ProgressEntry> entry_;
     c10::intrusive_ptr<ProcessGroupUCCLogger> logger_;
+    uint64_t seq_;
 
    private:
     // The future returned by getFuture.
@@ -215,7 +176,7 @@ class TORCH_API ProcessGroupUCC : public ProcessGroup {
   void runHealthCheck();
 
   template <typename PreProcess, typename PostProcess>
-  c10::intrusive_ptr<ProcessGroup::Work> collective_post(
+  c10::intrusive_ptr<Work> collective_post(
       OpType opType,
       PreProcess preproc,
       PostProcess postproc,
@@ -226,76 +187,84 @@ class TORCH_API ProcessGroupUCC : public ProcessGroup {
       std::vector<at::Tensor>& outputTensors,
       const char* prof_title);
 
-  c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& data,
       const BroadcastOptions& opts = BroadcastOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce_coalesced(
+  c10::intrusive_ptr<Work> allreduce_coalesced(
       std::vector<at::Tensor>& tensors,
       const AllreduceCoalescedOptions& opts =
           AllreduceCoalescedOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce(
+  c10::intrusive_ptr<Work> reduce(
       std::vector<at::Tensor>& tensors,
       const ReduceOptions& opts = ReduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> _allgather_base(
+  c10::intrusive_ptr<Work> _allgather_base(
       at::Tensor& outputBuffer,
       at::Tensor& inputBuffer,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& opts = BarrierOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> gather(
+  c10::intrusive_ptr<Work> gather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const GatherOptions& opts = GatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> scatter(
+  c10::intrusive_ptr<Work> scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ScatterOptions& opts = ScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall_base(
+  c10::intrusive_ptr<Work> alltoall_base(
       at::Tensor& outputTensor,
       at::Tensor& inputTensor,
       std::vector<int64_t>& outputSplitSizes,
       std::vector<int64_t>& inputSplitSizes,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall(
+  c10::intrusive_ptr<Work> alltoall(
       std::vector<at::Tensor>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> send(
+  c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recv(
+  c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& tensors,
       int srcRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recvAnysource(
-      std::vector<at::Tensor>& tensors,
-      int tag) override;
+  // Counting for the sequential number of UCC collective_post call.
+  uint64_t seq_{0};
+
+  // Agrees on an initial sequence number for the whole group by having rank 0
+  // create it and broadcast it to other ranks using the store.
+  void setSequenceNumberForGroup() override;
+
+  // Retrieves the current sequence number for the whole group, which should be
+  // in sync. If the returned number is not consistent across the group, it
+  // may indicate that there is some sort of collective desynchronization.
+  uint64_t getSequenceNumberForGroup() override;
 
   static c10::intrusive_ptr<ProcessGroup> createProcessGroupUCC(
       const c10::intrusive_ptr<::c10d::Store>& store,
@@ -308,9 +277,9 @@ class TORCH_API ProcessGroupUCC : public ProcessGroup {
   std::shared_ptr<torch_ucc_oob_coll_info_t> oob;
   std::shared_ptr<Comm> comm = {nullptr};
   uint32_t comm_id;
-  std::vector<ucp_ep_h> eps;
   ucc_team_h team{nullptr};
   ucc_ee_h cuda_ee{nullptr};
+
 #ifdef USE_CUDA
   std::unique_ptr<at::cuda::CUDAStream> stream = nullptr;
   event_pool_t ep;
@@ -321,7 +290,6 @@ class TORCH_API ProcessGroupUCC : public ProcessGroup {
 class Comm {
   c10::intrusive_ptr<ProcessGroupUCCLogger> logger;
   std::shared_ptr<torch_ucc_oob_coll_info_t> oob;
-  CommUCX ucx_comm;
   CommUCC ucc_comm;
   std::mutex mutex;
   std::thread progress_thread;
@@ -342,23 +310,13 @@ class Comm {
 
   ~Comm();
 
-  // Connects UCX end points.
-  void ucx_connect_eps(
-      std::vector<ucp_ep_h>& eps,
-      std::shared_ptr<torch_ucc_oob_coll_info_t> oob);
-
-  // Disconnects UCX end points.
-  void ucx_disconnect_eps(
-      std::vector<ucp_ep_h>& eps,
-      std::shared_ptr<torch_ucc_oob_coll_info_t> oob);
-
   void ucc_create_team(
       ucc_team_h& team,
       std::shared_ptr<torch_ucc_oob_coll_info_t> oob);
 
   void ucc_destroy_team(ucc_team_h& team);
 
-  c10::intrusive_ptr<ProcessGroup::Work> enqueue_p2p(
+  c10::intrusive_ptr<Work> enqueue_p2p(
       OpType opType,
       ucc_coll_req_h request,
       const char* prof_title);
@@ -386,20 +344,6 @@ class Comm {
       bool is_health_check = false);
 
   void progress_loop();
-
-  ucc_coll_req_h send_nb(
-      ucp_ep_h ep,
-      void* data,
-      ucs_memory_type_t mtype,
-      size_t size,
-      ucp_tag_t ucp_tag);
-
-  ucc_coll_req_h recv_nb(
-      void* data,
-      ucs_memory_type_t mtype,
-      size_t size,
-      ucp_tag_t ucp_tag,
-      ucp_tag_t ucp_tag_mask);
 };
 
 } // namespace c10d
diff --git a/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp b/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp
index e7463304974f..646ca4d67206 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp
@@ -1,4 +1,4 @@
-#include <c10d/ProcessGroupWrapper.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupWrapper.hpp>
 
 #ifdef USE_C10D_GLOO
 
@@ -10,8 +10,8 @@
 #include <c10/util/Optional.h>
 #include <c10/util/intrusive_ptr.h>
 #include <c10/util/irange.h>
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/ProcessGroupGloo.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
 #include <stdexcept>
 
 namespace c10d {
@@ -271,21 +271,21 @@ const std::string ProcessGroupWrapper::getBackendName() const {
   return pg_->getBackendName();
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::broadcast(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::broadcast(
     std::vector<at::Tensor>& data,
     const BroadcastOptions& opts) {
   runCollectiveChecks(OpType::BROADCAST, data);
   return pg_->broadcast(data, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::allreduce(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::allreduce(
     std::vector<at::Tensor>& data,
     const AllreduceOptions& opts) {
   runCollectiveChecks(OpType::ALLREDUCE, data);
   return pg_->allreduce(data, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::allreduce_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::allreduce_coalesced(
     std::vector<at::Tensor>& tensors,
     const AllreduceCoalescedOptions& opts) {
   // NOTE: We don't enforce shape checking for allreduce_coalesced because
@@ -296,14 +296,14 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::allreduce_coalesced(
   return pg_->allreduce_coalesced(tensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::reduce(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::reduce(
     std::vector<at::Tensor>& tensors,
     const ReduceOptions& opts) {
   runCollectiveChecks(OpType::REDUCE, tensors);
   return pg_->reduce(tensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::allgather(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::allgather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const AllgatherOptions& opts) {
@@ -311,7 +311,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::allgather(
   return pg_->allgather(outputTensors, inputTensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::_allgather_base(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::_allgather_base(
     at::Tensor& outputBuffer,
     at::Tensor& inputBuffer,
     const AllgatherOptions& opts) {
@@ -320,7 +320,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::_allgather_base(
   return pg_->_allgather_base(outputBuffer, inputBuffer, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::allgather_coalesced(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::allgather_coalesced(
     std::vector<std::vector<at::Tensor>>& outputTensorLists,
     std::vector<at::Tensor>& inputTensors,
     const AllgatherOptions& opts) {
@@ -332,7 +332,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::allgather_coalesced(
   return pg_->allgather_coalesced(outputTensorLists, inputTensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::gather(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::gather(
     std::vector<std::vector<at::Tensor>>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const GatherOptions& opts) {
@@ -340,7 +340,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::gather(
   return pg_->gather(outputTensors, inputTensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::scatter(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ScatterOptions& opts) {
@@ -348,7 +348,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::scatter(
   return pg_->scatter(outputTensors, inputTensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::reduce_scatter(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::reduce_scatter(
     std::vector<at::Tensor>& outputTensors,
     std::vector<std::vector<at::Tensor>>& inputTensors,
     const ReduceScatterOptions& opts) {
@@ -356,7 +356,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::reduce_scatter(
   return pg_->reduce_scatter(outputTensors, inputTensors, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::alltoall_base(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::alltoall_base(
     at::Tensor& outputTensor,
     at::Tensor& inputTensor,
     std::vector<int64_t>& outputSplitSizes,
@@ -368,7 +368,7 @@ c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::alltoall_base(
       outputTensor, inputTensor, outputSplitSizes, inputSplitSizes, opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::alltoall(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::alltoall(
     std::vector<at::Tensor>& outputTensors,
     std::vector<at::Tensor>& inputTensors,
     const AllToAllOptions& opts) {
@@ -395,37 +395,36 @@ uint64_t ProcessGroupWrapper::getSequenceNumberForGroup() {
   return pg_->getSequenceNumberForGroup();
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::send(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::send(
     std::vector<at::Tensor>& tensors,
     int dstRank,
     int tag) {
   return pg_->send(tensors, dstRank, tag);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::recv(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::recv(
     std::vector<at::Tensor>& tensors,
     int srcRank,
     int tag) {
   return pg_->recv(tensors, srcRank, tag);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::recvAnysource(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::recvAnysource(
     std::vector<at::Tensor>& tensors,
     int tag) {
   return pg_->recvAnysource(tensors, tag);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::barrier(
+c10::intrusive_ptr<Work> ProcessGroupWrapper::barrier(
     const BarrierOptions& opts) {
   runCollectiveChecks(OpType::BARRIER, {});
   return pg_->barrier(opts);
 }
 
-c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupWrapper::
-    _reduce_scatter_base(
-        at::Tensor& outputBuffer,
-        at::Tensor& inputBuffer,
-        const ReduceScatterOptions& opts) {
+c10::intrusive_ptr<Work> ProcessGroupWrapper::_reduce_scatter_base(
+    at::Tensor& outputBuffer,
+    at::Tensor& inputBuffer,
+    const ReduceScatterOptions& opts) {
   runCollectiveChecks(
       OpType::_REDUCE_SCATTER_BASE, {inputBuffer, outputBuffer});
   return pg_->_reduce_scatter_base(outputBuffer, inputBuffer, opts);
diff --git a/torch/csrc/distributed/c10d/ProcessGroupWrapper.hpp b/torch/csrc/distributed/c10d/ProcessGroupWrapper.hpp
index 62ec553ff3f4..736e02dc452a 100644
--- a/torch/csrc/distributed/c10d/ProcessGroupWrapper.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupWrapper.hpp
@@ -2,10 +2,10 @@
 
 #ifdef USE_C10D_GLOO
 
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/ProcessGroupGloo.hpp>
-#include <c10d/Types.hpp>
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 
 namespace c10d {
 
@@ -17,29 +17,29 @@ class TORCH_API ProcessGroupWrapper : public ProcessGroup {
 
   const std::string getBackendName() const override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& data,
       const BroadcastOptions& opts = BroadcastOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& data,
       const AllreduceOptions& opts = AllreduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce_coalesced(
+  c10::intrusive_ptr<Work> allreduce_coalesced(
       std::vector<at::Tensor>& tensors,
       const AllreduceCoalescedOptions& opts =
           AllreduceCoalescedOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce(
+  c10::intrusive_ptr<Work> reduce(
       std::vector<at::Tensor>& tensors,
       const ReduceOptions& opts = ReduceOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> _allgather_base(
+  c10::intrusive_ptr<Work> _allgather_base(
       at::Tensor& outputBuffer,
       at::Tensor& inputBuffer,
       const AllgatherOptions& opts = AllgatherOptions()) override;
@@ -48,34 +48,34 @@ class TORCH_API ProcessGroupWrapper : public ProcessGroup {
   // * do not add dependencies on this function,
   // * do not implement it in your ProcessGroup, implement _allgather_base
   //   instead.
-  c10::intrusive_ptr<ProcessGroup::Work> allgather_coalesced(
+  c10::intrusive_ptr<Work> allgather_coalesced(
       std::vector<std::vector<at::Tensor>>& outputTensorLists,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> gather(
+  c10::intrusive_ptr<Work> gather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const GatherOptions& opts = GatherOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> scatter(
+  c10::intrusive_ptr<Work> scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ScatterOptions& opts = ScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall_base(
+  c10::intrusive_ptr<Work> alltoall_base(
       at::Tensor& outputTensor,
       at::Tensor& inputTensor,
       std::vector<int64_t>& outputSplitSizes,
       std::vector<int64_t>& inputSplitSizes,
       const AllToAllOptions& opts = AllToAllOptions()) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> alltoall(
+  c10::intrusive_ptr<Work> alltoall(
       std::vector<at::Tensor>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllToAllOptions& opts = AllToAllOptions()) override;
@@ -94,24 +94,24 @@ class TORCH_API ProcessGroupWrapper : public ProcessGroup {
   // may indicate that there is some sort of collective desynchronization.
   uint64_t getSequenceNumberForGroup() override; // just call underlying
 
-  c10::intrusive_ptr<ProcessGroup::Work> send(
+  c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recv(
+  c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& tensors,
       int srcRank,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> recvAnysource(
+  c10::intrusive_ptr<Work> recvAnysource(
       std::vector<at::Tensor>& tensors,
       int tag) override;
 
-  c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& opts = BarrierOptions()) override;
 
-    c10::intrusive_ptr<ProcessGroup::Work> _reduce_scatter_base(
+    c10::intrusive_ptr<Work> _reduce_scatter_base(
       at::Tensor& outputBuffer,
       at::Tensor& inputBuffer,
       const ReduceScatterOptions& opts) override;
diff --git a/torch/csrc/distributed/c10d/PyProcessGroup.hpp b/torch/csrc/distributed/c10d/PyProcessGroup.hpp
index 22612aee12dc..83cf5bc54d74 100644
--- a/torch/csrc/distributed/c10d/PyProcessGroup.hpp
+++ b/torch/csrc/distributed/c10d/PyProcessGroup.hpp
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
 #include <torch/csrc/utils/pybind.h>
 #include <torch/csrc/jit/python/pybind_utils.h>
 
@@ -12,14 +12,14 @@ class PyProcessGroup : public ProcessGroup {
  public:
   // PyWork is a pybind11 trampoline class to allow a Python
   // class to inherit from torch.distributed.Work
-  class PyWork : public ProcessGroup::Work {
+  class PyWork : public Work {
    public:
     PyWork() = default;
 
     bool wait(std::chrono::milliseconds timeout = kNoTimeout) override {
       PYBIND11_OVERRIDE(
           bool, /* Return type */
-          ProcessGroup::Work, /* Parent class */
+          Work, /* Parent class */
           wait, /* Name of function in C++ */
           timeout);
     }
@@ -29,7 +29,7 @@ class PyProcessGroup : public ProcessGroup {
         // 1. We have to >MANUALLY< unwrap the PyFutureWrapper and
         // 2. The python name is get_future
         pybind11::gil_scoped_acquire gil;
-        auto override = pybind11::get_override(static_cast<const ProcessGroup::Work *>(this), "get_future");
+        auto override = pybind11::get_override(static_cast<const Work *>(this), "get_future");
 
         if (override) {
             py::object o = override();
@@ -51,12 +51,12 @@ class PyProcessGroup : public ProcessGroup {
     );
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> allgather(
+  c10::intrusive_ptr<Work> allgather(
       std::vector<std::vector<at::Tensor>>& outputTensors,
       std::vector<at::Tensor>& inputTensors,
       const AllgatherOptions& opts = AllgatherOptions()) override {
     PYBIND11_OVERRIDE(
-        c10::intrusive_ptr<ProcessGroup::Work>, /* Return type */
+        c10::intrusive_ptr<Work>, /* Return type */
         ProcessGroup, /* Parent class */
         allgather, /* Name of function in C++ */
         outputTensors,
@@ -64,43 +64,43 @@ class PyProcessGroup : public ProcessGroup {
         opts);
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> allreduce(
+  c10::intrusive_ptr<Work> allreduce(
       std::vector<at::Tensor>& tensors,
       const AllreduceOptions& opts = AllreduceOptions()) override {
     PYBIND11_OVERRIDE(
-        c10::intrusive_ptr<ProcessGroup::Work>, /* Return type */
+        c10::intrusive_ptr<Work>, /* Return type */
         ProcessGroup, /* Parent class */
         allreduce, /* Name of function in C++ */
         tensors,
         opts);
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> barrier(
+  c10::intrusive_ptr<Work> barrier(
       const BarrierOptions& opts = BarrierOptions()) {
     PYBIND11_OVERRIDE(
-        c10::intrusive_ptr<ProcessGroup::Work>, /* Return type */
+        c10::intrusive_ptr<Work>, /* Return type */
         ProcessGroup, /* Parent class */
         barrier, /* Name of function in C++ */
         opts);
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> broadcast(
+  c10::intrusive_ptr<Work> broadcast(
       std::vector<at::Tensor>& tensors,
       const BroadcastOptions& opts = BroadcastOptions()) override {
     PYBIND11_OVERRIDE(
-        c10::intrusive_ptr<ProcessGroup::Work>, /* Return type */
+        c10::intrusive_ptr<Work>, /* Return type */
         ProcessGroup, /* Parent class */
         broadcast, /* Name of function in C++ */
         tensors,
         opts);
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> reduce_scatter(
+  c10::intrusive_ptr<Work> reduce_scatter(
       std::vector<at::Tensor>& outputTensors,
       std::vector<std::vector<at::Tensor>>& inputTensors,
       const ReduceScatterOptions& opts = ReduceScatterOptions()) override {
     PYBIND11_OVERRIDE(
-        c10::intrusive_ptr<ProcessGroup::Work>, /* Return type */
+        c10::intrusive_ptr<Work>, /* Return type */
         ProcessGroup, /* Parent class */
         reduce_scatter, /* Name of function in C++ */
         outputTensors,
@@ -108,12 +108,12 @@ class PyProcessGroup : public ProcessGroup {
         opts);
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> send(
+  c10::intrusive_ptr<Work> send(
       std::vector<at::Tensor>& tensors,
       int dstRank,
       int tag) override {
     PYBIND11_OVERRIDE(
-        c10::intrusive_ptr<ProcessGroup::Work>, /* Return type */
+        c10::intrusive_ptr<Work>, /* Return type */
         ProcessGroup, /* Parent class */
         send, /* Name of function in C++ */
         tensors,
@@ -121,12 +121,12 @@ class PyProcessGroup : public ProcessGroup {
         tag);
   }
 
-  c10::intrusive_ptr<ProcessGroup::Work> recv(
+  c10::intrusive_ptr<Work> recv(
       std::vector<at::Tensor>& tensors,
       int srcRank,
       int tag) override {
     PYBIND11_OVERRIDE(
-        c10::intrusive_ptr<ProcessGroup::Work>, /* Return type */
+        c10::intrusive_ptr<Work>, /* Return type */
         ProcessGroup, /* Parent class */
         recv, /* Name of function in C++ */
         tensors,
diff --git a/torch/csrc/distributed/c10d/Store.cpp b/torch/csrc/distributed/c10d/Store.cpp
index ae8f1358b869..9632176adda5 100644
--- a/torch/csrc/distributed/c10d/Store.cpp
+++ b/torch/csrc/distributed/c10d/Store.cpp
@@ -1,4 +1,4 @@
-#include <c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
 
 namespace c10d {
 
@@ -17,4 +17,24 @@ void Store::setTimeout(const std::chrono::milliseconds& timeout) {
   timeout_ = timeout;
 }
 
+void Store::set(const std::string& key, const std::string& value) {
+  set(key, std::vector<uint8_t>(value.begin(), value.end()));
+}
+
+std::string Store::compareSet(
+    const std::string& key,
+    const std::string& currentValue,
+    const std::string& newValue) {
+  auto value = compareSet(
+      key,
+      std::vector<uint8_t>(currentValue.begin(), currentValue.end()),
+      std::vector<uint8_t>(newValue.begin(), newValue.end()));
+  return std::string(value.begin(), value.end());
+}
+
+std::string Store::get_to_str(const std::string& key) {
+  auto value = get(key);
+  return std::string(value.begin(), value.end());
+}
+
 } // namespace c10d
diff --git a/torch/csrc/distributed/c10d/Store.hpp b/torch/csrc/distributed/c10d/Store.hpp
index ee15f767dee4..1914fae8a0fb 100644
--- a/torch/csrc/distributed/c10d/Store.hpp
+++ b/torch/csrc/distributed/c10d/Store.hpp
@@ -30,10 +30,17 @@ class TORCH_API Store : public torch::CustomClassHolder {
 
   virtual ~Store();
 
+  void set(const std::string& key, const std::string& value);
+
   virtual void set(
       const std::string& key,
       const std::vector<uint8_t>& value) = 0;
 
+  std::string compareSet(
+      const std::string& key,
+      const std::string& currentValue,
+      const std::string& newValue);
+
   virtual std::vector<uint8_t> compareSet(
       const std::string& key,
       const std::vector<uint8_t>& currentValue,
@@ -41,6 +48,8 @@ class TORCH_API Store : public torch::CustomClassHolder {
     TORCH_INTERNAL_ASSERT(false, "Not implemented.");
   }
 
+  std::string get_to_str(const std::string& key);
+
   virtual std::vector<uint8_t> get(const std::string& key) = 0;
 
   virtual int64_t add(const std::string& key, int64_t value) = 0;
diff --git a/torch/csrc/distributed/c10d/TCPStore.cpp b/torch/csrc/distributed/c10d/TCPStore.cpp
index 0e5628a21e66..371f509e5c26 100644
--- a/torch/csrc/distributed/c10d/TCPStore.cpp
+++ b/torch/csrc/distributed/c10d/TCPStore.cpp
@@ -1,5 +1,5 @@
 #include <c10/util/irange.h>
-#include <c10d/TCPStore.hpp>
+#include <torch/csrc/distributed/c10d/TCPStore.hpp>
 
 #include <fcntl.h>
 #include <algorithm>
@@ -18,12 +18,12 @@
 #endif
 
 #ifdef _WIN32
-#include <c10d/WinSockUtils.hpp>
+#include <torch/csrc/distributed/c10d/WinSockUtils.hpp>
 #else
-#include <c10d/UnixSockUtils.hpp>
+#include <torch/csrc/distributed/c10d/UnixSockUtils.hpp>
 #endif
 
-#include <c10d/socket.h>
+#include <torch/csrc/distributed/c10d/socket.h>
 
 namespace c10d {
 namespace detail {
diff --git a/torch/csrc/distributed/c10d/TCPStore.hpp b/torch/csrc/distributed/c10d/TCPStore.hpp
index 6f1eb25581b5..425b7b7c4139 100644
--- a/torch/csrc/distributed/c10d/TCPStore.hpp
+++ b/torch/csrc/distributed/c10d/TCPStore.hpp
@@ -4,7 +4,7 @@
 #include <cstdint>
 #include <memory>
 
-#include <c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
 
 namespace c10d {
 namespace detail {
diff --git a/torch/csrc/distributed/c10d/TraceUtils.h b/torch/csrc/distributed/c10d/TraceUtils.h
index 4709c9447723..2b3358d24c78 100644
--- a/torch/csrc/distributed/c10d/TraceUtils.h
+++ b/torch/csrc/distributed/c10d/TraceUtils.h
@@ -1,8 +1,8 @@
 #pragma once
 
 #include <c10/util/irange.h>
-#include <c10d/Store.hpp>
-#include <c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
 
 #include <sys/types.h>
 
diff --git a/torch/csrc/distributed/c10d/Types.hpp b/torch/csrc/distributed/c10d/Types.hpp
index 02759541f55e..be20fcadba64 100644
--- a/torch/csrc/distributed/c10d/Types.hpp
+++ b/torch/csrc/distributed/c10d/Types.hpp
@@ -1,22 +1,112 @@
 #pragma once
 
+#include <torch/csrc/distributed/c10d/Store.hpp>
+
 #include <chrono>
 #include <cstdint>
 
+#include <ATen/core/ivalue.h>
+#include <ATen/core/Tensor.h>
+
+#include <c10/macros/Macros.h>
+#include <c10/util/intrusive_ptr.h>
+
 namespace c10d {
 
-enum class ReduceOp : std::uint8_t {
-  SUM = 0,
-  AVG,
-  PRODUCT,
-  MIN,
-  MAX,
-  BAND, // Bitwise AND
-  BOR, // Bitwise OR
-  BXOR, // Bitwise XOR
-  UNUSED,
+// Base class for supplementary data potentially needed by ReduceOps
+struct TORCH_API _SupplementBase : torch::CustomClassHolder {
+  virtual ~_SupplementBase() {}
+};
+
+// Supplementary data specific to NCCL PREMUL_SUM
+// The point of use in ProcessGroupNCCL knows how to unpack it.
+struct NCCLPreMulSumSupplement : _SupplementBase {
+  double double_factor{0.0};
+  at::Tensor tensor_factor;
+  NCCLPreMulSumSupplement(double f) : double_factor{f} {}
+  NCCLPreMulSumSupplement(at::Tensor t) : tensor_factor{std::move(t)} {
+    TORCH_CHECK_EQ(tensor_factor.numel(), 1);
+  }
+};
+
+// Other ReduceOps that need different supplementary data can also
+// derive from _SupplementBase.
+struct TORCH_API ReduceOp : torch::CustomClassHolder {
+  // note(crcrpar): RedOpType could be defined outside of `ReduceOp`
+  enum RedOpType : uint8_t {
+    SUM = 0,
+    AVG = 1,
+    PRODUCT = 2,
+    MIN = 3,
+    MAX = 4,
+    BAND = 5, // Bitwise AND
+    BOR = 6, // Bitwise OR
+    BXOR = 7, // Bitwise XOR
+    PREMUL_SUM = 8, // Multiply by a user-supplied constant before summing.
+    UNUSED = 9
+  };
+
+  ReduceOp() {}
+
+  ReduceOp(RedOpType op) : op_(op) {
+    TORCH_INTERNAL_ASSERT(
+      op_ != PREMUL_SUM,
+      "Use `torch.distributed._make_nccl_premul_sum` to create an instance of ReduceOp with PREMUL_SUM"
+    );
+  }
+
+  ReduceOp(RedOpType op, c10::intrusive_ptr<_SupplementBase> optional_supplement) {
+    if (optional_supplement.get()) {
+      op_ = op;
+    } else {
+      supplement_ = optional_supplement;
+    }
+  }
+
+  // The heap resource supplement_, if it exists, is managed by a c10::intrusive_ptr,
+  // so constructors and operator= can be simple
+  ReduceOp(const ReduceOp& other) :
+    op_(other.op_), supplement_(other.supplement_) {}
+
+  const ReduceOp& operator=(const ReduceOp& other) {
+    op_ = other.op_;
+    supplement_ = other.supplement_;
+    return *this;
+  }
+
+  operator RedOpType() const { return op_; }
+
+  bool operator==(const std::uint8_t other) {
+    TORCH_INTERNAL_ASSERT(other < 9, "Invalid other op value");
+    return other == op_;
+  }
+
+  bool operator==(const ReduceOp::RedOpType other) {
+    return *this == static_cast<std::uint8_t>(other);
+  }
+
+  bool operator==(const ReduceOp& other) {
+    return *this == other.op_;
+  }
+
+  RedOpType op_ = SUM;
+  // supplement_ is "type-erased" storage for optional supplementary
+  // data the op might need.
+  // The point of use will know the derived type supplement_ really is,
+  // and downcast its pointer to extract the data as the needed type(s).
+  // Right now, only PREMUL_SUM needs supplementary data, but the same
+  // mechanism could extend to support other nontrivial reduce ops with
+  // different supplementary payloads.
+  c10::intrusive_ptr<_SupplementBase> supplement_;
 };
 
+template<typename T> ReduceOp makeNCCLPreMulSum(const T& factor) {
+  ReduceOp rop;
+  rop.op_ = ReduceOp::PREMUL_SUM;
+  rop.supplement_ = c10::make_intrusive<NCCLPreMulSumSupplement>(factor);
+  return rop;
+}
+
 constexpr auto kUnsetTimeout = std::chrono::milliseconds(-1);
 
 struct BroadcastOptions {
@@ -67,4 +157,13 @@ struct BarrierOptions {
   std::chrono::milliseconds timeout = kUnsetTimeout;
 };
 
+struct DistributedBackendOptions {
+  c10::intrusive_ptr<::c10d::Store> store;
+  int group_rank;
+  int group_size;
+  std::chrono::duration<float> timeout;
+  std::string group_id;
+  std::vector<int64_t> global_ranks_in_group;
+};
+
 } // namespace c10d
diff --git a/torch/csrc/distributed/c10d/UCCTracing.cpp b/torch/csrc/distributed/c10d/UCCTracing.cpp
index d23d2b68b318..20805837f143 100644
--- a/torch/csrc/distributed/c10d/UCCTracing.cpp
+++ b/torch/csrc/distributed/c10d/UCCTracing.cpp
@@ -1,19 +1,15 @@
 #ifdef USE_C10D_UCC
 
-#include <c10d/UCCTracing.hpp>
-#include <c10d/UCCUtils.hpp>
+#include <torch/csrc/distributed/c10d/UCCTracing.hpp>
+#include <torch/csrc/distributed/c10d/UCCUtils.hpp>
 
-#include <c10d/ParamCommsUtils.hpp>
+#include <torch/csrc/distributed/c10d/ParamCommsUtils.hpp>
 
 #include <sys/stat.h>
 #include <cstdlib>
 #include <ctime>
 #include <fstream>
 
-#ifdef FBCODE_CAFFE2
-#include <c10d/UCCInternalUtils.hpp>
-#endif
-
 namespace c10d {
 
 void ProcessGroupUCCLogger::initCommsTracer() {
@@ -57,10 +53,6 @@ void ProcessGroupUCCLogger::flushComms(int rank, int world_size) {
     _outfile.flush();
     _outfile.close();
   }
-#ifdef FBCODE_CAFFE2
-  uploadTrace_internal(
-      trace_filename, dirname, c10::str("rank", rank, ".json"));
-#endif
 }
 
 /* unused */
@@ -120,7 +112,7 @@ void CommTraceLogger::recordComms(
       ",\n\t\t\"req\": ",
       workReq,
       ",\n\t\t\"seqnum\": ",
-      seqnum++,
+      seqnum,
       ",\n\t\t\"world_size\": ",
       world_size);
 
@@ -157,6 +149,8 @@ void CommTraceLogger::recordComms(
 
   // record the trace to kineto trace if applicable
   RECORD_PARAM_COMMS(
+      static_cast<int64_t>(seqnum), // seq
+      0, // process group ptr
       rank,
       commName.c_str(),
       inSize,
@@ -165,6 +159,8 @@ void CommTraceLogger::recordComms(
       curInSplitSizes_,
       curOutSplitSizes_);
 
+  ++seqnum;
+
   // reset optional field
   curRoot_ = -1;
   curInSplitSizes_ = {};
diff --git a/torch/csrc/distributed/c10d/UCCTracing.hpp b/torch/csrc/distributed/c10d/UCCTracing.hpp
index 60d6be877512..953cec8a1bc3 100644
--- a/torch/csrc/distributed/c10d/UCCTracing.hpp
+++ b/torch/csrc/distributed/c10d/UCCTracing.hpp
@@ -2,7 +2,7 @@
 
 #ifdef USE_C10D_UCC
 
-#include <c10d/UCCUtils.hpp>
+#include <torch/csrc/distributed/c10d/UCCUtils.hpp>
 
 namespace c10d {
 
diff --git a/torch/csrc/distributed/c10d/UCCUtils.cpp b/torch/csrc/distributed/c10d/UCCUtils.cpp
index 37cd829122f9..590a931f2f11 100644
--- a/torch/csrc/distributed/c10d/UCCUtils.cpp
+++ b/torch/csrc/distributed/c10d/UCCUtils.cpp
@@ -1,7 +1,12 @@
 #ifdef USE_C10D_UCC
 
-#include <c10d/UCCTracing.hpp>
-#include <c10d/UCCUtils.hpp>
+#include <torch/csrc/distributed/c10d/UCCTracing.hpp>
+#include <torch/csrc/distributed/c10d/UCCUtils.hpp>
+
+#include <cctype>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
 
 namespace c10d {
 
@@ -12,75 +17,6 @@ constexpr char kAllGatherDone[] = "ag_done";
 constexpr char kAllGatherFree[] = "ag_free";
 } // namespace
 
-CommUCX::CommUCX(
-    int comm_size,
-    const c10::intrusive_ptr<ProcessGroupUCCLogger>& logger)
-    : CommBase(logger) {
-  ucp_params_t params;
-  ucp_config_t* config;
-  ucs_status_t st;
-  ucp_worker_params_t worker_params;
-  ucp_lib_attr_t ucp_attr;
-
-  ucp_attr.field_mask = UCP_LIB_ATTR_FIELD_MAX_THREAD_LEVEL;
-  TORCH_UCX_CHECK(
-      ucp_lib_query(&ucp_attr), "failed to query UCP lib attributes");
-  TORCH_CHECK(
-      ucp_attr.max_thread_level == UCS_THREAD_MODE_MULTI,
-      "ucx library wasn't initialized with multithreading support, "
-      "please check ucx build options");
-  TORCH_UCX_CHECK(
-      ucp_config_read("TORCH", nullptr, &config), "failed to read UCP config");
-
-  memset(&params, 0, sizeof(ucp_params_t));
-  params.field_mask = UCP_PARAM_FIELD_FEATURES | UCP_PARAM_FIELD_REQUEST_SIZE |
-      UCP_PARAM_FIELD_ESTIMATED_NUM_EPS | UCP_PARAM_FIELD_TAG_SENDER_MASK |
-      UCP_PARAM_FIELD_REQUEST_INIT | UCP_PARAM_FIELD_REQUEST_CLEANUP;
-  params.request_size = sizeof(ucc_coll_req_t);
-  params.features = UCP_FEATURE_TAG;
-  params.estimated_num_eps = comm_size;
-  params.tag_sender_mask = TORCH_UCX_RANK_MASK;
-  params.request_init = [](void* request) {
-    static_cast<ucc_coll_req_h>(request)->status = UCC_INPROGRESS;
-  };
-  params.request_cleanup = [](void*) {};
-  TORCH_UCX_CHECK(
-      ucp_init(&params, config, &context), "failed to init UCP context");
-  ucp_config_release(config);
-
-  memset(&worker_params, 0, sizeof(ucp_worker_params_t));
-  worker_params.field_mask = UCP_WORKER_PARAM_FIELD_THREAD_MODE;
-  worker_params.thread_mode = UCS_THREAD_MODE_MULTI;
-  st = ucp_worker_create(context, &worker_params, &worker);
-  if (st != UCS_OK) {
-    TORCH_UCC_LOG_ERROR(
-        TORCH_UCC_INIT,
-        c10::str("UCX failed to create UCP worker:", ucs_status_string(st)));
-    ucp_cleanup(context);
-    throw std::runtime_error(ucs_status_string(st));
-  }
-}
-
-void CommUCX::progress() {
-  ucp_worker_progress(worker);
-}
-
-void CommUCX::free_request(ucc_coll_req_h request) {
-  request->status = UCC_INPROGRESS;
-  ucp_request_free(request);
-}
-
-CommUCX::~CommUCX() {
-  if (worker != nullptr) {
-    ucp_worker_destroy(worker);
-  }
-  if (context != nullptr) {
-    ucp_cleanup(context);
-  }
-  worker = nullptr;
-  context = nullptr;
-}
-
 ucc_status_t oob_allgather(
     void* sbuf,
     void* rbuf,
@@ -250,7 +186,7 @@ void CommUCC::free_request(ucc_coll_req_h request) {
 CommUCC::~CommUCC() {
   if (context != nullptr) {
     TORCH_UCC_CHECK(
-        ucc_context_destroy(context), "failed to destory UCC context");
+        ucc_context_destroy(context), "failed to destroy UCC context");
   }
   if (lib != nullptr) {
     TORCH_UCC_CHECK(ucc_finalize(lib), "failed to finalize UCC library");
diff --git a/torch/csrc/distributed/c10d/UCCUtils.hpp b/torch/csrc/distributed/c10d/UCCUtils.hpp
index 70cef1d7f99a..50510a6ea9a0 100644
--- a/torch/csrc/distributed/c10d/UCCUtils.hpp
+++ b/torch/csrc/distributed/c10d/UCCUtils.hpp
@@ -2,31 +2,9 @@
 
 #ifdef USE_C10D_UCC
 
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
 #include <ucc/api/ucc.h>
-#include <ucp/api/ucp.h>
-
-#define TORCH_UCX_COMM_BITS 15
-#define TORCH_UCX_RANK_BITS 16
-#define TORCH_UCX_TAG_BITS 32
-#define TORCH_UCX_OOB_BITS 1
-
-#define TORCH_UCX_COMM_BITS_OFFSET 0
-#define TORCH_UCX_RANK_BITS_OFFSET TORCH_UCX_COMM_BITS
-#define TORCH_UCX_TAG_BITS_OFFSET (TORCH_UCX_COMM_BITS + TORCH_UCX_RANK_BITS)
-#define TORCH_UCX_OOB_BITS_OFFSET \
-  (TORCH_UCX_COMM_BITS + TORCH_UCX_RANK_BITS + TORCH_UCX_TAG_BITS)
-
-#define TORCH_UCX_MAX_COMM ((((uint64_t)1) << TORCH_UCX_COMM_BITS) - 1)
-#define TORCH_UCX_MAX_RANK ((((uint64_t)1) << TORCH_UCX_RANK_BITS) - 1)
-#define TORCH_UCX_MAX_TAG ((((uint64_t)1) << TORCH_UCX_TAG_BITS) - 1)
-#define TORCH_UCX_MAX_OOB ((((uint64_t)1) << TORCH_UCX_OOB_BITS) - 1)
-
-#define TORCH_UCX_COMM_MASK (TORCH_UCX_MAX_COMM << TORCH_UCX_COMM_BITS_OFFSET)
-#define TORCH_UCX_RANK_MASK (TORCH_UCX_MAX_RANK << TORCH_UCX_RANK_BITS_OFFSET)
-#define TORCH_UCX_TAG_MASK (TORCH_UCX_MAX_TAG << TORCH_UCX_TAG_BITS_OFFSET)
-#define TORCH_UCX_OOB_MASK (TORCH_UCX_MAX_OOB << TORCH_UCX_OOB_BITS_OFFSET)
 
 namespace c10d {
 
@@ -53,29 +31,6 @@ namespace c10d {
     }                                     \
   } while (0)
 
-// Macro to throw on a non-successful UCX return value.
-#define TORCH_UCX_CHECK(_cmd, _error_msg) \
-  do {                                    \
-    ucs_status_t result = _cmd;           \
-    if (result != UCS_OK) {               \
-      std::string err = c10::str(         \
-          "[",                            \
-          std::string(__FILE__),          \
-          ":",                            \
-          std::to_string(__LINE__),       \
-          "] ",                           \
-          logger->getLogPrefix(),         \
-          _error_msg,                     \
-          ", error code ",                \
-          result,                         \
-          ": ",                           \
-          ucs_status_string(result),      \
-          ", system error code ",         \
-          errno);                         \
-      TORCH_CHECK(false, err);            \
-    }                                     \
-  } while (0)
-
 // Macros to print logs with unified format
 #define TORCH_UCC_LOG_ERROR(_phase, _msg) \
   LOG(ERROR) << logger->getLogPrefix(_phase) << "[ERROR] " << _msg;
@@ -148,21 +103,6 @@ class CommBase {
   virtual ~CommBase() {}
   c10::intrusive_ptr<ProcessGroupUCCLogger> logger;
 };
-
-class CommUCX : public CommBase {
- public:
-  ucp_context_h context{nullptr};
-  ucp_worker_h worker{nullptr};
-
- public:
-  void progress() override;
-  void free_request(ucc_coll_req_h request) override;
-  CommUCX(
-      int comm_size,
-      const c10::intrusive_ptr<ProcessGroupUCCLogger>& logger);
-  ~CommUCX();
-};
-
 class CommUCC : public CommBase {
  public:
   ucc_lib_h lib{nullptr};
@@ -188,6 +128,39 @@ ucc_status_t oob_allgather_test(void* req);
 
 ucc_status_t oob_allgather_free(void* req);
 
+// trim: remove spaces before and after the string view
+// implementation borrowed from https://stackoverflow.com/a/17976541
+inline c10::string_view trim(c10::string_view s) {
+  auto wsfront = std::find_if_not(
+      s.begin(), s.end(), [](int c) { return std::isspace(c); });
+  auto wsback = std::find_if_not(s.rbegin(), s.rend(), [](int c) {
+                  return std::isspace(c);
+                }).base();
+  return (
+      wsback <= wsfront ? "" : s.substr(wsfront - s.begin(), wsback - wsfront));
+}
+
+inline std::string tolower(c10::string_view s) {
+  std::string result;
+  result.reserve(s.size());
+  for (auto c : s) {
+    result.push_back(std::tolower(c));
+  }
+  return result;
+}
+
+inline std::vector<std::string> parse_list(std::string list) {
+  std::vector<std::string> result;
+  list = tolower(trim(list));
+  while (!list.empty()) {
+    const auto end_pos = list.find_first_of(',');
+    const auto token = trim(list.substr(0, end_pos));
+    result.push_back(std::string(token));
+    list = (end_pos != c10::string_view::npos) ? list.substr(end_pos + 1) : "";
+  }
+  return result;
+}
+
 } // namespace c10d
 
 #endif // USE_C10D_UCC
diff --git a/torch/csrc/distributed/c10d/UnixSockUtils.hpp b/torch/csrc/distributed/c10d/UnixSockUtils.hpp
index 1f16c3f0b041..aa79523a59be 100644
--- a/torch/csrc/distributed/c10d/UnixSockUtils.hpp
+++ b/torch/csrc/distributed/c10d/UnixSockUtils.hpp
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 
 namespace c10d {
 namespace tcputil {
diff --git a/torch/csrc/distributed/c10d/Utils.cpp b/torch/csrc/distributed/c10d/Utils.cpp
index 924d0a233682..962410cdd599 100644
--- a/torch/csrc/distributed/c10d/Utils.cpp
+++ b/torch/csrc/distributed/c10d/Utils.cpp
@@ -1,4 +1,4 @@
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 
 #include <algorithm>
 #include <cstring>
diff --git a/torch/csrc/distributed/c10d/Utils.hpp b/torch/csrc/distributed/c10d/Utils.hpp
index 501993a728b7..bec82cbf4254 100644
--- a/torch/csrc/distributed/c10d/Utils.hpp
+++ b/torch/csrc/distributed/c10d/Utils.hpp
@@ -3,7 +3,7 @@
 #include <ATen/ATen.h>
 #include <c10/util/accumulate.h>
 #include <c10/util/irange.h>
-#include <c10d/Types.hpp>
+#include <torch/csrc/distributed/c10d/Types.hpp>
 
 #ifdef _WIN32
 #include <winsock2.h>
@@ -37,6 +37,9 @@ TORCH_API std::string parse_env(const char* env_var_name);
 // Retrieve tensor shapes from a given tensor.
 TORCH_API std::vector<at::Tensor> getTensorShapes(const std::vector<at::Tensor>& tensors);
 
+// Use -2 to represent unset state of env vars
+#define C10D_ENV_NOT_SET -2
+
 // Turns at::IntArrayRef into "(1, 2, 3, 4)".
 inline std::string toString(at::IntArrayRef l) {
   std::stringstream ss;
@@ -70,7 +73,7 @@ inline void assertSameType(
   }
 }
 
-inline bool parseEnvVarFlag(const char* envVarName) {
+inline int parseEnvVarInt(const char* envVarName) {
   char* stringValue = std::getenv(envVarName);
   if (stringValue != nullptr) {
     int val;
@@ -80,16 +83,28 @@ inline bool parseEnvVarFlag(const char* envVarName) {
       TORCH_CHECK(false,
           "Invalid value for environment variable: " + std::string(envVarName));
     }
+    return val;
+  }
+  return C10D_ENV_NOT_SET;
+}
+
+inline int parseEnvVarIntDefault(const char* envVarName, int defaultVal) {
+    int val = parseEnvVarInt(envVarName);
+    if (val == C10D_ENV_NOT_SET)
+      return defaultVal;
+    return val;
+}
+
+inline bool parseEnvVarFlag(const char* envVarName) {
+    int val = parseEnvVarInt(envVarName);
     if (val == 1) {
       return true;
-    } else if (val == 0) {
+    } else if (val == 0 || val == C10D_ENV_NOT_SET) {
       return false;
-    } else {
-      TORCH_CHECK(false,
-          "Invalid value for environment variable: " + std::string(envVarName));
     }
-  }
-  return false;
+    TORCH_CHECK(false,
+        "Invalid value for environment variable: " + std::string(envVarName));
+    return false;
 }
 
 inline void assertSameSizes(
diff --git a/torch/csrc/distributed/c10d/WinSockUtils.hpp b/torch/csrc/distributed/c10d/WinSockUtils.hpp
index 1d6e9babca78..b29ef2174fd9 100644
--- a/torch/csrc/distributed/c10d/WinSockUtils.hpp
+++ b/torch/csrc/distributed/c10d/WinSockUtils.hpp
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 
 namespace c10d {
 namespace tcputil {
diff --git a/torch/csrc/distributed/c10d/Work.cpp b/torch/csrc/distributed/c10d/Work.cpp
new file mode 100644
index 000000000000..7ad90b742557
--- /dev/null
+++ b/torch/csrc/distributed/c10d/Work.cpp
@@ -0,0 +1,182 @@
+#include <ATen/ThreadLocalState.h>
+
+#include <torch/csrc/distributed/c10d/Work.hpp>
+
+namespace c10d {
+
+Work::Work(
+    int rank,
+    OpType opType,
+    const char* profilingTitle,
+    const c10::optional<std::vector<at::Tensor>>& inputTensors)
+    : rank_(rank), opType_(opType) {
+  if (profilingTitle != nullptr) {
+    auto recordingFunction =
+        std::make_shared<at::RecordFunction>(at::RecordScope::USER_SCOPE);
+    if (recordingFunction->isActive()) {
+      // Work events follow a future like pattern and can potentially be marked
+      // as complete by different threads, so explicitly set as async event.
+      recordingFunction->_setAsync();
+      // Passing input tensor to recordFunction allows for shape information in
+      // profiling output.
+      std::vector<c10::IValue> inputs;
+      if (inputTensors) {
+        inputs.reserve(inputTensors->size());
+        for (const auto& tensor : *inputTensors) {
+          inputs.emplace_back(tensor);
+        }
+      }
+      recordingFunction->before(
+          profilingTitle,
+          c10::ArrayRef<const c10::IValue>(inputs.data(), inputs.size()));
+      std::function<void()> end_handler = [recordingFunction]() {
+        recordingFunction->end();
+      };
+      recordFunctionEndCallback_ = at::wrapPropagateTLSState(end_handler);
+    }
+  }
+}
+
+OpType Work::retrieveOpType() {
+  return opType_;
+}
+
+Work::~Work() = default;
+
+bool Work::isCompleted() {
+  std::lock_guard<std::mutex> lock(mutex_);
+  return completed_;
+}
+
+bool Work::isSuccess() const {
+  std::lock_guard<std::mutex> lock(mutex_);
+  return !exception_;
+}
+
+std::exception_ptr Work::exception() const {
+  std::lock_guard<std::mutex> lock(mutex_);
+  return exception_;
+}
+
+int Work::sourceRank() const {
+  TORCH_CHECK(
+      false,
+      "sourceRank() may only be called on work objects "
+      "that correspond to a recv or recv-from-any call.");
+}
+
+std::vector<at::Tensor> Work::result() {
+  TORCH_CHECK(false, "result() not implemented.");
+}
+
+void Work::synchronize() {}
+
+bool Work::wait(std::chrono::milliseconds timeout) {
+  std::unique_lock<std::mutex> lock(mutex_);
+  if (timeout == kNoTimeout) {
+    // This waits without a timeout.
+    cv_.wait(lock, [&] { return completed_; });
+  } else {
+    // Waits for the user-provided timeout.
+    cv_.wait_for(lock, timeout, [&] { return completed_; });
+    if (!completed_) {
+      // Throw exception if the wait operation timed out and the work was not
+      // completed.
+      TORCH_CHECK(false, "Operation timed out!");
+    }
+  }
+  if (exception_) {
+    std::rethrow_exception(exception_);
+  }
+  synchronize();
+  // Always return true, because abort API is not implemented.
+  return true;
+}
+
+void Work::abort() {
+  TORCH_CHECK(false, "Work::abort not implemented.");
+}
+
+c10::intrusive_ptr<c10::ivalue::Future> Work::getFuture() {
+  TORCH_CHECK(false, "Work::getFuture not implemented.")
+}
+
+void Work::finish(std::exception_ptr exception) {
+  std::unique_lock<std::mutex> lock(mutex_);
+  completed_ = true;
+  exception_ = exception;
+  if (recordFunctionEndCallback_) {
+    recordFunctionEndCallback_();
+    recordFunctionEndCallback_ = nullptr;
+  }
+  lock.unlock();
+  cv_.notify_all();
+}
+
+void Work::finishAndThrow(std::exception_ptr exception) {
+  std::unique_lock<std::mutex> lock(mutex_);
+  completed_ = true;
+  exception_ = exception;
+  if (recordFunctionEndCallback_) {
+    recordFunctionEndCallback_();
+    recordFunctionEndCallback_ = nullptr;
+  }
+  if (exception_) {
+    std::rethrow_exception(exception_);
+  }
+}
+
+class FutureWrappingWork : public Work {
+ public:
+  FutureWrappingWork(c10::intrusive_ptr<c10::ivalue::Future> fut)
+      : Work(), _fut(fut) {}
+
+  ~FutureWrappingWork() {}
+
+  bool isCompleted() override {
+    return _fut->completed();
+  }
+
+  bool isSuccess() const override {
+    return _fut->hasValue();
+  }
+
+  std::exception_ptr exception() const override {
+    return _fut->exception_ptr();
+  }
+
+  int sourceRank() const override {
+    TORCH_CHECK(false, "FutureWrappingWork::sourceRank() not implemented");
+  }
+
+  std::vector<at::Tensor> result() override {
+    return _fut->value().toPyObjectHolder()->extractTensors();
+  }
+
+  bool wait(std::chrono::milliseconds timeout) override {
+    // FIXME
+    TORCH_CHECK(
+        timeout == kNoTimeout,
+        "FutureWrappingWork::wait() with finite timeout not implemented");
+    _fut->wait();
+    return true;
+  }
+
+  void abort() override {
+    TORCH_CHECK(false, "FutureWrappingWork::abort() not implemented");
+  }
+
+  c10::intrusive_ptr<c10::ivalue::Future> getFuture() override {
+    return _fut;
+  }
+
+ private:
+  c10::intrusive_ptr<c10::ivalue::Future> _fut;
+};
+
+c10::intrusive_ptr<Work> Work::create_from_future(
+    c10::intrusive_ptr<c10::ivalue::Future> future) {
+  return c10::make_intrusive<FutureWrappingWork>(future);
+}
+
+} // namespace c10d
diff --git a/torch/csrc/distributed/c10d/Work.hpp b/torch/csrc/distributed/c10d/Work.hpp
new file mode 100644
index 000000000000..252fc4205a02
--- /dev/null
+++ b/torch/csrc/distributed/c10d/Work.hpp
@@ -0,0 +1,138 @@
+#pragma once
+
+#include <ATen/ATen.h>
+#include <stdexcept>
+#include <vector>
+
+constexpr auto kNoTimeout = std::chrono::milliseconds(0);
+
+namespace c10d {
+
+constexpr const char* const kSeqNumStoreKey = "SEQ_NUM_STORE_KEY";
+
+enum class OpType : std::uint8_t {
+  BROADCAST = 0,
+  ALLREDUCE = 1,
+  ALLREDUCE_COALESCED = 2,
+  REDUCE = 3,
+  ALLGATHER = 4,
+  _ALLGATHER_BASE = 5,
+  ALLGATHER_COALESCED = 6,
+  GATHER = 7,
+  SCATTER = 8,
+  REDUCE_SCATTER = 9,
+  ALLTOALL_BASE = 10,
+  ALLTOALL = 11,
+  SEND = 12,
+  RECV = 13,
+  RECVANYSOURCE = 14,
+  BARRIER = 15,
+  _REDUCE_SCATTER_BASE = 16,
+  UNKNOWN = 100,
+};
+
+// Converts OpType to human readable string.
+TORCH_API std::string opTypeToString(OpType opType);
+
+// Whether or not an OP is an p2p op (SEND, RECV, RECVANYSOURCE)
+TORCH_API bool isP2POp(OpType opType, bool batchP2P = false);
+
+// Please do not use Work API, it is going away, to be
+// replaced by ivalue::Future.
+// Python binding for this class might change, please do not assume
+// this will be bound using pybind.
+class TORCH_API Work : public torch::CustomClassHolder {
+ public:
+  Work(
+      int rank = -1,
+      OpType opType = OpType::UNKNOWN,
+      const char* profilingTitle = nullptr,
+      const c10::optional<std::vector<at::Tensor>>& inputTensors =
+          c10::nullopt);
+
+  virtual ~Work();
+
+  // Checks if request has completed. Non-blocking operation.
+  virtual bool isCompleted();
+
+  // Returns if the work completed successfully.
+  // If false, the exception function can be called to get details.
+  virtual bool isSuccess() const;
+
+  // Returns exception if isSuccess() returned false.
+  virtual std::exception_ptr exception() const;
+
+  // Returns source rank if this objects represents a recv-from-any.
+  virtual int sourceRank() const;
+
+  // Returns result tensors, if applicable.
+  // If work is not supposed to have result, we return empty list.
+  virtual std::vector<at::Tensor> result();
+
+  // Ensures that operations on the output tensors that are invoked
+  // after this function returns are correctly sequenced after the
+  // asynchronous completion of this work.
+  //
+  // For CUDA tensors, it inserts stream synchronization such that
+  // the streams of the caller wait for completion of the
+  // asynchronous operations on the destination tensors.
+  //
+  // For CPU tensors, it is currently a nop.
+  //
+  // This function should only be used if the caller polls for
+  // completion through the `isCompleted` function, it has returned
+  // true, and the `isSuccess` function also has returned true.
+  //
+  virtual void synchronize();
+
+  // Waits until request completes. Blocking operation.
+  // Throws if the work completed with an exception.
+  // Returns false if the work is aborted.
+  // Otherwise, it always returns true, indicating the work is completed.
+  //
+  // Functionally equivalent to:
+  //
+  //   while (!isCompleted()) { /* nop */ }
+  //   auto success = isSuccess();
+  //   if (!success) { std::rethrow_exception(exception()); }
+  //   return success;
+  //
+  virtual bool wait(std::chrono::milliseconds timeout = kNoTimeout);
+
+  virtual void abort();
+
+  // Returns a Future object that will be associated with the completion of
+  // work. Only NCCL backend is currently supported.
+  virtual c10::intrusive_ptr<c10::ivalue::Future> getFuture();
+
+  OpType retrieveOpType();
+
+  static c10::intrusive_ptr<Work> create_from_future(
+      c10::intrusive_ptr<c10::ivalue::Future>);
+
+ protected:
+  // Completes the work object and optionally sets the exception in a
+  // thread-safe manner. Notifies all waiting condition variables as well.
+  void finish(std::exception_ptr exception = nullptr);
+
+  // Similar to finish, but throws an exception if one is already set or
+  // provided by the user.
+  void finishAndThrow(std::exception_ptr exception);
+
+  mutable std::mutex mutex_;
+  std::condition_variable cv_;
+  bool completed_ = false;
+  std::exception_ptr exception_;
+
+  // Current rank of the node.
+  const int rank_;
+
+  // Operation type that this work object refers to.
+  OpType opType_;
+
+  // When profiling, the callback to record end of operation event. This
+  // callback needs to be called when collective operation is complete.
+  std::function<void()> recordFunctionEndCallback_;
+};
+
+} // namespace c10d
diff --git a/torch/csrc/distributed/c10d/comm.cpp b/torch/csrc/distributed/c10d/comm.cpp
index d4c26d99bb0c..d011e5543a5d 100644
--- a/torch/csrc/distributed/c10d/comm.cpp
+++ b/torch/csrc/distributed/c10d/comm.cpp
@@ -1,11 +1,11 @@
-#include <c10d/comm.hpp>
+#include <torch/csrc/distributed/c10d/comm.hpp>
 
 #include <deque>
 
 #include <ATen/core/functional.h>
 #include <c10/util/irange.h>
-#include <c10d/reducer.hpp>
 #include <torch/csrc/distributed/c10d/Ops.hpp>
+#include <torch/csrc/distributed/c10d/reducer.hpp>
 #include <torch/csrc/utils/tensor_flatten.h>
 
 namespace c10d {
@@ -32,7 +32,13 @@ class BroadcastWork {
         flat_tensor_.front(), bucket_tensors_);
     TORCH_INTERNAL_ASSERT(output_tensors.size() == bucket_tensors_.size());
     for (const auto i : c10::irange(output_tensors.size())) {
-      bucket_tensors_[i].copy_(output_tensors[i], /*non_blocking=*/true);
+      // if output_tensor is empty, no need to copy it back,
+      // this can avoid error when both bucket_tensor and output_tensor
+      // are empty, but they have different shapes, see
+      // https://github.com/pytorch/pytorch/issues/87280
+      if (output_tensors[i].numel() != 0) {
+        bucket_tensors_[i].copy_(output_tensors[i], /*non_blocking=*/true);
+      }
     }
   }
 
@@ -48,7 +54,7 @@ class BroadcastWork {
 
  private:
   // The broadcast work that is kicked off upon construction.
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> work_;
+  c10::intrusive_ptr<c10d::Work> work_;
 };
 
 } // namespace
diff --git a/torch/csrc/distributed/c10d/comm.hpp b/torch/csrc/distributed/c10d/comm.hpp
index 1f47826ab9cb..0826723d40cc 100644
--- a/torch/csrc/distributed/c10d/comm.hpp
+++ b/torch/csrc/distributed/c10d/comm.hpp
@@ -2,7 +2,7 @@
 
 #include <ATen/ATen.h>
 #include <ATen/core/ivalue.h>
-#include <c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
 #include <torch/csrc/Export.h>
 
 namespace c10d {
diff --git a/torch/csrc/distributed/c10d/debug.cpp b/torch/csrc/distributed/c10d/debug.cpp
index 15270719725a..d0064a8a621e 100644
--- a/torch/csrc/distributed/c10d/debug.cpp
+++ b/torch/csrc/distributed/c10d/debug.cpp
@@ -4,15 +4,15 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <c10d/debug.h>
+#include <torch/csrc/distributed/c10d/debug.h>
 
 #include <algorithm>
 #include <cctype>
 #include <cstdlib>
 #include <string>
 
-#include <c10d/exception.h>
-#include <c10d/logging.h>
+#include <torch/csrc/distributed/c10d/exception.h>
+#include <torch/csrc/distributed/c10d/logging.h>
 
 namespace c10d {
 namespace detail {
diff --git a/torch/csrc/distributed/c10d/debug.h b/torch/csrc/distributed/c10d/debug.h
index 97237811cc1b..852419151519 100644
--- a/torch/csrc/distributed/c10d/debug.h
+++ b/torch/csrc/distributed/c10d/debug.h
@@ -10,7 +10,7 @@
 
 namespace c10d {
 
-enum class DebugLevel { Off, Info, Detail };
+enum class DebugLevel { Off = 0, Info = 1, Detail = 2 };
 
 TORCH_API void setDebugLevel(DebugLevel level);
 
diff --git a/torch/csrc/distributed/c10d/default_comm_hooks.cpp b/torch/csrc/distributed/c10d/default_comm_hooks.cpp
index 6599c2c0d197..9a8b2a5d9532 100644
--- a/torch/csrc/distributed/c10d/default_comm_hooks.cpp
+++ b/torch/csrc/distributed/c10d/default_comm_hooks.cpp
@@ -1,10 +1,10 @@
 #include <c10/core/ScalarType.h>
 #include <c10/util/Exception.h>
-#include <c10d/default_comm_hooks.hpp>
+#include <torch/csrc/distributed/c10d/default_comm_hooks.hpp>
 
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/comm.hpp>
 #include <torch/csrc/distributed/c10d/Ops.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/comm.hpp>
 #include <torch/torch.h>
 
 namespace c10d {
diff --git a/torch/csrc/distributed/c10d/default_comm_hooks.hpp b/torch/csrc/distributed/c10d/default_comm_hooks.hpp
index df6c8db29054..f7ac26a01d69 100644
--- a/torch/csrc/distributed/c10d/default_comm_hooks.hpp
+++ b/torch/csrc/distributed/c10d/default_comm_hooks.hpp
@@ -1,7 +1,7 @@
 #pragma once
 
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/comm.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/comm.hpp>
 
 namespace c10d {
 
diff --git a/torch/csrc/distributed/c10d/exception.cpp b/torch/csrc/distributed/c10d/exception.cpp
index 83cb00516c08..56b6695ea058 100644
--- a/torch/csrc/distributed/c10d/exception.cpp
+++ b/torch/csrc/distributed/c10d/exception.cpp
@@ -1,4 +1,4 @@
-#include <c10d/exception.h>
+#include <torch/csrc/distributed/c10d/exception.h>
 
 namespace c10d {
 
diff --git a/torch/csrc/distributed/c10d/init.cpp b/torch/csrc/distributed/c10d/init.cpp
index fa37ae2adef9..9a9699c5e12f 100644
--- a/torch/csrc/distributed/c10d/init.cpp
+++ b/torch/csrc/distributed/c10d/init.cpp
@@ -1,41 +1,43 @@
 #include <torch/csrc/python_headers.h>
 
 #include <c10/util/intrusive_ptr.h>
-#include <c10d/FileStore.hpp>
-#include <c10d/TCPStore.hpp>
-#include <c10d/Utils.hpp>
+#include <c10/util/string_view.h>
+#include <torch/csrc/distributed/c10d/FileStore.hpp>
+#include <torch/csrc/distributed/c10d/TCPStore.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 #ifndef _WIN32
-#include <c10d/HashStore.hpp>
-#include <c10d/ProcessGroupRoundRobin.hpp>
+#include <torch/csrc/distributed/c10d/HashStore.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupRoundRobin.hpp>
 #endif
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/PyProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/PyProcessGroup.hpp>
 
 #ifdef USE_C10D_GLOO
-#include <c10d/ProcessGroupGloo.hpp>
-#include <c10d/ProcessGroupWrapper.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupWrapper.hpp>
 #endif
 
 #ifdef USE_C10D_NCCL
-#include <c10d/ProcessGroupNCCL.hpp>
+#include <torch/csrc/distributed/c10d/NCCLUtils.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp>
 #endif
 
 #ifdef USE_C10D_MPI
-#include <c10d/ProcessGroupMPI.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupMPI.hpp>
 #endif
 
 #ifdef USE_C10D_UCC
-#include <c10d/ProcessGroupUCC.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupUCC.hpp>
 #endif
 
-#include <c10d/PrefixStore.hpp>
 #include <fmt/format.h>
 #include <pybind11/chrono.h>
+#include <torch/csrc/distributed/c10d/PrefixStore.hpp>
 
-#include <c10d/comm.hpp>
-#include <c10d/debug.h>
-#include <c10d/logger.hpp>
-#include <c10d/reducer.hpp>
+#include <torch/csrc/distributed/c10d/comm.hpp>
+#include <torch/csrc/distributed/c10d/debug.h>
+#include <torch/csrc/distributed/c10d/logger.hpp>
+#include <torch/csrc/distributed/c10d/reducer.hpp>
 
 #include <torch/csrc/Exceptions.h>
 #include <torch/csrc/distributed/c10d/Ops.hpp>
@@ -234,6 +236,61 @@ void _register_builtin_comm_hook(
   reducer.register_builtin_comm_hook(comm_hook_type);
 }
 
+// Customize the metaclass of ::c10d::ReduceOp for the backward compatibility.
+// https://github.com/pytorch/pytorch/pull/84243 changed ::c10d::ReduceOp to
+// struct from enum, sacrificing some of the Python built-in function supports
+// such as `isinstance` (see https://github.com/pytorch/pytorch/issues/87191)
+// and `copy` (see
+// https://github.com/pytorch/pytorch/pull/87303#discussion_r1002879700). Below,
+// we define a custom `isinstance` in CPython/pybind11
+// (`reduceopmeta___instancecheck__`) and modify the default metaclass of
+// pybind11 (`GetReduceOpMetaclass`) so that
+// `isinstance(torch.distributed.ReduceOp.SUM, torch.distributed.ReduceOp)`
+// returns :obj:`True` as if `ReduceOp` is enum.
+// Ref:
+//   - https://docs.python.org/3/extending/newtypes_tutorial.html
+//   - https://docs.python.org/3/c-api/typeobj.html?highlight=tp_methods
+//   - https://github.com/pybind/pybind11/issues/2696
+static PyObject* reduceopmeta___instancecheck__(
+    PyObject* self,
+    PyObject* args) {
+  if (Py_TYPE(self) == Py_TYPE(args)) {
+    Py_RETURN_TRUE;
+  }
+  if (c10::string_view(args->ob_type->tp_name).find("RedOpType") !=
+      c10::string_view::npos) {
+    Py_RETURN_TRUE;
+  }
+  Py_RETURN_FALSE;
+}
+static PyMethodDef reduceopmeta_methods[] = {
+    {"__instancecheck__",
+     (PyCFunction)reduceopmeta___instancecheck__,
+     METH_O,
+     "Custom `__instancecheck__` for ReduceOp"},
+    {NULL, NULL}};
+PyTypeObject* GetReduceOpMetaclass() {
+  static auto* metaclass = [] {
+    PyTypeObject* base_metaclass =
+        pybind11::detail::get_internals().default_metaclass;
+    PyType_Slot slots[] = {
+        {Py_tp_base, base_metaclass},
+        {Py_tp_methods, reduceopmeta_methods},
+        {0},
+    };
+    PyType_Spec spec = {};
+    spec.name = "torch._C._distributed_c10d._ReduceOpMeta";
+    spec.basicsize = base_metaclass->tp_basicsize;
+    spec.flags = Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE;
+    spec.slots = slots;
+    PyTypeObject* metaclass = (PyTypeObject*)PyType_FromSpec(&spec);
+    if (!metaclass)
+      throw py::error_already_set();
+    return metaclass;
+  }();
+  return metaclass;
+}
+
 PyObject* c10d_init(PyObject* _unused, PyObject* noargs) {
   C10_LOG_API_USAGE_ONCE("c10d.python.import");
 
@@ -514,9 +571,15 @@ An enum-like class for built-in communication hooks: ``ALLREDUCE`` and ``FP16_CO
           R"(Sets the debug level of the torch.distributed package from the
           ``TORCH_DISTRIBUTED_DEBUG`` environment variable.)");
 
-  py::enum_<::c10d::ReduceOp>(module, "ReduceOp", R"(
-An enum-like class for available reduction operations: ``SUM``, ``AVG``,
-``PRODUCT``, ``MIN``, ``MAX``, ``BAND``, ``BOR``, and ``BXOR``.
+  // TODO(crcrpar): Hardening `ReduceOp`.
+  //    While keeping most op types as enum value,
+  //    making `PREMUL_SUM` callable, i.e., allowing for
+  //    `ReduceOp.PREMUL_SUM(scale)` might be better as per @wanchaol.
+  // https://pybind11.readthedocs.io/en/stable/classes.html#enumerations-and-internal-types
+  py::class_<::c10d::ReduceOp> reduce_op(
+      module, "ReduceOp", py::metaclass((PyObject*)GetReduceOpMetaclass()), R"(
+An enum-like class for available reduction operations: ``SUM``, ``PRODUCT``,
+``MIN``, ``MAX``, ``BAND``, ``BOR``, ``BXOR``, and ``PREMUL_SUM``.
 
 ``BAND``, ``BOR``, and ``BXOR`` reductions are not available when
 using the ``NCCL`` backend.
@@ -525,19 +588,114 @@ using the ``NCCL`` backend.
 ``AVG`` is only available with the ``NCCL`` backend,
 and only for NCCL versions 2.10 or later.
 
+``PREMUL_SUM`` multiplies inputs by a given scalar locally before reduction.
+``PREMUL_SUM`` is only available with the ``NCCL`` backend,
+and only available for NCCL versions 2.11 or later. Users are supposed to
+use ``torch.distributed._make_nccl_premul_sum``.
+
 Additionally, ``MAX``, ``MIN`` and ``PRODUCT`` are not supported for complex tensors.
 
 The values of this class can be accessed as attributes, e.g., ``ReduceOp.SUM``.
 They are used in specifying strategies for reduction collectives, e.g.,
-:func:`reduce`, :func:`all_reduce_multigpu`, etc.)")
-      .value("SUM", ::c10d::ReduceOp::SUM)
-      .value("AVG", ::c10d::ReduceOp::AVG)
-      .value("PRODUCT", ::c10d::ReduceOp::PRODUCT)
-      .value("MIN", ::c10d::ReduceOp::MIN)
-      .value("MAX", ::c10d::ReduceOp::MAX)
-      .value("BAND", ::c10d::ReduceOp::BAND)
-      .value("BOR", ::c10d::ReduceOp::BOR)
-      .value("BXOR", ::c10d::ReduceOp::BXOR);
+:func:`reduce`, :func:`all_reduce_multigpu`, etc.
+
+This class does not support ``__members__`` property.)");
+
+  reduce_op.def(py::init<::c10d::ReduceOp::RedOpType>())
+      .def_readwrite("op", &::c10d::ReduceOp::op_);
+  // The following are for some kind of backward compatibility.
+  // Since c10d::ReduceOp had been an `enum class`, users can do comparison and
+  // take hash of `::c10d::ReduceOp`. To avoid losing these functionality, here
+  // I define some member methods.
+  reduce_op
+      .def(
+          "__eq__",
+          [](const ::c10d::ReduceOp& self,
+             const ::c10d::ReduceOp::RedOpType& other) {
+            return self == other;
+          })
+      .def(
+          "__eq__",
+          [](const ::c10d::ReduceOp& self, const ::c10d::ReduceOp& other) {
+            return self == other.op_;
+          })
+      .def(
+          "__hash__",
+          [](const ::c10d::ReduceOp& self) {
+            return static_cast<uint8_t>(self.op_);
+          })
+      .def(
+          "__copy__",
+          [](const ::c10d::ReduceOp& self) { return ::c10d::ReduceOp(self); })
+      .def(
+          "__deepcopy__",
+          [](const ::c10d::ReduceOp& self, const py::dict& memo) {
+            return ::c10d::ReduceOp(self);
+          })
+      .def(py::pickle(
+          [](const ::c10d::ReduceOp& r) {
+            // __getstate__
+            if (r.op_ != ::c10d::ReduceOp::RedOpType::PREMUL_SUM) {
+              return py::make_tuple(r.op_, py::none());
+            }
+            TORCH_CHECK(r.supplement_.defined(), "Invalid PREMUL_SUM ReduceOp");
+            const auto* preMulSupplement =
+                reinterpret_cast<::c10d::NCCLPreMulSumSupplement*>(
+                    r.supplement_.get());
+            if (!preMulSupplement->tensor_factor.defined()) {
+              return py::make_tuple(r.op_, preMulSupplement->double_factor);
+            } else {
+              return py::make_tuple(r.op_, preMulSupplement->tensor_factor);
+            }
+          },
+          [](const py::tuple t) {
+            // __setstate__
+            TORCH_CHECK(t.size() == 2, "Invalid state");
+            const auto op =
+                static_cast<::c10d::ReduceOp::RedOpType>(t[0].cast<uint8_t>());
+            if (op != ::c10d::ReduceOp::RedOpType::PREMUL_SUM) {
+              return ::c10d::ReduceOp(op);
+            }
+            const auto preMulSupplement_factor = t[1];
+            if (py::isinstance<py::float_>(preMulSupplement_factor)) {
+              return ::c10d::makeNCCLPreMulSum(t[1].cast<double>());
+            } else {
+              return ::c10d::makeNCCLPreMulSum(t[1].cast<at::Tensor>());
+            }
+          }));
+
+  py::enum_<::c10d::ReduceOp::RedOpType>(reduce_op, "RedOpType")
+      .value("SUM", ::c10d::ReduceOp::RedOpType::SUM)
+      .value("AVG", ::c10d::ReduceOp::RedOpType::AVG)
+      .value("PRODUCT", ::c10d::ReduceOp::RedOpType::PRODUCT)
+      .value("MIN", ::c10d::ReduceOp::RedOpType::MIN)
+      .value("MAX", ::c10d::ReduceOp::RedOpType::MAX)
+      .value("BAND", ::c10d::ReduceOp::RedOpType::BAND)
+      .value("BOR", ::c10d::ReduceOp::RedOpType::BOR)
+      .value("BXOR", ::c10d::ReduceOp::RedOpType::BXOR)
+      .value("PREMUL_SUM", ::c10d::ReduceOp::RedOpType::PREMUL_SUM)
+      .export_values();
+
+  // note(crcrpar): This could be removed because users will not pass
+  // `RedOpType` to reduce collective ops Ref: [Implicit
+  // conversions](https://pybind11.readthedocs.io/en/stable/advanced/classes.html#implicit-conversions)
+  // Let us skip the explicit construction of `c10d::ReduceOp` from
+  // `c10d::ReduceOp::RedOpType` in Python.
+  py::implicitly_convertible<::c10d::ReduceOp::RedOpType, ::c10d::ReduceOp>();
+
+  module
+      .def(
+          "_make_nccl_premul_sum",
+          &::c10d::makeNCCLPreMulSum<double>,
+          py::arg("factor").noconvert(),
+          py::return_value_policy::copy, // seems safest
+          py::call_guard<py::gil_scoped_release>())
+      .def(
+          "_make_nccl_premul_sum",
+          &::c10d::makeNCCLPreMulSum<at::Tensor>,
+          py::arg("factor").noconvert(),
+          py::return_value_policy::copy, // seems safest
+          py::call_guard<py::gil_scoped_release>());
 
   py::class_<::c10d::BroadcastOptions>(module, "BroadcastOptions")
       .def(py::init<>())
@@ -591,6 +749,20 @@ They are used in specifying strategies for reduction collectives, e.g.,
       .def(py::init<>())
       .def_readwrite("timeout", &::c10d::AllToAllOptions::timeout);
 
+  py::class_<::c10d::DistributedBackendOptions>(
+      module, "_DistributedBackendOptions")
+      .def(py::init<>())
+      .def_readwrite("store", &::c10d::DistributedBackendOptions::store)
+      .def_readwrite(
+          "group_rank", &::c10d::DistributedBackendOptions::group_rank)
+      .def_readwrite(
+          "group_size", &::c10d::DistributedBackendOptions::group_size)
+      .def_readwrite("timeout", &::c10d::DistributedBackendOptions::timeout)
+      .def_readwrite("group_id", &::c10d::DistributedBackendOptions::group_id)
+      .def_readwrite(
+          "global_ranks_in_group",
+          &::c10d::DistributedBackendOptions::global_ranks_in_group);
+
   auto store =
       py::class_<::c10d::Store, c10::intrusive_ptr<::c10d::Store>, PythonStore>(
           module,
@@ -977,7 +1149,11 @@ that adds a prefix to each key inserted to the store.
     prefix (str): The prefix string that is prepended to each key before being inserted into the store.
     store (torch.distributed.store): A store object that forms the underlying key-value store.
       )")
-      .def(py::init<const std::string&, c10::intrusive_ptr<::c10d::Store>>());
+      .def(py::init<const std::string&, c10::intrusive_ptr<::c10d::Store>>())
+      .def_property_readonly(
+          "underlying_store",
+          &::c10d::PrefixStore::getUnderlyingStore,
+          R"(Gets the underlying store object that PrefixStore wraps around.)");
 
   auto processGroup =
       py::class_<
@@ -1053,10 +1229,10 @@ that adds a prefix to each key inserted to the store.
 
           .def(
               "allreduce_coalesced",
-              [](::c10d::ProcessGroup& self,
-                 std::vector<at::Tensor>& xs,
+              [](const c10::intrusive_ptr<::c10d::ProcessGroup>& self,
+                 const std::vector<at::Tensor>& xs,
                  ::c10d::AllreduceCoalescedOptions opts) {
-                return self.allreduce_coalesced(xs, opts);
+                return ::c10d::ops::allreduce_coalesced(self, xs, opts);
               },
               py::arg("tensors"),
               py::arg("opts") = ::c10d::AllreduceCoalescedOptions(),
@@ -1106,7 +1282,13 @@ that adds a prefix to each key inserted to the store.
 
           .def(
               "_allgather_base",
-              &::c10d::ProcessGroup::_allgather_base,
+              [](const c10::intrusive_ptr<::c10d::ProcessGroup>& self,
+                 at::Tensor& output_tensor,
+                 at::Tensor& input_tensor,
+                 const ::c10d::AllgatherOptions& opts) {
+                return ::c10d::ops::_allgather_base(
+                    self, output_tensor, input_tensor, opts);
+              },
               py::arg("output"),
               py::arg("input"),
               py::arg("opts") = ::c10d::AllgatherOptions(),
@@ -1128,7 +1310,13 @@ that adds a prefix to each key inserted to the store.
 
           .def(
               "allgather_coalesced",
-              &::c10d::ProcessGroup::allgather_coalesced,
+              [](const c10::intrusive_ptr<::c10d::ProcessGroup>& self,
+                 const std::vector<std::vector<at::Tensor>>& output_lists,
+                 const std::vector<at::Tensor>& input_list,
+                 const ::c10d::AllgatherOptions& opts) {
+                return ::c10d::ops::allgather_coalesced(
+                    self, output_lists, input_list, opts);
+              },
               py::arg("output_lists"),
               py::arg("input_list"),
               py::arg("opts") = ::c10d::AllgatherOptions(),
@@ -1229,7 +1417,13 @@ that adds a prefix to each key inserted to the store.
 
           .def(
               "_reduce_scatter_base",
-              &::c10d::ProcessGroup::_reduce_scatter_base,
+              [](const c10::intrusive_ptr<::c10d::ProcessGroup>& self,
+                 at::Tensor& output_tensor,
+                 at::Tensor& input_tensor,
+                 const ::c10d::ReduceScatterOptions& opts) {
+                return ::c10d::ops::_reduce_scatter_base(
+                    self, output_tensor, input_tensor, opts);
+              },
               py::arg("outputTensor"),
               py::arg("inputTensor"),
               py::arg("opts") = ::c10d::ReduceScatterOptions(),
@@ -1345,7 +1539,7 @@ that adds a prefix to each key inserted to the store.
                  bool waitAllRanks) {
                 ::c10d::BarrierOptions opts;
                 opts.timeout = timeout;
-                return self->monitoredBarrier(opts, waitAllRanks);
+                return ::c10d::ops::monitored_barrier(self, opts, waitAllRanks);
               },
               py::arg("timeout") = ::c10d::kUnsetTimeout,
               py::arg("wait_all_ranks") = false,
@@ -1590,60 +1784,59 @@ Example::
 #endif
 
   py::class_<
-      ::c10d::ProcessGroup::Work,
-      c10::intrusive_ptr<::c10d::ProcessGroup::Work>,
+      ::c10d::Work,
+      c10::intrusive_ptr<::c10d::Work>,
       ::c10d::PyProcessGroup::PyWork>(module, "Work")
       .def(py::init<>())
-      .def("is_completed", &::c10d::ProcessGroup::Work::isCompleted)
+      .def("is_completed", &::c10d::Work::isCompleted)
       .def(
           "is_success",
-          [](::c10d::ProcessGroup::Work& work) -> bool {
-            TORCH_WARN_ONCE(fmt::format(
-                kDeprecationWarning, "ProcessGroup::Work::is_success"));
+          [](::c10d::Work& work) -> bool {
+            TORCH_WARN_ONCE(
+                fmt::format(kDeprecationWarning, "Work::is_success"));
             return work.isSuccess();
           })
       .def(
           "exception",
-          [](::c10d::ProcessGroup::Work& work) -> std::exception_ptr {
-            TORCH_WARN_ONCE(fmt::format(
-                kDeprecationWarning, "ProcessGroup::Work::exception"));
+          [](::c10d::Work& work) -> std::exception_ptr {
+            TORCH_WARN_ONCE(
+                fmt::format(kDeprecationWarning, "Work::exception"));
             return work.exception();
           })
       .def(
           "source_rank",
-          [](::c10d::ProcessGroup::Work& work) -> int {
-            TORCH_WARN_ONCE(fmt::format(
-                kDeprecationWarning, "ProcessGroup::Work::source_rank"));
+          [](::c10d::Work& work) -> int {
+            TORCH_WARN_ONCE(
+                fmt::format(kDeprecationWarning, "Work::source_rank"));
             return work.sourceRank();
           })
-      .def("_source_rank", &::c10d::ProcessGroup::Work::sourceRank)
+      .def("_source_rank", &::c10d::Work::sourceRank)
       .def(
           "result",
-          [](::c10d::ProcessGroup::Work& work) -> std::vector<at::Tensor> {
+          [](::c10d::Work& work) -> std::vector<at::Tensor> {
             return work.result();
           })
       .def(
           "synchronize",
-          [](::c10d::ProcessGroup::Work& work) -> void {
-            TORCH_WARN_ONCE(fmt::format(
-                kDeprecationWarning, "ProcessGroup::Work::synchronize"));
+          [](::c10d::Work& work) -> void {
+            TORCH_WARN_ONCE(
+                fmt::format(kDeprecationWarning, "Work::synchronize"));
             work.synchronize();
           })
       .def(
           "wait",
-          &::c10d::ProcessGroup::Work::wait,
+          &::c10d::Work::wait,
           py::arg("timeout") = kNoTimeout,
           py::call_guard<py::gil_scoped_release>())
       .def(
           "get_future",
-          [](::c10d::ProcessGroup::Work& work)
-              -> std::shared_ptr<jit::PythonFutureWrapper> {
+          [](::c10d::Work& work) -> std::shared_ptr<jit::PythonFutureWrapper> {
             return std::make_shared<jit::PythonFutureWrapper>(work.getFuture());
           },
           R"(
             Returns:
                 A ``torch.futures.Future`` object which is associated with the completion of
-                the ``ProcessGroup::Work``. As an example, a future object can be retrieved
+                the ``Work``. As an example, a future object can be retrieved
                 by ``fut = process_group.allreduce(tensors).get_future()``.
 
             Example::
@@ -1766,8 +1959,7 @@ Example::
         };
 
         auto set = [&store](const std::string& key, const std::string& value) {
-          std::vector<uint8_t> value_(value.begin(), value.end());
-          store->set(key, value_);
+          store->set(key, value);
         };
 
         auto get = [&store](const std::string& key) {
@@ -1812,16 +2004,16 @@ Example::
   module.def(
       "_create_work_from_future",
       [](std::shared_ptr<jit::PythonFutureWrapper> future) {
-        return ::c10d::ProcessGroup::Work::create_from_future(future->fut);
+        return ::c10d::Work::create_from_future(future->fut);
       },
       py::arg("future"),
       R"(
         Arguments:
             future(str): The future to wrap.
         Returns:
-            A ``ProcessGroup::Work`` object which is associated with the completion of
+            A ``Work`` object which is associated with the completion of
             the ``torch.futures.Future``.
-        This is the prefered way of constructing Work objects when writing a custom ProcessGroup
+        This is the preferred way of constructing Work objects when writing a custom ProcessGroup
         in python.
         Example::
             >>> class SingleRankProcessGroup(torch.distributed.ProcessGroup):
diff --git a/torch/csrc/distributed/c10d/logger.cpp b/torch/csrc/distributed/c10d/logger.cpp
index 7b940f645283..8025c1b6a41c 100644
--- a/torch/csrc/distributed/c10d/logger.cpp
+++ b/torch/csrc/distributed/c10d/logger.cpp
@@ -1,14 +1,14 @@
 #include <c10/util/StringUtil.h>
-#include <c10d/Utils.hpp>
-#include <c10d/debug.h>
-#include <c10d/logger.hpp>
 #include <fmt/format.h>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/debug.h>
+#include <torch/csrc/distributed/c10d/logger.hpp>
 #include <string>
 
 #include <c10/util/CallOnce.h>
 
 #ifdef USE_C10D_GLOO
-#include <c10d/ProcessGroupGloo.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroupGloo.hpp>
 #endif
 
 namespace c10d {
diff --git a/torch/csrc/distributed/c10d/logger.hpp b/torch/csrc/distributed/c10d/logger.hpp
index cd32c573a21e..82acd5d202d4 100644
--- a/torch/csrc/distributed/c10d/logger.hpp
+++ b/torch/csrc/distributed/c10d/logger.hpp
@@ -1,5 +1,5 @@
 #include <c10/util/Logging.h>
-#include <c10d/reducer.hpp>
+#include <torch/csrc/distributed/c10d/reducer.hpp>
 
 #include <mutex>
 
diff --git a/torch/csrc/distributed/c10d/logging.cpp b/torch/csrc/distributed/c10d/logging.cpp
index c079906b878a..8ded400535f8 100644
--- a/torch/csrc/distributed/c10d/logging.cpp
+++ b/torch/csrc/distributed/c10d/logging.cpp
@@ -4,9 +4,9 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <c10d/logging.h>
+#include <torch/csrc/distributed/c10d/logging.h>
 
-#include <c10d/debug.h>
+#include <torch/csrc/distributed/c10d/debug.h>
 
 namespace c10d {
 namespace detail {
diff --git a/torch/csrc/distributed/c10d/python_comm_hook.cpp b/torch/csrc/distributed/c10d/python_comm_hook.cpp
index 3373d9acee56..c5b24e01fb51 100644
--- a/torch/csrc/distributed/c10d/python_comm_hook.cpp
+++ b/torch/csrc/distributed/c10d/python_comm_hook.cpp
@@ -1,7 +1,7 @@
 #include <torch/csrc/distributed/c10d/python_comm_hook.h>
 
 #include <ATen/core/functional.h>
-#include <c10d/reducer.hpp>
+#include <torch/csrc/distributed/c10d/reducer.hpp>
 #include <torch/csrc/jit/python/pybind_utils.h>
 #include <torch/csrc/utils/tensor_flatten.h>
 
diff --git a/torch/csrc/distributed/c10d/python_comm_hook.h b/torch/csrc/distributed/c10d/python_comm_hook.h
index 5d30e63b5acc..48ad7cefae94 100644
--- a/torch/csrc/distributed/c10d/python_comm_hook.h
+++ b/torch/csrc/distributed/c10d/python_comm_hook.h
@@ -1,10 +1,10 @@
 #pragma once
 
-#include <c10d/comm.hpp>
+#include <torch/csrc/distributed/c10d/comm.hpp>
 
 #include <ATen/ATen.h>
 #include <ATen/core/ivalue.h>
-#include <c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
 #include <torch/csrc/utils/pybind.h>
 
 namespace c10d {
diff --git a/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu b/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu
index 8063af2bee55..a3fe6c69f435 100644
--- a/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu
+++ b/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu
@@ -1,5 +1,5 @@
 #include <c10/cuda/CUDAGuard.h>
-#include <c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
 #include <torch/csrc/distributed/c10d/quantization/quantization_gpu.h>
 #include <torch/csrc/distributed/c10d/quantization/quantization_utils.h>
 #include <torch/library.h>
diff --git a/torch/csrc/distributed/c10d/reducer.cpp b/torch/csrc/distributed/c10d/reducer.cpp
index 07bf8e0b73c2..1a644927ce79 100644
--- a/torch/csrc/distributed/c10d/reducer.cpp
+++ b/torch/csrc/distributed/c10d/reducer.cpp
@@ -1,7 +1,7 @@
-#include <c10d/reducer.hpp>
+#include <torch/csrc/distributed/c10d/reducer.hpp>
 
-#include <c10d/Utils.hpp>
-#include <c10d/default_comm_hooks.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/default_comm_hooks.hpp>
 
 #include <functional>
 
@@ -11,8 +11,6 @@
 #include <c10/util/Logging.h>
 #include <c10/util/hash.h>
 #include <c10/util/irange.h>
-#include <c10d/comm.hpp>
-#include <c10d/logger.hpp>
 #include <torch/csrc/autograd/engine.h>
 #include <torch/csrc/autograd/function_hook.h>
 #include <torch/csrc/autograd/functions/accumulate_grad.h>
@@ -20,6 +18,8 @@
 #include <torch/csrc/autograd/utils/grad_layout_contract.h>
 #include <torch/csrc/autograd/utils/lambda_post_hook.h>
 #include <torch/csrc/distributed/c10d/Ops.hpp>
+#include <torch/csrc/distributed/c10d/comm.hpp>
+#include <torch/csrc/distributed/c10d/logger.hpp>
 #include <torch/csrc/utils/memory.h>
 
 namespace c10d {
@@ -474,7 +474,7 @@ std::vector<c10d::GradBucket> Reducer::get_grad_buckets(
 }
 
 void Reducer::set_forward_pass_work_handle(
-    c10::intrusive_ptr<c10d::ProcessGroup::Work> forwardPassWorkHandle,
+    c10::intrusive_ptr<c10d::Work> forwardPassWorkHandle,
     bool useStaticWorldSize) {
   std::lock_guard<std::mutex> lock(mutex_);
   forwardPassWorkHandle_.workHandle = std::move(forwardPassWorkHandle);
diff --git a/torch/csrc/distributed/c10d/reducer.hpp b/torch/csrc/distributed/c10d/reducer.hpp
index cc14a1eb2be6..047755e2d2a3 100644
--- a/torch/csrc/distributed/c10d/reducer.hpp
+++ b/torch/csrc/distributed/c10d/reducer.hpp
@@ -10,12 +10,12 @@
 #include <ATen/core/ivalue_inl.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/intrusive_ptr.h>
-#include <c10d/ProcessGroup.hpp>
-#include <c10d/Utils.hpp>
-#include <c10d/comm.hpp>
-#include <c10d/debug.h>
-#include <c10d/reducer_timer.hpp>
-#include <c10d/default_comm_hooks.hpp>
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
+#include <torch/csrc/distributed/c10d/Utils.hpp>
+#include <torch/csrc/distributed/c10d/comm.hpp>
+#include <torch/csrc/distributed/c10d/debug.h>
+#include <torch/csrc/distributed/c10d/reducer_timer.hpp>
+#include <torch/csrc/distributed/c10d/default_comm_hooks.hpp>
 #include <torch/csrc/autograd/function.h>
 #include <torch/csrc/autograd/profiler.h>
 #include <torch/csrc/autograd/variable.h>
@@ -133,10 +133,10 @@ class TORCH_API Reducer {
   // Pushes all parameters to be rebuilt.
   void push_rebuilt_params_for_all_indices();
 
-  // Creates and sets ForwardPassWorkHandle given a ProcessGroup::Work and the
+  // Creates and sets ForwardPassWorkHandle given a Work and the
   // corresponding tensor being reduced.
   void set_forward_pass_work_handle(
-      c10::intrusive_ptr<c10d::ProcessGroup::Work> forwardPassWorkHandle,
+      c10::intrusive_ptr<c10d::Work> forwardPassWorkHandle,
       bool useStaticWorldSize);
 
   // Retrieve on-device tensors used to track locally unused parameters. It is
@@ -231,7 +231,7 @@ class TORCH_API Reducer {
   c10::optional<c10::List<c10::intrusive_ptr<c10::ivalue::Future>>> installed_futures_{c10::nullopt};
 
   // Work handle for allreduce on local_used_map_
-  c10::intrusive_ptr<c10d::ProcessGroup::Work> local_used_work_;
+  c10::intrusive_ptr<c10d::Work> local_used_work_;
 
   void mark_variable_ready_dense(size_t variable_index);
 
@@ -433,7 +433,7 @@ class TORCH_API Reducer {
   // A struct containing work handle and tensor for allreduce scheduled in
   // forward pass, if applicable.
   struct ForwardPassAllreduceWork {
-    c10::intrusive_ptr<c10d::ProcessGroup::Work> workHandle;
+    c10::intrusive_ptr<c10d::Work> workHandle;
     at::Tensor resultTensor;
     // whether we should divide by the initial world_size or the no. of
     // remaining DDP ranks.
diff --git a/torch/csrc/distributed/c10d/reducer_cuda.cpp b/torch/csrc/distributed/c10d/reducer_cuda.cpp
index 60aa450284f6..b63e9d3d6f3c 100644
--- a/torch/csrc/distributed/c10d/reducer_cuda.cpp
+++ b/torch/csrc/distributed/c10d/reducer_cuda.cpp
@@ -1,4 +1,4 @@
-#include <c10d/reducer_timer.hpp>
+#include <torch/csrc/distributed/c10d/reducer_timer.hpp>
 
 #include <ATen/cuda/CUDAEvent.h>
 #include <c10/core/DeviceGuard.h>
diff --git a/torch/csrc/distributed/c10d/sequence_num.cpp b/torch/csrc/distributed/c10d/sequence_num.cpp
index 27b2e8012bd4..1405084a383d 100644
--- a/torch/csrc/distributed/c10d/sequence_num.cpp
+++ b/torch/csrc/distributed/c10d/sequence_num.cpp
@@ -1,6 +1,6 @@
 #include <ATen/ThreadLocalState.h>
 #include <c10/util/Optional.h>
-#include <c10d/sequence_num.hpp>
+#include <torch/csrc/distributed/c10d/sequence_num.hpp>
 
 #include <c10/util/Logging.h>
 
diff --git a/torch/csrc/distributed/c10d/socket.cpp b/torch/csrc/distributed/c10d/socket.cpp
index 26fcc2aa088f..b7527d1eeed8 100644
--- a/torch/csrc/distributed/c10d/socket.cpp
+++ b/torch/csrc/distributed/c10d/socket.cpp
@@ -4,7 +4,7 @@
 // This source code is licensed under the BSD-style license found in the
 // LICENSE file in the root directory of this source tree.
 
-#include <c10d/socket.h>
+#include <torch/csrc/distributed/c10d/socket.h>
 
 #include <cstring>
 #include <system_error>
@@ -30,9 +30,9 @@
 #include <fmt/chrono.h>
 #include <fmt/format.h>
 
-#include <c10d/error.h>
-#include <c10d/exception.h>
-#include <c10d/logging.h>
+#include <torch/csrc/distributed/c10d/error.h>
+#include <torch/csrc/distributed/c10d/exception.h>
+#include <torch/csrc/distributed/c10d/logging.h>
 
 #include <c10/util/CallOnce.h>
 
diff --git a/torch/csrc/distributed/c10d/socket.h b/torch/csrc/distributed/c10d/socket.h
index 2bbb5f553537..02f4d2a2a722 100644
--- a/torch/csrc/distributed/c10d/socket.h
+++ b/torch/csrc/distributed/c10d/socket.h
@@ -12,7 +12,7 @@
 #include <string>
 
 #include <c10/macros/Macros.h>
-#include <c10d/exception.h>
+#include <torch/csrc/distributed/c10d/exception.h>
 
 namespace c10d {
 namespace detail {
diff --git a/torch/csrc/distributed/rpc/agent_utils.h b/torch/csrc/distributed/rpc/agent_utils.h
index 0288e0c063bb..6259bf4c685a 100644
--- a/torch/csrc/distributed/rpc/agent_utils.h
+++ b/torch/csrc/distributed/rpc/agent_utils.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <c10d/PrefixStore.hpp>
+#include <torch/csrc/distributed/c10d/PrefixStore.hpp>
 #include <torch/csrc/distributed/rpc/utils.h>
 
 namespace torch {
diff --git a/torch/csrc/distributed/rpc/py_rref.cpp b/torch/csrc/distributed/rpc/py_rref.cpp
index e4d41e50d0ac..447e04612bd7 100644
--- a/torch/csrc/distributed/rpc/py_rref.cpp
+++ b/torch/csrc/distributed/rpc/py_rref.cpp
@@ -56,7 +56,7 @@ RRefForkData fromPyTuple(const py::tuple& pyTuple) {
 TypePtr tryInferTypeWithTypeHint(
     const py::object& value,
     const py::object& type_hint) {
-  // If the py::object to be contained by the RRef is a ScripModule, we enforce
+  // If the py::object to be contained by the RRef is a ScriptModule, we enforce
   // users to specify its ModuleInterface type.
   if (auto module = jit::as_module(value)) {
     TORCH_CHECK(
@@ -101,7 +101,7 @@ TypePtr tryInferTypeWithTypeHint(
   }
 
   // NB: `jit::tryToInferType(..)` infers types including ScriptClass, but
-  // excluding ScripModule.
+  // excluding ScriptModule.
   jit::InferredType type_inferred = jit::tryToInferType(value);
   if (type_inferred.success()) {
     // If we could infer the type from the pyobject, we create
diff --git a/torch/csrc/distributed/rpc/rref_context.cpp b/torch/csrc/distributed/rpc/rref_context.cpp
index 004e9422be42..d620fe6b9465 100644
--- a/torch/csrc/distributed/rpc/rref_context.cpp
+++ b/torch/csrc/distributed/rpc/rref_context.cpp
@@ -114,6 +114,14 @@ void RRefContext::handleException(const JitFuture& jitFuture) {
   }
 }
 
+void RRefContext::handleExceptionSilent(const JitFuture& jitFuture) {
+  if (jitFuture.hasError()) {
+    auto errMsg = jitFuture.tryRetrieveErrorMessage();
+    VLOG(1) << "Got exception: " << errMsg;
+    TORCH_CHECK_MSG(false, errMsg);
+  }
+}
+
 RRefContext::RRefContext(std::shared_ptr<RpcAgent> agent)
     : agent_(std::move(agent)), destroyed_(false) {}
 
@@ -219,7 +227,7 @@ void RRefContext::delUser(
           RRefUserDelete(rrefId, forkId).toMessage());
 
       jitFuture->addCallback([this](JitFuture& future) {
-        handleException(future);
+        handleExceptionSilent(future);
         --numPendingFutures_;
       });
     }
@@ -507,7 +515,7 @@ void RRefContext::notifyOwnerAndParentOfFork(
     auto jitFuture = agent_->sendWithRetries(
         agent_->getWorkerInfo(parent), RRefChildAccept(forkId).toMessage());
     jitFuture->addCallback([this](JitFuture& future) {
-      handleException(future);
+      handleExceptionSilent(future);
       --numPendingFutures_;
     });
   } else {
@@ -706,7 +714,7 @@ void RRefContext::finishForkRequest(const ForkId& forkId, worker_id_t parent) {
       agent_->getWorkerInfo(parent), RRefChildAccept(forkId).toMessage());
 
   jitFuture->addCallback([this](JitFuture& future) {
-    handleException(future);
+    handleExceptionSilent(future);
     --numPendingFutures_;
   });
 }
diff --git a/torch/csrc/distributed/rpc/rref_context.h b/torch/csrc/distributed/rpc/rref_context.h
index 675cabc02bcd..78f1b3afb731 100644
--- a/torch/csrc/distributed/rpc/rref_context.h
+++ b/torch/csrc/distributed/rpc/rref_context.h
@@ -39,6 +39,9 @@ class TORCH_API RRefContext {
 
   static void handleException(const JitFuture& jitFuture);
 
+  // handle exception without throw ::c10::Error again
+  static void handleExceptionSilent(const JitFuture& jitFuture);
+
   RRefContext(const RRefContext&) = delete;
   RRefContext(RRefContext&& other) = delete;
   void operator=(const RRefContext&) = delete;
diff --git a/torch/csrc/distributed/rpc/tensorpipe_agent.cpp b/torch/csrc/distributed/rpc/tensorpipe_agent.cpp
index 39935febe663..c88571363742 100644
--- a/torch/csrc/distributed/rpc/tensorpipe_agent.cpp
+++ b/torch/csrc/distributed/rpc/tensorpipe_agent.cpp
@@ -495,8 +495,7 @@ void TensorPipeAgent::startImpl() {
 
   // Store our own url.
   const auto address = listener_->url(lowestPriorityTransport);
-  const std::vector<uint8_t> selfAddrData(address.begin(), address.end());
-  nameToAddressStore_.set(workerInfo_.name_, selfAddrData);
+  nameToAddressStore_.set(workerInfo_.name_, address);
 
   VLOG(1) << "RPC agent for " << workerInfo_.name_ << " is using address "
           << address;
@@ -1261,18 +1260,22 @@ void TensorPipeAgent::updateGroupMembership(
     workerNameToInfo_.erase(name);
     workerNameToURL_.erase(name);
 
-    for (const auto& it : reverseDeviceMaps_) {
-      if (reverseDeviceMaps.find(it.first) == reverseDeviceMaps.end()) {
-        reverseDeviceMaps_.erase(it.first);
+    // remove reverse device maps that are no longer used
+    for (auto it = reverseDeviceMaps_.begin();
+         it != reverseDeviceMaps_.end();) {
+      if (reverseDeviceMaps.find(it->first) == reverseDeviceMaps.end()) {
+        it = reverseDeviceMaps_.erase(it);
+      } else {
+        it++;
       }
     }
 
-    auto iter = devices_.begin();
-    while (iter != devices_.end()) {
-      if (std::find(devices.begin(), devices.end(), *iter) == devices.end()) {
-        iter = devices_.erase(iter);
+    // remove devices that are no longer used
+    for (auto it = devices_.begin(); it != devices_.end();) {
+      if (std::find(devices.begin(), devices.end(), *it) == devices.end()) {
+        it = devices_.erase(it);
       } else {
-        iter++;
+        it++;
       }
     }
   }
diff --git a/torch/csrc/distributed/rpc/tensorpipe_agent.h b/torch/csrc/distributed/rpc/tensorpipe_agent.h
index 2ad3ef6a0d75..b6cbd072ca34 100644
--- a/torch/csrc/distributed/rpc/tensorpipe_agent.h
+++ b/torch/csrc/distributed/rpc/tensorpipe_agent.h
@@ -6,8 +6,8 @@
 #include <thread>
 
 #include <c10/core/thread_pool.h>
-#include <c10d/PrefixStore.hpp>
-#include <c10d/Store.hpp>
+#include <torch/csrc/distributed/c10d/PrefixStore.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>
 #include <torch/csrc/distributed/rpc/rpc_agent.h>
 
 // Forward-declare the TensorPipe classes we need, to avoid including its
diff --git a/torch/csrc/distributed/rpc/utils.cpp b/torch/csrc/distributed/rpc/utils.cpp
index 12e3f2edf755..c20145e82d03 100644
--- a/torch/csrc/distributed/rpc/utils.cpp
+++ b/torch/csrc/distributed/rpc/utils.cpp
@@ -38,7 +38,7 @@ void processRemoteProfiledEvents(
   TORCH_CHECK(
       enabled,
       "Profiler was expected to be enabled. This can happen in callback "
-      " continutations that run in different threads, and the TLS of the "
+      " continuations that run in different threads, and the TLS of the "
       " profiler was not propagated.");
   std::vector<LegacyEvent> events = rpcWithProfilingResp.getProfiledEvents();
   const auto& profilingId = rpcWithProfilingResp.getProfilingId();
diff --git a/torch/csrc/dl.c b/torch/csrc/dl.c
deleted file mode 100644
index 9a8f02cfaf2f..000000000000
--- a/torch/csrc/dl.c
+++ /dev/null
@@ -1,32 +0,0 @@
-#include <Python.h>
-#include <dlfcn.h>
-
-PyObject* module;
-
-static PyMethodDef TorchDlMethods[] = {
-  {NULL, NULL, 0, NULL}
-};
-
-static struct PyModuleDef torchdlmodule = {
-    PyModuleDef_HEAD_INIT,
-    "torch._dl",
-    NULL,
-    -1,
-    TorchDlMethods
-    // NOLINTNEXTLINE(clang-diagnostic-missing-field-initializers)
-};
-
-PyMODINIT_FUNC PyInit__dl(void)
-{
-
-#define ASSERT_TRUE(cmd) if (!(cmd)) return NULL
-
-  ASSERT_TRUE(module = PyModule_Create(&torchdlmodule));
-  ASSERT_TRUE(PyModule_AddIntConstant(module, "RTLD_GLOBAL", (int64_t) RTLD_GLOBAL) == 0);
-  ASSERT_TRUE(PyModule_AddIntConstant(module, "RTLD_NOW", (int64_t) RTLD_NOW) == 0);
-  ASSERT_TRUE(PyModule_AddIntConstant(module, "RTLD_LAZY", (int64_t) RTLD_LAZY) == 0);
-
-  return module;
-
-#undef ASSERT_TRUE
-}
diff --git a/torch/csrc/dynamo/eval_frame.c b/torch/csrc/dynamo/eval_frame.c
new file mode 100644
index 000000000000..bbfc1bb2897d
--- /dev/null
+++ b/torch/csrc/dynamo/eval_frame.c
@@ -0,0 +1,606 @@
+#define PY_SSIZE_T_CLEAN
+#include <Python.h>
+#include <stdbool.h>
+
+// Only Python 3.7 through 3.10 supported
+#if PY_MAJOR_VERSION == 3 && PY_MINOR_VERSION < 11
+#define _PY_VERSION_OK
+
+#include <frameobject.h>
+#include <pystate.h>
+
+// see https://bugs.python.org/issue35886
+#if PY_VERSION_HEX >= 0x03080000
+#define Py_BUILD_CORE
+#include <internal/pycore_pystate.h>
+#undef Py_BUILD_CORE
+#endif
+
+#ifdef _WIN32
+#define unlikely(x) (x)
+#else
+#define unlikely(x) __builtin_expect((x), 0)
+#endif
+
+#define NULL_CHECK(val)                                         \
+  if (unlikely((val) == NULL)) {                                \
+    fprintf(stderr, "NULL ERROR: %s:%d\n", __FILE__, __LINE__); \
+    PyErr_Print();                                              \
+    abort();                                                    \
+  } else {                                                      \
+  }
+
+#define CHECK(cond)                                                     \
+  if (unlikely(!(cond))) {                                              \
+    fprintf(stderr, "DEBUG CHECK FAILED: %s:%d\n", __FILE__, __LINE__); \
+    abort();                                                            \
+  } else {                                                              \
+  }
+
+#ifdef TORCHDYNAMO_DEBUG
+
+#define DEBUG_CHECK(cond) CHECK(cond)
+#define DEBUG_NULL_CHECK(val) NULL_CHECK(val)
+#define DEBUG_TRACE(msg, ...) \
+  fprintf(stderr, "TRACE[%s:%d] " msg "\n", __func__, __LINE__, __VA_ARGS__)
+#define DEBUG_TRACE0(msg) \
+  fprintf(stderr, "TRACE[%s:%d] " msg "\n", __func__, __LINE__)
+
+#else
+
+#define DEBUG_CHECK(cond)
+#define DEBUG_NULL_CHECK(val)
+#define DEBUG_TRACE(msg, ...)
+#define DEBUG_TRACE0(msg)
+
+#endif
+
+// Flag to just run a frame normally
+#define SKIP_CODE ((void*)0x1)
+
+static PyObject* noargs = NULL; /* cached empty tuple */
+static PyObject* dotzerokey = NULL; /* ".0" */
+static PyObject* guard_fail_hook = NULL;
+static PyObject* guard_error_hook = NULL;
+
+size_t extra_index = -1;
+
+static Py_tss_t eval_frame_callback_key = Py_tss_NEEDS_INIT;
+
+inline static PyObject* eval_frame_callback_get(void) {
+  void* result = PyThread_tss_get(&eval_frame_callback_key);
+  if (unlikely(result == NULL)) {
+    Py_RETURN_NONE;
+  } else {
+    return (PyObject*)result;
+  }
+}
+
+inline static void eval_frame_callback_set(PyObject* obj) {
+  PyThread_tss_set(&eval_frame_callback_key, obj);
+}
+
+static void ignored(void* obj) {}
+static PyObject* _custom_eval_frame_shim(
+    PyThreadState* tstate,
+    PyFrameObject* frame,
+    int throw_flag);
+static PyObject* _custom_eval_frame(
+    PyThreadState* tstate,
+    PyFrameObject* frame,
+    int throw_flag,
+    PyObject* callback);
+#if PY_VERSION_HEX >= 0x03090000
+static PyObject* custom_eval_frame_shim(
+    PyThreadState* tstate,
+    PyFrameObject* frame,
+    int throw_flag) {
+  return _custom_eval_frame_shim(tstate, frame, throw_flag);
+}
+#else
+static PyObject* custom_eval_frame_shim(PyFrameObject* frame, int throw_flag) {
+  PyThreadState* tstate = PyThreadState_GET();
+  return _custom_eval_frame_shim(tstate, frame, throw_flag);
+}
+#endif
+
+inline static PyObject* eval_frame_default(
+    PyThreadState* tstate,
+    PyFrameObject* frame,
+    int throw_flag) {
+#if PY_VERSION_HEX >= 0x03090000
+  if (tstate == NULL) {
+    tstate = PyThreadState_GET();
+  }
+  return _PyEval_EvalFrameDefault(tstate, frame, throw_flag);
+#else
+  return _PyEval_EvalFrameDefault(frame, throw_flag);
+#endif
+}
+
+inline static void enable_eval_frame_shim(PyThreadState* tstate) {
+#if PY_VERSION_HEX >= 0x03090000
+  if (_PyInterpreterState_GetEvalFrameFunc(tstate->interp) !=
+      &custom_eval_frame_shim) {
+    _PyInterpreterState_SetEvalFrameFunc(
+        tstate->interp, &custom_eval_frame_shim);
+  }
+#else
+  if (tstate->interp->eval_frame != &custom_eval_frame_shim) {
+    // First call
+    tstate->interp->eval_frame = &custom_eval_frame_shim;
+  }
+#endif
+}
+
+inline static void enable_eval_frame_default(PyThreadState* tstate) {
+#if PY_VERSION_HEX >= 0x03090000
+  if (_PyInterpreterState_GetEvalFrameFunc(tstate->interp) !=
+      &_PyEval_EvalFrameDefault) {
+    _PyInterpreterState_SetEvalFrameFunc(
+        tstate->interp, &_PyEval_EvalFrameDefault);
+  }
+#else
+  if (tstate->interp->eval_frame != &_PyEval_EvalFrameDefault) {
+    // First call
+    tstate->interp->eval_frame = &_PyEval_EvalFrameDefault;
+  }
+#endif
+}
+
+static inline PyObject* call_callback(
+    PyObject* callable,
+    PyObject* frame,
+    long cache_len) {
+  PyObject* args = Py_BuildValue("(Ol)", frame, cache_len);
+  NULL_CHECK(args);
+  PyObject* result = PyObject_CallObject(callable, args);
+  Py_DECREF(args);
+  return result;
+}
+
+typedef struct cache_entry {
+  // check the guards: lambda: <locals of user function>: bool
+  PyObject* check_fn;
+  // modified user bytecode (protected by check_fn's guards)
+  PyCodeObject* code;
+  // on a cache miss, linked list of next thing to try
+  struct cache_entry* next;
+} CacheEntry;
+
+static CacheEntry* create_cache_entry(
+    CacheEntry* next,
+    PyObject* guarded_code) {
+  CacheEntry* e = (CacheEntry*)malloc(sizeof(CacheEntry));
+  DEBUG_NULL_CHECK(e);
+  e->check_fn = PyObject_GetAttrString(guarded_code, "check_fn");
+  NULL_CHECK(e->check_fn);
+  e->code = (PyCodeObject*)PyObject_GetAttrString(guarded_code, "code");
+  NULL_CHECK(e->code);
+  e->next = next;
+  return e;
+}
+
+static void destroy_cache_entry(CacheEntry* e) {
+  if (e == NULL || e == SKIP_CODE) {
+    return;
+  }
+  Py_XDECREF(e->check_fn);
+  Py_XDECREF(e->code);
+  destroy_cache_entry(e->next);
+  free(e);
+}
+
+inline static CacheEntry* get_extra(PyCodeObject* code) {
+  CacheEntry* extra = NULL;
+  _PyCode_GetExtra((PyObject*)code, extra_index, (void*)&extra);
+  return extra;
+}
+
+inline static void set_extra(PyCodeObject* code, CacheEntry* extra) {
+  // TODO(jansel): would it be faster to bypass this?
+  _PyCode_SetExtra((PyObject*)code, extra_index, extra);
+}
+
+#ifdef TORCHDYNAMO_DEBUG
+inline static const char* name(PyFrameObject* frame) {
+  DEBUG_CHECK(PyUnicode_Check(frame->f_code->co_name));
+  return PyUnicode_AsUTF8(frame->f_code->co_name);
+}
+#endif
+
+static void call_guard_fail_hook(
+    PyObject* hook,
+    CacheEntry* e,
+    PyObject* f_locals) {
+  // call debugging logic when a guard fails
+  PyObject* args = PyTuple_Pack(
+      4,
+      e->check_fn,
+      e->code,
+      f_locals,
+      (e->next == NULL ? Py_True : Py_False));
+  NULL_CHECK(args);
+  PyObject* result = PyObject_CallObject(hook, args);
+  NULL_CHECK(result);
+  Py_DECREF(result);
+  Py_DECREF(args);
+}
+
+static PyCodeObject* lookup(CacheEntry* e, PyFrameObject *frame, CacheEntry* prev) {
+  if (e == NULL) {
+    return NULL;
+  }
+  PyObject *f_locals = frame->f_locals;
+  PyObject* dotzero = PyDict_GetItem(f_locals, dotzerokey);
+  PyObject* valid = NULL;
+  if (unlikely(dotzero != NULL)) {
+    // .0 is a special variable name used for implicit args
+    PyObject* args = PyTuple_Pack(1, dotzero);
+    NULL_CHECK(args);
+    valid = PyObject_Call(e->check_fn, args, f_locals);
+    Py_DECREF(args);
+  } else {
+    valid = PyObject_Call(e->check_fn, noargs, f_locals);
+  }
+  if (unlikely(valid == NULL)) {
+    PyErr_Print();
+    if (guard_error_hook != NULL) {
+      call_guard_fail_hook(guard_error_hook, e, f_locals);
+    }
+    NULL_CHECK(valid);
+  }
+  Py_DECREF(valid);
+  if (valid == Py_True) {
+    // Keep the head as the most recently used cache entry.
+    // If the hit cache entry is not the head of the linked list,
+    // move it to the head
+    if (prev != NULL) {
+        CacheEntry* extra = get_extra(frame->f_code);
+        prev->next = e->next;
+        e->next = extra;
+        set_extra(frame->f_code, e);
+    }
+    return e->code;
+  }
+  if (unlikely(guard_fail_hook != NULL)) {
+    call_guard_fail_hook(guard_fail_hook, e, f_locals);
+  }
+  return lookup(e->next, frame, e);
+}
+
+static long cache_size(CacheEntry* e) {
+  if (e == NULL) {
+    return 0;
+  }
+  return 1 + cache_size(e->next);
+}
+
+inline static PyObject* eval_custom_code(
+    PyThreadState* tstate,
+    PyFrameObject* frame,
+    PyCodeObject* code,
+    int throw_flag) {
+  Py_ssize_t ncells = 0;
+  Py_ssize_t nfrees = 0;
+  Py_ssize_t nlocals_new = code->co_nlocals;
+  Py_ssize_t nlocals_old = frame->f_code->co_nlocals;
+
+  if ((code->co_flags & CO_NOFREE) == 0) {
+    ncells = PyTuple_GET_SIZE(code->co_cellvars);
+    nfrees = PyTuple_GET_SIZE(code->co_freevars);
+  }
+
+  DEBUG_NULL_CHECK(tstate);
+  DEBUG_NULL_CHECK(frame);
+  DEBUG_NULL_CHECK(code);
+  DEBUG_CHECK(ncells == PyTuple_GET_SIZE(frame->f_code->co_cellvars));
+  DEBUG_CHECK(nfrees == PyTuple_GET_SIZE(frame->f_code->co_freevars));
+  DEBUG_CHECK(nlocals_new >= nlocals_old);
+
+  PyFrameObject* shadow = PyFrame_New(tstate, code, frame->f_globals, NULL);
+  if (shadow == NULL) {
+    return NULL;
+  }
+
+  PyObject** fastlocals_old = frame->f_localsplus;
+  PyObject** fastlocals_new = shadow->f_localsplus;
+
+  for (Py_ssize_t i = 0; i < nlocals_old; i++) {
+    Py_XINCREF(fastlocals_old[i]);
+    fastlocals_new[i] = fastlocals_old[i];
+  }
+
+  for (Py_ssize_t i = 0; i < ncells + nfrees; i++) {
+    Py_XINCREF(fastlocals_old[nlocals_old + i]);
+    fastlocals_new[nlocals_new + i] = fastlocals_old[nlocals_old + i];
+  }
+
+  PyObject* result = eval_frame_default(tstate, shadow, throw_flag);
+  Py_DECREF(shadow);
+  return result;
+}
+
+static PyObject* _custom_eval_frame_shim(
+    PyThreadState* tstate,
+    PyFrameObject* frame,
+    int throw_flag) {
+  // Shims logic into one of three states. Can probably be refactored into a
+  // single func, later:
+  //  - None: disables TorchDynamo
+  //  - False: run-only mode (reuse existing compiles)
+  //  - Python callable(): enables TorchDynamo
+  PyObject* callback = eval_frame_callback_get();
+
+  if (callback == Py_None) {
+    return eval_frame_default(tstate, frame, throw_flag);
+  }
+
+  return _custom_eval_frame(tstate, frame, throw_flag, callback);
+}
+
+static PyObject* _custom_eval_frame(
+    PyThreadState* tstate,
+    PyFrameObject* frame,
+    int throw_flag,
+    PyObject* callback) {
+  DEBUG_TRACE(
+      "begin %s %s %i %i %i %i",
+      name(frame),
+      PyUnicode_AsUTF8(frame->f_code->co_filename),
+      frame->f_lineno,
+      frame->f_lasti,
+      frame->f_iblock,
+      frame->f_executing);
+  CacheEntry* extra = get_extra(frame->f_code);
+  if (extra == SKIP_CODE || (callback == Py_False && extra == NULL)) {
+    DEBUG_TRACE("skip %s", name(frame));
+    return eval_frame_default(tstate, frame, throw_flag);
+  }
+
+  // TODO(jansel): investigate directly using the "fast" representation
+  if (PyFrame_FastToLocalsWithError(frame) < 0) {
+    DEBUG_TRACE("error %s", name(frame));
+    return NULL;
+  }
+
+  // A callback of Py_False indicates "run only" mode, the cache is checked, but
+  // we never compile.
+  if (callback == Py_False) {
+    DEBUG_TRACE("In run only mode %s", name(frame));
+    PyCodeObject* cached_code = lookup(extra, frame, NULL);
+    if (cached_code != NULL) {
+      // used cached version
+      DEBUG_TRACE("cache hit %s", name(frame));
+      return eval_custom_code(tstate, frame, cached_code, throw_flag);
+    } else {
+      DEBUG_TRACE("cache miss %s", name(frame));
+      return eval_frame_default(tstate, frame, throw_flag);
+    }
+  }
+  DEBUG_CHECK(PyDict_CheckExact(frame->f_locals));
+  DEBUG_CHECK(PyDict_CheckExact(frame->f_globals));
+  DEBUG_CHECK(PyDict_CheckExact(frame->f_builtins));
+
+  // We don't run the current custom_eval_frame behavior for guards.
+  // So we temporarily set the callback to Py_None to drive the correct behavior
+  // in the shim.
+  eval_frame_callback_set(Py_None);
+
+  PyCodeObject* cached_code = lookup(extra, frame, NULL);
+  if (cached_code != NULL) {
+    // used cached version
+    DEBUG_TRACE("cache hit %s", name(frame));
+    // Re-enable custom behavior
+    eval_frame_callback_set(callback);
+    return eval_custom_code(tstate, frame, cached_code, throw_flag);
+  }
+  // cache miss
+
+  PyObject* result =
+      call_callback(callback, (PyObject*)frame, cache_size(extra));
+  if (result == NULL) {
+    // internal exception, returning here will leak the exception into user code
+    // this is useful for debugging -- but we dont want it to happen outside of
+    // testing
+    return NULL;
+  } else if (result != Py_None) {
+    DEBUG_TRACE("create cache %s", name(frame));
+    extra = create_cache_entry(extra, result);
+    Py_DECREF(result);
+    set_extra(frame->f_code, extra);
+    // Re-enable custom behavior
+    eval_frame_callback_set(callback);
+    return eval_custom_code(tstate, frame, extra->code, throw_flag);
+  } else {
+    DEBUG_TRACE("create skip %s", name(frame));
+    Py_DECREF(result);
+    destroy_cache_entry(extra);
+    set_extra(frame->f_code, SKIP_CODE);
+    // Re-enable custom behavior
+    eval_frame_callback_set(callback);
+    return eval_frame_default(tstate, frame, throw_flag);
+  }
+}
+
+static int active_dynamo_threads = 0;
+
+static PyObject* increment_working_threads(PyThreadState* tstate) {
+  active_dynamo_threads = active_dynamo_threads + 1;
+  if (active_dynamo_threads > 0) {
+    enable_eval_frame_shim(tstate);
+  }
+  Py_RETURN_NONE;
+}
+
+static PyObject* decrement_working_threads(PyThreadState* tstate) {
+  if (active_dynamo_threads > 0) {
+    active_dynamo_threads = active_dynamo_threads - 1;
+    if (active_dynamo_threads == 0) {
+      enable_eval_frame_default(tstate);
+    }
+  }
+  Py_RETURN_NONE;
+}
+
+static PyObject* set_eval_frame(PyObject* new_callback, PyThreadState* tstate) {
+  // Change the eval frame callback and return the old one
+  //  - None: disables TorchDynamo
+  //  - False: run-only mode (reuse existing compiles)
+  //  - Python callable(): enables TorchDynamo
+  PyObject* old_callback = eval_frame_callback_get();
+
+  // owned by caller
+  Py_INCREF(old_callback);
+
+  if (old_callback != Py_None && new_callback == Py_None) {
+    decrement_working_threads(tstate);
+  } else if (old_callback == Py_None && new_callback != Py_None) {
+    increment_working_threads(tstate);
+  }
+
+  Py_INCREF(new_callback);
+  Py_DECREF(old_callback);
+
+  // Set thread local callback. This will drive behavior of our shim, if/when it
+  // is installed.
+  eval_frame_callback_set(new_callback);
+
+  return old_callback;
+}
+
+static PyObject* set_eval_frame_py(PyObject* dummy, PyObject* args) {
+  PyObject* callback = NULL;
+  if (!PyArg_ParseTuple(args, "O:callback", &callback)) {
+    DEBUG_TRACE0("arg error");
+    return NULL;
+  }
+  if (callback != Py_None && callback != Py_False &&
+      !PyCallable_Check(callback)) {
+    DEBUG_TRACE0("arg error");
+    PyErr_SetString(PyExc_TypeError, "expected a callable");
+    return NULL;
+  }
+  DEBUG_TRACE(
+      "python enabled=%d and is run_only=%d",
+      callback != Py_None,
+      callback == Py_False);
+  return set_eval_frame(callback, PyThreadState_GET());
+}
+
+static PyObject* reset_code(PyObject* dummy, PyObject* args) {
+  PyObject* code = NULL;
+  if (!PyArg_ParseTuple(args, "O:code", &code)) {
+    DEBUG_TRACE0("arg error");
+    return NULL;
+  }
+  if (!PyCode_Check(code)) {
+    DEBUG_TRACE0("arg error");
+    PyErr_SetString(PyExc_TypeError, "expected a code object");
+    return NULL;
+  }
+
+  destroy_cache_entry(get_extra((PyCodeObject*)code));
+  set_extra((PyCodeObject*)code, NULL);
+  Py_RETURN_NONE;
+}
+
+static PyObject* unsupported(PyObject* dummy, PyObject* args) {
+  // a dummy C function used in testing
+  PyObject* obj1 = NULL;
+  PyObject* obj2 = NULL;
+  if (!PyArg_ParseTuple(args, "OO", &obj1, &obj2)) {
+    return NULL;
+  }
+  Py_INCREF(obj2);
+  return obj2;
+}
+
+static PyObject* skip_code(PyObject* dummy, PyObject* args) {
+  PyObject* obj = NULL;
+  if (!PyArg_ParseTuple(args, "O", &obj)) {
+    return NULL;
+  }
+  if (!PyCode_Check(obj)) {
+    PyErr_SetString(PyExc_TypeError, "expected a code object");
+    return NULL;
+  }
+  set_extra((PyCodeObject*)obj, SKIP_CODE);
+  Py_RETURN_NONE;
+}
+
+static PyObject* set_guard_fail_hook(PyObject* dummy, PyObject* args) {
+  PyObject* obj = NULL;
+  if (!PyArg_ParseTuple(args, "O", &obj)) {
+    return NULL;
+  }
+  Py_XDECREF(guard_fail_hook);
+  if (obj == Py_None) {
+    guard_fail_hook = NULL;
+  } else {
+    guard_fail_hook = obj;
+    Py_INCREF(guard_fail_hook);
+  }
+  Py_RETURN_NONE;
+}
+
+static PyObject* set_guard_error_hook(PyObject* dummy, PyObject* args) {
+  PyObject* obj = NULL;
+  if (!PyArg_ParseTuple(args, "O", &obj)) {
+    return NULL;
+  }
+  Py_XDECREF(guard_error_hook);
+  if (obj == Py_None) {
+    guard_error_hook = NULL;
+  } else {
+    guard_error_hook = obj;
+    Py_INCREF(guard_error_hook);
+  }
+  Py_RETURN_NONE;
+}
+
+#else // python 3.11
+#define PY311_RETURN_ERROR(name)                                          \
+  static PyObject* name(PyObject* dummy, PyObject* args) {                \
+    PyErr_SetString(PyExc_RuntimeError, "Python 3.11 not yet supported"); \
+    return NULL;                                                          \
+  }
+PY311_RETURN_ERROR(set_eval_frame_py);
+PY311_RETURN_ERROR(reset_code);
+PY311_RETURN_ERROR(unsupported);
+PY311_RETURN_ERROR(skip_code);
+PY311_RETURN_ERROR(set_guard_fail_hook);
+PY311_RETURN_ERROR(set_guard_error_hook);
+#endif
+
+static PyMethodDef _methods[] = {
+    {"set_eval_frame", set_eval_frame_py, METH_VARARGS, NULL},
+    {"reset_code", reset_code, METH_VARARGS, NULL},
+    {"unsupported", unsupported, METH_VARARGS, NULL},
+    {"skip_code", skip_code, METH_VARARGS, NULL},
+    {"set_guard_fail_hook", set_guard_fail_hook, METH_VARARGS, NULL},
+    {"set_guard_error_hook", set_guard_error_hook, METH_VARARGS, NULL},
+    {NULL, NULL, 0, NULL}};
+
+static struct PyModuleDef _module = {
+    PyModuleDef_HEAD_INIT,
+    "torch._C._dynamo.eval_frame",
+    "Module containing hooks to override eval_frame",
+    -1,
+    _methods};
+
+PyObject* torch_c_dynamo_eval_frame_init(void) {
+#ifdef _PY_VERSION_OK
+  extra_index = _PyEval_RequestCodeExtraIndex(ignored);
+
+  int result = PyThread_tss_create(&eval_frame_callback_key);
+  CHECK(result == 0);
+
+  Py_INCREF(Py_None);
+  eval_frame_callback_set(Py_None);
+
+  noargs = PyTuple_New(0);
+  dotzerokey = PyUnicode_InternFromString(".0");
+#endif
+  return PyModule_Create(&_module);
+}
diff --git a/torch/csrc/dynamo/eval_frame.h b/torch/csrc/dynamo/eval_frame.h
new file mode 100644
index 000000000000..99b16f3198c8
--- /dev/null
+++ b/torch/csrc/dynamo/eval_frame.h
@@ -0,0 +1,6 @@
+#pragma once
+#include <Python.h>
+
+extern "C" {
+PyObject* torch_c_dynamo_eval_frame_init(void);
+}
diff --git a/torch/csrc/dynamo/guards.cpp b/torch/csrc/dynamo/guards.cpp
new file mode 100644
index 000000000000..d7c5efc7b7d3
--- /dev/null
+++ b/torch/csrc/dynamo/guards.cpp
@@ -0,0 +1,422 @@
+#define PY_SSIZE_T_CLEAN
+#include <torch/csrc/dynamo/guards.h>
+#include <torch/csrc/utils/python_numbers.h>
+#include <torch/extension.h>
+#include <sstream>
+
+namespace {
+
+struct LocalState {
+  // TLS state that changes operators
+  c10::impl::LocalDispatchKeySet dispatch_modifier;
+  bool grad_mode_enabled;
+
+  at::DispatchKeySet apply(at::DispatchKeySet ks) const {
+    return (ks | dispatch_modifier.included_) - dispatch_modifier.excluded_;
+  }
+
+  LocalState()
+      : dispatch_modifier(c10::impl::tls_local_dispatch_key_set()),
+        grad_mode_enabled(at::GradMode::is_enabled()) {}
+};
+
+class TensorCheck {
+ public:
+  TensorCheck(
+      const LocalState& state,
+      PyTypeObject* pt,
+      const at::Tensor& v,
+      bool dynamic_shapes)
+      : pytype(pt),
+        dispatch_key_(state.apply(v.key_set()).raw_repr()),
+        dtype_(v.dtype().toScalarType()),
+        requires_grad_(state.grad_mode_enabled && v.requires_grad()),
+        dynamic_shapes_(dynamic_shapes) {
+    auto ndim = v.ndimension();
+    const auto& sizes = v.sizes();
+    const auto& strides = v.strides();
+    sizes_.reserve(ndim);
+    strides_.reserve(ndim);
+    for (auto i : c10::irange(ndim)) {
+      sizes_.emplace_back(sizes[i]);
+      strides_.emplace_back(strides[i]);
+    }
+  }
+
+  bool check(const LocalState& state, const at::Tensor& v) {
+    if (dispatch_key_ != state.apply(v.key_set()).raw_repr() ||
+        dtype_ != v.dtype().toScalarType() ||
+        requires_grad_ != (state.grad_mode_enabled && v.requires_grad())) {
+      return false;
+    }
+    auto ndim = static_cast<size_t>(v.ndimension());
+    if (ndim != sizes_.size()) {
+      return false;
+    }
+    if (!dynamic_shapes_) {
+      const auto& sizes = v.sizes();
+      const auto& strides = v.strides();
+      for (auto i : c10::irange(ndim)) {
+        if (sizes_[i] != sizes[i] || strides_[i] != strides[i]) {
+          return false;
+        }
+      }
+    }
+    return true;
+  }
+
+  std::string check_verbose(
+      const LocalState& state,
+      const at::Tensor& v,
+      std::string tensor_name) {
+    std::stringstream fail_reason;
+    fail_reason << "tensor '" << tensor_name << "' ";
+    if (dispatch_key_ != state.apply(v.key_set()).raw_repr()) {
+      // return fmt::format("tensor dispatch key mismatch. expected {}, actual
+      // {}", dispatch_key_, state.apply(v.key_set()).raw_repr());
+      fail_reason << "dispatch key set mismatch. expected "
+                  << c10::DispatchKeySet(
+                         c10::DispatchKeySet::RAW, dispatch_key_)
+                  << ", actual " << state.apply(v.key_set());
+      return fail_reason.str();
+    } else if (dtype_ != v.dtype().toScalarType()) {
+      // return fmt::format("tensor dtype mismatch. expected {}, actual {}",
+      // dtype_, v.dtype().toScalarType());
+      fail_reason << "dtype mismatch. expected " << dtype_ << ", actual "
+                  << v.dtype().toScalarType();
+      return fail_reason.str();
+    } else if (
+        requires_grad_ != (state.grad_mode_enabled && v.requires_grad())) {
+      // return fmt::format("tensor requires_grad mismatch. expected {}",
+      // requires_grad_);
+      fail_reason << "requires_grad mismatch. expected requires_grad="
+                  << requires_grad_;
+      return fail_reason.str();
+    }
+    size_t ndim = static_cast<size_t>(v.ndimension());
+    if (ndim != sizes_.size()) {
+      // return fmt::format("tensor rank mismatch. expected {}, actual {}",
+      // sizes_.size(), ndim);
+      fail_reason << "rank mismatch. expected " << sizes_.size() << ", actual "
+                  << ndim;
+      return fail_reason.str();
+    }
+    if (!dynamic_shapes_) {
+      const auto& sizes = v.sizes();
+      const auto& strides = v.strides();
+      for (auto i : c10::irange(ndim)) {
+        if (sizes_[i] != sizes[i]) {
+          // return fmt::format("tensor size mismatch at index {}. expected {},
+          // actual {}", i, sizes_[i], sizes[i]);
+          fail_reason << "size mismatch at index " << i << ". expected "
+                      << sizes_[i] << ", actual " << sizes[i];
+          return fail_reason.str();
+        } else if (strides_[i] != strides[i]) {
+          // return fmt::format("tensor strides mismatch at index {}. expected
+          // {}, actual {}", i, strides_[i]);
+          fail_reason << "strides mismatch at index " << i << ". expected "
+                      << strides_[i] << ", actual " << strides[i];
+          return fail_reason.str();
+        }
+      }
+    }
+    return "";
+  }
+
+  PyTypeObject* pytype;
+
+ private:
+  uint64_t dispatch_key_; // DispatchKeySet includes device/layout
+  at::ScalarType dtype_;
+  bool requires_grad_;
+  bool dynamic_shapes_;
+  std::vector<int64_t> sizes_;
+  std::vector<int64_t> strides_;
+};
+
+typedef std::vector<TensorCheck> ChecksList;
+
+typedef struct {
+  PyObject_HEAD;
+  ChecksList* checks;
+} TensorGuards;
+
+static void TensorGuards_dealloc(TensorGuards* self) {
+  if (self->checks != NULL) {
+    delete self->checks;
+    self->checks = NULL;
+  }
+  Py_TYPE(self)->tp_free((PyObject*)self);
+}
+
+static PyObject* TensorGuards_new(
+    PyTypeObject* type,
+    PyObject* args,
+    PyObject* kwds) {
+  TensorGuards* self = (TensorGuards*)type->tp_alloc(type, 0);
+  if (self != NULL) {
+    self->checks = new ChecksList();
+  }
+  return (PyObject*)self;
+}
+
+static int TensorGuards_init(
+    TensorGuards* self,
+    PyObject* args,
+    PyObject* kwds) {
+  if (!PyTuple_CheckExact(args)) {
+    PyErr_SetString(PyExc_TypeError, "expected tuple()");
+    return -1;
+  }
+  PyObject* dynamic_shapes_py = PyDict_GetItemString(kwds, "dynamic_shapes");
+  if (dynamic_shapes_py == NULL) {
+    PyErr_SetString(PyExc_TypeError, "missing dynamic_shapes=...");
+    return -1;
+  }
+  bool dynamic_shapes = PyObject_IsTrue(dynamic_shapes_py);
+
+  auto& checks = *self->checks;
+  auto len = PyTuple_GET_SIZE(args);
+  checks.reserve(len);
+  LocalState state;
+  for (auto i : c10::irange(len)) {
+    PyObject* item = PyTuple_GET_ITEM(args, i);
+    if (!THPVariable_CheckExact(item) && !THPVariable_Check(item)) {
+      PyErr_SetString(PyExc_TypeError, "expected Tensor()");
+      return -1;
+    }
+    checks.emplace_back(TensorCheck(
+        state, Py_TYPE(item), THPVariable_Unpack(item), dynamic_shapes));
+  }
+  return 0;
+}
+
+PyObject* TensorGuards_check(TensorGuards* self, PyObject* args) {
+  if (!PyTuple_CheckExact(args)) {
+    PyErr_SetString(PyExc_TypeError, "expected tuple()");
+    return NULL;
+  }
+  auto& checks = *self->checks;
+  auto len = PyTuple_GET_SIZE(args);
+
+  if (static_cast<decltype(len)>(checks.size()) != len) {
+    PyErr_SetString(PyExc_TypeError, "wrong length");
+    return NULL;
+  }
+
+  LocalState state;
+
+  for (auto i : c10::irange(len)) {
+    PyObject* item = PyTuple_GET_ITEM(args, i);
+    if (Py_TYPE(item) != checks[i].pytype) {
+      Py_RETURN_FALSE;
+    }
+    if (!checks[i].check(state, THPVariable_Unpack(item))) {
+      Py_RETURN_FALSE;
+    }
+  }
+
+  Py_RETURN_TRUE;
+}
+
+PyObject* TensorGuards_check_verbose(
+    TensorGuards* self,
+    PyObject* args,
+    PyObject* kwargs) {
+  if (!PyTuple_CheckExact(args)) {
+    PyErr_SetString(PyExc_TypeError, "expected tuple()");
+    return NULL;
+  }
+  auto& checks = *self->checks;
+  auto len = PyTuple_GET_SIZE(args);
+
+  if (static_cast<decltype(len)>(checks.size()) != len) {
+    PyErr_SetString(PyExc_TypeError, "wrong length");
+    return NULL;
+  }
+
+  PyObject* tensor_check_names_py =
+      PyDict_GetItemString(kwargs, "tensor_check_names");
+  if (tensor_check_names_py == NULL) {
+    PyErr_SetString(PyExc_TypeError, "missing tensor_check_names kwarg");
+    return NULL;
+  }
+
+  if (!PyList_Check(tensor_check_names_py)) {
+    PyErr_SetString(PyExc_TypeError, "tensor_check_names kwarg must be a list");
+    return NULL;
+  }
+
+  auto names_size = PyList_Size(tensor_check_names_py);
+  if (names_size != static_cast<decltype(names_size)>(checks.size())) {
+    PyErr_SetString(
+        PyExc_TypeError,
+        "tensor_check_names should be the same size as # tensors");
+    return NULL;
+  }
+
+  std::vector<std::string> tensor_check_names;
+  tensor_check_names.reserve(names_size);
+  for (auto i : c10::irange(names_size)) {
+    PyObject* value = PyList_GetItem(tensor_check_names_py, i);
+    if (!PyUnicode_Check(value)) {
+      PyErr_SetString(
+          PyExc_TypeError, "tensor_check_names must only contain strings");
+      return NULL;
+    }
+    tensor_check_names.emplace_back(PyUnicode_AsUTF8(value));
+  }
+
+  LocalState state;
+  for (auto i : c10::irange(len)) {
+    PyObject* item = PyTuple_GET_ITEM(args, i);
+    if (Py_TYPE(item) != checks[i].pytype) {
+      std::stringstream fail_reason;
+      PyObject* type_str = PyObject_Str(PyObject_Type(item));
+      fail_reason << "expected type of '" << tensor_check_names[i]
+                  << "' to be a tensor type, ";
+      if (!type_str) {
+        fail_reason << "but found a different type";
+      } else {
+        fail_reason << "' but found " << PyUnicode_AsUTF8(type_str);
+      }
+      return Py_BuildValue("s", fail_reason.str().c_str());
+    }
+    std::string fail_reason = checks[i].check_verbose(
+        state, THPVariable_Unpack(item), tensor_check_names[i]);
+    if (fail_reason.length() > 0) {
+      return Py_BuildValue("s", fail_reason.c_str());
+    }
+  }
+
+  Py_RETURN_TRUE;
+}
+
+static PyMethodDef TensorGuards_methods[] = {
+    {"check", (PyCFunction)TensorGuards_check, METH_VARARGS, ""},
+    {"check_verbose",
+     (PyCFunction)(void*)TensorGuards_check_verbose,
+     METH_VARARGS | METH_KEYWORDS,
+     "verbose fail reasons for failed checks"},
+    {NULL} /* Sentinel */
+};
+
+static PyTypeObject TensorGuardsType = {
+    // NOLINTNEXTLINE
+    PyVarObject_HEAD_INIT(NULL, 0)};
+
+static PyObject* check_type_id(PyObject* dummy, PyObject* args) {
+  // faster `lambda obj, expected: id(type(obj)) == expected`
+  PyObject* obj;
+  unsigned long expected;
+  if (!PyArg_ParseTuple(args, "Ok", &obj, &expected)) {
+    return NULL;
+  }
+  if (Py_TYPE(obj) == (void*)expected) {
+    Py_RETURN_TRUE;
+  } else {
+    Py_RETURN_FALSE;
+  }
+}
+
+static PyObject* check_obj_id(PyObject* dummy, PyObject* args) {
+  // faster `lambda obj, expected: id(obj) == expected`
+  PyObject* obj;
+  unsigned long expected;
+  if (!PyArg_ParseTuple(args, "Ok", &obj, &expected)) {
+    return NULL;
+  }
+  if (obj == (void*)expected) {
+    Py_RETURN_TRUE;
+  } else {
+    Py_RETURN_FALSE;
+  }
+}
+
+static PyObject* assert_size_stride(PyObject* dummy, PyObject* args) {
+  /*
+   Assert that a given tensor has a given size/stride, but ignore strides
+   of size==1 dimensions.  Implemented in C++ as this is on the hot path.
+  */
+  PyObject* item;
+  PyObject* size;
+  PyObject* stride;
+  if (!PyArg_ParseTuple(args, "OOO", &item, &size, &stride)) {
+    return NULL;
+  }
+  if (!THPVariable_CheckExact(item) && !THPVariable_Check(item)) {
+    PyErr_SetString(PyExc_TypeError, "expected Tensor()");
+    return NULL;
+  }
+  if (!PyTuple_CheckExact(size) || !PyTuple_CheckExact(stride)) {
+    PyErr_SetString(PyExc_TypeError, "expected tuple()");
+    return NULL;
+  }
+  at::Tensor tensor = THPVariable_Unpack(item);
+  int64_t ndim = tensor.ndimension();
+  if (PyTuple_GET_SIZE(size) != ndim || PyTuple_GET_SIZE(stride) != ndim) {
+    PyErr_SetString(PyExc_AssertionError, "wrong number of dimensions");
+    return NULL;
+  }
+  for (auto i : c10::irange(ndim)) {
+    int64_t want_size = THPUtils_unpackLong(PyTuple_GET_ITEM(size, i));
+    int64_t want_stride = THPUtils_unpackLong(PyTuple_GET_ITEM(stride, i));
+    int64_t actual_size = tensor.size(i);
+    int64_t actual_stride = tensor.stride(i);
+    if (want_size != actual_size ||
+        // ignore stride differences when size is 1
+        (want_stride != actual_stride && actual_size > 1)) {
+      std::stringstream msg;
+      msg << "expected size " << actual_size << "==" << want_size << ", stride "
+          << actual_stride << "==" << want_stride << " at dim=" << i;
+      PyErr_SetString(PyExc_AssertionError, msg.str().c_str());
+      return NULL;
+    }
+  }
+  Py_RETURN_TRUE;
+}
+
+static PyMethodDef _methods[] = {
+    {"check_type_id", check_type_id, METH_VARARGS, NULL},
+    {"check_obj_id", check_obj_id, METH_VARARGS, NULL},
+    {"assert_size_stride", assert_size_stride, METH_VARARGS, NULL},
+    {NULL, NULL, 0, NULL}};
+
+static struct PyModuleDef _module = {
+    PyModuleDef_HEAD_INIT,
+    "torch._C._dynamo.guards",
+    "Module containing checks on tensors",
+    -1,
+    _methods};
+
+} // namespace
+
+PyObject* torch_c_dynamo_guards_init() {
+  // initialize TensorGuardsType
+  TensorGuardsType.tp_name = "torch._C._dynamo.guards.TensorGuards";
+  TensorGuardsType.tp_basicsize = sizeof(TensorGuards);
+  TensorGuardsType.tp_itemsize = 0;
+  TensorGuardsType.tp_dealloc = (destructor)TensorGuards_dealloc;
+  TensorGuardsType.tp_flags = Py_TPFLAGS_DEFAULT;
+  TensorGuardsType.tp_doc = "Check properties of a torch.Tensor";
+  TensorGuardsType.tp_methods = TensorGuards_methods;
+  TensorGuardsType.tp_init = (initproc)TensorGuards_init;
+  TensorGuardsType.tp_new = TensorGuards_new;
+
+  PyObject* m;
+  if (PyType_Ready(&TensorGuardsType) < 0)
+    return NULL;
+
+  m = PyModule_Create(&_module);
+  if (m == NULL)
+    return NULL;
+
+  Py_INCREF(&TensorGuardsType);
+  if (PyModule_AddObject(m, "TensorGuards", (PyObject*)&TensorGuardsType) < 0) {
+    Py_DECREF(&TensorGuardsType);
+    Py_DECREF(m);
+    return NULL;
+  }
+
+  return m;
+}
diff --git a/torch/csrc/dynamo/guards.h b/torch/csrc/dynamo/guards.h
new file mode 100644
index 000000000000..78727403351b
--- /dev/null
+++ b/torch/csrc/dynamo/guards.h
@@ -0,0 +1,4 @@
+#pragma once
+#include <torch/csrc/python_headers.h>
+
+PyObject* torch_c_dynamo_guards_init();
diff --git a/torch/csrc/dynamo/init.cpp b/torch/csrc/dynamo/init.cpp
new file mode 100644
index 000000000000..40c5185401a9
--- /dev/null
+++ b/torch/csrc/dynamo/init.cpp
@@ -0,0 +1,32 @@
+#include <torch/csrc/dynamo/init.h>
+
+#include <torch/csrc/Exceptions.h>
+#include <torch/csrc/dynamo/eval_frame.h>
+#include <torch/csrc/dynamo/guards.h>
+
+static struct PyModuleDef _module =
+    {PyModuleDef_HEAD_INIT, "torch._C._dynamo", "", -1, NULL};
+
+namespace torch {
+namespace dynamo {
+
+void initDynamoBindings(PyObject* torch) {
+  PyObject* dynamo = PyModule_Create(&_module);
+  if (dynamo == NULL || PyModule_AddObject(torch, "_dynamo", dynamo) != 0) {
+    throw python_error();
+  }
+
+  PyObject* eval_frame = torch_c_dynamo_eval_frame_init();
+  if (eval_frame == NULL ||
+      PyModule_AddObject(dynamo, "eval_frame", eval_frame) != 0) {
+    throw python_error();
+  }
+
+  PyObject* guards = torch_c_dynamo_guards_init();
+  if (guards == NULL || PyModule_AddObject(dynamo, "guards", guards) != 0) {
+    throw python_error();
+  }
+}
+
+} // namespace dynamo
+} // namespace torch
diff --git a/torch/csrc/dynamo/init.h b/torch/csrc/dynamo/init.h
new file mode 100644
index 000000000000..c496df083dfe
--- /dev/null
+++ b/torch/csrc/dynamo/init.h
@@ -0,0 +1,14 @@
+#pragma once
+
+// C2039 MSVC
+#include <pybind11/complex.h>
+#include <pybind11/pybind11.h>
+#include <torch/csrc/utils/pybind.h>
+
+#include <Python.h>
+
+namespace torch {
+namespace dynamo {
+void initDynamoBindings(PyObject* torch);
+}
+} // namespace torch
diff --git a/torch/csrc/functorch/init.cpp b/torch/csrc/functorch/init.cpp
new file mode 100644
index 000000000000..65a3b3415b7e
--- /dev/null
+++ b/torch/csrc/functorch/init.cpp
@@ -0,0 +1,509 @@
+// Copyright (c) Facebook, Inc. and its affiliates.
+// All rights reserved.
+//
+// This source code is licensed under the BSD-style license found in the
+// LICENSE file in the root directory of this source tree.
+
+#include <ATen/FunctionalTensorWrapper.h>
+#include <ATen/WrapDimUtils.h>
+#include <torch/python.h>
+
+#include <ATen/functorch/BatchRulesHelper.h>
+#include <ATen/functorch/BatchedFallback.h>
+#include <ATen/functorch/BatchedTensorImpl.h>
+#include <ATen/functorch/DynamicLayer.h>
+#include <ATen/functorch/Interpreter.h>
+#include <ATen/functorch/LegacyVmapTransforms.h>
+#include <ATen/functorch/PlumbingHelper.h>
+#include <ATen/functorch/TensorWrapper.h>
+#include <c10/core/AutogradState.h>
+
+// This file contains functorch's Python bindings.
+
+namespace torch {
+namespace functorch {
+namespace impl {
+
+using namespace at::functorch;
+
+static bool has_level(const Tensor& self, int64_t level) {
+  const auto* batched = maybeGetBatchedImpl(self);
+  if (!batched) {
+    return false;
+  }
+  return batched->level() >= level;
+}
+
+Tensor _add_batch_dim(const Tensor& self, int64_t batch_dim, int64_t level) {
+  return addBatchDim(self, batch_dim, level);
+}
+
+Tensor _wrap_functional_tensor(const Tensor& self, int64_t level) {
+  auto t = at::functionalization::impl::to_functional_tensor(self);
+  at::functionalization::impl::unsafeGetFunctionalWrapper(t)->set_level(level);
+  return t;
+}
+
+void _assert_wrapped_functional(
+    const Tensor& unwrapped,
+    const Tensor& wrapped) {
+  TORCH_INTERNAL_ASSERT(
+      at::functionalization::impl::isFunctionalTensor(wrapped));
+  TORCH_INTERNAL_ASSERT(
+      !at::functionalization::impl::isFunctionalTensor(unwrapped));
+  auto wrapped_impl =
+      at::functionalization::impl::unsafeGetFunctionalWrapper(wrapped);
+  auto& wrapped_inner = wrapped_impl->value();
+  TORCH_INTERNAL_ASSERT(
+      unwrapped.unsafeGetTensorImpl() == wrapped_inner.unsafeGetTensorImpl())
+}
+
+void _propagate_functional_input_mutation(
+    const Tensor& unwrapped,
+    const Tensor& wrapped) {
+  TORCH_INTERNAL_ASSERT(
+      at::functionalization::impl::isFunctionalTensor(wrapped));
+  TORCH_INTERNAL_ASSERT(
+      !at::functionalization::impl::isFunctionalTensor(unwrapped));
+  auto wrapped_impl =
+      at::functionalization::impl::unsafeGetFunctionalWrapper(wrapped);
+  // Ensure that the input is up to date by committing any pending updates to
+  // the alias.
+  wrapped_impl->sync_();
+  auto& wrapped_inner = wrapped_impl->value();
+  // It would probably be more reasonable to check that the two tensors are
+  // aliased, but we can't do that unless we give BatchedTensorImpl a notion of
+  // storage.
+  if (unwrapped.unsafeGetTensorImpl() == wrapped_inner.unsafeGetTensorImpl()) {
+  } else {
+    if (unwrapped.sym_nbytes() != wrapped_inner.sym_nbytes()) {
+      // Functions might resize zero-sized inputs, which we need to reflect
+      // ehre.
+      unwrapped.resize__symint(wrapped_inner.sym_sizes());
+    }
+    // If the input tensor's metadata was mutated, then use as_strided_()
+    // to propagate the metadata change.
+    if (unwrapped.sym_sizes() != wrapped_inner.sym_sizes()) {
+      unwrapped.as_strided__symint(
+          wrapped_inner.sym_sizes(), wrapped_inner.sym_strides());
+    }
+    unwrapped.copy_(wrapped_inner);
+  }
+}
+
+static std::pair<Tensor, int64_t> remove_existing_batch_dim(
+    const BatchedTensorImpl* batched,
+    int64_t level) {
+  TORCH_INTERNAL_ASSERT(batched->level() == level);
+  return std::make_pair(batched->value(), batched->bdim());
+}
+
+// Poor man's version of np.moveaxis. Moves the dimension at `dst` to `src`
+// while preserving the order of other existing dimensions.
+// We should probably add np.moveaxis (it is more general) to PyTorch. (#36048)
+// When we do, replace the following with it.
+static Tensor _movedim(const Tensor& self, int64_t src, int64_t dst) {
+  auto logical_dim = self.dim();
+  src = at::maybe_wrap_dim(src, logical_dim);
+  dst = at::maybe_wrap_dim(dst, logical_dim);
+  if (src == dst) {
+    return self;
+  }
+  VmapDimVector permutation;
+  permutation.reserve(logical_dim);
+  for (int64_t dim = 0; dim < logical_dim; dim++) {
+    if (dim == src) {
+      continue;
+    }
+    permutation.push_back(dim);
+  }
+  permutation.insert(permutation.begin() + dst, src);
+  return self.permute(permutation);
+}
+
+// Removes the batch dim with level `level` from `self`. If this causes the
+// last batch dim to be removed from a BatchedTensor, then this returns a
+// regular Tensor.
+//
+// If the `level` of the batch dim to remove does not exist in `self`, then we
+// add the batch dim in. This can happen if `self` didn't interact with a tensor
+// inside the vmap level, for example,
+//     self = torch.randn(3)
+//     y = torch.randn(5)
+//     out = vmap(lambda x: vmap(lambda y: x)(y))(self)
+//     assert out.shape == (3, 5)
+// Inside the inner vmap, `x` is a BatchedTensor with a single batch dimension
+// corresponding to the *outer* vmap level and it doesn't have any dimensions
+// that correspond to the inner vmap level so we need to create one for the
+// user.
+//
+// `out_dim` controls where we should put the batch dimension in the output
+// tensor.
+Tensor _remove_batch_dim(
+    const Tensor& self,
+    int64_t level,
+    int64_t batch_size,
+    int64_t out_dim) {
+  if (!has_level(self, level)) {
+    auto self_sizes = self.sizes();
+    VmapDimVector expanded_sizes(self_sizes.begin(), self_sizes.end());
+    expanded_sizes.insert(expanded_sizes.begin() + out_dim, batch_size);
+    auto result = self.expand(expanded_sizes);
+    return result;
+  }
+
+  // Must be batched if has_level(self, /*any_level*/)
+  const auto* batched = maybeGetBatchedImpl(self);
+  TORCH_INTERNAL_ASSERT(batched != nullptr);
+
+  Tensor self_without_bdim;
+  int64_t newly_exposed_logical_dim;
+  std::tie(self_without_bdim, newly_exposed_logical_dim) =
+      remove_existing_batch_dim(batched, level);
+  auto result = _movedim(self_without_bdim, newly_exposed_logical_dim, out_dim);
+  return result;
+}
+
+Tensor _unwrap_functional_tensor(const Tensor& self, bool add_back_views) {
+  // We only ever call that after popping out of a functionalize() call, in
+  // which case the current tensors should always be wrapped in a
+  // FunctionalTensorWrapper.
+  TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(self));
+  auto functional =
+      at::functionalization::impl::unsafeGetFunctionalWrapper(self);
+
+  // when regenerating the (potentially mutated) input tensors, the
+  // functionalization pass regenerates them through a series of view_copy() op
+  // calls. Functorch wants to turn those back into view ops though. Ensure that
+  // the input is up to date by committing any pending updates to the alias.
+  at::functionalization::impl::FunctionalizationReapplyViewsGuard guard(
+      add_back_views);
+  bool any_updates = functional->apply_updates();
+  if (any_updates) {
+    functional->regenerate_from_base();
+  }
+  return functional->value();
+}
+
+Tensor _wrap_for_grad(const Tensor& self, int64_t level) {
+  // NB: different behavior inside??
+  // return self;
+  // TORCH_INTERNAL_ASSERT(!maybeGetTensorWrapper(self));
+  // TORCH_INTERNAL_ASSERT(self.has_storage());
+  return makeTensorWrapper(self, level);
+}
+
+Tensor _unwrap_for_grad(const Tensor& self, int64_t level) {
+  auto* result = maybeGetTensorWrapper(self);
+  if (!result) {
+    return self;
+  }
+  TORCH_INTERNAL_ASSERT(result->level().has_value());
+  if (result->level() == level) {
+    return result->value();
+  }
+  return self;
+}
+
+int64_t dlevel(const Tensor& tensor) {
+  auto* wrapped = maybeGetTensorWrapper(tensor);
+  if (!wrapped) {
+    return 0;
+  }
+  if (!wrapped->is_alive()) {
+    return -1;
+  }
+  return wrapped->level().value();
+}
+
+bool dump_tensor(const Tensor& self) {
+  dumpTensorCout(self);
+  return true;
+}
+
+RandomnessType get_randomness_enum(const std::string& randomness) {
+  if (randomness == "error") {
+    return RandomnessType::Error;
+  } else if (randomness == "same") {
+    return RandomnessType::Same;
+  } else if (randomness == "different") {
+    return RandomnessType::Different;
+  } else {
+    TORCH_CHECK(
+        false, "randomness argument must be error, same, or different.");
+  }
+}
+
+void set_fwd_grad_enabled(bool enabled) {
+  c10::AutogradState::get_tls_state().set_fw_grad_mode(enabled);
+}
+
+bool get_fwd_grad_enabled() {
+  return c10::AutogradState::get_tls_state().get_fw_grad_mode();
+}
+
+int64_t _grad_increment_nesting() {
+  // See NOTE [grad and vjp interaction with no_grad]
+  bool prev_grad_mode = c10::GradMode::is_enabled();
+  return initAndPushDynamicLayer(
+      TransformType::Grad, c10::nullopt, c10::nullopt, prev_grad_mode);
+}
+
+int64_t _grad_decrement_nesting() {
+  auto layer = popDynamicLayerAndDeleteMetadata();
+  TORCH_INTERNAL_ASSERT(layer.key() == TransformType::Grad);
+  return layer.layerId();
+}
+
+int64_t _jvp_increment_nesting() {
+  // See NOTE [grad and vjp interaction with no_grad]
+  bool prev_fwd_grad_mode = get_fwd_grad_enabled();
+  return initAndPushDynamicLayer(
+      TransformType::Jvp,
+      c10::nullopt,
+      c10::nullopt,
+      c10::nullopt,
+      prev_fwd_grad_mode);
+}
+
+int64_t _jvp_decrement_nesting() {
+  auto layer = popDynamicLayerAndDeleteMetadata();
+  TORCH_INTERNAL_ASSERT(layer.key() == TransformType::Jvp);
+  return layer.layerId();
+}
+
+int64_t _vmap_increment_nesting(
+    int64_t batch_size,
+    const std::string& randomness) {
+  return initAndPushDynamicLayer(
+      TransformType::Vmap, batch_size, get_randomness_enum(randomness));
+}
+
+int64_t _vmap_decrement_nesting() {
+  auto layer = popDynamicLayerAndDeleteMetadata();
+  TORCH_INTERNAL_ASSERT(layer.key() == TransformType::Vmap);
+  return layer.layerId();
+}
+
+int64_t _func_increment_nesting(bool reapply_views) {
+  return initAndPushDynamicLayer(
+      TransformType::Functionalize,
+      c10::nullopt,
+      c10::nullopt,
+      c10::nullopt,
+      c10::nullopt,
+      /*functionalize_add_back_views=*/reapply_views);
+}
+
+int64_t _func_decrement_nesting() {
+  auto layer = popDynamicLayerAndDeleteMetadata();
+  TORCH_INTERNAL_ASSERT(layer.key() == TransformType::Functionalize);
+  return layer.layerId();
+}
+
+static bool is_batchedtensor(const Tensor& tensor) {
+  auto* batched = maybeGetBatchedImpl(tensor);
+  return batched != nullptr;
+}
+
+static bool is_gradtrackingtensor(const Tensor& tensor) {
+  auto* wrapped = maybeGetTensorWrapper(tensor);
+  return wrapped != nullptr;
+}
+
+static bool is_functionaltensor(const Tensor& tensor) {
+  return tensor.unsafeGetTensorImpl()->key_set().has(
+      c10::DispatchKey::Functionalize);
+}
+
+static Tensor get_unwrapped(const Tensor& tensor) {
+  auto* batched = maybeGetBatchedImpl(tensor);
+  if (batched) {
+    return batched->value();
+  }
+  auto* wrapped = maybeGetTensorWrapper(tensor);
+  if (wrapped) {
+    return wrapped->value();
+  }
+  if (at::functionalization::impl::isFunctionalTensor(tensor)) {
+    auto* functional =
+        at::functionalization::impl::unsafeGetFunctionalWrapper(tensor);
+    return functional->value();
+  }
+  TORCH_CHECK(false, "No wrappers present!");
+}
+
+static int64_t maybe_get_level(const Tensor& tensor) {
+  auto* batched = maybeGetBatchedImpl(tensor);
+  if (batched) {
+    return batched->level();
+  }
+  auto* wrapped = maybeGetTensorWrapper(tensor);
+  if (wrapped) {
+    if (wrapped->level()) {
+      return *wrapped->level();
+    }
+    // TODO: this is a weird special case...
+    return -2;
+  }
+  if (at::functionalization::impl::isFunctionalTensor(tensor)) {
+    auto* functional =
+        at::functionalization::impl::unsafeGetFunctionalWrapper(tensor);
+    return functional->level();
+  }
+  return -1;
+}
+
+static int64_t maybe_get_bdim(const Tensor& tensor) {
+  auto* batched = maybeGetBatchedImpl(tensor);
+  if (batched) {
+    return batched->bdim();
+  }
+  return -1;
+}
+
+static int64_t currentLevel() {
+  auto maybe_layer = maybeCurrentDynamicLayer();
+  TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
+  int64_t current_level = maybe_layer->layerId();
+  return current_level;
+}
+
+static void tls_set_vmap_excluded(bool excluded) {
+  c10::impl::tls_set_dispatch_key_excluded(
+      c10::DispatchKey::FuncTorchBatched, excluded);
+}
+
+static void _set_dynamic_layer_keys_included(bool value) {
+  return setDynamicLayerFrontBackKeysIncluded(value);
+}
+
+static void dump_dls() {
+  std::cout << getDynamicLayerStack() << std::endl;
+}
+
+static void dump_local_tls() {
+  auto tls = c10::impl::tls_local_dispatch_key_set();
+  std::cout << "[Local Include] " << tls.included_ << std::endl;
+  std::cout << "[Local Exclude] " << tls.excluded_ << std::endl;
+}
+
+void initFuncTorchBindings(PyObject* module) {
+  auto _C = py::handle(module).cast<py::module>();
+  auto m = _C.def_submodule("_functorch");
+
+  m.def("_add_batch_dim", &_add_batch_dim, "add batch dim");
+  m.def("_remove_batch_dim", &_remove_batch_dim, "remove batch dim");
+  m.def(
+      "_wrap_functional_tensor",
+      &_wrap_functional_tensor,
+      "add functional tensor");
+  m.def(
+      "_assert_wrapped_functional",
+      &_assert_wrapped_functional,
+      "assert wrapped functional");
+  m.def(
+      "_propagate_functional_input_mutation",
+      &_propagate_functional_input_mutation,
+      "propagate functional input mutations");
+  m.def(
+      "_unwrap_functional_tensor",
+      &_unwrap_functional_tensor,
+      "remove functional tensor");
+  m.def("_vmap_increment_nesting", &_vmap_increment_nesting);
+  m.def("_vmap_decrement_nesting", &_vmap_decrement_nesting);
+  m.def(
+      "_func_increment_nesting",
+      &_func_increment_nesting,
+      "functionalization start");
+  m.def(
+      "_func_decrement_nesting",
+      &_func_decrement_nesting,
+      "functionalization end");
+  m.def("_grad_increment_nesting", &_grad_increment_nesting);
+  m.def("_grad_decrement_nesting", &_grad_decrement_nesting);
+  m.def("_jvp_increment_nesting", &_jvp_increment_nesting);
+  m.def("_jvp_decrement_nesting", &_jvp_decrement_nesting);
+  m.def("_wrap_for_grad", &_wrap_for_grad, "wrap as gradtrackingtensor");
+  m.def(
+      "_unwrap_for_grad", &_unwrap_for_grad, "unwrap from gradtrackingtensor");
+  m.def(
+      "_set_vmap_fallback_warning_enabled",
+      &at::functorch::setVmapFallbackWarningEnabled,
+      "Set vmap fallback warnings");
+  m.def("_set_vmap_fallback_enabled", &at::functorch::setVmapFallbackEnabled);
+  m.def("_is_vmap_fallback_enabled", &at::functorch::isVmapFallbackEnabled);
+  m.def(
+      "set_inplace_requires_grad_allowed",
+      &at::functorch::setInplaceRequiresGradAllowed);
+  m.def(
+      "get_inplace_requires_grad_allowed",
+      &at::functorch::getInplaceRequiresGradAllowed);
+  m.def(
+      "set_autograd_function_allowed",
+      &at::functorch::setAutogradFunctionAllowed);
+  m.def(
+      "get_autograd_function_allowed",
+      &at::functorch::getAutogradFunctionAllowed);
+  m.def("dlevel", &dlevel, "dlevel");
+  m.def("dump_tensor", &dump_tensor, "dump_tensor");
+  m.def("reshape_dim_into", &at::functorch::reshape_dim_into);
+  m.def("reshape_dim_outof", &at::functorch::reshape_dim_outof);
+  m.def("are_transforms_active", &at::functorch::areTransformsActive);
+  // various debugging things. Maybe we should offer these as first-class APIs
+  // on Tensors?
+  m.def("is_batchedtensor", &is_batchedtensor);
+  m.def("is_gradtrackingtensor", &is_gradtrackingtensor);
+  m.def("is_functionaltensor", &is_functionaltensor);
+  m.def("get_unwrapped", &get_unwrapped);
+  m.def("maybe_get_level", &maybe_get_level);
+  m.def("maybe_get_bdim", &maybe_get_bdim);
+  m.def("current_level", &currentLevel);
+  m.def("tls_set_vmap_excluded", &tls_set_vmap_excluded);
+  m.def("_set_dynamic_layer_keys_included", &_set_dynamic_layer_keys_included);
+  m.def("dump_dls", &dump_dls);
+  m.def("dump_local_tls", &dump_local_tls);
+  m.def("set_fwd_grad_enabled", &set_fwd_grad_enabled);
+  m.def("get_fwd_grad_enabled", &get_fwd_grad_enabled);
+  m.def("is_functorch_wrapped_tensor", [](const Tensor& tensor) {
+    return maybe_get_level(tensor) != -1;
+  });
+  m.def("peek_interpreter_stack", []() -> c10::optional<Interpreter> {
+    const auto& stack = getDynamicLayerStack();
+    if (stack.size() == 0) {
+      return c10::nullopt;
+    }
+    auto result = stack.back().interpreter();
+    return result;
+  });
+  m.def("pop_dynamic_layer_stack", &popDynamicLayer);
+  m.def("push_dynamic_layer_stack", [](DynamicLayer layer) -> int64_t {
+    return pushDynamicLayer(std::move(layer));
+  });
+  py::class_<DynamicLayer>(m, "DynamicLayer");
+
+  py::enum_<TransformType>(m, "TransformType")
+      .value("Torch", TransformType::Torch)
+      .value("Grad", TransformType::Grad)
+      .value("Jvp", TransformType::Jvp)
+      .value("Functionalize", TransformType::Functionalize)
+      .value("Vmap", TransformType::Vmap);
+  py::class_<Interpreter>(m, "CInterpreter")
+      .def("key", &Interpreter::key)
+      .def("level", &Interpreter::level);
+  py::class_<GradInterpreterPtr>(m, "CGradInterpreterPtr")
+      .def(py::init<const Interpreter*>())
+      .def("key", &GradInterpreterPtr::key)
+      .def("level", &GradInterpreterPtr::level)
+      .def("lift", &GradInterpreterPtr::lift)
+      .def("prevGradMode", &GradInterpreterPtr::prevGradMode);
+  py::class_<VmapInterpreterPtr>(m, "CVmapInterpreterPtr")
+      .def(py::init<const Interpreter*>())
+      .def("key", &VmapInterpreterPtr::key)
+      .def("level", &VmapInterpreterPtr::level)
+      .def("batchSize", &VmapInterpreterPtr::batchSize);
+}
+
+} // namespace impl
+} // namespace functorch
+} // namespace torch
diff --git a/torch/csrc/functorch/init.h b/torch/csrc/functorch/init.h
new file mode 100644
index 000000000000..16c63a632ccf
--- /dev/null
+++ b/torch/csrc/functorch/init.h
@@ -0,0 +1,12 @@
+#include <pybind11/pybind11.h>
+#include <torch/csrc/utils/pybind.h>
+
+namespace torch {
+namespace functorch {
+namespace impl {
+
+void initFuncTorchBindings(PyObject* module);
+
+}
+} // namespace functorch
+} // namespace torch
diff --git a/torch/csrc/itt.cpp b/torch/csrc/itt.cpp
index e434ad09b2d9..0e0ad78b4bac 100644
--- a/torch/csrc/itt.cpp
+++ b/torch/csrc/itt.cpp
@@ -7,6 +7,7 @@ void initIttBindings(PyObject* module) {
   auto m = py::handle(module).cast<py::module>();
 
   auto itt = m.def_submodule("_itt", "VTune ITT bindings");
+  itt.def("is_available", itt_is_available);
   itt.def("rangePush", itt_range_push);
   itt.def("rangePop", itt_range_pop);
   itt.def("mark", itt_mark);
diff --git a/torch/csrc/itt_wrapper.cpp b/torch/csrc/itt_wrapper.cpp
index a268d997c490..e9d9812ce31a 100644
--- a/torch/csrc/itt_wrapper.cpp
+++ b/torch/csrc/itt_wrapper.cpp
@@ -1,10 +1,15 @@
 #include <c10/macros/Export.h>
 #include <ittnotify.h>
+#include <torch/csrc/profiler/stubs/base.h>
 
 namespace torch {
 namespace profiler {
 __itt_domain* _itt_domain = __itt_domain_create("PyTorch");
 
+TORCH_API bool itt_is_available() {
+  return torch::profiler::impl::ittStubs()->enabled();
+}
+
 TORCH_API void itt_range_push(const char* msg) {
   __itt_string_handle* hsMsg = __itt_string_handle_create(msg);
   __itt_task_begin(_itt_domain, __itt_null, __itt_null, hsMsg);
diff --git a/torch/csrc/itt_wrapper.h b/torch/csrc/itt_wrapper.h
index 7460b47932b8..b98f4f7517d6 100644
--- a/torch/csrc/itt_wrapper.h
+++ b/torch/csrc/itt_wrapper.h
@@ -3,6 +3,7 @@
 
 namespace torch {
 namespace profiler {
+bool itt_is_available();
 void itt_range_push(const char* msg);
 void itt_range_pop();
 void itt_mark(const char* msg);
diff --git a/torch/csrc/jit/OVERVIEW.md b/torch/csrc/jit/OVERVIEW.md
index e93d047d6311..716896789762 100644
--- a/torch/csrc/jit/OVERVIEW.md
+++ b/torch/csrc/jit/OVERVIEW.md
@@ -894,7 +894,7 @@ def LSTMCellS(x, hx, cx, w_ih, w_hh, b_ih, b_hh):
     return hy, cy
 ```
 
-After going through the the frontend, we start with this unoptimized graph:
+After going through the frontend, we start with this unoptimized graph:
 
 ```
 graph(%x : Tensor,
@@ -1200,6 +1200,8 @@ or switching the fuser could also provide a temporary fix in case of bugs.
 | NNC context manager | `with torch.jit.fuser("fuser1"):` |
 | NVFuser enable/disable | `torch._C._jit_set_nvfuser_enabled()` |
 | NVFuser context manager | `with torch.jit.fuser("fuser2")` |
+| oneDNN Graph on CPU | `torch._C._jit_set_llga_enabled(True)` |
+| oneDNN Graph context manager | `with torch.jit.fuser("fuser3"):` |
 
 **C++ APIs:**
 
@@ -1406,7 +1408,7 @@ TODO: differentiation, symbolic autograd, fusion, operators
 We attempt to reduce the number of `prim::Guard` nodes as these nodes may interfere with optimizations.
 * First, `GuardElimination::moveGuardsToDefs` tries to move `prim::Guards` to their definitions, so the guards guarding the same `Tensor` follow the definition directly or another guard on the same `Tensor`.
 * This ordering allows us to **coalesce** (done in `GuardElimination::coalesceGuards`) multiple guards into a single one.
-* After guards are  **coaslesced** , `GuardElimination::eliminateGuards` attempts to eliminate more guards as follows: it inspects each operation and its inputs. It checks if inputs to the operation are guarded and also if the operation produces the consistent shapes given the guarded inputs. For example, if two inputs to `add` are guaranteed to be of shape `(2, 3)`, the output shape will also always be `(2, 3)`. If this property holds, we are allowed to remove the guard guarding operation's output.
+* After guards are  **coalesced** , `GuardElimination::eliminateGuards` attempts to eliminate more guards as follows: it inspects each operation and its inputs. It checks if inputs to the operation are guarded and also if the operation produces the consistent shapes given the guarded inputs. For example, if two inputs to `add` are guaranteed to be of shape `(2, 3)`, the output shape will also always be `(2, 3)`. If this property holds, we are allowed to remove the guard guarding operation's output.
 
 Lastly, we need to be handle cases when the assumptions about `Tensor` shapes fail at runtime. To handle guard failures, we need to be able to run the original code i.e. the code  that doesn't rely on assumptions about shapes. As guards can be inserted and moved (by Optimizer) at/to arbitrary points in a computational graph, we need to be able to resume execution starting from those arbitrary points onward.
 
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm
index bcc87914a116..9db3509dc1d2 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm
@@ -4,7 +4,6 @@
 #import <torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.h>
 #import <torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h>
 #import <torch/csrc/jit/backends/coreml/objc/PTMCoreMLTensorSpec.h>
-#import <torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h>
 #import <torch/script.h>
 
 #import <CoreML/CoreML.h>
@@ -15,6 +14,24 @@
 #import <Foundation/NSProcessInfo.h>
 #endif
 
+// This is a utility macro that can be used to throw an exception when a CoreML
+// API function produces a NSError. The exception will contain a message with
+// useful info extracted from the NSError.
+#define COREML_THROW_IF_ERROR(error, preamble)                                   \
+  do {                                                                           \
+    if C10_LIKELY(error) {                                                       \
+      throw c10::Error(                                                          \
+          {__func__, __FILE__, static_cast<uint32_t>(__LINE__)},                 \
+          c10::str(                                                              \
+              preamble,                                                          \
+              " Error details: ",                                                \
+              " Localized_description: ", error.localizedDescription.UTF8String, \
+              " Domain: ", error.domain.UTF8String,                              \
+              " Code: ", error.code,                                             \
+              " User Info: ", error.userInfo.description.UTF8String));           \
+    }                                                                            \
+  } while (false)
+
 namespace torch {
 namespace jit {
 namespace mobile {
@@ -24,9 +41,6 @@
 using c10::impl::GenericList;
 using c10::IValue;
 
-static const int32_t kSampleThreshold = static_cast<int32_t>(1.0 / 1000.0 * static_cast<double>(RAND_MAX));
-static const int32_t kSampleEvery = 500;
-
 struct CoreMLConfig {
   std::string backend = "CPU";
   bool allow_low_precision = true;
@@ -74,9 +88,14 @@ GenericList pack_outputs(const std::vector<TensorSpec>& output_specs, id<MLFeatu
       tensor.data_ptr<float>(),
       (float*)val.multiArrayValue.dataPointer,
       count * sizeof(float));
-    outputs.push_back(tensor);
+    outputs.push_back(std::move(tensor));
+  }
+  if(output_specs.size() > 1){
+    c10::List<c10::List<torch::Tensor>> output_res;
+    output_res.push_back(std::move(outputs));
+    return c10::impl::toList(std::move(output_res));
   }
-  return c10::impl::toList(outputs);
+  return c10::impl::toList(std::move(outputs));
 }
 
 class CoreMLBackend: public torch::jit::PyTorchBackendInterface {
@@ -88,16 +107,6 @@ GenericDict compile(IValue processed, GenericDict method_compile_spec) override
     const std::string& model = model_dict.at("model").toStringRef();
     const std::string& modelID = model_dict.at("hash").toStringRef();
 
-    const int32_t load_id = std::rand();
-    const int32_t instance_key = std::rand();
-    size_t mem_limit = 0;
-
-    PTMCoreMLObserver *observer = coreMLObserverConfig().getCoreMLObserver();
-    if (observer) {
-      mem_limit = observer->getRemainingMemory();
-      observer->onEnterCompileModel(instance_key, load_id);
-    }
-
     CoreMLConfig config;
     std::vector<TensorSpec> input_specs;
     std::vector<TensorSpec> output_specs;
@@ -108,32 +117,20 @@ GenericDict compile(IValue processed, GenericDict method_compile_spec) override
       input_specs = extra_json["inputs"].get<std::vector<TensorSpec>>();
       output_specs = extra_json["outputs"].get<std::vector<TensorSpec>>();
     } catch (std::exception& exn) {
-      if (observer) {
-        observer->onExitCompileModel(instance_key, false, true);
-      }
       TORCH_CHECK(false, "Parsing model dict failed!");
     }
 
     if (!type_validity(input_specs) || !type_validity(output_specs)) {
-      if (observer) {
-        observer->onExitCompileModel(instance_key, false, true);
-      }
       TORCH_CHECK(false, "Compiling model failed, only float type tensors supported");
     }
 
     if (![PTMCoreMLCompiler compileModel:model modelID:modelID]) {
-      if (observer) {
-        observer->onExitCompileModel(instance_key, false, true);
-      }
       TORCH_CHECK(false, "Compiling MLModel failed");
     }
 
     MLModel *cpuModel = [PTMCoreMLCompiler loadModel:modelID backend:"cpu" allowLowPrecision:NO];
 
     if (!cpuModel) {
-      if (observer) {
-        observer->onExitCompileModel(instance_key, false, true);
-      }
       TORCH_CHECK(false, "Loading MLModel failed");
     }
 
@@ -152,15 +149,8 @@ GenericDict compile(IValue processed, GenericDict method_compile_spec) override
       executor.model = configuredModel ?: cpuModel;
     });
 
-    if (observer) {
-      bool should_log = load_id < kSampleThreshold;
-      observer->onExitCompileModel(instance_key, true, should_log);
-    }
-
     MLModelWrapper model_wrapper = MLModelWrapper(executor);
     model_wrapper.outputs = output_specs;
-    model_wrapper.load_id = load_id;
-    model_wrapper.mem_limit = mem_limit;
 
     auto model_wrapper_ptr = c10::make_intrusive<MLModelWrapper>(model_wrapper);
     auto handle = IValue::make_capsule(model_wrapper_ptr);
@@ -172,29 +162,14 @@ GenericDict compile(IValue processed, GenericDict method_compile_spec) override
 
   GenericList execute(IValue handle, GenericList inputs) override {
     const auto model_wrapper = c10::static_intrusive_pointer_cast<MLModelWrapper>(handle.toCapsule());
-    const int32_t instance_key = std::rand();
-    const int32_t load_id = model_wrapper->load_id;
-    const size_t mem_limit = model_wrapper->mem_limit;
-    int32_t inferences = model_wrapper->inferences;
-
-    PTMCoreMLObserver *observer = coreMLObserverConfig().getCoreMLObserver();
-    if (observer) {
-      observer->onEnterExecuteModel(instance_key, load_id, mem_limit, inferences);
-    }
 
     PTMCoreMLExecutor *executor = model_wrapper->executor;
     [executor setInputs:inputs];
 
-    id<MLFeatureProvider> outputsProvider = [executor forward];
-
-    model_wrapper->inferences = ++inferences;
-
-    if (observer) {
-      // Check if this inference session is logged. If so, log every N inferences
-      bool succeeded = outputsProvider != nil;
-      bool should_log = load_id < kSampleThreshold && inferences > 1;
-      should_log = !succeeded || (should_log && (inferences % kSampleEvery == 0));
-      observer->onExitExecuteModel(instance_key, inferences, succeeded, should_log);
+    NSError *error;
+    id<MLFeatureProvider> outputsProvider = [executor forward:&error];
+    if (!outputsProvider) {
+      COREML_THROW_IF_ERROR(error, "Error running CoreML inference");
     }
 
     return pack_outputs(model_wrapper->outputs, outputsProvider);
@@ -215,7 +190,7 @@ bool is_available() override {
 
 struct PTMCoreMLContext : public ContextInterface {
   void setModelCacheDirectory(std::string dir) override {
-    [PTMCoreMLCompiler setModelCacheDirectory:dir];
+    [PTMCoreMLCompiler setCacheDirectory:dir];
   }
 };
 
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.h b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.h
index 59f3922c5a3d..d488f5b6e71d 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.h
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.h
@@ -6,9 +6,9 @@ NS_ASSUME_NONNULL_BEGIN
 
 @interface PTMCoreMLCompiler : NSObject
 
-+ (void)setModelCacheDirectory:(const std::string&)dir;
++ (void)setCacheDirectory:(const std::string&)dir;
 
-+ (NSString*)modelCacheDirectory;
++ (NSString*)cacheDirectory;
 
 + (BOOL)compileModel:(const std::string&)modelSpecs
              modelID:(const std::string&)modelID;
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
index 60a6dd4d005a..64201f3e3745 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLCompiler.mm
@@ -6,136 +6,133 @@
 
 @implementation PTMCoreMLCompiler
 
-static NSString* gModelCacheDirectory = @"";
+static NSString *gCacheDirectory = @"";
+static NSString *gCompiledModelExtension = @"mlmodelc";
+static NSString *gVersionExtension = @"version";
 
-+ (void)setModelCacheDirectory:(const std::string&)dir {
-  gModelCacheDirectory = [NSString stringWithCString:dir.c_str()];
++ (void)setCacheDirectory:(const std::string&)dir {
+  gCacheDirectory = [NSString stringWithCString:dir.c_str()];
 }
 
-+ (nonnull NSString *)modelCacheDirectory {
-  BOOL isSet = gModelCacheDirectory.length != 0;
-  BOOL isWriteable = isSet && [[NSFileManager defaultManager] isWritableFileAtPath:gModelCacheDirectory];
++ (nonnull NSString *)cacheDirectory {
+  BOOL isSet = gCacheDirectory.length != 0;
+  BOOL isWriteable = isSet && [[NSFileManager defaultManager] isWritableFileAtPath:gCacheDirectory];
   if (!isSet || !isWriteable) {
     // set the default directory to tmp
-    gModelCacheDirectory = NSTemporaryDirectory();
+    gCacheDirectory = NSTemporaryDirectory();
   }
-  return gModelCacheDirectory;
+  return gCacheDirectory;
 }
 
 + (BOOL)compileModel:(const std::string&)modelSpecs modelID:(const std::string&)modelID {
-  NSString* modelName = [NSString stringWithCString:modelID.c_str() encoding:NSUTF8StringEncoding];
-  NSURL* modelPath = [PTMCoreMLCompiler _cacheFilePath:modelName];
-  NSURL* compiledModelPath = [modelPath URLByAppendingPathExtension:@"mlmodelc"];
-
-  BOOL modelIsCached = [[NSFileManager defaultManager] fileExistsAtPath:modelPath.path];
-  BOOL compiledModelIsCached = [[NSFileManager defaultManager] fileExistsAtPath:compiledModelPath.path];
+  NSString *modelName = [NSString stringWithCString:modelID.c_str() encoding:NSUTF8StringEncoding];
+  NSString *modelPath = [NSTemporaryDirectory() stringByAppendingPathComponent:modelName];
+  NSURL *compiledURL = [PTMCoreMLCompiler _cacheURLForModel:modelName extension:gCompiledModelExtension];
+  BOOL compiledModelIsCached = [[NSFileManager defaultManager] fileExistsAtPath:compiledURL.path];
 
 #if TARGET_OS_IPHONE
-  NSError *error = nil;
-  NSURL *compilationOSPath = [modelPath URLByAppendingPathExtension:@"version"];
-  NSString *compilationOS = [NSString stringWithContentsOfFile:compilationOSPath.path encoding:NSUTF8StringEncoding error:&error];
+  NSURL *versionURL = [PTMCoreMLCompiler _cacheURLForModel:modelName extension:gVersionExtension];
+  NSString *compilationOS = [NSString stringWithContentsOfFile:versionURL.path encoding:NSUTF8StringEncoding error:nil];
   NSString *currentOS = [UIDevice currentDevice].systemVersion;
   BOOL wasCachedOnThisOS = [currentOS isEqualToString:compilationOS];
 #else
   BOOL wasCachedOnThisOS = NO;
 #endif
 
-  if (modelIsCached != compiledModelIsCached || !wasCachedOnThisOS) {
-    modelIsCached = NO;
-    compiledModelIsCached = NO;
-    [PTMCoreMLCompiler _cleanupCachedModel:modelID];
+  if (compiledModelIsCached && wasCachedOnThisOS) {
+    return YES;
   }
 
-  if (!modelIsCached) {
-    // Note that the serialized protobuf binary contains bytes not text.
-    // https://developers.google.com/protocol-buffers/docs/pythontutorial#parsing-and-serialization
-    NSData* data = [NSData dataWithBytes:modelSpecs.c_str() length:modelSpecs.length()];
-    if (![data writeToFile:modelPath.path atomically:YES]) {
-      // If the model cannot be persisted on disk then compilation cannot proceed.
-      NSLog(@"Failed to save specs for MLModel!");
-      [PTMCoreMLCompiler _cleanupCachedModel:modelID];
-      return NO;
-    }
+  if (!wasCachedOnThisOS) {
+    [PTMCoreMLCompiler _cleanupCachedModel:modelName];
   }
 
-  if (compiledModelIsCached) {
-    return YES;
+  BOOL writeSuccess = [PTMCoreMLCompiler _writeModelSpecs:modelSpecs toPath:modelPath];
+  if (!writeSuccess) {
+    return NO;
   }
 
-  return [PTMCoreMLCompiler _compileModel:modelID atPath:modelPath andCache:compiledModelPath];
+  return [PTMCoreMLCompiler _compileModel:modelName atPath:modelPath];
 }
 
 + (nullable MLModel*)loadModel:(const std::string&)modelID backend:(const std::string&)backend allowLowPrecision:(BOOL)allowLowPrecision {
-  NSString* modelName = [NSString stringWithCString:modelID.c_str() encoding:NSUTF8StringEncoding];
-  NSURL* modelPath = [PTMCoreMLCompiler _cacheFilePath:modelName];
-  NSURL* compiledModelPath = [modelPath URLByAppendingPathExtension:@"mlmodelc"];
+  NSString *modelName = [NSString stringWithCString:modelID.c_str() encoding:NSUTF8StringEncoding];
+  NSURL *modelURL = [PTMCoreMLCompiler _cacheURLForModel:modelName extension:gCompiledModelExtension];
 
   NSError *error;
   MLModel *model;
   if (@available(iOS 12.0, macOS 10.14, *)) {
     MLModelConfiguration* config = [[MLModelConfiguration alloc] init];
     MLComputeUnits computeUnits = MLComputeUnitsCPUOnly;
-    if (backend == "cpuandgpu") {
+    if (backend == "cpuAndGPU") {
       computeUnits = MLComputeUnitsCPUAndGPU;
     } else if (backend == "all") {
       computeUnits = MLComputeUnitsAll;
     }
     config.computeUnits = computeUnits;
     config.allowLowPrecisionAccumulationOnGPU = allowLowPrecision;
-    model = [MLModel modelWithContentsOfURL:compiledModelPath configuration:config error:&error];
+    model = [MLModel modelWithContentsOfURL:modelURL configuration:config error:&error];
   } else {
-    model = [MLModel modelWithContentsOfURL:compiledModelPath error:&error];
+    model = [MLModel modelWithContentsOfURL:modelURL error:&error];
   }
 
   if (error) {
-    NSLog(@"Failed to initialize MLModel!");
-    [PTMCoreMLCompiler _cleanupCachedModel:modelID];
+    [PTMCoreMLCompiler _cleanupCachedModel:modelName];
     return nil;
   }
 
   return model;
 }
 
-+ (BOOL)_compileModel:(const std::string&)modelID atPath:(NSURL *)modelPath andCache:(NSURL *)cachePath {
++ (BOOL)_writeModelSpecs:(const std::string&)modelSpecs toPath:(NSString *)modelPath {
+  // Note that the serialized protobuf binary contains bytes not text.
+  // https://developers.google.com/protocol-buffers/docs/pythontutorial#parsing-and-serialization
+  NSData* data = [NSData dataWithBytes:modelSpecs.c_str() length:modelSpecs.length()];
+  return [data writeToFile:modelPath atomically:YES];
+}
+
++ (BOOL)_compileModel:(NSString *)modelName atPath:(NSString *)modelPath {
   NSError *error;
-  NSURL *temporaryURL = [MLModel compileModelAtURL:modelPath error:&error];
-  if (!error) {
-#if TARGET_OS_IPHONE
-    NSURL *compilationOSPath = [modelPath URLByAppendingPathExtension:@"version"];
-    NSString *currentOSVer = [UIDevice currentDevice].systemVersion;
-    [currentOSVer writeToFile:compilationOSPath.path atomically:YES];
-#endif
-    [PTMCoreMLCompiler _moveFileToCache:temporaryURL cacheURL:cachePath error:&error];
+  NSURL *modelURL = [NSURL fileURLWithPath:modelPath];
+  NSURL *temporaryURL = [MLModel compileModelAtURL:modelURL error:&error];
+
+  // After the compiled model has been created, the original specs can be cleared to save cache space.
+  [[NSFileManager defaultManager] removeItemAtPath:modelPath error:nil];
+
+  if (error) {
+    return NO; // Model could not be compiled
+  }
+
+  NSURL *compiledURL = [PTMCoreMLCompiler _cacheURLForModel:modelName extension:gCompiledModelExtension];
+  if (![compiledURL isEqual:temporaryURL]) {
+    [[NSFileManager defaultManager] removeItemAtURL:compiledURL error:nil];
+    [[NSFileManager defaultManager] moveItemAtURL:temporaryURL toURL:compiledURL error:&error];
   }
+
   if (error) {
-    NSLog(@"Failed to compile MLModel!");
-    [PTMCoreMLCompiler _cleanupCachedModel:modelID];
+    return NO; // Model could not be saved in cache
   }
-  return !error;
-}
 
-+ (void)_cleanupCachedModel:(const std::string&)modelID {
-  NSString* modelName = [NSString stringWithCString:modelID.c_str() encoding:NSUTF8StringEncoding];
-  NSURL* modelPath = [PTMCoreMLCompiler _cacheFilePath:modelName];
-  NSURL* compiledModelPath = [modelPath URLByAppendingPathExtension:@"mlmodelc"];
-  NSURL* compilationOSPath = [modelPath URLByAppendingPathExtension:@"version"];
-  NSError* error = nil;
-  [[NSFileManager defaultManager] removeItemAtPath:modelPath.path error:&error];
-  [[NSFileManager defaultManager] removeItemAtPath:compiledModelPath.path error:&error];
-  [[NSFileManager defaultManager] removeItemAtPath:compilationOSPath.path error:&error];
+#if TARGET_OS_IPHONE
+  NSURL *versionURL = [PTMCoreMLCompiler _cacheURLForModel:modelName extension:gVersionExtension];
+  NSString *currentOSVer = [UIDevice currentDevice].systemVersion;
+  [currentOSVer writeToFile:versionURL.path atomically:YES];
+#endif
+
+  return YES;
 }
 
-+ (void)_moveFileToCache:(NSURL *)fileURL cacheURL:(NSURL *)cacheURL error:(NSError **)error {
-  if ([fileURL isEqual:cacheURL]) {
-    return;
-  }
-  [[NSFileManager defaultManager] removeItemAtURL:cacheURL error:nil];
-  [[NSFileManager defaultManager] moveItemAtURL:fileURL toURL:cacheURL error:error];
++ (void)_cleanupCachedModel:(NSString *)modelName {
+  NSURL *modelURL = [PTMCoreMLCompiler _cacheURLForModel:modelName extension:gCompiledModelExtension];
+  NSURL *versionURL = [PTMCoreMLCompiler _cacheURLForModel:modelName extension:gVersionExtension];
+  [[NSFileManager defaultManager] removeItemAtPath:modelURL.path error:nil];
+  [[NSFileManager defaultManager] removeItemAtPath:versionURL.path error:nil];
 }
 
-+ (NSURL *)_cacheFilePath:(NSString *)fileName {
-  NSString *filePath = [[PTMCoreMLCompiler modelCacheDirectory] stringByAppendingPathComponent:fileName];
-  return [NSURL fileURLWithPath:filePath];
++ (NSURL *)_cacheURLForModel:(NSString *)modelID extension:(NSString *)pathExtension {
+  NSString *filename = [modelID stringByAppendingPathExtension:pathExtension];
+  NSString *filePath = [[PTMCoreMLCompiler cacheDirectory] stringByAppendingPathComponent:filename];
+  return [NSURL fileURLWithPath:filePath isDirectory:NO];
 }
 
 @end
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.h b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.h
index d38a37bf6f2f..35cc2ca10a56 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.h
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.h
@@ -12,7 +12,7 @@ NS_ASSUME_NONNULL_BEGIN
 
 - (void)setInputs:(c10::impl::GenericList)inputs;
 
-- (id<MLFeatureProvider>)forward;
+- (id<MLFeatureProvider>)forward:(NSError**)error;
 
 @end
 
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.mm b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.mm
index b393cebd5216..40f4aea0f59c 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.mm
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLExecutor.mm
@@ -35,15 +35,8 @@ - (void)setInputs:(c10::impl::GenericList)inputs {
   }
 }
 
-- (id<MLFeatureProvider>)forward {
-  NSError *error;
-  MLPredictionOptions *options = [[MLPredictionOptions alloc] init];
-  id<MLFeatureProvider> outputs = [self.model predictionFromFeatures:_inputProvider options:options error:&error];
-  if (error) {
-    NSLog(@"Prediction failed with error %@", error);
-    return nil;
-  }
-  return outputs;
+- (id<MLFeatureProvider>)forward:(NSError **)error {
+  return [self.model predictionFromFeatures:_inputProvider error:error];
 }
 
 @end
diff --git a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h
index 49bad921cfa6..78160c187a1d 100644
--- a/torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h
+++ b/torch/csrc/jit/backends/coreml/objc/PTMCoreMLModelWrapper.h
@@ -11,9 +11,6 @@ class MLModelWrapper : public CustomClassHolder {
  public:
   PTMCoreMLExecutor* executor;
   std::vector<TensorSpec> outputs;
-  int32_t load_id = 0;
-  int32_t inferences = 0;
-  size_t mem_limit = 0;
 
   MLModelWrapper() = delete;
 
@@ -24,18 +21,12 @@ class MLModelWrapper : public CustomClassHolder {
   MLModelWrapper(const MLModelWrapper& oldObject) {
     executor = oldObject.executor;
     outputs = oldObject.outputs;
-    load_id = oldObject.load_id;
-    inferences = oldObject.inferences;
-    mem_limit = oldObject.mem_limit;
     [executor retain];
   }
 
   MLModelWrapper(MLModelWrapper&& oldObject) {
     executor = oldObject.executor;
     outputs = oldObject.outputs;
-    load_id = oldObject.load_id;
-    inferences = oldObject.inferences;
-    mem_limit = oldObject.mem_limit;
     [executor retain];
   }
 
diff --git a/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h b/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h
deleted file mode 100644
index 57d11527ac9c..000000000000
--- a/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h
+++ /dev/null
@@ -1,47 +0,0 @@
-#include <memory>
-
-class PTMCoreMLObserver {
- public:
-  virtual ~PTMCoreMLObserver() = default;
-
-  virtual size_t getRemainingMemory() {
-    return 0;
-  }
-
-  virtual void onEnterCompileModel(const int32_t, const int32_t) {}
-  virtual void onExitCompileModel(const int32_t, bool, bool) {}
-
-  virtual void onEnterExecuteModel(
-      const int32_t,
-      const int32_t,
-      const size_t,
-      const int32_t) {}
-  virtual void onExitExecuteModel(const int32_t, const int32_t, bool, bool) {}
-};
-
-class PTMCoreMLObserverConfig {
- public:
-  PTMCoreMLObserverConfig();
-
-  // Do not allow copying/moving.
-  // There should be only one global instance of this class.
-  PTMCoreMLObserverConfig(const PTMCoreMLObserverConfig&) = delete;
-  PTMCoreMLObserverConfig& operator=(const PTMCoreMLObserverConfig&) = delete;
-
-  PTMCoreMLObserverConfig(PTMCoreMLObserverConfig&&) = delete;
-  PTMCoreMLObserverConfig& operator=(PTMCoreMLObserverConfig&&) = delete;
-
- private:
-  std::unique_ptr<PTMCoreMLObserver> observer_;
-
- public:
-  void setCoreMLObserver(std::unique_ptr<PTMCoreMLObserver> observer) {
-    observer_ = std::move(observer);
-  }
-
-  PTMCoreMLObserver* getCoreMLObserver() {
-    return observer_.get();
-  }
-};
-
-PTMCoreMLObserverConfig& coreMLObserverConfig();
diff --git a/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm b/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm
deleted file mode 100644
index 372fc53622f7..000000000000
--- a/torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.mm
+++ /dev/null
@@ -1,8 +0,0 @@
-#import <torch/csrc/jit/backends/coreml/observer/PTMCoreMLObserver.h>
-
-PTMCoreMLObserverConfig::PTMCoreMLObserverConfig() : observer_{nullptr} {}
-
-PTMCoreMLObserverConfig& coreMLObserverConfig() {
-  static PTMCoreMLObserverConfig global_instance;
-  return global_instance;
-}
diff --git a/torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.cpp b/torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.cpp
new file mode 100644
index 000000000000..a64bf35431fd
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.cpp
@@ -0,0 +1,118 @@
+// (c) Meta Platforms, Inc. and affiliates. Confidential and proprietary.
+
+#include <caffe2/torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.h>
+#include <torch/csrc/jit/backends/xnnpack/serialization/schema_generated.h>
+
+#include <ATen/Utils.h>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+void XNNCompiler::compileModel(
+    const void* buffer_pointer,
+    size_t num_bytes,
+    XNNExecutor* executor) {
+  auto output_min = -std::numeric_limits<float>::infinity();
+  auto output_max = std::numeric_limits<float>::infinity();
+
+  auto flatbuffer_graph = fb_xnnpack::GetXNNGraph(buffer_pointer);
+  // initialize xnnpack
+  xnn_status status = xnn_initialize(/*allocator =*/nullptr);
+  TORCH_CHECK(xnn_status_success == status, "Failed to initialize xnnpack");
+
+  // create xnnpack subgraph
+  xnn_subgraph_t subgraph_ptr = nullptr;
+  status = xnn_create_subgraph(
+      /*external_value_ids=*/flatbuffer_graph->num_externs(),
+      /*flags=*/0,
+      &subgraph_ptr);
+  TORCH_CHECK(xnn_status_success == status, "Failed to create xnn subgraph");
+
+  // mapping from old ids to new created value ids
+  // The old ids that were serialied were generated AoT, since
+  // we are re-defining tensor values, the defined IDs could be
+  // different from the ones generated AoT, as a result, we need
+  // a new mapping from the old ids to the newly created ones
+  std::unordered_map<uint32_t, uint32_t> remapped_ids;
+
+  for (auto value : *flatbuffer_graph->xvalues()) {
+    switch (value->xvalue_type()) {
+      case fb_xnnpack::XValueUnion::XNNTensorValue: {
+        auto tensor_value = value->xvalue_as_XNNTensorValue();
+
+        std::vector<size_t> dims_data;
+        for (auto dim : *tensor_value->dims()) {
+          dims_data.push_back(static_cast<size_t>(dim));
+        }
+
+        uint32_t id = XNN_INVALID_VALUE_ID;
+        const auto& constant_buffer = *flatbuffer_graph->constant_buffer();
+        auto buffer_idx = tensor_value->constant_buffer_idx();
+        const auto buffer_ptr = buffer_idx == 0
+            ? nullptr
+            : constant_buffer[buffer_idx]->storage()->data();
+        status = xnn_define_tensor_value(
+            /*subgraph=*/subgraph_ptr,
+            /*datatype=*/xnn_datatype_fp32,
+            /*num_dims=*/tensor_value->num_dims(),
+            /*dims=*/dims_data.data(),
+            /*data=*/buffer_ptr,
+            /*external_id=*/tensor_value->external_id(),
+            /*flags=*/tensor_value->flags(),
+            /*id_out=*/&id);
+        TORCH_CHECK(
+            status == xnn_status_success,
+            "Failed to define tensor values in graph")
+        // map serialized id to newly generated id
+        remapped_ids.emplace(std::make_pair(tensor_value->id_out(), id));
+        break;
+      }
+      default: {
+        TORCH_CHECK(false, "Unhandled value type found in deserialization");
+      }
+    }
+  }
+
+  for (auto node : *flatbuffer_graph->xnodes()) {
+    switch (node->xnode_type()) {
+      case fb_xnnpack::XNodeUnion::XNNAdd: {
+        auto graph_node = node->xnode_as_XNNAdd();
+        status = xnn_define_add2(
+            subgraph_ptr,
+            output_min,
+            output_max,
+            remapped_ids.at(graph_node->input1_id()),
+            remapped_ids.at(graph_node->input2_id()),
+            remapped_ids.at(graph_node->output_id()),
+            graph_node->flags());
+        TORCH_CHECK(status == xnn_status_success, "Failed to create add node")
+        break;
+      }
+      default:
+        TORCH_CHECK(false, "Unhandled node type found in deserialization");
+    }
+  }
+
+  xnn_runtime_t runtime_ptr = nullptr;
+  status = xnn_create_runtime_v2(subgraph_ptr, nullptr, 0, &runtime_ptr);
+  TORCH_CHECK(xnn_status_success == status);
+
+  executor->runtime_ =
+      std::unique_ptr<xnn_runtime, decltype(&xnn_delete_runtime)>(
+          runtime_ptr, xnn_delete_runtime);
+
+  for (auto old_id : *flatbuffer_graph->input_ids()) {
+    executor->input_ids_.emplace_back(remapped_ids.at(old_id));
+  }
+
+  for (auto old_id : *flatbuffer_graph->output_ids()) {
+    executor->output_ids_.emplace_back(remapped_ids.at(old_id));
+  }
+};
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.h b/torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.h
new file mode 100644
index 000000000000..f74e784111d4
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.h
@@ -0,0 +1,27 @@
+// (c) Meta Platforms, Inc. and affiliates. Confidential and proprietary.
+
+#include <caffe2/torch/csrc/jit/backends/xnnpack/executor/xnn_executor.h>
+#include <xnnpack.h>
+#include <memory>
+#include <vector>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+class XNNCompiler {
+ public:
+  // Takes Flatbuffer Serialized XNNPack Model and rebuilds the xnn-subgraph
+  // returns an executor object that holds the xnn runtime object which we
+  // can then use to set inputs and run inference using the xnn graph.
+  static void compileModel(
+      const void* buffer_pointer,
+      size_t num_bytes,
+      XNNExecutor* executor);
+};
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/backends/xnnpack/executor/xnn_executor.h b/torch/csrc/jit/backends/xnnpack/executor/xnn_executor.h
new file mode 100644
index 000000000000..2521c0c7749d
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/executor/xnn_executor.h
@@ -0,0 +1,70 @@
+// (c) Meta Platforms, Inc. and affiliates. Confidential and proprietary.
+#pragma once
+#include <xnnpack.h>
+#include <memory>
+#include <vector>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+class XNNExecutor {
+ private:
+  std::unique_ptr<xnn_runtime, decltype(&xnn_delete_runtime)> runtime_{
+      nullptr,
+      &xnn_delete_runtime};
+  std::vector<uint32_t> input_ids_;
+  std::vector<uint32_t> output_ids_;
+  std::vector<xnn_external_value> externals_;
+
+ public:
+  XNNExecutor() = default;
+
+  template <typename T>
+  bool set_inputs(std::vector<T*>& inputs, std::vector<T*>& outputs) {
+    externals_.clear();
+
+    if (inputs.size() != input_ids_.size()) {
+      return false;
+    }
+
+    for (int i = 0; i < inputs.size(); i++) {
+      externals_.emplace_back(xnn_external_value{input_ids_[i], inputs[i]});
+    }
+
+    if (outputs.size() != output_ids_.size()) {
+      return false;
+    }
+
+    for (int i = 0; i < outputs.size(); i++) {
+      externals_.emplace_back(xnn_external_value{output_ids_[i], outputs[i]});
+    }
+
+    return true;
+  }
+
+  bool forward() {
+    xnn_status status =
+        xnn_setup_runtime(runtime_.get(), externals_.size(), externals_.data());
+
+    if (status != xnn_status_success) {
+      return false;
+    }
+
+    status = xnn_invoke_runtime(runtime_.get());
+
+    if (status != xnn_status_success) {
+      return false;
+    }
+
+    return true;
+  }
+
+  friend class XNNCompiler;
+};
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/backends/xnnpack/serialization/schema.fbs b/torch/csrc/jit/backends/xnnpack/serialization/schema.fbs
new file mode 100644
index 000000000000..cc1290b718fa
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/serialization/schema.fbs
@@ -0,0 +1,97 @@
+// Copyright (c) Meta Platforms, Inc. and affiliates.
+
+namespace fb_xnnpack;
+
+// datatype for xnn-values
+enum XNNDatatype : short {
+  /// Invalid data type. Valid Values never have this datatype.
+  xnn_datatype_invalid = 0,
+  /// IEEE754 single-precision floating-point.
+  xnn_datatype_fp32 = 1,
+  /// IEEE754 half-precision floating-point.
+  xnn_datatype_fp16 = 2,
+  /// Quantized 8-bit signed integer with shared per-Value quantization parameters.
+  xnn_datatype_qint8 = 3,
+  /// Quantized 32-bit signed integer with shared per-Value quantization parameters.
+  xnn_datatype_qint32 = 4,
+}
+
+// taken from executorch
+// Data buffer abstraction.
+table Buffer {
+  storage:[ubyte] (force_align: 16);
+}
+
+table XNNTensorValue {
+  // type of the tensor elements.
+  datatype:XNNDatatype;
+  // number of dimensions in the shape.
+  num_dims:uint;
+  // pointer to an array of @a num_dims shape dimensions. If num_dims is 0, this pointer can be NULL.
+  // XNNPACK does not keep any pointers to this array after the function returns.
+  dims:[uint];
+  // Index to the program's constant buffer table, value 0 is reserved to indicate non constant
+  constant_buffer_idx:uint;
+  // external ID for the Value. The ID must be within the range of reserved Value IDs specified on
+  // the Subgraph creation. If the external ID is XNN_INVALID_VALUE_ID, an internal ID will be
+  // created for the Value.
+  external_id:uint;
+  // binary features of the Value. Supported values are any combination of XNN_VALUE_FLAG_EXTERNAL_INPUT
+  // and XNN_VALUE_FLAG_EXTERNAL_OUTPUT.
+  flags:uint;
+  // pointer to the variable that will be initialized with the Value ID upon successful return. If a
+  // valid @a external_id was provided, the variable will be initialized with the @a external_id value.
+  id_out:uint;
+}
+
+union XNodeUnion {
+  XNNAdd,
+}
+
+union XValueUnion {
+  XNNTensorValue,
+}
+
+table XNode {
+  xnode:XNodeUnion;
+  // An int which can be linked back to the node in the origin graph
+  debug_handle:uint;
+}
+
+table XValue {
+  xvalue:XValueUnion;
+}
+
+table XNNAdd {
+  input1_id:uint;
+  input2_id:uint;
+  output_id:uint;
+  flags:uint;
+}
+
+table XNNGraph {
+  // Schema version.
+  version:string;
+  xnodes:[XNode];
+  xvalues:[XValue];
+
+  // Number of external inputs/outputs
+  num_externs:uint;
+
+  // Ids of external inputs
+  input_ids:[uint];
+
+  // Ids of external outputs
+  output_ids:[uint];
+
+  // Tables of constant data, used for constant Values (e.g.
+  // data field of weight tensors). Each constant is assigned an index into the table
+  // which are each individually aligned. 0 index is reserved to be pointed to by non-constant
+  // Tensors
+  constant_buffer:[Buffer];
+
+  // the list index is memory buffer id, the value is the memory buffer size.
+  mem_buffer_sizes: [uint];
+}
+
+root_type XNNGraph;
diff --git a/torch/csrc/jit/backends/xnnpack/serialization/serializer.cpp b/torch/csrc/jit/backends/xnnpack/serialization/serializer.cpp
new file mode 100644
index 000000000000..637f7cdf4c52
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/serialization/serializer.cpp
@@ -0,0 +1,102 @@
+// (c) Meta Platforms, Inc. and affiliates. Confidential and proprietary.
+
+#include <caffe2/torch/csrc/jit/backends/xnnpack/serialization/serializer.h>
+#include <torch/csrc/jit/backends/xnnpack/serialization/schema_generated.h>
+
+#include <sstream>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+using namespace fb_xnnpack;
+
+void XNNSerializer::serializeAddNode(
+    uint32_t input1_id,
+    uint32_t input2_id,
+    uint32_t output_id,
+    uint32_t flags) {
+  const auto addNode =
+      CreateXNNAdd(_builder, input1_id, input2_id, output_id, flags);
+  const auto flatbufferNode =
+      CreateXNode(_builder, XNodeUnion::XNNAdd, addNode.Union());
+  _nodes.push_back(flatbufferNode);
+}
+
+size_t XNNSerializer::serializeData(const uint8_t* data_ptr, size_t num_bytes) {
+  size_t constant_buffer_idx = 0;
+  // Handling the tensor _values with data
+  if (data_ptr != nullptr) {
+    // steps:
+    // 1. creating flatbuffer byte-vector for tensor data
+    auto storage = _builder.CreateVector(data_ptr, num_bytes);
+
+    // 2. put it in the common buffer
+    constant_buffer_idx = _constantBuffer.size();
+    _constantBuffer.emplace_back(CreateBuffer(_builder, storage));
+
+    // 3. record size into bufferSizes
+    _bufferSizes.push_back(num_bytes);
+    assert(_bufferSizes.size() == _constantBuffer.size());
+  }
+  return constant_buffer_idx;
+}
+
+void XNNSerializer::serializeTensorValue(
+    uint32_t xnn_datatype,
+    size_t num_dims,
+    std::vector<size_t> dims,
+    size_t data_buffer_idx,
+    uint32_t external_id,
+    uint32_t flags,
+    uint32_t id_out) {
+  std::vector<uint32_t> serialized_dims;
+  serialized_dims.reserve(dims.size());
+  for (auto dim : dims) {
+    serialized_dims.push_back(static_cast<uint32_t>(dim));
+  }
+
+  const auto tensorValue = CreateXNNTensorValueDirect(
+      _builder,
+      XNNDatatype(xnn_datatype),
+      num_dims,
+      &serialized_dims,
+      data_buffer_idx,
+      external_id,
+      flags,
+      id_out);
+
+  const auto flatbufferValue =
+      CreateXValue(_builder, XValueUnion::XNNTensorValue, tensorValue.Union());
+  _values.push_back(flatbufferValue);
+}
+
+std::string XNNSerializer::finishAndSerialize(
+    std::vector<uint32_t> input_ids,
+    std::vector<uint32_t> output_ids,
+    size_t num_extern_ids) {
+  auto xnnGraph = CreateXNNGraphDirect(
+      _builder,
+      _version_sha1,
+      &_nodes,
+      &_values,
+      num_extern_ids,
+      &input_ids,
+      &output_ids,
+      &_constantBuffer,
+      &_bufferSizes);
+
+  _builder.Finish(xnnGraph);
+
+  std::stringstream ss;
+  ss.write(
+      reinterpret_cast<char*>(_builder.GetBufferPointer()), _builder.GetSize());
+
+  return ss.str();
+}
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/backends/xnnpack/serialization/serializer.h b/torch/csrc/jit/backends/xnnpack/serialization/serializer.h
new file mode 100644
index 000000000000..5a683c3dc323
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/serialization/serializer.h
@@ -0,0 +1,86 @@
+// (c) Meta Platforms, Inc. and affiliates. Confidential and proprietary.
+
+#include <torch/csrc/jit/backends/xnnpack/serialization/schema_generated.h>
+#include <cstddef>
+#include <cstdint>
+#include <string>
+#include <vector>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+using namespace fb_xnnpack; // Specified in the schema
+
+class XNNSerializer {
+ public:
+  // Constructors
+  // initial buffersize of 1024 which will grow
+  // automatically, constant buffer and buffer sizes initialized with dummy
+  // values as 0 index is reserved for non-constant tensors
+  XNNSerializer() : XNNSerializer(1024) {}
+
+  explicit XNNSerializer(size_t bufferSize)
+      : _builder(bufferSize),
+        _nodes(),
+        _values(),
+        _constantBuffer({CreateBuffer(
+            _builder,
+            {})}), // index 0 is reserved for non-const data
+        _bufferSizes({0}) {}
+
+  // Serializing Nodes
+
+  // Serialize add node, we are serializing the argument needed to call
+  // xnn_define_add2. Serializing these values, and at run time we build
+  // teh graph by re running xnn_define_add2
+  void serializeAddNode(
+      uint32_t input1_id,
+      uint32_t input2_id,
+      uint32_t output_id,
+      uint32_t flags);
+
+  // Serializing Values
+  void serializeTensorValue(
+      uint32_t xnn_datatype,
+      size_t num_dims,
+      std::vector<size_t> dims,
+      size_t buffer_data_idx,
+      uint32_t external_id,
+      uint32_t flags,
+      uint32_t id_out);
+
+  // finish and serialize xnngraph returning serialized data
+  std::string finishAndSerialize(
+      std::vector<uint32_t> input_ids,
+      std::vector<uint32_t> output_ids,
+      size_t num_extern_ids);
+
+  // decoupled data serialization with tensor values. This way constant tensor
+  // data can be referenced by multiple intermediate tensors. This call
+  // serializes the num_bytes of the data_ptr and returns the index it was
+  // placed in.
+  size_t serializeData(const uint8_t* data_ptr, size_t num_bytes);
+
+ private:
+  // xnnpack version we are serializing
+  const char* _version_sha1 = "ae108ef49aa5623b896fc93d4298c49d1750d9ba";
+
+  // flatbuffer objects we will create and serialize together to create xnngraph
+  flatbuffers_fbsource::FlatBufferBuilder _builder;
+
+  // Vector of the serialized xnnpack nodes
+  std::vector<flatbuffers_fbsource::Offset<XNode>> _nodes;
+
+  // Vector of the serialized xnnpack values
+  std::vector<flatbuffers_fbsource::Offset<XValue>> _values;
+
+  std::vector<flatbuffers_fbsource::Offset<Buffer>> _constantBuffer;
+  std::vector<uint32_t> _bufferSizes;
+};
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/backends/xnnpack/xnnpack_backend_lib.cpp b/torch/csrc/jit/backends/xnnpack/xnnpack_backend_lib.cpp
new file mode 100644
index 000000000000..46c7458039d4
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/xnnpack_backend_lib.cpp
@@ -0,0 +1,119 @@
+#include <ATen/Functions.h>
+#include <ATen/Utils.h>
+#include <c10/core/TensorImpl.h>
+#include <torch/csrc/jit/backends/backend.h>
+#include <torch/csrc/jit/backends/backend_exception.h>
+
+#include <caffe2/torch/csrc/jit/backends/xnnpack/compiler/xnn_compiler.h>
+#include <torch/csrc/jit/backends/xnnpack/serialization/schema_generated.h>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+class XNNModelWrapper : public CustomClassHolder {
+ public:
+  XNNExecutor executor_;
+  XNNModelWrapper(XNNExecutor executor) : executor_(std::move(executor)){};
+
+  XNNModelWrapper() = delete;
+
+  XNNModelWrapper(const XNNModelWrapper& oldObject) = delete;
+};
+
+class XNNPackBackend : public PyTorchBackendInterface {
+ public:
+  // Constructor.
+  // NOLINTNEXTLINE(modernize-use-equals-default)
+  explicit XNNPackBackend() {}
+  // NOLINTNEXTLINE(modernize-use-override)
+  virtual ~XNNPackBackend() = default;
+
+  bool is_available() override {
+    return xnn_status_success == xnn_initialize(/*allocator=*/nullptr);
+  }
+
+  c10::impl::GenericDict compile(
+      c10::IValue processed,
+      c10::impl::GenericDict method_compile_spec) override {
+    auto dict = processed.toGenericDict();
+
+    // Compiling and wrapping exeuction object
+    const std::string& ser_model = dict.at("ser_model").toStringRef();
+    XNNExecutor executor;
+    XNNCompiler::compileModel(ser_model.data(), ser_model.length(), &executor);
+
+    auto model_ptr = c10::make_intrusive<XNNModelWrapper>(std::move(executor));
+    auto runtime_handle = IValue::make_capsule(model_ptr);
+    auto wrapper = c10::static_intrusive_pointer_cast<XNNModelWrapper>(
+        runtime_handle.toCapsule());
+
+    // Packing outputs into generic dict
+    c10::Dict<c10::IValue, c10::IValue> handles(
+        c10::StringType::get(), c10::AnyType::get());
+
+    c10::Dict<c10::IValue, c10::IValue> ret(
+        c10::StringType::get(), c10::AnyType::get());
+
+    ret.insert("runtime", runtime_handle);
+    ret.insert("output_shapes", dict.at("outputs"));
+
+    handles.insert("forward", ret);
+
+    return handles;
+  }
+
+  // Currently this is not implemented, and everything is computed a head of
+  // time the current implementation just takes the computed results from ahead
+  // of time and grabs them. The inputs are fed in through the compile spec for
+  // the sake of testing. In reality, the inputs will be fed in at this stage
+  // and ran here.
+  c10::impl::GenericList execute(
+      c10::IValue handle,
+      c10::impl::GenericList inputs) override {
+    auto dict = handle.toGenericDict();
+    auto output_shapes = dict.at("output_shapes").toList();
+
+    auto capsule = dict.at("runtime").toCapsule();
+    auto model_wrapper =
+        c10::static_intrusive_pointer_cast<XNNModelWrapper>(capsule);
+
+    XNNExecutor& executor = model_wrapper->executor_;
+
+    std::vector<float*> input_pointers;
+    for (int i = 0; i < inputs.size(); ++i) {
+      at::IValue val = inputs.get(i);
+      TORCH_CHECK(val.isTensor(), "Non-tensor inputs not supported");
+      input_pointers.push_back(val.toTensor().data_ptr<float>());
+    }
+
+    std::vector<at::Tensor> output_tensors;
+    std::vector<float*> output_pointers;
+    output_tensors.reserve(output_shapes.size());
+    for (int i = 0; i < output_shapes.size(); i++) {
+      auto o_shape = output_shapes.get(i).toIntVector();
+      auto output = at::empty(o_shape, c10::ScalarType::Float);
+      output_tensors.push_back(output);
+      output_pointers.push_back(output.data_ptr<float>());
+    }
+
+    TORCH_CHECK(
+        executor.set_inputs(input_pointers, output_pointers),
+        "Number of inputs/outputs does not match expected number of inputs/outputs");
+    TORCH_CHECK(executor.forward(), "Failed to invoke XNNPack runtime");
+
+    c10::List<at::Tensor> output_list(output_tensors);
+    return c10::impl::toList(output_list);
+  }
+};
+
+namespace {
+constexpr auto backend_name = "xnnpack";
+static auto cls = torch::jit::backend<XNNPackBackend>(backend_name);
+} // namespace
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/backends/xnnpack/xnnpack_backend_preprocess.cpp b/torch/csrc/jit/backends/xnnpack/xnnpack_backend_preprocess.cpp
new file mode 100644
index 000000000000..b4b7c912554a
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/xnnpack_backend_preprocess.cpp
@@ -0,0 +1,132 @@
+#include <torch/csrc/jit/backends/backend.h>
+#include <torch/csrc/jit/backends/backend_preprocess.h>
+
+#include <torch/csrc/jit/tensorexpr/graph_opt.h>
+#include <torch/torch.h>
+#include <xnnpack.h>
+
+#include <ATen/core/List.h>
+#include <torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.h>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+// Expected method_compile_spec should look something like this:
+// {
+//     "forward" : {"inputs" : at::Tensor}
+// }
+// or
+// {
+//     "forward" : {
+//                  "inputs" : c10::List<at::Tensor>,
+//                  "outputs" : c10::List<at::Tensor>
+//                  }
+// }
+// in which the value for "inputs" is the input shape to the module.
+// The module fed to the xnnpack backend must first be traced in order
+// to propagate input shapes through the module. This is important
+// for building the xnnpack_subgraph_t object.
+c10::IValue preprocess(
+    const Module& mod,
+    const c10::Dict<c10::IValue, c10::IValue>& method_compile_spec,
+    const BackendDebugHandleGenerator& generate_debug_handles) {
+  auto eval_mod = mod.clone();
+  eval_mod.eval();
+  eval_mod = torch::jit::freeze(eval_mod);
+
+  c10::Dict<IValue, IValue> compiled(StringType::get(), TensorType::get());
+
+  c10::IValue inp;
+  c10::IValue out;
+
+  TORCH_CHECK(
+      method_compile_spec.contains("forward"),
+      "method_compile_spec does not contain the \"forward\" key.");
+  auto innerDict = method_compile_spec.at("forward");
+
+  TORCH_CHECK(
+      innerDict.isGenericDict() &&
+          innerDict.toGenericDict().contains("inputs") &&
+          innerDict.toGenericDict().contains("outputs"),
+      "method_compile_spec does not contain a dictionary with an \"inputs\" key, under \"forward\" key.");
+
+  inp = innerDict.toGenericDict().at("inputs");
+  out = innerDict.toGenericDict().at("outputs");
+
+  TORCH_CHECK(
+      inp.isTensor() || inp.isTensorList(),
+      "method_compile_spec does not contain either a Tensor or TensorList, under it's \"inputs\" key.");
+  TORCH_CHECK(
+      out.isTensor() || out.isTensorList(),
+      "method_compile_spec does not contain either a Tensor or TensorList, under it's \"outputs\" key.");
+
+  // Graph preprocessing
+  const auto& forward_method = eval_mod.get_method("forward");
+
+  auto graph = toGraphFunction(forward_method.function()).graph()->copy();
+  graph = tensorexpr::removeUnusedSelfArgument(graph);
+  std::vector<c10::IValue> example_inputs;
+  if (inp.isTensorList()) {
+    c10::List<at::Tensor> inp_list = inp.toTensorList();
+    TORCH_CHECK(
+        graph->inputs().size() == inp_list.size(),
+        "method_compile_spec inputs do not match expected number of forward inputs");
+
+    example_inputs.reserve(inp_list.size());
+    for (const auto i : c10::irange(inp_list.size())) {
+      example_inputs.emplace_back(inp_list[i]);
+    }
+  } else {
+    TORCH_CHECK(
+        graph->inputs().size() == 1,
+        "method_compile_spec inputs do not match expected number of forward inputs");
+
+    example_inputs.emplace_back(inp.toTensor());
+  }
+
+  // inp above has been confirmed to be either Tensor or TensorList
+  XNNGraph graph_builder;
+  graph_builder.buildXNNGraph(graph, example_inputs);
+  // at this point graph is complete, for the sake of testing preprocess at this
+  // point we will do runtime setup and run with some default values
+
+  // grabbing the inputs from compile spec for testing
+
+  // gather sample inputs from compile spec
+  std::vector<at::Tensor> inputs;
+  auto input_list = inp.toList();
+
+  for (int i = 0; i < input_list.size(); i++) {
+    inputs.push_back(input_list.get(i).toTensor());
+  }
+  std::vector<at::Tensor> outputs;
+  auto output_list = out.toList();
+  std::vector<c10::IntList> output_shapes;
+
+  // gather sample outputs from compile spec
+  for (int i = 0; i < output_list.size(); i++) {
+    auto sample_output = output_list.get(i).toTensor();
+    outputs.push_back(sample_output);
+    // also gather output shapes to forward along to device
+    output_shapes.push_back(sample_output.sizes());
+  }
+
+  // sample run on sample inputs
+  graph_builder.runGraphOnInputs(inputs, outputs);
+  c10::List<c10::IntList> shapes_list(output_shapes);
+
+  compiled.insert("ser_model", graph_builder.serializedXNNGraph());
+  compiled.insert("outputs", shapes_list);
+  compiled.insert("Answer", outputs);
+
+  return compiled;
+}
+constexpr auto backend_name = "xnnpack";
+static auto pre_reg = backend_preprocess_register(backend_name, preprocess);
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.cpp b/torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.cpp
new file mode 100644
index 000000000000..7c7bb2d02e4c
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.cpp
@@ -0,0 +1,324 @@
+// (c) Meta Platforms, Inc. and affiliates. Confidential and proprietary.
+
+#include <caffe2/torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.h>
+#include <torch/csrc/jit/runtime/graph_iterator.h>
+#include <xnnpack.h>
+
+// graph passes
+#include <torch/csrc/jit/passes/constant_propagation.h>
+#include <torch/csrc/jit/passes/dead_code_elimination.h>
+#include <torch/csrc/jit/passes/frozen_graph_optimizations.h>
+#include <torch/csrc/jit/passes/lower_tuples.h>
+#include <torch/csrc/jit/passes/remove_mutation.h>
+#include <torch/csrc/jit/runtime/jit_trace.h>
+#include <torch/csrc/jit/tensorexpr/graph_opt.h>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+std::shared_ptr<torch::jit::Graph> XNNGraph::optimizeAndTraceGraph(
+    std::shared_ptr<torch::jit::Graph> graph,
+    std::vector<c10::IValue>& example_inputs) {
+  OptimizeFrozenGraph(graph, true);
+  RemoveListMutation(graph);
+  RemoveTensorMutation(graph);
+  LowerAllTuples(graph);
+  ConstantPropagation(graph);
+  graph = TraceGraph(graph, example_inputs);
+
+  return graph;
+}
+
+void XNNGraph::buildXNNGraph(
+    std::shared_ptr<torch::jit::Graph>& graph,
+    std::vector<c10::IValue> example_inputs) {
+  graph = optimizeAndTraceGraph(graph, example_inputs);
+  checkOpsToDelegate(graph);
+  gatherTensorValues(graph);
+
+  // count unique input/outputs (some inputs can be outputs)
+  std::unordered_set<torch::jit::Value*> externals;
+  for (auto inp : _inputs) {
+    externals.insert(inp);
+  }
+  for (auto out : _outputs) {
+    externals.insert(out);
+  }
+
+  // create subgraph
+  xnn_status status = xnn_create_subgraph(
+      /*external_value_ids=*/externals.size(),
+      /*flags=*/0,
+      &_subgraph_ptr);
+  TORCH_CHECK(xnn_status_success == status, "Failed to create xnn subgraph");
+
+  defineAllTensorValues();
+  defineAllNodes(graph);
+  // at this point graph is complete, for the sake of testing preprocess at
+  // this point we will do runtime setup and run with some default values
+}
+
+void XNNGraph::runGraphOnInputs(
+    std::vector<at::Tensor> tensor_inputs,
+    std::vector<at::Tensor> tensor_outputs) {
+  TORCH_CHECK(
+      _subgraph_ptr != nullptr,
+      "run buildXNNGraph before running graph on inputs");
+  xnn_runtime_t runtime = nullptr;
+  xnn_status status =
+      xnn_create_runtime_v2(_subgraph_ptr, nullptr, /*flags=*/0, &runtime);
+  TORCH_CHECK(
+      xnn_status_success == status,
+      "failed to create runtime for running inputs");
+
+  // smart pointer for runtime
+  std::unique_ptr<xnn_runtime, decltype(&xnn_delete_runtime)> auto_runtime(
+      runtime, xnn_delete_runtime);
+
+  std::vector<xnn_external_value> external_values;
+  TORCH_CHECK(
+      tensor_inputs.size() == _inputs.size(),
+      "supplied inputs does not match expected inputs");
+  for (int i = 0; i < tensor_inputs.size(); i++) {
+    external_values.push_back(
+        {_val_to_ids[_inputs[i]], tensor_inputs[i].data_ptr<float>()});
+  }
+
+  TORCH_CHECK(
+      tensor_outputs.size() == _outputs.size(),
+      "supplied outputs does not match expected outputs");
+  for (int i = 0; i < tensor_outputs.size(); i++) {
+    external_values.push_back(
+        {_val_to_ids[_outputs[i]], tensor_outputs[i].data_ptr<float>()});
+  }
+  status = xnn_setup_runtime(
+      auto_runtime.get(), external_values.size(), external_values.data());
+  TORCH_CHECK(xnn_status_success == status, "runtime not properly setup");
+
+  TORCH_CHECK(xnn_status_success == xnn_invoke_runtime(auto_runtime.get()));
+}
+
+void XNNGraph::checkOpsToDelegate(std::shared_ptr<torch::jit::Graph>& graph) {
+  std::unordered_set<string> unsupported_ops;
+  DepthFirstGraphNodeIterator it(graph);
+  Node* node = nullptr;
+  while ((node = it.next()) != nullptr) {
+    switch (node->kind()) {
+      case prim::Constant:
+      case aten::add: {
+        break;
+      }
+      default: {
+        unsupported_ops.insert(node->kind().toDisplayString());
+      }
+    }
+  }
+  std::stringstream error;
+  for (auto itr = unsupported_ops.begin(); itr != unsupported_ops.end();
+       itr++) {
+    error << *itr << std::endl;
+    ;
+  }
+  TORCH_CHECK(
+      unsupported_ops.empty(),
+      "the module contains the following unsupported ops:\n" + error.str());
+}
+
+std::string XNNGraph::serializedXNNGraph() {
+  std::vector<uint32_t> input_ids;
+  std::vector<uint32_t> output_ids;
+  std::unordered_set<uint32_t> num_externs;
+
+  for (auto val : _inputs) {
+    input_ids.push_back(_val_to_ids[val]);
+    num_externs.emplace(_val_to_ids[val]);
+  }
+
+  for (auto val : _outputs) {
+    output_ids.push_back(_val_to_ids[val]);
+    num_externs.emplace(_val_to_ids[val]);
+  }
+
+  return _serializer.finishAndSerialize(
+      input_ids, output_ids, num_externs.size());
+}
+
+std::vector<std::vector<long>> XNNGraph::getGraphOutputShapes() {
+  std::vector<std::vector<long>> output_shapes;
+  for (auto val : _outputs) {
+    auto tensor_ptr = val->type()->cast<TensorType>();
+    std::vector<long> sizes = tensor_ptr->sizes().concrete_sizes().value();
+    output_shapes.push_back(sizes);
+  }
+
+  return output_shapes;
+}
+
+void XNNGraph::defineAllNodes(std::shared_ptr<torch::jit::Graph>& graph) {
+  DepthFirstGraphNodeIterator it(graph);
+  Node* node = nullptr;
+  while ((node = it.next()) != nullptr) {
+    switch (node->kind()) {
+      case prim::Constant: {
+        break;
+      }
+      case aten::add: {
+        // todo: handle alpha for aten::add
+        uint32_t input1_id = _val_to_ids[node->inputs()[0]];
+        uint32_t input2_id = _val_to_ids[node->inputs()[1]];
+        TORCH_CHECK(
+            node->inputs()[2]->type()->cast<IntType>() == 1,
+            "non-1 alpha values not supported");
+        uint32_t output_id = _val_to_ids[node->outputs()[0]];
+
+        xnn_status status = xnn_define_add2(
+            _subgraph_ptr,
+            output_min,
+            output_max,
+            input1_id,
+            input2_id,
+            output_id,
+            /*flags=*/0);
+        _serializer.serializeAddNode(input1_id, input2_id, output_id, 0);
+        TORCH_CHECK(status == xnn_status_success, "failed to create add node");
+        break;
+      }
+      default: {
+        throw std::exception();
+        TORCH_CHECK(
+            false,
+            "The node of ",
+            node->kind().toQualString(),
+            " is not supported yet");
+        break;
+      }
+    }
+  }
+}
+
+void XNNGraph::defineAllTensorValues() {
+  uint32_t external_id =
+      std::numeric_limits<decltype(XNN_INVALID_VALUE_ID)>::min();
+  for (auto val : _intermediate_tensors) {
+    if (_val_to_ids.find(val) == _val_to_ids.end()) {
+      uint32_t id = XNN_INVALID_VALUE_ID;
+
+      // cast value to tensortype
+      auto tensor_ptr = val->type()->cast<TensorType>();
+      auto num_dims = tensor_ptr->dim().value();
+
+      // create size_t* for tensor shape, casting must be done from long ->
+      // size_t
+      std::vector<long> sizes = tensor_ptr->sizes().concrete_sizes().value();
+      std::vector<size_t> tensor_shape;
+      tensor_shape.reserve(sizes.size());
+      for (auto dim : sizes) {
+        TORCH_CHECK(dim >= 0, "Input Dims should be unsigned");
+        tensor_shape.push_back(static_cast<size_t>(dim));
+      }
+
+      // ext_id value
+      uint32_t ext_id = XNN_INVALID_VALUE_ID;
+
+      // update flag for if tensor is either graph input/output
+      uint32_t flags = 0;
+
+      // Check if value was produced by prim::Constant
+      void* value_data = nullptr;
+      size_t buffer_idx = 0;
+      size_t num_bytes = 0;
+      if (val->node()->kind() == prim::Constant) {
+        c10::optional<IValue> constant = val->node()->t(attr::value);
+        auto const_val = constant->toIValue().toTensor();
+        // Need tensor data to be contiguous for serialization
+        auto cont_const_val = const_val.contiguous();
+        value_data = cont_const_val.data_ptr();
+
+        num_bytes = const_val.storage().nbytes();
+        buffer_idx = _serializer.serializeData(
+            static_cast<const uint8_t*>(value_data), num_bytes);
+      }
+
+      if (isGraphInput(val) || isGraphOutput(val)) {
+        if (isGraphInput(val)) {
+          flags |= XNN_VALUE_FLAG_EXTERNAL_INPUT;
+        }
+        if (isGraphOutput(val)) {
+          flags |= XNN_VALUE_FLAG_EXTERNAL_OUTPUT;
+        }
+        ext_id = external_id++;
+      }
+      xnn_status status = xnn_define_tensor_value(
+          /*subgraph=*/_subgraph_ptr,
+          /*datatype=*/xnn_datatype_fp32,
+          /*num_dims=*/num_dims,
+          /*dims=*/tensor_shape.data(),
+          /*data=*/value_data,
+          /*external_id=*/ext_id,
+          /*flags=*/flags,
+          /*id_out=*/&id);
+      TORCH_CHECK(
+          status == xnn_status_success,
+          "failed to define xnn_tensor_id for: " + val->debugName());
+      _serializer.serializeTensorValue(
+          xnn_datatype_fp32,
+          num_dims,
+          tensor_shape,
+          buffer_idx,
+          ext_id,
+          flags,
+          id);
+      _val_to_ids.insert({val, id});
+    }
+  }
+}
+
+void XNNGraph::gatherTensorValues(std::shared_ptr<torch::jit::Graph>& graph) {
+  for (auto input : graph->inputs()) {
+    if (input->isCompleteTensor()) {
+      _intermediate_tensors.insert(input);
+      _inputs.push_back(input);
+    }
+  }
+
+  DepthFirstGraphNodeIterator it(graph);
+  Node* n = nullptr;
+  while ((n = it.next()) != nullptr) {
+    gatherNodeInputs(*n);
+  }
+
+  for (auto output : graph->outputs()) {
+    if (output->isCompleteTensor()) {
+      _intermediate_tensors.insert(output);
+      _outputs.push_back(output);
+    }
+  }
+}
+
+void XNNGraph::gatherNodeInputs(torch::jit::Node& node) {
+  switch (node.kind()) {
+    case aten::add: {
+      // this case will support all ops with only two inputs i.e. sub, add,
+      for (auto value : node.inputs()) {
+        if (value->isCompleteTensor()) {
+          _intermediate_tensors.insert(value);
+        }
+      }
+    }
+  }
+}
+
+bool XNNGraph::isGraphInput(torch::jit::Value* val) {
+  return std::count(_inputs.begin(), _inputs.end(), val) > 0;
+};
+
+bool XNNGraph::isGraphOutput(torch::jit::Value* val) {
+  return std::count(_outputs.begin(), _outputs.end(), val) > 0;
+};
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.h b/torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.h
new file mode 100644
index 000000000000..0ef0757f2319
--- /dev/null
+++ b/torch/csrc/jit/backends/xnnpack/xnnpack_graph_builder.h
@@ -0,0 +1,93 @@
+// (c) Meta Platforms, Inc. and affiliates. Confidential and proprietary.
+#include <ATen/Functions.h>
+#include <ATen/Utils.h>
+#include <torch/torch.h>
+#include <xnnpack.h>
+#include <unordered_set>
+#include <vector>
+
+#include <torch/csrc/jit/backends/xnnpack/serialization/serializer.h>
+
+namespace torch {
+namespace jit {
+namespace xnnpack {
+namespace delegate {
+
+class XNNGraph {
+ private:
+  const float output_min = -std::numeric_limits<float>::infinity();
+  const float output_max = std::numeric_limits<float>::infinity();
+
+  // serializer class
+  XNNSerializer _serializer;
+  // xnn subgraph
+  xnn_subgraph_t _subgraph_ptr;
+  // Set of all the tensor values throughout the jit graph
+  std::unordered_set<torch::jit::Value*> _intermediate_tensors;
+  // Set of all the tensor values mapped to the xnnpack ids
+  std::unordered_map<torch::jit::Value*, uint32_t> _val_to_ids;
+  // Vector containing the torch valued inputs/outputs,
+  // must be ordered to preserve the order of input/outputs
+  std::vector<torch::jit::Value*> _inputs;
+  std::vector<torch::jit::Value*> _outputs;
+
+  // Graph passes for optimizing and tracing torchscript graph
+  // Essentially massaging the graph into a digestiable format for
+  // xnnpack graph lowering.
+  std::shared_ptr<torch::jit::Graph> optimizeAndTraceGraph(
+      std::shared_ptr<torch::jit::Graph> graph,
+      std::vector<c10::IValue>& example_inputs);
+
+  // Gather all the intermediate tensor values within a graph. This
+  // skips through all prim constants. The purpose of this is for defining
+  // the tensor values beforehand for the xnnpack subgraph.
+  void gatherTensorValues(std::shared_ptr<torch::jit::Graph>& graph);
+
+  // Gathers the tensor values in a give node
+  void gatherNodeInputs(torch::jit::Node& node);
+
+  // Helper function to determine if a jit value is a graph input
+  bool isGraphInput(torch::jit::Value* val);
+
+  // Helper function to determine if a jit value is a graph output
+  bool isGraphOutput(torch::jit::Value* val);
+
+  // Defines all xnnpack nodes for the nodes in the graph
+  void defineAllNodes(std::shared_ptr<torch::jit::Graph>& graph);
+
+  // Defines all xnn tensor values used throughout the graph
+  void defineAllTensorValues();
+
+  // Makes a pass through the graph and throws if any ops are unsupported
+  void checkOpsToDelegate(std::shared_ptr<torch::jit::Graph>& graph);
+
+ public:
+  XNNGraph() : _serializer(), _subgraph_ptr(nullptr) {
+    xnn_status status = xnn_initialize(/*allocator =*/nullptr);
+    TORCH_CHECK(xnn_status_success == status, "Failed to initialize xnnpack");
+  }
+
+  ~XNNGraph() {
+    xnn_deinitialize();
+    if (_subgraph_ptr != nullptr) {
+      xnn_delete_subgraph(_subgraph_ptr);
+    }
+  }
+
+  void buildXNNGraph(
+      std::shared_ptr<torch::jit::Graph>& graph,
+      std::vector<c10::IValue> example_inputs);
+
+  void runGraphOnInputs(
+      std::vector<at::Tensor> tensor_inputs,
+      std::vector<at::Tensor> tensor_outputs);
+
+  std::string serializedXNNGraph();
+
+  std::vector<std::vector<long>> getGraphOutputShapes();
+};
+
+} // namespace delegate
+} // namespace xnnpack
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/README.md b/torch/csrc/jit/codegen/cuda/README.md
index be8aed6c5ce4..284fd1411196 100644
--- a/torch/csrc/jit/codegen/cuda/README.md
+++ b/torch/csrc/jit/codegen/cuda/README.md
@@ -197,8 +197,8 @@ First thing is to check that you have fusion kernel running properly. Try to run
 
 If turning on NVFuser produces unexpected outputs, set the `PYTORCH_NVFUSER_DISABLE` environment variable to disable some of the optional features, e.g.:
 - `fma`: disable using FMA instructions
-- `index_hoist`: disble optimization to hoist comon index expressions
-- `predicate_elimination`: disble optimization to eliminate redundant predicates
+- `index_hoist`: disable optimization to hoist common index expressions
+- `predicate_elimination`: disable optimization to eliminate redundant predicates
 - `unroll_with_rng`: disable unrolling when RNG is used
 
 For example, `export PYTORCH_NVFUSER_DISABLE=fma,index_hoist` would disable FMA and index hoisting.
diff --git a/torch/csrc/jit/codegen/cuda/arith.cpp b/torch/csrc/jit/codegen/cuda/arith.cpp
index 0e943931ad66..d4e1348ee693 100644
--- a/torch/csrc/jit/codegen/cuda/arith.cpp
+++ b/torch/csrc/jit/codegen/cuda/arith.cpp
@@ -63,7 +63,11 @@ Val* promoteSize(Val* v1, Val* v2) {
   } else if (v1->isConstInt() && v2->isConstInt()) {
     TORCH_INTERNAL_ASSERT(
         v1->evaluateInt() == v2->evaluateInt(),
-        "Expected sizes to match but found ",
+        "Expected sizes of, ",
+        v1->toString(),
+        " and ",
+        v2->toString(),
+        " to match but found ",
         v1->evaluateInt(),
         " and ",
         v2->evaluateInt(),
@@ -404,17 +408,6 @@ Val* unaryOp(UnaryOpType type, Val* v1) {
   TORCH_INTERNAL_ASSERT(
       type != UnaryOpType::Address,
       "The reference operator & is not accessible in the Fusion IR");
-
-  // TODO: We should add the following, but we need to go through schedulers
-  // and make sure all calls to "fusion->inputs" includes the output of RandLike
-  //
-  //  If rand like, there isn't a real dependency on the input value, so map it
-  //  to a dummy scalar. if
-  //
-  // (type == UnaryOpType::RandLike) {
-  //   v1 = new NamedScalar("__rnd", v1->getDataType().value());
-  // }
-
   Val* out = newValLike(v1, v1->getDataType().value());
   IrBuilder::create<UnaryOp>(type, out, v1);
   return out;
@@ -447,6 +440,156 @@ TensorView* unaryOp(
   return unaryOp(type, cast_v1)->as<TensorView>();
 }
 
+// TENSOR FACTORIES
+TensorView* rand(const std::vector<Val*>& shape, DataType dtype) {
+  auto n = shape.size();
+  auto out = TensorViewBuilder()
+                 .ndims(n)
+                 .dtype(dtype)
+                 .contiguity(std::vector<bool>(n, true))
+                 .shape(shape)
+                 .build();
+  IrBuilder::create<RNGOp>(RNGOpType::Uniform, out, dtype);
+  return out;
+}
+
+// TENSOR FACTORIES
+TensorView* uniform(
+    const std::vector<Val*>& shape,
+    Val* low,
+    Val* high,
+    DataType dtype) {
+  auto n = shape.size();
+  auto out = TensorViewBuilder()
+                 .ndims(n)
+                 .dtype(dtype)
+                 .contiguity(std::vector<bool>(n, true))
+                 .shape(shape)
+                 .build();
+  IrBuilder::create<RNGOp>(
+      RNGOpType::UniformRange, out, dtype, std::vector<Val*>{low, high});
+  return out;
+}
+
+TensorView* rand_like(TensorView* tv) {
+  TORCH_CHECK(
+      isFloatingPointType(tv->dtype()),
+      "input must have floating point type, but got ",
+      tv->dtype());
+  std::vector<Val*> shape;
+  auto dom = TensorDomain::noReductions(tv->getMaybeRFactorDomain());
+  shape.reserve(dom.size());
+  for (auto id : dom) {
+    shape.emplace_back(id->getMaybeExpandedExtent());
+  }
+  return rand(shape, tv->dtype());
+}
+
+Val* rand_like(Val* v) {
+  return rand_like(v->as<TensorView>());
+}
+
+TensorView* full(
+    const std::vector<Val*>& shape,
+    Val* fill_value,
+    DataType dtype) {
+  auto n = shape.size();
+  auto out = TensorViewBuilder()
+                 .ndims(n)
+                 .dtype(dtype)
+                 .contiguity(std::vector<bool>(n, true))
+                 .shape(shape)
+                 .build();
+  IrBuilder::create<FullOp>(out, fill_value, dtype);
+  return out;
+}
+
+TensorView* full_like(TensorView* tv, Val* fill_value) {
+  std::vector<Val*> shape;
+  auto dom = TensorDomain::noReductions(tv->getMaybeRFactorDomain());
+  shape.reserve(dom.size());
+  for (auto id : dom) {
+    shape.emplace_back(id->getMaybeExpandedExtent());
+  }
+  return full(shape, fill_value, tv->dtype());
+}
+
+Val* full_like(Val* v, Val* fill_value) {
+  return full_like(v->as<TensorView>(), fill_value);
+}
+
+TensorView* zeros(const std::vector<Val*>& shape, DataType dtype) {
+  return full(shape, FusionGuard::getCurFusion()->zeroVal(), dtype);
+}
+
+TensorView* zeros_like(TensorView* tv) {
+  return full_like(tv, FusionGuard::getCurFusion()->zeroVal());
+}
+
+Val* zeros_like(Val* v) {
+  return zeros_like(v->as<TensorView>());
+}
+
+TensorView* ones(const std::vector<Val*>& shape, DataType dtype) {
+  return full(shape, FusionGuard::getCurFusion()->oneVal(), dtype);
+}
+
+TensorView* ones_like(TensorView* tv) {
+  return full_like(tv, FusionGuard::getCurFusion()->oneVal());
+}
+
+Val* ones_like(Val* v) {
+  return ones_like(v->as<TensorView>());
+}
+
+TensorView* arange(Val* end, DataType dtype) {
+  return arange(FusionGuard::getCurFusion()->zeroVal(), end, dtype);
+}
+
+TensorView* arange(Val* start, Val* end, DataType dtype) {
+  return arange(start, end, FusionGuard::getCurFusion()->oneVal(), dtype);
+}
+
+TensorView* arange(Val* start, Val* end, Val* step, DataType dtype) {
+  if (isIntegralType(dtype)) {
+    start = castOp(DataType::Int, start);
+    end = castOp(DataType::Int, end);
+    step = castOp(DataType::Int, step);
+  } else if (isFloatingPointType(dtype)) {
+    start = castOp(DataType::Double, start);
+    end = castOp(DataType::Double, end);
+    step = castOp(DataType::Double, step);
+  }
+  // Make sure no negative value is passed to ceilDiv as the device
+  // implementation of ceilDiv assumes positive inputs
+  auto size = castOp(DataType::Int, ceilDiv(abs(sub(end, start)), abs(step)));
+  auto out = TensorViewBuilder()
+                 .ndims(1)
+                 .dtype(dtype)
+                 .contiguity({true})
+                 .shape({size})
+                 .build();
+  IrBuilder::create<ARangeOp>(out, start, end, step, dtype);
+  return out;
+}
+
+TensorView* eye(Val* rows, Val* cols, DataType dtype) {
+  TORCH_CHECK(rows->getDataType() == DataType::Int, "rows must have type Int");
+  TORCH_CHECK(cols->getDataType() == DataType::Int, "cols must have type Int");
+  auto out = TensorViewBuilder()
+                 .ndims(2)
+                 .dtype(dtype)
+                 .contiguity({true, true})
+                 .shape(std::vector<Val*>{rows, cols})
+                 .build();
+  IrBuilder::create<EyeOp>(out, dtype);
+  return out;
+}
+
+TensorView* eye(Val* size, DataType dtype) {
+  return eye(size, size, dtype);
+}
+
 // UNARY OPERATIONS
 
 #define NVFUSER_DEFINE_UNARY_OP(op_name, op_type) \
@@ -469,30 +612,6 @@ NVFUSER_DEFINE_UNARY_OP(trunc, Trunc)
 NVFUSER_DEFINE_UNARY_OP(print, Print)
 #undef NVFUSER_DEFINE_UNARY_OP
 
-Val* randlike(Val* v) {
-  TORCH_CHECK(
-      isFloatingPointType(v->dtype()),
-      "input must have floating point type, but got ",
-      v->dtype());
-  auto rand_vals = unaryOp(UnaryOpType::RandLike, v);
-  return where(
-      eq(rand_vals, IrBuilder::create<Double>(1.0)),
-      IrBuilder::create<Double>(0.0),
-      rand_vals);
-}
-
-TensorView* randlike(TensorView* v) {
-  TORCH_CHECK(
-      isFloatingPointType(v->dtype()),
-      "input must have floating point type, but got ",
-      v->dtype());
-  auto rand_vals = unaryOp(UnaryOpType::RandLike, v);
-  return where(
-      eq(rand_vals, IrBuilder::create<Double>(1.0)),
-      IrBuilder::create<Double>(0.0),
-      rand_vals);
-}
-
 Val* bitwise_not(Val* v) {
   TORCH_CHECK(
       isIntegralType(v->dtype()) || v->dtype() == DataType::Bool,
@@ -975,7 +1094,7 @@ static TensorView* newForReduction(
 
   TORCH_INTERNAL_ASSERT(
       !axes_set.empty(),
-      "Asked for ouput of reduction, but no reduction axis provided.");
+      "Asked for output of reduction, but no reduction axis provided.");
 
   TORCH_INTERNAL_ASSERT(
       (*(axes_set.rbegin())) < orig_domain.size(),
@@ -1004,6 +1123,11 @@ static TensorView* newForReduction(
 
     new_domain.push_back(
         IterDomainBuilder(id)
+            // If the domain is being reduced, but it's coming in as an expanded
+            // extent, we need to realize the expand.
+            .extent(
+                isReduction && id->hasExpandedExtent() ? id->expandedExtent()
+                                                       : id->extent())
             .resetSchedulingParams()
             .iter_type(isReduction ? IterType::Reduction : id->getIterType())
             .build());
@@ -1059,7 +1183,7 @@ TensorView* reductionOp(
 
     TORCH_CHECK(
         axis >= 0 && axis < ndims,
-        "Reduction on invalid axis, recieved: ",
+        "Reduction on invalid axis, received: ",
         axis,
         " however tensor view only has ",
         ndims,
@@ -1242,7 +1366,7 @@ TensorView* expand(TensorView* inp, const std::vector<Val*>& expanded_sizes) {
       // This is just done for clarity. It isn't necessary as it's
       // already done when constructing out_id_builder.
       out_id_builder.extent(inp_id->extent());
-    } else if (inp_id->isBroadcast()) {
+    } else if (inp_id->isBroadcast() && expanded_size_int != 1) {
       // When input id is a broadcast, expand the extent to the given
       // size, which can be concrete or symbolic.
       expanded = true;
@@ -1255,7 +1379,7 @@ TensorView* expand(TensorView* inp, const std::vector<Val*>& expanded_sizes) {
       // does not mean the ID becomes a broadcast.
       out_id_builder.extent(expanded_sizes[i]);
     } else {
-      // Input id is non-broadcast and its extent is concrete. Nothing
+      // Input id is non-expand and its extent is concrete. Nothing
       // to expand, but the input and expanded sizes should match if
       // the expanded size is also concrete.
       auto inp_id_size_int = inp_id->extent()->getInt();
@@ -1394,7 +1518,7 @@ WelfordResult Welford(
 
     TORCH_CHECK(
         axis >= 0 && axis < ndims,
-        "Reduction on invalid axis, recieved: ",
+        "Reduction on invalid axis, received: ",
         axis,
         " however tensor view only has ",
         ndims,
@@ -1412,12 +1536,12 @@ WelfordResult Welford(
       out_avg,
       out_var,
       out_N, /*out var/avg/count */
+      tv, /*in var/avg/count */
+      FusionGuard::getCurFusion()->zeroVal(),
+      FusionGuard::getCurFusion()->oneVal(),
       init_avg_val,
       init_var_val,
-      init_N, /*init var/avg/count */
-      tv,
-      FusionGuard::getCurFusion()->zeroVal(),
-      FusionGuard::getCurFusion()->oneVal()); /*in var/avg/count */
+      init_N); /*init var/avg/count */
 
   return WelfordResult(out_avg, out_var, out_N);
 }
@@ -2104,7 +2228,7 @@ static TensorView* newForMma(
 
   TORCH_INTERNAL_ASSERT(
       !axes_set.empty(),
-      "Asked for ouput of reduction, but no reduction axis provided.");
+      "Asked for output of reduction, but no reduction axis provided.");
 
   TORCH_INTERNAL_ASSERT(
       (*(axes_set.rbegin())) < orig_domain_a.size(),
@@ -2195,7 +2319,7 @@ TensorView* fusedMultiplySum(
 
     TORCH_CHECK(
         axis >= 0 && axis < ndims,
-        "Reduction on invalid axis, recieved: ",
+        "Reduction on invalid axis, received: ",
         axis,
         " however tensor view only has ",
         ndims,
diff --git a/torch/csrc/jit/codegen/cuda/arith.h b/torch/csrc/jit/codegen/cuda/arith.h
index 7a1efee80f5d..66344c74880c 100644
--- a/torch/csrc/jit/codegen/cuda/arith.h
+++ b/torch/csrc/jit/codegen/cuda/arith.h
@@ -121,6 +121,52 @@ TORCH_CUDA_CU_API WelfordResult Welford(
     // import IrBuilder just for this one interface.
     Int* init_N = nullptr);
 
+// RNG OPERATIONS
+TORCH_CUDA_CU_API TensorView* rand(
+    const std::vector<Val*>& shape,
+    DataType dtype);
+TORCH_CUDA_CU_API Val* rand_like(Val*);
+TORCH_CUDA_CU_API TensorView* rand_like(TensorView*);
+
+TORCH_CUDA_CU_API TensorView* uniform(
+    const std::vector<Val*>& shape,
+    Val* low,
+    Val* high,
+    DataType dtype);
+
+// TENSOR FACTORIES
+TORCH_CUDA_CU_API TensorView* full(
+    const std::vector<Val*>& shape,
+    Val* fill_value,
+    DataType dtype);
+TORCH_CUDA_CU_API TensorView* full_like(TensorView* tv, Val* fill_value);
+TORCH_CUDA_CU_API Val* full_like(Val* tv, Val* fill_value);
+TORCH_CUDA_CU_API TensorView* zeros(
+    const std::vector<Val*>& shape,
+    DataType dtype);
+TORCH_CUDA_CU_API TensorView* zeros_like(TensorView*);
+TORCH_CUDA_CU_API Val* zeros_like(Val*);
+TORCH_CUDA_CU_API TensorView* ones(
+    const std::vector<Val*>& shape,
+    DataType dtype);
+TORCH_CUDA_CU_API TensorView* ones_like(TensorView*);
+TORCH_CUDA_CU_API Val* ones_like(Val*);
+//! WARNING: giving invalid combinations of the start, end and step
+//! arguments can result in undefined behavior. Specifically, the
+//! signs of `end - start` and step must be the same.
+TORCH_CUDA_CU_API TensorView* arange(Val* end, DataType dtype = DataType::Int);
+TORCH_CUDA_CU_API TensorView* arange(
+    Val* start,
+    Val* end,
+    DataType dtype = DataType::Int);
+TORCH_CUDA_CU_API TensorView* arange(
+    Val* start,
+    Val* end,
+    Val* step,
+    DataType dtype = DataType::Int);
+TORCH_CUDA_CU_API TensorView* eye(Val* size, DataType dtype);
+TORCH_CUDA_CU_API TensorView* eye(Val* rows, Val* cols, DataType dtype);
+
 // UNARY OPERATIONS
 // abs
 TORCH_CUDA_CU_API Val* abs(Val*);
@@ -185,9 +231,6 @@ TORCH_CUDA_CU_API TensorView* log2(TensorView*);
 // neg
 TORCH_CUDA_CU_API Val* neg(Val*);
 TORCH_CUDA_CU_API TensorView* neg(TensorView*);
-// randlike
-TORCH_CUDA_CU_API Val* randlike(Val*);
-TORCH_CUDA_CU_API TensorView* randlike(TensorView*);
 // real
 TORCH_CUDA_CU_API Val* real(Val*);
 TORCH_CUDA_CU_API TensorView* real(TensorView*);
diff --git a/torch/csrc/jit/codegen/cuda/codegen.cpp b/torch/csrc/jit/codegen/cuda/codegen.cpp
index 7fe057d7383d..e62528fdabc3 100644
--- a/torch/csrc/jit/codegen/cuda/codegen.cpp
+++ b/torch/csrc/jit/codegen/cuda/codegen.cpp
@@ -247,7 +247,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
 
     // Kernels generating random numbers take extra (seed, offset) arguments
-    if (kernel_summary.is_stochastic) {
+    if (kernel_summary.max_rng_offsets >= 0) {
       code_ << ", at::PhiloxCudaState philox_args";
     }
 
@@ -259,14 +259,14 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     const auto& kernel_summary = kernel_->summary();
 
     // Random number generator (optional)
-    if (kernel_summary.is_stochastic) {
-      indent()
-          << "const auto idx = ((((blockIdx.z * gridDim.y + blockIdx.y) * gridDim.x + blockIdx.x) * blockDim.z + threadIdx.z) * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;";
-      indent() << "auto offset = philox_args.captured_ ?\n";
+    if (kernel_summary.max_rng_offsets >= 0) {
+      indent() << "auto philox_offset = philox_args.captured_ ?\n";
       indent()
           << "  static_cast<uint64_t>(*(philox_args.offset_.ptr) + philox_args.offset_intragraph_) :\n";
       indent() << "  philox_args.offset_.val;\n";
-      indent() << "Philox rnd(philox_args.seed_, idx, offset);\n";
+      indent() << "uint4 rng_result;\n";
+      indent() << "nvfuser_index_t rng_subseq = -1;\n";
+      indent() << "nvfuser_index_t rng_offset = -1;\n";
     }
 
     // Do we have any dynamic shared memory buffers?
@@ -282,7 +282,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     // Shared memory
     if (has_dynamic_smem || has_reductions || has_parallel_welford) {
       indent() << "alignas("
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
                << 16 // always align to 16B for any shared mem allocation
 #else
                << 8 // for HIP, we want 8-aligned even for smaller datatypes
@@ -290,18 +290,18 @@ class CudaKernelGenerator : private OptOutConstDispatch {
                << ") extern __shared__ char array[];\n";
 
       if (has_dynamic_smem) {
-        indent() << "unsigned offset = 0;\n";
+        indent() << "unsigned smem_offset = 0;\n";
       }
 
       if (has_reductions || has_parallel_welford) {
         indent() << "void* shared_mem = array;\n";
         if (has_dynamic_smem) {
           if (has_parallel_welford) {
-            indent() << "offset += "
+            indent() << "smem_offset += "
                      << "((blockDim.x * blockDim.y * blockDim.z) * 3 * sizeof("
                      << kernel_summary.largest_smem_data_type << "));\n";
           } else {
-            indent() << "offset += "
+            indent() << "smem_offset += "
                      << "((blockDim.x * blockDim.y * blockDim.z) * sizeof("
                      << kernel_summary.largest_smem_data_type << "));\n";
           }
@@ -543,9 +543,18 @@ class CudaKernelGenerator : private OptOutConstDispatch {
   void genCpAsync(const LoadStoreOp* ldst, int vec_size) {
     auto dtype = ldst->in()->getDataType().value();
 
-    indent() << "Ampere::cpAsync("
-             << genVectorPointer(ldst->out(), dtype, vec_size) << ","
-             << genVectorPointer(ldst->in(), dtype, vec_size) << ");\n";
+    if (ldst->predicate() == nullptr) {
+      // Out of line predicate variant
+      indent() << "Ampere::cpAsync("
+               << genVectorPointer(ldst->out(), dtype, vec_size) << ","
+               << genVectorPointer(ldst->in(), dtype, vec_size) << ");\n";
+    } else {
+      // Inline predicate variant
+      indent() << "Ampere::cpAsync("
+               << genVectorPointer(ldst->out(), dtype, vec_size) << ","
+               << genVectorPointer(ldst->in(), dtype, vec_size) << ","
+               << genInline(ldst->predicate()) << ");\n";
+    }
   }
 
   void genLdMatrix(const LoadStoreOp* ldst, int vector_word_size) {
@@ -560,6 +569,26 @@ class CudaKernelGenerator : private OptOutConstDispatch {
           << "&" << gen(ldst->in()) << ");\n";
   }
 
+  void handle(const FullOp* fop) final {
+    indent() << gen(fop->output(0)) << " = (" << fop->dtype() << ")"
+             << gen(fop->getFillValue()) << ";\n";
+  }
+
+  void handle(const ARangeOp* aop) final {
+    auto index =
+        genTensorIndex(aop->getLinearLogicalIndex()->as<kir::TensorIndex>());
+    indent() << gen(aop->output(0)) << " = arange<" << aop->dtype() << ">";
+    code_ << "(" << index << ", " << gen(aop->start()) << ", "
+          << gen(aop->step()) << ");\n";
+  }
+
+  void handle(const EyeOp* aop) final {
+    auto index1 = gen(aop->getIndex1());
+    auto index2 = gen(aop->getIndex2());
+    indent() << gen(aop->output(0)) << " = (" << aop->dtype() << ")";
+    code_ << "(" << index1 << " == " << index2 << ");\n";
+  }
+
   void handle(const UnaryOp* uop) final {
     bool is_vector_op = false;
     size_t vector_word_size = 1;
@@ -597,7 +626,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
             vector_size_optional.has_value(),
             "Could not evaluate constant value bound to vectorized dim.");
 
-        vector_word_size = vector_size_optional.value();
+        vector_word_size = vector_size_optional->as<int64_t>();
 
         vectorize_op = id->getParallelType() == ParallelType::Vectorize;
         misaligned_op =
@@ -695,8 +724,9 @@ class CudaKernelGenerator : private OptOutConstDispatch {
       }
     }
 
+    const auto op_type = uop->getUnaryOpType();
+
     if (uop->out()->isA<NamedScalar>()) {
-      const auto op_type = uop->getUnaryOpType();
       if (auto op = inline_op_str(op_type)) {
         indent() << gen(uop->out()) << " = " << *op << genInline(uop->in())
                  << ";\n";
@@ -713,7 +743,6 @@ class CudaKernelGenerator : private OptOutConstDispatch {
       code_ << " = ";
     }
 
-    const auto op_type = uop->getUnaryOpType();
     if (auto op = inline_op_str(op_type)) {
       if (alsoBooleanOperator(op_type) &&
           uop->out()->dtype() == DataType::Bool) {
@@ -740,13 +769,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
         }
       }
 
-      code_ << "(";
-      if (op_type == UnaryOpType::RandLike) {
-        code_ << "rnd";
-      } else {
-        code_ << gen(uop->in());
-      }
-      code_ << ")";
+      code_ << "(" << gen(uop->in()) << ")";
     }
 
     if (!print_inline_) {
@@ -754,6 +777,47 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
   }
 
+  void handle(const RNGOp* rop) final {
+    // TODO: TORCH_INTERNAL_ASSERT that the scheduler correctly creates an
+    // innermost ID of size 4 (float) or size 2 (double)?
+    auto index = genTensorIndex(rop->getPhiloxIndex()->as<kir::TensorIndex>());
+    int multiple = rop->dtype() == DataType::Double ? 2 : 4;
+    indent() << "nvfuser_index_t linear_index" << rop->name() << " = " << index
+             << ";\n";
+    indent() << "nvfuser_index_t rng_subseq" << rop->name() << " = linear_index"
+             << rop->name() << " / " << multiple << ";\n";
+    indent() << "nvfuser_index_t rng_component" << rop->name()
+             << " = linear_index" << rop->name() << " % " << multiple << ";\n";
+    indent() << "nvfuser_index_t rng_offset" << rop->name() << " = "
+             << rop->getRNGOffset() << ";\n";
+    indent() << "if (rng_subseq != rng_subseq" << rop->name()
+             << " || rng_offset != rng_offset" << rop->name() << ") {\n";
+    indent() << "  auto seed = philox_args.captured_ ?\n"
+             << "      static_cast<uint64_t>(*(philox_args.seed_.ptr)) : \n"
+             << "      philox_args.seed_.val;\n";
+    indent() << "  rng_result = philox(seed, rng_subseq" << rop->name()
+             << ", philox_offset / 4 + rng_offset" << rop->name() << ");\n";
+    indent() << "  rng_subseq = rng_subseq" << rop->name() << ";\n";
+    indent() << "  rng_offset = rng_offset" << rop->name() << ";\n";
+    indent() << "}\n";
+    auto op_type = rop->getRNGOpType();
+    indent() << gen(rop->output(0)) << " = " << op_type;
+    if (needFloatSuffix(op_type) && rop->dtype() == DataType::Float) {
+      code_ << "f";
+    }
+    code_ << "(rng_result, rng_component" << rop->name();
+    switch (op_type) {
+      case RNGOpType::UniformRange: {
+        auto parameters = rop->getParameters();
+        TORCH_INTERNAL_ASSERT(parameters.size() == 2);
+        code_ << ", " << gen(parameters[0]) << ", " << gen(parameters[1]);
+        break;
+      }
+      default:;
+    }
+    code_ << ");\n";
+  }
+
   std::string genBinaryOp(
       BinaryOpType op_type,
       DataType data_type,
@@ -1242,7 +1306,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
       TORCH_INTERNAL_ASSERT(
           id->getParallelType() != ParallelType::MisalignedVectorize,
           "LoadStoreOp: no support yet for mis-aligned vectorization");
-      vector_word_size = vector_size_optional.value();
+      vector_word_size = vector_size_optional->as<int64_t>();
       vectorize_op = true;
       break;
     }
@@ -1429,7 +1493,7 @@ class CudaKernelGenerator : private OptOutConstDispatch {
   }
 
   void addProfileArguments(ArgumentBuilder& func_args, const Expr* expr) {
-    if (isEnabled(EnableOption::KernelProfile) &&
+    if (isOptionEnabled(EnableOption::KernelProfile) &&
         kernel_->profile().isProfiled(expr)) {
       const auto& buffer_indices =
           kernel_->profile().getIndicesInProfileBuffer(expr);
@@ -1649,6 +1713,16 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     indent() << kTab << func_args << ");\n";
   }
 
+  void handle(const kir::GroupedGridWelford* grouped_gwop) final {
+    if (grouped_gwop->isAllreduce()) {
+      generateGroupedGridAllreduceWelford(grouped_gwop);
+      return;
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          false, "Non-allreduce grouped grid welford is not yet supported");
+    }
+  }
+
   // Enumerates all combinations of index values of grouped
   // loops. Each combination is a vector of loop index values. The
   // length of the vector is the number of grouped loops.
@@ -1850,6 +1924,154 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     indent() << kTab << func_args << ");\n";
   }
 
+  // Mostly the same as the grouped grid redution version
+  void generateGroupedGridAllreduceWelford(
+      const kir::GroupedGridWelford* grouped_gwop) {
+    TORCH_INTERNAL_ASSERT(grouped_gwop->isAllreduce());
+
+    const auto index_replacement_maps = getLoopIndexReplacementMaps();
+    const auto num_grouped_iterations = index_replacement_maps.size();
+
+    // This is also checked at the lowering validaiton time, so it
+    // isn't strictly necessary.
+    TORCH_INTERNAL_ASSERT(
+        num_grouped_iterations * grouped_gwop->numExprs() <=
+            kMaxNumGroupedReductions,
+        "Too many grouped reductions: ",
+        grouped_gwop->toString(),
+        ". Up to ",
+        kMaxNumGroupedReductions,
+        " reductions are allowed.");
+
+    ArgumentBuilder data_types;
+    ArgumentBuilder index_types;
+
+    // Note that the data type of var and avg and that of N are the
+    // same with all the welford ops since we only support
+    // grouping of iterations.
+    const auto data_type = grouped_gwop->outputVals().at(0).avg()->dtype();
+    const auto index_type = grouped_gwop->outputVals().at(0).N()->dtype();
+
+    std::array<ArgumentBuilder, 3> out_args;
+    std::array<ArgumentBuilder, 3> in_args;
+    std::array<ArgumentBuilder, 3> init_args;
+    std::array<ArgumentBuilder, 3> work_bufs;
+
+    ArgumentBuilder bool_types;
+    ArgumentBuilder read_preds;
+    ArgumentBuilder write_preds;
+
+    for (const auto expr_index : c10::irange(grouped_gwop->numExprs())) {
+      const auto& output = grouped_gwop->outputVals().at(expr_index);
+      const auto& input = grouped_gwop->inputVals().at(expr_index);
+      const auto& init = grouped_gwop->initVals().at(expr_index);
+
+      for (const auto& group_index :
+           c10::irange(index_replacement_maps.size())) {
+        // Set the index replacement map with the concrete values of
+        // indices of grouped loops.
+        index_replacement_map_ = index_replacement_maps.at(group_index);
+
+        data_types.arg(data_type);
+        index_types.arg(index_type);
+
+        auto work_buffer_offset = group_index == 0
+            ? "0"
+            : (genInline(grouped_gwop->buffer_stride()) + " * " +
+               std::to_string(group_index));
+
+        // Setup arguments for avg, var, and N
+        for (const auto i : c10::irange(3)) {
+          out_args[i].arg(gen(output.get(i)));
+          in_args[i].arg(gen(input.get(i)));
+          init_args[i].arg(gen(init.get(i)));
+          const auto work_buffer = grouped_gwop->reduction_buffers()[i]
+                                       .at(expr_index)
+                                       ->buffer()
+                                       ->as<TensorView>();
+          work_bufs[i]
+              .arg("&")
+              .append(varName(work_buffer))
+              .append("[")
+              .append(work_buffer_offset)
+              .append("]");
+        }
+
+        // read and write predicates
+        bool_types.arg("bool");
+        // Same argument for all inputs. Different predicates would be
+        // used when grouping is done across iterations
+        TORCH_INTERNAL_ASSERT(grouped_gwop->predicate() != nullptr);
+        TORCH_INTERNAL_ASSERT(
+            grouped_gwop->predicate() != nullptr &&
+            grouped_gwop->predicate()->hasValue());
+        const auto read_pred = genInline(grouped_gwop->predicate());
+        read_preds.arg(read_pred);
+        if (grouped_gwop->writePredicate() != nullptr) {
+          TORCH_INTERNAL_ASSERT(grouped_gwop->writePredicate()->hasValue());
+          write_preds.arg(genInline(grouped_gwop->writePredicate()));
+        } else {
+          write_preds.arg(read_pred);
+        }
+
+        index_replacement_map_.clear();
+      }
+    }
+
+    ArgumentBuilder func_args(block_nest_level_ + 1, kTab);
+    // output
+    func_args.arg(genCall("RefTuple", data_types, out_args[0]));
+    func_args.arg(genCall("RefTuple", data_types, out_args[1]));
+    func_args.arg(genCall("RefTuple", index_types, out_args[2]));
+    // input
+    func_args.arg(genCall("ConstRefTuple", data_types, in_args[0]));
+    func_args.arg(genCall("ConstRefTuple", data_types, in_args[1]));
+    func_args.arg(genCall("ConstRefTuple", index_types, in_args[2]));
+    // init
+    func_args.arg(genCall("LocalTuple", data_types, init_args[0]));
+    func_args.arg(genCall("LocalTuple", data_types, init_args[1]));
+    func_args.arg(genCall("LocalTuple", index_types, init_args[2]));
+    // work buffer
+    func_args.arg(genCall("VolatilePtrTuple", data_types, work_bufs[0]));
+    func_args.arg(genCall("VolatilePtrTuple", data_types, work_bufs[1]));
+    func_args.arg(genCall("VolatilePtrTuple", index_types, work_bufs[2]));
+    // global_sync_buffer
+    const auto sync_buffer =
+        grouped_gwop->sync_buffer()->buffer()->as<TensorView>();
+    func_args.arg("&").append(varName(sync_buffer)).append("[0]");
+
+    // shared_buf
+    ArgumentBuilder smem_buffer_args;
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(data_type), "shared_mem_avg"));
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(data_type), "shared_mem_var"));
+    smem_buffer_args.arg(
+        genCall("reinterpret_cast", ptrType(index_type), "shared_mem_n"));
+    func_args.arg(genCall(
+        "PtrTuple",
+        ArgumentBuilder().arg(data_type).arg(data_type).arg(index_type),
+        smem_buffer_args));
+
+    func_args.arg(genCall("LocalTuple", bool_types, read_preds));
+    func_args.arg(genCall("LocalTuple", bool_types, write_preds));
+
+    addProfileArguments(func_args, grouped_gwop);
+
+    ArgumentBuilder func_template_args;
+    func_template_args.arg(
+        grouped_gwop->numExprs() * index_replacement_maps.size());
+    func_template_args.arg(data_type);
+    func_template_args.arg(index_type);
+
+    indent() << genCall(
+                    genFusedReductionName(ir_utils::getTvOutput(grouped_gwop)) +
+                        ".welfordGroup",
+                    func_template_args,
+                    func_args)
+             << ";\n";
+  }
+
   void handle(const kir::GridBroadcast* grop) final {
     const auto bop = grop->broadcast_op();
     TORCH_INTERNAL_ASSERT(bop->out()->isA<kir::TensorIndex>());
@@ -2186,6 +2408,13 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     }
   }
 
+  void handle(const GroupedWelfordOp* grouped_wop) final {
+    TORCH_INTERNAL_ASSERT(
+        false,
+        "Should not reach here as grouped welford is only enabled for grid welford,",
+        " which is handled by its own handler");
+  }
+
   //! True if loop is grouped. The IterDomain of the loop must have
   //! ParallelType::Group, but it isn't sufficient as the loop may be
   //! for an initialization expression, for which the loop shold not
@@ -2194,7 +2423,8 @@ class CudaKernelGenerator : private OptOutConstDispatch {
     if (loop->iter_domain()->getParallelType() != ParallelType::Group) {
       return false;
     }
-    return ExprFinder::exists(loop, {ExprType::GroupedGridReduction});
+    return ExprFinder::exists(
+        loop, {ExprType::GroupedGridReduction, ExprType::GroupedGridWelford});
   }
 
   void handle(const kir::ForLoop* loop) final {
@@ -2307,15 +2537,15 @@ class CudaKernelGenerator : private OptOutConstDispatch {
           break;
         case MemoryType::Shared:
           // Align Offset Position
-          indent() << "offset = alignBufferSize(offset, "
+          indent() << "smem_offset = alignBufferSize(smem_offset, "
                    // Always align to 128b / 16B
                    << 16 << ");\n";
           // Shared Memory Pointer
           indent() << buffer_dtype << "* " << varName(tv)
                    << " = reinterpret_cast<" << buffer_dtype << "*>"
-                   << "(array + offset);\n";
+                   << "(array + smem_offset);\n";
           // Increment Offset Position
-          indent() << "offset += (" << genInline(size) << " * sizeof("
+          indent() << "smem_offset += (" << genInline(size) << " * sizeof("
                    << buffer_dtype << "));\n";
           break;
         case MemoryType::Local: {
diff --git a/torch/csrc/jit/codegen/cuda/compute_at.cpp b/torch/csrc/jit/codegen/cuda/compute_at.cpp
index ae6231614b7f..d8f950848f8f 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at.cpp
+++ b/torch/csrc/jit/codegen/cuda/compute_at.cpp
@@ -213,20 +213,21 @@ void ComputeAt::runAt(
   auto selected = getPropagationSubgraph(producer, consumer);
   ComputeAtSelector selector(selected);
 
-  InlinePropagator inline_propagator(
-      consumer, consumer_position, mode, selector.selected());
-
   MaxRootDomainInfoSpanningTree path(consumer, consumer_position, &selector);
 
   if (mode == ComputeAtMode::MostInlined) {
     MostInlinedTransformPropagator propagator;
     path.traverse(&propagator);
+    inlineMost(selected);
   } else {
     TransformPropagator propagator(consumer, consumer_position);
     path.traverse(&propagator);
+    inlineSelectedAt(
+        selected,
+        consumer,
+        consumer_position,
+        mode == ComputeAtMode::BestEffort);
   }
-
-  path.traverse(&inline_propagator);
 }
 
 void ComputeAt::runWith(
@@ -253,19 +254,21 @@ void ComputeAt::runWith(
   auto selected = getPropagationSubgraph(producer, consumer);
   ComputeAtSelector selector(selected);
 
-  InlinePropagator inline_propagator(
-      producer, producer_position, mode, selector.selected());
-
   MaxRootDomainInfoSpanningTree path(producer, producer_position, &selector);
 
   if (mode == ComputeAtMode::MostInlined) {
     MostInlinedTransformPropagator propagator;
     path.traverse(&propagator);
+    inlineMost(selected);
   } else {
     TransformPropagator propagator(producer, producer_position);
     path.traverse(&propagator);
+    inlineSelectedAt(
+        selected,
+        producer,
+        producer_position,
+        mode == ComputeAtMode::BestEffort);
   }
-  path.traverse(&inline_propagator);
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/compute_at.h b/torch/csrc/jit/codegen/cuda/compute_at.h
index 98100334d72b..d3d3fdb299dd 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at.h
+++ b/torch/csrc/jit/codegen/cuda/compute_at.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 #include <torch/csrc/jit/codegen/cuda/transform_replay.h>
 
diff --git a/torch/csrc/jit/codegen/cuda/compute_at_map.cpp b/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
index c5755c97aa7a..1c2ac627b575 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/compute_at_map.cpp
@@ -6,6 +6,8 @@
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 #include <torch/csrc/jit/codegen/cuda/transform_iter.h>
 
+#include <tuple>
+
 namespace torch {
 namespace jit {
 namespace fuser {
@@ -29,8 +31,22 @@ bool idIsALeafDomain(IterDomain* id, TensorView* tv) {
 
 } // namespace
 
-IterDomainGraph::IterDomainGraph(Fusion* fusion) {
+IterDomainGraph::IterDomainGraph(Fusion* fusion, bool allow_self_mapping) {
   build(fusion);
+
+  if (!allow_self_mapping) {
+    TORCH_INTERNAL_ASSERT(
+        !hasSelfMapping(),
+        "Unsupported domain mapping detected in ",
+        std::get<0>(*self_mapping_info_)->toString(),
+        ". ",
+        std::get<3>(*self_mapping_info_),
+        " domains, ",
+        std::get<1>(*self_mapping_info_)->toString(),
+        " and ",
+        std::get<2>(*self_mapping_info_)->toString(),
+        ", are mapped with each other.");
+  }
 }
 
 //! Map corresponding inputs and outputs of swizzle op together
@@ -55,7 +71,202 @@ void mapMaybeSwizzleOp(
   }
 }
 
+bool IterDomainGraph::exprsMap(
+    Expr* first,
+    Expr* second,
+    bool forward,
+    const DisjointSets<IterDomain*>& id_map) {
+  if (first == nullptr || second == nullptr) {
+    return false;
+  }
+
+  if (first->etype() != second->etype()) {
+    return false;
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      first->etype() == ExprType::Merge || first->etype() == ExprType::Split,
+      "Merge and split are the only expressions supported through rfactor operations in compute at map, but found:\n",
+      first->toString());
+
+  auto first_ids = ir_utils::filterByType<IterDomain>(
+                       forward ? first->inputs() : first->outputs())
+                       .vector();
+
+  auto second_ids = ir_utils::filterByType<IterDomain>(
+                        forward ? second->inputs() : second->outputs())
+                        .vector();
+
+  TORCH_INTERNAL_ASSERT(
+      first_ids.size() == second_ids.size(),
+      "Expected number of ",
+      (forward ? "inputs" : "outputs"),
+      " to match for\n",
+      first->toString(),
+      second->toString());
+
+  {
+    std::vector<std::pair<IterDomain*, IterDomain*>> zipped_ids;
+
+    std::transform(
+        first_ids.begin(),
+        first_ids.end(),
+        second_ids.begin(),
+        std::back_inserter(zipped_ids),
+        [](IterDomain* first, IterDomain* second) {
+          return std::make_pair(first, second);
+        });
+
+    if (std::any_of(
+            zipped_ids.begin(),
+            zipped_ids.end(),
+            [&](std::pair<IterDomain*, IterDomain*> id_pair) {
+              return !id_map.strictAreMapped(id_pair.first, id_pair.second);
+            })) {
+      return false;
+    }
+  }
+
+  if (first->isA<Merge>() && !forward) {
+    // Can't back prop through merge without making sure one dimension actually
+    // is identical extents.
+    auto merge0 = first->as<Merge>();
+    auto merge1 = second->as<Merge>();
+
+    auto extent_0o = merge0->outer()->extent();
+    auto extent_0i = merge0->inner()->extent();
+    auto extent_1o = merge1->outer()->extent();
+    auto extent_1i = merge1->inner()->extent();
+
+    auto extent_0_match = extent_0o->sameAs(extent_1o) ||
+        (extent_0o->isConstInt() && extent_1o->isConstInt() &&
+         extent_0o->evaluateInt() == extent_1o->evaluateInt());
+
+    auto extent_1_match = extent_0i->sameAs(extent_1i) ||
+        (extent_0i->isConstInt() && extent_1i->isConstInt() &&
+         extent_0i->evaluateInt() == extent_1i->evaluateInt());
+
+    if (!(extent_0_match || extent_1_match)) {
+      return false;
+    }
+  }
+
+  if (first->isA<Split>()) {
+    auto first_split = first->as<Split>();
+    auto second_split = second->as<Split>();
+    if (!first_split->factor()->sameAs(second_split->factor()) ||
+        first_split->innerSplit() != second_split->innerSplit() ||
+        !first_split->startOffset()->sameAs(second_split->startOffset()) ||
+        !first_split->stopOffset()->sameAs(second_split->stopOffset())) {
+      return false;
+    }
+  }
+
+  return true;
+}
+
+void IterDomainGraph::mapThroughExpr(Expr* first, Expr* second, bool forward) {
+  if (first == nullptr || second == nullptr) {
+    return;
+  }
+
+  if (!exprsMap(first, second, forward, exact_nodes_)) {
+    return;
+  }
+
+  auto first_ids = ir_utils::filterByType<IterDomain>(
+                       forward ? first->outputs() : first->inputs())
+                       .vector();
+  auto second_ids = ir_utils::filterByType<IterDomain>(
+                        forward ? second->outputs() : second->inputs())
+                        .vector();
+  TORCH_INTERNAL_ASSERT(
+      first_ids.size() == second_ids.size(),
+      "This should be unreachable, if transformation expressions match, their number of inputs and outputs should as well.\n However found:\n",
+      first->toString(),
+      "\nand\n",
+      second->toString());
+  for (auto out_i : c10::irange(first_ids.size())) {
+    exact_nodes_.mapEntries(first_ids[out_i], second_ids[out_i]);
+    permissive_nodes_.mapEntries(first_ids[out_i], second_ids[out_i]);
+  }
+}
+
+namespace {
+
+// Returns a pair of mapped IDs
+c10::optional<std::pair<IterDomain*, IterDomain*>> detectMappablePair(
+    const std::vector<IterDomain*>& ids,
+    const IterDomainGraph& id_graph) {
+  for (auto id1 : ids) {
+    for (auto id2 : ids) {
+      if (id1 == id2) {
+        continue;
+      }
+      if (id_graph.permissiveNodes().disjointSetMap().at(id1)->has(id2)) {
+        return std::make_pair(id1, id2);
+      }
+    }
+  }
+
+  return {};
+}
+
+// It is assumed that for any tensor represented by a list of domains,
+// those domains should never be mapped with each other. It may be
+// possible to lift this assumption, but it's unclear if it could
+// matter in practice.
+c10::optional<std::tuple<TensorView*, IterDomain*, IterDomain*, std::string>>
+findFirstSelfMapping(Fusion* fusion, const IterDomainGraph& id_graph) {
+  for (auto tv : ir_utils::allTvs(fusion)) {
+    // For each tensor, make sure root, rfactor and leaf domains
+    // should not include domains that are mapped with another domain
+    // in the same set of domains. This may be overly conservative,
+    // and it maybe enough to check the root domains.
+
+    // Root domains
+    auto self_mappped_root_pair =
+        detectMappablePair(tv->getRootDomain(), id_graph);
+    if (self_mappped_root_pair.has_value()) {
+      return std::make_tuple(
+          tv,
+          self_mappped_root_pair->first,
+          self_mappped_root_pair->second,
+          "Root");
+    }
+
+    // Rfactor domains
+    if (tv->hasRFactor()) {
+      auto self_mappped_rf_pair =
+          detectMappablePair(tv->getRFactorDomain(), id_graph);
+      if (self_mappped_rf_pair.has_value()) {
+        return std::make_tuple(
+            tv,
+            self_mappped_rf_pair->first,
+            self_mappped_rf_pair->second,
+            "RFactor");
+      }
+    }
+
+    // Leaf domains
+    auto self_mappped_leaf_pair =
+        detectMappablePair(tv->domain()->domain(), id_graph);
+    if (self_mappped_leaf_pair.has_value()) {
+      return std::make_tuple(
+          tv,
+          self_mappped_leaf_pair->first,
+          self_mappped_leaf_pair->second,
+          "Leaf");
+    }
+  }
+  return c10::nullopt;
+}
+
+} // namespace
+
 void IterDomainGraph::build(Fusion* fusion) {
+  FusionGuard fg(fusion);
+
   // Initialize a node for every iteration domain
   for (auto tv : ir_utils::allTvs(fusion)) {
     const auto& root_domain = tv->getRootDomain();
@@ -120,7 +331,7 @@ void IterDomainGraph::build(Fusion* fusion) {
             c_tv->getRootDomain().size() ==
                 first_output_tv->getRootDomain().size(),
             "Multiple outputs with mismatched dimensions is not supported. ",
-            "Only supported case is welford op where all outputs tvs have idential domains.");
+            "Only supported case is welford op where all outputs tvs have identical domains.");
         // p->f, c->c
         std::unordered_map<IterDomain*, IterDomain*> c2f_root_map;
         for (const auto i :
@@ -251,6 +462,151 @@ void IterDomainGraph::build(Fusion* fusion) {
       }
     }
   }
+
+  // Explicitly map through rfactor transformations, if we have an op like:
+  //
+  // T1[x, y*z] = view(T0[x*y, z])
+  // T3[x, y*z] = view(T2[x*y, z])
+  // T4 = T0 + T2
+  //
+  // We want to map T1 and T3's rfactor transformations together by playing the
+  // transformations forward since their root domains map. If instead we have:
+  //
+  // T1[x, y*z] = view(T0[x*y, z])
+  // T3[x, y*z] = view(T2[x*y, z])
+  // T4 = T1 + T3
+  //
+  // Then we wouldn't have a mapping of T1 and T3's root domain, we'd have a
+  // mapping of their rfactor domain, so we would want to map T1 and T3's
+  // rfactor transformations starting at their rfactor domains.
+  //
+  // Therefore we'll explicitly map rfactor transformation iteration domains
+  // forward and backwards. Something similar could happen with rfactor of root
+  // domains, though it seems mapping rfactor reduction domains aren't that
+  // important. Mapping view transformations is more important since view is
+  // part of the compute definition so having the map through the
+  // transformations makes it easy to check if different view operations are
+  // consistent with eachother.
+
+  auto all_tvs = ir_utils::allTvs(fusion);
+  std::vector<TensorView*> all_consumer_tvs;
+  std::copy_if(
+      all_tvs.begin(),
+      all_tvs.end(),
+      std::back_inserter(all_consumer_tvs),
+      [](TensorView* tv) { return !tv->isFusionInput() && tv->hasRFactor(); });
+
+  // IterDomains could have multiple uses defined in the fusion if multiple
+  // transformations were redefined (more than one transform propagation pass
+  // was run and retransformed sections of the graph). We're going to make a new
+  // uses map so we can easily process the actual uses of IterDomains. We
+  // actually only need rfactor uses for this section of mapping, so we'll limit
+  // this map to only rfactor transformations.
+  std::unordered_map<IterDomain*, Expr*> rfactor_id_uses;
+
+  // Order of traversal is important for processing all the rfactor ids as the
+  // first pass will go forward through expressions and the second pass will
+  // traverse backwards through them. ID's will be unique in this vector,
+  // enforced when building it since it's built with rfactor_id_uses.
+  std::vector<IterDomain*> rfactor_id_order;
+
+  // Grab all the rfactor ids.
+  for (auto consumer_tv : all_consumer_tvs) {
+    auto exprs = StmtSort::getExprs(
+        fusion,
+        {consumer_tv->getMaybeRFactorDomain().begin(),
+         consumer_tv->getMaybeRFactorDomain().end()});
+    for (auto expr : exprs) {
+      auto rfactor_inp_ids = ir_utils::filterByType<IterDomain>(expr->inputs());
+      TORCH_INTERNAL_ASSERT(
+          expr->isA<Split>() || expr->isA<Merge>(),
+          "Wasn't expecting the expression type of:\n",
+          expr->toString(),
+          "\nto be an expression defined in an rfactor transformation.");
+      for (auto rfactor_inp_id : rfactor_inp_ids) {
+        TORCH_INTERNAL_ASSERT(
+            rfactor_id_uses.find(rfactor_inp_id) == rfactor_id_uses.end(),
+            "Was expecting iter domains to only have one active transformation but found id ",
+            rfactor_inp_id->toString(),
+            " used in\n",
+            rfactor_id_uses.at(rfactor_inp_id),
+            "\nand\n",
+            expr->toString());
+        rfactor_id_uses.emplace(std::make_pair(rfactor_inp_id, expr));
+        rfactor_id_order.push_back(rfactor_inp_id);
+      }
+    }
+    for (auto rfactor_id : consumer_tv->getMaybeRFactorDomain()) {
+      if (rfactor_id->isRFactorProduct()) {
+        rfactor_id_uses.emplace(std::make_pair(rfactor_id, nullptr));
+        rfactor_id_order.push_back(rfactor_id);
+      }
+    }
+  }
+
+  // if prop_forward we're going forward through transformations and
+  // expressions, meaning if inputs of expressions map then we map their
+  // outputs, otherwise we're traversing backwards, meaning if outputs of
+  // expressions map then we map their inputs.
+  for (auto prop_forward : {true, false}) {
+    std::unordered_set<Expr*> visited_exprs;
+
+    for (auto rfactor_id_i : c10::irange(rfactor_id_order.size())) {
+      auto first_rfactor_id = prop_forward
+          ? rfactor_id_order[rfactor_id_i]
+          : rfactor_id_order[rfactor_id_order.size() - 1 - rfactor_id_i];
+
+      // At should be safe since we made rfactor_id_order and rfactor_id_uses at
+      // the same time so they should have the same exact entries.
+      auto first_expr = prop_forward ? rfactor_id_uses.at(first_rfactor_id)
+                                     : first_rfactor_id->definition();
+
+      if (first_expr == nullptr) {
+        continue;
+      }
+
+      if (visited_exprs.find(first_expr) != visited_exprs.end()) {
+        continue;
+      }
+      visited_exprs.emplace(first_expr);
+
+      // Only need to be concerned here with mapping across rfactor iter
+      // domains, so isolate out those.
+      auto all_exact_map_ids = exact_nodes_.getDisjointSetOf(first_rfactor_id);
+      std::vector<IterDomain*> exact_map_rf_ids;
+      std::copy_if(
+          all_exact_map_ids.vector().begin(),
+          all_exact_map_ids.vector().end(),
+          std::back_inserter(exact_map_rf_ids),
+          [](IterDomain* id) { return id->isRFactorProduct(); });
+
+      for (auto exact_map_rf_id : exact_map_rf_ids) {
+        if (exact_map_rf_id == first_rfactor_id) {
+          continue;
+        }
+        // If there's an input with an rfactor domain we could have an exact
+        // mapped rfactor id that's on the input meaning it wouldn't have an
+        // entry in rfactor_id_uses
+        auto other_use =
+            rfactor_id_uses.find(exact_map_rf_id) == rfactor_id_uses.end()
+            ? nullptr
+            : rfactor_id_uses.at(exact_map_rf_id);
+        auto other_expr =
+            prop_forward ? other_use : exact_map_rf_id->definition();
+
+        if (other_expr == nullptr) {
+          continue;
+        }
+
+        if (visited_exprs.find(other_expr) != visited_exprs.end()) {
+          continue;
+        }
+
+        mapThroughExpr(first_expr, other_expr, prop_forward);
+      }
+    }
+  }
+  self_mapping_info_ = findFirstSelfMapping(fusion, *this);
 }
 
 void IterDomainGraph::initializeId(
@@ -323,7 +679,7 @@ void ComputeAtMap::allocateIndexVariables() {
                   // Halo extended parallel loops currently are handled
                   // differently and an index variable would still
                   // be allocated in this case.
-                  (GpuLower::current()->haloInfo().getExtent(id) == nullptr)) {
+                  (GpuLower::current()->haloInfo()->getExtent(id) == nullptr)) {
                 ptype = id->getParallelType();
                 return true;
               }
@@ -795,6 +1151,10 @@ std::vector<IterDomain*> ComputeAtMap::getViewRfactorDomainsOfIdGroup(
 
 const std::shared_ptr<VectorOfUniqueEntries<IterDomain*>>& ComputeAtMap::
     disjointSetOf(IterDomain* id, IdMappingMode mode) const {
+  TORCH_INTERNAL_ASSERT(
+      idExistsInMap(id),
+      id->toString(),
+      " has not been processed in this Compute At Map, yet the disjoint set for it was requested.");
   return getIdSets(mode).disjointSetMap().at(id);
 }
 
@@ -811,6 +1171,11 @@ const DisjointSets<IterDomain*>& ComputeAtMap::getIdSets(
   TORCH_INTERNAL_ASSERT(false, "Error with mapping mode provided.");
 }
 
+bool ComputeAtMap::idExistsInMap(IterDomain* id) const {
+  return getIdSets(IdMappingMode::EXACT).disjointSetMap().find(id) !=
+      getIdSets(IdMappingMode::EXACT).disjointSetMap().end();
+}
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/compute_at_map.h b/torch/csrc/jit/codegen/cuda/compute_at_map.h
index 90c8a747b545..5ea92dff1644 100644
--- a/torch/csrc/jit/codegen/cuda/compute_at_map.h
+++ b/torch/csrc/jit/codegen/cuda/compute_at_map.h
@@ -54,7 +54,7 @@ namespace cuda {
 //   Do not forward through any broadcast IDs
 class TORCH_CUDA_CU_API IterDomainGraph {
  public:
-  IterDomainGraph(Fusion* fusion);
+  IterDomainGraph(Fusion* fusion, bool allow_self_mapping = false);
 
   const DisjointSets<IterDomain*>& permissiveNodes() const {
     return permissive_nodes_;
@@ -88,11 +88,29 @@ class TORCH_CUDA_CU_API IterDomainGraph {
     return view_rfactor_ids_;
   }
 
+  // Returns if first and second are expressions through which the provided
+  // id_map have matching inputs (if forward), or outputs (if not forward).
+  // Returning true means the expressions are "the same", in terms they modify
+  // matching original extents, by the same amount.
+  static bool exprsMap(
+      Expr* first,
+      Expr* second,
+      bool forward,
+      const DisjointSets<IterDomain*>& id_map);
+
+  bool hasSelfMapping() const {
+    return self_mapping_info_.has_value();
+  }
+
  private:
   void build(Fusion* fusion);
 
   void initializeId(IterDomain* id, bool is_view_rfactor_id, bool is_leaf_id);
 
+  // Checks if exprsMap then if forward will map outputs else inputs in exact
+  // and permissive map.
+  void mapThroughExpr(Expr* first, Expr* second, bool forward);
+
   DisjointSets<IterDomain*> permissive_nodes_;
   DisjointSets<IterDomain*> exact_nodes_;
   DisjointSets<IterDomain*> loop_nodes_;
@@ -108,6 +126,9 @@ class TORCH_CUDA_CU_API IterDomainGraph {
   VectorOfUniqueEntries<IterDomain*> all_ids_;
 
   std::unordered_set<IterDomain*> view_rfactor_ids_;
+
+  c10::optional<std::tuple<TensorView*, IterDomain*, IterDomain*, std::string>>
+      self_mapping_info_ = c10::nullopt;
 };
 
 class TrivialReductionInfo;
@@ -176,6 +197,10 @@ class TORCH_CUDA_CU_API ComputeAtMap {
   //! Get the ID sets for a provided IdMappingMode
   const DisjointSets<IterDomain*>& getIdSets(IdMappingMode mode) const;
 
+  // Returns if the ID actually has a disjoint set meaning it has been processed
+  // in the creation of the compute at map.
+  bool idExistsInMap(IterDomain* id) const;
+
   //! Returns the pre-allocated index variable integer used in
   //!  the kir::ForLoop corresponding to the given IterDomain.
   //!  this interface is only valid if the ID has a loop mapping,
diff --git a/torch/csrc/jit/codegen/cuda/contiguity.cpp b/torch/csrc/jit/codegen/cuda/contiguity.cpp
index dbcc160bb8c6..dcb39d948c67 100644
--- a/torch/csrc/jit/codegen/cuda/contiguity.cpp
+++ b/torch/csrc/jit/codegen/cuda/contiguity.cpp
@@ -1,4 +1,5 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
 #include <torch/csrc/jit/codegen/cuda/lower2device.h>
 
 #include <torch/csrc/jit/codegen/cuda/contiguity.h>
@@ -8,20 +9,454 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
+OrderedIdInformation::OrderedIdInformation(
+    const std::vector<IterDomain*>& ids,
+    const std::vector<IterDomain*>& root_domain,
+    std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info)
+    : active_ids_(root_domain), concrete_info_(concrete_info) {
+  if (ids.empty() || root_domain.empty()) {
+    return;
+  }
+
+  // Grab root ids and initialize them.
+  for (const auto root_i : c10::irange(root_domain.size())) {
+    auto root_id = root_domain[root_i]->as<IterDomain>();
+
+    // Initialize id_to_root_ids to map roots to themselves
+    id_to_root_ids_[root_id] = {root_id};
+
+    // Initialize roots as being made up of correctly ordered transforms.
+    consistently_ordered_ids_.emplace(root_id);
+
+    exclusively_consumes_roots_.emplace(root_id);
+  }
+
+  // Iterate from the root domain to the provided ids and fill
+  // consistently_ordered_ids_, id_to_root_ids_, and exclusively_consumes_roots_
+  // for all the IDs
+  auto exprs = StmtSort::getExprsBetween(
+      ids[0]->fusion(),
+      {root_domain.begin(), root_domain.end()},
+      {ids.begin(), ids.end()});
+
+  for (auto expr : exprs) {
+    OptInDispatch::handle(expr);
+  }
+}
+
+bool OrderedIdInformation::checkExclusivelyConsumesRoots(IterDomain* id) {
+  TORCH_INTERNAL_ASSERT(
+      std::find(active_ids_.begin(), active_ids_.end(), id) !=
+          active_ids_.end(),
+      "Error replaying transforms in contiguous ID checker, expected ",
+      id->toString(),
+      " to be in the active ID set.");
+
+  auto root_id_it = id_to_root_ids_.find(id);
+  TORCH_INTERNAL_ASSERT(
+      root_id_it != id_to_root_ids_.end(),
+      "Error replaying transforms in contiguous ID checker, couldn't find mapped roots of ",
+      id->toString());
+
+  const auto& root_ids = root_id_it->second;
+
+  // Check all the roots of all other ids, to see if any root_ids in id are also
+  // in them.
+  for (auto other_active_id : active_ids_) {
+    if (other_active_id == id || other_active_id == nullptr) {
+      continue;
+    }
+
+    auto root_id_it = id_to_root_ids_.find(other_active_id);
+    TORCH_INTERNAL_ASSERT(
+        root_id_it != id_to_root_ids_.end(),
+        "Error replaying transforms in contiguous ID checker, couldn't find mapped roots of ",
+        other_active_id->toString());
+
+    const auto& other_root_ids = root_id_it->second;
+
+    for (auto other_root_id : other_root_ids) {
+      if (root_ids.has(other_root_id)) {
+        return false;
+      }
+    }
+  }
+  return true;
+}
+
+void OrderedIdInformation::handle(Merge* merge) {
+  // Find inputs in the active_ids_ vector
+  const auto inner_it =
+      std::find(active_ids_.begin(), active_ids_.end(), merge->inner());
+  const auto outer_it =
+      std::find(active_ids_.begin(), active_ids_.end(), merge->outer());
+
+  // If either aren't in active_ids_ it means the inputs were detected to not be
+  // ordered correctly before hitting this expression.
+  if (inner_it == active_ids_.end() || outer_it == active_ids_.end()) {
+    return;
+  }
+
+  auto inner_pos = std::distance(active_ids_.begin(), inner_it);
+  auto outer_pos = std::distance(active_ids_.begin(), outer_it);
+
+  // Find inputs in the ordered transforms map
+  const auto inner_ordered_it = consistently_ordered_ids_.find(merge->inner());
+  const auto outer_ordered_it = consistently_ordered_ids_.find(merge->outer());
+
+  bool inner_ordered = inner_ordered_it != consistently_ordered_ids_.end();
+  bool outer_ordered = outer_ordered_it != consistently_ordered_ids_.end();
+
+  // Get root ids of the two inputs
+  const auto inner_root_ids_it = id_to_root_ids_.find(merge->inner());
+  const auto outer_root_ids_it = id_to_root_ids_.find(merge->outer());
+
+  TORCH_INTERNAL_ASSERT(
+      inner_root_ids_it != id_to_root_ids_.end() &&
+          outer_root_ids_it != id_to_root_ids_.end(),
+      "Error replaying transforms in contiguous ID checker.");
+
+  const auto& inner_root_ids = inner_root_ids_it->second;
+  const auto& outer_root_ids = outer_root_ids_it->second;
+
+  // TODO: Concretization may prevent contiguous indexing or vectorization.
+  //  It prevents contiguous indexing if the concretization is within the IDs
+  //  that are used for indexing.
+  //  For vectorization it just means we need to make sure the extents of the
+  //  axes to the right of the broadcast root domain in the contigous merge is
+  //  bigger than the vectorization dimension. And that the tensor buffer
+  //  supports the vector word size (always done).
+  bool outer_is_concretized_bcast = merge->outer()->isBroadcast() &&
+      concrete_info_->isConcretized(merge->outer());
+
+  bool inner_is_concretized_bcast = merge->inner()->isBroadcast() &&
+      concrete_info_->isConcretized(merge->inner());
+
+  // Update maps
+  // Find the position inner would have to have to be considered ordered
+  auto pos_after_outer = outer_pos + 1;
+  for (; pos_after_outer < active_ids_.size(); pos_after_outer++) {
+    if (active_ids_[pos_after_outer] == nullptr) {
+      // Can't be considered ordered after a nullptr
+      break;
+    }
+    if (active_ids_[pos_after_outer]->isReduction() ||
+        ((active_ids_[pos_after_outer]->isBroadcast() &&
+          !concrete_info_->isConcretized(active_ids_[pos_after_outer])))) {
+      // Skip reduction or broadcast axes that aren't concretized in the fusion
+      continue;
+    }
+    break;
+  }
+
+  // The output is ordered as long as the inputs were ordered and outer position
+  // is directly left of the inner position.
+  bool out_ordered = inner_ordered && outer_ordered;
+  out_ordered = out_ordered &&
+      // If inner_pos is before outer_pos it's not ordered correctly. If for
+      // some reason it's the same, that would be an error.
+      inner_pos > outer_pos &&
+      // Inner could be a broadcast, so doesn't have to be right on
+      // pos_after_outer as that ID (if it exists) should not be a broadcast.
+      // However, merging over a broadcast should be fine.
+      inner_pos <= pos_after_outer && !inner_is_concretized_bcast &&
+      !outer_is_concretized_bcast;
+
+  if (out_ordered) {
+    consistently_ordered_ids_.emplace(merge->out());
+  }
+
+  // Don't just remove active_ids_, as if we have something like:
+  //   [i0, i1, i2, i3]
+  //   ->merge(0, 2)
+  //   ->merge(1)
+  // The latter merge looks like it's ordered correctly, if we update the active
+  // map as:
+  //   [i0, i1, i2, i3] -> [i0*i2, i1, i3]
+  // Hoever if we instead mark it as:
+  //   [i0, i1, i2, i3] -> [i0*i2, i1, nullptr, i3]
+  // Or:
+  //   [i0, i1, i2, i3] -> [nullptr, i1, i0*i2, i3]
+  // It's clear the second merge is not ordered correctly. Doesn't matter which
+  // direction we put the iter domain in, prefer putting it in outer as we often
+  // are looking for inner dimensions that are contiguous. We don't want to
+  // always do this, as it could make ordered merges look non-ordered.
+  // For exmaple: [i0, i1, i2, i3]
+  // ->merge(0)
+  // ->merge(1)
+  // ->merge(0)
+  // If it's updated as:
+  // [i0, i1, i2, i3]
+  // -> [i0*i1, nullptr, i2, i3]
+  // -> [i0*i1, nullptr, i2*i3, nullptr]
+  // Now the final merge looks non-ordered but it is. So only insert a nullptr
+  // entry if the out is not ordered.
+  active_ids_[outer_pos] = merge->out();
+
+  if (!out_ordered) {
+    active_ids_[inner_pos] = nullptr;
+  } else {
+    active_ids_.erase(active_ids_.begin() + inner_pos);
+    for (auto i = outer_pos + 1; i < inner_pos; i++) {
+      // If there's broadcast axes between outer and inner and the merge was
+      // contiguous, there may be broadcasts between outer and inner that cannot
+      // be ordered merged anywhere else so remove them.
+      active_ids_.erase(active_ids_.begin() + outer_pos + 1);
+    }
+  }
+
+  // Update the root_id entry for the output.
+  VectorOfUniqueEntries<IterDomain*> root_ids = inner_root_ids;
+  root_ids.pushBack(outer_root_ids);
+
+  id_to_root_ids_[merge->out()] = root_ids;
+
+  // Need to check this after updating active_ids_ and id_to_root_ids_
+  if (checkExclusivelyConsumesRoots(merge->out())) {
+    exclusively_consumes_roots_.emplace(merge->out());
+  }
+}
+
+void OrderedIdInformation::handle(Split* split) {
+  // Find the input in the active_ids_ vector
+  const auto in_it =
+      std::find(active_ids_.begin(), active_ids_.end(), split->in());
+
+  if (in_it == active_ids_.end()) {
+    return;
+  }
+
+  auto in_pos = std::distance(active_ids_.begin(), in_it);
+
+  // Find the input in the ordered transforms map
+  const auto in_ordered_it = consistently_ordered_ids_.find(split->in());
+
+  bool in_ordered = in_ordered_it != consistently_ordered_ids_.end();
+
+  // Get root ids of the input
+  const auto in_root_ids_it = id_to_root_ids_.find(split->in());
+
+  TORCH_INTERNAL_ASSERT(
+      in_root_ids_it != id_to_root_ids_.end(),
+      "Error replaying transforms in contiguous ID checker.");
+
+  VectorOfUniqueEntries<IterDomain*> in_root_ids = in_root_ids_it->second;
+
+  // Update map for outputs
+  // Remove inputs from the active_ids_ and insert the output ID
+  active_ids_[in_pos] = split->outer();
+  active_ids_.insert(active_ids_.begin() + in_pos + 1, split->inner());
+
+  // The outputs are ordered as long as the input is ordered.
+  if (in_ordered) {
+    consistently_ordered_ids_.emplace(split->outer());
+    consistently_ordered_ids_.emplace(split->inner());
+  }
+
+  // Update the root_id entry for the outputs.
+  id_to_root_ids_[split->outer()] = in_root_ids;
+  id_to_root_ids_[split->inner()] = in_root_ids;
+}
+
+// Swizzle generally can't be contiguous because of the non-affine nature of it,
+// but we can still analyze the operation in the same way as merge/split.
+void OrderedIdInformation::handle(Swizzle2D* swizzle) {
+  // Find inputs in the active_ids_ vector
+  const auto in_x_it =
+      std::find(active_ids_.begin(), active_ids_.end(), swizzle->inX());
+  const auto in_y_it =
+      std::find(active_ids_.begin(), active_ids_.end(), swizzle->inY());
+
+  if (in_x_it == active_ids_.end() || in_y_it == active_ids_.end()) {
+    return;
+  }
+
+  auto in_x_pos = std::distance(active_ids_.begin(), in_x_it);
+  auto in_y_pos = std::distance(active_ids_.begin(), in_y_it);
+
+  // Find inputs in the ordered transforms map
+  const auto in_x_ordered_it = consistently_ordered_ids_.find(swizzle->inX());
+  const auto in_y_ordered_it = consistently_ordered_ids_.find(swizzle->inY());
+
+  bool in_x_ordered = in_x_ordered_it != consistently_ordered_ids_.end();
+  bool in_y_ordered = in_y_ordered_it != consistently_ordered_ids_.end();
+
+  // Get root ids of the two inputs
+  const auto in_x_root_ids_it = id_to_root_ids_.find(swizzle->inX());
+  const auto in_y_root_ids_it = id_to_root_ids_.find(swizzle->inY());
+
+  TORCH_INTERNAL_ASSERT(
+      in_x_root_ids_it != id_to_root_ids_.end() &&
+          in_y_root_ids_it != id_to_root_ids_.end(),
+      "Error replaying transforms in contiguous ID checker.");
+
+  const auto& in_x_root_ids = in_x_root_ids_it->second;
+  const auto& in_y_root_ids = in_y_root_ids_it->second;
+
+  // Update map for outputs
+  // Remove inputs from the active_ids_ and insert the output ID
+  active_ids_[in_x_pos] = swizzle->outX();
+  active_ids_[in_y_pos] = swizzle->outY();
+
+  // In the case of no real swizzle we can forward properties on each domain
+  // independently.
+  if (swizzle->swizzleType() == Swizzle2DType::NoSwizzle) {
+    if (in_x_ordered) {
+      consistently_ordered_ids_.emplace(swizzle->outX());
+    }
+
+    if (exclusivelyConsumesRoots(swizzle->inX())) {
+      exclusively_consumes_roots_.emplace(swizzle->outX());
+    }
+
+    if (in_y_ordered) {
+      consistently_ordered_ids_.emplace(swizzle->outY());
+    }
+
+    if (exclusivelyConsumesRoots(swizzle->inY())) {
+      exclusively_consumes_roots_.emplace(swizzle->outY());
+    }
+
+    id_to_root_ids_[swizzle->outX()] = in_x_root_ids;
+    id_to_root_ids_[swizzle->outY()] = in_y_root_ids;
+  } else {
+    VectorOfUniqueEntries<IterDomain*> root_ids = in_x_root_ids;
+    root_ids.pushBack(in_y_root_ids);
+    id_to_root_ids_[swizzle->outX()] = root_ids;
+    id_to_root_ids_[swizzle->outY()] = root_ids;
+  }
+}
+
+NonDivisibleSplitDependencies::NonDivisibleSplitDependencies(
+    // TODO: Revisit reduction rfactor axes and propagation. Should probably use
+    // ca_map to propogate non divisibility dependencies across exact map. Still
+    // need to think through divisible split and non divisible dependencies to
+    // see if there's conflicts where a split might look non divisible but
+    // actually is divisible and one's overruling the other.
+    const std::vector<IterDomain*>& ids,
+    const std::vector<IterDomain*>& root_domain,
+    const std::unordered_set<Split*>& divisible_splits) {
+  if (ids.empty() || root_domain.empty()) {
+    return;
+  }
+  auto transforms = StmtSort::getExprsBetween(
+      ids[0]->fusion(),
+      {root_domain.begin(), root_domain.end()},
+      {ids.begin(), ids.end()});
+  for (auto transform : transforms) {
+    auto inp_ids = ir_utils::filterByType<IterDomain>(transform->inputs());
+    for (auto inp_id : inp_ids) {
+      if (std::find(root_domain.begin(), root_domain.end(), inp_id) !=
+          root_domain.end()) {
+        // This generally shouldn't happen as there shouldn't be
+        // transformations before the root ids, but in case for some reason
+        // we eventually do have cases like that, we should reset the
+        // root_ids if for some reason they've been placed in the non
+        // divisible split set.
+        depends_on_non_divisible_split.erase(inp_id);
+      }
+    }
+
+    bool inputs_non_divisible =
+        std::any_of(inp_ids.begin(), inp_ids.end(), [this](IterDomain* inp_id) {
+          return depends_on_non_divisible_split.find(inp_id) !=
+              depends_on_non_divisible_split.end();
+        });
+
+    auto out_ids = ir_utils::filterByType<IterDomain>(transform->outputs());
+
+    if (inputs_non_divisible) {
+      // If any inputs are known to be dependent on a divisible split
+      // Mark outputs as dependent on a non_divisible split
+      depends_on_non_divisible_split.insert(out_ids.begin(), out_ids.end());
+      continue;
+    }
+
+    if (!transform->isA<Split>()) {
+      continue;
+    }
+
+    auto split = transform->as<Split>();
+    // If this transform is a non-divisible split
+    if (divisible_splits.find(split) == divisible_splits.end()) {
+      // Mark outputs as dependent on a non_divisible split
+      auto out_ids = ir_utils::filterByType<IterDomain>(transform->outputs());
+      depends_on_non_divisible_split.insert(out_ids.begin(), out_ids.end());
+    }
+  }
+}
+
 ContigIDs::ContigIDs(
     const std::vector<IterDomain*>& ids,
     const std::vector<IterDomain*>& root_domain,
     const std::vector<bool>& root_contiguity,
-    std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref,
+    const std::unordered_set<IterDomain*>& final_ids,
+    const std::unordered_map<IterDomain*, Val*>& index_map,
+    const std::unordered_set<Split*>& divisible_splits,
     std::unordered_map<IterDomain*, IterDomain*> p2c_id_map,
-    bool ignore_halo_constraint,
-    bool ignore_indexability)
+    bool ignore_indexability,
+    bool ignore_consistent_ordering)
     : root_domain_(root_domain),
       root_contiguity_(root_contiguity),
-      concrete_to_ref_(std::move(concrete_to_ref)),
+      final_ids_(final_ids),
+      index_map_(index_map),
+      divisible_splits_(divisible_splits),
       p2c_id_map_(std::move(p2c_id_map)),
-      ignore_indexability_(ignore_indexability) {
-  if (ids.empty()) {
+      ignore_indexability_(ignore_indexability),
+      ignore_consistent_ordering_(ignore_consistent_ordering),
+      non_divisible_id_info_(ids, root_domain_, divisible_splits_) {
+  if (ids.size() > 0) {
+    // This constructor doesn't provide the following information so it needs to
+    // be built.
+    ca_map_ = std::make_shared<ComputeAtMap>(ids[0]->fusion());
+    halo_info_ = std::make_shared<HaloInfo>(ids[0]->fusion(), ca_map_);
+    concrete_info_ =
+        std::make_shared<ConcretizedBroadcastDomains>(ids[0]->fusion());
+
+    consistent_transform_info_ = std::make_unique<const OrderedIdInformation>(
+        ids, root_domain, concrete_info_);
+  }
+  build(ids);
+}
+
+ContigIDs::ContigIDs(
+    const std::vector<IterDomain*>& ids,
+    const std::vector<IterDomain*>& root_domain,
+    const std::vector<bool>& root_contiguity,
+    const std::unordered_set<IterDomain*>& final_ids,
+    const std::unordered_map<IterDomain*, Val*>& index_map,
+    const std::unordered_set<Split*>& divisible_splits,
+    std::shared_ptr<const ComputeAtMap> ca_map,
+    std::shared_ptr<const HaloInfo> halo_info,
+    std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info,
+    std::unordered_map<IterDomain*, IterDomain*> p2c_id_map,
+    bool ignore_indexability,
+    bool ignore_consistent_ordering)
+    : root_domain_(root_domain),
+      root_contiguity_(root_contiguity),
+      final_ids_(final_ids),
+      index_map_(index_map),
+      divisible_splits_(divisible_splits),
+      ca_map_(ca_map),
+      halo_info_(halo_info),
+      concrete_info_(concrete_info),
+      p2c_id_map_(std::move(p2c_id_map)),
+      ignore_indexability_(ignore_indexability),
+      ignore_consistent_ordering_(ignore_consistent_ordering),
+      consistent_transform_info_(std::make_unique<const OrderedIdInformation>(
+          ids,
+          root_domain,
+          concrete_info_)),
+      non_divisible_id_info_(ids, root_domain, divisible_splits_) {
+  build(ids);
+}
+
+ContigIDs ContigIDs::getNonContigIDs() {
+  return ContigIDs({}, {}, {}, {}, {}, {});
+}
+
+void ContigIDs::build(const std::vector<IterDomain*>& ids) {
+  if (ids.empty() || root_domain_.empty()) {
     return;
   }
 
@@ -32,35 +467,29 @@ ContigIDs::ContigIDs(
       " != ",
       root_contiguity_.size());
 
-  // GpuLower is required to honor halo constraints
-  if (!ignore_halo_constraint) {
-    TORCH_INTERNAL_ASSERT(GpuLower::hasCurrent(), "GpuLower not found");
-  }
-
-  for (const auto i : c10::irange(root_domain_.size())) {
-    auto root_domain_i = root_domain_[i]->as<IterDomain>();
-    root_to_indexed_id_[root_domain_i] = root_domain_i;
+  for (const auto root_domain_i : c10::irange(root_domain_.size())) {
+    auto root_domain_id = root_domain_[root_domain_i]->as<IterDomain>();
+    root_to_indexed_id_[root_domain_id] = root_domain_id;
     // Initialize to false
-    is_contig_root_[root_domain_i] = false;
+    is_contig_root_[root_domain_id] = false;
     // If a root domain has halo, can't use merged domain even if
     // both inputs are contiguous. HaloInfo is also initialized for
     // rfactor root domains, which should just return "zero"
     // RootAxisInfo. This should be safe as no rfactor tensor should
     // need halo.
-    if (root_contiguity_[i] &&
-        (ignore_halo_constraint ||
-         !GpuLower::current()
-              ->haloInfo()
-              .getRootAxisInfo(root_domain_i)
-              .hasHalo())) {
-      contig_ids_.emplace(root_domain_i);
-      is_contig_root_[root_domain_i] = true;
-      within_contig_ids_[root_domain_i] = std::unordered_set<IterDomain*>();
+    if (root_contiguity_[root_domain_i] &&
+        !halo_info_->getRootAxisInfo(root_domain_id).hasHalo()) {
+      contig_ids_.emplace(root_domain_id);
+      is_contig_root_[root_domain_id] = true;
+      within_contig_ids_[root_domain_id] = std::unordered_set<IterDomain*>();
     }
   }
 
   if (!contig_ids_.empty()) {
-    auto exprs = StmtSort::getExprs(ids[0]->fusion(), {ids.begin(), ids.end()});
+    auto exprs = StmtSort::getExprsBetween(
+        ids[0]->fusion(),
+        {root_domain_.begin(), root_domain_.end()},
+        {ids.begin(), ids.end()});
     for (auto expr : exprs) {
       handle(expr);
     }
@@ -68,114 +497,99 @@ ContigIDs::ContigIDs(
 }
 
 void ContigIDs::handle(Merge* merge) {
-  // If either input is non-contiguous so is output.
-  const auto inner = merge->inner();
-  const auto outer = merge->outer();
-  const auto out = merge->out();
+  // If output is not consistently ordered or doesn't solely consume all root
+  // domains in its dependencies, then it can't be a contiguously indexable
+  // iterdomain.
+  if (!(ignore_consistent_ordering_ ||
+        consistent_transform_info_->isConsistentlyOrdered(merge->out()))) {
+    return;
+  }
 
-  if (!isContig(inner) || !isContig(outer)) {
+  if (!consistent_transform_info_->exclusivelyConsumesRoots(merge->out())) {
     return;
   }
 
-  // Stop contig merging if the merge output is not indexable.
-  if (!ignore_indexability_ && !isIndexable(out)) {
+  // If output is not "directly indexable" then it's definitely not contiguously
+  // indexable.
+  if (!ignore_indexability_ && !isIndexable(merge->out())) {
     return;
   }
 
-  // Grab inputs, make sure they're in root domain, check if they're
-  // contiguous.
+  // If inputs are marked as final, stop
+  if (final_ids_.count(merge->inner()) || final_ids_.count(merge->outer())) {
+    return;
+  }
 
-  auto lhs_inputs =
-      ir_utils::iterDomainInputsOfOrderedAs({outer}, root_domain_);
-  auto rhs_inputs =
-      ir_utils::iterDomainInputsOfOrderedAs({inner}, root_domain_);
+  // Check root domains for contiguity
+  auto root_ids_it =
+      consistent_transform_info_->idToRootIds().find(merge->out());
 
   TORCH_INTERNAL_ASSERT(
-      inRoot(lhs_inputs) && inRoot(rhs_inputs),
-      "Found an invalid merge operation, inputs of its arguments are not in the root domain.");
-
-  std::deque<IterDomain*> ordered_inputs(lhs_inputs.begin(), lhs_inputs.end());
-  ordered_inputs.insert(
-      ordered_inputs.end(), rhs_inputs.begin(), rhs_inputs.end());
-
-  // If any root input is not contig, output is not contig
-  if (!(std::all_of(
-          ordered_inputs.begin(), ordered_inputs.end(), [this](IterDomain* id) {
-            // Allow reduction tensors in contiguity check since we're using
-            // this to check contiguous vectors of reference tensors in
-            // schedulers (to set vectorization sizes), those reference tensors
-            // may have reduction dims, don't bail on contiguity just because
-            // it's a reduction dimension.
-            return is_contig_root_.at(id);
-          }))) {
-    return;
-  }
+      root_ids_it != consistent_transform_info_->idToRootIds().end(),
+      "\nError in contiguous analysis, merge info doesn't exist for:\n",
+      merge->toString(),
+      "\nId: ",
+      merge->out()->toString());
 
-  std::deque<IterDomain*> root_copy(root_domain_.begin(), root_domain_.end());
+  VectorOfUniqueEntries<IterDomain*> root_ids = root_ids_it->second;
 
-  // Forward to first matching argument
-  while (!root_copy.empty() && !ordered_inputs.empty()) {
-    if (root_copy.front() != ordered_inputs.front()) {
-      root_copy.pop_front();
-    } else {
-      break;
+  bool is_indexing_pass = !ignore_consistent_ordering_;
+
+  IterDomain* last_root = nullptr;
+  for (auto root_id_i : c10::irange(root_domain_.size())) {
+    auto root_id = root_domain_[root_id_i];
+    if (root_ids.has(root_id)) {
+      // ID found, remove it
+      root_ids.erase(root_id);
+      // If we're indexing:
+      // we could still potentially consider this ID linearly indexable, as we
+      // could multiple the index by the last root's stride.
+      //
+      // If we're computing predicates (ignore_consistent_ordering_==true),
+      // then we don't have this same constraint, we can just ignore
+      // contiguity of the roots all together.
+      if (!root_contiguity_[root_id_i] && is_indexing_pass) {
+        if (!root_ids.empty()) {
+          return;
+        }
+      }
+      last_root = root_id;
     }
   }
 
-  // Forward through all matching arguments
-  while (!root_copy.empty() && !ordered_inputs.empty()) {
-    if (root_copy.front() == ordered_inputs.front()) {
-      root_copy.pop_front();
-      ordered_inputs.pop_front();
-    } else if (
-        root_copy.front()->isReduction() || root_copy.front()->isBroadcast()) {
-      // This was a cause of an error with
-      // ReductionSchedulerMultiDimNonFastest. The test no longer
-      // fails.
-      root_copy.pop_front();
-    } else {
-      break;
-    }
+  // If there's a non_divisible split in the history of merge->out then it can't
+  // be contiguously indexable.
+  if (non_divisible_id_info_.dependsOnNonDivisibleSplit(merge->out())) {
+    return;
   }
 
-  // If we matched all inputs, the output is contiguous. Only want to keep the
-  // top contig ID, lower ids should be placed in the "within_contig_ids" map
-  // of top id.
-  if (ordered_inputs.empty()) {
-    if (contig_ids_.find(inner) != contig_ids_.end()) {
-      contig_ids_.erase(inner);
-    }
+  // Now we know merge->out is a contiguously indexable ID
 
-    if (contig_ids_.find(outer) != contig_ids_.end()) {
-      contig_ids_.erase(outer);
-    }
+  TORCH_INTERNAL_ASSERT(
+      last_root != nullptr,
+      "Issue processing root ids for ",
+      merge->out()->toString());
 
-    contig_ids_.emplace(out);
+  // Reset root_ids
+  root_ids = root_ids_it->second;
+  for (auto root_id : root_ids) {
+    root_to_indexed_id_[root_id] = merge->out();
+  }
 
-    std::unordered_set<IterDomain*> within_out;
-    within_out.emplace(inner);
-    if (within_contig_ids_.find(inner) != within_contig_ids_.end()) {
-      auto in_inner = within_contig_ids_.at(inner);
-      within_out.insert(in_inner.begin(), in_inner.end());
-      within_contig_ids_.erase(inner);
-    }
+  auto all_within_vals = DependencyCheck::getAllValsBetween(
+      {root_domain_.begin(), root_domain_.end()}, {merge->out()});
+  auto all_within_ids = ir_utils::filterByType<IterDomain>(all_within_vals);
 
-    within_out.emplace(outer);
-    if (within_contig_ids_.find(outer) != within_contig_ids_.end()) {
-      auto in_outer = within_contig_ids_.at(outer);
-      within_out.insert(in_outer.begin(), in_outer.end());
-      within_contig_ids_.erase(outer);
-    }
+  std::unordered_set<IterDomain*> within_id_set(
+      all_within_ids.begin(), all_within_ids.end());
 
-    within_contig_ids_[out] = within_out;
-
-    for (auto root : lhs_inputs) {
-      root_to_indexed_id_[root] = out;
-    }
-    for (auto root : rhs_inputs) {
-      root_to_indexed_id_[root] = out;
-    }
+  within_id_set.erase(merge->out());
+  within_contig_ids_[merge->out()] = within_id_set;
+  for (auto id : all_within_ids) {
+    contig_ids_.erase(id);
   }
+
+  contig_ids_.emplace(merge->out());
 }
 
 IterDomain* ContigIDs::getMappedId(IterDomain* id) const {
@@ -187,18 +601,16 @@ IterDomain* ContigIDs::getMappedId(IterDomain* id) const {
   }
 }
 
-IterDomain* ContigIDs::getCAIndexConcreteId(IterDomain* id) const {
-  TORCH_INTERNAL_ASSERT(
-      GpuLower::current() != nullptr, "GpuLower is not found");
-
-  auto c_id = GpuLower::current()->caMap()->getConcreteMappedID(
-      getMappedId(id), IdMappingMode::EXACT);
-  return c_id;
-}
-
 bool ContigIDs::isIndexable(IterDomain* id) const {
-  auto c_id = getCAIndexConcreteId(id);
-  return concrete_to_ref_.find(c_id) != concrete_to_ref_.end();
+  // If ID is mapped to consumer through persmissive map but not exact map it
+  // will not be mapped through to the exact map through the p2c map. Therefore
+  // reject because it involves broadcast resolution.
+  if (!ca_map_->idExistsInMap(getMappedId(id))) {
+    return false;
+  }
+  auto c_id =
+      ca_map_->getConcreteMappedID(getMappedId(id), IdMappingMode::EXACT);
+  return index_map_.find(c_id) != index_map_.end();
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/contiguity.h b/torch/csrc/jit/codegen/cuda/contiguity.h
index 7293901310eb..e3be65a5bbc0 100644
--- a/torch/csrc/jit/codegen/cuda/contiguity.h
+++ b/torch/csrc/jit/codegen/cuda/contiguity.h
@@ -2,13 +2,128 @@
 
 #include <c10/macros/Export.h>
 
+#include <torch/csrc/jit/codegen/cuda/compute_at_map.h>
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/lower_shift.h>
+#include <torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h>
 
 namespace torch {
 namespace jit {
 namespace fuser {
 namespace cuda {
 
+// Goes through the transformations associated with a series of ids and root
+// ids. Checks the ordering of the iteration domains through these operations to
+// pick out which operations are consistently ordered. For example:
+// [i0, i1, i2]
+// ->split(0, 4)->merge(1)->merge(1)->merge(0)
+// are consistently ordered from largest to smallest extents, but
+// ->split(0, 4)->merge(1)->merge(0, 2)->merge(0) is not consistently ordered
+// with the roots.
+//
+// This property is important to understand the contiguity of dimensions through
+// complex transformations.
+class OrderedIdInformation : public OptInDispatch {
+ public:
+  OrderedIdInformation() = delete;
+
+  OrderedIdInformation(
+      const std::vector<IterDomain*>& ids,
+      const std::vector<IterDomain*>& root_domain,
+      std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info);
+
+  const std::unordered_map<IterDomain*, VectorOfUniqueEntries<IterDomain*>>&
+  idToRootIds() const {
+    return id_to_root_ids_;
+  }
+
+  bool isConsistentlyOrdered(IterDomain* id) const {
+    return consistently_ordered_ids_.find(id) !=
+        consistently_ordered_ids_.end();
+  }
+
+  bool exclusivelyConsumesRoots(IterDomain* id) const {
+    return exclusively_consumes_roots_.find(id) !=
+        exclusively_consumes_roots_.end();
+  }
+
+ private:
+  // Returns if the id in active_ids should be in exclusively_consumes_roots_
+  bool checkExclusivelyConsumesRoots(IterDomain* id);
+
+  void handle(Split*) override;
+
+  void handle(Merge* merge) override;
+
+  void handle(Swizzle2D* swizzle) override;
+
+  // Track which root ids were used to generate each iter domain
+  std::unordered_map<IterDomain*, VectorOfUniqueEntries<IterDomain*>>
+      id_to_root_ids_;
+
+  // Track all IterDomains that have correct ordered transforms for contiguity.
+  // i.e. if we have:
+  //
+  // root = [i0, i1, i2]
+  // i3 = merge(i0, i2)
+  // would not be consistently ordered transformed
+  //
+  // root = [i0, i1, i2]
+  // i4, i5 = spit(merge(merge(i0, i1), i2), 4)
+  // would be consistently ordered transforms
+  //
+  // root = [i0, i1, i2, i3]
+  // i4 = merge(i1, i2) would also be consistently ordered transformed
+  std::unordered_set<IterDomain*> consistently_ordered_ids_;
+
+  // Active series of IterDomains that are updated while we're processing the
+  // domain. Helps us identify which ids are consistently_ordered_ids_. Used
+  // for intermediate storage, not to return.
+  std::vector<IterDomain*> active_ids_;
+
+  // IterDomains in this set exclusively consume all the uses of their roots.
+  // For example:
+  // [i0, i1] split(0, f)->merge(1)
+  // [ceilDiv(i0, f), f*i1]
+  // neither iter domains exclusively consume the roots. With another:
+  // merge(0) -> [ceilDiv(i0, f)*f*i1]
+  // The resulting iter domain does exclusively consume the roots.
+  //
+  // Also:
+  // [i0, i1, i2, i3] merge(1)->merge(1)
+  // ->[i0, i1*i2*i3]
+  // both resulting iter domains do exclusively consume their roots
+  std::unordered_set<IterDomain*> exclusively_consumes_roots_;
+
+  // Broadcast domains that are concretized cannot be considered contiguously
+  // indexable.
+  // TODO: This constraint is more conservative than necessary as it's only if
+  // the domain is concretized within the local indexing, not in the entire
+  // fusion.
+  std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info_;
+};
+
+// Based on provided divisible split set, goes through expressions and marks all
+// IterDomains that are dependent on a non-divisible split.
+class NonDivisibleSplitDependencies : public OptInDispatch {
+ public:
+  NonDivisibleSplitDependencies() = delete;
+
+  NonDivisibleSplitDependencies(
+      const std::vector<IterDomain*>& ids,
+      const std::vector<IterDomain*>& root_domain,
+      const std::unordered_set<Split*>& divisible_splits);
+
+  bool dependsOnNonDivisibleSplit(IterDomain* id) const {
+    return depends_on_non_divisible_split.find(id) !=
+        depends_on_non_divisible_split.end();
+  }
+
+ private:
+  std::unordered_set<IterDomain*> depends_on_non_divisible_split;
+};
+
 // A merge is contiguous if:
 //   Inputs of outer are to the left in the root domain of the inputs of RHS.
 //   All inputs are contiguous in the root domain:
@@ -22,8 +137,6 @@ namespace cuda {
 
 class ContigIDs : public OptInDispatch {
  public:
-  ContigIDs() = delete;
-
   //! Check through the history of ids whose inputs map to root_domain with
   //! contiguity root_contiguity. Return unordered_set of all merges that are
   //! contiguous. Ignore root order is primarily used for predicate generation.
@@ -42,21 +155,55 @@ class ContigIDs : public OptInDispatch {
   //! If ignore_indexability and ignore_halo_constraint are true,
   //! ignore the constraint on indexing and halo, respectively. It is
   //! the caller that is responsible for its correctness.
-  //!
-  //! The function interface with many parameters looks ugly, but it
-  //! is also important to make ignore_indexability and
-  //! ignore_halo_constraint explicit to avoid any surprise.
-  //!
   //! Not really sure why but clang-tidy only complains about
   //! std::unordered_map if passed as a const reference.
   ContigIDs(
       const std::vector<IterDomain*>& ids,
       const std::vector<IterDomain*>& root_domain,
       const std::vector<bool>& root_contiguity,
-      std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref,
+      const std::unordered_set<IterDomain*>& final_ids,
+      const std::unordered_map<IterDomain*, Val*>& index_map,
+      const std::unordered_set<Split*>& divisible_splits,
+      std::unordered_map<IterDomain*, IterDomain*> p2c_id_map = {},
+      bool ignore_indexability = false,
+      bool ignore_consistent_ordering = false);
+
+  //! \param ids IterDomains on the leaves of the domain we're looking for
+  //! contiguous indexing into.
+  //! \param root_domain the root domain of the domain we're looking for
+  //! contiguous indexing into.
+  //! \param root_contiguity the contiguity of the root_domain.
+  //! \param concrete_to_ref concrete ids of the exact map that the reference
+  //! index is using for indexing.
+  //! \param divisible_splits a set of all splits in the fusion that are
+  //! divisible.
+  //! \param ca_map compute at map of the fusion.
+  //! \param halo_info halo information of the fusion.
+  //! \param concrete_info concretized broadcast information of the fusion.
+  //! \param p2c_id_map map from producer to consumer ids used for indexing
+  //! producer tensors.
+  //! \param ignore_consistent_ordering true for actual indexing into tensors
+  //! but false for predicate analysis. Ordering of merges don't matter for
+  //! predicate generation as they don't map to a physical address.
+  //! \param ignore_indexability can only be true if providing a real
+  //! concrete_to_ref map. As what it's checking is if the index is actually
+  //! indexable based on the reference.
+  ContigIDs(
+      const std::vector<IterDomain*>& ids,
+      const std::vector<IterDomain*>& root_domain,
+      const std::vector<bool>& root_contiguity,
+      const std::unordered_set<IterDomain*>& final_ids,
+      const std::unordered_map<IterDomain*, Val*>& index_map,
+      const std::unordered_set<Split*>& divisible_splits,
+      std::shared_ptr<const ComputeAtMap> ca_map,
+      std::shared_ptr<const HaloInfo> halo_info,
+      std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info,
       std::unordered_map<IterDomain*, IterDomain*> p2c_id_map = {},
       bool ignore_indexability = false,
-      bool ignore_halo_constraint = false);
+      bool ignore_consistent_ordering = false);
+
+  //! Return an empty ContigIDs with no contiguous ID
+  static ContigIDs getNonContigIDs();
 
   const std::unordered_set<IterDomain*>& contigIDs() const {
     return contig_ids_;
@@ -71,6 +218,14 @@ class ContigIDs : public OptInDispatch {
     return root_to_indexed_id_;
   }
 
+  VectorOfUniqueEntries<IterDomain*> indexedRootIDs(IterDomain* id) const {
+    auto root_ids_it = consistent_transform_info_->idToRootIds().find(id);
+    if (root_ids_it == consistent_transform_info_->idToRootIds().end()) {
+      return {};
+    }
+    return root_ids_it->second;
+  }
+
  private:
   using OptInDispatch::handle;
 
@@ -107,17 +262,32 @@ class ContigIDs : public OptInDispatch {
   IterDomain* getMappedId(IterDomain* id) const;
 
  private:
+  void build(const std::vector<IterDomain*>& ids);
+
   //! Root domains to analyze contiguity
   const std::vector<IterDomain*>& root_domain_;
   //! Contiguity of root_domain_
   const std::vector<bool>& root_contiguity_;
-  //! Mapping of concrete to reference domains. If a concrete domain
-  //! is not mapped, it is not indexable as there's no mapped index.
-  const std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref_;
+  //! Domains where indexing/predicates cannot be done with their
+  //! consumers domains
+  const std::unordered_set<IterDomain*>& final_ids_;
+  //! Mapping of concrete domains to indices. Just used to check if
+  //! there's an index for an IterDomain.
+  const std::unordered_map<IterDomain*, Val*> index_map_;
+  // Divisible split information as we can still consider iter domains
+  // contiguous through divisible splits.
+  const std::unordered_set<Split*>& divisible_splits_;
+
+  std::shared_ptr<const ComputeAtMap> ca_map_;
+  std::shared_ptr<const HaloInfo> halo_info_;
+  std::shared_ptr<const ConcretizedBroadcastDomains> concrete_info_;
+
   //! Producer-to-consumer index map in the case of analyzing replayed
   //! producer tensors
   const std::unordered_map<IterDomain*, IterDomain*> p2c_id_map_;
+
   const bool ignore_indexability_ = false;
+  const bool ignore_consistent_ordering_ = false;
 
   //! Mapping of root domain to bool indicating contiguity
   std::unordered_map<IterDomain*, bool> is_contig_root_;
@@ -129,6 +299,10 @@ class ContigIDs : public OptInDispatch {
   //! Mapping of root domain to the actual indexed domain, which can
   //! be itself or a contig merged domain if found.
   std::unordered_map<IterDomain*, IterDomain*> root_to_indexed_id_;
+
+  std::unique_ptr<const OrderedIdInformation> consistent_transform_info_;
+
+  NonDivisibleSplitDependencies non_divisible_id_info_;
 };
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/disjoint_set.h b/torch/csrc/jit/codegen/cuda/disjoint_set.h
index 64f7c55dd1c7..8fd60dab5bd2 100644
--- a/torch/csrc/jit/codegen/cuda/disjoint_set.h
+++ b/torch/csrc/jit/codegen/cuda/disjoint_set.h
@@ -195,6 +195,10 @@ class DisjointSets {
   //
   // TODO: Return iterator
   void initializeSet(T entry) {
+    if (disjoint_set_maps_.find(entry) != disjoint_set_maps_.end()) {
+      return;
+    }
+
     disjoint_sets_.push_back(
         std::make_shared<VectorOfUniqueEntries<T, Hash>>());
     disjoint_sets_.back()->pushBack(entry);
@@ -256,7 +260,7 @@ class DisjointSets {
         entry_it != disjointSetMap().end(),
         "Strict mapping failed on element: ",
         abstractToString(entry0),
-        " either an error occured, or non strict mapping should have been used.");
+        " either an error occurred, or non strict mapping should have been used.");
     return entry_it->second->has(entry1);
   }
 
@@ -298,17 +302,14 @@ class DisjointSets {
   std::string toString() const {
     std::stringstream ss;
     ss << "disjoint sets{\n";
+    const std::string sep("  ");
     for (auto s_ptr : disjoint_sets_) {
       auto& set = *s_ptr;
-      ss << "  { ";
+      ss << sep << "{\n";
       for (auto entry : set.vector()) {
-        ss << abstractToString(entry);
-        // DomainKey defines == but not !=
-        if (!(entry == set.back())) {
-          ss << "; ";
-        }
+        ss << sep << sep << abstractToString(entry) << "\n";
       }
-      ss << " }\n";
+      ss << sep << "}\n";
     }
     ss << "}";
     return ss.str();
diff --git a/torch/csrc/jit/codegen/cuda/dispatch.cpp b/torch/csrc/jit/codegen/cuda/dispatch.cpp
index 8a514d4c2e16..70e9ae16375e 100644
--- a/torch/csrc/jit/codegen/cuda/dispatch.cpp
+++ b/torch/csrc/jit/codegen/cuda/dispatch.cpp
@@ -95,6 +95,15 @@ void Val::dispatch(T handler, Val* val) {
 template <typename T>
 void Expr::dispatch(T handler, Expr* expr) {
   switch (*(expr->getExprType())) {
+    case ExprType::FullOp:
+      ptr(handler)->handle(expr->as<FullOp>());
+      return;
+    case ExprType::ARangeOp:
+      ptr(handler)->handle(expr->as<ARangeOp>());
+      return;
+    case ExprType::EyeOp:
+      ptr(handler)->handle(expr->as<EyeOp>());
+      return;
     case ExprType::UnaryOp:
       ptr(handler)->handle(expr->as<UnaryOp>());
       return;
@@ -104,6 +113,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::TernaryOp:
       ptr(handler)->handle(expr->as<TernaryOp>());
       return;
+    case ExprType::RNGOp:
+      ptr(handler)->handle(expr->as<RNGOp>());
+      return;
     case ExprType::ReductionOp:
       ptr(handler)->handle(expr->as<ReductionOp>());
       return;
@@ -113,6 +125,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::WelfordOp:
       ptr(handler)->handle(expr->as<WelfordOp>());
       return;
+    case ExprType::GroupedWelfordOp:
+      ptr(handler)->handle(expr->as<GroupedWelfordOp>());
+      return;
     case ExprType::LoadStoreOp:
       ptr(handler)->handle(expr->as<LoadStoreOp>());
       return;
@@ -190,6 +205,9 @@ void Expr::dispatch(T handler, Expr* expr) {
     case ExprType::GridWelford:
       ptr(handler)->handle(expr->as<kir::GridWelford>());
       return;
+    case ExprType::GroupedGridWelford:
+      ptr(handler)->handle(expr->as<kir::GroupedGridWelford>());
+      return;
     case ExprType::AllocateFusedReduction:
       ptr(handler)->handle(expr->as<kir::AllocateFusedReduction>());
       return;
@@ -269,6 +287,15 @@ void Val::constDispatch(T handler, const Val* val) {
 template <typename T>
 void Expr::constDispatch(T handler, const Expr* expr) {
   switch (*(expr->getExprType())) {
+    case ExprType::FullOp:
+      ptr(handler)->handle(expr->as<FullOp>());
+      return;
+    case ExprType::ARangeOp:
+      ptr(handler)->handle(expr->as<ARangeOp>());
+      return;
+    case ExprType::EyeOp:
+      ptr(handler)->handle(expr->as<EyeOp>());
+      return;
     case ExprType::UnaryOp:
       ptr(handler)->handle(expr->as<UnaryOp>());
       return;
@@ -278,6 +305,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::TernaryOp:
       ptr(handler)->handle(expr->as<TernaryOp>());
       return;
+    case ExprType::RNGOp:
+      ptr(handler)->handle(expr->as<RNGOp>());
+      return;
     case ExprType::ReductionOp:
       ptr(handler)->handle(expr->as<ReductionOp>());
       return;
@@ -287,6 +317,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::WelfordOp:
       ptr(handler)->handle(expr->as<WelfordOp>());
       return;
+    case ExprType::GroupedWelfordOp:
+      ptr(handler)->handle(expr->as<GroupedWelfordOp>());
+      return;
     case ExprType::LoadStoreOp:
       ptr(handler)->handle(expr->as<LoadStoreOp>());
       return;
@@ -364,6 +397,9 @@ void Expr::constDispatch(T handler, const Expr* expr) {
     case ExprType::GridWelford:
       ptr(handler)->handle(expr->as<kir::GridWelford>());
       return;
+    case ExprType::GroupedGridWelford:
+      ptr(handler)->handle(expr->as<kir::GroupedGridWelford>());
+      return;
     case ExprType::AllocateFusedReduction:
       ptr(handler)->handle(expr->as<kir::AllocateFusedReduction>());
       return;
@@ -451,6 +487,15 @@ void Val::mutatorDispatch(T mutator, Val* val) {
 template <typename T>
 void Expr::mutatorDispatch(T mutator, Expr* expr) {
   switch (*(expr->getExprType())) {
+    case ExprType::FullOp:
+      ptr(mutator)->mutate(expr->as<FullOp>());
+      return;
+    case ExprType::ARangeOp:
+      ptr(mutator)->mutate(expr->as<ARangeOp>());
+      return;
+    case ExprType::EyeOp:
+      ptr(mutator)->mutate(expr->as<EyeOp>());
+      return;
     case ExprType::UnaryOp:
       ptr(mutator)->mutate(expr->as<UnaryOp>());
       return;
@@ -460,6 +505,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::TernaryOp:
       ptr(mutator)->mutate(expr->as<TernaryOp>());
       return;
+    case ExprType::RNGOp:
+      ptr(mutator)->mutate(expr->as<RNGOp>());
+      return;
     case ExprType::ReductionOp:
       ptr(mutator)->mutate(expr->as<ReductionOp>());
       return;
@@ -469,6 +517,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::WelfordOp:
       ptr(mutator)->mutate(expr->as<WelfordOp>());
       return;
+    case ExprType::GroupedWelfordOp:
+      ptr(mutator)->mutate(expr->as<GroupedWelfordOp>());
+      return;
     case ExprType::LoadStoreOp:
       ptr(mutator)->mutate(expr->as<LoadStoreOp>());
       return;
@@ -546,6 +597,9 @@ void Expr::mutatorDispatch(T mutator, Expr* expr) {
     case ExprType::GridWelford:
       ptr(mutator)->mutate(expr->as<kir::GridWelford>());
       return;
+    case ExprType::GroupedGridWelford:
+      ptr(mutator)->mutate(expr->as<kir::GroupedGridWelford>());
+      return;
     case ExprType::AllocateFusedReduction:
       ptr(mutator)->mutate(expr->as<kir::AllocateFusedReduction>());
       return;
@@ -698,6 +752,15 @@ void OptOutConstDispatch::handle(const kir::IntPair* stmt) {
 }
 
 // Exprs
+void OptOutConstDispatch::handle(const FullOp* stmt) {
+  unhandled(stmt);
+}
+void OptOutConstDispatch::handle(const ARangeOp* stmt) {
+  unhandled(stmt);
+}
+void OptOutConstDispatch::handle(const EyeOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const UnaryOp* stmt) {
   unhandled(stmt);
 }
@@ -707,6 +770,9 @@ void OptOutConstDispatch::handle(const BinaryOp* stmt) {
 void OptOutConstDispatch::handle(const TernaryOp* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const RNGOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const ReductionOp* stmt) {
   unhandled(stmt);
 }
@@ -716,6 +782,9 @@ void OptOutConstDispatch::handle(const GroupedReductionOp* stmt) {
 void OptOutConstDispatch::handle(const WelfordOp* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const GroupedWelfordOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const LoadStoreOp* stmt) {
   unhandled(stmt);
 }
@@ -793,6 +862,9 @@ void OptOutConstDispatch::handle(const kir::GridBroadcast* stmt) {
 void OptOutConstDispatch::handle(const kir::GridWelford* stmt) {
   unhandled(stmt);
 }
+void OptOutConstDispatch::handle(const kir::GroupedGridWelford* stmt) {
+  unhandled(stmt);
+}
 void OptOutConstDispatch::handle(const kir::AllocateFusedReduction* stmt) {
   unhandled(stmt);
 }
@@ -842,6 +914,15 @@ void OptOutDispatch::handle(kir::IntPair* stmt) {
 }
 
 // Exprs
+void OptOutDispatch::handle(FullOp* stmt) {
+  unhandled(stmt);
+}
+void OptOutDispatch::handle(ARangeOp* stmt) {
+  unhandled(stmt);
+}
+void OptOutDispatch::handle(EyeOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(UnaryOp* stmt) {
   unhandled(stmt);
 }
@@ -851,6 +932,9 @@ void OptOutDispatch::handle(BinaryOp* stmt) {
 void OptOutDispatch::handle(TernaryOp* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(RNGOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(ReductionOp* stmt) {
   unhandled(stmt);
 }
@@ -860,6 +944,9 @@ void OptOutDispatch::handle(GroupedReductionOp* stmt) {
 void OptOutDispatch::handle(WelfordOp* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(GroupedWelfordOp* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(LoadStoreOp* stmt) {
   unhandled(stmt);
 }
@@ -937,6 +1024,9 @@ void OptOutDispatch::handle(kir::GridBroadcast* stmt) {
 void OptOutDispatch::handle(kir::GridWelford* stmt) {
   unhandled(stmt);
 }
+void OptOutDispatch::handle(kir::GroupedGridWelford* stmt) {
+  unhandled(stmt);
+}
 void OptOutDispatch::handle(kir::AllocateFusedReduction* stmt) {
   unhandled(stmt);
 }
diff --git a/torch/csrc/jit/codegen/cuda/dispatch.h b/torch/csrc/jit/codegen/cuda/dispatch.h
index f680871b6460..4fea698191ec 100644
--- a/torch/csrc/jit/codegen/cuda/dispatch.h
+++ b/torch/csrc/jit/codegen/cuda/dispatch.h
@@ -68,12 +68,17 @@ class ComplexDouble;
 class NamedScalar;
 
 // Exprs
+class FullOp;
+class ARangeOp;
+class EyeOp;
 class UnaryOp;
 class BinaryOp;
 class TernaryOp;
+class RNGOp;
 class ReductionOp;
 class GroupedReductionOp;
 class WelfordOp;
+class GroupedWelfordOp;
 class LoadStoreOp;
 class MmaOp;
 class BroadcastOp;
@@ -105,6 +110,7 @@ class GridReduction;
 class GroupedGridReduction;
 class GridBroadcast;
 class GridWelford;
+class GroupedGridWelford;
 class AllocateFusedReduction;
 class InitMagicZero;
 class UpdateMagicZero;
@@ -140,12 +146,17 @@ class TORCH_CUDA_CU_API OptOutConstDispatch : public PolymorphicBase {
   virtual void handle(const kir::IntPair*);
 
   // Exprs
+  virtual void handle(const FullOp* stmt);
+  virtual void handle(const ARangeOp* stmt);
+  virtual void handle(const EyeOp* stmt);
   virtual void handle(const UnaryOp* stmt);
   virtual void handle(const BinaryOp* stmt);
   virtual void handle(const TernaryOp* stmt);
+  virtual void handle(const RNGOp* stmt);
   virtual void handle(const ReductionOp* stmt);
   virtual void handle(const GroupedReductionOp* stmt);
   virtual void handle(const WelfordOp* stmt);
+  virtual void handle(const GroupedWelfordOp* stmt);
   virtual void handle(const LoadStoreOp* stmt);
   virtual void handle(const MmaOp* stmt);
   virtual void handle(const BroadcastOp* stmt);
@@ -173,6 +184,7 @@ class TORCH_CUDA_CU_API OptOutConstDispatch : public PolymorphicBase {
   virtual void handle(const kir::GroupedGridReduction*);
   virtual void handle(const kir::GridBroadcast*);
   virtual void handle(const kir::GridWelford*);
+  virtual void handle(const kir::GroupedGridWelford*);
   virtual void handle(const kir::AllocateFusedReduction*);
   virtual void handle(const kir::Swizzle2DInt*);
   virtual void handle(const kir::PairSelect*);
@@ -203,12 +215,17 @@ class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
   virtual void handle(kir::IntPair*);
 
   // Exprs
+  virtual void handle(FullOp* stmt);
+  virtual void handle(ARangeOp* stmt);
+  virtual void handle(EyeOp* stmt);
   virtual void handle(UnaryOp* stmt);
   virtual void handle(BinaryOp* stmt);
   virtual void handle(TernaryOp* stmt);
+  virtual void handle(RNGOp* stmt);
   virtual void handle(ReductionOp* stmt);
   virtual void handle(GroupedReductionOp* stmt);
   virtual void handle(WelfordOp* stmt);
+  virtual void handle(GroupedWelfordOp* stmt);
   virtual void handle(LoadStoreOp* stmt);
   virtual void handle(MmaOp* stmt);
   virtual void handle(BroadcastOp* stmt);
@@ -236,6 +253,7 @@ class TORCH_CUDA_CU_API OptOutDispatch : public PolymorphicBase {
   virtual void handle(kir::GroupedGridReduction* stmt);
   virtual void handle(kir::GridBroadcast* stmt);
   virtual void handle(kir::GridWelford* stmt);
+  virtual void handle(kir::GroupedGridWelford* stmt);
   virtual void handle(kir::AllocateFusedReduction* stmt);
   virtual void handle(kir::Swizzle2DInt* stmt);
   virtual void handle(kir::PairSelect* stmt);
@@ -307,12 +325,17 @@ class TORCH_CUDA_CU_API OptOutMutator : public PolymorphicBase {
   virtual void mutate(kir::IntPair*);
 
   // Exprs
+  virtual void mutate(FullOp*);
+  virtual void mutate(ARangeOp*);
+  virtual void mutate(EyeOp*);
   virtual void mutate(UnaryOp*);
   virtual void mutate(BinaryOp*);
   virtual void mutate(TernaryOp*);
+  virtual void mutate(RNGOp*);
   virtual void mutate(ReductionOp*);
   virtual void mutate(GroupedReductionOp*);
   virtual void mutate(WelfordOp*);
+  virtual void mutate(GroupedWelfordOp*);
   virtual void mutate(LoadStoreOp*);
   virtual void mutate(MmaOp*);
   virtual void mutate(BroadcastOp*);
@@ -340,6 +363,7 @@ class TORCH_CUDA_CU_API OptOutMutator : public PolymorphicBase {
   virtual void mutate(kir::GroupedGridReduction*);
   virtual void mutate(kir::GridBroadcast*);
   virtual void mutate(kir::GridWelford*);
+  virtual void mutate(kir::GroupedGridWelford*);
   virtual void mutate(kir::AllocateFusedReduction*);
   virtual void mutate(kir::Swizzle2DInt*);
   virtual void mutate(kir::PairSelect*);
diff --git a/torch/csrc/jit/codegen/cuda/dynamic_type.h b/torch/csrc/jit/codegen/cuda/dynamic_type.h
new file mode 100644
index 000000000000..5cf9f0930929
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/dynamic_type.h
@@ -0,0 +1,312 @@
+#pragma once
+
+#include <c10/macros/Export.h>
+#include <c10/util/Exception.h>
+#include <c10/util/variant.h>
+#include <cmath>
+#include <iostream>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+class TORCH_CUDA_CU_API IntOrDouble {
+  c10::variant<double, int64_t> value_;
+
+ public:
+  IntOrDouble(int64_t i) : value_(i) {}
+  IntOrDouble(double d) : value_(d) {}
+  IntOrDouble(int i) : value_((int64_t)i) {}
+  IntOrDouble(size_t i) : value_((int64_t)i) {}
+  IntOrDouble() : IntOrDouble(0) {}
+
+  // Avoid using copy constructor of c10::variant as it's
+  // deprecated.
+  IntOrDouble(const IntOrDouble& other) {
+    value_ = other.value_;
+  }
+
+  // Explicitly define copy assignment operator as its implicit definition is
+  // deprecated
+  IntOrDouble& operator=(const IntOrDouble& other) {
+    value_ = other.value_;
+    return *this;
+  }
+
+  bool is_int() const {
+    return c10::holds_alternative<int64_t>(value_);
+  }
+
+  template <typename T>
+  T as() const {
+    TORCH_CHECK(
+        c10::holds_alternative<T>(value_),
+        "The expected dtype and the actual dtype does not match in IntOrDouble");
+    return c10::get<T>(value_);
+  }
+
+  template <typename T>
+  T cast() const;
+
+#define DEFINE_ARITHMETIC_OP(op)                                  \
+  IntOrDouble operator op(const IntOrDouble& other) const {       \
+    switch ((int)is_int() << 1 | (int)other.is_int()) {           \
+      case 0b00:                                                  \
+        return IntOrDouble(as<double>() op other.as<double>());   \
+      case 0b01:                                                  \
+        return IntOrDouble(as<double>() op other.as<int64_t>());  \
+      case 0b10:                                                  \
+        return IntOrDouble(as<int64_t>() op other.as<double>());  \
+      case 0b11:                                                  \
+        return IntOrDouble(as<int64_t>() op other.as<int64_t>()); \
+    }                                                             \
+    TORCH_INTERNAL_ASSERT(false);                                 \
+  }                                                               \
+  template <typename T>                                           \
+  IntOrDouble operator op(T other) const {                        \
+    if (is_int()) {                                               \
+      return IntOrDouble(as<int64_t>() op other);                 \
+    }                                                             \
+    return IntOrDouble(as<double>() op other);                    \
+  }
+
+  DEFINE_ARITHMETIC_OP(+)
+  DEFINE_ARITHMETIC_OP(-)
+  DEFINE_ARITHMETIC_OP(*)
+  DEFINE_ARITHMETIC_OP(/)
+  DEFINE_ARITHMETIC_OP(&&)
+
+#undef DEFINE_ARITHMETIC_OP
+
+#define DEFINE_ASSIGN_OP(assign, op)                                      \
+  IntOrDouble& operator assign(const IntOrDouble& other) {                \
+    switch ((int)is_int() << 1 | (int)other.is_int()) {                   \
+      case 0b00:                                                          \
+        return *this = IntOrDouble(as<double>() op other.as<double>());   \
+      case 0b01:                                                          \
+        return *this = IntOrDouble(as<double>() op other.as<int64_t>());  \
+      case 0b10:                                                          \
+        return *this = IntOrDouble(as<int64_t>() op other.as<double>());  \
+      case 0b11:                                                          \
+        return *this = IntOrDouble(as<int64_t>() op other.as<int64_t>()); \
+    }                                                                     \
+    TORCH_INTERNAL_ASSERT(false);                                         \
+  }                                                                       \
+  template <typename T>                                                   \
+  IntOrDouble& operator assign(T other) {                                 \
+    if (is_int()) {                                                       \
+      return *this = IntOrDouble(as<int64_t>() op other);                 \
+    }                                                                     \
+    return *this = IntOrDouble(as<double>() op other);                    \
+  }
+
+  DEFINE_ASSIGN_OP(+=, +)
+  DEFINE_ASSIGN_OP(-=, -)
+  DEFINE_ASSIGN_OP(*=, *)
+  DEFINE_ASSIGN_OP(/=, /)
+
+#undef DEFINE_ASSIGN_OP
+
+  IntOrDouble operator%(const IntOrDouble& other) const {
+    if (is_int() && other.is_int()) {
+      return IntOrDouble(as<int64_t>() % other.as<int64_t>());
+    }
+    TORCH_INTERNAL_ASSERT(false);
+  }
+  IntOrDouble operator%(int64_t other) const {
+    if (is_int()) {
+      return IntOrDouble(as<int64_t>() % other);
+    }
+    TORCH_INTERNAL_ASSERT(false);
+  }
+  IntOrDouble& operator%=(const IntOrDouble& other) {
+    if (is_int() && other.is_int()) {
+      return *this = IntOrDouble(as<int64_t>() % other.as<int64_t>());
+    }
+    TORCH_INTERNAL_ASSERT(false);
+  }
+  IntOrDouble& operator%=(int64_t other) {
+    if (is_int()) {
+      return *this = IntOrDouble(as<int64_t>() % other);
+    }
+    TORCH_INTERNAL_ASSERT(false);
+  }
+
+#define DEFINE_COMPARE_OP(op)                           \
+  bool operator op(const IntOrDouble& other) const {    \
+    switch ((int)is_int() << 1 | (int)other.is_int()) { \
+      case 0b00:                                        \
+        return as<double>() op other.as<double>();      \
+      case 0b01:                                        \
+        return as<double>() op other.as<int64_t>();     \
+      case 0b10:                                        \
+        return as<int64_t>() op other.as<double>();     \
+      case 0b11:                                        \
+        return as<int64_t>() op other.as<int64_t>();    \
+    }                                                   \
+    TORCH_INTERNAL_ASSERT(false);                       \
+  }                                                     \
+  bool operator op(double other) {                      \
+    if (is_int()) {                                     \
+      return as<int64_t>() op other;                    \
+    }                                                   \
+    return as<double>() op other;                       \
+  }                                                     \
+  bool operator op(int64_t other) {                     \
+    if (is_int()) {                                     \
+      return as<int64_t>() op other;                    \
+    }                                                   \
+    return as<double>() op other;                       \
+  }                                                     \
+  bool operator op(int other) {                         \
+    if (is_int()) {                                     \
+      return as<int64_t>() op other;                    \
+    }                                                   \
+    return as<double>() op other;                       \
+  }
+
+  DEFINE_COMPARE_OP(>)
+  DEFINE_COMPARE_OP(>=)
+  DEFINE_COMPARE_OP(<)
+  DEFINE_COMPARE_OP(<=)
+  DEFINE_COMPARE_OP(==)
+  DEFINE_COMPARE_OP(!=)
+
+#undef DEFINE_COMPARE_OP
+
+  IntOrDouble operator-() const {
+    if (is_int()) {
+      return IntOrDouble(-as<int64_t>());
+    }
+    return IntOrDouble(-as<double>());
+  }
+
+  explicit operator double() const;
+  explicit operator int64_t() const;
+  explicit operator size_t() const;
+  explicit operator int() const;
+};
+
+#define DEFINE_ARITHMETIC_OP(op)                           \
+  template <typename T>                                    \
+  inline IntOrDouble operator op(T lhs, IntOrDouble rhs) { \
+    if (rhs.is_int()) {                                    \
+      return IntOrDouble(lhs op rhs.as<int64_t>());        \
+    }                                                      \
+    return IntOrDouble(lhs op rhs.as<double>());           \
+  }
+
+DEFINE_ARITHMETIC_OP(+)
+DEFINE_ARITHMETIC_OP(-)
+DEFINE_ARITHMETIC_OP(*)
+DEFINE_ARITHMETIC_OP(/)
+
+#undef DEFINE_ARITHMETIC_OP
+
+template <>
+inline double IntOrDouble::cast<double>() const {
+  if (is_int()) {
+    return (double)as<int64_t>();
+  }
+  return as<double>();
+}
+
+template <>
+inline int64_t IntOrDouble::cast<int64_t>() const {
+  if (!is_int()) {
+    return (int64_t)as<double>();
+  }
+  return as<int64_t>();
+}
+
+inline IntOrDouble::operator double() const {
+  return as<double>();
+}
+
+inline IntOrDouble::operator int64_t() const {
+  return as<int64_t>();
+}
+
+inline IntOrDouble::operator size_t() const {
+  return as<int64_t>();
+}
+
+inline IntOrDouble::operator int() const {
+  return as<int64_t>();
+}
+
+#define DEFINE_EQ_OP(op)                                         \
+  inline bool operator op(double lhs, const IntOrDouble& rhs) {  \
+    if (rhs.is_int()) {                                          \
+      return false;                                              \
+    }                                                            \
+    return lhs op rhs.as<double>();                              \
+  }                                                              \
+                                                                 \
+  inline bool operator op(int64_t lhs, const IntOrDouble& rhs) { \
+    if (rhs.is_int()) {                                          \
+      return lhs op rhs.as<int64_t>();                           \
+    }                                                            \
+    return false;                                                \
+  }                                                              \
+                                                                 \
+  inline bool operator op(int lhs, const IntOrDouble& rhs) {     \
+    return operator op((int64_t)lhs, rhs);                       \
+  }
+
+DEFINE_EQ_OP(==)
+DEFINE_EQ_OP(!=)
+
+#undef DEFINE_EQ_OP
+
+inline std::ostream& operator<<(std::ostream& os, const IntOrDouble& val) {
+  if (val.is_int()) {
+    return os << val.as<int64_t>();
+  }
+  return os << val.as<double>();
+}
+
+namespace IntOrDouble_functions {
+
+inline IntOrDouble ceildiv(const IntOrDouble& a, const IntOrDouble& b) {
+  if (a.is_int() && b.is_int()) {
+    auto aa = a.as<int64_t>();
+    auto bb = b.as<int64_t>();
+    if (bb > 0) {
+      return (aa + bb - 1) / bb;
+    } else {
+      return (aa + bb + 1) / bb;
+    }
+  }
+  return std::ceil((a / b).as<double>());
+}
+
+inline IntOrDouble max(const IntOrDouble& a, const IntOrDouble& b) {
+  if (a.is_int() && b.is_int()) {
+    return std::max(a.as<int64_t>(), b.as<int64_t>());
+  }
+  return (a > b ? a : b).cast<double>();
+}
+
+inline IntOrDouble min(const IntOrDouble& a, const IntOrDouble& b) {
+  if (a.is_int() && b.is_int()) {
+    return std::min(a.as<int64_t>(), b.as<int64_t>());
+  }
+  return (a < b ? a : b).cast<double>();
+}
+
+inline IntOrDouble abs(const IntOrDouble& a) {
+  if (a.is_int()) {
+    return IntOrDouble(std::abs(a.as<int64_t>()));
+  } else {
+    return IntOrDouble(std::abs(a.as<double>()));
+  }
+}
+
+} // namespace IntOrDouble_functions
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/evaluator_common.cpp b/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
index 83107569dc54..ae280b4ac44c 100644
--- a/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
+++ b/torch/csrc/jit/codegen/cuda/evaluator_common.cpp
@@ -67,7 +67,7 @@ std::vector<VALTYPE*> makeSortedEvaluationList(std::vector<VALTYPE*> input) {
   return sorted;
 }
 
-//! Kernel IR utility, collects all the symbolic integers
+//! Kernel IR utility, collects all the symbolic values
 //!  used in allocation nodes.
 void collectBufferSizes(
     std::vector<Val*>& into,
@@ -85,21 +85,21 @@ void collectBufferSizes(
 }
 
 //! Kernel IR utility, collects all the kernel symbolic
-//!  integers we will need at runtime, i.e. after the
+//!  values we will need at runtime, i.e. after the
 //!  generated cuda kernel has already been compiled.
 //!  The values are to be used for runtime logic, like
 //!  `computeLaunchparams`.
-std::vector<Val*> collectRuntimeUsedIntegers(kir::Kernel* kernel) {
+std::vector<Val*> collectRuntimeUsedValues(kir::Kernel* kernel) {
   std::vector<Val*> ret;
   auto all_tvs = ir_utils::allTvs(kernel);
-  // Collect extent and integer inputs
+  // Collect extent and inputs
   for (auto tv : all_tvs) {
     for (auto id : tv->domain()->domain()) {
       ret.push_back(id->extent());
     }
   }
   for (auto inp : kernel->inputs()) {
-    if (inp->isA<Int>()) {
+    if (inp->isA<Int>() || inp->isA<Double>()) {
       ret.push_back(inp);
     }
   }
@@ -108,17 +108,17 @@ std::vector<Val*> collectRuntimeUsedIntegers(kir::Kernel* kernel) {
   return makeSortedEvaluationList(ret);
 }
 
-std::vector<Val*> collectRuntimeUsedIntegers(Fusion* fusion) {
+std::vector<Val*> collectRuntimeUsedValues(Fusion* fusion) {
   std::vector<Val*> ret;
   auto all_tvs = ir_utils::allTvs(fusion);
-  // Collect extent and integer inputs
+  // Collect extent and inputs
   for (auto tv : all_tvs) {
     for (auto id : tv->domain()->domain()) {
       ret.push_back(id->extent());
     }
   }
   for (auto inp : fusion->inputs()) {
-    if (inp->isA<Int>()) {
+    if (inp->isA<Int>() || inp->isA<Double>()) {
       ret.push_back(inp);
     }
   }
@@ -128,14 +128,14 @@ std::vector<Val*> collectRuntimeUsedIntegers(Fusion* fusion) {
 } // namespace
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::initializeValueList(
+void PrecomputedValuesBase<IRContext>::initializeValueList(
     typename IRContext::EVALUATOR_TYPE& const_evaluator,
     const std::vector<Val*>& sorted_value_list) {
   // Initialize workspace
   num_of_values_ = sorted_value_list.size();
   defined_ = std::vector<bool>(num_of_values_, false);
   is_constant_ = std::vector<bool>(num_of_values_, false);
-  values_ = std::vector<int64_t>(num_of_values_, -1);
+  values_ = std::vector<IntOrDouble>(num_of_values_, -1);
 
   // Fill in constants and assign evaluator indices
   for (const auto i : c10::irange(num_of_values_)) {
@@ -150,7 +150,7 @@ void PrecomputedIntegersBase<IRContext>::initializeValueList(
 }
 
 template <typename IRContext>
-c10::optional<int64_t> PrecomputedIntegersBase<IRContext>::getMaybeValueFor(
+c10::optional<IntOrDouble> PrecomputedValuesBase<IRContext>::getMaybeValueFor(
     const Val* val) {
   auto index = val->evaluatorIndex();
   if (index < 0) {
@@ -163,8 +163,8 @@ c10::optional<int64_t> PrecomputedIntegersBase<IRContext>::getMaybeValueFor(
 }
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::print() const {
-  std::cout << "Precomputed Integers:\n";
+void PrecomputedValuesBase<IRContext>::print() const {
+  std::cout << "Precomputed Values:\n";
   for (auto i : c10::irange(symbols_.size())) {
     if (defined_[i]) {
       std::cout << symbols_[i]->toInlineString() << " = " << values_[i]
@@ -174,14 +174,14 @@ void PrecomputedIntegersBase<IRContext>::print() const {
 }
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::evaluate() {
-  FUSER_PERF_SCOPE("PrecomputedIntegers::Evaluate");
-  integer_machine_->run();
+void PrecomputedValuesBase<IRContext>::evaluate() {
+  FUSER_PERF_SCOPE("PrecomputedValues::Evaluate");
+  value_machine_->run();
   validate();
 }
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::invalidate() {
+void PrecomputedValuesBase<IRContext>::invalidate() {
   // clear binding values
   binding_log_.clear();
 
@@ -193,20 +193,26 @@ void PrecomputedIntegersBase<IRContext>::invalidate() {
 }
 
 template <typename IRContext>
-void PrecomputedIntegersBase<IRContext>::validate() {
-  FUSER_PERF_SCOPE("PrecomputedIntegers::Validate");
+void PrecomputedValuesBase<IRContext>::validate() {
+  FUSER_PERF_SCOPE("PrecomputedValuess::Validate");
   for (auto it : binding_log_) {
-    TORCH_INTERNAL_ASSERT(values_[it.first] == it.second);
+    TORCH_INTERNAL_ASSERT(
+        values_[it.first] == it.second,
+        "Precomputed values failed to validate.",
+        "\nSomething unexpected changed between the compilation and execution.\n",
+        values_[it.first],
+        " != ",
+        it.second);
   }
   has_valid_values_ = true;
 }
 
 template <typename IRContext>
-NaiveIntegerMachine<IRContext>::NaiveIntegerMachine(
-    PrecomputedIntegersBase<IRContext>& precomputed_integers)
-    : precomputed_integers_(precomputed_integers) {
+NaiveValueMachine<IRContext>::NaiveValueMachine(
+    PrecomputedValuesBase<IRContext>& precomputed_values)
+    : precomputed_values_(precomputed_values) {
   num_of_instructions_ = 0;
-  for (auto val : precomputed_integers_.symbols_) {
+  for (auto val : precomputed_values_.symbols_) {
     auto def = val->definition();
     if (def) {
       if (auto uop = dynamic_cast<UnaryOp*>(def)) {
@@ -221,12 +227,12 @@ NaiveIntegerMachine<IRContext>::NaiveIntegerMachine(
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::run() {
+void NaiveValueMachine<IRContext>::run() {
   for (const auto i : c10::irange(num_of_instructions_)) {
     // Skip this instruction if the dest location
     //  has already been computed or is constant.
-    if (precomputed_integers_.defined_[dest_[i]] ||
-        precomputed_integers_.is_constant_[dest_[i]]) {
+    if (precomputed_values_.defined_[dest_[i]] ||
+        precomputed_values_.is_constant_[dest_[i]]) {
       continue;
     }
     runInstruction(i);
@@ -234,7 +240,7 @@ void NaiveIntegerMachine<IRContext>::run() {
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::makeUnaryOp(UnaryOp* uop) {
+void NaiveValueMachine<IRContext>::makeUnaryOp(UnaryOp* uop) {
   int in = uop->inputs()[0]->evaluatorIndex();
   int out = uop->outputs()[0]->evaluatorIndex();
   TORCH_INTERNAL_ASSERT(in >= 0, "Integer Machine: unknown input: ", uop);
@@ -243,12 +249,15 @@ void NaiveIntegerMachine<IRContext>::makeUnaryOp(UnaryOp* uop) {
   int index = makeInstructionEntry();
   inst_type_[index] = InstructionType::UNARY_OP;
   uop_type_[index] = IRContext::getOpType(uop);
+  if (uop_type_[index] == UnaryOpType::Cast) {
+    data_type_[index] = uop->out()->getDataType().value();
+  }
   src0_[index] = in;
   dest_[index] = out;
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::makeBinaryOp(BinaryOp* bop) {
+void NaiveValueMachine<IRContext>::makeBinaryOp(BinaryOp* bop) {
   int in0 = bop->inputs()[0]->evaluatorIndex();
   int in1 = bop->inputs()[1]->evaluatorIndex();
   int out = bop->outputs()[0]->evaluatorIndex();
@@ -266,11 +275,12 @@ void NaiveIntegerMachine<IRContext>::makeBinaryOp(BinaryOp* bop) {
 }
 
 template <typename IRContext>
-int NaiveIntegerMachine<IRContext>::makeInstructionEntry() {
+int NaiveValueMachine<IRContext>::makeInstructionEntry() {
   int index = num_of_instructions_++;
   inst_type_.push_back(InstructionType::UNARY_OP);
   uop_type_.push_back(UnaryOpType::Abs);
   bop_type_.push_back(BinaryOpType::Add);
+  data_type_.push_back(DataType::Null);
   src0_.push_back(-1);
   src1_.push_back(-1);
   dest_.push_back(-1);
@@ -278,7 +288,7 @@ int NaiveIntegerMachine<IRContext>::makeInstructionEntry() {
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::runInstruction(int index) {
+void NaiveValueMachine<IRContext>::runInstruction(int index) {
   switch (inst_type_[index]) {
     case InstructionType::UNARY_OP:
       runUnaryOp(index);
@@ -290,52 +300,66 @@ void NaiveIntegerMachine<IRContext>::runInstruction(int index) {
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::runUnaryOp(int index) {
+void NaiveValueMachine<IRContext>::runUnaryOp(int index) {
+  using namespace IntOrDouble_functions;
   int src_index = src0_[index];
-  bool src_defined = precomputed_integers_.defined_[src_index];
-  bool src_is_const = precomputed_integers_.is_constant_[src_index];
+  bool src_defined = precomputed_values_.defined_[src_index];
+  bool src_is_const = precomputed_values_.is_constant_[src_index];
   if (!src_defined && !src_is_const) {
     return;
   }
 
   int dest_index = dest_[index];
 
-  auto& src = precomputed_integers_.values_[src_index];
-  auto& dest = precomputed_integers_.values_[dest_index];
+  auto& src = precomputed_values_.values_[src_index];
+  auto& dest = precomputed_values_.values_[dest_index];
 
   switch (uop_type_[index]) {
     case UnaryOpType::Neg:
       dest = -src;
       break;
-    case UnaryOpType::Cast:
+    case UnaryOpType::Set:
       dest = src;
       break;
+    case UnaryOpType::Cast:
+      if (data_type_[index] == DataType::Double) {
+        dest = src.template cast<double>();
+      } else if (data_type_[index] == DataType::Int) {
+        dest = src.template cast<int64_t>();
+      } else {
+        TORCH_INTERNAL_ASSERT(false, "dtype not supported in evaluator");
+      }
+      break;
+    case UnaryOpType::Abs:
+      dest = abs(src);
+      break;
     default:
-      TORCH_CHECK(!"Unexpected operator type");
+      TORCH_CHECK(!"Unexpected operator type ", uop_type_[index]);
   }
 
-  precomputed_integers_.defined_[dest_index] = true;
+  precomputed_values_.defined_[dest_index] = true;
 }
 
 template <typename IRContext>
-void NaiveIntegerMachine<IRContext>::runBinaryOp(int index) {
+void NaiveValueMachine<IRContext>::runBinaryOp(int index) {
+  using namespace IntOrDouble_functions;
   int src0_index = src0_[index];
   int src1_index = src1_[index];
-  bool src0_is_const = precomputed_integers_.is_constant_[src0_index];
-  bool src1_is_const = precomputed_integers_.is_constant_[src1_index];
+  bool src0_is_const = precomputed_values_.is_constant_[src0_index];
+  bool src1_is_const = precomputed_values_.is_constant_[src1_index];
 
   bool src_defined =
-      (precomputed_integers_.defined_[src0_index] || src0_is_const) &&
-      (precomputed_integers_.defined_[src1_index] || src1_is_const);
+      (precomputed_values_.defined_[src0_index] || src0_is_const) &&
+      (precomputed_values_.defined_[src1_index] || src1_is_const);
 
   if (!src_defined) {
     return;
   }
   int dest_index = dest_[index];
 
-  auto& lhs = precomputed_integers_.values_[src0_index];
-  auto& rhs = precomputed_integers_.values_[src1_index];
-  auto& dest = precomputed_integers_.values_[dest_index];
+  auto& lhs = precomputed_values_.values_[src0_index];
+  auto& rhs = precomputed_values_.values_[src1_index];
+  auto& dest = precomputed_values_.values_[dest_index];
 
   switch (bop_type_[index]) {
     case BinaryOpType::Add:
@@ -357,7 +381,7 @@ void NaiveIntegerMachine<IRContext>::runBinaryOp(int index) {
       break;
     case BinaryOpType::CeilDiv:
       TORCH_CHECK(rhs != 0);
-      dest = (lhs + rhs - 1) / rhs;
+      dest = ceildiv(lhs, rhs);
       break;
     case BinaryOpType::And:
       dest = Int::ScalarType(lhs && rhs);
@@ -372,30 +396,30 @@ void NaiveIntegerMachine<IRContext>::runBinaryOp(int index) {
       TORCH_CHECK(!"Unexpected operator type");
   }
 
-  precomputed_integers_.defined_[dest_index] = true;
+  precomputed_values_.defined_[dest_index] = true;
 }
 
-KernelPrecomputedIntegers::KernelPrecomputedIntegers(kir::Kernel* kernel) {
-  loadSymbols(collectRuntimeUsedIntegers(kernel));
+KernelPrecomputedValues::KernelPrecomputedValues(kir::Kernel* kernel) {
+  loadSymbols(collectRuntimeUsedValues(kernel));
   kir::ExpressionEvaluator evaluator;
   initializeValueList(evaluator, symbols());
   initializeNamedScalars();
   initializeIntegerMachine();
 }
 
-void KernelPrecomputedIntegers::bindTensorMetaData(
+// TODO: put this to base class
+void KernelPrecomputedValues::bindTensorMetaData(
     TensorView* tv,
-    const at::Tensor& at_tensor) {
-  std::vector<std::pair<Val*, int64_t>> ret;
+    const TensorArgAbstract* tensor_arg_abstract) {
   const auto root_domain =
       TensorDomain::noReductions(tv->domain()->getMaybeRFactorDomain());
   TORCH_INTERNAL_ASSERT(
-      at_tensor.ndimension() == static_cast<int>(root_domain.size()),
+      tensor_arg_abstract->getRank() == static_cast<int>(root_domain.size()),
       "Something went wrong configuring launch. Inputs do not match.");
 
   for (const auto dim : c10::irange(root_domain.size())) {
     auto extent = root_domain[dim]->extent();
-    auto value = at_tensor.sizes()[dim];
+    auto value = tensor_arg_abstract->getSize(dim);
     bindValue(extent->evaluatorIndex(), value);
   }
 }
@@ -418,7 +442,7 @@ c10::optional<ParallelType> getMaybeThreadSizeParallelType(
 
 } // namespace
 
-void KernelPrecomputedIntegers::initializeNamedScalars() {
+void KernelPrecomputedValues::initializeNamedScalars() {
   for (auto val : symbols()) {
     if (auto named_scalar = dynamic_cast<NamedScalar*>(val)) {
       auto maybe_parallel_type = getMaybeThreadSizeParallelType(named_scalar);
@@ -434,30 +458,53 @@ void KernelPrecomputedIntegers::initializeNamedScalars() {
   }
 }
 
-void KernelPrecomputedIntegers::bindKernelInputs(
+// TODO: merge this one with above.
+void KernelPrecomputedValues::bindKernelInputs(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& aten_inputs) {
+    const KernelArgumentHolder& args) {
   if (hasValidValues()) {
     invalidate();
   }
 
   const auto& inputs = kernel->inputs();
+  TORCH_INTERNAL_ASSERT(
+      args.size() == inputs.size(), "kernel inputs size does not match args");
 
   for (const auto i : c10::irange(inputs.size())) {
+    auto arg = args[i];
     const auto input = inputs[i];
     if (auto tensor_input = dynamic_cast<TensorView*>(input)) {
-      const auto aten_tensor = aten_inputs[i].toTensor();
-      bindTensorMetaData(tensor_input, aten_tensor);
-    } else if (input->isScalar() && input->dtype() == DataType::Int) {
-      bindValue(input->evaluatorIndex(), aten_inputs[i].toInt());
+      if (const auto& tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(arg)) {
+        bindTensorMetaData(tensor_input, tensor_arg_abstract);
+      } else {
+        // TODO: cpu scalar of int type should be bound as scalar int as well
+        TORCH_CHECK(
+            arg->isType(ArgType::CpuScalarTensor),
+            "binding input to TensorView expects input arg to be of tensor type");
+      }
+    } else if (input->isScalar()) {
+      if (input->dtype() == DataType::Int) {
+        TORCH_CHECK(
+            arg->isType(ArgType::Long),
+            "binding input to integer type expects input arg to be a scalar of Long type");
+        precomputedValuesBaseType::bindValue(
+            input->evaluatorIndex(), *static_cast<const int64_t*>(arg->arg()));
+      } else if (input->dtype() == DataType::Double) {
+        TORCH_CHECK(
+            arg->isType(ArgType::Double),
+            "binding input to double type expects input arg to be a scalar of Double type");
+        precomputedValuesBaseType::bindValue(
+            input->evaluatorIndex(), *static_cast<const double*>(arg->arg()));
+      }
     }
   }
 }
 
-void KernelPrecomputedIntegers::bindParallelExtents(
+void KernelPrecomputedValues::bindParallelExtents(
     const ParallelExtentMap& parallel_extents,
     const LaunchParams& launch_constraint) {
-  // Bind integer values of extents of parallelized
+  // Bind values of extents of parallelized
   //  iterdomains from launch_constraint when applicable.
   // Consistency will be checked at validate().
   for (const auto& it : parallel_extents) {
@@ -470,7 +517,7 @@ void KernelPrecomputedIntegers::bindParallelExtents(
   }
 }
 
-void KernelPrecomputedIntegers::bindConcreteParallelTypeValue(
+void KernelPrecomputedValues::bindConcreteParallelTypeValue(
     ParallelType pt,
     int64_t value) {
   auto index_list_it = thread_dim_value_indices_.find(pt);
@@ -481,52 +528,73 @@ void KernelPrecomputedIntegers::bindConcreteParallelTypeValue(
   }
 }
 
-FusionPrecomputedIntegers::FusionPrecomputedIntegers(Fusion* fusion)
+FusionPrecomputedValues::FusionPrecomputedValues(Fusion* fusion)
     : fusion_(fusion) {
-  loadSymbols(collectRuntimeUsedIntegers(fusion));
+  loadSymbols(collectRuntimeUsedValues(fusion));
   ExpressionEvaluator evaluator(fusion);
   initializeValueList(evaluator, symbols());
   initializeIntegerMachine();
 }
 
-void FusionPrecomputedIntegers::bindTensorMetaData(
+// TODO: put this to base class
+void FusionPrecomputedValues::bindTensorMetaData(
     TensorView* tv,
-    const at::Tensor& at_tensor) {
+    const TensorArgAbstract* tensor_arg_abstract) {
   const auto root_domain =
       TensorDomain::noReductions(tv->getMaybeRFactorDomain());
   TORCH_INTERNAL_ASSERT(
-      at_tensor.ndimension() == static_cast<int>(root_domain.size()),
+      tensor_arg_abstract->getRank() == static_cast<int>(root_domain.size()),
       "Something went wrong configuring launch. Inputs do not match.");
 
   for (const auto dim : c10::irange(root_domain.size())) {
     auto extent = root_domain[dim]->extent();
-    auto value = at_tensor.sizes()[dim];
-    precomputedIntegersBaseType::bindValue(extent->evaluatorIndex(), value);
+    auto value = tensor_arg_abstract->getSize(dim);
+    precomputedValuesBaseType::bindValue(extent->evaluatorIndex(), value);
   }
 }
 
-void FusionPrecomputedIntegers::bindFusionInputs(
-    const at::ArrayRef<IValue>& aten_inputs) {
+void FusionPrecomputedValues::bindFusionInputs(
+    const KernelArgumentHolder& args) {
   if (hasValidValues()) {
-    precomputedIntegersBaseType::invalidate();
+    precomputedValuesBaseType::invalidate();
   }
 
   const auto& inputs = fusion_->inputs();
+  TORCH_INTERNAL_ASSERT(
+      args.size() == inputs.size(), "kernel inputs size does not match args");
 
   for (const auto i : c10::irange(inputs.size())) {
     const auto input = inputs[i];
+    const ArgAbstract* arg = args[i];
     if (auto tensor_input = dynamic_cast<TensorView*>(input)) {
-      const auto aten_tensor = aten_inputs[i].toTensor();
-      bindTensorMetaData(tensor_input, aten_tensor);
-    } else if (input->isScalar() && input->getDataType() == DataType::Int) {
-      precomputedIntegersBaseType::bindValue(
-          input->evaluatorIndex(), aten_inputs[i].toInt());
+      if (const auto& tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(arg)) {
+        bindTensorMetaData(tensor_input, tensor_arg_abstract);
+      } else {
+        TORCH_CHECK(
+            arg->isType(ArgType::CpuScalarTensor),
+            "binding input to TensorView expects input arg to be of tensor type");
+      }
+    } else if (input->isScalar()) {
+      if (input->getDataType() == DataType::Int) {
+        TORCH_CHECK(
+            arg->isType(ArgType::Long),
+            "binding input to integer type expects input arg to be a scalar of Long type");
+        precomputedValuesBaseType::bindValue(
+            input->evaluatorIndex(), *static_cast<const int64_t*>(arg->arg()));
+      } else if (input->getDataType() == DataType::Double) {
+        TORCH_CHECK(
+            arg->isType(ArgType::Double),
+            "binding input to double type expects input arg to be a scalar of Double type");
+        precomputedValuesBaseType::bindValue(
+            input->evaluatorIndex(), *static_cast<const double*>(arg->arg()));
+      }
     }
   }
 }
 
-template class PrecomputedIntegersBase<FusionIRContext>;
-template class PrecomputedIntegersBase<KernelIRContext>;
+template class PrecomputedValuesBase<FusionIRContext>;
+template class PrecomputedValuesBase<KernelIRContext>;
 
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/evaluator_common.h b/torch/csrc/jit/codegen/cuda/evaluator_common.h
index 7cbe37c602b9..528b1f1b2e0a 100644
--- a/torch/csrc/jit/codegen/cuda/evaluator_common.h
+++ b/torch/csrc/jit/codegen/cuda/evaluator_common.h
@@ -1,4 +1,6 @@
 #pragma once
+#include <torch/csrc/jit/codegen/cuda/dynamic_type.h>
+#include <torch/csrc/jit/codegen/cuda/executor_kernel_arg.h>
 #include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
@@ -62,28 +64,28 @@ class KernelIRContext {
 };
 
 template <typename IRContext>
-class PrecomputedIntegersBase;
+class PrecomputedValuesBase;
 
-//! NaiveIntegerMachine:
+//! NaiveValueMachine:
 //!  This is an un-optimized runtime for evaluating a
-//!   set of integers in one run. The runtime contains
+//!   set of values in one run. The runtime contains
 //!   a vector of instructions inferred from IR at compile-time
 //!   and it currently must be associated with an instance of
-//!   PrecomputedIntegersBase that will provide the workspace
-//!   containing the concrete values for the integers.
+//!   PrecomputedValuesBase that will provide the workspace
+//!   containing the concrete values for the values.
 template <typename IRContext>
-class NaiveIntegerMachine {
+class NaiveValueMachine {
   //! The generic types of instructions supported for this
   //!  machine, currently only binary and unary.
   enum class InstructionType { UNARY_OP, BINARY_OP };
 
  public:
-  //! Constructor lowers all the expr IR nodes stored in precomputed_integer
+  //! Constructor lowers all the expr IR nodes stored in precomputed_values
   //!  and stores them in the private state.
-  NaiveIntegerMachine(PrecomputedIntegersBase<IRContext>& precomputed_integers);
+  NaiveValueMachine(PrecomputedValuesBase<IRContext>& precomputed_values);
 
   //! Runs all the instructions and write results to the associated
-  //!  precomputed_integers.
+  //!  precomputed_values.
   void run();
 
  private:
@@ -109,12 +111,12 @@ class NaiveIntegerMachine {
   void runBinaryOp(int index);
 
  private:
-  friend PrecomputedIntegersBase<IRContext>;
+  friend PrecomputedValuesBase<IRContext>;
 
-  //! Reference to the PrecomputedInteger workspace associated with
+  //! Reference to the PrecomputedValues workspace associated with
   //!   this runtime. All the instructions will read and write the
   //!   values in this workspace.
-  PrecomputedIntegersBase<IRContext>& precomputed_integers_;
+  PrecomputedValuesBase<IRContext>& precomputed_values_;
 
   //! Instruction buffer. All states are in separate vectors and
   //!  the entry of each vector at the same index correspond to
@@ -131,6 +133,10 @@ class NaiveIntegerMachine {
   //!  value at each index corresponding to a binary op.
   std::vector<UnaryOpType> uop_type_;
 
+  //! Data type for unary op of type UnaryOpType::Cast, contains a default
+  //!  value at each index corresponding other ops.
+  std::vector<DataType> data_type_;
+
   //! Unary operator type if applicable, contains a default
   //!  value at each index corresponding to a unary op.
   std::vector<BinaryOpType> bop_type_;
@@ -150,33 +156,33 @@ class NaiveIntegerMachine {
   std::vector<int> dest_;
 };
 
-//! PrecomputedIntegersBase:
-//!  A class to support optimized evaluation of integers
+//! PrecomputedValuesBase:
+//!  A class to support optimized evaluation of values
 //!  at runtime.
-//!    At compile time all necessary integers are collected
+//!    At compile time all necessary values are collected
 //!  from given IR nodes and a runtime and a workspace containing
 //!  the concrete values is created and pre-allocated.
-//!    At runtime the integer vm is used to evaluate all the
-//!  integers and store them in the workspace ahead of time.
+//!    At runtime the value vm is used to evaluate all the
+//!  values and store them in the workspace ahead of time.
 template <typename IRContext>
-class PrecomputedIntegersBase {
-  using INTEGER_MACHINE = NaiveIntegerMachine<IRContext>;
+class PrecomputedValuesBase {
+  using VALUE_MACHINE = NaiveValueMachine<IRContext>;
 
  public:
-  explicit PrecomputedIntegersBase() = default;
+  explicit PrecomputedValuesBase() = default;
 
   //! Returns if the workspace contains evaluated results.
   bool ready() {
     return has_valid_values_;
   }
 
-  //! Runs the internal integer machine that will compute
+  //! Runs the internal value machine that will compute
   //!  the values allocated in the workspace.
   void evaluate();
 
   //! Returns value for the given IR node if it's stored
   //!  in the workspace and has been evaluated.
-  c10::optional<int64_t> getMaybeValueFor(const Val* val);
+  c10::optional<IntOrDouble> getMaybeValueFor(const Val* val);
 
   //! Debugging helper, prints all the currently known values
   void print() const;
@@ -191,7 +197,7 @@ class PrecomputedIntegersBase {
 
   //! Bind concrete value to the given index
   //!  if the index is valid.
-  void bindValue(int index, int64_t value) {
+  void bindValue(int index, IntOrDouble value) {
     if (index < 0 || is_constant_[index]) {
       return;
     }
@@ -213,10 +219,10 @@ class PrecomputedIntegersBase {
     return symbols_;
   }
 
-  //! Initialize the integer runtime that will
+  //! Initialize the value runtime that will
   //!  infer instructions from the workspace.
   void initializeIntegerMachine() {
-    integer_machine_ = std::make_unique<INTEGER_MACHINE>(*this);
+    value_machine_ = std::make_unique<VALUE_MACHINE>(*this);
   }
 
   bool hasValidValues() {
@@ -236,7 +242,7 @@ class PrecomputedIntegersBase {
   }
 
  private:
-  friend INTEGER_MACHINE;
+  friend VALUE_MACHINE;
 
   //! Marks if an evaluation has finished
   bool has_valid_values_ = false;
@@ -253,7 +259,7 @@ class PrecomputedIntegersBase {
   std::vector<bool> is_constant_;
 
   //! Stores the concrete values at each index.
-  std::vector<int64_t> values_;
+  std::vector<IntOrDouble> values_;
 
   //! Stores the IR nodes corresponding to each index.
   std::vector<Val*> symbols_;
@@ -261,50 +267,48 @@ class PrecomputedIntegersBase {
   //! An internal log to keep track of all the bindings
   //!  used in each evaluation cycle. To be used for
   //!  consistency check.
-  std::vector<std::pair<int, int64_t>> binding_log_;
+  std::vector<std::pair<int, IntOrDouble>> binding_log_;
 
-  //! Integer runtime for realizing the integer computations.
-  std::unique_ptr<INTEGER_MACHINE> integer_machine_;
+  //! Integer runtime for realizing the values computations.
+  std::unique_ptr<VALUE_MACHINE> value_machine_;
 };
 
-//! PreComputedInteger workspace in Fusion IR context,
-//!  defines the set of integers to be collected in each
+//! PrecomputedValues workspace in Fusion IR context,
+//!  defines the set of values to be collected in each
 //!  fusion graph and the input value binding given each
 //!  fusion runtime input.
-class FusionPrecomputedIntegers
-    : public PrecomputedIntegersBase<FusionIRContext> {
-  using precomputedIntegersBaseType = PrecomputedIntegersBase<FusionIRContext>;
+class FusionPrecomputedValues : public PrecomputedValuesBase<FusionIRContext> {
+  using precomputedValuesBaseType = PrecomputedValuesBase<FusionIRContext>;
 
  public:
-  FusionPrecomputedIntegers(Fusion* fusion);
+  FusionPrecomputedValues(Fusion* fusion);
 
   //! Bind concrete values from fusion runtime inputs
-  void bindFusionInputs(const at::ArrayRef<IValue>& aten_inputs);
+  void bindFusionInputs(const KernelArgumentHolder& args);
 
  private:
-  void bindTensorMetaData(TensorView* tv, const at::Tensor& at_tensor);
+  void bindTensorMetaData(
+      TensorView* tv,
+      const TensorArgAbstract* tensor_arg_abstract);
 
  private:
   Fusion* fusion_ = nullptr;
 };
-//! PreComputedInteger workspace in Fusion IR context,
-//!  defines the set of integers to be collected in each
+//! PrecomputedValues workspace in Fusion IR context,
+//!  defines the set of values to be collected in each
 //!  kernel IR sequence and the input value binding given each
 //!  fusion runtime input and launch constraints.
-class KernelPrecomputedIntegers
-    : public PrecomputedIntegersBase<KernelIRContext> {
-  using precomputedIntegersBaseType = PrecomputedIntegersBase<KernelIRContext>;
+class KernelPrecomputedValues : public PrecomputedValuesBase<KernelIRContext> {
+  using precomputedValuesBaseType = PrecomputedValuesBase<KernelIRContext>;
 
  public:
   using ParallelExtentMap =
       std::unordered_map<ParallelType, std::vector<const Val*>, TypeHash>;
 
-  KernelPrecomputedIntegers(kir::Kernel* kernel);
+  KernelPrecomputedValues(kir::Kernel* kernel);
 
   //! Bind concrete values from fusion runtime inputs
-  void bindKernelInputs(
-      kir::Kernel* kernel,
-      const at::ArrayRef<IValue>& aten_inputs);
+  void bindKernelInputs(kir::Kernel* kernel, const KernelArgumentHolder& args);
 
   //! Bind concrete values from launch constraints
   void bindParallelExtents(
@@ -317,7 +321,9 @@ class KernelPrecomputedIntegers
   void bindConcreteParallelTypeValue(ParallelType pt, int64_t value);
 
  private:
-  void bindTensorMetaData(TensorView* tv, const at::Tensor& at_tensor);
+  void bindTensorMetaData(
+      TensorView* tv,
+      const TensorArgAbstract* tensor_arg_abstract);
 
   //! Iterate through all the named scalars corresponding
   //!  to thread sizes and pre-group them by their parallel
diff --git a/torch/csrc/jit/codegen/cuda/executor.cpp b/torch/csrc/jit/codegen/cuda/executor.cpp
index 2c26570acf8f..23be5f4232aa 100644
--- a/torch/csrc/jit/codegen/cuda/executor.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor.cpp
@@ -9,6 +9,7 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
 #include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
+#include <torch/csrc/jit/codegen/cuda/lower_bank_conflict.h>
 #include <torch/csrc/jit/codegen/cuda/utils.h>
 
 #include <ATen/core/LegacyTypeDispatch.h>
@@ -20,6 +21,7 @@
 #include <c10/cuda/CUDAStream.h>
 #include <c10/util/irange.h>
 
+#include <cmath>
 #include <fstream>
 
 namespace torch {
@@ -29,6 +31,16 @@ namespace cuda {
 
 int FusionExecutor::fusion_id_counter_ = 0; // NOLINT
 
+bool fill_allocation_with_nan_ = false;
+
+bool shouldFillAllocationWithNan() {
+  return fill_allocation_with_nan_;
+}
+
+void setFillAllocationWithNan(bool value) {
+  fill_allocation_with_nan_ = value;
+}
+
 namespace {
 
 static const char* defineIndexMode(KernelIndexMode index_mode) {
@@ -47,9 +59,10 @@ static const char* defineIndexMode(KernelIndexMode index_mode) {
 
 static const char* defineIntegerTypes() {
   return R"(
-typedef unsigned char uint8_t;
 typedef signed char int8_t;
+typedef unsigned char uint8_t;
 typedef short int int16_t;
+typedef unsigned short int uint16_t;
 typedef int int32_t;
 typedef unsigned int uint32_t;
 typedef long long int int64_t;
@@ -74,16 +87,20 @@ static const std::string& defineComplexTypes() {
 std::string FusionExecutor::getStructuredCode(const std::string& kernel) {
   // generating cuda code;
   std::string code = "";
-#ifdef __HIP_PLATFORM_HCC__
+#ifdef USE_ROCM
 #if ROCM_VERSION < 40200
   code += std::string("#include <hip/hip_runtime.h>\n") +
       std::string("#include <hip/hip_bf16.h>\n") +
       std::string("#include <hip/hip_fp16.h>\n");
 #endif
+  code += std::string("#pragma clang force_cuda_host_device begin\n");
 #endif
   code += std::string("namespace ") + FusionExecutor::kernelNamespace() +
       " {\n" + defineIntegerTypes() + defineIndexMode(options_.index_mode) +
       defineComplexTypes() + executor_utils::kernelPreamble() + kernel + "}\n";
+#ifdef USE_ROCM
+  code += std::string("#pragma clang force_cuda_host_device end\n");
+#endif
 
   if (isDebugDumpEnabled(DebugDumpOption::CudaKernel)) {
     std::cout << "\n======= Codegen output for kernel: " << kernelName()
@@ -161,9 +178,8 @@ void FusionExecutor::debugCompileFusionFromStr(
 
 void FusionExecutor::compileFusion(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& inputs,
-    const LaunchParams& launch_constraints,
-    CompileOptions options) {
+    const KernelArgumentHolder& args,
+    const LaunchParams& launch_constraints) {
   FUSER_PERF_SCOPE("compileFusion");
 
   TORCH_INTERNAL_ASSERT(
@@ -173,6 +189,35 @@ void FusionExecutor::compileFusion(
     TORCH_INTERNAL_ASSERT(
         out->getValType() == ValType::TensorView,
         "Output types from fusions that are not tensors are not supported at this point.");
+
+    const auto maybe_rfactor_domain =
+        out->as<TensorView>()->getMaybeRFactorDomain();
+    // walking through outputs to see if output shapes are dependent on
+    // non-tensor inputs. For which case, we should have disabled output
+    // allocation, since the caching id only looks at tensor shapes.
+    // See issue https://github.com/csarofeen/pytorch/issues/2002
+    std::vector<Val*> output_extents;
+    for (const auto id : maybe_rfactor_domain) {
+      Val* extent = nullptr;
+      if (id->isReduction() || id->isStride()) {
+        continue;
+      } else if (id->isBroadcast() && id->hasExpandedExtent()) {
+        extent = id->expandedExtent();
+      } else {
+        extent = id->extent();
+      }
+      output_extents.emplace_back(extent);
+    }
+    auto dependencies = InputsOf::outputs(fusion, output_extents);
+    if (std::any_of(dependencies.begin(), dependencies.end(), [](Val* val) {
+          return val->isFusionInput();
+        })) {
+      // TODO: parameter cache is too big a hammer here. We should consider
+      // separate the caching logic of output sizes & launch params. Since
+      // output size dependency should only invalidate the output sizes
+      disable_parameter_cache_ = true;
+      break;
+    }
   }
 
   if (isDebugDumpEnabled(DebugDumpOption::FusionIr)) {
@@ -181,17 +226,19 @@ void FusionExecutor::compileFusion(
     fusion->printMath();
   }
 
-  options_ = options;
+  // TODO: refactor the options_ passed through
+  options_.device = c10::Device(c10::DeviceType::CUDA, args.getDeviceIndex());
+  options_.index_mode = args.getIndexMode();
   c10::DeviceGuard dg(options_.device);
 
   TORCH_INTERNAL_ASSERT(
       options_.device.is_cuda(), "Provided device to CUDA fuser is the CPU.");
   auto properties = at::cuda::getDeviceProperties(options_.device.index());
   configured_device_smem_ = properties->sharedMemPerBlock;
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
   device_smem_limit_ = properties->sharedMemPerBlockOptin;
 #else
-  // don't know if rocm supports opt-in shared memroy reconfiguration
+  // don't know if rocm supports opt-in shared memory reconfiguration
   device_smem_limit_ = properties->sharedMemPerBlock;
 #endif
   warp_size_ = properties->warpSize;
@@ -210,6 +257,27 @@ void FusionExecutor::compileFusion(
     kernel->print();
   }
 
+  if (isDebugDumpEnabled(DebugDumpOption::BankConflictInfo)) {
+    auto bank_conflict_info = getBankConflictInfo(kernel);
+    if (bank_conflict_info.empty()) {
+      std::cout << "===== No bank confliction =====" << std::endl;
+    } else {
+      std::cout << "======= Bank confliction =======" << std::endl;
+      for (auto info : bank_conflict_info) {
+        std::cout << "Expr: " << info.first->toString() << std::endl;
+        auto conflict = info.second;
+        if (conflict.first > 1) {
+          std::cout << "input conflict: " << conflict.first << " way, ";
+        }
+        if (conflict.second > 1) {
+          std::cout << "output conflict: " << conflict.second << " way";
+        }
+        std::cout << std::endl;
+      }
+      std::cout << "================================" << std::endl;
+    }
+  }
+
   kernel_code_ = codegen::generateCudaKernel(kernel, kernelName());
   const auto structured_code = getStructuredCode(kernel_code_);
 
@@ -240,8 +308,8 @@ void FusionExecutor::compileFusion(
 
   // TODO: pass block_size here;
   c10::optional<int> block_size = c10::nullopt;
-  if (!inputs.empty()) {
-    auto expr_eval = executor_utils::bindKernelInputs(inputs, kernel);
+  if (!args.empty()) {
+    auto expr_eval = executor_utils::bindKernelInputs(args, kernel);
     auto launch_params =
         computeLaunchParams(launch_constraints, expr_eval, warp_size_);
     block_size = launch_params.nThreads();
@@ -249,8 +317,15 @@ void FusionExecutor::compileFusion(
         block_size > 0, "launch param inferred block size < 0");
   }
 
-  block_size_high_water_mark =
-      block_size.has_value() ? block_size.value() : block_size_high_water_mark;
+  // TODO: high water mark should be computed via occupancy API after
+  // compilation.
+
+  // Basically setting high water martk as 1 when we don't provide args for
+  // compilation, it will just generate a kernel that gets ditched at the first
+  // run - not great. We should have better heuristics.
+  block_size_high_water_mark = std::max<int64_t>(
+      (block_size.has_value() ? block_size.value() : 1),
+      block_size_high_water_mark);
   std::tie(compiled_kernel_, last_compiler_log_) = executor_utils::nvrtcCompile(
       structured_code,
       (kernelNamespace() + "::" + kernelName()).c_str(),
@@ -259,7 +334,7 @@ void FusionExecutor::compileFusion(
   TORCH_INTERNAL_ASSERT(
       fusion_id_ > 0, "failed to assign a fusion_id_ after compilation.");
 
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
   // The driver API call requires an int argument.
   int max_dynamic_smem = 0;
   AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuFuncGetAttribute(
@@ -272,6 +347,42 @@ void FusionExecutor::compileFusion(
 
 namespace {
 
+void fillTensorWithNan(at::Tensor& t) {
+  switch (t.scalar_type()) {
+    case at::ScalarType::Byte:
+      t.fill_(0xFF);
+      break;
+    case at::ScalarType::Char:
+      t.fill_(0x7F);
+      break;
+    case at::ScalarType::Short:
+      t.fill_(0x7FFF);
+      break;
+    case at::ScalarType::Int:
+      t.fill_(0x7FFFFFFF);
+      break;
+    case at::ScalarType::Long:
+      t.fill_(0x7FFFFFFFFFFFFFFFL);
+      break;
+    case at::ScalarType::Bool:
+      t.fill_(true);
+      break;
+    case at::ScalarType::Half:
+    case at::ScalarType::Float:
+    case at::ScalarType::Double:
+    case at::ScalarType::BFloat16:
+      t.fill_(std::nan(""));
+      break;
+    case at::ScalarType::ComplexHalf:
+    case at::ScalarType::ComplexFloat:
+    case at::ScalarType::ComplexDouble:
+      t.fill_(c10::complex<double>(std::nan(""), std::nan("")));
+      break;
+    default:
+      TORCH_INTERNAL_ASSERT(false, "Unknown dtype");
+  }
+}
+
 at::Tensor inferAndAlloc(
     const TensorView* tv,
     const std::vector<Val*>& sizes,
@@ -300,7 +411,7 @@ at::Tensor inferAndAlloc(
         size->name(),
         ") for the buffer ",
         tv->toString());
-    inferred_sizes.push_back(inferred_val.value());
+    inferred_sizes.push_back(inferred_val->as<int64_t>());
     if (expanded_map.count(expanded_sizes.size())) {
       auto expanded_size = expanded_map.at(expanded_sizes.size());
       const auto inferred_expanded_size = expr_eval.evaluate(expanded_size);
@@ -320,29 +431,30 @@ at::Tensor inferAndAlloc(
       } else {
         expanded_dim = true;
       }
-      expanded_sizes.push_back(inferred_expanded_size.value());
+      expanded_sizes.push_back(inferred_expanded_size->as<int64_t>());
     } else {
-      expanded_sizes.push_back(inferred_val.value());
+      expanded_sizes.push_back(inferred_val->as<int64_t>());
     }
   }
 
   const auto at_type = data_type_to_aten(tv->dtype());
+  const auto tensor_options =
+      at::TensorOptions().dtype(at_type).device(options.device);
+  c10::IntArrayRef isizes(inferred_sizes);
 
   if (zero_init) {
-    const auto tensor_options =
-        at::TensorOptions().dtype(at_type).device(options.device);
-    c10::IntArrayRef isizes(inferred_sizes);
     auto zeros = at::zeros(isizes, tensor_options);
     if (expanded_dim) {
       return zeros.expand(expanded_sizes);
     }
     return zeros;
   } else {
-    c10::IntArrayRef isizes(inferred_sizes);
     // Non Variable type guard for empty_cuda call
     at::AutoDispatchBelowADInplaceOrView non_variable_type_mode;
-    auto empty = at::native::empty_cuda(
-        isizes, at_type, c10::nullopt, options.device, c10::nullopt);
+    auto empty = at::empty(isizes, tensor_options);
+    if (shouldFillAllocationWithNan()) {
+      fillTensorWithNan(empty);
+    }
     if (expanded_dim) {
       return empty.expand(expanded_sizes);
     }
@@ -392,10 +504,14 @@ uint64_t FusionExecutor::computeSharedMemory(
         const uint64_t data_size = dataTypeSize(smem_alloc->buffer()->dtype());
         // Add padding to align dynamic shared memory
         if (align_padding) {
+#ifndef USE_ROCM
           const int align_size = 16; // always align to 16B/128b.
+#else
+          const int align_size = 8; // see codegen.cpp for HIP
+#endif
           total = ceilDiv(total, align_size) * align_size;
         }
-        total += inferred_val.value() * data_size;
+        total += inferred_val->as<int64_t>() * data_size;
       } else {
         TORCH_INTERNAL_ASSERT(
             false,
@@ -464,10 +580,10 @@ LaunchParams FusionExecutor::computeLaunchParams(
 
   // TODO: Need to redesign this part a bit to
   //   find the right place to trigger evaluate
-  if (expr_eval.precomputedIntegers()) {
-    expr_eval.precomputedIntegers()->bindParallelExtents(
+  if (expr_eval.precomputedValues()) {
+    expr_eval.precomputedValues()->bindParallelExtents(
         parallel_iter_extents, launch_constraints);
-    expr_eval.precomputedIntegers()->evaluate();
+    expr_eval.precomputedValues()->evaluate();
   }
 
   // If any dimension was set in launch constraints we need to run through
@@ -489,7 +605,7 @@ LaunchParams FusionExecutor::computeLaunchParams(
                 "Cannot validate parallelization scheme, "
                 "this may be due to mixed broadcast axes that are parallelized.");
           }
-        } else if (!expr_eval.precomputedIntegers()) {
+        } else if (!expr_eval.precomputedValues()) {
           expr_eval.bind(extent, launch_constraints.getDim(p_type));
         }
         if (!launch_params.hasDim(p_type)) {
@@ -541,7 +657,7 @@ LaunchParams FusionExecutor::computeLaunchParams(
         TORCH_INTERNAL_ASSERT(
             *val <= 1024, "padded dimension larger than max block size");
       }
-      maximum_value = std::max(maximum_value, *val);
+      maximum_value = std::max(maximum_value, val->as<int64_t>());
     }
     // Protect for size-0 tensors, they still have a value so would prefer to
     // bind nothing than 0
@@ -553,8 +669,8 @@ LaunchParams FusionExecutor::computeLaunchParams(
 
   // Re-run the integer machine with all
   //  the thread sizes now determined.
-  if (expr_eval.precomputedIntegers()) {
-    expr_eval.precomputedIntegers()->evaluate();
+  if (expr_eval.precomputedValues()) {
+    expr_eval.precomputedValues()->evaluate();
   }
 
   const auto kernel = lowered_->kernel();
@@ -646,7 +762,7 @@ FusionExecutor::GlobalBuffers FusionExecutor::allocGlobalVals(
       global_buffers.zero_init.push_back(false);
     }
     // Remember the tensor buffer used for storing kernel profile
-    if (isEnabled(EnableOption::KernelProfile) &&
+    if (isOptionEnabled(EnableOption::KernelProfile) &&
         tv == kernel->profile().getBuffer()) {
       global_buffers.profile_buffer = global_buffers.buffers.back();
     }
@@ -656,25 +772,78 @@ FusionExecutor::GlobalBuffers FusionExecutor::allocGlobalVals(
 }
 
 std::vector<at::Tensor> FusionExecutor::allocOutputs(
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     kir::ExpressionEvaluator& expr_eval,
     const std::unordered_set<int>& alias_indices) {
   FUSER_PERF_SCOPE("FusionExecutor::AllocOutputs");
   const auto kernel = lowered_->kernel();
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
   std::vector<at::Tensor> outputs;
+  TORCH_INTERNAL_ASSERT(
+      args.size() == kernel->inputs().size(),
+      "kernel arguments length does not match runtime arguments.");
   for (const auto out_i : c10::irange(kernel->outputs().size())) {
-    // If the output is just trivially the input, just "copy" it over.
+    if (kernel->outputs()[out_i]->isFusionInput()) {
+      // pushing empty tensor for trivial forwarding. Since we handle this in
+      // integration, see step 1 - note [trivial forwarding]
+      c10::Device device(c10::DeviceType::CUDA, args.getDeviceIndex());
+      const auto tensor_options =
+          at::TensorOptions().dtype(at::kFloat).device(device);
+      outputs.emplace_back(at::empty({0}, tensor_options));
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          kernel->outputs()[out_i]->isA<TensorView>(),
+          "Cannot allocate outputs that are not tensors.");
+      auto output = kernel->outputs()[out_i]->as<TensorView>();
+      if (alias_indices.count(out_i) != 0) {
+        // aliasing to inputs, no need to allocate real output, just push empty
+        // tensor here.
+        outputs.emplace_back();
+      } else {
+        outputs.push_back(
+            inferAndAllocOutput(output, expr_eval, options_, false));
+      }
+    }
+  }
+  return outputs;
+}
+
+void FusionExecutor::setUsedTVs() {
+  auto used_vals = fusion_->usedMathVals();
+  auto used_tvs = ir_utils::filterByType<TensorView>(used_vals);
+  used_tvs_.clear();
+  used_tvs_.insert(used_tvs_.begin(), used_tvs.begin(), used_tvs.end());
+}
+
+KernelArgumentHolder FusionExecutor::evaluateOutputSizes(
+    const KernelArgumentHolder& args,
+    kir::ExpressionEvaluator& expr_eval,
+    const std::unordered_set<int>& alias_indices) {
+  FUSER_PERF_SCOPE("FusionExecutor::AllocOutputs");
+  const auto kernel = lowered_->kernel();
+
+  KernelArgumentHolder ret(args.getIndexMode());
+  ret.setDeviceIndex(args.getDeviceIndex());
+
+  CompileOptions meta_options = options_;
+  meta_options.device = c10::Device(DeviceType::Meta, 0);
+
+  for (const auto out_i : c10::irange(kernel->outputs().size())) {
+    // If the output is just trivially the input, just "copy" it over, see note
+    // [trivial forwarding]
     if (kernel->outputs()[out_i]->isFusionInput()) {
       for (auto inp_i : c10::irange(kernel->inputs().size())) {
         if (kernel->inputs()[inp_i] == kernel->outputs()[out_i]) {
           TORCH_INTERNAL_ASSERT(
-              inp_i < inputs.size(),
+              inp_i < args.size(),
               "Issue with an input showing up as output, couldn't find input.");
+
+          auto tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(args[inp_i]);
           TORCH_INTERNAL_ASSERT(
-              inputs[inp_i].isTensor(),
+              tensor_arg_abstract,
               "Cannot register a scalar as an output in a fusion.");
-          outputs.push_back(inputs[inp_i].toTensor());
+          ret.push(tensor_arg_abstract);
           break;
         }
       }
@@ -685,51 +854,111 @@ std::vector<at::Tensor> FusionExecutor::allocOutputs(
       auto output = kernel->outputs()[out_i]->as<TensorView>();
       if (alias_indices.count(out_i) != 0) {
         // aliasing to inputs, no need to allocate real output
-        outputs.push_back(
-            inferAndAlloc(output, {}, expr_eval, {}, options_, false));
+        // but we still need to push an entry here.
+        ret.push(int64_t(0));
       } else {
-        // Allocate a real output
-        outputs.push_back(
-            inferAndAllocOutput(output, expr_eval, options_, false));
+        // TODO: we are using meta here, which is bad since it doesn't account
+        // for devices. Switch to fake tensor instead
+        ret.push(inferAndAllocOutput(output, expr_eval, meta_options, false));
       }
     }
   }
-  return outputs;
+  return ret;
 }
 
-void FusionExecutor::setUsedTVs() {
-  auto used_vals = fusion_->usedMathVals();
-  auto used_tvs = ir_utils::filterByType<TensorView>(used_vals);
-  used_tvs_.clear();
+KernelArgumentHolder FusionExecutor::inferOutputSizes(
+    const KernelArgumentHolder& args,
+    const LaunchParams& launch_constraints) {
+  FUSER_PERF_SCOPE("FusionExecutor::RunFusion");
+
+  ExecutorEntry* executor_entry = nullptr;
+  c10::optional<size_t> opt_code = args.getCacheId();
+  if (opt_code.has_value()) {
+    executor_entry = &executor_entry_lookup_[*opt_code];
+  }
+
+  executor_utils::initializeCudaContext();
+  TORCH_INTERNAL_ASSERT(lowered_);
+
+  TORCH_INTERNAL_ASSERT(
+      !executor_entry || !executor_entry->init,
+      "compile kernel shouldn't hit a pre-existing cache");
+  FUSER_PERF_SCOPE("ExecutorRunFusion::ValidateAndInitialize");
+  // TODO: validate kernel inputs currently won't be happy, since our fusion
+  // args are mapped with `meta` tensor instead of `cuda` tensor, check if this
+  // would be resolved with FakeTensor
+  // executor_utils::validateKernelInputs(fusion_, args, options_.device);
+
+  if (!evaluator_precomputed_values_) {
+    evaluator_precomputed_values_ =
+        std::make_unique<KernelPrecomputedValues>(lowered_->kernel());
+  }
+
+  kir::ExpressionEvaluator expr_eval;
+  evaluator_precomputed_values_->bindKernelInputs(lowered_->kernel(), args);
+  expr_eval.precomputedValues() = evaluator_precomputed_values_.get();
 
-  for (auto tv : used_tvs)
-    used_tvs_.push_back(tv);
+  // I think this binds something to expr_eval, so even though we are not using
+  // launch_params_, we still need this in order to infer output shapes.
+  launch_params_ =
+      computeLaunchParams(launch_constraints, expr_eval, warp_size_);
+
+  executor_utils::validateVectorizedTensors(
+      lowered_.get()->kernel(), args, {}, compileTimeDataCache(), expr_eval);
+
+  auto alias_indices_entry = executor_utils::caching::ExecutorCompileTimeEntry<
+      executor_utils::caching::InputAliasIndices>(
+      compileTimeDataCache(), [&]() {
+        return std::make_unique<std::vector<std::pair<int, int>>>(
+            fusion_->getInputAliasIndices());
+      });
+
+  auto& alias_indices = alias_indices_entry.get();
+
+  // NOLINTNEXTLINE(bugprone-branch-clone)
+  auto output_alias_indices_entry =
+      executor_utils::caching::ExecutorCompileTimeEntry<
+          executor_utils::caching::OutputAliasIndices>(
+          compileTimeDataCache(), [&]() {
+            return std::make_unique<std::unordered_set<int>>(
+                fusion_->getOutputAliasIndices());
+          });
+
+  auto& output_alias_indices = output_alias_indices_entry.get();
+
+  auto ret = evaluateOutputSizes(args, expr_eval, output_alias_indices);
+
+  for (const auto& entry : alias_indices) {
+    auto aliased_output_index = entry.first;
+    auto aliased_input_index = entry.second;
+    TORCH_INTERNAL_ASSERT(
+        args[aliased_input_index]->isType(ArgType::Tensor),
+        "alias io only supports tensor");
+    ret.swap(aliased_output_index, args[aliased_input_index]);
+  }
+
+  return ret;
 }
 
 std::vector<at::Tensor> FusionExecutor::runFusion(
-    const at::ArrayRef<IValue>& inputs,
-    const std::vector<at::Tensor>& outputs,
+    KernelArgumentHolder& args,
     const LaunchParams& launch_constraints,
-    const c10::optional<size_t>& opt_code) {
+    const std::vector<at::Tensor>& outputs) {
   FUSER_PERF_SCOPE("FusionExecutor::RunFusion");
   TORCH_INTERNAL_ASSERT(compiled());
   TORCH_INTERNAL_ASSERT(
       fusion_id_ > 0, "Cannot run fusion, it was not compiled.");
   TORCH_INTERNAL_ASSERT(
-      !opt_code.has_value() || outputs.empty(),
+      !args.getCacheId().has_value() || outputs.empty(),
       "short cut input cache is not compatible with pre-allocated output");
 
+  size_t num_inputs = args.size();
+
   if (isDebugDumpEnabled(DebugDumpOption::FusionArgs)) {
     std::cout << "Arguments for fusion" << fusion_id_ << ":" << std::endl
               << "Inputs:" << std::endl;
-    for (const auto& input : inputs) {
-      if (input.isTensor()) {
-        const auto& input_tensor = input.toTensor();
-        std::cout << "  " << input_tensor.scalar_type() << " "
-                  << input.toTensor().sizes()
-                  << " (strides = " << input.toTensor().strides() << ")"
-                  << std::endl;
-      }
+    for (auto i : c10::irange(args.size())) {
+      args[i]->print();
     }
     std::cout << "Outputs:" << std::endl;
     for (const auto& output : outputs) {
@@ -740,8 +969,8 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
   }
 
   ExecutorEntry* executor_entry = nullptr;
-  if (opt_code.has_value()) {
-    executor_entry = &executor_entry_lookup_[*opt_code];
+  if (args.getCacheId().has_value()) {
+    executor_entry = &executor_entry_lookup_[*args.getCacheId()];
   }
 
   c10::DeviceGuard dg(options_.device);
@@ -750,7 +979,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
   TORCH_INTERNAL_ASSERT(lowered_);
   launch_params_ = LaunchParams();
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  std::vector<at::Tensor> allocated_outputs = outputs;
+  std::vector<at::Tensor> allocated_outputs;
   GlobalBuffers global_buffers;
   uint64_t rand_offset = 0;
 
@@ -771,18 +1000,32 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
               c10::nullopt,
               options_.device,
               c10::nullopt));
+          if (shouldFillAllocationWithNan()) {
+            fillTensorWithNan(allocated_outputs.back());
+          }
         }
+        // Note: aliased output is not returned as output. But we still need it
+        // for kernel execution, so would need to push them to args
         for (const auto& entry : executor_entry->io_alias_indices) {
+          auto aliased_output_index = entry.first;
+          auto aliased_input_index = entry.second;
+          auto tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(args[aliased_input_index]);
           TORCH_INTERNAL_ASSERT(
-              inputs[entry.second].isTensor(), "alias io only supports tensor");
-          allocated_outputs[entry.first] = inputs[entry.second].toTensor();
+              tensor_arg_abstract, "alias io only supports tensor");
+          allocated_outputs[aliased_output_index] =
+              tensor_arg_abstract->getTensor();
         }
+        args.push(allocated_outputs);
       } else {
         TORCH_INTERNAL_ASSERT(
             outputs.size() == fusion_->outputs().size(),
             __func__,
             " provided number of outputs does match fusion output");
+        allocated_outputs = outputs;
+        args.push(outputs);
       }
+
       {
         FUSER_PERF_SCOPE("ExecutorRunFusion::IntermediateBufferAlloc");
         for (const auto i : c10::irange(executor_entry->buffer_sizes.size())) {
@@ -800,6 +1043,9 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
                 c10::nullopt,
                 options_.device,
                 c10::nullopt));
+            if (shouldFillAllocationWithNan()) {
+              fillTensorWithNan(global_buffers.buffers.back());
+            }
             global_buffers.zero_init.push_back(false);
           }
         }
@@ -811,17 +1057,16 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
     // code path to take when either:
     //   1. no opt_code is provided or
     //   2. `executor_entry` is not initialized
-    executor_utils::validateKernelInputs(fusion_, inputs, options_.device);
+    executor_utils::validateKernelInputs(fusion_, args, options_.device);
 
-    if (!evaluator_precomputed_integers_) {
-      evaluator_precomputed_integers_ =
-          std::make_unique<KernelPrecomputedIntegers>(lowered_->kernel());
+    if (!evaluator_precomputed_values_) {
+      evaluator_precomputed_values_ =
+          std::make_unique<KernelPrecomputedValues>(lowered_->kernel());
     }
 
     kir::ExpressionEvaluator expr_eval;
-    evaluator_precomputed_integers_->bindKernelInputs(
-        lowered_->kernel(), inputs);
-    expr_eval.precomputedIntegers() = evaluator_precomputed_integers_.get();
+    evaluator_precomputed_values_->bindKernelInputs(lowered_->kernel(), args);
+    expr_eval.precomputedValues() = evaluator_precomputed_values_.get();
 
     launch_params_ =
         computeLaunchParams(launch_constraints, expr_eval, warp_size_);
@@ -842,7 +1087,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
     }
 
     if (kernel()->summary().has_cooperative_grid_reduction) {
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
       int num_blocks_per_SM = -1;
       at::globalContext().getNVRTC().cuOccupancyMaxActiveBlocksPerMultiprocessor(
           &num_blocks_per_SM,
@@ -860,7 +1105,13 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
           "what can be resident on the GPU at once. Need: ",
           launch_params_.gdimx() * launch_params_.gdimy() *
               launch_params_.gdimz(),
-          " but limited to ",
+          " (",
+          launch_params_.gdimx(),
+          " * ",
+          launch_params_.gdimy(),
+          " * ",
+          launch_params_.gdimz(),
+          ") but limited to ",
           num_blocks_per_SM,
           " * ",
           at::cuda::getDeviceProperties(options_.device.index())
@@ -873,7 +1124,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 
     executor_utils::validateVectorizedTensors(
         lowered_.get()->kernel(),
-        inputs,
+        args,
         outputs,
         compileTimeDataCache(),
         expr_eval);
@@ -888,7 +1139,6 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 
     auto& alias_indices = alias_indices_entry.get();
 
-    // ditch pre-allocated outputs if the number doesn't match.
     // NOLINTNEXTLINE(bugprone-branch-clone)
     if (outputs.empty()) {
       auto output_alias_indices_entry =
@@ -901,33 +1151,36 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 
       auto& output_alias_indices = output_alias_indices_entry.get();
 
-      allocated_outputs = allocOutputs(inputs, expr_eval, output_alias_indices);
+      allocated_outputs = allocOutputs(args, expr_eval, output_alias_indices);
 
       for (const auto& entry : alias_indices) {
+        auto aliased_output_index = entry.first;
+        auto aliased_input_index = entry.second;
+        auto tensor_arg_abstract =
+            dynamic_cast<const TensorArgAbstract*>(args[aliased_input_index]);
         TORCH_INTERNAL_ASSERT(
-            inputs[entry.second].isTensor(), "alias io only supports tensor");
-        allocated_outputs[entry.first] = inputs[entry.second].toTensor();
+            tensor_arg_abstract, "alias io only supports tensor");
+        allocated_outputs[aliased_output_index] =
+            tensor_arg_abstract->getTensor();
       }
+      args.push(allocated_outputs);
     } else {
-      // TODO: Update for aliasing, validate the outputs are the right sizes.
+      allocated_outputs = outputs;
+      args.push(outputs);
       executor_utils::validateKernelOutputs(
           fusion_, allocated_outputs, options_.device);
     }
 
     global_buffers = allocGlobalVals(expr_eval);
 
-    if (kernel()->summary().is_stochastic) {
+    if (kernel()->summary().max_rng_offsets >= 0) {
       // NOTE: this is how we map offset to PW kernels in order to have
       // identical random number generator to match native PyTorch results.
       // But it doesn't really work as it takes assumption how threads are
       // binded but is not generally how we handle that in scheduler.
       // Refer to `Philox` in generated kernel to understand how the mapping
       // works.
-      rand_offset = 4 *
-          (std::ceil(
-               allocated_outputs[0].numel() /
-               (4.0 * 128 * launch_params_.gdimx())) + // NOLINT
-           1);
+      rand_offset = (kernel()->summary().max_rng_offsets + 1) * 4;
     }
 
     // This is the entry when we have provided `opt_code` but the entry has not
@@ -955,15 +1208,12 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
     }
   }
 
-  KernelArgumentHolder kernel_arguments(options_.index_mode);
-  {
-    FUSER_PERF_SCOPE("ExecutorRunFusion::FillKernelArgStructure");
-    kernel_arguments.push(inputs);
-    kernel_arguments.push(allocated_outputs);
-    kernel_arguments.push(global_buffers.buffers);
-    if (lowered_->kernel()->summary().is_stochastic) {
-      kernel_arguments.appendPhiloxRNGSeed(rand_offset);
-    }
+  // push back global buffers
+  args.push(global_buffers.buffers);
+
+  // push back RNG state if needed
+  if (lowered_->kernel()->summary().max_rng_offsets >= 0) {
+    args.appendPhiloxRNGSeed(rand_offset);
   }
 
   if (isDebugDumpEnabled(DebugDumpOption::LaunchParam)) {
@@ -973,19 +1223,15 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
   if (isDebugDumpEnabled(DebugDumpOption::KernelArgs)) {
     std::cout << "Arguments for kernel" << fusion_id_ << ":" << std::endl
               << "Inputs:" << std::endl;
-    for (const auto& input : inputs) {
-      if (input.isTensor()) {
-        const auto& input_tensor = input.toTensor();
-        std::cout << "  " << input_tensor.scalar_type() << " "
-                  << input.toTensor().sizes()
-                  << " (strides = " << input.toTensor().strides() << ")"
-                  << std::endl;
-      }
+    for (auto i : c10::irange(args.size())) {
+      args[i]->print();
     }
     std::cout << "Outputs:" << std::endl;
+    // note: add aliased outputs here.
     for (const auto& output : allocated_outputs) {
       std::cout << "  " << output.scalar_type() << " " << output.sizes()
-                << " (strides = " << output.strides() << ")" << std::endl;
+                << " (strides = " << output.strides()
+                << ", address = " << output.data_ptr() << ")" << std::endl;
     }
     std::cout << "Reduction and semaphore buffers:" << std::endl;
     TORCH_INTERNAL_ASSERT(
@@ -1005,15 +1251,15 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
   if (measure_kernel_time_ ||
       isDebugDumpEnabled(DebugDumpOption::EffectiveBandwidth) ||
       isDebugDumpEnabled(DebugDumpOption::PerfDebugVerbose)) {
-    cudaEventCreate(&start_event);
-    cudaEventCreate(&finish_event);
-    cudaEventRecord(start_event);
+    C10_CUDA_CHECK(cudaEventCreate(&start_event));
+    C10_CUDA_CHECK(cudaEventCreate(&finish_event));
+    C10_CUDA_CHECK(cudaEventRecord(start_event));
   }
 
   if (execute_kernel_) {
     if (maybe_available_dynamic_smem_.has_value() &&
         launch_params_.smem() > maybe_available_dynamic_smem_.value()) {
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
       // Increase limit of dynamic shared memory if needed.
       AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuFuncSetAttribute(
           compiled_kernel_.function,
@@ -1036,10 +1282,10 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
           launch_params_.bdimz(),
           launch_params_.smem(),
           stream,
-          kernel_arguments.getBuffer(),
+          args.getBuffer(),
           nullptr));
     } else {
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
       FUSER_PERF_SCOPE("ExecutorRunFusion::cuLaunchCooperativeKernel");
       AT_CUDA_DRIVER_CHECK(
           at::globalContext().getNVRTC().cuLaunchCooperativeKernel(
@@ -1052,7 +1298,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
               launch_params_.bdimz(),
               launch_params_.smem(),
               stream,
-              kernel_arguments.getBuffer()));
+              args.getBuffer()));
 #else
       TORCH_INTERNAL_ASSERT(
           false, "Cross grid communication not supported with HIP.");
@@ -1063,19 +1309,21 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
   if (measure_kernel_time_ ||
       isDebugDumpEnabled(DebugDumpOption::EffectiveBandwidth) ||
       isDebugDumpEnabled(DebugDumpOption::PerfDebugVerbose)) {
-    cudaEventRecord(finish_event);
-    cudaEventSynchronize(start_event);
-    cudaEventSynchronize(finish_event);
-    cudaEventElapsedTime(&kernel_time_ms_, start_event, finish_event);
-    cudaEventDestroy(start_event);
-    cudaEventDestroy(finish_event);
+    C10_CUDA_CHECK(cudaEventRecord(finish_event));
+    C10_CUDA_CHECK(cudaEventSynchronize(start_event));
+    C10_CUDA_CHECK(cudaEventSynchronize(finish_event));
+    C10_CUDA_CHECK(
+        cudaEventElapsedTime(&kernel_time_ms_, start_event, finish_event));
+    C10_CUDA_CHECK(cudaEventDestroy(start_event));
+    C10_CUDA_CHECK(cudaEventDestroy(finish_event));
 
     bytes_processed_ = 0;
     // Figure how many bytes are inputs, outputs, and temporary buffers
-    for (auto input : inputs) {
-      if (input.isTensor()) {
-        bytes_processed_ += input.toTensor().numel() *
-            dataTypeSize(aten_to_data_type(input.toTensor().scalar_type()));
+    for (auto i : c10::irange(num_inputs)) {
+      if (auto tensor_arg_abstract =
+              dynamic_cast<const TensorArgAbstract*>(args[i])) {
+        bytes_processed_ += tensor_arg_abstract->numel() *
+            dataTypeSize(tensor_arg_abstract->getDataType());
       }
     }
     for (const auto& output : allocated_outputs) {
@@ -1092,7 +1340,7 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
     }
   }
 
-  if (isEnabled(EnableOption::KernelProfile)) {
+  if (isOptionEnabled(EnableOption::KernelProfile)) {
     std::cout << kernel()->profile().toString(global_buffers.profile_buffer);
   }
 
@@ -1102,7 +1350,8 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
 void FusionExecutor::compileRtc(
     const std::string& code,
     const std::string& name,
-    bool structured) {
+    bool structured,
+    CompileOptions options) {
   FUSER_PERF_SCOPE("ExecutorRunFusion::compileRtc");
   std::string scode;
   if (!structured) {
@@ -1111,7 +1360,7 @@ void FusionExecutor::compileRtc(
     scode = code;
   }
   fusion_id_ = 1;
-  options_ = CompileOptions();
+  options_ = options;
 
   std::tie(compiled_kernel_, last_compiler_log_) =
       executor_utils::nvrtcCompile(scode, name, fusion_id_);
diff --git a/torch/csrc/jit/codegen/cuda/executor.h b/torch/csrc/jit/codegen/cuda/executor.h
index 8a56fe957fb8..9d4775b37ca9 100644
--- a/torch/csrc/jit/codegen/cuda/executor.h
+++ b/torch/csrc/jit/codegen/cuda/executor.h
@@ -16,6 +16,9 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
+TORCH_CUDA_CU_API bool shouldFillAllocationWithNan();
+TORCH_CUDA_CU_API void setFillAllocationWithNan(bool value);
+
 // TODO: Should this actually be in launch params?
 struct TORCH_CUDA_CU_API CompileOptions {
   c10::Device device = c10::Device(c10::DeviceType::CUDA, 0);
@@ -33,17 +36,46 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
       int id,
       CompileOptions options = CompileOptions());
 
+  //! infers output sizes via returning non-allocated KernelArgumentHolder.
+  //! this function is useful for async compilation for segmented fusion
+  KernelArgumentHolder inferOutputSizes(
+      const KernelArgumentHolder& args,
+      const LaunchParams& launch_constraints);
+
+  void compileFusion(
+      Fusion* fusion,
+      const KernelArgumentHolder& args,
+      const LaunchParams& launch_constraints = LaunchParams());
+
+  // TODO: merge it with the overload above.
+  //! This API is merely here so we don't have to go back and update all cpp
+  //! tests.
   void compileFusion(
       Fusion* fusion,
       const at::ArrayRef<IValue>& inputs = {},
+      const LaunchParams& launch_constraints = LaunchParams()) {
+    KernelArgumentHolder args =
+        KernelArgumentHolder::createKernelArgumentHolder(inputs);
+    compileFusion(fusion, args, launch_constraints);
+  }
+
+  std::vector<at::Tensor> runFusion(
+      KernelArgumentHolder& args,
       const LaunchParams& launch_constraints = LaunchParams(),
-      CompileOptions options = CompileOptions());
+      const std::vector<at::Tensor>& outputs = {});
 
   std::vector<at::Tensor> runFusion(
       const at::ArrayRef<IValue>& inputs,
       const std::vector<at::Tensor>& outputs,
       const LaunchParams& launch_constraints = LaunchParams(),
-      const c10::optional<size_t>& opt_code = c10::nullopt);
+      const c10::optional<size_t>& opt_code = c10::nullopt) {
+    KernelArgumentHolder args =
+        KernelArgumentHolder::createKernelArgumentHolder(inputs);
+    if (opt_code.has_value()) {
+      args.setCacheId(*opt_code);
+    }
+    return runFusion(args, launch_constraints, outputs);
+  }
 
   std::vector<at::Tensor> runFusion(
       const at::ArrayRef<IValue>& inputs,
@@ -141,7 +173,8 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
   void compileRtc(
       const std::string& code,
       const std::string& name,
-      bool structured = false);
+      bool structured = false,
+      CompileOptions options = CompileOptions());
 
   //! Internal tests only. Runs the compiled CUDA kernel from compileRtc.
   void runRtc(
@@ -187,7 +220,7 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
   // skip allocating real storage for those, but still maintain its spot to
   // maintain the indexing from output aliases to inputs
   std::vector<at::Tensor> allocOutputs(
-      const at::ArrayRef<IValue>& inputs,
+      const KernelArgumentHolder& args,
       kir::ExpressionEvaluator& expr_eval,
       const std::unordered_set<int>& alias_indices = {});
 
@@ -201,6 +234,15 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
     return &compile_time_info_cache_;
   }
 
+  //! returns KernelArgumentHolder representing the output sizes from kernel
+  //! execution. Note: 1. this API would ignoring aliased outputs and instead
+  //! pushing scalar int 0 as a place holder; 2. this API doesn't actually
+  //! allocate output in memory, but rather is used just to infer output sizes.
+  KernelArgumentHolder evaluateOutputSizes(
+      const KernelArgumentHolder& args,
+      kir::ExpressionEvaluator& expr_eval,
+      const std::unordered_set<int>& alias_indices = {});
+
  private:
   CompileOptions options_;
 
@@ -249,7 +291,7 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
   ExecutorCompileTimeInfoCache compile_time_info_cache_;
 
   // Cached expr eval
-  std::unique_ptr<KernelPrecomputedIntegers> evaluator_precomputed_integers_ =
+  std::unique_ptr<KernelPrecomputedValues> evaluator_precomputed_values_ =
       nullptr;
 
   // Profiling support: knob to control wheter we actually execute the
@@ -269,7 +311,10 @@ class TORCH_CUDA_CU_API FusionExecutor : public NonCopyable {
   // Profiling support: the last launch param used
   LaunchParams launch_params_;
 
-  // Profiling support: knob to disable caching of launch params
+  // Profiling support: disable caching of launch params and output allocation
+  // output allocation is also disable when output sizes are dependent on
+  // runtime scalar inputs, such as for the case of tensor factory. see
+  // https://github.com/csarofeen/pytorch/issues/2002
   bool disable_parameter_cache_ = false;
 
   // Profiling support: kept copy of the cuda kernel
diff --git a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
index dc2e4d1fa49f..bc1ce2a4b7bc 100644
--- a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.cpp
@@ -120,6 +120,24 @@ std::unique_ptr<TensorArgAbstract> getTensorArg(
 
 } // namespace
 
+KernelArgumentHolder KernelArgumentHolder::createKernelArgumentHolder(
+    const c10::ArrayRef<c10::IValue>& inputs) {
+  if (inputs.empty()) {
+    // default to int32 on device 0
+    KernelArgumentHolder args(KernelIndexMode::INT32);
+    args.setDeviceIndex(0);
+    return args;
+  }
+  auto device_index = getCommonDeviceCUDA(inputs);
+  auto index_mode = collectIndexMode(inputs);
+
+  KernelArgumentHolder args(index_mode);
+  args.setDeviceIndex(device_index);
+  args.push(inputs);
+
+  return args;
+}
+
 // Push a tensor to the arguments
 void KernelArgumentHolder::push(const at::Tensor& tensor) {
   changed_ = true;
@@ -188,7 +206,9 @@ void KernelArgumentHolder::push(const at::Tensor& tensor) {
     c10::ScalarType dtype = tensor.scalar_type();
     std::unique_ptr<TensorArgAbstract> tensor_arg =
         getTensorArg(dtype, nDims, index_mode_);
+    tensor_arg->setTensor(tensor);
     tensor_arg->setPointer(tensor.data_ptr());
+    tensor_arg->setDataType(aten_to_data_type(dtype));
     for (const auto i : c10::irange(nDims)) {
       tensor_arg->setSize(i, tensor.sizes()[i]);
       tensor_arg->setStride(i, tensor.strides()[i]);
@@ -230,6 +250,10 @@ void KernelArgumentHolder::push(const IValue& val) {
       " Tried to create argument to send to a fused kernel, but got a non-scalar type.");
 }
 
+void KernelArgumentHolder::push(int64_t val) {
+  arguments_.push_back(std::make_unique<LongArg>(val));
+}
+
 void KernelArgumentHolder::push(const at::PhiloxCudaState& val) {
   arguments_.push_back(std::make_unique<PhiloxCudaStateArg>(val));
 }
@@ -266,6 +290,17 @@ void KernelArgumentHolder::push(const std::vector<at::Tensor>& tensors) {
   }
 }
 
+void KernelArgumentHolder::push(const ArgAbstract* arg) {
+  changed_ = true;
+  arguments_.emplace_back(arg->copy_unique_ptr());
+}
+
+void KernelArgumentHolder::swap(int i, const ArgAbstract* arg) {
+  changed_ = true;
+  auto holder = arg->copy_unique_ptr();
+  arguments_[i].swap(holder);
+}
+
 void KernelArgumentHolder::appendPhiloxRNGSeed(uint64_t rand_offset) {
   at::PhiloxCudaState philox_engine_inputs;
   auto gen = at::cuda::detail::getDefaultCUDAGenerator();
diff --git a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
index c135328a3acc..32f0eb021821 100644
--- a/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
+++ b/torch/csrc/jit/codegen/cuda/executor_kernel_arg.h
@@ -3,6 +3,7 @@
 #include <ATen/core/ivalue.h>
 #include <ATen/cuda/CUDAGeneratorImpl.h>
 #include <c10/util/Exception.h>
+#include <torch/csrc/jit/codegen/cuda/type.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <array>
 
@@ -21,7 +22,7 @@ struct TensorArgCodegen {
   T* data;
   std::array<nvfuser_index_t, N> size;
   std::array<nvfuser_index_t, N> stride;
-  constexpr int nDims() {
+  constexpr int nDims() const {
     return N;
   }
   void setSize(int i, nvfuser_index_t s) {
@@ -30,6 +31,12 @@ struct TensorArgCodegen {
   void setStride(int i, nvfuser_index_t s) {
     stride[i] = s;
   }
+  nvfuser_index_t getSize(int i) const {
+    return size[i];
+  }
+  nvfuser_index_t getStride(int i) const {
+    return stride[i];
+  }
 };
 
 // 0-Dim GPU based tensor
@@ -40,7 +47,7 @@ struct TensorArgCodegen<T, 0, nvfuser_index_t> {
   };
 
   T* data;
-  constexpr int nDims() {
+  constexpr int nDims() const {
     return 0;
   }
   void setSize(int, nvfuser_index_t) {
@@ -49,6 +56,12 @@ struct TensorArgCodegen<T, 0, nvfuser_index_t> {
   void setStride(int, nvfuser_index_t) {
     TORCH_INTERNAL_ASSERT(false, "Tried to set stride of a 0-dim tensor");
   }
+  nvfuser_index_t getSize(int i) const {
+    TORCH_INTERNAL_ASSERT(false, "Tried to get size of a 0-dim tensor");
+  }
+  nvfuser_index_t getStride(int i) const {
+    TORCH_INTERNAL_ASSERT(false, "Tried to get stride of a 0-dim tensor");
+  }
 };
 
 // Specialization for 0-dim case that's easy to pass in a CPU based tensor
@@ -62,62 +75,151 @@ struct CpuScalarTensorCodegen {
   T data;
 };
 
+// TODO: macro this and the printer below
+enum class ArgType {
+  PhiloxCudaState,
+  Long,
+  Double,
+  ComplexDouble,
+  Bool,
+  Tensor,
+  CpuScalarTensor
+};
+
+inline std::string argTypeToString(ArgType type) {
+  std::string ret;
+  switch (type) {
+    case ArgType::PhiloxCudaState:
+      ret = "PhiloxCudaState";
+      break;
+    case ArgType::Long:
+      ret = "Long";
+      break;
+    case ArgType::Double:
+      ret = "Double";
+      break;
+    case ArgType::ComplexDouble:
+      ret = "ComplexDouble";
+      break;
+    case ArgType::Bool:
+      ret = "Bool";
+      break;
+    case ArgType::Tensor:
+      ret = "Tensor";
+      break;
+    case ArgType::CpuScalarTensor:
+      ret = "CpuScalarTensor";
+      break;
+  }
+  return ret;
+}
+
 struct ArgAbstract {
   virtual ~ArgAbstract() = default;
+  virtual const void* arg() const = 0;
   virtual void* arg() = 0;
+  virtual bool isType(ArgType type) const = 0;
+  virtual ArgType type() const = 0;
+  virtual std::unique_ptr<ArgAbstract> copy_unique_ptr() const = 0;
+  virtual void print() const {
+    printf("input type: %s\n", argTypeToString(type()).c_str());
+  };
 };
 
+#define DEF_HELPEE_FUNC(TARGET_TYPE, ARG_NAME)                    \
+  bool isType(ArgType type) const override {                      \
+    return ArgType::TARGET_TYPE == type;                          \
+  }                                                               \
+  ArgType type() const override {                                 \
+    return ArgType::TARGET_TYPE;                                  \
+  }                                                               \
+  const void* arg() const override {                              \
+    return &ARG_NAME;                                             \
+  }                                                               \
+  void* arg() override {                                          \
+    return &ARG_NAME;                                             \
+  }                                                               \
+  std::unique_ptr<ArgAbstract> copy_unique_ptr() const override { \
+    return std::make_unique<TARGET_TYPE##Arg>(*this);             \
+  }
+
+#define DEF_PRINT_FUNC              \
+  void print() const override {     \
+    std::cout << val_ << std::endl; \
+  }
+
 struct PhiloxCudaStateArg : public ArgAbstract {
   at::PhiloxCudaState val_;
   PhiloxCudaStateArg(at::PhiloxCudaState _val) : val_(_val){};
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(PhiloxCudaState, val_)
 };
 
 struct LongArg : public ArgAbstract {
   int64_t val_;
   explicit LongArg(int64_t _val) : val_(_val) {}
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(Long, val_)
+  DEF_PRINT_FUNC
 };
 
 struct DoubleArg : public ArgAbstract {
   double val_;
   explicit DoubleArg(double _val) : val_(_val) {}
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(Double, val_)
+  DEF_PRINT_FUNC
 };
 
 struct ComplexDoubleArg : public ArgAbstract {
   c10::complex<double> val_;
   explicit ComplexDoubleArg(c10::complex<double> _val) : val_(_val) {}
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(ComplexDouble, val_)
+  DEF_PRINT_FUNC
 };
 
 struct BoolArg : public ArgAbstract {
   bool val_;
   explicit BoolArg(bool _val) : val_(_val) {}
-  void* arg() override {
-    return &val_;
-  }
+  DEF_HELPEE_FUNC(Bool, val_)
+  DEF_PRINT_FUNC
 };
 
 struct TensorArgAbstract : ArgAbstract {
   virtual void setSize(int i, int64_t size) = 0;
   virtual void setStride(int i, int64_t stride) = 0;
   virtual void setPointer(void* ptr) = 0;
+  virtual void setDataType(DataType data_type) = 0;
+  virtual void setTensor(at::Tensor tensor) = 0;
+
+  virtual int64_t getRank() const = 0;
+  virtual int64_t getSize(int i) const = 0;
+  virtual int64_t getStride(int i) const = 0;
+  virtual void* getPointer() const = 0;
+  virtual DataType getDataType() const = 0;
+  virtual int64_t numel() const = 0;
+  virtual at::Tensor getTensor() const = 0;
+
+  // TODO: clean it up and also print out dtype
+  void print() const override {
+    auto rank = getRank();
+    std::cout << "tensor dtype: " << getDataType() << " sizes: (";
+    for (auto i = 0; i < rank; i++) {
+      std::cout << getSize(i) << ", ";
+    }
+    std::cout << ") stride: (";
+    for (auto i = 0; i < rank; i++) {
+      std::cout << getStride(i) << ", ";
+    }
+    std::cout << ") pointer: " << getPointer() << std::endl;
+  }
 };
 
-// This should match the tensor used in the code generation (almost exactly)
 template <typename TENSOR_TYPE, typename nvfuser_index_t>
 // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
 struct TensorArg : public TensorArgAbstract {
   TENSOR_TYPE instance_;
+  // TODO: this is ugly, we should be extracting data type from `instance_`
+  // instead
+  DataType data_type_ = DataType::Null;
+  at::Tensor tensor_;
 
   void setSize(int i, int64_t size) override {
     instance_.setSize(i, (nvfuser_index_t)size);
@@ -128,10 +230,40 @@ struct TensorArg : public TensorArgAbstract {
   void setPointer(void* ptr) override {
     instance_.data = static_cast<decltype(TENSOR_TYPE::data)>(ptr);
   }
+  void setDataType(DataType data_type) override {
+    data_type_ = data_type;
+  }
+  void setTensor(at::Tensor tensor) override {
+    tensor_ = tensor;
+  }
 
-  void* arg() override {
-    return &instance_;
+  int64_t getSize(int i) const override {
+    return instance_.getSize(i);
   }
+  int64_t getStride(int i) const override {
+    return instance_.getStride(i);
+  }
+  int64_t getRank() const override {
+    return instance_.nDims();
+  }
+  void* getPointer() const override {
+    return instance_.data;
+  }
+  DataType getDataType() const override {
+    return data_type_;
+  }
+  at::Tensor getTensor() const override {
+    return tensor_;
+  }
+  int64_t numel() const override {
+    int64_t ret = 1;
+    for (auto i : c10::irange(instance_.nDims())) {
+      ret *= instance_.getSize(i);
+    }
+    return ret;
+  }
+
+  DEF_HELPEE_FUNC(Tensor, instance_)
 };
 
 template <typename CPU_TENSOR_TYPE>
@@ -144,16 +276,48 @@ struct CpuScalarTensorArg : public ArgAbstract {
     instance_.data = _data;
   }
 
-  void* arg() override {
-    return &instance_;
-  }
+  DEF_HELPEE_FUNC(CpuScalarTensor, instance_)
 };
 
-class KernelArgumentHolder {
+// TODO: This class needs some further clean up and refactor
+//! KernelArgumentHolder copies meta information from kernel inputs, including
+//! tensor sizes/shapes/dtype/memory_ptr and copies scalar inputs. It is used
+//! for both compilation as well as kernel execution. The important thing is to
+//! strip ownership of tensor from KernelArgumentHolder, so that during async
+//! compilation, we are not unnecessarily holding memory that is not needed.
+class TORCH_CUDA_CU_API KernelArgumentHolder {
  public:
+  //! create KernelArgumentHolder from c10 inputs. Note that we we not taking
+  //! the ownership of the memory from the original inputs, but just recording
+  //! its meta data for kernel execution/compilation.
+  static KernelArgumentHolder createKernelArgumentHolder(
+      const c10::ArrayRef<c10::IValue>& inputs);
+
+  KernelIndexMode getIndexMode() const {
+    return index_mode_;
+  }
+
   explicit KernelArgumentHolder(KernelIndexMode index_mode)
       : index_mode_(index_mode) {}
 
+  KernelArgumentHolder(const KernelArgumentHolder& self)
+      : device_index_(self.getDeviceIndex()),
+        cache_id_(self.getCacheId()),
+        index_mode_(self.getIndexMode()) {
+    for (const auto& arg : self.arguments_) {
+      push(arg.get());
+    }
+  }
+
+  KernelArgumentHolder& operator=(const KernelArgumentHolder& self) {
+    device_index_ = self.getDeviceIndex();
+    index_mode_ = self.getIndexMode();
+    for (const auto& arg : self.arguments_) {
+      push(arg.get());
+    }
+    return *this;
+  }
+
   // Push a tensor to the arguments
   void push(const at::Tensor& tensor);
 
@@ -170,12 +334,60 @@ class KernelArgumentHolder {
 
   void push(const std::vector<at::Tensor>& tensors);
 
+  void push(const ArgAbstract* arg);
+
+  void swap(int i, const ArgAbstract* arg);
+
+  // push int64
+  void push(int64_t val);
+
+  const ArgAbstract* back() const {
+    return arguments_.back().get();
+  }
+
   void appendPhiloxRNGSeed(uint64_t rand_offset);
 
+  const ArgAbstract* operator[](int ind) const {
+    return arguments_.at(ind).get();
+  };
+
+  size_t size() const {
+    return arguments_.size();
+  }
+
+  bool empty() const {
+    return arguments_.empty();
+  }
+
+  void setDeviceIndex(int index) {
+    device_index_ = index;
+  }
+
+  int getDeviceIndex() const {
+    return device_index_;
+  }
+
+  void setCacheId(size_t id) {
+    cache_id_ = id;
+  }
+
+  c10::optional<size_t> getCacheId() const {
+    return cache_id_;
+  }
+
+  void print() const {
+    for (const auto& arg : arguments_) {
+      arg->print();
+    }
+  }
+
  private:
   std::vector<std::unique_ptr<ArgAbstract>> arguments_;
   std::vector<void*> void_ptrs_;
   bool changed_ = true;
+
+  int device_index_ = 0;
+  c10::optional<size_t> cache_id_ = c10::nullopt;
   KernelIndexMode index_mode_ = KernelIndexMode::INT64;
 };
 
diff --git a/torch/csrc/jit/codegen/cuda/executor_utils.cpp b/torch/csrc/jit/codegen/cuda/executor_utils.cpp
index deb815624ab9..217480a974ed 100644
--- a/torch/csrc/jit/codegen/cuda/executor_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/executor_utils.cpp
@@ -2,7 +2,6 @@
 #include <ATen/cuda/CUDAGeneratorImpl.h>
 #include <ATen/cuda/nvrtc_stub/ATenNVRTC.h>
 
-#include <c10/cuda/CUDACachingAllocator.h>
 #include <c10/util/irange.h>
 
 #include <torch/csrc/jit/codegen/cuda/contiguity.h>
@@ -23,6 +22,8 @@
 #include <nvfuser_resources/broadcast.h>
 #include <nvfuser_resources/fp16_support.h>
 #include <nvfuser_resources/fused_reduction.h>
+#include <nvfuser_resources/fused_welford_helper.h>
+#include <nvfuser_resources/fused_welford_impl.h>
 #include <nvfuser_resources/grid_broadcast.h>
 #include <nvfuser_resources/grid_reduction.h>
 #include <nvfuser_resources/grid_sync.h>
@@ -38,6 +39,13 @@
 #include <nvfuser_resources/warp.h>
 #include <nvfuser_resources/welford.h>
 
+#ifdef USE_ROCM
+#include <nvfuser_resources/array_rocm.h>
+#include <nvfuser_resources/bf16_support_rocm.h>
+#include <nvfuser_resources/block_sync_default_rocm.h>
+#include <nvfuser_resources/warp_rocm.h>
+#endif
+
 #ifndef USE_ROCM
 #include <cuda_occupancy.h>
 #endif
@@ -53,7 +61,7 @@ namespace executor_utils {
 std::string kernelPreamble() {
   std::stringstream ss;
 
-#ifndef __HIP_PLATFORM_HCC__
+#ifndef USE_ROCM
   ss << nvfuser_resources::fp16_support_cu;
 #if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
   ss << nvfuser_resources::bf16_support_cu;
@@ -73,12 +81,18 @@ std::string kernelPreamble() {
 #define __align__(x) __attribute__((aligned(x)))
 #endif
   )";
+  // fp16 support is automatic, bf16 is not
+  ss << nvfuser_resources::bf16_support_rocm_cu;
 #endif
 
   // Base classes and helpers
   ss << nvfuser_resources::tensor_cu;
   ss << nvfuser_resources::type_traits_cu;
+#ifndef USE_ROCM
   ss << nvfuser_resources::array_cu;
+#else
+  ss << nvfuser_resources::array_rocm_cu;
+#endif
   ss << nvfuser_resources::random_numbers_cu;
   ss << nvfuser_resources::helpers_cu;
   ss << nvfuser_resources::index_utils_cu;
@@ -88,7 +102,11 @@ std::string kernelPreamble() {
   if (std::getenv("PYTORCH_NVFUSER_USE_BLOCK_SYNC_ATOMIC")) {
     ss << nvfuser_resources::block_sync_atomic_cu;
   } else {
+#ifndef USE_ROCM
     ss << nvfuser_resources::block_sync_default_cu;
+#else
+    ss << nvfuser_resources::block_sync_default_rocm_cu;
+#endif
   }
   ss << nvfuser_resources::grid_sync_cu;
 
@@ -98,10 +116,16 @@ std::string kernelPreamble() {
   ss << nvfuser_resources::grid_broadcast_cu;
   ss << nvfuser_resources::broadcast_cu;
   ss << nvfuser_resources::welford_cu;
+#ifndef USE_ROCM
   ss << nvfuser_resources::warp_cu;
   ss << nvfuser_resources::tensorcore_cu;
   ss << nvfuser_resources::memory_cu;
+#else
+  ss << nvfuser_resources::warp_rocm_cu;
+#endif
+  ss << nvfuser_resources::fused_welford_helper_cu;
   ss << nvfuser_resources::fused_reduction_cu;
+  ss << nvfuser_resources::fused_welford_impl_cu;
   ss << nvfuser_resources::swizzle_cu;
 
   // Random utilities
@@ -131,7 +155,7 @@ bool validateKernelArgTensor(
   }
 
   if (!is_cpu_scalar(arg) && !arg.is_cuda()) {
-    msg << "Argumnet is a CPU tensor which is not supported in fusions.\n";
+    msg << "Argument is a CPU tensor which is not supported in fusions.\n";
     return false;
   }
 
@@ -160,6 +184,7 @@ bool validateKernelArgTensor(
   at::ScalarType arg_data_type = arg.scalar_type();
   DataType param_data_type = *param->getDataType();
   bool match = false;
+  // TODO: remove this switch with `aten_to_data_type`
   switch (arg_data_type) {
     case at::ScalarType::Double:
       match = param_data_type == DataType::Double;
@@ -201,36 +226,36 @@ bool validateKernelArgTensor(
 
 // Return false if  arg_type doesn't match the type in param
 bool validateKernelArgScalar(
-    const c10::IValue& arg,
+    const ArgAbstract* arg,
     const Val* param,
     std::stringstream& msg) {
-  if (!arg.isScalar()) {
-    msg << "Argument is a scalar, but the parameter is not."
-        << "\n";
-    return false;
-  }
+  TORCH_INTERNAL_ASSERT(
+      param->getDataType().has_value(), "kernel param should have data type");
   DataType param_type = *param->getDataType();
   bool match = false;
-  switch (arg.toScalar().type()) {
-    case c10::ScalarType::Long:
+  switch (arg->type()) {
+    case ArgType::Long:
       match = param_type == DataType::Int || param_type == DataType::Int32;
       break;
-    case c10::ScalarType::ComplexDouble:
-      match = param_type == DataType::ComplexDouble ||
-          param_type == DataType::ComplexFloat;
-      break;
-    case c10::ScalarType::Double:
+    case ArgType::Double:
       match = param_type == DataType::Double || param_type == DataType::Float ||
           param_type == DataType::Half || param_type == DataType::BFloat16;
       break;
-    case c10::ScalarType::Bool:
+    case ArgType::Bool:
       match = param_type == DataType::Bool;
       break;
+    case ArgType::ComplexDouble:
+      match = param_type == DataType::ComplexDouble ||
+          param_type == DataType::ComplexFloat;
+      break;
     default:
-      match = false;
+      // TODO: We need to verify that param is actually a scalar
+      msg << "Argument is not a scalar, but the parameter is."
+          << "\n";
+      return false;
   }
   if (!match) {
-    msg << "Argument type is " << arg.toScalar().type()
+    msg << "Argument type is " << argTypeToString(arg->type())
         << ", but the parameter is " << param_type << "\n";
   }
   return match;
@@ -239,12 +264,23 @@ bool validateKernelArgScalar(
 // Return false if arg and param don't match up and if arg's device (if a
 // tensor) doesn't match provided device
 bool validateKernelArg(
-    const c10::IValue& arg,
+    const ArgAbstract* arg,
     const Val* param,
     const c10::Device& device,
     std::stringstream& msg) {
-  if (arg.isTensor()) {
-    return validateKernelArgTensor(arg.toTensor(), param, device, msg);
+  if (auto tensor_arg_abstract = dynamic_cast<const TensorArgAbstract*>(arg)) {
+    // TODO: don't use get tensor here. We would want to remove tensor reference
+    // for async compilation
+    return validateKernelArgTensor(
+        tensor_arg_abstract->getTensor(), param, device, msg);
+  } else if (arg->isType(ArgType::CpuScalarTensor)) {
+    // TODO: merge this one with above
+    // TODO: we need to check cpu scalar dtyp matches param
+    bool match = param->as<TensorView>()->isCpuScalar();
+    if (!match) {
+      msg << "Argument is scalar type, but kernel parameter is not\n";
+    }
+    return match;
   } else {
     return validateKernelArgScalar(arg, param, msg);
   }
@@ -332,7 +368,7 @@ bool checkValidMisalignedTensors(
 
 void validateKernelInputs(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const c10::Device& device) {
   FUSER_PERF_SCOPE("executor_utils::ValidateKernelInputs");
 
@@ -340,13 +376,12 @@ void validateKernelInputs(
   FusionGuard fg(fusion);
   // Check inputs
   TORCH_INTERNAL_ASSERT(
-      inputs.size() == fusion->inputs().size(),
-      "Wrong number of kernel inputs.");
+      args.size() == fusion->inputs().size(), "Wrong number of kernel inputs.");
 
   std::stringstream msg;
   bool mismatch = false;
-  for (const auto i : c10::irange(inputs.size())) {
-    const IValue& arg = inputs[i];
+  for (const auto i : c10::irange(args.size())) {
+    const ArgAbstract* arg = args[i];
     const Val* param = fusion->inputs()[i];
     mismatch = !validateKernelArg(arg, param, device, msg) || mismatch;
   }
@@ -373,7 +408,7 @@ void validateKernelOutputs(
   for (const auto i : c10::irange(outputs.size())) {
     const at::Tensor& arg = outputs[i];
     const Val* param = fusion->outputs()[i];
-    mismatch = !validateKernelArg(arg, param, device, msg) || mismatch;
+    mismatch = !validateKernelArgTensor(arg, param, device, msg) || mismatch;
   }
   TORCH_INTERNAL_ASSERT(
       !mismatch, "Found one or more invalid arguments: ", msg.str());
@@ -543,7 +578,7 @@ void validateAlignedVectorizeExtents(
         " as the extent of a vectorized root domain, ",
         id->toString(),
         ", is unknown.");
-    vectorized_merged_domain_extent *= extent_val.value();
+    vectorized_merged_domain_extent *= extent_val->as<int64_t>();
   }
 
   TORCH_INTERNAL_ASSERT(
@@ -557,13 +592,9 @@ void validateAlignedVectorizeExtents(
 }
 
 void validateAlignedVectorizedFusionInputOutput(
-    const IValue& aten_val,
+    const at::Tensor& aten_tensor,
     int word_size,
     TensorView* tv) {
-  TORCH_INTERNAL_ASSERT(aten_val.isTensor());
-
-  const auto& aten_tensor = aten_val.toTensor();
-
   TORCH_INTERNAL_ASSERT(
       reinterpret_cast<size_t>(aten_tensor.data_ptr()) %
               (word_size * aten_tensor.dtype().itemsize()) ==
@@ -614,7 +645,7 @@ void validateAlignedVectorizedFusionInputOutput(
 
 void validateAlignedVectorizedTensors(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval) {
@@ -639,9 +670,12 @@ void validateAlignedVectorizedTensors(
                       .aligned_vectorized_inp_tensor_pos) {
     auto tv = kernel->inputs().at(pos)->as<TensorView>();
     auto word_size = kernel->summary().vectorized_accesses.at(tv);
-    validateAlignedVectorizedFusionInputOutput(inputs[pos], word_size, tv);
+    auto tensor_arg_abstract =
+        dynamic_cast<const TensorArgAbstract*>(args[pos]);
+    TORCH_INTERNAL_ASSERT(tensor_arg_abstract, "alias io only supports tensor");
+    validateAlignedVectorizedFusionInputOutput(
+        tensor_arg_abstract->getTensor(), word_size, tv);
   }
-
   if (!outputs.empty()) {
     for (auto pos : tensor_vectorization_validation_entry.get()
                         .aligned_vectorized_out_tensor_pos) {
@@ -657,7 +691,7 @@ void validateAlignedVectorizedTensors(
 // could be improved to include shared memory.
 void validateMisalignedVectorizedTensors(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval) {
@@ -678,7 +712,13 @@ void validateMisalignedVectorizedTensors(
       inp_misaligned_tensors_pos.begin(),
       inp_misaligned_tensors_pos.end(),
       std::back_inserter(inp_misaligned_tensors),
-      [&inputs](int idx) { return inputs[idx]; });
+      [&args](int idx) {
+        auto tensor_arg_abstract =
+            dynamic_cast<const TensorArgAbstract*>(args[idx]);
+        TORCH_INTERNAL_ASSERT(
+            tensor_arg_abstract, "alias io only supports tensor");
+        return tensor_arg_abstract->getTensor();
+      });
 
   const auto& out_misaligned_tensors_pos =
       tensor_vectorization_validation_entry.get().out_misaligned_tensors_pos;
@@ -732,63 +772,79 @@ void validateVectorizedSplits(
 
 void validateVectorizedTensors(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval) {
   FUSER_PERF_SCOPE("FusionExecutor::validateVectorizedTensors");
 
   validateAlignedVectorizedTensors(
-      kernel, inputs, outputs, data_cache, expr_eval);
+      kernel, args, outputs, data_cache, expr_eval);
 
   validateMisalignedVectorizedTensors(
-      kernel, inputs, outputs, data_cache, expr_eval);
+      kernel, args, outputs, data_cache, expr_eval);
 
   validateVectorizedSplits(kernel, expr_eval);
 }
 
-kir::ExpressionEvaluator bindKernelInputs(
-    const at::ArrayRef<IValue>& aten_inputs,
-    kir::Kernel* kernel,
-    bool check_consistency) {
-  FUSER_PERF_SCOPE("executor_utils::BindKernelInputs");
-
-  TORCH_INTERNAL_ASSERT(
-      kernel->inputs().size() == aten_inputs.size(),
-      "Something went wrong configuring launch. Inputs no longer match.");
-
-  kir::ExpressionEvaluator expr_eval;
-  const auto& inputs = kernel->inputs();
-
-  for (const auto i : c10::irange(inputs.size())) {
-    const auto input = inputs[i];
+namespace {
 
-    if (auto tensor_input = dynamic_cast<TensorView*>(input)) {
+template <typename EXPR_EVALUATOR>
+void bindInputForExprEvaluation(
+    Val* val,
+    const ArgAbstract* arg,
+    bool check_consistency,
+    EXPR_EVALUATOR& expr_eval) {
+  if (val->getValType() == ValType::TensorView) {
+    TensorView* cg_tensor = val->as<TensorView>();
+    auto root_domain =
+        TensorDomain::noReductions(cg_tensor->getMaybeRFactorDomain());
+
+    if (root_domain.size() == 0) {
       TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].isTensor(),
-          "Something went wrong configuring launch. Inputs no longer match at index:",
-          i);
+          arg->isType(ArgType::CpuScalarTensor) ||
+              (arg->isType(ArgType::Tensor) &&
+               dynamic_cast<const TensorArgAbstract*>(arg)->getRank() == 0),
+          "Something went wrong configuring launch. Inputs is not rank 0 tensor");
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          arg->isType(ArgType::Tensor),
+          "Something went wrong configuring launch. Inputs do not match.");
 
-      const auto aten_tensor = aten_inputs[i].toTensor();
-      const auto root_domain = TensorDomain::noReductions(
-          tensor_input->domain()->getMaybeRFactorDomain());
+      auto tensor_arg_abstract = dynamic_cast<const TensorArgAbstract*>(arg);
       TORCH_INTERNAL_ASSERT(
-          aten_tensor.ndimension() == static_cast<int>(root_domain.size()),
-          "Something went wrong configuring launch. Inputs no longer match.");
+          tensor_arg_abstract &&
+              tensor_arg_abstract->getRank() == (int64_t)root_domain.size(),
+          "Something went wrong configuring launch. Inputs rank does not match.");
 
       for (const auto dim : c10::irange(root_domain.size())) {
-        Val* extent = nullptr;
+        const auto tensor_arg_size = tensor_arg_abstract->getSize(dim);
+        const auto tensor_arg_stride = tensor_arg_abstract->getStride(dim);
+        const auto extent = root_domain[dim]->extent();
         if (root_domain[dim]->hasExpandedExtent()) {
-          extent = root_domain[dim]->expandedExtent();
-        } else {
-          extent = root_domain[dim]->extent();
-        }
-        const auto value = aten_tensor.sizes()[dim];
-        if (value == 0 && tensor_input->uses().empty()) {
-          // If there's no uses, ignore there's a size-0 dimension.
-          continue;
+          TORCH_INTERNAL_ASSERT(
+              tensor_arg_stride == 0,
+              "Expecting an expanded dimension on dimension ",
+              dim,
+              " but found stride ",
+              tensor_arg_stride);
+          // Could support dynamic size on expanded dimension, so may not have
+          // an inferable expanded extent here. This check might be better to do
+          // once all values are bound.
+          auto maybe_expanded_size =
+              expr_eval.evaluate(root_domain[dim]->expandedExtent());
+          if (maybe_expanded_size.has_value()) {
+            TORCH_CHECK(
+                *maybe_expanded_size == tensor_arg_size,
+                "Expecting expanded extent of ",
+                *maybe_expanded_size,
+                " but received value of ",
+                tensor_arg_size);
+          }
         }
-        TORCH_INTERNAL_ASSERT(value != 0, "Cannot handle size-0 dimensions");
+
+        const auto value =
+            root_domain[dim]->hasExpandedExtent() ? 1 : tensor_arg_size;
         bool should_bind = true;
         if (check_consistency) {
           const auto prev_value = expr_eval.evaluate(extent);
@@ -808,85 +864,68 @@ kir::ExpressionEvaluator bindKernelInputs(
           expr_eval.bind(extent, value);
         }
       }
-      // NOLINTNEXTLINE: https://bugs.llvm.org/show_bug.cgi?id=48525
-    } else if (input->isScalar() && input->dtype() == DataType::Int) {
+    }
+  } else if (val->getValType().value() == ValType::Scalar) {
+    if (val->getDataType().value() == DataType::Int) {
       TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].type()->kind() == c10::TypeKind::IntType,
-          "kernel expected Scalar Int inputs, but found",
-          aten_inputs[i].type()->str());
-      expr_eval.bind(input, aten_inputs[i].toInt());
+          arg->isType(ArgType::Long),
+          "fusion expected Scalar Int inputs, but found ",
+          argTypeToString(arg->type()));
+      expr_eval.bind(val, *static_cast<const int64_t*>(arg->arg()));
+    } else if (val->getDataType().value() == DataType::Double) {
+      TORCH_INTERNAL_ASSERT(
+          arg->isType(ArgType::Double),
+          "fusion expected Scalar Double inputs, but found ",
+          argTypeToString(arg->type()));
+      expr_eval.bind(val, *static_cast<const double*>(arg->arg()));
     }
   }
+}
+
+} // namespace
+
+kir::ExpressionEvaluator bindKernelInputs(
+    const KernelArgumentHolder& args,
+    kir::Kernel* kernel,
+    bool check_consistency) {
+  FUSER_PERF_SCOPE("executor_utils::BindKernelInputs");
 
+  TORCH_INTERNAL_ASSERT(
+      kernel->inputs().size() == args.size(),
+      "Something went wrong configuring launch. Inputs no longer match.");
+
+  kir::ExpressionEvaluator expr_eval;
+  const auto& inputs = kernel->inputs();
+
+  for (const auto i : c10::irange(inputs.size())) {
+    bindInputForExprEvaluation(
+        inputs[i], args[i], check_consistency, expr_eval);
+  }
   return expr_eval;
 }
 
 ExpressionEvaluator bindFusionInputs(
-    const at::ArrayRef<IValue>& aten_inputs,
+    const KernelArgumentHolder& args,
     Fusion* fusion) {
   FUSER_PERF_SCOPE("executor_utils::BindFusionInputs");
 
+  auto inputs = fusion->inputs();
   TORCH_INTERNAL_ASSERT(
-      fusion->inputs().size() == aten_inputs.size(),
-      "Something went wrong configuring launch. Inputs do not match.");
+      inputs.size() == args.size(),
+      "Something went wrong configuring launch. Inputs do not match.\n",
+      "inputs: ",
+      ir_utils::toString(inputs),
+      " args size: ",
+      args.size());
 
-  ExpressionEvaluator evaluator(fusion);
-  auto inputs = fusion->inputs();
+  ExpressionEvaluator expr_eval(fusion);
 
   // This should probably move to EvaluationContext as we may want to bind
   // input values frequently. Bind fusion input values to runtime values.
-  for (const auto i : c10::irange(fusion->inputs().size())) {
-    if (inputs[i]->getValType() == ValType::TensorView) {
-      TensorView* cg_tensor = inputs[i]->as<TensorView>();
-
-      TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].isTensor(),
-          "Something went wrong configuring launch. Inputs do not match.");
-
-      auto aten_tensor = aten_inputs[i].toTensor();
-      auto root_dom =
-          TensorDomain::noReductions(cg_tensor->getMaybeRFactorDomain());
-      TORCH_INTERNAL_ASSERT(
-          aten_tensor.ndimension() == (int64_t)root_dom.size(),
-          "Something went wrong configuring launch. Inputs do not match.");
-      for (const auto dim : c10::irange(root_dom.size())) {
-        Val* extent = nullptr;
-        if (root_dom[dim]->hasExpandedExtent()) {
-          extent = root_dom[dim]->expandedExtent();
-        } else {
-          extent = root_dom[dim]->extent();
-        }
-        const auto value = aten_tensor.sizes()[dim];
-        if (value == 0 && cg_tensor->uses().empty()) {
-          // If there's no uses, ignore there's a size-0 dimension.
-          continue;
-        }
-        TORCH_INTERNAL_ASSERT(value != 0, "Cannot handle size-0 dimensions");
-        const auto prev_value = evaluator.evaluate(extent);
-        if (prev_value.has_value()) {
-          TORCH_CHECK(
-              *prev_value == value,
-              "Attempting to bind ",
-              extent,
-              " to ",
-              value,
-              "but it's already set to ",
-              *prev_value);
-        } else {
-          evaluator.bind(extent, value);
-        }
-      }
-    } else if (
-        inputs[i]->getValType().value() == ValType::Scalar &&
-        inputs[i]->getDataType().value() == DataType::Int) {
-      TORCH_INTERNAL_ASSERT(
-          aten_inputs[i].type()->kind() == c10::TypeKind::IntType,
-          "fusion expected Scalar Int inputs, but found",
-          aten_inputs[i].type()->str());
-      evaluator.bind(inputs[i], aten_inputs[i].toInt());
-    }
+  for (const auto i : c10::irange(inputs.size())) {
+    bindInputForExprEvaluation(inputs[i], args[i], true, expr_eval);
   }
-  return evaluator;
+  return expr_eval;
 }
 
 void initializeCudaContext() {
@@ -896,18 +935,48 @@ void initializeCudaContext() {
   AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuCtxGetCurrent(&pctx));
   if (!pctx) {
     std::unique_lock<std::mutex> cudaFreeMutexLock(
-        *(c10::cuda::CUDACachingAllocator::getFreeMutex()));
-    cudaFree(nullptr);
+        *(c10::cuda::getFreeMutex()));
+    C10_CUDA_CHECK(cudaFree(nullptr));
   }
 }
 
+namespace {
+
+// Dump PTX or CUBIN to a file
+#if CUDA_VERSION >= 11010
+void dumpCompiledCodeToFile(
+    const nvrtcProgram& program,
+    int fusion_id,
+    bool dump_cubin) {
+  const auto getSize = dump_cubin
+      ? at::globalContext().getNVRTC().nvrtcGetCUBINSize
+      : at::globalContext().getNVRTC().nvrtcGetPTXSize;
+  const auto getCode = dump_cubin ? at::globalContext().getNVRTC().nvrtcGetCUBIN
+                                  : at::globalContext().getNVRTC().nvrtcGetPTX;
+  size_t size = 0;
+  AT_CUDA_NVRTC_CHECK(getSize(program, &size));
+  std::vector<char> code(size);
+  AT_CUDA_NVRTC_CHECK(getCode(program, code.data()));
+  std::stringstream file_name;
+  file_name << "__tmp_kernel" << fusion_id << "."
+            << (dump_cubin ? "cubin" : "ptx");
+  std::cout << "PRINTING: " << file_name.str() << std::endl;
+  std::ofstream out(file_name.str());
+  TORCH_INTERNAL_ASSERT(out.is_open());
+  out.write(code.data(), size);
+  out.close();
+}
+#endif
+
+} // namespace
+
 std::pair<NvrtcFunction, std::string> nvrtcCompile(
     const std::string& code,
     const std::string& func_name,
     int id,
     c10::optional<int> opt_block_size) {
   FUSER_PERF_SCOPE("executor_utils::NVRTC");
-  if (isDisabled(DisableOption::ArchCheck)) {
+  if (isOptionDisabled(DisableOption::ArchCheck)) {
     TORCH_WARN(
         "NVFuser Compile: arch check disabled, should not compile any kernel");
   }
@@ -939,7 +1008,7 @@ std::pair<NvrtcFunction, std::string> nvrtcCompile(
         at::globalContext().getNVRTC().nvrtcDestroyProgram(&program));
   });
 
-#ifdef __HIP_PLATFORM_HCC__
+#ifdef USE_ROCM
   std::vector<const char*> args = {"--std=c++14"};
 #if ROCM_VERSION >= 40200
   args.push_back("-hip-pch");
@@ -949,6 +1018,12 @@ std::pair<NvrtcFunction, std::string> nvrtcCompile(
   // compile to sass is not allowed prior to CUDA 11.1
   compile_to_sass = false;
 #endif
+
+  if (isOptionDisabled(DisableOption::CompileToSass)) {
+    // Allows manually disabling compilation to sass
+    //  so the intermediate ptx could be checked.
+    compile_to_sass = false;
+  }
   // CUDA 11.1 allows going directly to SASS (sm_) instead of PTX (compute_)
   // which gives better backwards compatibility to work on older driver,
   // (since older driver doesn't necessrily recognize PTX emitted by new
@@ -964,8 +1039,8 @@ std::pair<NvrtcFunction, std::string> nvrtcCompile(
       "--std=c++14", compute.c_str(), "-default-device"};
 #endif
 
-  const bool disable_fma = isDisabled(DisableOption::Fma);
-#ifdef __HIP_PLATFORM_HCC__
+  const bool disable_fma = isOptionDisabled(DisableOption::Fma);
+#ifdef USE_ROCM
   if (disable_fma) {
     TORCH_WARN_ONCE(
         "PYTORCH_CUDA_FUSER_DISABLE_FMA is not supported on ROCm, ignoring");
@@ -980,15 +1055,13 @@ std::pair<NvrtcFunction, std::string> nvrtcCompile(
   // Add line info to generated kernels
   if (isDebugDumpEnabled(DebugDumpOption::DebugInfo)) {
     args.push_back("-lineinfo");
-    args.push_back("-G");
-    args.push_back("--dopt=on");
   }
 #ifdef NDEBUG
   // Avoid excessive register usage from assertion
   args.push_back("-DNDEBUG");
 #endif
 
-  if (isEnabled(EnableOption::KernelProfile)) {
+  if (isOptionEnabled(EnableOption::KernelProfile)) {
     args.push_back("-DPYTORCH_NVFUSER_PROFILE_KERNEL");
   }
 
@@ -1145,90 +1218,25 @@ std::pair<NvrtcFunction, std::string> nvrtcCompile(
 
   NvrtcFunction compiled_kernel_;
 
-  // TODO: We do go through different code path, should investigate whether this
-  // has an impact on generated binary.
-#ifndef __HIP_PLATFORM_HCC__
-  const char* prefix_env = getenv("PYTORCH_NVFUSER_CUBIN");
-  if (prefix_env) {
-    FUSER_PERF_SCOPE("executor_utils::Nvrtc::LoadCUBIN");
-
-    // Output ptx file
-    std::stringstream output_file_name;
-    output_file_name << prefix_env << "_" << id
-                     << (compile_to_sass ? ".cubin" : ".ptx");
-    std::ofstream outputFile(output_file_name.str().c_str(), std::ios::out);
-    if (outputFile.is_open()) {
-      outputFile.write(ptx.data(), ptx.size());
-      outputFile.close();
-    }
-
-    if (compile_to_sass) {
-      FUSER_PERF_SCOPE("executor_utils::Nvrtc::LoadPTX");
-
-      // load sass directly
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuModuleLoadDataEx(
-          &(compiled_kernel_.module),
-          ptx.data(),
-          options.size(),
-          options.data(),
-          option_vals.data()));
-    } else {
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      CUlinkState linkState;
-
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuLinkCreate(
-          // 0, nullptr, nullptr, &linkState));
-          options.size(),
-          options.data(),
-          option_vals.data(),
-          &linkState));
-
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuLinkAddData(
-          linkState,
-          CU_JIT_INPUT_PTX,
-          ptx.data(),
-          ptx_size,
-          "compiling PTX",
-          0,
-          nullptr,
-          nullptr));
-
-      if (isDebugDumpEnabled(DebugDumpOption::PrintPtxasLog)) {
-        std::cout << info_log.data() << std::endl;
-      }
-
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      size_t cubinSize;
-      // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-      void* cubin;
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuLinkComplete(
-          linkState, &cubin, &cubinSize));
+#ifndef USE_ROCM
 
-      // Output binary file
-      std::stringstream cubin_file_name;
-      cubin_file_name << prefix_env << "_" << id << ".cubin";
+#if CUDA_VERSION >= 11010
+  if (isDebugDumpEnabled(DebugDumpOption::Ptx)) {
+    dumpCompiledCodeToFile(program, id, false);
+  }
 
-      std::ofstream myCubinFile(
-          cubin_file_name.str().c_str(), std::ios::out | std::ios::binary);
+  if (isDebugDumpEnabled(DebugDumpOption::Cubin)) {
+    TORCH_INTERNAL_ASSERT(
+        compile_to_sass,
+        "CUBIN not available as the kernel was compiled only to PTX");
+    dumpCompiledCodeToFile(program, id, true);
+  }
+#endif
 
-      if (myCubinFile.is_open()) {
-        myCubinFile.write(static_cast<const char*>(cubin), cubinSize);
-        myCubinFile.close();
-      }
-      // load compiled cubin
-      // AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuModuleLoadData(
-      //     &(compiled_kernel_.module), cubin));
-      AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuModuleLoadDataEx(
-          &(compiled_kernel_.module),
-          cubin,
-          options.size(),
-          options.data(),
-          option_vals.data()));
-    }
-  } else {
+  {
     FUSER_PERF_SCOPE("executor_utils::Nvrtc::LoadPTX");
 
-    // load ptx directly
+    // load ptx or cubin directly
     AT_CUDA_DRIVER_CHECK(at::globalContext().getNVRTC().cuModuleLoadDataEx(
         &(compiled_kernel_.module),
         ptx.data(),
@@ -1253,7 +1261,7 @@ std::pair<NvrtcFunction, std::string> nvrtcCompile(
       lowered_kernel_name));
 
   TORCH_CHECK(
-      !isDisabled(DisableOption::ArchCheck),
+      !isOptionDisabled(DisableOption::ArchCheck),
       "NVFuser Compile: arch check disabled, should not return any compiled kernel");
 
   return {compiled_kernel_, ptxas_log.str()};
diff --git a/torch/csrc/jit/codegen/cuda/executor_utils.h b/torch/csrc/jit/codegen/cuda/executor_utils.h
index 37817838f386..af3b4d9372d4 100644
--- a/torch/csrc/jit/codegen/cuda/executor_utils.h
+++ b/torch/csrc/jit/codegen/cuda/executor_utils.h
@@ -9,6 +9,7 @@
 
 #include <torch/csrc/jit/ir/ir.h>
 
+#include <torch/csrc/jit/codegen/cuda/executor_kernel_arg.h>
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
@@ -30,7 +31,7 @@ std::string kernelPreamble();
 
 void validateKernelInputs(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const c10::Device& device);
 
 void validateKernelOutputs(
@@ -40,13 +41,13 @@ void validateKernelOutputs(
 
 //! Bind kernel input values to runtime values
 kir::ExpressionEvaluator bindKernelInputs(
-    const at::ArrayRef<IValue>& aten_inputs,
+    const KernelArgumentHolder& args,
     kir::Kernel* kernel,
     bool check_consistency = true);
 
 //! Bind fusion input values to runtime values
 TORCH_CUDA_CU_API ExpressionEvaluator
-bindFusionInputs(const at::ArrayRef<IValue>& aten_inputs, Fusion* fusion);
+bindFusionInputs(const KernelArgumentHolder& args, Fusion* fusion);
 
 struct NvrtcFunction {
   CUmodule module = CUmodule();
@@ -303,7 +304,7 @@ std::unique_ptr<caching::WarpPaddedExtentsInfo> getWarpPaddedExtentsInfo(
 
 void validateVectorizedTensors(
     kir::Kernel* kernel,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& args,
     const std::vector<at::Tensor>& outputs,
     caching::ExecutorCompileTimeInfoCache* data_cache,
     kir::ExpressionEvaluator& expr_eval);
diff --git a/torch/csrc/jit/codegen/cuda/expr_evaluator.cpp b/torch/csrc/jit/codegen/cuda/expr_evaluator.cpp
index 0f4b523b6ba0..6e1c62811111 100644
--- a/torch/csrc/jit/codegen/cuda/expr_evaluator.cpp
+++ b/torch/csrc/jit/codegen/cuda/expr_evaluator.cpp
@@ -13,22 +13,64 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-void ExpressionEvaluator::bind(Val* value, Int::ScalarType concrete_value) {
-  TORCH_CHECK(value->isAnInt());
-  auto val = value->getInt();
-  if (val.has_value() && val.value() == concrete_value) {
+namespace {
+
+bool equals(Val* value, const IntOrDouble& concrete_value) {
+  switch (value->getDataType().value()) {
+    case DataType::Int: {
+      if (!concrete_value.is_int()) {
+        return false;
+      }
+      auto val = value->getInt();
+      return val.has_value() && val.value() == concrete_value.as<int64_t>();
+    }
+    case DataType::Double: {
+      if (concrete_value.is_int()) {
+        return false;
+      }
+      auto val = value->getDouble();
+      return val.has_value() && val.value() == concrete_value.as<double>();
+    }
+    default:
+      TORCH_INTERNAL_ASSERT(false);
+  }
+}
+
+template <typename T>
+c10::optional<IntOrDouble> toOptionalIntOrDouble(c10::optional<T> i) {
+  if (!i) {
+    return c10::nullopt;
+  }
+  return IntOrDouble(i.value());
+}
+
+} // namespace
+
+void ExpressionEvaluator::bind(Val* value, const IntOrDouble& concrete_value) {
+  if (equals(value, concrete_value)) {
     return;
   }
   TORCH_CHECK(!value->isConstScalar(), "Tried to bind to a constant value");
   TORCH_CHECK(
       value->definition() == nullptr,
       "Tried to bind to a value that is computed in the fusion IR");
-  known_values_[value] = concrete_value;
+  if (value->isA<NamedScalar>()) {
+    known_named_scalars_[value->as<NamedScalar>()->name()] = concrete_value;
+  } else {
+    known_values_[value] = concrete_value;
+  }
+}
+
+void ExpressionEvaluator::bind(
+    const std::string& name,
+    const IntOrDouble& concrete_value) {
+  known_named_scalars_[name] = concrete_value;
 }
 
-c10::optional<Int::ScalarType> ExpressionEvaluator::evaluate(Val* value) {
-  if (evaluator_precomputed_integers_ != nullptr) {
-    return evaluator_precomputed_integers_->getMaybeValueFor(value);
+c10::optional<IntOrDouble> ExpressionEvaluator::evaluate(Val* value) {
+  if (evaluator_precomputed_values_ != nullptr) {
+    return toOptionalIntOrDouble(
+        evaluator_precomputed_values_->getMaybeValueFor(value));
   } else {
     auto maybe_concrete_value = getValue(value);
     if (!maybe_concrete_value.has_value()) {
@@ -53,39 +95,67 @@ void ExpressionEvaluator::print() const {
   std::cout << "--------------------\n\n";
 }
 
-c10::optional<Int::ScalarType> ExpressionEvaluator::getValue(Val* value) {
+c10::optional<IntOrDouble> ExpressionEvaluator::getValue(Val* value) {
   TORCH_INTERNAL_ASSERT(
-      value->isAnInt(),
-      "Expression Evaluation does not support values other than integers at this time.");
+      value->isAnInt() || value->isADouble(),
+      "Expression Evaluation does not support values other than integers/doubles at this time.");
 
   if (value->getValType().value() == ValType::Scalar) {
-    if (value->as<Int>()->value().has_value()) {
-      return value->as<Int>()->value();
+    if (value->isAnInt() && value->as<Int>()->value().has_value()) {
+      return toOptionalIntOrDouble(value->as<Int>()->value());
+    }
+    if (value->isADouble() && value->as<Double>()->value().has_value()) {
+      return toOptionalIntOrDouble(value->as<Double>()->value());
     }
   }
 
-  const auto it = known_values_.find(value);
-  return it != known_values_.end() ? c10::optional<Int::ScalarType>(it->second)
-                                   : c10::nullopt;
+  if (value->isA<NamedScalar>()) {
+    const auto it = known_named_scalars_.find(value->as<NamedScalar>()->name());
+    return it != known_named_scalars_.end()
+        ? c10::optional<IntOrDouble>(it->second)
+        : c10::nullopt;
+  } else {
+    const auto it = known_values_.find(value);
+    return it != known_values_.end() ? c10::optional<IntOrDouble>(it->second)
+                                     : c10::nullopt;
+  }
 }
 
 void ExpressionEvaluator::handle(UnaryOp* uop) {
+  using namespace IntOrDouble_functions;
   const auto in = evaluate(uop->in());
   if (in.has_value()) {
     switch (uop->getUnaryOpType()) {
       case UnaryOpType::Neg:
         known_values_[uop->out()] = -*in;
         break;
-      case UnaryOpType::Cast:
+      case UnaryOpType::Set:
         known_values_[uop->out()] = *in;
         break;
+      case UnaryOpType::Cast:
+        if (uop->out()->getDataType() == DataType::Int) {
+          known_values_[uop->out()] = in->cast<int64_t>();
+        } else if (uop->out()->getDataType() == DataType::Double) {
+          known_values_[uop->out()] = in->cast<double>();
+        } else {
+          TORCH_INTERNAL_ASSERT(false, "dtype not supported in evaluator");
+        }
+        break;
+      case UnaryOpType::Abs:
+        known_values_[uop->out()] = abs(*in);
+        break;
       default:
-        TORCH_CHECK(!"Unexpected operator type");
+        TORCH_CHECK(
+            !"Unexpected operator type ",
+            uop->getUnaryOpType(),
+            " in ",
+            uop->toString());
     }
   }
 }
 
 void ExpressionEvaluator::handle(BinaryOp* bop) {
+  using namespace IntOrDouble_functions;
   const auto lhs = evaluate(bop->lhs());
   const auto rhs = evaluate(bop->rhs());
   if (lhs.has_value() && rhs.has_value()) {
@@ -109,16 +179,16 @@ void ExpressionEvaluator::handle(BinaryOp* bop) {
         break;
       case BinaryOpType::CeilDiv:
         TORCH_CHECK(*rhs != 0);
-        known_values_[bop->out()] = (*lhs + *rhs - 1) / *rhs;
+        known_values_[bop->out()] = ceildiv(*lhs, *rhs);
         break;
       case BinaryOpType::And:
-        known_values_[bop->out()] = Int::ScalarType(*lhs && *rhs);
+        known_values_[bop->out()] = *lhs && *rhs;
         break;
       case BinaryOpType::Max:
-        known_values_[bop->out()] = std::max(*lhs, *rhs);
+        known_values_[bop->out()] = max(*lhs, *rhs);
         break;
       case BinaryOpType::Min:
-        known_values_[bop->out()] = std::min(*lhs, *rhs);
+        known_values_[bop->out()] = min(*lhs, *rhs);
         break;
       default:
         TORCH_CHECK(!"Unexpected operator type");
diff --git a/torch/csrc/jit/codegen/cuda/expr_evaluator.h b/torch/csrc/jit/codegen/cuda/expr_evaluator.h
index 5630743b6f69..4329f9604304 100644
--- a/torch/csrc/jit/codegen/cuda/expr_evaluator.h
+++ b/torch/csrc/jit/codegen/cuda/expr_evaluator.h
@@ -1,11 +1,13 @@
 #pragma once
 
 #include <c10/macros/Export.h>
+#include <torch/csrc/jit/codegen/cuda/dynamic_type.h>
 #include <torch/csrc/jit/codegen/cuda/ir_interface_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
 
 #include <c10/util/Optional.h>
 
+#include <string>
 #include <unordered_map>
 
 namespace torch {
@@ -13,7 +15,7 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-class FusionPrecomputedIntegers;
+class FusionPrecomputedValues;
 
 //! Calculate Fusion IR expressions
 class TORCH_CUDA_CU_API ExpressionEvaluator : private OptOutDispatch {
@@ -27,33 +29,37 @@ class TORCH_CUDA_CU_API ExpressionEvaluator : private OptOutDispatch {
   }
 
   //! Bind a concrete value to an IR variable
-  void bind(Val* value, Int::ScalarType concrete_value);
+  void bind(Val* value, const IntOrDouble& concrete_value);
+
+  //! Bind a concrete value to a named scalar
+  void bind(const std::string& name, const IntOrDouble& concrete_value);
 
   //! Try to evaluate a Fusion IR value
-  c10::optional<Int::ScalarType> evaluate(Val* value);
+  c10::optional<IntOrDouble> evaluate(Val* value);
 
   //! Debugging helper, prints all the currently known values
   void print() const;
 
-  void bindPrecomputedIntegers(
-      FusionPrecomputedIntegers* precomputed_integers) {
-    evaluator_precomputed_integers_ = precomputed_integers;
+  void bindPrecomputedValues(FusionPrecomputedValues* precomputed_values) {
+    evaluator_precomputed_values_ = precomputed_values;
   }
 
-  auto precomputedIntegers() {
-    return evaluator_precomputed_integers_;
+  auto precomputedValues() {
+    return evaluator_precomputed_values_;
   }
 
  private:
-  c10::optional<Int::ScalarType> getValue(Val* value);
+  c10::optional<IntOrDouble> getValue(Val* value);
 
   void handle(UnaryOp*) final;
   void handle(BinaryOp*) final;
+  // TODO: handle swizzle
 
  private:
-  std::unordered_map<const Val*, Int::ScalarType> known_values_;
+  std::unordered_map<const Val*, IntOrDouble> known_values_;
+  std::unordered_map<std::string, IntOrDouble> known_named_scalars_;
   Fusion* fusion_ = nullptr;
-  FusionPrecomputedIntegers* evaluator_precomputed_integers_ = nullptr;
+  FusionPrecomputedValues* evaluator_precomputed_values_ = nullptr;
 };
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/fusion.cpp b/torch/csrc/jit/codegen/cuda/fusion.cpp
index d3f4ae6715b8..e4f24f0473a1 100644
--- a/torch/csrc/jit/codegen/cuda/fusion.cpp
+++ b/torch/csrc/jit/codegen/cuda/fusion.cpp
@@ -11,6 +11,7 @@
 #include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
 #include <torch/csrc/jit/codegen/cuda/kernel.h>
 #include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/lower_bank_conflict.h>
 
 namespace torch {
 namespace jit {
@@ -51,9 +52,9 @@ void swap(Fusion& a, Fusion& b) noexcept {
 }
 
 std::unique_ptr<SegmentedFusion> Fusion::segment(
-    const at::ArrayRef<IValue>& inputs) {
+    const KernelArgumentHolder& args) {
   FUSER_PERF_SCOPE("Segment Fusion");
-  return SegmentCandidateFinder::segment(this, inputs);
+  return SegmentCandidateFinder::segment(this, args);
 }
 
 IrCloner Fusion::copy(const Fusion* from, Fusion* to) {
@@ -339,6 +340,20 @@ void Fusion::printKernel(DataType index_type) {
   std::cout << codegen::generateCudaKernel(GpuLower(this, index_type).kernel());
 }
 
+std::unordered_map<std::string, std::pair<int, int>> Fusion::bankConflictInfo(
+    DataType index_type) {
+  GpuLower lower(this, index_type);
+  auto kernel = lower.kernel();
+  auto info = getBankConflictInfo(kernel);
+  // The container of exprs goes out of scope, so we return a map of string here
+  std::unordered_map<std::string, std::pair<int, int>> result;
+  result.reserve(info.size());
+  for (auto i : info) {
+    result[i.first->toString()] = i.second;
+  }
+  return result;
+}
+
 void Fusion::printMath(bool from_outputs_only) {
   FUSER_PERF_SCOPE("Fusion::printMath");
 
@@ -373,6 +388,19 @@ void Fusion::printMath(bool from_outputs_only) {
   std::cout << "}\n\n";
 }
 
+std::vector<Val*> Fusion::inputsAndCreated() {
+  auto result = inputs_;
+  for (auto expr : exprs()) {
+    auto tv_inputs = ir_utils::filterByType<TensorView>(expr->inputs());
+    if (tv_inputs.empty()) {
+      for (auto v : expr->outputs()) {
+        result.emplace_back(v);
+      }
+    }
+  }
+  return result;
+}
+
 void Fusion::printTransforms() {
   FUSER_PERF_SCOPE("Fusion::printTransforms");
 
@@ -531,14 +559,15 @@ Expr* Fusion::definition(const Val* val) const {
 
 // Indicate to kernel to set itself up to generate random numbers
 bool Fusion::isStochastic() {
-  for (auto expr : exprs())
-    if (expr->getExprType() == ExprType::UnaryOp)
-      if (expr->as<UnaryOp>()->getUnaryOpType() == UnaryOpType::RandLike)
-        return true;
+  for (auto expr : exprs()) {
+    if (expr->getExprType() == ExprType::RNGOp) {
+      return true;
+    }
+  }
   return false;
 }
 
-std::vector<Val*> Fusion::getTerminatingOutputs() {
+std::vector<Val*> Fusion::getTerminatingOutputs() const {
   FUSER_PERF_SCOPE("getTerminatingOutputs");
 
   auto is_reachable_to_output = [](Val* val) {
diff --git a/torch/csrc/jit/codegen/cuda/fusion.h b/torch/csrc/jit/codegen/cuda/fusion.h
index 1f25a9661bf8..2c0c59fae2b9 100644
--- a/torch/csrc/jit/codegen/cuda/fusion.h
+++ b/torch/csrc/jit/codegen/cuda/fusion.h
@@ -53,6 +53,7 @@ class WelfordResult;
 
 class SegmentCandidateFinder;
 class SegmentedFusion;
+class KernelArgumentHolder;
 
 //! Fusion Guard is our "context manager". It holds the actrive fusion and
 //! allows it to be accessed anywhere through FusionGuard::getCurFusion()
@@ -135,6 +136,10 @@ class TORCH_CUDA_CU_API Fusion : public IrContainer {
   //! Lower the fusion and print a kernel
   void printKernel(DataType index_type = DataType::Int);
 
+  //! Lower the fusion and evaluate bank conflict info
+  std::unordered_map<std::string, std::pair<int, int>> bankConflictInfo(
+      DataType index_type = DataType::Int);
+
   //! Return a list of topologically sorted expressions. This only includes
   //! exprs required to genereate registered outputs.
   std::vector<Expr*> exprs();
@@ -168,18 +173,19 @@ class TORCH_CUDA_CU_API Fusion : public IrContainer {
   bool isStochastic();
 
   //! Run fusion segmentation algorithm to create a segmented fusion
-  std::unique_ptr<SegmentedFusion> segment(
-      const at::ArrayRef<at::IValue>& inputs);
+  std::unique_ptr<SegmentedFusion> segment(const KernelArgumentHolder& args);
 
   const auto& inputs() const {
     return inputs_;
   }
 
+  std::vector<Val*> inputsAndCreated();
+
   const auto& outputs() const {
     return outputs_;
   }
 
-  std::vector<Val*> getTerminatingOutputs();
+  std::vector<Val*> getTerminatingOutputs() const;
 
   // Aliasing output to input value, this is a WAR to allow inplace update on
   // input tensor.
diff --git a/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp b/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
index 4e76bffe665b..c0bf81dc688b 100644
--- a/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
+++ b/torch/csrc/jit/codegen/cuda/fusion_segmenter.cpp
@@ -1722,7 +1722,7 @@ class TranslateApplicableWelford {
   //!  returns true if any welford has been translated
   static bool run(
       SegmentedFusion* segmented_fusion,
-      const at::ArrayRef<IValue>& runtime_inputs) {
+      const KernelArgumentHolder& runtime_inputs) {
     TranslateApplicableWelford translate_welford(
         segmented_fusion, runtime_inputs);
     return translate_welford.translated_any_welford_;
@@ -1730,7 +1730,7 @@ class TranslateApplicableWelford {
 
   //! Try translation on complete fusion,
   //!  returns true if any welford has been translated
-  static bool run(Fusion* fusion, const at::ArrayRef<IValue>& runtime_inputs) {
+  static bool run(Fusion* fusion, const KernelArgumentHolder& runtime_inputs) {
     TranslateApplicableWelford translate_welford(fusion, runtime_inputs);
     return translate_welford.translated_any_welford_;
   }
@@ -1738,11 +1738,11 @@ class TranslateApplicableWelford {
  private:
   explicit TranslateApplicableWelford(
       SegmentedFusion* segmented_fusion,
-      const at::ArrayRef<IValue>& runtime_inputs);
+      const KernelArgumentHolder& runtime_inputs);
 
   explicit TranslateApplicableWelford(
       Fusion* fusion,
-      const at::ArrayRef<IValue>& runtime_inputs);
+      const KernelArgumentHolder& runtime_inputs);
 
   //! Given vector of welford ops from the same fusion,
   //!  checks if translating all of them result in a
@@ -1774,7 +1774,7 @@ class TranslateApplicableWelford {
   bool translated_any_welford_ = false;
 
   //! a reference to global fusion runtime inputs
-  const at::ArrayRef<IValue>& runtime_inputs_;
+  const KernelArgumentHolder& runtime_inputs_;
 
   //! For translation within group only,
   //!  group boundary at test copy
@@ -1785,7 +1785,7 @@ class TranslateApplicableWelford {
 
 TranslateApplicableWelford::TranslateApplicableWelford(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& runtime_inputs)
+    const KernelArgumentHolder& runtime_inputs)
     : runtime_inputs_(runtime_inputs) {
   auto exprs = fusion->exprs();
   std::vector<WelfordOp*> orignal_welfords(
@@ -1802,7 +1802,7 @@ TranslateApplicableWelford::TranslateApplicableWelford(
 
 TranslateApplicableWelford::TranslateApplicableWelford(
     SegmentedFusion* segmented_fusion,
-    const at::ArrayRef<IValue>& runtime_inputs)
+    const KernelArgumentHolder& runtime_inputs)
     : runtime_inputs_(runtime_inputs) {
   std::vector<SegmentedGroup*> translated_groups;
   std::vector<WelfordOp*> welford_to_translate;
@@ -2046,7 +2046,7 @@ void TranslateApplicableWelford::translateSingleWelford(WelfordOp* welford) {
 
 bool SegmentCandidateFinder::TranslateWelfordInFusion(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& runtime_inputs) {
+    const KernelArgumentHolder& runtime_inputs) {
   return TranslateApplicableWelford::run(fusion, runtime_inputs);
 }
 
@@ -2598,9 +2598,39 @@ bool CombineReductions::shouldRun(
   return false;
 }
 
-bool SegmentCandidateFinder::codeGenSupportedMerge(SegmentedEdge* edge) {
+namespace {
+
+//! Returns true if group1 and group2 are an immediate producer-consumer pair.
+bool areDirectlyConnected(SegmentedGroup* group1, SegmentedGroup* group2) {
+  // Check if group1 is a immediate consumer of group2
+  if (std::any_of(
+          group1->producer_edges.begin(),
+          group1->producer_edges.end(),
+          [group2](SegmentedEdge* edge) { return edge->from == group2; })) {
+    return true;
+  }
+
+  // Check if group1 is a immediate producer of group2
+  if (std::any_of(
+          group1->consumer_edges.begin(),
+          group1->consumer_edges.end(),
+          [group2](SegmentedEdge* edge) { return edge->to == group2; })) {
+    return true;
+  }
+
+  return false;
+}
+
+} // namespace
+
+bool SegmentCandidateFinder::codeGenSupportedMerge(
+    SegmentedGroup* group1,
+    SegmentedGroup* group2) {
+  TORCH_INTERNAL_ASSERT(
+      areDirectlyConnected(group1, group2),
+      "only support testing immediate producer-consumer groups");
   Fusion* fusion = segmented_fusion_->completeFusion();
-  auto h = tryMerge(fusion, runtime_info_, edge->from, edge->to);
+  auto h = tryMerge(fusion, runtime_info_, group1, group2);
   return h.has_value();
 }
 
@@ -2616,7 +2646,7 @@ ScheduleHeuristic SegmentCandidateFinder::deriveHeuristic(
 
 SegmentCandidateFinder::SegmentCandidateFinder(
     std::unique_ptr<Fusion> fusion,
-    const at::ArrayRef<IValue>& inputs,
+    const KernelArgumentHolder& inputs,
     SegmentCandidateFinderOptions options)
     : options_(options),
       runtime_info_(fusion.get(), inputs, true),
@@ -2827,7 +2857,7 @@ void SegmentCandidateFinder::findSegments() {
 
         auto candidate_it = candidates.begin();
         while (candidate_it != candidates.end() &&
-               !codeGenSupportedMerge(candidate_it->edge)) {
+               !codeGenSupportedMerge(group, candidate_it->group)) {
           candidate_it++;
         }
         if (candidate_it == candidates.end()) {
@@ -2896,7 +2926,7 @@ void SegmentCandidateFinder::finalMerge() {
       for (auto consumer : all_consumers_of_producer_group) {
         if (!producer_check->isConsumerOfAny(
                 consumer, all_consumers_of_producer_group) &&
-            codeGenSupportedMerge(consumer_edge_map.at(consumer))) {
+            codeGenSupportedMerge(producer_group, consumer)) {
           to_merge_.emplace_back(producer_group);
           to_merge_.emplace_back(consumer);
           producer_group->merged_ = true;
@@ -3100,7 +3130,7 @@ FusionKernelRuntime::SchedulerEntryPtr SegmentedFusion::
 }
 
 std::unique_ptr<FusionHeuristics> SegmentedFusion::makeInitialHeuristics(
-    const at::ArrayRef<IValue>& inputs) {
+    const KernelArgumentHolder& inputs) {
   auto ret = std::make_unique<FusionHeuristics>();
   SchedulerRuntimeInfo runtime_info(completeFusion(), inputs, true);
   for (auto g : groups()) {
@@ -3160,7 +3190,7 @@ class ForceHalfAnnotation : public IterVisitor {
                val->getDataType().value() == DataType::BFloat16);
         });
 
-    annotation.traverseFrom(fusion, fp16_outputs);
+    annotation.traverseTo(fusion, fp16_outputs);
     return annotation.force_fp16_tv_set_;
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/fusion_segmenter.h b/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
index 1af3374a5874..5014e708cb95 100644
--- a/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
+++ b/torch/csrc/jit/codegen/cuda/fusion_segmenter.h
@@ -307,7 +307,7 @@ class TORCH_CUDA_CU_API SegmentedFusion {
 
   //! Make heuristics for all groups in this segmented fusion
   std::unique_ptr<FusionHeuristics> makeInitialHeuristics(
-      const at::ArrayRef<IValue>& inputs);
+      const KernelArgumentHolder& inputs);
 
   //! Inline Debug print for segmented fusion
   std::string toString(int verbosity) const;
@@ -445,7 +445,7 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
   // Perform segmentation on a copy of the given fusion
   static std::unique_ptr<SegmentedFusion> segment(
       const Fusion* fusion,
-      const at::ArrayRef<IValue>& inputs,
+      const KernelArgumentHolder& inputs,
       SegmentCandidateFinderOptions options = SegmentCandidateFinderOptions()) {
     auto fusion_copy = std::make_unique<Fusion>(*fusion);
     if (isDebugDumpEnabled(DebugDumpOption::FusionSegments)) {
@@ -460,7 +460,7 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
   // Perform segmentation on and take ownership of the given fusion
   static std::unique_ptr<SegmentedFusion> segment(
       std::unique_ptr<Fusion> fusion,
-      const at::ArrayRef<IValue>& inputs,
+      const KernelArgumentHolder& inputs,
       SegmentCandidateFinderOptions options = SegmentCandidateFinderOptions()) {
     SegmentCandidateFinder scf(std::move(fusion), inputs, options);
     if (isDebugDumpEnabled(DebugDumpOption::FusionSegments)) {
@@ -473,13 +473,13 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
 
   static bool TranslateWelfordInFusion(
       Fusion* fusion,
-      const at::ArrayRef<IValue>& runtime_inputs);
+      const KernelArgumentHolder& runtime_inputs);
 
  private:
   // Perform segmentation on and take ownership of the given fusion
   SegmentCandidateFinder(
       std::unique_ptr<Fusion> fusion,
-      const at::ArrayRef<IValue>& inputs,
+      const KernelArgumentHolder& inputs,
       SegmentCandidateFinderOptions options);
 
   void resetTraversal();
@@ -488,7 +488,7 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
 
   SegmentedGroup* mergeNodes();
 
-  bool codeGenSupportedMerge(SegmentedEdge* edge);
+  bool codeGenSupportedMerge(SegmentedGroup* group1, SegmentedGroup* group2);
 
   void findSegments();
 
@@ -612,7 +612,7 @@ class TORCH_CUDA_CU_API SegmentCandidateFinder {
   //! TODO:
   //!  implement the expression evaluator transfer and
   //!  remove runtime_inputs_ in a follow up.
-  const at::ArrayRef<IValue>& runtime_inputs_;
+  const KernelArgumentHolder& runtime_inputs_;
 };
 
 // TODO: Make as member functions on classes instead of global scope
diff --git a/torch/csrc/jit/codegen/cuda/graph_fuser.cpp b/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
index 22ac7c17e0a5..c2427f938627 100644
--- a/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
+++ b/torch/csrc/jit/codegen/cuda/graph_fuser.cpp
@@ -1461,18 +1461,15 @@ Value* guardView(
   auto view_sizes_constant_list =
       constant_as<c10::List<int64_t>>(view->inputs().back());
   TORCH_INTERNAL_ASSERT(view_sizes_constant_list.has_value());
-
+  std::vector<int64_t> view_sizes = view_sizes_constant_list->vec();
   // 2. Get constraints for self tensor and view_sizes
-  auto constraints = analyzeViewConstraint(
-      self_sizes_constant_list, view_sizes_constant_list->vec());
+  auto constraints =
+      analyzeViewConstraint(self_sizes_constant_list, view_sizes);
 
   // 3. Add constraints as constant to graph
-  auto self_tensor_constraint = fusion->owningGraph()->insertConstant(
-      IValue(constraints.original_constraint));
-  self_tensor_constraint->node()->moveBefore(versioning_if);
-  auto view_sizes_constraint =
-      fusion->owningGraph()->insertConstant(IValue(constraints.new_constraint));
-  view_sizes_constraint->node()->moveBefore(versioning_if);
+  auto full_constraints = fusion->owningGraph()->insertConstant(
+      IValue(constraints.conglomerateString()));
+  full_constraints->node()->moveBefore(versioning_if);
 
   // 4. Create CudaFusionViewGuard using input tensor, profile_ivalue
   // for view_sizes list, and constraints
@@ -1487,8 +1484,7 @@ Value* guardView(
               c10::Symbol::fromQualString("prim::CudaFusionViewGuard"),
               {fusion_value_to_runtime_size.at(self_value),
                view_sizes_runtime,
-               self_tensor_constraint,
-               view_sizes_constraint},
+               full_constraints},
               1)
           ->insertBefore(versioning_if);
   return viewcheck_node->output();
@@ -1558,10 +1554,6 @@ void guardFusionGroup(
       profiled_ivalue_indices.insert(index);
     }
   }
-  // we should assert on non-tensor inputs
-  TORCH_INTERNAL_ASSERT(
-      tensor_inputs_to_check.size(),
-      "CudaFusionGuard expects at least one tensor input");
 
   // insert the if block first;
   auto versioning_if =
@@ -2180,7 +2172,10 @@ void decomposeLinearOps(Block* block) {
 void replaceAliasOpsWithCopy(std::shared_ptr<Graph>& graph, Block* block) {
   static std::unordered_map<Symbol, Symbol> alias_to_copy_mapping(
       {{aten::expand, prim::expand_copy},
-       {aten::expand_as, prim::expand_as_copy}});
+       {aten::expand_as, prim::expand_as_copy},
+       {aten::permute, prim::permute_copy},
+       {aten::transpose, prim::transpose_copy},
+       {aten::t, prim::t_copy}});
   // TODO: revert disabled aten::view
   //    ({{aten::view, prim::view_copy},
   //     {aten::reshape, prim::reshape_copy},
@@ -2232,7 +2227,10 @@ void replaceAliasOpsWithCopy(std::shared_ptr<Graph>& graph, Block* block) {
 void revertAliasCopyOps(std::shared_ptr<Graph>& graph, Block* block) {
   static std::unordered_map<Symbol, Symbol> copy_to_alias_mapping(
       {{prim::expand_copy, aten::expand},
-       {prim::expand_as_copy, aten::expand_as}});
+       {prim::expand_as_copy, aten::expand_as},
+       {prim::permute_copy, aten::permute},
+       {prim::transpose_copy, aten::transpose},
+       {prim::t_copy, aten::t}});
   // TODO: revert disabled aten::view
   //    ({{prim::view_copy, aten::view},
   //     {prim::flatten_copy, aten::flatten},
@@ -2441,14 +2439,14 @@ void CudaFuseGraph(std::shared_ptr<Graph>& graph) {
   GRAPH_DEBUG("Remove inplace operations: ", *graph);
 
   // TODO: separate passes into different file;
-  if (isEnabled(EnableOption::LinearDecomposition)) {
+  if (isOptionEnabled(EnableOption::LinearDecomposition)) {
     // TODO: restore decomposition after fusion, in case we are decomposing
     //       operation that can't be fused;
     decomposeLinearOps(graph->block());
   }
   GRAPH_DEBUG("After decompose Linear Ops by nvfuser: ", *graph);
 
-  if (isEnabled(EnableOption::ConvDecomposition)) {
+  if (isOptionEnabled(EnableOption::ConvDecomposition)) {
     decomposeConvOps(graph->block());
   }
   GRAPH_DEBUG("After decompose decompose Conv Ops by nvfuser: ", *graph);
diff --git a/torch/csrc/jit/codegen/cuda/grouped_reduction.cpp b/torch/csrc/jit/codegen/cuda/grouped_reduction.cpp
index 5931eb3427aa..d907a0665e9f 100644
--- a/torch/csrc/jit/codegen/cuda/grouped_reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/grouped_reduction.cpp
@@ -38,7 +38,7 @@ bool hasMatchingTransformations(TensorView* ref, TensorView* other) {
 }
 
 // Validate grouping of reductions and return a new max producer position
-unsigned int validateReductionGrouping(
+void validateReductionGrouping(
     const std::vector<Val*>& inputs,
     const std::vector<Val*>& outputs) {
   TORCH_INTERNAL_ASSERT(inputs.size() == outputs.size());
@@ -57,7 +57,6 @@ unsigned int validateReductionGrouping(
   const auto num_root_dims = ref_domain.size();
   const auto num_dims = ref_tv->nDims();
   const auto ref_ca_pos = ref_tv->getComputeAtPosition();
-  auto max_producer_pos = ref_tv->getMaxProducerPosition();
   for (const auto i : c10::irange(inputs.size())) {
     auto output_tv = outputs.at(i)->as<TensorView>();
     const auto& output_domain = output_tv->getRootDomain();
@@ -136,9 +135,6 @@ unsigned int validateReductionGrouping(
         ref_tv->toString(),
         ". Mismatched tensor: ",
         output_tv->toString());
-
-    max_producer_pos =
-        std::max(max_producer_pos, output_tv->getMaxProducerPosition());
   }
 
   // Must not have any data dependency from outputs to inputs
@@ -152,8 +148,6 @@ unsigned int validateReductionGrouping(
     }
     TORCH_INTERNAL_ASSERT(all_dep_vals.empty(), ss.str());
   }
-
-  return max_producer_pos;
 }
 
 } // namespace
@@ -194,14 +188,14 @@ void groupReductions(const std::vector<TensorView*>& reduction_outputs) {
     inputs.at(i) = rop->in();
   }
 
-  auto max_producer_pos = validateReductionGrouping(inputs, outputs);
-
-  for (auto output : ir_utils::filterByType<TensorView>(outputs)) {
-    output->setMaxProducer(max_producer_pos);
-  }
+  validateReductionGrouping(inputs, outputs);
 
   IrBuilder::create<GroupedReductionOp>(
       container, op_types, init_vals, outputs, inputs);
+
+  for (auto output : ir_utils::filterByType<TensorView>(outputs)) {
+    output->updateMaxProducerPosition();
+  }
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/grouped_reduction.h b/torch/csrc/jit/codegen/cuda/grouped_reduction.h
index 39e6e0850e67..330a6018446b 100644
--- a/torch/csrc/jit/codegen/cuda/grouped_reduction.h
+++ b/torch/csrc/jit/codegen/cuda/grouped_reduction.h
@@ -27,6 +27,10 @@ namespace cuda {
 //!   dimensions, the same transformations and the same axes to
 //!   reduce.
 //!
+//! Note that Welford is not allowed yet, though it should be
+//! technically straightforward to support horizontal fusions of
+//! welford ops. Unclear how common it would be in practice, though.
+//!
 //! \param reduction_outputs Tensors produced by ReductionOp
 TORCH_CUDA_CU_API void groupReductions(
     const std::vector<TensorView*>& reduction_outputs);
diff --git a/torch/csrc/jit/codegen/cuda/index_compute.cpp b/torch/csrc/jit/codegen/cuda/index_compute.cpp
index 8989ad06a234..9028f93e9a20 100644
--- a/torch/csrc/jit/codegen/cuda/index_compute.cpp
+++ b/torch/csrc/jit/codegen/cuda/index_compute.cpp
@@ -51,8 +51,8 @@ int getProducerHaloOffset(
   IterDomain* consumer_id = it->second;
 
   const auto& halo_map = GpuLower::current()->haloInfo();
-  const auto p_pad = halo_map.getRootAxisInfo(producer_id).width(0);
-  const auto c_pad = halo_map.getRootAxisInfo(consumer_id).width(0);
+  const auto p_pad = halo_map->getRootAxisInfo(producer_id).width(0);
+  const auto c_pad = halo_map->getRootAxisInfo(consumer_id).width(0);
 
   auto offset = p_pad - c_pad;
 
@@ -178,7 +178,8 @@ Val* getConcreteProducerOffsetWithGather(
   Val* window_idx = nullptr;
 
   if (use_concrete_map) {
-    window_idx = index_map.at(ir_utils::caMapExactConcreteId(window_id));
+    window_idx = index_map.at(GpuLower::current()->caMap()->getConcreteMappedID(
+        window_id, IdMappingMode::EXACT));
   } else {
     window_idx = index_map.at(window_id);
   }
@@ -307,7 +308,8 @@ Val* getProducerIndexWithPartialSplit(
   }
 
   return SimplifyingIrBuilder::addExpr(
-      producer_index, SimplifyingIrBuilder::create<Int>(diff_eval.value()));
+      producer_index,
+      SimplifyingIrBuilder::create<Int>(diff_eval->as<int64_t>()));
 }
 
 } // namespace
@@ -439,9 +441,8 @@ void IndexCompute::handle(Merge* merge) {
 
   // When the reference has halo extent for inner_id, that extent needs to
   // be used to un-merge
-  if (reference_halo_extent_map_.find(inner_id) !=
-      reference_halo_extent_map_.end()) {
-    inner_extent = reference_halo_extent_map_[inner_id];
+  if (halo_extent_map_.find(inner_id) != halo_extent_map_.end()) {
+    inner_extent = halo_extent_map_[inner_id];
   }
 
   const auto outer_extent = getExtent(outer_id);
@@ -586,20 +587,16 @@ IndexCompute::IndexCompute(
     std::unordered_set<IterDomain*> zero_domains,
     std::unordered_set<IterDomain*> zero_merged_in,
     std::unordered_set<IterDomain*> preferred_paths,
-    std::unordered_map<IterDomain*, Val*> reference_halo_extent_map)
+    std::unordered_map<IterDomain*, Val*> halo_extent_map)
     : IndexCompute(
           _td,
           std::move(initial_index_map),
           std::move(extent_map),
           std::move(zero_domains),
           std::move(zero_merged_in),
-          ContigIDs(
-              _td->domain(),
-              _td->getMaybeRFactorDomain(),
-              std::vector<bool>(_td->getMaybeRFactorDomain().size(), false),
-              {}),
+          ContigIDs::getNonContigIDs(),
           std::move(preferred_paths),
-          std::move(reference_halo_extent_map)) {}
+          std::move(halo_extent_map)) {}
 
 IndexCompute::IndexCompute(
     const TensorDomain* _td,
@@ -609,14 +606,14 @@ IndexCompute::IndexCompute(
     std::unordered_set<IterDomain*> zero_merged_in,
     const ContigIDs& contig_finder,
     std::unordered_set<IterDomain*> preferred_paths,
-    std::unordered_map<IterDomain*, Val*> reference_halo_extent_map)
+    std::unordered_map<IterDomain*, Val*> halo_extent_map)
     : td_(_td),
       index_map_(std::move(initial_index_map)),
       extent_map_(std::move(extent_map)),
       zero_domains_(std::move(zero_domains)),
       zero_merged_in_(std::move(zero_merged_in)),
       preferred_paths_(std::move(preferred_paths)),
-      reference_halo_extent_map_(std::move(reference_halo_extent_map)) {
+      halo_extent_map_(std::move(halo_extent_map)) {
   FUSER_PERF_SCOPE("GpuLower::Lower::IndexCompute::IndexCompute");
 
   // Make sure we recompute any indices we can that map to a contiguous access
@@ -639,17 +636,19 @@ IndexCompute::IndexCompute(
     std::unordered_map<IterDomain*, Val*> initial_index_map,
     std::unordered_set<IterDomain*> zero_domains,
     std::unordered_set<IterDomain*> preferred_paths,
-    std::unordered_map<IterDomain*, Val*> reference_halo_extent_map)
+    std::unordered_map<IterDomain*, Val*> halo_extent_map)
     : index_map_(std::move(initial_index_map)),
       zero_domains_(std::move(zero_domains)),
       preferred_paths_(std::move(preferred_paths)),
-      reference_halo_extent_map_(std::move(reference_halo_extent_map)) {
+      halo_extent_map_(std::move(halo_extent_map)) {
   FUSER_PERF_SCOPE("GpuLower::Lower::IndexCompute::IndexCompute");
   concrete_id_pass_ = true;
   swizzle_mode_ = SwizzleMode::Loop;
 }
 
 void IndexCompute::run(const LoopIndexing& loop_indexing) {
+  TORCH_INTERNAL_ASSERT(
+      concrete_id_pass_, "concrete pass only for this option");
   // Apply loop swizzles if there are any that outputs to
   //  the loop domains.
   // Currently only support loop swizzles that directly output
@@ -669,18 +668,90 @@ void IndexCompute::run(const LoopIndexing& loop_indexing) {
     }
   }
 
+  // Resolve the index vals that could be resolved with only
+  //  the loops that consumer_tv doesn't share with any of its
+  //  consumers, i.e. the not-inlined loops that define consumer_tv
+  //  values.
+  collectIndexIntoPermissiveMap(loop_indexing);
+
   // Run through the loop indexing expressions and generate
   //  the indexing integer math for the concrete ids.
   for (auto expr : loop_indexing.getBackwardExprList()) {
+    // Resolve missing values from permissive map.
+    updateIndexMapFromPermissiveMap(expr);
+
     handle(expr);
   }
 }
 
+void IndexCompute::collectIndexIntoPermissiveMap(
+    const LoopIndexing& loop_indexing) {
+  // Visit the expressions that only produces un-inlined iterdomains,
+  //  in reverse topological order.
+  for (auto expr : loop_indexing.getBackwardOutOfLineExprList()) {
+    // Compute indexing vals for the expression inputs.
+    //
+    // This stage should run before any indexing computation so it could be
+    //  made sure that all index values computed at this stage are
+    //  the ones that can be resolved only with the not-inlined
+    //  iterdomains.
+    //
+    auto id_outputs = ir_utils::filterByType<IterDomain>(expr->outputs());
+    if (std::all_of(
+            id_outputs.begin(), id_outputs.end(), [this](IterDomain* id) {
+              return index_map_.count(
+                  GpuLower::current()->caMap()->getConcreteMappedID(
+                      id, IdMappingMode::EXACT));
+            })) {
+      // Visit this expression:
+      // LoopIndexingAnalysis::traverseFromDomainVals made sure that each
+      //  concrete index is bound exactly once so computing these expressions
+      //  early should still be consistent.
+      handle(expr);
+
+      auto id_inputs = ir_utils::filterByType<IterDomain>(expr->inputs());
+      for (auto id : id_inputs) {
+        // Collect backward pass results from this expression if they are
+        //  made available in by this expression.
+        auto idx_it =
+            index_map_.find(GpuLower::current()->caMap()->getConcreteMappedID(
+                id, IdMappingMode::EXACT));
+
+        if (idx_it != index_map_.end()) {
+          permissive_index_map_
+              [GpuLower::current()->caMap()->getConcreteMappedID(
+                  id, IdMappingMode::PERMISSIVE)] = idx_it->second;
+        }
+      }
+    }
+  }
+}
+
+void IndexCompute::updateIndexMapFromPermissiveMap(const Expr* id_expr) {
+  auto id_outputs = ir_utils::filterByType<IterDomain>(id_expr->outputs());
+  for (auto id : id_outputs) {
+    auto concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        id, IdMappingMode::EXACT);
+    // Only try to copy index val from permissive map when
+    //  the index is missing.
+    if (!index_map_.count(concrete_id)) {
+      auto permissive_id = GpuLower::current()->caMap()->getConcreteMappedID(
+          id, IdMappingMode::PERMISSIVE);
+      // Write the permissive index val into index_map_ if the
+      //  missing value is found here.
+      auto permissive_it = permissive_index_map_.find(permissive_id);
+      if (permissive_it != permissive_index_map_.end()) {
+        index_map_[concrete_id] = permissive_it->second;
+      }
+    }
+  }
+}
+
 void IndexCompute::run() {
   const std::vector<Val*> domain_vals(
       td_->domain().begin(), td_->domain().end());
 
-  traverseFrom(td_->fusion(), domain_vals, false);
+  traverseTo(td_->fusion(), domain_vals, false);
 }
 
 IterDomain* IndexCompute::maybeGetExactMapConcreteID(IterDomain* id) {
@@ -714,15 +785,14 @@ bool IndexCompute::isZero(IterDomain* id) const {
 IndexCompute IndexCompute::updateIndexCompute(
     const TensorDomain* new_td,
     const std::unordered_map<IterDomain*, IterDomain*>& id_map,
-    const ContigIDs& contig_finder,
-    const std::unordered_map<IterDomain*, Val*>& reference_halo_extent_map)
-    const {
+    const ContigIDs& contig_finder) const {
   FUSER_PERF_SCOPE("GpuLower::Lower::updateIndexCompute");
 
   std::unordered_map<IterDomain*, Val*> updated_index_map;
   std::unordered_map<IterDomain*, Val*> updated_extent_map;
   std::unordered_set<IterDomain*> updated_zero_domains;
   std::unordered_set<IterDomain*> updated_zero_merged_in;
+  std::unordered_map<IterDomain*, Val*> updated_halo_extent_map;
 
   for (auto id_entry : id_map) {
     IterDomain* prev_id = id_entry.first;
@@ -741,6 +811,11 @@ IndexCompute IndexCompute::updateIndexCompute(
     if (zero_merged_in_.find(prev_id) != zero_merged_in_.end()) {
       updated_zero_merged_in.emplace(new_id);
     }
+
+    auto halo_extent_it = halo_extent_map_.find(prev_id);
+    if (halo_extent_it != halo_extent_map_.end()) {
+      updated_halo_extent_map[new_id] = halo_extent_it->second;
+    }
   }
 
   IndexCompute updated_index_compute(
@@ -751,25 +826,7 @@ IndexCompute IndexCompute::updateIndexCompute(
       updated_zero_merged_in,
       contig_finder,
       {},
-      reference_halo_extent_map);
-
-  if (concrete_id_pass_) {
-    // This should be the same behavior as with a reference tensor
-    //   created, since originally halo was pulled through exact
-    //   ca mapping and in the concrete_id_pass case, the id_map
-    //   also represents exact ca mapping.
-    // TODO: might need to re-visit pathological cases when we may
-    //  need to traverse and propagate halo info again in here.
-    for (auto id_entry : id_map) {
-      IterDomain* prev_id = id_entry.first;
-      IterDomain* new_id = id_entry.second;
-      auto halo_extent_it = reference_halo_extent_map_.find(prev_id);
-      if (halo_extent_it != reference_halo_extent_map_.end()) {
-        updated_index_compute.reference_halo_extent_map_[new_id] =
-            halo_extent_it->second;
-      }
-    }
-  }
+      updated_halo_extent_map);
 
   updated_index_compute.run();
 
@@ -790,7 +847,7 @@ class UpdateLeafIndices : public IterVisitor {
     const std::vector<Val*> domain_vals(
         td_->domain().begin(), td_->domain().end());
 
-    traverseFrom(td_->fusion(), domain_vals, false);
+    traverseTo(td_->fusion(), domain_vals, false);
   }
 
   const std::unordered_map<IterDomain*, Val*>& indexMap() const {
@@ -915,7 +972,7 @@ Val* getHaloExtentOfRootAxis(IterDomain* id, Val* normal_extent = nullptr) {
     normal_extent = id->extent();
   }
 
-  const auto& halo = GpuLower::current()->haloInfo().getRootAxisInfo(id);
+  const auto& halo = GpuLower::current()->haloInfo()->getRootAxisInfo(id);
   if (halo.hasHalo()) {
     auto halo_extent = SimplifyingIrBuilder::addExpr(
         normal_extent, SimplifyingIrBuilder::create<Int>(halo.width()));
@@ -1261,7 +1318,8 @@ c10::optional<IterDomain*> getMaybeIndexedIdToHoist(
     const TensorView* tv,
     const IndexCompute& indexing,
     Val* index) {
-  if (isDisabled(DisableOption::IndexHoist) || index->definition() == nullptr) {
+  if (isOptionDisabled(DisableOption::IndexHoist) ||
+      index->definition() == nullptr) {
     return c10::nullopt;
   }
 
@@ -1435,7 +1493,8 @@ std::vector<Val*> Index::getGlobalProducerStridedIndices(
   // effort which means some domains may be producer's original domains.
   std::vector<std::pair<IterDomain*, ParallelType>> p_id_backup;
   for (auto entry : c2p_map) {
-    auto ref_id = ir_utils::caMapExactConcreteId(entry.first);
+    auto ref_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        entry.first, IdMappingMode::EXACT);
     auto p_id = entry.second;
     if (ref_id->getParallelType() == ParallelType::Vectorize) {
       p_id_backup.emplace_back(std::make_pair(p_id, p_id->getParallelType()));
@@ -1674,7 +1733,8 @@ std::vector<Val*> Index::getNonGlobalProducerStridedIndices(
   // effort which means some domains may be the originals.
   std::vector<std::pair<IterDomain*, ParallelType>> p_id_backup;
   for (auto entry : c2p_index_map) {
-    auto ref_id = ir_utils::caMapExactConcreteId(entry.first);
+    auto ref_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        entry.first, IdMappingMode::EXACT);
     auto p_id = entry.second;
     if (ref_id->getParallelType() == ParallelType::Vectorize) {
       p_id_backup.emplace_back(std::make_pair(p_id, p_id->getParallelType()));
@@ -1866,22 +1926,27 @@ std::vector<Val*> Index::getNonGlobalProducerStridedIndices(
   return strided_inds;
 }
 
-std::vector<Val*> Index::getGlobalConsumerStridedIndices(
-    const TensorView* consumer_tv,
+std::vector<Val*> Index::getLinearLogicalIndex(
+    TensorView* consumer_tv,
     const std::vector<kir::ForLoop*>& loops) {
-  FUSER_PERF_SCOPE("GpuLower::Lower::getGlobalConsumerIndex");
-
-  auto gpu_lower = GpuLower::current();
-
-  auto index_from_id_graph = getTensorIndexFromIdGraph(loops, consumer_tv);
+  auto guard = ir_utils::overrideContiguityGuard(consumer_tv, true);
+  return getGlobalConsumerStridedIndices(consumer_tv, loops);
+}
 
-  auto consumer_indexing = index_from_id_graph.index;
+std::vector<Val*> Index::getPerDimLogicalIndex(
+    TensorView* consumer_tv,
+    const std::vector<kir::ForLoop*>& loops) {
+  auto guard = ir_utils::overrideContiguityGuard(consumer_tv, false);
+  IndexFromIdGraph index_from_id_graph =
+      getTensorIndexFromIdGraph(loops, consumer_tv);
+  return getRootIndices(consumer_tv, loops, index_from_id_graph);
+}
 
+std::vector<Val*> Index::getStrides(const TensorView* tv) {
   // Indices should now be mapped onto IterDomains in consumer, so just grab
   // and use them.
-  auto root_dom = consumer_tv->getMaybeRFactorDomain();
+  auto root_dom = tv->getMaybeRFactorDomain();
 
-  // TODO: Abstract stride logic to reuse with producer indexing
   std::vector<Val*> strides(
       root_dom.size(), GpuLower::current()->kernel()->oneVal());
   {
@@ -1892,14 +1957,13 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
         continue;
       }
       std::stringstream ss;
-      ss << "T" << consumer_tv->name() << ".stride[" << stride_i++ << "]";
+      ss << "T" << tv->name() << ".stride[" << stride_i++ << "]";
       strides[i] =
           SimplifyingIrBuilder::create<NamedScalar>(ss.str(), DataType::Int);
     }
   }
 
-  TORCH_INTERNAL_ASSERT(
-      root_dom.size() == consumer_tv->domain()->contiguity().size());
+  TORCH_INTERNAL_ASSERT(root_dom.size() == tv->domain()->contiguity().size());
   Val* cur_contig_stride = GpuLower::current()->kernel()->oneVal();
   for (const auto i : c10::irange(root_dom.size())) {
     auto dim = root_dom.size() - i - 1;
@@ -1907,24 +1971,7 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
       continue;
     }
 
-    Val* root_ind = nullptr;
-    if (consumer_indexing.indexMap().find(root_dom[dim]) !=
-        consumer_indexing.indexMap().end()) {
-      root_ind = consumer_indexing.indexMap().at(root_dom[dim]);
-    } else if (root_dom[dim]->isBroadcast()) {
-      root_ind = GpuLower::current()->kernel()->zeroVal();
-    }
-
-    TORCH_INTERNAL_ASSERT(
-        root_ind != nullptr,
-        "Couldn't find root mapping for ",
-        consumer_tv->toString(),
-        " dim: ",
-        dim,
-        " id: ",
-        root_dom[dim]->toString());
-
-    if (consumer_tv->domain()->contiguity()[dim]) {
+    if (tv->domain()->contiguity()[dim]) {
       // If contig, used the stored stride which may be the previous
       // dimensions stride * previous dimensions size
       strides[dim] = cur_contig_stride;
@@ -1940,12 +1987,18 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
           strides[dim], getHaloExtentOfRootAxis(root_dom[dim]));
     }
   }
+  return strides;
+}
 
-  auto vectorize_shift =
-      loops.empty() ? nullptr : loops.back()->vectorize_shift();
+std::vector<Val*> Index::getRootIndices(
+    const TensorView* tv,
+    const std::vector<kir::ForLoop*>& loops,
+    const IndexFromIdGraph& index_from_id_graph) {
+  auto gpu_lower = GpuLower::current();
+  auto root_dom = tv->getMaybeRFactorDomain();
+  auto indexing = index_from_id_graph.index;
 
-  // Global striding
-  std::vector<Val*> strided_inds(
+  std::vector<Val*> root_inds(
       root_dom.size(), GpuLower::current()->kernel()->zeroVal());
   for (const auto i : c10::irange(root_dom.size())) {
     // See a comment in indexing to root domains in getGlobalProducerIndex.
@@ -1956,22 +2009,21 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
     }
 
     TORCH_INTERNAL_ASSERT(
-        consumer_indexing.indexMap().find(root_dom[i]) !=
-            consumer_indexing.indexMap().end(),
+        indexing.indexMap().find(root_dom[i]) != indexing.indexMap().end(),
         "Couldn't find root mapping for ",
-        consumer_tv->toString(),
+        tv->toString(),
         " dim: ",
         i,
         " id: ",
         root_dom[i]->toString());
 
-    auto root_ind = consumer_indexing.indexMap().at(root_dom[i]);
+    auto root_ind = indexing.indexMap().at(root_dom[i]);
 
     // index hoist must be done before the adjustments for halo
     root_ind = hoistConsumerIndex(
         root_dom[i],
-        consumer_tv,
-        consumer_indexing,
+        tv,
+        indexing,
         index_from_id_graph.resolved_loop_domains,
         index_from_id_graph.initial_concrete_index_map,
         loops,
@@ -1979,12 +2031,33 @@ std::vector<Val*> Index::getGlobalConsumerStridedIndices(
 
     root_ind = SimplifyingIrBuilder::addExpr(
         root_ind, getGlobalConsumerOffsetWithPartialSplit(root_dom[i]));
+    root_inds[i] = root_ind;
+  }
+  return root_inds;
+}
 
-    if (root_ind->isZeroInt()) {
+std::vector<Val*> Index::getGlobalConsumerStridedIndices(
+    const TensorView* consumer_tv,
+    const std::vector<kir::ForLoop*>& loops) {
+  FUSER_PERF_SCOPE("GpuLower::Lower::getGlobalConsumerIndex");
+
+  auto index_from_id_graph = getTensorIndexFromIdGraph(loops, consumer_tv);
+  auto consumer_indexing = index_from_id_graph.index;
+  auto strides = getStrides(consumer_tv);
+  auto root_inds = getRootIndices(consumer_tv, loops, index_from_id_graph);
+
+  // Global striding
+  auto vectorize_shift =
+      loops.empty() ? nullptr : loops.back()->vectorize_shift();
+  std::vector<Val*> strided_inds(
+      root_inds.size(), GpuLower::current()->kernel()->zeroVal());
+  for (const auto i : c10::irange(root_inds.size())) {
+    if (root_inds[i]->isZeroInt()) {
       continue;
     } else {
-      auto strided_ind = SimplifyingIrBuilder::mulExpr(root_ind, strides[i]);
-      if (i == root_dom.size() - 1 && vectorize_shift != nullptr) {
+      auto strided_ind =
+          SimplifyingIrBuilder::mulExpr(root_inds[i], strides[i]);
+      if (i == strides.size() - 1 && vectorize_shift != nullptr) {
         strided_inds[i] =
             SimplifyingIrBuilder::addExpr(strided_ind, vectorize_shift);
       } else {
@@ -2253,103 +2326,71 @@ std::vector<PredicateDomainInfo> getPredicateContigIds(
 
   const auto& consumer_root_domain = consumer_tv->getRootDomain();
 
-  std::vector<IterDomain*> contiguous_ids = consumer_root_domain;
-
-  if (contiguous_ids.empty()) {
+  if (consumer_root_domain.empty()) {
     return std::vector<PredicateDomainInfo>();
   }
 
-  // If root IDs are partial, i.e., start is non-zero and stop is not
-  // equal to extent, predication can't be done with merged domains as
-  // start and stop information is only available with root
-  // domains. Similarly, merged domains don't have enough information
-  // about halo to do correct predication, so they must be excluded.
-  std::unordered_set<IterDomain*> excluded_ids;
+  std::unordered_map<IterDomain*, Val*> concrete_index_map;
+  for (auto entry : consumer_index_map) {
+    auto c_id = gpu_lower->caMap()->getConcreteMappedID(
+        entry.first, IdMappingMode::EXACT);
+    concrete_index_map[c_id] = entry.second;
+  }
 
-  for (auto consumer_root_id : consumer_root_domain) {
-    if (gpu_lower->haloInfo().getRootAxisInfo(consumer_root_id).hasHalo()) {
-      excluded_ids.insert(consumer_root_id);
-      continue;
-    }
-    if (consumer_root_id->maybePartial()) {
-      excluded_ids.insert(consumer_root_id);
-      continue;
-    }
-    // When consumer_root_id is a broadcast domain, do not allow contig
-    // predication as the merged output is not mapped with the
-    // reference unless the concrete domain is also a broadcast
-    // domain.
-    if (consumer_root_id->isBroadcast() &&
-        !GpuLower::current()
-             ->caMap()
-             ->getConcreteMappedID(consumer_root_id, IdMappingMode::PERMISSIVE)
-             ->isBroadcast()) {
-      excluded_ids.insert(consumer_root_id);
+  std::vector<bool> predicate_contiguity(consumer_root_domain.size(), true);
+  std::unordered_set<IterDomain*> final_ids;
+  for (auto root_i : c10::irange(predicate_contiguity.size())) {
+    auto root_id = consumer_root_domain[root_i];
+    if (root_id->maybePartial()) {
+      final_ids.insert(root_id);
       continue;
     }
     // Shifted or gathered axes need to be predicated at the root domain
     auto shift_expr = dynamic_cast<ShiftOp*>(consumer_tv->definition());
     auto gather_expr = dynamic_cast<GatherOp*>(consumer_tv->definition());
-    if (shift_expr == nullptr && gather_expr == nullptr) {
-      continue;
-    }
-    auto consumer_root_pos = consumer_tv->domain()->rootPosOf(consumer_root_id);
-    if ((shift_expr && shift_expr->offset(consumer_root_pos) != 0) ||
-        (gather_expr && consumer_root_pos < gather_expr->windowShape().size() &&
-         gather_expr->windowShape().at(consumer_root_pos) != 1)) {
-      excluded_ids.insert(consumer_root_id);
+    if ((shift_expr && shift_expr->offset(root_i) != 0) ||
+        (gather_expr && root_i < gather_expr->windowShape().size() &&
+         gather_expr->windowShape().at(root_i) != 1)) {
+      final_ids.insert(root_id);
     }
   }
 
-  // Run through iteration domain history
-  auto exprs = StmtSort::getExprs(
-      consumer_tv->fusion(),
-      {consumer_tv->domain()->domain().begin(),
-       consumer_tv->domain()->domain().end()});
+  ContigIDs contig_finder(
+      consumer_tv->domain()->domain(),
+      consumer_root_domain,
+      predicate_contiguity,
+      final_ids,
+      concrete_index_map,
+      GpuLower::current()->divisbleSplitSet(),
+      GpuLower::current()->caMap(),
+      GpuLower::current()->haloInfo(),
+      GpuLower::current()->concretizedBroadcastDomains(),
+      {},
+      false,
+      true);
 
-  for (auto expr : exprs) {
-    // If not a merge, output is not contiguous
-    if (expr->isA<Merge>()) {
-      auto merge = expr->as<Merge>();
-      auto inner_contig_it = std::find(
-          contiguous_ids.begin(), contiguous_ids.end(), merge->inner());
-      auto outer_contig_it = std::find(
-          contiguous_ids.begin(), contiguous_ids.end(), merge->outer());
+  std::vector<PredicateDomainInfo> contig_id_infos;
+  std::unordered_set<IterDomain*> covered_roots;
 
-      if (excluded_ids.count(merge->inner()) > 0 ||
-          excluded_ids.count(merge->outer()) > 0) {
-        continue;
-      }
+  // Create entries and return them
+  for (auto root_id : consumer_root_domain) {
+    if (covered_roots.count(root_id) > 0) {
+      continue;
+    }
 
-      // Do not try to predicate the merge output domain if the output
-      // domain has not a predicate that is mapped from the reference.
-      // See FusionContigPredicate_CUDA for a concrete example.
-      if (consumer_index_map.find(merge->out()) == consumer_index_map.end()) {
-        continue;
-      }
+    auto contig_id_it = contig_finder.rootToIndexedID().find(root_id);
 
-      if (inner_contig_it != contiguous_ids.end() &&
-          outer_contig_it != contiguous_ids.end()) {
-        // If inner and outer are contiguous, out must be contiguous. Remove
-        // inner and outer, and add out.
-        contiguous_ids.erase(outer_contig_it);
-        contiguous_ids.erase(std::find(
-            contiguous_ids.begin(), contiguous_ids.end(), merge->inner()));
-        contiguous_ids.emplace_back(merge->out());
-      }
-    }
-  }
+    TORCH_INTERNAL_ASSERT(
+        contig_id_it != contig_finder.rootToIndexedID().end(),
+        "Error in predicate contiguity analysis, missing index for root ",
+        root_id->toString());
 
-  std::vector<PredicateDomainInfo> contig_id_infos;
+    auto contig_id = contig_id_it->second;
 
-  // Create entries and return them
-  for (auto contig_id : contiguous_ids) {
     // Pick inputs from the starting domains, i.e.,
     // reference_predicated_root_domain.
-    auto contig_root_vals = IterVisitor::getInputsTo(
-        {contig_id},
-        {consumer_root_domain.begin(), consumer_root_domain.end()});
-    auto contig_root_ids = ir_utils::filterByType<IterDomain>(contig_root_vals);
+    auto contig_root_ids = contig_finder.indexedRootIDs(contig_id);
+    covered_roots.insert(contig_root_ids.begin(), contig_root_ids.end());
     PredicateDomainInfo contig_id_info;
     contig_id_info.id = contig_id;
     contig_id_info.covered_ids = std::unordered_set<IterDomain*>(
@@ -2403,7 +2444,7 @@ int getUnswitchStopOffset(
   const auto gpu_lower = GpuLower::current();
 
   AxisHaloInfo halo_info =
-      gpu_lower->haloInfo().getRootAxisInfo(consumer_root_id);
+      gpu_lower->haloInfo()->getRootAxisInfo(consumer_root_id);
 
   // If the consumer root domain to predicate does not have halo, no
   // adjustment is required.
@@ -2427,7 +2468,7 @@ int getUnswitchStopOffset(
           unswitch_it,
           consumer_tv->domain()->domain().end(),
           [&gpu_lower, &consumer_root_id](auto leaf_id) {
-            return gpu_lower->haloInfo().isHaloInherited(
+            return gpu_lower->haloInfo()->isHaloInherited(
                 consumer_root_id, leaf_id);
           })) {
     return halo_info.width();
@@ -2585,7 +2626,8 @@ std::pair<Val*, Val*> getStartAndStopLimitOffsets(
   Val* stop_limit = SimplifyingIrBuilder::negExpr(consumer_id->stopOffset());
 
   if (!non_divisible_pred) {
-    AxisHaloInfo halo_info = gpu_lower->haloInfo().getRootAxisInfo(consumer_id);
+    AxisHaloInfo halo_info =
+        gpu_lower->haloInfo()->getRootAxisInfo(consumer_id);
 
     // Below, "left" and "right" halo mean halo at offset zero and
     // axis extent, respectively.
@@ -2609,8 +2651,8 @@ std::pair<Val*, Val*> getStartAndStopLimitOffsets(
     // that it is less than the extent of the predicated ID +
     // halo. Note that getRootAxisInfo doesn't work since consumer_id
     // isn't a root domain.
-    if (gpu_lower->haloInfo().hasHaloWidth(consumer_id)) {
-      auto halo = gpu_lower->haloInfo().getHaloWidth(consumer_id);
+    if (gpu_lower->haloInfo()->hasHaloWidth(consumer_id)) {
+      auto halo = gpu_lower->haloInfo()->getHaloWidth(consumer_id);
       stop_limit = SimplifyingIrBuilder::addExpr(stop_limit, halo);
     }
   }
@@ -2757,8 +2799,8 @@ bool canOmitStopPredicate(
   // to be predicated, not its merged contig id even if it exists. So,
   // if contig_id does not have root axis info, contig_id is
   // guaranteed to have no halo.
-  auto halo_ext = gpu_lower->haloInfo().hasRootAxisInfo(contig_id)
-      ? gpu_lower->haloInfo().getRootAxisInfo(contig_id).width()
+  auto halo_ext = gpu_lower->haloInfo()->hasRootAxisInfo(contig_id)
+      ? gpu_lower->haloInfo()->getRootAxisInfo(contig_id).width()
       : 0;
 
   if (halo_ext + stop_offset_val.value() > 0) {
@@ -2796,7 +2838,7 @@ std::pair<Val*, Val*> hoistPredicates(
     TensorView* predicated_consumer_tv) {
   const std::pair<Val*, Val*> same_indices{start_index, stop_index};
 
-  if (isDisabled(DisableOption::IndexHoist)) {
+  if (isOptionDisabled(DisableOption::IndexHoist)) {
     return same_indices;
   }
 
@@ -2876,14 +2918,6 @@ std::vector<RootPredicateInfo> Index::getReferenceRootPredicates(
 
   auto db_axis = gpu_lower->doubleBufferInfo().getDoubleBufferAxis(consumer_tv);
 
-  // Indexing is done without considering contig merging. Actual
-  // predicated domains are determined by considering contiguity.
-  const ContigIDs contig_finder(
-      consumer_tv->domain()->domain(),
-      consumer_tv->getMaybeRFactorDomain(),
-      std::vector<bool>(consumer_tv->getMaybeRFactorDomain().size(), false),
-      {});
-
   // Generate start and stop indexing from idgraph.
   //
   // Both start and stop positions may need to be predicated. Indexing
diff --git a/torch/csrc/jit/codegen/cuda/index_compute.h b/torch/csrc/jit/codegen/cuda/index_compute.h
index 43964e39feb8..9a94ee94ac09 100644
--- a/torch/csrc/jit/codegen/cuda/index_compute.h
+++ b/torch/csrc/jit/codegen/cuda/index_compute.h
@@ -1,7 +1,6 @@
 #pragma once
 
 #include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
-#include <torch/csrc/jit/codegen/cuda/reference_tensor.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 
 #include <unordered_map>
@@ -17,40 +16,40 @@
  * indices (based on input indices) that match the root dimension.
  *
  * For example with GLOBAL tensor:
- * TV[I, J]
- * TV[Io, Ii{4}, J] = TV.split(I, factor=4)
+ * TV[I, K]
+ * TV[Io, Ii{4}, K] = TV.split(I, factor=4)
  * ALLOC: NONE
  * INDEX: indexCompute {i, j, k} -> {i * 4 + j, k}
- * FLATTENED_INDEX: {i * 4 + j, k} -> {i * 4 + j * J + k}
+ * FLATTENED_INDEX: {i * 4 + j, k} -> {(i * 4 + j) * K + k}
  * PREDICATE: {i * 4 + j, k} -> i * 4 + j < I
  *
  *
  * For example with SHARED tensor:
  *
- * global_TV[I, J]
- * global_TV[Io, Ii{4}, J] = global_TV.split(I, factor=4)
+ * global_TV[I, K]
+ * global_TV[Io, Ii{4}, K] = global_TV.split(I, factor=4)
  * smem_TV.compute_at(global_TV, 1)
  * global_TV.parallelize(1, threadIDx.x)
  *
- * ALLOC: alloc(smem_TV, 4 x J)
+ * ALLOC: alloc(smem_TV, 4 x K)
  * INDEX: indexCompute(smem_TV, {threadIdx.x, k}) -> {threadIdx.x, k}
- * FLATTENED_INDEX: {threadIdx.x * 4 + j, k} -> {threadIdx.x * 4 + j * J + k}
+ * FLATTENED_INDEX: {threadIdx.x * 4 + j, k} -> {(threadIdx.x * 4 + j) * K + k}
  * PREDICATE: {threadIdx.x * 4 + j, k} -> threadIdx.x * 4 + j < I // Same as if
  * global
  *
  *
  * For example with LOCAL tensor:
- * global_TV[I, J, K]
- * global_TV[Io, Ii{4}, J] = global_TV.split(I, factor=4)
- * reg_TV.compute_at(global_TV, 1)
+ * global_TV[I, K, L]
+ * global_TV[Io, Ii{4}, K, L] = global_TV.split(I, factor=4)
+ * reg_TV.compute_at(global_TV, 2)
  * global_TV.parallelize(1, threadIDx.x)
  * global_TV{i, j, k, l} -> { i * 4 + j, k, l }
- * global_TV{ i * 4 + j, k, l } -> { i * 4 + j * J * K  +  k * K  +  l}
+ * global_TV{ i * 4 + j, k, l } -> { (i * 4 + j) * K * L  +  k * L  +  l}
  *
- * ALLOC: alloc(reg_TV, J x K)
+ * ALLOC: alloc(reg_TV, K x L)
  * INDEX: {k, l} -> {k, l}
- * FLATTENED_INDEX: {k, l} -> {k * J + l}
- * PREDICATE: i * 4 + j < I && k < J && l < K ->  // Same as if global
+ * FLATTENED_INDEX: {k, l} -> {k * L + l}
+ * PREDICATE: i * 4 + j < I && k < K && l < L ->  // Same as if global
  *
  * These indices can then be flattened later based on strides.
  */
@@ -62,6 +61,7 @@ namespace cuda {
 
 class ContigIDs;
 class LoopIndexing;
+struct IndexFromIdGraph;
 
 class IndexCompute : public BackwardVisitor {
  protected:
@@ -86,6 +86,18 @@ class IndexCompute : public BackwardVisitor {
   //! based traversal.
   IterDomain* maybeGetExactMapConcreteID(IterDomain* id);
 
+  //! (Concrete indexing pass only)
+  //!  Collect permissive index binding from the given expression.
+  //! See also permissive_map_ and LoopIndexing::getBackwardOutOfLineExprList.
+  void collectIndexIntoPermissiveMap(const LoopIndexing& loop_indexing);
+
+  //! (Concrete indexing pass only)
+  //!  Iterate through id_expr's input and pull index vals from permissive
+  //! map, when both of the following are true:
+  //!    1. the output id is missing in index_map_.
+  //!    2. the output id is found in permissive map.
+  void updateIndexMapFromPermissiveMap(const Expr* id_expr);
+
   // Tensor domain we're mapping back to root
   const TensorDomain* td_; // NOLINT
 
@@ -122,9 +134,8 @@ class IndexCompute : public BackwardVisitor {
   // if there's an option
   std::unordered_set<IterDomain*> preferred_paths_;
 
-  // Map from IterDomains to halo-extended extents in corresponding
-  // reference tensor
-  std::unordered_map<IterDomain*, Val*> reference_halo_extent_map_;
+  // Map from IterDomains to halo-extended extents
+  std::unordered_map<IterDomain*, Val*> halo_extent_map_;
 
   // Temporary flag which tells IndexCompute to use concrete id's from the exact
   // map rather than the actual IDs used in the ID expressions.
@@ -137,6 +148,16 @@ class IndexCompute : public BackwardVisitor {
   //  pass. See also [Note on swizzle mode]
   SwizzleMode swizzle_mode_ = SwizzleMode::NoSwizzle;
 
+  // (Concrete id pass only)
+  // Contains the indexing math that could be resolved with only the
+  //  iterdomains on the right of the consumer_tv's ca axis, i.e. the
+  //  ones that corresponding to the loops that consumer_tv would not
+  //  share with any of its consumers.
+  // These indexing vals should be kept separate from index_map_ and
+  //  should only be used when the indexing traversal follows the
+  //  order defined in LoopIndexingAnalysis::traverseFromDomainVals.
+  std::unordered_map<IterDomain*, Val*> permissive_index_map_;
+
  public:
   const std::unordered_map<IterDomain*, Val*>& indexMap() const {
     return index_map_;
@@ -166,7 +187,7 @@ class IndexCompute : public BackwardVisitor {
       std::unordered_set<IterDomain*> zero_domains,
       std::unordered_set<IterDomain*> _zero_merged_in,
       std::unordered_set<IterDomain*> preferred_paths = {},
-      std::unordered_map<IterDomain*, Val*> reference_halo_extent_map = {});
+      std::unordered_map<IterDomain*, Val*> halo_extent_map = {});
 
   IndexCompute(
       const TensorDomain* _td,
@@ -176,7 +197,7 @@ class IndexCompute : public BackwardVisitor {
       std::unordered_set<IterDomain*> _zero_merged_in,
       const ContigIDs& contig_finder,
       std::unordered_set<IterDomain*> preferred_paths = {},
-      std::unordered_map<IterDomain*, Val*> reference_halo_extent_map = {});
+      std::unordered_map<IterDomain*, Val*> halo_extent_map = {});
 
   // Entry point used for using concrete id based traversal. This traversal is
   // assumed to start at leaf IDs provided by initial_index_map.
@@ -191,9 +212,7 @@ class IndexCompute : public BackwardVisitor {
   IndexCompute updateIndexCompute(
       const TensorDomain* new_td,
       const std::unordered_map<IterDomain*, IterDomain*>& id_map,
-      const ContigIDs& contig_finder,
-      const std::unordered_map<IterDomain*, Val*>& reference_halo_extent_map =
-          {}) const;
+      const ContigIDs& contig_finder) const;
 
   // Interface to run index traversal through loop indexing analysis result to
   // be used with the entry point for concrete id based traversal.
@@ -309,6 +328,15 @@ class Index {
       const TensorView* consumer,
       const std::vector<kir::ForLoop*>& loops);
 
+  // get the strides of a tensor used for the index lowering
+  static std::vector<Val*> getStrides(const TensorView* tv);
+
+  // get the root indices of a tensor used for the index lowering
+  static std::vector<Val*> getRootIndices(
+      const TensorView* tv,
+      const std::vector<kir::ForLoop*>& loops,
+      const IndexFromIdGraph& index_from_id_graph);
+
  public:
   // Indexing functions
   // Consumer = Producer
@@ -341,12 +369,28 @@ class Index {
       const TensorView* consumer,
       const std::vector<kir::ForLoop*>& loops);
 
+  //! Returns the logical index linearized from a multi-dimension address into a
+  //! linear memory address a consumer tensor. The returned index is intended to
+  //! be used for the computation of some tensor factories, such as: arange and
+  //! rand (for Philox pseudo random sequences)
+  static std::vector<Val*> getLinearLogicalIndex(
+      TensorView* consumer_tv,
+      const std::vector<kir::ForLoop*>& loops);
+
+  //! Returns a vector of logical indices mapped onto the (rfactor)
+  //! root domain of a consumer tensor. The returned index is intended
+  //! to be used for the computation of some tensor factories, such as:
+  //! eye
+  static std::vector<Val*> getPerDimLogicalIndex(
+      TensorView* consumer_tv,
+      const std::vector<kir::ForLoop*>& loops);
+
   //! Take a consumer tensorview and loop nest and generates predicates
   //! associated with the concrete roots of the loop nest. Returns a list of
-  //! predicates, and a list of concrete roots they're associated with. It is
-  //! assumed that no predicate is required if index[i] is an index directly
-  //! from a for loop. This will not catch all cases if we actually have static
-  //! size information for example:
+  //! predicates, and a list of concrete roots they're associated with. It
+  //! is assumed that no predicate is required if index[i] is an index
+  //! directly from a for loop. This will not catch all cases if we actually
+  //! have static size information for example:
   //!
   //! TV[I].split(4)
   //! would produce the code:
@@ -355,14 +399,14 @@ class Index {
   //!     if( i * 4 + j < TV.size(0))
   //!       TV[i * 4 + j]...
   //!
-  //! However if we had TV.size[0] = 16 at "compile time" then we wouldn't need
-  //! the predicate. This will be caught by canOmitPredicate in the predicate
-  //! lowering
+  //! However if we had TV.size[0] = 16 at "compile time" then we wouldn't
+  //! need the predicate. This will be caught by canOmitPredicate in the
+  //! predicate lowering
   //!
-  //! unswitch_or_vec_loop is the for loop to start the unswitch like predicate,
-  //! this is not a bool value as if we have an unswitch loop with a vectorized
-  //! loop inside, we only want to base the "unswitch" like predicate on the
-  //! vectorized loop.
+  //! unswitch_or_vec_loop is the for loop to start the unswitch like
+  //! predicate, this is not a bool value as if we have an unswitch loop
+  //! with a vectorized loop inside, we only want to base the "unswitch"
+  //! like predicate on the vectorized loop.
   static std::vector<RootPredicateInfo> getReferenceRootPredicates(
       TensorView* consumer_tv,
       const std::vector<kir::ForLoop*>& loops,
diff --git a/torch/csrc/jit/codegen/cuda/index_reference_replay.cpp b/torch/csrc/jit/codegen/cuda/index_reference_replay.cpp
deleted file mode 100644
index 73a57b6b501d..000000000000
--- a/torch/csrc/jit/codegen/cuda/index_reference_replay.cpp
+++ /dev/null
@@ -1,625 +0,0 @@
-#include <torch/csrc/jit/codegen/cuda/index_reference_replay.h>
-
-#include <torch/csrc/jit/codegen/cuda/contiguity.h>
-#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
-#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
-#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
-#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
-#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
-#include <torch/csrc/jit/codegen/cuda/lower_index_compute.h>
-
-namespace torch {
-namespace jit {
-namespace fuser {
-namespace cuda {
-
-IterDomain* IndexReferenceReplay::concreteToRefId(IterDomain* concrete_id) {
-  TORCH_INTERNAL_ASSERT(toConcrete(concrete_id) == concrete_id);
-  // If a reference id doesn't exist for the provided concrete id, make a new
-  // one and add it to the ref<->concrete maps
-  if (concrete_to_ref_id_.find(concrete_id) == concrete_to_ref_id_.end()) {
-    auto ref_id = idCopy(concrete_id);
-    ref_id_to_concrete_[ref_id] = concrete_id;
-    concrete_to_ref_id_[concrete_id] = ref_id;
-    return ref_id;
-  }
-  return concrete_to_ref_id_.at(concrete_id);
-}
-
-IterDomain* IndexReferenceReplay::refIdToConcrete(IterDomain* ref_id) {
-  // Assert the ref id is associated with a concrete id and return it
-  TORCH_INTERNAL_ASSERT(
-      ref_id_to_concrete_.find(ref_id) != ref_id_to_concrete_.end(),
-      "Could not find ",
-      ref_id,
-      " in reference replay.");
-  return ref_id_to_concrete_.at(ref_id);
-}
-
-IterDomain* IndexReferenceReplay::idCopy(IterDomain* id) {
-  // Make a new copy of the provided id for the reference to "own". Reference
-  // iteration domains should always be "iteration" type, not broadcast or
-  // reduction. All we care about are the transformations, and trying to make
-  // sure we track correctly a replaying with consistent reduction/broadcast
-  // domains is challenging and unnecessary.
-  auto copied_id = IterDomainBuilder(id->start(), id->extent())
-                       .parallel_type(id->getParallelType())
-                       .build();
-  replayed_ids_.emplace_back(copied_id);
-  return copied_id;
-}
-
-IterDomain* IndexReferenceReplay::toConcrete(IterDomain* id) {
-  return GpuLower::current()->caMap()->getConcreteMappedID(
-      id, IdMappingMode::EXACT);
-}
-
-void IndexReferenceReplay::handle(Split* split) {
-  // Don't consume the same values multiple times
-  auto ref_in = concreteToRefId(toConcrete(split->in()));
-  if (ref_id_consumed_.find(ref_in) != ref_id_consumed_.end()) {
-    return;
-  }
-  // Don't produce the same values multiple times
-  auto ref_outer = concreteToRefId(toConcrete(split->outer()));
-  auto ref_inner = concreteToRefId(toConcrete(split->inner()));
-  if (ref_id_produced_.find(ref_outer) != ref_id_produced_.end() ||
-      ref_id_produced_.find(ref_inner) != ref_id_produced_.end()) {
-    return;
-  }
-
-  // Replay the provided split operation and add it to the reference DAG
-  SimplifyingIrBuilder::create<Split>(
-      split->container(),
-      ref_outer,
-      ref_inner,
-      ref_in,
-      split->factor(),
-      split->innerSplit(),
-      split->startOffset(),
-      split->stopOffset());
-
-  // Mark producers and consumers
-  ref_id_consumed_.emplace(ref_in);
-  ref_id_produced_.emplace(ref_outer);
-  ref_id_produced_.emplace(ref_inner);
-}
-
-void IndexReferenceReplay::handle(Merge* merge) {
-  // Don't consume the same values multiple times
-  auto ref_outer = concreteToRefId(toConcrete(merge->outer()));
-  auto ref_inner = concreteToRefId(toConcrete(merge->inner()));
-  if (ref_id_consumed_.find(ref_outer) != ref_id_consumed_.end() ||
-      ref_id_consumed_.find(ref_inner) != ref_id_consumed_.end()) {
-    return;
-  }
-
-  // Don't produce the same values multiple times
-  auto ref_out = concreteToRefId(toConcrete(merge->out()));
-  if (ref_id_produced_.find(ref_out) != ref_id_produced_.end()) {
-    return;
-  }
-
-  // Replay the provided merge operation and add it to the reference DAG
-  SimplifyingIrBuilder::create<Merge>(
-      merge->container(), ref_out, ref_outer, ref_inner);
-
-  // Mark producers and consumers
-  ref_id_consumed_.emplace(ref_outer);
-  ref_id_consumed_.emplace(ref_inner);
-  ref_id_produced_.emplace(ref_out);
-}
-
-void IndexReferenceReplay::handle(Expr* e) {
-  // Simple expression dispatch
-  switch (e->getExprType().value()) {
-    case (ExprType::Split):
-    case (ExprType::Merge):
-      break;
-    default:
-      TORCH_INTERNAL_ASSERT(
-          false, "Invalid expr type found in transform traversal.");
-  }
-  OptInDispatch::handle(e);
-}
-
-namespace {
-
-bool isMappedWithAny(IterDomain* id, const std::vector<Val*>& ids) {
-  return std::any_of(ids.begin(), ids.end(), [&](Val* val) {
-    return val->isA<IterDomain>() &&
-        GpuLower::current()->caMap()->areMapped(
-            id, val->as<IterDomain>(), IdMappingMode::PERMISSIVE);
-  });
-}
-
-} // namespace
-
-// Get an rfactor IterDomain that is mapped with an IterDomain. If
-// multiple such IDs exist, select one whose input IDs are mapped with
-// the consumer IDs. This is to ensure the path from the leaf
-// IterDomains to the root matches with the consumer tensor.
-IterDomain* getRfactorIDToTraverse(
-    IterDomain* id,
-    const std::vector<Val*>& consumer_all_ids) {
-  const auto& rfactor_ids =
-      GpuLower::current()->caMap()->getViewRfactorDomainsOfIdGroup(
-          id, IdMappingMode::PERMISSIVE);
-
-  if (rfactor_ids.empty()) {
-    return nullptr;
-  }
-
-  for (auto rfactor_id : rfactor_ids) {
-    auto def = rfactor_id->definition();
-    if (def == nullptr) {
-      continue;
-    }
-
-    auto rfactor_id_inputs = ir_utils::filterByType<IterDomain>(def->inputs());
-    if (std::all_of(
-            rfactor_id_inputs.begin(),
-            rfactor_id_inputs.end(),
-            [&](IterDomain* rfactor_id_input) {
-              return isMappedWithAny(rfactor_id_input, consumer_all_ids);
-            })) {
-      return rfactor_id;
-    }
-  }
-
-  // No mapped ID found, which means the consumer is a post-view
-  // tensor. In that case, it shouldn't matter which view path to
-  // traverse, so just return the first one.
-  return rfactor_ids.at(0);
-}
-
-TensorDomain* IndexReferenceReplay::computeReplay() {
-  // Throw an error when two loops are mapped with each other, which
-  // violates an assumption that unique mappings between concrete
-  // IterDomains and the IterDomains of the loop structure must be
-  // established. It should be a reasonable assumption, but fusions
-  // like below won't work:
-  // tv0 = [I0]
-  // tv1 = broadcast(tv0, {true, false});
-  // tv2 = broadcast(tv0, {false, true});
-  // tv3 = tv1 + tv2
-  // Notice that the two axes of each of tv1, tv2 and tv3 are mapped
-  // with each other. We believe it is unlikely this limitation
-  // becomes a real concern in practice.
-  for (auto it_i = loop_structure_.begin(); it_i != loop_structure_.end();
-       ++it_i) {
-    for (auto it_j = it_i + 1; it_j != loop_structure_.end(); ++it_j) {
-      TORCH_INTERNAL_ASSERT(
-          !GpuLower::current()->caMap()->areMapped(
-              (*it_i)->iter_domain(),
-              (*it_j)->iter_domain(),
-              IdMappingMode::EXACT),
-          "Unsupported loop structure. Two loops are mapped together.");
-    }
-  }
-
-  std::vector<IterDomain*> domain_ids;
-  std::transform(
-      loop_structure_.begin(),
-      loop_structure_.end(),
-      std::back_inserter(domain_ids),
-      [](kir::ForLoop* fl) { return fl->iter_domain(); });
-
-  const auto consumer_all_ids = DependencyCheck::getAllValsBetween(
-      {consumer_tv_->getRootDomain().begin(),
-       consumer_tv_->getRootDomain().end()},
-      {consumer_tv_->domain()->domain().begin(),
-       consumer_tv_->domain()->domain().end()});
-
-  // IterVisitor based traversals don't work because we don't have all outputs.
-  // backward traversal's traverseFrom(domain_ids) will throw "Invalid backward
-  // traversal found. Some output paths were not provided". Therefore manaully
-  // do the backward traversal
-
-  // Order is really important here, start with outer most for loops in a depth
-  // first manner. The outer most loops are topologically closer to the outputs,
-  // so their broadcast dimensions are "more" resolved than those towards the
-  // inner most loops.
-  std::deque<IterDomain*> to_visit(domain_ids.begin(), domain_ids.end());
-  std::unordered_set<Expr*> visited_exprs;
-  std::unordered_set<IterDomain*> visited_ids;
-  while (!to_visit.empty()) {
-    auto out_id = to_visit.front();
-    to_visit.pop_front();
-
-    if (!visited_ids.emplace(out_id).second) {
-      continue;
-    }
-    auto expr = out_id->definition();
-
-    if (auto rfactor_id = getRfactorIDToTraverse(out_id, consumer_all_ids)) {
-      to_visit.emplace_front(rfactor_id);
-    }
-
-    // ID's will be copied for the reference as we replay transformations. If
-    // there was no transformations on an iteration domain, a copy of the
-    // iteration domain for the reference is made here.
-    if (expr == nullptr) {
-      if (std::find(domain_ids.begin(), domain_ids.end(), out_id) !=
-          domain_ids.end()) {
-        concreteToRefId(toConcrete(out_id));
-      }
-      continue;
-    }
-
-    if (!visited_exprs.emplace(expr).second) {
-      continue;
-    }
-
-    handle(expr);
-
-    auto inp_ids = ir_utils::filterByType<IterDomain>(expr->inputs());
-    // Make sure to put at the begining of the deque to maintain correct
-    // ordering.
-    to_visit.insert(to_visit.begin(), inp_ids.begin(), inp_ids.end());
-  }
-
-  // Construct a tensor that's representitive of the replayed loop structure.
-  std::vector<IterDomain*> loops_replayed_domain;
-  for (auto loop : loop_structure_) {
-    auto loop_id = loop->iter_domain();
-    // Map to loops with the loop map, but make sure the replayed id is actually
-    // a leaf in the replay.
-    auto ref_id_it = std::find_if(
-        replayed_ids_.begin(), replayed_ids_.end(), [&](IterDomain* ref_id) {
-          return ref_id->uses().empty() &&
-              GpuLower::current()->caMap()->areMapped(
-                  refIdToConcrete(ref_id), loop_id, IdMappingMode::PERMISSIVE);
-        });
-
-    TORCH_INTERNAL_ASSERT(
-        ref_id_it != replayed_ids_.end(),
-        "Could not find required iter domain in reference replay: ",
-        loop_id);
-
-    auto ref_id = *ref_id_it;
-    loops_replayed_domain.emplace_back(ref_id);
-
-    // Preserve parallelization
-    ref_id->parallelize(loop_id->getParallelType());
-  }
-
-  TensorDomain* domain = nullptr;
-  // If no domains were replayed to make the reference, just return the root
-  // domain.
-  if (std::none_of(
-          loops_replayed_domain.begin(),
-          loops_replayed_domain.end(),
-          [](IterDomain* id) { return id->definition() != nullptr; })) {
-    domain = SimplifyingIrBuilder::create<TensorDomain>(
-        // If there was no replay only return a domain with a root domain.
-        loops_replayed_domain);
-  } else {
-    // Construct the root domain as the inputs of the replayed domain
-    auto loops_replayed_domain_vals =
-        ir_utils::filterByType<Val>(loops_replayed_domain);
-    auto root_domain_vals = IterVisitor::getInputsTo(
-        {loops_replayed_domain_vals.begin(), loops_replayed_domain_vals.end()});
-    auto root_domain_ids = ir_utils::filterByType<IterDomain>(root_domain_vals);
-
-    auto all_replayed_vals = ir_utils::filterByType<Val>(replayed_ids_);
-
-    // The domain may have dangling iteration domains, i.e. the inner output of
-    // a split but not the outer. Find which replayed vals are dependant on the
-    // root domains.
-    auto all_ids_from_root = DependencyCheck::getAllValsBetween(
-        {root_domain_vals.begin(), root_domain_vals.end()},
-        {all_replayed_vals.begin(), all_replayed_vals.end()});
-
-    // Fill all dangling outputs as otherwise backwards visitor in index compute
-    // will complain for not having all outputs of the traversal.
-    for (auto id : ir_utils::filterByType<IterDomain>(all_ids_from_root)) {
-      if (id->uses().empty()) {
-        if (std::find(
-                loops_replayed_domain.begin(),
-                loops_replayed_domain.end(),
-                id) == loops_replayed_domain.end()) {
-          loops_replayed_domain.emplace_back(id);
-        }
-      }
-    }
-
-    // Create and return the reference.
-    domain = SimplifyingIrBuilder::create<TensorDomain>(
-        std::vector<IterDomain*>(
-            root_domain_ids.begin(), root_domain_ids.end()),
-        loops_replayed_domain);
-  }
-
-  cleanUpMappingsOfUnusedDomains(domain);
-  return domain;
-}
-
-void IndexReferenceReplay::cleanUpMappingsOfUnusedDomains(
-    TensorDomain* ref_domain) {
-  // The ref-to-concrete and concrete-to-ref maps can have mappings of
-  // domains that do not end up being used in the final reference
-  // domain. Drop them as they are not really part of reference
-  // tensor.
-
-  const auto all_vals = DependencyCheck::getAllValsBetween(
-      {ref_domain->getRootDomain().begin(), ref_domain->getRootDomain().end()},
-      {ref_domain->domain().begin(), ref_domain->domain().end()});
-
-  const std::unordered_set<IterDomain*> all_id_set(
-      ir_utils::filterByType<IterDomain>(all_vals).begin(),
-      ir_utils::filterByType<IterDomain>(all_vals).end());
-  for (auto it = ref_id_to_concrete_.begin();
-       it != ref_id_to_concrete_.end();) {
-    IterDomain* ref_id = it->first;
-    if (all_id_set.find(ref_id) == all_id_set.end()) {
-      it = ref_id_to_concrete_.erase(it);
-    } else {
-      ++it;
-    }
-  }
-
-  for (auto it = concrete_to_ref_id_.begin();
-       it != concrete_to_ref_id_.end();) {
-    IterDomain* ref_id = it->second;
-    if (all_id_set.find(ref_id) == all_id_set.end()) {
-      it = concrete_to_ref_id_.erase(it);
-    } else {
-      ++it;
-    }
-  }
-}
-
-IndexCompute getReferenceIndexing(
-    const std::vector<kir::ForLoop*>& loop_structure,
-    TensorDomain* reference_tensor,
-    kir::ForLoop* double_buffer_loop) {
-  // Create a simple index mapping from loop iter domains to their local index.
-  // This is only applicable to global memory buffers.
-  std::unordered_map<IterDomain*, Val*> initial_index_map;
-
-  TORCH_INTERNAL_ASSERT(loop_structure.size() <= reference_tensor->nDims());
-  int magic_zero_loop = -1;
-  for (const auto loop_i : c10::irange(loop_structure.size())) {
-    auto ref_axis = reference_tensor->axis(loop_i);
-    auto loop = loop_structure[loop_i];
-    auto ind = loop->index();
-
-    // If the loop is trivial, only the start value is used
-    if (loop->isTrivial()) {
-      initial_index_map[ref_axis] = loop->start();
-    } else {
-      initial_index_map[ref_axis] = ind;
-    }
-
-    if (double_buffer_loop == loop) {
-      TORCH_INTERNAL_ASSERT(
-          !loop->isTrivial(), "The double buffer loop must be materialized");
-      // This version of getReferenceIndexing is only used for
-      // indexing global tensors. When indexing global producers, the
-      // index for a double buffered loop needs to be incremented. The
-      // parameter double_buffer_loop should be nullptr when indexing
-      // global consumers tensors.
-      initial_index_map[ref_axis] = SimplifyingIrBuilder::addExpr(
-          initial_index_map[ref_axis], GpuLower::current()->kernel()->oneVal());
-    }
-
-    if (Index::protectWithMagicZero(loop, ref_axis, ind)) {
-      magic_zero_loop = (int)loop_i;
-    }
-  }
-
-  // Add magic zero to a fairly inner most index
-  if (magic_zero_loop >= 0) {
-    auto ref_id = reference_tensor->axis(magic_zero_loop);
-    initial_index_map[ref_id] = SimplifyingIrBuilder::addExpr(
-        initial_index_map[ref_id], FusionGuard::getCurFusion()->magicZeroVal());
-  }
-
-  // Send to the other version of reference indexing that directly takes the
-  // index map
-  return getReferenceIndexing(
-      loop_structure, reference_tensor, initial_index_map, {}, {});
-}
-
-IndexCompute getReferenceIndexing(
-    const std::vector<kir::ForLoop*>& loop_structure,
-    TensorDomain* reference_tensor,
-    std::unordered_map<IterDomain*, Val*> index_map,
-    std::unordered_set<IterDomain*> zero_domains,
-    std::unordered_set<IterDomain*> preferred_paths,
-    std::unordered_map<IterDomain*, Val*> halo_extent_map) {
-  // I thought this might be necesasry, but turns out it's not. I think it's
-  // because of the root ordering above, however leaving it in case we find
-  // out it is necessary in some cases. At the time of commiting, cuda-memcheck
-  // passed without this.
-  //
-  // std::unordered_map<IterDomain*,
-  // Val*> reference_extent_map; for (auto loop : loop_structure) {
-  //   // If there's a broadcast merged in the for loop ID we want to track its
-  //   // extent
-  //   auto inputs = InputsOf::outputs(
-  //       FusionGuard::getCurFusion(),
-  //       {toFusionID(loop->iter_domain())});
-
-  //   auto iter_inputs = ir_utils::filterByType<IterDomain>(inputs);
-
-  //   // If any of the inputs are a broadcast, explicitly mark the loop id's
-  //   // extent
-  //   if (std::any_of(iter_inputs.begin(), iter_inputs.end(), [](IterDomain*
-  //   id) {
-  //         return id->isBroadcast();
-  //       })) {
-  //     reference_extent_map[loop->iter_domain()] =
-  //     loop->iter_domain()->extent();
-  //   }
-  // }
-
-  // No contig indexing is done in reference indexing
-  ContigIDs contig_finder(
-      reference_tensor->domain(),
-      reference_tensor->getMaybeRFactorDomain(),
-      std::vector<bool>(
-          reference_tensor->getMaybeRFactorDomain().size(), false),
-      {});
-
-  IndexCompute compute(
-      reference_tensor,
-      index_map, // NOLINT
-      // reference_extent_map, // Seems this is not necessary, see comment above
-      // in this function
-      {},
-      zero_domains,
-      std::unordered_set<IterDomain*>(),
-      contig_finder,
-      preferred_paths,
-      halo_extent_map);
-
-  compute.run();
-
-  return compute;
-}
-
-namespace {
-
-// Class to track through the reference what path to take for zero merged in
-// indices if we're indexing shared memory or local memory. Use marked root
-// domains and traverse through the replay to mark paths to get to them during a
-// backward replay.
-class PreferredPathCompute : public IterVisitor {
- private:
-  void handle(Expr* e) override {
-    // If an input ID is marked, propagate the marking to outputs of the
-    // expression
-    auto all_iter_inputs = ir_utils::filterByType<IterDomain>(e->inputs());
-    if (std::any_of(
-            all_iter_inputs.begin(),
-            all_iter_inputs.end(),
-            [&](IterDomain* inp_id) {
-              return this->preferred_path.find(inp_id) !=
-                  this->preferred_path.end();
-            })) {
-      auto all_iter_outputs = ir_utils::filterByType<IterDomain>(e->outputs());
-      preferred_path.insert(all_iter_outputs.begin(), all_iter_outputs.end());
-    }
-  }
-
- private:
-  // If making a choice these are the iter domains to prefer when traversing
-  // backward.
-  std::unordered_set<IterDomain*> preferred_path;
-
- public:
-  static std::unordered_set<IterDomain*> compute(
-      TensorDomain* reference_domain,
-      const std::unordered_set<IterDomain*>& preferred_roots) {
-    // TODO: assert all provided preferred roots are in the history of reference
-    // domain.
-
-    PreferredPathCompute compute;
-    // Init preferred path
-    compute.preferred_path = preferred_roots;
-
-    // Propagate
-    compute.traverseFrom(
-        FusionGuard::getCurFusion(),
-        std::vector<Val*>(
-            reference_domain->domain().begin(),
-            reference_domain->domain().end()));
-
-    return compute.preferred_path;
-  }
-};
-
-class LoopIndexingPreferredPathCompute : public IterVisitor {
- public:
-  static std::unordered_set<IterDomain*> compute(
-      const TensorView* original_tv,
-      const LoopIndexing& loop_indexing,
-      bool use_replay_map,
-      const std::unordered_map<IterDomain*, IterDomain*>& p2c_map) {
-    LoopIndexingPreferredPathCompute compute;
-
-    auto all_concrete_ids = loop_indexing.getAllExactConcreteIdSet();
-
-    // Annotate all ids
-    auto all_original_ids = DependencyCheck::getAllValsBetween(
-        {original_tv->getMaybeRFactorDomain().begin(),
-         original_tv->getMaybeRFactorDomain().end()},
-        {original_tv->domain()->domain().begin(),
-         original_tv->domain()->domain().end()});
-
-    for (auto original_id :
-         ir_utils::filterByType<IterDomain>(all_original_ids)) {
-      auto mapped_id = original_id;
-      if (use_replay_map) {
-        auto c_id_it = p2c_map.find(original_id);
-        if (c_id_it == p2c_map.end()) {
-          continue;
-        }
-        mapped_id = c_id_it->second;
-      }
-      auto concrete_original_id = ir_utils::caMapExactConcreteId(mapped_id);
-      if (all_concrete_ids.count(concrete_original_id)) {
-        if (original_id->isBroadcast() || original_id->isReduction() ||
-            original_id->isStride()) {
-          continue;
-        }
-        compute.preferred_path_.insert(concrete_original_id);
-      }
-    }
-
-    for (auto expr : loop_indexing.getForwardExprList()) {
-      compute.handle(expr);
-    }
-
-    return compute.preferred_path_;
-  }
-
- private:
-  void handle(Expr* e) override {
-    // If an input ID is marked, propagate the marking to outputs of the
-    // expression
-    auto all_iter_inputs = ir_utils::filterByType<IterDomain>(e->inputs());
-    if (std::any_of(
-            all_iter_inputs.begin(),
-            all_iter_inputs.end(),
-            [&](IterDomain* inp_id) {
-              return this->preferred_path_.find(ir_utils::caMapExactConcreteId(
-                         inp_id)) != this->preferred_path_.end();
-            })) {
-      auto all_iter_outputs = ir_utils::filterByType<IterDomain>(e->outputs());
-
-      std::transform(
-          all_iter_outputs.begin(),
-          all_iter_outputs.end(),
-          std::inserter(preferred_path_, preferred_path_.end()),
-          ir_utils::caMapExactConcreteId);
-    }
-  }
-
-  std::unordered_set<IterDomain*> preferred_path_;
-};
-
-} // namespace
-
-// External interface for preferred path propagation.
-std::unordered_set<IterDomain*> buildPreferredPaths(
-    TensorDomain* reference_tensor,
-    const std::unordered_set<IterDomain*>& preferred_roots) {
-  return PreferredPathCompute::compute(reference_tensor, preferred_roots);
-}
-
-std::unordered_set<IterDomain*> buildLoopIndexingPreferredPath(
-    const TensorView* original_tv,
-    const LoopIndexing& loop_indexing,
-    bool use_replay_map,
-    std::unordered_map<IterDomain*, IterDomain*> p2c_map) {
-  return LoopIndexingPreferredPathCompute::compute(
-      original_tv, loop_indexing, use_replay_map, p2c_map);
-}
-
-} // namespace cuda
-} // namespace fuser
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/index_reference_replay.h b/torch/csrc/jit/codegen/cuda/index_reference_replay.h
deleted file mode 100644
index 15062ef452ec..000000000000
--- a/torch/csrc/jit/codegen/cuda/index_reference_replay.h
+++ /dev/null
@@ -1,132 +0,0 @@
-#pragma once
-
-#include <c10/macros/Export.h>
-
-#include <torch/csrc/jit/codegen/cuda/compute_at_map.h>
-#include <torch/csrc/jit/codegen/cuda/index_compute.h>
-#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
-#include <torch/csrc/jit/codegen/cuda/lower2device.h>
-#include <torch/csrc/jit/codegen/cuda/reference_tensor.h>
-
-#include <vector>
-
-namespace torch {
-namespace jit {
-namespace fuser {
-namespace cuda {
-
-class IndexReferenceReplay : public OptInDispatch {
- private:
-  IndexReferenceReplay(
-      const std::vector<kir::ForLoop*>& loop_structure,
-      const TensorView* consumer_tv)
-      : loop_structure_(loop_structure), consumer_tv_(consumer_tv) {}
-
-  // Generate the replay.
-  TensorDomain* computeReplay();
-
-  // Given a concrete_id return the reference id associated with it, or generate
-  // one to associate with it.
-  IterDomain* concreteToRefId(IterDomain* concrete_id);
-
-  // Given a reference id return the concrete id associated with it.
-  IterDomain* refIdToConcrete(IterDomain* ref_id);
-
-  // Make a new id for the reference replay based on the provided id
-  IterDomain* idCopy(IterDomain* id);
-
-  // Return the concrete entry of the non-reference id
-  IterDomain* toConcrete(IterDomain* id);
-
-  //! Remove mappings of reference IDs that do not end up being used
-  //! in the final reference domain
-  void cleanUpMappingsOfUnusedDomains(TensorDomain* reference_domain);
-
-  using OptInDispatch::handle;
-
-  void handle(Split* split) override;
-  void handle(Merge* merge) override;
-  void handle(Expr* e) override;
-
- private:
-  // Hold the loop structure we're generating a reference for.
-  const std::vector<kir::ForLoop*>& loop_structure_;
-  // The indexed or predicated consumer tensor
-  const TensorView* consumer_tv_ = nullptr;
-
-  // Keep a vector of all iteration domains used in the reference (includes all
-  // transformations)
-  std::vector<IterDomain*> replayed_ids_;
-
-  // Maps from reference and concrete id's in the compute at map.
-  std::unordered_map<IterDomain*, IterDomain*> ref_id_to_concrete_;
-  std::unordered_map<IterDomain*, IterDomain*> concrete_to_ref_id_;
-
-  // Keep track of which reference id's were used as an input into a
-  // transformation during replay
-  std::unordered_set<IterDomain*> ref_id_consumed_;
-
-  // Keep track of which reference id's were used as an output of a
-  // transformation during replay
-  std::unordered_set<IterDomain*> ref_id_produced_;
-
- public:
-  // Generate the reference of the provided loop nest structure
-  static ReferenceTensor getReference(
-      const std::vector<kir::ForLoop*>& loop_structure,
-      const TensorView* consumer_tv) {
-    auto replay = IndexReferenceReplay(loop_structure, consumer_tv);
-    ReferenceTensor ref;
-    ref.domain = replay.computeReplay();
-    ref.concrete_to_id = replay.concrete_to_ref_id_;
-    ref.id_to_concrete = replay.ref_id_to_concrete_;
-    return ref;
-  }
-};
-
-// Index into the reference based on the provided index map.
-IndexCompute getReferenceIndexing(
-    const std::vector<kir::ForLoop*>& loop_structure,
-    TensorDomain* reference_domain,
-    std::unordered_map<IterDomain*, Val*> index_map,
-    std::unordered_set<IterDomain*> zero_domains,
-    std::unordered_set<IterDomain*> preferred_path,
-    std::unordered_map<IterDomain*, Val*> halo_extent_map = {});
-
-// Short cut for global TVs. Index into the reference based on all loop indicies
-// in the loop structure.
-IndexCompute getReferenceIndexing(
-    const std::vector<kir::ForLoop*>& loop_structure,
-    TensorDomain* reference_domain,
-    kir::ForLoop* double_buffer_loop = nullptr);
-
-// When indexing there are sometimes an option to propagate an index down
-// multiple paths. This will return the IterDomains in the history of the
-// reference domain and mark which paths should be taken (if there's a
-// preference) to reach the roots provided in preferred_roots.
-std::unordered_set<IterDomain*> buildPreferredPaths(
-    TensorDomain* reference_domain,
-    const std::unordered_set<IterDomain*>& preferred_roots);
-
-// When indexing there are sometimes an option to propagate an index down
-// multiple paths. This will return the IterDomains in the history of the
-// reference domain and mark which paths should be taken (if there's a
-// preference) to reach the roots provided in preferred_roots.
-std::unordered_set<IterDomain*> buildLoopIndexingPreferredPath(
-    const TensorView* original_tv,
-    const LoopIndexing& loop_indexing,
-    bool use_replay_map = false,
-    std::unordered_map<IterDomain*, IterDomain*> p2c_map = {});
-
-// Get an rfactor IterDomain that is mapped with an IterDomain. If
-// multiple such IDs exist, select one whose input IDs are mapped with
-// the consumer IDs. This is to ensure the path from the leaf
-// IterDomains to the root matches with the consumer tensor.
-IterDomain* getRfactorIDToTraverse(
-    IterDomain* id,
-    const std::vector<Val*>& consumer_all_ids);
-
-} // namespace cuda
-} // namespace fuser
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/inline_propagator.cpp b/torch/csrc/jit/codegen/cuda/inline_propagator.cpp
deleted file mode 100644
index a5edae083a32..000000000000
--- a/torch/csrc/jit/codegen/cuda/inline_propagator.cpp
+++ /dev/null
@@ -1,385 +0,0 @@
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
-#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
-#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
-#include <torch/csrc/jit/codegen/cuda/transform_iter.h>
-
-#include <utility>
-
-namespace torch {
-namespace jit {
-namespace fuser {
-namespace cuda {
-
-MaxPosCalculator::MaxPosCalculator(
-    ComputeAtMode mode,
-    std::unordered_set<IterDomain*> uninlinable_ids)
-    : mode_(mode), uninlinable_ids_(std::move(uninlinable_ids)) {
-  buildUnmappableDims();
-}
-
-void MaxPosCalculator::buildUnmappableDims() {
-  ComputeAtRootDomainMap root_map;
-  root_map.build();
-
-  auto all_tvs = ir_utils::allTvs(FusionGuard::getCurFusion());
-  for (auto tv : all_tvs) {
-    auto consumers = ir_utils::consumerTvsOf(tv);
-    for (auto consumer : consumers) {
-      // Grab dimensions in producer and consumer that are mappable to eachother
-      // based on the computeAtRootDomainMap. This will tell us which dimensions
-      // can be inlined based on avoiding trying to inline non-trivial
-      // reduction structures.
-      auto mappable_roots =
-          root_map.getMappableDims(tv->domain(), consumer->domain());
-      for (auto tv_root_id : tv->getMaybeRFactorDomain()) {
-        if (mappable_roots.find(tv_root_id) == mappable_roots.end() &&
-            !tv_root_id->isTrivialReduction()) {
-          unmappable_dims_.emplace(tv_root_id);
-        }
-      }
-    }
-  }
-}
-
-bool MaxPosCalculator::isAllowedID(
-    IterDomain* id,
-    TensorView* tv,
-    bool allow_reduction,
-    bool allow_vectorize,
-    bool allow_unmappable) const {
-  bool allowed = true;
-
-  if (!allow_reduction) {
-    allowed = allowed && !id->isReduction();
-  }
-
-  if (uninlinable_ids_.count(id)) {
-    return false;
-  }
-
-  if (!allow_vectorize) {
-    // Avoid inlining if marked as Vectorize or Group. In the case of
-    // BestEffort and MostInlined modes, avoid Unroll as well.
-    bool is_vectorize = isParallelTypeVectorize(id->getParallelType()) ||
-        id->getParallelType() == ParallelType::Group ||
-        ((mode_ == ComputeAtMode::BestEffort ||
-          mode_ == ComputeAtMode::MostInlined) &&
-         id->getParallelType() == ParallelType::Unroll);
-    allowed = allowed && !is_vectorize;
-  }
-
-  if (!allow_unmappable) {
-    auto root_dom = tv->getMaybeRFactorDomain();
-    std::unordered_set<Val*> root_dom_set(root_dom.begin(), root_dom.end());
-    auto all_vals = DependencyCheck::getAllValsBetween(root_dom_set, {id});
-    bool is_unmappable = false;
-    for (auto val : all_vals) {
-      auto id = val->as<IterDomain>();
-      if (root_dom_set.count(val) > 0 && unmappable_dims_.count(id) > 0) {
-        is_unmappable = true;
-        break;
-      }
-    }
-    allowed = allowed && !is_unmappable;
-  }
-
-  return allowed;
-}
-
-size_t MaxPosCalculator::getMaxPosSelf(
-    TensorView* tv,
-    bool allow_reduction,
-    bool allow_vectorize,
-    bool allow_unmappable) const {
-  auto dom = tv->domain()->domain();
-  auto iter = std::find_if(dom.begin(), dom.end(), [=](IterDomain* id) {
-    return !isAllowedID(
-        id, tv, allow_reduction, allow_vectorize, allow_unmappable);
-  });
-  return std::distance(dom.begin(), iter);
-}
-
-// Return the max position in producer that can be inlined to consumer
-// Cannot inline:
-//   Vectorized dimensions in consumer
-//   Unrolled dimensions in consumer
-size_t MaxPosCalculator::getMaxProducerPosFromConsumer(
-    TensorView* producer,
-    TensorView* consumer) const {
-  auto pairwise_root_map = PairwiseRootDomainMap(producer, consumer);
-  auto replay_CasP =
-      BestEffortReplay::replayCasP(consumer, producer, -1, pairwise_root_map);
-  auto p2c_replay_map = replay_CasP.getReplay();
-
-  for (size_t producer_pos = 0; producer_pos < producer->nDims();
-       producer_pos++) {
-    // If the producer position is mismatching with the consumer, then we can
-    // not inline into this position, otherwise the max producer position of
-    // the consumer will become invalid and expression sort will fail.
-    if (TransformReplay::getMatchedLeafPosWithoutReplayCasP(
-            consumer, producer, producer_pos + 1) < 0) {
-      return producer_pos;
-    }
-    auto map_it = p2c_replay_map.find(producer->axis(producer_pos));
-    if (map_it != p2c_replay_map.end()) {
-      auto c_id = map_it->second;
-      if (!isAllowedID(c_id, consumer, true, false, true)) {
-        return producer_pos;
-      }
-    }
-  }
-  return producer->nDims();
-}
-
-size_t InlinePropagator::getMaxPosAll(TensorView* tv, bool check_siblings) {
-  auto max_pos = max_pos_calc.getMaxPosSelf(tv, false, false, false);
-  for (auto consumer_tv : ir_utils::consumerTvsOf(tv)) {
-    max_pos = std::min<size_t>(
-        max_pos, max_pos_calc.getMaxProducerPosFromConsumer(tv, consumer_tv));
-  }
-  if (check_siblings) {
-    for (auto sibling_tv : ir_utils::siblingTvsOf(tv)) {
-      max_pos = std::min<size_t>(max_pos, getMaxPosAll(sibling_tv, false));
-    }
-  }
-  return max_pos;
-}
-
-void InlinePropagator::setCAPos(TensorView* tv) {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  size_t pos = mapped_reference_pos_.at(tv);
-  if (debug) {
-    std::cout << "  Setting CA pos of " << tv << ":" << std::endl;
-    std::cout << "    mapped position: " << pos << std::endl;
-  }
-  if ((selected_.empty() || selected_.count(tv)) && !tv->isFusionInput()) {
-    auto max_pos = getMaxPosAll(tv);
-    if (debug) {
-      std::cout << "    max inlinable position: " << max_pos << std::endl;
-    }
-    if (mode_ == ComputeAtMode::Standard) {
-      TORCH_INTERNAL_ASSERT(
-          pos <= max_pos,
-          "Invalid compute at position detected in InlinePropagator when trying to set the CA position of: ",
-          tv,
-          " to ",
-          pos,
-          ",  max position that's allowed is ",
-          max_pos);
-    } else if (mode_ == ComputeAtMode::BestEffort) {
-      pos = std::min<size_t>(pos, max_pos);
-    } else {
-      pos = max_pos;
-    }
-    // hoist inner most broadcast
-    while (pos > 0 && tv->axis(pos - 1)->isBroadcast()) {
-      pos--;
-    }
-    auto current_ca_pos = tv->getComputeAtPosition();
-    if (debug) {
-      std::cout << "    current CA position: " << current_ca_pos << std::endl;
-    }
-    if (pos > current_ca_pos) {
-      if (debug) {
-        std::cout << "    new CA position: " << pos << std::endl;
-      }
-      tv->setComputeAt(pos);
-      for (auto consumer_tv : ir_utils::consumerTvsOf(tv)) {
-        needs_update_max_producer_.insert(consumer_tv);
-      }
-    } else if (debug) {
-      std::cout << "    CA position not changed" << std::endl;
-    }
-  } else if (debug) {
-    std::cout << "    tensor not selected, skip" << std::endl;
-  }
-}
-
-InlinePropagator::InlinePropagator(
-    TensorView* reference,
-    int64_t reference_pos,
-    ComputeAtMode mode,
-    std::unordered_set<TensorView*> selected,
-    std::unordered_set<IterDomain*> uninlinable_ids)
-    : max_pos_calc(mode, std::move(uninlinable_ids)),
-      selected_(std::move(selected)),
-      reference_(reference),
-      mode_(mode) {
-  if (reference_pos < 0) {
-    reference_pos += int64_t(reference->nDims()) + 1;
-  }
-  TORCH_INTERNAL_ASSERT(
-      reference_pos >= 0 && reference_pos <= reference->nDims(),
-      "Invalid computeAt axis, received ",
-      reference_pos,
-      " but should be > -",
-      reference->nDims(),
-      " and <= ",
-      reference->nDims(),
-      ".");
-  reference_pos_ = reference_pos;
-}
-
-void InlinePropagator::setUp() {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  mapped_reference_pos_[reference_] = reference_pos_;
-  if (debug) {
-    std::cout << "InlinePropagator::setUp" << std::endl;
-    std::cout << "  reference: " << reference_ << " @ " << reference_pos_
-              << std::endl;
-  }
-  setCAPos(reference_);
-}
-
-namespace {
-
-// Try to find the aligned position on consumer's domain corresponding to the
-//  compute at position of producer domain. Used in InlinePropagator pass only.
-//  No checking on actual producer-consumer relationship.
-unsigned int getConsumerPosAlignedToProducerCA(
-    TensorView* consumer,
-    TensorView* producer) {
-  // Locate consumer's position that aligns with
-  //  the producer's new compute at axis. We need broadcast axes forwarded so we
-  //  need to replay PasC as CasP will not forward braodcast dims. For example
-  //  if we have:
-  // T2[ iS22{( 3 * 1 )} ] ca_pos( 1 ) = broadcast( T1[ iS1{3} ] ca_pos( 1 )
-  // produce_pos( 1) ) CasP will have the mapping iS1{3} -> iS2{3} and PasC will
-  // have the mapping iS22{( 3 * 1 )} <- iS1{3} We need the latter. Refer to
-  // NVFuserTest.FusionComplexBCast1_CUDA
-
-  auto disjoint_sets =
-      BestEffortReplay::replayPasC(
-          producer, consumer, -1, PairwiseRootDomainMap(producer, consumer))
-          .getDisjointSets();
-
-  // Find the innermost position of consumer that has
-  //  been mapped within the producer ca axis.
-  unsigned int consumer_pos = consumer->nDims();
-  while (consumer_pos > 0) {
-    auto consumer_id = consumer->axis((int)consumer_pos - 1);
-    auto p_dom = producer->domain()->domain();
-    if (std::any_of(
-            p_dom.begin(),
-            p_dom.begin() + producer->getComputeAtPosition(),
-            [&consumer_id, &disjoint_sets](IterDomain* p_id) {
-              return disjoint_sets.permissiveAreMapped(consumer_id, p_id);
-            })) {
-      break;
-    }
-    consumer_pos--;
-  }
-
-  return consumer_pos;
-}
-
-} // namespace
-
-void InlinePropagator::tearDown() {
-  for (auto consumer : needs_update_max_producer_) {
-    unsigned int consumer_pos = 0;
-    for (auto producer : ir_utils::producerTvsOf(consumer)) {
-      consumer_pos = std::max(
-          consumer_pos, getConsumerPosAlignedToProducerCA(consumer, producer));
-    }
-    consumer->setMaxProducer(consumer_pos);
-  }
-}
-
-void InlinePropagator::propagateC2P(TensorView* from, TensorView* to) {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  if (debug) {
-    std::cout << "InlinePropagator::propagateC2P" << std::endl;
-    std::cout << "  from: " << from << std::endl;
-    std::cout << "  to: " << to << std::endl;
-  }
-  // Step 1: find mapped_reference_pos_[to]
-  int from_pos = mapped_reference_pos_.at(from);
-  auto to_pos =
-      TransformReplay::getMatchedLeafPosWithoutReplayPasC(to, from, from_pos);
-  if (mode_ == ComputeAtMode::Standard) {
-    TORCH_CHECK(
-        to_pos >= 0,
-        "Unable to propagate CA position from consumer ",
-        from,
-        " at ",
-        from_pos,
-        " to producer ",
-        to,
-        " because this would require replay.");
-  } else {
-    // For MostInlined and BestEffort inline propagation, we allow the DAG to
-    // be not replayed fully consistently. For such case, we just don't inline
-    // into the mismatched dimension.
-    while (to_pos < 0) {
-      from_pos--;
-      to_pos = TransformReplay::getMatchedLeafPosWithoutReplayPasC(
-          to, from, from_pos);
-    }
-  }
-  mapped_reference_pos_[to] = to_pos;
-  // Step 2: set CA position of `to`
-  setCAPos(to);
-}
-
-void InlinePropagator::propagateP2C(TensorView* from, TensorView* to) {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  if (debug) {
-    std::cout << "InlinePropagator::propagateP2C" << std::endl;
-    std::cout << "  from: " << from << std::endl;
-    std::cout << "  to: " << to << std::endl;
-  }
-  // Step 1: find mapped_reference_pos_[to]
-  int from_pos = mapped_reference_pos_.at(from);
-  auto to_pos =
-      TransformReplay::getMatchedLeafPosWithoutReplayCasP(to, from, from_pos);
-  if (mode_ == ComputeAtMode::Standard) {
-    TORCH_CHECK(
-        to_pos >= 0,
-        "Unable to propagate CA position from producer ",
-        from,
-        " at ",
-        from_pos,
-        " to consumer ",
-        to,
-        " because this would require replay.");
-  } else {
-    // For MostInlined and BestEffort inline propagation, we allow the DAG to
-    // be not replayed fully consistently. For such case, we just don't inline
-    // into the mismatched dimension.
-    while (to_pos < 0) {
-      from_pos--;
-      to_pos = TransformReplay::getMatchedLeafPosWithoutReplayCasP(
-          to, from, from_pos);
-    }
-  }
-  mapped_reference_pos_[to] = to_pos;
-  // Step 2: set CA position of `to`
-  setCAPos(to);
-}
-
-void InlinePropagator::propagateSibling(TensorView* from, TensorView* to) {
-  bool debug = isDebugDumpEnabled(DebugDumpOption::InlinePropagator);
-  if (debug) {
-    std::cout << "InlinePropagator::propagateSibling" << std::endl;
-    std::cout << "  from: " << from << std::endl;
-    std::cout << "  to: " << to << std::endl;
-  }
-  // Step 1: find mapped_reference_pos_[to]
-  auto from_pos = mapped_reference_pos_.at(from);
-  TORCH_CHECK(
-      TransformReplay::fullSelfMatching(to, from),
-      "Unable to propagate CA position from ",
-      from,
-      " to sibling ",
-      to,
-      " because this would require replay.");
-  mapped_reference_pos_[to] = from_pos;
-  // Step 2: set CA position of `to`
-  setCAPos(to);
-}
-
-} // namespace cuda
-} // namespace fuser
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/inline_propagator.h b/torch/csrc/jit/codegen/cuda/inline_propagator.h
deleted file mode 100644
index d1bdeebd06d6..000000000000
--- a/torch/csrc/jit/codegen/cuda/inline_propagator.h
+++ /dev/null
@@ -1,118 +0,0 @@
-#pragma once
-
-#include <torch/csrc/jit/codegen/cuda/ir_interface_nodes.h>
-#include <torch/csrc/jit/codegen/cuda/maxinfo_propagator.h>
-#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
-
-#include <unordered_set>
-
-namespace torch {
-namespace jit {
-namespace fuser {
-namespace cuda {
-
-class TORCH_CUDA_CU_API MaxPosCalculator {
-  ComputeAtMode mode_ = ComputeAtMode::Standard;
-
-  // Root domains in producer that's unmappable to any of its consumers
-  std::unordered_set<IterDomain*> unmappable_dims_;
-
-  // User set IterDomains to not inline, used in schedulers to avoid inlining
-  // trivial reductions
-  std::unordered_set<IterDomain*> uninlinable_ids_;
-
-  // Iterate through all TVs and collect the dimensions of each TV that don't
-  // map to all its consumer TVs.
-  void buildUnmappableDims();
-
-  // Utility function to return if an id of tv is a valid iter domain to inline
-  // within. This is used in getMaxPos{PasC,CasP}. Different variations of the
-  // bool values are used if checking max position of PasC, CasP, or checking
-  // for a max "self" position.
-  bool isAllowedID(
-      IterDomain* id,
-      TensorView* tv,
-      bool allow_reduction,
-      bool allow_vectorize,
-      bool allow_unmappable) const;
-
- public:
-  // Returns the position at which tv can be inlined within.
-  size_t getMaxPosSelf(
-      TensorView* tv,
-      bool allow_reduction,
-      bool allow_vectorize,
-      bool allow_unmappable) const;
-
-  // Returns the maximum position producer can be inlined based on consumer
-  // given the set ComputeAtMode
-  size_t getMaxProducerPosFromConsumer(
-      TensorView* producer,
-      TensorView* consumer) const;
-
-  MaxPosCalculator(
-      ComputeAtMode mode,
-      std::unordered_set<IterDomain*> uninlinable_ids = {});
-};
-
-// Propagate inline position to the `selected` tensors in the DAG. If `selected`
-// is not specified or empty, then propagate to the entire DAG.
-class TORCH_CUDA_CU_API InlinePropagator
-    : public MaxInfoSpanningTree::Propagator {
-  // Checks producers and consumers to see what the maximum position in tv is
-  // that can be shared across both directions.
-  size_t getMaxPosAll(TensorView* tv, bool check_siblings = true);
-
-  // We use mapped_reference_pos_ to keep track of the outer axes information of
-  // the reference tensor. That is, mapped_reference_pos_[tv] answers the
-  // question "What outer axes in tv are shared with the specified reference
-  // tensor's outer axes?". However, when we actually set the CA position of tv,
-  // we might not want to set it as mapped_reference_pos_[tv] because because we
-  // don't want to inline certain things (such as vectorized dimensions, inner
-  // most broadcasting, etc.).
-  std::unordered_map<TensorView*, size_t> mapped_reference_pos_;
-
-  // Actually set the computeAt position. This does not necessarily equal to
-  // mapped_reference_pos_[tv] because we don't want to inline certain things.
-  void setCAPos(TensorView* tv);
-
-  const MaxPosCalculator max_pos_calc;
-  std::unordered_set<TensorView*> selected_;
-  std::unordered_set<TensorView*> needs_update_max_producer_;
-  TensorView* reference_;
-  size_t reference_pos_;
-  ComputeAtMode mode_ = ComputeAtMode::Standard;
-
- public:
-  InlinePropagator(
-      TensorView* reference,
-      int64_t reference_pos,
-      ComputeAtMode mode = ComputeAtMode::Standard,
-      std::unordered_set<TensorView*> selected = {},
-      std::unordered_set<IterDomain*> uninlinable_ids = {});
-
-  InlinePropagator(
-      TensorView* reference,
-      int64_t reference_pos,
-      std::unordered_set<TensorView*> selected)
-      : InlinePropagator(
-            reference,
-            reference_pos,
-            ComputeAtMode::Standard,
-            selected) {}
-
-  ~InlinePropagator() = default;
-
-  // Actually propagate the transformations for the inlining pass. Uses the
-  // functions above to figure out what position to do the propagation at.
-  virtual void setUp() override;
-  virtual void propagateC2P(TensorView* from, TensorView* to) override;
-  virtual void propagateP2C(TensorView* from, TensorView* to) override;
-  virtual void propagateSibling(TensorView* from, TensorView* to) override;
-  virtual void tearDown() override;
-};
-
-} // namespace cuda
-} // namespace fuser
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/inlining.cpp b/torch/csrc/jit/codegen/cuda/inlining.cpp
new file mode 100644
index 000000000000..da6d229c68f8
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/inlining.cpp
@@ -0,0 +1,306 @@
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/transform_iter.h>
+
+#include <utility>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+MaxPosCalculator::MaxPosCalculator(
+    const std::unordered_set<IterDomain*>& uninlinable_ids)
+    : uninlinable_ids_(uninlinable_ids) {
+  buildUnmappableDims();
+}
+
+void MaxPosCalculator::buildUnmappableDims() {
+  ComputeAtRootDomainMap root_map;
+  root_map.build();
+  auto all_tvs = ir_utils::allTvs(FusionGuard::getCurFusion());
+  for (auto tv : all_tvs) {
+    auto consumers = ir_utils::consumerTvsOf(tv);
+    for (auto consumer : consumers) {
+      // Grab dimensions in producer and consumer that are mappable to eachother
+      // based on the computeAtRootDomainMap. This will tell us which dimensions
+      // can be inlined based on avoiding trying to inline non-trivial
+      // reduction structures.
+      auto mappable_roots =
+          root_map.getMappableDims(tv->domain(), consumer->domain());
+      for (auto tv_root_id : tv->getMaybeRFactorDomain()) {
+        if (mappable_roots.find(tv_root_id) == mappable_roots.end() &&
+            !tv_root_id->isTrivialReduction()) {
+          unmappable_dims_.emplace(tv_root_id);
+        }
+      }
+    }
+  }
+}
+
+bool MaxPosCalculator::isAllowedID(
+    IterDomain* id,
+    TensorView* tv,
+    bool best_effort,
+    bool allow_reduction,
+    bool allow_vectorize,
+    bool allow_unmappable) const {
+  bool allowed = true;
+
+  if (!allow_reduction) {
+    allowed = allowed && !id->isReduction();
+  }
+
+  if (uninlinable_ids_.count(id)) {
+    return false;
+  }
+
+  if (!allow_vectorize) {
+    // Avoid inlining if marked as Vectorize or Group. In the case of
+    // BestEffort and MostInlined modes, avoid Unroll as well.
+    bool is_vectorize = isParallelTypeVectorize(id->getParallelType()) ||
+        id->getParallelType() == ParallelType::Group ||
+        (best_effort && id->getParallelType() == ParallelType::Unroll);
+    allowed = allowed && !is_vectorize;
+  }
+
+  if (!allow_unmappable) {
+    auto root_dom = tv->getMaybeRFactorDomain();
+    std::unordered_set<Val*> root_dom_set(root_dom.begin(), root_dom.end());
+    auto all_vals = DependencyCheck::getAllValsBetween(root_dom_set, {id});
+    bool is_unmappable = false;
+    for (auto val : all_vals) {
+      auto id = val->as<IterDomain>();
+      if (root_dom_set.count(val) > 0 && unmappable_dims_.count(id) > 0) {
+        is_unmappable = true;
+        break;
+      }
+    }
+    allowed = allowed && !is_unmappable;
+  }
+
+  return allowed;
+}
+
+size_t MaxPosCalculator::getMaxPosSelf(
+    TensorView* tv,
+    bool best_effort,
+    bool allow_reduction,
+    bool allow_vectorize,
+    bool allow_unmappable) const {
+  auto dom = tv->domain()->domain();
+  auto iter = std::find_if(dom.begin(), dom.end(), [=](IterDomain* id) {
+    return !isAllowedID(
+        id,
+        tv,
+        best_effort,
+        allow_reduction,
+        allow_vectorize,
+        allow_unmappable);
+  });
+  return std::distance(dom.begin(), iter);
+}
+
+// Return the max position in producer that can be inlined to consumer
+// Cannot inline:
+//   Vectorized dimensions in consumer
+//   Unrolled dimensions in consumer
+size_t MaxPosCalculator::getMaxProducerPosFromConsumer(
+    TensorView* producer,
+    TensorView* consumer,
+    bool best_effort) const {
+  auto pairwise_root_map = PairwiseRootDomainMap(producer, consumer);
+  auto replay_CasP =
+      BestEffortReplay::replayCasP(consumer, producer, -1, pairwise_root_map);
+  auto p2c_replay_map = replay_CasP.getReplay();
+
+  for (size_t producer_pos = 0; producer_pos < producer->nDims();
+       producer_pos++) {
+    // If the producer position is mismatching with the consumer, then we can
+    // not inline into this position, otherwise the max producer position of
+    // the consumer will become invalid and expression sort will fail.
+    if (TransformReplay::getMatchedLeafPosWithoutReplayCasP(
+            consumer, producer, producer_pos + 1) < 0) {
+      return producer_pos;
+    }
+    auto map_it = p2c_replay_map.find(producer->axis(producer_pos));
+    if (map_it != p2c_replay_map.end()) {
+      auto c_id = map_it->second;
+      if (!isAllowedID(c_id, consumer, best_effort, true, false, true)) {
+        return producer_pos;
+      }
+    }
+  }
+  return producer->nDims();
+}
+
+size_t MaxPosCalculator::getMaxPosAll(
+    TensorView* tv,
+    bool best_effort,
+    bool check_siblings) {
+  auto max_pos = getMaxPosSelf(tv, best_effort, false, false, false);
+  for (auto consumer_tv : ir_utils::consumerTvsOf(tv)) {
+    max_pos = std::min<size_t>(
+        max_pos, getMaxProducerPosFromConsumer(tv, consumer_tv, best_effort));
+  }
+  if (check_siblings) {
+    for (auto sibling_tv : ir_utils::siblingTvsOf(tv)) {
+      max_pos = std::min<size_t>(
+          max_pos, getMaxPosAll(sibling_tv, best_effort, false));
+    }
+  }
+  return max_pos;
+}
+
+void inlineMost(const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  inlineMost(ir_utils::allTvs(FusionGuard::getCurFusion()), uninlinable_ids);
+}
+
+void inlineMost(
+    const std::vector<TensorView*>& tvs,
+    const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  if (tvs.empty()) {
+    return;
+  }
+  MaxPosCalculator calc(uninlinable_ids);
+  for (auto tv : tvs) {
+    tv->inlineAt(-1, true, &calc);
+  }
+}
+
+void inlineMost(
+    const std::unordered_set<TensorView*>& tvs,
+    const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  if (tvs.empty()) {
+    return;
+  }
+  MaxPosCalculator calc(uninlinable_ids);
+  for (auto tv : tvs) {
+    tv->inlineAt(-1, true, &calc);
+  }
+}
+
+namespace {
+
+// Find the positions of `selected` tensors that is mapped to the given position
+// in the reference tensor.
+class FindMappedPositions : public MaxInfoSpanningTree::Propagator {
+  std::unordered_map<TensorView*, size_t>& output_;
+
+ public:
+  FindMappedPositions(
+      std::unordered_map<TensorView*, size_t>& output,
+      TensorView* reference,
+      int64_t reference_pos);
+
+  ~FindMappedPositions() = default;
+
+  virtual void propagateC2P(TensorView* from, TensorView* to) override;
+  virtual void propagateP2C(TensorView* from, TensorView* to) override;
+  virtual void propagateSibling(TensorView* from, TensorView* to) override;
+};
+
+FindMappedPositions::FindMappedPositions(
+    std::unordered_map<TensorView*, size_t>& output,
+    TensorView* reference,
+    int64_t reference_pos)
+    : output_(output) {
+  if (reference_pos < 0) {
+    reference_pos += int64_t(reference->nDims()) + 1;
+  }
+  TORCH_CHECK(
+      reference_pos >= 0 && reference_pos <= reference->nDims(),
+      "Invalid axis received ",
+      reference_pos,
+      " but should be > -",
+      reference->nDims(),
+      " and <= ",
+      reference->nDims(),
+      ".");
+  output_[reference] = reference_pos;
+}
+
+void FindMappedPositions::propagateC2P(TensorView* from, TensorView* to) {
+  int from_pos = output_.at(from);
+  auto to_pos =
+      TransformReplay::getMatchedLeafPosWithoutReplayPasC(to, from, from_pos);
+  // If there is no matching position found, we compute the highest matched
+  // position as the closest approximation
+  while (to_pos < 0) {
+    from_pos--;
+    to_pos =
+        TransformReplay::getMatchedLeafPosWithoutReplayPasC(to, from, from_pos);
+  }
+  output_[to] = to_pos;
+}
+
+void FindMappedPositions::propagateP2C(TensorView* from, TensorView* to) {
+  int from_pos = output_.at(from);
+  auto to_pos =
+      TransformReplay::getMatchedLeafPosWithoutReplayCasP(to, from, from_pos);
+  // If there is no matching position found, we compute the highest matched
+  // position as the closest approximation
+  while (to_pos < 0) {
+    from_pos--;
+    to_pos =
+        TransformReplay::getMatchedLeafPosWithoutReplayCasP(to, from, from_pos);
+  }
+  output_[to] = to_pos;
+}
+
+void FindMappedPositions::propagateSibling(TensorView* from, TensorView* to) {
+  auto from_pos = output_.at(from);
+  TORCH_CHECK(
+      TransformReplay::fullSelfMatching(to, from),
+      "Transformations in siblings ",
+      from,
+      " and ",
+      to,
+      " does not match with each other.");
+  output_[to] = from_pos;
+}
+
+std::unordered_map<TensorView*, size_t> getPositionsMappedTo(
+    TensorView* reference_tv,
+    int64_t reference_pos) {
+  std::unordered_map<TensorView*, size_t> mapped_positions;
+  MaxRootDomainInfoSpanningTree tree(reference_tv, reference_pos);
+  FindMappedPositions propagator(mapped_positions, reference_tv, reference_pos);
+  tree.traverse(&propagator);
+  return mapped_positions;
+}
+
+} // namespace
+
+void inlineAllAt(
+    TensorView* reference_tv,
+    int64_t reference_pos,
+    bool best_effort,
+    const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  auto mapped_positions = getPositionsMappedTo(reference_tv, reference_pos);
+  MaxPosCalculator calc(uninlinable_ids);
+  for (auto pair : mapped_positions) {
+    pair.first->inlineAt(pair.second, best_effort, &calc);
+  }
+}
+
+void inlineSelectedAt(
+    const std::unordered_set<TensorView*>& selected,
+    TensorView* reference_tv,
+    int64_t reference_pos,
+    bool best_effort,
+    const std::unordered_set<IterDomain*>& uninlinable_ids) {
+  auto mapped_positions = getPositionsMappedTo(reference_tv, reference_pos);
+  MaxPosCalculator calc(uninlinable_ids);
+  for (auto pair : mapped_positions) {
+    if (selected.count(pair.first) > 0) {
+      pair.first->inlineAt(pair.second, best_effort, &calc);
+    }
+  }
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/inlining.h b/torch/csrc/jit/codegen/cuda/inlining.h
new file mode 100644
index 000000000000..3b15eb23f987
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/inlining.h
@@ -0,0 +1,100 @@
+#pragma once
+
+#include <torch/csrc/jit/codegen/cuda/ir_interface_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/maxinfo_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+
+#include <memory>
+#include <unordered_set>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+class MaxPosCalculator {
+  // Root domains in producer that's unmappable to any of its consumers
+  std::unordered_set<IterDomain*> unmappable_dims_;
+
+  // User set IterDomains to not inline, used in schedulers to avoid inlining
+  // trivial reductions
+  std::unordered_set<IterDomain*> uninlinable_ids_;
+
+  // Iterate through all TVs and collect the dimensions of each TV that don't
+  // map to all its consumer TVs.
+  void buildUnmappableDims();
+
+  // Utility function to return if an id of tv is a valid iter domain to inline
+  // within. This is used in getMaxPos{PasC,CasP}. Different variations of the
+  // bool values are used if checking max position of PasC, CasP, or checking
+  // for a max "self" position.
+  bool isAllowedID(
+      IterDomain* id,
+      TensorView* tv,
+      bool best_effort,
+      bool allow_reduction,
+      bool allow_vectorize,
+      bool allow_unmappable) const;
+
+ public:
+  // Returns the position at which tv can be inlined within.
+  size_t getMaxPosSelf(
+      TensorView* tv,
+      bool best_effort,
+      bool allow_reduction,
+      bool allow_vectorize,
+      bool allow_unmappable) const;
+
+  // Returns the maximum position producer can be inlined based on consumer
+  // given the set ComputeAtMode
+  size_t getMaxProducerPosFromConsumer(
+      TensorView* producer,
+      TensorView* consumer,
+      bool best_effort) const;
+
+  // Checks producers, consumers, and siblings to see what the maximum position
+  // in tv is that can be shared across both directions.
+  size_t getMaxPosAll(
+      TensorView* tv,
+      bool best_effort = false,
+      bool check_siblings = true);
+
+  MaxPosCalculator(const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+};
+
+// Inline to the right most allowed position for all tensors in the current
+// fusion.
+TORCH_CUDA_CU_API void inlineMost(
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+// Inline to the right most allowed position for the selected tensors in the
+// current fusion.
+TORCH_CUDA_CU_API void inlineMost(
+    const std::vector<TensorView*>& tvs,
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+// Inline to the right most allowed position for the selected tensors in the
+// current fusion.
+TORCH_CUDA_CU_API void inlineMost(
+    const std::unordered_set<TensorView*>& tvs,
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+
+// Inline to the position corresponding to the reference position in the
+// reference tensor for all tensors in the current fusion.
+TORCH_CUDA_CU_API void inlineAllAt(
+    TensorView* reference_tv,
+    int64_t reference_pos,
+    bool best_effort = false,
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+
+// Inline to the position corresponding to the reference position in the
+// reference tensor for selected tensors in the current fusion.
+TORCH_CUDA_CU_API void inlineSelectedAt(
+    const std::unordered_set<TensorView*>& selected,
+    TensorView* reference_tv,
+    int64_t reference_pos,
+    bool best_effort = false,
+    const std::unordered_set<IterDomain*>& uninlinable_ids = {});
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/instrumentation.cpp b/torch/csrc/jit/codegen/cuda/instrumentation.cpp
index 16b7f33a8e3a..2d570ce5b9d4 100644
--- a/torch/csrc/jit/codegen/cuda/instrumentation.cpp
+++ b/torch/csrc/jit/codegen/cuda/instrumentation.cpp
@@ -32,7 +32,7 @@ Trace::Trace() {
     logEvent('I', "TRACE_START");
   }
 
-  if (isDisabled(DisableOption::Nvtx)) {
+  if (isOptionDisabled(DisableOption::Nvtx)) {
     record_nvtx_range_ = false;
   }
 }
diff --git a/torch/csrc/jit/codegen/cuda/instrumentation.h b/torch/csrc/jit/codegen/cuda/instrumentation.h
index b929fffc4a12..ef89fcd66090 100644
--- a/torch/csrc/jit/codegen/cuda/instrumentation.h
+++ b/torch/csrc/jit/codegen/cuda/instrumentation.h
@@ -31,7 +31,7 @@ namespace inst {
 //! An easy way to view traces is to type `about://tracing` in Chrome or
 //! Chromium.
 //!
-class Trace : public NonCopyable {
+class TORCH_CUDA_CU_API Trace : public NonCopyable {
  public:
   using Clock = std::chrono::steady_clock;
 
@@ -73,7 +73,7 @@ class Trace : public NonCopyable {
 
 //! \internal Automatic scope for a perf marker
 //!   (normally used through the FUSER_PERF_SCOPE macro)
-class TraceScope : public NonCopyable {
+class TORCH_CUDA_CU_API TraceScope : public NonCopyable {
  public:
   explicit TraceScope(const char* event_name) : event_name_(event_name) {
     Trace::instance()->beginEvent(event_name_);
diff --git a/torch/csrc/jit/codegen/cuda/interface.cpp b/torch/csrc/jit/codegen/cuda/interface.cpp
index 2c86938b9b5b..6b1fa7c44f9c 100644
--- a/torch/csrc/jit/codegen/cuda/interface.cpp
+++ b/torch/csrc/jit/codegen/cuda/interface.cpp
@@ -1,8 +1,11 @@
 #include <torch/csrc/jit/codegen/cuda/interface.h>
 
 #include <ATen/core/dispatch/OperatorOptions.h>
+#include <ATen/native/NonSymbolicBC.h>
+#include <ATen/native/TensorShape.h>
 #include <c10/util/CallOnce.h>
 #include <c10/util/irange.h>
+#include <torch/csrc/jit/codegen/cuda/transform_view.h>
 #include <torch/csrc/jit/runtime/custom_operator.h>
 #include <torch/csrc/jit/runtime/register_ops_utils.h>
 
@@ -40,12 +43,8 @@ class NVFuserEnabler {
 
  public:
   static bool nvfuserCanBeEnabled() {
-#ifdef USE_ROCM
-    return false;
-#else
     return at::globalContext().hasCUDA() &&
         NVFuserPassManager::isRegistered() && getExecutorMode();
-#endif
   }
 
  private:
@@ -115,7 +114,7 @@ class NVFuserEnabler {
       return *getCachedFuserEnabledEnvVar();
     }
     // 3. default value
-#ifdef FBCODE_CAFFE2
+#if defined(USE_ROCM) || defined(FBCODE_CAFFE2)
     return false;
 #else
     return nvfuserCanBeEnabled();
@@ -226,6 +225,15 @@ bool skipNode(const std::string& symbol_str, bool flip) {
       getFuserInterface()->fn_skip_n(symbol_str, flip);
 }
 
+AnalyzeViewConstraint getViewConstraint(
+    const std::vector<int64_t>& original_sizes,
+    const std::vector<int64_t>& new_sizes) {
+  if (getFuserInterface()->fn_analyze_view != nullptr) {
+    return getFuserInterface()->fn_analyze_view(original_sizes, new_sizes);
+  }
+  TORCH_INTERNAL_ASSERT(false, "Requires nvFuser which requires CUDA build.");
+}
+
 //! [ Note -- type guard logic in CudaFusionGuard ]
 //!
 //! CudaFusionGuard is used to Guard input tensor to `CudaFusionGroup` so that
@@ -511,85 +519,6 @@ bool inferViewShape(
   return true;
 }
 
-//! [ Note -- type guard logic in CudaFusionViewGuard ]
-//!
-//! CudaFusionViewGuard is used to guard input tensors to a `CudaFusionGroup`
-//! that contains view operations, so that we would not feed inputs that
-//! violate the graph defined in `GraphCache`.
-//!
-//! output = view(self, view-sizes)
-//!
-//! View Guard Inputs:
-//!   1. self tensor_sizes - dynamic size List[Int]
-//!   2. view_sizes - profile_ivalue List[Int]
-//!   3. tensor_constraint - Constant List[Int]
-//!   4. view_sizes_constraint - Constant List[Int]
-//!
-//! Things that we check:
-//!   1. The #dimensions are the same for self tensor and its constraint
-//!   2. The #dimensions are the same for view-sizes and its constraint
-//!   3. Self tensor does not violate its constraint
-//!     a. Queue unrestricted sizes
-//!     b. Calculate #elements in self tensor
-//!   4. view-sizes does not violate its constraint
-//!     a. Pop unrestricted sizes from queue
-//!     b. Calculate #elements in view-sizes
-//!   5. The #elements is the same for self tensor and view-sizes
-//!
-//! Constraints:
-//! A restricted axis creates a graph constraint, so its sizes is static.
-//! An unrestricted axis is allowed to have a dynamic size, if it is consistent
-//! between self tensor and view-sizes. It is marked with -1 in the constraint.
-//! Only iterDomains with the Keep transform are dynamic. All other transforms
-//! create a static constraint.
-//!
-bool checkViewGuard(
-    c10::List<int64_t> tensor_sizes,
-    c10::List<int64_t> view_sizes,
-    c10::List<int64_t> tensor_constraint,
-    c10::List<int64_t> view_sizes_constraint) {
-  // 1: Num Dimensions Check
-  if (tensor_constraint.size() != tensor_sizes.size() ||
-      view_sizes_constraint.size() != view_sizes.size()) {
-    return false;
-  }
-
-  // If axis allows dynamic sizes, then add tensor size to this queue.
-  // For dynamic axes in view_sizes, check that it is consistent with
-  // the corresponding tensor size.
-  std::queue<int64_t> dynamic_axis_queue;
-
-  // 2. Tensor Static Check
-  int64_t tensor_size_product = 1;
-  for (const auto idx : c10::irange(tensor_sizes.size())) {
-    if (tensor_constraint[idx] == -1) {
-      dynamic_axis_queue.push(tensor_sizes[idx]);
-    } else if (tensor_constraint[idx] != tensor_sizes[idx]) {
-      return false;
-    }
-    tensor_size_product *= tensor_sizes[idx];
-  }
-
-  // 3. View-Sizes Static Check
-  int64_t view_size_product = 1;
-  for (const auto idx : c10::irange(view_sizes.size())) {
-    auto dynamic_size = (view_sizes_constraint[idx] == -1)
-        ? dynamic_axis_queue.front()
-        : view_sizes_constraint[idx];
-    if (dynamic_size != view_sizes[idx]) {
-      return false;
-    }
-    view_size_product *= dynamic_size;
-    if (view_sizes_constraint[idx] == -1) {
-      dynamic_axis_queue.pop();
-    }
-  }
-
-  // 4. Check view invariant
-  // The number of elements in the input and output tensors are the same.
-  return tensor_size_product == view_size_product;
-}
-
 //!
 //! CudaFusionViewGuard Example Graph:
 //!
@@ -639,7 +568,7 @@ RegisterOperators view_guard({
         [](const Node* node) -> Operation {
           return [](Stack& stack) {
             // view_sizes_constraint - Constant List[Int]
-            at::ArrayRef<IValue> inputs = last(stack, 4);
+            at::ArrayRef<IValue> inputs = last(stack, 3);
 
             // tensor_sizes is the runtime size for the self tensor
             // tensor_sizes - dynamic size List[Int]
@@ -654,23 +583,16 @@ RegisterOperators view_guard({
                 "profiled_view_sizes needs to be Int list");
             auto profiled_view_sizes = inputs[1].toIntList();
 
-            // tensor_constraint is a constant List[Int]
+            // tensor_constraints is a constant List[Int]
             // used to guard tensor_sizes
             TORCH_INTERNAL_ASSERT(
                 inputs[2].isIntList(),
                 "tensor constraint needs to be Int List");
-            auto tensor_constraint = inputs[2].toIntList();
-
-            // view_sizes_constraint is a constant List[Int]
-            // used to guard profiled_view_sizes
-            TORCH_INTERNAL_ASSERT(
-                inputs[3].isIntList(),
-                "view_sizes constraint needs to be Int List");
-            auto view_sizes_constraint = inputs[3].toIntList();
+            auto tensor_constraints = inputs[2].toIntList();
 
             // Drop after gather all input arguments
             // If an argument is moved, it is destroyed when dropped from stack
-            drop(stack, 4);
+            drop(stack, 3);
 
             auto status = inferViewShape(tensor_sizes, profiled_view_sizes);
             if (!status) {
@@ -682,12 +604,14 @@ RegisterOperators view_guard({
               push(stack, IValue(true));
               return;
             }
-
-            auto guard_status = checkViewGuard(
-                tensor_sizes,
-                profiled_view_sizes,
-                tensor_constraint,
-                view_sizes_constraint);
+            std::vector<int64_t> tensor_sizes_int_vec = tensor_sizes.vec();
+            std::vector<int64_t> view_sizes_int_vec = tensor_sizes.vec();
+            std::vector<int64_t> previous_constraints =
+                tensor_constraints.vec();
+            auto new_constraints = fuser::cuda::getViewConstraint(
+                tensor_sizes_int_vec, view_sizes_int_vec);
+            bool guard_status =
+                (new_constraints.conglomerateString() == previous_constraints);
             push(stack, IValue(guard_status));
             return;
           };
@@ -731,6 +655,62 @@ RegisterOperators reg_add_optional({
         aliasAnalysisFromSchema()),
 });
 
+// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+RegisterOperators reg_permute_copy({
+    Operator(
+        "prim::permute_copy(Tensor(a) self, int[] dims) -> Tensor",
+        [](const Node* node) -> Operation {
+          return [node](Stack& stack) {
+            TORCH_CHECK(
+                node->s(attr::name) == "CudaFusionGroup",
+                "permute_copy is only used by nvfuser to identify non-mutating ",
+                "alias ops, should be restored after fusion pass!");
+            IValue self, dims;
+            pop(stack, self, dims);
+            push(stack, at::native::view(self.toTensor(), dims.toIntVector()));
+          };
+        },
+        aliasAnalysisFromSchema()),
+});
+
+// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+RegisterOperators reg_transpose_copy({
+    Operator(
+        "prim::transpose_copy.int(Tensor(a) self, int dim0, int dim1) -> Tensor",
+        [](const Node* node) -> Operation {
+          return [node](Stack& stack) {
+            TORCH_CHECK(
+                node->s(attr::name) == "CudaFusionGroup",
+                "transpose_copy is only used by nvfuser to identify non-mutating ",
+                "alias ops, should be restored after fusion pass!");
+            IValue self, dim0, dim1;
+            pop(stack, self, dim0, dim1);
+            push(
+                stack,
+                at::transpose(self.toTensor(), dim0.toInt(), dim1.toInt()));
+          };
+        },
+        aliasAnalysisFromSchema()),
+});
+
+// NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+RegisterOperators reg_t_copy({
+    Operator(
+        "prim::t_copy(Tensor(a) self) -> Tensor",
+        [](const Node* node) -> Operation {
+          return [node](Stack& stack) {
+            TORCH_CHECK(
+                node->s(attr::name) == "CudaFusionGroup",
+                "t_copy is only used by nvfuser to identify non-mutating ",
+                "alias ops, should be restored after fusion pass!");
+            IValue self;
+            pop(stack, self);
+            push(stack, at::t(self.toTensor()));
+          };
+        },
+        aliasAnalysisFromSchema()),
+});
+
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 RegisterOperators reg_view_copy({
     Operator(
@@ -917,8 +897,7 @@ RegisterOperators reg_expand_copy({
                 "alias ops, should be restored after fusion pass!");
             IValue self, size, implicit;
             pop(stack, self, size, implicit);
-            push(
-                stack, at::native::expand(self.toTensor(), size.toIntVector()));
+            push(stack, self.toTensor().expand(size.toIntVector()));
           };
         },
         aliasAnalysisFromSchema()),
diff --git a/torch/csrc/jit/codegen/cuda/interface.h b/torch/csrc/jit/codegen/cuda/interface.h
index a6d776efba9c..01ea2e934035 100644
--- a/torch/csrc/jit/codegen/cuda/interface.h
+++ b/torch/csrc/jit/codegen/cuda/interface.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <c10/macros/Export.h>
+#include <torch/csrc/jit/codegen/cuda/transform_view.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <torch/csrc/jit/passes/pass_manager.h>
 #include <torch/csrc/jit/runtime/profiling_record.h>
@@ -34,6 +35,9 @@ struct CudaFuserInterface {
   void (*fn_insert_profile_inodes)(ProfilingRecord* pr) = nullptr;
   bool (*fn_profile_n)(const Node*) = nullptr;
   bool (*fn_skip_n)(const std::string&, bool flip) = nullptr;
+  AnalyzeViewConstraint (*fn_analyze_view)(
+      const std::vector<int64_t>& original_sizes,
+      const std::vector<int64_t>& new_sizes) = nullptr;
 };
 
 // Get interface, this is used by registration and user facing API internally
@@ -48,6 +52,10 @@ TORCH_API bool profileNode(const Node* node);
 
 TORCH_API bool skipNode(const std::string& symbol_str, bool flip = true);
 
+TORCH_API AnalyzeViewConstraint getViewConstraint(
+    const std::vector<int64_t>& original_sizes,
+    const std::vector<int64_t>& new_sizes);
+
 TORCH_API bool complyWith(
     const at::Tensor& tensor,
     const c10::TensorTypePtr& guard_tensor_type);
diff --git a/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp b/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
index dcf0b054ca35..ff00f659da63 100644
--- a/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_base_nodes.cpp
@@ -193,16 +193,13 @@ bool Val::isConstScalar() const {
 }
 
 bool Val::isConstInt() const {
-  if (!isAnInt()) {
-    return false;
-  }
-  return ConstCheck::isConstInt(this);
+  return ConstCheck::isConst(this) && isAnInt();
 }
 
 int64_t Val::evaluateInt() {
   TORCH_INTERNAL_ASSERT(
-      ConstCheck::isConstInt(this),
-      "Cannot get Int of not const integers through IR nodes, must use runtime ExpressionEvaluator.");
+      ConstCheck::isConst(this),
+      "Cannot get Int of not const values through IR nodes, must use runtime ExpressionEvaluator.");
 
   if (this->as<Int>()->value().has_value()) {
     return this->as<Int>()->value().value();
@@ -213,7 +210,24 @@ int64_t Val::evaluateInt() {
   TORCH_INTERNAL_ASSERT(
       evaluated_val.has_value(),
       "Detected a const integer but failed to infer its value.");
-  return evaluated_val.value();
+  return evaluated_val->as<int64_t>();
+}
+
+double Val::evaluateDouble() {
+  TORCH_INTERNAL_ASSERT(
+      ConstCheck::isConst(this),
+      "Cannot get Double of not const doubles through IR nodes, must use runtime ExpressionEvaluator.");
+
+  if (this->as<Double>()->value().has_value()) {
+    return this->as<Double>()->value().value();
+  }
+
+  ExpressionEvaluator ee(fusion());
+  auto evaluated_val = ee.evaluate(this);
+  TORCH_INTERNAL_ASSERT(
+      evaluated_val.has_value(),
+      "Detected a const integer but failed to infer its value.");
+  return evaluated_val->as<double>();
 }
 
 c10::optional<int64_t> Val::getInt() const {
@@ -224,7 +238,18 @@ c10::optional<int64_t> Val::getInt() const {
       }
     }
   }
-  return c10::optional<int64_t>();
+  return c10::nullopt;
+}
+
+c10::optional<double> Val::getDouble() const {
+  if (isConstScalar() && isAnInt()) {
+    if (this->getValType() == ValType::Scalar) {
+      if (this->isA<Double>()) {
+        return this->as<Double>()->value();
+      }
+    }
+  }
+  return c10::nullopt;
 }
 
 bool Val::isZeroInt() const {
@@ -316,6 +341,12 @@ void Expr::setPredicate(kir::Predicate* predicate) {
   predicate_ = predicate;
 }
 
+Expr* Expr::withPredicate(kir::Predicate* predicate) {
+  auto result = shallowCopy();
+  result->setPredicate(predicate);
+  return result;
+}
+
 kir::Predicate* Expr::writePredicate() const {
   TORCH_INTERNAL_ASSERT(
       container()->isA<kir::Kernel>(), "Function invalid for fusion.");
@@ -328,6 +359,19 @@ void Expr::setWritePredicate(kir::Predicate* write_predicate) {
   write_predicate_ = write_predicate;
 }
 
+Expr* Expr::withWritePredicate(kir::Predicate* predicate) {
+  auto result = shallowCopy();
+  result->setWritePredicate(predicate);
+  return result;
+}
+
+void Expr::copyPredicatesFrom(const Expr* expr) {
+  if (container()->isA<kir::Kernel>()) {
+    predicate_ = expr->predicate_;
+    write_predicate_ = expr->write_predicate_;
+  }
+}
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/ir_base_nodes.h b/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
index 2f57141432bb..dadabe167ebf 100644
--- a/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_base_nodes.h
@@ -48,6 +48,7 @@ class Expr;
 class Val;
 class UnaryOp;
 class BinaryOp;
+class RNGOp;
 class IterDomain;
 class IrCloner;
 class IrContainer;
@@ -250,17 +251,32 @@ class TORCH_CUDA_CU_API Val : public Statement {
     return isScalar() && dtype_ == DataType::Int;
   }
 
+  bool isADouble() const {
+    return isScalar() && dtype_ == DataType::Double;
+  }
+
   // If this Val is an integer with a direct constant value associated with it,
   // will return the value of that constant integer. If this integer has
   // defining expressions it will return a c10::nullopt. Those values should be
   // infered using evaluateInt.
   c10::optional<int64_t> getInt() const;
 
+  // If this Val is a double with a direct constant value associated with it,
+  // will return the value of that constant double. If this double has
+  // defining expressions it will return a c10::nullopt. Those values should be
+  // infered using evaluateDouble.
+  c10::optional<double> getDouble() const;
+
   // If this Val is a constant integer, and its history is comprised only of
-  // constant integers, will return the value of that constant integer. Cannot
+  // constant values, will return the value of that constant integer. Cannot
   // make constant as expression evaluator takes non-constant Vals.
   int64_t evaluateInt();
 
+  // If this Val is a constant double, and its history is comprised only of
+  // constant values, will return the value of that constant double. Cannot
+  // make constant as expression evaluator takes non-constant Vals.
+  double evaluateDouble();
+
   // Returns if no dependencies and is a constant scalar.
   virtual bool isConst() const {
     return false;
@@ -410,6 +426,10 @@ class TORCH_CUDA_CU_API Expr : public Statement {
 
   Expr(const Expr* src, IrCloner* ir_cloner);
 
+  // Creates a new instance of the expression with all its field copied.
+  // Note that unlike IrCloner, this function only do a shallow copy
+  virtual Expr* shallowCopy() const = 0;
+
   c10::optional<ExprType> getExprType() const override {
     return etype_;
   }
@@ -450,16 +470,27 @@ class TORCH_CUDA_CU_API Expr : public Statement {
   // TODO: Protect based on being in kernel container
   kir::Predicate* predicate() const;
 
+  // Creates a shallow copy the expression with the given predicate attached.
   // TODO: Protect based on being in kernel container
-  void setPredicate(kir::Predicate* predicate);
+  Expr* withPredicate(kir::Predicate* predicate);
 
   // TODO: Protect based on being in kernel container
   kir::Predicate* writePredicate() const;
 
+  // Creates a shallow copy the expression with the given write-predicate
+  // attached.
   // TODO: Protect based on being in kernel container
-  void setWritePredicate(kir::Predicate* write_predicate);
+  Expr* withWritePredicate(kir::Predicate* write_predicate);
 
  protected:
+  // TODO: Protect based on being in kernel container
+  void setPredicate(kir::Predicate* predicate);
+
+  // TODO: Protect based on being in kernel container
+  void setWritePredicate(kir::Predicate* write_predicate);
+
+  void copyPredicatesFrom(const Expr* expr);
+
   // TODO: Add Fusion passkey
   void addInput(Val* input) {
     TORCH_INTERNAL_ASSERT(input != nullptr);
diff --git a/torch/csrc/jit/codegen/cuda/ir_builder.cpp b/torch/csrc/jit/codegen/cuda/ir_builder.cpp
index adb523ecfa27..f0fd438c1567 100644
--- a/torch/csrc/jit/codegen/cuda/ir_builder.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_builder.cpp
@@ -60,9 +60,13 @@ IR_BUILDER_INSTANTIATE(ShiftOp)
 IR_BUILDER_INSTANTIATE(GatherOp)
 IR_BUILDER_INSTANTIATE(ViewAsScalar)
 IR_BUILDER_INSTANTIATE(ViewOp)
+IR_BUILDER_INSTANTIATE(FullOp)
+IR_BUILDER_INSTANTIATE(ARangeOp)
+IR_BUILDER_INSTANTIATE(EyeOp)
 IR_BUILDER_INSTANTIATE(UnaryOp)
 IR_BUILDER_INSTANTIATE(BinaryOp)
 IR_BUILDER_INSTANTIATE(TernaryOp)
+IR_BUILDER_INSTANTIATE(RNGOp)
 IR_BUILDER_INSTANTIATE(ReductionOp)
 IR_BUILDER_INSTANTIATE(GroupedReductionOp)
 IR_BUILDER_INSTANTIATE(WelfordOp)
diff --git a/torch/csrc/jit/codegen/cuda/ir_cloner.cpp b/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
index 638e9d8c5a5f..489be49ddfc7 100644
--- a/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_cloner.cpp
@@ -88,6 +88,18 @@ void IrCloner::handle(const TensorView* tv) {
   clone_ = IrBuilder::clone(tv, this);
 }
 
+void IrCloner::handle(const FullOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
+void IrCloner::handle(const ARangeOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
+void IrCloner::handle(const EyeOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
 void IrCloner::handle(const UnaryOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
@@ -100,6 +112,10 @@ void IrCloner::handle(const TernaryOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
 
+void IrCloner::handle(const RNGOp* op) {
+  clone_ = IrBuilder::clone(op, this);
+}
+
 void IrCloner::handle(const BroadcastOp* op) {
   clone_ = IrBuilder::clone(op, this);
 }
diff --git a/torch/csrc/jit/codegen/cuda/ir_cloner.h b/torch/csrc/jit/codegen/cuda/ir_cloner.h
index a0f5d76f007d..06e1ec3359d9 100644
--- a/torch/csrc/jit/codegen/cuda/ir_cloner.h
+++ b/torch/csrc/jit/codegen/cuda/ir_cloner.h
@@ -68,9 +68,13 @@ class TORCH_CUDA_CU_API IrCloner : private OptInConstDispatch {
   void handle(const ComplexDouble*) override;
   void handle(const NamedScalar*) override;
 
+  void handle(const FullOp*) override;
+  void handle(const ARangeOp*) override;
+  void handle(const EyeOp*) override;
   void handle(const UnaryOp*) override;
   void handle(const BinaryOp*) override;
   void handle(const TernaryOp*) override;
+  void handle(const RNGOp*) override;
   void handle(const BroadcastOp*) override;
   void handle(const ReductionOp*) override;
   void handle(const GroupedReductionOp*) override;
diff --git a/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp b/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
index 941bf22dea76..6c04e4214b07 100644
--- a/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_graphviz.cpp
@@ -407,6 +407,34 @@ void IrGraphGenerator::handle(const TensorView* tv) {
   tensor_views_.push_back(tv);
 }
 
+void IrGraphGenerator::handle(const FullOp* fop) {
+  // node
+  printExpr(fop, "full");
+
+  // inputs & outputs
+  addArc(fop->getFillValue(), fop);
+  addArc(fop, fop->output(0));
+}
+
+void IrGraphGenerator::handle(const ARangeOp* aop) {
+  // node
+  printExpr(aop, "arange");
+
+  // inputs & outputs
+  addArc(aop->start(), aop);
+  addArc(aop->end(), aop);
+  addArc(aop->step(), aop);
+  addArc(aop, aop->output(0));
+}
+
+void IrGraphGenerator::handle(const EyeOp* eop) {
+  // node
+  printExpr(eop, "eye");
+
+  // inputs & outputs
+  addArc(eop, eop->output(0));
+}
+
 void IrGraphGenerator::handle(const UnaryOp* uop) {
   // node
   std::stringstream label;
@@ -443,6 +471,16 @@ void IrGraphGenerator::handle(const TernaryOp* op) {
   addArc(op, op->out());
 }
 
+void IrGraphGenerator::handle(const RNGOp* op) {
+  // node
+  std::stringstream label;
+  label << op->getRNGOpType();
+  printExpr(op, label.str());
+
+  // inputs & outputs
+  addArc(op, op->output(0));
+}
+
 void IrGraphGenerator::handle(const BroadcastOp* op) {
   printExpr(op, "Broadcast");
   addArc(op->in(), op);
diff --git a/torch/csrc/jit/codegen/cuda/ir_graphviz.h b/torch/csrc/jit/codegen/cuda/ir_graphviz.h
index e5bbcac9157d..1f555ed31ec0 100644
--- a/torch/csrc/jit/codegen/cuda/ir_graphviz.h
+++ b/torch/csrc/jit/codegen/cuda/ir_graphviz.h
@@ -82,9 +82,13 @@ class TORCH_CUDA_CU_API IrGraphGenerator : private OptInConstDispatch {
   void handle(const ComplexDouble*) override;
   void handle(const NamedScalar*) override;
 
+  void handle(const FullOp*) override;
+  void handle(const ARangeOp*) override;
+  void handle(const EyeOp*) override;
   void handle(const UnaryOp*) override;
   void handle(const BinaryOp*) override;
   void handle(const TernaryOp*) override;
+  void handle(const RNGOp*) override;
   void handle(const BroadcastOp*) override;
   void handle(const ReductionOp*) override;
 
diff --git a/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h b/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
index 126abba2ae10..dbefc4858d11 100644
--- a/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_interface_nodes.h
@@ -154,8 +154,6 @@ class TORCH_CUDA_CU_API ComplexDouble : public Val {
 //! the compute at position to maximum possible through traversal.
 enum class ComputeAtMode { Standard, BestEffort, MostInlined };
 
-class InlinePropagator;
-class MaxProducerPosUpdater;
 class TransformPropagator;
 struct MostInlinedTransformPropagator;
 class TransformIter;
@@ -163,6 +161,8 @@ class TransformReplay;
 class OptOutMutator;
 class TensorDomain;
 
+class MaxPosCalculator;
+
 namespace ir_utils {
 class TVDomainGuard;
 }
@@ -492,21 +492,30 @@ class TORCH_CUDA_CU_API TensorView : public Val {
   friend TORCH_CUDA_CU_API MostInlinedTransformPropagator;
   friend TORCH_CUDA_CU_API TransformReplay;
   friend TORCH_CUDA_CU_API OptOutMutator;
-  friend TORCH_CUDA_CU_API InlinePropagator;
-  friend TORCH_CUDA_CU_API MaxProducerPosUpdater;
+  friend class InlineBatchingGuard;
   friend class ir_utils::TVDomainGuard;
-  friend TORCH_CUDA_CU_API void groupReductions(
-      const std::vector<TensorView*>&);
+
+  // Inline the computation of this tensor into its consumer at the given
+  // position. If this tensor is already inlined in a higher position, then this
+  // call is a no-op. If the right most dimensions before `pos` are
+  // broadcasting, then will not inline into these broadcastings. If
+  // best_effort, then will inline into the highest allowed position that is <=
+  // `pos`.
+  void inlineAt(
+      int64_t pos,
+      bool best_effort = false,
+      MaxPosCalculator* calc = nullptr);
+
+  // Update the max producer position of the current tensor. This is required
+  // when we modify producer-consumer relationship of a scheduled tensor, for
+  // example, grouping multiple reductions.
+  void updateMaxProducerPosition();
 
  protected:
   void setDomain(TensorDomain* td) {
     domain_ = td;
   }
 
-  void setComputeAt(unsigned int this_pos, bool decrease = false);
-
-  void setMaxProducer(unsigned int this_pos, bool decrease = false);
-
  private:
   int normalizeAxisPos(int pos) const {
     if (pos < 0) {
diff --git a/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h b/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
index 4a594728fb5a..d34b3a9f89c5 100644
--- a/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
+++ b/torch/csrc/jit/codegen/cuda/ir_internal_nodes.h
@@ -23,12 +23,144 @@ namespace cuda {
 class ViewTransform;
 class Scope;
 class IrCloner;
+struct AnalyzeViewResult;
 
 //! Returns true if both v1 and v2 are scalars, are the same type of scalars,
 //! and dispatches to the inherited Val type's `->sameAs` call. e.g. if both
 //! vals are `Int` will dispatch to v1->as<Int>()->sameAs(v2.as<Int>())
 bool areEqualScalars(Val* v1, Val* v2);
 
+class TORCH_CUDA_CU_API FullOp : public Expr {
+ public:
+  FullOp(IrBuilderPasskey, Val* out, Val* fill_value, DataType dtype);
+
+  FullOp(const FullOp* src, IrCloner* ir_cloner);
+
+  Expr* shallowCopy() const override;
+
+  bool sameAs(const Statement* other) const override;
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+  Val* getFillValue() const {
+    return fill_value_;
+  }
+
+ private:
+  const DataType dtype_;
+  Val* fill_value_;
+};
+
+class TORCH_CUDA_CU_API ARangeOp : public Expr {
+ public:
+  ARangeOp(
+      IrBuilderPasskey,
+      Val* out,
+      Val* start,
+      Val* end,
+      Val* step,
+      DataType dtype,
+      Val* linear_index = nullptr);
+
+  ARangeOp(const ARangeOp* src, IrCloner* ir_cloner);
+
+  Expr* shallowCopy() const override;
+
+  bool sameAs(const Statement* other) const override;
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+  Val* start() const {
+    return start_;
+  }
+
+  Val* end() const {
+    return end_;
+  }
+
+  Val* step() const {
+    return step_;
+  }
+
+  Val* getLinearLogicalIndex() const {
+    return linear_index_;
+  }
+
+  void setLinearIndex(Val* index) {
+    linear_index_ = index;
+  }
+
+ private:
+  const DataType dtype_;
+  Val* start_;
+  Val* end_;
+  Val* step_;
+  Val* linear_index_ = nullptr;
+};
+
+// Tensor factory for generating identity matrices like
+//
+// [[1, 0, 0],
+//  [0, 1, 0],
+//  [0, 0, 1]]
+//
+// or
+//
+// [[1, 0, 0],
+//  [0, 1, 0],
+//  [0, 0, 1],
+//  [0, 0, 0]]
+//
+// or
+//
+// [[1, 0, 0, 0],
+//  [0, 1, 0, 0],
+//  [0, 0, 1, 0]]
+class TORCH_CUDA_CU_API EyeOp : public Expr {
+ public:
+  EyeOp(
+      IrBuilderPasskey,
+      Val* out,
+      DataType dtype,
+      Val* index1 = nullptr,
+      Val* index2 = nullptr);
+
+  EyeOp(const EyeOp* src, IrCloner* ir_cloner);
+
+  Expr* shallowCopy() const override;
+
+  bool sameAs(const Statement* other) const override;
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+  Val* getIndex1() const {
+    return index1_;
+  }
+
+  void setIndex1(Val* index) {
+    index1_ = index;
+  }
+
+  Val* getIndex2() const {
+    return index2_;
+  }
+
+  void setIndex2(Val* index) {
+    index2_ = index;
+  }
+
+ private:
+  const DataType dtype_;
+  Val* index1_ = nullptr;
+  Val* index2_ = nullptr;
+};
+
 //! A specialization for Unary operations. Unary operations take in a single
 //! input and produce a single output. Examples include:
 //!   1) Casting operation i.e. float(a_val)
@@ -37,10 +169,17 @@ bool areEqualScalars(Val* v1, Val* v2);
 //!   4) split/merge
 class TORCH_CUDA_CU_API UnaryOp : public Expr {
  public:
-  UnaryOp(IrBuilderPasskey, UnaryOpType type, Val* out, Val* in);
+  UnaryOp(
+      IrBuilderPasskey,
+      UnaryOpType type,
+      Val* out,
+      Val* in,
+      int rng_offset = -1);
 
   UnaryOp(const UnaryOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -70,6 +209,8 @@ class TORCH_CUDA_CU_API BinaryOp : public Expr {
 
   BinaryOp(const BinaryOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -93,6 +234,67 @@ class TORCH_CUDA_CU_API BinaryOp : public Expr {
   Val* const rhs_ = nullptr;
 };
 
+//! A specialization for random number generator (RNG) operations. RNG
+//! operations take in no tensor input and produce a single output.
+class TORCH_CUDA_CU_API RNGOp : public Expr {
+ public:
+  RNGOp(
+      IrBuilderPasskey,
+      RNGOpType type,
+      Val* out,
+      DataType dtype,
+      std::vector<Val*> parameters = {},
+      int rng_offset = 0,
+      Val* philox_index = nullptr);
+
+  RNGOp(const RNGOp* src, IrCloner* ir_cloner);
+
+  Expr* shallowCopy() const override;
+
+  RNGOpType getRNGOpType() const {
+    return rng_op_type_;
+  }
+
+  DataType dtype() const {
+    return dtype_;
+  }
+
+  int getRNGOffset() const {
+    return rng_offset_;
+  }
+
+  void setRNGOffset(int val) {
+    rng_offset_ = val;
+  }
+
+  const std::vector<Val*>& getParameters() const {
+    return parameters_;
+  }
+
+  const std::vector<Val*>& getShape() const {
+    return shape_;
+  }
+
+  Val* getPhiloxIndex() const {
+    return philox_index_;
+  }
+
+  void setPhiloxIndex(Val* index) {
+    philox_index_ = index;
+  }
+
+  bool sameAs(const Statement* other) const override;
+
+ private:
+  const RNGOpType rng_op_type_;
+  const DataType dtype_;
+  std::vector<Val*> parameters_;
+  std::vector<Val*> shape_;
+  int rng_offset_ = -1;
+  // The index used to feed philox's subsequence and component
+  Val* philox_index_ = nullptr;
+};
+
 //! Broadcast in to match out. is_broadcast_dims are relative to out. Where
 //! is_broadcast_dims.size() == out->nDims().
 class TORCH_CUDA_CU_API BroadcastOp : public Expr {
@@ -108,6 +310,8 @@ class TORCH_CUDA_CU_API BroadcastOp : public Expr {
 
   BroadcastOp(const BroadcastOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -156,6 +360,8 @@ class TORCH_CUDA_CU_API ReductionOp : public Expr {
 
   ReductionOp(const ReductionOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -204,6 +410,8 @@ class TORCH_CUDA_CU_API GroupedReductionOp : public Expr {
 
   GroupedReductionOp(const GroupedReductionOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   //! Number of expressions grouped horizontally. It does not reflect
   //! iteration grouping.
   size_t numExprs() const {
@@ -230,6 +438,10 @@ class TORCH_CUDA_CU_API GroupedReductionOp : public Expr {
     return is_allreduce_;
   }
 
+  //! Return the index of the corresponding reduction expression for
+  //! a given output val.
+  int getExprIndexOfOutput(Val* output_val) const;
+
   bool sameAs(const Statement* other) const override;
 
  private:
@@ -241,78 +453,217 @@ class TORCH_CUDA_CU_API GroupedReductionOp : public Expr {
   bool is_allreduce_ = false;
 };
 
+//! Average, variance and N (count) vals for Welford
+class TORCH_CUDA_CU_API WelfordTriplet {
+ public:
+  //! Names of the Welford triplet vals
+  enum class ValName { Avg, Var, N };
+
+  WelfordTriplet() = default;
+
+  WelfordTriplet(Val* avg, Val* var, Val* N) : vals_({avg, var, N}) {}
+
+  Val* const& avg() const {
+    return get(ValName::Avg);
+  }
+
+  Val*& avg() {
+    return get(ValName::Avg);
+  }
+
+  TensorView* avgTv() const {
+    TORCH_INTERNAL_ASSERT(avg()->isA<TensorView>());
+    return avg()->as<TensorView>();
+  }
+
+  Val* const& var() const {
+    return get(ValName::Var);
+  }
+
+  Val*& var() {
+    return get(ValName::Var);
+  }
+
+  TensorView* varTv() const {
+    TORCH_INTERNAL_ASSERT(var()->isA<TensorView>());
+    return var()->as<TensorView>();
+  }
+
+  Val* const& N() const {
+    return get(ValName::N);
+  }
+
+  Val*& N() {
+    return get(ValName::N);
+  }
+
+  TensorView* NTv() const {
+    TORCH_INTERNAL_ASSERT(N()->isA<TensorView>());
+    return N()->as<TensorView>();
+  }
+
+  //! Get the i-th val. Ordering is defined by ValName.
+  Val* const& get(int i) const {
+    return vals_.at(i);
+  }
+
+  //! Get the i-th val. Ordering is defined by ValName.
+  Val*& get(int i) {
+    return vals_.at(i);
+  }
+
+  Val* const& get(ValName name) const {
+    return get(valNameToIndex(name));
+  }
+
+  Val*& get(ValName name) {
+    return get(valNameToIndex(name));
+  }
+
+  //! Get the name of a given val in this triplet. None is returned if
+  //! not found.
+  c10::optional<ValName> getNameOf(Val* val) const;
+
+  //! Return a new triplet with outputs produced by a function applied
+  //! to each of this triplet
+  template <typename Func>
+  WelfordTriplet transform(Func func) const {
+    return WelfordTriplet(func(avg()), func(var()), func(N()));
+  }
+
+  bool sameAs(const WelfordTriplet& other) const;
+
+  WelfordTriplet clone(IrCloner* ir_cloner) const;
+
+  //! Clone a vector of triplets
+  static std::vector<WelfordTriplet> clone(
+      const std::vector<WelfordTriplet>& src,
+      IrCloner* ir_cloner);
+
+  auto begin() {
+    return vals_.begin();
+  }
+
+  auto begin() const {
+    return vals_.begin();
+  }
+
+  auto end() {
+    return vals_.end();
+  }
+
+  auto end() const {
+    return vals_.end();
+  }
+
+ private:
+  //! Convert a given val name to an index
+  static int valNameToIndex(ValName name) {
+    return static_cast<int>(name);
+  }
+
+  //! Convert a given index to a name
+  static ValName indexToValName(int index) {
+    TORCH_INTERNAL_ASSERT(index >= 0 && index < 3, "Invalid index: ", index);
+    return static_cast<ValName>(index);
+  }
+
+ private:
+  //! Holds avg, var and N in this order
+  std::array<Val*, 3> vals_ = {{nullptr, nullptr, nullptr}};
+};
+
 //! Welford Scan operation.
 class TORCH_CUDA_CU_API WelfordOp : public Expr {
  public:
+  WelfordOp(
+      IrBuilderPasskey,
+      const WelfordTriplet& output,
+      const WelfordTriplet& input,
+      const WelfordTriplet& init,
+      bool is_fused = false);
+
   WelfordOp(
       IrBuilderPasskey,
       Val* out_avg,
       Val* out_var,
       Val* out_N,
-      Val* init_avg,
-      Val* init_var,
-      Val* init_N,
       Val* in_avg,
       Val* in_var,
       Val* in_N,
+      Val* init_avg,
+      Val* init_var,
+      Val* init_N,
       bool is_fused = false);
 
   WelfordOp(const WelfordOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
-    return out_avg_;
+    return output().avg();
   }
 
   Val* in() const {
-    return in_avg_;
+    return input().avg();
   }
 
   bool sameAs(const Statement* const other) const override;
 
-  // Welford Accessors
-  // TODO clean up
+  const WelfordTriplet& output() const {
+    return output_;
+  }
+
   Val* outAvg() const {
-    return out_avg_;
+    return output().avg();
   }
 
   Val* outVar() const {
-    return out_var_;
+    return output().var();
   }
 
   Val* outN() const {
-    return out_N_;
+    return output().N();
+  }
+
+  const WelfordTriplet& input() const {
+    return input_;
   }
 
   Val* inAvg() const {
-    return in_avg_;
+    return input().avg();
   }
 
   Val* inVar() const {
-    return in_var_;
+    return input().var();
   }
 
   Val* inN() const {
-    return in_N_;
+    return input().N();
+  }
+
+  const WelfordTriplet& init() const {
+    return init_;
   }
 
   Val* initAvg() const {
-    return init_avg_;
+    return init().avg();
   }
 
   Val* initVar() const {
-    return init_var_;
+    return init().var();
   }
 
   Val* initN() const {
-    return init_N_;
+    return init().N();
   }
 
   bool singleValue() const {
-    return in_N_->isOneInt();
+    return inN()->isOneInt();
   }
 
   bool hasInit() const {
-    return !init_N_->isZeroInt();
+    return !initN()->isZeroInt();
   }
 
   bool isAllreduce() const {
@@ -321,20 +672,123 @@ class TORCH_CUDA_CU_API WelfordOp : public Expr {
 
   std::vector<Val*> getInitVals() const;
 
+  //! Return the init val for an output val
+  Val* getInitValOfOutput(Val* output_val) const;
+
  private:
-  Val* const out_avg_;
-  Val* const out_var_;
-  Val* const out_N_;
-  Val* const init_avg_;
-  Val* const init_var_;
-  Val* const init_N_;
-  Val* const in_avg_;
-  Val* const in_var_;
-  Val* const in_N_;
+  const WelfordTriplet output_;
+  const WelfordTriplet input_;
+  const WelfordTriplet init_;
   //! True if using the fused reduction kernel (not implemented yet)
   bool is_allreduce_ = false;
 };
 
+class TORCH_CUDA_CU_API GroupedWelfordOp : public Expr {
+ public:
+  GroupedWelfordOp(
+      IrBuilderPasskey,
+      std::vector<WelfordTriplet> output_vals,
+      std::vector<WelfordTriplet> input_vals,
+      std::vector<WelfordTriplet> init_vals,
+      bool is_allreduce = false,
+      ExprType expr_type = ExprType::GroupedWelfordOp);
+
+  GroupedWelfordOp(const GroupedWelfordOp* src, IrCloner* ir_cloner);
+
+  Expr* shallowCopy() const override;
+
+  //! Number of expressions grouped horizontally. It does not reflect
+  //! iteration grouping. As horizontal grouping is not supported,
+  //! this always returns 1.
+  size_t numExprs() const {
+    return 1;
+  }
+
+  Val* out(size_t index) const {
+    return outAvg(index);
+  }
+
+  Val* in(size_t index) const {
+    return inAvg(index);
+  }
+
+  bool sameAs(const Statement* const other) const override;
+
+  const std::vector<WelfordTriplet>& outputVals() const {
+    return output_vals_;
+  }
+
+  const std::vector<WelfordTriplet>& inputVals() const {
+    return input_vals_;
+  }
+
+  const std::vector<WelfordTriplet>& initVals() const {
+    return init_vals_;
+  }
+
+  Val* outAvg(size_t index) const {
+    return outputVals().at(index).avg();
+  }
+
+  Val* outVar(size_t index) const {
+    return outputVals().at(index).var();
+  }
+
+  Val* outN(size_t index) const {
+    return outputVals().at(index).N();
+  }
+
+  Val* inAvg(size_t index) const {
+    return inputVals().at(index).avg();
+  }
+
+  Val* inVar(size_t index) const {
+    return inputVals().at(index).var();
+  }
+
+  Val* inN(size_t index) const {
+    return inputVals().at(index).N();
+  }
+
+  Val* initAvg(size_t index) const {
+    return initVals().at(index).avg();
+  }
+
+  Val* initVar(size_t index) const {
+    return initVals().at(index).var();
+  }
+
+  Val* initN(size_t index) const {
+    return initVals().at(index).N();
+  }
+
+  //! Return the index of the corresponding welford expression for
+  //! a given output val
+  int getExprIndexOfOutput(Val* output_val) const;
+
+  //! Return the init val for an output val
+  Val* getInitValOfOutput(Val* output_val) const;
+
+  bool singleValue(size_t index) const {
+    return inN(index)->isOneInt();
+  }
+
+  bool hasInit(size_t index) const {
+    return !initN(index)->isZeroInt();
+  }
+
+  bool isAllreduce() const {
+    return is_allreduce_;
+  }
+
+ private:
+  const std::vector<WelfordTriplet> output_vals_;
+  const std::vector<WelfordTriplet> input_vals_;
+  const std::vector<WelfordTriplet> init_vals_;
+  //! True if using the fused reduction kernel
+  bool is_allreduce_ = false;
+};
+
 //! Fused Matmul operation
 class TORCH_CUDA_CU_API MmaOp : public Expr {
  public:
@@ -366,6 +820,8 @@ class TORCH_CUDA_CU_API MmaOp : public Expr {
 
   MmaOp(const MmaOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -424,6 +880,8 @@ class TORCH_CUDA_CU_API TransposeOp : public Expr {
 
   TransposeOp(const TransposeOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   TensorView* out() const {
     return out_;
   }
@@ -454,6 +912,8 @@ class TORCH_CUDA_CU_API ExpandOp : public Expr {
 
   ExpandOp(const ExpandOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   TensorView* out() const {
     return out_;
   }
@@ -484,6 +944,8 @@ class TORCH_CUDA_CU_API TernaryOp : public Expr {
 
   TernaryOp(const TernaryOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -527,6 +989,8 @@ class TORCH_CUDA_CU_API ShiftOp : public Expr {
 
   ShiftOp(const ShiftOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -576,6 +1040,8 @@ class TORCH_CUDA_CU_API GatherOp : public Expr {
 
   GatherOp(const GatherOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -622,6 +1088,8 @@ class TORCH_CUDA_CU_API ViewAsScalar : public Expr {
 
   ViewAsScalar(const ViewAsScalar* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -655,6 +1123,8 @@ class TORCH_CUDA_CU_API ViewOp : public Expr {
 
   ViewOp(const ViewOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   TensorView* out() const {
     return out_;
   }
@@ -680,6 +1150,8 @@ class TORCH_CUDA_CU_API LoadStoreOp : public Expr {
 
   LoadStoreOp(const LoadStoreOp* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -876,21 +1348,24 @@ class TORCH_CUDA_CU_API IterDomain : public Val {
   }
 
   bool hasExpandedExtent() const {
-    TORCH_INTERNAL_ASSERT(
-        expanded_extent_ == nullptr || isBroadcast(),
-        "Expanded extent is only relevant for strided broadcast dimensions",
-        " yet found an expanded extent without a strided broadcast iter type.");
     return expanded_extent_ != nullptr;
   }
 
   // Returns the expanded extent of a strided broadcast entry.
   Val* expandedExtent() const {
     TORCH_INTERNAL_ASSERT(
-        isBroadcast(),
-        "Expanded extent is only relevant for strided broadcast dimensions.");
+        hasExpandedExtent(),
+        "Requested expanded extent, but none found on this dimension.");
     return expanded_extent_;
   }
 
+  Val* getMaybeExpandedExtent() const {
+    if (hasExpandedExtent()) {
+      return expandedExtent();
+    }
+    return extent();
+  }
+
   //! Dimension padding interface:
   //!  2 modes are currently supported:
   //!
@@ -941,16 +1416,8 @@ class TORCH_CUDA_CU_API IterDomain : public Val {
   }
 
   //! Check if IterDomain is a reduction axis with size of 1, i.e.
-  //! a "squeeze" operator.
-  //!
-  //! NOTE: Detection of trivial reduction here is not
-  //! comprehensive. See detectTrivialReductionDerivedDomains for more
-  //! comprehensive analysis. We typically use this for root domain trivial
-  //! reduction checks. So we ship to the correct scheduler. It may
-  //! not be incredibly robust, but it makes sense to keep it for now.
-  bool isTrivialReduction() const {
-    return isReduction() && extent()->isOneInt();
-  }
+  //! a "squeeze" operator, or solely derived from such axes.
+  bool isTrivialReduction() const;
 
   //! Split for stride by a given factor. It effectively does an inner
   //! split by the factor and sets the inner domain as a Stride
@@ -1206,8 +1673,7 @@ class TORCH_CUDA_CU_API TensorDomain : public Val {
       SwizzleMode swizzle_mode = SwizzleMode::Data);
 
   // Transform TensorView according to merge and split transformations
-  TensorDomain* view(
-      const std::vector<std::shared_ptr<ViewTransform>>& transforms);
+  TensorDomain* view(const AnalyzeViewResult& view_analysis);
 
   TensorDomain* flatten(int64_t start_dim, int64_t end_dim);
 
@@ -1257,6 +1723,8 @@ class TORCH_CUDA_CU_API Split : public Expr {
 
   Split(const Split* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   IterDomain* outer() const {
     return outer_;
   }
@@ -1317,6 +1785,8 @@ class TORCH_CUDA_CU_API Merge : public Expr {
 
   Merge(const Merge* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   IterDomain* out() const {
     return out_;
   }
@@ -1349,6 +1819,8 @@ class TORCH_CUDA_CU_API Swizzle2D : public Expr {
 
   Swizzle2D(const Swizzle2D* src, IrCloner* ir_cloner);
 
+  Expr* shallowCopy() const override;
+
   IterDomain* outX() const {
     return out_x_;
   }
diff --git a/torch/csrc/jit/codegen/cuda/ir_iostream.cpp b/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
index c435c2c55716..e13273c8e75e 100644
--- a/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_iostream.cpp
@@ -248,6 +248,81 @@ void IrPrinter::handle(const NamedScalar* ns) {
   os_ << ns->name();
 }
 
+void IrPrinter::handle(const FullOp* fop) {
+  if (!print_inline_) {
+    indent();
+    os_ << fop->output(0) << "\n";
+    indent_size_++;
+    indent();
+    os_ << " = ";
+  } else {
+    checkInlineable(fop);
+  }
+
+  os_ << "full({";
+  for (auto i : c10::irange(fop->inputs().size())) {
+    if (i == fop->inputs().size() - 1) {
+      os_ << "}";
+    }
+    if (i > 0) {
+      os_ << ", ";
+    }
+    handle(fop->input(i));
+  }
+  os_ << ", " << fop->dtype() << ")";
+
+  indent_size_--;
+
+  if (!print_inline_)
+    os_ << ";\n";
+}
+
+void IrPrinter::handle(const ARangeOp* aop) {
+  if (!print_inline_) {
+    indent() << aop->output(0);
+    os_ << "\n";
+    indent_size_++;
+    indent();
+    os_ << " = ";
+  } else {
+    checkInlineable(aop);
+  }
+
+  os_ << "arange(";
+  handle(aop->start());
+  os_ << ", ";
+  handle(aop->end());
+  os_ << ", ";
+  handle(aop->step());
+  os_ << ", " << aop->dtype() << ")";
+
+  indent_size_--;
+
+  if (!print_inline_)
+    os_ << ";\n";
+}
+
+void IrPrinter::handle(const EyeOp* eop) {
+  if (!print_inline_) {
+    indent();
+    os_ << eop->output(0) << "\n";
+    indent_size_++;
+    indent();
+    os_ << " = ";
+  } else {
+    checkInlineable(eop);
+  }
+
+  os_ << "eye(";
+  handle(eop->input(0));
+  os_ << ", " << eop->dtype() << ")";
+
+  indent_size_--;
+
+  if (!print_inline_)
+    os_ << ";\n";
+}
+
 void IrPrinter::handle(const UnaryOp* uop) {
   bool istvop = ir_utils::isTvOp(uop);
   if (!print_inline_) {
@@ -393,16 +468,51 @@ void IrPrinter::handle(const TernaryOp* top) {
     os_ << ";\n";
 }
 
+void IrPrinter::handle(const RNGOp* rop) {
+  if (!print_inline_) {
+    indent();
+    os_ << rop->output(0) << "\n";
+    indent_size_++;
+    indent();
+    os_ << " = ";
+  } else {
+    checkInlineable(rop);
+  }
+
+  os_ << rop->getRNGOpType() << "({";
+  bool first = true;
+  for (auto i : rop->getShape()) {
+    if (!first) {
+      os_ << ", ";
+    }
+    handle(i);
+    first = false;
+  }
+  os_ << "}";
+  for (auto i : rop->getParameters()) {
+    os_ << ", ";
+    handle(i);
+  }
+  os_ << ", " << rop->dtype() << ")";
+
+  indent_size_--;
+
+  if (!print_inline_) {
+    os_ << ";\n";
+  }
+}
+
 void IrPrinter::handle(const ReductionOp* rop) {
   indent() << rop->out() << "\n";
   indent() << "   = reduction( " << rop->in()
            << ", op = " << rop->getReductionOpType()
            << ", initial value = " << rop->init()
-           << ", allreduce = " << rop->isAllreduce() << " )\n";
+           << ", allreduce = " << (rop->isAllreduce() ? "true" : "false")
+           << " )\n";
 }
 
 void IrPrinter::handle(const GroupedReductionOp* grouped_rop) {
-  indent() << "Grouped reduction(\n";
+  indent() << "GroupedReductionOp(\n";
   ++indent_size_;
   for (const auto i : c10::irange(grouped_rop->numExprs())) {
     indent() << grouped_rop->output(i) << " = reduction( "
@@ -430,10 +540,34 @@ void IrPrinter::handle(const WelfordOp* wop) {
     os_ << "\n  initial value = " << wop->initAvg() << "(Avg)\n  "
         << wop->initVar() << "(Var)\n  " << wop->initN() << "(N)";
   }
-  os_ << "\n  allreduce = " << wop->isAllreduce();
+  os_ << "\n  allreduce = " << (wop->isAllreduce() ? "true" : "false");
   os_ << " )\n";
 }
 
+void IrPrinter::handle(const GroupedWelfordOp* grouped_wop) {
+  indent() << "GroupedWelford(\n";
+  ++indent_size_;
+  for (const auto i : c10::irange(grouped_wop->numExprs())) {
+    indent() << grouped_wop->outAvg(i) << " (Avg),\n";
+    indent() << grouped_wop->outVar(i) << " (Var),\n";
+    indent() << grouped_wop->outN(i) << " (Count)\n";
+    indent() << " = Welford ( ";
+    ++indent_size_;
+    indent() << grouped_wop->inAvg(i) << " (Avg),\n";
+    indent() << grouped_wop->inVar(i) << " (Var),\n";
+    indent() << grouped_wop->inN(i) << " (Count)\n";
+    indent() << "initial value =\n";
+    ++indent_size_;
+    indent() << grouped_wop->initAvg(i) << " (Avg),\n";
+    indent() << grouped_wop->initVar(i) << " (Var),\n";
+    indent() << grouped_wop->initN(i) << " (Count) )\n";
+    indent_size_ -= 2;
+  }
+  indent() << "allreduce = " << (grouped_wop->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
+}
+
 void IrPrinter::handle(const LoadStoreOp* ldst) {
   indent() << ldst->out() << " = " << ldst->opType() << "( " << ldst->in()
            << " )\n";
@@ -649,138 +783,171 @@ void IrPrinter::handle(const kir::GridBroadcast* node) {
 }
 
 void IrPrinter::handle(const kir::GridReduction* node) {
-  indent();
-  handle(node->out());
-  os_ << " = "
-      << "GRID_REDUCTION(op='" << node->getReductionOpType() << "'"
-      << ", in=";
-  handle(node->in());
-  os_ << ", init=";
-  handle(node->init());
-  os_ << ", read_pred=";
+  indent() << node->out() << " = reduction( " << node->in()
+           << ", op = " << node->getReductionOpType()
+           << ", initial value = " << node->init() << ",\n";
+  ++indent_size_;
+  indent() << "reduction buffer = " << node->reduction_buffer()->buffer()
+           << ",\n";
+  indent() << "sync buffer = " << node->sync_buffer()->buffer() << ",\n";
+  indent() << "read predicate = ";
   if (node->predicate() != nullptr) {
-    handle(node->predicate());
+    os_ << node->predicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << ")\n";
-  os_ << ", write_pred=";
+  os_ << ",\n";
+  indent() << "write predicate = ";
   if (node->writePredicate() != nullptr) {
-    handle(node->writePredicate());
+    os_ << node->writePredicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << ")\n";
-  indent() << kTab << ".reduction_buffer=";
-  handle(node->reduction_buffer()->buffer());
-  os_ << "\n";
-  indent() << kTab << ".sync_buffer=";
-  handle(node->sync_buffer()->buffer());
-  os_ << "\n";
+  os_ << ",\n";
+  indent() << "thread predicate = " << node->threadPredicate().toString()
+           << ",\n";
+  indent() << "allreduce = " << (node->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
 }
 
 void IrPrinter::handle(const kir::GroupedGridReduction* node) {
-  indent() << "Grouped grid reduction(\n";
+  indent() << "GroupedGridReduction(\n";
   ++indent_size_;
   for (const auto i : c10::irange(node->numExprs())) {
-    indent();
-    handle(node->output(i));
-    os_ << " = "
-        << "reduction(op='" << node->getReductionOpType(i) << "'"
-        << ", in=";
-    handle(node->input(i));
-    os_ << ", init=";
-    handle(node->initVal(i));
-    os_ << "\n";
-  }
-  indent() << kTab << ".read_pred=";
+    indent() << node->output(i) << " = reduction( " << node->input(i)
+             << ", op = " << node->getReductionOpType(i)
+             << ", initial value = " << node->initVal(i)
+             << ", reduction buffer = "
+             << node->reduction_buffers().at(i)->buffer() << " )\n";
+  }
+  indent() << "sync buffer = " << node->sync_buffer()->buffer() << ",\n";
+  indent() << "read predicate = ";
   if (node->predicate() != nullptr) {
-    handle(node->predicate());
+    os_ << node->predicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << "\n";
-  indent() << kTab << ".write_pred=";
+  os_ << ",\n";
+  indent() << "write predicate = ";
   if (node->writePredicate() != nullptr) {
-    handle(node->writePredicate());
+    os_ << node->writePredicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << "\n";
-  for (const auto i : c10::irange(node->numExprs())) {
-    indent() << kTab << ".reduction_buffer=";
-    handle(node->reduction_buffers().at(i)->buffer());
-    os_ << "\n";
-  }
-  indent() << kTab << ".sync_buffer=";
-  handle(node->sync_buffer()->buffer());
-  os_ << "\n";
+  os_ << ",\n";
+  indent() << "thread predicate = " << node->threadPredicate().toString()
+           << ",\n";
+  indent() << "allreduce = " << (node->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
 }
 
 void IrPrinter::handle(const kir::GridWelford* node) {
+  std::cerr << "current indent size: " << indent_size_ << std::endl;
   const auto* welford_op = node->welford_op();
-  indent();
-  handle(welford_op->outVar());
-  os_ << ",";
-  handle(welford_op->outAvg());
-  os_ << ",";
-  handle(welford_op->outN());
-  os_ << " = "
-      << "GRID_WELFORD("
-      << "inAvg=";
-  handle(welford_op->inAvg());
-  if (!welford_op->inN()->isOneInt()) {
-    indent() << ", inVar=";
-    handle(welford_op->inVar());
-  }
-  indent() << ", inN=";
-  handle(welford_op->inN());
-  if (!welford_op->initN()->isZeroInt()) {
-    indent() << ", initVar=";
-    handle(welford_op->initVar());
-    os_ << " initAvg=";
-    handle(welford_op->initAvg());
-    os_ << " initN=";
-    handle(welford_op->initN());
-  }
-  indent() << ", read_pred=";
+  indent() << welford_op->outAvg() << " (Avg),\n";
+  indent() << welford_op->outVar() << " (Var),\n";
+  indent() << welford_op->outN() << " (Count)\n";
+  indent() << " = Welford (\n";
+  ++indent_size_;
+  indent() << welford_op->inAvg() << " (Avg),\n";
+  indent() << welford_op->inVar() << " (Var),\n";
+  indent() << welford_op->inN() << " (Count)\n";
+  indent() << "initial value =\n";
+  ++indent_size_;
+  indent() << welford_op->initAvg() << " (Avg),\n";
+  indent() << welford_op->initVar() << " (Var),\n";
+  indent() << welford_op->initN() << " (Count),\n";
+  --indent_size_;
+  indent() << "reduction buffer =\n";
+  ++indent_size_;
+  indent() << node->avg_buffer()->buffer() << " (Avg),\n";
+  indent() << node->var_buffer()->buffer() << " (Var),\n";
+  indent() << node->N_buffer()->buffer() << " (Count),\n";
+  --indent_size_;
+  indent() << "sync buffer = " << node->sync_buffer()->buffer() << ",\n";
+  indent() << "read predicate = ";
   if (welford_op->predicate() != nullptr) {
-    handle(welford_op->predicate());
+    os_ << welford_op->predicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << ")\n";
-  indent() << ", write_pred=";
+  os_ << ",\n";
+  indent() << "write predicate = ";
   if (welford_op->writePredicate() != nullptr) {
-    handle(welford_op->writePredicate());
+    os_ << welford_op->writePredicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << ")\n";
-  indent() << kTab << ".var_buffer=";
-  handle(node->var_buffer()->buffer());
-  os_ << ".avg_buffer=";
-  handle(node->avg_buffer()->buffer());
-  os_ << ".n_buffer=";
-  handle(node->N_buffer()->buffer());
-  os_ << "\n";
-  indent() << kTab << ".sync_buffer=";
-  handle(node->sync_buffer()->buffer());
-  os_ << "\n";
-  indent() << kTab << ".grid_read_pred=";
+  os_ << ",\n";
+  indent() << "grid read predicate = ";
   if (node->predicate() != nullptr) {
-    handle(node->predicate());
+    os_ << node->predicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << "\n";
-  indent() << kTab << ".grid_write_pred=";
+  os_ << ",\n";
+  indent() << "grid write predicate = ";
   if (node->writePredicate() != nullptr) {
-    handle(node->writePredicate());
+    os_ << node->writePredicate();
   } else {
     os_ << "nullptr";
   }
-  os_ << "\n";
+  os_ << ",\n";
+  indent() << "thread predicate = " << node->threadPredicate().toString()
+           << ",\n";
+  indent() << "allreduce = " << (welford_op->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
+  std::cerr << "Ending indent size: " << indent_size_ << std::endl;
+}
+
+void IrPrinter::handle(const kir::GroupedGridWelford* node) {
+  indent() << "GroupedGridWelford(\n";
+  ++indent_size_;
+  for (const auto i : c10::irange(node->numExprs())) {
+    indent() << node->outAvg(i) << " (Avg),\n";
+    indent() << node->outVar(i) << " (Var),\n";
+    indent() << node->outN(i) << " (Count)\n";
+    indent() << " = Welford (\n";
+    ++indent_size_;
+    indent() << node->inAvg(i) << " (Avg),\n";
+    indent() << node->inVar(i) << " (Var),\n";
+    indent() << node->inN(i) << " (Count)\n";
+    indent() << "initial value =\n";
+    ++indent_size_;
+    indent() << node->initAvg(i) << " (Avg),\n";
+    indent() << node->initVar(i) << " (Var),\n";
+    indent() << node->initN(i) << " (Count),\n";
+    --indent_size_;
+    indent() << "reduction buffer =\n";
+    ++indent_size_;
+    indent() << node->reduction_buffers()[0].at(i)->buffer() << " (Avg),\n";
+    indent() << node->reduction_buffers()[1].at(i)->buffer() << " (Var),\n";
+    indent() << node->reduction_buffers()[2].at(i)->buffer() << " (Count) )\n";
+    indent_size_ -= 2;
+  }
+  indent() << "sync buffer = " << node->sync_buffer()->buffer() << ",\n";
+  indent() << "read predicate = ";
+  if (node->predicate() != nullptr) {
+    os_ << node->predicate();
+  } else {
+    os_ << "nullptr";
+  }
+  os_ << ",\n";
+  indent() << "write predicate = ";
+  if (node->writePredicate() != nullptr) {
+    os_ << node->writePredicate();
+  } else {
+    os_ << "nullptr";
+  }
+  os_ << ",\n";
+  indent() << "thread predicate = " << node->threadPredicate().toString()
+           << ",\n";
+  indent() << "allreduce = " << (node->isAllreduce() ? "true" : "false")
+           << " )\n";
+  --indent_size_;
 }
 
 void IrPrinter::handle(const kir::InitMagicZero* node) {
diff --git a/torch/csrc/jit/codegen/cuda/ir_iostream.h b/torch/csrc/jit/codegen/cuda/ir_iostream.h
index 5266c259c1f5..599e50286d29 100644
--- a/torch/csrc/jit/codegen/cuda/ir_iostream.h
+++ b/torch/csrc/jit/codegen/cuda/ir_iostream.h
@@ -82,12 +82,17 @@ class TORCH_CUDA_CU_API IrPrinter : public OptInConstDispatch {
   void handle(const ComplexDouble*) final;
   void handle(const NamedScalar*) final;
 
+  void handle(const FullOp*) final;
+  void handle(const ARangeOp*) final;
+  void handle(const EyeOp*) final;
   void handle(const UnaryOp*) final;
   void handle(const BinaryOp*) final;
   void handle(const TernaryOp*) final;
+  void handle(const RNGOp*) final;
   void handle(const ReductionOp*) final;
   void handle(const GroupedReductionOp*) final;
   void handle(const WelfordOp*) final;
+  void handle(const GroupedWelfordOp*) final;
   void handle(const LoadStoreOp*) final;
   void handle(const MmaOp*) final;
   void handle(const BroadcastOp*) final;
@@ -106,6 +111,7 @@ class TORCH_CUDA_CU_API IrPrinter : public OptInConstDispatch {
   void handle(const kir::GridReduction*) final;
   void handle(const kir::GroupedGridReduction*) final;
   void handle(const kir::GridWelford*) final;
+  void handle(const kir::GroupedGridWelford*) final;
   void handle(const kir::ForLoop*) final;
   void handle(const kir::IfThenElse*) final;
   void handle(const kir::Allocate*) final;
diff --git a/torch/csrc/jit/codegen/cuda/ir_nodes.cpp b/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
index d36bef7a24d0..c4d994f272be 100644
--- a/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_nodes.cpp
@@ -182,7 +182,173 @@ bool ComplexDouble::sameAs(const Statement* other) const {
   return false;
 }
 
-UnaryOp::UnaryOp(IrBuilderPasskey passkey, UnaryOpType type, Val* out, Val* in)
+FullOp::FullOp(
+    IrBuilderPasskey passkey,
+    Val* out,
+    Val* fill_value,
+    DataType dtype)
+    : Expr(passkey, ExprType::FullOp), dtype_(dtype), fill_value_(fill_value) {
+  if (out->isA<TensorView>()) {
+    addInput(out->as<TensorView>()->getRootDomain()[0]->extent());
+  }
+  addInput(fill_value);
+  addOutput(out);
+}
+
+FullOp::FullOp(const FullOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      dtype_(src->dtype()),
+      fill_value_(ir_cloner->clone(src->fill_value_)) {}
+
+Expr* FullOp::shallowCopy() const {
+  auto result = IrBuilder::create<FullOp>(output(0), fill_value_, dtype_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
+bool FullOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<FullOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<FullOp>();
+  if (dtype_ != other_op->dtype_) {
+    return false;
+  }
+  return Expr::sameAs(other);
+}
+
+ARangeOp::ARangeOp(
+    IrBuilderPasskey passkey,
+    Val* out,
+    Val* start,
+    Val* end,
+    Val* step,
+    DataType dtype,
+    Val* linear_index)
+    : Expr(passkey, ExprType::ARangeOp),
+      dtype_(dtype),
+      start_(start),
+      end_(end),
+      step_(step),
+      linear_index_(linear_index) {
+  addInput(start);
+  addInput(end);
+  addInput(step);
+  addOutput(out);
+}
+
+ARangeOp::ARangeOp(const ARangeOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      dtype_(src->dtype()),
+      start_(ir_cloner->clone(src->start_)),
+      end_(ir_cloner->clone(src->end_)),
+      step_(ir_cloner->clone(src->step_)),
+      linear_index_(ir_cloner->clone(src->linear_index_)) {}
+
+Expr* ARangeOp::shallowCopy() const {
+  auto result = IrBuilder::create<ARangeOp>(
+      output(0), start_, end_, step_, dtype_, linear_index_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
+bool ARangeOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<ARangeOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<ARangeOp>();
+  if (dtype_ != other_op->dtype_) {
+    return false;
+  }
+  if (!start_->sameAs(other_op->start_)) {
+    return false;
+  }
+  if (!end_->sameAs(other_op->end_)) {
+    return false;
+  }
+  if (!step_->sameAs(other_op->step_)) {
+    return false;
+  }
+  if ((linear_index_ == nullptr) != (other_op->linear_index_ == nullptr)) {
+    return false;
+  }
+  if ((linear_index_ != nullptr) &&
+      !linear_index_->sameAs(other_op->linear_index_)) {
+    return false;
+  }
+  return Expr::sameAs(other);
+}
+
+EyeOp::EyeOp(
+    IrBuilderPasskey passkey,
+    Val* out,
+    DataType dtype,
+    Val* index1,
+    Val* index2)
+    : Expr(passkey, ExprType::EyeOp),
+      dtype_(dtype),
+      index1_(index1),
+      index2_(index2) {
+  if (out->isA<TensorView>()) {
+    addInput(out->as<TensorView>()->getRootDomain()[0]->extent());
+    if (out->as<TensorView>()->getRootDomain()[1] !=
+        out->as<TensorView>()->getRootDomain()[0]) {
+      addInput(out->as<TensorView>()->getRootDomain()[1]->extent());
+    }
+  }
+  addOutput(out);
+}
+
+EyeOp::EyeOp(const EyeOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      dtype_(src->dtype_),
+      index1_(ir_cloner->clone(src->index1_)),
+      index2_(ir_cloner->clone(src->index2_)) {}
+
+Expr* EyeOp::shallowCopy() const {
+  auto result = IrBuilder::create<EyeOp>(output(0), dtype_, index1_, index2_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
+bool EyeOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<EyeOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<EyeOp>();
+  if (dtype_ != other_op->dtype_) {
+    return false;
+  }
+  if ((index1_ == nullptr) != (other_op->index1_ == nullptr)) {
+    return false;
+  }
+  if ((index2_ == nullptr) != (other_op->index2_ == nullptr)) {
+    return false;
+  }
+  if ((index1_ != nullptr) && !index1_->sameAs(other_op->index1_)) {
+    return false;
+  }
+  if ((index2_ != nullptr) && !index2_->sameAs(other_op->index2_)) {
+    return false;
+  }
+  return Expr::sameAs(other);
+}
+
+UnaryOp::UnaryOp(
+    IrBuilderPasskey passkey,
+    UnaryOpType type,
+    Val* out,
+    Val* in,
+    int rng_offset)
     : Expr(passkey, ExprType::UnaryOp),
       unary_op_type_{type},
       out_{out},
@@ -197,6 +363,12 @@ UnaryOp::UnaryOp(const UnaryOp* src, IrCloner* ir_cloner)
       out_(ir_cloner->clone(src->out_)),
       in_(ir_cloner->clone(src->in_)) {}
 
+Expr* UnaryOp::shallowCopy() const {
+  auto result = IrBuilder::create<UnaryOp>(unary_op_type_, out_, in_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool UnaryOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -205,8 +377,9 @@ bool UnaryOp::sameAs(const Statement* other) const {
     return false;
   }
   const auto other_op = other->as<UnaryOp>();
-  if (getUnaryOpType() != other_op->getUnaryOpType())
+  if (getUnaryOpType() != other_op->getUnaryOpType()) {
     return false;
+  }
   return Expr::sameAs(other);
 }
 
@@ -233,6 +406,12 @@ BinaryOp::BinaryOp(const BinaryOp* src, IrCloner* ir_cloner)
       lhs_(ir_cloner->clone(src->lhs_)),
       rhs_(ir_cloner->clone(src->rhs_)) {}
 
+Expr* BinaryOp::shallowCopy() const {
+  auto result = IrBuilder::create<BinaryOp>(binary_op_type_, out_, lhs_, rhs_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool BinaryOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -241,8 +420,9 @@ bool BinaryOp::sameAs(const Statement* other) const {
     return false;
   }
   const auto other_op = other->as<BinaryOp>();
-  if (getBinaryOpType() != other_op->getBinaryOpType())
+  if (getBinaryOpType() != other_op->getBinaryOpType()) {
     return false;
+  }
   return Expr::sameAs(other);
 }
 
@@ -273,6 +453,13 @@ TernaryOp::TernaryOp(const TernaryOp* src, IrCloner* ir_cloner)
       in2_(ir_cloner->clone(src->in2_)),
       in3_(ir_cloner->clone(src->in3_)) {}
 
+Expr* TernaryOp::shallowCopy() const {
+  auto result =
+      IrBuilder::create<TernaryOp>(ternary_op_type_, out_, in1_, in2_, in3_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool TernaryOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -281,8 +468,87 @@ bool TernaryOp::sameAs(const Statement* other) const {
     return false;
   }
   const auto other_op = other->as<TernaryOp>();
-  if (getTernaryOpType() != other_op->getTernaryOpType())
+  if (getTernaryOpType() != other_op->getTernaryOpType()) {
+    return false;
+  }
+  return Expr::sameAs(other);
+}
+
+RNGOp::RNGOp(
+    IrBuilderPasskey passkey,
+    RNGOpType type,
+    Val* out,
+    DataType dtype,
+    std::vector<Val*> parameters,
+    int rng_offset,
+    Val* philox_index)
+    : Expr(passkey, ExprType::RNGOp),
+      rng_op_type_(type),
+      dtype_(dtype),
+      parameters_(std::move(parameters)),
+      rng_offset_(rng_offset),
+      philox_index_(philox_index) {
+  if (out->isA<TensorView>()) {
+    for (auto id : out->as<TensorView>()->getRootDomain()) {
+      shape_.emplace_back(id->extent());
+    }
+  }
+  for (auto v : shape_) {
+    addInput(v);
+  }
+  for (auto v : parameters_) {
+    addInput(v);
+  }
+  addOutput(out);
+}
+
+RNGOp::RNGOp(const RNGOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      rng_op_type_(src->rng_op_type_),
+      dtype_(src->dtype()),
+      parameters_(ir_cloner->clone(src->parameters_)),
+      rng_offset_(src->rng_offset_),
+      philox_index_(ir_cloner->clone(src->philox_index_)) {}
+
+Expr* RNGOp::shallowCopy() const {
+  auto result = IrBuilder::create<RNGOp>(
+      rng_op_type_, output(0), dtype_, parameters_, rng_offset_, philox_index_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
+bool RNGOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<RNGOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<RNGOp>();
+  if (getRNGOpType() != other_op->getRNGOpType()) {
+    return false;
+  }
+  if (dtype_ != other_op->dtype_) {
+    return false;
+  }
+  if (parameters_.size() != other_op->parameters_.size()) {
+    return false;
+  }
+  for (auto i : c10::irange(parameters_.size())) {
+    if (!parameters_[i]->sameAs(other_op->parameters_[i])) {
+      return false;
+    }
+  }
+  if (getRNGOffset() != other_op->getRNGOffset()) {
+    return false;
+  }
+  if ((philox_index_ == nullptr) != (other_op->philox_index_ == nullptr)) {
     return false;
+  }
+  if ((philox_index_ != nullptr) &&
+      !philox_index_->sameAs(other_op->philox_index_)) {
+    return false;
+  }
   return Expr::sameAs(other);
 }
 
@@ -334,7 +600,7 @@ BroadcastOp::BroadcastOp(
           id->isReduction() || id->isStride(),
           "Invalid broadcast op: ",
           id,
-          ". Non-reduction input dim does't match to output.");
+          ". Non-reduction input dim doesn't match to output.");
     }
   }
 
@@ -362,6 +628,12 @@ BroadcastOp::BroadcastOp(const BroadcastOp* src, IrCloner* ir_cloner)
       in_(ir_cloner->clone(src->in_)),
       is_broadcast_dims_(src->is_broadcast_dims_) {}
 
+Expr* BroadcastOp::shallowCopy() const {
+  auto result = IrBuilder::create<BroadcastOp>(out_, in_, is_broadcast_dims_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool BroadcastOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -416,6 +688,36 @@ ReductionOp::ReductionOp(
   addInput(in);
 }
 
+ReductionOp::ReductionOp(const ReductionOp* src, IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      reduction_op_type_(src->reduction_op_type_),
+      init_(ir_cloner->clone(src->init_)),
+      out_(ir_cloner->clone(src->out_)),
+      in_(ir_cloner->clone(src->in_)),
+      is_allreduce_(src->is_allreduce_) {}
+
+Expr* ReductionOp::shallowCopy() const {
+  auto result = IrBuilder::create<ReductionOp>(
+      reduction_op_type_, init_, out_, in_, is_allreduce_, etype());
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
+bool ReductionOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+  if (!other->isA<ReductionOp>()) {
+    return false;
+  }
+  const auto other_op = other->as<ReductionOp>();
+  // Note that init is not part of input vals, so it must be checked separately.
+  return (
+      Expr::sameAs(other) &&
+      getReductionOpType() == other_op->getReductionOpType() &&
+      init()->sameAs(other_op->init()));
+}
+
 GroupedReductionOp::GroupedReductionOp(
     IrBuilderPasskey passkey,
     std::vector<BinaryOpType> reduction_op_types,
@@ -445,6 +747,28 @@ GroupedReductionOp::GroupedReductionOp(
       init_vals_(ir_cloner->clone(src->init_vals_)),
       is_allreduce_(src->is_allreduce_) {}
 
+Expr* GroupedReductionOp::shallowCopy() const {
+  auto result = IrBuilder::create<GroupedReductionOp>(
+      reduction_op_types_,
+      init_vals_,
+      outputs(),
+      inputs(),
+      is_allreduce_,
+      etype());
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
+int GroupedReductionOp::getExprIndexOfOutput(Val* output_val) const {
+  auto it = std::find(outputs().begin(), outputs().end(), output_val);
+  if (it != outputs().end()) {
+    return std::distance(outputs().begin(), it);
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      false, "Not an output, ", output_val->toString(), ", of ", toString());
+}
+
 bool GroupedReductionOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -471,128 +795,351 @@ bool GroupedReductionOp::sameAs(const Statement* other) const {
 
 WelfordOp::WelfordOp(
     IrBuilderPasskey passkey,
-    Val* out_avg,
-    Val* out_var,
-    Val* out_N,
-    Val* init_avg,
-    Val* init_var,
-    Val* init_N,
-    Val* in_avg,
-    Val* in_var,
-    Val* in_N,
+    const WelfordTriplet& output,
+    const WelfordTriplet& input,
+    const WelfordTriplet& init,
     bool is_fused)
     : Expr(passkey, ExprType::WelfordOp),
-      out_avg_(out_avg),
-      out_var_(out_var),
-      out_N_(out_N),
-      init_avg_(init_avg),
-      init_var_(init_var),
-      init_N_(init_N),
-      in_avg_(in_avg),
-      in_var_(in_var == nullptr ? in_avg->container()->zeroVal() : in_var),
-      in_N_(in_N),
+      output_(output),
+      input_(input),
+      init_(init),
       is_allreduce_(is_fused) {
+  // Previously, nullptr was accepted and implicitly replaced by
+  // default values. Looks like we always pass some non-null values,
+  // so removed the implicit default behavior for code simplicity.
+  TORCH_INTERNAL_ASSERT(output.avg() != nullptr);
+  TORCH_INTERNAL_ASSERT(output.var() != nullptr);
+  TORCH_INTERNAL_ASSERT(output.N() != nullptr);
+  TORCH_INTERNAL_ASSERT(init.avg() != nullptr);
+  TORCH_INTERNAL_ASSERT(init.var() != nullptr);
+  TORCH_INTERNAL_ASSERT(init.N() != nullptr);
+  TORCH_INTERNAL_ASSERT(input.avg() != nullptr);
+  TORCH_INTERNAL_ASSERT(input.var() != nullptr);
+  TORCH_INTERNAL_ASSERT(input.N() != nullptr);
+
   // Check output type
   TORCH_INTERNAL_ASSERT(
-      out_avg->getValType().value() == ValType::TensorView ||
-      out_avg->getValType().value() == ValType::TensorIndex);
+      output.avg()->getValType().value() == ValType::TensorView ||
+      output.avg()->getValType().value() == ValType::TensorIndex);
   TORCH_INTERNAL_ASSERT(
-      out_var->getValType().value() == ValType::TensorView ||
-      out_var->getValType().value() == ValType::TensorIndex);
+      output.var()->getValType().value() == ValType::TensorView ||
+      output.var()->getValType().value() == ValType::TensorIndex);
   TORCH_INTERNAL_ASSERT(
-      out_N->getValType().value() == ValType::TensorView ||
-      out_N->getValType().value() == ValType::TensorIndex);
+      output.N()->getValType().value() == ValType::TensorView ||
+      output.N()->getValType().value() == ValType::TensorIndex);
+  TORCH_INTERNAL_ASSERT(isIntegralType(output.N()->dtype()));
 
   // check initial value
-  TORCH_INTERNAL_ASSERT(init_N->getValType().value() == ValType::Scalar);
-  if (!init_N->isZeroInt()) {
+  TORCH_INTERNAL_ASSERT(init.N()->getValType().value() == ValType::Scalar);
+  TORCH_INTERNAL_ASSERT(isIntegralType(init.N()->dtype()));
+  if (!init.N()->isZeroInt()) {
     // when initial count is zero, no initial variance or average is needed
     // initial value with a count of 1 is un-common enough that I'll push
     // the responsibility of creating all-zero var tensors to the user
     TORCH_INTERNAL_ASSERT(
-        init_avg &&
-        (init_avg->getValType().value() == ValType::TensorView ||
-         init_avg->getValType().value() == ValType::TensorIndex));
+        init_.avg()->getValType().value() == ValType::TensorView ||
+        init_.avg()->getValType().value() == ValType::TensorIndex);
     TORCH_INTERNAL_ASSERT(
-        init_var &&
-        (init_var->getValType().value() == ValType::TensorView ||
-         init_var->getValType().value() == ValType::TensorIndex));
+        init_.var()->getValType().value() == ValType::TensorView ||
+            init_.var()->getValType().value() == ValType::TensorIndex,
+        "Invalid initial var: ",
+        init_.var()->toString());
   }
 
-  TORCH_INTERNAL_ASSERT(
-      in_avg &&
-          (in_avg->getValType().value() == ValType::TensorView ||
-           in_avg->getValType().value() == ValType::TensorIndex),
-      in_avg->getValType().value());
   // check input
   TORCH_INTERNAL_ASSERT(
-      in_N->getValType().value() == ValType::Scalar ||
-      in_N->getValType().value() == ValType::TensorView ||
-      in_N->getValType().value() == ValType::TensorIndex);
-  if (!in_N->isOneInt()) {
+      input_.avg()->getValType().value() == ValType::TensorView ||
+          input_.avg()->getValType().value() == ValType::TensorIndex,
+      input_.avg()->getValType().value());
+  TORCH_INTERNAL_ASSERT(
+      input_.N()->getValType().value() == ValType::Scalar ||
+      input_.N()->getValType().value() == ValType::TensorView ||
+      input_.N()->getValType().value() == ValType::TensorIndex);
+  TORCH_INTERNAL_ASSERT(isIntegralType(input_.N()->dtype()));
+  if (!input_.N()->isOneInt()) {
     // when input is only one value, only the value is required through avg
     // input the var part is implicitly 0 and codegen will handle that.
     TORCH_INTERNAL_ASSERT(
-        in_var &&
-        (in_var->getValType().value() == ValType::TensorView ||
-         in_var->getValType().value() == ValType::TensorIndex));
+        input_.var()->getValType().value() == ValType::TensorView ||
+        input_.var()->getValType().value() == ValType::TensorIndex);
   } else {
     TORCH_INTERNAL_ASSERT(
-        in_var == nullptr || in_var->isZeroInt(),
+        input_.var() == nullptr || input_.var()->isZeroInt(),
         "Invalid var input, which must be either nullptr or scalar zero when the N input is one.");
   }
 
-  addOutput(out_avg_);
-  addOutput(out_var_);
-  addOutput(out_N_);
+  addOutput(output_.avg());
+  addOutput(output_.var());
+  addOutput(output_.N());
 
-  addInput(in_avg_);
-  // Previously in_var_ was allowed to be null
-  TORCH_INTERNAL_ASSERT(
-      in_var_ != nullptr, "Welford var input nullptr not allowed");
-  addInput(in_var_);
-  addInput(in_N_);
+  addInput(input_.avg());
+  addInput(input_.var());
+  addInput(input_.N());
+}
+
+c10::optional<WelfordTriplet::ValName> WelfordTriplet::getNameOf(
+    Val* val) const {
+  auto it = std::find(begin(), end(), val);
+  if (it != end()) {
+    return indexToValName(std::distance(begin(), it));
+  }
+
+  return c10::optional<WelfordTriplet::ValName>();
+}
+
+bool WelfordTriplet::sameAs(const WelfordTriplet& other) const {
+  return this == &other ||
+      (avg()->sameAs(other.avg()) && var()->sameAs(other.var()) &&
+       N()->sameAs(other.N()));
+}
+
+WelfordTriplet WelfordTriplet::clone(IrCloner* ir_cloner) const {
+  return transform([&](const Val* val) { return ir_cloner->clone<Val>(val); });
 }
 
+std::vector<WelfordTriplet> WelfordTriplet::clone(
+    const std::vector<WelfordTriplet>& src,
+    IrCloner* ir_cloner) {
+  std::vector<WelfordTriplet> cloned;
+  for (const auto& triplet : src) {
+    cloned.emplace_back(triplet.clone(ir_cloner));
+  }
+  return cloned;
+}
+
+WelfordOp::WelfordOp(
+    IrBuilderPasskey passkey,
+    Val* out_avg,
+    Val* out_var,
+    Val* out_N,
+    Val* in_avg,
+    Val* in_var,
+    Val* in_N,
+    Val* init_avg,
+    Val* init_var,
+    Val* init_N,
+    bool is_fused)
+    : WelfordOp(
+          passkey,
+          WelfordTriplet(out_avg, out_var, out_N),
+          WelfordTriplet(in_avg, in_var, in_N),
+          WelfordTriplet(init_avg, init_var, init_N),
+          is_fused) {}
+
 WelfordOp::WelfordOp(const WelfordOp* src, IrCloner* ir_cloner)
     : Expr(src, ir_cloner),
-      out_avg_(ir_cloner->clone(src->out_avg_)),
-      out_var_(ir_cloner->clone(src->out_var_)),
-      out_N_(ir_cloner->clone(src->out_N_)),
-      init_avg_(src->init_avg_ ? ir_cloner->clone(src->init_avg_) : nullptr),
-      init_var_(src->init_var_ ? ir_cloner->clone(src->init_var_) : nullptr),
-      init_N_(ir_cloner->clone(src->init_N_)),
-      in_avg_(ir_cloner->clone(src->in_avg_)),
-      in_var_(src->in_var_ ? ir_cloner->clone(src->in_var_) : nullptr),
-      in_N_(ir_cloner->clone(src->in_N_)),
+      output_(src->output_.clone(ir_cloner)),
+      input_(src->input_.clone(ir_cloner)),
+      init_(src->init_.clone(ir_cloner)),
       is_allreduce_(src->is_allreduce_) {}
 
-namespace {
-inline bool sameOptionalVal(Val* a, Val* b) {
-  return ((a == nullptr && b == nullptr)) || ((a && b) && (a->sameAs(b)));
+Expr* WelfordOp::shallowCopy() const {
+  auto result =
+      IrBuilder::create<WelfordOp>(output_, input_, init_, is_allreduce_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
+Val* WelfordOp::getInitValOfOutput(Val* output_val) const {
+  auto val_name = output().getNameOf(output_val);
+
+  TORCH_INTERNAL_ASSERT(
+      val_name.has_value(),
+      "Not an output val ",
+      output_val->toString(),
+      " of ",
+      toString());
+
+  return init().get(*val_name);
 }
-} // namespace
 
 bool WelfordOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
   }
   if (auto other_wop = dynamic_cast<const WelfordOp*>(other)) {
-    return in_avg_->sameAs(other_wop->in_avg_) &&
-        sameOptionalVal(in_var_, other_wop->in_var_) &&
-        in_N_->sameAs(other_wop->in_N_) &&
-        sameOptionalVal(init_avg_, other_wop->init_avg_) &&
-        sameOptionalVal(init_var_, other_wop->init_var_) &&
-        init_N_->sameAs(other_wop->init_N_);
+    return input_.sameAs(other_wop->input_) && init_.sameAs(other_wop->init_);
   }
   return false;
 }
 
 std::vector<Val*> WelfordOp::getInitVals() const {
-  std::vector<Val*> init_vals({init_avg_, init_var_, init_N_});
+  std::vector<Val*> init_vals({init_.avg(), init_.var(), init_.N()});
   return init_vals;
 }
 
+GroupedWelfordOp::GroupedWelfordOp(
+    IrBuilderPasskey passkey,
+    std::vector<WelfordTriplet> output_vals,
+    std::vector<WelfordTriplet> input_vals,
+    std::vector<WelfordTriplet> init_vals,
+    bool is_allreduce,
+    ExprType expr_type)
+    : Expr(passkey, expr_type),
+      output_vals_(std::move(output_vals)),
+      input_vals_(std::move(input_vals)),
+      init_vals_(std::move(init_vals)),
+      is_allreduce_(is_allreduce) {
+  const auto num_grouped_ops = output_vals_.size();
+
+  TORCH_INTERNAL_ASSERT(
+      input_vals_.size() == num_grouped_ops,
+      "Invalid number of input arguments. Expected: ",
+      num_grouped_ops,
+      ", Given: ",
+      input_vals_.size());
+  TORCH_INTERNAL_ASSERT(
+      init_vals_.size() == num_grouped_ops,
+      "Invalid number of N arguments. Expected: ",
+      num_grouped_ops,
+      ", Given: ",
+      init_vals_.size());
+
+  for (const auto i : c10::irange(num_grouped_ops)) {
+    // Check output type
+    TORCH_INTERNAL_ASSERT(
+        output_vals_[i].avg()->getValType().value() == ValType::TensorView ||
+        output_vals_[i].avg()->getValType().value() == ValType::TensorIndex);
+    TORCH_INTERNAL_ASSERT(
+        output_vals_[i].var()->getValType().value() == ValType::TensorView ||
+        output_vals_[i].var()->getValType().value() == ValType::TensorIndex);
+    TORCH_INTERNAL_ASSERT(
+        output_vals_[i].N()->getValType().value() == ValType::TensorView ||
+        output_vals_[i].N()->getValType().value() == ValType::TensorIndex);
+    TORCH_INTERNAL_ASSERT(isIntegralType(output_vals_[i].N()->dtype()));
+
+    // check initial value
+    auto init_avg = init_vals_[i].avg();
+    auto init_var = init_vals_[i].var();
+    auto init_N = init_vals_[i].N();
+    TORCH_INTERNAL_ASSERT(
+        init_avg != nullptr && init_var != nullptr && init_N != nullptr,
+        "nullptr init vals are not allowed");
+    TORCH_INTERNAL_ASSERT(init_N->getValType().value() == ValType::Scalar);
+    TORCH_INTERNAL_ASSERT(isIntegralType(init_N->dtype()));
+    TORCH_INTERNAL_ASSERT(
+        init_avg->getValType().value() == ValType::TensorView ||
+            init_avg->getValType().value() == ValType::TensorIndex ||
+            (init_N->isZeroInt() &&
+             init_avg->getValType().value() == ValType::Scalar),
+        "Initial avg must be a tensor or, can be a scalar if initial N is zero.",
+        " Initial avg: ",
+        init_avg->toString(),
+        ". Initial N: ",
+        init_N->toString());
+    TORCH_INTERNAL_ASSERT(
+        init_var->getValType().value() == ValType::TensorView ||
+            init_var->getValType().value() == ValType::TensorIndex ||
+            (init_N->isZeroInt() &&
+             init_var->getValType().value() == ValType::Scalar),
+        "Initial var must be a tensor or, can be a scalar if initial N is zero: ",
+        init_var->toString());
+
+    // check input
+    auto in_avg = input_vals_[i].avg();
+    auto in_var = input_vals_[i].var();
+    auto in_N = input_vals_[i].N();
+    TORCH_INTERNAL_ASSERT(
+        in_avg != nullptr && in_var != nullptr && in_N != nullptr,
+        "nullptr input vals are not allowed");
+    TORCH_INTERNAL_ASSERT(
+        in_N->getValType().value() == ValType::Scalar ||
+        in_N->getValType().value() == ValType::TensorView ||
+        in_N->getValType().value() == ValType::TensorIndex);
+    TORCH_INTERNAL_ASSERT(isIntegralType(in_N->dtype()));
+    TORCH_INTERNAL_ASSERT(
+        in_avg->getValType().value() == ValType::TensorView ||
+            in_avg->getValType().value() == ValType::TensorIndex,
+        "Invalid input avg argument type: ",
+        in_avg->getValType().value());
+
+    if (in_N->isOneInt()) {
+      // when input is only one value, only the value is required through avg
+      // input the var part must be implicitly 0
+      TORCH_INTERNAL_ASSERT(
+          in_var->isZeroInt(),
+          "Invalid var input, which must be scalar zero when the N input is one: ",
+          in_var->toString());
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          in_var->getValType().value() == ValType::TensorView ||
+              in_var->getValType().value() == ValType::TensorIndex,
+          in_var->getValType().value(),
+          ", ",
+          in_N->toString());
+    }
+  }
+
+  for (const auto i : c10::irange(num_grouped_ops)) {
+    addOutput(output_vals_[i].avg());
+    addOutput(output_vals_[i].var());
+    addOutput(output_vals_[i].N());
+    addInput(input_vals_[i].avg());
+    addInput(input_vals_[i].var());
+    addInput(input_vals_[i].N());
+  }
+}
+
+GroupedWelfordOp::GroupedWelfordOp(
+    const GroupedWelfordOp* src,
+    IrCloner* ir_cloner)
+    : Expr(src, ir_cloner),
+      output_vals_(WelfordTriplet::clone(src->output_vals_, ir_cloner)),
+      input_vals_(WelfordTriplet::clone(src->input_vals_, ir_cloner)),
+      init_vals_(WelfordTriplet::clone(src->init_vals_, ir_cloner)),
+      is_allreduce_(src->is_allreduce_) {}
+
+Expr* GroupedWelfordOp::shallowCopy() const {
+  auto result = IrBuilder::create<GroupedWelfordOp>(
+      output_vals_, input_vals_, init_vals_, is_allreduce_, etype());
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
+bool GroupedWelfordOp::sameAs(const Statement* other) const {
+  if (this == other) {
+    return true;
+  }
+
+  auto grouped_op = dynamic_cast<const GroupedWelfordOp*>(other);
+  if (grouped_op == nullptr) {
+    return false;
+  }
+
+  if (!Expr::sameAs(other)) {
+    return false;
+  }
+
+  for (const auto i : c10::irange(numExprs())) {
+    if (!initAvg(i)->sameAs(grouped_op->initAvg(i)) ||
+        !initVar(i)->sameAs(grouped_op->initVar(i)) ||
+        !initN(i)->sameAs(grouped_op->initN(i))) {
+      return false;
+    }
+  }
+
+  return true;
+}
+
+int GroupedWelfordOp::getExprIndexOfOutput(Val* output_val) const {
+  for (const auto expr_idx : c10::irange(numExprs())) {
+    if (outputVals().at(expr_idx).getNameOf(output_val).has_value()) {
+      return expr_idx;
+    }
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      false, "Not an output, ", output_val->toString(), ", of ", toString());
+}
+
+Val* GroupedWelfordOp::getInitValOfOutput(Val* output_val) const {
+  auto expr_index = getExprIndexOfOutput(output_val);
+
+  auto val_name = outputVals().at(expr_index).getNameOf(output_val).value();
+
+  return initVals().at(expr_index).get(val_name);
+}
+
 MmaOp::MmaOp(
     IrBuilderPasskey passkey,
     Val* out,
@@ -643,6 +1190,13 @@ MmaOp::MmaOp(const MmaOp* src, IrCloner* ir_cloner)
       init_(ir_cloner->clone(src->init_)),
       options_(src->options_) {}
 
+Expr* MmaOp::shallowCopy() const {
+  auto result = IrBuilder::create<MmaOp>(out_, in_a_, in_b_, init_);
+  result->options_ = options_;
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool MmaOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -655,29 +1209,6 @@ bool MmaOp::sameAs(const Statement* other) const {
   return false;
 }
 
-ReductionOp::ReductionOp(const ReductionOp* src, IrCloner* ir_cloner)
-    : Expr(src, ir_cloner),
-      reduction_op_type_(src->reduction_op_type_),
-      init_(ir_cloner->clone(src->init_)),
-      out_(ir_cloner->clone(src->out_)),
-      in_(ir_cloner->clone(src->in_)),
-      is_allreduce_(src->is_allreduce_) {}
-
-bool ReductionOp::sameAs(const Statement* other) const {
-  if (this == other) {
-    return true;
-  }
-  if (!other->isA<ReductionOp>()) {
-    return false;
-  }
-  const auto other_op = other->as<ReductionOp>();
-  // Note that init is not part of input vals, so it must be checked separately.
-  return (
-      Expr::sameAs(other) &&
-      getReductionOpType() == other_op->getReductionOpType() &&
-      init()->sameAs(other_op->init()));
-}
-
 TransposeOp::TransposeOp(
     IrBuilderPasskey passkey,
     TensorView* out,
@@ -722,6 +1253,12 @@ TransposeOp::TransposeOp(const TransposeOp* src, IrCloner* ir_cloner)
       in_(ir_cloner->clone(src->in_)),
       new2old_(src->new2old_) {}
 
+Expr* TransposeOp::shallowCopy() const {
+  auto result = IrBuilder::create<TransposeOp>(out_, in_, new2old_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 std::vector<int64_t> TransposeOp::old2new() const {
   std::vector<int64_t> old2new(new2old_.size());
   for (auto new_axis : c10::irange(new2old_.size())) {
@@ -761,6 +1298,12 @@ ExpandOp::ExpandOp(const ExpandOp* src, IrCloner* ir_cloner)
   }
 }
 
+Expr* ExpandOp::shallowCopy() const {
+  auto result = IrBuilder::create<ExpandOp>(out_, in_, expanded_extents_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 ShiftOp::ShiftOp(
     IrBuilderPasskey passkey,
     Val* out,
@@ -808,6 +1351,12 @@ ShiftOp::ShiftOp(const ShiftOp* src, IrCloner* ir_cloner)
       offsets_(src->offsets_),
       pad_width_(src->pad_width_) {}
 
+Expr* ShiftOp::shallowCopy() const {
+  auto result = IrBuilder::create<ShiftOp>(out_, in_, offsets_, pad_width_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool ShiftOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -870,6 +1419,13 @@ GatherOp::GatherOp(const GatherOp* src, IrCloner* ir_cloner)
       window_shape_(src->window_shape_),
       pad_width_(src->pad_width_) {}
 
+Expr* GatherOp::shallowCopy() const {
+  auto result =
+      IrBuilder::create<GatherOp>(out_, in_, window_shape_, pad_width_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool GatherOp::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -916,6 +1472,12 @@ ViewAsScalar::ViewAsScalar(const ViewAsScalar* src, IrCloner* ir_cloner)
       vector_id_(ir_cloner->clone(src->vector_id_)),
       index_(ir_cloner->clone(src->index_)) {}
 
+Expr* ViewAsScalar::shallowCopy() const {
+  auto result = IrBuilder::create<ViewAsScalar>(out_, in_, vector_id_, index_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 ViewOp::ViewOp(IrBuilderPasskey passkey, TensorView* out, TensorView* in)
     : Expr(passkey, ExprType::ViewOp), out_(out), in_(in) {
   addOutput(out);
@@ -927,6 +1489,12 @@ ViewOp::ViewOp(const ViewOp* src, IrCloner* ir_cloner)
       out_(ir_cloner->clone(src->out_)),
       in_(ir_cloner->clone(src->in_)) {}
 
+Expr* ViewOp::shallowCopy() const {
+  auto result = IrBuilder::create<ViewOp>(out_, in_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 LoadStoreOp::LoadStoreOp(
     IrBuilderPasskey passkey,
     LoadStoreOpType op_type,
@@ -946,6 +1514,12 @@ LoadStoreOp::LoadStoreOp(const LoadStoreOp* src, IrCloner* ir_cloner)
       out_(ir_cloner->clone(src->out_)),
       in_(ir_cloner->clone(src->in_)) {}
 
+Expr* LoadStoreOp::shallowCopy() const {
+  auto result = IrBuilder::create<LoadStoreOp>(load_store_type_, out_, in_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 IterDomainBuilder::IterDomainBuilder(Val* _start, Val* _extent)
     : start_(_start), extent_(_extent) {
   TORCH_INTERNAL_ASSERT(
@@ -1130,6 +1704,10 @@ bool IterDomain::sameAs(const Statement* other) const {
   is_same = is_same && ScalarCheck::sameAs(start(), other_id->start());
   is_same =
       is_same && ScalarCheck::sameAs(stopOffset(), other_id->stopOffset());
+  is_same = is_same && (hasExpandedExtent() == other_id->hasExpandedExtent());
+  if (is_same && hasExpandedExtent()) {
+    is_same = ScalarCheck::sameAs(expandedExtent(), other_id->expandedExtent());
+  }
 
   return is_same;
 }
@@ -1142,6 +1720,37 @@ IterDomain* IterDomain::cloneWithoutRFactor() const {
   return cloned;
 }
 
+bool IterDomain::isTrivialReduction() const {
+  if (!isReduction()) {
+    return false;
+  }
+
+  if (extent()->isOneInt()) {
+    return true;
+  }
+
+  // If this domain is an output of an expression, i.e., not a root
+  // domain, check if all root domains are trivial reductions. This is
+  // almost the same as the analysis done in TrivialReductionInfo, but
+  // is limited within a single tensor, whereas TrivialReductionInfo
+  // does more expensive analysis potentially traversing through
+  // rfactor domains
+  if (definition()) {
+    // Note: There's no const version of IterVisitor.
+    auto id_inputs = InputsOf::output(fusion(), const_cast<IterDomain*>(this));
+    if (std::all_of(
+            ir_utils::filterByType<IterDomain>(id_inputs).begin(),
+            ir_utils::filterByType<IterDomain>(id_inputs).end(),
+            [](IterDomain* root_id) {
+              return root_id->isReduction() && root_id->extent()->isOneInt();
+            })) {
+      return true;
+    }
+  }
+
+  return false;
+}
+
 std::vector<IterDomain*> IterDomain::clone(
     const std::vector<IterDomain*>& domains) {
   std::vector<IterDomain*> cloned_domains;
@@ -1164,35 +1773,75 @@ IterDomain* IterDomain::merge(IterDomain* outer, IterDomain* inner) {
       "Merging IterDomains with ending values that are 0 is not supported at this time.");
   TORCH_CHECK(
       outer->isReduction() == inner->isReduction() ||
-          (!outer->isReduction() && inner->extent()->isOneInt()) ||
-          (outer->extent()->isOneInt() && !inner->isReduction()),
-      "Merging IterDomains requires that their iteration types match.");
+          (!outer->isReduction() && inner->isTrivialReduction()) ||
+          (outer->isTrivialReduction() && !inner->isReduction()),
+      "Merging IterDomains requires that their iteration types match. ",
+      "Outer: ",
+      outer->toString(),
+      ", Inner: ",
+      inner->toString());
   TORCH_CHECK(
       (outer->isGather() && inner->isGather()) ||
           (!outer->isGather() && !inner->isGather()),
       "Merging gather and non-gather domains is not supported.");
 
+  TORCH_CHECK(
+      !outer->isStride() && !inner->isStride(),
+      "No support for merging stride domains");
+
   Val* merged_id_size = mul(outer->extent(), inner->extent());
 
   IterType itype = outer->getIterType();
 
   if (outer->isBroadcast() && inner->isBroadcast()) {
     itype = IterType::Broadcast;
-  } else if (outer->isBroadcast() || inner->isBroadcast()) {
+  }
+
+  if ((outer->isBroadcast() || inner->isBroadcast()) &&
+      (outer->getIterType() == IterType::Iteration ||
+       inner->getIterType() == IterType::Iteration)) {
     itype = IterType::Iteration;
   }
 
   // Merging trivial reduction with iter domain, that's fine, just make it an
   // iter domain.
-  if ((outer->isReduction() || inner->isReduction()) &&
-      (!outer->isReduction() || !inner->isReduction())) {
+  if ((outer->isTrivialReduction() || inner->isTrivialReduction()) &&
+      (outer->getIterType() == IterType::Iteration ||
+       inner->getIterType() == IterType::Iteration)) {
     itype = IterType::Iteration;
   }
 
+  // Merging trivial reduction with broadcasting, that's fine, just make it a
+  // broadcasting.
+  if ((outer->isTrivialReduction() || inner->isTrivialReduction()) &&
+      (outer->isBroadcast() || inner->isBroadcast())) {
+    itype = IterType::Broadcast;
+  }
+
+  Val* expanded_extent = nullptr;
+  if (outer->hasExpandedExtent() || inner->hasExpandedExtent()) {
+    if (outer->hasExpandedExtent() && inner->hasExpandedExtent()) {
+      expanded_extent = mul(outer->expandedExtent(), inner->expandedExtent());
+    } else if (outer->hasExpandedExtent() && !inner->hasExpandedExtent()) {
+      if (inner->isBroadcast()) {
+        expanded_extent = outer->expandedExtent();
+      } else {
+        expanded_extent = mul(outer->expandedExtent(), inner->extent());
+      }
+    } else if (outer->hasExpandedExtent() && inner->hasExpandedExtent()) {
+      if (outer->isBroadcast()) {
+        expanded_extent = inner->expandedExtent();
+      } else {
+        expanded_extent = mul(outer->extent(), inner->expandedExtent());
+      }
+    }
+  }
+
   IterDomain* merged_id =
       IterDomainBuilder(
           outer->container()->zeroVal(), merged_id_size->as<Int>())
           .parallel_type(outer->getParallelType())
+          .expanded_extent(expanded_extent)
           .iter_type(itype)
           .build();
 
@@ -1234,6 +1883,11 @@ std::pair<IterDomain*, IterDomain*> IterDomain::split(
   // outer loop size
   Val* remainder =
       ceilDiv(Split::extent(in->extent(), start_offset, stop_offset), factor);
+  Val* expanded_remainder = nullptr;
+  if (in->hasExpandedExtent()) {
+    expanded_remainder = ceilDiv(
+        Split::extent(in->expandedExtent(), start_offset, stop_offset), factor);
+  }
 
   if ((start_offset != nullptr && !start_offset->isZeroInt()) ||
       (stop_offset != nullptr && !stop_offset->isZeroInt())) {
@@ -1242,20 +1896,28 @@ std::pair<IterDomain*, IterDomain*> IterDomain::split(
         "Partial split is only allowed with root domains");
   }
   // outer loop IterDomain
-  IterDomain* ido = IterDomainBuilder(
-                        in->container()->zeroVal(),
-                        inner_split ? remainder->as<Int>() : factor)
-                        .parallel_type(in->getParallelType())
-                        .iter_type(in->getIterType())
-                        .build();
+  IterDomain* ido =
+      IterDomainBuilder(
+          in->container()->zeroVal(),
+          inner_split ? remainder->as<Int>() : factor)
+          .expanded_extent(
+              in->hasExpandedExtent() && inner_split ? expanded_remainder
+                                                     : nullptr)
+          .parallel_type(in->getParallelType())
+          .iter_type(in->getIterType())
+          .build();
 
   // inner loop IterDomain
-  IterDomain* idi = IterDomainBuilder(
-                        in->container()->zeroVal(),
-                        inner_split ? factor : remainder->as<Int>())
-                        .parallel_type(in->getParallelType())
-                        .iter_type(in->getIterType())
-                        .build();
+  IterDomain* idi =
+      IterDomainBuilder(
+          in->container()->zeroVal(),
+          inner_split ? factor : remainder->as<Int>())
+          .expanded_extent(
+              in->hasExpandedExtent() && !inner_split ? expanded_remainder
+                                                      : nullptr)
+          .parallel_type(in->getParallelType())
+          .iter_type(in->getIterType())
+          .build();
 
   IrBuilder::create<Split>(
       in->container(),
@@ -1398,7 +2060,7 @@ TensorDomain::TensorDomain(
                              : std::move(contiguity)) {
   TORCH_CHECK(
       contiguity_.size() == getMaybeRFactorDomain().size(),
-      "Invalid contiguity information provided, incorrect size. Recieved vector of size ",
+      "Invalid contiguity information provided, incorrect size. Received vector of size ",
       contiguity_.size(),
       " but needed one of size ",
       root_domain_.size());
@@ -1422,7 +2084,7 @@ TensorDomain::TensorDomain(
                              : std::move(contiguity)) {
   TORCH_CHECK(
       contiguity_.size() == getMaybeRFactorDomain().size(),
-      "Invalid contiguity information provided, incorrect size. Recieved vector of size ",
+      "Invalid contiguity information provided, incorrect size. Received vector of size ",
       contiguity_.size(),
       " but needed one of size ",
       root_domain_.size());
@@ -1462,7 +2124,7 @@ TensorDomain::TensorDomain(
                              : std::move(contiguity)) {
   TORCH_CHECK(
       contiguity_.size() == getMaybeRFactorDomain().size(),
-      "Invalid contiguity information provided, incorrect size. Recieved vector of size ",
+      "Invalid contiguity information provided, incorrect size. Received vector of size ",
       contiguity_.size(),
       " but needed one of size ",
       getMaybeRFactorDomain().size());
@@ -1721,7 +2383,7 @@ void TensorDomain::split(
   resetDomains();
 }
 
-// Merge "axis" and "axis+1" into 1 dimension
+// Merge "axis_o" and "axis_i" into 1 dimension
 void TensorDomain::merge(int axis_o, int axis_i) {
   TORCH_INTERNAL_ASSERT(nDims() > 0, "Tried to do merge on a 0-dim domain");
   if (axis_o < 0)
@@ -1883,25 +2545,45 @@ bool TensorDomain::hasNontrivialReduction(const std::vector<IterDomain*>& td) {
   return false;
 }
 
-TensorDomain* TensorDomain::view(
-    const std::vector<std::shared_ptr<ViewTransform>>& transforms) {
+TensorDomain* TensorDomain::view(const AnalyzeViewResult& view_analysis) {
   TORCH_INTERNAL_ASSERT(nDims() > 0, "Tried to view transform a 0-dim domain");
-  return transformView(this, transforms);
+  return transformView(this, view_analysis);
 }
 
 TensorDomain* TensorDomain::flatten(int64_t start_dim, int64_t end_dim) {
+  auto inp_domain = noReductions(getMaybeRFactorDomain());
+
   if (start_dim < 0) {
-    start_dim += nDims();
+    start_dim += inp_domain.size();
   }
   if (end_dim < 0) {
-    end_dim += nDims();
+    end_dim += inp_domain.size();
   }
+  TORCH_CHECK(
+      start_dim >= 0 && start_dim < inp_domain.size(),
+      "Invalid start_dim ",
+      start_dim);
+  TORCH_CHECK(
+      end_dim >= 0 && end_dim < inp_domain.size(), "Invalid end_dim ", end_dim);
+  TORCH_CHECK(start_dim <= end_dim, "start_dim must be <= end_dim");
 
   std::vector<IterDomain*> new_root_domain;
-  auto inp_domain = noReductions(getMaybeRFactorDomain());
   new_root_domain.reserve(inp_domain.size());
-  for (auto id : inp_domain) {
-    new_root_domain.push_back(id->cloneWithoutRFactor());
+  for (auto i : c10::irange(inp_domain.size())) {
+    bool is_rfactor_dim = i >= start_dim && i <= end_dim;
+    auto inp_id = inp_domain[i];
+    auto out_id = IterDomainBuilder(inp_id)
+                      .is_rfactor_domain(is_rfactor_dim)
+                      .extent(
+                          (is_rfactor_dim && inp_id->hasExpandedExtent())
+                              ? inp_id->expandedExtent()
+                              : inp_id->extent())
+                      .iter_type(
+                          (is_rfactor_dim && inp_id->isBroadcast())
+                              ? IterType::Iteration
+                              : inp_id->getIterType())
+                      .build();
+    new_root_domain.push_back(out_id);
   }
 
   std::vector<IterDomain*> rfactor_domain;
@@ -1923,7 +2605,7 @@ TensorDomain* TensorDomain::flatten(int64_t start_dim, int64_t end_dim) {
   }
   rfactor_domain.push_back(merged_id);
 
-  for (auto i : c10::irange(end_dim + 1, nDims())) {
+  for (auto i : c10::irange(end_dim + 1, inp_domain.size())) {
     rfactor_domain.push_back(new_root_domain[i]);
   }
 
@@ -1983,6 +2665,13 @@ Split::Split(const Split* src, IrCloner* ir_cloner)
       start_offset_(ir_cloner->clone(src->start_offset_)),
       stop_offset_(ir_cloner->clone(src->stop_offset_)) {}
 
+Expr* Split::shallowCopy() const {
+  auto result = IrBuilder::create<Split>(
+      outer_, inner_, in_, factor_, inner_split_, start_offset_, stop_offset_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 Val* Split::extent(Val* in_extent, Val* start_offset, Val* stop_offset) {
   TORCH_INTERNAL_ASSERT(in_extent != nullptr);
 
@@ -2028,6 +2717,12 @@ Merge::Merge(const Merge* src, IrCloner* ir_cloner)
       outer_(ir_cloner->clone(src->outer_)),
       inner_(ir_cloner->clone(src->inner_)) {}
 
+Expr* Merge::shallowCopy() const {
+  auto result = IrBuilder::create<Merge>(out_, outer_, inner_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool Merge::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
@@ -2059,6 +2754,13 @@ Swizzle2D::Swizzle2D(
   addInput(in_y);
 }
 
+Expr* Swizzle2D::shallowCopy() const {
+  auto result = IrBuilder::create<Swizzle2D>(
+      out_x_, out_y_, in_x_, in_y_, swizzle_type_, swizzle_mode_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool Swizzle2D::sameAs(const Statement* other) const {
   if (this == other) {
     return true;
diff --git a/torch/csrc/jit/codegen/cuda/ir_utils.cpp b/torch/csrc/jit/codegen/cuda/ir_utils.cpp
index faa296732ca0..dba5ee10adab 100644
--- a/torch/csrc/jit/codegen/cuda/ir_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/ir_utils.cpp
@@ -180,6 +180,48 @@ struct SubstituteInExpr : public OptInDispatch {
     OptInDispatch::handle(expr);
   }
 
+  void handle(FullOp* full_expr) final {
+    auto out = reference_->sameAs(full_expr->output(0)) ? substitute_
+                                                        : full_expr->output(0);
+    expr_ = IrBuilder::create<FullOp>(
+        full_expr->container(),
+        out,
+        full_expr->getFillValue(),
+        full_expr->dtype());
+  }
+
+  void handle(ARangeOp* arange_expr) final {
+    auto start = reference_->sameAs(arange_expr->start())
+        ? substitute_
+        : arange_expr->start();
+    auto end = reference_->sameAs(arange_expr->end()) ? substitute_
+                                                      : arange_expr->end();
+    auto step = reference_->sameAs(arange_expr->step()) ? substitute_
+                                                        : arange_expr->step();
+    auto out = reference_->sameAs(arange_expr->output(0))
+        ? substitute_
+        : arange_expr->output(0);
+    expr_ = IrBuilder::create<ARangeOp>(
+        arange_expr->container(),
+        out,
+        start,
+        end,
+        step,
+        arange_expr->dtype(),
+        arange_expr->getLinearLogicalIndex());
+  }
+
+  void handle(EyeOp* eye_expr) final {
+    auto out = reference_->sameAs(eye_expr->output(0)) ? substitute_
+                                                       : eye_expr->output(0);
+    expr_ = IrBuilder::create<EyeOp>(
+        eye_expr->container(),
+        out,
+        eye_expr->dtype(),
+        eye_expr->getIndex1(),
+        eye_expr->getIndex2());
+  }
+
   void handle(UnaryOp* unary_expr) final {
     auto in =
         reference_->sameAs(unary_expr->in()) ? substitute_ : unary_expr->in();
@@ -223,6 +265,23 @@ struct SubstituteInExpr : public OptInDispatch {
         in3);
   }
 
+  void handle(RNGOp* rng_expr) final {
+    std::vector<Val*> subsituted_params;
+    for (auto v : rng_expr->getParameters()) {
+      subsituted_params.emplace_back(reference_->sameAs(v) ? substitute_ : v);
+    }
+    auto out = reference_->sameAs(rng_expr->output(0)) ? substitute_
+                                                       : rng_expr->output(0);
+    expr_ = IrBuilder::create<RNGOp>(
+        rng_expr->container(),
+        rng_expr->getRNGOpType(),
+        out,
+        rng_expr->dtype(),
+        subsituted_params,
+        rng_expr->getRNGOffset(),
+        rng_expr->getPhiloxIndex());
+  }
+
   void handle(ReductionOp* reduction_expr) final {
     auto init = reference_->sameAs(reduction_expr->init())
         ? substitute_
@@ -411,12 +470,12 @@ struct SubstituteInExpr : public OptInDispatch {
         out_avg,
         out_var,
         out_N,
-        init_avg,
-        init_var,
-        init_N,
         in_avg,
         in_var,
         in_N,
+        init_avg,
+        init_var,
+        init_N,
         welford_expr->isAllreduce());
   }
 
@@ -716,13 +775,29 @@ class ValReplacementMutator : private OptOutMutator {
     // would be a tensorview that doesn't get updated extents. Therefore, first
     // grab all leaves towards outputs and grab stmts from there.
     auto stmts = StmtSort::getStmts(fusion, allLeafOuts(fusion), true);
-    for (auto stmt : stmts) {
+
+    // Some fusions, such as standalone rand_like, can have disconnected DAG, so
+    // we need some mechanism to make sure our replacement set is as complete as
+    // possible
+    // TODO: I think we need a more general mechanism to support disconnected
+    // DAG
+    std::vector<Val*> more;
+    for (auto v : fusion->inputs()) {
+      if (std::find(stmts.begin(), stmts.end(), v) == stmts.end()) {
+        more.emplace_back(v);
+      }
+    }
+    auto more_stmts = StmtSort::getStmts(fusion, more, true);
+    more_stmts.insert(more_stmts.end(), stmts.begin(), stmts.end());
+
+    for (auto stmt : more_stmts) {
       mutate(stmt);
     }
   }
 
  private:
   using OptOutMutator::mutate;
+
   void mutate(Val* val) final {
     if (replacement_map_.find(val) == replacement_map_.end()) {
       return OptOutMutator::mutate(val);
@@ -778,29 +853,12 @@ Val* getReductionInitValOf(TensorView* tv) {
   if (auto rop = dynamic_cast<ReductionOp*>(def)) {
     init = rop->init();
   } else if (auto grop = dynamic_cast<GroupedReductionOp*>(def)) {
-    int output_idx = -1;
-    for (const auto i : c10::irange(grop->numExprs())) {
-      if (tv == grop->output(i)) {
-        output_idx = static_cast<int>(i);
-        break;
-      }
-    }
-    TORCH_INTERNAL_ASSERT(
-        output_idx >= 0,
-        "Matching output not found for GroupedReductionOp: ",
-        tv->toString(),
-        ". Defined by: ",
-        def->toString());
+    int output_idx = grop->getExprIndexOfOutput(tv);
     init = grop->initVal(output_idx);
   } else if (auto wop = dynamic_cast<WelfordOp*>(def)) {
-    if (tv == wop->outAvg()) {
-      init = wop->initAvg();
-    } else if (tv == wop->outVar()) {
-      init = wop->initVar();
-    } else {
-      TORCH_INTERNAL_ASSERT(tv == wop->outN());
-      init = wop->initN();
-    }
+    return wop->getInitValOfOutput(tv);
+  } else if (auto gwop = dynamic_cast<GroupedWelfordOp*>(def)) {
+    init = gwop->getInitValOfOutput(tv);
   } else if (auto mma = dynamic_cast<MmaOp*>(def)) {
     init = mma->init();
   }
@@ -813,13 +871,38 @@ Val* getReductionInitValOf(TensorView* tv) {
 bool isReductionOp(const Expr* expr) {
   // Note that GridReduction inherits ReductionOp
   return expr->isA<ReductionOp>() || expr->isA<GroupedReductionOp>() ||
-      expr->isA<WelfordOp>() || expr->isA<kir::GridWelford>();
+      expr->isA<WelfordOp>() || expr->isA<GroupedWelfordOp>() ||
+      expr->isA<kir::GridWelford>() || expr->isA<kir::GroupedGridWelford>();
 }
 
 bool isReductionTvOp(const Expr* expr) {
   return ir_utils::isTvOp(expr) && isReductionOp(expr);
 }
 
+TORCH_CUDA_CU_API std::vector<ViewOp*> getViewOps(Fusion* fusion) {
+  auto all_exprs = fusion->exprs();
+
+  auto all_view_ops = ir_utils::filterByType<ViewOp>(all_exprs);
+
+  std::vector<ViewOp*> view_ops;
+
+  std::copy_if(
+      all_view_ops.begin(),
+      all_view_ops.end(),
+      std::back_inserter(view_ops),
+      [](ViewOp* view) {
+        return std::any_of(
+            view->outputs().begin(), view->outputs().end(), [](Val* v) {
+              if (!v->isA<TensorView>()) {
+                return false;
+              }
+              return v->as<TensorView>()->hasRFactor();
+            });
+      });
+
+  return view_ops;
+}
+
 namespace {
 
 struct ReplaceValInIndexVal : public OptInDispatch {
diff --git a/torch/csrc/jit/codegen/cuda/ir_utils.h b/torch/csrc/jit/codegen/cuda/ir_utils.h
index 4e338d6ca0b6..adfc64fc74ad 100644
--- a/torch/csrc/jit/codegen/cuda/ir_utils.h
+++ b/torch/csrc/jit/codegen/cuda/ir_utils.h
@@ -106,6 +106,10 @@ class FilteredView {
     return begin() == end();
   }
 
+  std::vector<value_type> vector() const {
+    return std::vector<value_type>(begin(), end());
+  }
+
  private:
   const InputIt input_it_;
   const InputIt last_;
@@ -313,10 +317,15 @@ TORCH_CUDA_CU_API bool isReductionOp(const Expr*);
 // Returns if Expr is a reduction op with TensorView or TensorIndex
 TORCH_CUDA_CU_API bool isReductionTvOp(const Expr*);
 
+// Returns all non-trivial view operations. We shouldn't have trivial view
+// operations but this function is to simply make sure if we ever do we don't
+// pull them in.
+TORCH_CUDA_CU_API std::vector<ViewOp*> getViewOps(Fusion*);
+
 template <typename T>
 std::string toString(const T& nodes) {
   std::stringstream ss;
-  for (Statement* stmt : nodes) {
+  for (const Statement* stmt : nodes) {
     if (ss.tellp() != 0) {
       ss << ", ";
     }
diff --git a/torch/csrc/jit/codegen/cuda/iter_visitor.cpp b/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
index 6ae4e7374df5..984a22194a20 100644
--- a/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
+++ b/torch/csrc/jit/codegen/cuda/iter_visitor.cpp
@@ -32,81 +32,44 @@ void remove_visited(
   }
 }
 
-// Return all dependencies of a node including members of the node.
-class RecursiveDependencies : public OptInDispatch {
+class MemberStatements : public OptOutDispatch {
  public:
+  // Return all members of the stmt if it's a Val. For expressions it returns
+  // nothing.
   static std::vector<Statement*> next(Statement* stmt) {
-    RecursiveDependencies find_next(stmt);
+    MemberStatements find_next(stmt);
     return find_next.next_stmts_;
   }
 
  private:
-  RecursiveDependencies() = default;
+  MemberStatements() = default;
 
-  RecursiveDependencies(Statement* stmt) {
+  MemberStatements(Statement* stmt) {
     handle(stmt);
   }
 
-  using OptInDispatch::handle;
-
-  void handle(Expr* expr) final {
-    FusionGuard::getCurFusion()->assertInContainer(
-        expr,
-        "IterVisitor.cpp::RecursiveDependencies::handle(Expr*) Cannot traverse expr, ");
-    next_stmts_.insert(
-        next_stmts_.end(), expr->inputs().begin(), expr->inputs().end());
-  }
+  using OptOutDispatch::handle;
 
   void handle(Val* val) final {
     FusionGuard::getCurFusion()->assertInContainer(
         val,
-        "IterVisitor.cpp::RecursiveDependencies::handle(Val*) Cannot traverse val, ");
-    OptInDispatch::handle(val);
-  }
-
-  void simpleVal(Val* val) {
-    if (val->definition() == nullptr) {
-      return;
-    }
-    next_stmts_.push_back(val->definition());
-  }
-
-  void handle(Bool* stmt) final {
-    simpleVal(stmt);
-  }
-
-  void handle(Double* stmt) final {
-    simpleVal(stmt);
-  }
-
-  void handle(Int* stmt) final {
-    simpleVal(stmt);
-  }
-
-  void handle(ComplexDouble* stmt) final {
-    simpleVal(stmt);
-  }
-
-  void handle(NamedScalar* stmt) final {
-    simpleVal(stmt);
+        "IterVisitor.cpp::MemberStatements::handle(Val*) Cannot traverse val, ");
+    OptOutDispatch::handle(val);
   }
 
   void handle(IterDomain* stmt) final {
     next_stmts_.push_back(stmt->start());
     next_stmts_.push_back(stmt->extent());
     next_stmts_.push_back(stmt->stopOffset());
-    simpleVal(stmt);
   }
 
   void handle(TensorDomain* stmt) final {
     next_stmts_.insert(
         next_stmts_.end(), stmt->domain().begin(), stmt->domain().end());
-    simpleVal(stmt);
   }
 
   void handle(TensorView* tv) final {
     next_stmts_.push_back(tv->domain());
-    simpleVal(tv);
   }
 
   std::vector<Statement*> next_stmts_;
@@ -169,17 +132,18 @@ void IterVisitor::handle(Val* v) {
 // To prevent traversing all paths through a DAG (unless we want to) we have a
 // function to remove visited nodes from being re-added to the stack
 // (remove_visited).
-void IterVisitor::traverseFrom(
+void IterVisitor::traverseBetween(
     Fusion* fusion,
-    const std::vector<Val*>& from,
-    bool traverseAllPaths,
-    bool traverseIntoMembers) {
+    const std::unordered_set<Val*>& from,
+    const std::vector<Val*>& to,
+    bool traverse_all_paths,
+    bool traverse_into_members) {
   FusionGuard fg(fusion);
 
   std::unordered_set<Statement*> visited;
 
   stmt_stack.clear();
-  stmt_stack.emplace_back(from.rbegin(), from.rend());
+  stmt_stack.emplace_back(to.rbegin(), to.rend());
 
   bool all_inputs_visited = false;
 
@@ -201,7 +165,7 @@ void IterVisitor::traverseFrom(
     // If we just poped a stmt_stack level, we can finally visit it!
     if (all_inputs_visited) {
       // stmt may have be already visited.
-      if (traverseAllPaths || visited.find(stmt) == visited.end()) {
+      if (traverse_all_paths || visited.find(stmt) == visited.end()) {
         // Mark visited
         visited.insert(stmt);
 
@@ -217,10 +181,20 @@ void IterVisitor::traverseFrom(
     } else {
       // We're not ready to process this node, so add all its inputs to be
       // checked Visit input nodes.
-      auto next_stmts =
-          traverseIntoMembers ? RecursiveDependencies::next(stmt) : next(stmt);
+      std::vector<Statement*> next_stmts;
+
+      if ((stmt->isVal() && from.find(stmt->asVal()) == from.end()) ||
+          stmt->isExpr()) {
+        next_stmts = next(stmt);
+      }
+
+      if (traverse_into_members) {
+        auto members = MemberStatements::next(stmt);
+        next_stmts.insert(next_stmts.end(), members.begin(), members.end());
+      }
+
       // We may want to retraverse nodes, in that case revisit everything!
-      if (!traverseAllPaths) {
+      if (!traverse_all_paths) {
         // If we don't want to retraverse, remove nodes we already visisted.
         remove_visited(next_stmts, visited);
       }
@@ -238,12 +212,20 @@ void IterVisitor::traverseFrom(
   }
 }
 
+void IterVisitor::traverseTo(
+    Fusion* fusion,
+    const std::vector<Val*>& to,
+    bool traverse_all_paths,
+    bool traverse_into_members) {
+  traverseBetween(fusion, {}, to, traverse_all_paths, traverse_into_members);
+}
+
 void IterVisitor::traverseHelper(Fusion* fusion, bool traverse_all_paths) {
   FusionGuard fg(fusion);
 
   auto term_val_outs = fusion->getTerminatingOutputs();
   if (!term_val_outs.empty()) {
-    traverseFrom(fusion, term_val_outs, traverse_all_paths);
+    traverseTo(fusion, term_val_outs, traverse_all_paths);
   }
 }
 
@@ -257,8 +239,7 @@ void IterVisitor::traverseAllPaths(Fusion* fusion) {
 
 namespace {
 
-// Expr sort will take a fusion and return a topologically sorted list of
-// expressions.
+// TODO: Also have InputsOf should pick one and remove the other.
 class Inputs : public IterVisitor {
  private:
   //! Optional list of input vals. While traversing to inputs if a value in the
@@ -279,8 +260,9 @@ class Inputs : public IterVisitor {
   }
 
   void handle(Val* val) override {
-    // If there's no definition to val, or val is within the provided inputs
-    if (val->definition() == nullptr ||
+    // If there's no definition to val, or val is created inside the fusion, or
+    // val is within the provided inputs
+    if (val->definition() == nullptr || val->definition()->inputs().empty() ||
         std::find(all_inputs_.begin(), all_inputs_.end(), val) !=
             all_inputs_.end()) {
       // if not already placed in the inputs
@@ -298,7 +280,7 @@ class Inputs : public IterVisitor {
       return {};
     }
     Inputs inps(all_inputs);
-    inps.traverseFrom(of[0]->fusion(), of);
+    inps.traverseTo(of[0]->fusion(), of);
     return inps.inputs_;
   }
 };
@@ -327,7 +309,7 @@ class AllVals : public IterVisitor {
       Fusion* fusion,
       const std::vector<Val*>& from) {
     AllVals av;
-    av.traverseFrom(fusion, from, false);
+    av.traverseTo(fusion, from, false);
     return av.vals;
   }
 };
@@ -385,7 +367,7 @@ void BackwardVisitor::handle(Val* val) {
   OptOutDispatch::handle(val);
 }
 
-void BackwardVisitor::traverseFrom(
+void BackwardVisitor::traverseTo(
     Fusion* fusion,
     const std::vector<Val*>& from,
     bool traverseAllPaths) {
@@ -400,7 +382,6 @@ void BackwardVisitor::traverseFrom(
   }
 
   auto vals = AllVals::get(fusion, from);
-
   auto exprs = StmtSort::getExprs(fusion, from);
 
   {
@@ -538,7 +519,7 @@ struct Dependencies : public IterVisitor {
       std::unordered_set<Val*> _dependencies,
       const std::vector<Val*>& of)
       : dependencies_(std::move(_dependencies)) {
-    traverseFrom(of[0]->fusion(), of, false);
+    traverseTo(of[0]->fusion(), of, false);
   };
 
  public:
@@ -585,7 +566,7 @@ struct FindOutputs : public IterVisitor {
   // tracing all paths like this.
   FindOutputs(const std::unordered_set<Val*>& _of) : of_(_of) {
     auto fusion = (*of_.begin())->fusion();
-    traverseFrom(fusion, fusion->outputs(), true);
+    traverseTo(fusion, fusion->outputs(), true);
   };
 
   static std::unordered_set<Val*> getAllOutputsOf(
@@ -653,7 +634,7 @@ class DependentVals : public IterVisitor {
   DependentVals(const std::unordered_set<Val*>& _of) : of_(_of) {
     createBoundary();
     auto fusion = (*of_.begin())->fusion();
-    traverseFrom(fusion, fusion->outputs(), false);
+    traverseTo(fusion, fusion->outputs(), false);
   };
 
  public:
@@ -689,7 +670,7 @@ class DependencyChains : public IterVisitor {
 
   DependencyChains(Val* _dependency, Val* _of, bool all_chains_ = false)
       : dependencies_({_dependency}) {
-    traverseFrom(_of->fusion(), {_of}, all_chains_);
+    traverseTo(_of->fusion(), {_of}, all_chains_);
   }
 
   DependencyChains(Val* _dependency, bool all_chains_ = false)
@@ -815,12 +796,21 @@ std::vector<Expr*> StmtSort::getExprs(Fusion* fusion, bool traverse_members) {
 }
 
 std::vector<Expr*> StmtSort::getExprs(
+    Fusion* fusion,
+    const std::vector<Val*>& to,
+    bool traverse_members) {
+  auto stmts = StmtSort::getStmts(fusion, to, traverse_members);
+  auto filter = ir_utils::filterByType<Expr>(stmts.begin(), stmts.end());
+  std::vector<Expr*> exprs(filter.begin(), filter.end());
+  return exprs;
+}
+
+std::vector<Expr*> StmtSort::getExprsBetween(
     Fusion* fusion,
     const std::vector<Val*>& from,
+    const std::vector<Val*>& to,
     bool traverse_members) {
-  StmtSort es;
-  es.traverseFrom(fusion, from, false, traverse_members);
-  auto stmts = StmtSort::getStmts(fusion, from, traverse_members);
+  auto stmts = StmtSort::getStmtsBetween(fusion, from, to, traverse_members);
   auto filter = ir_utils::filterByType<Expr>(stmts.begin(), stmts.end());
   std::vector<Expr*> exprs(filter.begin(), filter.end());
   return exprs;
@@ -834,16 +824,27 @@ std::vector<Statement*> StmtSort::getStmts(
 }
 
 std::vector<Statement*> StmtSort::getStmts(
+    Fusion* fusion,
+    const std::vector<Val*>& to,
+    bool traverse_members) {
+  StmtSort es;
+  es.traverseTo(fusion, to, false, traverse_members);
+  return es.stmts;
+}
+
+std::vector<Statement*> StmtSort::getStmtsBetween(
     Fusion* fusion,
     const std::vector<Val*>& from,
+    const std::vector<Val*>& to,
     bool traverse_members) {
   StmtSort es;
-  es.traverseFrom(fusion, from, false, traverse_members);
+  es.traverseBetween(
+      fusion, {from.begin(), from.end()}, to, false, traverse_members);
   return es.stmts;
 }
 
 void InputsOf::handle(Val* v) {
-  if (v->definition() == nullptr) {
+  if (v->definition() == nullptr || v->definition()->inputs().empty()) {
     if (grabbed_inputs.emplace(v).second) {
       ordered_inputs.push_back(v);
     }
@@ -858,7 +859,7 @@ std::vector<Val*> InputsOf::outputs(
     Fusion* fusion,
     const std::vector<Val*>& outputs_) {
   InputsOf io;
-  io.traverseFrom(fusion, outputs_, false);
+  io.traverseTo(fusion, outputs_, false);
   return io.ordered_inputs;
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/iter_visitor.h b/torch/csrc/jit/codegen/cuda/iter_visitor.h
index 2447933d7373..3ad485f1a17b 100644
--- a/torch/csrc/jit/codegen/cuda/iter_visitor.h
+++ b/torch/csrc/jit/codegen/cuda/iter_visitor.h
@@ -75,29 +75,43 @@ class TORCH_CUDA_CU_API IterVisitor : public OptOutDispatch {
   // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
   std::vector<std::vector<Statement*>> stmt_stack;
 
-  // Statements to stop traversal on if they're hit (pretends they're leaf
-  // nodes in next)
-  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  std::unordered_set<Statement*> termination_stmts;
-
   void traverseHelper(Fusion* fusion, bool traverse_all_paths = false);
 
  public:
-  //! Starts at nodes provided in from, traverses from these nodes to inputs.
-  //! Calls handle on all Statement*s in topological sorted order.
+  //! Traverses nodes in Fusion from inputs in topological order to "to". i.e.
+  //! from inputs towards outputs.
   //! \param traverseAllPaths = false only call handle on each Statement* once
-  //!    traverseAllPaths = true traverses all paths from nodes in from to
-  //!    inputs. Calls handle on a Statement* for every path from "from" nodes,
-  //!    to inputs.
+  //!    traverseAllPaths = true traverses all paths between expressions/values.
+  //!    Calls handle on a Statement* for every path from inputs to "to".
   //! \param traverseIntoMembers = When hitting nodes like TensorView,
   //! TensorDomain, or IterDomain where there are members of the nodes that are
   //! Val's a value of "true" will also traverse into those member Val's, a
   //! value of "false" will not traverse into the members.
-  void traverseFrom(
+  void traverseTo(
       Fusion* fusion,
-      const std::vector<Val*>& from,
-      bool traverseAllPaths = false,
-      bool traverseIntoMembers = false);
+      const std::vector<Val*>& to,
+      bool traverse_all_paths = false,
+      bool traverse_into_members = false);
+
+  //! Traverses nodes in Fusion from inputs in topological order to "to". i.e.
+  //! from inputs towards outputs.
+  //! \param traverseAllPaths = false only call handle on each Statement* once
+  //!    traverseAllPaths = true traverses all paths between expressions/values.
+  //!    Calls handle on a Statement* for every path from inputs to "to".
+  //! \param traverseIntoMembers = When hitting nodes like TensorView,
+  //! TensorDomain, or IterDomain where there are members of the nodes that are
+  //! Val's a value of "true" will also traverse into those member Val's, a
+  //! value of "false" will not traverse into the members.
+  //! \param from: Specified values to start traversing. If a "from" Val is not
+  //! on path from inputs to "to" node it will not be visited. If there's a path
+  //! from inputs to "to" that doesn't go through "from" that input and the path
+  //! from it will also be traversed.
+  void traverseBetween(
+      Fusion* fusion,
+      const std::unordered_set<Val*>& from,
+      const std::vector<Val*>& to,
+      bool traverse_all_paths = false,
+      bool traverse_into_members = false);
 
   // Iterates from terminating outputs registered with the fusion. Terminating
   // means value is not used to generate any other value used in producing
@@ -109,7 +123,10 @@ class TORCH_CUDA_CU_API IterVisitor : public OptOutDispatch {
   void traverseAllPaths(Fusion* fusion);
 
   //! Get inputs to vals. Possible input vals can be optionally
-  //! given. If not, vals with no defining expression are returned.
+  //! given. If not, vals with no producers are returned.
+  //
+  // TODO: This doesn't seem to fit with IterVisitor. Should probably be moved
+  // out of the class.
   static std::vector<Val*> getInputsTo(
       const std::vector<Val*>& vals,
       const std::vector<Val*>& inputs = {});
@@ -197,7 +214,7 @@ class TORCH_CUDA_CU_API BackwardVisitor : public OptOutDispatch {
   // traverseAllPaths = false only call handle on each Statement* once
   // traverseAllPaths = true traverses all paths from nodes in from to inputs.
   //   Handle on a Statement* for every path from "from" nodes, to inputs.
-  void traverseFrom(
+  void traverseTo(
       Fusion* fusion,
       const std::vector<Val*>& from,
       bool traverseAllPaths = false);
@@ -251,37 +268,65 @@ class TORCH_CUDA_CU_API DependencyCheck {
 // expressions.
 class StmtSort : public IterVisitor {
  protected:
+  StmtSort() = default;
+
   std::vector<Statement*> stmts;
 
   void handle(Statement* stmt) override;
 
  public:
   // If traverse_members it will also extract all member nodes in the sorted
-  // expr list in the fusion. i.e. all expressions on IterDomains, extents, etc
-  static std::vector<Expr*> getExprs(
+  // statement list in the fusion. i.e. all IterDomains, extents, and associated
+  // expressions of them
+  static std::vector<Statement*> getStmts(
       Fusion* fusion,
       bool traverse_members = false);
 
+  // Returns ordered Statements required to produce from, including from.
+  static std::vector<Statement*> getStmts(
+      Fusion* fusion,
+      const std::vector<Val*>& to,
+      bool traverse_members = false);
+
+  // Returns ordered Statements required to produce from, including from.
+  // Stops traversal once hiting any Statements in to. Includes Statements in
+  // to.
+  //
+  // Warning: this doesn't necessarily prevent statements before `to` from being
+  // returned. e.g.
+  // i1 = i0
+  // i2 = i1
+  // i3 = i2
+  // i4 = i3 + i1
+  // getExprs(fusion, {i4}, {i3})
+  // will return the definition and values {i0, i1, i4}
+  // i3 is dependent on i1, but since i4 also is then the traversal will go down
+  // the i4->i1->i0 path, even though the i4->i3-//>i2->i1 path is blocked.
+  //
   // If traverse_members it will also extract all member nodes in the sorted
   // expr list in the fusion. i.e. all expressions on IterDomains, extents, etc
-  static std::vector<Expr*> getExprs(
+  static std::vector<Statement*> getStmtsBetween(
       Fusion* fusion,
       const std::vector<Val*>& from,
+      const std::vector<Val*>& to,
       bool traverse_members = false);
 
-  // If traverse_members it will also extract all member nodes in the sorted
-  // statement list in the fusion. i.e. all IterDomains, extents, and associated
-  // expressions of them
-  static std::vector<Statement*> getStmts(
+  // Same as getStmts version but filters to only return the Expr*s
+  static std::vector<Expr*> getExprs(
       Fusion* fusion,
       bool traverse_members = false);
 
-  // If traverse_members it will also extract all member nodes in the sorted
-  // expr list in the fusion. i.e. all IterDomains, extents, and associated
-  // expressions of them
-  static std::vector<Statement*> getStmts(
+  // Same as getStmts version but filters to only return the Expr*s
+  static std::vector<Expr*> getExprs(
+      Fusion* fusion,
+      const std::vector<Val*>& to,
+      bool traverse_members = false);
+
+  // Same as getStmts version but filters to only return the Expr*s
+  static std::vector<Expr*> getExprsBetween(
       Fusion* fusion,
       const std::vector<Val*>& from,
+      const std::vector<Val*>& to,
       bool traverse_members = false);
 };
 
diff --git a/torch/csrc/jit/codegen/cuda/kernel.cpp b/torch/csrc/jit/codegen/cuda/kernel.cpp
index 322aa61614b6..9e5211604972 100644
--- a/torch/csrc/jit/codegen/cuda/kernel.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel.cpp
@@ -80,11 +80,9 @@ class KernelIrScanner : private IrVisitor {
     }
   }
 
-  void handle(UnaryOp* unary_op) final {
-    if (unary_op->getUnaryOpType() == UnaryOpType::RandLike) {
-      // This kernel is using random numbers
-      summary_.is_stochastic = true;
-    }
+  void handle(RNGOp* rng_op) final {
+    summary_.max_rng_offsets =
+        std::max<int>(summary_.max_rng_offsets, rng_op->getRNGOffset());
   }
 
   void handle(TensorIndex* tensor_index) final {
@@ -137,6 +135,15 @@ class KernelIrScanner : private IrVisitor {
     }
   }
 
+  void handle(GroupedGridWelford* grid_welford) final {
+    summary_.has_welford = true;
+    summary_.has_grid_welford = true;
+    summary_.has_grid_reductions = true;
+    if (grid_welford->isAllreduce()) {
+      summary_.has_cooperative_grid_reduction = true;
+    }
+  }
+
   void handle(GridBroadcast* grid_broadcast) final {
     summary_.has_cooperative_grid_reduction = true;
     handle(grid_broadcast->broadcast_op());
diff --git a/torch/csrc/jit/codegen/cuda/kernel.h b/torch/csrc/jit/codegen/cuda/kernel.h
index 42ddd3da8684..e2a0e57ed68f 100644
--- a/torch/csrc/jit/codegen/cuda/kernel.h
+++ b/torch/csrc/jit/codegen/cuda/kernel.h
@@ -38,7 +38,7 @@ struct KernelSummary {
   std::vector<const kir::Allocate*> static_smem_allocations;
 
   //! Indicate the need to generate random numbers
-  bool is_stochastic = false;
+  int max_rng_offsets = -1;
 
   //! Do we have any block reductions?
   bool has_block_reductions = false;
diff --git a/torch/csrc/jit/codegen/cuda/kernel_cache.cpp b/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
index e1ed1d56c496..c4604042bfae 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_cache.cpp
@@ -8,6 +8,8 @@
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/runtime/graph_executor.h>
 
+#include <c10/core/thread_pool.h>
+#include <c10/cuda/CUDAGuard.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/jit_log.h>
 
@@ -18,28 +20,12 @@ namespace cuda {
 
 namespace {
 
-// Check device of TensorType in all inputs ensure all tensors are on cuda
-// devices.
-// return common device index (or -1 if device differs).
-int getCommonDeviceCUDA(const at::ArrayRef<IValue>& inputs) {
-  int index = -1;
-  for (const auto& input : inputs) {
-    if (!input.isTensor()) {
-      continue;
-    }
-    const auto& device = input.toTensor().device();
-    // skip cpu scalar tensor as they'll be promoted to scalar later
-    if (device.is_cpu() && is_cpu_scalar(input.toTensor())) {
-      continue;
-    }
-    TORCH_CHECK(device.is_cuda(), "nvfuser only supports cuda device");
-    auto cur_index = device.index();
-    if (index != -1 && index != cur_index) {
-      return -1;
-    }
-    index = (int)cur_index; // NOLINT
-  }
-  return index;
+#define THREAD_POOL_SIZE 10
+
+// TODO: clean this up with some knobs
+c10::ThreadPool* getThreadPool() {
+  static c10::ThreadPool pool(THREAD_POOL_SIZE);
+  return &pool;
 }
 
 void encodeBuffer(size_t value, std::string& buffer) {
@@ -53,8 +39,7 @@ void encodeBuffer(size_t value, std::string& buffer) {
 } // namespace
 
 InputsIdLookup::IdLookupReturn InputsIdLookup::lookupId(
-    const at::ArrayRef<IValue>& inputs,
-    const SchedulerRuntimeInfo* additional_info) {
+    const at::ArrayRef<IValue>& inputs) {
   IdLookupReturn ret;
 
   // lock mutex_ because we are touching encoding_
@@ -123,6 +108,42 @@ FusionExecutorCache::FusionExecutorCache(std::unique_ptr<Fusion> fusion)
   }
 }
 
+KernelArgumentHolder FusionExecutorCache::prepareInputs(
+    const at::ArrayRef<IValue>& inputs) {
+  FUSER_PERF_SCOPE("FusionExecutorCache::prepareInputs");
+
+  KernelArgumentHolder args =
+      KernelArgumentHolder::createKernelArgumentHolder(inputs);
+
+  // TODO: move InputsIdLookup inside KernelArgumentHolder;
+  auto id_lookup_ret = inputs_id_lookup_.lookupId(inputs);
+  if (id_lookup_ret.eviction) {
+    evictCache(id_lookup_ret.evict_id);
+  }
+
+  args.setCacheId(id_lookup_ret.id);
+  return args;
+}
+
+bool FusionExecutorCache::isCompiled(const at::ArrayRef<IValue>& inputs) {
+  FUSER_PERF_SCOPE("FusionExecutorCache::isCompiled");
+
+  // Access kernels associated with the common device id
+  KernelArgumentHolder args = prepareInputs(inputs);
+
+  return getKernelRuntimeFor(args)->isCompiled();
+}
+
+void FusionExecutorCache::compileFusionAsync(
+    const at::ArrayRef<IValue>& inputs) {
+  FUSER_PERF_SCOPE("FusionExecutorCache::compileFusionAsync");
+
+  KernelArgumentHolder args = prepareInputs(inputs);
+  auto kernel_runtime = getKernelRuntimeFor(args);
+
+  kernel_runtime->startAsyncCompile(args);
+}
+
 // Note [ Permutation support in nvfuser ]
 //
 // Background:
@@ -171,25 +192,31 @@ std::vector<at::Tensor> FusionExecutorCache::runFusionWithInputs(
     perm_inputs = inputs_vec;
   }
 
-  SchedulerRuntimeInfo runtime_info(fusion(), perm_inputs);
+  KernelArgumentHolder args = prepareInputs(perm_inputs);
 
-  auto id_lookup_ret = inputs_id_lookup_.lookupId(perm_inputs, &runtime_info);
-  if (id_lookup_ret.eviction) {
-    evictCache(id_lookup_ret.evict_id);
-  }
-
-  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  const size_t unique_id = id_lookup_ret.id;
-  auto kernel_runtime = getKernelRuntimeFor(perm_inputs, unique_id);
+  auto kernel_runtime = getKernelRuntimeFor(args);
   most_recent_runtime_ = kernel_runtime;
-  auto outputs = kernel_runtime->runWithInput(perm_inputs, unique_id);
+  int seq_id = 0;
+  // Record kernel input and output tensors so profiler can construct
+  // the data flow graph
+  RECORD_FUNCTION(
+      "run_fused_kernel",
+      std::vector<c10::IValue>(inputs.begin(), inputs.end()),
+      seq_id);
+  auto outputs = kernel_runtime->runWithInput(args);
+  RECORD_OUTPUTS(outputs);
 
   // permute output tensor returned by kernel execution. See Part_3 in Note [
   // Permutation support in nvfuser ]
   for (const auto& pair : fusion_->getPermutationOutputMap()) {
-    outputs[pair.first] = outputs[pair.first].permute(pair.second);
+    if (pair.first < outputs.size()) {
+      outputs[pair.first] = outputs[pair.first].permute(pair.second);
+    }
   }
 
+  // removing aliased outputs, since those are only used by input tensor update
+  // by fusion. It is not semantically correct to actually return them as
+  // outputs from fusion.
   int offset = 0;
   for (const auto& v : aliased_output_indices_) {
     outputs.erase(outputs.begin() + v - offset);
@@ -207,18 +234,16 @@ void FusionExecutorCache::evictCache(size_t cache_id) {
 }
 
 FusionKernelRuntime* FusionExecutorCache::getKernelRuntimeFor(
-    const at::ArrayRef<IValue>& inputs,
-    size_t unique_id) {
+    const KernelArgumentHolder& args) {
   // Check for id hit case
+  auto unique_id = *args.getCacheId();
   auto id_it = id_to_kernel_runtime_.find(unique_id);
   if (id_it != id_to_kernel_runtime_.end()) {
     return id_it->second;
   }
 
   // Access kernels associated with the common device id
-  auto device_index = getCommonDeviceCUDA(inputs);
-  TORCH_CHECK(device_index >= 0, "device is not coherent for fusion inputs");
-  auto& kernel_runtimes = kernel_runtimes_[device_index];
+  auto& kernel_runtimes = kernel_runtimes_[args.getDeviceIndex()];
 
   // Check for re-use hit case
   //  a kernel runtime is re-usable if all the compiled
@@ -228,8 +253,8 @@ FusionKernelRuntime* FusionExecutorCache::getKernelRuntimeFor(
   auto reuse_it = std::find_if(
       kernel_runtimes.begin(),
       kernel_runtimes.end(),
-      [&inputs, &new_heuristics](auto& kernel_runtime) {
-        auto maybe_heuristics = kernel_runtime->getMaybeHeuristicsFor(inputs);
+      [&args, &new_heuristics](auto& kernel_runtime) {
+        auto maybe_heuristics = kernel_runtime->getMaybeHeuristicsFor(args);
         if (!maybe_heuristics.has_value()) {
           return false;
         }
@@ -244,7 +269,7 @@ FusionKernelRuntime* FusionExecutorCache::getKernelRuntimeFor(
   } else {
     // graph miss, need to re-build an optimized graph for this case
     kernel_runtimes.emplace_back(
-        std::make_unique<FusionKernelRuntime>(fusion_.get(), inputs));
+        std::make_unique<FusionKernelRuntime>(fusion_.get(), args));
     kernel_runtime = kernel_runtimes.back().get();
     if (profiling_) {
       kernel_runtime->profile(true);
@@ -257,7 +282,7 @@ FusionKernelRuntime* FusionExecutorCache::getKernelRuntimeFor(
 
 FusionKernelRuntime::FusionKernelRuntime(
     Fusion* fusion,
-    const at::ArrayRef<IValue>& inputs) {
+    const KernelArgumentHolder& args) {
   FUSER_PERF_SCOPE("FusionKernelRuntime::FusionKernelRuntime");
 
   // Make a copy of fusion and do segmentation and translation
@@ -265,11 +290,11 @@ FusionKernelRuntime::FusionKernelRuntime(
   auto fusion_copy = std::make_unique<Fusion>(*fusion);
 
   // Run segmentation on the copied fusion
-  SchedulerRuntimeInfo runtime_info(fusion_copy.get(), inputs, true);
+  SchedulerRuntimeInfo runtime_info(fusion_copy.get(), args, true);
 
   // Initialize the evaluator simplifer
-  precomputed_integers_ =
-      std::make_unique<FusionPrecomputedIntegers>(fusion_copy.get());
+  precomputed_values_ =
+      std::make_unique<FusionPrecomputedValues>(fusion_copy.get());
 
   //! Try to schedule the complete fusion
   scheduler_debug_utils::canScheduleMessage(
@@ -284,13 +309,13 @@ FusionKernelRuntime::FusionKernelRuntime(
   if (segmented) {
     // Take ownership and segment transformed fusion
     segmented_fusion_ =
-        SegmentCandidateFinder::segment(std::move(fusion_copy), inputs);
+        SegmentCandidateFinder::segment(std::move(fusion_copy), args);
   } else {
     segmented_fusion_ = SegmentedFusion::fromCompleteFusion(
         std::move(fusion_copy), maybe_complete_fusion_heuristic.value());
   }
 
-  heuristics_ = segmented_fusion_->makeInitialHeuristics(inputs);
+  heuristics_ = segmented_fusion_->makeInitialHeuristics(args);
   executors_ = std::vector<FusionExecutor>(segmented_fusion_->groups().size());
   if (isDebugDumpEnabled(DebugDumpOption::FusionSegments)) {
     segmented_fusion_->print();
@@ -307,10 +332,10 @@ FusionKernelRuntime::FusionKernelRuntime(
 }
 
 std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
-    const at::ArrayRef<IValue>& inputs,
-    size_t input_id,
+    KernelArgumentHolder& args,
     SegmentedGroup* sg) {
   FUSER_PERF_SCOPE("FusionKernelRuntime::runKernelWithInput");
+  std::lock_guard<std::mutex> guard(mutex_);
   // This function will be called once on un-segmented fusion,
   //  for segmented fusion, this function will be called on each segment
   //  In the case of segmented fusion, segmented group needs to be given so
@@ -319,8 +344,6 @@ std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
   //   is complied and run
   TORCH_INTERNAL_ASSERT(sg, "runKernelWithInput: need valid group to run");
   auto group_id = sg->groupId();
-  const int device_index = getCommonDeviceCUDA(inputs);
-  TORCH_CHECK(device_index >= 0, "device is not coherent for fusion inputs");
 
   LaunchParams launch_params;
 
@@ -336,14 +359,11 @@ std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
     // Running a segment group as a single kernel,
     //  make a fusion to run from segmented fusion
     fusion_to_run = segmented_fusion_->makeFusion(sg);
-    CompileOptions options;
-    options.device = c10::Device(DeviceType::CUDA, device_index);
-    options.index_mode = scheduler_entry->indexMode();
     FusionGuard fg(fusion_to_run.get());
     scheduler_entry->schedule(fusion_to_run.get());
     launch_params = scheduler_entry->params()->lparams;
     executors_[group_id].compileFusion(
-        fusion_to_run.get(), inputs, launch_params, options);
+        fusion_to_run.get(), args, launch_params);
   } else {
     launch_params = scheduler_entry->params()->lparams;
   }
@@ -358,7 +378,7 @@ std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
     executor.setMeasureKernelTimeFlag(true);
   }
 
-  auto outputs = executor.runFusion(inputs, launch_params, input_id);
+  auto outputs = executor.runFusion(args, launch_params);
 
   // Print relevant information all at once for easy debuging of perf
   if (isDebugDumpEnabled(DebugDumpOption::PerfDebugVerbose)) {
@@ -369,14 +389,8 @@ std::vector<at::Tensor> FusionKernelRuntime::runKernelWithInput(
       segmented_fusion_->completeFusion()->printMath();
     }
     std::cout << "With inputs:\n";
-    for (auto inp : inputs) {
-      if (inp.isTensor()) {
-        auto inp_tensor = inp.toTensor();
-        std::cout << "  " << inp_tensor.dtype() << "  " << inp_tensor.sizes()
-                  << "  " << inp_tensor.strides() << "\n";
-      } else {
-        std::cout << "  " << inp << "\n";
-      }
+    for (auto i : c10::irange(args.size())) {
+      args[i]->print();
     }
     std::cout << "Compiler log: " << executor.compilerLog() << "\n";
     std::cout << scheduler_entry->params()->toString() << "\n";
@@ -451,78 +465,248 @@ void FusionKernelRuntime::prepareRuntimeOrder() {
   }
 }
 
-std::vector<at::Tensor> FusionKernelRuntime::runWithInput(
-    const at::ArrayRef<IValue>& inputs,
-    size_t input_id) {
-  FUSER_PERF_SCOPE("FusionKernelRuntime::runMultiKernelWithInput");
+// passing args by value, since we will be modify this
+void FusionKernelRuntime::startAsyncCompile(KernelArgumentHolder& args_old) {
+  // only single compilation is supported at this moment.
+  std::unique_lock<std::mutex> unique_lock(mutex_, std::try_to_lock);
+  TORCH_CHECK(
+      unique_lock.owns_lock(),
+      "Calling startAsyncCompile on a FusionKernelRuntime that's already starting a compilation thread is not supported");
+  std::unique_lock<std::mutex> unique_lock2(compiling_, std::try_to_lock);
+  TORCH_CHECK(
+      unique_lock2.owns_lock(),
+      "Calling startAsyncCompile on a FusionKernelRuntime that's already starting a compilation thread is not supported 2");
+
+  // for some reason I can't seem to move unique_lock and it keeps using copy.
+  // auto compile_fusion = [args = std::move(args_old), lock =
+  // std::move(unique_lock), this] () mutable {
+  auto compile_fusion = [args = std::move(args_old), this]() mutable {
+    std::lock_guard<std::mutex> guard(compiling_);
+
+    // locking mutex_ since we are touching executors_ during compilation.
+    // c10::DeviceGuard dg(c10::Device(DeviceType::CUDA,
+    // args.getDeviceIndex())); CUDAGuard uses runtime API directly, which is
+    // thread safe.
+    c10::cuda::CUDAGuard dg(args.getDeviceIndex());
+
+    FUSER_PERF_SCOPE("FusionKernelRuntime::startAsyncCompile");
 
-  TORCH_INTERNAL_ASSERT(
-      inputs.size() == segmented_fusion_->inputs().size(),
-      "Inputs were not set up correctly, recieved ",
-      inputs.size(),
-      " inputs but expecting ",
-      segmented_fusion_->inputs().size());
+    TORCH_INTERNAL_ASSERT(
+        args.size() == segmented_fusion_->inputs().size(),
+        "Inputs were not set up correctly, received ",
+        args.size(),
+        " inputs but expecting ",
+        segmented_fusion_->inputs().size());
+
+    c10::Device device(c10::DeviceType::CUDA, args.getDeviceIndex());
+    std::unordered_map<Val*, const ArgAbstract*> tensor_map;
+    mapFusionInputsToArgs(tensor_map, args);
+
+    // TODO: compilation can happen in parallel! We can have output sizes
+    // inferred on un-compiled kernel and setup all tensor_map prior to
+    // compilation.
+    for (auto group_to_run : runtime_workspace_.group_run_order) {
+      // TODO: index mode should be updated per segmented kernel
+      // Prepare input vector
+      KernelArgumentHolder group_runtime_inputs(args.getIndexMode());
+      group_runtime_inputs.setDeviceIndex(args.getDeviceIndex());
+      for (auto input : group_to_run->inputs()) {
+        group_runtime_inputs.push(tensor_map.at(input));
+      }
+
+      // Run graph segment
+      KernelArgumentHolder group_runtime_outputs =
+          compileKernel(group_runtime_inputs, group_to_run);
+
+      // map output args to tensor map
+      const auto& group_outputs = group_to_run->outputs();
+      for (const size_t group_out_i : c10::irange(group_outputs.size())) {
+        args.push(group_runtime_outputs[group_out_i]);
+        tensor_map.emplace(group_outputs[group_out_i], args.back());
+      }
+    }
+  };
 
-  c10::Device device(c10::DeviceType::CUDA, 0);
-  int extent_index_ = 0;
-  // Bind input in the tensor_map
-  for (const auto i : c10::irange(inputs.size())) {
-    runtime_workspace_.tensor_map.emplace(
-        segmented_fusion_->inputs()[i], inputs[i]);
+  getThreadPool()->run(compile_fusion);
+}
 
+// TODO: replace the boilerplate in runKernelWithInput
+KernelArgumentHolder FusionKernelRuntime::compileKernel(
+    const KernelArgumentHolder& args,
+    SegmentedGroup* sg) {
+  FUSER_PERF_SCOPE("FusionKernelRuntime::compileKernel");
+  // This function will be called once on un-segmented fusion,
+  //  for segmented fusion, this function will be called on each segment
+  //  In the case of segmented fusion, segmented group needs to be given so
+  //   a kernel is compiled and run for a segmented group
+  //  In the case of complete fusion, sg = nullptr, and the original fusion
+  //   is complied and run
+  TORCH_INTERNAL_ASSERT(sg, "compileKernel: need valid group to run");
+  auto group_id = sg->groupId();
+
+  LaunchParams launch_params;
+
+  auto scheduler_entry = schedulers()[group_id].get();
+
+  // Check that the heuristics are matched, in the case of segmented fusion
+  TORCH_INTERNAL_ASSERT(!sg || scheduler_entry->heuristic() == sg->heuristic());
+
+  if (!executors_[group_id].compiled()) {
+    FUSER_PERF_SCOPE("FusionKernelRuntime::compileKernel::Compile");
+    std::unique_ptr<Fusion> fusion_to_run;
+
+    // Running a segment group as a single kernel,
+    //  make a fusion to run from segmented fusion
+    fusion_to_run = segmented_fusion_->makeFusion(sg);
+    FusionGuard fg(fusion_to_run.get());
+    scheduler_entry->schedule(fusion_to_run.get());
+    launch_params = scheduler_entry->params()->lparams;
+
+    executors_[group_id].compileFusion(
+        fusion_to_run.get(), args, launch_params);
+  } else {
+    // TODO: this is a false negative assert, since we could be compiling
+    // something for elevated high water mark on block size.
+    TORCH_CHECK(false, "compiling an already compiled kernel");
+  }
+
+  auto& executor = executors_[group_id];
+
+  auto outputs = executor.inferOutputSizes(args, launch_params);
+  return outputs;
+}
+
+void FusionKernelRuntime::mapFusionInputsToArgs(
+    std::unordered_map<Val*, const ArgAbstract*>& tensor_map,
+    KernelArgumentHolder& args) {
+  int extent_index = 0;
+  auto original_args_size = args.size();
+  // Bind args in the tensor_map
+  for (const auto i : c10::irange(original_args_size)) {
+    tensor_map.emplace(segmented_fusion_->inputs()[i], args[i]);
     // Bind tensorview inputs values in case some segmented group
     //  needs it down the road.
     // TODO: we probably have done this already up to this point
     //      should consider caching the expression evaluators, both
     //      more convenient and safer than replication
-    if (inputs[i].isTensor()) {
-      auto aten_tensor = inputs[i].toTensor();
-      device = aten_tensor.device();
-      for (auto dim_size : aten_tensor.sizes()) {
-        runtime_workspace_.tensor_map.emplace(
-            runtime_workspace_.group_extent_binding_order[extent_index_++],
-            dim_size);
+    if (auto tensor_arg_abstract =
+            dynamic_cast<const TensorArgAbstract*>(args[i])) {
+      // Note this is very ugly way. We are pushing every single extent to args,
+      // because we don't have a better place to hold them.
+      auto rank = tensor_arg_abstract->getRank();
+      for (const auto dim : c10::irange(rank)) {
+        args.push(tensor_arg_abstract->getSize(dim));
+        tensor_map.emplace(
+            runtime_workspace_.group_extent_binding_order[extent_index++],
+            args.back());
       }
     }
   }
+}
+
+std::vector<at::Tensor> FusionKernelRuntime::runWithInput(
+    KernelArgumentHolder& args) {
+  FUSER_PERF_SCOPE("FusionKernelRuntime::runWithInput");
+
+  TORCH_INTERNAL_ASSERT(
+      args.size() == segmented_fusion_->inputs().size(),
+      "Inputs were not set up correctly, received ",
+      args.size(),
+      " inputs but expecting ",
+      segmented_fusion_->inputs().size());
+
+  c10::Device device(c10::DeviceType::CUDA, args.getDeviceIndex());
+
+  std::unordered_map<Val*, const ArgAbstract*> tensor_map;
+  mapFusionInputsToArgs(tensor_map, args);
+
+  // TODO: we don't need this any more, since TensorArgAbstract already holds a
+  // reference to tensor
+  std::unordered_map<Val*, at::Tensor> output_holder;
 
   if (isDebugDumpEnabled(DebugDumpOption::PerfDebugVerbose)) {
     std::cout << "=================RUNNING FUSION SEGMENTS================="
               << std::endl;
   }
+
+  // group should share cache id.
+  auto group_cache_id = args.getCacheId();
   for (auto group_to_run : runtime_workspace_.group_run_order) {
+    // TODO: index mode should be updated per segmented kernel
     // Prepare input vector
+    KernelArgumentHolder group_runtime_inputs(args.getIndexMode());
+    group_runtime_inputs.setDeviceIndex(args.getDeviceIndex());
+    if (group_cache_id.has_value()) {
+      group_runtime_inputs.setCacheId(group_cache_id.value());
+    }
     for (auto input : group_to_run->inputs()) {
-      runtime_workspace_.group_runtime_inputs.push_back(
-          runtime_workspace_.tensor_map.at(input));
+      group_runtime_inputs.push(tensor_map.at(input));
     }
+
+    // TODO: currently we are still outputing PyTorch tensors, instead of
+    // something abstract. This is quite unsatisfying. Prepare input vector
+
     // Run graph segment
-    runtime_workspace_.group_runtime_outputs = runKernelWithInput(
-        runtime_workspace_.group_runtime_inputs, input_id, group_to_run);
+    std::vector<at::Tensor> group_runtime_outputs =
+        runKernelWithInput(group_runtime_inputs, group_to_run);
 
     const auto& group_outputs = group_to_run->outputs();
 
     // Insert graph segment output to tensor map
-    for (unsigned int group_out_i = 0; group_out_i < group_outputs.size();
-         group_out_i++) {
-      runtime_workspace_.tensor_map.emplace(
-          group_outputs[group_out_i],
-          runtime_workspace_.group_runtime_outputs[group_out_i]);
+    TORCH_INTERNAL_ASSERT(
+        group_outputs.size() == group_runtime_outputs.size(),
+        "output size does not match");
+    for (const size_t group_out_i : c10::irange(group_outputs.size())) {
+      // trivial forwarding outputs empty tensor to save bandwidth, skip
+      // tensor_map update on those, since we want all future use of inputs on
+      // the original tensor input. See note [trivial forwarding]
+      if (!group_outputs[group_out_i]->isFusionInput()) {
+        output_holder[group_outputs[group_out_i]] =
+            group_runtime_outputs[group_out_i];
+
+        args.push(group_runtime_outputs[group_out_i]);
+        tensor_map.emplace(group_outputs[group_out_i], args.back());
+      }
     }
-    runtime_workspace_.group_runtime_inputs.clear();
-    runtime_workspace_.group_runtime_outputs.clear();
   }
 
   if (isDebugDumpEnabled(DebugDumpOption::PerfDebugVerbose)) {
     std::cout << "=============FINISHED RUNNING FUSION SEGMENTS============"
               << std::endl;
   }
+
   // Produce final global output
-  std::vector<IValue> fusion_outputs;
+  std::vector<at::Tensor> fusion_outputs;
   for (auto output : segmented_fusion_->outputs()) {
-    const auto iter = runtime_workspace_.tensor_map.find(output);
-    if (iter != runtime_workspace_.tensor_map.end()) {
+    const auto iter = output_holder.find(output);
+    if (iter != output_holder.end()) {
       fusion_outputs.push_back(iter->second);
+    } else if (output->isFusionInput()) {
+      // Note [ trivial forwarding ]
+      //
+      // Background:
+      // nvfuser codegen doesn't handle aliases at all. When we have a fusion
+      // that forwards an input to output without any operations on it, this is
+      // a no-op for codegen and the output tensor is never written to. However,
+      // the codegen cannot "forward" an input to output, since all outputs are
+      // allocated in integration. If we do not special case it, we'll ended up
+      // having a "fresh" tensor allocated for the forwarded-input.
+      //
+      // Approach:
+      // There are two aspects of the support:
+      // step 1. Codegen handles forwarding implicitly. Forwarded inputs doesn't
+      // have any producer in the IR, hence the output argument is not used in
+      // the code. But it does require to have an argument in the kernel as a
+      // place-holder so we'll map each arguments correctly.
+      // step 2. Integration handles the trivial forwarding of inputs. When we
+      // put together `fusion_outputs` for a given fusion, when outputs are just
+      // fusion inputs, we directly return the input tensor.
+      const auto iter = tensor_map.find(output);
+      TORCH_INTERNAL_ASSERT(
+          iter != tensor_map.end(), "Can not find output as aliased intput");
+      auto arg = dynamic_cast<const TensorArgAbstract*>(iter->second);
+      // See step 2 - note [ trivial forwarding ]
+      fusion_outputs.push_back(arg->getTensor());
     } else {
       bool empty_type_check = output->getDataType().has_value() &&
           output->getDataType().value() == DataType::Float;
@@ -555,20 +739,7 @@ std::vector<at::Tensor> FusionKernelRuntime::runWithInput(
       fusion_outputs.emplace_back(at::empty({0}, tensor_options));
     }
   }
-
-  std::vector<at::Tensor> fusion_output_tensors;
-  std::transform(
-      fusion_outputs.begin(),
-      fusion_outputs.end(),
-      std::back_inserter(fusion_output_tensors),
-      [](IValue ival) {
-        TORCH_INTERNAL_ASSERT(
-            ival.isTensor(), "Cannot output non-tensor objects from a fusion.");
-        return ival.toTensor();
-      });
-
-  runtime_workspace_.tensor_map.clear();
-  return fusion_output_tensors;
+  return fusion_outputs;
 }
 
 const std::vector<FusionKernelRuntime::SchedulerEntryPtr>& FusionKernelRuntime::
@@ -590,14 +761,14 @@ void FusionKernelRuntime::updateHeuristicsLaunchParams(
 }
 
 c10::optional<FusionKernelRuntime::HeuristicsPtr> FusionKernelRuntime::
-    getMaybeHeuristicsFor(const at::ArrayRef<IValue>& inputs) {
+    getMaybeHeuristicsFor(const KernelArgumentHolder& args) {
   FUSER_PERF_SCOPE("FusionKernelRuntime::getMaybeHeuristicsFor");
   auto complete_fusion = segmented_fusion_->completeFusion();
-  SchedulerRuntimeInfo runtime_info(complete_fusion, inputs);
-  precomputed_integers_->bindFusionInputs(inputs);
-  precomputed_integers_->evaluate();
-  runtime_info.expressionEvaluator().bindPrecomputedIntegers(
-      precomputed_integers_.get());
+  SchedulerRuntimeInfo runtime_info(complete_fusion, args);
+  precomputed_values_->bindFusionInputs(args);
+  precomputed_values_->evaluate();
+  runtime_info.expressionEvaluator().bindPrecomputedValues(
+      precomputed_values_.get());
 
   c10::optional<FusionKernelRuntime::HeuristicsPtr> ret;
   ret = std::make_unique<FusionHeuristics>();
diff --git a/torch/csrc/jit/codegen/cuda/kernel_cache.h b/torch/csrc/jit/codegen/cuda/kernel_cache.h
index f67742d10f3f..a8a0f1cf4f62 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_cache.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_cache.h
@@ -42,7 +42,7 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
  public:
   explicit FusionKernelRuntime(
       Fusion* fusion,
-      const at::ArrayRef<IValue>& inputs);
+      const KernelArgumentHolder& inputs);
 
   //! Type notations within FusionKernelRuntime Context
   using HashType = size_t;
@@ -56,10 +56,34 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
     }
   }
 
+  //! query if we already have a compiled kernel for execution
+  bool isCompiled() {
+    std::unique_lock<std::mutex> lock0(mutex_, std::try_to_lock);
+    std::unique_lock<std::mutex> lock1(compiling_, std::try_to_lock);
+    if (!lock0.owns_lock() || !lock1.owns_lock()) {
+      // compilation in progress
+      return false;
+    }
+
+    return std::all_of(
+        executors_.begin(), executors_.end(), [](const auto& executor) {
+          return executor.compiled();
+        });
+  }
+
+  //! starts compilation async
+  void startAsyncCompile(KernelArgumentHolder& inputs);
+
+  //! maps entries in `args` to fusion inputs.
+  //! Note that this function also pushes extra bits like dimension extent into
+  //! `args` for expression evaluator binding. So consider your `args` polluted
+  //! after this function and use it with caution.
+  void mapFusionInputsToArgs(
+      std::unordered_map<Val*, const ArgAbstract*>& tensor_map,
+      KernelArgumentHolder& args);
+
   //! Unified interface to run the managed kernels with given input
-  std::vector<at::Tensor> runWithInput(
-      const at::ArrayRef<IValue>& inputs,
-      size_t input_id);
+  std::vector<at::Tensor> runWithInput(KernelArgumentHolder& args);
 
   //! Turn On/Off profiling
   void profile(bool to_profile = true) {
@@ -110,7 +134,7 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
   //  any segment cannot be scheduled or the parameters don't match
   using HeuristicsPtr = std::unique_ptr<FusionHeuristics>;
   c10::optional<HeuristicsPtr> getMaybeHeuristicsFor(
-      const at::ArrayRef<IValue>& inputs);
+      const KernelArgumentHolder& args);
 
   //! Copy the launch params given in the parameter heuristics to prepare
   //!  for kernel launch for a new input dimension but same heuristics
@@ -121,8 +145,14 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
   //! fusions, or a kernel for a segmentedGrouup in a segmented fusion. Returns
   //! the kernel outputs.
   std::vector<at::Tensor> runKernelWithInput(
-      const at::ArrayRef<IValue>& inputs,
-      size_t input_id,
+      KernelArgumentHolder& args,
+      SegmentedGroup* sg);
+
+  //! Interface to compile a single kernel, either one kernel for single-kernel
+  //! fusions, or a kernel for a segmentedGrouup in a segmented fusion. Returns
+  //! the kernel outputs with tensor that doesn't own memory.
+  KernelArgumentHolder compileKernel(
+      const KernelArgumentHolder& args,
       SegmentedGroup* sg);
 
   //! Interface to run a the whole graph in a segmented fusion and return the
@@ -154,26 +184,25 @@ class TORCH_CUDA_CU_API FusionKernelRuntime {
 
   //! Pre-allocated runtime workspace to speed up kernel launch preparation.
   struct RuntimeWorkSpace {
-    //! Temporary space to save intermediate tensors for segmented fusion
-    std::unordered_map<Val*, IValue> tensor_map;
-
     //! Pre-determined order to run the segmented groups
     std::vector<SegmentedGroup*> group_run_order;
 
     //! Pre-determined order to bind tensor input meta data
     std::vector<Val*> group_extent_binding_order;
-
-    //! Pre-allocated workspace to hold group inputs and outputs
-    std::vector<IValue> group_runtime_inputs;
-    std::vector<at::Tensor> group_runtime_outputs;
   } runtime_workspace_;
 
-  //! Utility to speed up integer evaluation at runtime
-  std::unique_ptr<FusionPrecomputedIntegers> precomputed_integers_;
+  //! Utility to speed up value evaluation at runtime
+  std::unique_ptr<FusionPrecomputedValues> precomputed_values_;
 
   // States for profiling support
   bool profiling_ = false;
 
+  std::mutex mutex_;
+  // TODO: remove `compiling_` mutex and rely on `mutex_` only.
+  // we don't need the second mutex, if only I could figure out how to pass
+  // unique_lock into lambda
+  std::mutex compiling_;
+
   // The heuristics and executor for most recent kernel launch
   ExecutorLog most_recent_executor_log_;
 };
@@ -208,9 +237,7 @@ class TORCH_CUDA_CU_API InputsIdLookup : public NonCopyable {
   //! within the lookup cache. This is needed because lookup shortcut is also
   //! cached in nested `GraphCache`, `FusionExecutorCache` and `FusionExecutor`.
   //! see [ Note -- 2 level cache implementation ]
-  IdLookupReturn lookupId(
-      const at::ArrayRef<IValue>& inputs,
-      const SchedulerRuntimeInfo* additional_info = nullptr);
+  IdLookupReturn lookupId(const at::ArrayRef<IValue>& inputs);
 
   //! debugging API that returns the size of lookup table
   size_t size() const {
@@ -304,11 +331,13 @@ class TORCH_CUDA_CU_API InputsIdLookup : public NonCopyable {
 class TORCH_CUDA_CU_API FusionExecutorCache {
  public:
   //! create new fusion executor cache at a given device to handle kernel
-  //! generation of dynamic sizes;
-  //! fusion executor is taking the ownership of `fusion`;
+  //! generation of dynamic sizes
+  //! fusion executor is taking the ownership of `fusion`
   explicit FusionExecutorCache(std::unique_ptr<Fusion> fusion);
 
-  //! Execute fusion graph with given inputs, create `FusionExecutor` as needed;
+  //! Execute fusion graph with given inputs, create `FusionExecutor` as needed
+  //! Note this function also handles permutation & input update outside of
+  //! codegen.
   std::vector<at::Tensor> runFusionWithInputs(
       const at::ArrayRef<IValue>& inputs);
 
@@ -359,14 +388,25 @@ class TORCH_CUDA_CU_API FusionExecutorCache {
     }
   }
 
+  //! converts inputs from IValue to KernelArgumentHolder, also handles cache
+  //! lookup
+  KernelArgumentHolder prepareInputs(const at::ArrayRef<IValue>& inputs);
+
+  //! query if there's a kernel ready to go for given inputs
+  bool isCompiled(const at::ArrayRef<IValue>& inputs);
+
+  //! compile a kernel executor for given inputs. Note: the compilation is
+  //! async, there's some restriction on the user side. e.g. don't overlap
+  //! compilation and execution for the same FusionExecutor entry. This is
+  //! experimental at this moment, please use with extra caution.
+  void compileFusionAsync(const at::ArrayRef<IValue>& inputs);
+
  private:
   //! evict cached short cut entry in `code_to_fe_lookup_` as well as cached
   //! entry in `FusionExecutor`
   void evictCache(size_t cache_id);
 
-  FusionKernelRuntime* getKernelRuntimeFor(
-      const at::ArrayRef<IValue>& inputs,
-      size_t id);
+  FusionKernelRuntime* getKernelRuntimeFor(const KernelArgumentHolder& inputs);
 
  private:
   //! original un-scheduled `Fusion`;
diff --git a/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp b/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp
index 40c97d714a14..15a18a6bca83 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.cpp
@@ -10,11 +10,22 @@ namespace fuser {
 namespace cuda {
 namespace kir {
 
-void ExpressionEvaluator::bind(
-    const Val* value,
-    Int::ScalarType concrete_value) {
+namespace {
+
+template <typename T>
+c10::optional<IntOrDouble> toOptionalIntOrDouble(c10::optional<T> i) {
+  if (!i) {
+    return c10::nullopt;
+  }
+  return IntOrDouble(i.value());
+}
+
+} // namespace
+
+void ExpressionEvaluator::bind(const Val* value, IntOrDouble concrete_value) {
   TORCH_CHECK(value->isScalar());
-  TORCH_CHECK(value->dtype() == DataType::Int);
+  TORCH_CHECK(
+      value->dtype() == DataType::Int || value->dtype() == DataType::Double);
   TORCH_CHECK(!value->isConstScalar(), "Tried to bind to a constant value");
   TORCH_CHECK(
       value->definition() == nullptr,
@@ -29,29 +40,35 @@ void ExpressionEvaluator::bind(
     ParallelType pt,
     Int::ScalarType concrete_value) {
   TORCH_INTERNAL_ASSERT(isParallelTypeThread(pt));
-  if (precomputed_integers_) {
+  if (precomputed_values_) {
     // Need to bind the thread value to integer machine
     //  in pre-computed mode.
-    precomputed_integers_->bindConcreteParallelTypeValue(pt, concrete_value);
+    precomputed_values_->bindConcreteParallelTypeValue(pt, concrete_value);
   } else {
     known_parallel_dimensions_[pt] = concrete_value;
   }
 }
 
-c10::optional<Int::ScalarType> ExpressionEvaluator::evaluate(const Val* value) {
-  if (precomputed_integers_ && precomputed_integers_->ready()) {
-    if (precomputed_integers_->getMaybeValueFor(value).has_value()) {
-      return precomputed_integers_->getMaybeValueFor(value);
+c10::optional<IntOrDouble> ExpressionEvaluator::evaluate(const Val* value) {
+  if (precomputed_values_ && precomputed_values_->ready()) {
+    if (precomputed_values_->getMaybeValueFor(value).has_value()) {
+      return toOptionalIntOrDouble(
+          precomputed_values_->getMaybeValueFor(value));
     }
   }
 
   if (value->isScalar() && value->isConst()) {
-    return value->as<Int>()->value();
+    if (value->isADouble()) {
+      return toOptionalIntOrDouble(value->as<Double>()->value());
+    }
+    return toOptionalIntOrDouble(value->as<Int>()->value());
   } else {
     FUSER_PERF_SCOPE("kir::ExpressionEvaluator::evaluate");
 
     TORCH_CHECK(value->isScalar(), value->toString());
-    TORCH_CHECK(value->dtype() == DataType::Int, value->toString());
+    TORCH_CHECK(
+        value->dtype() == DataType::Int || value->dtype() == DataType::Double,
+        value->toString());
 
     // Is the value known (either explicit binding or memoized)?
     const auto pre_eval_it = known_values_.find(value);
@@ -63,7 +80,7 @@ c10::optional<Int::ScalarType> ExpressionEvaluator::evaluate(const Val* value) {
 
     const auto post_eval_it = known_values_.find(value);
     return post_eval_it != known_values_.end()
-        ? c10::optional<Int::ScalarType>(post_eval_it->second)
+        ? c10::optional<IntOrDouble>(post_eval_it->second)
         : c10::nullopt;
   }
   return c10::nullopt;
@@ -80,8 +97,8 @@ void ExpressionEvaluator::print() const {
     std::cout << kv.first->toString() << " = " << kv.second << "\n";
   }
   std::cout << "\nPre-computed Values\n";
-  if (precomputed_integers_ != nullptr) {
-    precomputed_integers_->print();
+  if (precomputed_values_ != nullptr) {
+    precomputed_values_->print();
   }
   std::cout << "--------------------\n\n";
 }
@@ -93,6 +110,13 @@ void ExpressionEvaluator::handle(const Int* value) {
   }
 }
 
+void ExpressionEvaluator::handle(const Double* value) {
+  TORCH_INTERNAL_ASSERT(!value->isConst());
+  if (auto def = value->definition()) {
+    OptOutConstDispatch::handle(def);
+  }
+}
+
 void ExpressionEvaluator::handle(const NamedScalar* named_scalar) {
   const auto& name = named_scalar->name();
   for (auto pt : kParallelTypeThreads) {
@@ -108,22 +132,41 @@ void ExpressionEvaluator::handle(const NamedScalar* named_scalar) {
 }
 
 void ExpressionEvaluator::handle(const UnaryOp* unary_op) {
+  using namespace IntOrDouble_functions;
   const auto in = evaluate(unary_op->in());
   if (in.has_value()) {
     switch (unary_op->getUnaryOpType()) {
       case UnaryOpType::Neg:
         known_values_[unary_op->out()] = -*in;
         break;
-      case UnaryOpType::Cast:
+      case UnaryOpType::Set:
         known_values_[unary_op->out()] = *in;
         break;
+      case UnaryOpType::Cast:
+        if (unary_op->out()->getDataType() == DataType::Int) {
+          known_values_[unary_op->out()] = in->cast<int64_t>();
+        } else if (unary_op->out()->getDataType() == DataType::Double) {
+          known_values_[unary_op->out()] = in->cast<double>();
+        } else {
+          TORCH_INTERNAL_ASSERT(false, "dtype not supported in evaluator");
+        }
+        break;
+      case UnaryOpType::Abs:
+        known_values_[unary_op->out()] = abs(*in);
+        break;
       default:
-        TORCH_CHECK(!"Unexpected operator type");
+        TORCH_CHECK(
+            false,
+            "Unexpected operator type ",
+            unary_op->getUnaryOpType(),
+            " in ",
+            unary_op->toString());
     }
   }
 }
 
 void ExpressionEvaluator::handle(const BinaryOp* binary_op) {
+  using namespace IntOrDouble_functions;
   const auto lhs = evaluate(binary_op->lhs());
   const auto rhs = evaluate(binary_op->rhs());
   if (lhs.has_value() && rhs.has_value()) {
@@ -147,7 +190,7 @@ void ExpressionEvaluator::handle(const BinaryOp* binary_op) {
         break;
       case BinaryOpType::CeilDiv:
         TORCH_CHECK(*rhs != 0);
-        known_values_[binary_op->out()] = (*lhs + *rhs - 1) / *rhs;
+        known_values_[binary_op->out()] = ceildiv(*lhs, *rhs);
         break;
       case BinaryOpType::And:
         known_values_[binary_op->out()] = Int::ScalarType(*lhs && *rhs);
diff --git a/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h b/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h
index 63586857ad85..8df365dfdc58 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h
@@ -4,6 +4,7 @@
 #include <c10/macros/Export.h>
 
 #include <torch/csrc/jit/codegen/cuda/dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/dynamic_type.h>
 #include <torch/csrc/jit/codegen/cuda/evaluator_common.h>
 #include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
 
@@ -39,13 +40,13 @@ namespace kir {
 class TORCH_CUDA_CU_API ExpressionEvaluator : private OptInConstDispatch {
  public:
   //! Set a concrete value for a symbolic value
-  void bind(const Val* value, Int::ScalarType concrete_value);
+  void bind(const Val* value, IntOrDouble concrete_value);
 
   //! Set a concrete value for a parallel dimension
   void bind(ParallelType pt, Int::ScalarType concrete_value);
 
   //! Try to evaluate a Kernel IR value
-  c10::optional<Int::ScalarType> evaluate(const Val* value);
+  c10::optional<IntOrDouble> evaluate(const Val* value);
 
   //! Returns true if `value` is known before binding kernel inputs
   static bool isConst(const Val* value);
@@ -53,19 +54,20 @@ class TORCH_CUDA_CU_API ExpressionEvaluator : private OptInConstDispatch {
   //! Debugging helper, prints all the currently known values
   void print() const;
 
-  auto& precomputedIntegers() {
-    return precomputed_integers_;
+  auto& precomputedValues() {
+    return precomputed_values_;
   }
 
  private:
   void handle(const Int* value) final;
+  void handle(const Double* value) final;
   void handle(const NamedScalar* named_scalar) final;
   void handle(const UnaryOp* unary_op) final;
   void handle(const BinaryOp* binary_op) final;
 
  private:
-  std::unordered_map<const Val*, Int::ScalarType> known_values_;
-  KernelPrecomputedIntegers* precomputed_integers_ = nullptr;
+  std::unordered_map<const Val*, IntOrDouble> known_values_;
+  KernelPrecomputedValues* precomputed_values_ = nullptr;
   std::unordered_map<ParallelType, Int::ScalarType, TypeHash>
       known_parallel_dimensions_;
 };
diff --git a/torch/csrc/jit/codegen/cuda/kernel_ir.cpp b/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
index 11bf1b185437..7e69f0307a7a 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
+++ b/torch/csrc/jit/codegen/cuda/kernel_ir.cpp
@@ -78,6 +78,15 @@ TensorIndex::TensorIndex(
   }
 }
 
+Val* TensorIndex::index(int i) const {
+  TORCH_INTERNAL_ASSERT(
+      nDims() > 0, "Tried to get an index of a 0-dim TensorIndex");
+  if (i < 0)
+    i += nDims();
+  TORCH_INTERNAL_ASSERT(i >= 0 && i < int(nDims()));
+  return indices_[i];
+}
+
 BlockSync::BlockSync(IrBuilderPasskey passkey, bool war_sync)
     : Expr(passkey, ExprType::BlockSync), war_sync_(war_sync) {
   TORCH_INTERNAL_ASSERT(
@@ -85,6 +94,12 @@ BlockSync::BlockSync(IrBuilderPasskey passkey, bool war_sync)
       "IR type only valid for Kernel container.");
 }
 
+Expr* BlockSync::shallowCopy() const {
+  auto result = IrBuilder::create<BlockSync>(war_sync_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 GridSync::GridSync(
     IrBuilderPasskey passkey,
     ParallelTypeBitmap sync_dims,
@@ -93,6 +108,12 @@ GridSync::GridSync(
       sync_dims_(sync_dims),
       sync_buffer_(sync_buffer) {}
 
+Expr* GridSync::shallowCopy() const {
+  auto result = IrBuilder::create<GridSync>(sync_dims_, sync_buffer_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 CpAsyncWait::CpAsyncWait(IrBuilderPasskey passkey, unsigned int keep_stages)
     : Expr(passkey, ExprType::CpAsyncWait), keep_stages_(keep_stages) {
   TORCH_INTERNAL_ASSERT(
@@ -100,6 +121,12 @@ CpAsyncWait::CpAsyncWait(IrBuilderPasskey passkey, unsigned int keep_stages)
       "IR type only valid for Kernel container.");
 }
 
+Expr* CpAsyncWait::shallowCopy() const {
+  auto result = IrBuilder::create<CpAsyncWait>(keep_stages_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 CpAsyncCommit::CpAsyncCommit(IrBuilderPasskey passkey)
     : Expr(passkey, ExprType::CpAsyncCommit) {
   TORCH_INTERNAL_ASSERT(
@@ -107,6 +134,12 @@ CpAsyncCommit::CpAsyncCommit(IrBuilderPasskey passkey)
       "IR type only valid for Kernel container.");
 }
 
+Expr* CpAsyncCommit::shallowCopy() const {
+  auto result = IrBuilder::create<CpAsyncCommit>();
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 InitMagicZero::InitMagicZero(IrBuilderPasskey passkey)
     : Expr(passkey, ExprType::InitMagicZero) {
   TORCH_INTERNAL_ASSERT(
@@ -114,6 +147,12 @@ InitMagicZero::InitMagicZero(IrBuilderPasskey passkey)
       "IR type only valid for Kernel container.");
 }
 
+Expr* InitMagicZero::shallowCopy() const {
+  auto result = IrBuilder::create<InitMagicZero>();
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 UpdateMagicZero::UpdateMagicZero(IrBuilderPasskey passkey)
     : Expr(passkey, ExprType::UpdateMagicZero) {
   TORCH_INTERNAL_ASSERT(
@@ -121,6 +160,12 @@ UpdateMagicZero::UpdateMagicZero(IrBuilderPasskey passkey)
       "IR type only valid for Kernel container.");
 }
 
+Expr* UpdateMagicZero::shallowCopy() const {
+  auto result = IrBuilder::create<UpdateMagicZero>();
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 namespace {
 
 bool isIntegralScalar(const Val* val) {
@@ -147,6 +192,12 @@ PairSelect::PairSelect(
   TORCH_INTERNAL_ASSERT(isIntegralScalar(out), "Integer only for this op");
 }
 
+Expr* PairSelect::shallowCopy() const {
+  auto result = IrBuilder::create<PairSelect>(out_, in_, selection_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 Swizzle2DInt::Swizzle2DInt(
     IrBuilderPasskey passkey,
     IntPair* out,
@@ -172,6 +223,13 @@ Swizzle2DInt::Swizzle2DInt(
   addInput(extent_y);
 }
 
+Expr* Swizzle2DInt::shallowCopy() const {
+  auto result = IrBuilder::create<Swizzle2DInt>(
+      out_, in_x_, in_y_, extent_x_, extent_y_, swizzle_type_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 void Scope::insert(std::vector<Expr*>::const_iterator pos, Expr* expr) {
   exprs_.insert(pos, expr);
 }
@@ -307,6 +365,22 @@ ForLoop::ForLoop(IrBuilderPasskey passkey, const ForLoop* other)
       "IR type only valid for Kernel container.");
 }
 
+Expr* ForLoop::shallowCopy() const {
+  auto result = IrBuilder::create<ForLoop>(
+      iter_domain_,
+      index_,
+      start_,
+      stop_,
+      step_,
+      vectorize_,
+      vectorize_shift_,
+      unroll_required_,
+      double_buffer_loop_stage_);
+  result->body_ = body_;
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 bool ForLoop::isUnrollable() const {
   // Start and stop must be constant, must not be a broadcast
   // dimension, cannot be bound to a parallel dimension, must not be
@@ -426,13 +500,12 @@ IfThenElse::IfThenElse(IrBuilderPasskey passkey, Predicate* cond)
   addInput(cond);
 }
 
-Val* TensorIndex::index(int i) const {
-  TORCH_INTERNAL_ASSERT(
-      nDims() > 0, "Tried to get an index of a 0-dim TensorIndex");
-  if (i < 0)
-    i += nDims();
-  TORCH_INTERNAL_ASSERT(i >= 0 && i < int(nDims()));
-  return indices_[i];
+Expr* IfThenElse::shallowCopy() const {
+  auto result = IrBuilder::create<IfThenElse>(predicate());
+  result->then_body_ = then_body_;
+  result->else_body_ = else_body_;
+  result->setWritePredicate(writePredicate());
+  return result;
 }
 
 Allocate::Allocate(
@@ -495,6 +568,13 @@ Allocate::Allocate(
       "IR type only valid for Kernel container.");
 }
 
+Expr* Allocate::shallowCopy() const {
+  auto result =
+      IrBuilder::create<Allocate>(buffer_, memory_type_, shape_, zero_init_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 GridReduction::GridReduction(
     IrBuilderPasskey passkey,
     BinaryOpType reduction_op_type,
@@ -523,6 +603,22 @@ GridReduction::GridReduction(
       "IR type only valid for Kernel container.");
 }
 
+Expr* GridReduction::shallowCopy() const {
+  auto result = IrBuilder::create<GridReduction>(
+      getReductionOpType(),
+      init(),
+      out(),
+      in(),
+      reduction_buffer_,
+      sync_buffer_,
+      entrance_index_,
+      entrances_,
+      isAllreduce());
+  result->copyPredicatesFrom(this);
+  result->thread_predicate_ = thread_predicate_;
+  return result;
+}
+
 GroupedGridReduction::GroupedGridReduction(
     IrBuilderPasskey passkey,
     std::vector<BinaryOpType> reduction_op_types,
@@ -553,6 +649,23 @@ GroupedGridReduction::GroupedGridReduction(
       "IR type only valid for Kernel container.");
 }
 
+Expr* GroupedGridReduction::shallowCopy() const {
+  auto result = IrBuilder::create<GroupedGridReduction>(
+      getReductionOpTypes(),
+      initVals(),
+      outputs(),
+      inputs(),
+      reduction_buffers_,
+      sync_buffer_,
+      entrance_index_,
+      entrances_,
+      buffer_stride_,
+      isAllreduce());
+  result->copyPredicatesFrom(this);
+  result->thread_predicate_ = thread_predicate_;
+  return result;
+}
+
 GridBroadcast::GridBroadcast(
     IrBuilderPasskey passkey,
     BroadcastOp* broadcast_op,
@@ -567,6 +680,13 @@ GridBroadcast::GridBroadcast(
       "IR type only valid for Kernel container.");
 }
 
+Expr* GridBroadcast::shallowCopy() const {
+  auto result = IrBuilder::create<GridBroadcast>(
+      broadcast_op_, broadcast_buffer_, sync_buffer_);
+  result->copyPredicatesFrom(this);
+  return result;
+}
+
 GridWelford::GridWelford(
     IrBuilderPasskey passkey,
     WelfordOp* welford_op,
@@ -589,6 +709,64 @@ GridWelford::GridWelford(
       "IR type only valid for Kernel container.");
 }
 
+Expr* GridWelford::shallowCopy() const {
+  auto result = IrBuilder::create<GridWelford>(
+      welford_op_,
+      var_buffer_,
+      avg_buffer_,
+      n_buffer_,
+      sync_buffer_,
+      entrance_index_,
+      entrances_);
+  result->copyPredicatesFrom(this);
+  result->thread_predicate_ = thread_predicate_;
+  return result;
+}
+
+GroupedGridWelford::GroupedGridWelford(
+    IrBuilderPasskey passkey,
+    std::vector<WelfordTriplet> output_vals,
+    std::vector<WelfordTriplet> input_vals,
+    std::vector<WelfordTriplet> init_vals,
+    std::array<std::vector<Allocate*>, 3> reduction_buffers,
+    Allocate* sync_buffer,
+    Val* entrance_index,
+    Val* entrances,
+    Val* buffer_stride,
+    bool is_allreduce)
+    : GroupedWelfordOp(
+          passkey,
+          std::move(output_vals),
+          std::move(input_vals),
+          std::move(init_vals),
+          is_allreduce,
+          ExprType::GroupedGridWelford),
+      reduction_buffers_(std::move(reduction_buffers)),
+      sync_buffer_(sync_buffer),
+      entrance_index_(entrance_index),
+      entrances_(entrances),
+      buffer_stride_(buffer_stride) {
+  TORCH_INTERNAL_ASSERT(
+      passkey.ir_container_->isA<kir::Kernel>(),
+      "IR type only valid for Kernel container.");
+}
+
+Expr* GroupedGridWelford::shallowCopy() const {
+  auto result = IrBuilder::create<GroupedGridWelford>(
+      outputVals(),
+      inputVals(),
+      initVals(),
+      reduction_buffers_,
+      sync_buffer_,
+      entrance_index_,
+      entrances_,
+      buffer_stride_,
+      isAllreduce());
+  result->copyPredicatesFrom(this);
+  result->thread_predicate_ = thread_predicate_;
+  return result;
+}
+
 AllocateFusedReduction::AllocateFusedReduction(
     IrBuilderPasskey passkey,
     GridReduction* grid_reduction)
@@ -619,6 +797,46 @@ AllocateFusedReduction::AllocateFusedReduction(
       "IR type only valid for Kernel container.");
 }
 
+AllocateFusedReduction::AllocateFusedReduction(
+    IrBuilderPasskey passkey,
+    GroupedGridWelford* grouped_grid_welford)
+    : Expr(passkey, ExprType::AllocateFusedReduction),
+      grid_expr_(grouped_grid_welford) {
+  TORCH_INTERNAL_ASSERT(
+      passkey.ir_container_->isA<kir::Kernel>(),
+      "IR type only valid for Kernel container.");
+}
+
+Expr* AllocateFusedReduction::shallowCopy() const {
+  if (grid_expr_->isA<GridReduction>()) {
+    auto result = IrBuilder::create<AllocateFusedReduction>(
+        grid_expr_->as<GridReduction>());
+    result->setPredicate(predicate());
+    result->setWritePredicate(writePredicate());
+    return result;
+  } else if (grid_expr_->isA<GridWelford>()) {
+    auto result = IrBuilder::create<AllocateFusedReduction>(
+        grid_expr_->as<GridWelford>());
+    result->setPredicate(predicate());
+    result->setWritePredicate(writePredicate());
+    return result;
+  } else if (grid_expr_->isA<GroupedGridReduction>()) {
+    auto result = IrBuilder::create<AllocateFusedReduction>(
+        grid_expr_->as<GroupedGridReduction>());
+    result->setPredicate(predicate());
+    result->setWritePredicate(writePredicate());
+    return result;
+  } else if (grid_expr_->isA<GroupedGridWelford>()) {
+    auto result = IrBuilder::create<AllocateFusedReduction>(
+        grid_expr_->as<GroupedGridWelford>());
+    result->setPredicate(predicate());
+    result->setWritePredicate(writePredicate());
+    return result;
+  }
+  TORCH_INTERNAL_ASSERT(
+      false, "Unknown reduction type in AllocateFusedReduction::shallowCopy");
+}
+
 TensorIndex* AllocateFusedReduction::out() const {
   TORCH_INTERNAL_ASSERT(grid_expr_ != nullptr);
   if (grid_expr_->isA<GridReduction>() ||
@@ -626,6 +844,10 @@ TensorIndex* AllocateFusedReduction::out() const {
     return grid_expr_->outputs().at(0)->as<kir::TensorIndex>();
   } else if (auto grid_welford = dynamic_cast<GridWelford*>(grid_expr_)) {
     return grid_welford->welford_op()->out()->as<kir::TensorIndex>();
+  } else if (
+      auto grouped_grid_welford =
+          dynamic_cast<GroupedGridWelford*>(grid_expr_)) {
+    return grouped_grid_welford->out(0)->as<kir::TensorIndex>();
   } else {
     TORCH_INTERNAL_ASSERT(
         false, "Invalid grid expression: ", grid_expr_->toString());
@@ -642,6 +864,10 @@ const ParallelTypeBitmap& AllocateFusedReduction::threadPredicate() const {
       auto grouped_grid_reduction =
           dynamic_cast<GroupedGridReduction*>(grid_expr_)) {
     return grouped_grid_reduction->threadPredicate();
+  } else if (
+      auto grouped_grid_welford =
+          dynamic_cast<GroupedGridWelford*>(grid_expr_)) {
+    return grouped_grid_welford->threadPredicate();
   } else {
     TORCH_INTERNAL_ASSERT(
         false, "Invalid grid expression: ", grid_expr_->toString());
diff --git a/torch/csrc/jit/codegen/cuda/kernel_ir.h b/torch/csrc/jit/codegen/cuda/kernel_ir.h
index b629f687e2c0..cd44e8d8e21b 100644
--- a/torch/csrc/jit/codegen/cuda/kernel_ir.h
+++ b/torch/csrc/jit/codegen/cuda/kernel_ir.h
@@ -39,6 +39,7 @@ class TensorView;
 class UnaryOp;
 class BinaryOp;
 class TernaryOp;
+class RNGOp;
 class ReductionOp;
 class WelfordOp;
 class BroadcastOp;
@@ -64,6 +65,7 @@ class GridReduction;
 class GroupedGridReduction;
 class GridBroadcast;
 class GridWelford;
+class GroupedGridWelford;
 class AllocateFusedReduction;
 
 // Expr container
@@ -92,7 +94,7 @@ class TORCH_CUDA_CU_API Predicate final : public Val {
     return expr_;
   }
 
-  Bool* thread_pred() {
+  Bool* thread_pred() const {
     TORCH_INTERNAL_ASSERT(
         ptype_ == PredicateType::Inline ||
         ptype_ == PredicateType::Misaligned || ptype_ == PredicateType::Shift ||
@@ -197,6 +199,8 @@ class TORCH_CUDA_CU_API Allocate final : public Expr {
       Val* size,
       bool zero_init = false);
 
+  Expr* shallowCopy() const override;
+
   Val* buffer() const {
     return buffer_;
   }
@@ -249,6 +253,8 @@ class TORCH_CUDA_CU_API BlockSync final : public Expr {
  public:
   explicit BlockSync(IrBuilderPasskey passkey, bool war_sync = false);
 
+  Expr* shallowCopy() const override;
+
   bool isWarHazardSync() const {
     return war_sync_;
   }
@@ -263,6 +269,8 @@ class TORCH_CUDA_CU_API CpAsyncWait final : public Expr {
  public:
   explicit CpAsyncWait(IrBuilderPasskey passkey, unsigned int keep_stages = 0);
 
+  Expr* shallowCopy() const override;
+
   //! Returns the remaining number of stages that are not synchronized
   //!  after this op.
   unsigned int keepStages() const {
@@ -280,6 +288,8 @@ class TORCH_CUDA_CU_API CpAsyncWait final : public Expr {
 class TORCH_CUDA_CU_API CpAsyncCommit final : public Expr {
  public:
   explicit CpAsyncCommit(IrBuilderPasskey passkey);
+
+  Expr* shallowCopy() const override;
 };
 
 // Synchronize all blocks in device, implies cooperative group launch is
@@ -291,6 +301,8 @@ class TORCH_CUDA_CU_API GridSync final : public Expr {
       ParallelTypeBitmap sync_dims,
       Val* sync_buffer);
 
+  Expr* shallowCopy() const override;
+
   ParallelTypeBitmap syncDims() const {
     return sync_dims_;
   }
@@ -309,6 +321,8 @@ class TORCH_CUDA_CU_API GridSync final : public Expr {
 class TORCH_CUDA_CU_API InitMagicZero final : public Expr {
  public:
   explicit InitMagicZero(IrBuilderPasskey passkey);
+
+  Expr* shallowCopy() const override;
 };
 
 // Simply prints "UPDATE_MAGIC_ZERO" in the code in accordance with magic_zero
@@ -316,6 +330,8 @@ class TORCH_CUDA_CU_API InitMagicZero final : public Expr {
 class TORCH_CUDA_CU_API UpdateMagicZero final : public Expr {
  public:
   explicit UpdateMagicZero(IrBuilderPasskey passkey);
+
+  Expr* shallowCopy() const override;
 };
 
 // TODO(kir): promote to IR node
@@ -416,6 +432,8 @@ class TORCH_CUDA_CU_API ForLoop final : public Expr {
 
   ForLoop(IrBuilderPasskey passkey, const ForLoop* other);
 
+  Expr* shallowCopy() const override;
+
   Val* index() const {
     return index_;
   }
@@ -510,6 +528,8 @@ class TORCH_CUDA_CU_API IfThenElse final : public Expr {
  public:
   explicit IfThenElse(IrBuilderPasskey passkey, Predicate* cond);
 
+  Expr* shallowCopy() const override;
+
   Scope& thenBody() {
     return then_body_;
   }
@@ -555,6 +575,8 @@ class TORCH_CUDA_CU_API GridReduction final : public ReductionOp {
       Val* entrances,
       bool is_allreduce = false);
 
+  Expr* shallowCopy() const override;
+
   Allocate* reduction_buffer() const {
     return reduction_buffer_;
   }
@@ -577,8 +599,11 @@ class TORCH_CUDA_CU_API GridReduction final : public ReductionOp {
     return thread_predicate_;
   }
 
-  void setThreadPredicate(const ParallelTypeBitmap& thread_predicate) {
-    thread_predicate_ = thread_predicate;
+  GridReduction* withThreadPredicate(
+      const ParallelTypeBitmap& thread_predicate) {
+    auto result = shallowCopy()->as<GridReduction>();
+    result->thread_predicate_ = thread_predicate;
+    return result;
   }
 
  private:
@@ -607,6 +632,8 @@ class TORCH_CUDA_CU_API GroupedGridReduction final : public GroupedReductionOp {
       Val* buffer_stride,
       bool is_allreduce = false);
 
+  Expr* shallowCopy() const override;
+
   const std::vector<Allocate*>& reduction_buffers() const {
     return reduction_buffers_;
   }
@@ -637,8 +664,11 @@ class TORCH_CUDA_CU_API GroupedGridReduction final : public GroupedReductionOp {
     return thread_predicate_;
   }
 
-  void setThreadPredicate(const ParallelTypeBitmap& thread_predicate) {
-    thread_predicate_ = thread_predicate;
+  GroupedGridReduction* withThreadPredicate(
+      const ParallelTypeBitmap& thread_predicate) {
+    auto result = shallowCopy()->as<GroupedGridReduction>();
+    result->thread_predicate_ = thread_predicate;
+    return result;
   }
 
  private:
@@ -669,6 +699,8 @@ class TORCH_CUDA_CU_API GridBroadcast final : public Expr {
       Allocate* broadcast_buffer,
       Allocate* sync_buffer);
 
+  Expr* shallowCopy() const override;
+
   BroadcastOp* broadcast_op() const {
     return broadcast_op_;
   }
@@ -694,6 +726,8 @@ class TORCH_CUDA_CU_API GridBroadcast final : public Expr {
 //!
 //! This node provides FusionExecutor the information it needs to allocate the
 //! reduction and sync buffers.
+//!
+//! TODO: Make this a subclass of WelfordOp
 class TORCH_CUDA_CU_API GridWelford final : public Expr {
  public:
   GridWelford(
@@ -706,6 +740,8 @@ class TORCH_CUDA_CU_API GridWelford final : public Expr {
       Val* entrance_index,
       Val* entrances);
 
+  Expr* shallowCopy() const override;
+
   WelfordOp* welford_op() const {
     return welford_op_;
   }
@@ -740,8 +776,10 @@ class TORCH_CUDA_CU_API GridWelford final : public Expr {
     return thread_predicate_;
   }
 
-  void setThreadPredicate(const ParallelTypeBitmap& thread_predicate) {
-    thread_predicate_ = thread_predicate;
+  GridWelford* withThreadPredicate(const ParallelTypeBitmap& thread_predicate) {
+    auto result = shallowCopy()->as<GridWelford>();
+    result->thread_predicate_ = thread_predicate;
+    return result;
   }
 
  private:
@@ -758,6 +796,69 @@ class TORCH_CUDA_CU_API GridWelford final : public Expr {
   ParallelTypeBitmap thread_predicate_;
 };
 
+class TORCH_CUDA_CU_API GroupedGridWelford final : public GroupedWelfordOp {
+ public:
+  // input, output and init vals are vectors of triplets
+  GroupedGridWelford(
+      IrBuilderPasskey passkey,
+      std::vector<WelfordTriplet> output_vals,
+      std::vector<WelfordTriplet> input_vals,
+      std::vector<WelfordTriplet> init_vals,
+      std::array<std::vector<Allocate*>, 3> reduction_buffers,
+      Allocate* sync_buffer,
+      Val* entrance_index,
+      Val* entrances,
+      Val* buffer_stride,
+      bool is_allreduce = false);
+
+  Expr* shallowCopy() const override;
+
+  const std::array<std::vector<Allocate*>, 3>& reduction_buffers() const {
+    return reduction_buffers_;
+  }
+
+  Allocate* sync_buffer() const {
+    return sync_buffer_;
+  }
+
+  // Which instance of entering this grid reduction is this iteration?
+  Val* entrance_index() const {
+    return entrance_index_;
+  }
+
+  // How many times will this grid reduction be entered
+  Val* entrances() const {
+    return entrances_;
+  }
+
+  Val* buffer_stride() const {
+    return buffer_stride_;
+  }
+
+  const ParallelTypeBitmap& threadPredicate() const {
+    return thread_predicate_;
+  }
+
+  GroupedGridWelford* withThreadPredicate(
+      const ParallelTypeBitmap& thread_predicate) {
+    auto result = shallowCopy()->as<GroupedGridWelford>();
+    result->thread_predicate_ = thread_predicate;
+    return result;
+  }
+
+ private:
+  std::array<std::vector<Allocate*>, 3> reduction_buffers_;
+  Allocate* sync_buffer_ = nullptr;
+  // gridReduce has template flags for thread predicates. In order to
+  // use them, the thread predicate is held here separately from
+  // Expr::predicate_.
+  ParallelTypeBitmap thread_predicate_;
+  Val* entrance_index_ = nullptr;
+  Val* entrances_ = nullptr;
+  // Stride of reduction buffers
+  Val* buffer_stride_ = nullptr;
+};
+
 // Allocate an instance of the fused reduction class.
 class TORCH_CUDA_CU_API AllocateFusedReduction final : public Expr {
  public:
@@ -773,6 +874,12 @@ class TORCH_CUDA_CU_API AllocateFusedReduction final : public Expr {
       IrBuilderPasskey passkey,
       GroupedGridReduction* grouped_grid_reduction);
 
+  explicit AllocateFusedReduction(
+      IrBuilderPasskey passkey,
+      GroupedGridWelford* grouped_grid_welford);
+
+  Expr* shallowCopy() const override;
+
   Expr* gridExpr() const {
     return grid_expr_;
   }
@@ -782,7 +889,7 @@ class TORCH_CUDA_CU_API AllocateFusedReduction final : public Expr {
   const ParallelTypeBitmap& threadPredicate() const;
 
  private:
-  //! GridReduction, GridWelford or GroupedGridReduction
+  //! GridReduction, GridWelford, GroupedGridReduction or GroupedGridWelford
   Expr* grid_expr_ = nullptr;
 };
 
@@ -813,6 +920,8 @@ class TORCH_CUDA_CU_API PairSelect : public Expr {
 
   PairSelect(IrBuilderPasskey, Val* out, IntPair* in, Selection selection);
 
+  Expr* shallowCopy() const override;
+
   Val* out() const {
     return out_;
   }
@@ -848,6 +957,8 @@ class TORCH_CUDA_CU_API Swizzle2DInt : public Expr {
       Val* extent_y,
       Swizzle2DType swizzle_type);
 
+  Expr* shallowCopy() const override;
+
   IntPair* out() const {
     return out_;
   }
diff --git a/torch/csrc/jit/codegen/cuda/lower2device.cpp b/torch/csrc/jit/codegen/cuda/lower2device.cpp
index b6e39fa588e8..142ee1b7a02f 100644
--- a/torch/csrc/jit/codegen/cuda/lower2device.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower2device.cpp
@@ -8,6 +8,7 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/lower_alias_memory.h>
 #include <torch/csrc/jit/codegen/cuda/lower_allocation.h>
+#include <torch/csrc/jit/codegen/cuda/lower_divisible_split.h>
 #include <torch/csrc/jit/codegen/cuda/lower_double_buffer.h>
 #include <torch/csrc/jit/codegen/cuda/lower_expr_sort.h>
 #include <torch/csrc/jit/codegen/cuda/lower_fusion_simplifier.h>
@@ -188,6 +189,16 @@ void GpuLower::collectPaddedParallelDims() {
   }
 }
 
+void assignRNGOffset(Fusion* fusion) {
+  int counter = 0;
+  for (auto expr : fusion->exprs()) {
+    if (expr->isA<RNGOp>()) {
+      auto rop = expr->as<RNGOp>();
+      rop->setRNGOffset(counter++);
+    }
+  }
+}
+
 void GpuLower::lower(Fusion* fusion, DataType index_type) {
   FUSER_PERF_SCOPE("GpuLower::lower");
   TORCH_INTERNAL_ASSERT(fusion != nullptr);
@@ -213,6 +224,7 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {
       tv->resolveIndexDtype();
     }
   }
+  assignRNGOffset(fusion_);
 
   FusionGuard fg(fusion_);
   // prepare for lowering
@@ -237,7 +249,7 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {
   // mappings of all iteration domains across the fusion. There are three types
   // of mappings Permissive, Exact, and Loop, see compute_at_map.h/cpp for more
   // information.
-  compute_at_map_ = std::make_unique<ComputeAtMap>(fusion_);
+  compute_at_map_ = std::make_shared<ComputeAtMap>(fusion_);
 
   if (isDebugDumpEnabled(DebugDumpOption::ComputeAtMap)) {
     std::cout << compute_at_map_->toString() << std::endl;
@@ -245,8 +257,12 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {
 
   compute_at_map_->validateAndPropagatePType();
 
+  // Uses compute_at_map, find all splits that are enforced to be divisible
+  divisible_splits_ = getAllDivisibleSplits(fusion_, compute_at_map_.get());
+
   // Used in parallel dimension map
-  concretized_broadcast_domains_.build(fusion_);
+  concretized_broadcast_domains_ =
+      std::make_shared<const ConcretizedBroadcastDomains>(fusion_);
 
   parallelDimensionMap().build(fusion_);
   if (isDebugDumpEnabled(DebugDumpOption::ParallelDimensions)) {
@@ -270,7 +286,7 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {
 
   // Scan the whole fusion and build mappings about halo extensions of
   // all IterDomains
-  haloInfo().build(fusion_);
+  halo_info_ = std::make_shared<HaloInfo>(fusion_, compute_at_map_);
 
   // Want to run this after parallel map and halo info map are
   // created. vectorized_accesses_ and vectorized_set_info_ are filled.
@@ -287,6 +303,9 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {
   // Depends on thread_pred_map_, validates parallelization collects which
   // tensor views need WAR or RAW syncs
   sync_map_.build(fusion_);
+  if (isDebugDumpEnabled(DebugDumpOption::SyncMap)) {
+    std::cout << sync_map_.toString() << std::endl;
+  }
 
   partialSplitMap().build(fusion_);
 
diff --git a/torch/csrc/jit/codegen/cuda/lower2device.h b/torch/csrc/jit/codegen/cuda/lower2device.h
index d5600e0a2513..250b06a6495f 100644
--- a/torch/csrc/jit/codegen/cuda/lower2device.h
+++ b/torch/csrc/jit/codegen/cuda/lower2device.h
@@ -62,7 +62,8 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
   //! Query if lowering is in progress
   static bool hasCurrent();
 
-  ConcretizedBroadcastDomains& concretizedBroadcastDomains() {
+  std::shared_ptr<const ConcretizedBroadcastDomains>
+  concretizedBroadcastDomains() {
     return concretized_broadcast_domains_;
   }
 
@@ -76,20 +77,16 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
     return thread_pred_map_;
   }
 
-  const std::unique_ptr<ComputeAtMap>& caMap() const {
-    return compute_at_map_;
+  std::shared_ptr<const ComputeAtMap> caMap() const {
+    return std::const_pointer_cast<const ComputeAtMap>(compute_at_map_);
   }
 
   const TrivialReductionInfo& trivialReductionInfo() const {
     return trivial_reduction_info_;
   }
 
-  const HaloInfo& haloInfo() const {
-    return halo_info_;
-  }
-
-  HaloInfo& haloInfo() {
-    return halo_info_;
+  std::shared_ptr<const HaloInfo> haloInfo() const {
+    return std::const_pointer_cast<const HaloInfo>(halo_info_);
   }
 
   const ParallelDimensionMap& parallelDimensionMap() const {
@@ -132,6 +129,10 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
     return non_divisible_split_info_;
   }
 
+  const auto& divisbleSplitSet() const {
+    return divisible_splits_;
+  }
+
   DoubleBufferInfo& doubleBufferInfo() {
     return double_buffer_info_;
   }
@@ -198,12 +199,13 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
   // would be safer to wrap all of these in unique pointers and remove the build
   // interface and default constructor. That way they couldn't be accessed
   // without being initialized.
-  ConcretizedBroadcastDomains concretized_broadcast_domains_;
+  std::shared_ptr<const ConcretizedBroadcastDomains>
+      concretized_broadcast_domains_;
   ThreadPredicateMap thread_pred_map_;
   PredicateElimination pred_elimination_;
-  std::unique_ptr<ComputeAtMap> compute_at_map_;
+  std::shared_ptr<ComputeAtMap> compute_at_map_;
   TrivialReductionInfo trivial_reduction_info_;
-  HaloInfo halo_info_;
+  std::shared_ptr<HaloInfo> halo_info_;
   LocalAllocationInfoMap local_allocation_info_map_;
   WarpPaddedParallelInfo warp_pad_info_;
   ParallelDimensionMap parallel_dimension_map_;
@@ -214,6 +216,7 @@ class TORCH_CUDA_CU_API GpuLower : public NonCopyable {
   FusedReductionInfo fused_reduction_info_;
   SyncMap sync_map_;
   kir::KernelPerformanceProfile profile_;
+  std::unordered_set<Split*> divisible_splits_;
 
   // Track which tensor views are inputs or outputs of a vectorized operation
   // and their maximum vectorized access size
diff --git a/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp b/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
index 08517c441ebf..ef12cce8fd46 100644
--- a/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp
@@ -18,16 +18,19 @@ namespace fuser {
 namespace cuda {
 
 namespace {
+// Alias used for std::transform
+IterDomain* exactConcreteId(IterDomain* id) {
+  return GpuLower::current()->caMap()->getConcreteMappedID(
+      id, IdMappingMode::EXACT);
+}
 
-//! Checks that the current loop nest is not realizing a serial
-//!  broadcast so that each index of producer buffer will only
-//!  be visited once, which is the only case where aggressive
-//!  inner sharing is valid.
-//!
+//! Checks that the current loop nest is realizing a serial
+//!  broadcast so that each index of producer buffer can be visited
+//!  multiple times, in which case the aggressive is not valid.
 bool isSerialBroadcastResolution(TensorView* producer, TensorView* consumer) {
   //! Note: see issue #1785:
   //!  serial broadcast resolution doesn't only happen to
-  //! immediate producers of broadcast ops. We can also have
+  //! immediate outputs of broadcast ops. We can also have
   //! example:
   //!  T1[I,B] = broadcast(T0[I]])
   //!  T3[I,I] = T1[I,B] + T2[I,I]
@@ -83,7 +86,7 @@ bool isSerialBroadcastResolution(TensorView* producer, TensorView* consumer) {
       std::inserter(
           producer_exact_concrete_root_ids,
           producer_exact_concrete_root_ids.begin()),
-      ir_utils::caMapExactConcreteId);
+      exactConcreteId);
 
   // Check if serial loop roots indexes any exact root id's that
   //  is not within the set of producer's root exact id's. These
@@ -92,7 +95,8 @@ bool isSerialBroadcastResolution(TensorView* producer, TensorView* consumer) {
   for (auto serial_loop_root :
        ir_utils::filterByType<IterDomain>(serial_loop_roots)) {
     if (!producer_exact_concrete_root_ids.count(
-            ir_utils::caMapExactConcreteId(serial_loop_root))) {
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                serial_loop_root, IdMappingMode::EXACT))) {
       return true;
     }
   }
@@ -565,7 +569,7 @@ class BufferUseDefInfo {
             "Lower_alias_memory : dynamic sized register allocation");
         return;
       }
-      if (register_size.value() <= kRegisterSizeThreshold) {
+      if (register_size->as<int64_t>() <= kRegisterSizeThreshold) {
         should_try_alias = false;
       }
     }
diff --git a/torch/csrc/jit/codegen/cuda/lower_allocation.cpp b/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
index f0c7b26c1802..264905cfa213 100644
--- a/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_allocation.cpp
@@ -59,7 +59,7 @@ class AllocationInserter : public kir::ExprMutator {
   // info.init_place_before, info.alloc_for_loop, info.alloc_place_before
   void fillAllocationInformation(AllocationInformation& info, Expr* expr) {
     auto loop_alloc_info =
-        loop_utils::getAllocInformation(info.buffer, for_loops_);
+        lower_utils::getAllocInformation(info.buffer, for_loops_);
 
     info.init_for_loop = loop_alloc_info.init_for_loop;
     info.alloc_for_loop = loop_alloc_info.alloc_for_loop;
@@ -131,7 +131,7 @@ class AllocationInserter : public kir::ExprMutator {
          ++init_loop_it) {
       auto id = *init_loop_it;
       kir::ForLoop* new_loop = nullptr;
-      auto extent_with_halo = gpu_lower->haloInfo().getExtent(id);
+      auto extent_with_halo = gpu_lower->haloInfo()->getExtent(id);
       if (extent_with_halo) {
         new_loop = IrBuilder::create<kir::ForLoop>(
             id,
@@ -166,7 +166,7 @@ class AllocationInserter : public kir::ExprMutator {
       }
       auto extent = id->extent();
       // Use halo-extended extent if found
-      auto halo_extent = gpu_lower->haloInfo().getRootAxisInfo(id);
+      auto halo_extent = gpu_lower->haloInfo()->getRootAxisInfo(id);
       if (halo_extent.hasHalo()) {
         extent = IrBuilder::addExpr(
             extent, IrBuilder::create<Int>(halo_extent.width()));
@@ -213,7 +213,7 @@ class AllocationInserter : public kir::ExprMutator {
 
     // Get the halo extent if found
     auto getExtent = [this](IterDomain* id) {
-      auto extent = gpu_lower->haloInfo().getExtent(id);
+      auto extent = gpu_lower->haloInfo()->getExtent(id);
       if (extent == nullptr) {
         extent = id->extent();
       }
@@ -368,7 +368,7 @@ class AllocationInserter : public kir::ExprMutator {
 
       auto extent = concrete_id->extent();
 
-      if (gpu_lower->haloInfo().getExtent(info.buffer->axis(axis_i)) !=
+      if (gpu_lower->haloInfo()->getExtent(info.buffer->axis(axis_i)) !=
           nullptr) {
         has_halo = true;
       }
@@ -478,6 +478,11 @@ class AllocationInserter : public kir::ExprMutator {
               out->name() == welford->outN()->name(), "Unreachable");
           init = welford->initN();
         }
+      } else if (expr->isA<GroupedWelfordOp>()) {
+        TORCH_INTERNAL_ASSERT(
+            default_val == nullptr,
+            "Welford should not have a default initialization value for predicate elimination.");
+        init = expr->as<GroupedWelfordOp>()->getInitValOfOutput(out);
       } else if (default_val != nullptr) {
         init = default_val;
       }
diff --git a/torch/csrc/jit/codegen/cuda/lower_bank_conflict.cpp b/torch/csrc/jit/codegen/cuda/lower_bank_conflict.cpp
new file mode 100644
index 000000000000..0b97b973f786
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_bank_conflict.cpp
@@ -0,0 +1,332 @@
+#include <torch/csrc/jit/codegen/cuda/lower_bank_conflict.h>
+
+#include <torch/csrc/jit/codegen/cuda/dynamic_type.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/type.h>
+
+#include <unordered_set>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+namespace {
+
+bool isSmemTensorIndex(Val* value) {
+  return value->isA<kir::TensorIndex>() &&
+      value->as<kir::TensorIndex>()->view()->getMemoryType() ==
+      MemoryType::Shared;
+}
+
+int64_t getVectorizeSize(kir::TensorIndex* ti) {
+  for (auto id : ti->view()->domain()->domain()) {
+    if (!isParallelTypeVectorize(id->getParallelType())) {
+      continue;
+    }
+
+    ExpressionEvaluator expr_eval(id->fusion());
+    auto vector_size_optional = expr_eval.evaluate(id->extent());
+
+    TORCH_INTERNAL_ASSERT(
+        vector_size_optional.has_value(),
+        "Could not evaluate constant value bound to vectorized dim.");
+
+    return vector_size_optional->as<int64_t>();
+  }
+  return 1;
+}
+
+inline int64_t getPhaseSize(int64_t word_size_bytes) {
+  if (word_size_bytes == 16) {
+    return 8;
+  }
+  if (word_size_bytes == 8) {
+    return 16;
+  }
+  return 32;
+}
+
+bool isThreadIdx(const std::string& name) {
+  return name == "threadIdx.x" || name == "threadIdx.y" ||
+      name == "threadIdx.z";
+}
+
+bool isBlockIdx(const std::string& name) {
+  return name == "blockIdx.x" || name == "blockIdx.y" || name == "blockIdx.z";
+}
+
+bool isBlockDim(const std::string& name) {
+  return name == "blockDim.x" && name == "blockDim.y" && name == "blockDim.z";
+}
+
+bool isGridDim(const std::string& name) {
+  return name == "gridDim.x" && name == "gridDim.y" && name == "gridDim.z";
+}
+
+ParallelType getParallelType(const std::string& name) {
+  if (name == "threadIdx.x") {
+    return ParallelType::TIDx;
+  } else if (name == "threadIdx.y") {
+    return ParallelType::TIDy;
+  } else if (name == "threadIdx.z") {
+    return ParallelType::TIDz;
+  } else if (name == "blockIdx.x") {
+    return ParallelType::BIDx;
+  } else if (name == "blockIdx.y") {
+    return ParallelType::BIDy;
+  } else if (name == "blockIdx.z") {
+    return ParallelType::BIDz;
+  }
+  TORCH_INTERNAL_ASSERT(false, "Not a parallel type");
+}
+
+std::vector<int64_t> evaluateAddressesOnFirstPhase(
+    kir::TensorIndex* ti,
+    const std::vector<kir::ForLoop*>& for_loops,
+    c10::optional<LaunchParams> launch_params,
+    const ExpressionEvaluator& expr_eval_common) {
+  std::vector<int64_t> addresses;
+  const auto word_size_bytes =
+      dataTypeSize(*(ti->getDataType())) * getVectorizeSize(ti);
+  int64_t phase_size = getPhaseSize(word_size_bytes);
+
+  if (launch_params.has_value()) {
+    phase_size = std::min<int64_t>(phase_size, launch_params->nThreads());
+  }
+
+  for (int64_t linear_tidx : c10::irange(phase_size)) {
+    int64_t tidx = linear_tidx;
+    int64_t tidy = 0;
+    int64_t tidz = 0;
+    if (launch_params.has_value()) {
+      tidy = tidx / launch_params->bdimx();
+      tidx = tidx % launch_params->bdimx();
+      tidz = tidy / launch_params->bdimy();
+      tidy = tidy % launch_params->bdimy();
+    }
+    int64_t index = 0;
+    // make a copy of the expression evaluator
+    ExpressionEvaluator expr_eval = expr_eval_common;
+    expr_eval.bind("threadIdx.x", tidx);
+    expr_eval.bind("threadIdx.y", tidy);
+    expr_eval.bind("threadIdx.z", tidz);
+    for (auto fl : for_loops) {
+      if (fl->index()->isA<NamedScalar>()) {
+        auto name = fl->index()->as<NamedScalar>()->name();
+        TORCH_INTERNAL_ASSERT(
+            isThreadIdx(name) || isBlockIdx(name), "unknow loop index");
+      } else {
+        auto start = expr_eval.evaluate(fl->start())->as<int64_t>();
+        expr_eval.bind(fl->index(), start);
+      }
+    }
+    for (auto ind : ti->indices()) {
+      index += expr_eval.evaluate(ind)->as<int64_t>();
+    }
+    addresses.emplace_back(index * word_size_bytes);
+  }
+  return addresses;
+}
+
+int getConflictWays(const std::vector<int64_t>& addresses) {
+  std::unordered_set<int64_t> words_by_bank[32];
+  for (auto addr : addresses) {
+    int64_t word = addr / 4;
+    int64_t bank = word % 32;
+    words_by_bank[bank].insert(word);
+  }
+  int conflict = 1;
+  for (const auto& words : words_by_bank) {
+    conflict = std::max<int>(conflict, words.size());
+  }
+  return conflict;
+}
+
+class InferLaunchParams : public kir::IrVisitor {
+ public:
+  static c10::optional<LaunchParams> get(
+      const std::vector<Expr*>& exprs,
+      const std::unordered_map<std::string, IntOrDouble>& known_values) {
+    if (exprs.empty()) {
+      return c10::nullopt;
+    }
+    return InferLaunchParams(exprs, known_values).launch_params_;
+  }
+
+ private:
+  InferLaunchParams(
+      const std::vector<Expr*>& exprs,
+      const std::unordered_map<std::string, IntOrDouble>& known_values)
+      : expr_eval_(exprs[0]->fusion()) {
+    for (auto pair : known_values) {
+      expr_eval_.bind(pair.first, pair.second);
+    }
+    handle(exprs);
+  }
+
+  using kir::IrVisitor::handle;
+
+  void handle(Expr* expr) final {
+    if (expr->isA<kir::ForLoop>() || expr->isA<kir::IfThenElse>()) {
+      kir::IrVisitor::handle(expr);
+      return;
+    }
+
+    for (auto fl : for_loops_) {
+      if (fl->index()->isA<NamedScalar>()) {
+        auto name = fl->index()->as<NamedScalar>()->name();
+        if (isThreadIdx(name) || isBlockIdx(name)) {
+          auto ptype = getParallelType(name);
+          auto stop = expr_eval_.evaluate(fl->stop());
+          if (stop.has_value()) {
+            if (!launch_params_.has_value()) {
+              launch_params_ = LaunchParams();
+            }
+            if (launch_params_->getRawVal(ptype) ==
+                LaunchParams::UNINITIALIZED_VAL) {
+              launch_params_->bind(stop->as<int64_t>(), ptype);
+            } else {
+              TORCH_INTERNAL_ASSERT(
+                  launch_params_->getDim(ptype) == stop,
+                  "Unable to infer launch parameters");
+            }
+          }
+        }
+      }
+    }
+  }
+
+  ExpressionEvaluator expr_eval_;
+  c10::optional<LaunchParams> launch_params_;
+};
+
+class BankConflictInfo : public kir::IrVisitor {
+ public:
+  static std::unordered_map<const Expr*, std::pair<int, int>> get(
+      const std::vector<Expr*>& exprs,
+      c10::optional<LaunchParams> launch_params,
+      const std::unordered_map<std::string, IntOrDouble>& known_values) {
+    if (exprs.empty()) {
+      return {};
+    }
+    return BankConflictInfo(exprs, launch_params, known_values)
+        .bank_conflict_info_;
+  }
+
+ private:
+  BankConflictInfo(
+      const std::vector<Expr*>& exprs,
+      c10::optional<LaunchParams> launch_params,
+      const std::unordered_map<std::string, IntOrDouble>& known_values)
+      : launch_params_(launch_params), expr_eval_common_(exprs[0]->fusion()) {
+    expr_eval_common_.bind("blockIdx.x", 0);
+    expr_eval_common_.bind("blockIdx.y", 0);
+    expr_eval_common_.bind("blockIdx.z", 0);
+    if (launch_params.has_value()) {
+      expr_eval_common_.bind("blockDim.x", launch_params->bdimx());
+      expr_eval_common_.bind("blockDim.y", launch_params->bdimy());
+      expr_eval_common_.bind("blockDim.z", launch_params->bdimz());
+      expr_eval_common_.bind("gridDim.x", launch_params->gdimx());
+      expr_eval_common_.bind("gridDim.y", launch_params->gdimy());
+      expr_eval_common_.bind("gridDim.z", launch_params->gdimz());
+    }
+    for (auto pair : known_values) {
+      expr_eval_common_.bind(pair.first, pair.second);
+    }
+    handle(exprs);
+  }
+
+  using kir::IrVisitor::handle;
+
+  void handle(Expr* expr) final {
+    if (expr->isA<kir::ForLoop>() || expr->isA<kir::IfThenElse>()) {
+      kir::IrVisitor::handle(expr);
+      return;
+    }
+
+    if (expr->isA<UnaryOp>()) {
+      auto uop = expr->as<UnaryOp>();
+      if (uop->getUnaryOpType() != UnaryOpType::Set) {
+        return;
+      }
+      std::pair<int, int> conflict_ways{0, 0};
+      if (isSmemTensorIndex(uop->in())) {
+        conflict_ways.first = getConflictWays(evaluateAddressesOnFirstPhase(
+            uop->in()->as<kir::TensorIndex>(),
+            for_loops_,
+            launch_params_,
+            expr_eval_common_));
+      }
+      if (isSmemTensorIndex(uop->out())) {
+        conflict_ways.second = getConflictWays(evaluateAddressesOnFirstPhase(
+            uop->out()->as<kir::TensorIndex>(),
+            for_loops_,
+            launch_params_,
+            expr_eval_common_));
+      }
+      if (conflict_ways.first > 1 || conflict_ways.second > 1) {
+        bank_conflict_info_[expr] = conflict_ways;
+      }
+    } else if (expr->isA<LoadStoreOp>()) {
+      auto ldst = expr->as<LoadStoreOp>();
+      std::pair<int, int> conflict_ways{0, 0};
+      if (isSmemTensorIndex(ldst->in())) {
+        conflict_ways.first = getConflictWays(evaluateAddressesOnFirstPhase(
+            ldst->in()->as<kir::TensorIndex>(),
+            for_loops_,
+            launch_params_,
+            expr_eval_common_));
+      }
+      if (isSmemTensorIndex(ldst->out())) {
+        conflict_ways.second = getConflictWays(evaluateAddressesOnFirstPhase(
+            ldst->out()->as<kir::TensorIndex>(),
+            for_loops_,
+            launch_params_,
+            expr_eval_common_));
+      }
+      if (conflict_ways.first > 1 || conflict_ways.second > 1) {
+        bank_conflict_info_[expr] = conflict_ways;
+      }
+    }
+  }
+
+  std::unordered_map<const Expr*, std::pair<int, int>> bank_conflict_info_;
+  c10::optional<LaunchParams> launch_params_;
+  ExpressionEvaluator expr_eval_common_;
+};
+
+} // namespace
+
+std::unordered_map<const Expr*, std::pair<int, int>> getBankConflictInfo(
+    kir::Kernel* kernel,
+    c10::optional<LaunchParams> launch_params,
+    const std::unordered_map<std::string, IntOrDouble>& known_values) {
+  for (auto pair : known_values) {
+    TORCH_CHECK(
+        !isThreadIdx(pair.first),
+        "threadIdx.{x,y,z} should be computed instead of provided");
+    TORCH_CHECK(
+        !isBlockIdx(pair.first),
+        "blockIdx.{x,y,z} should not be provided (they are always zero)");
+    TORCH_CHECK(
+        !isBlockDim(pair.first),
+        "blockDim.{x,y,z} should be provided by launch_params");
+    TORCH_CHECK(
+        !isGridDim(pair.first),
+        "gridDim.{x,y,z} should be provided by launch_params");
+  }
+  if (!launch_params.has_value()) {
+    launch_params =
+        InferLaunchParams::get(kernel->topLevelExprs(), known_values);
+  }
+  return BankConflictInfo::get(
+      kernel->topLevelExprs(), launch_params, known_values);
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_bank_conflict.h b/torch/csrc/jit/codegen/cuda/lower_bank_conflict.h
new file mode 100644
index 000000000000..b651c4ed33e2
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_bank_conflict.h
@@ -0,0 +1,46 @@
+#pragma once
+
+#include <torch/csrc/jit/codegen/cuda/dynamic_type.h>
+#include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/ir_base_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/kernel.h>
+
+#include <unordered_map>
+#include <utility>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+// for more info on shared memory access see page 54-72 of:
+// https://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf
+
+// Warning: The bank confliction checking utility here is not a replacement of
+// nsight compute. This utility currently has the following assumptions and
+// limitations:
+//
+//   1. This utility assumes that the data of the tensor is accessed by
+//      `T0[index]`, where `index` is the one stored in the `TensorIndex`
+//      object.
+//   2. This utility only checks the first iteration. If we have something like
+//      `T1_s[tidx, 5]`, then different iterations should have different
+//      conflictions, which will not be evaluated for all of them
+//   3. This utility assumes that all tensors are independent, which means:
+//      3.1 All shared memory tensors are allocated starting from a multiple of
+//          4*32 bytes
+//      3.2 The only source of bank confliction is from within a tensor.
+//          There is no bank conflict between different tensors.
+//
+// Also note that this utility will not provide accurate estimation if the above
+// assumptions are satisfied
+
+std::unordered_map<const Expr*, std::pair<int, int>> getBankConflictInfo(
+    kir::Kernel* kernel,
+    c10::optional<LaunchParams> launch_params = c10::nullopt,
+    const std::unordered_map<std::string, IntOrDouble>& known_values = {});
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp b/torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp
new file mode 100644
index 000000000000..c1de1201e5d1
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_divisible_split.cpp
@@ -0,0 +1,121 @@
+
+#include <torch/csrc/jit/codegen/cuda/lower_divisible_split.h>
+
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+
+#include <unordered_set>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+std::unordered_set<Split*> getAllDivisibleSplits(Fusion* fusion) {
+  ComputeAtMap ca_map(fusion);
+  return getAllDivisibleSplits(fusion, &ca_map);
+}
+
+std::unordered_set<Split*> getAllDivisibleSplits(
+    Fusion* fusion,
+    const ComputeAtMap* ca_map) {
+  std::unordered_set<Split*> all_divisible_splits;
+
+  auto all_tvs = ir_utils::allTvs(fusion);
+  // Find all tensor views with a view like rfactor. Splits used in view
+  // transformations must be divisible by definition.
+  for (auto tv : all_tvs) {
+    auto rfactor_dom = tv->getMaybeRFactorDomain();
+    // Not view if there's no rfactor axis
+    if (!tv->domain()->hasViewLikeRFactor()) {
+      continue;
+    }
+
+    // Take the view transformations and add all the splits. Those splits are
+    // the only divisible splits.
+    auto view_exprs =
+        StmtSort::getExprs(fusion, {rfactor_dom.begin(), rfactor_dom.end()});
+    auto split_exprs = ir_utils::filterByType<Split>(view_exprs);
+    all_divisible_splits.insert(split_exprs.begin(), split_exprs.end());
+  }
+
+  // Vectorized dimensions are enforced to be a result of divisible splits.
+  // Gather vectorized splits.
+  for (auto tv : all_tvs) {
+    auto vec_id_it = std::find_if(
+        tv->domain()->domain().begin(),
+        tv->domain()->domain().end(),
+        [](IterDomain* id) {
+          return isParallelTypeVectorize(id->getParallelType());
+        });
+
+    if (vec_id_it == tv->domain()->domain().end()) {
+      continue;
+    }
+
+    // We could have a case technically like:
+    // [8, 2] where we do:
+    // split(0, 2)
+    // merge(1)
+    // so it ends up as [4, 4]
+    // split(0, 2) must be divisible, but for now we're not going to capture
+    // cases like this. Just look for direct split's producing a vectorize
+    // dimension.
+    auto vec_id = *vec_id_it;
+    if (vec_id->definition() != nullptr && vec_id->definition()->isA<Split>()) {
+      all_divisible_splits.emplace(vec_id->definition()->as<Split>());
+    }
+  }
+
+  // If there's no view like splits, there's nothing to find
+  if (all_divisible_splits.empty()) {
+    return all_divisible_splits;
+  }
+
+  // Track the concrete id in the exact map of the outer output of the split
+  // expressions. This is how we'll check if there are matching splits. This
+  // also gets rid of any splits that already match (for processing).
+  std::unordered_map<IterDomain*, Expr*> outer_concrete_id_to_expr;
+
+  for (auto split : all_divisible_splits) {
+    outer_concrete_id_to_expr[ca_map->getConcreteMappedID(
+        split->outer(), IdMappingMode::EXACT)] = split;
+  }
+
+  std::unordered_set<Expr*> visited(
+      all_divisible_splits.begin(), all_divisible_splits.end());
+
+  // Find splits that match what we already have:
+  for (auto entry : outer_concrete_id_to_expr) {
+    auto concrete_id = entry.first;
+    auto original_view_split = entry.second;
+
+    const auto& exact_mapped_ids =
+        ca_map->idGraph().exactNodes().getDisjointSetOf(concrete_id).vector();
+    for (auto other_id : exact_mapped_ids) {
+      if (other_id->definition() == nullptr) {
+        continue;
+      }
+
+      if (!visited.emplace(other_id->definition()).second) {
+        // Already visited
+        continue;
+      }
+
+      if (IterDomainGraph::exprsMap(
+              original_view_split,
+              other_id->definition(),
+              false,
+              ca_map->idGraph().exactNodes())) {
+        all_divisible_splits.emplace(other_id->definition()->as<Split>());
+      }
+    }
+  }
+
+  return all_divisible_splits;
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_divisible_split.h b/torch/csrc/jit/codegen/cuda/lower_divisible_split.h
new file mode 100644
index 000000000000..f2c4a78e4895
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/lower_divisible_split.h
@@ -0,0 +1,29 @@
+#pragma once
+
+#include <c10/macros/Export.h>
+
+#include <torch/csrc/jit/codegen/cuda/compute_at_map.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+// Looks through all transformations assocaited with view, or enforced divisible
+// vectorization splits and gathers all splits that provably don't have a
+// remainder, therefore the extents of the associated IterDomains do not require
+// a ceilDiv expressions.
+TORCH_CUDA_CU_API std::unordered_set<Split*> getAllDivisibleSplits(
+    Fusion* fusion);
+
+// Same as above but will use provided ComputeAtMap instead of building its own.
+TORCH_CUDA_CU_API std::unordered_set<Split*> getAllDivisibleSplits(
+    Fusion* fusion,
+    const ComputeAtMap* ca_map);
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp b/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
index c4a5beeeabee..1e2806b11fd4 100644
--- a/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp
@@ -709,7 +709,7 @@ std::vector<IterDomain*> getLocalDomainOrdering(
   std::sort(
       merged_domain.begin(),
       merged_domain.end(),
-      IterDomainDependencySorter(
+      ir_utils::IterDomainDependencySorter(
           concrete_id_dependencies, GpuLower::current()->caMap()));
   return merged_domain;
 }
@@ -927,8 +927,8 @@ bool ExprSegmentationSorter::interIterUpdate() {
     // If we didn't finish and we tried the fallback, throw.
     TORCH_INTERNAL_ASSERT(
         !fallback_mode_enabled_,
-        "Couldn't succcessfully sort out the fusion expressions. ",
-        "There are remaining connections of the heirarchical segmentation which should have been ",
+        "Couldn't successfully sort out the fusion expressions. ",
+        "There are remaining connections of the hierarchical segmentation which should have been ",
         "flattened to a single ordered group, or disjoint ordered groups.");
     // We didn't finish, but we haven't tried the fallback, try again with that.
     fallback_mode_enabled_ = true;
@@ -1066,7 +1066,7 @@ void ExprSegmentationSorter::initializeForLoopDependencies() {
       }
     }
 
-    std::cerr << "Depdencies: " << std::endl;
+    std::cerr << "Dependencies: " << std::endl;
     for (const auto& dep_entry : concrete_id_dependencies) {
       std::cerr << "  Deps of " << dep_entry.first->toString() << std::endl
                 << "   ";
@@ -1398,6 +1398,9 @@ std::vector<Expr*> ExprSegmentationSorter::getExprs() const {
 
 std::vector<Expr*> reorderExprsForComputeAt() {
   auto fusion = FusionGuard::getCurFusion();
+  if (fusion->exprs().empty()) {
+    return {};
+  }
   TORCH_INTERNAL_ASSERT(fusion != nullptr);
   ExprSegmentationSorter sorter(fusion);
   sorter.sort();
diff --git a/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp b/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp
index 213abda029a6..744feab598b3 100644
--- a/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_fused_reduction.cpp
@@ -266,15 +266,9 @@ class FusionTransformer {
         fusion_->removeExpr(welford);
 
         fused_expr = IrBuilder::create<WelfordOp>(
-            out_avg,
-            out_var,
-            out_n,
-            init_avg,
-            init_var,
-            init_n,
-            in_avg,
-            in_var,
-            in_n,
+            WelfordTriplet{out_avg, out_var, out_n},
+            WelfordTriplet{in_avg, in_var, in_n},
+            WelfordTriplet{init_avg, init_var, init_n},
             true);
       } else if (auto grouped_rop = dynamic_cast<GroupedReductionOp*>(expr)) {
         TORCH_INTERNAL_ASSERT(!grouped_rop->isAllreduce());
diff --git a/torch/csrc/jit/codegen/cuda/lower_index.cpp b/torch/csrc/jit/codegen/cuda/lower_index.cpp
index 77046ee014f1..e83a0e9fce99 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_index.cpp
@@ -92,6 +92,81 @@ void IndexLowering::handle(const kir::ForLoop* for_loop) {
   active_scope_ = prev_scope;
 }
 
+void IndexLowering::handle(const RNGOp* rop) {
+  // Write random tensor indices into the consumer
+  //  tensor index if the output is a tensor.
+  auto out_tv = dynamic_cast<TensorView*>(rop->output(0));
+  TORCH_INTERNAL_ASSERT(out_tv != nullptr, "rand scalar not yet supported");
+
+  // TensorIndex for philox subsequence and component.
+  auto philox_index = SimplifyingIrBuilder::create<kir::TensorIndex>(
+      out_tv, Index::getLinearLogicalIndex(out_tv, for_loops_));
+
+  // TensorIndex for writing rand_like output.
+  const auto out = lowerDstIndex(out_tv);
+
+  auto lowered = IrBuilder::create<RNGOp>(
+      rop->getRNGOpType(),
+      out,
+      rop->dtype(),
+      rop->getParameters(),
+      rop->getRNGOffset(),
+      philox_index);
+
+  pushBack(lowered);
+  GpuLower::current()->propagateExprInfo(rop, back());
+}
+
+void IndexLowering::handle(const FullOp* fop) {
+  auto out_tv = dynamic_cast<TensorView*>(fop->output(0));
+  TORCH_INTERNAL_ASSERT(out_tv != nullptr);
+
+  // TensorIndex for writing output.
+  const auto out = lowerDstIndex(out_tv);
+  auto lowered =
+      IrBuilder::create<FullOp>(out, fop->getFillValue(), fop->dtype());
+
+  pushBack(lowered);
+  GpuLower::current()->propagateExprInfo(fop, back());
+}
+
+void IndexLowering::handle(const ARangeOp* aop) {
+  // Write linear tensor indices into the consumer
+  //  tensor index if the output is a tensor.
+  auto out_tv = dynamic_cast<TensorView*>(aop->output(0));
+  TORCH_INTERNAL_ASSERT(out_tv != nullptr);
+
+  // linear index for computing arange output
+  auto linear_index = SimplifyingIrBuilder::create<kir::TensorIndex>(
+      out_tv, Index::getLinearLogicalIndex(out_tv, for_loops_));
+
+  // TensorIndex for writing arange output.
+  const auto out = lowerDstIndex(out_tv);
+  auto lowered = IrBuilder::create<ARangeOp>(
+      out, aop->start(), aop->end(), aop->step(), aop->dtype(), linear_index);
+
+  pushBack(lowered);
+  GpuLower::current()->propagateExprInfo(aop, back());
+}
+
+void IndexLowering::handle(const EyeOp* eop) {
+  auto out_tv = dynamic_cast<TensorView*>(eop->output(0));
+  TORCH_INTERNAL_ASSERT(out_tv != nullptr);
+
+  // linear index for computing eye output
+  auto indices = Index::getPerDimLogicalIndex(out_tv, for_loops_);
+  TORCH_INTERNAL_ASSERT(indices.size() == 2);
+  auto index1 = indices[0];
+  auto index2 = indices[1];
+
+  // TensorIndex for writing eye output.
+  const auto out = lowerDstIndex(out_tv);
+  auto lowered = IrBuilder::create<EyeOp>(out, eop->dtype(), index1, index2);
+
+  pushBack(lowered);
+  GpuLower::current()->propagateExprInfo(eop, back());
+}
+
 void IndexLowering::handle(const UnaryOp* uop) {
   const auto in = lowerSrcIndex(uop->in(), uop->out());
   const auto out = lowerDstIndex(uop->out());
@@ -336,10 +411,12 @@ void IndexLowering::handleBlockReduction(
   ReductionOp* indexed_rop = IrBuilder::create<ReductionOp>(
       rop->getReductionOpType(), rop->init(), out, in, rop->isAllreduce());
   if (rop->predicate()) {
-    indexed_rop->setPredicate(rop->predicate());
+    indexed_rop =
+        indexed_rop->withPredicate(rop->predicate())->as<ReductionOp>();
   }
   if (rop->writePredicate()) {
-    indexed_rop->setWritePredicate(rop->writePredicate());
+    indexed_rop = indexed_rop->withWritePredicate(rop->writePredicate())
+                      ->as<ReductionOp>();
   }
 
   pushBack(indexed_rop);
@@ -418,13 +495,15 @@ void IndexLowering::handleGridReduction(
       n_entrances,
       rop->isAllreduce());
 
-  grid_reduction->setThreadPredicate(thread_pred);
+  grid_reduction = grid_reduction->withThreadPredicate(thread_pred);
 
   if (rop->predicate()) {
-    grid_reduction->setPredicate(rop->predicate());
+    grid_reduction = grid_reduction->withPredicate(rop->predicate())
+                         ->as<kir::GridReduction>();
   }
   if (rop->writePredicate()) {
-    grid_reduction->setWritePredicate(rop->writePredicate());
+    grid_reduction = grid_reduction->withWritePredicate(rop->writePredicate())
+                         ->as<kir::GridReduction>();
   }
 
   pushBack(grid_reduction);
@@ -481,10 +560,12 @@ void IndexLowering::handleBlockReduction(
       inputs,
       grouped_rop->isAllreduce());
   if (grouped_rop->predicate()) {
-    indexed_rop->setPredicate(grouped_rop->predicate());
+    indexed_rop = indexed_rop->withPredicate(grouped_rop->predicate())
+                      ->as<GroupedReductionOp>();
   }
   if (grouped_rop->writePredicate()) {
-    indexed_rop->setWritePredicate(grouped_rop->writePredicate());
+    indexed_rop = indexed_rop->withWritePredicate(grouped_rop->writePredicate())
+                      ->as<GroupedReductionOp>();
   }
 
   pushBack(indexed_rop);
@@ -563,13 +644,16 @@ void IndexLowering::handleGridReduction(
       work_buf_size_info.buffer_stride,
       grouped_rop->isAllreduce());
 
-  grid_reduction->setThreadPredicate(thread_pred);
+  grid_reduction = grid_reduction->withThreadPredicate(thread_pred);
 
   if (grouped_rop->predicate()) {
-    grid_reduction->setPredicate(grouped_rop->predicate());
+    grid_reduction = grid_reduction->withPredicate(grouped_rop->predicate())
+                         ->as<kir::GroupedGridReduction>();
   }
   if (grouped_rop->writePredicate()) {
-    grid_reduction->setWritePredicate(grouped_rop->writePredicate());
+    grid_reduction =
+        grid_reduction->withWritePredicate(grouped_rop->writePredicate())
+            ->as<kir::GroupedGridReduction>();
   }
 
   pushBack(grid_reduction);
@@ -589,8 +673,6 @@ void IndexLowering::handle(const WelfordOp* wop) {
   const bool has_block_reduce = out_domain->hasBlockReduction();
   const bool has_grid_reduce = out_domain->hasGridReduction();
 
-  // If we do a grid reduction we can't have a reduction axis that is not bound
-  // to a grid or block dim ()
   if (has_grid_reduce) {
     TORCH_INTERNAL_ASSERT(
         std::none_of(
@@ -624,19 +706,20 @@ void IndexLowering::handle(const WelfordOp* wop) {
       out_avg,
       out_var,
       out_N,
-      wop->initAvg(),
-      wop->initVar(),
-      wop->initN(),
       in_avg,
       in_var,
       in_N,
+      wop->initAvg(),
+      wop->initVar(),
+      wop->initN(),
       wop->isAllreduce());
 
   if (wop->predicate()) {
-    indexed_wop->setPredicate(wop->predicate());
+    indexed_wop = indexed_wop->withPredicate(wop->predicate())->as<WelfordOp>();
   }
   if (wop->writePredicate()) {
-    indexed_wop->setWritePredicate(wop->writePredicate());
+    indexed_wop =
+        indexed_wop->withWritePredicate(wop->writePredicate())->as<WelfordOp>();
   }
 
   // Serial welford
@@ -666,18 +749,18 @@ void IndexLowering::handleGridWelford(WelfordOp* indexed_wop) {
       getGridCommWorkBufferSize(out_domain, for_loops_, is_persistent);
 
   const auto work_buffer_size = buffer_size_info.size_of_privatized_buffer;
-  auto out_var_buffer = allocateUniqueBuffer(
-      work_buffer_size,
-      indexed_wop->outVar()->dtype(),
-      false,
-      indexed_wop->outVar()->as<kir::TensorIndex>()->view(),
-      work_buffer_map_);
   auto out_avg_buffer = allocateUniqueBuffer(
       work_buffer_size,
       indexed_wop->outAvg()->dtype(),
       false,
       indexed_wop->outAvg()->as<kir::TensorIndex>()->view(),
       work_buffer_map_);
+  auto out_var_buffer = allocateUniqueBuffer(
+      work_buffer_size,
+      indexed_wop->outVar()->dtype(),
+      false,
+      indexed_wop->outVar()->as<kir::TensorIndex>()->view(),
+      work_buffer_map_);
   auto out_N_buffer = allocateUniqueBuffer(
       work_buffer_size,
       indexed_wop->outN()->dtype(),
@@ -712,22 +795,27 @@ void IndexLowering::handleGridWelford(WelfordOp* indexed_wop) {
       entrance_ind,
       n_entrances);
 
-  grid_welford->setThreadPredicate(thread_pred);
+  grid_welford = grid_welford->withThreadPredicate(thread_pred);
 
   const bool block_reduce_separated =
       out_domain->hasBlockReduction() && !indexed_wop->isAllreduce();
 
   if (indexed_wop->predicate()) {
     if (block_reduce_separated) {
-      grid_welford->setPredicate(IrBuilder::create<kir::Predicate>(
-          GpuLower::current()->kernel()->trueVal()));
+      grid_welford = grid_welford
+                         ->withPredicate(IrBuilder::create<kir::Predicate>(
+                             GpuLower::current()->kernel()->trueVal()))
+                         ->as<kir::GridWelford>();
     } else {
-      grid_welford->setPredicate(indexed_wop->predicate());
+      grid_welford = grid_welford->withPredicate(indexed_wop->predicate())
+                         ->as<kir::GridWelford>();
     }
   }
 
   if (indexed_wop->writePredicate()) {
-    grid_welford->setWritePredicate(indexed_wop->writePredicate());
+    grid_welford =
+        grid_welford->withWritePredicate(indexed_wop->writePredicate())
+            ->as<kir::GridWelford>();
   }
 
   if (block_reduce_separated) {
@@ -745,10 +833,158 @@ void IndexLowering::handleGridWelford(WelfordOp* indexed_wop) {
   }
 }
 
+void IndexLowering::handle(const GroupedWelfordOp* grouped_wop) {
+  TORCH_INTERNAL_ASSERT(ir_utils::isTvOp(grouped_wop));
+
+  const auto out_tv = ir_utils::getTvOutput(grouped_wop);
+  const auto out_domain = out_tv->domain();
+
+  const bool has_grid_reduce = out_domain->hasGridReduction();
+
+  std::vector<WelfordTriplet> indexed_outputs(grouped_wop->numExprs());
+  std::vector<WelfordTriplet> indexed_inputs(grouped_wop->numExprs());
+
+  for (const auto i : c10::irange(grouped_wop->numExprs())) {
+    const auto& output = grouped_wop->outputVals().at(i);
+    const auto& input = grouped_wop->inputVals().at(i);
+    WelfordTriplet indexed_output;
+    WelfordTriplet indexed_input;
+    for (const auto j : c10::irange(3)) {
+      indexed_output.get(j) = lowerDstIndex(output.get(j));
+      indexed_input.get(j) = lowerSrcIndex(input.get(j), output.get(j));
+    }
+    indexed_outputs[i] = indexed_output;
+    indexed_inputs[i] = indexed_input;
+  }
+
+  if (has_grid_reduce) {
+    handleGroupedGridWelford(
+        grouped_wop, indexed_outputs, indexed_inputs, grouped_wop->initVals());
+  } else {
+    TORCH_INTERNAL_ASSERT(
+        false,
+        "Only grid welford is supported. Validation should have caught non-grid welford grouping.");
+  }
+}
+
+std::vector<kir::Allocate*> IndexLowering::allocateWelfordWorkBuffer(
+    const std::vector<WelfordTriplet>& triplets,
+    WelfordTriplet::ValName name,
+    Val* buffer_size) {
+  std::vector<kir::Allocate*> work_buffers;
+
+  std::transform(
+      triplets.begin(),
+      triplets.end(),
+      std::back_inserter(work_buffers),
+      [&](const WelfordTriplet& output) {
+        return allocateUniqueBuffer(
+            buffer_size,
+            output.get(name)->dtype(),
+            false,
+            output.get(name)->as<TensorView>(),
+            work_buffer_map_);
+      });
+
+  return work_buffers;
+}
+
+void IndexLowering::handleGroupedGridWelford(
+    const GroupedWelfordOp* op,
+    const std::vector<WelfordTriplet>& output_vals,
+    const std::vector<WelfordTriplet>& input_vals,
+    const std::vector<WelfordTriplet>& init_vals) {
+  const auto out_tv = ir_utils::getTvOutput(op);
+  const auto out_domain = out_tv->domain();
+
+  TORCH_INTERNAL_ASSERT(out_domain->hasGridReduction());
+
+  // If we do a grid reduction we can't have a reduction axis that is not bound
+  // to a grid or block dim.
+  TORCH_INTERNAL_ASSERT(
+      std::none_of(
+          out_domain->domain().begin(),
+          out_domain->domain().end(),
+          [](IterDomain* id) {
+            return !id->isThread() && id->isReduction() &&
+                !id->extent()->isOneInt();
+          }),
+      "Found a reduction stage that has both a non-parallelized ",
+      "reduction and a grid reduction. This is not supported, ",
+      "please use rfactor to do the serialized reduction first, ",
+      "then the grid reduction.");
+
+  const bool is_persistent = op->isAllreduce();
+  auto work_buf_size_info =
+      getGridCommWorkBufferSize(out_domain, for_loops_, is_persistent);
+
+  const auto work_buffers_avg = allocateWelfordWorkBuffer(
+      op->outputVals(),
+      WelfordTriplet::ValName::Avg,
+      work_buf_size_info.size_of_privatized_buffer);
+  const auto work_buffers_var = allocateWelfordWorkBuffer(
+      op->outputVals(),
+      WelfordTriplet::ValName::Var,
+      work_buf_size_info.size_of_privatized_buffer);
+  const auto work_buffers_N = allocateWelfordWorkBuffer(
+      op->outputVals(),
+      WelfordTriplet::ValName::N,
+      work_buf_size_info.size_of_privatized_buffer);
+
+  auto sync_buffer_size =
+      getGridSyncBufferSize(out_domain, for_loops_, is_persistent);
+  auto sync_buffer = allocateUniqueBuffer(
+      sync_buffer_size, DataType::Int, true, out_tv, sync_buffer_map_);
+
+  const auto entrance_ind = !is_persistent
+      ? getEntranceLinIndGridReduce(for_loops_)
+      : GpuLower::current()->kernel()->zeroVal();
+  const auto n_entrances = !is_persistent
+      ? getEntranceCountGridReduce(for_loops_)
+      : GpuLower::current()->kernel()->oneVal();
+
+  // The thread predicate needs to be set separately from the main
+  // predicate. Do not combine them like other expressions.
+  const auto& thread_pred =
+      GpuLower::current()->threadPredMap().getPredicatedParallelTypes(out_tv);
+
+  auto indexed_op = IrBuilder::create<kir::GroupedGridWelford>(
+      output_vals,
+      input_vals,
+      init_vals,
+      std::array<std::vector<kir::Allocate*>, 3>{
+          work_buffers_avg, work_buffers_var, work_buffers_N},
+      sync_buffer,
+      entrance_ind,
+      n_entrances,
+      work_buf_size_info.buffer_stride,
+      op->isAllreduce());
+
+  indexed_op = indexed_op->withThreadPredicate(thread_pred);
+
+  if (op->predicate()) {
+    indexed_op = indexed_op->withPredicate(op->predicate())
+                     ->as<kir::GroupedGridWelford>();
+  }
+  if (op->writePredicate()) {
+    indexed_op = indexed_op->withWritePredicate(op->writePredicate())
+                     ->as<kir::GroupedGridWelford>();
+  }
+
+  pushBack(indexed_op);
+  GpuLower::current()->propagateExprInfo(op, back());
+
+  if (op->isAllreduce()) {
+    allocateUniqueFusedReduction(indexed_op, out_tv);
+  }
+}
+
 void IndexLowering::handle(const LoadStoreOp* ldst) {
   const auto in = lowerSrcIndex(ldst->in(), ldst->out());
   const auto out = lowerDstIndex(ldst->out());
-  pushBack(IrBuilder::create<LoadStoreOp>(ldst->opType(), out, in));
+  auto new_ldst = IrBuilder::create<LoadStoreOp>(ldst->opType(), out, in)
+                      ->withPredicate(ldst->predicate());
+  pushBack(new_ldst);
   GpuLower::current()->propagateExprInfo(ldst, back());
 }
 
@@ -780,7 +1016,8 @@ void IndexLowering::handle(const BroadcastOp* bop) {
   const bool block_z = parallel_bitmap.get(ParallelType::BIDz);
 
   if (bop->predicate()) {
-    indexed_expr->setPredicate(bop->predicate());
+    indexed_expr =
+        indexed_expr->withPredicate(bop->predicate())->as<BroadcastOp>();
   }
 
   const bool grid_broadcast_needed = block_x || block_y || block_z;
@@ -807,7 +1044,8 @@ void IndexLowering::handle(const BroadcastOp* bop) {
       indexed_expr, work_buffer, sync_buffer);
 
   if (bop->predicate()) {
-    grid_broadcast->setPredicate(bop->predicate());
+    grid_broadcast = grid_broadcast->withPredicate(bop->predicate())
+                         ->as<kir::GridBroadcast>();
   }
 
   pushBack(grid_broadcast);
@@ -859,7 +1097,7 @@ kir::Allocate* IndexLowering::allocateUniqueBuffer(
 
   // No existing allocation found. Create a new one
   auto new_buffer =
-      ir_utils::allocGlobalBufferForGridComm(buffer_size, dtype, zero_init);
+      lower_utils::allocGlobalBufferForGridComm(buffer_size, dtype, zero_init);
 
   // Keep track of the allocation
   alloc_map.emplace(out_tv, new_buffer);
@@ -897,6 +1135,11 @@ void IndexLowering::allocateUniqueFusedReduction(
           IrBuilder::create<kir::AllocateFusedReduction>(
               expr->as<kir::GroupedGridReduction>());
       break;
+    case ExprType::GroupedGridWelford:
+      fused_reduction_alloc_reduction =
+          IrBuilder::create<kir::AllocateFusedReduction>(
+              expr->as<kir::GroupedGridWelford>());
+      break;
     default:
       TORCH_INTERNAL_ASSERT(false, "Invalid expr: ", expr->toString());
   }
diff --git a/torch/csrc/jit/codegen/cuda/lower_index.h b/torch/csrc/jit/codegen/cuda/lower_index.h
index fad30b511915..6c08eeb195ea 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index.h
+++ b/torch/csrc/jit/codegen/cuda/lower_index.h
@@ -38,13 +38,19 @@ class TORCH_CUDA_CU_API IndexLowering : private OptOutConstDispatch {
   // Insert an expression before the current top-level expression.
   void insertAtTopLevel(Expr* expr);
 
+  void handle(const FullOp*) final;
+  void handle(const ARangeOp*) final;
+  void handle(const EyeOp*) final;
   void handle(const ViewAsScalar*) final;
   void handle(const UnaryOp*) final;
+
   void handle(const BinaryOp*) final;
   void handle(const TernaryOp*) final;
+  void handle(const RNGOp*) final;
   void handle(const ReductionOp*) final;
   void handle(const GroupedReductionOp*) final;
   void handle(const WelfordOp*) final;
+  void handle(const GroupedWelfordOp*) final;
   void handle(const LoadStoreOp*) final;
   void handle(const MmaOp*) final;
   void handle(const BroadcastOp*) final;
@@ -77,6 +83,17 @@ class TORCH_CUDA_CU_API IndexLowering : private OptOutConstDispatch {
 
   void handleGridWelford(WelfordOp* new_wop);
 
+  void handleGroupedBlockWelford(
+      const GroupedWelfordOp* wop,
+      const std::vector<WelfordTriplet>& output_vals,
+      const std::vector<WelfordTriplet>& input_vals,
+      const std::vector<WelfordTriplet>& init_vals);
+  void handleGroupedGridWelford(
+      const GroupedWelfordOp* wop,
+      const std::vector<WelfordTriplet>& output_vals,
+      const std::vector<WelfordTriplet>& input_vals,
+      const std::vector<WelfordTriplet>& init_vals);
+
   // Allocate a unique buffer for grid reductions and broadcast. A
   // buffer is uniquely allocated for each output tensor of an
   // expression.
@@ -87,6 +104,11 @@ class TORCH_CUDA_CU_API IndexLowering : private OptOutConstDispatch {
       TensorView* out_tv,
       std::unordered_map<TensorView*, kir::Allocate*>& alloc_map);
 
+  std::vector<kir::Allocate*> allocateWelfordWorkBuffer(
+      const std::vector<WelfordTriplet>& triplets,
+      WelfordTriplet::ValName name,
+      Val* buffer_size);
+
   // Allocate a fused reduction object uniquely for a given
   // TensorView. Parameter expr is the expression corresponding to the
   // fused reduction.
diff --git a/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp b/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp
index b60decf95bd3..140fecc0f8af 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp
@@ -101,7 +101,7 @@ struct IndexingParameters {
 };
 
 // Initial loop index map for global producer or consumer case.
-IndexingParameters getGlobalIndexParameters(
+IndexingParameters getLinearIndexParameters(
     const LoopIndexing& loop_indexing,
     bool index_producer = false) {
   IndexingParameters index_parameters;
@@ -112,7 +112,8 @@ IndexingParameters getGlobalIndexParameters(
 
   for (auto loop_idx : c10::irange(loops.size())) {
     auto loop = loops[loop_idx];
-    auto index_domain = ir_utils::caMapExactConcreteId(loop_domain[loop_idx]);
+    auto index_domain = GpuLower::current()->caMap()->getConcreteMappedID(
+        loop_domain[loop_idx], IdMappingMode::EXACT);
     if (loop->isTrivial()) {
       // This is useful information in the case of
       //  MisalignedVectorize and double buffer epilog, etc.
@@ -125,7 +126,8 @@ IndexingParameters getGlobalIndexParameters(
 
   // Derive the halo extents from the loop indexing result.
   index_parameters.concrete_id_to_halo_extent =
-      GpuLower::current()->haloInfo().buildConcreteHaloExtentMap(loop_indexing);
+      GpuLower::current()->haloInfo()->buildConcreteHaloExtentMap(
+          loop_indexing);
 
   protectNonPredicateIndexWithMagicZero(
       loops,
@@ -148,7 +150,9 @@ IndexingParameters getGlobalIndexParameters(
 
         auto loop_id = loop_indexing.loopDomains()[loop_idx];
 
-        auto concrete_loop_id = ir_utils::caMapExactConcreteId(loop_id);
+        auto concrete_loop_id =
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                loop_id, IdMappingMode::EXACT);
 
         auto stage_depth =
             GpuLower::current()->doubleBufferInfo().getStageDepthFor(
@@ -185,7 +189,7 @@ IndexingParameters getNonGlobalInitialIndexParameters(
   }
 
   auto alloc_tv = index_producer ? producer_tv : consumer_tv;
-  auto alloc_info = loop_utils::getAllocInformation(
+  auto alloc_info = lower_utils::getAllocInformation(
       alloc_tv, loops, alloc_id_map, index_producer);
 
   std::unordered_map<kir::ForLoop*, Val*> loop_to_ind_map;
@@ -216,7 +220,9 @@ IndexingParameters getNonGlobalInitialIndexParameters(
     auto loop = loops[loop_idx];
     auto loop_domain = loop_domains[loop_idx];
 
-    auto concrete_loop_domain = ir_utils::caMapExactConcreteId(loop_domain);
+    auto concrete_loop_domain =
+        GpuLower::current()->caMap()->getConcreteMappedID(
+            loop_domain, IdMappingMode::EXACT);
 
     index_parameters.initial_concrete_id_index[concrete_loop_domain] =
         loop_to_ind_map.at(loop);
@@ -233,7 +239,8 @@ IndexingParameters getNonGlobalInitialIndexParameters(
 
   // Derive the halo extents from the loop indexing result.
   index_parameters.concrete_id_to_halo_extent =
-      GpuLower::current()->haloInfo().buildConcreteHaloExtentMap(loop_indexing);
+      GpuLower::current()->haloInfo()->buildConcreteHaloExtentMap(
+          loop_indexing);
 
   return index_parameters;
 }
@@ -397,7 +404,8 @@ IndexingParameters getPredicateInitialIndexParameters(
   for (int loop_idx : c10::irange(loops.size())) {
     auto loop = loops.at(loop_idx);
     auto concrete_loop_domain =
-        ir_utils::caMapExactConcreteId(loop_domains.at(loop_idx));
+        GpuLower::current()->caMap()->getConcreteMappedID(
+            loop_domains.at(loop_idx), IdMappingMode::EXACT);
     index_parameters.initial_concrete_id_index[concrete_loop_domain] =
         loop_to_ind_map.at(loop);
   }
@@ -408,7 +416,8 @@ IndexingParameters getPredicateInitialIndexParameters(
 
   // Derive the halo extents from the loop indexing result.
   index_parameters.concrete_id_to_halo_extent =
-      GpuLower::current()->haloInfo().buildConcreteHaloExtentMap(loop_indexing);
+      GpuLower::current()->haloInfo()->buildConcreteHaloExtentMap(
+          loop_indexing);
 
   return index_parameters;
 }
@@ -438,6 +447,7 @@ class LoopIndexingAnalysis {
     indexing.loop_root_ = loop_root_domains_;
     indexing.loop_domains_ = loop_domains_.vector();
     indexing.index_exprs_ = replayed_exprs_;
+    indexing.out_of_line_exprs_ = out_of_line_exprs_;
     return indexing;
   }
 
@@ -481,6 +491,12 @@ class LoopIndexingAnalysis {
   //! loop_domains_ with all of these iter domains.
   void constructLoopDomains();
 
+  //! Fills out_of_line_exprs_ by traversing the selected list of
+  //!  expressions in reverse topological order and collect iterdomains
+  //!  on the indexing paths that only involves leaf id's on the right
+  //!  of consumer's ca axis.
+  void collectOutOfLineExprs();
+
  private:
   //! Original loop nest input to derive info from.
   const std::vector<kir::ForLoop*>& loops_;
@@ -521,6 +537,10 @@ class LoopIndexingAnalysis {
   //! Selected list of exprs that will produce and consume each
   //!  of the exact concrete ids from the loop nest exactly once.
   std::vector<Expr*> replayed_exprs_;
+
+  //! Set of expressions from the selected list that can be
+  //!  resolved from axes on the right of ca axes.
+  std::vector<Expr*> out_of_line_exprs_;
 };
 
 LoopIndexingAnalysis::LoopIndexingAnalysis(
@@ -552,13 +572,20 @@ LoopIndexingAnalysis::LoopIndexingAnalysis(
   // consume each concrete id once so this map is well defined.
   for (auto expr : replayed_exprs_) {
     for (auto input_id : ir_utils::filterByType<IterDomain>(expr->inputs())) {
-      concrete_id_to_consumer_[ir_utils::caMapExactConcreteId(input_id)] = expr;
+      auto concrete_input_id =
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              input_id, IdMappingMode::EXACT);
+      concrete_id_to_consumer_[concrete_input_id] = expr;
     }
   }
 
   // Reconstruct the iterdomain view of the original loopnest after resolving
   // the exact definition of each index.
   constructLoopDomains();
+
+  //! Collect the set of indexing expressions that can be
+  //!  resolved out of line.
+  collectOutOfLineExprs();
 }
 
 void LoopIndexingAnalysis::validateLoopStructure(
@@ -580,7 +607,8 @@ void LoopIndexingAnalysis::validateLoopStructure(
   for (auto it_i = loops.begin(); it_i != loops.end(); ++it_i) {
     // Largely duplicating original logic
     auto loop_id = (*it_i)->iter_domain();
-    auto concrete_loop_id = ir_utils::caMapExactConcreteId(loop_id);
+    auto concrete_loop_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        loop_id, IdMappingMode::EXACT);
 
     TORCH_INTERNAL_ASSERT(
         !concrete_to_loop.count(concrete_loop_id),
@@ -644,13 +672,22 @@ void LoopIndexingAnalysis::traverseFromDomainVals() {
 }
 
 IterDomain* LoopIndexingAnalysis::concretizeAndVisitId(IterDomain* id) {
-  auto concrete_id = ir_utils::caMapExactConcreteId(id);
+  auto concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+      id, IdMappingMode::EXACT);
   if (replayed_concrete_ids_.pushBack(concrete_id)) {
     concrete_to_original_id_[concrete_id] = id;
   }
   return concrete_id;
 }
 
+namespace {
+// Alias used for std::transform
+IterDomain* exactConcreteId(IterDomain* id) {
+  return GpuLower::current()->caMap()->getConcreteMappedID(
+      id, IdMappingMode::EXACT);
+}
+} // namespace
+
 void LoopIndexingAnalysis::visitExpr(Expr* expr) {
   if (auto swizzle2d = dynamic_cast<Swizzle2D*>(expr)) {
     // Swizzle outputs are already forwarded through
@@ -685,14 +722,14 @@ void LoopIndexingAnalysis::visitExpr(Expr* expr) {
       consumed_ids.begin(),
       consumed_ids.end(),
       std::inserter(consumed_concrete_, consumed_concrete_.end()),
-      ir_utils::caMapExactConcreteId);
+      exactConcreteId);
 
   auto produced_ids = ir_utils::filterByType<IterDomain>(expr->outputs());
   std::transform(
       produced_ids.begin(),
       produced_ids.end(),
       std::inserter(produced_concrete_, produced_concrete_.end()),
-      ir_utils::caMapExactConcreteId);
+      exactConcreteId);
 }
 
 bool LoopIndexingAnalysis::visitIdsAndCheckDuplication(
@@ -717,8 +754,36 @@ void LoopIndexingAnalysis::constructLoopDomains() {
               !concrete_id_to_consumer_.count(concrete_id) &&
               // Use permissive map so the selected ID indeed represents the
               // loop.
-              GpuLower::current()->caMap()->areMapped(
-                  concrete_id, loop_id, IdMappingMode::PERMISSIVE);
+              // Note: see PR https://github.com/csarofeen/pytorch/pull/1960
+              //  and issue https://github.com/csarofeen/pytorch/issues/1873
+              // This mapping look up is part of a staged indexing scheme.
+              //  When we find a replayed exact id that exactly map to the loop
+              //  id, this means that we can resolve indexing involved in this
+              //  loop "locally", i.e. only with and with only the iterdomains
+              //  on the
+              //
+              //  given consumer tv.
+              //  When we cannot find an exact mapping, the permissive mapping
+              //  would
+              //   help defering the indexing resolution for this loop nest
+              //   level to other iterdomain expressions from tv's that are
+              //   further concretized and usually they are further down the
+              //   consumer chain of the given consumer tv.
+              //
+              //  Intuitively exact mapping of two iterdomains should imply
+              //  permissive mapping
+              //   of them as well and if that was the case, only looking up
+              //   permissive mapping would be enough to address both of the
+              //   cases above.
+              //  FIXME: But currently exact mapping does not imply permissive
+              //  mapping (See issue:
+              //       https://github.com/csarofeen/pytorch/issues/1963)
+              // Which means we should check both exact and permissive mapping
+              // here.
+              (GpuLower::current()->caMap()->areMapped(
+                   concrete_id, loop_id, IdMappingMode::EXACT) ||
+               GpuLower::current()->caMap()->areMapped(
+                   concrete_id, loop_id, IdMappingMode::PERMISSIVE));
         });
 
     TORCH_INTERNAL_ASSERT(
@@ -754,7 +819,8 @@ void LoopIndexingAnalysis::constructLoopDomains() {
   // will complain for not having all outputs of the traversal.
   for (auto id : ir_utils::filterByType<IterDomain>(all_ids_from_root)) {
     if (id->uses().empty()) {
-      loop_domains_.pushBack(ir_utils::caMapExactConcreteId(id));
+      loop_domains_.pushBack(GpuLower::current()->caMap()->getConcreteMappedID(
+          id, IdMappingMode::EXACT));
     }
   }
 }
@@ -782,7 +848,7 @@ IndexFromIdGraph getTensorIndexFromIdGraph(
   }
 
   if (is_global) {
-    index_parameters = getGlobalIndexParameters(loop_indexing, index_producer);
+    index_parameters = getLinearIndexParameters(loop_indexing, index_producer);
   } else {
     index_parameters = getNonGlobalInitialIndexParameters(
         loop_indexing, consumer_tv, index_producer, producer_tv, p2c_map);
@@ -799,11 +865,12 @@ IndexFromIdGraph getTensorIndexFromIdGraph(
   indexing.run(loop_indexing);
 
   // Populate indexing through exact map from initial indexing
+  auto consumer_root = index_producer ? consumer_tv->getRootDomain()
+                                      : consumer_tv->getMaybeRFactorDomain();
 
   // First collect all iterdomains in consumer transform history.
   auto all_consumer_vals = DependencyCheck::getAllValsBetween(
-      {consumer_tv->getMaybeRFactorDomain().begin(),
-       consumer_tv->getMaybeRFactorDomain().end()},
+      {consumer_root.begin(), consumer_root.end()},
       {consumer_tv->domain()->domain().begin(),
        consumer_tv->domain()->domain().end()});
 
@@ -833,7 +900,8 @@ IndexFromIdGraph getTensorIndexFromIdGraph(
 
     // Exact id will have to be pulled from consumer side as the
     //  producer side are replayed ids.
-    auto exact_concrete_id = ir_utils::caMapExactConcreteId(consumer_id);
+    auto exact_concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        consumer_id, IdMappingMode::EXACT);
 
     index_update_map[exact_concrete_id] = target_id;
 
@@ -848,7 +916,12 @@ IndexFromIdGraph getTensorIndexFromIdGraph(
       target_tv->domain()->domain(),
       target_tv->getMaybeRFactorDomain(),
       target_tv->domain()->contiguity(),
-      initial_indexable_map,
+      {},
+      indexing.indexMap(),
+      GpuLower::current()->divisbleSplitSet(),
+      GpuLower::current()->caMap(),
+      GpuLower::current()->haloInfo(),
+      GpuLower::current()->concretizedBroadcastDomains(),
       p2c_map);
 
   auto target_indexing = indexing.updateIndexCompute(
@@ -914,18 +987,16 @@ IndexFromIdGraph getPredicateIndexingFromIdGraph(
        ir_utils::filterByType<IterDomain>(all_consumer_vals)) {
     // Track the non-concrete id we were trying to bind index
     //  to, whether from producer or consumer.
-    auto exact_concrete_id = ir_utils::caMapExactConcreteId(consumer_id);
+    auto exact_concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+        consumer_id, IdMappingMode::EXACT);
     index_update_map[exact_concrete_id] = consumer_id;
   }
 
-  // No contiguity info is used in the predicate indexing pass,
-  //  the predicate generation logic that uses the index math
-  //  generated here will take contiguity into account.
-  ContigIDs contig_finder(
-      consumer_tv->domain()->domain(),
-      consumer_tv->getMaybeRFactorDomain(),
-      std::vector<bool>(consumer_tv->getMaybeRFactorDomain().size(), false),
-      {});
+  // No contiguity info is used in the predicate indexing pass, the predicate
+  // generation logic that uses the index math generated here will take
+  // contiguity into account. Send an empty ContigID class so nothing is marked
+  // as contiguous.
+  auto contig_finder = ContigIDs::getNonContigIDs();
 
   // Run second backward traversal to map back to the consumer_tv
   auto target_indexing = indexing.updateIndexCompute(
@@ -993,7 +1064,8 @@ LoopIndexingTraversal::LoopIndexingTraversal(
     auto next_ids =
         ir_utils::filterByType<IterDomain>(nextValsInTraversalOrder(expr));
     for (auto id : next_ids) {
-      auto concrete_id = ir_utils::caMapExactConcreteId(id);
+      auto concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
+          id, IdMappingMode::EXACT);
       TORCH_INTERNAL_ASSERT(
           concrete_id_to_dependency_.insert(std::make_pair(concrete_id, expr))
               .second,
@@ -1061,7 +1133,8 @@ std::vector<Expr*> LoopIndexingTraversal::getExprList() {
     for (auto prev_id :
          ir_utils::filterByType<IterDomain>(prevValsInTraversalOrder(top))) {
       auto prev_expr_it = concrete_id_to_dependency_.find(
-          ir_utils::caMapExactConcreteId(prev_id));
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              prev_id, IdMappingMode::EXACT));
       if (prev_expr_it != concrete_id_to_dependency_.end()) {
         auto prev_expr = prev_expr_it->second;
         if (!visited.count(prev_expr)) {
@@ -1087,6 +1160,50 @@ std::vector<Expr*> LoopIndexingTraversal::getExprList() {
 
 } // namespace
 
+void LoopIndexingAnalysis::collectOutOfLineExprs() {
+  // Keep track of all the id's that can be resolved without
+  //  iterdomains on the left of ca axes.
+  std::unordered_set<IterDomain*> out_of_line_ids;
+
+  // Start the set with all the leaf ids.
+  std::transform(
+      consumer_tv_->domain()->domain().begin() +
+          consumer_tv_->getComputeAtPosition(),
+      consumer_tv_->domain()->domain().end(),
+      std::inserter(out_of_line_ids, out_of_line_ids.end()),
+      exactConcreteId);
+
+  // Get the original selected list of index expressions
+  //  in reverse topological order.
+  auto backward_expr_list =
+      LoopIndexingTraversal::backwardTopologicalOrder(replayed_exprs_);
+
+  for (auto expr : backward_expr_list) {
+    auto id_outputs = ir_utils::filterByType<IterDomain>(expr->outputs());
+    if (
+        // Check that all of the outputs are out of line
+        std::all_of(
+            id_outputs.begin(),
+            id_outputs.end(),
+            [&out_of_line_ids](IterDomain* id) {
+              return out_of_line_ids.count(
+                  GpuLower::current()->caMap()->getConcreteMappedID(
+                      id, IdMappingMode::EXACT));
+            })) {
+      // Record out of line expression
+      out_of_line_exprs_.push_back(expr);
+
+      // Add all of the expression inputs as out of line id's.
+      auto id_inputs = ir_utils::filterByType<IterDomain>(expr->inputs());
+      std::transform(
+          id_inputs.begin(),
+          id_inputs.end(),
+          std::inserter(out_of_line_ids, out_of_line_ids.end()),
+          exactConcreteId);
+    }
+  }
+}
+
 std::vector<Expr*> LoopIndexing::getForwardExprList() const {
   return LoopIndexingTraversal::forwardTopologicalOrder(index_exprs_);
 }
@@ -1103,14 +1220,14 @@ std::unordered_set<IterDomain*> LoopIndexing::getAllExactConcreteIdSet() const {
         out_ids.begin(),
         out_ids.end(),
         std::inserter(all_id_set, all_id_set.end()),
-        ir_utils::caMapExactConcreteId);
+        exactConcreteId);
 
     auto in_ids = ir_utils::filterByType<IterDomain>(expr->inputs());
     std::transform(
         in_ids.begin(),
         in_ids.end(),
         std::inserter(all_id_set, all_id_set.end()),
-        ir_utils::caMapExactConcreteId);
+        exactConcreteId);
   }
   return all_id_set;
 }
@@ -1155,7 +1272,9 @@ class LoopIndexingPreferredPathCompute : public IterVisitor {
         }
         mapped_id = c_id_it->second;
       }
-      auto concrete_original_id = ir_utils::caMapExactConcreteId(mapped_id);
+      auto concrete_original_id =
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              mapped_id, IdMappingMode::EXACT);
       if (all_concrete_ids.count(concrete_original_id)) {
         if (original_id->isBroadcast() || original_id->isReduction() ||
             original_id->isStride()) {
@@ -1181,8 +1300,10 @@ class LoopIndexingPreferredPathCompute : public IterVisitor {
             all_iter_inputs.begin(),
             all_iter_inputs.end(),
             [&](IterDomain* inp_id) {
-              return this->preferred_path_.find(ir_utils::caMapExactConcreteId(
-                         inp_id)) != this->preferred_path_.end();
+              return this->preferred_path_.find(
+                         GpuLower::current()->caMap()->getConcreteMappedID(
+                             inp_id, IdMappingMode::EXACT)) !=
+                  this->preferred_path_.end();
             })) {
       auto all_iter_outputs = ir_utils::filterByType<IterDomain>(e->outputs());
 
@@ -1190,7 +1311,7 @@ class LoopIndexingPreferredPathCompute : public IterVisitor {
           all_iter_outputs.begin(),
           all_iter_outputs.end(),
           std::inserter(preferred_path_, preferred_path_.end()),
-          ir_utils::caMapExactConcreteId);
+          exactConcreteId);
     }
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_index_compute.h b/torch/csrc/jit/codegen/cuda/lower_index_compute.h
index d8d4dd7103b3..4b81fd0dec0c 100644
--- a/torch/csrc/jit/codegen/cuda/lower_index_compute.h
+++ b/torch/csrc/jit/codegen/cuda/lower_index_compute.h
@@ -127,6 +127,12 @@ class LoopIndexing {
   //!  topological order.
   std::vector<Expr*> getBackwardExprList() const;
 
+  //! Returns the set of out of line expressions in
+  //!  reverse topological order.
+  const std::vector<Expr*>& getBackwardOutOfLineExprList() const {
+    return out_of_line_exprs_;
+  }
+
   //! Returns all exact concrete id's that were produced
   //!  or consumed in the selected indexing expressions
   std::unordered_set<IterDomain*> getAllExactConcreteIdSet() const;
@@ -152,6 +158,12 @@ class LoopIndexing {
   //! The selected sequence of expressions that should represent
   //!  the correct indexing math from the given loop nest.
   std::vector<Expr*> index_exprs_;
+
+  //! The subset of sequence of expressions that can be resolved
+  //!  with only the iterdomains on the right of consumer tv's ca
+  //!  axis.
+  //! Expressions are ordered in reverse topological order.
+  std::vector<Expr*> out_of_line_exprs_;
 };
 
 // When indexing there are sometimes an option to propagate an index down
diff --git a/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp b/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
index 12b02d0b51ce..86ca9d8427e7 100644
--- a/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
@@ -293,7 +293,7 @@ class WarSyncInserter : private kir::ExprMutator {
     auto maybe_aliased_tv = alloc_map_.getRealBuffer(tv);
     auto alloc_it = smem_allocations_.find(maybe_aliased_tv);
     auto ca_loop =
-        loop_utils::getAllocInformation(tv, for_loops_).init_for_loop;
+        lower_utils::getAllocInformation(tv, for_loops_).init_for_loop;
     if (alloc_it == smem_allocations_.end()) {
       WarMemoryInfo mem_info;
       mem_info.ca_loop = ca_loop;
@@ -486,7 +486,7 @@ class ReadAfterWriteSyncs : public kir::ExprMutator {
       Expr* sync_expr = nullptr;
       kir::Allocate* maybe_alloc = nullptr;
       if (sync_bitmap.hasBID()) {
-        maybe_alloc = ir_utils::allocGlobalBufferForGridComm(
+        maybe_alloc = lower_utils::allocGlobalBufferForGridComm(
             getGridSyncBufferSize(sync_bitmap), DataType::Int, true);
         sync_expr = IrBuilder::create<kir::GridSync>(
             sync_bitmap, maybe_alloc->buffer());
diff --git a/torch/csrc/jit/codegen/cuda/lower_instrument.cpp b/torch/csrc/jit/codegen/cuda/lower_instrument.cpp
index 56fb8cda783e..cb7402bb752a 100644
--- a/torch/csrc/jit/codegen/cuda/lower_instrument.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_instrument.cpp
@@ -92,7 +92,7 @@ class Instrumentor : private kir::IrVisitor {
 } // namespace
 
 std::vector<Expr*> instrumentKernel(const std::vector<Expr*>& exprs) {
-  if (!isEnabled(EnableOption::KernelProfile)) {
+  if (!isOptionEnabled(EnableOption::KernelProfile)) {
     return exprs;
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_loops.cpp b/torch/csrc/jit/codegen/cuda/lower_loops.cpp
index 7fdb149da935..0653296366cc 100644
--- a/torch/csrc/jit/codegen/cuda/lower_loops.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_loops.cpp
@@ -33,7 +33,7 @@ LoopNestGenerator::LoopNestGenerator(const std::vector<Expr*>& exprs) {
 namespace {
 
 kir::ForLoop* openForHelper(kir::ForLoop* scope, IterDomain* id) {
-  auto extent_with_halo = GpuLower::current()->haloInfo().getExtent(id);
+  auto extent_with_halo = GpuLower::current()->haloInfo()->getExtent(id);
   kir::ForLoop* new_scope = nullptr;
   if (extent_with_halo) {
     // When an axis is extended with halo, unrolling and vectorization
@@ -252,7 +252,7 @@ void LoopNestGenerator::generate(const std::vector<Expr*>& exprs) {
     std::sort(
         loop_structure.rbegin(),
         loop_structure.rend(),
-        IterDomainDependencySorter(
+        ir_utils::IterDomainDependencySorter(
             concrete_id_dependencies, GpuLower::current()->caMap()));
     loop_structures_[tv] = loop_structure;
   }
diff --git a/torch/csrc/jit/codegen/cuda/lower_misaligned_vectorization.cpp b/torch/csrc/jit/codegen/cuda/lower_misaligned_vectorization.cpp
index bd3c9baf66e1..9e713f4cf3a2 100644
--- a/torch/csrc/jit/codegen/cuda/lower_misaligned_vectorization.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_misaligned_vectorization.cpp
@@ -462,7 +462,7 @@ class MisalignedVectorizationModifier : public kir::ExprMutator {
 
       TORCH_INTERNAL_ASSERT(
           !gpu_lower->trivialReductionInfo().isDerived(producer_root_id),
-          "No trivial reduciton axis should exist: ",
+          "No trivial reduction axis should exist: ",
           producer_root_id);
 
       // If the producer ID is reduction or broadcast, it should be safe
diff --git a/torch/csrc/jit/codegen/cuda/lower_predicate.cpp b/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
index 989c00be81b7..7b0393d49157 100644
--- a/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_predicate.cpp
@@ -20,7 +20,7 @@ namespace cuda {
 
 namespace {
 
-class ConditionalFromPredicateModifier : public kir::IrVisitor {
+class ConditionalFromPredicateModifier : public kir::ExprMutator {
  public:
   ConditionalFromPredicateModifier() = delete;
 
@@ -32,47 +32,58 @@ class ConditionalFromPredicateModifier : public kir::IrVisitor {
  private:
   ConditionalFromPredicateModifier(const std::vector<Expr*>& exprs) {
     FUSER_PERF_SCOPE(
-        "GpuLower::Lower::ConditionalFromPredicateModifier::process");
-    kir::IrVisitor::handle(exprs);
+        "ConditionalFromPredicateModifier::ConditionalFromPredicateModifier");
+    traverseAndInsert(exprs);
   }
 
-  using kir::IrVisitor::handle;
+  using kir::ExprMutator::handle;
 
   void handle(Expr* expr) final {
     if (expr != nullptr && expr->predicate() != nullptr) {
       // Replace expr predicate with bool conditional
       auto conditional = generateConditional(expr->predicate());
       if (expr->predicate()->predicate_type() == PredicateType::Vectorize) {
-        // TODO: This logic doesn't seem to fit well here, for unswitch the
-        // logic is in the unroll loop to set the thread predicate to the expr.
-        // I didn't have a quick way to do that so placing this here for now.
-        TORCH_INTERNAL_ASSERT(
-            expr->isA<kir::IfThenElse>(),
-            "Predicate handling expects ITE statement.");
-        auto ite = expr->as<kir::IfThenElse>();
-
-        TORCH_INTERNAL_ASSERT(
-            ite->thenBody().size() == 1,
-            "Expecting predicated body to only have one vectorized expression.");
-        auto vec_expr = ite->thenBody()[0];
-        TORCH_INTERNAL_ASSERT(
-            vec_expr->isA<UnaryOp>() || vec_expr->isA<LoadStoreOp>(),
-            "Vectorize predicate exprs only supported on set operations.");
-        TORCH_INTERNAL_ASSERT(
-            ir_utils::isTvOp(vec_expr),
-            "Vectorize predicate exprs only supported on tensor view operations.");
-        if (!vec_expr->inputs()[0]->isConstScalar()) {
+        if (expr->isA<kir::IfThenElse>()) {
+          // TODO: This logic doesn't seem to fit well here, for unswitch the
+          // logic is in the unroll loop to set the thread predicate to the
+          // expr. I didn't have a quick way to do that so placing this here for
+          // now.
+          auto ite = expr->as<kir::IfThenElse>();
+
+          TORCH_INTERNAL_ASSERT(
+              ite->thenBody().size() == 1,
+              "Expecting predicated body to only have one vectorized expression.");
+          auto vec_expr = ite->thenBody()[0];
+          TORCH_INTERNAL_ASSERT(
+              vec_expr->isA<UnaryOp>() || vec_expr->isA<LoadStoreOp>(),
+              "Vectorize predicate exprs only supported on set operations.");
+          TORCH_INTERNAL_ASSERT(
+              ir_utils::isTvOp(vec_expr),
+              "Vectorize predicate exprs only supported on tensor view operations.");
+          if (!vec_expr->inputs()[0]->isConstScalar()) {
+            conditional = SimplifyingIrBuilder::andExpr(
+                              conditional,
+                              GpuLower::current()->threadPredMap().getPredicate(
+                                  ir_utils::getTvOutput(vec_expr)))
+                              ->as<Bool>();
+          }
+        } else {
+          TORCH_INTERNAL_ASSERT(lower_utils::supportInlinePredicate(expr));
+          auto thread_pred = GpuLower::current()->threadPredMap().getPredicate(
+              ir_utils::getTvOutput(expr));
+          TORCH_INTERNAL_ASSERT(
+              thread_pred->isConst() && thread_pred->value().value());
           conditional = SimplifyingIrBuilder::andExpr(
                             conditional,
                             GpuLower::current()->threadPredMap().getPredicate(
-                                ir_utils::getTvOutput(vec_expr)))
+                                ir_utils::getTvOutput(expr)))
                             ->as<Bool>();
         }
       }
       TORCH_INTERNAL_ASSERT(conditional != nullptr);
       expr->predicate()->setValue(conditional);
       TORCH_INTERNAL_ASSERT(expr->predicate()->value() != nullptr);
-      setWritePredicate(expr, conditional);
+      setWritePredicate(expr);
     }
 
     // Note: [Predicate Inversion for CpAsync]
@@ -101,7 +112,7 @@ class ConditionalFromPredicateModifier : public kir::IrVisitor {
       invertPredicateForGmemToSharedMemInitialize(expr);
     }
 
-    kir::IrVisitor::handle(expr);
+    kir::ExprMutator::handle(expr);
   }
 
   // Invert the predicate of given expr.
@@ -123,7 +134,7 @@ class ConditionalFromPredicateModifier : public kir::IrVisitor {
         ir_utils::isCpAsyncInit(maybe_init.value());
   }
 
-  void setWritePredicate(Expr* expr, Bool* read_cond) {
+  void setWritePredicate(Expr* expr) {
     if (expr->writePredicate() != nullptr) {
       auto write_cond = generateConditional(expr->writePredicate());
       if (write_cond) {
@@ -131,7 +142,7 @@ class ConditionalFromPredicateModifier : public kir::IrVisitor {
       } else {
         // If generateConditional returns null, it means no specific
         // predicate needs to be used.
-        expr->setWritePredicate(nullptr);
+        registerReplace(expr, expr->withWritePredicate(nullptr));
       }
     }
   }
@@ -150,7 +161,7 @@ class ConditionalFromPredicateModifier : public kir::IrVisitor {
       ite->predicate()->setValue(conditional);
       TORCH_INTERNAL_ASSERT(ite->predicate()->value() != nullptr);
     }
-    kir::IrVisitor::handle(ite);
+    kir::ExprMutator::handle(ite);
   }
 
   // Generate conditional according to PredicateType
diff --git a/torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp b/torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp
index 2005ee751d66..294a2327bbba 100644
--- a/torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp
@@ -191,7 +191,7 @@ class PredicateChcker : public IterVisitor {
         predicateMisalignedVectorize(expr) || predicateShift(expr) ||
         predicateSharedMemAccess(expr) || predicateProducerConsumerPair(expr) ||
         predicateNonDivisibleRootDomains(expr) ||
-        predicateNonDivisibleSplit(expr);
+        predicateNonDivisibleSplit(expr) || predicateExpandReduce(expr);
 
     // A cp.async op would need a predicate for either the global
     //  input or its shared mem output, or both.
@@ -232,6 +232,54 @@ class PredicateChcker : public IterVisitor {
          expr->as<BinaryOp>()->getBinaryOpType() == BinaryOpType::CeilDiv));
   }
 
+  // If we're reducing an expanded domain, we need to be careful to predicate it
+  // or we could end up reducing a broadcasted value too many times.
+  bool predicateExpandReduce(Expr* expr) const {
+    if (!ir_utils::isReductionOp(expr)) {
+      return false;
+    }
+    auto tv_inputs = ir_utils::getTvs(expr->inputs());
+    TORCH_INTERNAL_ASSERT(
+        tv_inputs.size() > 0,
+        "Should never have a reduction op without a tensor view input.");
+    bool found_expand = false;
+    for (auto tv_input : tv_inputs) {
+      found_expand |= std::any_of(
+          tv_input->getMaybeRFactorDomain().begin(),
+          tv_input->getMaybeRFactorDomain().end(),
+          [](IterDomain* id) { return id->hasExpandedExtent(); });
+    }
+
+    if (!found_expand) {
+      return false;
+    }
+
+    auto tv_outputs = ir_utils::getTvs(expr->outputs());
+    if (expr->isA<WelfordOp>() && tv_inputs.size() != tv_outputs.size()) {
+      tv_outputs = std::vector<TensorView*>(tv_inputs.size(), tv_outputs[0]);
+    }
+
+    TORCH_INTERNAL_ASSERT(
+        tv_outputs.size() == tv_inputs.size(),
+        "Was expecting matching number of inputs and outputs for expression: ",
+        expr->toString());
+
+    for (auto i : c10::irange(tv_inputs.size())) {
+      const auto root_p2c =
+          PairwiseRootDomainMap(tv_inputs[i], tv_outputs[i])
+              .mapProducerToConsumer(
+                  tv_inputs[i]->domain(), tv_outputs[i]->domain());
+      for (auto entry : root_p2c) {
+        auto p_id = entry.first;
+        auto c_id = entry.second;
+        if (p_id->hasExpandedExtent() && c_id->isReduction()) {
+          return true;
+        }
+      }
+    }
+    return false;
+  }
+
   // Skip if MisalignedVectorize is involved for now. This could be
   // relaxed.
   bool predicateMisalignedVectorize(Expr* expr) const {
@@ -255,12 +303,12 @@ class PredicateChcker : public IterVisitor {
 
   // Shift is not supported yet.
   bool predicateShift(Expr* expr) const {
-    auto& halo_info = GpuLower::current()->haloInfo();
+    auto halo_info = GpuLower::current()->haloInfo();
     auto input_tvs = ir_utils::filterByType<TensorView>(expr->inputs());
-    return halo_info.needsShiftPredicate(expr) ||
+    return halo_info->needsShiftPredicate(expr) ||
         std::any_of(input_tvs.begin(), input_tvs.end(), [&](auto input_tv) {
              return input_tv->definition() != nullptr &&
-                 halo_info.needsShiftPredicate(input_tv->definition());
+                 halo_info->needsShiftPredicate(input_tv->definition());
            });
   }
 
@@ -592,10 +640,11 @@ class PredicateChcker : public IterVisitor {
       // If input is not predicated, out-of-bound value may be
       // overwritten by a garbage value. However, it doesn't matter if
       // the input is also produced by another welford.
-      if (!input_def->isA<WelfordOp>() &&
+      if (!input_def->isA<WelfordOp>() && !input_def->isA<GroupedWelfordOp>() &&
           non_predicated_exprs_.find(input_def) !=
               non_predicated_exprs_.end()) {
         needs_predicate_ = true;
+        return;
       }
     }
   }
@@ -643,12 +692,8 @@ class PredicateChcker : public IterVisitor {
       } else if (
           auto input_def_grouped_rop =
               dynamic_cast<GroupedReductionOp*>(input_def)) {
-        auto input_index_as_output = std::distance(
-            input_def_grouped_rop->outputs().begin(),
-            std::find(
-                input_def_grouped_rop->outputs().begin(),
-                input_def_grouped_rop->outputs().end(),
-                input));
+        auto input_index_as_output =
+            input_def_grouped_rop->getExprIndexOfOutput(input);
         if (grouped_rop->getReductionOpType(i) !=
                 input_def_grouped_rop->getReductionOpType(
                     input_index_as_output) &&
@@ -666,6 +711,62 @@ class PredicateChcker : public IterVisitor {
     }
   }
 
+  void handle(GroupedWelfordOp* grouped_wop) final {
+    for (const auto expr_idx : c10::irange(grouped_wop->numExprs())) {
+      for (const auto val_idx : c10::irange(3)) {
+        auto init = grouped_wop->initVals().at(expr_idx).get(val_idx);
+
+        // Welford input can be a scalar. Predicate is required unless
+        // the scalar value is equal to the init value.
+        auto input = grouped_wop->inputVals().at(expr_idx).get(val_idx);
+        if (input->isScalar()) {
+          if (!input->sameAs(init)) {
+            needs_predicate_ = true;
+            return;
+          }
+          continue;
+        }
+
+        auto input_tv = dynamic_cast<TensorView*>(input);
+        TORCH_INTERNAL_ASSERT(input_tv != nullptr);
+
+        auto input_def = input->definition();
+
+        // When input_def is null, input must be an input to the fusion,
+        // so that must be allocated on global memory. Since we don't omit
+        // predication for expressions involving global memory, this
+        // should never occur.
+        TORCH_INTERNAL_ASSERT(
+            input_def != nullptr,
+            "Inconsistent input found: ",
+            input->toString());
+
+        // The input needs to be initialized to the init value to omit
+        // the predicate, so if the input has its own init value, i.e.,
+        // produced by another reduction, they must use the same init
+        // value.
+        Val* input_init = ir_utils::getReductionInitValOf(input_tv);
+        if (input_init != nullptr && !init->sameAs(input_init)) {
+          needs_predicate_ = true;
+          return;
+        }
+
+        // If input is not predicated, out-of-bound value may be
+        // overwritten by a garbage value. However, it doesn't matter if
+        // the input is also produced by another reduction op as it
+        // must be initialized and its initialized value is already
+        // found to be equal to the initil value of this op.
+        if (!input_def->isA<WelfordOp>() &&
+            !input_def->isA<GroupedWelfordOp>() &&
+            non_predicated_exprs_.find(input_def) !=
+                non_predicated_exprs_.end()) {
+          needs_predicate_ = true;
+          return;
+        }
+      }
+    }
+  }
+
   // Similar to the above reduction constraint but for MMA
   void handle(MmaOp* mma) final {
     for (auto input : ir_utils::filterByType<TensorView>(mma->inputs())) {
@@ -824,7 +925,7 @@ bool PredicateElimination::setReductionInitValue(
   } else {
     TORCH_INTERNAL_ASSERT(
         false,
-        "Incosistent setting of initialization value for t",
+        "Inconsistent setting of initialization value for t",
         tv->name(),
         ". Prev: ",
         existing_val->toString(),
@@ -837,7 +938,7 @@ bool PredicateElimination::setReductionInitValue(
 bool PredicateElimination::canOmitPredicate(const Expr* expr) const {
   // Predicate elimination can be disabled with
   // PYTORCH_NVFUSER_DISABLE=predicate_elimination
-  if (isDisabled(DisableOption::PredicateElimination)) {
+  if (isOptionDisabled(DisableOption::PredicateElimination)) {
     assertOnWarpOps(expr);
     return false;
   }
@@ -890,7 +991,7 @@ Val* PredicateElimination::getInitValue(TensorView* tv) const {
 }
 
 void PredicateElimination::build(Fusion* fusion) {
-  traverseFrom(fusion, fusion->outputs());
+  traverseTo(fusion, fusion->outputs());
 }
 
 std::string PredicateElimination::toString() const {
diff --git a/torch/csrc/jit/codegen/cuda/lower_shift.cpp b/torch/csrc/jit/codegen/cuda/lower_shift.cpp
index fe1e0cc509c1..2a7c04243f4c 100644
--- a/torch/csrc/jit/codegen/cuda/lower_shift.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_shift.cpp
@@ -17,7 +17,7 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-void ShiftPredicateInserter::insert(
+Expr* ShiftPredicateInserter::insert(
     Expr* expr,
     const std::vector<kir::ForLoop*>& loops,
     Bool* thread_pred,
@@ -28,9 +28,9 @@ void ShiftPredicateInserter::insert(
   TORCH_INTERNAL_ASSERT(out_tv != nullptr, "Missing TensorView output");
 
   const bool needs_shift_predicate =
-      gpu_lower->haloInfo().needsShiftPredicate(out_tv->definition());
+      gpu_lower->haloInfo()->needsShiftPredicate(out_tv->definition());
   if (!needs_shift_predicate) {
-    return;
+    return expr;
   }
 
   // The conditional branches to create:
@@ -56,9 +56,8 @@ void ShiftPredicateInserter::insert(
   // If the expr involves a thread-block barrier, set the predicate of
   // the expr with shift_pred. Since the expr is not shift, the
   // padding is safe to omit.
-  if (ir_utils::hasBlockSync(expr, gpu_lower->threadPredMap())) {
-    expr->setPredicate(shift_pred);
-    return;
+  if (lower_utils::hasBlockSync(expr, gpu_lower->threadPredMap())) {
+    return expr->withPredicate(shift_pred);
   }
 
   auto shift_ite = IrBuilder::create<kir::IfThenElse>(shift_pred);
@@ -76,7 +75,7 @@ void ShiftPredicateInserter::insert(
 
   // No padding condition is required if this is within unswitch.
   if (within_unswitch) {
-    return;
+    return expr;
   }
 
   // Padding by zero
@@ -89,6 +88,8 @@ void ShiftPredicateInserter::insert(
   bounds_ite->thenBody().push_back(pad_expr);
   // Insert the else block
   shift_ite->elseBody().push_back(bounds_ite);
+
+  return expr;
 }
 
 int AxisHaloInfo::width() const {
@@ -145,13 +146,6 @@ const AxisHaloInfo& HaloInfo::getRootAxisInfo(IterDomain* id) const {
   return it->second;
 }
 
-AxisHaloInfo& HaloInfo::getRootAxisInfo(IterDomain* id) {
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-  return const_cast<AxisHaloInfo&>(
-      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-      const_cast<const HaloInfo*>(this)->getRootAxisInfo(id));
-}
-
 void HaloInfo::setRootAxisInfo(
     IterDomain* id,
     const AxisHaloInfo& root_axis_info) {
@@ -161,7 +155,9 @@ void HaloInfo::setRootAxisInfo(
   return;
 }
 
-void HaloInfo::build(Fusion* fusion) {
+HaloInfo::HaloInfo(Fusion* fusion, std::shared_ptr<const ComputeAtMap> ca_map)
+    // Make a copy of the permissive map for extent comparators
+    : permissive_map_(ca_map->idGraph().permissiveNodes()) {
   const auto vals = fusion->usedMathVals();
   auto tvs = ir_utils::filterByType<TensorView>(vals);
 
@@ -202,7 +198,7 @@ void HaloInfo::build(Fusion* fusion) {
 
   // Note that validation requires consumer halo info
   for (auto tv : tvs) {
-    validate(tv);
+    validate(tv, ca_map);
   }
 }
 
@@ -445,8 +441,20 @@ void HaloInfo::build(TensorDomain* td) {
       } else {
         setHaloWidth(merge->out(), 0);
       }
-    } else if (expr->getExprType().value() == ExprType::Swizzle2D) {
+    } else if (auto swizzle = dynamic_cast<Swizzle2D*>(expr)) {
       // Assume no halo on swizzled domain for now.
+      TORCH_INTERNAL_ASSERT(
+          getExtent(swizzle->inX()) == nullptr,
+          "Halo is not supported with swizzle. Halo-extended ID: ",
+          swizzle->inX()->toString(),
+          " used in ",
+          swizzle->toString());
+      TORCH_INTERNAL_ASSERT(
+          getExtent(swizzle->inY()) == nullptr,
+          "Halo is not supported with swizzle. Halo-extended ID: ",
+          swizzle->inY()->toString(),
+          " used in ",
+          swizzle->toString());
       for (auto id : ir_utils::filterByType<IterDomain>(expr->outputs())) {
         setHaloWidth(id, 0);
       }
@@ -474,12 +482,13 @@ void HaloInfo::build(TensorDomain* td) {
 //! Other types of parallelization should be supported except for
 //! vectorization. Vectorization should be eventually supported but
 //! needs further work.
-void HaloInfo::validate(TensorView* tv) const {
+void HaloInfo::validate(
+    TensorView* tv,
+    std::shared_ptr<const ComputeAtMap> ca_map) const {
   const auto mem_type = tv->getMemoryType();
 
   for (auto axis : tv->domain()->domain()) {
-    auto concrete_id = GpuLower::current()->caMap()->getConcreteMappedID(
-        axis, IdMappingMode::LOOP);
+    auto concrete_id = ca_map->getConcreteMappedID(axis, IdMappingMode::LOOP);
 
     // The extent is assumed to be the same
     TORCH_INTERNAL_ASSERT(
@@ -526,7 +535,7 @@ void HaloInfo::validate(TensorView* tv) const {
           consumer->domain()->domain().begin(),
           consumer->domain()->domain().end(),
           [&](IterDomain* consumer_axis) {
-            return GpuLower::current()->caMap()->areMapped(
+            return ca_map->areMapped(
                 axis, consumer_axis, IdMappingMode::PERMISSIVE);
           });
       if (it == consumer->domain()->domain().end()) {
@@ -626,11 +635,10 @@ bool extentCompare(
     const HaloInfo& halo_map,
     IterDomain* id1,
     IterDomain* id2,
-    Cmp cmp) {
-  auto gpu_lower = GpuLower::current();
+    Cmp cmp,
+    const DisjointSets<IterDomain*>& permissive_map) {
   TORCH_INTERNAL_ASSERT(
-      gpu_lower->caMap()->areMapped(id1, id2, IdMappingMode::PERMISSIVE),
-      "Invalid axes to compare");
+      permissive_map.strictAreMapped(id1, id2), "Invalid axes to compare");
 
   // It's invalid to compare two axes and when only either of them has
   // halo.
@@ -652,10 +660,10 @@ bool extentCompare(
       auto merge2 = dynamic_cast<Merge*>(id2->definition());
       TORCH_INTERNAL_ASSERT(
           merge2 != nullptr, "Invalid comparison: ", id1, " and ", id2);
-      auto inner_le =
-          extentCompare(halo_map, merge1->inner(), merge2->inner(), cmp);
-      auto outer_le =
-          extentCompare(halo_map, merge1->outer(), merge2->outer(), cmp);
+      auto inner_le = extentCompare(
+          halo_map, merge1->inner(), merge2->inner(), cmp, permissive_map);
+      auto outer_le = extentCompare(
+          halo_map, merge1->outer(), merge2->outer(), cmp, permissive_map);
       return inner_le && outer_le;
     } else {
       // This is not considered. Should never reach here.
@@ -667,11 +675,11 @@ bool extentCompare(
 } // namespace
 
 bool HaloInfo::extentLessEqual(IterDomain* id1, IterDomain* id2) const {
-  return extentCompare(*this, id1, id2, std::less_equal<>());
+  return extentCompare(*this, id1, id2, std::less_equal<>(), permissive_map_);
 }
 
 bool HaloInfo::extentEqual(IterDomain* id1, IterDomain* id2) const {
-  return extentCompare(*this, id1, id2, std::equal_to<>());
+  return extentCompare(*this, id1, id2, std::equal_to<>(), permissive_map_);
 }
 
 std::string HaloInfo::toString() const {
@@ -722,19 +730,20 @@ bool HaloInfo::needsShiftPredicate(Expr* expr) const {
 }
 
 std::unordered_map<IterDomain*, Val*> HaloInfo::buildConcreteHaloExtentMap(
-    const LoopIndexing& loop_indexing) {
+    const LoopIndexing& loop_indexing) const {
   // Use a local workspace to avoid re-defining halo info.
-  HaloInfo local_halo_info;
+  HaloInfo local_halo_info = *GpuLower::current()->haloInfo();
 
-  auto& global_halo_info = GpuLower::current()->haloInfo();
+  auto global_halo_info = GpuLower::current()->haloInfo();
 
   // Setup root:
   for (auto consumer_root_id : loop_indexing.consumerTv()->getRootDomain()) {
     auto consumer_index_concrete_id =
-        ir_utils::caMapExactConcreteId(consumer_root_id);
+        GpuLower::current()->caMap()->getConcreteMappedID(
+            consumer_root_id, IdMappingMode::EXACT);
     local_halo_info.setRootAxisInfo(
         consumer_index_concrete_id,
-        global_halo_info.getRootAxisInfo(consumer_root_id));
+        global_halo_info->getRootAxisInfo(consumer_root_id));
   }
 
   // Track IDs that are generated by merging halo-extended IDs
@@ -747,7 +756,8 @@ std::unordered_map<IterDomain*, Val*> HaloInfo::buildConcreteHaloExtentMap(
           merged_shifted_ids.find(split->in()) == merged_shifted_ids.end(),
           "Splitting IterDomain that is a merged domain of halo-extended domains is not allowed");
 
-      auto in_id = ir_utils::caMapExactConcreteId(split->in());
+      auto in_id = GpuLower::current()->caMap()->getConcreteMappedID(
+          split->in(), IdMappingMode::EXACT);
 
       // If no halo info is found, nothing needs to be done. This ID
       // must be an ancestor of a domain set by setRootAxisInfo.
@@ -759,32 +769,43 @@ std::unordered_map<IterDomain*, Val*> HaloInfo::buildConcreteHaloExtentMap(
 
       if (halo_width == 0) {
         local_halo_info.setHaloWidth(
-            ir_utils::caMapExactConcreteId(split->outer()), 0);
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                split->outer(), IdMappingMode::EXACT),
+            0);
         local_halo_info.setHaloWidth(
-            ir_utils::caMapExactConcreteId(split->inner()), 0);
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                split->inner(), IdMappingMode::EXACT),
+            0);
         continue;
       }
 
       // propagate to inner domain
-      auto out_id = ir_utils::caMapExactConcreteId(split->inner());
+      auto out_id = GpuLower::current()->caMap()->getConcreteMappedID(
+          split->inner(), IdMappingMode::EXACT);
 
       auto expanded_extent =
           SimplifyingIrBuilder::addExpr(out_id->extent(), halo_width);
       local_halo_info.extent_map_.insert({out_id, expanded_extent});
 
       local_halo_info.setHaloWidth(
-          ir_utils::caMapExactConcreteId(split->outer()), 0);
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              split->outer(), IdMappingMode::EXACT),
+          0);
       local_halo_info.setHaloWidth(
-          ir_utils::caMapExactConcreteId(split->inner()), halo_width);
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              split->inner(), IdMappingMode::EXACT),
+          halo_width);
 
       // TODO: add support for inheritance map
     } else if (auto merge = dynamic_cast<Merge*>(expr)) {
       // If either of the two inputs has halo extension, propagate it
       // to the merged output ID
       auto inner_extent = local_halo_info.getExtent(
-          ir_utils::caMapExactConcreteId(merge->inner()));
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              merge->inner(), IdMappingMode::EXACT));
       auto outer_extent = local_halo_info.getExtent(
-          ir_utils::caMapExactConcreteId(merge->outer()));
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              merge->outer(), IdMappingMode::EXACT));
       if (inner_extent != nullptr || outer_extent != nullptr) {
         if (inner_extent == nullptr) {
           inner_extent = merge->inner()->extent();
@@ -795,28 +816,41 @@ std::unordered_map<IterDomain*, Val*> HaloInfo::buildConcreteHaloExtentMap(
         auto expanded_extent =
             SimplifyingIrBuilder::mulExpr(outer_extent, inner_extent);
         local_halo_info.extent_map_.insert(
-            {ir_utils::caMapExactConcreteId(merge->out()), expanded_extent});
+            {GpuLower::current()->caMap()->getConcreteMappedID(
+                 merge->out(), IdMappingMode::EXACT),
+             expanded_extent});
         // Splitting the output of this merge is not allowed, so
         // remember it
-        merged_shifted_ids.insert(ir_utils::caMapExactConcreteId(merge->out()));
+        merged_shifted_ids.insert(
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                merge->out(), IdMappingMode::EXACT));
         // Note that halo_width_map_ is not updated
       } else {
-        setHaloWidth(ir_utils::caMapExactConcreteId(merge->out()), 0);
+        local_halo_info.setHaloWidth(
+            GpuLower::current()->caMap()->getConcreteMappedID(
+                merge->out(), IdMappingMode::EXACT),
+            0);
       }
     } else if (auto swizzle_2d = dynamic_cast<Swizzle2D*>(expr)) {
       // Swizzle with halo not yet supported, just set the width
       //  to zero at the moment.
       TORCH_INTERNAL_ASSERT(
           local_halo_info.getHaloWidth(
-              ir_utils::caMapExactConcreteId(swizzle_2d->inX())) == 0 &&
+              GpuLower::current()->caMap()->getConcreteMappedID(
+                  swizzle_2d->inX(), IdMappingMode::EXACT)) == 0 &&
               local_halo_info.getHaloWidth(
-                  ir_utils::caMapExactConcreteId(swizzle_2d->inY())) == 0,
+                  GpuLower::current()->caMap()->getConcreteMappedID(
+                      swizzle_2d->inY(), IdMappingMode::EXACT)) == 0,
           "Swizzle on ID with halo not yet supported.");
       TORCH_INTERNAL_ASSERT("Swizzle on ID with halo not yet supported.");
       local_halo_info.setHaloWidth(
-          ir_utils::caMapExactConcreteId(swizzle_2d->outX()), 0);
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              swizzle_2d->outX(), IdMappingMode::EXACT),
+          0);
       local_halo_info.setHaloWidth(
-          ir_utils::caMapExactConcreteId(swizzle_2d->outY()), 0);
+          GpuLower::current()->caMap()->getConcreteMappedID(
+              swizzle_2d->outY(), IdMappingMode::EXACT),
+          0);
     } else {
       TORCH_INTERNAL_ASSERT(false, "Unsupported expr: ", expr);
     }
diff --git a/torch/csrc/jit/codegen/cuda/lower_shift.h b/torch/csrc/jit/codegen/cuda/lower_shift.h
index d1500c5f9f20..f12410703d99 100644
--- a/torch/csrc/jit/codegen/cuda/lower_shift.h
+++ b/torch/csrc/jit/codegen/cuda/lower_shift.h
@@ -61,23 +61,12 @@ class AxisHaloInfo {
 class TORCH_CUDA_CU_API HaloInfo {
  public:
   //! Scan a fusion and collect all information for lowering
-  void build(Fusion* fusion);
-
-  //! Build mappings of extent information of a TensorDomain
-  void build(TensorDomain* td);
+  HaloInfo(Fusion* fusion, std::shared_ptr<const ComputeAtMap> ca_map);
 
   //! Almost exact duplicate of build(TensorDomain* td), except that
   //!  the traversal was done on loop indexing expressions.
   std::unordered_map<IterDomain*, Val*> buildConcreteHaloExtentMap(
-      const LoopIndexing& loop_indexing);
-
-  //! Set initial AxisHaloInfo of a root axis
-  //!
-  //! The axis does not need to be a root domain in the case of
-  //! reference tensors. Reference tensors get halo information from
-  //! consumer root domains, which may correspond to rfactor domains
-  //! of tensors from which reference tensors are derived.
-  void setRootAxisInfo(IterDomain* id, const AxisHaloInfo& root_axis_info);
+      const LoopIndexing& loop_indexing) const;
 
   //! Returns true if id has the root halo information set by
   //! setRootAxisInfo.
@@ -88,7 +77,6 @@ class TORCH_CUDA_CU_API HaloInfo {
   //! This is only for root axes. It is an error to query with
   //! non-root axes.
   const AxisHaloInfo& getRootAxisInfo(IterDomain* id) const;
-  AxisHaloInfo& getRootAxisInfo(IterDomain* id);
 
   //! Query if an axis has a halo width.
   //!
@@ -139,10 +127,21 @@ class TORCH_CUDA_CU_API HaloInfo {
   std::string toString() const;
 
  private:
+  //! Build mappings of extent information of a TensorDomain
+  void build(TensorDomain* td);
+
   //! Propagate root axis information from outputs to inputs of an
   //! expression
   void propagateRootAxisInfo(Expr* expr);
 
+  //! Set initial AxisHaloInfo of a root axis
+  //!
+  //! The axis does not need to be a root domain in the case of
+  //! reference tensors. Reference tensors get halo information from
+  //! consumer root domains, which may correspond to rfactor domains
+  //! of tensors from which reference tensors are derived.
+  void setRootAxisInfo(IterDomain* id, const AxisHaloInfo& root_axis_info);
+
   //! Adds a domain to the halo inheritance map.
   //!
   //! A domain, child, is added to the same set as domain parent. Both
@@ -163,11 +162,15 @@ class TORCH_CUDA_CU_API HaloInfo {
   void initializeFromRootAxisInfo(IterDomain* id);
 
   //! Validate shift usage
-  void validate(TensorView* td) const;
+  void validate(TensorView* td, std::shared_ptr<const ComputeAtMap> ca_map)
+      const;
 
   void setHaloWidth(IterDomain* id, int halo_width);
 
  private:
+  // Copy the permissive map from the passed in compute at map
+  const DisjointSets<IterDomain*> permissive_map_;
+
   //! Halo information of root axes
   std::unordered_map<IterDomain*, AxisHaloInfo> root_axis_map_;
 
@@ -222,7 +225,7 @@ class ShiftPredicateInserter {
   //! the generated predicate. The branch structure is different from
   //! the usual predicated expression, so the insertion is also done
   //! here.
-  static void insert(
+  static Expr* insert(
       Expr* expr,
       const std::vector<kir::ForLoop*>& loops,
       Bool* thread_pred,
diff --git a/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp b/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp
index 497256b5f850..9b8ccd4a77ae 100644
--- a/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_sync_information.cpp
@@ -26,7 +26,7 @@ void validateParallelizationOfTensor(TensorView* tv) {
     // It doesn't matter if this axis is a non-concretized broadcast
     // TODO: merging broadcast and non-broadcast
     if (axis->isBroadcast() &&
-        !GpuLower::current()->concretizedBroadcastDomains().isConcretized(
+        !GpuLower::current()->concretizedBroadcastDomains()->isConcretized(
             axis)) {
       continue;
     }
@@ -195,7 +195,7 @@ void SyncMap::build(Fusion* fusion) {
               (!parallel_bcast_doms.get(consumer_ptype) ||
                !GpuLower::current()
                     ->concretizedBroadcastDomains()
-                    .isConcretized(consumer_axis))) {
+                    ->isConcretized(consumer_axis))) {
             continue;
           }
 
@@ -240,12 +240,12 @@ void SyncMap::build(Fusion* fusion) {
                     p_id, c_id, IdMappingMode::PERMISSIVE)) {
               const auto halo_info = GpuLower::current()->haloInfo();
 
-              if (halo_info.hasHaloWidth(p_id) !=
-                      halo_info.hasHaloWidth(c_id) ||
-                  (halo_info.hasHaloWidth(p_id) &&
-                   halo_info.hasHaloWidth(c_id) &&
-                   halo_info.getHaloWidth(p_id) !=
-                       halo_info.getHaloWidth(c_id))) {
+              if (halo_info->hasHaloWidth(p_id) !=
+                      halo_info->hasHaloWidth(c_id) ||
+                  (halo_info->hasHaloWidth(p_id) &&
+                   halo_info->hasHaloWidth(c_id) &&
+                   halo_info->getHaloWidth(p_id) !=
+                       halo_info->getHaloWidth(c_id))) {
                 raw_dims.set(parallel_type);
                 continue;
               }
@@ -410,33 +410,13 @@ void SyncMap::build(Fusion* fusion) {
             }
           }
 
-          // If same parallel type and mapped, no need for syncs unless
-          // producer is in smem, producer parallel type is a thread
-          // dimension, and consumer concretizes the dimension. This sync is
-          // due to the redundant predicate omission in lower thread
-          // predicate.
-          auto redundant_preds = GpuLower::current()
-                                     ->threadPredMap()
-                                     .getPredicateInfo(producer)
-                                     .redundant_types;
-
-          if (p_id->isBroadcast() &&
-              GpuLower::current()->concretizedBroadcastDomains().isConcretized(
-                  p_id) &&
-              producer->getMemoryType() == MemoryType::Shared &&
-              redundant_preds.hasTID()) {
-            redundant_preds.clearAllBID();
-            raw_dims |= redundant_preds;
-            continue;
-          }
-
           // When the producer axis is a broadcast, it is not really
           // parallelized unless thread-predicated and concretized
           if (isParallelTypeThread(producer_ptype) && p_id->isBroadcast() &&
               (!parallel_bcast_doms.get(producer_ptype) ||
                !GpuLower::current()
                     ->concretizedBroadcastDomains()
-                    .isConcretized(p_id))) {
+                    ->isConcretized(p_id))) {
             continue;
           }
 
@@ -483,7 +463,7 @@ void SyncMap::build(Fusion* fusion) {
       } // end for consumers
 
       if (raw_dims.any()) {
-        needs_raw_sync_[producer] = raw_dims;
+        needs_raw_sync_[producer] |= raw_dims;
       }
 
     } // end producer
@@ -492,10 +472,14 @@ void SyncMap::build(Fusion* fusion) {
 
 std::string SyncMap::toString() const {
   std::stringstream ss;
-  ss << "TVs requiring RAW:" << std::endl;
+  ss << "SyncMap:";
+  bool is_first = true;
   for (auto entry : needs_raw_sync_) {
-    ss << "  " << entry.first->toString() << " :: " << entry.second.toString()
-       << std::endl;
+    if (!is_first) {
+      ss << ",";
+    }
+    ss << " " << entry.first->toString() << " -> " << entry.second.toString();
+    is_first = false;
   }
   return ss.str();
 }
diff --git a/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp b/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
index 18a4426cb7c0..dc10224a165c 100644
--- a/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
@@ -237,7 +237,7 @@ void ThreadPredicateMap::updateBitSet(const Expr* expr) {
           id_reductions.set(id->getParallelType());
         }
         if (id->isBroadcast() &&
-            GpuLower::current()->concretizedBroadcastDomains().isConcretized(
+            GpuLower::current()->concretizedBroadcastDomains()->isConcretized(
                 id)) {
           id_bcasts.set(id->getParallelType());
         }
@@ -316,7 +316,7 @@ class RedundantUseAnalysis : BackwardVisitor {
  public:
   RedundantUseAnalysis(Fusion* fusion, const ThreadPredicateMap& pred_map)
       : fusion_(fusion), pred_map_(pred_map) {
-    traverseFrom(fusion, fusion->terminatingMathVals());
+    traverseTo(fusion, fusion->terminatingMathVals());
   }
 
   //! Returns a bit map signifying the parallel dimensions
@@ -575,7 +575,8 @@ ParallelTypeBitmap ThreadPredicateMap::getParallelBroadcastDomains(
 
   for (auto id : iter_domains) {
     if (!id->isBroadcast() ||
-        !GpuLower::current()->concretizedBroadcastDomains().isConcretized(id)) {
+        !GpuLower::current()->concretizedBroadcastDomains()->isConcretized(
+            id)) {
       continue;
     }
     if (id->isBlockDim() || (!output_smem && id->isThreadDim())) {
diff --git a/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp b/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp
index cd9a7f787866..88a84aa3c587 100644
--- a/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.cpp
@@ -1,6 +1,5 @@
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
-#include <torch/csrc/jit/codegen/cuda/lower2device.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 
 #include <torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h>
@@ -10,12 +9,13 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-void ConcretizedBroadcastDomains::build(Fusion* fusion) {
+ConcretizedBroadcastDomains::ConcretizedBroadcastDomains(Fusion* fusion) {
   exact_map_ = std::make_unique<ExactRootDomainMap>(fusion);
 
   // Initialize the origin map with input broadcast domains
+  auto inputs = fusion->inputsAndCreated();
   for (const auto fusion_input_tv :
-       ir_utils::filterByType<TensorView>(fusion->inputs())) {
+       ir_utils::filterByType<TensorView>(inputs)) {
     for (auto root_id : fusion_input_tv->getRootDomain()) {
       if (root_id->isBroadcast()) {
         broadcast_origin_map_.emplace(
diff --git a/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h b/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h
index 24658f3cfe7c..c30fa9951404 100644
--- a/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h
+++ b/torch/csrc/jit/codegen/cuda/lower_trivial_broadcast.h
@@ -23,7 +23,8 @@ namespace cuda {
 //! domains are marked as concretized.
 class TORCH_CUDA_CU_API ConcretizedBroadcastDomains : private IterVisitor {
  public:
-  void build(Fusion* fusion);
+  ConcretizedBroadcastDomains() = delete;
+  ConcretizedBroadcastDomains(Fusion* fusion);
 
   //! Is a domain concretized?
   bool isConcretized(IterDomain* id) const;
diff --git a/torch/csrc/jit/codegen/cuda/lower_unroll.cpp b/torch/csrc/jit/codegen/cuda/lower_unroll.cpp
index 434d1711d9c8..63dbbf83d775 100644
--- a/torch/csrc/jit/codegen/cuda/lower_unroll.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_unroll.cpp
@@ -54,6 +54,14 @@ bool isReductionInitExpr(const Expr* expr) {
 
 } // namespace
 
+void UnrollPass::registerReplace(
+    Expr* reference,
+    Expr* new_expr,
+    kir::Scope* scope) {
+  kir::ExprMutator::registerReplace(reference, new_expr, scope);
+  GpuLower::current()->propagateExprInfo(reference, new_expr);
+}
+
 void UnrollPass::handle(Expr* expr) {
   if (ir_utils::isTvOp(expr)) {
     // If tv op, predicate it
@@ -79,11 +87,16 @@ void UnrollPass::handle(Expr* expr) {
 
     non_trivial_pred_found_ = true;
 
+    Expr* expr_with_predicate = expr;
+
     // When a predicate needs to account for ShiftOp, it is currently
     // taken care by its own function.
-    if (GpuLower::current()->haloInfo().needsShiftPredicate(expr)) {
-      ShiftPredicateInserter::insert(
+    if (GpuLower::current()->haloInfo()->needsShiftPredicate(expr)) {
+      expr_with_predicate = ShiftPredicateInserter::insert(
           expr, for_loops_, thread_pred, unswitched_loop_);
+      if (expr_with_predicate != expr) {
+        registerReplace(expr, expr_with_predicate, &for_loops_.back()->body());
+      }
       return;
     }
 
@@ -93,17 +106,18 @@ void UnrollPass::handle(Expr* expr) {
           ? thread_pred_expr
           : IrBuilder::create<kir::Predicate>(
                 PredicateType::ReductionWrite, expr, thread_pred);
-      expr->setWritePredicate(write_pred);
+      expr_with_predicate = expr_with_predicate->withWritePredicate(write_pred);
     }
 
     // For expr calling a device func with block sync, don't create
     // if-then-else but pass the predicate to the device func
-    if (ir_utils::hasBlockSync(expr, GpuLower::current()->threadPredMap())) {
+    if (lower_utils::hasBlockSync(expr, GpuLower::current()->threadPredMap())) {
       const auto pred = unswitched_loop_
           ? thread_pred_expr
           : IrBuilder::create<kir::Predicate>(
                 PredicateType::Inline, expr, thread_pred);
-      expr->setPredicate(pred);
+      expr_with_predicate = expr_with_predicate->withPredicate(pred);
+      registerReplace(expr, expr_with_predicate, &for_loops_.back()->body());
       return;
     }
 
@@ -124,6 +138,12 @@ void UnrollPass::handle(Expr* expr) {
                                     PredicateType::Inline, expr, thread_pred);
     }
 
+    if (lower_utils::supportInlinePredicate(expr)) {
+      expr_with_predicate = expr_with_predicate->withPredicate(pred);
+      registerReplace(expr, expr_with_predicate, &for_loops_.back()->body());
+      return;
+    }
+
     // If we need a predicate, put expr inside an if then else
     kir::IfThenElse* inline_ite = IrBuilder::create<kir::IfThenElse>(pred);
     if (for_loops_.empty()) {
@@ -135,7 +155,10 @@ void UnrollPass::handle(Expr* expr) {
       kir::ExprMutator::registerReplace(
           expr, inline_ite, &for_loops_.back()->body());
     }
-    inline_ite->thenBody().push_back(expr);
+    if (expr != expr_with_predicate) {
+      GpuLower::current()->propagateExprInfo(expr, expr_with_predicate);
+    }
+    inline_ite->thenBody().push_back(expr_with_predicate);
   } else if (auto for_loop = dynamic_cast<kir::ForLoop*>(expr)) {
     handle(for_loop);
   }
@@ -222,7 +245,7 @@ bool UnrollPass::canOmitElseClause(kir::ForLoop* fl) {
     // If there's any expression that requires barrier
     // synchronization, the else part can't be omitted
     for (auto expr : loop->body().exprs()) {
-      if (ir_utils::hasBlockSync(expr, pred_map)) {
+      if (lower_utils::hasBlockSync(expr, pred_map)) {
         return false;
       }
     }
@@ -264,9 +287,7 @@ bool UnrollPass::canOmitElseClause(kir::ForLoop* fl) {
   return true;
 }
 
-// Generate the loop nest structure and place it in lowered_exprs
 UnrollPass::UnrollPass(const std::vector<Expr*>& exprs) {
-  FUSER_PERF_SCOPE("GpuLower::Lower::UnrollPass::computeMap");
   kir::ExprMutator::traverseAndInsert(exprs);
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_unroll.h b/torch/csrc/jit/codegen/cuda/lower_unroll.h
index 14725c405b77..786e45115ba6 100644
--- a/torch/csrc/jit/codegen/cuda/lower_unroll.h
+++ b/torch/csrc/jit/codegen/cuda/lower_unroll.h
@@ -62,6 +62,8 @@ class TORCH_CUDA_CU_API UnrollPass : kir::ExprMutator {
   static bool canOmitElseClause(kir::ForLoop* fl);
 
  private:
+  void registerReplace(Expr* reference, Expr* new_expr, kir::Scope* scope);
+
   // Generate the for Expr replacement map
   UnrollPass(const std::vector<Expr*>& exprs);
 
diff --git a/torch/csrc/jit/codegen/cuda/lower_utils.cpp b/torch/csrc/jit/codegen/cuda/lower_utils.cpp
index 2f1ccd6d4014..3e92269f278a 100644
--- a/torch/csrc/jit/codegen/cuda/lower_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_utils.cpp
@@ -36,13 +36,42 @@ kir::IfThenElse* cloneIfThenElse(kir::IfThenElse* ite) {
 
 namespace ir_utils {
 
-TVDomainGuard::TVDomainGuard(TensorView* _tv, TensorDomain* td)
-    : tv_(_tv), prev_domain(tv_->domain()) {
+TVDomainGuard::TVDomainGuard(TensorView* tv, TensorDomain* td)
+    : tv_(tv), prev_domain_(tv_->domain()) {
   tv_->setDomain(td);
 }
 
+TVDomainGuard::TVDomainGuard(TVDomainGuard&& guard)
+    : tv_(nullptr), prev_domain_(guard.prev_domain_) {
+  std::swap(tv_, guard.tv_);
+}
+
 TVDomainGuard::~TVDomainGuard() {
-  tv_->setDomain(prev_domain);
+  if (tv_ != nullptr) {
+    tv_->setDomain(prev_domain_);
+  }
+}
+
+ir_utils::TVDomainGuard overrideContiguityGuard(
+    TensorView* tv,
+    bool contiguity) {
+  // Use domain guard to ignore the contiguity of
+  //  consumer tv.
+  TensorDomain* domain_with_specified_contiguity = nullptr;
+  std::vector<bool> contiguity_vector(
+      tv->getMaybeRFactorDomain().size(), contiguity);
+  if (tv->hasRFactor()) {
+    domain_with_specified_contiguity = IrBuilder::create<TensorDomain>(
+        tv->getRootDomain(),
+        tv->getRFactorDomain(),
+        tv->domain()->domain(),
+        contiguity_vector);
+  } else {
+    domain_with_specified_contiguity = IrBuilder::create<TensorDomain>(
+        tv->getRootDomain(), tv->domain()->domain(), contiguity_vector);
+  }
+
+  return ir_utils::TVDomainGuard(tv, domain_with_specified_contiguity);
 }
 
 std::vector<IterDomain*> iterDomainInputsOf(
@@ -91,9 +120,14 @@ bool isTvOp(const Expr* expr) {
       (expr->getExprType().value() == ExprType::UnaryOp ||
        expr->getExprType().value() == ExprType::BinaryOp ||
        expr->getExprType().value() == ExprType::TernaryOp ||
+       expr->getExprType().value() == ExprType::RNGOp ||
+       expr->getExprType().value() == ExprType::FullOp ||
+       expr->getExprType().value() == ExprType::ARangeOp ||
+       expr->getExprType().value() == ExprType::EyeOp ||
        expr->getExprType().value() == ExprType::ReductionOp ||
        expr->getExprType().value() == ExprType::GroupedReductionOp ||
        expr->getExprType().value() == ExprType::WelfordOp ||
+       expr->getExprType().value() == ExprType::GroupedWelfordOp ||
        expr->getExprType().value() == ExprType::LoadStoreOp ||
        expr->getExprType().value() == ExprType::MmaOp ||
        expr->getExprType().value() == ExprType::BroadcastOp ||
@@ -104,8 +138,10 @@ bool isTvOp(const Expr* expr) {
        expr->getExprType().value() == ExprType::ViewAsScalar ||
        expr->getExprType().value() == ExprType::ViewOp ||
        expr->getExprType().value() == ExprType::GridReduction ||
+       expr->getExprType().value() == ExprType::GroupedGridReduction ||
        expr->getExprType().value() == ExprType::GridBroadcast ||
-       expr->getExprType().value() == ExprType::GridWelford)) {
+       expr->getExprType().value() == ExprType::GridWelford ||
+       expr->getExprType().value() == ExprType::GroupedGridWelford)) {
     return true;
   }
   return false;
@@ -199,31 +235,6 @@ bool isScalarOp(const Expr* expr) {
   return true;
 }
 
-bool hasBlockSync(const Expr* expr, const ThreadPredicateMap& pred_map) {
-  if (!isTvOp(expr)) {
-    return false;
-  }
-
-  if (!(isReductionOp(expr) || expr->isA<BroadcastOp>() ||
-        expr->isA<kir::GridBroadcast>())) {
-    return false;
-  }
-
-  // GroupedReductionOp can have multiple output TVs, but they must be
-  // parallelized in the same way, so just checking one of them is enough.
-  auto tv = getTvOutput(expr);
-
-  if (tv->hasBlockReduction() || tv->hasGridReduction()) {
-    return true;
-  } else if (expr->isA<BroadcastOp>()) {
-    const ParallelTypeBitmap pt_map =
-        GpuLower::current()->threadPredMap().getParallelBroadcastDomains(tv);
-    return pt_map.any();
-  }
-
-  return false;
-}
-
 c10::optional<IterDomain*> getMaybeWarpReductionDim(
     const Val* output,
     const Val* input) {
@@ -360,20 +371,6 @@ bool isGlobalLoadInit(const Expr* expr) {
   return false;
 }
 
-kir::Allocate* allocGlobalBufferForGridComm(
-    Val* buffer_size,
-    DataType dtype,
-    bool zero_init) {
-  const std::vector<IterDomain*> new_buffer_ids = {
-      IrBuilder::create<IterDomain>(IterDomainBuilder(
-          GpuLower::current()->kernel()->zeroVal(), buffer_size))};
-  const auto buffer_domain = IrBuilder::create<TensorDomain>(new_buffer_ids);
-  const auto buffer_tv =
-      IrBuilder::create<TensorView>(buffer_domain, dtype, MemoryType::Global);
-  return IrBuilder::create<kir::Allocate>(
-      buffer_tv, buffer_tv->getMemoryType(), nullptr, zero_init);
-}
-
 namespace {
 
 class ExprFlattener : private kir::IrVisitor {
@@ -408,112 +405,6 @@ std::vector<Expr*> flattenScopedExprs(const std::vector<Expr*>& loop_nests) {
   return ExprFlattener::flatten(loop_nests);
 }
 
-IterDomain* caMapExactConcreteId(IterDomain* id) {
-  return GpuLower::current()->caMap()->getConcreteMappedID(
-      id, IdMappingMode::EXACT);
-}
-
-std::vector<Expr*> getAllSwizzlesBetween(
-    std::vector<IterDomain*> from,
-    std::vector<IterDomain*> to) {
-  auto all_expr = DependencyCheck::getAllExprsBetween(
-      {from.begin(), from.end()}, {to.begin(), to.end()});
-
-  std::vector<Expr*> all_swizzles;
-
-  std::copy_if(
-      all_expr.begin(),
-      all_expr.end(),
-      std::back_inserter(all_swizzles),
-      [](Expr* expr) {
-        return expr->getExprType().has_value() &&
-            (expr->etype() == ExprType::Swizzle2D);
-      });
-
-  return all_swizzles;
-}
-
-} // namespace ir_utils
-
-namespace loop_utils {
-
-BasicAllocInfo getAllocInformation(
-    const TensorView* tv,
-    const std::vector<kir::ForLoop*>& for_loops,
-    const std::unordered_map<IterDomain*, IterDomain*>& id_map,
-    bool use_id_map) {
-  BasicAllocInfo info;
-  auto gpu_lower = GpuLower::current();
-
-  bool outer_alloc_found = false;
-
-  for (auto fl : for_loops) {
-    if (info.alloc_pos == tv->getComputeAtPosition()) {
-      break;
-    }
-
-    if (tv->axis(info.alloc_pos)->isReduction()) {
-      const auto outputs = FusionGuard::getCurFusion()->getTerminatingOutputs();
-      TORCH_INTERNAL_ASSERT(
-          std::find(outputs.begin(), outputs.end(), tv) != outputs.end(),
-          "Invalid computeAt of T",
-          tv->name(),
-          ". A reducation axis is detected outside computeAt point even though it is not an output tensor.");
-      break;
-    }
-
-    auto fl_id = fl->iter_domain();
-
-    if (fl_id->getParallelType() == ParallelType::Unroll) {
-      break;
-    }
-
-    // Shared memory must be allocated outside of unswitched
-    // domains. See issue #1133.
-    if (fl_id->getParallelType() == ParallelType::Unswitch &&
-        tv->getMemoryType() == MemoryType::Shared) {
-      outer_alloc_found = true;
-    }
-
-    // Assume global memory is allocated at outer most scope.
-    if (tv->getMemoryType() == MemoryType::Global) {
-      outer_alloc_found = true;
-    }
-
-    // Allocation of a double buffered tensor is placed outside its
-    // double buffer axis.
-    if ((tv->isDoubleBuffered() || tv->isCircularBuffered()) &&
-        tv->axis(info.alloc_pos) ==
-            gpu_lower->doubleBufferInfo().getDoubleBufferAxis(tv)) {
-      outer_alloc_found = true;
-    }
-
-    auto local_id = tv->axis(info.alloc_pos);
-
-    if (use_id_map) {
-      auto id_it = id_map.find(local_id);
-      if (id_it != id_map.end()) {
-        local_id = id_it->second;
-      }
-    }
-
-    if (GpuLower::current()->caMap()->areMapped(
-            local_id, fl_id, IdMappingMode::PERMISSIVE)) {
-      info.alloc_pos++;
-    }
-
-    info.init_for_loop = fl;
-
-    if (!outer_alloc_found) {
-      info.alloc_for_loop = fl;
-    }
-  }
-
-  return info;
-}
-
-} // namespace loop_utils
-
 namespace {
 
 class ReplaceExprInput : private kir::ExprMutator {
@@ -555,8 +446,8 @@ class ReplaceExprInput : private kir::ExprMutator {
 
   // Copy predicates and register expression replacement
   void registerReplaceWithPredicate(Expr* old_expr, Expr* new_expr) {
-    new_expr->setPredicate(old_expr->predicate());
-    new_expr->setWritePredicate(old_expr->writePredicate());
+    new_expr = new_expr->withPredicate(old_expr->predicate())
+                   ->withWritePredicate(old_expr->writePredicate());
     registerReplace(old_expr, new_expr);
   }
 
@@ -564,9 +455,7 @@ class ReplaceExprInput : private kir::ExprMutator {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
       auto replacement = IrBuilder::create<UnaryOp>(
-          node->getUnaryOpType(),
-          node->out(),
-          replaced_inputs.value().at(node->in()));
+          node->getUnaryOpType(), node->out(), replaced_inputs->at(node->in()));
       registerReplaceWithPredicate(node, replacement);
     }
   }
@@ -577,8 +466,8 @@ class ReplaceExprInput : private kir::ExprMutator {
       auto replacement = IrBuilder::create<BinaryOp>(
           node->getBinaryOpType(),
           node->out(),
-          replaced_inputs.value().at(node->lhs()),
-          replaced_inputs.value().at(node->rhs()));
+          replaced_inputs->at(node->lhs()),
+          replaced_inputs->at(node->rhs()));
       registerReplaceWithPredicate(node, replacement);
     }
   }
@@ -589,13 +478,18 @@ class ReplaceExprInput : private kir::ExprMutator {
       auto replacement = IrBuilder::create<TernaryOp>(
           node->getTernaryOpType(),
           node->out(),
-          replaced_inputs.value().at(node->in1()),
-          replaced_inputs.value().at(node->in2()),
-          replaced_inputs.value().at(node->in3()));
+          replaced_inputs->at(node->in1()),
+          replaced_inputs->at(node->in2()),
+          replaced_inputs->at(node->in3()));
       registerReplaceWithPredicate(node, replacement);
     }
   }
 
+  void handle(RNGOp* node) final {
+    // RNGOp has no input
+    return;
+  }
+
   void handle(ReductionOp* node) final {
     auto replaced_inputs = getMaybeInputReplacementMap(node);
     if (replaced_inputs.has_value()) {
@@ -603,7 +497,7 @@ class ReplaceExprInput : private kir::ExprMutator {
           node->getReductionOpType(),
           node->init(),
           node->out(),
-          replaced_inputs.value().at(node->in()),
+          replaced_inputs->at(node->in()),
           node->isAllreduce());
       registerReplaceWithPredicate(node, replacement);
     }
@@ -634,7 +528,7 @@ class ReplaceExprInput : private kir::ExprMutator {
     if (replaced_inputs.has_value()) {
       auto replacement = IrBuilder::create<BroadcastOp>(
           node->out(),
-          replaced_inputs.value().at(node->in()),
+          replaced_inputs->at(node->in()),
           node->getBroadcastDimFlags());
       registerReplaceWithPredicate(node, replacement);
     }
@@ -650,9 +544,9 @@ class ReplaceExprInput : private kir::ExprMutator {
           node->initAvg(),
           node->initVar(),
           node->initN(),
-          replaced_inputs.value().at(node->inAvg()),
-          replaced_inputs.value().at(node->inVar()),
-          replaced_inputs.value().at(node->inN()));
+          replaced_inputs->at(node->inAvg()),
+          replaced_inputs->at(node->inVar()),
+          replaced_inputs->at(node->inN()));
       registerReplaceWithPredicate(node, replacement);
     }
   }
@@ -662,8 +556,8 @@ class ReplaceExprInput : private kir::ExprMutator {
     if (replaced_inputs.has_value()) {
       auto replacement = IrBuilder::create<MmaOp>(
           node->out(),
-          replaced_inputs.value().at(node->inA()),
-          replaced_inputs.value().at(node->inB()),
+          replaced_inputs->at(node->inA()),
+          replaced_inputs->at(node->inB()),
           node->init(),
           node->options());
       registerReplaceWithPredicate(node, replacement);
@@ -691,15 +585,161 @@ std::vector<Expr*> replaceInputsInExpr(
   return ReplaceExprInput::replace(exprs, replacement_map);
 }
 
-bool isTrivialIterDomain(IterDomain* id) {
-  auto pt = id->getParallelType();
-  return id->isReduction() || id->isBroadcast() || id->isStride() ||
-      (id->extent()->isOneInt() && id->start()->isZeroInt()) ||
-      pt == ParallelType::Vectorize ||
-      (isParallelTypeThread(pt) &&
-       !GpuLower::current()->haloInfo().hasHaloWidth(id));
+std::vector<Expr*> getAllSwizzlesBetween(
+    std::vector<IterDomain*> from,
+    std::vector<IterDomain*> to) {
+  auto all_expr = DependencyCheck::getAllExprsBetween(
+      {from.begin(), from.end()}, {to.begin(), to.end()});
+
+  std::vector<Expr*> all_swizzles;
+
+  std::copy_if(
+      all_expr.begin(),
+      all_expr.end(),
+      std::back_inserter(all_swizzles),
+      [](Expr* expr) {
+        return expr->getExprType().has_value() &&
+            (expr->etype() == ExprType::Swizzle2D);
+      });
+
+  return all_swizzles;
+}
+
+} // namespace ir_utils
+
+namespace lower_utils {
+
+bool hasBlockSync(const Expr* expr, const ThreadPredicateMap& pred_map) {
+  if (expr->isA<kir::BlockSync>()) {
+    return true;
+  }
+
+  if (!ir_utils::isTvOp(expr)) {
+    return false;
+  }
+
+  if (!(ir_utils::isReductionOp(expr) || expr->isA<BroadcastOp>() ||
+        expr->isA<kir::GridBroadcast>())) {
+    return false;
+  }
+
+  // GroupedReductionOp can have multiple output TVs, but they must be
+  // parallelized in the same way, so just checking one of them is enough.
+  auto tv = ir_utils::getTvOutput(expr);
+
+  if (tv->hasBlockReduction() || tv->hasGridReduction()) {
+    return true;
+  } else if (expr->isA<BroadcastOp>()) {
+    const ParallelTypeBitmap pt_map =
+        GpuLower::current()->threadPredMap().getParallelBroadcastDomains(tv);
+    return pt_map.any();
+  }
+
+  return false;
+}
+
+kir::Allocate* allocGlobalBufferForGridComm(
+    Val* buffer_size,
+    DataType dtype,
+    bool zero_init) {
+  const std::vector<IterDomain*> new_buffer_ids = {
+      IrBuilder::create<IterDomain>(IterDomainBuilder(
+          GpuLower::current()->kernel()->zeroVal(), buffer_size))};
+  const auto buffer_domain = IrBuilder::create<TensorDomain>(new_buffer_ids);
+  const auto buffer_tv =
+      IrBuilder::create<TensorView>(buffer_domain, dtype, MemoryType::Global);
+  return IrBuilder::create<kir::Allocate>(
+      buffer_tv, buffer_tv->getMemoryType(), nullptr, zero_init);
 }
 
+BasicAllocInfo getAllocInformation(
+    const TensorView* tv,
+    const std::vector<kir::ForLoop*>& for_loops,
+    const std::unordered_map<IterDomain*, IterDomain*>& id_map,
+    bool use_id_map) {
+  BasicAllocInfo info;
+  auto gpu_lower = GpuLower::current();
+
+  bool outer_alloc_found = false;
+
+  for (auto fl : for_loops) {
+    if (info.alloc_pos == tv->getComputeAtPosition()) {
+      break;
+    }
+
+    if (tv->axis(info.alloc_pos)->isReduction()) {
+      const auto outputs = FusionGuard::getCurFusion()->getTerminatingOutputs();
+      TORCH_INTERNAL_ASSERT(
+          std::find(outputs.begin(), outputs.end(), tv) != outputs.end(),
+          "Invalid computeAt of T",
+          tv->name(),
+          ". A reducation axis is detected outside computeAt point even though it is not an output tensor.");
+      break;
+    }
+
+    auto fl_id = fl->iter_domain();
+
+    if (fl_id->getParallelType() == ParallelType::Unroll) {
+      break;
+    }
+
+    // Shared memory must be allocated outside of unswitched
+    // domains. See issue #1133.
+    if (fl_id->getParallelType() == ParallelType::Unswitch &&
+        tv->getMemoryType() == MemoryType::Shared) {
+      outer_alloc_found = true;
+    }
+
+    // Assume global memory is allocated at outer most scope.
+    if (tv->getMemoryType() == MemoryType::Global) {
+      outer_alloc_found = true;
+    }
+
+    // Allocation of a double buffered tensor is placed outside its
+    // double buffer axis.
+    if ((tv->isDoubleBuffered() || tv->isCircularBuffered()) &&
+        tv->axis(info.alloc_pos) ==
+            gpu_lower->doubleBufferInfo().getDoubleBufferAxis(tv)) {
+      outer_alloc_found = true;
+    }
+
+    auto local_id = tv->axis(info.alloc_pos);
+
+    if (use_id_map) {
+      auto id_it = id_map.find(local_id);
+      if (id_it != id_map.end()) {
+        local_id = id_it->second;
+      }
+    }
+
+    if (GpuLower::current()->caMap()->areMapped(
+            local_id, fl_id, IdMappingMode::PERMISSIVE)) {
+      info.alloc_pos++;
+    }
+
+    info.init_for_loop = fl;
+
+    if (!outer_alloc_found) {
+      info.alloc_for_loop = fl;
+    }
+  }
+
+  return info;
+}
+
+//! Implementing this in here to avoid including too many headers
+//!  in type.cpp. Conceptually this should be a generic definition
+//!  rather than a util.
+bool supportInlinePredicate(Expr* expr) {
+  if (ir_utils::isCpAsyncOp(expr)) {
+    return true;
+  }
+  // TODO: build out support.
+  return false;
+}
+
+} // namespace lower_utils
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/lower_utils.h b/torch/csrc/jit/codegen/cuda/lower_utils.h
index d8821fd0d4eb..4807c1e5520e 100644
--- a/torch/csrc/jit/codegen/cuda/lower_utils.h
+++ b/torch/csrc/jit/codegen/cuda/lower_utils.h
@@ -39,24 +39,32 @@ namespace ir_utils {
 // producers with a consumer set of indices, so we need to view the producer
 // transformed like consumer while we index. This will set the tv with td for
 // the life of this context guard.
-class TVDomainGuard {
+class TORCH_CUDA_CU_API TVDomainGuard {
  private:
   TensorView* tv_;
-  TensorDomain* prev_domain;
+  TensorDomain* prev_domain_;
 
  public:
-  explicit TVDomainGuard(TensorView* _tv, TensorDomain* td);
+  explicit TVDomainGuard(TensorView* tv, TensorDomain* td);
+  TVDomainGuard(const TVDomainGuard&) = delete;
+  TVDomainGuard(TVDomainGuard&&);
 
   //! An utility to access the tensordomain before the temporary
   //!  view. This is used to retrieve information, like swizzle
   //!  information that can only be reliably kept at the original domain.
   const TensorDomain* prevDomain() const {
-    return prev_domain;
+    return prev_domain_;
   }
 
   ~TVDomainGuard();
 };
 
+// Create a TVDomainGuard that temporarily view a tensorview with specified
+// all-true or all-false contiguity.
+TORCH_CUDA_CU_API ir_utils::TVDomainGuard overrideContiguityGuard(
+    TensorView* tv,
+    bool contiguity);
+
 //! Return inputs of provided IterDomains that are IterDomains. A list
 //! of input IterDomain can be optionally given. Otherwise,
 //! IterDomains with no defining expression are returned.
@@ -82,8 +90,6 @@ TORCH_CUDA_CU_API TensorView* getTvOutput(const Expr*);
 // Returns the first input of Expr that is a TensorView
 TORCH_CUDA_CU_API TensorView* getTvInput(const Expr*);
 
-bool hasBlockSync(const Expr* expr, const ThreadPredicateMap& pred_map);
-
 //! Returns the iterdomain that maps to the thread dimension grouped
 //!  to warps. Returns nullopt if the reduction is not to be lowered to
 //!  a warp reduction.
@@ -108,13 +114,6 @@ bool derivedFromRootCAAxes(const TensorView* tv, IterDomain* axis);
 std::unordered_map<ParallelType, IterDomain*, TypeHash> getParallelDomains(
     const Val* val);
 
-// Allocate global buffer for a grid communication calls, i.e. grid reduce, grid
-// welford reduce, grid broadcast.
-kir::Allocate* allocGlobalBufferForGridComm(
-    Val* buffer_size,
-    DataType dtype,
-    bool zero_init);
-
 //! Returns true if the expression will be lowered to
 //!  a ldmatrix intrinsic.
 bool isLdMatrixOp(const Expr* expr);
@@ -150,49 +149,12 @@ bool isTensorScalarFillOp(const Expr* expr);
 TORCH_CUDA_CU_API std::vector<Expr*> flattenScopedExprs(
     const std::vector<Expr*>& loop_nests);
 
-//! Returns the concretized iterdomain according to
-//!  the exact compute at map.
-IterDomain* caMapExactConcreteId(IterDomain* id);
-
 //! Returns all swizzle ops between the set of iterdomains
 //!  in `from` and `to`.
 std::vector<Expr*> getAllSwizzlesBetween(
     std::vector<IterDomain*> from,
     std::vector<IterDomain*> to);
 
-} // namespace ir_utils
-
-namespace loop_utils {
-
-struct BasicAllocInfo {
-  // The for loop that the initialization of this allocation must be
-  // placed in, nullptr if not within a loop
-  kir::ForLoop* init_for_loop = nullptr;
-
-  // Keep track of the actual allocation loop. This can be different
-  // from init_for_loop only with unswitched shared memory allocations,
-  // which are moved outer loops to avoid duplicated allocations. This means
-  // that the alloc position may be outside what's expected. Most applications
-  // outside lower_allocation is likely looking for init_for_loop which is
-  // more directly related to how large an allocation is and how it's used.
-  // (see issue #1133).
-  kir::ForLoop* alloc_for_loop = nullptr;
-
-  // The allocation position relative to buffer IDs, it could be outside the
-  // compute at position if it's shared memory with a compute at inside an
-  // unswitch
-  size_t alloc_pos = 0;
-};
-
-// Fill the above allocation struct based on provided information. id_map is
-// used if we're looking at a producer tensor but loops on a consumer tensor.
-BasicAllocInfo getAllocInformation(
-    const TensorView* tv,
-    const std::vector<kir::ForLoop*>& loops,
-    const std::unordered_map<IterDomain*, IterDomain*>& id_map = {},
-    bool use_id_map = false);
-} // namespace loop_utils
-
 // Replace value pass on Kernel IR.
 //  Replace each use of any Val* that apears in the given `replacement_map`
 //  Keeps the predicate carried by each expr
@@ -203,9 +165,6 @@ std::vector<Expr*> replaceInputsInExpr(
     const std::vector<Expr*>& exprs,
     const std::unordered_map<Val*, Val*>& replacement_map);
 
-// True if an IterDomain does not materialize a loop
-bool isTrivialIterDomain(IterDomain* id);
-
 // Go through all expressions and compute a local ordering of loops. operator<
 // is implemented based on the concrete_id_dependencies analysis done. If
 // there's no dependency between two IDs then order doesn't mater, otherwise we
@@ -235,7 +194,7 @@ struct TORCH_CUDA_CU_API IterDomainDependencySorter {
   IterDomainDependencySorter(
       const std::unordered_map<IterDomain*, std::unordered_set<IterDomain*>>&
           concrete_id_dependencies,
-      const std::unique_ptr<ComputeAtMap>& compute_at_map)
+      std::shared_ptr<const ComputeAtMap> compute_at_map)
       : concrete_id_dependencies_(concrete_id_dependencies),
         compute_at_map_(compute_at_map) {}
 
@@ -261,9 +220,56 @@ struct TORCH_CUDA_CU_API IterDomainDependencySorter {
 
   const std::unordered_map<IterDomain*, std::unordered_set<IterDomain*>>&
       concrete_id_dependencies_;
-  const std::unique_ptr<ComputeAtMap>& compute_at_map_;
+  const std::shared_ptr<const ComputeAtMap> compute_at_map_;
 };
 
+} // namespace ir_utils
+
+namespace lower_utils {
+
+bool hasBlockSync(const Expr* expr, const ThreadPredicateMap& pred_map);
+
+// Allocate global buffer for a grid communication calls, i.e. grid reduce, grid
+// welford reduce, grid broadcast.
+kir::Allocate* allocGlobalBufferForGridComm(
+    Val* buffer_size,
+    DataType dtype,
+    bool zero_init);
+
+struct BasicAllocInfo {
+  // The for loop that the initialization of this allocation must be
+  // placed in, nullptr if not within a loop
+  kir::ForLoop* init_for_loop = nullptr;
+
+  // Keep track of the actual allocation loop. This can be different
+  // from init_for_loop only with unswitched shared memory allocations,
+  // which are moved outer loops to avoid duplicated allocations. This means
+  // that the alloc position may be outside what's expected. Most applications
+  // outside lower_allocation is likely looking for init_for_loop which is
+  // more directly related to how large an allocation is and how it's used.
+  // (see issue #1133).
+  kir::ForLoop* alloc_for_loop = nullptr;
+
+  // The allocation position relative to buffer IDs, it could be outside the
+  // compute at position if it's shared memory with a compute at inside an
+  // unswitch
+  size_t alloc_pos = 0;
+};
+
+// Fill the above allocation struct based on provided information. id_map is
+// used if we're looking at a producer tensor but loops on a consumer tensor.
+BasicAllocInfo getAllocInformation(
+    const TensorView* tv,
+    const std::vector<kir::ForLoop*>& loops,
+    const std::unordered_map<IterDomain*, IterDomain*>& id_map = {},
+    bool use_id_map = false);
+
+//! Returns true if the expression has a variant that takes a predicate
+//!  as an inline argument.
+bool supportInlinePredicate(Expr* expr);
+
+} // namespace lower_utils
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/lower_validation.cpp b/torch/csrc/jit/codegen/cuda/lower_validation.cpp
index f4a31cba1349..f6f71c2ec123 100644
--- a/torch/csrc/jit/codegen/cuda/lower_validation.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_validation.cpp
@@ -86,7 +86,7 @@ class ValidateSiblings : public IterVisitor {
         auto sibling_id = it->second;
         TORCH_INTERNAL_ASSERT(
             sibling->axis(i) == sibling_id,
-            "Invalid matching sinbling ID detected. Expr: ",
+            "Invalid matching sibling ID detected. Expr: ",
             expr->toString(),
             "Sibling ID: ",
             sibling_id->toString());
@@ -714,7 +714,8 @@ std::unordered_map<IterDomain*, std::pair<int64_t, int64_t>> getLiveRangeOffsets
           // is visible to outside of the fusion.
           map.insert(
               {consumer_root,
-               {consumer_start_offset.value(), consumer_stop_offset.value()}});
+               {consumer_start_offset->as<int64_t>(),
+                consumer_stop_offset->as<int64_t>()}});
         } else {
           // When the range of this root domain is already set, it
           // must be set by its consumers. Make sure the required
@@ -722,9 +723,9 @@ std::unordered_map<IterDomain*, std::pair<int64_t, int64_t>> getLiveRangeOffsets
           // this root domain.
           auto& consumer_range = it->second;
           TORCH_INTERNAL_ASSERT(
-              consumer_start_offset.value() <= consumer_range.first);
+              consumer_start_offset->as<int64_t>() <= consumer_range.first);
           TORCH_INTERNAL_ASSERT(
-              consumer_stop_offset.value() <= consumer_range.second);
+              consumer_stop_offset->as<int64_t>() <= consumer_range.second);
         }
       }
 
@@ -852,7 +853,7 @@ namespace {
 //!  higher than provided major.minor.
 void validateMinimumArch(int major, int minor) {
   // Skip checking arch if disabled.
-  if (isDisabled(DisableOption::ArchCheck)) {
+  if (isOptionDisabled(DisableOption::ArchCheck)) {
     return;
   }
 
@@ -1182,7 +1183,7 @@ void validateAndConvertIterDomainGrouping(Fusion* fusion) {
 
       // Halo is not allowed
       TORCH_CHECK(
-          GpuLower::current()->haloInfo().getExtent(id) == nullptr,
+          GpuLower::current()->haloInfo()->getExtent(id) == nullptr,
           "Invalid use of ParallelType::Group.",
           " Grouping of halo-extended IterDomain, ",
           id->toString(),
@@ -1204,8 +1205,10 @@ void validateAndConvertIterDomainGrouping(Fusion* fusion) {
 
     TORCH_CHECK(
         tv->definition()->isA<ReductionOp>() ||
-            tv->definition()->isA<GroupedReductionOp>(),
-        "Invalid use of ParallelType::Group. Only ReductionOp and GroupedReductionOp are allowed. ",
+            tv->definition()->isA<GroupedReductionOp>() ||
+            tv->definition()->isA<WelfordOp>() ||
+            tv->definition()->isA<GroupedWelfordOp>(),
+        "Invalid use of ParallelType::Group. Only ReductionOp, GroupedReductionOp, WelfordOp and GroupedWelfordOp are allowed. ",
         tv->definition()->toString());
 
     // Convert ReductionOp to GroupedReductionOp
@@ -1238,6 +1241,36 @@ void validateAndConvertIterDomainGrouping(Fusion* fusion) {
           outputs,
           inputs,
           is_allreduce);
+    } else if (tv->definition()->isA<WelfordOp>()) {
+      // Convert WelfordOp to GroupedWelfordOp
+      auto wop = def->as<WelfordOp>();
+      auto is_allreduce = wop->isAllreduce();
+
+      TORCH_CHECK(
+          is_allreduce,
+          "Invalid use of ParallelType::Group.",
+          " Only enabled for allreduce reductions: ",
+          wop->toString());
+
+      TORCH_CHECK(
+          tv->domain()->hasGridReduction(),
+          "Invalid use of ParallelType::Group.",
+          " Only enabled for grid reductions: ",
+          wop->toString());
+
+      std::vector<WelfordTriplet> output_vals(
+          {{wop->outAvg(), wop->outVar(), wop->outN()}});
+      std::vector<WelfordTriplet> input_vals(
+          {{wop->inAvg(), wop->inVar(), wop->inN()}});
+      std::vector<WelfordTriplet> init_vals(
+          {{wop->initAvg(), wop->initVar(), wop->initN()}});
+      fusion->removeExpr(wop);
+      IrBuilder::create<GroupedWelfordOp>(
+          static_cast<IrContainer*>(fusion),
+          output_vals,
+          input_vals,
+          init_vals,
+          is_allreduce);
     }
   }
 }
diff --git a/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp b/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
index 1d87790c014f..ff603c1d18f6 100644
--- a/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
+++ b/torch/csrc/jit/codegen/cuda/lower_warp_reduce.cpp
@@ -136,7 +136,7 @@ class EliminateDeadBroadcastAndAllocate {
 //!    be removed, and generates a replacement map from the broadcast
 //!    output to reduction output.
 //!
-//!   2. kir_utils::replaceInputsInExpr replaces applicable uses of
+//!   2. ir_utils::replaceInputsInExpr replaces applicable uses of
 //!    the broadcast output with the corresponding reduction output.
 //!
 //!   3. EliminateDeadBroadcastAndAllocate removes the broadcast ops
@@ -145,8 +145,8 @@ class FuseBroadcastWithWarpReduce : private kir::IrVisitor {
  public:
   static std::vector<Expr*> fuse(const std::vector<Expr*>& exprs) {
     FuseBroadcastWithWarpReduce fuse_broadcast_map(exprs);
-    const auto replaced_inputs =
-        replaceInputsInExpr(exprs, fuse_broadcast_map.val_replacement_map_);
+    const auto replaced_inputs = ir_utils::replaceInputsInExpr(
+        exprs, fuse_broadcast_map.val_replacement_map_);
     return EliminateDeadBroadcastAndAllocate::run(replaced_inputs);
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/manager.cpp b/torch/csrc/jit/codegen/cuda/manager.cpp
index 22f914de407e..4eb61c78b749 100644
--- a/torch/csrc/jit/codegen/cuda/manager.cpp
+++ b/torch/csrc/jit/codegen/cuda/manager.cpp
@@ -62,12 +62,16 @@ namespace {
 // in the fallback path.
 void enableAliasCopyNodes(const std::shared_ptr<Graph>& graph, Block* block) {
   static std::unordered_set<Symbol> alias_copy_op(
-      {prim::view_copy,
-       prim::reshape_copy,
-       prim::expand_copy,
+      {prim::expand_copy,
        prim::expand_as_copy,
+       prim::flatten_copy,
+       prim::permute_copy,
+       prim::reshape_copy,
        prim::squeeze_copy,
-       prim::unsqueeze_copy});
+       prim::t_copy,
+       prim::transpose_copy,
+       prim::unsqueeze_copy,
+       prim::view_copy});
 
   for (Node* n : block->nodes()) {
     for (Block* b : n->blocks()) {
diff --git a/torch/csrc/jit/codegen/cuda/mutator.cpp b/torch/csrc/jit/codegen/cuda/mutator.cpp
index 6844dc04f6ff..12a3de15f4a7 100644
--- a/torch/csrc/jit/codegen/cuda/mutator.cpp
+++ b/torch/csrc/jit/codegen/cuda/mutator.cpp
@@ -58,8 +58,14 @@ void OptOutMutator::mutate(NamedScalar* ns) {}
 void OptOutMutator::mutate(IterDomain* id) {
   Val* start = maybeMutated(id->start());
   Val* extent = maybeMutated(id->extent());
+  Val* expanded_extent = nullptr;
+  if (id->hasExpandedExtent()) {
+    expanded_extent = maybeMutated(id->expandedExtent());
+  }
   Val* stop_offset = maybeMutated(id->stopOffset());
   if (start->sameAs(id->start()) && extent->sameAs(id->extent()) &&
+      (!id->hasExpandedExtent() ||
+       expanded_extent->sameAs(id->expandedExtent())) &&
       stop_offset->sameAs(id->stopOffset())) {
     return;
   }
@@ -69,6 +75,7 @@ void OptOutMutator::mutate(IterDomain* id) {
           .start(start)
           .extent(extent)
           .stop_offset(stop_offset)
+          .expanded_extent(expanded_extent)
           .build());
 }
 
@@ -118,7 +125,48 @@ void OptOutMutator::mutate(kir::TensorIndex*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
 
-// MUTATE FUNCTIONS FOR EXPRESSIONS.
+void OptOutMutator::mutate(FullOp* fop) {
+  Val* out = maybeMutated(fop->output(0));
+  Val* fill_value = maybeMutated(fop->getFillValue());
+
+  if (out->sameAs(fop->output(0))) {
+    return;
+  }
+  auto container = fop->container();
+  container->removeExpr(fop);
+  IrBuilder::create<FullOp>(container, out, fill_value, fop->dtype());
+}
+
+void OptOutMutator::mutate(ARangeOp* aop) {
+  Val* out = maybeMutated(aop->output(0));
+
+  if (out->sameAs(aop->output(0))) {
+    return;
+  }
+  auto container = aop->container();
+  container->removeExpr(aop);
+  IrBuilder::create<ARangeOp>(
+      container,
+      out,
+      aop->start(),
+      aop->end(),
+      aop->step(),
+      aop->dtype(),
+      aop->getLinearLogicalIndex());
+}
+
+void OptOutMutator::mutate(EyeOp* eop) {
+  Val* out = maybeMutated(eop->output(0));
+
+  if (out->sameAs(eop->output(0))) {
+    return;
+  }
+  auto container = eop->container();
+  container->removeExpr(eop);
+  IrBuilder::create<EyeOp>(
+      container, out, eop->dtype(), eop->getIndex1(), eop->getIndex2());
+}
+
 void OptOutMutator::mutate(UnaryOp* uop) {
   Val* out = maybeMutated(uop->out());
   Val* in = maybeMutated(uop->in());
@@ -164,6 +212,31 @@ void OptOutMutator::mutate(TernaryOp* top) {
   IrBuilder::create<TernaryOp>(container, top_type, out, in1, in2, in3);
 }
 
+void OptOutMutator::mutate(RNGOp* rop) {
+  Val* out = maybeMutated(rop->output(0));
+  auto& parameters = rop->getParameters();
+  std::vector<Val*> mutated_parameters;
+  for (auto v : parameters) {
+    mutated_parameters.emplace_back(maybeMutated(v));
+  }
+
+  if (out == rop->output(0) && mutated_parameters == parameters) {
+    return;
+  }
+
+  auto container = rop->container();
+  auto rop_type = rop->getRNGOpType();
+  container->removeExpr(rop);
+  IrBuilder::create<RNGOp>(
+      container,
+      rop_type,
+      out,
+      rop->dtype(),
+      mutated_parameters,
+      rop->getRNGOffset(),
+      rop->getPhiloxIndex());
+}
+
 void OptOutMutator::mutate(ReductionOp* rop) {
   Val* out = maybeMutated(rop->out());
   Val* in = maybeMutated(rop->in());
@@ -256,15 +329,52 @@ void OptOutMutator::mutate(WelfordOp* wop) {
       out_avg,
       out_var,
       out_N,
-      init_avg,
-      init_var,
-      init_N,
       in_avg,
       in_var,
       in_N,
+      init_avg,
+      init_var,
+      init_N,
       wop->isAllreduce());
 }
 
+void OptOutMutator::mutate(GroupedWelfordOp* wop) {
+  bool is_same = true;
+
+  std::vector<WelfordTriplet> output_vals;
+  for (const auto& out : wop->outputVals()) {
+    auto maybe_mutated =
+        out.transform([&](Val* val) { return maybeMutated(val); });
+    is_same = is_same && maybe_mutated.sameAs(out);
+    output_vals.push_back(maybe_mutated);
+  }
+
+  std::vector<WelfordTriplet> input_vals;
+  for (const auto& inp : wop->inputVals()) {
+    auto maybe_mutated =
+        inp.transform([&](Val* val) { return maybeMutated(val); });
+    is_same = is_same && maybe_mutated.sameAs(inp);
+    input_vals.push_back(maybe_mutated);
+  }
+
+  std::vector<WelfordTriplet> init_vals;
+  for (const auto& init : wop->initVals()) {
+    auto maybe_mutated =
+        init.transform([&](Val* val) { return maybeMutated(val); });
+    is_same = is_same && maybe_mutated.sameAs(init);
+    init_vals.push_back(maybe_mutated);
+  }
+
+  if (is_same) {
+    return;
+  }
+
+  auto container = wop->container();
+  container->removeExpr(wop);
+  IrBuilder::create<GroupedWelfordOp>(
+      container, output_vals, input_vals, init_vals, wop->isAllreduce());
+}
+
 void OptOutMutator::mutate(MmaOp* mma) {
   Val* out = maybeMutated(mma->out());
   Val* in_a = maybeMutated(mma->inA());
@@ -504,6 +614,9 @@ void OptOutMutator::mutate(kir::GridBroadcast*) {
 void OptOutMutator::mutate(kir::GridWelford*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
+void OptOutMutator::mutate(kir::GroupedGridWelford*) {
+  TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
+}
 void OptOutMutator::mutate(kir::AllocateFusedReduction*) {
   TORCH_INTERNAL_ASSERT(false, "Not implemented yet.");
 }
diff --git a/torch/csrc/jit/codegen/cuda/non_divisible_split.cpp b/torch/csrc/jit/codegen/cuda/non_divisible_split.cpp
index 3a2ab5f5eb5b..eaff9274892d 100644
--- a/torch/csrc/jit/codegen/cuda/non_divisible_split.cpp
+++ b/torch/csrc/jit/codegen/cuda/non_divisible_split.cpp
@@ -23,7 +23,7 @@ void NonDivisibleSplitInfo::build(Fusion* fusion) {
         tv->domain()->domain().begin(), tv->domain()->domain().end());
     current_tv_ = tv;
     clearReachability();
-    traverseFrom(fusion, domain_vals);
+    traverseTo(fusion, domain_vals);
     current_tv_ = nullptr;
   }
 
@@ -53,7 +53,16 @@ void NonDivisibleSplitInfo::handle(Split* split) {
         splits_to_validate_.insert(split);
       } else {
         // Not proven to be a divisible split
-        splits_to_predicate_[current_tv_].push_back(split);
+        auto gpu_lower = GpuLower::current();
+        TORCH_INTERNAL_ASSERT(gpu_lower != nullptr);
+
+        // If we know this split must be divisible, it's either validated as
+        // above, exact matches to a case matching the above, or exact matches
+        // to a transformation from view which must be divisible.
+        if (gpu_lower->divisbleSplitSet().find(split) ==
+            gpu_lower->divisbleSplitSet().end()) {
+          splits_to_predicate_[current_tv_].push_back(split);
+        }
       }
 
       is_protected = true;
diff --git a/torch/csrc/jit/codegen/cuda/nvfuser.cmake b/torch/csrc/jit/codegen/cuda/nvfuser.cmake
index 05c36a90d499..147003054766 100644
--- a/torch/csrc/jit/codegen/cuda/nvfuser.cmake
+++ b/torch/csrc/jit/codegen/cuda/nvfuser.cmake
@@ -1,6 +1,4 @@
-if(BUILD_SPLIT_CUDA)
-  set(TORCHLIB_FLAVOR torch_cuda_cu) # chose torch_cuda_cu here since JIT is in torch_cuda_cpp
-elseif(USE_CUDA)
+if(USE_CUDA)
   set(TORCHLIB_FLAVOR torch_cuda)
 elseif(USE_ROCM)
   set(TORCHLIB_FLAVOR torch_hip)
@@ -15,6 +13,8 @@ list(APPEND NVFUSER_RUNTIME_FILES
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/broadcast.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fp16_support.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/bf16_support.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_broadcast.cu
   ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/grid_reduction.cu
@@ -34,6 +34,15 @@ list(APPEND NVFUSER_RUNTIME_FILES
   ${TORCH_ROOT}/aten/src/ATen/cuda/detail/UnpackRaw.cuh
 )
 
+if(USE_ROCM)
+list(APPEND NVFUSER_RUNTIME_FILES
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/array_rocm.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/bf16_support_rocm.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/block_sync_default_rocm.cu
+  ${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/runtime/warp_rocm.cu
+)
+endif()
+
 file(MAKE_DIRECTORY "${CMAKE_BINARY_DIR}/include/nvfuser_resources")
 
 # "stringify" NVFUSER runtime sources
@@ -45,7 +54,7 @@ foreach(src ${NVFUSER_RUNTIME_FILES})
   add_custom_command(
     COMMENT "Stringify NVFUSER runtime source file"
     OUTPUT ${dst}
-    DEPENDS ${src}
+    DEPENDS ${src} "${NVFUSER_STRINGIFY_TOOL}"
     COMMAND ${PYTHON_EXECUTABLE} ${NVFUSER_STRINGIFY_TOOL} -i ${src} -o ${dst}
   )
   add_custom_target(nvfuser_rt_${filename} DEPENDS ${dst})
diff --git a/torch/csrc/jit/codegen/cuda/ops/alias.cpp b/torch/csrc/jit/codegen/cuda/ops/alias.cpp
index f6e12533fa6d..20c6ee533063 100644
--- a/torch/csrc/jit/codegen/cuda/ops/alias.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/alias.cpp
@@ -27,27 +27,32 @@ namespace {
 //! Before: TV0[I0, I1, I2]
 //! After: TV0[I0, I1, 2, ceilDiv(I2, 2)]
 //!
+//! orig_tv is the tensor view originally coming in from user for the view
+//! operation. This is the tensor view all of the view analysis is relative to.
+//! View might be doing trivial reductions before sending into the view
+//! operation, so we want the actual input to the view operation to be
+//! potentially after the original view operation.
 TensorView* applyViewTransforms(
-    TensorView* tv,
-    const std::vector<std::shared_ptr<ViewTransform>>& transforms) {
+    TensorView* orig_tv,
+    TensorView* post_reduce_tv,
+    const AnalyzeViewResult& view_analysis) {
+  TORCH_INTERNAL_ASSERT(orig_tv != nullptr, "Input is invalid.");
+  TORCH_INTERNAL_ASSERT(post_reduce_tv != nullptr, "Input is invalid.");
   TORCH_INTERNAL_ASSERT(
-      !tv->hasComputeAt(),
+      !post_reduce_tv->hasComputeAt(),
       "Cannot modify rfactor domain after compute at has been set.");
 
-  TORCH_INTERNAL_ASSERT(tv->nDims() > 0, "Tried to view a 0-dim TensorView");
-
-  TORCH_CHECK(
-      !tv->domain()->hasRFactor(),
-      "Cannot call view on the same TensorView twice.");
+  TORCH_INTERNAL_ASSERT(
+      post_reduce_tv->nDims() > 0, "Tried to view a 0-dim TensorView");
 
-  TORCH_INTERNAL_ASSERT(!transforms.empty());
+  TORCH_INTERNAL_ASSERT(!view_analysis.transforms.empty());
 
   TensorView* consumer = IrBuilder::create<TensorView>(
-      tv->container(),
-      tv->domain()->view(transforms),
-      tv->getDataType().value());
+      orig_tv->container(),
+      orig_tv->domain()->view(view_analysis),
+      orig_tv->getDataType().value());
 
-  IrBuilder::create<ViewOp>(tv->container(), consumer, tv);
+  IrBuilder::create<ViewOp>(orig_tv->container(), consumer, post_reduce_tv);
 
   return consumer;
 }
@@ -55,6 +60,7 @@ TensorView* applyViewTransforms(
 } // namespace
 
 TensorView* view(TensorView* x, DataType dtype) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   if (x->getDataType() == dtype) {
     return x;
   }
@@ -74,41 +80,51 @@ TensorView* view(
     TensorView* x,
     const std::vector<int64_t>& original_sizes,
     const std::vector<int64_t>& new_sizes) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   TORCH_INTERNAL_ASSERT(
       TensorDomain::noReductions(x->getMaybeRFactorDomain()).size() ==
       original_sizes.size());
+  TORCH_INTERNAL_ASSERT(
+      !original_sizes.empty(), "No support for 0-dim tensors in view support.");
 
-  auto analyze_view = analyzeView(x, original_sizes, new_sizes);
+  auto view_analysis = analyzeView(x, original_sizes, new_sizes);
 
-  auto reduction = (!analyze_view.trivial_reduction_axes.empty())
+  auto reduction = (!view_analysis.trivial_reduction_axes.empty())
       ? sum(x,
-            analyze_view.trivial_reduction_axes,
+            view_analysis.trivial_reduction_axes,
             false /* keep_dim */,
             x->getDataType().value())
       : x;
 
-  auto view = (!analyze_view.transforms.empty())
-      ? applyViewTransforms(reduction, analyze_view.transforms)
-      : reduction;
+  auto view = view_analysis.transforms.empty()
+      ? reduction
+      : applyViewTransforms(x, reduction, view_analysis);
 
-  return (analyze_view.has_broadcast)
-      ? broadcast(view, analyze_view.broadcast_axes)
+  auto bcasted = std::any_of(
+                     view_analysis.broadcast_axes.begin(),
+                     view_analysis.broadcast_axes.end(),
+                     [](bool b) { return b; })
+      ? broadcast(view, view_analysis.broadcast_axes)
       : view;
+
+  return bcasted;
 }
 
 TensorView* flatten(TensorView* x, int64_t start_dim, int64_t end_dim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  auto inp_domain = TensorDomain::noReductions(x->getMaybeRFactorDomain());
   if (start_dim < 0) {
-    start_dim += x->nDims();
+    start_dim += inp_domain.size();
   }
   if (end_dim < 0) {
-    end_dim += x->nDims();
+    end_dim += inp_domain.size();
   }
   TORCH_CHECK(
-      start_dim >= 0 && start_dim < x->nDims(),
+      start_dim >= 0 && start_dim < inp_domain.size(),
       "Invalid start_dim ",
       start_dim);
   TORCH_CHECK(
-      end_dim >= 0 && end_dim < x->nDims(), "Invalid end_dim ", end_dim);
+      end_dim >= 0 && end_dim < inp_domain.size(), "Invalid end_dim ", end_dim);
   TORCH_CHECK(start_dim <= end_dim, "start_dim must be <= end_dim");
 
   if (start_dim == end_dim) {
@@ -125,6 +141,7 @@ TensorView* flatten(TensorView* x, int64_t start_dim, int64_t end_dim) {
 }
 
 TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   TORCH_INTERNAL_ASSERT(
@@ -148,6 +165,7 @@ TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes) {
 }
 
 TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes, int dim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   TORCH_INTERNAL_ASSERT(
@@ -176,6 +194,7 @@ TensorView* squeeze(TensorView* x, const std::vector<int64_t>& sizes, int dim) {
 }
 
 TensorView* unsqueeze(TensorView* x, int dim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   if (dim < 0) {
@@ -195,14 +214,31 @@ TensorView* unsqueeze(TensorView* x, int dim) {
 }
 
 TensorView* permute(TensorView* x, const std::vector<int64_t>& new2old) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+  if (new2old.size() == 0) {
+    return set(x);
+  }
   auto inp_domain = TensorDomain::noReductions(x->getMaybeRFactorDomain());
   std::vector<IterDomain*> out_domain(inp_domain.size());
 
+  TORCH_CHECK(
+      inp_domain.size() == new2old.size(),
+      "The number of dimensions in the tensor input does not match the length",
+      " of the desired ordering of dimensions i.e. input.dim() = ",
+      inp_domain.size(),
+      " is not equal to len(dims) = ",
+      new2old.size());
+
+  // Return scalar tensors immediately
+  if (inp_domain.size() == 0) {
+    return set(x);
+  }
+
   auto normalized_new2old =
       ir_utils::normalizeNew2Old(new2old, inp_domain.size());
 
   for (const auto i : c10::irange(out_domain.size())) {
-    auto in_id = inp_domain[new2old[i]];
+    auto in_id = inp_domain[normalized_new2old[i]];
     out_domain[i] = in_id->cloneWithoutRFactor();
   }
 
@@ -215,6 +251,7 @@ TensorView* permute(TensorView* x, const std::vector<int64_t>& new2old) {
 }
 
 TensorView* transpose(TensorView* x, int64_t dim0, int64_t dim1) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   if (dim0 < 0) {
@@ -245,6 +282,7 @@ TensorView* transpose(TensorView* x, int64_t dim0, int64_t dim1) {
 }
 
 TensorView* transpose(TensorView* x) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
   const auto ndims = static_cast<int>(x->domain()->noReductions().size());
 
   TORCH_CHECK(
diff --git a/torch/csrc/jit/codegen/cuda/ops/composite.cpp b/torch/csrc/jit/codegen/cuda/ops/composite.cpp
index 5aa1d64c5cf1..a7905c4894c1 100644
--- a/torch/csrc/jit/codegen/cuda/ops/composite.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/composite.cpp
@@ -27,7 +27,7 @@ ForwardDropoutResult dropout(TensorView* x, Val* prob, Val* scale) {
           scale->getDataType().value() == DataType::Double,
       "Scale is not a valid Double.");
 
-  auto rand_vals = randlike(x);
+  auto rand_vals = rand_like(x);
   auto mask = lt(rand_vals, prob);
   auto apply_mask = mul(x, mask);
   auto y = mul(apply_mask, scale);
diff --git a/torch/csrc/jit/codegen/cuda/ops/normalization.cpp b/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
index 883b85260720..f1739c665f03 100644
--- a/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
+++ b/torch/csrc/jit/codegen/cuda/ops/normalization.cpp
@@ -69,6 +69,64 @@ TensorView* variance(
   return y;
 }
 
+TORCH_CUDA_CU_API VarMeanResult variance_mean(
+    TensorView* x,
+    const std::vector<int>& dims,
+    int64_t correction,
+    bool keepdim) {
+  TORCH_INTERNAL_ASSERT(x != nullptr, "Input is invalid.");
+
+  TORCH_CHECK(
+      correction >= 0, "correction must be non-negative, but got ", correction);
+
+  // There are compilation errors for half precision
+  auto dtype = x->getDataType().value();
+  TORCH_CHECK(
+      !(dtype == DataType::Half || dtype == DataType::BFloat16),
+      "variance_mean is not supported for ",
+      dtype,
+      " please upcast to float");
+
+  if (isComplexType(x->getDataType().value())) {
+    // There are compilation errors:
+    // __tmp_kernel1.cu(6727): error: namespace "CudaCodeGen::std" has no member
+    // "imagf"
+    // __tmp_kernel1.cu(6753): error: namespace "CudaCodeGen::std" has no member
+    // "realf"
+    TORCH_CHECK(false, "var_mean is not supported for complex types.");
+    auto out_real = variance_mean(real(x), dims, correction, keepdim);
+    auto out_imag = variance_mean(imag(x), dims, correction, keepdim);
+    // variance of a complex tensor is the sum of real and imaginary variances
+    // and is real mean of a complex tensor is complex complex(out_real.mean,
+    // out_imag.mean) It seems construction of a complex tensor from two real
+    // tensors is not supported yet
+    return {add(out_real.var, out_imag.var), nullptr};
+  }
+
+  const int kNumberOfDims =
+      TensorDomain::noReductions(x->getMaybeRFactorDomain()).size();
+  auto num_features = numFeatures(x, dims, kNumberOfDims);
+  if (correction > 0) {
+    num_features =
+        sub(num_features, IrBuilder::create<Int>(x->container(), correction));
+  }
+
+  auto welford_out = Welford(x, dims);
+  auto mean = welford_out.avg;
+  auto var = mul(welford_out.var_sum, reciprocal(num_features));
+
+  if (keepdim) {
+    std::vector<bool> is_broadcast(kNumberOfDims, false);
+    for (auto dim : dims) {
+      is_broadcast[dim] = true;
+    }
+    var = broadcast(var, is_broadcast);
+    mean = broadcast(mean, is_broadcast);
+  }
+
+  return {var, mean};
+}
+
 TensorView* standard_deviation(
     TensorView* x,
     const std::vector<int>& dims,
@@ -529,8 +587,10 @@ ForwardNormResult batch_norm(
     auto invstd_bcast = broadcast(unbiased_invstd, broadcast_mask);
 
     // During inference, mean/invstd output are empty tensors
-    mean = TensorViewBuilder().shape(std::vector<int64_t>{0}).build();
-    invstd = TensorViewBuilder().shape(std::vector<int64_t>{0}).build();
+    // on CPU, but not on CUDA. We need to make sure we have the same
+    // behavior as with eager mode on CUDA.
+    mean = running_mean;
+    invstd = unbiased_invstd;
     y = mul(x_sub_mean, invstd_bcast);
   }
 
@@ -782,8 +842,10 @@ ForwardNormResult instance_norm(
         broadcast(unbiased_invstd, channels_only_broadcast_mask);
 
     // During inference, mean/invstd output are empty tensors
-    mean = TensorViewBuilder().shape(std::vector<int64_t>{0}).build();
-    invstd = TensorViewBuilder().shape(std::vector<int64_t>{0}).build();
+    // on CPU, but not on CUDA. We need to make sure we have the same
+    // behavior as with eager mode on CUDA.
+    mean = running_mean;
+    invstd = unbiased_invstd;
     y = mul(x_sub_mean, invstd_bcast);
   }
 
diff --git a/torch/csrc/jit/codegen/cuda/ops/normalization.h b/torch/csrc/jit/codegen/cuda/ops/normalization.h
index f953af5bd96d..d0283525d19a 100644
--- a/torch/csrc/jit/codegen/cuda/ops/normalization.h
+++ b/torch/csrc/jit/codegen/cuda/ops/normalization.h
@@ -38,6 +38,11 @@ struct BackwardRMSNormResult {
   TensorView* grad_weight = nullptr;
 };
 
+struct VarMeanResult {
+  TensorView* var = nullptr;
+  TensorView* mean = nullptr;
+};
+
 TORCH_CUDA_CU_API TensorView* mean(
     TensorView* x,
     const std::vector<int>& dims,
@@ -55,6 +60,12 @@ TORCH_CUDA_CU_API TensorView* variance(
     int64_t correction,
     bool keepdim);
 
+TORCH_CUDA_CU_API VarMeanResult variance_mean(
+    TensorView* x,
+    const std::vector<int>& dims,
+    int64_t correction,
+    bool keepdim);
+
 TORCH_CUDA_CU_API TensorView* standard_deviation(
     TensorView* x,
     const std::vector<int>& dims,
diff --git a/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp b/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
index fd468a8b792e..c562b206652d 100644
--- a/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/parallel_dimension_map.cpp
@@ -56,7 +56,7 @@ void ParallelDimensionMap::registerConstantExtent(IterDomain* id) {
       id->toString(),
       " should have been constant, but could not be evaluated at compile time.");
 
-  auto const_extent = extent_int.value();
+  auto const_extent = extent_int->as<int64_t>();
 
   // Uses index map
   auto concrete_id = getCAMappedConcreteDomain(id);
diff --git a/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp b/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp
index 43961dbda475..9e3ff2046c0f 100644
--- a/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp
+++ b/torch/csrc/jit/codegen/cuda/parallel_type_bitmap.cpp
@@ -12,6 +12,7 @@ constexpr std::bitset<ParallelTypeBitmap::kNumParallelTypes>
 
 std::string ParallelTypeBitmap::toString() const {
   std::stringstream ss;
+  ss << "(";
   bool is_first = true;
   for (ParallelType pt : *this) {
     if (!is_first) {
@@ -20,6 +21,7 @@ std::string ParallelTypeBitmap::toString() const {
     ss << pt;
     is_first = false;
   }
+  ss << ")";
   return ss.str();
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/parser.cpp b/torch/csrc/jit/codegen/cuda/parser.cpp
index ae5fbf18d122..e78d5effbee3 100644
--- a/torch/csrc/jit/codegen/cuda/parser.cpp
+++ b/torch/csrc/jit/codegen/cuda/parser.cpp
@@ -1321,7 +1321,7 @@ class IrParser {
               }
             }
 
-            auto out = randlike(operand);
+            auto out = rand_like(operand);
             value_map.emplace(
                 node->output()->unique(), ValueHolder(out, format));
           },
@@ -2312,7 +2312,7 @@ class IrParser {
 
               auto data_type = DataType::Null;
               if (const auto opt_ivalue = toIValue(node->input(2))) {
-                if (!opt_ivalue.value().isNone()) {
+                if (!opt_ivalue->isNone()) {
                   data_type = aten_to_data_type(opt_ivalue->toScalarType());
                 }
               }
@@ -3378,6 +3378,115 @@ class IrParser {
           },
           nullptr);
     }
+
+    {
+      auto ptr_op = getOperatorForLiteral(
+          "prim::permute_copy.int(Tensor(a) self, int[] dims) -> Tensor");
+      REGISTER_PARSE_RULE(
+          ptr_op,
+          {
+            MemoryFormat format;
+            std::list<Val*> list_val;
+            std::tie(format, list_val) = getConsistentValues(
+                c10::nullopt, value_map[node->inputs()[0]->unique()]);
+            auto self_t = list_val.front();
+            list_val.pop_front();
+            auto self = self_t->as<TensorView>();
+
+            auto dims = constant_as<c10::List<int64_t>>(node->input(1));
+            TORCH_INTERNAL_ASSERT(
+                dims.has_value(), "The dims parameter is required.");
+            TORCH_INTERNAL_ASSERT(
+                dims.value().size() == self->getMaybeRFactorDomain().size());
+
+            auto output = permute(self, dims->vec());
+            value_map.emplace(
+                node->output()->unique(), ValueHolder(output, format));
+          },
+          [](const Node* node) -> bool {
+            if (!isInputNonSizeZeroTensor(node)) {
+              return false;
+            }
+            auto dims = constant_as<c10::List<int64_t>>(node->input(1));
+            if (!dims.has_value()) {
+              return false;
+            }
+
+            return true;
+          },
+          nullptr);
+    }
+
+    {
+      auto ptr_op = getOperatorForLiteral(
+          "prim::transpose_copy.int(Tensor(a) self, int dim0, int dim1) -> Tensor");
+      REGISTER_PARSE_RULE(
+          ptr_op,
+          {
+            MemoryFormat format;
+            std::list<Val*> list_val;
+            std::tie(format, list_val) = getConsistentValues(
+                c10::nullopt, value_map[node->inputs()[0]->unique()]);
+            auto self_t = list_val.front();
+            list_val.pop_front();
+            auto self = self_t->as<TensorView>();
+
+            auto dim0 = constant_as<int>(node->input(1));
+            TORCH_INTERNAL_ASSERT(
+                dim0.has_value(), "dim0 in transpose is not valid.");
+
+            auto dim1 = constant_as<int>(node->input(2));
+            TORCH_INTERNAL_ASSERT(
+                dim1.has_value(), "dim1 in transpose is not valid.");
+
+            auto output = transpose(self, dim0.value(), dim1.value());
+            value_map.emplace(
+                node->output()->unique(), ValueHolder(output, format));
+          },
+          [](const Node* node) -> bool {
+            if (!isInputNonSizeZeroTensor(node)) {
+              return false;
+            }
+            if (node->input(1)->node()->kind() != prim::Constant) {
+              return false;
+            }
+            if (node->input(2)->node()->kind() != prim::Constant) {
+              return false;
+            }
+            return true;
+          },
+          nullptr);
+    }
+
+    {
+      auto ptr_op =
+          getOperatorForLiteral("prim::t_copy(Tensor(a) self) -> Tensor");
+      REGISTER_PARSE_RULE(
+          ptr_op,
+          {
+            MemoryFormat format;
+            std::list<Val*> list_val;
+            std::tie(format, list_val) = getConsistentValues(
+                c10::nullopt, value_map[node->inputs()[0]->unique()]);
+            auto self_t = list_val.front();
+            list_val.pop_front();
+            auto self = self_t->as<TensorView>();
+
+            TORCH_INTERNAL_ASSERT(self->getMaybeRFactorDomain().size() <= 2);
+
+            auto output = transpose(self);
+            value_map.emplace(
+                node->output()->unique(), ValueHolder(output, format));
+          },
+          [](const Node* node) -> bool {
+            if (!isInputNonSizeZeroTensor(node)) {
+              return false;
+            }
+
+            return true;
+          },
+          nullptr);
+    }
   }
 
   void processJitNode(const JitOp* node) {
@@ -4141,6 +4250,49 @@ bool insertProfileIValue(ProfilingRecord* pr, Node* node, size_t offset) {
     return true;
   }
 
+  static auto permute_schema =
+      getOperatorForLiteral(
+          "aten::permute(Tensor(a) self, int[] dims) -> Tensor(a)")
+          ->schema();
+  static auto permute_copy_schema =
+      getOperatorForLiteral(
+          "prim::permute_copy(Tensor(a) self, int[] dims) -> Tensor")
+          ->schema();
+  if (node->matches(permute_schema) || node->matches(permute_copy_schema)) {
+    switch (offset) {
+      // argument 1: dims;
+      case 1:
+        profileIntList(pr, node, offset);
+        break;
+      default:
+        return false;
+    }
+    return true;
+  }
+
+  static auto transpose_int_copy_schema =
+      getOperatorForLiteral(
+          "aten::transpose.int(Tensor(a) self, int dim0, int dim1) -> Tensor(a)")
+          ->schema();
+  static auto transpose_int_schema =
+      getOperatorForLiteral(
+          "prim::transpose_copy.int(Tensor(a) self, int dim0, int dim1) -> Tensor")
+          ->schema();
+  if (node->matches(transpose_int_copy_schema) ||
+      node->matches(transpose_int_schema)) {
+    switch (offset) {
+      // argument 1: dim0;
+      // argument 2: dim1;
+      case 1:
+      case 2:
+        profileInt(pr, node, offset);
+        break;
+      default:
+        return false;
+    }
+    return true;
+  }
+
   static auto batch_norm_impl_index_schema =
       getOperatorForLiteral(
           "aten::_batch_norm_impl_index(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> (Tensor, Tensor, Tensor, Tensor, int)")
@@ -4352,6 +4504,30 @@ bool insertProfileIValue(ProfilingRecord* pr, Node* node, size_t offset) {
     }
   }
 
+  static auto var_dim_schema =
+      getOperatorForLiteral(
+          "aten::var.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor")
+          ->schema();
+  static auto std_dim_schema =
+      getOperatorForLiteral(
+          "aten::std.dim(Tensor self, int[1]? dim, bool unbiased=True, bool keepdim=False) -> Tensor")
+          ->schema();
+  if (node->matches(var_dim_schema) || node->matches(std_dim_schema)) {
+    switch (offset) {
+      case 1:
+        profileIntList(pr, node, offset);
+        return true;
+      case 2:
+        profileBool(pr, node, offset);
+        return true;
+      case 3:
+        profileBool(pr, node, offset);
+        return true;
+      default:
+        return false;
+    }
+  }
+
   return false;
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/partition.cpp b/torch/csrc/jit/codegen/cuda/partition.cpp
index 403bd8831979..e9c809101b07 100644
--- a/torch/csrc/jit/codegen/cuda/partition.cpp
+++ b/torch/csrc/jit/codegen/cuda/partition.cpp
@@ -171,7 +171,7 @@ bool compatibleType(const torch::jit::Value* val) {
           DataType::Null) {
         return false;
       }
-      if (!isEnabled(EnableOption::Complex)) {
+      if (!isOptionEnabled(EnableOption::Complex)) {
         // Complex is disabled by default until its support is completely added
         // TODO: remove this logic
         if (isComplexType(
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/README.md b/torch/csrc/jit/codegen/cuda/python_frontend/README.md
new file mode 100644
index 000000000000..d519e69bcda3
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/README.md
@@ -0,0 +1,138 @@
+# nvFuser Python Frontend
+
+This frontend allows for a user to describe the set of operations for nvFuser to fuse via 1 or more kernels.  This frontend is intended to be an integration point with PyTorch or standalone applications.
+
+# Usage
+
+## Example 1 - Define and Execute a Fusion
+
+```python
+import torch
+from torch._C._nvfuser import Fusion, FusionDefinition, DataType
+
+fs = Fusion()
+with FusionDefinition(fs) as fd :
+    t0 = fd.define_tensor(symbolic_sizes=[-1, 1, -1],
+                          contiguous=[True, True, True],
+                          dtype=DataType.Float)
+    t1 = fd.define_tensor(3)
+    c0 = fd.define_constant(3.0)
+
+    t2 = fd.ops.add(t0, t1)
+    t3 = fd.ops.mul(t2, c0)
+    t4 = fd.ops.sum(t3, [-1], False, DataType.Float)
+
+    fd.add_output(t4)
+
+input1 = torch.ones(2, 1, 8, device='cuda')
+input2 = torch.ones(2, 4, 8, device='cuda')
+
+nvf_out = fs.execute([input1, input2])[0]
+```
+
+## Example 2 - Lookup and Execute a `Fusion` Based on Id
+
+```python
+fid = 0
+fs = Fusion(fid)
+
+input1 = torch.ones(2, 1, 8, device='cuda')
+input2 = torch.ones(2, 4, 8, device='cuda')
+
+nvf_out = fs.execute([input1, input2])[0]
+```
+
+## Components
+
+### `Fusion` - Represents a Fusion
+#### `Fusion` Methods
+* `defined()`: Allows you to query if the `Fusion` is already defined and can be executed.
+* `execute([inputs])`:  Allows you to execute the currently defined fusion with a list of given inputs and returns a list of tensors.
+* `id()`: Returns the fusion id for a given `Fusion`.
+* `print()`: Prints the low level IR for the currently defined fusion.
+
+### `FusionDefinition` Context Manager - Interface for Defining Fusions
+
+#### Defining Input Tensors
+_All intermediate tensors are created by operations.  Constant tensors do not exist._
+
+There are 3 ways to define tensors that will be enumerated below.
+
+##### 1.) Defining tensors by the number of input dimensions only
+This interface tells nvFuser that the tensor has a given number of symbolic dimensions that are not necessarily contiguous in memory.  The user also has the ability to specify a data type.  The default type is `Float`.
+```python
+t0 = fd.define_tensor(3)
+t1 = fd.define_tensor(3, DataType.Half)
+```
+
+##### 2.) Defining tensors by a list of concrete sizes and a list of strides
+The `sizes` parameter defines the number of dimensions and the size of each dimension.  The `strides` parameter has to have the same number of dimensions as the `sizes` parameter.
+nvFuser translates the concrete sizes and strides into symbolic sizes and contiguity information that can be directly defined via the next way to define tensors.  This allows the user to directly take a Pytorch defined tensor and query its sizes and strides in order to apply them in the definition.
+```python
+t0 = fd.define_tensor(sizes=[2, 4, 6], strides=[24, 6, 1], dtype=DataType.Half)
+```
+
+##### 3.) Defining tensors by a list of symbolic sizes and a list of contiguity information
+The list of symbolic sizes defines the number of dimensions and `-1` is given for each dimension unless it is a broadcast dimension that is defined with a `1`.  The contiguity information is viewed from right to left.  A `True` definition indicates the current dimension is contiguous with the dimension to its right.
+
+```python
+t0 = fd.define_tensor(symbolic_sizes=[-1, 1, -1], contiguous=[True, True, True], dtype=DataType.Float)
+```
+
+#### Defining Input Scalars
+_All intermediate scalars, except for constants, are created by operations._
+
+The only thing the user has to define for a scalar is its type.
+
+```python
+s0 = fd.define_scalar(dtype=DataType.Half)
+```
+
+#### Defining Constant Scalars
+
+Constants can be of types: `Bool`, `ComplexDouble`, `Double`, or `Int`.  The definition only takes a constant and the type is inferred by the constant.
+
+```python
+c0 = fd.define_constant(3.0)
+```
+
+#### Defining Operations
+
+Operators are added with the following notation:
+```python
+output = fd.ops.foo(arg1, ... )
+```
+You can see a supported list of operations with the following query:
+```python
+python -c "from torch._C._nvfuser import FusionDefinition; help(FusionDefinition.Operators)"
+```
+#### Notating Outputs
+
+The `FusionDefinition` `add_output` method is used to indicate an intermediate is an output to the fusion.
+
+```python
+add_output(output: Tensor)
+# or
+add_output(output: Scalar)
+```
+
+# Debug Information
+**Query a list of supported operations:**
+```python
+python -c "from torch._C._nvfuser import FusionDefinition; help(FusionDefinition.Operators)"
+```
+**View the fusion definitions that are executed by setting an environment variable:**
+```python
+export PYTORCH_NVFUSER_DUMP=python_definition
+```
+Example Output:
+```python
+def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
+    T0 = fd.define_tensor(symbolic_sizes=[-1, 1, -1], contiguous=[True, True, True], dtype=DataType.Float)
+    T1 = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[False, False, False], dtype=DataType.Float)
+    S2 = fd.define_constant(3.00000)
+    T3 = fd.ops.add(T0, T1)
+    T4 = fd.ops.mul(T3, S2)
+    T5 = fd.ops.sum(T4, axes=[-1], keepdim=False, dtype=DataType.Float)
+    fd.add_output(T5)
+```
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/examples/double_half_cast.py b/torch/csrc/jit/codegen/cuda/python_frontend/examples/double_half_cast.py
deleted file mode 100644
index b3ce49d32d97..000000000000
--- a/torch/csrc/jit/codegen/cuda/python_frontend/examples/double_half_cast.py
+++ /dev/null
@@ -1,30 +0,0 @@
-import torch
-
-from torch._C._nvfuser import Fusion, FusionDefinition, DataType
-
-# Construct and Define Fusion
-fusion = Fusion()
-
-with FusionDefinition(fusion) as fd :
-    t0 = fd.define_tensor(2, DataType.Double)
-    t1 = fd.define_tensor(2, DataType.Double)
-
-    t0h = fd.ops.cast(t0, DataType.Half)
-    t1h = fd.ops.cast(t1, DataType.Half)
-    t2 = fd.ops.add(t0h, t1h)
-    t3 = fd.ops.relu(t2)
-
-    fd.add_output(t3)
-
-fusion.print_ir()
-
-# Execute Fusion
-input1 = torch.ones(2, 4, device='cuda', dtype=torch.float64)
-input2 = torch.ones(2, 4, device='cuda', dtype=torch.float64)
-
-# Kernel compilation should be cached for the 2nd iteration
-# with input tensors of the same shape
-for _ in range(5) :
-    outputs = fusion.execute([input1, input2])
-
-print(outputs[0])
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/examples/half_double_cast.py b/torch/csrc/jit/codegen/cuda/python_frontend/examples/half_double_cast.py
deleted file mode 100644
index d5f7070a4eeb..000000000000
--- a/torch/csrc/jit/codegen/cuda/python_frontend/examples/half_double_cast.py
+++ /dev/null
@@ -1,28 +0,0 @@
-import torch
-
-from torch._C._nvfuser import Fusion, FusionDefinition, DataType
-
-# Construct and Define Fusion
-fusion = Fusion()
-
-with FusionDefinition(fusion) as fd :
-    t0 = fd.define_tensor(2, DataType.Half)
-    t1 = fd.define_tensor(2, DataType.Double)
-
-    t2 = fd.ops.add(t0, t1)
-    t5 = fd.ops.relu(t2)
-
-    fd.add_output(t5)
-
-fusion.print_ir()
-
-# Execute Fusion
-input1 = torch.ones(2, 4, device='cuda', dtype=torch.float16)
-input2 = torch.ones(2, 4, device='cuda', dtype=torch.float64)
-
-# Kernel compilation should be cached for the 2nd iteration
-# with input tensors of the same shape
-for _ in range(5) :
-    outputs = fusion.execute([input1, input2])
-
-print(outputs[0])
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example.py b/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example.py
deleted file mode 100644
index 2bd236c0cf2d..000000000000
--- a/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example.py
+++ /dev/null
@@ -1,36 +0,0 @@
-import torch
-from torch._C._nvfuser import Fusion, FusionDefinition, DataType
-
-# Construct and Define Fusion
-fusion = Fusion()
-
-with FusionDefinition(fusion) as fd :
-    t0 = fd.define_tensor(3)
-    t1 = fd.define_tensor(3)
-    s0 = fd.define_scalar()
-
-    c0 = fd.define_constant(3.0)
-
-    t2 = fd.ops.add(t0, t1)
-    t3 = fd.ops.mul(t2, c0)
-    t4 = fd.ops.atan2(t3, s0)
-    t5 = fd.ops.relu(t4)
-    t6 = fd.ops.sum(t5, [-1], False, DataType.Float)
-    t7 = fd.ops.isfinite(t6)
-
-    fd.add_output(t6)
-    fd.add_output(t7)
-
-fusion.print_ir()
-
-# Execute Fusion
-input1 = torch.ones(2, 4, 8, device='cuda')
-input2 = torch.ones(2, 4, 8, device='cuda')
-
-# Kernel compilation should be cached for the 2nd iteration
-# with input tensors of the same shape
-for _ in range(5) :
-    outputs = fusion.execute([input1, input2, 2.0])
-
-print(outputs[0])
-print(outputs[1])
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example_broadcast_in_dim.py b/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example_broadcast_in_dim.py
deleted file mode 100644
index 06733dbd68de..000000000000
--- a/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example_broadcast_in_dim.py
+++ /dev/null
@@ -1,94 +0,0 @@
-import torch
-
-from torch._C._nvfuser import Fusion, FusionDefinition
-import torch._prims as prims
-import torch._refs as refs
-
-# Construct and Define Fusion
-fusion1 = Fusion()
-
-with FusionDefinition(fusion1) as fd :
-    t0 = fd.define_tensor(1)
-    t1 = fd.define_tensor(3)
-
-    t0_b = fd.ops.broadcast_in_dim(t0, [2, 3, 4], [1])
-    t2 = fd.ops.add(t0_b, t1)
-
-    fd.add_output(t2)
-
-fusion1.print_ir()
-
-# Execute Fusion
-input1 = torch.randn(3, device='cuda')
-input2 = torch.randn(2, 3, 4, device='cuda')
-
-# Kernel compilation should be cached for the 2nd iteration
-# with input tensors of the same shape
-for _ in range(5) :
-    o = fusion1.execute([input1, input2])[0]
-
-assert(o.shape == torch.Size([2, 3, 4]))
-
-# Reference in prim torch
-ref_o = refs.add(prims.broadcast_in_dim(input1, [2, 3, 4], [1]), input2)
-assert(ref_o.allclose(o))
-assert(ref_o.shape == o.shape)
-
-fusion2 = Fusion()
-
-input1 = torch.randn(1, 1, 4, device='cuda')
-input2 = torch.randn(2, 3, 4, device='cuda')
-
-with FusionDefinition(fusion2) as fd :
-    t0 = fd.define_tensor(sizes=input1.size(), strides=input1.stride())
-    t1 = fd.define_tensor(sizes=input2.size(), strides=input2.stride())
-
-    t0_b = fd.ops.broadcast_in_dim(t0, [2, 3, 4], [0, 1, 2])
-    t2 = fd.ops.add(t0_b, t1)
-
-    fd.add_output(t2)
-
-fusion2.print_ir()
-
-# Kernel compilation should be cached for the 2nd iteration
-# with input tensors of the same shape
-for _ in range(5) :
-    o = fusion2.execute([input1, input2])[0]
-
-assert(o.shape == torch.Size([2, 3, 4]))
-
-# Reference in prim torch
-ref_o = refs.add(prims.broadcast_in_dim(input1, [2, 3, 4], [0, 1, 2]), input2)
-assert(ref_o.allclose(o))
-assert(ref_o.shape == o.shape)
-
-# Construct and Define Fusion
-fusion3 = Fusion()
-
-with FusionDefinition(fusion3) as fd :
-    # t0 = fd.define_tensor(2)
-    t0 = fd.define_tensor([3, 1], [1, 1])
-    t1 = fd.define_tensor(1)
-
-    t1_b = fd.ops.broadcast_in_dim(t1, [3, 3], [0])  # 1 -> 0
-    t2 = fd.ops.add(t0, t1_b)
-
-    fd.add_output(t2)
-
-fusion3.print_ir()
-
-# Execute Fusion
-input1 = torch.randn(3, 1, device='cuda')
-input2 = torch.randn(3, device='cuda')
-
-# Kernel compilation should be cached for the 2nd iteration
-# with input tensors of the same shape
-for _ in range(5) :
-    o = fusion3.execute([input1, input2])[0]
-
-assert(o.shape == torch.Size([3, 3]))
-
-# Reference in prim torch
-ref_o = refs.add(input1, prims.broadcast_in_dim(input2, [3, 3], [0]))
-assert(ref_o.allclose(o))
-assert(ref_o.shape == o.shape)
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example_fp16.py b/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example_fp16.py
deleted file mode 100644
index 55fc2585c22c..000000000000
--- a/torch/csrc/jit/codegen/cuda/python_frontend/examples/python_example_fp16.py
+++ /dev/null
@@ -1,35 +0,0 @@
-import torch
-
-from torch._C._nvfuser import Fusion, FusionDefinition, DataType
-
-# Construct and Define Fusion
-fusion = Fusion()
-
-with FusionDefinition(fusion) as fd :
-    t0 = fd.define_tensor(3, DataType.Half)
-    t1 = fd.define_tensor(1, DataType.Half)
-    s0 = fd.define_scalar()
-
-    c0 = fd.define_constant(3.0)
-
-    t2 = fd.ops.add(t0, t1)
-    t3 = fd.ops.mul(t2, c0)
-    t4 = fd.ops.mul(t3, s0)
-    t5 = fd.ops.relu(t4)
-    t6 = fd.ops.sum(t5, [-1], False, DataType.Float)
-
-    t7 = fd.ops.cast(t6, DataType.Half)
-    fd.add_output(t7)
-
-fusion.print_ir()
-
-# Execute Fusion
-input1 = torch.ones(2, 4, 8, device='cuda', dtype=torch.float16)
-input2 = torch.ones(8, device='cuda', dtype=torch.float16)
-
-# Kernel compilation should be cached for the 2nd iteration
-# with input tensors of the same shape
-for _ in range(5) :
-    outputs = fusion.execute([input1, input2, 2.0])
-
-print(outputs[0])
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.cpp b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.cpp
new file mode 100644
index 000000000000..0efc4a0f0cfc
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.cpp
@@ -0,0 +1,155 @@
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h>
+#include <mutex>
+
+namespace nvfuser {
+
+static std::mutex fusion_cache_lock;
+FusionCache* FusionCache::singleton_ = nullptr;
+
+FusionCacheEntry::FusionCacheEntry(RecordFunctor* rec, size_t _fusion_id)
+    : record(rec), record_hash_map(), fusion_id(_fusion_id), visits(0) {}
+
+bool FusionCacheEntry::isTerminal() const {
+  return (record.get()->recordType() == RecordType::End);
+}
+
+FusionCache* FusionCache::get(size_t max_fusions) {
+  std::lock_guard<std::mutex> guard(fusion_cache_lock);
+  if (singleton_ == nullptr) {
+    singleton_ = new FusionCache(max_fusions);
+  }
+  TORCH_CHECK(
+      max_fusions >= singleton_->fusions_.size(),
+      "The max fusions is set less than the number of fusions in the cache.");
+  singleton_->max_fusions_ = max_fusions;
+  return singleton_;
+}
+
+size_t FusionCache::numFusions() const {
+  return fusions_.size();
+}
+
+void FusionCache::print(std::ostream& os) {
+  os << "Total Fusions: " << fusions_.size() << "\n";
+
+  // Does not make sense to print stats if the cache is disabled.
+  if (fusions_.size() > 0) {
+    os << "Cache Hits by Fusion Id:\n";
+    auto total_cache_hits = 0;
+    for (size_t i = 0; i < terminal_cache_entries_.size(); ++i) {
+      // The first visit is a miss!
+      auto visits = terminal_cache_entries_[i]->visits - 1;
+      total_cache_hits += visits;
+      os << "\t" << i << " -> " << visits << " hits\n";
+    }
+
+    auto hit_rate = static_cast<float>(total_cache_hits) /
+        static_cast<float>(fusion_cache_start_->visits) * 100.0;
+    os << "Cache Lookups: " << fusion_cache_start_->visits;
+    os << " Cache Hits: " << total_cache_hits;
+    os << " Hit Rate: " << hit_rate << "%\n";
+  }
+}
+
+void FusionCache::reset() {
+  std::lock_guard<std::mutex> guard(fusion_cache_lock);
+  if (singleton_ != nullptr) {
+    auto max_fusions = singleton_->max_fusions_;
+    delete singleton_;
+    singleton_ = new FusionCache(max_fusions);
+  }
+}
+
+FusionCache::FusionCache(size_t max_fusions)
+    : max_fusions_(max_fusions),
+      fusion_cache_start_(nullptr),
+      fusion_cache_ptr_(nullptr),
+      fusions_() {
+  RecordFunctor* start = new StartRecord();
+  fusion_cache_start_ = std::make_unique<FusionCacheEntry>(start);
+  fusion_cache_ptr_ = fusion_cache_start_.get();
+}
+
+c10::optional<FusionCacheEntry*> FusionCache::lookupFusionCacheEntry(
+    RecordFunctor* rec) const {
+  TORCH_CHECK(
+      !fusionCachePtr()->isTerminal(),
+      "There should be no children from a Terminal Cache Entry!");
+  TORCH_CHECK(rec, "Record is null!");
+  auto cache_entry = fusionCachePtr()->record_hash_map.find(rec);
+  if (cache_entry == std::end(fusionCachePtr()->record_hash_map)) {
+    return c10::nullopt;
+  } else {
+    return c10::optional<FusionCacheEntry*>(cache_entry->second.get());
+  }
+}
+
+c10::optional<size_t> FusionCache::createFusionCacheEntry(RecordFunctor* rec) {
+  c10::optional<size_t> result = c10::nullopt;
+  TORCH_CHECK(
+      !fusionCachePtr()->isTerminal(),
+      "Cannot create a cache entry from a terminal entry!");
+  TORCH_CHECK(rec, "Record is null!");
+
+  size_t fusion_id = 0;
+  if (rec->recordType() == RecordType::End) {
+    TORCH_CHECK(
+        (fusions_.size() + 1) <= max_fusions_,
+        "The number of fusions in nvfuser has exceeded ",
+        max_fusions_,
+        "fusions.  The max_fusions for the FusionCache might need to be ",
+        "increased if the max number is not being exceeded due to an error.");
+    fusions_.push_back(std::make_unique<Nvf::FusionExecutorCache>(
+        std::make_unique<Nvf::Fusion>()));
+    fusion_id = fusions_.size() - 1;
+    result = c10::optional<size_t>(fusion_id);
+  }
+
+  // Copying the record owned by the FusionDefinition that calls this function
+  // so the cache owns a copy when the FusionDefinition gets destroyed rather
+  // than managing a shared pointer that would  only share with
+  // FusionDefinition that creates a cache entry but not cache lookups
+  RecordFunctor* new_rec = rec->clone();
+  fusionCachePtr()->record_hash_map[new_rec] =
+      std::make_unique<FusionCacheEntry>(new_rec, fusion_id);
+  if (rec->recordType() == RecordType::End) {
+    terminal_cache_entries_.push_back(
+        fusionCachePtr()->record_hash_map[new_rec].get());
+  }
+  if (Nvf::isDebugDumpEnabled(Nvf::DebugDumpOption::PythonFrontendDebug)) {
+    std::stringstream ss;
+    new_rec->print(ss);
+    std::cout << "\nFusionDefinition: Create new cache entry for: " << ss.str()
+              << "\n";
+  }
+  return result;
+}
+
+void FusionCache::resetFusionCachePtr() {
+  fusion_cache_ptr_ = fusion_cache_start_.get();
+  TORCH_CHECK(fusionCachePtr()->record->recordType() == RecordType::Start);
+  ++(fusionCachePtr()->visits);
+}
+
+void FusionCache::traverseFusionCache(RecordFunctor* rec) {
+  TORCH_CHECK(
+      !fusionCachePtr()->isTerminal(),
+      "Cannot traverse cache from a terminal entry!");
+  auto cache_entry = fusionCachePtr()->record_hash_map.find(rec);
+  TORCH_CHECK(
+      cache_entry != std::end(fusionCachePtr()->record_hash_map),
+      "Cache Entry for Cache Traverse is not found!");
+  TORCH_CHECK(cache_entry->second, "Record in Cache Entry is null!");
+  fusion_cache_ptr_ = cache_entry->second.get();
+  ++(fusionCachePtr()->visits);
+}
+
+FusionCacheEntry* FusionCache::fusionCachePtr() const {
+  TORCH_INTERNAL_ASSERT(
+      fusion_cache_ptr_ != nullptr,
+      "The fusion cache entry is unexpectedly null.");
+  return fusion_cache_ptr_;
+}
+
+} // namespace nvfuser
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h
new file mode 100644
index 000000000000..7d18d78f6720
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h
@@ -0,0 +1,111 @@
+#pragma once
+#include <c10/macros/Export.h>
+
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h>
+
+#include <memory>
+
+//! nvFuser Fusion IR namespace abbreviation
+namespace Nvf = torch::jit::fuser::cuda;
+
+namespace nvfuser {
+
+struct RecordFunctor;
+
+//! \struct FusionCacheEntry
+//! \brief Is the container for a Node in the cache contained in the
+//! FusionCache that is organized as a prefix tree.
+
+struct TORCH_CUDA_CU_API FusionCacheEntry {
+  FusionCacheEntry(RecordFunctor* rec, size_t _fusion_id = 0);
+
+  // Queries whether the entry denotes a leaf node which also represents
+  // a the end of Fusion entry in the cache.
+  bool isTerminal() const;
+
+  //! An entry's primary data is the record it holds
+  std::unique_ptr<RecordFunctor> record;
+  //! A hash map of the children for the current node.
+  //! The hash map hashs a pointer to a RecordFunctor because
+  //! the hash function is virtual.
+  std::unordered_map<RecordFunctor*, std::unique_ptr<FusionCacheEntry>>
+      record_hash_map;
+  //! An index into FusionCache's vector of nvFuser object that holds an
+  //! unscheduled Fusion.  The id is only valid if the entry is terminal.
+  size_t fusion_id;
+  //! Count of times the Entry is traversed
+  size_t visits;
+};
+
+//! \class FusionCache
+//! \brief A singleton class used in the nvFuser python interface
+//! to manage the caching of fusions.
+//!
+//! The fusion cache implements a prefix tree of records in order to cache
+//! fusions.  A leaf of the tree with a terminal node contains an nvFuser
+//! Fusion IR container for a cached instance.
+//!
+//! \todo Add the ability to evict a fusion.  There is currently a max number
+//! of fusions that is checked to prevent a runaway case.
+
+class TORCH_CUDA_CU_API FusionCache {
+  //! The constructor is private given the FusionCache is only constructed
+  //! as a singleton.
+  FusionCache(size_t max_fusions);
+
+  //! Copy and Assignment of the FusionCache is not supported
+  FusionCache(const FusionCache&) = delete;
+  FusionCache& operator=(const FusionCache&) = delete;
+
+ public:
+  //! The next 2 pubic methods are the python interface methods
+
+  //! Gets a pointer to the singleton and creates a new one if necessary
+  static FusionCache* get(size_t max_fusions = 8192);
+  //! Number of fusions cached
+  size_t numFusions() const;
+  //! print cache stats
+  void print(std::ostream& os);
+  //! Reset Cache to an empty state
+  static void reset();
+
+  //! The rest of the public methods are only used in C++
+
+  //! Queries the current cache entry to see if a record matches one of its
+  //! children
+  c10::optional<FusionCacheEntry*> lookupFusionCacheEntry(
+      RecordFunctor* rec) const;
+  //! Creates a child node for the current cache entry and an optional
+  //! fusion_id is returned if the new entry is terminal
+  c10::optional<size_t> createFusionCacheEntry(RecordFunctor* rec);
+  //! Resets the current cache pointer to the top of the tree
+  void resetFusionCachePtr();
+  //! Traverses the cache from the current entry to the child associated
+  //! with the record given.
+  void traverseFusionCache(RecordFunctor* rec);
+
+  friend class FusionInterface;
+
+ private:
+  //! Returns the pointer to the current cache entry
+  FusionCacheEntry* fusionCachePtr() const;
+
+  //! The static pointer to the FusionCache
+  static FusionCache* singleton_;
+
+  //! The max allowed number of fusions in the cache
+  size_t max_fusions_;
+  //! The top of the prefix tree used to start a cache look up of a given
+  //! fusion definition.
+  std::unique_ptr<FusionCacheEntry> fusion_cache_start_;
+  //! A pointer to the current cache entry in a cache lookup of a fusion
+  //! definition.
+  FusionCacheEntry* fusion_cache_ptr_;
+  //! A vector of nvFuser Fusion IR fusions.
+  std::vector<std::unique_ptr<Nvf::FusionExecutorCache>> fusions_;
+  //! A vector of Terminal Cache Entries for Stats collection
+  std::vector<FusionCacheEntry*> terminal_cache_entries_;
+};
+
+} // namespace nvfuser
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.cpp b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.cpp
index 4efdc21526ba..cf467d9ae5ca 100644
--- a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.cpp
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.cpp
@@ -1,65 +1,186 @@
-#ifdef USE_CUDA
+#include <torch/csrc/jit/codegen/cuda/instrumentation.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h>
 #include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h>
-#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.h>
+#include <torch/csrc/jit/codegen/cuda/utils.h>
+
+// Require namespace for perf scope instrumentation
+using namespace torch::jit::fuser::cuda::inst;
 
 namespace nvfuser {
 
-FusionDefinition::FusionDefinition(FusionOwner* fusion_owner)
-    : fusion_owner_(fusion_owner),
-      prev_fusion_(nullptr),
+const char* dtypeToPyString(Nvf::DataType t) {
+  switch (t) {
+    case Nvf::DataType::Bool:
+      return "DataType.Bool";
+    case Nvf::DataType::Double:
+      return "DataType.Double";
+    case Nvf::DataType::Float:
+      return "DataType.Float";
+    case Nvf::DataType::Half:
+      return "DataType.Half";
+    case Nvf::DataType::BFloat16:
+      return "DataType.Bfloat16";
+    case Nvf::DataType::Int:
+      return "DataType.Int";
+    case Nvf::DataType::Int32:
+      return "DataType.Int32";
+    case Nvf::DataType::ComplexFloat:
+      return "DataType.ComplexFloat";
+    case Nvf::DataType::ComplexDouble:
+      return "DataType.ComplexDouble";
+    case Nvf::DataType::Null:
+      return "DataType.Null";
+    default:
+      break;
+  }
+  TORCH_INTERNAL_ASSERT(false, "No string found for data type.");
+  return nullptr;
+}
+
+FusionDefinition::FusionDefinition(FusionInterface* fusion, size_t max_length)
+    : max_length_(max_length),
+      fusion_(fusion),
+      fusion_cache_(FusionCache::get()),
+      end_record_(new EndRecord()),
       recording_(),
       recording_state_(),
       fusion_state_(),
       ops(this) {}
 
-FusionDefinition* FusionDefinition::enter() {
-  prev_fusion_ = FusionGuard::getCurFusion();
-  FusionGuard::setCurFusion(fusionPtr());
-  return this;
-}
-void FusionDefinition::exit() {
+void FusionDefinition::buildFusionIr() {
+  FUSER_PERF_SCOPE("FusionDefinition::buildFusionIr");
+  auto fusion_guard = fusionInterfacePtr()->guard();
   fusion_state_.resize(recording_state_.size(), nullptr);
   for (auto& record : recording_) {
     auto functor = record.get();
     (*functor)(*this);
   }
+}
+
+FusionCache* FusionDefinition::fusionCachePtr() const {
+  TORCH_INTERNAL_ASSERT(
+      fusion_cache_ != nullptr, "FusionCache pointer is null!");
+  return fusion_cache_;
+}
 
-  FusionGuard::setCurFusion(prev_fusion_);
-  prev_fusion_ = nullptr;
+FusionInterface* FusionDefinition::fusionInterfacePtr() const {
+  TORCH_INTERNAL_ASSERT(fusion_ != nullptr, "FusionInterface pointer is null!");
+  return fusion_;
+}
+
+FusionDefinition* FusionDefinition::enter() {
+  TORCH_CHECK(max_length_ > 0, "Can't make a FusionDefinition with 0 records!");
+  TORCH_CHECK(
+      !fusionInterfacePtr()->defined(), "Fusion Interface is already defined!");
+  fusionCachePtr()->resetFusionCachePtr();
+  return this;
 }
 
-Scalar* FusionDefinition::defineScalar() {
-  Scalar* out = new nvfuser::Scalar(recording_state_.size());
-  recording_state_.emplace_back(out);
+void FusionDefinition::exit() {
+  FUSER_PERF_SCOPE("FusionDefinition::exit");
+  auto cache_entry =
+      fusionCachePtr()->lookupFusionCacheEntry(end_record_.get());
+  if (!cache_entry.has_value()) {
+    if (Nvf::isDebugDumpEnabled(Nvf::DebugDumpOption::PythonFrontendDebug)) {
+      std::cout << "\nFusionDefinition: Terminal Node not found.\n";
+    }
+    auto fusion_id =
+        fusionCachePtr()->createFusionCacheEntry(end_record_.get());
+    TORCH_CHECK(fusion_id.has_value(), "Invalid fusion id!");
+    fusionInterfacePtr()->define(fusion_id.value());
+    fusionCachePtr()->traverseFusionCache(end_record_.get());
+
+    if (Nvf::isDebugDumpEnabled(Nvf::DebugDumpOption::PythonDefinition)) {
+      print(std::cout);
+    }
+
+    buildFusionIr();
+
+    if (Nvf::isDebugDumpEnabled(Nvf::DebugDumpOption::FusionIrPresched)) {
+      fusionInterfacePtr()->print();
+    }
+  } else {
+    if (Nvf::isDebugDumpEnabled(Nvf::DebugDumpOption::PythonFrontendDebug)) {
+      std::cout << "\nFusionDefinition: Terminal Node found!\n";
+    }
+    fusionInterfacePtr()->define(cache_entry.value()->fusion_id);
+    fusionCachePtr()->traverseFusionCache(end_record_.get());
+  }
+}
+
+void FusionDefinition::print(std::ostream& os) const {
+  os << "\ndef nvfuser_fusion_id" << fusion_->id();
+  os << "(fd : FusionDefinition) -> None :\n";
+  os << std::dec;
+  for (auto& rec : recording_) {
+    os << "    ";
+    rec->print(os);
+    os << "\n";
+  }
+  os << "\n";
+}
+
+Scalar FusionDefinition::defineScalar() {
+  FUSER_PERF_SCOPE("FusionDefinition::defineScalar");
+  Scalar out(recording_state_.size());
+  recording_state_.emplace_back(out(), StateType::Scalar);
   return out;
 }
-Tensor* FusionDefinition::defineTensor() {
-  Tensor* out = new nvfuser::Tensor(recording_state_.size());
-  recording_state_.emplace_back(out);
+
+Tensor FusionDefinition::defineTensor() {
+  FUSER_PERF_SCOPE("FusionDefinition::defineTensor");
+  Tensor out(recording_state_.size());
+  recording_state_.emplace_back(out(), StateType::Tensor);
   return out;
 }
+
 void FusionDefinition::defineRecord(RecordFunctor* record) {
+  FUSER_PERF_SCOPE("FusionDefinition::defineRecord");
+  TORCH_CHECK(
+      (recording_.size() + 1) <= max_length_,
+      "The fusion definition has exceeded ",
+      max_length_,
+      "operations.  The max_length for FusionDefintion's might need to be ",
+      "increased if the definition is created as expected.");
   recording_.emplace_back(record);
+  auto cache_entry =
+      fusionCachePtr()->lookupFusionCacheEntry(recording_.back().get());
+  // If the Record is found in the cache, the FusionDefinition and the Cache
+  // will not share Record given the Record had to be created in order to
+  // match it but it also already existed in the cache.
+  if (cache_entry.has_value()) {
+    if (Nvf::isDebugDumpEnabled(Nvf::DebugDumpOption::PythonFrontendDebug)) {
+      std::cout << "\nFusionDefinition: Record (hash: 0x" << std::hex
+                << record->hash() << ") hit in Fusion Cache.\n";
+    }
+    // The FusionDefinition and the Cache will share the Record
+  } else {
+    if (Nvf::isDebugDumpEnabled(Nvf::DebugDumpOption::PythonFrontendDebug)) {
+      std::cout << "\nFusionDefinition: Record (hash: 0x" << std::hex
+                << record->hash() << ") missed in Fusion Cache.\n";
+    }
+    fusionCachePtr()->createFusionCacheEntry(recording_.back().get());
+  }
+  fusionCachePtr()->traverseFusionCache(recording_.back().get());
 }
 
-void FusionDefinition::addInput(NvfVal* input) {
-  fusionPtr()->addInput(input);
+void FusionDefinition::addInput(Nvf::Val* input) {
+  fusionInterfacePtr()->addInput(input);
 }
-void FusionDefinition::addOutput(NvfVal* output) {
-  fusionPtr()->addOutput(output);
+void FusionDefinition::addOutput(Nvf::Val* output) {
+  fusionInterfacePtr()->addOutput(output);
 }
 
-NvfVal* FusionDefinition::getFusionState(size_t index) const {
+Nvf::Val* FusionDefinition::getFusionState(size_t index) const {
   return fusion_state_.at(index);
 }
-void FusionDefinition::setFusionState(size_t index, NvfVal* val) {
+void FusionDefinition::setFusionState(size_t index, Nvf::Val* val) {
   fusion_state_.at(index) = val;
 }
 
-Fusion* FusionDefinition::fusionPtr() {
-  return fusion_owner_->fusionPtr();
+State FusionDefinition::recordingState(size_t index) const {
+  return recording_state_.at(index);
 }
 
 } // namespace nvfuser
-
-#endif // USE_CUDA
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h
index a5aca2f0d250..68723813ea2c 100644
--- a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h
@@ -1,21 +1,24 @@
 #pragma once
-#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <c10/macros/Export.h>
+
 #include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
-#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_owner.h>
 
-//! nvFuser Fusion IR Types
-using NvfDataType = torch::jit::fuser::cuda::DataType;
-using NvfFusion = torch::jit::fuser::cuda::Fusion;
-using NvfTensorView = torch::jit::fuser::cuda::TensorView;
-using NvfVal = torch::jit::fuser::cuda::Val;
+//! nvFuser Fusion IR namespace abbreviation
+namespace Nvf = torch::jit::fuser::cuda;
 
 namespace nvfuser {
 
+class FusionCache;
+class FusionInterface;
 struct RecordFunctor;
 
-//! The State, child classes Tensor and Scalar, and the StateType enum
-//! are used to define state objects to encapsulate the recording of state
-//! in the FusionDefinition.
+//! This is helper function used to print a python formated
+//! Fusion IR DataType when printing a fusion definition.
+
+TORCH_CUDA_CU_API const char* dtypeToPyString(Nvf::DataType t);
+
+//! The State and the StateType enum are used to define state objects to
+//! encapsulate the recording of state in the FusionDefinition.
 
 enum class StateType {
   Tensor,
@@ -24,15 +27,15 @@ enum class StateType {
 };
 
 struct State {
-  State(StateType _stype, size_t _index) : stype(_stype), index(_index) {}
+  State(size_t _index, StateType _stype) : index(_index), stype(_stype) {}
 
-  //! StateType is either: Tensor or Scalar
-  StateType stype;
   //! A unique index to identifiy each recorded state item.
   size_t index;
+  //! StateType is either: Tensor or Scalar
+  StateType stype;
 };
 
-//! The child classes are used to define separate function signtures in
+//! The Tensor and Scalar classes are used to define separate function signtures
 //! in the FusionDefintion to identify the appropriate Operator function.
 //!
 //! Example:
@@ -40,12 +43,26 @@ struct State {
 //!   add(Tensor* arg1, Tensor* arg2) -> Tensor*
 //!   add(Tensor* arg1, Scalar* arg2) -> Tensor*
 //!   add(Scalar* arg1, Scalar* arg2) -> Scalar*
-struct Tensor : State {
-  Tensor(size_t _index) : State(StateType::Tensor, _index) {}
+struct Tensor {
+  Tensor(size_t _index) : index(_index) {}
+
+  size_t operator()() const {
+    return index;
+  }
+
+  //! A unique index to identifiy each recorded state item.
+  size_t index;
 };
 
-struct Scalar : State {
-  Scalar(size_t _index) : State(StateType::Scalar, _index) {}
+struct Scalar {
+  Scalar(size_t _index) : index(_index) {}
+
+  size_t operator()() const {
+    return index;
+  }
+
+  //! A unique index to identifiy each recorded state item.
+  size_t index;
 };
 
 //! FusionDefinition defines the C++ side of a Python Context manager to
@@ -56,17 +73,14 @@ struct Scalar : State {
 //! in a cache and the recorded records are used to build an nvFuser Fusion
 //! object if the definition missed in the cache.
 //!
-//! \todo Need to implement the cache portion. Currently, the Fusion object
-//! is always built.
-//!
 //! The nested Operators class was designed to allow the user to query all the
 //! available Operators in the FusionDefinition via python help.
 //!
 //! Example:
 //!   help(FusionDefinition.Operators)
-class FusionDefinition {
+class TORCH_CUDA_CU_API FusionDefinition {
  public:
-  FusionDefinition(FusionOwner* fusion_owner);
+  FusionDefinition(FusionInterface* fusion, size_t max_length = 256);
 
   // The copy/move/assign constructors/operators are being removed
   // because it is not possible to copy the fusion_recording data member
@@ -81,46 +95,60 @@ class FusionDefinition {
   FusionDefinition* enter();
   //! Exit Python Context Manager -- Triggers cache lookup
   void exit();
+  //! Prints a python function representing the definition
+  void print(std::ostream& os) const;
 
   //! These methods are used to record the FusionDefinition for cache lookup
 
   //! Defines a Scalar State Record
-  Scalar* defineScalar();
+  Scalar defineScalar();
   //! Defines a Tensor State Record
-  Tensor* defineTensor();
+  Tensor defineTensor();
   //! Defines a Record that records the operation required to
   //! build the corresponding Fusion IR operation on cache miss.
   void defineRecord(RecordFunctor* record);
-
-  //! These methods are used to replay the operations for building the
-  //! nvFuser Fusion IR on a cache miss.
-
   //! Adds a Tensor/Scalar input to the Fusion object
-  void addInput(NvfVal* input);
+  void addInput(Nvf::Val* input);
   //! Adds a Tensor/Scalar output to the Fusion object
-  void addOutput(NvfVal* output);
+  void addOutput(Nvf::Val* output);
   //! Gets a Fusion IR Tensor/Scalar object
-  NvfVal* getFusionState(size_t index) const;
+  Nvf::Val* getFusionState(size_t index) const;
   //! Sets a Fusion IR Tensor/Scalar object
-  void setFusionState(size_t index, NvfVal* val);
-
-  //! A pointer to the nvFuser Fusion IR Oject
-  NvfFusion* fusionPtr();
+  void setFusionState(size_t index, Nvf::Val* val);
+  //! Gets a Record State object
+  State recordingState(size_t index) const;
 
  private:
-  // \todo These items will be replaced by a FusionManager instead of a cache
-  // for an individual fusion object
-  FusionOwner* fusion_owner_;
-  NvfFusion* prev_fusion_;
+  //! Builds an nvFuser Fusion IR object upon exit of a FusionDefintion
+  //! when a cache lookup fails.
+  void buildFusionIr();
+  //! Returns the FusionCache Ptr that holds the cache of Fusions
+  FusionCache* fusionCachePtr() const;
+  //! Returns the FusionInterface Ptr that represents the corresponding
+  //! Fusion IR object.
+  FusionInterface* fusionInterfacePtr() const;
+
+  //! Holds the defined maximum length of a FusionDefinition in order to
+  //! prevent a run away error. The user should feel free to increase this
+  //! number as appropriate.
+  size_t max_length_;
+
+  //! A pointer to an interface for an nvFusion Fusion IR object.
+  FusionInterface* fusion_;
+  //! A pointer to the FusionCache.
+  FusionCache* fusion_cache_;
+
+  //! Holds an End Record
+  std::unique_ptr<RecordFunctor> end_record_;
 
   //! A vector of record operations in the FusionDefintion
   std::vector<std::unique_ptr<RecordFunctor>> recording_;
-  //! A vector of state (Tensor/Scalar) recorded in the FusionDefinition
-  std::vector<std::unique_ptr<State>> recording_state_;
+  //! A vector of state recorded in the FusionDefinition
+  std::vector<State> recording_state_;
 
   //! A vector of nvFuser Fusion IR TensorViews/Vals for building the Fusion
   //! IR graph.
-  std::vector<NvfVal*> fusion_state_;
+  std::vector<Nvf::Val*> fusion_state_;
 
  public:
   //! The Operators are not directly defined in this header.  They are defined
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.cpp b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.cpp
new file mode 100644
index 000000000000..b9e3b65116af
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.cpp
@@ -0,0 +1,65 @@
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.h>
+
+namespace nvfuser {
+
+FusionInterface::FusionInterface() : fusion_id_(c10::nullopt) {}
+FusionInterface::FusionInterface(size_t fusion_id)
+    : fusion_id_(c10::optional<size_t>(fusion_id)) {}
+
+void FusionInterface::define(size_t fusion_id) {
+  auto fc = FusionCache::get();
+  TORCH_CHECK(fusion_id < fc->fusions_.size(), "Invalid fusion id!");
+  fusion_id_ = c10::optional<size_t>(fusion_id);
+}
+
+bool FusionInterface::defined() const {
+  return fusion_id_.has_value();
+}
+
+size_t FusionInterface::id() const {
+  TORCH_CHECK(defined(), "Invalid fusion id!");
+  return fusion_id_.value();
+}
+
+void FusionInterface::addInput(Nvf::Val* input) const {
+  fusionPtr()->addInput(input);
+}
+
+void FusionInterface::addOutput(Nvf::Val* output) const {
+  fusionPtr()->addOutput(output);
+}
+
+std::vector<at::Tensor> FusionInterface::execute(
+    const at::ArrayRef<c10::IValue>& inputs) const {
+  // aliasOutputToInput always adds Tensors as outputs that we don't want
+  // to return to the user. We need to remove them.
+  auto count_output_aliases = fusionPtr()->getOutputAliasIndices().size();
+  auto result = fusionExecutorCachePtr()->runFusionWithInputs(inputs);
+  result.erase(result.begin(), result.begin() + count_output_aliases);
+  return result;
+}
+
+Nvf::FusionGuard FusionInterface::guard() const {
+  return Nvf::FusionGuard(fusionPtr());
+}
+
+void FusionInterface::print() const {
+  fusionExecutorCachePtr()->printFusion();
+}
+
+Nvf::FusionExecutorCache* FusionInterface::fusionExecutorCachePtr() const {
+  auto fc = FusionCache::get();
+  TORCH_CHECK(defined(), "Invalid fusion id!");
+  TORCH_CHECK(
+      fc->fusions_.at(fusion_id_.value()), "FusionExecutorCache Ptr is Null!");
+  return fc->fusions_.at(fusion_id_.value()).get();
+}
+
+Nvf::Fusion* FusionInterface::fusionPtr() const {
+  auto fusion_ptr = fusionExecutorCachePtr()->fusion();
+  TORCH_CHECK(fusion_ptr != nullptr, "Fusion IR pointer is null!");
+  return fusion_ptr;
+}
+
+} // namespace nvfuser
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.h b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.h
new file mode 100644
index 000000000000..60d55f16104f
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.h
@@ -0,0 +1,72 @@
+#pragma once
+#include <c10/macros/Export.h>
+
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+
+//! nvFuser Fusion IR namespace abbreviation
+namespace Nvf = torch::jit::fuser::cuda;
+
+namespace nvfuser {
+
+//! \class FusionInterface
+//! \brief Implements an Interface that represents an nvFuser IR object in
+//! in python.
+//!
+//! Example 1 - Define fusion:
+//!
+//!   fs = Fusion()
+//!   with FusionDefinition(fs) as fd :
+//!       t0 = fd.define_tensor(3)
+//!       s1 = fd.define_constant(3.)
+//!       t2 = fd.ops.add(t0, s1)
+//!       fd.add_output(t2)
+//!
+//!   input = torch.ones(2, 4, 8, device='cuda')
+//!   for _ in range(5) :
+//!      outputs = fs.execute([input])
+//!
+//! Example 2 - Use cached fusion, directly, based on id:
+//!
+//!   fs = Fusion(fusion_id)
+//!
+//!   input = torch.ones(2, 4, 8, device='cuda')
+//!   for _ in range(5) :
+//!      outputs = fs.execute([input])
+
+class TORCH_CUDA_CU_API FusionInterface {
+ public:
+  //! Pybind11 cannot bind to c10::optional and Pytorch is compiled with C++14.
+  //! Therefore, I am adding two constructors, instead.
+  FusionInterface();
+  FusionInterface(size_t fusion_id);
+
+  //! Define which Fusion IR object the interface represents
+  void define(size_t fusion_id);
+  //! Query whether the Fusion IR is defined
+  bool defined() const;
+  //! Return fusion id of this Fusion
+  size_t id() const;
+
+  //! Adds an input to the represented Fusion IR.
+  void addInput(Nvf::Val* input) const;
+  //! Adds an Output to the represented Fusion IR.
+  void addOutput(Nvf::Val* output) const;
+  //! Executes a fusion if the current cache pointer points at a terminal node
+  std::vector<at::Tensor> execute(
+      const at::ArrayRef<c10::IValue>& inputs) const;
+  //! Activates a guard around the represented Fusion IR.
+  Nvf::FusionGuard guard() const;
+  //! Prints the represented nvFuser IR
+  void print() const;
+
+ private:
+  //! Provides a pointer to the FusionExecutorCache that maps the current
+  //! unscheduled Fusion IRs to scheduled Fusion IRs for execution.
+  Nvf::FusionExecutorCache* fusionExecutorCachePtr() const;
+  //! Points to the nvFuser Fusion IR object
+  Nvf::Fusion* fusionPtr() const;
+
+  c10::optional<size_t> fusion_id_;
+};
+
+} // namespace nvfuser
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_owner.h b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_owner.h
deleted file mode 100644
index dce8cc4d65d5..000000000000
--- a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_owner.h
+++ /dev/null
@@ -1,36 +0,0 @@
-
-#pragma once
-
-#include <torch/csrc/jit/codegen/cuda/fusion.h>
-
-using namespace torch::jit::fuser::cuda;
-
-namespace nvfuser {
-
-class FusionOwner {
- public:
-  FusionOwner() : executor_cache_(std::make_unique<Fusion>()) {}
-
-  // Non-copyable
-  FusionOwner(const FusionOwner&) = delete;
-  FusionOwner& operator=(const FusionOwner&) = delete;
-
-  std::vector<at::Tensor> execute(const at::ArrayRef<c10::IValue>& inputs) {
-    return executor_cache_.runFusionWithInputs(inputs);
-  }
-  Fusion* fusionPtr() {
-    return executor_cache_.fusion();
-  }
-
-  void printIr() {
-    executor_cache_.printFusion();
-  }
-  void printKernel() {
-    executor_cache_.fusion()->printKernel();
-  }
-
- private:
-  FusionExecutorCache executor_cache_;
-};
-
-} // namespace nvfuser
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h
index 77cbb7086aa5..b8105f1e4fb8 100644
--- a/torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h
@@ -1,33 +1,176 @@
 #pragma once
 #include <c10/util/complex.h>
 #include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/ops/alias.h>
 #include <torch/csrc/jit/codegen/cuda/ops/normalization.h>
 #include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h>
+#include <torch/csrc/jit/codegen/cuda/utils.h>
+
+#include <algorithm>
 
 namespace nvfuser {
 
+//! This enum it to give a Record Type for record hashing given that the
+//! record type is otherwise determined via the success of dynamic casting.
+//! This means that templated types are not specifically enumerated for
+//! each set of template arguments.
+enum class RecordType {
+  Base = 0,
+  Op,
+  BatchNormOp,
+  BroadcastOp,
+  BroadcastInDimOp,
+  CastOp,
+  Constant,
+  End,
+  Tensor,
+  Output,
+  ReductionOp,
+  Scalar,
+  SqueezeOp,
+  Start,
+  VarianceOp,
+  VarianceMeanOp,
+  ViewOp,
+  PermuteOp,
+};
+
 //! RecordFunctor is the base class record for operations recorded by
 //! the FusionDefinition.  It is, in essence, a node in the graph with
-//! input edges, args, and outputs edges outputs that where the stored
+//! input edges, args, and outputs edges outputs where the stored
 //! values are indices into the recorded state.
 //!
-//! The virual functor is the operators that is replayed on a cache
-//! to build the appropriate part of the nvFuser Fusion IR for a given
-//! record.
+//! The virual functor operator is executed on a cache miss to build the
+//! appropriate part of the nvFuser Fusion IR for a given record.
+//!
+//! The hash and equality operators are used to facilitate the hashing of
+//! RecordFunctors in a hash map given those operators need to be
+//! specified for custom objects.
+//!
+//! The print function is used to print the given Record as a statement
+//! in a python formated function.
 
 struct RecordFunctor {
-  RecordFunctor(std::vector<size_t> _args, std::vector<size_t> _outputs)
-      : args(std::move(_args)), outputs(std::move(_outputs)) {}
+  RecordFunctor(
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::string _name,
+      RecordType _record_type)
+      : args_(std::move(_args)),
+        outputs_(std::move(_outputs)),
+        name_(std::move(_name)),
+        record_type_(_record_type) {}
   virtual ~RecordFunctor() = default;
+  //! Allows for copying of Child Class objects with RecordFunctor pointers.
+  virtual RecordFunctor* clone() = 0;
+
+  //! The base class is placing the type, outputs, and args hashed as follows:
+  //! | 63 - 56 | 55 - 48 | 47 ----------- 32 | 32 ------------------------  0 |
+  //! | Type    | Outputs | Args              | Child Class Specified          |
+  virtual size_t hash() const {
+    size_t arg_hash = 0;
+    for (auto arg : args_) {
+      arg_hash ^= ((arg.index << 1) ^ static_cast<size_t>(arg.stype));
+    }
+    size_t output_hash = 0;
+    for (auto output : outputs_) {
+      output_hash ^= ((output.index << 1) ^ static_cast<size_t>(output.stype));
+    }
+    return ((static_cast<size_t>(record_type_) & 0xff) << 56) |
+        ((output_hash & 0xff) << 48) | ((arg_hash & 0xffff) << 32);
+  }
+
+  //! The base virtual equality operator is defined so all child
+  //! classes can utilize the check for the same args and outputs.
+  virtual bool operator==(const RecordFunctor& other) const {
+    auto result = (record_type_ == other.record_type_);
+    result = result && (args_.size() == other.args_.size()) &&
+        (outputs_.size() == other.outputs_.size());
+    if (result) {
+      for (size_t i = 0; i < args_.size(); ++i) {
+        if ((args_[i].index != other.args_[i].index) ||
+            (args_[i].stype != other.args_[i].stype)) {
+          result = false;
+          break;
+        }
+      }
+    }
+    if (result) {
+      for (size_t i = 0; i < outputs_.size(); ++i) {
+        if ((outputs_[i].index != other.outputs_[i].index) ||
+            (outputs_[i].stype != other.outputs_[i].stype)) {
+          result = false;
+          break;
+        }
+      }
+    }
+    return result;
+  }
 
   //! Abstraction for an operation to build this record's nvFuser Fusion IR
   //! piece if the recording has a cache miss.
   virtual void operator()(FusionDefinition& fd) = 0;
 
+  //! The base print function when printing Record for a given FusionDefinition
+  //! in python formated code.
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    bool first_output = true;
+    for (auto& output : outputs_) {
+      if (first_output) {
+        first_output = false;
+      } else {
+        os << ", ";
+      }
+      if (output.stype == StateType::Scalar) {
+        os << "S";
+      } else if (output.stype == StateType::Tensor) {
+        os << "T";
+      } else {
+        TORCH_INTERNAL_ASSERT(false, "Unsupported StateType");
+      }
+      os << output.index;
+    }
+    if (outputs_.size() > 0) {
+      os << " = "
+         << "fd." << name_ << "(";
+    } else {
+      os << "fd." << name_ << "(";
+    }
+    bool first_arg = true;
+    for (auto& arg : args_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      if (arg.stype == StateType::Scalar) {
+        os << "S" << arg.index;
+      } else if (arg.stype == StateType::Tensor) {
+        os << "T" << arg.index;
+      } else if (arg.stype == StateType::None) {
+        os << "None";
+      } else {
+        TORCH_INTERNAL_ASSERT(false, "Unsupported StateType");
+      }
+    }
+    if (close_function) {
+      os << ")";
+    }
+  }
+
+  RecordType recordType() const {
+    return record_type_;
+  }
+
+ protected:
   //! Inputs that are indices into the FusionDefinition's Recorded State.
-  std::vector<size_t> args;
+  std::vector<State> args_;
   //! Outputs that are indices into the FusionDefinition's Recorded State.
-  std::vector<size_t> outputs;
+  std::vector<State> outputs_;
+  //! Record Name
+  std::string name_;
+  //! Record Type of child class used for hashing
+  RecordType record_type_;
 };
 
 //! The OpRecord RecordFunctor is the most widely used child class because
@@ -42,12 +185,65 @@ struct RecordFunctor {
 template <class OutType, class... ArgTypes>
 struct OpRecord : RecordFunctor {
   OpRecord(
-      std::vector<size_t> _args,
-      std::vector<size_t> _outputs,
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::string _name,
       std::function<OutType(ArgTypes...)> fusion_op)
-      : RecordFunctor(std::move(_args), std::move(_outputs)),
+      : RecordFunctor(
+            std::move(_args),
+            std::move(_outputs),
+            _name,
+            RecordType::Op),
         fusion_op_(fusion_op) {}
   virtual ~OpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new OpRecord(*this);
+  }
+
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 -------------------------------------  0 |
+  //! | Arith Function Sigs hash code               |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    return result | (fusion_op_.target_type().hash_code() & 0xffffffff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    // A succesfull cast indicates a RecordFunctor of the same child class
+    if (auto child_ptr = dynamic_cast<const OpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      if (result) {
+        // Match the nvFuser arith function types
+        result = result &&
+            (fusion_op_.target_type() == child_ptr->fusion_op_.target_type());
+        if (Nvf::isDebugDumpEnabled(
+                Nvf::DebugDumpOption::PythonFrontendDebug)) {
+          std::cout << "\nOpRecord: " << name_ << " Target Type [self: 0x"
+                    << fusion_op_.target_type().name() << "] [other: 0x"
+                    << child_ptr->fusion_op_.target_type().name() << "] ";
+        }
+        // Match the nvFuser arith function pointers
+        // IMPORTANT! you need to dereference the target pointer in order
+        // to match the function
+        result = result &&
+            (*fusion_op_.template target<OutType (*)(ArgTypes...)>() ==
+             *child_ptr->fusion_op_
+                  .template target<OutType (*)(ArgTypes...)>());
+        if (Nvf::isDebugDumpEnabled(
+                Nvf::DebugDumpOption::PythonFrontendDebug)) {
+          std::cout
+              << "Target  Ptr [self: 0x" << std::hex
+              << (size_t)*fusion_op_.template target<OutType (*)(ArgTypes...)>()
+              << "] [other: 0x" << std::hex
+              << (size_t)*child_ptr->fusion_op_
+                     .template target<OutType (*)(ArgTypes...)>()
+              << "]\n";
+        }
+      }
+    }
+    return result;
+  }
 
   //! The variadic set of indices for the number of args for this op are
   //! deduced by providing the index_sequence as a parameter.  Similarly,
@@ -57,9 +253,9 @@ struct OpRecord : RecordFunctor {
   //! to a Fusion IR TensorView or leave it as a Fusion IR Val (Scalar).
   //!
   //! A deduced binary op could look like:
-  //!   OutType opFunc<std::tuple<NvfTensor*, NvfTensor*>, 0, 1>
+  //!   OutType opFunc<std::tuple<TensorView*, TensorView*>, 0, 1>
   //! A deduced ternary op could look like:
-  //!   OutTupe opFunc<std::tuple<NvfTensor*, NvfVal*, NvfVal*>, 0, 1, 2>
+  //!   OutTupe opFunc<std::tuple<TensorView*, Val*, Val*>, 0, 1, 2>
   template <class TupleType, std::size_t... Is>
   OutType opFunc(
       FusionDefinition& fd,
@@ -67,17 +263,17 @@ struct OpRecord : RecordFunctor {
       std::index_sequence<Is...>) {
     return fusion_op_(
         dynamic_cast<typename std::tuple_element<Is, TupleType>::type>(
-            fd.getFusionState(args.at(Is)))...);
+            fd.getFusionState(args_.at(Is).index))...);
   }
 
-  void operator()(FusionDefinition& fd) final {
+  virtual void operator()(FusionDefinition& fd) final {
     using arg_tuple_t = std::tuple<ArgTypes...>;
     auto indices =
         std::make_index_sequence<std::tuple_size<arg_tuple_t>::value>();
     // The tuple variable is never populated, it is passed for its type.
     arg_tuple_t inputs;
     auto output = opFunc(fd, inputs, indices);
-    fd.setFusionState(outputs.at(0), output);
+    fd.setFusionState(outputs_.at(0).index, output);
   }
 
  private:
@@ -85,21 +281,330 @@ struct OpRecord : RecordFunctor {
   std::function<OutType(ArgTypes...)> fusion_op_;
 };
 
+struct ViewOpRecord : RecordFunctor {
+  ViewOpRecord(
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::vector<int64_t>& original_shape,
+      std::vector<int64_t>& new_shape)
+      : RecordFunctor(
+            std::move(_args),
+            std::move(_outputs),
+            "ops.view",
+            RecordType::ViewOp),
+        original_shape_(std::move(original_shape)),
+        new_shape_(std::move(new_shape)) {}
+  virtual ~ViewOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new ViewOpRecord(*this);
+  }
+
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 -------------- 16 | 15 --------------  0 |
+  //! | original_shape hash  | new_shape hash       |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    size_t new_shape_hash = 0;
+    for (auto shape : new_shape_) {
+      new_shape_hash ^= static_cast<size_t>(shape);
+    }
+    size_t original_shape_hash = 0;
+    for (auto shape : original_shape_) {
+      original_shape_hash |= 1 << ((new_shape_.size() - 1) - shape);
+    }
+    original_shape_hash = (original_shape_hash & 0xffff) << 16;
+    return result | original_shape_hash | (new_shape_hash & 0xffff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const ViewOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      result &= std::equal(
+          original_shape_.begin(),
+          original_shape_.end(),
+          child_ptr->original_shape_.begin());
+      result &= std::equal(
+          new_shape_.begin(), new_shape_.end(), child_ptr->new_shape_.begin());
+    }
+    return result;
+  }
+
+  void operator()(FusionDefinition& fd) final {
+    auto arg =
+        fd.getFusionState(args_.at(0).index)->template as<Nvf::TensorView>();
+    auto output =
+        torch::jit::fuser::cuda::view(arg, original_shape_, new_shape_);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << ", original_shape=[";
+    bool first_arg = true;
+    for (auto shape : original_shape_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << shape;
+    }
+    os << "]";
+    os << ", new_shape=[";
+    first_arg = true;
+    for (auto shape : new_shape_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << shape;
+    }
+    os << "]";
+    if (close_function) {
+      os << ")";
+    }
+  }
+
+ private:
+  //! Represents the tensor dimensions of the input tensor.
+  std::vector<int64_t> original_shape_;
+  //! Represents the tensor dimensions of the output tensor.
+  std::vector<int64_t> new_shape_;
+};
+
+struct PermuteOpRecord : RecordFunctor {
+  PermuteOpRecord(
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::vector<int64_t>& dims)
+      : RecordFunctor(
+            std::move(_args),
+            std::move(_outputs),
+            "ops.permute",
+            RecordType::PermuteOp),
+        dims_(std::move(dims)) {}
+  virtual ~PermuteOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new PermuteOpRecord(*this);
+  }
+
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    size_t dims_hash = 0;
+    for (auto dim : dims_) {
+      dims_hash ^= static_cast<size_t>(dim);
+    }
+    return result | (dims_hash & 0xffff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const PermuteOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      if (result) {
+        result = (dims_.size() == child_ptr->dims_.size());
+        if (result) {
+          for (size_t i = 0; i < dims_.size(); ++i) {
+            if (dims_[i] != child_ptr->dims_[i]) {
+              result = false;
+              break;
+            }
+          }
+        }
+      }
+    }
+    return result;
+  }
+
+  void operator()(FusionDefinition& fd) final {
+    auto arg =
+        fd.getFusionState(args_.at(0).index)->template as<Nvf::TensorView>();
+    auto output = Nvf::permute(arg, dims_);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << ", dims=[";
+    bool first_arg = true;
+    for (auto dim : dims_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << dim;
+    }
+    os << "]";
+    if (close_function) {
+      os << ")";
+    }
+  }
+
+ private:
+  //! Represents the mapping from the original shape to the new shape
+  std::vector<int64_t> dims_;
+};
+
+struct SqueezeOpRecord : RecordFunctor {
+  SqueezeOpRecord(
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::vector<int64_t>& original_shape,
+      int64_t dim)
+      : RecordFunctor(
+            std::move(_args),
+            std::move(_outputs),
+            "ops.squeeze",
+            RecordType::SqueezeOp),
+        original_shape_(std::move(original_shape)),
+        dim_(dim) {}
+  virtual ~SqueezeOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new SqueezeOpRecord(*this);
+  }
+
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 -------------- 16 | 15 --------------  0 |
+  //! | Squeeze Dim hash     | original_shape hash  |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    size_t original_shape_hash = 0;
+    for (auto shape : original_shape_) {
+      original_shape_hash ^= static_cast<size_t>(shape);
+    }
+    size_t squeeze_dim_hash = static_cast<size_t>(dim_);
+    squeeze_dim_hash = (squeeze_dim_hash & 0xffff) << 16;
+    return result | squeeze_dim_hash | (original_shape_hash & 0xffff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const SqueezeOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      if (result) {
+        result = (original_shape_.size() == child_ptr->original_shape_.size());
+        if (result) {
+          result = (dim_ == child_ptr->dim_);
+        }
+        if (result) {
+          for (size_t i = 0; i < original_shape_.size(); ++i) {
+            if (original_shape_[i] != child_ptr->original_shape_[i]) {
+              result = false;
+              break;
+            }
+          }
+        }
+      }
+    }
+    return result;
+  }
+
+  void operator()(FusionDefinition& fd) final {
+    auto arg =
+        fd.getFusionState(args_.at(0).index)->template as<Nvf::TensorView>();
+    auto output = Nvf::squeeze(arg, original_shape_, dim_);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << ", original_shape=[";
+    bool first_arg = true;
+    for (auto shape : original_shape_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << shape;
+    }
+    os << "]";
+    os << ", dim=" << dim_;
+    if (close_function) {
+      os << ")";
+    }
+  }
+
+ private:
+  //! Represents the tensor dimensions of the input tensor.
+  std::vector<int64_t> original_shape_;
+  //! Dimension to squeeze.
+  int64_t dim_;
+};
+
 //! Specialized Record Functor for the FusionDefinition's broadcast_in_dim op.
 
-struct BroadcastOpRecord : RecordFunctor {
-  BroadcastOpRecord(
-      std::vector<size_t> _args,
-      std::vector<size_t> _outputs,
+struct BroadcastInDimOpRecord : RecordFunctor {
+  BroadcastInDimOpRecord(
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::string _name,
       std::vector<int64_t>& output_shape,
       std::vector<int64_t>& broadcast_dims)
-      : RecordFunctor(std::move(_args), std::move(_outputs)),
+      : RecordFunctor(
+            std::move(_args),
+            std::move(_outputs),
+            _name,
+            RecordType::BroadcastInDimOp),
         output_shape_(std::move(output_shape)),
         broadcast_dims_(std::move(broadcast_dims)) {}
-  virtual ~BroadcastOpRecord() = default;
+  virtual ~BroadcastInDimOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new BroadcastInDimOpRecord(*this);
+  }
 
-  void operator()(FusionDefinition& fd) final {
-    auto arg = fd.getFusionState(args.at(0))->template as<TensorView>();
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 -------------- 16 | 15 --------------  0 |
+  //! | broadcast_dims hash  | output_shape hash    |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    size_t output_shape_hash = 0;
+    for (auto shape : output_shape_) {
+      output_shape_hash ^= static_cast<size_t>(shape);
+    }
+    size_t broadcast_dims_hash = 0;
+    for (auto dim : broadcast_dims_) {
+      broadcast_dims_hash |= 1 << ((output_shape_.size() - 1) - dim);
+    }
+    broadcast_dims_hash = (broadcast_dims_hash & 0xffff) << 16;
+    return result | broadcast_dims_hash | (output_shape_hash & 0xffff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const BroadcastInDimOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      if (result) {
+        result =
+            ((output_shape_.size() == child_ptr->output_shape_.size()) &&
+             (broadcast_dims_.size() == child_ptr->broadcast_dims_.size()));
+        if (result) {
+          for (size_t i = 0; i < output_shape_.size(); ++i) {
+            if (output_shape_[i] != child_ptr->output_shape_[i]) {
+              result = false;
+              break;
+            }
+          }
+        }
+        if (result) {
+          for (size_t i = 0; i < broadcast_dims_.size(); ++i) {
+            if (broadcast_dims_[i] != child_ptr->broadcast_dims_[i]) {
+              result = false;
+              break;
+            }
+          }
+        }
+      }
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    auto arg =
+        fd.getFusionState(args_.at(0).index)->template as<Nvf::TensorView>();
 
     const auto& arg_domains_nr = arg->domain()->noReductions();
     const auto arg_ndims = arg_domains_nr.size();
@@ -141,18 +646,48 @@ struct BroadcastOpRecord : RecordFunctor {
           output_shape_[idx] != -1) {
         // TODO: this would be tricky to handle on dynamic shapes, we'll
         // need to pass-in a symbol instead somehow.
-        output_shape_on_bcast[idx] = IrBuilder::create<Int>(output_shape_[idx]);
+        output_shape_on_bcast[idx] =
+            Nvf::IrBuilder::create<Nvf::Int>(output_shape_[idx]);
         has_expand = true;
       } else {
-        output_shape_on_bcast[idx] = IrBuilder::create<Int>(-1);
+        output_shape_on_bcast[idx] = Nvf::IrBuilder::create<Nvf::Int>(-1);
       }
     }
 
-    auto output = torch::jit::fuser::cuda::broadcast(arg, is_broadcast_dim);
+    auto output = Nvf::broadcast(arg, is_broadcast_dim);
     if (has_expand) {
-      output = torch::jit::fuser::cuda::expand(output, output_shape_on_bcast);
+      output = Nvf::expand(output, output_shape_on_bcast);
+    }
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << ", output_shape=[";
+    bool first_arg = true;
+    for (auto shape : output_shape_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << shape;
+    }
+    os << "]";
+    os << ", broadcast_dims=[";
+    first_arg = true;
+    for (auto dim : broadcast_dims_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << dim;
+    }
+    os << "]";
+    if (close_function) {
+      os << ")";
     }
-    fd.setFusionState(outputs.at(0), output);
   }
 
  private:
@@ -164,42 +699,214 @@ struct BroadcastOpRecord : RecordFunctor {
   std::vector<int64_t> broadcast_dims_;
 };
 
+//! Specialized Record Functor for the FusionDefinition's broadcast op.
+
+struct BroadcastOpRecord : RecordFunctor {
+  BroadcastOpRecord(
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::string _name,
+      std::vector<bool>& is_broadcast_dim)
+      : RecordFunctor(
+            std::move(_args),
+            std::move(_outputs),
+            _name,
+            RecordType::BroadcastOp),
+        is_broadcast_dim_(std::move(is_broadcast_dim)) {}
+  virtual ~BroadcastOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new BroadcastOpRecord(*this);
+  }
+
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    size_t is_broadcast_dim_hash = 0;
+    for (size_t i = 0; i < is_broadcast_dim_.size(); ++i) {
+      is_broadcast_dim_hash |=
+          (is_broadcast_dim_[i] << (is_broadcast_dim_.size() - 1 - i));
+    }
+    return result | (is_broadcast_dim_hash & 0xfff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const BroadcastOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      result &= std::equal(
+          is_broadcast_dim_.begin(),
+          is_broadcast_dim_.end(),
+          child_ptr->is_broadcast_dim_.begin());
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    auto arg =
+        fd.getFusionState(args_.at(0).index)->template as<Nvf::TensorView>();
+    auto output = Nvf::broadcast(arg, is_broadcast_dim_);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << ", is_broadcast_dim=[";
+    bool first_arg = true;
+    for (auto dim : is_broadcast_dim_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << (dim ? "True" : "False");
+    }
+    os << "]";
+    if (close_function) {
+      os << ")";
+    }
+  }
+
+ private:
+  //! Communicates which dimensions in the output are broadcasted.
+  std::vector<bool> is_broadcast_dim_;
+};
+
 template <class OutType, class ArgType>
 struct CastOpRecord : RecordFunctor {
   CastOpRecord(
-      std::vector<size_t> _args,
-      std::vector<size_t> _outputs,
-      std::function<OutType(NvfDataType, ArgType)> fusion_op,
-      NvfDataType dtype)
-      : RecordFunctor(std::move(_args), std::move(_outputs)),
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::string _name,
+      std::function<OutType(Nvf::DataType, ArgType)> fusion_op,
+      Nvf::DataType dtype)
+      : RecordFunctor(
+            std::move(_args),
+            std::move(_outputs),
+            _name,
+            RecordType::CastOp),
         fusion_op_(fusion_op),
         dtype_(dtype) {}
   virtual ~CastOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new CastOpRecord(*this);
+  }
 
-  void operator()(FusionDefinition& fd) final {
-    auto arg = dynamic_cast<ArgType>(fd.getFusionState(args.at(0)));
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 --- 24 | 23 --------------------------  0 |
+  //! | Dtype     | Arith Function Sig hash code     |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    result |= ((static_cast<size_t>(dtype_) & 0xff) << 24);
+    result |= (fusion_op_.target_type().hash_code() & 0xffffff);
+    return result;
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const CastOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      if (result) {
+        result = result &&
+            (fusion_op_.target_type() == child_ptr->fusion_op_.target_type());
+        if (Nvf::isDebugDumpEnabled(
+                Nvf::DebugDumpOption::PythonFrontendDebug)) {
+          std::cout << "\nCastOpRecord: " << name_ << " Target Type [self: 0x"
+                    << fusion_op_.target_type().name() << "] [other: 0x"
+                    << child_ptr->fusion_op_.target_type().name() << "]";
+        }
+        // IMPORTANT! you need to dereference the target pointer in order
+        // to match the function
+        result = result &&
+            (*fusion_op_
+                  .template target<OutType (*)(Nvf::DataType, ArgType)>() ==
+             *child_ptr->fusion_op_
+                  .template target<OutType (*)(Nvf::DataType, ArgType)>());
+        if (Nvf::isDebugDumpEnabled(
+                Nvf::DebugDumpOption::PythonFrontendDebug)) {
+          std::cout
+              << " Target  Ptr [self: 0x" << std::hex
+              << (size_t)*fusion_op_
+                     .template target<OutType (*)(Nvf::DataType, ArgType)>()
+              << "] [other: 0x" << std::hex
+              << (size_t)*child_ptr->fusion_op_
+                     .template target<OutType (*)(Nvf::DataType, ArgType)>()
+              << "]\n";
+        }
+        result = result && (dtype_ == child_ptr->dtype_);
+      }
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    auto arg = dynamic_cast<ArgType>(fd.getFusionState(args_.at(0).index));
     auto output = fusion_op_(dtype_, arg);
-    fd.setFusionState(outputs.at(0), output);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << ", dtype=" << dtypeToPyString(dtype_);
+    if (close_function) {
+      os << ")";
+    }
   }
 
  private:
   //! nvFuser arith function signature
-  std::function<OutType(NvfDataType, ArgType)> fusion_op_;
+  std::function<OutType(Nvf::DataType, ArgType)> fusion_op_;
   //! Type to cast to.
-  NvfDataType dtype_;
+  Nvf::DataType dtype_;
 };
 
 //! Specialized Record Functor for recording FusionDefinition constant state.
 
 template <typename ExprType, typename ValueType>
 struct ConstantRecord : RecordFunctor {
-  ConstantRecord(std::vector<size_t> _outputs, ValueType val)
-      : RecordFunctor({}, std::move(_outputs)), value_(val) {}
+  ConstantRecord(std::vector<State> _outputs, ValueType val)
+      : RecordFunctor(
+            {},
+            std::move(_outputs),
+            "define_constant",
+            RecordType::Constant),
+        value_(val) {}
   virtual ~ConstantRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new ConstantRecord(*this);
+  }
 
-  void operator()(FusionDefinition& fd) final {
-    NvfVal* output = IrBuilder::create<ExprType>(value_);
-    fd.setFusionState(outputs.at(0), output);
+  //! Going to start out hashing nothing extra since hashing a complex number
+  //! seems complicated.  Initially, the thought was to simply static cast the
+  //! value_
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    return result;
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const ConstantRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      result = result && (value_ == child_ptr->value_);
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    Nvf::Val* output = Nvf::IrBuilder::create<ExprType>(value_);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    if (std::is_same<ValueType, bool>::value) {
+      os << (value_ ? "True" : "False");
+    } else {
+      os << std::showpoint << value_;
+    }
+
+    if (close_function) {
+      os << ")";
+    }
   }
 
  private:
@@ -207,57 +914,210 @@ struct ConstantRecord : RecordFunctor {
   ValueType value_;
 };
 
+//! Specialized Record Functor for recording FusionDefinition End.
+//! The accompanying Fusion Cache Entry holds a Fusion Object.
+
+struct EndRecord : RecordFunctor {
+  EndRecord() : RecordFunctor({}, {}, "end", RecordType::End) {}
+  virtual ~EndRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new EndRecord(*this);
+  }
+
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 ---------------------------------------  0 |
+  //! | None                                          |
+  virtual size_t hash() const final {
+    return RecordFunctor::hash();
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (dynamic_cast<const EndRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {}
+};
+
 //! Specialized Record Functor for recording FusionDefinition input tensors.
 
-struct InputTensorRecord : RecordFunctor {
-  InputTensorRecord(
-      std::vector<size_t> _outputs,
+struct TensorRecord : RecordFunctor {
+  TensorRecord(
+      std::vector<State> _outputs,
       std::vector<int64_t> _symbolic_sizes,
       std::vector<bool> _contiguous_info,
-      NvfDataType _dtype)
-      : RecordFunctor({}, std::move(_outputs)),
-        symbolic_sizes(std::move(_symbolic_sizes)),
-        contiguous_info(std::move(_contiguous_info)),
-        dtype(_dtype) {}
-  virtual ~InputTensorRecord() = default;
+      Nvf::DataType _dtype,
+      bool _is_cpu = false)
+      : RecordFunctor(
+            {},
+            std::move(_outputs),
+            "define_tensor",
+            RecordType::Tensor),
+        symbolic_sizes_(std::move(_symbolic_sizes)),
+        contiguous_info_(std::move(_contiguous_info)),
+        dtype_(_dtype),
+        is_cpu_(_is_cpu) {}
+  virtual ~TensorRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new TensorRecord(*this);
+  }
 
-  void operator()(FusionDefinition& fd) final {
-    auto tv = TensorViewBuilder()
-                  .ndims(symbolic_sizes.size())
-                  .contiguity(contiguous_info)
-                  .shape(symbolic_sizes)
-                  .dtype(dtype)
+  //! Child specific hash function in lower 32 bits.
+  //! |  31  | 30 --- 24 | 23 --------- 12 | 11 ---------  0 |
+  //! | CPU? | Dtype     | Symbolic Sizes  | Contiguous Info |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    size_t ssize_hash = 0;
+    for (size_t i = 0; i < symbolic_sizes_.size(); ++i) {
+      size_t ssize = 0;
+      if (symbolic_sizes_[i] == -1) {
+        ssize = 1;
+      }
+      ssize_hash |= (ssize << (symbolic_sizes_.size() - 1 - i));
+    }
+    size_t contig_hash = 0;
+    for (size_t i = 0; i < contiguous_info_.size(); ++i) {
+      contig_hash |= (contiguous_info_[i] << (contiguous_info_.size() - 1 - i));
+    }
+
+    result |= ((static_cast<size_t>(is_cpu_) & 0x1) << 31);
+    result |= ((static_cast<size_t>(dtype_) & 0x7f) << 24);
+    return result | ((ssize_hash & 0xfff) << 12) | (contig_hash & 0xfff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const TensorRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      result = result && (dtype_ == child_ptr->dtype_);
+      result = result && (is_cpu_ == child_ptr->is_cpu_);
+      if (result) {
+        result =
+            ((symbolic_sizes_.size() == child_ptr->symbolic_sizes_.size()) &&
+             (contiguous_info_.size() == child_ptr->contiguous_info_.size()));
+        if (result) {
+          for (size_t i = 0; i < symbolic_sizes_.size(); ++i) {
+            if (symbolic_sizes_[i] != child_ptr->symbolic_sizes_[i]) {
+              result = false;
+              break;
+            }
+          }
+        }
+        if (result) {
+          for (size_t i = 0; i < contiguous_info_.size(); ++i) {
+            if (contiguous_info_[i] != child_ptr->contiguous_info_[i]) {
+              result = false;
+              break;
+            }
+          }
+        }
+      }
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    auto tv = Nvf::TensorViewBuilder()
+                  .ndims(symbolic_sizes_.size())
+                  .contiguity(contiguous_info_)
+                  .shape(symbolic_sizes_)
+                  .dtype(dtype_)
                   .build();
 
-    fd.setFusionState(outputs.at(0), tv);
+    if (symbolic_sizes_.empty() && is_cpu_) {
+      tv->setCpuScalar(true);
+    } else {
+      TORCH_CHECK(!is_cpu_, "CPU non-scalar tensor is not supported!");
+    }
+
+    fd.setFusionState(outputs_.at(0).index, tv);
     fd.addInput(tv);
   }
 
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << "symbolic_sizes=[";
+    bool first_arg = true;
+    for (auto ss : symbolic_sizes_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << ss;
+    }
+    os << "], contiguous=[";
+    first_arg = true;
+    for (auto ci : contiguous_info_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      if (ci) {
+        os << "True";
+      } else {
+        os << "False";
+      }
+    }
+    os << "], dtype=" << dtypeToPyString(dtype_);
+    os << ", is_cpu=" << (is_cpu_ ? "True" : "False");
+    if (close_function) {
+      os << ")";
+    }
+  }
+
+ private:
   //! A vector of tensor dimension sizes.
   //! This vector only captures sizes of -1 or 1 to indicate a symbolic
   //! dimension (-1) or a broadcast dimension (1).
-  std::vector<int64_t> symbolic_sizes;
+  std::vector<int64_t> symbolic_sizes_;
   //! A vector to indicate whether the a tensor dimension is contiguous
   //! with the dimension just to its right.
-  std::vector<bool> contiguous_info;
+  std::vector<bool> contiguous_info_;
   //! Tensor data type.
-  NvfDataType dtype;
+  Nvf::DataType dtype_;
+  //! Notes a scalar CPU Tensor
+  bool is_cpu_;
 };
 
 //! Specialized Record Functor for recording FusionDefinition outputs.
 
 template <class OutputType>
 struct OutputRecord : RecordFunctor {
-  OutputRecord(std::vector<size_t> _args)
-      : RecordFunctor(std::move(_args), {}) {}
+  OutputRecord(std::vector<State> _args)
+      : RecordFunctor(std::move(_args), {}, "add_output", RecordType::Output) {}
   virtual ~OutputRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new OutputRecord(*this);
+  }
 
-  void operator()(FusionDefinition& fd) final {
-    auto input = fd.getFusionState(args.at(0));
+  //! Nothing extra necessary in hash
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 ---------------------------------------  0 |
+  //! | None                                          |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    return result;
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const OutputRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    auto input = fd.getFusionState(args_.at(0).index);
 
     // With C++17, this statement should be "if constexpr"
-    if (std::is_same<OutputType, NvfTensorView>::value) {
-      fd.addOutput(input->template as<NvfTensorView>());
+    if (std::is_same<OutputType, Nvf::TensorView>::value) {
+      fd.addOutput(input->template as<Nvf::TensorView>());
     } else {
       fd.addOutput(input);
     }
@@ -268,92 +1128,315 @@ struct OutputRecord : RecordFunctor {
 
 struct ReductionOpRecord : RecordFunctor {
   ReductionOpRecord(
-      std::vector<size_t> _args,
-      std::vector<size_t> _outputs,
-      std::function<
-          NvfTensorView*(NvfTensorView*, std::vector<int>&, bool, NvfDataType)>
-          fusion_op,
+      std::vector<State> _args,
+      std::vector<State> _outputs,
+      std::string _name,
+      std::function<Nvf::TensorView*(
+          Nvf::TensorView*,
+          const std::vector<int>&,
+          bool,
+          Nvf::DataType)> fusion_op,
       std::vector<int> axes,
       bool keep_dim,
-      NvfDataType dtype)
-      : RecordFunctor(std::move(_args), std::move(_outputs)),
+      Nvf::DataType dtype)
+      : RecordFunctor(
+            std::move(_args),
+            std::move(_outputs),
+            _name,
+            RecordType::ReductionOp),
         fusion_op_(fusion_op),
         axes_(std::move(axes)),
         keep_dim_(keep_dim),
         dtype_(dtype) {}
   virtual ~ReductionOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new ReductionOpRecord(*this);
+  }
 
-  void operator()(FusionDefinition& fd) final {
-    auto arg = fd.getFusionState(args.at(0))->template as<NvfTensorView>();
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 -- 28 | 27 --- 20 | 19 -----------------  0 |
+  //! | keep_dim | Dtype     | Axes Hash               |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    size_t axes_hash = 0;
+    // Normally I would make a little endian hash of the axes but I do not
+    // know the size of the tensor based on just the record information.
+    for (size_t i = 0; i < axes_.size(); ++i) {
+      axes_hash |= (1 << axes_[i]);
+    }
+
+    return result | (static_cast<size_t>(keep_dim_) << 28) |
+        ((static_cast<size_t>(dtype_) & 0xff) << 20) | (axes_hash & 0xfffff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const ReductionOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      if (result) {
+        result = result &&
+            (fusion_op_.target_type() == child_ptr->fusion_op_.target_type());
+        if (Nvf::isDebugDumpEnabled(
+                Nvf::DebugDumpOption::PythonFrontendDebug)) {
+          std::cout << "\nReductionOpRecord: " << name_
+                    << " Target Type [self: 0x"
+                    << fusion_op_.target_type().name() << "] [other: 0x"
+                    << child_ptr->fusion_op_.target_type().name() << "]";
+        }
+        // IMPORTANT! you need to dereference the target pointer in order
+        // to match the function
+        result = result &&
+            (*fusion_op_.template target<
+                 Nvf::
+                     TensorView* (*)(Nvf::TensorView*, const std::vector<int>&, bool, Nvf::DataType)>() ==
+             *child_ptr->fusion_op_.template target<
+                 Nvf::
+                     TensorView* (*)(Nvf::TensorView*, const std::vector<int>&, bool, Nvf::DataType)>());
+        if (Nvf::isDebugDumpEnabled(
+                Nvf::DebugDumpOption::PythonFrontendDebug)) {
+          std::cout
+              << " Target  Ptr [self: 0x" << std::hex
+              << (size_t)*fusion_op_.template target<
+                     Nvf::
+                         TensorView* (*)(Nvf::TensorView*, const std::vector<int>&, bool, Nvf::DataType)>()
+              << "] [other: 0x" << std::hex
+              << (size_t)*child_ptr->fusion_op_.template target<
+                     Nvf::
+                         TensorView* (*)(Nvf::TensorView*, const std::vector<int>&, bool, Nvf::DataType)>()
+              << "]\n";
+        }
+        result = result && (keep_dim_ == child_ptr->keep_dim_);
+        result = result && (dtype_ == child_ptr->dtype_);
+        if (result) {
+          result = (axes_.size() == child_ptr->axes_.size());
+          if (result) {
+            for (size_t i = 0; i < axes_.size(); ++i) {
+              if (axes_[i] != child_ptr->axes_[i]) {
+                result = false;
+                break;
+              }
+            }
+          }
+        }
+      }
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    auto arg =
+        fd.getFusionState(args_.at(0).index)->template as<Nvf::TensorView>();
     auto output = fusion_op_(arg, axes_, keep_dim_, dtype_);
-    fd.setFusionState(outputs.at(0), output);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << ", axes=[";
+    bool first_arg = true;
+    for (auto axis : axes_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << axis;
+    }
+    os << "]";
+    os << ", keepdim=" << (keep_dim_ ? "True" : "False");
+    os << ", dtype=" << dtypeToPyString(dtype_);
+    if (close_function) {
+      os << ")";
+    }
   }
 
  private:
   //! nvFuser arith function signature for a given reduction operation
-  std::function<
-      NvfTensorView*(NvfTensorView*, std::vector<int>&, bool, NvfDataType)>
+  std::function<Nvf::TensorView*(
+      Nvf::TensorView*,
+      const std::vector<int>&,
+      bool,
+      Nvf::DataType)>
       fusion_op_;
   //! The tensor dimensions to reduce
   std::vector<int> axes_;
   //! Indicates whether to keep the reduced dimension(s).
   bool keep_dim_;
   //! The output data type.
-  NvfDataType dtype_;
+  Nvf::DataType dtype_;
 };
 
 //! Specialized Record Functor for recording FusionDefinition input scalars.
 
 struct ScalarRecord : RecordFunctor {
-  ScalarRecord(std::vector<size_t> _outputs, NvfDataType dtype)
-      : RecordFunctor({}, std::move(_outputs)), dtype_(dtype) {}
+  ScalarRecord(std::vector<State> _outputs, Nvf::DataType dtype)
+      : RecordFunctor(
+            {},
+            std::move(_outputs),
+            "define_scalar",
+            RecordType::Scalar),
+        dtype_(dtype) {}
   virtual ~ScalarRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new ScalarRecord(*this);
+  }
 
-  void operator()(FusionDefinition& fd) final {
-    NvfVal* output = nullptr;
-    if (dtype_ == NvfDataType::Double) {
-      output = IrBuilder::create<torch::jit::fuser::cuda::Double>();
-    } else if (dtype_ == NvfDataType::ComplexDouble) {
-      output = IrBuilder::create<torch::jit::fuser::cuda::ComplexDouble>();
-    } else if (dtype_ == NvfDataType::Bool) {
-      output = IrBuilder::create<torch::jit::fuser::cuda::Bool>();
-    } else if (dtype_ == NvfDataType::Int) {
-      output = IrBuilder::create<torch::jit::fuser::cuda::Int>();
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 ---------------------------------------  0 |
+  //! | Dtype                                         |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    return result | (static_cast<size_t>(dtype_) & 0xffffffff);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const ScalarRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      result = result && (dtype_ == child_ptr->dtype_);
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    Nvf::Val* output = nullptr;
+    if (dtype_ == Nvf::DataType::Double) {
+      output = Nvf::IrBuilder::create<Nvf::Double>();
+    } else if (dtype_ == Nvf::DataType::ComplexDouble) {
+      output = Nvf::IrBuilder::create<Nvf::ComplexDouble>();
+    } else if (dtype_ == Nvf::DataType::Bool) {
+      output = Nvf::IrBuilder::create<Nvf::Bool>();
+    } else if (dtype_ == Nvf::DataType::Int) {
+      output = Nvf::IrBuilder::create<Nvf::Int>();
     } else {
       TORCH_CHECK(false, "Dtype is not supported:", dtype_);
     }
     fd.addInput(output);
-    fd.setFusionState(outputs.at(0), output);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const {
+    RecordFunctor::print(os, false);
+    os << "dtype=" << dtypeToPyString(dtype_);
+    if (close_function) {
+      os << ")";
+    }
   }
 
  private:
   //! Scalar data type.
-  NvfDataType dtype_;
+  Nvf::DataType dtype_;
 };
 
-//! Specialized Record Functor for the FusionDefinition's var op.
+//! Specialized Record Functor for recording FusionDefinition Start.
+//! There should only ever be one instance of this Record in the
+//! Fusion Cache.
 
-struct VarianceOpRecord : RecordFunctor {
-  VarianceOpRecord(
-      std::vector<size_t> _args,
-      std::vector<size_t> _outputs,
+struct StartRecord : RecordFunctor {
+  StartRecord() : RecordFunctor({}, {}, "start", RecordType::Start) {}
+  virtual ~StartRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new StartRecord(*this);
+  }
+
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 ---------------------------------------  0 |
+  //! | None                                          |
+  virtual size_t hash() const final {
+    return RecordFunctor::hash();
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (dynamic_cast<const StartRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+    }
+    return result;
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {}
+};
+
+//! Specialized Record Functors for Normalization based ops.
+
+struct NormOpRecord : RecordFunctor {
+  NormOpRecord(
+      std::vector<State> args,
+      std::vector<State> outputs,
+      std::string name,
+      RecordType type,
       std::vector<int>& axes,
       int64_t correction,
       bool keep_dim)
-      : RecordFunctor(std::move(_args), std::move(_outputs)),
+      : RecordFunctor(std::move(args), std::move(outputs), name, type),
         axes_(axes),
         correction_(correction),
         keep_dim_(keep_dim) {}
-  virtual ~VarianceOpRecord() = default;
+  virtual ~NormOpRecord() = default;
+  virtual RecordFunctor* clone() = 0;
 
-  void operator()(FusionDefinition& fd) final {
-    auto arg = fd.getFusionState(args.at(0))->as<NvfTensorView>();
-    auto output =
-        torch::jit::fuser::cuda::variance(arg, axes_, correction_, keep_dim_);
-    fd.setFusionState(outputs.at(0), output);
+  // I am skipping the bassel's correction value in the hash because
+  // I suspect we might change it to a bool from a 64-bit value
+  //! Child specific hash function in lower 32 bits.
+  //! | 31 -- 28 | 27 -----------------------------  0 |
+  //! | keep_dim | Axes Hash                           |
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    size_t axes_hash = 0;
+    // Normally I would make a little endian hash of the axes but I do not
+    // know the size of the tensor based on just the record information.
+    for (size_t i = 0; i < axes_.size(); ++i) {
+      axes_hash |= (1 << axes_[i]);
+    }
+    return result | (static_cast<size_t>(keep_dim_) << 28) |
+        (axes_hash & 0xfffffff);
   }
 
- private:
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const NormOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      result = result && (correction_ == child_ptr->correction_);
+      result = result && (keep_dim_ == child_ptr->keep_dim_);
+      if (result) {
+        result = (axes_.size() == child_ptr->axes_.size());
+        if (result) {
+          for (size_t i = 0; i < axes_.size(); ++i) {
+            if (axes_[i] != child_ptr->axes_[i]) {
+              result = false;
+              break;
+            }
+          }
+        }
+      }
+    }
+    return result;
+  }
+
+  //! Each NormOp Child should define the operator() to build the IR
+  virtual void operator()(FusionDefinition& fd) = 0;
+
+  virtual void print(std::ostream& os, bool close_function = true) const final {
+    RecordFunctor::print(os, false);
+    os << ", axes=[";
+    bool first_arg = true;
+    for (auto axis : axes_) {
+      if (first_arg) {
+        first_arg = false;
+      } else {
+        os << ", ";
+      }
+      os << axis;
+    }
+    os << "]";
+    os << ", correction=" << correction_;
+    os << ", keepdim=" << (keep_dim_ ? "True" : "False");
+    if (close_function) {
+      os << ")";
+    }
+  }
+
+ protected:
   //! Dimensions of tensor to reduce for variance calculation
   std::vector<int> axes_;
   //! Bessel's correction value
@@ -362,4 +1445,166 @@ struct VarianceOpRecord : RecordFunctor {
   bool keep_dim_;
 };
 
+struct VarianceOpRecord : NormOpRecord {
+  VarianceOpRecord(
+      std::vector<State> args,
+      std::vector<State> outputs,
+      std::vector<int>& axes,
+      int64_t correction,
+      bool keep_dim)
+      : NormOpRecord(
+            std::move(args),
+            std::move(outputs),
+            "ops.var",
+            RecordType::VarianceOp,
+            axes,
+            correction,
+            keep_dim) {}
+  virtual ~VarianceOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new VarianceOpRecord(*this);
+  }
+
+  virtual void operator()(FusionDefinition& fd) final {
+    auto arg = fd.getFusionState(args_.at(0).index)->as<Nvf::TensorView>();
+    auto output = Nvf::variance(arg, axes_, correction_, keep_dim_);
+    fd.setFusionState(outputs_.at(0).index, output);
+  }
+};
+
+//! VarianceMean requires a separate Record because nvFuser defines the output
+//! of var_mean as a custom struct.
+struct VarianceMeanOpRecord : NormOpRecord {
+  VarianceMeanOpRecord(
+      std::vector<State> args,
+      std::vector<State> outputs,
+      std::vector<int>& axes,
+      int64_t correction,
+      bool keep_dim)
+      : NormOpRecord(
+            std::move(args),
+            std::move(outputs),
+            "ops.var_mean",
+            RecordType::VarianceMeanOp,
+            axes,
+            correction,
+            keep_dim) {}
+  virtual ~VarianceMeanOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new VarianceMeanOpRecord(*this);
+  }
+
+  void operator()(FusionDefinition& fd) final {
+    auto arg = fd.getFusionState(args_.at(0).index)->as<Nvf::TensorView>();
+    auto output = Nvf::variance_mean(arg, axes_, correction_, keep_dim_);
+    fd.setFusionState(outputs_.at(0).index, output.var);
+    fd.setFusionState(outputs_.at(1).index, output.mean);
+  }
+};
+
+struct BatchNormOpRecord : RecordFunctor {
+  BatchNormOpRecord(
+      std::vector<State> args,
+      std::vector<State> outputs,
+      bool training,
+      bool channels_last)
+      : RecordFunctor(
+            std::move(args),
+            std::move(outputs),
+            "ops.batch_norm",
+            RecordType::BatchNormOp),
+        training_(training),
+        channels_last_(channels_last) {}
+  virtual ~BatchNormOpRecord() = default;
+  virtual RecordFunctor* clone() final {
+    return new BatchNormOpRecord(*this);
+  }
+
+  virtual bool operator==(const RecordFunctor& other) const final {
+    auto result = false;
+    if (auto child_ptr = dynamic_cast<const BatchNormOpRecord*>(&other)) {
+      result = RecordFunctor::operator==(other);
+      result = result && (training_ == child_ptr->training_);
+      result = result && (channels_last_ == child_ptr->channels_last_);
+    }
+    return result;
+  }
+
+  virtual size_t hash() const final {
+    auto result = RecordFunctor::hash();
+    return result | (static_cast<size_t>(training_) << 28) |
+        (static_cast<size_t>(channels_last_) << 29);
+  }
+
+  void operator()(FusionDefinition& fd) final {
+    auto x = fd.getFusionState(args_.at(0).index)->as<Nvf::TensorView>();
+    auto weight = (args_.at(1).stype == StateType::Tensor)
+        ? fd.getFusionState(args_.at(1).index)->as<Nvf::TensorView>()
+        : nullptr;
+    auto bias = (args_.at(2).stype == StateType::Tensor)
+        ? fd.getFusionState(args_.at(2).index)->as<Nvf::TensorView>()
+        : nullptr;
+    auto running_mean = (args_.at(3).stype == StateType::Tensor)
+        ? fd.getFusionState(args_.at(3).index)->as<Nvf::TensorView>()
+        : nullptr;
+    auto running_var = (args_.at(4).stype == StateType::Tensor)
+        ? fd.getFusionState(args_.at(4).index)->as<Nvf::TensorView>()
+        : nullptr;
+    auto momentum = fd.getFusionState(args_.at(5).index)->as<Nvf::Val>();
+    auto eps = fd.getFusionState(args_.at(6).index)->as<Nvf::Val>();
+    auto output = Nvf::batch_norm(
+        x,
+        weight,
+        bias,
+        running_mean,
+        running_var,
+        training_,
+        momentum,
+        eps,
+        channels_last_);
+    fd.setFusionState(outputs_.at(0).index, output.output);
+    fd.setFusionState(outputs_.at(1).index, output.mean);
+    fd.setFusionState(outputs_.at(2).index, output.invstd);
+  }
+
+  virtual void print(std::ostream& os, bool close_function = true) const final {
+    RecordFunctor::print(os, false);
+    os << ", training=" << (training_ ? "True" : "False");
+    os << ", channels_last=" << (channels_last_ ? "True" : "False");
+    if (close_function) {
+      os << ")";
+    }
+  }
+
+ private:
+  bool training_;
+  bool channels_last_;
+};
+
 } // namespace nvfuser
+
+//! Creating the template specialized hash and equal_to functions for a
+//! RecordFunctor object in order to use hash maps (unordered_maps) in STL.
+namespace std {
+using namespace nvfuser;
+
+template <>
+struct hash<RecordFunctor*> {
+  size_t operator()(const RecordFunctor* p) const {
+    TORCH_CHECK(p, "The RecordFunctor Pointer for hashing is null!");
+    return p->hash();
+  }
+};
+template <>
+struct equal_to<RecordFunctor*> {
+  bool operator()(const RecordFunctor* p, const RecordFunctor* q) const {
+    TORCH_CHECK(
+        p,
+        "The RecordFunctor Pointer on the lhs of an equality check is null!");
+    TORCH_CHECK(
+        q,
+        "The RecordFunctor Pointer on the rhs of an equality check is null!");
+    return p->operator==(*q);
+  }
+};
+} // namespace std
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp b/torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp
index 471b7cc1a149..68fe709deb78 100644
--- a/torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp
@@ -2,16 +2,21 @@
 
 #ifdef USE_CUDA
 #include <c10/util/ArrayRef.h>
+#include <c10/util/Optional.h>
 #include <c10/util/irange.h>
 #include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/instrumentation.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
 #include <torch/csrc/jit/codegen/cuda/ops/composite.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h>
 #include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.h>
 #include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h>
 #include <torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.h>
 #include <torch/csrc/jit/python/pybind_utils.h>
 #include <iostream>
+#include <tuple>
 
 namespace torch {
 namespace jit {
@@ -23,27 +28,59 @@ void initNvFuserPythonBindings(PyObject* module) {
   auto nvfuser = m.def_submodule("_nvfuser");
 
   //! DataTypes supported by nvFuser in the FusionDefinition
-  py::enum_<NvfDataType>(nvfuser, "DataType")
-      .value("Double", NvfDataType::Double)
-      .value("Float", NvfDataType::Float)
-      .value("Half", NvfDataType::Half)
-      .value("Int", NvfDataType::Int)
-      .value("Int32", NvfDataType::Int32)
-      .value("Bool", NvfDataType::Bool)
-      .value("BFloat16", NvfDataType::BFloat16)
-      .value("ComplexFloat", NvfDataType::ComplexFloat)
-      .value("ComplexDouble", NvfDataType::ComplexDouble)
-      .value("Null", NvfDataType::Null);
-
-  //! Binding an object that owns a FusionExecutorCache instance and provides
-  //! an interface
-  //! \todo This object will be removed when a FusionManager is added
-  //! containing a cache.
-  py::class_<nvfuser::FusionOwner> fusion(nvfuser, "Fusion");
+  py::enum_<Nvf::DataType>(nvfuser, "DataType")
+      .value("Double", Nvf::DataType::Double)
+      .value("Float", Nvf::DataType::Float)
+      .value("Half", Nvf::DataType::Half)
+      .value("Int", Nvf::DataType::Int)
+      .value("Int32", Nvf::DataType::Int32)
+      .value("Bool", Nvf::DataType::Bool)
+      .value("BFloat16", Nvf::DataType::BFloat16)
+      .value("ComplexFloat", Nvf::DataType::ComplexFloat)
+      .value("ComplexDouble", Nvf::DataType::ComplexDouble)
+      .value("Null", Nvf::DataType::Null);
+
+  nvfuser.def(
+      "compute_contiguity",
+      [](const std::vector<int64_t>& sizes,
+         const std::vector<int64_t>& strides) {
+        py::tuple contiguity(sizes.size());
+        TORCH_CHECK(
+            sizes.size() == strides.size(),
+            "compute_contiguity: Sizes and strides must have the same number of dimensions");
+        if (sizes.size() == 0) {
+          return contiguity;
+        }
+        contiguity[sizes.size() - 1] = strides.back() == 1;
+        for (int64_t i = static_cast<int64_t>(sizes.size()) - 2; i >= 0; --i) {
+          contiguity[i] = strides[i] == strides[i + 1] * sizes[i + 1];
+        }
+        return contiguity;
+      });
+
+  //! Binding the FusionCache that holds a cache of Fusions
+  //! This is only bound to provide an interface to get the number of fusions
+  //! that are cached.
+  py::class_<nvfuser::FusionCache> fusion_cache(nvfuser, "FusionCache");
+  fusion_cache
+      .def_static(
+          "get",
+          &nvfuser::FusionCache::get,
+          py::arg("max_fusions") = int(8192),
+          py::return_value_policy::reference)
+      .def("num_fusions", &nvfuser::FusionCache::numFusions)
+      .def("print_stats", [](nvfuser::FusionCache& self) {
+        self.print(std::cout);
+      });
+
+  py::class_<nvfuser::FusionInterface> fusion(nvfuser, "Fusion");
   fusion.def(py::init<>())
+      .def(py::init<size_t>(), py::arg("fusion_id"))
+      .def("define", &nvfuser::FusionInterface::define)
+      .def("defined", &nvfuser::FusionInterface::defined)
       .def(
           "execute",
-          [](nvfuser::FusionOwner& self, const py::iterable& iter) {
+          [](nvfuser::FusionInterface& self, const py::iterable& iter) {
             std::vector<IValue> inputs;
             for (py::handle obj : iter) {
               inputs.push_back(toIValue(obj, c10::AnyType::get()));
@@ -51,10 +88,8 @@ void initNvFuserPythonBindings(PyObject* module) {
             return self.execute(inputs);
           },
           py::return_value_policy::reference)
-      .def("print_ir", [](nvfuser::FusionOwner& self) { self.printIr(); })
-      .def("print_kernel", [](nvfuser::FusionOwner& self) {
-        self.printKernel();
-      });
+      .def("id", &nvfuser::FusionInterface::id)
+      .def("print", &nvfuser::FusionInterface::print);
 
   //! These are the FusionDefinition supported object types that are either
   //! defined as inputs or the output of an operation.
@@ -65,11 +100,18 @@ void initNvFuserPythonBindings(PyObject* module) {
   //! define the set the operations and connections between operations for
   //! nvFuser to create.
   py::class_<nvfuser::FusionDefinition> fusion_def(nvfuser, "FusionDefinition");
-  fusion_def.def(py::init<nvfuser::FusionOwner*>())
+  fusion_def
+      .def(
+          py::init<nvfuser::FusionInterface*, int>(),
+          py::arg("fusion"),
+          py::arg("max_length") = int(1024))
       .def_readwrite("ops", &nvfuser::FusionDefinition::ops)
       .def(
           "__enter__",
           [](nvfuser::FusionDefinition& self) -> nvfuser::FusionDefinition* {
+            // Instrumentation to mark the beginning of a FusionDefinition
+            Nvf::inst::Trace::instance()->beginEvent(
+                "FusionDefinition Context Manager");
             return self.enter();
           })
       .def(
@@ -77,46 +119,99 @@ void initNvFuserPythonBindings(PyObject* module) {
           [](nvfuser::FusionDefinition& self,
              void* exc_type,
              void* exc_value,
-             void* traceback) { self.exit(); })
+             void* traceback) {
+            self.exit();
+            // Mark the end of a FusionDefinition Context Manager
+            Nvf::inst::Trace::instance()->endEvent(nullptr);
+          })
+      .def(
+          "__str__",
+          [](nvfuser::FusionDefinition& self) {
+            std::stringstream ss;
+            self.print(ss);
+            return ss.str();
+          })
       .def(
           "add_output",
-          [](nvfuser::FusionDefinition& self, nvfuser::Scalar* output) {
-            self.defineRecord(
-                new nvfuser::OutputRecord<NvfVal>({output->index}));
+          [](nvfuser::FusionDefinition& self, nvfuser::Scalar output) {
+            FUSER_PERF_SCOPE("FusionDefinition.add_output (scalar)");
+            self.defineRecord(new nvfuser::OutputRecord<Nvf::Val>(
+                {self.recordingState(output())}));
           })
       .def(
           "add_output",
-          [](nvfuser::FusionDefinition& self, nvfuser::Tensor* output) {
-            self.defineRecord(
-                new nvfuser::OutputRecord<NvfTensorView>({output->index}));
+          [](nvfuser::FusionDefinition& self, nvfuser::Tensor output) {
+            FUSER_PERF_SCOPE("FusionDefinition.add_output (tensor)");
+            self.defineRecord(new nvfuser::OutputRecord<Nvf::TensorView>(
+                {self.recordingState(output())}));
           })
       .def(
           "define_tensor",
           [](nvfuser::FusionDefinition& self,
              size_t ndims,
-             NvfDataType dtype = NvfDataType::Float) -> nvfuser::Tensor* {
+             Nvf::DataType dtype = Nvf::DataType::Float,
+             bool is_cpu = false) -> nvfuser::Tensor {
+            FUSER_PERF_SCOPE("FusionDefinition.define_tensor (simple)");
             std::vector<int64_t> maybe_symbolic_sizes(ndims, -1);
             ;
             std::vector<bool> contig_info(ndims, false);
 
-            nvfuser::Tensor* out = self.defineTensor();
-            self.defineRecord(new nvfuser::InputTensorRecord(
-                {out->index},
+            nvfuser::Tensor out = self.defineTensor();
+            self.defineRecord(new nvfuser::TensorRecord(
+                {self.recordingState(out())},
                 std::move(maybe_symbolic_sizes),
                 std::move(contig_info),
-                dtype));
+                dtype,
+                is_cpu));
 
             return out;
           },
           py::arg("ndims"),
-          py::arg("dtype") = torch::jit::fuser::cuda::DataType::Float,
+          py::arg("dtype") = Nvf::DataType::Float,
+          py::arg("is_cpu") = false,
+          py::return_value_policy::reference)
+      .def(
+          "define_tensor",
+          [](nvfuser::FusionDefinition& self,
+             std::vector<int64_t>& symbolic_sizes,
+             std::vector<bool>& contiguous,
+             Nvf::DataType dtype = Nvf::DataType::Float,
+             bool is_cpu = false) -> nvfuser::Tensor {
+            FUSER_PERF_SCOPE("FusionDefinition.define_tensor (default)");
+
+            for (size_t i = 0; i < symbolic_sizes.size(); ++i) {
+              TORCH_CHECK(
+                  symbolic_sizes[i] == -1 || symbolic_sizes[i] == 1,
+                  "The value ",
+                  symbolic_sizes[i],
+                  " at index ",
+                  i,
+                  " was neither broadcast(1) or symbolic(-1).");
+            }
+
+            nvfuser::Tensor out = self.defineTensor();
+            self.defineRecord(new nvfuser::TensorRecord(
+                {self.recordingState(out())},
+                symbolic_sizes,
+                contiguous,
+                dtype,
+                is_cpu));
+
+            return out;
+          },
+          py::arg("symbolic_sizes"),
+          py::arg("contiguous"),
+          py::arg("dtype") = Nvf::DataType::Float,
+          py::arg("is_cpu") = false,
           py::return_value_policy::reference)
       .def(
           "define_tensor",
           [](nvfuser::FusionDefinition& self,
-             std::vector<int64_t> sizes,
-             std::vector<int64_t> strides,
-             NvfDataType dtype = NvfDataType::Float) -> nvfuser::Tensor* {
+             std::vector<int64_t>& sizes,
+             std::vector<int64_t>& strides,
+             Nvf::DataType dtype = Nvf::DataType::Float,
+             bool is_cpu = false) -> nvfuser::Tensor {
+            FUSER_PERF_SCOPE("FusionDefinition.define_tensor (integration)");
             TORCH_CHECK(
                 sizes.size() == strides.size(),
                 "The number of sizes does not match the number of strides.",
@@ -153,73 +248,76 @@ void initNvFuserPythonBindings(PyObject* module) {
               }
             }
 
-            nvfuser::Tensor* out = self.defineTensor();
-            self.defineRecord(new nvfuser::InputTensorRecord(
-                {out->index},
+            nvfuser::Tensor out = self.defineTensor();
+            self.defineRecord(new nvfuser::TensorRecord(
+                {self.recordingState(out())},
                 std::move(maybe_symbolic_sizes),
                 std::move(contig_info),
-                dtype));
+                dtype,
+                is_cpu));
 
             return out;
           },
           py::arg("sizes"),
           py::arg("strides"),
-          py::arg("dtype") = NvfDataType::Float,
+          py::arg("dtype") = Nvf::DataType::Float,
+          py::arg("is_cpu") = false,
           py::return_value_policy::reference)
       .def(
           "define_constant",
-          [](nvfuser::FusionDefinition& self, double val) -> nvfuser::Scalar* {
-            nvfuser::Scalar* out = self.defineScalar();
-            self.defineRecord(
-                new nvfuser::
-                    ConstantRecord<torch::jit::fuser::cuda::Double, double>(
-                        {out->index}, val));
+          [](nvfuser::FusionDefinition& self, double val) -> nvfuser::Scalar {
+            FUSER_PERF_SCOPE("FusionDefinition.define_constant (double)");
+            nvfuser::Scalar out = self.defineScalar();
+            self.defineRecord(new nvfuser::ConstantRecord<Nvf::Double, double>(
+                {self.recordingState(out())}, val));
             return out;
           },
           py::return_value_policy::reference)
       .def(
           "define_constant",
           [](nvfuser::FusionDefinition& self,
-             c10::complex<double> val) -> nvfuser::Scalar* {
-            nvfuser::Scalar* out = self.defineScalar();
-            self.defineRecord(new nvfuser::ConstantRecord<
-                              torch::jit::fuser::cuda::ComplexDouble,
-                              c10::complex<double>>({out->index}, val));
+             std::complex<double> val) -> nvfuser::Scalar {
+            FUSER_PERF_SCOPE("FusionDefinition.define_constant (complex)");
+            nvfuser::Scalar out = self.defineScalar();
+            self.defineRecord(
+                new nvfuser::
+                    ConstantRecord<Nvf::ComplexDouble, c10::complex<double>>(
+                        {self.recordingState(out())},
+                        static_cast<c10::complex<double>>(val)));
             return out;
           },
           py::return_value_policy::reference)
       .def(
           "define_constant",
-          [](nvfuser::FusionDefinition& self, bool val) -> nvfuser::Scalar* {
-            nvfuser::Scalar* out = self.defineScalar();
-            self.defineRecord(
-                new nvfuser::
-                    ConstantRecord<torch::jit::fuser::cuda::Bool, bool>(
-                        {out->index}, val));
+          [](nvfuser::FusionDefinition& self, bool val) -> nvfuser::Scalar {
+            FUSER_PERF_SCOPE("FusionDefinition.define_constant (bool)");
+            nvfuser::Scalar out = self.defineScalar();
+            self.defineRecord(new nvfuser::ConstantRecord<Nvf::Bool, bool>(
+                {self.recordingState(out())}, val));
             return out;
           },
           py::return_value_policy::reference)
       .def(
           "define_constant",
-          [](nvfuser::FusionDefinition& self, int64_t val) -> nvfuser::Scalar* {
-            nvfuser::Scalar* out = self.defineScalar();
-            self.defineRecord(
-                new nvfuser::
-                    ConstantRecord<torch::jit::fuser::cuda::Int, int64_t>(
-                        {out->index}, val));
+          [](nvfuser::FusionDefinition& self, int64_t val) -> nvfuser::Scalar {
+            FUSER_PERF_SCOPE("FusionDefinition.define_constant (int)");
+            nvfuser::Scalar out = self.defineScalar();
+            self.defineRecord(new nvfuser::ConstantRecord<Nvf::Int, int64_t>(
+                {self.recordingState(out())}, val));
             return out;
           },
           py::return_value_policy::reference)
       .def(
           "define_scalar",
           [](nvfuser::FusionDefinition& self,
-             NvfDataType dtype = torch::jit::fuser::cuda::DataType::Double)
-              -> nvfuser::Scalar* {
-            nvfuser::Scalar* out = self.defineScalar();
-            self.defineRecord(new nvfuser::ScalarRecord({out->index}, dtype));
+             Nvf::DataType dtype = Nvf::DataType::Double) -> nvfuser::Scalar {
+            FUSER_PERF_SCOPE("FusionDefinition.define_scalar");
+            nvfuser::Scalar out = self.defineScalar();
+            self.defineRecord(
+                new nvfuser::ScalarRecord({self.recordingState(out())}, dtype));
             return out;
           },
-          py::arg("dtype") = torch::jit::fuser::cuda::DataType::Double,
+          py::arg("dtype") = Nvf::DataType::Double,
           py::return_value_policy::reference);
 
   //! The Operators class is a nested class of FusionDefinition to allow the
@@ -235,35 +333,39 @@ void initNvFuserPythonBindings(PyObject* module) {
   nvf_ops.def(py::init<nvfuser::FusionDefinition*>());
 
   // ******************** INSERT OP BINDINGS BELOW HERE ********************
-
-#define NVFUSER_PYTHON_BINDING_UNARY_OP(op_str, op_name)                  \
-  nvf_ops.def(                                                            \
-      op_str,                                                             \
-      [](nvfuser::FusionDefinition::Operators& self,                      \
-         nvfuser::Tensor* input) -> nvfuser::Tensor* {                    \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor(); \
-        self.fusion_definition->defineRecord(                             \
-            new nvfuser::OpRecord<NvfTensorView*, NvfTensorView*>(        \
-                {input->index},                                           \
-                {output->index},                                          \
-                static_cast<NvfTensorView* (*)(NvfTensorView*)>(          \
-                    torch::jit::fuser::cuda::op_name)));                  \
-        return output;                                                    \
-      },                                                                  \
-      py::return_value_policy::reference);                                \
-  nvf_ops.def(                                                            \
-      op_str,                                                             \
-      [](nvfuser::FusionDefinition::Operators& self,                      \
-         nvfuser::Scalar* input) -> nvfuser::Scalar* {                    \
-        nvfuser::Scalar* output = self.fusion_definition->defineScalar(); \
-        self.fusion_definition->defineRecord(                             \
-            new nvfuser::OpRecord<NvfVal*, NvfVal*>(                      \
-                {input->index},                                           \
-                {output->index},                                          \
-                static_cast<NvfVal* (*)(NvfVal*)>(                        \
-                    torch::jit::fuser::cuda::op_name)));                  \
-        return output;                                                    \
-      },                                                                  \
+#define OP_PREFIX "Operators."
+#define NVFUSER_PYTHON_BINDING_UNARY_OP(op_str, op_name)               \
+  nvf_ops.def(                                                         \
+      op_str,                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                   \
+         nvfuser::Tensor input) -> nvfuser::Tensor {                   \
+        FUSER_PERF_SCOPE("Operators." op_str);                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;        \
+        nvfuser::Tensor output = fd->defineTensor();                   \
+        fd->defineRecord(                                              \
+            new nvfuser::OpRecord<Nvf::TensorView*, Nvf::TensorView*>( \
+                {fd->recordingState(input())},                         \
+                {fd->recordingState(output())},                        \
+                ("ops." op_str),                                       \
+                static_cast<Nvf::TensorView* (*)(Nvf::TensorView*)>(   \
+                    Nvf::op_name)));                                   \
+        return output;                                                 \
+      },                                                               \
+      py::return_value_policy::reference);                             \
+  nvf_ops.def(                                                         \
+      op_str,                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                   \
+         nvfuser::Scalar input) -> nvfuser::Scalar {                   \
+        FUSER_PERF_SCOPE("Operators." op_str);                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;        \
+        nvfuser::Scalar output = fd->defineScalar();                   \
+        fd->defineRecord(new nvfuser::OpRecord<Nvf::Val*, Nvf::Val*>(  \
+            {fd->recordingState(input())},                             \
+            {fd->recordingState(output())},                            \
+            ("ops." op_str),                                           \
+            static_cast<Nvf::Val* (*)(Nvf::Val*)>(Nvf::op_name)));     \
+        return output;                                                 \
+      },                                                               \
       py::return_value_policy::reference);
 
   NVFUSER_PYTHON_BINDING_UNARY_OP("abs", abs)
@@ -288,7 +390,7 @@ void initNvFuserPythonBindings(PyObject* module) {
   NVFUSER_PYTHON_BINDING_UNARY_OP("neg", neg)
   NVFUSER_PYTHON_BINDING_UNARY_OP("bitwise_not", bitwise_not)
   NVFUSER_PYTHON_BINDING_UNARY_OP("relu", relu)
-  NVFUSER_PYTHON_BINDING_UNARY_OP("rand_like", randlike)
+  NVFUSER_PYTHON_BINDING_UNARY_OP("rand_like", rand_like)
   NVFUSER_PYTHON_BINDING_UNARY_OP("reciprocal", reciprocal)
   NVFUSER_PYTHON_BINDING_UNARY_OP("round", round)
   NVFUSER_PYTHON_BINDING_UNARY_OP("rsqrt", rsqrt)
@@ -312,68 +414,85 @@ void initNvFuserPythonBindings(PyObject* module) {
   NVFUSER_PYTHON_BINDING_UNARY_OP("imag", imag)
 #undef NVFUSER_PYTHON_BINDING_UNARY_OP
 
-#define NVFUSER_PYTHON_BINDING_BINARY_OP(op_str, op_name)                    \
-  nvf_ops.def(                                                               \
-      op_str,                                                                \
-      [](nvfuser::FusionDefinition::Operators& self,                         \
-         nvfuser::Tensor* arg1,                                              \
-         nvfuser::Tensor* arg2) -> nvfuser::Tensor* {                        \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();    \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<          \
-                                             NvfTensorView*,                 \
-                                             NvfTensorView*,                 \
-                                             NvfTensorView*>(                \
-            {arg1->index, arg2->index},                                      \
-            {output->index},                                                 \
-            static_cast<NvfTensorView* (*)(NvfTensorView*, NvfTensorView*)>( \
-                torch::jit::fuser::cuda::op_name)));                         \
-        return output;                                                       \
-      },                                                                     \
-      py::return_value_policy::reference);                                   \
-  nvf_ops.def(                                                               \
-      op_str,                                                                \
-      [](nvfuser::FusionDefinition::Operators& self,                         \
-         nvfuser::Tensor* arg1,                                              \
-         nvfuser::Scalar* arg2) -> nvfuser::Tensor* {                        \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();    \
-        self.fusion_definition->defineRecord(                                \
-            new nvfuser::OpRecord<NvfTensorView*, NvfTensorView*, NvfVal*>(  \
-                {arg1->index, arg2->index},                                  \
-                {output->index},                                             \
-                static_cast<NvfTensorView* (*)(NvfTensorView*, NvfVal*)>(    \
-                    torch::jit::fuser::cuda::op_name)));                     \
-        return output;                                                       \
-      },                                                                     \
-      py::return_value_policy::reference);                                   \
-  nvf_ops.def(                                                               \
-      op_str,                                                                \
-      [](nvfuser::FusionDefinition::Operators& self,                         \
-         nvfuser::Scalar* arg1,                                              \
-         nvfuser::Tensor* arg2) -> nvfuser::Tensor* {                        \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();    \
-        self.fusion_definition->defineRecord(                                \
-            new nvfuser::OpRecord<NvfTensorView*, NvfVal*, NvfTensorView*>(  \
-                {arg1->index, arg2->index},                                  \
-                {output->index},                                             \
-                static_cast<NvfTensorView* (*)(NvfVal*, NvfTensorView*)>(    \
-                    torch::jit::fuser::cuda::op_name)));                     \
-        return output;                                                       \
-      },                                                                     \
-      py::return_value_policy::reference);                                   \
-  nvf_ops.def(                                                               \
-      op_str,                                                                \
-      [](nvfuser::FusionDefinition::Operators& self,                         \
-         nvfuser::Scalar* arg1,                                              \
-         nvfuser::Scalar* arg2) -> nvfuser::Scalar* {                        \
-        nvfuser::Scalar* output = self.fusion_definition->defineScalar();    \
-        self.fusion_definition->defineRecord(                                \
-            new nvfuser::OpRecord<NvfVal*, NvfVal*, NvfVal*>(                \
-                {arg1->index, arg2->index},                                  \
-                {output->index},                                             \
-                static_cast<NvfVal* (*)(NvfVal*, NvfVal*)>(                  \
-                    torch::jit::fuser::cuda::op_name)));                     \
-        return output;                                                       \
-      },                                                                     \
+#define NVFUSER_PYTHON_BINDING_BINARY_OP(op_str, op_name)                   \
+  nvf_ops.def(                                                              \
+      op_str,                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                        \
+         nvfuser::Tensor arg1,                                              \
+         nvfuser::Tensor arg2) -> nvfuser::Tensor {                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;             \
+        nvfuser::Tensor output = fd->defineTensor();                        \
+        fd->defineRecord(new nvfuser::OpRecord<                             \
+                         Nvf::TensorView*,                                  \
+                         Nvf::TensorView*,                                  \
+                         Nvf::TensorView*>(                                 \
+            {fd->recordingState(arg1()), fd->recordingState(arg2())},       \
+            {fd->recordingState(output())},                                 \
+            ("ops." op_str),                                                \
+            static_cast<                                                    \
+                Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::TensorView*)>(  \
+                Nvf::op_name)));                                            \
+        return output;                                                      \
+      },                                                                    \
+      py::return_value_policy::reference);                                  \
+  nvf_ops.def(                                                              \
+      op_str,                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                        \
+         nvfuser::Tensor arg1,                                              \
+         nvfuser::Scalar arg2) -> nvfuser::Tensor {                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;             \
+        nvfuser::Tensor output = fd->defineTensor();                        \
+        fd->defineRecord(new nvfuser::OpRecord<                             \
+                         Nvf::TensorView*,                                  \
+                         Nvf::TensorView*,                                  \
+                         Nvf::Val*>(                                        \
+            {fd->recordingState(arg1()), fd->recordingState(arg2())},       \
+            {fd->recordingState(output())},                                 \
+            ("ops." op_str),                                                \
+            static_cast<Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*)>( \
+                Nvf::op_name)));                                            \
+        return output;                                                      \
+      },                                                                    \
+      py::return_value_policy::reference);                                  \
+  nvf_ops.def(                                                              \
+      op_str,                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                        \
+         nvfuser::Scalar arg1,                                              \
+         nvfuser::Tensor arg2) -> nvfuser::Tensor {                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;             \
+        nvfuser::Tensor output = fd->defineTensor();                        \
+        fd->defineRecord(new nvfuser::OpRecord<                             \
+                         Nvf::TensorView*,                                  \
+                         Nvf::Val*,                                         \
+                         Nvf::TensorView*>(                                 \
+            {fd->recordingState(arg1()), fd->recordingState(arg2())},       \
+            {fd->recordingState(output())},                                 \
+            ("ops." op_str),                                                \
+            static_cast<Nvf::TensorView* (*)(Nvf::Val*, Nvf::TensorView*)>( \
+                Nvf::op_name)));                                            \
+        return output;                                                      \
+      },                                                                    \
+      py::return_value_policy::reference);                                  \
+  nvf_ops.def(                                                              \
+      op_str,                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                        \
+         nvfuser::Scalar arg1,                                              \
+         nvfuser::Scalar arg2) -> nvfuser::Scalar {                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;             \
+        nvfuser::Scalar output = fd->defineScalar();                        \
+        fd->defineRecord(                                                   \
+            new nvfuser::OpRecord<Nvf::Val*, Nvf::Val*, Nvf::Val*>(         \
+                {fd->recordingState(arg1()), fd->recordingState(arg2())},   \
+                {fd->recordingState(output())},                             \
+                ("ops." op_str),                                            \
+                static_cast<Nvf::Val* (*)(Nvf::Val*, Nvf::Val*)>(           \
+                    Nvf::op_name)));                                        \
+        return output;                                                      \
+      },                                                                    \
       py::return_value_policy::reference);
 
   NVFUSER_PYTHON_BINDING_BINARY_OP("add", add)
@@ -398,236 +517,311 @@ void initNvFuserPythonBindings(PyObject* module) {
   NVFUSER_PYTHON_BINDING_BINARY_OP("bitwise_right_shift", bitwise_left_shift)
 #undef NVFUSER_PYTHON_BINDING_BINARY_OP
 
-#define NVFUSER_PYTHON_BINDING_BINARY_WITH_ALPHA_OP(op_str, op_name)           \
-  nvf_ops.def(                                                                 \
-      op_str,                                                                  \
-      [](nvfuser::FusionDefinition::Operators& self,                           \
-         nvfuser::Tensor* arg1,                                                \
-         nvfuser::Tensor* arg2,                                                \
-         nvfuser::Scalar* arg3) -> nvfuser::Tensor* {                          \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();      \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<            \
-                                             NvfTensorView*,                   \
-                                             NvfTensorView*,                   \
-                                             NvfTensorView*,                   \
-                                             NvfVal*>(                         \
-            {arg1->index, arg2->index, arg3->index},                           \
-            {output->index},                                                   \
-            static_cast<                                                       \
-                NvfTensorView* (*)(NvfTensorView*, NvfTensorView*, NvfVal*)>(  \
-                torch::jit::fuser::cuda::op_name)));                           \
-        return output;                                                         \
-      },                                                                       \
-      py::return_value_policy::reference);                                     \
-  nvf_ops.def(                                                                 \
-      op_str,                                                                  \
-      [](nvfuser::FusionDefinition::Operators& self,                           \
-         nvfuser::Tensor* arg1,                                                \
-         nvfuser::Scalar* arg2,                                                \
-         nvfuser::Scalar* arg3) -> nvfuser::Tensor* {                          \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();      \
-        self.fusion_definition->defineRecord(                                  \
-            new nvfuser::                                                      \
-                OpRecord<NvfTensorView*, NvfTensorView*, NvfVal*, NvfVal*>(    \
-                    {arg1->index, arg2->index, arg3->index},                   \
-                    {output->index},                                           \
-                    static_cast<                                               \
-                        NvfTensorView* (*)(NvfTensorView*, NvfVal*, NvfVal*)>( \
-                        torch::jit::fuser::cuda::op_name)));                   \
-        return output;                                                         \
-      },                                                                       \
-      py::return_value_policy::reference);                                     \
-  nvf_ops.def(                                                                 \
-      op_str,                                                                  \
-      [](nvfuser::FusionDefinition::Operators& self,                           \
-         nvfuser::Scalar* arg1,                                                \
-         nvfuser::Tensor* arg2,                                                \
-         nvfuser::Scalar* arg3) -> nvfuser::Tensor* {                          \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();      \
-        self.fusion_definition->defineRecord(                                  \
-            new nvfuser::                                                      \
-                OpRecord<NvfTensorView*, NvfVal*, NvfTensorView*, NvfVal*>(    \
-                    {arg1->index, arg2->index, arg3->index},                   \
-                    {output->index},                                           \
-                    static_cast<                                               \
-                        NvfTensorView* (*)(NvfVal*, NvfTensorView*, NvfVal*)>( \
-                        torch::jit::fuser::cuda::op_name)));                   \
-        return output;                                                         \
-      },                                                                       \
-      py::return_value_policy::reference);                                     \
-  nvf_ops.def(                                                                 \
-      op_str,                                                                  \
-      [](nvfuser::FusionDefinition::Operators& self,                           \
-         nvfuser::Scalar* arg1,                                                \
-         nvfuser::Scalar* arg2,                                                \
-         nvfuser::Scalar* arg3) -> nvfuser::Scalar* {                          \
-        nvfuser::Scalar* output = self.fusion_definition->defineScalar();      \
-        self.fusion_definition->defineRecord(                                  \
-            new nvfuser::OpRecord<NvfVal*, NvfVal*, NvfVal*, NvfVal*>(         \
-                {arg1->index, arg2->index, arg3->index},                       \
-                {output->index},                                               \
-                static_cast<NvfVal* (*)(NvfVal*, NvfVal*, NvfVal*)>(           \
-                    torch::jit::fuser::cuda::op_name)));                       \
-        return output;                                                         \
-      },                                                                       \
-      py::return_value_policy::reference);
-
-  NVFUSER_PYTHON_BINDING_BINARY_WITH_ALPHA_OP("add_alpha", add_alpha)
-  NVFUSER_PYTHON_BINDING_BINARY_WITH_ALPHA_OP("sub_alpha", sub_alpha)
-#undef NVFUSER_PYTHON_BINDING_BINARY_WITH_ALPHA_OP
-
-#define NVFUSER_PYTHON_BINDING_TERNARY_OP(op_str, op_name)                           \
-  nvf_ops.def(                                                                       \
-      op_str,                                                                        \
-      [](nvfuser::FusionDefinition::Operators& self,                                 \
-         nvfuser::Scalar* arg1,                                                      \
-         nvfuser::Scalar* arg2,                                                      \
-         nvfuser::Scalar* arg3) -> nvfuser::Scalar* {                                \
-        nvfuser::Scalar* output = self.fusion_definition->defineScalar();            \
-        self.fusion_definition->defineRecord(                                        \
-            new nvfuser::OpRecord<NvfVal*, NvfVal*, NvfVal*, NvfVal*>(               \
-                {arg1->index, arg2->index, arg3->index},                             \
-                {output->index},                                                     \
-                static_cast<NvfVal* (*)(NvfVal*, NvfVal*, NvfVal*)>(                 \
-                    torch::jit::fuser::cuda::op_name)));                             \
-        return output;                                                               \
-      },                                                                             \
-      py::return_value_policy::reference);                                           \
-  nvf_ops.def(                                                                       \
-      op_str,                                                                        \
-      [](nvfuser::FusionDefinition::Operators& self,                                 \
-         nvfuser::Tensor* arg1,                                                      \
-         nvfuser::Tensor* arg2,                                                      \
-         nvfuser::Tensor* arg3) -> nvfuser::Tensor* {                                \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();            \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                  \
-                                             NvfTensorView*,                         \
-                                             NvfTensorView*,                         \
-                                             NvfTensorView*,                         \
-                                             NvfTensorView*>(                        \
-            {arg1->index, arg2->index, arg3->index},                                 \
-            {output->index},                                                         \
-            static_cast<                                                             \
-                NvfTensorView* (*)(NvfTensorView*, NvfTensorView*, NvfTensorView*)>( \
-                torch::jit::fuser::cuda::op_name)));                                 \
-        return output;                                                               \
-      },                                                                             \
-      py::return_value_policy::reference);                                           \
+#define NVFUSER_PYTHON_BINDING_BINARY_WITH_ALPHA_OP(op_str, op_name)                 \
   nvf_ops.def(                                                                       \
       op_str,                                                                        \
       [](nvfuser::FusionDefinition::Operators& self,                                 \
-         nvfuser::Tensor* arg1,                                                      \
-         nvfuser::Tensor* arg2,                                                      \
-         nvfuser::Scalar* arg3) -> nvfuser::Tensor* {                                \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();            \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                  \
-                                             NvfTensorView*,                         \
-                                             NvfTensorView*,                         \
-                                             NvfTensorView*,                         \
-                                             NvfVal*>(                               \
-            {arg1->index, arg2->index, arg3->index},                                 \
-            {output->index},                                                         \
+         nvfuser::Tensor arg1,                                                       \
+         nvfuser::Tensor arg2,                                                       \
+         nvfuser::Scalar arg3) -> nvfuser::Tensor {                                  \
+        FUSER_PERF_SCOPE("Operators." op_str);                                       \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                      \
+        nvfuser::Tensor output = fd->defineTensor();                                 \
+        fd->defineRecord(new nvfuser::OpRecord<                                      \
+                         Nvf::TensorView*,                                           \
+                         Nvf::TensorView*,                                           \
+                         Nvf::TensorView*,                                           \
+                         Nvf::Val*>(                                                 \
+            {fd->recordingState(arg1()),                                             \
+             fd->recordingState(arg2()),                                             \
+             fd->recordingState(arg3())},                                            \
+            {fd->recordingState(output())},                                          \
+            ("ops." op_str),                                                         \
             static_cast<                                                             \
-                NvfTensorView* (*)(NvfTensorView*, NvfTensorView*, NvfVal*)>(        \
-                torch::jit::fuser::cuda::op_name)));                                 \
+                Nvf::                                                                \
+                    TensorView* (*)(Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*)>( \
+                Nvf::op_name)));                                                     \
         return output;                                                               \
       },                                                                             \
       py::return_value_policy::reference);                                           \
   nvf_ops.def(                                                                       \
       op_str,                                                                        \
       [](nvfuser::FusionDefinition::Operators& self,                                 \
-         nvfuser::Tensor* arg1,                                                      \
-         nvfuser::Scalar* arg2,                                                      \
-         nvfuser::Tensor* arg3) -> nvfuser::Tensor* {                                \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();            \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                  \
-                                             NvfTensorView*,                         \
-                                             NvfTensorView*,                         \
-                                             NvfVal*,                                \
-                                             NvfTensorView*>(                        \
-            {arg1->index, arg2->index, arg3->index},                                 \
-            {output->index},                                                         \
+         nvfuser::Tensor arg1,                                                       \
+         nvfuser::Scalar arg2,                                                       \
+         nvfuser::Scalar arg3) -> nvfuser::Tensor {                                  \
+        FUSER_PERF_SCOPE("Operators." op_str);                                       \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                      \
+        nvfuser::Tensor output = fd->defineTensor();                                 \
+        fd->defineRecord(new nvfuser::OpRecord<                                      \
+                         Nvf::TensorView*,                                           \
+                         Nvf::TensorView*,                                           \
+                         Nvf::Val*,                                                  \
+                         Nvf::Val*>(                                                 \
+            {fd->recordingState(arg1()),                                             \
+             fd->recordingState(arg2()),                                             \
+             fd->recordingState(arg3())},                                            \
+            {fd->recordingState(output())},                                          \
+            ("ops." op_str),                                                         \
             static_cast<                                                             \
-                NvfTensorView* (*)(NvfTensorView*, NvfVal*, NvfTensorView*)>(        \
-                torch::jit::fuser::cuda::op_name)));                                 \
+                Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*, Nvf::Val*)>(       \
+                Nvf::op_name)));                                                     \
         return output;                                                               \
       },                                                                             \
       py::return_value_policy::reference);                                           \
   nvf_ops.def(                                                                       \
       op_str,                                                                        \
       [](nvfuser::FusionDefinition::Operators& self,                                 \
-         nvfuser::Scalar* arg1,                                                      \
-         nvfuser::Tensor* arg2,                                                      \
-         nvfuser::Tensor* arg3) -> nvfuser::Tensor* {                                \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();            \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                  \
-                                             NvfTensorView*,                         \
-                                             NvfVal*,                                \
-                                             NvfTensorView*,                         \
-                                             NvfTensorView*>(                        \
-            {arg1->index, arg2->index, arg3->index},                                 \
-            {output->index},                                                         \
+         nvfuser::Scalar arg1,                                                       \
+         nvfuser::Tensor arg2,                                                       \
+         nvfuser::Scalar arg3) -> nvfuser::Tensor {                                  \
+        FUSER_PERF_SCOPE("Operators." op_str);                                       \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                      \
+        nvfuser::Tensor output = fd->defineTensor();                                 \
+        fd->defineRecord(new nvfuser::OpRecord<                                      \
+                         Nvf::TensorView*,                                           \
+                         Nvf::Val*,                                                  \
+                         Nvf::TensorView*,                                           \
+                         Nvf::Val*>(                                                 \
+            {fd->recordingState(arg1()),                                             \
+             fd->recordingState(arg2()),                                             \
+             fd->recordingState(arg3())},                                            \
+            {fd->recordingState(output())},                                          \
+            ("ops." op_str),                                                         \
             static_cast<                                                             \
-                NvfTensorView* (*)(NvfVal*, NvfTensorView*, NvfTensorView*)>(        \
-                torch::jit::fuser::cuda::op_name)));                                 \
-        return output;                                                               \
-      },                                                                             \
-      py::return_value_policy::reference);                                           \
-  nvf_ops.def(                                                                       \
-      op_str,                                                                        \
-      [](nvfuser::FusionDefinition::Operators& self,                                 \
-         nvfuser::Scalar* arg1,                                                      \
-         nvfuser::Scalar* arg2,                                                      \
-         nvfuser::Tensor* arg3) -> nvfuser::Tensor* {                                \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();            \
-        self.fusion_definition->defineRecord(                                        \
-            new nvfuser::                                                            \
-                OpRecord<NvfTensorView*, NvfVal*, NvfVal*, NvfTensorView*>(          \
-                    {arg1->index, arg2->index, arg3->index},                         \
-                    {output->index},                                                 \
-                    static_cast<                                                     \
-                        NvfTensorView* (*)(NvfVal*, NvfVal*, NvfTensorView*)>(       \
-                        torch::jit::fuser::cuda::op_name)));                         \
-        return output;                                                               \
-      },                                                                             \
-      py::return_value_policy::reference);                                           \
-  nvf_ops.def(                                                                       \
-      op_str,                                                                        \
-      [](nvfuser::FusionDefinition::Operators& self,                                 \
-         nvfuser::Tensor* arg1,                                                      \
-         nvfuser::Scalar* arg2,                                                      \
-         nvfuser::Scalar* arg3) -> nvfuser::Tensor* {                                \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();            \
-        self.fusion_definition->defineRecord(                                        \
-            new nvfuser::                                                            \
-                OpRecord<NvfTensorView*, NvfTensorView*, NvfVal*, NvfVal*>(          \
-                    {arg1->index, arg2->index, arg3->index},                         \
-                    {output->index},                                                 \
-                    static_cast<                                                     \
-                        NvfTensorView* (*)(NvfTensorView*, NvfVal*, NvfVal*)>(       \
-                        torch::jit::fuser::cuda::op_name)));                         \
+                Nvf::TensorView* (*)(Nvf::Val*, Nvf::TensorView*, Nvf::Val*)>(       \
+                Nvf::op_name)));                                                     \
         return output;                                                               \
       },                                                                             \
       py::return_value_policy::reference);                                           \
   nvf_ops.def(                                                                       \
       op_str,                                                                        \
       [](nvfuser::FusionDefinition::Operators& self,                                 \
-         nvfuser::Scalar* arg1,                                                      \
-         nvfuser::Tensor* arg2,                                                      \
-         nvfuser::Scalar* arg3) -> nvfuser::Tensor* {                                \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();            \
-        self.fusion_definition->defineRecord(                                        \
-            new nvfuser::                                                            \
-                OpRecord<NvfTensorView*, NvfVal*, NvfTensorView*, NvfVal*>(          \
-                    {arg1->index, arg2->index, arg3->index},                         \
-                    {output->index},                                                 \
-                    static_cast<                                                     \
-                        NvfTensorView* (*)(NvfVal*, NvfTensorView*, NvfVal*)>(       \
-                        torch::jit::fuser::cuda::op_name)));                         \
+         nvfuser::Scalar arg1,                                                       \
+         nvfuser::Scalar arg2,                                                       \
+         nvfuser::Scalar arg3) -> nvfuser::Scalar {                                  \
+        FUSER_PERF_SCOPE("Operators." op_str);                                       \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                      \
+        nvfuser::Scalar output = fd->defineScalar();                                 \
+        fd->defineRecord(                                                            \
+            new nvfuser::OpRecord<Nvf::Val*, Nvf::Val*, Nvf::Val*, Nvf::Val*>(       \
+                {fd->recordingState(arg1()),                                         \
+                 fd->recordingState(arg2()),                                         \
+                 fd->recordingState(arg3())},                                        \
+                {fd->recordingState(output())},                                      \
+                ("ops." op_str),                                                     \
+                static_cast<Nvf::Val* (*)(Nvf::Val*, Nvf::Val*, Nvf::Val*)>(         \
+                    Nvf::op_name)));                                                 \
         return output;                                                               \
       },                                                                             \
       py::return_value_policy::reference);
 
+  NVFUSER_PYTHON_BINDING_BINARY_WITH_ALPHA_OP("add_alpha", add_alpha)
+  NVFUSER_PYTHON_BINDING_BINARY_WITH_ALPHA_OP("sub_alpha", sub_alpha)
+#undef NVFUSER_PYTHON_BINDING_BINARY_WITH_ALPHA_OP
+
+#define NVFUSER_PYTHON_BINDING_TERNARY_OP(op_str, op_name)                                  \
+  nvf_ops.def(                                                                              \
+      op_str,                                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                                        \
+         nvfuser::Scalar arg1,                                                              \
+         nvfuser::Scalar arg2,                                                              \
+         nvfuser::Scalar arg3) -> nvfuser::Scalar {                                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                             \
+        nvfuser::Scalar output = fd->defineScalar();                                        \
+        fd->defineRecord(                                                                   \
+            new nvfuser::OpRecord<Nvf::Val*, Nvf::Val*, Nvf::Val*, Nvf::Val*>(              \
+                {fd->recordingState(arg1()),                                                \
+                 fd->recordingState(arg2()),                                                \
+                 fd->recordingState(arg3())},                                               \
+                {fd->recordingState(output())},                                             \
+                ("ops." op_str),                                                            \
+                static_cast<Nvf::Val* (*)(Nvf::Val*, Nvf::Val*, Nvf::Val*)>(                \
+                    Nvf::op_name)));                                                        \
+        return output;                                                                      \
+      },                                                                                    \
+      py::return_value_policy::reference);                                                  \
+  nvf_ops.def(                                                                              \
+      op_str,                                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                                        \
+         nvfuser::Tensor arg1,                                                              \
+         nvfuser::Tensor arg2,                                                              \
+         nvfuser::Tensor arg3) -> nvfuser::Tensor {                                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                             \
+        nvfuser::Tensor output = fd->defineTensor();                                        \
+        fd->defineRecord(new nvfuser::OpRecord<                                             \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::TensorView*>(                                                 \
+            {fd->recordingState(arg1()),                                                    \
+             fd->recordingState(arg2()),                                                    \
+             fd->recordingState(arg3())},                                                   \
+            {fd->recordingState(output())},                                                 \
+            ("ops." op_str),                                                                \
+            static_cast<                                                                    \
+                Nvf::                                                                       \
+                    TensorView* (*)(Nvf::TensorView*, Nvf::TensorView*, Nvf::TensorView*)>( \
+                Nvf::op_name)));                                                            \
+        return output;                                                                      \
+      },                                                                                    \
+      py::return_value_policy::reference);                                                  \
+  nvf_ops.def(                                                                              \
+      op_str,                                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                                        \
+         nvfuser::Tensor arg1,                                                              \
+         nvfuser::Tensor arg2,                                                              \
+         nvfuser::Scalar arg3) -> nvfuser::Tensor {                                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                             \
+        nvfuser::Tensor output = fd->defineTensor();                                        \
+        fd->defineRecord(new nvfuser::OpRecord<                                             \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::Val*>(                                                        \
+            {fd->recordingState(arg1()),                                                    \
+             fd->recordingState(arg2()),                                                    \
+             fd->recordingState(arg3())},                                                   \
+            {fd->recordingState(output())},                                                 \
+            ("ops." op_str),                                                                \
+            static_cast<                                                                    \
+                Nvf::                                                                       \
+                    TensorView* (*)(Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*)>(        \
+                Nvf::op_name)));                                                            \
+        return output;                                                                      \
+      },                                                                                    \
+      py::return_value_policy::reference);                                                  \
+  nvf_ops.def(                                                                              \
+      op_str,                                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                                        \
+         nvfuser::Tensor arg1,                                                              \
+         nvfuser::Scalar arg2,                                                              \
+         nvfuser::Tensor arg3) -> nvfuser::Tensor {                                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                             \
+        nvfuser::Tensor output = fd->defineTensor();                                        \
+        fd->defineRecord(new nvfuser::OpRecord<                                             \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::Val*,                                                         \
+                         Nvf::TensorView*>(                                                 \
+            {fd->recordingState(arg1()),                                                    \
+             fd->recordingState(arg2()),                                                    \
+             fd->recordingState(arg3())},                                                   \
+            {fd->recordingState(output())},                                                 \
+            ("ops." op_str),                                                                \
+            static_cast<                                                                    \
+                Nvf::                                                                       \
+                    TensorView* (*)(Nvf::TensorView*, Nvf::Val*, Nvf::TensorView*)>(        \
+                Nvf::op_name)));                                                            \
+        return output;                                                                      \
+      },                                                                                    \
+      py::return_value_policy::reference);                                                  \
+  nvf_ops.def(                                                                              \
+      op_str,                                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                                        \
+         nvfuser::Scalar arg1,                                                              \
+         nvfuser::Tensor arg2,                                                              \
+         nvfuser::Tensor arg3) -> nvfuser::Tensor {                                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                             \
+        nvfuser::Tensor output = fd->defineTensor();                                        \
+        fd->defineRecord(new nvfuser::OpRecord<                                             \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::Val*,                                                         \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::TensorView*>(                                                 \
+            {fd->recordingState(arg1()),                                                    \
+             fd->recordingState(arg2()),                                                    \
+             fd->recordingState(arg3())},                                                   \
+            {fd->recordingState(output())},                                                 \
+            ("ops." op_str),                                                                \
+            static_cast<                                                                    \
+                Nvf::                                                                       \
+                    TensorView* (*)(Nvf::Val*, Nvf::TensorView*, Nvf::TensorView*)>(        \
+                Nvf::op_name)));                                                            \
+        return output;                                                                      \
+      },                                                                                    \
+      py::return_value_policy::reference);                                                  \
+  nvf_ops.def(                                                                              \
+      op_str,                                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                                        \
+         nvfuser::Scalar arg1,                                                              \
+         nvfuser::Scalar arg2,                                                              \
+         nvfuser::Tensor arg3) -> nvfuser::Tensor {                                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                             \
+        nvfuser::Tensor output = fd->defineTensor();                                        \
+        fd->defineRecord(new nvfuser::OpRecord<                                             \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::Val*,                                                         \
+                         Nvf::Val*,                                                         \
+                         Nvf::TensorView*>(                                                 \
+            {fd->recordingState(arg1()),                                                    \
+             fd->recordingState(arg2()),                                                    \
+             fd->recordingState(arg3())},                                                   \
+            {fd->recordingState(output())},                                                 \
+            ("ops." op_str),                                                                \
+            static_cast<                                                                    \
+                Nvf::TensorView* (*)(Nvf::Val*, Nvf::Val*, Nvf::TensorView*)>(              \
+                Nvf::op_name)));                                                            \
+        return output;                                                                      \
+      },                                                                                    \
+      py::return_value_policy::reference);                                                  \
+  nvf_ops.def(                                                                              \
+      op_str,                                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                                        \
+         nvfuser::Tensor arg1,                                                              \
+         nvfuser::Scalar arg2,                                                              \
+         nvfuser::Scalar arg3) -> nvfuser::Tensor {                                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                             \
+        nvfuser::Tensor output = fd->defineTensor();                                        \
+        fd->defineRecord(new nvfuser::OpRecord<                                             \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::Val*,                                                         \
+                         Nvf::Val*>(                                                        \
+            {fd->recordingState(arg1()),                                                    \
+             fd->recordingState(arg2()),                                                    \
+             fd->recordingState(arg3())},                                                   \
+            {fd->recordingState(output())},                                                 \
+            ("ops." op_str),                                                                \
+            static_cast<                                                                    \
+                Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*, Nvf::Val*)>(              \
+                Nvf::op_name)));                                                            \
+        return output;                                                                      \
+      },                                                                                    \
+      py::return_value_policy::reference);                                                  \
+  nvf_ops.def(                                                                              \
+      op_str,                                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                                        \
+         nvfuser::Scalar arg1,                                                              \
+         nvfuser::Tensor arg2,                                                              \
+         nvfuser::Scalar arg3) -> nvfuser::Tensor {                                         \
+        FUSER_PERF_SCOPE("Operators." op_str);                                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                             \
+        nvfuser::Tensor output = fd->defineTensor();                                        \
+        fd->defineRecord(new nvfuser::OpRecord<                                             \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::Val*,                                                         \
+                         Nvf::TensorView*,                                                  \
+                         Nvf::Val*>(                                                        \
+            {fd->recordingState(arg1()),                                                    \
+             fd->recordingState(arg2()),                                                    \
+             fd->recordingState(arg3())},                                                   \
+            {fd->recordingState(output())},                                                 \
+            ("ops." op_str),                                                                \
+            static_cast<                                                                    \
+                Nvf::TensorView* (*)(Nvf::Val*, Nvf::TensorView*, Nvf::Val*)>(              \
+                Nvf::op_name)));                                                            \
+        return output;                                                                      \
+      },                                                                                    \
+      py::return_value_policy::reference);
+
   NVFUSER_PYTHON_BINDING_TERNARY_OP("lerp", lerp)
   NVFUSER_PYTHON_BINDING_TERNARY_OP("where", where)
 #undef NVFUSER_PYTHON_BINDING_TERNARY_OP
@@ -636,34 +830,46 @@ void initNvFuserPythonBindings(PyObject* module) {
   nvf_ops.def(                                                                 \
       op_str,                                                                  \
       [](nvfuser::FusionDefinition::Operators& self,                           \
-         nvfuser::Scalar* arg1,                                                \
-         nvfuser::Scalar* arg2,                                                \
-         nvfuser::Scalar* arg3) -> nvfuser::Scalar* {                          \
-        nvfuser::Scalar* output = self.fusion_definition->defineScalar();      \
-        self.fusion_definition->defineRecord(                                  \
-            new nvfuser::OpRecord<NvfVal*, NvfVal*, NvfVal*, NvfVal*>(         \
-                {arg1->index, arg2->index, arg3->index},                       \
-                {output->index},                                               \
-                static_cast<NvfVal* (*)(NvfVal*, NvfVal*, NvfVal*)>(           \
-                    torch::jit::fuser::cuda::op_name)));                       \
+         nvfuser::Scalar arg1,                                                 \
+         nvfuser::Scalar arg2,                                                 \
+         nvfuser::Scalar arg3) -> nvfuser::Scalar {                            \
+        FUSER_PERF_SCOPE("Operators." op_str);                                 \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                \
+        nvfuser::Scalar output = fd->defineScalar();                           \
+        fd->defineRecord(                                                      \
+            new nvfuser::OpRecord<Nvf::Val*, Nvf::Val*, Nvf::Val*, Nvf::Val*>( \
+                {fd->recordingState(arg1()),                                   \
+                 fd->recordingState(arg2()),                                   \
+                 fd->recordingState(arg3())},                                  \
+                {fd->recordingState(output())},                                \
+                ("ops." op_str),                                               \
+                static_cast<Nvf::Val* (*)(Nvf::Val*, Nvf::Val*, Nvf::Val*)>(   \
+                    Nvf::op_name)));                                           \
         return output;                                                         \
       },                                                                       \
       py::return_value_policy::reference);                                     \
   nvf_ops.def(                                                                 \
       op_str,                                                                  \
       [](nvfuser::FusionDefinition::Operators& self,                           \
-         nvfuser::Tensor* arg1,                                                \
-         nvfuser::Scalar* arg2,                                                \
-         nvfuser::Scalar* arg3) -> nvfuser::Tensor* {                          \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();      \
-        self.fusion_definition->defineRecord(                                  \
-            new nvfuser::                                                      \
-                OpRecord<NvfTensorView*, NvfTensorView*, NvfVal*, NvfVal*>(    \
-                    {arg1->index, arg2->index, arg3->index},                   \
-                    {output->index},                                           \
-                    static_cast<                                               \
-                        NvfTensorView* (*)(NvfTensorView*, NvfVal*, NvfVal*)>( \
-                        torch::jit::fuser::cuda::op_name)));                   \
+         nvfuser::Tensor arg1,                                                 \
+         nvfuser::Scalar arg2,                                                 \
+         nvfuser::Scalar arg3) -> nvfuser::Tensor {                            \
+        FUSER_PERF_SCOPE("Operators." op_str);                                 \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                \
+        nvfuser::Tensor output = fd->defineTensor();                           \
+        fd->defineRecord(new nvfuser::OpRecord<                                \
+                         Nvf::TensorView*,                                     \
+                         Nvf::TensorView*,                                     \
+                         Nvf::Val*,                                            \
+                         Nvf::Val*>(                                           \
+            {fd->recordingState(arg1()),                                       \
+             fd->recordingState(arg2()),                                       \
+             fd->recordingState(arg3())},                                      \
+            {fd->recordingState(output())},                                    \
+            ("ops." op_str),                                                   \
+            static_cast<                                                       \
+                Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*, Nvf::Val*)>( \
+                Nvf::op_name)));                                               \
         return output;                                                         \
       },                                                                       \
       py::return_value_policy::reference);
@@ -672,206 +878,270 @@ void initNvFuserPythonBindings(PyObject* module) {
   NVFUSER_PYTHON_BINDING_THRESHOLD_LIKE_OP("threshold", threshold)
 #undef NVFUSER_PYTHON_BINDING_THRESHOLD_LIKE_OP
 
-#define NVFUSER_PYTHON_BINDING_TERNARY_WITH_ALPHA_OP(op_str, op_name)                         \
-  nvf_ops.def(                                                                                \
-      op_str,                                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                                          \
-         nvfuser::Scalar* arg1,                                                               \
-         nvfuser::Scalar* arg2,                                                               \
-         nvfuser::Scalar* arg3,                                                               \
-         nvfuser::Scalar* arg4) -> nvfuser::Scalar* {                                         \
-        nvfuser::Scalar* output = self.fusion_definition->defineScalar();                     \
-        self.fusion_definition->defineRecord(                                                 \
-            new nvfuser::                                                                     \
-                OpRecord<NvfVal*, NvfVal*, NvfVal*, NvfVal*, NvfVal*>(                        \
-                    {arg1->index, arg2->index, arg3->index, arg4->index},                     \
-                    {output->index},                                                          \
-                    static_cast<                                                              \
-                        NvfVal* (*)(NvfVal*, NvfVal*, NvfVal*, NvfVal*)>(                     \
-                        torch::jit::fuser::cuda::op_name)));                                  \
-        return output;                                                                        \
-      },                                                                                      \
-      py::return_value_policy::reference);                                                    \
-  nvf_ops.def(                                                                                \
-      op_str,                                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                                          \
-         nvfuser::Tensor* arg1,                                                               \
-         nvfuser::Tensor* arg2,                                                               \
-         nvfuser::Tensor* arg3,                                                               \
-         nvfuser::Scalar* arg4) -> nvfuser::Tensor* {                                         \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();                     \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                           \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*>(                                 \
-            {arg1->index, arg2->index, arg3->index, arg4->index},                             \
-            {output->index},                                                                  \
-            static_cast<                                                                      \
-                NvfTensorView* (*)(NvfTensorView*, NvfTensorView*, NvfTensorView*, NvfVal*)>( \
-                torch::jit::fuser::cuda::op_name)));                                          \
-        return output;                                                                        \
-      },                                                                                      \
-      py::return_value_policy::reference);                                                    \
-  nvf_ops.def(                                                                                \
-      op_str,                                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                                          \
-         nvfuser::Tensor* arg1,                                                               \
-         nvfuser::Tensor* arg2,                                                               \
-         nvfuser::Scalar* arg3,                                                               \
-         nvfuser::Scalar* arg4) -> nvfuser::Tensor* {                                         \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();                     \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                           \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*,                                         \
-                                             NvfVal*>(                                        \
-            {arg1->index, arg2->index, arg3->index, arg4->index},                             \
-            {output->index},                                                                  \
-            static_cast<                                                                      \
-                NvfTensorView* (*)(NvfTensorView*, NvfTensorView*, NvfVal*, NvfVal*)>(        \
-                torch::jit::fuser::cuda::op_name)));                                          \
-        return output;                                                                        \
-      },                                                                                      \
-      py::return_value_policy::reference);                                                    \
-  nvf_ops.def(                                                                                \
-      op_str,                                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                                          \
-         nvfuser::Tensor* arg1,                                                               \
-         nvfuser::Scalar* arg2,                                                               \
-         nvfuser::Tensor* arg3,                                                               \
-         nvfuser::Scalar* arg4) -> nvfuser::Tensor* {                                         \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();                     \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                           \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*,                                         \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*>(                                        \
-            {arg1->index, arg2->index, arg3->index, arg4->index},                             \
-            {output->index},                                                                  \
-            static_cast<                                                                      \
-                NvfTensorView* (*)(NvfTensorView*, NvfVal*, NvfTensorView*, NvfVal*)>(        \
-                torch::jit::fuser::cuda::op_name)));                                          \
-        return output;                                                                        \
-      },                                                                                      \
-      py::return_value_policy::reference);                                                    \
-  nvf_ops.def(                                                                                \
-      op_str,                                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                                          \
-         nvfuser::Scalar* arg1,                                                               \
-         nvfuser::Tensor* arg2,                                                               \
-         nvfuser::Tensor* arg3,                                                               \
-         nvfuser::Scalar* arg4) -> nvfuser::Tensor* {                                         \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();                     \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                           \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*,                                         \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*>(                                        \
-            {arg1->index, arg2->index, arg3->index, arg4->index},                             \
-            {output->index},                                                                  \
-            static_cast<                                                                      \
-                NvfTensorView* (*)(NvfVal*, NvfTensorView*, NvfTensorView*, NvfVal*)>(        \
-                torch::jit::fuser::cuda::op_name)));                                          \
-        return output;                                                                        \
-      },                                                                                      \
-      py::return_value_policy::reference);                                                    \
-  nvf_ops.def(                                                                                \
-      op_str,                                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                                          \
-         nvfuser::Scalar* arg1,                                                               \
-         nvfuser::Scalar* arg2,                                                               \
-         nvfuser::Tensor* arg3,                                                               \
-         nvfuser::Scalar* arg4) -> nvfuser::Tensor* {                                         \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();                     \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                           \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*,                                         \
-                                             NvfVal*,                                         \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*>(                                        \
-            {arg1->index, arg2->index, arg3->index, arg4->index},                             \
-            {output->index},                                                                  \
-            static_cast<                                                                      \
-                NvfTensorView* (*)(NvfVal*, NvfVal*, NvfTensorView*, NvfVal*)>(               \
-                torch::jit::fuser::cuda::op_name)));                                          \
-        return output;                                                                        \
-      },                                                                                      \
-      py::return_value_policy::reference);                                                    \
-  nvf_ops.def(                                                                                \
-      op_str,                                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                                          \
-         nvfuser::Tensor* arg1,                                                               \
-         nvfuser::Scalar* arg2,                                                               \
-         nvfuser::Scalar* arg3,                                                               \
-         nvfuser::Scalar* arg4) -> nvfuser::Tensor* {                                         \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();                     \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                           \
-                                             NvfTensorView*,                                  \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*,                                         \
-                                             NvfVal*,                                         \
-                                             NvfVal*>(                                        \
-            {arg1->index, arg2->index, arg3->index, arg4->index},                             \
-            {output->index},                                                                  \
-            static_cast<                                                                      \
-                NvfTensorView* (*)(NvfTensorView*, NvfVal*, NvfVal*, NvfVal*)>(               \
-                torch::jit::fuser::cuda::op_name)));                                          \
-        return output;                                                                        \
-      },                                                                                      \
-      py::return_value_policy::reference);                                                    \
-  nvf_ops.def(                                                                                \
-      op_str,                                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                                          \
-         nvfuser::Scalar* arg1,                                                               \
-         nvfuser::Tensor* arg2,                                                               \
-         nvfuser::Scalar* arg3,                                                               \
-         nvfuser::Scalar* arg4) -> nvfuser::Tensor* {                                         \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();                     \
-        self.fusion_definition->defineRecord(new nvfuser::OpRecord<                           \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*,                                         \
-                                             NvfTensorView*,                                  \
-                                             NvfVal*,                                         \
-                                             NvfVal*>(                                        \
-            {arg1->index, arg2->index, arg3->index, arg4->index},                             \
-            {output->index},                                                                  \
-            static_cast<                                                                      \
-                NvfTensorView* (*)(NvfVal*, NvfTensorView*, NvfVal*, NvfVal*)>(               \
-                torch::jit::fuser::cuda::op_name)));                                          \
-        return output;                                                                        \
-      },                                                                                      \
+#define NVFUSER_PYTHON_BINDING_TERNARY_WITH_ALPHA_OP(op_str, op_name)                                  \
+  nvf_ops.def(                                                                                         \
+      op_str,                                                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                                                   \
+         nvfuser::Scalar arg1,                                                                         \
+         nvfuser::Scalar arg2,                                                                         \
+         nvfuser::Scalar arg3,                                                                         \
+         nvfuser::Scalar arg4) -> nvfuser::Scalar {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                        \
+        nvfuser::Scalar output = fd->defineScalar();                                                   \
+        fd->defineRecord(new nvfuser::OpRecord<                                                        \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*>(                                                                   \
+            {fd->recordingState(arg1()),                                                               \
+             fd->recordingState(arg2()),                                                               \
+             fd->recordingState(arg3()),                                                               \
+             fd->recordingState(arg4())},                                                              \
+            {fd->recordingState(output())},                                                            \
+            ("ops." op_str),                                                                           \
+            static_cast<                                                                               \
+                Nvf::Val* (*)(Nvf::Val*, Nvf::Val*, Nvf::Val*, Nvf::Val*)>(                            \
+                Nvf::op_name)));                                                                       \
+        return output;                                                                                 \
+      },                                                                                               \
+      py::return_value_policy::reference);                                                             \
+  nvf_ops.def(                                                                                         \
+      op_str,                                                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                                                   \
+         nvfuser::Tensor arg1,                                                                         \
+         nvfuser::Tensor arg2,                                                                         \
+         nvfuser::Tensor arg3,                                                                         \
+         nvfuser::Scalar arg4) -> nvfuser::Tensor {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                        \
+        nvfuser::Tensor output = fd->defineTensor();                                                   \
+        fd->defineRecord(new nvfuser::OpRecord<                                                        \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*>(                                                            \
+            {fd->recordingState(arg1()),                                                               \
+             fd->recordingState(arg2()),                                                               \
+             fd->recordingState(arg3()),                                                               \
+             fd->recordingState(arg4())},                                                              \
+            {fd->recordingState(output())},                                                            \
+            ("ops." op_str),                                                                           \
+            static_cast<                                                                               \
+                Nvf::                                                                                  \
+                    TensorView* (*)(Nvf::TensorView*, Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*)>( \
+                Nvf::op_name)));                                                                       \
+        return output;                                                                                 \
+      },                                                                                               \
+      py::return_value_policy::reference);                                                             \
+  nvf_ops.def(                                                                                         \
+      op_str,                                                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                                                   \
+         nvfuser::Tensor arg1,                                                                         \
+         nvfuser::Tensor arg2,                                                                         \
+         nvfuser::Scalar arg3,                                                                         \
+         nvfuser::Scalar arg4) -> nvfuser::Tensor {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                        \
+        nvfuser::Tensor output = fd->defineTensor();                                                   \
+        fd->defineRecord(new nvfuser::OpRecord<                                                        \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*>(                                                                   \
+            {fd->recordingState(arg1()),                                                               \
+             fd->recordingState(arg2()),                                                               \
+             fd->recordingState(arg3()),                                                               \
+             fd->recordingState(arg4())},                                                              \
+            {fd->recordingState(output())},                                                            \
+            ("ops." op_str),                                                                           \
+            static_cast<                                                                               \
+                Nvf::                                                                                  \
+                    TensorView* (*)(Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*, Nvf::Val*)>(        \
+                Nvf::op_name)));                                                                       \
+        return output;                                                                                 \
+      },                                                                                               \
+      py::return_value_policy::reference);                                                             \
+  nvf_ops.def(                                                                                         \
+      op_str,                                                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                                                   \
+         nvfuser::Tensor arg1,                                                                         \
+         nvfuser::Scalar arg2,                                                                         \
+         nvfuser::Tensor arg3,                                                                         \
+         nvfuser::Scalar arg4) -> nvfuser::Tensor {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                        \
+        nvfuser::Tensor output = fd->defineTensor();                                                   \
+        fd->defineRecord(new nvfuser::OpRecord<                                                        \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*,                                                                    \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*>(                                                                   \
+            {fd->recordingState(arg1()),                                                               \
+             fd->recordingState(arg2()),                                                               \
+             fd->recordingState(arg3()),                                                               \
+             fd->recordingState(arg4())},                                                              \
+            {fd->recordingState(output())},                                                            \
+            ("ops." op_str),                                                                           \
+            static_cast<                                                                               \
+                Nvf::                                                                                  \
+                    TensorView* (*)(Nvf::TensorView*, Nvf::Val*, Nvf::TensorView*, Nvf::Val*)>(        \
+                Nvf::op_name)));                                                                       \
+        return output;                                                                                 \
+      },                                                                                               \
+      py::return_value_policy::reference);                                                             \
+  nvf_ops.def(                                                                                         \
+      op_str,                                                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                                                   \
+         nvfuser::Scalar arg1,                                                                         \
+         nvfuser::Tensor arg2,                                                                         \
+         nvfuser::Tensor arg3,                                                                         \
+         nvfuser::Scalar arg4) -> nvfuser::Tensor {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                        \
+        nvfuser::Tensor output = fd->defineTensor();                                                   \
+        fd->defineRecord(new nvfuser::OpRecord<                                                        \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*,                                                                    \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*>(                                                                   \
+            {fd->recordingState(arg1()),                                                               \
+             fd->recordingState(arg2()),                                                               \
+             fd->recordingState(arg3()),                                                               \
+             fd->recordingState(arg4())},                                                              \
+            {fd->recordingState(output())},                                                            \
+            ("ops." op_str),                                                                           \
+            static_cast<                                                                               \
+                Nvf::                                                                                  \
+                    TensorView* (*)(Nvf::Val*, Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*)>(        \
+                Nvf::op_name)));                                                                       \
+        return output;                                                                                 \
+      },                                                                                               \
+      py::return_value_policy::reference);                                                             \
+  nvf_ops.def(                                                                                         \
+      op_str,                                                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                                                   \
+         nvfuser::Scalar arg1,                                                                         \
+         nvfuser::Scalar arg2,                                                                         \
+         nvfuser::Tensor arg3,                                                                         \
+         nvfuser::Scalar arg4) -> nvfuser::Tensor {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                        \
+        nvfuser::Tensor output = fd->defineTensor();                                                   \
+        fd->defineRecord(new nvfuser::OpRecord<                                                        \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*,                                                                    \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*>(                                                                   \
+            {fd->recordingState(arg1()),                                                               \
+             fd->recordingState(arg2()),                                                               \
+             fd->recordingState(arg3()),                                                               \
+             fd->recordingState(arg4())},                                                              \
+            {fd->recordingState(output())},                                                            \
+            ("ops." op_str),                                                                           \
+            static_cast<                                                                               \
+                Nvf::                                                                                  \
+                    TensorView* (*)(Nvf::Val*, Nvf::Val*, Nvf::TensorView*, Nvf::Val*)>(               \
+                Nvf::op_name)));                                                                       \
+        return output;                                                                                 \
+      },                                                                                               \
+      py::return_value_policy::reference);                                                             \
+  nvf_ops.def(                                                                                         \
+      op_str,                                                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                                                   \
+         nvfuser::Tensor arg1,                                                                         \
+         nvfuser::Scalar arg2,                                                                         \
+         nvfuser::Scalar arg3,                                                                         \
+         nvfuser::Scalar arg4) -> nvfuser::Tensor {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                        \
+        nvfuser::Tensor output = fd->defineTensor();                                                   \
+        fd->defineRecord(new nvfuser::OpRecord<                                                        \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*>(                                                                   \
+            {fd->recordingState(arg1()),                                                               \
+             fd->recordingState(arg2()),                                                               \
+             fd->recordingState(arg3()),                                                               \
+             fd->recordingState(arg4())},                                                              \
+            {fd->recordingState(output())},                                                            \
+            ("ops." op_str),                                                                           \
+            static_cast<                                                                               \
+                Nvf::                                                                                  \
+                    TensorView* (*)(Nvf::TensorView*, Nvf::Val*, Nvf::Val*, Nvf::Val*)>(               \
+                Nvf::op_name)));                                                                       \
+        return output;                                                                                 \
+      },                                                                                               \
+      py::return_value_policy::reference);                                                             \
+  nvf_ops.def(                                                                                         \
+      op_str,                                                                                          \
+      [](nvfuser::FusionDefinition::Operators& self,                                                   \
+         nvfuser::Scalar arg1,                                                                         \
+         nvfuser::Tensor arg2,                                                                         \
+         nvfuser::Scalar arg3,                                                                         \
+         nvfuser::Scalar arg4) -> nvfuser::Tensor {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                         \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                        \
+        nvfuser::Tensor output = fd->defineTensor();                                                   \
+        fd->defineRecord(new nvfuser::OpRecord<                                                        \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*,                                                                    \
+                         Nvf::TensorView*,                                                             \
+                         Nvf::Val*,                                                                    \
+                         Nvf::Val*>(                                                                   \
+            {fd->recordingState(arg1()),                                                               \
+             fd->recordingState(arg2()),                                                               \
+             fd->recordingState(arg3()),                                                               \
+             fd->recordingState(arg4())},                                                              \
+            {fd->recordingState(output())},                                                            \
+            ("ops." op_str),                                                                           \
+            static_cast<                                                                               \
+                Nvf::                                                                                  \
+                    TensorView* (*)(Nvf::Val*, Nvf::TensorView*, Nvf::Val*, Nvf::Val*)>(               \
+                Nvf::op_name)));                                                                       \
+        return output;                                                                                 \
+      },                                                                                               \
       py::return_value_policy::reference);
 
   NVFUSER_PYTHON_BINDING_TERNARY_WITH_ALPHA_OP("addcmul", addcmul)
 #undef NVFUSER_PYTHON_BINDING_TERNARY_WITH_ALPHA_OP
 
-#define NVFUSER_PYTHON_BINDING_REDUCTION_OP(op_str, op_name)                 \
-  nvf_ops.def(                                                               \
-      op_str,                                                                \
-      [](nvfuser::FusionDefinition::Operators& self,                         \
-         nvfuser::Tensor* arg,                                               \
-         const std::vector<int>& axes,                                       \
-         bool keep_dim,                                                      \
-         NvfDataType dtype) -> nvfuser::Tensor* {                            \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();    \
-        self.fusion_definition->defineRecord(new nvfuser::ReductionOpRecord( \
-            {arg->index},                                                    \
-            {output->index},                                                 \
-            torch::jit::fuser::cuda::op_name,                                \
-            axes,                                                            \
-            keep_dim,                                                        \
-            dtype));                                                         \
-        return output;                                                       \
-      },                                                                     \
-      py::arg("arg"),                                                        \
-      py::arg("axes"),                                                       \
-      py::arg("keep_dim"),                                                   \
-      py::arg("dtype") = torch::jit::fuser::cuda::DataType::Null,            \
+#define NVFUSER_PYTHON_BINDING_REDUCTION_OP(op_str, op_name)                                          \
+  nvf_ops.def(                                                                                        \
+      op_str,                                                                                         \
+      [](nvfuser::FusionDefinition::Operators& self,                                                  \
+         nvfuser::Tensor arg,                                                                         \
+         const std::vector<int>& axes,                                                                \
+         bool keepdim,                                                                                \
+         Nvf::DataType dtype) -> nvfuser::Tensor {                                                    \
+        FUSER_PERF_SCOPE("Operators." op_str);                                                        \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;                                       \
+        nvfuser::Tensor output = fd->defineTensor();                                                  \
+        fd->defineRecord(new nvfuser::ReductionOpRecord(                                              \
+            {fd->recordingState(arg())},                                                              \
+            {fd->recordingState(output())},                                                           \
+            ("ops." op_str),                                                                          \
+            static_cast<                                                                              \
+                Nvf::                                                                                 \
+                    TensorView* (*)(Nvf::TensorView*, const std::vector<int>&, bool, Nvf::DataType)>( \
+                Nvf::op_name),                                                                        \
+            axes,                                                                                     \
+            keepdim,                                                                                  \
+            dtype));                                                                                  \
+        return output;                                                                                \
+      },                                                                                              \
+      py::arg("arg"),                                                                                 \
+      py::arg("axes"),                                                                                \
+      py::arg("keepdim") = false,                                                                     \
+      py::arg("dtype") = Nvf::DataType::Null,                                                         \
       py::return_value_policy::reference);
 
   NVFUSER_PYTHON_BINDING_REDUCTION_OP("sum", sum)
@@ -879,68 +1149,250 @@ void initNvFuserPythonBindings(PyObject* module) {
   NVFUSER_PYTHON_BINDING_REDUCTION_OP("min", min)
 #undef NVFUSER_PYTHON_BINDING_REDUCTION_OP
 
-#define NVFUSER_PYTHON_BINDING_CAST_OP(op_str, op_name)                       \
-  nvf_ops.def(                                                                \
-      op_str,                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                          \
-         nvfuser::Tensor* arg,                                                \
-         NvfDataType dtype) -> nvfuser::Tensor* {                             \
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();     \
-        self.fusion_definition->defineRecord(                                 \
-            new nvfuser::CastOpRecord<NvfTensorView*, NvfTensorView*>(        \
-                {arg->index},                                                 \
-                {output->index},                                              \
-                static_cast<NvfTensorView* (*)(NvfDataType, NvfTensorView*)>( \
-                    torch::jit::fuser::cuda::op_name),                        \
-                dtype));                                                      \
-        return output;                                                        \
-      },                                                                      \
-      py::return_value_policy::reference);                                    \
-  nvf_ops.def(                                                                \
-      op_str,                                                                 \
-      [](nvfuser::FusionDefinition::Operators& self,                          \
-         nvfuser::Scalar* arg,                                                \
-         NvfDataType dtype) -> nvfuser::Scalar* {                             \
-        nvfuser::Scalar* output = self.fusion_definition->defineScalar();     \
-        self.fusion_definition->defineRecord(                                 \
-            new nvfuser::CastOpRecord<NvfVal*, NvfVal*>(                      \
-                {arg->index},                                                 \
-                {output->index},                                              \
-                static_cast<NvfVal* (*)(NvfDataType, NvfVal*)>(               \
-                    torch::jit::fuser::cuda::op_name),                        \
-                dtype));                                                      \
-        return output;                                                        \
-      },                                                                      \
+#define NVFUSER_PYTHON_BINDING_CAST_OP(op_str, op_name)                     \
+  nvf_ops.def(                                                              \
+      op_str,                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                        \
+         nvfuser::Tensor arg,                                               \
+         Nvf::DataType dtype) -> nvfuser::Tensor {                          \
+        FUSER_PERF_SCOPE("Operators." op_str);                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;             \
+        nvfuser::Tensor output = fd->defineTensor();                        \
+        fd->defineRecord(                                                   \
+            new nvfuser::CastOpRecord<Nvf::TensorView*, Nvf::TensorView*>(  \
+                {fd->recordingState(arg())},                                \
+                {fd->recordingState(output())},                             \
+                ("ops." op_str),                                            \
+                static_cast<                                                \
+                    Nvf::TensorView* (*)(Nvf::DataType, Nvf::TensorView*)>( \
+                    Nvf::op_name),                                          \
+                dtype));                                                    \
+        return output;                                                      \
+      },                                                                    \
+      py::arg("arg"),                                                       \
+      py::arg("dtype"),                                                     \
+      py::return_value_policy::reference);                                  \
+  nvf_ops.def(                                                              \
+      op_str,                                                               \
+      [](nvfuser::FusionDefinition::Operators& self,                        \
+         nvfuser::Scalar arg,                                               \
+         Nvf::DataType dtype) -> nvfuser::Scalar {                          \
+        FUSER_PERF_SCOPE("Operators." op_str);                              \
+        nvfuser::FusionDefinition* fd = self.fusion_definition;             \
+        nvfuser::Scalar output = fd->defineScalar();                        \
+        fd->defineRecord(new nvfuser::CastOpRecord<Nvf::Val*, Nvf::Val*>(   \
+            {fd->recordingState(arg())},                                    \
+            {fd->recordingState(output())},                                 \
+            ("ops." op_str),                                                \
+            static_cast<Nvf::Val* (*)(Nvf::DataType, Nvf::Val*)>(           \
+                Nvf::op_name),                                              \
+            dtype));                                                        \
+        return output;                                                      \
+      },                                                                    \
+      py::arg("arg"),                                                       \
+      py::arg("dtype"),                                                     \
       py::return_value_policy::reference);
 
   NVFUSER_PYTHON_BINDING_CAST_OP("cast", castOp)
 #undef NVFUSER_PYTHON_BINDING_CAST_OP
 
+  nvf_ops.def(
+      "permute",
+      [](nvfuser::FusionDefinition::Operators& self,
+         nvfuser::Tensor arg,
+         std::vector<int64_t>& dims) -> nvfuser::Tensor {
+        nvfuser::FusionDefinition* fd = self.fusion_definition;
+        nvfuser::Tensor output = fd->defineTensor();
+        self.fusion_definition->defineRecord(new nvfuser::PermuteOpRecord(
+            {fd->recordingState(arg())}, {fd->recordingState(output())}, dims));
+        return output;
+      },
+      py::arg("arg"),
+      py::arg("dims"),
+      py::return_value_policy::reference);
+
+  nvf_ops.def(
+      "squeeze",
+      [](nvfuser::FusionDefinition::Operators& self,
+         nvfuser::Tensor arg,
+         std::vector<int64_t>& original_shape,
+         int64_t dim) -> nvfuser::Tensor {
+        FUSER_PERF_SCOPE("Operators.squeeze");
+        nvfuser::FusionDefinition* fd = self.fusion_definition;
+        nvfuser::Tensor output = fd->defineTensor();
+        fd->defineRecord(new nvfuser::SqueezeOpRecord(
+            {fd->recordingState(arg())},
+            {fd->recordingState(output())},
+            original_shape,
+            dim));
+        return output;
+      },
+      py::arg("arg"),
+      py::arg("original_shape"),
+      py::arg("dim"),
+      py::return_value_policy::reference);
+  nvf_ops.def(
+      "view",
+      [](nvfuser::FusionDefinition::Operators& self,
+         nvfuser::Tensor arg,
+         std::vector<int64_t>& original_shape,
+         std::vector<int64_t>& new_shape) -> nvfuser::Tensor {
+        nvfuser::FusionDefinition* fd = self.fusion_definition;
+        nvfuser::Tensor output = fd->defineTensor();
+        self.fusion_definition->defineRecord(new nvfuser::ViewOpRecord(
+            {fd->recordingState(arg())},
+            {fd->recordingState(output())},
+            original_shape,
+            new_shape));
+        return output;
+      },
+      py::arg("arg"),
+      py::arg("original_shape"),
+      py::arg("new_shape"),
+      py::return_value_policy::reference);
+
   nvf_ops.def(
       "var",
       [](nvfuser::FusionDefinition::Operators& self,
-         nvfuser::Tensor* arg,
+         nvfuser::Tensor arg,
          std::vector<int>& axes,
          int64_t correction,
-         bool keepdim) -> nvfuser::Tensor* {
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();
-        self.fusion_definition->defineRecord(new nvfuser::VarianceOpRecord(
-            {arg->index}, {output->index}, axes, correction, keepdim));
+         bool keepdim) -> nvfuser::Tensor {
+        FUSER_PERF_SCOPE("Operators.var");
+        nvfuser::FusionDefinition* fd = self.fusion_definition;
+        nvfuser::Tensor output = fd->defineTensor();
+        fd->defineRecord(new nvfuser::VarianceOpRecord(
+            {fd->recordingState(arg())},
+            {fd->recordingState(output())},
+            axes,
+            correction,
+            keepdim));
         return output;
       },
+      py::arg("arg"),
+      py::arg("axes"),
+      py::arg("correction"),
+      py::arg("keepdim") = false,
+      py::return_value_policy::reference);
+  nvf_ops.def(
+      "var_mean",
+      [](nvfuser::FusionDefinition::Operators& self,
+         nvfuser::Tensor arg,
+         std::vector<int>& axes,
+         int64_t correction,
+         bool keepdim) -> decltype(auto) {
+        FUSER_PERF_SCOPE("Operators.var_mean");
+        nvfuser::FusionDefinition* fd = self.fusion_definition;
+        nvfuser::Tensor var = fd->defineTensor();
+        nvfuser::Tensor mean = fd->defineTensor();
+        fd->defineRecord(new nvfuser::VarianceMeanOpRecord(
+            {fd->recordingState(arg())},
+            {fd->recordingState(var()), fd->recordingState(mean())},
+            axes,
+            correction,
+            keepdim));
+        return std::make_tuple(var, mean);
+      },
+      py::arg("arg"),
+      py::arg("axes"),
+      py::arg("correction"),
+      py::arg("keepdim") = false,
+      py::return_value_policy::reference);
+  nvf_ops.def(
+      "batch_norm",
+      [](nvfuser::FusionDefinition::Operators& self,
+         nvfuser::Tensor arg,
+         c10::optional<nvfuser::Tensor> weight,
+         c10::optional<nvfuser::Tensor> bias,
+         c10::optional<nvfuser::Tensor> running_mean,
+         c10::optional<nvfuser::Tensor> running_var,
+         nvfuser::Scalar momentum,
+         nvfuser::Scalar eps,
+         bool training,
+         bool channels_last) -> decltype(auto) {
+        FUSER_PERF_SCOPE("Operators.batch_norm");
+        nvfuser::FusionDefinition* fd = self.fusion_definition;
+        nvfuser::Tensor output = fd->defineTensor();
+        nvfuser::Tensor mean = fd->defineTensor();
+        nvfuser::Tensor invstd = fd->defineTensor();
+        auto weight_state = weight.has_value()
+            ? fd->recordingState(weight.value()())
+            : nvfuser::State(0, nvfuser::StateType::None);
+        auto bias_state = bias.has_value()
+            ? fd->recordingState(bias.value()())
+            : nvfuser::State(0, nvfuser::StateType::None);
+        auto running_mean_state = running_mean.has_value()
+            ? fd->recordingState(running_mean.value()())
+            : nvfuser::State(0, nvfuser::StateType::None);
+        auto running_var_state = running_var.has_value()
+            ? fd->recordingState(running_var.value()())
+            : nvfuser::State(0, nvfuser::StateType::None);
+        fd->defineRecord(new nvfuser::BatchNormOpRecord(
+            {fd->recordingState(arg()),
+             weight_state,
+             bias_state,
+             running_mean_state,
+             running_var_state,
+             fd->recordingState(momentum()),
+             fd->recordingState(eps())},
+            {fd->recordingState(output()),
+             fd->recordingState(mean()),
+             fd->recordingState(invstd())},
+            training,
+            channels_last));
+        return std::make_tuple(output, mean, invstd);
+      },
+      py::arg("arg"),
+      py::arg("weight").none(true),
+      py::arg("bias").none(true),
+      py::arg("running_mean").none(true),
+      py::arg("running_var").none(true),
+      py::arg("momentum"),
+      py::arg("eps"),
+      py::arg("training"),
+      py::arg("channels_last") = false,
       py::return_value_policy::reference);
-
   nvf_ops.def(
       "broadcast_in_dim",
       [](nvfuser::FusionDefinition::Operators& self,
-         nvfuser::Tensor* arg,
+         nvfuser::Tensor arg,
          std::vector<int64_t>& output_shape,
-         std::vector<int64_t>& broadcast_dims) -> nvfuser::Tensor* {
-        nvfuser::Tensor* output = self.fusion_definition->defineTensor();
-        self.fusion_definition->defineRecord(new nvfuser::BroadcastOpRecord(
-            {arg->index}, {output->index}, output_shape, broadcast_dims));
+         std::vector<int64_t>& broadcast_dims) -> nvfuser::Tensor {
+        FUSER_PERF_SCOPE("Operators.broadcast_in_dim");
+        nvfuser::FusionDefinition* fd = self.fusion_definition;
+        TORCH_CHECK(
+            output_shape.size() >= broadcast_dims.size(),
+            "broadcast_dims vector size is too big for output shape!");
+        nvfuser::Tensor output = fd->defineTensor();
+        fd->defineRecord(new nvfuser::BroadcastInDimOpRecord(
+            {fd->recordingState(arg())},
+            {fd->recordingState(output())},
+            "ops.broadcast_in_dim",
+            output_shape,
+            broadcast_dims));
+        return output;
+      },
+      py::arg("arg"),
+      py::arg("output_shape"),
+      py::arg("broadcast_dims"),
+      py::return_value_policy::reference);
+  nvf_ops.def(
+      "broadcast",
+      [](nvfuser::FusionDefinition::Operators& self,
+         nvfuser::Tensor arg,
+         std::vector<bool>& is_broadcast_dim) -> nvfuser::Tensor {
+        FUSER_PERF_SCOPE("Operators.broadcast");
+        nvfuser::FusionDefinition* fd = self.fusion_definition;
+        nvfuser::Tensor output = fd->defineTensor();
+        fd->defineRecord(new nvfuser::BroadcastOpRecord(
+            {fd->recordingState(arg())},
+            {fd->recordingState(output())},
+            "ops.broadcast",
+            is_broadcast_dim));
         return output;
       },
+      py::arg("arg"),
+      py::arg("is_broadcast_dim"),
       py::return_value_policy::reference);
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_cache.cpp b/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_cache.cpp
new file mode 100644
index 000000000000..607c560dab74
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_cache.cpp
@@ -0,0 +1,266 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/torch.h>
+
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_cache.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace nvfuser;
+using namespace torch::jit::fuser::cuda;
+
+// RUN CMD: bin/test_jit --gtest_filter="NVFuserTest*PyFusionCache*"
+TEST_F(NVFuserTest, PyFusionCache_CUDA) {
+  // Reset cache before testing.
+  try {
+    FusionCache::reset();
+    SUCCEED();
+  } catch (const std::exception& e) {
+    FAIL() << "Did not properly reset cache!" << e.what();
+  }
+
+  // Create a fusion manager with a maximum of 1 Fusion
+  FusionCache* fc = FusionCache::get(1);
+  // You should never get a nullptr
+  ASSERT_FALSE(fc == nullptr);
+  ASSERT_TRUE(fc->numFusions() == 0);
+
+  // Check that cache methods all assert when presented with a null record.
+  {
+    std::unique_ptr<RecordFunctor> null_record(nullptr);
+
+    try {
+      auto bad_cache_entry_ptr = fc->lookupFusionCacheEntry(null_record.get());
+      FAIL() << "Should trigger an assert when the record is looked up!";
+    } catch (...) {
+      SUCCEED();
+    }
+
+    try {
+      fc->traverseFusionCache(null_record.get());
+      FAIL() << "Should trigger an assert when the record is looked up!";
+    } catch (...) {
+      SUCCEED();
+    }
+
+    try {
+      fc->createFusionCacheEntry(null_record.get());
+      FAIL() << "Should trigger an assert when the record is looked up!";
+    } catch (...) {
+      SUCCEED();
+    }
+
+    try {
+      auto id = fc->createFusionCacheEntry(null_record.get());
+      FAIL() << "Should trigger an assert when the record is looked up!";
+    } catch (...) {
+      SUCCEED();
+    }
+  }
+
+  // Check that cache methods act appropriately when presenting a new
+  // record to an empty cache.
+  {
+    std::unique_ptr<RecordFunctor> test_record(new TensorRecord(
+        {State(0, StateType::Tensor)}, {3}, {true}, Nvf::DataType::Float));
+
+    // Check Methods prior to adding an entry to the cache
+
+    // Cache Lookup should not succeed becase no records are in the cache
+    try {
+      auto empty_cache_entry_ptr =
+          fc->lookupFusionCacheEntry(test_record.get());
+      ASSERT_TRUE(empty_cache_entry_ptr == c10::nullopt);
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during cache lookup!" << e.what();
+    }
+
+    // Traversal of the cache should fail because there is nothing to traverse
+    try {
+      fc->traverseFusionCache(test_record.get());
+      FAIL() << "Expected the cache traversal to fail!";
+    } catch (...) {
+      SUCCEED();
+    }
+
+    // Add a cache entry and check methods
+
+    try {
+      fc->createFusionCacheEntry(test_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert on Cache Entry creation!" << e.what();
+    }
+
+    try {
+      auto cache_entry_ptr = fc->lookupFusionCacheEntry(test_record.get());
+      ASSERT_FALSE(cache_entry_ptr == c10::nullopt);
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert on cache lookup!" << e.what();
+    }
+
+    try {
+      fc->traverseFusionCache(test_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert during Cache Traverse!" << e.what();
+    }
+
+    // Add a terminal cache entry and check methods
+
+    std::unique_ptr<RecordFunctor> end_record(new EndRecord());
+    try {
+      auto id = fc->createFusionCacheEntry(end_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert on Terminal Cache Entry creation!"
+             << e.what();
+    }
+
+    try {
+      fc->traverseFusionCache(end_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert while traversing to a Terminal Entry!"
+             << e.what();
+    }
+
+    try {
+      auto no_cache_entry_ptr = fc->lookupFusionCacheEntry(test_record.get());
+      FAIL() << "Expected an assert from a terminal entry!";
+    } catch (...) {
+      SUCCEED();
+    }
+
+    try {
+      fc->traverseFusionCache(test_record.get());
+      FAIL() << "Expected an assert from a terminal entry!";
+    } catch (...) {
+      SUCCEED();
+    }
+  }
+
+  // Setup cache for a new cache lookup
+  try {
+    fc->resetFusionCachePtr();
+    SUCCEED();
+  } catch (const std::exception& e) {
+    FAIL() << "Did not properly set cache to pointer to top of tree!"
+           << e.what();
+  }
+
+  // Check that cache methods act appropriately when presenting a new
+  // record to a cache with 1 fusion.
+  {
+    std::unique_ptr<RecordFunctor> cached_record(new TensorRecord(
+        {State(0, StateType::Tensor)}, {3}, {true}, Nvf::DataType::Float));
+    std::unique_ptr<RecordFunctor> new_record(
+        new ScalarRecord({State(1, StateType::Scalar)}, Nvf::DataType::Float));
+
+    try {
+      auto hit_cache_entry = fc->lookupFusionCacheEntry(cached_record.get());
+      ASSERT_FALSE(hit_cache_entry == c10::nullopt);
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Cache lookup unexpectedly asserted!" << e.what();
+    }
+
+    try {
+      fc->traverseFusionCache(cached_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Fusion cache traverse unexpectedly asserted!" << e.what();
+    }
+
+    try {
+      auto miss_cache_entry = fc->lookupFusionCacheEntry(new_record.get());
+      ASSERT_TRUE(miss_cache_entry == c10::nullopt);
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Cache lookup unexpectedly asserted!" << e.what();
+    }
+
+    try {
+      fc->createFusionCacheEntry(new_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert on Cache Entry creation!" << e.what();
+    }
+
+    try {
+      fc->traverseFusionCache(new_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Fusion cache traverse unexpectedly asserted!" << e.what();
+    }
+
+    std::unique_ptr<RecordFunctor> end_record(new EndRecord());
+    try {
+      auto id = fc->createFusionCacheEntry(end_record.get());
+      FAIL() << "Expected the cache to assert because it is full!";
+    } catch (...) {
+      SUCCEED();
+    }
+  }
+
+  // Setup cache for a new cache lookup
+  try {
+    fc->resetFusionCachePtr();
+    SUCCEED();
+  } catch (const std::exception& e) {
+    FAIL() << "Did not properly set cache to pointer to top of tree!"
+           << e.what();
+  }
+
+  // Verify proper cache lookup up of complete fusion already cached.
+  // This tends to flush out pointer problems in the cache.
+  {
+    std::unique_ptr<RecordFunctor> test_record(new TensorRecord(
+        {State(0, StateType::Tensor)}, {3}, {true}, Nvf::DataType::Float));
+    std::unique_ptr<RecordFunctor> dummy_record(new TensorRecord(
+        {State(0, StateType::Tensor)}, {3}, {true}, Nvf::DataType::Float));
+
+    try {
+      auto cache_entry_ptr = fc->lookupFusionCacheEntry(test_record.get());
+      ASSERT_FALSE(cache_entry_ptr == c10::nullopt);
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert on cache lookup!" << e.what();
+    }
+
+    try {
+      fc->traverseFusionCache(test_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert during Cache Traverse!" << e.what();
+    }
+
+    std::unique_ptr<RecordFunctor> end_record(new EndRecord());
+    try {
+      auto no_cache_entry_ptr = fc->lookupFusionCacheEntry(end_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert on cache lookup!" << e.what();
+    }
+
+    try {
+      fc->traverseFusionCache(end_record.get());
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "An unexpected assert while traversing to a Terminal Entry!"
+             << e.what();
+    }
+  }
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_definition.cpp b/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_definition.cpp
new file mode 100644
index 000000000000..bae9cf6def81
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_definition.cpp
@@ -0,0 +1,196 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/torch.h>
+
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_definition.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_interface.h>
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace nvfuser;
+using namespace torch::jit::fuser::cuda;
+
+// RUN CMD: bin/test_jit --gtest_filter="NVFuserTest*FusionDefinition*"
+TEST_F(NVFuserTest, FusionDefinition_CUDA) {
+  // Test that the FusionDefinition asserts on max_length == 0
+  {
+    FusionDefinition fd(nullptr, 0);
+
+    try {
+      fd.enter();
+      FAIL() << "You should trigger an assert with 0 Records allowed!";
+    } catch (...) {
+      SUCCEED();
+    }
+  }
+
+  // Test that the FusionDefinition asserts on a null FusionManager ptr
+  {
+    FusionDefinition fd(nullptr, 5);
+
+    try {
+      fd.enter();
+      FAIL() << "You should trigger an assert with a null FusionInterface!";
+    } catch (...) {
+      SUCCEED();
+    }
+  }
+
+  // Create a new FusionDefinition that is not found in the cache
+  {
+    std::unique_ptr<FusionInterface> fusion =
+        std::make_unique<FusionInterface>();
+    FusionDefinition fd(fusion.get(), 4);
+
+    try {
+      fd.enter();
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert while entering FusionDefinition context! "
+             << e.what();
+    }
+
+    auto t0 = fd.defineTensor();
+    try {
+      fd.defineRecord(new TensorRecord(
+          {fd.recordingState(t0())}, {3}, {true}, Nvf::DataType::Float));
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during Tensor Record creation! " << e.what();
+    }
+
+    auto s1 = fd.defineScalar();
+    try {
+      fd.defineRecord(
+          new ScalarRecord({fd.recordingState(s1())}, Nvf::DataType::Double));
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during Scalar Record creation! " << e.what();
+    }
+
+    auto t2 = fd.defineTensor();
+    try {
+      fd.defineRecord(
+          new OpRecord<Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*>(
+              {fd.recordingState(t0()), fd.recordingState(s1())},
+              {fd.recordingState(t2())},
+              "ops.add",
+              static_cast<Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*)>(
+                  Nvf::add)));
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during Add Record creation! " << e.what();
+    }
+
+    try {
+      fd.defineRecord(
+          new OutputRecord<Nvf::TensorView>({fd.recordingState(t2())}));
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during Output Record creation! " << e.what();
+    }
+
+    try {
+      fd.defineRecord(new OutputRecord<Nvf::Val>({fd.recordingState(s1())}));
+      FAIL() << "Expected an assert for too many records!";
+    } catch (...) {
+      SUCCEED();
+    }
+
+    try {
+      fd.exit();
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during creation of a new Fusion! "
+             << e.what();
+    }
+  }
+
+  // Look up a FusionDefinition with a defined Fusion
+  {
+    std::unique_ptr<FusionInterface> fusion =
+        std::make_unique<FusionInterface>(0);
+    FusionDefinition fd(fusion.get(), 1);
+
+    try {
+      fd.enter();
+      FAIL() << "You should trigger an assert with a defined FusionInterface!";
+    } catch (const std::exception& e) {
+      SUCCEED();
+    }
+  }
+
+  // Look up a FusionDefinition completely in the cache
+  {
+    std::unique_ptr<FusionInterface> fusion =
+        std::make_unique<FusionInterface>();
+    FusionDefinition fd(fusion.get(), 4);
+
+    try {
+      fd.enter();
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert while entering FusionDefinition context! "
+             << e.what();
+    }
+
+    auto t0 = fd.defineTensor();
+    try {
+      fd.defineRecord(new TensorRecord(
+          {fd.recordingState(t0())}, {3}, {true}, Nvf::DataType::Float));
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during Tensor Record creation! " << e.what();
+    }
+
+    auto s1 = fd.defineScalar();
+    try {
+      fd.defineRecord(
+          new ScalarRecord({fd.recordingState(s1())}, Nvf::DataType::Double));
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during Scalar Record creation! " << e.what();
+    }
+
+    auto t2 = fd.defineTensor();
+    try {
+      fd.defineRecord(
+          new OpRecord<Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*>(
+              {fd.recordingState(t0()), fd.recordingState(s1())},
+              {fd.recordingState(t2())},
+              "ops.add",
+              static_cast<Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*)>(
+                  Nvf::add)));
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during Add Record creation! " << e.what();
+    }
+
+    try {
+      fd.defineRecord(
+          new OutputRecord<Nvf::TensorView>({fd.recordingState(t2())}));
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during Output Record creation! " << e.what();
+    }
+
+    try {
+      fd.exit();
+      SUCCEED();
+    } catch (const std::exception& e) {
+      FAIL() << "Unexpected assert during creation of a new Fusion! "
+             << e.what();
+    }
+  }
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_record.cpp b/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_record.cpp
new file mode 100644
index 000000000000..5ae2db7db880
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/python_frontend/test/test_nvfuser_fusion_record.cpp
@@ -0,0 +1,136 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/torch.h>
+
+#include <torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace nvfuser;
+using namespace torch::jit::fuser::cuda;
+
+// RUN CMD: bin/test_jit --gtest_filter="NVFuserTest*RecordFunctorEquality*"
+TEST_F(NVFuserTest, RecordFunctorEquality_CUDA) {
+  // Getting the std::function matching correct is error prone so providing
+  // checks for OpRecord, CastOp, and ReductionOp that employ std::function
+  // matching.
+
+  // OpRecord Equality Check
+  {
+    auto t0 = nvfuser::State(0, StateType::Tensor);
+    auto s1 = nvfuser::State(1, StateType::Scalar);
+    auto out = nvfuser::State(2, StateType::Tensor);
+    std::unique_ptr<RecordFunctor> test_record1(
+        new OpRecord<Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*>(
+            {t0, s1},
+            {out},
+            "ops.mul",
+            static_cast<Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*)>(
+                Nvf::mul)));
+    std::unique_ptr<RecordFunctor> test_record2(
+        new OpRecord<Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*>(
+            {t0, s1},
+            {out},
+            "ops.mul",
+            static_cast<Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*)>(
+                Nvf::mul)));
+    std::unique_ptr<RecordFunctor> test_record3(
+        new OpRecord<Nvf::TensorView*, Nvf::TensorView*, Nvf::Val*>(
+            {t0, s1},
+            {out},
+            "ops.mul",
+            static_cast<Nvf::TensorView* (*)(Nvf::TensorView*, Nvf::Val*)>(
+                Nvf::mul)));
+
+    EXPECT_TRUE(*test_record1 == *test_record2);
+    EXPECT_TRUE(*test_record1 == *test_record3);
+    EXPECT_TRUE(*test_record2 == *test_record3);
+  }
+
+  // CastOpRecord Equality Check
+  {
+    auto t0 = nvfuser::State(0, StateType::Tensor);
+    auto out = nvfuser::State(1, StateType::Tensor);
+    std::unique_ptr<RecordFunctor> test_record1(
+        new CastOpRecord<Nvf::TensorView*, Nvf::TensorView*>(
+            {t0},
+            {out},
+            "ops.cast",
+            static_cast<Nvf::TensorView* (*)(Nvf::DataType, Nvf::TensorView*)>(
+                Nvf::castOp),
+            Nvf::DataType::Half));
+    std::unique_ptr<RecordFunctor> test_record2(
+        new CastOpRecord<Nvf::TensorView*, Nvf::TensorView*>(
+            {t0},
+            {out},
+            "ops.cast",
+            static_cast<Nvf::TensorView* (*)(Nvf::DataType, Nvf::TensorView*)>(
+                Nvf::castOp),
+            Nvf::DataType::Half));
+    std::unique_ptr<RecordFunctor> test_record3(
+        new CastOpRecord<Nvf::TensorView*, Nvf::TensorView*>(
+            {t0},
+            {out},
+            "ops.cast",
+            static_cast<Nvf::TensorView* (*)(Nvf::DataType, Nvf::TensorView*)>(
+                Nvf::castOp),
+            Nvf::DataType::Half));
+
+    EXPECT_TRUE(*test_record1 == *test_record2);
+    EXPECT_TRUE(*test_record1 == *test_record3);
+    EXPECT_TRUE(*test_record2 == *test_record3);
+  }
+
+  // ReductionOpRecord Equality Check
+  {
+    auto t0 = nvfuser::State(0, StateType::Tensor);
+    auto out = nvfuser::State(1, StateType::Tensor);
+    std::unique_ptr<RecordFunctor> test_record1(new ReductionOpRecord(
+        {t0},
+        {out},
+        "ops.sum",
+        static_cast<
+            Nvf::
+                TensorView* (*)(Nvf::TensorView*, const std::vector<int>&, bool, Nvf::DataType)>(
+            Nvf::sum),
+        {0},
+        false,
+        Nvf::DataType::Float));
+    std::unique_ptr<RecordFunctor> test_record2(new ReductionOpRecord(
+        {t0},
+        {out},
+        "ops.sum",
+        static_cast<
+            Nvf::
+                TensorView* (*)(Nvf::TensorView*, const std::vector<int>&, bool, Nvf::DataType)>(
+            Nvf::sum),
+        {0},
+        false,
+        Nvf::DataType::Float));
+    std::unique_ptr<RecordFunctor> test_record3(new ReductionOpRecord(
+        {t0},
+        {out},
+        "ops.sum",
+        static_cast<
+            Nvf::
+                TensorView* (*)(Nvf::TensorView*, const std::vector<int>&, bool, Nvf::DataType)>(
+            Nvf::sum),
+        {0},
+        false,
+        Nvf::DataType::Float));
+
+    EXPECT_TRUE(*test_record1 == *test_record2);
+    EXPECT_TRUE(*test_record1 == *test_record3);
+    EXPECT_TRUE(*test_record2 == *test_record3);
+  }
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/reference_tensor.h b/torch/csrc/jit/codegen/cuda/reference_tensor.h
deleted file mode 100644
index 07c83bb6ed74..000000000000
--- a/torch/csrc/jit/codegen/cuda/reference_tensor.h
+++ /dev/null
@@ -1,27 +0,0 @@
-#pragma once
-
-#include <c10/macros/Export.h>
-
-#include <torch/csrc/jit/codegen/cuda/ir_interface_nodes.h>
-
-#include <unordered_map>
-
-namespace torch {
-namespace jit {
-namespace fuser {
-namespace cuda {
-
-struct ReferenceTensor {
-  TensorDomain* domain = nullptr;
-
-  // Map from concrete iteration domains in ComputeAtMaps to iter domains
-  // including those used to construct domain.
-  std::unordered_map<IterDomain*, IterDomain*> concrete_to_id;
-  // Map from reference iteration domains to concrete iteration domains.
-  std::unordered_map<IterDomain*, IterDomain*> id_to_concrete;
-};
-
-} // namespace cuda
-} // namespace fuser
-} // namespace jit
-} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/register_interface.cpp b/torch/csrc/jit/codegen/cuda/register_interface.cpp
index c89f8c5a7a6a..ba50c1352e43 100644
--- a/torch/csrc/jit/codegen/cuda/register_interface.cpp
+++ b/torch/csrc/jit/codegen/cuda/register_interface.cpp
@@ -26,6 +26,7 @@ class RegisterInterface {
     ptr->fn_insert_profile_inodes = &InsertProfileNodes;
     ptr->fn_profile_n = &shouldProfileNode;
     ptr->fn_skip_n = &skipNodeKind;
+    ptr->fn_analyze_view = &analyzeViewConstraint;
   }
 };
 
diff --git a/torch/csrc/jit/codegen/cuda/root_domain_map.cpp b/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
index f7d00799e83e..235d257e2351 100644
--- a/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
+++ b/torch/csrc/jit/codegen/cuda/root_domain_map.cpp
@@ -186,27 +186,39 @@ auto ensureMapping(
   return it;
 }
 
+TensorView* lookUpTv(const TensorDomain* td) {
+  Fusion* fusion = FusionGuard::getCurFusion();
+  for (auto tv : ir_utils::filterByType<TensorView>(fusion->vals())) {
+    if (tv->domain() == td) {
+      return tv;
+    }
+  }
+  return nullptr;
+}
+
 } // namespace
 
 std::string DomainKey::toString() const {
   std::stringstream ss;
-  ss << "{";
-  if (td()) {
-    ss << td() << " (root: " << td()->getRootDomain()
-       << ", maybe rfactor: " << td()->getMaybeRFactorDomain() << ")";
-  } else {
-    ss << "null";
-  }
-  ss << ", ";
   if (id()) {
     ss << id();
   } else {
     ss << "null";
   }
   if (concreteId()) {
-    ss << " (" << concreteId() << ")";
+    ss << " (concrete: " << concreteId() << ")";
+  }
+  ss << " in ";
+  if (td()) {
+    auto tv = lookUpTv(td());
+    TORCH_INTERNAL_ASSERT(tv != nullptr, "No TV found for ", td()->toString());
+    ss << "T" << tv->name() << "[ " << td()->getRootDomain() << " ]";
+    if (td()->hasRFactor()) {
+      ss << " (Rfactor: [ " << td()->getMaybeRFactorDomain() << " ])";
+    }
+  } else {
+    ss << "null";
   }
-  ss << "}";
   return ss.str();
 }
 
@@ -222,12 +234,12 @@ class FindInputDomains : BackwardVisitor {
  private:
   FindInputDomains(TensorView* tv, const IterDomain* id)
       : BackwardVisitor(false), tv_(tv) {
-    input_keys.insert(DomainKey(tv_->domain(), id));
+    input_keys_.insert(DomainKey(tv_->domain(), id));
   }
 
   DomainKeySet find() {
-    traverseFrom(tv_->fusion(), {tv_});
-    return input_keys;
+    traverseTo(tv_->fusion(), {tv_});
+    return input_keys_;
   }
 
   void handle(Expr* expr) override {
@@ -249,7 +261,7 @@ class FindInputDomains : BackwardVisitor {
                    .mapConsumerToProducer(out_tv->domain(), in_tv->domain());
     for (auto root_dom : out_tv->getRootDomain()) {
       DomainKey out_key({out_tv->domain(), root_dom});
-      if (input_keys.find(out_key) == input_keys.end()) {
+      if (input_keys_.find(out_key) == input_keys_.end()) {
         continue;
       }
       auto input_id_it = c2p.find(root_dom);
@@ -257,13 +269,13 @@ class FindInputDomains : BackwardVisitor {
         continue;
       }
       DomainKey input_key(in_tv->domain(), input_id_it->second);
-      input_keys.insert(input_key);
+      input_keys_.insert(input_key);
     }
   }
 
  private:
   TensorView* tv_ = nullptr;
-  DomainKeySet input_keys;
+  DomainKeySet input_keys_;
 
  public:
   static DomainKeySet find(TensorView* tv, const IterDomain* id) {
@@ -285,6 +297,10 @@ void UnmappableReductionDomains::handleReductionOutput(TensorView* out_tv) {
   auto use_chains = DependencyCheck::getAllUseChains(out_tv);
   for (const auto& chain : use_chains) {
     for (const auto& tv : ir_utils::filterByType<TensorView>(chain)) {
+      // Do not include the tensor itself in its consumers
+      if (tv == out_tv) {
+        continue;
+      }
       const auto& root_domain = tv->getRootDomain();
       for (const auto& id : root_domain) {
         DomainKey consumer_key(tv->domain(), id);
@@ -327,30 +343,41 @@ void UnmappableReductionDomains::handle(WelfordOp* op) {
 }
 
 bool UnmappableReductionDomains::isReductionOutputMapped(
-    const std::vector<DomainKey>& consumer_domains,
+    const DomainKeySet& consumer_domains,
     const ComputeAtRootDomainMap& root_map) const {
+  // Check each reduction domain if any of the consumer domains
+  // conflicts with it
   for (const auto& kv : reduction_domains_) {
     const DomainKey& reduction_domain = kv.first;
+    // Domains that must not be mapped with the reduction domain
     const DomainKeySet& incompatible_domains = kv.second;
-    DomainKey consumer_domain_with_reduction;
-    bool reduction_found = false;
+    // Input domains to the reduction domain
     const auto& input_keys = reduction_domain_inputs_.at(reduction_domain);
-    for (const DomainKey& consumer_domain : consumer_domains) {
-      for (const auto& input_key : input_keys) {
-        if (input_key == consumer_domain) {
-          consumer_domain_with_reduction = consumer_domain;
-          reduction_found = true;
-          break;
-        }
-      }
-    }
-    if (!reduction_found) {
+    // Check if any of the consumer domains is an input to the
+    // reduction
+    auto it = std::find_if(
+        consumer_domains.begin(),
+        consumer_domains.end(),
+        [&](const auto& consumer_domain) {
+          return std::find(
+                     input_keys.begin(), input_keys.end(), consumer_domain) !=
+              input_keys.end();
+        });
+    // None of the consumer domains is used for the reduction
+    // domain. They should be safe with respect to this reduction
+    // domain
+    if (it == consumer_domains.end()) {
       continue;
     }
-    // Make sure no incompatible domains will be merged with the reduction
-    // domain.
+
+    // A consumer domain that is an input to the reduction domain
+    const DomainKey& input_to_reduction = *it;
+
+    // Check if mapping input_to_reduction with the other domains in
+    // consumer_domains. If there's a domain that is a consumer of the
+    // reduction, they must not be mapped together
     for (const auto& consumer_domain : consumer_domains) {
-      if (consumer_domain == consumer_domain_with_reduction) {
+      if (consumer_domain == input_to_reduction) {
         continue;
       }
       if (std::any_of(
@@ -370,6 +397,27 @@ bool UnmappableReductionDomains::isReductionOutputMapped(
   return false;
 }
 
+std::string UnmappableReductionDomains::toString() const {
+  std::stringstream ss;
+  ss << "Reduction-to-consumer map\n";
+  for (const auto& kv : reduction_domains_) {
+    ss << "\tReduction: " << kv.first.toString() << "\n";
+    for (const auto& mapped_val : kv.second) {
+      ss << "\t\tConsumer domain: " << mapped_val.toString() << "\n";
+    }
+  }
+
+  ss << "Reduction-to-producer map\n";
+  for (const auto& kv : reduction_domain_inputs_) {
+    ss << "\tReduction: " << kv.first.toString() << "\n";
+    for (const auto& mapped_val : kv.second) {
+      ss << "\t\tProducer domain: " << mapped_val.toString() << "\n";
+    }
+  }
+
+  return ss.str();
+}
+
 void ComputeAtRootDomainMap::build(bool map_through_reduction) {
   // Make sure we start from scratch. Throw away previous results.
   eq_set_.clear();
@@ -438,7 +486,7 @@ bool ComputeAtRootDomainMap::canMap(
     const IterDomain* id_b) const {
   TORCH_INTERNAL_ASSERT(
       id_b->definition() == nullptr || id_b->isRFactorProduct(),
-      "Non-root domain is not supproted: ",
+      "Non-root domain is not supported: ",
       id_b);
 
   if (!id_b->isBroadcast()) {
@@ -649,7 +697,7 @@ ComputeAtRootDomainMapBuilder::ComputeAtRootDomainMapBuilder(
       map_through_reduction_(map_through_reduction) {
   Fusion* fusion = FusionGuard::getCurFusion();
   TORCH_INTERNAL_ASSERT(fusion != nullptr);
-  traverseFrom(fusion, fusion->outputs(), false);
+  traverseTo(fusion, fusion->outputs(), false);
   if (!pending_map_.empty()) {
     std::stringstream ss;
     ss << "pending map:\n";
@@ -712,7 +760,7 @@ void ComputeAtRootDomainMapBuilder::setInvalid(
 }
 
 bool ComputeAtRootDomainMapBuilder::isInvalid(
-    const std::vector<DomainKey>& domains) const {
+    const DomainKeySet& domains) const {
   // First, collect all invalid mappings for each of the keys in domains
   DomainKeyMap<DomainKeySet> invalid_key_map;
   for (const auto& key : domains) {
@@ -729,8 +777,9 @@ bool ComputeAtRootDomainMapBuilder::isInvalid(
 
   // Next, check if any pair is invalid to map.
   const auto num_keys = domains.size();
+  const std::vector<DomainKey> domains_vec({domains.begin(), domains.end()});
   for (const auto i : c10::irange(num_keys)) {
-    const auto& key_i = domains[i];
+    const auto& key_i = domains_vec[i];
     // If no invalid keys found for key_i, it can be skipped.
     const auto invalid_key_map_it = invalid_key_map.find(key_i);
     if (invalid_key_map_it == invalid_key_map.end()) {
@@ -743,7 +792,7 @@ bool ComputeAtRootDomainMapBuilder::isInvalid(
     // If any other key in domains is identified mappable with any of
     // the keys in this set, the mapping with key_i is invalid.
     for (const auto j : c10::irange(i + 1, num_keys)) {
-      const auto& key_j = domains[j];
+      const auto& key_j = domains_vec[j];
       if (std::any_of(
               invalid_keys_for_i.begin(),
               invalid_keys_for_i.end(),
@@ -786,10 +835,6 @@ void ComputeAtRootDomainMapBuilder::setMaybeMapped(
       addToPendingList(producer_bcast_key, consumer_bcast_key);
     }
   } else {
-    TORCH_INTERNAL_ASSERT(
-        !consumer_id->isBroadcast(),
-        "No concrete domain found for a broadcast domain: ",
-        consumer_key.toString());
     auto producer_concrete_key = producer_key;
     if (producer_id->isBroadcast()) {
       const auto concrete_id = consumer_id;
@@ -825,7 +870,7 @@ void ComputeAtRootDomainMapBuilder::mapPointwiseOrReductionOp(Expr* e) {
   const auto& out_root = out_td->getRootDomain();
 
   // Record equalities from output to all the inputs
-  // ignores un-concretizable broadcasts
+  // ignores non-concretizable broadcasts
   for (auto* in_tv : ir_utils::filterByType<TensorView>(e->inputs())) {
     const TensorDomain* in_td = in_tv->domain();
     std::vector<IterDomain*> in_root =
@@ -841,15 +886,16 @@ void ComputeAtRootDomainMapBuilder::mapPointwiseOrReductionOp(Expr* e) {
     for (const auto it : c10::irange(in_root.size())) {
       if (e->outputs().size() > 1) {
         TORCH_INTERNAL_ASSERT(
-            e->isA<WelfordOp>() || e->isA<GroupedReductionOp>(),
-            "Multi-output mapping assumes WelforddOp or GroupedReductionOp but, ",
+            e->isA<WelfordOp>() || e->isA<GroupedReductionOp>() ||
+                e->isA<GroupedWelfordOp>(),
+            "Unknown multi-output Expr type ",
             e->getExprType().value(),
             " is found");
-        for (auto o : e->outputs()) {
-          auto o_tv = o->as<TensorView>();
-          auto o_td = o_tv->domain();
-          auto o_root = o_td->getRootDomain();
-          setMaybeMapped(in_td, in_root[it], o_td, o_root[it]);
+        for (auto out : e->outputs()) {
+          auto out_tv = out->as<TensorView>();
+          auto out_td = out_tv->domain();
+          auto out_root = out_td->getRootDomain();
+          setMaybeMapped(in_td, in_root[it], out_td, out_root[it]);
         }
       } else {
         setMaybeMapped(in_td, in_root[it], out_td, out_root[it]);
@@ -1005,6 +1051,10 @@ void ComputeAtRootDomainMapBuilder::mapAllPendingMappings(
   }
 }
 
+void ComputeAtRootDomainMapBuilder::handle(RNGOp* rop) {
+  handle(rop->output(0)->as<TensorView>());
+}
+
 void ComputeAtRootDomainMapBuilder::handle(TensorView* tv) {
   const TensorDomain* td = tv->domain();
   const auto rfactor = TensorDomain::noReductions(td->getMaybeRFactorDomain());
@@ -1015,7 +1065,7 @@ void ComputeAtRootDomainMapBuilder::handle(TensorView* tv) {
     mapAllPendingMappings(td, id);
   }
 
-  // When tv has a rfactor domain, propagate the domain mappings from
+  // When tv has an rfactor domain, propagate the domain mappings from
   // each of the rfactor axes to the dependent root axes.
   if (td->hasViewLikeRFactor()) {
     std::unordered_set<Val*> root_set(
@@ -1047,61 +1097,20 @@ void ComputeAtRootDomainMapBuilder::handle(TensorView* tv) {
   }
 }
 
-// Checks whether all consumers of a producer can be joined without
-// introducing unsupported mappings. Specifically, if a domain of a
-// consumer has a mapped iteration domain in another consumer that
-// does not correspond to the same producer iteration domain, mapping
-// the consumer domains would result in the producer iteration domain
-// mapped to two different consumer iteration domains, requiring
-// recomputations.
-bool ComputeAtRootDomainMapBuilder::hasMatchingDomains(
-    const std::vector<DomainKey>& unique_domains) {
-  for (const auto& key : unique_domains) {
-    for (const auto& other_key : unique_domains) {
-      if (key == other_key) {
-        continue;
-      }
-      const auto& other_root = other_key.td()->getRootDomain();
-      if (std::any_of(
-              other_root.begin(), other_root.end(), [&](const IterDomain* id) {
-                return root_map_.canMap(key, other_key.td(), id);
-              })) {
-        return true;
-      }
-    }
-  }
-  return false;
-}
-
 // Checks whether all consumers of a producer can be joined without
 // introducing unsupported mappings, i.e., requiring recomputations.
 bool ComputeAtRootDomainMapBuilder::safeToMap(const DomainKeySet& domains) {
   if (domains.size() <= 1) {
     return true;
   }
-  // Filter out equivalent domains
-  std::vector<DomainKey> unique_domains;
-  for (const auto& domain : domains) {
-    if (std::none_of(
-            unique_domains.begin(),
-            unique_domains.end(),
-            [&](const auto& unique_dom) {
-              return root_map_.canMap(domain, unique_dom);
-            })) {
-      unique_domains.push_back(domain);
-    }
-  }
-  if (hasMatchingDomains(unique_domains)) {
-    return false;
-  }
+
   // Can't map if reduction output domains would be mapped
-  if (incompatible_domains_.isReductionOutputMapped(
-          unique_domains, root_map_) &&
+  if (incompatible_domains_.isReductionOutputMapped(domains, root_map_) &&
       !map_through_reduction_) {
     return false;
   }
   // Make sure mapping these domains won't cause any invalid mapping
-  if (isInvalid(unique_domains)) {
+  if (isInvalid(domains)) {
     return false;
   }
   return true;
@@ -1114,7 +1123,7 @@ class ExactRootDomainMapBuilder : private IterVisitor {
       Fusion* fusion,
       DisjointSets<const IterDomain*>& eq_sets)
       : eq_sets_(eq_sets) {
-    traverseFrom(fusion, fusion->outputs());
+    traverseTo(fusion, fusion->outputs());
   }
 
  private:
@@ -1147,6 +1156,15 @@ ExactRootDomainMap::ExactRootDomainMap(Fusion* fusion) {
 bool ExactRootDomainMap::areMapped(
     const IterDomain* id_a,
     const IterDomain* id_b) const {
+  // With expand going into a view operation there can be an instance where an
+  // iteration root domain in the consumer resolves the broadcast from the
+  // producer, then immediately rfactors it. In this case the consumer root is
+  // not mapped exactly to any other domain, so it might no have an entry in
+  // eq_sets_. eq_sets_.strictAreMapped would throw in this case so just return
+  // false if a mapping doesn't exist.
+  if (!eq_sets_.mappingExists(id_a) || !eq_sets_.mappingExists(id_b)) {
+    return false;
+  }
   return eq_sets_.strictAreMapped(id_a, id_b);
 }
 
diff --git a/torch/csrc/jit/codegen/cuda/root_domain_map.h b/torch/csrc/jit/codegen/cuda/root_domain_map.h
index a4a3b5a440e2..fa3d323ba6d2 100644
--- a/torch/csrc/jit/codegen/cuda/root_domain_map.h
+++ b/torch/csrc/jit/codegen/cuda/root_domain_map.h
@@ -145,6 +145,9 @@ class DomainKey {
     return td() == other.td() && id() == other.id() &&
         concreteId() == other.concreteId();
   }
+  bool operator!=(const DomainKey& other) const {
+    return !(*this == other);
+  }
 
   std::string toString() const;
 
@@ -183,9 +186,11 @@ class TORCH_CUDA_CU_API UnmappableReductionDomains : private IterVisitor {
   //! reduction outputs within the corresponding reduction loop is not
   //! possible. This routine is used to build root domain mappings.
   bool isReductionOutputMapped(
-      const std::vector<DomainKey>& consumer_domains,
+      const DomainKeySet& consumer_domains,
       const ComputeAtRootDomainMap& root_map) const;
 
+  std::string toString() const;
+
  private:
   using IterVisitor::handle;
   void handle(ReductionOp* op) override;
@@ -284,6 +289,8 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMap : public RootDomainMap {
       const TensorDomain* producer,
       const TensorDomain* consumer) const;
 
+  std::string toString() const;
+
  private:
   //! Returns if key_a and key(td_b, id_b) are mapped to eachother (equivalent),
   //! or are the same key.
@@ -326,8 +333,6 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMap : public RootDomainMap {
       const std::unordered_set<IterDomain*>& root_dims_to_map,
       bool producer_to_consumer) const override;
 
-  std::string toString() const;
-
  private:
   //! Disjoint set of all mapped <TD, ID> keys to determine axes equivalency
   DisjointSets<DomainKey, DomainKeyHash> eq_set_;
@@ -365,7 +370,7 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMapBuilder
   void setInvalid(const DomainKey& key1, const DomainKey& key2);
 
   //! Check if no pair of domains is invalid to map
-  bool isInvalid(const std::vector<DomainKey>& domains) const;
+  bool isInvalid(const DomainKeySet& domains) const;
 
   //! Track a pair of producer-consumer domains as potentially mappable. Inserts
   //! entries into pending_map_, but does not add anything into the root_map_
@@ -399,6 +404,8 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMapBuilder
     mapPointwiseOrReductionOp(top);
   }
 
+  void handle(RNGOp* top) override;
+
   void handle(ReductionOp* op) override {
     mapPointwiseOrReductionOp(op);
   }
@@ -451,8 +458,6 @@ class TORCH_CUDA_CU_API ComputeAtRootDomainMapBuilder
   //! mapping is done separately for each concrete domain.
   void mapAllPendingMappings(const TensorDomain* td, IterDomain* id);
 
-  bool hasMatchingDomains(const std::vector<DomainKey>& unique_domains);
-
   bool safeToMap(const DomainKeySet& domains);
 
  private:
diff --git a/torch/csrc/jit/codegen/cuda/runtime/array_rocm.cu b/torch/csrc/jit/codegen/cuda/runtime/array_rocm.cu
new file mode 100644
index 000000000000..c3d91289131a
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/array_rocm.cu
@@ -0,0 +1,236 @@
+// aligned register array for vectorized load/store
+template <typename scalar_t, int size, int align_size>
+struct alignas(sizeof(scalar_t) * align_size) Array {
+  scalar_t array[size];
+
+  __device__ void set(scalar_t v) {
+#pragma unroll
+    for (int i = 0; i < size; ++i) {
+      array[i] = v;
+    }
+  }
+
+  __device__ scalar_t& operator[](const unsigned int i) {
+    return array[i];
+  }
+};
+
+// Used for vectorized allocations that are not in registers
+template <typename scalar_t, int vec_size>
+__device__ void arraySet(scalar_t* buff, scalar_t val) {
+#pragma unroll
+  for (int i = 0; i < vec_size; ++i) {
+    buff[i] = val;
+  }
+}
+
+template <typename scalar_t, int vec_size>
+__device__ void loadGeneric(scalar_t* to, scalar_t* from) {
+  // It would be really nice to use memcpy here, but one example was failing
+  // with:
+  //
+  //  memcpy(to, from, vec_size * sizeof(scalar_t));
+  //
+  // Yet passing with:
+  //
+  // for(int i = 0; i < vec_size; i++){
+  //   to[i] = from[i];
+  // }
+
+  switch (sizeof(scalar_t) * vec_size) {
+    case 1:
+      *reinterpret_cast<uchar1*>(to) = *reinterpret_cast<uchar1*>(from);
+      break;
+    case 2:
+      *reinterpret_cast<uchar2*>(to) = *reinterpret_cast<uchar2*>(from);
+      break;
+    case 4:
+      *reinterpret_cast<uint1*>(to) = *reinterpret_cast<uint1*>(from);
+      break;
+    case 8:
+      *reinterpret_cast<uint2*>(to) = *reinterpret_cast<uint2*>(from);
+      break;
+    case 12:
+      *reinterpret_cast<uint3*>(to) = *reinterpret_cast<uint3*>(from);
+      break;
+    case 16:
+      *reinterpret_cast<uint4*>(to) = *reinterpret_cast<uint4*>(from);
+      break;
+  }
+}
+
+// Volatile version only works with c++ fundamnetal types
+template <
+    typename scalar_t,
+    int vec_size,
+    bool is_volatile_to,
+    bool is_volatile_from>
+__device__ void loadGenericVolatile(
+    typename MaybeVolatile<scalar_t, is_volatile_to>::type* to,
+    typename MaybeVolatile<scalar_t, is_volatile_from>::type* from) {
+  switch (sizeof(scalar_t) * vec_size) {
+    // Reinterpret cast like this with volatile types only works for C++
+    // fundamental types otherwise the = operator is not defined
+    case 1:
+      *reinterpret_cast<
+          typename MaybeVolatile<unsigned char, is_volatile_to>::type*>(to) =
+          *reinterpret_cast<
+              typename MaybeVolatile<unsigned char, is_volatile_from>::type*>(
+              from);
+      break;
+    case 2:
+      *reinterpret_cast<typename MaybeVolatile<short, is_volatile_to>::type*>(
+          to) =
+          *reinterpret_cast<
+              typename MaybeVolatile<short, is_volatile_from>::type*>(from);
+      break;
+    case 4:
+      *reinterpret_cast<
+          typename MaybeVolatile<unsigned int, is_volatile_to>::type*>(to) =
+          *reinterpret_cast<
+              typename MaybeVolatile<unsigned int, is_volatile_from>::type*>(
+              from);
+      break;
+    case 8:
+      *reinterpret_cast<typename MaybeVolatile<double, is_volatile_to>::type*>(
+          to) =
+          *reinterpret_cast<
+              typename MaybeVolatile<double, is_volatile_from>::type*>(from);
+      break;
+  }
+}
+
+template <typename scalar_t, int vec_size, bool is_volatile>
+__device__ void loadLocalToGlobal(
+    typename MaybeVolatile<scalar_t, is_volatile>::type* to,
+    scalar_t* from) {
+  switch (sizeof(scalar_t) * vec_size) {
+    case 1:
+    case 2:
+    case 4:
+      loadGenericVolatile<scalar_t, vec_size, is_volatile, false>(to, from);
+      break;
+    case 8: {
+      if (is_volatile) {
+        uint2 const& _from = *reinterpret_cast<uint2*>(from);
+        uint2 & _to = *reinterpret_cast<uint2*>(to);
+        _to = _from;
+      } else {
+        uint2 const& _from = *reinterpret_cast<uint2*>(from);
+        uint2 & _to = *reinterpret_cast<uint2*>(to);
+        _to = _from;
+      }
+      break;
+    }
+    case 12: {
+      if (is_volatile) {
+        uint3 const& _from = *reinterpret_cast<uint3*>(from);
+        uint3 & _to = *reinterpret_cast<uint3*>(to);
+        _to = _from;
+      } else {
+        uint3 const& _from = *reinterpret_cast<uint3*>(from);
+        uint3 & _to = *reinterpret_cast<uint3*>(to);
+        _to = _from;
+      }
+      break;
+    }
+    case 16: {
+      if (is_volatile) {
+        uint4 const& _from = *reinterpret_cast<uint4*>(from);
+        uint4 & _to = *reinterpret_cast<uint4*>(to);
+        _to = _from;
+      } else {
+        uint4 const& _from = *reinterpret_cast<uint4*>(from);
+        uint4 & _to = *reinterpret_cast<uint4*>(to);
+        _to = _from;
+      }
+      break;
+    }
+  }
+}
+
+template <typename scalar_t, int vec_size, bool is_volatile>
+__device__ void loadGlobalToLocal(
+    scalar_t* to,
+    typename MaybeVolatile<scalar_t, is_volatile>::type* from) {
+  switch (sizeof(scalar_t) * vec_size) {
+    case 1:
+    case 2:
+    case 4:
+      loadGenericVolatile<scalar_t, vec_size, false, is_volatile>(to, from);
+      break;
+    case 8: {
+      if (is_volatile) {
+        uint2& _to = *reinterpret_cast<uint2*>(to);
+        uint2& _from = *reinterpret_cast<uint2*>(from);
+        _to = _from;
+      } else {
+        uint2& _to = *reinterpret_cast<uint2*>(to);
+        uint2& _from = *reinterpret_cast<uint2*>(from);
+        _to = _from;
+      }
+      break;
+    }
+    case 12: {
+      if (is_volatile) {
+        uint3& _to = *reinterpret_cast<uint3*>(to);
+        uint3& _from = *reinterpret_cast<uint3*>(from);
+        _to = _from;
+      } else {
+        uint3& _to = *reinterpret_cast<uint3*>(to);
+        uint3& _from = *reinterpret_cast<uint3*>(from);
+        _to = _from;
+      }
+      break;
+    }
+    case 16: {
+      if (is_volatile) {
+        uint4& _to = *reinterpret_cast<uint4*>(to);
+        uint4& _from = *reinterpret_cast<uint4*>(from);
+        _to = _from;
+      } else {
+        uint4& _to = *reinterpret_cast<uint4*>(to);
+        uint4& _from = *reinterpret_cast<uint4*>(from);
+        _to = _from;
+      }
+      break;
+    }
+  }
+}
+
+template <
+    typename scalar_t,
+    int vec_size,
+    bool is_volatile_to,
+    bool is_volatile_from>
+__device__ void loadGlobalToGlobal(
+    typename MaybeVolatile<scalar_t, is_volatile_to>::type* to,
+    typename MaybeVolatile<scalar_t, is_volatile_from>::type* from) {
+  switch (sizeof(scalar_t) * vec_size) {
+    // Reinterpret cast like this with volatile types only works for C++
+    // fundamental types otherwise the = operator is not defined
+    case 1:
+    case 2:
+    case 4:
+    case 8:
+      loadGenericVolatile<scalar_t, vec_size, is_volatile_to, is_volatile_from>(
+          to, from);
+      break;
+    case 12: {
+      uint3 local_intermediate;
+      loadGlobalToLocal<scalar_t, vec_size, is_volatile_from>(
+          reinterpret_cast<scalar_t*>(&local_intermediate), from);
+      loadLocalToGlobal<scalar_t, vec_size, is_volatile_to>(
+          to, reinterpret_cast<scalar_t*>(&local_intermediate));
+      break;
+    }
+    case 16: {
+      uint4 local_intermediate;
+      loadGlobalToLocal<scalar_t, vec_size, is_volatile_from>(
+          reinterpret_cast<scalar_t*>(&local_intermediate), from);
+      loadLocalToGlobal<scalar_t, vec_size, is_volatile_to>(
+          to, reinterpret_cast<scalar_t*>(&local_intermediate));
+      break;
+    }
+  }
+}
diff --git a/torch/csrc/jit/codegen/cuda/runtime/bf16_support_rocm.cu b/torch/csrc/jit/codegen/cuda/runtime/bf16_support_rocm.cu
new file mode 100644
index 000000000000..e58b540a518d
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/bf16_support_rocm.cu
@@ -0,0 +1,39 @@
+
+struct __align__(2) __bfloat {
+  __bfloat() = default;
+
+  inline __device__ __bfloat(const float f) {
+    if (f != f) {
+      __x = uint16_t(0x7FC0);
+    } else {
+      union {
+        uint32_t U32;
+        float F32;
+      };
+
+      F32 = f;
+      uint32_t rounding_bias = ((U32 >> 16) & 1) + uint32_t(0x7FFF);
+      __x = static_cast<uint16_t>((U32 + rounding_bias) >> 16);
+    }
+  }
+
+  inline __device__ operator float() const {
+    float res = 0;
+    uint32_t tmp = __x;
+    tmp <<= 16;
+    float* tempRes = reinterpret_cast<float*>(&tmp);
+    res = *tempRes;
+    return res;
+  }
+
+ protected:
+  unsigned short __x;
+};
+
+__device__ __bfloat __float2bfloat(const float f) {
+  return __bfloat(f);
+}
+
+__device__ float __bfloat2float(const __bfloat h) {
+  return float(h);
+}
diff --git a/torch/csrc/jit/codegen/cuda/runtime/block_sync_default_rocm.cu b/torch/csrc/jit/codegen/cuda/runtime/block_sync_default_rocm.cu
new file mode 100644
index 000000000000..856b1b259acd
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/block_sync_default_rocm.cu
@@ -0,0 +1,12 @@
+
+// Default block synchronization. Just use __barrier_sync
+namespace block_sync {
+
+__forceinline__ __device__ void init() {}
+
+// Thread-block synchronization
+__forceinline__ __device__ void sync() {
+  __syncthreads();
+}
+
+} // namespace block_sync
diff --git a/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu b/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
index a1cbcb1b398e..74e364ae7b4a 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/fused_reduction.cu
@@ -1,4 +1,5 @@
 namespace fused_reduction {
+
 namespace impl {
 
 //! Suppose f_i be the i-th function of the binary function
@@ -110,6 +111,85 @@ __inline__ __device__ static void reduceEach(
       val0, offset0, val1, offset1, reduction_ops...);
 }
 
+template <typename TupleType0, typename TupleType1, typename Func, int num_vals>
+struct TupleReduce {};
+
+template <typename TupleType0, typename TupleType1, typename Func>
+struct TupleReduce<TupleType0, TupleType1, Func, 1> {
+  __inline__ __device__ static void reduce(
+      TupleType0& val0,
+      nvfuser_index_t offset0,
+      const TupleType1& val1,
+      nvfuser_index_t offset1,
+      Func reduction_op) {
+    static_assert(
+        IsSameType<
+            typename TupleType0::ValTypes,
+            typename TupleType1::ValTypes>::value,
+        "Invalid value types");
+    reduction_op(val0.val<0>(offset0), val1.val<0>(offset1));
+  }
+};
+
+template <typename TupleType0, typename TupleType1, typename Func>
+struct TupleReduce<TupleType0, TupleType1, Func, 2> {
+  __inline__ __device__ static void reduce(
+      TupleType0& val0,
+      nvfuser_index_t offset0,
+      const TupleType1& val1,
+      nvfuser_index_t offset1,
+      Func reduction_op) {
+    static_assert(
+        IsSameType<
+            typename TupleType0::ValTypes,
+            typename TupleType1::ValTypes>::value,
+        "Invalid value types");
+    reduction_op(
+        val0.val<0>(offset0),
+        val0.val<1>(offset0),
+        val1.val<0>(offset1),
+        val1.val<1>(offset1));
+  }
+};
+
+template <typename TupleType0, typename TupleType1, typename Func>
+struct TupleReduce<TupleType0, TupleType1, Func, 3> {
+  __inline__ __device__ static void reduce(
+      TupleType0& val0,
+      nvfuser_index_t offset0,
+      const TupleType1& val1,
+      nvfuser_index_t offset1,
+      Func reduction_op) {
+    static_assert(
+        IsSameType<
+            typename TupleType0::ValTypes,
+            typename TupleType1::ValTypes>::value,
+        "Invalid value types");
+    reduction_op(
+        val0.val<0>(offset0),
+        val0.val<1>(offset0),
+        val0.val<2>(offset0),
+        val1.val<0>(offset1),
+        val1.val<1>(offset1),
+        val1.val<2>(offset1));
+  }
+};
+
+//! Reduce all values of a tuple together. The reduction function must
+//! have the same number of inputs as the number of values of each tuple.
+template <typename TupleType0, typename TupleType1, typename Func>
+__inline__ __device__ void reduceTuple(
+    TupleType0& val0,
+    nvfuser_index_t offset0,
+    const TupleType1& val1,
+    nvfuser_index_t offset1,
+    Func reduction_op) {
+  static_assert(
+      TupleType0::num_vals == TupleType1::num_vals, "Invalid number of values");
+  TupleReduce<TupleType0, TupleType1, Func, TupleType0::num_vals>::reduce(
+      val0, offset0, val1, offset1, reduction_op);
+}
+
 // Reduces all of the first (idx+1) values by a thread block
 template <
     int idx,
@@ -297,85 +377,6 @@ __inline__ __device__ void blockReduceEach(
           reduction_ops...);
 }
 
-template <typename TupleType0, typename TupleType1, typename Func, int num_vals>
-struct TupleReduce {};
-
-template <typename TupleType0, typename TupleType1, typename Func>
-struct TupleReduce<TupleType0, TupleType1, Func, 1> {
-  __inline__ __device__ static void reduce(
-      TupleType0& val0,
-      nvfuser_index_t offset0,
-      const TupleType1& val1,
-      nvfuser_index_t offset1,
-      Func reduction_op) {
-    static_assert(
-        IsSameType<
-            typename TupleType0::ValTypes,
-            typename TupleType1::ValTypes>::value,
-        "Invalid value types");
-    reduction_op(val0.val<0>(offset0), val1.val<0>(offset1));
-  }
-};
-
-template <typename TupleType0, typename TupleType1, typename Func>
-struct TupleReduce<TupleType0, TupleType1, Func, 2> {
-  __inline__ __device__ static void reduce(
-      TupleType0& val0,
-      nvfuser_index_t offset0,
-      const TupleType1& val1,
-      nvfuser_index_t offset1,
-      Func reduction_op) {
-    static_assert(
-        IsSameType<
-            typename TupleType0::ValTypes,
-            typename TupleType1::ValTypes>::value,
-        "Invalid value types");
-    reduction_op(
-        val0.val<0>(offset0),
-        val0.val<1>(offset0),
-        val1.val<0>(offset1),
-        val1.val<1>(offset1));
-  }
-};
-
-template <typename TupleType0, typename TupleType1, typename Func>
-struct TupleReduce<TupleType0, TupleType1, Func, 3> {
-  __inline__ __device__ static void reduce(
-      TupleType0& val0,
-      nvfuser_index_t offset0,
-      const TupleType1& val1,
-      nvfuser_index_t offset1,
-      Func reduction_op) {
-    static_assert(
-        IsSameType<
-            typename TupleType0::ValTypes,
-            typename TupleType1::ValTypes>::value,
-        "Invalid value types");
-    reduction_op(
-        val0.val<0>(offset0),
-        val0.val<1>(offset0),
-        val0.val<2>(offset0),
-        val1.val<0>(offset1),
-        val1.val<1>(offset1),
-        val1.val<2>(offset1));
-  }
-};
-
-//! Reduce all values of a tuple together. The reduction function must
-//! have the same number of inputs as the number of values of each tuple.
-template <typename TupleType0, typename TupleType1, typename Func>
-__inline__ __device__ void reduceTuple(
-    TupleType0& val0,
-    nvfuser_index_t offset0,
-    const TupleType1& val1,
-    nvfuser_index_t offset1,
-    Func reduction_op) {
-  static_assert(
-      TupleType0::num_vals == TupleType1::num_vals, "Invalid number of values");
-  TupleReduce<TupleType0, TupleType1, Func, TupleType0::num_vals>::reduce(
-      val0, offset0, val1, offset1, reduction_op);
-}
-
 } // namespace impl
 
 // We have 6 dimensions, 3 in the grid, 3 in the block
@@ -463,831 +464,1065 @@ class ParallelReduce {
       bool read_pred, // Prevent reading from out of bounds memory
       bool write_pred, // Prevent from writing out of bounds
       const LocalTuple<Types...>& init_val,
-      Func reduction_op) {
-    // If no reduction needed, just return input
-    if (!BLOCK_REDUCE && !GRID_REDUCE) {
-      if (read_pred && write_pred) {
-        out = inp;
-      }
-      return;
-    }
-
-    // Don't read/write in temporary buffers if in a predicated dimension
-    bool block_reduce_participate = index_utils::
-        maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
-            threadIdx);
+      Func reduction_op);
 
-    // Initialize block result
-    LocalTuple<Types...> block_result = init_val;
-
-    // Grab input data if participating in the reduction, set to block_result in
-    // the case there is no block reduction
-    if (block_reduce_participate && read_pred) {
-      block_result = inp;
-    }
-
-    // Only threads that with id == 0 in the dimensions being reduced will
-    // have a valid result
-    bool has_block_result = index_utils::maskedIsZero<
-        isReduce(X_THREAD),
-        isReduce(Y_THREAD),
-        isReduce(Z_THREAD)>(threadIdx);
-
-    if (BLOCK_REDUCE) {
-      // -- START BLOCK REDUCTION -- //
-
-      // Size of the block reduction segment, can be an int since it's limited
-      // to number of threads
-      int block_reduction_size = index_utils::maskedSize<
-          isReduce(X_THREAD),
-          isReduce(Y_THREAD),
-          isReduce(Z_THREAD)>(blockDim);
-
-      // Index in the reduction segment, can be an int since it's limited to
-      // number of threads
-      int tid_in_block_reduction = index_utils::maskedOffset<
-          isReduce(X_THREAD),
-          isReduce(Y_THREAD),
-          isReduce(Z_THREAD)>(threadIdx, blockDim);
-
-      // ID of the block reduction this thread is participating in
-      //
-      // If any of the parallel dimensions are predicated out, that means
-      // they've already been reduced, so we only care about the first thread in
-      // that dimension. Therefore don't expand the reduction_idx by that
-      // dimension
-      int block_reduction_idx = index_utils::
-          maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-              threadIdx, blockDim);
-
-      // Shared memory buffer is 2D
-      // [iter dimension, reduction dimension]
-
-      // Offset into smem for the current thread
-      int block_reduce_smem_offset =
-          block_reduction_idx * block_reduction_size + tid_in_block_reduction;
-
-      // Initialize shared memory
-      if (block_reduce_participate) {
-        copyTuple(shared_buf, block_reduce_smem_offset, block_result);
-      }
+  //! Profiled version
+  template <typename Func, typename... Types>
+  __device__ __inline__ void reduce(
+      RefTuple<Types...> out,
+      const ConstRefTuple<Types...>& inp,
+      VolatilePtrTuple<Types...> global_work_buffer,
+      int64_t* global_sync_buffer, // Allocated as product of all
+                                   // non-participating Grid dimension
+      PtrTuple<Types...> shared_buf,
+      bool read_pred, // Prevent reading from out of bounds memory
+      bool write_pred, // Prevent from writing out of bounds
+      const LocalTuple<Types...>& init_val,
+      Func reduction_op,
+      int64_t& cycles,
+      int64_t& count);
 
-      // Sync to make sure smem is completely initialized
-      block_sync::sync();
+  //! Each value of a tuple is independently reduced by the
+  //! corresponding reduction op. Thus, Welford-like reductions are
+  //! not supported by this interface.
+  //!
+  //! Note that out is purely used as the output parameter, and its
+  //! initial value is not used but just overwritten. Since grid
+  //! reductions do not allow serial reduction IterDomains, there is
+  //! no need to accumulate into the out parameter.
+  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+  __device__ __inline__ void reduceGroup(
+      RefTuple<DataTypes...> out,
+      const ConstRefTuple<DataTypes...>& inp,
+      VolatilePtrTuple<DataTypes...> global_work_buffer,
+      const LocalTuple<DataTypes...>& init_val,
+      int64_t* global_sync_buffer,
+      void* shared_mem,
+      const LocalTuple<BoolTypes...>& read_preds,
+      const LocalTuple<BoolTypes...>& write_preds,
+      Funcs... funcs);
 
-      // Round reduction size down to nearest power of 2
-      int np2 = 1 << (31 - __clz(block_reduction_size));
+  //! Profiled version
+  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+  __device__ __inline__ void reduceGroup(
+      RefTuple<DataTypes...> out,
+      const ConstRefTuple<DataTypes...>& inp,
+      VolatilePtrTuple<DataTypes...> global_work_buffer,
+      const LocalTuple<DataTypes...>& init_val,
+      int64_t* global_sync_buffer,
+      void* shared_mem,
+      const LocalTuple<BoolTypes...>& read_preds,
+      const LocalTuple<BoolTypes...>& write_preds,
+      int64_t& cycles,
+      int64_t& count,
+      Funcs... funcs);
+
+  template <int NumArgs, typename DataType, typename IndexType>
+  __device__ __inline__ void welfordGroup(
+      typename MakeRefTuple<NumArgs, DataType>::type out_avg,
+      typename MakeRefTuple<NumArgs, DataType>::type out_var,
+      typename MakeRefTuple<NumArgs, IndexType>::type out_N,
+      const typename MakeConstRefTuple<NumArgs, DataType>::type& inp_avg,
+      const typename MakeConstRefTuple<NumArgs, DataType>::type& inp_var,
+      const typename MakeConstRefTuple<NumArgs, IndexType>::type& inp_N,
+      const typename MakeLocalTuple<NumArgs, DataType>::type& init_avg,
+      const typename MakeLocalTuple<NumArgs, DataType>::type& init_var,
+      const typename MakeLocalTuple<NumArgs, IndexType>::type& init_N,
+      typename MakeVolatilePtrTuple<NumArgs, DataType>::type
+          global_work_buffer_avg,
+      typename MakeVolatilePtrTuple<NumArgs, DataType>::type
+          global_work_buffer_var,
+      typename MakeVolatilePtrTuple<NumArgs, IndexType>::type
+          global_work_buffer_N,
+      int64_t* global_sync_buffer,
+      PtrTuple<DataType, DataType, IndexType> shared_buf,
+      const typename MakeLocalTuple<NumArgs, bool>::type& read_preds,
+      const typename MakeLocalTuple<NumArgs, bool>::type& write_preds);
 
-      // Perform an initial reduction leaving np2 elements
-      if (block_reduce_participate && tid_in_block_reduction < np2 &&
-          tid_in_block_reduction + np2 < block_reduction_size) {
-        impl::reduceTuple(
-            shared_buf,
-            block_reduce_smem_offset,
-            shared_buf,
-            block_reduce_smem_offset + np2,
-            reduction_op);
-      }
+ private:
+  __device__ static bool isLastBlockInGrid() {
+    return index_utils::maskedIsLast<
+               isReduceOrIter(X_BLOCK),
+               isReduceOrIter(Y_BLOCK),
+               isReduceOrIter(Z_BLOCK)>(blockIdx, gridDim) &&
+        index_utils::maskedIsZero<
+               !isReduceOrIter(X_BLOCK),
+               !isReduceOrIter(Y_BLOCK),
+               !isReduceOrIter(Z_BLOCK)>(blockIdx);
+  }
 
-      // Always need to sync while operating on shared memory
-      block_sync::sync();
+  //! Initial per-CTA reduction of each value of a tuple. Each value
+  //! is reduced individually, so the shared memory buffer just needs
+  //! to be large enough for each value. NOTE that the smem buffer is
+  //! not forward protected.
+  template <
+      bool BLOCK_BROADCAST,
+      typename... DataTypes,
+      typename... Funcs,
+      typename... BoolTypes>
+  __device__ __inline__ static LocalTuple<DataTypes...> reduceGroupBlock(
+      const ConstRefTuple<DataTypes...>& inp,
+      const LocalTuple<DataTypes...>& init_val,
+      void* shared_mem,
+      const LocalTuple<BoolTypes...>& read_preds,
+      bool block_reduce_participate,
+      Funcs... funcs);
 
-      // Reduce down until 2 values, leaving 2 values allows us to manually
-      // perform the last reduction and avoid a syncthreads
-      for (int factor = np2 / 2; factor > 1; factor >>= 1) {
-        if (tid_in_block_reduction < factor && block_reduce_participate) {
-          impl::reduceTuple(
-              shared_buf,
-              block_reduce_smem_offset,
-              shared_buf,
-              block_reduce_smem_offset + factor,
-              reduction_op);
-        }
-        block_sync::sync();
-      }
+  //! Final reduction of partial results. Done by all blocks
+  //! redundantly when BROADCAST is true, or just one block otherwise.
+  //! The smem buffer is assumed synchronized when it is passed in,
+  //! but it isn't synchronized when returning from this function.
+  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+  __device__ __inline__ static void reduceGroupLastBlock(
+      RefTuple<DataTypes...>& out,
+      const VolatilePtrTuple<DataTypes...>& global_work_buffer,
+      const LocalTuple<DataTypes...>& init_val,
+      void* shared_mem,
+      nvfuser_index_t block_red_idx_offset,
+      nvfuser_index_t num_thread_iters,
+      nvfuser_index_t num_block_iters,
+      nvfuser_index_t thread_red_idx_offset,
+      nvfuser_index_t grid_red_size,
+      const LocalTuple<BoolTypes...>& write_preds,
+      bool block_reduce_participate,
+      bool grid_reduce_participate,
+      Funcs... reduction_ops);
 
-      // Accumulate that last valid result
-      if (has_block_result) {
-        copyTuple(block_result, shared_buf, block_reduce_smem_offset);
-        if (block_reduction_size > 1) {
-          impl::reduceTuple(
-              block_result,
-              0,
-              shared_buf,
-              block_reduce_smem_offset + 1,
-              reduction_op);
-        }
-      }
+  //! Welford version of reduceGroupBlock
+  template <
+      bool BLOCK_BROADCAST,
+      int NumVals,
+      typename DataType,
+      typename IndexType>
+  __device__ __inline__ static void welfordGroupBlock(
+      LocalWelfordTripletTuple<NumVals, DataType, IndexType>& block_result,
+      const ConstRefWelfordTripletTuple<NumVals, DataType, IndexType>& inp,
+      PtrTuple<DataType, DataType, IndexType> shared_buf,
+      const typename MakeLocalTuple<NumVals, bool>::type& read_preds,
+      bool block_reduce_participate);
+
+  //! Welford version of reduceGrouplLastBlock
+  template <int NumVals, typename DataType, typename IndexType>
+  __device__ __inline__ static void welfordGroupLastBlock(
+      RefWelfordTripletTuple<NumVals, DataType, IndexType>& out,
+      const VolatilePtrWelfordTripletTuple<NumVals, DataType, IndexType>&
+          global_work_buffer,
+      const LocalWelfordTripletTuple<NumVals, DataType, IndexType>& init_val,
+      PtrTuple<DataType, DataType, IndexType> shared_buf,
+      nvfuser_index_t block_red_idx_offset,
+      nvfuser_index_t num_thread_iters,
+      nvfuser_index_t num_block_iters,
+      nvfuser_index_t thread_red_idx_offset,
+      nvfuser_index_t grid_red_size,
+      const typename MakeLocalTuple<NumVals, bool>::type& write_preds,
+      bool block_reduce_participate,
+      bool grid_reduce_participate);
 
-      // ===== BLOCK REDUCTION CLEANUP =======
-      if (!GRID_REDUCE) {
-        // If no grid reduction, we don't have to continue. Either broadcast
-        // back across the block or return the correct reduction
-        if (has_block_result && write_pred) {
-          impl::reduceTuple(block_result, 0, out, 0, reduction_op);
-          out = block_result;
-        }
-        if (BROADCAST) {
-          // No grid reduce, but need to broadcast, perform block broadcast
-          if (has_block_result && write_pred) {
-            // Put result back in shared memory, put in the first entry of the
-            // reduction segment's buffer
-            copyTuple(
-                shared_buf,
-                block_reduction_idx * block_reduction_size,
-                block_result);
-          }
-
-          // Sync threads to make sure result is in smem
-          block_sync::sync();
-          // If the thread is participating, and is not attempting to write out
-          // of bounds, return the broadcasted value.
-          if (block_reduce_participate && write_pred) {
-            copyTuple(
-                out, shared_buf, block_reduction_idx * block_reduction_size);
-          }
-        }
+  // End Parallel reduce class
+};
 
-        // Forward protect shared memory, don't want threads to continue to
-        // another reduction/broadcast and pollute shared memory before the
-        // reduction is completely finished.
-        //
-        // This could be avoided in some cases if we added thread syncs from
-        // block reductions in the syncthread insertion pass.
-        block_sync::sync();
-        return;
-      }
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename Func, typename... Types>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduce(
+        RefTuple<Types...> out,
+        const ConstRefTuple<Types...>& inp,
+        VolatilePtrTuple<Types...> global_work_buffer,
+        int64_t* global_sync_buffer, // Allocated as product of all
+        // non-participating Grid dimension
+        PtrTuple<Types...> shared_buf,
+        bool read_pred, // Prevent reading from out of bounds memory
+        bool write_pred, // Prevent from writing out of bounds
+        const LocalTuple<Types...>& init_val,
+        Func reduction_op) {
+  // If no reduction needed, just return input
+  if (!BLOCK_REDUCE && !GRID_REDUCE) {
+    if (read_pred && write_pred) {
+      out = inp;
     }
+    return;
+  }
 
-    // -- START GRID REDUCTION -- //
-    // Grid reductions are more challenging for two reasons, (1) the reduction
-    // itself is 3D instead of 2D because we now have an iter domain space in
-    // the grid dimension. (2) a tree reduction isn't performed, instead all
-    // blocks will populate GMEM and one  block will finish the grid reduction.
-
-    // What is the grid reduction size, block reduction already performed so
-    // that doesn't have to be taken into consideration
-    const auto grid_red_size = index_utils::
-        maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            gridDim);
-
-    // Which ID in the reduction is this block. Threads can participate in
-    // multiple grid reductions, but the block will have the same relative index
-    // in those reductions
-    const auto idx_in_grid_red = index_utils::
-        maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    if (PERSISTENT_REDUCTION && flip) {
-      auto global_buffer_size =
-          index_utils::
-              maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
-                  gridDim) *
-          index_utils::
-              maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-                  blockDim) *
-          grid_red_size;
-      global_work_buffer += global_buffer_size;
-    }
-    flip = !flip;
+  // Don't read/write in temporary buffers if in a predicated dimension
+  bool block_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
+          threadIdx);
 
-    // How many grid reductions have to be performed, in the grid dimension
-    const auto num_block_iters = index_utils::
-        maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+  // Initialize block result
+  LocalTuple<Types...> block_result = init_val;
 
-    // Which grid reduction does this block participate in, in the grid
-    // dimension
-    const auto block_red_idx_offset = index_utils::
-        maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
-            blockIdx, gridDim);
+  // Grab input data if participating in the reduction, set to block_result in
+  // the case there is no block reduction
+  if (block_reduce_participate && read_pred) {
+    block_result = inp;
+  }
+
+  // Only threads that with id == 0 in the dimensions being reduced will
+  // have a valid result
+  bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
 
-    // How many grid reductions have to be performed, in the block dimension
-    const auto num_thread_iters = index_utils::
-        maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+  if (BLOCK_REDUCE) {
+    // -- START BLOCK REDUCTION -- //
+
+    // Size of the block reduction segment, can be an int since it's limited
+    // to number of threads
+    int block_reduction_size = index_utils::
+        maskedSize<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
             blockDim);
 
-    // Which grid reduction does this thread participate in, in the block
+    // Index in the reduction segment, can be an int since it's limited to
+    // number of threads
+    int tid_in_block_reduction = index_utils::maskedOffset<
+        isReduce(X_THREAD),
+        isReduce(Y_THREAD),
+        isReduce(Z_THREAD)>(threadIdx, blockDim);
+
+    // ID of the block reduction this thread is participating in
+    //
+    // If any of the parallel dimensions are predicated out, that means
+    // they've already been reduced, so we only care about the first thread in
+    // that dimension. Therefore don't expand the reduction_idx by that
     // dimension
-    const auto thread_red_idx_offset = index_utils::
+    int block_reduction_idx = index_utils::
         maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
             threadIdx, blockDim);
 
-    // 3D buffer of reductions:
-    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
-    // Offset into the work buffer
-    const auto work_buf_offset =
-        (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
-            num_thread_iters +
-        thread_red_idx_offset;
+    // Shared memory buffer is 2D
+    // [iter dimension, reduction dimension]
 
-    // Don't read/write in temporary buffers if in a predicated dimension
-    bool grid_reduce_participate = index_utils::
-        maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(
-            blockIdx);
-
-    if (grid_reduce_participate && block_reduce_participate) {
-      if (has_block_result) {
-        copyTuple(global_work_buffer, work_buf_offset, block_result);
-      }
-    }
+    // Offset into smem for the current thread
+    int block_reduce_smem_offset =
+        block_reduction_idx * block_reduction_size + tid_in_block_reduction;
 
-    // -- GLOBAL BUFFER FILLED -- //
-
-    bool last_block = index_utils::
-        maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    if (grid_reduce_participate) {
-      // Don't need to sync up blocks that are not participating in this
-      // reduction
-      grid_sync::sync<
-          isReduce(X_BLOCK),
-          isReduce(Y_BLOCK),
-          isReduce(Z_BLOCK),
-          PERSISTENT_REDUCTION>(
-          global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+    // Initialize shared memory
+    if (block_reduce_participate) {
+      copyTuple(shared_buf, block_reduce_smem_offset, block_result);
     }
 
-    // -- START BLOCK CLEANUP -- //
-    // All blocks perform the last cleanup, so every block, and every thread
-    // will have the final result
-
-    // Initialize block result
-    LocalTuple<Types...> last_block_result(init_val);
-
-    if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
-      // Can use the last block to reduce all the values the blocks filled in.
-      // Can use any thread that has been predicated, or has been reduced to do
-      // this reduction, cannot use any block that's associated with an
-      // iteration domain
-
-      // Start with non-block reduction
-
-      // Index in the reduction segment
-      int tid_in_block_reduction_2 = index_utils::maskedOffset<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(threadIdx, blockDim);
-
-      int block_reduction_size_2 = index_utils::maskedSize<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(blockDim);
-
-      // 3D buffer of reductions:
-      //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
-      // Change the offset, we want to keep the last two dimensions, but the
-      // first dimension is what we will reduce over
-      const auto work_buf_offset_2 =
-          block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
-      for (auto reduction_i = tid_in_block_reduction_2;
-           reduction_i < grid_red_size;
-           reduction_i += block_reduction_size_2) {
-        impl::reduceTuple(
-            last_block_result,
-            0,
-            global_work_buffer,
-            work_buf_offset_2 +
-                reduction_i * num_block_iters *
-                    num_thread_iters, // Iterating over the outer most
-                                      // dimension, so need to stride by the
-                                      // total number of grid reductions. Could
-                                      // come back and change it so this is the
-                                      // contiguous dimension
-            reduction_op);
-      }
-
-      // -- START LAST BLOCK - BLOCK REDUCTION -- //
-
-      // Reduced so we have one value per thread, we need to further reduce any
-      // dimension that is not an iter dimension
+    // Sync to make sure smem is completely initialized
+    block_sync::sync();
 
-      // Which block reduction this thread is participating in
-      int block_reduction_idx = index_utils::
-          maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-              threadIdx, blockDim);
+    // Round reduction size down to nearest power of 2
+    int np2 = 1 << (31 - __clz(block_reduction_size));
 
-      // Offset in smem for this thread's result
-      auto smem_offset = block_reduction_idx * block_reduction_size_2 +
-          tid_in_block_reduction_2;
+    // Perform an initial reduction leaving np2 elements
+    if (block_reduce_participate && tid_in_block_reduction < np2 &&
+        tid_in_block_reduction + np2 < block_reduction_size) {
+      impl::reduceTuple(
+          shared_buf,
+          block_reduce_smem_offset,
+          shared_buf,
+          block_reduce_smem_offset + np2,
+          reduction_op);
+    }
 
-      // Similar as before, reduce down to nearest power of 2 so we can do a
-      // tree reduction
-      int np2 = 1 << (31 - __clz(min(block_reduction_size_2, grid_red_size)));
+    // Always need to sync while operating on shared memory
+    block_sync::sync();
 
-      // Threads values are initialized, so all can participate here
-      if (tid_in_block_reduction_2 >= np2) {
-        copyTuple(shared_buf, smem_offset, last_block_result);
+    // Reduce down until 2 values, leaving 2 values allows us to manually
+    // perform the last reduction and avoid a syncthreads
+    for (int factor = np2 / 2; factor > 1; factor >>= 1) {
+      if (tid_in_block_reduction < factor && block_reduce_participate) {
+        impl::reduceTuple(
+            shared_buf,
+            block_reduce_smem_offset,
+            shared_buf,
+            block_reduce_smem_offset + factor,
+            reduction_op);
       }
-
       block_sync::sync();
+    }
 
-      if (tid_in_block_reduction_2 < np2 &&
-          tid_in_block_reduction_2 + np2 <
-              min(block_reduction_size_2, grid_red_size)) {
+    // Accumulate that last valid result
+    if (has_block_result) {
+      copyTuple(block_result, shared_buf, block_reduce_smem_offset);
+      if (block_reduction_size > 1) {
         impl::reduceTuple(
-            last_block_result, 0, shared_buf, smem_offset + np2, reduction_op);
+            block_result,
+            0,
+            shared_buf,
+            block_reduce_smem_offset + 1,
+            reduction_op);
       }
+    }
 
-      if (tid_in_block_reduction_2 < np2) {
-        copyTuple(shared_buf, smem_offset, last_block_result);
+    // ===== BLOCK REDUCTION CLEANUP =======
+    if (!GRID_REDUCE) {
+      // If no grid reduction, we don't have to continue. Either broadcast
+      // back across the block or return the correct reduction
+      if (has_block_result && write_pred) {
+        impl::reduceTuple(block_result, 0, out, 0, reduction_op);
+        out = block_result;
       }
-
-      // Always sync when communicating across smem
-      block_sync::sync();
-
-      // Reduce down to 2 values, last thread will do the final reduction and
-      // can save a syncthreads this way
-      for (int factor = np2 / 2; factor > 1; factor >>= 1) {
-        if (tid_in_block_reduction_2 < factor) {
-          impl::reduceTuple(
-              shared_buf,
-              smem_offset,
+      if (BROADCAST) {
+        // No grid reduce, but need to broadcast, perform block broadcast
+        if (has_block_result && write_pred) {
+          // Put result back in shared memory, put in the first entry of the
+          // reduction segment's buffer
+          copyTuple(
               shared_buf,
-              smem_offset + factor,
-              reduction_op);
+              block_reduction_idx * block_reduction_size,
+              block_result);
         }
-        block_sync::sync();
-      }
 
-      // If this thread in each block has the final result before broadcasting
-      // to all other threads in block
-      bool has_block_result_2 = index_utils::maskedIsZero<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(threadIdx);
-      // Do the last reduction, protected by the write predicate
-      copyTuple(last_block_result, shared_buf, smem_offset);
-      if (has_block_result && grid_reduce_participate) {
-        impl::reduceTuple(last_block_result, 0, out, 0, reduction_op);
-        if (min(block_reduction_size_2, grid_red_size) > 1) {
-          impl::reduceTuple(
-              last_block_result, 0, shared_buf, smem_offset + 1, reduction_op);
-        }
-      }
-      if (grid_reduce_participate && PERSISTENT_REDUCTION) {
-        // If persistent reduction, always broadcast reduced values
-        copyTuple(shared_buf, smem_offset, last_block_result);
+        // Sync threads to make sure result is in smem
         block_sync::sync();
-        if (write_pred && block_reduce_participate) {
+        // If the thread is participating, and is not attempting to write out
+        // of bounds, return the broadcasted value.
+        if (block_reduce_participate && write_pred) {
           copyTuple(
-              out, shared_buf, block_reduction_idx * block_reduction_size_2);
-        }
-        // For persistent kernels we double the global buffer allocation so we
-        // don't need to protect those buffers every iteration preventing the
-        // need of an additional grid_sync. Since we flip back and forth between
-        // sections of the buffer, the one grid sync protects the other part of
-        // the buffer.
-      } else {
-        if (grid_reduce_participate) {
-          if (last_block && has_block_result && block_reduce_participate &&
-              write_pred) {
-            copyTuple(
-                out, shared_buf, block_reduction_idx * block_reduction_size_2);
-          }
+              out, shared_buf, block_reduction_idx * block_reduction_size);
         }
       }
-      // Forward protect the smem used in this reduction
+
+      // Forward protect shared memory, don't want threads to continue to
+      // another reduction/broadcast and pollute shared memory before the
+      // reduction is completely finished.
+      //
+      // This could be avoided in some cases if we added thread syncs from
+      // block reductions in the syncthread insertion pass.
       block_sync::sync();
+      return;
     }
   }
 
-  //! Profiled version
-  template <typename Func, typename... Types>
-  __device__ __inline__ void reduce(
-      RefTuple<Types...> out,
-      const ConstRefTuple<Types...>& inp,
-      VolatilePtrTuple<Types...> global_work_buffer,
-      int64_t* global_sync_buffer, // Allocated as product of all
-                                   // non-participating Grid dimension
-      PtrTuple<Types...> shared_buf,
-      bool read_pred, // Prevent reading from out of bounds memory
-      bool write_pred, // Prevent from writing out of bounds
-      const LocalTuple<Types...>& init_val,
-      Func reduction_op,
-      int64_t& cycles,
-      int64_t& count) {
-    int64_t start_counter = 0;
-
-    if (isLastBlockInGrid() &&
-        index_utils::maskedIsZero<true, true, true>(threadIdx)) {
-      start_counter = readCycleCounter();
+  // -- START GRID REDUCTION -- //
+  // Grid reductions are more challenging for two reasons, (1) the reduction
+  // itself is 3D instead of 2D because we now have an iter domain space in
+  // the grid dimension. (2) a tree reduction isn't performed, instead all
+  // blocks will populate GMEM and one  block will finish the grid reduction.
+
+  // What is the grid reduction size, block reduction already performed so
+  // that doesn't have to be taken into consideration
+  const auto grid_red_size = index_utils::
+      maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          gridDim);
+
+  // Which ID in the reduction is this block. Threads can participate in
+  // multiple grid reductions, but the block will have the same relative index
+  // in those reductions
+  const auto idx_in_grid_red = index_utils::
+      maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if (PERSISTENT_REDUCTION && flip) {
+    auto global_buffer_size =
+        index_utils::
+            maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+                gridDim) *
+        index_utils::
+            maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+                blockDim) *
+        grid_red_size;
+    global_work_buffer += global_buffer_size;
+  }
+  flip = !flip;
+
+  // How many grid reductions have to be performed, in the grid dimension
+  const auto num_block_iters = index_utils::
+      maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+
+  // Which grid reduction does this block participate in, in the grid
+  // dimension
+  const auto block_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the block dimension
+  const auto num_thread_iters = index_utils::
+      maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          blockDim);
+
+  // Which grid reduction does this thread participate in, in the block
+  // dimension
+  const auto thread_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // 3D buffer of reductions:
+  //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+  // Offset into the work buffer
+  const auto work_buf_offset =
+      (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
+          num_thread_iters +
+      thread_red_idx_offset;
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  bool grid_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(blockIdx);
+
+  if (grid_reduce_participate && block_reduce_participate) {
+    if (has_block_result) {
+      copyTuple(global_work_buffer, work_buf_offset, block_result);
     }
+  }
 
-    reduce(
-        out,
-        inp,
-        global_work_buffer,
-        global_sync_buffer,
-        shared_buf,
-        read_pred,
-        write_pred,
-        init_val,
-        reduction_op);
-
-    if (isLastBlockInGrid() &&
-        index_utils::maskedIsZero<true, true, true>(threadIdx)) {
-      cycles += readCycleCounter() - start_counter;
-      ++count;
-    }
+  // -- GLOBAL BUFFER FILLED -- //
+
+  bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if (grid_reduce_participate) {
+    // Don't need to sync up blocks that are not participating in this
+    // reduction
+    grid_sync::sync<
+        isReduce(X_BLOCK),
+        isReduce(Y_BLOCK),
+        isReduce(Z_BLOCK),
+        PERSISTENT_REDUCTION>(
+        global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
   }
 
-  //! Each value of a tuple is independently reduced by the
-  //! corresponding reduction op. Thus, Welford-like reductions are
-  //! not supported by this interface.
-  //!
-  //! Note that out is purely used as the output parameter, and its
-  //! initial value is not used but just overwritten. Since grid
-  //! reductions do not allow serial reduction IterDomains, there is
-  //! no need to accumulate into the out parameter.
-  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
-  __device__ __inline__ void reduceGroup(
-      RefTuple<DataTypes...> out,
-      const ConstRefTuple<DataTypes...>& inp,
-      VolatilePtrTuple<DataTypes...> global_work_buffer,
-      const LocalTuple<DataTypes...>& init_val,
-      int64_t* global_sync_buffer,
-      void* shared_mem,
-      const LocalTuple<BoolTypes...>& read_preds,
-      const LocalTuple<BoolTypes...>& write_preds,
-      Funcs... funcs) {
-    static_assert(
-        sizeof...(DataTypes) == sizeof...(Funcs),
-        "Mismatched number of Tuple values and functions");
-    static_assert(
-        sizeof...(DataTypes) == sizeof...(BoolTypes),
-        "Mismatched number of Tuple values and predicate values");
+  // -- START BLOCK CLEANUP -- //
+  // All blocks perform the last cleanup, so every block, and every thread
+  // will have the final result
 
-    // If no reduction needed, just return input
-    if (!BLOCK_REDUCE && !GRID_REDUCE) {
-      copyTupleIf(out, inp, read_preds && write_preds);
-      return;
-    }
+  // Initialize block result
+  LocalTuple<Types...> last_block_result(init_val);
 
-    // Don't read/write in temporary buffers if in a predicated dimension
-    const bool block_reduce_participate = index_utils::
-        maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
-            threadIdx);
+  if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
+    // Can use the last block to reduce all the values the blocks filled in.
+    // Can use any thread that has been predicated, or has been reduced to do
+    // this reduction, cannot use any block that's associated with an
+    // iteration domain
 
-    // Only threads that with id == 0 in the dimensions being reduced will
-    // have a valid result
-    const bool has_block_result = index_utils::maskedIsZero<
-        isReduce(X_THREAD),
-        isReduce(Y_THREAD),
-        isReduce(Z_THREAD)>(threadIdx);
-
-    // Initial per-block reduction. Result is broadcast if specified
-    // and this call is block reduction only.
-    const auto block_result = reduceGroupBlock < !GRID_REDUCE &&
-        BROADCAST > (inp,
-                     init_val,
-                     shared_mem,
-                     read_preds,
-                     block_reduce_participate,
-                     funcs...);
-    // If block reduction only, save to out and exit
-    if (!GRID_REDUCE) {
-      copyTupleIf(
-          out,
-          block_result,
-          write_preds &&
-              (block_reduce_participate && (BROADCAST || has_block_result)));
+    // Start with non-block reduction
 
-      // Need a block sync here as reduceGroupBlock does not
-      // forward-protect the smem buffer. This block sync is not
-      // necessary when a grid reduction follows since a block sync is
-      // done just before the grid sync.
-      block_sync::sync();
-      return;
+    // Index in the reduction segment
+    int tid_in_block_reduction_2 = index_utils::maskedOffset<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx, blockDim);
+
+    int block_reduction_size_2 = index_utils::maskedSize<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(blockDim);
+
+    // 3D buffer of reductions:
+    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+    // Change the offset, we want to keep the last two dimensions, but the
+    // first dimension is what we will reduce over
+    const auto work_buf_offset_2 =
+        block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
+    for (auto reduction_i = tid_in_block_reduction_2;
+         reduction_i < grid_red_size;
+         reduction_i += block_reduction_size_2) {
+      impl::reduceTuple(
+          last_block_result,
+          0,
+          global_work_buffer,
+          work_buf_offset_2 +
+              reduction_i * num_block_iters *
+                  num_thread_iters, // Iterating over the outer most
+          // dimension, so need to stride by the
+          // total number of grid reductions. Could
+          // come back and change it so this is the
+          // contiguous dimension
+          reduction_op);
     }
 
-    // -- START GRID REDUCTION -- //
-    // Grid reductions are more challenging for two reasons, (1) the reduction
-    // itself is 3D instead of 2D because we now have an iter domain space in
-    // the grid dimension. (2) a tree reduction isn't performed, instead all
-    // blocks will populate GMEM and one  block will finish the grid reduction.
-
-    // What is the grid reduction size, block reduction already performed so
-    // that doesn't have to be taken into consideration
-    const auto grid_red_size = index_utils::
-        maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            gridDim);
-
-    // Which ID in the reduction is this block. Threads can participate in
-    // multiple grid reductions, but the block will have the same relative index
-    // in those reductions
-    const auto idx_in_grid_red = index_utils::
-        maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    // How many grid reductions have to be performed, in the grid dimension
-    const auto num_block_iters = index_utils::
-        maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
-
-    // Which grid reduction does this block participate in, in the grid
-    // dimension
-    const auto block_red_idx_offset = index_utils::
-        maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
-            blockIdx, gridDim);
+    // -- START LAST BLOCK - BLOCK REDUCTION -- //
 
-    // How many grid reductions have to be performed, in the block dimension
-    const auto num_thread_iters = index_utils::
-        maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-            blockDim);
+    // Reduced so we have one value per thread, we need to further reduce any
+    // dimension that is not an iter dimension
 
-    // Which grid reduction does this thread participate in, in the block
-    // dimension
-    const auto thread_red_idx_offset = index_utils::
+    // Which block reduction this thread is participating in
+    int block_reduction_idx = index_utils::
         maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
             threadIdx, blockDim);
 
-    // 3D buffer of reductions:
-    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
-    // Offset into the work buffer
-    const auto work_buf_offset =
-        (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
-            num_thread_iters +
-        thread_red_idx_offset;
-
-    // Don't read/write in temporary buffers if in a predicated dimension
-    bool grid_reduce_participate = index_utils::
-        maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(
-            blockIdx);
-
-    if (PERSISTENT_REDUCTION && flip) {
-      auto global_buffer_size =
-          index_utils::
-              maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
-                  gridDim) *
-          index_utils::
-              maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-                  blockDim) *
-          grid_red_size;
-      global_work_buffer += global_buffer_size;
+    // Offset in smem for this thread's result
+    auto smem_offset =
+        block_reduction_idx * block_reduction_size_2 + tid_in_block_reduction_2;
+
+    // Similar as before, reduce down to nearest power of 2 so we can do a
+    // tree reduction
+    int np2 = 1 << (31 - __clz(min(block_reduction_size_2, grid_red_size)));
+
+    // Threads values are initialized, so all can participate here
+    if (tid_in_block_reduction_2 >= np2) {
+      copyTuple(shared_buf, smem_offset, last_block_result);
     }
-    flip = !flip;
 
-    // Per-block partial reduction to global work buffer
-    if (grid_reduce_participate && block_reduce_participate &&
-        has_block_result) {
-      copyTuple(global_work_buffer, work_buf_offset, block_result);
+    block_sync::sync();
+
+    if (tid_in_block_reduction_2 < np2 &&
+        tid_in_block_reduction_2 + np2 <
+            min(block_reduction_size_2, grid_red_size)) {
+      impl::reduceTuple(
+          last_block_result, 0, shared_buf, smem_offset + np2, reduction_op);
     }
 
-    // -- GLOBAL BUFFER FILLED -- //
-
-    bool last_block = index_utils::
-        maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    if (grid_reduce_participate) {
-      // Don't need to sync up blocks that are not participating in this
-      // reduction
-      grid_sync::sync<
-          isReduce(X_BLOCK),
-          isReduce(Y_BLOCK),
-          isReduce(Z_BLOCK),
-          PERSISTENT_REDUCTION>(
-          global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+    if (tid_in_block_reduction_2 < np2) {
+      copyTuple(shared_buf, smem_offset, last_block_result);
     }
 
-    // -- START BLOCK CLEANUP -- //
-    reduceGroupLastBlock(
-        out,
-        global_work_buffer,
-        init_val,
-        shared_mem,
-        block_red_idx_offset,
-        num_thread_iters,
-        num_block_iters,
-        thread_red_idx_offset,
-        grid_red_size,
-        write_preds,
-        block_reduce_participate,
-        grid_reduce_participate,
-        funcs...);
+    // Always sync when communicating across smem
+    block_sync::sync();
+
+    // Reduce down to 2 values, last thread will do the final reduction and
+    // can save a syncthreads this way
+    for (int factor = np2 / 2; factor > 1; factor >>= 1) {
+      if (tid_in_block_reduction_2 < factor) {
+        impl::reduceTuple(
+            shared_buf,
+            smem_offset,
+            shared_buf,
+            smem_offset + factor,
+            reduction_op);
+      }
+      block_sync::sync();
+    }
 
-    // Forward protect the smem buffer
+    // If this thread in each block has the final result before broadcasting
+    // to all other threads in block
+    bool has_block_result_2 = index_utils::maskedIsZero<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx);
+    // Do the last reduction, protected by the write predicate
+    copyTuple(last_block_result, shared_buf, smem_offset);
+    if (has_block_result && grid_reduce_participate) {
+      impl::reduceTuple(last_block_result, 0, out, 0, reduction_op);
+      if (min(block_reduction_size_2, grid_red_size) > 1) {
+        impl::reduceTuple(
+            last_block_result, 0, shared_buf, smem_offset + 1, reduction_op);
+      }
+    }
+    if (grid_reduce_participate && PERSISTENT_REDUCTION) {
+      // If persistent reduction, always broadcast reduced values
+      copyTuple(shared_buf, smem_offset, last_block_result);
+      block_sync::sync();
+      if (write_pred && block_reduce_participate) {
+        copyTuple(
+            out, shared_buf, block_reduction_idx * block_reduction_size_2);
+      }
+      // For persistent kernels we double the global buffer allocation so we
+      // don't need to protect those buffers every iteration preventing the
+      // need of an additional grid_sync. Since we flip back and forth between
+      // sections of the buffer, the one grid sync protects the other part of
+      // the buffer.
+    } else {
+      if (grid_reduce_participate) {
+        if (last_block && has_block_result && block_reduce_participate &&
+            write_pred) {
+          copyTuple(
+              out, shared_buf, block_reduction_idx * block_reduction_size_2);
+        }
+      }
+    }
+    // Forward protect the smem used in this reduction
     block_sync::sync();
   }
+}
 
-  //! Profiled version
-  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
-  __device__ __inline__ void reduceGroup(
-      RefTuple<DataTypes...> out,
-      const ConstRefTuple<DataTypes...>& inp,
-      VolatilePtrTuple<DataTypes...> global_work_buffer,
-      const LocalTuple<DataTypes...>& init_val,
-      int64_t* global_sync_buffer,
-      void* shared_mem,
-      const LocalTuple<BoolTypes...>& read_preds,
-      const LocalTuple<BoolTypes...>& write_preds,
-      int64_t& cycles,
-      int64_t& count,
-      Funcs... funcs) {
-    int64_t start_counter = 0;
+//! Profiled version
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename Func, typename... Types>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduce(
+        RefTuple<Types...> out,
+        const ConstRefTuple<Types...>& inp,
+        VolatilePtrTuple<Types...> global_work_buffer,
+        int64_t* global_sync_buffer, // Allocated as product of all
+        // non-participating Grid dimension
+        PtrTuple<Types...> shared_buf,
+        bool read_pred, // Prevent reading from out of bounds memory
+        bool write_pred, // Prevent from writing out of bounds
+        const LocalTuple<Types...>& init_val,
+        Func reduction_op,
+        int64_t& cycles,
+        int64_t& count) {
+  int64_t start_counter = 0;
+
+  if (isLastBlockInGrid() &&
+      index_utils::maskedIsZero<true, true, true>(threadIdx)) {
+    start_counter = readCycleCounter();
+  }
 
-    if (isLastBlockInGrid() &&
-        index_utils::maskedIsZero<true, true, true>(threadIdx)) {
-      start_counter = readCycleCounter();
-    }
+  reduce(
+      out,
+      inp,
+      global_work_buffer,
+      global_sync_buffer,
+      shared_buf,
+      read_pred,
+      write_pred,
+      init_val,
+      reduction_op);
+
+  if (isLastBlockInGrid() &&
+      index_utils::maskedIsZero<true, true, true>(threadIdx)) {
+    cycles += readCycleCounter() - start_counter;
+    ++count;
+  }
+}
 
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
     reduceGroup(
+        RefTuple<DataTypes...> out,
+        const ConstRefTuple<DataTypes...>& inp,
+        VolatilePtrTuple<DataTypes...> global_work_buffer,
+        const LocalTuple<DataTypes...>& init_val,
+        int64_t* global_sync_buffer,
+        void* shared_mem,
+        const LocalTuple<BoolTypes...>& read_preds,
+        const LocalTuple<BoolTypes...>& write_preds,
+        Funcs... funcs) {
+  static_assert(
+      sizeof...(DataTypes) == sizeof...(Funcs),
+      "Mismatched number of Tuple values and functions");
+  static_assert(
+      sizeof...(DataTypes) == sizeof...(BoolTypes),
+      "Mismatched number of Tuple values and predicate values");
+
+  // If no reduction needed, just return input
+  if (!BLOCK_REDUCE && !GRID_REDUCE) {
+    copyTupleIf(out, inp, read_preds && write_preds);
+    return;
+  }
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  const bool block_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
+          threadIdx);
+
+  // Only threads that with id == 0 in the dimensions being reduced will
+  // have a valid result
+  const bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
+
+  // Initial per-block reduction. Result is broadcast if specified
+  // and this call is block reduction only.
+  const auto block_result = reduceGroupBlock < !GRID_REDUCE &&
+      BROADCAST > (inp,
+                   init_val,
+                   shared_mem,
+                   read_preds,
+                   block_reduce_participate,
+                   funcs...);
+  // If block reduction only, save to out and exit
+  if (!GRID_REDUCE) {
+    copyTupleIf(
         out,
-        inp,
-        global_work_buffer,
-        init_val,
-        global_sync_buffer,
-        shared_mem,
-        read_preds,
-        write_preds,
-        funcs...);
+        block_result,
+        write_preds &&
+            (block_reduce_participate && (BROADCAST || has_block_result)));
 
-    if (isLastBlockInGrid() &&
-        index_utils::maskedIsZero<true, true, true>(threadIdx)) {
-      cycles += readCycleCounter() - start_counter;
-      ++count;
-    }
+    // Need a block sync here as reduceGroupBlock does not
+    // forward-protect the smem buffer. This block sync is not
+    // necessary when a grid reduction follows since a block sync is
+    // done just before the grid sync.
+    block_sync::sync();
+    return;
   }
 
- private:
-  __device__ bool isLastBlockInGrid() {
-    return index_utils::maskedIsLast<
-               isReduceOrIter(X_BLOCK),
-               isReduceOrIter(Y_BLOCK),
-               isReduceOrIter(Z_BLOCK)>(blockIdx, gridDim) &&
-        index_utils::maskedIsZero<
-               !isReduceOrIter(X_BLOCK),
-               !isReduceOrIter(Y_BLOCK),
-               !isReduceOrIter(Z_BLOCK)>(blockIdx);
+  // -- START GRID REDUCTION -- //
+  // Grid reductions are more challenging for two reasons, (1) the reduction
+  // itself is 3D instead of 2D because we now have an iter domain space in
+  // the grid dimension. (2) a tree reduction isn't performed, instead all
+  // blocks will populate GMEM and one  block will finish the grid reduction.
+
+  // What is the grid reduction size, block reduction already performed so
+  // that doesn't have to be taken into consideration
+  const auto grid_red_size = index_utils::
+      maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          gridDim);
+
+  // Which ID in the reduction is this block. Threads can participate in
+  // multiple grid reductions, but the block will have the same relative index
+  // in those reductions
+  const auto idx_in_grid_red = index_utils::
+      maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the grid dimension
+  const auto num_block_iters = index_utils::
+      maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+
+  // Which grid reduction does this block participate in, in the grid
+  // dimension
+  const auto block_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the block dimension
+  const auto num_thread_iters = index_utils::
+      maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          blockDim);
+
+  // Which grid reduction does this thread participate in, in the block
+  // dimension
+  const auto thread_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // 3D buffer of reductions:
+  //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+  // Offset into the work buffer
+  const auto work_buf_offset =
+      (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
+          num_thread_iters +
+      thread_red_idx_offset;
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  bool grid_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(blockIdx);
+
+  if (PERSISTENT_REDUCTION && flip) {
+    auto global_buffer_size =
+        index_utils::
+            maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+                gridDim) *
+        index_utils::
+            maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+                blockDim) *
+        grid_red_size;
+    global_work_buffer += global_buffer_size;
   }
+  flip = !flip;
 
-  //! Initial per-CTA reduction of each value of a tuple. Each value
-  //! is reduced individually, so the shared memory buffer just needs
-  //! to be large enough for each value. NOTE that the smem buffer is
-  //! not forward protected.
-  template <
-      bool BLOCK_BROADCAST,
-      typename... DataTypes,
-      typename... Funcs,
-      typename... BoolTypes>
-  __device__ __inline__ static LocalTuple<DataTypes...> reduceGroupBlock(
-      const ConstRefTuple<DataTypes...>& inp,
-      const LocalTuple<DataTypes...>& init_val,
-      void* shared_mem,
-      const LocalTuple<BoolTypes...>& read_preds,
-      bool block_reduce_participate,
-      Funcs... funcs) {
-    const bool has_block_result = index_utils::maskedIsZero<
-        isReduce(X_THREAD),
-        isReduce(Y_THREAD),
-        isReduce(Z_THREAD)>(threadIdx);
+  // Per-block partial reduction to global work buffer
+  if (grid_reduce_participate && block_reduce_participate && has_block_result) {
+    copyTuple(global_work_buffer, work_buf_offset, block_result);
+  }
 
-    // Initialize block result
-    LocalTuple<DataTypes...> block_result = init_val;
+  // -- GLOBAL BUFFER FILLED -- //
+
+  bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if (grid_reduce_participate) {
+    // Don't need to sync up blocks that are not participating in this
+    // reduction
+    grid_sync::sync<
+        isReduce(X_BLOCK),
+        isReduce(Y_BLOCK),
+        isReduce(Z_BLOCK),
+        PERSISTENT_REDUCTION>(
+        global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+  }
 
-    copyTupleIf(block_result, inp, block_reduce_participate && read_preds);
+  // -- START BLOCK CLEANUP -- //
+  reduceGroupLastBlock(
+      out,
+      global_work_buffer,
+      init_val,
+      shared_mem,
+      block_red_idx_offset,
+      num_thread_iters,
+      num_block_iters,
+      thread_red_idx_offset,
+      grid_red_size,
+      write_preds,
+      block_reduce_participate,
+      grid_reduce_participate,
+      funcs...);
+
+  // Forward protect the smem buffer
+  block_sync::sync();
+}
 
-    // Size of the block reduction segment, can be an int since it's limited
-    // to number of threads
-    const int block_reduction_size = index_utils::
-        maskedSize<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
-            blockDim);
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduceGroup(
+        RefTuple<DataTypes...> out,
+        const ConstRefTuple<DataTypes...>& inp,
+        VolatilePtrTuple<DataTypes...> global_work_buffer,
+        const LocalTuple<DataTypes...>& init_val,
+        int64_t* global_sync_buffer,
+        void* shared_mem,
+        const LocalTuple<BoolTypes...>& read_preds,
+        const LocalTuple<BoolTypes...>& write_preds,
+        int64_t& cycles,
+        int64_t& count,
+        Funcs... funcs) {
+  int64_t start_counter = 0;
+
+  if (isLastBlockInGrid() &&
+      index_utils::maskedIsZero<true, true, true>(threadIdx)) {
+    start_counter = readCycleCounter();
+  }
 
-    // Index in the reduction segment, can be an int since it's limited to
-    // number of threads
-    const int tid_in_block_reduction = index_utils::maskedOffset<
-        isReduce(X_THREAD),
-        isReduce(Y_THREAD),
-        isReduce(Z_THREAD)>(threadIdx, blockDim);
+  reduceGroup(
+      out,
+      inp,
+      global_work_buffer,
+      init_val,
+      global_sync_buffer,
+      shared_mem,
+      read_preds,
+      write_preds,
+      funcs...);
+
+  if (isLastBlockInGrid() &&
+      index_utils::maskedIsZero<true, true, true>(threadIdx)) {
+    cycles += readCycleCounter() - start_counter;
+    ++count;
+  }
+}
 
-    // ID of the block reduction this thread is participating in
-    //
-    // If any of the parallel dimensions are predicated out, that means
-    // they've already been reduced, so we only care about the first thread in
-    // that dimension. Therefore don't expand the reduction_idx by that
-    // dimension
-    const int block_reduction_idx = index_utils::
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <
+    bool BLOCK_BROADCAST,
+    typename... DataTypes,
+    typename... Funcs,
+    typename... BoolTypes>
+__device__ __inline__ LocalTuple<DataTypes...> ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduceGroupBlock(
+        const ConstRefTuple<DataTypes...>& inp,
+        const LocalTuple<DataTypes...>& init_val,
+        void* shared_mem,
+        const LocalTuple<BoolTypes...>& read_preds,
+        bool block_reduce_participate,
+        Funcs... funcs) {
+  const bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
+
+  // Initialize block result
+  LocalTuple<DataTypes...> block_result = init_val;
+
+  copyTupleIf(block_result, inp, block_reduce_participate && read_preds);
+
+  // Size of the block reduction segment, can be an int since it's limited
+  // to number of threads
+  const int block_reduction_size = index_utils::
+      maskedSize<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          blockDim);
+
+  // Index in the reduction segment, can be an int since it's limited to
+  // number of threads
+  const int tid_in_block_reduction = index_utils::
+      maskedOffset<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // ID of the block reduction this thread is participating in
+  //
+  // If any of the parallel dimensions are predicated out, that means
+  // they've already been reduced, so we only care about the first thread in
+  // that dimension. Therefore don't expand the reduction_idx by that
+  // dimension
+  const int block_reduction_idx = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // Do not protect the smem buffer as it's not always necessary.
+  impl::blockReduceEach<
+      BLOCK_BROADCAST,
+      false,
+      LocalTuple<DataTypes...>,
+      Funcs...>(
+      block_result,
+      block_result,
+      shared_mem,
+      has_block_result,
+      tid_in_block_reduction,
+      block_reduction_size,
+      block_reduction_size,
+      block_reduction_idx,
+      funcs...);
+
+  return block_result;
+}
+
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    reduceGroupLastBlock(
+        RefTuple<DataTypes...>& out,
+        const VolatilePtrTuple<DataTypes...>& global_work_buffer,
+        const LocalTuple<DataTypes...>& init_val,
+        void* shared_mem,
+        nvfuser_index_t block_red_idx_offset,
+        nvfuser_index_t num_thread_iters,
+        nvfuser_index_t num_block_iters,
+        nvfuser_index_t thread_red_idx_offset,
+        nvfuser_index_t grid_red_size,
+        const LocalTuple<BoolTypes...>& write_preds,
+        bool block_reduce_participate,
+        bool grid_reduce_participate,
+        Funcs... reduction_ops) {
+  // Initialize block result
+  LocalTuple<DataTypes...> last_block_result(init_val);
+
+  const bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
+    // Can use the last block to reduce all the values the blocks filled in.
+    // Can use any thread that has been predicated, or has been reduced to do
+    // this reduction, cannot use any block that's associated with an
+    // iteration domain
+
+    // Start with non-block reduction
+
+    // Index in the reduction segment
+    int tid_in_block_reduction = index_utils::maskedOffset<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx, blockDim);
+
+    int block_reduction_size = index_utils::maskedSize<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(blockDim);
+
+    bool has_block_result = index_utils::maskedIsZero<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx);
+
+    // 3D buffer of reductions:
+    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+    // Change the offset, we want to keep the last two dimensions, but the
+    // first dimension is what we will reduce over
+    const auto work_buf_offset =
+        block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
+    for (auto reduction_i = tid_in_block_reduction; reduction_i < grid_red_size;
+         reduction_i += block_reduction_size) {
+      impl::reduceEach(
+          last_block_result,
+          0,
+          global_work_buffer,
+          work_buf_offset +
+              reduction_i * num_block_iters *
+                  num_thread_iters, // Iterating over the outer most
+                                    // dimension, so need to stride by the
+                                    // total number of grid reductions. Could
+                                    // come back and change it so this is the
+                                    // contiguous dimension
+          reduction_ops...);
+    }
+
+    // Which block reduction this thread is participating in
+    int block_reduction_idx = index_utils::
         maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
             threadIdx, blockDim);
 
-    // Do not protect the smem buffer as it's not always necessary.
-    impl::blockReduceEach<
-        BLOCK_BROADCAST,
-        false,
-        LocalTuple<DataTypes...>,
-        Funcs...>(
-        block_result,
-        block_result,
+    impl::blockReduceEach<BROADCAST, false, LocalTuple<DataTypes...>, Funcs...>(
+        last_block_result,
+        last_block_result,
         shared_mem,
         has_block_result,
         tid_in_block_reduction,
         block_reduction_size,
-        block_reduction_size,
+        min(grid_red_size, block_reduction_size),
         block_reduction_idx,
-        funcs...);
+        reduction_ops...);
 
-    return block_result;
-  }
-
-  //! Final reduction of partial results. Done by all blocks
-  //! redundantly when BROADCAST is true, or just one block otherwise.
-  //! The smem buffer is assumed synchronized when it is passed in,
-  //! but it isn't synchronized when returning from this function.
-  template <typename... DataTypes, typename... Funcs, typename... BoolTypes>
-  __device__ __inline__ void reduceGroupLastBlock(
-      RefTuple<DataTypes...>& out,
-      const VolatilePtrTuple<DataTypes...>& global_work_buffer,
-      const LocalTuple<DataTypes...>& init_val,
-      void* shared_mem,
-      nvfuser_index_t block_red_idx_offset,
-      nvfuser_index_t num_thread_iters,
-      nvfuser_index_t num_block_iters,
-      nvfuser_index_t thread_red_idx_offset,
-      nvfuser_index_t grid_red_size,
-      const LocalTuple<BoolTypes...>& write_preds,
-      bool block_reduce_participate,
-      bool grid_reduce_participate,
-      Funcs... reduction_ops) {
-    // Initialize block result
-    LocalTuple<DataTypes...> last_block_result(init_val);
-
-    const bool last_block = index_utils::
-        maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
-            blockIdx, gridDim);
-
-    if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
-      // Can use the last block to reduce all the values the blocks filled in.
-      // Can use any thread that has been predicated, or has been reduced to do
-      // this reduction, cannot use any block that's associated with an
-      // iteration domain
-
-      // Start with non-block reduction
-
-      // Index in the reduction segment
-      int tid_in_block_reduction = index_utils::maskedOffset<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(threadIdx, blockDim);
-
-      int block_reduction_size = index_utils::maskedSize<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(blockDim);
-
-      bool has_block_result = index_utils::maskedIsZero<
-          activeNotIter(X_THREAD),
-          activeNotIter(Y_THREAD),
-          activeNotIter(Z_THREAD)>(threadIdx);
-
-      // 3D buffer of reductions:
-      //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
-      // Change the offset, we want to keep the last two dimensions, but the
-      // first dimension is what we will reduce over
-      const auto work_buf_offset =
-          block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
-      for (auto reduction_i = tid_in_block_reduction;
-           reduction_i < grid_red_size;
-           reduction_i += block_reduction_size) {
-        impl::reduceEach(
-            last_block_result,
-            0,
-            global_work_buffer,
-            work_buf_offset +
-                reduction_i * num_block_iters *
-                    num_thread_iters, // Iterating over the outer most
-                                      // dimension, so need to stride by the
-                                      // total number of grid reductions. Could
-                                      // come back and change it so this is the
-                                      // contiguous dimension
-            reduction_ops...);
-      }
-
-      // Which block reduction this thread is participating in
-      int block_reduction_idx = index_utils::
-          maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
-              threadIdx, blockDim);
-
-      impl::
-          blockReduceEach<BROADCAST, false, LocalTuple<DataTypes...>, Funcs...>(
-              last_block_result,
-              last_block_result,
-              shared_mem,
-              has_block_result,
-              tid_in_block_reduction,
-              block_reduction_size,
-              min(grid_red_size, block_reduction_size),
-              block_reduction_idx,
-              reduction_ops...);
-
-      copyTupleIf(
-          out,
-          last_block_result,
-          write_preds &&
-              (block_reduce_participate && (BROADCAST || has_block_result)));
-    }
+    copyTupleIf(
+        out,
+        last_block_result,
+        write_preds &&
+            (block_reduce_participate && (BROADCAST || has_block_result)));
   }
-
-  // End Parallel reduce class
-};
+}
 
 } // namespace fused_reduction
diff --git a/torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu b/torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu
new file mode 100644
index 000000000000..9f29071bab91
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/fused_welford_helper.cu
@@ -0,0 +1,93 @@
+namespace fused_reduction {
+
+// Tuple of Welford avg, var and N parameters.
+//
+// Template parameters:
+// - DataTypeT: Type of avg and var
+// - IndexTypeT: Type of N
+// - MakeTuple: Template template parameter to define Tuple types
+// (e.g., MakeLocalTuple>
+template <
+    int NumVals,
+    typename DataTypeT,
+    typename IndexTypeT,
+    template <int, typename>
+    typename MakeTuple>
+struct WelfordTripletTuple {
+  static constexpr int num_vals = NumVals;
+  using DataType = DataTypeT;
+  using IndexType = IndexTypeT;
+  using DataTuple = typename MakeTuple<NumVals, DataType>::type;
+  using IndexTuple = typename MakeTuple<NumVals, IndexType>::type;
+
+  DataTuple avg;
+  DataTuple var;
+  IndexTuple N;
+
+  WelfordTripletTuple(
+      const DataTuple& avg,
+      const DataTuple& var,
+      const IndexTuple& N)
+      : avg(avg), var(var), N(N) {}
+};
+
+template <int NumVals, typename DataType, typename IndexType>
+using LocalWelfordTripletTuple =
+    WelfordTripletTuple<NumVals, DataType, IndexType, MakeLocalTuple>;
+
+template <int NumVals, typename DataType, typename IndexType>
+using RefWelfordTripletTuple =
+    WelfordTripletTuple<NumVals, DataType, IndexType, MakeRefTuple>;
+
+template <int NumVals, typename DataType, typename IndexType>
+using ConstRefWelfordTripletTuple =
+    WelfordTripletTuple<NumVals, DataType, IndexType, MakeConstRefTuple>;
+
+template <int NumVals, typename DataTypeT, typename IndexTypeT>
+using VolatilePtrWelfordTripletTuple =
+    WelfordTripletTuple<NumVals, DataTypeT, IndexTypeT, MakeVolatilePtrTuple>;
+
+// Advance pointer offsets of WelfordTripleTuple. Only valid when the
+// values are pointer values.
+template <typename WelfordTripletTupleType>
+__inline__ __device__ static void operator+=(
+    WelfordTripletTupleType& triplet,
+    nvfuser_index_t offset) {
+  triplet.avg += offset;
+  triplet.var += offset;
+  triplet.N += offset;
+}
+
+// Copy each of the triplet tuples
+template <typename DstType, typename SrcType>
+__inline__ __device__ static void copyWelfordTripletTuple(
+    DstType& dst,
+    nvfuser_index_t dst_offset,
+    const SrcType& src,
+    nvfuser_index_t src_offset = 0) {
+  copyTuple(dst.avg, dst_offset, src.avg, src_offset);
+  copyTuple(dst.var, dst_offset, src.var, src_offset);
+  copyTuple(dst.N, dst_offset, src.N, src_offset);
+}
+
+// Copy each of the triplet tuples
+template <typename DstType, typename SrcType>
+__inline__ __device__ static void copyWelfordTripletTuple(
+    DstType& dst,
+    const SrcType& src,
+    nvfuser_index_t src_offset = 0) {
+  copyWelfordTripletTuple(dst, 0, src, src_offset);
+}
+
+// Copy each of the triplet tuples
+template <typename DstType, typename SrcType, typename PredType>
+__inline__ __device__ static void copyWelfordTripletTupleIf(
+    DstType& dst,
+    const SrcType& src,
+    const PredType& pred) {
+  copyTupleIf(dst.avg, src.avg, pred);
+  copyTupleIf(dst.var, src.var, pred);
+  copyTupleIf(dst.N, src.N, pred);
+}
+
+} // namespace fused_reduction
diff --git a/torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu b/torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu
new file mode 100644
index 000000000000..8603087e8453
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/fused_welford_impl.cu
@@ -0,0 +1,623 @@
+namespace fused_reduction {
+
+namespace impl {
+
+//! Implementation helper for welfordEach.
+template <int ValIdx, typename Triplet0, typename Triplet1>
+struct WelfordForEach {
+  static __inline__ __device__ void call(
+      Triplet0& triplet0,
+      nvfuser_index_t offset0,
+      const Triplet1& triplet1,
+      nvfuser_index_t offset1) {
+    static_assert(
+        Triplet0::num_vals == Triplet1::num_vals, "Invalid Triplet types");
+    static_assert(
+        IsSameType<typename Triplet0::DataType, typename Triplet1::DataType>::
+            value,
+        "Invalid Triplet types");
+    static_assert(
+        IsSameType<typename Triplet0::IndexType, typename Triplet1::IndexType>::
+            value,
+        "Invalid Triplet types");
+
+    using DataType = typename Triplet0::DataType;
+    using IndexType = typename Triplet0::IndexType;
+
+    WelfordForEach<ValIdx - 1, Triplet0, Triplet1>::call(
+        triplet0, offset0, triplet1, offset1);
+    welfordCombine<DataType, IndexType>(
+        triplet0.avg.val<ValIdx>(offset0),
+        triplet0.var.val<ValIdx>(offset0),
+        triplet0.N.val<ValIdx>(offset0),
+        triplet1.avg.val<ValIdx>(offset1),
+        triplet1.var.val<ValIdx>(offset1),
+        triplet1.N.val<ValIdx>(offset1));
+  }
+};
+
+template <typename Triplet0, typename Triplet1>
+struct WelfordForEach<-1, Triplet0, Triplet1> {
+  __inline__ __device__ static void call(
+      Triplet0& triplet0,
+      nvfuser_index_t offset0,
+      const Triplet1& triplet1,
+      nvfuser_index_t offset1) {}
+};
+
+//! Call welfordCombine with each of the triplet tuples. This is a
+//! welford version of reduceEach.
+template <typename Triplet0, typename Triplet1>
+__inline__ __device__ static void welfordEach(
+    Triplet0& triplet0,
+    nvfuser_index_t offset0,
+    const Triplet1& triplet1,
+    nvfuser_index_t offset1) {
+  WelfordForEach<Triplet0::num_vals - 1, Triplet0, Triplet1>::call(
+      triplet0, offset0, triplet1, offset1);
+}
+
+// Welford version of BlockReduceEach
+template <
+    int idx,
+    bool BROADCAST,
+    bool FORWARD_PROTECT_SMEM,
+    typename LocalWelfordTripletTupleT>
+struct BlockWelfordEach {
+  __inline__ __device__ static void reduce(
+      LocalWelfordTripletTupleT& block_result,
+      const LocalWelfordTripletTupleT& partial_result,
+      PtrTuple<
+          typename LocalWelfordTripletTupleT::DataType,
+          typename LocalWelfordTripletTupleT::DataType,
+          typename LocalWelfordTripletTupleT::IndexType> shared_buf,
+      bool has_block_result,
+      int tid_in_reduction,
+      int num_threads_per_reduction,
+      int num_elements_per_reduction,
+      int reduction_idx) {
+    // Finish the reduction of each tuple value with a smaller offset
+    BlockWelfordEach<idx - 1, BROADCAST, true, LocalWelfordTripletTupleT>::
+        reduce(
+            block_result,
+            partial_result,
+            shared_buf,
+            has_block_result,
+            tid_in_reduction,
+            num_threads_per_reduction,
+            num_elements_per_reduction,
+            reduction_idx);
+
+    if (num_elements_per_reduction == 1) {
+      if (has_block_result) {
+        copyWelfordTripletTuple(block_result, partial_result);
+      }
+      return;
+    }
+
+    using DataType = typename LocalWelfordTripletTupleT::DataType;
+    using IndexType = typename LocalWelfordTripletTupleT::IndexType;
+
+    LocalTuple<DataType, DataType, IndexType> block_result_i(
+        partial_result.avg.val<idx>(0),
+        partial_result.var.val<idx>(0),
+        partial_result.N.val<idx>(0));
+
+    const auto smem_offset =
+        reduction_idx * num_threads_per_reduction + tid_in_reduction;
+
+    const int np2 = 1 << (31 - __clz(num_elements_per_reduction));
+
+    // Threads values are initialized, so all can participate here
+    if (tid_in_reduction >= np2) {
+      copyTuple(shared_buf, smem_offset, block_result_i);
+    }
+
+    block_sync::sync();
+    if (tid_in_reduction < np2 &&
+        tid_in_reduction + np2 < num_elements_per_reduction) {
+      impl::reduceTuple(
+          block_result_i,
+          0,
+          shared_buf,
+          smem_offset + np2,
+          welfordCombine<DataType, IndexType>);
+    }
+
+    if (tid_in_reduction < np2) {
+      copyTuple(shared_buf, smem_offset, block_result_i);
+    }
+
+    // Always sync when communicating across smem
+    block_sync::sync();
+
+    // Reduce down to 2 values, last thread will do the final reduction and
+    // can save a syncthreads this way
+    for (int factor = np2 / 2; factor > 1; factor >>= 1) {
+      if (tid_in_reduction < factor) {
+        impl::reduceTuple(
+            shared_buf,
+            smem_offset,
+            shared_buf,
+            smem_offset + factor,
+            welfordCombine<DataType, IndexType>);
+      }
+      block_sync::sync();
+    }
+
+    copyTuple(block_result_i, shared_buf, smem_offset);
+
+    // Do the last reduction
+    if (has_block_result) {
+      impl::reduceTuple(
+          block_result_i,
+          0,
+          shared_buf,
+          smem_offset + 1,
+          welfordCombine<DataType, IndexType>);
+    }
+
+    if (BROADCAST) {
+      if (has_block_result) {
+        // Put result back in shared memory, put in the first entry of the
+        // reduction segment's buffer
+        copyTuple(
+            shared_buf,
+            reduction_idx * num_threads_per_reduction,
+            block_result_i);
+      }
+
+      // Sync threads to make sure result is in smem
+      block_sync::sync();
+
+      copyTuple(
+          block_result_i,
+          shared_buf,
+          reduction_idx * num_threads_per_reduction);
+    }
+
+    block_result.avg.val<idx>(0) = block_result_i.val<0>(0);
+    block_result.var.val<idx>(0) = block_result_i.val<1>(0);
+    block_result.N.val<idx>(0) = block_result_i.val<2>(0);
+
+    if (FORWARD_PROTECT_SMEM) {
+      block_sync::sync();
+    }
+  }
+};
+
+// Specialization for idx == -1, i.e., no value to reduce.
+template <
+    bool BROADCAST,
+    bool FORWARD_PROTECT_SMEM,
+    typename LocalWelfordTripletTupleT>
+struct BlockWelfordEach<
+    -1,
+    BROADCAST,
+    FORWARD_PROTECT_SMEM,
+    LocalWelfordTripletTupleT> {
+  __inline__ __device__ static void reduce(
+      LocalWelfordTripletTupleT& block_result,
+      const LocalWelfordTripletTupleT& partial_result,
+      PtrTuple<
+          typename LocalWelfordTripletTupleT::DataType,
+          typename LocalWelfordTripletTupleT::DataType,
+          typename LocalWelfordTripletTupleT::IndexType> shared_buf,
+      bool has_block_result,
+      int tid_in_reduction,
+      int num_threads_per_reduction,
+      int num_elements_per_reduction,
+      int reduction_idx) {}
+};
+
+//! Welford version of blockReduceEach. Perform block-parallel Welford
+//! reduction of each Welford triplet.
+template <
+    bool BROADCAST,
+    bool FORWARD_PROTECT_SMEM,
+    typename LocalWelfordTripletTupleT>
+__inline__ __device__ void blockWelfordEach(
+    LocalWelfordTripletTupleT& block_result,
+    const LocalWelfordTripletTupleT& partial_result,
+    PtrTuple<
+        typename LocalWelfordTripletTupleT::DataType,
+        typename LocalWelfordTripletTupleT::DataType,
+        typename LocalWelfordTripletTupleT::IndexType> shared_buf,
+    bool has_block_result,
+    int tid_in_reduction,
+    int num_threads_per_reduction,
+    int num_elements_per_reduction,
+    int reduction_idx) {
+  BlockWelfordEach<
+      LocalWelfordTripletTupleT::num_vals - 1,
+      BROADCAST,
+      FORWARD_PROTECT_SMEM,
+      LocalWelfordTripletTupleT>::
+      reduce(
+          block_result,
+          partial_result,
+          shared_buf,
+          has_block_result,
+          tid_in_reduction,
+          num_threads_per_reduction,
+          num_elements_per_reduction,
+          reduction_idx);
+}
+
+} // namespace impl
+
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <int NumArgs, typename DataType, typename IndexType>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    welfordGroup(
+        typename MakeRefTuple<NumArgs, DataType>::type out_avg,
+        typename MakeRefTuple<NumArgs, DataType>::type out_var,
+        typename MakeRefTuple<NumArgs, IndexType>::type out_N,
+        const typename MakeConstRefTuple<NumArgs, DataType>::type& inp_avg,
+        const typename MakeConstRefTuple<NumArgs, DataType>::type& inp_var,
+        const typename MakeConstRefTuple<NumArgs, IndexType>::type& inp_N,
+        const typename MakeLocalTuple<NumArgs, DataType>::type& init_avg,
+        const typename MakeLocalTuple<NumArgs, DataType>::type& init_var,
+        const typename MakeLocalTuple<NumArgs, IndexType>::type& init_N,
+        typename MakeVolatilePtrTuple<NumArgs, DataType>::type
+            global_work_buffer_avg,
+        typename MakeVolatilePtrTuple<NumArgs, DataType>::type
+            global_work_buffer_var,
+        typename MakeVolatilePtrTuple<NumArgs, IndexType>::type
+            global_work_buffer_N,
+        int64_t* global_sync_buffer,
+        PtrTuple<DataType, DataType, IndexType> shared_buf,
+        const typename MakeLocalTuple<NumArgs, bool>::type& read_preds,
+        const typename MakeLocalTuple<NumArgs, bool>::type& write_preds) {
+  const ConstRefWelfordTripletTuple<NumArgs, DataType, IndexType> inp(
+      inp_avg, inp_var, inp_N);
+  RefWelfordTripletTuple<NumArgs, DataType, IndexType> out(
+      out_avg, out_var, out_N);
+
+  // If no reduction needed, just return input
+  if (!BLOCK_REDUCE && !GRID_REDUCE) {
+    copyWelfordTripletTupleIf(out, inp, read_preds && write_preds);
+    return;
+  }
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  const bool block_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_THREAD), isPred(Y_THREAD), isPred(Z_THREAD)>(
+          threadIdx);
+
+  // Only threads that with id == 0 in the dimensions being reduced will
+  // have a valid result
+  const bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
+
+  LocalWelfordTripletTuple<NumArgs, DataType, IndexType> block_result(
+      init_avg, init_var, init_N);
+
+  // Initial per-block reduction. Result is broadcast if specified
+  // and this call is block reduction only.
+  welfordGroupBlock<!GRID_REDUCE && BROADCAST>(
+      block_result, inp, shared_buf, read_preds, block_reduce_participate);
+
+  // If block reduction only, save to out and exit
+  if (!GRID_REDUCE) {
+    copyWelfordTripletTupleIf(
+        out,
+        block_result,
+        write_preds &&
+            (block_reduce_participate && (BROADCAST || has_block_result)));
+
+    // Need a block sync here as reduceGroupBlock does not
+    // forward-protect the smem buffer. This block sync is not
+    // necessary when a grid reduction follows since a block sync is
+    // done just before the grid sync.
+    block_sync::sync();
+    return;
+  }
+
+  // -- START GRID REDUCTION -- //
+  // Grid reductions are more challenging for two reasons, (1) the reduction
+  // itself is 3D instead of 2D because we now have an iter domain space in
+  // the grid dimension. (2) a tree reduction isn't performed, instead all
+  // blocks will populate GMEM and one  block will finish the grid reduction.
+
+  // What is the grid reduction size, block reduction already performed so
+  // that doesn't have to be taken into consideration
+  const auto grid_red_size = index_utils::
+      maskedSize<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          gridDim);
+
+  // Which ID in the reduction is this block. Threads can participate in
+  // multiple grid reductions, but the block will have the same relative index
+  // in those reductions
+  const auto idx_in_grid_red = index_utils::
+      maskedOffset<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the grid dimension
+  const auto num_block_iters = index_utils::
+      maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(gridDim);
+
+  // Which grid reduction does this block participate in, in the grid
+  // dimension
+  const auto block_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  // How many grid reductions have to be performed, in the block dimension
+  const auto num_thread_iters = index_utils::
+      maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          blockDim);
+
+  // Which grid reduction does this thread participate in, in the block
+  // dimension
+  const auto thread_red_idx_offset = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // 3D buffer of reductions:
+  //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+  // Offset into the work buffer
+  auto work_buf_offset =
+      (idx_in_grid_red * num_block_iters + block_red_idx_offset) *
+          num_thread_iters +
+      thread_red_idx_offset;
+
+  // Don't read/write in temporary buffers if in a predicated dimension
+  bool grid_reduce_participate = index_utils::
+      maskedIsZero<isPred(X_BLOCK), isPred(Y_BLOCK), isPred(Z_BLOCK)>(blockIdx);
+
+  VolatilePtrWelfordTripletTuple<NumArgs, DataType, IndexType>
+      global_work_buffer(
+          global_work_buffer_avg, global_work_buffer_var, global_work_buffer_N);
+
+  if (PERSISTENT_REDUCTION && flip) {
+    auto global_buffer_size =
+        index_utils::
+            maskedSize<isIter(X_BLOCK), isIter(Y_BLOCK), isIter(Z_BLOCK)>(
+                gridDim) *
+        index_utils::
+            maskedSize<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+                blockDim) *
+        grid_red_size;
+    global_work_buffer += global_buffer_size;
+  }
+  flip = !flip;
+
+  // Per-block partial reduction to global work buffer
+  if (grid_reduce_participate && block_reduce_participate && has_block_result) {
+    copyWelfordTripletTuple(global_work_buffer, work_buf_offset, block_result);
+  }
+
+  // -- GLOBAL BUFFER FILLED -- //
+
+  bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if (grid_reduce_participate) {
+    // Don't need to sync up blocks that are not participating in this
+    // reduction
+    grid_sync::sync<
+        isReduce(X_BLOCK),
+        isReduce(Y_BLOCK),
+        isReduce(Z_BLOCK),
+        PERSISTENT_REDUCTION>(
+        global_sync_buffer[block_red_idx_offset], grid_red_size, last_block);
+  }
+
+  // -- START BLOCK CLEANUP -- //
+  welfordGroupLastBlock(
+      out,
+      global_work_buffer,
+      LocalWelfordTripletTuple<NumArgs, DataType, IndexType>(
+          init_avg, init_var, init_N),
+      shared_buf,
+      block_red_idx_offset,
+      num_thread_iters,
+      num_block_iters,
+      thread_red_idx_offset,
+      grid_red_size,
+      write_preds,
+      block_reduce_participate,
+      grid_reduce_participate);
+
+  // Forward protect the smem buffer
+  block_sync::sync();
+}
+
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <
+    bool BLOCK_BROADCAST,
+    int NumVals,
+    typename DataType,
+    typename IndexType>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    welfordGroupBlock(
+        LocalWelfordTripletTuple<NumVals, DataType, IndexType>& block_result,
+        const ConstRefWelfordTripletTuple<NumVals, DataType, IndexType>& inp,
+        PtrTuple<DataType, DataType, IndexType> shared_buf,
+        const typename MakeLocalTuple<NumVals, bool>::type& read_preds,
+        bool block_reduce_participate) {
+  const bool has_block_result = index_utils::
+      maskedIsZero<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx);
+
+  copyWelfordTripletTupleIf(
+      block_result, inp, block_reduce_participate && read_preds);
+
+  // Size of the block reduction segment, can be an int since it's limited
+  // to number of threads
+  const int block_reduction_size = index_utils::
+      maskedSize<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          blockDim);
+
+  // Index in the reduction segment, can be an int since it's limited to
+  // number of threads
+  const int tid_in_block_reduction = index_utils::
+      maskedOffset<isReduce(X_THREAD), isReduce(Y_THREAD), isReduce(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // ID of the block reduction this thread is participating in
+  //
+  // If any of the parallel dimensions are predicated out, that means
+  // they've already been reduced, so we only care about the first thread in
+  // that dimension. Therefore don't expand the reduction_idx by that
+  // dimension
+  const int block_reduction_idx = index_utils::
+      maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+          threadIdx, blockDim);
+
+  // Do not protect the smem buffer as it's not always necessary.
+  impl::blockWelfordEach<
+      BLOCK_BROADCAST,
+      false,
+      LocalWelfordTripletTuple<NumVals, DataType, IndexType>>(
+      block_result,
+      block_result,
+      shared_buf,
+      has_block_result,
+      tid_in_block_reduction,
+      block_reduction_size,
+      block_reduction_size,
+      block_reduction_idx);
+}
+
+template <
+    int X_BLOCK,
+    int Y_BLOCK,
+    int Z_BLOCK,
+    int X_THREAD,
+    int Y_THREAD,
+    int Z_THREAD,
+    bool PERSISTENT_REDUCTION,
+    bool BROADCAST>
+template <int NumVals, typename DataType, typename IndexType>
+__device__ __inline__ void ParallelReduce<
+    X_BLOCK,
+    Y_BLOCK,
+    Z_BLOCK,
+    X_THREAD,
+    Y_THREAD,
+    Z_THREAD,
+    PERSISTENT_REDUCTION,
+    BROADCAST>::
+    welfordGroupLastBlock(
+        RefWelfordTripletTuple<NumVals, DataType, IndexType>& out,
+        const VolatilePtrWelfordTripletTuple<NumVals, DataType, IndexType>&
+            global_work_buffer,
+        const LocalWelfordTripletTuple<NumVals, DataType, IndexType>& init_val,
+        PtrTuple<DataType, DataType, IndexType> shared_buf,
+        nvfuser_index_t block_red_idx_offset,
+        nvfuser_index_t num_thread_iters,
+        nvfuser_index_t num_block_iters,
+        nvfuser_index_t thread_red_idx_offset,
+        nvfuser_index_t grid_red_size,
+        const typename MakeLocalTuple<NumVals, bool>::type& write_preds,
+        bool block_reduce_participate,
+        bool grid_reduce_participate) {
+  // Initialize block result
+  auto last_block_result = init_val;
+
+  const bool last_block = index_utils::
+      maskedIsLast<isReduce(X_BLOCK), isReduce(Y_BLOCK), isReduce(Z_BLOCK)>(
+          blockIdx, gridDim);
+
+  if ((PERSISTENT_REDUCTION || last_block) && grid_reduce_participate) {
+    // Can use the last block to reduce all the values the blocks filled in.
+    // Can use any thread that has been predicated, or has been reduced to do
+    // this reduction, cannot use any block that's associated with an
+    // iteration domain
+
+    // Start with non-block reduction
+
+    // Index in the reduction segment
+    int tid_in_block_reduction = index_utils::maskedOffset<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx, blockDim);
+
+    int block_reduction_size = index_utils::maskedSize<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(blockDim);
+
+    bool has_block_result = index_utils::maskedIsZero<
+        activeNotIter(X_THREAD),
+        activeNotIter(Y_THREAD),
+        activeNotIter(Z_THREAD)>(threadIdx);
+
+    // 3D buffer of reductions:
+    //    [reduction_offset(grid), iter_offset(grid), iter_offset(block)]
+    // Change the offset, we want to keep the last two dimensions, but the
+    // first dimension is what we will reduce over
+    const auto work_buf_offset =
+        block_red_idx_offset * num_thread_iters + thread_red_idx_offset;
+    for (auto reduction_i = tid_in_block_reduction; reduction_i < grid_red_size;
+         reduction_i += block_reduction_size) {
+      impl::welfordEach(
+          last_block_result,
+          0,
+          global_work_buffer,
+          work_buf_offset + reduction_i * num_block_iters * num_thread_iters);
+    }
+
+    // Which block reduction this thread is participating in
+    int block_reduction_idx = index_utils::
+        maskedOffset<isIter(X_THREAD), isIter(Y_THREAD), isIter(Z_THREAD)>(
+            threadIdx, blockDim);
+
+    impl::blockWelfordEach<
+        BROADCAST,
+        false,
+        LocalWelfordTripletTuple<NumVals, DataType, IndexType>>(
+        last_block_result,
+        last_block_result,
+        shared_buf,
+        has_block_result,
+        tid_in_block_reduction,
+        block_reduction_size,
+        min(grid_red_size, block_reduction_size),
+        block_reduction_idx);
+
+    copyWelfordTripletTupleIf(
+        out,
+        last_block_result,
+        write_preds &&
+            (block_reduce_participate && (BROADCAST || has_block_result)));
+  }
+}
+
+} // namespace fused_reduction
diff --git a/torch/csrc/jit/codegen/cuda/runtime/helpers.cu b/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
index 3819bdcf3505..2e29ac02a464 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/helpers.cu
@@ -27,6 +27,18 @@ __device__ constexpr int64_t ceilDiv(int a, int64_t b) {
   return ceilDiv((int64_t)a, b);
 }
 
+__device__ constexpr double ceilDiv(double a, double b) {
+  return std::ceil(a / b);
+}
+
+__device__ constexpr double ceilDiv(double a, int64_t b) {
+  return std::ceil(a / b);
+}
+
+__device__ constexpr double ceilDiv(int64_t a, double b) {
+  return std::ceil(a / b);
+}
+
 // Monotonic and precise lerp is described here:
 // https://math.stackexchange.com/a/1798323
 __device__ double lerp(double start, double end, double weight) {
@@ -299,14 +311,6 @@ __device__ int64_t where(bool c, int a, int64_t b) {
   return c ? a : b;
 }
 
-__device__ double randLike(Philox& rnd) {
-  return uniform(rnd(), rnd());
-}
-
-__device__ float randLikef(Philox& rnd) {
-  return uniformf(rnd());
-}
-
 __device__ constexpr int64_t remainder(int64_t a, int64_t b) {
   auto mod = a % b;
   if ((mod != 0) && ((b < 0) != (mod < 0)))
@@ -630,3 +634,8 @@ __device__ __bfloat print_impl(const char* name, __bfloat value) {
 #endif
 
 #define print(...) print_impl(#__VA_ARGS__, (__VA_ARGS__))
+
+template <typename OutT, typename IndexT, typename InputT>
+__device__ OutT arange(IndexT index, InputT start, InputT step) {
+  return start + step * index;
+}
diff --git a/torch/csrc/jit/codegen/cuda/runtime/memory.cu b/torch/csrc/jit/codegen/cuda/runtime/memory.cu
index bc275ec1cc40..e064a43090fd 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/memory.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/memory.cu
@@ -152,6 +152,31 @@ DEVICE_INLINE void cpAsync(
       "n"(byte_size));
 }
 
+// Global to SMEM load that is asynchronous,
+// not guaranteed to be completed until cpAsyncBarrier() is called.
+template <typename dtype, int len>
+DEVICE_INLINE void cpAsync(
+    Array<dtype, len, len>* smem_ptr,
+    void const* gmem_ptr,
+    bool predicate) {
+  unsigned smem_addr = util::toSmem(&(smem_ptr->array[0]));
+  constexpr int byte_size = sizeof(dtype) * len;
+
+  static_assert(
+      byte_size == 4 || byte_size == 8 || byte_size == 16,
+      "cp_async : unsupported byte size");
+
+  asm volatile(
+      "{\n"
+      "  .reg .pred p;\n"
+      "  setp.ne.b32 p, %3, 0;\n"
+      "@p cp.async.ca.shared.global [%0], [%1], %2;\n"
+      "}\n" ::"r"(smem_addr),
+      "l"(gmem_ptr),
+      "n"(byte_size),
+      "r"((int)predicate));
+}
+
 // TODO: Might have a different category of sync if we want to build out this:
 DEVICE_INLINE void cpAsyncBarrier() {
   asm volatile("cp.async.wait_all;");
diff --git a/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu b/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu
index bbea2656ef9a..75d39e7c0c4b 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu
@@ -1,111 +1,89 @@
-class Philox {
- public:
-  __device__ Philox(
-      unsigned long long seed,
-      unsigned long long subsequence,
-      unsigned long long offset) {
-    key.x = (unsigned int)seed;
-    key.y = (unsigned int)(seed >> 32);
-    counter = make_uint4(0, 0, 0, 0);
-    counter.z = (unsigned int)(subsequence);
-    counter.w = (unsigned int)(subsequence >> 32);
-    STATE = 0;
-    incr_n(offset / 4);
-  }
-
-  __device__ unsigned long operator()() {
-    if (STATE == 0) {
-      uint4 counter_ = counter;
-      uint2 key_ = key;
-      for (int i = 0; i < 9; i++) {
-        counter_ = single_round(counter_, key_);
-        key_.x += (kPhilox10A);
-        key_.y += (kPhilox10B);
-      }
-      output = single_round(counter_, key_);
-      incr();
-    }
-    unsigned long ret = 0;
-    switch (STATE) {
-      case 0:
-        ret = output.x;
-        break;
-      case 1:
-        ret = output.y;
-        break;
-      case 2:
-        ret = output.z;
-        break;
-      case 3:
-        ret = output.w;
-        break;
-    }
-    STATE = (STATE + 1) % 4;
-    return ret;
-  }
-
- private:
-  __device__ void incr_n(unsigned long long n) {
-    unsigned int nlo = (unsigned int)(n);
-    unsigned int nhi = (unsigned int)(n >> 32);
-    counter.x += nlo;
-    if (counter.x < nlo)
-      nhi++;
-    counter.y += nhi;
-    if (nhi <= counter.y)
-      return;
-    if (++counter.z)
-      return;
-    ++counter.w;
-  }
-
-  __device__ void incr() {
-    if (++counter.x)
-      return;
-    if (++counter.y)
-      return;
-    if (++counter.z)
-      return;
-    ++counter.w;
-  }
-
-  __device__ unsigned int mulhilo32(
-      unsigned int a,
-      unsigned int b,
-      unsigned int* result_high) {
-    *result_high = __umulhi(a, b);
-    return a * b;
-  }
+__device__ unsigned int mulhilo32(
+    unsigned int a,
+    unsigned int b,
+    unsigned int* result_high) {
+  *result_high = __umulhi(a, b);
+  return a * b;
+}
 
-  __device__ uint4 single_round(uint4 ctr, uint2 key) {
-    unsigned int hi0;
-    unsigned int hi1;
-    unsigned int lo0 = mulhilo32(kPhiloxSA, ctr.x, &hi0);
-    unsigned int lo1 = mulhilo32(kPhiloxSB, ctr.z, &hi1);
-    uint4 ret = {hi1 ^ ctr.y ^ key.x, lo1, hi0 ^ ctr.w ^ key.y, lo0};
-    return ret;
-  }
+__device__ uint4 single_round(uint4 ctr, uint2 key) {
+  constexpr unsigned long kPhiloxSA = 0xD2511F53;
+  constexpr unsigned long kPhiloxSB = 0xCD9E8D57;
+  unsigned int hi0;
+  unsigned int hi1;
+  unsigned int lo0 = mulhilo32(kPhiloxSA, ctr.x, &hi0);
+  unsigned int lo1 = mulhilo32(kPhiloxSB, ctr.z, &hi1);
+  uint4 ret = {hi1 ^ ctr.y ^ key.x, lo1, hi0 ^ ctr.w ^ key.y, lo0};
+  return ret;
+}
 
- private:
-  static constexpr unsigned long kPhilox10A = 0x9E3779B9;
-  static constexpr unsigned long kPhilox10B = 0xBB67AE85;
-  static constexpr unsigned long kPhiloxSA = 0xD2511F53;
-  static constexpr unsigned long kPhiloxSB = 0xCD9E8D57;
+__device__ uint4 philox(
+    unsigned long long seed,
+    unsigned long long subsequence,
+    unsigned long long offset) {
+  constexpr unsigned long kPhilox10A = 0x9E3779B9;
+  constexpr unsigned long kPhilox10B = 0xBB67AE85;
+  uint2 key = {};
+  key.x = (unsigned int)seed;
+  key.y = (unsigned int)(seed >> 32);
+  uint4 counter = make_uint4(0, 0, 0, 0);
+  counter.x = (unsigned int)(offset);
+  counter.y = (unsigned int)(offset >> 32);
+  counter.z = (unsigned int)(subsequence);
+  counter.w = (unsigned int)(subsequence >> 32);
 
-  uint4 counter = {};
   uint4 output = {};
-  uint2 key = {};
-  unsigned int STATE = 0;
-};
+  uint2 key_ = key;
+  uint4 counter_ = counter;
+  for (int i = 0; i < 9; i++) {
+    counter_ = single_round(counter_, key_);
+    key_.x += (kPhilox10A);
+    key_.y += (kPhilox10B);
+  }
+  output = single_round(counter_, key_);
+  return output;
+}
 
 __device__ float uniformf(unsigned int x) {
   constexpr float kRanInvM32 = 2.3283064e-10f; // Inverse of 2^32.
-  return x * kRanInvM32;
+  float result = x * kRanInvM32;
+  return result == 1 ? 0.0f : result;
 }
 
 __device__ double uniform(unsigned int x, unsigned int y) {
   constexpr double kRan2Pow53Inv = 1.1102230246251565e-16;
   const unsigned long long z =
       (unsigned long long)x ^ ((unsigned long long)y << (53 - 32));
-  return z * kRan2Pow53Inv + (kRan2Pow53Inv / 2.0);
+  double result = z * kRan2Pow53Inv + (kRan2Pow53Inv / 2.0);
+  return result == 1 ? 0.0 : result;
+}
+
+__device__ double rng_uniform(const uint4& rng_result, int rng_component) {
+  return uniform(
+      (&rng_result.x)[rng_component * 2],
+      (&rng_result.x)[rng_component * 2 + 1]);
+}
+
+__device__ float rng_uniformf(const uint4& rng_result, int rng_component) {
+  return uniformf((&rng_result.x)[rng_component]);
+}
+
+__device__ double rng_uniform_range(
+    const uint4& rng_result,
+    int rng_component,
+    double from,
+    double to) {
+  auto range = to - from;
+  auto uniform01 = rng_uniform(rng_result, rng_component);
+  return from + range * uniform01;
+}
+
+__device__ float rng_uniform_rangef(
+    const uint4& rng_result,
+    int rng_component,
+    float from,
+    float to) {
+  auto range = to - from;
+  auto uniform01 = rng_uniformf(rng_result, rng_component);
+  return from + range * uniform01;
 }
diff --git a/torch/csrc/jit/codegen/cuda/runtime/tuple.cu b/torch/csrc/jit/codegen/cuda/runtime/tuple.cu
index 7f2e7ab94b7d..6daac6b99b75 100644
--- a/torch/csrc/jit/codegen/cuda/runtime/tuple.cu
+++ b/torch/csrc/jit/codegen/cuda/runtime/tuple.cu
@@ -606,6 +606,179 @@ using PtrTuple = PtrTupleBase<false, Types...>;
 template <typename... Types>
 using VolatilePtrTuple = PtrTupleBase<true, Types...>;
 
+// Define a LocalTuple of NumVals values of type Type
+template <int NumVals, typename Type>
+struct MakeLocalTuple;
+
+template <typename Type>
+struct MakeLocalTuple<1, Type> {
+  using type = LocalTuple<Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<2, Type> {
+  using type = LocalTuple<Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<3, Type> {
+  using type = LocalTuple<Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<4, Type> {
+  using type = LocalTuple<Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<5, Type> {
+  using type = LocalTuple<Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<6, Type> {
+  using type = LocalTuple<Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<7, Type> {
+  using type = LocalTuple<Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeLocalTuple<8, Type> {
+  using type = LocalTuple<Type, Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <int NumVals, typename Type>
+struct MakeRefTuple;
+
+template <typename Type>
+struct MakeRefTuple<1, Type> {
+  using type = RefTuple<Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<2, Type> {
+  using type = RefTuple<Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<3, Type> {
+  using type = RefTuple<Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<4, Type> {
+  using type = RefTuple<Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<5, Type> {
+  using type = RefTuple<Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<6, Type> {
+  using type = RefTuple<Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<7, Type> {
+  using type = RefTuple<Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeRefTuple<8, Type> {
+  using type = RefTuple<Type, Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <int NumVals, typename Type>
+struct MakeConstRefTuple;
+
+template <typename Type>
+struct MakeConstRefTuple<1, Type> {
+  using type = ConstRefTuple<Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<2, Type> {
+  using type = ConstRefTuple<Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<3, Type> {
+  using type = ConstRefTuple<Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<4, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<5, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<6, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<7, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeConstRefTuple<8, Type> {
+  using type = ConstRefTuple<Type, Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <int NumVals, typename Type>
+struct MakeVolatilePtrTuple;
+
+template <typename Type>
+struct MakeVolatilePtrTuple<1, Type> {
+  using type = VolatilePtrTuple<Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<2, Type> {
+  using type = VolatilePtrTuple<Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<3, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<4, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<5, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<6, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<7, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type, Type, Type, Type>;
+};
+
+template <typename Type>
+struct MakeVolatilePtrTuple<8, Type> {
+  using type = VolatilePtrTuple<Type, Type, Type, Type, Type, Type, Type, Type>;
+};
+
 // Utility definitions. Currently only used with LocalTuple
 
 template <int idx, typename BinaryFunc, typename... DataTypes>
diff --git a/torch/csrc/jit/codegen/cuda/runtime/warp_rocm.cu b/torch/csrc/jit/codegen/cuda/runtime/warp_rocm.cu
new file mode 100644
index 000000000000..f89a95b5cf6c
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/runtime/warp_rocm.cu
@@ -0,0 +1,76 @@
+namespace warp {
+
+template <
+    bool SINGLE_WARP,
+    typename T,
+    typename Func,
+    typename _dim3ti,
+    typename _dim3bd>
+__device__ void warpReduceTIDX(
+    T& out,
+    const T& inp_val,
+    Func reduction_op,
+    const _dim3ti& thread_idx,
+    const _dim3bd& block_dim,
+    T* shared_mem,
+    bool read_write_pred,
+    T init_val) {
+  constexpr int WARP_SIZE = warpSize;
+
+  // Assume input padded to multiples of a warp
+  T reduce_val = init_val;
+
+  // Do warp reduction
+  if (read_write_pred) {
+    reduce_val = inp_val;
+  }
+
+  // Reduce within each warp
+  for (int i = WARP_SIZE/2; i >= 1; i /= 2) {
+    reduction_op(
+        reduce_val, __shfl_xor(reduce_val, i, WARP_SIZE));
+  }
+
+  // Reduce across warp if needed
+  // Load value to shared mem
+  if (!SINGLE_WARP) {
+    unsigned int warp_idx = thread_idx.x / WARP_SIZE;
+    unsigned int lane_idx = thread_idx.x % WARP_SIZE;
+    unsigned int reduce_group_id = thread_idx.z * block_dim.y + thread_idx.y;
+    bool is_warp_head = lane_idx == 0;
+    unsigned int reduction_size = block_dim.x;
+    unsigned int num_of_warps = reduction_size / WARP_SIZE;
+    unsigned int smem_offset = reduce_group_id * num_of_warps;
+
+    block_sync::sync();
+
+    if (read_write_pred && is_warp_head) {
+      shared_mem[smem_offset + warp_idx] = reduce_val;
+    }
+
+    block_sync::sync();
+
+    if (warp_idx == 0) {
+      // This assumes num_of_warps will be < 32, meaning < 1024 threads.
+      //  Should be true for long enough.
+      assert(num_of_warps <= 32);
+
+      reduce_val = lane_idx < num_of_warps ? shared_mem[smem_offset + lane_idx]
+                                           : init_val;
+
+      // Reduce within warp 0
+      for (int i = WARP_SIZE/2; i >= 1; i /= 2) {
+        reduction_op(
+            reduce_val, __shfl_xor(reduce_val, i, WARP_SIZE));
+      }
+    }
+
+    if (is_warp_head) {
+      reduction_op(out, reduce_val);
+    }
+  } else {
+    reduction_op(out, reduce_val);
+  }
+}
+
+} // namespace warp
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h b/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
index 56460ec92695..d01d226efe42 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h
@@ -2,6 +2,7 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/normalization.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/pointwise.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/reduction.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose.h>
 
 namespace torch {
 namespace jit {
@@ -10,9 +11,11 @@ namespace cuda {
 
 enum class TORCH_CUDA_CU_API ScheduleHeuristic {
   None,
+  NoOp,
   PointWise,
   Reduction,
-  Persistent
+  Persistent,
+  Transpose
 };
 }
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h b/torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h
index f2c9f161619a..6453962bfec8 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h
@@ -26,13 +26,18 @@ namespace HeuristicCompileTime {
 //! Enum for all possible types of cached entries of compile-time info.
 enum class CompileTimeEntryType {
   DOMAIN_MAP,
+  TRANSPOSE_DOMAIN_MAP,
   REFERENCE_TENSORS,
+  REFERENCE_TENSORS_FOR_GROUPS,
   VECTORIZABLE_INPUTS_AND_OUTPUTS,
+  INPUTS_AND_OUTPUTS_INNER_DIM_GROUPS,
   UNROLLABLE_INPUTS_AND_OUTPUTS,
   REDUCTION_TVS,
   PERSISTENT_BUFFER_INFO,
   SCOPE_PERSISTENT_FACTOR_INFO,
-  BROADCAST_BYTE_MULTIPLES
+  BROADCAST_BYTE_MULTIPLES,
+  INNER_MOST_DIMS_INFO,
+  CAN_SCHEDULE_TRANSPOSE,
 };
 
 //! Entry type definition class for `DOMAIN_MAP`,
@@ -44,6 +49,15 @@ class DomainMap {
       CompileTimeEntryType::DOMAIN_MAP;
 };
 
+//! Entry type definition class for `DOMAIN_MAP`,
+//!  stores the domain map of a fusion, used by transpose scheduler.
+class TransposeDomainMap {
+ public:
+  using DataType = pointwise_utils::DomainMap;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::TRANSPOSE_DOMAIN_MAP;
+};
+
 //! Entry type definition class for `REFERENCE_TENSORS`,
 //!  stores the the reference TensorViews used to schedule a fusion.
 class ReferenceTensors {
@@ -53,6 +67,16 @@ class ReferenceTensors {
       CompileTimeEntryType::REFERENCE_TENSORS;
 };
 
+//! Entry type definition class for `REFERENCE_TENSORS`,
+//!  stores the the reference TensorViews used to schedule a fusion, used by
+//!  transpose scheduler.
+class ReferenceTensorsForGroups {
+ public:
+  using DataType = std::vector<TensorView*>;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::REFERENCE_TENSORS_FOR_GROUPS;
+};
+
 //! Entry type definition class for `VECTORIZABLE_INPUTS_AND_OUTPUTS`,
 //!  stores the vectorizable TensorViews on a fusion's inputs and outputs.
 class VectorizableInputsAndOutputs {
@@ -62,6 +86,15 @@ class VectorizableInputsAndOutputs {
       CompileTimeEntryType::VECTORIZABLE_INPUTS_AND_OUTPUTS;
 };
 
+//! Entry type definition class for `INPUTS_AND_OUTPUTS_INNER_DIM_GROUPS`,
+//!  stores the fusion's inputs and outputs grouped by inner most dimension.
+class InputsOutputsInnerDimGroups {
+ public:
+  using DataType = std::vector<std::vector<TensorView*>>;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::INPUTS_AND_OUTPUTS_INNER_DIM_GROUPS;
+};
+
 //! Entry type definition class for `UNROLLABLE_INPUTS_AND_OUTPUTS`,
 //!  stores the unrollable TensorViews on a fusion's inputs and outputs.
 class UnrollableInputsAndOutputs {
@@ -89,6 +122,16 @@ class PersistentBufferInfo {
       CompileTimeEntryType::PERSISTENT_BUFFER_INFO;
 };
 
+//! Entry type definition class for `INNER_MOST_DIMS_INFO`,
+//!  Used in the transpose scheduler to store inner most IterDomains and their
+//!  position in reference1 of group 1 and group 2
+class InnerMostDimInfo {
+ public:
+  using DataType = std::vector<int64_t>;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::INNER_MOST_DIMS_INFO;
+};
+
 //! Auxiliary data types for `SCOPE_PERSISTENT_FACTOR_INFO` entry type.
 using ScopedPersistenceBufferMap = std::unordered_map<Val*, std::vector<bool>>;
 
@@ -111,11 +154,20 @@ class ScopePersistentFactorInfo {
 //!  information.
 class BroadcastMultiples {
  public:
-  using DataType = std::vector<scheduler_utils::BroadcastMultiple>;
+  using DataType = scheduler_utils::BroadcastMultipleInformation;
   static const CompileTimeEntryType EntryType =
       CompileTimeEntryType::BROADCAST_BYTE_MULTIPLES;
 };
 
+//! Entry type definition class for `CAN_SCHEDULE_TRANSPOSE`,
+//!  stores if the transpose scheduler can scheduler this fusion
+class CanScheduleTranspose {
+ public:
+  using DataType = bool;
+  static const CompileTimeEntryType EntryType =
+      CompileTimeEntryType::CAN_SCHEDULE_TRANSPOSE;
+};
+
 //! Base abstract class for unified storage in `HeuristicSummary`,
 //!  each entry in `HeuristicSummary` will be a subclass.
 class CompileTimeInfoBase : public PolymorphicBase {
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/heuristic.h b/torch/csrc/jit/codegen/cuda/scheduler/heuristic.h
index 058c72e592ad..a828d66fdf03 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/heuristic.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/heuristic.h
@@ -1,6 +1,7 @@
 #pragma once
 
 #include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/utils.h>
 
 #include <string>
 
@@ -9,7 +10,7 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-class HeuristicParams {
+class HeuristicParams : public PolymorphicBase {
  public:
   std::string tag = "";
 
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp
index 6abd4dd56c47..ddf1061591ed 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp
@@ -49,7 +49,7 @@ bool canValidateIsInnerDim(
       if (!maybe_factor.has_value()) {
         return false;
       }
-      int factor = maybe_factor.value();
+      int factor = maybe_factor->as<int64_t>();
       if (factor < inner_dim_size) {
         // This might be too restrictive. Would need more
         //   bookkeeping to relax.
@@ -68,7 +68,7 @@ bool canValidateIsInnerDim(
       if (!maybe_leaf_size.has_value()) {
         return false;
       }
-      if (maybe_leaf_size.value() != inner_dim_size) {
+      if (maybe_leaf_size->as<int64_t>() != inner_dim_size) {
         return false;
       }
       leaf = merge->inner();
@@ -208,7 +208,7 @@ std::vector<IterDomain*> getMmaDomains(MmaOp* mma, MmaDimension dimension) {
   TORCH_CHECK(
       a_domain.size() == b_domain.size() &&
           a_domain.size() == accumulator_domain.size(),
-      "Inconsisitent dimensions in mma op",
+      "Inconsistent dimensions in mma op",
       a_domain.size(),
       " ",
       b_domain.size(),
@@ -274,10 +274,10 @@ std::unordered_set<IterDomain*> getMmaDomainSet(
 //  optimizations.
 //
 // A concrete example:
-//  T0 [I0, I1, I2, I3, I4, I5] = mma(T1[I01, B11, B21, I31, I41, B51], T2[B02,
+//  T0 [I0, I1, I2, R3, I4, I5] = mma(T1[I01, B11, B21, I31, I41, B51], T2[B02,
 //  I12, B22, I32, I42, I52], {3};
 // In this case some example querries:
-//  K dimension of T0 = {I3}
+//  K dimension of T0 = {R3}
 //  M dimension of T1 = {I01}
 //  N dimension of T2 = {I52}
 //  etc.
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp b/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
index 0d87174f0b56..459974b8d288 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp
@@ -878,7 +878,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getPersistentHeuristics(
           data_cache, [&first_red_tv]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    first_red_tv, true));
+                    first_red_tv, true, true));
           });
 
   auto& vectorizable_inputs_outputs = vectorizable_inputs_outputs_entry.get();
@@ -888,7 +888,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getPersistentHeuristics(
           data_cache, [&first_red_tv]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    first_red_tv, false));
+                    first_red_tv, false, false));
           });
 
   auto& unrollable_inputs_outputs = unrollable_inputs_outputs_entry.get();
@@ -909,7 +909,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getPersistentHeuristics(
   }
 
   // Try expanding vectorization to contig merged domains
-  vectorize_factor = scheduler_utils::expandVectorizationToContigMergedDomains(
+  vectorize_factor = vectorize_helper::expandVectorizationToContigMergedDomains(
       fusion,
       runtime_info,
       vectorizable_inputs_outputs,
@@ -992,6 +992,8 @@ TORCH_CUDA_CU_API void schedulePersistentKernel(
       scheduler_utils::getReductionTvs(fusion /*, ignore_trivial = true */);
 
   TORCH_INTERNAL_ASSERT(reduction_tvs.size());
+  // Registry assumes the reference tv is the first reduction_tv, if this
+  // changes registry needs to change.
   auto reduction_tv = reduction_tvs[0];
 
   auto dim_analysis = scheduler_utils::canonicalDimReduction(
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
index a4aaac198720..b40e6fbf7cf7 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp
@@ -1,11 +1,12 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/pointwise.h>
 
 #include <torch/csrc/jit/codegen/cuda/executor_utils.h>
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/instrumentation.h>
 #include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/lower_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h>
@@ -51,35 +52,6 @@ class DomainMap : public pointwise_utils::DomainMap {
     return result;
   }
 
-  static bool hasReferenceTensorView(Fusion* fusion) {
-    FusionGuard fg(fusion);
-    DomainMap domain_map(fusion);
-    return domain_map.findReferenceTensorView() != nullptr;
-  }
-
-  // Determine if output TensorView is a valid reference tensor for this fusion.
-  // The reference tensor must map to all the iterDomains in each input.
-  bool isValidReference(TensorView* output_tv) const {
-    if (output_tv->isFusionInput()) {
-      return false;
-    }
-    for (auto input_tv :
-         ir_utils::filterByType<TensorView>(fusion_->inputs())) {
-      if (input_tv->uses().empty()) {
-        continue;
-      }
-
-      if (fusion_->getOutputAlias(output_tv) == input_tv) {
-        continue;
-      }
-
-      if (!areAllInputIdsMappedToOutput(input_tv, output_tv)) {
-        return false;
-      }
-    }
-    return true;
-  }
-
  private:
   bool hasMinimumSize(TensorView* tv, int num_axes) const {
     TORCH_INTERNAL_ASSERT(tv != nullptr);
@@ -148,8 +120,8 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
         inferred_val.has_value(),
         "Error inferring size for pointwise scheduler: ",
         ref_root[ref_i]->extent()->toInlineString());
-    elem_counts[ref_i] = inferred_val.value();
-    n_elems *= inferred_val.value();
+    elem_counts[ref_i] = inferred_val->as<int64_t>();
+    n_elems *= elem_counts[ref_i];
   }
 
   // If zero dimensional or zero size, return default parameters
@@ -163,13 +135,11 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
     });
     vectorizable_inputs_outputs_entry.get();
 
-    auto broadcast_byte_multiples_entry =
-        HeuristicSummaryEntry<HeuristicCompileTime::BroadcastMultiples>(
-            data_cache, []() {
-              return std::make_unique<
-                  std::vector<scheduler_utils::BroadcastMultiple>>();
-            });
-    broadcast_byte_multiples_entry.get();
+    auto broadcast_info = HeuristicSummaryEntry<
+        HeuristicCompileTime::BroadcastMultiples>(data_cache, []() {
+      return std::make_unique<scheduler_utils::BroadcastMultipleInformation>();
+    });
+    broadcast_info.get();
     return std::make_shared<PointwiseParams>("Pointwise heuristics");
   }
 
@@ -179,7 +149,7 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
           data_cache, [&largest_out]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    largest_out, true));
+                    largest_out, true, true));
           });
 
   constexpr int64_t kSixteen = 16; // clang tidy
@@ -203,32 +173,9 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
         ceilDiv(n_elems, device_multiprocessor_count * kThreadX));
   }
 
-  // If we use RNG don't unroll so we can do correctness testing
-  if (fusion->isStochastic() && isDisabled(DisableOption::UnrollWithRng)) {
-    max_unroll_factor = 1;
-  }
-
   auto params = std::make_shared<PointwiseParams>("Pointwise heuristics");
 
-  /*
-   * 2D pointwise scheduling logic. What is expected is there's some
-   * broadcasting pattern which would make scheduling as a 2D problem more
-   * efficient than scheduling simply as a 1D problem.
-   *
-   * Mapping count holds how many bytes are in each dimension for both inputs
-   * and outputs relative to the reference tensor. What we're looking for is a
-   * break point in reference_tvs dimensions which separates the outer dimension
-   * and inner dimension of the problem mapped to 2D.
-   *
-   * break_point is computed assuming no reuse, ignoring parallelization
-   * limitations, and simply figures out which point best separates broadcasted
-   * dimensions. In other words, where's the point where we isolate the most
-   * broadcasted elements to one side.
-   *
-   * Once a break point is found, simply schedule the pointwise op as 2D
-   * balancing parallelization as best as possible.
-   */
-
+  // See pointwise.h to understand what we're doing for this 2D analysis.
   // Ideal break point location
   int break_point = 0;
 
@@ -257,16 +204,15 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
   // break point.
   int64_t gdim_right = 1;
 
-  auto broadcast_byte_multiples_entry =
-      HeuristicSummaryEntry<HeuristicCompileTime::BroadcastMultiples>(
-          data_cache, [&largest_out, &index_type]() {
-            return std::make_unique<
-                std::vector<scheduler_utils::BroadcastMultiple>>(
-                scheduler_utils::getBroadcastMultiples(
-                    largest_out, index_type));
-          });
+  auto broadcast_info = HeuristicSummaryEntry<
+      HeuristicCompileTime::BroadcastMultiples>(
+      data_cache, [&largest_out, &index_type]() {
+        return std::make_unique<scheduler_utils::BroadcastMultipleInformation>(
+            scheduler_utils::getBroadcastMultiples(largest_out, index_type));
+      });
 
-  auto& broadcast_byte_multiples = broadcast_byte_multiples_entry.get();
+  auto& view_disjoint_sets = broadcast_info.get().view_disjoint_set_ids;
+  auto& broadcast_byte_multiples = broadcast_info.get().broadcast_multiples;
 
   TORCH_INTERNAL_ASSERT(broadcast_byte_multiples.size() == ref_root.size());
 
@@ -293,6 +239,12 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
       int64_t min_total_transfer = std::numeric_limits<int64_t>::max();
 
       for (const auto break_point_i : c10::irange(ref_root.size())) {
+        // If break point is incoherent with view, don't consider breaking here.
+        if (!scheduler_utils::breakIsDisjoint(
+                view_disjoint_sets, break_point_i)) {
+          continue;
+        }
+
         // Number of elements in the right side of reference tv with
         // break_point_i
         int64_t cur_right_elem_count = 1;
@@ -389,8 +341,10 @@ std::shared_ptr<PointwiseParams> getPointwiseHeuristics(
   }
 
   // Try expanding vectorization to contig merged domains
+  // TODO: This is an expensive function that shouldn't be in heuristics without
+  // caching.
   auto expanded_vector_word_size =
-      scheduler_utils::expandVectorizationToContigMergedDomains(
+      vectorize_helper::expandVectorizationToContigMergedDomains(
           fusion,
           runtime_info,
           vectorizable_inputs_outputs,
@@ -462,8 +416,15 @@ LaunchParams schedulePointwise(
   return params->lparams;
 }
 
+TensorView* getReferenceTensorView(Fusion* fusion) {
+  FusionGuard fg(fusion);
+  DomainMap domain_map(fusion);
+  auto reference_tv = domain_map.findReferenceTensorView();
+  return reference_tv;
+}
+
 bool hasReferenceTensorView(Fusion* fusion) {
-  return DomainMap::hasReferenceTensorView(fusion);
+  return getReferenceTensorView(fusion) != nullptr;
 }
 
 // TODO: Inline intermediate operations (avoid inlining unrolled/vectorized
@@ -514,41 +475,142 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
     return;
   }
 
-  DomainMap domain_map(fusion);
-  TensorView* reference_tv =
-      domain_map.findReferenceTensorView(params.break_point);
+  TensorView* reference_tv = getReferenceTensorView(fusion);
 
   TORCH_INTERNAL_ASSERT(
       reference_tv != nullptr,
       "Could not find a fully broadcasted output to reference schedule on.");
 
-  auto all_tvs = ir_utils::allTvs(fusion);
-
-  // Merge right side of break point
+  // Positions of rhs and lhs after merging all dimensions.
   int rhs_i = -1;
-  for (int i = (int)reference_tv->nDims(); i > (int)params.break_point; i--) {
-    auto axis_i = i - 1;
-    if (rhs_i == -1) {
-      rhs_i = axis_i;
-    } else {
-      reference_tv->merge(axis_i, rhs_i);
-      rhs_i = axis_i;
+  int lhs_i = -1;
+
+  auto view_ops = ir_utils::getViewOps(fusion);
+
+  /*
+   * If there's no path from reference through producer paths only to a view,
+   * e.g.: input
+   *      /  \
+   *   view reference
+   *    /
+   * output
+   *
+   * we need to propagate the view transformations to the reference tv before
+   * scheduling the reference tv. Since view ops have to be identical, if any
+   * path from reference tv through producers goes through a view, all paths
+   * from reference tv's to views should be through producers.
+   */
+  bool needs_view_prop =
+      view_ops.size() > 0 &&
+      !std::any_of(
+          view_ops.begin(), view_ops.end(), [&reference_tv](ViewOp* view) {
+            return DependencyCheck::isDependencyOf(view->out(), reference_tv) ||
+                view->out()->sameAs(reference_tv);
+          });
+
+  if (needs_view_prop) {
+    auto first_view_op = *view_ops.begin();
+
+    // Propagate the view transformations
+    TransformPropagator propagator(first_view_op->out());
+    MaxRootDomainInfoSpanningTree spanning_tree(first_view_op->out());
+    spanning_tree.traverse(&propagator);
+
+    // Reorder reference_tv after propagating the view operation. This will
+    // reorder for better merging.
+    reference_tv->reorder(
+        scheduler_utils::domainReorderAsRfactorMap(reference_tv));
+
+    // Break point is relative to rfactor domain, find the leaf domain ID's in
+    // the left/right side, we really need the values in domain, but easiest way
+    // to do this is with Dependency check which will grab all intermediate
+    // values too.
+    auto lhs_all_vals = DependencyCheck::getAllValsBetween(
+        {reference_tv->getMaybeRFactorDomain().begin(),
+         reference_tv->getMaybeRFactorDomain().begin() + params.break_point},
+        {reference_tv->domain()->domain().begin(),
+         reference_tv->domain()->domain().end()});
+
+    std::unordered_set<Val*> lhs_all_vals_set(
+        lhs_all_vals.begin(), lhs_all_vals.end());
+
+    auto rhs_all_vals = DependencyCheck::getAllValsBetween(
+        {reference_tv->getMaybeRFactorDomain().begin() + params.break_point,
+         reference_tv->getMaybeRFactorDomain().end()},
+        {reference_tv->domain()->domain().begin(),
+         reference_tv->domain()->domain().end()});
+
+    std::unordered_set<Val*> rhs_all_vals_set(
+        rhs_all_vals.begin(), rhs_all_vals.end());
+
+    // Make sure lhs and rhs groups are disjoint.
+    for (auto lhs_val : lhs_all_vals) {
+      TORCH_INTERNAL_ASSERT(
+          rhs_all_vals_set.count(lhs_val) == 0,
+          "Error in pointwise scheduler. LHS and RHS of the 2D scheduler are not disjoint.");
     }
-  }
-  if (rhs_i >= 0) {
-    // If there's an rhs
-    reference_tv->reorder({{rhs_i, -1}});
-  }
 
-  // Merge left side of break point
-  int lhs_i = -1;
-  for (int i = (int)params.break_point; i > 0; i--) {
-    auto axis_i = i - 1;
-    if (lhs_i == -1) {
-      lhs_i = axis_i;
-    } else {
-      reference_tv->merge(axis_i, lhs_i);
-      lhs_i = axis_i;
+    // Merge rhs, then lhs.
+    IterDomain* rhs_id = nullptr;
+    IterDomain* lhs_id = nullptr;
+    auto ndims = reference_tv->nDims();
+    for (auto i : c10::irange(ndims)) {
+      // Merge from right to left
+      auto pos = ndims - 1 - i;
+      auto id = reference_tv->axis(pos);
+      if (lhs_all_vals_set.count(id) > 0) {
+        if (lhs_id == nullptr) {
+          lhs_id = id;
+          lhs_i = pos;
+        } else {
+          reference_tv->merge(pos, lhs_i);
+          lhs_i = pos;
+          if (rhs_i > lhs_i) {
+            rhs_i--;
+          }
+        }
+      } else if (rhs_all_vals_set.count(id) > 0) {
+        if (rhs_id == nullptr) {
+          rhs_id = id;
+          rhs_i = pos;
+        } else {
+          reference_tv->merge(pos, rhs_i);
+          rhs_i = pos;
+          if (lhs_i > rhs_i) {
+            lhs_i--;
+          }
+        }
+      }
+    }
+    // Find the iter domains that should be in the lhs, and rhs.
+  } else {
+    // Don't need to worry about view transformations, just merge reference tv
+    // as we normally would.
+
+    // Merge right side of break point
+    for (int i = (int)reference_tv->nDims(); i > (int)params.break_point; i--) {
+      auto axis_i = i - 1;
+      if (rhs_i == -1) {
+        rhs_i = axis_i;
+      } else {
+        reference_tv->merge(axis_i, rhs_i);
+        rhs_i = axis_i;
+      }
+    }
+    if (rhs_i >= 0) {
+      // If there's an rhs
+      reference_tv->reorder({{rhs_i, -1}});
+    }
+
+    // Merge left side of break point
+    for (int i = (int)params.break_point; i > 0; i--) {
+      auto axis_i = i - 1;
+      if (lhs_i == -1) {
+        lhs_i = axis_i;
+      } else {
+        reference_tv->merge(axis_i, lhs_i);
+        lhs_i = axis_i;
+      }
     }
   }
 
@@ -712,7 +774,7 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
   if (params.vectorize) {
     // Grab all tensor views that should be vectorized
     auto inputs_outputs =
-        scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true);
+        scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true, true);
     std::vector<TensorView*> vectorized_tvs;
     bool should_vectorize_reference_tv = false;
     for (auto tv : inputs_outputs) {
@@ -743,9 +805,9 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
   // get a higher position in later inline propagation. We need this separate
   // step because we were not using ParallelType::Unroll, so we have to do
   // unrolling manually.
-  InlinePropagator inline_unswitch(
-      reference_tv, unswitch_pos, ComputeAtMode::BestEffort);
-  spanning_tree.traverse(&inline_unswitch);
+  inlineAllAt(reference_tv, unswitch_pos, true);
+
+  auto all_tvs = ir_utils::allTvs(fusion);
 
   // Inline at the inner most position. The CA position of all tensors except
   // inputs, cached inputs and outputs will be updated.
@@ -758,9 +820,7 @@ void schedulePointwise(Fusion* fusion, const PointwiseParams& params) {
     auto output = entry.second;
     inner_most_tensors.erase(output);
   }
-  InlinePropagator inline_inner_most(
-      reference_tv, -1, ComputeAtMode::BestEffort, inner_most_tensors);
-  spanning_tree.traverse(&inline_inner_most);
+  inlineMost(inner_most_tensors);
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.h b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.h
index 6cba29cd6b4b..f3a1da7bcff5 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise.h
@@ -10,6 +10,141 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
+/*
+ * The 2D pointwise scheduling logic is a bit interesting. We'll start by giving
+ * motivation for what the scheduling is attempting to do. What we're going to
+ * do with the scheduling is attempt to make it two dimensional in a way that
+ * minimizes the refetching of broadcasted dimensions. If we think of the
+ * trivial case:
+ * T0[i0, b1]
+ * T1[b0, i1]
+ * T2[i0, i1] = T0 + T1
+ * If we scheduled T2 as 1-dimensional we would do something along the lines of
+ * merging i0 and i1 then splitting out a block and thread dimension. If i1 is
+ * greater than the thread dimension, then all threads would pull the same value
+ * from T0. However, they would all be pulling different values from T1. In this
+ * case we have perfect reuse of the broadcast dimension T0 but potentially no
+ * reuse of the broadcast dimension of T1. "Potentially" because if i1 isn't too
+ * big it should be efficiently cached in L2. If i1 is big, then by the time we
+ * increment the i0 dimension the i1 dimension will be pushed out of cache.
+ *
+ * Instead what we do is we map this to a two dimensional problem. Instead of
+ * having the schedule that merges the two dimensions, we'll actually leave the
+ * dimensions separate and we'll take i0, split it to BIDy, TIDy, and take i1
+ * and split it to BIDx and TIDx. Therefore we'll have a parallelization on T2
+ * like [BIDy, TIDy | BIDx, TIDx], where | denotes the separation of the
+ * original i0 and i1. This helps because all threads in the TIDx dimension will
+ * reuse the same value in the i0 dimension (holding BIDy and TIDy constant),
+ * all the threads in the TIDy dimension (holding BIDx, and TIDx constant) will
+ * reuse the same value in the i1 dimension. This reuse of values reduces the
+ * number of redundant values pulled from T0 and T1. The same thing can be said
+ * for when incrementing BIDy, but since BIDy is strided on BIDx there's no
+ * effective increment of BIDy without incrementing BIDx. Since all threads are
+ * executed within a block we can effectively consider the block incrementing
+ * TIDx BDIMx times while holding TIDy constant and incrementing TIDy BDIMy
+ * times while holding TIDx constant. Since multiple BIDx's are running at the
+ * same time on the device we can consider a wave on the GPU of incrementing
+ * BIDx (wave number of times), while holding TIDy constant BDIMy * wave number
+ * of times.
+ *
+ * If instead we have a situation like:
+ * T0[i0, i1, b2]
+ * T1[i0, b1, i2]
+ * T2[i0, i1, i2] = T0 + T1
+ * It makes sense that the break point would be in position 2, between i1 and
+ * i2. This is because when we map [i0, i1 | i2] to [BIDy, TIDy| BIDx, TIDx]
+ * BIDx, and TIDx will access the same elements of T0 on b2, and TIDy will
+ * likely access the same elements of T1 (as long as i1 > BDIMy). Even if i1 on
+ * the order of BDIMy we'll only access ~two unique elements per increment of
+ * BIDx or TIDx. This means we'll still reuse many of the same values and limit
+ * the amount we need to read duplicate values in T0 and T1.
+ *
+ * If instead we have:
+ * T0[i0, b1, i2]
+ * T1[b0, i1, i2]
+ * T2[i0, i1, i2] = T0 + T1
+ * The analysis gets a bit more complicated. First if i2 is very large and i0
+ * and i1 are relatively small it would make sense to have [i0, i1 | i2]. If b0
+ * is very small it's unlikely beneficial to have [i0 | i1, i2] as there would
+ * be small reuse on b0, and potentially no reuse on b1. If i2 is very small it
+ * may be worthwhile to have [i0 | i1, i2]. If i1 and i2 are not small, and
+ * their product is relatively large (i.e. you can't fit T2[i, :, :] in L2) then
+ * it's unlikely we'll get any significant reuse across i0.
+ *
+ * What we should (but don't due to complexity) assume then, is that we will get
+ * strong reuse across TIDx and TIDy for dimensions that are on the inner
+ * portion of the 2D tile.
+ *
+ * For example if we have:
+ * T0[i0, b1, i2]
+ * T1[b0, b1, i2]
+ * T2[b0, i1, i2]
+ * T3[i0, i1, i2] = T0 + T1 + T2
+ * We may want to break point at position 1 or position 2 (i.e. [i0 | i1, i2] or
+ * [i0, i1 | i2]). We can't immediately tell from the structure.
+ *
+ * If we choose [i0, i1 | i2] then we'll get:
+ * Strong reuse of T0 on TIDy (b1 dim)
+ * Perfect reuse across T1 on TIDy (b0 and b1)
+ * If BIDx is bound to the LHS of the tile we'll get:
+ * Maybe strong reuse of T0 on BIDx (b1 dim if it's large)
+ * Perfect reuse across T1 on BIDx
+ * Potentially no reuse on T2 if i1 is very large
+ *
+ * If we pick [i0 | i1, i2], then we'll get:
+ * We'll perfect reuse across TIDy on T1 and T2 on b0
+ * Some reuse on T0 and T1 on b1 across BIDx if i2 is relatively small and BIDx
+ * is bound to the RHS of the 2D schedule Perfect reuse on T1 and T2 on b0
+ * across BIDx if BIDx is bound to the LHS of the 2D schedule
+ *
+ * Materializing these benefits is dependent on the decisions the scheduler
+ * makes when parallelizing the problem. The heuristics logic at the moment is
+ * fairly simplistic where it assumes that there's only reuse across the break
+ * points for tensors that have no iteration domain on the entire side of the
+ * breakpoint. This is not optimal but for the time being it seems sufficient.
+ * We would ideally take into consideration the parallelization scheme and
+ * partial broadcasting on the lhs or rhs.
+ *
+ * An example of how this analysis is done is given the DAG:
+ * T0[i0, i1, b2] float
+ * T1[i0, b1, i2] half
+ * T2[i0, b1, i2] = cast(T1, float)
+ * T4[i0, i1, i2] float = T0 + T2
+ * With values of 10, 100, 1000 as [i0, i1, i2]
+ * Our break point analysis for positions 0, 1, 2, 3 will be:
+ *
+ * 0: 10*10 * 100*10 * 1000*10 = 1e9
+ * 1: 10*10 * 100*10 * 1000*10 = 1e9
+ * 2: 10*10 * 100*10 * 1000*6  = 6e8
+ * 3: 10*10 * 100*10 * 1000*10 = 1e9
+ *
+ * Where for each computation the LHS of the * pairs is the number of elements
+ * in that dimension on the reference and the RHS of the * pairs is the
+ * broadcast multiple where any tensor that has all broadcasts on the rhs or lhs
+ * of the break point doesn't contribute to the broadcast multiple of the rhs or
+ * lhs.
+ *
+ * So we'll pick position 2 since we're confident we can get broadcast reuse on
+ * the rhs of tensor 0. As already mentioned this is a pretty big
+ * simplification/assumption and in reality it may be harder/easier to take
+ * advantage of broadcast on the inner or outer dimension. This is a reasonable
+ * way to make relative decisions on break points, however, this computation is
+ * ont doing an effective estimate of actual DRAM transfers which it should be
+ * modified to do so.
+ *
+ * For view schedules there can be some incoherent break points for example:
+ * T1[i0, i1*i2] = view(T0[i0, i1, i2])
+ * would make the position 2 "incoherent". In otherwords we cannot replay
+ * through the view a schedule that tries to merge i0 and i1, without i2. So for
+ * positions that are incoherent we won't consider break point positions there.
+ *
+ * See FusionBroadcastViewMultiples_CUDA for what we expect with view handling.
+ * Shortly any dimensions that are inputs or outputs of view transformations are
+ * considered together, since it's hard to account for partial dimensions that
+ * are being broadcasted. So for view it's primarily an all or nothing situation
+ * when it comes to the 2D pointwise scheduler.
+ */
+
 class SchedulerRuntimeInfo;
 class HeuristicSummary;
 
@@ -36,6 +171,9 @@ TORCH_CUDA_CU_API LaunchParams schedulePointwise(
 //!  the pointwise scheduler.
 bool hasReferenceTensorView(Fusion* fusion);
 
+// Return reference tensor view.
+TensorView* getReferenceTensorView(Fusion* fusion);
+
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp
index ff6bfd07dd12..cf823322078f 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp
@@ -6,19 +6,9 @@ namespace fuser {
 namespace cuda {
 namespace pointwise_utils {
 
-DomainMap::DomainMap(Fusion* fusion)
-    : fusion_(fusion), ca_map_(ComputeAtMap(fusion)) {
-  view_tvs_ = scheduler_utils::getViewTVs(fusion);
-}
-
-bool DomainMap::areExactMapped(IterDomain* id1, IterDomain* id2) {
-  return ca_map_.areMapped(id1, id2, IdMappingMode::EXACT);
-}
-
-// Determine if all IterDomains in input are mapped to output
-bool DomainMap::areAllInputIdsMappedToOutput(
-    TensorView* input_tv,
-    TensorView* output_tv) const {
+// Determine if all IterDomains in input are mapped to the given tensor
+bool DomainMap::areAllInputIdsMappedTo(TensorView* input_tv, TensorView* tv)
+    const {
   // Get concrete IDs for input root or rfactor domain
   std::unordered_set<IterDomain*> in_concrete_ids;
   for (auto in_id : input_tv->getMaybeRFactorDomain()) {
@@ -30,11 +20,9 @@ bool DomainMap::areAllInputIdsMappedToOutput(
 
   // Erase all input concrete IDs mapped to the output domain
   // Ignore unresolved broadcast dimensions
-  for (auto out_id : output_tv->getMaybeRFactorDomain()) {
-    if (!out_id->isBroadcast()) {
-      if (!eraseIfMapped(in_concrete_ids, out_id)) {
-        eraseIfInputMappedThroughViewToOutput(in_concrete_ids, out_id);
-      }
+  for (auto id : tv->getMaybeRFactorDomain()) {
+    if (!eraseIfMapped(in_concrete_ids, id)) {
+      eraseIfInputMappedThroughViewTo(in_concrete_ids, id);
     }
   }
   return in_concrete_ids.empty();
@@ -45,7 +33,7 @@ bool DomainMap::eraseIfMapped(
     std::unordered_set<IterDomain*>& in_concrete_ids,
     IterDomain* out_id) const {
   auto out_concrete_id =
-      ca_map_.getConcreteMappedID(out_id, IdMappingMode::EXACT);
+      ca_map_.getConcreteMappedID(out_id, IdMappingMode::PERMISSIVE);
   auto in_concrete_id_iter = in_concrete_ids.find(out_concrete_id);
   bool found_match = in_concrete_id_iter != in_concrete_ids.end();
   if (found_match) {
@@ -58,12 +46,12 @@ bool DomainMap::eraseIfMapped(
 // Currently this function only allow having one view on the path from input to
 // output. If there are multiple views, then likely the pointwise scheduler will
 // reject the fusion because we can not correctly find a reference tensor.
-void DomainMap::eraseIfInputMappedThroughViewToOutput(
+void DomainMap::eraseIfInputMappedThroughViewTo(
     std::unordered_set<IterDomain*>& in_concrete_ids,
-    IterDomain* out_id) const {
+    IterDomain* id) const {
   for (auto view : view_tvs_) {
     // Find any ID in view rfactor domain that is mapped to output ID
-    auto view_rfactor_id = anyMapped(view->getRFactorDomain(), out_id);
+    auto view_rfactor_id = anyMapped(view->getRFactorDomain(), id);
     if (view_rfactor_id == nullptr) {
       continue;
     }
@@ -94,6 +82,20 @@ IterDomain* DomainMap::anyMapped(
   return nullptr;
 }
 
+// Determine if output TensorView is a valid reference tensor for this fusion.
+// The reference tensor must map to all the iterDomains in each input.
+bool DomainMap::isValidReference(TensorView* tv) const {
+  for (auto input_tv : ir_utils::filterByType<TensorView>(fusion_->inputs())) {
+    if (input_tv->uses().empty()) {
+      continue;
+    }
+    if (!areAllInputIdsMappedTo(input_tv, tv)) {
+      return false;
+    }
+  }
+  return true;
+}
+
 } // namespace pointwise_utils
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h b/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h
index 99d29a452511..6cc4b1b8b93b 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h
@@ -15,18 +15,22 @@ namespace pointwise_utils {
 // that maps to all IterDomains in the fusion.
 class DomainMap {
  public:
-  DomainMap(Fusion* fusion);
+  DomainMap(Fusion* fusion) : fusion_(fusion), ca_map_(fusion) {
+    view_tvs_ = scheduler_utils::getViewTVs(fusion);
+  }
   virtual ~DomainMap() = default;
 
-  bool areExactMapped(IterDomain* id1, IterDomain* id2);
-
   const ComputeAtMap& getComputeAtMap() const {
     return ca_map_;
   }
 
+  // Determine if a TensorView is a valid reference tensor for this fusion.
+  // The reference tensor must map to all the iterDomains in each input.
+  bool isValidReference(TensorView* tv) const;
+
  protected:
-  // Determine if all iterDomains are mapped between input and output tvs
-  bool areAllInputIdsMappedToOutput(TensorView* input_tv, TensorView* output_tv)
+  // Determine if all IterDomains are mapped between input and the given tvs
+  bool areAllInputIdsMappedTo(TensorView* input_tv, TensorView* output_tv)
       const;
 
   // Erase input concrete ID if it is mapped to output ID
@@ -34,10 +38,10 @@ class DomainMap {
       std::unordered_set<IterDomain*>& in_concrete_ids,
       IterDomain* out_id) const;
 
-  // Check if in_id is mapped to out_id through any view rfactor domain
-  void eraseIfInputMappedThroughViewToOutput(
+  // Check if in_id is mapped to id through any view rfactor domain
+  void eraseIfInputMappedThroughViewTo(
       std::unordered_set<IterDomain*>& in_concrete_ids,
-      IterDomain* out_id) const;
+      IterDomain* id) const;
 
   // Find any id in domain that maps with target id
   IterDomain* anyMapped(
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp b/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
index febe99de2ad8..3037f8469dad 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp
@@ -924,7 +924,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getReductionHeuristics(
           data_cache, [&reduction_tv]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    reduction_tv, true));
+                    reduction_tv, true, true));
           });
 
   auto& vectorizable_inputs_outputs = vectorizable_inputs_outputs_entry.get();
@@ -934,7 +934,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getReductionHeuristics(
           data_cache, [&reduction_tv]() {
             return std::make_unique<std::vector<TensorView*>>(
                 scheduler_utils::getInputsOutputsWithInnerDim(
-                    reduction_tv, false));
+                    reduction_tv, false, false));
           });
 
   auto& unrollable_inputs_outputs = unrollable_inputs_outputs_entry.get();
@@ -954,7 +954,7 @@ TORCH_CUDA_CU_API std::shared_ptr<ReductionParams> getReductionHeuristics(
   }
 
   // Try expanding vectorization to contig merged domains
-  vectorize_factor = scheduler_utils::expandVectorizationToContigMergedDomains(
+  vectorize_factor = vectorize_helper::expandVectorizationToContigMergedDomains(
       fusion,
       runtime_info,
       vectorizable_inputs_outputs,
@@ -1010,6 +1010,8 @@ void scheduleReduction(Fusion* fusion, const ReductionParams& rparams) {
 
   TORCH_INTERNAL_ASSERT(reduction_tvs.size());
 
+  // Registry assumes the reference tv is the first reduction_tv, if this
+  // changes registry needs to change.
   auto reduction_tv = reduction_tvs[0];
 
   auto dim_analysis = scheduler_utils::canonicalDimReduction(
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
index a9038dd8c296..ae9ecd88bbdc 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.cpp
@@ -1,7 +1,7 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
 
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/ir_cloner.h>
 #include <torch/csrc/jit/codegen/cuda/ir_utils.h>
 #include <torch/csrc/jit/codegen/cuda/maxinfo_propagator.h>
@@ -266,7 +266,7 @@ void multiReductionInliner(
 
     // Grab all tensor views that should be vectorized
     auto vectorizable_inputs_outputs =
-        scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true);
+        scheduler_utils::getInputsOutputsWithInnerDim(reference_tv, true, true);
 
     auto vectorizable_expr = [](Expr* e) {
       return e->isA<UnaryOp>() &&
@@ -336,14 +336,7 @@ void multiReductionInliner(
       scheduler_utils::getTrivialReductionMap(fusion);
 
   // Inline the schedule
-  InlinePropagator inline_propagator(
-      reference_tv,
-      -1,
-      ComputeAtMode::MostInlined,
-      {},
-      mapped_to_trivial_reduction);
-
-  MaxRootDomainInfoSpanningTree(reference_tv).traverse(&inline_propagator);
+  inlineMost(mapped_to_trivial_reduction);
 }
 
 namespace {
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp b/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
index f742d8f1a5b0..5d5bc84ef3b4 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/registry.cpp
@@ -9,6 +9,7 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/debug_utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/pointwise.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
 
 #include <limits>
@@ -357,6 +358,45 @@ class SchedulerTopologyChecker {
 
     return true;
   }
+
+  /* Returns if any non-trivial views are not before the reference. For example:
+   *     t0
+   *    /  \
+   *  view ref
+   *   |
+   *   t1
+   * This could be important as transform propagation from a reference backwards
+   * through a view should always work, but transform propagation form a
+   * reference forward through a view could interfere with the view transforms.
+   */
+  static bool hasViewNotBeforeRef(
+      Fusion* fusion,
+      std::vector<TensorView*> reference_tvs) {
+    std::vector<TensorView*> view_tvs;
+    auto view_ops = ir_utils::getViewOps(fusion);
+    for (auto view_op : view_ops) {
+      auto tv_outs = ir_utils::filterByType<TensorView>(view_op->outputs());
+      for (auto entry : tv_outs) {
+        view_tvs.push_back(entry);
+      }
+    }
+
+    if (view_tvs.empty()) {
+      return false;
+    }
+
+    // Terrible complexity, may be worth improving, but is a compile time
+    // check.
+    for (auto ref_tv : reference_tvs) {
+      for (auto view_tv : view_tvs) {
+        if (!DependencyCheck::isDependencyOf(view_tv, ref_tv)) {
+          return true;
+        }
+      }
+    }
+
+    return false;
+  }
 };
 
 bool isConnectedFusionGraph(Fusion* fusion) {
@@ -368,19 +408,24 @@ bool isConnectedFusionGraph(Fusion* fusion) {
   // A set of connected components on the fusion graph
   DisjointSets<Val*> component_sets;
 
+  TORCH_INTERNAL_ASSERT(
+      !fusion->outputs().empty(), "Fusion without output is not supported");
+  auto output0 = fusion->outputs()[0];
+  component_sets.initializeSet(output0);
+
   // Iterate through all used exprs
   for (auto expr : fusion->exprs()) {
     TORCH_INTERNAL_ASSERT(
-        !expr->inputs().empty(), "unknown expr with zero input");
+        !expr->outputs().empty(), "unknown expr with zero output");
 
     // Each expr maps all its inputs and
     //  outputs to the same component
-    auto input0 = expr->inputs()[0];
-    for (auto input : expr->inputs()) {
-      component_sets.mapEntries(input0, input);
+    auto output0 = expr->output(0);
+    for (auto input : ir_utils::filterByType<TensorView>(expr->inputs())) {
+      component_sets.mapEntries(output0, input);
     }
     for (auto output : expr->outputs()) {
-      component_sets.mapEntries(input0, output);
+      component_sets.mapEntries(output0, output);
     }
   }
 
@@ -393,7 +438,6 @@ bool isConnectedFusionGraph(Fusion* fusion) {
   //  If there is no independent compute flow
   // on this fusion graph, all outputs will be
   // equivalent/connected to the first output.
-  auto output0 = fusion->outputs()[0];
   for (auto output : fusion->outputs()) {
     if (!component_sets.strictAreMapped(output0, output)) {
       return false;
@@ -404,30 +448,67 @@ bool isConnectedFusionGraph(Fusion* fusion) {
 
 } // namespace
 
-SchedulerRuntimeInfo::SchedulerRuntimeInfo(
-    Fusion* complete_fusion,
-    const at::ArrayRef<IValue>& inputs,
-    bool create_expr_evaluator)
-    : complete_fusion_(complete_fusion) {
+void SchedulerRuntimeInfo::initialize(
+    const KernelArgumentHolder& args,
+    bool create_expr_evaluator) {
   TORCH_INTERNAL_ASSERT(
-      complete_fusion_->inputs().size() == inputs.size(),
+      complete_fusion_->inputs().size() == args.size(),
       "Invalid number of arguments passed in for provided fusion group.");
 
-  for (auto inp_i : c10::irange(inputs.size())) {
-    auto aten_inp = inputs[inp_i];
-    if (aten_inp.isTensor()) {
+  for (auto inp_i : c10::irange(args.size())) {
+    auto kernel_arg = args[inp_i];
+    // Note: we are skipping CpuScalar tensor here
+    if (auto tensor_arg_abstract =
+            dynamic_cast<const TensorArgAbstract*>(kernel_arg)) {
       auto fusion_inp = complete_fusion_->inputs()[inp_i];
-      auto data_ptr = aten_inp.toTensor().data_ptr();
+      auto data_ptr = tensor_arg_abstract->getPointer();
       input_ptrs_[fusion_inp] = (size_t)data_ptr;
+
+      // find and push discontiguous stride
+      auto dtype_size = dataTypeSize(tensor_arg_abstract->getDataType());
+      input_discontig_strides_[fusion_inp] = {};
+      auto dims = tensor_arg_abstract->getRank();
+      auto expected_stride = 1;
+      for (auto dim = dims - 1; dim >= 0; dim--) {
+        auto size = tensor_arg_abstract->getSize(dim);
+        if (size <= 1) {
+          continue;
+        }
+        auto stride = tensor_arg_abstract->getStride(dim);
+        if (stride != expected_stride) {
+          input_discontig_strides_[fusion_inp].push_back(stride * dtype_size);
+          expected_stride = stride;
+        }
+        expected_stride *= size;
+      }
     }
   }
 
   expression_evaluator_ =
       std::make_unique<ExpressionEvaluator>(complete_fusion_);
   if (create_expr_evaluator) {
-    initializeExpressionEvaluator(inputs);
+    initializeExpressionEvaluator(args);
   }
-  collectIndexModeInfo(inputs);
+  index_mode_ = args.getIndexMode();
+}
+
+SchedulerRuntimeInfo::SchedulerRuntimeInfo(
+    Fusion* complete_fusion,
+    const KernelArgumentHolder& args,
+    bool create_expr_evaluator)
+    : complete_fusion_(complete_fusion) {
+  initialize(args, create_expr_evaluator);
+}
+
+// TODO: remove this one
+SchedulerRuntimeInfo::SchedulerRuntimeInfo(
+    Fusion* complete_fusion,
+    const at::ArrayRef<at::IValue>& aten_inputs,
+    bool create_expr_evaluator)
+    : complete_fusion_(complete_fusion) {
+  KernelArgumentHolder args =
+      KernelArgumentHolder::createKernelArgumentHolder(aten_inputs);
+  initialize(args, create_expr_evaluator);
 }
 
 // TODO: Output tensors could have an alignment that is not 16 Bytes passed in
@@ -440,11 +521,11 @@ size_t SchedulerRuntimeInfo::ptrOf(TensorView* tv) {
 }
 
 void SchedulerRuntimeInfo::initializeExpressionEvaluator(
-    const at::ArrayRef<IValue>& inputs) {
+    const KernelArgumentHolder& args) {
   // TODO: refactor bindFusionInputs to better support this
   //  use case, i.e. support construct and bind input.
   *expression_evaluator_ =
-      executor_utils::bindFusionInputs(inputs, complete_fusion_);
+      executor_utils::bindFusionInputs(args, complete_fusion_);
 }
 
 size_t SchedulerRuntimeInfo::computeAlignmentSize(size_t ptr_address) {
@@ -466,6 +547,13 @@ size_t SchedulerRuntimeInfo::getAlignmentSize(TensorView* tv) {
   }
 
   auto alignment_size = SchedulerRuntimeInfo::computeAlignmentSize(ptrOf(tv));
+  auto strides_it = input_discontig_strides_.find(tv);
+  if (strides_it != input_discontig_strides_.end()) {
+    for (auto stride : strides_it->second) {
+      alignment_size = std::min(
+          alignment_size, SchedulerRuntimeInfo::computeAlignmentSize(stride));
+    }
+  }
   alignment_map_[tv] = alignment_size;
   return alignment_size;
 }
@@ -550,7 +638,7 @@ size_t SchedulerRuntimeInfo::getMaxVectorizableWidth(TensorView* tv) {
     }
 
     // Still contiguous
-    numel *= dim_size.value();
+    numel *= dim_size->as<int64_t>();
   }
 
   // Assuming intermediate tensors have friendly alignment, and
@@ -650,7 +738,7 @@ size_t SchedulerRuntimeInfo::getInnerDimVectorizableWidth(TensorView* tv) {
   auto maybe_inner_dimension_size =
       expression_evaluator_->evaluate(inner_most_dim->extent());
   TORCH_INTERNAL_ASSERT(maybe_inner_dimension_size.has_value());
-  size_t inner_dimension_size = maybe_inner_dimension_size.value();
+  size_t inner_dimension_size = maybe_inner_dimension_size->as<int64_t>();
 
   while (next_vector_size <= max_vector_size &&
          next_vector_size <= inner_dimension_size &&
@@ -665,54 +753,6 @@ size_t SchedulerRuntimeInfo::getInnerDimVectorizableWidth(TensorView* tv) {
   return vector_size;
 }
 
-void SchedulerRuntimeInfo::collectIndexModeInfo(
-    const at::ArrayRef<at::IValue>& inputs) {
-  // TODO: Need to check the output sizes as well.
-
-  // Save 1 more bit besides the sign bit to be conservative
-  constexpr int64_t most_positive_int32_index =
-      std::numeric_limits<int>::max() / 2;
-  constexpr int64_t most_negative_int32_index =
-      std::numeric_limits<int>::min() / 2;
-
-  // Start by setting index mode to int32
-  index_mode_ = KernelIndexMode::INT32;
-
-  // Check all runtime inputs, and if any one of
-  //  the input's index exceeds max_int32 will
-  //  fall back to int64 indexing
-  for (auto ivalue_input : inputs) {
-    if (ivalue_input.isTensor()) {
-      auto tensor_input = ivalue_input.toTensor();
-      int64_t tensor_most_positive_index = 0;
-      int64_t tensor_most_negative_index = 0;
-      for (auto dim_i = 0; dim_i < tensor_input.ndimension(); dim_i++) {
-        // Ignore broadcast dimensions
-        if (tensor_input.size(dim_i) > 1) {
-          // accumulate based on the sign of stride
-          if (tensor_input.stride(dim_i) > 0) {
-            // Acuumulate positive stride
-            tensor_most_positive_index +=
-                (tensor_input.size(dim_i) - 1) * tensor_input.stride(dim_i);
-          } else {
-            // Acuumulate negative stride
-            tensor_most_negative_index +=
-                (tensor_input.size(dim_i) - 1) * tensor_input.stride(dim_i);
-          }
-        }
-      }
-
-      // Fall back to int64 if it can be either too positive
-      //  or too negative.
-      if (tensor_most_positive_index > most_positive_int32_index ||
-          tensor_most_negative_index < most_negative_int32_index) {
-        index_mode_ = KernelIndexMode::INT64;
-        return;
-      }
-    }
-  }
-}
-
 bool SchedulerEntry::sameAs(const SchedulerEntry* other) {
   if (heuristc_ != other->heuristc_) {
     return false;
@@ -774,8 +814,7 @@ static bool checkPatternEquivalence(
 // being broadcasted to one size multiple times or different sizes. This is a
 // hard to optimize problem and likely indicates we shouldn't be fusing.
 bool hasNonUniqueBcast(Fusion* fusion) {
-  ConcretizedBroadcastDomains concretize_info;
-  concretize_info.build(fusion);
+  ConcretizedBroadcastDomains concretize_info(fusion);
 
   for (auto tv : ir_utils::allTvs(fusion)) {
     for (auto id : tv->getRootDomain()) {
@@ -816,6 +855,119 @@ bool hasNonUniqueBcast(Fusion* fusion) {
 //!        This function will be called when compiling a kernel. It should apply
 //!        scheduling to the given fusion
 
+//! NoOp scheduler represents the case where scheduler will
+//!  not do any scheduling operations and forward the un-scheduled
+//!  fusion directly to code generation and kernel compilation.
+//!
+//! Typical use case of this scheduler is to handle edge cases
+//!  such as where all tensors are size-1 or size-0.
+class NoOpScheduler : public SchedulerEntry {
+  //! Provides a dummy heuristic type to ensure
+  //!  unified interface on NoOp scheduler.
+  class NoOpHeuristic : public HeuristicParams {
+   public:
+    size_t hash() const override {
+      return 0;
+    }
+    std::shared_ptr<HeuristicParams> clone() const override {
+      return std::make_shared<NoOpHeuristic>();
+    }
+    bool sameAs(const std::shared_ptr<HeuristicParams>& other) const override {
+      auto other_casted = std::dynamic_pointer_cast<ReductionParams>(other);
+      return other_casted != nullptr;
+    };
+  };
+
+ public:
+  explicit NoOpScheduler(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr)
+      : SchedulerEntry(ScheduleHeuristic::NoOp) {
+    params_ = std::make_shared<NoOpHeuristic>();
+  }
+
+  //! Check if the no-op heuristics apply in given fusion
+  static bool canScheduleCompileTime(Fusion* fusion) {
+    // Check there're no non-trivial reduction ops.
+    for (auto reduction :
+         ir_utils::getReductionOps(fusion, true /* ignore_trivial */)) {
+      for (auto input :
+           ir_utils::filterByType<TensorView>(reduction->inputs())) {
+        auto root_dom = input->getRootDomain();
+        auto all_nonzero =
+            std::none_of(root_dom.begin(), root_dom.end(), [](IterDomain* id) {
+              return id->extent()->isZeroInt();
+            });
+        if (all_nonzero) {
+          scheduler_debug_utils::canScheduleRejectReason(
+              ScheduleHeuristic::NoOp,
+              "reduction of non-zero elements is not supported");
+          return false;
+        }
+      }
+    }
+
+    // Check that all outputs are either broadcast or ignored reduction.
+    for (auto out_tv : ir_utils::filterByType<TensorView>(fusion->outputs())) {
+      auto non_zero_candidate_dimension = TensorDomain::noReductions(
+          TensorDomain::noBroadcasts(out_tv->domain()->domain()));
+
+      // non_zero_candidate_dimension is empty would mean this out tv has only
+      //  broadcast and trivial reduction axes, and this out tv would not
+      //  require scheduling ops.
+      // If any of the dimensions in non_zero_candidate_dimension is compile
+      // time
+      //  constant zero, this out tv also does not require any scheduling
+      //  operation as it is essentially a scalar.
+      // TODO:
+      // There seems to be a runtime component to it
+      //  too, i.e. if the runtime sizes are zero, then we should
+      //  handle it through null scheduler.
+      if (!non_zero_candidate_dimension.empty() &&
+          std::none_of(
+              non_zero_candidate_dimension.begin(),
+              non_zero_candidate_dimension.end(),
+              [](IterDomain* id) { return id->extent()->isZeroInt(); })) {
+        // We have found a out_tv with a dimension that NoOp scheduler couldn't
+        //  handle and therefore reject this fusion.
+        scheduler_debug_utils::canScheduleRejectReason(
+            ScheduleHeuristic::NoOp, "output has a concrete dimension");
+        return false;
+      }
+    }
+
+    // We have verified that all iterdomains on all output tv's are trivial
+    // reductions,
+    //  broadcasts or zero-sized. Therefore accepting this fusion for NoOp
+    //  scheduling.
+    return true;
+  }
+
+  static bool canScheduleRunTime(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr) {
+    // TODO:
+    //  Pipe through dynamic zero checks.
+    return true;
+  }
+
+  void schedule(Fusion* fusion) override {
+    // Schedule is no-op.
+    return;
+  }
+
+ private:
+  void computeHeuristics(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr) {
+    // Heuristics is no-op.
+    return;
+  }
+};
+
 class ReductionScheduler : public SchedulerEntry {
  public:
   explicit ReductionScheduler(
@@ -828,7 +980,7 @@ class ReductionScheduler : public SchedulerEntry {
 
   //! Check if the reduction heuristics apply in given fusion
   static bool canScheduleCompileTime(Fusion* fusion) {
-    // Temporarily allow view in reduction scheduler
+    // Temporarily disallow view in reduction scheduler
     // TODO Add more testing before enabling
     auto view_tvs = scheduler_utils::getViewTVs(fusion);
     if (view_tvs.size() > 0) {
@@ -866,6 +1018,17 @@ class ReductionScheduler : public SchedulerEntry {
       return false;
     }
 
+    // Persistent scheduler simply uses reduction_tvs[0] as the reference, if
+    // that changes, this needs to be changed. Second check here may be overly
+    // conservative.
+    if (SchedulerTopologyChecker::hasViewNotBeforeRef(
+            fusion, {reduction_tvs[0]}) ||
+        !scheduler_utils::allMatchingViews(fusion)) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Reduction, "Unsupported view fusion.");
+      return false;
+    }
+
     // Make sure reduction axes are consistent through the fusion
     auto reduction_ops =
         ir_utils::getReductionOps(fusion, false /* ignore_trivial */);
@@ -965,6 +1128,84 @@ class ReductionScheduler : public SchedulerEntry {
   }
 };
 
+class TransposeScheduler : public SchedulerEntry {
+ public:
+  explicit TransposeScheduler(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr)
+      : SchedulerEntry(ScheduleHeuristic::Transpose) {
+    computeHeuristics(fusion, runtime_info, data_cache);
+  }
+
+  static bool canScheduleCompileTime(Fusion* fusion) {
+    // Temporarily disallow view in transpose scheduler
+    // TODO Add more testing before enabling
+    auto view_tvs = scheduler_utils::getViewTVs(fusion);
+    if (view_tvs.size() > 0) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose, "No support for view op");
+      return false;
+    }
+
+    if (!hasAtLeastTwoValidGroups(fusion)) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose,
+          "cannot find two mismatching inner most dimensions");
+      return false;
+    }
+
+    // TODO: add support for trivial reduction
+    auto reduction_ops =
+        ir_utils::getReductionOps(fusion, false /* ignore_trivial */);
+
+    if (!reduction_ops.empty()) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose, "no support for reduction ops");
+      return false;
+    }
+
+    if (hasNonUniqueBcast(fusion)) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose,
+          "Broadcasting dimension might be broadcasting to multiple sizes.");
+      return false;
+    }
+
+    return true;
+  }
+
+  static bool canScheduleRunTime(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr) {
+    FUSER_PERF_SCOPE("TransposeScheduler::canScheduleRunTime");
+
+    auto reason =
+        getTransposeRuntimeRejectReason(fusion, data_cache, runtime_info);
+    if (!reason.empty()) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Transpose, reason);
+      return false;
+    }
+    return true;
+  }
+
+  void schedule(Fusion* fusion) override {
+    FUSER_PERF_SCOPE("Schedule Transpose Fusion");
+    scheduleTranspose(fusion, transposeParams());
+  }
+
+ private:
+  void computeHeuristics(
+      Fusion* fusion,
+      SchedulerRuntimeInfo& runtime_info,
+      HeuristicSummary* data_cache = nullptr) {
+    params_ = getTransposeHeuristics(fusion, runtime_info, data_cache);
+    TORCH_INTERNAL_ASSERT(params_ != nullptr);
+  }
+};
+
 class PointWiseScheduler : public SchedulerEntry {
  public:
   explicit PointWiseScheduler(
@@ -985,6 +1226,14 @@ class PointWiseScheduler : public SchedulerEntry {
       return false;
     }
 
+    if (!scheduler_utils::allMatchingViews(fusion) &&
+        SchedulerTopologyChecker::hasViewNotBeforeRef(
+            fusion, {getReferenceTensorView(fusion)})) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::PointWise, "Unsupported view fusion.");
+      return false;
+    }
+
     auto reduction_ops =
         ir_utils::getReductionOps(fusion, true /* ignore_trivial */);
 
@@ -1008,6 +1257,18 @@ class PointWiseScheduler : public SchedulerEntry {
       Fusion* fusion,
       SchedulerRuntimeInfo& runtime_info,
       HeuristicSummary* data_cache = nullptr) {
+    auto can_schedule_transpose_entry =
+        HeuristicSummaryEntry<HeuristicCompileTime::CanScheduleTranspose>(
+            data_cache, [fusion]() {
+              return std::make_unique<bool>(
+                  TransposeScheduler::canScheduleCompileTime(fusion));
+            });
+    if (can_schedule_transpose_entry.get()) {
+      auto reason =
+          getTransposeRuntimeRejectReason(fusion, data_cache, runtime_info);
+      return !reason.empty();
+    }
+
     return true;
   }
 
@@ -1075,6 +1336,16 @@ class PersistentKernelScheduler : public SchedulerEntry {
       return false;
     }
 
+    // Persistent scheduler simply uses reduction_tvs[0] as the reference, if
+    // that changes, this needs to be changed. Second check here may be overly
+    // conservative.
+    if (SchedulerTopologyChecker::hasViewNotBeforeRef(
+            fusion, {reduction_tvs[0]}) ||
+        !scheduler_utils::allMatchingViews(fusion)) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Persistent, "Unsupported view fusion.");
+    }
+
     if (findTransposeOps(fusion).size() > 0) {
       // Use pointwise logic
       scheduler_debug_utils::canScheduleRejectReason(
@@ -1247,7 +1518,9 @@ class PersistentKernelScheduler : public SchedulerEntry {
 // Schedule Table
 const std::vector<ScheduleHeuristic>& all_heuristics() {
   static const std::vector<ScheduleHeuristic> hlist = {
+      ScheduleHeuristic::NoOp,
       ScheduleHeuristic::Reduction,
+      ScheduleHeuristic::Transpose,
       ScheduleHeuristic::PointWise,
       ScheduleHeuristic::Persistent};
   return hlist;
@@ -1268,6 +1541,9 @@ bool checkCanSchedule(
     if (!isConnectedFusionGraph(fusion)) {
       return false;
     }
+    if (IterDomainGraph(fusion, /*allow_self_mapping=*/true).hasSelfMapping()) {
+      return false;
+    }
     if (!SchedulerType::canScheduleCompileTime(fusion)) {
       return false;
     }
@@ -1285,6 +1561,8 @@ bool SchedulerEntry::canSchedule(
     SchedulerRuntimeInfo& runtime_info,
     HeuristicSummary* data_cache) {
   switch (sh) {
+    case ScheduleHeuristic::NoOp:
+      return checkCanSchedule<NoOpScheduler>(fusion, runtime_info, data_cache);
     case ScheduleHeuristic::PointWise:
       return checkCanSchedule<PointWiseScheduler>(
           fusion, runtime_info, data_cache);
@@ -1294,6 +1572,9 @@ bool SchedulerEntry::canSchedule(
     case ScheduleHeuristic::Persistent:
       return checkCanSchedule<PersistentKernelScheduler>(
           fusion, runtime_info, data_cache);
+    case ScheduleHeuristic::Transpose:
+      return checkCanSchedule<TransposeScheduler>(
+          fusion, runtime_info, data_cache);
     default:
       TORCH_INTERNAL_ASSERT(false, "unreachable");
       return false;
@@ -1308,6 +1589,10 @@ std::unique_ptr<SchedulerEntry> SchedulerEntry::makeEntry(
     HeuristicSummary* data_cache) {
   std::unique_ptr<SchedulerEntry> scheduler_entry = nullptr;
   switch (sh) {
+    case ScheduleHeuristic::NoOp:
+      scheduler_entry =
+          std::make_unique<NoOpScheduler>(fusion, runtime_info, data_cache);
+      break;
     case ScheduleHeuristic::PointWise:
       scheduler_entry = std::make_unique<PointWiseScheduler>(
           fusion, runtime_info, data_cache);
@@ -1320,6 +1605,10 @@ std::unique_ptr<SchedulerEntry> SchedulerEntry::makeEntry(
       scheduler_entry = std::make_unique<PersistentKernelScheduler>(
           fusion, runtime_info, data_cache);
       break;
+    case ScheduleHeuristic::Transpose:
+      scheduler_entry = std::make_unique<TransposeScheduler>(
+          fusion, runtime_info, data_cache);
+      break;
     default:
       TORCH_INTERNAL_ASSERT(false, "unreachable");
   }
@@ -1347,12 +1636,16 @@ size_t SchedulerEntryHash::operator()(const SchedulerEntry& se) const {
 
 std::string toString(ScheduleHeuristic sh) {
   switch (sh) {
+    case ScheduleHeuristic::NoOp:
+      return "no-op";
     case ScheduleHeuristic::PointWise:
       return "pointwise";
     case ScheduleHeuristic::Reduction:
       return "reduction";
     case ScheduleHeuristic::Persistent:
       return "persistent";
+    case ScheduleHeuristic::Transpose:
+      return "transpose";
     default:
       TORCH_INTERNAL_ASSERT(false, "undefined schedule");
   }
@@ -1393,6 +1686,9 @@ HeuristicSummary::HeuristicSummary(
     : heuristic_(heuristic) {
   recording_ = true;
   switch (heuristic) {
+    case ScheduleHeuristic::NoOp:
+      NoOpScheduler::canScheduleRunTime(fusion, runtime_info, this);
+      break;
     case ScheduleHeuristic::PointWise:
       getPointwiseHeuristics(fusion, runtime_info, this);
       PointWiseScheduler::canScheduleRunTime(fusion, runtime_info, this);
@@ -1405,6 +1701,10 @@ HeuristicSummary::HeuristicSummary(
       getPersistentHeuristics(fusion, runtime_info, this);
       PersistentKernelScheduler::canScheduleRunTime(fusion, runtime_info, this);
       break;
+    case ScheduleHeuristic::Transpose:
+      getTransposeHeuristics(fusion, runtime_info, this);
+      TransposeScheduler::canScheduleRunTime(fusion, runtime_info, this);
+      break;
     default:
       TORCH_INTERNAL_ASSERT(false, "unknown heuristic");
   }
@@ -1414,14 +1714,39 @@ HeuristicSummary::HeuristicSummary(
 
 void HeuristicSummary::validate() const {
   switch (heuristic_) {
+    case ScheduleHeuristic::NoOp: {
+      // TODO: need to cache the dynamically zero inputs?
+      break;
+    }
+    case ScheduleHeuristic::Transpose:
     case ScheduleHeuristic::PointWise: {
-      TORCH_INTERNAL_ASSERT(entry_type_map_.count(EntryType::DOMAIN_MAP));
+      if (heuristic_ == ScheduleHeuristic::PointWise) {
+        TORCH_INTERNAL_ASSERT(entry_type_map_.count(EntryType::DOMAIN_MAP));
+        TORCH_INTERNAL_ASSERT(
+            entry_type_map_.count(EntryType::REFERENCE_TENSORS));
+        TORCH_INTERNAL_ASSERT(
+            entry_type_map_.count(EntryType::VECTORIZABLE_INPUTS_AND_OUTPUTS));
+        TORCH_INTERNAL_ASSERT(
+            entry_type_map_.count(EntryType::BROADCAST_BYTE_MULTIPLES));
+        TORCH_INTERNAL_ASSERT(
+            entry_type_map_.count(EntryType::CAN_SCHEDULE_TRANSPOSE));
+        auto can_schedule_transpose =
+            entry_type_map_.at(EntryType::CAN_SCHEDULE_TRANSPOSE)
+                ->as<CompileTimeInfo<
+                    HeuristicCompileTime::CanScheduleTranspose>>()
+                ->get();
+        if (!*can_schedule_transpose) {
+          break;
+        }
+      }
       TORCH_INTERNAL_ASSERT(
-          entry_type_map_.count(EntryType::REFERENCE_TENSORS));
+          entry_type_map_.count(EntryType::TRANSPOSE_DOMAIN_MAP));
+      TORCH_INTERNAL_ASSERT(entry_type_map_.count(
+          EntryType::INPUTS_AND_OUTPUTS_INNER_DIM_GROUPS));
       TORCH_INTERNAL_ASSERT(
-          entry_type_map_.count(EntryType::VECTORIZABLE_INPUTS_AND_OUTPUTS));
+          entry_type_map_.count(EntryType::REFERENCE_TENSORS_FOR_GROUPS));
       TORCH_INTERNAL_ASSERT(
-          entry_type_map_.count(EntryType::BROADCAST_BYTE_MULTIPLES));
+          entry_type_map_.count(EntryType::INNER_MOST_DIMS_INFO));
       break;
     }
     case ScheduleHeuristic::Reduction: {
@@ -1487,9 +1812,14 @@ HeuristicSummaryEntry<EntryClass>::HeuristicSummaryEntry(
 
 // Template instantiation for pre-defined cache entries
 template class HeuristicSummaryEntry<HeuristicCompileTime::DomainMap>;
+template class HeuristicSummaryEntry<HeuristicCompileTime::TransposeDomainMap>;
 template class HeuristicSummaryEntry<HeuristicCompileTime::ReferenceTensors>;
+template class HeuristicSummaryEntry<
+    HeuristicCompileTime::ReferenceTensorsForGroups>;
 template class HeuristicSummaryEntry<
     HeuristicCompileTime::VectorizableInputsAndOutputs>;
+template class HeuristicSummaryEntry<
+    HeuristicCompileTime::InputsOutputsInnerDimGroups>;
 template class HeuristicSummaryEntry<
     HeuristicCompileTime::UnrollableInputsAndOutputs>;
 template class HeuristicSummaryEntry<HeuristicCompileTime::ReductionTVs>;
@@ -1498,6 +1828,9 @@ template class HeuristicSummaryEntry<
 template class HeuristicSummaryEntry<
     HeuristicCompileTime::ScopePersistentFactorInfo>;
 template class HeuristicSummaryEntry<HeuristicCompileTime::BroadcastMultiples>;
+template class HeuristicSummaryEntry<HeuristicCompileTime::InnerMostDimInfo>;
+template class HeuristicSummaryEntry<
+    HeuristicCompileTime::CanScheduleTranspose>;
 
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/registry.h b/torch/csrc/jit/codegen/cuda/scheduler/registry.h
index 7d2af85bfad0..8b3409447634 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/registry.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/registry.h
@@ -1,4 +1,5 @@
 #pragma once
+#include <torch/csrc/jit/codegen/cuda/executor_kernel_arg.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h>
@@ -26,6 +27,7 @@ class ExpressionEvaluator;
 //!    segmenter and schedulers.
 //!  It is important that input id encoding should be up to date with any change
 //!   of this class to avoid launching compiled kernels with illegal inputs.
+
 class TORCH_CUDA_CU_API SchedulerRuntimeInfo : public NonCopyable {
  public:
   // Max vector size we will consider, in bytes,
@@ -33,13 +35,19 @@ class TORCH_CUDA_CU_API SchedulerRuntimeInfo : public NonCopyable {
   static constexpr size_t max_alignment_size_in_byte = 16;
 
   //! Create runtime info for given fusion and input. Creating and binding
-  //! evaluator is optional. The evaluator is used to manage intermediate
+  //!  evaluator is optional. The evaluator is used to manage intermediate
   //!  integers in the fusion. We need them for segmenter and schedulers,
   //!  but we don't need them when we are just using this class to provide
   //!  additional encoding for kernel cache lookup.
   SchedulerRuntimeInfo(
       Fusion* complete_fusion,
-      const at::ArrayRef<at::IValue>& inputs,
+      const KernelArgumentHolder& inputs,
+      bool create_expr_evaluator = false);
+
+  // TODO: Remove this guy below. Everything needs to go into the other ctor
+  SchedulerRuntimeInfo(
+      Fusion* complete_fusion,
+      const at::ArrayRef<at::IValue>& aten_inputs,
       bool create_expr_evaluator = false);
 
   //! Lookup for the alignment sizes of the given tv. Currently only returns
@@ -78,12 +86,11 @@ class TORCH_CUDA_CU_API SchedulerRuntimeInfo : public NonCopyable {
 
  private:
   // Bind full fusion inputs to the internal expression evaluator
-  void initializeExpressionEvaluator(const at::ArrayRef<at::IValue>& inputs);
+  void initializeExpressionEvaluator(const KernelArgumentHolder& inputs);
 
-  // check if input is compatible with 32b index mode
-  void collectIndexModeInfo(const at::ArrayRef<at::IValue>& inputs);
+  // Initialize SchedulerRuntimeInfo
+  void initialize(const KernelArgumentHolder& args, bool create_expr_evaluator);
 
- private:
   bool isInputTv(TensorView* tv) {
     return std::find(
                complete_fusion_->inputs().begin(),
@@ -91,6 +98,7 @@ class TORCH_CUDA_CU_API SchedulerRuntimeInfo : public NonCopyable {
                tv) != complete_fusion_->inputs().end();
   }
 
+ private:
   // Returns the offset of tv in the inputs ignoring non tensor views. Used to
   // access input_sizes, input_strides, input_ptr
   int offsetTensorPos(TensorView* tv);
@@ -105,6 +113,9 @@ class TORCH_CUDA_CU_API SchedulerRuntimeInfo : public NonCopyable {
   // TODO: Support output tensor pointers
   std::unordered_map<Val*, size_t> input_ptrs_;
 
+  // Copy of aten input tensor strides (in bytes)
+  std::unordered_map<Val*, std::vector<size_t>> input_discontig_strides_;
+
   // Cache for getAlignmentSize
   std::unordered_map<TensorView*, size_t> alignment_map_;
   // Cache for getMaxVectorizableWidth
@@ -187,6 +198,13 @@ class TORCH_CUDA_CU_API SchedulerEntry {
     return *pparams;
   }
 
+  const TransposeParams& transposeParams() const {
+    auto tparams = std::dynamic_pointer_cast<TransposeParams>(params_);
+    TORCH_INTERNAL_ASSERT(
+        tparams != nullptr, "Heuristic parameter is not a transpose parameter");
+    return *tparams;
+  }
+
   void updateLaunchConstraint(const LaunchParams& launch_params) {
     params_->lparams = launch_params;
   }
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp b/torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp
new file mode 100644
index 000000000000..b7e85cbc1c5e
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp
@@ -0,0 +1,1140 @@
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose.h>
+
+#include <torch/csrc/jit/codegen/cuda/executor_utils.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/instrumentation.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/lower_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+#include <torch/csrc/jit/codegen/cuda/utils.h>
+
+#include <ATen/cuda/CUDAContext.h>
+
+#include <algorithm>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+namespace {
+
+// DomainMap uses the ComputeAtMap to find a reference TensorView
+// that maps to all iterDomains in the fusion.
+class DomainMap : public pointwise_utils::DomainMap {
+ public:
+  using pointwise_utils::DomainMap::DomainMap;
+
+  TensorView* findReferenceFor(const std::vector<TensorView*>& group) const {
+    TensorView* result = nullptr;
+    int max_dims = -1;
+    for (auto tv : group) {
+      if (isValidReference(tv)) {
+        int dims = pointwise_utils::nRootDims(tv);
+        if (dims > max_dims) {
+          result = tv;
+          max_dims = dims;
+        }
+      }
+    }
+    return result;
+  }
+
+  IterDomain* getMappedRootDimIn(TensorView* tv, IterDomain* root_dim) const {
+    // Find the root id mapped to `root_dim`
+    const auto& root_dom = tv->getRootDomain();
+    IterDomain* mapped_id = nullptr;
+    for (auto i : c10::irange(root_dom.size())) {
+      if (ca_map_.idGraph().permissiveNodes().permissiveAreMapped(
+              root_dom[i], root_dim)) {
+        mapped_id = root_dom[i];
+        break;
+      }
+    }
+    return mapped_id;
+  }
+
+  static bool hasAtLeastTwoValidGroups(Fusion* fusion) {
+    FusionGuard fg(fusion);
+    DomainMap domain_map(fusion);
+    auto grouped_inputs_outputs = domain_map.groupInputsOutputsByInnerDim();
+    if (grouped_inputs_outputs.size() < 2) {
+      return false;
+    }
+    auto ref1 = domain_map.findReferenceFor(grouped_inputs_outputs[0]);
+    auto ref2 = domain_map.findReferenceFor(grouped_inputs_outputs[1]);
+    if (ref1 == nullptr || ref2 == nullptr) {
+      return false;
+    }
+    // reference 1 is the global reference, so it must have dim mapped the
+    // innermost dim of both groups
+    auto innermost2 = scheduler_utils::innerMostRootDim(ref2);
+    return domain_map.getMappedRootDimIn(ref1, innermost2) != nullptr;
+  }
+
+  int getInnerLeafDim(TensorView* tv, IterDomain* root_dim) const {
+    auto mapped_id = getMappedRootDimIn(tv, root_dim);
+    TORCH_INTERNAL_ASSERT(
+        mapped_id != nullptr,
+        "Can not find ID mapped to ",
+        root_dim,
+        " in tensor ",
+        tv);
+    // Project the root id to leaf id
+    while (!mapped_id->uses().empty()) {
+      TORCH_INTERNAL_ASSERT(mapped_id->uses().size() == 1);
+      auto expr = mapped_id->uses()[0];
+      if (expr->isA<Split>()) {
+        mapped_id = expr->as<Split>()->inner();
+      } else {
+        auto merge = expr->as<Merge>();
+        TORCH_INTERNAL_ASSERT(
+            mapped_id == merge->inner(),
+            "Can not find ID mapped to ",
+            root_dim,
+            " in tensor ",
+            tv);
+        mapped_id = merge->out();
+      }
+    }
+    // Find the position of the leaf id
+    const auto& dom = tv->domain()->domain();
+    for (auto i : c10::irange(dom.size())) {
+      if (dom[i] == mapped_id) {
+        return i;
+      }
+    }
+    TORCH_INTERNAL_ASSERT(
+        false, "Can not find ID mapped to ", root_dim, " in tensor ", tv);
+  }
+
+  // Group inputs and outputs of a fusion by its inner most domain. For example
+  //   inputs: t0, t1
+  //   t2 = transpose(t1)
+  //   t3 = t0 + t2
+  //   t4 = sin(t0)
+  //   t5 = cos(t1)
+  //   outputs: t3, t4, t5
+  //
+  // Then we should have group {t0, t3, t4} and {t1, t5}
+  //
+  // The returned groups are sorted in descending size. If the sizes of two
+  // group are equal, then we sort them by their members in the following order:
+  //   output[0], output[1], ..., input[0], input[1], ...
+  // That is, {ouput[0], output[2]} will be in front of {ouput[1], output[3]}
+  // The order here must be deterministic, because in transpose heuristics, we
+  // have `vectorize_factor1` and `vectorize_factor2` and we need to be sure
+  // that `1` and `2` are assigned to the same group across runs.
+  //
+  // In the case where view is present in the graph, there are two cases: if the
+  // view doesn't touch any inner dimension of any group, then the support of it
+  // is trivial. In the case where view actually touches an inner-most dim, we
+  // keep track of the inner-most dimension of view's split and merges.
+  //
+  // For example, if you have:
+  //   T0 [2, 3, 5] <-- input
+  //   T1 [2, 5, 3] <-- input
+  //   T2 [2, 5, 3] = transpose(T0) + T1
+  //   T3 [2, 15] = view(T2)
+  //   output <-- T3
+  //
+  // Then T3 should be in the same group with T1, and T0 should have
+  // different group with T1 and T3.
+  std::vector<std::vector<TensorView*>> groupInputsOutputsByInnerDim() const {
+    std::vector<std::vector<TensorView*>> groups;
+    auto output_tvs = ir_utils::filterByType<TensorView>(fusion_->outputs());
+    auto input_tvs = ir_utils::filterByType<TensorView>(fusion_->inputs());
+    std::unordered_set<TensorView*> grouped;
+    decltype(input_tvs)* tv_filtered_groups[2] = {&output_tvs, &input_tvs};
+    for (auto tv_filtered_group : tv_filtered_groups) {
+      for (auto tv : *tv_filtered_group) {
+        if (tv->isFusionInput() && tv->uses().empty()) {
+          continue;
+        }
+        if (grouped.count(tv) > 0) {
+          continue;
+        }
+        groups.emplace_back(std::vector<TensorView*>{tv});
+        grouped.emplace(tv);
+        // We only want to grab the inner-most dimension, because we don't want
+        // tensors with different inner-most dimension to be put in the same
+        // group. For example, if we have:
+        //   T2[i1, i3*i2] = relu(view(transpose(T1[i1, i2, i3])))
+        // then we don't want T1 and T2 to be in the same group.
+        //
+        // But we don't want to check contiguity. For example, if we have:
+        //   T1[i1, i2, i3] (contiguous) + T2[i1, i2, i3] (discontiguous)
+        // Then we still want to T1 and T2 to be grouped together.
+        auto group =
+            scheduler_utils::getInputsOutputsWithInnerDim(tv, true, false);
+        if (group.empty()) {
+          // In case that the inner most dim of tv is not found (for example, tv
+          // is a fusion input with only reductions), we just return a null
+          // result which will tell the scheduler to reject the fusion
+          return {};
+        }
+        for (auto member_tv : group) {
+          if (grouped.count(member_tv) == 0) {
+            grouped.emplace(member_tv);
+            groups.back().emplace_back(member_tv);
+          } else if (member_tv != tv) {
+            // Ambiguous grouping. This should only happen at `canSchedule`, so
+            // we just return a null result which will tell the scheduler to
+            // reject the fusion
+            return {};
+          }
+        }
+      }
+    }
+    std::stable_sort(
+        groups.begin(),
+        groups.end(),
+        [](const std::vector<TensorView*>& v1,
+           const std::vector<TensorView*>& v2) {
+          return v1.size() > v2.size();
+        });
+    return groups;
+  }
+};
+
+// Note: [Supporting small transpose dimensions]
+// We prefer to make tiles of size 32x32 if there are enough elements to achieve
+// good occupancy, otherwise, we will use tile size 8x8. In both cases, it is
+// possible that the inner dimension of group 1 and/or group 2 are smaller than
+// the desired tile size. If this happens, part of the threads of a block will
+// be wasted, leading to bad performance. To prevent this from happening, if the
+// size of the inner-most dim is smaller than the tile size, we merge other
+// dimensions with the inner-most dimension to create larger "virtual inner-most
+// dimension". The algorithm that we create these virtual inner-most dimensions
+// is as follows:
+//
+// For example, if we have
+//   T0[I0{2}, I1{1024*1024}, I2{2}, I3{2}, I4{2}, I5{2}, I6{2}] input
+//   T1 = transpose(T0, 4, 6)
+// We first try to merge each inner-most dim with the dimensions on its left:
+//   T0[I0{2}, I1*I2*I3*I4{1024*1024*8}, I5*I6{4}]
+// If there is/are still unsatisfied innermost dim(s) after this step (I5*I6 in
+// this example), we find other dims that is not merged yet to satisfy it/them:
+//   T0[I0*I5*I6{8}, I1*I2*I3*I4{1024*1024*8}]
+// If after merging all the dims, there is still one of them not satisfied, this
+// usually means there is one large dim that is consumed by the satisfied one.
+// We will split that dim and large dim and and use the splitted ones to satisfy
+// both of them:
+//   T0[I0*I1o*I5*I6{1024*1024/4*8}, I1i*I2*I3*I4{32}]
+void maybeBuildVirtualInnerDims(
+    TransposeParams& params,
+    int64_t device_multiprocessor_count,
+    int64_t n_elems,
+    const std::vector<int64_t>& shape_in_ref1,
+    int64_t inner_most1,
+    int64_t inner_most2) {
+  int64_t merged_size1 = shape_in_ref1[inner_most1];
+  int64_t merged_size2 = shape_in_ref1[inner_most2];
+
+  int64_t actual_tile_size1 =
+      std::min<int64_t>(merged_size1, params.tile_size1);
+  int64_t actual_tile_size2 =
+      std::min<int64_t>(merged_size2, params.tile_size2);
+  int64_t wave_elements =
+      device_multiprocessor_count * actual_tile_size1 * actual_tile_size2;
+
+  if (wave_elements >= n_elems) {
+    // if one full wave can handle all elements, don't create virtual inner dims
+    return;
+  }
+
+  // merge inner_most1 and inner_most2 left until we are done or we can no
+  // longer do so
+  int64_t dim = inner_most1 - 1;
+  while (dim >= 0 && dim != inner_most2 && merged_size1 < params.tile_size1) {
+    params.dims_merged_with_1.push_back(dim);
+    merged_size1 *= shape_in_ref1[dim];
+    dim--;
+  }
+  dim = inner_most2 - 1;
+  while (dim >= 0 && dim != inner_most1 && merged_size2 < params.tile_size2) {
+    params.dims_merged_with_2.push_back(dim);
+    merged_size2 *= shape_in_ref1[dim];
+    dim--;
+  }
+  // If any of them are unsatisfied, then find other dims to merge
+  std::unordered_set<int64_t> unavailable_dims;
+  unavailable_dims.reserve(
+      2 + params.dims_merged_with_1.size() + params.dims_merged_with_2.size());
+  unavailable_dims.insert(inner_most1);
+  unavailable_dims.insert(inner_most2);
+  for (auto i : params.dims_merged_with_1) {
+    unavailable_dims.insert(i);
+  }
+  for (auto i : params.dims_merged_with_2) {
+    unavailable_dims.insert(i);
+  }
+  dim = shape_in_ref1.size() - 1;
+  while (dim >= 0 && merged_size1 < params.tile_size1) {
+    if (unavailable_dims.count(dim) == 0) {
+      params.dims_merged_with_1.push_back(dim);
+      merged_size1 *= shape_in_ref1[dim];
+      unavailable_dims.insert(dim);
+    }
+    dim--;
+  }
+  dim = shape_in_ref1.size() - 1;
+  while (dim >= 0 && merged_size2 < params.tile_size2) {
+    if (unavailable_dims.count(dim) == 0) {
+      params.dims_merged_with_2.push_back(dim);
+      merged_size2 *= shape_in_ref1[dim];
+      unavailable_dims.insert(dim);
+    }
+    dim--;
+  }
+  // If both are satisfied, then we are done. If neither are satisfied, then it
+  // is impossible to satisfy both of them, also done.
+  if ((merged_size1 < params.tile_size1) ==
+      (merged_size2 < params.tile_size2)) {
+    return; // no need to split
+  }
+  // If one of them are not satisfied, there might be two cases:
+  // 1. The satisfied one just merged in a large dim. If this is the case, We
+  //    split this large dim, so that now we have two available dims to satisfy
+  //    both virtual innermost dim.
+  // 2. The satisfied one did not merge in anything. For example,
+  //    T0[I0{1024*1024}, I1{2}]
+  //    If this is the case, this means that we need to split the large
+  //    inner-most dimension to satisfy the small innermost dimension
+  int64_t large_dim;
+  int64_t split_factor;
+  bool split_inner_most;
+  if (merged_size1 < params.tile_size1) {
+    if (params.dims_merged_with_2.empty()) {
+#if SUPPORT_SPLITTING_INNERMOST_DIM
+      // https://github.com/csarofeen/pytorch/issues/1964
+      // case 2
+      split_inner_most = true;
+      large_dim = inner_most2;
+      split_factor = params.tile_size2;
+#else
+      // disabled due to indexing error
+      return;
+#endif
+    } else {
+      // case 1
+      split_inner_most = false;
+      large_dim = params.dims_merged_with_2.back();
+      auto prev_merged_size2 = merged_size2 / shape_in_ref1[large_dim];
+      split_factor = ceilDiv(params.tile_size2, prev_merged_size2);
+    }
+  } else {
+    if (params.dims_merged_with_1.empty()) {
+#if SUPPORT_SPLITTING_INNERMOST_DIM
+      // https://github.com/csarofeen/pytorch/issues/1964
+      // case 2
+      split_inner_most = true;
+      large_dim = inner_most1;
+      split_factor = params.tile_size1;
+#else
+      // disabled due to indexing error
+      return;
+#endif
+    } else {
+      // case 1
+      split_inner_most = false;
+      large_dim = params.dims_merged_with_1.back();
+      auto prev_merged_size1 = merged_size1 / shape_in_ref1[large_dim];
+      split_factor = ceilDiv(params.tile_size1, prev_merged_size1);
+    }
+  }
+  params.split_before_tiling.push_back({large_dim, split_factor});
+  // adjust all dims to after-split
+  for (auto& i : params.dims_merged_with_1) {
+    if (i > large_dim) {
+      i++;
+    }
+  }
+  for (auto& i : params.dims_merged_with_2) {
+    if (i > large_dim) {
+      i++;
+    }
+  }
+  // Give the split-out dim to the unsatisfied one, so that both are satisfied.
+  if (merged_size1 < params.tile_size1) {
+    if (!split_inner_most) {
+      params.dims_merged_with_2.pop_back();
+      params.dims_merged_with_2.push_back(large_dim + 1);
+    }
+    params.dims_merged_with_1.push_back(large_dim);
+  } else {
+    if (!split_inner_most) {
+      params.dims_merged_with_1.pop_back();
+      params.dims_merged_with_1.push_back(large_dim + 1);
+    }
+    params.dims_merged_with_2.push_back(large_dim);
+  }
+}
+
+HeuristicSummaryEntry<HeuristicCompileTime::TransposeDomainMap> getDomainMap(
+    HeuristicSummary* data_cache,
+    Fusion* fusion) {
+  auto domain_map_entry =
+      HeuristicSummaryEntry<HeuristicCompileTime::TransposeDomainMap>(
+          data_cache,
+          [fusion]() { return std::make_unique<DomainMap>(fusion); });
+  return domain_map_entry;
+}
+
+HeuristicSummaryEntry<HeuristicCompileTime::InputsOutputsInnerDimGroups>
+getInputsOutputsGroups(HeuristicSummary* data_cache, DomainMap& domain_map) {
+  auto grouped_inputs_outputs_entry =
+      HeuristicSummaryEntry<HeuristicCompileTime::InputsOutputsInnerDimGroups>(
+          data_cache, [&domain_map]() {
+            return std::make_unique<std::vector<std::vector<TensorView*>>>(
+                domain_map.groupInputsOutputsByInnerDim());
+          });
+  auto& grouped_inputs_outputs = grouped_inputs_outputs_entry.get();
+
+  TORCH_INTERNAL_ASSERT(
+      grouped_inputs_outputs.size() >= 2,
+      "Can not find mismatched inner most dim, should use pointwise scheduler.");
+
+  return grouped_inputs_outputs_entry;
+}
+
+HeuristicSummaryEntry<HeuristicCompileTime::ReferenceTensorsForGroups>
+getReferenceTensors(
+    HeuristicSummary* data_cache,
+    DomainMap& domain_map,
+    std::vector<std::vector<TensorView*>>& grouped_inputs_outputs) {
+  auto reference_tensors_entry =
+      HeuristicSummaryEntry<HeuristicCompileTime::ReferenceTensorsForGroups>(
+          data_cache, [&domain_map, &grouped_inputs_outputs]() {
+            std::vector<TensorView*> data{
+                domain_map.findReferenceFor(grouped_inputs_outputs[0]),
+                domain_map.findReferenceFor(grouped_inputs_outputs[1])};
+            return std::make_unique<std::vector<TensorView*>>(std::move(data));
+          });
+  auto& reference_tensors = reference_tensors_entry.get();
+  TORCH_INTERNAL_ASSERT(reference_tensors.size() == 2);
+  TensorView* reference1 = reference_tensors[0];
+  TensorView* reference2 = reference_tensors[1];
+  TORCH_INTERNAL_ASSERT(
+      reference1 != nullptr, "Unable to find reference tensor for group 1");
+  TORCH_INTERNAL_ASSERT(
+      reference2 != nullptr, "Unable to find reference tensor for group 2");
+  return reference_tensors_entry;
+}
+
+std::pair<std::vector<int64_t>, int64_t> getShapeInReference(
+    HeuristicSummary* data_cache,
+    SchedulerRuntimeInfo& runtime_info,
+    TensorView* reference,
+    DomainMap& domain_map) {
+  auto ref_root = reference->getMaybeRFactorDomain();
+  std::vector<int64_t> shape_in_ref;
+  shape_in_ref.reserve(reference->nDims());
+  int64_t n_elems = 1;
+  for (size_t ref_i = 0; ref_i < ref_root.size(); ref_i++) {
+    auto id = ref_root[ref_i];
+    auto concrete_id = domain_map.getComputeAtMap().getConcreteMappedID(
+        id, IdMappingMode::EXACT);
+    auto inferred_val =
+        runtime_info.expressionEvaluator().evaluate(concrete_id->extent());
+    TORCH_INTERNAL_ASSERT(
+        inferred_val.has_value(),
+        "Error inferring size for pointwise scheduler: ",
+        ref_root[ref_i]->extent()->toInlineString());
+    int64_t size = inferred_val->as<int64_t>();
+    n_elems *= size;
+    shape_in_ref.push_back(size);
+  }
+  return {shape_in_ref, n_elems};
+}
+
+HeuristicSummaryEntry<HeuristicCompileTime::InnerMostDimInfo>
+getInnerMostDimInfoInReference(
+    HeuristicSummary* data_cache,
+    const std::vector<TensorView*>& group_references,
+    TensorView* global_reference,
+    DomainMap& domain_map) {
+  auto innermost_info_entry =
+      HeuristicSummaryEntry<HeuristicCompileTime::InnerMostDimInfo>(
+          data_cache, [&]() {
+            std::vector<int64_t> data;
+            data.reserve(group_references.size());
+            for (auto ref_tv : group_references) {
+              auto inner_most_id = scheduler_utils::innerMostRootDim(ref_tv);
+              auto inner_most_pos_in_global_ref =
+                  domain_map.getInnerLeafDim(global_reference, inner_most_id);
+              data.emplace_back(inner_most_pos_in_global_ref);
+            }
+            return std::make_unique<std::vector<int64_t>>(std::move(data));
+          });
+  return innermost_info_entry;
+}
+
+} // namespace
+
+std::string getTransposeRuntimeRejectReason(
+    Fusion* fusion,
+    HeuristicSummary* data_cache,
+    SchedulerRuntimeInfo& runtime_info) {
+  auto domain_map_entry = getDomainMap(data_cache, fusion);
+  auto& domain_map = dynamic_cast<DomainMap&>(domain_map_entry.get());
+  auto grouped_inputs_outputs_entry =
+      getInputsOutputsGroups(data_cache, domain_map);
+  auto grouped_inputs_outputs = grouped_inputs_outputs_entry.get();
+  auto reference_tensors_entry =
+      getReferenceTensors(data_cache, domain_map, grouped_inputs_outputs);
+  auto reference_tensors = reference_tensors_entry.get();
+  TensorView* reference1 = reference_tensors[0];
+
+  auto pair =
+      getShapeInReference(data_cache, runtime_info, reference1, domain_map);
+  auto& shape_in_ref1 = pair.first;
+  auto& n_elems = pair.second;
+
+  auto innermost_info_entry = getInnerMostDimInfoInReference(
+      data_cache, reference_tensors, reference1, domain_map);
+  auto innermost_info = innermost_info_entry.get();
+
+  constexpr size_t default_tile_elements =
+      TransposeParams::getDefaultTileSize() *
+      TransposeParams::getDefaultTileSize();
+
+  // don't schedule with transpose scheduler if less than a full wave
+  const int64_t device_multiprocessor_count =
+      (int64_t)at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
+  auto elements_per_wave = device_multiprocessor_count * default_tile_elements;
+  if (elements_per_wave > n_elems) {
+    return "Transpose scheduler does not perform well on small problem sizes.";
+  }
+
+  auto inner_most_pos1_in_ref1 = innermost_info[0];
+  auto inner_most_pos2_in_ref1 = innermost_info[1];
+
+  auto inner_size1 = shape_in_ref1[inner_most_pos1_in_ref1];
+  auto inner_size2 = shape_in_ref1[inner_most_pos2_in_ref1];
+
+  // For cases like
+  //   transpose(T0[1000000000, 2, 2], 1, 2)
+  // the pointwise scheduler should provide better performance, because it
+  // provides coalesced memory access
+  if (inner_size1 * inner_size2 < default_tile_elements) {
+    auto inner_elements = inner_size1 * inner_size2;
+    for (int64_t i = inner_most_pos2_in_ref1 + 1; i < inner_most_pos1_in_ref1;
+         i++) {
+      inner_elements *= shape_in_ref1[i];
+    }
+    // note that the algorithm here is only an approximation because it only
+    // checks reference1. In principle, we need to check all inputs and outputs
+    // to get an accurate result, but that is too much work. I think checking
+    // only reference 1 is fine for now. Below is an example where the
+    // approximation here will not work:
+    //   T0[10000000, 2, 3] (reference 1)
+    //   T1[2, 10000000, 3] input/output
+    //   T2[2, 10000000, 3] input/output
+    //   T3[2, 10000000, 3] input/output
+    //   T4[3, 10000000, 2] input/output
+    //   T5[3, 10000000, 2] input/output
+    if (inner_elements < default_tile_elements) {
+      return "Inner transpose of small dimensions should be scheduled by the "
+             "pointwise scheduler because it provides better memory coalescing";
+    }
+  }
+
+#if !SUPPORT_SPLITTING_INNERMOST_DIM
+  if (n_elems / inner_size1 < TransposeParams::getDefaultTileSize() ||
+      n_elems / inner_size2 < TransposeParams::getDefaultTileSize()) {
+    return "Splitting of inner most dim for the creation of virtual inner most dim "
+           "is disabled due to indexing bug, skipping this case at runtime for now"
+           "See: https://github.com/csarofeen/pytorch/issues/1964";
+  }
+#endif
+
+  return "";
+}
+
+bool hasAtLeastTwoValidGroups(Fusion* fusion) {
+  return DomainMap::hasAtLeastTwoValidGroups(fusion);
+}
+
+std::shared_ptr<TransposeParams> getTransposeHeuristics(
+    Fusion* fusion,
+    const at::ArrayRef<c10::IValue>& runtime_inputs,
+    HeuristicSummary* data_cache) {
+  SchedulerRuntimeInfo runtime_info(fusion, runtime_inputs, true);
+  return getTransposeHeuristics(fusion, runtime_info, data_cache);
+}
+
+std::shared_ptr<TransposeParams> getTransposeHeuristics(
+    Fusion* fusion,
+    SchedulerRuntimeInfo& runtime_info,
+    HeuristicSummary* data_cache) {
+  FUSER_PERF_SCOPE("getTransposeHeuristics");
+
+  FusionGuard fg(fusion);
+
+  // Incase any buffer is of type DataType::Index
+  DataType index_type = indexModeToDtype(runtime_info.getIndexMode());
+
+  auto domain_map_entry = getDomainMap(data_cache, fusion);
+  auto& domain_map = dynamic_cast<DomainMap&>(domain_map_entry.get());
+  auto grouped_inputs_outputs_entry =
+      getInputsOutputsGroups(data_cache, domain_map);
+  auto grouped_inputs_outputs = grouped_inputs_outputs_entry.get();
+  auto reference_tensors_entry =
+      getReferenceTensors(data_cache, domain_map, grouped_inputs_outputs);
+  auto reference_tensors = reference_tensors_entry.get();
+  TensorView* reference1 = reference_tensors[0];
+  TensorView* reference2 = reference_tensors[1];
+  auto pair =
+      getShapeInReference(data_cache, runtime_info, reference1, domain_map);
+  auto& shape_in_ref1 = pair.first;
+  auto& n_elems = pair.second;
+
+  const int64_t device_multiprocessor_count =
+      (int64_t)at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
+
+  auto innermost_info_entry = getInnerMostDimInfoInReference(
+      data_cache, reference_tensors, reference1, domain_map);
+  auto innermost_info = innermost_info_entry.get();
+
+  auto inner_most_pos1_in_ref1 = innermost_info[0];
+  auto inner_most_pos2_in_ref1 = innermost_info[1];
+
+  auto params = std::make_shared<TransposeParams>("Transpose heuristics");
+
+  // Expand inner-most dims to virtual inner-most dims so that the inner-most
+  // dims has at least tile_size elements
+  // See note [Supporting small transpose dimensions]
+  maybeBuildVirtualInnerDims(
+      *params,
+      device_multiprocessor_count,
+      n_elems,
+      shape_in_ref1,
+      inner_most_pos1_in_ref1,
+      inner_most_pos2_in_ref1);
+
+  // Note [vectorization and unroll of input and output]
+  //
+  // The choice of vectorization size, block size and tile sizes needs to be
+  // consistent with each other. Consider the following:
+  //
+  // The number of threads in one block is
+  //   num_threads = blockDim.x * blockDim.y
+  // and the number of elements per each tile is
+  //   num_elems_per_tile = params->tile_size1 * params->tile_size2
+  // So each thread needs to process
+  //   num_elems_per_thread = num_elems_per_tile / num_threads
+  // elements. That is, once the tile sizes and block size are determined, the
+  // `num_elems_per_thread` is determined, regardless of vectorizability of
+  // input/output tensors.
+  //
+  // To make the selection of tile sizes othogonal to vectorizability, we
+  // support having both vectorization and unrolling in the same tensor. For
+  // example, if we have num_elems_per_tile == 1024 and num_threads = 256, then
+  // we have num_elems_per_thread being 4. And if we have vector size 2, then we
+  // will do unroll 2 * vectorize 2 at the same tensor.
+  //
+  // Also, since the inner most dim of different groups are not the same, it is
+  // natural to consider their vectorizability separately and allow them to have
+  // different vectorize/unroll sizes.
+
+  constexpr int64_t kSixteen = 16; // clang tidy
+
+  int64_t max_input_dtype_size = 1;
+
+  size_t n_input_tensors = 0;
+  for (auto inp : ir_utils::filterByType<TensorView>(fusion->inputs())) {
+    max_input_dtype_size = std::max(
+        max_input_dtype_size,
+        (int64_t)dataTypeSize(inp->getDataType().value(), index_type));
+    n_input_tensors++;
+  }
+
+  auto max_unroll_factor = ceilDiv(
+      // Available unrolling based on size of data type
+      (int64_t)kSixteen / max_input_dtype_size,
+      // Reduce max unrolling factor if we have many inputs/outputs to unroll
+      // as it could start consuming a lot of registers.
+      std::max(
+          (scheduler_utils::lastPow2(
+               (int64_t)grouped_inputs_outputs[0].size() +
+               (int64_t)grouped_inputs_outputs[1].size()) >>
+           2),
+          (int64_t)1));
+
+  // Don't unroll at the cost of getting a full wave on the GPU
+  auto max_unroll_factor_occupancy = ceilDiv(
+      n_elems,
+      device_multiprocessor_count * params->tile_size1 * params->tile_size2);
+  max_unroll_factor = std::min(max_unroll_factor, max_unroll_factor_occupancy);
+
+  // Don't unroll at the cost of getting a full warp, useful for the case where
+  // tile sizes are small
+  auto max_unroll_factor_block =
+      ceilDiv(params->tile_size1 * params->tile_size2, 32);
+  max_unroll_factor = std::min(max_unroll_factor, max_unroll_factor_block);
+
+  // Compute maximum vectorize factor that can be used
+  size_t vectorize_factor1 = max_unroll_factor;
+  size_t vectorize_factor2 = max_unroll_factor;
+
+  for (auto tv : grouped_inputs_outputs[0]) {
+    const auto tv_vectorize_factor =
+        runtime_info.getInnerDimVectorizableWidth(tv);
+    vectorize_factor1 = std::min(vectorize_factor1, tv_vectorize_factor);
+  }
+  // TODO: Since group2 only has global->shared and shared->global set op, we
+  // can have fine-grained control of unroll/vectorization at per tensor level.
+  // We should not be using a single global vectorize factor for the entire
+  // group 2
+  for (auto tv : grouped_inputs_outputs[1]) {
+    const auto tv_vectorize_factor =
+        runtime_info.getInnerDimVectorizableWidth(tv);
+    vectorize_factor2 = std::min(vectorize_factor2, tv_vectorize_factor);
+  }
+
+  params->vectorize_factor1 = scheduler_utils::lastPow2(
+      std::min(static_cast<size_t>(max_unroll_factor), vectorize_factor1));
+  params->vectorize_factor2 = scheduler_utils::lastPow2(
+      std::min(static_cast<size_t>(max_unroll_factor), vectorize_factor2));
+
+  params->lparams.bind(params->getThreadsPerBlock(), ParallelType::TIDx);
+
+  if (isDebugDumpEnabled(DebugDumpOption::SchedulerDebug)) {
+    std::cerr << "\n===== Transpose Stats ========\n"
+              << "inputs: " << ir_utils::toString(fusion->inputs()) << "\n"
+              << "outputs: " << ir_utils::toString(fusion->outputs()) << "\n"
+              << "shape: " << shape_in_ref1 << "\n"
+              << "num_elems: " << n_elems << "\n"
+              << "n_input_tensors: " << n_input_tensors << "\n"
+              << "max_input_dtype_size: " << max_input_dtype_size << "\n"
+              << "group 1: " << ir_utils::toString(grouped_inputs_outputs[0])
+              << "\n"
+              << "reference1: " << reference1 << "\n"
+              << "inner_most_id1 position: " << inner_most_pos1_in_ref1
+              << " (in reference 1)\n"
+              << "group 2: " << ir_utils::toString(grouped_inputs_outputs[1])
+              << "\n"
+              << "reference2: " << reference2 << "\n"
+              << "inner_most_id2 position: " << inner_most_pos2_in_ref1
+              << " (in reference 1)" << std::endl;
+    if (!params->split_before_tiling.empty() ||
+        !params->dims_merged_with_1.empty() ||
+        !params->dims_merged_with_2.empty()) {
+      std::cerr << "small transposed dim, needs virtual inner-most dim"
+                << std::endl;
+    }
+    std::cerr << std::endl;
+    std::cerr << params->toString() << std::endl;
+  }
+
+  return params;
+}
+
+// TODO: remove or return launch parameters
+LaunchParams scheduleTranspose(
+    Fusion* fusion,
+    const at::ArrayRef<c10::IValue>& runtime_inputs) {
+  FUSER_PERF_SCOPE("scheduleFusion");
+  auto params = getTransposeHeuristics(fusion, runtime_inputs);
+  TORCH_INTERNAL_ASSERT(
+      params != nullptr, "Could not schedule transpose operation.");
+  scheduleTranspose(fusion, *params);
+  return params->lparams;
+}
+
+void scheduleTranspose(Fusion* fusion, TransposeParams params) {
+  FusionGuard fg(fusion);
+
+  // Make sure we don't have global memory set on intermediate tensors from
+  // fusion segmentation
+  scheduler_utils::clearMemorySpace(fusion);
+
+  // maybe has_reduction for scheduling should be done on a per output tensor
+  // basis.
+  // TODO: add support for trivial reduction
+  TORCH_INTERNAL_ASSERT(
+      ir_utils::getReductionOps(fusion, /*ignore_trivial=*/false).empty(),
+      "This scheduler only handles pointwise ops.");
+
+  // Cache inputs
+  auto cached_inputs = scheduler_utils::cacheInputs(fusion, true);
+
+  // Cache and fork outputs
+  auto cached_outputs = scheduler_utils::cacheAndForkOutputs(fusion, true);
+
+  std::vector<TensorView*> input_tvs;
+  {
+    auto filtered_tvs = ir_utils::filterByType<TensorView>(fusion->inputs());
+    // Remove hanging tensor views
+    for (auto tv : filtered_tvs) {
+      if (tv->uses().empty()) {
+        continue;
+      }
+      input_tvs.push_back(tv);
+    }
+  }
+  auto output_tvs = ir_utils::filterByType<TensorView>(fusion->outputs());
+
+  size_t max_dims = 0;
+  for (auto inp : input_tvs) {
+    max_dims = std::max(pointwise_utils::nRootDims(inp), max_dims);
+  }
+
+  for (auto out : output_tvs) {
+    max_dims = std::max(pointwise_utils::nRootDims(out), max_dims);
+  }
+
+  // If everything is zero dim tensors, just return.
+  if (max_dims == 0) {
+    return;
+  }
+
+  DomainMap domain_map(fusion);
+  auto grouped_inputs_outputs = domain_map.groupInputsOutputsByInnerDim();
+  TORCH_INTERNAL_ASSERT(grouped_inputs_outputs.size() >= 2);
+
+  /*
+   * We need something similar to `cacheFork` for input tensors in group 2. We
+   * need this because we will want to propagate to the entire DAG except group
+   * 2 and its cached inputs, so we need to make sure the DAG is still connected
+   * if we remove group and its cached inputs. For example
+   *    t0
+   *    |
+   *   cache
+   *   /  \
+   *  t1  t2
+   * if groups = {{t1, t2}, {t0}}, then removing {t0, cache} from the DAG will
+   * make it disconnected.
+   */
+  std::unordered_set<TensorView*> group2_and_cached_inputs(
+      grouped_inputs_outputs[1].begin(), grouped_inputs_outputs[1].end());
+  for (auto tv : grouped_inputs_outputs[1]) {
+    if (tv->isFusionInput()) {
+      auto existing_cache = ir_utils::consumerTvsOf(tv)[0];
+      if (ir_utils::consumerTvsOf(existing_cache).size() > 1) {
+        auto new_cache = tv->cacheAfter();
+        new_cache->setMemoryType(MemoryType::Shared);
+        group2_and_cached_inputs.emplace(new_cache);
+      } else {
+        existing_cache->setMemoryType(MemoryType::Shared);
+        group2_and_cached_inputs.emplace(existing_cache);
+      }
+    }
+  }
+  // set cached outputs of group 2 to shared memory
+  for (auto pair : cached_outputs) {
+    auto cached_output = pair.first;
+    auto output = pair.second;
+    if (group2_and_cached_inputs.count(output) > 0) {
+      cached_output->setMemoryType(MemoryType::Shared);
+    }
+  }
+
+  TensorView* reference1 =
+      domain_map.findReferenceFor(grouped_inputs_outputs[0]);
+  TensorView* reference2 =
+      domain_map.findReferenceFor(grouped_inputs_outputs[1]);
+
+  TORCH_INTERNAL_ASSERT(
+      reference1 != nullptr,
+      "Could not find a fully broadcasted tensor to reference schedule on the first group.");
+
+  TORCH_INTERNAL_ASSERT(
+      reference2 != nullptr,
+      "Could not find a fully broadcasted tensor to reference schedule on the second group.");
+
+  auto inner_most_id1 = scheduler_utils::innerMostRootDim(reference1);
+  auto inner_most_id2 = scheduler_utils::innerMostRootDim(reference2);
+
+  //////////////////////////////////////////
+  // Step 1: Make virtual inner most dims //
+  //////////////////////////////////////////
+
+  // See note [Supporting small transpose dimensions]
+
+  // split big dims so that we have enough dimensions available to merge with
+  // inner-most dims to create the virtual inner-most dim
+  scheduler_utils::splitDims(reference1, params.split_before_tiling);
+  // Merging reference 1's dims_merged_with_1 but updating dims_merged_with_2
+  // based on the changes in the dimensions that were merged. So we can then run
+  // merge with dims_merged_with_2.
+  auto merged1 = scheduler_utils::mergeDims(
+      reference1, params.dims_merged_with_1, params.dims_merged_with_2);
+  // Merging reference 1's dims_merged_with_2 and updating `merged1`.
+  std::vector<size_t> merged1_vec;
+  if (merged1.has_value()) {
+    merged1_vec.push_back(*merged1);
+  }
+  auto merged2 = scheduler_utils::mergeDims(
+      reference1, params.dims_merged_with_2, merged1_vec);
+  if (merged1.has_value()) {
+    merged1 = merged1_vec[0];
+  }
+
+  // merge with inner most dims to get virtual inner most dims
+  size_t inner_most_pos1_in_ref1 =
+      domain_map.getInnerLeafDim(reference1, inner_most_id1);
+  size_t inner_most_pos2_in_ref1 =
+      domain_map.getInnerLeafDim(reference1, inner_most_id2);
+  if (merged1.has_value()) {
+    if (inner_most_pos1_in_ref1 < *merged1) {
+      reference1->reorder(
+          {{*merged1, inner_most_pos1_in_ref1},
+           {inner_most_pos1_in_ref1, *merged1}});
+      std::swap(*merged1, inner_most_pos1_in_ref1);
+    }
+    if (inner_most_pos2_in_ref1 > inner_most_pos1_in_ref1) {
+      inner_most_pos2_in_ref1--;
+    }
+    if (merged2.has_value() && *merged2 > inner_most_pos1_in_ref1) {
+      (*merged2)--;
+    }
+    reference1->merge(*merged1, inner_most_pos1_in_ref1);
+    inner_most_pos1_in_ref1 = *merged1;
+  }
+  if (merged2.has_value()) {
+    if (inner_most_pos2_in_ref1 < *merged2) {
+      reference1->reorder(
+          {{*merged2, inner_most_pos2_in_ref1},
+           {inner_most_pos2_in_ref1, *merged2}});
+      std::swap(*merged2, inner_most_pos2_in_ref1);
+    }
+    if (inner_most_pos1_in_ref1 > inner_most_pos2_in_ref1) {
+      inner_most_pos1_in_ref1--;
+    }
+    reference1->merge(*merged2, inner_most_pos2_in_ref1);
+    inner_most_pos2_in_ref1 = *merged2;
+  }
+
+  /////////////////////////////
+  // Step 2: global schedule //
+  /////////////////////////////
+
+  // make tile
+  // [..., I1, .., I2, ...]
+  reference1->split(inner_most_pos1_in_ref1, params.tile_size1);
+  reference1->reorder({{inner_most_pos1_in_ref1 + 1, -1}});
+  reference1->split(inner_most_pos2_in_ref1, params.tile_size2);
+  reference1->reorder({{inner_most_pos2_in_ref1 + 1, -1}});
+  // [..., I1/tile1, .., I2/tile2, ..., tile1, tile2]
+
+  // Merge remaining dimensions
+  int lhs_i = -1;
+  for (int i = (int)reference1->nDims() - 2; i > 0; i--) {
+    auto axis_i = i - 1;
+    if (lhs_i == -1) {
+      lhs_i = axis_i;
+    } else {
+      reference1->merge(axis_i, lhs_i);
+      lhs_i = axis_i;
+    }
+  }
+  reference1->split(0, 1);
+  // [merged_dim, 1, tile1, tile2]
+
+  // parallelize non-tile dimensions
+  reference1->axis(1)->parallelize(ParallelType::Unswitch);
+  reference1->axis(0)->parallelize(ParallelType::BIDx);
+  // [BIDx, Unswitch, tile1, tile2]
+
+  // Propagate transformations so far to the entire DAG
+  TransformPropagator propagator(reference1);
+  MaxRootDomainInfoSpanningTree entire_dag(reference1);
+  entire_dag.traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(reference1);
+
+  // For a transpose scheduling, all we need is to bind threadIdx.x differently
+  // for inputs and outputs. This swap of binding could happen at any tensor on
+  // the path from input to output, especially, it does not have to be in the
+  // transpose tensor. Here, we naively do the binding swap at cached
+  // input/output for simplicity. We might need to find a better set of swap
+  // tensors in the future to reduce shared memory usage.
+
+  //////////////////////////////
+  // Step 3: Schedule group 2 //
+  //////////////////////////////
+
+  // transform tile for vectorization/unroll
+  // See note [vectorization and unroll of input and output]
+
+  int pos = reference2->nDims() - 2;
+  // [..., tile1, tile2]
+  reference2->merge(pos);
+  reference2->split(pos, params.vectorize_factor2);
+  reference2->split(pos, params.getThreadsPerBlock());
+  // [..., Unroll, TIDx, Vectorize]
+
+  // Propagate transformations of reference2 to the entire DAG except
+  // group 1. We actually only want to propagate to the fusion outputs, but
+  // inputs and outputs themselves are disconnected, so we have to borrow the
+  // entire DAG and use its spanning tree.
+  {
+    auto all_tvs_except1 = ir_utils::allTvsExcept(
+        fusion,
+        {grouped_inputs_outputs[0].begin(), grouped_inputs_outputs[0].end()});
+    SetSelector selector({all_tvs_except1.begin(), all_tvs_except1.end()});
+    MaxRootDomainInfoSpanningTree entire_dag_except1(reference2, &selector);
+    TransformPropagator propagator(reference2);
+    entire_dag_except1.traverse(&propagator);
+  }
+
+  // parallelize group2 and its cached inputs
+  {
+    if (params.vectorize_factor2 > 1) {
+      reference2->axis(-1)->parallelize(ParallelType::Vectorize);
+    }
+    reference2->axis(-2)->parallelize(ParallelType::TIDx);
+    reference2->axis(-3)->parallelize(ParallelType::Unroll);
+
+    ComputeAtMap ca_map(fusion);
+
+    scheduler_utils::parallelizeAllLike(
+        reference2,
+        {group2_and_cached_inputs.begin(), group2_and_cached_inputs.end()},
+        {ParallelType::TIDx});
+
+    // Only vectorize the axes that exactly maps to the vectorized axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.
+    std::vector<TensorView*> vectorized_group2_cached_inputs;
+    for (auto gin : group2_and_cached_inputs) {
+      if (std::any_of(
+              gin->domain()->domain().begin(),
+              gin->domain()->domain().end(),
+              [&ca_map, reference2](IterDomain* id) {
+                return ca_map.areMapped(
+                    id, reference2->axis(-1), IdMappingMode::EXACT);
+              })) {
+        vectorized_group2_cached_inputs.push_back(gin);
+      }
+    }
+    scheduler_utils::parallelizeAllLike(
+        reference2, vectorized_group2_cached_inputs, {ParallelType::Vectorize});
+
+    // Only unroll the axes that exactly maps to the unrolled axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.
+    std::vector<TensorView*> unrolled_group2_cached_inputs;
+    for (auto gin : group2_and_cached_inputs) {
+      if (std::any_of(
+              gin->domain()->domain().begin(),
+              gin->domain()->domain().end(),
+              [&ca_map, reference2](IterDomain* id) {
+                return ca_map.areMapped(
+                    id, reference2->axis(-3), IdMappingMode::EXACT);
+              })) {
+        unrolled_group2_cached_inputs.push_back(gin);
+      }
+    }
+    scheduler_utils::parallelizeAllLike(
+        reference2, unrolled_group2_cached_inputs, {ParallelType::Unroll});
+  }
+
+  //////////////////////////////
+  // Step 4: Schedule group 1 //
+  //////////////////////////////
+
+  // schedule group 1
+  reference1->reorder({{-2, -1}});
+  // [..., tile2, tile1]
+  pos = reference1->nDims() - 2;
+  reference1->merge(pos);
+  reference1->split(pos, params.vectorize_factor1);
+  reference1->split(pos, params.getThreadsPerBlock());
+  if (params.vectorize_factor1 > 1) {
+    reference1->axis(-1)->parallelize(ParallelType::Vectorize);
+  }
+  reference1->axis(-2)->parallelize(ParallelType::TIDx);
+  reference1->axis(-3)->parallelize(ParallelType::Unroll);
+  // [..., Unroll, TIDx, Vectorize]
+
+  // Propagate transformations, parallelization of the reference1 to the entire
+  // DAG except group 2 and its corresponding cached outputs.
+  {
+    auto all_tvs_except2 =
+        ir_utils::allTvsExcept(fusion, group2_and_cached_inputs);
+    SetSelector selector({all_tvs_except2.begin(), all_tvs_except2.end()});
+    MaxRootDomainInfoSpanningTree entire_dag_except_outputs(
+        reference1, &selector);
+    TransformPropagator propagator(reference1);
+    entire_dag_except_outputs.traverse(&propagator);
+    scheduler_utils::parallelizeAllLike(
+        reference1, all_tvs_except2, {ParallelType::TIDx});
+  }
+
+  // vectorize and unroll group 1's output and cached input
+  {
+    ComputeAtMap ca_map(fusion);
+    std::vector<TensorView*> group1_and_cached_inputs(
+        grouped_inputs_outputs[0].begin(), grouped_inputs_outputs[0].end());
+    for (auto tv : grouped_inputs_outputs[0]) {
+      if (tv->isFusionInput()) {
+        group1_and_cached_inputs.emplace_back(ir_utils::consumerTvsOf(tv)[0]);
+      }
+    }
+
+    // Only vectorize the axes that exactly maps to the vectorized axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.
+    std::vector<TensorView*> vectorized_group1_cached_inputs;
+    for (auto gin : group1_and_cached_inputs) {
+      if (std::any_of(
+              gin->domain()->domain().begin(),
+              gin->domain()->domain().end(),
+              [&ca_map, reference1](IterDomain* id) {
+                return ca_map.areMapped(
+                    id, reference1->axis(-1), IdMappingMode::EXACT);
+              })) {
+        vectorized_group1_cached_inputs.push_back(gin);
+      }
+    }
+    scheduler_utils::parallelizeAllLike(
+        reference1, vectorized_group1_cached_inputs, {ParallelType::Vectorize});
+
+    // Only unroll the axes that exactly maps to the unrolled axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.
+    std::vector<TensorView*> unrolled_group1_cached_inputs;
+    for (auto gin : group1_and_cached_inputs) {
+      if (std::any_of(
+              gin->domain()->domain().begin(),
+              gin->domain()->domain().end(),
+              [&ca_map, reference1](IterDomain* id) {
+                return ca_map.areMapped(
+                    id, reference1->axis(-3), IdMappingMode::EXACT);
+              })) {
+        unrolled_group1_cached_inputs.push_back(gin);
+      }
+    }
+    scheduler_utils::parallelizeAllLike(
+        reference1, unrolled_group1_cached_inputs, {ParallelType::Unroll});
+  }
+
+  ////////////////////////////////
+  // Step 5: Cleanup and inline //
+  ////////////////////////////////
+
+  // cleanup parallelization from reference1 and reference2 if they are fusion
+  // inputs
+  for (auto tv : {reference1, reference2}) {
+    if (tv->isFusionInput()) {
+      for (auto id : tv->domain()->domain()) {
+        id->parallelize(ParallelType::Serial);
+      }
+    }
+  }
+
+  // Inline
+  inlineMost();
+}
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/transpose.h b/torch/csrc/jit/codegen/cuda/scheduler/transpose.h
new file mode 100644
index 000000000000..c1a4ab6efb6a
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/transpose.h
@@ -0,0 +1,115 @@
+#pragma once
+
+#include <ATen/core/ivalue.h>
+
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h>
+
+#define SUPPORT_SPLITTING_INNERMOST_DIM 0
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+// Note [Transpose scheduling]
+//
+// The target of transpose scheduling is to get coalesced global memory access
+// to as much input and output tensors as possible. For a DAG with only pure
+// pointwise operators, the scheduling is very simple because the inner most
+// dimension of all input and output tensors are all mapped together in the
+// ComputeAtMap, i.e., there is essentially only one inner most dimension. In
+// such case, we just vectorize that inner most dimension and bind it to
+// threadIdx.x identically for all input and output tensors. In the case where
+// transposes are present in the DAG, the inner most dimensions of different
+// inputs and outputs might not match. And there is no fixed pattern on which
+// input/output tensors should share the same inner most dimension with which.
+// Consider the following example DAGs ([T] represents transpose, all tensors
+// are 2D):
+//
+//   t0    t1      t0    t1      t0    t1        t0    t1         t0
+//    \    |        \    /        \    |          \    |          |
+//     \  [T]       [T] [T]        \  [T]          t2 [T]        [T]
+//      \ /           \ /           \ / \         / \ / \         |
+//      t2             t2           t2   t3      t3  t4 t5       [T]
+//                                                                |
+//                                                                t1
+//
+// In order to support all these cases in a general way, the following
+// perspective is very important: What we are looking for is to bind threadIdx.x
+// differently for different inputs and outputs, so there has to be some tensor
+// somewhere in the DAG that we write and read with different threadIdx.x
+// bindings. The tensor of binding swap can be any tensor on the path that
+// connects inputs/outputs with different inner most dimension, especially, it
+// does not necessarily have to be the tensor of the transpose operator. In
+// other words, thanks to our indexing system who is already taking care of the
+// correctness of transpose, the scheduler can freely choose where to realize
+// these transposes as different threadIdx.x bindings. This observation greatly
+// simplifies our scheduling.
+//
+// Our scheduling strategy is as follows: We first split the inputs and outputs
+// of the fusion into two groups according to their inner most dimension. The
+// inner most dimensions of tensors in the same group are mapped to each other,
+// and they are not mapped to the inner most dimesion of tensors in a different
+// group. Depending on the transpose pattern, there can be more than two groups,
+// if this is the case, we only consider the two largest groups, and the tensors
+// in the remaining groups will just be accessed unvectorized and uncoalesced.
+// We call the largest group as `group1` and the second largest group as
+// `group2`. When we have the groups, we will make a 2D tiling [I1, I2] ->
+// [I1/tile1, tile1, I2/tile2, tile2] on the inner most dimensions of group1 and
+// group2. If I1 and I2 are too small to make a 32x32 tile, such as in the
+// fusion of tanspose(T1[1024, 2, 1024, 2], {1, 3}), we merge in other
+// dimensions to make a virtual I1 and I2. The details of how we create virtual
+// I1 and I2 are described in note [Supporting small transpose dimensions].
+//
+// Each tile [tile1, tile2] will be handled by a block, and the tensors that
+// have mismatched threadIdx.x bindings will use shared memory. The outer IDs of
+// the tiling split will be merged with non-tiled IDs and then binded to
+// blockIdx.x for the entire DAG, regardless of which group a tensor belongs to.
+// For the inner tile IDs [tile1, tile2], we need to transform and parallelize
+// group 1 and group 2 differently. The intermediate tensors can be transformed
+// and parallelized consistently either with group 1 or group 2. Here, since
+// group 1 is larger than group 2, we decide to only transform and parallelize
+// the cached inputs of group 2 together with group 2, and keep the rest of the
+// DAG consistent with group 1.
+//
+// If you would like to see an example of how to manually schedule a complicated
+// DAG using this idea, refer to:
+//   FusionManualScheduleTransposeComplexDAG1_CUDA
+
+class SchedulerRuntimeInfo;
+class HeuristicSummary;
+
+TORCH_CUDA_CU_API std::shared_ptr<TransposeParams> getTransposeHeuristics(
+    Fusion* fusion,
+    const at::ArrayRef<c10::IValue>& runtime_inputs,
+    HeuristicSummary* data_cache = nullptr);
+
+TORCH_CUDA_CU_API std::shared_ptr<TransposeParams> getTransposeHeuristics(
+    Fusion* fusion,
+    SchedulerRuntimeInfo& runtime_info,
+    HeuristicSummary* data_cache = nullptr);
+
+TORCH_CUDA_CU_API void scheduleTranspose(
+    Fusion* fusion,
+    TransposeParams params);
+
+TORCH_CUDA_CU_API LaunchParams scheduleTranspose(
+    Fusion* fusion,
+    const at::ArrayRef<c10::IValue>& runtime_inputs);
+
+//! Utility for canSchedule interface to check if this fusion has at least two
+//! groups, each with a fully broadcasted reference tensor.
+TORCH_CUDA_CU_API bool hasAtLeastTwoValidGroups(Fusion* fusion);
+
+// If can schedule at runtime, returns empty string, otherwise returns the
+// reason why we should not schedule at runtime.
+TORCH_CUDA_CU_API std::string getTransposeRuntimeRejectReason(
+    Fusion* fusion,
+    HeuristicSummary* data_cache,
+    SchedulerRuntimeInfo& runtime_info);
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h b/torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h
new file mode 100644
index 000000000000..5e56278a7f16
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/transpose_heuristic.h
@@ -0,0 +1,163 @@
+#pragma once
+
+#include <c10/util/hash.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/heuristic.h>
+#include <torch/csrc/jit/codegen/cuda/utils.h>
+
+#include <sstream>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+
+// Parameters of the transpose heuristic to describe the optimial schedule.
+// Warning: equal operator is intended for use in caching the kernel associated
+// with these reduction parameters. It does not check if the launch parameters
+// are equivelent!
+class TransposeParams : public HeuristicParams {
+ public:
+  static constexpr size_t getMaxThreadsPerBlock() {
+    return 128;
+  }
+
+  static constexpr size_t getDefaultTileSize() {
+    return 32;
+  }
+
+  // See note [Supporting small transpose dimensions], all dims are positions in
+  // reference1
+  std::vector<std::pair<size_t, size_t>> split_before_tiling = {};
+  std::vector<size_t> dims_merged_with_1 = {};
+  std::vector<size_t> dims_merged_with_2 = {};
+
+  // Vectorization factor for tensors in the first group
+  size_t vectorize_factor1 = 1;
+
+  // Vectorization factor for tensors in the second group
+  size_t vectorize_factor2 = 1;
+
+  // TODO: support symbolic tile size
+  // https://github.com/csarofeen/pytorch/pull/1854#discussion_r928143729
+
+  // Tile size for the inner most dim of tensors in the first group
+  size_t tile_size1 = getDefaultTileSize();
+
+  // Tile size for the inner most dim of tensors in the second group
+  size_t tile_size2 = getDefaultTileSize();
+
+  using HeuristicParams::HeuristicParams;
+
+  // Warning: Does not check launch parameters!
+  bool sameAs(
+      const std::shared_ptr<HeuristicParams>& other_base) const override {
+    auto other_casted = std::dynamic_pointer_cast<TransposeParams>(other_base);
+    if (other_casted == nullptr) {
+      return false;
+    }
+    const TransposeParams& other = *other_casted;
+    bool attr_equal = other.split_before_tiling == split_before_tiling &&
+        other.dims_merged_with_1 == dims_merged_with_1 &&
+        other.dims_merged_with_2 == dims_merged_with_2 &&
+        other.vectorize_factor1 == vectorize_factor1 &&
+        other.vectorize_factor2 == vectorize_factor2 &&
+        other.tile_size1 == tile_size1 && other.tile_size2 == tile_size2;
+    return attr_equal;
+  }
+
+  std::string toString() const override {
+    std::stringstream ss;
+    ss << "\n===== Transpose Parameters ========\n"
+       << (tag == "" ? "" : "Tag: ") << tag << " Transpose Characteristics:\n"
+       << " BlckX: " << lparams.bdimx() << "\n";
+    ss << " input tile size: " << tile_size1 << "\n";
+    ss << " output tile size: " << tile_size2 << "\n";
+    int elements_per_tile = tile_size1 * tile_size2;
+    ss << " elements per tile: " << elements_per_tile << "\n";
+    int elements_per_thread = elements_per_tile / lparams.bdimx();
+    ss << " elements per thread: " << elements_per_thread << "\n";
+    if (vectorize_factor1 > 1) {
+      ss << "Vectorize group 1, Factor: " << vectorize_factor1 << "\n";
+    }
+    int unroll_factor1 = elements_per_thread / vectorize_factor1;
+    if (unroll_factor1 > 1) {
+      ss << "Unroll group 1, Factor: " << unroll_factor1 << "\n";
+    }
+    if (vectorize_factor2 > 1) {
+      ss << "Vectorize group 2, Factor: " << vectorize_factor2 << "\n";
+    }
+    int unroll_factor2 = elements_per_thread / vectorize_factor2;
+    if (unroll_factor2 > 1) {
+      ss << "Unroll group 2, Factor: " << unroll_factor2 << "\n";
+    }
+    if (!split_before_tiling.empty() || !dims_merged_with_1.empty() ||
+        !dims_merged_with_2.empty()) {
+      ss << "Virtual inner-most dim:\n";
+      if (!split_before_tiling.empty()) {
+        ss << "  ";
+        bool first = true;
+        for (auto pair : split_before_tiling) {
+          if (!first) {
+            ss << ", ";
+          }
+          first = false;
+          ss << "split(" << pair.first << ", " << pair.second << ")";
+        }
+        ss << "\n";
+      }
+      if (!dims_merged_with_1.empty()) {
+        ss << "  merge ";
+        bool first = true;
+        for (auto dim : dims_merged_with_1) {
+          if (!first) {
+            ss << ", ";
+          }
+          first = false;
+          ss << dim;
+        }
+        ss << " with innermost1\n";
+      }
+      if (!dims_merged_with_2.empty()) {
+        ss << "  merge ";
+        bool first = true;
+        for (auto dim : dims_merged_with_2) {
+          if (!first) {
+            ss << ", ";
+          }
+          first = false;
+          ss << dim;
+        }
+        ss << " with innermost2\n";
+      }
+    }
+    ss << "====================================\n";
+    return ss.str();
+  }
+
+  size_t hash() const override {
+    return c10::get_hash(
+        split_before_tiling,
+        dims_merged_with_1,
+        dims_merged_with_2,
+        vectorize_factor1,
+        vectorize_factor2,
+        tile_size1,
+        tile_size2);
+  }
+
+  std::shared_ptr<HeuristicParams> clone() const override {
+    return std::make_shared<TransposeParams>(*this);
+  }
+
+  int getThreadsPerBlock() const {
+    size_t tile_vectors1 = ceilDiv(tile_size1 * tile_size2, vectorize_factor1);
+    size_t tile_vectors2 = ceilDiv(tile_size1 * tile_size2, vectorize_factor2);
+    size_t tile_vectors = std::min(tile_vectors1, tile_vectors2);
+    return std::min(getMaxThreadsPerBlock(), tile_vectors);
+  }
+};
+
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp b/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
index 3da9605e6c30..4ba6b241e455 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/scheduler/utils.cpp
@@ -11,6 +11,8 @@
 #include <torch/csrc/jit/codegen/cuda/scheduler/mma_utils.h>
 #include <torch/csrc/jit/codegen/cuda/transform_replay.h>
 
+#include <algorithm>
+
 namespace torch {
 namespace jit {
 namespace fuser {
@@ -135,6 +137,64 @@ size_t merge_3d(
   }
 }
 
+void splitDims(
+    TensorView* tv,
+    std::vector<std::pair<size_t, size_t>> to_split, // (dim, size)
+    std::vector<size_t>& to_update) {
+  std::stable_sort(
+      to_split.begin(),
+      to_split.end(),
+      [](const std::pair<size_t, size_t>& p1,
+         const std::pair<size_t, size_t>& p2) { return p1.first < p2.first; });
+  size_t dim_offset = 0;
+  size_t pending_dim_offset = 0;
+  int64_t prev_dim = -1;
+  for (auto entry : to_split) {
+    size_t dim = entry.first;
+    size_t size = entry.second;
+    if (dim != prev_dim) {
+      dim_offset += pending_dim_offset;
+      pending_dim_offset = 0;
+    }
+    size_t actual_dim = dim_offset + dim;
+    tv->split(actual_dim, size);
+    pending_dim_offset++;
+    for (auto& i : to_update) {
+      if (i > actual_dim) {
+        i++;
+      }
+    }
+    prev_dim = dim;
+  }
+}
+
+c10::optional<size_t> mergeDims(
+    TensorView* tv,
+    std::vector<size_t> to_merge,
+    std::vector<size_t>& to_update) {
+  if (to_merge.empty()) {
+    return c10::nullopt;
+  }
+  if (to_merge.size() == 1) {
+    return to_merge[0];
+  }
+  std::sort(to_merge.begin(), to_merge.end());
+  size_t left = to_merge[0];
+  int64_t offset = 0;
+  for (auto right = to_merge.begin() + 1; right != to_merge.end(); right++) {
+    auto actual_right = offset-- + *right;
+    tv->merge(left, actual_right);
+    for (auto& i : to_update) {
+      if (i == actual_right) {
+        i = left;
+      } else if (i > actual_right) {
+        i--;
+      }
+    }
+  }
+  return left;
+}
+
 size_t mergeReduction(
     TensorView* tv,
     const std::unordered_set<IterDomain*>& dont_merge) {
@@ -240,21 +300,6 @@ void parallelizeAllLike(
   }
 }
 
-void computeAtInputs(TensorView* consumer, int pos, ComputeAtMode mode) {
-  for (auto inp_tv : ir_utils::inputTvsOf(consumer)) {
-    inp_tv->computeAt(consumer, pos, mode);
-  }
-}
-
-void computeWithOutputs(TensorView* producer, int pos, ComputeAtMode mode) {
-  for (auto out_tv : ir_utils::outputTvsOf(producer)) {
-    if (out_tv == producer) {
-      continue;
-    }
-    producer->computeWith(out_tv, pos, mode);
-  }
-}
-
 namespace {
 
 // Find the resolution points of the persistent buffers in the provided
@@ -566,7 +611,7 @@ TvProperties getProperties(
       TORCH_INTERNAL_ASSERT(
           inferred_val.has_value(), "Error inferring reduction size.");
       inner_most_dimension_numel =
-          inner_most_dimension_numel * inferred_val.value();
+          inner_most_dimension_numel * inferred_val->as<int64_t>();
       inner_most_dimension_ndims++;
     }
   }
@@ -583,9 +628,9 @@ TvProperties getProperties(
         inferred_val.has_value(),
         "Error inferring dimensions of reduction fusion.");
     if (id->isReduction()) {
-      total_reduction_numel *= inferred_val.value();
+      total_reduction_numel *= inferred_val->as<int64_t>();
     } else {
-      total_iteration_numel *= inferred_val.value();
+      total_iteration_numel *= inferred_val->as<int64_t>();
     }
   }
 
@@ -600,51 +645,6 @@ TvProperties getProperties(
   return properties;
 }
 
-void computeAtBetween(
-    const std::vector<TensorView*>& producers,
-    const std::vector<TensorView*>& overall_consumers,
-    int pos,
-    ComputeAtMode mode,
-    std::unordered_set<IterDomain*> mapped_to_trivial_reduction) {
-  for (auto producer : producers) {
-    // Figure out what's between producer and overall_consumers, will not give
-    // back any consumers that are not downstream from producer
-    auto all_vals_between = DependencyCheck::getAllValsBetween(
-        {producer}, {overall_consumers.begin(), overall_consumers.end()});
-
-    std::unordered_set<Val*> all_vals_between_set(
-        all_vals_between.begin(), all_vals_between.end());
-
-    for (auto consumer : overall_consumers) {
-      if (all_vals_between_set.count(consumer)) {
-        // The way we generate producers and consumers is that we inch away from
-        // inputs/outputs. There's a chance we could meet in the middle.
-        if (producer == consumer) {
-          continue;
-        }
-
-        auto pos_it = std::find_if(
-            consumer->domain()->domain().begin(),
-            consumer->domain()->domain().end(),
-            [&mapped_to_trivial_reduction](IterDomain* id) {
-              return mapped_to_trivial_reduction.count(id);
-            });
-
-        auto consumer_pos = pos_it == consumer->domain()->domain().end()
-            ? pos
-            : std::min(
-                  (int)std::distance(
-                      consumer->domain()->domain().begin(), pos_it) +
-                      1,
-                  (pos < 0 ? pos + (int)consumer->nDims() : pos));
-        // Assume we don't want to reset computeAt on tensors that have already
-        // performed it.
-        producer->computeAt(consumer, consumer_pos, mode);
-      }
-    }
-  }
-}
-
 namespace {
 
 // Figure out which persistent buffers are active at the generation of values in
@@ -822,9 +822,9 @@ PersistentBufferSizeReturn persistentBufferSize(
       TORCH_INTERNAL_ASSERT(
           id_size.has_value(), "Could not infer persistent buffer size.");
       if (persistent_buffer_sizes[buffer_i] == -1) {
-        persistent_buffer_sizes[buffer_i] = id_size.value();
+        persistent_buffer_sizes[buffer_i] = id_size->as<int64_t>();
       } else {
-        persistent_buffer_sizes[buffer_i] *= id_size.value();
+        persistent_buffer_sizes[buffer_i] *= id_size->as<int64_t>();
       }
     }
 
@@ -1075,13 +1075,14 @@ std::vector<std::pair<TensorView*, TensorView*>> cacheAndForkOutputs(
 }
 
 namespace {
+
 // Take the inner most rfactor id from innerMostRootDim and project it to the
 // root domain if the provided domain is on the rfactor domain. If vectorize,
 // will not project if not following the inner most path.
 IterDomain* projectIdToRoot(
     TensorView* tv,
     IterDomain* reference_id,
-    bool vectorize) {
+    bool inner_only) {
   if (reference_id == nullptr) {
     return nullptr;
   }
@@ -1102,14 +1103,19 @@ IterDomain* projectIdToRoot(
     if (expr->isA<Merge>()) {
       auto merge = expr->as<Merge>();
       if (merge->out() == projected_id) {
-        projected_id = merge->inner();
+        if (!merge->inner()->isBroadcast() &&
+            !merge->inner()->isTrivialReduction()) {
+          projected_id = merge->inner();
+        } else {
+          projected_id = merge->outer();
+        }
       }
     } else if (expr->isA<Split>()) {
       auto split = expr->as<Split>();
       if (split->inner() == projected_id) {
         projected_id = split->in();
       } else if (split->outer() == projected_id) {
-        if (vectorize) {
+        if (inner_only) {
           projected_id = nullptr;
         } else {
           projected_id = split->in();
@@ -1125,6 +1131,62 @@ IterDomain* projectIdToRoot(
   }
   return projected_id;
 }
+
+// Take the inner most root id from innerMostRootDim and project it to the
+// rfactor domain if the provided domain is on the rfactor domain. If vectorize,
+// will not project if not following the inner most path.
+IterDomain* projectIdToRFactor(
+    TensorView* tv,
+    IterDomain* reference_id,
+    bool inner_only) {
+  if (reference_id == nullptr) {
+    return nullptr;
+  }
+
+  if (!tv->hasRFactor()) {
+    return reference_id;
+  }
+
+  auto replay_exprs = StmtSort::getExprs(
+      tv->fusion(),
+      {tv->getRFactorDomain().begin(), tv->getRFactorDomain().end()},
+      false);
+  if (replay_exprs.empty()) {
+    return reference_id;
+  }
+
+  IterDomain* projected_id = reference_id;
+  for (auto expr_it = replay_exprs.begin(); expr_it != replay_exprs.end();
+       ++expr_it) {
+    auto expr = *expr_it;
+    if (expr->isA<Merge>()) {
+      auto merge = expr->as<Merge>();
+      if (merge->inner() == projected_id) {
+        projected_id = merge->out();
+      } else if (merge->outer() == projected_id) {
+        if (merge->inner()->isBroadcast() ||
+            merge->inner()->isTrivialReduction() || !inner_only) {
+          projected_id = merge->out();
+        } else {
+          projected_id = nullptr;
+        }
+      }
+    } else if (expr->isA<Split>()) {
+      auto split = expr->as<Split>();
+      if (split->in() == projected_id) {
+        projected_id = split->inner();
+      }
+    } else {
+      TORCH_INTERNAL_ASSERT(
+          false, "Didn't recognize the iterdomain expression: ", expr);
+    }
+    if (projected_id == nullptr) {
+      break;
+    }
+  }
+  return projected_id;
+}
+
 } // namespace
 
 IterDomain* innerMostRootDim(TensorView* tv) {
@@ -1176,95 +1238,76 @@ IterDomain* innerMostRootDim(TensorView* tv) {
 FindAllMappedDims::FindAllMappedDims(
     TensorView* from,
     IterDomain* id,
-    bool vectorize_pass)
-    : starting_tv(from), starting_id(id) {
-  std::deque<TensorView*> to_visit{starting_tv};
-  std::unordered_set<TensorView*> visited;
-  mapped_ids.emplace(std::make_pair(starting_tv, starting_id));
-
-  // Propagate mapping of id
-  while (!to_visit.empty()) {
-    auto tv = to_visit.front();
-    to_visit.pop_front();
-
-    if (!visited.emplace(tv).second) {
-      continue;
-    }
-
-    auto tv_id = mapped_ids.at(tv);
-
-    for (auto consumer_tv : ir_utils::consumerTvsOf(tv)) {
-      if (visited.find(consumer_tv) != visited.end()) {
-        continue;
-      }
+    bool inner_only)
+    : starting_tv_(from), starting_id_(id), inner_only_(inner_only) {}
+
+void FindAllMappedDims::setUp() {
+  mapped_root_ids_[starting_tv_] =
+      projectIdToRoot(starting_tv_, starting_id_, inner_only_);
+  mapped_rfactor_ids_[starting_tv_] =
+      projectIdToRFactor(starting_tv_, starting_id_, inner_only_);
+}
 
-      if (mapped_ids.find(consumer_tv) != mapped_ids.end()) {
-        continue;
-      }
+void FindAllMappedDims::propagateC2P(TensorView* from, TensorView* to) {
+  auto from_id = mapped_root_ids_.at(from);
+  PairwiseRootDomainMap root_map(to, from);
+  auto c2p_map = root_map.mapConsumerToProducer(from->domain(), to->domain());
+  auto p_it = c2p_map.find(from_id);
+  if (p_it != c2p_map.end()) {
+    mapped_root_ids_[to] = projectIdToRoot(to, p_it->second, inner_only_);
+    mapped_rfactor_ids_[to] = p_it->second;
+  } else {
+    mapped_root_ids_[to] = nullptr;
+    mapped_rfactor_ids_[to] = nullptr;
+  }
+}
 
-      PairwiseRootDomainMap root_map(tv, consumer_tv);
-      auto p2c_map =
-          root_map.mapProducerToConsumer(tv->domain(), consumer_tv->domain());
+void FindAllMappedDims::propagateP2C(TensorView* from, TensorView* to) {
+  auto from_id = mapped_rfactor_ids_.at(from);
+  PairwiseRootDomainMap root_map(from, to);
+  auto p2c_map = root_map.mapProducerToConsumer(from->domain(), to->domain());
+  auto c_it = p2c_map.find(from_id);
+  if (c_it != p2c_map.end()) {
+    mapped_root_ids_[to] = c_it->second;
+    mapped_rfactor_ids_[to] = projectIdToRFactor(to, c_it->second, inner_only_);
+  } else {
+    mapped_root_ids_[to] = nullptr;
+    mapped_rfactor_ids_[to] = nullptr;
+  }
+}
 
-      auto c_it = p2c_map.find(tv_id);
-      if (c_it != p2c_map.end()) {
-        mapped_ids.emplace(std::make_pair(consumer_tv, c_it->second));
-        to_visit.emplace_back(consumer_tv);
+void FindAllMappedDims::propagateSibling(TensorView* from, TensorView* to) {
+  auto from_id = mapped_root_ids_.at(from);
+  if (from_id == nullptr) {
+    mapped_root_ids_[to] = nullptr;
+  } else {
+    for (auto i : c10::irange(from->getRootDomain().size())) {
+      if (from_id == from->getRootDomain()[i]) {
+        mapped_root_ids_[to] = to->getRootDomain()[i];
+        break;
       }
     }
-
-    // For producers, project to root
-    tv_id = projectIdToRoot(tv, tv_id, vectorize_pass);
-    // If projection fails, don't map to producers
-    if (tv_id == nullptr) {
-      continue;
-    }
-
-    for (auto producer_tv : ir_utils::producerTvsOf(tv)) {
-      if (visited.find(producer_tv) != visited.end()) {
-        continue;
-      }
-
-      if (mapped_ids.find(producer_tv) != mapped_ids.end()) {
-        continue;
-      }
-
-      PairwiseRootDomainMap root_map(producer_tv, tv);
-      auto c2p_map =
-          root_map.mapConsumerToProducer(tv->domain(), producer_tv->domain());
-      auto p_it = c2p_map.find(tv_id);
-      if (p_it != c2p_map.end()) {
-        mapped_ids.emplace(std::make_pair(producer_tv, p_it->second));
-        to_visit.emplace_back(producer_tv);
+  }
+  from_id = mapped_rfactor_ids_.at(from);
+  if (from_id == nullptr) {
+    mapped_root_ids_[to] = nullptr;
+  } else {
+    for (auto i : c10::irange(from->getMaybeRFactorDomain().size())) {
+      if (from_id == from->getMaybeRFactorDomain()[i]) {
+        mapped_rfactor_ids_[to] = to->getMaybeRFactorDomain()[i];
+        return;
       }
     }
   }
+  TORCH_INTERNAL_ASSERT(false, "Unable to find mapped root/rfactor domain");
 }
 
-std::unordered_set<IterDomain*> FindAllMappedDims::from(
-    TensorView* tv,
-    IterDomain* id,
-    bool vectorize_pass) {
-  auto root_domain = tv->hasReduction() && tv->hasRFactor()
-      ? tv->getRootDomain()
-      : tv->getMaybeRFactorDomain();
-
-  TORCH_INTERNAL_ASSERT(
-      std::find_if(
-          root_domain.begin(),
-          root_domain.end(),
-          [&id](IterDomain* root_id) { return root_id == id; }) !=
-          root_domain.end(),
-      "Tried to map out ",
-      id,
-      " from TV ",
-      tv,
-      " to the rest of the fusion, but id does not belong to this tv.");
-
-  FindAllMappedDims mapped_dims(tv, id, vectorize_pass);
-
+std::unordered_set<IterDomain*> FindAllMappedDims::get() const {
   std::unordered_set<IterDomain*> mapped_id_set;
-  for (auto entry : mapped_dims.mapped_ids) {
+  for (auto entry : mapped_root_ids_) {
+    mapped_id_set.emplace(entry.second);
+  }
+  for (auto entry : mapped_rfactor_ids_) {
     mapped_id_set.emplace(entry.second);
   }
   return mapped_id_set;
@@ -1272,15 +1315,15 @@ std::unordered_set<IterDomain*> FindAllMappedDims::from(
 
 bool hasInnerDim(
     TensorView* tv,
-    std::unordered_set<IterDomain*> vector_dims,
+    std::unordered_set<IterDomain*> inner_dims,
     bool should_vectorize) {
   const auto& inner_most_dim = innerMostRootDim(tv);
   if (inner_most_dim == nullptr || inner_most_dim->isReduction()) {
     return false;
   }
 
-  // Make sure inner most dimension is in the vector_dim set
-  if (vector_dims.count(inner_most_dim) == 0) {
+  // Make sure inner most dimension is in the inner_dims set
+  if (inner_dims.count(inner_most_dim) == 0) {
     return false;
   }
 
@@ -1312,18 +1355,37 @@ bool hasInnerDim(
 
 std::vector<TensorView*> getInputsOutputsWithInnerDim(
     TensorView* reference_tv,
+    bool inner_only,
     bool vectorize_pass) {
+  if (vectorize_pass) {
+    TORCH_INTERNAL_ASSERT(
+        inner_only, "Can only vectorize inner-most dimensions");
+  }
+
   auto inner_most_id = innerMostRootDim(reference_tv);
 
   if (inner_most_id == nullptr) {
     return {};
   }
 
-  auto vectorizable_dims =
-      FindAllMappedDims::from(reference_tv, inner_most_id, vectorize_pass);
+  FindAllMappedDims all_mapped_root_dims(
+      reference_tv, inner_most_id, inner_only);
+  MaxRootDomainInfoSpanningTree tree(reference_tv);
+  tree.traverse(&all_mapped_root_dims);
+
+  auto vectorizable_dims = all_mapped_root_dims.get();
 
   std::vector<TensorView*> vectorizable_tensors;
 
+  // We put outputs in front of inputs because this would make the transpose
+  // scheduler prefer to use output instead of input as reference tensor.
+  for (auto output_tv :
+       ir_utils::filterByType<TensorView>(reference_tv->fusion()->outputs())) {
+    if (hasInnerDim(output_tv, vectorizable_dims, vectorize_pass)) {
+      vectorizable_tensors.push_back(output_tv);
+    }
+  }
+
   for (auto input_tv :
        ir_utils::filterByType<TensorView>(reference_tv->fusion()->inputs())) {
     if (hasInnerDim(input_tv, vectorizable_dims, vectorize_pass)) {
@@ -1331,17 +1393,112 @@ std::vector<TensorView*> getInputsOutputsWithInnerDim(
     }
   }
 
-  for (auto output_tv :
-       ir_utils::filterByType<TensorView>(reference_tv->fusion()->outputs())) {
-    if (hasInnerDim(output_tv, vectorizable_dims, vectorize_pass)) {
-      vectorizable_tensors.push_back(output_tv);
+  return vectorizable_tensors;
+}
+
+namespace {
+// Holder return struct for the below function.
+struct DisjointViewSetInfo {
+  // const* to the disjoint set in disjoint_view_set passed in to
+  // getDisjointViewSetsOf each iterdomain in the rfactor of ref is mapped to.
+  //
+  // WARNING: these pointers are relative to the disjoint_view_set reference
+  // passed into getDisjointViewSetsOf it's the user's responsibillity to
+  // maintain the lifetime of that reference to match this vector.
+  std::vector<const VectorOfUniqueEntries<IterDomain*>*> disjoint_sets_of_ref;
+
+  // Unique ID associated to the disjoint view group the rfactor id belongs to
+  // in disjoint_sets_of_ref. It's straight forward to map from
+  // disjoint_sets_of_ref to the vector, but not the other way around.
+  std::vector<int> disjoint_set_ids;
+
+  // TensorView reference the above vectors are relative to.
+  TensorView* ref;
+};
+
+// Returns disjoint view sets mapped onto the given reference. Returns a pair
+// of vectors of size rfactorDomain of reference. Vector of
+// VectorOfUniqueEntries returns a const* to the disjoint set in
+// disjoint_view_set the iterdomain is mapped to. Integer vector represents
+// which disjoint view group the rfactor id belongs to. It's straight forward
+// to map from the former to the latter, but not the latter to former.
+//
+// Since we return a const* to entries in disjoint_view_set, it must be passed
+// in as a reference. Algorithm is N^2 based on number of dims in reference,
+// but generating the disjoint view set is likely the limiter on perf of this
+// function.
+DisjointViewSetInfo getDisjointViewSetsOf(
+    Fusion* fusion,
+    TensorView* of,
+    DisjointSets<IterDomain*>& disjoint_view_set) {
+  auto rfactor_dom = of->getMaybeRFactorDomain();
+  if (rfactor_dom.size() == 0) {
+    return {};
+  }
+
+  // Start naming id's based on 0 so the inner most dimension will always be
+  // 0, then as groups are discovered marching to the left their id will
+  // increase. i.e. we could have something like [0, 3, 1, 2, 1, 0] as a
+  // result.
+  std::vector<int> disjoint_group_ids(rfactor_dom.size(), -1);
+  std::vector<const VectorOfUniqueEntries<IterDomain*>*> disjoint_set_of_id(
+      rfactor_dom.size(), nullptr);
+  int current_group_id = 0;
+  int ref_dim_i = rfactor_dom.size() - 1;
+
+  while (ref_dim_i >= 0) {
+    if (disjoint_group_ids[ref_dim_i] != -1) {
+      // Already put in a group, continue
+      ref_dim_i--;
+      continue;
     }
+
+    const auto& ref_group =
+        disjoint_view_set.getDisjointSetOf(rfactor_dom[ref_dim_i]);
+
+    int other_dim_i = ref_dim_i;
+    while (other_dim_i >= 0) {
+      const auto& other_group =
+          disjoint_view_set.getDisjointSetOf(rfactor_dom[other_dim_i]);
+      if (&ref_group == &other_group) {
+        disjoint_group_ids[other_dim_i] = current_group_id;
+        disjoint_set_of_id[other_dim_i] = &ref_group;
+      }
+      other_dim_i--;
+    }
+
+    ref_dim_i--;
+    current_group_id++;
   }
 
-  return vectorizable_tensors;
+  TORCH_INTERNAL_ASSERT(
+      std::none_of(
+          disjoint_group_ids.begin(),
+          disjoint_group_ids.end(),
+          [](int i) { return i == -1; }),
+      "Failed to generate the view disjoint groups of the reference ",
+      of->toString());
+
+  TORCH_INTERNAL_ASSERT(
+      std::none_of(
+          disjoint_set_of_id.begin(),
+          disjoint_set_of_id.end(),
+          [](const VectorOfUniqueEntries<IterDomain*>* ptr) {
+            return ptr == nullptr;
+          }),
+      "Failed to generate the view disjoint groups of the reference ",
+      of->toString());
+
+  DisjointViewSetInfo info;
+  info.disjoint_sets_of_ref = disjoint_set_of_id;
+  info.disjoint_set_ids = disjoint_group_ids;
+  info.ref = of;
+
+  return info;
 }
+} // namespace
 
-std::vector<BroadcastMultiple> getBroadcastMultiples(
+BroadcastMultipleInformation getBroadcastMultiples(
     TensorView* reference_tv,
     DataType index_type) {
   auto fusion = reference_tv->fusion();
@@ -1350,6 +1507,13 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(
   std::vector<BroadcastMultiple> multiples(
       reference_tv->getMaybeRFactorDomain().size());
 
+  auto disjoint_view_sets = disjointViewSets(fusion);
+  auto disjoint_set_information = scheduler_utils::getDisjointViewSetsOf(
+      fusion, reference_tv, disjoint_view_sets);
+
+  auto ref_disjoint_sets = disjoint_set_information.disjoint_sets_of_ref;
+  auto ref_disjoint_set_ids = disjoint_set_information.disjoint_set_ids;
+
   // All input or output tensor views
   std::vector<TensorView*> in_out_tvs;
   {
@@ -1359,8 +1523,8 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(
     in_out_tvs.insert(in_out_tvs.end(), out_tvs.begin(), out_tvs.end());
   }
 
-  // Shouldn't matter if we use EXACT or PERMISSIVE mapping mode for compute at
-  // map as we're just looking at the root mappings.
+  // Shouldn't matter if we use EXACT or PERMISSIVE mapping mode for compute
+  // at map as we're just looking at the root mappings.
   auto ca_map = ComputeAtMap(fusion);
 
   auto ref_root_domain = reference_tv->getMaybeRFactorDomain();
@@ -1380,35 +1544,60 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(
       if (ref_id->isBroadcast() || ref_id->isReduction()) {
         continue;
       }
-      auto map_it = std::find_if(
-          in_out_tv_domain_list.begin(),
-          in_out_tv_domain_list.end(),
-          [&ref_id, &ca_map](IterDomain* in_out_tv_id) {
-            return ca_map.areMapped(in_out_tv_id, ref_id, IdMappingMode::EXACT);
-          });
 
-      if (map_it == in_out_tv_domain_list.end()) {
+      bool ref_id_has_view_transforms = std::count(
+                                            ref_disjoint_set_ids.begin(),
+                                            ref_disjoint_set_ids.end(),
+                                            ref_disjoint_set_ids[ref_i]) > 1;
+
+      // Could have multiple mappings if there's view transforms
+      std::vector<IterDomain*> mapped_ids;
+      if (!ref_id_has_view_transforms) {
+        auto mapped_it = std::find_if(
+            in_out_tv_domain_list.begin(),
+            in_out_tv_domain_list.end(),
+            [&ref_id, &ca_map](IterDomain* in_out_tv_id) {
+              return ca_map.areMapped(
+                  in_out_tv_id, ref_id, IdMappingMode::EXACT);
+            });
+        if (mapped_it != in_out_tv_domain_list.end()) {
+          mapped_ids.push_back(*mapped_it);
+        }
+      } else {
+        for (auto in_out_id : in_out_tv_domain) {
+          if (ref_disjoint_sets[ref_i]->has(in_out_id)) {
+            mapped_ids.push_back(in_out_id);
+          }
+        }
+      }
+
+      // Nothing maps to reference, no contribution to multiples for this dim
+      if (mapped_ids.empty()) {
         continue;
       }
 
-      // If input/output id is broadcast or reduction
-      if ((*map_it)->isBroadcast() || (*map_it)->isReduction()) {
+      if (std::all_of(mapped_ids.begin(), mapped_ids.end(), [](IterDomain* id) {
+            return id->isReduction() || id->isBroadcast();
+          })) {
         continue;
       }
 
+      // If any iteration domain in the input or output that's mapped through
+      // the view disjoint set is not a reduction or broadcast, assume it's a
+      // full dimension for the sake of the pointwise scheduler.
       mapped_axes[ref_i] = true;
-      in_out_tv_domain_list.erase(map_it);
     }
 
     // For each break point position if there an lhs or rhs multiple based on
-    // this tensor add it to the global multiplier
+    // this tensor add it to the global multiplier. The only time we consider
+    // we can benefit from broadcast is if the entire left or right side the
+    // break point is all broadcasts.
     {
       bool rhs = false;
       bool lhs = false;
       auto dtype_size =
           dataTypeSize(in_out_tv->getDataType().value(), index_type);
-      for (size_t mapped_axes_i = 0; mapped_axes_i < mapped_axes.size();
-           mapped_axes_i++) {
+      for (auto mapped_axes_i : c10::irange(mapped_axes.size())) {
         auto lhs_i = mapped_axes_i;
         auto rhs_i = mapped_axes.size() - 1 - mapped_axes_i;
 
@@ -1425,91 +1614,10 @@ std::vector<BroadcastMultiple> getBroadcastMultiples(
       }
     }
   }
-
-  return multiples;
-}
-
-size_t collectMaxVectorizeSizeWithContigMerge(
-    TensorView* tv,
-    IterDomain* leaf_merged_domain,
-    size_t max_vector_size_in_byte,
-    ExpressionEvaluator& expression_evaluator,
-    DataType index_type) {
-  // Maybe too conservative, but only handles fully contiguous tensors
-  // TODO: Relax the contiguity constraint to be similar to that in index
-  // computing. Just looking for all merged root domains in the right order, all
-  // merged root dimensions are contiguous, all merged root dimensions are next
-  // to eachother (exlcuding broadcast).
-  if (std::any_of(
-          tv->domain()->contiguity().begin(),
-          tv->domain()->contiguity().end(),
-          [](const auto contig) { return !contig; })) {
-    return 1;
-  }
-
-  auto dtype_size = dataTypeSize(tv->dtype(), index_type);
-  const size_t max_vector_size = max_vector_size_in_byte / dtype_size;
-
-  // Assume no halo-related expression appears in the fusion. No
-  // broadcast is merged, so indexability can be assumed to be true.
-  ContigIDs contigIds(
-      {leaf_merged_domain},
-      tv->getMaybeRFactorDomain(),
-      tv->domain()->contiguity(),
-      {},
-      {},
-      true,
-      true);
-
-  auto innermost_root_id = tv->getMaybeRFactorDomain().back();
-  auto indexed_id = contigIds.rootToIndexedID().at(innermost_root_id);
-
-  size_t merged_size = 1;
-  // If the indexed ID is a contig merged domain, i.e., it is
-  // different from innermost_root_id, we accumulate the extents of
-  // all the root domains covered by the contig indexed ID. Otherwise,
-  // just look at the extent of the innermost root ID.
-  if (indexed_id != innermost_root_id) {
-    const auto& within_root = contigIds.withinContigIDs().at(indexed_id);
-    for (auto root_id : tv->getMaybeRFactorDomain()) {
-      if (within_root.find(root_id) == within_root.end()) {
-        continue;
-      }
-      auto maybe_dimension_size =
-          expression_evaluator.evaluate(root_id->extent());
-      TORCH_INTERNAL_ASSERT(
-          maybe_dimension_size.has_value(),
-          "Unknown extent of tv: ",
-          tv->toString(),
-          ", id: ",
-          root_id->toString());
-      merged_size *= maybe_dimension_size.value();
-    }
-  } else {
-    auto maybe_dimension_size =
-        expression_evaluator.evaluate(innermost_root_id->extent());
-    TORCH_INTERNAL_ASSERT(
-        maybe_dimension_size.has_value(),
-        "Unknown extent of tv: ",
-        tv->toString(),
-        ", id: ",
-        innermost_root_id->toString());
-    merged_size = maybe_dimension_size.value();
-  }
-
-  size_t vector_size = 1;
-  size_t next_vector_size = vector_size * 2;
-
-  // Try until vector size exceeds the max allowed size
-  while (next_vector_size <= max_vector_size) {
-    if (merged_size % next_vector_size != 0) {
-      break;
-    }
-    vector_size = next_vector_size;
-    next_vector_size *= 2;
-  }
-
-  return vector_size;
+  BroadcastMultipleInformation bcast_info;
+  bcast_info.view_disjoint_set_ids = ref_disjoint_set_ids;
+  bcast_info.broadcast_multiples = multiples;
+  return bcast_info;
 }
 
 namespace matmul_utils {
@@ -1743,7 +1851,7 @@ c10::optional<IterDomain*> getMaybeRootIfInnermostTiled(
 
 } // namespace
 
-TORCH_CUDA_CU_API void orderTiledConcreteIdAsRoot(TensorView* tv) {
+void orderTiledConcreteIdAsRoot(TensorView* tv) {
   auto ndims = tv->nDims();
 
   // Keep track of the left most position where we will
@@ -1843,9 +1951,7 @@ TORCH_CUDA_CU_API void orderTiledConcreteIdAsRoot(TensorView* tv) {
 } // namespace matmul_utils
 
 //! Propagate current transformations on from_tv to all graphs
-TORCH_CUDA_CU_API void transformPropagateToAllFrom(
-    TensorView* from_tv,
-    int pos) {
+void transformPropagateToAllFrom(TensorView* from_tv, int pos) {
   TransformPropagator propagator(from_tv, pos);
   MaxRootDomainInfoSpanningTree(from_tv, nullptr).traverse(&propagator);
 }
@@ -2071,186 +2177,218 @@ void BoundedDirectionalTransformPropagator::bothWays(
   propagate(from, pos, included_tvs, *options);
 }
 
-// Grab all values and expressions used to make the merged_domain and remove
-// them from the fusion
-void cleanUpInnermostMergedDomains(
-    const std::vector<IterDomain*>& root_domain,
-    IterDomain* merged_domain) {
-  TORCH_INTERNAL_ASSERT(merged_domain != nullptr);
-  TORCH_INTERNAL_ASSERT(!root_domain.empty());
-
-  std::unordered_set<Val*> root_set({root_domain.begin(), root_domain.end()});
+DisjointSets<IterDomain*> disjointViewSets(Fusion* fusion) {
+  // Start from the exact iter domain graph of the fusion
+  IterDomainGraph id_graph(fusion);
+  auto disjoint_view_ids = id_graph.exactNodes();
 
-  auto vals = DependencyCheck::getAllValsBetween(root_set, {merged_domain});
-
-  for (auto it = vals.rbegin(); it != vals.rend(); ++it) {
-    TORCH_INTERNAL_ASSERT((*it)->isA<IterDomain>());
-    auto id = (*it)->as<IterDomain>();
-    if (root_set.find(id) != root_set.end()) {
-      continue;
+  // If iter domains are involved in any transformation from root domains to
+  // rfactor domains they should be considered "contaminated".
+  for (auto tv : ir_utils::allTvs(fusion)) {
+    for (auto expr : StmtSort::getExprs(
+             fusion,
+             {tv->getMaybeRFactorDomain().begin(),
+              tv->getMaybeRFactorDomain().end()})) {
+      if (expr->isA<Merge>()) {
+        auto merge = expr->as<Merge>();
+        disjoint_view_ids.mapEntries(merge->inner(), merge->out());
+        disjoint_view_ids.mapEntries(merge->outer(), merge->out());
+      } else if (expr->isA<Split>()) {
+        auto split = expr->as<Split>();
+        disjoint_view_ids.mapEntries(split->in(), split->inner());
+        disjoint_view_ids.mapEntries(split->in(), split->outer());
+      } else {
+        TORCH_INTERNAL_ASSERT(
+            false, "Expression type: ", expr->toString(), " not supported.");
+      }
     }
-    Fusion* fusion = id->container()->as<Fusion>();
-    auto id_def = id->definition();
-    TORCH_INTERNAL_ASSERT(
-        id_def->isA<Merge>(),
-        "Invalid ID: ",
-        id->toString(),
-        ". Expected definition of a Merge expression: ",
-        (id_def != nullptr ? id_def->toString() : "nullptr"));
-    fusion->removeExpr(id_def);
-    fusion->removeVal(id);
   }
+  return disjoint_view_ids;
 }
 
-// Merge innermost domains for finding the widest vectorizable
-// size. Return the merged domain or nullptr if no merge is done.
-IterDomain* mergeInnermostDomains(
-    const std::vector<IterDomain*>& domain,
-    int num_merged_domains) {
-  const auto ndims = domain.size();
-  IterDomain* merged_id = nullptr;
-  bool is_merge_done = false;
-  for (const auto i : c10::irange(num_merged_domains)) {
-    auto id = domain.at(ndims - 1 - i);
-    // broadcast and trivial reductions are ignored
-    if (id->isBroadcast() || id->isTrivialReduction()) {
-      continue;
-    }
-    if (merged_id == nullptr) {
-      merged_id = id;
-    } else {
-      auto id_inner = merged_id;
-      auto id_outer = id;
-      merged_id = IterDomain::merge(id_outer, id_inner);
-      is_merge_done = true;
-    }
-  }
-  return is_merge_done ? merged_id : nullptr;
-}
+bool allMatchingViews(Fusion* fusion) {
+  // Start from the exact iter domain graph of the fusion
+  IterDomainGraph id_graph(fusion);
+  auto exact_disjoint_set = id_graph.exactNodes();
 
-//! Attempt to expand vectorized domains to contig merged domains. Break point
-//! identifies the point in which you can't propagate contiguous merges. For
-//! example in pointwise this is the point where we want to split the
-//! parallelization to take advantage of broadcast, and for reduction schedulers
-//! it's the point where we switch from a reduction domain to an iter domain (or
-//! vice versa).
-size_t expandVectorizationToContigMergedDomains(
-    Fusion* fusion,
-    SchedulerRuntimeInfo& runtime_info,
-    const std::vector<TensorView*> vectorizable_inputs_outputs,
-    TensorView* reference_tv,
-    int break_point,
-    size_t default_word_size) {
-  // Don't vectorize when RNG is used
-  if (fusion->isStochastic() && isDisabled(DisableOption::UnrollWithRng)) {
-    return default_word_size;
+  auto view_exprs = ir_utils::getViewOps(fusion);
+  if (view_exprs.empty()) {
+    return true;
   }
 
-  size_t max_expand_size = SchedulerRuntimeInfo::max_alignment_size_in_byte;
-  size_t common_alignment_size =
-      SchedulerRuntimeInfo::max_alignment_size_in_byte;
-
-  for (auto inp_out : vectorizable_inputs_outputs) {
-    auto dtype_size = dataTypeSize(
-        inp_out->dtype(), indexModeToDtype(runtime_info.getIndexMode()));
+  std::vector<TensorView*> all_view_outs;
 
-    max_expand_size = std::min(
-        max_expand_size,
-        SchedulerRuntimeInfo::max_alignment_size_in_byte / dtype_size);
-    max_expand_size = std::min(
-        max_expand_size, runtime_info.getMaxVectorizableWidth(inp_out));
-    common_alignment_size =
-        std::min(common_alignment_size, runtime_info.getAlignmentSize(inp_out));
+  for (auto view_expr : view_exprs) {
+    auto outs = ir_utils::filterByType<TensorView>(view_expr->outputs());
+    all_view_outs.insert(all_view_outs.end(), outs.begin(), outs.end());
   }
 
-  // If there's no possibility to increase vector size of provided tensors, then
-  // don't bother doing a more complex analysis to try and do so, just return
-  // early.
-  if (max_expand_size == default_word_size) {
-    return default_word_size;
-  }
+  TORCH_INTERNAL_ASSERT(
+      all_view_outs.size() > 0,
+      "Found view operations but can't find any output tensor views.");
 
-  auto ca_map = ComputeAtMap(fusion);
+  auto first_out_tv = *all_view_outs.begin();
+  auto first_root_dom =
+      TensorDomain::noReductions(first_out_tv->getRootDomain());
+  auto first_rfactor_dom =
+      TensorDomain::noReductions(first_out_tv->getRFactorDomain());
 
-  // Merge the domains right of the break point
-  const auto& ref_root = reference_tv->getMaybeRFactorDomain();
-  const int num_merged_domains =
-      static_cast<int>(ref_root.size()) - static_cast<int>(break_point);
+  for (auto other_out_tv : all_view_outs) {
+    if (other_out_tv == first_out_tv) {
+      continue;
+    }
 
-  // No expansion with no merged domain
-  if (num_merged_domains == 0) {
-    return default_word_size;
-  }
+    auto other_root_dom =
+        TensorDomain::noReductions(other_out_tv->getRootDomain());
+    auto other_rfactor_dom =
+        TensorDomain::noReductions(other_out_tv->getRFactorDomain());
 
-  // Merge the domains but don't modify TensorDomain
-  auto merged_domain = mergeInnermostDomains(ref_root, num_merged_domains);
+    if (first_root_dom.size() != other_root_dom.size() ||
+        first_rfactor_dom.size() != other_rfactor_dom.size()) {
+      return false;
+    }
+    {
+      std::vector<std::pair<IterDomain*, IterDomain*>> zipped_ids;
+
+      std::transform(
+          first_root_dom.begin(),
+          first_root_dom.end(),
+          other_root_dom.begin(),
+          std::back_inserter(zipped_ids),
+          [](IterDomain* first, IterDomain* second) {
+            return std::make_pair(first, second);
+          });
 
-  // No expansion is done if no merge is done.
-  if (merged_domain == nullptr) {
-    return default_word_size;
-  }
+      if (std::any_of(
+              zipped_ids.begin(),
+              zipped_ids.end(),
+              [&exact_disjoint_set](
+                  std::pair<IterDomain*, IterDomain*> id_pair) {
+                return !exact_disjoint_set.strictAreMapped(
+                    id_pair.first, id_pair.second);
+              })) {
+        return false;
+      }
+    }
+    {
+      std::vector<std::pair<IterDomain*, IterDomain*>> zipped_ids;
+
+      std::transform(
+          first_rfactor_dom.begin(),
+          first_rfactor_dom.end(),
+          other_rfactor_dom.begin(),
+          std::back_inserter(zipped_ids),
+          [](IterDomain* first, IterDomain* second) {
+            return std::make_pair(first, second);
+          });
 
-  // Find the vectorizable word size with the merged domains
-  size_t word_size = scheduler_utils::collectMaxVectorizeSizeWithContigMerge(
-      reference_tv,
-      merged_domain,
-      common_alignment_size,
-      runtime_info.expressionEvaluator(),
-      indexModeToDtype(runtime_info.getIndexMode()));
+      if (std::any_of(
+              zipped_ids.begin(),
+              zipped_ids.end(),
+              [&exact_disjoint_set](
+                  std::pair<IterDomain*, IterDomain*> id_pair) {
+                return !exact_disjoint_set.strictAreMapped(
+                    id_pair.first, id_pair.second);
+              })) {
+        return false;
+      }
+    }
+  }
+  return true;
+}
 
-  cleanUpInnermostMergedDomains(ref_root, merged_domain);
+bool breakIsDisjoint(std::vector<int> group_ids, int pos) {
+  if (pos < 0) {
+    pos += group_ids.size();
+  }
+  TORCH_INTERNAL_ASSERT(
+      pos >= 0 && pos <= group_ids.size(),
+      "Invalid position, size of vec is ",
+      group_ids.size(),
+      " but position is ",
+      pos);
 
-  // Stop if the reference doesn't get a larger word size.
-  if (word_size <= default_word_size) {
-    return default_word_size;
+  if (pos == 0 || pos == group_ids.size()) {
+    return true;
   }
 
-  // Check the other TVs and take the minimum of the valid word sizes
-  for (const auto tv : vectorizable_inputs_outputs) {
-    if (tv == reference_tv) {
-      continue;
-    }
+  std::unordered_set<int> left_ints(group_ids.begin(), group_ids.begin() + pos);
 
-    const auto& tv_root = tv->getMaybeRFactorDomain();
+  for (auto i = pos; i < group_ids.size(); i++) {
+    if (left_ints.count(group_ids[i]) > 0) {
+      return false;
+    }
+  }
+  return true;
+}
 
-    int tv_num_merged_domains = 0;
-    for (const auto i : c10::irange(num_merged_domains)) {
-      if (i == tv_root.size()) {
-        break;
+std::unordered_map<int, int> domainReorderAsRfactorMap(TensorView* tv) {
+  FusionGuard fg(tv->fusion());
+  auto transform_exprs = StmtSort::getExprs(
+      tv->fusion(),
+      {tv->domain()->domain().begin(), tv->domain()->domain().end()});
+  // simply update this vector of id's as progressing through the transformation
+  // expressions. We'll always insert the result of split in the location of the
+  // input, and insert the merge result in the position of the inner dimension.
+
+  auto reordered_ids = tv->getMaybeRFactorDomain();
+  for (const auto* expr : transform_exprs) {
+    if (const Split* split = dynamic_cast<const Split*>(expr)) {
+      auto find_it =
+          std::find(reordered_ids.begin(), reordered_ids.end(), split->in());
+      if (find_it == reordered_ids.end()) {
+        // Transformations before rfactor, ignore those.
+        continue;
       }
-      auto ref_id = ref_root.at(ref_root.size() - 1 - i);
-      IterDomain* tv_id = tv_root.at(tv_root.size() - 1 - i);
-      // If not mapped, stop expanding.
-      if (!ca_map.areMapped(ref_id, tv_id, IdMappingMode::EXACT)) {
-        break;
-      } else {
-        ++tv_num_merged_domains;
+      auto pos = std::distance(reordered_ids.begin(), find_it);
+      reordered_ids[pos] = split->inner();
+      reordered_ids.insert(reordered_ids.begin() + pos, split->outer());
+    } else if (const Merge* merge = dynamic_cast<const Merge*>(expr)) {
+      auto find_it_0 =
+          std::find(reordered_ids.begin(), reordered_ids.end(), merge->outer());
+      auto find_it_1 =
+          std::find(reordered_ids.begin(), reordered_ids.end(), merge->inner());
+      if (find_it_0 == reordered_ids.end() &&
+          find_it_1 == reordered_ids.end()) {
+        // Transformations before rfactor, ignore those.
+        continue;
       }
-    }
-
-    size_t tv_word_size = 1;
-    if (tv_num_merged_domains > 1) {
-      auto tv_merged_domain =
-          mergeInnermostDomains(tv_root, tv_num_merged_domains);
-      if (tv_merged_domain == nullptr) {
-        tv_word_size = runtime_info.getInnerDimVectorizableWidth(tv);
-      } else {
-        tv_word_size = scheduler_utils::collectMaxVectorizeSizeWithContigMerge(
-            tv,
-            tv_merged_domain,
-            common_alignment_size,
-            runtime_info.expressionEvaluator(),
-            indexModeToDtype(runtime_info.getIndexMode()));
-        cleanUpInnermostMergedDomains(tv_root, tv_merged_domain);
+      TORCH_INTERNAL_ASSERT(
+          find_it_0 != reordered_ids.end() && find_it_1 != reordered_ids.end(),
+          "Error in transformations of ",
+          tv->toString(),
+          "\nTransformations before rfactor should not mix with transformations after rfactor.");
+      auto pos0 = std::distance(reordered_ids.begin(), find_it_0);
+      auto pos1 = std::distance(reordered_ids.begin(), find_it_1);
+      if (pos0 > pos1) {
+        std::swap(pos0, pos1);
       }
-    } else {
-      tv_word_size = runtime_info.getInnerDimVectorizableWidth(tv);
-    }
+      // Should be impossible.
+      TORCH_INTERNAL_ASSERT(
+          pos0 != pos1,
+          "Didn't expect merge inputs to be the same iteration domain:\n",
+          merge->toString());
 
-    word_size = std::min(word_size, tv_word_size);
+      reordered_ids.erase(reordered_ids.begin() + pos0);
+      pos1--;
+      reordered_ids[pos1] = merge->out();
+    }
   }
 
-  return word_size;
+  std::unordered_map<int, int> old2new;
+  for (auto id_i : c10::irange(tv->domain()->domain().size())) {
+    auto leaf_id = tv->axis(id_i);
+    auto find_it =
+        std::find(reordered_ids.begin(), reordered_ids.end(), leaf_id);
+    TORCH_INTERNAL_ASSERT(
+        find_it != reordered_ids.end(),
+        "Reordering map creation failed, uninitialized iterdomain,",
+        " likely something is wrong with the transformations between the rfactor domain and the leaves.");
+    int new_pos = (int)std::distance(reordered_ids.begin(), find_it);
+    int old_pos = (int)id_i;
+    old2new[old_pos] = new_pos;
+  }
+  return old2new;
 }
 
 } // namespace scheduler_utils
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/utils.h b/torch/csrc/jit/codegen/cuda/scheduler/utils.h
index 6e31f91198a5..373a879f740d 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/utils.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/utils.h
@@ -1,5 +1,6 @@
 #pragma once
 
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/maxinfo_propagator.h>
@@ -44,6 +45,38 @@ inline int64_t safeDiv(const int64_t x, const int64_t y) {
   return std::max(x / y, (int64_t)1);
 }
 
+// Split the given dimensions in `to_split`. Also update the dimensions in
+// `to_update` to the positions in the splitted tensor. Splitting one dimension
+// multiple times is supported, and if this is the case, then the order of
+// `to_split` matters. All given dimensions are numbers before any split.
+TORCH_CUDA_CU_API void splitDims(
+    TensorView* tv,
+    std::vector<std::pair<size_t, size_t>> to_split, // (dim, size)
+    std::vector<size_t>& to_update);
+
+TORCH_CUDA_CU_API inline void splitDims(
+    TensorView* tv,
+    std::vector<std::pair<size_t, size_t>> to_split) { // (dim, size)
+  std::vector<size_t> unused;
+  splitDims(tv, std::move(to_split), unused);
+}
+
+// Merge all the given dimensions in `to_merge` into a single dimension. Also
+// update the dimensions in `to_update` to the positions in the merged tensor.
+// Returns the merged dimension. All given dimensions are numbers before any
+// merge.
+TORCH_CUDA_CU_API c10::optional<size_t> mergeDims(
+    TensorView* tv,
+    std::vector<size_t> to_merge,
+    std::vector<size_t>& to_update);
+
+TORCH_CUDA_CU_API inline c10::optional<size_t> mergeDims(
+    TensorView* tv,
+    std::vector<size_t> to_merge) {
+  std::vector<size_t> unused;
+  return mergeDims(tv, std::move(to_merge), unused);
+}
+
 // Merge all reduction to the right side and returns total number of
 // reduction axes. Don't merge is typically used for trivial reductions.
 size_t mergeReduction(
@@ -83,16 +116,6 @@ TORCH_CUDA_CU_API inline void parallelizeAllLike(
       propagate_padding);
 }
 
-TORCH_CUDA_CU_API void computeAtInputs(
-    TensorView* consumer,
-    int pos,
-    ComputeAtMode mode = ComputeAtMode::Standard);
-
-TORCH_CUDA_CU_API void computeWithOutputs(
-    TensorView* producer,
-    int pos,
-    ComputeAtMode mode = ComputeAtMode::Standard);
-
 struct PersistentBufferInfo {
   std::vector<TensorView*> persistent_buffers;
   std::unordered_set<IterDomain*> unmappable_dims;
@@ -155,15 +178,6 @@ TvProperties getProperties(
     SchedulerRuntimeInfo& runtime_info,
     TensorView* tv);
 
-// Will call computeAt once on each producer, with the first consumer found that
-// is a consumer of the individual producer
-void computeAtBetween(
-    const std::vector<TensorView*>& producers,
-    const std::vector<TensorView*>& consumers,
-    int pos,
-    ComputeAtMode mode,
-    std::unordered_set<IterDomain*> mapped_to_trivial_reduction = {});
-
 // Struct to store persistent buffer sizes. also holds the persistent buffer
 // size of the buffers are projected to the inputs.
 struct PersistentBufferSizeReturn {
@@ -228,34 +242,32 @@ cacheAndForkOutputs(Fusion* fusion, bool unroll);
 // domain.
 IterDomain* innerMostRootDim(TensorView* tv);
 
-// Uses a lot of logic from TransformPropagator in the implementation
-class FindAllMappedDims {
- private:
-  FindAllMappedDims(
-      TensorView* from,
-      IterDomain* starting_id,
-      bool vectorize_pass);
-
- private:
-  std::unordered_map<TensorView*, IterDomain*> mapped_ids;
-  TensorView* starting_tv = nullptr;
-  IterDomain* starting_id = nullptr;
+// Looks through fusion and finds all dims that match to the one provided in
+// the tensorview provided. Iter domain must be a root domain. If inner_only,
+// will only map dimensions if they're the inner most position. This is
+// important when projecting a dimension between an rfactor position and its
+// root position when mapping from consumer to producer. If inner_only=true,
+// takes the rfactor/root dimensions that maps, projects it to the root/rfactor
+// domain, but only following the inner most pass when encounting split/merge.
+// When propagating backward, for split it will only propagate backwards if the
+// mapped dimension is the inner portion of the split. For merge, inner_only
+// doesn't make a dimension and will propagate through the inner portion of the
+// merge. When propagating forward, the logic is symmetric with the backward
+// case.
+class FindAllMappedDims : public MaxInfoSpanningTree::Propagator {
+  std::unordered_map<TensorView*, IterDomain*> mapped_root_ids_;
+  std::unordered_map<TensorView*, IterDomain*> mapped_rfactor_ids_;
+  TensorView* starting_tv_ = nullptr;
+  IterDomain* starting_id_ = nullptr;
+  bool inner_only_;
 
  public:
-  // Looks through fusion and finds all dims that match to the one provided in
-  // the tensorview provided. Iter domain must be a root domain. If vectorize
-  // pass, will only map dimensions if they're the inner most position. This is
-  // important when projecting a dimension from an rfactor position to its root
-  // position when mapping from consumer to producer. If vectorize_pass=true,
-  // takes the rfactor dimensions that maps, projects it to the root domain, but
-  // only following the inner most pass when encounting split/merge. For split
-  // it will only propagate backwards if the mapped dimension is the inner
-  // portion of the split. For merge, vectorize_pass doesn't make a dimension
-  // and will propagate through the inner portion of the merge.
-  static std::unordered_set<IterDomain*> from(
-      TensorView* tv,
-      IterDomain* id,
-      bool vectorize_pass);
+  FindAllMappedDims(TensorView* from, IterDomain* starting_id, bool inner_only);
+  virtual void setUp() override;
+  virtual void propagateC2P(TensorView* from, TensorView* to) override;
+  virtual void propagateP2C(TensorView* from, TensorView* to) override;
+  virtual void propagateSibling(TensorView* from, TensorView* to) override;
+  std::unordered_set<IterDomain*> get() const;
 };
 
 // Checks if tensor view has an iteration domain in vector dims in its inner
@@ -268,10 +280,13 @@ bool hasInnerDim(
 
 // Returns all inputs and outputs that share the inner most dimension of the
 // provided reference. If reference is an input it ignores reduction axes, will
-// ignore all broadcast axes. If can_vectorize, will check contiguity for
-// vectorization, otherwise it just checks it has that inner dim.
+// ignore all broadcast axes. If inner_only, will require inner->inner mapping
+// in view, otherwise, it allows all inner->any mapping. If vectorize_pass, will
+// check contiguity for vectorization, otherwise it just checks it has that
+// inner dim.
 std::vector<TensorView*> getInputsOutputsWithInnerDim(
     TensorView* reference_tv,
+    bool inner_only,
     bool vectorize_pass);
 
 // Structure to hold byte multiples for break points. I.e. if we have the
@@ -288,14 +303,26 @@ struct BroadcastMultiple {
   int64_t lhs_multiple = 0;
 };
 
-// Returns a vector of counts, size = reference_tv->getRootDomain().size(), each
-// entry [i] is the number of inputs/outputs that have a non-broadcast dimension
-// mapped to the corresponding dimension in reference_tv. Count includes
-// reference_tv if reference_tv is an input or output. Count is multiplied by
-// data type size.
-std::vector<BroadcastMultiple> getBroadcastMultiples(
-    TensorView* reference_tv,
-    DataType index_type);
+struct BroadcastMultipleInformation {
+  std::vector<int> view_disjoint_set_ids;
+  std::vector<BroadcastMultiple> broadcast_multiples;
+};
+
+// Returns a vector of size reference_tv->getMaybeRFactorDomain().size() which
+// is a view disjoint set id of each of those iter domains. If entries share the
+// same value, they undergo view transformations in the fusion together.
+// Broadcast multiples are also of size
+// reference_tv->getMaybeRFactorDomain().size(), each entry [i] is the number of
+// inputs/outputs that have a non-broadcast dimension mapped to the
+// corresponding dimension in reference_tv. Broadcast multiples includes
+// reference_tv if reference_tv is an input or output. Broadcast multiples is
+// multiplied by data type size. In the case of view operations the broadcast
+// multiple is the full multiple size if any domain in the group maps to a
+// non-broadcast dimension in the given input/output. Otherwise if all
+// dimensions are broadcast that input/output will not contribute to the
+// multiple.
+TORCH_CUDA_CU_API BroadcastMultipleInformation
+getBroadcastMultiples(TensorView* reference_tv, DataType index_type);
 
 //! Collect maximum vectorization word size of a tensor whose
 //! innermost domain is leaf_merged_domain. Contig merging is taken
@@ -468,6 +495,47 @@ struct TORCH_CUDA_CU_API BoundedDirectionalTransformPropagator {
       Options options);
 };
 
+// Schedulers typically start by merging some axes together then splitting,
+// and propagating those transformations through the dag. What we want to
+// understand is if these merges can be supported through view operations.
+// For example it could be problematic to support a reduction fusion:
+//
+// tv0[2, 3, 4]
+// tv1 = sum(tv0, {1, 2})
+// tv2 = view(tv0, {6, 4})
+//
+// Since the first step of the reduction scheduler would be tv1->merge(1, 2).
+// If we tried to propagate this transformation through the view it would make
+// the view invalid. If we tried to propagate the view through the reduction,
+// it would attempt to merge a reduction and non-reduction dimension. So for
+// these types of fusions we would like to understand that the view considers
+// axis 1 and 2 of tv1 as "non-separable" axes.
+//
+// If IterDomains are disjoint in the returned set, then they are considered
+// "separable".
+// Warning: This pass generates the IdGraphs, not intended for use at runtime.
+TORCH_CUDA_CU_API DisjointSets<IterDomain*> disjointViewSets(Fusion* fusion);
+
+// Return if all trasnformations in all views match.
+// TODO: Should this be moved to registry.cpp/.h?
+// Warning: This pass generates the IdGraphs, not intended for use at runtime.
+TORCH_CUDA_CU_API bool allMatchingViews(Fusion* fusion);
+
+// Makes sure that there are no group id's left of pos that match right of pos.
+// e.g.
+// [1, 0, 0] pos 2 would return false
+// [1, 0, 0] pos 1 would return true
+TORCH_CUDA_CU_API bool breakIsDisjoint(std::vector<int> group_ids, int pos);
+
+// Generates an old to new map to reorder tv's domain as the rfactor order.
+// Priority is given to inner most dimensions for example:
+// rfactor [i0, i1, i2]
+// domain [i0*i2, i1]
+// will produce the map {{0, 1}, {1, 0}}
+// This is somewhat similar to orderTiledConcreteIdAsRoot
+TORCH_CUDA_CU_API std::unordered_map<int, int> domainReorderAsRfactorMap(
+    TensorView* tv);
+
 } // namespace scheduler_utils
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp b/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp
new file mode 100644
index 000000000000..2c3c848c7f5c
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.cpp
@@ -0,0 +1,286 @@
+#include <torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h>
+
+#include <torch/csrc/jit/codegen/cuda/compute_at_map.h>
+#include <torch/csrc/jit/codegen/cuda/contiguity.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/lower_divisible_split.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
+
+#include <c10/util/irange.h>
+
+#include <unordered_set>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace cuda {
+namespace vectorize_helper {
+
+// Grab all values and expressions used to make the merged_domain and remove
+// them from the fusion
+void cleanUpInnermostMergedDomains(
+    const std::vector<IterDomain*>& root_domain,
+    IterDomain* merged_domain) {
+  TORCH_INTERNAL_ASSERT(merged_domain != nullptr);
+  TORCH_INTERNAL_ASSERT(!root_domain.empty());
+
+  std::unordered_set<Val*> root_set({root_domain.begin(), root_domain.end()});
+
+  auto vals = DependencyCheck::getAllValsBetween(root_set, {merged_domain});
+
+  for (auto it = vals.rbegin(); it != vals.rend(); ++it) {
+    TORCH_INTERNAL_ASSERT((*it)->isA<IterDomain>());
+    auto id = (*it)->as<IterDomain>();
+    if (root_set.find(id) != root_set.end()) {
+      continue;
+    }
+    Fusion* fusion = id->container()->as<Fusion>();
+    auto id_def = id->definition();
+    TORCH_INTERNAL_ASSERT(
+        id_def->isA<Merge>(),
+        "Invalid ID: ",
+        id->toString(),
+        ". Expected definition of a Merge expression: ",
+        (id_def != nullptr ? id_def->toString() : "nullptr"));
+    fusion->removeExpr(id_def);
+    fusion->removeVal(id);
+  }
+}
+
+// Merge innermost domains for finding the widest vectorizable
+// size. Return the merged domain or nullptr if no merge is done.
+IterDomain* mergeInnermostDomains(
+    const std::vector<IterDomain*>& domain,
+    int num_merged_domains) {
+  const auto ndims = domain.size();
+  IterDomain* merged_id = nullptr;
+  bool is_merge_done = false;
+  for (const auto i : c10::irange(num_merged_domains)) {
+    auto id = domain.at(ndims - 1 - i);
+    // broadcast and trivial reductions are ignored
+    if (id->isBroadcast() || id->isTrivialReduction()) {
+      continue;
+    }
+    if (merged_id == nullptr) {
+      merged_id = id;
+    } else {
+      auto id_inner = merged_id;
+      auto id_outer = id;
+      merged_id = IterDomain::merge(id_outer, id_inner);
+      is_merge_done = true;
+    }
+  }
+  return is_merge_done ? merged_id : nullptr;
+}
+
+size_t collectMaxVectorizeSizeWithContigMerge(
+    TensorView* tv,
+    IterDomain* leaf_merged_domain,
+    size_t max_vector_size_in_byte,
+    ExpressionEvaluator& expression_evaluator,
+    DataType index_type) {
+  auto dtype_size = dataTypeSize(tv->dtype(), index_type);
+  const size_t max_vector_size = max_vector_size_in_byte / dtype_size;
+
+  // Assume no halo-related expression appears in the fusion. No
+  // broadcast is merged, so indexability can be assumed to be true.
+  // This is expensive, as ContigIDs builds other things like CAMap,
+  // HaloInfo, and ConcreteBroadcast info. We should explicitly build and reuse
+  // these as they're compile time information.
+  ContigIDs contigIds(
+      {leaf_merged_domain},
+      tv->getMaybeRFactorDomain(),
+      tv->domain()->contiguity(),
+      {},
+      {},
+      getAllDivisibleSplits(tv->fusion()),
+      {},
+      true);
+
+  auto innermost_root_id = tv->getMaybeRFactorDomain().back();
+  auto indexed_id = contigIds.rootToIndexedID().at(innermost_root_id);
+
+  size_t merged_size = 1;
+  // If the indexed ID is a contig merged domain, i.e., it is
+  // different from innermost_root_id, we accumulate the extents of
+  // all the root domains covered by the contig indexed ID. Otherwise,
+  // just look at the extent of the innermost root ID.
+  if (indexed_id != innermost_root_id) {
+    const auto& within_root = contigIds.withinContigIDs().at(indexed_id);
+    for (auto root_id : tv->getMaybeRFactorDomain()) {
+      if (within_root.find(root_id) == within_root.end()) {
+        continue;
+      }
+      auto maybe_dimension_size =
+          expression_evaluator.evaluate(root_id->extent());
+      TORCH_INTERNAL_ASSERT(
+          maybe_dimension_size.has_value(),
+          "Unknown extent of tv: ",
+          tv->toString(),
+          ", id: ",
+          root_id->toString());
+      merged_size *= maybe_dimension_size->as<int64_t>();
+    }
+  } else {
+    auto maybe_dimension_size =
+        expression_evaluator.evaluate(innermost_root_id->extent());
+    TORCH_INTERNAL_ASSERT(
+        maybe_dimension_size.has_value(),
+        "Unknown extent of tv: ",
+        tv->toString(),
+        ", id: ",
+        innermost_root_id->toString());
+    merged_size = maybe_dimension_size->as<int64_t>();
+  }
+
+  size_t vector_size = 1;
+  size_t next_vector_size = vector_size * 2;
+
+  // Try until vector size exceeds the max allowed size
+  while (next_vector_size <= max_vector_size) {
+    if (merged_size % next_vector_size != 0) {
+      break;
+    }
+    vector_size = next_vector_size;
+    next_vector_size *= 2;
+  }
+
+  return vector_size;
+}
+
+//! Attempt to expand vectorized domains to contig merged domains. Break point
+//! identifies the point in which you can't propagate contiguous merges. For
+//! example in pointwise this is the point where we want to split the
+//! parallelization to take advantage of broadcast, and for reduction
+//! schedulers it's the point where we switch from a reduction domain to an
+//! iter domain (or vice versa).
+size_t expandVectorizationToContigMergedDomains(
+    Fusion* fusion,
+    SchedulerRuntimeInfo& runtime_info,
+    const std::vector<TensorView*> vectorizable_inputs_outputs,
+    TensorView* reference_tv,
+    int break_point,
+    size_t default_word_size) {
+  size_t max_expand_size = SchedulerRuntimeInfo::max_alignment_size_in_byte;
+  size_t common_alignment_size =
+      SchedulerRuntimeInfo::max_alignment_size_in_byte;
+
+  for (auto inp_out : vectorizable_inputs_outputs) {
+    auto dtype_size = dataTypeSize(
+        inp_out->dtype(), indexModeToDtype(runtime_info.getIndexMode()));
+
+    max_expand_size = std::min(
+        max_expand_size,
+        SchedulerRuntimeInfo::max_alignment_size_in_byte / dtype_size);
+    max_expand_size = std::min(
+        max_expand_size, runtime_info.getMaxVectorizableWidth(inp_out));
+    common_alignment_size =
+        std::min(common_alignment_size, runtime_info.getAlignmentSize(inp_out));
+  }
+
+  // If there's no possibility to increase vector size of provided tensors,
+  // then don't bother doing a more complex analysis to try and do so, just
+  // return early.
+  if (max_expand_size == default_word_size) {
+    return default_word_size;
+  }
+
+  auto ca_map = ComputeAtMap(fusion);
+
+  // Merge the domains right of the break point
+  const auto& ref_root = reference_tv->getMaybeRFactorDomain();
+  const int max_num_merged_domains =
+      static_cast<int>(ref_root.size()) - static_cast<int>(break_point);
+  int64_t num_merged_domains = 0;
+  while (num_merged_domains < max_num_merged_domains) {
+    auto pos = (int64_t)ref_root.size() - 1 - num_merged_domains;
+    if (!reference_tv->domain()->contiguity()[pos]) {
+      break;
+    }
+    num_merged_domains++;
+  }
+
+  // No expansion with no merged domain
+  if (num_merged_domains == 0) {
+    return default_word_size;
+  }
+
+  // Merge the domains but don't modify TensorDomain
+  auto merged_domain = mergeInnermostDomains(ref_root, num_merged_domains);
+
+  // No expansion is done if no merge is done.
+  if (merged_domain == nullptr) {
+    return default_word_size;
+  }
+
+  // Find the vectorizable word size with the merged domains
+  size_t word_size = collectMaxVectorizeSizeWithContigMerge(
+      reference_tv,
+      merged_domain,
+      common_alignment_size,
+      runtime_info.expressionEvaluator(),
+      indexModeToDtype(runtime_info.getIndexMode()));
+
+  cleanUpInnermostMergedDomains(ref_root, merged_domain);
+
+  // Stop if the reference doesn't get a larger word size.
+  if (word_size <= default_word_size) {
+    return default_word_size;
+  }
+
+  // Check the other TVs and take the minimum of the valid word sizes
+  for (const auto tv : vectorizable_inputs_outputs) {
+    if (tv == reference_tv) {
+      continue;
+    }
+
+    const auto& tv_root = tv->getMaybeRFactorDomain();
+
+    int tv_num_merged_domains = 0;
+    for (const auto i : c10::irange(max_num_merged_domains)) {
+      if (i == tv_root.size()) {
+        break;
+      }
+      auto ref_id = ref_root.at(ref_root.size() - 1 - i);
+      auto pos = tv_root.size() - 1 - i;
+      IterDomain* tv_id = tv_root.at(pos);
+      // If not mapped, stop expanding.
+      if (!ca_map.areMapped(ref_id, tv_id, IdMappingMode::EXACT) ||
+          !tv->domain()->contiguity()[pos]) {
+        break;
+      } else {
+        ++tv_num_merged_domains;
+      }
+    }
+
+    size_t tv_word_size = 1;
+    if (tv_num_merged_domains > 1) {
+      auto tv_merged_domain =
+          mergeInnermostDomains(tv_root, tv_num_merged_domains);
+      if (tv_merged_domain == nullptr) {
+        tv_word_size = runtime_info.getInnerDimVectorizableWidth(tv);
+      } else {
+        tv_word_size = collectMaxVectorizeSizeWithContigMerge(
+            tv,
+            tv_merged_domain,
+            common_alignment_size,
+            runtime_info.expressionEvaluator(),
+            indexModeToDtype(runtime_info.getIndexMode()));
+        cleanUpInnermostMergedDomains(tv_root, tv_merged_domain);
+      }
+    } else {
+      tv_word_size = runtime_info.getInnerDimVectorizableWidth(tv);
+    }
+
+    word_size = std::min(word_size, tv_word_size);
+  }
+
+  return word_size;
+}
+
+} // namespace vectorize_helper
+} // namespace cuda
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h b/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h
index 0a67d00618e2..a9b959b495d6 100644
--- a/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h
+++ b/torch/csrc/jit/codegen/cuda/scheduler/vectorize_helper.h
@@ -2,21 +2,15 @@
 
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
-#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/registry.h>
-#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+
+#include <vector>
 
 namespace torch {
 namespace jit {
 namespace fuser {
 namespace cuda {
-
-// TODO: Put implementations in a vectorize_helper.cpp
-namespace scheduler_utils {
-
-// Moved the definition of these to
-// torch/csrc/jit/codegen/cuda/scheduler/utils.cpp as making new CPP files is
-// painful for multiple reasons.
+namespace vectorize_helper {
 
 // Grab all values and expressions used to make the merged_domain and remove
 // them from the fusion
@@ -44,7 +38,7 @@ size_t expandVectorizationToContigMergedDomains(
     int break_point,
     size_t default_word_size);
 
-} // namespace scheduler_utils
+} // namespace vectorize_helper
 } // namespace cuda
 } // namespace fuser
 } // namespace jit
diff --git a/torch/csrc/jit/codegen/cuda/tensor_view.cpp b/torch/csrc/jit/codegen/cuda/tensor_view.cpp
index 406524c5f78b..85f320fef2e4 100644
--- a/torch/csrc/jit/codegen/cuda/tensor_view.cpp
+++ b/torch/csrc/jit/codegen/cuda/tensor_view.cpp
@@ -3,6 +3,7 @@
 #include <torch/csrc/jit/codegen/cuda/compute_at.h>
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
 #include <torch/csrc/jit/codegen/cuda/ir_cloner.h>
@@ -290,40 +291,115 @@ IterDomain* TensorView::axis(int pos) const {
   return domain()->axis(pos);
 }
 
-void TensorView::setComputeAt(unsigned int pos, bool decrease) {
+void TensorView::inlineAt(
+    int64_t pos,
+    bool best_effort,
+    MaxPosCalculator* calc) {
   TORCH_INTERNAL_ASSERT(
       !container()->isA<kir::Kernel>(),
       "Function invalid for kernel container.");
-  if (pos <= compute_at_pos_ && !decrease) {
-    return;
+
+  std::unique_ptr<MaxPosCalculator> calc_owner;
+  if (calc == nullptr) {
+    calc_owner = std::make_unique<MaxPosCalculator>();
+    calc = calc_owner.get();
+  }
+
+  if (pos < 0) {
+    pos += int64_t(nDims()) + 1;
   }
 
   TORCH_INTERNAL_ASSERT(
-      (unsigned)pos <= nDims(),
-      "Invalid this computeAt position for T",
+      pos >= 0 && pos <= nDims(),
+      "Invalid inline position for T",
       name(),
       ": ",
       pos);
 
-  compute_at_pos_ = pos;
-}
+  auto max_inline_pos = calc->getMaxPosAll(this, best_effort);
 
-void TensorView::setMaxProducer(unsigned int pos, bool decrease) {
-  TORCH_INTERNAL_ASSERT(
-      !container()->isA<kir::Kernel>(),
-      "Function invalid for kernel container.");
-  if (pos <= max_producer_pos_ && !decrease) {
-    return;
+  if (best_effort) {
+    pos = std::min<int64_t>(max_inline_pos, pos);
+  }
+
+  // hoist inner most broadcast
+  while (pos > 0 && axis(pos - 1)->isBroadcast()) {
+    pos--;
   }
 
   TORCH_INTERNAL_ASSERT(
-      (unsigned)pos <= nDims(),
-      "Invalid max producer position for T",
+      pos <= max_inline_pos,
+      "Invalid inline position for T",
       name(),
       ": ",
-      pos);
+      pos,
+      ". Maximum allowed value:",
+      max_inline_pos);
+
+  if (isFusionInput()) {
+    return;
+  }
+
+  if (pos > compute_at_pos_) {
+    compute_at_pos_ = pos;
+    for (auto consumer : ir_utils::consumerTvsOf(this)) {
+      consumer->updateMaxProducerPosition();
+    }
+  }
+}
+
+namespace {
 
-  max_producer_pos_ = pos;
+// Try to find the aligned position on consumer's domain corresponding to the
+//  compute at position of producer domain. No checking on actual
+//  producer-consumer relationship.
+unsigned int getConsumerPosAlignedToProducerCA(
+    TensorView* consumer,
+    TensorView* producer) {
+  // Locate consumer's position that aligns with
+  //  the producer's new compute at axis. We need broadcast axes forwarded so we
+  //  need to replay PasC as CasP will not forward braodcast dims. For example
+  //  if we have:
+  // T2[ iS22{( 3 * 1 )} ] ca_pos( 1 ) = broadcast( T1[ iS1{3} ] ca_pos( 1 )
+  // produce_pos( 1) ) CasP will have the mapping iS1{3} -> iS2{3} and PasC will
+  // have the mapping iS22{( 3 * 1 )} <- iS1{3} We need the latter. Refer to
+  // NVFuserTest.FusionComplexBCast1_CUDA
+
+  auto disjoint_sets =
+      BestEffortReplay::replayPasC(
+          producer, consumer, -1, PairwiseRootDomainMap(producer, consumer))
+          .getDisjointSets();
+
+  // Find the innermost position of consumer that has
+  //  been mapped within the producer ca axis.
+  unsigned int consumer_pos = consumer->nDims();
+  while (consumer_pos > 0) {
+    auto consumer_id = consumer->axis((int)consumer_pos - 1);
+    auto p_dom = producer->domain()->domain();
+    if (std::any_of(
+            p_dom.begin(),
+            p_dom.begin() + producer->getComputeAtPosition(),
+            [&consumer_id, &disjoint_sets](IterDomain* p_id) {
+              return disjoint_sets.permissiveAreMapped(consumer_id, p_id);
+            })) {
+      break;
+    }
+    consumer_pos--;
+  }
+
+  return consumer_pos;
+}
+
+} // namespace
+
+void TensorView::updateMaxProducerPosition() {
+  TORCH_INTERNAL_ASSERT(
+      !container()->isA<kir::Kernel>(),
+      "Function invalid for kernel container.");
+  for (auto producer : ir_utils::producerTvsOf(this)) {
+    max_producer_pos_ = std::max(
+        max_producer_pos_, getConsumerPosAlignedToProducerCA(this, producer));
+  }
 }
 
 TensorView* TensorView::computeAt(
@@ -430,7 +506,7 @@ TensorView* TensorView::split(
   return this;
 }
 
-// Merge "axis" and "axis+1" into 1 dimension
+// Merge "axis_o" and "axis_i" into 1 dimension
 TensorView* TensorView::merge(int axis_o, int axis_i) {
   TORCH_INTERNAL_ASSERT(nDims() > 0, "Tried to do merge on a 0-dim TensorView");
 
@@ -681,7 +757,7 @@ TensorView* TensorView::rFactor(const std::vector<int>& axes) {
 
   TORCH_CHECK(
       !definition()->isA<GroupedReductionOp>(),
-      "For GroupedReducitonOp, use TensorView::rFactor(const std::vector<int>& axes, const std::vector<TensorView*>& tvs)");
+      "For GroupedReductionOp, use TensorView::rFactor(const std::vector<int>& axes, const std::vector<TensorView*>& tvs)");
 
   // Split tensor view into 2 parts
   auto domain_pair = domain()->rFactor(axes);
@@ -818,6 +894,15 @@ std::vector<TensorView*> TensorView::rFactor(
         "Rfactor of a multi-output reduction not used correctly");
   }
 
+  // Currently grouping of welford is only supported through
+  // ParallelType::Group, so GroupedWelfordOp is only created during
+  // the lowering time. As rFactor is done before lowering, there
+  // should be no GroupedWelfordOp at this point.
+  TORCH_INTERNAL_ASSERT(
+      !definition()->isA<GroupedWelfordOp>(),
+      "GroupedWelfordOp found: ",
+      definition()->toString());
+
   std::vector<TensorView*> rf_tvs(tvs.size());
 
   // Make sure this gets rfactored last so everybody gets
@@ -844,25 +929,25 @@ std::vector<TensorView*> TensorView::rFactor(
     IrBuilder::create<WelfordOp>(
         producer_avg,
         producer_var,
-        producer_n, /*out var/avg/count */
-        wop->initAvg(),
-        wop->initVar(),
-        wop->initN(), /*init var/avg/count */
+        producer_n,
         wop->inAvg(),
         wop->inVar(),
-        wop->inN());
+        wop->inN(),
+        wop->initAvg(),
+        wop->initVar(),
+        wop->initN());
 
     // Expr* consumer_definition =
     IrBuilder::create<WelfordOp>(
         wop->outAvg(),
         wop->outVar(),
         wop->outN(),
-        wop->initAvg(),
-        wop->initVar(),
-        wop->initN(),
         producer_avg,
         producer_var,
-        producer_n);
+        producer_n,
+        wop->initAvg(),
+        wop->initVar(),
+        wop->initN());
   } else if (
       auto grouped_rop = dynamic_cast<GroupedReductionOp*>(definition())) {
     IrBuilder::create<GroupedReductionOp>(
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
deleted file mode 100644
index 5834cf38b9f6..000000000000
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp
+++ /dev/null
@@ -1,25499 +0,0 @@
-#if defined(USE_CUDA)
-#include <gmock/gmock-matchers.h>
-#include <gtest/gtest.h>
-
-#include <torch/csrc/jit/codegen/cuda/arith.h>
-#include <torch/csrc/jit/codegen/cuda/codegen.h>
-#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
-#include <torch/csrc/jit/codegen/cuda/executor.h>
-#include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
-#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
-#include <torch/csrc/jit/codegen/cuda/fusion.h>
-#include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
-#include <torch/csrc/jit/codegen/cuda/grouped_reduction.h>
-#include <torch/csrc/jit/codegen/cuda/inline_propagator.h>
-#include <torch/csrc/jit/codegen/cuda/interface.h>
-#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
-#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
-#include <torch/csrc/jit/codegen/cuda/ir_graphviz.h>
-#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
-#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
-#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
-#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
-#include <torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h>
-#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
-#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
-#include <torch/csrc/jit/codegen/cuda/lower2device.h>
-#include <torch/csrc/jit/codegen/cuda/lower_magic_zero.h>
-#include <torch/csrc/jit/codegen/cuda/mutator.h>
-#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
-#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
-#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
-#include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
-#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
-#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
-#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
-#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
-#include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
-
-#include <test/cpp/jit/test_utils.h>
-#include <torch/csrc/jit/api/function_impl.h>
-#include <torch/csrc/jit/codegen/cuda/parser.h>
-#include <torch/csrc/jit/ir/irparser.h>
-#include <torch/torch.h>
-
-#include <ATen/cuda/CUDAContext.h>
-#include <ATen/cuda/Exceptions.h>
-#include <c10/cuda/CUDAStream.h>
-
-#include <algorithm>
-#include <iostream>
-#include <sstream>
-#include <thread>
-
-// Tests go in torch::jit
-namespace torch {
-namespace jit {
-
-using namespace torch::jit::fuser::cuda;
-using namespace at::indexing;
-
-namespace {
-
-TensorView* loweredTv(TensorView* tv, GpuLower& gpulw) {
-  auto used_tvs = ir_utils::allTvs(gpulw.kernel()->as<Fusion>());
-  TensorView* matching_tv = nullptr;
-  for (auto lowered_tv : used_tvs) {
-    if (lowered_tv->name() == tv->name()) {
-      matching_tv = lowered_tv;
-    }
-  }
-  TORCH_INTERNAL_ASSERT(matching_tv != nullptr);
-  return matching_tv;
-}
-
-class PredicatedChecker : public kir::IrVisitor {
- public:
-  // Checks if the provided tv is written to within a non-trivial conditional
-  static bool isPredicated(TensorView* tv, GpuLower& gpulw) {
-    PredicatedChecker checker(
-        loweredTv(tv, gpulw), gpulw.kernel()->topLevelExprs());
-    return checker.is_predicated_;
-  }
-
- private:
-  PredicatedChecker() = delete;
-
-  PredicatedChecker(TensorView* tv, std::vector<Expr*> exprs) : tv_(tv) {
-    kir::IrVisitor::handle(exprs);
-  }
-
-  using kir::IrVisitor::handle;
-  bool is_predicated_ = false;
-  bool predicated_ite_ = false;
-  TensorView* tv_ = nullptr;
-
-  void handle(kir::IfThenElse* ite) final {
-    auto prev_ite = predicated_ite_;
-    predicated_ite_ = !ite->predicate()->value()->isConstScalar();
-    kir::IrVisitor::handle(ite);
-    predicated_ite_ = prev_ite;
-  }
-
-  void handle(Expr* expr) final {
-    if (expr->outputs().size() && expr->outputs()[0]->isA<kir::TensorIndex>()) {
-      auto ti = expr->outputs()[0]->as<kir::TensorIndex>();
-      if (ti->view() == tv_) {
-        is_predicated_ = is_predicated_ | predicated_ite_;
-        if (expr->predicate() != nullptr &&
-            !expr->predicate()->value()->isConst()) {
-          is_predicated_ = true;
-        }
-      }
-    }
-    kir::IrVisitor::handle(expr);
-  }
-};
-
-class UnswitchInElseChecker : public kir::IrVisitor {
- public:
-  // Checks if there are any unswitched for loops within an else clause
-  static bool check(GpuLower& gpulw) {
-    UnswitchInElseChecker checker(gpulw.kernel()->topLevelExprs());
-    return checker.found_in_else_;
-  }
-
- private:
-  UnswitchInElseChecker() = delete;
-  UnswitchInElseChecker(std::vector<Expr*> exprs) {
-    kir::IrVisitor::handle(exprs);
-  }
-
-  using kir::IrVisitor::handle;
-  bool within_else_ = false;
-  bool found_in_else_ = false;
-
-  void handle(kir::IfThenElse* ite) final {
-    auto prev_within_else = within_else_;
-    within_else_ = true;
-    kir::IrVisitor::handle(ite->elseBody().exprs());
-    within_else_ = prev_within_else;
-  }
-
-  void handle(kir::ForLoop* for_loop) final {
-    if (for_loop->iter_domain()->getParallelType() == ParallelType::Unswitch) {
-      found_in_else_ = found_in_else_ || within_else_;
-    }
-    kir::IrVisitor::handle(for_loop);
-  }
-};
-
-class PredicateMagicZeroChecker : public kir::IrVisitor {
- public:
-  // Checks if all predicated domains of the provided tv are protected with
-  // magic zero
-  static bool isProtected(TensorView* tv, GpuLower& gpulw) {
-    PredicateMagicZeroChecker checker(
-        loweredTv(tv, gpulw), gpulw.kernel()->topLevelExprs());
-    return checker.is_protected_;
-  }
-
- private:
-  using kir::IrVisitor::handle;
-
-  PredicateMagicZeroChecker(TensorView* tv, std::vector<Expr*> exprs)
-      : tv_(tv) {
-    handle(exprs);
-  }
-
-  void handle(kir::IfThenElse* ite) final {
-    auto prev_predicate = predicate_;
-    predicate_ = ite->predicate()->value();
-    kir::IrVisitor::handle(ite);
-    predicate_ = prev_predicate;
-  }
-
-  void handle(Expr* expr) final {
-    if (expr->outputs().size() && expr->outputs()[0]->isA<kir::TensorIndex>()) {
-      auto ti = expr->outputs()[0]->as<kir::TensorIndex>();
-      if (ti->view() == tv_) {
-        is_protected_ = checkPredicateOfTensor(predicate_);
-        return;
-      }
-    }
-
-    if (expr->isA<kir::ForLoop>()) {
-      handle(expr->as<kir::ForLoop>());
-    } else if (expr->isA<kir::IfThenElse>()) {
-      handle(expr->as<kir::IfThenElse>());
-    } else {
-      for (auto input : expr->inputs()) {
-        handle(input);
-      }
-    }
-  }
-
-  // Return true If all predicated domains are protected
-  bool checkPredicateOfTensor(Val* predicate) {
-    auto id_predicates = decomposeCompoundPredicate(predicate);
-    for (auto id_predicate : id_predicates) {
-      // Just check if nvfuser_zero is used. Not perfect but probably
-      // good enough.
-      is_magic_zero_found_ = false;
-      handle(id_predicate);
-      if (!is_magic_zero_found_) {
-        return false;
-      }
-    }
-    return true;
-  }
-
-  // Decompose "X && Y" to a vector of {X, Y}.
-  std::vector<Val*> decomposeCompoundPredicate(Val* predicate) {
-    if (auto binary_op = dynamic_cast<BinaryOp*>(predicate->definition())) {
-      if (binary_op->getBinaryOpType() == BinaryOpType::And) {
-        auto pred = decomposeCompoundPredicate(binary_op->lhs());
-        auto rhs_pred = decomposeCompoundPredicate(binary_op->rhs());
-        pred.insert(pred.end(), rhs_pred.begin(), rhs_pred.end());
-        return pred;
-      }
-    }
-
-    return {predicate};
-  }
-
-  void handle(Val* val) final {
-    if (isMagicZero(val)) {
-      is_magic_zero_found_ = true;
-      return;
-    }
-
-    auto def = val->definition();
-    if (def != nullptr) {
-      handle(def);
-    }
-  }
-
- private:
-  bool is_protected_ = false;
-  Val* predicate_ = nullptr;
-  TensorView* tv_ = nullptr;
-  bool is_magic_zero_found_ = false;
-};
-
-// Basically just TransformPropagator, except that it checks the consistency
-// replayPasC with getMatchedLeafPosWithoutReplayPasC, replayCasP with
-// getMatchedLeafPosWithoutReplayCasP, and fullSelfReplay with fullSelfMatching:
-// - After replayPasC, getMatchedLeafPosWithoutReplayPasC should return the same
-//   replayed position
-// - After replayCasP, getMatchedLeafPosWithoutReplayCasP should return the same
-//   replayed position
-// - After fullSelfReplay, fullSelfMatching should return true
-struct TransformPropagatorWithCheck : public TransformPropagator {
- public:
-  virtual void propagateC2P(TensorView* from, TensorView* to) override {
-    TransformPropagator::propagateC2P(from, to);
-    auto from_pos = replayed_pos_.at(from);
-    auto to_pos = replayed_pos_.at(to);
-    TORCH_CHECK(
-        TransformReplay::getMatchedLeafPosWithoutReplayPasC(
-            to, from, from_pos) == to_pos);
-  }
-  virtual void propagateP2C(TensorView* from, TensorView* to) override {
-    TransformPropagator::propagateP2C(from, to);
-    auto from_pos = replayed_pos_.at(from);
-    auto to_pos = replayed_pos_.at(to);
-    TORCH_CHECK(
-        TransformReplay::getMatchedLeafPosWithoutReplayCasP(
-            to, from, from_pos) == to_pos);
-  }
-  virtual void propagateSibling(TensorView* from, TensorView* to) override {
-    TransformPropagator::propagateSibling(from, to);
-    auto from_pos = replayed_pos_.at(from);
-    auto to_pos = replayed_pos_.at(to);
-    TORCH_CHECK(from_pos == to_pos);
-    TORCH_CHECK(TransformReplay::fullSelfMatching(from, to));
-  }
-  using TransformPropagator::TransformPropagator;
-};
-
-} // namespace
-
-// 1. Test cases are void() functions.
-// 2. They start with the prefix `test`
-
-// A few smoke tests for IrGraphGenerator
-// (These tests exercise IrGraphGenerator through a non-trivial IR,
-//  to make sure that it runs w/o crashing. The actual output is not
-//  validated)
-TEST_F(NVFuserTest, FusionIrGraphGenerator_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Make sure we can handle empty IRs
-  TORCH_CHECK(!IrGraphGenerator::toGraphviz(
-                   &fusion, IrGraphGenerator::DetailLevel::Basic)
-                   .empty());
-
-  // Construct an interesting IR
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.141));
-  TensorView* tv3 = broadcast(tv0, {false, true, false, true});
-  TensorView* tv4 =
-      reductionOp(BinaryOpType::Add, {2}, IrBuilder::create<Double>(0), tv3);
-  TensorView* tv5 = clamp(
-      tv4, IrBuilder::create<Double>(0.f), IrBuilder::create<Double>(1.f));
-  TensorView* tv6 = add(tv2, tv2);
-
-  // Another checkpoint before adding outputs
-  TORCH_CHECK(!IrGraphGenerator::toGraphviz(
-                   &fusion, IrGraphGenerator::DetailLevel::Explicit)
-                   .empty());
-
-  fusion.addOutput(tv6);
-
-  tv4->axis(2)->parallelize(ParallelType::BIDy);
-  tv6->merge(0);
-  tv6->split(0, 4);
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-  tv5->reorder({{-1, 0}});
-  tv2->computeAt(tv6, 1);
-
-  // Another checkpoint with more node types
-  TORCH_CHECK(!IrGraphGenerator::toGraphviz(
-                   &fusion, IrGraphGenerator::DetailLevel::ComputeOnly)
-                   .empty());
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  // Final IR graph
-  TORCH_CHECK(!IrGraphGenerator::toGraphviz(
-                   &fusion, IrGraphGenerator::DetailLevel::Verbose)
-                   .empty());
-}
-
-TEST_F(NVFuserTest, FusionDispatch_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  Double* f = IrBuilder::create<Double>(2.f);
-  std::stringstream ss1, ss2, ss3;
-  ss1 << f;
-  ss2 << static_cast<Val*>(f);
-  ss3 << static_cast<Statement*>(f);
-  TORCH_CHECK(
-      ss1.str().compare(ss2.str()) == 0 && ss1.str().compare(ss3.str()) == 0,
-      "Error with dispatch system where results differ by passing Double* vs Val* vs Statement*.");
-}
-
-// Evaluate basic scalar operations with constant values
-TEST_F(NVFuserTest, FusionExprEvalConstants_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  ExpressionEvaluator evaluator(&fusion);
-
-  auto* a = IrBuilder::create<Int>(7);
-  auto* b = IrBuilder::create<Int>(3);
-
-  // Avoid div operation because it casts int operands to float
-  checkIntValue(evaluator, neg(a), -7);
-  checkIntValue(evaluator, add(a, b), 10);
-  checkIntValue(evaluator, neg(mul(sub(a, b), add(a, b))), -40);
-  checkIntValue(evaluator, mod(a, b), 1);
-  checkIntValue(evaluator, ceilDiv(a, b), 3);
-}
-
-// Evaluate basic scalar operations with bound values
-TEST_F(NVFuserTest, FusionExprEvalBindings_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  ExpressionEvaluator evaluator(&fusion);
-
-  auto* a = IrBuilder::create<Int>();
-  auto* b = IrBuilder::create<Int>();
-  auto* c = add(a, b);
-  auto* d = neg(ceilDiv(c, b));
-  auto* e = IrBuilder::create<Int>(0);
-
-  // trying to evaluate before binding should give empty results
-  TORCH_CHECK(!evaluator.evaluate(a).has_value());
-  TORCH_CHECK(!evaluator.evaluate(d).has_value());
-
-  evaluator.bind(a, 7);
-  evaluator.bind(b, 3);
-
-  // can't bind to the results of expressions
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(evaluator.bind(c, 100));
-
-  // can't bind to concrete values
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(evaluator.bind(e, 100));
-
-  checkIntValue(evaluator, c, 10);
-  checkIntValue(evaluator, sub(a, b), 4);
-  checkIntValue(evaluator, mod(a, b), 1);
-  checkIntValue(evaluator, ceilDiv(a, b), 3);
-  checkIntValue(evaluator, d, -4);
-
-  // Reset evaluation context
-  evaluator = ExpressionEvaluator(&fusion);
-
-  evaluator.bind(a, 2);
-  evaluator.bind(b, 5);
-
-  checkIntValue(evaluator, c, 7);
-  checkIntValue(evaluator, sub(a, b), -3);
-  checkIntValue(evaluator, mod(a, b), 2);
-  checkIntValue(evaluator, ceilDiv(a, b), 1);
-  checkIntValue(evaluator, d, -2);
-}
-
-// Evaluate expressions in a simple IR
-TEST_F(NVFuserTest, FusionExprEvalBasic_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Create a non-trivial IR
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = add(tv0, tv2);
-
-  fusion.addOutput(tv3);
-
-  tv3->split(0, 4);
-
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::Unroll);
-  tv3->axis(1)->parallelize(ParallelType::Unroll);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // 1. Create an evaluator
-  ExpressionEvaluator evaluator(&fusion);
-
-  // 2. Bind values
-  //
-  // IMPORTANT:
-  // a. The bindings are only as stable as the Vals are in the fusion graph
-  // b. You must use the original (rootDomain) extents
-  //  (ex. `tv0->getRootDomain()[0]->extent()`
-  //   instead of `tv0->axis(0)->extent()`)
-  //
-  evaluator.bind(tv0->getRootDomain()[0]->extent(), 6);
-  evaluator.bind(tv0->getRootDomain()[1]->extent(), 128);
-  evaluator.bind(tv1->getRootDomain()[0]->extent(), 6);
-  evaluator.bind(tv1->getRootDomain()[1]->extent(), 128);
-
-  // 3. Evaluate and check result values
-  TORCH_CHECK(tv2->domain()->nDims() == 3);
-  checkIntValue(evaluator, tv2->axis(0)->extent(), 2);
-  checkIntValue(evaluator, tv2->axis(1)->extent(), 4);
-  checkIntValue(evaluator, tv2->axis(2)->extent(), 128);
-
-  TORCH_CHECK(tv3->domain()->nDims() == 3);
-  checkIntValue(evaluator, tv3->axis(0)->extent(), 2);
-  checkIntValue(evaluator, tv3->axis(1)->extent(), 4);
-  checkIntValue(evaluator, tv3->axis(2)->extent(), 128);
-}
-
-// Evaluate expressions in a more complex IR
-TEST_F(NVFuserTest, FusionExprEvalComplex_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(-1.0));
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.0));
-  TensorView* tv3 = mul(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv4 = add(tv2, tv1);
-  TensorView* tv5 = add(tv4, tv3);
-  TensorView* tv6 = add(tv0, tv3);
-
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  tv5->reorder({{-1, 0}});
-
-  tv6->split(0, 5);
-  tv5->merge(0);
-
-  // 1. Create an evaluator
-  ExpressionEvaluator evaluator(&fusion);
-
-  // 2. Bind values
-  evaluator.bind(tv0->getRootDomain()[0]->extent(), 129);
-  evaluator.bind(tv0->getRootDomain()[1]->extent(), 127);
-
-  // Evaluate and check extent values
-  TORCH_CHECK(tv0->domain()->nDims() == 2);
-  checkIntValue(evaluator, tv0->axis(0)->extent(), 129);
-  checkIntValue(evaluator, tv0->axis(1)->extent(), 127);
-
-  TORCH_CHECK(tv3->domain()->nDims() == 2);
-  checkIntValue(evaluator, tv3->axis(0)->extent(), 129);
-  checkIntValue(evaluator, tv3->axis(1)->extent(), 127);
-
-  TORCH_CHECK(tv4->domain()->nDims() == 2);
-  checkIntValue(evaluator, tv4->axis(0)->extent(), 129);
-  checkIntValue(evaluator, tv4->axis(1)->extent(), 127);
-
-  TORCH_CHECK(tv5->domain()->nDims() == 1);
-  checkIntValue(evaluator, tv5->axis(0)->extent(), 16383);
-
-  TORCH_CHECK(tv6->domain()->nDims() == 3);
-  checkIntValue(evaluator, tv6->axis(0)->extent(), 26);
-  checkIntValue(evaluator, tv6->axis(1)->extent(), 5);
-  checkIntValue(evaluator, tv6->axis(2)->extent(), 127);
-}
-
-// Evaluate expressions post lowering
-TEST_F(NVFuserTest, FusionExprEvalPostLower_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Create a non-trivial IR
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = add(tv0, tv2);
-
-  fusion.addOutput(tv3);
-
-  tv3->split(0, 4);
-
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::Unroll);
-  tv3->axis(1)->parallelize(ParallelType::Unroll);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto* bid_x = add(tv3->axis(0)->extent(), IrBuilder::create<Int>(0));
-  auto* tid_x = add(tv3->axis(-1)->extent(), IrBuilder::create<Int>(0));
-
-  // Lower
-  GpuLower gpulw(&fusion);
-
-  // 1. Create an evaluation context
-  ExpressionEvaluator evaluator(&fusion);
-
-  // 2. Bind values
-  evaluator.bind(tv0->getRootDomain()[0]->extent(), 6);
-  evaluator.bind(tv0->getRootDomain()[1]->extent(), 128);
-  evaluator.bind(tv1->getRootDomain()[0]->extent(), 6);
-  evaluator.bind(tv1->getRootDomain()[1]->extent(), 128);
-
-  // 3. Evaluate and check result values
-  TORCH_CHECK(tv2->domain()->nDims() == 3);
-  checkIntValue(evaluator, tv2->axis(0)->extent(), 2);
-  checkIntValue(evaluator, tv2->axis(1)->extent(), 4);
-  checkIntValue(evaluator, tv2->axis(2)->extent(), 128);
-
-  TORCH_CHECK(tv3->domain()->nDims() == 3);
-  checkIntValue(evaluator, tv3->axis(0)->extent(), 2);
-  checkIntValue(evaluator, tv3->axis(1)->extent(), 4);
-  checkIntValue(evaluator, tv3->axis(2)->extent(), 128);
-
-  checkIntValue(evaluator, bid_x, 2);
-  checkIntValue(evaluator, tid_x, 128);
-}
-
-// Kernel IR: Evaluate basic scalar operations with constant values
-TEST_F(NVFuserTest, FusionKernelExprEvalConstants_CUDA) {
-  Fusion fusion;
-  kir::Kernel kernel(&fusion);
-  FusionGuard fg((&kernel)->as<Fusion>());
-
-  auto a = IrBuilder::create<Int>(7);
-  auto b = IrBuilder::create<Int>(3);
-  auto c = IrBuilder::subExpr(a, b);
-  auto d = IrBuilder::divExpr(a, b);
-  auto e = IrBuilder::mulExpr(c, d);
-
-  kir::ExpressionEvaluator evaluator;
-
-  checkIntValue(evaluator, IrBuilder::negExpr(a), -7);
-  checkIntValue(evaluator, IrBuilder::addExpr(a, b), 10);
-  checkIntValue(evaluator, IrBuilder::negExpr(e), -8);
-  checkIntValue(evaluator, IrBuilder::modExpr(a, b), 1);
-  checkIntValue(evaluator, IrBuilder::ceilDivExpr(a, b), 3);
-}
-
-// Kernel IR: Evaluate basic scalar operations with bound values
-TEST_F(NVFuserTest, FusionKernelExprEvalBindings_CUDA) {
-  Fusion fusion;
-  kir::Kernel kernel(&fusion);
-  FusionGuard fg((&kernel)->as<Fusion>());
-
-  kir::ExpressionEvaluator evaluator;
-
-  auto a = IrBuilder::create<Int>(c10::nullopt);
-  auto b = IrBuilder::create<Int>(c10::nullopt);
-  auto c = IrBuilder::addExpr(a, b);
-  auto d = IrBuilder::negExpr(IrBuilder::ceilDivExpr(c, b));
-  auto e = IrBuilder::create<Int>(0);
-
-  // trying to evaluate before binding should give empty results
-  TORCH_CHECK(!evaluator.evaluate(a).has_value());
-  TORCH_CHECK(!evaluator.evaluate(d).has_value());
-
-  evaluator.bind(a, 7);
-  evaluator.bind(b, 3);
-
-  // can't bind to the results of expressions
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(evaluator.bind(c, 100));
-
-  // can't bind to concrete values
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(evaluator.bind(e, 100));
-
-  checkIntValue(evaluator, c, 10);
-  checkIntValue(evaluator, IrBuilder::subExpr(a, b), 4);
-  checkIntValue(evaluator, IrBuilder::modExpr(a, b), 1);
-  checkIntValue(evaluator, IrBuilder::ceilDivExpr(a, b), 3);
-  checkIntValue(evaluator, d, -4);
-
-  // Reset the evaluation context
-  evaluator = kir::ExpressionEvaluator();
-
-  evaluator.bind(a, 2);
-  evaluator.bind(b, 5);
-
-  checkIntValue(evaluator, c, 7);
-  checkIntValue(evaluator, IrBuilder::subExpr(a, b), -3);
-  checkIntValue(evaluator, IrBuilder::modExpr(a, b), 2);
-  checkIntValue(evaluator, IrBuilder::ceilDivExpr(a, b), 1);
-  checkIntValue(evaluator, d, -2);
-}
-
-TEST_F(NVFuserTest, FusionClear_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // 1. Create a dummy IR
-
-  {
-    TensorView* tv0 = makeSymbolicTensor(2);
-    TensorView* tv1 = makeSymbolicTensor(2);
-
-    fusion.addInput(tv0);
-    fusion.addInput(tv1);
-
-    TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-    TensorView* tv3 = add(tv0, tv2);
-
-    fusion.addOutput(tv3);
-
-    tv3->split(0, 4);
-    tv0->computeAt(tv3, 1);
-    tv1->computeAt(tv3, 1);
-
-    tv3->axis(0)->parallelize(ParallelType::BIDx);
-    tv2->axis(1)->parallelize(ParallelType::Unroll);
-    tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  // 2. Clear the IR
-
-  fusion.clear();
-
-  TORCH_CHECK(fusion.unordered_exprs().empty());
-  TORCH_CHECK(fusion.vals().empty());
-
-  TORCH_CHECK(fusion.inputs().empty());
-  TORCH_CHECK(fusion.outputs().empty());
-
-  TORCH_CHECK(ir_utils::getReductionOps(&fusion).empty());
-
-  // 3. Rebuild the IR
-
-  {
-    TensorView* tv0 = makeSymbolicTensor(3);
-    TensorView* tv1 = makeSymbolicTensor(3);
-    TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-    TensorView* tv3 = add(tv0, tv2);
-
-    fusion.addInput(tv0);
-    fusion.addInput(tv1);
-    fusion.addOutput(tv3);
-
-    // tv3 [i0, i1, i2]
-    tv3->reorder({{0, 2}, {2, 0}});
-    // tv3 [i2, i1, i0]
-    tv3->split(-1, 4);
-    // tv3 [i2, i1, i0outer, i0inner{4}]
-    tv3->reorder({{2, 0}, {3, 1}, {0, 3}});
-    // tv3 [i0outer, i0inner{4}, i1, i2]
-    tv0->computeAt(tv3, -1);
-    tv1->computeAt(tv3, -1);
-    tv3->axis(1)->parallelize(ParallelType::BIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input1 = at::randn({16, 8, 8}, options);
-  at::Tensor input2 = at::randn_like(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  auto outputs = fe.runFusion({input1, input2});
-
-  at::Tensor tv2_ref = input2 + 2.0;
-  at::Tensor output_ref = input1 + tv2_ref;
-
-  TORCH_CHECK(output_ref.equal(outputs[0]));
-}
-
-TEST_F(NVFuserTest, FusionCopy_CUDA) {
-  Fusion original_fusion;
-
-  // Create the test IR
-  {
-    FusionGuard fg(&original_fusion);
-
-    auto tv0 = makeSymbolicTensor(3);
-    auto tv1 = makeSymbolicTensor(3);
-    auto tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-    auto tv3 = sub(add(tv0, mul(tv2, tv2)), tv2);
-
-    original_fusion.addInput(tv0);
-    original_fusion.addInput(tv1);
-    original_fusion.addOutput(tv3);
-
-    tv3->reorder({{0, 2}, {2, 0}});
-    tv3->split(-1, 4);
-    tv3->reorder({{2, 0}, {3, 1}, {0, 3}});
-
-    tv0->computeAt(tv3, -1);
-    tv1->computeAt(tv3, -1);
-
-    tv3->axis(0)->parallelize(ParallelType::BIDx);
-    tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  // Test copy before lowering
-  Fusion clone = original_fusion;
-
-  // Compare IR dumps
-  std::stringstream original_ir;
-  std::stringstream clone_ir;
-  original_ir << original_fusion;
-  clone_ir << clone;
-  ASSERT_EQ(original_ir.str(), clone_ir.str());
-
-  // Lower original fusion
-  std::string original_kernel;
-  {
-    // TODO(kir): remove this guard once we implement the cuda codegen visitor
-    FusionGuard fg(&original_fusion);
-    original_kernel =
-        codegen::generateCudaKernel(GpuLower(&original_fusion).kernel());
-  }
-
-  // Make sure the "before lowering" clone was not mutated
-  // while lowering the original fusion IR
-  std::stringstream before_lowering_ir;
-  before_lowering_ir << clone;
-  ASSERT_EQ(original_ir.str(), before_lowering_ir.str());
-
-  // Test copy after lowering (including assignment operator)
-  Fusion before_lowering = clone;
-  clone = original_fusion;
-
-  // Compare IR dumps
-  std::stringstream original_lowered_ir;
-  std::stringstream clone_lowered_ir;
-  original_lowered_ir << original_fusion;
-  clone_lowered_ir << clone;
-  ASSERT_EQ(original_lowered_ir.str(), clone_lowered_ir.str());
-
-  // Lower the "before lowering" and compare kernels
-  std::string clone_kernel;
-  {
-    // TODO(kir): remove this guard once we implement the cuda codegen visitor
-    FusionGuard fg(&before_lowering);
-    clone_kernel =
-        codegen::generateCudaKernel(GpuLower(&before_lowering).kernel());
-  }
-  ASSERT_EQ(original_kernel, clone_kernel);
-}
-
-TEST_F(NVFuserTest, FusionMove_CUDA) {
-  Fusion fusion;
-
-  // Create the test IR
-  {
-    FusionGuard fg(&fusion);
-
-    auto tv0 = makeSymbolicTensor(3);
-    auto tv1 = makeSymbolicTensor(3);
-    auto tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-    auto tv3 = sub(add(tv0, mul(tv2, tv2)), tv2);
-
-    fusion.addInput(tv0);
-    fusion.addInput(tv1);
-    fusion.addOutput(tv3);
-
-    tv3->reorder({{0, 2}, {2, 0}});
-    tv3->split(-1, 4);
-    tv3->reorder({{2, 0}, {3, 1}, {0, 3}});
-
-    tv0->computeAt(tv3, -1);
-    tv1->computeAt(tv3, -1);
-
-    tv3->axis(0)->parallelize(ParallelType::BIDx);
-    tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  std::stringstream original_ir;
-  original_ir << fusion;
-
-  // Test move before lowering
-  Fusion another_fusion = std::move(fusion);
-
-  // Check that the original fusion is "empty"
-  //
-  // IMPORTANT: these checks assume knowledge of the internal
-  //    implementation of the move operations. General uses
-  //    should only assume that the moved-from object is in
-  //    a valid, but unspecified state. This is similar to the
-  //    standard library containers:
-  //    https://en.cppreference.com/w/cpp/utility/move
-  //
-  TORCH_CHECK(fusion.unordered_exprs().empty());
-  TORCH_CHECK(fusion.vals().empty());
-  TORCH_CHECK(fusion.inputs().empty());
-  TORCH_CHECK(fusion.outputs().empty());
-
-  // clear() has no pre-conditions so it's valid to call on a moved-from object
-  fusion.clear();
-
-  // Compare IR dumps
-  std::stringstream another_ir;
-  another_ir << another_fusion;
-  ASSERT_EQ(original_ir.str(), another_ir.str());
-
-  // Lower the fusion IR
-  GpuLower lower(&another_fusion);
-
-  std::stringstream lowered_ir;
-  lowered_ir << another_fusion;
-
-  // Test move assignment after lowering
-  fusion = std::move(another_fusion);
-
-  // Compare IR dumps
-  std::stringstream moved_lowered_ir;
-  moved_lowered_ir << fusion;
-  ASSERT_EQ(lowered_ir.str(), moved_lowered_ir.str());
-}
-
-TEST_F(NVFuserTest, FusionSimpleArith_CUDA) {
-  std::stringstream ss1, ss2;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  Double* d1 = IrBuilder::create<Double>(1.f);
-  Double* d2 = IrBuilder::create<Double>(2.f);
-  Double* d3 = IrBuilder::create<Double>();
-
-  // Disrupt the fusion to make sure guard works well
-  {
-    Fusion fusion2;
-    FusionGuard fg(&fusion2);
-
-    Double* d1 = IrBuilder::create<Double>(1.f);
-    Double* d2 = IrBuilder::create<Double>(2.f);
-    add(d1, d2);
-    ss2 << fusion2;
-  }
-
-  IrBuilder::create<BinaryOp>(BinaryOpType::Add, d3, d1, d2);
-  ss1 << fusion;
-
-  TORCH_CHECK(
-      ss1.str().compare(ss2.str()) == 0,
-      "Error where explicit add nodes don't match implicit add nodes.");
-}
-
-TEST_F(NVFuserTest, FusionScalarTypePromote_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  Bool* b = IrBuilder::create<Bool>(true);
-  Double* d = IrBuilder::create<Double>(4.f);
-  Int* i = IrBuilder::create<Int>(3);
-  ComplexDouble* c =
-      IrBuilder::create<ComplexDouble>(c10::complex<double>(1, 2));
-
-  TORCH_CHECK(add(b, b)->getDataType() == DataType::Bool);
-  TORCH_CHECK(add(b, d)->getDataType() == DataType::Double);
-  TORCH_CHECK(add(b, i)->getDataType() == DataType::Int);
-  TORCH_CHECK(add(b, c)->getDataType() == DataType::ComplexDouble);
-
-  TORCH_CHECK(add(d, b)->getDataType() == DataType::Double);
-  TORCH_CHECK(add(d, d)->getDataType() == DataType::Double);
-  TORCH_CHECK(add(d, i)->getDataType() == DataType::Double);
-  TORCH_CHECK(add(d, c)->getDataType() == DataType::ComplexDouble);
-
-  TORCH_CHECK(add(i, b)->getDataType() == DataType::Int);
-  TORCH_CHECK(add(i, d)->getDataType() == DataType::Double);
-  TORCH_CHECK(add(i, i)->getDataType() == DataType::Int);
-  TORCH_CHECK(add(i, c)->getDataType() == DataType::ComplexDouble);
-
-  TORCH_CHECK(add(c, b)->getDataType() == DataType::ComplexDouble);
-  TORCH_CHECK(add(c, d)->getDataType() == DataType::ComplexDouble);
-  TORCH_CHECK(add(c, i)->getDataType() == DataType::ComplexDouble);
-  TORCH_CHECK(add(c, c)->getDataType() == DataType::ComplexDouble);
-}
-
-TEST_F(NVFuserTest, FusionComplexAbsTypes_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto options = at::TensorOptions().device(at::kCUDA, 0);
-  auto tensor_cf = at::randn({4, 4, 4}, options.dtype(at::kComplexFloat));
-  auto tensor_cd = at::randn({4, 4, 4}, options.dtype(at::kComplexDouble));
-
-  auto type_cf = TensorType::create(tensor_cf);
-  auto tv_cf = IrBuilder::create<TensorView>(type_cf);
-  auto type_cd = TensorType::create(tensor_cd);
-  auto tv_cd = IrBuilder::create<TensorView>(type_cd);
-
-  TORCH_CHECK(
-      tensor_cf.abs().scalar_type() ==
-      data_type_to_aten(abs(tv_cf)->getDataType().value()));
-  TORCH_CHECK(
-      tensor_cd.abs().scalar_type() ==
-      data_type_to_aten(abs(tv_cd)->getDataType().value()));
-}
-
-TEST_F(NVFuserTest, FusionRegister_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-  Double* v1 = IrBuilder::create<Double>(1.f);
-  Double* v2 = IrBuilder::create<Double>(2.f);
-  Val* v3 = binaryOp(BinaryOpType::Add, v1, v2);
-  Val* v4 = binaryOp(BinaryOpType::Add, v1, v2);
-  TORCH_CHECK(v1->name() + 1 == v2->name());
-  TORCH_CHECK(v2->name() + 1 == v3->name());
-  TORCH_CHECK(v3->name() + 1 == v4->name());
-  TORCH_CHECK(v3->definition()->name() + 1 == v4->definition()->name());
-}
-
-// dummy expr with 2 outputs only for toposort test.
-struct DummyExpr : public Expr {
-  ~DummyExpr() = default;
-  DummyExpr(
-      IrBuilderPasskey passkey,
-      Val* _outlhs,
-      Val* _outrhs,
-      Val* _lhs,
-      Val* _rhs)
-      : Expr(passkey, ExprType::UnaryOp) // Not terribly safe...
-  {
-    addOutput(_outlhs);
-    addOutput(_outrhs);
-    addInput(_lhs);
-    addInput(_rhs);
-  }
-  DummyExpr(const DummyExpr& other) = delete;
-  DummyExpr& operator=(const DummyExpr& other) = delete;
-  DummyExpr(DummyExpr&& other) = delete;
-  DummyExpr& operator=(DummyExpr&& other) = delete;
-};
-
-TEST_F(NVFuserTest, FusionTopoSort_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // e0: v3, v2 = dummy(v1, v0)
-  // e1: v4     =   add(v3, v2)
-  // e2: v5     =   add(v2, v4)
-  // e3: v6     =   add(v5, v5)
-  Double* v0 = IrBuilder::create<Double>();
-  Double* v1 = IrBuilder::create<Double>();
-  Double* v2 = IrBuilder::create<Double>();
-  Double* v3 = IrBuilder::create<Double>();
-  Double* v4 = IrBuilder::create<Double>();
-  Double* v5 = IrBuilder::create<Double>();
-  Double* v6 = IrBuilder::create<Double>();
-
-  std::vector<Val*> inputs = {v0, v1};
-  for (auto val : inputs) {
-    fusion.addInput(val);
-  }
-
-  Expr* e0 = IrBuilder::create<DummyExpr>(v3, v2, v1, v0);
-  Expr* e1 = IrBuilder::create<BinaryOp>(BinaryOpType::Add, v4, v3, v2);
-  Expr* e2 = IrBuilder::create<BinaryOp>(BinaryOpType::Add, v5, v2, v4);
-  Expr* e3 = IrBuilder::create<BinaryOp>(BinaryOpType::Add, v6, v5, v5);
-
-  fusion.addOutput(v2);
-  fusion.addOutput(v3);
-  auto exprs = fusion.exprs();
-  TORCH_CHECK(exprs.size() == 1, "Found ", exprs.size(), " but expecting 1");
-  TORCH_CHECK(exprs[0] == e0);
-
-  fusion.addOutput(v5);
-  exprs = fusion.exprs();
-  TORCH_CHECK(exprs.size() == 3, "Found ", exprs.size(), " but expecting 3");
-  TORCH_CHECK(exprs[0] == e0);
-  TORCH_CHECK(exprs[1] == e1);
-  TORCH_CHECK(exprs[2] == e2);
-
-  fusion.addOutput(v4);
-  exprs = fusion.exprs();
-  TORCH_CHECK(exprs.size() == 3, "Found ", exprs.size(), " but expecting 3");
-  TORCH_CHECK(exprs[0] == e0);
-  TORCH_CHECK(exprs[1] == e1);
-  TORCH_CHECK(exprs[2] == e2);
-
-  fusion.addOutput(v6);
-  exprs = fusion.exprs();
-  TORCH_CHECK(exprs.size() == 4, "Found ", exprs.size(), " but expecting 4");
-  TORCH_CHECK(exprs[0] == e0);
-  TORCH_CHECK(exprs[1] == e1);
-  TORCH_CHECK(exprs[2] == e2);
-  TORCH_CHECK(exprs[3] == e3);
-
-  TORCH_CHECK(v2->definition()->name() == 0);
-  TORCH_CHECK(v3->definition()->name() == 0);
-  TORCH_CHECK(v4->definition()->name() == 1);
-  TORCH_CHECK(v5->definition()->name() == 2);
-  TORCH_CHECK(v6->definition()->name() == 3);
-}
-
-TEST_F(NVFuserTest, FusionTensor_CUDA) {
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  {
-    auto tensor = at::randn({2, 3, 4, 5}, options);
-    auto tensor_type = TensorType::create(tensor);
-    auto fuser_tensor = IrBuilder::create<TensorView>(tensor_type);
-    TORCH_CHECK((int64_t)fuser_tensor->nDims() == tensor.dim());
-    TORCH_CHECK(fuser_tensor->getDataType().value() == DataType::Float);
-    TORCH_CHECK(fuser_tensor->domain() != nullptr);
-    for (const auto i : c10::irange(fuser_tensor->nDims())) {
-      // size 1 dimension are makred as broadcast
-      TORCH_CHECK(
-          fuser_tensor->axis(i)->isBroadcast() == (tensor.sizes()[i] == 1));
-      // check contiguity information;
-      TORCH_CHECK(fuser_tensor->domain()->contiguity()[i]);
-    }
-  }
-
-  // TensorType::create fills stride_properties, which helps us to mark
-  // IterDomain properly
-  // Note: implementation could change, depending on how much we want to invest
-  // in our home-brew contiguity coalescing. For now let's make sure that we
-  // properly test what we are using.
-  {
-    auto tensor = at::randn({4, 4, 4}, options);
-    auto sliced_tensor = tensor.slice(1, 0, -1, 2);
-
-    auto tensor_type = TensorType::create(sliced_tensor);
-    auto fuser_tensor = IrBuilder::create<TensorView>(tensor_type);
-    TORCH_CHECK((int64_t)fuser_tensor->nDims() == tensor.dim());
-    TORCH_CHECK(fuser_tensor->getDataType().value() == DataType::Float);
-    TORCH_CHECK(fuser_tensor->domain() != nullptr);
-    for (const auto i : c10::irange(fuser_tensor->nDims())) {
-      // size 1 dimension are makred as broadcast
-      TORCH_CHECK(fuser_tensor->axis(i)->isBroadcast() == false);
-    }
-    TORCH_CHECK(fuser_tensor->domain()->contiguity()[0]);
-    TORCH_CHECK(!fuser_tensor->domain()->contiguity()[1]);
-    TORCH_CHECK(fuser_tensor->domain()->contiguity()[2]);
-  }
-
-  {
-    auto tensor = at::randn({2, 3, 4, 5}, options);
-    auto permuted_tensor = tensor.permute({0, 3, 1, 2});
-    auto tensor_type = TensorType::create(permuted_tensor);
-    auto fuser_tensor = IrBuilder::create<TensorView>(tensor_type);
-    TORCH_CHECK((int64_t)fuser_tensor->nDims() == tensor.dim());
-    TORCH_CHECK(fuser_tensor->getDataType().value() == DataType::Float);
-    TORCH_CHECK(fuser_tensor->domain() != nullptr);
-    for (const auto i : c10::irange(fuser_tensor->nDims())) {
-      // size 1 dimension are makred as broadcast
-      TORCH_CHECK(fuser_tensor->axis(i)->isBroadcast() == false);
-    }
-    TORCH_CHECK(!fuser_tensor->domain()->contiguity()[0]);
-    TORCH_CHECK(!fuser_tensor->domain()->contiguity()[1]);
-    TORCH_CHECK(fuser_tensor->domain()->contiguity()[2]);
-    TORCH_CHECK(!fuser_tensor->domain()->contiguity()[3]);
-  }
-}
-
-TEST_F(NVFuserTest, FusionFilterVals_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  auto tv1 = makeSymbolicTensor(1);
-  auto scalar0 = IrBuilder::create<Double>(0);
-  auto scalar1 = IrBuilder::create<Int>(0);
-  auto scalar2 = IrBuilder::create<Int>(1);
-
-  const std::vector<Val*> vals = {tv0, scalar0, tv1, scalar1, scalar2};
-
-  std::vector<TensorView*> tvs(
-      ir_utils::filterByType<TensorView>(vals).begin(),
-      ir_utils::filterByType<TensorView>(vals).end());
-  TORCH_CHECK(tvs.size() == 2);
-  TORCH_CHECK(tvs[0] == tv0);
-  TORCH_CHECK(tvs[1] == tv1);
-
-  std::vector<Double*> floats(
-      ir_utils::filterByType<Double>(vals).begin(),
-      ir_utils::filterByType<Double>(vals).end());
-  TORCH_CHECK(floats.size() == 1);
-  TORCH_CHECK(floats[0] == scalar0);
-
-  std::vector<Int*> ints(
-      ir_utils::filterByType<Int>(vals).begin(),
-      ir_utils::filterByType<Int>(vals).end());
-  TORCH_CHECK(ints.size() == 2);
-  TORCH_CHECK(ints[0] == scalar1);
-  TORCH_CHECK(ints[1] == scalar2);
-
-  TORCH_CHECK(
-      ir_utils::filterByType<Expr>(vals).begin() ==
-          ir_utils::filterByType<Expr>(vals).end(),
-      "Not expecting any results");
-}
-
-TEST_F(NVFuserTest, FusionTVSplit_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv = makeSymbolicTensor(3);
-
-  tv = tv->split(2, 2);
-  TORCH_CHECK(tv->nDims() == 4);
-  Expr* outer = tv->axis(2)->extent()->definition();
-
-  TORCH_CHECK(
-      outer->getExprType().value() == ExprType::BinaryOp &&
-      static_cast<BinaryOp*>(outer)->getBinaryOpType() ==
-          BinaryOpType::CeilDiv &&
-      static_cast<BinaryOp*>(outer)->lhs()->sameAs(
-          tv->getRootDomain()[2]->extent()) &&
-      static_cast<Int*>(static_cast<BinaryOp*>(outer)->rhs())
-          ->sameAs(IrBuilder::create<Int>(2)));
-
-  IterDomain* inner = static_cast<IterDomain*>(tv->axis(3));
-  TORCH_CHECK(
-      inner->extent()->isScalar() &&
-      static_cast<Int*>(inner->extent())->isConst() &&
-      static_cast<Int*>(inner->extent())->value().value() == 2);
-}
-
-TEST_F(NVFuserTest, FusionTVMerge_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv = makeSymbolicTensor(3);
-
-  tv = tv->merge(1);
-  Expr* axisOp = tv->axis(1)->extent()->definition();
-
-  TORCH_CHECK(
-      tv->nDims() == 2 && axisOp->getExprType() == ExprType::BinaryOp &&
-      static_cast<BinaryOp*>(axisOp)->getBinaryOpType() == BinaryOpType::Mul &&
-      static_cast<BinaryOp*>(axisOp)->lhs() ==
-          tv->getRootDomain()[1]->extent() &&
-      static_cast<BinaryOp*>(axisOp)->rhs() ==
-          tv->getRootDomain()[2]->extent());
-}
-
-TEST_F(NVFuserTest, FusionTVReorder_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::unordered_map<int, int> shift_right{{-1, 0}};
-
-  std::unordered_map<int, int> shift_left{{0, -1}};
-
-  std::unordered_map<int, int> shift_left_2{{0, -1}, {1, 0}, {2, 1}};
-
-  std::unordered_map<int, int> swap{{0, 2}, {2, 0}};
-
-  auto tv = makeSymbolicTensor(3);
-  std::vector<IterDomain*> ref;
-  ref = std::vector<IterDomain*>(
-      tv->domain()->domain().begin(), tv->domain()->domain().end());
-
-  tv->reorder(shift_left);
-  for (const auto i : c10::irange(tv->nDims())) {
-    TORCH_CHECK(ref[i]->sameAs(tv->axis(i - 1)));
-  }
-
-  tv = makeSymbolicTensor(3);
-  ref = std::vector<IterDomain*>(
-      tv->domain()->domain().begin(), tv->domain()->domain().end());
-
-  tv->reorder(shift_left);
-  for (const auto i : c10::irange(tv->nDims())) {
-    TORCH_CHECK(ref[i]->sameAs(tv->axis(i - 1)));
-  }
-
-  tv = makeSymbolicTensor(3);
-  ref = std::vector<IterDomain*>(
-      tv->domain()->domain().begin(), tv->domain()->domain().end());
-
-  tv->reorder(shift_right);
-  TORCH_CHECK(ref[ref.size() - 1]->sameAs(tv->axis(0)));
-  for (const auto i : c10::irange(1, tv->nDims())) {
-    TORCH_CHECK(ref[i - 1]->sameAs(tv->axis(i)));
-  }
-
-  tv = makeSymbolicTensor(3);
-  ref = std::vector<IterDomain*>(
-      tv->domain()->domain().begin(), tv->domain()->domain().end());
-  tv->reorder(swap);
-  TORCH_CHECK(ref[0]->sameAs(tv->axis(2)));
-  TORCH_CHECK(ref[2]->sameAs(tv->axis(0)));
-  TORCH_CHECK(ref[1]->sameAs(tv->axis(1)));
-}
-
-TEST_F(NVFuserTest, FusionEquality_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  Double* fval1 = IrBuilder::create<Double>();
-  Double* fval1_copy = fval1;
-  Double* fval2 = IrBuilder::create<Double>();
-  Double* fone = IrBuilder::create<Double>(1.0);
-
-  TORCH_CHECK(fval1->sameAs(fval1_copy));
-  TORCH_CHECK(!fval1->sameAs(fval2));
-  TORCH_CHECK(!fone->sameAs(fval1));
-  TORCH_CHECK(fone->sameAs(IrBuilder::create<Double>(1.0)));
-
-  Int* ival1 = IrBuilder::create<Int>();
-  Int* ival1_copy = ival1;
-  Int* ival2 = IrBuilder::create<Int>();
-  Int* ione = IrBuilder::create<Int>(1);
-
-  TORCH_CHECK(ival1->sameAs(ival1_copy));
-  TORCH_CHECK(!ival1->sameAs(ival2));
-  TORCH_CHECK(!ione->sameAs(ival1));
-  TORCH_CHECK(ione->sameAs(IrBuilder::create<Int>(1)));
-
-  BinaryOp* add1 = IrBuilder::create<BinaryOp>(
-      BinaryOpType::Add, IrBuilder::create<Double>(), fval1, ival1);
-  BinaryOp* add1_copy = IrBuilder::create<BinaryOp>(
-      BinaryOpType::Add, IrBuilder::create<Double>(), fval1, ival1);
-  BinaryOp* sub1 = IrBuilder::create<BinaryOp>(
-      BinaryOpType::Sub, IrBuilder::create<Double>(), fval1, ival1);
-
-  UnaryOp* neg1 = IrBuilder::create<UnaryOp>(
-      UnaryOpType::Neg, IrBuilder::create<Double>(), fval1);
-  UnaryOp* neg2 = IrBuilder::create<UnaryOp>(
-      UnaryOpType::Neg, IrBuilder::create<Double>(), fval2);
-  UnaryOp* neg1_copy = IrBuilder::create<UnaryOp>(
-      UnaryOpType::Neg, IrBuilder::create<Double>(), fval1);
-
-  TORCH_CHECK(add1->sameAs(add1_copy));
-  TORCH_CHECK(!add1->sameAs(sub1));
-
-  TORCH_CHECK(neg1->sameAs(neg1_copy));
-  TORCH_CHECK(!static_cast<Expr*>(neg1)->sameAs(add1));
-  TORCH_CHECK(!neg1->sameAs(neg2));
-}
-
-TEST_F(NVFuserTest, FusionDependency_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  Double* d0 = IrBuilder::create<Double>(0.f);
-  Double* d1 = IrBuilder::create<Double>(1.f);
-  auto d2 = add(d0, d1);
-
-  auto d3 = add(d2, d2);
-
-  Double* d4 = IrBuilder::create<Double>(4.f);
-  Double* d5 = IrBuilder::create<Double>(5.f);
-  auto d6 = add(d4, d5);
-
-  Double* d7 = IrBuilder::create<Double>(7.f);
-  Double* d8 = IrBuilder::create<Double>(8.f);
-  auto d9 = add(d7, d8);
-
-  auto d10 = add(d6, d9);
-
-  auto d11 = add(d3, d10);
-
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d0, d11));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d1, d11));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d2, d11));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d3, d11));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d6, d11));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d9, d11));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d0, d2));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d2, d3));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d4, d6));
-  TORCH_CHECK(DependencyCheck::isDependencyOf(d8, d10));
-
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d0));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d1));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d2));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d3));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d4));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d5));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d2, d0));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d3, d2));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d6, d4));
-  TORCH_CHECK(!DependencyCheck::isDependencyOf(d10, d8));
-
-  auto dep_chain = DependencyCheck::getSingleDependencyChain(d0, d11);
-  TORCH_CHECK(dep_chain.back() == d11);
-  dep_chain.pop_back();
-  TORCH_CHECK(dep_chain.back() == d3);
-  dep_chain.pop_back();
-  TORCH_CHECK(dep_chain.back() == d2);
-  dep_chain.pop_back();
-
-  dep_chain = DependencyCheck::getSingleDependencyChain(d6, d11);
-  TORCH_CHECK(dep_chain.back() == d11);
-  dep_chain.pop_back();
-  TORCH_CHECK(dep_chain.back() == d10);
-  dep_chain.pop_back();
-
-  dep_chain = DependencyCheck::getSingleDependencyChain(d4, d11);
-  TORCH_CHECK(dep_chain.back() == d11);
-  dep_chain.pop_back();
-  TORCH_CHECK(dep_chain.back() == d10);
-  dep_chain.pop_back();
-  TORCH_CHECK(dep_chain.back() == d6);
-  dep_chain.pop_back();
-
-  dep_chain = DependencyCheck::getSingleDependencyChain(d11, d2);
-  TORCH_CHECK(dep_chain.empty());
-}
-
-TEST_F(NVFuserTest, FusionParser_CUDA) {
-  // This test may not pass if using a custom block sync as there may
-  // be additional calls. Skip the test as it's not specifically
-  // relevant with block synchronizatin.
-  if (std::getenv("PYTORCH_NVFUSER_USE_BLOCK_SYNC_ATOMIC")) {
-    return;
-  }
-  auto g = std::make_shared<Graph>();
-  const auto graph0_string = R"IR(
-    graph(%0 : Float(2, strides=[1]),
-          %1 : Float(2, strides=[1])):
-      %c0 : Float(2, strides=[1]) = aten::mul(%0, %1)
-      %d0 : Float(2, strides=[1]) = aten::mul(%c0, %0)
-      return (%d0))IR";
-  parseIR(graph0_string, g.get());
-
-  // strides are not yet supported in the irparser.
-  for (auto val : g->block()->inputs()) {
-    if (val->isCompleteTensor())
-      val->setType(val->type()->castRaw<TensorType>()->contiguous());
-  }
-  for (auto node : g->block()->nodes()) {
-    for (auto val : node->outputs()) {
-      if (val->isCompleteTensor())
-        val->setType(val->type()->castRaw<TensorType>()->contiguous());
-    }
-  }
-
-  auto fusion = parseJitIR(g);
-  FusionGuard fg(fusion.get());
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  // Avoid vectorization here as those kernels can't be lowered twice at the
-  // moment
-  at::Tensor input1 = at::randn({16}, options);
-  at::Tensor input2 = at::randn({16}, options);
-  auto lparams = schedulePointwise(fusion.get(), {input1, input2});
-
-  // CONSIDER:
-  // 1. this can be moved to a dedicated "golden" file
-  // 2. use a fuzzy compare (ignore non-significant whitespaces for example)
-  const std::string expected_kernel = R"(
-__global__ void CUDAGeneratedKernel(Tensor<float, 1> T0, Tensor<float, 1> T1, Tensor<float, 1> T3) {
-  int64_t i50;
-  i50 = (((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x);
-  if ((i50 < T0.size[0])) {
-    float T5[1];
-    T5[0] = 0;
-    T5[0]
-       = T1[i50];
-    float T4[1];
-    T4[0] = 0;
-    T4[0]
-       = T0[i50];
-    float T2[1];
-    T2[0]
-      = T4[0]
-      * T5[0];
-    float T6[1];
-    T6[0]
-      = T2[0]
-      * T4[0];
-    T3[i50]
-       = T6[0];
-  }
-}
-)";
-
-  const std::string actual_kernel =
-      "\n" + codegen::generateCudaKernel(GpuLower(fusion.get()).kernel());
-  if (expected_kernel.size() != actual_kernel.size() ||
-      expected_kernel.compare(actual_kernel) != 0) {
-    std::cerr
-        << " Codegen mismatch, codegen possibly changed, or is incorrect. "
-        << " \n ========= EXPECTED ========= \n"
-        << expected_kernel << "\n========= ACTUAL ========== \n"
-        << actual_kernel << "\n=================" << std::endl;
-    auto it = std::mismatch(
-        expected_kernel.begin(),
-        expected_kernel.end(),
-        actual_kernel.begin(),
-        actual_kernel.end());
-    std::string actual_mismatched_snippet(it.second, actual_kernel.end());
-    actual_mismatched_snippet = actual_mismatched_snippet.substr(0, 10);
-    std::string expected_mismatched_snippet(it.first, expected_kernel.end());
-    expected_mismatched_snippet = expected_mismatched_snippet.substr(0, 10);
-    std::cerr << "First mismatch found at: " << actual_mismatched_snippet
-              << ", expected: " << expected_mismatched_snippet << std::endl;
-    TORCH_CHECK(false);
-  }
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1, input2}, lparams);
-  auto outputs = fe.runFusion({input1, input2}, lparams);
-  at::Tensor output_ref = input1 * input2 * input1;
-  TORCH_CHECK(output_ref.equal(outputs[0]));
-}
-
-TEST_F(NVFuserTest, FusionOuterSplit_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(3);
-
-  IrBuilder::create<BinaryOp>(
-      BinaryOpType::Add,
-      tv0,
-      IrBuilder::create<Double>(0.0),
-      IrBuilder::create<Double>(1.0));
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(3.0));
-  fusion.addOutput(tv2);
-
-  //[I0, I1, I2]
-  tv2->split(-1, 4, false);
-  //[I0, I1, I2o{4}, I2i]
-  tv2->merge(0);
-  tv2->merge(0);
-  //[I0*I1*I2o{4}, I2i]
-  tv2->split(0, 2);
-  //[I0*I1*I2o{4}o, I0*I1*I2o{4}i{2}, I2i]
-  tv2->reorder({{0, 1}, {1, 0}});
-  // I0*I1*I2o{4}i{2}, [I0*I1*I2o{4}o, I2i]
-
-  tv0->computeAt(tv2, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor output = at::empty({2, 6, 32}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-  fe.runFusion({}, {output});
-
-  at::Tensor output_ref = at::zeros_like(output, options);
-  output_ref = output_ref + 0.0 + 1.0 + 2.0 + 3.0;
-
-  TORCH_CHECK(output_ref.equal(output));
-}
-
-TEST_F(NVFuserTest, FusionCodeGen_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(3);
-
-  IrBuilder::create<BinaryOp>(
-      BinaryOpType::Add,
-      tv0,
-      IrBuilder::create<Double>(0.0),
-      IrBuilder::create<Double>(1.0));
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(3.0));
-  fusion.addOutput(tv2);
-
-  //[I0, I1, I2]
-  tv2 = tv2->split(0, 4);
-  //[I0o, I0i{4}, I1, I2]
-  tv2 = tv2->merge(1);
-  //[I0o, I0i{4}*I1, I2]
-  tv2 = tv2->split(-1, 2);
-  //[I0o, I0i{4}*I1, I2o, I2i{2}]
-  tv2 = tv2->reorder({{0, 1}, {1, 0}, {3, 2}});
-  //[I0i{4}*I1, I0o, I2i{2}, I2o]
-
-  tv0->computeAt(tv2, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor output = at::empty({16, 8, 8}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-  fe.runFusion({}, {output});
-
-  at::Tensor output_ref = at::zeros_like(output, options);
-  output_ref = output_ref + 0.0 + 1.0 + 2.0 + 3.0;
-
-  TORCH_CHECK(output_ref.equal(output));
-}
-
-TEST_F(NVFuserTest, FusionCodeGen2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(3);
-  TensorView* tv1 = makeSymbolicTensor(3);
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = add(tv0, tv2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv3);
-
-  //[I0, I1, I2]
-  tv3->reorder({{0, 2}, {2, 0}});
-  //[I2, I1, I0]
-  tv3->split(-1, 4);
-  //[I2, I1, I0o, I0i{4}]
-  tv3->reorder({{2, 0}, {3, 1}, {0, 3}});
-  // I0o, I0i{4}, I1, I2]
-
-  tv0->computeAt(tv3, -1);
-  tv1->computeAt(tv3, -1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input1 = at::randn({16, 8, 8}, options);
-  at::Tensor input2 = at::randn_like(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  auto outputs = fe.runFusion({input1, input2});
-
-  at::Tensor tv2_ref = input2 + 2.0;
-  at::Tensor output_ref = input1 + tv2_ref;
-
-  TORCH_CHECK(output_ref.equal(outputs[0]));
-}
-
-TEST_F(NVFuserTest, FusionSimplePWise_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-  // dimensionality of the problem
-  int nDims = 3;
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeContigTensor(nDims);
-  TensorView* tv1 = makeContigTensor(nDims);
-
-  // Register your inputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  // Do math with it, it returns a `Val*` but can be static_casted back to
-  // TensorView
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = add(tv0, tv2);
-
-  // Register your outputs
-  fusion.addOutput(tv3);
-
-  // Do transformations, remember, transformations are outputs to inputs
-  // This doesn't have to be in this order
-  tv3->merge(1);
-  tv3->merge(0);
-
-  // Split by n_threads
-  tv3->split(0, 128);
-  tv3->split(0, 4);
-
-  // For all inputs, computeAt the output inline, temporaries should be squeezed
-  // between them
-  tv0->computeAt(tv3, -1);
-  tv1->computeAt(tv3, -1);
-
-  // Parallelize TV3
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(-2)->parallelize(ParallelType::Unroll);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input1 = at::randn({64, 2, 128}, options);
-  at::Tensor input2 = at::rand_like(input1);
-  at::Tensor output = at::empty_like(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  fe.runFusion({input1, input2}, {output});
-
-  at::Tensor tv2_ref = input2 + 2.0;
-  at::Tensor output_ref = input1 + tv2_ref;
-
-  TORCH_CHECK(output_ref.equal(output));
-}
-
-TEST_F(NVFuserTest, FusionSimpleAmperePipeline_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // requires ampere+ GPU
-  if (!deviceMajorMinorCheck(8)) {
-    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
-    return;
-  }
-
-  auto tv0 = makeContigTensor(1);
-
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-
-  fusion.addOutput(tv1);
-
-  auto tv_cache = tv0->cacheAfter(LoadStoreOpType::CpAsync);
-  tv_cache->setMemoryType(MemoryType::Shared);
-
-  tv1->split(0, 16);
-  tv0->computeAt(tv1, 1);
-
-  tv_cache->circularBuffer(10);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({255}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1});
-  auto cg_outputs = fe.runFusion({input1});
-
-  testValidate(&fusion, cg_outputs, {input1}, {input1}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimplePWiseDtypeComplex_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-  // dimensionality of the problem
-  int nDims = 3;
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeContigTensor(nDims, DataType::ComplexFloat);
-  TensorView* tv1 = makeContigTensor(nDims, DataType::ComplexFloat);
-
-  // Register your inputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  // Do math with it, it returns a `Val*` but can be static_casted back to
-  // TensorView
-  c10::complex<double> scalar1(2.0, 3.0);
-  TensorView* tv2 = add(tv1, IrBuilder::create<ComplexDouble>(scalar1));
-  TensorView* tv3 = add(tv0, tv2);
-
-  // Register your outputs
-  fusion.addOutput(tv3);
-
-  // Do transformations, remember, transformations are outputs to inputs
-  // This doesn't have to be in this order
-  tv3->merge(1);
-  tv3->merge(0);
-
-  // Split by n_threads
-  tv3->split(0, 128);
-  tv3->split(0, 4);
-
-  // For all inputs, computeAt the output inline, temporaries should be squeezed
-  // between them
-  tv0->computeAt(tv3, -1);
-  tv1->computeAt(tv3, -1);
-
-  // Parallelize TV3
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(-2)->parallelize(ParallelType::Unroll);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options =
-      at::TensorOptions().dtype(at::kComplexFloat).device(at::kCUDA, 0);
-
-  at::Tensor input1 = at::randn({64, 2, 128}, options);
-  at::Tensor input2 = at::rand_like(input1);
-  at::Tensor output = at::empty_like(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  fe.runFusion({input1, input2}, {output});
-
-  at::Tensor tv2_ref = input2 + scalar1;
-  at::Tensor output_ref = input1 + tv2_ref;
-
-  TORCH_CHECK(output_ref.equal(output));
-}
-
-TEST_F(NVFuserTest, FusionExecKernel_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  // Register your inputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  // Do math with it, it returns a `Val*` but can be static_casted back to
-  // TensorView
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = add(tv0, tv2);
-
-  // Register your outputs
-  fusion.addOutput(tv3);
-
-  tv3->merge(0);
-  tv3->split(0, 128);
-  tv3->split(0, 4);
-
-  // For all inputs, computeAt the output inline, temporaries should be squeezed
-  // between them
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  // Parallelize TV3
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::Unroll);
-  tv3->axis(1)->parallelize(ParallelType::Unroll);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input1 = at::ones({1, 128}, options);
-  at::Tensor input2 = at::ones_like(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  auto outputs = fe.runFusion({input1, input2});
-
-  at::Tensor check = at::full({1, 128}, 4, options);
-  ;
-  TORCH_CHECK(outputs[0].equal(check));
-}
-
-int ceilDiv_(int a, int b) {
-  return (a + b - 1) / b;
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAt1_CUDA) {
-  // Case 1
-  // tv1 = tv0 * 0.5
-  // tv2 = tv1 * -1
-  // tv3 = tv1 + 3
-  // tv4 = tv1 * 2
-  // tv5 = tv3 + tv2
-  // tv6 = tv5 + tv4
-  // tv7 = tv1 + tv4
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
-  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3.0));
-  TensorView* tv4 = mul(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv5 = add(tv3, tv2);
-
-  TensorView* tv6 = add(tv5, tv4);
-  TensorView* tv7 = add(tv1, tv4);
-
-  fusion.addOutput(tv6);
-  fusion.addOutput(tv7);
-
-  // Lets setup to actually run
-  tv7->merge(0);
-  tv7->split(0, 128);
-  tv7->split(0, 4);
-
-  tv7->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeAt(tv7, 1);
-
-  ComputeAtMap ca_map(&fusion);
-
-  // The this-position of the last tensor should be zero.
-  TORCH_CHECK(
-      tv7->nDims() == 3 && tv7->getComputeAtPosition() == 0 &&
-      tv7->getMaxProducerPosition() == 1);
-  TORCH_CHECK(
-      tv7->nDims() == 3 && tv6->getComputeAtPosition() == 0 &&
-      tv6->getMaxProducerPosition() == 1);
-  // The position of every other tensor should be 1.
-  for (auto tv : {tv1, tv2, tv3, tv4, tv5}) {
-    TORCH_CHECK(tv->nDims() == 3 && tv->getComputeAtPosition() == 1);
-
-    TORCH_CHECK(
-        ca_map.areMapped(tv7->axis(0), tv->axis(0), IdMappingMode::PERMISSIVE));
-  }
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({129, 127}, options);
-
-  auto t1 = aten_input.mul({0.5});
-  auto t2 = t1.mul({-1.0});
-  auto t3 = t1.add({3.0});
-  auto t4 = t1.mul({2.0});
-  auto t5 = t3.add(t2);
-  auto t6 = t5.add(t4);
-  auto t7 = t1.add(t4);
-
-  std::vector<at::Tensor> aten_outputs = {t6, t7};
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options), at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAt2_CUDA) {
-  // Case 2
-  // tv1 = tv0 * -1
-  // tv2 = tv0 + 3
-  // tv3 = tv0 * 2
-  // tv4 = tv2 + tv1
-  // tv5 = tv4 + tv3
-  // tv6 = tv5 + tv3
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(-1.0));
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.0));
-  TensorView* tv3 = mul(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv4 = add(tv2, tv1);
-
-  TensorView* tv5 = add(tv4, tv3);
-  TensorView* tv6 = add(tv5, tv3);
-
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  // Lets setup to actually run
-  tv6->merge(0);
-  tv6->split(0, 128);
-  tv6->split(0, 4);
-
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeAt(tv6, 1);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({129, 127}, options);
-
-  auto t1 = input.mul({-1.0});
-  auto t2 = input.add({3.0});
-  auto t3 = input.mul({2.0});
-  auto t4 = t2.add(t1);
-  auto t5 = t4.add(t3);
-  auto t6 = t5.add(t3);
-
-  std::vector<at::Tensor> aten_outputs = {t5, t6};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  testValidate(&fusion, cg_outputs, {input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAt3_CUDA) {
-  // Case 3
-  // T2 = T1 * 0.979361
-  // T3 = T2 * T0
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(.979361));
-  TensorView* tv3 = mul(tv2, tv0);
-
-  fusion.addOutput(tv3);
-
-  // Lets setup to actually run
-  while (tv3->nDims() > 1)
-    tv3->merge(0);
-  tv3->split(0, 128);
-  tv3->split(0, 4);
-
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  auto t2 = t1.mul({0.979361});
-  auto aten_output = t2.mul(t0);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  at::Tensor cg_output = at::empty_like(t0, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAt4_CUDA) {
-  // Case 4
-  // T4 = T2 - T3
-  // T5 = T1 + T4
-  // T6 = T5 - T0
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = makeSymbolicTensor(4);
-  fusion.addInput(tv2);
-
-  TensorView* tv3 = makeSymbolicTensor(4);
-  fusion.addInput(tv3);
-
-  TensorView* tv4 = sub(tv2, tv3);
-  TensorView* tv5 = add(tv1, tv4);
-  TensorView* tv6 = sub(tv5, tv0);
-
-  fusion.addOutput(tv6);
-
-  // Lets setup to actually run
-  while (tv6->nDims() > 1)
-    tv6->merge(0);
-  tv6->split(0, 128);
-  tv6->split(0, 4);
-
-  tv0->computeAt(tv6, 1);
-  tv1->computeAt(tv6, 1);
-  tv2->computeAt(tv6, 1);
-  tv3->computeAt(tv6, 1);
-
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-  at::Tensor t2 = at::rand_like(t0, options);
-  at::Tensor t3 = at::rand_like(t0, options);
-
-  auto t4 = t2.sub(t3);
-  auto t5 = t1.add(t4);
-  auto aten_output = t5.sub(t0);
-
-  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAt5_CUDA) {
-  // Case 5
-  // tv2 = tv0 + 2.0
-  // tv3 = tv1 * tv2
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = mul(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv3->merge(0);
-  tv3->split(-1, 8);
-  tv3->split(-1, 4);
-
-  tv2->computeAt(tv3, 1);
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  auto t2 = t0.add(2.0);
-  auto aten_output = t1.mul(t2);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAt6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = mul(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv2->merge(0);
-  tv2->split(-1, 8);
-  tv2->split(-1, 4);
-  tv3->merge(0);
-  tv3->split(-1, 8);
-
-  tv2->computeAt(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  auto t2 = t0.add(2.0);
-  auto aten_output = t1.mul(t2);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAt7_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-
-  auto tv2 = makeSymbolicTensor(1);
-  fusion.addInput(tv2);
-
-  auto tv3 = add(tv2, IrBuilder::create<Double>(3.0));
-
-  auto tv4 = add(tv1, tv3);
-  fusion.addOutput(tv4);
-
-  auto tv5 = broadcast(tv1, {false, true});
-
-  auto tv6 = makeSymbolicTensor(2);
-  fusion.addInput(tv6);
-
-  auto tv7 = mul(tv5, tv6);
-
-  fusion.addOutput(tv7);
-
-  tv7->split(1, 2);
-  tv7->merge(0);
-  tv7->split(0, 4);
-  tv7->split(0, 128);
-
-  tv7->axis(0)->parallelize(ParallelType::BIDx);
-  tv7->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv0->computeAt(tv7, 1);
-  auto tv5_domain = tv5->domain()->domain();
-
-  // These computeAt transformations should not affect the TV5 domain
-  tv0->computeAt(tv4, -1);
-  tv2->computeAt(tv4, -1);
-
-  auto tv5_domain_current = tv5->domain()->domain();
-  TORCH_CHECK(tv5_domain == tv5_domain_current, "Invalid TV5 domain");
-
-  const int numel_x = 100;
-  const int numel_y = 200;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({numel_x}, options);
-  auto t2 = at::randn({numel_x}, options);
-  auto t6 = at::randn({numel_x, numel_y}, options);
-
-  auto t1 = t0.add(1.0);
-  auto t3 = t2.add(3.0);
-  auto t4 = t1.add(t3);
-  auto t5 = t1.unsqueeze(1);
-  auto t7 = t5.mul(t6);
-
-  std::vector<IValue> aten_inputs = {t0, t2, t6};
-  std::vector<at::Tensor> aten_outputs = {t4, t7};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAt8_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-
-  auto tv2 = makeSymbolicTensor(1);
-  fusion.addInput(tv2);
-
-  auto tv3 = add(tv2, IrBuilder::create<Double>(3.0));
-
-  auto tv4 = add(tv1, tv3);
-  fusion.addOutput(tv4);
-
-  auto tv5 = broadcast(tv1, {false, true});
-
-  auto tv6 = makeSymbolicTensor(2);
-  fusion.addInput(tv6);
-
-  auto tv7 = mul(tv5, tv6);
-
-  fusion.addOutput(tv7);
-
-  tv7->split(1, 2);
-  tv7->merge(0);
-  tv7->split(0, 128, false);
-  tv7->split(0, 4, false);
-
-  tv7->axis(0)->parallelize(ParallelType::BIDx);
-  tv7->axis(1)->parallelize(ParallelType::TIDx);
-
-  // Reverse computeAt structure from previous test
-  tv0->computeAt(tv4, -1);
-  tv2->computeAt(tv4, -1);
-  tv0->computeAt(tv7, -1);
-
-  const int numel_x = 100;
-  const int numel_y = 200;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({numel_x}, options);
-  auto t2 = at::randn({numel_x}, options);
-  auto t6 = at::randn({numel_x, numel_y}, options);
-
-  auto t1 = t0.add(1.0);
-  auto t3 = t2.add(3.0);
-  auto t4 = t1.add(t3);
-  auto t5 = t1.unsqueeze(1);
-  auto t7 = t5.mul(t6);
-
-  std::vector<IValue> aten_inputs = {t0, t2, t6};
-  std::vector<at::Tensor> aten_outputs = {t4, t7};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeWith1_CUDA) {
-  // Case 1
-  // tv1 = tv0 * 0.5
-  // tv2 = tv1 * -1
-  // tv3 = tv1 + 3
-  // tv4 = tv1 * 2
-  // tv5 = tv3 + tv2
-  // tv6 = tv5 + tv4
-  // tv7 = tv1 + tv4
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
-  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3.0));
-  TensorView* tv4 = mul(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv5 = add(tv3, tv2);
-
-  TensorView* tv6 = add(tv5, tv4);
-  TensorView* tv7 = add(tv1, tv4);
-
-  fusion.addOutput(tv6);
-  fusion.addOutput(tv7);
-
-  // Lets setup to actually run
-  tv0->merge(0);
-  tv0->split(0, 128);
-  tv0->split(0, 4);
-
-  tv0->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeWith(tv7, 1);
-
-  GpuLower gpulw(&fusion);
-
-  // The this-position of the last tensor should be zero.
-  TORCH_CHECK(
-      tv7->nDims() == 3 && tv7->getComputeAtPosition() == 0 &&
-      tv7->getMaxProducerPosition() == 1);
-  TORCH_CHECK(
-      tv7->nDims() == 3 && tv6->getComputeAtPosition() == 0 &&
-      tv6->getMaxProducerPosition() == 1);
-
-  ComputeAtMap ca_map(&fusion);
-
-  // The position of every other tensor should be 1.
-  for (auto tv : {tv1, tv2, tv3, tv4, tv5}) {
-    TORCH_CHECK(tv->nDims() == 3 && tv->getComputeAtPosition() == 1);
-    TORCH_CHECK(
-        ca_map.areMapped(tv7->axis(0), tv->axis(0), IdMappingMode::PERMISSIVE));
-  }
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({129, 127}, options);
-
-  auto t1 = aten_input.mul({0.5});
-  auto t2 = t1.mul({-1.0});
-  auto t3 = t1.add({3.0});
-  auto t4 = t1.mul({2.0});
-  auto t5 = t3.add(t2);
-  auto t6 = t5.add(t4);
-  auto t7 = t1.add(t4);
-
-  std::vector<at::Tensor> aten_outputs = {t6, t7};
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options), at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeWith2_CUDA) {
-  // Case 2
-  // tv1 = tv0 * -1
-  // tv2 = tv0 + 3
-  // tv3 = tv0 * 2
-  // tv4 = tv2 + tv1
-  // tv5 = tv4 + tv3
-  // tv6 = tv5 + tv3
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(-1.0));
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.0));
-  TensorView* tv3 = mul(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv4 = add(tv2, tv1);
-
-  TensorView* tv5 = add(tv4, tv3);
-  TensorView* tv6 = add(tv5, tv3);
-
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  // Lets setup to actually run
-  tv0->merge(0);
-  tv0->split(0, 128);
-  tv0->split(0, 4);
-
-  tv0->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeWith(tv6, 1);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({129, 127}, options);
-
-  auto t1 = input.mul({-1.0});
-  auto t2 = input.add({3.0});
-  auto t3 = input.mul({2.0});
-  auto t4 = t2.add(t1);
-  auto t5 = t4.add(t3);
-  auto t6 = t5.add(t3);
-
-  std::vector<at::Tensor> aten_outputs = {t5, t6};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  testValidate(&fusion, cg_outputs, {input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeWith3_CUDA) {
-  // Case 3
-  // T2 = T1 * 0.979361
-  // T3 = T2 * T0
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(.979361));
-  TensorView* tv3 = mul(tv2, tv0);
-
-  fusion.addOutput(tv3);
-
-  // Lets setup to actually run
-  while (tv0->nDims() > 1)
-    tv0->merge(0);
-  tv0->split(0, 128);
-  tv0->split(0, 4);
-
-  while (tv1->nDims() > 1)
-    tv1->merge(0);
-  tv1->split(0, 128);
-  tv1->split(0, 4);
-
-  tv0->computeWith(tv3, 1);
-  tv1->computeWith(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  auto t2 = t1.mul({0.979361});
-  auto aten_output = t2.mul(t0);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  at::Tensor cg_output = at::empty_like(t0, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeWith4_CUDA) {
-  // Case 4
-  // T4 = T2 - T3
-  // T5 = T1 + T4
-  // T6 = T5 - T0
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = makeSymbolicTensor(4);
-  fusion.addInput(tv2);
-
-  TensorView* tv3 = makeSymbolicTensor(4);
-  fusion.addInput(tv3);
-
-  TensorView* tv4 = sub(tv2, tv3);
-  TensorView* tv5 = add(tv1, tv4);
-  TensorView* tv6 = sub(tv5, tv0);
-
-  fusion.addOutput(tv6);
-  std::vector<TensorView*> tvs = {tv0, tv1, tv2};
-  for (auto tv : tvs) {
-    // Lets setup to actually run
-    while (tv->nDims() > 1) {
-      tv->merge(0);
-    }
-    tv->split(0, 128);
-    tv->split(0, 4);
-    tv->computeWith(tv6, 1);
-  }
-
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-  at::Tensor t2 = at::rand_like(t0, options);
-  at::Tensor t3 = at::rand_like(t0, options);
-
-  auto t4 = t2.sub(t3);
-  auto t5 = t1.add(t4);
-  auto aten_output = t5.sub(t0);
-
-  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeWith5_CUDA) {
-  // Case 5
-  // tv2 = tv0 + 2.0
-  // tv3 = tv1 * tv2
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = mul(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv2->merge(0);
-  tv2->split(-1, 8);
-  tv2->split(-1, 4);
-
-  tv2->computeWith(tv3, 1);
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  auto t2 = t0.add(2.0);
-  auto aten_output = t1.mul(t2);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeWith6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = mul(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv2->merge(0);
-  tv2->split(-1, 8);
-  tv2->split(-1, 4);
-  tv3->merge(0);
-  tv3->split(-1, 8);
-
-  tv2->computeWith(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  auto t2 = t0.add(2.0);
-  auto aten_output = t1.mul(t2);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionComputeAtMultiConsumers_CUDA) {
-  // tv1 = tv0 * 0.5
-  // tv2 = tv1 * -1
-  // tv3 = tv2 * -2
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
-  TensorView* tv3 = mul(tv1, IrBuilder::create<Double>(-2.0));
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv3);
-
-  // This computeAt will affect tv2 as well, even though tv2 is not in
-  // the data-flow path between tv1 and tv3. The reason is that tv1 is
-  // now computed at tv3, so tv2 must also be computed at the same
-  // location. Overall, what will happen is basically we merge
-  // expressions of all tensors and compute them in a single loop
-  // nest.
-  TensorView* computeAtTarget = tv3;
-  computeAtTarget->split(0, 128);
-  tv1->computeAt(computeAtTarget, 1);
-
-  TensorView* affected_tensors[] = {tv1, tv2, tv3};
-  for (auto tv : affected_tensors) {
-    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
-  }
-
-  GpuLower gpulw(&fusion);
-
-  TORCH_CHECK(tv1->getComputeAtPosition() == 1);
-  TORCH_CHECK(
-      tv2->getComputeAtPosition() == 0 && tv2->getMaxProducerPosition() == 1);
-  TORCH_CHECK(
-      tv3->getComputeAtPosition() == 0 && tv3->getMaxProducerPosition() == 1);
-
-  ComputeAtMap ca_map(&fusion);
-
-  // Note that tv2 is also computed at tv3.
-  for (auto tv : {tv1, tv2}) {
-    TORCH_CHECK(ca_map.areMapped(
-        tv->axis(0), computeAtTarget->axis(0), IdMappingMode::PERMISSIVE));
-  }
-
-  TORCH_CHECK(tv3->getComputeAtPosition() == 0);
-
-  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
-  for (auto tv : affected_tensors) {
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({1000}, options);
-
-  auto t1 = aten_input * 0.5;
-  auto t2 = t1 * -1.0;
-  auto t3 = t1 * -2.0;
-
-  std::vector<at::Tensor> aten_outputs = {t2, t3};
-
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options), at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-// Similar to ComputeAtMultiConsumers, but with a common consumer.
-TEST_F(NVFuserTest, FusionComputeAtCommonConsumer1_CUDA) {
-  // tv1 = tv0 * 0.5
-  // tv2 = tv1 * -1
-  // tv3 = tv2 * -2
-  // tv4 = tv2 + tv3
-  // tv5 = tv4 * 5
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
-  TensorView* tv3 = mul(tv1, IrBuilder::create<Double>(-2.0));
-  TensorView* tv4 = add(tv2, tv3);
-  TensorView* tv5 = mul(tv4, IrBuilder::create<Double>(5.0));
-  fusion.addOutput(tv3);
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv5);
-
-  // Computing tv1 at tv3. This will affect tv2 as discussed in
-  // ComplexComputeAt1. Additionally, in this case, notice that tv4 is
-  // the common consumer of tv2 and tv3, so they are computed at
-  // tv4. The indirect propagation of the computeAt should stop at the
-  // common consumer, and no further change should occur. More
-  // specifically, the computeAT position of tv4 and tv5 should be zero.
-  TensorView* computeAtTarget = tv3;
-  computeAtTarget->split(0, 128);
-  tv1->computeAt(computeAtTarget, 1);
-
-  TensorView* affected_tensors[] = {tv1, tv2, tv3, tv4};
-  for (auto tv : affected_tensors) {
-    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
-  }
-
-  TORCH_CHECK(tv1->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv3->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv4->getComputeAtPosition() == 0);
-  TORCH_CHECK(tv5->getComputeAtPosition() == 0);
-
-  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (auto tv : affected_tensors) {
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  // Transform tv5 to make it look like the rest
-  tv5->split(0, 128);
-  tv5->axis(1)->parallelize(ParallelType::TIDx);
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({1000}, options);
-
-  auto t1 = aten_input * 0.5;
-  auto t2 = t1 * -1.0;
-  auto t3 = t1 * -2.0;
-  auto t4 = t2 + t3;
-  auto t5 = t4 * 5.0;
-
-  std::vector<at::Tensor> aten_outputs = {t3, t4, t5};
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionComputeAtCommonConsumer2_CUDA) {
-  // tv1 = tv0 * 0.5
-  // tv2 = tv1 * -1
-  // tv3 = tv2 * -1
-  // tv4 = tv1 + 4
-  // tv5 = tv3 + tv4
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
-  TensorView* tv3 = mul(tv2, IrBuilder::create<Double>(-1.0));
-  TensorView* tv4 = add(tv1, IrBuilder::create<Double>(4.0));
-  TensorView* tv5 = add(tv3, tv4);
-
-  fusion.addOutput(tv5);
-
-  TensorView* computeAtTarget = tv3;
-
-  computeAtTarget->merge(0);
-  computeAtTarget->split(0, 128);
-  computeAtTarget->split(0, 4);
-
-  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
-
-  // This computeAt will affect all tensors including tv3, tv4 and
-  // tv5, even though it appears to impact only tv1 and tv2. The
-  // reason is that tv1 is now computed at tv3, so tv4 must also be
-  // computed at the same location. Similarly, the consumer of tv4,
-  // tv5, must also be computed at the same location. Overall, what
-  // will happen is basically we merge expressions of all tensors and
-  // compute them in a single loop nest. Internally, this will be
-  // realized by making all tensors, except for those in the path
-  // between tv1 and tv3, computed at tv5, which we call the common
-  // consumer.
-  tv1->computeAt(computeAtTarget, 1);
-
-  // All tensors should have the same dimenionality as the target
-  for (Val* val : fusion.vals()) {
-    if (val->isFusionInput() ||
-        val->getValType().value() != ValType::TensorView) {
-      continue;
-    }
-    TensorView* tv = val->as<TensorView>();
-    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
-    if (tv == tv5) {
-      TORCH_CHECK(tv->getComputeAtPosition() == 0);
-    } else {
-      TORCH_CHECK(tv->getComputeAtPosition() == 1);
-    }
-  }
-
-  for (auto tv : ir_utils::filterByType<TensorView>(fusion.vals())) {
-    if (!tv->isFusionInput()) {
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({129, 127}, options);
-
-  auto t1 = aten_input.mul({0.5});
-  auto t2 = t1.mul({-1.0});
-  auto t3 = t2.mul({-1.0});
-  auto t4 = t1.add({4.0});
-  auto aten_output = t3 + t4;
-
-  at::Tensor cg_output = at::empty_like(aten_input, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Similar to the above common consumer test but adds an additional
-// tensor that has no common consumer with the other tensors.
-TEST_F(NVFuserTest, FusionComputeAtCommonConsumer3_CUDA) {
-  // tv1 = tv0 * 0.5
-  // tv2 = tv1 * -1
-  // tv3 = tv2 * -1
-  // tv4 = tv1 + 4
-  // tv5 = tv2 + tv3
-  // tv6 = tv1 + 6
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
-  TensorView* tv3 = mul(tv2, IrBuilder::create<Double>(-1.0));
-  TensorView* tv4 = add(tv1, IrBuilder::create<Double>(4.0));
-  TensorView* tv5 = add(tv3, tv4);
-  TensorView* tv6 = add(tv1, IrBuilder::create<Double>(6.0));
-
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  TensorView* computeAtTarget = tv3;
-
-  computeAtTarget->merge(0);
-  computeAtTarget->split(0, 128);
-  computeAtTarget->split(0, 4);
-
-  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
-
-  // This will have the same impact on the tensors except for tv5 and
-  // tv6. tv6 does not have any common consumer with the computeAt
-  // target, but since it uses tv1, it must be also computed at the
-  // same location as the other impacted tensors. We can either make
-  // tv5 computed at tv6 or tv6 computed at tv5. In this case, tv5
-  // should be computed at tv6 just because the current implementation
-  // orders the computeAt relationship based on the order in which
-  // tensors are specified as outputs.
-
-  tv1->computeAt(computeAtTarget, 1);
-
-  // All tensors should have the same dimenionality as the target
-  for (auto tv : ir_utils::filterByType<TensorView>(fusion.vals())) {
-    if (tv->isFusionInput()) {
-      continue;
-    }
-    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
-    if (tv == tv5 || tv == tv6) {
-      TORCH_CHECK(tv->getComputeAtPosition() == 0);
-      TORCH_CHECK(tv->getMaxProducerPosition() == 1);
-    } else {
-      TORCH_CHECK(tv->getComputeAtPosition() == 1);
-    }
-  }
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = val->as<TensorView>();
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({129, 127}, options);
-
-  auto t1 = aten_input.mul({0.5});
-  auto t2 = t1.mul({-1.0});
-  auto t3 = t2.mul({-1.0});
-  auto t4 = t1.add({4.0});
-  auto t5 = t3 + t4;
-  auto t6 = t1.add({6.0});
-
-  std::vector<at::Tensor> aten_outputs = {t5, t6};
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options), at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-// Similar to ComputeAtCommonConsumer1 but with an addtiona ltensor
-// that does not have data dependency with the consumer.
-TEST_F(NVFuserTest, FusionComputeAtNoCommonConsumer_CUDA) {
-  // tv1 = tv0 * 0.5
-  // tv2 = tv1 * -1
-  // tv3 = tv1 * -2
-  // tv4 = tv2 + tv3
-  // tv5 = tv4 * 5
-  // tv6 = tv1 * 6
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
-  TensorView* tv3 = mul(tv1, IrBuilder::create<Double>(-2.0));
-  TensorView* tv4 = add(tv2, tv3);
-  TensorView* tv5 = mul(tv4, IrBuilder::create<Double>(5.0));
-  // Notice that tv6 is not a consumer of tv4.
-  TensorView* tv6 = mul(tv1, IrBuilder::create<Double>(6.0));
-  fusion.addOutput(tv3);
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  TensorView* computeAtTarget = tv3;
-  computeAtTarget->split(0, 128);
-  tv1->computeAt(computeAtTarget, 1);
-
-  TensorView* affected_tensors[] = {tv1, tv2, tv3, tv4, tv5, tv6};
-  for (auto tv : affected_tensors) {
-    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
-    if (tv == tv6 || tv == tv5) {
-      TORCH_CHECK(tv->getComputeAtPosition() == 0);
-    } else {
-      TORCH_CHECK(tv->getComputeAtPosition() == 1);
-    }
-  }
-
-  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (auto tv : affected_tensors) {
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({1000}, options);
-
-  auto t1 = aten_input * 0.5;
-  auto t2 = t1 * -1.0;
-  auto t3 = t1 * -2.0;
-  auto t4 = t2 + t3;
-  auto t5 = t4 * 5.0;
-  auto t6 = t1 * 6.0;
-
-  std::vector<at::Tensor> aten_outputs = {t3, t4, t5, t6};
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-namespace {
-
-void checkIdMapped(
-    ComputeAtRootDomainMap& root_map,
-    TensorView* v0,
-    IterDomain* id0,
-    TensorView* v1,
-    IterDomain* id1,
-    bool should_map) {
-  if (should_map) {
-    TORCH_CHECK(
-        root_map.canMap(v0->domain(), id0, v1->domain(), id1),
-        "Should be mappable: ",
-        id0,
-        " of ",
-        v0,
-        " and ",
-        id1,
-        " of ",
-        v1);
-  } else {
-    TORCH_CHECK(
-        !root_map.canMap(v0->domain(), id0, v1->domain(), id1),
-        "Should not be mappable: ",
-        id0,
-        " of ",
-        v0,
-        " and ",
-        id1,
-        " of ",
-        v1);
-  }
-}
-
-void checkIdMapped(
-    TensorView* v0,
-    const std::vector<IterDomain*>& root0,
-    const std::vector<bool> should_map0,
-    TensorView* v1,
-    const std::vector<IterDomain*>& root1,
-    const std::vector<bool> should_map1) {
-  ComputeAtRootDomainMap map;
-  map.build();
-  TORCH_INTERNAL_ASSERT(root0.size() == should_map0.size());
-  TORCH_INTERNAL_ASSERT(root1.size() == should_map1.size());
-  size_t idx0 = 0;
-  for (const auto i : c10::irange(root0.size())) {
-    size_t idx1 = 0;
-    for (const auto j : c10::irange(root1.size())) {
-      if (should_map0[i] && should_map1[j] && idx0 == idx1) {
-        checkIdMapped(map, v0, root0[i], v1, root1[j], true);
-      } else {
-        checkIdMapped(map, v0, root0[i], v1, root1[j], false);
-      }
-      if (should_map1[j])
-        ++idx1;
-    }
-    if (should_map0[i])
-      ++idx0;
-  }
-}
-
-void checkIdMapped(
-    TensorView* v0,
-    const std::vector<IterDomain*>& root0,
-    TensorView* v1,
-    const std::vector<IterDomain*>& root1) {
-  checkIdMapped(
-      v0,
-      root0,
-      std::vector<bool>(root0.size(), true),
-      v1,
-      root1,
-      std::vector<bool>(root1.size(), true));
-}
-
-} // namespace
-
-TEST_F(NVFuserTest, FusionRootMappingBasic_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  auto tv3 = broadcast(tv0, {true, false, false});
-  auto tv4 = broadcast(tv1, {false, true, false});
-  auto tv5 = add(tv3, tv4);
-  fusion.addOutput(tv5);
-
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, true},
-      tv4,
-      tv4->getRootDomain(),
-      {false, true, true});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, true},
-      tv4,
-      tv4->getRootDomain(),
-      {true, false, true});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {false, true},
-      tv1,
-      tv1->getRootDomain(),
-      {false, true});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, true},
-      tv5,
-      tv5->getRootDomain(),
-      {false, true, true});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, true},
-      tv5,
-      tv5->getRootDomain(),
-      {true, false, true});
-  checkIdMapped(tv3, tv3->getRootDomain(), tv4, tv4->getRootDomain());
-  checkIdMapped(tv3, tv3->getRootDomain(), tv5, tv5->getRootDomain());
-  checkIdMapped(tv4, tv4->getRootDomain(), tv5, tv5->getRootDomain());
-}
-
-TEST_F(NVFuserTest, FusionRootMappingRfactor_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // [I,I]
-  TensorView* tv0 = makeSymbolicTensor(2);
-  // [I,I,I]
-  TensorView* tv1 = makeSymbolicTensor(3);
-
-  //[I,I,R]
-  auto tv2 = sum(tv1, {2});
-  auto tv3 = add(tv2, tv0);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv3);
-
-  // scheduling:
-  //[B,I,R0,R1=128], root = [B,I,R]
-  tv2->split(2, 128);
-
-  // root=[B,I,Irf], rfactor=[B,I,Irf,Rrf]
-  auto tv4 = tv2->rFactor({3});
-
-  checkIdMapped(tv1, tv1->getRootDomain(), tv4, tv4->getRootDomain());
-  checkIdMapped(
-      tv4,
-      tv4->getRFactorDomain(),
-      {true, true, true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, true, true});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, true, false});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, true, false},
-      tv3,
-      tv3->getRootDomain(),
-      {true, true});
-  checkIdMapped(
-      tv2,
-      tv2->getRootDomain(),
-      {true, true, false},
-      tv3,
-      tv3->getRootDomain(),
-      {true, true});
-  checkIdMapped(tv0, tv0->getRootDomain(), tv3, tv3->getRootDomain());
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, true},
-      tv1,
-      tv1->getRootDomain(),
-      {true, true, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, true},
-      tv2,
-      tv2->getRootDomain(),
-      {true, true, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, true},
-      tv4,
-      tv4->getRFactorDomain(),
-      {true, true, false, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, true},
-      tv4,
-      tv4->getRootDomain(),
-      {true, true, false});
-}
-
-TEST_F(NVFuserTest, FusionRootMappingReductionDependency1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  fusion.addOutput(tv2);
-
-  // The second dimension cannot be mapped as it would require recomputation.
-  checkIdMapped(tv0, tv0->getRootDomain(), tv1, tv1->getRootDomain());
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-}
-
-TEST_F(NVFuserTest, FusionRootMappingReductionDependency2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv0, tv2);
-  fusion.addOutput(tv3);
-
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv1,
-      tv1->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv3,
-      tv3->getRootDomain(),
-      {true, false});
-  checkIdMapped(tv2, tv2->getRootDomain(), tv3, tv3->getRootDomain());
-}
-
-TEST_F(NVFuserTest, FusionRootMappingReductionDependency3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  fusion.addOutput(tv2);
-
-  tv1->split(-1, 4);
-  auto tv3 = tv1->rFactor({-2});
-
-  checkIdMapped(tv0, tv0->getRootDomain(), tv3, tv3->getRootDomain());
-  checkIdMapped(
-      tv3,
-      tv3->getMaybeRFactorDomain(),
-      {true, false, true},
-      tv1,
-      tv1->getRootDomain(),
-      {true, true});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-}
-
-TEST_F(NVFuserTest, FusionRootMappingReductionDependency4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv0, tv2);
-  fusion.addOutput(tv3);
-
-  tv1->split(-1, 4);
-  auto tv4 = tv1->rFactor({-2});
-
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv4,
-      tv4->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv4,
-      tv4->getMaybeRFactorDomain(),
-      {true, false, true},
-      tv1,
-      tv1->getRootDomain(),
-      {true, true});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-  checkIdMapped(tv2, tv2->getRootDomain(), tv3, tv3->getRootDomain());
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-}
-
-// Reproducer of issue #749
-TEST_F(NVFuserTest, FusionRootMappingReductionDependency5_CUDA_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {1});
-  auto tv3 = broadcast(tv2, {false, true});
-  auto tv4 = add(tv0, tv3);
-  auto tv5 = add(tv4, tv1);
-  fusion.addOutput(tv5);
-
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv1,
-      tv1->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv2,
-      tv2->getRootDomain(),
-      {true, false},
-      tv3,
-      tv3->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv3,
-      tv3->getRootDomain(),
-      {true, true},
-      tv4,
-      tv4->getRootDomain(),
-      {true, true});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv4,
-      tv4->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv4,
-      tv4->getRootDomain(),
-      {true, true},
-      tv5,
-      tv5->getRootDomain(),
-      {true, true});
-}
-
-// Similar to RootMappingReductionDependency5 but with rFactor
-TEST_F(NVFuserTest, FusionRootMappingReductionDependency6_CUDA_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {1});
-  auto tv3 = broadcast(tv2, {false, true});
-  auto tv4 = add(tv0, tv3);
-  auto tv5 = add(tv4, tv1);
-  fusion.addOutput(tv5);
-
-  tv2->split(1, 4);
-  auto tv6 = tv2->rFactor({-1});
-
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv1,
-      tv1->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv6,
-      tv6->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv6,
-      tv6->getMaybeRFactorDomain(),
-      {true, true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, true});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv2,
-      tv2->getRootDomain(),
-      {true, false},
-      tv3,
-      tv3->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv3,
-      tv3->getRootDomain(),
-      {true, true},
-      tv4,
-      tv4->getRootDomain(),
-      {true, true});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true, false},
-      tv4,
-      tv4->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv4,
-      tv4->getRootDomain(),
-      {true, true},
-      tv5,
-      tv5->getRootDomain(),
-      {true, true});
-}
-
-TEST_F(NVFuserTest, FusionRootMappingMultipleBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  auto tv1 = broadcast(tv0, {false, true});
-  auto tv2 = broadcast(tv0, {true, false});
-  auto tv3 = add(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  // tv0 cannot be mapped with the consumers as it would mean its only
-  // domain would be mapped to both the first and second domains of
-  // the two consumers, thus computing tv0 at both corresponding loops.
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {false},
-      tv1,
-      tv1->getRootDomain(),
-      {false, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {false},
-      tv2,
-      tv2->getRootDomain(),
-      {false, false});
-  checkIdMapped(tv1, tv1->getRootDomain(), tv3, tv3->getRootDomain());
-  checkIdMapped(tv2, tv2->getRootDomain(), tv3, tv3->getRootDomain());
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {false},
-      tv3,
-      tv3->getRootDomain(),
-      {false, false});
-}
-
-TEST_F(
-    NVFuserTest,
-    FusionRootMappingMultipleBroadcastWithNoCommonConsumer_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  auto tv1 = broadcast(tv0, {false, true});
-  auto tv2 = broadcast(tv0, {true, false});
-  fusion.addOutput(tv1);
-  fusion.addOutput(tv2);
-
-  // If there is no common consumer, there is no recomputation constraint.
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true},
-      tv1,
-      tv1->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true},
-      tv2,
-      tv2->getRootDomain(),
-      {false, true});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {false, true});
-}
-
-TEST_F(NVFuserTest, FusionRootMappingBroadcastNonUniqueSize_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  auto tv2 = makeSymbolicTensor(2);
-  fusion.addInput(tv2);
-  auto tv3 = broadcast(tv0, {false, true});
-  auto tv4 = add(tv1, tv3);
-  fusion.addOutput(tv4);
-  auto tv5 = add(tv2, tv3);
-  fusion.addOutput(tv5);
-
-  // Broadcast domains can be used with multiple domains with
-  // different sizes. In this test, the broadcast domain of tv3 has
-  // two consumers, tv4 and tv5, which may have different sizes. Each
-  // of the consumers is used with the broadcast domain of tv3, but
-  // the two consumers may not have the same size, it is not possible
-  // to map those domains.
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true},
-      tv3,
-      tv3->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true},
-      tv1,
-      tv1->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv2,
-      tv2->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, false},
-      tv3,
-      tv3->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv2,
-      tv2->getRootDomain(),
-      {true, false},
-      tv3,
-      tv3->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv3,
-      tv3->getRootDomain(),
-      {true, false},
-      tv4,
-      tv4->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv3,
-      tv3->getRootDomain(),
-      {true, false},
-      tv5,
-      tv5->getRootDomain(),
-      {true, false});
-  checkIdMapped(
-      tv4,
-      tv4->getRootDomain(),
-      {true, false},
-      tv5,
-      tv5->getRootDomain(),
-      {true, false});
-}
-
-TEST_F(NVFuserTest, FusionRootMappingBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  // tv0[I0]
-  fusion.addInput(tv0);
-  auto tv1 = broadcast(tv0, {true, false});
-  // tv1[B1, I0]
-  auto tv2 = broadcast(tv1, {true, false, false});
-  // tv2[B2, B1, I0]
-  fusion.addOutput(tv2);
-
-  // In this case, tv1 and tv2 has one and two broadcast domains,
-  // respectively. It is the second broadcast domain that is mapped to
-  // the broadcast of tv1.
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true},
-      tv1,
-      tv1->getRootDomain(),
-      {false, true});
-  checkIdMapped(
-      tv1,
-      tv1->getRootDomain(),
-      {true, true},
-      tv2,
-      tv2->getRootDomain(),
-      {false, true, true}); // Not {true, false, true}
-  checkIdMapped(
-      tv0,
-      tv0->getRootDomain(),
-      {true},
-      tv2,
-      tv2->getRootDomain(),
-      {false, false, true});
-}
-
-// Reproducer of issue #723
-TEST_F(NVFuserTest, FusionRootMappingTrivialReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  auto tv1 = makeSymbolicTensor(2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv0, {true, false});
-  auto tv3 = sum(tv2, {0});
-  auto tv4 = add(tv2, tv1);
-
-  fusion.addOutput(tv3);
-  fusion.addOutput(tv4);
-
-  ComputeAtRootDomainMap map;
-  map.build();
-
-  checkIdMapped(
-      map, tv2, tv2->getRootDomain()[0], tv4, tv4->getRootDomain()[0], true);
-  checkIdMapped(
-      map, tv2, tv2->getRootDomain()[0], tv3, tv3->getRootDomain()[0], true);
-
-  tv2->computeAt(tv4, -1);
-
-  const int x = 11;
-  const int y = 12;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({x}, options);
-  at::Tensor t1 = at::randn({y, x}, options);
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto t3 = t0;
-  auto t4 = t0.unsqueeze(0).expand({y, x}) + t1;
-
-  testValidate(&fusion, outputs, aten_inputs, {t3, t4}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionComputeAtFailDueToRootMapping_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = broadcast(tv1, {true, false});
-  auto tv3 = broadcast(tv1, {false, true});
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  // computeAt should fail as there is no valid root mapping.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(tv1->computeAt(tv4, 1));
-}
-
-TEST_F(NVFuserTest, FusionScalarInputs_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-
-  Double* d0 = IrBuilder::create<Double>();
-  fusion.addInput(d0);
-  Double* d1 = IrBuilder::create<Double>();
-  fusion.addInput(d1);
-  Double* d2 = IrBuilder::create<Double>();
-  fusion.addInput(d2);
-  Double* d3 = IrBuilder::create<Double>();
-  fusion.addInput(d3);
-  Val* d4 = mul(d0, d1);
-  Val* d5 = sub(d2, d3);
-
-  TensorView* tv2 = sub(tv1, d4);
-  TensorView* tv3 = add(tv0, d5);
-  TensorView* tv4 = mul(tv3, tv2);
-
-  fusion.addOutput(tv4);
-
-  // Lets setup to actually run
-  while (tv4->nDims() > 1)
-    tv4->merge(0);
-  tv4->split(0, 128);
-  tv4->split(0, 4);
-
-  tv0->computeAt(tv4, 1);
-  tv1->computeAt(tv4, 1);
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  // d4 = d0 * d1
-  // d5 = d2 - d3
-  // t2 = t1 - d4
-  // t3 = t0 + d5
-  // t4 = t3 * t2
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  float fl0 = 0.1;
-  float fl1 = -0.2;
-  float fl2 = 0.3;
-  float fl3 = -0.4;
-  float fl4 = fl0 * fl1;
-  float fl5 = fl2 - fl3;
-
-  at::Tensor t0 = at::randn({129, 127}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  auto t2 = t1.sub(fl4);
-  auto t3 = t0.add(fl5);
-  auto aten_output = t3.mul(t2);
-
-  at::Tensor cg_output = at::empty_like(t0, options);
-
-  at::Scalar test(fl0);
-
-  std::vector<IValue> aten_inputs = {
-      t0,
-      t1,
-      at::Scalar(fl0),
-      at::Scalar(fl1),
-      at::Scalar(fl2),
-      at::Scalar(fl3)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionLoopUnroll_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(3);
-  TensorView* tv1 = makeSymbolicTensor(3);
-
-  // Register your inputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  // Do math with it, it returns a `Val*` but can be static_casted back to
-  // TensorView
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = add(tv0, tv2);
-
-  // Register your outputs
-  fusion.addOutput(tv3);
-
-  int block_size = 16;
-
-  tv3->merge(0, 1);
-  tv3->merge(0, 1);
-
-  tv3->split(0, block_size);
-  tv3->split(0, 4);
-
-  // For all inputs, computeAt the output inline, temporaries should be squeezed
-  // between them
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  // Parallelize
-  tv2->axis(1)->parallelize(ParallelType::Unroll);
-  tv3->axis(1)->parallelize(ParallelType::Unroll);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input0 = at::randn({129, 13, 3}, options);
-  at::Tensor input1 = at::randn({129, 13, 3}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input0, input1});
-  auto outputs = fe.runFusion({input0, input1});
-
-  TORCH_CHECK(outputs[0].equal(input0.add(input1.add(2.0))));
-}
-
-/*
- * Helper function for single op testing that generates a codegen operand
- */
-
-Val* gen_jit_operand(std::pair<ValType, DataType> desc) {
-  if (desc.first == ValType::TensorView) {
-    return makeSymbolicTensor(2, desc.second);
-  } else if (desc.first == ValType::Scalar) {
-    if (desc.second == DataType::Float) {
-      return IrBuilder::create<Double>();
-    } else if (desc.second == DataType::Double) {
-      return IrBuilder::create<Double>();
-    } else if (desc.second == DataType::ComplexFloat) {
-      return IrBuilder::create<ComplexDouble>();
-    } else if (desc.second == DataType::ComplexDouble) {
-      return IrBuilder::create<ComplexDouble>();
-    } else if (desc.second == DataType::Int) {
-      return IrBuilder::create<Int>();
-    } else {
-      TORCH_CHECK(false, "Not currently supported type: ", desc.first);
-    }
-  } else {
-    TORCH_CHECK(false, "Not currently supported type: ", desc.first);
-  }
-  return nullptr;
-}
-
-/*
- * Helper function for single op testing that generates an ATen operand
- */
-
-IValue gen_aten_operand(
-    std::pair<ValType, DataType> desc,
-    int blocks,
-    int threads,
-    bool rand) {
-  if (desc.first == ValType::TensorView) {
-    if (desc.second == DataType::Double || desc.second == DataType::Float ||
-        desc.second == DataType::ComplexDouble ||
-        desc.second == DataType::ComplexFloat ||
-        desc.second == DataType::Half || desc.second == DataType::BFloat16) {
-      auto options = at::TensorOptions()
-                         .dtype(data_type_to_aten(desc.second))
-                         .device(at::kCUDA, 0);
-      if (rand) {
-        return IValue(at::rand({blocks, threads}, options));
-      } else {
-        return IValue(at::empty({blocks, threads}, options));
-      }
-    } else if (desc.second == DataType::Int || desc.second == DataType::Int32) {
-      auto dtype = desc.second == DataType::Int32 ? at::kInt : at::kLong;
-      if (rand) {
-        auto options =
-            at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-        return IValue(at::randn({blocks, threads}, options).mul(5).to(dtype));
-      } else {
-        auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
-        return IValue(at::empty({blocks, threads}, options));
-      }
-    } else if (desc.second == DataType::Bool) {
-      if (rand) {
-        auto options =
-            at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-        return IValue(
-            at::rand({blocks, threads}, options).round().to(at::kBool));
-      } else {
-        auto options =
-            at::TensorOptions().dtype(at::kBool).device(at::kCUDA, 0);
-        return IValue(at::empty({blocks, threads}, options));
-      }
-    } else {
-      TORCH_CHECK(false, "Not currently supported type: ", desc.second)
-    }
-  } else if (desc.first == ValType::Scalar) {
-    // IValue scalars can only be double int64 or bool
-    if (desc.second == DataType::ComplexDouble ||
-        desc.second == DataType::ComplexFloat) {
-      return IValue(at::Scalar(c10::complex<double>(1.0, 0.0)));
-    } else if (
-        desc.second == DataType::Double || desc.second == DataType::Float ||
-        desc.second == DataType::Half || desc.second == DataType::BFloat16) {
-      return IValue(at::Scalar(1.0));
-    } else if (desc.second == DataType::Int) {
-      return IValue(at::Scalar(1));
-    } else {
-      TORCH_CHECK(false, "Not currently supported type: ", desc.first);
-    }
-  } else {
-    TORCH_CHECK(false, "Not currently supported type: ", desc.first);
-  }
-  return nullptr;
-}
-
-/*
- * Templatized Helper Function To generate single Op comparison between the
- * JIT codegen for Cuda and the ATen Library.
- */
-
-using OutputPair = std::pair<ValType, DataType>;
-template <
-    typename AtenFunc,
-    typename JitFunc,
-    typename InputTuple,
-    size_t... NumInputs>
-void test_op(
-    int blocks,
-    int threads,
-    std::string op_str,
-    AtenFunc af,
-    JitFunc jf,
-    OutputPair op,
-    InputTuple it,
-    std::index_sequence<NumInputs...>) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Generate Input JIT function Inputs and add them as Inputs to the Fusion
-  // Graph
-  std::array<Val*, sizeof...(NumInputs)> jit_inputs = {
-      gen_jit_operand(std::get<NumInputs>(it))...};
-  std::for_each(jit_inputs.begin(), jit_inputs.end(), [&fusion](Val* v) {
-    fusion.addInput(v);
-  });
-  TensorView* out =
-      static_cast<TensorView*>(jf(std::get<NumInputs>(jit_inputs)...));
-  fusion.addOutput(out);
-
-  std::for_each(jit_inputs.begin(), jit_inputs.end(), [out](Val* v) {
-    if (v->getValType() == ValType::TensorView)
-      static_cast<TensorView*>(v)->computeAt(out, -1);
-  });
-  out->axis(0)->parallelize(ParallelType::BIDx);
-  out->axis(-1)->parallelize(ParallelType::TIDx);
-
-  std::array<IValue, sizeof...(NumInputs)> aten_inputs = {gen_aten_operand(
-      std::get<NumInputs>(it), blocks, threads, /*rand*/ true)...};
-  const at::ArrayRef<IValue> aten_inputs_ivalues(aten_inputs);
-
-  at::Tensor cg_output =
-      gen_aten_operand(op, blocks, threads, /*rand*/ false).toTensor();
-  std::vector<at::Tensor> output_vect = {cg_output};
-  cudaDeviceSynchronize();
-  if (fusion.isStochastic())
-    at::manual_seed(0);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs_ivalues);
-  fe.runFusion(aten_inputs_ivalues, output_vect);
-  cudaDeviceSynchronize();
-
-  if (fusion.isStochastic())
-    at::manual_seed(0);
-  at::Tensor aten_output = af(aten_inputs);
-  cudaDeviceSynchronize(); // This sync shouldn't be necessary;
-
-  std::string op_msg = "Operation " + op_str;
-
-  testValidate(
-      &fusion,
-      {cg_output},
-      aten_inputs,
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      op_msg);
-}
-
-/*
- *  Templatized Helper Function that uses variadic templates to
- *  process a variable length Input Tuple of different Operand Type.
- */
-template <typename AtenFunc, typename JitFunc, typename InputTuple>
-void test_op(
-    int blocks,
-    int threads,
-    std::string op_str,
-    AtenFunc af,
-    JitFunc jf,
-    OutputPair op,
-    InputTuple it) {
-  static constexpr auto size = std::tuple_size<InputTuple>::value;
-  test_op(
-      blocks,
-      threads,
-      op_str,
-      af,
-      jf,
-      op,
-      it,
-      std::make_index_sequence<size>{});
-}
-
-TEST_F(NVFuserTest, FusionUnaryOps_CUDA) {
-  using OpTuple =
-      std::tuple<at::Tensor (*)(const at::Tensor&), UnaryOpType, std::string>;
-
-  // [Note: explicit tuple type for uniform initialization list]
-  // Tuple type must be explicitly specified for each uniform initialization
-  // list within the vector to make this code compatible with some old env
-  // which we still need to support. eg. gcc 5.4 + cuda 9.2.
-  std::vector<OpTuple> ops{
-      OpTuple{at::acos, UnaryOpType::Acos, "acos"},
-      OpTuple{at::asin, UnaryOpType::Asin, "asin"},
-      OpTuple{at::atan, UnaryOpType::Atan, "atan"},
-      // There does not appear to be an appropriate ATen function for atanh
-      // OpTuple{at::atanh,      UnaryOpType::Atanh,      "atanh"      },
-      OpTuple{at::cos, UnaryOpType::Cos, "cos"},
-      OpTuple{at::cosh, UnaryOpType::Cosh, "cosh"},
-      OpTuple{at::exp, UnaryOpType::Exp, "exp"},
-      // OpTuple{at::gelu, UnaryOpType::Gelu, "gelu"},
-      OpTuple{at::log, UnaryOpType::Log, "log"},
-      OpTuple{at::log10, UnaryOpType::Log10, "log10"},
-      OpTuple{at::neg, UnaryOpType::Neg, "neg"},
-      OpTuple{at::reciprocal, UnaryOpType::Reciprocal, "reciprocal"},
-      OpTuple{at::sigmoid, UnaryOpType::Sigmoid, "sigmoid"},
-      OpTuple{at::sin, UnaryOpType::Sin, "sin"},
-      OpTuple{at::sinh, UnaryOpType::Sinh, "sinh"},
-      OpTuple{at::sqrt, UnaryOpType::Sqrt, "sqrt"},
-      OpTuple{at::tan, UnaryOpType::Tan, "tan"},
-      OpTuple{at::tanh, UnaryOpType::Tanh, "tanh"},
-      OpTuple{at::isfinite, UnaryOpType::IsFinite, "isfinite"},
-      OpTuple{at::isinf, UnaryOpType::IsInf, "isinf"},
-      OpTuple{at::isnan, UnaryOpType::IsNan, "isnan"},
-      OpTuple{at::isreal, UnaryOpType::IsReal, "isreal"},
-  };
-
-  // The following ops has no complex support in eager mode
-  std::vector<OpTuple> ops_without_complex{
-      OpTuple{at::ceil, UnaryOpType::Ceil, "ceil"},
-      OpTuple{at::floor, UnaryOpType::Floor, "floor"},
-      OpTuple{at::frac, UnaryOpType::Frac, "frac"},
-      OpTuple{at::trunc, UnaryOpType::Trunc, "trunc"},
-      OpTuple{at::round, UnaryOpType::Round, "round"},
-      OpTuple{at::relu, UnaryOpType::Relu, "relu"},
-      OpTuple{at::expm1, UnaryOpType::Expm1, "expm1"},
-      OpTuple{at::log1p, UnaryOpType::Log1p, "log1p"},
-      OpTuple{at::lgamma, UnaryOpType::Lgamma, "lgamma"},
-      OpTuple{at::erf, UnaryOpType::Erf, "erf"},
-      OpTuple{at::erfc, UnaryOpType::Erfc, "erfc"},
-      OpTuple{at::isneginf, UnaryOpType::IsNegInf, "isneginf"},
-      OpTuple{at::isposinf, UnaryOpType::IsPosInf, "isposinf"},
-  };
-
-  // The following ops only supports complex
-  std::vector<OpTuple> ops_complex_only{
-      // real is supported via UnaryOpType::Set for non-complex types, and
-      // UnaryOpType::Real requires input to be complex
-      OpTuple{at::real, UnaryOpType::Real, "real"},
-      OpTuple{at::imag, UnaryOpType::Imag, "imag"},
-  };
-
-  // Complex support for the following op is not working in nvFuser yet
-  std::vector<OpTuple> ops_skip_complex{
-      // TODO: abs is actually supported in nvFuser, but it has bug!!!
-      // In eager mode, abs(complex_tensor) returns floating point tensor
-      // but in nvFuser, it wrongly returns complex tensor!
-      // We need to:
-      //  1. change our type promotion logic to make a special case for abs
-      //  2. why this bug is not detected here? we should bump up test coverage
-      OpTuple{at::abs, UnaryOpType::Abs, "abs"},
-      // TODO: the following two ops fails with compilation error like
-      // "undefined function rsqrt(complex)", we could implement them in
-      // helpers.cu, but I think it is better to check with Jiterator first,
-      // because Jiterator uses the same string for complex support.
-      OpTuple{at::rsqrt, UnaryOpType::Rsqrt, "rsqrt"},
-      OpTuple{at::log2, UnaryOpType::Log2, "log2"}};
-
-  std::vector<DataType> dtypes = {
-      DataType::Float,
-      DataType::Double,
-      DataType::ComplexFloat,
-      DataType::ComplexDouble};
-
-  for (auto dtype : dtypes) {
-    auto ops_to_test = ops;
-    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
-      ops_to_test.insert(
-          ops_to_test.end(),
-          ops_without_complex.begin(),
-          ops_without_complex.end());
-      ops_to_test.insert(
-          ops_to_test.end(), ops_skip_complex.begin(), ops_skip_complex.end());
-    } else {
-      ops_to_test.insert(
-          ops_to_test.end(), ops_complex_only.begin(), ops_complex_only.end());
-    }
-    std::for_each(ops.begin(), ops.end(), [&](OpTuple& op) {
-      test_op(
-          /*blocks*/ 640,
-          /*threads*/ 64,
-          /*name*/ std::get<2>(op),
-          /*Aten Func   */
-          [&op](std::array<IValue, 1>& vals) {
-            return std::get<0>(op)(vals[0].toTensor());
-          },
-          /*JIT  Func   */
-          [&op](Val* in1) -> Val* { return unaryOp(std::get<1>(op), in1); },
-          /*Output      */ std::make_pair(ValType::TensorView, dtype),
-          /*Inputs Tuple*/
-          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
-    });
-
-    // TODO: why the rand_like test is failing for complex? Is it because each
-    // complex needs to draw 2 random numbers from the RNG? We need to enable
-    // this
-    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
-      test_op(
-          /*blocks*/ 128,
-          /*threads*/ 64,
-          /*name*/ "rand_like",
-          /*Aten Func   */
-          [](std::array<IValue, 1>& vals) {
-            return at::rand_like(vals[0].toTensor());
-          },
-          /*JIT  Func   */
-          [](Val* in1) -> Val* { return unaryOp(UnaryOpType::RandLike, in1); },
-          /*Output      */ std::make_pair(ValType::TensorView, dtype),
-          /*Inputs Tuple*/
-          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
-    }
-  }
-
-  dtypes = {DataType::Int, DataType::Int32, DataType::Bool};
-  for (auto dtype : dtypes) {
-    test_op(
-        /*blocks*/ 128,
-        /*threads*/ 64,
-        /*name*/ "bitwise_not",
-        /*Aten Func   */
-        [](std::array<IValue, 1>& vals) {
-          return at::bitwise_not(vals[0].toTensor());
-        },
-        /*JIT  Func   */
-        [](Val* in1) -> Val* { return unaryOp(UnaryOpType::Not, in1); },
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
-  }
-}
-
-TEST_F(NVFuserTest, FusionBinaryOps_CUDA) {
-  using AtenFuncSig = at::Tensor (*)(const at::Tensor&, const at::Tensor&);
-  using OpTuple = std::tuple<AtenFuncSig, BinaryOpType, std::string>;
-
-  std::vector<DataType> dtypes = {
-      DataType::Double,
-      DataType::Float,
-      DataType::ComplexFloat,
-      DataType::ComplexDouble};
-
-  // see [Note: explicit tuple type for uniform initialization list]
-  std::vector<OpTuple> equal_ops{
-      OpTuple{at::eq, BinaryOpType::Eq, "eq"},
-      OpTuple{at::ne, BinaryOpType::NE, "ne"}};
-
-  // Complex numbers are not ordered
-  std::vector<OpTuple> order_ops{
-      OpTuple{at::ge, BinaryOpType::GE, "ge"},
-      OpTuple{at::gt, BinaryOpType::GT, "gt"},
-      OpTuple{at::le, BinaryOpType::LE, "le"},
-      OpTuple{at::lt, BinaryOpType::LT, "lt"}};
-
-  // see [Note: explicit tuple type for uniform initialization list]
-  std::vector<OpTuple> math_ops{
-      OpTuple{at::div, BinaryOpType::Div, "div"},
-      OpTuple{at::mul, BinaryOpType::Mul, "mul"},
-      OpTuple{at::pow, BinaryOpType::Pow, "pow"}};
-
-  // The following ops has no complex support in eager mode
-  std::vector<OpTuple> math_ops_without_complex{
-      OpTuple{at::atan2, BinaryOpType::Atan2, "atan2"},
-      OpTuple{at::max, BinaryOpType::Max, "max"},
-      OpTuple{at::min, BinaryOpType::Min, "min"},
-      OpTuple{at::fmod, BinaryOpType::Fmod, "fmod"},
-      // NOTE: Remainder does not match the Aten impl exactly
-      // despite using an identical function.
-      OpTuple{at::remainder, BinaryOpType::Remainder, "remainder"}};
-
-  for (auto dtype : dtypes) {
-    auto logic_ops = equal_ops;
-    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
-      logic_ops.insert(logic_ops.end(), order_ops.begin(), order_ops.end());
-    }
-    std::for_each(logic_ops.begin(), logic_ops.end(), [&](OpTuple& op) {
-      test_op(
-          /*blocks*/ 640,
-          /*threads*/ 64,
-          /*name*/ std::get<2>(op),
-          /*Aten Func   */
-          [&op](std::array<IValue, 2>& vals) {
-            return std::get<0>(op)(vals[0].toTensor(), vals[1].toTensor());
-          },
-          /*JIT  Func   */
-          [&op](Val* in1, Val* in2) -> Val* {
-            return binaryOp(std::get<1>(op), in1, in2);
-          },
-          /*Output      */ std::make_pair(ValType::TensorView, DataType::Bool),
-          /*Inputs Tuple*/
-          std::make_tuple(
-              std::make_pair(ValType::TensorView, dtype),
-              std::make_pair(ValType::TensorView, dtype)));
-    });
-
-    auto enabled_math_ops = math_ops;
-    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
-      enabled_math_ops.insert(
-          enabled_math_ops.end(),
-          math_ops_without_complex.begin(),
-          math_ops_without_complex.end());
-    }
-    std::for_each(
-        enabled_math_ops.begin(), enabled_math_ops.end(), [&](OpTuple& op) {
-          test_op(
-              /*blocks*/ 640,
-              /*threads*/ 64,
-              /*name*/ std::get<2>(op),
-              /*Aten Func   */
-              [&op](std::array<IValue, 2>& vals) {
-                return std::get<0>(op)(vals[0].toTensor(), vals[1].toTensor());
-              },
-              /*JIT  Func   */
-              [&op](Val* in1, Val* in2) -> Val* {
-                return binaryOp(std::get<1>(op), in1, in2);
-              },
-              /*Output      */ std::make_pair(ValType::TensorView, dtype),
-              /*Inputs Tuple*/
-              std::make_tuple(
-                  std::make_pair(ValType::TensorView, dtype),
-                  std::make_pair(ValType::TensorView, dtype)));
-        });
-
-    test_op(
-        /*blocks*/ 640,
-        /*threads*/ 64,
-        /*name*/ "add_alpha",
-        /*Aten Func   */
-        [](std::array<IValue, 3>& vals) {
-          return at::add(
-              vals[0].toTensor(), vals[1].toTensor(), vals[2].toScalar());
-        },
-        /*JIT  Func   */ static_cast<Val* (*)(Val*, Val*, Val*)>(&add_alpha),
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::Scalar, dtype)));
-
-    test_op(
-        /*blocks*/ 640,
-        /*threads*/ 64,
-        /*name*/ "sub_alpha",
-        /*Aten Func   */
-        [](std::array<IValue, 3>& vals) {
-          return at::sub(
-              vals[0].toTensor(), vals[1].toTensor(), vals[2].toScalar());
-        },
-        /*JIT  Func   */ static_cast<Val* (*)(Val*, Val*, Val*)>(&sub_alpha),
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::Scalar, dtype)));
-  }
-}
-
-TEST_F(NVFuserTest, FusionTernaryOps_CUDA) {
-  std::vector<DataType> dtypes = {
-      DataType::Double,
-      DataType::Float,
-      DataType::ComplexFloat,
-      DataType::ComplexDouble};
-
-  for (auto dtype : dtypes) {
-    // clamp and threshold are not supported for complex on eager mode
-    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
-      test_op(
-          /*blocks*/ 640,
-          /*threads*/ 64,
-          /*name*/ "clamp",
-          /*Aten Func   */
-          [](std::array<IValue, 1>& vals) {
-            return at::clamp(vals[0].toTensor(), 0.f, 1.f);
-          },
-          /*JIT  Func   */
-          [&](Val* in1) -> Val* {
-            if (dtype == DataType::Float) {
-              return clamp(
-                  in1,
-                  IrBuilder::create<Double>(0.f),
-                  IrBuilder::create<Double>(1.f));
-            } else {
-              return clamp(
-                  in1,
-                  IrBuilder::create<Double>(0.f),
-                  IrBuilder::create<Double>(1.f));
-            }
-          },
-          /*Output      */ std::make_pair(ValType::TensorView, dtype),
-          /*Inputs Tuple*/
-          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
-      test_op(
-          /*blocks*/ 640,
-          /*threads*/ 64,
-          /*name*/ "threshold",
-          /*Aten Func   */
-          [](std::array<IValue, 1>& vals) {
-            return at::threshold(vals[0].toTensor(), 0.f, 1.f);
-          },
-          /*JIT  Func   */
-          [&](Val* in1) -> Val* {
-            if (dtype == DataType::Float) {
-              return threshold(
-                  in1,
-                  IrBuilder::create<Double>(0.f),
-                  IrBuilder::create<Double>(1.f));
-            } else {
-              return threshold(
-                  in1,
-                  IrBuilder::create<Double>(0.f),
-                  IrBuilder::create<Double>(1.f));
-            }
-          },
-          /*Output      */ std::make_pair(ValType::TensorView, dtype),
-          /*Inputs Tuple*/
-          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
-    }
-    test_op(
-        /*blocks*/ 640,
-        /*threads*/ 64,
-        /*name*/ "where",
-        /*Aten Func   */
-        [](std::array<IValue, 3>& vals) {
-          return at::where(
-              vals[0].toTensor(), vals[1].toTensor(), vals[2].toTensor());
-        },
-        /*JIT  Func   */ static_cast<Val* (*)(Val*, Val*, Val*)>(&where),
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(
-            std::make_pair(ValType::TensorView, DataType::Bool),
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::TensorView, dtype)));
-  }
-}
-
-TEST_F(NVFuserTest, FusionCompoundOps_CUDA) {
-  std::vector<DataType> dtypes = {
-      DataType::Double,
-      DataType::Float,
-      DataType::ComplexFloat,
-      DataType::ComplexDouble};
-
-  for (auto dtype : dtypes) {
-    test_op(
-        /*blocks*/ 640,
-        /*threads*/ 64,
-        /*name*/ "lerp",
-        /*Aten Func   */
-        [](std::array<IValue, 3>& vals) {
-          return at::lerp(
-              vals[0].toTensor(), vals[1].toTensor(), vals[2].toTensor());
-        },
-        /*JIT  Func   */ static_cast<Val* (*)(Val*, Val*, Val*)>(&lerp),
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::TensorView, dtype)));
-    test_op(
-        /*blocks*/ 640,
-        /*threads*/ 64,
-        /*name*/ "addcmul",
-        /*Aten Func   */
-        [](std::array<IValue, 4>& vals) {
-          return at::addcmul(
-              vals[0].toTensor(),
-              vals[1].toTensor(),
-              vals[2].toTensor(),
-              vals[3].toScalar());
-        },
-        /*JIT  Func   */
-        static_cast<Val* (*)(Val*, Val*, Val*, Val*)>(&addcmul),
-        /*Output      */ std::make_pair(ValType::TensorView, dtype),
-        /*Inputs Tuple*/
-        std::make_tuple(
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::TensorView, dtype),
-            std::make_pair(ValType::Scalar, dtype)));
-  }
-}
-
-TEST_F(NVFuserTest, FusionCastOps_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2, DataType::Half);
-
-  TensorView* intrm1 = castOp(DataType::Float, tv0);
-  TensorView* out = castOp(DataType::Half, intrm1);
-
-  fusion.addInput(tv0);
-  fusion.addOutput(out);
-  tv0->computeAt(out, -1);
-
-  out->axis(0)->parallelize(ParallelType::BIDx);
-  out->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-
-  at::Tensor input1 = at::randn({1, 4}, options);
-  at::Tensor ref_output = at::empty_like(input1);
-
-  std::array<IValue, 1> inputs = {input1};
-  const at::ArrayRef<IValue> input_ivalues(inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, input_ivalues);
-  auto outputs = fe.runFusion(input_ivalues);
-
-  ref_output = at::_cast_Half(at::_cast_Double(input1));
-
-  TORCH_CHECK(
-      outputs[0].equal(ref_output),
-      "\nOp Type: -- ",
-      "cast FP16->FP32->FP16",
-      " -- had a mismatch.\n",
-      "\nABS MAX DIFF: ",
-      outputs[0].sub(ref_output).abs().max(),
-      "\n");
-}
-
-// Start off simple, block on the outer dim
-// block stride + thread all reduce + unrolling on inner dim
-TEST_F(NVFuserTest, FusionReduction1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  tv1->split(1, 128);
-  // tv1[I0, R1o, R1i{128}] = tv0[I0, I1]
-  tv1->split(1, 4);
-  // tv1[I0, R1oo, R1oi{4}, R1i{128}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{128}] = tv0[I0, I1]
-  // tv1[I0,        R1oi{4},  R1i{128}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{128}]
-
-  TensorView* tv3 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{128}] = tv0[I0, I1]
-  // tv3[I0,        R1oi{4}, Ir1i{128}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{128}]
-  // tv1[I0,                  R1i{128}] = tv3[I0,        R1oi{4}, Ir1i{128}]
-
-  // Incrementally, can print in between for debugging
-  tv0->computeAt(tv2, 1);
-  tv2->computeAt(tv3, 1);
-  tv3->computeAt(tv1, 1);
-
-  // Re do it all at once, because why not.
-  tv0->computeAt(tv1, 1);
-
-  tv2->axis(2)->parallelize(ParallelType::Unroll);
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 65000;
-  int numel_y = 1025;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReduction2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-
-  fusion.addOutput(tv1);
-
-  // switches to try some different scenarios. maybe we should iterate on all
-  // permutations.
-  bool bind_bidx = true;
-  bool bind_tidx = true;
-  bool bind_tidy = true;
-  bool bind_unroll = true;
-
-  int numel_x = 1025; // Cannot exceed block dim max size / tidy
-  int numel_y = 129;
-  int tidx = 16;
-  int tidy = 8;
-  int unroll_factor = 4;
-
-  tv1->split(1, tidx);
-  // tv1[I0, R1o, R1i{tidx}] = tv0[I0, I1]
-
-  tv1->split(1, unroll_factor);
-  // tv1[I0, R1oo, R1oi{unroll}, R1i{tidx}] = tv0[I0, I1]
-
-  tv1->split(0, tidy);
-
-  TensorView* tv2 = tv1->rFactor({-3});
-  // tv2[I0,             >R1oo<, Ir1oi{unroll}, Ir1i{tidx}]
-  // tv1[I0o, I0i{tidy},          R1oi{unroll},  R1i{tidx}]
-
-  TensorView* tv3 = tv1->rFactor({-2});
-  // tv2[I0,             >R1oo<, Ir1oi{unroll}, Ir1i{tidx}]
-  // tv3[I0,                      R1oi{unroll}, Ir1i{tidx}]
-  // tv1[I0o, I0i{tidy},                         R1i{tidx}]
-
-  tv0->computeAt(tv1, -2);
-
-  if (bind_unroll)
-    tv2->axis(-2)->parallelize(ParallelType::Unroll);
-  if (bind_bidx)
-    tv1->axis(0)->parallelize(ParallelType::BIDx);
-  if (bind_tidy)
-    tv1->axis(1)->parallelize(ParallelType::TIDy);
-
-  if (bind_tidx) {
-    tv2->axis(-1)->parallelize(ParallelType::TIDx);
-    tv3->axis(-1)->parallelize(ParallelType::TIDx);
-    tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReduction3_CUDA) {
-  // What if Z participates in the reduction with X?
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-
-  fusion.addOutput(tv1);
-
-  int numel_x = 1025; // Cannot exceed block dim max size / tidy
-  int numel_y = 129;
-  int tidx = 16;
-  int tidz = 8;
-
-  tv1->split(1, tidz);
-  // tv1[I0, R1o, R1i{tidz}] = tv0[I0, I1]
-
-  tv1->split(1, tidx);
-  // tv1[I0, R1oo, R1oi{tidx}, R1i{tidz}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({-3});
-  // tv2[I0,  >R1oo<, Ir1oi{tidx}, Ir1i{tidz}]
-  // tv1[I0o,          R1oi{tidx},  R1i{tidz}]
-
-  tv0->computeAt(tv1, -3);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(-2)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDz);
-
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDz);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, {cg_output});
-
-  auto aten_output = aten_input.to(at::kDouble).sum({1});
-
-  testValidate(
-      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReduction4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  TensorView* tv2 = add(tv0, tv1);
-  // tv2[I0, I1] = tv0[I0, I1] + tv1[I0, I1]
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv3 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv2);
-  // tv3[I0, R1] = tv2[I0, I1]
-
-  TensorView* tv4 = makeSymbolicTensor(1);
-  fusion.addInput(tv4);
-
-  // tv5[I0] = tv3[I0, R1] * tv4[I0]
-  TensorView* tv5 = mul(tv3, tv4);
-  fusion.addOutput(tv5);
-
-  int tidx = 16;
-
-  // RFactor the reduction
-  tv3->split(1, tidx);
-  // tv3[I0, R1o, R1i{tidx}] = tv2[I0, I1]
-
-  TensorView* tv6 = tv3->rFactor({-2});
-  // tv6[I0, R1o, iR1i{tidx}] = tv2[I0, I1]
-  // tv3[I0,       R1i{tidx}] = tv3[I0, I1]
-  tv2->computeAt(tv6, 2);
-
-  // Compute at inline with tv5 (only 1D)
-  tv6->computeAt(tv3, 1);
-  tv3->computeAt(tv5, 1);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-
-  // Intermediate tensors only need this, but doesn't hurt to do on inputs
-  // tv0, 1, 4
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv6->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 1025;
-  int numel_y = 129;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
-  at::Tensor t1 = at::randn({numel_x, numel_y}, options);
-  at::Tensor t4 = at::randn({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1, t4});
-  auto cg_outputs = fe.runFusion({t0, t1, t4});
-
-  auto t2 = t0.add(t1);
-  auto t3 = t2.to(at::kDouble).sum({1});
-  auto aten_output = t3.mul(t4);
-
-  testValidate(
-      &fusion, cg_outputs, {t0, t1, t4}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReduction5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(3);
-
-  fusion.addInput(tv0);
-
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-
-  fusion.addOutput(tv1);
-
-  int bidy = 2;
-  int tidy = 4;
-  int tidx = 5;
-
-  int dim1 = 11;
-
-  tv1->split(-2, tidy);
-
-  TensorView* tv2 = tv1->rFactor({-3});
-
-  tv0->computeAt(tv1, 1);
-  tv1->axis(0)->parallelize(ParallelType::BIDy);
-
-  for (auto* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      val->as<TensorView>()->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  tv2->axis(-2)->parallelize(ParallelType::TIDy);
-  tv1->axis(-2)->parallelize(ParallelType::TIDy);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({bidy, dim1, tidx}, options);
-
-  at::Tensor cg_output = at::empty({bidy, tidx}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReduction6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int bdimx = 64;
-  const int bdimy = 8;
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(3);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1, R2] = tv0[I0, I1, I2]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1, 2}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  tv1->split(2, bdimx);
-  // tv1[I0, R1, R2o, R2i{128}] = tv0[I0, I1, I2]
-  tv1->split(1, bdimy);
-  // tv1[I0, R1o, R1i{8}, R2o, R2i{128}] = tv0[I0, I1, I2]
-
-  TensorView* tv2 = tv1->rFactor({3});
-  // tv2[I0, I1o, I1i{8}, R2o, I2i{128}] = tv0[I0, I1, I2]
-  // tv1[I0, R1o, R1i{8},      R2i{128}] = tv2[I0, I1o, I1i{8}, R2o, I2i{128}]
-
-  TensorView* tv3 = tv1->rFactor({1});
-  // tv2[I0, I1o, I1i{8}, R2o, I2i{128}] = tv0[I0, I1, I2]
-  // tv3[I0, R1o, I1i{8},      I2i{128}] = tv2[I0, I1o, I1i{8}, R2o, I2i{128}]
-  // tv1[I0,      R1i{8},      R2i{128}] = tv3[I0, R1o, I1i{8},      I2i{128}]
-
-  tv3->computeAt(tv1, 1);
-  tv2->computeAt(tv3, 2);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->axis(-2)->parallelize(ParallelType::TIDy);
-  tv3->axis(-2)->parallelize(ParallelType::TIDy);
-  tv2->axis(-3)->parallelize(ParallelType::TIDy);
-
-  int numel_x = 650;
-  int numel_y = 1000;
-  int numel_z = 4;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y, numel_z}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = input.to(at::kDouble).sum({1, 2});
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionMultiGridReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = max(tv0, {0});
-  TensorView* tv2 = sum(tv0, {0});
-
-  fusion.addOutput(tv1);
-  fusion.addOutput(tv2);
-
-  int numel_x = 4;
-  int numel_y = 2;
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  std::vector<at::Tensor> aten_outputs = {
-      std::get<0>(input.to(at::kDouble).max(0)), input.to(at::kDouble).sum(0)};
-  testValidate(&fusion, cg_outputs, {input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionMultiGridReduction2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = sum(tv1, {0});
-  fusion.addOutput(tv2);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::BIDy);
-  tv2->axis(0)->parallelize(ParallelType::BIDy);
-
-  FusionExecutor fe;
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
-}
-
-TEST_F(NVFuserTest, FusionReductionTFT_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-
-  fusion.addOutput(tv1);
-
-  int numel_x = 1025;
-  int numel_y = 129;
-  int tidx = 16;
-  int tidy = 8;
-  int tidz = 8;
-
-  tv1->split(1, tidx);
-  // tv1[I0, R1o, R1i{tidx}]
-
-  tv1->split(1, tidz);
-  // tv1[I0, R1oo, R1Oi{tidz}, R1R1i{tidx}]
-
-  tv1->split(0, tidy);
-  // tv1[I0o, I0i, R1oo, R1Oi{tidz}, R1R1i{tidx}]
-
-  TensorView* tv2 = tv1->rFactor({2});
-  // tv2[I0o, I0i, R1oo, I1Oi{tidz}, I11i{tidx}]
-  // tv1[I0o, I0i,       R1Oi{tidz}, R1R1i{tidx}]
-
-  tv2->computeAt(tv1, 2);
-
-  tv1->axis(1)->parallelize(ParallelType::TIDy);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->axis(-2)->parallelize(ParallelType::TIDz);
-  tv2->axis(-2)->parallelize(ParallelType::TIDz);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReductionOuterSplit_CUDA) {
-  // based off FusionReduction4
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  TensorView* tv2 = add(tv0, tv1);
-  // tv2[I0, I1] = tv0[I0, I1] + tv1[I0, I1]
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv3 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv2);
-  // tv3[I0, R1] = tv2[I0, I1]
-
-  TensorView* tv4 = makeSymbolicTensor(1);
-  fusion.addInput(tv4);
-
-  // tv5[I0] = tv3[I0, R1] * tv4[I0]
-  TensorView* tv5 = mul(tv3, tv4);
-  fusion.addOutput(tv5);
-
-  // RFactor the reduction
-  tv3->split(1, 16, false);
-  // tv3[I0, R1o{16}, R1i{tidx}] = tv2[I0, I1]
-
-  TensorView* tv6 = tv3->rFactor({-2});
-  // tv6[I0, R1o{16}, iR1i{tidx}] = tv2[I0, I1]
-  // tv3[I0,           R1i{tidx}] = tv3[I0, I1]
-  tv2->computeAt(tv6, 2);
-
-  // Compute at inline with tv5 (only 1D)
-  tv6->computeAt(tv3, 1);
-  tv3->computeAt(tv5, 1);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-
-  // Intermediate tensors only need this, but doesn't hurt to do on inputs
-  // tv0, 1, 4
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv6->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 1025;
-  int numel_y = 129;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
-  at::Tensor t1 = at::randn({numel_x, numel_y}, options);
-  at::Tensor t4 = at::randn({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1, t4});
-  auto cg_outputs = fe.runFusion({t0, t1, t4});
-
-  auto t2 = t0.add(t1);
-  auto t3 = t2.to(at::kDouble).sum({1});
-  auto aten_output = t3.mul(t4);
-
-  testValidate(
-      &fusion, cg_outputs, {t0, t1, t4}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBranches_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  TensorView* tv2 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addInput(tv2);
-
-  auto tv3 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv4 = add(tv3, tv1);
-  auto tv5 = add(tv3, tv2);
-  auto tv6 = add(tv4, tv5);
-
-  fusion.addOutput(tv6);
-
-  constexpr int x = 63, y = 33;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({x, y}, options);
-  at::Tensor t1 = at::randn({x, y}, options);
-  at::Tensor t2 = at::randn({x, y}, options);
-
-  FusionExecutor fe;
-  tv6->merge(0);
-  tv6->split(0, 128);
-  tv6->split(0, 4);
-
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeAt(tv6, 1);
-  tv1->computeAt(tv6, 1);
-  tv2->computeAt(tv6, 1);
-
-  tv3->axis(-2)->parallelize(ParallelType::Unroll);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(-2)->parallelize(ParallelType::Unroll);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv5->axis(-2)->parallelize(ParallelType::Unroll);
-  tv5->axis(-1)->parallelize(ParallelType::TIDx);
-  tv6->axis(-1)->parallelize(ParallelType::TIDx);
-
-  std::vector<IValue> aten_inputs = {t0, t1, t2};
-
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto t3 = t0.add(1.0);
-  auto t4 = t3.add(t1);
-  auto t5 = t3.add(t2);
-  auto aten_output = t4.add(t5);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleBCast1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1.5));
-
-  TensorView* tv2 = makeSymbolicTensor(2);
-  fusion.addInput(tv2);
-  TensorView* tv3 = makeSymbolicTensor(2);
-  fusion.addInput(tv3);
-  TensorView* tv4 = sub(tv2, tv3);
-
-  TensorView* tv5 = broadcast(tv1, {false, false, true});
-  TensorView* tv6 = broadcast(tv4, {true, false, false});
-
-  TensorView* tv7 = add(tv5, tv6);
-  fusion.addOutput(tv7);
-
-  tv7->split(-1, 4);
-  tv7->split(0, 8);
-
-  tv0->computeAt(tv7, -1);
-  tv2->computeAt(tv7, -1);
-
-  tv7->axis(0)->parallelize(ParallelType::BIDx);
-  tv7->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int x = 63, y = 33, z = 15;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({x, y}, options);
-  at::Tensor t1 = t0.add(1.5);
-
-  at::Tensor t2 = at::randn({y, z}, options);
-  at::Tensor t3 = at::randn({y, z}, options);
-
-  at::Tensor t4 = t2.sub(t3);
-  at::Tensor t5 = t1.unsqueeze(-1).expand({x, y, z});
-
-  at::Tensor t6 = t4.expand({x, y, z});
-
-  at::Tensor aten_output = t5.add(t6);
-
-  std::vector<IValue> aten_inputs = {t0, t2, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleBCast2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv0, tv1);
-
-  TensorView* tv3 = broadcast(tv2, {false, false, true});
-
-  TensorView* tv4 = makeSymbolicTensor(2);
-  fusion.addInput(tv4);
-
-  TensorView* tv5 = sub(tv4, IrBuilder::create<Double>(0.1));
-
-  TensorView* tv6 = broadcast(tv5, {true, false, false});
-
-  TensorView* tv7 = add(tv3, tv6);
-
-  fusion.addOutput(tv7);
-
-  tv7->merge(0, 1);
-
-  tv0->computeAt(tv7, -1);
-  tv4->computeAt(tv7, -1);
-
-  tv7->axis(0)->parallelize(ParallelType::BIDx);
-  tv7->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int x = 63, y = 33, z = 15;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({x, y}, options);
-  at::Tensor t1 = at::randn({x, y}, options);
-  at::Tensor t2 = t0.add(t1);
-  at::Tensor t3 = t2.unsqueeze(-1).expand({x, y, z});
-
-  at::Tensor t4 = at::randn({y, z}, options);
-  at::Tensor t5 = t4.sub(0.1);
-  at::Tensor t6 = t5.expand({x, y, z});
-  at::Tensor aten_output = t3.add(t6);
-
-  at::Tensor cg_output = at::empty({x, y, z}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1, t4};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleBCast3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up input tensor views
-  // tv0[I1, B{1}]
-  TensorView* tv0 = makeConcreteTensor({-1, 1});
-  fusion.addInput(tv0);
-
-  // tv1[I0, I1, I2]
-  TensorView* tv2 = makeSymbolicTensor(3);
-  fusion.addInput(tv2);
-
-  TensorView* tv3 = add(tv0, tv2);
-
-  fusion.addOutput(tv3);
-
-  tv3->merge(0);
-  tv3->merge(0);
-
-  tv0->computeAt(tv3, -1);
-  tv2->computeAt(tv3, -1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  constexpr int x = 2, y = 3, z = 4;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({y, 1}, options);
-  at::Tensor t2 = at::randn({x, y, z}, options);
-  auto aten_output = t0.add(t2);
-
-  std::vector<IValue> aten_inputs = {t0, t2};
-  at::Tensor cg_output = at::empty({x, y, z}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleBCast4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeConcreteTensor({1, -1});
-
-  TensorView* tv1 = makeSymbolicTensor(3);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv3 = add(tv0, tv1);
-
-  tv3->merge(0);
-  tv3->merge(0);
-  tv3->split(0, 128);
-  tv3->split(0, 4);
-
-  fusion.addOutput(tv3);
-
-  tv0->computeAt(tv3, -1);
-  tv1->computeAt(tv3, -1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-2)->parallelize(ParallelType::Unroll);
-
-  constexpr int x = 63, y = 33, z = 15;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({1, z}, options);
-  at::Tensor t1 = at::randn({x, y, z}, options);
-
-  auto aten_output = t0.add(t1);
-
-  at::Tensor cg_output = at::empty({x, y, z}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleBCast5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int m = 2, k = 3, n = 4;
-  auto tv0 = makeConcreteTensor({m, k});
-  auto tv1 = makeConcreteTensor({k, n});
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = broadcast(tv0, {false, false, true});
-  TensorView* tv3 = broadcast(tv1, {true, false, false});
-
-  TensorView* tv4 = add(tv2, tv3);
-
-  fusion.addOutput(tv4);
-
-  tv4->merge(0);
-  tv4->merge(0);
-
-  tv0->computeAt(tv4, -1);
-  tv1->computeAt(tv4, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({m, k}, options);
-  at::Tensor t1 = at::randn({k, n}, options);
-
-  auto t2 = t0.unsqueeze(-1).expand({m, k, n});
-  auto t3 = t1.expand({m, k, n});
-  auto aten_output = t2.add(t3);
-
-  at::Tensor cg_output = at::empty({m, k, n}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionComplexBCast1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int x = 2, y = 3, z = 4;
-
-  auto tv0 = makeConcreteTensor({y});
-  auto tv1 = div(tv0, IrBuilder::create<Double>(2.0));
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = makeConcreteTensor({y, z});
-  auto tv4 = mul(tv2, tv3);
-  auto tv5 = broadcast(tv4, {true, false, false});
-  auto tv6 = makeConcreteTensor({x, y, z});
-  auto tv7 = add(tv5, tv6);
-
-  // tv0[    i1    ] = input
-  // tv1[    i1    ] = tv0/2.0
-  // tv2[    i1, b2] = bcast(tv1)
-  // tv3[    i1, i2] = input
-  // tv4[    i1, i2] = tv2 * tv3
-  // tv5[b0, i1, i2] = bcast(tv4)
-  // tv6[i0, i1, i2] = input
-  // tv7[i0, i1, i2] = tv5 + tv6
-
-  // tv4 = bcast(tv1) * tv3
-  // tv7 = bcast(tv4) + tv6
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv3);
-  fusion.addInput(tv6);
-
-  fusion.addOutput(tv7);
-
-  tv7->merge(0);
-  tv7->merge(0);
-  tv0->computeAt(tv7, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({y}, options);
-  at::Tensor t3 = at::randn({y, z}, options);
-  at::Tensor t6 = at::randn({x, y, z}, options);
-
-  auto t4 = t0.div(2.0).unsqueeze(-1).expand({y, z}) * t3;
-  auto aten_output = t4.unsqueeze(0).expand({x, y, z}) + t6;
-
-  std::vector<IValue> aten_inputs = {t0, t3, t6};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionComplexBCast2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int x = 2, y = 3, z = 4;
-
-  auto tv0 = makeConcreteTensor({y, z});
-  auto tv1 = div(tv0, IrBuilder::create<Double>(2.0));
-  auto tv2 = sum(tv1, {1});
-  auto tv3 = broadcast(tv2, {true, false});
-  auto tv4 = makeConcreteTensor({x, y});
-  auto tv5 = add(tv3, tv4);
-
-  // tv0[    i1, i2] = input
-  // tv1[    i1, i2] = tv0/2.0
-  // tv2[    i1    ] = sum(tv1, 1)
-  // tv3[b0, i1    ] = bcast(tv2)
-  // tv4[i0, i1    ] = input
-  // tv5[i0, i1    ] = tv3 + tv4
-
-  // tv2 = sum(tv0/2.0, 1)
-  // tv5 = bcast(tv2) + tv4
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv4);
-
-  fusion.addOutput(tv5);
-
-  tv5->merge(0);
-  tv0->computeAt(tv5, -1);
-  tv1->computeAt(tv2, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({y, z}, options);
-  at::Tensor t4 = at::randn({x, y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t4});
-  auto cg_outputs = fe.runFusion({t0, t4});
-
-  auto t1 = t0.div(2.0);
-  auto t2 = t1.to(at::kDouble).sum(1);
-  auto t3 = t2.unsqueeze(0).expand({x, y});
-  auto aten_output = t3.add(t4);
-
-  testValidate(
-      &fusion, {cg_outputs}, {t0, t4}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int w = 3, x = 4, y = 7, z = 8;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv3 = broadcast(tv2, {true, false, false, false});
-  auto tv4 = add(tv3, tv1);
-
-  fusion.addOutput(tv4);
-
-  tv4->merge(0);
-  tv4->merge(0);
-  tv4->merge(0);
-
-  tv4->split(0, 128);
-  tv4->split(0, 4);
-
-  tv2->computeAt(tv4, 1);
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-  tv4->axis(1)->parallelize(ParallelType::Unroll);
-  tv4->axis(2)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(1)->parallelize(ParallelType::Unroll);
-  tv3->axis(2)->parallelize(ParallelType::TIDx);
-
-  tv2->axis(1)->parallelize(ParallelType::Unroll);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-
-  FusionExecutor fe;
-
-  at::Tensor t0 = at::randn({x, y, z}, options);
-  at::Tensor t1 = at::randn({w, x, y, z}, options);
-
-  auto t3 = t0.add(1.0);
-  auto aten_output = t3.add(t1);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int w = 3, x = 4, y = 7, z = 8;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv3 = broadcast(tv2, {true, false, false, false});
-  auto tv4 = add(tv3, tv1);
-
-  fusion.addOutput(tv4);
-
-  tv4->merge(-2);
-  tv4->merge(-2);
-  tv4->merge(-2);
-
-  tv4->split(0, 128);
-  tv4->split(0, 4);
-
-  tv2->computeAt(tv4, 1);
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-  tv4->axis(1)->parallelize(ParallelType::Unroll);
-  tv4->axis(2)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(1)->parallelize(ParallelType::Unroll);
-  tv3->axis(2)->parallelize(ParallelType::TIDx);
-
-  tv2->axis(1)->parallelize(ParallelType::Unroll);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-
-  FusionExecutor fe;
-
-  at::Tensor t0 = at::randn({x, y, z}, options);
-  at::Tensor t1 = at::randn({w, x, y, z}, options);
-
-  auto t3 = t0.add(1.0);
-  auto aten_output = t3.add(t1);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int w = 3, x = 4, y = 7, z = 8;
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv3 = add(tv2, tv1);
-  fusion.addOutput(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({x, y, z}, options);
-  at::Tensor t1 = at::randn({w, x, y, z}, options);
-
-  auto t2 = t0.add(1.0);
-  auto aten_output = t2.add(t1);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeConcreteTensor({4, 8});
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeConcreteTensor({4, 4, 8});
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv3 = broadcast(tv2, {true, false, false});
-  TensorView* tv4 = add(tv3, tv1);
-  fusion.addOutput(tv4);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({4, 8}, options);
-  at::Tensor t1 = at::randn({4, 4, 8}, options);
-
-  auto t2 = t0.add(1.0);
-  auto aten_output = t2.add(t1);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(3);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv3 = broadcast(tv2, {true, false, true});
-  TensorView* tv4 = add(tv3, tv1);
-  fusion.addOutput(tv4);
-
-  tv3->merge(0)->merge(0)->split(0, 2)->split(0, 3);
-  tv4->merge(0)->merge(0)->split(0, 2)->split(0, 3);
-
-  tv0->computeAt(tv4, 1);
-  tv1->computeAt(tv4, 1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({7}, options);
-  at::Tensor t1 = at::randn({5, 7, 11}, options);
-
-  auto t2 = t0.add(1.0);
-  auto aten_output = t2.unsqueeze(-1).add(t1);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> tensor0_shape{7, 4, 7};
-  std::vector<int64_t> tensor1_shape{4, 7};
-
-  TensorView* tv0 = makeSymbolicTensor(tensor0_shape.size());
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(tensor1_shape.size());
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv0, tv1);
-  TensorView* tv3 = sum(tv2, {0, 1});
-  fusion.addOutput(tv3);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input0 = at::randn(tensor0_shape, options);
-  at::Tensor input1 = at::randn(tensor1_shape, options);
-
-  std::vector<int64_t> reduction_axes{0, 1};
-  auto reduction_params = getReductionHeuristics(&fusion, {input0, input1});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input0, input1}, reduction_params->lparams);
-  auto cg_outputs = fe.runFusion({input0, input1}, reduction_params->lparams);
-
-  auto aten_output = input0.add(input1).to(at::kDouble).sum(reduction_axes);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {input0, input1},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      reduction_params->lparams);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing7_CUDA) {
-  // Might be able to use this one without 6 as the heuristics in 6 may change
-  // and this test is to cover the same issue.
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {false, true});
-
-  auto tv2 = makeSymbolicTensor(2);
-  fusion.addInput(tv2);
-
-  auto tv3 = add(tv1, tv2);
-  auto tv4 = sum(tv3, {0, 1});
-  fusion.addOutput(tv4);
-
-  tv4->merge(0, 1);
-  tv4->split(0, 128);
-  tv4->split(0, 4);
-
-  auto tv5 = tv4->rFactor({0, 1});
-
-  tv5->computeAt(tv4, -1);
-  tv0->computeAt(tv5, -1);
-
-  tv4->axis(0)->parallelize(ParallelType::TIDx);
-
-  const int numel_x = 100;
-  const int numel_y = 200;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto at_t0 = at::randn({numel_x}, options);
-  auto at_t1 = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {at_t0, at_t1});
-  auto cg_outputs = fe.runFusion({at_t0, at_t1});
-
-  auto aten_output = (at_t0.unsqueeze(-1).expand({numel_x, numel_y}) + at_t1)
-                         .to(at::kDouble)
-                         .sum();
-
-  testValidate(
-      &fusion, cg_outputs, {at_t0, at_t1}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing8_CUDA) {
-  // Same as 7 but with outer splits instead of inner
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {false, true});
-
-  auto tv2 = makeSymbolicTensor(2);
-  fusion.addInput(tv2);
-
-  auto tv3 = add(tv1, tv2);
-  auto tv4 = sum(tv3, {0, 1});
-  fusion.addOutput(tv4);
-
-  tv4->merge(0, 1);
-  tv4->split(0, 128, false);
-  tv4->split(0, 4, false);
-
-  auto tv5 = tv4->rFactor({0, 1});
-
-  tv5->computeAt(tv4, -1);
-  tv0->computeAt(tv5, -1);
-
-  tv4->axis(0)->parallelize(ParallelType::TIDx);
-
-  const int numel_x = 100;
-  const int numel_y = 200;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto at_t0 = at::randn({numel_x}, options);
-  auto at_t1 = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {at_t0, at_t1});
-  auto cg_outputs = fe.runFusion({at_t0, at_t1});
-
-  auto aten_output = (at_t0.unsqueeze(-1).expand({numel_x, numel_y}) + at_t1)
-                         .to(at::kDouble)
-                         .sum();
-
-  testValidate(
-      &fusion, cg_outputs, {at_t0, at_t1}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing9_CUDA) {
-  // Same as 7 but with outer splits instead of inner
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {false, true});
-
-  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
-  fusion.addOutput(tv2);
-
-  auto tv3 = makeSymbolicTensor(3);
-  fusion.addInput(tv3);
-
-  auto tv4 = add(tv3, tv2);
-  fusion.addOutput(tv4);
-
-  const int numel_x = 200;
-  const int numel_y = 300;
-  const int numel_z = 400;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto at_t0 = at::randn({numel_y}, options);
-  auto at_t3 = at::randn({numel_x, numel_y, numel_z}, options);
-  std::vector<IValue> aten_inputs = {at_t0, at_t3};
-
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  auto at_t1 = at_t0.unsqueeze(-1);
-  auto at_t2 = at_t1.mul(2.0);
-
-  auto at_t4 = at_t3.add(at_t2);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {at_t2, at_t4}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing10_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeContigTensor(2);
-  TensorView* tv1 = makeContigTensor(2);
-
-  // Register your inputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  // Do math with it, it returns a `Val*` but can be static_casted back to
-  // TensorView
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = add(tv0, tv2);
-
-  // Register your outputs
-  fusion.addOutput(tv3);
-
-  auto tv0_cache = tv0->cacheAfter();
-  auto tv1_cache = tv1->cacheAfter();
-
-  std::vector<TensorView*> tvs = {tv0_cache, tv1_cache, tv2, tv3};
-
-  for (auto tv : tvs) {
-    tv->split(1, 2, false);
-    tv->split(1, 1);
-    tv->split(-1, 4);
-    // [I0, 2, 1, I1/2/4, 4]
-    tv->reorder({{1, 2}, {2, 3}, {3, 1}});
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(1)->parallelize(ParallelType::TIDx);
-  }
-
-  // For all inputs, computeAt the output inline, temporaries should be squeezed
-  // between them
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  tv0_cache->axis(-1)->parallelize(ParallelType::Vectorize);
-  tv1_cache->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input1 = at::randn({64, 128}, options);
-  at::Tensor input2 = at::rand_like(input1);
-  at::Tensor output = at::empty_like(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  fe.runFusion({input1, input2}, {output});
-
-  at::Tensor tv2_ref = input2 + 2.0;
-  at::Tensor output_ref = input1 + tv2_ref;
-
-  TORCH_CHECK(output_ref.equal(output));
-}
-
-TEST_F(NVFuserTest, FusionAdvancedIndexing11_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int w = 3, x = 4, y = 7, z = 8;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  auto tv0 = makeSymbolicTensor(4);
-  auto tv1 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
-  auto tv3 = broadcast(tv2, {true, false, true, true});
-  auto tv4 = add(tv3, tv0);
-
-  fusion.addOutput(tv4);
-
-  tv4->merge(0);
-  tv4->merge(1);
-
-  tv4->split(1, 32);
-  tv4->split(0, 1);
-
-  tv4->reorder({{2, 1}});
-
-  tv2->computeAt(tv4, 3);
-
-  tv2->setMemoryType(MemoryType::Global);
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-  tv4->axis(1)->parallelize(ParallelType::BIDy);
-  tv4->axis(2)->parallelize(ParallelType::Unswitch);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  FusionExecutor fe;
-
-  at::Tensor t0 = at::randn({w, x, y, z}, options);
-  at::Tensor t1 = at::randn({x}, options);
-
-  auto t3 = t1.add(1.0).unsqueeze(-1).unsqueeze(-1);
-  auto aten_output = t3.add(t0);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-// Intended to stress the lowering of our code generator
-TEST_F(NVFuserTest, FusionAdvancedLowering1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({9, 5});
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
-  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3));
-  TensorView* tv4 = sum(tv3, {1});
-
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv4);
-
-  tv4->split(1, 4);
-  auto tv5 = tv4->rFactor({2});
-
-  tv1->computeAt(tv5, 2);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(1);
-  at::Tensor aten_input = at::randn({9, 5}, options);
-
-  auto t1 = aten_input.add(1.0);
-  auto t2 = t1.add(2.0);
-  auto t3 = t1.add(3.0);
-  auto t4 = t3.sum(1);
-
-  std::vector<at::Tensor> aten_outputs = {t2, t4};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedLowering2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Progressively broadcast tensors
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  TensorView* tv2 = makeSymbolicTensor(3);
-  fusion.addInput(tv2);
-
-  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv4 = broadcast(tv3, {false, true});
-  TensorView* tv5 = add(tv4, tv1);
-  TensorView* tv6 = add(tv5, tv2);
-
-  fusion.addOutput(tv6);
-
-  // Split inner dimension
-  tv6->split(1, 4);
-  // Merge middle dims with outer dimensions
-  tv6->merge(2);
-  tv6->merge(0);
-
-  // tv6[I0*I1o, I1i*I2]
-
-  // Compute everything inline
-  tv0->computeAt(tv6, -1);
-
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-  tv6->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  int x = 13, y = 9, z = 5;
-  at::Tensor t0 = at::randn({y}, options);
-  at::Tensor t1 = at::randn({y, z}, options);
-  at::Tensor t2 = at::randn({x, y, z}, options);
-
-  auto t3 = t0.add(1.0);
-  auto t4 = t3.unsqueeze(-1);
-  auto t5 = t4.add(t1);
-  auto t6 = t5.add(t2);
-
-  std::vector<IValue> aten_inputs = {t0, t1, t2};
-  std::vector<at::Tensor> aten_outputs = {t6};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-// TODO: Complete test
-TEST_F(NVFuserTest, FusionAdvancedLowering3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({1, -1});
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  // [b0, i1]
-  auto tv2 = add(tv0, IrBuilder::create<Double>(2.0));
-
-  // [i0, i1]
-  auto tv3 = add(tv1, IrBuilder::create<Double>(3.0));
-
-  // [b0, i1]
-  auto tv4 = add(tv2, IrBuilder::create<Double>(4.0));
-
-  // [io, i1]
-  auto tv5 = add(tv2, tv3);
-
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv5);
-
-  tv0->computeAt(tv4, -1);
-
-  tv3->setMemoryType(MemoryType::Global);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  int x = 13, y = 9;
-  at::Tensor t0 = at::randn({1, y}, options);
-  at::Tensor t1 = at::randn({x, y}, options);
-
-  auto t4 = t0 + 2 + 4;
-  auto t5 = t0 + 2 + t1 + 3;
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-  std::vector<at::Tensor> aten_outputs = {t4, t5};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-// This excercises indexing with broadcast root axes. Non-broadcast
-// axes need to be preferred when propagating index exprs to root
-// axes. See, e.g., Index::getConsumerIndex_impl.
-TEST_F(NVFuserTest, FusionAdvancedLowering4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = broadcast(tv0, {false, true});
-  auto tv2 = broadcast(tv1, {false, false, true});
-  auto tv3 = makeSymbolicTensor(3);
-  fusion.addInput(tv3);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  tv4->merge(1)->merge(0);
-  tv4->split(0, 8);
-  tv0->computeAt(tv4, 1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 10;
-  const int by = 20;
-  const int bz = 30;
-  at::Tensor t0 = at::randn({bx}, options);
-  at::Tensor t3 = at::randn({bx, by, bz}, options);
-  std::vector<IValue> aten_inputs = {t0, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output =
-      t0.unsqueeze(-1).expand({bx, by}).unsqueeze(-1).expand({bx, by, bz}) + t3;
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedLowering5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({5, 4, 3});
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = makeConcreteTensor({5, 3});
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv1, {false, true, false});
-
-  auto tv3 = add(tv0, tv2);
-
-  fusion.addOutput(tv3);
-
-  tv2->merge(0);
-  tv1->computeAt(tv2, 1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(1);
-  at::Tensor t0 = at::randn({5, 4, 3}, options);
-  at::Tensor t1 = at::randn({5, 3}, options);
-  auto t2 = t1.unsqueeze(1);
-  auto t3 = t0 + t2;
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-  std::vector<at::Tensor> aten_outputs = {t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedLowering6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({5, 4, 3});
-  fusion.addInput(tv0);
-  auto tv1 = makeConcreteTensor({4});
-  fusion.addInput(tv1);
-  auto tv2 = unaryOp(UnaryOpType::Set, tv0);
-  auto tv3 = unaryOp(UnaryOpType::Set, tv1);
-
-  auto tv4 = sum(tv2, {0, 2});
-  auto tv5 = add(tv4, tv3);
-  fusion.addOutput(tv5);
-
-  auto tv6 = broadcast(tv3, {true, false, true});
-  auto tv7 = add(tv2, tv6);
-  fusion.addOutput(tv7);
-
-  tv2->computeAt(tv4, -1, ComputeAtMode::BestEffort);
-  tv3->computeAt(tv7, -1, ComputeAtMode::BestEffort);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(1);
-  at::Tensor t0 = at::randn({5, 4, 3}, options);
-  at::Tensor t1 = at::randn({4}, options);
-
-  auto t2 = t0;
-  auto t3 = t1;
-
-  std::vector<int64_t> reduction_axes{0, 2};
-  auto t4 = t2.sum(reduction_axes);
-  auto t5 = add(t4, t3);
-  auto t6 = t3.unsqueeze(0).unsqueeze(-1);
-  auto t7 = t2.add(t6);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-  std::vector<at::Tensor> aten_outputs = {t5, t7};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-// Test a simple Gemm but also play around with fusion executor features
-TEST_F(NVFuserTest, FusionSimpleGemm_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2); // M, K
-  TensorView* tv1 = makeSymbolicTensor(2); // K, N
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = broadcast(tv0, {false, false, true});
-  // tv2[I0, I1, B] = tv0[I0, I1]
-
-  TensorView* tv3 = broadcast(tv1, {true, false, false});
-  // tv3[B, I1, I2] = tv1[I1, I2]
-
-  // tv4[I0, I1, I2] = tv2[I0, I1, B] * tv3[B, I1, I2]
-  TensorView* tv4 = mul(tv2, tv3);
-  // tv5[I0, R1, I2] = tv4[I0, I1, I2]
-  TensorView* tv5 = sum(tv4, {1});
-  fusion.addOutput(tv5);
-
-  tv5->split(1, 32);
-  // tv5[I0, R1o, R1i{32}, I2]
-
-  auto tv6 = tv5->rFactor({1});
-  // tv6[I0, R1o, I1i{32}, I2] = tv4[I0, I1, I2]
-  // tv5[I0,    , R1i{32}, I2] = tv6[I0, R1o, I1i{32}, I2]
-
-  tv5->split(0, 4);
-  tv5->split(-1, 4);
-  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
-  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
-
-  tv0->computeAt(tv5, -1);
-  tv1->computeAt(tv5, -1);
-
-  // tv6[I0o, I0i{4}, R1o, I1i{32}, I2o, I2i{4}]
-  // tv5[I0o, I0i{4},    , R1i{32}, I2o, I2i{4}]
-  //--> (line symbolizes compute at location)
-  // tv4[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, I1o]
-  // tv6[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, R1o]
-  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
-
-  tv0->computeAt(tv6, -1);
-  tv1->computeAt(tv6, -1);
-  // tv4[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, I1o |]
-  // tv6[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, R1o |]
-  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
-
-  tv5->axis(0)->parallelize(ParallelType::BIDz);
-  tv5->axis(1)->parallelize(ParallelType::TIDz);
-
-  tv5->axis(-2)->parallelize(ParallelType::BIDy);
-  tv5->axis(-1)->parallelize(ParallelType::TIDy);
-
-  tv5->axis(2)->parallelize(ParallelType::TIDx);
-  tv6->axis(2)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 65, K = 33, N = 17;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1}, LaunchParams(1, -1, -1, 32, 4, 4));
-  // Lets specify a few bounds in launch params to make sure it works
-  fe.runFusion({t0, t1}, LaunchParams(1, -1, -1, 32, 4, 4));
-
-  // Make sure bad launch params throws
-  // TODO: Re-enable once we have parallelization validation in.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  // ASSERT_ANY_THROW(fe.runFusion({t0, t1}, LaunchParams(1, 2, 3, 4, 5, 6)));
-
-  // Don't specify any launch params
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto aten_output = t0.to(at::kDouble).matmul(t1.to(at::kDouble));
-
-  testValidate(
-      &fusion, cg_outputs, {t0, t1}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Softmax with a 1D tensor. Parallelized only with a single thread block.
-TEST_F(NVFuserTest, FusionSoftmax1D_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int tidx = 128;
-  const int dimx = 1000;
-
-  // Set up your input tensor views
-  TensorView* input_tv0 = makeSymbolicTensor(1);
-  fusion.addInput(input_tv0);
-
-  TensorView* exp_tv1 = unaryOp(UnaryOpType::Exp, input_tv0);
-  TensorView* sum_exp_tv2 = sum(exp_tv1, {-1});
-  TensorView* bcast_sum_tv3 = broadcast(sum_exp_tv2, {true});
-
-  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
-  // computed at sum_exp_rf_tv8.
-  TensorView* exp_tv1_copy = unaryOp(UnaryOpType::Exp, input_tv0);
-
-  TensorView* output_tv4 = div(exp_tv1_copy, bcast_sum_tv3);
-
-  fusion.addOutput(output_tv4);
-
-  bcast_sum_tv3->split(0, tidx);
-
-  sum_exp_tv2->split(-1, tidx);
-  TensorView* sum_exp_rf_tv5 = sum_exp_tv2->rFactor({-2});
-
-  output_tv4->split(-1, tidx);
-
-  exp_tv1->computeAt(sum_exp_rf_tv5, -1);
-  exp_tv1_copy->computeAt(output_tv4, -1);
-
-  TensorView* tensors_to_parallelize[] = {
-      sum_exp_tv2, bcast_sum_tv3, output_tv4, sum_exp_rf_tv5};
-
-  for (auto tv : tensors_to_parallelize) {
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({dimx}, options);
-  at::Tensor cg_output = at::empty({dimx}, options);
-  at::Tensor t3_output = at::empty_like(cg_output, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  fe.runFusion({t0}, {cg_output});
-
-  auto aten_output = at::_softmax(t0.to(at::kDouble), -1, false);
-
-  testValidate(&fusion, {cg_output}, {t0}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Softmax with a 1D tensor with input normalization.
-TEST_F(NVFuserTest, FusionSoftmax1DNormalized_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int tidx = 128;
-  const int dimx = 1000;
-
-  // Set up your input tensor views
-  TensorView* input_tv0 = makeSymbolicTensor(1);
-  fusion.addInput(input_tv0);
-
-  // Normalize with the max value before computing exp.
-  TensorView* max_val_tv1 = reductionOp(
-      BinaryOpType::Max, {-1}, IrBuilder::create<Double>(0), input_tv0);
-  TensorView* bcast_max_tv2 = broadcast(max_val_tv1, {true});
-  TensorView* sub_tv3 = sub(input_tv0, bcast_max_tv2);
-  TensorView* exp_tv4 = unaryOp(UnaryOpType::Exp, sub_tv3);
-  TensorView* sum_exp_tv5 = sum(exp_tv4, {-1});
-  TensorView* bcast_sum_tv6 = broadcast(sum_exp_tv5, {true});
-
-  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
-  // computed at sum_exp_rf_tv8.
-  TensorView* sub_tv3_copy = sub(input_tv0, bcast_max_tv2);
-  TensorView* exp_tv4_copy = unaryOp(UnaryOpType::Exp, sub_tv3_copy);
-
-  TensorView* output_tv7 = div(exp_tv4_copy, bcast_sum_tv6);
-
-  fusion.addOutput(output_tv7);
-  bcast_max_tv2->split(0, tidx);
-  bcast_sum_tv6->split(0, tidx);
-
-  max_val_tv1->split(-1, tidx);
-  TensorView* max_val_rf_tv8 = max_val_tv1->rFactor({-2});
-
-  sum_exp_tv5->split(-1, tidx);
-  TensorView* sum_exp_rf_tv9 = sum_exp_tv5->rFactor({-2});
-
-  output_tv7->split(-1, tidx);
-
-  sub_tv3->computeAt(sum_exp_rf_tv9, -1);
-  sub_tv3_copy->computeAt(output_tv7, -1);
-
-  TensorView* tensors_to_parallelize[] = {
-      max_val_tv1,
-      bcast_max_tv2,
-      sum_exp_tv5,
-      bcast_sum_tv6,
-      output_tv7,
-      max_val_rf_tv8,
-      sum_exp_rf_tv9};
-
-  for (auto tv : tensors_to_parallelize) {
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({dimx}, options);
-  at::Tensor t3_output = at::empty({dimx}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = at::_softmax(input.to(at::kDouble), -1, false);
-
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Softmax with a 3D tensor, where the inner-most 3rd dimension is
-// normalized. Pallelized with multiple thread blocks.
-TEST_F(NVFuserTest, FusionSoftmax3D_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int tidx = 32;
-  const int dimx = 32;
-  const int dimy = 16;
-  const int dimz = 130;
-
-  // Set up your input tensor views
-  TensorView* input_tv0 = makeSymbolicTensor(3);
-  fusion.addInput(input_tv0);
-
-  TensorView* exp_tv1 = unaryOp(UnaryOpType::Exp, input_tv0);
-  TensorView* sum_exp_tv2 = sum(exp_tv1, {-1});
-  TensorView* bcast_sum_tv3 = broadcast(sum_exp_tv2, {false, false, true});
-
-  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
-  // computed at sum_exp_rf_tv8.
-  TensorView* exp_tv1_copy = unaryOp(UnaryOpType::Exp, input_tv0);
-
-  TensorView* output_tv4 = div(exp_tv1_copy, bcast_sum_tv3);
-
-  fusion.addOutput(output_tv4);
-
-  bcast_sum_tv3->split(-1, tidx);
-
-  sum_exp_tv2->split(-1, tidx);
-  TensorView* sum_exp_rf_tv5 = sum_exp_tv2->rFactor({-2});
-
-  output_tv4->split(-1, tidx);
-
-  exp_tv1->computeAt(sum_exp_rf_tv5, -1);
-  exp_tv1_copy->computeAt(output_tv4, -1);
-
-  TensorView* tensors_to_parallelize[] = {
-      sum_exp_tv2, bcast_sum_tv3, output_tv4, sum_exp_rf_tv5};
-
-  for (auto tv : tensors_to_parallelize) {
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(1)->parallelize(ParallelType::BIDy);
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({dimx, dimy, dimz}, options);
-
-  at::Tensor cg_output = at::empty({dimx, dimy, dimz}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = at::_softmax(input.to(at::kDouble), -1, false);
-
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Softmax with a 3D tensor with input normalization.
-TEST_F(NVFuserTest, FusionSoftmax3DNormalized_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int tidx = 32;
-  const int dimx = 32;
-  const int dimy = 16;
-  const int dimz = 130;
-
-  // Set up your input tensor views
-  TensorView* input_tv0 = makeSymbolicTensor(3);
-  fusion.addInput(input_tv0);
-
-  // Normalize with the max value before computing exp.
-  TensorView* max_val_tv1 = reductionOp(
-      BinaryOpType::Max, {-1}, IrBuilder::create<Double>(0), input_tv0);
-  TensorView* bcast_max_tv2 = broadcast(max_val_tv1, {false, false, true});
-  TensorView* sub_tv3 = sub(input_tv0, bcast_max_tv2);
-  TensorView* exp_tv4 = unaryOp(UnaryOpType::Exp, sub_tv3);
-  TensorView* sum_exp_tv5 = sum(exp_tv4, {-1});
-  TensorView* bcast_sum_tv6 = broadcast(sum_exp_tv5, {false, false, true});
-
-  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
-  // computed at sum_exp_rf_tv8.
-  TensorView* sub_tv3_copy = sub(input_tv0, bcast_max_tv2);
-  TensorView* exp_tv4_copy = unaryOp(UnaryOpType::Exp, sub_tv3_copy);
-
-  TensorView* output_tv7 = div(exp_tv4_copy, bcast_sum_tv6);
-
-  fusion.addOutput(output_tv7);
-
-  bcast_max_tv2->split(-1, tidx);
-  bcast_sum_tv6->split(-1, tidx);
-
-  max_val_tv1->split(-1, tidx);
-  TensorView* max_val_rf_tv8 = max_val_tv1->rFactor({-2});
-
-  sum_exp_tv5->split(-1, tidx);
-  TensorView* sum_exp_rf_tv9 = sum_exp_tv5->rFactor({-2});
-
-  output_tv7->split(-1, tidx);
-
-  sub_tv3->computeAt(sum_exp_rf_tv9, -1);
-  sub_tv3_copy->computeAt(output_tv7, -1);
-
-  TensorView* tensors_to_parallelize[] = {
-      max_val_tv1,
-      bcast_max_tv2,
-      sum_exp_tv5,
-      bcast_sum_tv6,
-      output_tv7,
-      max_val_rf_tv8,
-      sum_exp_rf_tv9};
-
-  for (auto tv : tensors_to_parallelize) {
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(1)->parallelize(ParallelType::BIDy);
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({dimx, dimy, dimz}, options);
-  at::Tensor t3_output = at::empty({dimx, dimy, dimz}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = at::_softmax(input.to(at::kDouble), -1, false);
-
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSoftmaxComputeAt_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-
-  auto tv3 = add(tv0, IrBuilder::create<Double>(1.0));
-
-  auto tv4 = mul(tv2, tv3);
-
-  auto tv5 = sum(tv4, {1});
-  auto tv6 = broadcast(tv5, {false, true});
-
-  auto tv7 = sub(tv6, tv4);
-  fusion.addOutput(tv7);
-
-  tv1->computeAt(tv7, 1);
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(tv1->computeAt(tv7, -1));
-}
-
-// Similar to FusionReduction but uses grid reduction
-TEST_F(NVFuserTest, FusionGridReduction1_CUDA) {
-  const int gdimx = 32;
-  const int bdimx = 128;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  tv1->split(1, bdimx);
-  // tv1[I0, R1o, R1i{128}] = tv0[I0, I1]
-  tv1->split(1, gdimx);
-  // tv1[I0, R1oo, R1oi{32}, R1i{128}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}] = tv0[I0, I1]
-  // tv1[I0,        R1oi{32},  R1i{128}] = tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}]
-
-  // Incrementally, can print in between for debugging
-  tv0->computeAt(tv2, 1);
-  tv2->computeAt(tv1, 1);
-
-  // Re do it all at once, because why not.
-  tv0->computeAt(tv1, 1);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDy);
-  tv1->axis(1)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::BIDx);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 10000;
-  int numel_y = 65000;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Same test as the above but uses BIDy and TIDx for reduction
-TEST_F(NVFuserTest, FusionGridReduction2_CUDA) {
-  const int gdimy = 32;
-  const int bdimx = 128;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  tv1->split(1, bdimx);
-  // tv1[I0, R1o, R1i{128}] = tv0[I0, I1]
-  tv1->split(1, gdimy);
-  // tv1[I0, R1oo, R1oi{32}, R1i{128}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}] = tv0[I0, I1]
-  // tv1[I0,        R1oi{32},  R1i{128}] = tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}]
-
-  // Incrementally, can print in between for debugging
-  tv0->computeAt(tv2, 1);
-  tv2->computeAt(tv1, 1);
-
-  // Re do it all at once, because why not.
-  tv0->computeAt(tv1, 1);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::BIDy);
-  tv2->axis(2)->parallelize(ParallelType::BIDy);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 10000;
-  int numel_y = 65000;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Same test but uses BIDy and BIDz for reduction. No TID used.
-TEST_F(NVFuserTest, FusionGridReduction3dim1_CUDA) {
-  // Grid reductions when there aren't any threads are serial reductions
-  // keep these numbers low so our error isn't too high compared to normal cuda
-  // reductions
-  const int gdimz = 15;
-  const int gdimy = 9;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  tv1->split(1, gdimy);
-  // tv1[I0, R1o, R1i{128}] = tv0[I0, I1]
-  tv1->split(1, gdimz);
-  // tv1[I0, R1oo, R1oi{32}, R1i{128}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}] = tv0[I0, I1]
-  // tv1[I0,        R1oi{32},  R1i{128}] = tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}]
-
-  // Incrementally, can print in between for debugging
-  tv0->computeAt(tv2, 1);
-  tv2->computeAt(tv1, 1);
-
-  // Re do it all at once, because why not.
-  tv0->computeAt(tv1, 1);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::BIDz);
-  tv2->axis(2)->parallelize(ParallelType::BIDz);
-  tv1->axis(-1)->parallelize(ParallelType::BIDy);
-  tv2->axis(-1)->parallelize(ParallelType::BIDy);
-
-  int numel_x = 100;
-  int numel_y = 6500;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Same as testGPU_FusionGridReduction3dim1 but reduces dimension 0
-TEST_F(NVFuserTest, FusionGridReduction3dim0_CUDA) {
-  // Grid reductions when there aren't any threads are serial reductions
-  // keep these numbers low so our error isn't too high compared to normal cuda
-  // reductions
-  const int gdimz = 15;
-  const int gdimy = 9;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[R0, I1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {0}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  tv1->split(0, gdimy);
-  // tv1[R0o, R0i{128}, I1] = tv0[I0, I1]
-  tv1->split(0, gdimz);
-  // tv1[R0oo, R0oi{32}, R0i{128}, I1] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({0});
-  // tv2[R0oo, I0oi{32}, I0i{128}, I1] = tv0[I0, I1]
-  // tv1[      R0oi{32}, R0i{128}, I1] = tv2[R0oo, I0oi{32}, I0i{128}, I1]
-
-  // Note that computeAt isn't going to make anything better as there
-  // is no dynamically sized dimension.
-
-  // Map parallelism as [Serial, BIDz, BIDy, BIDx]
-  tv1->axis(-1)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::BIDx);
-  tv1->axis(-2)->parallelize(ParallelType::BIDy);
-  tv2->axis(-2)->parallelize(ParallelType::BIDy);
-  tv1->axis(-3)->parallelize(ParallelType::BIDz);
-  tv2->axis(-3)->parallelize(ParallelType::BIDz);
-
-  int numel_x = 6500;
-  int numel_y = 100;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = input.to(at::kDouble).sum({0});
-
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// This is similar to the FusionReduction, but swaps BIDx and TIDx
-TEST_F(NVFuserTest, FusionGridReduction4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int bdimx = 128;
-  const int gdimx = 1024;
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  tv1->split(1, gdimx);
-  // tv1[I0, R1o, R1i{1024}] = tv0[I0, I1]
-  tv1->split(1, 4);
-  // tv1[I0, R1oo, R1oi{4}, R1i{128}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{1024}] = tv0[I0, I1]
-  // tv1[I0,        R1oi{4},  R1i{1024}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{1024}]
-
-  TensorView* tv3 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{1024}] = tv0[I0, I1]
-  // tv3[I0,        R1oi{4}, Ir1i{1024}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{1024}]
-  // tv1[I0,                  R1i{1024}] = tv3[I0,        R1oi{4}, Ir1i{1024}]
-
-  // Incrementally, can print in between for debugging
-  tv0->computeAt(tv2, 1);
-  tv2->computeAt(tv3, 1);
-  tv3->computeAt(tv1, 1);
-
-  // Re do it all at once, because why not.
-  tv0->computeAt(tv1, 1);
-
-  tv2->axis(2)->parallelize(ParallelType::Unroll);
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-
-  tv1->axis(-1)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::BIDx);
-  tv3->axis(-1)->parallelize(ParallelType::BIDx);
-
-  int numel_x = bdimx;
-  int numel_y = 65000;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Grid reduction with 2D thread blocks but only TIDx and BIDx are
-// mapped to a reduction dim
-TEST_F(NVFuserTest, FusionGridReduction5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int bdimx = 64;
-  const int bdimy = 16;
-  const int gdimx = 4;
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  tv1->split(1, bdimx);
-  // tv1[I0, R1o, R1i{64}] = tv0[I0, I1]
-  tv1->split(1, gdimx);
-  // tv1[I0, R1oo, R1oi{4}, R1i{64}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{64}] = tv0[I0, I1]
-  // tv1[I0,        R1oi{4},  R1i{64}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{64}]
-
-  tv0->computeAt(tv1, 1);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->axis(-2)->parallelize(ParallelType::BIDx);
-  tv2->axis(-2)->parallelize(ParallelType::BIDx);
-
-  tv1->axis(0)->parallelize(ParallelType::TIDy);
-
-  int numel_x = bdimy;
-  int numel_y = 6500;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Similar to FusionGridReduction1 but with 3D tensors
-TEST_F(NVFuserTest, FusionGridReduction6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(3);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1, R2] = tv0[I0, I1, I2]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1, 2}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion).size(),
-      "Could not detect reduction in fusion.");
-
-  // Splitting for TID
-  tv1->split(2, 128);
-  // tv1[I0, R1, R2o, R2i{128}] = tv0[I0, I1, I2]
-
-  // Splitting for BID
-  tv1->split(1, 128);
-
-  // tv1[I0, R1o, R1i{128}, R2o, R2i{128}] = tv0[I0, I1, I2]
-
-  TensorView* tv2 = tv1->rFactor({3});
-  // tv2[I0, I1o, I1i{128}, R2o, I2i{128}]
-  // tv1[I0, R1o, R1i{128},      R2i{128}]
-
-  TensorView* tv3 = tv1->rFactor({1});
-  // tv2[I0, I1o, I1i{128}, R2o, I2i{128}]
-  // tv3[I0, R1o, I1i{128},      I2i{128}]
-  // tv1[I0,      R1i{128},      R2i{128}]
-
-  tv3->computeAt(tv1, 1);
-  tv2->computeAt(tv3, 3);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDy);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->axis(-2)->parallelize(ParallelType::BIDx);
-  tv2->axis(-3)->parallelize(ParallelType::BIDx);
-  tv3->axis(-2)->parallelize(ParallelType::BIDx);
-
-  int numel_x = 6500;
-  int numel_y = 200;
-  int numel_z = numel_y;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y, numel_z}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = input.to(at::kDouble).sum({1, 2});
-
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// See issue #1049
-TEST_F(NVFuserTest, FusionGridReduction7_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  fusion.addOutput(tv1);
-
-  tv1->split(0, 1000);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::BIDy);
-
-  const int numel_x = 1;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto out = fe.runFusion({input});
-
-  auto aten_output = input.sum({0});
-
-  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridReduction8_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  fusion.addOutput(tv1);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  const int numel_x = 2;
-  const int numel_y = 4;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto out = fe.runFusion({input});
-
-  auto aten_output = input.sum({0});
-
-  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridReduction9_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-
-  auto tv2 = makeSymbolicTensor(1);
-  fusion.addInput(tv2);
-
-  auto tv3 = add(tv2, tv1);
-  fusion.addOutput(tv3);
-
-  tv1->split(1, 2);
-
-  tv1->axis(1)->parallelize(ParallelType::BIDx);
-  tv1->axis(2)->parallelize(ParallelType::BIDy);
-
-  tv1->computeAt(tv3, 1);
-
-  const int numel_x = 4;
-  const int numel_y = 10;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
-  at::Tensor t2 = at::randn({numel_x}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t2};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_output = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0.sum({1}).add(t2);
-
-  testValidate(&fusion, cg_output, {t0, t2}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridReduction10_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {-1});
-  auto tv2 = sum(tv1, {-1});
-  auto tv3 = sum(tv2, {-1});
-
-  fusion.addOutput(tv3);
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-  tv1->axis(1)->parallelize(ParallelType::BIDx);
-  tv1->axis(2)->parallelize(ParallelType::TIDy);
-  tv1->axis(3)->parallelize(ParallelType::TIDz);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv2->axis(1)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDy);
-
-  tv3->axis(0)->parallelize(ParallelType::TIDx);
-  tv3->axis(1)->parallelize(ParallelType::BIDx);
-
-  tv0->computeAt(tv3, 1);
-
-  const int numel_w = 2;
-  const int numel_x = 3;
-  const int numel_y = 4;
-  const int numel_z = 5;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({numel_w, numel_x, numel_y, numel_z}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_output = fe.runFusion({t0});
-
-  auto aten_output = t0.sum({1, 2, 3});
-
-  testValidate(&fusion, cg_output, {t0}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionNonRedAxisBind_CUDA) {
-  int bid_x = 3;
-  int tid_x = 2;
-  int red_dim = 0;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = reductionOp(
-      BinaryOpType::Add, {red_dim}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  tv1->split(-1, tid_x);
-  tv1->axis(-2)->parallelize(ParallelType::BIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({16, bid_x * tid_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = input.to(at::kDouble).sum({red_dim});
-
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSplitBCast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* input_tv0 = makeSymbolicTensor(3);
-  TensorView* input_tv1 = makeSymbolicTensor(3);
-  fusion.addInput(input_tv0);
-  fusion.addInput(input_tv1);
-
-  TensorView* sum_tv2 = reductionOp(
-      BinaryOpType::Add, {2}, IrBuilder::create<Double>(0), input_tv0);
-  TensorView* bcast_tv3 = broadcast(sum_tv2, {false, false, true});
-  TensorView* output_tv4 = div(input_tv1, bcast_tv3);
-
-  sum_tv2->split(-1, 32);
-  TensorView* sum_rf_tv5 = sum_tv2->rFactor({-2});
-
-  bcast_tv3->split(-1, 32);
-  output_tv4->split(-1, 32);
-
-  sum_rf_tv5->axis(0)->parallelize(ParallelType::BIDx);
-  sum_tv2->axis(0)->parallelize(ParallelType::BIDx);
-  bcast_tv3->axis(0)->parallelize(ParallelType::BIDx);
-  output_tv4->axis(0)->parallelize(ParallelType::BIDx);
-
-  sum_rf_tv5->axis(1)->parallelize(ParallelType::BIDy);
-  sum_tv2->axis(1)->parallelize(ParallelType::BIDy);
-  bcast_tv3->axis(1)->parallelize(ParallelType::BIDy);
-  output_tv4->axis(1)->parallelize(ParallelType::BIDy);
-
-  sum_rf_tv5->axis(-1)->parallelize(ParallelType::TIDx);
-  sum_tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  bcast_tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  output_tv4->axis(-1)->parallelize(ParallelType::TIDx);
-
-  fusion.addOutput(output_tv4);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({32, 32, 128}, options);
-  at::Tensor t1 = at::randn({32, 32, 128}, options);
-  at::Tensor cg_output = at::empty({32, 32, 128}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  fe.runFusion({t0, t1}, {cg_output});
-}
-
-TEST_F(NVFuserTest, FusionBCastInnerDim_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // reduce then broadcast
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = broadcast(tv1, {false, true});
-
-  TORCH_CHECK(!tv2->axis(0)->isReduction() && tv2->axis(1)->isBroadcast());
-}
-
-TEST_F(NVFuserTest, FusionBCastReduce_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-
-  auto tv1 = broadcast(tv0, {true, false, false});
-  auto tv2 = sum(tv1, {1});
-  TORCH_CHECK(
-      tv2->axis(0)->isBroadcast() && tv2->axis(1)->isReduction() &&
-      !tv2->axis(2)->isBroadcast() && !tv2->axis(2)->isReduction());
-}
-
-// Multiple consumer reduction with computeAt
-// https://github.com/csarofeen/pytorch/issues/110
-TEST_F(NVFuserTest, FusionReductionMultiConsumer_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = unaryOp(UnaryOpType::Exp, tv0);
-  auto tv2 =
-      reductionOp(BinaryOpType::Max, {-1}, IrBuilder::create<Double>(0), tv1);
-  auto tv3 =
-      reductionOp(BinaryOpType::Min, {-1}, IrBuilder::create<Double>(0), tv1);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-  tv1->computeAt(tv2, -1, ComputeAtMode::BestEffort);
-
-  TORCH_CHECK(tv1->getComputeAtPosition() == 2);
-}
-
-TEST_F(NVFuserTest, FusionComputeAtExprOrder1_CUDA) {
-  for (const auto i : c10::irange(2)) {
-    Fusion fusion;
-    FusionGuard fg(&fusion);
-
-    // Set up your input tensor views
-    TensorView* tv0 = makeSymbolicTensor(1);
-    fusion.addInput(tv0);
-
-    auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-    auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-    TensorView* tv3 = add(tv1, tv2);
-    // Set outputs tv2 or tv1 and then tv3
-    if (i == 0) {
-      fusion.addOutput(tv2);
-    } else {
-      fusion.addOutput(tv1);
-    }
-    fusion.addOutput(tv3);
-
-    if (i == 0) {
-      tv1->computeAt(tv3, -1);
-    } else {
-      tv2->computeAt(tv3, -1);
-    }
-
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor aten_input = at::randn({100}, options);
-    std::vector<at::Tensor> aten_outputs = {
-        aten_input + 1, (aten_input + 1) * 2};
-
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, {aten_input});
-    auto cg_outputs = fe.runFusion({aten_input});
-
-    testValidate(
-        &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionComputeAtExprOrder2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv3 = add(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv3->split(-1, 32);
-
-  tv1->computeAt(tv3, -1);
-  tv2->computeAt(tv3, -2);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({100, 100}, options);
-  auto aten_output = (aten_input + 1) * 2;
-
-  at::Tensor cg_output = at::empty_like(aten_input, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionComputeAtExprOrder3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int64_t dimx = 13;
-  const int64_t dimy = 15;
-
-  TensorView* tv0 = makeConcreteTensor({dimx, dimy});
-  fusion.addInput(tv0);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
-  TensorView* tv3 = add(tv2, IrBuilder::create<Double>(3));
-  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
-  TensorView* tv5 = mul(tv2, tv4);
-  fusion.addOutput(tv5);
-
-  tv1->computeAt(tv2, 2);
-  tv3->computeAt(tv4, 1);
-  tv4->computeAt(tv5, 2);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({dimx, dimy}, options);
-  auto t1 = aten_input.add(1.);
-  auto t2 = t1.add(2.);
-  auto t3 = t2.add(3.);
-  auto t4 = t3.add(4.);
-  auto aten_output = t2.mul(t4);
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionZeroDimComputeAt_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-  TORCH_CHECK(tv2->nDims() == 0);
-  tv1->computeAt(tv2, 0);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({100}, options);
-  auto aten_output = aten_input.to(at::kDouble).sum() + 1;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionZeroDimBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(0);
-  fusion.addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {true, true});
-  TORCH_CHECK(tv1->nDims() == 2);
-
-  TensorView* tv2 = makeSymbolicTensor(2);
-  fusion.addInput(tv2);
-
-  auto tv3 = add(tv1, tv2);
-  auto tv4 = sum(tv3, {0, 1});
-  fusion.addOutput(tv4);
-
-  tv3->computeAt(tv4, -1);
-  tv3->axis(-2)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDy);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({}, options);
-  at::Tensor t1 = at::randn({10, 10}, options);
-
-  auto aten_output = (t0.unsqueeze(-1).unsqueeze(-1).expand({10, 10}) + t1)
-                         .to(at::kDouble)
-                         .sum();
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-  at::Tensor cg_output = at::empty({}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionZeroDimReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int bdimx = 32;
-  const int gdimx = 32;
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  fusion.addOutput(tv1);
-
-  tv1->split(0, bdimx);
-  tv1->split(0, gdimx);
-  auto tv2 = tv1->rFactor({0});
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-2)->parallelize(ParallelType::BIDx);
-  tv2->axis(-2)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({1000}, options);
-  auto aten_output = aten_input.to(at::kDouble).sum();
-
-  at::Tensor cg_output = at::empty({}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBCastAfterReduce_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-  const int tidx = 128;
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-
-  tv1->split(1, tidx);
-  auto tv3 = tv1->rFactor({-2});
-
-  TensorView* tv4 = makeSymbolicTensor(2);
-  fusion.addInput(tv4);
-
-  auto tv5 = add(tv2, tv4);
-  fusion.addOutput(tv5);
-  tv5->split(1, tidx);
-
-  tv3->computeAt(tv5, 1);
-
-  tv2->split(1, tidx);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv5->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-
-  int x = 63, y = 200;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({x, y}, options);
-  at::Tensor t4 = at::randn({x, y}, options);
-
-  auto t3 = t0.to(at::kDouble).sum({1}).unsqueeze(-1).expand({x, y});
-  auto aten_output = t3.add(t4);
-
-  std::vector<IValue> aten_inputs = {t0, t4};
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t4});
-  auto cg_outputs = fe.runFusion({t0, t4});
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionOutputBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({2, 3});
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = broadcast(tv0, {true, false, true, false, true});
-
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({2, 3}, options);
-  auto aten_output = aten_input.unsqueeze(2).unsqueeze(1).unsqueeze(0);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReductionKeepDimBasic_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({2, 3, 4, 5, 6});
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = sum(tv0, {0, 2, -1}, /*keep_dim=*/true);
-
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({2, 3, 4, 5, 6}, options);
-  auto aten_output =
-      aten_input.to(at::kDouble).sum({0, 2, -1}, /*keepdim=*/true);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReductionKeepDimScheduler_CUDA) {
-  constexpr int bid_x = 80;
-  constexpr int tid_x = 4096;
-  constexpr int red_dim = 1;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeConcreteTensor({bid_x, tid_x});
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = reductionOp(
-      BinaryOpType::Add,
-      {red_dim},
-      IrBuilder::create<Double>(0),
-      tv0,
-      /*keep_dim=*/true);
-
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({bid_x, tid_x}, options);
-  auto aten_output =
-      aten_input.to(at::kDouble).sum({red_dim}, /*keepdim=*/true);
-
-  // Apply reduction heuristic
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionSumTo_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> tensor_shape{2, 3, 4, 5, 6};
-  std::vector<int64_t> sum_to_shape{1, 5, 6};
-
-  std::vector<int64_t> tensor_shape_ref{2, 3, 4, 5, 6};
-  std::vector<int64_t> sum_to_shape_ref{1, 5, 6};
-
-  std::vector<Int*> sum_to_symb;
-  std::transform(
-      sum_to_shape.begin(),
-      sum_to_shape.end(),
-      std::back_inserter(sum_to_symb),
-      [](int s) -> Int* { return IrBuilder::create<Int>(s); });
-
-  TensorView* tv0 = makeConcreteTensor(tensor_shape);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = sum_to(tv0, sum_to_symb);
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn(tensor_shape_ref, options);
-  auto aten_output = at::sum_to(aten_input.to(at::kDouble), sum_to_shape_ref);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  TORCH_CHECK(
-      cg_outputs[0].dim() == static_cast<int64_t>(sum_to_shape.size()),
-      "sum_to not keeping the final dimension");
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSumToNoop_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> tensor_shape{4, 5, 6};
-  std::vector<int64_t> sum_to_shape{4, 5, 6};
-
-  std::vector<int64_t> tensor_shape_ref{4, 5, 6};
-  std::vector<int64_t> sum_to_shape_ref{4, 5, 6};
-
-  std::vector<Int*> sum_to_symb;
-  std::transform(
-      sum_to_shape.begin(),
-      sum_to_shape.end(),
-      std::back_inserter(sum_to_symb),
-      [](int s) -> Int* { return IrBuilder::create<Int>(s); });
-
-  TensorView* tv0 = makeConcreteTensor(tensor_shape);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = sum_to(tv0, sum_to_symb);
-
-  // Dummy operator to avoid tv0 both input and output
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(0));
-  fusion.addOutput(tv2);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn(tensor_shape_ref, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-  auto aten_output = at::sum_to(aten_input.to(at::kDouble), sum_to_shape_ref);
-
-  TORCH_CHECK(
-      cg_outputs[0].dim() == static_cast<int64_t>(sum_to_shape.size()),
-      "sum_to not keeping the final dimension");
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReductionScheduler_CUDA) {
-  constexpr int bid_x = 80;
-  constexpr int tid_x = 4096;
-  constexpr int red_dim = 1;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = reductionOp(
-      BinaryOpType::Add, {red_dim}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({bid_x, tid_x}, options);
-  auto aten_output = aten_input.to(at::kDouble).sum({red_dim});
-
-  // Apply reduction heuristic
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  // no broadcasting needed, omitting the last optional argument;
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-// Simple reduction parallelized on a symbolic size.
-TEST_F(NVFuserTest, FusionSymbolicReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  // tv1[I0, R1] = tv0[I0, I1]
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  // Interface should just be a direct split with a Parallel type. We can
-  // include the parallelize call if we do this.
-  tv1->split(1, NamedScalar::getParallelDim(ParallelType::TIDx));
-  // tv1[I0, R1o, R1i{BIDx}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({1});
-  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{BIDx}] = tv0[I0, I1]
-  // tv1[I0,        R1oi{4},  R1i{BIDx}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{BIDx}]
-
-  // Incrementally, can print in between for debugging
-  tv0->computeAt(tv2, 1);
-  tv2->computeAt(tv1, 1);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 65000;
-  int numel_y = 1025;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
-  auto aten_output = aten_input.to(at::kDouble).sum({1});
-
-  // How many threads to use for the block reduction
-  int runtime_threadIdx_dim = 128;
-
-  LaunchParams lparams(-1, -1, -1, runtime_threadIdx_dim, -1, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionReductionSchedulerMultiDimNonFastest_CUDA) {
-  const std::vector<int> red_dims = {0, 2};
-  // Copy is because CodeGen requires int and Pytorch requires int64_t
-  // for a vector of reduction dimensions
-  const std::vector<int64_t> red_dims64 = {0, 2};
-  const std::vector<int64_t> tensor_dims_in = {5, 10, 15, 20};
-  const std::vector<int64_t> tensor_dims_out = {10, 20};
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(tensor_dims_in.size());
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = reductionOp(
-      BinaryOpType::Add, red_dims, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn(tensor_dims_in, options);
-  auto aten_output = aten_input.to(at::kDouble).sum(red_dims64);
-  at::Tensor cg_output = at::empty(tensor_dims_out, options);
-
-  // Apply reduction heuristic
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  fe.runFusion({aten_input}, {cg_output}, lparams);
-
-  testValidate(
-      &fusion,
-      {cg_output},
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionReductionSchedulerMultiDimFastest_CUDA) {
-  const std::vector<int> red_dims = {1, 3};
-  // Copy is because CodeGen requires int and Pytorch requires int64_t
-  // for a vector of reduction dimensions
-  const std::vector<int64_t> red_dims64 = {1, 3};
-  const std::vector<int64_t> tensor_dims_in = {5, 10, 15, 20};
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(tensor_dims_in.size());
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = reductionOp(
-      BinaryOpType::Add, red_dims, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn(tensor_dims_in, options);
-  auto aten_output = aten_input.to(at::kDouble).sum(red_dims64);
-
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionReductionSchedulerNoODimShmoo_CUDA) {
-  std::vector<DataType> dtypes = {
-      DataType::Double, DataType::Float, DataType::Half};
-  // TODO: add test for complex. Currently complex fails with the following
-  // NVRTC compilation error message:
-  //   error: no suitable user-defined conversion from
-  //   "CudaCodeGen::std::complex<double>" to "CudaCodeGen::std::complex<float>"
-  //   exists
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
-  if (at::cuda::getDeviceProperties(0)->major >= 8) {
-    dtypes.insert(dtypes.end(), DataType::BFloat16);
-  }
-#endif
-
-  std::vector<int> red_dims;
-
-  // Tried to cut down the number iterations with just
-  // doing every other power of 2.
-  for (int i = 1; i <= 1024 * 1024; i <<= 2) {
-    red_dims.push_back(i);
-  }
-
-  for (auto dtype : dtypes) {
-    at::ScalarType aten_dtype = data_type_to_aten(dtype);
-    for (auto& rdim : red_dims) {
-      Fusion fusion;
-      FusionGuard fg(&fusion);
-
-      bool is_fp16 = dtype == DataType::Half;
-      bool is_bf16 = dtype == DataType::BFloat16;
-
-      TensorView* tv0 = makeSymbolicTensor(1, dtype);
-      fusion.addInput(tv0);
-
-      TensorView* tv0_cast = tv0;
-      if (is_fp16 || is_bf16) {
-        tv0_cast = castOp(DataType::Float, tv0);
-      }
-
-      TensorView* tv1 = sum(tv0_cast, {0});
-
-      TensorView* tv1_cast = tv1;
-      if (is_fp16) {
-        tv1_cast = castOp(DataType::Half, tv1);
-      }
-      if (is_bf16) {
-        tv1_cast = castOp(DataType::BFloat16, tv1);
-      }
-
-      fusion.addOutput(tv1_cast);
-
-      auto options = at::TensorOptions().dtype(aten_dtype).device(at::kCUDA, 0);
-
-      at::Tensor aten_input = at::randn({rdim}, options);
-      auto aten_output = aten_input.to(at::kDouble).sum({0});
-
-      auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-      TORCH_CHECK(reduction_params != nullptr, "Reduction is not found!");
-      scheduleReduction(&fusion, *reduction_params);
-      auto lparams = reduction_params->lparams;
-
-      FusionExecutor fe;
-      fe.compileFusion(&fusion, {aten_input}, lparams);
-      auto cg_outputs = fe.runFusion({aten_input}, lparams);
-
-      testValidate(
-          &fusion,
-          cg_outputs,
-          {aten_input},
-          {aten_output},
-          __LINE__,
-          __FILE__,
-          "",
-          lparams);
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionReductionSchedulerDimShmoo_CUDA) {
-  std::vector<DataType> dtypes = {
-      DataType::Double, DataType::Float, DataType::Half};
-  // TODO: add complex support. Currently, complex fails with the following
-  // NVRTC compilation error:
-  //   error: no instance of overloaded function "__shfl_xor_sync" matches the
-  //   argument list
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
-  if (at::cuda::getDeviceProperties(0)->major >= 8) {
-    dtypes.insert(dtypes.end(), DataType::BFloat16);
-  }
-#endif
-
-  std::vector<int> red_axis = {1, 0};
-  std::vector<int> output_dims = {160, 320};
-  std::vector<int> red_dims;
-
-  // Tried to cut down the number iterations with just
-  // doing every other power of 2.
-  for (int i = 1; i <= 1024 * 1024; i <<= 2) {
-    red_dims.push_back(i);
-  }
-
-  for (auto dtype : dtypes) {
-    at::ScalarType aten_dtype = data_type_to_aten(dtype);
-    for (auto& axis : red_axis) {
-      for (auto& odim : output_dims) {
-        for (auto& rdim : red_dims) {
-          Fusion fusion;
-          FusionGuard fg(&fusion);
-
-          bool is_fp16 = dtype == DataType::Half;
-          bool is_bf16 = dtype == DataType::BFloat16;
-
-          TensorView* tv0 = makeSymbolicTensor(2, dtype);
-          fusion.addInput(tv0);
-
-          TensorView* tv0_cast = tv0;
-          if (is_fp16 || is_bf16) {
-            tv0_cast = castOp(DataType::Float, tv0);
-          }
-
-          TensorView* tv1 = sum(tv0_cast, {axis});
-
-          TensorView* tv1_cast = tv1;
-          if (is_fp16) {
-            tv1_cast = castOp(DataType::Half, tv1);
-          }
-          if (is_bf16) {
-            tv1_cast = castOp(DataType::BFloat16, tv1);
-          }
-          fusion.addOutput(tv1_cast);
-
-          auto options =
-              at::TensorOptions().dtype(aten_dtype).device(at::kCUDA, 0);
-
-          at::Tensor aten_input =
-              (axis ? at::randn({odim, rdim}, options)
-                    : at::randn({rdim, odim}, options));
-
-          auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-          TORCH_CHECK(reduction_params != nullptr, "Reduction is not found!");
-          scheduleReduction(&fusion, *reduction_params);
-          auto lparams = reduction_params->lparams;
-
-          FusionExecutor fe;
-          fe.compileFusion(&fusion, {aten_input}, lparams);
-          auto cg_outputs = fe.runFusion({aten_input}, lparams);
-          auto aten_output = aten_input.to(at::kDouble).sum({axis});
-          testValidate(
-              &fusion,
-              cg_outputs,
-              {aten_input},
-              {aten_output},
-              __LINE__,
-              __FILE__,
-              "",
-              lparams);
-        }
-      }
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionCacheBefore_CUDA) {
-  // TVM Cache Write
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(3.0));
-  fusion.addInput(tv0);
-  fusion.addOutput(tv2);
-
-  // Before: TV2 = TV1 * 3
-  // After:  TV3 = TV1 * 3;
-  //         TV2 = TV3;
-  TensorView* tv3 = tv2->cacheBefore();
-
-  constexpr int BSX = 32;
-  tv2->split(-1, BSX);
-  tv0->computeAt(tv2, -1);
-
-  // Thread and Block binding
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 32, N = 750;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({M, N}, options);
-  at::Tensor aten_output = (aten_input + 1.0) * 3.0;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionCacheAfter_CUDA) {
-  // TVM Cache Read
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(3.0));
-  fusion.addInput(tv0);
-  fusion.addOutput(tv2);
-
-  // Before: TV1 = TV0 + 1
-  // After:  TV3 = TV0;
-  //         TV1 = TV3 + 1
-  TensorView* tv3 = tv0->cacheAfter();
-
-  constexpr int BSX = 32;
-  tv2->split(-1, BSX);
-  tv0->computeAt(tv2, -1);
-
-  // Thread and Block binding
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 32, N = 457;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({M, N}, options);
-  at::Tensor aten_output = (aten_input + 1.0) * 3.0;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionCacheFork_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(3.0));
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
-  fusion.addOutput(tv2);
-  // Before:  TV1 = TV0 + 1
-  //          TV2 = TV1 * 1
-  // Output:  TV1, TV2
-
-  // After:   TV1 = TV0 + 1
-  //          TV3 = TV1
-  //          TV2 = TV1 * 1
-  // Output:  TV3, TV2
-
-  // cacheFork !!does not!! automatically apply ComputeAt to the cache
-  auto tv3 = tv1->cacheFork();
-
-  constexpr int BSX = 32;
-  tv2->split(-1, BSX);
-  tv0->computeAt(tv2, -1);
-
-  // Thread and Block binding
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 32, N = 457;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({M, N}, options);
-  at::Tensor aten_output1 = aten_input + 1.0;
-  at::Tensor aten_output2 = aten_output1 * 3.0;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output1, aten_output2},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionCacheIndirect_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  TensorView* tv2 = makeSymbolicTensor(2);
-  TensorView* tv3 = makeSymbolicTensor(2);
-  TensorView* tv4 = sub(tv2, tv3);
-  TensorView* tv5 = add(tv1, tv4);
-  TensorView* tv6 = sub(tv5, tv0);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addInput(tv2);
-  fusion.addInput(tv3);
-  fusion.addOutput(tv6);
-  // t6 = ((t1 + (t2 - t3)) - t0)
-
-  tv5->cacheAfter();
-  tv5->cacheBefore();
-
-  // cacheAfter on inputs placed before schedule
-  constexpr int BSX = 32;
-  tv6->split(-1, BSX);
-  tv2->computeAt(tv6, -1);
-
-  // Thread and Block binding
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-  tv6->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 32, N = 810;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  at::Tensor t1 = at::randn({M, N}, options);
-  at::Tensor t2 = at::randn({M, N}, options);
-  at::Tensor t3 = at::randn({M, N}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
-  at::Tensor aten_output = (t1 + (t2 - t3)) - t0;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionCacheBcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  TensorView* tv0 = makeSymbolicTensor(1); // (M, 1)
-  TensorView* tv1 = broadcast(tv0, {false, true});
-  TensorView* tv2 = makeSymbolicTensor(1); // (1, N)
-  TensorView* tv3 = broadcast(tv2, {true, false});
-  TensorView* tv4 = mul(tv1, tv3);
-  fusion.addInput(tv0);
-  fusion.addInput(tv2);
-  fusion.addOutput(tv4);
-
-  // Case 1
-  tv0->cacheAfter();
-
-  // Case 2
-  tv1->cacheBefore();
-
-  // Case 3
-  tv1->cacheAfter();
-
-  // Case 4
-  TensorView* tv8 = tv4->cacheBefore();
-
-  constexpr int BSX = 128;
-  tv4->split(0, BSX);
-  tv4->split(-1, BSX);
-  tv4->reorder({{0, 0}, {1, 2}, {2, 1}, {3, 3}});
-  // M/BSX, N/BSY, BSX, BSY
-  tv0->computeAt(tv4, 2);
-  tv2->computeAt(tv4, 2);
-  // 0, 1 | 2, 3, 4
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-  tv4->axis(1)->parallelize(ParallelType::BIDy);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  // Manual Replay on TV3
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv8->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 92, N = 500;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M}, options);
-  at::Tensor t1 = at::randn({N}, options);
-  std::vector<IValue> aten_inputs = {t0, t1};
-  at::Tensor aten_output =
-      t0.to(at::kDouble).unsqueeze(1).matmul(t1.to(at::kDouble).unsqueeze(0));
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionCacheMultiConsumer_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
-  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(2));
-
-  fusion.addInput(tv0);
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv4);
-
-  auto tv5 = tv1->cacheBefore();
-  auto tv6 = tv3->cacheBefore();
-  tv5->setMemoryType(MemoryType::Shared);
-  tv6->setMemoryType(MemoryType::Shared);
-
-  tv1->computeAt(tv2, -1);
-  tv3->computeAt(tv4, -1);
-
-  // Fails because tensor must be recomputed twice
-  // auto tv7 = tv0->cacheAfter();
-
-  constexpr int N = 800;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({N}, options);
-  auto aten_output = (aten_input + 1) + 2;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output, aten_output},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSmem_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  TensorView* tv0 = makeSymbolicTensor(2); // (M, N)
-  TensorView* tv1 = makeSymbolicTensor(2); // (M, N)
-  TensorView* tv2 = mul(tv0, tv1);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv2);
-
-  // Schedule
-  TensorView* tv3 = tv0->cacheAfter();
-  TensorView* tv4 = tv1->cacheAfter();
-  tv3->setMemoryType(MemoryType::Shared);
-  tv4->setMemoryType(MemoryType::Shared);
-
-  constexpr int BSY = 32;
-  constexpr int BSX = 128;
-  tv2->split(0, BSY);
-  tv2->split(2, BSX);
-  // M/BSX, BSX, N/BSX, BSX
-  tv2->reorder({{0, 0}, {1, 2}, {2, 1}, {3, 3}});
-  // M/BSX, N/BSX, BSX, BSX
-
-  tv0->computeAt(tv2, 2);
-  tv1->computeAt(tv2, 2);
-
-  // Thread and Block binding
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::BIDy);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  // Manual Binding
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 128, N = 10240;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  at::Tensor t1 = at::randn({M, N}, options);
-  at::Tensor aten_output = mul(t0, t1);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
-}
-
-TEST_F(NVFuserTest, FusionSmemReduce_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  TensorView* tv0 = makeSymbolicTensor(3); // M, K, N
-  TensorView* tv1 = sum(tv0, {1}); // M, R, N
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
-
-  TensorView* tv2 = tv0->cacheAfter();
-  tv2->setMemoryType(MemoryType::Shared);
-
-  // Schedule
-  constexpr int BSX = 32;
-  tv1->split(2, BSX);
-  tv1->split(1, 128);
-  tv1->split(0, BSX);
-  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
-  tv1->reorder({{0, 0}, {1, 2}, {2, 4}, {3, 5}, {4, 1}, {5, 3}});
-  TensorView* tv3 = tv1->rFactor({-2});
-
-  tv0->computeAt(tv1, -2);
-  tv0->computeAt(tv3, -2);
-
-  // Thread and Block binding
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::BIDy);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  // Manual Binding
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 154, K = 45, N = 1524;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({M, K, N}, options);
-  at::Tensor aten_output = sum(aten_input.to(at::kDouble), {1});
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
-}
-
-TEST_F(NVFuserTest, FusionSmemBlockGemm_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  TensorView* tv0 = makeSymbolicTensor(2); // (M, K)
-  TensorView* tv1 = makeSymbolicTensor(2); // (K, N)
-  TensorView* tv2 = broadcast(tv0, {false, false, true}); // (M, K, B)
-  TensorView* tv3 = broadcast(tv1, {true, false, false}); // (B, K, N)
-  TensorView* tv4 = mul(tv2, tv3); // M, K, N
-  TensorView* tv5 = sum(tv4, {1}); // M, R, N
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv5);
-
-  // Schedule
-  constexpr int BSX = 16;
-  tv5->split(2, BSX - 1);
-  tv5->split(1, BSX);
-  tv5->split(0, BSX + 1);
-  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
-  tv5->reorder({{0, 0}, {1, 3}, {2, 2}, {3, 5}, {4, 1}, {5, 4}});
-  // M/BSX, N/BSX, K/BSX, MSX, NSX, KSX
-  TensorView* tv6 = tv5->rFactor({-1});
-
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-  tv4->setMemoryType(MemoryType::Shared);
-  tv6->setMemoryType(MemoryType::Shared);
-
-  tv0->computeAt(tv5, 3);
-  tv1->computeAt(tv5, 3);
-
-  // Thread and Block binding
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-  tv5->axis(-2)->parallelize(ParallelType::TIDy);
-  tv5->axis(-1)->parallelize(ParallelType::TIDx);
-  // Manual Binding
-  tv2->axis(-3)->parallelize(ParallelType::TIDy);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(-3)->parallelize(ParallelType::TIDy);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv6->axis(-3)->parallelize(ParallelType::TIDy);
-  tv6->axis(-2)->parallelize(ParallelType::TIDx);
-
-  // Make sure BIDx is makred as exact (see issue #1119)
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(gpulw.parallelDimensionMap().isExact(ParallelType::BIDx));
-
-  constexpr int M = 154, K = 45, N = 1524;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-  at::Tensor aten_output = matmul(t0.to(at::kDouble), t1.to(at::kDouble));
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
-}
-
-TEST_F(NVFuserTest, FusionSmemBlockGemmCache_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  TensorView* tv0 = makeSymbolicTensor(2); // (M, K)
-  TensorView* tv1 = makeSymbolicTensor(2); // (K, N)
-  TensorView* tv2 = broadcast(tv0, {false, false, true}); // (M, K, B)
-  TensorView* tv3 = broadcast(tv1, {true, false, false}); // (B, K, N)
-  TensorView* tv4 = mul(tv2, tv3); // M, K, N
-  TensorView* tv5 = sum(tv4, {1}); // M, R, N
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv5);
-
-  // Schedule
-  // Remove reduction axis from tv5
-  // tv6 = (M, R, N)
-  // tv5 = (M, N)
-  TensorView* tv6 = tv5->cacheBefore();
-
-  constexpr int BSX = 16;
-  tv5->split(1, BSX);
-  tv5->split(0, BSX);
-  // M/BSX, BSX, N/BSX, BSX
-  tv5->reorder({{0, 0}, {1, 2}, {2, 1}, {3, 3}});
-  // tv5 = M/BSX, N/BSX, MSX, NSX
-
-  tv6->computeAt(tv5, 2);
-  tv6->computeAt(tv5, 2);
-
-  tv6->split(-1, BSX);
-  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
-  tv6->reorder({{0, 0}, {1, 1}, {2, 3}, {3, 4}, {4, 2}, {5, 5}});
-  // M/BSX, N/BSX, K/BSX, MSX, NSX, KSX
-  TensorView* tv7 = tv6->rFactor({-1});
-  // tv7 = M/BSX, N/BSX, K/BSXrf, MSX, NSX, KSXr
-  // tv6 = M/BSX, N/BSX, K/BSXr, MSX, NSX
-
-  tv0->computeAt(tv6, 3);
-  tv1->computeAt(tv6, 3);
-
-  tv0->computeAt(tv7, 3);
-  tv1->computeAt(tv7, 3);
-
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-  tv4->setMemoryType(MemoryType::Shared);
-  tv6->setMemoryType(MemoryType::Shared);
-  tv7->setMemoryType(MemoryType::Shared);
-  // Memory Type
-
-  // Thread and Block binding
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-  tv5->axis(-2)->parallelize(ParallelType::TIDy);
-  tv5->axis(-1)->parallelize(ParallelType::TIDx);
-  // Manual Binding
-  tv2->axis(-3)->parallelize(ParallelType::TIDy);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(-3)->parallelize(ParallelType::TIDy);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv7->axis(-3)->parallelize(ParallelType::TIDy);
-  tv7->axis(-2)->parallelize(ParallelType::TIDx);
-
-  tv6->axis(-2)->parallelize(ParallelType::TIDy);
-  tv6->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 154, K = 45, N = 1524;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-  at::Tensor aten_output = matmul(t0.to(at::kDouble), t1.to(at::kDouble));
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
-}
-
-TEST_F(NVFuserTest, FusionSmemDynamicPersistentSoftmax2D_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* x = makeSymbolicTensor(2);
-  fusion.addInput(x);
-  TensorView* max_val = reductionOp(
-      BinaryOpType::Max,
-      {-1},
-      IrBuilder::create<Double>(std::numeric_limits<float>::lowest()),
-      x); // (M)
-  TensorView* bcast_max = broadcast(max_val, {false, true}); // (M, B)
-  TensorView* x_max_sub = sub(x, bcast_max); // (M, N)
-  TensorView* exp = unaryOp(UnaryOpType::Exp, x_max_sub); // (M, N)
-  TensorView* sum_exp = sum(exp, {-1}); // (M, R)
-  TensorView* bcast_sum = broadcast(sum_exp, {false, true}); // (M, B)
-  TensorView* softmax = div(exp, bcast_sum); // (M, N)
-  fusion.addOutput(softmax);
-
-  // Read Input into Shared Memory
-  // Load Input + Pwise into shared memory
-  auto cache_x = x->cacheAfter();
-  cache_x->setMemoryType(MemoryType::Shared);
-  exp->setMemoryType(MemoryType::Shared);
-
-  std::vector<TensorView*> all_tensors(
-      {x,
-       cache_x,
-       max_val,
-       bcast_max,
-       x_max_sub,
-       exp,
-       sum_exp,
-       bcast_sum,
-       softmax});
-
-  auto tidx = IrBuilder::create<Int>();
-  fusion.addInput(tidx);
-
-  for (auto tensor : all_tensors) {
-    tensor->split(-1, tidx);
-  }
-
-  auto sum_exp_rf = sum_exp->rFactor({1});
-  all_tensors.push_back(sum_exp_rf);
-
-  // computeAt
-  x->computeAt(x_max_sub, 1);
-  exp->computeAt(softmax, 1);
-  x_max_sub->computeAt(exp, 2);
-
-  softmax->axis(0)->parallelize(ParallelType::BIDx);
-  for (auto tensor : all_tensors) {
-    tensor->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  const int64_t dimx = 1024;
-  const int64_t dimy = 4096;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({dimx, dimy}, options);
-  auto aten_output = at::_softmax(aten_input.to(at::kDouble), -1, false);
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input, 128});
-  auto cg_outputs = fe.runFusion({aten_input, 128});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input, 128},
-      {aten_output},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionMagicSchedulerSoftmax_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int kReductionAxis = 3;
-  std::vector<int64_t> input_shape{10, 10, 10, 67};
-  TensorView* input = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(input);
-
-  auto output = softmax(input, kReductionAxis);
-
-  fusion.addOutput(output);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn(input_shape, options);
-  auto aten_output =
-      at::_softmax(aten_input.to(at::kDouble), kReductionAxis, false);
-
-  auto reduction_params = getPersistentHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-
-  schedulePersistentKernel(&fusion, *reduction_params);
-
-  auto lparams = reduction_params->lparams;
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionTestMaskSoftmax_CUDA) {
-  // This test is testing the usage of all padding tokens
-  // with softmax like Bert might might use in a full padding
-  // sequence.
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int kReductionAxis = 3;
-  std::vector<int64_t> input_shape{256, 16, 128, 128};
-  TensorView* input = makeSymbolicTensor(input_shape.size());
-  TensorView* mask = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(input);
-  fusion.addInput(mask);
-
-  auto out1 = add(input, mask);
-  auto output = softmax(out1, kReductionAxis);
-
-  fusion.addOutput(output);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn(input_shape, options);
-  at::Tensor aten_mask = at::ones(input_shape, options);
-  // -10,000 is used here as a magic number because the padding
-  // tokens need to be a value that gives a value close to zero
-  // as to not influence softmax.  Bert, in particular, does
-  // not use -Infinity because sometimes it will have a
-  // softmax of all padding tokkens that can result a divide by
-  // zero that creates NaN result.
-  aten_mask = aten_mask * -10000.0;
-  auto aten_out1 = aten_input + aten_mask;
-  auto aten_output = at::_softmax(aten_out1, kReductionAxis, false);
-
-  auto reduction_params =
-      getPersistentHeuristics(&fusion, {aten_input, aten_mask});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-
-  schedulePersistentKernel(&fusion, *reduction_params);
-
-  auto lparams = reduction_params->lparams;
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input, aten_mask}, lparams);
-  auto cg_outputs = fe.runFusion({aten_input, aten_mask}, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input, aten_mask},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionMagicSchedulerLayerNormBackward_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> shape{20, 100, 35, 67};
-  std::vector<int64_t> norm_shape{67};
-
-  const size_t kM = shape.size();
-  const size_t kN = norm_shape.size();
-  const size_t kOuterNumDims = kM - kN;
-
-  std::vector<int64_t> outer_shape;
-  for (const auto idx : c10::irange(kOuterNumDims)) {
-    outer_shape.push_back(shape[idx]);
-  }
-  for (const auto idx : c10::irange(kOuterNumDims, kM)) {
-    outer_shape.push_back(1);
-  }
-
-  auto grad_out = makeSymbolicTensor(shape.size());
-  auto input = makeSymbolicTensor(shape.size());
-  auto mean = makeConcreteTensor(outer_shape);
-  auto rstd = makeConcreteTensor(outer_shape);
-  auto weight = makeSymbolicTensor(norm_shape.size());
-  auto bias = makeSymbolicTensor(norm_shape.size());
-  fusion.addInput(grad_out);
-  fusion.addInput(input);
-  fusion.addInput(mean);
-  fusion.addInput(rstd);
-  fusion.addInput(weight);
-  fusion.addInput(bias);
-
-  auto grads = layer_norm_backward(
-      grad_out,
-      input,
-      norm_shape,
-      mean,
-      rstd,
-      weight,
-      bias,
-      {true, true, true});
-
-  fusion.addOutput(grads.grad_input);
-  fusion.addOutput(grads.grad_weight);
-  fusion.addOutput(grads.grad_bias);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_grad_out = at::randn(shape, options);
-  at::Tensor aten_input = at::randn(shape, options);
-  at::Tensor aten_weight = at::randn(norm_shape, options);
-  at::Tensor aten_bias = at::randn(norm_shape, options);
-  auto at_weight = c10::optional<at::Tensor>(aten_weight);
-  auto at_bias = c10::optional<at::Tensor>(aten_bias);
-
-  const float kEps = 1e-5;
-  auto aten_results =
-      at::native_layer_norm(aten_input, norm_shape, at_weight, at_bias, kEps);
-  auto aten_output = std::get<0>(aten_results);
-  auto aten_mean = std::get<1>(aten_results);
-  auto aten_rstd = std::get<2>(aten_results);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  std::vector<IValue> aten_inputs = {
-      aten_grad_out, aten_input, aten_mean, aten_rstd, aten_weight, aten_bias};
-  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
-
-  auto aten_gradients = at::native_layer_norm_backward(
-      aten_grad_out.to(at::kDouble),
-      aten_input.to(at::kDouble),
-      norm_shape,
-      aten_mean.to(at::kDouble),
-      aten_rstd.to(at::kDouble),
-      c10::optional<at::Tensor>(aten_weight.to(at::kDouble)),
-      c10::optional<at::Tensor>(aten_bias.to(at::kDouble)),
-      {true, true, true});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      aten_inputs,
-      {std::get<0>(aten_gradients),
-       std::get<1>(aten_gradients),
-       std::get<2>(aten_gradients)},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionMagicSchedulerRMSNormBackward_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-  const int64_t NORM_SIZE = 1024;
-  std::vector<int64_t> shape{8, 56, NORM_SIZE};
-  std::vector<int64_t> norm_shape{NORM_SIZE};
-
-  const size_t kM = shape.size();
-  const size_t kN = norm_shape.size();
-  const size_t kOuterNumDims = kM - kN;
-
-  std::vector<int64_t> outer_shape;
-  for (const auto idx : c10::irange(kOuterNumDims)) {
-    outer_shape.push_back(shape[idx]);
-  }
-  for (const auto idx : c10::irange(kOuterNumDims, kM)) {
-    outer_shape.push_back(1);
-  }
-
-  auto grad_out = makeContigTensor(shape.size());
-  auto input = makeContigTensor(shape.size());
-  auto rstd = makeConcreteTensor(outer_shape);
-  auto weight = makeContigTensor(norm_shape.size());
-  fusion.addInput(grad_out);
-  fusion.addInput(input);
-  fusion.addInput(rstd);
-  fusion.addInput(weight);
-
-  auto grads = rms_norm_backward(
-      grad_out, input, norm_shape, rstd, weight, {true, true});
-
-  fusion.addOutput(grads.grad_input);
-  fusion.addOutput(grads.grad_weight);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_grad_out = at::randn(shape, options);
-  at::Tensor aten_input = at::randn(shape, options);
-  at::Tensor aten_weight = at::randn(norm_shape, options);
-  auto at_weight = c10::optional<at::Tensor>(aten_weight);
-
-  const float kEps = 1e-6;
-  auto pow2 = at::pow(aten_input, 2);
-  auto sum = at::sum(pow2, -1, true);
-  auto var = at::mul(sum, 1.0 / NORM_SIZE);
-  auto aten_rstd = at::pow(at::add(var, kEps), -0.5);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  std::vector<IValue> aten_inputs = {
-      aten_grad_out, aten_input, aten_rstd, aten_weight};
-  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
-
-  auto in_mul_rstd = at::mul(aten_input, aten_rstd);
-  auto grad_out_mul = at::mul(aten_grad_out, in_mul_rstd);
-  auto aten_grad_weight = at::sum(grad_out_mul, c10::IntArrayRef{0, 1});
-  auto sum_loss1 = at::sum(at::mul(aten_grad_out, aten_weight), -1, true);
-  auto sum_loss2 = at::sum(
-      at::mul(
-          at::mul(at::mul(aten_grad_out, aten_weight), aten_input), aten_rstd),
-      -1,
-      true);
-
-  const float fH = NORM_SIZE;
-  auto term1 = at::mul(aten_rstd, 1.0 / fH);
-  auto aten_grad_input = at::mul(at::mul(aten_grad_out, fH), aten_weight);
-  aten_grad_input = at::sub(aten_grad_input, sum_loss1);
-  aten_grad_input = at::sub(
-      aten_grad_input, at::mul(at::mul(aten_input, aten_rstd), sum_loss2));
-  aten_grad_input = at::mul(aten_grad_input, term1);
-  testValidate(
-      &fusion,
-      cg_outputs,
-      aten_inputs,
-      {aten_grad_input, aten_grad_weight},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionMagicSchedulerLayerNormalization_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  const float kEps = 1e-5;
-  Double* eps_ptr = IrBuilder::create<Double>(kEps);
-
-  std::vector<int64_t> input_shape{20, 100, 35, 67};
-  std::vector<int64_t> norm_shape{67};
-
-  auto input = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(input);
-
-  auto result = layer_norm(input, norm_shape, nullptr, nullptr, eps_ptr);
-
-  fusion.addOutput(result.output);
-  fusion.addOutput(result.mean);
-  fusion.addOutput(result.invstd);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn(input_shape, options);
-  c10::optional<at::Tensor> aten_weight = c10::nullopt;
-  c10::optional<at::Tensor> aten_bias = c10::nullopt;
-  auto aten_outputs = at::native_layer_norm(
-      aten_input, norm_shape, aten_weight, aten_bias, kEps);
-
-  // Check reduction axis is same for all reductions
-  // Generate Launch Parameters
-  auto reduction_params = getPersistentHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  auto cg_outputs = fec.runFusionWithInputs({aten_input});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {std::get<0>(aten_outputs),
-       std::get<1>(aten_outputs),
-       std::get<2>(aten_outputs)},
-      __LINE__,
-      __FILE__,
-      "");
-}
-
-TEST_F(NVFuserTest, FusionMagicSchedulerRMSNormalization_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  int64_t NORM_SIZE = 1024;
-  const float kEps = 1e-6;
-  Double* eps_ptr = IrBuilder::create<Double>(kEps);
-
-  std::vector<int64_t> input_shape{8, 56, NORM_SIZE};
-  std::vector<int64_t> norm_shape{NORM_SIZE};
-
-  auto input = makeContigTensor(input_shape.size());
-  fusion.addInput(input);
-  auto result = rms_norm(input, norm_shape, nullptr, eps_ptr);
-
-  fusion.addOutput(result.output);
-  fusion.addOutput(result.invstd);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn(input_shape, options);
-  c10::optional<at::Tensor> aten_weight = c10::nullopt;
-
-  auto pow2 = at::pow(aten_input, 2);
-
-  auto sum = at::sum(pow2, -1, true);
-  auto var = at::mul(sum, 1.0 / NORM_SIZE);
-  auto invstd = at::pow(at::add(var, kEps), -0.5);
-  auto output = at::mul(aten_input, invstd);
-  //// Check reduction axis is same for all reductions
-  //// Generate Launch Parameters
-  auto reduction_params = getPersistentHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  auto cg_outputs = fec.runFusionWithInputs({aten_input});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {output, invstd},
-      __LINE__,
-      __FILE__,
-      "");
-}
-
-TEST_F(NVFuserTest, FusionMagicSchedulerBatchNormalization_CUDA) {
-  if (!deviceMajorMinorCheck(7)) {
-    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
-    return;
-  }
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-  const bool kTraining = true;
-  std::vector<int64_t> input_shape{20, 100, 35, 45};
-
-  auto input = makeSymbolicTensor(input_shape.size());
-  auto weight = makeSymbolicTensor(1);
-  auto bias = makeSymbolicTensor(1);
-  auto running_mean = makeSymbolicTensor(1);
-  auto running_var = makeSymbolicTensor(1);
-  fusion->addInput(input);
-  fusion->addInput(weight);
-  fusion->addInput(bias);
-  fusion->addInput(running_mean);
-  fusion->addInput(running_var);
-
-  Double* momentum = IrBuilder::create<Double>(kMomentum);
-  Double* eps = IrBuilder::create<Double>(kEps);
-
-  auto result = batch_norm(
-      input, weight, bias, running_mean, running_var, kTraining, momentum, eps);
-
-  fusion->addOutput(result.output);
-  fusion->addOutput(result.mean);
-  fusion->addOutput(result.invstd);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto at_input = at::randn(input_shape, options);
-  auto at_weight = at::ones({input_shape[1]}, options);
-  auto at_bias = at::zeros({input_shape[1]}, options);
-  auto at_run_mean = at::zeros({input_shape[1]}, options);
-  auto at_run_var = at::ones({input_shape[1]}, options);
-
-  std::vector<IValue> aten_inputs = {
-      at_input, at_weight, at_bias, at_run_mean, at_run_var};
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
-
-  auto aten_outputs = at::native_batch_norm(
-      at_input,
-      c10::optional<at::Tensor>(at_weight),
-      c10::optional<at::Tensor>(at_bias),
-      c10::optional<at::Tensor>(at_run_mean),
-      c10::optional<at::Tensor>(at_run_var),
-      kTraining,
-      kMomentum,
-      kEps);
-
-  testValidate(
-      executor_cache.fusion(),
-      cg_outputs,
-      aten_inputs,
-      {std::get<0>(aten_outputs),
-       std::get<1>(aten_outputs),
-       std::get<2>(aten_outputs)},
-      __LINE__,
-      __FILE__,
-      "");
-}
-
-TEST_F(NVFuserTest, FusionMagicSchedulerInstanceNormalization_CUDA) {
-  if (!deviceMajorMinorCheck(7)) {
-    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
-    return;
-  }
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-  const bool kUseInputStats = true;
-  std::vector<int64_t> input_shape{20, 100, 35, 45};
-
-  auto input = makeSymbolicTensor(input_shape.size());
-  auto weight = makeSymbolicTensor(1);
-  auto bias = makeSymbolicTensor(1);
-  auto running_mean = makeSymbolicTensor(1);
-  auto running_var = makeSymbolicTensor(1);
-  fusion->addInput(input);
-  fusion->addInput(weight);
-  fusion->addInput(bias);
-  fusion->addInput(running_mean);
-  fusion->addInput(running_var);
-
-  Double* momentum = IrBuilder::create<Double>(kMomentum);
-  Double* eps = IrBuilder::create<Double>(kEps);
-
-  auto result = instance_norm(
-      input,
-      weight,
-      bias,
-      running_mean,
-      running_var,
-      kUseInputStats,
-      momentum,
-      eps);
-
-  fusion->addOutput(result.output);
-  // fusion->addOutput(result.mean);
-  // fusion->addOutput(result.invstd);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto at_input = at::randn(input_shape, options);
-  auto at_weight = at::ones({input_shape[1]}, options);
-  auto at_bias = at::zeros({input_shape[1]}, options);
-  auto at_run_mean = at::zeros({input_shape[1]}, options);
-  auto at_run_var = at::ones({input_shape[1]}, options);
-
-  std::vector<IValue> aten_inputs = {
-      at_input, at_weight, at_bias, at_run_mean, at_run_var};
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
-  auto cg_outputs_full = {at_run_mean, at_run_var, cg_outputs[0]};
-
-  auto aten_outputs = at::instance_norm(
-      at_input,
-      c10::optional<at::Tensor>(at_weight),
-      c10::optional<at::Tensor>(at_bias),
-      c10::optional<at::Tensor>(at_run_mean),
-      c10::optional<at::Tensor>(at_run_var),
-      kUseInputStats,
-      kMomentum,
-      kEps,
-      false);
-
-  testValidate(
-      executor_cache.fusion(),
-      cg_outputs,
-      aten_inputs,
-      // TODO: can run_mean/run_var be checked here?
-      // fusion_outputs.size() == aten_outputs.size() && aten_outputs.size() ==
-      // fusion->outputs().size() - output_alias_indices.size()
-      {aten_outputs},
-      __LINE__,
-      __FILE__,
-      "");
-}
-
-TEST_F(NVFuserTest, FusionMagicSchedulerInstanceNormalizationBackward_CUDA) {
-  if (!deviceMajorMinorCheck(7)) {
-    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
-    return;
-  }
-  auto fusion_forward = std::make_unique<Fusion>();
-  FusionGuard fg_forward(fusion_forward.get());
-
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-  const bool kUseInputStats = true;
-  const bool channels_last = true;
-  const int B = 2;
-  const int C = 5;
-  const int S = 3;
-  std::vector<int64_t> input_shape{B, C, S, S, S};
-  // explicit channels-last for NVFuser
-  std::vector<int64_t> nvfuser_input_shape{B, S, S, S, C};
-
-  auto input = makeContigTensor(input_shape.size());
-  auto weight = makeContigTensor(1);
-  auto bias = makeContigTensor(1);
-  fusion_forward->addInput(input);
-  fusion_forward->addInput(weight);
-  fusion_forward->addInput(bias);
-
-  Double* momentum = IrBuilder::create<Double>(kMomentum);
-  Double* eps = IrBuilder::create<Double>(kEps);
-  auto result_forward = instance_norm(
-      input,
-      weight,
-      bias,
-      nullptr,
-      nullptr,
-      kUseInputStats,
-      momentum,
-      eps,
-      channels_last);
-  fusion_forward->addOutput(result_forward.output);
-  fusion_forward->addOutput(result_forward.mean);
-  fusion_forward->addOutput(result_forward.invstd);
-
-  FusionExecutorCache executor_cache_forward(std::move(fusion_forward));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto at_input = at::randn(input_shape, options)
-                      .to(at::MemoryFormat::ChannelsLast3d)
-                      .set_requires_grad(true);
-  auto at_input_nvfuser = at_input.clone().detach().permute({0, 2, 3, 4, 1});
-  auto at_weight = at::ones({input_shape[1]}, options).set_requires_grad(true);
-  auto at_weight_nvfuser = at_weight.clone().detach();
-  auto at_bias = at::zeros({input_shape[1]}, options).set_requires_grad(true);
-  auto at_bias_nvfuser = at_bias.clone().detach();
-  std::vector<torch::jit::IValue> aten_inputs_forward = {
-      at_input_nvfuser, at_weight_nvfuser, at_bias_nvfuser};
-  // out, mean, invstd
-  auto outputs_forward =
-      executor_cache_forward.runFusionWithInputs(aten_inputs_forward);
-  auto at_out = at::instance_norm(
-      at_input,
-      c10::optional<at::Tensor>(at_weight),
-      c10::optional<at::Tensor>(at_bias),
-      c10::optional<at::Tensor>(c10::nullopt),
-      c10::optional<at::Tensor>(c10::nullopt),
-      kUseInputStats,
-      kMomentum,
-      kEps,
-      false);
-  auto at_grad =
-      at::randn(input_shape, options).to(at::MemoryFormat::ChannelsLast3d);
-  auto at_grad_nvfuser = at_grad.clone().detach().permute({0, 2, 3, 4, 1});
-  at_out.backward(at_grad);
-  auto fusion_backward = std::make_unique<Fusion>();
-  FusionGuard fg_backward(fusion_backward.get());
-
-  input = makeContigTensor(input_shape.size());
-  auto grad_output = makeContigTensor(input_shape.size());
-  weight = makeContigTensor(1);
-  auto save_mean = makeContigTensor(2);
-  auto save_invstd = makeContigTensor(2);
-  auto dummy = makeContigTensor(0);
-
-  fusion_backward->addInput(input);
-  fusion_backward->addInput(grad_output);
-  fusion_backward->addInput(weight);
-  fusion_backward->addInput(dummy); // dummy for run_mean
-  fusion_backward->addInput(dummy); // dummy for run_var
-  fusion_backward->addInput(save_mean);
-  fusion_backward->addInput(save_invstd);
-
-  auto result_backward = instance_norm_backward(
-      input,
-      grad_output,
-      weight,
-      nullptr,
-      nullptr,
-      save_mean,
-      save_invstd,
-      kUseInputStats,
-      eps,
-      {true, true, true},
-      channels_last);
-
-  fusion_backward->addOutput(result_backward.grad_input);
-  fusion_backward->addOutput(result_backward.grad_weight);
-  fusion_backward->addOutput(result_backward.grad_bias);
-
-  FusionExecutorCache executor_cache_backward(std::move(fusion_backward));
-  std::vector<torch::jit::IValue> aten_inputs_backward = {
-      at_input_nvfuser,
-      at_grad_nvfuser,
-      at_weight_nvfuser,
-      at::empty({}),
-      at::empty({}),
-      outputs_forward[1],
-      outputs_forward[2]};
-  auto outputs_backward =
-      executor_cache_backward.runFusionWithInputs(aten_inputs_backward);
-  outputs_backward[0] = outputs_backward[0].permute({0, 4, 1, 2, 3});
-  testValidate(
-      executor_cache_backward.fusion(),
-      outputs_backward,
-      aten_inputs_backward,
-      {at_input.grad(), at_weight.grad(), at_bias.grad()},
-      __LINE__,
-      __FILE__,
-      "");
-}
-
-TEST_F(NVFuserTest, FusionPersistentSoftmaxLocalSmem_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int pixels_per_thread = 64;
-  const int TIDX = 128;
-  const int static_size = pixels_per_thread * TIDX;
-
-  TensorView* sx = makeConcreteTensor({-1, static_size});
-  TensorView* dx = makeSymbolicTensor(2);
-  fusion.addInput(sx);
-  fusion.addInput(dx);
-
-  TensorView* max_sx = reductionOp(
-      BinaryOpType::Max,
-      {-1},
-      IrBuilder::create<Double>(std::numeric_limits<float>::lowest()),
-      sx); // (M)
-  TensorView* max_dx = reductionOp(
-      BinaryOpType::Max,
-      {-1},
-      IrBuilder::create<Double>(std::numeric_limits<float>::lowest()),
-      dx); // (M)
-
-  // Reduction => merge local and shared memory TensorViews
-  TensorView* max_val = binaryOp(BinaryOpType::Max, max_sx, max_dx);
-  TensorView* bcast_max = broadcast(max_val, {false, true}); // (M, B)
-
-  TensorView* sx_max_sub = sub(sx, bcast_max); // (M, N)
-  TensorView* dx_max_sub = sub(dx, bcast_max); // (M, N)
-
-  TensorView* sx_exp = unaryOp(UnaryOpType::Exp, sx_max_sub); // (M, N)
-  TensorView* dx_exp = unaryOp(UnaryOpType::Exp, dx_max_sub); // (M, N)
-
-  TensorView* sx_sum_exp = sum(sx_exp, {-1}); // (M, R)
-  TensorView* dx_sum_exp = sum(dx_exp, {-1}); // (M, R)
-
-  // Reduction => merge local and shared memory TensorViews
-  TensorView* sum_exp = binaryOp(BinaryOpType::Add, sx_sum_exp, dx_sum_exp);
-  TensorView* bcast_sum = broadcast(sum_exp, {false, true}); // (M, B)
-
-  TensorView* sx_softmax = div(sx_exp, bcast_sum); // (M, N)
-  TensorView* dx_softmax = div(dx_exp, bcast_sum); // (M, N)
-  fusion.addOutput(sx_softmax);
-  fusion.addOutput(dx_softmax);
-
-  auto sx_cache = sx->cacheAfter();
-  auto dx_cache = dx->cacheAfter();
-  dx_cache->setMemoryType(MemoryType::Shared);
-  dx_exp->setMemoryType(MemoryType::Shared);
-
-  // Reduction and Broadcast Tensors common to both memory TVs
-  std::vector<TensorView*> common_tensors(
-      {max_val, sum_exp, bcast_max, bcast_sum});
-
-  // Static Local Memory TVs
-  std::vector<TensorView*> static_tensors(
-      {sx, sx_cache, max_sx, sx_max_sub, sx_exp, sx_sum_exp, sx_softmax});
-
-  // Dynamic Local Memory TVs
-  std::vector<TensorView*> dynamic_tensors(
-      {dx, dx_cache, max_dx, dx_max_sub, dx_exp, dx_sum_exp, dx_softmax});
-
-  std::vector<TensorView*> all_tensors;
-  all_tensors.insert(
-      all_tensors.end(), common_tensors.begin(), common_tensors.end());
-  all_tensors.insert(
-      all_tensors.end(), static_tensors.begin(), static_tensors.end());
-  all_tensors.insert(
-      all_tensors.end(), dynamic_tensors.begin(), dynamic_tensors.end());
-
-  // M => M
-  // M, N => M, N/128, 128
-  for (auto tensor : all_tensors) {
-    if (tensor->nDims() > 1) {
-      tensor->split(-1, TIDX);
-    }
-  }
-
-  auto sx_sum_exp_rf = sx_sum_exp->rFactor({1});
-  auto dx_sum_exp_rf = dx_sum_exp->rFactor({1});
-  all_tensors.push_back(sx_sum_exp_rf);
-  all_tensors.push_back(dx_sum_exp_rf);
-
-  // computeAt
-  sx->computeAt(sx_max_sub, 1);
-  dx->computeAt(dx_max_sub, 1);
-
-  sx_exp->computeAt(sx_softmax, 1);
-  dx_exp->computeAt(dx_softmax, 1);
-
-  sx_max_sub->computeAt(sx_exp, 2);
-  dx_max_sub->computeAt(dx_exp, 2);
-
-  sx_softmax->axis(0)->parallelize(ParallelType::BIDx);
-  dx_softmax->axis(0)->parallelize(ParallelType::BIDx);
-  for (auto tensor : all_tensors) {
-    if (tensor->nDims() > 1) {
-      tensor->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  const int64_t dimx = 1024;
-  const int64_t dimy = 16384;
-
-  auto properties = at::cuda::getDeviceProperties(0);
-  // Require 70KB of smem to run test
-  const size_t required_smem_size = 70 << 10;
-  if (properties->sharedMemPerBlockOptin < required_smem_size) {
-    GTEST_SKIP() << "not enough shared memory space on device to run test";
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({dimx, dimy}, options);
-  at::Tensor aten_static_in = aten_input.narrow(1, 0, static_size);
-  at::Tensor aten_dynamic_in =
-      aten_input.narrow(1, static_size, dimy - static_size);
-
-  at::Tensor out = at::zeros({dimx, dimy}, options);
-  at::Tensor cg_static_out = out.narrow(1, 0, static_size);
-  at::Tensor cg_dynamic_out = out.narrow(1, static_size, dimy - static_size);
-
-  std::vector<at::Tensor> aten_outputs;
-
-  auto aten_output = at::_softmax(aten_input.to(at::kDouble), -1, false);
-  at::Tensor aten_static_out = aten_output.narrow(1, 0, static_size);
-  at::Tensor aten_dynamic_out =
-      aten_output.narrow(1, static_size, dimy - static_size);
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_static_in, aten_dynamic_in});
-  fe.runFusion(
-      {aten_static_in, aten_dynamic_in}, {cg_static_out, cg_dynamic_out});
-
-  testValidate(
-      &fusion,
-      {cg_static_out, cg_dynamic_out},
-      {aten_static_in, aten_dynamic_in},
-      {cg_static_out, cg_dynamic_out},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionPersistentNormLocalShared_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int pixels_per_thread = 64;
-  const int TIDX = 128;
-  const int static_size = pixels_per_thread * TIDX;
-
-  TensorView* sx = makeConcreteTensor({-1, static_size});
-  TensorView* dx = makeSymbolicTensor(2);
-  fusion.addInput(sx);
-  fusion.addInput(dx);
-
-  Double* gamma = IrBuilder::create<Double>();
-  Double* beta = IrBuilder::create<Double>();
-  Double* eps = IrBuilder::create<Double>();
-  Int* N = IrBuilder::create<Int>();
-  fusion.addInput(gamma);
-  fusion.addInput(beta);
-  fusion.addInput(eps);
-  fusion.addInput(N);
-
-  // Reduction
-  auto sx_sum = sum(sx, {-1}); // (M, R)
-  auto dx_sum = sum(dx, {-1}); // (M, R)
-  // Reduction => merge local and shared memory TensorViews
-  auto x_sum = binaryOp(BinaryOpType::Add, sx_sum, dx_sum);
-
-  // Broadcast
-  auto x_sum_bcast = broadcast(x_sum, {false, true}); // (M, B)
-  // Pwise
-  auto x_mean = div(x_sum_bcast, N); // (M, B)
-
-  auto sx_mean_sub = sub(sx, x_mean); // (M, N)
-  auto dx_mean_sub = sub(dx, x_mean); // (M, N)
-
-  auto sx_mean_sub_pow = mul(sx_mean_sub, sx_mean_sub); // (M, N)
-  auto dx_mean_sub_pow = mul(dx_mean_sub, dx_mean_sub); // (M, N)
-
-  // Reduction
-  auto sx_var_sum = sum(sx_mean_sub_pow, {-1}); // (M, R)
-  auto dx_var_sum = sum(dx_mean_sub_pow, {-1}); // (M, R)
-  // Reduction => merge local and shared memory TensorViews
-  auto var_sum = binaryOp(BinaryOpType::Add, sx_var_sum, dx_var_sum);
-
-  // Broadcast
-  auto var_sum_bcast = broadcast(var_sum, {false, true}); // (M, B)
-  // Pwise
-  auto var = div(var_sum_bcast, N); // (M, B)
-  auto var_eps = add(var, eps); // (M, B)
-  auto rvar = unaryOp(UnaryOpType::Rsqrt, var_eps); // (M, B)
-
-  auto sx_norm = mul(sx_mean_sub, rvar);
-  auto dx_norm = mul(dx_mean_sub, rvar);
-
-  auto sx_norm_gamma = mul(sx_norm, gamma);
-  auto dx_norm_gamma = mul(dx_norm, gamma);
-
-  auto sx_norm_gamma_beta = add(sx_norm_gamma, beta);
-  auto dx_norm_gamma_beta = add(dx_norm_gamma, beta);
-
-  fusion.addOutput(sx_norm_gamma_beta);
-  fusion.addOutput(dx_norm_gamma_beta);
-
-  sx_norm_gamma_beta->setContiguity(false);
-  dx_norm_gamma_beta->setContiguity(false);
-
-  // Read Input into Shared Memory
-  // Read Input minus Input_Mean into Shared Memory
-  auto sx_cache = sx->cacheAfter();
-  auto dx_cache = dx->cacheAfter();
-  dx_cache->setMemoryType(MemoryType::Shared);
-  dx_mean_sub->setMemoryType(MemoryType::Shared);
-
-  std::vector<TensorView*> common_tensors(
-      {x_sum, x_sum_bcast, x_mean, var_sum, var_sum_bcast, var, var_eps, rvar});
-
-  std::vector<TensorView*> static_tensors(
-      {sx,
-       sx_cache,
-       sx_sum,
-       sx_mean_sub,
-       sx_mean_sub_pow,
-       sx_var_sum,
-       sx_norm,
-       sx_norm_gamma,
-       sx_norm_gamma_beta});
-
-  std::vector<TensorView*> dynamic_tensors(
-      {dx,
-       dx_cache,
-       dx_sum,
-       dx_mean_sub,
-       dx_mean_sub_pow,
-       dx_var_sum,
-       dx_norm,
-       dx_norm_gamma,
-       dx_norm_gamma_beta});
-
-  std::vector<TensorView*> all_tensors;
-  all_tensors.insert(
-      all_tensors.end(), common_tensors.begin(), common_tensors.end());
-  all_tensors.insert(
-      all_tensors.end(), static_tensors.begin(), static_tensors.end());
-  all_tensors.insert(
-      all_tensors.end(), dynamic_tensors.begin(), dynamic_tensors.end());
-
-  // M => M
-  // M, N => M, N/128, 128
-  for (auto tensor : all_tensors) {
-    if (tensor->nDims() > 1) {
-      tensor->split(-1, TIDX);
-    }
-  }
-
-  // Local Sum => Block Broadcast
-  TensorView* sx_sum_rf = sx_sum->rFactor({1});
-  TensorView* sx_var_sum_rf = sx_var_sum->rFactor({1});
-  TensorView* dx_sum_rf = dx_sum->rFactor({1});
-  TensorView* dx_var_sum_rf = dx_var_sum->rFactor({1});
-  all_tensors.push_back(sx_sum_rf);
-  all_tensors.push_back(sx_var_sum_rf);
-  all_tensors.push_back(dx_sum_rf);
-  all_tensors.push_back(dx_var_sum_rf);
-
-  // ComputeAt
-  sx->computeAt(sx_mean_sub_pow, 1);
-  dx->computeAt(dx_mean_sub_pow, 1);
-
-  var_sum->computeAt(rvar, 1);
-
-  sx_mean_sub_pow->computeAt(sx_var_sum_rf, 2);
-  dx_mean_sub_pow->computeAt(dx_var_sum_rf, 2);
-
-  sx_norm->computeAt(sx_norm_gamma_beta, 2);
-  dx_norm->computeAt(dx_norm_gamma_beta, 2);
-
-  sx_norm_gamma_beta->axis(0)->parallelize(ParallelType::BIDx);
-  dx_norm_gamma_beta->axis(0)->parallelize(ParallelType::BIDx);
-  for (auto tensor : all_tensors) {
-    if (tensor->nDims() > 1) {
-      tensor->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  const int dimx = 1024;
-  const int dimy = 16384;
-  const float kGamma = 1.0f;
-  const float kBeta = 0.0f;
-  const float kEps = 1e-5;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({dimx, dimy}, options);
-  at::Tensor aten_static_in = aten_input.narrow(1, 0, static_size);
-  at::Tensor aten_dynamic_in =
-      aten_input.narrow(1, static_size, dimy - static_size);
-
-  at::Tensor out = at::zeros({dimx, dimy}, options);
-  at::Tensor cg_static_out = out.narrow(1, 0, static_size);
-  at::Tensor cg_dynamic_out = out.narrow(1, static_size, dimy - static_size);
-
-  std::vector<IValue> aten_inputs = {
-      aten_static_in, aten_dynamic_in, kGamma, kBeta, kEps, dimy};
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-
-  auto properties = at::cuda::getDeviceProperties(0);
-  // Require 70KB of smem to run test
-  const size_t required_smem_size = 70 << 10;
-  if (properties->sharedMemPerBlockOptin < required_smem_size) {
-    GTEST_SKIP() << "not enough shared memory space on device to run test";
-  }
-
-  fe.runFusion(aten_inputs, {cg_static_out, cg_dynamic_out});
-
-  auto at_mu = at::mean(aten_input.to(at::kDouble), -1).unsqueeze(1);
-  auto at_var = at::var(aten_input.to(at::kDouble), -1, false).unsqueeze(1);
-  auto at_rvar = at::rsqrt(at::add(at_var, kEps));
-  auto at_norm = at::mul(at::sub(aten_input, at_mu), at_rvar);
-  auto aten_output = at::add(at::mul(at_norm, kGamma), kBeta);
-  at::Tensor aten_static_out = aten_output.narrow(1, 0, static_size);
-  at::Tensor aten_dynamic_out =
-      aten_output.narrow(1, static_size, dimy - static_size);
-
-  testValidate(
-      &fusion,
-      {cg_static_out, cg_dynamic_out},
-      aten_inputs,
-      {aten_static_out, aten_dynamic_out},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSmemDynamicPersistentNorm_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  auto x = makeSymbolicTensor(2);
-  Double* gamma = IrBuilder::create<Double>();
-  Double* beta = IrBuilder::create<Double>();
-  Double* eps = IrBuilder::create<Double>();
-  Int* N = IrBuilder::create<Int>();
-  fusion.addInput(x);
-  fusion.addInput(gamma);
-  fusion.addInput(beta);
-  fusion.addInput(eps);
-  fusion.addInput(N);
-
-  // Reduction
-  auto x_sum = sum(x, {-1}); // (M, R)
-  // Broadcast
-  auto x_sum_bcast = broadcast(x_sum, {false, true}); // (M, B)
-  // Pwise
-  auto x_mean = div(x_sum_bcast, N); // (M, B)
-  auto x_mean_sub = sub(x, x_mean); // (M, N)
-  auto x_mean_sub_pow = mul(x_mean_sub, x_mean_sub); // (M, N)
-  // Reduction
-  auto var_sum = sum(x_mean_sub_pow, {-1}); // (M, R)
-  // Broadcast
-  auto var_sum_bcast = broadcast(var_sum, {false, true}); // (M, B)
-  // Pwise
-  auto var = div(var_sum_bcast, N); // (M, B)
-  auto var_eps = add(var, eps); // (M, B)
-  auto rvar = unaryOp(UnaryOpType::Rsqrt, var_eps); // (M, B)
-  auto norm = mul(x_mean_sub, rvar);
-  auto norm_gamma = mul(norm, gamma);
-  auto norm_gamma_beta = add(norm_gamma, beta);
-  fusion.addOutput(norm_gamma_beta);
-
-  // Read Input into Shared Memory
-  // Read Input minus Input_Mean into Shared Memory
-  auto cache_x = x->cacheAfter();
-  cache_x->setMemoryType(MemoryType::Shared);
-  x_mean_sub->setMemoryType(MemoryType::Shared);
-
-  std::vector<TensorView*> all_tensors(
-      {x_sum,
-       x_mean,
-       cache_x,
-       x_sum_bcast,
-       x_mean_sub,
-       x_mean_sub_pow,
-       var_sum,
-       var_sum_bcast,
-       var,
-       var_eps,
-       rvar,
-       norm,
-       norm_gamma,
-       norm_gamma_beta});
-
-  auto tidx = IrBuilder::create<Int>();
-  fusion.addInput(tidx);
-
-  for (auto tensor : all_tensors) {
-    tensor->split(-1, tidx);
-  }
-
-  // Local Sum => Block Broadcast
-  TensorView* x_sum_rf = x_sum->rFactor({1});
-  TensorView* var_sum_rf = var_sum->rFactor({1});
-  all_tensors.push_back(x_sum_rf);
-  all_tensors.push_back(var_sum_rf);
-
-  // ComputeAt
-  x->computeAt(x_mean_sub_pow, 1);
-  var_sum->computeAt(rvar, 1);
-  x_mean_sub_pow->computeAt(var_sum_rf, 2);
-  norm->computeAt(norm_gamma_beta, 2);
-
-  for (auto tv : all_tensors) {
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  const int dimx = 128;
-  const int dimy = 2048;
-  const float kGamma = 1.0f;
-  const float kBeta = 0.0f;
-  const float kEps = 1e-5;
-  const int TIDX = 128;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({dimx, dimy}, options);
-  auto at_mu = at::mean(aten_input.to(at::kDouble), -1).unsqueeze(1);
-  auto at_var = at::var(aten_input.to(at::kDouble), -1).unsqueeze(1);
-  auto at_rvar = at::rsqrt(at::add(at_var, kEps));
-  auto at_norm = at::mul(at::sub(aten_input, at_mu), at_rvar);
-  auto aten_output = at::add(at::mul(at_norm, kGamma), kBeta);
-
-  std::vector<IValue> aten_inputs = {
-      aten_input, kGamma, kBeta, kEps, dimy, TIDX};
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSmemDynamicReductionSymbolic_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
-  // tv1[I0, R1] = tv0[I0, I1]
-
-  // Interface should just be a direct split with a Parallel type. We can
-  // include the parallelize call if we do this.
-  tv1->split(1, NamedScalar::getParallelDim(ParallelType::TIDx));
-  // tv1[I0, R1o, R1i{BIDx}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({2});
-  tv2->setMemoryType(MemoryType::Shared);
-  // tv2[I0, R1oo, Ir1i{BIDx}] = tv0[I0, I1]
-  // tv1[I0,        R1i{BIDx}] = tv2[I0, R1oo, Ir1i{BIDx}]
-
-  tv0->computeAt(tv1, 1);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-
-  constexpr int numel_x = 65000, numel_y = 1024;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
-  auto aten_output = aten_input.to(at::kDouble).sum({1});
-
-  // How many threads to use for the block reduction
-  constexpr int runtime_threadIdx_dim = 128;
-
-  LaunchParams lparams(-1, -1, -1, runtime_threadIdx_dim, -1, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
-}
-
-TEST_F(NVFuserTest, FusionSmemDynamicReductionSymbolicArg_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  Int* sym_bsx = IrBuilder::create<Int>();
-  TensorView* tv0 = makeSymbolicTensor(3); // M, K, N
-  fusion.addInput(tv0);
-  fusion.addInput(sym_bsx);
-
-  TensorView* tv1 = sum(tv0, {1}); // M, R, N
-  fusion.addOutput(tv1);
-
-  TensorView* tv2 = tv0->cacheAfter();
-  tv2->setMemoryType(MemoryType::Shared);
-
-  // Schedule
-  constexpr int BSX = 32;
-  tv1->split(2, BSX);
-  tv1->split(1, sym_bsx);
-  tv1->split(0, BSX);
-  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
-  tv1->reorder({{0, 0}, {1, 2}, {2, 4}, {3, 5}, {4, 1}, {5, 3}});
-  TensorView* tv3 = tv1->rFactor({-2});
-
-  tv0->computeAt(tv1, -2);
-  tv0->computeAt(tv3, -2);
-
-  // Thread and Block binding
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::BIDy);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  // Manual Binding
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 154, K = 45, N = 1524;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({M, K, N}, options);
-  at::Tensor aten_output = aten_input.to(at::kDouble).sum({1});
-
-  // How many threads to use for the block reduction
-  constexpr int runtime_threadIdx_dim = 128;
-
-  auto lparams = LaunchParams(-1, -1, -1, runtime_threadIdx_dim, -1, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input, runtime_threadIdx_dim}, lparams);
-  auto cg_outputs = fe.runFusion({aten_input, runtime_threadIdx_dim}, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input, runtime_threadIdx_dim},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
-}
-
-TEST_F(NVFuserTest, FusionSmemDynamicPwiseMulSymbolicArgWAR_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  Int* sym_bsx = IrBuilder::create<Int>();
-  TensorView* tv0 = makeSymbolicTensor(2); // (M, K)
-  TensorView* tv1 = makeSymbolicTensor(2); // (K, N)
-  TensorView* tv2 = broadcast(tv0, {false, false, true}); // (M, K, B)
-  TensorView* tv3 = broadcast(tv1, {true, false, false}); // (B, K, N)
-  TensorView* tv4 = mul(tv2, tv3); // M, K, N
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addInput(sym_bsx);
-  fusion.addOutput(tv4);
-  // Algorithm
-
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-
-  constexpr int BSX = 32;
-  tv4->split(2, BSX);
-  tv4->split(1, sym_bsx);
-  tv4->split(0, BSX);
-  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
-  tv4->reorder({{0, 0}, {1, 3}, {2, 1}, {3, 4}, {4, 2}, {5, 5}});
-  // M/BSX, K/BSX, N/BSX, MSX, KSX, NSX
-
-  tv0->computeAt(tv4, 3);
-  tv1->computeAt(tv4, 3);
-  // Schedule
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-  tv4->axis(2)->parallelize(ParallelType::BIDy);
-  // Manual Binding
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  // Thread and Block binding
-
-  constexpr int M = 128, K = 457, N = 1024;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-  at::Tensor aten_output = mul(t0.unsqueeze(2), t1.unsqueeze(0));
-  std::vector<IValue> aten_inputs = {t0, t1, BSX};
-
-  LaunchParams lparams(-1, -1, -1, BSX, -1, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      aten_inputs,
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 1);
-}
-
-TEST_F(NVFuserTest, FusionSmemDynamicTiledGemm_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Symbolic integers we will use for runtime tiling
-  Int* symbolic_m_tile_dim = IrBuilder::create<Int>(); // bound to threadIdx.z
-  Int* symbolic_split_k_tile_dim =
-      IrBuilder::create<Int>(); // bound to blockIdx.x
-  Int* symbolic_block_k_tile_dim =
-      IrBuilder::create<Int>(); // bound to threadIdx.x
-  // Compile-time integer for tiling
-  int n_smem_tile = 8; // bound to threadIdx.y
-
-  // Symbolic 2D tensors TV0[M, K], TV1[K, N]
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  // Broadcast tv0 to [M, K, *]
-  TensorView* tv2 = broadcast(tv0, {false, false, true});
-  // Broadcast tv1 to [*, K, N]
-  TensorView* tv3 = broadcast(tv1, {true, false, false});
-
-  // Pointwise multiplication resulting in tv3[M, K, N]
-  TensorView* tv4 = mul(tv2, tv3);
-
-  // Turn the K-dimension of tv4 into a reduction dimension
-  TensorView* tv5 = sum(tv4, {1});
-
-  // Register inputs and outputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv5);
-
-  // Register runtime tile dims as inputs
-  fusion.addInput(symbolic_m_tile_dim);
-  fusion.addInput(symbolic_split_k_tile_dim);
-  fusion.addInput(symbolic_block_k_tile_dim);
-
-  // Make a 3D tile, mix of symbolic and constant, do in reverse order because
-  // dims are inserted
-  // [M, K, N]
-  tv5->split(2, n_smem_tile);
-  tv5->split(1, symbolic_block_k_tile_dim);
-  tv5->split(1, symbolic_split_k_tile_dim);
-  tv5->split(0, symbolic_m_tile_dim);
-  // [Mo, Mi, Koo, Koi, Ki, No, Ni]
-
-  // Reorder so all outer tiles are in the leftmost 3 positions
-  tv5->reorder({{1, 5}, {5, 1}});
-  // [Mo, No, Koo, Koi, Ki, Mi, Ni]
-
-  // Factor out the outer reduction IterDomain, then run the inter-cta
-  // reduction, and intra-cta reduction
-  auto tv6 = tv5->rFactor({2});
-  // [Mo, No, rKoo, rKoi, rKi, Mi, Ni]
-  // [Mo, No,       rKoi, rKi, Mi, Ni]
-
-  // Scope computations
-  tv6->computeAt(tv5, 2);
-  // [Mo, No, rKoo,  Koi,  Ki, Mi, Ni]
-  // [Mo, No,       rKoi, rKi, Mi, Ni]
-
-  // Setup compute at schedule
-  tv0->computeAt(tv6, 3);
-  tv1->computeAt(tv6, 3);
-  tv4->computeAt(tv6, -1);
-  //
-  // T2[Mo,  bNo, Koo, Koi,  Kii,  Mi, bNi] CA(4, 3)
-  // T3[bMo,  No, Koo, Koi,  Kii, bMi,  Ni] CA(4, 3)
-  // T4[ Mo,  No, Koo, Koi,  Kii,  Mi,  Ni]
-  // T6[ Mo,  No, rKoo, Koi, Kii,  Mi,  Ni]
-  // T5[ Mo,  No,      rKoi, rKii, Mi,  Ni]
-
-  // Cache smem tiles
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-  tv4->setMemoryType(MemoryType::Local);
-  tv6->setMemoryType(MemoryType::Local);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDz);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-
-  std::vector<TensorView*> tv_list = {tv2, tv3, tv4, tv5, tv6};
-  for (auto tv : tv_list) {
-    tv->axis(-2)->parallelize(ParallelType::TIDz);
-    tv->axis(-1)->parallelize(ParallelType::TIDy);
-  }
-  tv2->axis(3)->parallelize(ParallelType::TIDx);
-  tv3->axis(3)->parallelize(ParallelType::TIDx);
-  tv4->axis(3)->parallelize(ParallelType::TIDx);
-  tv6->axis(3)->parallelize(ParallelType::TIDx);
-  tv5->axis(2)->parallelize(ParallelType::TIDx);
-
-  tv2->axis(4)->parallelize(ParallelType::BIDx);
-  tv3->axis(4)->parallelize(ParallelType::BIDx);
-  tv4->axis(4)->parallelize(ParallelType::BIDx);
-  tv6->axis(4)->parallelize(ParallelType::BIDx);
-  tv5->axis(3)->parallelize(ParallelType::BIDx);
-
-  constexpr int M = 31, K = 65, N = 33;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-
-  // Runtime tiling
-  int m_tile = 4; // bound to threadIdx.z
-  int split_k = 7; // bound to blockIdx.x
-  int intra_cta = 8; // bound to threadIdx.x
-
-  std::vector<IValue> aten_inputs = {t0, t1, m_tile, split_k, intra_cta};
-  at::Tensor aten_output =
-      mul(t0.unsqueeze(2), t1.unsqueeze(0)).to(at::kDouble).sum(1);
-
-  FusionExecutor fe;
-  // Generate CUDA and compile with nvRTC
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 1);
-}
-
-TEST_F(NVFuserTest, FusionGlobalIntermediate_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
-  // tv1[I0, R1] = tv0[I0, I1]
-
-  // Interface should just be a direct split with a Parallel type. We can
-  // include the parallelize call if we do this.
-  tv1->split(1, NamedScalar::getParallelDim(ParallelType::TIDx));
-  // tv1[I0, R1o, R1i{BIDx}] = tv0[I0, I1]
-
-  TensorView* tv2 = tv1->rFactor({2});
-  tv2->setMemoryType(MemoryType::Global);
-  // tv2[I0, R1oo, Ir1i{BIDx}] = tv0[I0, I1]
-  // tv1[I0,        R1i{BIDx}] = tv2[I0, R1oo, Ir1i{BIDx}]
-
-  tv0->computeAt(tv1, 1);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-
-  constexpr int numel_x = 65000, numel_y = 1024;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  // How many threads to use for the block reduction
-  constexpr int runtime_threadIdx_dim = 128;
-
-  auto lparams = LaunchParams(-1, -1, -1, runtime_threadIdx_dim, -1, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input}, lparams);
-  auto cg_outputs = fe.runFusion({input}, lparams);
-
-  auto aten_output = input.to(at::kDouble).sum({1});
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionGlobalIntermediateDefaultSchedule_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  TensorView* tv2 = makeSymbolicTensor(2);
-  TensorView* tv3 = makeSymbolicTensor(2);
-  TensorView* tv4 = sub(tv2, tv3);
-  TensorView* tv5 = add(tv1, tv4);
-  TensorView* tv6 = sub(tv5, tv0);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addInput(tv2);
-  fusion.addInput(tv3);
-  fusion.addOutput(tv6);
-  // t6 = ((t1 + (t2 - t3)) - t0)
-
-  tv4->setMemoryType(MemoryType::Global);
-  tv5->setMemoryType(MemoryType::Global);
-  tv6->setMemoryType(MemoryType::Global);
-
-  constexpr int M = 32, N = 810;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  at::Tensor t1 = at::randn({M, N}, options);
-  at::Tensor t2 = at::randn({M, N}, options);
-  at::Tensor t3 = at::randn({M, N}, options);
-
-  at::Tensor aten_output = (t1 + (t2 - t3)) - t0;
-
-  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1, t2, t3});
-  auto cg_outputs = fe.runFusion({t0, t1, t2, t3});
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionConstCheck_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto one = IrBuilder::create<Int>(1);
-  TORCH_CHECK(one->isConstScalar());
-
-  auto one_x2 = mul(one, one);
-  TORCH_CHECK(one_x2->isConstScalar());
-
-  auto one_x3 = mul(one_x2, one);
-  TORCH_CHECK(one_x3->isConstScalar());
-
-  auto one_x4 = mul(one_x3, one);
-  TORCH_CHECK(one_x4->isConstScalar());
-}
-
-TEST_F(NVFuserTest, FusionUnrollWithAlloc_CUDA) {
-  const std::vector<int64_t> tensor_dims_in = {128, 128};
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(tensor_dims_in.size());
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(0));
-  TensorView* tv2 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv1);
-  fusion.addOutput(tv2);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn(tensor_dims_in, options);
-  at::Tensor cg_output = at::empty({tensor_dims_in[0]}, options);
-
-  // Schedule
-  tv2->split(1, 32);
-  tv2->split(1, 4); // unroll
-
-  auto tv2_rf = tv2->rFactor({-3, -2});
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv2_rf->axis(0)->parallelize(ParallelType::BIDx);
-  tv2_rf->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2_rf->axis(-2)->parallelize(ParallelType::Unroll);
-
-  tv1->computeAt(tv2_rf, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto aten_output = (input + 0).to(at::kDouble).sum(1);
-
-  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Test isZeroInt
-TEST_F(NVFuserTest, FusionIsZeroInt_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  Int* x = IrBuilder::create<Int>(0);
-  Int* y = IrBuilder::create<Int>(1);
-  Val* z = mul(x, y);
-  TORCH_CHECK(x->isZeroInt());
-  TORCH_CHECK(!y->isZeroInt());
-  TORCH_CHECK(!z->isZeroInt());
-}
-
-// Test isOneInt
-TEST_F(NVFuserTest, FusionIsOneInt_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  Int* x = IrBuilder::create<Int>(1);
-  Int* y = IrBuilder::create<Int>(1);
-  Val* z = mul(x, y);
-  TORCH_CHECK(x->isOneInt());
-  TORCH_CHECK(y->isOneInt());
-  TORCH_CHECK(!z->isOneInt());
-}
-
-// This is to verify no cycle of computeAt is created. A more complex
-// variation of this pattern appears in one of the Python tests
-// (test_random_topo).
-TEST_F(NVFuserTest, FusionComputeAtNonterminatingOutput_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  // Common intermediate tensor
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  // tv1 -> tv2
-  auto tv2 = add(tv1, IrBuilder::create<Double>(2));
-  // tv1 -> tv3 -> tv4
-  auto tv3 = add(tv1, IrBuilder::create<Double>(3));
-  auto tv4 = add(tv3, IrBuilder::create<Double>(4));
-
-  // NOTE: This should no longer occur as of PR #201.
-  // The order of adding outputs matters. If tv3 is added before tv4,
-  // it should be fine. However, if tv4 is added before tv3, there
-  // will be a cycle of tv3->tv4 and tv4->tv3. tv3->tv4 is created
-  // first, and then tv4->tv3 is created at the final phase of
-  // computeAt (ComputeAt::setupOutputs).
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv3);
-
-  tv0->computeAt(tv2, -1);
-
-  TORCH_CHECK(tv3->hasComputeAt());
-  TORCH_CHECK(!tv4->hasComputeAt());
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn(100, options);
-
-  auto t1 = aten_input + 1;
-  auto t2 = t1 + 2;
-  auto t3 = t1 + 3;
-  auto t4 = t3 + 4;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  std::vector<at::Tensor> aten_outputs = {t2, t4, t3};
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTraversalOrder1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2));
-  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3));
-  TensorView* tv4 = add(tv1, IrBuilder::create<Double>(4));
-
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv3);
-  fusion.addOutput(tv4);
-
-  tv1->computeAt(tv3, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({10, 10}, options);
-
-  auto t1 = aten_input + 1;
-  auto t2 = aten_input + 2;
-  auto t3 = t1 + 3;
-  auto t4 = t1 + 4;
-
-  std::vector<at::Tensor> aten_outputs = {t2, t3, t4};
-
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTraversalOrder2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
-
-  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(3));
-  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
-
-  TensorView* tv5 = add(tv1, tv3);
-
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv5);
-
-  tv1->computeAt(tv5, -1);
-  tv3->computeAt(tv5, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({10, 10}, options);
-
-  auto t1 = aten_input + 1;
-  auto t2 = t1 + 2;
-  auto t3 = aten_input + 3;
-  auto t4 = t3 + 4;
-  auto t5 = t1 + t3;
-
-  std::vector<at::Tensor> aten_outputs = {t2, t4, t5};
-
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTraversalOrder3_CUDA) {
-  for (const auto i : c10::irange(2)) {
-    Fusion fusion;
-    FusionGuard fg(&fusion);
-
-    TensorView* tv0 = makeSymbolicTensor(1);
-    fusion.addInput(tv0);
-
-    TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-    TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
-
-    TensorView* tv3 = add(tv0, IrBuilder::create<Double>(3));
-    TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
-
-    TensorView* tv5 = add(tv1, tv3);
-
-    fusion.addOutput(tv2);
-    fusion.addOutput(tv4);
-    fusion.addOutput(tv5);
-
-    const int tile = 32;
-
-    tv1->split(-1, tile);
-    tv2->split(-1, tile);
-    tv3->split(-1, tile);
-    tv4->split(-1, tile);
-    tv5->split(-1, tile);
-
-    auto compute_at_outer = tv1;
-    auto compute_at_inner = tv3;
-    if (i == 1) {
-      std::swap(compute_at_inner, compute_at_outer);
-    }
-
-    compute_at_outer->computeAt(tv5, -2);
-    compute_at_inner->computeAt(tv5, -1);
-
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor aten_input = at::randn({100}, options);
-    auto t1 = aten_input + 1;
-    auto t2 = t1 + 2;
-    auto t3 = aten_input + 3;
-    auto t4 = t3 + 4;
-    auto t5 = t1 + t3;
-
-    std::vector<at::Tensor> aten_outputs = {t2, t4, t5};
-
-    std::vector<at::Tensor> cg_outputs = {
-        at::empty_like(aten_input, options),
-        at::empty_like(aten_input, options),
-        at::empty_like(aten_input, options)};
-
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, {aten_input});
-    fe.runFusion({aten_input}, cg_outputs);
-
-    testValidate(
-        &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionTraversalOrder4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // First tree
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
-  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3));
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv3);
-
-  // Second tree
-  TensorView* tv4 = makeSymbolicTensor(1);
-  fusion.addInput(tv4);
-  TensorView* tv5 = add(tv4, IrBuilder::create<Double>(5));
-  TensorView* tv6 = add(tv5, IrBuilder::create<Double>(6));
-  TensorView* tv7 = add(tv5, IrBuilder::create<Double>(7));
-  fusion.addOutput(tv6);
-  fusion.addOutput(tv7);
-
-  tv1->computeAt(tv2, -1);
-  tv5->computeAt(tv6, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({100}, options);
-  at::Tensor t4 = at::rand_like(t0, options);
-
-  auto t1 = t0 + 1;
-  auto t2 = t1 + 2;
-  auto t3 = t1 + 3;
-  auto t5 = t4 + 5;
-  auto t6 = t5 + 6;
-  auto t7 = t5 + 7;
-
-  std::vector<at::Tensor> aten_outputs = {t2, t3, t6, t7};
-  std::vector<IValue> aten_inputs = {t0, t4};
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(t0, options),
-      at::empty_like(t0, options),
-      at::empty_like(t0, options),
-      at::empty_like(t0, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  fe.runFusion(aten_inputs, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTraversalOrder5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
-  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(3));
-  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
-  TensorView* tv5 = add(tv2, tv4);
-
-  fusion.addOutput(tv1);
-  fusion.addOutput(tv3);
-  fusion.addOutput(tv5);
-
-  tv2->computeAt(tv5, -1);
-  tv4->computeAt(tv5, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({100}, options);
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options),
-      at::empty_like(aten_input, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  auto t1 = aten_input + 1;
-  auto t2 = t1 + 2;
-  auto t3 = aten_input + 3;
-  auto t4 = t3 + 4;
-  auto t5 = t2 + t4;
-
-  std::vector<at::Tensor> aten_outputs = {t1, t3, t5};
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTraversalOrder6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2));
-  TensorView* tv3 = add(tv1, tv2);
-  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
-
-  fusion.addOutput(tv4);
-
-  tv1->split(0, 32);
-  tv2->split(0, 32);
-  tv3->split(0, 32);
-  tv4->split(0, 32);
-
-  tv3->computeAt(tv4, -2);
-  tv1->computeAt(tv3, -1);
-  tv2->computeAt(tv3, -2);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({100}, options);
-
-  auto t1 = aten_input + 1;
-  auto t2 = aten_input + 2;
-  auto t3 = t1 + t2;
-  auto aten_output = t3 + 4;
-
-  at::Tensor cg_output = at::empty_like(aten_input, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTraversalOrder7_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
-  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(3));
-  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
-  TensorView* tv5 = add(tv2, tv4);
-
-  fusion.addOutput(tv5);
-
-  TensorView* tvs[] = {tv1, tv2, tv3, tv4, tv5};
-  for (auto tv : tvs) {
-    tv->split(0, 2);
-    tv->split(0, 4);
-    tv->split(0, 8);
-  }
-
-  // computeAt into inner loop nests
-  tv1->computeAt(tv2, -1);
-  tv3->computeAt(tv4, -2);
-
-  tv2->computeAt(tv5, -4);
-  tv4->computeAt(tv5, -3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({100}, options);
-
-  auto t1 = aten_input + 1;
-  auto t2 = t1 + 2;
-  auto t3 = aten_input + 3;
-  auto t4 = t3 + 4;
-  auto aten_output = t2 + t4;
-
-  at::Tensor cg_output = at::empty_like(aten_input, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Test predication of grid reduction
-TEST_F(NVFuserTest, FusionThreadPredicate_CUDA) {
-  const int gdimx = 4;
-  const int bdimx = 128;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
-  TensorView* tv2 = unaryOp(UnaryOpType::Neg, tv1);
-  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(2));
-
-  fusion.addOutput(tv3);
-  fusion.addOutput(tv2);
-
-  tv1->split(1, bdimx);
-  tv1->split(1, gdimx);
-  tv3->split(1, bdimx);
-  tv3->split(1, gdimx);
-
-  TensorView* tv1_rf = tv1->rFactor({1});
-
-  tv1->computeAt(tv2, -1);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDy);
-  tv1_rf->axis(0)->parallelize(ParallelType::BIDy);
-  tv2->axis(0)->parallelize(ParallelType::BIDy);
-  tv1->axis(-2)->parallelize(ParallelType::BIDx);
-  tv1_rf->axis(-2)->parallelize(ParallelType::BIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(3)->parallelize(ParallelType::TIDx);
-  tv3->axis(2)->parallelize(ParallelType::BIDx);
-  tv3->axis(0)->parallelize(ParallelType::BIDy);
-
-  int numel_x = 100;
-  int numel_y = 1000;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
-
-  auto t2 = -aten_input.to(at::kDouble).sum({1});
-  auto t3 = aten_input + 2.0;
-
-  std::vector<at::Tensor> aten_outputs = {t3, t2};
-
-  std::vector<at::Tensor> cg_outputs = {
-      at::empty_like(aten_input, options), at::empty({numel_x}, options)};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, cg_outputs);
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionLSTMCell_CUDA) {
-  const int hidden_features = 512;
-  const int batch_size = 64;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tvs[16];
-  for (const auto i : c10::irange(16)) {
-    tvs[i] = makeSymbolicTensor(2);
-    fusion.addInput(tvs[i]);
-  }
-
-  auto ingate = unaryOp(
-      UnaryOpType::Sigmoid, add(add(add(tvs[0], tvs[1]), tvs[2]), tvs[3]));
-
-  auto forgetgate = unaryOp(
-      UnaryOpType::Sigmoid, add(add(add(tvs[4], tvs[5]), tvs[6]), tvs[7]));
-
-  auto cellgate = unaryOp(
-      UnaryOpType::Tanh, add(add(add(tvs[8], tvs[9]), tvs[10]), tvs[11]));
-
-  auto outgate = unaryOp(
-      UnaryOpType::Sigmoid, add(add(add(tvs[12], tvs[13]), tvs[14]), tvs[15]));
-
-  auto cx = makeContigTensor(2);
-  fusion.addInput(cx);
-
-  auto cy = add(mul(forgetgate, cx), mul(ingate, cellgate));
-
-  auto hy = mul(outgate, unaryOp(UnaryOpType::Tanh, cy));
-
-  fusion.addOutput(cy);
-  fusion.addOutput(hy);
-
-  std::vector<c10::IValue> aten_inputs;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor large_tensor0 =
-      at::randn({batch_size, hidden_features * 4}, options);
-  at::Tensor large_tensor1 =
-      at::randn({batch_size, hidden_features * 4}, options);
-  at::Tensor large_tensor2 =
-      at::randn({batch_size, hidden_features * 4}, options);
-  at::Tensor large_tensor3 =
-      at::randn({batch_size, hidden_features * 4}, options);
-
-  auto chunked0 = large_tensor0.chunk(4, 1);
-  auto chunked1 = large_tensor1.chunk(4, 1);
-  auto chunked2 = large_tensor2.chunk(4, 1);
-  auto chunked3 = large_tensor3.chunk(4, 1);
-
-  aten_inputs.insert(aten_inputs.end(), chunked0.begin(), chunked0.end());
-  aten_inputs.insert(aten_inputs.end(), chunked1.begin(), chunked1.end());
-  aten_inputs.insert(aten_inputs.end(), chunked2.begin(), chunked2.end());
-  aten_inputs.insert(aten_inputs.end(), chunked3.begin(), chunked3.end());
-
-  auto at_ingate =
-      chunked0[0].add(chunked0[1]).add(chunked0[2]).add(chunked0[3]).sigmoid();
-  auto at_forgetgate =
-      chunked1[0].add(chunked1[1]).add(chunked1[2]).add(chunked1[3]).sigmoid();
-  auto at_cellgate =
-      chunked2[0].add(chunked2[1]).add(chunked2[2]).add(chunked2[3]).tanh();
-  auto at_outgate =
-      chunked3[0].add(chunked3[1]).add(chunked3[2]).add(chunked3[3]).sigmoid();
-
-  auto at_cx = at::randn({batch_size, hidden_features}, options);
-  aten_inputs.push_back(at_cx);
-  auto at_cy = at_forgetgate.mul(at_cx).add(at_ingate.mul(at_cellgate));
-  auto at_hy = at_outgate.mul(at_cy.tanh());
-
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {at_cy, at_hy}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionComputeAtMultiBCast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = broadcast(tv1, {true, false});
-  TensorView* tv3 = broadcast(tv1, {false, true});
-  TensorView* tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  // Not possible to do computeAt at position -1 as recomputation
-  // would be required. An exception should be thrown.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(tv1->computeAt(tv3, -1));
-}
-
-TEST_F(NVFuserTest, FusionReductionHalf_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(3, DataType::Half);
-  fusion.addInput(tv0);
-
-  auto tv1 = castOp(DataType::Float, tv0);
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
-  auto tv3 = sum(tv2, {2});
-  auto tv4 = castOp(DataType::Half, tv3);
-
-  fusion.addOutput(tv4);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({8, 8, 16}, options);
-
-  auto reduction_tv = tv3;
-
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  // no broadcasting needed, omitting the last optional argument;
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-
-  auto aten_output = aten_input.add(1.0).to(at::kDouble).sum({2});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionReduceSingle_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeConcreteTensor({100, 1});
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({100, 1}, options);
-
-  // Grab only tensor views, though there shouldn't be any other type
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  // no broadcasting needed, omitting the last optional argument;
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  auto aten_output = aten_input.to(at::kDouble).sum({1});
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReduceImplicitBroadcast_CUDA) {
-  constexpr int bid_x = 80;
-  constexpr int tid_x = 4096;
-  constexpr int red_dim = 1;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeConcreteTensor({bid_x, tid_x, 1});
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = reductionOp(
-      BinaryOpType::Add, {red_dim, 2}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({bid_x, tid_x, 1}, options);
-
-  // Apply reduction heuristic
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  // no broadcasting needed, omitting the last optional argument;
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-  auto aten_output = aten_input.to(at::kDouble).sum({red_dim, 2});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionReduceImplicitBroadcast2_CUDA) {
-  constexpr int bid_x = 80;
-  constexpr int tid_x = 4096;
-  constexpr int red_dim = 1;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeConcreteTensor({bid_x, tid_x, 1});
-  fusion.addInput(tv0);
-
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {2}, IrBuilder::create<Double>(0), tv0);
-
-  TensorView* tv2 = reductionOp(
-      BinaryOpType::Add, {red_dim}, IrBuilder::create<Double>(0), tv1);
-  fusion.addOutput(tv2);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({bid_x, tid_x, 1}, options);
-
-  // Apply reduction heuristic
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-
-  scheduleReduction(&fusion, *reduction_params);
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  // no broadcasting needed, omitting the last optional argument;
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-  auto aten_output = aten_input.to(at::kDouble).sum({1, 2});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionReduceImplicitBroadcast3_CUDA) {
-  constexpr int bid_x = 80;
-  constexpr int tid_x = 4096;
-  constexpr int red_dim = 1;
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeConcreteTensor({bid_x, tid_x, 1});
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = reductionOp(
-      BinaryOpType::Add, {red_dim}, IrBuilder::create<Double>(0), tv0);
-
-  TensorView* tv2 =
-      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv1);
-  fusion.addOutput(tv2);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({bid_x, tid_x, 1}, options);
-
-  // Apply reduction heuristic
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  // no broadcasting needed, omitting the last optional argument;
-  auto cg_outputs = fe.runFusion({aten_input}, lparams);
-  auto aten_output = aten_input.to(at::kDouble).sum({2, 1});
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {aten_input},
-      {aten_output},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionTrivialReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeConcreteTensor({10, 20, 1});
-  fusion.addInput(tv0);
-  TensorView* tv1 =
-      reductionOp(BinaryOpType::Add, {2}, IrBuilder::create<Double>(0), tv0);
-  fusion.addOutput(tv1);
-
-  TORCH_CHECK(
-      ir_utils::getReductionOps(&fusion, true /* ignore_trivial */).empty(),
-      "Trivial reduction picked up by fusion");
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({10, 20, 1}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-  auto aten_output = aten_input.to(at::kDouble).sum({2});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTrivialReduction2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int w = 1, x = 1, y = 7, z = 8;
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeConcreteTensor({w, x, y, z});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = sum(tv1, {0});
-  auto tv3 = sum(tv2, {0});
-  auto tv4 = add(tv3, tv0);
-
-  fusion.addOutput(tv4);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({y, z}, options);
-  at::Tensor t1 = at::randn({w, x, y, z}, options);
-  auto aten_output = t1.to(at::kDouble).sum({0}).sum({0}).add(t0);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTrivialReduction3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int v = 1, w = 1, x = 1, y = 7, z = 8;
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeConcreteTensor({v, w, x, y, z});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = sum(tv1, {0, 1, 2});
-  auto tv3 = add(tv2, tv0);
-
-  fusion.addOutput(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({y, z}, options);
-  at::Tensor t1 = at::randn({v, w, x, y, z}, options);
-  auto aten_output = t1.sum({0, 1, 2}).add(t0);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-// Make sure trivial reductions are correctly detected even with
-// scheduling applied.
-TEST_F(NVFuserTest, FusionDetectTrivialReduction1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {false, true});
-  auto tv2 = sum(tv1, {1});
-  fusion.addOutput(tv2);
-
-  tv2->split(1, 4);
-  tv2->split(1, 8);
-  auto tv3 = tv2->rFactor({-1});
-  auto tv4 = tv2->rFactor({-1});
-
-  auto tv5 = broadcast(tv0, {true, false});
-  auto tv6 = add(tv5, IrBuilder::create<Double>(1));
-  auto tv7 = sub(tv6, IrBuilder::create<Double>(1));
-  auto tv8 = sum(tv7, {0});
-  fusion.addOutput(tv8);
-
-  auto tv9 = broadcast(tv0, {false, true, true});
-  auto tv10 = sum(tv9, {1});
-  auto tv11 = sum(tv10, {1});
-  fusion.addOutput(tv11);
-
-  tv8->split(0, 3);
-  tv10->split(1, 4);
-  tv11->split(1, 5);
-
-  tv0->computeAt(tv2, -1);
-  tv0->computeAt(tv8, -1);
-  tv0->computeAt(tv11, 1);
-
-  // Test indexing to gmem-backed tensors
-  tv3->setMemoryType(MemoryType::Global);
-  tv8->setMemoryType(MemoryType::Global);
-
-  GpuLower gpulw(&fusion);
-
-  // No ReductionOp should be generated as all the reduction
-  // exprs should be replaced with a unary set op.
-  for (const auto expr : gpulw.kernel()->as<Fusion>()->exprs()) {
-    TORCH_CHECK(!expr->isA<ReductionOp>());
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({100}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {t0, t0, t0}, __LINE__, __FILE__);
-}
-
-// Test detection of partially trivial reduction
-TEST_F(NVFuserTest, FusionDetectTrivialReduction2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv1->split(1, 1);
-  // tv1->axis(1): non-trivial
-  // tv1->axis(2): trivial
-
-  auto tv3 = tv1->rFactor({-1});
-
-  // Just to suppress register-allocation warning
-  tv0->computeAt(tv2, 1);
-  tv3->computeAt(tv1, -1);
-
-  GpuLower gpulw(&fusion);
-
-  // tv3's reduction axis is a trivial reduction. The only
-  // ReductionOp should be for tv1.
-  for (const auto expr : gpulw.kernel()->as<Fusion>()->exprs()) {
-    if (expr->isA<ReductionOp>()) {
-      auto reduction_out =
-          expr->as<ReductionOp>()->outputs()[0]->as<TensorView>();
-      TORCH_CHECK(reduction_out->name() == 1);
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionInputsIdLookup_CUDA) {
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({16, 8, 8}, options);
-  at::Tensor t1 = at::randn({8, 8}, options);
-  at::Tensor t2 = at::randn({6, 4}, options);
-
-  // create a cache with max size 2;
-  torch::jit::fuser::cuda::InputsIdLookup inputs_id_lookup(2);
-
-  // testing basic function, same encoding for identical inputs
-  auto id_0 = inputs_id_lookup.lookupId({t0, t1, 5.0});
-  auto id_0_lookup = inputs_id_lookup.lookupId({t0, t1, 2.5});
-  TORCH_CHECK(id_0.id == id_0_lookup.id);
-  TORCH_CHECK(inputs_id_lookup.size() == 1);
-  TORCH_CHECK(id_0.eviction == false);
-
-  // new input (even tho same shape, but we have different signature because of
-  // missing scalar input
-  auto id_1 = inputs_id_lookup.lookupId({t0, t1});
-  auto id_1_lookup = inputs_id_lookup.lookupId({t0, t1});
-  TORCH_CHECK(id_1.id == id_1_lookup.id);
-  TORCH_CHECK(inputs_id_lookup.size() == 2);
-  TORCH_CHECK(id_1.eviction == false);
-
-  // eviction should happen at this point
-  auto id_2 = inputs_id_lookup.lookupId({t2, t1});
-  TORCH_CHECK(id_2.id != id_0.id);
-  TORCH_CHECK(id_2.id != id_1.id);
-  TORCH_CHECK(inputs_id_lookup.size() == 2);
-  TORCH_CHECK(id_2.eviction == true);
-  TORCH_CHECK(id_2.evict_id == id_0.id);
-
-  // look at input 1 again
-  auto id_1_relook = inputs_id_lookup.lookupId({t0, t1});
-  TORCH_CHECK(id_1_relook.id == id_1.id);
-  TORCH_CHECK(id_1_relook.eviction == false);
-}
-
-TEST_F(NVFuserTest, FusionGroupGuardSimpleTensor_CUDA) {
-  std::vector<int64_t> sizes_vec({16, 8, 8});
-  std::vector<int64_t> strides_vec({64, 8, 1});
-  auto tensor_type = TensorType::create(
-      at::kFloat, c10::nullopt, sizes_vec, strides_vec, c10::nullopt);
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  // pass with identical shape
-  auto t0 = at::randn({16, 8, 8}, options);
-  TORCH_CHECK(complyWith(t0, tensor_type));
-
-  // pass with dynamic shape
-  auto t1 = at::randn({16, 16, 8}, options);
-  TORCH_CHECK(complyWith(t1, tensor_type));
-
-  // broadcasting semantic change failure
-  auto t2 = at::randn({16, 1, 8}, options);
-  TORCH_CHECK(!complyWith(t2, tensor_type));
-
-  // contiguity failure via slicing
-  auto t3 = t0.slice(1, 0, 8, 2);
-  TORCH_CHECK(!complyWith(t3, tensor_type));
-
-  // contiguity failure via slicing
-  auto t4 = t0.slice(2, 0, 8, 2);
-  TORCH_CHECK(!complyWith(t4, tensor_type));
-
-  // rank failure
-  auto t5 = at::randn({16, 8, 8, 8}, options);
-  TORCH_CHECK(!complyWith(t5, tensor_type));
-
-  // contiguity on stride 1 dimension with implicit broadcasting
-  auto t = at::randn({4}, options);
-  auto t6 = t.unsqueeze(1).expand({4, 8});
-  TORCH_CHECK(complyWith(t6, TensorType::create(t6)));
-}
-
-TEST_F(NVFuserTest, FusionGroupGuardBroadcastTensor_CUDA) {
-  std::vector<int64_t> sizes_vec({16, 1, 8});
-  std::vector<int64_t> strides_vec({8, 8, 1});
-  auto tensor_type = TensorType::create(
-      at::kFloat, c10::nullopt, sizes_vec, strides_vec, c10::nullopt);
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  // broadcasting semantic change
-  auto t0 = at::randn({16, 8, 8}, options);
-  TORCH_CHECK(!complyWith(t0, tensor_type));
-
-  // dtype failure
-  auto t1 = at::randn({16, 1, 8}, options.dtype(at::kHalf));
-  TORCH_CHECK(!complyWith(t1, tensor_type));
-
-  // dtype failure
-  auto t2 = at::randn({16, 1, 8}, options);
-  TORCH_CHECK(complyWith(t2, tensor_type));
-
-  // device inconsistency shouldn't fail
-  auto t3 = at::randn({16, 1, 8}, options.device(at::kCPU, 0));
-  TORCH_CHECK(complyWith(t3, tensor_type));
-}
-
-TEST_F(NVFuserTest, FusionGroupGuardPermutedTensor_CUDA) {
-  std::vector<int64_t> sizes_vec({16, 8, 8});
-  std::vector<int64_t> strides_vec({64, 1, 8});
-  auto tensor_type = TensorType::create(
-      at::kFloat, c10::nullopt, sizes_vec, strides_vec, c10::nullopt);
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  // failing permutation
-  auto t0 = at::randn({16, 8, 8}, options);
-  TORCH_CHECK(!complyWith(t0, tensor_type));
-
-  // passing with dynamic shape
-  auto t1 = t0.permute({0, 2, 1});
-  TORCH_CHECK(complyWith(t1, tensor_type));
-}
-
-TEST_F(NVFuserTest, FusionGroupGuardRelaxedCheck_CUDA) {
-  std::vector<int64_t> sizes_vec({16, 8, 8});
-  std::vector<int64_t> strides_vec({128, 16, 1});
-  auto tensor_type = TensorType::create(
-      at::kFloat, c10::nullopt, sizes_vec, strides_vec, c10::nullopt);
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  // contiguity check passes although it differs
-  auto t0 = at::randn({16, 16, 8}, options);
-  TORCH_CHECK(complyWith(t0, tensor_type));
-
-  // passing with dynamic shape
-  auto t1 = t0.slice(1, 0, 16, 2);
-  TORCH_CHECK(complyWith(t1, tensor_type));
-}
-
-TEST_F(NVFuserTest, FusionDisjointSet_CUDA) {
-  DisjointSets<int> set;
-
-  const std::set<int> group_x({0, 1, 2});
-  const std::set<int> group_y({3, 4, 5});
-  const std::set<int> group_z({6, 7, 8});
-  const std::vector<std::set<int>> groups({group_x, group_y, group_z});
-  std::set<int> group_all;
-  std::for_each(groups.begin(), groups.end(), [&](const auto& g) {
-    group_all.insert(g.begin(), g.end());
-  });
-
-  // Initially, nothing should be considered equivalent
-  for (auto i : group_all) {
-    for (auto j : group_all) {
-      TORCH_CHECK(!set.permissiveAreMapped(i, j));
-    }
-  }
-
-  // Sets values in group_x are equivalent
-  for (auto i : group_x) {
-    for (auto j : group_x) {
-      set.mapEntries(i, j);
-      TORCH_CHECK(set.mappingExists(i));
-      TORCH_CHECK(set.mappingExists(j));
-    }
-  }
-
-  // All values in group_x shoudl be equivalent with each other
-  for (auto i : group_x) {
-    for (auto j : group_x) {
-      TORCH_CHECK(set.permissiveAreMapped(i, j));
-    }
-  }
-  // But nothing else should be equivalent
-  for (auto i : group_all) {
-    for (auto j : group_y) {
-      TORCH_CHECK(!set.permissiveAreMapped(i, j));
-    }
-    for (auto j : group_z) {
-      TORCH_CHECK(!set.permissiveAreMapped(i, j));
-    }
-  }
-
-  // Sets values in group_y are equivalent
-  for (auto i : group_y) {
-    for (auto j : group_y) {
-      set.mapEntries(i, j);
-      TORCH_CHECK(set.mappingExists(i));
-      TORCH_CHECK(set.mappingExists(j));
-    }
-  }
-
-  // group_x should be still equivalent
-  for (auto i : group_x) {
-    for (auto j : group_x) {
-      TORCH_CHECK(set.permissiveAreMapped(i, j));
-    }
-  }
-  // group_y should be now equivalent
-  for (auto i : group_y) {
-    for (auto j : group_y) {
-      TORCH_CHECK(set.permissiveAreMapped(i, j));
-    }
-  }
-  // But group_z should not be equivalent with anything yet
-  for (auto i : group_all) {
-    for (auto j : group_z) {
-      TORCH_CHECK(!set.permissiveAreMapped(i, j));
-    }
-  }
-
-  // Sets values in group_z are equivalent
-  for (auto i : group_z) {
-    for (auto j : group_z) {
-      set.mapEntries(i, j);
-      TORCH_CHECK(set.mappingExists(i));
-      TORCH_CHECK(set.mappingExists(j));
-    }
-  }
-
-  // Now each of the three groups should be equivalent within each
-  // group
-  for (const auto gi : c10::irange(groups.size())) {
-    for (const auto gj : c10::irange(groups.size())) {
-      for (auto i : groups[gi]) {
-        for (auto j : groups[gj]) {
-          TORCH_CHECK(
-              (gi == gj && set.permissiveAreMapped(i, j)) ||
-              (gi != gj && !set.permissiveAreMapped(i, j)));
-        }
-      }
-    }
-  }
-
-  std::vector<int> all_elements = set.getAllElements().vector();
-  std::sort(all_elements.begin(), all_elements.end());
-  std::vector<int> group_all_vec(group_all.begin(), group_all.end());
-  std::sort(group_all_vec.begin(), group_all_vec.end());
-  TORCH_CHECK(all_elements == group_all_vec);
-
-  set.clear();
-  TORCH_CHECK(set.getAllElements().vector().size() == 0);
-
-  // All cleared. Nothing should be considered equivalent.
-  for (auto i : group_all) {
-    for (auto j : group_all) {
-      TORCH_CHECK(!set.permissiveAreMapped(i, j));
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionNonUniqueBroadcastSize_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  auto tv1 = makeSymbolicTensor(2);
-  auto tv2 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addInput(tv2);
-
-  auto tv3 = broadcast(tv0, {false, true});
-  auto tv4 = add(tv3, tv1);
-  auto tv5 = add(tv3, tv2);
-
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv5);
-
-  // In order to do this, tv1->axis(1) and tv2->axis(1) must have the
-  // same size, but we can't prove it, so this should throw an error.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(tv3->computeAt(tv4, -1));
-}
-
-TEST_F(NVFuserTest, FusionBiasGeluFwd_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const float k_079 = 0.79788456;
-  const float k_004 = 0.044715;
-
-  // bias vector
-  auto t0 = makeSymbolicTensor(1, DataType::Half);
-  fusion.addInput(t0);
-  auto t1 = castOp(DataType::Float, t0);
-  // input tensor
-  auto t2 = makeSymbolicTensor(3, DataType::Half);
-  fusion.addInput(t2);
-  auto t3 = castOp(DataType::Float, t2);
-  auto t4 = broadcast(t1, {true, true, false});
-  auto t5 = add(t4, t3);
-  auto t6 = mul(t5, IrBuilder::create<Double>(0.5));
-  auto t7 = mul(t5, IrBuilder::create<Double>(k_079));
-  auto t8 = mul(t5, IrBuilder::create<Double>(k_004));
-  auto t9 = mul(t8, t5);
-  auto t10 = add(t9, IrBuilder::create<Int>(1));
-  auto t11 = mul(t7, t10);
-  auto t12 = unaryOp(UnaryOpType::Tanh, t11);
-  auto t13 = add(t12, IrBuilder::create<Double>(1));
-  auto t14 = mul(t6, t13);
-  auto t15 = castOp(DataType::Half, t14);
-  fusion.addOutput(t15);
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  std::vector<int64_t> input_shape{6, 512, 4096};
-  std::vector<int64_t> bias_shape{4096};
-
-  auto at_input = at::randn(input_shape, options);
-  auto at_bias = at::randn(bias_shape, options);
-
-  auto at_x =
-      at_bias.to(c10::ScalarType::Float) + at_input.to(c10::ScalarType::Float);
-  auto aten_output_float =
-      at_x * 0.5 * (1.0 + (k_079 * at_x * (1 + k_004 * at_x * at_x)).tanh());
-  auto aten_output = aten_output_float.to(c10::ScalarType::Half);
-
-  std::vector<IValue> aten_inputs = {at_bias, at_input};
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBiasGeluBwd_CUDA) {
-  if (at::cuda::getDeviceProperties(0)->major < 6) {
-    return;
-  }
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const float k_079 = 0.79788456;
-  const float k_004 = 0.044715;
-  const float k_010 = 0.1070322243;
-
-  // gradient tensor
-  auto t0 = makeSymbolicTensor(3, DataType::Half);
-  fusion.addInput(t0);
-  auto t1 = castOp(DataType::Float, t0);
-  // bias tensor
-  auto t2 = makeSymbolicTensor(1, DataType::Half);
-  fusion.addInput(t2);
-  auto t3 = castOp(DataType::Float, t2);
-  // input tensor
-  auto t4 = makeSymbolicTensor(3, DataType::Half);
-  fusion.addInput(t4);
-  auto t5 = castOp(DataType::Float, t4);
-  auto t6 = broadcast(t3, {true, true, false});
-  auto t7 = add(t6, t5);
-  auto t8 = mul(t7, IrBuilder::create<Double>(k_079));
-  auto t9 = mul(t7, IrBuilder::create<Double>(k_004));
-  auto t10 = mul(t9, t7);
-  auto t11 = add(t10, IrBuilder::create<Int>(1));
-  auto t12 = mul(t8, t11);
-  auto t13 = unaryOp(UnaryOpType::Tanh, t12);
-  auto t14 = mul(t7, IrBuilder::create<Double>(0.5));
-  auto t15 = mul(t13, t13);
-  auto t16 = unaryOp(UnaryOpType::Neg, t15);
-  auto t17 = add(t16, IrBuilder::create<Int>(1));
-  auto t18 = mul(t7, IrBuilder::create<Double>(k_010));
-  auto t19 = mul(t18, t7);
-  auto t20 = add(t19, IrBuilder::create<Double>(k_079));
-  auto t21 = mul(t17, t20);
-  auto t22 = mul(t14, t21);
-  auto t23 = add(t13, IrBuilder::create<Int>(1));
-  auto t24 = mul(t23, IrBuilder::create<Double>(0.5));
-  auto t25 = add(t22, t24);
-  auto t26 = mul(t25, t1);
-  // Save float output for validation
-  fusion.addOutput(t26);
-  auto t27 = castOp(DataType::Half, t26);
-  fusion.addOutput(t27);
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::manual_seed(1);
-  std::vector<int64_t> input_shape{6, 512, 4096};
-  std::vector<int64_t> bias_shape{4096};
-  auto at_input = at::randn(input_shape, options);
-  auto at_bias = at::randn(bias_shape, options);
-  auto at_grad = at::randn(input_shape, options);
-
-  auto at_x =
-      at_bias.to(c10::ScalarType::Float) + at_input.to(c10::ScalarType::Float);
-  auto at_tanh_out = (k_079 * at_x * (1 + k_004 * at_x * at_x)).tanh();
-  auto at_ff = 0.5 * at_x *
-          ((1 - at_tanh_out * at_tanh_out) * (k_079 + k_010 * at_x * at_x)) +
-      0.5 * (1 + at_tanh_out);
-  auto at_out = at_ff * at_grad;
-  auto at_out_half = at_out.to(c10::ScalarType::Half);
-
-  std::vector<IValue> aten_inputs = {at_grad, at_bias, at_input};
-  std::vector<at::Tensor> aten_outputs = {at_out, at_out_half};
-
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-// Reproducer of issue #459
-TEST_F(NVFuserTest, FusionIssue459_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv3 = broadcast(tv2, {true, false});
-  auto tv4 = add(tv1, tv3);
-
-  // Create two outputs from the final arithmetic result
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-  auto tv6 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv6);
-
-  // Scheduling
-  for (auto output : ir_utils::filterByType<TensorView>(fusion.outputs())) {
-    output->merge(-2, -1);
-  }
-  for (auto output : ir_utils::filterByType<TensorView>(fusion.outputs())) {
-    output->split(0, 128);
-  }
-
-  tv0->computeAt(tv5, -1);
-
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-  tv6->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  const int numel_x = 10;
-  const int numel_y = 20;
-  auto t0 = at::randn({numel_x}, options);
-  auto t1 = at::randn({numel_y, numel_x}, options);
-  auto aten_output = (t0 + 1).unsqueeze(0) + t1 + 1;
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      aten_inputs,
-      {aten_output, aten_output},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSmemIndexingSimple_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv3);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv0->computeAt(tv3, -1);
-
-  tv1->setMemoryType(MemoryType::Shared);
-  tv2->setMemoryType(MemoryType::Global);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  auto aten_input = at::randn({12, 34}, options);
-  at::Tensor aten_output = aten_input + 1.0 + 1.0 + 1.0;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSmemIndexing_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Symbolic integers we will use for runtime tiling
-  Int* symbolic_m_tile_dim = IrBuilder::create<Int>();
-  Int* symbolic_split_k_tile_dim = IrBuilder::create<Int>();
-  Int* symbolic_block_k_tile_dim = IrBuilder::create<Int>();
-  // Compile-time integer for tiling
-  int n_smem_tile = 32;
-
-  // Symbolic 2D tensors TV0[M, K], TV1[K, N]
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  // Broadcast tv0 to [M, K, *]
-  TensorView* tv2 = broadcast(tv0, {false, false, true});
-  // Broadcast tv1 to [*, K, N]
-  TensorView* tv3 = broadcast(tv1, {true, false, false});
-
-  // Pointwise multiplication resulting in tv3[M, K, N]
-  TensorView* tv4 = mul(tv2, tv3);
-
-  // Sum the K-dim
-  TensorView* tv5 = sum(tv4, {1});
-
-  // Register inputs and outputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv5);
-
-  // Register runtime tile dims as inputs
-  fusion.addInput(symbolic_m_tile_dim);
-  fusion.addInput(symbolic_split_k_tile_dim);
-  fusion.addInput(symbolic_block_k_tile_dim);
-
-  // Make a 3D tile, mix of symbolic and constant, do in reverse order because
-  // dims are inserted
-  // [M, rK, N]
-  tv5->split(2, n_smem_tile);
-  // [M, rK, No, Ni{32}]
-  tv5->split(1, symbolic_block_k_tile_dim);
-  // [M, rKo, rKi{i2}, No, Ni{32}]
-  tv5->split(1, symbolic_split_k_tile_dim);
-  // [M, rKoo, rKoi{i1}, rKi{i2}, No, Ni{32}]
-  tv5->split(0, symbolic_m_tile_dim);
-  // [Mo, Mi{i0}, rKoo, rKoi{i1}, rKi{i2}, No, Ni{32}]
-
-  // Reorder so all outer tiles are in the leftmost 3 positions
-  // [Mo, Mi{i0}, rKoo, rKoi{i1}, rKi{i2},     No, Ni{32}]
-  // [Mo,     No, rKoo, rKoi{i1}, rKi{i2}, Mi{i0}, Ni{32}]
-  tv5->reorder({{1, 5}, {5, 1}});
-
-  // Factor out the outer reduction IterDomain, then run the inter-cta
-  // reduction, and intra-cta reduction
-  // [Mo, No, rKoo,  Koi{i1},  Ki{i2}, Mi{i0}, Ni{32}]
-  // [Mo, No,       rKoi{i1}, rKi{i2}, Mi{i0}, Ni{32}]
-  auto tv6 = tv5->rFactor({2});
-
-  // Scope computations
-  tv6->computeAt(tv5, 2);
-
-  // [Mo, No, rKoo, Koi{i1},  Ki{i2}, Mi{i0}, Ni{32}]
-  // [Mo, No, Ki{i2}, Mi{i0}, Ni{32}, rKoo, Koi{i1}]
-  tv6->reorder({
-      {5, -2},
-      {6, -1},
-      {2, 2},
-      {3, 3},
-      {4, 4},
-  });
-
-  // Setup compute at schedule
-  tv0->computeAt(tv6, 3);
-  tv1->computeAt(tv6, 3);
-  tv4->computeAt(tv6, -1);
-
-  // Cache smem tiles
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-  tv4->setMemoryType(MemoryType::Shared);
-  tv6->setMemoryType(MemoryType::Shared);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDz);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-
-  std::vector<TensorView*> tv_list = {tv2, tv3, tv4, tv5, tv6};
-  for (auto tv : tv_list) {
-    tv->axis(-2)->parallelize(ParallelType::TIDz);
-    tv->axis(-1)->parallelize(ParallelType::TIDy);
-  }
-
-  constexpr int M = 31, K = 65, N = 32;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-
-  at::Tensor aten_output =
-      mul(t0.unsqueeze(2), t1.unsqueeze(0)).to(at::kDouble).sum(1);
-
-  // A, B, m_tile_dim, split_k, intra_cta_tile
-  std::vector<IValue> aten_inputs = {t0, t1, 3, 4, 5};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-// Reproducer of issue 408
-TEST_F(NVFuserTest, FusionCacheBeforeReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {1});
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 4);
-
-  auto tv3 = tv2->cacheBefore();
-
-  tv0->computeAt(tv3, -1);
-  tv3->computeAt(tv2, -1);
-
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  const int numel_x = 100;
-  const int numel_y = 200;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
-  at::Tensor cg_output = at::empty({numel_x}, options);
-
-  auto aten_output = (aten_input + 1).to(at::kDouble).sum({1});
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  fe.runFusion({aten_input}, {cg_output});
-
-  testValidate(
-      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionCacheBeforeReduction2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(3);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {1});
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv3);
-
-  auto tv4 = tv2->cacheBefore();
-
-  tv4->computeAt(tv3, 1);
-  tv0->computeAt(tv4, -1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-
-  const int numel_x = 10;
-  const int numel_y = 20;
-  const int numel_z = 30;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({numel_x, numel_y, numel_z}, options);
-  auto t2 = (aten_input + 1).to(at::kDouble).sum({1});
-  auto t3 = t2 + 1;
-  std::vector<at::Tensor> aten_outputs = {t2, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue367_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Symbolic integers we will use for runtime tiling
-  Int* symbolic_m_tile_dim = IrBuilder::create<Int>();
-  Int* symbolic_split_k_tile_dim = IrBuilder::create<Int>();
-  Int* symbolic_block_k_tile_dim = IrBuilder::create<Int>();
-  // Compile-time integer for tiling
-  int n_smem_tile = 32;
-
-  // Symbolic 2D tensors TV0[M, K], TV1[K, N]
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  // Broadcast tv0 to [M, K, *]
-  TensorView* tv2 = broadcast(tv0, {false, false, true});
-  // Broadcast tv1 to [*, K, N]
-  TensorView* tv3 = broadcast(tv1, {true, false, false});
-
-  // Pointwise multiplication resulting in tv3[M, K, N]
-  TensorView* tv4 = mul(tv2, tv3);
-
-  // Sum the K-dim
-  TensorView* tv5 = sum(tv4, {1});
-
-  // Register inputs and outputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv5);
-
-  // Register runtime tile dims as inputs
-  fusion.addInput(symbolic_m_tile_dim);
-  fusion.addInput(symbolic_split_k_tile_dim);
-  fusion.addInput(symbolic_block_k_tile_dim);
-
-  // Make a 3D tile, mix of symbolic and constant, do in reverse order because
-  // dims are inserted
-  // [M, K, N]
-  tv5->split(2, n_smem_tile);
-  tv5->split(1, symbolic_block_k_tile_dim);
-  tv5->split(1, symbolic_split_k_tile_dim);
-  tv5->split(0, symbolic_m_tile_dim);
-  // [Mo, Mi, Koo, Koi, Ki, No, Ni]
-  tv5->reorder({{1, 5}, {5, 1}});
-  // [Mo, No, Koo, Koi, Ki, Mi, Ni]
-
-  auto tv6 = tv5->rFactor({2});
-  auto tv7 = tv5->rFactor({2});
-  // [Mo, No, rKoo,  Koi,  Ki, Mi, Ni]
-  // [Mo, No,       rKoi, rKi, Mi, Ni]
-
-  // Scope computations
-  tv6->computeAt(tv5, 2);
-
-  tv0->computeAt(tv6, 3);
-  tv1->computeAt(tv6, 3);
-  tv4->computeAt(tv6, -1);
-
-  // Cache smem tiles
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-  tv4->setMemoryType(MemoryType::Local);
-  tv6->setMemoryType(MemoryType::Local);
-  tv7->setMemoryType(MemoryType::Local);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDz);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-
-  std::vector<TensorView*> tv_list = {tv2, tv3, tv4, tv5, tv6, tv7};
-  for (auto tv : tv_list) {
-    tv->axis(-2)->parallelize(ParallelType::TIDz);
-    tv->axis(-1)->parallelize(ParallelType::TIDy);
-  }
-  tv2->axis(3)->parallelize(ParallelType::TIDx);
-  tv3->axis(3)->parallelize(ParallelType::TIDx);
-  tv4->axis(3)->parallelize(ParallelType::TIDx);
-  tv6->axis(3)->parallelize(ParallelType::TIDx);
-  tv7->axis(2)->parallelize(ParallelType::TIDx);
-
-  tv2->axis(4)->parallelize(ParallelType::BIDx);
-  tv3->axis(4)->parallelize(ParallelType::BIDx);
-  tv4->axis(4)->parallelize(ParallelType::BIDx);
-  tv6->axis(4)->parallelize(ParallelType::BIDx);
-  tv7->axis(3)->parallelize(ParallelType::BIDx);
-  tv5->axis(2)->parallelize(ParallelType::BIDx);
-
-  constexpr int M = 3, K = 6, N = 16;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-
-  // A, B, m, split_k, block_k
-  std::vector<IValue> aten_inputs = {t0, t1, 2, 2, 3};
-  at::Tensor aten_output =
-      mul(t0.unsqueeze(2), t1.unsqueeze(0)).to(at::kDouble).sum(1);
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue468_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = sum(tv1, {0});
-  fusion.addOutput(tv2);
-
-  tv1->axis(0)->parallelize(ParallelType::TIDy);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDy);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({10, 100}, options);
-  at::Tensor aten_output = aten_input.to(at::kDouble).sum({1}).sum({0});
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue363_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Symbolic 2D tensors TV0[M, K], TV1[K, N]
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(2);
-
-  // Broadcast tv0 to [M, K, *]
-  TensorView* tv2 = broadcast(tv0, {false, false, true});
-  // Broadcast tv1 to [*, K, N]
-  TensorView* tv3 = broadcast(tv1, {true, false, false});
-
-  // Pointwise multiplication resulting in tv3[M, K, N]
-  TensorView* tv4 = mul(tv2, tv3);
-
-  // Sum the K-dim
-  TensorView* tv5 = sum(tv4, {1});
-
-  // Register inputs and outputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv5);
-
-  tv2->setMemoryType(MemoryType::Global);
-  tv3->setMemoryType(MemoryType::Global);
-  tv4->setMemoryType(MemoryType::Global);
-
-  tv0->computeAt(tv5, -1);
-  tv1->computeAt(tv5, -1);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDz);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-
-  tv5->axis(2)->parallelize(ParallelType::BIDx);
-
-  constexpr int M = 3, K = 6, N = 16;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-  at::Tensor aten_output =
-      mul(t0.unsqueeze(2), t1.unsqueeze(0)).to(at::kDouble).sum(1);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue484_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = add(tv1, IrBuilder::create<Double>(0));
-  fusion.addOutput(tv2);
-
-  tv1->setMemoryType(MemoryType::Global);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 100;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({M, M}, options);
-  at::Tensor aten_output = aten_input.to(at::kDouble).sum({1});
-
-  torch::jit::fuser::cuda::FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue329_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {1});
-  fusion.addOutput(tv2);
-  auto tv3 = sum(tv1, {1});
-  fusion.addOutput(tv3);
-
-  tv1->computeAt(tv2, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  std::vector<int64_t> t0_shape{17, 19};
-  auto aten_input = at::randn(t0_shape, options);
-  auto t2 = (aten_input + 1).to(at::kDouble).sum({1});
-  auto t3 = (aten_input + 1).to(at::kDouble).sum({1});
-  std::vector<at::Tensor> aten_outputs = {t2, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue382_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = broadcast(tv1, {false, false, true});
-  auto tv3 = makeSymbolicTensor(3);
-  fusion.addInput(tv3);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  tv2->merge(1);
-  tv4->merge(1);
-
-  tv1->computeAt(tv4, 1);
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv1->setMemoryType(MemoryType::Global);
-  tv2->setMemoryType(MemoryType::Global);
-
-  const int numel_x = 12;
-  const int numel_y = 34;
-  const int numel_z = 56;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({numel_x, numel_y}, options);
-  auto t3 = at::randn({numel_x, numel_y, numel_z}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t3};
-  auto aten_output = (t0 + 1).unsqueeze(-1) + t3;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue507_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  std::vector<int64_t> t0_shape{17, 19};
-  auto aten_input = at::randn(t0_shape, options);
-  auto t1 = (aten_input + 1);
-  auto aten_output = (t1 + 1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue532_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  TensorView* tv0 = makeSymbolicTensor(1);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addInput(tv0);
-  fusion.addOutput(tv2);
-
-  const int M_BLOCK = 64;
-  const int M_THREAD = 4;
-
-  tv2->split(0, M_BLOCK);
-  // tv2: [M/M_BLOCK, M_BLOCK]
-  tv1->computeAt(tv2, 1);
-  // tv1: [M/M_BLOCK, M_BLOCK]
-
-  tv1->split(-1, M_BLOCK / M_THREAD);
-  // tv1: [M/M_BLOCK, M_THREAD, M_BLOCK / M_THREAD]
-
-  tv2->split(-1, M_THREAD);
-  // tv2: [M/M_BLOCK, M_BLOCK / M_THREAD, M_THREAD]
-
-  constexpr int M = 1000;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  at::Tensor aten_output = t0 + 1 + 1;
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionLoopUnswitch_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  TensorView* tv0 = makeSymbolicTensor(1);
-  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addInput(tv0);
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 32);
-  tv1->computeAt(tv2, -1);
-
-  tv2->axis(1)->parallelize(ParallelType::Unswitch);
-
-  constexpr int M = 1000;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  at::Tensor aten_output = t0 + 1 + 1;
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue549_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2); // M, K
-  TensorView* tv1 = makeSymbolicTensor(2); // K, N
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-
-  TensorView* tv3 = broadcast(tv2, {false, false, true});
-  // tv3[I0, I1, B] = tv0[I0, I1]
-
-  TensorView* tv4 = broadcast(tv1, {true, false, false});
-  // tv4[B, I1, I2] = tv1[I1, I2]
-
-  // tv5[I0, I1, I2] = tv3[I0, I1, B] * tv4[B, I1, I2]
-  TensorView* tv5 = mul(tv3, tv4);
-  // tv6[I0, R1, I2] = tv5[I0, I1, I2]
-  TensorView* tv6 = sum(tv5, {1});
-  fusion.addOutput(tv6);
-
-  tv6->split(1, 32);
-  // tv6[I0, R1o, R1i{32}, I2]
-
-  auto tv7 = tv6->rFactor({1});
-  // tv7[I0, R1o, I1i{32}, I2] = tv5[I0, I1, I2]
-  // tv6[I0,    , R1i{32}, I2] = tv7[I0, R1o, I1i{32}, I2]
-
-  tv6->split(0, 4);
-  tv6->split(-1, 4);
-  // tv6[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
-  // tv6[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
-
-  tv0->computeAt(tv6, -1);
-  tv1->computeAt(tv6, -1);
-
-  // tv7[I0o, I0i{4}, R1o, I1i{32}, I2o, I2i{4}]
-  // tv6[I0o, I0i{4},    , R1i{32}, I2o, I2i{4}]
-  //--> (line symbolizes compute at location)
-  // tv5[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, I1o]
-  // tv7[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, R1o]
-  // tv6[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
-
-  tv0->computeAt(tv7, -1);
-  tv1->computeAt(tv7, -1);
-  // tv5[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, I1o |]
-  // tv7[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, R1o |]
-  // tv6[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
-
-  tv6->axis(0)->parallelize(ParallelType::BIDz);
-  tv6->axis(1)->parallelize(ParallelType::TIDz);
-
-  tv6->axis(-2)->parallelize(ParallelType::BIDy);
-  tv6->axis(-1)->parallelize(ParallelType::TIDy);
-
-  tv6->axis(2)->parallelize(ParallelType::TIDx);
-  tv7->axis(2)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 65, K = 33, N = 17;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-
-  // Lets specify a few bounds in launch params to make sure it works
-  LaunchParams lparams(1, -1, -1, 32, 4, 4);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1}, lparams);
-  fe.runFusion({t0, t1}, lparams);
-
-  // Make sure bad launch params throws
-  // TODO: Re-enable once we have parallelization validation in.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  // ASSERT_ANY_THROW(fe.runFusion({t0, t1}, LaunchParams(1, 2, 3, 4, 5, 6)));
-
-  // Don't specify any launch params
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto aten_output = (t0 + 1).to(at::kDouble).matmul(t1.to(at::kDouble));
-
-  testValidate(
-      &fusion, cg_outputs, {t0, t1}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleCompileRtc_CUDA) {
-  FusionExecutor fe;
-  std::string kernel = R"(
-__global__ void kernel1(Tensor<float, 1> T0, Tensor<float, 1> T1) {
-  if(threadIdx.x==0){
-    for(size_t ki28 = 0; ki28 < T0.size[0]; ++ki28) {
-      T1[ki28*T1.stride[0]] = T0[ki28*T0.stride[0]]*2;
-    }
-  }
-}
-    )";
-  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
-  LaunchParams lp(
-      256, // gdimx
-      1, // gdimy
-      1, // gdimz
-      1, // bdimx
-      1, // bdimy
-      1 // bdimz
-  );
-  lp.setSmem(0);
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const std::vector<int64_t> tensor_dims = {8};
-  auto in0 = at::randn(tensor_dims, options);
-  auto out0 = at::empty_like(in0);
-  fe.runRtc(lp, {in0, out0});
-
-  auto out_ref = in0 * 2;
-  TORCH_CHECK(out_ref.allclose(out0));
-}
-
-TEST_F(NVFuserTest, FusionSerialWelford_CUDA) {
-  FusionExecutor fe;
-  int x = 128, y = 64, z = 64;
-
-  std::string kernel = R"(
-__global__ void kernel1(
-    Tensor<float,3> inp,
-    Tensor<float,1> out_var,
-    Tensor<float,1> out_avg
-){
-    for(int i0=0;i0<inp.size[0];i0++){
-        float tmp_M2=0;
-        float tmp_avg=0;
-        long tmp_N=0;
-        for(int i1=0;i1<inp.size[1];i1++){
-            for(int i2=0;i2<inp.size[2];i2++){
-                welfordCombine(
-                    tmp_avg,
-                    tmp_M2,
-                    tmp_N,
-                    inp[i0*inp.stride[0]+
-                        i1*inp.stride[1]+
-                        i2*inp.stride[2]],
-                    0.f,
-                    (long)1
-                );
-            }
-        }
-        out_var[i0*out_var.stride[0]]=
-            tmp_M2/(tmp_N);
-        out_avg[i0*out_avg.stride[0]]=
-            tmp_avg;
-    }
-}
-    )";
-  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
-  LaunchParams lp(
-      1, // gdimx
-      1, // gdimy
-      1, // gdimz
-      1, // bdimx
-      1, // bdimy
-      1 // bdimz
-  );
-  lp.setSmem(0);
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const std::vector<int64_t> tensor_dims = {x, y, z};
-  auto in0 = at::randn(tensor_dims, options);
-  auto out_var = at::empty({x}, options);
-  auto out_avg = at::empty({x}, options);
-  fe.runRtc(lp, {in0, out_var, out_avg});
-
-  TORCH_CHECK(in0.var({1, 2}, false).allclose(out_var));
-  TORCH_CHECK(in0.mean({1, 2}).allclose(out_avg, /*rtol*/ 1e-5, /*atol*/ 1e-6));
-}
-
-TEST_F(NVFuserTest, FusionBlockWelford_CUDA) {
-  FusionExecutor fe;
-  int x = 7, y = 8, z = 9;
-
-  std::string kernel = R"(
-__global__ void kernel1(
-    Tensor<float,2> inp,
-    Tensor<float,1> out_avg,
-    Tensor<float,1> out_var,
-    Tensor<float,1> init_avg,
-    Tensor<float,1> init_var,
-    Tensor<long,0> init_N
-){
-    //actual generated kernel will use dynamic shared mem,
-    // here is just for prototype
-    __shared__ float mem_avg[512];
-    __shared__ float mem_M2[512];
-    __shared__ long mem_N[512];
-    float in=inp[threadIdx.x*inp.stride[0]+
-                        threadIdx.y*inp.stride[1]];
-    float tmp_avg=0;
-    float tmp_M2=0;
-    long tmp_N=0;
-    blockWelford<false,true,false>(
-        tmp_avg,
-        tmp_M2,
-        tmp_N,
-        in,
-        0.f,
-        (long)1,
-        threadIdx,
-        blockDim,
-        (float*)mem_avg,
-        (float*)mem_M2,
-        (long*)mem_N,
-        (bool)(threadIdx.x<inp.size[0]),
-        0.f);
-    __syncthreads();
-    if(threadIdx.x<out_var.size[0] && threadIdx.y==0){
-        welfordCombine(
-                    tmp_avg,
-                    tmp_M2,
-                    tmp_N,
-                    init_avg[threadIdx.x*init_avg.stride[0]],
-                    init_var[threadIdx.x*init_var.stride[0]]*init_N[0],
-                    init_N[0]
-                );
-        out_avg[threadIdx.x*out_avg.stride[0]]=tmp_avg;
-        out_var[threadIdx.x*out_var.stride[0]]=tmp_M2/(tmp_N);
-    }
-}
-    )";
-  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
-  LaunchParams lp(
-      1, // gdimx
-      1, // gdimy
-      1, // gdimz
-      x, // bdimx
-      y, // bdimy
-      1 // bdimz
-  );
-  lp.setSmem(0);
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const std::vector<int64_t> tensor_dims = {x, y};
-  const std::vector<int64_t> init_dims = {x, z};
-
-  // generate initial values
-  auto init_in = at::randn(init_dims, options);
-  auto init_var = init_in.var({1}, false);
-  auto init_avg = init_in.mean({1});
-  auto init_N =
-      at::tensor(z, at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0));
-
-  auto in0 = at::randn(tensor_dims, options);
-
-  // run kernel
-  auto out_var = at::zeros({x}, options);
-  auto out_avg = at::zeros({x}, options);
-  fe.runRtc(lp, {in0, out_avg, out_var, init_avg, init_var, init_N});
-
-  // compare with reference output
-  auto cat_tensor = at::cat({init_in, in0}, 1);
-  TORCH_CHECK(cat_tensor.var({1}, false).allclose(out_var));
-  TORCH_CHECK(
-      cat_tensor.mean({1}).allclose(out_avg, /*rtol*/ 1e-5, /*atol*/ 1e-6));
-}
-
-TEST_F(NVFuserTest, FusionBlockWelfordNoInit_CUDA) {
-  FusionExecutor fe;
-  int x = 7, y = 8, z = 9;
-
-  // need support IValue for integer input as initial count
-  std::string kernel = R"(
-__global__ void kernel1(
-    Tensor<float,3> inp,
-    Tensor<float,1> out_avg,
-    Tensor<float,1> out_var
-){
-    //actual generated kernel will use dynamic shared mem,
-    // here is just for prototype
-    __shared__ float mem_avg[512];
-    __shared__ float mem_M2[512];
-    __shared__ long mem_N[512];
-    float in=inp[threadIdx.x*inp.stride[0]+
-                        threadIdx.y*inp.stride[1]+
-                        threadIdx.z*inp.stride[2]];
-    float tmp_avg=0;
-    float tmp_M2=0;
-    long tmp_N=0;
-    block_sync::init();
-    blockWelford<false,true,true>(
-        tmp_avg,
-        tmp_M2,
-        tmp_N,
-        in,
-        0.f,
-        (long) 1,
-        threadIdx,
-        blockDim,
-        (float*)mem_avg,
-        (float*)mem_M2,
-        (long*)mem_N,
-        (bool)(threadIdx.x<inp.size[0]),
-        0.f);
-    __syncthreads();
-    if(threadIdx.x<out_var.size[0] && threadIdx.y==0 && threadIdx.z==0){
-        out_avg[threadIdx.x*out_var.stride[0]]=tmp_avg;
-        out_var[threadIdx.x*out_var.stride[0]]=tmp_M2/(tmp_N);
-    }
-}
-    )";
-  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
-  LaunchParams lp(
-      1, // gdimx
-      1, // gdimy
-      1, // gdimz
-      x, // bdimx
-      y, // bdimy
-      z // bdimz
-  );
-  lp.setSmem(0);
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const std::vector<int64_t> tensor_dims = {x, y, z};
-  auto in0 = at::randn(tensor_dims, options);
-  auto out_var = at::empty({x}, options);
-  auto out_avg = at::empty({x}, options);
-  fe.runRtc(lp, {in0, out_avg, out_var});
-
-  TORCH_CHECK(in0.var({1, 2}, false).allclose(out_var));
-  TORCH_CHECK(in0.mean({1, 2}).allclose(out_avg, /*rtol*/ 1e-5, /*atol*/ 1e-6));
-}
-
-TEST_F(NVFuserTest, FusionGridWelfordNoInit_CUDA) {
-  FusionExecutor fe;
-  int x = 128, y = 64, z = 128;
-
-  std::string kernel = R"(
-__global__ void kernel1(
-    Tensor<float,3> inp,
-    Tensor<float,1> out_avg,
-    Tensor<float,1> out_var,
-    Tensor<float,1> work_buf_avg,
-    Tensor<float,1> work_buf_M2,
-    Tensor<long,1> work_buf_N,
-    Tensor<int64_t,1> sync_flag
-){
-    __shared__ float shared_buf_avg[512];
-    __shared__ float shared_buf_M2[512];
-    __shared__ long shared_buf_N[512];
-    float tmp_avg=0;
-    float tmp_M2=0;
-    long tmp_N=0;
-    float in = inp[ blockIdx.x  * inp.stride[0]+
-                    blockIdx.y  * inp.stride[1]+
-                    threadIdx.x * inp.stride[2]];
-    block_sync::init();
-    welford::gridWelford<
-        true,true,false,
-        true,false,false,
-        false
-    >(
-        tmp_avg,
-        tmp_M2,
-        tmp_N,
-        in,
-        0.f,
-        (long) 1,
-        &work_buf_avg[0],
-        &work_buf_M2[0],
-        &work_buf_N[0],
-        sync_flag,
-        (float*)shared_buf_avg,
-        (float*)shared_buf_M2,
-        (long*)shared_buf_N,
-        threadIdx.x<out_var.size[0],
-        threadIdx.x<out_var.size[0],
-        0.f,
-        0,
-        1);
-    if(blockIdx.x == gridDim.x - 1 && blockIdx.y == gridDim.y - 1){
-        out_avg[threadIdx.x*out_avg.stride[0]]=tmp_avg;
-        out_var[threadIdx.x*out_var.stride[0]]=tmp_M2/tmp_N;
-    }
-}
-    )";
-  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
-  LaunchParams lp(
-      x, // gdimx
-      y, // gdimy
-      1, // gdimz
-      z, // bdimx
-      1, // bdimy
-      1 // bdimz
-  );
-  lp.setSmem(0);
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const auto options_int =
-      at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-
-  const std::vector<int64_t> tensor_dims = {x, y, z};
-  auto in0 = at::randn(tensor_dims, options);
-
-  auto out_avg = at::empty({z}, options);
-  auto out_var = at::empty({z}, options);
-  auto work_buf_avg = at::empty({x * y * z}, options);
-  auto work_buf_var = at::empty({x * y * z}, options);
-  auto work_buf_N = at::empty({x * y * z}, options_int);
-  auto sync_flag = at::zeros({1}, options_int);
-  fe.runRtc(
-      lp,
-      {in0,
-       out_avg,
-       out_var,
-       work_buf_avg,
-       work_buf_var,
-       work_buf_N,
-       sync_flag});
-  std::vector<int64_t> dims{0, 1};
-
-  TORCH_CHECK(in0.mean(dims).allclose(out_avg, /*rtol*/ 1e-5, /*atol*/ 1e-6));
-  TORCH_CHECK(in0.var(dims, false).allclose(out_var));
-}
-
-TEST_F(NVFuserTest, FusionWelfordOp_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int M = 64, N = 128;
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
-  auto tvs = Welford(tv1, {1});
-  auto tv_avg = tvs.avg;
-  auto tv_M2 = tvs.var_sum;
-  auto tv_N = tvs.n;
-  fusion.addOutput(tv_avg);
-  fusion.addOutput(tv_M2);
-  fusion.addOutput(tv_N);
-
-  tv_avg->split(1, 32);
-  tv_avg->split(0, 32);
-  tv_avg->split(0, 4);
-  tv_avg->reorder({{-1, -3}, {-3, -1}});
-  tv1->computeAt(tv_avg, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto outputs = fe.runFusion({t0});
-
-  // by default Welford outputs sum of square diff so need to divide to get var
-  outputs[1] /= N;
-
-  testValidate(
-      fe.kernel(),
-      outputs,
-      {t0},
-      {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBlockWelfordOp_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int M = 64, N = 128;
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
-  auto tvs = Welford(tv1, {1});
-  auto tv_avg = tvs.avg;
-  auto tv_M2 = tvs.var_sum;
-  auto tv_N = tvs.n;
-  fusion.addOutput(tv_avg);
-  fusion.addOutput(tv_M2);
-  fusion.addOutput(tv_N);
-
-  tv_avg->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->computeAt(tv_avg, -1);
-
-  //
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  at::Tensor t_var = at::empty({M}, options);
-  at::Tensor t_avg = at::empty({M}, options);
-  at::Tensor t_N = at::empty({M}, options_int);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto outputs = fe.runFusion({t0});
-
-  // by default Welford outputs sum of square diff so need to divide to get var
-  outputs[1] /= N;
-
-  testValidate(
-      fe.kernel(),
-      outputs,
-      {t0},
-      {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridWelfordOp_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int M = 64, N = 128;
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
-  auto tvs = Welford(tv1, {1});
-  auto tv_avg = tvs.avg;
-  auto tv_M2 = tvs.var_sum;
-  auto tv_N = tvs.n;
-  fusion.addOutput(tv_avg);
-  fusion.addOutput(tv_M2);
-  fusion.addOutput(tv_N);
-
-  tv_avg->axis(0)->parallelize(ParallelType::TIDx);
-  tv_avg->axis(-1)->parallelize(ParallelType::BIDx);
-
-  tv1->computeAt(tv_avg, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  at::Tensor t_avg = at::empty({M}, options);
-  at::Tensor t_var = at::empty({M}, options);
-  at::Tensor t_N = at::empty({M}, options_int);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto outputs = fe.runFusion({t0});
-
-  // by default Welford outputs sum of square diff so need to divide to get var
-  outputs[1] /= N;
-
-  testValidate(
-      fe.kernel(),
-      outputs,
-      {t0},
-      {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionRfactorWelfordOp_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int M = 64, N = 128;
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
-  auto tvs = Welford(tv1, {1});
-  auto tv_avg = tvs.avg;
-  auto tv_M2 = tvs.var_sum;
-  auto tv_N = tvs.n;
-  fusion.addOutput(tv_avg);
-  fusion.addOutput(tv_M2);
-  fusion.addOutput(tv_N);
-
-  tv_avg->split(1, 4);
-  ir_utils::rfactorHelper(tvs.avg, {2});
-  tv1->computeAt(tv_avg, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  at::Tensor t_avg = at::empty({M}, options);
-  at::Tensor t_var = at::empty({M}, options);
-  at::Tensor t_N = at::empty({M}, options_int);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto outputs = fe.runFusion({t0});
-
-  // by default Welford outputs sum of square diff so need to divide to get var
-  outputs[1] /= N;
-
-  testValidate(
-      fe.kernel(),
-      outputs,
-      {t0},
-      {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionWelfordSchedule_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int M = 64, N = 128;
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
-  auto tvs = Welford(tv1, {1});
-  auto tv_avg = tvs.avg;
-  auto tv_M2 = tvs.var_sum;
-  auto tv_N = tvs.n;
-  fusion.addOutput(tv_avg);
-  fusion.addOutput(tv_M2);
-  fusion.addOutput(tv_N);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  // TODO: Why do we use launch params from here, but not scheduling???
-  auto reduction_params = getReductionHeuristics(&fusion, {t0});
-  scheduleReduction(&fusion, *reduction_params);
-
-  auto lparams = reduction_params->lparams;
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0}, lparams);
-  auto outputs = fe.runFusion({t0}, lparams);
-
-  // by default Welford outputs sum of square diff so need to divide to get var
-  outputs[1] /= N;
-
-  auto at_avg = t0.mean({1});
-  auto at_var = t0.var({1}, false);
-  auto at_n = at::ones({M}, options_int) * N;
-
-  testValidate(
-      fe.kernel(),
-      outputs,
-      {t0},
-      {at_avg, at_var, at_n},
-      __LINE__,
-      __FILE__,
-      "validate welford",
-      reduction_params->lparams);
-}
-
-namespace {
-void testWelford(DataType dtype, int red_axis, int odim, int rdim) {
-  const int axis = red_axis;
-  at::ScalarType aten_dtype = data_type_to_aten(dtype);
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-  TensorView* tv0 = makeSymbolicTensor(2, dtype);
-  bool is_fp16 = dtype == DataType::Half;
-  bool is_bf16 = dtype == DataType::BFloat16;
-  TensorView* tv0_cast = tv0;
-  if (is_fp16 || is_bf16) {
-    tv0_cast = castOp(DataType::Float, tv0);
-  }
-  fusion.addInput(tv0);
-  auto tv1 = mul(tv0_cast, IrBuilder::create<Double>(1));
-  auto tvs = Welford(tv1, {axis});
-  auto tv_avg = tvs.avg;
-  auto tv_M2 = tvs.var_sum;
-  auto tv_N = tvs.n;
-
-  TensorView* avg_cast = tv_avg;
-  TensorView* M2_cast = tv_M2;
-
-  if (is_fp16) {
-    avg_cast = castOp(DataType::Half, tv_avg);
-    M2_cast = castOp(DataType::Half, tv_M2);
-  }
-  if (is_bf16) {
-    avg_cast = castOp(DataType::BFloat16, tv_avg);
-    M2_cast = castOp(DataType::BFloat16, tv_M2);
-  }
-
-  fusion.addOutput(avg_cast);
-  fusion.addOutput(M2_cast);
-  fusion.addOutput(tv_N);
-
-  auto options = at::TensorOptions().dtype(aten_dtype).device(at::kCUDA, 0);
-  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  std::vector<TensorView*> outputs_of_red;
-  at::Tensor aten_input =
-      (axis ? at::randn({odim, rdim}, options)
-            : at::randn({rdim, odim}, options));
-
-  if (is_fp16 || is_bf16) {
-    outputs_of_red.push_back(avg_cast);
-    outputs_of_red.push_back(M2_cast);
-  }
-
-  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
-  scheduleReduction(&fusion, *reduction_params);
-
-  auto lparams = reduction_params->lparams;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input}, lparams);
-  auto outputs = fe.runFusion({aten_input}, lparams);
-
-  // by default Welford outputs sum of square diff so need to divide to
-  // get var
-
-  outputs[1] /= rdim;
-
-  auto at_avg = aten_input.mean({axis});
-  auto at_var = aten_input.var({axis}, false);
-  auto at_n =
-      (axis ? at::ones({odim, rdim}, options)
-            : at::ones({rdim, odim}, options));
-  at_n = at_n.sum({axis});
-
-  testValidate(
-      fe.kernel(),
-      outputs,
-      {aten_input},
-      {at_avg, at_var, at_n},
-      __LINE__,
-      __FILE__,
-      "validate welford",
-      reduction_params->lparams);
-}
-} // namespace
-
-TEST_F(NVFuserTest, FusionWelfordShmoo_CUDA) {
-  std::vector<DataType> dtypes = {
-      DataType::Double, DataType::Float, DataType::Half};
-  // TODO: enable this for complex. Currently, complex yields
-  // silent wrong results:
-  //   Detected abs error of: 3.8062
-  //     absolute tolerance was set to 2.23704e-06
-  //     and relative tolerance set to 2.23704e-08
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
-  if (at::cuda::getDeviceProperties(0)->major >= 8) {
-    dtypes.insert(dtypes.end(), DataType::BFloat16);
-  }
-#endif
-
-  std::vector<int> red_axis = {1, 0};
-  std::vector<int> output_dims = {160, 320};
-  std::vector<int> red_dims;
-
-  // Tried to cut down the number iterations with just
-  // doing every other power of 2.
-  for (int i = 1; i <= 1024 * 1024; i <<= 2) {
-    red_dims.push_back(i);
-  }
-
-  for (auto dtype : dtypes) {
-    for (auto& axis : red_axis) {
-      for (auto& odim : output_dims) {
-        for (auto& rdim : red_dims) {
-          // TODO: original welford algorithm actually keeps a running sum of
-          // squares, i.e. M_{2n} in the
-          //       cf:
-          //       https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
-          //       algorithm notation, and it can reach inf for large numbers
-          //       with half precision. skipping too large volumes for half for
-          //       nwo might need further numerical experiments to re-design
-          //       this.
-          if (rdim > 32768 &&
-              (dtype == DataType::Half || dtype == DataType::BFloat16)) {
-            continue;
-          }
-          testWelford(dtype, axis, odim, rdim);
-        }
-      }
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionTranspose1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int M = 10;
-  constexpr int N = 20;
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = transpose(tv0);
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  at::Tensor aten_output = t0.t();
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTranspose2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int M = 10;
-  constexpr int N = 20;
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = transpose(tv0);
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
-
-  tv1->merge(0);
-  tv1->split(0, 32);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  at::Tensor aten_output = t0.t();
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleGemmTransposed_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-
-  TensorView* tv0 = makeSymbolicTensor(2); // K, M
-  TensorView* tv1 = makeSymbolicTensor(2); // N, K
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv0_t = transpose(tv0);
-  TensorView* tv1_t = transpose(tv1);
-
-  TensorView* tv2 = broadcast(tv0_t, {false, false, true});
-  // tv2[I0, I1, B] = tv0[I0, I1]
-
-  TensorView* tv3 = broadcast(tv1_t, {true, false, false});
-  // tv3[B, I1, I2] = tv1[I1, I2]
-
-  // tv4[I0, I1, I2] = tv2[I0, I1, B] * tv3[B, I1, I2]
-  TensorView* tv4 = mul(tv2, tv3);
-  // tv5[I0, R1, I2] = tv4[I0, I1, I2]
-  TensorView* tv5 = sum(tv4, {1});
-  fusion.addOutput(tv5);
-
-  tv5->split(1, 32);
-  // tv5[I0, R1o, R1i{32}, I2]
-
-  auto tv6 = tv5->rFactor({1});
-  // tv6[I0, R1o, I1i{32}, I2] = tv4[I0, I1, I2]
-  // tv5[I0,    , R1i{32}, I2] = tv6[I0, R1o, I1i{32}, I2]
-
-  tv5->split(0, 4);
-  tv5->split(-1, 4);
-  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
-  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
-
-  tv0_t->computeAt(tv5, -1);
-  tv1_t->computeAt(tv5, -1);
-
-  // tv6[I0o, I0i{4}, R1o, I1i{32}, I2o, I2i{4}]
-  // tv5[I0o, I0i{4},    , R1i{32}, I2o, I2i{4}]
-  //--> (line symbolizes compute at location)
-  // tv4[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, I1o]
-  // tv6[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, R1o]
-  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
-
-  tv0_t->computeAt(tv6, -1);
-  tv1_t->computeAt(tv6, -1);
-  // tv4[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, I1o |]
-  // tv6[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, R1o |]
-  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
-
-  tv5->axis(0)->parallelize(ParallelType::BIDz);
-  tv5->axis(1)->parallelize(ParallelType::TIDz);
-
-  tv5->axis(-2)->parallelize(ParallelType::BIDy);
-  tv5->axis(-1)->parallelize(ParallelType::TIDy);
-
-  tv5->axis(2)->parallelize(ParallelType::TIDx);
-  tv6->axis(2)->parallelize(ParallelType::TIDx);
-
-  constexpr int M = 65, K = 33, N = 17;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({K, M}, options);
-  at::Tensor t1 = at::randn({N, K}, options);
-
-  // Lets specify a few bounds in launch params to make sure it works
-  LaunchParams lparams(1, -1, -1, 32, 4, 4);
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1}, lparams);
-  fe.runFusion({t0, t1}, lparams);
-
-  // Don't specify any launch params
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto aten_output = t0.t().to(at::kDouble).matmul(t1.t().to(at::kDouble));
-
-  testValidate(
-      &fusion, cg_outputs, {t0, t1}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSoftmax3DTransposed_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int tidx = 32;
-  const int dimx = 32;
-  const int dimy = 16;
-  const int dimz = 130;
-
-  // Set up your input tensor views
-  TensorView* input_tv0 = makeSymbolicTensor(3);
-  fusion.addInput(input_tv0);
-
-  TensorView* input_t = transpose(input_tv0, 1, 2);
-
-  TensorView* exp_tv1 = unaryOp(UnaryOpType::Exp, input_t);
-  TensorView* sum_exp_tv2 = sum(exp_tv1, {-1});
-  TensorView* bcast_sum_tv3 = broadcast(sum_exp_tv2, {false, false, true});
-
-  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
-  // computed at sum_exp_rf_tv8.
-  TensorView* input_t_copy = transpose(input_tv0, 1, 2);
-  TensorView* exp_tv1_copy = unaryOp(UnaryOpType::Exp, input_t_copy);
-
-  TensorView* output_tv4 = div(exp_tv1_copy, bcast_sum_tv3);
-
-  fusion.addOutput(output_tv4);
-
-  bcast_sum_tv3->split(-1, tidx);
-
-  sum_exp_tv2->split(-1, tidx);
-  TensorView* sum_exp_rf_tv5 = sum_exp_tv2->rFactor({-2});
-
-  output_tv4->split(-1, tidx);
-
-  input_t->computeAt(sum_exp_rf_tv5, -1);
-  input_t_copy->computeAt(output_tv4, -1);
-
-  TensorView* tensors_to_parallelize[] = {
-      sum_exp_tv2, bcast_sum_tv3, output_tv4, sum_exp_rf_tv5};
-
-  for (auto tv : tensors_to_parallelize) {
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(1)->parallelize(ParallelType::BIDy);
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({dimx, dimz, dimy}, options);
-
-  at::Tensor cg_output = at::empty({dimx, dimy, dimz}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_input_t = at::transpose(input, 1, 2);
-  auto aten_output = at::_softmax(aten_input_t.to(at::kDouble), -1, false);
-
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed1_CUDA) {
-  // Case 1
-  // tv1 = tv0 * 0.5
-  // tv2 = tv1 * -1
-  // tv3 = tv1 + 3
-  // tv4 = tv1 * 2
-  // tv5 = tv3 + tv2
-  // tv6 = tv5 + tv4
-  // tv7 = tv1 + tv4
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  tv0 = transpose(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
-  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3.0));
-  TensorView* tv4 = mul(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv5 = add(tv3, tv2);
-
-  TensorView* tv6 = add(tv5, tv4);
-  TensorView* tv7 = add(tv1, tv4);
-
-  fusion.addOutput(tv6);
-  fusion.addOutput(tv7);
-
-  // Lets setup to actually run
-  tv7->merge(0);
-  tv7->split(0, 128);
-  tv7->split(0, 4);
-
-  tv7->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeAt(tv7, 1);
-
-  // The this-position of the last tensor should be zero.
-  TORCH_CHECK(
-      tv7->nDims() == 3 && tv7->getComputeAtPosition() == 0 &&
-      tv7->getMaxProducerPosition() == 1);
-  TORCH_CHECK(
-      tv6->nDims() == 3 && tv6->getComputeAtPosition() == 0 &&
-      tv6->getMaxProducerPosition() == 1);
-  // The position of every other tensor should be 1.
-  for (auto tv : {tv1, tv2, tv3, tv4, tv5}) {
-    TORCH_CHECK(tv->nDims() == 3 && tv->getComputeAtPosition() == 1);
-  }
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::randn({129, 127}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  at::Tensor aten_input_t = aten_input.t();
-
-  auto t1 = aten_input_t.mul({0.5});
-  auto t2 = t1.mul({-1.0});
-  auto t3 = t1.add({3.0});
-  auto t4 = t1.mul({2.0});
-  auto t5 = t3.add(t2);
-  auto t6 = t5.add(t4);
-  auto t7 = t1.add(t4);
-
-  std::vector<at::Tensor> aten_outputs = {t6, t7};
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed2_CUDA) {
-  // Case 2
-  // tv1 = tv0 * -1
-  // tv2 = tv0 + 3
-  // tv3 = tv0 * 2
-  // tv4 = tv2 + tv1
-  // tv5 = tv4 + tv3
-  // tv6 = tv5 + tv3
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  tv0 = transpose(tv0);
-
-  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(-1.0));
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.0));
-  TensorView* tv3 = mul(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv4 = add(tv2, tv1);
-
-  TensorView* tv5 = add(tv4, tv3);
-  TensorView* tv6 = add(tv5, tv3);
-
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  // Lets setup to actually run
-  tv6->merge(0);
-  tv6->split(0, 128);
-  tv6->split(0, 4);
-
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeAt(tv6, 1);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({129, 127}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  auto input_t = input.t();
-  auto t1 = input_t.mul({-1.0});
-  auto t2 = input_t.add({3.0});
-  auto t3 = input_t.mul({2.0});
-  auto t4 = t2.add(t1);
-  auto t5 = t4.add(t3);
-  auto t6 = t5.add(t3);
-
-  std::vector<at::Tensor> aten_outputs = {t5, t6};
-
-  testValidate(&fusion, cg_outputs, {input}, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed3_CUDA) {
-  // Case 3
-  // T2 = T1 * 0.979361
-  // T3 = T2 * T0
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-
-  tv0 = permute(tv0, {3, 0, 1, 2});
-
-  TensorView* tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv1);
-
-  tv1 = permute(tv1, {3, 0, 1, 2});
-
-  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(.979361));
-  TensorView* tv3 = mul(tv2, tv0);
-
-  fusion.addOutput(tv3);
-
-  // Lets setup to actually run
-  while (tv3->nDims() > 1)
-    tv3->merge(0);
-  tv3->split(0, 128);
-  tv3->split(0, 4);
-
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto t0_t = t0.permute({3, 0, 1, 2});
-  auto t1_t = t1.permute({3, 0, 1, 2});
-  auto t2 = t1_t.mul({0.979361});
-  auto aten_output = t2.mul(t0_t);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed4_CUDA) {
-  // Case 4
-  // T4 = T2 - T3
-  // T5 = T1 + T4
-  // T6 = T5 - T0
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(4);
-  fusion.addInput(tv0);
-
-  tv0 = permute(tv0, {3, 0, 1, 2});
-
-  TensorView* tv1 = makeSymbolicTensor(4);
-  fusion.addInput(tv1);
-
-  tv1 = permute(tv1, {3, 0, 1, 2});
-
-  TensorView* tv2 = makeSymbolicTensor(4);
-  fusion.addInput(tv2);
-
-  tv2 = permute(tv2, {3, 0, 1, 2});
-
-  TensorView* tv3 = makeSymbolicTensor(4);
-  fusion.addInput(tv3);
-
-  tv3 = permute(tv3, {3, 0, 1, 2});
-
-  TensorView* tv4 = sub(tv2, tv3);
-  TensorView* tv5 = add(tv1, tv4);
-  TensorView* tv6 = sub(tv5, tv0);
-
-  fusion.addOutput(tv6);
-
-  // Lets setup to actually run
-  while (tv6->nDims() > 1)
-    tv6->merge(0);
-  tv6->split(0, 128);
-  tv6->split(0, 4);
-
-  tv0->computeAt(tv6, 1);
-  tv1->computeAt(tv6, 1);
-  tv2->computeAt(tv6, 1);
-  tv3->computeAt(tv6, 1);
-
-  tv6->axis(0)->parallelize(ParallelType::BIDx);
-
-  for (Val* val : fusion.vals()) {
-    if (!val->isFusionInput() &&
-        val->getValType().value() == ValType::TensorView) {
-      TensorView* tv = static_cast<TensorView*>(val);
-
-      tv->axis(1)->parallelize(ParallelType::Unroll);
-      tv->axis(-1)->parallelize(ParallelType::TIDx);
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-  at::Tensor t2 = at::rand_like(t0, options);
-  at::Tensor t3 = at::rand_like(t0, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto t0_t = t0.permute({3, 0, 1, 2});
-  auto t1_t = t1.permute({3, 0, 1, 2});
-  auto t2_t = t2.permute({3, 0, 1, 2});
-  auto t3_t = t3.permute({3, 0, 1, 2});
-  auto t4 = t2_t.sub(t3_t);
-  auto t5 = t1_t.add(t4);
-  auto aten_output = t5.sub(t0_t);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed5_CUDA) {
-  // Case 5
-  // tv2 = tv0 + 2.0
-  // tv3 = tv1 * tv2
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  tv0 = transpose(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  tv1 = transpose(tv1);
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = mul(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv3->merge(0);
-  tv3->split(-1, 8);
-  tv3->split(-1, 4);
-
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto t2 = t0.t().add(2.0);
-  auto aten_output = t1.t().mul(t2);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  tv0 = transpose(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  tv1 = transpose(tv1);
-  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = mul(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv2->merge(0);
-  tv2->split(-1, 8);
-  tv2->split(-1, 4);
-  tv3->merge(0);
-  tv3->split(-1, 8);
-
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({63, 65}, options);
-  at::Tensor t1 = at::rand_like(t0, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto t2 = t0.t().add(2.0);
-  auto aten_output = t1.t().mul(t2);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSegmentReducePointwise_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(1);
-  TensorView* tv2 = makeSymbolicTensor(2);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-  fusion->addInput(tv2);
-
-  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1)); // Group 0
-  TensorView* tv4 =
-      max(tv3, {0}); // Group 0 (use max instead to avoid numerical issues)
-  TensorView* tv5 = add(tv4, tv1); //  Group 0 (Non Broadcast after reduce,
-                                   //  keeps normalization scheduler away)
-  TensorView* tv6 = add(tv5, tv2); //  Group 1 (Broadcast after reduce)
-
-  fusion->addOutput(tv6);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({128, 65}, options);
-  at::Tensor t1 = at::randn({65}, options);
-  at::Tensor t2 = at::randn({128, 65}, options);
-
-  auto t3 = t0.add(1.0);
-  auto t4 = std::get<0>(at::max(t3, 0));
-  auto t5 = t4.add(t1);
-  auto t6 = t5.add(t2);
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto outputs = executor_cache.runFusionWithInputs({t0, t1, t2});
-
-  TORCH_CHECK(
-      executor_cache.getMostRecentKernelRuntime()->isSegmented(),
-      "segmentation didn't happen");
-  TORCH_CHECK(
-      executor_cache.getMostRecentKernelRuntime()
-              ->fusionSegments()
-              ->groups()
-              .size() == 2,
-      "segmentation didn't happen as expected");
-
-  testValidate(
-      executor_cache.fusion(), outputs, {t0, t1, t2}, {t6}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionMultipleVectorize_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  TensorView* tv0 = makeContigTensor(1);
-  TensorView* tv1 = makeContigTensor(1);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  TensorView* tv3 = add(tv0, tv1);
-  fusion->addOutput(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({40960}, options);
-  at::Tensor t1 = at::randn({40960}, options);
-  auto t2 = t0 + t1;
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-  executor_cache.profile(true);
-
-  auto outputs = executor_cache.runFusionWithInputs({t0, t1});
-  auto runtime1 = executor_cache.getMostRecentKernelRuntime();
-  auto log1 = std::dynamic_pointer_cast<PointwiseParams>(
-      executor_cache.getMostRecentExecutorInfo().params);
-  TORCH_CHECK(log1 != nullptr);
-  TORCH_CHECK(log1->vectorize);
-
-  testValidate(
-      executor_cache.fusion(), outputs, {t0, t1}, {t2}, __LINE__, __FILE__);
-
-  t0 = at::randn({40964}, options);
-  t1 = at::randn({40964}, options);
-  t2 = t0 + t1;
-
-  outputs = executor_cache.runFusionWithInputs({t0, t1});
-  auto runtime2 = executor_cache.getMostRecentKernelRuntime();
-  auto log2 = std::dynamic_pointer_cast<PointwiseParams>(
-      executor_cache.getMostRecentExecutorInfo().params);
-  TORCH_CHECK(log2 != nullptr);
-  TORCH_CHECK(log2->vectorize);
-
-  testValidate(
-      executor_cache.fusion(), outputs, {t0, t1}, {t2}, __LINE__, __FILE__);
-
-  t0 = at::randn({40962}, options);
-  t1 = at::randn({40962}, options);
-  t2 = t0 + t1;
-
-  outputs = executor_cache.runFusionWithInputs({t0, t1});
-  auto runtime3 = executor_cache.getMostRecentKernelRuntime();
-  auto log3 = std::dynamic_pointer_cast<PointwiseParams>(
-      executor_cache.getMostRecentExecutorInfo().params);
-  TORCH_CHECK(log3 != nullptr);
-  TORCH_CHECK(log3->vectorize);
-
-  testValidate(
-      executor_cache.fusion(), outputs, {t0, t1}, {t2}, __LINE__, __FILE__);
-
-  TORCH_CHECK(runtime1 == runtime2);
-  TORCH_CHECK(runtime1 != runtime3);
-}
-
-TEST_F(NVFuserTest, FusionVectorizeSimple_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeContigTensor(3);
-
-  fusion.addInput(tv0);
-
-  auto tv1 = unaryOp(UnaryOpType::Sin, tv0);
-
-  fusion.addOutput(tv1);
-
-  auto tv0_cache = tv0->cacheAfter();
-
-  auto tv1_cache = tv1->cacheBefore();
-
-  tv1->merge(0);
-  tv1->merge(0);
-  tv1->split(0, 4);
-  tv1->split(0, 128);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv0->computeAt(tv1, 2);
-
-  tv0_cache->axis(2)->parallelize(ParallelType::Vectorize);
-  tv1->axis(2)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor aten_input = at::empty({2, 6, 32}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {aten_input});
-  auto cg_outputs = fe.runFusion({aten_input});
-
-  at::Tensor aten_output = aten_input.sin();
-
-  testValidate(
-      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleVectorizeUnroll_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-  // dimensionality of the problem
-  int nDims = 3;
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeContigTensor(nDims);
-  TensorView* tv1 = makeContigTensor(nDims);
-
-  // Register your inputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  // Do math with it, it returns a `Val*` but can be static_casted back to
-  // TensorView
-  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  TensorView* tv3 = add(tv0, tv2);
-
-  // Register your outputs
-  fusion.addOutput(tv3);
-
-  auto tv0_cache = tv0->cacheAfter();
-  auto tv1_cache = tv1->cacheAfter();
-  auto tv3_cache = tv3->cacheBefore();
-
-  // Do transformations, remember, transformations are outputs to inputs
-  // This doesn't have to be in this order
-  tv3->merge(1);
-
-  // Split by n_threads
-  tv3->split(1, 2);
-  tv3->split(0, 3);
-  tv3->split(0, 1);
-
-  // [bidx, unswitch, unroll{2}, tidx, vectorize{2}]
-
-  // Parallelize TV3
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(1)->parallelize(ParallelType::Unswitch);
-  tv3->axis(2)->parallelize(ParallelType::Unroll);
-  tv3->axis(3)->parallelize(ParallelType::TIDx);
-
-  tv3->reorder({{4, 2}});
-  // [bidx, unswitch, vectorize{2}, unroll{2}, tidx]
-
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-  scheduler_utils::parallelizeAllLike(tv3);
-
-  tv0_cache->axis(2)->parallelize(ParallelType::Vectorize);
-  tv1_cache->axis(2)->parallelize(ParallelType::Vectorize);
-  tv3->axis(2)->parallelize(ParallelType::Vectorize);
-
-  // For all inputs, computeAt the output inline, temporaries should be squeezed
-  // between them
-  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-  tv1->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input1 = at::randn({64, 2, 128}, options);
-  at::Tensor input2 = at::rand_like(input1);
-  at::Tensor output = at::empty_like(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  fe.runFusion({input1, input2}, {output});
-
-  at::Tensor tv2_ref = input2 + 2.0;
-  at::Tensor output_ref = input1 + tv2_ref;
-
-  TORCH_CHECK(output_ref.equal(output));
-}
-
-TEST_F(NVFuserTest, FusionSegmentReduceSoftmax_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  std::vector<int64_t> input_shape{32, 64, 8};
-  const int kReductionAxis = 1;
-
-  auto tv0 = TensorViewBuilder()
-                 .ndims(input_shape.size())
-                 .dtype(DataType::Double)
-                 .build();
-
-  fusion->addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv2 = sum(tv1, {2}); // Group 0
-
-  auto output = softmax(tv2, kReductionAxis); // Group 1
-  fusion->addOutput(output);
-
-  auto options = at::TensorOptions().dtype(at::kDouble).device(at::kCUDA, 0);
-  at::Tensor at_x = at::randn(input_shape, options);
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto outputs = executor_cache.runFusionWithInputs({at_x});
-
-  auto t1 = at_x.add(1.0);
-  auto t2 = t1.sum({2});
-  auto t3 = at::_softmax(t2.to(at::kDouble), -1, false);
-
-  auto optimized_fusion = executor_cache.getMostRecentKernelRuntime();
-  TORCH_CHECK(optimized_fusion->isSegmented(), "segmentation didn't happen");
-  TORCH_CHECK(
-      optimized_fusion->fusionSegments()->groups().size() == 2,
-      "segmentation didn't happen as expected");
-
-  testValidate(
-      executor_cache.fusion(), outputs, {at_x}, {t3}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSwizzle1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 7);
-  tv2->split(0, 9);
-
-  tv0->computeAt(tv2, 1);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv1->setMemoryType(MemoryType::Shared);
-  tv1->swizzle(SwizzleType::Transpose, {1, 2});
-
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-  tv1->axis(2)->parallelize(ParallelType::TIDy);
-
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDy);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({100}, options);
-
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = (t0 + 1) * 2;
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSwizzle2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
-  fusion.addOutput(tv2);
-
-  tv1->split(-1, 4);
-  tv1->split(-2, 4);
-
-  tv2->split(-1, 4);
-  tv2->split(-2, 4);
-
-  tv0->computeAt(tv2, 1);
-
-  tv2->reorder({{-1, -2}});
-
-  tv1->setMemoryType(MemoryType::Shared);
-  tv1->swizzle(SwizzleType::Transpose, {-2, -1});
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-2)->parallelize(ParallelType::TIDy);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-2)->parallelize(ParallelType::TIDy);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({123}, options);
-
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = (t0 + 1) * 2;
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTransposeWithSwizzle_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = transpose(tv0);
-  fusion.addOutput(tv1);
-
-  // tv0: [I0, I1]
-  // tv1: [I1, I0]
-
-  const int BS = 32;
-
-  // CTA tiling by BS*BS
-  tv1->split(1, BS);
-  tv1->split(0, BS);
-  tv1->reorder({{1, 2}});
-  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
-
-  // Create a smem buffer to cache each tile
-  auto tv0_cache = tv0->cacheAfter();
-  tv0_cache->setMemoryType(MemoryType::Shared);
-
-  tv0->computeAt(tv1, 2);
-  // tv0: [I0, I1]
-  // tv0_cache: [I1/BS, I0/BS, BS(I1), BS(I0)]
-  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
-
-  // Assign each thread block to a tile
-  tv1->axis(0)->parallelize(ParallelType::BIDy);
-  tv1->axis(1)->parallelize(ParallelType::BIDx);
-
-  // Thread mapping for each tile. For both of the input and output
-  // tiles, map TIDx to the fastest-changing dimension to facilitate
-  // coalesced gmem accesses.
-  tv1->axis(2)->parallelize(ParallelType::TIDy);
-  tv1->axis(3)->parallelize(ParallelType::TIDx);
-  // Note that the fastest-changing axis is next to the inner-most
-  // axis since computeAt reorders the axes as the output tensor.
-  tv0_cache->axis(2)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(3)->parallelize(ParallelType::TIDy);
-
-  // Swizzles the smem cache to avoid bank conflicts
-  tv0_cache->swizzle(SwizzleType::Transpose, {3, 2});
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 100;
-  const int by = 200;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0.t();
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTransposeWithSwizzle1DThreadBlock_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = transpose(tv0);
-  fusion.addOutput(tv1);
-
-  // tv0: [I0, I1]
-  // tv1: [I1, I0]
-
-  const int BS = 32;
-  const int BDIM = 256;
-
-  // CTA tiling by BS*BS
-  tv1->split(1, BS);
-  tv1->split(0, BS);
-  tv1->reorder({{1, 2}});
-  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
-
-  // Create a smem buffer to cache each tile
-  auto tv0_cache = tv0->cacheAfter();
-  tv0_cache->setMemoryType(MemoryType::Shared);
-
-  tv0->computeAt(tv1, 2);
-  // tv0: [I0, I1]
-  // tv0_cache: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
-  // tv1: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
-
-  // Tranform the tile axes for 1D thread mapping
-  tv1->merge(-2, -1);
-  tv1->split(-1, BDIM);
-  // tv1: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
-
-  // Transform the cache similarly but apply swizzle to the 2D tile axes.
-  tv0_cache->reorder({{-2, -1}});
-  tv0_cache->swizzle(SwizzleType::Transpose, {2, 3});
-  tv0_cache->merge(-2, -1);
-  tv0_cache->split(-1, BDIM);
-  // tv0: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
-
-  // Assign each thread block to a tile
-  tv1->axis(0)->parallelize(ParallelType::BIDy);
-  tv1->axis(1)->parallelize(ParallelType::BIDx);
-
-  // Thread mapping for each tile.
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 100;
-  const int by = 200;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0.t();
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridPersistence_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = broadcast(tv1, {true});
-  auto tv3 = add(tv0, tv2);
-  fusion.addOutput(tv3);
-
-  std::vector<TensorView*> tvs = {tv1, tv2, tv3};
-  for (auto tv : tvs) {
-    tv->split(0, 2);
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(1)->parallelize(ParallelType::BIDy);
-  }
-
-  const int numel_x = 10;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto out = fe.runFusion({input});
-
-  auto aten_output = input.sum({0}).unsqueeze(-1).add(input);
-
-  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridPersistence2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = broadcast(tv1, {true, false});
-  auto tv3 = add(tv0, tv2);
-  fusion.addOutput(tv3);
-
-  std::vector<TensorView*> tvs = {tv1, tv2, tv3};
-  for (auto tv : tvs) {
-    tv->split(0, 2);
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(1)->parallelize(ParallelType::TIDy);
-    tv->axis(2)->parallelize(ParallelType::TIDx);
-  }
-
-  const int numel_x = 10;
-  const int numel_y = 3;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto out = fe.runFusion({input});
-
-  auto aten_output = input.sum({0}).unsqueeze(0).add(input);
-
-  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionWelfordPersistence_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tvs = Welford(tv0, {0});
-  auto tv4 = add(tvs.avg, tvs.var_sum);
-  auto tv5 = broadcast(tv4, {true});
-  auto tv6 = add(tv0, tv5);
-  fusion.addOutput(tv6);
-
-  std::vector<TensorView*> schedule_tvs = {
-      tvs.avg, tvs.var_sum, tvs.n, tv5, tv6};
-
-  for (auto tv : schedule_tvs) {
-    tv->split(0, 2);
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(1)->parallelize(ParallelType::BIDy);
-  }
-
-  const int numel_x = 10;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto out = fe.runFusion({input});
-
-  auto aten_output = (input.mean({0}) + (input.var({0}, false) * numel_x))
-                         .unsqueeze(-1)
-                         .add(input);
-
-  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionWelfordPersistence2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tvs = Welford(tv0, {0});
-  auto tv4 = add(tvs.avg, tvs.var_sum);
-  auto tv5 = broadcast(tv4, {true, false});
-  auto tv6 = add(tv0, tv5);
-  fusion.addOutput(tv6);
-
-  std::vector<TensorView*> schedule_tvs = {
-      tvs.avg, tvs.var_sum, tvs.n, tv5, tv6};
-  for (auto tv : schedule_tvs) {
-    tv->split(0, 2);
-    tv->axis(0)->parallelize(ParallelType::BIDx);
-    tv->axis(1)->parallelize(ParallelType::TIDy);
-    tv->axis(2)->parallelize(ParallelType::TIDx);
-  }
-  tv4->axis(0)->parallelize(ParallelType::TIDx);
-
-  const int numel_x = 10;
-  const int numel_y = 3;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto out = fe.runFusion({input});
-
-  auto aten_output = (input.mean({0}) + (input.var({0}, false) * numel_x))
-                         .unsqueeze(0)
-                         .add(input);
-
-  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue633_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int dx = 10;
-  const int dy = 11;
-  const int dz = 12;
-
-  auto tv0 = makeConcreteTensor({dx, dy, dz});
-  fusion.addInput(tv0);
-  auto tv1 = makeConcreteTensor({dx, dy, 1});
-  fusion.addInput(tv1);
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  tv2->merge(1);
-  tv2->merge(0);
-  tv2->split(-1, 128);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({dx, dy, dz}, options);
-  at::Tensor t1 = at::randn({dx, dy, 1}, options);
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0 + t1;
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBroadcastAcrossComputeAt_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> shape{17, 19};
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  auto tv2 = broadcast(tv0, {false, true});
-  auto tv3 = add(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv3->split(1, 128);
-  tv0->computeAt(tv3, 2);
-
-  for (auto tv : {tv2, tv3}) {
-    tv->axis(-1)->parallelize(ParallelType::TIDx);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({shape[0]}, options);
-  at::Tensor t1 = at::randn(shape, options);
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto t3 = t0.unsqueeze(-1).expand(shape) + t1;
-
-  testValidate(&fusion, cg_outputs, aten_inputs, {t3}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorizeMisalignedPointwise_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(2);
-  auto tv1 = makeContigTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  const int kTDX = 64;
-  const int kVecSize = 4;
-  const int kNumElems = kTDX * kVecSize;
-
-  tv2->split(1, kNumElems);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  tv2->split(-1, kVecSize);
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 128;
-  const int by = 457;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  at::Tensor t1 = at::randn({bx, by}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0 + t1;
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorizeMisalignedPointwiseMergeContig_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(4);
-  auto tv1 = makeContigTensor(4);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  tv2->reorder({{0, 1}, {1, 0}});
-  tv2->merge(-2);
-
-  const int kTDX = 64;
-  const int kVecSize = 2;
-  const int kNumElems = kTDX * kVecSize;
-
-  tv2->split(-1, kNumElems);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  tv2->split(0, 128);
-  tv2->split(-1, kVecSize);
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::BIDy);
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int n = 32;
-  const int c = 127;
-  const int h = 51;
-  const int w = 23;
-  at::Tensor t0 = at::randn({n, c, h, w}, options);
-  at::Tensor t1 = at::randn({n, c, h, w}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0 + t1;
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorizeMisalignedPointwiseMergeSymbolicPass_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int kNumDims = 4;
-  constexpr int kTDX = 64;
-  constexpr int kVecSize = 2;
-  constexpr int kNumElems = kTDX * kVecSize;
-
-  auto tv0 = makeSymbolicTensor(kNumDims);
-  auto tv1 = makeSymbolicTensor(kNumDims);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  // Create caches for vectorization
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  // Merge all dimensions together except inner-most dim
-  for (const auto idx : c10::irange(kNumDims - 2)) {
-    tv2->merge(0);
-  }
-  // Split inner-most dim
-  tv2->split(-1, kNumElems);
-  tv2->split(-1, kVecSize);
-  TransformPropagatorWithCheck propagator(tv2);
-  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  // Parallelization Strategy
-  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int n = 5;
-  const int c = 3;
-  const int h = 51;
-  const int w = 257;
-  at::Tensor t0 = at::randn({n, c, h, w}, options);
-  at::Tensor t1 = at::randn({n, c, h, w}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0 + t1;
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorizeMisalignedPointwiseMergeSymbolicFail_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int kNumDims = 4;
-  constexpr int kTDX = 64;
-  constexpr int kVecSize = 2;
-  constexpr int kNumElems = kTDX * kVecSize;
-  std::vector<int64_t> bcast_shape{1, 1, 1, -1};
-
-  auto tv0 = makeContigTensor(kNumDims);
-  auto tv1 = TensorViewBuilder().shape(bcast_shape).build();
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  // Create caches for vectorization
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  // Merge all dimensions together
-  // Backward merge order is necessary for vectorize validation
-  for (int idx = kNumDims - 1; idx > 0; --idx) {
-    tv2->merge(idx - 1);
-  }
-  tv2->split(-1, kNumElems);
-  tv2->split(-1, kVecSize);
-  TransformPropagatorWithCheck propagator(tv2);
-  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  // Parallelization Strategy
-  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int n = 32;
-  const int c = 128;
-  const int h = 51;
-  const int w = 23;
-  at::Tensor t0 = at::randn({n, c, h, w}, options);
-  at::Tensor t1 = at::randn({1, 1, 1, w}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  // TODO: throw assertion - cannot merge non-contiguous vectorization axes
-  // Make sure compilation fails
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
-}
-
-TEST_F(NVFuserTest, FusionVectorizeMisalignedRFactor_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(2);
-  auto tv1 = makeContigTensor(2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-
-  auto tv3 = sum(tv2, {-1});
-
-  fusion.addOutput(tv3);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-
-  tv3->split(-1, 128 * 4);
-  tv3->split(-1, 4);
-  // Reduce outer dim first
-  auto tv4 = tv3->rFactor({-3, -1});
-  // Tv3 will reduce threads
-
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeAt(tv4, -2);
-  tv1->computeAt(tv4, -2);
-
-  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  tv4->axis(-2)->parallelize(ParallelType::TIDx);
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv2->computeAt(tv4, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 128;
-  const int by = 2050;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  at::Tensor t1 = at::randn({bx, by}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0.add(t1).sum(1);
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorizeMisalignedWrongDimFail_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(2);
-  auto tv1 = makeContigTensor(2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  tv2->split(1, 16);
-  tv2->split(1, 64);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
-  for (auto tv : vectorized_tvs) {
-    tv->split(-1, 4);
-    // Vectorize the wrong dimension
-    tv->axis(-2)->parallelize(ParallelType::MisalignedVectorize);
-  }
-
-  FusionExecutor fe;
-  // Make sure compilation fails
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
-}
-
-TEST_F(NVFuserTest, FusionVectorizeMisalignedStride_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  const int kTDX = 64;
-  const int kVecSize = 4;
-  const int kNumElems = kTDX * kVecSize;
-
-  tv2->split(1, kNumElems);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-
-  tv2->split(-1, kVecSize);
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 128;
-  const int by = 2049;
-  at::Tensor t0 = at::randn({bx, by}, options).index({"...", Slice(3)});
-  at::Tensor t1 = at::randn({bx, by}, options).index({"...", Slice(3)});
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0 + t1;
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorizeMisalignedStrideFail_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  const int kTDX = 64;
-  const int kVecSize = 4;
-  const int kNumElems = kTDX * kVecSize;
-
-  tv2->split(1, kNumElems);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  tv2->split(-1, kVecSize);
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 128;
-  const int by = 2049;
-  at::Tensor t0 = at::randn({bx, by}, options).index({"...", Slice(3)});
-  at::Tensor t1 = at::randn({bx, by}, options).index({"...", Slice(3)});
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-
-  // Failure because the input + output tensors do not have the same stride
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion(aten_inputs));
-}
-
-TEST_F(NVFuserTest, FusionVectorization1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  tv2->split(1, 16);
-  tv2->split(1, 64);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
-  for (auto tv : vectorized_tvs) {
-    tv->split(-1, 4);
-    tv->axis(-1)->parallelize(ParallelType::Vectorize);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 128;
-  const int by = 2048;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  at::Tensor t1 = at::randn({bx, by}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0 + t1;
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorization2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  tv2->split(1, 16);
-  tv2->split(1, 64);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
-  for (auto tv : vectorized_tvs) {
-    tv->split(-1, 4);
-    // Vectorize the wrong dimension
-    tv->axis(-2)->parallelize(ParallelType::Vectorize);
-  }
-
-  FusionExecutor fe;
-  // Make sure compilation fails
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
-}
-
-TEST_F(NVFuserTest, FusionVectorization3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-  fusion.addOutput(tv2);
-
-  tv2->split(1, 16);
-  tv2->split(1, 64);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::TIDx);
-
-  auto c0 = tv0->cacheAfter();
-  auto c1 = tv1->cacheAfter();
-  auto c2 = tv2->cacheBefore();
-
-  c0->computeAt(tv2, -2);
-  c1->computeAt(tv2, -2);
-
-  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
-  for (auto tv : vectorized_tvs) {
-    tv->split(-1, 4);
-    tv->axis(-1)->parallelize(ParallelType::Vectorize);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 128;
-  const int by = 2049;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  at::Tensor t1 = at::randn({bx, by}, options);
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion(aten_inputs));
-
-  aten_inputs[0] = t0.index({"...", Slice(1)});
-  aten_inputs[1] = t1.index({"...", Slice(1)});
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion(aten_inputs));
-
-  t0 = at::randn({bx, 2048}, options).index({"...", Slice(4)});
-  t1 = at::randn({bx, 2048}, options).index({"...", Slice(4)});
-  aten_inputs = {t0, t1};
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0 + t1;
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorizationRFactor_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-
-  auto tv3 = sum(tv2, {-1});
-
-  fusion.addOutput(tv3);
-
-  tv3->split(-1, 128 * 4);
-  tv3->split(-1, 4);
-  // Reduce outer dim first
-  auto tv4 = tv3->rFactor({-3, -1});
-  // Tv3 will reduce threads
-
-  auto tv6 = tv0->cacheAfter();
-  auto tv7 = tv1->cacheAfter();
-
-  tv0->computeAt(tv3, 1);
-  tv1->computeAt(tv3, 1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv0->computeAt(tv4, -2);
-  tv1->computeAt(tv4, -2);
-
-  tv6->axis(-1)->parallelize(ParallelType::Vectorize);
-  tv7->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  tv4->axis(-2)->parallelize(ParallelType::TIDx);
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const int bx = 128;
-  const int by = 2048;
-  at::Tensor t0 = at::randn({bx, by}, options);
-  at::Tensor t1 = at::randn({bx, by}, options);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto aten_output = t0.add(t1).sum(1);
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-
-  auto t3 = t0.add(t1).sum(1);
-
-  testValidate(&fusion, cg_outputs, aten_inputs, {t3}, __LINE__, __FILE__);
-}
-
-// Unswitched loops with extent one may omit else clause.
-TEST_F(NVFuserTest, FusionSizeOneLoop1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Progressively broadcast tensors
-  TensorView* tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  TensorView* tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  TensorView* tv2 = makeSymbolicTensor(3);
-  fusion.addInput(tv2);
-
-  TensorView* tv3 = broadcast(tv0, {false, true});
-  TensorView* tv4 = add(tv3, tv1);
-  TensorView* tv5 = add(tv4, tv2);
-
-  fusion.addOutput(tv5);
-
-  // Split inner dimension
-  tv5->split(1, 8);
-  // Merge middle dims with outer dimensions
-  tv5->merge(2);
-  tv5->merge(0);
-
-  // tv5[I0*I1o, I1i*I2]
-  // Get a dim of size 1 to unswitch
-  tv5->split(0, 1, false);
-
-  // Compute everything inline
-  tv0->computeAt(tv5, -1);
-
-  tv5->axis(0)->parallelize(ParallelType::Unswitch);
-  tv5->axis(1)->parallelize(ParallelType::BIDx);
-  tv5->axis(2)->parallelize(ParallelType::TIDx);
-
-  // Make sure the unswitched loop does not have an else clause.
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(!UnswitchInElseChecker::check(gpulw));
-
-  const int x = 11;
-  const int y = 12;
-  const int z = 13;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({x}, options);
-  at::Tensor t1 = at::randn({x, y}, options);
-  at::Tensor t2 = at::randn({z, x, y}, options);
-  std::vector<IValue> aten_inputs = {t0, t1, t2};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-  auto t6 = (t0.unsqueeze(-1) + t1).unsqueeze(0) + t2;
-
-  testValidate(&fusion, cg_outputs, aten_inputs, {t6}, __LINE__, __FILE__);
-}
-
-// The unswitched loop has extent one but inner loops don't. The else
-// part should not be omitted.
-TEST_F(NVFuserTest, FusionSizeOneLoop2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int x = 15;
-  auto tv0 = makeConcreteTensor({x});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv1);
-
-  tv1->split(-1, 4);
-  tv1->split(-2, 1);
-
-  tv1->axis(-2)->parallelize(ParallelType::Unswitch);
-
-  // Make sure the size-one unswitched loop does not omit the else clause.
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(UnswitchInElseChecker::check(gpulw));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({x}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-  auto t1 = t0 + 1;
-
-  testValidate(&fusion, cg_outputs, aten_inputs, {t1}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionValidateParallelize1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDy);
-
-  // Invalid as tv1 and tv2 do have the same ParallelType
-  FusionExecutor fe;
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
-}
-
-TEST_F(NVFuserTest, FusionValidateParallelize2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDy);
-  tv1->setMemoryType(MemoryType::Shared);
-
-  // tv1 and tv2 do have the same ParallelType, but tv1 is on shared
-  // memory, so it is valid
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-}
-
-TEST_F(NVFuserTest, FusionValidateParallelize3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv1->split(-1, 4);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->split(-1, 4);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->setMemoryType(MemoryType::Global);
-
-  // tv1 and tv2 have the same shape and ParallelType
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-}
-
-TEST_F(NVFuserTest, FusionValidateParallelize4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv1->split(-1, 4);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->split(-1, 8);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->setMemoryType(MemoryType::Global);
-
-  // tv1 and tv2 do not have the same shape but global memory comm is supported.
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-}
-
-TEST_F(NVFuserTest, FusionValidateParallelize5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv1->split(-1, 4);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv2->split(-1, 8);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // tv1 and tv2 do not have the same shape, but tv1 is on shared
-  // memory, so it is valid
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-}
-
-// See issue #995
-TEST_F(NVFuserTest, FusionValidateParallelize6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int64_t W = 5, X = 6, Y = 7, Z = 8;
-
-  auto tv0 = makeConcreteTensor({X, Y, Z});
-  auto tv1 = makeConcreteTensor({W, X, Y, Z});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv3 = broadcast(tv2, {true, false, false, false});
-  auto tv4 = add(tv3, tv1);
-  fusion.addOutput(tv4);
-
-  tv4->merge(0);
-  tv4->merge(0);
-  tv4->merge(0);
-  tv4->split(0, 4);
-  tv4->split(0, 3);
-  tv4->split(0, 2);
-
-  TransformPropagatorWithCheck propagator(tv4);
-  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
-
-  tv0->computeAt(tv2, 2);
-  tv3->computeAt(tv4, 2);
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // Validation should throw an exception saying the first axes of tv2
-  // and tv3 have incompatible parallelization. See also issue #995.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fusion.printKernel());
-}
-
-TEST_F(NVFuserTest, FusionDAGMerging_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(5);
-  auto tv1 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  // Branch 0
-  auto tv2 = sum(tv0, {0}); // 0
-  auto tv3 = sum(tv2, {0}); // 1
-  auto tv4 = sum(tv3, {0}); // 2
-  auto tv5 = sum(tv4, {0}); // 3
-
-  // Branch 1
-  auto tv6 = add(tv1, IrBuilder::create<Double>(1)); // 4
-
-  // Merge
-  auto tv7 = add(tv6, tv5); // 5
-
-  // Maximum expected output groups (can improve overtime):
-  //  {0}, {1}, {2}, {3,4,5}
-  //  without final merge would have been {0}, {1}, {2}, {3,4}, {5}
-
-  fusion.addOutput(tv7);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({2, 2, 2, 2, 2}, options);
-  at::Tensor t1 = at::randn({2}, options);
-
-  auto fusion_segments = fusion.segment({t0, t1});
-  TORCH_CHECK(fusion_segments->groups().size() <= 4);
-}
-
-TEST_F(NVFuserTest, FusionDAGScalarMerging_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto i0 = IrBuilder::create<Double>();
-
-  fusion->addInput(tv0);
-  fusion->addInput(i0);
-
-  auto i1 = add(i0, IrBuilder::create<Double>(1.0));
-  auto i2 = mul(i1, i1);
-  auto i3 = add(i2, i1);
-
-  // Branch 0
-  auto tv1 = sum(tv0, {0}); // 0
-  auto tv2 = add(tv1, i2);
-  // Branch 1
-  auto tv3 = sum(tv2, {0}); // 1
-  auto tv4 = add(tv3, i3);
-
-  auto tv5 = add(tv4, i0);
-
-  fusion->addOutput(tv5);
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({16, 16, 16}, options);
-  double s0 = 0.5;
-
-  auto s1 = s0 + 1.0;
-  auto s2 = s1 * s1;
-  auto s3 = s2 + s1;
-  auto t1 = t0.sum({0});
-  auto t2 = t1 + s2;
-  auto t3 = sum(t2, {0});
-  auto t4 = t3 + s3;
-  auto t5 = t4 + s0;
-
-  auto outputs = executor_cache.runFusionWithInputs({t0, s0});
-
-  TORCH_CHECK(
-      executor_cache.getMostRecentKernelRuntime()->isSegmented(),
-      "segmentation didn't happen");
-  TORCH_CHECK(
-      executor_cache.getMostRecentKernelRuntime()
-              ->fusionSegments()
-              ->groups()
-              .size() == 2,
-      "segmentation didn't happen as expected");
-
-  testValidate(
-      executor_cache.fusion(), outputs, {t0, s0}, {t5}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBlockReduceInSerialLoop_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int M = 10;
-  constexpr int N = 20;
-  constexpr int K = 20;
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto tv1 = sum(tv0, {{1, 2}});
-  fusion.addInput(tv0);
-  fusion.addOutput(tv1);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N, K}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-  at::Tensor aten_output = t0.sum({1, 2});
-  testValidate(
-      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBlockWelfordInSerialLoop_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int M = 10;
-  constexpr int N = 20;
-  constexpr int K = 20;
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto tvs = Welford(tv0, {{1, 2}});
-  fusion.addInput(tv0);
-  auto tv_avg = tvs.avg;
-  auto tv_M2 = tvs.var_sum;
-  auto tv_N = tvs.n;
-  fusion.addOutput(tv_avg);
-  fusion.addOutput(tv_M2);
-
-  tv_avg->axis(-1)->parallelize(ParallelType::TIDx);
-  tv_avg->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M, N, K}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-  at::Tensor aten_avg = t0.mean({1, 2});
-  at::Tensor aten_M2 = t0.var({1, 2}, false) * N * K;
-  testValidate(
-      &fusion, outputs, aten_inputs, {aten_avg, aten_M2}, __LINE__, __FILE__);
-}
-
-// See Issue #716
-TEST_F(NVFuserTest, FusionIOTensorTrivialReductionRepro_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  constexpr int M = 10;
-  constexpr int N = 11;
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  std::vector<int> reduction_axes = {1};
-  std::vector<bool> broadcast_mask = {false, true};
-
-  auto tv0_bcast = broadcast(tv0, broadcast_mask);
-  auto path1_bcast = add(tv0_bcast, IrBuilder::create<Double>(1.0));
-  auto path1 = sum(path1_bcast, reduction_axes);
-  fusion.addOutput(path1);
-
-  auto p = path1->split(1, 1);
-  path1->rFactor({1});
-  path1->axis(0)->parallelize(ParallelType::BIDx);
-  tv0->computeAt(path1, 1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({M}, options);
-  at::Tensor t0_ref = t0.clone();
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-
-  // inplace op, we are adding t0 to itself
-  auto outputs = fe.runFusion(aten_inputs, {t0});
-
-  TORCH_CHECK(outputs[0].allclose(t0_ref.add(1)));
-}
-
-TEST_F(NVFuserTest, FusionReductionPredicate_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {0});
-  fusion.addOutput(tv1);
-
-  auto tv2 = tv0->cacheAfter();
-
-  const int bdimx = 128;
-  tv1->split(1, bdimx);
-  tv1->split(1, 4);
-  tv1->split(1, 1);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(2)->parallelize(ParallelType::Unroll);
-  tv1->split(0, 10);
-  tv0->computeAt(tv1, 4);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 650;
-  int numel_y = 102;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({numel_x, numel_y}, options);
-  at::Tensor cg_output = at::empty({numel_y}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  fe.runFusion({input}, {cg_output});
-
-  auto aten_output = input.to(at::kDouble).sum({0});
-
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue728_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addOutput(tv0);
-  auto tv1 = makeSymbolicTensor(1);
-  fusion.addOutput(tv1);
-  auto tv2 = makeSymbolicTensor(1);
-  fusion.addOutput(tv2);
-
-  auto tv3 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv4 = add(tv3, tv1);
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  auto tv6 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  // tv0 -> tv3 -+
-  // tv1 --------+-> tv4 -> tv5
-  //
-  // tv2 -> tv6
-
-  auto all_vals_under_tv3 =
-      DependencyCheck::getAllValsBetween({tv3}, fusion.outputs());
-  std::unordered_set<Val*> included_tensors({tv3, tv4, tv5});
-  for (auto tv : included_tensors) {
-    TORCH_CHECK(
-        std::find(all_vals_under_tv3.begin(), all_vals_under_tv3.end(), tv) !=
-            all_vals_under_tv3.end(),
-        "TV",
-        tv->name(),
-        " not found");
-  }
-  for (auto tv : ir_utils::filterByType<TensorView>(fusion.vals())) {
-    if (included_tensors.find(tv) == included_tensors.end()) {
-      TORCH_CHECK(
-          std::find(all_vals_under_tv3.begin(), all_vals_under_tv3.end(), tv) ==
-              all_vals_under_tv3.end(),
-          "TV",
-          tv->name(),
-          " should not be found");
-    }
-  }
-
-  auto no_dependency = DependencyCheck::getAllValsBetween({}, fusion.outputs());
-  TORCH_CHECK(no_dependency.empty(), "No val should be returned");
-
-  auto no_dep_path = DependencyCheck::getAllValsBetween({tv0, tv1}, {tv6});
-  TORCH_CHECK(no_dep_path.empty(), "No val should be returned");
-
-  auto no_dep_path2 = DependencyCheck::getAllValsBetween({tv2}, {tv5});
-  TORCH_CHECK(no_dep_path2.empty(), "No val should be returned");
-
-  auto just_tv3 = DependencyCheck::getAllValsBetween({tv3}, {tv3});
-  TORCH_CHECK(
-      just_tv3.size() == 1 && *(just_tv3.begin()) == tv3,
-      "Only tv3 should be included");
-}
-
-TEST_F(NVFuserTest, FusionIssue757_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = makeSymbolicTensor(2);
-  fusion.addInput(tv3);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  tv1->computeAt(tv4, -1);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 650;
-  int numel_y = 102;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
-  at::Tensor t3 = at::randn({numel_x, numel_y}, options);
-  std::vector<IValue> inputs = {t0, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, inputs);
-  auto outputs = fe.runFusion(inputs);
-
-  auto t1 = t0.sum({1});
-  auto t2 = t1.unsqueeze(-1).expand({numel_x, numel_y});
-  auto t4 = t2 + t3;
-
-  testValidate(&fusion, outputs, inputs, {t4}, __LINE__, __FILE__);
-}
-
-// See issue #759
-TEST_F(NVFuserTest, FusionPredicatedBlockBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = makeSymbolicTensor(2);
-  fusion.addInput(tv3);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  tv4->split(0, 4);
-  tv1->computeAt(tv4, -1);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(1)->parallelize(ParallelType::TIDy);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(1)->parallelize(ParallelType::TIDy);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-
-  int numel_x = 100;
-  int numel_y = 101;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
-  at::Tensor t3 = at::randn({numel_x, numel_y}, options);
-  std::vector<IValue> inputs = {t0, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, inputs);
-  auto outputs = fe.runFusion(inputs);
-
-  auto t1 = t0.sum({1});
-  auto t2 = t1.unsqueeze(-1).expand({numel_x, numel_y});
-  auto t4 = t2 + t3;
-
-  testValidate(&fusion, outputs, inputs, {t4}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSegmentVerticalMerge_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(3);
-
-  fusion->addInput(tv0);
-  // {first kernel}
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = add(tv1, tv0);
-  auto tv3 = sum(tv2, {0});
-  auto tv4 = add(tv3, tv0);
-  auto tv5 = sum(tv4, {0});
-  auto tv6 = sum(tv5, {0});
-  // {second kernel}
-  auto tv7 = add(tv6, tv5);
-  auto tv8 = add(tv7, tv5);
-  auto tv9 = sum(tv8, {0});
-
-  fusion->addOutput(tv9);
-
-  SegmentCandidateFinderOptions segment_options;
-  segment_options.run_herrmann_merge = false;
-  segment_options.run_final_merge = false;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({2, 2, 2}, options);
-
-  auto segmented_fusion =
-      SegmentCandidateFinder::segment(fusion.get(), {t0}, segment_options);
-
-  TORCH_CHECK(segmented_fusion->groups().size() == 2);
-}
-
-TEST_F(NVFuserTest, FusionSegmentHorizontalMerge_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto i0 = IrBuilder::create<Double>();
-
-  fusion->addInput(tv0);
-  fusion->addInput(i0);
-
-  // Branch 0 {first kernel}
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = add(tv0, i0);
-  auto tv3 = unaryOp(UnaryOpType::Rsqrt, tv2);
-  auto tv4 = sum(tv3, {0});
-
-  // Branch 1 {first kernel}
-  auto tv5 = unaryOp(UnaryOpType::Rsqrt, tv3);
-  auto tv6 = sum(tv5, {0});
-
-  // Incompatible {second kernel}
-  auto tv7 = sum(tv6, {0});
-
-  fusion->addOutput(tv1);
-  fusion->addOutput(tv4);
-  fusion->addOutput(tv7);
-
-  SegmentCandidateFinderOptions segment_options;
-  segment_options.run_herrmann_merge = false;
-  segment_options.run_final_merge = false;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({2, 2, 2}, options);
-
-  auto segmented_fusion =
-      SegmentCandidateFinder::segment(fusion.get(), {t0, 1.0}, segment_options);
-
-  TORCH_CHECK(segmented_fusion->groups().size() == 2);
-}
-
-TEST_F(NVFuserTest, FusionSegmentMixReduction_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(3);
-
-  fusion->addInput(tv0);
-
-  // def of tv1 in kernel 1 through horizontal
-  auto tv1 = sum(tv0, {0, 1});
-  // kernel 2
-  auto tv2 = sum(tv0, {2});
-  auto tv3 = broadcast(tv2, {false, false, true});
-  auto tv4 = add(tv0, tv3);
-  auto tv5 = sum(tv4, {2});
-  // end of kernel 2
-  // kernel 1
-  auto tv6 = unaryOp(UnaryOpType::Rsqrt, tv0);
-  auto tv7 = sum(tv6, {0, 1});
-  auto tv8 = sum(tv6, {0, 1});
-
-  fusion->addOutput(tv1);
-  fusion->addOutput(tv5);
-  fusion->addOutput(tv7);
-  fusion->addOutput(tv8);
-
-  SegmentCandidateFinderOptions segment_options;
-  segment_options.run_herrmann_merge = false;
-  segment_options.run_final_merge = false;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({2, 2, 2}, options);
-
-  auto segmented_fusion =
-      SegmentCandidateFinder::segment(fusion.get(), {t0}, segment_options);
-
-  TORCH_CHECK(segmented_fusion->groups().size() <= 2);
-}
-
-TEST_F(NVFuserTest, FusionSBAR_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // N, H, W, C format
-  std::vector<int64_t> input_shape{656, 7, 7, 64};
-
-  auto x = makeContigTensor(4);
-  auto y = makeContigTensor(4);
-  auto weight = makeContigTensor(1);
-  auto bias = makeContigTensor(1);
-
-  fusion.addInput(x);
-  fusion.addInput(y);
-  fusion.addInput(weight);
-  fusion.addInput(bias);
-
-  const size_t kNumberOfDims = x->nDims();
-  std::vector<bool> broadcast_mask(kNumberOfDims, false);
-  for (const auto axis : c10::irange(kNumberOfDims - 1)) {
-    broadcast_mask[axis] = true;
-  }
-
-  auto weight_bcast = broadcast(weight, broadcast_mask);
-  auto scale = mul(x, weight_bcast);
-  auto bias_bcast = broadcast(bias, broadcast_mask);
-  auto scale_bias = add(scale, bias_bcast);
-  auto scale_bias_add = add(scale_bias, y);
-  auto scale_bias_add_relu = unaryOp(UnaryOpType::Relu, scale_bias_add);
-
-  fusion.addOutput(scale_bias_add_relu);
-
-  // inputs
-  at::manual_seed(0);
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_x = at::randn(input_shape, options);
-  at::Tensor at_y = at::randn(input_shape, options);
-  at::Tensor at_weight = at::ones({input_shape[3]}, options);
-  at::Tensor at_bias = at::zeros({input_shape[3]}, options);
-
-  // inputs
-  std::vector<c10::IValue> inputs = {at_x, at_y, at_weight, at_bias};
-
-  // outputs
-  std::vector<at::Tensor> outputs;
-
-  auto lparams = schedulePointwise(&fusion, inputs);
-
-  FusionExecutor executor;
-  executor.compileFusion(&fusion, inputs, lparams);
-  outputs = executor.runFusion(inputs, lparams);
-
-  auto at_scale = at::mul(at_x, at_weight);
-  auto at_scale_bias = at::add(at_scale, at_bias);
-  auto pwise_add = at::add(at_scale_bias, at_y);
-  auto output = at::relu(pwise_add);
-
-  testValidate(&fusion, outputs, inputs, {output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSingleElement_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(0);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(2.5));
-
-  auto tv2 = add(tv1, IrBuilder::create<Double>(3.5));
-  fusion.addOutput(tv2);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({}, options);
-
-  at::Tensor cg_output = at::empty({}, options);
-
-  auto lparams = schedulePointwise(&fusion, {input});
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input}, lparams);
-  fe.runFusion({input}, {cg_output}, lparams);
-
-  auto aten_output = input.add(2.5).add(3.5);
-
-  testValidate(
-      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBNBackwardRepro_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  int batch = 4;
-  int c = 4;
-  int h = 4;
-  int w = 4;
-  int numDims = 4;
-
-  auto input = makeSymbolicTensor(numDims);
-  fusion.addInput(input);
-  auto weight = makeSymbolicTensor(1);
-  fusion.addInput(weight);
-  auto running_mean = makeSymbolicTensor(1);
-  fusion.addInput(running_mean);
-  auto running_var = makeSymbolicTensor(1);
-  fusion.addInput(running_var);
-  auto save_mean = makeSymbolicTensor(1);
-  fusion.addInput(save_mean);
-  auto save_invstd = makeSymbolicTensor(1);
-  fusion.addInput(save_invstd);
-
-  auto grad_out_prev = makeSymbolicTensor(numDims);
-  fusion.addInput(grad_out_prev);
-  auto gt_0 =
-      makeSymbolicTensor(numDims); // single tensor broadcasted is dangerous.
-  fusion.addInput(gt_0);
-
-  auto gt_bool = binaryOp(BinaryOpType::GT, gt_0, IrBuilder::create<Int>(1));
-  auto gt_float = castOp(DataType::Float, gt_bool);
-
-  auto grad_out = mul(grad_out_prev, gt_float);
-
-  Val* eps_ptr = IrBuilder::create<Double>(1e-5);
-
-  auto grads = batch_norm_backward(
-      input,
-      grad_out,
-      weight,
-      running_mean,
-      running_var,
-      save_mean,
-      save_invstd,
-      true,
-      eps_ptr,
-      {true, true, true});
-
-  fusion.addOutput(grads.grad_input);
-  fusion.addOutput(grads.grad_weight);
-  fusion.addOutput(grads.grad_bias);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input0 = at::randn({batch, c, h, w}, options);
-  at::Tensor input1 = at::randn({c}, options);
-  at::Tensor input2 = at::randn_like(input1);
-  at::Tensor input3 = at::randn_like(input1);
-  at::Tensor input4 = at::randn_like(input1);
-  at::Tensor input5 = at::randn_like(input1);
-  at::Tensor input6 = at::randn_like(input0);
-  at::Tensor input7 = at::randn_like(input0);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  std::vector<IValue> inputs = {
-      input0, input1, input2, input3, input4, input5, input6, input7};
-  auto outputs = fec.runFusionWithInputs(inputs);
-}
-
-// TODO: We only changed inputs, merge this with the test above.
-TEST_F(NVFuserTest, FusionBNBackwardRepro2_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  int batch = 2;
-  int c = 81;
-  int h = 1;
-  int w = 1;
-  int numDims = 4;
-
-  // auto input = makeSymbolicTensor(numDims);
-  auto input = makeConcreteTensor({-1, -1, 1, 1});
-  fusion.addInput(input);
-  auto weight = makeSymbolicTensor(1);
-  fusion.addInput(weight);
-  auto running_mean = makeSymbolicTensor(1);
-  fusion.addInput(running_mean);
-  auto running_var = makeSymbolicTensor(1);
-  fusion.addInput(running_var);
-  auto save_mean = makeSymbolicTensor(1);
-  fusion.addInput(save_mean);
-  auto save_invstd = makeSymbolicTensor(1);
-  fusion.addInput(save_invstd);
-
-  // auto grad_out_prev = makeSymbolicTensor(numDims);
-  auto grad_out_prev = makeConcreteTensor({-1, -1, 1, 1});
-  fusion.addInput(grad_out_prev);
-  // auto gt_0 =
-  //     makeSymbolicTensor(numDims); // single tensor broadcasted is dangerous.
-  auto gt_0 = makeConcreteTensor({-1, -1, 1, 1});
-  fusion.addInput(gt_0);
-
-  auto gt_bool = binaryOp(BinaryOpType::GT, gt_0, IrBuilder::create<Int>(1));
-  auto gt_float = castOp(DataType::Float, gt_bool);
-
-  auto grad_out = mul(grad_out_prev, gt_float);
-
-  Val* eps_ptr = IrBuilder::create<Double>(1e-5);
-
-  auto grads = batch_norm_backward(
-      input,
-      grad_out,
-      weight,
-      running_mean,
-      running_var,
-      save_mean,
-      save_invstd,
-      true,
-      eps_ptr,
-      {true, true, true});
-
-  fusion.addOutput(grads.grad_input);
-  fusion.addOutput(grads.grad_weight);
-  fusion.addOutput(grads.grad_bias);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input0 = at::randn({batch, c, h, w}, options);
-  at::Tensor input1 = at::randn({c}, options);
-  at::Tensor input2 = at::randn_like(input1);
-  at::Tensor input3 = at::randn_like(input1);
-  at::Tensor input4 = at::randn_like(input1);
-  at::Tensor input5 = at::randn_like(input1);
-  at::Tensor input6 = at::randn_like(input0);
-  at::Tensor input7 = at::randn_like(input0);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  std::vector<IValue> inputs = {
-      input0, input1, input2, input3, input4, input5, input6, input7};
-  auto outputs = fec.runFusionWithInputs(inputs);
-}
-
-TEST_F(NVFuserTest, FusionBNRepro_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  const bool kTraining = true;
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-
-  int batch = 14;
-  int c = 65;
-  int h = 7;
-  int w = 7;
-  int numDims = 4;
-
-  auto input = makeSymbolicTensor(numDims);
-  fusion.addInput(input);
-  auto weight = makeSymbolicTensor(1);
-  fusion.addInput(weight);
-  auto bias = makeSymbolicTensor(1);
-  fusion.addInput(bias);
-  auto running_mean = makeSymbolicTensor(1);
-  fusion.addInput(running_mean);
-  auto running_var = makeSymbolicTensor(1);
-  fusion.addInput(running_var);
-
-  auto momentum_ptr = IrBuilder::create<Double>(kMomentum);
-  auto eps_ptr = IrBuilder::create<Double>(kEps);
-
-  auto result = batch_norm(
-      input,
-      weight,
-      bias,
-      running_mean,
-      running_var,
-      kTraining,
-      momentum_ptr,
-      eps_ptr);
-
-  fusion.addOutput(result.output);
-  fusion.addOutput(result.mean);
-  fusion.addOutput(result.invstd);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({batch, c, h, w}, options);
-  at::Tensor input2 = at::randn({c}, options);
-  at::Tensor input3 = at::randn_like(input2);
-  at::Tensor input4 = at::randn_like(input2);
-  at::Tensor input5 = at::randn_like(input2);
-
-  auto input1_ref = input1.clone();
-  auto input2_ref = input2.clone();
-  auto input3_ref = input3.clone();
-  auto input4_ref = input4.clone();
-  auto input5_ref = input5.clone();
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  std::vector<IValue> aten_inputs = {input1, input2, input3, input4, input5};
-  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
-
-  auto at_results = at::native_batch_norm(
-      input1_ref,
-      input2_ref,
-      input3_ref,
-      input4_ref,
-      input5_ref,
-      kTraining,
-      kMomentum,
-      kEps);
-
-  auto at_output = std::get<0>(at_results);
-  auto at_mean = std::get<1>(at_results);
-  auto at_invstd = std::get<2>(at_results);
-
-  std::vector<at::Tensor> aten_outputs = {at_output, at_mean, at_invstd};
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBNRepro2_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  const bool kTraining = true;
-  const float kMomentum = 0.1;
-  const float kEps = 1e-5;
-
-  int batch = 2;
-  int c = 4;
-  int h = 17;
-  int w = 17;
-  int numDims = 4;
-
-  auto input = makeSymbolicTensor(numDims);
-  fusion.addInput(input);
-
-  Val* momentum_ptr = IrBuilder::create<Double>(kMomentum);
-  Val* eps_ptr = IrBuilder::create<Double>(kEps);
-
-  auto result = batch_norm(
-      input,
-      nullptr,
-      nullptr,
-      nullptr,
-      nullptr,
-      kTraining,
-      momentum_ptr,
-      eps_ptr);
-
-  fusion.addOutput(result.output);
-  fusion.addOutput(result.mean);
-  fusion.addOutput(result.invstd);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({batch, c, h, w}, options);
-
-  auto input1_ref = input1.clone();
-  at::Tensor r_m;
-  at::Tensor r_v;
-  at::Tensor weight;
-  at::Tensor bias;
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  std::vector<IValue> aten_inputs = {input1};
-  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
-
-  auto at_results = at::native_batch_norm(
-      input1_ref, r_m, r_v, weight, bias, kTraining, kMomentum, kEps);
-
-  auto at_output = std::get<0>(at_results);
-  auto at_mean = std::get<1>(at_results);
-  auto at_invstd = std::get<2>(at_results);
-
-  std::vector<at::Tensor> aten_outputs = {at_output, at_mean, at_invstd};
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionZeroSizeTensorPW_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = makeConcreteTensor({0});
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(2.5));
-  fusion.addOutput(tv2);
-
-  // This test used to just have:
-  // auto tv3 = makeConcreteTensor({0});
-  // and somehow that was running through our system fine, but size-0 tensors
-  // are not supported, so making sure this fails.
-  auto tv3 = set(tv1);
-  fusion.addOutput(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input0 = at::randn({2}, options);
-  at::Tensor input1 = at::randn({0}, options);
-  at::Tensor cg_output2 = at::empty({2}, options);
-  at::Tensor cg_output3 = at::empty({0}, options);
-
-  // Fails at schedule pointwise because our (maybe only) size-0 check is in
-  // binding input sizes which the scheduler ends up calling.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(schedulePointwise(&fusion, {input0, input1}));
-}
-
-TEST_F(NVFuserTest, FusionZeroSizeTensorReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = makeConcreteTensor({0});
-  fusion.addInput(tv1);
-
-  auto tv2 = sum(tv0, {1});
-  fusion.addOutput(tv2);
-
-  auto tv3 = makeConcreteTensor({0});
-  fusion.addOutput(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input0 = at::randn({2, 4}, options);
-  at::Tensor input1 = at::randn({0}, options);
-  at::Tensor cg_output2 = at::empty({2}, options);
-  at::Tensor cg_output3 = at::empty({0}, options);
-
-  auto reduction_params = getReductionHeuristics(&fusion, {input0, input1});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  scheduleReduction(&fusion, *reduction_params);
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-
-  auto lparams = reduction_params->lparams;
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input0, input1}, lparams);
-  auto cg_outputs = fe.runFusion({input0, input1}, lparams);
-  auto aten_output2 = input0.sum({1});
-  at::Tensor aten_output3 = at::empty({0}, options);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {input0, input1},
-      {aten_output2, aten_output3},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionZeroSizeTensorNormalization_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = makeConcreteTensor({0});
-  fusion.addInput(tv1);
-
-  auto tv2 = sum(tv0, {0});
-  auto tv3 = broadcast(tv2, {true, false});
-  auto tv4 = add(tv0, tv3);
-  fusion.addOutput(tv4);
-
-  auto tv5 = makeConcreteTensor({0});
-  fusion.addOutput(tv5);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor input0 = at::randn({2, 4}, options);
-  at::Tensor input1 = at::randn({0}, options);
-  at::Tensor cg_output2 = at::empty({2, 4}, options);
-  at::Tensor cg_output3 = at::empty({0}, options);
-
-  auto reduction_params = getPersistentHeuristics(&fusion, {input0, input1});
-  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
-  schedulePersistentKernel(&fusion, *reduction_params);
-
-  auto lparams = reduction_params->lparams;
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input0, input1}, lparams);
-  auto cg_outputs = fe.runFusion({input0, input1}, lparams);
-  auto aten_output2 = input0.sum({0}).add(input0);
-  at::Tensor aten_output3 = at::empty({0}, options);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {input0, input1},
-      {aten_output2, aten_output3},
-      __LINE__,
-      __FILE__,
-      "",
-      lparams);
-}
-
-TEST_F(NVFuserTest, FusionSegmentIoAlias_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  TensorView* tv1 = makeSymbolicTensor(1);
-  TensorView* tv2 = makeSymbolicTensor(2);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-  fusion->addInput(tv2);
-
-  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1)); // Group 0
-  TensorView* tv4 =
-      max(tv3, {0}); // Group 0 (use max instead to avoid numerical issues)
-  TensorView* tv5 = add(tv4, tv1); //  Group 0 (Non Broadcast after reduce,
-                                   //  keeps normalization scheduler away)
-  TensorView* tv6 = add(tv5, tv2); //  Group 1 (Broadcast after reduce)
-
-  // Note: test alias;
-  fusion->aliasOutputToInput(tv6, tv0);
-  // TODO: support output on aliased fusion #1488
-  // remove tv7 after #1488
-  // fusion->addOutput(tv6);
-  TensorView* tv7 = add(tv6, IrBuilder::create<Double>(1)); // Group 0
-  fusion->addOutput(tv7);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({128, 65}, options);
-  at::Tensor t1 = at::randn({65}, options);
-  at::Tensor t2 = at::randn({128, 65}, options);
-
-  auto t3 = t0.add(1.0);
-  auto t4 = std::get<0>(at::max(t3, 0));
-  auto t5 = t4.add(t1);
-  auto t6 = t5.add(t2);
-  auto t7 = t6.add(1.0);
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto outputs = executor_cache.runFusionWithInputs({t0, t1, t2});
-
-  // TODO: support output on aliased fusion #1488
-  // validating aliasing
-  // TORCH_INTERNAL_ASSERT(outputs[0].data_ptr() == t0.data_ptr());
-
-  TORCH_CHECK(
-      executor_cache.getMostRecentKernelRuntime()->isSegmented(),
-      "segmentation didn't happen");
-  TORCH_CHECK(
-      executor_cache.getMostRecentKernelRuntime()
-              ->fusionSegments()
-              ->groups()
-              .size() == 2,
-      "segmentation didn't happen as expected");
-
-  testValidate(
-      executor_cache.fusion(), outputs, {t0, t1, t2}, {t7}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionWelford1Output_CUDA) {
-  auto fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-
-  auto tvs = Welford(tv0, {1});
-  fusion->addOutput(tvs.var_sum);
-  FusionExecutorCache executor_cache(std::move(fusion_ptr));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({128, 65}, options);
-  auto outputs = executor_cache.runFusionWithInputs({t0});
-
-  auto t1 = t0.var({1}, false) * 65;
-  testValidate(fusion, outputs, {t0}, {t1}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTranslate1Welford_CUDA) {
-  auto fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-
-  auto tvs = Welford(tv0, {1});
-  auto tv_out = add(tv0, broadcast(tvs.avg, {false, true}));
-  fusion->addOutput(tv_out);
-  FusionExecutorCache executor_cache(std::move(fusion_ptr));
-
-  auto run_test = [&executor_cache,
-                   fusion](auto inner_size) -> FusionKernelRuntime* {
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor t0 = at::randn({128, inner_size}, options);
-    auto outputs = executor_cache.runFusionWithInputs({t0});
-    // Square sums does not fit well in the testValidate assumptions,
-    //  so we just compare the divided output here.
-    testValidate(
-        fusion,
-        outputs,
-        {t0},
-        {t0.add(t0.mean({1}).unsqueeze(1))},
-        __LINE__,
-        __FILE__);
-
-    return executor_cache.getMostRecentKernelRuntime();
-  };
-
-  // Run a translated welford
-  auto runtime1 = run_test(64);
-  // Check it was translated
-  TORCH_CHECK(
-      runtime1->fusionSegments()->groups().size() == 1 &&
-      runtime1->fusionSegments()->groups()[0]->exprs().size() > 2);
-
-  // Run an un-translated welford
-  auto runtime2 = run_test(65536);
-
-  bool found_welford = false;
-  for (auto group : runtime2->fusionSegments()->groups()) {
-    for (auto expr : group->exprs()) {
-      if (expr->isA<WelfordOp>()) {
-        found_welford = true;
-      }
-    }
-  }
-  TORCH_CHECK(found_welford);
-}
-
-TEST_F(NVFuserTest, FusionTranslate2Welford_CUDA) {
-  auto fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-
-  auto tvs1 = Welford(tv0, {1});
-  auto tv_out1 = add(tv0, broadcast(tvs1.avg, {false, true}));
-  fusion->addOutput(tv_out1);
-
-  auto tvs2 = Welford(tv0, {1});
-  auto tv_out2 = add(tv0, broadcast(tvs2.avg, {false, true}));
-  fusion->addOutput(tv_out2);
-
-  FusionExecutorCache executor_cache(std::move(fusion_ptr));
-
-  auto run_test = [&executor_cache,
-                   fusion](auto inner_size) -> FusionKernelRuntime* {
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor t0 = at::randn({128, inner_size}, options);
-    auto outputs = executor_cache.runFusionWithInputs({t0});
-
-    // Square sums does not fit well in the testValidate assumptions,
-    //  so we just compare the divided output here.
-    auto out = t0.add(t0.mean({1}).unsqueeze(1));
-    testValidate(fusion, outputs, {t0}, {out, out}, __LINE__, __FILE__);
-
-    return executor_cache.getMostRecentKernelRuntime();
-  };
-
-  // Run a translated welford
-  auto runtime1 = run_test(64);
-  // Check it was translated
-  TORCH_CHECK(
-      runtime1->fusionSegments()->groups().size() == 1 &&
-      runtime1->fusionSegments()->groups()[0]->exprs().size() > 4);
-
-  // Run an un-translated welford
-  auto runtime2 = run_test(65536);
-  // // Check it was not translated
-  bool found_welford = false;
-  for (auto group : runtime2->fusionSegments()->groups()) {
-    for (auto expr : group->exprs()) {
-      if (expr->isA<WelfordOp>()) {
-        found_welford = true;
-      }
-    }
-  }
-  TORCH_CHECK(found_welford);
-}
-
-TEST_F(NVFuserTest, FusionLargeWelfordNormalization_CUDA) {
-  auto fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-
-  auto tvs1 = Welford(tv0, {1});
-  auto sum_of_tv0 = sum(tv0, {1});
-
-  fusion->addOutput(tvs1.var_sum);
-  fusion->addOutput(sum_of_tv0);
-
-  FusionExecutorCache executor_cache(std::move(fusion_ptr));
-
-  auto run_test = [&executor_cache,
-                   fusion](auto inner_size) -> FusionKernelRuntime* {
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor t0 = at::randn({128, inner_size}, options);
-    auto outputs = executor_cache.runFusionWithInputs({t0});
-
-    auto t1 = t0.var({1}, false) * inner_size;
-    auto t2 = t0.sum({1});
-    testValidate(fusion, outputs, {t0}, {t1, t2}, __LINE__, __FILE__);
-
-    return executor_cache.getMostRecentKernelRuntime();
-  };
-
-  auto runtime = run_test(65536);
-  TORCH_CHECK(!runtime->isSegmented());
-}
-
-TEST_F(NVFuserTest, FusionWelfordOuterPersistence_CUDA) {
-  auto fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-
-  auto tvs1 = Welford(tv0, {1});
-  auto sum_of_tv0 = sum(tv0, {1});
-  auto sum_bcasted = broadcast(sum_of_tv0, {false, true});
-  auto avg_bcasted = broadcast(tvs1.avg, {false, true});
-  auto tv0_plus_sum = add(tv0, sum_bcasted);
-  auto tv0_plus_avg = add(tv0, avg_bcasted);
-
-  fusion->addOutput(tv0_plus_sum);
-  fusion->addOutput(tv0_plus_avg);
-
-  FusionExecutorCache executor_cache(std::move(fusion_ptr));
-
-  auto run_test = [&executor_cache,
-                   fusion](auto inner_size) -> FusionKernelRuntime* {
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor t0 = at::randn({128, inner_size}, options);
-    auto outputs = executor_cache.runFusionWithInputs({t0});
-
-    auto t1 = t0.to(c10::kDouble).mean({1}).unsqueeze(1) + t0;
-    auto t2 = t0.to(c10::kDouble).sum({1}).unsqueeze(1) + t0;
-    testValidate(fusion, outputs, {t0}, {t2, t1}, __LINE__, __FILE__);
-
-    return executor_cache.getMostRecentKernelRuntime();
-  };
-
-  for (auto inner_size : {4096, 8192, 32768}) {
-    auto runtime = run_test(inner_size);
-    TORCH_CHECK(!runtime->isSegmented());
-  }
-}
-
-TEST_F(NVFuserTest, FusionSegmentIslands_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  auto tv2 = sum(tv0, {0});
-  auto tv3 = sum(tv1, {1});
-  fusion->addOutput(tv2);
-  fusion->addOutput(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({16, 16}, options);
-  at::Tensor t1 = at::randn({16, 16}, options);
-
-  FusionExecutorCache fusion_executor_cache(std::move(fusion));
-  fusion_executor_cache.runFusionWithInputs({t0, t1});
-}
-
-TEST_F(NVFuserTest, FusionBackOffInnerBroadcast_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  auto tv1 = makeSymbolicTensor(2);
-  auto tv2 = makeSymbolicTensor(4);
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  auto tv3 = broadcast(tv0, {false, true, true, true});
-  auto tv4 = broadcast(tv1, {false, false, true, true});
-  auto tv5 = unaryOp(UnaryOpType::Rsqrt, tv2);
-
-  auto tv6 = add(tv3, tv5);
-  auto tv7 = add(tv4, tv5);
-  auto tv8 = add(tv3, tv4);
-
-  auto tv9 = add(tv6, tv7);
-  auto tv10 = add(tv9, tv8);
-
-  fusion->addOutput(tv10);
-
-  tv0->computeAt(tv10, -2);
-  tv1->computeAt(tv10, -2);
-  tv2->computeAt(tv10, -2);
-
-  TORCH_CHECK(tv3->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv4->getComputeAtPosition() == 2);
-  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
-
-  TORCH_CHECK(tv6->getMaxProducerPosition() == 3);
-  TORCH_CHECK(tv7->getMaxProducerPosition() == 3);
-  TORCH_CHECK(tv8->getMaxProducerPosition() == 2);
-}
-
-TEST_F(NVFuserTest, FusionBackOffInnerBroadcast2_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(3);
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-  auto tv2 = broadcast(tv0, {false, false, true});
-  auto tv3 = add(tv2, tv1);
-
-  fusion->addOutput(tv3);
-  tv3->split(-2, 4);
-  tv3->reorder({{-1, -2}});
-  tv0->computeAt(tv3, -2);
-  tv1->computeAt(tv3, -2);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 2);
-  TORCH_CHECK(tv3->getMaxProducerPosition() == 2);
-}
-
-TEST_F(NVFuserTest, FusionBackOffInnerBroadcast3_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(4);
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-  auto tv2 = broadcast(tv0, {false, false, true});
-  auto tv3 = broadcast(tv2, {false, true, false, false});
-  auto tv4 = add(tv3, tv1);
-
-  fusion->addOutput(tv4);
-  tv0->computeAt(tv4, -1);
-  tv1->computeAt(tv4, -1);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 2);
-  TORCH_CHECK(tv3->getMaxProducerPosition() == 3);
-}
-
-TEST_F(NVFuserTest, FusionSimpleWarp_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv2, tv0);
-
-  fusion->addOutput(tv3);
-
-  tv1->split(1, 32);
-  auto tv1_rf = tv1->rFactor({1});
-  TransformPropagatorWithCheck propagator(tv1_rf);
-  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
-  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({16, 128}, options);
-
-  auto at_output = input1.sum({1}, true).add(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSimpleWarpPad_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv2, tv0);
-
-  fusion->addOutput(tv3);
-
-  // Schedule a persistent kernel
-  auto tv0_cache = tv0->cacheAfter();
-  tv1->split(1, 8, false);
-  auto tv1_rf = tv1->rFactor({1});
-  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
-  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1_rf->axis(-1)->padToMultipleOfWarp(32);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->padToMultipleOfWarp(32);
-  TransformPropagatorWithCheck propagator(tv1_rf);
-  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
-  tv0->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0->axis(-1)->padToMultipleOfWarp(32);
-  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(-1)->padToMultipleOfWarp(32);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->padToMultipleOfWarp(32);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->padToMultipleOfWarp(32);
-
-  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({16, 127}, options);
-
-  auto at_output = input1.sum({1}, true).add(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionWarpPadMergeSplit_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(3);
-
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {1, 2});
-  auto tv2 = broadcast(tv1, {false, true, true});
-  auto tv3 = add(tv2, tv0);
-
-  fusion->addOutput(tv3);
-
-  // Schedule a persistent kernel
-  auto tv0_cache = tv0->cacheAfter();
-  tv1->merge(1);
-  tv1->split(1, 8, false);
-
-  auto tv1_rf = tv1->rFactor({1});
-  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
-  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->padToMultipleOfWarp();
-  TransformPropagatorWithCheck propagator(tv1_rf);
-  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
-  tv0->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({16, 17, 128}, options);
-
-  auto at_output = input1.sum({1, 2}, true).add(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSerialWarpReduction_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(3);
-
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {1, 2});
-  auto tv2 = broadcast(tv1, {false, true, true});
-  auto tv3 = add(tv2, tv0);
-
-  fusion->addOutput(tv3);
-
-  // Schedule a persistent kernel
-  auto tv0_cache = tv0->cacheAfter();
-  tv1->merge(1);
-  tv1->split(1, 8, false);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->padToMultipleOfWarp();
-  TransformPropagatorWithCheck propagator(tv1);
-  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
-  tv0->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({16, 17, 128}, options);
-
-  auto at_output = input1.sum({1, 2}, true).add(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTrivialWarpReduction_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeConcreteTensor({17, 18, 128, 1});
-
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {1, 2, 3});
-  auto tv2 = broadcast(tv1, {false, true, true, true});
-  auto tv3 = add(tv2, tv0);
-
-  fusion->addOutput(tv3);
-
-  // Schedule a persistent kernel
-  auto tv0_cache = tv0->cacheAfter();
-  tv1->merge(1);
-  tv1->split(1, 8, false);
-
-  auto tv1_rf = tv1->rFactor({1});
-  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
-  tv1_rf->axis(-2)->parallelize(ParallelType::TIDx);
-  tv1->axis(-2)->parallelize(ParallelType::TIDx);
-  tv1->axis(-2)->padToMultipleOfWarp();
-  TransformPropagatorWithCheck propagator(tv1_rf);
-  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
-  tv0->axis(-2)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(-2)->parallelize(ParallelType::TIDx);
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-  tv3->axis(-2)->parallelize(ParallelType::TIDx);
-
-  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({17, 18, 128, 1}, options);
-
-  auto at_output = input1.sum({1, 2, 3}, true).add(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionMultipleDimBinding_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv_add = makeSymbolicTensor(2);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv_add);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv2, tv0);
-  auto tv4 = add(tv0, tv_add);
-
-  fusion->addOutput(tv3);
-  fusion->addOutput(tv4);
-
-  // Schedule a persistent kernel
-  auto tv0_cache = tv0->cacheAfter();
-  tv1->split(1, 8, false);
-  auto tv1_rf = tv1->rFactor({1});
-  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
-  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1_rf->axis(-1)->padToMultipleOfWarp(32);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->padToMultipleOfWarp(32);
-  TransformPropagatorWithCheck propagator(tv1_rf);
-  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
-  tv0->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0->axis(-1)->padToMultipleOfWarp(32);
-  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(-1)->padToMultipleOfWarp(32);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->padToMultipleOfWarp(32);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->padToMultipleOfWarp(32);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(-1)->padToMultipleOfWarp(64);
-
-  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({16, 128}, options);
-  at::Tensor input2 = at::randn({16, 128}, options);
-
-  auto at_output = input1.sum({1}, true).add(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1, input2});
-  auto outputs = fe.runFusion({input1, input2});
-  testValidate(
-      fusion.get(),
-      outputs,
-      {input1, input2},
-      {at_output, input1 + input2},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionPadNoWarpReduce_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv2, tv0);
-
-  fusion->addOutput(tv3);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->padToMultipleOfWarp();
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv1->axis(0)->parallelize(ParallelType::TIDy);
-  tv2->axis(0)->parallelize(ParallelType::TIDy);
-  tv3->axis(0)->parallelize(ParallelType::TIDy);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({16, 31}, options);
-
-  auto at_output = input1.sum({1}, true).add(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionWarpMutipleThreadDim_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {1});
-  fusion->addOutput(tv2);
-
-  tv2->split(1, 8);
-  auto tv2_rf = tv2->rFactor({-1});
-  tv2_rf->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2_rf->axis(-1)->padToMultipleOfWarp();
-
-  TransformPropagatorWithCheck propagator(tv2_rf);
-  MaxRootDomainInfoSpanningTree(tv2_rf).traverse(&propagator);
-
-  tv0->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::TIDy);
-  tv0->computeAt(tv2, 2);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({16, 31}, options);
-
-  auto at_output = (input1 + 1).sum({1});
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionWarpReduceUnrollOuterLoop_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv2, tv0);
-
-  fusion->addOutput(tv3);
-
-  // Schedule a persistent kernel
-  auto tv0_cache = tv0->cacheAfter();
-  tv1->split(1, 8, false);
-  tv1->split(0, 4);
-  auto tv1_rf = tv1->rFactor({2});
-
-  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
-  tv1_rf->axis(1)->parallelize(ParallelType::Unroll);
-  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->padToMultipleOfWarp();
-  tv1->axis(1)->parallelize(ParallelType::Unroll);
-  TransformPropagatorWithCheck propagator(tv1_rf);
-  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
-  tv0->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0->axis(1)->parallelize(ParallelType::Unroll);
-  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
-  tv0_cache->axis(1)->parallelize(ParallelType::Unroll);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(1)->parallelize(ParallelType::Unroll);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(1)->parallelize(ParallelType::Unroll);
-
-  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({16, 128}, options);
-
-  auto at_output = input1.sum({1}, true).add(input1);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
-}
-
-// Repro of issue #1579
-TEST_F(NVFuserTest, FusionWarpReducePredication_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> shape1 = {1024};
-  std::vector<int64_t> shape2 = {50};
-
-  auto tv0 = makeConcreteTensor(shape1);
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {0});
-  fusion.addOutput(tv1);
-
-  auto tv2 = makeConcreteTensor(shape2);
-  fusion.addInput(tv2);
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  auto tv4 = sum(tv3, {0});
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-
-  // Just to fill the smem buffer by a thread block of 1024 threads
-  // with some values
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // Make the tv4_rf reduction a warp reduction to trigger the
-  // bug. Since the smem buffer is filled with some values due to the
-  // reduction of tv1, those values would be used by predicated-out
-  // threads.
-  tv4->split(-1, 10);
-  auto tv4_rf = tv4->rFactor({-1});
-  tv4_rf->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4_rf->axis(-1)->padToMultipleOfWarp();
-
-  tv4_rf->computeAt(tv4, 1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn(shape1, options);
-  auto t2 = at::randn(shape2, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t2});
-  auto cg_outputs = fe.runFusion({t0, t2});
-
-  auto t1 = t0.sum({0});
-  auto t4 = (t2 + 1).sum({0}) + 1;
-
-  testValidate(&fusion, cg_outputs, {t0, t2}, {t1, t4}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSegfaultReduction_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  int batch = 2;
-  int c = 1;
-  int h = 1;
-  int w = 1;
-  int numDims = 4;
-
-  auto input = makeConcreteTensor({-1, 1, 1, 1});
-  fusion.addInput(input);
-  auto bcast_bias = makeConcreteTensor({-1, 1, 1, 1});
-  fusion.addInput(bcast_bias);
-
-  std::vector<int64_t> at_sum_axes;
-  std::vector<int> outer_reduction_axes;
-  std::vector<bool> outer_broadcast_mask(numDims, false);
-  Val* N = IrBuilder::create<Double>(1);
-  for (const auto axis : c10::irange(numDims)) {
-    if (axis != 1) {
-      outer_reduction_axes.push_back(axis);
-      at_sum_axes.push_back(axis);
-      outer_broadcast_mask[axis] = true;
-      N = mul(N, input->domain()->domain()[axis]->extent());
-    }
-  }
-
-  auto output0 = mul(input, bcast_bias);
-  fusion.addOutput(output0);
-  auto output1 = sum(output0, outer_reduction_axes);
-  fusion.addOutput(output1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input0 = at::randn({batch, c, h, w}, options);
-  at::Tensor input1 = at::randn({batch, c, h, w}, options);
-
-  auto at_output0 = input0.mul(input1);
-  auto at_output1 = at_output0.sum(at_sum_axes);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  std::vector<IValue> inputs = {input0, input1};
-  auto outputs = fec.runFusionWithInputs(inputs);
-
-  testValidate(
-      &fusion, outputs, inputs, {at_output0, at_output1}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionPredicateElimination1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(2));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(3));
-
-  fusion.addOutput(tv3);
-
-  tv3->split(0, 32);
-  tv0->computeAt(tv3, 1);
-
-  tv2->axis(1)->parallelize(ParallelType::Unswitch);
-
-  {
-    GpuLower gpulw(&fusion);
-    TORCH_CHECK(!PredicatedChecker::isPredicated(tv2, gpulw));
-  }
-
-  tv2->axis(1)->parallelize(ParallelType::Serial);
-  tv2->split(1, 5);
-
-  {
-    GpuLower gpulw(&fusion);
-    TORCH_CHECK(PredicatedChecker::isPredicated(tv2, gpulw));
-  }
-}
-
-// Repro of issue #1571
-TEST_F(NVFuserTest, FusionPredicateElimination2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> shape({10, 11});
-
-  auto tv0 = makeConcreteTensor(shape);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {1});
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  tv1->split(1, 4);
-  tv1->split(0, 4);
-  tv2->split(1, 4);
-  tv2->split(0, 4);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn(shape, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = (t0 + 1).sum({1}) + 1;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionPredicateElimination3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  auto tv3 = tv0->cacheAfter();
-
-  tv1->split(0, 10);
-  tv1->split(0, 33);
-  TransformPropagatorWithCheck propagator(tv1);
-  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
-
-  auto tv4 = tv1->rFactor({-1});
-  auto tv5 = tv1->rFactor({-1});
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-  tv4->axis(1)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(tv4);
-
-  GpuLower gpulw(&fusion);
-
-  // The fusion has three reductions: one within each thread, one
-  // within each block, and another with the whole grid. All of them
-  // should not need to be predicated as they use the same init value
-  // and same reduction op.
-  TORCH_CHECK(!PredicatedChecker::isPredicated(tv4, gpulw));
-  TORCH_CHECK(!PredicatedChecker::isPredicated(tv5, gpulw));
-  TORCH_CHECK(!PredicatedChecker::isPredicated(tv1, gpulw));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  for (auto size : {1, 2, 999, 1001, 1234, 10000}) {
-    auto t0 = at::randn({size}, options);
-
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, {t0});
-    auto cg_outputs = fe.runFusion({t0});
-
-    auto ref = sum(t0) + 1;
-    testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionPredicateElimination4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-
-  auto tv2 = sum(tv1, {0});
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv3);
-
-  auto tv4 = max(tv1, {0});
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-
-  tv1->split(1, 7);
-  tv1->split(0, 11);
-  tv1->reorder({{1, 2}, {2, 1}});
-  TransformPropagatorWithCheck propagator(tv1);
-  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
-
-  tv1->axis(0)->parallelize(ParallelType::TIDy);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(tv1);
-
-  GpuLower gpulw(&fusion);
-
-  // tv2 uses the same op and init with tv1, so tv2 should be fine
-  // without a predicate. However, tv4, while it uses the tv1 as its
-  // input, the reduction op and init value is different from those of
-  // tv1, so tv4 needs to be predicated.
-  TORCH_CHECK(!PredicatedChecker::isPredicated(tv2, gpulw));
-  TORCH_CHECK(PredicatedChecker::isPredicated(tv4, gpulw));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  std::vector<int64_t> sizes = {1, 2, 33, 34, 64, 99};
-  for (auto s0 : sizes) {
-    for (auto s1 : sizes) {
-      auto t0 = at::randn({s0, s1}, options);
-
-      FusionExecutor fe;
-      fe.compileFusion(&fusion, {t0});
-      auto cg_outputs = fe.runFusion({t0});
-
-      auto t1 = t0.sum({1});
-      auto t3 = t1.sum({0}) + 1;
-      auto t5 = std::get<0>(t1.max(0)) + 1;
-
-      testValidate(&fusion, cg_outputs, {t0}, {t3, t5}, __LINE__, __FILE__);
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionPredicateElimination5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tvs2 = Welford(tv1, {0});
-  auto tv3 = set(tvs2.avg);
-  fusion.addOutput(tv3);
-
-  tvs2.avg->split(0, 4);
-  TransformPropagatorWithCheck propagator(tvs2.avg);
-  MaxRootDomainInfoSpanningTree(tvs2.avg).traverse(&propagator);
-  auto avg_rf = ir_utils::rfactorHelper(tvs2.avg, {1});
-
-  avg_rf->axis(0)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(avg_rf);
-
-  GpuLower gpulw(&fusion);
-
-  // The first per-thread welford needs to be predicated as the N
-  // input is different from its init value. The second welford op
-  // does not need a predicate.
-  TORCH_CHECK(PredicatedChecker::isPredicated(avg_rf, gpulw));
-  TORCH_CHECK(!PredicatedChecker::isPredicated(tvs2.avg, gpulw));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  std::vector<int64_t> sizes = {1, 2, 33, 34, 64, 99};
-  for (auto s0 : sizes) {
-    auto t0 = at::randn({s0}, options);
-
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, {t0});
-    auto cg_outputs = fe.runFusion({t0});
-
-    auto ref = t0.mean({0});
-
-    testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionPredicateElimination6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv4);
-
-  tv4->split(1, 5);
-  TransformPropagatorWithCheck propagator(tv4);
-  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
-
-  tv4->reorder({{0, 1}, {1, 0}});
-  tv3->computeAt(tv4, 1);
-
-  GpuLower gpulw(&fusion);
-
-  // The expression for tv2 is a local-to-local expression. It
-  // satisfies all the requirements of predicate elimination, except
-  // for the on on split root domains. As the second root axis of tv2
-  // is split, its index exceeds its extent (i.e., 3 in this case)
-  // without its predicate.
-  TORCH_CHECK(PredicatedChecker::isPredicated(tv2, gpulw));
-
-  // Unlike tv2, tv3 is computed at tv4, so the second root axis does
-  // have a zero domain. Its index should look like "i * 5 + j", where
-  // i comes from the first root domain and j comes from the split
-  // inner domain.
-  TORCH_CHECK(!PredicatedChecker::isPredicated(tv3, gpulw));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({2, 3}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 4;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionPredicateElimination7_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv3);
-
-  tv3->split(-1, 5);
-  tv3->split(-1, 4);
-  tv3->split(-1, 3);
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv0->computeAt(tv3, 1);
-
-  // The last split of tv2 is a non-divisible split, and omitting it
-  // is invalid.
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(PredicatedChecker::isPredicated(tv2, gpulw));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({123}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 3;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionForceFp16Simple_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  // Group 1
-  auto tv2 = sum(tv0, {1});
-  auto tv3 = broadcast(tv2, {false, true});
-
-  // Group 2
-  auto tv4 = add(tv3, tv1); // Edge: tv3: expect cast
-  auto tv5 = castOp(DataType::Half, tv4);
-
-  fusion->addOutput(tv5);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-
-  std::vector<int64_t> shape{15, 16};
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn(shape, options);
-  auto in1 = at::randn(shape, options);
-  fec.runFusionWithInputs({in0, in1});
-
-  // Check the segmented edge is fp16
-  auto segmented_fusion = fec.getMostRecentKernelRuntime()->fusionSegments();
-  for (auto edge : segmented_fusion->edges()) {
-    auto edge_tv = edge->val->as<TensorView>();
-    TORCH_CHECK(edge_tv->getDataType() == DataType::Half);
-  }
-}
-
-TEST_F(NVFuserTest, FusionForceBf16Simple_CUDA) {
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
-  // requires ampere+ GPU
-  if (!deviceMajorMinorCheck(8)) {
-    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
-    return;
-  }
-
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  // Group 1
-  auto tv2 = sum(tv0, {1});
-  auto tv3 = broadcast(tv2, {false, true});
-
-  // Group 2
-  auto tv4 = add(tv3, tv1); // Edge: tv3: expect cast
-  auto tv5 = castOp(DataType::BFloat16, tv4);
-
-  fusion->addOutput(tv5);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-
-  std::vector<int64_t> shape{15, 16};
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn(shape, options);
-  auto in1 = at::randn(shape, options);
-  fec.runFusionWithInputs({in0, in1});
-
-  // Check the segmented edge is bf16
-  auto segmented_fusion = fec.getMostRecentKernelRuntime()->fusionSegments();
-  for (auto edge : segmented_fusion->edges()) {
-    auto edge_tv = edge->val->as<TensorView>();
-    TORCH_CHECK(edge_tv->getDataType() == DataType::BFloat16);
-  }
-#else
-  GTEST_SKIP() << "requires cuda 11.0 or newer toolkit";
-#endif
-}
-
-TEST_F(NVFuserTest, FusionForceFp16NotAllCast_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto tv1 = makeSymbolicTensor(3);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  // Group 1
-  auto tv3 = sum(tv0, {1});
-  auto tv4 = broadcast(tv3, {false, true, false});
-  auto tv5 = sum(tv0, {1});
-
-  // Group 2
-  auto tv6 = add(tv4, tv1); // edge tv4, expect cast
-  auto tv7 = castOp(DataType::Half, tv6);
-
-  // Group 3
-  auto tv8 = sum(tv5, {1}); // edge tv5, don't expect cast
-
-  fusion->addOutput(tv7);
-  fusion->addOutput(tv8);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-
-  std::vector<int64_t> shape{16, 16, 16};
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn(shape, options);
-  auto in1 = at::randn(shape, options);
-  fec.runFusionWithInputs({in0, in1});
-
-  auto segmented_fusion = fec.getMostRecentKernelRuntime()->fusionSegments();
-  auto complete_fusion = segmented_fusion->completeFusion();
-
-  // Check that the edge that wasn't fp16 is the producer of the
-  //  reduction op, i.e. tv8 = sum(tv5,{1});.
-  for (auto edge : segmented_fusion->edges()) {
-    auto edge_tv = edge->val->as<TensorView>();
-    if (edge_tv->getDataType() == DataType::Float) {
-      auto consumer = *(complete_fusion->unordered_uses(edge_tv).begin());
-      TORCH_CHECK(consumer->isA<ReductionOp>());
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionForceBf16NotAllCast_CUDA) {
-#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
-  // requires ampere+ GPU
-  if (!deviceMajorMinorCheck(8)) {
-    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
-    return;
-  }
-
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeSymbolicTensor(3);
-  auto tv1 = makeSymbolicTensor(3);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  // Group 1
-  auto tv3 = sum(tv0, {1});
-  auto tv4 = broadcast(tv3, {false, true, false});
-  auto tv5 = sum(tv0, {1});
-
-  // Group 2
-  auto tv6 = add(tv4, tv1); // edge tv4, expect cast
-  auto tv7 = castOp(DataType::BFloat16, tv6);
-
-  // Group 3
-  auto tv8 = sum(tv5, {1}); // edge tv5, don't expect cast
-
-  fusion->addOutput(tv7);
-  fusion->addOutput(tv8);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-
-  std::vector<int64_t> shape{16, 16, 16};
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn(shape, options);
-  auto in1 = at::randn(shape, options);
-  fec.runFusionWithInputs({in0, in1});
-
-  auto segmented_fusion = fec.getMostRecentKernelRuntime()->fusionSegments();
-  auto complete_fusion = segmented_fusion->completeFusion();
-
-  // Check that the edge that wasn't fp16 is the producer of the
-  //  reduction op, i.e. tv8 = sum(tv5,{1});.
-  for (auto edge : segmented_fusion->edges()) {
-    auto edge_tv = edge->val->as<TensorView>();
-    if (edge_tv->getDataType() == DataType::Float) {
-      auto consumer = *(complete_fusion->unordered_uses(edge_tv).begin());
-      TORCH_CHECK(consumer->isA<ReductionOp>());
-    }
-  }
-#else
-  GTEST_SKIP() << "requires cuda 11.0 or newer toolkit";
-#endif
-}
-
-TEST_F(NVFuserTest, FusionBufferReuseBroadCastMultiVisit_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeConcreteTensor({2, 2});
-  auto tv1 = makeConcreteTensor({2, 2, 2});
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  auto tv2 = mul(tv0, IrBuilder::create<Double>(2));
-  auto tv3 = broadcast(tv2, {false, false, true});
-  auto tv4 = add(tv3, tv1);
-  auto tv5 = mul(tv4, IrBuilder::create<Double>(3));
-  fusion->addOutput(tv5);
-
-  // t4 cannot inner re-use t2, because there's a broadcast
-  //  between them.
-  tv0->computeAt(tv5, 1, ComputeAtMode::BestEffort);
-  tv3->computeAt(tv5, 2, ComputeAtMode::BestEffort);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn({2, 2}, options);
-  auto in1 = at::randn({2, 2, 2}, options);
-
-  auto at_output = ((in0 * 2).unsqueeze(2) + in1) * 3;
-  FusionExecutor fe;
-  fe.compileFusion(fusion, {in0, in1});
-  auto outputs = fe.runFusion({in0, in1});
-
-  testValidate(fusion, outputs, {in0, in1}, {at_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBufferReuseStressTest_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeConcreteTensor({2, 2});
-  auto tv1 = makeConcreteTensor({2, 2, 2});
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  auto tv2 = mul(tv0, IrBuilder::create<Double>(2));
-  auto tv3 = mul(tv0, IrBuilder::create<Double>(3));
-  auto tv4 = mul(tv2, tv3);
-  // Broadcast buffer can be reused through outer sharing
-  auto tv5 = broadcast(tv4, {true, false, false});
-  auto tv6 = mul(tv5, IrBuilder::create<Double>(5));
-  auto tv7 = mul(tv6, tv1);
-  auto tv8 = mul(tv7, IrBuilder::create<Double>(7));
-  // tv9 shouldn't alias to avoid buffer over-subscription
-  auto tv9 = broadcast(tv4, {true, false, false});
-  auto tv10 = mul(tv9, IrBuilder::create<Double>(9));
-  auto tv11 = add(tv5, tv9);
-  fusion->addOutput(tv7);
-  fusion->addOutput(tv11);
-
-  tv0->computeAt(tv5, 1, ComputeAtMode::BestEffort);
-  tv0->computeAt(tv9, 1, ComputeAtMode::BestEffort);
-
-  tv5->computeAt(tv7, 1, ComputeAtMode::BestEffort);
-  tv5->computeAt(tv11, 1, ComputeAtMode::BestEffort);
-  tv9->computeAt(tv11, 1, ComputeAtMode::BestEffort);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn({2, 2}, options);
-  auto in1 = at::randn({2, 2, 2}, options);
-  auto t2 = in0 * 2;
-  auto t3 = in0 * 3;
-  auto t4 = t2 * t3;
-  auto t5 = t4.unsqueeze(0);
-  auto t6 = t5 * 5;
-  auto t7 = t6 * in1;
-  auto t8 = t7 * 7;
-  auto t9 = t4.unsqueeze(0);
-  auto t10 = t9 * 9;
-  auto t11 = t5 + t9;
-  FusionExecutor fe;
-  fe.compileFusion(fusion, {in0, in1});
-
-  auto at_output = ((in0 * 2).unsqueeze(2) + in1) * 3;
-  auto outputs = fe.runFusion({in0, in1});
-
-  testValidate(fusion, outputs, {in0, in1}, {t7, t11}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBufferReuseLargeBuffer_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeConcreteTensor({256, 512});
-
-  fusion->addInput(tv0);
-
-  auto tv1 = mul(tv0, IrBuilder::create<Double>(2));
-  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
-  auto tv3 = mul(tv2, IrBuilder::create<Double>(2));
-  auto tv4 = mul(tv3, IrBuilder::create<Double>(2));
-  auto tv5 = mul(tv4, IrBuilder::create<Double>(2));
-  auto tv6 = mul(tv5, IrBuilder::create<Double>(2));
-
-  fusion->addOutput(tv6);
-
-  tv0->computeAt(tv6, 1, ComputeAtMode::BestEffort);
-  tv6->axis(0)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn({256, 512}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion, {in0});
-  auto outputs = fe.runFusion({in0});
-
-  auto at_out = in0.mul(2).mul(2).mul(2).mul(2).mul(2).mul(2);
-
-  testValidate(fusion, outputs, {in0}, {at_out}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBufferReuseNo2hop_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeConcreteTensor({2, 2});
-  auto tv1 = makeConcreteTensor({2, 2, 2});
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  auto tv2 = mul(tv0, IrBuilder::create<Double>(2));
-  auto tv3 = broadcast(tv2, {false, false, true});
-  auto tv4 = add(tv3, tv1); // T4 to be inner aliased first, and
-                            //  shouldn't outer alias on top
-  auto tv5 = mul(tv4, IrBuilder::create<Double>(3));
-  auto tv6 = mul(tv5, IrBuilder::create<Double>(3));
-  fusion->addOutput(tv6);
-
-  tv0->computeAt(tv6, 1, ComputeAtMode::BestEffort);
-  tv4->computeAt(tv6, 2, ComputeAtMode::BestEffort);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn({2, 2}, options);
-  auto in1 = at::randn({2, 2, 2}, options);
-  FusionExecutor fe;
-  fe.compileFusion(fusion, {in0, in1});
-  auto outputs = fe.runFusion({in0, in1});
-
-  auto at_out = (in0.mul(2.0).unsqueeze(2) + in1).mul(3.0).mul(3.0);
-
-  testValidate(fusion, outputs, {in0, in1}, {at_out}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBufferReuseAllocationOrder_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeConcreteTensor({3, 3, 3});
-
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
-  auto tv3 = mul(tv2, IrBuilder::create<Double>(2));
-
-  fusion->addOutput(tv3);
-
-  // In this case tv1 "reuses" allocation of tv2
-  //  due to the switched allocation order
-  tv1->computeAt(tv2, 1, ComputeAtMode::BestEffort);
-
-  tv0->axis(0)->parallelize(ParallelType::TIDx);
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv3->axis(0)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn({3, 3, 3}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion, {in0});
-  auto outputs = fe.runFusion({in0});
-
-  auto at_out = in0.sum(1).mul(2).mul(2);
-
-  testValidate(fusion, outputs, {in0}, {at_out}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBufferReuseLiveInterval_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeConcreteTensor({16, 16});
-
-  fusion->addInput(tv0);
-
-  auto tv1 = mul(tv0, IrBuilder::create<Double>(3));
-  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
-  auto tv3 = mul(tv2, IrBuilder::create<Double>(2));
-  // tv1 used till here, cannot be reused by tv2 or tv3
-  auto tv4 = mul(tv3, tv1);
-
-  fusion->addOutput(tv4);
-
-  tv0->computeAt(tv4, 1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn({16, 16}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion, {in0});
-  auto cg_outputs = fe.runFusion({in0});
-
-  auto at_t0 = in0 * 3.0;
-  auto at_out = at_t0 * 2.0 * 2.0 * at_t0;
-
-  testValidate(fusion, cg_outputs, {in0}, {at_out}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBufferReuseNoAcrossBroadcast_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  auto fusion = fusion_ptr.get();
-  FusionGuard fg(fusion);
-
-  auto tv0 = makeConcreteTensor({2, 2});
-  auto tv1 = makeConcreteTensor({2, 2, 2});
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  auto tv2 = mul(tv0, IrBuilder::create<Double>(2));
-  auto tv3 = mul(tv0, IrBuilder::create<Double>(3));
-  auto tv4 = mul(tv2, tv3);
-  auto tv5 = broadcast(tv4, {false, false, true});
-  auto tv6 = mul(tv5, tv1);
-  auto tv7 = mul(tv6, IrBuilder::create<Double>(7));
-  fusion->addOutput(tv7);
-
-  // tv6 shouldn't re-use t2 or t3 because of
-  //  the broadcast in between
-  tv0->computeAt(tv4, 1, ComputeAtMode::BestEffort);
-  tv4->computeAt(tv7, 2, ComputeAtMode::BestEffort);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto in0 = at::randn({2, 2}, options);
-  auto in1 = at::randn({2, 2, 2}, options);
-  FusionExecutor fe;
-  fe.compileFusion(fusion, {in0, in1});
-  auto outputs = fe.runFusion({in0, in1});
-
-  auto t2 = in0 * 2;
-  auto t3 = in0 * 3;
-  auto t4 = t2 * t3;
-  auto t5 = t4.unsqueeze(2);
-  auto t6 = t5 * in1;
-  auto t7 = t6 * 7;
-  testValidate(fusion, outputs, {in0, in1}, {t7}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue970_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int nelm = 10;
-
-  // tv3 = tv0 + sum(tv0)
-  auto tv0 = makeConcreteTensor({nelm, nelm});
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv2, tv0);
-  fusion.addOutput(tv3);
-
-  tv1->split(1, 4);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({nelm, nelm}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto outputs = fe.runFusion({t0});
-
-  auto ref = sum(t0, {1}).unsqueeze(-1).expand({nelm, nelm}) + t0;
-
-  testValidate(&fusion, outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Reproducer of #1016
-TEST_F(NVFuserTest, FusionIssue1016_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(2));
-
-  fusion.addOutput(tv2);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv2->split(-1, 8);
-
-  int numel_x = 10;
-  int numel_y = 11;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
-  std::vector<IValue> inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, inputs);
-  auto outputs = fe.runFusion(inputs);
-
-  auto ref = t0 + 1 + 2;
-
-  testValidate(&fusion, outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Reproducer of #1021
-TEST_F(NVFuserTest, FusionIssue1021_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = broadcast(tv1, {false, true});
-  fusion.addOutput(tv2);
-
-  auto tv3 = tv2->cacheBefore();
-
-  tv2->split(0, 2);
-
-  tv1->computeAt(tv2, 1);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv2->axis(1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({10}, options);
-  std::vector<IValue> inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, inputs);
-  auto outputs = fe.runFusion(inputs);
-
-  auto ref = (t0 + 1).unsqueeze(-1);
-
-  testValidate(&fusion, outputs, inputs, {ref}, __LINE__, __FILE__);
-}
-
-// Reproducer of issue #1053
-TEST_F(NVFuserTest, FusionNonUniqueThreadDim_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-  auto tv1 = sum(tv0, {0});
-  fusion->addOutput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  fusion->addOutput(tv2);
-
-  tv1->split(0, 8);
-  auto tv1_rf = tv1->rFactor({-1});
-
-  tv1_rf->computeAt(tv1, 1);
-
-  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({32}, options);
-
-  auto at_tv1 = (input1).sum({0});
-  auto at_tv2 = input1 + 1;
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-  testValidate(
-      fusion.get(), outputs, {input1}, {at_tv1, at_tv2}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionParallelDimensionMap1_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  fusion->addOutput(tv1);
-  fusion->addOutput(tv2);
-
-  tv1->split(0, 8, false);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-  tv2->split(0, 8, false);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-
-  // The extents of tv1 and tv2 axes are equal even though their
-  // actual values are not statically known
-  GpuLower gpulw(fusion.get());
-  const auto& pdmap = gpulw.parallelDimensionMap();
-  for (const auto i : c10::irange(tv1->domain()->domain().size())) {
-    auto dom1 = tv1->domain()->domain()[i];
-    auto dom2 = tv2->domain()->domain()[i];
-    TORCH_INTERNAL_ASSERT(pdmap.equalDim(dom1->extent(), dom2->extent()));
-  }
-
-  TORCH_CHECK(pdmap.isExact(ParallelType::TIDx));
-  TORCH_CHECK(
-      pdmap.get(ParallelType::TIDx)->isA<NamedScalar>() &&
-      pdmap.get(ParallelType::TIDx)->as<NamedScalar>()->name() == "blockDim.x");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({32}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-
-  testValidate(
-      fusion.get(),
-      outputs,
-      {input1},
-      {input1 + 1, input1 + 1},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionParallelDimensionMap2_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion->addInput(tv1);
-  auto tv2 = broadcast(tv0, {false, true});
-  auto tv3 = add(tv1, tv2);
-  fusion->addOutput(tv3);
-
-  tv3->split(-1, 8, false);
-  tv2->computeAt(tv3, -1);
-
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  GpuLower gpulw(fusion.get());
-  const auto& pdmap = gpulw.parallelDimensionMap();
-  TORCH_CHECK(pdmap.isExact(ParallelType::TIDx));
-  TORCH_CHECK(
-      pdmap.get(ParallelType::TIDx)->isA<NamedScalar>() &&
-      pdmap.get(ParallelType::TIDx)->as<NamedScalar>()->name() == "blockDim.x");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({11}, options);
-  at::Tensor input2 = at::randn({11, 13}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1, input2});
-  auto outputs = fe.runFusion({input1, input2});
-
-  auto ref = input1.unsqueeze(-1) + input2;
-
-  testValidate(
-      fusion.get(), outputs, {input1, input2}, {ref}, __LINE__, __FILE__);
-}
-
-// Mix symbolic and concrete tensors
-TEST_F(NVFuserTest, FusionParallelDimensionMap3_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  fusion->addOutput(tv2);
-  auto tv3 = add(tv0, IrBuilder::create<Double>(1));
-  fusion->addOutput(tv3);
-
-  tv2->split(0, 10);
-  tv3->split(0, 20);
-
-  auto tv4 = add(tv0, IrBuilder::create<Double>(1));
-  fusion->addOutput(tv4);
-  auto tv5 = add(tv0, IrBuilder::create<Double>(1));
-  fusion->addOutput(tv5);
-
-  // Not mapped but equal extent
-  tv4->split(0, 10);
-  tv5->split(0, 10);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv4->axis(-1)->parallelize(ParallelType::TIDy);
-  tv5->axis(-1)->parallelize(ParallelType::TIDy);
-
-  GpuLower gpulw(fusion.get());
-  const auto& pdmap = gpulw.parallelDimensionMap();
-  TORCH_CHECK(!pdmap.isExact(ParallelType::TIDx));
-  TORCH_CHECK(
-      pdmap.get(ParallelType::TIDx)->isA<NamedScalar>() &&
-      pdmap.get(ParallelType::TIDx)->as<NamedScalar>()->name() == "blockDim.x");
-  TORCH_CHECK(pdmap.isExact(ParallelType::TIDy));
-  TORCH_CHECK(
-      pdmap.get(ParallelType::TIDy)->isConst() &&
-      pdmap.get(ParallelType::TIDy)->as<Int>()->value().value() == 10);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({13}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(fusion.get(), {input1});
-  auto outputs = fe.runFusion({input1});
-
-  testValidate(
-      fusion.get(),
-      outputs,
-      {input1},
-      {input1 + 1, input1 + 1, input1 + 1, input1 + 1},
-      __LINE__,
-      __FILE__);
-}
-
-// Parallelizing merged broadcast domains
-TEST_F(NVFuserTest, FusionParallelDimensionMap4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv3 = broadcast(tv2, {true, false});
-  auto tv4 = add(tv3, tv1);
-  fusion.addOutput(tv4);
-
-  tv4->split(1, 4);
-  tv4->reorder({{1, 2}, {2, 1}});
-  tv4->merge(0);
-  tv0->computeAt(tv4, 1);
-  tv1->computeAt(tv4, 1);
-
-  // TIDx is mapped to tv4.axis(0) as well as tv2.axis(0), so it's not
-  // exact.
-  tv4->axis(0)->parallelize(ParallelType::TIDx);
-
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-
-  GpuLower gpulw(&fusion);
-  const auto& pdmap = gpulw.parallelDimensionMap();
-  TORCH_CHECK(!pdmap.isExact(ParallelType::TIDx));
-  TORCH_CHECK(
-      pdmap.get(ParallelType::TIDx)->isA<NamedScalar>() &&
-      pdmap.get(ParallelType::TIDx)->as<NamedScalar>()->name() == "blockDim.x");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({13}, options);
-  at::Tensor input2 = at::randn({15, 13}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  auto outputs = fe.runFusion({input1, input2});
-
-  auto ref = (input1 + 1).unsqueeze(0) + input2;
-
-  testValidate(&fusion, outputs, {input1, input2}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionParallelDimensionMap5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  auto tv3 = broadcast(tv0, {false, true});
-  auto tv4 = add(tv3, tv1);
-  fusion.addOutput(tv4);
-
-  tv4->split(1, 4);
-  tv0->computeAt(tv4, -1);
-  tv1->computeAt(tv4, -1);
-
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->axis(-2)->parallelize(ParallelType::TIDy);
-  tv3->axis(-2)->parallelize(ParallelType::TIDy);
-
-  GpuLower gpulw(&fusion);
-  const auto& pdmap = gpulw.parallelDimensionMap();
-  TORCH_CHECK(pdmap.isExact(ParallelType::TIDx));
-  TORCH_CHECK(pdmap.isExact(ParallelType::TIDy));
-  TORCH_CHECK(
-      pdmap.get(ParallelType::TIDx)->isConst() &&
-      pdmap.get(ParallelType::TIDx)->as<Int>()->value().value() == 4);
-  TORCH_CHECK(
-      pdmap.get(ParallelType::TIDy)->isA<NamedScalar>() &&
-      pdmap.get(ParallelType::TIDy)->as<NamedScalar>()->name() == "blockDim.y");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input1 = at::randn({13}, options);
-  at::Tensor input2 = at::randn({13, 15}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input1, input2});
-  auto outputs = fe.runFusion({input1, input2});
-
-  auto ref = (input1).unsqueeze(-1) + input2;
-
-  testValidate(&fusion, outputs, {input1, input2}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSegmenterCombineReductionsCycleRepro_CUDA) {
-  auto fusion_ptr = std::make_unique<Fusion>();
-  auto& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  auto t0 = makeSymbolicTensor(3, DataType::Float);
-  auto t1 = makeSymbolicTensor(3, DataType::Half);
-  auto t3 = makeSymbolicTensor(3, DataType::Half);
-  auto t5 = makeSymbolicTensor(3, DataType::Half);
-  auto t7 = makeSymbolicTensor(1, DataType::Half);
-  auto t11 = makeSymbolicTensor(3, DataType::Half);
-  auto t13 = makeSymbolicTensor(3, DataType::Half);
-  auto t15 = makeSymbolicTensor(3, DataType::Half);
-  auto t17 = makeSymbolicTensor(3, DataType::Half);
-  auto d56 = IrBuilder::create<Double>();
-
-  fusion.addInput(t0);
-  fusion.addInput(t1);
-  fusion.addInput(t3);
-  fusion.addInput(t5);
-  fusion.addInput(t7);
-  fusion.addInput(t11);
-  fusion.addInput(t13);
-  fusion.addInput(t15);
-  fusion.addInput(t17);
-  fusion.addInput(d56);
-
-  auto t2 = castOp(DataType::Float, t1);
-  auto t4 = castOp(DataType::Float, t3);
-  auto t22 = sub(t2, t4);
-  auto t6 = castOp(DataType::Float, t5);
-  auto t23 = mul(t22, t6);
-  auto t16 = castOp(DataType::Float, t15);
-  auto t18 = castOp(DataType::Float, t17);
-  auto t19 = add(t16, t18);
-  auto t14 = castOp(DataType::Float, t13);
-  auto t20 = add(t19, t14);
-  auto t12 = castOp(DataType::Float, t11);
-  auto t21 = add(t20, t12);
-  auto t8 = castOp(DataType::Float, t7);
-  auto t24 = broadcast(t8, {true, true, false});
-  auto t25 = mul(t21, t24);
-  auto t27 = sum(t25, {2});
-  auto t28 = broadcast(t27, {false, false, true});
-  auto t29 = mul(t25, t23);
-  auto t30 = sum(t29, {2});
-  auto t31 = broadcast(t30, {false, false, true});
-  auto d59 =
-      mul(t1->getRootDomain()[2]->extent(), IrBuilder::create<Double>(1));
-  auto t26 = mul(d59, t25);
-  auto txx = mul(t26, IrBuilder::create<Double>(1));
-  auto t33 = sub(txx, t28);
-  auto d70 = unaryOp(UnaryOpType::Reciprocal, d59);
-  auto t35 = mul(d70, t6);
-  auto t39 = sum(t21, {0, 1});
-  auto t47 = castOp(DataType::Half, t39);
-  auto t37 = mul(t21, t23);
-  auto t38 = sum(t37, {0, 1});
-  auto t46 = castOp(DataType::Half, t38);
-  auto t32 = mul(t23, t31);
-  auto t34 = sub(t33, t32);
-  auto t36 = mul(t35, t34);
-  auto t45 = castOp(DataType::Half, t36);
-  auto t40 = mul(t36, t0);
-  auto t41 = mul(t40, d56);
-  auto t44 = castOp(DataType::Half, t41);
-  auto t42 = sum(t41, {0, 1});
-  auto t43 = castOp(DataType::Half, t42);
-
-  fusion.addOutput(t43);
-  fusion.addOutput(t44);
-  fusion.addOutput(t45);
-  fusion.addOutput(t46);
-  fusion.addOutput(t47);
-
-  auto options_half = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  auto options_float =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_t0 = at::randn({128, 64, 1024}, options_float);
-  at::Tensor at_t1 = at::randn({128, 64, 1024}, options_half);
-  at::Tensor at_t3 = at::randn({128, 64, 1024}, options_half);
-  at::Tensor at_t5 = at::randn({128, 64, 1024}, options_half);
-  at::Tensor at_t7 = at::randn({1024}, options_half);
-  at::Tensor at_t11 = at::randn({128, 64, 1024}, options_half);
-  at::Tensor at_t13 = at::randn({128, 64, 1024}, options_half);
-  at::Tensor at_t15 = at::randn({128, 64, 1024}, options_half);
-  at::Tensor at_t17 = at::randn({128, 64, 1024}, options_half);
-  double at_d56 = 1.1111;
-
-  std::vector<IValue> aten_inputs = {
-      at_t0,
-      at_t1,
-      at_t3,
-      at_t5,
-      at_t7,
-      at_t11,
-      at_t13,
-      at_t15,
-      at_t17,
-      at_d56};
-  for (auto _ : c10::irange(5)) {
-    auto segmented_fusion =
-        SegmentCandidateFinder::segment(fusion_ptr.get(), aten_inputs);
-  }
-}
-
-TEST_F(NVFuserTest, FusionSerialAndParallelIndexing_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  auto tv3 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv4);
-
-  auto tv5 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv6 = add(tv5, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv6);
-
-  // Case 1: local memory tensor computed serially and used by
-  // parallel threads
-  tv2->split(-1, 4);
-  tv1->computeAt(tv2, -2);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // Case 2: shared memory tensor computed serially and used by BID
-  tv4->split(-1, 4);
-  tv3->computeAt(tv4, -2);
-  tv4->axis(-1)->parallelize(ParallelType::BIDx);
-  tv3->setMemoryType(MemoryType::Shared);
-
-  // Case 3: shared memory tensor computed by TID and used by BID
-  tv6->split(-1, 4);
-  tv5->computeAt(tv6, -2);
-  tv6->axis(-1)->parallelize(ParallelType::BIDx);
-  tv5->axis(-1)->parallelize(ParallelType::TIDx);
-  tv5->setMemoryType(MemoryType::Shared);
-
-  const int nx = 11;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({nx}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref = t0 + 2;
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {ref, ref, ref}, __LINE__, __FILE__);
-}
-
-// Repro of issue #1105
-TEST_F(NVFuserTest, FusionWARSyncAliasedSmem_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  tv1->setMemoryType(MemoryType::Shared);
-  tv2->setMemoryType(MemoryType::Shared);
-
-  tv3->split(0, 4);
-  tv0->computeAt(tv3, 1);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDy);
-  tv3->axis(-1)->parallelize(ParallelType::TIDz);
-
-  // Make sure a WAR sync is inserted at the end of the outer loop
-  GpuLower gpulw(&fusion);
-  for (const auto& kir_node : gpulw.kernel()->topLevelExprs()) {
-    if (auto loop = dynamic_cast<kir::ForLoop*>(kir_node)) {
-      const auto& body = loop->body().exprs();
-      TORCH_CHECK(!body.empty());
-      auto last_expr = dynamic_cast<kir::BlockSync*>(body.back());
-      TORCH_CHECK(last_expr != nullptr, "Invalid expr found");
-      TORCH_CHECK(last_expr->isWarHazardSync(), "Not a sync for WAR hazard");
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({17}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref1 = t0 + 3;
-
-  testValidate(&fusion, outputs, aten_inputs, {ref1}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1099_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  auto tv3 = makeSymbolicTensor(1);
-  fusion.addInput(tv3);
-
-  // Just to make TIDx/y/z non-exact
-  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  auto tv6 = add(tv5, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv6);
-
-  tv2->split(0, 4);
-  tv0->computeAt(tv2, 1);
-
-  tv0->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->axis(-1)->parallelize(ParallelType::TIDy);
-  tv2->axis(-1)->parallelize(ParallelType::TIDz);
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv4->split(0, 5);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv4->setMemoryType(MemoryType::Shared);
-  tv5->split(0, 6);
-  tv5->axis(-1)->parallelize(ParallelType::TIDy);
-  tv5->setMemoryType(MemoryType::Shared);
-  tv6->split(0, 7);
-  tv6->axis(-1)->parallelize(ParallelType::TIDz);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({17}, options);
-  at::Tensor t3 = at::randn({19}, options);
-  std::vector<IValue> aten_inputs = {t0, t3};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref_t2 = t0 + 2;
-  auto ref_t3 = t3 + 3;
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {ref_t2, ref_t3}, __LINE__, __FILE__);
-}
-
-// Repro of issue #1080
-TEST_F(NVFuserTest, FusionUnswitchPredicate_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 4);
-  tv0->computeAt(tv2, 2);
-
-  tv2->split(-1, 8);
-  tv1->split(-1, 8);
-
-  tv2->axis(1)->parallelize(ParallelType::Unswitch);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-2)->parallelize(ParallelType::TIDy);
-
-  // swap TIDx and TIDy
-  tv1->axis(-1)->parallelize(ParallelType::TIDy);
-  tv1->axis(-2)->parallelize(ParallelType::TIDx);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  const int nx = 4;
-  const int ny = 10;
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({nx, ny}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref = t0 + 2;
-
-  testValidate(&fusion, outputs, aten_inputs, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1189_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({16, 16});
-  auto tv1 = makeConcreteTensor({16, 16});
-
-  auto tv0b = broadcast(tv0, {false, false, true});
-  auto tv1b = broadcast(tv1, {false, false, true});
-
-  fusion.addInput(tv0b);
-  fusion.addInput(tv1b);
-
-  auto tv2 = add(tv0b, tv1b);
-  auto tv3 = sum(tv2, {1});
-  fusion.addOutput(tv3);
-
-  auto parallelize = [](auto tv) {
-    tv->axis(0)->parallelize(ParallelType::TIDx);
-    tv->axis(1)->parallelize(ParallelType::BIDx);
-    tv->axis(2)->parallelize(ParallelType::BIDy);
-  };
-
-  parallelize(tv0b);
-  parallelize(tv1b);
-  parallelize(tv2);
-  parallelize(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({16, 16, 1}, options);
-  at::Tensor t1 = at::randn({16, 16, 1}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto outputs = fe.runFusion({t0, t1});
-
-  auto ref = (t0 + t1).sum({1});
-
-  testValidate(&fusion, outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1052_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(1);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  auto tv3 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv3);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  scheduler_utils::parallelizeAllLike(tv2, {tv0});
-  scheduler_utils::parallelizeAllLike(tv3, {tv1});
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({10}, options);
-  at::Tensor t1 = at::randn({100}, options);
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref_t2 = t0 + 1;
-  auto ref_t3 = t1 + 1;
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {ref_t2, ref_t3}, __LINE__, __FILE__);
-}
-
-// Repro of issue #1115
-TEST_F(NVFuserTest, FusionPointwiseBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> input_shape{3, 17, 80};
-  std::vector<int64_t> output_shape{3, 17, 1, 80};
-
-  TensorView* x = makeSymbolicTensor(input_shape.size());
-  TensorView* bias = makeSymbolicTensor(input_shape.size());
-  fusion.addInput(x);
-  fusion.addInput(bias);
-
-  auto x_add_bias = add(x, bias);
-  auto x_bcast = broadcast(x_add_bias, {false, false, true, false});
-  auto y = gelu(x_bcast);
-  fusion.addOutput(y);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_x = at::randn(input_shape, options);
-  at::Tensor at_bias = at::randn(input_shape, options);
-  std::vector<IValue> aten_inputs = {at_x, at_bias};
-
-  schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto at_x_add_bias = at_x + at_bias;
-  auto at_x_view = at::native::view(at_x_add_bias, output_shape);
-  auto aten_y = at::gelu(at_x_view);
-
-  testValidate(&fusion, outputs, aten_inputs, {aten_y}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionPointwiseVectorize_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int size = 1024 * 64;
-
-  TensorView* x = makeContigTensor(1);
-  fusion.addInput(x);
-  auto y = sin(x);
-  fusion.addOutput(y);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  // PyTorch's CUDA caching allocator should always return aligned pointer for
-  // freshly allocated tensor
-  at::Tensor at_x = at::randn({size}, options);
-
-  schedulePointwise(&fusion, {at_x});
-
-  for (auto x_consumer : ir_utils::consumerTvsOf(x)) {
-    bool found_vec_in_input = false;
-    for (auto id : x_consumer->domain()->domain()) {
-      if (isParallelTypeVectorize(id->getParallelType())) {
-        found_vec_in_input = true;
-        break;
-      }
-    }
-    TORCH_CHECK(found_vec_in_input, "Expect input to be vectorized");
-  }
-
-  for (auto id : y->domain()->domain()) {
-    if (isParallelTypeVectorize(id->getParallelType())) {
-      return;
-    }
-  }
-  TORCH_CHECK(false, "Expect output to be vectorized");
-}
-
-TEST_F(NVFuserTest, FusionSmemAliasSerial_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  // Just set the dimension of TIDx
-  auto tv4 = makeSymbolicTensor(1);
-  fusion.addInput(tv4);
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-
-  tv1->setMemoryType(MemoryType::Shared);
-  tv2->setMemoryType(MemoryType::Shared);
-
-  tv5->axis(0)->parallelize(ParallelType::TIDx);
-
-  // tv1 and tv2 are on shared memory and are not parallelized with
-  // TIDx. They should be predicated as they are redundant and can
-  // interfere with smem aliasing (issue #1100).
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({10}, options);
-  at::Tensor t4 = at::randn({1024}, options);
-  std::vector<IValue> aten_inputs = {t0, t4};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref1 = t0 + 3;
-  auto ref2 = t4 + 1;
-
-  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridReductionWithNonExactParallelDimensions_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv1);
-
-  auto tv2 = makeSymbolicTensor(1);
-  fusion.addInput(tv2);
-  auto tv3 = sum(tv2, {0});
-  fusion.addOutput(tv3);
-
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({17}, options);
-  at::Tensor t2 = at::randn({19}, options);
-  std::vector<IValue> aten_inputs = {t0, t2};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref1 = t0 + 1;
-  auto ref2 = sum(t2);
-
-  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridWelfordWithNonExactParallelDimensions_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv1);
-
-  auto tv2 = makeSymbolicTensor(1);
-  fusion.addInput(tv2);
-  auto tv3 = Welford(tv2, {0}).avg;
-  fusion.addOutput(tv3);
-
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({17}, options);
-  at::Tensor t2 = at::randn({19}, options);
-  std::vector<IValue> aten_inputs = {t0, t2};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref1 = t0 + 1;
-  auto ref2 = mean(t2, {0});
-
-  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionGridReductionWithNonExactParallelDimensions2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0, 1});
-  fusion.addOutput(tv1);
-
-  auto tv2 = makeSymbolicTensor(3);
-  fusion.addInput(tv2);
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv3);
-
-  auto tv4 = makeSymbolicTensor(3);
-  fusion.addInput(tv4);
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(0)->parallelize(ParallelType::TIDx);
-  tv3->axis(1)->parallelize(ParallelType::TIDy);
-  tv3->axis(2)->parallelize(ParallelType::TIDz);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-  tv5->axis(2)->parallelize(ParallelType::BIDz);
-
-  // TODO: This needs a fix for issue #1102.
-  // Also, need to allow predicated grid reductions.
-#if 0
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({2, 3}, options);
-  at::Tensor t2 = at::randn({5, 6, 7}, options);
-  at::Tensor t4 = at::randn({8, 9, 10}, options);
-  std::vector<IValue> aten_inputs = {t0, t2, t4};
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref1 = t0.sum(at::IntArrayRef{0, 1});
-  auto ref2 = t2 + 1;
-  auto ref3 = t4 + 1;
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {ref1, ref2, ref3}, __LINE__, __FILE__);
-#endif
-}
-
-TEST_F(NVFuserTest, FusionGridWelfordWithNonExactParallelDimensions2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tvs = Welford(tv0, {0, 1});
-  fusion.addOutput(tvs.avg);
-
-  auto tv2 = makeSymbolicTensor(3);
-  fusion.addInput(tv2);
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv3);
-
-  auto tv4 = makeSymbolicTensor(3);
-  fusion.addInput(tv4);
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-
-  tvs.avg->axis(0)->parallelize(ParallelType::BIDx);
-  tvs.avg->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(0)->parallelize(ParallelType::TIDx);
-  tv3->axis(1)->parallelize(ParallelType::TIDy);
-  tv3->axis(2)->parallelize(ParallelType::TIDz);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-  tv5->axis(2)->parallelize(ParallelType::BIDz);
-
-  // TODO: needs a fix for issue #1102
-  // Also, need to allow predicated grid reductions.
-#if 0
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({2, 3}, options);
-  at::Tensor t2 = at::randn({5, 6, 7}, options);
-  at::Tensor t4 = at::randn({8, 9, 10}, options);
-  std::vector<IValue> aten_inputs = {t0, t2, t4};
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref1 = t0.mean(at::IntArrayRef{0, 1});
-  auto ref2 = t2 + 1;
-  auto ref3 = t4 + 1;
-
-  testValidate(
-      &fusion, outputs, aten_inputs, {ref1, ref2, ref3}, __LINE__, __FILE__);
-#endif
-}
-
-// Repro of issue #1102
-TEST_F(NVFuserTest, FusionPredicateParallelizedDomains_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  // Just to make TIDx/y/z non-exact
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv3);
-
-  auto tv4 = makeSymbolicTensor(1);
-  fusion.addInput(tv4);
-
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  auto tv6 = add(tv5, IrBuilder::create<Double>(1));
-  auto tv7 = add(tv6, IrBuilder::create<Double>(1));
-  auto tv8 = add(tv7, IrBuilder::create<Double>(1));
-  auto tv9 = sum(tv8, {0});
-  fusion.addOutput(tv9);
-
-  tv1->split(0, 5);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv1->setMemoryType(MemoryType::Shared);
-  tv2->split(0, 6);
-  tv2->axis(-1)->parallelize(ParallelType::TIDy);
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->split(0, 7);
-  tv3->axis(-1)->parallelize(ParallelType::TIDz);
-
-  tv9->split(0, 4);
-  tv4->computeAt(tv9, 1);
-
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv5->axis(-1)->parallelize(ParallelType::TIDy);
-  tv6->axis(-1)->parallelize(ParallelType::TIDz);
-  tv7->axis(-1)->parallelize(ParallelType::TIDz);
-  tv8->axis(-1)->parallelize(ParallelType::TIDz);
-  tv9->axis(-1)->parallelize(ParallelType::TIDz);
-  tv9->axis(0)->parallelize(ParallelType::BIDx);
-
-  tv5->setMemoryType(MemoryType::Shared);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({17}, options);
-  at::Tensor t4 = at::randn({19}, options);
-  std::vector<IValue> aten_inputs = {t0, t4};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref1 = t0 + 3;
-  auto ref2 = sum(t4 + 4);
-
-  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
-}
-
-// Repro of #1102 and #1129
-TEST_F(NVFuserTest, FusionSmemPredicateUnswitch_CUDA) {
-  if (!deviceMajorMinorCheck(7)) {
-    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
-    return;
-  }
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(1);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-
-  // Just to make TIDx/y/z non-exact
-  auto tvx = add(tv1, IrBuilder::create<Double>(1));
-  auto tvy = add(tvx, IrBuilder::create<Double>(1));
-  auto tvz = add(tvy, IrBuilder::create<Double>(1));
-  fusion.addOutput(tvz);
-
-  tv5->split(0, 4);
-  tv0->computeAt(tv5, 1);
-
-  tv0->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDy);
-  tv3->axis(-1)->parallelize(ParallelType::TIDz);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv5->axis(-1)->parallelize(ParallelType::TIDy);
-  tv5->axis(0)->parallelize(ParallelType::Unswitch);
-
-  tvx->split(0, 5);
-  tvx->axis(-1)->parallelize(ParallelType::TIDx);
-  tvy->split(0, 6);
-  tvy->axis(-1)->parallelize(ParallelType::TIDy);
-  tvz->split(0, 7);
-  tvz->axis(-1)->parallelize(ParallelType::TIDz);
-
-  for (auto tv : {tv2, tv3, tv4, tvx, tvy}) {
-    tv->setMemoryType(MemoryType::Shared);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({17}, options);
-  at::Tensor t1 = at::randn({19}, options);
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref1 = t0 + 4;
-  auto ref2 = t1 + 3;
-
-  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
-}
-
-// Repro of issue #1136
-TEST_F(NVFuserTest, FusionFloatPow_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Int>(4));
-  // To check if pow(tv0, 2) is replaced with tv0 * tv0
-  auto tv2 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Int>(2));
-  // To check if pow(tv0, 2.0) is replaced with tv0 * tv0
-  auto tv3 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Double>(2));
-  auto tv4 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Int>(3));
-  auto tv5 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Double>(3));
-  auto s = binaryOp(
-      BinaryOpType::Pow,
-      IrBuilder::create<Double>(3),
-      IrBuilder::create<Double>(3));
-  auto tv6 = add(tv0, s);
-
-  fusion.addOutput(tv1);
-  fusion.addOutput(tv2);
-  fusion.addOutput(tv3);
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  tv1->split(0, 32);
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  TransformPropagatorWithCheck propagator(tv1);
-  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
-  scheduler_utils::parallelizeAllLike(tv1, {tv2, tv3, tv4, tv5, tv6});
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({1000}, options);
-  // Negative inputs cause nan in Fuesr as use_fast_math is enabled
-  t0 = abs(t0);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto p4 = at::pow(t0, 4);
-  auto p2 = at::pow(t0, 2);
-  auto p3 = at::pow(t0, 3);
-  auto t6 = t0 + std::pow(3, 3);
-
-  testValidate(
-      &fusion,
-      outputs,
-      aten_inputs,
-      {p4, p2, p2, p3, p3, t6},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1127_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  const int numel = 4;
-
-  auto tv0 = makeConcreteTensor({numel});
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = broadcast(tv1, {true});
-
-  auto tv3 = makeConcreteTensor({numel, numel});
-  fusion.addInput(tv3);
-
-  auto tv4 = sum(tv3, {1});
-
-  auto tv5 = add(tv2, tv4);
-  fusion.addOutput(tv5);
-
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv4->axis(1)->parallelize(ParallelType::TIDx);
-  tv5->axis(0)->parallelize(ParallelType::TIDx);
-
-  // Lowering should fail since tv5 is predicated and paralellized with TIDx.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fusion.printKernel());
-}
-
-TEST_F(NVFuserTest, FusionChannelsLastParser_CUDA) {
-  // This test may not pass if using a custom block sync as there may
-  // be additional calls. Skip the test as it's not specifically
-  // relevant with block synchronizatin.
-  if (std::getenv("PYTORCH_NVFUSER_USE_BLOCK_SYNC_ATOMIC")) {
-    return;
-  }
-  auto g = std::make_shared<Graph>();
-  const auto graph0_string = R"IR(
-  graph(%0 : Half(8, 4, 10, 16, strides=[640, 1, 64, 4]),
-        %1 : Half(8, 4, 10, 16, strides=[640, 160, 16, 1])):
-    %o.1 : Half(8, 4, 10, 16, strides=[640, 1, 64, 4]) = aten::mul(%0, %1) # sum_dyn.py:5:6
-    %3 : Half(8, 4, 10, 16, strides=[640, 1, 64, 4]) = aten::relu(%o.1) # sum_dyn.py:6:9
-    return (%3))IR";
-  parseIR(graph0_string, g.get());
-
-  // strides are not yet supported in the irparser.
-  {
-    auto val = g->block()->inputs()[0];
-    val->setType(val->type()->castRaw<TensorType>()->withSizesStrides(
-        {8, 4, 10, 16}, {640, 1, 64, 4}));
-  }
-
-  {
-    auto val = g->block()->inputs()[1];
-    val->setType(val->type()->castRaw<TensorType>()->withSizesStrides(
-        {8, 4, 10, 16}, {640, 160, 16, 1}));
-  }
-
-  for (auto node : g->block()->nodes()) {
-    for (auto val : node->outputs()) {
-      if (val->isCompleteTensor())
-        val->setType(val->type()->castRaw<TensorType>()->withSizesStrides(
-            {8, 4, 10, 16}, {640, 1, 64, 4}));
-    }
-  }
-
-  auto fusion = parseJitIR(g);
-  FusionGuard fg(fusion.get());
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor input0 =
-      at::randn({2, 2, 2, 16}, options).clone(c10::MemoryFormat::ChannelsLast);
-  at::Tensor input1 = at::randn({2, 2, 2, 16}, options);
-  auto lparams = schedulePointwise(fusion.get(), {input0, input1});
-
-  // CONSIDER:
-  // 1. this can be moved to a dedicated "golden" file
-  // 2. use a fuzzy compare (ignore non-significant whitespaces for example)
-  const std::string expected_kernel = R"(
-__global__ void CUDAGeneratedKernel(Tensor<__half, 4> T0, Tensor<__half, 4> T2, Tensor<__half, 4> T7) {
-  int64_t i171;
-  i171 = (((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x);
-  if ((i171 < (T0.size[0] * (T0.size[1] * (T0.size[2] * T0.size[3]))))) {
-    __half T9[1];
-    T9[0] = 0;
-    T9[0]
-       = T2[((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) / (T0.size[1] * (T0.size[2] * T0.size[3]))) * ((T0.size[2] * T0.size[1]) * T0.size[3])) + ((((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) % (T0.size[2] * T0.size[3])) % T0.size[3]) * (T0.size[2] * T0.size[1])) + (((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) / (T0.size[2] * T0.size[3])) * T0.size[2]) + (((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) % (T0.size[2] * T0.size[3])) / T0.size[3])];
-    __half T8[1];
-    T8[0] = 0;
-    T8[0]
-       = T0[i171];
-    float T3[1];
-    T3[0]
-       = __half2float(T9[0]);
-    float T4[1];
-    T4[0]
-       = T3[0];
-    float T1[1];
-    T1[0]
-       = __half2float(T8[0]);
-    float T5[1];
-    T5[0]
-      = T1[0]
-      * T4[0];
-    float T6[1];
-    T6[0]
-       = relu(T5[0]);
-    __half T10[1];
-    T10[0]
-       = __float2half(T6[0]);
-    T7[i171]
-       = T10[0];
-  }
-}
-)";
-
-  const std::string actual_kernel =
-      "\n" + codegen::generateCudaKernel(GpuLower(fusion.get()).kernel());
-
-  if (expected_kernel.size() != actual_kernel.size() ||
-      expected_kernel.compare(actual_kernel) != 0) {
-    std::cerr
-        << " Codegen mismatch, codegen possibly changed, or is incorrect. "
-        << " \n ========= EXPECTED ========= \n"
-        << expected_kernel << "\n========= ACTUAL ========== \n"
-        << actual_kernel << "\n=================" << std::endl;
-    auto it = std::mismatch(
-        expected_kernel.begin(),
-        expected_kernel.end(),
-        actual_kernel.begin(),
-        actual_kernel.end());
-    std::string actual_mismatched_snippet(it.second, actual_kernel.end());
-    actual_mismatched_snippet = actual_mismatched_snippet.substr(0, 10);
-    std::string expected_mismatched_snippet(it.first, expected_kernel.end());
-    expected_mismatched_snippet = expected_mismatched_snippet.substr(0, 10);
-    std::cerr << "First mismatch found at: " << actual_mismatched_snippet
-              << ", expected: " << expected_mismatched_snippet << std::endl;
-    TORCH_CHECK(false);
-  }
-
-  // TODO: runFusion hits assertion. I'm probably doing something wrong here.
-  // FusionExecutor fe;
-  // fe.compileFusion(fusion.get());
-  // auto outputs = fe.runFusion({input0, input1}, lparams);
-  // at::Tensor output_ref = (input0 * input1).relu();
-  // TORCH_CHECK(output_ref.equal(outputs[0]));
-}
-
-TEST_F(NVFuserTest, FusionThreadPredicateUnswitch_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({10, 1024});
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->computeAt(tv3, -1);
-  tv3->axis(0)->parallelize(ParallelType::Unswitch);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({10, 1024}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref = sum(t0, {1}) + 2;
-
-  testValidate(&fusion, outputs, aten_inputs, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionNonContigOutputs_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv1);
-
-  tv1->setContiguity(false);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_input = at::randn({10}, options);
-  at::Tensor at_output = at::empty_strided({10}, {2}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {at_input});
-  auto returned_outputs = fe.runFusion({at_input}, {at_output});
-
-  // Returned outputs should only contain one tensor that is the same
-  // as the output tensor given to runFusion
-  TORCH_CHECK(returned_outputs.size() == 1);
-  TORCH_CHECK(returned_outputs[0].is_same(at_output));
-  TORCH_CHECK(!returned_outputs[0].is_contiguous());
-
-  auto at_ref = at_input + 1;
-
-  testValidate(&fusion, {at_output}, {at_input}, {at_ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTestWarpSoftMax_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Setup softmax fusion
-  auto input = makeContigTensor(2);
-  fusion.addInput(input);
-  auto output = softmax(input, 1);
-  fusion.addOutput(output);
-
-  // Setup runtime input
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_input = at::randn({8, 16 * 197}, options);
-  std::vector<c10::IValue> aten_inputs({aten_input});
-
-  // Schedule through magic scheduler
-  SchedulerRuntimeInfo runtime_info(&fusion, aten_inputs, true);
-  TORCH_CHECK(SchedulerEntry::canSchedule(
-      ScheduleHeuristic::Persistent, &fusion, runtime_info));
-  auto scheduler = SchedulerEntry::makeEntry(
-      ScheduleHeuristic::Persistent, &fusion, runtime_info);
-  scheduler->schedule(&fusion);
-
-  // Modify the schedule to use warp reduction
-  auto used_vals = fusion.usedMathVals();
-  for (auto tv : ir_utils::filterByType<TensorView>(used_vals)) {
-    for (IterDomain* id : tv->domain()->domain()) {
-      if (id->getParallelType() == ParallelType::TIDx) {
-        id->padToMultipleOfWarp();
-      }
-    }
-  }
-
-  // Test result
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-  auto ref_output = at::_softmax(aten_input, 1, false);
-  testValidate(&fusion, outputs, aten_inputs, {ref_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1133_CUDA) {
-  if (!deviceMajorMinorCheck(7)) {
-    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
-    return;
-  }
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {1});
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  tv0->computeAt(tv3, 1);
-
-  const int split_factor = 32;
-
-  tv2->split(-1, split_factor);
-  tv1->computeAt(tv2, -2);
-
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(0)->parallelize(ParallelType::Unswitch);
-
-  tv1->setMemoryType(MemoryType::Shared);
-  tv2->setMemoryType(MemoryType::Shared);
-
-  // Both tv1 and tv2 should be allocated at the top-level scope
-  GpuLower gpulw(&fusion);
-  bool tv1_validated = false;
-  bool tv2_validated = false;
-  for (const auto& kir_node : gpulw.kernel()->topLevelExprs()) {
-    if (auto alloc = dynamic_cast<kir::Allocate*>(kir_node)) {
-      auto size = alloc->size();
-      if (!(alloc->buffer()->name() == 1 || alloc->buffer()->name() == 2)) {
-        // There should be no allocation other than those for tv1 and tv2
-        TORCH_CHECK(false, "Invalid allocation detected");
-      }
-      TORCH_CHECK(size->isA<Int>(), "Invalid allocation size");
-      TORCH_CHECK(size->as<Int>()->isConst(), "Allocation not constant");
-      auto size_int = size->as<Int>()->value().value();
-      if (alloc->buffer()->name() == 1) {
-        TORCH_CHECK(
-            size_int == split_factor,
-            "Invalid allocation size: ",
-            size->as<Int>()->value().value());
-        tv1_validated = true;
-      } else {
-        TORCH_CHECK(
-            size_int == 1,
-            "Invalid allocation size: ",
-            size->as<Int>()->value().value());
-        tv2_validated = true;
-      }
-    }
-  }
-
-  TORCH_CHECK(tv1_validated, "Failed to validate tv1 allocation");
-  TORCH_CHECK(tv2_validated, "Failed to validate tv2 allocation");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({99, 101}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref = (t0 + 1).sum({1}) + 1;
-
-  testValidate(&fusion, outputs, aten_inputs, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionRfactorContigIDs_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  fusion.addOutput(tv1);
-
-  tv1->split(1, 32);
-
-  auto tv2 = tv1->rFactor({1});
-
-  // This merged domain is not contiguous.
-  tv2->merge(0, 2);
-
-  tv2->setMemoryType(MemoryType::Shared);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({99, 101}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto ref = t0.sum({1});
-
-  testValidate(&fusion, outputs, aten_inputs, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionPersistentBufferCalculation1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tv2 = sum(tv1, {1});
-  auto tv3 = broadcast(tv2, {false, true});
-  auto tv4 = set(tv1);
-  auto tv5 = add(tv3, tv4);
-  fusion.addOutput(tv5);
-
-  auto persistent_buffer_info = scheduler_utils::persistentBuffers(&fusion);
-
-  auto isTvWithinVec = [](std::vector<TensorView*>& vec, TensorView* tv) {
-    return std::find(vec.begin(), vec.end(), tv) != vec.end();
-  };
-
-  auto tvEntryInVecVec = [](std::vector<std::vector<TensorView*>>& vec_o_vec,
-                            std::vector<TensorView*>& buffer_vec,
-                            TensorView* tv) {
-    auto buffer_it = std::find(buffer_vec.begin(), buffer_vec.end(), tv);
-    return vec_o_vec.begin() + std::distance(buffer_vec.begin(), buffer_it);
-  };
-
-  auto& buffers = persistent_buffer_info.persistent_buffers;
-  auto& resolution = persistent_buffer_info.persistent_buffer_resolution_points;
-  auto& projectable = persistent_buffer_info.projectable_persistent_buffers;
-  auto& projectable_inputs = persistent_buffer_info.projectable_buffer_inputs;
-
-  TORCH_INTERNAL_ASSERT(buffers.size() == 1);
-  TORCH_INTERNAL_ASSERT(resolution.size() == 1 && resolution[0].size() == 1);
-  TORCH_INTERNAL_ASSERT(projectable.size() == 1);
-  TORCH_INTERNAL_ASSERT(projectable_inputs.size() == 1);
-
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(buffers, tv1));
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable, tv1));
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable_inputs, tv0));
-
-  auto tv1_resolution_it = tvEntryInVecVec(resolution, buffers, tv1);
-  TORCH_INTERNAL_ASSERT(tv1_resolution_it != resolution.end())
-
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv1_resolution_it, tv5));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor aten_t0 = at::randn({99, 101}, options);
-
-  // Schedule through magic scheduler
-  SchedulerRuntimeInfo runtime_info(&fusion, {aten_t0}, true);
-  auto persistent_buffer_size =
-      persistentBufferSize(&fusion, runtime_info, persistent_buffer_info);
-
-  TORCH_INTERNAL_ASSERT(
-      persistent_buffer_size.persistent_buffer_size ==
-      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Float)));
-  TORCH_INTERNAL_ASSERT(
-      persistent_buffer_size.projected_persistent_buffer_size ==
-      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Float)));
-}
-
-TEST_F(NVFuserTest, FusionPersistentBufferCalculation2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2, DataType::Half);
-  fusion.addInput(tv0);
-
-  auto tv1 = castOp(DataType::Float, tv0);
-  auto tv2 = sum(tv1, {1});
-  auto tv3 = broadcast(tv2, {false, true});
-  auto tv4 = set(tv1);
-  auto tv5 = add(tv3, tv4);
-  auto tv6 = castOp(DataType::Half, tv5);
-  fusion.addOutput(tv6);
-
-  auto persistent_buffer_info = scheduler_utils::persistentBuffers(&fusion);
-
-  auto isTvWithinVec = [](std::vector<TensorView*>& vec, TensorView* tv) {
-    return std::find(vec.begin(), vec.end(), tv) != vec.end();
-  };
-
-  auto tvEntryInVecVec = [](std::vector<std::vector<TensorView*>>& vec_o_vec,
-                            std::vector<TensorView*>& buffer_vec,
-                            TensorView* tv) {
-    auto buffer_it = std::find(buffer_vec.begin(), buffer_vec.end(), tv);
-    return vec_o_vec.begin() + std::distance(buffer_vec.begin(), buffer_it);
-  };
-
-  auto& buffers = persistent_buffer_info.persistent_buffers;
-  auto& resolution = persistent_buffer_info.persistent_buffer_resolution_points;
-  auto& projectable = persistent_buffer_info.projectable_persistent_buffers;
-  auto& projectable_inputs = persistent_buffer_info.projectable_buffer_inputs;
-
-  TORCH_INTERNAL_ASSERT(buffers.size() == 1);
-  TORCH_INTERNAL_ASSERT(resolution.size() == 1 && resolution[0].size() == 1);
-  TORCH_INTERNAL_ASSERT(projectable.size() == 1);
-  TORCH_INTERNAL_ASSERT(projectable_inputs.size() == 1);
-
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(buffers, tv1));
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable, tv1));
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable_inputs, tv0));
-
-  auto tv1_resolution_it = tvEntryInVecVec(resolution, buffers, tv1);
-  TORCH_INTERNAL_ASSERT(tv1_resolution_it != resolution.end())
-
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv1_resolution_it, tv5));
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor aten_t0 = at::randn({99, 101}, options);
-
-  // Schedule through magic scheduler
-  SchedulerRuntimeInfo runtime_info(&fusion, {aten_t0}, true);
-  auto persistent_buffer_size =
-      persistentBufferSize(&fusion, runtime_info, persistent_buffer_info);
-
-  TORCH_INTERNAL_ASSERT(
-      persistent_buffer_size.persistent_buffer_size ==
-      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Float)));
-  TORCH_INTERNAL_ASSERT(
-      persistent_buffer_size.projected_persistent_buffer_size ==
-      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Half)));
-}
-
-TEST_F(NVFuserTest, FusionPersistentBufferCalculation3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2, DataType::Half);
-  fusion.addInput(tv0);
-
-  auto tv1 = castOp(DataType::Float, tv0);
-  auto tv2 = set(tv1);
-  auto tv3 = sum(tv2, {1});
-  auto tv4 = broadcast(tv3, {false, true});
-
-  auto tv5 = makeSymbolicTensor(2, DataType::Half);
-  fusion.addInput(tv5);
-
-  auto tv6 = castOp(DataType::Float, tv5);
-
-  auto tv7 = add(tv6, tv4);
-  auto tv8 = set(tv1);
-  auto tv9 = add(tv7, tv8);
-  auto tv10 = sum(tv9, {1});
-  auto tv11 = broadcast(tv10, {false, true});
-  auto tv12 = set(tv7);
-  auto tv13 = add(tv12, tv11);
-
-  fusion.addOutput(tv13);
-
-  auto persistent_buffer_info = scheduler_utils::persistentBuffers(&fusion);
-
-  auto isTvWithinVec = [](std::vector<TensorView*>& vec, TensorView* tv) {
-    return std::find(vec.begin(), vec.end(), tv) != vec.end();
-  };
-
-  auto tvEntryInVecVec = [](std::vector<std::vector<TensorView*>>& vec_o_vec,
-                            std::vector<TensorView*>& buffer_vec,
-                            TensorView* tv) {
-    auto buffer_it = std::find(buffer_vec.begin(), buffer_vec.end(), tv);
-    return vec_o_vec.begin() + std::distance(buffer_vec.begin(), buffer_it);
-  };
-
-  auto& buffers = persistent_buffer_info.persistent_buffers;
-  auto& resolution = persistent_buffer_info.persistent_buffer_resolution_points;
-  auto& projectable = persistent_buffer_info.projectable_persistent_buffers;
-  auto& projectable_inputs = persistent_buffer_info.projectable_buffer_inputs;
-
-  TORCH_INTERNAL_ASSERT(buffers.size() == 2);
-  TORCH_INTERNAL_ASSERT(
-      resolution.size() == 2 && resolution[0].size() == 1 &&
-      resolution[1].size() == 1);
-  TORCH_INTERNAL_ASSERT(projectable.size() == 1);
-  TORCH_INTERNAL_ASSERT(projectable_inputs.size() == 1);
-
-  TORCH_INTERNAL_ASSERT(
-      isTvWithinVec(buffers, tv1) && isTvWithinVec(buffers, tv7));
-  TORCH_INTERNAL_ASSERT(
-      isTvWithinVec(projectable, tv1) && !isTvWithinVec(projectable, tv7));
-
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable_inputs, tv0));
-
-  auto tv1_resolution_it = tvEntryInVecVec(resolution, buffers, tv1);
-  TORCH_INTERNAL_ASSERT(tv1_resolution_it != resolution.end())
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv1_resolution_it, tv9));
-
-  auto tv7_resolution_it = tvEntryInVecVec(resolution, buffers, tv7);
-  TORCH_INTERNAL_ASSERT(tv7_resolution_it != resolution.end())
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv7_resolution_it, tv13));
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor aten_t0 = at::randn({99, 101}, options);
-  at::Tensor aten_t5 = at::randn({99, 101}, options);
-
-  // Schedule through magic scheduler
-  SchedulerRuntimeInfo runtime_info(&fusion, {aten_t0, aten_t5}, true);
-  auto persistent_buffer_size =
-      persistentBufferSize(&fusion, runtime_info, persistent_buffer_info);
-
-  TORCH_INTERNAL_ASSERT(
-      persistent_buffer_size.persistent_buffer_size ==
-      static_cast<int64_t>(
-          aten_t0.size(1) * dataTypeSize(DataType::Float) * 2));
-  TORCH_INTERNAL_ASSERT(
-      persistent_buffer_size.projected_persistent_buffer_size ==
-      static_cast<int64_t>(
-          aten_t0.size(1) *
-          (dataTypeSize(DataType::Half) + dataTypeSize(DataType::Float))));
-}
-
-TEST_F(NVFuserTest, FusionPersistentBufferCalculation4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2, DataType::Half);
-  fusion.addInput(tv0);
-
-  auto tv1 = castOp(DataType::Float, tv0);
-  auto tv2 = set(tv1);
-  auto tv3 = sum(tv2, {1});
-  auto tv4 = broadcast(tv3, {false, true});
-  auto tv5 = set(tv1);
-  auto tv6 = add(tv4, tv5);
-  auto tv7 = set(tv2);
-  auto tv8 = add(tv7, tv6);
-  auto tv9 = castOp(DataType::Half, tv8);
-
-  fusion.addOutput(tv9);
-
-  auto persistent_buffer_info = scheduler_utils::persistentBuffers(&fusion);
-
-  auto isTvWithinVec = [](std::vector<TensorView*>& vec, TensorView* tv) {
-    return std::find(vec.begin(), vec.end(), tv) != vec.end();
-  };
-
-  auto tvEntryInVecVec = [](std::vector<std::vector<TensorView*>>& vec_o_vec,
-                            std::vector<TensorView*>& buffer_vec,
-                            TensorView* tv) {
-    auto buffer_it = std::find(buffer_vec.begin(), buffer_vec.end(), tv);
-    return vec_o_vec.begin() + std::distance(buffer_vec.begin(), buffer_it);
-  };
-
-  auto& buffers = persistent_buffer_info.persistent_buffers;
-  auto& resolution = persistent_buffer_info.persistent_buffer_resolution_points;
-  auto& projectable = persistent_buffer_info.projectable_persistent_buffers;
-  auto& projectable_inputs = persistent_buffer_info.projectable_buffer_inputs;
-
-  TORCH_INTERNAL_ASSERT(buffers.size() == 2);
-  TORCH_INTERNAL_ASSERT(
-      resolution.size() == 2 && resolution[0].size() == 1 &&
-      resolution[1].size() == 1);
-
-  TORCH_INTERNAL_ASSERT(projectable.size() == 2);
-  TORCH_INTERNAL_ASSERT(projectable_inputs.size() == 1);
-
-  TORCH_INTERNAL_ASSERT(
-      isTvWithinVec(buffers, tv1) && isTvWithinVec(buffers, tv2));
-  TORCH_INTERNAL_ASSERT(
-      isTvWithinVec(projectable, tv1) && isTvWithinVec(projectable, tv2));
-
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable_inputs, tv0));
-
-  auto tv1_resolution_it = tvEntryInVecVec(resolution, buffers, tv1);
-  TORCH_INTERNAL_ASSERT(tv1_resolution_it != resolution.end())
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv1_resolution_it, tv6));
-
-  auto tv2_resolution_it = tvEntryInVecVec(resolution, buffers, tv2);
-  TORCH_INTERNAL_ASSERT(tv2_resolution_it != resolution.end())
-  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv2_resolution_it, tv8));
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor aten_t0 = at::randn({99, 101}, options);
-
-  // Schedule through magic scheduler
-  SchedulerRuntimeInfo runtime_info(&fusion, {aten_t0}, true);
-  auto persistent_buffer_size =
-      persistentBufferSize(&fusion, runtime_info, persistent_buffer_info);
-
-  TORCH_INTERNAL_ASSERT(
-      persistent_buffer_size.persistent_buffer_size ==
-      static_cast<int64_t>(
-          aten_t0.size(1) * dataTypeSize(DataType::Float) * 2));
-
-  TORCH_INTERNAL_ASSERT(
-      persistent_buffer_size.projected_persistent_buffer_size ==
-      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Half)));
-}
-
-TEST_F(NVFuserTest, FusionPersistentBufferProjection_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2, DataType::Half);
-  fusion.addInput(tv0);
-
-  auto tv1 = castOp(DataType::Float, tv0);
-  auto tv2 = set(tv1);
-  auto tv3 = sum(tv2, {1});
-  auto tv4 = broadcast(tv3, {false, true});
-  auto tv5 = set(tv1);
-  auto tv6 = add(tv4, tv5);
-  auto tv7 = set(tv2);
-  auto tv8 = add(tv7, tv6);
-  auto tv9 = castOp(DataType::Half, tv8);
-
-  fusion.addOutput(tv9);
-
-  reduction_scheduler_utils::projectPersistentBuffers(&fusion);
-
-  auto tv5_producers = ir_utils::producerTvsOf(tv5);
-  auto tv7_producers = ir_utils::producerTvsOf(tv7);
-
-  // Projection should have broken these dependencies
-
-  TORCH_INTERNAL_ASSERT(
-      std::find(tv5_producers.begin(), tv5_producers.end(), tv1) ==
-      tv5_producers.end());
-  TORCH_INTERNAL_ASSERT(
-      std::find(tv7_producers.begin(), tv7_producers.end(), tv2) ==
-      tv7_producers.end());
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor aten_t0 = at::randn({99, 101}, options);
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  auto cg_outputs = fec.runFusionWithInputs({aten_t0});
-
-  auto aten_t1 = aten_t0.to(c10::kDouble);
-  auto aten_t3 = aten_t1.sum({1});
-  auto aten_t4 = aten_t3.unsqueeze(1);
-  auto aten_t7 = aten_t4.add(aten_t1).add(aten_t1);
-
-  testValidate(&fusion, cg_outputs, {aten_t0}, {aten_t7}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1223_CUDA) {
-  if (!deviceMajorMinorCheck(7)) {
-    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
-    return;
-  }
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {0, 1});
-  fusion.addOutput(tv2);
-
-  auto tv3 = add(tv0, IrBuilder::create<Double>(0));
-  fusion.addOutput(tv3);
-
-  tv2->split(0, 4);
-  tv2->split(1, 1, false);
-  tv2->split(-1, 4);
-
-  tv2->axis(1)->parallelize(ParallelType::Unswitch);
-  tv2->axis(-3)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDy);
-
-  tv1->computeAt(tv2, -1);
-
-  // Make TIDx and TIDy non-exact
-  tv3->split(0, 32);
-  tv3->split(-1, 32);
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-  tv3->axis(3)->parallelize(ParallelType::TIDy);
-
-  // The second axis of both tv1 and tv2 are fully unswitched, so they
-  // don't need to predicate the parallel type usage of TIDy, whereas
-  // the first axis is only partially unswitched, i.e., part of its
-  // split output domains is outside the unswitched axis, so the first
-  // axis, which uses TIDx, needs to predicate the parallel
-  // dimension. Previously, as reported in issue #1223, unswitched
-  // expressions didn't predicate parallel dimensions. It should be
-  // fixed by PR #1222.
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_t0 = at::ones({11, 10}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {at_t0});
-  auto cg_outputs = fe.runFusion({at_t0});
-
-  auto at_t1 = (at_t0 + 1).sum();
-
-  testValidate(
-      &fusion, cg_outputs, {at_t0}, {at_t1, at_t0}, __LINE__, __FILE__);
-}
-
-// See #1247 and #1250
-TEST_F(NVFuserTest, FusionRfactorPredication1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = min(tv1, {0});
-
-  fusion.addOutput(tv2);
-
-  // Make TIDx non-exact
-  auto tv3 = makeContigTensor(1);
-  fusion.addInput(tv3);
-
-  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv4);
-
-  tv2->split(0, 4);
-  auto tv5 = tv2->rFactor({1});
-
-  tv0->computeAt(tv2, 1);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-
-  tv4->axis(0)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_t0 = at::randn({9}, options);
-  at_t0 = at::abs(at_t0);
-  at::Tensor at_t3 = at::randn({128}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {at_t0, at_t3});
-  auto cg_outputs = fe.runFusion({at_t0, at_t3});
-
-  auto at_t2 = (at_t0 + 1).min();
-  auto at_t4 = at_t3 + 1;
-
-  testValidate(
-      &fusion, cg_outputs, {at_t0, at_t3}, {at_t2, at_t4}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionRfactorPredication2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = min(tv0, {0});
-  fusion.addOutput(tv1);
-
-  // Make TIDx non-exact
-  auto tv2 = makeContigTensor(1);
-  fusion.addInput(tv2);
-
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv3);
-
-  tv1->split(0, 4);
-  auto tv4 = tv1->rFactor({0});
-
-  tv1->split(0, 3);
-
-  // tv0->computeAt(tv1, 3);
-  tv4->reorder({{0, 1}});
-  tv4->split(0, 3);
-  tv4->setMemoryType(MemoryType::Shared);
-
-  // tv0: [I]
-  // tv4: [4/3, 3, I/4]
-  // tv1: [4/3, 3]
-
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(tv1, {tv4});
-
-  tv3->axis(0)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor at_t0 = at::randn({9}, options);
-  at_t0 = at::abs(at_t0);
-  at::Tensor at_t3 = at::randn({128}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {at_t0, at_t3});
-  auto cg_outputs = fe.runFusion({at_t0, at_t3});
-
-  auto at_t2 = std::get<0>(at_t0.min(0));
-  auto at_t4 = at_t3 + 1;
-
-  testValidate(
-      &fusion, cg_outputs, {at_t0, at_t3}, {at_t2, at_t4}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionRfactorIndirectRoot_CUDA) {
-  // https://github.com/csarofeen/pytorch/issues/1692
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(3);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {1, 2});
-  fusion.addOutput(tv1);
-
-  tv1->split(2, 4);
-  tv1->split(1, 3);
-  tv1->merge(2, 3);
-  auto rf = tv1->rFactor({-1});
-
-  tv1->split(0, 256);
-  tv1->axis(0)->parallelize(ParallelType::BIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-  rf->computeAt(tv1, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-
-  auto at_in = at::randn({6, 6, 6}, options);
-  auto at_out = at_in.sum({1, 2});
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {at_in});
-  auto cg_outputs = fe.runFusion({at_in});
-
-  testValidate(&fusion, cg_outputs, {at_in}, {at_out}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionNonDivisibleSplit1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  fusion.addOutput(tv1);
-
-  // [I]
-  tv1->split(0, 5);
-  // [ceilDiv(I, 5), 5]
-
-  // This second split is non-divisible. The split domain must be predicated.
-  tv1->split(1, 3);
-  // [ceilDiv(I, 5), 2, 3]
-
-  auto tv2 = sum(tv0, {0});
-  fusion.addOutput(tv2);
-
-  // tv2 shouldn't need to have another predicate
-  tv2->split(0, 4);
-  tv2->split(1, 2);
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
-      "There must be no split to validate");
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 1,
-      "Only tv1 should have a non-divisible predicate.");
-  for (auto tv : {loweredTv(tv1, gpulw)}) {
-    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
-    TORCH_CHECK(
-        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
-        "No info found for ",
-        tv);
-    const auto& splits_to_predicate = it->second;
-    TORCH_CHECK(
-        splits_to_predicate.size() == 1,
-        "There must be one split to predicate");
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({24}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0.sum();
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref, ref}, __LINE__, __FILE__);
-}
-
-// Repro of issue #1074
-TEST_F(NVFuserTest, FusionNonDivisibleSplit2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 2);
-  tv2->split(-1, 4);
-  tv2->reorder({{1, 2}, {2, 1}});
-  tv0->computeAt(tv2, 2);
-
-  tv2->split(-1, 3);
-
-  // To make the sanitizer catch the invalid accesses. Not necessary
-  // to expose the bug.
-  tv1->setMemoryType(MemoryType::Shared);
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
-      "There must be no split to validate");
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 1,
-      "Only tv2 should have a non-divisible predicate.");
-  for (auto tv : {loweredTv(tv2, gpulw)}) {
-    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
-    TORCH_CHECK(
-        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
-        "No info found for ",
-        tv);
-    const auto& splits_to_predicate = it->second;
-    TORCH_CHECK(
-        splits_to_predicate.size() == 1,
-        "There must be one split to predicate");
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({13, 17}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 2;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Similar to FusionNonDivisibleSplit1 but with unswitch
-TEST_F(NVFuserTest, FusionNonDivisibleSplit3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {0});
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 5);
-  tv2->split(1, 3);
-
-  tv0->computeAt(tv2, -1);
-
-  tv2->axis(0)->parallelize(ParallelType::Unswitch);
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
-      "There must be no split to validate");
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 2,
-      "Both tv1 and tv2 should have a non-divisible predicate.");
-  for (auto tv : {loweredTv(tv1, gpulw), loweredTv(tv2, gpulw)}) {
-    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
-    TORCH_CHECK(
-        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
-        "No info found for ",
-        tv);
-    const auto& splits_to_predicate = it->second;
-    TORCH_CHECK(
-        splits_to_predicate.size() == 1,
-        "There must be one split to predicate");
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({24}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = (t0 + 1).sum();
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Non-divisible split through merge
-TEST_F(NVFuserTest, FusionNonDivisibleSplit4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {0, 1});
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 5);
-  tv2->merge(1, 2);
-  tv2->split(1, 3);
-
-  tv0->computeAt(tv2, -1);
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
-      "There must be no split to validate");
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 2,
-      "Both tv1 and tv2 should have a non-divisible predicate.");
-  for (auto tv : {loweredTv(tv1, gpulw), loweredTv(tv2, gpulw)}) {
-    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
-    TORCH_CHECK(
-        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
-        "No info found for ",
-        tv);
-    const auto& splits_to_predicate = it->second;
-    TORCH_CHECK(
-        splits_to_predicate.size() == 1,
-        "There must be one split to predicate");
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({24, 2}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = (t0 + 1).sum();
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Nested splits
-TEST_F(NVFuserTest, FusionNonDivisibleSplit5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = sum(tv1, {0});
-  fusion.addOutput(tv2);
-
-  // [I]
-  tv2->split(0, 8);
-  // [I/8, 8]
-  tv2->split(1, 2);
-  // [I/8, 4, 2]
-  tv2->split(1, 3); // non-divisible split of outer output
-  // [I/8, 2, 3, 2]
-
-  tv0->computeAt(tv2, -1);
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
-      "There must be no split to validate");
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 2,
-      "Both tv1 and tv2 should have a non-divisible predicate.");
-  for (auto tv : {loweredTv(tv1, gpulw), loweredTv(tv2, gpulw)}) {
-    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
-    TORCH_CHECK(
-        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
-        "No info found for ",
-        tv);
-    const auto& splits_to_predicate = it->second;
-    TORCH_CHECK(
-        splits_to_predicate.size() == 1,
-        "There must be one split to predicate");
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({24}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = (t0 + 1).sum();
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Vectorized non-divisible split. Must be validated at run time
-TEST_F(NVFuserTest, FusionNonDivisibleSplitVectorize1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  fusion.addOutput(tv1);
-
-  tv1->split(0, 8, false);
-  tv1->split(1, 4);
-
-  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToValidate().size() == 1,
-      "There should be one split to validate");
-  for (const auto& kv : gpulw.nonDivisibleSplitInfo().splitsToPredicate()) {
-    const auto& splits_to_predicate = kv.second;
-    TORCH_CHECK(
-        splits_to_predicate.empty(),
-        "There must be no split to predicate, but tensor t",
-        kv.first->name(),
-        " has:",
-        splits_to_predicate);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({32}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-
-  auto t0_non_divisible = at::randn({8}, options);
-  // Since ceilDiv(8, 8) is not divisible by 4, the vectorization is
-  // illegal. The run-time validation of vectorization should throw an error.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion({t0_non_divisible}));
-}
-
-// If a split is validated at run time, it's not necessary to predicate.
-TEST_F(NVFuserTest, FusionNonDivisibleSplitVectorize2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = sum(tv2, {0});
-  fusion.addOutput(tv3);
-
-  tv3->split(0, 8, false);
-  tv3->split(1, 4);
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(tv3, {tv1, tv2});
-
-  tv1->axis(2)->parallelize(ParallelType::Vectorize);
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      gpulw.nonDivisibleSplitInfo().splitsToValidate().size() == 1,
-      "There should be one split to validate");
-  for (const auto& kv : gpulw.nonDivisibleSplitInfo().splitsToPredicate()) {
-    const auto& splits_to_predicate = kv.second;
-    TORCH_CHECK(
-        splits_to_predicate.empty(),
-        "There must be no split to predicate, but tensor t",
-        kv.first->name(),
-        " has:",
-        splits_to_predicate);
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-
-  auto t0 = at::randn({1024}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = (t0 + 1).sum();
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1284Repro_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> input_shape_0 = {10, 20};
-  std::vector<int64_t> input_shape_1 = {15};
-
-  TensorView* in_0 = makeSymbolicTensor(input_shape_0.size());
-  TensorView* in_1 = makeSymbolicTensor(input_shape_1.size());
-  fusion.addInput(in_0);
-  fusion.addInput(in_1);
-
-  TensorView* out_0 = add(in_0, IrBuilder::create<Double>(0.f));
-  TensorView* out_1 = add(in_1, IrBuilder::create<Double>(2.f));
-
-  fusion.addOutput(out_0);
-  fusion.addOutput(out_1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_in_0 = at::randn(input_shape_0, options);
-  at::Tensor at_in_1 = at::randn(input_shape_1, options);
-  std::vector<IValue> aten_inputs = {at_in_0, at_in_1};
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  auto outputs = fec.runFusionWithInputs(aten_inputs);
-
-  auto t1 = at_in_1 + 2;
-
-  auto runtime = fec.getMostRecentKernelRuntime();
-  TORCH_INTERNAL_ASSERT(runtime->isSegmented());
-  TORCH_INTERNAL_ASSERT(runtime->fusionSegments()->groups().size() == 2);
-
-  testValidate(
-      &fusion, outputs, {at_in_0, at_in_1}, {at_in_0, t1}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1284Repro2_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> input_shape_0 = {4, 4};
-  std::vector<int64_t> input_shape_1 = {3, 4, 4};
-  std::vector<int64_t> input_shape_2 = {2, 8, 4, 4};
-
-  TensorView* in_0 = makeSymbolicTensor(input_shape_0.size());
-  TensorView* in_1 = makeSymbolicTensor(input_shape_1.size());
-  TensorView* in_2 = makeSymbolicTensor(input_shape_2.size());
-
-  fusion.addInput(in_0);
-  fusion.addInput(in_1);
-  fusion.addInput(in_2);
-
-  TensorView* out_0 = add(in_0, in_1);
-  TensorView* out_1 = add(in_0, in_2);
-
-  fusion.addOutput(out_0);
-  fusion.addOutput(out_1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor at_in_0 = at::randn(input_shape_0, options);
-  at::Tensor at_in_1 = at::randn(input_shape_1, options);
-  at::Tensor at_in_2 = at::randn(input_shape_2, options);
-
-  std::vector<IValue> aten_inputs = {at_in_0, at_in_1, at_in_2};
-
-  FusionExecutorCache fec(std::move(fusion_ptr));
-  auto outputs = fec.runFusionWithInputs(aten_inputs);
-
-  auto t0 = at_in_0 + at_in_1;
-  auto t1 = at_in_0 + at_in_2;
-
-  auto runtime = fec.getMostRecentKernelRuntime();
-  TORCH_INTERNAL_ASSERT(runtime->isSegmented());
-  TORCH_INTERNAL_ASSERT(runtime->fusionSegments()->groups().size() == 2);
-
-  testValidate(
-      &fusion,
-      outputs,
-      {at_in_0, at_in_1, at_in_2},
-      {t0, t1},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionIssue1305Repro_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  auto t0 = makeContigTensor(1);
-  auto t1 = makeContigTensor(2);
-
-  fusion.addInput(t0);
-  fusion.addInput(t1);
-
-  auto t2 = broadcast(t0, {true, false});
-  auto t3 = add(t1, t2);
-  auto t4 = add(t3, t2);
-  auto t5 = sum(t4, {1});
-  auto t6 = broadcast(t5, {false, true});
-  auto t7 = add(t3, t6);
-
-  fusion.addOutput(t7);
-
-  t3->computeAt(t7, -1, ComputeAtMode::MostInlined);
-
-  TORCH_INTERNAL_ASSERT(t3->getComputeAtPosition() == 1);
-}
-
-TEST_F(NVFuserTest, FusionDoubleBuffering1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
-  auto tv3 = set(tv2);
-  fusion.addOutput(tv3);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv3->split(-1, 128);
-  tv3->split(-1, 32);
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv0->computeAt(tv3, 1);
-
-  tv3->axis(-2)->parallelize(ParallelType::BIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(tv3);
-
-  tv1->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({1000}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 1;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionDoubleBuffering2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
-  auto tv3 = set(tv2);
-  fusion.addOutput(tv3);
-
-  tv3->split(-1, 128);
-  tv3->split(-1, 32);
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv0->computeAt(tv3, -1);
-
-  tv3->axis(-2)->parallelize(ParallelType::BIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(tv3);
-
-  tv1->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({1000}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 1;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionDoubleBuffering3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv2 = set(tv1);
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1.0));
-  fusion.addOutput(tv3);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv3->split(-1, 128);
-  tv3->split(-1, 32);
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv0->computeAt(tv3, 1);
-
-  // tv2 is invalid to double-buffer as its producer, tv1, is
-  // computed inside the double-buffering loop.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(tv2->doubleBuffer());
-
-  // Moving tv2 inner makes tv1 large enough to double-buffer tv2
-  tv2->computeAt(tv3, 2);
-
-  tv2->doubleBuffer();
-
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({1000}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 2;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Double buffering smem to local and unswitch
-TEST_F(NVFuserTest, FusionDoubleBuffering4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv2 = set(tv1);
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1.0));
-  fusion.addOutput(tv3);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv3->split(-1, 128);
-  tv3->split(-1, 32);
-  tv3->split(-1, 8);
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv0->computeAt(tv3, 2);
-  tv2->computeAt(tv3, -1);
-
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(1)->parallelize(ParallelType::Unswitch);
-  scheduler_utils::parallelizeAllLike(tv3);
-
-  tv2->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({1000}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 2;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Double buffering gmem to shared and unswitch
-TEST_F(NVFuserTest, FusionDoubleBuffering5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
-  fusion.addOutput(tv2);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv2->split(-1, 128);
-  tv2->split(-1, 32);
-  tv2->split(-1, 8);
-  TransformPropagatorWithCheck propagator(tv2);
-  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
-
-  tv0->computeAt(tv2, 2);
-  tv1->computeAt(tv2, -1);
-
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(1)->parallelize(ParallelType::Unswitch);
-  scheduler_utils::parallelizeAllLike(tv2);
-
-  tv1->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({1000}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 1;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Double buffering smem to local and unroll
-TEST_F(NVFuserTest, FusionDoubleBuffering6_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv2 = set(tv1);
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1.0));
-  fusion.addOutput(tv3);
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  tv3->split(-1, 128);
-  tv3->split(-1, 16);
-  tv3->split(-2, 4);
-  tv3->split(-2, 2);
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv0->computeAt(tv3, 1);
-  tv2->computeAt(tv3, -1);
-
-  tv3->axis(2)->parallelize(ParallelType::Unroll);
-  tv3->axis(4)->parallelize(ParallelType::TIDx);
-
-  tv2->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({199}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 2;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Double buffering and vectorize
-TEST_F(NVFuserTest, FusionDoubleBuffering7_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
-  fusion.addOutput(tv2);
-
-  tv2->split(-1, 128);
-  tv2->split(-1, 4);
-  TransformPropagatorWithCheck propagator(tv2);
-  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
-
-  tv1->computeAt(tv2, 2);
-
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-
-  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  tv1->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({200}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 1;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Multiple tensors to double-buffer
-TEST_F(NVFuserTest, FusionDoubleBuffering8_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeContigTensor(1);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv0);
-  auto tv3 = set(tv1);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  tv4->split(0, 32);
-  tv4->split(0, 4);
-  TransformPropagatorWithCheck propagator(tv4);
-  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
-
-  tv0->computeAt(tv4, 1);
-  tv1->computeAt(tv4, 1);
-
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(tv4);
-
-  tv2->doubleBuffer();
-  tv3->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({100}, options);
-  auto t1 = at::randn({100}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Nested double buffering from gmem to smem and smem to register
-TEST_F(NVFuserTest, FusionDoubleBuffering9_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto out = tv1;
-  fusion.addOutput(out);
-
-  auto tv2 = tv0->cacheAfter();
-  auto tv3 = tv2->cacheAfter();
-
-  out->split(0, 32);
-  out->split(0, 4);
-  TransformPropagatorWithCheck propagator(out);
-  MaxRootDomainInfoSpanningTree(out).traverse(&propagator);
-
-  tv2->setMemoryType(MemoryType::Shared);
-
-  tv2->computeAt(out, 1);
-  tv3->computeAt(out, -1);
-
-  out->axis(-1)->parallelize(ParallelType::TIDx);
-  scheduler_utils::parallelizeAllLike(out);
-
-  tv2->doubleBuffer();
-  tv3->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({1001}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0 + 1;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// FusionSmemBlockGemmCache + double buffering at both smem and local
-TEST_F(NVFuserTest, FusionSmemBlockGemmCacheDoubleBuffer_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Algorithm
-  TensorView* tv0 = makeSymbolicTensor(2); // (M, K)
-  TensorView* tv1 = makeSymbolicTensor(2); // (K, N)
-  TensorView* tv2 = broadcast(tv0, {false, false, true}); // (M, K, B)
-  TensorView* tv3 = broadcast(tv1, {true, false, false}); // (B, K, N)
-  TensorView* tv4 = mul(tv2, tv3); // M, K, N
-  TensorView* tv5 = sum(tv4, {1}); // M, R, N
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addOutput(tv5);
-
-  TensorView* tv6 = tv5->cacheBefore();
-
-  // For smem double buffering
-  auto tv0_cache_local = tv0->cacheAfter();
-  auto tv1_cache_local = tv1->cacheAfter();
-
-  // For register double buffering
-  auto tv0_cache_smem = tv0->cacheAfter();
-  auto tv1_cache_smem = tv1->cacheAfter();
-
-  const int BSX = 32;
-  const int TSX = 8;
-
-  // [M, K, N]
-  tv6->split(-1, BSX);
-  tv6->split(-1, TSX);
-  tv6->split(1, BSX);
-  tv6->split(0, BSX);
-  tv6->split(1, TSX);
-  // [M/BSX, BSX/TSX, TSX, K/BSX, BSX, N/BSX, BSX/TSX, TSX]
-  tv6->reorder(
-      {{4, 7}, {7, 6}, {6, 5}, {2, 4}, {1, 3}, {3, 2}, {5, 1}, {0, 0}});
-  // [M/BSX, N/BSX, K/BSX, BSX/TSX, BSX/TSX, TSX, TSX, BSX]
-
-  auto tv6_rf = tv6->rFactor({-1});
-
-  TransformPropagatorWithCheck propagator(tv6_rf);
-  MaxRootDomainInfoSpanningTree(tv6_rf).traverse(&propagator);
-
-  tv0->computeAt(tv6, 3);
-  tv1->computeAt(tv6, 3);
-
-  tv6_rf->computeAt(tv6, -1);
-  tv0_cache_local->computeAt(tv6_rf, -1);
-  tv1_cache_local->computeAt(tv6_rf, -1);
-
-  tv0_cache_smem->setMemoryType(MemoryType::Shared);
-  tv1_cache_smem->setMemoryType(MemoryType::Shared);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDx);
-  tv5->axis(1)->parallelize(ParallelType::BIDy);
-  tv5->axis(-3)->parallelize(ParallelType::TIDy);
-  tv5->axis(-1)->parallelize(ParallelType::TIDx);
-
-  scheduler_utils::parallelizeAllLike(tv5);
-
-  tv0_cache_local->doubleBuffer();
-  tv1_cache_local->doubleBuffer();
-
-  tv0_cache_smem->doubleBuffer();
-  tv1_cache_smem->doubleBuffer();
-
-  constexpr int M = 154, K = 45, N = 1524;
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({M, K}, options);
-  at::Tensor t1 = at::randn({K, N}, options);
-  at::Tensor aten_output = matmul(t0.to(at::kDouble), t1.to(at::kDouble));
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-  // The smem cache write in this test case is redundant predicated,
-  //   and also double buffered. Currently we are relying on WAR sync
-  //   insertion to ensure ordering of double buffered tensor access.
-  // The check below makes sure that the sync is inserted so that the
-  //   test isn't running on a race condition.
-  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count > 0);
-}
-
-TEST_F(NVFuserTest, FusionIntermediateTensorVectorize_CUDA) {
-  std::vector<MemoryType> mem_types = {MemoryType::Shared, MemoryType::Local};
-
-  for (auto mem_type : mem_types) {
-    Fusion fusion;
-    FusionGuard fg(&fusion);
-
-    auto tv0 = makeContigTensor(1);
-    fusion.addInput(tv0);
-
-    auto tv1 = set(tv0);
-    auto tv2 = set(tv1);
-    auto tv3 = set(tv2);
-    fusion.addOutput(tv3);
-
-    tv1->setMemoryType(mem_type);
-
-    tv3->split(-1, 4);
-    TransformPropagatorWithCheck propagator(tv3);
-    MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-    tv1->computeAt(tv3, -2);
-
-    tv2->axis(-1)->parallelize(ParallelType::Vectorize);
-
-    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::manual_seed(0);
-    auto t0 = at::randn({15}, options);
-    FusionExecutor fe;
-    fe.compileFusion(&fusion);
-
-    // This should throw an exception as the extent of t0 is not
-    // divisible by the vector width
-    // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-    ASSERT_ANY_THROW(fe.runFusion({t0}));
-
-    auto t1 = at::randn({16}, options);
-    auto cg_outputs = fe.runFusion({t1});
-
-    auto ref = t1;
-
-    testValidate(&fusion, cg_outputs, {t1}, {ref}, __LINE__, __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionBroadcastConcretization1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({10, 1});
-  fusion.addInput(tv0);
-  auto tv1 = makeConcreteTensor({10, 20});
-  fusion.addInput(tv1);
-  auto tv2 = makeConcreteTensor({10, 10});
-  fusion.addInput(tv2);
-
-  // Not concretized
-  auto tv3 = sum(tv2, {1});
-  auto tv4 = broadcast(tv3, {false, true});
-  auto tv5 = add(tv0, tv4);
-  fusion.addOutput(tv5);
-
-  // Concretized
-  auto tv6 = sum(tv2, {1});
-  auto tv7 = broadcast(tv6, {false, true});
-  auto tv8 = add(tv1, tv7);
-  fusion.addOutput(tv8);
-
-  for (auto tv : {tv3, tv4, tv5, tv6, tv7, tv8}) {
-    tv->axis(1)->parallelize(ParallelType::TIDx);
-  }
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(!gpulw.concretizedBroadcastDomains().isConcretized(
-      loweredTv(tv4, gpulw)->axis(1)));
-  TORCH_CHECK(gpulw.concretizedBroadcastDomains().isConcretized(
-      loweredTv(tv7, gpulw)->axis(1)));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({10, 1}, options);
-  auto t1 = at::randn({10, 20}, options);
-  auto t2 = at::randn({10, 10}, options);
-  std::vector<IValue> aten_inputs = {t0, t1, t2};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto t5 = t0 + t2.sum({1}).unsqueeze(-1);
-  auto t8 = t1 + t2.sum({1}).unsqueeze(-1);
-
-  testValidate(&fusion, outputs, aten_inputs, {t5, t8}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBroadcastConcretization2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {0, 1});
-  auto tv2 = broadcast(tv1, {true});
-  auto tv3 = broadcast(tv2, {false, true});
-  fusion.addOutput(tv3);
-
-  // tv1 is thread-predicated with TIDx and TIDy
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDy);
-  // tv2 broadcasts along TIDx
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  // tv3 broadcasts along TIDy
-  tv3->axis(0)->parallelize(ParallelType::TIDx);
-  tv3->axis(1)->parallelize(ParallelType::TIDy);
-
-  // Both tv2 and tv3 broadcast along predicated TID dimensions, but
-  // since the broadcast domains are not concretized, there should be
-  // no actual parallel broadcast
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      !gpulw.kernel()->summary().has_block_broadcasts &&
-          !gpulw.kernel()->summary().has_grid_broadcasts,
-      "There must be no parallel broadcast in this fusion");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({10, 11}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto t3 = t0.sum().unsqueeze(-1).unsqueeze(-1);
-
-  testValidate(&fusion, outputs, aten_inputs, {t3}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionBroadcastConcretization3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> input_shape({10, 4, 8});
-  std::vector<int64_t> output_shape({8, 4, 1});
-
-  auto tv0 = makeConcreteTensor(input_shape);
-  fusion.addInput(tv0);
-
-  auto tv2 = sum(tv0, {0});
-  auto tv3 = set(tv2);
-  auto tv4 =
-      view(tv3, {input_shape.begin() + 1, input_shape.end()}, output_shape);
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv5->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // The view op adds a broadcast domain in tv4, which is
-  // parallelized. Howver, it is never materialized, so there should
-  // be no parallel broadcast.
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      !gpulw.kernel()->summary().has_block_broadcasts &&
-          !gpulw.kernel()->summary().has_grid_broadcasts,
-      "There must be no parallel broadcast in this fusion");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn(input_shape, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto outputs = fe.runFusion(aten_inputs);
-
-  auto t5 = at::native::view(t0.sum(0), output_shape) + 1;
-
-  testValidate(&fusion, outputs, aten_inputs, {t5}, __LINE__, __FILE__);
-}
-
-// Merging non-broadcast and broadcast domains
-// TODO: Fix use case see issue https://github.com/csarofeen/pytorch/issues/1418
-// validateParallelize does not pass. Even if it's skipped,
-// generated code is invalid as blockBroadcast is not used.
-#if 0
-TEST_F(NVFuserTest, FusionBroadcastConcretization4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = broadcast(tv1, {false, true});
-  auto tv3 = add(tv2, tv0);
-  fusion.addOutput(tv3);
-
-  tv1->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv2->merge(0, 1);
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  // TODO: When set to shared memory, this kernel should be correct, but fails
-  // validation and when skipped produces incorrect code
-  tv2->setMemoryType(MemoryType::Shared);
-
-  tv3->merge(0, 1);
-  tv3->axis(0)->parallelize(ParallelType::TIDx);
-
-  fusion.printMath();
-  fusion.printKernel();
-}
-#endif
-
-TEST_F(NVFuserTest, FusionBroadcastConcretization5_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(1);
-  fusion.addInput(tv1);
-  auto tv2 = makeSymbolicTensor(1);
-  fusion.addInput(tv2);
-  auto tv3 = makeSymbolicTensor(1);
-  fusion.addInput(tv3);
-
-  // Assert tv2 and tv3 have the same shape
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  // Concretize a broadcast domain to multiple non-concrete domains
-  // through a multi-output expression. It should be considered to be
-  // non-uniquely concretized.
-  auto tv5 = broadcast(tv0, {false, true});
-  // Reduce only the non-broadcast domain.
-  auto tvs = Welford(tv5, {0});
-  auto tv9 = add(tvs.avg, tv1);
-  auto tv10 = add(tvs.var_sum, tv2);
-  fusion.addOutput(tv9);
-  fusion.addOutput(tv10);
-
-  // Same pattern as the above, but concretize the broadcast domain
-  // with tv2 and tv3, which have the exactly same shape, so the
-  // broadcast should be considered uniquely concretized.
-  auto tv11 = broadcast(tv0, {false, true});
-  // Reduce only the non-broadcast domain.
-  auto tvs2 = Welford(tv11, {0});
-  auto tv15 = add(tvs2.avg, tv2);
-  auto tv16 = add(tvs2.var_sum, tv3);
-  fusion.addOutput(tv15);
-  fusion.addOutput(tv16);
-
-  // Reduce only the broadcast domain. Since it's reduced, it should
-  // not be considered to be concretized.
-  auto tv17 = broadcast(tv0, {false, true});
-  auto tvs3 = Welford(tv17, {1});
-  fusion.addOutput(tvs3.avg);
-
-  ConcretizedBroadcastDomains bcast_concretization_info;
-  bcast_concretization_info.build(&fusion);
-
-  TORCH_CHECK(
-      bcast_concretization_info.maybeNonUniquelyConcretized(tv5->axis(1)),
-      "Failed to detect non-unique concretization of ",
-      tv5->toString());
-
-  TORCH_CHECK(
-      bcast_concretization_info.isUniquelyConcretized(tv11->axis(1)),
-      "Failed to detect unique concretization of ",
-      tv11->toString());
-
-  TORCH_CHECK(
-      !bcast_concretization_info.isConcretized(tv17->axis(1)),
-      "Failed to detect non-concretization of ",
-      tv17->toString());
-}
-
-TEST_F(NVFuserTest, FusionIssue1430_CUDA) {
-  // Derived from an expression sorting issue when using loop map, now expr
-  // sorting uses parallel map.
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  int V = 2, W = 3, X = 4, Y = 5, Z = 6;
-
-  // setup fusion
-  auto tv0 = TensorViewBuilder()
-                 .ndims(5)
-                 .dtype(DataType::Half)
-                 .contiguity(std::vector<bool>(5, true))
-                 .shape({V, W, X, Y, Z})
-                 .build();
-
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  auto tv2 = castOp(DataType::Float, tv1);
-
-  auto tvs = Welford(tv2, {1, 2, 3, 4});
-  auto tv3 = tvs.avg;
-  auto tv4 = tvs.var_sum;
-  auto tv5 = tvs.n;
-
-  // avg
-  auto tv6 = broadcast(tvs.avg, {false, true, true, true, true});
-
-  // var
-  auto tv7 = mul(tv4, IrBuilder::create<Double>(1. / (W * X * Y * Z)));
-  auto tv8 = add(tv7, IrBuilder::create<Double>(1.e-6));
-  auto tv9 = broadcast(tv8, {false, true, true, true, true});
-  auto tv10 = rsqrt(tv9);
-
-  auto tv11 = castOp(DataType::Float, tv1);
-  auto tv12 = sub(tv11, tv6);
-  auto tv13 = mul(tv12, tv10);
-
-  auto tv14 = set(tv13);
-  fusion.addOutput(tv14);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDy);
-  tv3->axis(2)->parallelize(ParallelType::BIDx);
-  tv3->axis(3)->parallelize(ParallelType::TIDx);
-  tv3->axis(4)->parallelize(ParallelType::Vectorize);
-
-  // tv3->reorder({{1, -2}});
-
-  auto rfactor = ir_utils::rfactorHelper(tv3, {1, 4});
-
-  scheduler_utils::parallelizeAllLike(rfactor);
-
-  for (auto tv : ir_utils::allTvs(&fusion)) {
-    if (tv != tv1 || tv != tv3) {
-      for (auto i : c10::irange(tv->nDims())) {
-        if (isParallelTypeVectorize(tv->axis(i)->getParallelType())) {
-          tv->axis(i)->parallelize(ParallelType::Serial);
-        }
-      }
-    }
-  }
-
-  tv0->computeAt(tv14, 1);
-  tv13->computeAt(tv14, -2);
-  tv2->computeAt(tv14, -1, ComputeAtMode::MostInlined);
-  tv11->computeAt(tv14, -1, ComputeAtMode::MostInlined);
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({V, W, X, Y, Z}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-  auto cg_outputs = fe.runFusion({t0}, LaunchParams(X, V, -1, Y, -1, -1));
-
-  auto t0_double = t0.to(at::kDouble);
-
-  auto at_mu = at::mean(t0_double, {1, 2, 3, 4})
-                   .unsqueeze(-1)
-                   .unsqueeze(-1)
-                   .unsqueeze(-1)
-                   .unsqueeze(-1);
-  auto at_var = at::var(t0_double, {1, 2, 3, 4}, false)
-                    .unsqueeze(-1)
-                    .unsqueeze(-1)
-                    .unsqueeze(-1)
-                    .unsqueeze(-1);
-
-  auto at_out = t0_double.sub(at_mu).div(at_var.add(1.e-6).sqrt());
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {t0},
-      {at_out},
-      __LINE__,
-      __FILE__,
-      "",
-      LaunchParams(X, V, -1, Y, -1, -1));
-}
-
-// Test code generation of allocated scalars
-TEST_F(NVFuserTest, FusionCodegenAllocatedScalars_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Fusion is just a dummy container in this test, just used for
-  // getting a Kernel container
-  auto tv0 = makeSymbolicTensor(0);
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  fusion.addOutput(tv1);
-
-  GpuLower gpulw(&fusion);
-  auto kernel = gpulw.kernel();
-
-  // Set the kernel as the current fusion
-  FusionGuard kg(kernel);
-
-  // Create alocated scalars
-  auto ks0 = add(kernel->zeroVal(), kernel->oneVal());
-  auto ks0_alloc = IrBuilder::create<kir::Allocate>(
-      ks0, MemoryType::Local, kernel->oneVal());
-
-  auto ks1 = add(ks0, kernel->oneVal());
-  auto ks1_alloc = IrBuilder::create<kir::Allocate>(
-      ks1, MemoryType::Local, kernel->oneVal());
-
-  auto tk0 = kernel->inputs()[0]->as<TensorView>();
-  auto tki0 = IrBuilder::create<kir::TensorIndex>(tk0, std::vector<Val*>{ks0});
-  auto tki1 = IrBuilder::create<kir::TensorIndex>(tk0, std::vector<Val*>{ks1});
-  auto tk0_expr = IrBuilder::create<UnaryOp>(UnaryOpType::Set, tki0, tki1);
-
-  // Insert the scalar expression and the allocation of the
-  // output directly to the kernel
-  auto proxy = kir::KernelInternalProxy(kernel);
-
-  const auto indent = "  ";
-  const auto ks0_name = "i" + std::to_string(ks0->name());
-  const auto ks1_name = "i" + std::to_string(ks1->name());
-  const auto tk0_name = "T" + std::to_string(tk0->name());
-
-  auto& exprs = proxy.topLevelExprs();
-  exprs.push_back(tk0_expr);
-
-  // Invalid code gen
-  const auto no_alloc_code = codegen::generateCudaKernel(kernel);
-
-  // Without alloc, Int vals are just inlined, resulting in:
-  // t0[(0 + 1)] = t0[((0 + 1) + 1)]
-  std::stringstream no_alloc_ref;
-  no_alloc_ref << "\n"
-               << indent << tk0_name << "[(0 + 1)]\n"
-               << indent << indent << " = " << tk0_name << "[((0 + 1) + 1)];\n";
-
-  TORCH_CHECK(
-      no_alloc_code.find(no_alloc_ref.str()) != std::string::npos,
-      "Invalid code generation. Expected:",
-      no_alloc_ref.str(),
-      "Actual:\n",
-      no_alloc_code);
-
-  // Insert proper allocations and definitions
-  exprs.insert(std::find(exprs.begin(), exprs.end(), tk0_expr), ks0_alloc);
-  exprs.insert(
-      std::find(exprs.begin(), exprs.end(), tk0_expr), ks0->definition());
-  exprs.insert(std::find(exprs.begin(), exprs.end(), tk0_expr), ks1_alloc);
-  exprs.insert(
-      std::find(exprs.begin(), exprs.end(), tk0_expr), ks1->definition());
-
-  const auto valid_code = codegen::generateCudaKernel(kernel);
-
-  std::stringstream valid_ref;
-  valid_ref << "\n"
-            << indent << tk0_name << "[" << ks0_name << "]\n"
-            << indent << indent << " = " << tk0_name << "[" << ks1_name
-            << "];\n";
-
-  TORCH_CHECK(
-      valid_code.find(valid_ref.str()) != std::string::npos,
-      "Invalid code generation. Expected:",
-      valid_ref.str(),
-      "Actual:\n",
-      valid_code);
-}
-
-TEST_F(NVFuserTest, FusionIndexHoist1_CUDA) {
-  if (isDisabled(DisableOption::IndexHoist)) {
-    GTEST_SKIP() << "Index hoisting disabled";
-  }
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tv2 = set(tv1);
-  auto tv3 = set(tv2);
-  auto tv4 = set(tv3);
-  auto tv5 = set(tv4);
-  fusion.addOutput(tv5);
-
-  tv1->split(-1, 4);
-  tv2->split(-1, 4);
-  tv3->merge(0, 1);
-  tv3->split(0, 8);
-  tv5->merge(0, 1);
-  tv5->split(0, 8);
-  tv4->computeAt(tv5, -1);
-
-  tv1->setMemoryType(MemoryType::Global);
-  tv2->setMemoryType(MemoryType::Global);
-  tv3->setMemoryType(MemoryType::Global);
-
-  // Use Int32 as the index type to verify Int32 is used as the type
-  // of hoisted indices
-  GpuLower gpulw(&fusion, DataType::Int32);
-  auto kernel = gpulw.kernel();
-
-  auto is_index_times_ns = [](Val* val, Val* index, std::string name) -> bool {
-    auto def = dynamic_cast<BinaryOp*>(val->definition());
-    if (def == nullptr) {
-      return false;
-    }
-    return def->getBinaryOpType() == BinaryOpType::Mul &&
-        def->rhs()->isA<NamedScalar>() &&
-        def->rhs()->as<NamedScalar>()->name() == name && def->lhs() == index;
-  };
-
-  // Validate indices in the kernel are hoisted as
-  // intended. Validation could be also done by just string comparison
-  // as the parser test, but updating such tests would be tedious.
-  for (auto top_level_loop :
-       ir_utils::filterByType<kir::ForLoop>(kernel->topLevelExprs())) {
-    auto innermost_loop = top_level_loop;
-    while (auto first_expr_loop = dynamic_cast<kir::ForLoop*>(
-               innermost_loop->body().exprs().at(0))) {
-      innermost_loop = first_expr_loop;
-    }
-    const auto& exprs = innermost_loop->body().exprs();
-    TORCH_CHECK(!exprs.empty(), "No expression found");
-    TORCH_CHECK(
-        exprs.at(0)->isA<kir::Allocate>(),
-        "Invalid expression: ",
-        exprs.at(0)->toString());
-    auto hoisted_index = exprs.at(0)->as<kir::Allocate>()->buffer();
-    TORCH_CHECK(
-        hoisted_index->dtype() == DataType::Int32,
-        "Invalid data type of hoisted indices. Should be Int32 but: ",
-        hoisted_index->dtype());
-    kir::Predicate* pred = nullptr;
-    for (auto expr : exprs) {
-      if (expr->isA<kir::IfThenElse>()) {
-        pred = expr->as<kir::IfThenElse>()->predicate();
-        auto arith_expr = expr->as<kir::IfThenElse>()->thenBody().exprs().at(0);
-        auto out_ti = arith_expr->outputs()[0]->as<kir::TensorIndex>();
-        if (out_ti->view()->name() == 1) {
-          // Ref: T1[*, hoisted_index] = T0[*, hoisted_index * T0.stride];
-          auto t1_index = out_ti->index(1);
-          TORCH_CHECK(
-              t1_index == hoisted_index,
-              "Invalid index: ",
-              t1_index->toInlineString());
-          // Pred: hoisted_index < T0.size[1]
-          TORCH_CHECK(
-              pred->value()->definition()->as<BinaryOp>()->lhs() ==
-                  hoisted_index,
-              "Invalid predicate: ",
-              pred->value()->toInlineString());
-          TORCH_CHECK(arith_expr->inputs().size() == 1);
-          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
-          TORCH_CHECK(in0->view()->name() == 0);
-          // hoisted_index * T0.stride[1]
-          auto t0_index = in0->index(1);
-          TORCH_CHECK(
-              is_index_times_ns(t0_index, hoisted_index, "T0.stride[1]"),
-              "Invalid index: ",
-              t0_index->toInlineString());
-        } else if (out_ti->view()->name() == 2) {
-          // Ref: T3[*, hoisted_index] = T2[*, hoisted_index];
-          auto out_index = out_ti->index(1);
-          TORCH_CHECK(
-              out_index == hoisted_index,
-              "Invalid index: ",
-              out_index->toInlineString());
-          TORCH_CHECK(
-              pred->value()->definition()->as<BinaryOp>()->lhs() ==
-                  hoisted_index,
-              "Invalid predicate: ",
-              pred->value()->toInlineString());
-          TORCH_CHECK(arith_expr->inputs().size() == 1);
-          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
-          TORCH_CHECK(in0->view()->name() == 1);
-          auto in0_index = in0->index(1);
-          TORCH_CHECK(
-              in0_index == hoisted_index,
-              "Invalid index: ",
-              in0_index->toInlineString());
-        } else if (out_ti->view()->name() == 3) {
-          // Ref: T3[hoisted_index] = T2[hoisted_index];
-          auto out_index = out_ti->index(0);
-          TORCH_CHECK(
-              out_index == hoisted_index,
-              "Invalid index: ",
-              out_index->toInlineString());
-          TORCH_CHECK(
-              pred->value()->definition()->as<BinaryOp>()->lhs() ==
-                  hoisted_index,
-              "Invalid predicate: ",
-              pred->value()->toInlineString());
-          TORCH_CHECK(arith_expr->inputs().size() == 1);
-          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
-          TORCH_CHECK(in0->view()->name() == 2);
-          auto in0_index = in0->index(0);
-          TORCH_CHECK(
-              in0_index == hoisted_index,
-              "Invalid index: ",
-              in0_index->toInlineString());
-        } else if (out_ti->view()->name() == 4) {
-          // Ref: T4[0] = T3[hoisted_index];
-          TORCH_CHECK(
-              pred->value()->definition()->as<BinaryOp>()->lhs() ==
-                  hoisted_index,
-              "Invalid predicate: ",
-              pred->value()->toInlineString());
-          TORCH_CHECK(arith_expr->inputs().size() == 1);
-          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
-          TORCH_CHECK(in0->view()->name() == 3);
-          auto in0_index = in0->index(0);
-          TORCH_CHECK(
-              in0_index == hoisted_index,
-              "Invalid index: ",
-              in0_index->toInlineString());
-        } else if (out_ti->view()->name() == 5) {
-          // Ref: T5[hoisted_index] = T4[0]
-          auto out_index = out_ti->index(0);
-          TORCH_CHECK(
-              out_index == hoisted_index,
-              "Invalid index: ",
-              out_index->toInlineString());
-          TORCH_CHECK(
-              pred->value()->definition()->as<BinaryOp>()->lhs() ==
-                  hoisted_index,
-              "Invalid predicate: ",
-              pred->value()->toInlineString());
-        }
-      }
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({15, 17}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Hoist indices for vectorized tensors
-TEST_F(NVFuserTest, FusionIndexHoist2_CUDA) {
-  if (isDisabled(DisableOption::IndexHoist)) {
-    GTEST_SKIP() << "Index hoisting disabled";
-  }
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeContigTensor(1);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv0);
-  auto tv3 = set(tv1);
-  auto tv4 = add(tv2, tv3);
-  auto tv5 = set(tv4);
-  fusion.addOutput(tv5);
-
-  tv5->split(-1, 4);
-  TransformPropagatorWithCheck propagator(tv5);
-  MaxRootDomainInfoSpanningTree(tv5).traverse(&propagator);
-
-  tv4->split(-1, 3);
-
-  tv0->computeAt(tv5, 1);
-  tv1->computeAt(tv5, 1);
-
-  tv2->axis(-1)->parallelize(ParallelType::Vectorize);
-  tv3->axis(-1)->parallelize(ParallelType::Vectorize);
-  tv5->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({16}, options);
-  auto t1 = at::randn({16}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTestGridComm_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-  int X = 3, Y = 4, Z = 2;
-  auto tv0 = makeConcreteTensor({X, Y, Z});
-  fusion.addInput(tv0);
-  auto tv1 = makeConcreteTensor({X, Y, Z});
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv0);
-  auto tv3 = add(tv2, tv1);
-  auto tv4 = set(tv3);
-  auto tv5 = set(tv4);
-  fusion.addOutput(tv5);
-
-  tv2->setMemoryType(MemoryType::Global);
-  tv3->setMemoryType(MemoryType::Global);
-  tv4->setMemoryType(MemoryType::Global);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDy);
-  tv2->axis(1)->parallelize(ParallelType::BIDx);
-  tv2->axis(2)->parallelize(ParallelType::Vectorize);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(1)->parallelize(ParallelType::BIDy);
-
-  tv4->axis(0)->parallelize(ParallelType::BIDy);
-  tv4->axis(1)->parallelize(ParallelType::BIDx);
-
-  tv5->axis(0)->parallelize(ParallelType::BIDy);
-  tv5->axis(1)->parallelize(ParallelType::BIDx);
-  tv5->axis(2)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({X, Y, Z}, options);
-  auto t1 = at::randn({X, Y, Z}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// See issue https://github.com/csarofeen/pytorch/issues/1497
-TEST_F(NVFuserTest, FusionTestGridComm2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int64_t W = 3, X = 4;
-
-  auto tv0 = makeConcreteTensor({X});
-  auto tv1 = makeConcreteTensor({W, X});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv3 = broadcast(tv2, {true, false});
-  auto tv4 = add(tv3, tv1);
-  fusion.addOutput(tv4);
-
-  tv4->merge(0);
-  tv4->split(0, 2);
-
-  TransformPropagatorWithCheck propagator(tv4);
-  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
-
-  tv3->computeAt(tv4, 1);
-
-  tv4->axis(0)->parallelize(ParallelType::BIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv2->setMemoryType(MemoryType::Global);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  auto t0 = at::randn({X}, options);
-  auto t1 = at::randn({W, X}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1 + 1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Vectorized reset test for double buffered registers
-TEST_F(NVFuserTest, FusionDoubleBufferVector_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv2 = sum(tv1, {0});
-  auto tv2c = tv2->cacheBefore();
-
-  fusion.addOutput(tv2);
-
-  auto tv1cw = tv1->cacheAfter();
-  auto tv1cr = tv1cw->cacheAfter();
-
-  tv1cw->split(-1, 32);
-  tv1cr->split(-1, 32);
-  tv1cr->split(-1, 4);
-  tv1cr->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  tv1cw->computeAt(tv1cr, 1);
-  tv0->computeAt(tv1cw, -1);
-  tv2c->split(-1, 32);
-  tv2c->split(-1, 4);
-  tv1cr->computeAt(tv2c, 2);
-
-  tv1cw->setMemoryType(MemoryType::Shared);
-  tv1cr->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::manual_seed(0);
-  auto t0 = at::randn({200}, options);
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-  auto ref = (t0 + 1).sum({0});
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Request 48KB of data in shared mem,
-//  should be large enough not to fit in
-//  static allocations, but small enough
-//  to fit in supported devices (sm70+).
-TEST_F(NVFuserTest, FusionLargeSmem_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 12288);
-  tv2->split(1, 128);
-  tv1->computeAt(tv2, 1);
-  tv1->split(1, 128);
-  tv0->computeAt(tv1, -1);
-  tv1->setMemoryType(MemoryType::Shared);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::manual_seed(0);
-  auto t0 = at::randn({12288 * 4}, options);
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-  auto ref = t0 + 1 + 2;
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Request a smem allocation that is equal to the device limit
-TEST_F(NVFuserTest, FusionTooLargeSmem_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto properties = at::cuda::getDeviceProperties(
-      c10::Device(c10::DeviceType::CUDA, 0).index());
-  int device_limit = properties->sharedMemPerBlockOptin;
-
-  auto tv0 = makeContigTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(2.0));
-  fusion.addOutput(tv2);
-
-  // 4 byte per float
-  tv2->split(0, device_limit / 4);
-  tv2->split(1, 128);
-  tv1->computeAt(tv2, 1);
-  tv1->split(1, 128);
-  tv0->computeAt(tv1, -1);
-  tv1->setMemoryType(MemoryType::Shared);
-  tv1->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::manual_seed(0);
-  auto t0 = at::randn({12288 * 4}, options);
-  FusionExecutor fe;
-
-  // First compile gets a compiled kernel
-  fe.compileFusion(&fusion, {t0});
-
-  // Should be throwing because the kernel
-  //  requested absolute device limit
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion({t0}));
-}
-
-// Try to test alignment when multiple tensors are
-//  in shared mem.
-TEST_F(NVFuserTest, FusionSmemAlignment_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({3, 4, 7, 2, 5});
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {4});
-  auto tv2 = sum(tv1, {3});
-  auto tv3 = sum(tv2, {2});
-  auto tv4 = sum(tv3, {1});
-  fusion.addOutput(tv4);
-
-  auto tv0c = tv0->cacheAfter();
-  auto tv1bc = tv1->cacheBefore();
-  auto tv2bc = tv2->cacheBefore();
-  auto tv3bc = tv3->cacheBefore();
-  auto tv4bc = tv4->cacheBefore();
-
-  tv0c->setMemoryType(MemoryType::Shared);
-  tv1bc->setMemoryType(MemoryType::Shared);
-  tv2bc->setMemoryType(MemoryType::Shared);
-  tv3bc->setMemoryType(MemoryType::Shared);
-  tv4bc->setMemoryType(MemoryType::Shared);
-
-  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
-  tv3->axis(-1)->parallelize(ParallelType::Vectorize);
-  tv0->computeAt(tv4, 0);
-  tv0->computeAt(tv2, 2);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::manual_seed(0);
-  auto t0 = at::randn({3, 4, 7, 2, 5}, options);
-  FusionExecutor fe;
-
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-  auto tref = t0.sum({1, 2, 3, 4});
-
-  testValidate(&fusion, cg_outputs, {t0}, {tref}, __LINE__, __FILE__);
-}
-
-// Repro of #1521
-TEST_F(NVFuserTest, FusionImmediateValueAsInput_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto immediate_scalr = IrBuilder::create<Double>(0.1);
-  // Adding an immediate scalar value as an input is not allowed
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fusion.addInput(immediate_scalr));
-
-  // Instead, use a symbolic value
-  auto symbolic_scalar = IrBuilder::create<Double>();
-  fusion.addInput(symbolic_scalar);
-
-  auto tv1 = add(tv0, symbolic_scalar);
-  fusion.addOutput(tv1);
-
-  // Make sure the kernel is compiled.
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-}
-
-// Repro of #1506
-TEST_F(NVFuserTest, FusionVectorizeContigIndex_CUDA) {
-  std::vector<int64_t> shape{14, 14};
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  auto tv2 = set(tv1);
-  fusion.addOutput(tv2);
-
-  tv2->merge(0);
-
-  // Vectorize by 4 should be allowed
-  tv2->split(0, 4);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv0->computeAt(tv2, 1);
-
-  tv1->axis(1)->parallelize(ParallelType::Vectorize);
-  tv2->axis(1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn(shape, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  TORCH_CHECK(t0.equal(cg_outputs[0]));
-}
-
-// Make sure the same fusion as FusionVectorizeContigIndex fails if
-// not contig.
-TEST_F(NVFuserTest, FusionVectorizeContigIndexFail_CUDA) {
-  std::vector<int64_t> shape{14, 14};
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  auto tv2 = set(tv1);
-  fusion.addOutput(tv2);
-
-  tv2->merge(0);
-
-  tv2->split(0, 4);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv0->computeAt(tv2, 1);
-
-  tv1->axis(1)->parallelize(ParallelType::Vectorize);
-  tv2->axis(1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn(shape, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-
-  // This should fail at the launch time as 14 is not divisible by the
-  // vector word size. The two domains are merged, but they are not
-  // contiguous, so contig indexing is not involved in this case.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion({t0}));
-}
-
-TEST_F(NVFuserTest, FusionVectorizeInputToOutput_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  fusion.addOutput(tv1);
-
-  tv1->split(0, 4);
-
-  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::manual_seed(0);
-
-  const int n = 12;
-  auto t0 = at::randn({n}, options);
-  // Shift by one to make it non-aligned
-  auto t0_misaligned = at::randn({n + 1}, options).index({Slice(1)});
-  auto t1_misaligned = at::empty({n + 1}, options).index({Slice(1)});
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-  TORCH_CHECK(t0.equal(cg_outputs[0]));
-
-  // Pass misaligned input. This must fail.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion({t0_misaligned}));
-
-  // Pass misaligned output. This must fail too.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion({t0}, {t1_misaligned}));
-}
-
-// Repro of issue #1530
-TEST_F(NVFuserTest, FusionVectorizeContigIndexValidationFail_CUDA) {
-  std::vector<int64_t> shape{1, 2, 1};
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(shape.size());
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  fusion.addOutput(tv1);
-
-  tv1->merge(1);
-  tv1->merge(0);
-
-  auto invalid_vec_size = shape[0] * shape[1] * shape[2];
-  invalid_vec_size *= invalid_vec_size;
-
-  tv1->split(0, invalid_vec_size);
-
-  tv1->axis(1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn(shape, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion({t0}));
-}
-
-TEST_F(NVFuserTest, FusionContigIndexingWithBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({4});
-  fusion.addInput(tv0);
-  auto tv1 = makeConcreteTensor({3, 4});
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv0, {true, false});
-  auto tv3 = add(tv2, tv1);
-  fusion.addOutput(tv3);
-
-  tv3->merge(0);
-  TransformPropagatorWithCheck propagator(tv3);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv2->setMemoryType(MemoryType::Local);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({4}, options);
-  auto t1 = at::randn({3, 4}, options);
-
-  auto t3 = t0.unsqueeze(0).add(t1);
-  {
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, {t0, t1});
-    auto cg_outputs = fe.runFusion({t0, t1});
-
-    testValidate(&fusion, cg_outputs, {t0, t1}, {t3}, __LINE__, __FILE__);
-  }
-
-  // Make sure tv2 indexing also works when it's stored in global memory
-  tv2->setMemoryType(MemoryType::Global);
-  {
-    FusionExecutor fe;
-    fe.compileFusion(&fusion, {t0, t1});
-    auto cg_outputs = fe.runFusion({t0, t1});
-
-    testValidate(&fusion, cg_outputs, {t0, t1}, {t3}, __LINE__, __FILE__);
-  }
-}
-
-// Repro of #1534. Validation should detect invalid vectorization.
-TEST_F(NVFuserTest, FusionVectorizeContigIndexValidationFail2_CUDA) {
-  std::vector<int64_t> shape1{2, 3, 2};
-  std::vector<int64_t> shape2{2, 2};
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigConcreteTensor(shape1);
-  fusion.addInput(tv0);
-  auto tv1 = makeContigConcreteTensor(shape2);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv1);
-  auto tv3 = broadcast(tv2, {false, true, false});
-  auto tv4 = add(tv0, tv3);
-  fusion.addOutput(tv4);
-
-  tv4->merge(1, 2);
-  tv4->merge(0, 1);
-  tv4->split(0, 4);
-  TransformPropagatorWithCheck propagator(tv4);
-  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
-
-  tv0->computeAt(tv4, -2);
-  tv1->computeAt(tv4, -2);
-
-  tv2->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn(shape1, options);
-  auto t1 = at::randn(shape2, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-
-  // Vectorization of tv2 should be detected as invalid.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fe.runFusion({t0, t1}));
-}
-
-TEST_F(NVFuserTest, FusionVectorizeContigIndexWithBroadcast_CUDA) {
-  std::vector<int64_t> shape1{2, 2, 2};
-  std::vector<int64_t> shape2{1, 2, 2};
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // [I0, I1, I2]
-  auto tv0 = makeContigTensor(shape1.size());
-  fusion.addInput(tv0);
-
-  // [B3, I1, I2]
-  auto tv1 = makeContigConcreteTensor(shape2);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv1);
-  auto tv3 = add(tv0, tv2);
-  fusion.addOutput(tv3);
-
-  tv3->merge(1, 2);
-  tv3->merge(0, 1);
-  tv3->split(0, 4);
-
-  // Don't modify tv1 so that it's replayed as tv2 with actual
-  // transformations. It would create temporary IterDomains, and the
-  // validation should still be able to detect vectorization by 4 is valid.
-  // TransformPropagatorWithCheck propagator(tv3);
-  // MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-
-  tv2->merge(1, 2);
-  tv2->merge(0, 1);
-  tv2->split(0, 4);
-
-  tv2->computeAt(tv3, -2);
-
-  tv2->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn(shape1, options);
-  auto t1 = at::randn(shape2, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionVectorizeContigIndexPointwiseSchedule_CUDA) {
-  std::vector<int64_t> shape0{100, 14, 2, 14};
-  std::vector<int64_t> shape1{100, 2, 14};
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(shape0.size());
-  fusion.addInput(tv0);
-  auto tv1 = makeContigTensor(shape1.size());
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv1, {false, true, false, false});
-  auto tv3 = add(tv0, tv2);
-  fusion.addOutput(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn(shape0, options);
-  auto t1 = at::randn(shape1, options);
-
-  auto lparams = schedulePointwise(&fusion, {t0, t1});
-
-  GpuLower gpulw(&fusion);
-  auto kernel = gpulw.kernel();
-
-  // The innermost two dimensions are merged and contiguous, so
-  // vectorization can be done against 2*14=28 rather than 14, so
-  // vector word size should be 4. Broadcasting of tv1 should not
-  // matter.
-  for (const auto& vec_info : kernel->summary().vectorized_set_info) {
-    TORCH_CHECK(
-        vec_info.word_size == 4,
-        "Invalid vector word size: ",
-        vec_info.word_size);
-  }
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1}, lparams);
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1.unsqueeze(-3);
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Repro of issue #1539.
-TEST_F(NVFuserTest, FusionTrivialReductionForwarding1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {true, false});
-  auto tv2 = sum(tv1, {0});
-  auto tv3 = set(tv2);
-  fusion.addOutput(tv3);
-
-  tv2->merge(0);
-  tv2->split(0, 4);
-
-  TransformPropagatorWithCheck propagator(tv2);
-  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
-
-  // All tensors must be transformed to a 2D tensor with each axis
-  // mapped with each other in the LOOP map.
-  ComputeAtMap ca_map(&fusion);
-  for (auto tv : ir_utils::allTvs(&fusion)) {
-    TORCH_CHECK(
-        tv->nDims() == 2, "Expected to be a 2D tensor but: ", tv->toString());
-    for (const auto i : c10::irange(2)) {
-      TORCH_CHECK(ca_map.areMapped(
-          tv->axis(i), tv3->axis(i), IdMappingMode::PERMISSIVE));
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionTrivialReductionForwarding2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {true, false});
-  auto tv2 = sum(tv1, {0});
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  // Merging a trivial reduction with a non-reduction domain
-  tv2->merge(0, 1);
-  tv2->split(0, 4);
-
-  tv3->split(0, 4);
-
-  // tv2 and tv3 are different as tv3 lacks the trivial reduction, but
-  // they are mapped with each other by BestEffortReplay as the merge
-  // of trivial reduciton dim is forwarded.
-
-  PairwiseRootDomainMap root_map(tv2, tv3);
-
-  auto p2c = BestEffortReplay::replayCasP(tv3, tv2, 2, root_map).getReplay();
-  for (const auto i : c10::irange(tv2->nDims())) {
-    auto tv2_id = tv2->axis(i);
-    auto it = p2c.find(tv2_id);
-    TORCH_CHECK(
-        it != p2c.end(),
-        "Expected mapped consumer ID but not found: ",
-        tv2_id->toString());
-    auto tv3_mapped_id = it->second;
-    TORCH_CHECK(
-        tv3_mapped_id == tv3->axis(i),
-        "Unexpected mapped consumer ID: ",
-        tv3_mapped_id->toString());
-  }
-
-  auto c2p = BestEffortReplay::replayPasC(tv2, tv3, 2, root_map).getReplay();
-  for (const auto i : c10::irange(tv3->nDims())) {
-    auto tv3_id = tv3->axis(i);
-    auto it = c2p.find(tv3_id);
-    TORCH_CHECK(
-        it != c2p.end(),
-        "Expected mapped producer ID but not found: ",
-        tv3_id->toString());
-    auto tv2_mapped_id = it->second;
-    TORCH_CHECK(
-        tv2_mapped_id == tv2->axis(i),
-        "Unexpected mapped consumer ID: ",
-        tv2_mapped_id->toString());
-  }
-}
-
-TEST_F(NVFuserTest, FusionTrivialReductionForwarding3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = sum(tv0, {1});
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv2);
-
-  // Similar pattern as FusionTrivialReductionForwarding2 but no
-  // trivial reduciton at the root domain
-
-  // Create a trivial reduction by splitting with a factor of 1
-  tv1->split(1, 1, false);
-  // Merging with a trivial reduction
-  tv1->merge(0, 1);
-  tv1->split(0, 5);
-
-  tv2->split(0, 5);
-
-  // While the merge of tv1 is done with a trivial reduciton, it's not
-  // a root domain, so forwarding is not enabled. BestEffortReplay
-  // should only map the first axis of each tensor.
-
-  PairwiseRootDomainMap root_map(tv1, tv2);
-  auto p2c = BestEffortReplay::replayCasP(tv2, tv1, 2, root_map).getReplay();
-  TORCH_CHECK(p2c.size() == 1, "Expected only one mapping found");
-  TORCH_CHECK(p2c.begin()->first == tv1->getRootDomain().at(0));
-  TORCH_CHECK(p2c.begin()->second == tv2->getRootDomain().at(0));
-}
-
-TEST_F(NVFuserTest, FusionTrivialReductionForwarding4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv0, {true, false});
-  auto tv3 = add(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  // tv4 has a trivial reduction axis
-  auto tv4 = sum(tv2, {0});
-  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
-  fusion.addOutput(tv5);
-
-  tv3->merge(0, 1);
-  tv3->split(0, 32);
-
-  // This causes the trivial reduction of tv4 to be merged with
-  // another axis of tv4, and then forward computeAt is done from tv4
-  // to tv5. The split of the merged id of tv4 should be done on tv5
-  // by forwarding the merge of the trivial reduction.
-  tv0->computeAt(tv3, -1);
-
-  tv3->axis(0)->parallelize(ParallelType::BIDx);
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({111}, options);
-  auto t1 = at::randn({123, 111}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto t2 = t0.unsqueeze(0);
-  auto t3 = t1 + t2;
-  auto t5 = sum(t2, {0}) + 1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {t3, t5}, __LINE__, __FILE__);
-}
-
-// See issue #1598
-TEST_F(NVFuserTest, FusionRAWSyncInsertionPlace1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv0);
-  auto tv3 = set(tv1);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  // Place tv2 on shared memory
-  tv2->split(0, 2);
-  tv2->split(-1, 4);
-  tv2->setMemoryType(MemoryType::Shared);
-  tv2->axis(-2)->parallelize(ParallelType::TIDy);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv3->split(0, 2);
-  tv3->split(-1, 4);
-  // swap tidx and tidy
-  tv3->axis(-2)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDy);
-
-  tv4->split(0, 2);
-  tv4->split(-1, 4);
-  tv4->axis(-2)->parallelize(ParallelType::TIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDy);
-
-  tv0->computeAt(tv4, 1);
-  tv3->computeAt(tv4, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({10, 64}, options);
-  auto t1 = at::randn({10, 64}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// See issue #1598
-TEST_F(NVFuserTest, FusionRAWSyncInsertionPlace2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv0);
-  auto tv3 = set(tv1);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  tv2->split(0, 2);
-  tv2->split(-1, 4);
-  tv2->setMemoryType(MemoryType::Shared);
-
-  tv2->axis(-2)->parallelize(ParallelType::TIDy);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  tv4->split(0, 2);
-  tv4->split(-1, 4);
-  // Also do unroll for tv3 and tv4
-  tv4->split(-2, 8, false);
-  tv4->axis(-3)->parallelize(ParallelType::Unroll);
-  // swap tidx and tidy
-  tv4->axis(-2)->parallelize(ParallelType::TIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDy);
-
-  tv0->computeAt(tv4, 1);
-  tv3->computeAt(tv4, -1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({10, 64}, options);
-  auto t1 = at::randn({10, 64}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// See issue #1599
-TEST_F(NVFuserTest, FusionRAWSyncInsertionPlace3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv0);
-  auto tv3 = set(tv1);
-  auto tv4 = add(tv2, tv3);
-  fusion.addOutput(tv4);
-
-  // Use unroll where a RAW-sync tensor is stored
-
-  tv4->split(0, 2);
-  tv4->split(0, 3);
-  tv4->split(-1, 4);
-  tv4->axis(1)->parallelize(ParallelType::Unroll);
-  tv4->axis(-2)->parallelize(ParallelType::TIDx);
-  tv4->axis(-1)->parallelize(ParallelType::TIDy);
-
-  tv0->computeAt(tv4, 3);
-  tv3->computeAt(tv4, -1);
-
-  tv2->split(-1, 4);
-  tv2->axis(-2)->parallelize(ParallelType::TIDy);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->setMemoryType(MemoryType::Shared);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({50, 64}, options);
-  auto t1 = at::randn({50, 64}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// See #1618
-TEST_F(NVFuserTest, FusionRAWSyncInsertionPlace4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({16, 128});
-  auto tv1 = makeConcreteTensor({16, 128});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv0);
-  auto tv3 = set(tv1);
-  auto tv4 = set(tv2);
-  auto tv5 = set(tv3);
-  auto tv6 = add(tv4, tv5);
-  fusion.addOutput(tv6);
-
-  tv2->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-
-  tv2->computeAt(tv6, 0);
-  tv3->computeAt(tv6, 1);
-  tv4->computeAt(tv6, 1);
-  tv5->computeAt(tv6, -1);
-  tv2->split(1, 64);
-  tv3->split(1, 64);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv3->axis(-1)->parallelize(ParallelType::TIDx);
-  tv6->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // Check the block sync is inserted at the correct location.
-  //  There is exactly one block sync needed in this test case
-  //    and the sync needs to be after the 2 expressions
-  //    that modify shared memory.
-  class SyncInsertionPointChecker : public kir::IrVisitor {
-   public:
-    using kir::IrVisitor::handle;
-
-   private:
-    void handle(UnaryOp* uop) final {
-      // Record number of unary ops that modifies shared memory.
-      if (uop->out()->isA<kir::TensorIndex>() &&
-          uop->out()->as<kir::TensorIndex>()->view()->getMemoryType() ==
-              MemoryType::Shared &&
-          // Filter out initialization expressions
-          uop->in()->isA<kir::TensorIndex>()) {
-        number_of_writes_++;
-      }
-    }
-    void handle(kir::BlockSync* bsync) final {
-      // Make sure both shared memory modifying expressions
-      //  have been observed at the sync insertion point.
-      TORCH_INTERNAL_ASSERT(
-          number_of_writes_ == 2,
-          "FusionRAWSyncInsertionPlace4 test fail:",
-          "only 1 sync after the 2 shared mem writes is needed in this test,"
-          "either a redundant sync has been inserted or the block sync is not inserted at the right place");
-    }
-
-   private:
-    int number_of_writes_ = 0;
-  } sync_insertion_checker;
-  GpuLower gpulw(&fusion);
-  sync_insertion_checker.handle(gpulw.kernel()->topLevelExprs());
-}
-
-// Test serial write and parallel read of shared mem: mapped case
-TEST_F(NVFuserTest, FusionSerialSmemWriteParallelRead1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({128, 6});
-  TensorView* tv1 = makeConcreteTensor({128, 6});
-  TensorView* tv2 = makeConcreteTensor({128, 6});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addInput(tv2);
-
-  TensorView* tv3 = add(tv0, tv1);
-  TensorView* tv4 = add(tv3, tv2);
-
-  fusion.addOutput(tv4);
-
-  //  Use shared memory
-  tv3->setMemoryType(MemoryType::Shared);
-
-  // Parallelize t4, in this case dim 0 on tv3 will
-  //  not be parallelized but dim0 of t4 will be.
-  // We will need to make sure a sync is inserted
-  //  even if these dimensions are mapped.
-  tv4->axis(0)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({128, 6}, options);
-  at::Tensor t1 = at::randn({128, 6}, options);
-  at::Tensor t2 = at::randn({128, 6}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1, t2});
-  auto cg_outputs = fe.runFusion({t0, t1, t2});
-
-  auto ref = t0 + t1 + t2;
-
-  testValidate(&fusion, cg_outputs, {t0, t1, t2}, {ref}, __LINE__, __FILE__);
-}
-
-// Test serial write and parallel read of shared mem: un-mapped case
-TEST_F(NVFuserTest, FusionSerialSmemWriteParallelRead2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({128, 6});
-  TensorView* tv1 = makeConcreteTensor({128, 6});
-  TensorView* tv2 = makeConcreteTensor({128, 6});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  fusion.addInput(tv2);
-
-  TensorView* tv3 = add(tv0, tv1);
-  TensorView* tv4 = add(tv3, tv2);
-
-  fusion.addOutput(tv4);
-
-  //  Use shared memory
-  tv3->setMemoryType(MemoryType::Shared);
-
-  // Split and parallelize t4,
-  //  the parallelized dimension in t4 will not
-  // map across to the shared mem tensor, t3. So
-  // there will need to be a sync before use of t3.
-  tv4->split(0, 2);
-  tv4->axis(0)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({128, 6}, options);
-  at::Tensor t1 = at::randn({128, 6}, options);
-  at::Tensor t2 = at::randn({128, 6}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1, t2});
-  auto cg_outputs = fe.runFusion({t0, t1, t2});
-
-  auto ref = t0 + t1 + t2;
-
-  testValidate(&fusion, cg_outputs, {t0, t1, t2}, {ref}, __LINE__, __FILE__);
-}
-
-// Simple test of async copy primitive
-TEST_F(NVFuserTest, FusionSimpleCpAsync_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  int m = 33, n = 31;
-
-  TensorView* tv0 = makeConcreteTensor({m, n});
-  TensorView* tv1 = makeConcreteTensor({m, n});
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv0, tv1);
-
-  fusion.addOutput(tv2);
-
-  auto tv0_shared = tv0->cacheAfter(LoadStoreOpType::CpAsync);
-  tv0_shared->setMemoryType(MemoryType::Shared);
-
-  tv0->computeAt(tv2, 1);
-  tv0_shared->axis(1)->parallelize(ParallelType::TIDx);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({m, n}, options);
-  at::Tensor t1 = at::randn({m, n}, options);
-
-  FusionExecutor fe;
-
-  // requires ampere+ GPU
-  if (!deviceMajorMinorCheck(8)) {
-    ASSERT_ANY_THROW(fe.compileFusion(&fusion, {t0, t1}));
-    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
-  }
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Simple test of async copy primitive: double buffered
-//   Double buffer case 1, both block sync and async wait
-//  are needed.
-TEST_F(NVFuserTest, FusionDoubleBufferCpAsync1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Using vectorization so need to keep n multiple of 4.
-  int m = 33, n = 48;
-
-  TensorView* tv0 = makeConcreteTensor({m, n});
-  TensorView* tv1 = makeConcreteTensor({m, n});
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv0, tv1);
-
-  fusion.addOutput(tv2);
-
-  auto tv0_shared = tv0->cacheAfter(LoadStoreOpType::CpAsync);
-  tv0_shared->setMemoryType(MemoryType::Shared);
-  tv0->computeAt(tv2, 1);
-
-  // Asynchronously load a tile in one schedule
-  tv0_shared->split(1, 4);
-  tv0_shared->axis(-1)->parallelize(ParallelType::Vectorize);
-  tv0_shared->axis(-2)->parallelize(ParallelType::TIDx);
-
-  // Consume the loaded tile in another schedule,
-  //   triggering the need for a sync.
-  tv2->split(1, 12);
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // Double buffer the shared mem tensor.
-  tv0_shared->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({m, n}, options);
-  at::Tensor t1 = at::randn({m, n}, options);
-
-  FusionExecutor fe;
-  // requires ampere+ GPU
-  if (!deviceMajorMinorCheck(8)) {
-    ASSERT_ANY_THROW(fe.compileFusion(&fusion, {t0, t1}));
-    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
-  }
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Simple test of async copy primitive: double buffered
-//   Double buffer case 2, only async wait is needed
-TEST_F(NVFuserTest, FusionDoubleBufferCpAsync2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Using vectorization so need to keep n multiple of 4.
-  int m = 33, n = 48;
-
-  TensorView* tv0 = makeConcreteTensor({m, n});
-  TensorView* tv1 = makeConcreteTensor({m, n});
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv0, tv1);
-
-  fusion.addOutput(tv2);
-
-  auto tv0_shared = tv0->cacheAfter(LoadStoreOpType::CpAsync);
-  tv0_shared->setMemoryType(MemoryType::Shared);
-  tv0->computeAt(tv2, 1);
-
-  // Asynchronously load a tile in one schedule
-  tv0_shared->split(1, 4);
-  tv0_shared->axis(-2)->parallelize(ParallelType::TIDx);
-
-  // Consume the loaded tile in another schedule,
-  //   triggering the need for a sync.
-  tv2->split(1, 4);
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-
-  // Double buffer the shared mem tensor.
-  tv0_shared->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({m, n}, options);
-  at::Tensor t1 = at::randn({m, n}, options);
-
-  FusionExecutor fe;
-  // requires ampere+ GPU
-  if (!deviceMajorMinorCheck(8)) {
-    ASSERT_ANY_THROW(fe.compileFusion(&fusion, {t0, t1}));
-    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
-  }
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Simple test for double buffer in shared mem,
-//  where we should not insert redundant syncs when
-//  they are not needed.
-TEST_F(NVFuserTest, FusionDoubleBufferNoSync_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Using vectorization so need to keep n multiple of 4.
-  int m = 33, n = 48;
-
-  TensorView* tv0 = makeConcreteTensor({m, n});
-  TensorView* tv1 = makeConcreteTensor({m, n});
-
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  TensorView* tv2 = add(tv0, tv1);
-
-  fusion.addOutput(tv2);
-
-  auto tv0_shared = tv0->cacheAfter();
-  tv0_shared->setMemoryType(MemoryType::Shared);
-  tv0->computeAt(tv2, 1);
-
-  // Asynchronously load a tile in one schedule
-  tv0_shared->split(1, 4);
-  tv0_shared->axis(-2)->parallelize(ParallelType::TIDx);
-
-  // Consume the loaded tile in another schedule,
-  //   triggering the need for a sync.
-  tv2->split(1, 4);
-  tv2->axis(-2)->parallelize(ParallelType::TIDx);
-
-  // Double buffer the shared mem tensor.
-  tv0_shared->doubleBuffer();
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({m, n}, options);
-  at::Tensor t1 = at::randn({m, n}, options);
-
-  GpuLower gpulw(&fusion);
-  auto flattened_exprs =
-      ir_utils::flattenScopedExprs(gpulw.kernel()->topLevelExprs());
-  bool sync_inserted = std::any_of(
-      flattened_exprs.begin(), flattened_exprs.end(), [](Expr* expr) {
-        return expr->isA<kir::BlockSync>();
-      });
-  TORCH_INTERNAL_ASSERT(!sync_inserted, "Un-expected block sync inserted");
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Test predicate inversion for cp.async
-TEST_F(NVFuserTest, FusionCpAsyncPredicate_CUDA) {
-  // requires ampere+ GPU
-
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Using vectorization so need to keep n multiple of 4.
-  int m = 33, n = 48;
-
-  TensorView* tv0 = makeConcreteTensor({m, n});
-
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1});
-  fusion.addOutput(tv1);
-
-  auto tv0_shared = tv0->cacheAfter(LoadStoreOpType::CpAsync);
-  auto tv0_reg = tv0_shared->cacheAfter();
-  tv0_shared->setMemoryType(MemoryType::Shared);
-  tv0->computeAt(tv1, 1);
-
-  tv0_shared->split(-1, 32);
-  tv0_shared->split(-1, 4);
-  tv0_shared->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({m, n}, options);
-
-  FusionExecutor fe;
-  if (!deviceMajorMinorCheck(8)) {
-    ASSERT_ANY_THROW(fe.compileFusion(&fusion, {t0}));
-    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
-  }
-
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0.sum({1});
-
-  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Test predicate removal on reg-to-reg expressions
-TEST_F(NVFuserTest, FusionPredRemovalCheck_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeContigTensor(2);
-  fusion.addInput(tv0);
-
-  TensorView* tv1 = set(tv0);
-  TensorView* tv2 = set(tv1);
-  TensorView* tv3 = set(tv2);
-  TensorView* tv4 = set(tv3);
-
-  fusion.addOutput(tv4);
-  tv4->split(1, 4);
-  tv0->computeAt(tv4, -2);
-  tv3->axis(-1)->parallelize(ParallelType::Vectorize);
-
-  class PredicateRemovalChecker : public kir::IrVisitor {
-   public:
-    using kir::IrVisitor::handle;
-
-   private:
-    void handle(UnaryOp* uop) final {
-      assertOnLocalToLocal(uop);
-    }
-
-    // Utility to assert any local-to-local expr is only trivially predicated.
-    void assertOnLocalToLocal(Expr* expr) {
-      bool is_local = true;
-      for (auto in : ir_utils::filterByType<kir::TensorIndex>(expr->inputs())) {
-        if (in->view()->getMemoryType() != MemoryType::Local) {
-          is_local = false;
-        }
-      }
-      for (auto in :
-           ir_utils::filterByType<kir::TensorIndex>(expr->outputs())) {
-        if (in->view()->getMemoryType() != MemoryType::Local) {
-          is_local = false;
-        }
-      }
-
-      if (is_local) {
-        if (auto ite = dynamic_cast<kir::IfThenElse*>(scope_exprs_.back())) {
-          TORCH_INTERNAL_ASSERT(
-              ite->predicate()->value()->isConst(),
-              "redundant predicate on: ",
-              expr);
-        }
-      }
-    }
-
-   private:
-    bool within_ite_ = false;
-  } pred_checker;
-
-  GpuLower gpulw(&fusion);
-  pred_checker.handle(gpulw.kernel()->topLevelExprs());
-}
-
-TEST_F(NVFuserTest, FusionPropagateParallelTypesToSiblings_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tvs = Welford(tv0, {0});
-  auto tv_avg = tvs.avg;
-  fusion.addOutput(tv_avg);
-
-  tv_avg->split(0, 128);
-  TransformPropagatorWithCheck propagator(tv_avg);
-  MaxRootDomainInfoSpanningTree(tv_avg).traverse(&propagator);
-
-  tv_avg->axis(0)->parallelize(ParallelType::BIDx);
-  tv_avg->axis(1)->parallelize(ParallelType::TIDx);
-
-  // Make sure the parallelization of tv_avg is propagated to the var
-  // and count tensors.
-  GpuLower gpulw(&fusion);
-  for (const auto expr : gpulw.kernel()->exprs()) {
-    auto wop = dynamic_cast<WelfordOp*>(expr);
-    if (wop == nullptr) {
-      continue;
-    }
-    auto ref = wop->outAvg()->as<TensorView>();
-    for (auto sibling : ir_utils::filterByType<TensorView>(wop->outputs())) {
-      if (ref == sibling) {
-        continue;
-      }
-      TORCH_CHECK(
-          ref->nDims() == sibling->nDims(),
-          "Invalid sibling: ",
-          sibling->toString());
-      for (const auto i : c10::irange(ref->nDims())) {
-        TORCH_CHECK(
-            ref->axis(i)->getParallelType() ==
-                sibling->axis(i)->getParallelType(),
-            "Mismatched parallel types between siblings. ",
-            ref->toString(),
-            ", ",
-            sibling->toString());
-      }
-    }
-  }
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-  at::manual_seed(0);
-  at::Tensor t0 = at::randn({9999}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto outputs = fe.runFusion({t0});
-
-  testValidate(fe.kernel(), outputs, {t0}, {t0.mean({0})}, __LINE__, __FILE__);
-}
-
-// Test ExactRootDomainMap
-TEST_F(NVFuserTest, FusionExactRootDomainMap_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv0, {false, true});
-  auto tv3 = transpose(tv2);
-  auto tv4 = add(tv2, tv1);
-  auto tv5 = add(tv2, tv3);
-  auto tv6 = add(tv3, tv1);
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv6);
-
-  const auto exact_map = ExactRootDomainMap(&fusion);
-
-  // In the exact mapping, the broadcast domain introduced at tv2 is
-  // only mapped with the another one in tv3, which is just transposed
-  // from tv2. Any other domain, including the second domain of tv4,
-  // must not be mapped.
-
-  auto tv2_bc = tv2->axis(1);
-  auto tv3_bc = tv3->axis(0);
-
-  TORCH_CHECK(
-      exact_map.areMapped(tv2_bc, tv3_bc),
-      "Invalid exact root domain map: ",
-      exact_map.toString());
-
-  // They must not be mapped with anything else.
-  for (auto tv : ir_utils::allTvs(&fusion)) {
-    for (auto root_id : tv->getRootDomain()) {
-      if (root_id == tv2_bc || root_id == tv3_bc) {
-        continue;
-      }
-      TORCH_CHECK(
-          !exact_map.areMapped(root_id, tv2_bc),
-          "Invalid exact root domain map: ",
-          exact_map.toString());
-      TORCH_CHECK(
-          !exact_map.areMapped(root_id, tv3_bc),
-          "Invalid exact root domain map: ",
-          exact_map.toString());
-    }
-  }
-}
-
-class NVFuserMultithreadedTest : public ::testing::Test {
- protected:
-  bool was_enabled = false;
-
-  void SetUp() override {
-    was_enabled = fuser::cuda::setEnabled(true);
-  }
-
-  void TearDown() override {
-    fuser::cuda::setEnabled(was_enabled);
-  }
-};
-
-TEST_F(NVFuserMultithreadedTest, SingleFunction_CUDA) {
-  std::string ir = R"IR(
-graph(%x.1 : Tensor,
-      %y.1 : Tensor):
-  %12 : NoneType = prim::Constant()
-  %11 : bool = prim::Constant[value=0]()
-  %9 : int = prim::Constant[value=1]()
-  %3 : Tensor = aten::exp(%x.1)
-  %5 : Tensor = aten::relu(%y.1)
-  %6 : Tensor = aten::sin(%5)
-  %8 : Tensor = aten::add(%3, %6, %9)
-  %10 : int[] = prim::ListConstruct(%9)
-  %13 : Tensor = aten::sum(%8, %10, %11, %12)
-  return (%13)
-)IR";
-  auto g = std::make_shared<Graph>();
-  torch::jit::parseIR(ir, g.get());
-  GraphFunction fn("nvfuser_test", g, nullptr);
-
-  auto run_kernel = [&fn]() {
-    auto x = torch::rand({32, 32}, at::TensorOptions(at::kCUDA));
-    auto y = torch::rand({32, 32}, at::TensorOptions(at::kCUDA));
-    std::vector<IValue> results;
-    for (const auto& _ : c10::irange(10)) {
-      auto stack = createStack({x.clone(), y.clone()});
-      fn.run(stack);
-      results.push_back(stack.back());
-    }
-    for (const auto& i : c10::irange(1, 10)) {
-      auto t0 = results[0].toTensor();
-      auto ti = results[i].toTensor();
-      ASSERT_TRUE(at::allclose(t0, ti));
-    }
-  };
-
-  constexpr size_t kNumThreads = 4;
-  std::vector<std::thread> threads;
-  for (size_t id = 0; id < kNumThreads; ++id) {
-    threads.emplace_back(run_kernel);
-  }
-  for (auto& t : threads) {
-    t.join();
-  }
-}
-
-TEST_F(NVFuserMultithreadedTest, MultipleFunctions_CUDA) {
-  auto run_kernel = []() {
-    const std::string ir = R"IR(
-  graph(%x.1 : Tensor,
-        %y.1 : Tensor):
-    %12 : NoneType = prim::Constant()
-    %11 : bool = prim::Constant[value=0]()
-    %9 : int = prim::Constant[value=1]()
-    %3 : Tensor = aten::exp(%x.1)
-    %5 : Tensor = aten::relu(%y.1)
-    %6 : Tensor = aten::sin(%5)
-    %8 : Tensor = aten::add(%3, %6, %9)
-    %10 : int[] = prim::ListConstruct(%9)
-    %13 : Tensor = aten::sum(%8, %10, %11, %12)
-    return (%13)
-  )IR";
-    auto g = std::make_shared<Graph>();
-    torch::jit::parseIR(ir, g.get());
-    GraphFunction fn("nvfuser_test", g, nullptr);
-
-    auto x = torch::rand({32, 32}, at::TensorOptions(at::kCUDA));
-    auto y = torch::rand({32, 32}, at::TensorOptions(at::kCUDA));
-    std::vector<IValue> results;
-    constexpr size_t numRuns = 10;
-    for (const auto& _ : c10::irange(numRuns)) {
-      auto stack = createStack({x.clone(), y.clone()});
-      fn.run(stack);
-      results.push_back(stack.back());
-    }
-    for (const auto& i : c10::irange(1, numRuns)) {
-      auto t0 = results[0].toTensor();
-      auto ti = results[i].toTensor();
-      ASSERT_TRUE(at::allclose(t0, ti));
-    }
-  };
-
-  constexpr size_t kNumThreads = 4;
-  std::vector<std::thread> threads;
-  for (size_t id = 0; id < kNumThreads; ++id) {
-    threads.emplace_back(run_kernel);
-  }
-  for (auto& t : threads) {
-    t.join();
-  }
-}
-
-// Repro of issue #1655
-TEST_F(NVFuserTest, FusionIncompleteConcreteID_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion.addInput(tv0);
-  auto tv1 = makeSymbolicTensor(2);
-  fusion.addInput(tv1);
-  auto tv2 = makeSymbolicTensor(2);
-  fusion.addInput(tv2);
-
-  auto tv3 = broadcast(tv0, {true, true, false});
-  auto tv4 = broadcast(tv1, {false, true, false});
-  auto tv5 = broadcast(tv2, {true, false, false});
-
-  auto tv6 = add(tv3, tv4);
-  auto tv7 = add(tv3, tv5);
-
-  fusion.addOutput(tv6);
-  fusion.addOutput(tv7);
-
-  tv6->merge(0);
-  tv6->merge(0);
-
-  TransformPropagatorWithCheck propagator(tv6);
-  MaxRootDomainInfoSpanningTree(tv6).traverse(&propagator);
-
-  tv0->computeAt(tv6, -1, ComputeAtMode::MostInlined);
-  tv1->computeAt(tv6, -1, ComputeAtMode::MostInlined);
-  tv2->computeAt(tv7, -1, ComputeAtMode::MostInlined);
-
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(fusion.printKernel());
-}
-
-TEST_F(NVFuserTest, FusionTestReEntrantGridWelford_CUDA) {
-  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
-  Fusion& fusion = *fusion_ptr.get();
-  FusionGuard fg(&fusion);
-
-  int X = 256, Y = 7, Z = 2048;
-
-  // setup fusion
-  auto tv0 = makeContigTensor(4, DataType::Half);
-  fusion.addInput(tv0);
-  auto tv1 = castOp(DataType::Float, tv0);
-
-  auto tvs = Welford(tv1, {0, 1, 2});
-  auto tv_avg = tvs.avg;
-  auto tv_M2 = tvs.var_sum;
-  auto tv_N = tvs.n;
-  fusion.addOutput(tv_avg);
-  fusion.addOutput(tv_M2);
-
-  auto cached_input = tv0->cacheAfter();
-  auto cached_avg = tv_avg->cacheBefore();
-  auto cached_M2 = tv_M2->cacheBefore();
-
-  auto reduction_tv = scheduler_utils::getReductionTvs(&fusion)[0];
-
-  reduction_tv->merge(0);
-  reduction_tv->merge(0);
-
-  int TIDx = 16;
-  int vec = 4;
-
-  int TIDy = 16;
-  int outer_tidy_fact = 16;
-
-  reduction_tv->split(-1, TIDx * vec);
-  reduction_tv->split(-1, vec);
-  reduction_tv->axis(-2)->parallelize(ParallelType::TIDx);
-  reduction_tv->axis(-1)->parallelize(ParallelType::Vectorize);
-  reduction_tv->axis(-3)->parallelize(ParallelType::BIDx);
-
-  reduction_tv->split(0, TIDy);
-  reduction_tv->axis(1)->parallelize(ParallelType::TIDy);
-  reduction_tv->split(0, outer_tidy_fact);
-  reduction_tv->axis(0)->parallelize(ParallelType::BIDy);
-
-  // T2_g[ rblockIdx.y, rS{16}, rthreadIdx.y, iblockIdx.x, ithreadIdx.x24,
-  // iV25{4} ]
-  reduction_tv->reorder({{3, 0}, {4, 1}, {0, 2}, {2, 3}, {1, 4}, {5, 5}});
-  // T2_g[iblockIdx.x, ithreadIdx.x24, rblockIdx.y, rthreadIdx.y, rS{16},
-  // iV25{4}]
-
-  TransformPropagatorWithCheck propagator(reduction_tv);
-  MaxRootDomainInfoSpanningTree(reduction_tv).traverse(&propagator);
-  auto rfactor_tv = ir_utils::rfactorHelper(reduction_tv, {4});
-  scheduler_utils::parallelizeAllLike(rfactor_tv);
-
-  tv0->computeAt(tv_avg, 2);
-  tv0->computeAt(cached_input, -2);
-
-  cached_input->computeAt(rfactor_tv, 4, ComputeAtMode::BestEffort);
-
-  for (auto tv : ir_utils::allTvs(&fusion)) {
-    if (tv == cached_input || tv == tv_avg || tv == tv_M2) {
-      continue;
-    }
-    tv->axis(-1)->parallelize(ParallelType::Serial);
-  }
-
-  CompileOptions co;
-  co.index_mode = KernelIndexMode::INT32;
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {}, LaunchParams(), co);
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({X, Y, Y, Z}, options);
-
-  auto cg_outputs = fe.runFusion({t0}, LaunchParams(-1, -1, -1, -1, -1, -1));
-
-  // by default Welford outputs sum of square diff so need to divide to get var
-  cg_outputs[1] = cg_outputs[1].div((float)(X * Y * Y));
-
-  auto at_mu = at::mean(t0.to(at::kDouble), {0, 1, 2});
-  auto at_var = at::var(t0.to(at::kDouble), {0, 1, 2}, false);
-
-  testValidate(
-      &fusion,
-      cg_outputs,
-      {t0},
-      {at_mu, at_var},
-      __LINE__,
-      __FILE__,
-      "",
-      LaunchParams(-1, -1, -1, -1, -1, -1));
-}
-
-// Test sync insertion with redundant predicates
-TEST_F(NVFuserTest, FusionRedundantPredSync_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({32});
-  TensorView* tv1 = makeConcreteTensor({32, 32});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv0, {true, false});
-  auto tv3 = add(tv2, tv1);
-
-  fusion.addOutput(tv3);
-
-  auto tv0c = tv0->cacheAfter();
-
-  // Make a redundant write through smem
-  tv0c->setMemoryType(MemoryType::Shared);
-
-  tv0->computeAt(tv3, 0);
-  tv1->computeAt(tv3, 0);
-
-  tv0c->axis(0)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::TIDy);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(0)->parallelize(ParallelType::TIDy);
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-
-  GpuLower gpulw(&fusion);
-  auto flattened_exprs =
-      ir_utils::flattenScopedExprs(gpulw.kernel()->topLevelExprs());
-  bool sync_inserted = std::any_of(
-      flattened_exprs.begin(), flattened_exprs.end(), [](Expr* expr) {
-        return expr->isA<kir::BlockSync>();
-      });
-  TORCH_INTERNAL_ASSERT(sync_inserted, "Expected block sync not inserted");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({32}, options);
-  at::Tensor t1 = at::randn({32, 32}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Test case for removing syncs on chain of redundant uses.
-TEST_F(NVFuserTest, FusionRedundantPredSync2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({32});
-  TensorView* tv1 = makeConcreteTensor({32, 32});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv0, {true, false});
-  auto tv3 = add(tv2, tv1);
-
-  fusion.addOutput(tv3);
-
-  auto tv0c = tv0->cacheAfter();
-
-  // Make a redundant write through smem
-  tv0c->setMemoryType(MemoryType::Shared);
-  tv2->setMemoryType(MemoryType::Shared);
-
-  tv0->computeAt(tv3, 0);
-  tv1->computeAt(tv3, 0);
-
-  tv0c->axis(0)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::TIDy);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(0)->parallelize(ParallelType::TIDy);
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-
-  // Utility class to make sure one block sync
-  //  is inserted by RAW pass.
-  class SyncChecker : public kir::IrVisitor {
-   public:
-    using kir::IrVisitor::handle;
-    int result() {
-      return sync_seen_;
-    }
-
-   private:
-    void handle(kir::BlockSync*) final {
-      sync_seen_++;
-    }
-
-   private:
-    int sync_seen_ = 0;
-  } checker;
-
-  GpuLower gpulw(&fusion);
-  checker.handle(gpulw.kernel()->topLevelExprs());
-  TORCH_INTERNAL_ASSERT(
-      checker.result() < 2, "More syncs were inserted than expected");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({32}, options);
-  at::Tensor t1 = at::randn({32, 32}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
-}
-
-// Test case for sync insertion after redundant predicated smem write
-//  Check that syncs are removed only when all paths are redundant.
-TEST_F(NVFuserTest, FusionRedundantPredSync3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({32});
-  TensorView* tv1 = makeConcreteTensor({32, 32});
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = broadcast(tv0, {true, false});
-  auto tv3 = set(tv2);
-  auto tv4 = add(tv3, tv1);
-  auto tv5 = add(tv2, tv1);
-
-  fusion.addOutput(tv4);
-  fusion.addOutput(tv5);
-
-  auto tv0c = tv0->cacheAfter();
-
-  // In this scheduling config,
-  //  tv0c -> tv2 -> tv3 is a redundant path for tidy
-  //  tv0c -> tv2 -> tv5 is not.
-  //  So we need a RAW sync in tv0c->tv2 to make sure
-  //  tv2 has the correct value to produce tv5.
-  tv0c->setMemoryType(MemoryType::Shared);
-  tv3->setMemoryType(MemoryType::Shared);
-
-  tv0c->axis(0)->parallelize(ParallelType::TIDx);
-  tv2->axis(0)->parallelize(ParallelType::TIDy);
-  tv2->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv3->axis(0)->parallelize(ParallelType::TIDy);
-  tv3->axis(1)->parallelize(ParallelType::TIDx);
-
-  tv5->axis(0)->parallelize(ParallelType::TIDy);
-  tv5->axis(1)->parallelize(ParallelType::TIDx);
-
-  // Utility class to make sure one block sync
-  //  is inserted by RAW pass.
-  class SyncChecker : public kir::IrVisitor {
-   public:
-    using kir::IrVisitor::handle;
-    int result() {
-      return sync_seen_;
-    }
-
-   private:
-    void handle(kir::BlockSync* sync) final {
-      if (!sync->isWarHazardSync()) {
-        sync_seen_++;
-      }
-    }
-
-   private:
-    int sync_seen_ = 0;
-  } checker;
-
-  GpuLower gpulw(&fusion);
-  checker.handle(gpulw.kernel()->topLevelExprs());
-
-  // This is implicit checking. There are exactly 2 places
-  //  where RAW hazards happen: one producing tv2 and the other
-  //  producing tv3. This test case expect syncs in both of
-  //  these places so we check that 2 RAW syncs are inserted.
-  TORCH_INTERNAL_ASSERT(
-      checker.result() == 2,
-      "Exactly 2 RAW sync expected for the two shared memory transfers");
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor t0 = at::randn({32}, options);
-  at::Tensor t1 = at::randn({32, 32}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0, t1});
-  auto cg_outputs = fe.runFusion({t0, t1});
-
-  auto ref = t0 + t1;
-
-  testValidate(&fusion, cg_outputs, {t0, t1}, {ref, ref}, __LINE__, __FILE__);
-}
-
-// Unit test case for detecting thread redundant usage of shared tensors.
-TEST_F(NVFuserTest, FusionRedundantUseCheck_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeConcreteTensor({32, 32});
-  fusion.addInput(tv0);
-
-  auto tv1 = set(tv0);
-  auto tv2 = set(tv1);
-  auto tv3 = set(tv2);
-  auto tv4 = set(tv3);
-
-  auto tv5 = set(tv4);
-
-  auto tv6 = set(tv4);
-  auto tv7 = set(tv6);
-
-  fusion.addOutput(tv5);
-  fusion.addOutput(tv7);
-
-  tv2->setMemoryType(MemoryType::Shared);
-  tv4->setMemoryType(MemoryType::Shared);
-
-  tv7->axis(-1)->parallelize(ParallelType::TIDx);
-
-  // Thread pred map cannot be built without an active lower
-  //  object. So would need to lower the whole fusion for
-  //  testing. However, lower also keeps an copy of the fusion
-  //  so the original pointers cannot be used to querry the
-  //  thread pred map. So have to traverse the new expr list
-  //  to find the pointers;
-  GpuLower gpulw(&fusion);
-
-  TensorView *lowered_tv2 = nullptr, *lowered_tv4 = nullptr;
-  auto used_vals = gpulw.kernel()->usedMathVals();
-
-  for (auto tv : ir_utils::filterByType<TensorView>(used_vals)) {
-    if (tv->name() == 2) {
-      lowered_tv2 = tv;
-    }
-    if (tv->name() == 4) {
-      lowered_tv4 = tv;
-    }
-  }
-
-  TORCH_INTERNAL_ASSERT(
-      lowered_tv2 != nullptr && lowered_tv4 != nullptr,
-      "tv2 or tv4 not lowered or mangled");
-
-  auto tv2_info = gpulw.threadPredMap().getPredicateInfo(lowered_tv2);
-  auto tv4_info = gpulw.threadPredMap().getPredicateInfo(lowered_tv4);
-
-  // tv2 -> tv3 -> tv4 (shared) is the only use chain for tv2,
-  //  and tv4 is redundantly written in tidx so tv2 is redundantly
-  //  consumed in tidx.
-  TORCH_INTERNAL_ASSERT(
-      tv2_info.redundant_use_types.get(ParallelType::TIDx),
-      "TV2 is redundantly used but not detected.");
-
-  // tv4->tv5 (global) is a redundant use chain, but
-  // tv4->tv6->tv7 is not, so tv4 should not be detected as
-  // a redundant used tensor in tidx.
-  TORCH_INTERNAL_ASSERT(
-      !tv4_info.redundant_use_types.get(ParallelType::TIDx),
-      "TV4 is not redundantly used but not detected.");
-}
-
-// Test a basic swizzle pattern
-TEST_F(NVFuserTest, FusionSimpleSwizzle0_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 32});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv2);
-
-  // Make a 2x8 Zshape tile
-  tv1->split(-1, 16);
-  tv1->split(-1, 8);
-  // [O, 2, 8]
-
-  tv2->split(-1, 16);
-  tv2->split(-1, 4);
-  //[O, 4, 4]
-
-  tv1->computeAt(tv2, 1);
-  tv1->swizzle(Swizzle2DType::ZShape, -2, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({2, 32}, options);
-  auto t2 = t0 + 2.0;
-  auto cg_outputs = fe.runFusion({t0});
-
-  testValidate(&fusion, cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
-}
-
-// Test swizzle inlining
-TEST_F(NVFuserTest, FusionSimpleSwizzle1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 32});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  // Make a 2x8 Zshape tile
-  tv2->split(-1, 16);
-  tv2->split(-1, 8);
-  // [O, 2, 8]
-
-  tv3->split(-1, 16);
-  tv3->split(-1, 4);
-  //[O, 4, 4]
-
-  tv2->computeAt(tv3, 1);
-  tv2->swizzle(Swizzle2DType::ZShape, -2, -1);
-
-  // Inlining a producer into a swizzled consumer is ok
-  tv1->computeAt(tv2, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({2, 32}, options);
-  auto t3 = t0 + 3.0;
-  auto cg_outputs = fe.runFusion({t0});
-
-  testValidate(&fusion, cg_outputs, {t0}, {t3}, __LINE__, __FILE__);
-}
-
-// Test sync insertion and memory check in parallelized swizzles.
-//  In this test, data is parallel written into smem in zcurve
-//   pattern and then read out and output to global mem unswizzled.
-TEST_F(NVFuserTest, FusionSimpleSwizzle2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({32, 32});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv2);
-
-  tv1->swizzle(Swizzle2DType::ZShape, -2, -1);
-
-  tv1->axis(0)->parallelize(ParallelType::TIDx);
-  tv1->axis(1)->parallelize(ParallelType::TIDy);
-
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv2->axis(1)->parallelize(ParallelType::TIDy);
-
-  // Validation should fail since TV1 is not in shared
-  //  memory as required by sync info pass.
-  ASSERT_ANY_THROW(GpuLower gpulw_throw(&fusion));
-
-  tv1->setMemoryType(MemoryType::Shared);
-
-  // Make sure that a sync is inserted:
-  bool sync_found = false;
-  GpuLower gpu_lw(&fusion);
-  auto flattened_exps =
-      ir_utils::flattenScopedExprs(gpu_lw.kernel()->topLevelExprs());
-
-  for (auto expr : flattened_exps) {
-    if (expr->isA<kir::BlockSync>()) {
-      sync_found = true;
-    }
-    // Will require a sync thread before any shared memory read.
-    for (auto inp_tv : ir_utils::filterByType<TensorView>(expr->inputs())) {
-      if (inp_tv->getMemoryType() == MemoryType::Shared) {
-        TORCH_INTERNAL_ASSERT(
-            sync_found, "Block sync required but not inserted");
-      }
-    }
-  }
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({32, 32}, options);
-  auto t2 = t0 + 2.0;
-  auto cg_outputs = fe.runFusion({t0});
-
-  testValidate(&fusion, cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
-}
-
-// Test BestEffortReplay behavior with swizzle op
-TEST_F(NVFuserTest, FusionSwizzleMapping_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 32});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  // Make a 2x8 Zshape tile
-  tv2->split(-1, 16);
-  tv2->split(-1, 8);
-  // [O, 2, 8]
-
-  tv3->split(-1, 16);
-  tv3->split(-1, 4);
-  //[O, 4, 4]
-
-  tv2->computeAt(tv3, 1);
-  tv2->swizzle(Swizzle2DType::ZShape, -2, -1);
-
-  // Inlining a producer into a swizzled consumer is ok
-  tv1->computeAt(tv2, -1);
-
-  // Check BestEffortReplay behavior with skip swizzles option on.
-  PairwiseRootDomainMap root_map(tv1, tv2);
-
-  // Check producer to consumer map,
-  //  i.e. unswizzled tensor to swizzled tensor map
-  //----------------------------------------------------------
-  auto p2c = BestEffortReplay::replayCasP(tv2, tv1, -1, root_map).getReplay();
-  auto swizzle_x_it0 = p2c.find(tv1->axis(-2));
-  auto swizzle_y_it0 = p2c.find(tv1->axis(-1));
-  // P2C map should exist and both the x and y map should
-  //  map to the output of the swizzle op.
-  TORCH_INTERNAL_ASSERT(
-      swizzle_x_it0 != p2c.end() && swizzle_y_it0 != p2c.end());
-  TORCH_INTERNAL_ASSERT(
-      swizzle_x_it0->second == tv2->axis(-2) &&
-      swizzle_y_it0->second == tv2->axis(-1));
-
-  // Check consumer to producer map,
-  //  i.e. swizzled tensor to unswizzled tensor map
-  //----------------------------------------------------------
-  auto c2p = BestEffortReplay::replayPasC(tv1, tv2, -1, root_map).getReplay();
-
-  auto swizzle_op = tv2->axis(-1)->definition()->as<Swizzle2D>();
-
-  // Find mapping for swizzle inputs
-  auto swizzle_x_it1 = c2p.find(swizzle_op->inX());
-  auto swizzle_y_it1 = c2p.find(swizzle_op->inY());
-
-  // Find mapping for swizzle outputs
-  auto swizzle_x_it2 = c2p.find(swizzle_op->outX());
-  auto swizzle_y_it2 = c2p.find(swizzle_op->outY());
-
-  // Input of swizzle ops will not be mapped to any
-  //  by BestEffortReplay, as BestEffortReplay has to be
-  //  one to one. IdGraph will further map them together.
-  TORCH_INTERNAL_ASSERT(
-      swizzle_x_it1 == c2p.end() && swizzle_y_it1 == c2p.end());
-
-  // Mapping for swizzle outputs should be mapped and should
-  //  also map to the corresponding axes on the unswizzled tensor.
-  TORCH_INTERNAL_ASSERT(
-      swizzle_x_it2 != c2p.end() && swizzle_y_it2 != c2p.end());
-  TORCH_INTERNAL_ASSERT(
-      swizzle_x_it2->second == tv1->axis(-2) &&
-      swizzle_y_it2->second == tv1->axis(-1));
-
-  // Check id graph behavior
-  //----------------------------------------------------------
-  ComputeAtMap ca_map(&fusion);
-  // Corresponding inputs and outputs of swizzle ops are
-  //  map through by exact and permissive map.
-  TORCH_INTERNAL_ASSERT(
-      ca_map.areMapped(tv1->axis(-2), swizzle_op->inX(), IdMappingMode::EXACT));
-  TORCH_INTERNAL_ASSERT(
-      ca_map.areMapped(tv1->axis(-1), swizzle_op->inY(), IdMappingMode::EXACT));
-  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
-      tv1->axis(-2), swizzle_op->outX(), IdMappingMode::EXACT));
-  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
-      tv1->axis(-1), swizzle_op->outY(), IdMappingMode::EXACT));
-
-  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
-      tv1->axis(-2), swizzle_op->inX(), IdMappingMode::PERMISSIVE));
-  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
-      tv1->axis(-1), swizzle_op->inY(), IdMappingMode::PERMISSIVE));
-  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
-      tv1->axis(-2), swizzle_op->outX(), IdMappingMode::PERMISSIVE));
-  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
-      tv1->axis(-1), swizzle_op->outY(), IdMappingMode::PERMISSIVE));
-}
-
-// Test a basic loop swizzle pattern
-TEST_F(NVFuserTest, FusionLoopSwizzle0_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 32});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv2);
-
-  tv2->split(-1, 16);
-  tv2->split(-1, 4);
-  //[O, 4, 4]
-
-  tv2->swizzle(Swizzle2DType::ZShape, -2, -1, SwizzleMode::Loop);
-
-  tv0->computeAt(tv2, -1);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({2, 32}, options);
-  auto t2 = t0 + 2.0;
-  auto cg_outputs = fe.runFusion({t0});
-
-  testValidate(&fusion, cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
-}
-
-// Outer block zshape pattern
-TEST_F(NVFuserTest, FusionLoopSwizzle1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv2);
-
-  tv2->split(-2, 8);
-  tv2->split(-1, 4);
-  //[I0o, I0i, I1o, I1i]
-  tv2->reorder({{1, 2}, {2, 1}});
-  //[I0o, I1o, I0i, I1i]
-
-  tv2->swizzle(Swizzle2DType::ZShape, 0, 1, SwizzleMode::Loop);
-  tv0->computeAt(tv2, -1);
-
-  tv2->axis(0)->parallelize(ParallelType::BIDx);
-  tv2->axis(1)->parallelize(ParallelType::BIDy);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({45, 77}, options);
-  auto t2 = t0 + 2.0;
-  auto cg_outputs = fe.runFusion({t0});
-
-  testValidate(&fusion, cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
-}
-
-// Test assertion in unsupported pattern: non-leaf loop swizzle.
-TEST_F(NVFuserTest, FusionLoopSwizzleCheck0_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 32});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv2);
-
-  tv2->split(-1, 16);
-  tv2->split(-1, 4);
-  //[O, 4, 4]
-
-  // Swizzle the inner tile.
-  tv2->swizzle(Swizzle2DType::ZShape, -2, -1, SwizzleMode::Loop);
-
-  // Make swizzle output not a leaf domain.
-  tv2->merge(-2);
-
-  tv0->computeAt(tv2, -1);
-
-  FusionExecutor fe;
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
-}
-
-// Test assertion in unsupported pattern: half-inlined loop swizzle.
-TEST_F(NVFuserTest, FusionLoopSwizzleCheck1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 32});
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
-  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
-
-  fusion.addOutput(tv3);
-
-  //[O, 4, 4]
-  tv2->split(-1, 16);
-  tv2->split(-1, 4);
-
-  //[O, 4, 4]
-  tv3->split(-1, 16);
-  tv3->split(-1, 4);
-
-  // Swizzle inner tile of tv2
-  tv2->swizzle(Swizzle2DType::ZShape, -2, -1, SwizzleMode::Loop);
-
-  // Make tv2 swizzled and half-inlined (unsupported).
-  tv0->computeAt(tv3, -2);
-
-  fusion.print();
-  FusionExecutor fe;
-  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
-}
-
-TEST_F(NVFuserTest, FusionUnsqueeze1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> shape({10, 11});
-
-  auto tv0 = makeConcreteTensor(shape);
-  fusion.addInput(tv0);
-
-  // [I, R]
-  auto tv1 = sum(tv0, {1});
-  // [I, B]
-  auto tv2 = unsqueeze(tv1, -1);
-  fusion.addOutput(tv2);
-
-  TORCH_CHECK(
-      tv2->nDims() == 2, "Unpected unsqueeze result: ", tv2->toString());
-  TORCH_CHECK(
-      tv2->axis(1)->isBroadcast(),
-      "Unexpected unsqueeze result: ",
-      tv2->toString());
-
-  // tv1 has only one non-reduction axis. An exception should be
-  // thrown.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(unsqueeze(tv1, 2));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({10, 11}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto ref = t0.sum(1).unsqueeze(-1);
-
-  testValidate(&fusion, cg_outputs, aten_inputs, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSqueeze1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  std::vector<int64_t> shape({10, 11});
-
-  auto tv0 = makeConcreteTensor(shape);
-  fusion.addInput(tv0);
-
-  // [I, B]
-  auto tv1 = sum(tv0, {1}, true);
-  // [I]
-  auto tv2 = squeeze(tv1, {shape[0], 1});
-  fusion.addOutput(tv2);
-
-  TORCH_CHECK(
-      tv2->nDims() == 2, "Unexpected squeeze result: ", tv2->toString());
-
-  // [I, R]
-  auto tv3 = sum(tv0, {1});
-  // tv3 has only one non-reduction axis. The extent of the first axis
-  // is not one, so squeeze should fail.
-  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
-  ASSERT_ANY_THROW(squeeze(tv3, {shape[0], 1}));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({10, 11}, options);
-  std::vector<IValue> aten_inputs = {t0};
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs);
-  auto cg_outputs = fe.runFusion(aten_inputs);
-
-  auto ref = t0.sum(1, true).squeeze(-1);
-
-  testValidate(&fusion, cg_outputs, aten_inputs, {ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionContigPredicate_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  auto tv2 = broadcast(tv1, {false, true, false});
-  fusion.addOutput(tv2);
-
-  tv2->merge(-2, -1);
-  tv2->merge(-2, -1);
-  tv2->split(-1, 100);
-  tv0->computeAt(tv2, -1);
-
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(PredicatedChecker::isPredicated(tv1, gpulw));
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({3, 4}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  auto ref = t0.unsqueeze(1);
-
-  testValidate(fe.kernel(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
-}
-
-// Repro of https://github.com/csarofeen/pytorch/issues/1777
-TEST_F(NVFuserTest, FusionDivScalarLhs_CUDA) {
-  // tv1 = 2.0 / tv0
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-  TensorView* tv1 = div(IrBuilder::create<Double>(2.0), tv0);
-  fusion.addOutput(tv1);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({3, 3}, options);
-  // There's no overload div(Scalar, Tensor) in ATen
-  auto aten_output = at::div(
-      at::native::wrapped_scalar_tensor(at::Scalar(2.0), options.device()), t0);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {t0});
-  auto cg_outputs = fe.runFusion({t0});
-
-  testValidate(&fusion, cg_outputs, {t0}, {aten_output}, __LINE__, __FILE__);
-}
-
-// Repro of an issue of the reduction scheduler with a broadcast
-// domain concretized to multiple domains that are not proven to have
-// the same extent
-TEST_F(NVFuserTest, FusionRepro1713_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(2);
-  auto tv1 = makeSymbolicTensor(2);
-  auto tv2 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-  fusion->addInput(tv2);
-  auto tv3 = broadcast(tv2, {false, true});
-
-  auto tv4 = add(tv3, tv0);
-
-  auto tv5 = add(tv3, tv1);
-  auto tv6 = sum(tv5, {0});
-  fusion->addOutput(tv4);
-  fusion->addOutput(tv6);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({1024, 204800}, options);
-  // Original repro had the same shape as t0, but this should work
-  // with a different extent at the second axis
-  at::Tensor t1 = at::randn({1024, 123}, options);
-  at::Tensor t2 = at::randn({1024}, options);
-  std::vector<IValue> aten_inputs({t0, t1, t2});
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
-
-  auto t3 = t2.unsqueeze(-1);
-  auto t4 = t3 + t0;
-  auto t5 = t3 + t1;
-  auto t6 = sum(t5, {0});
-
-  testValidate(
-      executor_cache.fusion(),
-      cg_outputs,
-      {t0, t1, t2},
-      {t4, t6},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionExpand_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto w = 2, x = 3, y = 4, z = 5;
-
-  // Test
-  // a simple expand
-  // Expand that's propagated
-  // expand_as
-  // symbolic expand
-
-  // x
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {false, true});
-  auto tv2 = expand(tv1, {tv0->axis(0)->extent(), IrBuilder::create<Int>(y)});
-
-  // x
-  auto tv3 = makeSymbolicTensor(1);
-  fusion->addInput(tv3);
-  auto tv4 = broadcast(tv3, {false, true});
-  auto tv5 = add(tv4, tv2);
-  // [x, e_y]
-
-  // [x, y, z]
-  auto tv6 = makeSymbolicTensor(3);
-  fusion->addInput(tv6);
-
-  // Disjoint set op will cause a segmentation for just this op.
-  auto tmp_7 = set(tv6);
-  fusion->addOutput(tmp_7);
-
-  auto tv7 = broadcast(tv5, {false, false, true});
-
-  auto tv8 = expand_as(tv7, tv6);
-  // [x, e_y, e_z]
-
-  auto w_symbolic = IrBuilder::create<Int>();
-  fusion->addInput(w_symbolic);
-
-  auto tv9 = broadcast(tv8, {true, false, false, false});
-  //[1, x, e_y, e_z]
-
-  auto tv10 = expand(
-      tv9,
-      {w_symbolic,
-       tv9->axis(1)->extent(),
-       tv9->axis(2)->expandedExtent(),
-       tv9->axis(3)->expandedExtent()});
-
-  fusion->addOutput(tv10);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({x}, options);
-  at::Tensor t3 = at::randn({x}, options);
-  at::Tensor t6 = at::randn({x, y, z}, options);
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3, t6, w});
-  auto cg_out = cg_outputs[1];
-
-  TORCH_INTERNAL_ASSERT(cg_out.size(0) == w);
-  TORCH_INTERNAL_ASSERT(cg_out.size(1) == x);
-  TORCH_INTERNAL_ASSERT(cg_out.size(2) == y);
-  TORCH_INTERNAL_ASSERT(cg_out.size(3) == z);
-  TORCH_INTERNAL_ASSERT(cg_out.stride(0) == 0);
-  TORCH_INTERNAL_ASSERT(cg_out.stride(1) == 1);
-  TORCH_INTERNAL_ASSERT(cg_out.stride(2) == 0);
-  TORCH_INTERNAL_ASSERT(cg_out.stride(3) == 0);
-
-  auto t10 = t0.unsqueeze(-1)
-                 .expand({x, y})
-                 .add(t3.unsqueeze(-1))
-                 .unsqueeze(-1)
-                 .expand_as(t6)
-                 .unsqueeze(0)
-                 .expand({w, x, y, z});
-
-  testValidate(
-      executor_cache.fusion(),
-      cg_outputs,
-      {t0, t3, t6, w},
-      {t6, t10},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionExpandIssue1751_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto x = 3, y = 4, z = 5;
-
-  // y, z
-  auto tv0 = makeSymbolicTensor(2);
-  fusion->addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {true, false, false});
-
-  // Two ways to propagate extents as is: use -1 or explicitly pass
-  // the extent vals.
-
-  auto tv2 = expand(
-      tv1,
-      {IrBuilder::create<Int>(x),
-       IrBuilder::create<Int>(-1),
-       IrBuilder::create<Int>(-1)});
-
-  auto tv3 = expand(
-      tv1,
-      {IrBuilder::create<Int>(x),
-       tv0->axis(0)->extent(),
-       tv0->axis(1)->extent()});
-
-  fusion->addOutput(tv2);
-  fusion->addOutput(tv3);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({y, z}, options);
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
-
-  for (const auto& cg_out : cg_outputs) {
-    TORCH_INTERNAL_ASSERT(cg_out.size(0) == x);
-    TORCH_INTERNAL_ASSERT(cg_out.size(1) == y);
-    TORCH_INTERNAL_ASSERT(cg_out.size(2) == z);
-  }
-
-  auto t2 = t0.expand({x, y, z});
-
-  testValidate(
-      executor_cache.fusion(), cg_outputs, {t0}, {t2, t2}, __LINE__, __FILE__);
-}
-
-// TODO: Make sure the kernel uses the expanded concrete size instead
-// of the symbolic size
-TEST_F(NVFuserTest, FusionExpandToConcrete_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto x = 3, y = 4;
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-
-  auto tv1 = broadcast(tv0, {true, false});
-
-  auto tv2 =
-      expand(tv1, {IrBuilder::create<Int>(x), IrBuilder::create<Int>(y)});
-
-  fusion->addOutput(tv2);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({y}, options);
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-
-  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
-
-  for (const auto& cg_out : cg_outputs) {
-    TORCH_INTERNAL_ASSERT(cg_out.size(0) == x);
-    TORCH_INTERNAL_ASSERT(cg_out.size(1) == y);
-  }
-
-  auto t2 = t0.expand({x, y});
-
-  testValidate(
-      executor_cache.fusion(), cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionReproNoncontigBroadcast_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({4, 32, 16, 112, 112}, options).transpose(-1, -2);
-  at::Tensor t1 = at::randn({32, 1, 112, 1}, options).transpose(-1, -2);
-
-  auto tv0 = TensorViewBuilder()
-                 .ndims(5)
-                 .contiguity({true, true, false, false, false}) // ttfff
-                 .shape({-1, -1, -1, -1, -1})
-                 .dtype(DataType::Half)
-                 .build();
-  auto tv1 = TensorViewBuilder()
-                 .ndims(4)
-                 .contiguity({true, false, false, true}) // tfft
-                 .shape({-1, 1, 1, -1})
-                 .dtype(DataType::Half)
-                 .build();
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-
-  fusion->addOutput(tv2);
-
-  std::vector<IValue> aten_inputs({t0, t1});
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
-
-  auto t2 = t0 + t1;
-
-  testValidate(
-      executor_cache.fusion(), cg_outputs, {t0, t1}, {t2}, __LINE__, __FILE__);
-}
-
-namespace {
-
-// check that the resulting sibling are identical
-void checkSiblingConsistency(TensorView* replay, TensorView* target) {
-  auto replay_root = replay->getRootDomain();
-  auto replay_dom = replay->domain()->domain();
-  auto target_root = target->getRootDomain();
-  auto target_dom = target->domain()->domain();
-  std::unordered_map<IterDomain*, IterDomain*> target2replay_map;
-  TORCH_CHECK(replay_root.size() == target_root.size());
-  target2replay_map.reserve(replay_root.size());
-  std::transform(
-      target_root.begin(),
-      target_root.end(),
-      replay_root.begin(),
-      std::inserter(target2replay_map, target2replay_map.begin()),
-      [](auto a, auto b) { return std::make_pair(a, b); });
-  BestEffortReplay replay_(replay_dom, target_dom, target2replay_map);
-  auto r = replay_.getReplay();
-  for (int64_t i = 0; i < replay_dom.size(); i++) {
-    auto target_id = target_dom[i];
-    auto replay_it = r.find(target_id);
-    TORCH_CHECK(replay_it != r.end());
-    TORCH_CHECK(
-        replay_it->second == replay_dom[i],
-        "IterDomain mismatch when checking ",
-        replay,
-        " and ",
-        target,
-        " at ",
-        i,
-        ", got ",
-        replay_it->second,
-        " and ",
-        replay_dom[i]);
-  }
-};
-
-} // namespace
-
-TEST_F(NVFuserTest, FusionTransformPropagateSibling_CUDA) {
-  // https://github.com/csarofeen/pytorch/issues/1760
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tvs = Welford(tv0, {1});
-  fusion.addOutput(tvs.var_sum);
-
-  tvs.avg->split(1, 1);
-  tvs.avg->split(1, 2);
-  tvs.avg->split(1, 3);
-  tvs.var_sum->split(1, 1);
-  tvs.var_sum->split(1, 2);
-  tvs.var_sum->split(1, 3);
-  tvs.n->split(1, 1);
-  tvs.n->split(1, 2);
-  tvs.n->split(1, 3);
-
-  auto var_sum_rf = ir_utils::rfactorHelper(tvs.var_sum, {1, 4});
-
-  TransformPropagatorWithCheck propagator(var_sum_rf);
-  MaxRootDomainInfoSpanningTree(var_sum_rf).traverse(&propagator);
-
-  auto rf_tvs = ir_utils::producerTvsOf(tvs.var_sum);
-
-  std::vector<TensorView*> siblings[] = {{tvs.avg, tvs.var_sum, tvs.n}, rf_tvs};
-  for (auto tensors : siblings) {
-    for (auto t1 : tensors) {
-      for (auto t2 : tensors) {
-        TORCH_CHECK(TransformReplay::fullSelfMatching(t1, t2));
-      }
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionTransformPropagateSelectorSibling_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tvs = Welford(tv0, {1});
-  fusion.addOutput(tvs.var_sum);
-
-  tvs.avg->split(1, 1);
-  tvs.avg->split(1, 2);
-  tvs.avg->split(1, 3);
-  tvs.var_sum->split(1, 1);
-  tvs.var_sum->split(1, 2);
-  tvs.var_sum->split(1, 3);
-  tvs.n->split(1, 1);
-  tvs.n->split(1, 2);
-  tvs.n->split(1, 3);
-
-  auto var_sum_rf = ir_utils::rfactorHelper(tvs.var_sum, {1, 4});
-
-  struct DisableTv0 : public MaxInfoSpanningTree::Selector {
-    TensorView* tv0;
-    virtual bool allowC2P(TensorView* from, TensorView* to) override {
-      return from != tv0 && to != tv0;
-    };
-    virtual bool allowP2C(TensorView* from, TensorView* to) override {
-      return from != tv0 && to != tv0;
-    };
-    virtual bool allowSibling(TensorView* from, TensorView* to) override {
-      return true;
-    }
-    DisableTv0(TensorView* tv0) : tv0(tv0) {}
-  } selector1(tv0);
-
-  struct DisableTv0AndSibling : public DisableTv0 {
-    virtual bool allowSibling(TensorView* from, TensorView* to) override {
-      return false;
-    }
-    using DisableTv0::DisableTv0;
-  } selector2(tv0);
-
-  TransformPropagatorWithCheck propagator(var_sum_rf);
-  MaxRootDomainInfoSpanningTree good_path(var_sum_rf, &selector1);
-  MaxRootDomainInfoSpanningTree bad_path(var_sum_rf, &selector2);
-
-  auto rf_tvs = ir_utils::producerTvsOf(tvs.var_sum);
-
-  auto check = [&]() {
-    std::vector<TensorView*> siblings[] = {
-        {tvs.avg, tvs.var_sum, tvs.n}, rf_tvs};
-    for (auto tensors : siblings) {
-      for (auto t1 : tensors) {
-        for (auto t2 : tensors) {
-          TORCH_CHECK(TransformReplay::fullSelfMatching(t1, t2));
-        }
-      }
-    }
-  };
-
-  bad_path.traverse(&propagator);
-  ASSERT_ANY_THROW(check());
-  good_path.traverse(&propagator);
-  check();
-}
-
-TEST_F(NVFuserTest, FusionTransformPropagatePosition_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(4);
-  auto tv1 = makeSymbolicTensor(6);
-  fusion.addInput(tv0);
-
-  auto tv2 = broadcast(tv0, {false, false, true, false, false, true});
-  auto tv3 = add(tv1, tv2);
-  fusion.addOutput(tv3);
-
-  tv0->merge(2);
-  tv0->merge(0);
-  TransformPropagatorWithCheck propagator(tv0);
-  MaxRootDomainInfoSpanningTree(tv0).traverse(&propagator);
-
-  TORCH_CHECK(tv1->nDims() == 4);
-}
-
-TEST_F(NVFuserTest, FusionIgnoreZeroDimReduction_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-  auto tv1 = sum(tv0, {0});
-  // tv1 is effectively a zero-dim tensor as it only has a reduction
-  // axis.
-  // Reducing it further is converted to just a set op.
-  auto tv2 = sum(tv1, {0});
-  fusion->addOutput(tv2);
-
-  auto tv2_def = dynamic_cast<UnaryOp*>(tv2->definition());
-  TORCH_CHECK(
-      tv2_def != nullptr,
-      "Expected UnaryOp but found ",
-      tv2->definition()->toString());
-
-  TORCH_CHECK(
-      tv2_def->getUnaryOpType() == UnaryOpType::Set,
-      "Expected UnaryOpType::Set but found ",
-      tv2_def->getUnaryOpType());
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  auto t0 = at::randn({12345}, options);
-  std::vector<IValue> aten_inputs({t0});
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
-
-  auto ref = sum(t0, {0});
-
-  testValidate(
-      executor_cache.fusion(),
-      cg_outputs,
-      aten_inputs,
-      {ref},
-      __LINE__,
-      __FILE__);
-}
-
-// Repro of issue #1770
-TEST_F(NVFuserTest, FusionIssue1770Repro_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-  auto tv1 = makeSymbolicTensor(1);
-  fusion->addInput(tv1);
-
-  auto tv2 = ge(tv0, tv1);
-  auto tv3 =
-      where(tv2, IrBuilder::create<Double>(1), IrBuilder::create<Double>(2));
-  fusion->addOutput(tv3);
-
-  std::vector<int64_t> shape({999});
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn(shape, options);
-  at::Tensor t1 = at::randn(shape, options);
-  std::vector<IValue> aten_inputs({t0, t1});
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
-
-  auto ref = where(t0 >= t1, 1.0, 2.0);
-
-  testValidate(
-      executor_cache.fusion(),
-      cg_outputs,
-      aten_inputs,
-      {ref},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionTransformPropagatorSelector_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-  auto tv1 = makeSymbolicTensor(1);
-  fusion->addInput(tv1);
-
-  auto tv2 = add(tv0, tv1);
-
-  auto tv3 = sin(tv2);
-  auto tv4 = cos(tv2);
-
-  fusion->addOutput(tv3);
-  fusion->addOutput(tv4);
-
-  tv2->split(0, 10);
-
-  struct Selector : public MaxInfoSpanningTree::Selector {
-    TensorView* tv0;
-    TensorView* tv3;
-    virtual bool allowC2P(TensorView* from, TensorView* to) override {
-      return to == tv0;
-    }
-    virtual bool allowP2C(TensorView* from, TensorView* to) override {
-      return to == tv3;
-    }
-    virtual bool allowSibling(TensorView* from, TensorView* to) override {
-      return false;
-    }
-    Selector(TensorView* tv0, TensorView* tv3) : tv0(tv0), tv3(tv3) {}
-  } selector(tv0, tv3);
-
-  TransformPropagatorWithCheck propagator(tv2);
-  MaxRootDomainInfoSpanningTree(tv2, &selector).traverse(&propagator);
-
-  TORCH_CHECK(tv0->nDims() == 2);
-  TORCH_CHECK(tv1->nDims() == 1);
-  TORCH_CHECK(tv2->nDims() == 2);
-  TORCH_CHECK(tv3->nDims() == 2);
-  TORCH_CHECK(tv4->nDims() == 1);
-}
-
-TEST_F(NVFuserTest, FusionTransformPropagatorPos_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeConcreteTensor({22, 105});
-  fusion->addInput(tv0);
-
-  auto tv1 = sin(tv0);
-  fusion->addOutput(tv1);
-
-  tv1->split(0, 2);
-  tv1->split(-1, 3);
-  tv1->split(-1, 5);
-
-  TransformPropagatorWithCheck propagator(tv1, 2);
-  MaxRootDomainInfoSpanningTree(tv1, 2).traverse(&propagator);
-
-  auto expect = makeConcreteTensor({22, 105});
-  expect->split(0, 2);
-  TORCH_CHECK(TransformReplay::fullSelfMatching(expect, tv0));
-}
-
-TEST_F(NVFuserTest, FusionMaxRootDomainInfoSpanningTreePrintTwice_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(3);
-  fusion->addInput(tv0);
-
-  auto tv1 = sum(tv0, {0});
-  auto tv2 = neg(tv1);
-
-  fusion->addOutput(tv2);
-
-  tv1->split(0, 10);
-
-  struct Printer : public MaxInfoSpanningTree::Propagator {
-    std::stringstream ss;
-    virtual void propagateC2P(TensorView* from, TensorView* to) override {
-      ss << "propagateC2P" << std::endl;
-      ss << "from: " << from->name() << std::endl;
-      ss << "to: " << to->name() << std::endl;
-    }
-    virtual void propagateP2C(TensorView* from, TensorView* to) override {
-      ss << "propagateP2C" << std::endl;
-      ss << "from: " << from->name() << std::endl;
-      ss << "to: " << to->name() << std::endl;
-    }
-    virtual void propagateSibling(TensorView* from, TensorView* to) override {
-      ss << "propagateSibling" << std::endl;
-      ss << "from: " << from->name() << std::endl;
-      ss << "to: " << to->name() << std::endl;
-    }
-  } printer1, printer2;
-  printer1.ss << std::endl;
-  printer2.ss << std::endl;
-
-  MaxRootDomainInfoSpanningTree path(tv1);
-  path.traverse(&printer1);
-  path.traverse(&printer2);
-
-  auto expect = R"ESCAPE(
-propagateC2P
-from: 1
-to: 0
-propagateP2C
-from: 1
-to: 2
-)ESCAPE";
-  TORCH_CHECK(printer1.ss.str() == expect);
-  TORCH_CHECK(printer2.ss.str() == expect);
-}
-
-TEST_F(NVFuserTest, FusionTransformPropagatorNoOverwrite_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  auto tv0 = makeSymbolicTensor(1);
-  fusion->addInput(tv0);
-  auto tv1 = broadcast(tv0, {true, false, true});
-  auto tv2 = sin(tv1);
-  fusion->addOutput(tv2);
-
-  tv0->split(0, 2);
-  tv2->split(1, 2);
-  tv2->split(0, 4);
-
-  MaxRootDomainInfoSpanningTree path1(tv2);
-  TransformPropagatorWithCheck propagator1(tv2);
-  path1.traverse(&propagator1);
-
-  MaxRootDomainInfoSpanningTree path2(tv0);
-  TransformPropagatorWithCheck propagator2(tv0);
-  path2.traverse(&propagator2);
-
-  TORCH_CHECK(tv1->axis(0)->isBroadcast());
-  TORCH_CHECK(tv1->axis(1)->isBroadcast());
-  TORCH_CHECK(!tv1->axis(2)->isBroadcast());
-  TORCH_CHECK(!tv1->axis(3)->isBroadcast());
-  TORCH_CHECK(tv1->axis(4)->isBroadcast());
-
-  auto expect = makeSymbolicTensor(3);
-  expect->split(1, 2);
-  expect->split(0, 4);
-  TORCH_CHECK(TransformReplay::fullSelfMatching(expect, tv1));
-}
-
-TEST_F(NVFuserTest, FusionIssue1785Repro_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  // Set up your input tensor views
-  TensorView* tv0 = makeContigTensor(1);
-  TensorView* tv1 = makeContigTensor(2);
-
-  // Register your inputs
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-
-  auto tv2 = set(tv0);
-  // [B, I]
-  auto tv3 = broadcast(tv2, {true, false});
-  auto tv4 = add(tv3, tv1);
-  auto tv5 = set(tv4);
-
-  // Register your outputs
-  fusion.addOutput(tv5);
-
-  tv5->split(0, 8);
-  tv5->split(-1, 8);
-
-  // [Serial, TIDy, TIDX, Serial]
-
-  tv4->computeAt(tv5, -2);
-  tv3->computeAt(tv4, -1);
-  tv2->computeAt(tv3, 0);
-  tv2->split(0, 8);
-  tv2->axis(0)->parallelize(ParallelType::TIDx);
-  tv1->computeAt(tv5, -2);
-
-  tv5->axis(1)->parallelize(ParallelType::TIDy);
-  tv5->axis(2)->parallelize(ParallelType::TIDx);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor in1 = at::randn({16}, options);
-  at::Tensor in2 = at::randn({12, 16}, options);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {in1, in2});
-  auto cg_outputs = fe.runFusion({in1, in2});
-
-  auto tv_ref = in1 + in2;
-
-  testValidate(&fusion, cg_outputs, {in1, in2}, {tv_ref}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionSkipReplay_CUDA) {
-  {
-    Fusion fusion;
-    FusionGuard fg(&fusion);
-
-    TensorView* tv0 = makeContigTensor(1);
-    TensorView* tv1 = makeContigTensor(2);
-    fusion.addInput(tv0);
-    fusion.addInput(tv1);
-
-    auto tv2 = broadcast(tv0, {false, true});
-    auto tv3 = add(tv2, tv1);
-    fusion.addOutput(tv3);
-
-    tv3->split(1, 2, false);
-
-    TransformPropagatorWithCheck propagator(tv3);
-    MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
-  }
-
-  {
-    Fusion fusion;
-    FusionGuard fg(&fusion);
-
-    TensorView* tv0 = makeContigTensor(3);
-    fusion.addInput(tv0);
-
-    auto tv1 = sum(tv0, {0, 2});
-    auto tv2 = sin(tv1);
-    fusion.addOutput(tv2);
-
-    tv0->split(1, 2, false);
-
-    TransformPropagatorWithCheck propagator(tv0);
-    MaxRootDomainInfoSpanningTree(tv0).traverse(&propagator);
-  }
-}
-
-TEST_F(NVFuserTest, FusionInlineRepro1803_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeContigTensor(2);
-
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  auto tvs = Welford(tv1, {1});
-  auto tvo = set(tvs.var_sum);
-  fusion.addOutput(tvo);
-
-  tvo->split(0, 16);
-  tvo->axis(1)->parallelize(ParallelType::Unroll);
-
-  tv0->computeAt(tvo, -1, ComputeAtMode::BestEffort);
-
-  TORCH_CHECK(
-      tvs.var_sum->getComputeAtPosition() == tvs.avg->getComputeAtPosition());
-  TORCH_CHECK(
-      tvs.var_sum->getComputeAtPosition() == tvs.n->getComputeAtPosition());
-  TORCH_CHECK(tvs.var_sum->getComputeAtPosition() == 1);
-}
-
-// Unit test for the transform selection logic
-TEST_F(NVFuserTest, FusionBoundedDirectionSelection1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  TensorView* tv0 = makeContigTensor(2);
-
-  fusion.addInput(tv0);
-  auto tv1 = set(tv0);
-  auto tv2 = set(tv1);
-  auto tv3 = add(tv2, tv1);
-  fusion.addOutput(tv3);
-
-  tv3->split(-1, 5);
-  tv3->split(-1, 8);
-
-  scheduler_utils::BoundedDirectionalTransformPropagator::backward(
-      tv3, -1, {tv0, tv2});
-
-  // Check that the splits are replayed on tv1, even though tv2
-  //  is part of the boundary.
-  TORCH_INTERNAL_ASSERT(
-      tv2->nDims() == 4, "Propagator didn't propagate to tv2");
-}
-
-TEST_F(NVFuserTest, FusionIssueRepro1844_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  std::vector<int64_t> shape = {2, 1, 768};
-  std::vector<int64_t> sum_to_shape = {768};
-  std::vector<int64_t> sum_to_axes = {0, 1};
-  double kProb = 0.5;
-
-  std::vector<Int*> sum_to_symb;
-  std::transform(
-      sum_to_shape.begin(),
-      sum_to_shape.end(),
-      std::back_inserter(sum_to_symb),
-      [](int s) -> Int* { return IrBuilder::create<Int>(s); });
-
-  TensorView* tv0 = makeContigConcreteTensor(shape);
-  TensorView* tv1 = makeContigConcreteTensor(shape);
-  TensorView* tv2 = makeContigConcreteTensor(shape, DataType::Bool);
-
-  fusion->addInput(tv0);
-  fusion->addInput(tv1);
-  fusion->addInput(tv2);
-
-  Double* prob = IrBuilder::create<Double>(kProb);
-  auto grad_input = dropout_backward(tv1, tv2, prob);
-  auto grad_gelu = gelu_backward(grad_input, tv0);
-  auto grad_bias = sum_to(grad_gelu, sum_to_symb);
-
-  fusion->addOutput(grad_gelu);
-  fusion->addOutput(grad_bias);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  const auto mask_options =
-      at::TensorOptions().dtype(at::kBool).device(at::kCUDA, 0);
-  at::manual_seed(0);
-
-  at::Tensor a = at::randn(shape, options);
-  at::Tensor b = at::randn(shape, options);
-  at::Tensor c = at::randn(shape, options);
-  auto mask = at::gt(c, 0.0f);
-  std::vector<IValue> aten_inputs = {a, b, mask};
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
-
-  auto dinput = at::native_dropout_backward(b, mask, kProb);
-  auto dgelu = at::gelu_backward(dinput, a, "none");
-  auto dbias = dgelu.sum(sum_to_axes);
-
-  testValidate(
-      executor_cache.fusion(),
-      cg_outputs,
-      aten_inputs,
-      {dgelu, dbias},
-      __LINE__,
-      __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionInsertMagicZero1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeSymbolicTensor(2);
-  fusion.addInput(tv0);
-
-  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
-  auto tv2 = set(tv1);
-  fusion.addOutput(tv2);
-
-  tv2->split(0, 32);
-  tv2->split(-1, 2);
-  tv2->reorder({{1, 2}, {2, 1}});
-  tv2->merge(0);
-
-  TransformPropagator propagator(tv2);
-  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
-
-  tv0->computeAt(tv2, 1);
-
-  // The predicate of tv2 should be protected with magic zero
-  GpuLower gpulw(&fusion);
-  TORCH_CHECK(
-      PredicateMagicZeroChecker::isProtected(tv2, gpulw),
-      "Failed to protect the predicates of ",
-      tv2->toString());
-}
-
-TEST_F(
-    NVFuserTest,
-    FusionPointwiseScheduleWithBroadcastAndTrivialReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeContigTensor(3);
-  auto tv1 = makeContigTensor(2);
-  fusion.addInput(tv0);
-  fusion.addInput(tv1);
-  auto tv2 = broadcast(tv0, {false, true, false, true, false, true});
-  auto tv3 = sin(tv2);
-  auto tv4 = add(tv3, tv1);
-  auto tv5 = sum(tv4, {1});
-  fusion.addOutput(tv5);
-
-  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor t0 = at::randn({100, 100, 10}, options);
-  at::Tensor t1 = at::randn({10, 20}, options);
-
-  auto aten_output = (t0.view({100, 1, 100, 1, 10, 1}).sin() + t1).squeeze(1);
-
-  std::vector<IValue> aten_inputs = {t0, t1};
-
-  auto lparams = schedulePointwise(&fusion, aten_inputs);
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, aten_inputs, lparams);
-  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
-
-  testValidate(
-      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims1_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3, 4});
-  fusion.addInput(tv0);
-  auto tv1 = sin(tv0);
-  auto tv2 = cos(tv1);
-  auto tv3 = transpose(tv2, 1, 2);
-  auto tv4 = exp(tv3);
-  auto tv5 = tan(tv4);
-  fusion.addOutput(tv5);
-
-  InlinePropagator inline_propagator(tv5, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv5).traverse(&inline_propagator);
-
-  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({2, 3, 4}, options);
-  auto output = input.sin().cos().transpose(1, 2).exp().tan();
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims2_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3, 4});
-  fusion.addInput(tv0);
-  auto tv1 = sin(tv0);
-  auto tv2 = cos(tv1);
-  auto tv3 = transpose(tv2, 1, 2);
-  auto tv4 = exp(tv3);
-  auto tv5 = tan(tv4);
-  fusion.addOutput(tv5);
-
-  InlinePropagator inline_propagator(tv5, -1, ComputeAtMode::BestEffort);
-  MaxRootDomainInfoSpanningTree(tv5).traverse(&inline_propagator);
-
-  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv1->getComputeAtPosition() == 1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({2, 3, 4}, options);
-  auto output = input.sin().cos().transpose(1, 2).exp().tan();
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims3_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3, 4});
-  fusion.addInput(tv0);
-  auto tv1 = sin(tv0);
-  // broadcasting
-  auto tv2 = broadcast(tv1, {false, true, false, true, false, true});
-  auto tv3 = relu(tv2);
-  // trivial reduction
-  auto tv4 = sum(tv3, {1, 3, 5});
-  auto tv5 = cos(tv4);
-  auto tv6 = transpose(tv5, 1, 2);
-  auto tv7 = exp(tv6);
-  auto tv8 = tan(tv7);
-  fusion.addOutput(tv8);
-
-  for (auto tv : {tv2, tv3, tv4}) {
-    tv->merge(0);
-    tv->merge(1);
-    tv->merge(2);
-  }
-
-  InlinePropagator inline_propagator(tv8, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv8).traverse(&inline_propagator);
-
-  TORCH_CHECK(tv8->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv7->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv6->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv5->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({2, 3, 4}, options);
-  auto output = input.sin().relu().cos().transpose(1, 2).exp().tan();
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionInlinePropagatorMismatchedDims4_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3, 4});
-  fusion.addInput(tv0);
-  auto tv1 = sin(tv0);
-  auto tv2 = exp(tv1);
-  auto tv3 = relu(tv2);
-  auto tv4 = cos(tv3);
-  auto tv5 = tan(tv4);
-  fusion.addOutput(tv5);
-
-  tv3->merge(1);
-  InlinePropagator inline_propagator(tv0, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv0).traverse(&inline_propagator);
-
-  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv3->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 1);
-  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({2, 3, 4}, options);
-  auto output = input.sin().exp().relu().cos().tan();
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionInlinePropagatorBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3, 4});
-  fusion.addInput(tv0);
-  auto tv1 = sin(tv0);
-  // broadcasting
-  auto tv2 = broadcast(tv1, {false, true, false, true, false, true});
-  auto tv3 = cos(tv2);
-  auto tv4 = tan(tv3);
-  fusion.addOutput(tv4);
-
-  for (auto tv : {tv2, tv3, tv4}) {
-    tv->merge(0);
-    tv->merge(1);
-    tv->merge(2);
-  }
-
-  InlinePropagator inline_propagator(tv0, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv0).traverse(&inline_propagator);
-
-  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({2, 3, 4}, options);
-  auto output = input.sin().view({2, 1, 3, 1, 4, 1}).cos().tan();
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionInlinePropagatorBroadcastTrivialReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3, 4});
-  fusion.addInput(tv0);
-  auto tv1 = sin(tv0);
-  // broadcasting
-  auto tv2 = broadcast(tv1, {false, true, false, true, false, true});
-  auto tv3 = tan(tv2);
-  // trivial reduction
-  auto tv4 = sum(tv3, {1, 3, 5});
-  auto tv5 = cos(tv4);
-  auto tv6 = exp(tv5);
-  fusion.addOutput(tv6);
-
-  for (auto tv : {tv2, tv3, tv4}) {
-    tv->merge(0);
-    tv->merge(1);
-    tv->merge(2);
-  }
-
-  InlinePropagator inline_propagator(tv6, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv6).traverse(&inline_propagator);
-
-  TORCH_CHECK(tv6->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv2->getComputeAtPosition() == 3);
-  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-  at::Tensor input = at::randn({2, 3, 4}, options);
-  auto output = input.sin().tan().cos().exp();
-
-  FusionExecutor fe;
-  fe.compileFusion(&fusion, {input});
-  auto cg_outputs = fe.runFusion({input});
-
-  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
-}
-
-TEST_F(NVFuserTest, FusionMatchedLeafPosWithoutReplayTrivialReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 1, 3, 1, 4, 1});
-  fusion.addInput(tv0);
-  auto tv1 = sum(tv0, {1, 3, 5});
-  auto tv2 = sin(tv1);
-  fusion.addOutput(tv1);
-
-  for (auto tv : {tv0, tv1}) {
-    tv->merge(0);
-    tv->merge(1);
-    tv->merge(2);
-  }
-
-  TORCH_CHECK(
-      TransformReplay::getMatchedLeafPosWithoutReplayPasC(tv0, tv1, 3) == 3);
-  TORCH_CHECK(
-      TransformReplay::getMatchedLeafPosWithoutReplayCasP(tv1, tv0, 3) == 3);
-  TORCH_CHECK(
-      TransformReplay::getMatchedLeafPosWithoutReplayPasC(tv1, tv2, 3) == 3);
-  TORCH_CHECK(
-      TransformReplay::getMatchedLeafPosWithoutReplayCasP(tv2, tv1, 3) == 3);
-}
-
-TEST_F(NVFuserTest, FusionMatchedLeafPosWithoutReplayBroadcast_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3, 4});
-  fusion.addInput(tv0);
-  auto tv1 = broadcast(tv0, {false, true, false, true, false, true});
-  auto tv2 = sin(tv1);
-  fusion.addOutput(tv2);
-
-  for (auto tv : {tv1, tv2}) {
-    tv->merge(0);
-    tv->merge(1);
-    tv->merge(2);
-  }
-
-  TORCH_CHECK(
-      TransformReplay::getMatchedLeafPosWithoutReplayPasC(tv0, tv1, 3) == 3);
-  TORCH_CHECK(
-      TransformReplay::getMatchedLeafPosWithoutReplayCasP(tv1, tv0, 3) == 3);
-  TORCH_CHECK(
-      TransformReplay::getMatchedLeafPosWithoutReplayPasC(tv1, tv2, 3) == 3);
-  TORCH_CHECK(
-      TransformReplay::getMatchedLeafPosWithoutReplayCasP(tv2, tv1, 3) == 3);
-}
-
-TEST_F(NVFuserTest, FusionIdGraphTrivialReduction_CUDA) {
-  Fusion fusion;
-  FusionGuard fg(&fusion);
-
-  auto tv0 = makeConcreteTensor({2, 3, 4});
-  fusion.addInput(tv0);
-  auto tv1 = broadcast(tv0, {false, true, false, true, false, true});
-  auto tv2 = sum(tv1, {1, 3, 5});
-  auto tv3 = sin(tv2);
-  fusion.addOutput(tv3);
-
-  for (auto tv : {tv1, tv2}) {
-    tv->merge(0);
-    tv->merge(1);
-    tv->merge(2);
-  }
-
-  InlinePropagator inline_propagator(tv3, -1, ComputeAtMode::MostInlined);
-  MaxRootDomainInfoSpanningTree(tv3).traverse(&inline_propagator);
-
-  ComputeAtMap ca_map(&fusion);
-
-  auto all_tvs = ir_utils::allTvs(&fusion);
-  for (auto tv1 : all_tvs) {
-    for (auto tv2 : all_tvs) {
-      if (tv1->isFusionInput() || tv2->isFusionInput()) {
-        continue;
-      }
-      for (int i : c10::irange(3)) {
-        auto id1 = tv1->axis(i);
-        auto id2 = tv2->axis(i);
-        TORCH_CHECK(ca_map.areMapped(id1, id2, IdMappingMode::LOOP));
-        TORCH_CHECK(ca_map.areMapped(id1, id2, IdMappingMode::PERMISSIVE));
-      }
-    }
-  }
-}
-
-TEST_F(NVFuserTest, FusionPrint_CUDA) {
-  auto dtypes = {
-      at::kFloat,
-      at::kDouble,
-      at::kHalf,
-      at::kBFloat16,
-      at::kInt,
-      at::kLong,
-      at::kBool};
-  for (auto dtype : dtypes) {
-    auto fusion = std::make_unique<Fusion>();
-    FusionGuard fg(fusion.get());
-
-    auto tv0 = makeSymbolicTensor(1, aten_to_data_type(dtype));
-    fusion->addInput(tv0);
-    auto tv1 = print(tv0);
-    auto tv2 = sin(tv1);
-    fusion->addOutput(tv2);
-
-    // There is no way to check if anything is printed to the console, but we
-    // can validate that when print exist, compilation and computation are not
-    // broken.
-    auto options = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
-    at::Tensor t0 = at::arange(2, options).to(dtype);
-
-    FusionExecutorCache executor_cache(std::move(fusion));
-    auto cg_outputs = executor_cache.runFusionWithInputs({t0});
-
-    testValidate(
-        executor_cache.fusion(),
-        cg_outputs,
-        {t0},
-        {t0.sin()},
-        __LINE__,
-        __FILE__);
-  }
-}
-
-TEST_F(NVFuserTest, FusionCheckedSymbolicShape_CUDA) {
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor a = at::randn({123, 456}, options);
-  at::Tensor b = at::randn({123, 456}, options);
-  at::Tensor c = at::randn({321, 654}, options);
-
-  using return_t =
-      std::pair<std::unique_ptr<FusionExecutorCache>, std::vector<at::Tensor>>;
-  auto matched_add = [](at::Tensor a, at::Tensor b) -> return_t {
-    auto fusion = std::make_unique<Fusion>();
-    FusionGuard fg(fusion.get());
-
-    Val* s1 = IrBuilder::create<Int>();
-    Val* s2 = IrBuilder::create<Int>();
-    auto builder = TensorViewBuilder().shape(std::vector<Val*>{s1, s2});
-    TensorView* tv0 = builder.build();
-    TensorView* tv1 = builder.build();
-
-    fusion->addInput(tv0);
-    fusion->addInput(tv1);
-
-    auto tv2 = add(tv0, tv1);
-
-    fusion->addOutput(tv2);
-
-    auto executor_cache =
-        std::make_unique<FusionExecutorCache>(std::move(fusion));
-    auto cg_outputs = executor_cache->runFusionWithInputs({a, b});
-    return {std::move(executor_cache), std::move(cg_outputs)};
-  };
-
-  {
-    auto ret1 = matched_add(a, b);
-    testValidate(
-        ret1.first->fusion(), ret1.second, {a, b}, {a + b}, __LINE__, __FILE__);
-  }
-
-  {
-    EXPECT_THAT(
-        [&]() { matched_add(a, c); },
-        ::testing::ThrowsMessage<c10::Error>(
-            ::testing::HasSubstr("Attempting to bind")));
-  }
-}
-
-TEST_F(NVFuserTest, FusionSizeDependentData_CUDA) {
-  auto fusion = std::make_unique<Fusion>();
-  FusionGuard fg(fusion.get());
-
-  Val* s1 = IrBuilder::create<Int>();
-  auto builder = TensorViewBuilder().shape(std::vector<Val*>{s1});
-  TensorView* tv0 = builder.build();
-
-  fusion->addInput(tv0);
-
-  auto tv1 = add(tv0, s1);
-
-  fusion->addOutput(tv1);
-
-  const auto options =
-      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-
-  at::Tensor a = at::zeros({123}, options);
-
-  FusionExecutorCache executor_cache(std::move(fusion));
-  auto cg_outputs = executor_cache.runFusionWithInputs({a});
-
-  testValidate(
-      executor_cache.fusion(), cg_outputs, {a}, {a + 123}, __LINE__, __FILE__);
-}
-
-} // namespace jit
-} // namespace torch
-#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu1.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu1.cpp
new file mode 100644
index 000000000000..2a14695b53ff
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu1.cpp
@@ -0,0 +1,9985 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/codegen.h>
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
+#include <torch/csrc/jit/codegen/cuda/grouped_reduction.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/interface.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/ir_graphviz.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/lower_magic_zero.h>
+#include <torch/csrc/jit/codegen/cuda/mutator.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+#include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
+
+#include <test/cpp/jit/test_utils.h>
+#include <torch/csrc/jit/api/function_impl.h>
+#include <torch/csrc/jit/codegen/cuda/parser.h>
+#include <torch/csrc/jit/ir/irparser.h>
+#include <torch/torch.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+#include <c10/cuda/CUDAStream.h>
+
+#include <algorithm>
+#include <iostream>
+#include <sstream>
+#include <thread>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+using namespace at::indexing;
+
+// A few smoke tests for IrGraphGenerator
+// (These tests exercise IrGraphGenerator through a non-trivial IR,
+//  to make sure that it runs w/o crashing. The actual output is not
+//  validated)
+TEST_F(NVFuserTest, FusionIrGraphGenerator_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Make sure we can handle empty IRs
+  TORCH_CHECK(!IrGraphGenerator::toGraphviz(
+                   &fusion, IrGraphGenerator::DetailLevel::Basic)
+                   .empty());
+
+  // Construct an interesting IR
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.141));
+  TensorView* tv3 = broadcast(tv0, {false, true, false, true});
+  TensorView* tv4 =
+      reductionOp(BinaryOpType::Add, {2}, IrBuilder::create<Double>(0), tv3);
+  TensorView* tv5 = clamp(
+      tv4, IrBuilder::create<Double>(0.f), IrBuilder::create<Double>(1.f));
+  TensorView* tv6 = add(tv2, tv2);
+
+  // Another checkpoint before adding outputs
+  TORCH_CHECK(!IrGraphGenerator::toGraphviz(
+                   &fusion, IrGraphGenerator::DetailLevel::Explicit)
+                   .empty());
+
+  fusion.addOutput(tv6);
+
+  tv4->axis(2)->parallelize(ParallelType::BIDy);
+  tv6->merge(0);
+  tv6->split(0, 4);
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->reorder({{-1, 0}});
+  tv2->computeAt(tv6, 1);
+
+  // Another checkpoint with more node types
+  TORCH_CHECK(!IrGraphGenerator::toGraphviz(
+                   &fusion, IrGraphGenerator::DetailLevel::ComputeOnly)
+                   .empty());
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  // Final IR graph
+  TORCH_CHECK(!IrGraphGenerator::toGraphviz(
+                   &fusion, IrGraphGenerator::DetailLevel::Verbose)
+                   .empty());
+}
+
+TEST_F(NVFuserTest, FusionDispatch_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  Double* f = IrBuilder::create<Double>(2.f);
+  std::stringstream ss1, ss2, ss3;
+  ss1 << f;
+  ss2 << static_cast<Val*>(f);
+  ss3 << static_cast<Statement*>(f);
+  TORCH_CHECK(
+      ss1.str().compare(ss2.str()) == 0 && ss1.str().compare(ss3.str()) == 0,
+      "Error with dispatch system where results differ by passing Double* vs Val* vs Statement*.");
+}
+
+// Evaluate basic scalar operations with constant values
+TEST_F(NVFuserTest, FusionExprEvalConstants_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  ExpressionEvaluator evaluator(&fusion);
+
+  auto* a = IrBuilder::create<Int>(7);
+  auto* b = IrBuilder::create<Int>(3);
+
+  // Avoid div operation because it casts int operands to float
+  checkIntValue(evaluator, neg(a), -7);
+  checkIntValue(evaluator, add(a, b), 10);
+  checkIntValue(evaluator, neg(mul(sub(a, b), add(a, b))), -40);
+  checkIntValue(evaluator, mod(a, b), 1);
+  checkIntValue(evaluator, ceilDiv(a, b), 3);
+}
+
+TEST_F(NVFuserTest, FusionExprEvalDouble_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+  auto ten = IrBuilder::create<Double>(10);
+  auto two = IrBuilder::create<Double>(2);
+  auto three = IrBuilder::create<Double>(3);
+  auto val = castOp(DataType::Int, ceilDiv(sub(ten, two), three));
+  auto reference = static_cast<int64_t>(std::ceil((10.0 - 2.0) / 3.0));
+  TORCH_CHECK(reference == val->evaluateInt());
+}
+
+// Evaluate basic scalar operations with bound values
+TEST_F(NVFuserTest, FusionExprEvalBindings_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  ExpressionEvaluator evaluator(&fusion);
+
+  auto* a = IrBuilder::create<Int>();
+  auto* b = IrBuilder::create<Int>();
+  auto* c = add(a, b);
+  auto* d = neg(ceilDiv(c, b));
+  auto* e = IrBuilder::create<Int>(0);
+
+  // trying to evaluate before binding should give empty results
+  TORCH_CHECK(!evaluator.evaluate(a).has_value());
+  TORCH_CHECK(!evaluator.evaluate(d).has_value());
+
+  evaluator.bind(a, 7);
+  evaluator.bind(b, 3);
+
+  // can't bind to the results of expressions
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(evaluator.bind(c, 100));
+
+  // can't bind to concrete values
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(evaluator.bind(e, 100));
+
+  checkIntValue(evaluator, c, 10);
+  checkIntValue(evaluator, sub(a, b), 4);
+  checkIntValue(evaluator, mod(a, b), 1);
+  checkIntValue(evaluator, ceilDiv(a, b), 3);
+  checkIntValue(evaluator, d, -4);
+
+  // Reset evaluation context
+  evaluator = ExpressionEvaluator(&fusion);
+
+  evaluator.bind(a, 2);
+  evaluator.bind(b, 5);
+
+  checkIntValue(evaluator, c, 7);
+  checkIntValue(evaluator, sub(a, b), -3);
+  checkIntValue(evaluator, mod(a, b), 2);
+  checkIntValue(evaluator, ceilDiv(a, b), 1);
+  checkIntValue(evaluator, d, -2);
+}
+
+// Evaluate expressions in a simple IR
+TEST_F(NVFuserTest, FusionExprEvalBasic_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Create a non-trivial IR
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = add(tv0, tv2);
+
+  fusion.addOutput(tv3);
+
+  tv3->split(0, 4);
+
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::Unroll);
+  tv3->axis(1)->parallelize(ParallelType::Unroll);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // 1. Create an evaluator
+  ExpressionEvaluator evaluator(&fusion);
+
+  // 2. Bind values
+  //
+  // IMPORTANT:
+  // a. The bindings are only as stable as the Vals are in the fusion graph
+  // b. You must use the original (rootDomain) extents
+  //  (ex. `tv0->getRootDomain()[0]->extent()`
+  //   instead of `tv0->axis(0)->extent()`)
+  //
+  evaluator.bind(tv0->getRootDomain()[0]->extent(), 6);
+  evaluator.bind(tv0->getRootDomain()[1]->extent(), 128);
+  evaluator.bind(tv1->getRootDomain()[0]->extent(), 6);
+  evaluator.bind(tv1->getRootDomain()[1]->extent(), 128);
+
+  // 3. Evaluate and check result values
+  TORCH_CHECK(tv2->domain()->nDims() == 3);
+  checkIntValue(evaluator, tv2->axis(0)->extent(), 2);
+  checkIntValue(evaluator, tv2->axis(1)->extent(), 4);
+  checkIntValue(evaluator, tv2->axis(2)->extent(), 128);
+
+  TORCH_CHECK(tv3->domain()->nDims() == 3);
+  checkIntValue(evaluator, tv3->axis(0)->extent(), 2);
+  checkIntValue(evaluator, tv3->axis(1)->extent(), 4);
+  checkIntValue(evaluator, tv3->axis(2)->extent(), 128);
+}
+
+// Evaluate expressions in a more complex IR
+TEST_F(NVFuserTest, FusionExprEvalComplex_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(-1.0));
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.0));
+  TensorView* tv3 = mul(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv4 = add(tv2, tv1);
+  TensorView* tv5 = add(tv4, tv3);
+  TensorView* tv6 = add(tv0, tv3);
+
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  tv5->reorder({{-1, 0}});
+
+  tv6->split(0, 5);
+  tv5->merge(0);
+
+  // 1. Create an evaluator
+  ExpressionEvaluator evaluator(&fusion);
+
+  // 2. Bind values
+  evaluator.bind(tv0->getRootDomain()[0]->extent(), 129);
+  evaluator.bind(tv0->getRootDomain()[1]->extent(), 127);
+
+  // Evaluate and check extent values
+  TORCH_CHECK(tv0->domain()->nDims() == 2);
+  checkIntValue(evaluator, tv0->axis(0)->extent(), 129);
+  checkIntValue(evaluator, tv0->axis(1)->extent(), 127);
+
+  TORCH_CHECK(tv3->domain()->nDims() == 2);
+  checkIntValue(evaluator, tv3->axis(0)->extent(), 129);
+  checkIntValue(evaluator, tv3->axis(1)->extent(), 127);
+
+  TORCH_CHECK(tv4->domain()->nDims() == 2);
+  checkIntValue(evaluator, tv4->axis(0)->extent(), 129);
+  checkIntValue(evaluator, tv4->axis(1)->extent(), 127);
+
+  TORCH_CHECK(tv5->domain()->nDims() == 1);
+  checkIntValue(evaluator, tv5->axis(0)->extent(), 16383);
+
+  TORCH_CHECK(tv6->domain()->nDims() == 3);
+  checkIntValue(evaluator, tv6->axis(0)->extent(), 26);
+  checkIntValue(evaluator, tv6->axis(1)->extent(), 5);
+  checkIntValue(evaluator, tv6->axis(2)->extent(), 127);
+}
+
+// Evaluate expressions post lowering
+TEST_F(NVFuserTest, FusionExprEvalPostLower_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Create a non-trivial IR
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = add(tv0, tv2);
+
+  fusion.addOutput(tv3);
+
+  tv3->split(0, 4);
+
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::Unroll);
+  tv3->axis(1)->parallelize(ParallelType::Unroll);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto* bid_x = add(tv3->axis(0)->extent(), IrBuilder::create<Int>(0));
+  auto* tid_x = add(tv3->axis(-1)->extent(), IrBuilder::create<Int>(0));
+
+  // Lower
+  GpuLower gpulw(&fusion);
+
+  // 1. Create an evaluation context
+  ExpressionEvaluator evaluator(&fusion);
+
+  // 2. Bind values
+  evaluator.bind(tv0->getRootDomain()[0]->extent(), 6);
+  evaluator.bind(tv0->getRootDomain()[1]->extent(), 128);
+  evaluator.bind(tv1->getRootDomain()[0]->extent(), 6);
+  evaluator.bind(tv1->getRootDomain()[1]->extent(), 128);
+
+  // 3. Evaluate and check result values
+  TORCH_CHECK(tv2->domain()->nDims() == 3);
+  checkIntValue(evaluator, tv2->axis(0)->extent(), 2);
+  checkIntValue(evaluator, tv2->axis(1)->extent(), 4);
+  checkIntValue(evaluator, tv2->axis(2)->extent(), 128);
+
+  TORCH_CHECK(tv3->domain()->nDims() == 3);
+  checkIntValue(evaluator, tv3->axis(0)->extent(), 2);
+  checkIntValue(evaluator, tv3->axis(1)->extent(), 4);
+  checkIntValue(evaluator, tv3->axis(2)->extent(), 128);
+
+  checkIntValue(evaluator, bid_x, 2);
+  checkIntValue(evaluator, tid_x, 128);
+}
+
+// Kernel IR: Evaluate basic scalar operations with constant values
+TEST_F(NVFuserTest, FusionKernelExprEvalConstants_CUDA) {
+  Fusion fusion;
+  kir::Kernel kernel(&fusion);
+  FusionGuard fg((&kernel)->as<Fusion>());
+
+  auto a = IrBuilder::create<Int>(7);
+  auto b = IrBuilder::create<Int>(3);
+  auto c = IrBuilder::subExpr(a, b);
+  auto d = IrBuilder::divExpr(a, b);
+  auto e = IrBuilder::mulExpr(c, d);
+
+  kir::ExpressionEvaluator evaluator;
+
+  checkIntValue(evaluator, IrBuilder::negExpr(a), -7);
+  checkIntValue(evaluator, IrBuilder::addExpr(a, b), 10);
+  checkIntValue(evaluator, IrBuilder::negExpr(e), -8);
+  checkIntValue(evaluator, IrBuilder::modExpr(a, b), 1);
+  checkIntValue(evaluator, IrBuilder::ceilDivExpr(a, b), 3);
+}
+
+// Kernel IR: Evaluate basic scalar operations with bound values
+TEST_F(NVFuserTest, FusionKernelExprEvalBindings_CUDA) {
+  Fusion fusion;
+  kir::Kernel kernel(&fusion);
+  FusionGuard fg((&kernel)->as<Fusion>());
+
+  kir::ExpressionEvaluator evaluator;
+
+  auto a = IrBuilder::create<Int>(c10::nullopt);
+  auto b = IrBuilder::create<Int>(c10::nullopt);
+  auto c = IrBuilder::addExpr(a, b);
+  auto d = IrBuilder::negExpr(IrBuilder::ceilDivExpr(c, b));
+  auto e = IrBuilder::create<Int>(0);
+
+  // trying to evaluate before binding should give empty results
+  TORCH_CHECK(!evaluator.evaluate(a).has_value());
+  TORCH_CHECK(!evaluator.evaluate(d).has_value());
+
+  evaluator.bind(a, 7);
+  evaluator.bind(b, 3);
+
+  // can't bind to the results of expressions
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(evaluator.bind(c, 100));
+
+  // can't bind to concrete values
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(evaluator.bind(e, 100));
+
+  checkIntValue(evaluator, c, 10);
+  checkIntValue(evaluator, IrBuilder::subExpr(a, b), 4);
+  checkIntValue(evaluator, IrBuilder::modExpr(a, b), 1);
+  checkIntValue(evaluator, IrBuilder::ceilDivExpr(a, b), 3);
+  checkIntValue(evaluator, d, -4);
+
+  // Reset the evaluation context
+  evaluator = kir::ExpressionEvaluator();
+
+  evaluator.bind(a, 2);
+  evaluator.bind(b, 5);
+
+  checkIntValue(evaluator, c, 7);
+  checkIntValue(evaluator, IrBuilder::subExpr(a, b), -3);
+  checkIntValue(evaluator, IrBuilder::modExpr(a, b), 2);
+  checkIntValue(evaluator, IrBuilder::ceilDivExpr(a, b), 1);
+  checkIntValue(evaluator, d, -2);
+}
+
+TEST_F(NVFuserTest, FusionClear_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // 1. Create a dummy IR
+
+  {
+    TensorView* tv0 = makeSymbolicTensor(2);
+    TensorView* tv1 = makeSymbolicTensor(2);
+
+    fusion.addInput(tv0);
+    fusion.addInput(tv1);
+
+    TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+    TensorView* tv3 = add(tv0, tv2);
+
+    fusion.addOutput(tv3);
+
+    tv3->split(0, 4);
+    tv0->computeAt(tv3, 1);
+    tv1->computeAt(tv3, 1);
+
+    tv3->axis(0)->parallelize(ParallelType::BIDx);
+    tv2->axis(1)->parallelize(ParallelType::Unroll);
+    tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  // 2. Clear the IR
+
+  fusion.clear();
+
+  TORCH_CHECK(fusion.unordered_exprs().empty());
+  TORCH_CHECK(fusion.vals().empty());
+
+  TORCH_CHECK(fusion.inputs().empty());
+  TORCH_CHECK(fusion.outputs().empty());
+
+  TORCH_CHECK(ir_utils::getReductionOps(&fusion).empty());
+
+  // 3. Rebuild the IR
+
+  {
+    TensorView* tv0 = makeSymbolicTensor(3);
+    TensorView* tv1 = makeSymbolicTensor(3);
+    TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+    TensorView* tv3 = add(tv0, tv2);
+
+    fusion.addInput(tv0);
+    fusion.addInput(tv1);
+    fusion.addOutput(tv3);
+
+    // tv3 [i0, i1, i2]
+    tv3->reorder({{0, 2}, {2, 0}});
+    // tv3 [i2, i1, i0]
+    tv3->split(-1, 4);
+    // tv3 [i2, i1, i0outer, i0inner{4}]
+    tv3->reorder({{2, 0}, {3, 1}, {0, 3}});
+    // tv3 [i0outer, i0inner{4}, i1, i2]
+    tv0->computeAt(tv3, -1);
+    tv1->computeAt(tv3, -1);
+    tv3->axis(1)->parallelize(ParallelType::BIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({16, 8, 8}, options);
+  at::Tensor input2 = at::randn_like(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  auto outputs = fe.runFusion({input1, input2});
+
+  at::Tensor tv2_ref = input2 + 2.0;
+  at::Tensor output_ref = input1 + tv2_ref;
+
+  TORCH_CHECK(output_ref.equal(outputs[0]));
+}
+
+TEST_F(NVFuserTest, FusionCopy_CUDA) {
+  Fusion original_fusion;
+
+  // Create the test IR
+  {
+    FusionGuard fg(&original_fusion);
+
+    auto tv0 = makeSymbolicTensor(3);
+    auto tv1 = makeSymbolicTensor(3);
+    auto tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+    auto tv3 = sub(add(tv0, mul(tv2, tv2)), tv2);
+
+    original_fusion.addInput(tv0);
+    original_fusion.addInput(tv1);
+    original_fusion.addOutput(tv3);
+
+    tv3->reorder({{0, 2}, {2, 0}});
+    tv3->split(-1, 4);
+    tv3->reorder({{2, 0}, {3, 1}, {0, 3}});
+
+    tv0->computeAt(tv3, -1);
+    tv1->computeAt(tv3, -1);
+
+    tv3->axis(0)->parallelize(ParallelType::BIDx);
+    tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  // Test copy before lowering
+  Fusion clone = original_fusion;
+
+  // Compare IR dumps
+  std::stringstream original_ir;
+  std::stringstream clone_ir;
+  original_ir << original_fusion;
+  clone_ir << clone;
+  ASSERT_EQ(original_ir.str(), clone_ir.str());
+
+  // Lower original fusion
+  std::string original_kernel;
+  {
+    // TODO(kir): remove this guard once we implement the cuda codegen visitor
+    FusionGuard fg(&original_fusion);
+    original_kernel =
+        codegen::generateCudaKernel(GpuLower(&original_fusion).kernel());
+  }
+
+  // Make sure the "before lowering" clone was not mutated
+  // while lowering the original fusion IR
+  std::stringstream before_lowering_ir;
+  before_lowering_ir << clone;
+  ASSERT_EQ(original_ir.str(), before_lowering_ir.str());
+
+  // Test copy after lowering (including assignment operator)
+  Fusion before_lowering = clone;
+  clone = original_fusion;
+
+  // Compare IR dumps
+  std::stringstream original_lowered_ir;
+  std::stringstream clone_lowered_ir;
+  original_lowered_ir << original_fusion;
+  clone_lowered_ir << clone;
+  ASSERT_EQ(original_lowered_ir.str(), clone_lowered_ir.str());
+
+  // Lower the "before lowering" and compare kernels
+  std::string clone_kernel;
+  {
+    // TODO(kir): remove this guard once we implement the cuda codegen visitor
+    FusionGuard fg(&before_lowering);
+    clone_kernel =
+        codegen::generateCudaKernel(GpuLower(&before_lowering).kernel());
+  }
+  ASSERT_EQ(original_kernel, clone_kernel);
+}
+
+TEST_F(NVFuserTest, FusionMove_CUDA) {
+  Fusion fusion;
+
+  // Create the test IR
+  {
+    FusionGuard fg(&fusion);
+
+    auto tv0 = makeSymbolicTensor(3);
+    auto tv1 = makeSymbolicTensor(3);
+    auto tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+    auto tv3 = sub(add(tv0, mul(tv2, tv2)), tv2);
+
+    fusion.addInput(tv0);
+    fusion.addInput(tv1);
+    fusion.addOutput(tv3);
+
+    tv3->reorder({{0, 2}, {2, 0}});
+    tv3->split(-1, 4);
+    tv3->reorder({{2, 0}, {3, 1}, {0, 3}});
+
+    tv0->computeAt(tv3, -1);
+    tv1->computeAt(tv3, -1);
+
+    tv3->axis(0)->parallelize(ParallelType::BIDx);
+    tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  std::stringstream original_ir;
+  original_ir << fusion;
+
+  // Test move before lowering
+  Fusion another_fusion = std::move(fusion);
+
+  // Check that the original fusion is "empty"
+  //
+  // IMPORTANT: these checks assume knowledge of the internal
+  //    implementation of the move operations. General uses
+  //    should only assume that the moved-from object is in
+  //    a valid, but unspecified state. This is similar to the
+  //    standard library containers:
+  //    https://en.cppreference.com/w/cpp/utility/move
+  //
+  TORCH_CHECK(fusion.unordered_exprs().empty());
+  TORCH_CHECK(fusion.vals().empty());
+  TORCH_CHECK(fusion.inputs().empty());
+  TORCH_CHECK(fusion.outputs().empty());
+
+  // clear() has no pre-conditions so it's valid to call on a moved-from object
+  fusion.clear();
+
+  // Compare IR dumps
+  std::stringstream another_ir;
+  another_ir << another_fusion;
+  ASSERT_EQ(original_ir.str(), another_ir.str());
+
+  // Lower the fusion IR
+  GpuLower lower(&another_fusion);
+
+  std::stringstream lowered_ir;
+  lowered_ir << another_fusion;
+
+  // Test move assignment after lowering
+  fusion = std::move(another_fusion);
+
+  // Compare IR dumps
+  std::stringstream moved_lowered_ir;
+  moved_lowered_ir << fusion;
+  ASSERT_EQ(lowered_ir.str(), moved_lowered_ir.str());
+}
+
+TEST_F(NVFuserTest, FusionSimpleArith_CUDA) {
+  std::stringstream ss1, ss2;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  Double* d1 = IrBuilder::create<Double>(1.f);
+  Double* d2 = IrBuilder::create<Double>(2.f);
+  Double* d3 = IrBuilder::create<Double>();
+
+  // Disrupt the fusion to make sure guard works well
+  {
+    Fusion fusion2;
+    FusionGuard fg(&fusion2);
+
+    Double* d1 = IrBuilder::create<Double>(1.f);
+    Double* d2 = IrBuilder::create<Double>(2.f);
+    add(d1, d2);
+    ss2 << fusion2;
+  }
+
+  IrBuilder::create<BinaryOp>(BinaryOpType::Add, d3, d1, d2);
+  ss1 << fusion;
+
+  TORCH_CHECK(
+      ss1.str().compare(ss2.str()) == 0,
+      "Error where explicit add nodes don't match implicit add nodes.");
+}
+
+TEST_F(NVFuserTest, FusionScalarTypePromote_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  Bool* b = IrBuilder::create<Bool>(true);
+  Double* d = IrBuilder::create<Double>(4.f);
+  Int* i = IrBuilder::create<Int>(3);
+  ComplexDouble* c =
+      IrBuilder::create<ComplexDouble>(c10::complex<double>(1, 2));
+
+  TORCH_CHECK(add(b, b)->getDataType() == DataType::Bool);
+  TORCH_CHECK(add(b, d)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(b, i)->getDataType() == DataType::Int);
+  TORCH_CHECK(add(b, c)->getDataType() == DataType::ComplexDouble);
+
+  TORCH_CHECK(add(d, b)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(d, d)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(d, i)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(d, c)->getDataType() == DataType::ComplexDouble);
+
+  TORCH_CHECK(add(i, b)->getDataType() == DataType::Int);
+  TORCH_CHECK(add(i, d)->getDataType() == DataType::Double);
+  TORCH_CHECK(add(i, i)->getDataType() == DataType::Int);
+  TORCH_CHECK(add(i, c)->getDataType() == DataType::ComplexDouble);
+
+  TORCH_CHECK(add(c, b)->getDataType() == DataType::ComplexDouble);
+  TORCH_CHECK(add(c, d)->getDataType() == DataType::ComplexDouble);
+  TORCH_CHECK(add(c, i)->getDataType() == DataType::ComplexDouble);
+  TORCH_CHECK(add(c, c)->getDataType() == DataType::ComplexDouble);
+}
+
+TEST_F(NVFuserTest, FusionComplexAbsTypes_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto options = at::TensorOptions().device(at::kCUDA, 0);
+  auto tensor_cf = at::randn({4, 4, 4}, options.dtype(at::kComplexFloat));
+  auto tensor_cd = at::randn({4, 4, 4}, options.dtype(at::kComplexDouble));
+
+  auto type_cf = TensorType::create(tensor_cf);
+  auto tv_cf = IrBuilder::create<TensorView>(type_cf);
+  auto type_cd = TensorType::create(tensor_cd);
+  auto tv_cd = IrBuilder::create<TensorView>(type_cd);
+
+  TORCH_CHECK(
+      tensor_cf.abs().scalar_type() ==
+      data_type_to_aten(abs(tv_cf)->getDataType().value()));
+  TORCH_CHECK(
+      tensor_cd.abs().scalar_type() ==
+      data_type_to_aten(abs(tv_cd)->getDataType().value()));
+}
+
+TEST_F(NVFuserTest, FusionRegister_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  Double* v1 = IrBuilder::create<Double>(1.f);
+  Double* v2 = IrBuilder::create<Double>(2.f);
+  Val* v3 = binaryOp(BinaryOpType::Add, v1, v2);
+  Val* v4 = binaryOp(BinaryOpType::Add, v1, v2);
+  TORCH_CHECK(v1->name() + 1 == v2->name());
+  TORCH_CHECK(v2->name() + 1 == v3->name());
+  TORCH_CHECK(v3->name() + 1 == v4->name());
+  TORCH_CHECK(v3->definition()->name() + 1 == v4->definition()->name());
+}
+
+// dummy expr with 2 outputs only for toposort test.
+struct DummyExpr : public Expr {
+  ~DummyExpr() = default;
+  DummyExpr(
+      IrBuilderPasskey passkey,
+      Val* _outlhs,
+      Val* _outrhs,
+      Val* _lhs,
+      Val* _rhs)
+      : Expr(passkey, ExprType::UnaryOp) // Not terribly safe...
+  {
+    addOutput(_outlhs);
+    addOutput(_outrhs);
+    addInput(_lhs);
+    addInput(_rhs);
+  }
+  DummyExpr(const DummyExpr& other) = delete;
+  DummyExpr& operator=(const DummyExpr& other) = delete;
+  DummyExpr(DummyExpr&& other) = delete;
+  DummyExpr& operator=(DummyExpr&& other) = delete;
+  Expr* shallowCopy() const override {
+    return nullptr;
+  }
+};
+
+TEST_F(NVFuserTest, FusionTopoSort_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // e0: v3, v2 = dummy(v1, v0)
+  // e1: v4     =   add(v3, v2)
+  // e2: v5     =   add(v2, v4)
+  // e3: v6     =   add(v5, v5)
+  Double* v0 = IrBuilder::create<Double>();
+  Double* v1 = IrBuilder::create<Double>();
+  Double* v2 = IrBuilder::create<Double>();
+  Double* v3 = IrBuilder::create<Double>();
+  Double* v4 = IrBuilder::create<Double>();
+  Double* v5 = IrBuilder::create<Double>();
+  Double* v6 = IrBuilder::create<Double>();
+
+  std::vector<Val*> inputs = {v0, v1};
+  for (auto val : inputs) {
+    fusion.addInput(val);
+  }
+
+  Expr* e0 = IrBuilder::create<DummyExpr>(v3, v2, v1, v0);
+  Expr* e1 = IrBuilder::create<BinaryOp>(BinaryOpType::Add, v4, v3, v2);
+  Expr* e2 = IrBuilder::create<BinaryOp>(BinaryOpType::Add, v5, v2, v4);
+  Expr* e3 = IrBuilder::create<BinaryOp>(BinaryOpType::Add, v6, v5, v5);
+
+  fusion.addOutput(v2);
+  fusion.addOutput(v3);
+  auto exprs = fusion.exprs();
+  TORCH_CHECK(exprs.size() == 1, "Found ", exprs.size(), " but expecting 1");
+  TORCH_CHECK(exprs[0] == e0);
+
+  fusion.addOutput(v5);
+  exprs = fusion.exprs();
+  TORCH_CHECK(exprs.size() == 3, "Found ", exprs.size(), " but expecting 3");
+  TORCH_CHECK(exprs[0] == e0);
+  TORCH_CHECK(exprs[1] == e1);
+  TORCH_CHECK(exprs[2] == e2);
+
+  fusion.addOutput(v4);
+  exprs = fusion.exprs();
+  TORCH_CHECK(exprs.size() == 3, "Found ", exprs.size(), " but expecting 3");
+  TORCH_CHECK(exprs[0] == e0);
+  TORCH_CHECK(exprs[1] == e1);
+  TORCH_CHECK(exprs[2] == e2);
+
+  fusion.addOutput(v6);
+  exprs = fusion.exprs();
+  TORCH_CHECK(exprs.size() == 4, "Found ", exprs.size(), " but expecting 4");
+  TORCH_CHECK(exprs[0] == e0);
+  TORCH_CHECK(exprs[1] == e1);
+  TORCH_CHECK(exprs[2] == e2);
+  TORCH_CHECK(exprs[3] == e3);
+
+  TORCH_CHECK(v2->definition()->name() == 0);
+  TORCH_CHECK(v3->definition()->name() == 0);
+  TORCH_CHECK(v4->definition()->name() == 1);
+  TORCH_CHECK(v5->definition()->name() == 2);
+  TORCH_CHECK(v6->definition()->name() == 3);
+}
+
+TEST_F(NVFuserTest, FusionTensor_CUDA) {
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  {
+    auto tensor = at::randn({2, 3, 4, 5}, options);
+    auto tensor_type = TensorType::create(tensor);
+    auto fuser_tensor = IrBuilder::create<TensorView>(tensor_type);
+    TORCH_CHECK((int64_t)fuser_tensor->nDims() == tensor.dim());
+    TORCH_CHECK(fuser_tensor->getDataType().value() == DataType::Float);
+    TORCH_CHECK(fuser_tensor->domain() != nullptr);
+    for (const auto i : c10::irange(fuser_tensor->nDims())) {
+      // size 1 dimension are makred as broadcast
+      TORCH_CHECK(
+          fuser_tensor->axis(i)->isBroadcast() == (tensor.sizes()[i] == 1));
+      // check contiguity information;
+      TORCH_CHECK(fuser_tensor->domain()->contiguity()[i]);
+    }
+  }
+
+  // TensorType::create fills stride_properties, which helps us to mark
+  // IterDomain properly
+  // Note: implementation could change, depending on how much we want to invest
+  // in our home-brew contiguity coalescing. For now let's make sure that we
+  // properly test what we are using.
+  {
+    auto tensor = at::randn({4, 4, 4}, options);
+    auto sliced_tensor = tensor.slice(1, 0, -1, 2);
+
+    auto tensor_type = TensorType::create(sliced_tensor);
+    auto fuser_tensor = IrBuilder::create<TensorView>(tensor_type);
+    TORCH_CHECK((int64_t)fuser_tensor->nDims() == tensor.dim());
+    TORCH_CHECK(fuser_tensor->getDataType().value() == DataType::Float);
+    TORCH_CHECK(fuser_tensor->domain() != nullptr);
+    for (const auto i : c10::irange(fuser_tensor->nDims())) {
+      // size 1 dimension are makred as broadcast
+      TORCH_CHECK(fuser_tensor->axis(i)->isBroadcast() == false);
+    }
+    TORCH_CHECK(fuser_tensor->domain()->contiguity()[0]);
+    TORCH_CHECK(!fuser_tensor->domain()->contiguity()[1]);
+    TORCH_CHECK(fuser_tensor->domain()->contiguity()[2]);
+  }
+
+  {
+    auto tensor = at::randn({2, 3, 4, 5}, options);
+    auto permuted_tensor = tensor.permute({0, 3, 1, 2});
+    auto tensor_type = TensorType::create(permuted_tensor);
+    auto fuser_tensor = IrBuilder::create<TensorView>(tensor_type);
+    TORCH_CHECK((int64_t)fuser_tensor->nDims() == tensor.dim());
+    TORCH_CHECK(fuser_tensor->getDataType().value() == DataType::Float);
+    TORCH_CHECK(fuser_tensor->domain() != nullptr);
+    for (const auto i : c10::irange(fuser_tensor->nDims())) {
+      // size 1 dimension are makred as broadcast
+      TORCH_CHECK(fuser_tensor->axis(i)->isBroadcast() == false);
+    }
+    TORCH_CHECK(!fuser_tensor->domain()->contiguity()[0]);
+    TORCH_CHECK(!fuser_tensor->domain()->contiguity()[1]);
+    TORCH_CHECK(fuser_tensor->domain()->contiguity()[2]);
+    TORCH_CHECK(!fuser_tensor->domain()->contiguity()[3]);
+  }
+}
+
+TEST_F(NVFuserTest, FusionFilterVals_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  auto tv1 = makeSymbolicTensor(1);
+  auto scalar0 = IrBuilder::create<Double>(0);
+  auto scalar1 = IrBuilder::create<Int>(0);
+  auto scalar2 = IrBuilder::create<Int>(1);
+
+  const std::vector<Val*> vals = {tv0, scalar0, tv1, scalar1, scalar2};
+
+  std::vector<TensorView*> tvs(
+      ir_utils::filterByType<TensorView>(vals).begin(),
+      ir_utils::filterByType<TensorView>(vals).end());
+  TORCH_CHECK(tvs.size() == 2);
+  TORCH_CHECK(tvs[0] == tv0);
+  TORCH_CHECK(tvs[1] == tv1);
+
+  std::vector<Double*> floats(
+      ir_utils::filterByType<Double>(vals).begin(),
+      ir_utils::filterByType<Double>(vals).end());
+  TORCH_CHECK(floats.size() == 1);
+  TORCH_CHECK(floats[0] == scalar0);
+
+  std::vector<Int*> ints(
+      ir_utils::filterByType<Int>(vals).begin(),
+      ir_utils::filterByType<Int>(vals).end());
+  TORCH_CHECK(ints.size() == 2);
+  TORCH_CHECK(ints[0] == scalar1);
+  TORCH_CHECK(ints[1] == scalar2);
+
+  TORCH_CHECK(
+      ir_utils::filterByType<Expr>(vals).begin() ==
+          ir_utils::filterByType<Expr>(vals).end(),
+      "Not expecting any results");
+}
+
+TEST_F(NVFuserTest, FusionTVSplit_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv = makeSymbolicTensor(3);
+
+  tv = tv->split(2, 2);
+  TORCH_CHECK(tv->nDims() == 4);
+  Expr* outer = tv->axis(2)->extent()->definition();
+
+  TORCH_CHECK(
+      outer->getExprType().value() == ExprType::BinaryOp &&
+      static_cast<BinaryOp*>(outer)->getBinaryOpType() ==
+          BinaryOpType::CeilDiv &&
+      static_cast<BinaryOp*>(outer)->lhs()->sameAs(
+          tv->getRootDomain()[2]->extent()) &&
+      static_cast<Int*>(static_cast<BinaryOp*>(outer)->rhs())
+          ->sameAs(IrBuilder::create<Int>(2)));
+
+  IterDomain* inner = static_cast<IterDomain*>(tv->axis(3));
+  TORCH_CHECK(
+      inner->extent()->isScalar() &&
+      static_cast<Int*>(inner->extent())->isConst() &&
+      static_cast<Int*>(inner->extent())->value().value() == 2);
+}
+
+TEST_F(NVFuserTest, FusionTVMerge_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv = makeSymbolicTensor(3);
+
+  tv = tv->merge(1);
+  Expr* axisOp = tv->axis(1)->extent()->definition();
+
+  TORCH_CHECK(
+      tv->nDims() == 2 && axisOp->getExprType() == ExprType::BinaryOp &&
+      static_cast<BinaryOp*>(axisOp)->getBinaryOpType() == BinaryOpType::Mul &&
+      static_cast<BinaryOp*>(axisOp)->lhs() ==
+          tv->getRootDomain()[1]->extent() &&
+      static_cast<BinaryOp*>(axisOp)->rhs() ==
+          tv->getRootDomain()[2]->extent());
+}
+
+TEST_F(NVFuserTest, FusionTVReorder_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::unordered_map<int, int> shift_right{{-1, 0}};
+
+  std::unordered_map<int, int> shift_left{{0, -1}};
+
+  std::unordered_map<int, int> shift_left_2{{0, -1}, {1, 0}, {2, 1}};
+
+  std::unordered_map<int, int> swap{{0, 2}, {2, 0}};
+
+  auto tv = makeSymbolicTensor(3);
+  std::vector<IterDomain*> ref;
+  ref = std::vector<IterDomain*>(
+      tv->domain()->domain().begin(), tv->domain()->domain().end());
+
+  tv->reorder(shift_left);
+  for (const auto i : c10::irange(tv->nDims())) {
+    TORCH_CHECK(ref[i]->sameAs(tv->axis(i - 1)));
+  }
+
+  tv = makeSymbolicTensor(3);
+  ref = std::vector<IterDomain*>(
+      tv->domain()->domain().begin(), tv->domain()->domain().end());
+
+  tv->reorder(shift_left);
+  for (const auto i : c10::irange(tv->nDims())) {
+    TORCH_CHECK(ref[i]->sameAs(tv->axis(i - 1)));
+  }
+
+  tv = makeSymbolicTensor(3);
+  ref = std::vector<IterDomain*>(
+      tv->domain()->domain().begin(), tv->domain()->domain().end());
+
+  tv->reorder(shift_right);
+  TORCH_CHECK(ref[ref.size() - 1]->sameAs(tv->axis(0)));
+  for (const auto i : c10::irange(1, tv->nDims())) {
+    TORCH_CHECK(ref[i - 1]->sameAs(tv->axis(i)));
+  }
+
+  tv = makeSymbolicTensor(3);
+  ref = std::vector<IterDomain*>(
+      tv->domain()->domain().begin(), tv->domain()->domain().end());
+  tv->reorder(swap);
+  TORCH_CHECK(ref[0]->sameAs(tv->axis(2)));
+  TORCH_CHECK(ref[2]->sameAs(tv->axis(0)));
+  TORCH_CHECK(ref[1]->sameAs(tv->axis(1)));
+}
+
+TEST_F(NVFuserTest, FusionEquality_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  Double* fval1 = IrBuilder::create<Double>();
+  Double* fval1_copy = fval1;
+  Double* fval2 = IrBuilder::create<Double>();
+  Double* fone = IrBuilder::create<Double>(1.0);
+
+  TORCH_CHECK(fval1->sameAs(fval1_copy));
+  TORCH_CHECK(!fval1->sameAs(fval2));
+  TORCH_CHECK(!fone->sameAs(fval1));
+  TORCH_CHECK(fone->sameAs(IrBuilder::create<Double>(1.0)));
+
+  Int* ival1 = IrBuilder::create<Int>();
+  Int* ival1_copy = ival1;
+  Int* ival2 = IrBuilder::create<Int>();
+  Int* ione = IrBuilder::create<Int>(1);
+
+  TORCH_CHECK(ival1->sameAs(ival1_copy));
+  TORCH_CHECK(!ival1->sameAs(ival2));
+  TORCH_CHECK(!ione->sameAs(ival1));
+  TORCH_CHECK(ione->sameAs(IrBuilder::create<Int>(1)));
+
+  BinaryOp* add1 = IrBuilder::create<BinaryOp>(
+      BinaryOpType::Add, IrBuilder::create<Double>(), fval1, ival1);
+  BinaryOp* add1_copy = IrBuilder::create<BinaryOp>(
+      BinaryOpType::Add, IrBuilder::create<Double>(), fval1, ival1);
+  BinaryOp* sub1 = IrBuilder::create<BinaryOp>(
+      BinaryOpType::Sub, IrBuilder::create<Double>(), fval1, ival1);
+
+  UnaryOp* neg1 = IrBuilder::create<UnaryOp>(
+      UnaryOpType::Neg, IrBuilder::create<Double>(), fval1);
+  UnaryOp* neg2 = IrBuilder::create<UnaryOp>(
+      UnaryOpType::Neg, IrBuilder::create<Double>(), fval2);
+  UnaryOp* neg1_copy = IrBuilder::create<UnaryOp>(
+      UnaryOpType::Neg, IrBuilder::create<Double>(), fval1);
+
+  TORCH_CHECK(add1->sameAs(add1_copy));
+  TORCH_CHECK(!add1->sameAs(sub1));
+
+  TORCH_CHECK(neg1->sameAs(neg1_copy));
+  TORCH_CHECK(!static_cast<Expr*>(neg1)->sameAs(add1));
+  TORCH_CHECK(!neg1->sameAs(neg2));
+}
+
+TEST_F(NVFuserTest, FusionDependency_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  Double* d0 = IrBuilder::create<Double>(0.f);
+  Double* d1 = IrBuilder::create<Double>(1.f);
+  auto d2 = add(d0, d1);
+
+  auto d3 = add(d2, d2);
+
+  Double* d4 = IrBuilder::create<Double>(4.f);
+  Double* d5 = IrBuilder::create<Double>(5.f);
+  auto d6 = add(d4, d5);
+
+  Double* d7 = IrBuilder::create<Double>(7.f);
+  Double* d8 = IrBuilder::create<Double>(8.f);
+  auto d9 = add(d7, d8);
+
+  auto d10 = add(d6, d9);
+
+  auto d11 = add(d3, d10);
+
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d0, d11));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d1, d11));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d2, d11));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d3, d11));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d6, d11));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d9, d11));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d0, d2));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d2, d3));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d4, d6));
+  TORCH_CHECK(DependencyCheck::isDependencyOf(d8, d10));
+
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d0));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d1));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d2));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d3));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d4));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d11, d5));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d2, d0));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d3, d2));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d6, d4));
+  TORCH_CHECK(!DependencyCheck::isDependencyOf(d10, d8));
+
+  auto dep_chain = DependencyCheck::getSingleDependencyChain(d0, d11);
+  TORCH_CHECK(dep_chain.back() == d11);
+  dep_chain.pop_back();
+  TORCH_CHECK(dep_chain.back() == d3);
+  dep_chain.pop_back();
+  TORCH_CHECK(dep_chain.back() == d2);
+  dep_chain.pop_back();
+
+  dep_chain = DependencyCheck::getSingleDependencyChain(d6, d11);
+  TORCH_CHECK(dep_chain.back() == d11);
+  dep_chain.pop_back();
+  TORCH_CHECK(dep_chain.back() == d10);
+  dep_chain.pop_back();
+
+  dep_chain = DependencyCheck::getSingleDependencyChain(d4, d11);
+  TORCH_CHECK(dep_chain.back() == d11);
+  dep_chain.pop_back();
+  TORCH_CHECK(dep_chain.back() == d10);
+  dep_chain.pop_back();
+  TORCH_CHECK(dep_chain.back() == d6);
+  dep_chain.pop_back();
+
+  dep_chain = DependencyCheck::getSingleDependencyChain(d11, d2);
+  TORCH_CHECK(dep_chain.empty());
+}
+
+TEST_F(NVFuserTest, FusionParser_CUDA) {
+  // This test may not pass if using a custom block sync as there may
+  // be additional calls. Skip the test as it's not specifically
+  // relevant with block synchronizatin.
+  if (std::getenv("PYTORCH_NVFUSER_USE_BLOCK_SYNC_ATOMIC")) {
+    return;
+  }
+  auto g = std::make_shared<Graph>();
+  const auto graph0_string = R"IR(
+    graph(%0 : Float(2, strides=[1]),
+          %1 : Float(2, strides=[1])):
+      %c0 : Float(2, strides=[1]) = aten::mul(%0, %1)
+      %d0 : Float(2, strides=[1]) = aten::mul(%c0, %0)
+      return (%d0))IR";
+  parseIR(graph0_string, g.get());
+
+  // strides are not yet supported in the irparser.
+  for (auto val : g->block()->inputs()) {
+    if (val->isCompleteTensor())
+      val->setType(val->type()->castRaw<TensorType>()->contiguous());
+  }
+  for (auto node : g->block()->nodes()) {
+    for (auto val : node->outputs()) {
+      if (val->isCompleteTensor())
+        val->setType(val->type()->castRaw<TensorType>()->contiguous());
+    }
+  }
+
+  auto fusion = parseJitIR(g);
+  FusionGuard fg(fusion.get());
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  // Avoid vectorization here as those kernels can't be lowered twice at the
+  // moment
+  at::Tensor input1 = at::randn({16}, options);
+  at::Tensor input2 = at::randn({16}, options);
+  auto lparams = schedulePointwise(fusion.get(), {input1, input2});
+
+  // CONSIDER:
+  // 1. this can be moved to a dedicated "golden" file
+  // 2. use a fuzzy compare (ignore non-significant whitespaces for example)
+  const std::string expected_kernel = R"(
+__global__ void CUDAGeneratedKernel(Tensor<float, 1> T0, Tensor<float, 1> T1, Tensor<float, 1> T3) {
+  int64_t i50;
+  i50 = (((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x);
+  if ((i50 < T0.size[0])) {
+    float T5[1];
+    T5[0] = 0;
+    T5[0]
+       = T1[i50];
+    float T4[1];
+    T4[0] = 0;
+    T4[0]
+       = T0[i50];
+    float T2[1];
+    T2[0]
+      = T4[0]
+      * T5[0];
+    float T6[1];
+    T6[0]
+      = T2[0]
+      * T4[0];
+    T3[i50]
+       = T6[0];
+  }
+}
+)";
+
+  const std::string actual_kernel =
+      "\n" + codegen::generateCudaKernel(GpuLower(fusion.get()).kernel());
+  if (expected_kernel.size() != actual_kernel.size() ||
+      expected_kernel.compare(actual_kernel) != 0) {
+    std::cerr
+        << " Codegen mismatch, codegen possibly changed, or is incorrect. "
+        << " \n ========= EXPECTED ========= \n"
+        << expected_kernel << "\n========= ACTUAL ========== \n"
+        << actual_kernel << "\n=================" << std::endl;
+    auto it = std::mismatch(
+        expected_kernel.begin(),
+        expected_kernel.end(),
+        actual_kernel.begin(),
+        actual_kernel.end());
+    std::string actual_mismatched_snippet(it.second, actual_kernel.end());
+    actual_mismatched_snippet = actual_mismatched_snippet.substr(0, 10);
+    std::string expected_mismatched_snippet(it.first, expected_kernel.end());
+    expected_mismatched_snippet = expected_mismatched_snippet.substr(0, 10);
+    std::cerr << "First mismatch found at: " << actual_mismatched_snippet
+              << ", expected: " << expected_mismatched_snippet << std::endl;
+    TORCH_CHECK(false);
+  }
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1, input2}, lparams);
+  auto outputs = fe.runFusion({input1, input2}, lparams);
+  at::Tensor output_ref = input1 * input2 * input1;
+  TORCH_CHECK(output_ref.equal(outputs[0]));
+}
+
+TEST_F(NVFuserTest, FusionOuterSplit_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(3);
+
+  IrBuilder::create<BinaryOp>(
+      BinaryOpType::Add,
+      tv0,
+      IrBuilder::create<Double>(0.0),
+      IrBuilder::create<Double>(1.0));
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(3.0));
+  fusion.addOutput(tv2);
+
+  //[I0, I1, I2]
+  tv2->split(-1, 4, false);
+  //[I0, I1, I2o{4}, I2i]
+  tv2->merge(0);
+  tv2->merge(0);
+  //[I0*I1*I2o{4}, I2i]
+  tv2->split(0, 2);
+  //[I0*I1*I2o{4}o, I0*I1*I2o{4}i{2}, I2i]
+  tv2->reorder({{0, 1}, {1, 0}});
+  // I0*I1*I2o{4}i{2}, [I0*I1*I2o{4}o, I2i]
+
+  tv0->computeAt(tv2, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor output = at::empty({2, 6, 32}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+  fe.runFusion({}, {output});
+
+  at::Tensor output_ref = at::zeros_like(output, options);
+  output_ref = output_ref + 0.0 + 1.0 + 2.0 + 3.0;
+
+  TORCH_CHECK(output_ref.equal(output));
+}
+
+TEST_F(NVFuserTest, FusionCodeGen_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(3);
+
+  IrBuilder::create<BinaryOp>(
+      BinaryOpType::Add,
+      tv0,
+      IrBuilder::create<Double>(0.0),
+      IrBuilder::create<Double>(1.0));
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(3.0));
+  fusion.addOutput(tv2);
+
+  //[I0, I1, I2]
+  tv2 = tv2->split(0, 4);
+  //[I0o, I0i{4}, I1, I2]
+  tv2 = tv2->merge(1);
+  //[I0o, I0i{4}*I1, I2]
+  tv2 = tv2->split(-1, 2);
+  //[I0o, I0i{4}*I1, I2o, I2i{2}]
+  tv2 = tv2->reorder({{0, 1}, {1, 0}, {3, 2}});
+  //[I0i{4}*I1, I0o, I2i{2}, I2o]
+
+  tv0->computeAt(tv2, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor output = at::empty({16, 8, 8}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+  fe.runFusion({}, {output});
+
+  at::Tensor output_ref = at::zeros_like(output, options);
+  output_ref = output_ref + 0.0 + 1.0 + 2.0 + 3.0;
+
+  TORCH_CHECK(output_ref.equal(output));
+}
+
+TEST_F(NVFuserTest, FusionCodeGen2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(3);
+  TensorView* tv1 = makeSymbolicTensor(3);
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = add(tv0, tv2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv3);
+
+  //[I0, I1, I2]
+  tv3->reorder({{0, 2}, {2, 0}});
+  //[I2, I1, I0]
+  tv3->split(-1, 4);
+  //[I2, I1, I0o, I0i{4}]
+  tv3->reorder({{2, 0}, {3, 1}, {0, 3}});
+  // I0o, I0i{4}, I1, I2]
+
+  tv0->computeAt(tv3, -1);
+  tv1->computeAt(tv3, -1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({16, 8, 8}, options);
+  at::Tensor input2 = at::randn_like(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  auto outputs = fe.runFusion({input1, input2});
+
+  at::Tensor tv2_ref = input2 + 2.0;
+  at::Tensor output_ref = input1 + tv2_ref;
+
+  TORCH_CHECK(output_ref.equal(outputs[0]));
+}
+
+TEST_F(NVFuserTest, FusionSimplePWise_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  // dimensionality of the problem
+  int nDims = 3;
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeContigTensor(nDims);
+  TensorView* tv1 = makeContigTensor(nDims);
+
+  // Register your inputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // Do math with it, it returns a `Val*` but can be static_casted back to
+  // TensorView
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = add(tv0, tv2);
+
+  // Register your outputs
+  fusion.addOutput(tv3);
+
+  // Do transformations, remember, transformations are outputs to inputs
+  // This doesn't have to be in this order
+  tv3->merge(1);
+  tv3->merge(0);
+
+  // Split by n_threads
+  tv3->split(0, 128);
+  tv3->split(0, 4);
+
+  // For all inputs, computeAt the output inline, temporaries should be squeezed
+  // between them
+  tv0->computeAt(tv3, -1);
+  tv1->computeAt(tv3, -1);
+
+  // Parallelize TV3
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(-2)->parallelize(ParallelType::Unroll);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({64, 2, 128}, options);
+  at::Tensor input2 = at::rand_like(input1);
+  at::Tensor output = at::empty_like(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  fe.runFusion({input1, input2}, {output});
+
+  at::Tensor tv2_ref = input2 + 2.0;
+  at::Tensor output_ref = input1 + tv2_ref;
+
+  TORCH_CHECK(output_ref.equal(output));
+}
+
+TEST_F(NVFuserTest, FusionSimplePWiseDtypeComplex_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  // dimensionality of the problem
+  int nDims = 3;
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeContigTensor(nDims, DataType::ComplexFloat);
+  TensorView* tv1 = makeContigTensor(nDims, DataType::ComplexFloat);
+
+  // Register your inputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // Do math with it, it returns a `Val*` but can be static_casted back to
+  // TensorView
+  c10::complex<double> scalar1(2.0, 3.0);
+  TensorView* tv2 = add(tv1, IrBuilder::create<ComplexDouble>(scalar1));
+  TensorView* tv3 = add(tv0, tv2);
+
+  // Register your outputs
+  fusion.addOutput(tv3);
+
+  // Do transformations, remember, transformations are outputs to inputs
+  // This doesn't have to be in this order
+  tv3->merge(1);
+  tv3->merge(0);
+
+  // Split by n_threads
+  tv3->split(0, 128);
+  tv3->split(0, 4);
+
+  // For all inputs, computeAt the output inline, temporaries should be squeezed
+  // between them
+  tv0->computeAt(tv3, -1);
+  tv1->computeAt(tv3, -1);
+
+  // Parallelize TV3
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(-2)->parallelize(ParallelType::Unroll);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options =
+      at::TensorOptions().dtype(at::kComplexFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({64, 2, 128}, options);
+  at::Tensor input2 = at::rand_like(input1);
+  at::Tensor output = at::empty_like(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  fe.runFusion({input1, input2}, {output});
+
+  at::Tensor tv2_ref = input2 + scalar1;
+  at::Tensor output_ref = input1 + tv2_ref;
+
+  TORCH_CHECK(output_ref.equal(output));
+}
+
+TEST_F(NVFuserTest, FusionExecKernel_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  // Register your inputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // Do math with it, it returns a `Val*` but can be static_casted back to
+  // TensorView
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = add(tv0, tv2);
+
+  // Register your outputs
+  fusion.addOutput(tv3);
+
+  tv3->merge(0);
+  tv3->split(0, 128);
+  tv3->split(0, 4);
+
+  // For all inputs, computeAt the output inline, temporaries should be squeezed
+  // between them
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  // Parallelize TV3
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::Unroll);
+  tv3->axis(1)->parallelize(ParallelType::Unroll);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::ones({1, 128}, options);
+  at::Tensor input2 = at::ones_like(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  auto outputs = fe.runFusion({input1, input2});
+
+  at::Tensor check = at::full({1, 128}, 4, options);
+  ;
+  TORCH_CHECK(outputs[0].equal(check));
+}
+
+int ceilDiv_(int a, int b) {
+  return (a + b - 1) / b;
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAt1_CUDA) {
+  // Case 1
+  // tv1 = tv0 * 0.5
+  // tv2 = tv1 * -1
+  // tv3 = tv1 + 3
+  // tv4 = tv1 * 2
+  // tv5 = tv3 + tv2
+  // tv6 = tv5 + tv4
+  // tv7 = tv1 + tv4
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
+  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3.0));
+  TensorView* tv4 = mul(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv5 = add(tv3, tv2);
+
+  TensorView* tv6 = add(tv5, tv4);
+  TensorView* tv7 = add(tv1, tv4);
+
+  fusion.addOutput(tv6);
+  fusion.addOutput(tv7);
+
+  // Lets setup to actually run
+  tv7->merge(0);
+  tv7->split(0, 128);
+  tv7->split(0, 4);
+
+  tv7->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeAt(tv7, 1);
+
+  ComputeAtMap ca_map(&fusion);
+
+  // The this-position of the last tensor should be zero.
+  TORCH_CHECK(
+      tv7->nDims() == 3 && tv7->getComputeAtPosition() == 0 &&
+      tv7->getMaxProducerPosition() == 1);
+  TORCH_CHECK(
+      tv7->nDims() == 3 && tv6->getComputeAtPosition() == 0 &&
+      tv6->getMaxProducerPosition() == 1);
+  // The position of every other tensor should be 1.
+  for (auto tv : {tv1, tv2, tv3, tv4, tv5}) {
+    TORCH_CHECK(tv->nDims() == 3 && tv->getComputeAtPosition() == 1);
+
+    TORCH_CHECK(
+        ca_map.areMapped(tv7->axis(0), tv->axis(0), IdMappingMode::PERMISSIVE));
+  }
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({129, 127}, options);
+
+  auto t1 = aten_input.mul({0.5});
+  auto t2 = t1.mul({-1.0});
+  auto t3 = t1.add({3.0});
+  auto t4 = t1.mul({2.0});
+  auto t5 = t3.add(t2);
+  auto t6 = t5.add(t4);
+  auto t7 = t1.add(t4);
+
+  std::vector<at::Tensor> aten_outputs = {t6, t7};
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options), at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAt2_CUDA) {
+  // Case 2
+  // tv1 = tv0 * -1
+  // tv2 = tv0 + 3
+  // tv3 = tv0 * 2
+  // tv4 = tv2 + tv1
+  // tv5 = tv4 + tv3
+  // tv6 = tv5 + tv3
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(-1.0));
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.0));
+  TensorView* tv3 = mul(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv4 = add(tv2, tv1);
+
+  TensorView* tv5 = add(tv4, tv3);
+  TensorView* tv6 = add(tv5, tv3);
+
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  // Lets setup to actually run
+  tv6->merge(0);
+  tv6->split(0, 128);
+  tv6->split(0, 4);
+
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeAt(tv6, 1);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({129, 127}, options);
+
+  auto t1 = input.mul({-1.0});
+  auto t2 = input.add({3.0});
+  auto t3 = input.mul({2.0});
+  auto t4 = t2.add(t1);
+  auto t5 = t4.add(t3);
+  auto t6 = t5.add(t3);
+
+  std::vector<at::Tensor> aten_outputs = {t5, t6};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  testValidate(&fusion, cg_outputs, {input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAt3_CUDA) {
+  // Case 3
+  // T2 = T1 * 0.979361
+  // T3 = T2 * T0
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(.979361));
+  TensorView* tv3 = mul(tv2, tv0);
+
+  fusion.addOutput(tv3);
+
+  // Lets setup to actually run
+  while (tv3->nDims() > 1)
+    tv3->merge(0);
+  tv3->split(0, 128);
+  tv3->split(0, 4);
+
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  auto t2 = t1.mul({0.979361});
+  auto aten_output = t2.mul(t0);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  at::Tensor cg_output = at::empty_like(t0, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAt4_CUDA) {
+  // Case 4
+  // T4 = T2 - T3
+  // T5 = T1 + T4
+  // T6 = T5 - T0
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = makeSymbolicTensor(4);
+  fusion.addInput(tv2);
+
+  TensorView* tv3 = makeSymbolicTensor(4);
+  fusion.addInput(tv3);
+
+  TensorView* tv4 = sub(tv2, tv3);
+  TensorView* tv5 = add(tv1, tv4);
+  TensorView* tv6 = sub(tv5, tv0);
+
+  fusion.addOutput(tv6);
+
+  // Lets setup to actually run
+  while (tv6->nDims() > 1)
+    tv6->merge(0);
+  tv6->split(0, 128);
+  tv6->split(0, 4);
+
+  tv0->computeAt(tv6, 1);
+  tv1->computeAt(tv6, 1);
+  tv2->computeAt(tv6, 1);
+  tv3->computeAt(tv6, 1);
+
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+  at::Tensor t2 = at::rand_like(t0, options);
+  at::Tensor t3 = at::rand_like(t0, options);
+
+  auto t4 = t2.sub(t3);
+  auto t5 = t1.add(t4);
+  auto aten_output = t5.sub(t0);
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAt5_CUDA) {
+  // Case 5
+  // tv2 = tv0 + 2.0
+  // tv3 = tv1 * tv2
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = mul(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv3->merge(0);
+  tv3->split(-1, 8);
+  tv3->split(-1, 4);
+
+  tv2->computeAt(tv3, 1);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  auto t2 = t0.add(2.0);
+  auto aten_output = t1.mul(t2);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAt6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = mul(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv2->merge(0);
+  tv2->split(-1, 8);
+  tv2->split(-1, 4);
+  tv3->merge(0);
+  tv3->split(-1, 8);
+
+  tv2->computeAt(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  auto t2 = t0.add(2.0);
+  auto aten_output = t1.mul(t2);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAt7_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+
+  auto tv2 = makeSymbolicTensor(1);
+  fusion.addInput(tv2);
+
+  auto tv3 = add(tv2, IrBuilder::create<Double>(3.0));
+
+  auto tv4 = add(tv1, tv3);
+  fusion.addOutput(tv4);
+
+  auto tv5 = broadcast(tv1, {false, true});
+
+  auto tv6 = makeSymbolicTensor(2);
+  fusion.addInput(tv6);
+
+  auto tv7 = mul(tv5, tv6);
+
+  fusion.addOutput(tv7);
+
+  tv7->split(1, 2);
+  tv7->merge(0);
+  tv7->split(0, 4);
+  tv7->split(0, 128);
+
+  tv7->axis(0)->parallelize(ParallelType::BIDx);
+  tv7->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv0->computeAt(tv7, 1);
+  auto tv5_domain = tv5->domain()->domain();
+
+  // These computeAt transformations should not affect the TV5 domain
+  tv0->computeAt(tv4, -1);
+  tv2->computeAt(tv4, -1);
+
+  auto tv5_domain_current = tv5->domain()->domain();
+  TORCH_CHECK(tv5_domain == tv5_domain_current, "Invalid TV5 domain");
+
+  const int numel_x = 100;
+  const int numel_y = 200;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({numel_x}, options);
+  auto t2 = at::randn({numel_x}, options);
+  auto t6 = at::randn({numel_x, numel_y}, options);
+
+  auto t1 = t0.add(1.0);
+  auto t3 = t2.add(3.0);
+  auto t4 = t1.add(t3);
+  auto t5 = t1.unsqueeze(1);
+  auto t7 = t5.mul(t6);
+
+  std::vector<IValue> aten_inputs = {t0, t2, t6};
+  std::vector<at::Tensor> aten_outputs = {t4, t7};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAt8_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+
+  auto tv2 = makeSymbolicTensor(1);
+  fusion.addInput(tv2);
+
+  auto tv3 = add(tv2, IrBuilder::create<Double>(3.0));
+
+  auto tv4 = add(tv1, tv3);
+  fusion.addOutput(tv4);
+
+  auto tv5 = broadcast(tv1, {false, true});
+
+  auto tv6 = makeSymbolicTensor(2);
+  fusion.addInput(tv6);
+
+  auto tv7 = mul(tv5, tv6);
+
+  fusion.addOutput(tv7);
+
+  tv7->split(1, 2);
+  tv7->merge(0);
+  tv7->split(0, 128, false);
+  tv7->split(0, 4, false);
+
+  tv7->axis(0)->parallelize(ParallelType::BIDx);
+  tv7->axis(1)->parallelize(ParallelType::TIDx);
+
+  // Reverse computeAt structure from previous test
+  tv0->computeAt(tv4, -1);
+  tv2->computeAt(tv4, -1);
+  tv0->computeAt(tv7, -1);
+
+  const int numel_x = 100;
+  const int numel_y = 200;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({numel_x}, options);
+  auto t2 = at::randn({numel_x}, options);
+  auto t6 = at::randn({numel_x, numel_y}, options);
+
+  auto t1 = t0.add(1.0);
+  auto t3 = t2.add(3.0);
+  auto t4 = t1.add(t3);
+  auto t5 = t1.unsqueeze(1);
+  auto t7 = t5.mul(t6);
+
+  std::vector<IValue> aten_inputs = {t0, t2, t6};
+  std::vector<at::Tensor> aten_outputs = {t4, t7};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeWith1_CUDA) {
+  // Case 1
+  // tv1 = tv0 * 0.5
+  // tv2 = tv1 * -1
+  // tv3 = tv1 + 3
+  // tv4 = tv1 * 2
+  // tv5 = tv3 + tv2
+  // tv6 = tv5 + tv4
+  // tv7 = tv1 + tv4
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
+  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3.0));
+  TensorView* tv4 = mul(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv5 = add(tv3, tv2);
+
+  TensorView* tv6 = add(tv5, tv4);
+  TensorView* tv7 = add(tv1, tv4);
+
+  fusion.addOutput(tv6);
+  fusion.addOutput(tv7);
+
+  // Lets setup to actually run
+  tv0->merge(0);
+  tv0->split(0, 128);
+  tv0->split(0, 4);
+
+  tv0->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeWith(tv7, 1);
+
+  GpuLower gpulw(&fusion);
+
+  // The this-position of the last tensor should be zero.
+  TORCH_CHECK(
+      tv7->nDims() == 3 && tv7->getComputeAtPosition() == 0 &&
+      tv7->getMaxProducerPosition() == 1);
+  TORCH_CHECK(
+      tv7->nDims() == 3 && tv6->getComputeAtPosition() == 0 &&
+      tv6->getMaxProducerPosition() == 1);
+
+  ComputeAtMap ca_map(&fusion);
+
+  // The position of every other tensor should be 1.
+  for (auto tv : {tv1, tv2, tv3, tv4, tv5}) {
+    TORCH_CHECK(tv->nDims() == 3 && tv->getComputeAtPosition() == 1);
+    TORCH_CHECK(
+        ca_map.areMapped(tv7->axis(0), tv->axis(0), IdMappingMode::PERMISSIVE));
+  }
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({129, 127}, options);
+
+  auto t1 = aten_input.mul({0.5});
+  auto t2 = t1.mul({-1.0});
+  auto t3 = t1.add({3.0});
+  auto t4 = t1.mul({2.0});
+  auto t5 = t3.add(t2);
+  auto t6 = t5.add(t4);
+  auto t7 = t1.add(t4);
+
+  std::vector<at::Tensor> aten_outputs = {t6, t7};
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options), at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeWith2_CUDA) {
+  // Case 2
+  // tv1 = tv0 * -1
+  // tv2 = tv0 + 3
+  // tv3 = tv0 * 2
+  // tv4 = tv2 + tv1
+  // tv5 = tv4 + tv3
+  // tv6 = tv5 + tv3
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(-1.0));
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.0));
+  TensorView* tv3 = mul(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv4 = add(tv2, tv1);
+
+  TensorView* tv5 = add(tv4, tv3);
+  TensorView* tv6 = add(tv5, tv3);
+
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  // Lets setup to actually run
+  tv0->merge(0);
+  tv0->split(0, 128);
+  tv0->split(0, 4);
+
+  tv0->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeWith(tv6, 1);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({129, 127}, options);
+
+  auto t1 = input.mul({-1.0});
+  auto t2 = input.add({3.0});
+  auto t3 = input.mul({2.0});
+  auto t4 = t2.add(t1);
+  auto t5 = t4.add(t3);
+  auto t6 = t5.add(t3);
+
+  std::vector<at::Tensor> aten_outputs = {t5, t6};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  testValidate(&fusion, cg_outputs, {input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeWith3_CUDA) {
+  // Case 3
+  // T2 = T1 * 0.979361
+  // T3 = T2 * T0
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(.979361));
+  TensorView* tv3 = mul(tv2, tv0);
+
+  fusion.addOutput(tv3);
+
+  // Lets setup to actually run
+  while (tv0->nDims() > 1)
+    tv0->merge(0);
+  tv0->split(0, 128);
+  tv0->split(0, 4);
+
+  while (tv1->nDims() > 1)
+    tv1->merge(0);
+  tv1->split(0, 128);
+  tv1->split(0, 4);
+
+  tv0->computeWith(tv3, 1);
+  tv1->computeWith(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  auto t2 = t1.mul({0.979361});
+  auto aten_output = t2.mul(t0);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  at::Tensor cg_output = at::empty_like(t0, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeWith4_CUDA) {
+  // Case 4
+  // T4 = T2 - T3
+  // T5 = T1 + T4
+  // T6 = T5 - T0
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = makeSymbolicTensor(4);
+  fusion.addInput(tv2);
+
+  TensorView* tv3 = makeSymbolicTensor(4);
+  fusion.addInput(tv3);
+
+  TensorView* tv4 = sub(tv2, tv3);
+  TensorView* tv5 = add(tv1, tv4);
+  TensorView* tv6 = sub(tv5, tv0);
+
+  fusion.addOutput(tv6);
+  std::vector<TensorView*> tvs = {tv0, tv1, tv2};
+  for (auto tv : tvs) {
+    // Lets setup to actually run
+    while (tv->nDims() > 1) {
+      tv->merge(0);
+    }
+    tv->split(0, 128);
+    tv->split(0, 4);
+    tv->computeWith(tv6, 1);
+  }
+
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+  at::Tensor t2 = at::rand_like(t0, options);
+  at::Tensor t3 = at::rand_like(t0, options);
+
+  auto t4 = t2.sub(t3);
+  auto t5 = t1.add(t4);
+  auto aten_output = t5.sub(t0);
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeWith5_CUDA) {
+  // Case 5
+  // tv2 = tv0 + 2.0
+  // tv3 = tv1 * tv2
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = mul(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv2->merge(0);
+  tv2->split(-1, 8);
+  tv2->split(-1, 4);
+
+  tv2->computeWith(tv3, 1);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  auto t2 = t0.add(2.0);
+  auto aten_output = t1.mul(t2);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeWith6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = mul(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv2->merge(0);
+  tv2->split(-1, 8);
+  tv2->split(-1, 4);
+  tv3->merge(0);
+  tv3->split(-1, 8);
+
+  tv2->computeWith(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  auto t2 = t0.add(2.0);
+  auto aten_output = t1.mul(t2);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionComputeAtMultiConsumers_CUDA) {
+  // tv1 = tv0 * 0.5
+  // tv2 = tv1 * -1
+  // tv3 = tv2 * -2
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
+  TensorView* tv3 = mul(tv1, IrBuilder::create<Double>(-2.0));
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv3);
+
+  // This computeAt will affect tv2 as well, even though tv2 is not in
+  // the data-flow path between tv1 and tv3. The reason is that tv1 is
+  // now computed at tv3, so tv2 must also be computed at the same
+  // location. Overall, what will happen is basically we merge
+  // expressions of all tensors and compute them in a single loop
+  // nest.
+  TensorView* computeAtTarget = tv3;
+  computeAtTarget->split(0, 128);
+  tv1->computeAt(computeAtTarget, 1);
+
+  TensorView* affected_tensors[] = {tv1, tv2, tv3};
+  for (auto tv : affected_tensors) {
+    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
+  }
+
+  GpuLower gpulw(&fusion);
+
+  TORCH_CHECK(tv1->getComputeAtPosition() == 1);
+  TORCH_CHECK(
+      tv2->getComputeAtPosition() == 0 && tv2->getMaxProducerPosition() == 1);
+  TORCH_CHECK(
+      tv3->getComputeAtPosition() == 0 && tv3->getMaxProducerPosition() == 1);
+
+  ComputeAtMap ca_map(&fusion);
+
+  // Note that tv2 is also computed at tv3.
+  for (auto tv : {tv1, tv2}) {
+    TORCH_CHECK(ca_map.areMapped(
+        tv->axis(0), computeAtTarget->axis(0), IdMappingMode::PERMISSIVE));
+  }
+
+  TORCH_CHECK(tv3->getComputeAtPosition() == 0);
+
+  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
+  for (auto tv : affected_tensors) {
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({1000}, options);
+
+  auto t1 = aten_input * 0.5;
+  auto t2 = t1 * -1.0;
+  auto t3 = t1 * -2.0;
+
+  std::vector<at::Tensor> aten_outputs = {t2, t3};
+
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options), at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+// Similar to ComputeAtMultiConsumers, but with a common consumer.
+TEST_F(NVFuserTest, FusionComputeAtCommonConsumer1_CUDA) {
+  // tv1 = tv0 * 0.5
+  // tv2 = tv1 * -1
+  // tv3 = tv2 * -2
+  // tv4 = tv2 + tv3
+  // tv5 = tv4 * 5
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
+  TensorView* tv3 = mul(tv1, IrBuilder::create<Double>(-2.0));
+  TensorView* tv4 = add(tv2, tv3);
+  TensorView* tv5 = mul(tv4, IrBuilder::create<Double>(5.0));
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv5);
+
+  // Computing tv1 at tv3. This will affect tv2 as discussed in
+  // ComplexComputeAt1. Additionally, in this case, notice that tv4 is
+  // the common consumer of tv2 and tv3, so they are computed at
+  // tv4. The indirect propagation of the computeAt should stop at the
+  // common consumer, and no further change should occur. More
+  // specifically, the computeAT position of tv4 and tv5 should be zero.
+  TensorView* computeAtTarget = tv3;
+  computeAtTarget->split(0, 128);
+  tv1->computeAt(computeAtTarget, 1);
+
+  TensorView* affected_tensors[] = {tv1, tv2, tv3, tv4};
+  for (auto tv : affected_tensors) {
+    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
+  }
+
+  TORCH_CHECK(tv1->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv3->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv4->getComputeAtPosition() == 0);
+  TORCH_CHECK(tv5->getComputeAtPosition() == 0);
+
+  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (auto tv : affected_tensors) {
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  // Transform tv5 to make it look like the rest
+  tv5->split(0, 128);
+  tv5->axis(1)->parallelize(ParallelType::TIDx);
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({1000}, options);
+
+  auto t1 = aten_input * 0.5;
+  auto t2 = t1 * -1.0;
+  auto t3 = t1 * -2.0;
+  auto t4 = t2 + t3;
+  auto t5 = t4 * 5.0;
+
+  std::vector<at::Tensor> aten_outputs = {t3, t4, t5};
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionComputeAtCommonConsumer2_CUDA) {
+  // tv1 = tv0 * 0.5
+  // tv2 = tv1 * -1
+  // tv3 = tv2 * -1
+  // tv4 = tv1 + 4
+  // tv5 = tv3 + tv4
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
+  TensorView* tv3 = mul(tv2, IrBuilder::create<Double>(-1.0));
+  TensorView* tv4 = add(tv1, IrBuilder::create<Double>(4.0));
+  TensorView* tv5 = add(tv3, tv4);
+
+  fusion.addOutput(tv5);
+
+  TensorView* computeAtTarget = tv3;
+
+  computeAtTarget->merge(0);
+  computeAtTarget->split(0, 128);
+  computeAtTarget->split(0, 4);
+
+  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
+
+  // This computeAt will affect all tensors including tv3, tv4 and
+  // tv5, even though it appears to impact only tv1 and tv2. The
+  // reason is that tv1 is now computed at tv3, so tv4 must also be
+  // computed at the same location. Similarly, the consumer of tv4,
+  // tv5, must also be computed at the same location. Overall, what
+  // will happen is basically we merge expressions of all tensors and
+  // compute them in a single loop nest. Internally, this will be
+  // realized by making all tensors, except for those in the path
+  // between tv1 and tv3, computed at tv5, which we call the common
+  // consumer.
+  tv1->computeAt(computeAtTarget, 1);
+
+  // All tensors should have the same dimenionality as the target
+  for (Val* val : fusion.vals()) {
+    if (val->isFusionInput() ||
+        val->getValType().value() != ValType::TensorView) {
+      continue;
+    }
+    TensorView* tv = val->as<TensorView>();
+    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
+    if (tv == tv5) {
+      TORCH_CHECK(tv->getComputeAtPosition() == 0);
+    } else {
+      TORCH_CHECK(tv->getComputeAtPosition() == 1);
+    }
+  }
+
+  for (auto tv : ir_utils::filterByType<TensorView>(fusion.vals())) {
+    if (!tv->isFusionInput()) {
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({129, 127}, options);
+
+  auto t1 = aten_input.mul({0.5});
+  auto t2 = t1.mul({-1.0});
+  auto t3 = t2.mul({-1.0});
+  auto t4 = t1.add({4.0});
+  auto aten_output = t3 + t4;
+
+  at::Tensor cg_output = at::empty_like(aten_input, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Similar to the above common consumer test but adds an additional
+// tensor that has no common consumer with the other tensors.
+TEST_F(NVFuserTest, FusionComputeAtCommonConsumer3_CUDA) {
+  // tv1 = tv0 * 0.5
+  // tv2 = tv1 * -1
+  // tv3 = tv2 * -1
+  // tv4 = tv1 + 4
+  // tv5 = tv2 + tv3
+  // tv6 = tv1 + 6
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
+  TensorView* tv3 = mul(tv2, IrBuilder::create<Double>(-1.0));
+  TensorView* tv4 = add(tv1, IrBuilder::create<Double>(4.0));
+  TensorView* tv5 = add(tv3, tv4);
+  TensorView* tv6 = add(tv1, IrBuilder::create<Double>(6.0));
+
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  TensorView* computeAtTarget = tv3;
+
+  computeAtTarget->merge(0);
+  computeAtTarget->split(0, 128);
+  computeAtTarget->split(0, 4);
+
+  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
+
+  // This will have the same impact on the tensors except for tv5 and
+  // tv6. tv6 does not have any common consumer with the computeAt
+  // target, but since it uses tv1, it must be also computed at the
+  // same location as the other impacted tensors. We can either make
+  // tv5 computed at tv6 or tv6 computed at tv5. In this case, tv5
+  // should be computed at tv6 just because the current implementation
+  // orders the computeAt relationship based on the order in which
+  // tensors are specified as outputs.
+
+  tv1->computeAt(computeAtTarget, 1);
+
+  // All tensors should have the same dimenionality as the target
+  for (auto tv : ir_utils::filterByType<TensorView>(fusion.vals())) {
+    if (tv->isFusionInput()) {
+      continue;
+    }
+    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
+    if (tv == tv5 || tv == tv6) {
+      TORCH_CHECK(tv->getComputeAtPosition() == 0);
+      TORCH_CHECK(tv->getMaxProducerPosition() == 1);
+    } else {
+      TORCH_CHECK(tv->getComputeAtPosition() == 1);
+    }
+  }
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = val->as<TensorView>();
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({129, 127}, options);
+
+  auto t1 = aten_input.mul({0.5});
+  auto t2 = t1.mul({-1.0});
+  auto t3 = t2.mul({-1.0});
+  auto t4 = t1.add({4.0});
+  auto t5 = t3 + t4;
+  auto t6 = t1.add({6.0});
+
+  std::vector<at::Tensor> aten_outputs = {t5, t6};
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options), at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+// Similar to ComputeAtCommonConsumer1 but with an addtiona ltensor
+// that does not have data dependency with the consumer.
+TEST_F(NVFuserTest, FusionComputeAtNoCommonConsumer_CUDA) {
+  // tv1 = tv0 * 0.5
+  // tv2 = tv1 * -1
+  // tv3 = tv1 * -2
+  // tv4 = tv2 + tv3
+  // tv5 = tv4 * 5
+  // tv6 = tv1 * 6
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
+  TensorView* tv3 = mul(tv1, IrBuilder::create<Double>(-2.0));
+  TensorView* tv4 = add(tv2, tv3);
+  TensorView* tv5 = mul(tv4, IrBuilder::create<Double>(5.0));
+  // Notice that tv6 is not a consumer of tv4.
+  TensorView* tv6 = mul(tv1, IrBuilder::create<Double>(6.0));
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  TensorView* computeAtTarget = tv3;
+  computeAtTarget->split(0, 128);
+  tv1->computeAt(computeAtTarget, 1);
+
+  TensorView* affected_tensors[] = {tv1, tv2, tv3, tv4, tv5, tv6};
+  for (auto tv : affected_tensors) {
+    TORCH_CHECK(tv->nDims() == computeAtTarget->nDims());
+    if (tv == tv6 || tv == tv5) {
+      TORCH_CHECK(tv->getComputeAtPosition() == 0);
+    } else {
+      TORCH_CHECK(tv->getComputeAtPosition() == 1);
+    }
+  }
+
+  computeAtTarget->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (auto tv : affected_tensors) {
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({1000}, options);
+
+  auto t1 = aten_input * 0.5;
+  auto t2 = t1 * -1.0;
+  auto t3 = t1 * -2.0;
+  auto t4 = t2 + t3;
+  auto t5 = t4 * 5.0;
+  auto t6 = t1 * 6.0;
+
+  std::vector<at::Tensor> aten_outputs = {t3, t4, t5, t6};
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+namespace {
+
+void checkIdMapped(
+    ComputeAtRootDomainMap& root_map,
+    TensorView* v0,
+    IterDomain* id0,
+    TensorView* v1,
+    IterDomain* id1,
+    bool should_map) {
+  if (should_map) {
+    TORCH_CHECK(
+        root_map.canMap(v0->domain(), id0, v1->domain(), id1),
+        "Should be mappable: ",
+        id0,
+        " of ",
+        v0,
+        " and ",
+        id1,
+        " of ",
+        v1);
+  } else {
+    TORCH_CHECK(
+        !root_map.canMap(v0->domain(), id0, v1->domain(), id1),
+        "Should not be mappable: ",
+        id0,
+        " of ",
+        v0,
+        " and ",
+        id1,
+        " of ",
+        v1);
+  }
+}
+
+void checkIdMapped(
+    TensorView* v0,
+    const std::vector<IterDomain*>& root0,
+    const std::vector<bool> should_map0,
+    TensorView* v1,
+    const std::vector<IterDomain*>& root1,
+    const std::vector<bool> should_map1) {
+  ComputeAtRootDomainMap map;
+  map.build();
+  TORCH_INTERNAL_ASSERT(root0.size() == should_map0.size());
+  TORCH_INTERNAL_ASSERT(root1.size() == should_map1.size());
+  size_t idx0 = 0;
+  for (const auto i : c10::irange(root0.size())) {
+    size_t idx1 = 0;
+    for (const auto j : c10::irange(root1.size())) {
+      if (should_map0[i] && should_map1[j] && idx0 == idx1) {
+        checkIdMapped(map, v0, root0[i], v1, root1[j], true);
+      } else {
+        checkIdMapped(map, v0, root0[i], v1, root1[j], false);
+      }
+      if (should_map1[j])
+        ++idx1;
+    }
+    if (should_map0[i])
+      ++idx0;
+  }
+}
+
+void checkIdMapped(
+    TensorView* v0,
+    const std::vector<IterDomain*>& root0,
+    TensorView* v1,
+    const std::vector<IterDomain*>& root1) {
+  checkIdMapped(
+      v0,
+      root0,
+      std::vector<bool>(root0.size(), true),
+      v1,
+      root1,
+      std::vector<bool>(root1.size(), true));
+}
+
+} // namespace
+
+TEST_F(NVFuserTest, FusionRootMappingBasic_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv3 = broadcast(tv0, {true, false, false});
+  auto tv4 = broadcast(tv1, {false, true, false});
+  auto tv5 = add(tv3, tv4);
+  fusion.addOutput(tv5);
+
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, true},
+      tv4,
+      tv4->getRootDomain(),
+      {false, true, true});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, true},
+      tv4,
+      tv4->getRootDomain(),
+      {true, false, true});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {false, true},
+      tv1,
+      tv1->getRootDomain(),
+      {false, true});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, true},
+      tv5,
+      tv5->getRootDomain(),
+      {false, true, true});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, true},
+      tv5,
+      tv5->getRootDomain(),
+      {true, false, true});
+  checkIdMapped(tv3, tv3->getRootDomain(), tv4, tv4->getRootDomain());
+  checkIdMapped(tv3, tv3->getRootDomain(), tv5, tv5->getRootDomain());
+  checkIdMapped(tv4, tv4->getRootDomain(), tv5, tv5->getRootDomain());
+}
+
+TEST_F(NVFuserTest, FusionRootMappingRfactor_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // [I,I]
+  TensorView* tv0 = makeSymbolicTensor(2);
+  // [I,I,I]
+  TensorView* tv1 = makeSymbolicTensor(3);
+
+  //[I,I,R]
+  auto tv2 = sum(tv1, {2});
+  auto tv3 = add(tv2, tv0);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv3);
+
+  // scheduling:
+  //[B,I,R0,R1=128], root = [B,I,R]
+  tv2->split(2, 128);
+
+  // root=[B,I,Irf], rfactor=[B,I,Irf,Rrf]
+  auto tv4 = tv2->rFactor({3});
+
+  checkIdMapped(tv1, tv1->getRootDomain(), tv4, tv4->getRootDomain());
+  checkIdMapped(
+      tv4,
+      tv4->getRFactorDomain(),
+      {true, true, true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, true, true});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, true, false});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, true, false},
+      tv3,
+      tv3->getRootDomain(),
+      {true, true});
+  checkIdMapped(
+      tv2,
+      tv2->getRootDomain(),
+      {true, true, false},
+      tv3,
+      tv3->getRootDomain(),
+      {true, true});
+  checkIdMapped(tv0, tv0->getRootDomain(), tv3, tv3->getRootDomain());
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, true},
+      tv1,
+      tv1->getRootDomain(),
+      {true, true, false});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, true},
+      tv2,
+      tv2->getRootDomain(),
+      {true, true, false});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, true},
+      tv4,
+      tv4->getRFactorDomain(),
+      {true, true, false, false});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, true},
+      tv4,
+      tv4->getRootDomain(),
+      {true, true, false});
+}
+
+TEST_F(NVFuserTest, FusionRootMappingReductionDependency1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  fusion.addOutput(tv2);
+
+  // The second dimension cannot be mapped as it would require recomputation.
+  checkIdMapped(tv0, tv0->getRootDomain(), tv1, tv1->getRootDomain());
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+}
+
+TEST_F(NVFuserTest, FusionRootMappingReductionDependency2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv0, tv2);
+  fusion.addOutput(tv3);
+
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv1,
+      tv1->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv3,
+      tv3->getRootDomain(),
+      {true, false});
+  checkIdMapped(tv2, tv2->getRootDomain(), tv3, tv3->getRootDomain());
+}
+
+TEST_F(NVFuserTest, FusionRootMappingReductionDependency3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  fusion.addOutput(tv2);
+
+  tv1->split(-1, 4);
+  auto tv3 = tv1->rFactor({-2});
+
+  checkIdMapped(tv0, tv0->getRootDomain(), tv3, tv3->getRootDomain());
+  checkIdMapped(
+      tv3,
+      tv3->getMaybeRFactorDomain(),
+      {true, false, true},
+      tv1,
+      tv1->getRootDomain(),
+      {true, true});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+}
+
+TEST_F(NVFuserTest, FusionRootMappingReductionDependency4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv0, tv2);
+  fusion.addOutput(tv3);
+
+  tv1->split(-1, 4);
+  auto tv4 = tv1->rFactor({-2});
+
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv4,
+      tv4->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv4,
+      tv4->getMaybeRFactorDomain(),
+      {true, false, true},
+      tv1,
+      tv1->getRootDomain(),
+      {true, true});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+  checkIdMapped(tv2, tv2->getRootDomain(), tv3, tv3->getRootDomain());
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+}
+
+// Reproducer of issue #749
+TEST_F(NVFuserTest, FusionRootMappingReductionDependency5_CUDA_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1});
+  auto tv3 = broadcast(tv2, {false, true});
+  auto tv4 = add(tv0, tv3);
+  auto tv5 = add(tv4, tv1);
+  fusion.addOutput(tv5);
+
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv1,
+      tv1->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv2,
+      tv2->getRootDomain(),
+      {true, false},
+      tv3,
+      tv3->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv3,
+      tv3->getRootDomain(),
+      {true, true},
+      tv4,
+      tv4->getRootDomain(),
+      {true, true});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv4,
+      tv4->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv4,
+      tv4->getRootDomain(),
+      {true, true},
+      tv5,
+      tv5->getRootDomain(),
+      {true, true});
+}
+
+// Similar to RootMappingReductionDependency5 but with rFactor
+TEST_F(NVFuserTest, FusionRootMappingReductionDependency6_CUDA_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1});
+  auto tv3 = broadcast(tv2, {false, true});
+  auto tv4 = add(tv0, tv3);
+  auto tv5 = add(tv4, tv1);
+  fusion.addOutput(tv5);
+
+  tv2->split(1, 4);
+  auto tv6 = tv2->rFactor({-1});
+
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv1,
+      tv1->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv6,
+      tv6->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv6,
+      tv6->getMaybeRFactorDomain(),
+      {true, true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, true});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv2,
+      tv2->getRootDomain(),
+      {true, false},
+      tv3,
+      tv3->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv3,
+      tv3->getRootDomain(),
+      {true, true},
+      tv4,
+      tv4->getRootDomain(),
+      {true, true});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true, false},
+      tv4,
+      tv4->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv4,
+      tv4->getRootDomain(),
+      {true, true},
+      tv5,
+      tv5->getRootDomain(),
+      {true, true});
+}
+
+TEST_F(
+    NVFuserTest,
+    FusionRootMappingMultipleBroadcastWithNoCommonConsumer_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  auto tv1 = broadcast(tv0, {false, true});
+  auto tv2 = broadcast(tv0, {true, false});
+  fusion.addOutput(tv1);
+  fusion.addOutput(tv2);
+
+  // If there is no common consumer, there is no recomputation constraint.
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true},
+      tv1,
+      tv1->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true},
+      tv2,
+      tv2->getRootDomain(),
+      {false, true});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {false, true});
+}
+
+TEST_F(NVFuserTest, FusionRootMappingBroadcastNonUniqueSize_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  auto tv2 = makeSymbolicTensor(2);
+  fusion.addInput(tv2);
+  auto tv3 = broadcast(tv0, {false, true});
+  auto tv4 = add(tv1, tv3);
+  fusion.addOutput(tv4);
+  auto tv5 = add(tv2, tv3);
+  fusion.addOutput(tv5);
+
+  // Broadcast domains can be used with multiple domains with
+  // different sizes. In this test, the broadcast domain of tv3 has
+  // two consumers, tv4 and tv5, which may have different sizes. Each
+  // of the consumers is used with the broadcast domain of tv3, but
+  // the two consumers may not have the same size, it is not possible
+  // to map those domains.
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true},
+      tv3,
+      tv3->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true},
+      tv1,
+      tv1->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv2,
+      tv2->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, false},
+      tv3,
+      tv3->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv2,
+      tv2->getRootDomain(),
+      {true, false},
+      tv3,
+      tv3->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv3,
+      tv3->getRootDomain(),
+      {true, false},
+      tv4,
+      tv4->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv3,
+      tv3->getRootDomain(),
+      {true, false},
+      tv5,
+      tv5->getRootDomain(),
+      {true, false});
+  checkIdMapped(
+      tv4,
+      tv4->getRootDomain(),
+      {true, false},
+      tv5,
+      tv5->getRootDomain(),
+      {true, false});
+}
+
+TEST_F(NVFuserTest, FusionRootMappingBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  // tv0[I0]
+  fusion.addInput(tv0);
+  auto tv1 = broadcast(tv0, {true, false});
+  // tv1[B1, I0]
+  auto tv2 = broadcast(tv1, {true, false, false});
+  // tv2[B2, B1, I0]
+  fusion.addOutput(tv2);
+
+  // In this case, tv1 and tv2 has one and two broadcast domains,
+  // respectively. It is the second broadcast domain that is mapped to
+  // the broadcast of tv1.
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true},
+      tv1,
+      tv1->getRootDomain(),
+      {false, true});
+  checkIdMapped(
+      tv1,
+      tv1->getRootDomain(),
+      {true, true},
+      tv2,
+      tv2->getRootDomain(),
+      {false, true, true}); // Not {true, false, true}
+  checkIdMapped(
+      tv0,
+      tv0->getRootDomain(),
+      {true},
+      tv2,
+      tv2->getRootDomain(),
+      {false, false, true});
+}
+
+// Reproducer of issue #723
+TEST_F(NVFuserTest, FusionRootMappingTrivialReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  auto tv1 = makeSymbolicTensor(2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv0, {true, false});
+  auto tv3 = sum(tv2, {0});
+  auto tv4 = add(tv2, tv1);
+
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv4);
+
+  ComputeAtRootDomainMap map;
+  map.build();
+
+  checkIdMapped(
+      map, tv2, tv2->getRootDomain()[0], tv4, tv4->getRootDomain()[0], true);
+  checkIdMapped(
+      map, tv2, tv2->getRootDomain()[0], tv3, tv3->getRootDomain()[0], true);
+
+  tv2->computeAt(tv4, -1);
+
+  const int x = 11;
+  const int y = 12;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({x}, options);
+  at::Tensor t1 = at::randn({y, x}, options);
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto t3 = t0;
+  auto t4 = t0.unsqueeze(0).expand({y, x}) + t1;
+
+  testValidate(&fusion, outputs, aten_inputs, {t3, t4}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1950
+TEST_F(NVFuserTest, FusionRootMappingRepro1950_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  auto tv0 = makeSymbolicTensor(3);
+  auto tv1 = makeSymbolicTensor(3);
+  auto tv2 = makeSymbolicTensor(3);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+
+  auto tv3 = set(tv0);
+  auto tv4 = mul(tv1, tv3);
+  auto tv5 = mul(tv1, tv2);
+  auto tv6 = mul(tv5, tv3);
+  auto tv7 = sum(tv6, {2});
+  auto tv8 = broadcast(tv7, {false, false, true});
+  auto tv9 = mul(tv3, tv8);
+
+  // Issue #1950 was caused by a particular traversal ordering based
+  // on the output tensor ordering as below
+  fusion.addOutput(tv9);
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv4);
+
+  ComputeAtRootDomainMap root_map;
+  root_map.build();
+
+  checkIdMapped(root_map, tv4, tv4->axis(-1), tv9, tv9->axis(-1), false);
+}
+
+TEST_F(NVFuserTest, FusionDetectSelfMappedDomains_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  // [I1]
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  // [B2, I2]
+  auto tv2 = broadcast(tv1, {true, false});
+  // [I3, B3]
+  auto tv3 = broadcast(tv1, {false, true});
+  // [I4, I5]
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  // IterDomainGraph maps B2, I3 and I4 together, and similarly I2,
+  // B3 and I5. The problem is I1 is mapped with both of the ID
+  // groups, so eventually all of the IDs are mapped
+  // together. IterDomainGraph should throw an exception as this
+  // pattern of domain mappings is not supported.
+
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW({ IterDomainGraph id_graph(&fusion); });
+}
+
+TEST_F(NVFuserTest, FusionScalarInputs_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+
+  Double* d0 = IrBuilder::create<Double>();
+  fusion.addInput(d0);
+  Double* d1 = IrBuilder::create<Double>();
+  fusion.addInput(d1);
+  Double* d2 = IrBuilder::create<Double>();
+  fusion.addInput(d2);
+  Double* d3 = IrBuilder::create<Double>();
+  fusion.addInput(d3);
+  Val* d4 = mul(d0, d1);
+  Val* d5 = sub(d2, d3);
+
+  TensorView* tv2 = sub(tv1, d4);
+  TensorView* tv3 = add(tv0, d5);
+  TensorView* tv4 = mul(tv3, tv2);
+
+  fusion.addOutput(tv4);
+
+  // Lets setup to actually run
+  while (tv4->nDims() > 1)
+    tv4->merge(0);
+  tv4->split(0, 128);
+  tv4->split(0, 4);
+
+  tv0->computeAt(tv4, 1);
+  tv1->computeAt(tv4, 1);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  // d4 = d0 * d1
+  // d5 = d2 - d3
+  // t2 = t1 - d4
+  // t3 = t0 + d5
+  // t4 = t3 * t2
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  float fl0 = 0.1;
+  float fl1 = -0.2;
+  float fl2 = 0.3;
+  float fl3 = -0.4;
+  float fl4 = fl0 * fl1;
+  float fl5 = fl2 - fl3;
+
+  at::Tensor t0 = at::randn({129, 127}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  auto t2 = t1.sub(fl4);
+  auto t3 = t0.add(fl5);
+  auto aten_output = t3.mul(t2);
+
+  at::Tensor cg_output = at::empty_like(t0, options);
+
+  at::Scalar test(fl0);
+
+  std::vector<IValue> aten_inputs = {
+      t0,
+      t1,
+      at::Scalar(fl0),
+      at::Scalar(fl1),
+      at::Scalar(fl2),
+      at::Scalar(fl3)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionLoopUnroll_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(3);
+  TensorView* tv1 = makeSymbolicTensor(3);
+
+  // Register your inputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // Do math with it, it returns a `Val*` but can be static_casted back to
+  // TensorView
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = add(tv0, tv2);
+
+  // Register your outputs
+  fusion.addOutput(tv3);
+
+  int block_size = 16;
+
+  tv3->merge(0, 1);
+  tv3->merge(0, 1);
+
+  tv3->split(0, block_size);
+  tv3->split(0, 4);
+
+  // For all inputs, computeAt the output inline, temporaries should be squeezed
+  // between them
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  // Parallelize
+  tv2->axis(1)->parallelize(ParallelType::Unroll);
+  tv3->axis(1)->parallelize(ParallelType::Unroll);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input0 = at::randn({129, 13, 3}, options);
+  at::Tensor input1 = at::randn({129, 13, 3}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1});
+  auto outputs = fe.runFusion({input0, input1});
+
+  TORCH_CHECK(outputs[0].equal(input0.add(input1.add(2.0))));
+}
+
+/*
+ * Helper function for single op testing that generates a codegen operand
+ */
+
+Val* gen_jit_operand(std::pair<ValType, DataType> desc) {
+  if (desc.first == ValType::TensorView) {
+    return makeSymbolicTensor(2, desc.second);
+  } else if (desc.first == ValType::Scalar) {
+    if (desc.second == DataType::Float) {
+      return IrBuilder::create<Double>();
+    } else if (desc.second == DataType::Double) {
+      return IrBuilder::create<Double>();
+    } else if (desc.second == DataType::ComplexFloat) {
+      return IrBuilder::create<ComplexDouble>();
+    } else if (desc.second == DataType::ComplexDouble) {
+      return IrBuilder::create<ComplexDouble>();
+    } else if (desc.second == DataType::Int) {
+      return IrBuilder::create<Int>();
+    } else {
+      TORCH_CHECK(false, "Not currently supported type: ", desc.first);
+    }
+  } else {
+    TORCH_CHECK(false, "Not currently supported type: ", desc.first);
+  }
+  return nullptr;
+}
+
+/*
+ * Helper function for single op testing that generates an ATen operand
+ */
+
+IValue gen_aten_operand(
+    std::pair<ValType, DataType> desc,
+    int blocks,
+    int threads,
+    bool rand) {
+  if (desc.first == ValType::TensorView) {
+    if (desc.second == DataType::Double || desc.second == DataType::Float ||
+        desc.second == DataType::ComplexDouble ||
+        desc.second == DataType::ComplexFloat ||
+        desc.second == DataType::Half || desc.second == DataType::BFloat16) {
+      auto options = at::TensorOptions()
+                         .dtype(data_type_to_aten(desc.second))
+                         .device(at::kCUDA, 0);
+      if (rand) {
+        return IValue(at::rand({blocks, threads}, options));
+      } else {
+        return IValue(at::empty({blocks, threads}, options));
+      }
+    } else if (desc.second == DataType::Int || desc.second == DataType::Int32) {
+      auto dtype = desc.second == DataType::Int32 ? at::kInt : at::kLong;
+      if (rand) {
+        auto options =
+            at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+        return IValue(at::randn({blocks, threads}, options).mul(5).to(dtype));
+      } else {
+        auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+        return IValue(at::empty({blocks, threads}, options));
+      }
+    } else if (desc.second == DataType::Bool) {
+      if (rand) {
+        auto options =
+            at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+        return IValue(
+            at::rand({blocks, threads}, options).round().to(at::kBool));
+      } else {
+        auto options =
+            at::TensorOptions().dtype(at::kBool).device(at::kCUDA, 0);
+        return IValue(at::empty({blocks, threads}, options));
+      }
+    } else {
+      TORCH_CHECK(false, "Not currently supported type: ", desc.second)
+    }
+  } else if (desc.first == ValType::Scalar) {
+    // IValue scalars can only be double int64 or bool
+    if (desc.second == DataType::ComplexDouble ||
+        desc.second == DataType::ComplexFloat) {
+      return IValue(at::Scalar(c10::complex<double>(1.0, 0.0)));
+    } else if (
+        desc.second == DataType::Double || desc.second == DataType::Float ||
+        desc.second == DataType::Half || desc.second == DataType::BFloat16) {
+      return IValue(at::Scalar(1.0));
+    } else if (desc.second == DataType::Int) {
+      return IValue(at::Scalar(1));
+    } else {
+      TORCH_CHECK(false, "Not currently supported type: ", desc.first);
+    }
+  } else {
+    TORCH_CHECK(false, "Not currently supported type: ", desc.first);
+  }
+  return nullptr;
+}
+
+/*
+ * Templatized Helper Function To generate single Op comparison between the
+ * JIT codegen for Cuda and the ATen Library.
+ */
+
+using OutputPair = std::pair<ValType, DataType>;
+template <
+    typename AtenFunc,
+    typename JitFunc,
+    typename InputTuple,
+    size_t... NumInputs>
+void test_op(
+    int blocks,
+    int threads,
+    std::string op_str,
+    AtenFunc af,
+    JitFunc jf,
+    OutputPair op,
+    InputTuple it,
+    std::index_sequence<NumInputs...>) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Generate Input JIT function Inputs and add them as Inputs to the Fusion
+  // Graph
+  std::array<Val*, sizeof...(NumInputs)> jit_inputs = {
+      gen_jit_operand(std::get<NumInputs>(it))...};
+  std::for_each(jit_inputs.begin(), jit_inputs.end(), [&fusion](Val* v) {
+    fusion.addInput(v);
+  });
+  TensorView* out =
+      static_cast<TensorView*>(jf(std::get<NumInputs>(jit_inputs)...));
+  fusion.addOutput(out);
+
+  std::for_each(jit_inputs.begin(), jit_inputs.end(), [out](Val* v) {
+    if (v->getValType() == ValType::TensorView)
+      static_cast<TensorView*>(v)->computeAt(out, -1);
+  });
+  out->axis(0)->parallelize(ParallelType::BIDx);
+  out->axis(-1)->parallelize(ParallelType::TIDx);
+
+  std::array<IValue, sizeof...(NumInputs)> aten_inputs = {gen_aten_operand(
+      std::get<NumInputs>(it), blocks, threads, /*rand*/ true)...};
+  const at::ArrayRef<IValue> aten_inputs_ivalues(aten_inputs);
+
+  at::Tensor cg_output =
+      gen_aten_operand(op, blocks, threads, /*rand*/ false).toTensor();
+  std::vector<at::Tensor> output_vect = {cg_output};
+  cudaDeviceSynchronize();
+  if (fusion.isStochastic())
+    at::manual_seed(0);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs_ivalues);
+  fe.runFusion(aten_inputs_ivalues, output_vect);
+  cudaDeviceSynchronize();
+
+  if (fusion.isStochastic())
+    at::manual_seed(0);
+  at::Tensor aten_output = af(aten_inputs);
+  cudaDeviceSynchronize(); // This sync shouldn't be necessary;
+
+  std::string op_msg = "Operation " + op_str;
+
+  testValidate(
+      &fusion,
+      {cg_output},
+      aten_inputs,
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      op_msg);
+}
+
+/*
+ *  Templatized Helper Function that uses variadic templates to
+ *  process a variable length Input Tuple of different Operand Type.
+ */
+template <typename AtenFunc, typename JitFunc, typename InputTuple>
+void test_op(
+    int blocks,
+    int threads,
+    std::string op_str,
+    AtenFunc af,
+    JitFunc jf,
+    OutputPair op,
+    InputTuple it) {
+  static constexpr auto size = std::tuple_size<InputTuple>::value;
+  test_op(
+      blocks,
+      threads,
+      op_str,
+      af,
+      jf,
+      op,
+      it,
+      std::make_index_sequence<size>{});
+}
+
+TEST_F(NVFuserTest, FusionUnaryOps_CUDA) {
+  using OpTuple =
+      std::tuple<at::Tensor (*)(const at::Tensor&), UnaryOpType, std::string>;
+
+  // [Note: explicit tuple type for uniform initialization list]
+  // Tuple type must be explicitly specified for each uniform initialization
+  // list within the vector to make this code compatible with some old env
+  // which we still need to support. eg. gcc 5.4 + cuda 9.2.
+  std::vector<OpTuple> ops{
+      OpTuple{at::acos, UnaryOpType::Acos, "acos"},
+      OpTuple{at::asin, UnaryOpType::Asin, "asin"},
+      OpTuple{at::atan, UnaryOpType::Atan, "atan"},
+      // There does not appear to be an appropriate ATen function for atanh
+      // OpTuple{at::atanh,      UnaryOpType::Atanh,      "atanh"      },
+      OpTuple{at::cos, UnaryOpType::Cos, "cos"},
+      OpTuple{at::cosh, UnaryOpType::Cosh, "cosh"},
+      OpTuple{at::exp, UnaryOpType::Exp, "exp"},
+      // OpTuple{at::gelu, UnaryOpType::Gelu, "gelu"},
+      OpTuple{at::log, UnaryOpType::Log, "log"},
+      OpTuple{at::log10, UnaryOpType::Log10, "log10"},
+      OpTuple{at::neg, UnaryOpType::Neg, "neg"},
+      OpTuple{at::reciprocal, UnaryOpType::Reciprocal, "reciprocal"},
+      OpTuple{at::sigmoid, UnaryOpType::Sigmoid, "sigmoid"},
+      OpTuple{at::sin, UnaryOpType::Sin, "sin"},
+      OpTuple{at::sinh, UnaryOpType::Sinh, "sinh"},
+      OpTuple{at::sqrt, UnaryOpType::Sqrt, "sqrt"},
+      OpTuple{at::tan, UnaryOpType::Tan, "tan"},
+      OpTuple{at::tanh, UnaryOpType::Tanh, "tanh"},
+      OpTuple{at::isfinite, UnaryOpType::IsFinite, "isfinite"},
+      OpTuple{at::isinf, UnaryOpType::IsInf, "isinf"},
+      OpTuple{at::isnan, UnaryOpType::IsNan, "isnan"},
+      OpTuple{at::isreal, UnaryOpType::IsReal, "isreal"},
+  };
+
+  // The following ops has no complex support in eager mode
+  std::vector<OpTuple> ops_without_complex{
+      OpTuple{at::ceil, UnaryOpType::Ceil, "ceil"},
+      OpTuple{at::floor, UnaryOpType::Floor, "floor"},
+      OpTuple{at::frac, UnaryOpType::Frac, "frac"},
+      OpTuple{at::trunc, UnaryOpType::Trunc, "trunc"},
+      OpTuple{at::round, UnaryOpType::Round, "round"},
+      OpTuple{at::relu, UnaryOpType::Relu, "relu"},
+      OpTuple{at::expm1, UnaryOpType::Expm1, "expm1"},
+      OpTuple{at::log1p, UnaryOpType::Log1p, "log1p"},
+      OpTuple{at::lgamma, UnaryOpType::Lgamma, "lgamma"},
+      OpTuple{at::erf, UnaryOpType::Erf, "erf"},
+      OpTuple{at::erfc, UnaryOpType::Erfc, "erfc"},
+      OpTuple{at::isneginf, UnaryOpType::IsNegInf, "isneginf"},
+      OpTuple{at::isposinf, UnaryOpType::IsPosInf, "isposinf"},
+  };
+
+  // The following ops only supports complex
+  std::vector<OpTuple> ops_complex_only{
+      // real is supported via UnaryOpType::Set for non-complex types, and
+      // UnaryOpType::Real requires input to be complex
+      OpTuple{at::real, UnaryOpType::Real, "real"},
+      OpTuple{at::imag, UnaryOpType::Imag, "imag"},
+  };
+
+  // Complex support for the following op is not working in nvFuser yet
+  std::vector<OpTuple> ops_skip_complex{
+      // TODO: abs is actually supported in nvFuser, but it has bug!!!
+      // In eager mode, abs(complex_tensor) returns floating point tensor
+      // but in nvFuser, it wrongly returns complex tensor!
+      // We need to:
+      //  1. change our type promotion logic to make a special case for abs
+      //  2. why this bug is not detected here? we should bump up test coverage
+      OpTuple{at::abs, UnaryOpType::Abs, "abs"},
+      // TODO: the following two ops fails with compilation error like
+      // "undefined function rsqrt(complex)", we could implement them in
+      // helpers.cu, but I think it is better to check with Jiterator first,
+      // because Jiterator uses the same string for complex support.
+      OpTuple{at::rsqrt, UnaryOpType::Rsqrt, "rsqrt"},
+      OpTuple{at::log2, UnaryOpType::Log2, "log2"}};
+
+  std::vector<DataType> dtypes = {
+      DataType::Float,
+      DataType::Double,
+      DataType::ComplexFloat,
+      DataType::ComplexDouble};
+
+  for (auto dtype : dtypes) {
+    auto ops_to_test = ops;
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      ops_to_test.insert(
+          ops_to_test.end(),
+          ops_without_complex.begin(),
+          ops_without_complex.end());
+      ops_to_test.insert(
+          ops_to_test.end(), ops_skip_complex.begin(), ops_skip_complex.end());
+    } else {
+      ops_to_test.insert(
+          ops_to_test.end(), ops_complex_only.begin(), ops_complex_only.end());
+    }
+    std::for_each(ops.begin(), ops.end(), [&](OpTuple& op) {
+      test_op(
+          /*blocks*/ 640,
+          /*threads*/ 64,
+          /*name*/ std::get<2>(op),
+          /*Aten Func   */
+          [&op](std::array<IValue, 1>& vals) {
+            return std::get<0>(op)(vals[0].toTensor());
+          },
+          /*JIT  Func   */
+          [&op](Val* in1) -> Val* { return unaryOp(std::get<1>(op), in1); },
+          /*Output      */ std::make_pair(ValType::TensorView, dtype),
+          /*Inputs Tuple*/
+          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+    });
+  }
+
+  dtypes = {DataType::Int, DataType::Int32, DataType::Bool};
+  for (auto dtype : dtypes) {
+    test_op(
+        /*blocks*/ 128,
+        /*threads*/ 64,
+        /*name*/ "bitwise_not",
+        /*Aten Func   */
+        [](std::array<IValue, 1>& vals) {
+          return at::bitwise_not(vals[0].toTensor());
+        },
+        /*JIT  Func   */
+        [](Val* in1) -> Val* { return unaryOp(UnaryOpType::Not, in1); },
+        /*Output      */ std::make_pair(ValType::TensorView, dtype),
+        /*Inputs Tuple*/
+        std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+  }
+}
+
+TEST_F(NVFuserTest, FusionBinaryOps_CUDA) {
+  using AtenFuncSig = at::Tensor (*)(const at::Tensor&, const at::Tensor&);
+  using OpTuple = std::tuple<AtenFuncSig, BinaryOpType, std::string>;
+
+  std::vector<DataType> dtypes = {
+      DataType::Double,
+      DataType::Float,
+      DataType::ComplexFloat,
+      DataType::ComplexDouble};
+
+  // see [Note: explicit tuple type for uniform initialization list]
+  std::vector<OpTuple> equal_ops{
+      OpTuple{at::eq, BinaryOpType::Eq, "eq"},
+      OpTuple{at::ne, BinaryOpType::NE, "ne"}};
+
+  // Complex numbers are not ordered
+  std::vector<OpTuple> order_ops{
+      OpTuple{at::ge, BinaryOpType::GE, "ge"},
+      OpTuple{at::gt, BinaryOpType::GT, "gt"},
+      OpTuple{at::le, BinaryOpType::LE, "le"},
+      OpTuple{at::lt, BinaryOpType::LT, "lt"}};
+
+  // see [Note: explicit tuple type for uniform initialization list]
+  std::vector<OpTuple> math_ops{
+      OpTuple{at::div, BinaryOpType::Div, "div"},
+      OpTuple{at::mul, BinaryOpType::Mul, "mul"},
+      OpTuple{at::pow, BinaryOpType::Pow, "pow"}};
+
+  // The following ops has no complex support in eager mode
+  std::vector<OpTuple> math_ops_without_complex{
+      OpTuple{at::atan2, BinaryOpType::Atan2, "atan2"},
+      OpTuple{at::max, BinaryOpType::Max, "max"},
+      OpTuple{at::min, BinaryOpType::Min, "min"},
+      OpTuple{at::fmod, BinaryOpType::Fmod, "fmod"},
+      // NOTE: Remainder does not match the Aten impl exactly
+      // despite using an identical function.
+      OpTuple{at::remainder, BinaryOpType::Remainder, "remainder"}};
+
+  for (auto dtype : dtypes) {
+    auto logic_ops = equal_ops;
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      logic_ops.insert(logic_ops.end(), order_ops.begin(), order_ops.end());
+    }
+    std::for_each(logic_ops.begin(), logic_ops.end(), [&](OpTuple& op) {
+      test_op(
+          /*blocks*/ 640,
+          /*threads*/ 64,
+          /*name*/ std::get<2>(op),
+          /*Aten Func   */
+          [&op](std::array<IValue, 2>& vals) {
+            return std::get<0>(op)(vals[0].toTensor(), vals[1].toTensor());
+          },
+          /*JIT  Func   */
+          [&op](Val* in1, Val* in2) -> Val* {
+            return binaryOp(std::get<1>(op), in1, in2);
+          },
+          /*Output      */ std::make_pair(ValType::TensorView, DataType::Bool),
+          /*Inputs Tuple*/
+          std::make_tuple(
+              std::make_pair(ValType::TensorView, dtype),
+              std::make_pair(ValType::TensorView, dtype)));
+    });
+
+    auto enabled_math_ops = math_ops;
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      enabled_math_ops.insert(
+          enabled_math_ops.end(),
+          math_ops_without_complex.begin(),
+          math_ops_without_complex.end());
+    }
+    std::for_each(
+        enabled_math_ops.begin(), enabled_math_ops.end(), [&](OpTuple& op) {
+          test_op(
+              /*blocks*/ 640,
+              /*threads*/ 64,
+              /*name*/ std::get<2>(op),
+              /*Aten Func   */
+              [&op](std::array<IValue, 2>& vals) {
+                return std::get<0>(op)(vals[0].toTensor(), vals[1].toTensor());
+              },
+              /*JIT  Func   */
+              [&op](Val* in1, Val* in2) -> Val* {
+                return binaryOp(std::get<1>(op), in1, in2);
+              },
+              /*Output      */ std::make_pair(ValType::TensorView, dtype),
+              /*Inputs Tuple*/
+              std::make_tuple(
+                  std::make_pair(ValType::TensorView, dtype),
+                  std::make_pair(ValType::TensorView, dtype)));
+        });
+
+    test_op(
+        /*blocks*/ 640,
+        /*threads*/ 64,
+        /*name*/ "add_alpha",
+        /*Aten Func   */
+        [](std::array<IValue, 3>& vals) {
+          return at::add(
+              vals[0].toTensor(), vals[1].toTensor(), vals[2].toScalar());
+        },
+        /*JIT  Func   */ static_cast<Val* (*)(Val*, Val*, Val*)>(&add_alpha),
+        /*Output      */ std::make_pair(ValType::TensorView, dtype),
+        /*Inputs Tuple*/
+        std::make_tuple(
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::Scalar, dtype)));
+
+    test_op(
+        /*blocks*/ 640,
+        /*threads*/ 64,
+        /*name*/ "sub_alpha",
+        /*Aten Func   */
+        [](std::array<IValue, 3>& vals) {
+          return at::sub(
+              vals[0].toTensor(), vals[1].toTensor(), vals[2].toScalar());
+        },
+        /*JIT  Func   */ static_cast<Val* (*)(Val*, Val*, Val*)>(&sub_alpha),
+        /*Output      */ std::make_pair(ValType::TensorView, dtype),
+        /*Inputs Tuple*/
+        std::make_tuple(
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::Scalar, dtype)));
+  }
+}
+
+TEST_F(NVFuserTest, FusionTernaryOps_CUDA) {
+  std::vector<DataType> dtypes = {
+      DataType::Double,
+      DataType::Float,
+      DataType::ComplexFloat,
+      DataType::ComplexDouble};
+
+  for (auto dtype : dtypes) {
+    // clamp and threshold are not supported for complex on eager mode
+    if (dtype != DataType::ComplexFloat && dtype != DataType::ComplexDouble) {
+      test_op(
+          /*blocks*/ 640,
+          /*threads*/ 64,
+          /*name*/ "clamp",
+          /*Aten Func   */
+          [](std::array<IValue, 1>& vals) {
+            return at::clamp(vals[0].toTensor(), 0.f, 1.f);
+          },
+          /*JIT  Func   */
+          [&](Val* in1) -> Val* {
+            if (dtype == DataType::Float) {
+              return clamp(
+                  in1,
+                  IrBuilder::create<Double>(0.f),
+                  IrBuilder::create<Double>(1.f));
+            } else {
+              return clamp(
+                  in1,
+                  IrBuilder::create<Double>(0.f),
+                  IrBuilder::create<Double>(1.f));
+            }
+          },
+          /*Output      */ std::make_pair(ValType::TensorView, dtype),
+          /*Inputs Tuple*/
+          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+      test_op(
+          /*blocks*/ 640,
+          /*threads*/ 64,
+          /*name*/ "threshold",
+          /*Aten Func   */
+          [](std::array<IValue, 1>& vals) {
+            return at::threshold(vals[0].toTensor(), 0.f, 1.f);
+          },
+          /*JIT  Func   */
+          [&](Val* in1) -> Val* {
+            if (dtype == DataType::Float) {
+              return threshold(
+                  in1,
+                  IrBuilder::create<Double>(0.f),
+                  IrBuilder::create<Double>(1.f));
+            } else {
+              return threshold(
+                  in1,
+                  IrBuilder::create<Double>(0.f),
+                  IrBuilder::create<Double>(1.f));
+            }
+          },
+          /*Output      */ std::make_pair(ValType::TensorView, dtype),
+          /*Inputs Tuple*/
+          std::make_tuple(std::make_pair(ValType::TensorView, dtype)));
+    }
+    test_op(
+        /*blocks*/ 640,
+        /*threads*/ 64,
+        /*name*/ "where",
+        /*Aten Func   */
+        [](std::array<IValue, 3>& vals) {
+          return at::where(
+              vals[0].toTensor(), vals[1].toTensor(), vals[2].toTensor());
+        },
+        /*JIT  Func   */ static_cast<Val* (*)(Val*, Val*, Val*)>(&where),
+        /*Output      */ std::make_pair(ValType::TensorView, dtype),
+        /*Inputs Tuple*/
+        std::make_tuple(
+            std::make_pair(ValType::TensorView, DataType::Bool),
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::TensorView, dtype)));
+  }
+}
+
+TEST_F(NVFuserTest, FusionCompoundOps_CUDA) {
+  std::vector<DataType> dtypes = {
+      DataType::Double,
+      DataType::Float,
+      DataType::ComplexFloat,
+      DataType::ComplexDouble};
+
+  for (auto dtype : dtypes) {
+    test_op(
+        /*blocks*/ 640,
+        /*threads*/ 64,
+        /*name*/ "lerp",
+        /*Aten Func   */
+        [](std::array<IValue, 3>& vals) {
+          return at::lerp(
+              vals[0].toTensor(), vals[1].toTensor(), vals[2].toTensor());
+        },
+        /*JIT  Func   */ static_cast<Val* (*)(Val*, Val*, Val*)>(&lerp),
+        /*Output      */ std::make_pair(ValType::TensorView, dtype),
+        /*Inputs Tuple*/
+        std::make_tuple(
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::TensorView, dtype)));
+    test_op(
+        /*blocks*/ 640,
+        /*threads*/ 64,
+        /*name*/ "addcmul",
+        /*Aten Func   */
+        [](std::array<IValue, 4>& vals) {
+          return at::addcmul(
+              vals[0].toTensor(),
+              vals[1].toTensor(),
+              vals[2].toTensor(),
+              vals[3].toScalar());
+        },
+        /*JIT  Func   */
+        static_cast<Val* (*)(Val*, Val*, Val*, Val*)>(&addcmul),
+        /*Output      */ std::make_pair(ValType::TensorView, dtype),
+        /*Inputs Tuple*/
+        std::make_tuple(
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::TensorView, dtype),
+            std::make_pair(ValType::Scalar, dtype)));
+  }
+}
+
+TEST_F(NVFuserTest, FusionCastOps_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2, DataType::Half);
+
+  TensorView* intrm1 = castOp(DataType::Float, tv0);
+  TensorView* out = castOp(DataType::Half, intrm1);
+
+  fusion.addInput(tv0);
+  fusion.addOutput(out);
+  tv0->computeAt(out, -1);
+
+  out->axis(0)->parallelize(ParallelType::BIDx);
+  out->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({1, 4}, options);
+  at::Tensor ref_output = at::empty_like(input1);
+
+  std::array<IValue, 1> inputs = {input1};
+  const at::ArrayRef<IValue> input_ivalues(inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, input_ivalues);
+  auto outputs = fe.runFusion(input_ivalues);
+
+  ref_output = at::_cast_Half(at::_cast_Double(input1));
+
+  TORCH_CHECK(
+      outputs[0].equal(ref_output),
+      "\nOp Type: -- ",
+      "cast FP16->FP32->FP16",
+      " -- had a mismatch.\n",
+      "\nABS MAX DIFF: ",
+      outputs[0].sub(ref_output).abs().max(),
+      "\n");
+}
+
+// Start off simple, block on the outer dim
+// block stride + thread all reduce + unrolling on inner dim
+TEST_F(NVFuserTest, FusionReduction1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  tv1->split(1, 128);
+  // tv1[I0, R1o, R1i{128}] = tv0[I0, I1]
+  tv1->split(1, 4);
+  // tv1[I0, R1oo, R1oi{4}, R1i{128}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{128}] = tv0[I0, I1]
+  // tv1[I0,        R1oi{4},  R1i{128}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{128}]
+
+  TensorView* tv3 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{128}] = tv0[I0, I1]
+  // tv3[I0,        R1oi{4}, Ir1i{128}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{128}]
+  // tv1[I0,                  R1i{128}] = tv3[I0,        R1oi{4}, Ir1i{128}]
+
+  // Incrementally, can print in between for debugging
+  tv0->computeAt(tv2, 1);
+  tv2->computeAt(tv3, 1);
+  tv3->computeAt(tv1, 1);
+
+  // Re do it all at once, because why not.
+  tv0->computeAt(tv1, 1);
+
+  tv2->axis(2)->parallelize(ParallelType::Unroll);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 65000;
+  int numel_y = 1025;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReduction2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+
+  fusion.addOutput(tv1);
+
+  // switches to try some different scenarios. maybe we should iterate on all
+  // permutations.
+  bool bind_bidx = true;
+  bool bind_tidx = true;
+  bool bind_tidy = true;
+  bool bind_unroll = true;
+
+  int numel_x = 1025; // Cannot exceed block dim max size / tidy
+  int numel_y = 129;
+  int tidx = 16;
+  int tidy = 8;
+  int unroll_factor = 4;
+
+  tv1->split(1, tidx);
+  // tv1[I0, R1o, R1i{tidx}] = tv0[I0, I1]
+
+  tv1->split(1, unroll_factor);
+  // tv1[I0, R1oo, R1oi{unroll}, R1i{tidx}] = tv0[I0, I1]
+
+  tv1->split(0, tidy);
+
+  TensorView* tv2 = tv1->rFactor({-3});
+  // tv2[I0,             >R1oo<, Ir1oi{unroll}, Ir1i{tidx}]
+  // tv1[I0o, I0i{tidy},          R1oi{unroll},  R1i{tidx}]
+
+  TensorView* tv3 = tv1->rFactor({-2});
+  // tv2[I0,             >R1oo<, Ir1oi{unroll}, Ir1i{tidx}]
+  // tv3[I0,                      R1oi{unroll}, Ir1i{tidx}]
+  // tv1[I0o, I0i{tidy},                         R1i{tidx}]
+
+  tv0->computeAt(tv1, -2);
+
+  if (bind_unroll)
+    tv2->axis(-2)->parallelize(ParallelType::Unroll);
+  if (bind_bidx)
+    tv1->axis(0)->parallelize(ParallelType::BIDx);
+  if (bind_tidy)
+    tv1->axis(1)->parallelize(ParallelType::TIDy);
+
+  if (bind_tidx) {
+    tv2->axis(-1)->parallelize(ParallelType::TIDx);
+    tv3->axis(-1)->parallelize(ParallelType::TIDx);
+    tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReduction3_CUDA) {
+  // What if Z participates in the reduction with X?
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+
+  fusion.addOutput(tv1);
+
+  int numel_x = 1025; // Cannot exceed block dim max size / tidy
+  int numel_y = 129;
+  int tidx = 16;
+  int tidz = 8;
+
+  tv1->split(1, tidz);
+  // tv1[I0, R1o, R1i{tidz}] = tv0[I0, I1]
+
+  tv1->split(1, tidx);
+  // tv1[I0, R1oo, R1oi{tidx}, R1i{tidz}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({-3});
+  // tv2[I0,  >R1oo<, Ir1oi{tidx}, Ir1i{tidz}]
+  // tv1[I0o,          R1oi{tidx},  R1i{tidz}]
+
+  tv0->computeAt(tv1, -3);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(-2)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDz);
+
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDz);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, {cg_output});
+
+  auto aten_output = aten_input.to(at::kDouble).sum({1});
+
+  testValidate(
+      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReduction4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  TensorView* tv2 = add(tv0, tv1);
+  // tv2[I0, I1] = tv0[I0, I1] + tv1[I0, I1]
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv3 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv2);
+  // tv3[I0, R1] = tv2[I0, I1]
+
+  TensorView* tv4 = makeSymbolicTensor(1);
+  fusion.addInput(tv4);
+
+  // tv5[I0] = tv3[I0, R1] * tv4[I0]
+  TensorView* tv5 = mul(tv3, tv4);
+  fusion.addOutput(tv5);
+
+  int tidx = 16;
+
+  // RFactor the reduction
+  tv3->split(1, tidx);
+  // tv3[I0, R1o, R1i{tidx}] = tv2[I0, I1]
+
+  TensorView* tv6 = tv3->rFactor({-2});
+  // tv6[I0, R1o, iR1i{tidx}] = tv2[I0, I1]
+  // tv3[I0,       R1i{tidx}] = tv3[I0, I1]
+  tv2->computeAt(tv6, 2);
+
+  // Compute at inline with tv5 (only 1D)
+  tv6->computeAt(tv3, 1);
+  tv3->computeAt(tv5, 1);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+
+  // Intermediate tensors only need this, but doesn't hurt to do on inputs
+  // tv0, 1, 4
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv6->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 1025;
+  int numel_y = 129;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
+  at::Tensor t1 = at::randn({numel_x, numel_y}, options);
+  at::Tensor t4 = at::randn({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1, t4});
+  auto cg_outputs = fe.runFusion({t0, t1, t4});
+
+  auto t2 = t0.add(t1);
+  auto t3 = t2.to(at::kDouble).sum({1});
+  auto aten_output = t3.mul(t4);
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t1, t4}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReduction5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(3);
+
+  fusion.addInput(tv0);
+
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+
+  fusion.addOutput(tv1);
+
+  int bidy = 2;
+  int tidy = 4;
+  int tidx = 5;
+
+  int dim1 = 11;
+
+  tv1->split(-2, tidy);
+
+  TensorView* tv2 = tv1->rFactor({-3});
+
+  tv0->computeAt(tv1, 1);
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+
+  for (auto* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      val->as<TensorView>()->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  tv2->axis(-2)->parallelize(ParallelType::TIDy);
+  tv1->axis(-2)->parallelize(ParallelType::TIDy);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({bidy, dim1, tidx}, options);
+
+  at::Tensor cg_output = at::empty({bidy, tidx}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReduction6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int bdimx = 64;
+  const int bdimy = 8;
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(3);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1, R2] = tv0[I0, I1, I2]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1, 2}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  tv1->split(2, bdimx);
+  // tv1[I0, R1, R2o, R2i{128}] = tv0[I0, I1, I2]
+  tv1->split(1, bdimy);
+  // tv1[I0, R1o, R1i{8}, R2o, R2i{128}] = tv0[I0, I1, I2]
+
+  TensorView* tv2 = tv1->rFactor({3});
+  // tv2[I0, I1o, I1i{8}, R2o, I2i{128}] = tv0[I0, I1, I2]
+  // tv1[I0, R1o, R1i{8},      R2i{128}] = tv2[I0, I1o, I1i{8}, R2o, I2i{128}]
+
+  TensorView* tv3 = tv1->rFactor({1});
+  // tv2[I0, I1o, I1i{8}, R2o, I2i{128}] = tv0[I0, I1, I2]
+  // tv3[I0, R1o, I1i{8},      I2i{128}] = tv2[I0, I1o, I1i{8}, R2o, I2i{128}]
+  // tv1[I0,      R1i{8},      R2i{128}] = tv3[I0, R1o, I1i{8},      I2i{128}]
+
+  tv3->computeAt(tv1, 1);
+  tv2->computeAt(tv3, 2);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->axis(-2)->parallelize(ParallelType::TIDy);
+  tv3->axis(-2)->parallelize(ParallelType::TIDy);
+  tv2->axis(-3)->parallelize(ParallelType::TIDy);
+
+  int numel_x = 650;
+  int numel_y = 1000;
+  int numel_z = 4;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y, numel_z}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = input.to(at::kDouble).sum({1, 2});
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMultiGridReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = max(tv0, {0});
+  TensorView* tv2 = sum(tv0, {0});
+
+  fusion.addOutput(tv1);
+  fusion.addOutput(tv2);
+
+  int numel_x = 4;
+  int numel_y = 2;
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  std::vector<at::Tensor> aten_outputs = {
+      std::get<0>(input.to(at::kDouble).max(0)), input.to(at::kDouble).sum(0)};
+  testValidate(&fusion, cg_outputs, {input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMultiGridReduction2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = sum(tv1, {0});
+  fusion.addOutput(tv2);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::BIDy);
+  tv2->axis(0)->parallelize(ParallelType::BIDy);
+
+  FusionExecutor fe;
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionReductionTFT_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+
+  fusion.addOutput(tv1);
+
+  int numel_x = 1025;
+  int numel_y = 129;
+  int tidx = 16;
+  int tidy = 8;
+  int tidz = 8;
+
+  tv1->split(1, tidx);
+  // tv1[I0, R1o, R1i{tidx}]
+
+  tv1->split(1, tidz);
+  // tv1[I0, R1oo, R1Oi{tidz}, R1R1i{tidx}]
+
+  tv1->split(0, tidy);
+  // tv1[I0o, I0i, R1oo, R1Oi{tidz}, R1R1i{tidx}]
+
+  TensorView* tv2 = tv1->rFactor({2});
+  // tv2[I0o, I0i, R1oo, I1Oi{tidz}, I11i{tidx}]
+  // tv1[I0o, I0i,       R1Oi{tidz}, R1R1i{tidx}]
+
+  tv2->computeAt(tv1, 2);
+
+  tv1->axis(1)->parallelize(ParallelType::TIDy);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->axis(-2)->parallelize(ParallelType::TIDz);
+  tv2->axis(-2)->parallelize(ParallelType::TIDz);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReductionOuterSplit_CUDA) {
+  // based off FusionReduction4
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  TensorView* tv2 = add(tv0, tv1);
+  // tv2[I0, I1] = tv0[I0, I1] + tv1[I0, I1]
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv3 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv2);
+  // tv3[I0, R1] = tv2[I0, I1]
+
+  TensorView* tv4 = makeSymbolicTensor(1);
+  fusion.addInput(tv4);
+
+  // tv5[I0] = tv3[I0, R1] * tv4[I0]
+  TensorView* tv5 = mul(tv3, tv4);
+  fusion.addOutput(tv5);
+
+  // RFactor the reduction
+  tv3->split(1, 16, false);
+  // tv3[I0, R1o{16}, R1i{tidx}] = tv2[I0, I1]
+
+  TensorView* tv6 = tv3->rFactor({-2});
+  // tv6[I0, R1o{16}, iR1i{tidx}] = tv2[I0, I1]
+  // tv3[I0,           R1i{tidx}] = tv3[I0, I1]
+  tv2->computeAt(tv6, 2);
+
+  // Compute at inline with tv5 (only 1D)
+  tv6->computeAt(tv3, 1);
+  tv3->computeAt(tv5, 1);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+
+  // Intermediate tensors only need this, but doesn't hurt to do on inputs
+  // tv0, 1, 4
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv6->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 1025;
+  int numel_y = 129;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
+  at::Tensor t1 = at::randn({numel_x, numel_y}, options);
+  at::Tensor t4 = at::randn({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1, t4});
+  auto cg_outputs = fe.runFusion({t0, t1, t4});
+
+  auto t2 = t0.add(t1);
+  auto t3 = t2.to(at::kDouble).sum({1});
+  auto aten_output = t3.mul(t4);
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t1, t4}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBranches_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  TensorView* tv2 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+
+  auto tv3 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv4 = add(tv3, tv1);
+  auto tv5 = add(tv3, tv2);
+  auto tv6 = add(tv4, tv5);
+
+  fusion.addOutput(tv6);
+
+  constexpr int x = 63, y = 33;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y}, options);
+  at::Tensor t1 = at::randn({x, y}, options);
+  at::Tensor t2 = at::randn({x, y}, options);
+
+  FusionExecutor fe;
+  tv6->merge(0);
+  tv6->split(0, 128);
+  tv6->split(0, 4);
+
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeAt(tv6, 1);
+  tv1->computeAt(tv6, 1);
+  tv2->computeAt(tv6, 1);
+
+  tv3->axis(-2)->parallelize(ParallelType::Unroll);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(-2)->parallelize(ParallelType::Unroll);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv5->axis(-2)->parallelize(ParallelType::Unroll);
+  tv5->axis(-1)->parallelize(ParallelType::TIDx);
+  tv6->axis(-1)->parallelize(ParallelType::TIDx);
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2};
+
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto t3 = t0.add(1.0);
+  auto t4 = t3.add(t1);
+  auto t5 = t3.add(t2);
+  auto aten_output = t4.add(t5);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSimpleBCast1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1.5));
+
+  TensorView* tv2 = makeSymbolicTensor(2);
+  fusion.addInput(tv2);
+  TensorView* tv3 = makeSymbolicTensor(2);
+  fusion.addInput(tv3);
+  TensorView* tv4 = sub(tv2, tv3);
+
+  TensorView* tv5 = broadcast(tv1, {false, false, true});
+  TensorView* tv6 = broadcast(tv4, {true, false, false});
+
+  TensorView* tv7 = add(tv5, tv6);
+  fusion.addOutput(tv7);
+
+  tv7->split(-1, 4);
+  tv7->split(0, 8);
+
+  tv0->computeAt(tv7, -1);
+  tv2->computeAt(tv7, -1);
+
+  tv7->axis(0)->parallelize(ParallelType::BIDx);
+  tv7->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int x = 63, y = 33, z = 15;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y}, options);
+  at::Tensor t1 = t0.add(1.5);
+
+  at::Tensor t2 = at::randn({y, z}, options);
+  at::Tensor t3 = at::randn({y, z}, options);
+
+  at::Tensor t4 = t2.sub(t3);
+  at::Tensor t5 = t1.unsqueeze(-1).expand({x, y, z});
+
+  at::Tensor t6 = t4.expand({x, y, z});
+
+  at::Tensor aten_output = t5.add(t6);
+
+  std::vector<IValue> aten_inputs = {t0, t2, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSimpleBCast2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv0, tv1);
+
+  TensorView* tv3 = broadcast(tv2, {false, false, true});
+
+  TensorView* tv4 = makeSymbolicTensor(2);
+  fusion.addInput(tv4);
+
+  TensorView* tv5 = sub(tv4, IrBuilder::create<Double>(0.1));
+
+  TensorView* tv6 = broadcast(tv5, {true, false, false});
+
+  TensorView* tv7 = add(tv3, tv6);
+
+  fusion.addOutput(tv7);
+
+  tv7->merge(0, 1);
+
+  tv0->computeAt(tv7, -1);
+  tv4->computeAt(tv7, -1);
+
+  tv7->axis(0)->parallelize(ParallelType::BIDx);
+  tv7->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int x = 63, y = 33, z = 15;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y}, options);
+  at::Tensor t1 = at::randn({x, y}, options);
+  at::Tensor t2 = t0.add(t1);
+  at::Tensor t3 = t2.unsqueeze(-1).expand({x, y, z});
+
+  at::Tensor t4 = at::randn({y, z}, options);
+  at::Tensor t5 = t4.sub(0.1);
+  at::Tensor t6 = t5.expand({x, y, z});
+  at::Tensor aten_output = t3.add(t6);
+
+  at::Tensor cg_output = at::empty({x, y, z}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1, t4};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSimpleBCast3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up input tensor views
+  // tv0[I1, B{1}]
+  TensorView* tv0 = makeConcreteTensor({-1, 1});
+  fusion.addInput(tv0);
+
+  // tv1[I0, I1, I2]
+  TensorView* tv2 = makeSymbolicTensor(3);
+  fusion.addInput(tv2);
+
+  TensorView* tv3 = add(tv0, tv2);
+
+  fusion.addOutput(tv3);
+
+  tv3->merge(0);
+  tv3->merge(0);
+
+  tv0->computeAt(tv3, -1);
+  tv2->computeAt(tv3, -1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  constexpr int x = 2, y = 3, z = 4;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({y, 1}, options);
+  at::Tensor t2 = at::randn({x, y, z}, options);
+  auto aten_output = t0.add(t2);
+
+  std::vector<IValue> aten_inputs = {t0, t2};
+  at::Tensor cg_output = at::empty({x, y, z}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSimpleBCast4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeConcreteTensor({1, -1});
+
+  TensorView* tv1 = makeSymbolicTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv3 = add(tv0, tv1);
+
+  tv3->merge(0);
+  tv3->merge(0);
+  tv3->split(0, 128);
+  tv3->split(0, 4);
+
+  fusion.addOutput(tv3);
+
+  tv0->computeAt(tv3, -1);
+  tv1->computeAt(tv3, -1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-2)->parallelize(ParallelType::Unroll);
+
+  constexpr int x = 63, y = 33, z = 15;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({1, z}, options);
+  at::Tensor t1 = at::randn({x, y, z}, options);
+
+  auto aten_output = t0.add(t1);
+
+  at::Tensor cg_output = at::empty({x, y, z}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSimpleBCast5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int m = 2, k = 3, n = 4;
+  auto tv0 = makeConcreteTensor({m, k});
+  auto tv1 = makeConcreteTensor({k, n});
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = broadcast(tv0, {false, false, true});
+  TensorView* tv3 = broadcast(tv1, {true, false, false});
+
+  TensorView* tv4 = add(tv2, tv3);
+
+  fusion.addOutput(tv4);
+
+  tv4->merge(0);
+  tv4->merge(0);
+
+  tv0->computeAt(tv4, -1);
+  tv1->computeAt(tv4, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({m, k}, options);
+  at::Tensor t1 = at::randn({k, n}, options);
+
+  auto t2 = t0.unsqueeze(-1).expand({m, k, n});
+  auto t3 = t1.expand({m, k, n});
+  auto aten_output = t2.add(t3);
+
+  at::Tensor cg_output = at::empty({m, k, n}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionComplexBCast1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int x = 2, y = 3, z = 4;
+
+  auto tv0 = makeConcreteTensor({y});
+  auto tv1 = div(tv0, IrBuilder::create<Double>(2.0));
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = makeConcreteTensor({y, z});
+  auto tv4 = mul(tv2, tv3);
+  auto tv5 = broadcast(tv4, {true, false, false});
+  auto tv6 = makeConcreteTensor({x, y, z});
+  auto tv7 = add(tv5, tv6);
+
+  // tv0[    i1    ] = input
+  // tv1[    i1    ] = tv0/2.0
+  // tv2[    i1, b2] = bcast(tv1)
+  // tv3[    i1, i2] = input
+  // tv4[    i1, i2] = tv2 * tv3
+  // tv5[b0, i1, i2] = bcast(tv4)
+  // tv6[i0, i1, i2] = input
+  // tv7[i0, i1, i2] = tv5 + tv6
+
+  // tv4 = bcast(tv1) * tv3
+  // tv7 = bcast(tv4) + tv6
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv3);
+  fusion.addInput(tv6);
+
+  fusion.addOutput(tv7);
+
+  tv7->merge(0);
+  tv7->merge(0);
+  tv0->computeAt(tv7, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({y}, options);
+  at::Tensor t3 = at::randn({y, z}, options);
+  at::Tensor t6 = at::randn({x, y, z}, options);
+
+  auto t4 = t0.div(2.0).unsqueeze(-1).expand({y, z}) * t3;
+  auto aten_output = t4.unsqueeze(0).expand({x, y, z}) + t6;
+
+  std::vector<IValue> aten_inputs = {t0, t3, t6};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionComplexBCast2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int x = 2, y = 3, z = 4;
+
+  auto tv0 = makeConcreteTensor({y, z});
+  auto tv1 = div(tv0, IrBuilder::create<Double>(2.0));
+  auto tv2 = sum(tv1, {1});
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = makeConcreteTensor({x, y});
+  auto tv5 = add(tv3, tv4);
+
+  // tv0[    i1, i2] = input
+  // tv1[    i1, i2] = tv0/2.0
+  // tv2[    i1    ] = sum(tv1, 1)
+  // tv3[b0, i1    ] = bcast(tv2)
+  // tv4[i0, i1    ] = input
+  // tv5[i0, i1    ] = tv3 + tv4
+
+  // tv2 = sum(tv0/2.0, 1)
+  // tv5 = bcast(tv2) + tv4
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv4);
+
+  fusion.addOutput(tv5);
+
+  tv5->merge(0);
+  tv0->computeAt(tv5, -1);
+  tv1->computeAt(tv2, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({y, z}, options);
+  at::Tensor t4 = at::randn({x, y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t4});
+  auto cg_outputs = fe.runFusion({t0, t4});
+
+  auto t1 = t0.div(2.0);
+  auto t2 = t1.to(at::kDouble).sum(1);
+  auto t3 = t2.unsqueeze(0).expand({x, y});
+  auto aten_output = t3.add(t4);
+
+  testValidate(
+      &fusion, {cg_outputs}, {t0, t4}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int w = 3, x = 4, y = 7, z = 8;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv3 = broadcast(tv2, {true, false, false, false});
+  auto tv4 = add(tv3, tv1);
+
+  fusion.addOutput(tv4);
+
+  tv4->merge(0);
+  tv4->merge(0);
+  tv4->merge(0);
+
+  tv4->split(0, 128);
+  tv4->split(0, 4);
+
+  tv2->computeAt(tv4, 1);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(1)->parallelize(ParallelType::Unroll);
+  tv4->axis(2)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(1)->parallelize(ParallelType::Unroll);
+  tv3->axis(2)->parallelize(ParallelType::TIDx);
+
+  tv2->axis(1)->parallelize(ParallelType::Unroll);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
+
+  FusionExecutor fe;
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t1 = at::randn({w, x, y, z}, options);
+
+  auto t3 = t0.add(1.0);
+  auto aten_output = t3.add(t1);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int w = 3, x = 4, y = 7, z = 8;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv3 = broadcast(tv2, {true, false, false, false});
+  auto tv4 = add(tv3, tv1);
+
+  fusion.addOutput(tv4);
+
+  tv4->merge(-2);
+  tv4->merge(-2);
+  tv4->merge(-2);
+
+  tv4->split(0, 128);
+  tv4->split(0, 4);
+
+  tv2->computeAt(tv4, 1);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(1)->parallelize(ParallelType::Unroll);
+  tv4->axis(2)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(1)->parallelize(ParallelType::Unroll);
+  tv3->axis(2)->parallelize(ParallelType::TIDx);
+
+  tv2->axis(1)->parallelize(ParallelType::Unroll);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
+
+  FusionExecutor fe;
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t1 = at::randn({w, x, y, z}, options);
+
+  auto t3 = t0.add(1.0);
+  auto aten_output = t3.add(t1);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int w = 3, x = 4, y = 7, z = 8;
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv3 = add(tv2, tv1);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t1 = at::randn({w, x, y, z}, options);
+
+  auto t2 = t0.add(1.0);
+  auto aten_output = t2.add(t1);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeConcreteTensor({4, 8});
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeConcreteTensor({4, 4, 8});
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv3 = broadcast(tv2, {true, false, false});
+  TensorView* tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({4, 8}, options);
+  at::Tensor t1 = at::randn({4, 4, 8}, options);
+
+  auto t2 = t0.add(1.0);
+  auto aten_output = t2.add(t1);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(3);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv3 = broadcast(tv2, {true, false, true});
+  TensorView* tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  tv3->merge(0)->merge(0)->split(0, 2)->split(0, 3);
+  tv4->merge(0)->merge(0)->split(0, 2)->split(0, 3);
+
+  tv0->computeAt(tv4, 1);
+  tv1->computeAt(tv4, 1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({7}, options);
+  at::Tensor t1 = at::randn({5, 7, 11}, options);
+
+  auto t2 = t0.add(1.0);
+  auto aten_output = t2.unsqueeze(-1).add(t1);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> tensor0_shape{7, 4, 7};
+  std::vector<int64_t> tensor1_shape{4, 7};
+
+  TensorView* tv0 = makeSymbolicTensor(tensor0_shape.size());
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(tensor1_shape.size());
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv0, tv1);
+  TensorView* tv3 = sum(tv2, {0, 1});
+  fusion.addOutput(tv3);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input0 = at::randn(tensor0_shape, options);
+  at::Tensor input1 = at::randn(tensor1_shape, options);
+
+  std::vector<int64_t> reduction_axes{0, 1};
+  auto reduction_params = getReductionHeuristics(&fusion, {input0, input1});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, reduction_params->lparams);
+  auto cg_outputs = fe.runFusion({input0, input1}, reduction_params->lparams);
+
+  auto aten_output = input0.add(input1).to(at::kDouble).sum(reduction_axes);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {input0, input1},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      reduction_params->lparams);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing7_CUDA) {
+  // Might be able to use this one without 6 as the heuristics in 6 may change
+  // and this test is to cover the same issue.
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {false, true});
+
+  auto tv2 = makeSymbolicTensor(2);
+  fusion.addInput(tv2);
+
+  auto tv3 = add(tv1, tv2);
+  auto tv4 = sum(tv3, {0, 1});
+  fusion.addOutput(tv4);
+
+  tv4->merge(0, 1);
+  tv4->split(0, 128);
+  tv4->split(0, 4);
+
+  auto tv5 = tv4->rFactor({0, 1});
+
+  tv5->computeAt(tv4, -1);
+  tv0->computeAt(tv5, -1);
+
+  tv4->axis(0)->parallelize(ParallelType::TIDx);
+
+  const int numel_x = 100;
+  const int numel_y = 200;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto at_t0 = at::randn({numel_x}, options);
+  auto at_t1 = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {at_t0, at_t1});
+  auto cg_outputs = fe.runFusion({at_t0, at_t1});
+
+  auto aten_output = (at_t0.unsqueeze(-1).expand({numel_x, numel_y}) + at_t1)
+                         .to(at::kDouble)
+                         .sum();
+
+  testValidate(
+      &fusion, cg_outputs, {at_t0, at_t1}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing8_CUDA) {
+  // Same as 7 but with outer splits instead of inner
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {false, true});
+
+  auto tv2 = makeSymbolicTensor(2);
+  fusion.addInput(tv2);
+
+  auto tv3 = add(tv1, tv2);
+  auto tv4 = sum(tv3, {0, 1});
+  fusion.addOutput(tv4);
+
+  tv4->merge(0, 1);
+  tv4->split(0, 128, false);
+  tv4->split(0, 4, false);
+
+  auto tv5 = tv4->rFactor({0, 1});
+
+  tv5->computeAt(tv4, -1);
+  tv0->computeAt(tv5, -1);
+
+  tv4->axis(0)->parallelize(ParallelType::TIDx);
+
+  const int numel_x = 100;
+  const int numel_y = 200;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto at_t0 = at::randn({numel_x}, options);
+  auto at_t1 = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {at_t0, at_t1});
+  auto cg_outputs = fe.runFusion({at_t0, at_t1});
+
+  auto aten_output = (at_t0.unsqueeze(-1).expand({numel_x, numel_y}) + at_t1)
+                         .to(at::kDouble)
+                         .sum();
+
+  testValidate(
+      &fusion, cg_outputs, {at_t0, at_t1}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing9_CUDA) {
+  // Same as 7 but with outer splits instead of inner
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {false, true});
+
+  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeSymbolicTensor(3);
+  fusion.addInput(tv3);
+
+  auto tv4 = add(tv3, tv2);
+  fusion.addOutput(tv4);
+
+  const int numel_x = 200;
+  const int numel_y = 300;
+  const int numel_z = 400;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto at_t0 = at::randn({numel_y}, options);
+  auto at_t3 = at::randn({numel_x, numel_y, numel_z}, options);
+  std::vector<IValue> aten_inputs = {at_t0, at_t3};
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  auto at_t1 = at_t0.unsqueeze(-1);
+  auto at_t2 = at_t1.mul(2.0);
+
+  auto at_t4 = at_t3.add(at_t2);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {at_t2, at_t4}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing10_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeContigTensor(2);
+  TensorView* tv1 = makeContigTensor(2);
+
+  // Register your inputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // Do math with it, it returns a `Val*` but can be static_casted back to
+  // TensorView
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = add(tv0, tv2);
+
+  // Register your outputs
+  fusion.addOutput(tv3);
+
+  auto tv0_cache = tv0->cacheAfter();
+  auto tv1_cache = tv1->cacheAfter();
+
+  std::vector<TensorView*> tvs = {tv0_cache, tv1_cache, tv2, tv3};
+
+  for (auto tv : tvs) {
+    tv->split(1, 2, false);
+    tv->split(1, 1);
+    tv->split(-1, 4);
+    // [I0, 2, 1, I1/2/4, 4]
+    tv->reorder({{1, 2}, {2, 3}, {3, 1}});
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::TIDx);
+  }
+
+  // For all inputs, computeAt the output inline, temporaries should be squeezed
+  // between them
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  tv0_cache->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv1_cache->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({64, 128}, options);
+  at::Tensor input2 = at::rand_like(input1);
+  at::Tensor output = at::empty_like(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  fe.runFusion({input1, input2}, {output});
+
+  at::Tensor tv2_ref = input2 + 2.0;
+  at::Tensor output_ref = input1 + tv2_ref;
+
+  TORCH_CHECK(output_ref.equal(output));
+}
+
+TEST_F(NVFuserTest, FusionAdvancedIndexing11_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int w = 3, x = 4, y = 7, z = 8;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  auto tv0 = makeSymbolicTensor(4);
+  auto tv1 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
+  auto tv3 = broadcast(tv2, {true, false, true, true});
+  auto tv4 = add(tv3, tv0);
+
+  fusion.addOutput(tv4);
+
+  tv4->merge(0);
+  tv4->merge(1);
+
+  tv4->split(1, 32);
+  tv4->split(0, 1);
+
+  tv4->reorder({{2, 1}});
+
+  tv2->computeAt(tv4, 3);
+
+  tv2->setMemoryType(MemoryType::Global);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(1)->parallelize(ParallelType::BIDy);
+  tv4->axis(2)->parallelize(ParallelType::Unswitch);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  FusionExecutor fe;
+
+  at::Tensor t0 = at::randn({w, x, y, z}, options);
+  at::Tensor t1 = at::randn({x}, options);
+
+  auto t3 = t1.add(1.0).unsqueeze(-1).unsqueeze(-1);
+  auto aten_output = t3.add(t0);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+// Intended to stress the lowering of our code generator
+TEST_F(NVFuserTest, FusionAdvancedLowering1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({9, 5});
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
+  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3));
+  TensorView* tv4 = sum(tv3, {1});
+
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv4);
+
+  tv4->split(1, 4);
+  auto tv5 = tv4->rFactor({2});
+
+  tv1->computeAt(tv5, 2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(1);
+  at::Tensor aten_input = at::randn({9, 5}, options);
+
+  auto t1 = aten_input.add(1.0);
+  auto t2 = t1.add(2.0);
+  auto t3 = t1.add(3.0);
+  auto t4 = t3.sum(1);
+
+  std::vector<at::Tensor> aten_outputs = {t2, t4};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedLowering2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Progressively broadcast tensors
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  TensorView* tv2 = makeSymbolicTensor(3);
+  fusion.addInput(tv2);
+
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv4 = broadcast(tv3, {false, true});
+  TensorView* tv5 = add(tv4, tv1);
+  TensorView* tv6 = add(tv5, tv2);
+
+  fusion.addOutput(tv6);
+
+  // Split inner dimension
+  tv6->split(1, 4);
+  // Merge middle dims with outer dimensions
+  tv6->merge(2);
+  tv6->merge(0);
+
+  // tv6[I0*I1o, I1i*I2]
+
+  // Compute everything inline
+  tv0->computeAt(tv6, -1);
+
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+  tv6->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  int x = 13, y = 9, z = 5;
+  at::Tensor t0 = at::randn({y}, options);
+  at::Tensor t1 = at::randn({y, z}, options);
+  at::Tensor t2 = at::randn({x, y, z}, options);
+
+  auto t3 = t0.add(1.0);
+  auto t4 = t3.unsqueeze(-1);
+  auto t5 = t4.add(t1);
+  auto t6 = t5.add(t2);
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2};
+  std::vector<at::Tensor> aten_outputs = {t6};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+// TODO: Complete test
+TEST_F(NVFuserTest, FusionAdvancedLowering3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({1, -1});
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // [b0, i1]
+  auto tv2 = add(tv0, IrBuilder::create<Double>(2.0));
+
+  // [i0, i1]
+  auto tv3 = add(tv1, IrBuilder::create<Double>(3.0));
+
+  // [b0, i1]
+  auto tv4 = add(tv2, IrBuilder::create<Double>(4.0));
+
+  // [io, i1]
+  auto tv5 = add(tv2, tv3);
+
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv5);
+
+  tv0->computeAt(tv4, -1);
+
+  tv3->setMemoryType(MemoryType::Global);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  int x = 13, y = 9;
+  at::Tensor t0 = at::randn({1, y}, options);
+  at::Tensor t1 = at::randn({x, y}, options);
+
+  auto t4 = t0 + 2 + 4;
+  auto t5 = t0 + 2 + t1 + 3;
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+  std::vector<at::Tensor> aten_outputs = {t4, t5};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+// This excercises indexing with broadcast root axes. Non-broadcast
+// axes need to be preferred when propagating index exprs to root
+// axes. See, e.g., Index::getConsumerIndex_impl.
+TEST_F(NVFuserTest, FusionAdvancedLowering4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = broadcast(tv0, {false, true});
+  auto tv2 = broadcast(tv1, {false, false, true});
+  auto tv3 = makeSymbolicTensor(3);
+  fusion.addInput(tv3);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  tv4->merge(1)->merge(0);
+  tv4->split(0, 8);
+  tv0->computeAt(tv4, 1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 10;
+  const int by = 20;
+  const int bz = 30;
+  at::Tensor t0 = at::randn({bx}, options);
+  at::Tensor t3 = at::randn({bx, by, bz}, options);
+  std::vector<IValue> aten_inputs = {t0, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output =
+      t0.unsqueeze(-1).expand({bx, by}).unsqueeze(-1).expand({bx, by, bz}) + t3;
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedLowering5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({5, 4, 3});
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = makeConcreteTensor({5, 3});
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv1, {false, true, false});
+
+  auto tv3 = add(tv0, tv2);
+
+  fusion.addOutput(tv3);
+
+  tv2->merge(0);
+  tv1->computeAt(tv2, 1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(1);
+  at::Tensor t0 = at::randn({5, 4, 3}, options);
+  at::Tensor t1 = at::randn({5, 3}, options);
+  auto t2 = t1.unsqueeze(1);
+  auto t3 = t0 + t2;
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+  std::vector<at::Tensor> aten_outputs = {t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedLowering6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({5, 4, 3});
+  fusion.addInput(tv0);
+  auto tv1 = makeConcreteTensor({4});
+  fusion.addInput(tv1);
+  auto tv2 = unaryOp(UnaryOpType::Set, tv0);
+  auto tv3 = unaryOp(UnaryOpType::Set, tv1);
+
+  auto tv4 = sum(tv2, {0, 2});
+  auto tv5 = add(tv4, tv3);
+  fusion.addOutput(tv5);
+
+  auto tv6 = broadcast(tv3, {true, false, true});
+  auto tv7 = add(tv2, tv6);
+  fusion.addOutput(tv7);
+
+  tv2->computeAt(tv4, -1, ComputeAtMode::BestEffort);
+  tv3->computeAt(tv7, -1, ComputeAtMode::BestEffort);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(1);
+  at::Tensor t0 = at::randn({5, 4, 3}, options);
+  at::Tensor t1 = at::randn({4}, options);
+
+  auto t2 = t0;
+  auto t3 = t1;
+
+  std::vector<int64_t> reduction_axes{0, 2};
+  auto t4 = t2.sum(reduction_axes);
+  auto t5 = add(t4, t3);
+  auto t6 = t3.unsqueeze(0).unsqueeze(-1);
+  auto t7 = t2.add(t6);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+  std::vector<at::Tensor> aten_outputs = {t5, t7};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+// Test a simple Gemm but also play around with fusion executor features
+TEST_F(NVFuserTest, FusionSimpleGemm_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2); // M, K
+  TensorView* tv1 = makeSymbolicTensor(2); // K, N
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = broadcast(tv0, {false, false, true});
+  // tv2[I0, I1, B] = tv0[I0, I1]
+
+  TensorView* tv3 = broadcast(tv1, {true, false, false});
+  // tv3[B, I1, I2] = tv1[I1, I2]
+
+  // tv4[I0, I1, I2] = tv2[I0, I1, B] * tv3[B, I1, I2]
+  TensorView* tv4 = mul(tv2, tv3);
+  // tv5[I0, R1, I2] = tv4[I0, I1, I2]
+  TensorView* tv5 = sum(tv4, {1});
+  fusion.addOutput(tv5);
+
+  tv5->split(1, 32);
+  // tv5[I0, R1o, R1i{32}, I2]
+
+  auto tv6 = tv5->rFactor({1});
+  // tv6[I0, R1o, I1i{32}, I2] = tv4[I0, I1, I2]
+  // tv5[I0,    , R1i{32}, I2] = tv6[I0, R1o, I1i{32}, I2]
+
+  tv5->split(0, 4);
+  tv5->split(-1, 4);
+  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
+  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
+
+  tv0->computeAt(tv5, -1);
+  tv1->computeAt(tv5, -1);
+
+  // tv6[I0o, I0i{4}, R1o, I1i{32}, I2o, I2i{4}]
+  // tv5[I0o, I0i{4},    , R1i{32}, I2o, I2i{4}]
+  //--> (line symbolizes compute at location)
+  // tv4[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, I1o]
+  // tv6[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, R1o]
+  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
+
+  tv0->computeAt(tv6, -1);
+  tv1->computeAt(tv6, -1);
+  // tv4[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, I1o |]
+  // tv6[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, R1o |]
+  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
+
+  tv5->axis(0)->parallelize(ParallelType::BIDz);
+  tv5->axis(1)->parallelize(ParallelType::TIDz);
+
+  tv5->axis(-2)->parallelize(ParallelType::BIDy);
+  tv5->axis(-1)->parallelize(ParallelType::TIDy);
+
+  tv5->axis(2)->parallelize(ParallelType::TIDx);
+  tv6->axis(2)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 65, K = 33, N = 17;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1}, LaunchParams(1, -1, -1, 32, 4, 4));
+  // Lets specify a few bounds in launch params to make sure it works
+  fe.runFusion({t0, t1}, LaunchParams(1, -1, -1, 32, 4, 4));
+
+  // Make sure bad launch params throws
+  // TODO: Re-enable once we have parallelization validation in.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  // ASSERT_ANY_THROW(fe.runFusion({t0, t1}, LaunchParams(1, 2, 3, 4, 5, 6)));
+
+  // Don't specify any launch params
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto aten_output = t0.to(at::kDouble).matmul(t1.to(at::kDouble));
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t1}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Softmax with a 1D tensor. Parallelized only with a single thread block.
+TEST_F(NVFuserTest, FusionSoftmax1D_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int tidx = 128;
+  const int dimx = 1000;
+
+  // Set up your input tensor views
+  TensorView* input_tv0 = makeSymbolicTensor(1);
+  fusion.addInput(input_tv0);
+
+  TensorView* exp_tv1 = unaryOp(UnaryOpType::Exp, input_tv0);
+  TensorView* sum_exp_tv2 = sum(exp_tv1, {-1});
+  TensorView* bcast_sum_tv3 = broadcast(sum_exp_tv2, {true});
+
+  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
+  // computed at sum_exp_rf_tv8.
+  TensorView* exp_tv1_copy = unaryOp(UnaryOpType::Exp, input_tv0);
+
+  TensorView* output_tv4 = div(exp_tv1_copy, bcast_sum_tv3);
+
+  fusion.addOutput(output_tv4);
+
+  bcast_sum_tv3->split(0, tidx);
+
+  sum_exp_tv2->split(-1, tidx);
+  TensorView* sum_exp_rf_tv5 = sum_exp_tv2->rFactor({-2});
+
+  output_tv4->split(-1, tidx);
+
+  exp_tv1->computeAt(sum_exp_rf_tv5, -1);
+  exp_tv1_copy->computeAt(output_tv4, -1);
+
+  TensorView* tensors_to_parallelize[] = {
+      sum_exp_tv2, bcast_sum_tv3, output_tv4, sum_exp_rf_tv5};
+
+  for (auto tv : tensors_to_parallelize) {
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({dimx}, options);
+  at::Tensor cg_output = at::empty({dimx}, options);
+  at::Tensor t3_output = at::empty_like(cg_output, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  fe.runFusion({t0}, {cg_output});
+
+  auto aten_output = at::_softmax(t0.to(at::kDouble), -1, false);
+
+  testValidate(&fusion, {cg_output}, {t0}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Softmax with a 1D tensor with input normalization.
+TEST_F(NVFuserTest, FusionSoftmax1DNormalized_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int tidx = 128;
+  const int dimx = 1000;
+
+  // Set up your input tensor views
+  TensorView* input_tv0 = makeSymbolicTensor(1);
+  fusion.addInput(input_tv0);
+
+  // Normalize with the max value before computing exp.
+  TensorView* max_val_tv1 = reductionOp(
+      BinaryOpType::Max, {-1}, IrBuilder::create<Double>(0), input_tv0);
+  TensorView* bcast_max_tv2 = broadcast(max_val_tv1, {true});
+  TensorView* sub_tv3 = sub(input_tv0, bcast_max_tv2);
+  TensorView* exp_tv4 = unaryOp(UnaryOpType::Exp, sub_tv3);
+  TensorView* sum_exp_tv5 = sum(exp_tv4, {-1});
+  TensorView* bcast_sum_tv6 = broadcast(sum_exp_tv5, {true});
+
+  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
+  // computed at sum_exp_rf_tv8.
+  TensorView* sub_tv3_copy = sub(input_tv0, bcast_max_tv2);
+  TensorView* exp_tv4_copy = unaryOp(UnaryOpType::Exp, sub_tv3_copy);
+
+  TensorView* output_tv7 = div(exp_tv4_copy, bcast_sum_tv6);
+
+  fusion.addOutput(output_tv7);
+  bcast_max_tv2->split(0, tidx);
+  bcast_sum_tv6->split(0, tidx);
+
+  max_val_tv1->split(-1, tidx);
+  TensorView* max_val_rf_tv8 = max_val_tv1->rFactor({-2});
+
+  sum_exp_tv5->split(-1, tidx);
+  TensorView* sum_exp_rf_tv9 = sum_exp_tv5->rFactor({-2});
+
+  output_tv7->split(-1, tidx);
+
+  sub_tv3->computeAt(sum_exp_rf_tv9, -1);
+  sub_tv3_copy->computeAt(output_tv7, -1);
+
+  TensorView* tensors_to_parallelize[] = {
+      max_val_tv1,
+      bcast_max_tv2,
+      sum_exp_tv5,
+      bcast_sum_tv6,
+      output_tv7,
+      max_val_rf_tv8,
+      sum_exp_rf_tv9};
+
+  for (auto tv : tensors_to_parallelize) {
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({dimx}, options);
+  at::Tensor t3_output = at::empty({dimx}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = at::_softmax(input.to(at::kDouble), -1, false);
+
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Softmax with a 3D tensor, where the inner-most 3rd dimension is
+// normalized. Pallelized with multiple thread blocks.
+TEST_F(NVFuserTest, FusionSoftmax3D_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int tidx = 32;
+  const int dimx = 32;
+  const int dimy = 16;
+  const int dimz = 130;
+
+  // Set up your input tensor views
+  TensorView* input_tv0 = makeSymbolicTensor(3);
+  fusion.addInput(input_tv0);
+
+  TensorView* exp_tv1 = unaryOp(UnaryOpType::Exp, input_tv0);
+  TensorView* sum_exp_tv2 = sum(exp_tv1, {-1});
+  TensorView* bcast_sum_tv3 = broadcast(sum_exp_tv2, {false, false, true});
+
+  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
+  // computed at sum_exp_rf_tv8.
+  TensorView* exp_tv1_copy = unaryOp(UnaryOpType::Exp, input_tv0);
+
+  TensorView* output_tv4 = div(exp_tv1_copy, bcast_sum_tv3);
+
+  fusion.addOutput(output_tv4);
+
+  bcast_sum_tv3->split(-1, tidx);
+
+  sum_exp_tv2->split(-1, tidx);
+  TensorView* sum_exp_rf_tv5 = sum_exp_tv2->rFactor({-2});
+
+  output_tv4->split(-1, tidx);
+
+  exp_tv1->computeAt(sum_exp_rf_tv5, -1);
+  exp_tv1_copy->computeAt(output_tv4, -1);
+
+  TensorView* tensors_to_parallelize[] = {
+      sum_exp_tv2, bcast_sum_tv3, output_tv4, sum_exp_rf_tv5};
+
+  for (auto tv : tensors_to_parallelize) {
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::BIDy);
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({dimx, dimy, dimz}, options);
+
+  at::Tensor cg_output = at::empty({dimx, dimy, dimz}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = at::_softmax(input.to(at::kDouble), -1, false);
+
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Softmax with a 3D tensor with input normalization.
+TEST_F(NVFuserTest, FusionSoftmax3DNormalized_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int tidx = 32;
+  const int dimx = 32;
+  const int dimy = 16;
+  const int dimz = 130;
+
+  // Set up your input tensor views
+  TensorView* input_tv0 = makeSymbolicTensor(3);
+  fusion.addInput(input_tv0);
+
+  // Normalize with the max value before computing exp.
+  TensorView* max_val_tv1 = reductionOp(
+      BinaryOpType::Max, {-1}, IrBuilder::create<Double>(0), input_tv0);
+  TensorView* bcast_max_tv2 = broadcast(max_val_tv1, {false, false, true});
+  TensorView* sub_tv3 = sub(input_tv0, bcast_max_tv2);
+  TensorView* exp_tv4 = unaryOp(UnaryOpType::Exp, sub_tv3);
+  TensorView* sum_exp_tv5 = sum(exp_tv4, {-1});
+  TensorView* bcast_sum_tv6 = broadcast(sum_exp_tv5, {false, false, true});
+
+  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
+  // computed at sum_exp_rf_tv8.
+  TensorView* sub_tv3_copy = sub(input_tv0, bcast_max_tv2);
+  TensorView* exp_tv4_copy = unaryOp(UnaryOpType::Exp, sub_tv3_copy);
+
+  TensorView* output_tv7 = div(exp_tv4_copy, bcast_sum_tv6);
+
+  fusion.addOutput(output_tv7);
+
+  bcast_max_tv2->split(-1, tidx);
+  bcast_sum_tv6->split(-1, tidx);
+
+  max_val_tv1->split(-1, tidx);
+  TensorView* max_val_rf_tv8 = max_val_tv1->rFactor({-2});
+
+  sum_exp_tv5->split(-1, tidx);
+  TensorView* sum_exp_rf_tv9 = sum_exp_tv5->rFactor({-2});
+
+  output_tv7->split(-1, tidx);
+
+  sub_tv3->computeAt(sum_exp_rf_tv9, -1);
+  sub_tv3_copy->computeAt(output_tv7, -1);
+
+  TensorView* tensors_to_parallelize[] = {
+      max_val_tv1,
+      bcast_max_tv2,
+      sum_exp_tv5,
+      bcast_sum_tv6,
+      output_tv7,
+      max_val_rf_tv8,
+      sum_exp_rf_tv9};
+
+  for (auto tv : tensors_to_parallelize) {
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::BIDy);
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({dimx, dimy, dimz}, options);
+  at::Tensor t3_output = at::empty({dimx, dimy, dimz}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = at::_softmax(input.to(at::kDouble), -1, false);
+
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSoftmaxComputeAt_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+
+  auto tv3 = add(tv0, IrBuilder::create<Double>(1.0));
+
+  auto tv4 = mul(tv2, tv3);
+
+  auto tv5 = sum(tv4, {1});
+  auto tv6 = broadcast(tv5, {false, true});
+
+  auto tv7 = sub(tv6, tv4);
+  fusion.addOutput(tv7);
+
+  tv1->computeAt(tv7, 1);
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(tv1->computeAt(tv7, -1));
+}
+
+// Similar to FusionReduction but uses grid reduction
+TEST_F(NVFuserTest, FusionGridReduction1_CUDA) {
+  const int gdimx = 32;
+  const int bdimx = 128;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  tv1->split(1, bdimx);
+  // tv1[I0, R1o, R1i{128}] = tv0[I0, I1]
+  tv1->split(1, gdimx);
+  // tv1[I0, R1oo, R1oi{32}, R1i{128}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}] = tv0[I0, I1]
+  // tv1[I0,        R1oi{32},  R1i{128}] = tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}]
+
+  // Incrementally, can print in between for debugging
+  tv0->computeAt(tv2, 1);
+  tv2->computeAt(tv1, 1);
+
+  // Re do it all at once, because why not.
+  tv0->computeAt(tv1, 1);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::BIDx);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 10000;
+  int numel_y = 65000;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Same test as the above but uses BIDy and TIDx for reduction
+TEST_F(NVFuserTest, FusionGridReduction2_CUDA) {
+  const int gdimy = 32;
+  const int bdimx = 128;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  tv1->split(1, bdimx);
+  // tv1[I0, R1o, R1i{128}] = tv0[I0, I1]
+  tv1->split(1, gdimy);
+  // tv1[I0, R1oo, R1oi{32}, R1i{128}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}] = tv0[I0, I1]
+  // tv1[I0,        R1oi{32},  R1i{128}] = tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}]
+
+  // Incrementally, can print in between for debugging
+  tv0->computeAt(tv2, 1);
+  tv2->computeAt(tv1, 1);
+
+  // Re do it all at once, because why not.
+  tv0->computeAt(tv1, 1);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::BIDy);
+  tv2->axis(2)->parallelize(ParallelType::BIDy);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 10000;
+  int numel_y = 65000;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Same test but uses BIDy and BIDz for reduction. No TID used.
+TEST_F(NVFuserTest, FusionGridReduction3dim1_CUDA) {
+  // Grid reductions when there aren't any threads are serial reductions
+  // keep these numbers low so our error isn't too high compared to normal cuda
+  // reductions
+  const int gdimz = 15;
+  const int gdimy = 9;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  tv1->split(1, gdimy);
+  // tv1[I0, R1o, R1i{128}] = tv0[I0, I1]
+  tv1->split(1, gdimz);
+  // tv1[I0, R1oo, R1oi{32}, R1i{128}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}] = tv0[I0, I1]
+  // tv1[I0,        R1oi{32},  R1i{128}] = tv2[I0, R1oo, Ir1oi{32}, Ir1i{128}]
+
+  // Incrementally, can print in between for debugging
+  tv0->computeAt(tv2, 1);
+  tv2->computeAt(tv1, 1);
+
+  // Re do it all at once, because why not.
+  tv0->computeAt(tv1, 1);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::BIDz);
+  tv2->axis(2)->parallelize(ParallelType::BIDz);
+  tv1->axis(-1)->parallelize(ParallelType::BIDy);
+  tv2->axis(-1)->parallelize(ParallelType::BIDy);
+
+  int numel_x = 100;
+  int numel_y = 6500;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Same as testGPU_FusionGridReduction3dim1 but reduces dimension 0
+TEST_F(NVFuserTest, FusionGridReduction3dim0_CUDA) {
+  // Grid reductions when there aren't any threads are serial reductions
+  // keep these numbers low so our error isn't too high compared to normal cuda
+  // reductions
+  const int gdimz = 15;
+  const int gdimy = 9;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[R0, I1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {0}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  tv1->split(0, gdimy);
+  // tv1[R0o, R0i{128}, I1] = tv0[I0, I1]
+  tv1->split(0, gdimz);
+  // tv1[R0oo, R0oi{32}, R0i{128}, I1] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({0});
+  // tv2[R0oo, I0oi{32}, I0i{128}, I1] = tv0[I0, I1]
+  // tv1[      R0oi{32}, R0i{128}, I1] = tv2[R0oo, I0oi{32}, I0i{128}, I1]
+
+  // Note that computeAt isn't going to make anything better as there
+  // is no dynamically sized dimension.
+
+  // Map parallelism as [Serial, BIDz, BIDy, BIDx]
+  tv1->axis(-1)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::BIDx);
+  tv1->axis(-2)->parallelize(ParallelType::BIDy);
+  tv2->axis(-2)->parallelize(ParallelType::BIDy);
+  tv1->axis(-3)->parallelize(ParallelType::BIDz);
+  tv2->axis(-3)->parallelize(ParallelType::BIDz);
+
+  int numel_x = 6500;
+  int numel_y = 100;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = input.to(at::kDouble).sum({0});
+
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// This is similar to the FusionReduction, but swaps BIDx and TIDx
+TEST_F(NVFuserTest, FusionGridReduction4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int bdimx = 128;
+  const int gdimx = 1024;
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  tv1->split(1, gdimx);
+  // tv1[I0, R1o, R1i{1024}] = tv0[I0, I1]
+  tv1->split(1, 4);
+  // tv1[I0, R1oo, R1oi{4}, R1i{128}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{1024}] = tv0[I0, I1]
+  // tv1[I0,        R1oi{4},  R1i{1024}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{1024}]
+
+  TensorView* tv3 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{1024}] = tv0[I0, I1]
+  // tv3[I0,        R1oi{4}, Ir1i{1024}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{1024}]
+  // tv1[I0,                  R1i{1024}] = tv3[I0,        R1oi{4}, Ir1i{1024}]
+
+  // Incrementally, can print in between for debugging
+  tv0->computeAt(tv2, 1);
+  tv2->computeAt(tv3, 1);
+  tv3->computeAt(tv1, 1);
+
+  // Re do it all at once, because why not.
+  tv0->computeAt(tv1, 1);
+
+  tv2->axis(2)->parallelize(ParallelType::Unroll);
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+
+  tv1->axis(-1)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::BIDx);
+  tv3->axis(-1)->parallelize(ParallelType::BIDx);
+
+  int numel_x = bdimx;
+  int numel_y = 65000;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Grid reduction with 2D thread blocks but only TIDx and BIDx are
+// mapped to a reduction dim
+TEST_F(NVFuserTest, FusionGridReduction5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int bdimx = 64;
+  const int bdimy = 16;
+  const int gdimx = 4;
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  tv1->split(1, bdimx);
+  // tv1[I0, R1o, R1i{64}] = tv0[I0, I1]
+  tv1->split(1, gdimx);
+  // tv1[I0, R1oo, R1oi{4}, R1i{64}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{64}] = tv0[I0, I1]
+  // tv1[I0,        R1oi{4},  R1i{64}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{64}]
+
+  tv0->computeAt(tv1, 1);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->axis(-2)->parallelize(ParallelType::BIDx);
+  tv2->axis(-2)->parallelize(ParallelType::BIDx);
+
+  tv1->axis(0)->parallelize(ParallelType::TIDy);
+
+  int numel_x = bdimy;
+  int numel_y = 6500;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Similar to FusionGridReduction1 but with 3D tensors
+TEST_F(NVFuserTest, FusionGridReduction6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(3);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1, R2] = tv0[I0, I1, I2]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1, 2}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion).size(),
+      "Could not detect reduction in fusion.");
+
+  // Splitting for TID
+  tv1->split(2, 128);
+  // tv1[I0, R1, R2o, R2i{128}] = tv0[I0, I1, I2]
+
+  // Splitting for BID
+  tv1->split(1, 128);
+
+  // tv1[I0, R1o, R1i{128}, R2o, R2i{128}] = tv0[I0, I1, I2]
+
+  TensorView* tv2 = tv1->rFactor({3});
+  // tv2[I0, I1o, I1i{128}, R2o, I2i{128}]
+  // tv1[I0, R1o, R1i{128},      R2i{128}]
+
+  TensorView* tv3 = tv1->rFactor({1});
+  // tv2[I0, I1o, I1i{128}, R2o, I2i{128}]
+  // tv3[I0, R1o, I1i{128},      I2i{128}]
+  // tv1[I0,      R1i{128},      R2i{128}]
+
+  tv3->computeAt(tv1, 1);
+  tv2->computeAt(tv3, 3);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->axis(-2)->parallelize(ParallelType::BIDx);
+  tv2->axis(-3)->parallelize(ParallelType::BIDx);
+  tv3->axis(-2)->parallelize(ParallelType::BIDx);
+
+  int numel_x = 6500;
+  int numel_y = 200;
+  int numel_z = numel_y;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y, numel_z}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = input.to(at::kDouble).sum({1, 2});
+
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// See issue #1049
+TEST_F(NVFuserTest, FusionGridReduction7_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  fusion.addOutput(tv1);
+
+  tv1->split(0, 1000);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::BIDy);
+
+  const int numel_x = 1;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto out = fe.runFusion({input});
+
+  auto aten_output = input.sum({0});
+
+  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridReduction8_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  fusion.addOutput(tv1);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  const int numel_x = 2;
+  const int numel_y = 4;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto out = fe.runFusion({input});
+
+  auto aten_output = input.sum({0});
+
+  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridReduction9_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+
+  auto tv2 = makeSymbolicTensor(1);
+  fusion.addInput(tv2);
+
+  auto tv3 = add(tv2, tv1);
+  fusion.addOutput(tv3);
+
+  tv1->split(1, 2);
+
+  tv1->axis(1)->parallelize(ParallelType::BIDx);
+  tv1->axis(2)->parallelize(ParallelType::BIDy);
+
+  tv1->computeAt(tv3, 1);
+
+  const int numel_x = 4;
+  const int numel_y = 10;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
+  at::Tensor t2 = at::randn({numel_x}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t2};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_output = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0.sum({1}).add(t2);
+
+  testValidate(&fusion, cg_output, {t0, t2}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridReduction10_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {-1});
+  auto tv2 = sum(tv1, {-1});
+  auto tv3 = sum(tv2, {-1});
+
+  fusion.addOutput(tv3);
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  tv1->axis(1)->parallelize(ParallelType::BIDx);
+  tv1->axis(2)->parallelize(ParallelType::TIDy);
+  tv1->axis(3)->parallelize(ParallelType::TIDz);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDy);
+
+  tv3->axis(0)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::BIDx);
+
+  tv0->computeAt(tv3, 1);
+
+  const int numel_w = 2;
+  const int numel_x = 3;
+  const int numel_y = 4;
+  const int numel_z = 5;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({numel_w, numel_x, numel_y, numel_z}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_output = fe.runFusion({t0});
+
+  auto aten_output = t0.sum({1, 2, 3});
+
+  testValidate(&fusion, cg_output, {t0}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionNonRedAxisBind_CUDA) {
+  int bid_x = 3;
+  int tid_x = 2;
+  int red_dim = 0;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = reductionOp(
+      BinaryOpType::Add, {red_dim}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  tv1->split(-1, tid_x);
+  tv1->axis(-2)->parallelize(ParallelType::BIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({16, bid_x * tid_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = input.to(at::kDouble).sum({red_dim});
+
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSplitBCast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* input_tv0 = makeSymbolicTensor(3);
+  TensorView* input_tv1 = makeSymbolicTensor(3);
+  fusion.addInput(input_tv0);
+  fusion.addInput(input_tv1);
+
+  TensorView* sum_tv2 = reductionOp(
+      BinaryOpType::Add, {2}, IrBuilder::create<Double>(0), input_tv0);
+  TensorView* bcast_tv3 = broadcast(sum_tv2, {false, false, true});
+  TensorView* output_tv4 = div(input_tv1, bcast_tv3);
+
+  sum_tv2->split(-1, 32);
+  TensorView* sum_rf_tv5 = sum_tv2->rFactor({-2});
+
+  bcast_tv3->split(-1, 32);
+  output_tv4->split(-1, 32);
+
+  sum_rf_tv5->axis(0)->parallelize(ParallelType::BIDx);
+  sum_tv2->axis(0)->parallelize(ParallelType::BIDx);
+  bcast_tv3->axis(0)->parallelize(ParallelType::BIDx);
+  output_tv4->axis(0)->parallelize(ParallelType::BIDx);
+
+  sum_rf_tv5->axis(1)->parallelize(ParallelType::BIDy);
+  sum_tv2->axis(1)->parallelize(ParallelType::BIDy);
+  bcast_tv3->axis(1)->parallelize(ParallelType::BIDy);
+  output_tv4->axis(1)->parallelize(ParallelType::BIDy);
+
+  sum_rf_tv5->axis(-1)->parallelize(ParallelType::TIDx);
+  sum_tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  bcast_tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  output_tv4->axis(-1)->parallelize(ParallelType::TIDx);
+
+  fusion.addOutput(output_tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({32, 32, 128}, options);
+  at::Tensor t1 = at::randn({32, 32, 128}, options);
+  at::Tensor cg_output = at::empty({32, 32, 128}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  fe.runFusion({t0, t1}, {cg_output});
+}
+
+TEST_F(NVFuserTest, FusionBCastInnerDim_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // reduce then broadcast
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = broadcast(tv1, {false, true});
+
+  TORCH_CHECK(!tv2->axis(0)->isReduction() && tv2->axis(1)->isBroadcast());
+}
+
+TEST_F(NVFuserTest, FusionBCastReduce_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+
+  auto tv1 = broadcast(tv0, {true, false, false});
+  auto tv2 = sum(tv1, {1});
+  TORCH_CHECK(
+      tv2->axis(0)->isBroadcast() && tv2->axis(1)->isReduction() &&
+      !tv2->axis(2)->isBroadcast() && !tv2->axis(2)->isReduction());
+}
+
+// Multiple consumer reduction with computeAt
+// https://github.com/csarofeen/pytorch/issues/110
+TEST_F(NVFuserTest, FusionReductionMultiConsumer_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = unaryOp(UnaryOpType::Exp, tv0);
+  auto tv2 =
+      reductionOp(BinaryOpType::Max, {-1}, IrBuilder::create<Double>(0), tv1);
+  auto tv3 =
+      reductionOp(BinaryOpType::Min, {-1}, IrBuilder::create<Double>(0), tv1);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+  tv1->computeAt(tv2, -1, ComputeAtMode::BestEffort);
+
+  TORCH_CHECK(tv1->getComputeAtPosition() == 2);
+}
+
+TEST_F(NVFuserTest, FusionComputeAtExprOrder1_CUDA) {
+  for (const auto i : c10::irange(2)) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    // Set up your input tensor views
+    TensorView* tv0 = makeSymbolicTensor(1);
+    fusion.addInput(tv0);
+
+    auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+    auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+    TensorView* tv3 = add(tv1, tv2);
+    // Set outputs tv2 or tv1 and then tv3
+    if (i == 0) {
+      fusion.addOutput(tv2);
+    } else {
+      fusion.addOutput(tv1);
+    }
+    fusion.addOutput(tv3);
+
+    if (i == 0) {
+      tv1->computeAt(tv3, -1);
+    } else {
+      tv2->computeAt(tv3, -1);
+    }
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor aten_input = at::randn({100}, options);
+    std::vector<at::Tensor> aten_outputs = {
+        aten_input + 1, (aten_input + 1) * 2};
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {aten_input});
+    auto cg_outputs = fe.runFusion({aten_input});
+
+    testValidate(
+        &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionComputeAtExprOrder2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv3 = add(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv3->split(-1, 32);
+
+  tv1->computeAt(tv3, -1);
+  tv2->computeAt(tv3, -2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({100, 100}, options);
+  auto aten_output = (aten_input + 1) * 2;
+
+  at::Tensor cg_output = at::empty_like(aten_input, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionComputeAtExprOrder3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int64_t dimx = 13;
+  const int64_t dimy = 15;
+
+  TensorView* tv0 = makeConcreteTensor({dimx, dimy});
+  fusion.addInput(tv0);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
+  TensorView* tv3 = add(tv2, IrBuilder::create<Double>(3));
+  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
+  TensorView* tv5 = mul(tv2, tv4);
+  fusion.addOutput(tv5);
+
+  tv1->computeAt(tv2, 2);
+  tv3->computeAt(tv4, 1);
+  tv4->computeAt(tv5, 2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({dimx, dimy}, options);
+  auto t1 = aten_input.add(1.);
+  auto t2 = t1.add(2.);
+  auto t3 = t2.add(3.);
+  auto t4 = t3.add(4.);
+  auto aten_output = t2.mul(t4);
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionZeroDimComputeAt_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+  TORCH_CHECK(tv2->nDims() == 0);
+  tv1->computeAt(tv2, 0);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({100}, options);
+  auto aten_output = aten_input.to(at::kDouble).sum() + 1;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionZeroDimBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(0);
+  fusion.addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {true, true});
+  TORCH_CHECK(tv1->nDims() == 2);
+
+  TensorView* tv2 = makeSymbolicTensor(2);
+  fusion.addInput(tv2);
+
+  auto tv3 = add(tv1, tv2);
+  auto tv4 = sum(tv3, {0, 1});
+  fusion.addOutput(tv4);
+
+  tv3->computeAt(tv4, -1);
+  tv3->axis(-2)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDy);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({}, options);
+  at::Tensor t1 = at::randn({10, 10}, options);
+
+  auto aten_output = (t0.unsqueeze(-1).unsqueeze(-1).expand({10, 10}) + t1)
+                         .to(at::kDouble)
+                         .sum();
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+  at::Tensor cg_output = at::empty({}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionZeroDimReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int bdimx = 32;
+  const int gdimx = 32;
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  fusion.addOutput(tv1);
+
+  tv1->split(0, bdimx);
+  tv1->split(0, gdimx);
+  auto tv2 = tv1->rFactor({0});
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-2)->parallelize(ParallelType::BIDx);
+  tv2->axis(-2)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({1000}, options);
+  auto aten_output = aten_input.to(at::kDouble).sum();
+
+  at::Tensor cg_output = at::empty({}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBCastAfterReduce_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  const int tidx = 128;
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+
+  tv1->split(1, tidx);
+  auto tv3 = tv1->rFactor({-2});
+
+  TensorView* tv4 = makeSymbolicTensor(2);
+  fusion.addInput(tv4);
+
+  auto tv5 = add(tv2, tv4);
+  fusion.addOutput(tv5);
+  tv5->split(1, tidx);
+
+  tv3->computeAt(tv5, 1);
+
+  tv2->split(1, tidx);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv5->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+
+  int x = 63, y = 200;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y}, options);
+  at::Tensor t4 = at::randn({x, y}, options);
+
+  auto t3 = t0.to(at::kDouble).sum({1}).unsqueeze(-1).expand({x, y});
+  auto aten_output = t3.add(t4);
+
+  std::vector<IValue> aten_inputs = {t0, t4};
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t4});
+  auto cg_outputs = fe.runFusion({t0, t4});
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionOutputBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({2, 3});
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = broadcast(tv0, {true, false, true, false, true});
+
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({2, 3}, options);
+  auto aten_output = aten_input.unsqueeze(2).unsqueeze(1).unsqueeze(0);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReductionKeepDimBasic_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({2, 3, 4, 5, 6});
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = sum(tv0, {0, 2, -1}, /*keep_dim=*/true);
+
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({2, 3, 4, 5, 6}, options);
+  auto aten_output =
+      aten_input.to(at::kDouble).sum({0, 2, -1}, /*keepdim=*/true);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReductionKeepDimScheduler_CUDA) {
+  constexpr int bid_x = 80;
+  constexpr int tid_x = 4096;
+  constexpr int red_dim = 1;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeConcreteTensor({bid_x, tid_x});
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = reductionOp(
+      BinaryOpType::Add,
+      {red_dim},
+      IrBuilder::create<Double>(0),
+      tv0,
+      /*keep_dim=*/true);
+
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({bid_x, tid_x}, options);
+  auto aten_output =
+      aten_input.to(at::kDouble).sum({red_dim}, /*keepdim=*/true);
+
+  // Apply reduction heuristic
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionSumTo_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> tensor_shape{2, 3, 4, 5, 6};
+  std::vector<int64_t> sum_to_shape{1, 5, 6};
+
+  std::vector<int64_t> tensor_shape_ref{2, 3, 4, 5, 6};
+  std::vector<int64_t> sum_to_shape_ref{1, 5, 6};
+
+  std::vector<Int*> sum_to_symb;
+  std::transform(
+      sum_to_shape.begin(),
+      sum_to_shape.end(),
+      std::back_inserter(sum_to_symb),
+      [](int s) -> Int* { return IrBuilder::create<Int>(s); });
+
+  TensorView* tv0 = makeConcreteTensor(tensor_shape);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = sum_to(tv0, sum_to_symb);
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn(tensor_shape_ref, options);
+  auto aten_output = at::sum_to(aten_input.to(at::kDouble), sum_to_shape_ref);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  TORCH_CHECK(
+      cg_outputs[0].dim() == static_cast<int64_t>(sum_to_shape.size()),
+      "sum_to not keeping the final dimension");
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSumToNoop_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> tensor_shape{4, 5, 6};
+  std::vector<int64_t> sum_to_shape{4, 5, 6};
+
+  std::vector<int64_t> tensor_shape_ref{4, 5, 6};
+  std::vector<int64_t> sum_to_shape_ref{4, 5, 6};
+
+  std::vector<Int*> sum_to_symb;
+  std::transform(
+      sum_to_shape.begin(),
+      sum_to_shape.end(),
+      std::back_inserter(sum_to_symb),
+      [](int s) -> Int* { return IrBuilder::create<Int>(s); });
+
+  TensorView* tv0 = makeConcreteTensor(tensor_shape);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = sum_to(tv0, sum_to_symb);
+
+  // Dummy operator to avoid tv0 both input and output
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(0));
+  fusion.addOutput(tv2);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn(tensor_shape_ref, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+  auto aten_output = at::sum_to(aten_input.to(at::kDouble), sum_to_shape_ref);
+
+  TORCH_CHECK(
+      cg_outputs[0].dim() == static_cast<int64_t>(sum_to_shape.size()),
+      "sum_to not keeping the final dimension");
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReductionScheduler_CUDA) {
+  constexpr int bid_x = 80;
+  constexpr int tid_x = 4096;
+  constexpr int red_dim = 1;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = reductionOp(
+      BinaryOpType::Add, {red_dim}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({bid_x, tid_x}, options);
+  auto aten_output = aten_input.to(at::kDouble).sum({red_dim});
+
+  // Apply reduction heuristic
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  // no broadcasting needed, omitting the last optional argument;
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+// This test checks if our system could correctly handles the case where both
+// reduction and trivial reduction exist in the fusion. Trivial reduction
+// deserve testing because trivial reduction is handled more like a broadcasting
+// rather than a reduction.
+TEST_F(NVFuserTest, FusionReductionWithTrivialReduction_CUDA) {
+  constexpr int bid_x = 80;
+  constexpr int tid_x = 4096;
+
+  std::vector<std::vector<int64_t>> shapes = {
+      {-1, -1, 1}, {-1, 1, -1}, {1, -1, -1}};
+
+  for (auto shape : shapes) {
+    std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+    Fusion& fusion = *fusion_ptr;
+    FusionGuard fg(&fusion);
+
+    std::vector<std::vector<int64_t>> reduction_dims = {
+        {0},
+        {1},
+        {2},
+        {0, 1},
+        {0, 2},
+        {1, 2},
+        {0, 1, 2},
+    };
+
+    // Set up your input tensor views
+    TensorView* tv0 = makeConcreteTensor(shape);
+    fusion.addInput(tv0);
+
+    for (auto rdims : reduction_dims) {
+      std::vector<int> rdims_(rdims.begin(), rdims.end());
+      auto tv = sum(tv0, rdims_);
+      fusion.addOutput(tv);
+    }
+
+    const auto options =
+        at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+    auto concrete_shape = shape;
+    std::deque<int64_t> concrete_values = {bid_x, tid_x};
+    for (auto& s : concrete_shape) {
+      if (s == -1) {
+        s = concrete_values.front();
+        concrete_values.pop_front();
+      }
+    }
+
+    at::Tensor aten_input = at::randn(concrete_shape, options);
+    std::vector<at::Tensor> aten_outputs;
+    for (auto rdims : reduction_dims) {
+      aten_outputs.push_back(aten_input.sum(rdims));
+    }
+
+    FusionExecutorCache executor_cache(std::move(fusion_ptr));
+    auto cg_outputs = executor_cache.runFusionWithInputs({aten_input});
+
+    testValidate(
+        &fusion,
+        cg_outputs,
+        {aten_input},
+        aten_outputs,
+        __LINE__,
+        __FILE__,
+        "");
+  }
+}
+
+// Simple reduction parallelized on a symbolic size.
+TEST_F(NVFuserTest, FusionSymbolicReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  // tv1[I0, R1] = tv0[I0, I1]
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  // Interface should just be a direct split with a Parallel type. We can
+  // include the parallelize call if we do this.
+  tv1->split(1, NamedScalar::getParallelDim(ParallelType::TIDx));
+  // tv1[I0, R1o, R1i{BIDx}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({1});
+  // tv2[I0, R1oo, Ir1oi{4}, Ir1i{BIDx}] = tv0[I0, I1]
+  // tv1[I0,        R1oi{4},  R1i{BIDx}] = tv2[I0, R1oo, Ir1oi{4}, Ir1i{BIDx}]
+
+  // Incrementally, can print in between for debugging
+  tv0->computeAt(tv2, 1);
+  tv2->computeAt(tv1, 1);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 65000;
+  int numel_y = 1025;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
+  auto aten_output = aten_input.to(at::kDouble).sum({1});
+
+  // How many threads to use for the block reduction
+  int runtime_threadIdx_dim = 128;
+
+  LaunchParams lparams(-1, -1, -1, runtime_threadIdx_dim, -1, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionReductionSchedulerMultiDimNonFastest_CUDA) {
+  const std::vector<int> red_dims = {0, 2};
+  // Copy is because CodeGen requires int and Pytorch requires int64_t
+  // for a vector of reduction dimensions
+  const std::vector<int64_t> red_dims64 = {0, 2};
+  const std::vector<int64_t> tensor_dims_in = {5, 10, 15, 20};
+  const std::vector<int64_t> tensor_dims_out = {10, 20};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(tensor_dims_in.size());
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = reductionOp(
+      BinaryOpType::Add, red_dims, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn(tensor_dims_in, options);
+  auto aten_output = aten_input.to(at::kDouble).sum(red_dims64);
+  at::Tensor cg_output = at::empty(tensor_dims_out, options);
+
+  // Apply reduction heuristic
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  fe.runFusion({aten_input}, {cg_output}, lparams);
+
+  testValidate(
+      &fusion,
+      {cg_output},
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionReductionSchedulerMultiDimFastest_CUDA) {
+  const std::vector<int> red_dims = {1, 3};
+  // Copy is because CodeGen requires int and Pytorch requires int64_t
+  // for a vector of reduction dimensions
+  const std::vector<int64_t> red_dims64 = {1, 3};
+  const std::vector<int64_t> tensor_dims_in = {5, 10, 15, 20};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(tensor_dims_in.size());
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = reductionOp(
+      BinaryOpType::Add, red_dims, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn(tensor_dims_in, options);
+  auto aten_output = aten_input.to(at::kDouble).sum(red_dims64);
+
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionReductionSchedulerNoODimShmoo_CUDA) {
+  std::vector<DataType> dtypes = {
+      DataType::Double, DataType::Float, DataType::Half};
+  // TODO: add test for complex. Currently complex fails with the following
+  // NVRTC compilation error message:
+  //   error: no suitable user-defined conversion from
+  //   "CudaCodeGen::std::complex<double>" to "CudaCodeGen::std::complex<float>"
+  //   exists
+#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+  if (at::cuda::getDeviceProperties(0)->major >= 8) {
+    dtypes.insert(dtypes.end(), DataType::BFloat16);
+  }
+#endif
+
+  std::vector<int> red_dims;
+
+  // Tried to cut down the number iterations with just
+  // doing every other power of 2.
+  for (int i = 1; i <= 1024 * 1024; i <<= 2) {
+    red_dims.push_back(i);
+  }
+
+  for (auto dtype : dtypes) {
+    at::ScalarType aten_dtype = data_type_to_aten(dtype);
+    for (auto& rdim : red_dims) {
+      Fusion fusion;
+      FusionGuard fg(&fusion);
+
+      bool is_fp16 = dtype == DataType::Half;
+      bool is_bf16 = dtype == DataType::BFloat16;
+
+      TensorView* tv0 = makeSymbolicTensor(1, dtype);
+      fusion.addInput(tv0);
+
+      TensorView* tv0_cast = tv0;
+      if (is_fp16 || is_bf16) {
+        tv0_cast = castOp(DataType::Float, tv0);
+      }
+
+      TensorView* tv1 = sum(tv0_cast, {0});
+
+      TensorView* tv1_cast = tv1;
+      if (is_fp16) {
+        tv1_cast = castOp(DataType::Half, tv1);
+      }
+      if (is_bf16) {
+        tv1_cast = castOp(DataType::BFloat16, tv1);
+      }
+
+      fusion.addOutput(tv1_cast);
+
+      auto options = at::TensorOptions().dtype(aten_dtype).device(at::kCUDA, 0);
+
+      at::Tensor aten_input = at::randn({rdim}, options);
+      auto aten_output = aten_input.to(at::kDouble).sum({0});
+
+      auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+      TORCH_CHECK(reduction_params != nullptr, "Reduction is not found!");
+      scheduleReduction(&fusion, *reduction_params);
+      auto lparams = reduction_params->lparams;
+
+      FusionExecutor fe;
+      fe.compileFusion(&fusion, {aten_input}, lparams);
+      auto cg_outputs = fe.runFusion({aten_input}, lparams);
+
+      testValidate(
+          &fusion,
+          cg_outputs,
+          {aten_input},
+          {aten_output},
+          __LINE__,
+          __FILE__,
+          "",
+          lparams);
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionReductionSchedulerDimShmoo_CUDA) {
+  std::vector<DataType> dtypes = {
+      DataType::Double, DataType::Float, DataType::Half};
+  // TODO: add complex support. Currently, complex fails with the following
+  // NVRTC compilation error:
+  //   error: no instance of overloaded function "__shfl_xor_sync" matches the
+  //   argument list
+#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+  if (at::cuda::getDeviceProperties(0)->major >= 8) {
+    dtypes.insert(dtypes.end(), DataType::BFloat16);
+  }
+#endif
+
+  std::vector<int> red_axis = {1, 0};
+  std::vector<int> output_dims = {160, 320};
+  std::vector<int> red_dims;
+
+  // Tried to cut down the number iterations with just
+  // doing every other power of 2.
+  for (int i = 1; i <= 1024 * 1024; i <<= 2) {
+    red_dims.push_back(i);
+  }
+
+  for (auto dtype : dtypes) {
+    at::ScalarType aten_dtype = data_type_to_aten(dtype);
+    for (auto& axis : red_axis) {
+      for (auto& odim : output_dims) {
+        for (auto& rdim : red_dims) {
+          Fusion fusion;
+          FusionGuard fg(&fusion);
+
+          bool is_fp16 = dtype == DataType::Half;
+          bool is_bf16 = dtype == DataType::BFloat16;
+
+          TensorView* tv0 = makeSymbolicTensor(2, dtype);
+          fusion.addInput(tv0);
+
+          TensorView* tv0_cast = tv0;
+          if (is_fp16 || is_bf16) {
+            tv0_cast = castOp(DataType::Float, tv0);
+          }
+
+          TensorView* tv1 = sum(tv0_cast, {axis});
+
+          TensorView* tv1_cast = tv1;
+          if (is_fp16) {
+            tv1_cast = castOp(DataType::Half, tv1);
+          }
+          if (is_bf16) {
+            tv1_cast = castOp(DataType::BFloat16, tv1);
+          }
+          fusion.addOutput(tv1_cast);
+
+          auto options =
+              at::TensorOptions().dtype(aten_dtype).device(at::kCUDA, 0);
+
+          at::Tensor aten_input =
+              (axis ? at::randn({odim, rdim}, options)
+                    : at::randn({rdim, odim}, options));
+
+          auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+          TORCH_CHECK(reduction_params != nullptr, "Reduction is not found!");
+          scheduleReduction(&fusion, *reduction_params);
+          auto lparams = reduction_params->lparams;
+
+          FusionExecutor fe;
+          fe.compileFusion(&fusion, {aten_input}, lparams);
+          auto cg_outputs = fe.runFusion({aten_input}, lparams);
+          auto aten_output = aten_input.to(at::kDouble).sum({axis});
+          testValidate(
+              &fusion,
+              cg_outputs,
+              {aten_input},
+              {aten_output},
+              __LINE__,
+              __FILE__,
+              "",
+              lparams);
+        }
+      }
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionCacheBefore_CUDA) {
+  // TVM Cache Write
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(3.0));
+  fusion.addInput(tv0);
+  fusion.addOutput(tv2);
+
+  // Before: TV2 = TV1 * 3
+  // After:  TV3 = TV1 * 3;
+  //         TV2 = TV3;
+  TensorView* tv3 = tv2->cacheBefore();
+
+  constexpr int BSX = 32;
+  tv2->split(-1, BSX);
+  tv0->computeAt(tv2, -1);
+
+  // Thread and Block binding
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 32, N = 750;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({M, N}, options);
+  at::Tensor aten_output = (aten_input + 1.0) * 3.0;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionCacheAfter_CUDA) {
+  // TVM Cache Read
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(3.0));
+  fusion.addInput(tv0);
+  fusion.addOutput(tv2);
+
+  // Before: TV1 = TV0 + 1
+  // After:  TV3 = TV0;
+  //         TV1 = TV3 + 1
+  TensorView* tv3 = tv0->cacheAfter();
+
+  constexpr int BSX = 32;
+  tv2->split(-1, BSX);
+  tv0->computeAt(tv2, -1);
+
+  // Thread and Block binding
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 32, N = 457;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({M, N}, options);
+  at::Tensor aten_output = (aten_input + 1.0) * 3.0;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionCacheFork_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(3.0));
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+  fusion.addOutput(tv2);
+  // Before:  TV1 = TV0 + 1
+  //          TV2 = TV1 * 1
+  // Output:  TV1, TV2
+
+  // After:   TV1 = TV0 + 1
+  //          TV3 = TV1
+  //          TV2 = TV1 * 1
+  // Output:  TV3, TV2
+
+  // cacheFork !!does not!! automatically apply ComputeAt to the cache
+  auto tv3 = tv1->cacheFork();
+
+  constexpr int BSX = 32;
+  tv2->split(-1, BSX);
+  tv0->computeAt(tv2, -1);
+
+  // Thread and Block binding
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 32, N = 457;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({M, N}, options);
+  at::Tensor aten_output1 = aten_input + 1.0;
+  at::Tensor aten_output2 = aten_output1 * 3.0;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output1, aten_output2},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionCacheIndirect_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  TensorView* tv2 = makeSymbolicTensor(2);
+  TensorView* tv3 = makeSymbolicTensor(2);
+  TensorView* tv4 = sub(tv2, tv3);
+  TensorView* tv5 = add(tv1, tv4);
+  TensorView* tv6 = sub(tv5, tv0);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+  fusion.addInput(tv3);
+  fusion.addOutput(tv6);
+  // t6 = ((t1 + (t2 - t3)) - t0)
+
+  tv5->cacheAfter();
+  tv5->cacheBefore();
+
+  // cacheAfter on inputs placed before schedule
+  constexpr int BSX = 32;
+  tv6->split(-1, BSX);
+  tv2->computeAt(tv6, -1);
+
+  // Thread and Block binding
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+  tv6->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 32, N = 810;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  at::Tensor t1 = at::randn({M, N}, options);
+  at::Tensor t2 = at::randn({M, N}, options);
+  at::Tensor t3 = at::randn({M, N}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
+  at::Tensor aten_output = (t1 + (t2 - t3)) - t0;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionCacheBcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  TensorView* tv0 = makeSymbolicTensor(1); // (M, 1)
+  TensorView* tv1 = broadcast(tv0, {false, true});
+  TensorView* tv2 = makeSymbolicTensor(1); // (1, N)
+  TensorView* tv3 = broadcast(tv2, {true, false});
+  TensorView* tv4 = mul(tv1, tv3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv2);
+  fusion.addOutput(tv4);
+
+  // Case 1
+  tv0->cacheAfter();
+
+  // Case 2
+  tv1->cacheBefore();
+
+  // Case 3
+  tv1->cacheAfter();
+
+  // Case 4
+  TensorView* tv8 = tv4->cacheBefore();
+
+  constexpr int BSX = 128;
+  tv4->split(0, BSX);
+  tv4->split(-1, BSX);
+  tv4->reorder({{0, 0}, {1, 2}, {2, 1}, {3, 3}});
+  // M/BSX, N/BSY, BSX, BSY
+  tv0->computeAt(tv4, 2);
+  tv2->computeAt(tv4, 2);
+  // 0, 1 | 2, 3, 4
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(1)->parallelize(ParallelType::BIDy);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  // Manual Replay on TV3
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv8->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 92, N = 500;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M}, options);
+  at::Tensor t1 = at::randn({N}, options);
+  std::vector<IValue> aten_inputs = {t0, t1};
+  at::Tensor aten_output =
+      t0.to(at::kDouble).unsqueeze(1).matmul(t1.to(at::kDouble).unsqueeze(0));
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionCacheMultiConsumer_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(2));
+
+  fusion.addInput(tv0);
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv4);
+
+  auto tv5 = tv1->cacheBefore();
+  auto tv6 = tv3->cacheBefore();
+  tv5->setMemoryType(MemoryType::Shared);
+  tv6->setMemoryType(MemoryType::Shared);
+
+  tv1->computeAt(tv2, -1);
+  tv3->computeAt(tv4, -1);
+
+  // Fails because tensor must be recomputed twice
+  // auto tv7 = tv0->cacheAfter();
+
+  constexpr int N = 800;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({N}, options);
+  auto aten_output = (aten_input + 1) + 2;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output, aten_output},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSmem_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  TensorView* tv0 = makeSymbolicTensor(2); // (M, N)
+  TensorView* tv1 = makeSymbolicTensor(2); // (M, N)
+  TensorView* tv2 = mul(tv0, tv1);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv2);
+
+  // Schedule
+  TensorView* tv3 = tv0->cacheAfter();
+  TensorView* tv4 = tv1->cacheAfter();
+  tv3->setMemoryType(MemoryType::Shared);
+  tv4->setMemoryType(MemoryType::Shared);
+
+  constexpr int BSY = 32;
+  constexpr int BSX = 128;
+  tv2->split(0, BSY);
+  tv2->split(2, BSX);
+  // M/BSX, BSX, N/BSX, BSX
+  tv2->reorder({{0, 0}, {1, 2}, {2, 1}, {3, 3}});
+  // M/BSX, N/BSX, BSX, BSX
+
+  tv0->computeAt(tv2, 2);
+  tv1->computeAt(tv2, 2);
+
+  // Thread and Block binding
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::BIDy);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  // Manual Binding
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 128, N = 10240;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  at::Tensor t1 = at::randn({M, N}, options);
+  at::Tensor aten_output = mul(t0, t1);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
+}
+
+TEST_F(NVFuserTest, FusionSmemReduce_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  TensorView* tv0 = makeSymbolicTensor(3); // M, K, N
+  TensorView* tv1 = sum(tv0, {1}); // M, R, N
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+
+  TensorView* tv2 = tv0->cacheAfter();
+  tv2->setMemoryType(MemoryType::Shared);
+
+  // Schedule
+  constexpr int BSX = 32;
+  tv1->split(2, BSX);
+  tv1->split(1, 128);
+  tv1->split(0, BSX);
+  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
+  tv1->reorder({{0, 0}, {1, 2}, {2, 4}, {3, 5}, {4, 1}, {5, 3}});
+  TensorView* tv3 = tv1->rFactor({-2});
+
+  tv0->computeAt(tv1, -2);
+  tv0->computeAt(tv3, -2);
+
+  // Thread and Block binding
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::BIDy);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  // Manual Binding
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 154, K = 45, N = 1524;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({M, K, N}, options);
+  at::Tensor aten_output = sum(aten_input.to(at::kDouble), {1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
+}
+
+TEST_F(NVFuserTest, FusionSmemBlockGemm_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  TensorView* tv0 = makeSymbolicTensor(2); // (M, K)
+  TensorView* tv1 = makeSymbolicTensor(2); // (K, N)
+  TensorView* tv2 = broadcast(tv0, {false, false, true}); // (M, K, B)
+  TensorView* tv3 = broadcast(tv1, {true, false, false}); // (B, K, N)
+  TensorView* tv4 = mul(tv2, tv3); // M, K, N
+  TensorView* tv5 = sum(tv4, {1}); // M, R, N
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv5);
+
+  // Schedule
+  constexpr int BSX = 16;
+  tv5->split(2, BSX - 1);
+  tv5->split(1, BSX);
+  tv5->split(0, BSX + 1);
+  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
+  tv5->reorder({{0, 0}, {1, 3}, {2, 2}, {3, 5}, {4, 1}, {5, 4}});
+  // M/BSX, N/BSX, K/BSX, MSX, NSX, KSX
+  TensorView* tv6 = tv5->rFactor({-1});
+
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+  tv4->setMemoryType(MemoryType::Shared);
+  tv6->setMemoryType(MemoryType::Shared);
+
+  tv0->computeAt(tv5, 3);
+  tv1->computeAt(tv5, 3);
+
+  // Thread and Block binding
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+  tv5->axis(-2)->parallelize(ParallelType::TIDy);
+  tv5->axis(-1)->parallelize(ParallelType::TIDx);
+  // Manual Binding
+  tv2->axis(-3)->parallelize(ParallelType::TIDy);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(-3)->parallelize(ParallelType::TIDy);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv6->axis(-3)->parallelize(ParallelType::TIDy);
+  tv6->axis(-2)->parallelize(ParallelType::TIDx);
+
+  // Make sure BIDx is makred as exact (see issue #1119)
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(gpulw.parallelDimensionMap().isExact(ParallelType::BIDx));
+
+  constexpr int M = 154, K = 45, N = 1524;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+  at::Tensor aten_output = matmul(t0.to(at::kDouble), t1.to(at::kDouble));
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
+}
+
+TEST_F(NVFuserTest, FusionSmemBlockGemmCache_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  TensorView* tv0 = makeSymbolicTensor(2); // (M, K)
+  TensorView* tv1 = makeSymbolicTensor(2); // (K, N)
+  TensorView* tv2 = broadcast(tv0, {false, false, true}); // (M, K, B)
+  TensorView* tv3 = broadcast(tv1, {true, false, false}); // (B, K, N)
+  TensorView* tv4 = mul(tv2, tv3); // M, K, N
+  TensorView* tv5 = sum(tv4, {1}); // M, R, N
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv5);
+
+  // Schedule
+  // Remove reduction axis from tv5
+  // tv6 = (M, R, N)
+  // tv5 = (M, N)
+  TensorView* tv6 = tv5->cacheBefore();
+
+  constexpr int BSX = 16;
+  tv5->split(1, BSX);
+  tv5->split(0, BSX);
+  // M/BSX, BSX, N/BSX, BSX
+  tv5->reorder({{0, 0}, {1, 2}, {2, 1}, {3, 3}});
+  // tv5 = M/BSX, N/BSX, MSX, NSX
+
+  tv6->computeAt(tv5, 2);
+  tv6->computeAt(tv5, 2);
+
+  tv6->split(-1, BSX);
+  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
+  tv6->reorder({{0, 0}, {1, 1}, {2, 3}, {3, 4}, {4, 2}, {5, 5}});
+  // M/BSX, N/BSX, K/BSX, MSX, NSX, KSX
+  TensorView* tv7 = tv6->rFactor({-1});
+  // tv7 = M/BSX, N/BSX, K/BSXrf, MSX, NSX, KSXr
+  // tv6 = M/BSX, N/BSX, K/BSXr, MSX, NSX
+
+  tv0->computeAt(tv6, 3);
+  tv1->computeAt(tv6, 3);
+
+  tv0->computeAt(tv7, 3);
+  tv1->computeAt(tv7, 3);
+
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+  tv4->setMemoryType(MemoryType::Shared);
+  tv6->setMemoryType(MemoryType::Shared);
+  tv7->setMemoryType(MemoryType::Shared);
+  // Memory Type
+
+  // Thread and Block binding
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+  tv5->axis(-2)->parallelize(ParallelType::TIDy);
+  tv5->axis(-1)->parallelize(ParallelType::TIDx);
+  // Manual Binding
+  tv2->axis(-3)->parallelize(ParallelType::TIDy);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(-3)->parallelize(ParallelType::TIDy);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv7->axis(-3)->parallelize(ParallelType::TIDy);
+  tv7->axis(-2)->parallelize(ParallelType::TIDx);
+
+  tv6->axis(-2)->parallelize(ParallelType::TIDy);
+  tv6->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 154, K = 45, N = 1524;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+  at::Tensor aten_output = matmul(t0.to(at::kDouble), t1.to(at::kDouble));
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
+}
+
+TEST_F(NVFuserTest, FusionSmemDynamicPersistentSoftmax2D_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* x = makeSymbolicTensor(2);
+  fusion.addInput(x);
+  TensorView* max_val = reductionOp(
+      BinaryOpType::Max,
+      {-1},
+      IrBuilder::create<Double>(std::numeric_limits<float>::lowest()),
+      x); // (M)
+  TensorView* bcast_max = broadcast(max_val, {false, true}); // (M, B)
+  TensorView* x_max_sub = sub(x, bcast_max); // (M, N)
+  TensorView* exp = unaryOp(UnaryOpType::Exp, x_max_sub); // (M, N)
+  TensorView* sum_exp = sum(exp, {-1}); // (M, R)
+  TensorView* bcast_sum = broadcast(sum_exp, {false, true}); // (M, B)
+  TensorView* softmax = div(exp, bcast_sum); // (M, N)
+  fusion.addOutput(softmax);
+
+  // Read Input into Shared Memory
+  // Load Input + Pwise into shared memory
+  auto cache_x = x->cacheAfter();
+  cache_x->setMemoryType(MemoryType::Shared);
+  exp->setMemoryType(MemoryType::Shared);
+
+  std::vector<TensorView*> all_tensors(
+      {x,
+       cache_x,
+       max_val,
+       bcast_max,
+       x_max_sub,
+       exp,
+       sum_exp,
+       bcast_sum,
+       softmax});
+
+  auto tidx = IrBuilder::create<Int>();
+  fusion.addInput(tidx);
+
+  for (auto tensor : all_tensors) {
+    tensor->split(-1, tidx);
+  }
+
+  auto sum_exp_rf = sum_exp->rFactor({1});
+  all_tensors.push_back(sum_exp_rf);
+
+  // computeAt
+  x->computeAt(x_max_sub, 1);
+  exp->computeAt(softmax, 1);
+  x_max_sub->computeAt(exp, 2);
+
+  softmax->axis(0)->parallelize(ParallelType::BIDx);
+  for (auto tensor : all_tensors) {
+    tensor->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  const int64_t dimx = 1024;
+  const int64_t dimy = 4096;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({dimx, dimy}, options);
+  auto aten_output = at::_softmax(aten_input.to(at::kDouble), -1, false);
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input, 128});
+  auto cg_outputs = fe.runFusion({aten_input, 128});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input, 128},
+      {aten_output},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerSoftmax_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int kReductionAxis = 3;
+  std::vector<int64_t> input_shape{10, 10, 10, 67};
+  TensorView* input = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(input);
+
+  auto output = softmax(input, kReductionAxis);
+
+  fusion.addOutput(output);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn(input_shape, options);
+  auto aten_output =
+      at::_softmax(aten_input.to(at::kDouble), kReductionAxis, false);
+
+  auto reduction_params = getPersistentHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+
+  schedulePersistentKernel(&fusion, *reduction_params);
+
+  auto lparams = reduction_params->lparams;
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionTestMaskSoftmax_CUDA) {
+  // This test is testing the usage of all padding tokens
+  // with softmax like Bert might might use in a full padding
+  // sequence.
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int kReductionAxis = 3;
+  std::vector<int64_t> input_shape{256, 16, 128, 128};
+  TensorView* input = makeSymbolicTensor(input_shape.size());
+  TensorView* mask = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(input);
+  fusion.addInput(mask);
+
+  auto out1 = add(input, mask);
+  auto output = softmax(out1, kReductionAxis);
+
+  fusion.addOutput(output);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn(input_shape, options);
+  at::Tensor aten_mask = at::ones(input_shape, options);
+  // -10,000 is used here as a magic number because the padding
+  // tokens need to be a value that gives a value close to zero
+  // as to not influence softmax.  Bert, in particular, does
+  // not use -Infinity because sometimes it will have a
+  // softmax of all padding tokkens that can result a divide by
+  // zero that creates NaN result.
+  aten_mask = aten_mask * -10000.0;
+  auto aten_out1 = aten_input + aten_mask;
+  auto aten_output = at::_softmax(aten_out1, kReductionAxis, false);
+
+  auto reduction_params =
+      getPersistentHeuristics(&fusion, {aten_input, aten_mask});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+
+  schedulePersistentKernel(&fusion, *reduction_params);
+
+  auto lparams = reduction_params->lparams;
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input, aten_mask}, lparams);
+  auto cg_outputs = fe.runFusion({aten_input, aten_mask}, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input, aten_mask},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerLayerNormBackward_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> shape{20, 100, 35, 67};
+  std::vector<int64_t> norm_shape{67};
+
+  const size_t kM = shape.size();
+  const size_t kN = norm_shape.size();
+  const size_t kOuterNumDims = kM - kN;
+
+  std::vector<int64_t> outer_shape;
+  for (const auto idx : c10::irange(kOuterNumDims)) {
+    outer_shape.push_back(shape[idx]);
+  }
+  for (const auto idx : c10::irange(kOuterNumDims, kM)) {
+    outer_shape.push_back(1);
+  }
+
+  auto grad_out = makeSymbolicTensor(shape.size());
+  auto input = makeSymbolicTensor(shape.size());
+  auto mean = makeConcreteTensor(outer_shape);
+  auto rstd = makeConcreteTensor(outer_shape);
+  auto weight = makeSymbolicTensor(norm_shape.size());
+  auto bias = makeSymbolicTensor(norm_shape.size());
+  fusion.addInput(grad_out);
+  fusion.addInput(input);
+  fusion.addInput(mean);
+  fusion.addInput(rstd);
+  fusion.addInput(weight);
+  fusion.addInput(bias);
+
+  auto grads = layer_norm_backward(
+      grad_out,
+      input,
+      norm_shape,
+      mean,
+      rstd,
+      weight,
+      bias,
+      {true, true, true});
+
+  fusion.addOutput(grads.grad_input);
+  fusion.addOutput(grads.grad_weight);
+  fusion.addOutput(grads.grad_bias);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_grad_out = at::randn(shape, options);
+  at::Tensor aten_input = at::randn(shape, options);
+  at::Tensor aten_weight = at::randn(norm_shape, options);
+  at::Tensor aten_bias = at::randn(norm_shape, options);
+  auto at_weight = c10::optional<at::Tensor>(aten_weight);
+  auto at_bias = c10::optional<at::Tensor>(aten_bias);
+
+  const float kEps = 1e-5;
+  auto aten_results =
+      at::native_layer_norm(aten_input, norm_shape, at_weight, at_bias, kEps);
+  auto aten_output = std::get<0>(aten_results);
+  auto aten_mean = std::get<1>(aten_results);
+  auto aten_rstd = std::get<2>(aten_results);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  std::vector<IValue> aten_inputs = {
+      aten_grad_out, aten_input, aten_mean, aten_rstd, aten_weight, aten_bias};
+  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
+
+  auto aten_gradients = at::native_layer_norm_backward(
+      aten_grad_out.to(at::kDouble),
+      aten_input.to(at::kDouble),
+      norm_shape,
+      aten_mean.to(at::kDouble),
+      aten_rstd.to(at::kDouble),
+      c10::optional<at::Tensor>(aten_weight.to(at::kDouble)),
+      c10::optional<at::Tensor>(aten_bias.to(at::kDouble)),
+      {true, true, true});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      aten_inputs,
+      {std::get<0>(aten_gradients),
+       std::get<1>(aten_gradients),
+       std::get<2>(aten_gradients)},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerRMSNormBackward_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+  const int64_t NORM_SIZE = 1024;
+  std::vector<int64_t> shape{8, 56, NORM_SIZE};
+  std::vector<int64_t> norm_shape{NORM_SIZE};
+
+  const size_t kM = shape.size();
+  const size_t kN = norm_shape.size();
+  const size_t kOuterNumDims = kM - kN;
+
+  std::vector<int64_t> outer_shape;
+  for (const auto idx : c10::irange(kOuterNumDims)) {
+    outer_shape.push_back(shape[idx]);
+  }
+  for (const auto idx : c10::irange(kOuterNumDims, kM)) {
+    outer_shape.push_back(1);
+  }
+
+  auto grad_out = makeContigTensor(shape.size());
+  auto input = makeContigTensor(shape.size());
+  auto rstd = makeConcreteTensor(outer_shape);
+  auto weight = makeContigTensor(norm_shape.size());
+  fusion.addInput(grad_out);
+  fusion.addInput(input);
+  fusion.addInput(rstd);
+  fusion.addInput(weight);
+
+  auto grads = rms_norm_backward(
+      grad_out, input, norm_shape, rstd, weight, {true, true});
+
+  fusion.addOutput(grads.grad_input);
+  fusion.addOutput(grads.grad_weight);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_grad_out = at::randn(shape, options);
+  at::Tensor aten_input = at::randn(shape, options);
+  at::Tensor aten_weight = at::randn(norm_shape, options);
+  auto at_weight = c10::optional<at::Tensor>(aten_weight);
+
+  const float kEps = 1e-6;
+  auto pow2 = at::pow(aten_input, 2);
+  auto sum = at::sum(pow2, -1, true);
+  auto var = at::mul(sum, 1.0 / NORM_SIZE);
+  auto aten_rstd = at::pow(at::add(var, kEps), -0.5);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  std::vector<IValue> aten_inputs = {
+      aten_grad_out, aten_input, aten_rstd, aten_weight};
+  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
+
+  auto in_mul_rstd = at::mul(aten_input, aten_rstd);
+  auto grad_out_mul = at::mul(aten_grad_out, in_mul_rstd);
+  auto aten_grad_weight = at::sum(grad_out_mul, c10::IntArrayRef{0, 1});
+  auto sum_loss1 = at::sum(at::mul(aten_grad_out, aten_weight), -1, true);
+  auto sum_loss2 = at::sum(
+      at::mul(
+          at::mul(at::mul(aten_grad_out, aten_weight), aten_input), aten_rstd),
+      -1,
+      true);
+
+  const float fH = NORM_SIZE;
+  auto term1 = at::mul(aten_rstd, 1.0 / fH);
+  auto aten_grad_input = at::mul(at::mul(aten_grad_out, fH), aten_weight);
+  aten_grad_input = at::sub(aten_grad_input, sum_loss1);
+  aten_grad_input = at::sub(
+      aten_grad_input, at::mul(at::mul(aten_input, aten_rstd), sum_loss2));
+  aten_grad_input = at::mul(aten_grad_input, term1);
+  testValidate(
+      &fusion,
+      cg_outputs,
+      aten_inputs,
+      {aten_grad_input, aten_grad_weight},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerLayerNormalization_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  const float kEps = 1e-5;
+  Double* eps_ptr = IrBuilder::create<Double>(kEps);
+
+  std::vector<int64_t> input_shape{20, 100, 35, 67};
+  std::vector<int64_t> norm_shape{67};
+
+  auto input = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(input);
+
+  auto result = layer_norm(input, norm_shape, nullptr, nullptr, eps_ptr);
+
+  fusion.addOutput(result.output);
+  fusion.addOutput(result.mean);
+  fusion.addOutput(result.invstd);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn(input_shape, options);
+  c10::optional<at::Tensor> aten_weight = c10::nullopt;
+  c10::optional<at::Tensor> aten_bias = c10::nullopt;
+  auto aten_outputs = at::native_layer_norm(
+      aten_input, norm_shape, aten_weight, aten_bias, kEps);
+
+  // Check reduction axis is same for all reductions
+  // Generate Launch Parameters
+  auto reduction_params = getPersistentHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto cg_outputs = fec.runFusionWithInputs({aten_input});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {std::get<0>(aten_outputs),
+       std::get<1>(aten_outputs),
+       std::get<2>(aten_outputs)},
+      __LINE__,
+      __FILE__,
+      "");
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerRMSNormalization_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int64_t NORM_SIZE = 1024;
+  const float kEps = 1e-6;
+  Double* eps_ptr = IrBuilder::create<Double>(kEps);
+
+  std::vector<int64_t> input_shape{8, 56, NORM_SIZE};
+  std::vector<int64_t> norm_shape{NORM_SIZE};
+
+  auto input = makeContigTensor(input_shape.size());
+  fusion.addInput(input);
+  auto result = rms_norm(input, norm_shape, nullptr, eps_ptr);
+
+  fusion.addOutput(result.output);
+  fusion.addOutput(result.invstd);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn(input_shape, options);
+  c10::optional<at::Tensor> aten_weight = c10::nullopt;
+
+  auto pow2 = at::pow(aten_input, 2);
+
+  auto sum = at::sum(pow2, -1, true);
+  auto var = at::mul(sum, 1.0 / NORM_SIZE);
+  auto invstd = at::pow(at::add(var, kEps), -0.5);
+  auto output = at::mul(aten_input, invstd);
+  //// Check reduction axis is same for all reductions
+  //// Generate Launch Parameters
+  auto reduction_params = getPersistentHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto cg_outputs = fec.runFusionWithInputs({aten_input});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {output, invstd},
+      __LINE__,
+      __FILE__,
+      "");
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerBatchNormalization_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  const float kMomentum = 0.1;
+  const float kEps = 1e-5;
+  const bool kTraining = true;
+  std::vector<int64_t> input_shape{20, 100, 35, 45};
+
+  auto input = makeSymbolicTensor(input_shape.size());
+  auto weight = makeSymbolicTensor(1);
+  auto bias = makeSymbolicTensor(1);
+  auto running_mean = makeSymbolicTensor(1);
+  auto running_var = makeSymbolicTensor(1);
+  fusion->addInput(input);
+  fusion->addInput(weight);
+  fusion->addInput(bias);
+  fusion->addInput(running_mean);
+  fusion->addInput(running_var);
+
+  Double* momentum = IrBuilder::create<Double>(kMomentum);
+  Double* eps = IrBuilder::create<Double>(kEps);
+
+  auto result = batch_norm(
+      input, weight, bias, running_mean, running_var, kTraining, momentum, eps);
+
+  fusion->addOutput(result.output);
+  fusion->addOutput(result.mean);
+  fusion->addOutput(result.invstd);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto at_input = at::randn(input_shape, options);
+  auto at_weight = at::ones({input_shape[1]}, options);
+  auto at_bias = at::zeros({input_shape[1]}, options);
+  auto at_run_mean = at::zeros({input_shape[1]}, options);
+  auto at_run_var = at::ones({input_shape[1]}, options);
+
+  std::vector<IValue> aten_inputs = {
+      at_input, at_weight, at_bias, at_run_mean, at_run_var};
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto aten_outputs = at::native_batch_norm(
+      at_input,
+      c10::optional<at::Tensor>(at_weight),
+      c10::optional<at::Tensor>(at_bias),
+      c10::optional<at::Tensor>(at_run_mean),
+      c10::optional<at::Tensor>(at_run_var),
+      kTraining,
+      kMomentum,
+      kEps);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      aten_inputs,
+      {std::get<0>(aten_outputs),
+       std::get<1>(aten_outputs),
+       std::get<2>(aten_outputs)},
+      __LINE__,
+      __FILE__,
+      "");
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerInstanceNormalization_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  const float kMomentum = 0.1;
+  const float kEps = 1e-5;
+  const bool kUseInputStats = true;
+  std::vector<int64_t> input_shape{20, 100, 35, 45};
+
+  auto input = makeSymbolicTensor(input_shape.size());
+  auto weight = makeSymbolicTensor(1);
+  auto bias = makeSymbolicTensor(1);
+  auto running_mean = makeSymbolicTensor(1);
+  auto running_var = makeSymbolicTensor(1);
+  fusion->addInput(input);
+  fusion->addInput(weight);
+  fusion->addInput(bias);
+  fusion->addInput(running_mean);
+  fusion->addInput(running_var);
+
+  Double* momentum = IrBuilder::create<Double>(kMomentum);
+  Double* eps = IrBuilder::create<Double>(kEps);
+
+  auto result = instance_norm(
+      input,
+      weight,
+      bias,
+      running_mean,
+      running_var,
+      kUseInputStats,
+      momentum,
+      eps);
+
+  fusion->addOutput(result.output);
+  // fusion->addOutput(result.mean);
+  // fusion->addOutput(result.invstd);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto at_input = at::randn(input_shape, options);
+  auto at_weight = at::ones({input_shape[1]}, options);
+  auto at_bias = at::zeros({input_shape[1]}, options);
+  auto at_run_mean = at::zeros({input_shape[1]}, options);
+  auto at_run_var = at::ones({input_shape[1]}, options);
+
+  std::vector<IValue> aten_inputs = {
+      at_input, at_weight, at_bias, at_run_mean, at_run_var};
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+  auto cg_outputs_full = {at_run_mean, at_run_var, cg_outputs[0]};
+
+  auto aten_outputs = at::instance_norm(
+      at_input,
+      c10::optional<at::Tensor>(at_weight),
+      c10::optional<at::Tensor>(at_bias),
+      c10::optional<at::Tensor>(at_run_mean),
+      c10::optional<at::Tensor>(at_run_var),
+      kUseInputStats,
+      kMomentum,
+      kEps,
+      false);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      aten_inputs,
+      // TODO: can run_mean/run_var be checked here?
+      // fusion_outputs.size() == aten_outputs.size() && aten_outputs.size() ==
+      // fusion->outputs().size() - output_alias_indices.size()
+      {aten_outputs},
+      __LINE__,
+      __FILE__,
+      "");
+}
+
+TEST_F(NVFuserTest, FusionMagicSchedulerInstanceNormalizationBackward_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  auto fusion_forward = std::make_unique<Fusion>();
+  FusionGuard fg_forward(fusion_forward.get());
+
+  const float kMomentum = 0.1;
+  const float kEps = 1e-5;
+  const bool kUseInputStats = true;
+  const bool channels_last = true;
+  const int B = 2;
+  const int C = 5;
+  const int S = 3;
+  std::vector<int64_t> input_shape{B, C, S, S, S};
+  // explicit channels-last for NVFuser
+  std::vector<int64_t> nvfuser_input_shape{B, S, S, S, C};
+
+  auto input = makeContigTensor(input_shape.size());
+  auto weight = makeContigTensor(1);
+  auto bias = makeContigTensor(1);
+  fusion_forward->addInput(input);
+  fusion_forward->addInput(weight);
+  fusion_forward->addInput(bias);
+
+  Double* momentum = IrBuilder::create<Double>(kMomentum);
+  Double* eps = IrBuilder::create<Double>(kEps);
+  auto result_forward = instance_norm(
+      input,
+      weight,
+      bias,
+      nullptr,
+      nullptr,
+      kUseInputStats,
+      momentum,
+      eps,
+      channels_last);
+  fusion_forward->addOutput(result_forward.output);
+  fusion_forward->addOutput(result_forward.mean);
+  fusion_forward->addOutput(result_forward.invstd);
+
+  FusionExecutorCache executor_cache_forward(std::move(fusion_forward));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto at_input = at::randn(input_shape, options)
+                      .to(at::MemoryFormat::ChannelsLast3d)
+                      .set_requires_grad(true);
+  auto at_input_nvfuser = at_input.clone().detach().permute({0, 2, 3, 4, 1});
+  auto at_weight = at::ones({input_shape[1]}, options).set_requires_grad(true);
+  auto at_weight_nvfuser = at_weight.clone().detach();
+  auto at_bias = at::zeros({input_shape[1]}, options).set_requires_grad(true);
+  auto at_bias_nvfuser = at_bias.clone().detach();
+  std::vector<torch::jit::IValue> aten_inputs_forward = {
+      at_input_nvfuser, at_weight_nvfuser, at_bias_nvfuser};
+  // out, mean, invstd
+  auto outputs_forward =
+      executor_cache_forward.runFusionWithInputs(aten_inputs_forward);
+  auto at_out = at::instance_norm(
+      at_input,
+      c10::optional<at::Tensor>(at_weight),
+      c10::optional<at::Tensor>(at_bias),
+      c10::optional<at::Tensor>(c10::nullopt),
+      c10::optional<at::Tensor>(c10::nullopt),
+      kUseInputStats,
+      kMomentum,
+      kEps,
+      false);
+  auto at_grad =
+      at::randn(input_shape, options).to(at::MemoryFormat::ChannelsLast3d);
+  auto at_grad_nvfuser = at_grad.clone().detach().permute({0, 2, 3, 4, 1});
+  at_out.backward(at_grad);
+  auto fusion_backward = std::make_unique<Fusion>();
+  FusionGuard fg_backward(fusion_backward.get());
+
+  input = makeContigTensor(input_shape.size());
+  auto grad_output = makeContigTensor(input_shape.size());
+  weight = makeContigTensor(1);
+  auto save_mean = makeContigTensor(2);
+  auto save_invstd = makeContigTensor(2);
+  auto dummy = makeContigTensor(0);
+
+  fusion_backward->addInput(input);
+  fusion_backward->addInput(grad_output);
+  fusion_backward->addInput(weight);
+  fusion_backward->addInput(dummy); // dummy for run_mean
+  fusion_backward->addInput(dummy); // dummy for run_var
+  fusion_backward->addInput(save_mean);
+  fusion_backward->addInput(save_invstd);
+
+  auto result_backward = instance_norm_backward(
+      input,
+      grad_output,
+      weight,
+      nullptr,
+      nullptr,
+      save_mean,
+      save_invstd,
+      kUseInputStats,
+      eps,
+      {true, true, true},
+      channels_last);
+
+  fusion_backward->addOutput(result_backward.grad_input);
+  fusion_backward->addOutput(result_backward.grad_weight);
+  fusion_backward->addOutput(result_backward.grad_bias);
+
+  FusionExecutorCache executor_cache_backward(std::move(fusion_backward));
+  std::vector<torch::jit::IValue> aten_inputs_backward = {
+      at_input_nvfuser,
+      at_grad_nvfuser,
+      at_weight_nvfuser,
+      at::empty({}),
+      at::empty({}),
+      outputs_forward[1],
+      outputs_forward[2]};
+  auto outputs_backward =
+      executor_cache_backward.runFusionWithInputs(aten_inputs_backward);
+  outputs_backward[0] = outputs_backward[0].permute({0, 4, 1, 2, 3});
+  testValidate(
+      executor_cache_backward.fusion(),
+      outputs_backward,
+      aten_inputs_backward,
+      {at_input.grad(), at_weight.grad(), at_bias.grad()},
+      __LINE__,
+      __FILE__,
+      "");
+}
+
+TEST_F(NVFuserTest, FusionPersistentSoftmaxLocalShared_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int pixels_per_thread = 64;
+  const int TIDX = 128;
+  const int static_size = pixels_per_thread * TIDX;
+
+  TensorView* sx = makeConcreteTensor({-1, static_size});
+  TensorView* dx = makeSymbolicTensor(2);
+  fusion.addInput(sx);
+  fusion.addInput(dx);
+
+  TensorView* max_sx = reductionOp(
+      BinaryOpType::Max,
+      {-1},
+      IrBuilder::create<Double>(std::numeric_limits<float>::lowest()),
+      sx); // (M)
+  TensorView* max_dx = reductionOp(
+      BinaryOpType::Max,
+      {-1},
+      IrBuilder::create<Double>(std::numeric_limits<float>::lowest()),
+      dx); // (M)
+
+  // Reduction => merge local and shared memory TensorViews
+  TensorView* max_val = binaryOp(BinaryOpType::Max, max_sx, max_dx);
+  TensorView* bcast_max = broadcast(max_val, {false, true}); // (M, B)
+
+  TensorView* sx_max_sub = sub(sx, bcast_max); // (M, N)
+  TensorView* dx_max_sub = sub(dx, bcast_max); // (M, N)
+
+  TensorView* sx_exp = unaryOp(UnaryOpType::Exp, sx_max_sub); // (M, N)
+  TensorView* dx_exp = unaryOp(UnaryOpType::Exp, dx_max_sub); // (M, N)
+
+  TensorView* sx_sum_exp = sum(sx_exp, {-1}); // (M, R)
+  TensorView* dx_sum_exp = sum(dx_exp, {-1}); // (M, R)
+
+  // Reduction => merge local and shared memory TensorViews
+  TensorView* sum_exp = binaryOp(BinaryOpType::Add, sx_sum_exp, dx_sum_exp);
+  TensorView* bcast_sum = broadcast(sum_exp, {false, true}); // (M, B)
+
+  TensorView* sx_softmax = div(sx_exp, bcast_sum); // (M, N)
+  TensorView* dx_softmax = div(dx_exp, bcast_sum); // (M, N)
+  fusion.addOutput(sx_softmax);
+  fusion.addOutput(dx_softmax);
+
+  auto sx_cache = sx->cacheAfter();
+  auto dx_cache = dx->cacheAfter();
+  dx_cache->setMemoryType(MemoryType::Shared);
+  dx_exp->setMemoryType(MemoryType::Shared);
+
+  // Reduction and Broadcast Tensors common to both memory TVs
+  std::vector<TensorView*> common_tensors(
+      {max_val, sum_exp, bcast_max, bcast_sum});
+
+  // Static Local Memory TVs
+  std::vector<TensorView*> static_tensors(
+      {sx, sx_cache, max_sx, sx_max_sub, sx_exp, sx_sum_exp, sx_softmax});
+
+  // Dynamic Local Memory TVs
+  std::vector<TensorView*> dynamic_tensors(
+      {dx, dx_cache, max_dx, dx_max_sub, dx_exp, dx_sum_exp, dx_softmax});
+
+  std::vector<TensorView*> all_tensors;
+  all_tensors.insert(
+      all_tensors.end(), common_tensors.begin(), common_tensors.end());
+  all_tensors.insert(
+      all_tensors.end(), static_tensors.begin(), static_tensors.end());
+  all_tensors.insert(
+      all_tensors.end(), dynamic_tensors.begin(), dynamic_tensors.end());
+
+  // M => M
+  // M, N => M, N/128, 128
+  for (auto tensor : all_tensors) {
+    if (tensor->nDims() > 1) {
+      tensor->split(-1, TIDX);
+    }
+  }
+
+  auto sx_sum_exp_rf = sx_sum_exp->rFactor({1});
+  auto dx_sum_exp_rf = dx_sum_exp->rFactor({1});
+  all_tensors.push_back(sx_sum_exp_rf);
+  all_tensors.push_back(dx_sum_exp_rf);
+
+  // computeAt
+  sx->computeAt(sx_max_sub, 1);
+  dx->computeAt(dx_max_sub, 1);
+
+  sx_exp->computeAt(sx_softmax, 1);
+  dx_exp->computeAt(dx_softmax, 1);
+
+  sx_max_sub->computeAt(sx_exp, 2);
+  dx_max_sub->computeAt(dx_exp, 2);
+
+  sx_softmax->axis(0)->parallelize(ParallelType::BIDx);
+  dx_softmax->axis(0)->parallelize(ParallelType::BIDx);
+  for (auto tensor : all_tensors) {
+    if (tensor->nDims() > 1) {
+      tensor->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  const int64_t dimx = 1024;
+  const int64_t dimy = 16384;
+
+  auto properties = at::cuda::getDeviceProperties(0);
+  const size_t required_smem_size =
+      (dimy - static_size) * sizeof(float) + TIDX * sizeof(float);
+  if (properties->sharedMemPerBlockOptin < required_smem_size) {
+    GTEST_SKIP() << "not enough shared memory space on device to run test: "
+                 << properties->sharedMemPerBlock;
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({dimx, dimy}, options);
+  at::Tensor aten_static_in = aten_input.narrow(1, 0, static_size);
+  at::Tensor aten_dynamic_in =
+      aten_input.narrow(1, static_size, dimy - static_size);
+
+  at::Tensor out = at::zeros({dimx, dimy}, options);
+  at::Tensor cg_static_out = out.narrow(1, 0, static_size);
+  at::Tensor cg_dynamic_out = out.narrow(1, static_size, dimy - static_size);
+
+  std::vector<at::Tensor> aten_outputs;
+
+  auto aten_output = at::_softmax(aten_input.to(at::kDouble), -1, false);
+  at::Tensor aten_static_out = aten_output.narrow(1, 0, static_size);
+  at::Tensor aten_dynamic_out =
+      aten_output.narrow(1, static_size, dimy - static_size);
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_static_in, aten_dynamic_in});
+  fe.runFusion(
+      {aten_static_in, aten_dynamic_in}, {cg_static_out, cg_dynamic_out});
+
+  testValidate(
+      &fusion,
+      {cg_static_out, cg_dynamic_out},
+      {aten_static_in, aten_dynamic_in},
+      {cg_static_out, cg_dynamic_out},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPersistentNormLocalShared_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int pixels_per_thread = 64;
+  const int TIDX = 128;
+  const int static_size = pixels_per_thread * TIDX;
+
+  TensorView* sx = makeConcreteTensor({-1, static_size});
+  TensorView* dx = makeSymbolicTensor(2);
+  fusion.addInput(sx);
+  fusion.addInput(dx);
+
+  Double* gamma = IrBuilder::create<Double>();
+  Double* beta = IrBuilder::create<Double>();
+  Double* eps = IrBuilder::create<Double>();
+  Int* N = IrBuilder::create<Int>();
+  fusion.addInput(gamma);
+  fusion.addInput(beta);
+  fusion.addInput(eps);
+  fusion.addInput(N);
+
+  // Reduction
+  auto sx_sum = sum(sx, {-1}); // (M, R)
+  auto dx_sum = sum(dx, {-1}); // (M, R)
+  // Reduction => merge local and shared memory TensorViews
+  auto x_sum = binaryOp(BinaryOpType::Add, sx_sum, dx_sum);
+
+  // Broadcast
+  auto x_sum_bcast = broadcast(x_sum, {false, true}); // (M, B)
+  // Pwise
+  auto x_mean = div(x_sum_bcast, N); // (M, B)
+
+  auto sx_mean_sub = sub(sx, x_mean); // (M, N)
+  auto dx_mean_sub = sub(dx, x_mean); // (M, N)
+
+  auto sx_mean_sub_pow = mul(sx_mean_sub, sx_mean_sub); // (M, N)
+  auto dx_mean_sub_pow = mul(dx_mean_sub, dx_mean_sub); // (M, N)
+
+  // Reduction
+  auto sx_var_sum = sum(sx_mean_sub_pow, {-1}); // (M, R)
+  auto dx_var_sum = sum(dx_mean_sub_pow, {-1}); // (M, R)
+  // Reduction => merge local and shared memory TensorViews
+  auto var_sum = binaryOp(BinaryOpType::Add, sx_var_sum, dx_var_sum);
+
+  // Broadcast
+  auto var_sum_bcast = broadcast(var_sum, {false, true}); // (M, B)
+  // Pwise
+  auto var = div(var_sum_bcast, N); // (M, B)
+  auto var_eps = add(var, eps); // (M, B)
+  auto rvar = unaryOp(UnaryOpType::Rsqrt, var_eps); // (M, B)
+
+  auto sx_norm = mul(sx_mean_sub, rvar);
+  auto dx_norm = mul(dx_mean_sub, rvar);
+
+  auto sx_norm_gamma = mul(sx_norm, gamma);
+  auto dx_norm_gamma = mul(dx_norm, gamma);
+
+  auto sx_norm_gamma_beta = add(sx_norm_gamma, beta);
+  auto dx_norm_gamma_beta = add(dx_norm_gamma, beta);
+
+  fusion.addOutput(sx_norm_gamma_beta);
+  fusion.addOutput(dx_norm_gamma_beta);
+
+  sx_norm_gamma_beta->setContiguity(false);
+  dx_norm_gamma_beta->setContiguity(false);
+
+  // Read Input into Shared Memory
+  // Read Input minus Input_Mean into Shared Memory
+  auto sx_cache = sx->cacheAfter();
+  auto dx_cache = dx->cacheAfter();
+  dx_cache->setMemoryType(MemoryType::Shared);
+  dx_mean_sub->setMemoryType(MemoryType::Shared);
+
+  std::vector<TensorView*> common_tensors(
+      {x_sum, x_sum_bcast, x_mean, var_sum, var_sum_bcast, var, var_eps, rvar});
+
+  std::vector<TensorView*> static_tensors(
+      {sx,
+       sx_cache,
+       sx_sum,
+       sx_mean_sub,
+       sx_mean_sub_pow,
+       sx_var_sum,
+       sx_norm,
+       sx_norm_gamma,
+       sx_norm_gamma_beta});
+
+  std::vector<TensorView*> dynamic_tensors(
+      {dx,
+       dx_cache,
+       dx_sum,
+       dx_mean_sub,
+       dx_mean_sub_pow,
+       dx_var_sum,
+       dx_norm,
+       dx_norm_gamma,
+       dx_norm_gamma_beta});
+
+  std::vector<TensorView*> all_tensors;
+  all_tensors.insert(
+      all_tensors.end(), common_tensors.begin(), common_tensors.end());
+  all_tensors.insert(
+      all_tensors.end(), static_tensors.begin(), static_tensors.end());
+  all_tensors.insert(
+      all_tensors.end(), dynamic_tensors.begin(), dynamic_tensors.end());
+
+  // M => M
+  // M, N => M, N/128, 128
+  for (auto tensor : all_tensors) {
+    if (tensor->nDims() > 1) {
+      tensor->split(-1, TIDX);
+    }
+  }
+
+  // Local Sum => Block Broadcast
+  TensorView* sx_sum_rf = sx_sum->rFactor({1});
+  TensorView* sx_var_sum_rf = sx_var_sum->rFactor({1});
+  TensorView* dx_sum_rf = dx_sum->rFactor({1});
+  TensorView* dx_var_sum_rf = dx_var_sum->rFactor({1});
+  all_tensors.push_back(sx_sum_rf);
+  all_tensors.push_back(sx_var_sum_rf);
+  all_tensors.push_back(dx_sum_rf);
+  all_tensors.push_back(dx_var_sum_rf);
+
+  // ComputeAt
+  sx->computeAt(sx_mean_sub_pow, 1);
+  dx->computeAt(dx_mean_sub_pow, 1);
+
+  var_sum->computeAt(rvar, 1);
+
+  sx_mean_sub_pow->computeAt(sx_var_sum_rf, 2);
+  dx_mean_sub_pow->computeAt(dx_var_sum_rf, 2);
+
+  sx_norm->computeAt(sx_norm_gamma_beta, 2);
+  dx_norm->computeAt(dx_norm_gamma_beta, 2);
+
+  sx_norm_gamma_beta->axis(0)->parallelize(ParallelType::BIDx);
+  dx_norm_gamma_beta->axis(0)->parallelize(ParallelType::BIDx);
+  for (auto tensor : all_tensors) {
+    if (tensor->nDims() > 1) {
+      tensor->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  const int dimx = 1024;
+  const int dimy = 16384;
+  const float kGamma = 1.0f;
+  const float kBeta = 0.0f;
+  const float kEps = 1e-5;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  auto properties = at::cuda::getDeviceProperties(0);
+  const size_t required_smem_size =
+      (dimy - static_size) * sizeof(float) + TIDX * sizeof(float);
+  if (properties->sharedMemPerBlockOptin < required_smem_size) {
+    GTEST_SKIP() << "not enough shared memory space on device to run test: "
+                 << properties->sharedMemPerBlock;
+  }
+
+  at::Tensor aten_input = at::randn({dimx, dimy}, options);
+  at::Tensor aten_static_in = aten_input.narrow(1, 0, static_size);
+  at::Tensor aten_dynamic_in =
+      aten_input.narrow(1, static_size, dimy - static_size);
+
+  at::Tensor out = at::zeros({dimx, dimy}, options);
+  at::Tensor cg_static_out = out.narrow(1, 0, static_size);
+  at::Tensor cg_dynamic_out = out.narrow(1, static_size, dimy - static_size);
+
+  std::vector<IValue> aten_inputs = {
+      aten_static_in, aten_dynamic_in, kGamma, kBeta, kEps, dimy};
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+
+  fe.runFusion(aten_inputs, {cg_static_out, cg_dynamic_out});
+
+  auto at_mu = at::mean(aten_input.to(at::kDouble), -1).unsqueeze(1);
+  auto at_var = at::var(aten_input.to(at::kDouble), -1, false).unsqueeze(1);
+  auto at_rvar = at::rsqrt(at::add(at_var, kEps));
+  auto at_norm = at::mul(at::sub(aten_input, at_mu), at_rvar);
+  auto aten_output = at::add(at::mul(at_norm, kGamma), kBeta);
+  at::Tensor aten_static_out = aten_output.narrow(1, 0, static_size);
+  at::Tensor aten_dynamic_out =
+      aten_output.narrow(1, static_size, dimy - static_size);
+
+  testValidate(
+      &fusion,
+      {cg_static_out, cg_dynamic_out},
+      aten_inputs,
+      {aten_static_out, aten_dynamic_out},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSmemDynamicPersistentNorm_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  auto x = makeSymbolicTensor(2);
+  Double* gamma = IrBuilder::create<Double>();
+  Double* beta = IrBuilder::create<Double>();
+  Double* eps = IrBuilder::create<Double>();
+  Int* N = IrBuilder::create<Int>();
+  fusion.addInput(x);
+  fusion.addInput(gamma);
+  fusion.addInput(beta);
+  fusion.addInput(eps);
+  fusion.addInput(N);
+
+  // Reduction
+  auto x_sum = sum(x, {-1}); // (M, R)
+  // Broadcast
+  auto x_sum_bcast = broadcast(x_sum, {false, true}); // (M, B)
+  // Pwise
+  auto x_mean = div(x_sum_bcast, N); // (M, B)
+  auto x_mean_sub = sub(x, x_mean); // (M, N)
+  auto x_mean_sub_pow = mul(x_mean_sub, x_mean_sub); // (M, N)
+  // Reduction
+  auto var_sum = sum(x_mean_sub_pow, {-1}); // (M, R)
+  // Broadcast
+  auto var_sum_bcast = broadcast(var_sum, {false, true}); // (M, B)
+  // Pwise
+  auto var = div(var_sum_bcast, N); // (M, B)
+  auto var_eps = add(var, eps); // (M, B)
+  auto rvar = unaryOp(UnaryOpType::Rsqrt, var_eps); // (M, B)
+  auto norm = mul(x_mean_sub, rvar);
+  auto norm_gamma = mul(norm, gamma);
+  auto norm_gamma_beta = add(norm_gamma, beta);
+  fusion.addOutput(norm_gamma_beta);
+
+  // Read Input into Shared Memory
+  // Read Input minus Input_Mean into Shared Memory
+  auto cache_x = x->cacheAfter();
+  cache_x->setMemoryType(MemoryType::Shared);
+  x_mean_sub->setMemoryType(MemoryType::Shared);
+
+  std::vector<TensorView*> all_tensors(
+      {x_sum,
+       x_mean,
+       cache_x,
+       x_sum_bcast,
+       x_mean_sub,
+       x_mean_sub_pow,
+       var_sum,
+       var_sum_bcast,
+       var,
+       var_eps,
+       rvar,
+       norm,
+       norm_gamma,
+       norm_gamma_beta});
+
+  auto tidx = IrBuilder::create<Int>();
+  fusion.addInput(tidx);
+
+  for (auto tensor : all_tensors) {
+    tensor->split(-1, tidx);
+  }
+
+  // Local Sum => Block Broadcast
+  TensorView* x_sum_rf = x_sum->rFactor({1});
+  TensorView* var_sum_rf = var_sum->rFactor({1});
+  all_tensors.push_back(x_sum_rf);
+  all_tensors.push_back(var_sum_rf);
+
+  // ComputeAt
+  x->computeAt(x_mean_sub_pow, 1);
+  var_sum->computeAt(rvar, 1);
+  x_mean_sub_pow->computeAt(var_sum_rf, 2);
+  norm->computeAt(norm_gamma_beta, 2);
+
+  for (auto tv : all_tensors) {
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  const int dimx = 128;
+  const int dimy = 2048;
+  const float kGamma = 1.0f;
+  const float kBeta = 0.0f;
+  const float kEps = 1e-5;
+  const int TIDX = 128;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({dimx, dimy}, options);
+  auto at_mu = at::mean(aten_input.to(at::kDouble), -1).unsqueeze(1);
+  auto at_var = at::var(aten_input.to(at::kDouble), -1).unsqueeze(1);
+  auto at_rvar = at::rsqrt(at::add(at_var, kEps));
+  auto at_norm = at::mul(at::sub(aten_input, at_mu), at_rvar);
+  auto aten_output = at::add(at::mul(at_norm, kGamma), kBeta);
+
+  std::vector<IValue> aten_inputs = {
+      aten_input, kGamma, kBeta, kEps, dimy, TIDX};
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSmemDynamicReductionSymbolic_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+  // tv1[I0, R1] = tv0[I0, I1]
+
+  // Interface should just be a direct split with a Parallel type. We can
+  // include the parallelize call if we do this.
+  tv1->split(1, NamedScalar::getParallelDim(ParallelType::TIDx));
+  // tv1[I0, R1o, R1i{BIDx}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({2});
+  tv2->setMemoryType(MemoryType::Shared);
+  // tv2[I0, R1oo, Ir1i{BIDx}] = tv0[I0, I1]
+  // tv1[I0,        R1i{BIDx}] = tv2[I0, R1oo, Ir1i{BIDx}]
+
+  tv0->computeAt(tv1, 1);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+
+  constexpr int numel_x = 65000, numel_y = 1024;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
+  auto aten_output = aten_input.to(at::kDouble).sum({1});
+
+  // How many threads to use for the block reduction
+  constexpr int runtime_threadIdx_dim = 128;
+
+  LaunchParams lparams(-1, -1, -1, runtime_threadIdx_dim, -1, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
+}
+
+TEST_F(NVFuserTest, FusionSmemDynamicReductionSymbolicArg_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  Int* sym_bsx = IrBuilder::create<Int>();
+  TensorView* tv0 = makeSymbolicTensor(3); // M, K, N
+  fusion.addInput(tv0);
+  fusion.addInput(sym_bsx);
+
+  TensorView* tv1 = sum(tv0, {1}); // M, R, N
+  fusion.addOutput(tv1);
+
+  TensorView* tv2 = tv0->cacheAfter();
+  tv2->setMemoryType(MemoryType::Shared);
+
+  // Schedule
+  constexpr int BSX = 32;
+  tv1->split(2, BSX);
+  tv1->split(1, sym_bsx);
+  tv1->split(0, BSX);
+  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
+  tv1->reorder({{0, 0}, {1, 2}, {2, 4}, {3, 5}, {4, 1}, {5, 3}});
+  TensorView* tv3 = tv1->rFactor({-2});
+
+  tv0->computeAt(tv1, -2);
+  tv0->computeAt(tv3, -2);
+
+  // Thread and Block binding
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::BIDy);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  // Manual Binding
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 154, K = 45, N = 1524;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({M, K, N}, options);
+  at::Tensor aten_output = aten_input.to(at::kDouble).sum({1});
+
+  // How many threads to use for the block reduction
+  constexpr int runtime_threadIdx_dim = 128;
+
+  auto lparams = LaunchParams(-1, -1, -1, runtime_threadIdx_dim, -1, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input, runtime_threadIdx_dim}, lparams);
+  auto cg_outputs = fe.runFusion({aten_input, runtime_threadIdx_dim}, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input, runtime_threadIdx_dim},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 0);
+}
+
+TEST_F(NVFuserTest, FusionSmemDynamicPwiseMulSymbolicArgWAR_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  Int* sym_bsx = IrBuilder::create<Int>();
+  TensorView* tv0 = makeSymbolicTensor(2); // (M, K)
+  TensorView* tv1 = makeSymbolicTensor(2); // (K, N)
+  TensorView* tv2 = broadcast(tv0, {false, false, true}); // (M, K, B)
+  TensorView* tv3 = broadcast(tv1, {true, false, false}); // (B, K, N)
+  TensorView* tv4 = mul(tv2, tv3); // M, K, N
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(sym_bsx);
+  fusion.addOutput(tv4);
+  // Algorithm
+
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+
+  constexpr int BSX = 32;
+  tv4->split(2, BSX);
+  tv4->split(1, sym_bsx);
+  tv4->split(0, BSX);
+  // M/BSX, BSX, K/BSX, BSX, N/BSX, BSX
+  tv4->reorder({{0, 0}, {1, 3}, {2, 1}, {3, 4}, {4, 2}, {5, 5}});
+  // M/BSX, K/BSX, N/BSX, MSX, KSX, NSX
+
+  tv0->computeAt(tv4, 3);
+  tv1->computeAt(tv4, 3);
+  // Schedule
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(2)->parallelize(ParallelType::BIDy);
+  // Manual Binding
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  // Thread and Block binding
+
+  constexpr int M = 128, K = 457, N = 1024;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+  at::Tensor aten_output = mul(t0.unsqueeze(2), t1.unsqueeze(0));
+  std::vector<IValue> aten_inputs = {t0, t1, BSX};
+
+  LaunchParams lparams(-1, -1, -1, BSX, -1, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      aten_inputs,
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 1);
+}
+
+TEST_F(NVFuserTest, FusionSmemDynamicTiledGemm_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Symbolic integers we will use for runtime tiling
+  Int* symbolic_m_tile_dim = IrBuilder::create<Int>(); // bound to threadIdx.z
+  Int* symbolic_split_k_tile_dim =
+      IrBuilder::create<Int>(); // bound to blockIdx.x
+  Int* symbolic_block_k_tile_dim =
+      IrBuilder::create<Int>(); // bound to threadIdx.x
+  // Compile-time integer for tiling
+  int n_smem_tile = 8; // bound to threadIdx.y
+
+  // Symbolic 2D tensors TV0[M, K], TV1[K, N]
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  // Broadcast tv0 to [M, K, *]
+  TensorView* tv2 = broadcast(tv0, {false, false, true});
+  // Broadcast tv1 to [*, K, N]
+  TensorView* tv3 = broadcast(tv1, {true, false, false});
+
+  // Pointwise multiplication resulting in tv3[M, K, N]
+  TensorView* tv4 = mul(tv2, tv3);
+
+  // Turn the K-dimension of tv4 into a reduction dimension
+  TensorView* tv5 = sum(tv4, {1});
+
+  // Register inputs and outputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv5);
+
+  // Register runtime tile dims as inputs
+  fusion.addInput(symbolic_m_tile_dim);
+  fusion.addInput(symbolic_split_k_tile_dim);
+  fusion.addInput(symbolic_block_k_tile_dim);
+
+  // Make a 3D tile, mix of symbolic and constant, do in reverse order because
+  // dims are inserted
+  // [M, K, N]
+  tv5->split(2, n_smem_tile);
+  tv5->split(1, symbolic_block_k_tile_dim);
+  tv5->split(1, symbolic_split_k_tile_dim);
+  tv5->split(0, symbolic_m_tile_dim);
+  // [Mo, Mi, Koo, Koi, Ki, No, Ni]
+
+  // Reorder so all outer tiles are in the leftmost 3 positions
+  tv5->reorder({{1, 5}, {5, 1}});
+  // [Mo, No, Koo, Koi, Ki, Mi, Ni]
+
+  // Factor out the outer reduction IterDomain, then run the inter-cta
+  // reduction, and intra-cta reduction
+  auto tv6 = tv5->rFactor({2});
+  // [Mo, No, rKoo, rKoi, rKi, Mi, Ni]
+  // [Mo, No,       rKoi, rKi, Mi, Ni]
+
+  // Scope computations
+  tv6->computeAt(tv5, 2);
+  // [Mo, No, rKoo,  Koi,  Ki, Mi, Ni]
+  // [Mo, No,       rKoi, rKi, Mi, Ni]
+
+  // Setup compute at schedule
+  tv0->computeAt(tv6, 3);
+  tv1->computeAt(tv6, 3);
+  tv4->computeAt(tv6, -1);
+  //
+  // T2[Mo,  bNo, Koo, Koi,  Kii,  Mi, bNi] CA(4, 3)
+  // T3[bMo,  No, Koo, Koi,  Kii, bMi,  Ni] CA(4, 3)
+  // T4[ Mo,  No, Koo, Koi,  Kii,  Mi,  Ni]
+  // T6[ Mo,  No, rKoo, Koi, Kii,  Mi,  Ni]
+  // T5[ Mo,  No,      rKoi, rKii, Mi,  Ni]
+
+  // Cache smem tiles
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+  tv4->setMemoryType(MemoryType::Local);
+  tv6->setMemoryType(MemoryType::Local);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDz);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+
+  std::vector<TensorView*> tv_list = {tv2, tv3, tv4, tv5, tv6};
+  for (auto tv : tv_list) {
+    tv->axis(-2)->parallelize(ParallelType::TIDz);
+    tv->axis(-1)->parallelize(ParallelType::TIDy);
+  }
+  tv2->axis(3)->parallelize(ParallelType::TIDx);
+  tv3->axis(3)->parallelize(ParallelType::TIDx);
+  tv4->axis(3)->parallelize(ParallelType::TIDx);
+  tv6->axis(3)->parallelize(ParallelType::TIDx);
+  tv5->axis(2)->parallelize(ParallelType::TIDx);
+
+  tv2->axis(4)->parallelize(ParallelType::BIDx);
+  tv3->axis(4)->parallelize(ParallelType::BIDx);
+  tv4->axis(4)->parallelize(ParallelType::BIDx);
+  tv6->axis(4)->parallelize(ParallelType::BIDx);
+  tv5->axis(3)->parallelize(ParallelType::BIDx);
+
+  constexpr int M = 31, K = 65, N = 33;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+
+  // Runtime tiling
+  int m_tile = 4; // bound to threadIdx.z
+  int split_k = 7; // bound to blockIdx.x
+  int intra_cta = 8; // bound to threadIdx.x
+
+  std::vector<IValue> aten_inputs = {t0, t1, m_tile, split_k, intra_cta};
+  at::Tensor aten_output =
+      mul(t0.unsqueeze(2), t1.unsqueeze(0)).to(at::kDouble).sum(1);
+
+  FusionExecutor fe;
+  // Generate CUDA and compile with nvRTC
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count == 1);
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu2.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu2.cpp
new file mode 100644
index 000000000000..d154b454281e
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu2.cpp
@@ -0,0 +1,9801 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/codegen.h>
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
+#include <torch/csrc/jit/codegen/cuda/grouped_reduction.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/interface.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/ir_graphviz.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/lower_magic_zero.h>
+#include <torch/csrc/jit/codegen/cuda/mutator.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+#include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
+
+#include <test/cpp/jit/test_utils.h>
+#include <torch/csrc/jit/api/function_impl.h>
+#include <torch/csrc/jit/codegen/cuda/parser.h>
+#include <torch/csrc/jit/ir/irparser.h>
+#include <torch/torch.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+#include <c10/cuda/CUDAStream.h>
+
+#include <algorithm>
+#include <iostream>
+#include <sstream>
+#include <thread>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+using namespace at::indexing;
+
+TEST_F(NVFuserTest, FusionGlobalIntermediate_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+  // tv1[I0, R1] = tv0[I0, I1]
+
+  // Interface should just be a direct split with a Parallel type. We can
+  // include the parallelize call if we do this.
+  tv1->split(1, NamedScalar::getParallelDim(ParallelType::TIDx));
+  // tv1[I0, R1o, R1i{BIDx}] = tv0[I0, I1]
+
+  TensorView* tv2 = tv1->rFactor({2});
+  tv2->setMemoryType(MemoryType::Global);
+  // tv2[I0, R1oo, Ir1i{BIDx}] = tv0[I0, I1]
+  // tv1[I0,        R1i{BIDx}] = tv2[I0, R1oo, Ir1i{BIDx}]
+
+  tv0->computeAt(tv1, 1);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+
+  constexpr int numel_x = 65000, numel_y = 1024;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  // How many threads to use for the block reduction
+  constexpr int runtime_threadIdx_dim = 128;
+
+  auto lparams = LaunchParams(-1, -1, -1, runtime_threadIdx_dim, -1, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto cg_outputs = fe.runFusion({input}, lparams);
+
+  auto aten_output = input.to(at::kDouble).sum({1});
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionGlobalIntermediateDefaultSchedule_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  TensorView* tv2 = makeSymbolicTensor(2);
+  TensorView* tv3 = makeSymbolicTensor(2);
+  TensorView* tv4 = sub(tv2, tv3);
+  TensorView* tv5 = add(tv1, tv4);
+  TensorView* tv6 = sub(tv5, tv0);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+  fusion.addInput(tv3);
+  fusion.addOutput(tv6);
+  // t6 = ((t1 + (t2 - t3)) - t0)
+
+  tv4->setMemoryType(MemoryType::Global);
+  tv5->setMemoryType(MemoryType::Global);
+  tv6->setMemoryType(MemoryType::Global);
+
+  constexpr int M = 32, N = 810;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  at::Tensor t1 = at::randn({M, N}, options);
+  at::Tensor t2 = at::randn({M, N}, options);
+  at::Tensor t3 = at::randn({M, N}, options);
+
+  at::Tensor aten_output = (t1 + (t2 - t3)) - t0;
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1, t2, t3});
+  auto cg_outputs = fe.runFusion({t0, t1, t2, t3});
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionConstCheck_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto one = IrBuilder::create<Int>(1);
+  TORCH_CHECK(one->isConstScalar());
+
+  auto one_x2 = mul(one, one);
+  TORCH_CHECK(one_x2->isConstScalar());
+
+  auto one_x3 = mul(one_x2, one);
+  TORCH_CHECK(one_x3->isConstScalar());
+
+  auto one_x4 = mul(one_x3, one);
+  TORCH_CHECK(one_x4->isConstScalar());
+}
+
+TEST_F(NVFuserTest, FusionUnrollWithAlloc_CUDA) {
+  const std::vector<int64_t> tensor_dims_in = {128, 128};
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(tensor_dims_in.size());
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(0));
+  TensorView* tv2 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv1);
+  fusion.addOutput(tv2);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn(tensor_dims_in, options);
+  at::Tensor cg_output = at::empty({tensor_dims_in[0]}, options);
+
+  // Schedule
+  tv2->split(1, 32);
+  tv2->split(1, 4); // unroll
+
+  auto tv2_rf = tv2->rFactor({-3, -2});
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv2_rf->axis(0)->parallelize(ParallelType::BIDx);
+  tv2_rf->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2_rf->axis(-2)->parallelize(ParallelType::Unroll);
+
+  tv1->computeAt(tv2_rf, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto aten_output = (input + 0).to(at::kDouble).sum(1);
+
+  testValidate(&fusion, cg_outputs, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Test isZeroInt
+TEST_F(NVFuserTest, FusionIsZeroInt_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  Int* x = IrBuilder::create<Int>(0);
+  Int* y = IrBuilder::create<Int>(1);
+  Val* z = mul(x, y);
+  TORCH_CHECK(x->isZeroInt());
+  TORCH_CHECK(!y->isZeroInt());
+  TORCH_CHECK(!z->isZeroInt());
+}
+
+// Test isOneInt
+TEST_F(NVFuserTest, FusionIsOneInt_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  Int* x = IrBuilder::create<Int>(1);
+  Int* y = IrBuilder::create<Int>(1);
+  Val* z = mul(x, y);
+  TORCH_CHECK(x->isOneInt());
+  TORCH_CHECK(y->isOneInt());
+  TORCH_CHECK(!z->isOneInt());
+}
+
+// This is to verify no cycle of computeAt is created. A more complex
+// variation of this pattern appears in one of the Python tests
+// (test_random_topo).
+TEST_F(NVFuserTest, FusionComputeAtNonterminatingOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  // Common intermediate tensor
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  // tv1 -> tv2
+  auto tv2 = add(tv1, IrBuilder::create<Double>(2));
+  // tv1 -> tv3 -> tv4
+  auto tv3 = add(tv1, IrBuilder::create<Double>(3));
+  auto tv4 = add(tv3, IrBuilder::create<Double>(4));
+
+  // NOTE: This should no longer occur as of PR #201.
+  // The order of adding outputs matters. If tv3 is added before tv4,
+  // it should be fine. However, if tv4 is added before tv3, there
+  // will be a cycle of tv3->tv4 and tv4->tv3. tv3->tv4 is created
+  // first, and then tv4->tv3 is created at the final phase of
+  // computeAt (ComputeAt::setupOutputs).
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv3);
+
+  tv0->computeAt(tv2, -1);
+
+  TORCH_CHECK(tv3->hasComputeAt());
+  TORCH_CHECK(!tv4->hasComputeAt());
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn(100, options);
+
+  auto t1 = aten_input + 1;
+  auto t2 = t1 + 2;
+  auto t3 = t1 + 3;
+  auto t4 = t3 + 4;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  std::vector<at::Tensor> aten_outputs = {t2, t4, t3};
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTraversalOrder1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2));
+  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3));
+  TensorView* tv4 = add(tv1, IrBuilder::create<Double>(4));
+
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv4);
+
+  tv1->computeAt(tv3, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({10, 10}, options);
+
+  auto t1 = aten_input + 1;
+  auto t2 = aten_input + 2;
+  auto t3 = t1 + 3;
+  auto t4 = t1 + 4;
+
+  std::vector<at::Tensor> aten_outputs = {t2, t3, t4};
+
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTraversalOrder2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
+
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(3));
+  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
+
+  TensorView* tv5 = add(tv1, tv3);
+
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv5);
+
+  tv1->computeAt(tv5, -1);
+  tv3->computeAt(tv5, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({10, 10}, options);
+
+  auto t1 = aten_input + 1;
+  auto t2 = t1 + 2;
+  auto t3 = aten_input + 3;
+  auto t4 = t3 + 4;
+  auto t5 = t1 + t3;
+
+  std::vector<at::Tensor> aten_outputs = {t2, t4, t5};
+
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTraversalOrder3_CUDA) {
+  for (const auto i : c10::irange(2)) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    TensorView* tv0 = makeSymbolicTensor(1);
+    fusion.addInput(tv0);
+
+    TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+    TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
+
+    TensorView* tv3 = add(tv0, IrBuilder::create<Double>(3));
+    TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
+
+    TensorView* tv5 = add(tv1, tv3);
+
+    fusion.addOutput(tv2);
+    fusion.addOutput(tv4);
+    fusion.addOutput(tv5);
+
+    const int tile = 32;
+
+    tv1->split(-1, tile);
+    tv2->split(-1, tile);
+    tv3->split(-1, tile);
+    tv4->split(-1, tile);
+    tv5->split(-1, tile);
+
+    auto compute_at_outer = tv1;
+    auto compute_at_inner = tv3;
+    if (i == 1) {
+      std::swap(compute_at_inner, compute_at_outer);
+    }
+
+    compute_at_outer->computeAt(tv5, -2);
+    compute_at_inner->computeAt(tv5, -1);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor aten_input = at::randn({100}, options);
+    auto t1 = aten_input + 1;
+    auto t2 = t1 + 2;
+    auto t3 = aten_input + 3;
+    auto t4 = t3 + 4;
+    auto t5 = t1 + t3;
+
+    std::vector<at::Tensor> aten_outputs = {t2, t4, t5};
+
+    std::vector<at::Tensor> cg_outputs = {
+        at::empty_like(aten_input, options),
+        at::empty_like(aten_input, options),
+        at::empty_like(aten_input, options)};
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {aten_input});
+    fe.runFusion({aten_input}, cg_outputs);
+
+    testValidate(
+        &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTraversalOrder4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // First tree
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
+  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3));
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv3);
+
+  // Second tree
+  TensorView* tv4 = makeSymbolicTensor(1);
+  fusion.addInput(tv4);
+  TensorView* tv5 = add(tv4, IrBuilder::create<Double>(5));
+  TensorView* tv6 = add(tv5, IrBuilder::create<Double>(6));
+  TensorView* tv7 = add(tv5, IrBuilder::create<Double>(7));
+  fusion.addOutput(tv6);
+  fusion.addOutput(tv7);
+
+  tv1->computeAt(tv2, -1);
+  tv5->computeAt(tv6, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({100}, options);
+  at::Tensor t4 = at::rand_like(t0, options);
+
+  auto t1 = t0 + 1;
+  auto t2 = t1 + 2;
+  auto t3 = t1 + 3;
+  auto t5 = t4 + 5;
+  auto t6 = t5 + 6;
+  auto t7 = t5 + 7;
+
+  std::vector<at::Tensor> aten_outputs = {t2, t3, t6, t7};
+  std::vector<IValue> aten_inputs = {t0, t4};
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(t0, options),
+      at::empty_like(t0, options),
+      at::empty_like(t0, options),
+      at::empty_like(t0, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  fe.runFusion(aten_inputs, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTraversalOrder5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(3));
+  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
+  TensorView* tv5 = add(tv2, tv4);
+
+  fusion.addOutput(tv1);
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv5);
+
+  tv2->computeAt(tv5, -1);
+  tv4->computeAt(tv5, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({100}, options);
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options),
+      at::empty_like(aten_input, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  auto t1 = aten_input + 1;
+  auto t2 = t1 + 2;
+  auto t3 = aten_input + 3;
+  auto t4 = t3 + 4;
+  auto t5 = t2 + t4;
+
+  std::vector<at::Tensor> aten_outputs = {t1, t3, t5};
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTraversalOrder6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2));
+  TensorView* tv3 = add(tv1, tv2);
+  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
+
+  fusion.addOutput(tv4);
+
+  tv1->split(0, 32);
+  tv2->split(0, 32);
+  tv3->split(0, 32);
+  tv4->split(0, 32);
+
+  tv3->computeAt(tv4, -2);
+  tv1->computeAt(tv3, -1);
+  tv2->computeAt(tv3, -2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({100}, options);
+
+  auto t1 = aten_input + 1;
+  auto t2 = aten_input + 2;
+  auto t3 = t1 + t2;
+  auto aten_output = t3 + 4;
+
+  at::Tensor cg_output = at::empty_like(aten_input, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTraversalOrder7_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2));
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(3));
+  TensorView* tv4 = add(tv3, IrBuilder::create<Double>(4));
+  TensorView* tv5 = add(tv2, tv4);
+
+  fusion.addOutput(tv5);
+
+  TensorView* tvs[] = {tv1, tv2, tv3, tv4, tv5};
+  for (auto tv : tvs) {
+    tv->split(0, 2);
+    tv->split(0, 4);
+    tv->split(0, 8);
+  }
+
+  // computeAt into inner loop nests
+  tv1->computeAt(tv2, -1);
+  tv3->computeAt(tv4, -2);
+
+  tv2->computeAt(tv5, -4);
+  tv4->computeAt(tv5, -3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({100}, options);
+
+  auto t1 = aten_input + 1;
+  auto t2 = t1 + 2;
+  auto t3 = aten_input + 3;
+  auto t4 = t3 + 4;
+  auto aten_output = t2 + t4;
+
+  at::Tensor cg_output = at::empty_like(aten_input, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Test predication of grid reduction
+TEST_F(NVFuserTest, FusionThreadPredicate_CUDA) {
+  const int gdimx = 4;
+  const int bdimx = 128;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv0);
+  TensorView* tv2 = unaryOp(UnaryOpType::Neg, tv1);
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(2));
+
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv2);
+
+  tv1->split(1, bdimx);
+  tv1->split(1, gdimx);
+  tv3->split(1, bdimx);
+  tv3->split(1, gdimx);
+
+  TensorView* tv1_rf = tv1->rFactor({1});
+
+  tv1->computeAt(tv2, -1);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1_rf->axis(0)->parallelize(ParallelType::BIDy);
+  tv2->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(-2)->parallelize(ParallelType::BIDx);
+  tv1_rf->axis(-2)->parallelize(ParallelType::BIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(3)->parallelize(ParallelType::TIDx);
+  tv3->axis(2)->parallelize(ParallelType::BIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDy);
+
+  int numel_x = 100;
+  int numel_y = 1000;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
+
+  auto t2 = -aten_input.to(at::kDouble).sum({1});
+  auto t3 = aten_input + 2.0;
+
+  std::vector<at::Tensor> aten_outputs = {t3, t2};
+
+  std::vector<at::Tensor> cg_outputs = {
+      at::empty_like(aten_input, options), at::empty({numel_x}, options)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, cg_outputs);
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionLSTMCell_CUDA) {
+  const int hidden_features = 512;
+  const int batch_size = 64;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tvs[16];
+  for (const auto i : c10::irange(16)) {
+    tvs[i] = makeSymbolicTensor(2);
+    fusion.addInput(tvs[i]);
+  }
+
+  auto ingate = unaryOp(
+      UnaryOpType::Sigmoid, add(add(add(tvs[0], tvs[1]), tvs[2]), tvs[3]));
+
+  auto forgetgate = unaryOp(
+      UnaryOpType::Sigmoid, add(add(add(tvs[4], tvs[5]), tvs[6]), tvs[7]));
+
+  auto cellgate = unaryOp(
+      UnaryOpType::Tanh, add(add(add(tvs[8], tvs[9]), tvs[10]), tvs[11]));
+
+  auto outgate = unaryOp(
+      UnaryOpType::Sigmoid, add(add(add(tvs[12], tvs[13]), tvs[14]), tvs[15]));
+
+  auto cx = makeContigTensor(2);
+  fusion.addInput(cx);
+
+  auto cy = add(mul(forgetgate, cx), mul(ingate, cellgate));
+
+  auto hy = mul(outgate, unaryOp(UnaryOpType::Tanh, cy));
+
+  fusion.addOutput(cy);
+  fusion.addOutput(hy);
+
+  std::vector<c10::IValue> aten_inputs;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor large_tensor0 =
+      at::randn({batch_size, hidden_features * 4}, options);
+  at::Tensor large_tensor1 =
+      at::randn({batch_size, hidden_features * 4}, options);
+  at::Tensor large_tensor2 =
+      at::randn({batch_size, hidden_features * 4}, options);
+  at::Tensor large_tensor3 =
+      at::randn({batch_size, hidden_features * 4}, options);
+
+  auto chunked0 = large_tensor0.chunk(4, 1);
+  auto chunked1 = large_tensor1.chunk(4, 1);
+  auto chunked2 = large_tensor2.chunk(4, 1);
+  auto chunked3 = large_tensor3.chunk(4, 1);
+
+  aten_inputs.insert(aten_inputs.end(), chunked0.begin(), chunked0.end());
+  aten_inputs.insert(aten_inputs.end(), chunked1.begin(), chunked1.end());
+  aten_inputs.insert(aten_inputs.end(), chunked2.begin(), chunked2.end());
+  aten_inputs.insert(aten_inputs.end(), chunked3.begin(), chunked3.end());
+
+  auto at_ingate =
+      chunked0[0].add(chunked0[1]).add(chunked0[2]).add(chunked0[3]).sigmoid();
+  auto at_forgetgate =
+      chunked1[0].add(chunked1[1]).add(chunked1[2]).add(chunked1[3]).sigmoid();
+  auto at_cellgate =
+      chunked2[0].add(chunked2[1]).add(chunked2[2]).add(chunked2[3]).tanh();
+  auto at_outgate =
+      chunked3[0].add(chunked3[1]).add(chunked3[2]).add(chunked3[3]).sigmoid();
+
+  auto at_cx = at::randn({batch_size, hidden_features}, options);
+  aten_inputs.push_back(at_cx);
+  auto at_cy = at_forgetgate.mul(at_cx).add(at_ingate.mul(at_cellgate));
+  auto at_hy = at_outgate.mul(at_cy.tanh());
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {at_cy, at_hy}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReductionHalf_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(3, DataType::Half);
+  fusion.addInput(tv0);
+
+  auto tv1 = castOp(DataType::Float, tv0);
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
+  auto tv3 = sum(tv2, {2});
+  auto tv4 = castOp(DataType::Half, tv3);
+
+  fusion.addOutput(tv4);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({8, 8, 16}, options);
+
+  auto reduction_tv = tv3;
+
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  // no broadcasting needed, omitting the last optional argument;
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+
+  auto aten_output = aten_input.add(1.0).to(at::kDouble).sum({2});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionReduceSingle_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeConcreteTensor({100, 1});
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({100, 1}, options);
+
+  // Grab only tensor views, though there shouldn't be any other type
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  // no broadcasting needed, omitting the last optional argument;
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  auto aten_output = aten_input.to(at::kDouble).sum({1});
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReduceImplicitBroadcast_CUDA) {
+  constexpr int bid_x = 80;
+  constexpr int tid_x = 4096;
+  constexpr int red_dim = 1;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeConcreteTensor({bid_x, tid_x, 1});
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = reductionOp(
+      BinaryOpType::Add, {red_dim, 2}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({bid_x, tid_x, 1}, options);
+
+  // Apply reduction heuristic
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  // no broadcasting needed, omitting the last optional argument;
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+  auto aten_output = aten_input.to(at::kDouble).sum({red_dim, 2});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionReduceImplicitBroadcast2_CUDA) {
+  constexpr int bid_x = 80;
+  constexpr int tid_x = 4096;
+  constexpr int red_dim = 1;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeConcreteTensor({bid_x, tid_x, 1});
+  fusion.addInput(tv0);
+
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {2}, IrBuilder::create<Double>(0), tv0);
+
+  TensorView* tv2 = reductionOp(
+      BinaryOpType::Add, {red_dim}, IrBuilder::create<Double>(0), tv1);
+  fusion.addOutput(tv2);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({bid_x, tid_x, 1}, options);
+
+  // Apply reduction heuristic
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+
+  scheduleReduction(&fusion, *reduction_params);
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  // no broadcasting needed, omitting the last optional argument;
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+  auto aten_output = aten_input.to(at::kDouble).sum({1, 2});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionReduceImplicitBroadcast3_CUDA) {
+  constexpr int bid_x = 80;
+  constexpr int tid_x = 4096;
+  constexpr int red_dim = 1;
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeConcreteTensor({bid_x, tid_x, 1});
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = reductionOp(
+      BinaryOpType::Add, {red_dim}, IrBuilder::create<Double>(0), tv0);
+
+  TensorView* tv2 =
+      reductionOp(BinaryOpType::Add, {1}, IrBuilder::create<Double>(0), tv1);
+  fusion.addOutput(tv2);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({bid_x, tid_x, 1}, options);
+
+  // Apply reduction heuristic
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  // no broadcasting needed, omitting the last optional argument;
+  auto cg_outputs = fe.runFusion({aten_input}, lparams);
+  auto aten_output = aten_input.to(at::kDouble).sum({2, 1});
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {aten_input},
+      {aten_output},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionTrivialReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeConcreteTensor({10, 20, 1});
+  fusion.addInput(tv0);
+  TensorView* tv1 =
+      reductionOp(BinaryOpType::Add, {2}, IrBuilder::create<Double>(0), tv0);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(
+      ir_utils::getReductionOps(&fusion, true /* ignore_trivial */).empty(),
+      "Trivial reduction picked up by fusion");
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({10, 20, 1}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+  auto aten_output = aten_input.to(at::kDouble).sum({2});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTrivialReduction2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int w = 1, x = 1, y = 7, z = 8;
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeConcreteTensor({w, x, y, z});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = sum(tv1, {0});
+  auto tv3 = sum(tv2, {0});
+  auto tv4 = add(tv3, tv0);
+
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({y, z}, options);
+  at::Tensor t1 = at::randn({w, x, y, z}, options);
+  auto aten_output = t1.to(at::kDouble).sum({0}).sum({0}).add(t0);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTrivialReduction3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int v = 1, w = 1, x = 1, y = 7, z = 8;
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeConcreteTensor({v, w, x, y, z});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = sum(tv1, {0, 1, 2});
+  auto tv3 = add(tv2, tv0);
+
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({y, z}, options);
+  at::Tensor t1 = at::randn({v, w, x, y, z}, options);
+  auto aten_output = t1.sum({0, 1, 2}).add(t0);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+// Make sure trivial reductions are correctly detected even with
+// scheduling applied.
+TEST_F(NVFuserTest, FusionDetectTrivialReduction1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {false, true});
+  auto tv2 = sum(tv1, {1});
+  fusion.addOutput(tv2);
+
+  tv2->split(1, 4);
+  tv2->split(1, 8);
+  auto tv3 = tv2->rFactor({-1});
+  auto tv4 = tv2->rFactor({-1});
+
+  auto tv5 = broadcast(tv0, {true, false});
+  auto tv6 = add(tv5, IrBuilder::create<Double>(1));
+  auto tv7 = sub(tv6, IrBuilder::create<Double>(1));
+  auto tv8 = sum(tv7, {0});
+  fusion.addOutput(tv8);
+
+  auto tv9 = broadcast(tv0, {false, true, true});
+  auto tv10 = sum(tv9, {1});
+  auto tv11 = sum(tv10, {1});
+  fusion.addOutput(tv11);
+
+  tv8->split(0, 3);
+  tv10->split(1, 4);
+  tv11->split(1, 5);
+
+  tv0->computeAt(tv2, -1);
+  tv0->computeAt(tv8, -1);
+  tv0->computeAt(tv11, 1);
+
+  // Test indexing to gmem-backed tensors
+  tv3->setMemoryType(MemoryType::Global);
+  tv8->setMemoryType(MemoryType::Global);
+
+  GpuLower gpulw(&fusion);
+
+  // No ReductionOp should be generated as all the reduction
+  // exprs should be replaced with a unary set op.
+  for (const auto expr : gpulw.kernel()->as<Fusion>()->exprs()) {
+    TORCH_CHECK(!expr->isA<ReductionOp>());
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({100}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {t0, t0, t0}, __LINE__, __FILE__);
+}
+
+// Test detection of partially trivial reduction
+TEST_F(NVFuserTest, FusionDetectTrivialReduction2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv1->split(1, 1);
+  // tv1->axis(1): non-trivial
+  // tv1->axis(2): trivial
+
+  auto tv3 = tv1->rFactor({-1});
+
+  // Just to suppress register-allocation warning
+  tv0->computeAt(tv2, 1);
+  tv3->computeAt(tv1, -1);
+
+  GpuLower gpulw(&fusion);
+
+  // tv3's reduction axis is a trivial reduction. The only
+  // ReductionOp should be for tv1.
+  for (const auto expr : gpulw.kernel()->as<Fusion>()->exprs()) {
+    if (expr->isA<ReductionOp>()) {
+      auto reduction_out =
+          expr->as<ReductionOp>()->outputs()[0]->as<TensorView>();
+      TORCH_CHECK(reduction_out->name() == 1);
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionInputsIdLookup_CUDA) {
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({16, 8, 8}, options);
+  at::Tensor t1 = at::randn({8, 8}, options);
+  at::Tensor t2 = at::randn({6, 4}, options);
+
+  // create a cache with max size 2;
+  torch::jit::fuser::cuda::InputsIdLookup inputs_id_lookup(2);
+
+  // testing basic function, same encoding for identical inputs
+  auto id_0 = inputs_id_lookup.lookupId({t0, t1, 5.0});
+  auto id_0_lookup = inputs_id_lookup.lookupId({t0, t1, 2.5});
+  TORCH_CHECK(id_0.id == id_0_lookup.id);
+  TORCH_CHECK(inputs_id_lookup.size() == 1);
+  TORCH_CHECK(id_0.eviction == false);
+
+  // new input (even tho same shape, but we have different signature because of
+  // missing scalar input
+  auto id_1 = inputs_id_lookup.lookupId({t0, t1});
+  auto id_1_lookup = inputs_id_lookup.lookupId({t0, t1});
+  TORCH_CHECK(id_1.id == id_1_lookup.id);
+  TORCH_CHECK(inputs_id_lookup.size() == 2);
+  TORCH_CHECK(id_1.eviction == false);
+
+  // eviction should happen at this point
+  auto id_2 = inputs_id_lookup.lookupId({t2, t1});
+  TORCH_CHECK(id_2.id != id_0.id);
+  TORCH_CHECK(id_2.id != id_1.id);
+  TORCH_CHECK(inputs_id_lookup.size() == 2);
+  TORCH_CHECK(id_2.eviction == true);
+  TORCH_CHECK(id_2.evict_id == id_0.id);
+
+  // look at input 1 again
+  auto id_1_relook = inputs_id_lookup.lookupId({t0, t1});
+  TORCH_CHECK(id_1_relook.id == id_1.id);
+  TORCH_CHECK(id_1_relook.eviction == false);
+}
+
+TEST_F(NVFuserTest, FusionGroupGuardSimpleTensor_CUDA) {
+  std::vector<int64_t> sizes_vec({16, 8, 8});
+  std::vector<int64_t> strides_vec({64, 8, 1});
+  auto tensor_type = TensorType::create(
+      at::kFloat, c10::nullopt, sizes_vec, strides_vec, c10::nullopt);
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  // pass with identical shape
+  auto t0 = at::randn({16, 8, 8}, options);
+  TORCH_CHECK(complyWith(t0, tensor_type));
+
+  // pass with dynamic shape
+  auto t1 = at::randn({16, 16, 8}, options);
+  TORCH_CHECK(complyWith(t1, tensor_type));
+
+  // broadcasting semantic change failure
+  auto t2 = at::randn({16, 1, 8}, options);
+  TORCH_CHECK(!complyWith(t2, tensor_type));
+
+  // contiguity failure via slicing
+  auto t3 = t0.slice(1, 0, 8, 2);
+  TORCH_CHECK(!complyWith(t3, tensor_type));
+
+  // contiguity failure via slicing
+  auto t4 = t0.slice(2, 0, 8, 2);
+  TORCH_CHECK(!complyWith(t4, tensor_type));
+
+  // rank failure
+  auto t5 = at::randn({16, 8, 8, 8}, options);
+  TORCH_CHECK(!complyWith(t5, tensor_type));
+
+  // contiguity on stride 1 dimension with implicit broadcasting
+  auto t = at::randn({4}, options);
+  auto t6 = t.unsqueeze(1).expand({4, 8});
+  TORCH_CHECK(complyWith(t6, TensorType::create(t6)));
+}
+
+TEST_F(NVFuserTest, FusionGroupGuardBroadcastTensor_CUDA) {
+  std::vector<int64_t> sizes_vec({16, 1, 8});
+  std::vector<int64_t> strides_vec({8, 8, 1});
+  auto tensor_type = TensorType::create(
+      at::kFloat, c10::nullopt, sizes_vec, strides_vec, c10::nullopt);
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  // broadcasting semantic change
+  auto t0 = at::randn({16, 8, 8}, options);
+  TORCH_CHECK(!complyWith(t0, tensor_type));
+
+  // dtype failure
+  auto t1 = at::randn({16, 1, 8}, options.dtype(at::kHalf));
+  TORCH_CHECK(!complyWith(t1, tensor_type));
+
+  // dtype failure
+  auto t2 = at::randn({16, 1, 8}, options);
+  TORCH_CHECK(complyWith(t2, tensor_type));
+
+  // device inconsistency shouldn't fail
+  auto t3 = at::randn({16, 1, 8}, options.device(at::kCPU, 0));
+  TORCH_CHECK(complyWith(t3, tensor_type));
+}
+
+TEST_F(NVFuserTest, FusionGroupGuardPermutedTensor_CUDA) {
+  std::vector<int64_t> sizes_vec({16, 8, 8});
+  std::vector<int64_t> strides_vec({64, 1, 8});
+  auto tensor_type = TensorType::create(
+      at::kFloat, c10::nullopt, sizes_vec, strides_vec, c10::nullopt);
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  // failing permutation
+  auto t0 = at::randn({16, 8, 8}, options);
+  TORCH_CHECK(!complyWith(t0, tensor_type));
+
+  // passing with dynamic shape
+  auto t1 = t0.permute({0, 2, 1});
+  TORCH_CHECK(complyWith(t1, tensor_type));
+}
+
+TEST_F(NVFuserTest, FusionGroupGuardRelaxedCheck_CUDA) {
+  std::vector<int64_t> sizes_vec({16, 8, 8});
+  std::vector<int64_t> strides_vec({128, 16, 1});
+  auto tensor_type = TensorType::create(
+      at::kFloat, c10::nullopt, sizes_vec, strides_vec, c10::nullopt);
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  // contiguity check passes although it differs
+  auto t0 = at::randn({16, 16, 8}, options);
+  TORCH_CHECK(complyWith(t0, tensor_type));
+
+  // passing with dynamic shape
+  auto t1 = t0.slice(1, 0, 16, 2);
+  TORCH_CHECK(complyWith(t1, tensor_type));
+}
+
+TEST_F(NVFuserTest, FusionDisjointSet_CUDA) {
+  DisjointSets<int> set;
+
+  const std::set<int> group_x({0, 1, 2});
+  const std::set<int> group_y({3, 4, 5});
+  const std::set<int> group_z({6, 7, 8});
+  const std::vector<std::set<int>> groups({group_x, group_y, group_z});
+  std::set<int> group_all;
+  std::for_each(groups.begin(), groups.end(), [&](const auto& g) {
+    group_all.insert(g.begin(), g.end());
+  });
+
+  // Initially, nothing should be considered equivalent
+  for (auto i : group_all) {
+    for (auto j : group_all) {
+      TORCH_CHECK(!set.permissiveAreMapped(i, j));
+    }
+  }
+
+  // Sets values in group_x are equivalent
+  for (auto i : group_x) {
+    for (auto j : group_x) {
+      set.mapEntries(i, j);
+      TORCH_CHECK(set.mappingExists(i));
+      TORCH_CHECK(set.mappingExists(j));
+    }
+  }
+
+  // All values in group_x shoudl be equivalent with each other
+  for (auto i : group_x) {
+    for (auto j : group_x) {
+      TORCH_CHECK(set.permissiveAreMapped(i, j));
+    }
+  }
+  // But nothing else should be equivalent
+  for (auto i : group_all) {
+    for (auto j : group_y) {
+      TORCH_CHECK(!set.permissiveAreMapped(i, j));
+    }
+    for (auto j : group_z) {
+      TORCH_CHECK(!set.permissiveAreMapped(i, j));
+    }
+  }
+
+  // Sets values in group_y are equivalent
+  for (auto i : group_y) {
+    for (auto j : group_y) {
+      set.mapEntries(i, j);
+      TORCH_CHECK(set.mappingExists(i));
+      TORCH_CHECK(set.mappingExists(j));
+    }
+  }
+
+  // group_x should be still equivalent
+  for (auto i : group_x) {
+    for (auto j : group_x) {
+      TORCH_CHECK(set.permissiveAreMapped(i, j));
+    }
+  }
+  // group_y should be now equivalent
+  for (auto i : group_y) {
+    for (auto j : group_y) {
+      TORCH_CHECK(set.permissiveAreMapped(i, j));
+    }
+  }
+  // But group_z should not be equivalent with anything yet
+  for (auto i : group_all) {
+    for (auto j : group_z) {
+      TORCH_CHECK(!set.permissiveAreMapped(i, j));
+    }
+  }
+
+  // Sets values in group_z are equivalent
+  for (auto i : group_z) {
+    for (auto j : group_z) {
+      set.mapEntries(i, j);
+      TORCH_CHECK(set.mappingExists(i));
+      TORCH_CHECK(set.mappingExists(j));
+    }
+  }
+
+  // Now each of the three groups should be equivalent within each
+  // group
+  for (const auto gi : c10::irange(groups.size())) {
+    for (const auto gj : c10::irange(groups.size())) {
+      for (auto i : groups[gi]) {
+        for (auto j : groups[gj]) {
+          TORCH_CHECK(
+              (gi == gj && set.permissiveAreMapped(i, j)) ||
+              (gi != gj && !set.permissiveAreMapped(i, j)));
+        }
+      }
+    }
+  }
+
+  std::vector<int> all_elements = set.getAllElements().vector();
+  std::sort(all_elements.begin(), all_elements.end());
+  std::vector<int> group_all_vec(group_all.begin(), group_all.end());
+  std::sort(group_all_vec.begin(), group_all_vec.end());
+  TORCH_CHECK(all_elements == group_all_vec);
+
+  set.clear();
+  TORCH_CHECK(set.getAllElements().vector().size() == 0);
+
+  // All cleared. Nothing should be considered equivalent.
+  for (auto i : group_all) {
+    for (auto j : group_all) {
+      TORCH_CHECK(!set.permissiveAreMapped(i, j));
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionNonUniqueBroadcastSize_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  auto tv1 = makeSymbolicTensor(2);
+  auto tv2 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+
+  auto tv3 = broadcast(tv0, {true, false});
+  auto tv4 = add(tv3, tv1);
+  auto tv5 = add(tv3, tv2);
+
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv5);
+
+  // In order to do this, tv1->axis(1) and tv2->axis(1) must have the
+  // same size, but we can't prove it, so this should throw an error.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(tv3->computeAt(tv4, -1));
+}
+
+TEST_F(NVFuserTest, FusionBiasGeluFwd_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const float k_079 = 0.79788456;
+  const float k_004 = 0.044715;
+
+  // bias vector
+  auto t0 = makeSymbolicTensor(1, DataType::Half);
+  fusion.addInput(t0);
+  auto t1 = castOp(DataType::Float, t0);
+  // input tensor
+  auto t2 = makeSymbolicTensor(3, DataType::Half);
+  fusion.addInput(t2);
+  auto t3 = castOp(DataType::Float, t2);
+  auto t4 = broadcast(t1, {true, true, false});
+  auto t5 = add(t4, t3);
+  auto t6 = mul(t5, IrBuilder::create<Double>(0.5));
+  auto t7 = mul(t5, IrBuilder::create<Double>(k_079));
+  auto t8 = mul(t5, IrBuilder::create<Double>(k_004));
+  auto t9 = mul(t8, t5);
+  auto t10 = add(t9, IrBuilder::create<Int>(1));
+  auto t11 = mul(t7, t10);
+  auto t12 = unaryOp(UnaryOpType::Tanh, t11);
+  auto t13 = add(t12, IrBuilder::create<Double>(1));
+  auto t14 = mul(t6, t13);
+  auto t15 = castOp(DataType::Half, t14);
+  fusion.addOutput(t15);
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  std::vector<int64_t> input_shape{6, 512, 4096};
+  std::vector<int64_t> bias_shape{4096};
+
+  auto at_input = at::randn(input_shape, options);
+  auto at_bias = at::randn(bias_shape, options);
+
+  auto at_x =
+      at_bias.to(c10::ScalarType::Float) + at_input.to(c10::ScalarType::Float);
+  auto aten_output_float =
+      at_x * 0.5 * (1.0 + (k_079 * at_x * (1 + k_004 * at_x * at_x)).tanh());
+  auto aten_output = aten_output_float.to(c10::ScalarType::Half);
+
+  std::vector<IValue> aten_inputs = {at_bias, at_input};
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBiasGeluBwd_CUDA) {
+  if (at::cuda::getDeviceProperties(0)->major < 6) {
+    return;
+  }
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const float k_079 = 0.79788456;
+  const float k_004 = 0.044715;
+  const float k_010 = 0.1070322243;
+
+  // gradient tensor
+  auto t0 = makeSymbolicTensor(3, DataType::Half);
+  fusion.addInput(t0);
+  auto t1 = castOp(DataType::Float, t0);
+  // bias tensor
+  auto t2 = makeSymbolicTensor(1, DataType::Half);
+  fusion.addInput(t2);
+  auto t3 = castOp(DataType::Float, t2);
+  // input tensor
+  auto t4 = makeSymbolicTensor(3, DataType::Half);
+  fusion.addInput(t4);
+  auto t5 = castOp(DataType::Float, t4);
+  auto t6 = broadcast(t3, {true, true, false});
+  auto t7 = add(t6, t5);
+  auto t8 = mul(t7, IrBuilder::create<Double>(k_079));
+  auto t9 = mul(t7, IrBuilder::create<Double>(k_004));
+  auto t10 = mul(t9, t7);
+  auto t11 = add(t10, IrBuilder::create<Int>(1));
+  auto t12 = mul(t8, t11);
+  auto t13 = unaryOp(UnaryOpType::Tanh, t12);
+  auto t14 = mul(t7, IrBuilder::create<Double>(0.5));
+  auto t15 = mul(t13, t13);
+  auto t16 = unaryOp(UnaryOpType::Neg, t15);
+  auto t17 = add(t16, IrBuilder::create<Int>(1));
+  auto t18 = mul(t7, IrBuilder::create<Double>(k_010));
+  auto t19 = mul(t18, t7);
+  auto t20 = add(t19, IrBuilder::create<Double>(k_079));
+  auto t21 = mul(t17, t20);
+  auto t22 = mul(t14, t21);
+  auto t23 = add(t13, IrBuilder::create<Int>(1));
+  auto t24 = mul(t23, IrBuilder::create<Double>(0.5));
+  auto t25 = add(t22, t24);
+  auto t26 = mul(t25, t1);
+  // Save float output for validation
+  fusion.addOutput(t26);
+  auto t27 = castOp(DataType::Half, t26);
+  fusion.addOutput(t27);
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::manual_seed(1);
+  std::vector<int64_t> input_shape{6, 512, 4096};
+  std::vector<int64_t> bias_shape{4096};
+  auto at_input = at::randn(input_shape, options);
+  auto at_bias = at::randn(bias_shape, options);
+  auto at_grad = at::randn(input_shape, options);
+
+  auto at_x =
+      at_bias.to(c10::ScalarType::Float) + at_input.to(c10::ScalarType::Float);
+  auto at_tanh_out = (k_079 * at_x * (1 + k_004 * at_x * at_x)).tanh();
+  auto at_ff = 0.5 * at_x *
+          ((1 - at_tanh_out * at_tanh_out) * (k_079 + k_010 * at_x * at_x)) +
+      0.5 * (1 + at_tanh_out);
+  auto at_out = at_ff * at_grad;
+  auto at_out_half = at_out.to(c10::ScalarType::Half);
+
+  std::vector<IValue> aten_inputs = {at_grad, at_bias, at_input};
+  std::vector<at::Tensor> aten_outputs = {at_out, at_out_half};
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+// Reproducer of issue #459
+TEST_F(NVFuserTest, FusionIssue459_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv1, tv3);
+
+  // Create two outputs from the final arithmetic result
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+  auto tv6 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv6);
+
+  // Scheduling
+  for (auto output : ir_utils::filterByType<TensorView>(fusion.outputs())) {
+    output->merge(-2, -1);
+  }
+  for (auto output : ir_utils::filterByType<TensorView>(fusion.outputs())) {
+    output->split(0, 128);
+  }
+
+  tv0->computeAt(tv5, -1);
+
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+  tv6->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  const int numel_x = 10;
+  const int numel_y = 20;
+  auto t0 = at::randn({numel_x}, options);
+  auto t1 = at::randn({numel_y, numel_x}, options);
+  auto aten_output = (t0 + 1).unsqueeze(0) + t1 + 1;
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      aten_inputs,
+      {aten_output, aten_output},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSmemIndexingSimple_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv3);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv0->computeAt(tv3, -1);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv2->setMemoryType(MemoryType::Global);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  auto aten_input = at::randn({12, 34}, options);
+  at::Tensor aten_output = aten_input + 1.0 + 1.0 + 1.0;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSmemIndexing_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Symbolic integers we will use for runtime tiling
+  Int* symbolic_m_tile_dim = IrBuilder::create<Int>();
+  Int* symbolic_split_k_tile_dim = IrBuilder::create<Int>();
+  Int* symbolic_block_k_tile_dim = IrBuilder::create<Int>();
+  // Compile-time integer for tiling
+  int n_smem_tile = 32;
+
+  // Symbolic 2D tensors TV0[M, K], TV1[K, N]
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  // Broadcast tv0 to [M, K, *]
+  TensorView* tv2 = broadcast(tv0, {false, false, true});
+  // Broadcast tv1 to [*, K, N]
+  TensorView* tv3 = broadcast(tv1, {true, false, false});
+
+  // Pointwise multiplication resulting in tv3[M, K, N]
+  TensorView* tv4 = mul(tv2, tv3);
+
+  // Sum the K-dim
+  TensorView* tv5 = sum(tv4, {1});
+
+  // Register inputs and outputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv5);
+
+  // Register runtime tile dims as inputs
+  fusion.addInput(symbolic_m_tile_dim);
+  fusion.addInput(symbolic_split_k_tile_dim);
+  fusion.addInput(symbolic_block_k_tile_dim);
+
+  // Make a 3D tile, mix of symbolic and constant, do in reverse order because
+  // dims are inserted
+  // [M, rK, N]
+  tv5->split(2, n_smem_tile);
+  // [M, rK, No, Ni{32}]
+  tv5->split(1, symbolic_block_k_tile_dim);
+  // [M, rKo, rKi{i2}, No, Ni{32}]
+  tv5->split(1, symbolic_split_k_tile_dim);
+  // [M, rKoo, rKoi{i1}, rKi{i2}, No, Ni{32}]
+  tv5->split(0, symbolic_m_tile_dim);
+  // [Mo, Mi{i0}, rKoo, rKoi{i1}, rKi{i2}, No, Ni{32}]
+
+  // Reorder so all outer tiles are in the leftmost 3 positions
+  // [Mo, Mi{i0}, rKoo, rKoi{i1}, rKi{i2},     No, Ni{32}]
+  // [Mo,     No, rKoo, rKoi{i1}, rKi{i2}, Mi{i0}, Ni{32}]
+  tv5->reorder({{1, 5}, {5, 1}});
+
+  // Factor out the outer reduction IterDomain, then run the inter-cta
+  // reduction, and intra-cta reduction
+  // [Mo, No, rKoo,  Koi{i1},  Ki{i2}, Mi{i0}, Ni{32}]
+  // [Mo, No,       rKoi{i1}, rKi{i2}, Mi{i0}, Ni{32}]
+  auto tv6 = tv5->rFactor({2});
+
+  // Scope computations
+  tv6->computeAt(tv5, 2);
+
+  // [Mo, No, rKoo, Koi{i1},  Ki{i2}, Mi{i0}, Ni{32}]
+  // [Mo, No, Ki{i2}, Mi{i0}, Ni{32}, rKoo, Koi{i1}]
+  tv6->reorder({
+      {5, -2},
+      {6, -1},
+      {2, 2},
+      {3, 3},
+      {4, 4},
+  });
+
+  // Setup compute at schedule
+  tv0->computeAt(tv6, 3);
+  tv1->computeAt(tv6, 3);
+  tv4->computeAt(tv6, -1);
+
+  // Cache smem tiles
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+  tv4->setMemoryType(MemoryType::Shared);
+  tv6->setMemoryType(MemoryType::Shared);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDz);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+
+  std::vector<TensorView*> tv_list = {tv2, tv3, tv4, tv5, tv6};
+  for (auto tv : tv_list) {
+    tv->axis(-2)->parallelize(ParallelType::TIDz);
+    tv->axis(-1)->parallelize(ParallelType::TIDy);
+  }
+
+  constexpr int M = 31, K = 65, N = 32;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+
+  at::Tensor aten_output =
+      mul(t0.unsqueeze(2), t1.unsqueeze(0)).to(at::kDouble).sum(1);
+
+  // A, B, m_tile_dim, split_k, intra_cta_tile
+  std::vector<IValue> aten_inputs = {t0, t1, 3, 4, 5};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+// Reproducer of issue 408
+TEST_F(NVFuserTest, FusionCacheBeforeReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1});
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 4);
+
+  auto tv3 = tv2->cacheBefore();
+
+  tv0->computeAt(tv3, -1);
+  tv3->computeAt(tv2, -1);
+
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  const int numel_x = 100;
+  const int numel_y = 200;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({numel_x, numel_y}, options);
+  at::Tensor cg_output = at::empty({numel_x}, options);
+
+  auto aten_output = (aten_input + 1).to(at::kDouble).sum({1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  fe.runFusion({aten_input}, {cg_output});
+
+  testValidate(
+      &fusion, {cg_output}, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionCacheBeforeReduction2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1});
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv3);
+
+  auto tv4 = tv2->cacheBefore();
+
+  tv4->computeAt(tv3, 1);
+  tv0->computeAt(tv4, -1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+
+  const int numel_x = 10;
+  const int numel_y = 20;
+  const int numel_z = 30;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({numel_x, numel_y, numel_z}, options);
+  auto t2 = (aten_input + 1).to(at::kDouble).sum({1});
+  auto t3 = t2 + 1;
+  std::vector<at::Tensor> aten_outputs = {t2, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue367_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Symbolic integers we will use for runtime tiling
+  Int* symbolic_m_tile_dim = IrBuilder::create<Int>();
+  Int* symbolic_split_k_tile_dim = IrBuilder::create<Int>();
+  Int* symbolic_block_k_tile_dim = IrBuilder::create<Int>();
+  // Compile-time integer for tiling
+  int n_smem_tile = 32;
+
+  // Symbolic 2D tensors TV0[M, K], TV1[K, N]
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  // Broadcast tv0 to [M, K, *]
+  TensorView* tv2 = broadcast(tv0, {false, false, true});
+  // Broadcast tv1 to [*, K, N]
+  TensorView* tv3 = broadcast(tv1, {true, false, false});
+
+  // Pointwise multiplication resulting in tv3[M, K, N]
+  TensorView* tv4 = mul(tv2, tv3);
+
+  // Sum the K-dim
+  TensorView* tv5 = sum(tv4, {1});
+
+  // Register inputs and outputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv5);
+
+  // Register runtime tile dims as inputs
+  fusion.addInput(symbolic_m_tile_dim);
+  fusion.addInput(symbolic_split_k_tile_dim);
+  fusion.addInput(symbolic_block_k_tile_dim);
+
+  // Make a 3D tile, mix of symbolic and constant, do in reverse order because
+  // dims are inserted
+  // [M, K, N]
+  tv5->split(2, n_smem_tile);
+  tv5->split(1, symbolic_block_k_tile_dim);
+  tv5->split(1, symbolic_split_k_tile_dim);
+  tv5->split(0, symbolic_m_tile_dim);
+  // [Mo, Mi, Koo, Koi, Ki, No, Ni]
+  tv5->reorder({{1, 5}, {5, 1}});
+  // [Mo, No, Koo, Koi, Ki, Mi, Ni]
+
+  auto tv6 = tv5->rFactor({2});
+  auto tv7 = tv5->rFactor({2});
+  // [Mo, No, rKoo,  Koi,  Ki, Mi, Ni]
+  // [Mo, No,       rKoi, rKi, Mi, Ni]
+
+  // Scope computations
+  tv6->computeAt(tv5, 2);
+
+  tv0->computeAt(tv6, 3);
+  tv1->computeAt(tv6, 3);
+  tv4->computeAt(tv6, -1);
+
+  // Cache smem tiles
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+  tv4->setMemoryType(MemoryType::Local);
+  tv6->setMemoryType(MemoryType::Local);
+  tv7->setMemoryType(MemoryType::Local);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDz);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+
+  std::vector<TensorView*> tv_list = {tv2, tv3, tv4, tv5, tv6, tv7};
+  for (auto tv : tv_list) {
+    tv->axis(-2)->parallelize(ParallelType::TIDz);
+    tv->axis(-1)->parallelize(ParallelType::TIDy);
+  }
+  tv2->axis(3)->parallelize(ParallelType::TIDx);
+  tv3->axis(3)->parallelize(ParallelType::TIDx);
+  tv4->axis(3)->parallelize(ParallelType::TIDx);
+  tv6->axis(3)->parallelize(ParallelType::TIDx);
+  tv7->axis(2)->parallelize(ParallelType::TIDx);
+
+  tv2->axis(4)->parallelize(ParallelType::BIDx);
+  tv3->axis(4)->parallelize(ParallelType::BIDx);
+  tv4->axis(4)->parallelize(ParallelType::BIDx);
+  tv6->axis(4)->parallelize(ParallelType::BIDx);
+  tv7->axis(3)->parallelize(ParallelType::BIDx);
+  tv5->axis(2)->parallelize(ParallelType::BIDx);
+
+  constexpr int M = 3, K = 6, N = 16;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+
+  // A, B, m, split_k, block_k
+  std::vector<IValue> aten_inputs = {t0, t1, 2, 2, 3};
+  at::Tensor aten_output =
+      mul(t0.unsqueeze(2), t1.unsqueeze(0)).to(at::kDouble).sum(1);
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue468_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = sum(tv1, {0});
+  fusion.addOutput(tv2);
+
+  tv1->axis(0)->parallelize(ParallelType::TIDy);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDy);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({10, 100}, options);
+  at::Tensor aten_output = aten_input.to(at::kDouble).sum({1}).sum({0});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue363_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Symbolic 2D tensors TV0[M, K], TV1[K, N]
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(2);
+
+  // Broadcast tv0 to [M, K, *]
+  TensorView* tv2 = broadcast(tv0, {false, false, true});
+  // Broadcast tv1 to [*, K, N]
+  TensorView* tv3 = broadcast(tv1, {true, false, false});
+
+  // Pointwise multiplication resulting in tv3[M, K, N]
+  TensorView* tv4 = mul(tv2, tv3);
+
+  // Sum the K-dim
+  TensorView* tv5 = sum(tv4, {1});
+
+  // Register inputs and outputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv5);
+
+  tv2->setMemoryType(MemoryType::Global);
+  tv3->setMemoryType(MemoryType::Global);
+  tv4->setMemoryType(MemoryType::Global);
+
+  tv0->computeAt(tv5, -1);
+  tv1->computeAt(tv5, -1);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDz);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+
+  tv5->axis(2)->parallelize(ParallelType::BIDx);
+
+  constexpr int M = 3, K = 6, N = 16;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+  at::Tensor aten_output =
+      mul(t0.unsqueeze(2), t1.unsqueeze(0)).to(at::kDouble).sum(1);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue484_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = add(tv1, IrBuilder::create<Double>(0));
+  fusion.addOutput(tv2);
+
+  tv1->setMemoryType(MemoryType::Global);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 100;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({M, M}, options);
+  at::Tensor aten_output = aten_input.to(at::kDouble).sum({1});
+
+  torch::jit::fuser::cuda::FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue329_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1});
+  fusion.addOutput(tv2);
+  auto tv3 = sum(tv1, {1});
+  fusion.addOutput(tv3);
+
+  tv1->computeAt(tv2, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  std::vector<int64_t> t0_shape{17, 19};
+  auto aten_input = at::randn(t0_shape, options);
+  auto t2 = (aten_input + 1).to(at::kDouble).sum({1});
+  auto t3 = (aten_input + 1).to(at::kDouble).sum({1});
+  std::vector<at::Tensor> aten_outputs = {t2, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue382_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = broadcast(tv1, {false, false, true});
+  auto tv3 = makeSymbolicTensor(3);
+  fusion.addInput(tv3);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  tv2->merge(1);
+  tv4->merge(1);
+
+  tv1->computeAt(tv4, 1);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv1->setMemoryType(MemoryType::Global);
+  tv2->setMemoryType(MemoryType::Global);
+
+  const int numel_x = 12;
+  const int numel_y = 34;
+  const int numel_z = 56;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({numel_x, numel_y}, options);
+  auto t3 = at::randn({numel_x, numel_y, numel_z}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t3};
+  auto aten_output = (t0 + 1).unsqueeze(-1) + t3;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue507_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  std::vector<int64_t> t0_shape{17, 19};
+  auto aten_input = at::randn(t0_shape, options);
+  auto t1 = (aten_input + 1);
+  auto aten_output = (t1 + 1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue532_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  TensorView* tv0 = makeSymbolicTensor(1);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addInput(tv0);
+  fusion.addOutput(tv2);
+
+  const int M_BLOCK = 64;
+  const int M_THREAD = 4;
+
+  tv2->split(0, M_BLOCK);
+  // tv2: [M/M_BLOCK, M_BLOCK]
+  tv1->computeAt(tv2, 1);
+  // tv1: [M/M_BLOCK, M_BLOCK]
+
+  tv1->split(-1, M_BLOCK / M_THREAD);
+  // tv1: [M/M_BLOCK, M_THREAD, M_BLOCK / M_THREAD]
+
+  tv2->split(-1, M_THREAD);
+  // tv2: [M/M_BLOCK, M_BLOCK / M_THREAD, M_THREAD]
+
+  constexpr int M = 1000;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  at::Tensor aten_output = t0 + 1 + 1;
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionLoopUnswitch_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  TensorView* tv0 = makeSymbolicTensor(1);
+  TensorView* tv1 = add(tv0, IrBuilder::create<Double>(1));
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addInput(tv0);
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 32);
+  tv1->computeAt(tv2, -1);
+
+  tv2->axis(1)->parallelize(ParallelType::Unswitch);
+
+  constexpr int M = 1000;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  at::Tensor aten_output = t0 + 1 + 1;
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue549_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2); // M, K
+  TensorView* tv1 = makeSymbolicTensor(2); // K, N
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+
+  TensorView* tv3 = broadcast(tv2, {false, false, true});
+  // tv3[I0, I1, B] = tv0[I0, I1]
+
+  TensorView* tv4 = broadcast(tv1, {true, false, false});
+  // tv4[B, I1, I2] = tv1[I1, I2]
+
+  // tv5[I0, I1, I2] = tv3[I0, I1, B] * tv4[B, I1, I2]
+  TensorView* tv5 = mul(tv3, tv4);
+  // tv6[I0, R1, I2] = tv5[I0, I1, I2]
+  TensorView* tv6 = sum(tv5, {1});
+  fusion.addOutput(tv6);
+
+  tv6->split(1, 32);
+  // tv6[I0, R1o, R1i{32}, I2]
+
+  auto tv7 = tv6->rFactor({1});
+  // tv7[I0, R1o, I1i{32}, I2] = tv5[I0, I1, I2]
+  // tv6[I0,    , R1i{32}, I2] = tv7[I0, R1o, I1i{32}, I2]
+
+  tv6->split(0, 4);
+  tv6->split(-1, 4);
+  // tv6[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
+  // tv6[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
+
+  tv0->computeAt(tv6, -1);
+  tv1->computeAt(tv6, -1);
+
+  // tv7[I0o, I0i{4}, R1o, I1i{32}, I2o, I2i{4}]
+  // tv6[I0o, I0i{4},    , R1i{32}, I2o, I2i{4}]
+  //--> (line symbolizes compute at location)
+  // tv5[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, I1o]
+  // tv7[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, R1o]
+  // tv6[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
+
+  tv0->computeAt(tv7, -1);
+  tv1->computeAt(tv7, -1);
+  // tv5[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, I1o |]
+  // tv7[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, R1o |]
+  // tv6[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
+
+  tv6->axis(0)->parallelize(ParallelType::BIDz);
+  tv6->axis(1)->parallelize(ParallelType::TIDz);
+
+  tv6->axis(-2)->parallelize(ParallelType::BIDy);
+  tv6->axis(-1)->parallelize(ParallelType::TIDy);
+
+  tv6->axis(2)->parallelize(ParallelType::TIDx);
+  tv7->axis(2)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 65, K = 33, N = 17;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+
+  // Lets specify a few bounds in launch params to make sure it works
+  LaunchParams lparams(1, -1, -1, 32, 4, 4);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1}, lparams);
+  fe.runFusion({t0, t1}, lparams);
+
+  // Make sure bad launch params throws
+  // TODO: Re-enable once we have parallelization validation in.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  // ASSERT_ANY_THROW(fe.runFusion({t0, t1}, LaunchParams(1, 2, 3, 4, 5, 6)));
+
+  // Don't specify any launch params
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto aten_output = (t0 + 1).to(at::kDouble).matmul(t1.to(at::kDouble));
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t1}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSimpleCompileRtc_CUDA) {
+  FusionExecutor fe;
+  std::string kernel = R"(
+__global__ void kernel1(Tensor<float, 1> T0, Tensor<float, 1> T1) {
+  if(threadIdx.x==0){
+    for(size_t ki28 = 0; ki28 < T0.size[0]; ++ki28) {
+      T1[ki28*T1.stride[0]] = T0[ki28*T0.stride[0]]*2;
+    }
+  }
+}
+    )";
+  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
+  LaunchParams lp(
+      256, // gdimx
+      1, // gdimy
+      1, // gdimz
+      1, // bdimx
+      1, // bdimy
+      1 // bdimz
+  );
+  lp.setSmem(0);
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const std::vector<int64_t> tensor_dims = {8};
+  auto in0 = at::randn(tensor_dims, options);
+  auto out0 = at::empty_like(in0);
+  fe.runRtc(lp, {in0, out0});
+
+  auto out_ref = in0 * 2;
+  TORCH_CHECK(out_ref.allclose(out0));
+}
+
+TEST_F(NVFuserTest, FusionSerialWelford_CUDA) {
+  FusionExecutor fe;
+  int x = 128, y = 64, z = 64;
+
+  std::string kernel = R"(
+__global__ void kernel1(
+    Tensor<float,3> inp,
+    Tensor<float,1> out_var,
+    Tensor<float,1> out_avg
+){
+    for(int i0=0;i0<inp.size[0];i0++){
+        float tmp_M2=0;
+        float tmp_avg=0;
+        long tmp_N=0;
+        for(int i1=0;i1<inp.size[1];i1++){
+            for(int i2=0;i2<inp.size[2];i2++){
+                welfordCombine(
+                    tmp_avg,
+                    tmp_M2,
+                    tmp_N,
+                    inp[i0*inp.stride[0]+
+                        i1*inp.stride[1]+
+                        i2*inp.stride[2]],
+                    0.f,
+                    (long)1
+                );
+            }
+        }
+        out_var[i0*out_var.stride[0]]=
+            tmp_M2/(tmp_N);
+        out_avg[i0*out_avg.stride[0]]=
+            tmp_avg;
+    }
+}
+    )";
+  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
+  LaunchParams lp(
+      1, // gdimx
+      1, // gdimy
+      1, // gdimz
+      1, // bdimx
+      1, // bdimy
+      1 // bdimz
+  );
+  lp.setSmem(0);
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const std::vector<int64_t> tensor_dims = {x, y, z};
+  auto in0 = at::randn(tensor_dims, options);
+  auto out_var = at::empty({x}, options);
+  auto out_avg = at::empty({x}, options);
+  fe.runRtc(lp, {in0, out_var, out_avg});
+
+  TORCH_CHECK(in0.var({1, 2}, false).allclose(out_var));
+  TORCH_CHECK(in0.mean({1, 2}).allclose(out_avg, /*rtol*/ 1e-5, /*atol*/ 1e-6));
+}
+
+TEST_F(NVFuserTest, FusionBlockWelford_CUDA) {
+  FusionExecutor fe;
+  int x = 7, y = 8, z = 9;
+
+  std::string kernel = R"(
+__global__ void kernel1(
+    Tensor<float,2> inp,
+    Tensor<float,1> out_avg,
+    Tensor<float,1> out_var,
+    Tensor<float,1> init_avg,
+    Tensor<float,1> init_var,
+    Tensor<long,0> init_N
+){
+    //actual generated kernel will use dynamic shared mem,
+    // here is just for prototype
+    __shared__ float mem_avg[512];
+    __shared__ float mem_M2[512];
+    __shared__ long mem_N[512];
+    float in=inp[threadIdx.x*inp.stride[0]+
+                        threadIdx.y*inp.stride[1]];
+    float tmp_avg=0;
+    float tmp_M2=0;
+    long tmp_N=0;
+    blockWelford<false,true,false>(
+        tmp_avg,
+        tmp_M2,
+        tmp_N,
+        in,
+        0.f,
+        (long)1,
+        threadIdx,
+        blockDim,
+        (float*)mem_avg,
+        (float*)mem_M2,
+        (long*)mem_N,
+        (bool)(threadIdx.x<inp.size[0]),
+        0.f);
+    __syncthreads();
+    if(threadIdx.x<out_var.size[0] && threadIdx.y==0){
+        welfordCombine(
+                    tmp_avg,
+                    tmp_M2,
+                    tmp_N,
+                    init_avg[threadIdx.x*init_avg.stride[0]],
+                    init_var[threadIdx.x*init_var.stride[0]]*init_N[0],
+                    init_N[0]
+                );
+        out_avg[threadIdx.x*out_avg.stride[0]]=tmp_avg;
+        out_var[threadIdx.x*out_var.stride[0]]=tmp_M2/(tmp_N);
+    }
+}
+    )";
+  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
+  LaunchParams lp(
+      1, // gdimx
+      1, // gdimy
+      1, // gdimz
+      x, // bdimx
+      y, // bdimy
+      1 // bdimz
+  );
+  lp.setSmem(0);
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const std::vector<int64_t> tensor_dims = {x, y};
+  const std::vector<int64_t> init_dims = {x, z};
+
+  // generate initial values
+  auto init_in = at::randn(init_dims, options);
+  auto init_var = init_in.var({1}, false);
+  auto init_avg = init_in.mean({1});
+  auto init_N =
+      at::tensor(z, at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0));
+
+  auto in0 = at::randn(tensor_dims, options);
+
+  // run kernel
+  auto out_var = at::zeros({x}, options);
+  auto out_avg = at::zeros({x}, options);
+  fe.runRtc(lp, {in0, out_avg, out_var, init_avg, init_var, init_N});
+
+  // compare with reference output
+  auto cat_tensor = at::cat({init_in, in0}, 1);
+  TORCH_CHECK(cat_tensor.var({1}, false).allclose(out_var));
+  TORCH_CHECK(
+      cat_tensor.mean({1}).allclose(out_avg, /*rtol*/ 1e-5, /*atol*/ 1e-6));
+}
+
+TEST_F(NVFuserTest, FusionBlockWelfordNoInit_CUDA) {
+  FusionExecutor fe;
+  int x = 7, y = 8, z = 9;
+
+  // need support IValue for integer input as initial count
+  std::string kernel = R"(
+__global__ void kernel1(
+    Tensor<float,3> inp,
+    Tensor<float,1> out_avg,
+    Tensor<float,1> out_var
+){
+    //actual generated kernel will use dynamic shared mem,
+    // here is just for prototype
+    __shared__ float mem_avg[512];
+    __shared__ float mem_M2[512];
+    __shared__ long mem_N[512];
+    float in=inp[threadIdx.x*inp.stride[0]+
+                        threadIdx.y*inp.stride[1]+
+                        threadIdx.z*inp.stride[2]];
+    float tmp_avg=0;
+    float tmp_M2=0;
+    long tmp_N=0;
+    block_sync::init();
+    blockWelford<false,true,true>(
+        tmp_avg,
+        tmp_M2,
+        tmp_N,
+        in,
+        0.f,
+        (long) 1,
+        threadIdx,
+        blockDim,
+        (float*)mem_avg,
+        (float*)mem_M2,
+        (long*)mem_N,
+        (bool)(threadIdx.x<inp.size[0]),
+        0.f);
+    __syncthreads();
+    if(threadIdx.x<out_var.size[0] && threadIdx.y==0 && threadIdx.z==0){
+        out_avg[threadIdx.x*out_var.stride[0]]=tmp_avg;
+        out_var[threadIdx.x*out_var.stride[0]]=tmp_M2/(tmp_N);
+    }
+}
+    )";
+  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
+  LaunchParams lp(
+      1, // gdimx
+      1, // gdimy
+      1, // gdimz
+      x, // bdimx
+      y, // bdimy
+      z // bdimz
+  );
+  lp.setSmem(0);
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const std::vector<int64_t> tensor_dims = {x, y, z};
+  auto in0 = at::randn(tensor_dims, options);
+  auto out_var = at::empty({x}, options);
+  auto out_avg = at::empty({x}, options);
+  fe.runRtc(lp, {in0, out_avg, out_var});
+
+  TORCH_CHECK(in0.var({1, 2}, false).allclose(out_var));
+  TORCH_CHECK(in0.mean({1, 2}).allclose(out_avg, /*rtol*/ 1e-5, /*atol*/ 1e-6));
+}
+
+TEST_F(NVFuserTest, FusionGridWelfordNoInit_CUDA) {
+  FusionExecutor fe;
+  int x = 128, y = 64, z = 128;
+
+  std::string kernel = R"(
+__global__ void kernel1(
+    Tensor<float,3> inp,
+    Tensor<float,1> out_avg,
+    Tensor<float,1> out_var,
+    Tensor<float,1> work_buf_avg,
+    Tensor<float,1> work_buf_M2,
+    Tensor<long,1> work_buf_N,
+    Tensor<int64_t,1> sync_flag
+){
+    __shared__ float shared_buf_avg[512];
+    __shared__ float shared_buf_M2[512];
+    __shared__ long shared_buf_N[512];
+    float tmp_avg=0;
+    float tmp_M2=0;
+    long tmp_N=0;
+    float in = inp[ blockIdx.x  * inp.stride[0]+
+                    blockIdx.y  * inp.stride[1]+
+                    threadIdx.x * inp.stride[2]];
+    block_sync::init();
+    welford::gridWelford<
+        true,true,false,
+        true,false,false,
+        false
+    >(
+        tmp_avg,
+        tmp_M2,
+        tmp_N,
+        in,
+        0.f,
+        (long) 1,
+        &work_buf_avg[0],
+        &work_buf_M2[0],
+        &work_buf_N[0],
+        sync_flag,
+        (float*)shared_buf_avg,
+        (float*)shared_buf_M2,
+        (long*)shared_buf_N,
+        threadIdx.x<out_var.size[0],
+        threadIdx.x<out_var.size[0],
+        0.f,
+        0,
+        1);
+    if(blockIdx.x == gridDim.x - 1 && blockIdx.y == gridDim.y - 1){
+        out_avg[threadIdx.x*out_avg.stride[0]]=tmp_avg;
+        out_var[threadIdx.x*out_var.stride[0]]=tmp_M2/tmp_N;
+    }
+}
+    )";
+  fe.compileRtc(kernel, "CudaCodeGen::kernel1");
+  LaunchParams lp(
+      x, // gdimx
+      y, // gdimy
+      1, // gdimz
+      z, // bdimx
+      1, // bdimy
+      1 // bdimz
+  );
+  lp.setSmem(0);
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const auto options_int =
+      at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+
+  const std::vector<int64_t> tensor_dims = {x, y, z};
+  auto in0 = at::randn(tensor_dims, options);
+
+  auto out_avg = at::empty({z}, options);
+  auto out_var = at::empty({z}, options);
+  auto work_buf_avg = at::empty({x * y * z}, options);
+  auto work_buf_var = at::empty({x * y * z}, options);
+  auto work_buf_N = at::empty({x * y * z}, options_int);
+  auto sync_flag = at::zeros({1}, options_int);
+  fe.runRtc(
+      lp,
+      {in0,
+       out_avg,
+       out_var,
+       work_buf_avg,
+       work_buf_var,
+       work_buf_N,
+       sync_flag});
+  std::vector<int64_t> dims{0, 1};
+
+  TORCH_CHECK(in0.mean(dims).allclose(out_avg, /*rtol*/ 1e-5, /*atol*/ 1e-6));
+  TORCH_CHECK(in0.var(dims, false).allclose(out_var));
+}
+
+TEST_F(NVFuserTest, FusionWelfordOp_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int M = 64, N = 128;
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
+  auto tvs = Welford(tv1, {1});
+  auto tv_avg = tvs.avg;
+  auto tv_M2 = tvs.var_sum;
+  auto tv_N = tvs.n;
+  fusion.addOutput(tv_avg);
+  fusion.addOutput(tv_M2);
+  fusion.addOutput(tv_N);
+
+  tv_avg->split(1, 32);
+  tv_avg->split(0, 32);
+  tv_avg->split(0, 4);
+  tv_avg->reorder({{-1, -3}, {-3, -1}});
+  tv1->computeAt(tv_avg, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  // by default Welford outputs sum of square diff so need to divide to get var
+  outputs[1] /= N;
+
+  testValidate(
+      fe.kernel(),
+      outputs,
+      {t0},
+      {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBlockWelfordOp_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int M = 64, N = 128;
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
+  auto tvs = Welford(tv1, {1});
+  auto tv_avg = tvs.avg;
+  auto tv_M2 = tvs.var_sum;
+  auto tv_N = tvs.n;
+  fusion.addOutput(tv_avg);
+  fusion.addOutput(tv_M2);
+  fusion.addOutput(tv_N);
+
+  tv_avg->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->computeAt(tv_avg, -1);
+
+  //
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  at::Tensor t_var = at::empty({M}, options);
+  at::Tensor t_avg = at::empty({M}, options);
+  at::Tensor t_N = at::empty({M}, options_int);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  // by default Welford outputs sum of square diff so need to divide to get var
+  outputs[1] /= N;
+
+  testValidate(
+      fe.kernel(),
+      outputs,
+      {t0},
+      {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridWelfordOp_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int M = 64, N = 128;
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
+  auto tvs = Welford(tv1, {1});
+  auto tv_avg = tvs.avg;
+  auto tv_M2 = tvs.var_sum;
+  auto tv_N = tvs.n;
+  fusion.addOutput(tv_avg);
+  fusion.addOutput(tv_M2);
+  fusion.addOutput(tv_N);
+
+  tv_avg->axis(0)->parallelize(ParallelType::TIDx);
+  tv_avg->axis(-1)->parallelize(ParallelType::BIDx);
+
+  tv1->computeAt(tv_avg, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  at::Tensor t_avg = at::empty({M}, options);
+  at::Tensor t_var = at::empty({M}, options);
+  at::Tensor t_N = at::empty({M}, options_int);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  // by default Welford outputs sum of square diff so need to divide to get var
+  outputs[1] /= N;
+
+  testValidate(
+      fe.kernel(),
+      outputs,
+      {t0},
+      {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionRfactorWelfordOp_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int M = 64, N = 128;
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
+  auto tvs = Welford(tv1, {1});
+  auto tv_avg = tvs.avg;
+  auto tv_M2 = tvs.var_sum;
+  auto tv_N = tvs.n;
+  fusion.addOutput(tv_avg);
+  fusion.addOutput(tv_M2);
+  fusion.addOutput(tv_N);
+
+  tv_avg->split(1, 4);
+  ir_utils::rfactorHelper(tvs.avg, {2});
+  tv1->computeAt(tv_avg, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  at::Tensor t_avg = at::empty({M}, options);
+  at::Tensor t_var = at::empty({M}, options);
+  at::Tensor t_N = at::empty({M}, options_int);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  // by default Welford outputs sum of square diff so need to divide to get var
+  outputs[1] /= N;
+
+  testValidate(
+      fe.kernel(),
+      outputs,
+      {t0},
+      {t0.mean({1}), t0.var({1}, false), at::ones({M}, options_int) * N},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionWelfordSchedule_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int M = 64, N = 128;
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = mul(tv0, IrBuilder::create<Double>(1));
+  auto tvs = Welford(tv1, {1});
+  auto tv_avg = tvs.avg;
+  auto tv_M2 = tvs.var_sum;
+  auto tv_N = tvs.n;
+  fusion.addOutput(tv_avg);
+  fusion.addOutput(tv_M2);
+  fusion.addOutput(tv_N);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  // TODO: Why do we use launch params from here, but not scheduling???
+  auto reduction_params = getReductionHeuristics(&fusion, {t0});
+  scheduleReduction(&fusion, *reduction_params);
+
+  auto lparams = reduction_params->lparams;
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0}, lparams);
+  auto outputs = fe.runFusion({t0}, lparams);
+
+  // by default Welford outputs sum of square diff so need to divide to get var
+  outputs[1] /= N;
+
+  auto at_avg = t0.mean({1});
+  auto at_var = t0.var({1}, false);
+  auto at_n = at::ones({M}, options_int) * N;
+
+  testValidate(
+      fe.kernel(),
+      outputs,
+      {t0},
+      {at_avg, at_var, at_n},
+      __LINE__,
+      __FILE__,
+      "validate welford",
+      reduction_params->lparams);
+}
+
+namespace {
+void testWelford(DataType dtype, int red_axis, int odim, int rdim) {
+  const int axis = red_axis;
+  at::ScalarType aten_dtype = data_type_to_aten(dtype);
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  TensorView* tv0 = makeSymbolicTensor(2, dtype);
+  bool is_fp16 = dtype == DataType::Half;
+  bool is_bf16 = dtype == DataType::BFloat16;
+  TensorView* tv0_cast = tv0;
+  if (is_fp16 || is_bf16) {
+    tv0_cast = castOp(DataType::Float, tv0);
+  }
+  fusion.addInput(tv0);
+  auto tv1 = mul(tv0_cast, IrBuilder::create<Double>(1));
+  auto tvs = Welford(tv1, {axis});
+  auto tv_avg = tvs.avg;
+  auto tv_M2 = tvs.var_sum;
+  auto tv_N = tvs.n;
+
+  TensorView* avg_cast = tv_avg;
+  TensorView* M2_cast = tv_M2;
+
+  if (is_fp16) {
+    avg_cast = castOp(DataType::Half, tv_avg);
+    M2_cast = castOp(DataType::Half, tv_M2);
+  }
+  if (is_bf16) {
+    avg_cast = castOp(DataType::BFloat16, tv_avg);
+    M2_cast = castOp(DataType::BFloat16, tv_M2);
+  }
+
+  fusion.addOutput(avg_cast);
+  fusion.addOutput(M2_cast);
+  fusion.addOutput(tv_N);
+
+  auto options = at::TensorOptions().dtype(aten_dtype).device(at::kCUDA, 0);
+  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  std::vector<TensorView*> outputs_of_red;
+  at::Tensor aten_input =
+      (axis ? at::randn({odim, rdim}, options)
+            : at::randn({rdim, odim}, options));
+
+  if (is_fp16 || is_bf16) {
+    outputs_of_red.push_back(avg_cast);
+    outputs_of_red.push_back(M2_cast);
+  }
+
+  auto reduction_params = getReductionHeuristics(&fusion, {aten_input});
+  scheduleReduction(&fusion, *reduction_params);
+
+  auto lparams = reduction_params->lparams;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input}, lparams);
+  auto outputs = fe.runFusion({aten_input}, lparams);
+
+  // by default Welford outputs sum of square diff so need to divide to
+  // get var
+
+  outputs[1] /= rdim;
+
+  auto at_avg = aten_input.mean({axis});
+  auto at_var = aten_input.var({axis}, false);
+  auto at_n =
+      (axis ? at::ones({odim, rdim}, options)
+            : at::ones({rdim, odim}, options));
+  at_n = at_n.sum({axis});
+
+  testValidate(
+      fe.kernel(),
+      outputs,
+      {aten_input},
+      {at_avg, at_var, at_n},
+      __LINE__,
+      __FILE__,
+      "validate welford",
+      reduction_params->lparams);
+}
+} // namespace
+
+TEST_F(NVFuserTest, FusionWelfordShmoo_CUDA) {
+  std::vector<DataType> dtypes = {
+      DataType::Double, DataType::Float, DataType::Half};
+  // TODO: enable this for complex. Currently, complex yields
+  // silent wrong results:
+  //   Detected abs error of: 3.8062
+  //     absolute tolerance was set to 2.23704e-06
+  //     and relative tolerance set to 2.23704e-08
+#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+  if (at::cuda::getDeviceProperties(0)->major >= 8) {
+    dtypes.insert(dtypes.end(), DataType::BFloat16);
+  }
+#endif
+
+  std::vector<int> red_axis = {1, 0};
+  std::vector<int> output_dims = {160, 320};
+  std::vector<int> red_dims;
+
+  // Tried to cut down the number iterations with just
+  // doing every other power of 2.
+  for (int i = 1; i <= 1024 * 1024; i <<= 2) {
+    red_dims.push_back(i);
+  }
+
+  for (auto dtype : dtypes) {
+    for (auto& axis : red_axis) {
+      for (auto& odim : output_dims) {
+        for (auto& rdim : red_dims) {
+          // TODO: original welford algorithm actually keeps a running sum of
+          // squares, i.e. M_{2n} in the
+          //       cf:
+          //       https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+          //       algorithm notation, and it can reach inf for large numbers
+          //       with half precision. skipping too large volumes for half for
+          //       nwo might need further numerical experiments to re-design
+          //       this.
+          if (rdim > 32768 &&
+              (dtype == DataType::Half || dtype == DataType::BFloat16)) {
+            continue;
+          }
+          testWelford(dtype, axis, odim, rdim);
+        }
+      }
+    }
+  }
+}
+
+namespace {
+void testVarMean(at::ScalarType dtype, int correction, bool keepdim) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  int M = 64, N = 128;
+
+  auto tv0 = makeSymbolicTensor(2, aten_to_data_type(dtype));
+  fusion->addInput(tv0);
+  auto tvs = variance_mean(tv0, {1}, correction, keepdim);
+  auto tv_mean = tvs.mean;
+  auto tv_var = tvs.var;
+  fusion->addOutput(tv_var);
+  fusion->addOutput(tv_mean);
+
+  auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto outputs = executor_cache.runFusionWithInputs({t0});
+
+  auto at_var_mean = at::var_mean(t0, {1}, correction, keepdim);
+  std::vector<at::Tensor> aten_outputs = {
+      std::get<0>(at_var_mean), std::get<1>(at_var_mean)};
+
+  testValidate(
+      executor_cache.fusion(), outputs, {t0}, aten_outputs, __LINE__, __FILE__);
+}
+} // namespace
+
+TEST_F(NVFuserTest, FusionVarMean_CUDA) {
+  std::vector<at::ScalarType> dtypes = {at::kFloat, at::kDouble};
+  std::vector<int> corrections = {0, 1};
+  std::vector<bool> keepdims = {false, true};
+  for (auto correction : corrections) {
+    for (auto keepdim : keepdims) {
+      for (auto dtype : dtypes) {
+        testVarMean(dtype, correction, keepdim);
+      }
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionSimpleGemmTransposed_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+
+  TensorView* tv0 = makeSymbolicTensor(2); // K, M
+  TensorView* tv1 = makeSymbolicTensor(2); // N, K
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv0_t = transpose(tv0);
+  TensorView* tv1_t = transpose(tv1);
+
+  TensorView* tv2 = broadcast(tv0_t, {false, false, true});
+  // tv2[I0, I1, B] = tv0[I0, I1]
+
+  TensorView* tv3 = broadcast(tv1_t, {true, false, false});
+  // tv3[B, I1, I2] = tv1[I1, I2]
+
+  // tv4[I0, I1, I2] = tv2[I0, I1, B] * tv3[B, I1, I2]
+  TensorView* tv4 = mul(tv2, tv3);
+  // tv5[I0, R1, I2] = tv4[I0, I1, I2]
+  TensorView* tv5 = sum(tv4, {1});
+  fusion.addOutput(tv5);
+
+  tv5->split(1, 32);
+  // tv5[I0, R1o, R1i{32}, I2]
+
+  auto tv6 = tv5->rFactor({1});
+  // tv6[I0, R1o, I1i{32}, I2] = tv4[I0, I1, I2]
+  // tv5[I0,    , R1i{32}, I2] = tv6[I0, R1o, I1i{32}, I2]
+
+  tv5->split(0, 4);
+  tv5->split(-1, 4);
+  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
+  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}]
+
+  tv0_t->computeAt(tv5, -1);
+  tv1_t->computeAt(tv5, -1);
+
+  // tv6[I0o, I0i{4}, R1o, I1i{32}, I2o, I2i{4}]
+  // tv5[I0o, I0i{4},    , R1i{32}, I2o, I2i{4}]
+  //--> (line symbolizes compute at location)
+  // tv4[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, I1o]
+  // tv6[I0o, I0i{4}, I1i{32}, I2o, I2i{4}|, R1o]
+  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
+
+  tv0_t->computeAt(tv6, -1);
+  tv1_t->computeAt(tv6, -1);
+  // tv4[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, I1o |]
+  // tv6[I0o, I0i{4}, I1i{32}, I2o, I2i{4}, R1o |]
+  // tv5[I0o, I0i{4}, R1i{32}, I2o, I2i{4}|]
+
+  tv5->axis(0)->parallelize(ParallelType::BIDz);
+  tv5->axis(1)->parallelize(ParallelType::TIDz);
+
+  tv5->axis(-2)->parallelize(ParallelType::BIDy);
+  tv5->axis(-1)->parallelize(ParallelType::TIDy);
+
+  tv5->axis(2)->parallelize(ParallelType::TIDx);
+  tv6->axis(2)->parallelize(ParallelType::TIDx);
+
+  constexpr int M = 65, K = 33, N = 17;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({K, M}, options);
+  at::Tensor t1 = at::randn({N, K}, options);
+
+  // Lets specify a few bounds in launch params to make sure it works
+  LaunchParams lparams(1, -1, -1, 32, 4, 4);
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1}, lparams);
+  fe.runFusion({t0, t1}, lparams);
+
+  // Don't specify any launch params
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto aten_output = t0.t().to(at::kDouble).matmul(t1.t().to(at::kDouble));
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t1}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSoftmax3DTransposed_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int tidx = 32;
+  const int dimx = 32;
+  const int dimy = 16;
+  const int dimz = 130;
+
+  // Set up your input tensor views
+  TensorView* input_tv0 = makeSymbolicTensor(3);
+  fusion.addInput(input_tv0);
+
+  TensorView* input_t = transpose(input_tv0, 1, 2);
+
+  TensorView* exp_tv1 = unaryOp(UnaryOpType::Exp, input_t);
+  TensorView* sum_exp_tv2 = sum(exp_tv1, {-1});
+  TensorView* bcast_sum_tv3 = broadcast(sum_exp_tv2, {false, false, true});
+
+  // Replicate exp_tv4 as exp_tv4_copy because exp_tv4 is going to be
+  // computed at sum_exp_rf_tv8.
+  TensorView* input_t_copy = transpose(input_tv0, 1, 2);
+  TensorView* exp_tv1_copy = unaryOp(UnaryOpType::Exp, input_t_copy);
+
+  TensorView* output_tv4 = div(exp_tv1_copy, bcast_sum_tv3);
+
+  fusion.addOutput(output_tv4);
+
+  bcast_sum_tv3->split(-1, tidx);
+
+  sum_exp_tv2->split(-1, tidx);
+  TensorView* sum_exp_rf_tv5 = sum_exp_tv2->rFactor({-2});
+
+  output_tv4->split(-1, tidx);
+
+  input_t->computeAt(sum_exp_rf_tv5, -1);
+  input_t_copy->computeAt(output_tv4, -1);
+
+  TensorView* tensors_to_parallelize[] = {
+      sum_exp_tv2, bcast_sum_tv3, output_tv4, sum_exp_rf_tv5};
+
+  for (auto tv : tensors_to_parallelize) {
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::BIDy);
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({dimx, dimz, dimy}, options);
+
+  at::Tensor cg_output = at::empty({dimx, dimy, dimz}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_input_t = at::transpose(input, 1, 2);
+  auto aten_output = at::_softmax(aten_input_t.to(at::kDouble), -1, false);
+
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed1_CUDA) {
+  // Case 1
+  // tv1 = tv0 * 0.5
+  // tv2 = tv1 * -1
+  // tv3 = tv1 + 3
+  // tv4 = tv1 * 2
+  // tv5 = tv3 + tv2
+  // tv6 = tv5 + tv4
+  // tv7 = tv1 + tv4
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  tv0 = transpose(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(0.5));
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(-1.0));
+  TensorView* tv3 = add(tv1, IrBuilder::create<Double>(3.0));
+  TensorView* tv4 = mul(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv5 = add(tv3, tv2);
+
+  TensorView* tv6 = add(tv5, tv4);
+  TensorView* tv7 = add(tv1, tv4);
+
+  fusion.addOutput(tv6);
+  fusion.addOutput(tv7);
+
+  // Lets setup to actually run
+  tv7->merge(0);
+  tv7->split(0, 128);
+  tv7->split(0, 4);
+
+  tv7->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeAt(tv7, 1);
+
+  // The this-position of the last tensor should be zero.
+  TORCH_CHECK(
+      tv7->nDims() == 3 && tv7->getComputeAtPosition() == 0 &&
+      tv7->getMaxProducerPosition() == 1);
+  TORCH_CHECK(
+      tv6->nDims() == 3 && tv6->getComputeAtPosition() == 0 &&
+      tv6->getMaxProducerPosition() == 1);
+  // The position of every other tensor should be 1.
+  for (auto tv : {tv1, tv2, tv3, tv4, tv5}) {
+    TORCH_CHECK(tv->nDims() == 3 && tv->getComputeAtPosition() == 1);
+  }
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::randn({129, 127}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  at::Tensor aten_input_t = aten_input.t();
+
+  auto t1 = aten_input_t.mul({0.5});
+  auto t2 = t1.mul({-1.0});
+  auto t3 = t1.add({3.0});
+  auto t4 = t1.mul({2.0});
+  auto t5 = t3.add(t2);
+  auto t6 = t5.add(t4);
+  auto t7 = t1.add(t4);
+
+  std::vector<at::Tensor> aten_outputs = {t6, t7};
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed2_CUDA) {
+  // Case 2
+  // tv1 = tv0 * -1
+  // tv2 = tv0 + 3
+  // tv3 = tv0 * 2
+  // tv4 = tv2 + tv1
+  // tv5 = tv4 + tv3
+  // tv6 = tv5 + tv3
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  tv0 = transpose(tv0);
+
+  TensorView* tv1 = mul(tv0, IrBuilder::create<Double>(-1.0));
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(3.0));
+  TensorView* tv3 = mul(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv4 = add(tv2, tv1);
+
+  TensorView* tv5 = add(tv4, tv3);
+  TensorView* tv6 = add(tv5, tv3);
+
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  // Lets setup to actually run
+  tv6->merge(0);
+  tv6->split(0, 128);
+  tv6->split(0, 4);
+
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeAt(tv6, 1);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({129, 127}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  auto input_t = input.t();
+  auto t1 = input_t.mul({-1.0});
+  auto t2 = input_t.add({3.0});
+  auto t3 = input_t.mul({2.0});
+  auto t4 = t2.add(t1);
+  auto t5 = t4.add(t3);
+  auto t6 = t5.add(t3);
+
+  std::vector<at::Tensor> aten_outputs = {t5, t6};
+
+  testValidate(&fusion, cg_outputs, {input}, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed3_CUDA) {
+  // Case 3
+  // T2 = T1 * 0.979361
+  // T3 = T2 * T0
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+
+  tv0 = permute(tv0, {3, 0, 1, 2});
+
+  TensorView* tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv1);
+
+  tv1 = permute(tv1, {3, 0, 1, 2});
+
+  TensorView* tv2 = mul(tv1, IrBuilder::create<Double>(.979361));
+  TensorView* tv3 = mul(tv2, tv0);
+
+  fusion.addOutput(tv3);
+
+  // Lets setup to actually run
+  while (tv3->nDims() > 1)
+    tv3->merge(0);
+  tv3->split(0, 128);
+  tv3->split(0, 4);
+
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto t0_t = t0.permute({3, 0, 1, 2});
+  auto t1_t = t1.permute({3, 0, 1, 2});
+  auto t2 = t1_t.mul({0.979361});
+  auto aten_output = t2.mul(t0_t);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed4_CUDA) {
+  // Case 4
+  // T4 = T2 - T3
+  // T5 = T1 + T4
+  // T6 = T5 - T0
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(4);
+  fusion.addInput(tv0);
+
+  tv0 = permute(tv0, {3, 0, 1, 2});
+
+  TensorView* tv1 = makeSymbolicTensor(4);
+  fusion.addInput(tv1);
+
+  tv1 = permute(tv1, {3, 0, 1, 2});
+
+  TensorView* tv2 = makeSymbolicTensor(4);
+  fusion.addInput(tv2);
+
+  tv2 = permute(tv2, {3, 0, 1, 2});
+
+  TensorView* tv3 = makeSymbolicTensor(4);
+  fusion.addInput(tv3);
+
+  tv3 = permute(tv3, {3, 0, 1, 2});
+
+  TensorView* tv4 = sub(tv2, tv3);
+  TensorView* tv5 = add(tv1, tv4);
+  TensorView* tv6 = sub(tv5, tv0);
+
+  fusion.addOutput(tv6);
+
+  // Lets setup to actually run
+  while (tv6->nDims() > 1)
+    tv6->merge(0);
+  tv6->split(0, 128);
+  tv6->split(0, 4);
+
+  tv0->computeAt(tv6, 1);
+  tv1->computeAt(tv6, 1);
+  tv2->computeAt(tv6, 1);
+  tv3->computeAt(tv6, 1);
+
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+
+  for (Val* val : fusion.vals()) {
+    if (!val->isFusionInput() &&
+        val->getValType().value() == ValType::TensorView) {
+      TensorView* tv = static_cast<TensorView*>(val);
+
+      tv->axis(1)->parallelize(ParallelType::Unroll);
+      tv->axis(-1)->parallelize(ParallelType::TIDx);
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({129, 127, 63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+  at::Tensor t2 = at::rand_like(t0, options);
+  at::Tensor t3 = at::rand_like(t0, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto t0_t = t0.permute({3, 0, 1, 2});
+  auto t1_t = t1.permute({3, 0, 1, 2});
+  auto t2_t = t2.permute({3, 0, 1, 2});
+  auto t3_t = t3.permute({3, 0, 1, 2});
+  auto t4 = t2_t.sub(t3_t);
+  auto t5 = t1_t.add(t4);
+  auto aten_output = t5.sub(t0_t);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed5_CUDA) {
+  // Case 5
+  // tv2 = tv0 + 2.0
+  // tv3 = tv1 * tv2
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  tv0 = transpose(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  tv1 = transpose(tv1);
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = mul(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv3->merge(0);
+  tv3->split(-1, 8);
+  tv3->split(-1, 4);
+
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto t2 = t0.t().add(2.0);
+  auto aten_output = t1.t().mul(t2);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionAdvancedComputeAtTransposed6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  tv0 = transpose(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  tv1 = transpose(tv1);
+  TensorView* tv2 = add(tv0, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = mul(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv2->merge(0);
+  tv2->split(-1, 8);
+  tv2->split(-1, 4);
+  tv3->merge(0);
+  tv3->split(-1, 8);
+
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({63, 65}, options);
+  at::Tensor t1 = at::rand_like(t0, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto t2 = t0.t().add(2.0);
+  auto aten_output = t1.t().mul(t2);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSegmentReducePointwise_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(1);
+  TensorView* tv2 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  fusion->addInput(tv2);
+
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1)); // Group 0
+  TensorView* tv4 =
+      max(tv3, {0}); // Group 0 (use max instead to avoid numerical issues)
+  TensorView* tv5 = add(tv4, tv1); //  Group 0 (Non Broadcast after reduce,
+                                   //  keeps normalization scheduler away)
+  TensorView* tv6 = add(tv5, tv2); //  Group 1 (Broadcast after reduce)
+
+  fusion->addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({128, 65}, options);
+  at::Tensor t1 = at::randn({65}, options);
+  at::Tensor t2 = at::randn({128, 65}, options);
+
+  auto t3 = t0.add(1.0);
+  auto t4 = std::get<0>(at::max(t3, 0));
+  auto t5 = t4.add(t1);
+  auto t6 = t5.add(t2);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto outputs = executor_cache.runFusionWithInputs({t0, t1, t2});
+
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()->isSegmented(),
+      "segmentation didn't happen");
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()
+              ->fusionSegments()
+              ->groups()
+              .size() == 2,
+      "segmentation didn't happen as expected");
+
+  testValidate(
+      executor_cache.fusion(), outputs, {t0, t1, t2}, {t6}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMultipleVectorize_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  TensorView* tv0 = makeContigTensor(1);
+  TensorView* tv1 = makeContigTensor(1);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  TensorView* tv3 = add(tv0, tv1);
+  fusion->addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({40960}, options);
+  at::Tensor t1 = at::randn({40960}, options);
+  auto t2 = t0 + t1;
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  executor_cache.profile(true);
+
+  auto outputs = executor_cache.runFusionWithInputs({t0, t1});
+  auto runtime1 = executor_cache.getMostRecentKernelRuntime();
+  auto log1 =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
+  TORCH_CHECK(log1 != nullptr);
+  TORCH_CHECK(log1->vectorize);
+
+  testValidate(
+      executor_cache.fusion(), outputs, {t0, t1}, {t2}, __LINE__, __FILE__);
+
+  t0 = at::randn({40964}, options);
+  t1 = at::randn({40964}, options);
+  t2 = t0 + t1;
+
+  outputs = executor_cache.runFusionWithInputs({t0, t1});
+  auto runtime2 = executor_cache.getMostRecentKernelRuntime();
+  auto log2 =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
+  TORCH_CHECK(log2 != nullptr);
+  TORCH_CHECK(log2->vectorize);
+
+  testValidate(
+      executor_cache.fusion(), outputs, {t0, t1}, {t2}, __LINE__, __FILE__);
+
+  t0 = at::randn({40962}, options);
+  t1 = at::randn({40962}, options);
+  t2 = t0 + t1;
+
+  outputs = executor_cache.runFusionWithInputs({t0, t1});
+  auto runtime3 = executor_cache.getMostRecentKernelRuntime();
+  auto log3 =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
+  TORCH_CHECK(log3 != nullptr);
+  TORCH_CHECK(log3->vectorize);
+
+  testValidate(
+      executor_cache.fusion(), outputs, {t0, t1}, {t2}, __LINE__, __FILE__);
+
+  TORCH_CHECK(runtime1 == runtime2);
+  TORCH_CHECK(runtime1 != runtime3);
+}
+
+TEST_F(NVFuserTest, FusionVectorizeSimple_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeContigTensor(3);
+
+  fusion.addInput(tv0);
+
+  auto tv1 = unaryOp(UnaryOpType::Sin, tv0);
+
+  fusion.addOutput(tv1);
+
+  auto tv0_cache = tv0->cacheAfter();
+
+  auto tv1_cache = tv1->cacheBefore();
+
+  tv1->merge(0);
+  tv1->merge(0);
+  tv1->split(0, 4);
+  tv1->split(0, 128);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv0->computeAt(tv1, 2);
+
+  tv0_cache->axis(2)->parallelize(ParallelType::Vectorize);
+  tv1->axis(2)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor aten_input = at::empty({2, 6, 32}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {aten_input});
+  auto cg_outputs = fe.runFusion({aten_input});
+
+  at::Tensor aten_output = aten_input.sin();
+
+  testValidate(
+      &fusion, cg_outputs, {aten_input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSimpleVectorizeUnroll_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  // dimensionality of the problem
+  int nDims = 3;
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeContigTensor(nDims);
+  TensorView* tv1 = makeContigTensor(nDims);
+
+  // Register your inputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // Do math with it, it returns a `Val*` but can be static_casted back to
+  // TensorView
+  TensorView* tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  TensorView* tv3 = add(tv0, tv2);
+
+  // Register your outputs
+  fusion.addOutput(tv3);
+
+  auto tv0_cache = tv0->cacheAfter();
+  auto tv1_cache = tv1->cacheAfter();
+  auto tv3_cache = tv3->cacheBefore();
+
+  // Do transformations, remember, transformations are outputs to inputs
+  // This doesn't have to be in this order
+  tv3->merge(1);
+
+  // Split by n_threads
+  tv3->split(1, 2);
+  tv3->split(0, 3);
+  tv3->split(0, 1);
+
+  // [bidx, unswitch, unroll{2}, tidx, vectorize{2}]
+
+  // Parallelize TV3
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(1)->parallelize(ParallelType::Unswitch);
+  tv3->axis(2)->parallelize(ParallelType::Unroll);
+  tv3->axis(3)->parallelize(ParallelType::TIDx);
+
+  tv3->reorder({{4, 2}});
+  // [bidx, unswitch, vectorize{2}, unroll{2}, tidx]
+
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(tv3);
+
+  tv0_cache->axis(2)->parallelize(ParallelType::Vectorize);
+  tv1_cache->axis(2)->parallelize(ParallelType::Vectorize);
+  tv3->axis(2)->parallelize(ParallelType::Vectorize);
+
+  // For all inputs, computeAt the output inline, temporaries should be squeezed
+  // between them
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+  tv1->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({64, 2, 128}, options);
+  at::Tensor input2 = at::rand_like(input1);
+  at::Tensor output = at::empty_like(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  fe.runFusion({input1, input2}, {output});
+
+  at::Tensor tv2_ref = input2 + 2.0;
+  at::Tensor output_ref = input1 + tv2_ref;
+
+  TORCH_CHECK(output_ref.equal(output));
+}
+
+TEST_F(NVFuserTest, FusionSegmentReduceSoftmax_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  std::vector<int64_t> input_shape{32, 64, 8};
+  const int kReductionAxis = 1;
+
+  auto tv0 = TensorViewBuilder()
+                 .ndims(input_shape.size())
+                 .dtype(DataType::Double)
+                 .build();
+
+  fusion->addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv2 = sum(tv1, {2}); // Group 0
+
+  auto output = softmax(tv2, kReductionAxis); // Group 1
+  fusion->addOutput(output);
+
+  auto options = at::TensorOptions().dtype(at::kDouble).device(at::kCUDA, 0);
+  at::Tensor at_x = at::randn(input_shape, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto outputs = executor_cache.runFusionWithInputs({at_x});
+
+  auto t1 = at_x.add(1.0);
+  auto t2 = t1.sum({2});
+  auto t3 = at::_softmax(t2.to(at::kDouble), -1, false);
+
+  auto optimized_fusion = executor_cache.getMostRecentKernelRuntime();
+  TORCH_CHECK(optimized_fusion->isSegmented(), "segmentation didn't happen");
+  TORCH_CHECK(
+      optimized_fusion->fusionSegments()->groups().size() == 2,
+      "segmentation didn't happen as expected");
+
+  testValidate(
+      executor_cache.fusion(), outputs, {at_x}, {t3}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSwizzle1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 7);
+  tv2->split(0, 9);
+
+  tv0->computeAt(tv2, 1);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->swizzle(SwizzleType::Transpose, {1, 2});
+
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  tv1->axis(2)->parallelize(ParallelType::TIDy);
+
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDy);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({100}, options);
+
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = (t0 + 1) * 2;
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSwizzle2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
+  fusion.addOutput(tv2);
+
+  tv1->split(-1, 4);
+  tv1->split(-2, 4);
+
+  tv2->split(-1, 4);
+  tv2->split(-2, 4);
+
+  tv0->computeAt(tv2, 1);
+
+  tv2->reorder({{-1, -2}});
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->swizzle(SwizzleType::Transpose, {-2, -1});
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-2)->parallelize(ParallelType::TIDy);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-2)->parallelize(ParallelType::TIDy);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({123}, options);
+
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = (t0 + 1) * 2;
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridPersistence_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = broadcast(tv1, {true});
+  auto tv3 = add(tv0, tv2);
+  fusion.addOutput(tv3);
+
+  std::vector<TensorView*> tvs = {tv1, tv2, tv3};
+  for (auto tv : tvs) {
+    tv->split(0, 2);
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::BIDy);
+  }
+
+  const int numel_x = 10;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto out = fe.runFusion({input});
+
+  auto aten_output = input.sum({0}).unsqueeze(-1).add(input);
+
+  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridPersistence2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = broadcast(tv1, {true, false});
+  auto tv3 = add(tv0, tv2);
+  fusion.addOutput(tv3);
+
+  std::vector<TensorView*> tvs = {tv1, tv2, tv3};
+  for (auto tv : tvs) {
+    tv->split(0, 2);
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::TIDy);
+    tv->axis(2)->parallelize(ParallelType::TIDx);
+  }
+
+  const int numel_x = 10;
+  const int numel_y = 3;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto out = fe.runFusion({input});
+
+  auto aten_output = input.sum({0}).unsqueeze(0).add(input);
+
+  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionWelfordPersistence_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tvs = Welford(tv0, {0});
+  auto tv4 = add(tvs.avg, tvs.var_sum);
+  auto tv5 = broadcast(tv4, {true});
+  auto tv6 = add(tv0, tv5);
+  fusion.addOutput(tv6);
+
+  std::vector<TensorView*> schedule_tvs = {
+      tvs.avg, tvs.var_sum, tvs.n, tv5, tv6};
+
+  for (auto tv : schedule_tvs) {
+    tv->split(0, 2);
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::BIDy);
+  }
+
+  const int numel_x = 10;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto out = fe.runFusion({input});
+
+  auto aten_output = (input.mean({0}) + (input.var({0}, false) * numel_x))
+                         .unsqueeze(-1)
+                         .add(input);
+
+  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionWelfordPersistence2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tvs = Welford(tv0, {0});
+  auto tv4 = add(tvs.avg, tvs.var_sum);
+  auto tv5 = broadcast(tv4, {true, false});
+  auto tv6 = add(tv0, tv5);
+  fusion.addOutput(tv6);
+
+  std::vector<TensorView*> schedule_tvs = {
+      tvs.avg, tvs.var_sum, tvs.n, tv5, tv6};
+  for (auto tv : schedule_tvs) {
+    tv->split(0, 2);
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::TIDy);
+    tv->axis(2)->parallelize(ParallelType::TIDx);
+  }
+  tv4->axis(0)->parallelize(ParallelType::TIDx);
+
+  const int numel_x = 10;
+  const int numel_y = 3;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto out = fe.runFusion({input});
+
+  auto aten_output = (input.mean({0}) + (input.var({0}, false) * numel_x))
+                         .unsqueeze(0)
+                         .add(input);
+
+  testValidate(&fusion, out, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue633_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int dx = 10;
+  const int dy = 11;
+  const int dz = 12;
+
+  auto tv0 = makeConcreteTensor({dx, dy, dz});
+  fusion.addInput(tv0);
+  auto tv1 = makeConcreteTensor({dx, dy, 1});
+  fusion.addInput(tv1);
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  tv2->merge(1);
+  tv2->merge(0);
+  tv2->split(-1, 128);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({dx, dy, dz}, options);
+  at::Tensor t1 = at::randn({dx, dy, 1}, options);
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0 + t1;
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBroadcastAcrossComputeAt_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> shape{17, 19};
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  auto tv2 = broadcast(tv0, {false, true});
+  auto tv3 = add(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv3->split(1, 128);
+  tv0->computeAt(tv3, 2);
+
+  for (auto tv : {tv2, tv3}) {
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({shape[0]}, options);
+  at::Tensor t1 = at::randn(shape, options);
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto t3 = t0.unsqueeze(-1).expand(shape) + t1;
+
+  testValidate(&fusion, cg_outputs, aten_inputs, {t3}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorizeMisalignedPointwise_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  const int kTDX = 64;
+  const int kVecSize = 4;
+  const int kNumElems = kTDX * kVecSize;
+
+  tv2->split(1, kNumElems);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  tv2->split(-1, kVecSize);
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 128;
+  const int by = 457;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  at::Tensor t1 = at::randn({bx, by}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0 + t1;
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorizeMisalignedPointwiseMergeContig_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(4);
+  auto tv1 = makeContigTensor(4);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  tv2->reorder({{0, 1}, {1, 0}});
+  tv2->merge(-2);
+
+  const int kTDX = 64;
+  const int kVecSize = 2;
+  const int kNumElems = kTDX * kVecSize;
+
+  tv2->split(-1, kNumElems);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  tv2->split(0, 128);
+  tv2->split(-1, kVecSize);
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::BIDy);
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int n = 32;
+  const int c = 127;
+  const int h = 51;
+  const int w = 23;
+  at::Tensor t0 = at::randn({n, c, h, w}, options);
+  at::Tensor t1 = at::randn({n, c, h, w}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0 + t1;
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorizeMisalignedPointwiseMergeSymbolicPass_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int kNumDims = 4;
+  constexpr int kTDX = 64;
+  constexpr int kVecSize = 2;
+  constexpr int kNumElems = kTDX * kVecSize;
+
+  auto tv0 = makeSymbolicTensor(kNumDims);
+  auto tv1 = makeSymbolicTensor(kNumDims);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  // Create caches for vectorization
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  // Merge all dimensions together except inner-most dim
+  for (const auto idx : c10::irange(kNumDims - 2)) {
+    tv2->merge(0);
+  }
+  // Split inner-most dim
+  tv2->split(-1, kNumElems);
+  tv2->split(-1, kVecSize);
+  TransformPropagatorWithCheck propagator(tv2);
+  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  // Parallelization Strategy
+  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int n = 5;
+  const int c = 3;
+  const int h = 51;
+  const int w = 257;
+  at::Tensor t0 = at::randn({n, c, h, w}, options);
+  at::Tensor t1 = at::randn({n, c, h, w}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0 + t1;
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorizeMisalignedPointwiseMergeSymbolicFail_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int kNumDims = 4;
+  constexpr int kTDX = 64;
+  constexpr int kVecSize = 2;
+  constexpr int kNumElems = kTDX * kVecSize;
+  std::vector<int64_t> bcast_shape{1, 1, 1, -1};
+
+  auto tv0 = makeContigTensor(kNumDims);
+  auto tv1 = TensorViewBuilder().shape(bcast_shape).build();
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  // Create caches for vectorization
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  // Merge all dimensions together
+  // Backward merge order is necessary for vectorize validation
+  for (int idx = kNumDims - 1; idx > 0; --idx) {
+    tv2->merge(idx - 1);
+  }
+  tv2->split(-1, kNumElems);
+  tv2->split(-1, kVecSize);
+  TransformPropagatorWithCheck propagator(tv2);
+  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  // Parallelization Strategy
+  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int n = 32;
+  const int c = 128;
+  const int h = 51;
+  const int w = 23;
+  at::Tensor t0 = at::randn({n, c, h, w}, options);
+  at::Tensor t1 = at::randn({1, 1, 1, w}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  // TODO: throw assertion - cannot merge non-contiguous vectorization axes
+  // Make sure compilation fails
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionVectorizeMisalignedRFactor_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  auto tv1 = makeContigTensor(2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+
+  auto tv3 = sum(tv2, {-1});
+
+  fusion.addOutput(tv3);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+
+  tv3->split(-1, 128 * 4);
+  tv3->split(-1, 4);
+  // Reduce outer dim first
+  auto tv4 = tv3->rFactor({-3, -1});
+  // Tv3 will reduce threads
+
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeAt(tv4, -2);
+  tv1->computeAt(tv4, -2);
+
+  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  tv4->axis(-2)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv2->computeAt(tv4, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 128;
+  const int by = 2050;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  at::Tensor t1 = at::randn({bx, by}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0.add(t1).sum(1);
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorizeMisalignedWrongDimFail_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  auto tv1 = makeContigTensor(2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  tv2->split(1, 16);
+  tv2->split(1, 64);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
+  for (auto tv : vectorized_tvs) {
+    tv->split(-1, 4);
+    // Vectorize the wrong dimension
+    tv->axis(-2)->parallelize(ParallelType::MisalignedVectorize);
+  }
+
+  FusionExecutor fe;
+  // Make sure compilation fails
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionVectorizeMisalignedStride_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  const int kTDX = 64;
+  const int kVecSize = 4;
+  const int kNumElems = kTDX * kVecSize;
+
+  tv2->split(1, kNumElems);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+
+  tv2->split(-1, kVecSize);
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 128;
+  const int by = 2049;
+  at::Tensor t0 = at::randn({bx, by}, options).index({"...", Slice(3)});
+  at::Tensor t1 = at::randn({bx, by}, options).index({"...", Slice(3)});
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0 + t1;
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorizeMisalignedStrideFail_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  const int kTDX = 64;
+  const int kVecSize = 4;
+  const int kNumElems = kTDX * kVecSize;
+
+  tv2->split(1, kNumElems);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  tv2->split(-1, kVecSize);
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  c0->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+  c1->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::MisalignedVectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 128;
+  const int by = 2049;
+  at::Tensor t0 = at::randn({bx, by}, options).index({"...", Slice(3)});
+  at::Tensor t1 = at::randn({bx, by}, options).index({"...", Slice(3)});
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+
+  // Failure because the input + output tensors do not have the same stride
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion(aten_inputs));
+}
+
+TEST_F(NVFuserTest, FusionVectorization1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  tv2->split(1, 16);
+  tv2->split(1, 64);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
+  for (auto tv : vectorized_tvs) {
+    tv->split(-1, 4);
+    tv->axis(-1)->parallelize(ParallelType::Vectorize);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 128;
+  const int by = 2048;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  at::Tensor t1 = at::randn({bx, by}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0 + t1;
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorization2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  tv2->split(1, 16);
+  tv2->split(1, 64);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
+  for (auto tv : vectorized_tvs) {
+    tv->split(-1, 4);
+    // Vectorize the wrong dimension
+    tv->axis(-2)->parallelize(ParallelType::Vectorize);
+  }
+
+  FusionExecutor fe;
+  // Make sure compilation fails
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionVectorization3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  tv2->split(1, 16);
+  tv2->split(1, 64);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
+
+  auto c0 = tv0->cacheAfter();
+  auto c1 = tv1->cacheAfter();
+  auto c2 = tv2->cacheBefore();
+
+  c0->computeAt(tv2, -2);
+  c1->computeAt(tv2, -2);
+
+  std::vector<TensorView*> vectorized_tvs = {c0, c1, tv2};
+  for (auto tv : vectorized_tvs) {
+    tv->split(-1, 4);
+    tv->axis(-1)->parallelize(ParallelType::Vectorize);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 128;
+  const int by = 2049;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  at::Tensor t1 = at::randn({bx, by}, options);
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion(aten_inputs));
+
+  aten_inputs[0] = t0.index({"...", Slice(1)});
+  aten_inputs[1] = t1.index({"...", Slice(1)});
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion(aten_inputs));
+
+  t0 = at::randn({bx, 2048}, options).index({"...", Slice(4)});
+  t1 = at::randn({bx, 2048}, options).index({"...", Slice(4)});
+  aten_inputs = {t0, t1};
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0 + t1;
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorizationRFactor_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+
+  auto tv3 = sum(tv2, {-1});
+
+  fusion.addOutput(tv3);
+
+  tv3->split(-1, 128 * 4);
+  tv3->split(-1, 4);
+  // Reduce outer dim first
+  auto tv4 = tv3->rFactor({-3, -1});
+  // Tv3 will reduce threads
+
+  auto tv6 = tv0->cacheAfter();
+  auto tv7 = tv1->cacheAfter();
+
+  tv0->computeAt(tv3, 1);
+  tv1->computeAt(tv3, 1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv0->computeAt(tv4, -2);
+  tv1->computeAt(tv4, -2);
+
+  tv6->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv7->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  tv4->axis(-2)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 128;
+  const int by = 2048;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  at::Tensor t1 = at::randn({bx, by}, options);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0.add(t1).sum(1);
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+
+  auto t3 = t0.add(t1).sum(1);
+
+  testValidate(&fusion, cg_outputs, aten_inputs, {t3}, __LINE__, __FILE__);
+}
+
+// Unswitched loops with extent one may omit else clause.
+TEST_F(NVFuserTest, FusionSizeOneLoop1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Progressively broadcast tensors
+  TensorView* tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  TensorView* tv2 = makeSymbolicTensor(3);
+  fusion.addInput(tv2);
+
+  TensorView* tv3 = broadcast(tv0, {false, true});
+  TensorView* tv4 = add(tv3, tv1);
+  TensorView* tv5 = add(tv4, tv2);
+
+  fusion.addOutput(tv5);
+
+  // Split inner dimension
+  tv5->split(1, 8);
+  // Merge middle dims with outer dimensions
+  tv5->merge(2);
+  tv5->merge(0);
+
+  // tv5[I0*I1o, I1i*I2]
+  // Get a dim of size 1 to unswitch
+  tv5->split(0, 1, false);
+
+  // Compute everything inline
+  tv0->computeAt(tv5, -1);
+
+  tv5->axis(0)->parallelize(ParallelType::Unswitch);
+  tv5->axis(1)->parallelize(ParallelType::BIDx);
+  tv5->axis(2)->parallelize(ParallelType::TIDx);
+
+  // Make sure the unswitched loop does not have an else clause.
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(!UnswitchInElseChecker::check(gpulw));
+
+  const int x = 11;
+  const int y = 12;
+  const int z = 13;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({x}, options);
+  at::Tensor t1 = at::randn({x, y}, options);
+  at::Tensor t2 = at::randn({z, x, y}, options);
+  std::vector<IValue> aten_inputs = {t0, t1, t2};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+  auto t6 = (t0.unsqueeze(-1) + t1).unsqueeze(0) + t2;
+
+  testValidate(&fusion, cg_outputs, aten_inputs, {t6}, __LINE__, __FILE__);
+}
+
+// The unswitched loop has extent one but inner loops don't. The else
+// part should not be omitted.
+TEST_F(NVFuserTest, FusionSizeOneLoop2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int x = 15;
+  auto tv0 = makeConcreteTensor({x});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv1);
+
+  tv1->split(-1, 4);
+  tv1->split(-2, 1);
+
+  tv1->axis(-2)->parallelize(ParallelType::Unswitch);
+
+  // Make sure the size-one unswitched loop does not omit the else clause.
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(UnswitchInElseChecker::check(gpulw));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({x}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+  auto t1 = t0 + 1;
+
+  testValidate(&fusion, cg_outputs, aten_inputs, {t1}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionValidateParallelize1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDy);
+
+  // Invalid as tv1 and tv2 do have the same ParallelType
+  FusionExecutor fe;
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionValidateParallelize2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDy);
+  tv1->setMemoryType(MemoryType::Shared);
+
+  // tv1 and tv2 do have the same ParallelType, but tv1 is on shared
+  // memory, so it is valid
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+}
+
+TEST_F(NVFuserTest, FusionValidateParallelize3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv1->split(-1, 4);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->split(-1, 4);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->setMemoryType(MemoryType::Global);
+
+  // tv1 and tv2 have the same shape and ParallelType
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+}
+
+TEST_F(NVFuserTest, FusionValidateParallelize4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv1->split(-1, 4);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->split(-1, 8);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->setMemoryType(MemoryType::Global);
+
+  // tv1 and tv2 do not have the same shape but global memory comm is supported.
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+}
+
+TEST_F(NVFuserTest, FusionValidateParallelize5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv1->split(-1, 4);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv2->split(-1, 8);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // tv1 and tv2 do not have the same shape, but tv1 is on shared
+  // memory, so it is valid
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+}
+
+// See issue #995
+TEST_F(NVFuserTest, FusionValidateParallelize6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int64_t W = 5, X = 6, Y = 7, Z = 8;
+
+  auto tv0 = makeConcreteTensor({X, Y, Z});
+  auto tv1 = makeConcreteTensor({W, X, Y, Z});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv3 = broadcast(tv2, {true, false, false, false});
+  auto tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  tv4->merge(0);
+  tv4->merge(0);
+  tv4->merge(0);
+  tv4->split(0, 4);
+  tv4->split(0, 3);
+  tv4->split(0, 2);
+
+  TransformPropagatorWithCheck propagator(tv4);
+  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+
+  tv0->computeAt(tv2, 2);
+  tv3->computeAt(tv4, 2);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // Validation should throw an exception saying the first axes of tv2
+  // and tv3 have incompatible parallelization. See also issue #995.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fusion.printKernel());
+}
+
+// Repro of #2046
+TEST_F(NVFuserTest, FusionValidateParallelize7_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  auto tv3 = set(tv1);
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Global);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv2->axis(1)->parallelize(ParallelType::TIDy);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  // tv2 uses tv1 but is not parallelized with BIDx, so a grid sync is
+  // required. It should be placed as a top-level expression.
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      std::any_of(
+          gpulw.kernel()->topLevelExprs().begin(),
+          gpulw.kernel()->topLevelExprs().end(),
+          [](Expr* expr) { return expr->isA<kir::GridSync>(); }),
+      "Grid sync not found");
+}
+
+TEST_F(NVFuserTest, FusionDAGMerging_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(5);
+  auto tv1 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  // Branch 0
+  auto tv2 = sum(tv0, {0}); // 0
+  auto tv3 = sum(tv2, {0}); // 1
+  auto tv4 = sum(tv3, {0}); // 2
+  auto tv5 = sum(tv4, {0}); // 3
+
+  // Branch 1
+  auto tv6 = add(tv1, IrBuilder::create<Double>(1)); // 4
+
+  // Merge
+  auto tv7 = add(tv6, tv5); // 5
+
+  // Maximum expected output groups (can improve overtime):
+  //  {0}, {1}, {2}, {3,4,5}
+  //  without final merge would have been {0}, {1}, {2}, {3,4}, {5}
+
+  fusion.addOutput(tv7);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({2, 2, 2, 2, 2}, options);
+  at::Tensor t1 = at::randn({2}, options);
+
+  std::vector<at::Tensor> aten_inputs = {t0, t1};
+
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(aten_inputs);
+
+  auto fusion_segments = fusion.segment(args);
+  TORCH_CHECK(fusion_segments->groups().size() <= 4);
+}
+
+TEST_F(NVFuserTest, FusionDAGScalarMerging_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto i0 = IrBuilder::create<Double>();
+
+  fusion->addInput(tv0);
+  fusion->addInput(i0);
+
+  auto i1 = add(i0, IrBuilder::create<Double>(1.0));
+  auto i2 = mul(i1, i1);
+  auto i3 = add(i2, i1);
+
+  // Branch 0
+  auto tv1 = sum(tv0, {0}); // 0
+  auto tv2 = add(tv1, i2);
+  // Branch 1
+  auto tv3 = sum(tv2, {0}); // 1
+  auto tv4 = add(tv3, i3);
+
+  auto tv5 = add(tv4, i0);
+
+  fusion->addOutput(tv5);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({16, 16, 16}, options);
+  double s0 = 0.5;
+
+  auto s1 = s0 + 1.0;
+  auto s2 = s1 * s1;
+  auto s3 = s2 + s1;
+  auto t1 = t0.sum({0});
+  auto t2 = t1 + s2;
+  auto t3 = sum(t2, {0});
+  auto t4 = t3 + s3;
+  auto t5 = t4 + s0;
+
+  auto outputs = executor_cache.runFusionWithInputs({t0, s0});
+
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()->isSegmented(),
+      "segmentation didn't happen");
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()
+              ->fusionSegments()
+              ->groups()
+              .size() == 2,
+      "segmentation didn't happen as expected");
+
+  testValidate(
+      executor_cache.fusion(), outputs, {t0, s0}, {t5}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBlockReduceInSerialLoop_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int M = 10;
+  constexpr int N = 20;
+  constexpr int K = 20;
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto tv1 = sum(tv0, {{1, 2}});
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N, K}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+  at::Tensor aten_output = t0.sum({1, 2});
+  testValidate(
+      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBlockWelfordInSerialLoop_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int M = 10;
+  constexpr int N = 20;
+  constexpr int K = 20;
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto tvs = Welford(tv0, {{1, 2}});
+  fusion.addInput(tv0);
+  auto tv_avg = tvs.avg;
+  auto tv_M2 = tvs.var_sum;
+  auto tv_N = tvs.n;
+  fusion.addOutput(tv_avg);
+  fusion.addOutput(tv_M2);
+
+  tv_avg->axis(-1)->parallelize(ParallelType::TIDx);
+  tv_avg->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N, K}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+  at::Tensor aten_avg = t0.mean({1, 2});
+  at::Tensor aten_M2 = t0.var({1, 2}, false) * N * K;
+  testValidate(
+      &fusion, outputs, aten_inputs, {aten_avg, aten_M2}, __LINE__, __FILE__);
+}
+
+// See Issue #716
+TEST_F(NVFuserTest, FusionIOTensorTrivialReductionRepro_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int M = 10;
+  constexpr int N = 11;
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  std::vector<int> reduction_axes = {1};
+  std::vector<bool> broadcast_mask = {false, true};
+
+  auto tv0_bcast = broadcast(tv0, broadcast_mask);
+  auto path1_bcast = add(tv0_bcast, IrBuilder::create<Double>(1.0));
+  auto path1 = sum(path1_bcast, reduction_axes);
+  fusion.addOutput(path1);
+
+  auto p = path1->split(1, 1);
+  path1->rFactor({1});
+  path1->axis(0)->parallelize(ParallelType::BIDx);
+  tv0->computeAt(path1, 1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M}, options);
+  at::Tensor t0_ref = t0.clone();
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+
+  // inplace op, we are adding t0 to itself
+  auto outputs = fe.runFusion(aten_inputs, {t0});
+
+  TORCH_CHECK(outputs[0].allclose(t0_ref.add(1)));
+}
+
+TEST_F(NVFuserTest, FusionReductionPredicate_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {0});
+  fusion.addOutput(tv1);
+
+  auto tv2 = tv0->cacheAfter();
+
+  const int bdimx = 128;
+  tv1->split(1, bdimx);
+  tv1->split(1, 4);
+  tv1->split(1, 1);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(2)->parallelize(ParallelType::Unroll);
+  tv1->split(0, 10);
+  tv0->computeAt(tv1, 4);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 650;
+  int numel_y = 102;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({numel_x, numel_y}, options);
+  at::Tensor cg_output = at::empty({numel_y}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  fe.runFusion({input}, {cg_output});
+
+  auto aten_output = input.to(at::kDouble).sum({0});
+
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue728_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addOutput(tv0);
+  auto tv1 = makeSymbolicTensor(1);
+  fusion.addOutput(tv1);
+  auto tv2 = makeSymbolicTensor(1);
+  fusion.addOutput(tv2);
+
+  auto tv3 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv4 = add(tv3, tv1);
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  auto tv6 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  // tv0 -> tv3 -+
+  // tv1 --------+-> tv4 -> tv5
+  //
+  // tv2 -> tv6
+
+  auto all_vals_under_tv3 =
+      DependencyCheck::getAllValsBetween({tv3}, fusion.outputs());
+  std::unordered_set<Val*> included_tensors({tv3, tv4, tv5});
+  for (auto tv : included_tensors) {
+    TORCH_CHECK(
+        std::find(all_vals_under_tv3.begin(), all_vals_under_tv3.end(), tv) !=
+            all_vals_under_tv3.end(),
+        "TV",
+        tv->name(),
+        " not found");
+  }
+  for (auto tv : ir_utils::filterByType<TensorView>(fusion.vals())) {
+    if (included_tensors.find(tv) == included_tensors.end()) {
+      TORCH_CHECK(
+          std::find(all_vals_under_tv3.begin(), all_vals_under_tv3.end(), tv) ==
+              all_vals_under_tv3.end(),
+          "TV",
+          tv->name(),
+          " should not be found");
+    }
+  }
+
+  auto no_dependency = DependencyCheck::getAllValsBetween({}, fusion.outputs());
+  TORCH_CHECK(no_dependency.empty(), "No val should be returned");
+
+  auto no_dep_path = DependencyCheck::getAllValsBetween({tv0, tv1}, {tv6});
+  TORCH_CHECK(no_dep_path.empty(), "No val should be returned");
+
+  auto no_dep_path2 = DependencyCheck::getAllValsBetween({tv2}, {tv5});
+  TORCH_CHECK(no_dep_path2.empty(), "No val should be returned");
+
+  auto just_tv3 = DependencyCheck::getAllValsBetween({tv3}, {tv3});
+  TORCH_CHECK(
+      just_tv3.size() == 1 && *(just_tv3.begin()) == tv3,
+      "Only tv3 should be included");
+}
+
+TEST_F(NVFuserTest, FusionIssue757_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = makeSymbolicTensor(2);
+  fusion.addInput(tv3);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  tv1->computeAt(tv4, -1);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 650;
+  int numel_y = 102;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
+  at::Tensor t3 = at::randn({numel_x, numel_y}, options);
+  std::vector<IValue> inputs = {t0, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, inputs);
+  auto outputs = fe.runFusion(inputs);
+
+  auto t1 = t0.sum({1});
+  auto t2 = t1.unsqueeze(-1).expand({numel_x, numel_y});
+  auto t4 = t2 + t3;
+
+  testValidate(&fusion, outputs, inputs, {t4}, __LINE__, __FILE__);
+}
+
+// See issue #759
+TEST_F(NVFuserTest, FusionPredicatedBlockBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = makeSymbolicTensor(2);
+  fusion.addInput(tv3);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  tv4->split(0, 4);
+  tv1->computeAt(tv4, -1);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDy);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(1)->parallelize(ParallelType::TIDy);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+
+  int numel_x = 100;
+  int numel_y = 101;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
+  at::Tensor t3 = at::randn({numel_x, numel_y}, options);
+  std::vector<IValue> inputs = {t0, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, inputs);
+  auto outputs = fe.runFusion(inputs);
+
+  auto t1 = t0.sum({1});
+  auto t2 = t1.unsqueeze(-1).expand({numel_x, numel_y});
+  auto t4 = t2 + t3;
+
+  testValidate(&fusion, outputs, inputs, {t4}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSegmentVerticalMerge_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(3);
+
+  fusion->addInput(tv0);
+  // {first kernel}
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = add(tv1, tv0);
+  auto tv3 = sum(tv2, {0});
+  auto tv4 = add(tv3, tv0);
+  auto tv5 = sum(tv4, {0});
+  auto tv6 = sum(tv5, {0});
+  // {second kernel}
+  auto tv7 = add(tv6, tv5);
+  auto tv8 = add(tv7, tv5);
+  auto tv9 = sum(tv8, {0});
+
+  fusion->addOutput(tv9);
+
+  SegmentCandidateFinderOptions segment_options;
+  segment_options.run_herrmann_merge = false;
+  segment_options.run_final_merge = false;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({2, 2, 2}, options);
+
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(t0);
+
+  auto segmented_fusion =
+      SegmentCandidateFinder::segment(fusion.get(), args, segment_options);
+
+  TORCH_CHECK(segmented_fusion->groups().size() == 2);
+}
+
+TEST_F(NVFuserTest, FusionSegmentHorizontalMerge_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto i0 = IrBuilder::create<Double>();
+
+  fusion->addInput(tv0);
+  fusion->addInput(i0);
+
+  // Branch 0 {first kernel}
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = add(tv0, i0);
+  auto tv3 = unaryOp(UnaryOpType::Rsqrt, tv2);
+  auto tv4 = sum(tv3, {0});
+
+  // Branch 1 {first kernel}
+  auto tv5 = unaryOp(UnaryOpType::Rsqrt, tv3);
+  auto tv6 = sum(tv5, {0});
+
+  // Incompatible {second kernel}
+  auto tv7 = sum(tv6, {0});
+
+  fusion->addOutput(tv1);
+  fusion->addOutput(tv4);
+  fusion->addOutput(tv7);
+
+  SegmentCandidateFinderOptions segment_options;
+  segment_options.run_herrmann_merge = false;
+  segment_options.run_final_merge = false;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({2, 2, 2}, options);
+
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(t0);
+  c10::IValue scalar = 1.0;
+  args.push(scalar);
+
+  auto segmented_fusion =
+      SegmentCandidateFinder::segment(fusion.get(), args, segment_options);
+
+  TORCH_CHECK(segmented_fusion->groups().size() == 2);
+}
+
+TEST_F(NVFuserTest, FusionSegmentMixReduction_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(3);
+
+  fusion->addInput(tv0);
+
+  // def of tv1 in kernel 1 through horizontal
+  auto tv1 = sum(tv0, {0, 1});
+  // kernel 2
+  auto tv2 = sum(tv0, {2});
+  auto tv3 = broadcast(tv2, {false, false, true});
+  auto tv4 = add(tv0, tv3);
+  auto tv5 = sum(tv4, {2});
+  // end of kernel 2
+  // kernel 1
+  auto tv6 = unaryOp(UnaryOpType::Rsqrt, tv0);
+  auto tv7 = sum(tv6, {0, 1});
+  auto tv8 = sum(tv6, {0, 1});
+
+  fusion->addOutput(tv1);
+  fusion->addOutput(tv5);
+  fusion->addOutput(tv7);
+  fusion->addOutput(tv8);
+
+  SegmentCandidateFinderOptions segment_options;
+  segment_options.run_herrmann_merge = false;
+  segment_options.run_final_merge = false;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({2, 2, 2}, options);
+
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(t0);
+
+  auto segmented_fusion =
+      SegmentCandidateFinder::segment(fusion.get(), args, segment_options);
+
+  TORCH_CHECK(segmented_fusion->groups().size() <= 2);
+}
+
+TEST_F(NVFuserTest, FusionSBAR_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // N, H, W, C format
+  std::vector<int64_t> input_shape{656, 7, 7, 64};
+
+  auto x = makeContigTensor(4);
+  auto y = makeContigTensor(4);
+  auto weight = makeContigTensor(1);
+  auto bias = makeContigTensor(1);
+
+  fusion.addInput(x);
+  fusion.addInput(y);
+  fusion.addInput(weight);
+  fusion.addInput(bias);
+
+  const size_t kNumberOfDims = x->nDims();
+  std::vector<bool> broadcast_mask(kNumberOfDims, false);
+  for (const auto axis : c10::irange(kNumberOfDims - 1)) {
+    broadcast_mask[axis] = true;
+  }
+
+  auto weight_bcast = broadcast(weight, broadcast_mask);
+  auto scale = mul(x, weight_bcast);
+  auto bias_bcast = broadcast(bias, broadcast_mask);
+  auto scale_bias = add(scale, bias_bcast);
+  auto scale_bias_add = add(scale_bias, y);
+  auto scale_bias_add_relu = unaryOp(UnaryOpType::Relu, scale_bias_add);
+
+  fusion.addOutput(scale_bias_add_relu);
+
+  // inputs
+  at::manual_seed(0);
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_x = at::randn(input_shape, options);
+  at::Tensor at_y = at::randn(input_shape, options);
+  at::Tensor at_weight = at::ones({input_shape[3]}, options);
+  at::Tensor at_bias = at::zeros({input_shape[3]}, options);
+
+  // inputs
+  std::vector<c10::IValue> inputs = {at_x, at_y, at_weight, at_bias};
+
+  // outputs
+  std::vector<at::Tensor> outputs;
+
+  auto lparams = schedulePointwise(&fusion, inputs);
+
+  FusionExecutor executor;
+  executor.compileFusion(&fusion, inputs, lparams);
+  outputs = executor.runFusion(inputs, lparams);
+
+  auto at_scale = at::mul(at_x, at_weight);
+  auto at_scale_bias = at::add(at_scale, at_bias);
+  auto pwise_add = at::add(at_scale_bias, at_y);
+  auto output = at::relu(pwise_add);
+
+  testValidate(&fusion, outputs, inputs, {output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSingleElement_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(0);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(2.5));
+
+  auto tv2 = add(tv1, IrBuilder::create<Double>(3.5));
+  fusion.addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({}, options);
+
+  at::Tensor cg_output = at::empty({}, options);
+
+  auto lparams = schedulePointwise(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  fe.runFusion({input}, {cg_output}, lparams);
+
+  auto aten_output = input.add(2.5).add(3.5);
+
+  testValidate(
+      &fusion, {cg_output}, {input}, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBNBackwardRepro_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int batch = 4;
+  int c = 4;
+  int h = 4;
+  int w = 4;
+  int numDims = 4;
+
+  auto input = makeSymbolicTensor(numDims);
+  fusion.addInput(input);
+  auto weight = makeSymbolicTensor(1);
+  fusion.addInput(weight);
+  auto running_mean = makeSymbolicTensor(1);
+  fusion.addInput(running_mean);
+  auto running_var = makeSymbolicTensor(1);
+  fusion.addInput(running_var);
+  auto save_mean = makeSymbolicTensor(1);
+  fusion.addInput(save_mean);
+  auto save_invstd = makeSymbolicTensor(1);
+  fusion.addInput(save_invstd);
+
+  auto grad_out_prev = makeSymbolicTensor(numDims);
+  fusion.addInput(grad_out_prev);
+  auto gt_0 =
+      makeSymbolicTensor(numDims); // single tensor broadcasted is dangerous.
+  fusion.addInput(gt_0);
+
+  auto gt_bool = binaryOp(BinaryOpType::GT, gt_0, IrBuilder::create<Int>(1));
+  auto gt_float = castOp(DataType::Float, gt_bool);
+
+  auto grad_out = mul(grad_out_prev, gt_float);
+
+  Val* eps_ptr = IrBuilder::create<Double>(1e-5);
+
+  auto grads = batch_norm_backward(
+      input,
+      grad_out,
+      weight,
+      running_mean,
+      running_var,
+      save_mean,
+      save_invstd,
+      true,
+      eps_ptr,
+      {true, true, true});
+
+  fusion.addOutput(grads.grad_input);
+  fusion.addOutput(grads.grad_weight);
+  fusion.addOutput(grads.grad_bias);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({batch, c, h, w}, options);
+  at::Tensor input1 = at::randn({c}, options);
+  at::Tensor input2 = at::randn_like(input1);
+  at::Tensor input3 = at::randn_like(input1);
+  at::Tensor input4 = at::randn_like(input1);
+  at::Tensor input5 = at::randn_like(input1);
+  at::Tensor input6 = at::randn_like(input0);
+  at::Tensor input7 = at::randn_like(input0);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  std::vector<IValue> inputs = {
+      input0, input1, input2, input3, input4, input5, input6, input7};
+  auto outputs = fec.runFusionWithInputs(inputs);
+}
+
+// TODO: We only changed inputs, merge this with the test above.
+TEST_F(NVFuserTest, FusionBNBackwardRepro2_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int batch = 2;
+  int c = 81;
+  int h = 1;
+  int w = 1;
+  int numDims = 4;
+
+  // auto input = makeSymbolicTensor(numDims);
+  auto input = makeConcreteTensor({-1, -1, 1, 1});
+  fusion.addInput(input);
+  auto weight = makeSymbolicTensor(1);
+  fusion.addInput(weight);
+  auto running_mean = makeSymbolicTensor(1);
+  fusion.addInput(running_mean);
+  auto running_var = makeSymbolicTensor(1);
+  fusion.addInput(running_var);
+  auto save_mean = makeSymbolicTensor(1);
+  fusion.addInput(save_mean);
+  auto save_invstd = makeSymbolicTensor(1);
+  fusion.addInput(save_invstd);
+
+  // auto grad_out_prev = makeSymbolicTensor(numDims);
+  auto grad_out_prev = makeConcreteTensor({-1, -1, 1, 1});
+  fusion.addInput(grad_out_prev);
+  // auto gt_0 =
+  //     makeSymbolicTensor(numDims); // single tensor broadcasted is dangerous.
+  auto gt_0 = makeConcreteTensor({-1, -1, 1, 1});
+  fusion.addInput(gt_0);
+
+  auto gt_bool = binaryOp(BinaryOpType::GT, gt_0, IrBuilder::create<Int>(1));
+  auto gt_float = castOp(DataType::Float, gt_bool);
+
+  auto grad_out = mul(grad_out_prev, gt_float);
+
+  Val* eps_ptr = IrBuilder::create<Double>(1e-5);
+
+  auto grads = batch_norm_backward(
+      input,
+      grad_out,
+      weight,
+      running_mean,
+      running_var,
+      save_mean,
+      save_invstd,
+      true,
+      eps_ptr,
+      {true, true, true});
+
+  fusion.addOutput(grads.grad_input);
+  fusion.addOutput(grads.grad_weight);
+  fusion.addOutput(grads.grad_bias);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({batch, c, h, w}, options);
+  at::Tensor input1 = at::randn({c}, options);
+  at::Tensor input2 = at::randn_like(input1);
+  at::Tensor input3 = at::randn_like(input1);
+  at::Tensor input4 = at::randn_like(input1);
+  at::Tensor input5 = at::randn_like(input1);
+  at::Tensor input6 = at::randn_like(input0);
+  at::Tensor input7 = at::randn_like(input0);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  std::vector<IValue> inputs = {
+      input0, input1, input2, input3, input4, input5, input6, input7};
+  auto outputs = fec.runFusionWithInputs(inputs);
+}
+
+TEST_F(NVFuserTest, FusionBNRepro_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  const bool kTraining = true;
+  const float kMomentum = 0.1;
+  const float kEps = 1e-5;
+
+  int batch = 14;
+  int c = 65;
+  int h = 7;
+  int w = 7;
+  int numDims = 4;
+
+  auto input = makeSymbolicTensor(numDims);
+  fusion.addInput(input);
+  auto weight = makeSymbolicTensor(1);
+  fusion.addInput(weight);
+  auto bias = makeSymbolicTensor(1);
+  fusion.addInput(bias);
+  auto running_mean = makeSymbolicTensor(1);
+  fusion.addInput(running_mean);
+  auto running_var = makeSymbolicTensor(1);
+  fusion.addInput(running_var);
+
+  auto momentum_ptr = IrBuilder::create<Double>(kMomentum);
+  auto eps_ptr = IrBuilder::create<Double>(kEps);
+
+  auto result = batch_norm(
+      input,
+      weight,
+      bias,
+      running_mean,
+      running_var,
+      kTraining,
+      momentum_ptr,
+      eps_ptr);
+
+  fusion.addOutput(result.output);
+  fusion.addOutput(result.mean);
+  fusion.addOutput(result.invstd);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({batch, c, h, w}, options);
+  at::Tensor input2 = at::randn({c}, options);
+  at::Tensor input3 = at::randn_like(input2);
+  at::Tensor input4 = at::randn_like(input2);
+  at::Tensor input5 = at::randn_like(input2);
+
+  auto input1_ref = input1.clone();
+  auto input2_ref = input2.clone();
+  auto input3_ref = input3.clone();
+  auto input4_ref = input4.clone();
+  auto input5_ref = input5.clone();
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  std::vector<IValue> aten_inputs = {input1, input2, input3, input4, input5};
+  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
+
+  auto at_results = at::native_batch_norm(
+      input1_ref,
+      input2_ref,
+      input3_ref,
+      input4_ref,
+      input5_ref,
+      kTraining,
+      kMomentum,
+      kEps);
+
+  auto at_output = std::get<0>(at_results);
+  auto at_mean = std::get<1>(at_results);
+  auto at_invstd = std::get<2>(at_results);
+
+  std::vector<at::Tensor> aten_outputs = {at_output, at_mean, at_invstd};
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBNRepro2_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  const bool kTraining = true;
+  const float kMomentum = 0.1;
+  const float kEps = 1e-5;
+
+  int batch = 2;
+  int c = 4;
+  int h = 17;
+  int w = 17;
+  int numDims = 4;
+
+  auto input = makeSymbolicTensor(numDims);
+  fusion.addInput(input);
+
+  Val* momentum_ptr = IrBuilder::create<Double>(kMomentum);
+  Val* eps_ptr = IrBuilder::create<Double>(kEps);
+
+  auto result = batch_norm(
+      input,
+      nullptr,
+      nullptr,
+      nullptr,
+      nullptr,
+      kTraining,
+      momentum_ptr,
+      eps_ptr);
+
+  fusion.addOutput(result.output);
+  fusion.addOutput(result.mean);
+  fusion.addOutput(result.invstd);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({batch, c, h, w}, options);
+
+  auto input1_ref = input1.clone();
+  at::Tensor r_m;
+  at::Tensor r_v;
+  at::Tensor weight;
+  at::Tensor bias;
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  std::vector<IValue> aten_inputs = {input1};
+  auto cg_outputs = fec.runFusionWithInputs(aten_inputs);
+
+  auto at_results = at::native_batch_norm(
+      input1_ref, r_m, r_v, weight, bias, kTraining, kMomentum, kEps);
+
+  auto at_output = std::get<0>(at_results);
+  auto at_mean = std::get<1>(at_results);
+  auto at_invstd = std::get<2>(at_results);
+
+  std::vector<at::Tensor> aten_outputs = {at_output, at_mean, at_invstd};
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, aten_outputs, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionZeroSizeTensorPW_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = makeConcreteTensor({0});
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(2.5));
+  fusion.addOutput(tv2);
+
+  // This test used to just have:
+  // auto tv3 = makeConcreteTensor({0});
+  // and somehow that was running through our system fine, but size-0 tensors
+  // are not supported, so making sure this fails.
+  auto tv3 = set(tv1);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input0 = at::randn({2}, options);
+  at::Tensor input1 = at::randn({0}, options);
+  at::Tensor cg_output2 = at::empty({2}, options);
+  at::Tensor cg_output3 = at::empty({0}, options);
+
+  // Fails at schedule pointwise because our (maybe only) size-0 check is in
+  // binding input sizes which the scheduler ends up calling.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(schedulePointwise(&fusion, {input0, input1}));
+}
+
+TEST_F(NVFuserTest, FusionZeroSizeTensorReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = makeConcreteTensor({0});
+  fusion.addInput(tv1);
+
+  auto tv2 = sum(tv0, {1});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({0});
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input0 = at::randn({2, 4}, options);
+  at::Tensor input1 = at::randn({0}, options);
+  at::Tensor cg_output2 = at::empty({2}, options);
+  at::Tensor cg_output3 = at::empty({0}, options);
+
+  auto reduction_params = getReductionHeuristics(&fusion, {input0, input1});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  scheduleReduction(&fusion, *reduction_params);
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+
+  auto lparams = reduction_params->lparams;
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto cg_outputs = fe.runFusion({input0, input1}, lparams);
+  auto aten_output2 = input0.sum({1});
+  at::Tensor aten_output3 = at::empty({0}, options);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {input0, input1},
+      {aten_output2, aten_output3},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionZeroSizeTensorNormalization_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = makeConcreteTensor({0});
+  fusion.addInput(tv1);
+
+  auto tv2 = sum(tv0, {0});
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv0, tv3);
+  fusion.addOutput(tv4);
+
+  auto tv5 = makeConcreteTensor({0});
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input0 = at::randn({2, 4}, options);
+  at::Tensor input1 = at::randn({0}, options);
+  at::Tensor cg_output2 = at::empty({2, 4}, options);
+  at::Tensor cg_output3 = at::empty({0}, options);
+
+  auto reduction_params = getPersistentHeuristics(&fusion, {input0, input1});
+  TORCH_CHECK(reduction_params, "Reduction schedule was not generated!");
+  schedulePersistentKernel(&fusion, *reduction_params);
+
+  auto lparams = reduction_params->lparams;
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto cg_outputs = fe.runFusion({input0, input1}, lparams);
+  auto aten_output2 = input0.sum({0}).add(input0);
+  at::Tensor aten_output3 = at::empty({0}, options);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {input0, input1},
+      {aten_output2, aten_output3},
+      __LINE__,
+      __FILE__,
+      "",
+      lparams);
+}
+
+TEST_F(NVFuserTest, FusionSegmentIoAlias_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(1);
+  TensorView* tv2 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  fusion->addInput(tv2);
+
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1)); // Group 0
+  TensorView* tv4 =
+      max(tv3, {0}); // Group 0 (use max instead to avoid numerical issues)
+  TensorView* tv5 = add(tv4, tv1); //  Group 0 (Non Broadcast after reduce,
+                                   //  keeps normalization scheduler away)
+  TensorView* tv6 = add(tv5, tv2); //  Group 1 (Broadcast after reduce)
+
+  // Note: test alias;
+  fusion->aliasOutputToInput(tv6, tv0);
+  // TODO: support output on aliased fusion #1488
+  // remove tv7 after #1488
+  // fusion->addOutput(tv6);
+  TensorView* tv7 = add(tv6, IrBuilder::create<Double>(1)); // Group 0
+  fusion->addOutput(tv7);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({128, 65}, options);
+  at::Tensor t1 = at::randn({65}, options);
+  at::Tensor t2 = at::randn({128, 65}, options);
+
+  auto t3 = t0.add(1.0);
+  auto t4 = std::get<0>(at::max(t3, 0));
+  auto t5 = t4.add(t1);
+  auto t6 = t5.add(t2);
+  auto t7 = t6.add(1.0);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto outputs = executor_cache.runFusionWithInputs({t0, t1, t2});
+
+  // TODO: support output on aliased fusion #1488
+  // validating aliasing
+  // TORCH_INTERNAL_ASSERT(outputs[0].data_ptr() == t0.data_ptr());
+
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()->isSegmented(),
+      "segmentation didn't happen");
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()
+              ->fusionSegments()
+              ->groups()
+              .size() == 2,
+      "segmentation didn't happen as expected");
+
+  testValidate(
+      executor_cache.fusion(), outputs, {t0, t1, t2}, {t7}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionWelford1Output_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+
+  auto tvs = Welford(tv0, {1});
+  fusion->addOutput(tvs.var_sum);
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({128, 65}, options);
+  auto outputs = executor_cache.runFusionWithInputs({t0});
+
+  auto t1 = t0.var({1}, false) * 65;
+  testValidate(fusion, outputs, {t0}, {t1}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTranslate1Welford_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+
+  auto tvs = Welford(tv0, {1});
+  auto tv_out = add(tv0, broadcast(tvs.avg, {false, true}));
+  fusion->addOutput(tv_out);
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+
+  auto run_test = [&executor_cache,
+                   fusion](auto inner_size) -> FusionKernelRuntime* {
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor t0 = at::randn({128, inner_size}, options);
+    auto outputs = executor_cache.runFusionWithInputs({t0});
+    // Square sums does not fit well in the testValidate assumptions,
+    //  so we just compare the divided output here.
+    testValidate(
+        fusion,
+        outputs,
+        {t0},
+        {t0.add(t0.mean({1}).unsqueeze(1))},
+        __LINE__,
+        __FILE__);
+
+    return executor_cache.getMostRecentKernelRuntime();
+  };
+
+  // Run a translated welford
+  auto runtime1 = run_test(64);
+  // Check it was translated
+  TORCH_CHECK(
+      runtime1->fusionSegments()->groups().size() == 1 &&
+      runtime1->fusionSegments()->groups()[0]->exprs().size() > 2);
+
+  // Run an un-translated welford
+  auto runtime2 = run_test(65536);
+
+  bool found_welford = false;
+  for (auto group : runtime2->fusionSegments()->groups()) {
+    for (auto expr : group->exprs()) {
+      if (expr->isA<WelfordOp>()) {
+        found_welford = true;
+      }
+    }
+  }
+  TORCH_CHECK(found_welford);
+}
+
+TEST_F(NVFuserTest, FusionTranslate2Welford_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+
+  auto tvs1 = Welford(tv0, {1});
+  auto tv_out1 = add(tv0, broadcast(tvs1.avg, {false, true}));
+  fusion->addOutput(tv_out1);
+
+  auto tvs2 = Welford(tv0, {1});
+  auto tv_out2 = add(tv0, broadcast(tvs2.avg, {false, true}));
+  fusion->addOutput(tv_out2);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+
+  auto run_test = [&executor_cache,
+                   fusion](auto inner_size) -> FusionKernelRuntime* {
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor t0 = at::randn({128, inner_size}, options);
+    auto outputs = executor_cache.runFusionWithInputs({t0});
+
+    // Square sums does not fit well in the testValidate assumptions,
+    //  so we just compare the divided output here.
+    auto out = t0.add(t0.mean({1}).unsqueeze(1));
+    testValidate(fusion, outputs, {t0}, {out, out}, __LINE__, __FILE__);
+
+    return executor_cache.getMostRecentKernelRuntime();
+  };
+
+  // Run a translated welford
+  auto runtime1 = run_test(64);
+  // Check it was translated
+  TORCH_CHECK(
+      runtime1->fusionSegments()->groups().size() == 1 &&
+      runtime1->fusionSegments()->groups()[0]->exprs().size() > 4);
+
+  // Run an un-translated welford
+  auto runtime2 = run_test(65536);
+  // // Check it was not translated
+  bool found_welford = false;
+  for (auto group : runtime2->fusionSegments()->groups()) {
+    for (auto expr : group->exprs()) {
+      if (expr->isA<WelfordOp>()) {
+        found_welford = true;
+      }
+    }
+  }
+  TORCH_CHECK(found_welford);
+}
+
+TEST_F(NVFuserTest, FusionLargeWelfordNormalization_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+
+  auto tvs1 = Welford(tv0, {1});
+  auto sum_of_tv0 = sum(tv0, {1});
+
+  fusion->addOutput(tvs1.var_sum);
+  fusion->addOutput(sum_of_tv0);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+
+  auto run_test = [&executor_cache,
+                   fusion](auto inner_size) -> FusionKernelRuntime* {
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor t0 = at::randn({128, inner_size}, options);
+    auto outputs = executor_cache.runFusionWithInputs({t0});
+
+    auto t1 = t0.var({1}, false) * inner_size;
+    auto t2 = t0.sum({1});
+    testValidate(fusion, outputs, {t0}, {t1, t2}, __LINE__, __FILE__);
+
+    return executor_cache.getMostRecentKernelRuntime();
+  };
+
+  auto runtime = run_test(65536);
+  TORCH_CHECK(!runtime->isSegmented());
+}
+
+TEST_F(NVFuserTest, FusionWelfordOuterPersistence_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+
+  auto tvs1 = Welford(tv0, {1});
+  auto sum_of_tv0 = sum(tv0, {1});
+  auto sum_bcasted = broadcast(sum_of_tv0, {false, true});
+  auto avg_bcasted = broadcast(tvs1.avg, {false, true});
+  auto tv0_plus_sum = add(tv0, sum_bcasted);
+  auto tv0_plus_avg = add(tv0, avg_bcasted);
+
+  fusion->addOutput(tv0_plus_sum);
+  fusion->addOutput(tv0_plus_avg);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+
+  auto run_test = [&executor_cache,
+                   fusion](auto inner_size) -> FusionKernelRuntime* {
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor t0 = at::randn({128, inner_size}, options);
+    auto outputs = executor_cache.runFusionWithInputs({t0});
+
+    auto t1 = t0.to(c10::kDouble).mean({1}).unsqueeze(1) + t0;
+    auto t2 = t0.to(c10::kDouble).sum({1}).unsqueeze(1) + t0;
+    testValidate(fusion, outputs, {t0}, {t2, t1}, __LINE__, __FILE__);
+
+    return executor_cache.getMostRecentKernelRuntime();
+  };
+
+  for (auto inner_size : {4096, 8192, 32768}) {
+    auto runtime = run_test(inner_size);
+    TORCH_CHECK(!runtime->isSegmented());
+  }
+}
+
+TEST_F(NVFuserTest, FusionSegmentIslands_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  auto tv2 = sum(tv0, {0});
+  auto tv3 = sum(tv1, {1});
+  fusion->addOutput(tv2);
+  fusion->addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({16, 16}, options);
+  at::Tensor t1 = at::randn({16, 16}, options);
+
+  FusionExecutorCache fusion_executor_cache(std::move(fusion));
+  fusion_executor_cache.runFusionWithInputs({t0, t1});
+}
+
+TEST_F(NVFuserTest, FusionBackOffInnerBroadcast_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  auto tv1 = makeSymbolicTensor(2);
+  auto tv2 = makeSymbolicTensor(4);
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  auto tv3 = broadcast(tv0, {false, true, true, true});
+  auto tv4 = broadcast(tv1, {false, false, true, true});
+  auto tv5 = unaryOp(UnaryOpType::Rsqrt, tv2);
+
+  auto tv6 = add(tv3, tv5);
+  auto tv7 = add(tv4, tv5);
+  auto tv8 = add(tv3, tv4);
+
+  auto tv9 = add(tv6, tv7);
+  auto tv10 = add(tv9, tv8);
+
+  fusion->addOutput(tv10);
+
+  tv0->computeAt(tv10, -2);
+  tv1->computeAt(tv10, -2);
+  tv2->computeAt(tv10, -2);
+
+  TORCH_CHECK(tv3->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv4->getComputeAtPosition() == 2);
+  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
+
+  TORCH_CHECK(tv6->getMaxProducerPosition() == 3);
+  TORCH_CHECK(tv7->getMaxProducerPosition() == 3);
+  TORCH_CHECK(tv8->getMaxProducerPosition() == 2);
+}
+
+TEST_F(NVFuserTest, FusionBackOffInnerBroadcast2_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(3);
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = broadcast(tv0, {false, false, true});
+  auto tv3 = add(tv2, tv1);
+
+  fusion->addOutput(tv3);
+  tv3->split(-2, 4);
+  tv3->reorder({{-1, -2}});
+  tv0->computeAt(tv3, -2);
+  tv1->computeAt(tv3, -2);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 2);
+  TORCH_CHECK(tv3->getMaxProducerPosition() == 2);
+}
+
+TEST_F(NVFuserTest, FusionBackOffInnerBroadcast3_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(4);
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = broadcast(tv0, {false, false, true});
+  auto tv3 = broadcast(tv2, {false, true, false, false});
+  auto tv4 = add(tv3, tv1);
+
+  fusion->addOutput(tv4);
+  tv0->computeAt(tv4, -1);
+  tv1->computeAt(tv4, -1);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 2);
+  TORCH_CHECK(tv3->getMaxProducerPosition() == 3);
+}
+
+TEST_F(NVFuserTest, FusionSimpleWarp_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv2, tv0);
+
+  fusion->addOutput(tv3);
+
+  tv1->split(1, 32);
+  auto tv1_rf = tv1->rFactor({1});
+  TransformPropagatorWithCheck propagator(tv1_rf);
+  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
+  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({16, 128}, options);
+
+  auto at_output = input1.sum({1}, true).add(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSimpleWarpPad_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv2, tv0);
+
+  fusion->addOutput(tv3);
+
+  // Schedule a persistent kernel
+  auto tv0_cache = tv0->cacheAfter();
+  tv1->split(1, 8, false);
+  auto tv1_rf = tv1->rFactor({1});
+  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
+  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1_rf->axis(-1)->padToMultipleOfWarp(32);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->padToMultipleOfWarp(32);
+  TransformPropagatorWithCheck propagator(tv1_rf);
+  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
+  tv0->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0->axis(-1)->padToMultipleOfWarp(32);
+  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(-1)->padToMultipleOfWarp(32);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->padToMultipleOfWarp(32);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->padToMultipleOfWarp(32);
+
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({16, 127}, options);
+
+  auto at_output = input1.sum({1}, true).add(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionWarpPadMergeSplit_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(3);
+
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1, 2});
+  auto tv2 = broadcast(tv1, {false, true, true});
+  auto tv3 = add(tv2, tv0);
+
+  fusion->addOutput(tv3);
+
+  // Schedule a persistent kernel
+  auto tv0_cache = tv0->cacheAfter();
+  tv1->merge(1);
+  tv1->split(1, 8, false);
+
+  auto tv1_rf = tv1->rFactor({1});
+  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
+  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->padToMultipleOfWarp();
+  TransformPropagatorWithCheck propagator(tv1_rf);
+  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
+  tv0->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({16, 17, 128}, options);
+
+  auto at_output = input1.sum({1, 2}, true).add(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSerialWarpReduction_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(3);
+
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1, 2});
+  auto tv2 = broadcast(tv1, {false, true, true});
+  auto tv3 = add(tv2, tv0);
+
+  fusion->addOutput(tv3);
+
+  // Schedule a persistent kernel
+  auto tv0_cache = tv0->cacheAfter();
+  tv1->merge(1);
+  tv1->split(1, 8, false);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->padToMultipleOfWarp();
+  TransformPropagatorWithCheck propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+  tv0->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({16, 17, 128}, options);
+
+  auto at_output = input1.sum({1, 2}, true).add(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTrivialWarpReduction_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({17, 18, 128, 1});
+
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1, 2, 3});
+  auto tv2 = broadcast(tv1, {false, true, true, true});
+  auto tv3 = add(tv2, tv0);
+
+  fusion->addOutput(tv3);
+
+  // Schedule a persistent kernel
+  auto tv0_cache = tv0->cacheAfter();
+  tv1->merge(1);
+  tv1->split(1, 8, false);
+
+  auto tv1_rf = tv1->rFactor({1});
+  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
+  tv1_rf->axis(-2)->parallelize(ParallelType::TIDx);
+  tv1->axis(-2)->parallelize(ParallelType::TIDx);
+  tv1->axis(-2)->padToMultipleOfWarp();
+  TransformPropagatorWithCheck propagator(tv1_rf);
+  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
+  tv0->axis(-2)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(-2)->parallelize(ParallelType::TIDx);
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+  tv3->axis(-2)->parallelize(ParallelType::TIDx);
+
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({17, 18, 128, 1}, options);
+
+  auto at_output = input1.sum({1, 2, 3}, true).add(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMultipleDimBinding_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv_add = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv_add);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv2, tv0);
+  auto tv4 = add(tv0, tv_add);
+
+  fusion->addOutput(tv3);
+  fusion->addOutput(tv4);
+
+  // Schedule a persistent kernel
+  auto tv0_cache = tv0->cacheAfter();
+  tv1->split(1, 8, false);
+  auto tv1_rf = tv1->rFactor({1});
+  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
+  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1_rf->axis(-1)->padToMultipleOfWarp(32);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->padToMultipleOfWarp(32);
+  TransformPropagatorWithCheck propagator(tv1_rf);
+  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
+  tv0->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0->axis(-1)->padToMultipleOfWarp(32);
+  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(-1)->padToMultipleOfWarp(32);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->padToMultipleOfWarp(32);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->padToMultipleOfWarp(32);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(-1)->padToMultipleOfWarp(64);
+
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({16, 128}, options);
+  at::Tensor input2 = at::randn({16, 128}, options);
+
+  auto at_output = input1.sum({1}, true).add(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1, input2});
+  auto outputs = fe.runFusion({input1, input2});
+  testValidate(
+      fusion.get(),
+      outputs,
+      {input1, input2},
+      {at_output, input1 + input2},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPadNoWarpReduce_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv2, tv0);
+
+  fusion->addOutput(tv3);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->padToMultipleOfWarp();
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv1->axis(0)->parallelize(ParallelType::TIDy);
+  tv2->axis(0)->parallelize(ParallelType::TIDy);
+  tv3->axis(0)->parallelize(ParallelType::TIDy);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({16, 31}, options);
+
+  auto at_output = input1.sum({1}, true).add(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionWarpMutipleThreadDim_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1});
+  fusion->addOutput(tv2);
+
+  tv2->split(1, 8);
+  auto tv2_rf = tv2->rFactor({-1});
+  tv2_rf->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2_rf->axis(-1)->padToMultipleOfWarp();
+
+  TransformPropagatorWithCheck propagator(tv2_rf);
+  MaxRootDomainInfoSpanningTree(tv2_rf).traverse(&propagator);
+
+  tv0->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDy);
+  tv0->computeAt(tv2, 2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({16, 31}, options);
+
+  auto at_output = (input1 + 1).sum({1});
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionWarpReduceUnrollOuterLoop_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv2, tv0);
+
+  fusion->addOutput(tv3);
+
+  // Schedule a persistent kernel
+  auto tv0_cache = tv0->cacheAfter();
+  tv1->split(1, 8, false);
+  tv1->split(0, 4);
+  auto tv1_rf = tv1->rFactor({2});
+
+  tv1_rf->axis(0)->parallelize(ParallelType::BIDx);
+  tv1_rf->axis(1)->parallelize(ParallelType::Unroll);
+  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->padToMultipleOfWarp();
+  tv1->axis(1)->parallelize(ParallelType::Unroll);
+  TransformPropagatorWithCheck propagator(tv1_rf);
+  MaxRootDomainInfoSpanningTree(tv1_rf).traverse(&propagator);
+  tv0->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0->axis(1)->parallelize(ParallelType::Unroll);
+  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(1)->parallelize(ParallelType::Unroll);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::Unroll);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::Unroll);
+
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({16, 128}, options);
+
+  auto at_output = input1.sum({1}, true).add(input1);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_output}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1579
+TEST_F(NVFuserTest, FusionWarpReducePredication_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> shape1 = {1024};
+  std::vector<int64_t> shape2 = {50};
+
+  auto tv0 = makeConcreteTensor(shape1);
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {0});
+  fusion.addOutput(tv1);
+
+  auto tv2 = makeConcreteTensor(shape2);
+  fusion.addInput(tv2);
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  auto tv4 = sum(tv3, {0});
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+
+  // Just to fill the smem buffer by a thread block of 1024 threads
+  // with some values
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // Make the tv4_rf reduction a warp reduction to trigger the
+  // bug. Since the smem buffer is filled with some values due to the
+  // reduction of tv1, those values would be used by predicated-out
+  // threads.
+  tv4->split(-1, 10);
+  auto tv4_rf = tv4->rFactor({-1});
+  tv4_rf->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4_rf->axis(-1)->padToMultipleOfWarp();
+
+  tv4_rf->computeAt(tv4, 1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape1, options);
+  auto t2 = at::randn(shape2, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t2});
+  auto cg_outputs = fe.runFusion({t0, t2});
+
+  auto t1 = t0.sum({0});
+  auto t4 = (t2 + 1).sum({0}) + 1;
+
+  testValidate(&fusion, cg_outputs, {t0, t2}, {t1, t4}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSegfaultReduction_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int batch = 2;
+  int c = 1;
+  int h = 1;
+  int w = 1;
+  int numDims = 4;
+
+  auto input = makeConcreteTensor({-1, 1, 1, 1});
+  fusion.addInput(input);
+  auto bcast_bias = makeConcreteTensor({-1, 1, 1, 1});
+  fusion.addInput(bcast_bias);
+
+  std::vector<int64_t> at_sum_axes;
+  std::vector<int> outer_reduction_axes;
+  std::vector<bool> outer_broadcast_mask(numDims, false);
+  Val* N = IrBuilder::create<Double>(1);
+  for (const auto axis : c10::irange(numDims)) {
+    if (axis != 1) {
+      outer_reduction_axes.push_back(axis);
+      at_sum_axes.push_back(axis);
+      outer_broadcast_mask[axis] = true;
+      N = mul(N, input->domain()->domain()[axis]->extent());
+    }
+  }
+
+  auto output0 = mul(input, bcast_bias);
+  fusion.addOutput(output0);
+  auto output1 = sum(output0, outer_reduction_axes);
+  fusion.addOutput(output1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({batch, c, h, w}, options);
+  at::Tensor input1 = at::randn({batch, c, h, w}, options);
+
+  auto at_output0 = input0.mul(input1);
+  auto at_output1 = at_output0.sum(at_sum_axes);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  std::vector<IValue> inputs = {input0, input1};
+  auto outputs = fec.runFusionWithInputs(inputs);
+
+  testValidate(
+      &fusion, outputs, inputs, {at_output0, at_output1}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPredicateElimination1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(2));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(3));
+
+  fusion.addOutput(tv3);
+
+  tv3->split(0, 32);
+  tv0->computeAt(tv3, 1);
+
+  tv2->axis(1)->parallelize(ParallelType::Unswitch);
+
+  {
+    GpuLower gpulw(&fusion);
+    TORCH_CHECK(!PredicatedChecker::isPredicated(tv2, gpulw));
+  }
+
+  tv2->axis(1)->parallelize(ParallelType::Serial);
+  tv2->split(1, 5);
+
+  {
+    GpuLower gpulw(&fusion);
+    TORCH_CHECK(PredicatedChecker::isPredicated(tv2, gpulw));
+  }
+}
+
+// Repro of issue #1571
+TEST_F(NVFuserTest, FusionPredicateElimination2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> shape({10, 11});
+
+  auto tv0 = makeConcreteTensor(shape);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1});
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  tv1->split(1, 4);
+  tv1->split(0, 4);
+  tv2->split(1, 4);
+  tv2->split(0, 4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = (t0 + 1).sum({1}) + 1;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPredicateElimination3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  auto tv3 = tv0->cacheAfter();
+
+  tv1->split(0, 10);
+  tv1->split(0, 33);
+  TransformPropagatorWithCheck propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+
+  auto tv4 = tv1->rFactor({-1});
+  auto tv5 = tv1->rFactor({-1});
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv4);
+
+  GpuLower gpulw(&fusion);
+
+  // The fusion has three reductions: one within each thread, one
+  // within each block, and another with the whole grid. All of them
+  // should not need to be predicated as they use the same init value
+  // and same reduction op.
+  TORCH_CHECK(!PredicatedChecker::isPredicated(tv4, gpulw));
+  TORCH_CHECK(!PredicatedChecker::isPredicated(tv5, gpulw));
+  TORCH_CHECK(!PredicatedChecker::isPredicated(tv1, gpulw));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  for (auto size : {1, 2, 999, 1001, 1234, 10000}) {
+    auto t0 = at::randn({size}, options);
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {t0});
+    auto cg_outputs = fe.runFusion({t0});
+
+    auto ref = sum(t0) + 1;
+    testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionPredicateElimination4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+
+  auto tv2 = sum(tv1, {0});
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv3);
+
+  auto tv4 = max(tv1, {0});
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+
+  tv1->split(1, 7);
+  tv1->split(0, 11);
+  tv1->reorder({{1, 2}, {2, 1}});
+  TransformPropagatorWithCheck propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+
+  tv1->axis(0)->parallelize(ParallelType::TIDy);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv1);
+
+  GpuLower gpulw(&fusion);
+
+  // tv2 uses the same op and init with tv1, so tv2 should be fine
+  // without a predicate. However, tv4, while it uses the tv1 as its
+  // input, the reduction op and init value is different from those of
+  // tv1, so tv4 needs to be predicated.
+  TORCH_CHECK(!PredicatedChecker::isPredicated(tv2, gpulw));
+  TORCH_CHECK(PredicatedChecker::isPredicated(tv4, gpulw));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  std::vector<int64_t> sizes = {1, 2, 33, 34, 64, 99};
+  for (auto s0 : sizes) {
+    for (auto s1 : sizes) {
+      auto t0 = at::randn({s0, s1}, options);
+
+      FusionExecutor fe;
+      fe.compileFusion(&fusion, {t0});
+      auto cg_outputs = fe.runFusion({t0});
+
+      auto t1 = t0.sum({1});
+      auto t3 = t1.sum({0}) + 1;
+      auto t5 = std::get<0>(t1.max(0)) + 1;
+
+      testValidate(&fusion, cg_outputs, {t0}, {t3, t5}, __LINE__, __FILE__);
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionPredicateElimination5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tvs2 = Welford(tv1, {0});
+  auto tv3 = set(tvs2.avg);
+  fusion.addOutput(tv3);
+
+  tvs2.avg->split(0, 4);
+  TransformPropagatorWithCheck propagator(tvs2.avg);
+  MaxRootDomainInfoSpanningTree(tvs2.avg).traverse(&propagator);
+  auto avg_rf = ir_utils::rfactorHelper(tvs2.avg, {1});
+
+  avg_rf->axis(0)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(avg_rf);
+
+  GpuLower gpulw(&fusion);
+
+  // The first per-thread welford needs to be predicated as the N
+  // input is different from its init value. The second welford op
+  // does not need a predicate.
+  TORCH_CHECK(PredicatedChecker::isPredicated(avg_rf, gpulw));
+  TORCH_CHECK(!PredicatedChecker::isPredicated(tvs2.avg, gpulw));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  std::vector<int64_t> sizes = {1, 2, 33, 34, 64, 99};
+  for (auto s0 : sizes) {
+    auto t0 = at::randn({s0}, options);
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {t0});
+    auto cg_outputs = fe.runFusion({t0});
+
+    auto ref = t0.mean({0});
+
+    testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionPredicateElimination6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv4);
+
+  tv4->split(1, 5);
+  TransformPropagatorWithCheck propagator(tv4);
+  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+
+  tv4->reorder({{0, 1}, {1, 0}});
+  tv3->computeAt(tv4, 1);
+
+  GpuLower gpulw(&fusion);
+
+  // The expression for tv2 is a local-to-local expression. It
+  // satisfies all the requirements of predicate elimination, except
+  // for the on on split root domains. As the second root axis of tv2
+  // is split, its index exceeds its extent (i.e., 3 in this case)
+  // without its predicate.
+  TORCH_CHECK(PredicatedChecker::isPredicated(tv2, gpulw));
+
+  // Unlike tv2, tv3 is computed at tv4, so the second root axis does
+  // have a zero domain. Its index should look like "i * 5 + j", where
+  // i comes from the first root domain and j comes from the split
+  // inner domain.
+  TORCH_CHECK(!PredicatedChecker::isPredicated(tv3, gpulw));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({2, 3}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 4;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPredicateElimination7_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv3);
+
+  tv3->split(-1, 5);
+  tv3->split(-1, 4);
+  tv3->split(-1, 3);
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv0->computeAt(tv3, 1);
+
+  // The last split of tv2 is a non-divisible split, and omitting it
+  // is invalid.
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(PredicatedChecker::isPredicated(tv2, gpulw));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({123}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 3;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionForceFp16Simple_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  // Group 1
+  auto tv2 = sum(tv0, {1});
+  auto tv3 = broadcast(tv2, {false, true});
+
+  // Group 2
+  auto tv4 = add(tv3, tv1); // Edge: tv3: expect cast
+  auto tv5 = castOp(DataType::Half, tv4);
+
+  fusion->addOutput(tv5);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
+  std::vector<int64_t> shape{15, 16};
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn(shape, options);
+  auto in1 = at::randn(shape, options);
+  fec.runFusionWithInputs({in0, in1});
+
+  // Check the segmented edge is fp16
+  auto segmented_fusion = fec.getMostRecentKernelRuntime()->fusionSegments();
+  for (auto edge : segmented_fusion->edges()) {
+    auto edge_tv = edge->val->as<TensorView>();
+    TORCH_CHECK(edge_tv->getDataType() == DataType::Half);
+  }
+}
+
+TEST_F(NVFuserTest, FusionForceBf16Simple_CUDA) {
+#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+  // requires ampere+ GPU
+  if (!deviceMajorMinorCheck(8)) {
+    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
+    return;
+  }
+
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  // Group 1
+  auto tv2 = sum(tv0, {1});
+  auto tv3 = broadcast(tv2, {false, true});
+
+  // Group 2
+  auto tv4 = add(tv3, tv1); // Edge: tv3: expect cast
+  auto tv5 = castOp(DataType::BFloat16, tv4);
+
+  fusion->addOutput(tv5);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
+  std::vector<int64_t> shape{15, 16};
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn(shape, options);
+  auto in1 = at::randn(shape, options);
+  fec.runFusionWithInputs({in0, in1});
+
+  // Check the segmented edge is bf16
+  auto segmented_fusion = fec.getMostRecentKernelRuntime()->fusionSegments();
+  for (auto edge : segmented_fusion->edges()) {
+    auto edge_tv = edge->val->as<TensorView>();
+    TORCH_CHECK(edge_tv->getDataType() == DataType::BFloat16);
+  }
+#else
+  GTEST_SKIP() << "requires cuda 11.0 or newer toolkit";
+#endif
+}
+
+TEST_F(NVFuserTest, FusionForceFp16NotAllCast_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto tv1 = makeSymbolicTensor(3);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  // Group 1
+  auto tv3 = sum(tv0, {1});
+  auto tv4 = broadcast(tv3, {false, true, false});
+  auto tv5 = sum(tv0, {1});
+
+  // Group 2
+  auto tv6 = add(tv4, tv1); // edge tv4, expect cast
+  auto tv7 = castOp(DataType::Half, tv6);
+
+  // Group 3
+  auto tv8 = sum(tv5, {1}); // edge tv5, don't expect cast
+
+  fusion->addOutput(tv7);
+  fusion->addOutput(tv8);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
+  std::vector<int64_t> shape{16, 16, 16};
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn(shape, options);
+  auto in1 = at::randn(shape, options);
+  fec.runFusionWithInputs({in0, in1});
+
+  auto segmented_fusion = fec.getMostRecentKernelRuntime()->fusionSegments();
+  auto complete_fusion = segmented_fusion->completeFusion();
+
+  // Check that the edge that wasn't fp16 is the producer of the
+  //  reduction op, i.e. tv8 = sum(tv5,{1});.
+  for (auto edge : segmented_fusion->edges()) {
+    auto edge_tv = edge->val->as<TensorView>();
+    if (edge_tv->getDataType() == DataType::Float) {
+      auto consumer = *(complete_fusion->unordered_uses(edge_tv).begin());
+      TORCH_CHECK(consumer->isA<ReductionOp>());
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionForceBf16NotAllCast_CUDA) {
+#if defined(CUDA_VERSION) && CUDA_VERSION >= 11000
+  // requires ampere+ GPU
+  if (!deviceMajorMinorCheck(8)) {
+    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
+    return;
+  }
+
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeSymbolicTensor(3);
+  auto tv1 = makeSymbolicTensor(3);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  // Group 1
+  auto tv3 = sum(tv0, {1});
+  auto tv4 = broadcast(tv3, {false, true, false});
+  auto tv5 = sum(tv0, {1});
+
+  // Group 2
+  auto tv6 = add(tv4, tv1); // edge tv4, expect cast
+  auto tv7 = castOp(DataType::BFloat16, tv6);
+
+  // Group 3
+  auto tv8 = sum(tv5, {1}); // edge tv5, don't expect cast
+
+  fusion->addOutput(tv7);
+  fusion->addOutput(tv8);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
+  std::vector<int64_t> shape{16, 16, 16};
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn(shape, options);
+  auto in1 = at::randn(shape, options);
+  fec.runFusionWithInputs({in0, in1});
+
+  auto segmented_fusion = fec.getMostRecentKernelRuntime()->fusionSegments();
+  auto complete_fusion = segmented_fusion->completeFusion();
+
+  // Check that the edge that wasn't fp16 is the producer of the
+  //  reduction op, i.e. tv8 = sum(tv5,{1});.
+  for (auto edge : segmented_fusion->edges()) {
+    auto edge_tv = edge->val->as<TensorView>();
+    if (edge_tv->getDataType() == DataType::Float) {
+      auto consumer = *(complete_fusion->unordered_uses(edge_tv).begin());
+      TORCH_CHECK(consumer->isA<ReductionOp>());
+    }
+  }
+#else
+  GTEST_SKIP() << "requires cuda 11.0 or newer toolkit";
+#endif
+}
+
+TEST_F(NVFuserTest, FusionBufferReuseBroadCastMultiVisit_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeConcreteTensor({2, 2});
+  auto tv1 = makeConcreteTensor({2, 2, 2});
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  auto tv2 = mul(tv0, IrBuilder::create<Double>(2));
+  auto tv3 = broadcast(tv2, {false, false, true});
+  auto tv4 = add(tv3, tv1);
+  auto tv5 = mul(tv4, IrBuilder::create<Double>(3));
+  fusion->addOutput(tv5);
+
+  // t4 cannot inner re-use t2, because there's a broadcast
+  //  between them.
+  tv0->computeAt(tv5, 1, ComputeAtMode::BestEffort);
+  tv3->computeAt(tv5, 2, ComputeAtMode::BestEffort);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn({2, 2}, options);
+  auto in1 = at::randn({2, 2, 2}, options);
+
+  auto at_output = ((in0 * 2).unsqueeze(2) + in1) * 3;
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {in0, in1});
+  auto outputs = fe.runFusion({in0, in1});
+
+  testValidate(fusion, outputs, {in0, in1}, {at_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBufferReuseStressTest_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeConcreteTensor({2, 2});
+  auto tv1 = makeConcreteTensor({2, 2, 2});
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  auto tv2 = mul(tv0, IrBuilder::create<Double>(2));
+  auto tv3 = mul(tv0, IrBuilder::create<Double>(3));
+  auto tv4 = mul(tv2, tv3);
+  // Broadcast buffer can be reused through outer sharing
+  auto tv5 = broadcast(tv4, {true, false, false});
+  auto tv6 = mul(tv5, IrBuilder::create<Double>(5));
+  auto tv7 = mul(tv6, tv1);
+  auto tv8 = mul(tv7, IrBuilder::create<Double>(7));
+  // tv9 shouldn't alias to avoid buffer over-subscription
+  auto tv9 = broadcast(tv4, {true, false, false});
+  auto tv10 = mul(tv9, IrBuilder::create<Double>(9));
+  auto tv11 = add(tv5, tv9);
+  fusion->addOutput(tv7);
+  fusion->addOutput(tv11);
+
+  tv0->computeAt(tv5, 1, ComputeAtMode::BestEffort);
+  tv0->computeAt(tv9, 1, ComputeAtMode::BestEffort);
+
+  tv5->computeAt(tv7, 1, ComputeAtMode::BestEffort);
+  tv5->computeAt(tv11, 1, ComputeAtMode::BestEffort);
+  tv9->computeAt(tv11, 1, ComputeAtMode::BestEffort);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn({2, 2}, options);
+  auto in1 = at::randn({2, 2, 2}, options);
+  auto t2 = in0 * 2;
+  auto t3 = in0 * 3;
+  auto t4 = t2 * t3;
+  auto t5 = t4.unsqueeze(0);
+  auto t6 = t5 * 5;
+  auto t7 = t6 * in1;
+  auto t8 = t7 * 7;
+  auto t9 = t4.unsqueeze(0);
+  auto t10 = t9 * 9;
+  auto t11 = t5 + t9;
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {in0, in1});
+
+  auto at_output = ((in0 * 2).unsqueeze(2) + in1) * 3;
+  auto outputs = fe.runFusion({in0, in1});
+
+  testValidate(fusion, outputs, {in0, in1}, {t7, t11}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBufferReuseLargeBuffer_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeConcreteTensor({256, 512});
+
+  fusion->addInput(tv0);
+
+  auto tv1 = mul(tv0, IrBuilder::create<Double>(2));
+  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
+  auto tv3 = mul(tv2, IrBuilder::create<Double>(2));
+  auto tv4 = mul(tv3, IrBuilder::create<Double>(2));
+  auto tv5 = mul(tv4, IrBuilder::create<Double>(2));
+  auto tv6 = mul(tv5, IrBuilder::create<Double>(2));
+
+  fusion->addOutput(tv6);
+
+  tv0->computeAt(tv6, 1, ComputeAtMode::BestEffort);
+  tv6->axis(0)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn({256, 512}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {in0});
+  auto outputs = fe.runFusion({in0});
+
+  auto at_out = in0.mul(2).mul(2).mul(2).mul(2).mul(2).mul(2);
+
+  testValidate(fusion, outputs, {in0}, {at_out}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBufferReuseNo2hop_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeConcreteTensor({2, 2});
+  auto tv1 = makeConcreteTensor({2, 2, 2});
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  auto tv2 = mul(tv0, IrBuilder::create<Double>(2));
+  auto tv3 = broadcast(tv2, {false, false, true});
+  auto tv4 = add(tv3, tv1); // T4 to be inner aliased first, and
+                            //  shouldn't outer alias on top
+  auto tv5 = mul(tv4, IrBuilder::create<Double>(3));
+  auto tv6 = mul(tv5, IrBuilder::create<Double>(3));
+  fusion->addOutput(tv6);
+
+  tv0->computeAt(tv6, 1, ComputeAtMode::BestEffort);
+  tv4->computeAt(tv6, 2, ComputeAtMode::BestEffort);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn({2, 2}, options);
+  auto in1 = at::randn({2, 2, 2}, options);
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {in0, in1});
+  auto outputs = fe.runFusion({in0, in1});
+
+  auto at_out = (in0.mul(2.0).unsqueeze(2) + in1).mul(3.0).mul(3.0);
+
+  testValidate(fusion, outputs, {in0, in1}, {at_out}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBufferReuseAllocationOrder_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeConcreteTensor({3, 3, 3});
+
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
+  auto tv3 = mul(tv2, IrBuilder::create<Double>(2));
+
+  fusion->addOutput(tv3);
+
+  // In this case tv1 "reuses" allocation of tv2
+  //  due to the switched allocation order
+  tv1->computeAt(tv2, 1, ComputeAtMode::BestEffort);
+
+  tv0->axis(0)->parallelize(ParallelType::TIDx);
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv3->axis(0)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn({3, 3, 3}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {in0});
+  auto outputs = fe.runFusion({in0});
+
+  auto at_out = in0.sum(1).mul(2).mul(2);
+
+  testValidate(fusion, outputs, {in0}, {at_out}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBufferReuseLiveInterval_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeConcreteTensor({16, 16});
+
+  fusion->addInput(tv0);
+
+  auto tv1 = mul(tv0, IrBuilder::create<Double>(3));
+  auto tv2 = mul(tv1, IrBuilder::create<Double>(2));
+  auto tv3 = mul(tv2, IrBuilder::create<Double>(2));
+  // tv1 used till here, cannot be reused by tv2 or tv3
+  auto tv4 = mul(tv3, tv1);
+
+  fusion->addOutput(tv4);
+
+  tv0->computeAt(tv4, 1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn({16, 16}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {in0});
+  auto cg_outputs = fe.runFusion({in0});
+
+  auto at_t0 = in0 * 3.0;
+  auto at_out = at_t0 * 2.0 * 2.0 * at_t0;
+
+  testValidate(fusion, cg_outputs, {in0}, {at_out}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBufferReuseNoAcrossBroadcast_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  auto tv0 = makeConcreteTensor({2, 2});
+  auto tv1 = makeConcreteTensor({2, 2, 2});
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  auto tv2 = mul(tv0, IrBuilder::create<Double>(2));
+  auto tv3 = mul(tv0, IrBuilder::create<Double>(3));
+  auto tv4 = mul(tv2, tv3);
+  auto tv5 = broadcast(tv4, {false, false, true});
+  auto tv6 = mul(tv5, tv1);
+  auto tv7 = mul(tv6, IrBuilder::create<Double>(7));
+  fusion->addOutput(tv7);
+
+  // tv6 shouldn't re-use t2 or t3 because of
+  //  the broadcast in between
+  tv0->computeAt(tv4, 1, ComputeAtMode::BestEffort);
+  tv4->computeAt(tv7, 2, ComputeAtMode::BestEffort);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto in0 = at::randn({2, 2}, options);
+  auto in1 = at::randn({2, 2, 2}, options);
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {in0, in1});
+  auto outputs = fe.runFusion({in0, in1});
+
+  auto t2 = in0 * 2;
+  auto t3 = in0 * 3;
+  auto t4 = t2 * t3;
+  auto t5 = t4.unsqueeze(2);
+  auto t6 = t5 * in1;
+  auto t7 = t6 * 7;
+  testValidate(fusion, outputs, {in0, in1}, {t7}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue970_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int nelm = 10;
+
+  // tv3 = tv0 + sum(tv0)
+  auto tv0 = makeConcreteTensor({nelm, nelm});
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv2, tv0);
+  fusion.addOutput(tv3);
+
+  tv1->split(1, 4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({nelm, nelm}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  auto ref = sum(t0, {1}).unsqueeze(-1).expand({nelm, nelm}) + t0;
+
+  testValidate(&fusion, outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Reproducer of #1016
+TEST_F(NVFuserTest, FusionIssue1016_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(2));
+
+  fusion.addOutput(tv2);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv2->split(-1, 8);
+
+  int numel_x = 10;
+  int numel_y = 11;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({numel_x, numel_y}, options);
+  std::vector<IValue> inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, inputs);
+  auto outputs = fe.runFusion(inputs);
+
+  auto ref = t0 + 1 + 2;
+
+  testValidate(&fusion, outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Reproducer of #1021
+TEST_F(NVFuserTest, FusionIssue1021_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = broadcast(tv1, {false, true});
+  fusion.addOutput(tv2);
+
+  auto tv3 = tv2->cacheBefore();
+
+  tv2->split(0, 2);
+
+  tv1->computeAt(tv2, 1);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10}, options);
+  std::vector<IValue> inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, inputs);
+  auto outputs = fe.runFusion(inputs);
+
+  auto ref = (t0 + 1).unsqueeze(-1);
+
+  testValidate(&fusion, outputs, inputs, {ref}, __LINE__, __FILE__);
+}
+
+// Reproducer of issue #1053
+TEST_F(NVFuserTest, FusionNonUniqueThreadDim_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+  auto tv1 = sum(tv0, {0});
+  fusion->addOutput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  fusion->addOutput(tv2);
+
+  tv1->split(0, 8);
+  auto tv1_rf = tv1->rFactor({-1});
+
+  tv1_rf->computeAt(tv1, 1);
+
+  tv1_rf->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({32}, options);
+
+  auto at_tv1 = (input1).sum({0});
+  auto at_tv2 = input1 + 1;
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+  testValidate(
+      fusion.get(), outputs, {input1}, {at_tv1, at_tv2}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionParallelDimensionMap1_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  fusion->addOutput(tv1);
+  fusion->addOutput(tv2);
+
+  tv1->split(0, 8, false);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  tv2->split(0, 8, false);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+
+  // The extents of tv1 and tv2 axes are equal even though their
+  // actual values are not statically known
+  GpuLower gpulw(fusion.get());
+  const auto& pdmap = gpulw.parallelDimensionMap();
+  for (const auto i : c10::irange(tv1->domain()->domain().size())) {
+    auto dom1 = tv1->domain()->domain()[i];
+    auto dom2 = tv2->domain()->domain()[i];
+    TORCH_INTERNAL_ASSERT(pdmap.equalDim(dom1->extent(), dom2->extent()));
+  }
+
+  TORCH_CHECK(pdmap.isExact(ParallelType::TIDx));
+  TORCH_CHECK(
+      pdmap.get(ParallelType::TIDx)->isA<NamedScalar>() &&
+      pdmap.get(ParallelType::TIDx)->as<NamedScalar>()->name() == "blockDim.x");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({32}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+
+  testValidate(
+      fusion.get(),
+      outputs,
+      {input1},
+      {input1 + 1, input1 + 1},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionParallelDimensionMap2_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion->addInput(tv1);
+  auto tv2 = broadcast(tv0, {false, true});
+  auto tv3 = add(tv1, tv2);
+  fusion->addOutput(tv3);
+
+  tv3->split(-1, 8, false);
+  tv2->computeAt(tv3, -1);
+
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  GpuLower gpulw(fusion.get());
+  const auto& pdmap = gpulw.parallelDimensionMap();
+  TORCH_CHECK(pdmap.isExact(ParallelType::TIDx));
+  TORCH_CHECK(
+      pdmap.get(ParallelType::TIDx)->isA<NamedScalar>() &&
+      pdmap.get(ParallelType::TIDx)->as<NamedScalar>()->name() == "blockDim.x");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({11}, options);
+  at::Tensor input2 = at::randn({11, 13}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1, input2});
+  auto outputs = fe.runFusion({input1, input2});
+
+  auto ref = input1.unsqueeze(-1) + input2;
+
+  testValidate(
+      fusion.get(), outputs, {input1, input2}, {ref}, __LINE__, __FILE__);
+}
+
+// Mix symbolic and concrete tensors
+TEST_F(NVFuserTest, FusionParallelDimensionMap3_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  fusion->addOutput(tv2);
+  auto tv3 = add(tv0, IrBuilder::create<Double>(1));
+  fusion->addOutput(tv3);
+
+  tv2->split(0, 10);
+  tv3->split(0, 20);
+
+  auto tv4 = add(tv0, IrBuilder::create<Double>(1));
+  fusion->addOutput(tv4);
+  auto tv5 = add(tv0, IrBuilder::create<Double>(1));
+  fusion->addOutput(tv5);
+
+  // Not mapped but equal extent
+  tv4->split(0, 10);
+  tv5->split(0, 10);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv4->axis(-1)->parallelize(ParallelType::TIDy);
+  tv5->axis(-1)->parallelize(ParallelType::TIDy);
+
+  GpuLower gpulw(fusion.get());
+  const auto& pdmap = gpulw.parallelDimensionMap();
+  TORCH_CHECK(!pdmap.isExact(ParallelType::TIDx));
+  TORCH_CHECK(
+      pdmap.get(ParallelType::TIDx)->isA<NamedScalar>() &&
+      pdmap.get(ParallelType::TIDx)->as<NamedScalar>()->name() == "blockDim.x");
+  TORCH_CHECK(pdmap.isExact(ParallelType::TIDy));
+  TORCH_CHECK(
+      pdmap.get(ParallelType::TIDy)->isConst() &&
+      pdmap.get(ParallelType::TIDy)->as<Int>()->value().value() == 10);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({13}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {input1});
+  auto outputs = fe.runFusion({input1});
+
+  testValidate(
+      fusion.get(),
+      outputs,
+      {input1},
+      {input1 + 1, input1 + 1, input1 + 1, input1 + 1},
+      __LINE__,
+      __FILE__);
+}
+
+// Parallelizing merged broadcast domains
+TEST_F(NVFuserTest, FusionParallelDimensionMap4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  tv4->split(1, 4);
+  tv4->reorder({{1, 2}, {2, 1}});
+  tv4->merge(0);
+  tv0->computeAt(tv4, 1);
+  tv1->computeAt(tv4, 1);
+
+  // TIDx is mapped to tv4.axis(0) as well as tv2.axis(0), so it's not
+  // exact.
+  tv4->axis(0)->parallelize(ParallelType::TIDx);
+
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+
+  GpuLower gpulw(&fusion);
+  const auto& pdmap = gpulw.parallelDimensionMap();
+  TORCH_CHECK(!pdmap.isExact(ParallelType::TIDx));
+  TORCH_CHECK(
+      pdmap.get(ParallelType::TIDx)->isA<NamedScalar>() &&
+      pdmap.get(ParallelType::TIDx)->as<NamedScalar>()->name() == "blockDim.x");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({13}, options);
+  at::Tensor input2 = at::randn({15, 13}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  auto outputs = fe.runFusion({input1, input2});
+
+  auto ref = (input1 + 1).unsqueeze(0) + input2;
+
+  testValidate(&fusion, outputs, {input1, input2}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionParallelDimensionMap5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  auto tv3 = broadcast(tv0, {false, true});
+  auto tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  tv4->split(1, 4);
+  tv0->computeAt(tv4, -1);
+  tv1->computeAt(tv4, -1);
+
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->axis(-2)->parallelize(ParallelType::TIDy);
+  tv3->axis(-2)->parallelize(ParallelType::TIDy);
+
+  GpuLower gpulw(&fusion);
+  const auto& pdmap = gpulw.parallelDimensionMap();
+  TORCH_CHECK(pdmap.isExact(ParallelType::TIDx));
+  TORCH_CHECK(pdmap.isExact(ParallelType::TIDy));
+  TORCH_CHECK(
+      pdmap.get(ParallelType::TIDx)->isConst() &&
+      pdmap.get(ParallelType::TIDx)->as<Int>()->value().value() == 4);
+  TORCH_CHECK(
+      pdmap.get(ParallelType::TIDy)->isA<NamedScalar>() &&
+      pdmap.get(ParallelType::TIDy)->as<NamedScalar>()->name() == "blockDim.y");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({13}, options);
+  at::Tensor input2 = at::randn({13, 15}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1, input2});
+  auto outputs = fe.runFusion({input1, input2});
+
+  auto ref = (input1).unsqueeze(-1) + input2;
+
+  testValidate(&fusion, outputs, {input1, input2}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSegmenterCombineReductionsCycleRepro_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  auto t0 = makeSymbolicTensor(3, DataType::Float);
+  auto t1 = makeSymbolicTensor(3, DataType::Half);
+  auto t3 = makeSymbolicTensor(3, DataType::Half);
+  auto t5 = makeSymbolicTensor(3, DataType::Half);
+  auto t7 = makeSymbolicTensor(1, DataType::Half);
+  auto t11 = makeSymbolicTensor(3, DataType::Half);
+  auto t13 = makeSymbolicTensor(3, DataType::Half);
+  auto t15 = makeSymbolicTensor(3, DataType::Half);
+  auto t17 = makeSymbolicTensor(3, DataType::Half);
+  auto d56 = IrBuilder::create<Double>();
+
+  fusion.addInput(t0);
+  fusion.addInput(t1);
+  fusion.addInput(t3);
+  fusion.addInput(t5);
+  fusion.addInput(t7);
+  fusion.addInput(t11);
+  fusion.addInput(t13);
+  fusion.addInput(t15);
+  fusion.addInput(t17);
+  fusion.addInput(d56);
+
+  auto t2 = castOp(DataType::Float, t1);
+  auto t4 = castOp(DataType::Float, t3);
+  auto t22 = sub(t2, t4);
+  auto t6 = castOp(DataType::Float, t5);
+  auto t23 = mul(t22, t6);
+  auto t16 = castOp(DataType::Float, t15);
+  auto t18 = castOp(DataType::Float, t17);
+  auto t19 = add(t16, t18);
+  auto t14 = castOp(DataType::Float, t13);
+  auto t20 = add(t19, t14);
+  auto t12 = castOp(DataType::Float, t11);
+  auto t21 = add(t20, t12);
+  auto t8 = castOp(DataType::Float, t7);
+  auto t24 = broadcast(t8, {true, true, false});
+  auto t25 = mul(t21, t24);
+  auto t27 = sum(t25, {2});
+  auto t28 = broadcast(t27, {false, false, true});
+  auto t29 = mul(t25, t23);
+  auto t30 = sum(t29, {2});
+  auto t31 = broadcast(t30, {false, false, true});
+  auto d59 =
+      mul(t1->getRootDomain()[2]->extent(), IrBuilder::create<Double>(1));
+  auto t26 = mul(d59, t25);
+  auto txx = mul(t26, IrBuilder::create<Double>(1));
+  auto t33 = sub(txx, t28);
+  auto d70 = unaryOp(UnaryOpType::Reciprocal, d59);
+  auto t35 = mul(d70, t6);
+  auto t39 = sum(t21, {0, 1});
+  auto t47 = castOp(DataType::Half, t39);
+  auto t37 = mul(t21, t23);
+  auto t38 = sum(t37, {0, 1});
+  auto t46 = castOp(DataType::Half, t38);
+  auto t32 = mul(t23, t31);
+  auto t34 = sub(t33, t32);
+  auto t36 = mul(t35, t34);
+  auto t45 = castOp(DataType::Half, t36);
+  auto t40 = mul(t36, t0);
+  auto t41 = mul(t40, d56);
+  auto t44 = castOp(DataType::Half, t41);
+  auto t42 = sum(t41, {0, 1});
+  auto t43 = castOp(DataType::Half, t42);
+
+  fusion.addOutput(t43);
+  fusion.addOutput(t44);
+  fusion.addOutput(t45);
+  fusion.addOutput(t46);
+  fusion.addOutput(t47);
+
+  auto options_half = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  auto options_float =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_t0 = at::randn({128, 64, 1024}, options_float);
+  at::Tensor at_t1 = at::randn({128, 64, 1024}, options_half);
+  at::Tensor at_t3 = at::randn({128, 64, 1024}, options_half);
+  at::Tensor at_t5 = at::randn({128, 64, 1024}, options_half);
+  at::Tensor at_t7 = at::randn({1024}, options_half);
+  at::Tensor at_t11 = at::randn({128, 64, 1024}, options_half);
+  at::Tensor at_t13 = at::randn({128, 64, 1024}, options_half);
+  at::Tensor at_t15 = at::randn({128, 64, 1024}, options_half);
+  at::Tensor at_t17 = at::randn({128, 64, 1024}, options_half);
+  double at_d56 = 1.1111;
+
+  std::vector<at::Tensor> aten_inputs = {
+      at_t0, at_t1, at_t3, at_t5, at_t7, at_t11, at_t13, at_t15, at_t17};
+
+  c10::IValue val = at_d56;
+
+  KernelArgumentHolder args(KernelIndexMode::INT32);
+  args.setDeviceIndex(0);
+  args.push(aten_inputs);
+  args.push(val);
+
+  for (auto _ : c10::irange(5)) {
+    auto segmented_fusion =
+        SegmentCandidateFinder::segment(fusion_ptr.get(), args);
+  }
+}
+
+TEST_F(NVFuserTest, FusionSerialAndParallelIndexing_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  auto tv3 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv4);
+
+  auto tv5 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv6 = add(tv5, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv6);
+
+  // Case 1: local memory tensor computed serially and used by
+  // parallel threads
+  tv2->split(-1, 4);
+  tv1->computeAt(tv2, -2);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // Case 2: shared memory tensor computed serially and used by BID
+  tv4->split(-1, 4);
+  tv3->computeAt(tv4, -2);
+  tv4->axis(-1)->parallelize(ParallelType::BIDx);
+  tv3->setMemoryType(MemoryType::Shared);
+
+  // Case 3: shared memory tensor computed by TID and used by BID
+  tv6->split(-1, 4);
+  tv5->computeAt(tv6, -2);
+  tv6->axis(-1)->parallelize(ParallelType::BIDx);
+  tv5->axis(-1)->parallelize(ParallelType::TIDx);
+  tv5->setMemoryType(MemoryType::Shared);
+
+  const int nx = 11;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({nx}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref = t0 + 2;
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {ref, ref, ref}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1105
+TEST_F(NVFuserTest, FusionWARSyncAliasedSmem_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv2->setMemoryType(MemoryType::Shared);
+
+  tv3->split(0, 4);
+  tv0->computeAt(tv3, 1);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDy);
+  tv3->axis(-1)->parallelize(ParallelType::TIDz);
+
+  // Make sure a WAR sync is inserted at the end of the outer loop
+  GpuLower gpulw(&fusion);
+  for (const auto& kir_node : gpulw.kernel()->topLevelExprs()) {
+    if (auto loop = dynamic_cast<kir::ForLoop*>(kir_node)) {
+      const auto& body = loop->body().exprs();
+      TORCH_CHECK(!body.empty());
+      auto last_expr = dynamic_cast<kir::BlockSync*>(body.back());
+      TORCH_CHECK(last_expr != nullptr, "Invalid expr found");
+      TORCH_CHECK(last_expr->isWarHazardSync(), "Not a sync for WAR hazard");
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({17}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref1 = t0 + 3;
+
+  testValidate(&fusion, outputs, aten_inputs, {ref1}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1099_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeSymbolicTensor(1);
+  fusion.addInput(tv3);
+
+  // Just to make TIDx/y/z non-exact
+  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  auto tv6 = add(tv5, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv6);
+
+  tv2->split(0, 4);
+  tv0->computeAt(tv2, 1);
+
+  tv0->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->axis(-1)->parallelize(ParallelType::TIDy);
+  tv2->axis(-1)->parallelize(ParallelType::TIDz);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv4->split(0, 5);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv4->setMemoryType(MemoryType::Shared);
+  tv5->split(0, 6);
+  tv5->axis(-1)->parallelize(ParallelType::TIDy);
+  tv5->setMemoryType(MemoryType::Shared);
+  tv6->split(0, 7);
+  tv6->axis(-1)->parallelize(ParallelType::TIDz);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({17}, options);
+  at::Tensor t3 = at::randn({19}, options);
+  std::vector<IValue> aten_inputs = {t0, t3};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref_t2 = t0 + 2;
+  auto ref_t3 = t3 + 3;
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {ref_t2, ref_t3}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1080
+TEST_F(NVFuserTest, FusionUnswitchPredicate_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 4);
+  tv0->computeAt(tv2, 2);
+
+  tv2->split(-1, 8);
+  tv1->split(-1, 8);
+
+  tv2->axis(1)->parallelize(ParallelType::Unswitch);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-2)->parallelize(ParallelType::TIDy);
+
+  // swap TIDx and TIDy
+  tv1->axis(-1)->parallelize(ParallelType::TIDy);
+  tv1->axis(-2)->parallelize(ParallelType::TIDx);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  const int nx = 4;
+  const int ny = 10;
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({nx, ny}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref = t0 + 2;
+
+  testValidate(&fusion, outputs, aten_inputs, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1189_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({16, 16});
+  auto tv1 = makeConcreteTensor({16, 16});
+
+  auto tv0b = broadcast(tv0, {false, false, true});
+  auto tv1b = broadcast(tv1, {false, false, true});
+
+  fusion.addInput(tv0b);
+  fusion.addInput(tv1b);
+
+  auto tv2 = add(tv0b, tv1b);
+  auto tv3 = sum(tv2, {1});
+  fusion.addOutput(tv3);
+
+  auto parallelize = [](auto tv) {
+    tv->axis(0)->parallelize(ParallelType::TIDx);
+    tv->axis(1)->parallelize(ParallelType::BIDx);
+    tv->axis(2)->parallelize(ParallelType::BIDy);
+  };
+
+  parallelize(tv0b);
+  parallelize(tv1b);
+  parallelize(tv2);
+  parallelize(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({16, 16, 1}, options);
+  at::Tensor t1 = at::randn({16, 16, 1}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto outputs = fe.runFusion({t0, t1});
+
+  auto ref = (t0 + t1).sum({1});
+
+  testValidate(&fusion, outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1052_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(1);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  auto tv3 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv3);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  scheduler_utils::parallelizeAllLike(tv2, {tv0});
+  scheduler_utils::parallelizeAllLike(tv3, {tv1});
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10}, options);
+  at::Tensor t1 = at::randn({100}, options);
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref_t2 = t0 + 1;
+  auto ref_t3 = t1 + 1;
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {ref_t2, ref_t3}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1115
+TEST_F(NVFuserTest, FusionPointwiseBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> input_shape{3, 17, 80};
+  std::vector<int64_t> output_shape{3, 17, 1, 80};
+
+  TensorView* x = makeSymbolicTensor(input_shape.size());
+  TensorView* bias = makeSymbolicTensor(input_shape.size());
+  fusion.addInput(x);
+  fusion.addInput(bias);
+
+  auto x_add_bias = add(x, bias);
+  auto x_bcast = broadcast(x_add_bias, {false, false, true, false});
+  auto y = gelu(x_bcast);
+  fusion.addOutput(y);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_x = at::randn(input_shape, options);
+  at::Tensor at_bias = at::randn(input_shape, options);
+  std::vector<IValue> aten_inputs = {at_x, at_bias};
+
+  schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto at_x_add_bias = at_x + at_bias;
+  auto at_x_view = at::native::view(at_x_add_bias, output_shape);
+  auto aten_y = at::gelu(at_x_view);
+
+  testValidate(&fusion, outputs, aten_inputs, {aten_y}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPointwiseVectorize_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int size = 1024 * 64;
+
+  TensorView* x = makeContigTensor(1);
+  fusion.addInput(x);
+  auto y = sin(x);
+  fusion.addOutput(y);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  // PyTorch's CUDA caching allocator should always return aligned pointer for
+  // freshly allocated tensor
+  at::Tensor at_x = at::randn({size}, options);
+
+  schedulePointwise(&fusion, {at_x});
+
+  for (auto x_consumer : ir_utils::consumerTvsOf(x)) {
+    bool found_vec_in_input = false;
+    for (auto id : x_consumer->domain()->domain()) {
+      if (isParallelTypeVectorize(id->getParallelType())) {
+        found_vec_in_input = true;
+        break;
+      }
+    }
+    TORCH_CHECK(found_vec_in_input, "Expect input to be vectorized");
+  }
+
+  for (auto id : y->domain()->domain()) {
+    if (isParallelTypeVectorize(id->getParallelType())) {
+      return;
+    }
+  }
+  TORCH_CHECK(false, "Expect output to be vectorized");
+}
+
+TEST_F(NVFuserTest, FusionSmemAliasSerial_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  // Just set the dimension of TIDx
+  auto tv4 = makeSymbolicTensor(1);
+  fusion.addInput(tv4);
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv2->setMemoryType(MemoryType::Shared);
+
+  tv5->axis(0)->parallelize(ParallelType::TIDx);
+
+  // tv1 and tv2 are on shared memory and are not parallelized with
+  // TIDx. They should be predicated as they are redundant and can
+  // interfere with smem aliasing (issue #1100).
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10}, options);
+  at::Tensor t4 = at::randn({1024}, options);
+  std::vector<IValue> aten_inputs = {t0, t4};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref1 = t0 + 3;
+  auto ref2 = t4 + 1;
+
+  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridReductionWithNonExactParallelDimensions_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv1);
+
+  auto tv2 = makeSymbolicTensor(1);
+  fusion.addInput(tv2);
+  auto tv3 = sum(tv2, {0});
+  fusion.addOutput(tv3);
+
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({17}, options);
+  at::Tensor t2 = at::randn({19}, options);
+  std::vector<IValue> aten_inputs = {t0, t2};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref1 = t0 + 1;
+  auto ref2 = sum(t2);
+
+  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridWelfordWithNonExactParallelDimensions_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv1);
+
+  auto tv2 = makeSymbolicTensor(1);
+  fusion.addInput(tv2);
+  auto tv3 = Welford(tv2, {0}).avg;
+  fusion.addOutput(tv3);
+
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({17}, options);
+  at::Tensor t2 = at::randn({19}, options);
+  std::vector<IValue> aten_inputs = {t0, t2};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref1 = t0 + 1;
+  auto ref2 = mean(t2, {0});
+
+  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionGridReductionWithNonExactParallelDimensions2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0, 1});
+  fusion.addOutput(tv1);
+
+  auto tv2 = makeSymbolicTensor(3);
+  fusion.addInput(tv2);
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv3);
+
+  auto tv4 = makeSymbolicTensor(3);
+  fusion.addInput(tv4);
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(0)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDy);
+  tv3->axis(2)->parallelize(ParallelType::TIDz);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+  tv5->axis(2)->parallelize(ParallelType::BIDz);
+
+  // TODO: This needs a fix for issue #1102.
+  // Also, need to allow predicated grid reductions.
+#if 0
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({2, 3}, options);
+  at::Tensor t2 = at::randn({5, 6, 7}, options);
+  at::Tensor t4 = at::randn({8, 9, 10}, options);
+  std::vector<IValue> aten_inputs = {t0, t2, t4};
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref1 = t0.sum(at::IntArrayRef{0, 1});
+  auto ref2 = t2 + 1;
+  auto ref3 = t4 + 1;
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {ref1, ref2, ref3}, __LINE__, __FILE__);
+#endif
+}
+
+TEST_F(NVFuserTest, FusionGridWelfordWithNonExactParallelDimensions2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tvs = Welford(tv0, {0, 1});
+  fusion.addOutput(tvs.avg);
+
+  auto tv2 = makeSymbolicTensor(3);
+  fusion.addInput(tv2);
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv3);
+
+  auto tv4 = makeSymbolicTensor(3);
+  fusion.addInput(tv4);
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+
+  tvs.avg->axis(0)->parallelize(ParallelType::BIDx);
+  tvs.avg->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(0)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDy);
+  tv3->axis(2)->parallelize(ParallelType::TIDz);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+  tv5->axis(2)->parallelize(ParallelType::BIDz);
+
+  // TODO: needs a fix for issue #1102
+  // Also, need to allow predicated grid reductions.
+#if 0
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({2, 3}, options);
+  at::Tensor t2 = at::randn({5, 6, 7}, options);
+  at::Tensor t4 = at::randn({8, 9, 10}, options);
+  std::vector<IValue> aten_inputs = {t0, t2, t4};
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref1 = t0.mean(at::IntArrayRef{0, 1});
+  auto ref2 = t2 + 1;
+  auto ref3 = t4 + 1;
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {ref1, ref2, ref3}, __LINE__, __FILE__);
+#endif
+}
+
+// Repro of issue #1102
+TEST_F(NVFuserTest, FusionPredicateParallelizedDomains_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  // Just to make TIDx/y/z non-exact
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv3);
+
+  auto tv4 = makeSymbolicTensor(1);
+  fusion.addInput(tv4);
+
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  auto tv6 = add(tv5, IrBuilder::create<Double>(1));
+  auto tv7 = add(tv6, IrBuilder::create<Double>(1));
+  auto tv8 = add(tv7, IrBuilder::create<Double>(1));
+  auto tv9 = sum(tv8, {0});
+  fusion.addOutput(tv9);
+
+  tv1->split(0, 5);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv1->setMemoryType(MemoryType::Shared);
+  tv2->split(0, 6);
+  tv2->axis(-1)->parallelize(ParallelType::TIDy);
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->split(0, 7);
+  tv3->axis(-1)->parallelize(ParallelType::TIDz);
+
+  tv9->split(0, 4);
+  tv4->computeAt(tv9, 1);
+
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv5->axis(-1)->parallelize(ParallelType::TIDy);
+  tv6->axis(-1)->parallelize(ParallelType::TIDz);
+  tv7->axis(-1)->parallelize(ParallelType::TIDz);
+  tv8->axis(-1)->parallelize(ParallelType::TIDz);
+  tv9->axis(-1)->parallelize(ParallelType::TIDz);
+  tv9->axis(0)->parallelize(ParallelType::BIDx);
+
+  tv5->setMemoryType(MemoryType::Shared);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({17}, options);
+  at::Tensor t4 = at::randn({19}, options);
+  std::vector<IValue> aten_inputs = {t0, t4};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref1 = t0 + 3;
+  auto ref2 = sum(t4 + 4);
+
+  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
+}
+
+// Repro of #1102 and #1129
+TEST_F(NVFuserTest, FusionSmemPredicateUnswitch_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(1);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+
+  // Just to make TIDx/y/z non-exact
+  auto tvx = add(tv1, IrBuilder::create<Double>(1));
+  auto tvy = add(tvx, IrBuilder::create<Double>(1));
+  auto tvz = add(tvy, IrBuilder::create<Double>(1));
+  fusion.addOutput(tvz);
+
+  tv5->split(0, 4);
+  tv0->computeAt(tv5, 1);
+
+  tv0->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDy);
+  tv3->axis(-1)->parallelize(ParallelType::TIDz);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv5->axis(-1)->parallelize(ParallelType::TIDy);
+  tv5->axis(0)->parallelize(ParallelType::Unswitch);
+
+  tvx->split(0, 5);
+  tvx->axis(-1)->parallelize(ParallelType::TIDx);
+  tvy->split(0, 6);
+  tvy->axis(-1)->parallelize(ParallelType::TIDy);
+  tvz->split(0, 7);
+  tvz->axis(-1)->parallelize(ParallelType::TIDz);
+
+  for (auto tv : {tv2, tv3, tv4, tvx, tvy}) {
+    tv->setMemoryType(MemoryType::Shared);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({17}, options);
+  at::Tensor t1 = at::randn({19}, options);
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref1 = t0 + 4;
+  auto ref2 = t1 + 3;
+
+  testValidate(&fusion, outputs, aten_inputs, {ref1, ref2}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1136
+TEST_F(NVFuserTest, FusionFloatPow_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Int>(4));
+  // To check if pow(tv0, 2) is replaced with tv0 * tv0
+  auto tv2 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Int>(2));
+  // To check if pow(tv0, 2.0) is replaced with tv0 * tv0
+  auto tv3 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Double>(2));
+  auto tv4 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Int>(3));
+  auto tv5 = binaryOp(BinaryOpType::Pow, tv0, IrBuilder::create<Double>(3));
+  auto s = binaryOp(
+      BinaryOpType::Pow,
+      IrBuilder::create<Double>(3),
+      IrBuilder::create<Double>(3));
+  auto tv6 = add(tv0, s);
+
+  fusion.addOutput(tv1);
+  fusion.addOutput(tv2);
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  tv1->split(0, 32);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  TransformPropagatorWithCheck propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(tv1, {tv2, tv3, tv4, tv5, tv6});
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1000}, options);
+  // Negative inputs cause nan in Fuesr as use_fast_math is enabled
+  t0 = abs(t0);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto p4 = at::pow(t0, 4);
+  auto p2 = at::pow(t0, 2);
+  auto p3 = at::pow(t0, 3);
+  auto t6 = t0 + std::pow(3, 3);
+
+  testValidate(
+      &fusion,
+      outputs,
+      aten_inputs,
+      {p4, p2, p2, p3, p3, t6},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1127_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  const int numel = 4;
+
+  auto tv0 = makeConcreteTensor({numel});
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = broadcast(tv1, {true});
+
+  auto tv3 = makeConcreteTensor({numel, numel});
+  fusion.addInput(tv3);
+
+  auto tv4 = sum(tv3, {1});
+
+  auto tv5 = add(tv2, tv4);
+  fusion.addOutput(tv5);
+
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv4->axis(1)->parallelize(ParallelType::TIDx);
+  tv5->axis(0)->parallelize(ParallelType::TIDx);
+
+  // Lowering should fail since tv5 is predicated and paralellized with TIDx.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fusion.printKernel());
+}
+
+TEST_F(NVFuserTest, FusionChannelsLastParser_CUDA) {
+  // This test may not pass if using a custom block sync as there may
+  // be additional calls. Skip the test as it's not specifically
+  // relevant with block synchronizatin.
+  if (std::getenv("PYTORCH_NVFUSER_USE_BLOCK_SYNC_ATOMIC")) {
+    return;
+  }
+  auto g = std::make_shared<Graph>();
+  const auto graph0_string = R"IR(
+  graph(%0 : Half(8, 4, 10, 16, strides=[640, 1, 64, 4]),
+        %1 : Half(8, 4, 10, 16, strides=[640, 160, 16, 1])):
+    %o.1 : Half(8, 4, 10, 16, strides=[640, 1, 64, 4]) = aten::mul(%0, %1) # sum_dyn.py:5:6
+    %3 : Half(8, 4, 10, 16, strides=[640, 1, 64, 4]) = aten::relu(%o.1) # sum_dyn.py:6:9
+    return (%3))IR";
+  parseIR(graph0_string, g.get());
+
+  // strides are not yet supported in the irparser.
+  {
+    auto val = g->block()->inputs()[0];
+    val->setType(val->type()->castRaw<TensorType>()->withSizesStrides(
+        {8, 4, 10, 16}, {640, 1, 64, 4}));
+  }
+
+  {
+    auto val = g->block()->inputs()[1];
+    val->setType(val->type()->castRaw<TensorType>()->withSizesStrides(
+        {8, 4, 10, 16}, {640, 160, 16, 1}));
+  }
+
+  for (auto node : g->block()->nodes()) {
+    for (auto val : node->outputs()) {
+      if (val->isCompleteTensor())
+        val->setType(val->type()->castRaw<TensorType>()->withSizesStrides(
+            {8, 4, 10, 16}, {640, 1, 64, 4}));
+    }
+  }
+
+  auto fusion = parseJitIR(g);
+  FusionGuard fg(fusion.get());
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor input0 =
+      at::randn({2, 2, 2, 16}, options).clone(c10::MemoryFormat::ChannelsLast);
+  at::Tensor input1 = at::randn({2, 2, 2, 16}, options);
+  auto lparams = schedulePointwise(fusion.get(), {input0, input1});
+
+  // CONSIDER:
+  // 1. this can be moved to a dedicated "golden" file
+  // 2. use a fuzzy compare (ignore non-significant whitespaces for example)
+  const std::string expected_kernel = R"(
+__global__ void CUDAGeneratedKernel(Tensor<__half, 4> T0, Tensor<__half, 4> T2, Tensor<__half, 4> T7) {
+  int64_t i165;
+  i165 = (((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x);
+  if ((i165 < (T0.size[0] * (T0.size[1] * (T0.size[2] * T0.size[3]))))) {
+    __half T9[1];
+    T9[0] = 0;
+    T9[0]
+       = T2[((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) / (T0.size[1] * (T0.size[2] * T0.size[3]))) * ((T0.size[2] * T0.size[1]) * T0.size[3])) + ((((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) % (T0.size[2] * T0.size[3])) % T0.size[3]) * (T0.size[2] * T0.size[1])) + (((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) / (T0.size[2] * T0.size[3])) * T0.size[2]) + (((((((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x)) % (T0.size[1] * (T0.size[2] * T0.size[3]))) % (T0.size[2] * T0.size[3])) / T0.size[3])];
+    __half T8[1];
+    T8[0] = 0;
+    T8[0]
+       = T0[i165];
+    float T3[1];
+    T3[0]
+       = __half2float(T9[0]);
+    float T4[1];
+    T4[0]
+       = T3[0];
+    float T1[1];
+    T1[0]
+       = __half2float(T8[0]);
+    float T5[1];
+    T5[0]
+      = T1[0]
+      * T4[0];
+    float T6[1];
+    T6[0]
+       = relu(T5[0]);
+    __half T10[1];
+    T10[0]
+       = __float2half(T6[0]);
+    T7[i165]
+       = T10[0];
+  }
+}
+)";
+
+  const std::string actual_kernel =
+      "\n" + codegen::generateCudaKernel(GpuLower(fusion.get()).kernel());
+
+  if (expected_kernel.size() != actual_kernel.size() ||
+      expected_kernel.compare(actual_kernel) != 0) {
+    std::cerr
+        << " Codegen mismatch, codegen possibly changed, or is incorrect. "
+        << " \n ========= EXPECTED ========= \n"
+        << expected_kernel << "\n========= ACTUAL ========== \n"
+        << actual_kernel << "\n=================" << std::endl;
+    auto it = std::mismatch(
+        expected_kernel.begin(),
+        expected_kernel.end(),
+        actual_kernel.begin(),
+        actual_kernel.end());
+    std::string actual_mismatched_snippet(it.second, actual_kernel.end());
+    actual_mismatched_snippet = actual_mismatched_snippet.substr(0, 10);
+    std::string expected_mismatched_snippet(it.first, expected_kernel.end());
+    expected_mismatched_snippet = expected_mismatched_snippet.substr(0, 10);
+    std::cerr << "First mismatch found at: " << actual_mismatched_snippet
+              << ", expected: " << expected_mismatched_snippet << std::endl;
+    TORCH_CHECK(false);
+  }
+
+  // TODO: runFusion hits assertion. I'm probably doing something wrong here.
+  // FusionExecutor fe;
+  // fe.compileFusion(fusion.get());
+  // auto outputs = fe.runFusion({input0, input1}, lparams);
+  // at::Tensor output_ref = (input0 * input1).relu();
+  // TORCH_CHECK(output_ref.equal(outputs[0]));
+}
+
+TEST_F(NVFuserTest, FusionThreadPredicateUnswitch_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({10, 1024});
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->computeAt(tv3, -1);
+  tv3->axis(0)->parallelize(ParallelType::Unswitch);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10, 1024}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref = sum(t0, {1}) + 2;
+
+  testValidate(&fusion, outputs, aten_inputs, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionNonContigOutputs_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv1);
+
+  tv1->setContiguity(false);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_input = at::randn({10}, options);
+  at::Tensor at_output = at::empty_strided({10}, {2}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {at_input});
+  auto returned_outputs = fe.runFusion({at_input}, {at_output});
+
+  // Returned outputs should only contain one tensor that is the same
+  // as the output tensor given to runFusion
+  TORCH_CHECK(returned_outputs.size() == 1);
+  TORCH_CHECK(returned_outputs[0].is_same(at_output));
+  TORCH_CHECK(!returned_outputs[0].is_contiguous());
+
+  auto at_ref = at_input + 1;
+
+  testValidate(&fusion, {at_output}, {at_input}, {at_ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTestWarpSoftMax_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Setup softmax fusion
+  auto input = makeContigTensor(2);
+  fusion.addInput(input);
+  auto output = softmax(input, 1);
+  fusion.addOutput(output);
+
+  // Setup runtime input
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_input = at::randn({8, 16 * 197}, options);
+  std::vector<c10::IValue> aten_inputs({aten_input});
+
+  // Schedule through magic scheduler
+  SchedulerRuntimeInfo runtime_info(&fusion, aten_inputs, true);
+  TORCH_CHECK(SchedulerEntry::canSchedule(
+      ScheduleHeuristic::Persistent, &fusion, runtime_info));
+  auto scheduler = SchedulerEntry::makeEntry(
+      ScheduleHeuristic::Persistent, &fusion, runtime_info);
+  scheduler->schedule(&fusion);
+
+  // Modify the schedule to use warp reduction
+  auto used_vals = fusion.usedMathVals();
+  for (auto tv : ir_utils::filterByType<TensorView>(used_vals)) {
+    for (IterDomain* id : tv->domain()->domain()) {
+      if (id->getParallelType() == ParallelType::TIDx) {
+        id->padToMultipleOfWarp();
+      }
+    }
+  }
+
+  // Test result
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+  auto ref_output = at::_softmax(aten_input, 1, false);
+  testValidate(&fusion, outputs, aten_inputs, {ref_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1133_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1});
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  tv0->computeAt(tv3, 1);
+
+  const int split_factor = 32;
+
+  tv2->split(-1, split_factor);
+  tv1->computeAt(tv2, -2);
+
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(0)->parallelize(ParallelType::Unswitch);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv2->setMemoryType(MemoryType::Shared);
+
+  // Both tv1 and tv2 should be allocated at the top-level scope
+  GpuLower gpulw(&fusion);
+  bool tv1_validated = false;
+  bool tv2_validated = false;
+  for (const auto& kir_node : gpulw.kernel()->topLevelExprs()) {
+    if (auto alloc = dynamic_cast<kir::Allocate*>(kir_node)) {
+      auto size = alloc->size();
+      if (!(alloc->buffer()->name() == 1 || alloc->buffer()->name() == 2)) {
+        // There should be no allocation other than those for tv1 and tv2
+        TORCH_CHECK(false, "Invalid allocation detected");
+      }
+      TORCH_CHECK(size->isA<Int>(), "Invalid allocation size");
+      TORCH_CHECK(size->as<Int>()->isConst(), "Allocation not constant");
+      auto size_int = size->as<Int>()->value().value();
+      if (alloc->buffer()->name() == 1) {
+        TORCH_CHECK(
+            size_int == split_factor,
+            "Invalid allocation size: ",
+            size->as<Int>()->value().value());
+        tv1_validated = true;
+      } else {
+        TORCH_CHECK(
+            size_int == 1,
+            "Invalid allocation size: ",
+            size->as<Int>()->value().value());
+        tv2_validated = true;
+      }
+    }
+  }
+
+  TORCH_CHECK(tv1_validated, "Failed to validate tv1 allocation");
+  TORCH_CHECK(tv2_validated, "Failed to validate tv2 allocation");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({99, 101}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref = (t0 + 1).sum({1}) + 1;
+
+  testValidate(&fusion, outputs, aten_inputs, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionRfactorContigIDs_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  fusion.addOutput(tv1);
+
+  tv1->split(1, 32);
+
+  auto tv2 = tv1->rFactor({1});
+
+  // This merged domain is not contiguous.
+  tv2->merge(0, 2);
+
+  tv2->setMemoryType(MemoryType::Shared);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({99, 101}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto ref = t0.sum({1});
+
+  testValidate(&fusion, outputs, aten_inputs, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPersistentBufferCalculation1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = sum(tv1, {1});
+  auto tv3 = broadcast(tv2, {false, true});
+  auto tv4 = set(tv1);
+  auto tv5 = add(tv3, tv4);
+  fusion.addOutput(tv5);
+
+  auto persistent_buffer_info = scheduler_utils::persistentBuffers(&fusion);
+
+  auto isTvWithinVec = [](std::vector<TensorView*>& vec, TensorView* tv) {
+    return std::find(vec.begin(), vec.end(), tv) != vec.end();
+  };
+
+  auto tvEntryInVecVec = [](std::vector<std::vector<TensorView*>>& vec_o_vec,
+                            std::vector<TensorView*>& buffer_vec,
+                            TensorView* tv) {
+    auto buffer_it = std::find(buffer_vec.begin(), buffer_vec.end(), tv);
+    return vec_o_vec.begin() + std::distance(buffer_vec.begin(), buffer_it);
+  };
+
+  auto& buffers = persistent_buffer_info.persistent_buffers;
+  auto& resolution = persistent_buffer_info.persistent_buffer_resolution_points;
+  auto& projectable = persistent_buffer_info.projectable_persistent_buffers;
+  auto& projectable_inputs = persistent_buffer_info.projectable_buffer_inputs;
+
+  TORCH_INTERNAL_ASSERT(buffers.size() == 1);
+  TORCH_INTERNAL_ASSERT(resolution.size() == 1 && resolution[0].size() == 1);
+  TORCH_INTERNAL_ASSERT(projectable.size() == 1);
+  TORCH_INTERNAL_ASSERT(projectable_inputs.size() == 1);
+
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(buffers, tv1));
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable, tv1));
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable_inputs, tv0));
+
+  auto tv1_resolution_it = tvEntryInVecVec(resolution, buffers, tv1);
+  TORCH_INTERNAL_ASSERT(tv1_resolution_it != resolution.end())
+
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv1_resolution_it, tv5));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor aten_t0 = at::randn({99, 101}, options);
+
+  // Schedule through magic scheduler
+  SchedulerRuntimeInfo runtime_info(&fusion, {aten_t0}, true);
+  auto persistent_buffer_size =
+      persistentBufferSize(&fusion, runtime_info, persistent_buffer_info);
+
+  TORCH_INTERNAL_ASSERT(
+      persistent_buffer_size.persistent_buffer_size ==
+      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Float)));
+  TORCH_INTERNAL_ASSERT(
+      persistent_buffer_size.projected_persistent_buffer_size ==
+      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Float)));
+}
+
+TEST_F(NVFuserTest, FusionPersistentBufferCalculation2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2, DataType::Half);
+  fusion.addInput(tv0);
+
+  auto tv1 = castOp(DataType::Float, tv0);
+  auto tv2 = sum(tv1, {1});
+  auto tv3 = broadcast(tv2, {false, true});
+  auto tv4 = set(tv1);
+  auto tv5 = add(tv3, tv4);
+  auto tv6 = castOp(DataType::Half, tv5);
+  fusion.addOutput(tv6);
+
+  auto persistent_buffer_info = scheduler_utils::persistentBuffers(&fusion);
+
+  auto isTvWithinVec = [](std::vector<TensorView*>& vec, TensorView* tv) {
+    return std::find(vec.begin(), vec.end(), tv) != vec.end();
+  };
+
+  auto tvEntryInVecVec = [](std::vector<std::vector<TensorView*>>& vec_o_vec,
+                            std::vector<TensorView*>& buffer_vec,
+                            TensorView* tv) {
+    auto buffer_it = std::find(buffer_vec.begin(), buffer_vec.end(), tv);
+    return vec_o_vec.begin() + std::distance(buffer_vec.begin(), buffer_it);
+  };
+
+  auto& buffers = persistent_buffer_info.persistent_buffers;
+  auto& resolution = persistent_buffer_info.persistent_buffer_resolution_points;
+  auto& projectable = persistent_buffer_info.projectable_persistent_buffers;
+  auto& projectable_inputs = persistent_buffer_info.projectable_buffer_inputs;
+
+  TORCH_INTERNAL_ASSERT(buffers.size() == 1);
+  TORCH_INTERNAL_ASSERT(resolution.size() == 1 && resolution[0].size() == 1);
+  TORCH_INTERNAL_ASSERT(projectable.size() == 1);
+  TORCH_INTERNAL_ASSERT(projectable_inputs.size() == 1);
+
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(buffers, tv1));
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable, tv1));
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable_inputs, tv0));
+
+  auto tv1_resolution_it = tvEntryInVecVec(resolution, buffers, tv1);
+  TORCH_INTERNAL_ASSERT(tv1_resolution_it != resolution.end())
+
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv1_resolution_it, tv5));
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor aten_t0 = at::randn({99, 101}, options);
+
+  // Schedule through magic scheduler
+  SchedulerRuntimeInfo runtime_info(&fusion, {aten_t0}, true);
+  auto persistent_buffer_size =
+      persistentBufferSize(&fusion, runtime_info, persistent_buffer_info);
+
+  TORCH_INTERNAL_ASSERT(
+      persistent_buffer_size.persistent_buffer_size ==
+      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Float)));
+  TORCH_INTERNAL_ASSERT(
+      persistent_buffer_size.projected_persistent_buffer_size ==
+      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Half)));
+}
+
+TEST_F(NVFuserTest, FusionPersistentBufferCalculation3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2, DataType::Half);
+  fusion.addInput(tv0);
+
+  auto tv1 = castOp(DataType::Float, tv0);
+  auto tv2 = set(tv1);
+  auto tv3 = sum(tv2, {1});
+  auto tv4 = broadcast(tv3, {false, true});
+
+  auto tv5 = makeSymbolicTensor(2, DataType::Half);
+  fusion.addInput(tv5);
+
+  auto tv6 = castOp(DataType::Float, tv5);
+
+  auto tv7 = add(tv6, tv4);
+  auto tv8 = set(tv1);
+  auto tv9 = add(tv7, tv8);
+  auto tv10 = sum(tv9, {1});
+  auto tv11 = broadcast(tv10, {false, true});
+  auto tv12 = set(tv7);
+  auto tv13 = add(tv12, tv11);
+
+  fusion.addOutput(tv13);
+
+  auto persistent_buffer_info = scheduler_utils::persistentBuffers(&fusion);
+
+  auto isTvWithinVec = [](std::vector<TensorView*>& vec, TensorView* tv) {
+    return std::find(vec.begin(), vec.end(), tv) != vec.end();
+  };
+
+  auto tvEntryInVecVec = [](std::vector<std::vector<TensorView*>>& vec_o_vec,
+                            std::vector<TensorView*>& buffer_vec,
+                            TensorView* tv) {
+    auto buffer_it = std::find(buffer_vec.begin(), buffer_vec.end(), tv);
+    return vec_o_vec.begin() + std::distance(buffer_vec.begin(), buffer_it);
+  };
+
+  auto& buffers = persistent_buffer_info.persistent_buffers;
+  auto& resolution = persistent_buffer_info.persistent_buffer_resolution_points;
+  auto& projectable = persistent_buffer_info.projectable_persistent_buffers;
+  auto& projectable_inputs = persistent_buffer_info.projectable_buffer_inputs;
+
+  TORCH_INTERNAL_ASSERT(buffers.size() == 2);
+  TORCH_INTERNAL_ASSERT(
+      resolution.size() == 2 && resolution[0].size() == 1 &&
+      resolution[1].size() == 1);
+  TORCH_INTERNAL_ASSERT(projectable.size() == 1);
+  TORCH_INTERNAL_ASSERT(projectable_inputs.size() == 1);
+
+  TORCH_INTERNAL_ASSERT(
+      isTvWithinVec(buffers, tv1) && isTvWithinVec(buffers, tv7));
+  TORCH_INTERNAL_ASSERT(
+      isTvWithinVec(projectable, tv1) && !isTvWithinVec(projectable, tv7));
+
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable_inputs, tv0));
+
+  auto tv1_resolution_it = tvEntryInVecVec(resolution, buffers, tv1);
+  TORCH_INTERNAL_ASSERT(tv1_resolution_it != resolution.end())
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv1_resolution_it, tv9));
+
+  auto tv7_resolution_it = tvEntryInVecVec(resolution, buffers, tv7);
+  TORCH_INTERNAL_ASSERT(tv7_resolution_it != resolution.end())
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv7_resolution_it, tv13));
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor aten_t0 = at::randn({99, 101}, options);
+  at::Tensor aten_t5 = at::randn({99, 101}, options);
+
+  // Schedule through magic scheduler
+  SchedulerRuntimeInfo runtime_info(&fusion, {aten_t0, aten_t5}, true);
+  auto persistent_buffer_size =
+      persistentBufferSize(&fusion, runtime_info, persistent_buffer_info);
+
+  TORCH_INTERNAL_ASSERT(
+      persistent_buffer_size.persistent_buffer_size ==
+      static_cast<int64_t>(
+          aten_t0.size(1) * dataTypeSize(DataType::Float) * 2));
+  TORCH_INTERNAL_ASSERT(
+      persistent_buffer_size.projected_persistent_buffer_size ==
+      static_cast<int64_t>(
+          aten_t0.size(1) *
+          (dataTypeSize(DataType::Half) + dataTypeSize(DataType::Float))));
+}
+
+TEST_F(NVFuserTest, FusionPersistentBufferCalculation4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2, DataType::Half);
+  fusion.addInput(tv0);
+
+  auto tv1 = castOp(DataType::Float, tv0);
+  auto tv2 = set(tv1);
+  auto tv3 = sum(tv2, {1});
+  auto tv4 = broadcast(tv3, {false, true});
+  auto tv5 = set(tv1);
+  auto tv6 = add(tv4, tv5);
+  auto tv7 = set(tv2);
+  auto tv8 = add(tv7, tv6);
+  auto tv9 = castOp(DataType::Half, tv8);
+
+  fusion.addOutput(tv9);
+
+  auto persistent_buffer_info = scheduler_utils::persistentBuffers(&fusion);
+
+  auto isTvWithinVec = [](std::vector<TensorView*>& vec, TensorView* tv) {
+    return std::find(vec.begin(), vec.end(), tv) != vec.end();
+  };
+
+  auto tvEntryInVecVec = [](std::vector<std::vector<TensorView*>>& vec_o_vec,
+                            std::vector<TensorView*>& buffer_vec,
+                            TensorView* tv) {
+    auto buffer_it = std::find(buffer_vec.begin(), buffer_vec.end(), tv);
+    return vec_o_vec.begin() + std::distance(buffer_vec.begin(), buffer_it);
+  };
+
+  auto& buffers = persistent_buffer_info.persistent_buffers;
+  auto& resolution = persistent_buffer_info.persistent_buffer_resolution_points;
+  auto& projectable = persistent_buffer_info.projectable_persistent_buffers;
+  auto& projectable_inputs = persistent_buffer_info.projectable_buffer_inputs;
+
+  TORCH_INTERNAL_ASSERT(buffers.size() == 2);
+  TORCH_INTERNAL_ASSERT(
+      resolution.size() == 2 && resolution[0].size() == 1 &&
+      resolution[1].size() == 1);
+
+  TORCH_INTERNAL_ASSERT(projectable.size() == 2);
+  TORCH_INTERNAL_ASSERT(projectable_inputs.size() == 1);
+
+  TORCH_INTERNAL_ASSERT(
+      isTvWithinVec(buffers, tv1) && isTvWithinVec(buffers, tv2));
+  TORCH_INTERNAL_ASSERT(
+      isTvWithinVec(projectable, tv1) && isTvWithinVec(projectable, tv2));
+
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(projectable_inputs, tv0));
+
+  auto tv1_resolution_it = tvEntryInVecVec(resolution, buffers, tv1);
+  TORCH_INTERNAL_ASSERT(tv1_resolution_it != resolution.end())
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv1_resolution_it, tv6));
+
+  auto tv2_resolution_it = tvEntryInVecVec(resolution, buffers, tv2);
+  TORCH_INTERNAL_ASSERT(tv2_resolution_it != resolution.end())
+  TORCH_INTERNAL_ASSERT(isTvWithinVec(*tv2_resolution_it, tv8));
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor aten_t0 = at::randn({99, 101}, options);
+
+  // Schedule through magic scheduler
+  SchedulerRuntimeInfo runtime_info(&fusion, {aten_t0}, true);
+  auto persistent_buffer_size =
+      persistentBufferSize(&fusion, runtime_info, persistent_buffer_info);
+
+  TORCH_INTERNAL_ASSERT(
+      persistent_buffer_size.persistent_buffer_size ==
+      static_cast<int64_t>(
+          aten_t0.size(1) * dataTypeSize(DataType::Float) * 2));
+
+  TORCH_INTERNAL_ASSERT(
+      persistent_buffer_size.projected_persistent_buffer_size ==
+      static_cast<int64_t>(aten_t0.size(1) * dataTypeSize(DataType::Half)));
+}
+
+TEST_F(NVFuserTest, FusionPersistentBufferProjection_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2, DataType::Half);
+  fusion.addInput(tv0);
+
+  auto tv1 = castOp(DataType::Float, tv0);
+  auto tv2 = set(tv1);
+  auto tv3 = sum(tv2, {1});
+  auto tv4 = broadcast(tv3, {false, true});
+  auto tv5 = set(tv1);
+  auto tv6 = add(tv4, tv5);
+  auto tv7 = set(tv2);
+  auto tv8 = add(tv7, tv6);
+  auto tv9 = castOp(DataType::Half, tv8);
+
+  fusion.addOutput(tv9);
+
+  reduction_scheduler_utils::projectPersistentBuffers(&fusion);
+
+  auto tv5_producers = ir_utils::producerTvsOf(tv5);
+  auto tv7_producers = ir_utils::producerTvsOf(tv7);
+
+  // Projection should have broken these dependencies
+
+  TORCH_INTERNAL_ASSERT(
+      std::find(tv5_producers.begin(), tv5_producers.end(), tv1) ==
+      tv5_producers.end());
+  TORCH_INTERNAL_ASSERT(
+      std::find(tv7_producers.begin(), tv7_producers.end(), tv2) ==
+      tv7_producers.end());
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor aten_t0 = at::randn({99, 101}, options);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto cg_outputs = fec.runFusionWithInputs({aten_t0});
+
+  auto aten_t1 = aten_t0.to(c10::kDouble);
+  auto aten_t3 = aten_t1.sum({1});
+  auto aten_t4 = aten_t3.unsqueeze(1);
+  auto aten_t7 = aten_t4.add(aten_t1).add(aten_t1);
+
+  testValidate(&fusion, cg_outputs, {aten_t0}, {aten_t7}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1223_CUDA) {
+  if (!deviceMajorMinorCheck(7)) {
+    GTEST_SKIP() << "skipping tests on pre-Volta GPUs";
+    return;
+  }
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {0, 1});
+  fusion.addOutput(tv2);
+
+  auto tv3 = add(tv0, IrBuilder::create<Double>(0));
+  fusion.addOutput(tv3);
+
+  tv2->split(0, 4);
+  tv2->split(1, 1, false);
+  tv2->split(-1, 4);
+
+  tv2->axis(1)->parallelize(ParallelType::Unswitch);
+  tv2->axis(-3)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDy);
+
+  tv1->computeAt(tv2, -1);
+
+  // Make TIDx and TIDy non-exact
+  tv3->split(0, 32);
+  tv3->split(-1, 32);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+  tv3->axis(3)->parallelize(ParallelType::TIDy);
+
+  // The second axis of both tv1 and tv2 are fully unswitched, so they
+  // don't need to predicate the parallel type usage of TIDy, whereas
+  // the first axis is only partially unswitched, i.e., part of its
+  // split output domains is outside the unswitched axis, so the first
+  // axis, which uses TIDx, needs to predicate the parallel
+  // dimension. Previously, as reported in issue #1223, unswitched
+  // expressions didn't predicate parallel dimensions. It should be
+  // fixed by PR #1222.
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_t0 = at::ones({11, 10}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {at_t0});
+  auto cg_outputs = fe.runFusion({at_t0});
+
+  auto at_t1 = (at_t0 + 1).sum();
+
+  testValidate(
+      &fusion, cg_outputs, {at_t0}, {at_t1, at_t0}, __LINE__, __FILE__);
+}
+
+// See #1247 and #1250
+TEST_F(NVFuserTest, FusionRfactorPredication1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = min(tv1, {0});
+
+  fusion.addOutput(tv2);
+
+  // Make TIDx non-exact
+  auto tv3 = makeContigTensor(1);
+  fusion.addInput(tv3);
+
+  auto tv4 = add(tv3, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv4);
+
+  tv2->split(0, 4);
+  auto tv5 = tv2->rFactor({1});
+
+  tv0->computeAt(tv2, 1);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+
+  tv4->axis(0)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_t0 = at::randn({9}, options);
+  at_t0 = at::abs(at_t0);
+  at::Tensor at_t3 = at::randn({128}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {at_t0, at_t3});
+  auto cg_outputs = fe.runFusion({at_t0, at_t3});
+
+  auto at_t2 = (at_t0 + 1).min();
+  auto at_t4 = at_t3 + 1;
+
+  testValidate(
+      &fusion, cg_outputs, {at_t0, at_t3}, {at_t2, at_t4}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionRfactorPredication2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = min(tv0, {0});
+  fusion.addOutput(tv1);
+
+  // Make TIDx non-exact
+  auto tv2 = makeContigTensor(1);
+  fusion.addInput(tv2);
+
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv3);
+
+  tv1->split(0, 4);
+  auto tv4 = tv1->rFactor({0});
+
+  tv1->split(0, 3);
+
+  // tv0->computeAt(tv1, 3);
+  tv4->reorder({{0, 1}});
+  tv4->split(0, 3);
+  tv4->setMemoryType(MemoryType::Shared);
+
+  // tv0: [I]
+  // tv4: [4/3, 3, I/4]
+  // tv1: [4/3, 3]
+
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv1, {tv4});
+
+  tv3->axis(0)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor at_t0 = at::randn({9}, options);
+  at_t0 = at::abs(at_t0);
+  at::Tensor at_t3 = at::randn({128}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {at_t0, at_t3});
+  auto cg_outputs = fe.runFusion({at_t0, at_t3});
+
+  auto at_t2 = std::get<0>(at_t0.min(0));
+  auto at_t4 = at_t3 + 1;
+
+  testValidate(
+      &fusion, cg_outputs, {at_t0, at_t3}, {at_t2, at_t4}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionRfactorIndirectRoot_CUDA) {
+  // https://github.com/csarofeen/pytorch/issues/1692
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(3);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1, 2});
+  fusion.addOutput(tv1);
+
+  tv1->split(2, 4);
+  tv1->split(1, 3);
+  tv1->merge(2, 3);
+  auto rf = tv1->rFactor({-1});
+
+  tv1->split(0, 256);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  rf->computeAt(tv1, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+
+  auto at_in = at::randn({6, 6, 6}, options);
+  auto at_out = at_in.sum({1, 2});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {at_in});
+  auto cg_outputs = fe.runFusion({at_in});
+
+  testValidate(&fusion, cg_outputs, {at_in}, {at_out}, __LINE__, __FILE__);
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu3.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu3.cpp
new file mode 100644
index 000000000000..8d24cc380374
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu3.cpp
@@ -0,0 +1,6538 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/codegen.h>
+#include <torch/csrc/jit/codegen/cuda/disjoint_set.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/executor_launch_params.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
+#include <torch/csrc/jit/codegen/cuda/grouped_reduction.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/interface.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_builder.h>
+#include <torch/csrc/jit/codegen/cuda/ir_graphviz.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/ir_utils.h>
+#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_expr_evaluator.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/lower_magic_zero.h>
+#include <torch/csrc/jit/codegen/cuda/mutator.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+#include <torch/csrc/jit/codegen/cuda/transform_rfactor.h>
+
+#include <test/cpp/jit/test_utils.h>
+#include <torch/csrc/jit/api/function_impl.h>
+#include <torch/csrc/jit/codegen/cuda/parser.h>
+#include <torch/csrc/jit/ir/irparser.h>
+#include <torch/torch.h>
+
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+#include <c10/cuda/CUDAStream.h>
+
+#include <algorithm>
+#include <iostream>
+#include <sstream>
+#include <thread>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+using namespace at::indexing;
+
+TEST_F(NVFuserTest, FusionNonDivisibleSplit1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  fusion.addOutput(tv1);
+
+  // [I]
+  tv1->split(0, 5);
+  // [ceilDiv(I, 5), 5]
+
+  // This second split is non-divisible. The split domain must be predicated.
+  tv1->split(1, 3);
+  // [ceilDiv(I, 5), 2, 3]
+
+  auto tv2 = sum(tv0, {0});
+  fusion.addOutput(tv2);
+
+  // tv2 shouldn't need to have another predicate
+  tv2->split(0, 4);
+  tv2->split(1, 2);
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
+      "There must be no split to validate");
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 1,
+      "Only tv1 should have a non-divisible predicate.");
+  for (auto tv : {loweredTv(tv1, gpulw)}) {
+    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
+    TORCH_CHECK(
+        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
+        "No info found for ",
+        tv);
+    const auto& splits_to_predicate = it->second;
+    TORCH_CHECK(
+        splits_to_predicate.size() == 1,
+        "There must be one split to predicate");
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({24}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0.sum();
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref, ref}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1074
+TEST_F(NVFuserTest, FusionNonDivisibleSplit2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 2);
+  tv2->split(-1, 4);
+  tv2->reorder({{1, 2}, {2, 1}});
+  tv0->computeAt(tv2, 2);
+
+  tv2->split(-1, 3);
+
+  // To make the sanitizer catch the invalid accesses. Not necessary
+  // to expose the bug.
+  tv1->setMemoryType(MemoryType::Shared);
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
+      "There must be no split to validate");
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 1,
+      "Only tv2 should have a non-divisible predicate.");
+  for (auto tv : {loweredTv(tv2, gpulw)}) {
+    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
+    TORCH_CHECK(
+        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
+        "No info found for ",
+        tv);
+    const auto& splits_to_predicate = it->second;
+    TORCH_CHECK(
+        splits_to_predicate.size() == 1,
+        "There must be one split to predicate");
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({13, 17}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 2;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Similar to FusionNonDivisibleSplit1 but with unswitch
+TEST_F(NVFuserTest, FusionNonDivisibleSplit3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {0});
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 5);
+  tv2->split(1, 3);
+
+  tv0->computeAt(tv2, -1);
+
+  tv2->axis(0)->parallelize(ParallelType::Unswitch);
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
+      "There must be no split to validate");
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 2,
+      "Both tv1 and tv2 should have a non-divisible predicate.");
+  for (auto tv : {loweredTv(tv1, gpulw), loweredTv(tv2, gpulw)}) {
+    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
+    TORCH_CHECK(
+        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
+        "No info found for ",
+        tv);
+    const auto& splits_to_predicate = it->second;
+    TORCH_CHECK(
+        splits_to_predicate.size() == 1,
+        "There must be one split to predicate");
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({24}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = (t0 + 1).sum();
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Non-divisible split through merge
+TEST_F(NVFuserTest, FusionNonDivisibleSplit4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {0, 1});
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 5);
+  tv2->merge(1, 2);
+  tv2->split(1, 3);
+
+  tv0->computeAt(tv2, -1);
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
+      "There must be no split to validate");
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 2,
+      "Both tv1 and tv2 should have a non-divisible predicate.");
+  for (auto tv : {loweredTv(tv1, gpulw), loweredTv(tv2, gpulw)}) {
+    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
+    TORCH_CHECK(
+        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
+        "No info found for ",
+        tv);
+    const auto& splits_to_predicate = it->second;
+    TORCH_CHECK(
+        splits_to_predicate.size() == 1,
+        "There must be one split to predicate");
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({24, 2}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = (t0 + 1).sum();
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Nested splits
+TEST_F(NVFuserTest, FusionNonDivisibleSplit5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {0});
+  fusion.addOutput(tv2);
+
+  // [I]
+  tv2->split(0, 8);
+  // [I/8, 8]
+  tv2->split(1, 2);
+  // [I/8, 4, 2]
+  tv2->split(1, 3); // non-divisible split of outer output
+  // [I/8, 2, 3, 2]
+
+  tv0->computeAt(tv2, -1);
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToValidate().empty(),
+      "There must be no split to validate");
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToPredicate().size() == 2,
+      "Both tv1 and tv2 should have a non-divisible predicate.");
+  for (auto tv : {loweredTv(tv1, gpulw), loweredTv(tv2, gpulw)}) {
+    auto it = gpulw.nonDivisibleSplitInfo().splitsToPredicate().find(tv);
+    TORCH_CHECK(
+        it != gpulw.nonDivisibleSplitInfo().splitsToPredicate().end(),
+        "No info found for ",
+        tv);
+    const auto& splits_to_predicate = it->second;
+    TORCH_CHECK(
+        splits_to_predicate.size() == 1,
+        "There must be one split to predicate");
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({24}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = (t0 + 1).sum();
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Vectorized non-divisible split. Must be validated at run time
+TEST_F(NVFuserTest, FusionNonDivisibleSplitVectorize1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->split(0, 8, false);
+  tv1->split(1, 4);
+
+  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToValidate().size() == 1,
+      "There should be one split to validate");
+  for (const auto& kv : gpulw.nonDivisibleSplitInfo().splitsToPredicate()) {
+    const auto& splits_to_predicate = kv.second;
+    TORCH_CHECK(
+        splits_to_predicate.empty(),
+        "There must be no split to predicate, but tensor t",
+        kv.first->name(),
+        " has:",
+        splits_to_predicate);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({32}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+
+  auto t0_non_divisible = at::randn({8}, options);
+  // Since ceilDiv(8, 8) is not divisible by 4, the vectorization is
+  // illegal. The run-time validation of vectorization should throw an error.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion({t0_non_divisible}));
+}
+
+// If a split is validated at run time, it's not necessary to predicate.
+TEST_F(NVFuserTest, FusionNonDivisibleSplitVectorize2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = sum(tv2, {0});
+  fusion.addOutput(tv3);
+
+  tv3->split(0, 8, false);
+  tv3->split(1, 4);
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv3, {tv1, tv2});
+
+  tv1->axis(2)->parallelize(ParallelType::Vectorize);
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      gpulw.nonDivisibleSplitInfo().splitsToValidate().size() == 1,
+      "There should be one split to validate");
+  for (const auto& kv : gpulw.nonDivisibleSplitInfo().splitsToPredicate()) {
+    const auto& splits_to_predicate = kv.second;
+    TORCH_CHECK(
+        splits_to_predicate.empty(),
+        "There must be no split to predicate, but tensor t",
+        kv.first->name(),
+        " has:",
+        splits_to_predicate);
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+
+  auto t0 = at::randn({1024}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = (t0 + 1).sum();
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1284Repro_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> input_shape_0 = {10, 20};
+  std::vector<int64_t> input_shape_1 = {15};
+
+  TensorView* in_0 = makeSymbolicTensor(input_shape_0.size());
+  TensorView* in_1 = makeSymbolicTensor(input_shape_1.size());
+  fusion.addInput(in_0);
+  fusion.addInput(in_1);
+
+  TensorView* out_0 = add(in_0, IrBuilder::create<Double>(0.f));
+  TensorView* out_1 = add(in_1, IrBuilder::create<Double>(2.f));
+
+  fusion.addOutput(out_0);
+  fusion.addOutput(out_1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_in_0 = at::randn(input_shape_0, options);
+  at::Tensor at_in_1 = at::randn(input_shape_1, options);
+  std::vector<IValue> aten_inputs = {at_in_0, at_in_1};
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto outputs = fec.runFusionWithInputs(aten_inputs);
+
+  auto t1 = at_in_1 + 2;
+
+  auto runtime = fec.getMostRecentKernelRuntime();
+  TORCH_INTERNAL_ASSERT(runtime->isSegmented());
+  TORCH_INTERNAL_ASSERT(runtime->fusionSegments()->groups().size() == 2);
+
+  testValidate(
+      &fusion, outputs, {at_in_0, at_in_1}, {at_in_0, t1}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1284Repro2_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> input_shape_0 = {4, 4};
+  std::vector<int64_t> input_shape_1 = {3, 4, 4};
+  std::vector<int64_t> input_shape_2 = {2, 8, 4, 4};
+
+  TensorView* in_0 = makeSymbolicTensor(input_shape_0.size());
+  TensorView* in_1 = makeSymbolicTensor(input_shape_1.size());
+  TensorView* in_2 = makeSymbolicTensor(input_shape_2.size());
+
+  fusion.addInput(in_0);
+  fusion.addInput(in_1);
+  fusion.addInput(in_2);
+
+  TensorView* out_0 = add(in_0, in_1);
+  TensorView* out_1 = add(in_0, in_2);
+
+  fusion.addOutput(out_0);
+  fusion.addOutput(out_1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor at_in_0 = at::randn(input_shape_0, options);
+  at::Tensor at_in_1 = at::randn(input_shape_1, options);
+  at::Tensor at_in_2 = at::randn(input_shape_2, options);
+
+  std::vector<IValue> aten_inputs = {at_in_0, at_in_1, at_in_2};
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto outputs = fec.runFusionWithInputs(aten_inputs);
+
+  auto t0 = at_in_0 + at_in_1;
+  auto t1 = at_in_0 + at_in_2;
+
+  auto runtime = fec.getMostRecentKernelRuntime();
+  TORCH_INTERNAL_ASSERT(runtime->isSegmented());
+  TORCH_INTERNAL_ASSERT(runtime->fusionSegments()->groups().size() == 2);
+
+  testValidate(
+      &fusion,
+      outputs,
+      {at_in_0, at_in_1, at_in_2},
+      {t0, t1},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIssue1305Repro_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  auto t0 = makeContigTensor(1);
+  auto t1 = makeContigTensor(2);
+
+  fusion.addInput(t0);
+  fusion.addInput(t1);
+
+  auto t2 = broadcast(t0, {true, false});
+  auto t3 = add(t1, t2);
+  auto t4 = add(t3, t2);
+  auto t5 = sum(t4, {1});
+  auto t6 = broadcast(t5, {false, true});
+  auto t7 = add(t3, t6);
+
+  fusion.addOutput(t7);
+
+  t3->computeAt(t7, -1, ComputeAtMode::MostInlined);
+
+  TORCH_INTERNAL_ASSERT(t3->getComputeAtPosition() == 1);
+}
+
+TEST_F(NVFuserTest, FusionDoubleBuffering1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv3->split(-1, 128);
+  tv3->split(-1, 32);
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv0->computeAt(tv3, 1);
+
+  tv3->axis(-2)->parallelize(ParallelType::BIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv3);
+
+  tv1->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1000}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 1;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionDoubleBuffering2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv3->split(-1, 128);
+  tv3->split(-1, 32);
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv0->computeAt(tv3, -1);
+
+  tv3->axis(-2)->parallelize(ParallelType::BIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv3);
+
+  tv1->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1000}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 1;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionDoubleBuffering3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv2 = set(tv1);
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1.0));
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv3->split(-1, 128);
+  tv3->split(-1, 32);
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv0->computeAt(tv3, 1);
+
+  // tv2 is invalid to double-buffer as its producer, tv1, is
+  // computed inside the double-buffering loop.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(tv2->doubleBuffer());
+
+  // Moving tv2 inner makes tv1 large enough to double-buffer tv2
+  tv2->computeAt(tv3, 2);
+
+  tv2->doubleBuffer();
+
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1000}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 2;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Double buffering smem to local and unswitch
+TEST_F(NVFuserTest, FusionDoubleBuffering4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv2 = set(tv1);
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1.0));
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv3->split(-1, 128);
+  tv3->split(-1, 32);
+  tv3->split(-1, 8);
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv0->computeAt(tv3, 2);
+  tv2->computeAt(tv3, -1);
+
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::Unswitch);
+  scheduler_utils::parallelizeAllLike(tv3);
+
+  tv2->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1000}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 2;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Double buffering gmem to shared and unswitch
+TEST_F(NVFuserTest, FusionDoubleBuffering5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
+  fusion.addOutput(tv2);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv2->split(-1, 128);
+  tv2->split(-1, 32);
+  tv2->split(-1, 8);
+  TransformPropagatorWithCheck propagator(tv2);
+  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
+
+  tv0->computeAt(tv2, 2);
+  tv1->computeAt(tv2, -1);
+
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::Unswitch);
+  scheduler_utils::parallelizeAllLike(tv2);
+
+  tv1->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1000}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 1;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Double buffering smem to local and unroll
+TEST_F(NVFuserTest, FusionDoubleBuffering6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv2 = set(tv1);
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1.0));
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  tv3->split(-1, 128);
+  tv3->split(-1, 16);
+  tv3->split(-2, 4);
+  tv3->split(-2, 2);
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv0->computeAt(tv3, 1);
+  tv2->computeAt(tv3, -1);
+
+  tv3->axis(2)->parallelize(ParallelType::Unroll);
+  tv3->axis(4)->parallelize(ParallelType::TIDx);
+
+  tv2->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({199}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 2;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Double buffering and vectorize
+TEST_F(NVFuserTest, FusionDoubleBuffering7_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1.0));
+  fusion.addOutput(tv2);
+
+  tv2->split(-1, 128);
+  tv2->split(-1, 4);
+  TransformPropagatorWithCheck propagator(tv2);
+  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
+
+  tv1->computeAt(tv2, 2);
+
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+
+  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  tv1->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({200}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 1;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Multiple tensors to double-buffer
+TEST_F(NVFuserTest, FusionDoubleBuffering8_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeContigTensor(1);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = set(tv1);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  tv4->split(0, 32);
+  tv4->split(0, 4);
+  TransformPropagatorWithCheck propagator(tv4);
+  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+
+  tv0->computeAt(tv4, 1);
+  tv1->computeAt(tv4, 1);
+
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(tv4);
+
+  tv2->doubleBuffer();
+  tv3->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({100}, options);
+  auto t1 = at::randn({100}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Nested double buffering from gmem to smem and smem to register
+TEST_F(NVFuserTest, FusionDoubleBuffering9_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto out = tv1;
+  fusion.addOutput(out);
+
+  auto tv2 = tv0->cacheAfter();
+  auto tv3 = tv2->cacheAfter();
+
+  out->split(0, 32);
+  out->split(0, 4);
+  TransformPropagatorWithCheck propagator(out);
+  MaxRootDomainInfoSpanningTree(out).traverse(&propagator);
+
+  tv2->setMemoryType(MemoryType::Shared);
+
+  tv2->computeAt(out, 1);
+  tv3->computeAt(out, -1);
+
+  out->axis(-1)->parallelize(ParallelType::TIDx);
+  scheduler_utils::parallelizeAllLike(out);
+
+  tv2->doubleBuffer();
+  tv3->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1001}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0 + 1;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// FusionSmemBlockGemmCache + double buffering at both smem and local
+TEST_F(NVFuserTest, FusionSmemBlockGemmCacheDoubleBuffer_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Algorithm
+  TensorView* tv0 = makeSymbolicTensor(2); // (M, K)
+  TensorView* tv1 = makeSymbolicTensor(2); // (K, N)
+  TensorView* tv2 = broadcast(tv0, {false, false, true}); // (M, K, B)
+  TensorView* tv3 = broadcast(tv1, {true, false, false}); // (B, K, N)
+  TensorView* tv4 = mul(tv2, tv3); // M, K, N
+  TensorView* tv5 = sum(tv4, {1}); // M, R, N
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addOutput(tv5);
+
+  TensorView* tv6 = tv5->cacheBefore();
+
+  // For smem double buffering
+  auto tv0_cache_local = tv0->cacheAfter();
+  auto tv1_cache_local = tv1->cacheAfter();
+
+  // For register double buffering
+  auto tv0_cache_smem = tv0->cacheAfter();
+  auto tv1_cache_smem = tv1->cacheAfter();
+
+  const int BSX = 32;
+  const int TSX = 8;
+
+  // [M, K, N]
+  tv6->split(-1, BSX);
+  tv6->split(-1, TSX);
+  tv6->split(1, BSX);
+  tv6->split(0, BSX);
+  tv6->split(1, TSX);
+  // [M/BSX, BSX/TSX, TSX, K/BSX, BSX, N/BSX, BSX/TSX, TSX]
+  tv6->reorder(
+      {{4, 7}, {7, 6}, {6, 5}, {2, 4}, {1, 3}, {3, 2}, {5, 1}, {0, 0}});
+  // [M/BSX, N/BSX, K/BSX, BSX/TSX, BSX/TSX, TSX, TSX, BSX]
+
+  auto tv6_rf = tv6->rFactor({-1});
+
+  TransformPropagatorWithCheck propagator(tv6_rf);
+  MaxRootDomainInfoSpanningTree(tv6_rf).traverse(&propagator);
+
+  tv0->computeAt(tv6, 3);
+  tv1->computeAt(tv6, 3);
+
+  tv6_rf->computeAt(tv6, -1);
+  tv0_cache_local->computeAt(tv6_rf, -1);
+  tv1_cache_local->computeAt(tv6_rf, -1);
+
+  tv0_cache_smem->setMemoryType(MemoryType::Shared);
+  tv1_cache_smem->setMemoryType(MemoryType::Shared);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->axis(1)->parallelize(ParallelType::BIDy);
+  tv5->axis(-3)->parallelize(ParallelType::TIDy);
+  tv5->axis(-1)->parallelize(ParallelType::TIDx);
+
+  scheduler_utils::parallelizeAllLike(tv5);
+
+  tv0_cache_local->doubleBuffer();
+  tv1_cache_local->doubleBuffer();
+
+  tv0_cache_smem->doubleBuffer();
+  tv1_cache_smem->doubleBuffer();
+
+  constexpr int M = 154, K = 45, N = 1524;
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({M, K}, options);
+  at::Tensor t1 = at::randn({K, N}, options);
+  at::Tensor aten_output = matmul(t0.to(at::kDouble), t1.to(at::kDouble));
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+  // The smem cache write in this test case is redundant predicated,
+  //   and also double buffered. Currently we are relying on WAR sync
+  //   insertion to ensure ordering of double buffered tensor access.
+  // The check below makes sure that the sync is inserted so that the
+  //   test isn't running on a race condition.
+  TORCH_CHECK(fe.kernel()->summary().war_hazard_syncs_count > 0);
+}
+
+TEST_F(NVFuserTest, FusionIntermediateTensorVectorize_CUDA) {
+  std::vector<MemoryType> mem_types = {MemoryType::Shared, MemoryType::Local};
+
+  for (auto mem_type : mem_types) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    auto tv0 = makeContigTensor(1);
+    fusion.addInput(tv0);
+
+    auto tv1 = set(tv0);
+    auto tv2 = set(tv1);
+    auto tv3 = set(tv2);
+    fusion.addOutput(tv3);
+
+    tv1->setMemoryType(mem_type);
+
+    tv3->split(-1, 4);
+    TransformPropagatorWithCheck propagator(tv3);
+    MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+    tv1->computeAt(tv3, -2);
+
+    tv2->axis(-1)->parallelize(ParallelType::Vectorize);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::manual_seed(0);
+    auto t0 = at::randn({15}, options);
+    FusionExecutor fe;
+    fe.compileFusion(&fusion);
+
+    // This should throw an exception as the extent of t0 is not
+    // divisible by the vector width
+    // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+    ASSERT_ANY_THROW(fe.runFusion({t0}));
+
+    auto t1 = at::randn({16}, options);
+    auto cg_outputs = fe.runFusion({t1});
+
+    auto ref = t1;
+
+    testValidate(&fusion, cg_outputs, {t1}, {ref}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionBroadcastConcretization1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({10, 1});
+  fusion.addInput(tv0);
+  auto tv1 = makeConcreteTensor({10, 20});
+  fusion.addInput(tv1);
+  auto tv2 = makeConcreteTensor({10, 10});
+  fusion.addInput(tv2);
+
+  // Not concretized
+  auto tv3 = sum(tv2, {1});
+  auto tv4 = broadcast(tv3, {false, true});
+  auto tv5 = add(tv0, tv4);
+  fusion.addOutput(tv5);
+
+  // Concretized
+  auto tv6 = sum(tv2, {1});
+  auto tv7 = broadcast(tv6, {false, true});
+  auto tv8 = add(tv1, tv7);
+  fusion.addOutput(tv8);
+
+  for (auto tv : {tv3, tv4, tv5, tv6, tv7, tv8}) {
+    tv->axis(1)->parallelize(ParallelType::TIDx);
+  }
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(!gpulw.concretizedBroadcastDomains()->isConcretized(
+      loweredTv(tv4, gpulw)->axis(1)));
+  TORCH_CHECK(gpulw.concretizedBroadcastDomains()->isConcretized(
+      loweredTv(tv7, gpulw)->axis(1)));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({10, 1}, options);
+  auto t1 = at::randn({10, 20}, options);
+  auto t2 = at::randn({10, 10}, options);
+  std::vector<IValue> aten_inputs = {t0, t1, t2};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto t5 = t0 + t2.sum({1}).unsqueeze(-1);
+  auto t8 = t1 + t2.sum({1}).unsqueeze(-1);
+
+  testValidate(&fusion, outputs, aten_inputs, {t5, t8}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBroadcastConcretization2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {0, 1});
+  auto tv2 = broadcast(tv1, {true});
+  auto tv3 = broadcast(tv2, {false, true});
+  fusion.addOutput(tv3);
+
+  // tv1 is thread-predicated with TIDx and TIDy
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDy);
+  // tv2 broadcasts along TIDx
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  // tv3 broadcasts along TIDy
+  tv3->axis(0)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDy);
+
+  // Both tv2 and tv3 broadcast along predicated TID dimensions, but
+  // since the broadcast domains are not concretized, there should be
+  // no actual parallel broadcast
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      !gpulw.kernel()->summary().has_block_broadcasts &&
+          !gpulw.kernel()->summary().has_grid_broadcasts,
+      "There must be no parallel broadcast in this fusion");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({10, 11}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto t3 = t0.sum().unsqueeze(-1).unsqueeze(-1);
+
+  testValidate(&fusion, outputs, aten_inputs, {t3}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBroadcastConcretization3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> input_shape({10, 4, 8});
+  std::vector<int64_t> output_shape({8, 4, 1});
+
+  auto tv0 = makeConcreteTensor(input_shape);
+  fusion.addInput(tv0);
+
+  auto tv2 = sum(tv0, {0});
+  auto tv3 = set(tv2);
+  auto tv4 =
+      view(tv3, {input_shape.begin() + 1, input_shape.end()}, output_shape);
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv5->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // The view op adds a broadcast domain in tv4, which is
+  // parallelized. Howver, it is never materialized, so there should
+  // be no parallel broadcast.
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      !gpulw.kernel()->summary().has_block_broadcasts &&
+          !gpulw.kernel()->summary().has_grid_broadcasts,
+      "There must be no parallel broadcast in this fusion");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn(input_shape, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  auto t5 = at::native::view(t0.sum(0), output_shape) + 1;
+
+  testValidate(&fusion, outputs, aten_inputs, {t5}, __LINE__, __FILE__);
+}
+
+// Merging non-broadcast and broadcast domains
+// TODO: Fix use case see issue https://github.com/csarofeen/pytorch/issues/1418
+// validateParallelize does not pass. Even if it's skipped,
+// generated code is invalid as blockBroadcast is not used.
+#if 0
+TEST_F(NVFuserTest, FusionBroadcastConcretization4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = broadcast(tv1, {false, true});
+  auto tv3 = add(tv2, tv0);
+  fusion.addOutput(tv3);
+
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv2->merge(0, 1);
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  // TODO: When set to shared memory, this kernel should be correct, but fails
+  // validation and when skipped produces incorrect code
+  tv2->setMemoryType(MemoryType::Shared);
+
+  tv3->merge(0, 1);
+  tv3->axis(0)->parallelize(ParallelType::TIDx);
+
+  fusion.printMath();
+  fusion.printKernel();
+}
+#endif
+
+TEST_F(NVFuserTest, FusionBroadcastConcretization5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(1);
+  fusion.addInput(tv1);
+  auto tv2 = makeSymbolicTensor(1);
+  fusion.addInput(tv2);
+  auto tv3 = makeSymbolicTensor(1);
+  fusion.addInput(tv3);
+
+  // Assert tv2 and tv3 have the same shape
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  // Concretize a broadcast domain to multiple non-concrete domains
+  // through a multi-output expression. It should be considered to be
+  // non-uniquely concretized.
+  auto tv5 = broadcast(tv0, {false, true});
+  // Reduce only the non-broadcast domain.
+  auto tvs = Welford(tv5, {0});
+  auto tv9 = add(tvs.avg, tv1);
+  auto tv10 = add(tvs.var_sum, tv2);
+  fusion.addOutput(tv9);
+  fusion.addOutput(tv10);
+
+  // Same pattern as the above, but concretize the broadcast domain
+  // with tv2 and tv3, which have the exactly same shape, so the
+  // broadcast should be considered uniquely concretized.
+  auto tv11 = broadcast(tv0, {false, true});
+  // Reduce only the non-broadcast domain.
+  auto tvs2 = Welford(tv11, {0});
+  auto tv15 = add(tvs2.avg, tv2);
+  auto tv16 = add(tvs2.var_sum, tv3);
+  fusion.addOutput(tv15);
+  fusion.addOutput(tv16);
+
+  // Reduce only the broadcast domain. Since it's reduced, it should
+  // not be considered to be concretized.
+  auto tv17 = broadcast(tv0, {false, true});
+  auto tvs3 = Welford(tv17, {1});
+  fusion.addOutput(tvs3.avg);
+
+  ConcretizedBroadcastDomains bcast_concretization_info(&fusion);
+
+  TORCH_CHECK(
+      bcast_concretization_info.maybeNonUniquelyConcretized(tv5->axis(1)),
+      "Failed to detect non-unique concretization of ",
+      tv5->toString());
+
+  TORCH_CHECK(
+      bcast_concretization_info.isUniquelyConcretized(tv11->axis(1)),
+      "Failed to detect unique concretization of ",
+      tv11->toString());
+
+  TORCH_CHECK(
+      !bcast_concretization_info.isConcretized(tv17->axis(1)),
+      "Failed to detect non-concretization of ",
+      tv17->toString());
+}
+
+TEST_F(NVFuserTest, FusionIssue1430_CUDA) {
+  // Derived from an expression sorting issue when using loop map, now expr
+  // sorting uses parallel map.
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int V = 2, W = 3, X = 4, Y = 5, Z = 6;
+
+  // setup fusion
+  auto tv0 = TensorViewBuilder()
+                 .ndims(5)
+                 .dtype(DataType::Half)
+                 .contiguity(std::vector<bool>(5, true))
+                 .shape({V, W, X, Y, Z})
+                 .build();
+
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = castOp(DataType::Float, tv1);
+
+  auto tvs = Welford(tv2, {1, 2, 3, 4});
+  auto tv3 = tvs.avg;
+  auto tv4 = tvs.var_sum;
+  auto tv5 = tvs.n;
+
+  // avg
+  auto tv6 = broadcast(tvs.avg, {false, true, true, true, true});
+
+  // var
+  auto tv7 = mul(tv4, IrBuilder::create<Double>(1. / (W * X * Y * Z)));
+  auto tv8 = add(tv7, IrBuilder::create<Double>(1.e-6));
+  auto tv9 = broadcast(tv8, {false, true, true, true, true});
+  auto tv10 = rsqrt(tv9);
+
+  auto tv11 = castOp(DataType::Float, tv1);
+  auto tv12 = sub(tv11, tv6);
+  auto tv13 = mul(tv12, tv10);
+
+  auto tv14 = set(tv13);
+  fusion.addOutput(tv14);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDy);
+  tv3->axis(2)->parallelize(ParallelType::BIDx);
+  tv3->axis(3)->parallelize(ParallelType::TIDx);
+  tv3->axis(4)->parallelize(ParallelType::Vectorize);
+
+  // tv3->reorder({{1, -2}});
+
+  auto rfactor = ir_utils::rfactorHelper(tv3, {1, 4});
+
+  scheduler_utils::parallelizeAllLike(rfactor);
+
+  for (auto tv : ir_utils::allTvs(&fusion)) {
+    if (tv != tv1 || tv != tv3) {
+      for (auto i : c10::irange(tv->nDims())) {
+        if (isParallelTypeVectorize(tv->axis(i)->getParallelType())) {
+          tv->axis(i)->parallelize(ParallelType::Serial);
+        }
+      }
+    }
+  }
+
+  tv0->computeAt(tv14, 1);
+  tv13->computeAt(tv14, -2);
+  tv2->computeAt(tv14, -1, ComputeAtMode::MostInlined);
+  tv11->computeAt(tv14, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({V, W, X, Y, Z}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+  auto cg_outputs = fe.runFusion({t0}, LaunchParams(X, V, -1, Y, -1, -1));
+
+  auto t0_double = t0.to(at::kDouble);
+
+  auto at_mu = at::mean(t0_double, {1, 2, 3, 4})
+                   .unsqueeze(-1)
+                   .unsqueeze(-1)
+                   .unsqueeze(-1)
+                   .unsqueeze(-1);
+  auto at_var = at::var(t0_double, {1, 2, 3, 4}, false)
+                    .unsqueeze(-1)
+                    .unsqueeze(-1)
+                    .unsqueeze(-1)
+                    .unsqueeze(-1);
+
+  auto at_out = t0_double.sub(at_mu).div(at_var.add(1.e-6).sqrt());
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {t0},
+      {at_out},
+      __LINE__,
+      __FILE__,
+      "",
+      LaunchParams(X, V, -1, Y, -1, -1));
+}
+
+// Test code generation of allocated scalars
+TEST_F(NVFuserTest, FusionCodegenAllocatedScalars_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Fusion is just a dummy container in this test, just used for
+  // getting a Kernel container
+  auto tv0 = makeSymbolicTensor(0);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion.addOutput(tv1);
+
+  GpuLower gpulw(&fusion);
+  auto kernel = gpulw.kernel();
+
+  // Set the kernel as the current fusion
+  FusionGuard kg(kernel);
+
+  // Create alocated scalars
+  auto ks0 = add(kernel->zeroVal(), kernel->oneVal());
+  auto ks0_alloc = IrBuilder::create<kir::Allocate>(
+      ks0, MemoryType::Local, kernel->oneVal());
+
+  auto ks1 = add(ks0, kernel->oneVal());
+  auto ks1_alloc = IrBuilder::create<kir::Allocate>(
+      ks1, MemoryType::Local, kernel->oneVal());
+
+  auto tk0 = kernel->inputs()[0]->as<TensorView>();
+  auto tki0 = IrBuilder::create<kir::TensorIndex>(tk0, std::vector<Val*>{ks0});
+  auto tki1 = IrBuilder::create<kir::TensorIndex>(tk0, std::vector<Val*>{ks1});
+  auto tk0_expr = IrBuilder::create<UnaryOp>(UnaryOpType::Set, tki0, tki1);
+
+  // Insert the scalar expression and the allocation of the
+  // output directly to the kernel
+  auto proxy = kir::KernelInternalProxy(kernel);
+
+  const auto indent = "  ";
+  const auto ks0_name = "i" + std::to_string(ks0->name());
+  const auto ks1_name = "i" + std::to_string(ks1->name());
+  const auto tk0_name = "T" + std::to_string(tk0->name());
+
+  auto& exprs = proxy.topLevelExprs();
+  exprs.push_back(tk0_expr);
+
+  // Invalid code gen
+  const auto no_alloc_code = codegen::generateCudaKernel(kernel);
+
+  // Without alloc, Int vals are just inlined, resulting in:
+  // t0[(0 + 1)] = t0[((0 + 1) + 1)]
+  std::stringstream no_alloc_ref;
+  no_alloc_ref << "\n"
+               << indent << tk0_name << "[(0 + 1)]\n"
+               << indent << indent << " = " << tk0_name << "[((0 + 1) + 1)];\n";
+
+  TORCH_CHECK(
+      no_alloc_code.find(no_alloc_ref.str()) != std::string::npos,
+      "Invalid code generation. Expected:",
+      no_alloc_ref.str(),
+      "Actual:\n",
+      no_alloc_code);
+
+  // Insert proper allocations and definitions
+  exprs.insert(std::find(exprs.begin(), exprs.end(), tk0_expr), ks0_alloc);
+  exprs.insert(
+      std::find(exprs.begin(), exprs.end(), tk0_expr), ks0->definition());
+  exprs.insert(std::find(exprs.begin(), exprs.end(), tk0_expr), ks1_alloc);
+  exprs.insert(
+      std::find(exprs.begin(), exprs.end(), tk0_expr), ks1->definition());
+
+  const auto valid_code = codegen::generateCudaKernel(kernel);
+
+  std::stringstream valid_ref;
+  valid_ref << "\n"
+            << indent << tk0_name << "[" << ks0_name << "]\n"
+            << indent << indent << " = " << tk0_name << "[" << ks1_name
+            << "];\n";
+
+  TORCH_CHECK(
+      valid_code.find(valid_ref.str()) != std::string::npos,
+      "Invalid code generation. Expected:",
+      valid_ref.str(),
+      "Actual:\n",
+      valid_code);
+}
+
+TEST_F(NVFuserTest, FusionIndexHoist1_CUDA) {
+  if (isOptionDisabled(DisableOption::IndexHoist)) {
+    GTEST_SKIP() << "Index hoisting disabled";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  auto tv3 = set(tv2);
+  auto tv4 = set(tv3);
+  auto tv5 = set(tv4);
+  fusion.addOutput(tv5);
+
+  tv1->split(-1, 4);
+  tv2->split(-1, 4);
+  tv3->merge(0, 1);
+  tv3->split(0, 8);
+  tv5->merge(0, 1);
+  tv5->split(0, 8);
+  tv4->computeAt(tv5, -1);
+
+  tv1->setMemoryType(MemoryType::Global);
+  tv2->setMemoryType(MemoryType::Global);
+  tv3->setMemoryType(MemoryType::Global);
+
+  // Use Int32 as the index type to verify Int32 is used as the type
+  // of hoisted indices
+  GpuLower gpulw(&fusion, DataType::Int32);
+  auto kernel = gpulw.kernel();
+
+  auto is_index_times_ns = [](Val* val, Val* index, std::string name) -> bool {
+    auto def = dynamic_cast<BinaryOp*>(val->definition());
+    if (def == nullptr) {
+      return false;
+    }
+    return def->getBinaryOpType() == BinaryOpType::Mul &&
+        def->rhs()->isA<NamedScalar>() &&
+        def->rhs()->as<NamedScalar>()->name() == name && def->lhs() == index;
+  };
+
+  // Validate indices in the kernel are hoisted as
+  // intended. Validation could be also done by just string comparison
+  // as the parser test, but updating such tests would be tedious.
+  for (auto top_level_loop :
+       ir_utils::filterByType<kir::ForLoop>(kernel->topLevelExprs())) {
+    auto innermost_loop = top_level_loop;
+    while (auto first_expr_loop = dynamic_cast<kir::ForLoop*>(
+               innermost_loop->body().exprs().at(0))) {
+      innermost_loop = first_expr_loop;
+    }
+    const auto& exprs = innermost_loop->body().exprs();
+    TORCH_CHECK(!exprs.empty(), "No expression found");
+    TORCH_CHECK(
+        exprs.at(0)->isA<kir::Allocate>(),
+        "Invalid expression: ",
+        exprs.at(0)->toString());
+    auto hoisted_index = exprs.at(0)->as<kir::Allocate>()->buffer();
+    TORCH_CHECK(
+        hoisted_index->dtype() == DataType::Int32,
+        "Invalid data type of hoisted indices. Should be Int32 but: ",
+        hoisted_index->dtype());
+    kir::Predicate* pred = nullptr;
+    for (auto expr : exprs) {
+      if (expr->isA<kir::IfThenElse>()) {
+        pred = expr->as<kir::IfThenElse>()->predicate();
+        auto arith_expr = expr->as<kir::IfThenElse>()->thenBody().exprs().at(0);
+        auto out_ti = arith_expr->outputs()[0]->as<kir::TensorIndex>();
+        if (out_ti->view()->name() == 1) {
+          // Ref: T1[*, hoisted_index] = T0[*, hoisted_index * T0.stride];
+          auto t1_index = out_ti->index(1);
+          TORCH_CHECK(
+              t1_index == hoisted_index,
+              "Invalid index: ",
+              t1_index->toInlineString());
+          // Pred: hoisted_index < T0.size[1]
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString(),
+              ", ",
+              expr->toString());
+          TORCH_CHECK(arith_expr->inputs().size() == 1);
+          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
+          TORCH_CHECK(in0->view()->name() == 0);
+          // hoisted_index * T0.stride[1]
+          auto t0_index = in0->index(1);
+          TORCH_CHECK(
+              is_index_times_ns(t0_index, hoisted_index, "T0.stride[1]"),
+              "Invalid index: ",
+              t0_index->toInlineString(),
+              ", ",
+              expr->toString());
+        } else if (out_ti->view()->name() == 2) {
+          // Ref: T3[*, hoisted_index] = T2[*, hoisted_index];
+          auto out_index = out_ti->index(1);
+          TORCH_CHECK(
+              out_index == hoisted_index,
+              "Invalid index: ",
+              out_index->toInlineString(),
+              ", ",
+              expr->toString());
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString(),
+              ", ",
+              expr->toString());
+          TORCH_CHECK(arith_expr->inputs().size() == 1);
+          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
+          TORCH_CHECK(in0->view()->name() == 1);
+          auto in0_index = in0->index(1);
+          TORCH_CHECK(
+              in0_index == hoisted_index,
+              "Invalid index: ",
+              in0_index->toInlineString(),
+              ", ",
+              expr->toString());
+        } else if (out_ti->view()->name() == 3) {
+          // Ref: T3[hoisted_index] = T2[hoisted_index];
+          auto out_index = out_ti->index(0);
+          TORCH_CHECK(
+              out_index == hoisted_index,
+              "Invalid index: ",
+              out_index->toInlineString(),
+              ", ",
+              expr->toString());
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString(),
+              ", ",
+              expr->toString());
+          TORCH_CHECK(arith_expr->inputs().size() == 1);
+          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
+          TORCH_CHECK(in0->view()->name() == 2);
+          auto in0_index = in0->index(0);
+          TORCH_CHECK(
+              in0_index == hoisted_index,
+              "Invalid index: ",
+              in0_index->toInlineString(),
+              ", ",
+              expr->toString());
+        } else if (out_ti->view()->name() == 4) {
+          // Ref: T4[0] = T3[hoisted_index];
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString(),
+              ", ",
+              expr->toString());
+          TORCH_CHECK(arith_expr->inputs().size() == 1);
+          auto in0 = arith_expr->inputs().front()->as<kir::TensorIndex>();
+          TORCH_CHECK(in0->view()->name() == 3);
+          auto in0_index = in0->index(0);
+          TORCH_CHECK(
+              in0_index == hoisted_index,
+              "Invalid index: ",
+              in0_index->toInlineString(),
+              ", ",
+              expr->toString());
+        } else if (out_ti->view()->name() == 5) {
+          // Ref: T5[hoisted_index] = T4[0]
+          auto out_index = out_ti->index(0);
+          TORCH_CHECK(
+              out_index == hoisted_index,
+              "Invalid index: ",
+              out_index->toInlineString(),
+              ", ",
+              expr->toString());
+          TORCH_CHECK(
+              pred->value()->definition()->as<BinaryOp>()->lhs() ==
+                  hoisted_index,
+              "Invalid predicate: ",
+              pred->value()->toInlineString(),
+              ", ",
+              expr->toString());
+        }
+      }
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({15, 17}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Hoist indices for vectorized tensors
+TEST_F(NVFuserTest, FusionIndexHoist2_CUDA) {
+  if (isOptionDisabled(DisableOption::IndexHoist)) {
+    GTEST_SKIP() << "Index hoisting disabled";
+  }
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeContigTensor(1);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = set(tv1);
+  auto tv4 = add(tv2, tv3);
+  auto tv5 = set(tv4);
+  fusion.addOutput(tv5);
+
+  tv5->split(-1, 4);
+  TransformPropagatorWithCheck propagator(tv5);
+  MaxRootDomainInfoSpanningTree(tv5).traverse(&propagator);
+
+  tv4->split(-1, 3);
+
+  tv0->computeAt(tv5, 1);
+  tv1->computeAt(tv5, 1);
+
+  tv2->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv3->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv5->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({16}, options);
+  auto t1 = at::randn({16}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTestGridComm_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+  int X = 3, Y = 4, Z = 2;
+  auto tv0 = makeConcreteTensor({X, Y, Z});
+  fusion.addInput(tv0);
+  auto tv1 = makeConcreteTensor({X, Y, Z});
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = add(tv2, tv1);
+  auto tv4 = set(tv3);
+  auto tv5 = set(tv4);
+  fusion.addOutput(tv5);
+
+  tv2->setMemoryType(MemoryType::Global);
+  tv3->setMemoryType(MemoryType::Global);
+  tv4->setMemoryType(MemoryType::Global);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDy);
+  tv2->axis(1)->parallelize(ParallelType::BIDx);
+  tv2->axis(2)->parallelize(ParallelType::Vectorize);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(1)->parallelize(ParallelType::BIDy);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDy);
+  tv4->axis(1)->parallelize(ParallelType::BIDx);
+
+  tv5->axis(0)->parallelize(ParallelType::BIDy);
+  tv5->axis(1)->parallelize(ParallelType::BIDx);
+  tv5->axis(2)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({X, Y, Z}, options);
+  auto t1 = at::randn({X, Y, Z}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// See issue https://github.com/csarofeen/pytorch/issues/1497
+TEST_F(NVFuserTest, FusionTestGridComm2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int64_t W = 3, X = 4;
+
+  auto tv0 = makeConcreteTensor({X});
+  auto tv1 = makeConcreteTensor({W, X});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  tv4->merge(0);
+  tv4->split(0, 2);
+
+  TransformPropagatorWithCheck propagator(tv4);
+  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+
+  tv3->computeAt(tv4, 1);
+
+  tv4->axis(0)->parallelize(ParallelType::BIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv2->setMemoryType(MemoryType::Global);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({X}, options);
+  auto t1 = at::randn({W, X}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1 + 1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Vectorized reset test for double buffered registers
+TEST_F(NVFuserTest, FusionDoubleBufferVector_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv2 = sum(tv1, {0});
+  auto tv2c = tv2->cacheBefore();
+
+  fusion.addOutput(tv2);
+
+  auto tv1cw = tv1->cacheAfter();
+  auto tv1cr = tv1cw->cacheAfter();
+
+  tv1cw->split(-1, 32);
+  tv1cr->split(-1, 32);
+  tv1cr->split(-1, 4);
+  tv1cr->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  tv1cw->computeAt(tv1cr, 1);
+  tv0->computeAt(tv1cw, -1);
+  tv2c->split(-1, 32);
+  tv2c->split(-1, 4);
+  tv1cr->computeAt(tv2c, 2);
+
+  tv1cw->setMemoryType(MemoryType::Shared);
+  tv1cr->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::manual_seed(0);
+  auto t0 = at::randn({200}, options);
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  auto ref = (t0 + 1).sum({0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Request 48KB of data in shared mem,
+//  should be large enough not to fit in
+//  static allocations, but small enough
+//  to fit in supported devices (sm70+).
+TEST_F(NVFuserTest, FusionLargeSmem_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 12288);
+  tv2->split(1, 128);
+  tv1->computeAt(tv2, 1);
+  tv1->split(1, 128);
+  tv0->computeAt(tv1, -1);
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::manual_seed(0);
+  auto t0 = at::randn({12288 * 4}, options);
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  auto ref = t0 + 1 + 2;
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Request a smem allocation that is equal to the device limit
+TEST_F(NVFuserTest, FusionTooLargeSmem_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto properties = at::cuda::getDeviceProperties(
+      c10::Device(c10::DeviceType::CUDA, 0).index());
+  int device_limit = properties->sharedMemPerBlockOptin;
+
+  auto tv0 = makeContigTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1.0));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(2.0));
+  fusion.addOutput(tv2);
+
+  // 4 byte per float
+  tv2->split(0, device_limit / 4);
+  tv2->split(1, 128);
+  tv1->computeAt(tv2, 1);
+  tv1->split(1, 128);
+  tv0->computeAt(tv1, -1);
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::manual_seed(0);
+  auto t0 = at::randn({12288 * 4}, options);
+  FusionExecutor fe;
+
+  // First compile gets a compiled kernel
+  fe.compileFusion(&fusion, {t0});
+
+  // Should be throwing because the kernel
+  //  requested absolute device limit
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion({t0}));
+}
+
+// Try to test alignment when multiple tensors are
+//  in shared mem.
+TEST_F(NVFuserTest, FusionSmemAlignment_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({3, 4, 7, 2, 5});
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {4});
+  auto tv2 = sum(tv1, {3});
+  auto tv3 = sum(tv2, {2});
+  auto tv4 = sum(tv3, {1});
+  fusion.addOutput(tv4);
+
+  auto tv0c = tv0->cacheAfter();
+  auto tv1bc = tv1->cacheBefore();
+  auto tv2bc = tv2->cacheBefore();
+  auto tv3bc = tv3->cacheBefore();
+  auto tv4bc = tv4->cacheBefore();
+
+  tv0c->setMemoryType(MemoryType::Shared);
+  tv1bc->setMemoryType(MemoryType::Shared);
+  tv2bc->setMemoryType(MemoryType::Shared);
+  tv3bc->setMemoryType(MemoryType::Shared);
+  tv4bc->setMemoryType(MemoryType::Shared);
+
+  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv3->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv0->computeAt(tv4, 0);
+  tv0->computeAt(tv2, 2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::manual_seed(0);
+  auto t0 = at::randn({3, 4, 7, 2, 5}, options);
+  FusionExecutor fe;
+
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  auto tref = t0.sum({1, 2, 3, 4});
+
+  testValidate(&fusion, cg_outputs, {t0}, {tref}, __LINE__, __FILE__);
+}
+
+// Repro of #1521
+TEST_F(NVFuserTest, FusionImmediateValueAsInput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto immediate_scalr = IrBuilder::create<Double>(0.1);
+  // Adding an immediate scalar value as an input is not allowed
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fusion.addInput(immediate_scalr));
+
+  // Instead, use a symbolic value
+  auto symbolic_scalar = IrBuilder::create<Double>();
+  fusion.addInput(symbolic_scalar);
+
+  auto tv1 = add(tv0, symbolic_scalar);
+  fusion.addOutput(tv1);
+
+  // Make sure the kernel is compiled.
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+}
+
+// Repro of #1506
+TEST_F(NVFuserTest, FusionVectorizeContigIndex_CUDA) {
+  std::vector<int64_t> shape{14, 14};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  fusion.addOutput(tv2);
+
+  tv2->merge(0);
+
+  // Vectorize by 4 should be allowed
+  tv2->split(0, 4);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv0->computeAt(tv2, 1);
+
+  tv1->axis(1)->parallelize(ParallelType::Vectorize);
+  tv2->axis(1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  TORCH_CHECK(t0.equal(cg_outputs[0]));
+}
+
+// Make sure the same fusion as FusionVectorizeContigIndex fails if
+// not contig.
+TEST_F(NVFuserTest, FusionVectorizeContigIndexFail_CUDA) {
+  std::vector<int64_t> shape{14, 14};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  fusion.addOutput(tv2);
+
+  tv2->merge(0);
+
+  tv2->split(0, 4);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv0->computeAt(tv2, 1);
+
+  tv1->axis(1)->parallelize(ParallelType::Vectorize);
+  tv2->axis(1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+
+  // This should fail at the launch time as 14 is not divisible by the
+  // vector word size. The two domains are merged, but they are not
+  // contiguous, so contig indexing is not involved in this case.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion({t0}));
+}
+
+TEST_F(NVFuserTest, FusionVectorizeInputToOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->split(0, 4);
+
+  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+
+  const int n = 12;
+  auto t0 = at::randn({n}, options);
+  // Shift by one to make it non-aligned
+  auto t0_misaligned = at::randn({n + 1}, options).index({Slice(1)});
+  auto t1_misaligned = at::empty({n + 1}, options).index({Slice(1)});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  TORCH_CHECK(t0.equal(cg_outputs[0]));
+
+  // Pass misaligned input. This must fail.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion({t0_misaligned}));
+
+  // Pass misaligned output. This must fail too.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion({t0}, {t1_misaligned}));
+}
+
+// Repro of issue #1530
+TEST_F(NVFuserTest, FusionVectorizeContigIndexValidationFail_CUDA) {
+  std::vector<int64_t> shape{1, 2, 1};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(shape.size());
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->merge(1);
+  tv1->merge(0);
+
+  auto invalid_vec_size = shape[0] * shape[1] * shape[2];
+  invalid_vec_size *= invalid_vec_size;
+
+  tv1->split(0, invalid_vec_size);
+
+  tv1->axis(1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion({t0}));
+}
+
+TEST_F(NVFuserTest, FusionContigIndexingWithBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({4});
+  fusion.addInput(tv0);
+  auto tv1 = makeConcreteTensor({3, 4});
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv0, {true, false});
+  auto tv3 = add(tv2, tv1);
+  fusion.addOutput(tv3);
+
+  tv3->merge(0);
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv2->setMemoryType(MemoryType::Local);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({4}, options);
+  auto t1 = at::randn({3, 4}, options);
+
+  auto t3 = t0.unsqueeze(0).add(t1);
+  {
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {t0, t1});
+    auto cg_outputs = fe.runFusion({t0, t1});
+
+    testValidate(&fusion, cg_outputs, {t0, t1}, {t3}, __LINE__, __FILE__);
+  }
+
+  // Make sure tv2 indexing also works when it's stored in global memory
+  tv2->setMemoryType(MemoryType::Global);
+  {
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {t0, t1});
+    auto cg_outputs = fe.runFusion({t0, t1});
+
+    testValidate(&fusion, cg_outputs, {t0, t1}, {t3}, __LINE__, __FILE__);
+  }
+}
+
+// Repro of #1534. Validation should detect invalid vectorization.
+TEST_F(NVFuserTest, FusionVectorizeContigIndexValidationFail2_CUDA) {
+  std::vector<int64_t> shape1{2, 3, 2};
+  std::vector<int64_t> shape2{2, 2};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigConcreteTensor(shape1);
+  fusion.addInput(tv0);
+  auto tv1 = makeContigConcreteTensor(shape2);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv1);
+  auto tv3 = broadcast(tv2, {false, true, false});
+  auto tv4 = add(tv0, tv3);
+  fusion.addOutput(tv4);
+
+  tv4->merge(1, 2);
+  tv4->merge(0, 1);
+  tv4->split(0, 4);
+  TransformPropagatorWithCheck propagator(tv4);
+  MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+
+  tv0->computeAt(tv4, -2);
+  tv1->computeAt(tv4, -2);
+
+  tv2->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape1, options);
+  auto t1 = at::randn(shape2, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+
+  // Vectorization of tv2 should be detected as invalid.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fe.runFusion({t0, t1}));
+}
+
+TEST_F(NVFuserTest, FusionVectorizeContigIndexWithBroadcast_CUDA) {
+  std::vector<int64_t> shape1{2, 2, 2};
+  std::vector<int64_t> shape2{1, 2, 2};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // [I0, I1, I2]
+  auto tv0 = makeContigTensor(shape1.size());
+  fusion.addInput(tv0);
+
+  // [B3, I1, I2]
+  auto tv1 = makeContigConcreteTensor(shape2);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv1);
+  auto tv3 = add(tv0, tv2);
+  fusion.addOutput(tv3);
+
+  tv3->merge(1, 2);
+  tv3->merge(0, 1);
+  tv3->split(0, 4);
+
+  // Don't modify tv1 so that it's replayed as tv2 with actual
+  // transformations. It would create temporary IterDomains, and the
+  // validation should still be able to detect vectorization by 4 is valid.
+  // TransformPropagatorWithCheck propagator(tv3);
+  // MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+
+  tv2->merge(1, 2);
+  tv2->merge(0, 1);
+  tv2->split(0, 4);
+
+  tv2->computeAt(tv3, -2);
+
+  tv2->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape1, options);
+  auto t1 = at::randn(shape2, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionVectorizeContigIndexPointwiseSchedule_CUDA) {
+  std::vector<int64_t> shape0{100, 14, 2, 14};
+  std::vector<int64_t> shape1{100, 2, 14};
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(shape0.size());
+  fusion.addInput(tv0);
+  auto tv1 = makeContigTensor(shape1.size());
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv1, {false, true, false, false});
+  auto tv3 = add(tv0, tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn(shape0, options);
+  auto t1 = at::randn(shape1, options);
+
+  auto lparams = schedulePointwise(&fusion, {t0, t1});
+
+  GpuLower gpulw(&fusion);
+  auto kernel = gpulw.kernel();
+
+  // The innermost two dimensions are merged and contiguous, so
+  // vectorization can be done against 2*14=28 rather than 14, so
+  // vector word size should be 4. Broadcasting of tv1 should not
+  // matter.
+  for (const auto& vec_info : kernel->summary().vectorized_set_info) {
+    TORCH_CHECK(
+        vec_info.word_size == 4,
+        "Invalid vector word size: ",
+        vec_info.word_size);
+  }
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1}, lparams);
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1.unsqueeze(-3);
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Repro of issue #1539.
+TEST_F(NVFuserTest, FusionTrivialReductionForwarding1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {true, false});
+  auto tv2 = sum(tv1, {0});
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv2->merge(0);
+  tv2->split(0, 4);
+
+  TransformPropagatorWithCheck propagator(tv2);
+  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
+
+  // All tensors must be transformed to a 2D tensor with each axis
+  // mapped with each other in the LOOP map.
+  ComputeAtMap ca_map(&fusion);
+  for (auto tv : ir_utils::allTvs(&fusion)) {
+    TORCH_CHECK(
+        tv->nDims() == 2, "Expected to be a 2D tensor but: ", tv->toString());
+    for (const auto i : c10::irange(2)) {
+      TORCH_CHECK(ca_map.areMapped(
+          tv->axis(i), tv3->axis(i), IdMappingMode::PERMISSIVE));
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionTrivialReductionForwarding2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {true, false});
+  auto tv2 = sum(tv1, {0});
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  // Merging a trivial reduction with a non-reduction domain
+  tv2->merge(0, 1);
+  tv2->split(0, 4);
+
+  tv3->split(0, 4);
+
+  // tv2 and tv3 are different as tv3 lacks the trivial reduction, but
+  // they are mapped with each other by BestEffortReplay as the merge
+  // of trivial reduciton dim is forwarded.
+
+  PairwiseRootDomainMap root_map(tv2, tv3);
+
+  auto p2c = BestEffortReplay::replayCasP(tv3, tv2, 2, root_map).getReplay();
+  for (const auto i : c10::irange(tv2->nDims())) {
+    auto tv2_id = tv2->axis(i);
+    auto it = p2c.find(tv2_id);
+    TORCH_CHECK(
+        it != p2c.end(),
+        "Expected mapped consumer ID but not found: ",
+        tv2_id->toString());
+    auto tv3_mapped_id = it->second;
+    TORCH_CHECK(
+        tv3_mapped_id == tv3->axis(i),
+        "Unexpected mapped consumer ID: ",
+        tv3_mapped_id->toString());
+  }
+
+  auto c2p = BestEffortReplay::replayPasC(tv2, tv3, 2, root_map).getReplay();
+  for (const auto i : c10::irange(tv3->nDims())) {
+    auto tv3_id = tv3->axis(i);
+    auto it = c2p.find(tv3_id);
+    TORCH_CHECK(
+        it != c2p.end(),
+        "Expected mapped producer ID but not found: ",
+        tv3_id->toString());
+    auto tv2_mapped_id = it->second;
+    TORCH_CHECK(
+        tv2_mapped_id == tv2->axis(i),
+        "Unexpected mapped consumer ID: ",
+        tv2_mapped_id->toString());
+  }
+}
+
+TEST_F(NVFuserTest, FusionTrivialReductionForwarding3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv2);
+
+  // Similar pattern as FusionTrivialReductionForwarding2 but trivial
+  // reduciton at non-root domain
+
+  // Create a trivial reduction by splitting with a factor of 1
+  tv1->split(1, 1, false);
+  // Merging with a trivial reduction
+  tv1->merge(0, 1);
+  auto tv1_merge_out_id = tv1->axis(0);
+  tv1->split(0, 5);
+
+  tv2->split(0, 5);
+
+  // The merge of tv1 is done with a non-root trivial
+  // reduciton. BestEffortReplay should forward the merge.
+
+  PairwiseRootDomainMap root_map(tv1, tv2);
+  auto p2c = BestEffortReplay::replayCasP(tv2, tv1, 2, root_map).getReplay();
+
+  // The two tensors should look like:
+  // tv1: [I1*1//5, 5, I2//1]
+  // tv2: [I1//5, 5]
+  //
+  // BestEffortRepaly should forward the merge of (I1 * 1) and create
+  // mappings of:
+  // I1*1//5 -> I1//5
+  // 5 -> 5
+  // I1*1 -> I1
+
+  TORCH_CHECK(p2c.size() == 3, "Unexpected number of mappings");
+  TORCH_CHECK(p2c.count(tv1->axis(0)) && p2c[tv1->axis(0)] == tv2->axis(0));
+  TORCH_CHECK(p2c.count(tv1->axis(1)) && p2c[tv1->axis(1)] == tv2->axis(1));
+  TORCH_CHECK(
+      p2c.count(tv1_merge_out_id) &&
+      p2c[tv1_merge_out_id] == tv2->getRootDomain()[0]);
+}
+
+TEST_F(NVFuserTest, FusionTrivialReductionForwarding4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv0, {true, false});
+  auto tv3 = add(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  // tv4 has a trivial reduction axis
+  auto tv4 = sum(tv2, {0});
+  auto tv5 = add(tv4, IrBuilder::create<Double>(1));
+  fusion.addOutput(tv5);
+
+  tv3->merge(0, 1);
+  tv3->split(0, 32);
+
+  // This causes the trivial reduction of tv4 to be merged with
+  // another axis of tv4, and then forward computeAt is done from tv4
+  // to tv5. The split of the merged id of tv4 should be done on tv5
+  // by forwarding the merge of the trivial reduction.
+  tv0->computeAt(tv3, -1);
+
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({111}, options);
+  auto t1 = at::randn({123, 111}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto t2 = t0.unsqueeze(0);
+  auto t3 = t1 + t2;
+  auto t5 = sum(t2, {0}) + 1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {t3, t5}, __LINE__, __FILE__);
+}
+
+// See issue #1598
+TEST_F(NVFuserTest, FusionRAWSyncInsertionPlace1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = set(tv1);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  // Place tv2 on shared memory
+  tv2->split(0, 2);
+  tv2->split(-1, 4);
+  tv2->setMemoryType(MemoryType::Shared);
+  tv2->axis(-2)->parallelize(ParallelType::TIDy);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv3->split(0, 2);
+  tv3->split(-1, 4);
+  // swap tidx and tidy
+  tv3->axis(-2)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDy);
+
+  tv4->split(0, 2);
+  tv4->split(-1, 4);
+  tv4->axis(-2)->parallelize(ParallelType::TIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDy);
+
+  tv0->computeAt(tv4, 1);
+  tv3->computeAt(tv4, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({10, 64}, options);
+  auto t1 = at::randn({10, 64}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// See issue #1598
+TEST_F(NVFuserTest, FusionRAWSyncInsertionPlace2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = set(tv1);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  tv2->split(0, 2);
+  tv2->split(-1, 4);
+  tv2->setMemoryType(MemoryType::Shared);
+
+  tv2->axis(-2)->parallelize(ParallelType::TIDy);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  tv4->split(0, 2);
+  tv4->split(-1, 4);
+  // Also do unroll for tv3 and tv4
+  tv4->split(-2, 8, false);
+  tv4->axis(-3)->parallelize(ParallelType::Unroll);
+  // swap tidx and tidy
+  tv4->axis(-2)->parallelize(ParallelType::TIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDy);
+
+  tv0->computeAt(tv4, 1);
+  tv3->computeAt(tv4, -1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({10, 64}, options);
+  auto t1 = at::randn({10, 64}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// See issue #1599
+TEST_F(NVFuserTest, FusionRAWSyncInsertionPlace3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = set(tv1);
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  // Use unroll where a RAW-sync tensor is stored
+
+  tv4->split(0, 2);
+  tv4->split(0, 3);
+  tv4->split(-1, 4);
+  tv4->axis(1)->parallelize(ParallelType::Unroll);
+  tv4->axis(-2)->parallelize(ParallelType::TIDx);
+  tv4->axis(-1)->parallelize(ParallelType::TIDy);
+
+  tv0->computeAt(tv4, 3);
+  tv3->computeAt(tv4, -1);
+
+  tv2->split(-1, 4);
+  tv2->axis(-2)->parallelize(ParallelType::TIDy);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->setMemoryType(MemoryType::Shared);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({50, 64}, options);
+  auto t1 = at::randn({50, 64}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// See #1618
+TEST_F(NVFuserTest, FusionRAWSyncInsertionPlace4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({16, 128});
+  auto tv1 = makeConcreteTensor({16, 128});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  auto tv3 = set(tv1);
+  auto tv4 = set(tv2);
+  auto tv5 = set(tv3);
+  auto tv6 = add(tv4, tv5);
+  fusion.addOutput(tv6);
+
+  tv2->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+
+  tv2->computeAt(tv6, 0);
+  tv3->computeAt(tv6, 1);
+  tv4->computeAt(tv6, 1);
+  tv5->computeAt(tv6, -1);
+  tv2->split(1, 64);
+  tv3->split(1, 64);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv3->axis(-1)->parallelize(ParallelType::TIDx);
+  tv6->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // Check the block sync is inserted at the correct location.
+  //  There is exactly one block sync needed in this test case
+  //    and the sync needs to be after the 2 expressions
+  //    that modify shared memory.
+  class SyncInsertionPointChecker : public kir::IrVisitor {
+   public:
+    using kir::IrVisitor::handle;
+
+   private:
+    void handle(UnaryOp* uop) final {
+      // Record number of unary ops that modifies shared memory.
+      if (uop->out()->isA<kir::TensorIndex>() &&
+          uop->out()->as<kir::TensorIndex>()->view()->getMemoryType() ==
+              MemoryType::Shared &&
+          // Filter out initialization expressions
+          uop->in()->isA<kir::TensorIndex>()) {
+        number_of_writes_++;
+      }
+    }
+    void handle(kir::BlockSync* bsync) final {
+      // Make sure both shared memory modifying expressions
+      //  have been observed at the sync insertion point.
+      TORCH_INTERNAL_ASSERT(
+          number_of_writes_ == 2,
+          "FusionRAWSyncInsertionPlace4 test fail:",
+          "only 1 sync after the 2 shared mem writes is needed in this test,"
+          "either a redundant sync has been inserted or the block sync is not inserted at the right place");
+    }
+
+   private:
+    int number_of_writes_ = 0;
+  } sync_insertion_checker;
+  GpuLower gpulw(&fusion);
+  sync_insertion_checker.handle(gpulw.kernel()->topLevelExprs());
+}
+
+// Test serial write and parallel read of shared mem: mapped case
+TEST_F(NVFuserTest, FusionSerialSmemWriteParallelRead1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({128, 6});
+  TensorView* tv1 = makeConcreteTensor({128, 6});
+  TensorView* tv2 = makeConcreteTensor({128, 6});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+
+  TensorView* tv3 = add(tv0, tv1);
+  TensorView* tv4 = add(tv3, tv2);
+
+  fusion.addOutput(tv4);
+
+  //  Use shared memory
+  tv3->setMemoryType(MemoryType::Shared);
+
+  // Parallelize t4, in this case dim 0 on tv3 will
+  //  not be parallelized but dim0 of t4 will be.
+  // We will need to make sure a sync is inserted
+  //  even if these dimensions are mapped.
+  tv4->axis(0)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({128, 6}, options);
+  at::Tensor t1 = at::randn({128, 6}, options);
+  at::Tensor t2 = at::randn({128, 6}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1, t2});
+  auto cg_outputs = fe.runFusion({t0, t1, t2});
+
+  auto ref = t0 + t1 + t2;
+
+  testValidate(&fusion, cg_outputs, {t0, t1, t2}, {ref}, __LINE__, __FILE__);
+}
+
+// Test serial write and parallel read of shared mem: un-mapped case
+TEST_F(NVFuserTest, FusionSerialSmemWriteParallelRead2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({128, 6});
+  TensorView* tv1 = makeConcreteTensor({128, 6});
+  TensorView* tv2 = makeConcreteTensor({128, 6});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+
+  TensorView* tv3 = add(tv0, tv1);
+  TensorView* tv4 = add(tv3, tv2);
+
+  fusion.addOutput(tv4);
+
+  //  Use shared memory
+  tv3->setMemoryType(MemoryType::Shared);
+
+  // Split and parallelize t4,
+  //  the parallelized dimension in t4 will not
+  // map across to the shared mem tensor, t3. So
+  // there will need to be a sync before use of t3.
+  tv4->split(0, 2);
+  tv4->axis(0)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({128, 6}, options);
+  at::Tensor t1 = at::randn({128, 6}, options);
+  at::Tensor t2 = at::randn({128, 6}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1, t2});
+  auto cg_outputs = fe.runFusion({t0, t1, t2});
+
+  auto ref = t0 + t1 + t2;
+
+  testValidate(&fusion, cg_outputs, {t0, t1, t2}, {ref}, __LINE__, __FILE__);
+}
+
+// Simple test of async copy primitive
+TEST_F(NVFuserTest, FusionSimpleCpAsync_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int m = 33, n = 31;
+
+  TensorView* tv0 = makeConcreteTensor({m, n});
+  TensorView* tv1 = makeConcreteTensor({m, n});
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv0, tv1);
+
+  fusion.addOutput(tv2);
+
+  auto tv0_shared = tv0->cacheAfter(LoadStoreOpType::CpAsync);
+  tv0_shared->setMemoryType(MemoryType::Shared);
+
+  tv0->computeAt(tv2, 1);
+  tv0_shared->axis(1)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({m, n}, options);
+  at::Tensor t1 = at::randn({m, n}, options);
+
+  FusionExecutor fe;
+
+  // requires ampere+ GPU
+  if (!deviceMajorMinorCheck(8)) {
+    ASSERT_ANY_THROW(fe.compileFusion(&fusion, {t0, t1}));
+    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
+  }
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Simple test of async copy primitive: double buffered
+//   Double buffer case 1, both block sync and async wait
+//  are needed.
+TEST_F(NVFuserTest, FusionDoubleBufferCpAsync1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Using vectorization so need to keep n multiple of 4.
+  int m = 33, n = 48;
+
+  TensorView* tv0 = makeConcreteTensor({m, n});
+  TensorView* tv1 = makeConcreteTensor({m, n});
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv0, tv1);
+
+  fusion.addOutput(tv2);
+
+  auto tv0_shared = tv0->cacheAfter(LoadStoreOpType::CpAsync);
+  tv0_shared->setMemoryType(MemoryType::Shared);
+  tv0->computeAt(tv2, 1);
+
+  // Asynchronously load a tile in one schedule
+  tv0_shared->split(1, 4);
+  tv0_shared->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv0_shared->axis(-2)->parallelize(ParallelType::TIDx);
+
+  // Consume the loaded tile in another schedule,
+  //   triggering the need for a sync.
+  tv2->split(1, 12);
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // Double buffer the shared mem tensor.
+  tv0_shared->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({m, n}, options);
+  at::Tensor t1 = at::randn({m, n}, options);
+
+  FusionExecutor fe;
+  // requires ampere+ GPU
+  if (!deviceMajorMinorCheck(8)) {
+    ASSERT_ANY_THROW(fe.compileFusion(&fusion, {t0, t1}));
+    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
+  }
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Simple test of async copy primitive: double buffered
+//   Double buffer case 2, only async wait is needed
+TEST_F(NVFuserTest, FusionDoubleBufferCpAsync2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Using vectorization so need to keep n multiple of 4.
+  int m = 33, n = 48;
+
+  TensorView* tv0 = makeConcreteTensor({m, n});
+  TensorView* tv1 = makeConcreteTensor({m, n});
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv0, tv1);
+
+  fusion.addOutput(tv2);
+
+  auto tv0_shared = tv0->cacheAfter(LoadStoreOpType::CpAsync);
+  tv0_shared->setMemoryType(MemoryType::Shared);
+  tv0->computeAt(tv2, 1);
+
+  // Asynchronously load a tile in one schedule
+  tv0_shared->split(1, 4);
+  tv0_shared->axis(-2)->parallelize(ParallelType::TIDx);
+
+  // Consume the loaded tile in another schedule,
+  //   triggering the need for a sync.
+  tv2->split(1, 4);
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+
+  // Double buffer the shared mem tensor.
+  tv0_shared->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({m, n}, options);
+  at::Tensor t1 = at::randn({m, n}, options);
+
+  FusionExecutor fe;
+  // requires ampere+ GPU
+  if (!deviceMajorMinorCheck(8)) {
+    ASSERT_ANY_THROW(fe.compileFusion(&fusion, {t0, t1}));
+    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
+  }
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Simple test for double buffer in shared mem,
+//  where we should not insert redundant syncs when
+//  they are not needed.
+TEST_F(NVFuserTest, FusionDoubleBufferNoSync_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Using vectorization so need to keep n multiple of 4.
+  int m = 33, n = 48;
+
+  TensorView* tv0 = makeConcreteTensor({m, n});
+  TensorView* tv1 = makeConcreteTensor({m, n});
+
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  TensorView* tv2 = add(tv0, tv1);
+
+  fusion.addOutput(tv2);
+
+  auto tv0_shared = tv0->cacheAfter();
+  tv0_shared->setMemoryType(MemoryType::Shared);
+  tv0->computeAt(tv2, 1);
+
+  // Asynchronously load a tile in one schedule
+  tv0_shared->split(1, 4);
+  tv0_shared->axis(-2)->parallelize(ParallelType::TIDx);
+
+  // Consume the loaded tile in another schedule,
+  //   triggering the need for a sync.
+  tv2->split(1, 4);
+  tv2->axis(-2)->parallelize(ParallelType::TIDx);
+
+  // Double buffer the shared mem tensor.
+  tv0_shared->doubleBuffer();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({m, n}, options);
+  at::Tensor t1 = at::randn({m, n}, options);
+
+  GpuLower gpulw(&fusion);
+  auto flattened_exprs =
+      ir_utils::flattenScopedExprs(gpulw.kernel()->topLevelExprs());
+  bool sync_inserted = std::any_of(
+      flattened_exprs.begin(), flattened_exprs.end(), [](Expr* expr) {
+        return expr->isA<kir::BlockSync>();
+      });
+  TORCH_INTERNAL_ASSERT(!sync_inserted, "Un-expected block sync inserted");
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Test predicate inversion for cp.async
+TEST_F(NVFuserTest, FusionCpAsyncPredicate_CUDA) {
+  // requires ampere+ GPU
+
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Using vectorization so need to keep n multiple of 4.
+  int m = 33, n = 48;
+
+  TensorView* tv0 = makeConcreteTensor({m, n});
+
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1});
+  fusion.addOutput(tv1);
+
+  auto tv0_shared = tv0->cacheAfter(LoadStoreOpType::CpAsync);
+  auto tv0_reg = tv0_shared->cacheAfter();
+  tv0_shared->setMemoryType(MemoryType::Shared);
+  tv0->computeAt(tv1, 1);
+
+  tv0_shared->split(-1, 32);
+  tv0_shared->split(-1, 4);
+  tv0_shared->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({m, n}, options);
+
+  FusionExecutor fe;
+  if (!deviceMajorMinorCheck(8)) {
+    ASSERT_ANY_THROW(fe.compileFusion(&fusion, {t0}));
+    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
+  }
+
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0.sum({1});
+
+  testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Test predicate removal on reg-to-reg expressions
+TEST_F(NVFuserTest, FusionPredRemovalCheck_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+
+  TensorView* tv1 = set(tv0);
+  TensorView* tv2 = set(tv1);
+  TensorView* tv3 = set(tv2);
+  TensorView* tv4 = set(tv3);
+
+  fusion.addOutput(tv4);
+  tv4->split(1, 4);
+  tv0->computeAt(tv4, -2);
+  tv3->axis(-1)->parallelize(ParallelType::Vectorize);
+
+  class PredicateRemovalChecker : public kir::IrVisitor {
+   public:
+    using kir::IrVisitor::handle;
+
+   private:
+    void handle(UnaryOp* uop) final {
+      assertOnLocalToLocal(uop);
+    }
+
+    // Utility to assert any local-to-local expr is only trivially predicated.
+    void assertOnLocalToLocal(Expr* expr) {
+      bool is_local = true;
+      for (auto in : ir_utils::filterByType<kir::TensorIndex>(expr->inputs())) {
+        if (in->view()->getMemoryType() != MemoryType::Local) {
+          is_local = false;
+        }
+      }
+      for (auto in :
+           ir_utils::filterByType<kir::TensorIndex>(expr->outputs())) {
+        if (in->view()->getMemoryType() != MemoryType::Local) {
+          is_local = false;
+        }
+      }
+
+      if (is_local) {
+        if (auto ite = dynamic_cast<kir::IfThenElse*>(scope_exprs_.back())) {
+          TORCH_INTERNAL_ASSERT(
+              ite->predicate()->value()->isConst(),
+              "redundant predicate on: ",
+              expr);
+        }
+      }
+    }
+
+   private:
+    bool within_ite_ = false;
+  } pred_checker;
+
+  GpuLower gpulw(&fusion);
+  pred_checker.handle(gpulw.kernel()->topLevelExprs());
+}
+
+TEST_F(NVFuserTest, FusionPropagateParallelTypesToSiblings_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tvs = Welford(tv0, {0});
+  auto tv_avg = tvs.avg;
+  fusion.addOutput(tv_avg);
+
+  tv_avg->split(0, 128);
+  TransformPropagatorWithCheck propagator(tv_avg);
+  MaxRootDomainInfoSpanningTree(tv_avg).traverse(&propagator);
+
+  tv_avg->axis(0)->parallelize(ParallelType::BIDx);
+  tv_avg->axis(1)->parallelize(ParallelType::TIDx);
+
+  // Make sure the parallelization of tv_avg is propagated to the var
+  // and count tensors.
+  GpuLower gpulw(&fusion);
+  for (const auto expr : gpulw.kernel()->exprs()) {
+    auto wop = dynamic_cast<WelfordOp*>(expr);
+    if (wop == nullptr) {
+      continue;
+    }
+    auto ref = wop->outAvg()->as<TensorView>();
+    for (auto sibling : ir_utils::filterByType<TensorView>(wop->outputs())) {
+      if (ref == sibling) {
+        continue;
+      }
+      TORCH_CHECK(
+          ref->nDims() == sibling->nDims(),
+          "Invalid sibling: ",
+          sibling->toString());
+      for (const auto i : c10::irange(ref->nDims())) {
+        TORCH_CHECK(
+            ref->axis(i)->getParallelType() ==
+                sibling->axis(i)->getParallelType(),
+            "Mismatched parallel types between siblings. ",
+            ref->toString(),
+            ", ",
+            sibling->toString());
+      }
+    }
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto options_int = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({9999}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  testValidate(fe.kernel(), outputs, {t0}, {t0.mean({0})}, __LINE__, __FILE__);
+}
+
+// Test ExactRootDomainMap
+TEST_F(NVFuserTest, FusionExactRootDomainMap_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv0, {false, true});
+  auto tv3 = transpose(tv2);
+  auto tv4 = add(tv2, tv1);
+  auto tv5 = add(tv2, tv3);
+  auto tv6 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  const auto exact_map = ExactRootDomainMap(&fusion);
+
+  // In the exact mapping, the broadcast domain introduced at tv2 is
+  // only mapped with the another one in tv3, which is just transposed
+  // from tv2. Any other domain, including the second domain of tv4,
+  // must not be mapped.
+
+  auto tv2_bc = tv2->axis(1);
+  auto tv3_bc = tv3->axis(0);
+
+  TORCH_CHECK(
+      exact_map.areMapped(tv2_bc, tv3_bc),
+      "Invalid exact root domain map: ",
+      exact_map.toString());
+
+  // They must not be mapped with anything else.
+  for (auto tv : ir_utils::allTvs(&fusion)) {
+    for (auto root_id : tv->getRootDomain()) {
+      if (root_id == tv2_bc || root_id == tv3_bc) {
+        continue;
+      }
+      TORCH_CHECK(
+          !exact_map.areMapped(root_id, tv2_bc),
+          "Invalid exact root domain map: ",
+          exact_map.toString());
+      TORCH_CHECK(
+          !exact_map.areMapped(root_id, tv3_bc),
+          "Invalid exact root domain map: ",
+          exact_map.toString());
+    }
+  }
+}
+
+class NVFuserMultithreadedTest : public ::testing::Test {
+ protected:
+  bool was_enabled = false;
+
+  void SetUp() override {
+    was_enabled = fuser::cuda::setEnabled(true);
+  }
+
+  void TearDown() override {
+    fuser::cuda::setEnabled(was_enabled);
+  }
+};
+
+TEST_F(NVFuserMultithreadedTest, SingleFunction_CUDA) {
+  std::string ir = R"IR(
+graph(%x.1 : Tensor,
+      %y.1 : Tensor):
+  %12 : NoneType = prim::Constant()
+  %11 : bool = prim::Constant[value=0]()
+  %9 : int = prim::Constant[value=1]()
+  %3 : Tensor = aten::exp(%x.1)
+  %5 : Tensor = aten::relu(%y.1)
+  %6 : Tensor = aten::sin(%5)
+  %8 : Tensor = aten::add(%3, %6, %9)
+  %10 : int[] = prim::ListConstruct(%9)
+  %13 : Tensor = aten::sum(%8, %10, %11, %12)
+  return (%13)
+)IR";
+  auto g = std::make_shared<Graph>();
+  torch::jit::parseIR(ir, g.get());
+  GraphFunction fn("nvfuser_test", g, nullptr);
+
+  auto run_kernel = [&fn]() {
+    auto x = torch::rand({32, 32}, at::TensorOptions(at::kCUDA));
+    auto y = torch::rand({32, 32}, at::TensorOptions(at::kCUDA));
+    std::vector<IValue> results;
+    for (const auto& _ : c10::irange(10)) {
+      auto stack = createStack({x.clone(), y.clone()});
+      fn.run(stack);
+      results.push_back(stack.back());
+    }
+    for (const auto& i : c10::irange(1, 10)) {
+      auto t0 = results[0].toTensor();
+      auto ti = results[i].toTensor();
+      ASSERT_TRUE(at::allclose(t0, ti));
+    }
+  };
+
+  constexpr size_t kNumThreads = 4;
+  std::vector<std::thread> threads;
+  for (size_t id = 0; id < kNumThreads; ++id) {
+    threads.emplace_back(run_kernel);
+  }
+  for (auto& t : threads) {
+    t.join();
+  }
+}
+
+TEST_F(NVFuserMultithreadedTest, MultipleFunctions_CUDA) {
+  auto run_kernel = []() {
+    const std::string ir = R"IR(
+  graph(%x.1 : Tensor,
+        %y.1 : Tensor):
+    %12 : NoneType = prim::Constant()
+    %11 : bool = prim::Constant[value=0]()
+    %9 : int = prim::Constant[value=1]()
+    %3 : Tensor = aten::exp(%x.1)
+    %5 : Tensor = aten::relu(%y.1)
+    %6 : Tensor = aten::sin(%5)
+    %8 : Tensor = aten::add(%3, %6, %9)
+    %10 : int[] = prim::ListConstruct(%9)
+    %13 : Tensor = aten::sum(%8, %10, %11, %12)
+    return (%13)
+  )IR";
+    auto g = std::make_shared<Graph>();
+    torch::jit::parseIR(ir, g.get());
+    GraphFunction fn("nvfuser_test", g, nullptr);
+
+    auto x = torch::rand({32, 32}, at::TensorOptions(at::kCUDA));
+    auto y = torch::rand({32, 32}, at::TensorOptions(at::kCUDA));
+    std::vector<IValue> results;
+    constexpr size_t numRuns = 10;
+    for (const auto& _ : c10::irange(numRuns)) {
+      auto stack = createStack({x.clone(), y.clone()});
+      fn.run(stack);
+      results.push_back(stack.back());
+    }
+    for (const auto& i : c10::irange(1, numRuns)) {
+      auto t0 = results[0].toTensor();
+      auto ti = results[i].toTensor();
+      ASSERT_TRUE(at::allclose(t0, ti));
+    }
+  };
+
+  constexpr size_t kNumThreads = 4;
+  std::vector<std::thread> threads;
+  for (size_t id = 0; id < kNumThreads; ++id) {
+    threads.emplace_back(run_kernel);
+  }
+  for (auto& t : threads) {
+    t.join();
+  }
+}
+
+// Repro of issue #1655
+TEST_F(NVFuserTest, FusionIncompleteConcreteID_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion.addInput(tv0);
+  auto tv1 = makeSymbolicTensor(2);
+  fusion.addInput(tv1);
+  auto tv2 = makeSymbolicTensor(2);
+  fusion.addInput(tv2);
+
+  auto tv3 = broadcast(tv0, {true, true, false});
+  auto tv4 = broadcast(tv1, {false, true, false});
+  auto tv5 = broadcast(tv2, {true, false, false});
+
+  auto tv6 = add(tv3, tv4);
+  auto tv7 = add(tv3, tv5);
+
+  fusion.addOutput(tv6);
+  fusion.addOutput(tv7);
+
+  tv6->merge(0);
+  tv6->merge(0);
+
+  TransformPropagatorWithCheck propagator(tv6);
+  MaxRootDomainInfoSpanningTree(tv6).traverse(&propagator);
+
+  tv0->computeAt(tv6, -1, ComputeAtMode::MostInlined);
+  tv1->computeAt(tv6, -1, ComputeAtMode::MostInlined);
+  tv2->computeAt(tv7, -1, ComputeAtMode::MostInlined);
+
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(fusion.printKernel());
+}
+
+TEST_F(NVFuserTest, FusionTestReEntrantGridWelford_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int X = 256, Y = 7, Z = 2048;
+
+  // setup fusion
+  auto tv0 = makeContigTensor(4, DataType::Half);
+  fusion.addInput(tv0);
+  auto tv1 = castOp(DataType::Float, tv0);
+
+  auto tvs = Welford(tv1, {0, 1, 2});
+  auto tv_avg = tvs.avg;
+  auto tv_M2 = tvs.var_sum;
+  auto tv_N = tvs.n;
+  fusion.addOutput(tv_avg);
+  fusion.addOutput(tv_M2);
+
+  auto cached_input = tv0->cacheAfter();
+  auto cached_avg = tv_avg->cacheBefore();
+  auto cached_M2 = tv_M2->cacheBefore();
+
+  auto reduction_tv = scheduler_utils::getReductionTvs(&fusion)[0];
+
+  reduction_tv->merge(0);
+  reduction_tv->merge(0);
+
+  int TIDx = 16;
+  int vec = 4;
+
+  int TIDy = 16;
+  int outer_tidy_fact = 16;
+
+  reduction_tv->split(-1, TIDx * vec);
+  reduction_tv->split(-1, vec);
+  reduction_tv->axis(-2)->parallelize(ParallelType::TIDx);
+  reduction_tv->axis(-1)->parallelize(ParallelType::Vectorize);
+  reduction_tv->axis(-3)->parallelize(ParallelType::BIDx);
+
+  reduction_tv->split(0, TIDy);
+  reduction_tv->axis(1)->parallelize(ParallelType::TIDy);
+  reduction_tv->split(0, outer_tidy_fact);
+  reduction_tv->axis(0)->parallelize(ParallelType::BIDy);
+
+  // T2_g[ rblockIdx.y, rS{16}, rthreadIdx.y, iblockIdx.x, ithreadIdx.x24,
+  // iV25{4} ]
+  reduction_tv->reorder({{3, 0}, {4, 1}, {0, 2}, {2, 3}, {1, 4}, {5, 5}});
+  // T2_g[iblockIdx.x, ithreadIdx.x24, rblockIdx.y, rthreadIdx.y, rS{16},
+  // iV25{4}]
+
+  TransformPropagatorWithCheck propagator(reduction_tv);
+  MaxRootDomainInfoSpanningTree(reduction_tv).traverse(&propagator);
+  auto rfactor_tv = ir_utils::rfactorHelper(reduction_tv, {4});
+  scheduler_utils::parallelizeAllLike(rfactor_tv);
+
+  tv0->computeAt(tv_avg, 2);
+  tv0->computeAt(cached_input, -2);
+
+  cached_input->computeAt(rfactor_tv, 4, ComputeAtMode::BestEffort);
+
+  for (auto tv : ir_utils::allTvs(&fusion)) {
+    if (tv == cached_input || tv == tv_avg || tv == tv_M2) {
+      continue;
+    }
+    tv->axis(-1)->parallelize(ParallelType::Serial);
+  }
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {}, LaunchParams());
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({X, Y, Y, Z}, options);
+
+  auto cg_outputs = fe.runFusion({t0}, LaunchParams(-1, -1, -1, -1, -1, -1));
+
+  // by default Welford outputs sum of square diff so need to divide to get var
+  cg_outputs[1] = cg_outputs[1].div((float)(X * Y * Y));
+
+  auto at_mu = at::mean(t0.to(at::kDouble), {0, 1, 2});
+  auto at_var = at::var(t0.to(at::kDouble), {0, 1, 2}, false);
+
+  testValidate(
+      &fusion,
+      cg_outputs,
+      {t0},
+      {at_mu, at_var},
+      __LINE__,
+      __FILE__,
+      "",
+      LaunchParams(-1, -1, -1, -1, -1, -1));
+}
+
+// Test sync insertion with redundant predicates
+TEST_F(NVFuserTest, FusionRedundantPredSync_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({32});
+  TensorView* tv1 = makeConcreteTensor({32, 32});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv0, {true, false});
+  auto tv3 = add(tv2, tv1);
+
+  fusion.addOutput(tv3);
+
+  auto tv0c = tv0->cacheAfter();
+
+  // Make a redundant write through smem
+  tv0c->setMemoryType(MemoryType::Shared);
+
+  tv0->computeAt(tv3, 0);
+  tv1->computeAt(tv3, 0);
+
+  tv0c->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::TIDy);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(0)->parallelize(ParallelType::TIDy);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  GpuLower gpulw(&fusion);
+  auto flattened_exprs =
+      ir_utils::flattenScopedExprs(gpulw.kernel()->topLevelExprs());
+  bool sync_inserted = std::any_of(
+      flattened_exprs.begin(), flattened_exprs.end(), [](Expr* expr) {
+        return expr->isA<kir::BlockSync>();
+      });
+  TORCH_INTERNAL_ASSERT(sync_inserted, "Expected block sync not inserted");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({32}, options);
+  at::Tensor t1 = at::randn({32, 32}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Test case for removing syncs on chain of redundant uses.
+TEST_F(NVFuserTest, FusionRedundantPredSync2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({32});
+  TensorView* tv1 = makeConcreteTensor({32, 32});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv0, {true, false});
+  auto tv3 = add(tv2, tv1);
+
+  fusion.addOutput(tv3);
+
+  auto tv0c = tv0->cacheAfter();
+
+  // Make a redundant write through smem
+  tv0c->setMemoryType(MemoryType::Shared);
+  tv2->setMemoryType(MemoryType::Shared);
+
+  tv0->computeAt(tv3, 0);
+  tv1->computeAt(tv3, 0);
+
+  tv0c->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::TIDy);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(0)->parallelize(ParallelType::TIDy);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  // Utility class to make sure one block sync
+  //  is inserted by RAW pass.
+  class SyncChecker : public kir::IrVisitor {
+   public:
+    using kir::IrVisitor::handle;
+    int result() {
+      return sync_seen_;
+    }
+
+   private:
+    void handle(kir::BlockSync*) final {
+      sync_seen_++;
+    }
+
+   private:
+    int sync_seen_ = 0;
+  } checker;
+
+  GpuLower gpulw(&fusion);
+  checker.handle(gpulw.kernel()->topLevelExprs());
+  TORCH_INTERNAL_ASSERT(
+      checker.result() < 2, "More syncs were inserted than expected");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({32}, options);
+  at::Tensor t1 = at::randn({32, 32}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+// Test case for sync insertion after redundant predicated smem write
+//  Check that syncs are removed only when all paths are redundant.
+TEST_F(NVFuserTest, FusionRedundantPredSync3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({32});
+  TensorView* tv1 = makeConcreteTensor({32, 32});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = broadcast(tv0, {true, false});
+  auto tv3 = set(tv2);
+  auto tv4 = add(tv3, tv1);
+  auto tv5 = add(tv2, tv1);
+
+  fusion.addOutput(tv4);
+  fusion.addOutput(tv5);
+
+  auto tv0c = tv0->cacheAfter();
+
+  // In this scheduling config,
+  //  tv0c -> tv2 -> tv3 is a redundant path for tidy
+  //  tv0c -> tv2 -> tv5 is not.
+  //  So we need a RAW sync in tv0c->tv2 to make sure
+  //  tv2 has the correct value to produce tv5.
+  tv0c->setMemoryType(MemoryType::Shared);
+  tv3->setMemoryType(MemoryType::Shared);
+
+  tv0c->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::TIDy);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv3->axis(0)->parallelize(ParallelType::TIDy);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  tv5->axis(0)->parallelize(ParallelType::TIDy);
+  tv5->axis(1)->parallelize(ParallelType::TIDx);
+
+  // Utility class to make sure one block sync
+  //  is inserted by RAW pass.
+  class SyncChecker : public kir::IrVisitor {
+   public:
+    using kir::IrVisitor::handle;
+    int result() {
+      return sync_seen_;
+    }
+
+   private:
+    void handle(kir::BlockSync* sync) final {
+      if (!sync->isWarHazardSync()) {
+        sync_seen_++;
+      }
+    }
+
+   private:
+    int sync_seen_ = 0;
+  } checker;
+
+  GpuLower gpulw(&fusion);
+  checker.handle(gpulw.kernel()->topLevelExprs());
+
+  // This is implicit checking. There are exactly 2 places
+  //  where RAW hazards happen: one producing tv2 and the other
+  //  producing tv3. This test case expect syncs in both of
+  //  these places so we check that 2 RAW syncs are inserted.
+  TORCH_INTERNAL_ASSERT(
+      checker.result() == 2,
+      "Exactly 2 RAW sync expected for the two shared memory transfers");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({32}, options);
+  at::Tensor t1 = at::randn({32, 32}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+
+  auto ref = t0 + t1;
+
+  testValidate(&fusion, cg_outputs, {t0, t1}, {ref, ref}, __LINE__, __FILE__);
+}
+
+// Unit test case for detecting thread redundant usage of shared tensors.
+TEST_F(NVFuserTest, FusionRedundantUseCheck_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeConcreteTensor({32, 32});
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  auto tv3 = set(tv2);
+  auto tv4 = set(tv3);
+
+  auto tv5 = set(tv4);
+
+  auto tv6 = set(tv4);
+  auto tv7 = set(tv6);
+
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv7);
+
+  tv2->setMemoryType(MemoryType::Shared);
+  tv4->setMemoryType(MemoryType::Shared);
+
+  tv7->axis(-1)->parallelize(ParallelType::TIDx);
+
+  // Thread pred map cannot be built without an active lower
+  //  object. So would need to lower the whole fusion for
+  //  testing. However, lower also keeps an copy of the fusion
+  //  so the original pointers cannot be used to querry the
+  //  thread pred map. So have to traverse the new expr list
+  //  to find the pointers;
+  GpuLower gpulw(&fusion);
+
+  TensorView *lowered_tv2 = nullptr, *lowered_tv4 = nullptr;
+  auto used_vals = gpulw.kernel()->usedMathVals();
+
+  for (auto tv : ir_utils::filterByType<TensorView>(used_vals)) {
+    if (tv->name() == 2) {
+      lowered_tv2 = tv;
+    }
+    if (tv->name() == 4) {
+      lowered_tv4 = tv;
+    }
+  }
+
+  TORCH_INTERNAL_ASSERT(
+      lowered_tv2 != nullptr && lowered_tv4 != nullptr,
+      "tv2 or tv4 not lowered or mangled");
+
+  auto tv2_info = gpulw.threadPredMap().getPredicateInfo(lowered_tv2);
+  auto tv4_info = gpulw.threadPredMap().getPredicateInfo(lowered_tv4);
+
+  // tv2 -> tv3 -> tv4 (shared) is the only use chain for tv2,
+  //  and tv4 is redundantly written in tidx so tv2 is redundantly
+  //  consumed in tidx.
+  TORCH_INTERNAL_ASSERT(
+      tv2_info.redundant_use_types.get(ParallelType::TIDx),
+      "TV2 is redundantly used but not detected.");
+
+  // tv4->tv5 (global) is a redundant use chain, but
+  // tv4->tv6->tv7 is not, so tv4 should not be detected as
+  // a redundant used tensor in tidx.
+  TORCH_INTERNAL_ASSERT(
+      !tv4_info.redundant_use_types.get(ParallelType::TIDx),
+      "TV4 is not redundantly used but not detected.");
+}
+
+// Test a basic swizzle pattern
+TEST_F(NVFuserTest, FusionSimpleSwizzle0_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 32});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv2);
+
+  // Make a 2x8 Zshape tile
+  tv1->split(-1, 16);
+  tv1->split(-1, 8);
+  // [O, 2, 8]
+
+  tv2->split(-1, 16);
+  tv2->split(-1, 4);
+  //[O, 4, 4]
+
+  tv1->computeAt(tv2, 1);
+  tv1->swizzle(Swizzle2DType::ZShape, -2, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({2, 32}, options);
+  auto t2 = t0 + 2.0;
+  auto cg_outputs = fe.runFusion({t0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
+}
+
+// Test swizzle inlining
+TEST_F(NVFuserTest, FusionSimpleSwizzle1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 32});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  // Make a 2x8 Zshape tile
+  tv2->split(-1, 16);
+  tv2->split(-1, 8);
+  // [O, 2, 8]
+
+  tv3->split(-1, 16);
+  tv3->split(-1, 4);
+  //[O, 4, 4]
+
+  tv2->computeAt(tv3, 1);
+  tv2->swizzle(Swizzle2DType::ZShape, -2, -1);
+
+  // Inlining a producer into a swizzled consumer is ok
+  tv1->computeAt(tv2, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({2, 32}, options);
+  auto t3 = t0 + 3.0;
+  auto cg_outputs = fe.runFusion({t0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {t3}, __LINE__, __FILE__);
+}
+
+// Test sync insertion and memory check in parallelized swizzles.
+//  In this test, data is parallel written into smem in zcurve
+//   pattern and then read out and output to global mem unswizzled.
+TEST_F(NVFuserTest, FusionSimpleSwizzle2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({32, 32});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv2);
+
+  tv1->swizzle(Swizzle2DType::ZShape, -2, -1);
+
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDy);
+
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDy);
+
+  // Validation should fail since TV1 is not in shared
+  //  memory as required by sync info pass.
+  ASSERT_ANY_THROW(GpuLower gpulw_throw(&fusion));
+
+  tv1->setMemoryType(MemoryType::Shared);
+
+  // Make sure that a sync is inserted:
+  bool sync_found = false;
+  GpuLower gpu_lw(&fusion);
+  auto flattened_exps =
+      ir_utils::flattenScopedExprs(gpu_lw.kernel()->topLevelExprs());
+
+  for (auto expr : flattened_exps) {
+    if (expr->isA<kir::BlockSync>()) {
+      sync_found = true;
+    }
+    // Will require a sync thread before any shared memory read.
+    for (auto inp_tv : ir_utils::filterByType<TensorView>(expr->inputs())) {
+      if (inp_tv->getMemoryType() == MemoryType::Shared) {
+        TORCH_INTERNAL_ASSERT(
+            sync_found, "Block sync required but not inserted");
+      }
+    }
+  }
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({32, 32}, options);
+  auto t2 = t0 + 2.0;
+  auto cg_outputs = fe.runFusion({t0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
+}
+
+// Test BestEffortReplay behavior with swizzle op
+TEST_F(NVFuserTest, FusionSwizzleMapping_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 32});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  // Make a 2x8 Zshape tile
+  tv2->split(-1, 16);
+  tv2->split(-1, 8);
+  // [O, 2, 8]
+
+  tv3->split(-1, 16);
+  tv3->split(-1, 4);
+  //[O, 4, 4]
+
+  tv2->computeAt(tv3, 1);
+  tv2->swizzle(Swizzle2DType::ZShape, -2, -1);
+
+  // Inlining a producer into a swizzled consumer is ok
+  tv1->computeAt(tv2, -1);
+
+  // Check BestEffortReplay behavior with skip swizzles option on.
+  PairwiseRootDomainMap root_map(tv1, tv2);
+
+  // Check producer to consumer map,
+  //  i.e. unswizzled tensor to swizzled tensor map
+  //----------------------------------------------------------
+  auto p2c = BestEffortReplay::replayCasP(tv2, tv1, -1, root_map).getReplay();
+  auto swizzle_x_it0 = p2c.find(tv1->axis(-2));
+  auto swizzle_y_it0 = p2c.find(tv1->axis(-1));
+  // P2C map should exist and both the x and y map should
+  //  map to the output of the swizzle op.
+  TORCH_INTERNAL_ASSERT(
+      swizzle_x_it0 != p2c.end() && swizzle_y_it0 != p2c.end());
+  TORCH_INTERNAL_ASSERT(
+      swizzle_x_it0->second == tv2->axis(-2) &&
+      swizzle_y_it0->second == tv2->axis(-1));
+
+  // Check consumer to producer map,
+  //  i.e. swizzled tensor to unswizzled tensor map
+  //----------------------------------------------------------
+  auto c2p = BestEffortReplay::replayPasC(tv1, tv2, -1, root_map).getReplay();
+
+  auto swizzle_op = tv2->axis(-1)->definition()->as<Swizzle2D>();
+
+  // Find mapping for swizzle inputs
+  auto swizzle_x_it1 = c2p.find(swizzle_op->inX());
+  auto swizzle_y_it1 = c2p.find(swizzle_op->inY());
+
+  // Find mapping for swizzle outputs
+  auto swizzle_x_it2 = c2p.find(swizzle_op->outX());
+  auto swizzle_y_it2 = c2p.find(swizzle_op->outY());
+
+  // Input of swizzle ops will not be mapped to any
+  //  by BestEffortReplay, as BestEffortReplay has to be
+  //  one to one. IdGraph will further map them together.
+  TORCH_INTERNAL_ASSERT(
+      swizzle_x_it1 == c2p.end() && swizzle_y_it1 == c2p.end());
+
+  // Mapping for swizzle outputs should be mapped and should
+  //  also map to the corresponding axes on the unswizzled tensor.
+  TORCH_INTERNAL_ASSERT(
+      swizzle_x_it2 != c2p.end() && swizzle_y_it2 != c2p.end());
+  TORCH_INTERNAL_ASSERT(
+      swizzle_x_it2->second == tv1->axis(-2) &&
+      swizzle_y_it2->second == tv1->axis(-1));
+
+  // Check id graph behavior
+  //----------------------------------------------------------
+  ComputeAtMap ca_map(&fusion);
+  // Corresponding inputs and outputs of swizzle ops are
+  //  map through by exact and permissive map.
+  TORCH_INTERNAL_ASSERT(
+      ca_map.areMapped(tv1->axis(-2), swizzle_op->inX(), IdMappingMode::EXACT));
+  TORCH_INTERNAL_ASSERT(
+      ca_map.areMapped(tv1->axis(-1), swizzle_op->inY(), IdMappingMode::EXACT));
+  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
+      tv1->axis(-2), swizzle_op->outX(), IdMappingMode::EXACT));
+  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
+      tv1->axis(-1), swizzle_op->outY(), IdMappingMode::EXACT));
+
+  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
+      tv1->axis(-2), swizzle_op->inX(), IdMappingMode::PERMISSIVE));
+  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
+      tv1->axis(-1), swizzle_op->inY(), IdMappingMode::PERMISSIVE));
+  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
+      tv1->axis(-2), swizzle_op->outX(), IdMappingMode::PERMISSIVE));
+  TORCH_INTERNAL_ASSERT(ca_map.areMapped(
+      tv1->axis(-1), swizzle_op->outY(), IdMappingMode::PERMISSIVE));
+}
+
+// Test a basic loop swizzle pattern
+TEST_F(NVFuserTest, FusionLoopSwizzle0_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 32});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv2);
+
+  tv2->split(-1, 16);
+  tv2->split(-1, 4);
+  //[O, 4, 4]
+
+  tv2->swizzle(Swizzle2DType::ZShape, -2, -1, SwizzleMode::Loop);
+
+  tv0->computeAt(tv2, -1);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({2, 32}, options);
+  auto t2 = t0 + 2.0;
+  auto cg_outputs = fe.runFusion({t0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
+}
+
+// Outer block zshape pattern
+TEST_F(NVFuserTest, FusionLoopSwizzle1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv2);
+
+  tv2->split(-2, 8);
+  tv2->split(-1, 4);
+  //[I0o, I0i, I1o, I1i]
+  tv2->reorder({{1, 2}, {2, 1}});
+  //[I0o, I1o, I0i, I1i]
+
+  tv2->swizzle(Swizzle2DType::ZShape, 0, 1, SwizzleMode::Loop);
+  tv0->computeAt(tv2, -1);
+
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(1)->parallelize(ParallelType::BIDy);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({45, 77}, options);
+  auto t2 = t0 + 2.0;
+  auto cg_outputs = fe.runFusion({t0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
+}
+
+// Test assertion in unsupported pattern: non-leaf loop swizzle.
+TEST_F(NVFuserTest, FusionLoopSwizzleCheck0_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 32});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv2);
+
+  tv2->split(-1, 16);
+  tv2->split(-1, 4);
+  //[O, 4, 4]
+
+  // Swizzle the inner tile.
+  tv2->swizzle(Swizzle2DType::ZShape, -2, -1, SwizzleMode::Loop);
+
+  // Make swizzle output not a leaf domain.
+  tv2->merge(-2);
+
+  tv0->computeAt(tv2, -1);
+
+  FusionExecutor fe;
+  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
+}
+
+// Test assertion in unsupported pattern: half-inlined loop swizzle.
+TEST_F(NVFuserTest, FusionLoopSwizzleCheck1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 32});
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = add(tv1, IrBuilder::create<Double>(1));
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv3);
+
+  //[O, 4, 4]
+  tv2->split(-1, 16);
+  tv2->split(-1, 4);
+
+  //[O, 4, 4]
+  tv3->split(-1, 16);
+  tv3->split(-1, 4);
+
+  // Swizzle inner tile of tv2
+  tv2->swizzle(Swizzle2DType::ZShape, -2, -1, SwizzleMode::Loop);
+
+  // Make tv2 swizzled and partially-inlined (unsupported).
+  tv0->computeAt(tv3, -2);
+
+  FusionExecutor fe;
+  ASSERT_ANY_THROW(fe.compileFusion(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionUnsqueeze1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> shape({10, 11});
+
+  auto tv0 = makeConcreteTensor(shape);
+  fusion.addInput(tv0);
+
+  // [I, R]
+  auto tv1 = sum(tv0, {1});
+  // [I, B]
+  auto tv2 = unsqueeze(tv1, -1);
+  fusion.addOutput(tv2);
+
+  TORCH_CHECK(
+      tv2->nDims() == 2, "Unexpected unsqueeze result: ", tv2->toString());
+  TORCH_CHECK(
+      tv2->axis(1)->isBroadcast(),
+      "Unexpected unsqueeze result: ",
+      tv2->toString());
+
+  // tv1 has only one non-reduction axis. An exception should be
+  // thrown.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(unsqueeze(tv1, 2));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10, 11}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto ref = t0.sum(1).unsqueeze(-1);
+
+  testValidate(&fusion, cg_outputs, aten_inputs, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSqueeze1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> shape({10, 11});
+
+  auto tv0 = makeConcreteTensor(shape);
+  fusion.addInput(tv0);
+
+  // [I, B]
+  auto tv1 = sum(tv0, {1}, true);
+  // [I]
+  auto tv2 = squeeze(tv1, {shape[0], 1});
+  fusion.addOutput(tv2);
+
+  TORCH_CHECK(
+      tv2->nDims() == 2, "Unexpected squeeze result: ", tv2->toString());
+
+  // [I, R]
+  auto tv3 = sum(tv0, {1});
+  // tv3 has only one non-reduction axis. The extent of the first axis
+  // is not one, so squeeze should fail.
+  // NOLINTNEXTLINE(cppcoreguidelines-avoid-goto,hicpp-avoid-goto)
+  ASSERT_ANY_THROW(squeeze(tv3, {shape[0], 1}));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10, 11}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto ref = t0.sum(1, true).squeeze(-1);
+
+  testValidate(&fusion, cg_outputs, aten_inputs, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionContigPredicate_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = broadcast(tv1, {false, true, false});
+  fusion.addOutput(tv2);
+
+  tv2->merge(-2, -1);
+  tv2->merge(-2, -1);
+  tv2->split(-1, 100);
+  tv0->computeAt(tv2, -1);
+
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(PredicatedChecker::isPredicated(tv1, gpulw));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({3, 4}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  auto ref = t0.unsqueeze(1);
+
+  testValidate(fe.kernel(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Repro of https://github.com/csarofeen/pytorch/issues/1777
+TEST_F(NVFuserTest, FusionDivScalarLhs_CUDA) {
+  // tv1 = 2.0 / tv0
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  TensorView* tv1 = div(IrBuilder::create<Double>(2.0), tv0);
+  fusion.addOutput(tv1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({3, 3}, options);
+  // There's no overload div(Scalar, Tensor) in ATen
+  auto aten_output = at::div(
+      at::native::wrapped_scalar_tensor(at::Scalar(2.0), options.device()), t0);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {aten_output}, __LINE__, __FILE__);
+}
+
+// Repro of an issue of the reduction scheduler with a broadcast
+// domain concretized to multiple domains that are not proven to have
+// the same extent
+TEST_F(NVFuserTest, FusionRepro1713_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = makeSymbolicTensor(2);
+  auto tv2 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  fusion->addInput(tv2);
+  auto tv3 = broadcast(tv2, {false, true});
+
+  auto tv4 = add(tv3, tv0);
+
+  auto tv5 = add(tv3, tv1);
+  auto tv6 = sum(tv5, {0});
+  fusion->addOutput(tv4);
+  fusion->addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1024, 204800}, options);
+  // Original repro had the same shape as t0, but this should work
+  // with a different extent at the second axis
+  at::Tensor t1 = at::randn({1024, 123}, options);
+  at::Tensor t2 = at::randn({1024}, options);
+  std::vector<IValue> aten_inputs({t0, t1, t2});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto t3 = t2.unsqueeze(-1);
+  auto t4 = t3 + t0;
+  auto t5 = t3 + t1;
+  auto t6 = sum(t5, {0});
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      {t0, t1, t2},
+      {t4, t6},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionExpand_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto w = 2, x = 3, y = 4, z = 5;
+
+  // Test
+  // a simple expand
+  // Expand that's propagated
+  // expand_as
+  // symbolic expand
+
+  // x
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {false, true});
+  auto tv2 = expand(tv1, {tv0->axis(0)->extent(), IrBuilder::create<Int>(y)});
+
+  // x
+  auto tv3 = makeSymbolicTensor(1);
+  fusion->addInput(tv3);
+  auto tv4 = broadcast(tv3, {false, true});
+  auto tv5 = add(tv4, tv2);
+  // [x, e_y]
+
+  // [x, y, z]
+  auto tv6 = makeSymbolicTensor(3);
+  fusion->addInput(tv6);
+
+  // Disjoint set op will cause a segmentation for just this op.
+  auto tmp_7 = set(tv6);
+  fusion->addOutput(tmp_7);
+
+  auto tv7 = broadcast(tv5, {false, false, true});
+
+  auto tv8 = expand_as(tv7, tv6);
+  // [x, e_y, e_z]
+
+  auto w_symbolic = IrBuilder::create<Int>();
+  fusion->addInput(w_symbolic);
+
+  auto tv9 = broadcast(tv8, {true, false, false, false});
+  //[1, x, e_y, e_z]
+
+  auto tv10 = expand(
+      tv9,
+      {w_symbolic,
+       tv9->axis(1)->extent(),
+       tv9->axis(2)->expandedExtent(),
+       tv9->axis(3)->expandedExtent()});
+
+  fusion->addOutput(tv10);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({x}, options);
+  at::Tensor t3 = at::randn({x}, options);
+  at::Tensor t6 = at::randn({x, y, z}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3, t6, w});
+  auto cg_out = cg_outputs[1];
+
+  TORCH_INTERNAL_ASSERT(cg_out.size(0) == w);
+  TORCH_INTERNAL_ASSERT(cg_out.size(1) == x);
+  TORCH_INTERNAL_ASSERT(cg_out.size(2) == y);
+  TORCH_INTERNAL_ASSERT(cg_out.size(3) == z);
+  TORCH_INTERNAL_ASSERT(cg_out.stride(0) == 0);
+  TORCH_INTERNAL_ASSERT(cg_out.stride(1) == 1);
+  TORCH_INTERNAL_ASSERT(cg_out.stride(2) == 0);
+  TORCH_INTERNAL_ASSERT(cg_out.stride(3) == 0);
+
+  auto t10 = t0.unsqueeze(-1)
+                 .expand({x, y})
+                 .add(t3.unsqueeze(-1))
+                 .unsqueeze(-1)
+                 .expand_as(t6)
+                 .unsqueeze(0)
+                 .expand({w, x, y, z});
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      {t0, t3, t6, w},
+      {t6, t10},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionExpandIssue1751_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto x = 3, y = 4, z = 5;
+
+  // y, z
+  auto tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {true, false, false});
+
+  // Two ways to propagate extents as is: use -1 or explicitly pass
+  // the extent vals.
+
+  auto tv2 = expand(
+      tv1,
+      {IrBuilder::create<Int>(x),
+       IrBuilder::create<Int>(-1),
+       IrBuilder::create<Int>(-1)});
+
+  auto tv3 = expand(
+      tv1,
+      {IrBuilder::create<Int>(x),
+       tv0->axis(0)->extent(),
+       tv0->axis(1)->extent()});
+
+  fusion->addOutput(tv2);
+  fusion->addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({y, z}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  for (const auto& cg_out : cg_outputs) {
+    TORCH_INTERNAL_ASSERT(cg_out.size(0) == x);
+    TORCH_INTERNAL_ASSERT(cg_out.size(1) == y);
+    TORCH_INTERNAL_ASSERT(cg_out.size(2) == z);
+  }
+
+  auto t2 = t0.expand({x, y, z});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {t2, t2}, __LINE__, __FILE__);
+}
+
+// TODO: Make sure the kernel uses the expanded concrete size instead
+// of the symbolic size
+TEST_F(NVFuserTest, FusionExpandToConcrete_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto x = 3, y = 4;
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+
+  auto tv1 = broadcast(tv0, {true, false});
+
+  auto tv2 =
+      expand(tv1, {IrBuilder::create<Int>(x), IrBuilder::create<Int>(y)});
+
+  fusion->addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({y}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  for (const auto& cg_out : cg_outputs) {
+    TORCH_INTERNAL_ASSERT(cg_out.size(0) == x);
+    TORCH_INTERNAL_ASSERT(cg_out.size(1) == y);
+  }
+
+  auto t2 = t0.expand({x, y});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {t2}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionReproNoncontigBroadcast_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({4, 32, 16, 112, 112}, options).transpose(-1, -2);
+  at::Tensor t1 = at::randn({32, 1, 112, 1}, options).transpose(-1, -2);
+
+  auto tv0 = TensorViewBuilder()
+                 .ndims(5)
+                 .contiguity({true, true, false, false, false}) // ttfff
+                 .shape({-1, -1, -1, -1, -1})
+                 .dtype(DataType::Half)
+                 .build();
+  auto tv1 = TensorViewBuilder()
+                 .ndims(4)
+                 .contiguity({true, false, false, true}) // tfft
+                 .shape({-1, 1, 1, -1})
+                 .dtype(DataType::Half)
+                 .build();
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+
+  fusion->addOutput(tv2);
+
+  std::vector<IValue> aten_inputs({t0, t1});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto t2 = t0 + t1;
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0, t1}, {t2}, __LINE__, __FILE__);
+}
+
+namespace {
+
+// check that the resulting sibling are identical
+void checkSiblingConsistency(TensorView* replay, TensorView* target) {
+  auto replay_root = replay->getRootDomain();
+  auto replay_dom = replay->domain()->domain();
+  auto target_root = target->getRootDomain();
+  auto target_dom = target->domain()->domain();
+  std::unordered_map<IterDomain*, IterDomain*> target2replay_map;
+  TORCH_CHECK(replay_root.size() == target_root.size());
+  target2replay_map.reserve(replay_root.size());
+  std::transform(
+      target_root.begin(),
+      target_root.end(),
+      replay_root.begin(),
+      std::inserter(target2replay_map, target2replay_map.begin()),
+      [](auto a, auto b) { return std::make_pair(a, b); });
+  BestEffortReplay replay_(replay_dom, target_dom, target2replay_map);
+  auto r = replay_.getReplay();
+  for (int64_t i = 0; i < replay_dom.size(); i++) {
+    auto target_id = target_dom[i];
+    auto replay_it = r.find(target_id);
+    TORCH_CHECK(replay_it != r.end());
+    TORCH_CHECK(
+        replay_it->second == replay_dom[i],
+        "IterDomain mismatch when checking ",
+        replay,
+        " and ",
+        target,
+        " at ",
+        i,
+        ", got ",
+        replay_it->second,
+        " and ",
+        replay_dom[i]);
+  }
+};
+
+} // namespace
+
+TEST_F(NVFuserTest, FusionTransformPropagateSibling_CUDA) {
+  // https://github.com/csarofeen/pytorch/issues/1760
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tvs = Welford(tv0, {1});
+  fusion.addOutput(tvs.var_sum);
+
+  tvs.avg->split(1, 1);
+  tvs.avg->split(1, 2);
+  tvs.avg->split(1, 3);
+  tvs.var_sum->split(1, 1);
+  tvs.var_sum->split(1, 2);
+  tvs.var_sum->split(1, 3);
+  tvs.n->split(1, 1);
+  tvs.n->split(1, 2);
+  tvs.n->split(1, 3);
+
+  auto var_sum_rf = ir_utils::rfactorHelper(tvs.var_sum, {1, 4});
+
+  TransformPropagatorWithCheck propagator(var_sum_rf);
+  MaxRootDomainInfoSpanningTree(var_sum_rf).traverse(&propagator);
+
+  auto rf_tvs = ir_utils::producerTvsOf(tvs.var_sum);
+
+  std::vector<TensorView*> siblings[] = {{tvs.avg, tvs.var_sum, tvs.n}, rf_tvs};
+  for (auto tensors : siblings) {
+    for (auto t1 : tensors) {
+      for (auto t2 : tensors) {
+        TORCH_CHECK(TransformReplay::fullSelfMatching(t1, t2));
+      }
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransformPropagateSelectorSibling_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tvs = Welford(tv0, {1});
+  fusion.addOutput(tvs.var_sum);
+
+  tvs.avg->split(1, 1);
+  tvs.avg->split(1, 2);
+  tvs.avg->split(1, 3);
+  tvs.var_sum->split(1, 1);
+  tvs.var_sum->split(1, 2);
+  tvs.var_sum->split(1, 3);
+  tvs.n->split(1, 1);
+  tvs.n->split(1, 2);
+  tvs.n->split(1, 3);
+
+  auto var_sum_rf = ir_utils::rfactorHelper(tvs.var_sum, {1, 4});
+
+  struct DisableTv0 : public MaxInfoSpanningTree::Selector {
+    TensorView* tv0;
+    virtual bool allowC2P(TensorView* from, TensorView* to) override {
+      return from != tv0 && to != tv0;
+    };
+    virtual bool allowP2C(TensorView* from, TensorView* to) override {
+      return from != tv0 && to != tv0;
+    };
+    virtual bool allowSibling(TensorView* from, TensorView* to) override {
+      return true;
+    }
+    DisableTv0(TensorView* tv0) : tv0(tv0) {}
+  } selector1(tv0);
+
+  struct DisableTv0AndSibling : public DisableTv0 {
+    virtual bool allowSibling(TensorView* from, TensorView* to) override {
+      return false;
+    }
+    using DisableTv0::DisableTv0;
+  } selector2(tv0);
+
+  TransformPropagatorWithCheck propagator(var_sum_rf);
+  MaxRootDomainInfoSpanningTree good_path(var_sum_rf, &selector1);
+  MaxRootDomainInfoSpanningTree bad_path(var_sum_rf, &selector2);
+
+  auto rf_tvs = ir_utils::producerTvsOf(tvs.var_sum);
+
+  auto check = [&]() {
+    std::vector<TensorView*> siblings[] = {
+        {tvs.avg, tvs.var_sum, tvs.n}, rf_tvs};
+    for (auto tensors : siblings) {
+      for (auto t1 : tensors) {
+        for (auto t2 : tensors) {
+          TORCH_CHECK(TransformReplay::fullSelfMatching(t1, t2));
+        }
+      }
+    }
+  };
+
+  bad_path.traverse(&propagator);
+  ASSERT_ANY_THROW(check());
+  good_path.traverse(&propagator);
+  check();
+}
+
+TEST_F(NVFuserTest, FusionTransformPropagatePosition_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(4);
+  auto tv1 = makeSymbolicTensor(6);
+  fusion.addInput(tv0);
+
+  auto tv2 = broadcast(tv0, {false, false, true, false, false, true});
+  auto tv3 = add(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  tv0->merge(2);
+  tv0->merge(0);
+  TransformPropagatorWithCheck propagator(tv0);
+  MaxRootDomainInfoSpanningTree(tv0).traverse(&propagator);
+
+  TORCH_CHECK(tv1->nDims() == 4);
+}
+
+TEST_F(NVFuserTest, FusionIgnoreZeroDimReduction_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+  auto tv1 = sum(tv0, {0});
+  // tv1 is effectively a zero-dim tensor as it only has a reduction
+  // axis.
+  // Reducing it further is converted to just a set op.
+  auto tv2 = sum(tv1, {0});
+  fusion->addOutput(tv2);
+
+  auto tv2_def = dynamic_cast<UnaryOp*>(tv2->definition());
+  TORCH_CHECK(
+      tv2_def != nullptr,
+      "Expected UnaryOp but found ",
+      tv2->definition()->toString());
+
+  TORCH_CHECK(
+      tv2_def->getUnaryOpType() == UnaryOpType::Set,
+      "Expected UnaryOpType::Set but found ",
+      tv2_def->getUnaryOpType());
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({12345}, options);
+  std::vector<IValue> aten_inputs({t0});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto ref = sum(t0, {0});
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      aten_inputs,
+      {ref},
+      __LINE__,
+      __FILE__);
+}
+
+// Repro of issue #1770
+TEST_F(NVFuserTest, FusionIssue1770Repro_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+  auto tv1 = makeSymbolicTensor(1);
+  fusion->addInput(tv1);
+
+  auto tv2 = ge(tv0, tv1);
+  auto tv3 =
+      where(tv2, IrBuilder::create<Double>(1), IrBuilder::create<Double>(2));
+  fusion->addOutput(tv3);
+
+  std::vector<int64_t> shape({999});
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn(shape, options);
+  at::Tensor t1 = at::randn(shape, options);
+  std::vector<IValue> aten_inputs({t0, t1});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto ref = where(t0 >= t1, 1.0, 2.0);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      aten_inputs,
+      {ref},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTransformPropagatorSelector_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+  auto tv1 = makeSymbolicTensor(1);
+  fusion->addInput(tv1);
+
+  auto tv2 = add(tv0, tv1);
+
+  auto tv3 = sin(tv2);
+  auto tv4 = cos(tv2);
+
+  fusion->addOutput(tv3);
+  fusion->addOutput(tv4);
+
+  tv2->split(0, 10);
+
+  struct Selector : public MaxInfoSpanningTree::Selector {
+    TensorView* tv0;
+    TensorView* tv3;
+    virtual bool allowC2P(TensorView* from, TensorView* to) override {
+      return to == tv0;
+    }
+    virtual bool allowP2C(TensorView* from, TensorView* to) override {
+      return to == tv3;
+    }
+    virtual bool allowSibling(TensorView* from, TensorView* to) override {
+      return false;
+    }
+    Selector(TensorView* tv0, TensorView* tv3) : tv0(tv0), tv3(tv3) {}
+  } selector(tv0, tv3);
+
+  TransformPropagatorWithCheck propagator(tv2);
+  MaxRootDomainInfoSpanningTree(tv2, &selector).traverse(&propagator);
+
+  TORCH_CHECK(tv0->nDims() == 2);
+  TORCH_CHECK(tv1->nDims() == 1);
+  TORCH_CHECK(tv2->nDims() == 2);
+  TORCH_CHECK(tv3->nDims() == 2);
+  TORCH_CHECK(tv4->nDims() == 1);
+}
+
+TEST_F(NVFuserTest, FusionTransformPropagatorPos_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({22, 105});
+  fusion->addInput(tv0);
+
+  auto tv1 = sin(tv0);
+  fusion->addOutput(tv1);
+
+  tv1->split(0, 2);
+  tv1->split(-1, 3);
+  tv1->split(-1, 5);
+
+  TransformPropagatorWithCheck propagator(tv1, 2);
+  MaxRootDomainInfoSpanningTree(tv1, 2).traverse(&propagator);
+
+  auto expect = makeConcreteTensor({22, 105});
+  expect->split(0, 2);
+  TORCH_CHECK(TransformReplay::fullSelfMatching(expect, tv0));
+}
+
+TEST_F(NVFuserTest, FusionMaxRootDomainInfoSpanningTreePrintTwice_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(3);
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = neg(tv1);
+
+  fusion->addOutput(tv2);
+
+  tv1->split(0, 10);
+
+  struct Printer : public MaxInfoSpanningTree::Propagator {
+    std::stringstream ss;
+    virtual void propagateC2P(TensorView* from, TensorView* to) override {
+      ss << "propagateC2P" << std::endl;
+      ss << "from: " << from->name() << std::endl;
+      ss << "to: " << to->name() << std::endl;
+    }
+    virtual void propagateP2C(TensorView* from, TensorView* to) override {
+      ss << "propagateP2C" << std::endl;
+      ss << "from: " << from->name() << std::endl;
+      ss << "to: " << to->name() << std::endl;
+    }
+    virtual void propagateSibling(TensorView* from, TensorView* to) override {
+      ss << "propagateSibling" << std::endl;
+      ss << "from: " << from->name() << std::endl;
+      ss << "to: " << to->name() << std::endl;
+    }
+  } printer1, printer2;
+  printer1.ss << std::endl;
+  printer2.ss << std::endl;
+
+  MaxRootDomainInfoSpanningTree path(tv1);
+  path.traverse(&printer1);
+  path.traverse(&printer2);
+
+  auto expect = R"ESCAPE(
+propagateC2P
+from: 1
+to: 0
+propagateP2C
+from: 1
+to: 2
+)ESCAPE";
+  TORCH_CHECK(printer1.ss.str() == expect);
+  TORCH_CHECK(printer2.ss.str() == expect);
+}
+
+TEST_F(NVFuserTest, FusionTransformPropagatorNoOverwrite_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeSymbolicTensor(1);
+  fusion->addInput(tv0);
+  auto tv1 = broadcast(tv0, {true, false, true});
+  auto tv2 = sin(tv1);
+  fusion->addOutput(tv2);
+
+  tv0->split(0, 2);
+  tv2->split(1, 2);
+  tv2->split(0, 4);
+
+  MaxRootDomainInfoSpanningTree path1(tv2);
+  TransformPropagatorWithCheck propagator1(tv2);
+  path1.traverse(&propagator1);
+
+  MaxRootDomainInfoSpanningTree path2(tv0);
+  TransformPropagatorWithCheck propagator2(tv0);
+  path2.traverse(&propagator2);
+
+  TORCH_CHECK(tv1->axis(0)->isBroadcast());
+  TORCH_CHECK(tv1->axis(1)->isBroadcast());
+  TORCH_CHECK(!tv1->axis(2)->isBroadcast());
+  TORCH_CHECK(!tv1->axis(3)->isBroadcast());
+  TORCH_CHECK(tv1->axis(4)->isBroadcast());
+
+  auto expect = makeSymbolicTensor(3);
+  expect->split(1, 2);
+  expect->split(0, 4);
+  TORCH_CHECK(TransformReplay::fullSelfMatching(expect, tv1));
+}
+
+TEST_F(NVFuserTest, FusionIssue1785Repro_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // Set up your input tensor views
+  TensorView* tv0 = makeContigTensor(1);
+  TensorView* tv1 = makeContigTensor(2);
+
+  // Register your inputs
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+
+  auto tv2 = set(tv0);
+  // [B, I]
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv3, tv1);
+  auto tv5 = set(tv4);
+
+  // Register your outputs
+  fusion.addOutput(tv5);
+
+  tv5->split(0, 8);
+  tv5->split(-1, 8);
+
+  // [Serial, TIDy, TIDX, Serial]
+
+  tv4->computeAt(tv5, -2);
+  tv3->computeAt(tv4, -1);
+  tv2->computeAt(tv3, 0);
+  tv2->split(0, 8);
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv1->computeAt(tv5, -2);
+
+  tv5->axis(1)->parallelize(ParallelType::TIDy);
+  tv5->axis(2)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor in1 = at::randn({16}, options);
+  at::Tensor in2 = at::randn({12, 16}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {in1, in2});
+  auto cg_outputs = fe.runFusion({in1, in2});
+
+  auto tv_ref = in1 + in2;
+
+  testValidate(&fusion, cg_outputs, {in1, in2}, {tv_ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSkipReplay_CUDA) {
+  {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    TensorView* tv0 = makeContigTensor(1);
+    TensorView* tv1 = makeContigTensor(2);
+    fusion.addInput(tv0);
+    fusion.addInput(tv1);
+
+    auto tv2 = broadcast(tv0, {false, true});
+    auto tv3 = add(tv2, tv1);
+    fusion.addOutput(tv3);
+
+    tv3->split(1, 2, false);
+
+    TransformPropagatorWithCheck propagator(tv3);
+    MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+  }
+
+  {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    TensorView* tv0 = makeContigTensor(3);
+    fusion.addInput(tv0);
+
+    auto tv1 = sum(tv0, {0, 2});
+    auto tv2 = sin(tv1);
+    fusion.addOutput(tv2);
+
+    tv0->split(1, 2, false);
+
+    TransformPropagatorWithCheck propagator(tv0);
+    MaxRootDomainInfoSpanningTree(tv0).traverse(&propagator);
+  }
+}
+
+TEST_F(NVFuserTest, FusionInlineRepro1803_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeContigTensor(2);
+
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tvs = Welford(tv1, {1});
+  auto tvo = set(tvs.var_sum);
+  fusion.addOutput(tvo);
+
+  tvo->split(0, 16);
+  tvo->axis(1)->parallelize(ParallelType::Unroll);
+
+  tv0->computeAt(tvo, -1, ComputeAtMode::BestEffort);
+
+  TORCH_CHECK(
+      tvs.var_sum->getComputeAtPosition() == tvs.avg->getComputeAtPosition());
+  TORCH_CHECK(
+      tvs.var_sum->getComputeAtPosition() == tvs.n->getComputeAtPosition());
+  TORCH_CHECK(tvs.var_sum->getComputeAtPosition() == 1);
+}
+
+// Unit test for the transform selection logic
+TEST_F(NVFuserTest, FusionBoundedDirectionSelection1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeContigTensor(2);
+
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  auto tv3 = add(tv2, tv1);
+  fusion.addOutput(tv3);
+
+  tv3->split(-1, 5);
+  tv3->split(-1, 8);
+
+  scheduler_utils::BoundedDirectionalTransformPropagator::backward(
+      tv3, -1, {tv0, tv2});
+
+  // Check that the splits are replayed on tv2
+  TORCH_INTERNAL_ASSERT(
+      tv2->nDims() == tv3->nDims(),
+      "Propagator didn't propagate to tv2: ",
+      tv2->toString());
+
+  // Check that the splits are replayed on tv1 as well. Even though
+  //  one of its consumers, tv2, is part of the boundary, another
+  //  consumer is not a boundary, so tv1 should be transformed as well.
+  TORCH_INTERNAL_ASSERT(
+      tv1->nDims() == tv3->nDims(),
+      "Propagator didn't propagate to tv1: ",
+      tv1->toString());
+}
+
+TEST_F(NVFuserTest, FusionIssueRepro1844_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  std::vector<int64_t> shape = {2, 1, 768};
+  std::vector<int64_t> sum_to_shape = {768};
+  std::vector<int64_t> sum_to_axes = {0, 1};
+  double kProb = 0.5;
+
+  std::vector<Int*> sum_to_symb;
+  std::transform(
+      sum_to_shape.begin(),
+      sum_to_shape.end(),
+      std::back_inserter(sum_to_symb),
+      [](int s) -> Int* { return IrBuilder::create<Int>(s); });
+
+  TensorView* tv0 = makeContigConcreteTensor(shape);
+  TensorView* tv1 = makeContigConcreteTensor(shape);
+  TensorView* tv2 = makeContigConcreteTensor(shape, DataType::Bool);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  fusion->addInput(tv2);
+
+  Double* prob = IrBuilder::create<Double>(kProb);
+  auto grad_input = dropout_backward(tv1, tv2, prob);
+  auto grad_gelu = gelu_backward(grad_input, tv0);
+  auto grad_bias = sum_to(grad_gelu, sum_to_symb);
+
+  fusion->addOutput(grad_gelu);
+  fusion->addOutput(grad_bias);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const auto mask_options =
+      at::TensorOptions().dtype(at::kBool).device(at::kCUDA, 0);
+  at::manual_seed(0);
+
+  at::Tensor a = at::randn(shape, options);
+  at::Tensor b = at::randn(shape, options);
+  at::Tensor c = at::randn(shape, options);
+  auto mask = at::gt(c, 0.0f);
+  std::vector<IValue> aten_inputs = {a, b, mask};
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto dinput = at::native_dropout_backward(b, mask, kProb);
+  auto dgelu = at::gelu_backward(dinput, a, "none");
+  auto dbias = dgelu.sum(sum_to_axes);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      aten_inputs,
+      {dgelu, dbias},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInsertMagicZero1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = set(tv1);
+  fusion.addOutput(tv2);
+
+  tv2->split(0, 32);
+  tv2->split(-1, 2);
+  tv2->reorder({{1, 2}, {2, 1}});
+  tv2->merge(0);
+
+  TransformPropagatorWithCheck propagator(tv2);
+  MaxRootDomainInfoSpanningTree(tv2).traverse(&propagator);
+
+  tv0->computeAt(tv2, 1);
+
+  // The predicate of tv2 should be protected with magic zero
+  GpuLower gpulw(&fusion);
+  TORCH_CHECK(
+      PredicateMagicZeroChecker::isProtected(tv2, gpulw),
+      "Failed to protect the predicates of ",
+      tv2->toString());
+}
+
+TEST_F(NVFuserTest, FusionRepro1860_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr;
+  FusionGuard fg(&fusion);
+  std::vector<bool> contiguity{true, false, false};
+
+  std::vector<int64_t> shape{1, -1, -1};
+  TensorView* tv0 = makeContigConcreteTensor(shape);
+  fusion.addInput(tv0);
+  TensorView* tv1 = makeContigConcreteTensor(shape);
+  fusion.addInput(tv1);
+  TensorView* tv2 = makeContigConcreteTensor(shape);
+  fusion.addInput(tv2);
+
+  std::vector<IterDomain*> domain1(3, nullptr);
+  for (const auto i : c10::irange(3)) {
+    if (i == 0) {
+      domain1[i] =
+          IterDomainBuilder(
+              FusionGuard::getCurFusion()->zeroVal(), IrBuilder::create<Int>(1))
+              .iter_type(IterType::Broadcast)
+              .build();
+    } else {
+      domain1[i] =
+          IterDomainBuilder(
+              FusionGuard::getCurFusion()->zeroVal(), IrBuilder::create<Int>(1))
+              .expanded_extent(IrBuilder::create<Int>(1 + i))
+              .iter_type(IterType::Broadcast)
+              .build();
+    }
+  }
+
+  TensorView* tv22 = IrBuilder::create<TensorView>(
+      IrBuilder::create<TensorDomain>(domain1, contiguity), DataType::Float);
+
+  fusion.addInput(tv22);
+
+  auto tv3 = add(tv0, tv1);
+  auto tv4 = softmax(tv3, 0);
+  auto tv5 = add(tv4, tv22);
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor input1 = at::randn({1, 2, 3}, options);
+  at::Tensor input2 = at::randn({1, 2, 3}, options);
+  at::Tensor input3 = at::randn({1, 2, 3}, options);
+  at::Tensor input4 = at::randn({1, 1, 1}, options).expand({1, 2, 3});
+  std::vector<IValue> aten_inputs = {input1, input2, input3, input4};
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto outputs = executor_cache.runFusionWithInputs(aten_inputs);
+}
+
+TEST_F(NVFuserTest, FusionExpandReduce_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({1, 8});
+  fusion->addInput(tv0);
+
+  auto tv1 =
+      expand(tv0, {IrBuilder::create<Int>(12), IrBuilder::create<Int>(8)});
+
+  auto tv2 = sum(tv1, {0});
+  fusion->addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1, 8}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  auto ref = t0.expand({12, 8}).sum({0});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Predicate elimination issue repro:
+TEST_F(NVFuserTest, FusionExpandReduce2_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({1, 4});
+  fusion->addInput(tv0);
+
+  auto tv1 =
+      expand(tv0, {IrBuilder::create<Int>(3), IrBuilder::create<Int>(4)});
+
+  auto tv2 = sum(tv1, {0});
+  fusion->addOutput(tv2);
+
+  // tv2[r{3}, i{4}]
+  tv2->split(0, NamedScalar::getParallelDim(ParallelType::TIDy));
+  tv2->axis(1)->parallelize(ParallelType::TIDy);
+  tv2->split(0, NamedScalar::getParallelDim(ParallelType::BIDy), false);
+  tv2->axis(0)->parallelize(ParallelType::BIDy);
+  tv2->split(-1, NamedScalar::getParallelDim(ParallelType::TIDx));
+  tv2->axis(-1)->parallelize(ParallelType::TIDx);
+  tv2->axis(-2)->parallelize(ParallelType::BIDx);
+  // [rBIDy, rO, rTIDy, iBIDx, iTIDx]
+  tv2->reorder({{-2, 0}, {-1, 1}, {2, 2}});
+  // [iBIDx, iTIDx, rTIDy, rBIDy, rO]
+  auto tv3 = tv2->rFactor({-1});
+
+  TransformPropagatorWithCheck propagator(tv3);
+  MaxRootDomainInfoSpanningTree(tv3).traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(tv3);
+  tv0->computeAt(tv3, -1, ComputeAtMode::MostInlined);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1, 4}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion.get(), {t0}, LaunchParams(-1, 2, -1, 4, 2, 1));
+  auto cg_outputs = fe.runFusion({t0}, LaunchParams(-1, 2, -1, 4, 2, 1));
+
+  auto ref = t0.expand({3, 4}).sum({0});
+
+  testValidate(
+      fusion.get(),
+      cg_outputs,
+      {t0},
+      {ref},
+      __LINE__,
+      __FILE__,
+      "",
+      LaunchParams(-1, 2, -1, 4, 2, 1));
+}
+
+TEST_F(NVFuserTest, FusionExpandBadShapeTest_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr;
+  FusionGuard fg(&fusion);
+  std::vector<bool> contiguity{false, false};
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  std::vector<IterDomain*> domains = {
+      IterDomainBuilder(
+          FusionGuard::getCurFusion()->zeroVal(), IrBuilder::create<Int>())
+          .build(),
+      IterDomainBuilder(
+          FusionGuard::getCurFusion()->zeroVal(), IrBuilder::create<Int>(1))
+          .expanded_extent(IrBuilder::create<Int>(10))
+          .iter_type(IterType::Broadcast)
+          .build()};
+
+  // expand to 10
+  TensorView* tv22 = IrBuilder::create<TensorView>(
+      IrBuilder::create<TensorDomain>(domains, contiguity), DataType::Float);
+
+  fusion.addInput(tv22);
+
+  auto tv3 = add(tv0, tv22);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  // Incompatible shapes
+  at::Tensor input1 = at::randn({2, 3}, options);
+  // Passing expand size of 5, not 10. Should cause an error
+  at::Tensor input4 = at::randn({2, 1}, options).expand({2, 5});
+
+  std::vector<IValue> aten_inputs = {input1, input4};
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  ASSERT_ANY_THROW(executor_cache.runFusionWithInputs(aten_inputs));
+}
+
+TEST_F(
+    NVFuserTest,
+    FusionPointwiseScheduleWithBroadcastAndTrivialReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = broadcast(tv0, {false, true, false, true, false, true});
+  auto tv3 = sin(tv2);
+  auto tv4 = add(tv3, tv1);
+  auto tv5 = sum(tv4, {1});
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({100, 100, 10}, options);
+  at::Tensor t1 = at::randn({10, 20}, options);
+
+  auto aten_output = (t0.view({100, 1, 100, 1, 10, 1}).sin() + t1).squeeze(1);
+
+  std::vector<IValue> aten_inputs = {t0, t1};
+
+  auto lparams = schedulePointwise(&fusion, aten_inputs);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs, lparams);
+  auto cg_outputs = fe.runFusion(aten_inputs, lparams);
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInliningMismatchedDims1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = cos(tv1);
+  auto tv3 = transpose(tv2, 1, 2);
+  auto tv4 = exp(tv3);
+  auto tv5 = tan(tv4);
+  fusion.addOutput(tv5);
+
+  inlineMost();
+
+  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({2, 3, 4}, options);
+  auto output = input.sin().cos().transpose(1, 2).exp().tan();
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInliningMismatchedDims2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = cos(tv1);
+  auto tv3 = transpose(tv2, 1, 2);
+  auto tv4 = exp(tv3);
+  auto tv5 = tan(tv4);
+  fusion.addOutput(tv5);
+
+  inlineAllAt(tv5, -1, true);
+
+  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv1->getComputeAtPosition() == 1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({2, 3, 4}, options);
+  auto output = input.sin().cos().transpose(1, 2).exp().tan();
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInliningMismatchedDims3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  // broadcasting
+  auto tv2 = broadcast(tv1, {false, true, false, true, false, true});
+  auto tv3 = relu(tv2);
+  // trivial reduction
+  auto tv4 = sum(tv3, {1, 3, 5});
+  auto tv5 = cos(tv4);
+  auto tv6 = transpose(tv5, 1, 2);
+  auto tv7 = exp(tv6);
+  auto tv8 = tan(tv7);
+  fusion.addOutput(tv8);
+
+  for (auto tv : {tv2, tv3, tv4}) {
+    tv->merge(0);
+    tv->merge(1);
+    tv->merge(2);
+  }
+
+  inlineMost();
+
+  TORCH_CHECK(tv8->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv7->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv6->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv5->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({2, 3, 4}, options);
+  auto output = input.sin().relu().cos().transpose(1, 2).exp().tan();
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInliningMismatchedDims4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = exp(tv1);
+  auto tv3 = relu(tv2);
+  auto tv4 = cos(tv3);
+  auto tv5 = tan(tv4);
+  fusion.addOutput(tv5);
+
+  tv3->merge(1);
+  inlineMost();
+
+  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv3->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 1);
+  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({2, 3, 4}, options);
+  auto output = input.sin().exp().relu().cos().tan();
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInliningBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  // broadcasting
+  auto tv2 = broadcast(tv1, {false, true, false, true, false, true});
+  auto tv3 = cos(tv2);
+  auto tv4 = tan(tv3);
+  fusion.addOutput(tv4);
+
+  for (auto tv : {tv2, tv3, tv4}) {
+    tv->merge(0);
+    tv->merge(1);
+    tv->merge(2);
+  }
+
+  inlineMost();
+
+  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({2, 3, 4}, options);
+  auto output = input.sin().view({2, 1, 3, 1, 4, 1}).cos().tan();
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInliningBroadcastTrivialReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  // broadcasting
+  auto tv2 = broadcast(tv1, {false, true, false, true, false, true});
+  auto tv3 = tan(tv2);
+  // trivial reduction
+  auto tv4 = sum(tv3, {1, 3, 5});
+  auto tv5 = cos(tv4);
+  auto tv6 = exp(tv5);
+  fusion.addOutput(tv6);
+
+  for (auto tv : {tv2, tv3, tv4}) {
+    tv->merge(0);
+    tv->merge(1);
+    tv->merge(2);
+  }
+
+  inlineMost();
+
+  TORCH_CHECK(tv6->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv5->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv4->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv3->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv2->getComputeAtPosition() == 3);
+  TORCH_CHECK(tv1->getComputeAtPosition() == 3);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({2, 3, 4}, options);
+  auto output = input.sin().tan().cos().exp();
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input});
+  auto cg_outputs = fe.runFusion({input});
+
+  testValidate(&fusion, cg_outputs, {input}, {output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMatchedLeafPosWithoutReplayTrivialReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 1, 3, 1, 4, 1});
+  fusion.addInput(tv0);
+  auto tv1 = sum(tv0, {1, 3, 5});
+  auto tv2 = sin(tv1);
+  fusion.addOutput(tv1);
+
+  for (auto tv : {tv0, tv1}) {
+    tv->merge(0);
+    tv->merge(1);
+    tv->merge(2);
+  }
+
+  TORCH_CHECK(
+      TransformReplay::getMatchedLeafPosWithoutReplayPasC(tv0, tv1, 3) == 3);
+  TORCH_CHECK(
+      TransformReplay::getMatchedLeafPosWithoutReplayCasP(tv1, tv0, 3) == 3);
+  TORCH_CHECK(
+      TransformReplay::getMatchedLeafPosWithoutReplayPasC(tv1, tv2, 3) == 3);
+  TORCH_CHECK(
+      TransformReplay::getMatchedLeafPosWithoutReplayCasP(tv2, tv1, 3) == 3);
+}
+
+TEST_F(NVFuserTest, FusionMatchedLeafPosWithoutReplayBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion.addInput(tv0);
+  auto tv1 = broadcast(tv0, {false, true, false, true, false, true});
+  auto tv2 = sin(tv1);
+  fusion.addOutput(tv2);
+
+  for (auto tv : {tv1, tv2}) {
+    tv->merge(0);
+    tv->merge(1);
+    tv->merge(2);
+  }
+
+  TORCH_CHECK(
+      TransformReplay::getMatchedLeafPosWithoutReplayPasC(tv0, tv1, 3) == 3);
+  TORCH_CHECK(
+      TransformReplay::getMatchedLeafPosWithoutReplayCasP(tv1, tv0, 3) == 3);
+  TORCH_CHECK(
+      TransformReplay::getMatchedLeafPosWithoutReplayPasC(tv1, tv2, 3) == 3);
+  TORCH_CHECK(
+      TransformReplay::getMatchedLeafPosWithoutReplayCasP(tv2, tv1, 3) == 3);
+}
+
+TEST_F(NVFuserTest, FusionIdGraphTrivialReduction_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion.addInput(tv0);
+  auto tv1 = broadcast(tv0, {false, true, false, true, false, true});
+  auto tv2 = sum(tv1, {1, 3, 5});
+  auto tv3 = sin(tv2);
+  fusion.addOutput(tv3);
+
+  for (auto tv : {tv1, tv2}) {
+    tv->merge(0);
+    tv->merge(1);
+    tv->merge(2);
+  }
+
+  inlineMost();
+
+  ComputeAtMap ca_map(&fusion);
+
+  auto all_tvs = ir_utils::allTvs(&fusion);
+  for (auto tv1 : all_tvs) {
+    for (auto tv2 : all_tvs) {
+      if (tv1->isFusionInput() || tv2->isFusionInput()) {
+        continue;
+      }
+      for (int i : c10::irange(3)) {
+        auto id1 = tv1->axis(i);
+        auto id2 = tv2->axis(i);
+        TORCH_CHECK(ca_map.areMapped(id1, id2, IdMappingMode::LOOP));
+        TORCH_CHECK(ca_map.areMapped(id1, id2, IdMappingMode::PERMISSIVE));
+      }
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionPrint_CUDA) {
+  auto dtypes = {
+      at::kFloat,
+      at::kDouble,
+      at::kHalf,
+      at::kBFloat16,
+      at::kInt,
+      at::kLong,
+      at::kBool};
+  for (auto dtype : dtypes) {
+    auto fusion = std::make_unique<Fusion>();
+    FusionGuard fg(fusion.get());
+
+    auto tv0 = makeSymbolicTensor(1, aten_to_data_type(dtype));
+    fusion->addInput(tv0);
+    auto tv1 = print(tv0);
+    auto tv2 = sin(tv1);
+    fusion->addOutput(tv2);
+
+    // There is no way to check if anything is printed to the console, but we
+    // can validate that when print exist, compilation and computation are not
+    // broken.
+    auto options = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
+    at::Tensor t0 = at::arange(2, options).to(dtype);
+
+    FusionExecutorCache executor_cache(std::move(fusion));
+    auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {t0},
+        {t0.sin()},
+        __LINE__,
+        __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionCheckedSymbolicShape_CUDA) {
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor a = at::randn({123, 456}, options);
+  at::Tensor b = at::randn({123, 456}, options);
+  at::Tensor c = at::randn({321, 654}, options);
+
+  using return_t =
+      std::pair<std::unique_ptr<FusionExecutorCache>, std::vector<at::Tensor>>;
+  auto matched_add = [](at::Tensor a, at::Tensor b) -> return_t {
+    auto fusion = std::make_unique<Fusion>();
+    FusionGuard fg(fusion.get());
+
+    Val* s1 = IrBuilder::create<Int>();
+    Val* s2 = IrBuilder::create<Int>();
+    auto builder = TensorViewBuilder().shape(std::vector<Val*>{s1, s2});
+    TensorView* tv0 = builder.build();
+    TensorView* tv1 = builder.build();
+
+    fusion->addInput(tv0);
+    fusion->addInput(tv1);
+
+    auto tv2 = add(tv0, tv1);
+
+    fusion->addOutput(tv2);
+
+    auto executor_cache =
+        std::make_unique<FusionExecutorCache>(std::move(fusion));
+    auto cg_outputs = executor_cache->runFusionWithInputs({a, b});
+    return {std::move(executor_cache), std::move(cg_outputs)};
+  };
+
+  {
+    auto ret1 = matched_add(a, b);
+    testValidate(
+        ret1.first->fusion(), ret1.second, {a, b}, {a + b}, __LINE__, __FILE__);
+  }
+
+  {
+    EXPECT_THAT(
+        [&]() { matched_add(a, c); },
+        ::testing::ThrowsMessage<c10::Error>(
+            ::testing::HasSubstr("Attempting to bind")));
+  }
+}
+
+TEST_F(NVFuserTest, FusionSizeDependentData_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* s1 = IrBuilder::create<Int>();
+  auto builder = TensorViewBuilder().shape(std::vector<Val*>{s1});
+  TensorView* tv0 = builder.build();
+
+  fusion->addInput(tv0);
+
+  auto tv1 = add(tv0, s1);
+
+  fusion->addOutput(tv1);
+
+  const auto options =
+      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor a = at::zeros({123}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({a});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {a}, {a + 123}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionDependencyCheck_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1);
+  TensorView* tv1 = makeSymbolicTensor(1);
+  TensorView* tv2 = makeSymbolicTensor(1);
+  TensorView* tv3 = makeSymbolicTensor(1);
+
+  auto tv4 = add(tv0, tv1);
+  auto tv5 = add(tv0, tv2);
+  auto tv6 = add(tv0, tv3);
+
+  auto tv7 = add(tv1, tv2);
+  auto tv8 = add(tv1, tv3);
+
+  auto tv9 = add(tv2, tv3);
+
+  {
+    auto all_vals = DependencyCheck::getAllValsBetween(
+        {tv0, tv1}, {tv4, tv5, tv6, tv7, tv8, tv9});
+    std::unordered_set<Val*> all_vals_set(all_vals.begin(), all_vals.end());
+    std::vector<Val*> results({tv0, tv1, tv4, tv5, tv6, tv7, tv8});
+    for (auto result : results) {
+      TORCH_CHECK(all_vals_set.count(result) > 0);
+      all_vals_set.erase(result);
+    }
+    TORCH_CHECK(all_vals_set.empty());
+  }
+
+  auto tv10 = add(tv6, tv7);
+  {
+    auto all_vals = DependencyCheck::getAllValsBetween({tv0, tv1}, {tv10});
+    std::unordered_set<Val*> all_vals_set(all_vals.begin(), all_vals.end());
+    std::vector<Val*> results({tv0, tv1, tv6, tv7, tv10});
+    for (auto result : results) {
+      TORCH_CHECK(all_vals_set.count(result) > 0);
+      all_vals_set.erase(result);
+    }
+    TORCH_CHECK(all_vals_set.empty());
+  }
+}
+
+// Repro for issue #1925
+TEST_F(NVFuserTest, FusionScheduleTransposeRepro1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(4);
+  auto tv1 = makeConcreteTensor({-1, -1, -1, 1});
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({1, 1, 333, 1}, options);
+  at::Tensor input1 = at::randn({1, 1, 333, 1}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto outputs = fe.runFusion({input0, input1}, lparams);
+
+  auto tv_ref = input0 + input1;
+
+  testValidate(
+      &fusion, outputs, {input0, input1}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// Repro for issue #1873
+TEST_F(NVFuserTest, FusionInlineBroadcastIndexing0_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(1);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = set(tv0);
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv3, tv1);
+  fusion.addOutput(tv4);
+
+  tv4->merge(0);
+  tv4->split(0, 32);
+
+  tv0->computeAt(tv4, 1);
+
+  tv2->split(-1, 8);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({123}, options);
+  at::Tensor t1 = at::randn({3, 123}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t1});
+
+  auto outputs = fe.runFusion({t0, t1});
+
+  auto tv_ref = t0 + t1;
+
+  testValidate(&fusion, outputs, {t0, t1}, {tv_ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPredicateUnshare_CUDA) {
+  // https://github.com/csarofeen/pytorch/issues/1926
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = set(tv1);
+  fusion->addOutput(tv2);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  for (auto tv : {tv1, tv2}) {
+    tv->split(0, 4);
+    tv->reorder({{1, -1}});
+    tv->split(1, 8);
+    tv->merge(0);
+    tv->split(0, 1);
+    tv->axis(0)->parallelize(ParallelType::BIDx);
+    tv->axis(1)->parallelize(ParallelType::Unswitch);
+  }
+  tv1->merge(2);
+  tv2->reorder({{2, 3}});
+  tv2->merge(2);
+  for (auto tv : {tv1, tv2}) {
+    tv->axis(-1)->parallelize(ParallelType::TIDx);
+  }
+
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({5, 5}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  auto out = cg_outputs[0];
+
+  testValidate(fusion, {out}, {t0}, {t0}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, AsyncCompilation_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  TensorView* tv1 = makeSymbolicTensor(1);
+  TensorView* tv2 = makeSymbolicTensor(2);
+
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  fusion->addInput(tv2);
+
+  TensorView* tv3 = add(tv0, IrBuilder::create<Double>(1)); // Group 0
+  TensorView* tv4 =
+      max(tv3, {0}); // Group 0 (use max instead to avoid numerical issues)
+  TensorView* tv5 = add(tv4, tv1); //  Group 0 (Non Broadcast after reduce,
+                                   //  keeps normalization scheduler away)
+  TensorView* tv6 = add(tv5, tv2); //  Group 1 (Broadcast after reduce)
+
+  fusion->addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({8, 5}, options);
+  at::Tensor t1 = at::randn({5}, options);
+  at::Tensor t2 = at::randn({8, 5}, options);
+
+  auto t3 = t0.add(1.0);
+  auto t4 = std::get<0>(at::max(t3, 0));
+  auto t5 = t4.add(t1);
+  auto t6 = t5.add(t2);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  std::vector<IValue> aten_inputs = {t0, t1, t2};
+
+  executor_cache.compileFusionAsync(aten_inputs);
+
+  while (!executor_cache.isCompiled(aten_inputs)) {
+    std::this_thread::sleep_for(std::chrono::milliseconds(20));
+    printf(".");
+  }
+
+  auto outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()->isSegmented(),
+      "segmentation didn't happen");
+  TORCH_CHECK(
+      executor_cache.getMostRecentKernelRuntime()
+              ->fusionSegments()
+              ->groups()
+              .size() == 2,
+      "segmentation didn't happen as expected");
+
+  testValidate(
+      executor_cache.fusion(), outputs, aten_inputs, {t6}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMergeBroadcastingTrivialReduction1_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({1, 1});
+  TensorView* tv1 = makeConcreteTensor({-1});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = sum(tv0, {1});
+  auto tv3 = add(tv2, tv1);
+  fusion->addOutput(tv3);
+
+  tv0->merge(0);
+
+  MaxRootDomainInfoSpanningTree tree(tv0);
+  TransformPropagatorWithCheck tp(tv0);
+  tree.traverse(&tp);
+
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1, 1}, options);
+  at::Tensor t1 = at::randn({10}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto out = cg_outputs[0];
+
+  testValidate(
+      fusion, {out}, {t0, t1}, {t1 + t0.flatten()}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionMergeBroadcastingTrivialReduction2_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({-1, 1, 1});
+  TensorView* tv1 = makeConcreteTensor({-1, -1});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = sum(tv0, {1});
+  auto tv3 = add(tv2, tv1);
+  fusion->addOutput(tv3);
+
+  tv2->merge(1);
+  tv2->merge(0);
+
+  MaxRootDomainInfoSpanningTree tree(tv0);
+  TransformPropagatorWithCheck tp(tv0);
+  tree.traverse(&tp);
+
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10, 1, 1}, options);
+  at::Tensor t1 = at::randn({10, 10}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto out = cg_outputs[0];
+
+  testValidate(
+      fusion, {out}, {t0, t1}, {t1 + t0.squeeze(-1)}, __LINE__, __FILE__);
+}
+
+// Simple test case exercising the null scheduler path.
+TEST_F(NVFuserTest, FusionNullScheduler_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({1, 1, 1});
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {0, 1, 2});
+
+  fusion->addOutput(tv1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1, 1, 1}, options);
+
+  std::vector<IValue> aten_inputs({t0});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto t1 = t0.sum({0, 1, 2});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {t1}, __LINE__, __FILE__);
+
+  auto groups =
+      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups();
+
+  // Check that all groups on the resulting runtime are null.
+  for (auto group : groups) {
+    TORCH_INTERNAL_ASSERT(group->heuristic() == ScheduleHeuristic::NoOp);
+  }
+}
+
+// Simple test case exercising the null scheduler path.
+TEST_F(NVFuserTest, FusionNullScheduler2_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({0, 1, 9223372036854775807L});
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {0, 1, 2});
+
+  fusion->addOutput(tv1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({0, 1, 9223372036854775807L}, options);
+
+  std::vector<IValue> aten_inputs({t0});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  auto t1 = t0.sum({0, 1, 2});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {t1}, __LINE__, __FILE__);
+
+  auto groups =
+      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups();
+
+  // Check that all groups on the resulting runtime are null.
+  for (auto group : groups) {
+    TORCH_INTERNAL_ASSERT(group->heuristic() == ScheduleHeuristic::NoOp);
+  }
+}
+
+// Simple test case exercising the null scheduler path.
+TEST_F(NVFuserTest, FusionNullScheduler3_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = TensorViewBuilder().ndims(0).build();
+  auto tv1 = TensorViewBuilder().ndims(0).build();
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = add(tv0, tv1);
+  fusion->addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({}, options);
+  at::Tensor t1 = at::randn({}, options);
+
+  std::vector<IValue> aten_inputs({t0, t1});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      {t0, t1},
+      {t0 + t1},
+      __LINE__,
+      __FILE__);
+
+  auto groups =
+      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups();
+
+  // Check that all groups on the resulting runtime are null.
+  for (auto group : groups) {
+    TORCH_INTERNAL_ASSERT(group->heuristic() == ScheduleHeuristic::NoOp);
+  }
+}
+
+TEST_F(NVFuserTest, FusionEmpty_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({10, 10, 10});
+  auto tv1 = makeConcreteTensor({10, 10, 10});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  fusion->addOutput(tv0);
+  fusion->addOutput(tv1);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10, 10, 10}, options);
+  at::Tensor t1 = at::randn({10, 10, 10}, options);
+
+  std::vector<IValue> aten_inputs({t0, t1});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      {t0, t1},
+      {t0, t1},
+      __LINE__,
+      __FILE__);
+
+  auto groups =
+      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups();
+
+  // Check that all groups on the resulting runtime are null.
+  for (auto group : groups) {
+    TORCH_INTERNAL_ASSERT(group->heuristic() == ScheduleHeuristic::NoOp);
+  }
+}
+
+TEST_F(NVFuserTest, FusionMappingRelation_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({1, 1});
+  TensorView* tv1 = makeConcreteTensor({-1, 1, 1});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = set(tv0);
+  auto tv3 = broadcast(tv2, {true, false, false});
+  auto tv4 = add(tv3, tv1);
+
+  fusion->addOutput(tv4);
+
+  tv4->merge(-2);
+  tv4->merge(-1);
+
+  tv0->computeAt(tv4, -1);
+  tv1->computeAt(tv4, -1);
+
+  ComputeAtMap ca_map(fusion);
+
+  // FIXME: This is the concerning part that would motivate some
+  //  more formalization on concrete/permissive mapping:
+  //   exact mapping should ideally imply permissive mapping.
+  auto tv4_inner_node = tv4->axis(0)->definition()->input(1)->as<IterDomain>();
+  TORCH_CHECK(
+      ca_map.areMapped(tv2->axis(0), tv4_inner_node, IdMappingMode::EXACT));
+  TORCH_CHECK(!ca_map.areMapped(
+      tv2->axis(0), tv4_inner_node, IdMappingMode::PERMISSIVE));
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({1, 1}, options);
+  at::Tensor t1 = at::randn({2, 1, 1}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto out = cg_outputs[0];
+
+  testValidate(
+      fusion, {out}, {t0, t1}, {t1 + t0.squeeze(0)}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionInlineAt_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2);
+  fusion->addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = cos(tv1);
+  fusion->addOutput(tv2);
+
+  tv1->inlineAt(-1);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({100, 2}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0});
+  auto cg_outputs = fe.runFusion({t0});
+  auto out = cg_outputs[0];
+
+  testValidate(fusion, {out}, {t0}, {t0.sin().cos()}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTrivialInputForwarding_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({-1, -1});
+  TensorView* tv1 = makeConcreteTensor({-1, -1});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  // Note: tv2 is not needed. Kept it here since previously there was an
+  // assertion from sorting in codegen.
+  auto tv2 = add(tv1, IrBuilder::create<Double>(3.141));
+  fusion->addOutput(tv0);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({10, 4}, options);
+  at::Tensor t1 = at::randn({10, 4}, options);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto cg_outputs = fec.runFusionWithInputs({t0, t1});
+
+  testValidate(fusion, cg_outputs, {t0, t1}, {t0}, __LINE__, __FILE__);
+
+  // Second run to ensure cache hit handles trivial forwarding properly
+  TORCH_CHECK(fec.isCompiled({t0, t1}));
+  auto cg_outputs2 = fec.runFusionWithInputs({t0, t1});
+  testValidate(fusion, cg_outputs2, {t0, t1}, {t0}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTrivialInputForwarding2_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(0);
+  fusion->addInput(tv0);
+  fusion->addOutput(tv0);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn({}, options);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  auto cg_outputs = fec.runFusionWithInputs({t0});
+
+  testValidate(fusion, cg_outputs, {t0}, {t0}, __LINE__, __FILE__);
+
+  // Second run to ensure cache hit handles trivial forwarding properly
+  TORCH_CHECK(fec.isCompiled({t0}));
+  auto cg_outputs2 = fec.runFusionWithInputs({t0});
+  testValidate(fusion, cg_outputs2, {t0}, {t0}, __LINE__, __FILE__);
+}
+
+// Simplified repro of issue #2008
+TEST_F(NVFuserTest, FusionReplayTrivialReductionAndBroadcast2_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr;
+  FusionGuard fg(fusion_ptr.get());
+
+  std::vector<int64_t> shape({10, 1, 1});
+
+  auto tv0 = makeConcreteTensor(shape);
+  fusion.addInput(tv0);
+
+  auto tv1 = add(tv0, IrBuilder::create<Double>(1));
+  auto tv2 = sum(tv1, {1, 2});
+  auto tv3 = broadcast(tv2, {false, true, true});
+  fusion.addOutput(tv3);
+
+  tv0->merge(-2, -1)->merge(-2, -1)->split(0, 4);
+
+  MaxRootDomainInfoSpanningTree tree(tv0);
+  TransformPropagator tp(tv0);
+  tree.traverse(&tp);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::randn(shape, options);
+  std::vector<IValue> aten_inputs({t0});
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion_ptr.get(), aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  testValidate(&fusion, outputs, aten_inputs, {t0 + 1}, __LINE__, __FILE__);
+}
+
+namespace {
+
+size_t getVecSizeForPointwise(FusionExecutorCache& fec) {
+  auto most_recent_params =
+      fec.getMostRecentKernelRuntime()->getMostRecentExecutorLog().params;
+  auto params = dynamic_cast<PointwiseParams*>(most_recent_params.get());
+  if (params->vectorize) {
+    return params->unroll_factor;
+  }
+  return 1;
+}
+
+} // namespace
+
+TEST_F(NVFuserTest, FusionVectorizeStrideContiguity2D_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 =
+      TensorViewBuilder().ndims(2).contiguity({false, true}).build();
+  fusion->addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion->addOutput(tv1);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  fec.profile(true);
+
+  std::vector<std::pair<int, int>> size_and_vec{{17, 1}, {18, 2}, {32, 4}};
+
+  for (auto pair : size_and_vec) {
+    auto size = pair.first;
+    auto vec = pair.second;
+    auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+    at::Tensor t0 = at::randn({1000000, size}, options).narrow(1, 0, 16);
+    auto cg_outputs = fec.runFusionWithInputs({t0});
+
+    TORCH_CHECK(getVecSizeForPointwise(fec) == vec);
+
+    testValidate(fusion, cg_outputs, {t0}, {t0}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionVectorizeStrideContiguity3D_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 =
+      TensorViewBuilder().ndims(3).contiguity({false, true, true}).build();
+  fusion->addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion->addOutput(tv1);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  fec.profile(true);
+
+  std::vector<std::pair<int, int>> size_and_vec{{17, 1}, {10, 2}, {16, 4}};
+
+  for (auto pair : size_and_vec) {
+    auto size = pair.first;
+    auto vec = pair.second;
+    auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+    at::Tensor t0 = at::randn({1000000, size, 3}, options).narrow(1, 0, 8);
+    auto cg_outputs = fec.runFusionWithInputs({t0});
+
+    TORCH_CHECK(getVecSizeForPointwise(fec) == vec);
+
+    testValidate(fusion, cg_outputs, {t0}, {t0}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionVectorizeStrideContiguity5D_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = TensorViewBuilder()
+                        .ndims(5)
+                        .contiguity({false, true, false, true, true})
+                        .build();
+  fusion->addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion->addOutput(tv1);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  fec.profile(true);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+
+  std::vector<std::tuple<int, int, int>> sizes_and_vec{
+      {9, 17, 1}, {9, 10, 2}, {9, 16, 4}};
+
+  for (auto tup : sizes_and_vec) {
+    auto size1 = std::get<0>(tup);
+    auto size2 = std::get<1>(tup);
+    auto vec = std::get<2>(tup);
+    at::Tensor t0 = at::randn({4, size1, 12345, size2, 3}, options)
+                        .narrow(1, 0, 8)
+                        .narrow(3, 0, 4);
+    auto cg_outputs = fec.runFusionWithInputs({t0});
+
+    TORCH_CHECK(getVecSizeForPointwise(fec) == vec);
+
+    testValidate(fusion, cg_outputs, {t0}, {t0}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionVectorizeStrideContiguitySelfOverlapping_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = TensorViewBuilder()
+                        .ndims(5)
+                        .contiguity({false, true, false, true, true})
+                        .build();
+  fusion->addInput(tv0);
+  auto tv1 = set(tv0);
+  fusion->addOutput(tv1);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+  fec.profile(true);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+
+  std::vector<std::tuple<int, int, int, int>> sizes_strides_and_vec{
+      {4, 4, 4, 4},
+      {4, 4, 2, 2},
+      {4, 2, 4, 2},
+      {2, 4, 4, 2},
+      {4, 4, 1, 1},
+      {4, 1, 4, 1},
+      {1, 4, 4, 1},
+      {2, 2, 2, 2},
+      {2, 2, 1, 1},
+      {2, 1, 2, 1},
+      {1, 2, 2, 1}};
+
+  for (auto tup : sizes_strides_and_vec) {
+    auto size = std::get<0>(tup);
+    auto stride1 = std::get<1>(tup);
+    auto stride2 = std::get<2>(tup);
+    auto vec = std::get<3>(tup);
+    std::vector<int64_t> shape = {4, 4, 12345, size, 3};
+    std::vector<int64_t> stride = {stride1, stride2 * 12345, stride2, 3, 1};
+    at::Tensor t0 = at::empty_strided(shape, stride, options);
+    t0.random_();
+    auto cg_outputs = fec.runFusionWithInputs({t0});
+    TORCH_CHECK(getVecSizeForPointwise(fec) == vec);
+    testValidate(fusion, cg_outputs, {t0}, {t0}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionSimpleAmperePipeline_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  // requires ampere+ GPU
+  if (!deviceMajorMinorCheck(8)) {
+    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
+    return;
+  }
+
+  auto tv0 = makeContigTensor(1);
+
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+
+  fusion.addOutput(tv1);
+
+  auto tv_cache = tv0->cacheAfter(LoadStoreOpType::CpAsync);
+  tv_cache->setMemoryType(MemoryType::Shared);
+
+  tv1->split(0, 16);
+  tv0->computeAt(tv1, 1);
+
+  tv_cache->circularBuffer(10);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input1 = at::randn({255}, options);
+
+  // Add check that the cp async op has an inlined predicate.
+  class InlinedCpAsyncPredChecker : public kir::IrVisitor {
+   public:
+    using kir::IrVisitor::handle;
+
+   private:
+    void handle(kir::IfThenElse* ite) final {
+      auto prev_within_ite = within_ite_;
+      within_ite_ = true;
+      kir::IrVisitor::handle(ite);
+      within_ite_ = prev_within_ite;
+    }
+
+    void handle(LoadStoreOp* ldst) final {
+      if (ldst->opType() == LoadStoreOpType::CpAsync) {
+        TORCH_INTERNAL_ASSERT(!within_ite_, "CPASYNC predicate not inlined");
+        TORCH_INTERNAL_ASSERT(
+            ldst->predicate()->hasValue() &&
+                !ldst->predicate()->value()->isConst(),
+            "CPASYNC predicate is not generated");
+      }
+    }
+
+   private:
+    bool within_ite_ = false;
+  } pred_checker;
+
+  // Check that cp async is inlined:
+  GpuLower gpulw(&fusion);
+  pred_checker.handle(gpulw.kernel()->topLevelExprs());
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input1});
+  auto cg_outputs = fe.runFusion({input1});
+
+  testValidate(&fusion, cg_outputs, {input1}, {input1}, __LINE__, __FILE__);
+}
+
+// Test file size should be up to 10K LoC. Create a new file for more tests.
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp
index 332fec9b6523..e827de56e56b 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_fused_reduction.cpp
@@ -10,6 +10,7 @@
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
 #include <torch/csrc/jit/codegen/cuda/grouped_reduction.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/interface.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
@@ -24,6 +25,7 @@
 #include <torch/csrc/jit/codegen/cuda/mutator.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/reduction_utils.h>
 #include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
 #include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
 #include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
@@ -2148,6 +2150,316 @@ TEST_F(NVFuserTest, FusionCrossIterationGroupedGridAllreduce4_CUDA) {
   testValidate(fe.kernel(), outputs, {t0}, {ref}, __LINE__, __FILE__);
 }
 
+TEST_F(NVFuserTest, FusionCrossIterationGroupedGridAllreduceWelford1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = Welford(tv1, {0}).avg;
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv0, tv3);
+  fusion.addOutput(tv4);
+
+  const int vec = 2;
+  const int tidx = 32;
+  const int tidy = 8;
+
+  tv1->split(1, vec);
+  tv1->split(1, tidx);
+  tv1->split(0, tidy);
+  TransformPropagator propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::TIDy);
+  tv1->axis(2)->parallelize(ParallelType::BIDx);
+  tv1->axis(3)->parallelize(ParallelType::TIDx);
+
+  scheduler_utils::parallelizeAllLike(tv1);
+
+  tv2->axis(4)->parallelize(ParallelType::Group);
+
+  // Make sure the reduction expr is converted to GroupedGridReduciton
+  // and the non-reduction domains of the output TV are either
+  // grouped or parallelized
+  GpuLower gpulw(&fusion);
+  bool validated = false;
+  for (auto expr : KernelExprVisitor::getAllExprs(gpulw.kernel())) {
+    auto grouped_grid_reduction = dynamic_cast<kir::GroupedGridWelford*>(expr);
+    if (grouped_grid_reduction == nullptr) {
+      continue;
+    }
+    validated = true;
+  }
+  TORCH_CHECK(
+      validated, "Invalid lowered kernel. No GroupedGridWelford found.");
+
+  std::vector<int64_t> shape({99, 101});
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  auto t0_double = t0.to(at::kDouble);
+  auto ref = t0_double + t0_double.mean({0}).unsqueeze(0);
+
+  testValidate(fe.kernel(), outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Test grouping of two domains
+TEST_F(NVFuserTest, FusionCrossIterationGroupedGridAllreduceWelford2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+
+  auto tv1 = set(tv0);
+  auto tv2 = Welford(tv1, {0}).avg;
+  auto tv3 = broadcast(tv2, {true, false});
+  auto tv4 = add(tv0, tv3);
+  fusion.addOutput(tv4);
+
+  const int vec1 = 2;
+  const int vec2 = 3;
+  const int tidx = 16;
+  const int tidy = 8;
+
+  tv1->split(1, vec1);
+  tv1->split(1, vec2);
+  tv1->split(1, tidx);
+  tv1->split(0, tidy);
+  TransformPropagator propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::TIDy);
+  tv1->axis(2)->parallelize(ParallelType::BIDx);
+  tv1->axis(3)->parallelize(ParallelType::TIDx);
+
+  scheduler_utils::parallelizeAllLike(tv1);
+
+  tv2->axis(4)->parallelize(ParallelType::Group);
+  tv2->axis(5)->parallelize(ParallelType::Group);
+
+  std::vector<int64_t> shape({99, 129});
+
+  // Make sure the reduction expr is converted to GroupedGridReduciton
+  // and the non-reduction domains of the output TV are either
+  // grouped or parallelized
+  GpuLower gpulw(&fusion);
+  bool validated = false;
+  for (auto expr : KernelExprVisitor::getAllExprs(gpulw.kernel())) {
+    auto grouped_grid_reduction = dynamic_cast<kir::GroupedGridWelford*>(expr);
+    if (grouped_grid_reduction == nullptr) {
+      continue;
+    }
+    validated = true;
+  }
+  TORCH_CHECK(
+      validated, "Invalid lowered kernel. No GroupedGridWelford found.");
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn(shape, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0});
+  auto outputs = fe.runFusion({t0});
+
+  auto t0_double = t0.to(at::kDouble);
+  auto ref = t0_double + t0_double.mean({0}).unsqueeze(0);
+
+  testValidate(fe.kernel(), outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+// Follows the pattern of persistent outer grid welford in batchnorm
+TEST_F(NVFuserTest, FusionCrossIterationGroupedGridAllreduceWelfordShmoo_CUDA) {
+  struct Params {
+    int N;
+    int H;
+    int W;
+    int C;
+    int tidx;
+    int tidy;
+    int vect;
+    int persistent_buffer;
+    int bidx;
+  };
+
+  auto test = [](const Params& params) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    std::vector<bool> bcast_pattern{true, true, true, false};
+    std::vector<int> reduction_dims{2, 1, 0};
+
+    auto tv0 = makeSymbolicTensor(4);
+    fusion.addInput(tv0);
+
+    auto tv1 = set(tv0);
+    auto tvs = Welford(tv1, reduction_dims);
+    auto tv2 = tvs.avg;
+    auto tv3 = tvs.var_sum;
+    auto tv4 = tvs.n;
+    auto tv5 = broadcast(tv2, bcast_pattern);
+    auto tv6 = broadcast(tv3, bcast_pattern);
+    auto tv7 = broadcast(tv4, bcast_pattern);
+    auto tv8 = sub(tv1, tv5);
+    auto tv9 = add(tv8, tv6);
+    // auto tv10 = div(tv9, tv7);
+    // fusion.addOutput(tv10);
+    fusion.addOutput(tv9);
+
+    // Schedule the fusion as it will be done by the persistent
+    // scheduler
+
+    auto input_cache = tv1;
+    auto output_cache = tv9->cacheBefore();
+
+    auto transform_ref = tv2;
+
+    transform_ref->merge(0)->merge(0);
+
+    int reduction_pos = 1;
+
+    transform_ref->split(0, params.tidy);
+    ++reduction_pos;
+    transform_ref->axis(1)->parallelize(ParallelType::TIDy);
+
+    // Persistent buffer
+    transform_ref->split(0, params.persistent_buffer);
+    ++reduction_pos;
+
+    // Unswitch
+    transform_ref->split(0, 1);
+    ++reduction_pos;
+    transform_ref->axis(1)->parallelize(ParallelType::Unswitch);
+
+    transform_ref->axis(0)->parallelize(ParallelType::BIDy);
+
+    transform_ref->split(reduction_pos, params.vect);
+    transform_ref->axis(reduction_pos + 1)
+        ->parallelize(ParallelType::Vectorize);
+
+    transform_ref->split(reduction_pos, params.tidx);
+    transform_ref->axis(reduction_pos + 1)->parallelize(ParallelType::TIDx);
+    transform_ref->split(reduction_pos, params.bidx);
+    transform_ref->axis(reduction_pos + 1)->parallelize(ParallelType::BIDx);
+
+    auto transform_ref_rf =
+        reduction_scheduler_utils::sortAndRFactor(transform_ref);
+
+    TransformPropagator propagator(transform_ref_rf);
+    MaxRootDomainInfoSpanningTree(transform_ref_rf).traverse(&propagator);
+
+    int vec_id = std::distance(
+        transform_ref_rf->domain()->domain().begin(),
+        std::find_if(
+            transform_ref_rf->domain()->domain().begin(),
+            transform_ref_rf->domain()->domain().end(),
+            [](auto id) {
+              return id->getParallelType() == ParallelType::Vectorize;
+            }));
+    transform_ref_rf->axis(vec_id)->parallelize(ParallelType::Serial);
+
+    int unswitch_id = std::distance(
+        transform_ref_rf->domain()->domain().begin(),
+        std::find_if(
+            transform_ref_rf->domain()->domain().begin(),
+            transform_ref_rf->domain()->domain().end(),
+            [](auto id) {
+              return id->getParallelType() == ParallelType::Unswitch;
+            }));
+    transform_ref_rf->axis(unswitch_id)->parallelize(ParallelType::Serial);
+
+    scheduler_utils::parallelizeAllLike(
+        transform_ref_rf, ir_utils::allTvs(&fusion));
+
+    ParallelType vec_pt = ParallelType::Vectorize;
+    tv1->axis(vec_id)->parallelize(vec_pt);
+    tv9->axis(vec_id)->parallelize(vec_pt);
+
+    transform_ref->axis(vec_id)->parallelize(ParallelType::Group);
+
+    transform_ref_rf->axis(unswitch_id)->parallelize(ParallelType::Unswitch);
+
+    inlineMost();
+
+    // Make sure the reduction expr is converted to GroupedGridReduciton
+    // and the non-reduction domains of the output TV are either
+    // grouped or parallelized
+    GpuLower gpulw(&fusion);
+    bool validated = false;
+    for (auto expr : KernelExprVisitor::getAllExprs(gpulw.kernel())) {
+      auto grouped_grid_reduction =
+          dynamic_cast<kir::GroupedGridWelford*>(expr);
+      validated = true;
+    }
+    TORCH_CHECK(
+        validated, "Invalid lowered kernel. No GroupedGridWelford found.");
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::manual_seed(0);
+
+    const std::vector<int64_t> input_shape{
+        params.N, params.H, params.W, params.C};
+    auto t0 = at::randn(input_shape, options);
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {t0});
+
+    // Skip the rest of this test size if the required number of SMs
+    // exceeds the available SM count
+    const auto num_required_sms = params.bidx *
+        ceilDiv(ceilDiv(params.N * params.H * params.W, params.tidy),
+                params.persistent_buffer);
+    if (num_required_sms > deviceSMCount()) {
+      return;
+    }
+
+    auto cg_outputs = fe.runFusion({t0});
+
+    auto t1 = t0.to(at::kDouble);
+    auto t2 = t1.mean({0, 1, 2}).unsqueeze(0).unsqueeze(0).unsqueeze(0);
+    auto t3 =
+        at::var(t1, {0, 1, 2}, false).unsqueeze(0).unsqueeze(0).unsqueeze(0);
+    auto t4 = params.N * params.H * params.W;
+    auto ref = (t1 - t2 + (t3 * t4));
+
+    testValidate(&fusion, cg_outputs, {t0}, {ref}, __LINE__, __FILE__, "");
+  };
+
+  std::vector<Params> base_params;
+  base_params.push_back({256, 7, 7, 1, 8, 32, 2, 32, 4});
+  base_params.push_back({256, 7, 7, 1, 16, 16, 4, 50, 4});
+  base_params.push_back({128, 7, 7, 1, 16, 16, 4, 32, 4});
+  base_params.push_back({128, 14, 14, 1, 16, 16, 4, 32, 1});
+  base_params.push_back({128, 14, 14, 1, 16, 16, 2, 64, 2});
+  base_params.push_back({128, 14, 14, 1, 8, 32, 4, 50, 4});
+  base_params.push_back({128, 14, 14, 1, 8, 32, 2, 50, 4});
+
+  std::vector<Params> param_vec;
+  for (const auto base_p : base_params) {
+    for (const auto c_dim : {512, 1024, 2048}) {
+      auto tmp = base_p;
+      tmp.C = c_dim;
+      param_vec.push_back(tmp);
+    }
+  }
+
+  for (const auto& params : param_vec) {
+    test(params);
+  }
+}
+
 } // namespace jit
 } // namespace torch
 #endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_rng.cu b/torch/csrc/jit/codegen/cuda/test/test_gpu_rng.cu
new file mode 100644
index 000000000000..3e5968c3e084
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_rng.cu
@@ -0,0 +1,399 @@
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <ATen/cuda/CUDAGeneratorImpl.h>
+#include <c10/util/Optional.h>
+#include <torch/csrc/jit/codegen/cuda/arith.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+#include <ATen/cuda/CUDAGraphsUtils.cuh>
+
+#include <cassert>
+#include <type_traits>
+
+#include <curand.h>
+#include <curand_kernel.h>
+#include <curand_philox4x32_x.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+
+namespace {
+
+template <typename T>
+__global__ void generate_uniform_kernel(
+    T* output,
+    int64_t size,
+    PhiloxCudaState philox_args) {
+  int64_t tid = blockIdx.x * blockDim.x + threadIdx.x;
+
+  auto seeds = at::cuda::philox::unpack(philox_args);
+  curandStatePhilox4_32_10_t state;
+  curand_init(std::get<0>(seeds), tid, std::get<1>(seeds), &state);
+
+  if (std::is_same<T, double>::value) {
+    double2 result = curand_uniform2_double(&state);
+    if (tid * 2 < size) {
+      output[tid * 2] = result.x;
+    }
+    if (tid * 2 + 1 < size) {
+      output[tid * 2 + 1] = result.y;
+    }
+  } else {
+    auto is_float = std::is_same<T, float>::value;
+    assert(is_float);
+    float4 result = curand_uniform4(&state);
+    if (tid * 4 < size) {
+      output[tid * 4] = result.x;
+    }
+    if (tid * 4 + 1 < size) {
+      output[tid * 4 + 1] = result.y;
+    }
+    if (tid * 4 + 2 < size) {
+      output[tid * 4 + 2] = result.z;
+    }
+    if (tid * 4 + 3 < size) {
+      output[tid * 4 + 3] = result.w;
+    }
+  }
+}
+
+at::Tensor generate_uniform(int64_t size, at::ScalarType dtype) {
+  auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+  auto result = at::empty({size}, options);
+
+  auto gen = get_generator_or_default<CUDAGeneratorImpl>(
+      c10::nullopt, at::cuda::detail::getDefaultCUDAGenerator());
+  PhiloxCudaState rng_engine_inputs;
+  {
+    // See Note [Acquire lock when using random generators]
+    std::lock_guard<std::mutex> lock(gen->mutex_);
+    rng_engine_inputs = gen->philox_cuda_state(4);
+  }
+
+  if (dtype == kFloat) {
+    int64_t block = 128;
+    int64_t block_elems = block * 4;
+    int64_t grid = (size + block_elems - 1) / block_elems;
+    generate_uniform_kernel<<<
+        grid,
+        block,
+        0,
+        at::cuda::getCurrentCUDAStream()>>>(
+        result.data_ptr<float>(), size, rng_engine_inputs);
+  } else {
+    TORCH_CHECK(dtype == kDouble);
+    int64_t block = 128;
+    int64_t block_elems = block * 2;
+    int64_t grid = (size + block_elems - 1) / block_elems;
+    generate_uniform_kernel<<<
+        grid,
+        block,
+        0,
+        at::cuda::getCurrentCUDAStream()>>>(
+        result.data_ptr<double>(), size, rng_engine_inputs);
+  }
+  return result;
+}
+
+} // namespace
+
+TEST_F(NVFuserTest, FusionRNGValidateWithCURand_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  Int* size_val = IrBuilder::create<Int>();
+  fusion->addInput(size_val);
+  TensorView* tv0 = rand({size_val}, DataType::Float);
+  TensorView* tv1 = rand({size_val}, DataType::Double);
+  fusion->addOutput(tv0);
+  fusion->addOutput(tv1);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
+  for (int64_t size : {16, 1024, 10001, 10002, 10003, 100000, 10000001}) {
+    at::manual_seed(0);
+    auto cg_outputs = fec.runFusionWithInputs({size});
+
+    at::manual_seed(0);
+    auto ref0 = generate_uniform(size, kFloat);
+    auto ref1 = generate_uniform(size, kDouble);
+
+    testValidate(
+        fec.fusion(), cg_outputs, {size}, {ref0, ref1}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionRNGManualScheduleValidateWithCURand_CUDA) {
+  int64_t size = 128;
+  auto dtype = kFloat;
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(1, aten_to_data_type(dtype));
+  fusion->addInput(tv0);
+  auto tv1 = rand_like(tv0);
+  auto tv2 = set(tv1);
+  fusion->addOutput(tv2);
+
+  tv2->split(0, 8);
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+
+  tv1->computeAt(tv2, 1);
+
+  auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+  at::Tensor t0 = at::zeros({size}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0});
+
+  at::manual_seed(0);
+  auto cg_outputs = fe.runFusion({t0});
+  auto out = cg_outputs[0];
+
+  at::manual_seed(0);
+  auto ref = generate_uniform(size, dtype);
+
+  testValidate(fusion, {out}, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionRNGManualScheduleValidateWithCURand2_CUDA) {
+  auto dtype = kFloat;
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  Int* size1 = IrBuilder::create<Int>();
+  Int* size2 = IrBuilder::create<Int>();
+  Int* size3 = IrBuilder::create<Int>();
+  Int* size4 = IrBuilder::create<Int>();
+  fusion->addInput(size1);
+  fusion->addInput(size2);
+  fusion->addInput(size3);
+  fusion->addInput(size4);
+  TensorView* tv0 = rand({size1, size2, size3, size4}, DataType::Float);
+  fusion->addOutput(tv0);
+
+  auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {10, 10, 10, 10});
+
+  at::manual_seed(0);
+  auto cg_outputs = fe.runFusion({10, 10, 10, 10});
+  auto out = cg_outputs[0];
+
+  at::manual_seed(0);
+  auto ref = generate_uniform(10000, dtype).view({10, 10, 10, 10});
+
+  testValidate(fusion, {out}, {10, 10, 10, 10}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionBroadcastingRNG_CUDA) {
+  for (auto dtype : {kFloat, kDouble}) {
+    std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+    auto fusion = fusion_ptr.get();
+    FusionGuard fg(fusion);
+
+    TensorView* tv0 = makeConcreteTensor({5, 1}, aten_to_data_type(dtype));
+    TensorView* tv1 = makeConcreteTensor({5, 5}, aten_to_data_type(dtype));
+    fusion->addInput(tv0);
+    fusion->addInput(tv1);
+    auto tv2 = rand_like(tv0);
+    auto tv3 = add(tv1, tv2);
+    auto tv4 = add(tv0, tv3);
+    fusion->addOutput(tv4);
+
+    FusionExecutorCache fec(std::move(fusion_ptr));
+
+    auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+    at::Tensor t0 = at::zeros({5, 1}, options);
+    at::Tensor t1 = at::zeros({5, 5}, options);
+
+    auto cg_outputs = fec.runFusionWithInputs({t0, t1});
+    auto out = cg_outputs[0];
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 1)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 2)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 3)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 4)).all().item<bool>())
+  }
+}
+
+TEST_F(NVFuserTest, FusionBroadcastingRNG2_CUDA) {
+  for (int64_t size : {16, 1024, 10001, 10002, 10003, 100000, 10000001}) {
+    for (auto dtype : {kFloat, kDouble}) {
+      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+      auto fusion = fusion_ptr.get();
+      FusionGuard fg(fusion);
+
+      TensorView* tv0 = makeConcreteTensor({1}, aten_to_data_type(dtype));
+      TensorView* tv1 = makeSymbolicTensor(1, aten_to_data_type(dtype));
+      fusion->addInput(tv0);
+      fusion->addInput(tv1);
+      auto tv2 = rand_like(tv0);
+      auto tv3 = add(tv1, tv2);
+      fusion->addOutput(tv3);
+
+      FusionExecutorCache fec(std::move(fusion_ptr));
+
+      auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      at::Tensor t0 = at::zeros({1}, options);
+      at::Tensor t1 = at::zeros({size}, options);
+
+      at::manual_seed(0);
+      auto cg_outputs = fec.runFusionWithInputs({t0, t1});
+      auto out = cg_outputs[0];
+
+      at::manual_seed(0);
+      auto ref = generate_uniform(1, dtype).expand_as(t1);
+
+      testValidate(fec.fusion(), {out}, {t0, t1}, {ref}, __LINE__, __FILE__);
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionBroadcastingRNGSmem_CUDA) {
+  for (auto dtype : {kFloat, kDouble}) {
+    std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+    auto fusion = fusion_ptr.get();
+    FusionGuard fg(fusion);
+
+    TensorView* tv0 = makeConcreteTensor({5, 1}, aten_to_data_type(dtype));
+    TensorView* tv1 = makeConcreteTensor({5, 5}, aten_to_data_type(dtype));
+    fusion->addInput(tv0);
+    fusion->addInput(tv1);
+    auto tv2 = rand_like(tv0);
+    auto tv3 = add(tv1, tv2);
+    auto tv4 = add(tv0, tv3);
+    fusion->addOutput(tv4);
+
+    auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+    at::Tensor t0 = at::zeros({5, 1}, options);
+    at::Tensor t1 = at::zeros({5, 5}, options);
+
+    auto lparams = scheduleTranspose(fusion, {t0, t1});
+
+    FusionExecutor fe;
+    fe.compileFusion(fusion, {t0, t1}, lparams);
+    auto cg_outputs = fe.runFusion({t0, t1}, lparams);
+    auto out = cg_outputs[0];
+
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 1)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 2)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 3)).all().item<bool>())
+    TORCH_CHECK((out.select(1, 0) == out.select(1, 4)).all().item<bool>())
+  }
+}
+
+TEST_F(NVFuserTest, FusionBroadcastingRNGSmemNonSquareTile_CUDA) {
+  // https://github.com/csarofeen/pytorch/issues/1926
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeConcreteTensor({5, 1});
+  TensorView* tv1 = makeConcreteTensor({5, 5});
+  fusion->addInput(tv0);
+  fusion->addInput(tv1);
+  auto tv2 = rand_like(tv0);
+  auto tv3 = add(tv1, tv2);
+  auto tv4 = add(tv0, tv3);
+  fusion->addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(kFloat).device(at::kCUDA, 0);
+  at::Tensor t0 = at::zeros({5, 1}, options);
+  at::Tensor t1 = at::zeros({5, 5}, options);
+
+  TransposeParams heuristics;
+  heuristics.tile_size1 = 8;
+  heuristics.tile_size2 = 4;
+  scheduleTranspose(fusion, heuristics);
+
+  FusionExecutor fe;
+  fe.compileFusion(fusion, {t0, t1});
+  auto cg_outputs = fe.runFusion({t0, t1});
+  auto out = cg_outputs[0];
+
+  TORCH_CHECK((out.select(1, 0) == out.select(1, 1)).all().item<bool>());
+  TORCH_CHECK((out.select(1, 0) == out.select(1, 2)).all().item<bool>());
+  TORCH_CHECK((out.select(1, 0) == out.select(1, 3)).all().item<bool>());
+  TORCH_CHECK((out.select(1, 0) == out.select(1, 4)).all().item<bool>());
+}
+
+TEST_F(NVFuserTest, FusionUniform_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  Int* size_val = IrBuilder::create<Int>();
+  Double* low = IrBuilder::create<Double>();
+  Double* high = IrBuilder::create<Double>();
+  fusion->addInput(size_val);
+  fusion->addInput(low);
+  fusion->addInput(high);
+  TensorView* tv0 = uniform({size_val}, low, high, DataType::Float);
+  TensorView* tv1 = uniform({size_val}, low, high, DataType::Double);
+  fusion->addOutput(tv0);
+  fusion->addOutput(tv1);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
+  for (int64_t size : {16, 1024, 10001, 10002, 10003, 100000, 10000001}) {
+    at::manual_seed(0);
+    auto cg_outputs = fec.runFusionWithInputs({size, -1.0, 1.0});
+
+    at::manual_seed(0);
+    auto ref0 = generate_uniform(size, kFloat) * 2 - 1;
+    auto ref1 = generate_uniform(size, kDouble) * 2 - 1;
+
+    testValidate(
+        fec.fusion(),
+        cg_outputs,
+        {size, -1.0, 1.0},
+        {ref0, ref1},
+        __LINE__,
+        __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionRandLikeReduction_CUDA) {
+  auto dtype = kFloat;
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+
+  TensorView* tv0 = makeSymbolicTensor(2, aten_to_data_type(dtype));
+  fusion->addInput(tv0);
+  auto tv1 = sum(tv0, {0});
+  auto tv2 = rand_like(tv1);
+  auto tv3 = add(tv1, tv2);
+  fusion->addOutput(tv3);
+
+  FusionExecutorCache fec(std::move(fusion_ptr));
+
+  auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+  at::Tensor t0 = at::zeros({2, 3}, options);
+
+  at::manual_seed(0);
+  auto cg_outputs = fec.runFusionWithInputs({t0});
+  auto out = cg_outputs[0];
+
+  at::manual_seed(0);
+  auto t1 = t0.sum(0);
+  auto t2 = generate_uniform(3, dtype).expand_as(t1);
+  auto t3 = t1.add(t2);
+
+  testValidate(fec.fusion(), {out}, {t0}, {t3}, __LINE__, __FILE__);
+}
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp
index b2302013f5fd..d1f185011826 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_shift.cpp
@@ -2976,6 +2976,7 @@ TEST_F(NVFuserTest, FusionConv2D_CUDA) {
 TEST_F(NVFuserTest, FusionConv2DNoPadding_CUDA) {
   Fusion fusion;
   FusionGuard fg(&fusion);
+  ContextCudnnTF32Disabled disabling_tf32_cudnn;
 
   // Input: [C, H, W]
   auto inp = makeSymbolicTensor(3);
@@ -5394,6 +5395,72 @@ TEST_F(NVFuserTest, FusionGatherIterTypePromotion_CUDA) {
   testValidate(&fusion, outputs, inputs, {ref}, __LINE__, __FILE__);
 }
 
+TEST_F(NVFuserTest, FusionContigPredicateShift_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<int64_t> shape({2, 2});
+
+  auto tv0 = makeConcreteTensor(shape);
+  // [0:I]
+  fusion.addInput(tv0);
+
+  // Below, tv2 and tv3 are mostly the same, except for tv2 is padded
+  // with 0, whereas tv3 is not, so the valid range of tv3 is [0:I-1]
+
+  // [0:I]
+  auto tv1 = shift(tv0, {-1, 0});
+
+  // [0:I-1]
+  auto tv2 = shift(tv0, {-1, 0}, false);
+
+  // tv3 is not an output of shift, but it gets a partial root
+  // domain from tv2, so it must be predicated at the root domain
+  auto tv3 = add(tv2, IrBuilder::create<Double>(1));
+
+  fusion.addOutput(tv1);
+  fusion.addOutput(tv3);
+
+  // contig merge
+  tv1->merge(0);
+  tv1->split(0, 4);
+  TransformPropagator propagator(tv1);
+  MaxRootDomainInfoSpanningTree(tv1).traverse(&propagator);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  // Create 3x2 and trim to 2x2. This would cause the output tensor
+  // non-zero values if not properly predicated.
+  at::Tensor t0 = at::randn({3, 2}, options);
+  t0 = t0.index(
+      {at::indexing::Slice(0, 2), at::indexing::Slice(0, at::indexing::None)});
+
+  // Use random output to detect invalid writes
+  at::Tensor t1 = at::rand_like(t0, options);
+  // Use zero-cleared output to detect invalid writes
+  at::Tensor t3 = at::zeros_like(t0, options);
+
+  std::vector<IValue> inputs = {t0};
+  std::vector<at::Tensor> outputs = {t1, t3};
+
+  std::vector<at::indexing::TensorIndex> indices{
+      at::indexing::Slice(0, -1), at::indexing::Slice(0, at::indexing::None)};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, inputs);
+  fe.runFusion(inputs, outputs);
+
+  // Make sure the padded region is zero filled
+  TORCH_CHECK(t1[1].equal(at::zeros(2, options)));
+  // Make sure not touched as the shift is not padded
+  TORCH_CHECK(t3[1].equal(at::zeros(2, options)));
+
+  auto ref = shift(t0, {-1, 0});
+
+  TORCH_CHECK(t1.equal(ref));
+  TORCH_CHECK(t3.index(indices).equal((ref + 1).index(indices)));
+}
+
 } // namespace jit
 } // namespace torch
 #endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp
new file mode 100644
index 000000000000..06e93fcd579e
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_tensor_factories.cpp
@@ -0,0 +1,339 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/codegen.h>
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/ir_iostream.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+
+TEST_F(NVFuserTest, FusionStandaloneFull_CUDA) {
+  auto sizes = {0, 1, 10, 17, 1024};
+  auto dtypes = {
+      kBool,
+      kFloat,
+      kLong,
+      kDouble,
+      kHalf,
+      kBFloat16,
+      kInt,
+      kComplexFloat,
+      kComplexDouble};
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* size = IrBuilder::create<Int>();
+  Val* fill_val1 = IrBuilder::create<Int>();
+  Val* fill_val2 = IrBuilder::create<Int>();
+  Val* fill_val3 = IrBuilder::create<Int>();
+  fusion->addInput(size);
+  fusion->addInput(fill_val1);
+  fusion->addInput(fill_val2);
+  fusion->addInput(fill_val3);
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+    auto out_tv = full({size}, fill_val1, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = full({size, size}, fill_val2, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = full_like(out_tv, fill_val3);
+    fusion->addOutput(out_tv);
+  }
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  for (auto size : sizes) {
+    std::vector<at::Tensor> expect;
+    expect.reserve(dtypes.size());
+    for (auto dtype : dtypes) {
+      if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+        continue;
+      }
+      const auto options =
+          at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      expect.emplace_back(at::full({size}, 11, options));
+      expect.emplace_back(at::full({size, size}, 12, options));
+      expect.emplace_back(at::full({size, size}, 13, options));
+    }
+    auto cg_outputs = executor_cache.runFusionWithInputs({size, 11, 12, 13});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {size, 11, 12, 13},
+        expect,
+        __LINE__,
+        __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionStandaloneZeros_CUDA) {
+  auto sizes = {0, 1, 10, 17, 1024};
+  auto dtypes = {
+      kBool,
+      kFloat,
+      kLong,
+      kDouble,
+      kHalf,
+      kBFloat16,
+      kInt,
+      kComplexFloat,
+      kComplexDouble};
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* size = IrBuilder::create<Int>();
+  fusion->addInput(size);
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+    auto out_tv = zeros({size}, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = zeros({size, size}, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = zeros_like(out_tv);
+    fusion->addOutput(out_tv);
+  }
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  for (auto size : sizes) {
+    std::vector<at::Tensor> expect;
+    expect.reserve(dtypes.size());
+    for (auto dtype : dtypes) {
+      if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+        continue;
+      }
+      const auto options =
+          at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      expect.emplace_back(at::zeros({size}, options));
+      expect.emplace_back(at::zeros({size, size}, options));
+      expect.emplace_back(at::zeros({size, size}, options));
+    }
+    auto cg_outputs = executor_cache.runFusionWithInputs({size});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {size},
+        expect,
+        __LINE__,
+        __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionStandaloneOnes_CUDA) {
+  auto sizes = {0, 1, 10, 17, 1024};
+  auto dtypes = {
+      kBool,
+      kFloat,
+      kLong,
+      kDouble,
+      kHalf,
+      kBFloat16,
+      kInt,
+      kComplexFloat,
+      kComplexDouble};
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* size = IrBuilder::create<Int>();
+  fusion->addInput(size);
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+    auto out_tv = ones({size}, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = ones({size, size}, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv);
+    out_tv = ones_like(out_tv);
+    fusion->addOutput(out_tv);
+  }
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  for (auto size : sizes) {
+    std::vector<at::Tensor> expect;
+    expect.reserve(dtypes.size());
+    for (auto dtype : dtypes) {
+      if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+        continue;
+      }
+      const auto options =
+          at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      expect.emplace_back(at::ones({size}, options));
+      expect.emplace_back(at::ones({size, size}, options));
+      expect.emplace_back(at::ones({size, size}, options));
+    }
+    auto cg_outputs = executor_cache.runFusionWithInputs({size});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {size},
+        expect,
+        __LINE__,
+        __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionStandaloneARange_CUDA) {
+  auto starts_ends = {-1., 0., 10.3, 1024. * 256};
+  auto steps = {-1.5, 1., 2.};
+  auto dtypes = {kFloat, kLong, kDouble};
+
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+
+    auto fusion = std::make_unique<Fusion>();
+    FusionGuard fg(fusion.get());
+
+    Val* start_int = IrBuilder::create<Int>();
+    Val* end_int = IrBuilder::create<Int>();
+    Val* step_int = IrBuilder::create<Int>();
+    Val* start_double = IrBuilder::create<Double>();
+    Val* end_double = IrBuilder::create<Double>();
+    Val* step_double = IrBuilder::create<Double>();
+    fusion->addInput(start_int);
+    fusion->addInput(end_int);
+    fusion->addInput(step_int);
+    fusion->addInput(start_double);
+    fusion->addInput(end_double);
+    fusion->addInput(step_double);
+    auto tv0 = arange(start_int, end_int, step_int, aten_to_data_type(dtype));
+    auto tv1 =
+        arange(start_double, end_double, step_double, aten_to_data_type(dtype));
+    auto tv2 =
+        arange(start_int, end_double, step_double, aten_to_data_type(dtype));
+    auto tv3 =
+        arange(start_double, end_double, step_int, aten_to_data_type(dtype));
+    fusion->addOutput(tv0);
+    fusion->addOutput(tv1);
+    fusion->addOutput(tv2);
+    fusion->addOutput(tv3);
+
+    FusionExecutorCache executor_cache(std::move(fusion));
+
+    const auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+
+    for (auto start : starts_ends) {
+      for (auto end : starts_ends) {
+        for (auto step : steps) {
+          if (std::signbit(end - start) != std::signbit(step)) {
+            continue;
+          }
+
+          at::Tensor a =
+              at::arange((int64_t)start, (int64_t)end, (int64_t)step, options);
+          at::Tensor b =
+              at::arange((double)start, (double)end, (double)step, options);
+          at::Tensor c =
+              at::arange((int64_t)start, (double)end, (double)step, options);
+          at::Tensor d =
+              at::arange((double)start, (double)end, (int64_t)step, options);
+
+          auto cg_outputs = executor_cache.runFusionWithInputs(
+              {(int64_t)start,
+               (int64_t)end,
+               (int64_t)step,
+               (double)start,
+               (double)end,
+               (double)step});
+
+          testValidate(
+              executor_cache.fusion(),
+              cg_outputs,
+              {(int64_t)start,
+               (int64_t)end,
+               (int64_t)step,
+               (double)start,
+               (double)end,
+               (double)step},
+              {a, b, c, d},
+              __LINE__,
+              __FILE__);
+        }
+      }
+    }
+  }
+}
+
+TEST_F(NVFuserTest, FusionStandaloneEye_CUDA) {
+  auto sizes = {0, 1, 10, 17, 1024};
+  auto dtypes = {
+      kBool,
+      kFloat,
+      kLong,
+      kDouble,
+      kHalf,
+      kBFloat16,
+      kInt,
+      kComplexFloat,
+      kComplexDouble};
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  Val* size = IrBuilder::create<Int>();
+  Val* maybe_m = IrBuilder::create<Int>();
+  fusion->addInput(size);
+  fusion->addInput(maybe_m);
+  for (auto dtype : dtypes) {
+    if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+      continue;
+    }
+    auto out_tv1 = eye(size, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv1);
+    auto out_tv2 = eye(size, maybe_m, aten_to_data_type(dtype));
+    fusion->addOutput(out_tv2);
+  }
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+
+  for (auto size : sizes) {
+    std::vector<at::Tensor> expect;
+    expect.reserve(dtypes.size());
+    for (auto dtype : dtypes) {
+      if (!isSupportedTypeByDevice(aten_to_data_type(dtype))) {
+        continue;
+      }
+      const auto options =
+          at::TensorOptions().dtype(dtype).device(at::kCUDA, 0);
+      expect.emplace_back(at::eye(size, options));
+      expect.emplace_back(at::eye(size, 15, options));
+    }
+    auto cg_outputs = executor_cache.runFusionWithInputs({size, 15});
+
+    testValidate(
+        executor_cache.fusion(),
+        cg_outputs,
+        {size, 15},
+        expect,
+        __LINE__,
+        __FILE__);
+  }
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp
new file mode 100644
index 000000000000..b10360f00315
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp
@@ -0,0 +1,1260 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_cache.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/all_schedulers.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/transpose.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+
+TEST_F(NVFuserTest, FusionTranspose1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int M = 10;
+  constexpr int N = 20;
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = transpose(tv0);
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  at::Tensor aten_output = t0.t();
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTranspose2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  constexpr int M = 10;
+  constexpr int N = 20;
+
+  auto tv0 = makeSymbolicTensor(2);
+  auto tv1 = transpose(tv0);
+  fusion.addInput(tv0);
+  fusion.addOutput(tv1);
+
+  tv1->merge(0);
+  tv1->split(0, 32);
+
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  at::Tensor t0 = at::randn({M, N}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto outputs = fe.runFusion(aten_inputs);
+
+  at::Tensor aten_output = t0.t();
+
+  testValidate(
+      &fusion, outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTransposeWithSwizzle_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0);
+  fusion.addOutput(tv1);
+
+  // tv0: [I0, I1]
+  // tv1: [I1, I0]
+
+  const int BS = 32;
+
+  // CTA tiling by BS*BS
+  tv1->split(1, BS);
+  tv1->split(0, BS);
+  tv1->reorder({{1, 2}});
+  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
+
+  // Create a smem buffer to cache each tile
+  auto tv0_cache = tv0->cacheAfter();
+  tv0_cache->setMemoryType(MemoryType::Shared);
+
+  tv0->computeAt(tv1, 2);
+  // tv0: [I0, I1]
+  // tv0_cache: [I1/BS, I0/BS, BS(I1), BS(I0)]
+  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
+
+  // Assign each thread block to a tile
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::BIDx);
+
+  // Thread mapping for each tile. For both of the input and output
+  // tiles, map TIDx to the fastest-changing dimension to facilitate
+  // coalesced gmem accesses.
+  tv1->axis(2)->parallelize(ParallelType::TIDy);
+  tv1->axis(3)->parallelize(ParallelType::TIDx);
+  // Note that the fastest-changing axis is next to the inner-most
+  // axis since computeAt reorders the axes as the output tensor.
+  tv0_cache->axis(2)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(3)->parallelize(ParallelType::TIDy);
+
+  // Swizzles the smem cache to avoid bank conflicts
+  tv0_cache->swizzle(SwizzleType::Transpose, {3, 2});
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 100;
+  const int by = 200;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0.t();
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionTransposeWithSwizzle1DThreadBlock_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeSymbolicTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0);
+  fusion.addOutput(tv1);
+
+  // tv0: [I0, I1]
+  // tv1: [I1, I0]
+
+  const int BS = 32;
+  const int BDIM = 256;
+
+  // CTA tiling by BS*BS
+  tv1->split(1, BS);
+  tv1->split(0, BS);
+  tv1->reorder({{1, 2}});
+  // tv1: [I1/BS, I0/BS, BS(I1), BS(I0)]
+
+  // Create a smem buffer to cache each tile
+  auto tv0_cache = tv0->cacheAfter();
+  tv0_cache->setMemoryType(MemoryType::Shared);
+
+  tv0->computeAt(tv1, 2);
+  // tv0: [I0, I1]
+  // tv0_cache: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
+  // tv1: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
+
+  // Tranform the tile axes for 1D thread mapping
+  tv1->merge(-2, -1);
+  tv1->split(-1, BDIM);
+  // tv1: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
+
+  // Transform the cache similarly but apply swizzle to the 2D tile axes.
+  tv0_cache->reorder({{-2, -1}});
+  tv0_cache->swizzle(SwizzleType::Transpose, {2, 3});
+  tv0_cache->merge(-2, -1);
+  tv0_cache->split(-1, BDIM);
+  // tv0: [I1/BS, I0/BS, BS*BS/BDIM, BDIM]
+
+  // Assign each thread block to a tile
+  tv1->axis(0)->parallelize(ParallelType::BIDy);
+  tv1->axis(1)->parallelize(ParallelType::BIDx);
+
+  // Thread mapping for each tile.
+  tv1->axis(-1)->parallelize(ParallelType::TIDx);
+  tv0_cache->axis(-1)->parallelize(ParallelType::TIDx);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  const int bx = 100;
+  const int by = 200;
+  at::Tensor t0 = at::randn({bx, by}, options);
+  std::vector<IValue> aten_inputs = {t0};
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, aten_inputs);
+  auto cg_outputs = fe.runFusion(aten_inputs);
+
+  auto aten_output = t0.t();
+
+  testValidate(
+      &fusion, cg_outputs, aten_inputs, {aten_output}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSimple_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(1, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->tanspose->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSinTransposeCos_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0, 0, 2);
+  auto tv2 = sin(tv1);
+  auto tv3 = transpose(tv2, 1, 2);
+  auto tv4 = cos(tv3);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.transpose(0, 2).sin().transpose(1, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+/*
+ * t0->transpose--.
+ *                 \
+ * t1->transpose---add-->sin->t5
+ */
+TEST_F(NVFuserTest, FusionScheduleTransposeMultipleInput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = transpose(tv0, 0, 2);
+  auto tv3 = transpose(tv1, 0, 2);
+  auto tv4 = add(tv2, tv3);
+  auto tv5 = sin(tv4);
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({256, 1024, 1024}, options);
+  at::Tensor input1 = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto outputs = fe.runFusion({input0, input1}, lparams);
+
+  auto tv_ref = (input0.transpose(0, 2) + input1.transpose(0, 2)).sin();
+
+  testValidate(
+      &fusion, outputs, {input0, input1}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// t0->sin->transpose->t5
+//  `->cos->transpose->t6
+TEST_F(NVFuserTest, FusionScheduleTransposeMultipleOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv2 = sin(tv0);
+  auto tv3 = cos(tv0);
+  auto tv5 = transpose(tv2, 0, 2);
+  auto tv6 = transpose(tv3, 0, 2);
+  fusion.addOutput(tv5);
+  fusion.addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref1 = input.sin().transpose(0, 2);
+  auto tv_ref2 = input.cos().transpose(0, 2);
+
+  testValidate(
+      &fusion, outputs, {input}, {tv_ref1, tv_ref2}, __LINE__, __FILE__);
+}
+
+/*
+ * t0->transpose->sin->t3
+ *   \_.-->cos->t5
+ *   /
+ * t1
+ */
+TEST_F(NVFuserTest, FusionScheduleTransposeMultipleInputOutput_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = transpose(tv0, 0, 2);
+  auto tv3 = sin(tv2);
+  fusion.addOutput(tv3);
+  auto tv4 = add(tv0, tv1);
+  auto tv5 = cos(tv4);
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({256, 1024, 1024}, options);
+  at::Tensor input1 = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto outputs = fe.runFusion({input0, input1}, lparams);
+
+  auto tv_ref1 = input0.transpose(0, 2).sin();
+  auto tv_ref2 = (input0 + input1).cos();
+
+  testValidate(
+      &fusion,
+      outputs,
+      {input0, input1},
+      {tv_ref1, tv_ref2},
+      __LINE__,
+      __FILE__);
+}
+
+/*
+ *             .------>sin------>z
+ * x->transpose->transpose->add->y
+ *  \_______________________/
+ */
+TEST_F(NVFuserTest, FusionScheduleTransposeMatchingSkipConnection_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0, 0, 2);
+  auto tv2 = transpose(tv1, 0, 2);
+  auto tv3 = add(tv0, tv2);
+  fusion.addOutput(tv3);
+  auto tv4 = sin(tv1);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref1 = input.transpose(0, 2).transpose(0, 2) + input;
+  auto tv_ref2 = input.transpose(0, 2).sin();
+
+  testValidate(
+      &fusion, outputs, {input}, {tv_ref1, tv_ref2}, __LINE__, __FILE__);
+}
+
+// x->transpose--add->z
+// y->broadcast-/
+TEST_F(NVFuserTest, FusionScheduleTransposeBroadcast_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = transpose(tv0, 1, 2);
+  auto tv3 = broadcast(tv1, {false, false, true});
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({1024, 256, 1024}, options);
+  at::Tensor input1 = at::randn({1024, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1});
+  // auto lparams = schedulePointwise(&fusion, {input0, input1});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1}, lparams);
+  auto outputs = fe.runFusion({input0, input1}, lparams);
+
+  auto tv_ref = input0.transpose(1, 2) + input1.unsqueeze(2);
+
+  testValidate(
+      &fusion, outputs, {input0, input1}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->broadcast--add->z
+// y->broadcast-/
+TEST_F(NVFuserTest, FusionScheduleTransposeNoReference_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  auto tv1 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  auto tv2 = broadcast(tv0, {false, true, false});
+  auto tv3 = broadcast(tv1, {false, false, true});
+  auto tv4 = add(tv2, tv3);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({1024, 256}, options);
+  at::Tensor input1 = at::randn({1024, 1024}, options);
+
+  EXPECT_THAT(
+      [&]() {
+        scheduleTranspose(&fusion, {input0, input1});
+      },
+      testing::ThrowsMessage<c10::Error>(
+          testing::HasSubstr("reference tensor")));
+}
+
+// x->broadcast--add->z
+// y->broadcast-/
+TEST_F(NVFuserTest, FusionScheduleBroadcastOnly_CUDA) {
+  for (bool contig0 : {true, false}) {
+    for (bool contig1 : {true, false}) {
+      Fusion fusion;
+      FusionGuard fg(&fusion);
+      auto tv0 = contig0 ? makeContigConcreteTensor({-1, 1, -1})
+                         : makeConcreteTensor({-1, 1, -1});
+      auto tv1 = contig1 ? makeContigConcreteTensor({-1, -1, 1})
+                         : makeConcreteTensor({-1, -1, 1});
+      fusion.addInput(tv0);
+      fusion.addInput(tv1);
+      auto tv2 = add(tv0, tv1);
+      fusion.addOutput(tv2);
+
+      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+      at::Tensor input0 = at::randn({1024, 1, 256}, options);
+      at::Tensor input1 = at::randn({1024, 1024, 1}, options);
+
+      auto lparams = scheduleTranspose(&fusion, {input0, input1});
+
+      FusionExecutor fe;
+      fe.compileFusion(&fusion, {input0, input1}, lparams);
+      auto outputs = fe.runFusion({input0, input1}, lparams);
+
+      auto tv_ref = input0 + input1;
+
+      testValidate(
+          &fusion, outputs, {input0, input1}, {tv_ref}, __LINE__, __FILE__);
+    }
+  }
+}
+
+// mermaid graph:
+// ```mermaid
+// %%{
+//   init: {
+//     'theme': 'base',
+//     'themeVariables': { 'fontSize': '30px', 'fontFamily': 'times'}}
+// }%%
+// graph TD
+//   T0("T0(M, N, K)")
+//   T1("T1(N, M, K)")
+//   T2("T2(M, K, N)")
+//   T0 --> A("transpose(1, 2)") --> T3("T3(M, K, N)")
+//   T1 ---> sigmoid --> T5("T5(N, M, K)")
+//   T5 --> B("transpose(0, 2)") --> T7("T7(K, M, N)")
+//   T2 ----> C("add")
+//   T3 --> C --> T6("T6(M, K, N)")
+//   T6 --> D("transpose(0, 1)") --> T11("T11(K, M, N)")
+//   T11 --> E("add") -->T12("T12(K, M, N)")
+//   T7 --> E
+//   T1 ---> F("transpose(0, 1)") --> T4("T4(M, N, K)")
+//   T0 --> G("add") --> T8("T8(M, N, K)") --> relu ---> T9("T9(M, N, K)")
+//   T4 --> G
+//   T6 ---> sin ---> T10("T10(M, K, N)")
+//   style T0 fill:lightgreen
+//   style T1 fill:lightgreen
+//   style T2 fill:lightgreen
+//   style T12 fill:lightblue
+//   style T9 fill:lightblue
+//   style T10 fill:lightblue
+// ```
+TEST_F(NVFuserTest, FusionScheduleTransposeComplexDAG1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(3);
+  auto tv2 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+  auto tv3 = transpose(tv0, 1, 2);
+  auto tv4 = transpose(tv1, 0, 1);
+  auto tv5 = sigmoid(tv1);
+  auto tv6 = add(tv2, tv3);
+  auto tv7 = transpose(tv5, 0, 2);
+  auto tv8 = add(tv4, tv0);
+  auto tv9 = relu(tv8);
+  fusion.addOutput(tv9);
+  auto tv10 = sin(tv6);
+  fusion.addOutput(tv10);
+  auto tv11 = transpose(tv6, 0, 1);
+  auto tv12 = add(tv7, tv11);
+  fusion.addOutput(tv12);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({512, 1024, 256}, options);
+  at::Tensor input1 = at::randn({1024, 512, 256}, options);
+  at::Tensor input2 = at::randn({512, 256, 1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1, input2});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1, input2}, lparams);
+  auto outputs = fe.runFusion({input0, input1, input2}, lparams);
+
+  auto t3 = input0.transpose(1, 2);
+  auto t4 = input1.transpose(0, 1);
+  auto t5 = input1.sigmoid();
+  auto t6 = input2 + t3;
+  auto t7 = t5.transpose(0, 2);
+  auto t8 = t4 + input0;
+  auto t9 = t8.relu();
+  auto t10 = t6.sin();
+  auto t11 = t6.transpose(0, 1);
+  auto t12 = t7 + t11;
+
+  testValidate(
+      &fusion,
+      outputs,
+      {input0, input1, input2},
+      {t9, t10, t12},
+      __LINE__,
+      __FILE__);
+}
+
+// mermaid graph:
+// ```mermaid
+// %%{
+//   init: {
+//     'theme': 'base',
+//     'themeVariables': { 'fontSize': '30px', 'fontFamily': 'times'}}
+// }%%
+// graph TD
+//   T0("T0(M, N, K)")
+//   T1("T1(N, M, K)")
+//   T2("T2(M, K, N)")
+//   T0 --> A("transpose(1, 2)") --> T3("T3(M, K, N)")
+//   T1 ---> sigmoid --> T5("T5(N, M, K)")
+//   T5 --> B("transpose(0, 2)") --> T7("T7(K, M, N)")
+//   T2 ----> C("add")
+//   T3 --> C --> T6("T6(M, K, N)")
+//   T6 --> D("transpose(0, 1)") --> T11("T11(K, M, N)")
+//   T11 --> E("add") -->T12("T12(K, M, N)")
+//   T7 --> E
+//   T1 ---> F("transpose(0, 1)") --> T4("T4(M, N, K)")
+//   T0 --> G("add") --> T8("T8(M, N, K)") --> relu ---> T9("T9(M, N, K)")
+//   T4 --> G
+//   T6 ---> sin ---> T10("T10(M, K, N)")
+//   style T0 fill:lightgreen
+//   style T1 fill:lightgreen
+//   style T2 fill:lightgreen
+//   style T12 fill:lightblue
+//   style T9 fill:lightblue
+//   style T10 fill:lightblue
+// ```
+TEST_F(NVFuserTest, FusionManualScheduleTransposeComplexDAG1_CUDA) {
+  // achieved: 833.526 GB/s on RTX 3090 (theoretical bandwidth: 936 GB/s)
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigTensor(3);
+  auto tv2 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+  auto tv3 = transpose(tv0, 1, 2);
+  auto tv4 = transpose(tv1, 0, 1);
+  auto tv5 = sigmoid(tv1);
+  auto tv6 = add(tv2, tv3);
+  auto tv7 = transpose(tv5, 0, 2);
+  auto tv8 = add(tv4, tv0);
+  auto tv9 = relu(tv8);
+  fusion.addOutput(tv9);
+  auto tv10 = sin(tv6);
+  fusion.addOutput(tv10);
+  auto tv11 = transpose(tv6, 0, 1);
+  auto tv12 = add(tv7, tv11);
+  fusion.addOutput(tv12);
+
+  // group 1: tv0, tv1, *tv9, innermost dim K
+  // group 2: tv2, *tv10, tv12, innermost dim N
+
+  // cache inputs and outputs
+  auto tv0_cache = tv0->cacheAfter();
+  auto tv1_cache = tv1->cacheAfter();
+  auto tv2_cache = tv2->cacheAfter();
+  auto tv9_cache = tv9->cacheBefore();
+  auto tv10_cache = tv10->cacheBefore();
+  auto tv12_cache = tv12->cacheBefore();
+
+  // Step 1: Make 32x32 tiles, schedule outer dimensions
+  {
+    // Pick an arbitrary tensor as a reference tensor for this step. There is no
+    // requirement on which group this reference tensor should belong to. Here
+    // we pick tv9, which belongs to group 1.
+
+    // Make 32x32 tile:
+    // [M, N, K]
+    tv9->split(1, 32);
+    tv9->reorder({{2, -1}});
+    tv9->split(2, 32);
+    tv9->reorder({{3, -1}});
+    // [M, N/32, K/32, 32(N), 32(K)]
+
+    // merge outer dims, parallelize on BIDx, and unswitch
+    tv9->merge(0);
+    tv9->merge(0);
+    tv9->split(0, 1);
+    // [M * N/32 * K/32, 1, 32(N), 32(K)]
+    tv9->axis(0)->parallelize(ParallelType::BIDx);
+    tv9->axis(1)->parallelize(ParallelType::Unswitch);
+    // [BIDx, Unswitch, 32(N), 32(K)]
+
+    // propagate to the entire DAG
+    MaxRootDomainInfoSpanningTree entire_dag(tv9);
+    TransformPropagator tp(tv9);
+    entire_dag.traverse(&tp);
+    scheduler_utils::parallelizeAllLike(tv9);
+  }
+
+  constexpr int threads_per_block = 128;
+
+  // Step 2, schedule group 2
+  {
+    // group 2: tv2, *tv10, tv12, innermost dim N
+
+    tv2_cache->setMemoryType(MemoryType::Shared);
+    tv10_cache->setMemoryType(MemoryType::Shared);
+    tv12_cache->setMemoryType(MemoryType::Shared);
+
+    // pick tv10 as reference tensor for group 2
+    // [BIDx, Unswitch, 32(N), 32(K)]
+    tv10->reorder({{-1, -2}});
+    // [BIDx, Unswitch, 32(K), 32(N)]
+    tv10->merge(2);
+    tv10->split(2, 4);
+    tv10->split(2, threads_per_block);
+    tv10->axis(-1)->parallelize(ParallelType::Vectorize);
+    tv10->axis(-2)->parallelize(ParallelType::TIDx);
+    tv10->axis(-3)->parallelize(ParallelType::Unroll);
+    // [BIDx, Unswitch, Unroll, TIDx, Vectorize]
+
+    // Propagate to group 2 and its cache. Note that group 2 and its cache are
+    // not connected, so we need to borrow other tensors of the DAG to be able
+    // to propagate. The transformations on borrowed tensors will be overwritten
+    // in the next step. We can not borrow the reference tensor of group 1.
+    auto all_tvs_except_ref1 = ir_utils::allTvsExcept(&fusion, {tv9});
+    auto all_tvs_except_ref1_set = std::unordered_set<TensorView*>(
+        all_tvs_except_ref1.begin(), all_tvs_except_ref1.end());
+    SetSelector selector(all_tvs_except_ref1_set);
+    MaxRootDomainInfoSpanningTree tree(tv10, &selector);
+    TransformPropagator tp(tv10);
+    tree.traverse(&tp);
+    scheduler_utils::parallelizeAllLike(
+        tv10, {tv2_cache, tv10, tv12}, {ParallelType::TIDx});
+    scheduler_utils::parallelizeAllLike(
+        tv10,
+        {tv2_cache, tv10, tv12},
+        {ParallelType::Vectorize, ParallelType::Unroll});
+  }
+
+  // Step 3, schedule group 1
+  {
+    // group 1: tv0, tv1, *tv9, innermost dim K
+    // [BIDx, Unswitch, 32(N), 32(K)]
+    tv9->merge(2);
+    tv9->split(2, 4);
+    tv9->split(2, threads_per_block);
+    tv9->axis(-1)->parallelize(ParallelType::Vectorize);
+    tv9->axis(-2)->parallelize(ParallelType::TIDx);
+    tv9->axis(-3)->parallelize(ParallelType::Unroll);
+    // [BIDx, Unswitch, Unroll, TIDx, Vectorize]
+
+    // Propagate to the entire DAG except for group 2 and its cached inputs
+    auto all_tvs_except2 =
+        ir_utils::allTvsExcept(&fusion, {tv2, tv2_cache, tv10, tv12});
+    auto all_tvs_except2_set = std::unordered_set<TensorView*>(
+        all_tvs_except2.begin(), all_tvs_except2.end());
+    SetSelector selector(all_tvs_except2_set);
+    MaxRootDomainInfoSpanningTree tree(tv9, &selector);
+    TransformPropagator tp(tv9);
+    tree.traverse(&tp);
+    scheduler_utils::parallelizeAllLike(
+        tv9, all_tvs_except2, {ParallelType::TIDx});
+    scheduler_utils::parallelizeAllLike(
+        tv9,
+        {tv0_cache, tv1_cache, tv9},
+        {ParallelType::Vectorize, ParallelType::Unroll});
+  }
+
+  // inline
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({512, 1024, 256}, options);
+  at::Tensor input1 = at::randn({1024, 512, 256}, options);
+  at::Tensor input2 = at::randn({512, 256, 1024}, options);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1, input2});
+  auto outputs = fe.runFusion({input0, input1, input2});
+
+  auto t3 = input0.transpose(1, 2);
+  auto t4 = input1.transpose(0, 1);
+  auto t5 = input1.sigmoid();
+  auto t6 = input2 + t3;
+  auto t7 = t5.transpose(0, 2);
+  auto t8 = t4 + input0;
+  auto t9 = t8.relu();
+  auto t10 = t6.sin();
+  auto t11 = t6.transpose(0, 1);
+  auto t12 = t7 + t11;
+
+  testValidate(
+      &fusion,
+      outputs,
+      {input0, input1, input2},
+      {t9, t10, t12},
+      __LINE__,
+      __FILE__);
+}
+
+// x->view->y
+TEST_F(NVFuserTest, FusionViewNoTranspose_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = flatten(tv0, 1, 2);
+  fusion.addOutput(tv1);
+
+  TORCH_CHECK(!hasAtLeastTwoValidGroups(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionTransposeSelfMapping_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0, 0, 1);
+  auto tv2 = add(tv0, tv1);
+  fusion.addOutput(tv2);
+
+  EXPECT_THAT(
+      [&]() { IterDomainGraph(fusion_ptr.get()); },
+      testing::ThrowsMessage<c10::Error>(
+          testing::HasSubstr("Unsupported domain mapping detected")));
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({5, 5}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  auto ref = t0.transpose(0, 1) + t0;
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+#if 0
+// silent wrong result
+TEST_F(NVFuserTest, FusionTransposeViewSelfMapping_CUDA) {
+  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(2);
+  fusion.addInput(tv0);
+  auto tv1 = transpose(tv0, 0, 1);
+  auto tv2 = view(tv0, {2, 3}, {3, 2});
+  auto tv3 = add(tv1, tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  auto t0 = at::randn({2, 3}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  auto ref = t0.transpose(0, 1) + t0.view({3, 2});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+#endif
+
+// t0------------.
+// t2->broadcast->sub->mul->relu->t6
+// t1------------------'
+TEST_F(NVFuserTest, FusionScheduleTransposeMissingDim_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  auto tv1 = makeContigConcreteTensor({1, -1, 1});
+  auto tv2 = makeContigTensor(1);
+  fusion.addInput(tv0);
+  fusion.addInput(tv1);
+  fusion.addInput(tv2);
+  auto tv3 = broadcast(tv2, {true, false, true});
+  auto tv4 = sub(tv0, tv3);
+  auto tv5 = mul(tv4, tv1);
+  auto tv6 = relu(tv5);
+  fusion.addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input0 = at::randn({512, 1024, 512}, options);
+  at::Tensor input1 = at::randn({1, 1024, 1}, options);
+  at::Tensor input2 = at::randn({1024}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input0, input1, input2});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input0, input1, input2}, lparams);
+  auto outputs = fe.runFusion({input0, input1, input2}, lparams);
+
+  auto t3 = input2.unsqueeze(0).unsqueeze(-1);
+  auto t4 = input0 - t3;
+  auto t5 = t4 * input1;
+  auto t6 = at::relu(t5);
+
+  testValidate(
+      &fusion, outputs, {input0, input1, input2}, {t6}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSmall_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({1024, 2, 2}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(1, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSmallInnerSize1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({64 * 1024 * 1024, 2, 2}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(1, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSmallInnerSize2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 0, 2);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({2, 64 * 1024 * 1024, 2}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(0, 2).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTransposeSmallInnerSize3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(8);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = transpose(tv1, 4, 7);
+  auto tv3 = cos(tv2);
+  fusion.addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({1024 * 1024, 2, 2, 2, 2, 2, 2, 2}, options);
+
+  auto lparams = scheduleTranspose(&fusion, {input});
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.sin().transpose(4, 7).cos();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+}
+
+// x->sin->transpose->cos->y
+TEST_F(NVFuserTest, FusionScheduleTranspose2DSmallInnerSize_CUDA) {
+  std::array<std::vector<int64_t>, 2> shapes{
+      std::vector<int64_t>{1024 * 1024 * 128, 2},
+      std::vector<int64_t>{2, 1024 * 1024 * 128}};
+  for (const auto& shape : shapes) {
+    Fusion fusion;
+    FusionGuard fg(&fusion);
+
+    auto tv0 = makeContigTensor(2);
+    fusion.addInput(tv0);
+    auto tv1 = sin(tv0);
+    auto tv2 = transpose(tv1, 0, 1);
+    auto tv3 = cos(tv2);
+    fusion.addOutput(tv3);
+
+    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+    at::Tensor input = at::randn(shape, options);
+
+    auto lparams = scheduleTranspose(&fusion, {input});
+
+    FusionExecutor fe;
+    fe.compileFusion(&fusion, {input}, lparams);
+    auto outputs = fe.runFusion({input}, lparams);
+
+    auto tv_ref = input.sin().transpose(0, 1).cos();
+
+    testValidate(&fusion, outputs, {input}, {tv_ref}, __LINE__, __FILE__);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransposeBankConflict1_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({32, 32});
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = transpose(tv1, 0, 1);
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto bank_conflict_info = fusion.bankConflictInfo();
+
+  TORCH_CHECK(!bank_conflict_info.empty());
+  for (auto info : bank_conflict_info) {
+    std::pair<int, int> expect{32, 0};
+    TORCH_CHECK(info.second == expect);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransposeBankConflict2_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({32, 32});
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = transpose(tv1, 0, 1);
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  tv2->axis(0)->parallelize(ParallelType::TIDx);
+  tv3->axis(0)->parallelize(ParallelType::TIDx);
+
+  auto bank_conflict_info = fusion.bankConflictInfo();
+
+  TORCH_CHECK(!bank_conflict_info.empty());
+  for (auto info : bank_conflict_info) {
+    std::pair<int, int> expect{0, 32};
+    TORCH_CHECK(info.second == expect);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransposeBankConflict3_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({32, 32}, DataType::Bool);
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = transpose(tv1, 0, 1);
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+
+  auto bank_conflict_info = fusion.bankConflictInfo();
+
+  TORCH_CHECK(!bank_conflict_info.empty());
+  for (auto info : bank_conflict_info) {
+    std::pair<int, int> expect{8, 0};
+    TORCH_CHECK(info.second == expect);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransposeBankConflict4_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({32, 32});
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = transpose(tv1, 0, 1);
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->merge(0);
+  tv1->split(0, 4);
+  tv1->split(0, 8);
+  tv1->axis(-1)->parallelize(ParallelType::Vectorize);
+  tv1->axis(0)->parallelize(ParallelType::TIDx);
+  // T1 [TIDx(32), 8, V(4)]
+
+  tv2->setMemoryType(MemoryType::Shared);
+  tv2->merge(0);
+  tv2->split(0, 4);
+  tv2->split(0, 32);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+  // T2 [8, TIDx(32), 4]
+
+  tv3->merge(0);
+  tv3->split(0, 2);
+  tv3->split(0, 32);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+  // T3 [16, TIDx(32), 2]
+
+  auto bank_conflict_info = fusion.bankConflictInfo();
+
+  TORCH_CHECK(!bank_conflict_info.empty());
+  for (auto info : bank_conflict_info) {
+    std::pair<int, int> expect1{0, 8};
+    std::pair<int, int> expect2{8, 4};
+    std::pair<int, int> expect3{2, 0};
+    TORCH_CHECK(
+        info.second == expect1 || info.second == expect2 ||
+        info.second == expect3);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransposeBankConflict5_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({1024, 32, 32});
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(2)->parallelize(ParallelType::TIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDx);
+  tv3->axis(2)->parallelize(ParallelType::TIDx);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto bank_conflict_info = fusion.bankConflictInfo();
+
+  TORCH_CHECK(!bank_conflict_info.empty());
+  for (auto info : bank_conflict_info) {
+    std::pair<int, int> expect{32, 0};
+    TORCH_CHECK(info.second == expect);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransposeBankConflict6_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({1024, 32, 32});
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(2)->parallelize(ParallelType::TIDy);
+  tv2->axis(2)->parallelize(ParallelType::TIDy);
+  tv3->axis(2)->parallelize(ParallelType::TIDy);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto bank_conflict_info = fusion.bankConflictInfo();
+
+  TORCH_CHECK(!bank_conflict_info.empty());
+  for (auto info : bank_conflict_info) {
+    std::pair<int, int> expect{32, 0};
+    TORCH_CHECK(info.second == expect);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransposeBankConflict7_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({1024, 8, 8});
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(1)->parallelize(ParallelType::TIDx);
+  tv2->axis(1)->parallelize(ParallelType::TIDx);
+  tv3->axis(1)->parallelize(ParallelType::TIDx);
+  tv1->axis(2)->parallelize(ParallelType::TIDy);
+  tv2->axis(2)->parallelize(ParallelType::TIDy);
+  tv3->axis(2)->parallelize(ParallelType::TIDy);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto bank_conflict_info = fusion.bankConflictInfo();
+
+  TORCH_CHECK(!bank_conflict_info.empty());
+  for (auto info : bank_conflict_info) {
+    std::pair<int, int> expect{0, 2};
+    TORCH_CHECK(info.second == expect);
+  }
+}
+
+TEST_F(NVFuserTest, FusionTransposeBankConflict8_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeConcreteTensor({1024, 8, 8});
+  fusion.addInput(tv0);
+  auto tv1 = set(tv0);
+  auto tv2 = transpose(tv1, 1, 2);
+  auto tv3 = set(tv2);
+  fusion.addOutput(tv3);
+
+  tv1->setMemoryType(MemoryType::Shared);
+  tv1->axis(2)->parallelize(ParallelType::TIDx);
+  tv2->axis(2)->parallelize(ParallelType::TIDy);
+  tv3->axis(2)->parallelize(ParallelType::TIDy);
+  tv1->axis(0)->parallelize(ParallelType::BIDx);
+  tv2->axis(0)->parallelize(ParallelType::BIDx);
+  tv3->axis(0)->parallelize(ParallelType::BIDx);
+
+  auto bank_conflict_info = fusion.bankConflictInfo();
+
+  // no bank confliction
+  TORCH_CHECK(bank_conflict_info.empty());
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp
new file mode 100644
index 000000000000..19c3c6f9bf6d
--- /dev/null
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_utils.cpp
@@ -0,0 +1,273 @@
+#if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
+#include <gtest/gtest.h>
+
+#include <torch/csrc/jit/codegen/cuda/fusion.h>
+#include <torch/csrc/jit/codegen/cuda/lower_utils.h>
+#include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
+#include <torch/csrc/jit/codegen/cuda/scheduler/utils.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>
+#include <torch/csrc/jit/codegen/cuda/test/test_utils.h>
+
+// Tests go in torch::jit
+namespace torch {
+namespace jit {
+
+using namespace torch::jit::fuser::cuda;
+
+TEST_F(NVFuserTest, FusionSplitDims_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int64_t* p = prime_numbers;
+  auto tv = makeConcreteTensor(
+      {p[0] * p[1] * p[2], p[3], p[4], p[5] * p[6], p[7], p[8], p[9] * p[10]});
+  std::vector<size_t> dims{0, 1, 2, 3, 4, 5, 6};
+  scheduler_utils::splitDims(
+      tv, {{0, p[2]}, {0, p[1]}, {3, p[6]}, {6, p[10]}}, dims);
+  TORCH_CHECK(tv->nDims() == 11);
+  for (auto i : c10::irange(11)) {
+    TORCH_CHECK(tv->axis(i)->extent()->evaluateInt() == p[i]);
+  }
+  std::vector<size_t> expect{0, 3, 4, 5, 7, 8, 9};
+  TORCH_CHECK(dims == expect);
+}
+
+TEST_F(NVFuserTest, FusionMergeDims_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int64_t* p = prime_numbers;
+  auto tv = makeConcreteTensor(
+      {p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10]});
+  std::vector<size_t> dims{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
+  auto merged = scheduler_utils::mergeDims(tv, {2, 3, 7, 8, 9}, dims);
+  TORCH_CHECK(merged == 2);
+  std::vector<int64_t> expect_shape{
+      p[0], p[1], p[2] * p[3] * p[7] * p[8] * p[9], p[4], p[5], p[6], p[10]};
+  TORCH_CHECK(tv->nDims() == expect_shape.size());
+  for (auto i : c10::irange(expect_shape.size())) {
+    TORCH_CHECK(tv->axis(i)->extent()->evaluateInt() == expect_shape[i]);
+  }
+  std::vector<size_t> expect_dims{0, 1, 2, 2, 3, 4, 5, 2, 2, 2, 6};
+  TORCH_CHECK(dims == expect_dims);
+}
+
+TEST_F(NVFuserTest, FusionReorderAsRFactor_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int a = 1, b = 2, c = 3, d = 4;
+
+  TensorView* tv0 = makeConcreteTensor({a, b, c, d});
+  fusion.addInput(tv0);
+  fusion.addOutput(tv0);
+
+  // [a, b, c, d]
+  tv0->merge(0, 2);
+  // [a*c, b, d]
+  tv0->split(1, 2);
+  // [a*c, bo, bi, d]
+  tv0->split(3, 3);
+  // [a*c, bo, bi, do, di]
+  tv0->reorder({{1, 4}, {2, 1}, {3, 3}, {4, 2}});
+  // [a*c, bi, di, do, bo]
+  tv0->merge(3);
+  tv0->merge(1);
+  // [a*c, bi*di, do*bo]
+  tv0->reorder({{0, 2}});
+  // [bi*di, do*bo, a*c]
+  // Order we want is:
+  // [a*c, do*bo, bi*di]
+  auto old2new = scheduler_utils::domainReorderAsRfactorMap(tv0);
+  TORCH_CHECK(old2new[0] == 2);
+  TORCH_CHECK(old2new[1] == 1);
+  TORCH_CHECK(old2new[2] == 0);
+}
+
+TEST_F(NVFuserTest, FusionDisjointViewSet_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({2, 3, 4});
+  fusion->addInput(tv0);
+
+  auto tv1 = view(tv0, {2, 3, 4}, {2, 12});
+
+  auto tv2 = makeConcreteTensor({2, 12});
+  fusion->addInput(tv2);
+
+  auto tv3 = add(tv2, tv1);
+  fusion->addOutput(tv3);
+
+  auto disjoint_exact = scheduler_utils::disjointViewSets(fusion.get());
+
+  TORCH_INTERNAL_ASSERT(
+      disjoint_exact.strictAreMapped(tv0->axis(1), tv0->axis(2)));
+}
+
+TEST_F(NVFuserTest, FusionMatchingViews_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int x = 2, y = 3, z = 4;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = view(tv0, {x, y, z}, {x * y, z});
+
+  auto tv2 = sin(tv1);
+
+  auto tv3 = view(tv2, {x * y, z}, {x, y * z});
+  fusion.addOutput(tv3);
+
+  auto tv4 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv4);
+
+  auto tv5 = view(tv4, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv5);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv6 = add(tv0, tv4);
+  fusion.addOutput(tv6);
+
+  TORCH_INTERNAL_ASSERT(!scheduler_utils::allMatchingViews(&fusion));
+}
+
+TEST_F(NVFuserTest, FusionBroadcastViewMultiples_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int a = 2, b = 3, c = 5, d = 7, e = 11, f = 13;
+
+  auto tv0 = makeConcreteTensor({a, b, c, d, e, f});
+  fusion.addInput(tv0);
+
+  // tie e and f together (swapping values next to eachother enforces they'll be
+  // merged then split by view)
+  auto tv1 = view(tv0, {a, b, c, d, e, f}, {a, b, c, d, f, e});
+  fusion.addOutput(tv1);
+
+  // swap d and e
+  auto tv2 = transpose(tv1, 3, 4);
+  // tie c and e together
+  auto tv3 = view(tv2, {a, b, c, e, d, f}, {a, b, e, c, d, f});
+
+  fusion.addOutput(tv3);
+
+  auto tv4 = set(tv0);
+  // Use tv4 as the reference
+  fusion.addOutput(tv4);
+
+  // a, b, d aren't tied to anything so they are valid broadcasts from the
+  // perspective of broadcast multiples analysis.
+  auto tv5 = makeConcreteTensor({1, 1, c, 1, e, f});
+  fusion.addInput(tv5);
+
+  // c, e, and f are tied together so this shouldn't be counted as a broadcast
+  // dim in the reference since it's a partial bcast
+  auto tv6 = makeConcreteTensor({a, b, c, 1, 1, 1});
+  fusion.addInput(tv6);
+
+  // c, e, and f are tied together this should be counted as a broadcast dim in
+  // the reference since it's a partial bcast
+  auto tv7 = makeConcreteTensor({a, b, 1, 1, 1, 1});
+  fusion.addInput(tv7);
+
+  // plug the broadcasts into the fusion
+  auto tv8 = add(tv5, tv4);
+  auto tv9 = add(tv6, tv8);
+  auto tv10 = add(tv7, tv9);
+  fusion.addOutput(tv10);
+
+  auto bcast_info =
+      scheduler_utils::getBroadcastMultiples(tv4, DataType::Int32);
+
+  // linked c, e, and f together so they should have the same id.
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[5] == 0);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[4] == 0);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[3] == 1);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[2] == 0);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[1] == 2);
+  TORCH_CHECK(bcast_info.view_disjoint_set_ids[0] == 3);
+
+  TORCH_CHECK(
+      scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 0));
+  TORCH_CHECK(
+      scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 1));
+  TORCH_CHECK(
+      scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 2));
+  TORCH_CHECK(
+      !scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 3));
+  TORCH_CHECK(
+      !scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 4));
+  TORCH_CHECK(
+      !scheduler_utils::breakIsDisjoint(bcast_info.view_disjoint_set_ids, 5));
+
+  // tv0  [a, b, c, d, e, f]
+  // tv1  [a, b, c, d, e, f]
+  // tv3  [a, b, c, d, e, f]
+  // tv4  [a, b, c, d, e, f]
+  // tv5  [1, 1, c, 1, e, f] -> Left bcasts should show up in some multiples
+  // tv6  [a, b, c, 1, 1, 1] -> view interferes with bcasts, non of these should
+  //                            show up
+  // tv7  [a, b, 1, 1, 1, 1] -> These broadcasts could be recognized
+  // tv10 [a, b, c, d, e, f]
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[0].lhs_multiple == 0 &&
+      bcast_info.broadcast_multiples[0].rhs_multiple == 8 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[1].lhs_multiple == 7 * 4 &&
+      bcast_info.broadcast_multiples[1].rhs_multiple == 8 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[2].lhs_multiple == 7 * 4 &&
+      bcast_info.broadcast_multiples[2].rhs_multiple == 7 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[3].lhs_multiple == 8 * 4 &&
+      bcast_info.broadcast_multiples[3].rhs_multiple == 7 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[4].lhs_multiple == 8 * 4 &&
+      bcast_info.broadcast_multiples[4].rhs_multiple == 7 * 4);
+
+  TORCH_CHECK(
+      bcast_info.broadcast_multiples[5].lhs_multiple == 8 * 4 &&
+      bcast_info.broadcast_multiples[5].rhs_multiple == 7 * 4);
+}
+
+TEST_F(NVFuserTest, FusionTVDomainGuard_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  std::vector<bool> all_true = {true, true};
+  std::vector<bool> all_false = {false, false};
+  std::vector<bool> false_true = {false, true};
+  auto tv = TensorViewBuilder().ndims(2).contiguity(false_true).build();
+  TORCH_CHECK(tv->domain()->contiguity() == false_true);
+  {
+    auto guard = ir_utils::overrideContiguityGuard(tv, true);
+    TORCH_CHECK(tv->domain()->contiguity() == all_true);
+  }
+  TORCH_CHECK(tv->domain()->contiguity() == false_true);
+  {
+    auto guard = ir_utils::overrideContiguityGuard(tv, false);
+    TORCH_CHECK(tv->domain()->contiguity() == all_false);
+  }
+  TORCH_CHECK(tv->domain()->contiguity() == false_true);
+  {
+    auto guard1 = ir_utils::overrideContiguityGuard(tv, true);
+    auto guard2 = std::move(guard1);
+    TORCH_CHECK(tv->domain()->contiguity() == all_true);
+  }
+  TORCH_CHECK(tv->domain()->contiguity() == false_true);
+}
+
+} // namespace jit
+} // namespace torch
+#endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h b/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
index 0247c33c8a72..f70c7a80f76f 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h
@@ -1,3 +1,5 @@
+#pragma once
+
 #include <torch/csrc/jit/codegen/cuda/executor_utils.h>
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
@@ -5,44 +7,22 @@
 #include <torch/csrc/jit/codegen/cuda/lower_utils.h>
 
 #include <ATen/cuda/CUDAContext.h>
-#include <torch/torch.h>
 
 #include <unordered_map>
 
+// Tests go in torch::jit
 namespace torch {
 namespace jit {
-namespace fuser {
-namespace cuda {
-
-inline bool deviceMajorMinorCheck(int major, int minor = 0) {
-  auto dev_prop = at::cuda::getCurrentDeviceProperties();
-  if (dev_prop->major < major ||
-      (dev_prop->major == major && dev_prop->minor < minor)) {
-    return false;
-  }
-  return true;
-}
 
-inline int deviceSMCount() {
-  int sm_count = at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
-  return sm_count;
-}
+using namespace torch::jit::fuser::cuda;
 
-class NVFuserTest : public ::testing::Test {
- protected:
-  void SetUp() override {
-    // requires PASCAL or newer
-    if (!deviceMajorMinorCheck(6)) {
-      GTEST_SKIP() << "skipping tests on pre-PASCAL GPUs";
-    }
-  }
-};
+namespace {
 
 struct ValidationConstants {
   // Tolerances generated from randn + add + sum fusion
   // compared against double precision
   std::array<std::array<double, 2>, 20> sum_tolerances_float = {
-      {{4, 1.51992e-06},      {8, 2.23704e-06},      {16, 2.95788e-06},
+      {{4, 1.68222e-06},      {8, 2.23704e-06},      {16, 2.95788e-06},
        {32, 4.4778e-06},      {64, 6.75395e-06},     {128, 8.57934e-06},
        {256, 1.30594e-05},    {512, 2.19122e-05},    {1024, 3.3451e-05},
        {2048, 5.78476e-05},   {4096, 0.000108292},   {8192, 0.00012207},
@@ -67,8 +47,6 @@ struct ValidationConstants {
   double base_float_rel_tol = -1;
 };
 
-namespace {
-
 // Returns abs and relative values to use for validation
 std::pair<double, double> getTolerance(
     DataType dtype,
@@ -243,7 +221,8 @@ class ReductionSizeMapper : private IterVisitor {
             id,
             " in ",
             tv);
-        reduction_elements = reduction_elements * inferred_extent.value();
+        reduction_elements =
+            reduction_elements * inferred_extent->as<int64_t>();
       }
     }
     return reduction_elements;
@@ -283,7 +262,11 @@ ExpressionEvaluator bindInputsAndLaunchParams(
     Fusion* fusion,
     const at::ArrayRef<IValue>& aten_inputs,
     const LaunchParams& launch_constraints) {
-  auto expr_eval = executor_utils::bindFusionInputs(aten_inputs, fusion);
+  // index_mode is not important here
+  KernelArgumentHolder argument_holder(KernelIndexMode::INT64);
+  argument_holder.push(aten_inputs);
+
+  auto expr_eval = executor_utils::bindFusionInputs(argument_holder, fusion);
   for (auto val : fusion->vals()) {
     if (!val->isA<TensorView>()) {
       continue;
@@ -326,15 +309,13 @@ ExpressionEvaluator bindInputsAndLaunchParams(
   return expr_eval;
 }
 
-} // namespace
-
 // Validation will look through the fusion and figure out how many elements were
 // reduced to create each output. It will then compute a tolernace to use for
 // allclose based on experimental results. The experimental results were based
 // on adding two tensors then summing them. This of course has an assumption
 // that we're always summing values between -2 and 2. If we start summing values
 // larger than that this approach might not hold.
-inline void testValidate(
+void testValidate(
     Fusion* fusion,
     const std::vector<at::Tensor>& fusion_outputs,
     const at::ArrayRef<IValue>& aten_inputs,
@@ -454,18 +435,6 @@ inline void testValidate(
   }
 }
 
-inline void clearL2Cache() {
-  torch::NoGradGuard no_grad;
-  auto l2_cache_size = at::cuda::getCurrentDeviceProperties()->l2CacheSize;
-  auto options =
-      torch::TensorOptions().dtype(torch::kFloat32).device(at::kCUDA, 0);
-
-  auto l2_elems = l2_cache_size / 4;
-  torch::Tensor t0 = torch::empty(l2_elems, options);
-  torch::Tensor t1 = torch::clone(t0);
-};
-
-} // namespace cuda
-} // namespace fuser
+} // namespace
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp b/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp
index c12babc65c5f..3892762298e1 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp
+++ b/torch/csrc/jit/codegen/cuda/test/test_gpu_view.cpp
@@ -1,4 +1,5 @@
 #if defined(USE_CUDA)
+#include <gmock/gmock-matchers.h>
 #include <gtest/gtest.h>
 
 #include <torch/csrc/jit/codegen/cuda/arith.h>
@@ -9,6 +10,7 @@
 #include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/fusion.h>
 #include <torch/csrc/jit/codegen/cuda/fusion_segmenter.h>
+#include <torch/csrc/jit/codegen/cuda/inlining.h>
 #include <torch/csrc/jit/codegen/cuda/interface.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
 #include <torch/csrc/jit/codegen/cuda/ir_builder.h>
@@ -21,6 +23,7 @@
 #include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
 #include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
 #include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/lower_divisible_split.h>
 #include <torch/csrc/jit/codegen/cuda/mutator.h>
 #include <torch/csrc/jit/codegen/cuda/ops/all_ops.h>
 #include <torch/csrc/jit/codegen/cuda/root_domain_map.h>
@@ -311,33 +314,66 @@ void reductionViewAddFusion(
   }
 }
 
-TEST_F(NVFuserTest, FusionViewReductionShmoo_CUDA) {
-  typedef std::vector<int64_t> shape;
-  typedef std::pair<shape, shape> view_example;
-
-  std::vector<view_example> view_before_examples = {
-      {{19, 12, 7, 99}, {19, 3, 2772}},
-      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
-      {{3, 17, 80, 1}, {51, 2, 4, 1, 10}},
-      {{3, 17, 80, 1, 9}, {51, 2, 4, 1, 10, 9}},
-      {{2, 3, 4, 5}, {1, 6, 1, 2, 2, 5, 1}},
-      {{22, 22, 2}, {22, 11, 1, 1, 4}},
-      {{37, 9, 7, 6, 10}, {333, 2, 2, 3, 35}},
-      {{1, 1, 333, 1}, {1, 1, 333, 1}},
-      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, 8}},
-      {{1, 333, 1}, {1, 37, 9, 1}},
-      {{1, 333}, {1, 1, 1, 111, 1, 3}},
-      {{22, 1, 22, 1}, {484}},
-      {{1, 333, 1}, {333}},
-      // Incorrect Result - Broadcast Issue - Reduction
-      {{1, 27454, 1, 2}, {1, 7844, 1, 7}},
-      {{1, 7844, 1, 7}, {1, 27454, 2}}};
+typedef std::vector<int64_t> shape;
+typedef std::pair<shape, shape> view_example;
+
+// TODO: View examples with just 333 elements are failing validation in
+// normalization. This might just be because our tolerances aren't tuned well
+// for small sizes and the parallelization could be limited which could be
+// detected as a validation issue, though it might not actually be a correctness
+// issue. Using 3333 instead of 333 in those cases but should validate what's
+// going on in the 333 case.
+std::vector<view_example> all_view_examples = {
+    {{1, 19, 1, 3 * 4, 7, 1, 99}, {1, 19, -1, 3, 4 * 7 * 99}},
+    {{1, 19, 1, 3 * 4, 7, 1, 99}, {1, 19, 1, 3, 4 * 7 * 99}},
+    {{19, 3 * 4, 7, 99}, {19, 3, 4 * 7 * 99}},
+
+    {{3, 17, 2 * 4 * 10, 1}, {3 * 17, 1, 2, 4, -1}},
+    {{3, 17, 2 * 4 * 10, 1}, {3 * 17, 1, 2, 4, 10}},
+    {{3, 17, 2 * 4 * 10, 1}, {3 * 17, 2, 4, 1, 10}},
+
+    {{3, 17, 2 * 4 * 10, 1, 9}, {-1, 1, 2, 4, 10, 9}},
+    {{3, 17, 2 * 4 * 10, 1, 9}, {3 * 17, 1, 2, 4, 10, 9}},
+    {{3, 17, 2 * 4 * 10, 1, 9}, {3 * 17, 2, 4, 1, 10, 9}},
+
+    {{2, 3, 2 * 2, 5}, {1, 2 * 3, 1, -1, 2, 5, 1}},
+
+    {{22, 11 * 2, 2}, {22, -1, 1, 1, 2 * 2}},
+    {{22, 1, 22, 1}, {-1}},
+    {{22, 11 * 2, 2}, {22, 11, 1, 1, 2 * 2}},
+    {{22, 1, 22, 1}, {22 * 22}},
+
+    {{37, 9, 7, 3 * 2, 5 * 2}, {37 * 9, 2, -1, 3, 7 * 5}},
+    {{37, 9, 7, 3 * 2, 5 * 2}, {37 * 9, 2, 2, 3, 7 * 5}},
+
+    {{1, 1, 3333, 1}, {1, 1, -1, 1}},
+    // Disabled for now due to non-deterministic nan issue (#1920)
+    // {{1, 1111 * 3}, {1, 1, 1, -1, 1, 3}},
+    {{1, 3333, 1}, {-1}},
+    {{1, 1, 3333, 1}, {1, 1, 3333, 1}},
+    {{1, 303 * 11, 1}, {1, 303, -1, 1}},
+    {{1, 3333, 1}, {1, 303, 11, 1}},
+    // Disabled for now due to non-deterministic nan issue (#1920)
+    // {{1, 3333}, {1, 1, 1, 1111, 1, 3}},
+    {{1, 3333, 1}, {3333}},
+
+    {{1, 3922 * 7, 1, 2}, {1, 3922 * 2, 1, -1}},
+    {{1, 3922 * 2, 1, 7}, {1, -1, 2}},
+    {{1, 3922 * 7, 2}, {1, 3922 * 2, 7}},
+    {{1, 3922 * 2, 1, 7}, {1, 3922 * 7, 2}},
+    {{1, 3922 * 7, 1, 2}, {1, 3922 * 2, 1, 7}},
+
+    {{8, 1, 1, 2 * 4, 1, 8}, {8, 2, 4, 1, -1}},
+    {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, 8}},
+
+    {{2, 3, 2 * 2, 5}, {1, 6, 1, 2, 2, 5, 1}},
+};
 
-  for (auto e : view_before_examples) {
+TEST_F(NVFuserTest, FusionViewReductionShmoo_CUDA) {
+  for (auto e : all_view_examples) {
     reductionViewAddFusion(e.first, e.second, true /* view_before_reduction */);
   }
-
-  std::vector<view_example> view_after_examples = {
+  std::vector<view_example> view_after_reduce_examples = {
       {{19, 12, 7, 99}, {19, 3, 28}},
       {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 28}},
       {{3, 17, 80, 1}, {51, 1, 2, 4, 10}},
@@ -353,7 +389,7 @@ TEST_F(NVFuserTest, FusionViewReductionShmoo_CUDA) {
       {{1, 27454, 1, 2}, {1, 3922, 1, 7}},
       {{1, 7844, 1, 7}, {1, 1961, 4}}};
 
-  for (auto e : view_after_examples) {
+  for (auto e : view_after_reduce_examples) {
     reductionViewAddFusion(
         e.first, e.second, false /* view_before_reduction */);
   }
@@ -365,15 +401,20 @@ void persistentViewAddFusion(
     bool view_before_persistent) {
   constexpr int kAxis = -1;
 
-  auto bias_shape = (view_before_persistent) ? input_shape : output_shape;
+  // Support -1 sizes in the inputs
+  auto inferred_shapes = inferViewShapes(input_shape, output_shape);
+  auto inferred_input = inferred_shapes.first;
+  auto inferred_output = inferred_shapes.second;
+
+  auto bias_shape = view_before_persistent ? inferred_input : inferred_output;
   for (auto has_implicit_broadcast : {false, true}) {
     std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
     Fusion& fusion = *fusion_ptr.get();
     FusionGuard fg(&fusion);
 
     TensorView* x = (has_implicit_broadcast)
-        ? makeConcreteTensor(input_shape)
-        : makeSymbolicTensor(input_shape.size());
+        ? makeConcreteTensor(inferred_input)
+        : makeSymbolicTensor(inferred_input.size());
     TensorView* bias = (has_implicit_broadcast)
         ? makeConcreteTensor(bias_shape)
         : makeSymbolicTensor(bias_shape.size());
@@ -381,13 +422,13 @@ void persistentViewAddFusion(
     fusion.addInput(bias);
 
     auto tv1 = (view_before_persistent) ? add(x, bias) : softmax(x, kAxis);
-    auto x_view = view(tv1, input_shape, output_shape);
+    auto x_view = view(tv1, inferred_input, inferred_output);
     auto y =
         (view_before_persistent) ? softmax(x_view, kAxis) : add(x_view, bias);
     fusion.addOutput(y);
 
     auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor at_x = at::randn(input_shape, options);
+    at::Tensor at_x = at::randn(inferred_input, options);
     at::Tensor at_bias = at::randn(bias_shape, options);
     std::vector<IValue> aten_inputs = {at_x, at_bias};
 
@@ -397,7 +438,7 @@ void persistentViewAddFusion(
     auto at_tv1 = (view_before_persistent)
         ? (at_x + at_bias)
         : at::_softmax(at_x, kAxis, false /* half_to_float */);
-    auto at_x_view = at::native::view(at_tv1, output_shape);
+    auto at_x_view = at::native::view(at_tv1, inferred_output);
     auto at_y = (view_before_persistent)
         ? at::_softmax(at_x_view, kAxis, false /* half_to_float */)
         : at::add(at_x_view, at_bias);
@@ -407,34 +448,12 @@ void persistentViewAddFusion(
 }
 
 TEST_F(NVFuserTest, FusionViewPersistentShmoo_CUDA) {
-  typedef std::vector<int64_t> shape;
-  typedef std::pair<shape, shape> view_example;
-
-  std::vector<view_example> view_examples = {
-      {{19, 12, 7, 99}, {19, 3, 2772}},
-      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
-      {{3, 17, 80, 1}, {51, 2, 4, 1, 10}},
-      {{3, 17, 80, 1, 9}, {51, 2, 4, 1, 10, 9}},
-      {{2, 3, 4, 5}, {1, 6, 1, 2, 2, 5, 1}},
-      {{22, 22, 2}, {22, 11, 1, 1, 4}},
-      {{37, 9, 7, 6, 10}, {333, 2, 2, 3, 35}},
-      {{1, 1, 333, 1}, {1, 1, 333, 1}},
-      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, 8}},
-      {{1, 333, 1}, {1, 37, 9, 1}},
-      // TODO Validation Error - Absolute Tolerance
-      // {{1, 333}, {1, 1, 1, 111, 1, 3}},
-      {{22, 1, 22, 1}, {484}},
-      {{1, 333, 1}, {333}},
-      // TODO Incorrect Result - Broadcast Issue - Reduction
-      {{1, 27454, 1, 2}, {1, 7844, 1, 7}},
-      {{1, 7844, 1, 7}, {1, 27454, 2}}};
-
-  for (auto e : view_examples) {
+  for (auto e : all_view_examples) {
     persistentViewAddFusion(
         e.first, e.second, true /* view_before_persistent */);
   }
 
-  for (auto e : view_examples) {
+  for (auto e : all_view_examples) {
     persistentViewAddFusion(
         e.first, e.second, false /* view_before_persistent */);
   }
@@ -499,51 +518,7 @@ TEST_F(NVFuserTest, FusionViewMerge_CUDA) {
 }
 
 TEST_F(NVFuserTest, FusionViewAllShmoo_CUDA) {
-  typedef std::vector<int64_t> shape;
-  typedef std::pair<shape, shape> view_example;
-
-  std::vector<view_example> examples = {
-      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
-      {{3, 17, 80, 1}, {51, 1, 2, 4, 10}},
-      {{3, 17, 80, 1, 9}, {51, 1, 2, 4, 10, 9}},
-      {{2, 3, 4, 5}, {1, 6, 1, 2, 2, 5, 1}},
-      {{22, 22, 2}, {22, 11, 1, 1, 4}},
-      {{37, 9, 7, 6, 10}, {333, 2, 2, 3, 35}},
-      {{1, 1, 333, 1}, {1, 1, 333, 1}},
-      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, 8}},
-      {{1, 333, 1}, {1, 37, 9, 1}},
-      {{1, 333}, {1, 1, 1, 111, 1, 3}},
-      {{22, 1, 22, 1}, {484}},
-      {{1, 333, 1}, {333}},
-      {{1, 27454, 1, 2}, {1, 7844, 1, 7}},
-      {{1, 7844, 1, 7}, {1, 27454, 2}}};
-
-  for (auto e : examples) {
-    addViewGeluFusion(e.first, e.second);
-  }
-}
-
-TEST_F(NVFuserTest, FusionViewInferShmoo_CUDA) {
-  typedef std::vector<int64_t> shape;
-  typedef std::pair<shape, shape> view_example;
-
-  std::vector<view_example> examples = {
-      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, -1, 3, 2772}},
-      {{3, 17, 80, 1}, {51, 1, 2, 4, -1}},
-      {{3, 17, 80, 1, 9}, {-1, 1, 2, 4, 10, 9}},
-      {{2, 3, 4, 5}, {1, 6, 1, -1, 2, 5, 1}},
-      {{22, 22, 2}, {22, -1, 1, 1, 4}},
-      {{37, 9, 7, 6, 10}, {333, 2, -1, 3, 35}},
-      {{1, 1, 333, 1}, {1, 1, -1, 1}},
-      {{8, 1, 1, 8, 1, 8}, {8, 2, 4, 1, -1}},
-      {{1, 333, 1}, {1, 37, -1, 1}},
-      {{1, 333}, {1, 1, 1, -1, 1, 3}},
-      {{22, 1, 22, 1}, {-1}},
-      {{1, 333, 1}, {-1}},
-      {{1, 27454, 1, 2}, {1, 7844, 1, -1}},
-      {{1, 7844, 1, 7}, {1, -1, 2}}};
-
-  for (auto e : examples) {
+  for (auto e : all_view_examples) {
     addViewGeluFusion(e.first, e.second);
   }
 }
@@ -551,27 +526,32 @@ TEST_F(NVFuserTest, FusionViewInferShmoo_CUDA) {
 void geluViewAddFusion(
     std::vector<int64_t> input_shape,
     std::vector<int64_t> output_shape) {
+  // Support -1 sizes in the inputs
+  auto inferred_shapes = inferViewShapes(input_shape, output_shape);
+  auto inferred_input = inferred_shapes.first;
+  auto inferred_output = inferred_shapes.second;
+
   for (auto hasImplicitBroadcast : {false, true}) {
     Fusion fusion;
     FusionGuard fg(&fusion);
 
     TensorView* x = (hasImplicitBroadcast)
-        ? makeConcreteTensor(input_shape)
-        : makeSymbolicTensor(input_shape.size());
+        ? makeConcreteTensor(inferred_input)
+        : makeSymbolicTensor(inferred_input.size());
     TensorView* bias = (hasImplicitBroadcast)
-        ? makeConcreteTensor(output_shape)
-        : makeSymbolicTensor(output_shape.size());
+        ? makeConcreteTensor(inferred_output)
+        : makeSymbolicTensor(inferred_output.size());
     fusion.addInput(x);
     fusion.addInput(bias);
 
     auto x_gelu = gelu(x);
-    auto x_view = view(x_gelu, input_shape, output_shape);
+    auto x_view = view(x_gelu, inferred_input, inferred_output);
     auto y = add(x_view, bias);
     fusion.addOutput(y);
 
     auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
-    at::Tensor at_x = at::randn(input_shape, options);
-    at::Tensor at_bias = at::randn(output_shape, options);
+    at::Tensor at_x = at::randn(inferred_input, options);
+    at::Tensor at_bias = at::randn(inferred_output, options);
     std::vector<IValue> aten_inputs = {at_x, at_bias};
 
     auto lparams = schedulePointwise(&fusion, aten_inputs);
@@ -581,7 +561,7 @@ void geluViewAddFusion(
     auto outputs = fe.runFusion(aten_inputs, lparams);
 
     auto at_x_gelu = at::gelu(at_x);
-    auto at_x_view = at::native::view(at_x_gelu, output_shape);
+    auto at_x_view = at::native::view(at_x_gelu, inferred_output);
     auto at_y = at_x_view + at_bias;
 
     testValidate(&fusion, outputs, aten_inputs, {at_y}, __LINE__, __FILE__);
@@ -589,15 +569,7 @@ void geluViewAddFusion(
 }
 
 TEST_F(NVFuserTest, FusionViewStride_CUDA) {
-  typedef std::vector<int64_t> shape;
-  typedef std::pair<shape, shape> view_example;
-
-  std::vector<view_example> examples = {
-      {{1, 27454, 2}, {1, 7844, 7}},
-      {{1, 19, 1, 12, 7, 1, 99}, {1, 19, 1, 3, 2772}},
-      {{1, 7844, 1, 7}, {1, 27454, 2}}};
-
-  for (const auto& e : examples) {
+  for (const auto& e : all_view_examples) {
     geluViewAddFusion(e.first, e.second);
   }
 }
@@ -980,6 +952,908 @@ TEST_F(NVFuserTest, FusionExpandRepro_CUDA) {
   testValidate(&fusion, outputs, aten_inputs, {out}, __LINE__, __FILE__);
 }
 
+TEST_F(NVFuserTest, FusionExpandView1_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({4, 1, 8});
+  fusion->addInput(tv0);
+
+  auto tv1 = makeConcreteTensor({12, 8});
+  fusion->addInput(tv1);
+
+  auto tv2 = expand(
+      tv0,
+      {IrBuilder::create<Int>(4),
+       IrBuilder::create<Int>(3),
+       IrBuilder::create<Int>(8)});
+
+  auto tv3 = view(tv2, {4, 3, 8}, {12, 8});
+  auto tv4 = add(tv3, tv1);
+  fusion->addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({4, 1, 8}, options);
+  auto t1 = at::randn({12, 8}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
+
+  auto ref = at::reshape(t0.expand({4, 3, 8}), {12, 8}) + t1;
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionExpandView2_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({1, 8});
+  fusion->addInput(tv0);
+
+  auto tv1 = makeConcreteTensor({3, 4, 8});
+  fusion->addInput(tv1);
+
+  auto tv2 =
+      expand(tv0, {IrBuilder::create<Int>(12), IrBuilder::create<Int>(8)});
+
+  auto tv3 = view(tv2, {12, 8}, {3, 4, 8});
+  auto tv4 = add(tv3, tv1);
+  fusion->addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({1, 8}, options);
+  auto t1 = at::randn({3, 4, 8}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
+
+  auto ref = at::reshape(t0.expand({12, 8}), {3, 4, 8}) + t1;
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0, t1}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionViewTransformCache_CUDA) {
+  auto assert_matches = [](view_example example_0, view_example example_1) {
+    TORCH_INTERNAL_ASSERT(
+        analyzeViewConstraint(example_0.first, example_0.second) ==
+            analyzeViewConstraint(example_1.first, example_1.second),
+        "View: ",
+        example_0.first,
+        " -> ",
+        example_0.second,
+        "  Does not match:",
+        example_1.first,
+        " -> ",
+        example_1.second);
+  };
+
+  auto assert_does_not_match = [](view_example example_0,
+                                  view_example example_1) {
+    TORCH_INTERNAL_ASSERT(
+        !(analyzeViewConstraint(example_0.first, example_0.second) ==
+          analyzeViewConstraint(example_1.first, example_1.second)),
+        "View: ",
+        example_0.first,
+        " -> ",
+        example_0.second,
+        "  Should not match:",
+        example_1.first,
+        " -> ",
+        example_1.second);
+  };
+
+  // Splits are done as splitting out left hand side, so left hand side
+  // split changes can't reuse view, but right hand side split changes can.
+  // Merges, since they don't bury hard values in can always be reshared.
+  // Need to make sure trivial reduction, and broadcast changes don't try to
+  // reuse view. What matches and what doesn't is very specific to the
+  // implementation of how the splits/merges are generated. This could be
+  // changed over time as there isn't a single set of transformations to
+  // potentially make a view. For example we could always merge all dimensions,
+  // then split out all dimensions. This would always be valid but would not be
+  // efficient for indexing.
+
+  // "Same"
+  assert_matches(
+      {{1, 1, 3333, 1}, {1, 1, 3333, 1}}, {{1, 1, 3333, 1}, {1, 1, -1, 1}});
+  assert_matches(
+      {{8, 1, 1, 2 * 4, 1, 8}, {8, 2, 4, 1, 8}},
+      {{8, 1, 1, 2 * 4, 1, 8}, {8, 2, 4, 1, -1}});
+
+  // Trivial reduce matching
+  assert_matches({{1, 3333, 1}, {-1}}, {{1, 24, 1}, {-1}});
+
+  // Trivial reduce not matching
+  assert_does_not_match({{1, 3333, 1}, {-1}}, {{1, 3333}, {-1}});
+
+  // Broadcast matching
+  assert_matches({{3333}, {1, -1, 1}}, {{24}, {1, -1, 1}});
+
+  // Broadcast not matching
+  assert_does_not_match({{3333}, {1, -1, 1}}, {{24}, {1, -1}});
+
+  // RHS split
+  assert_matches(
+      {{3, 17, 2 * 4 * 10, 1}, {3 * 17, 1, 2, 4, -1}},
+      {{3, 17, 2 * 4 * 10 * 7, 1}, {3 * 17, 1, 2, 4, -1}});
+  assert_matches(
+      {{1, 303 * 11, 1}, {1, 303, -1, 1}},
+      {{1, 303 * 11 * 4, 1}, {1, 303, -1, 1}});
+  assert_matches(
+      {{2, 3, 2 * 2 * 3, 5}, {1, 2 * 3, 1, 2, -1, 5, 1}},
+      {{2, 3, 2 * 2 * 4, 5}, {1, 2 * 3, 1, 2, -1, 5, 1}});
+  assert_matches(
+      {{22, 11 * 2, 2}, {22, 11, 1, 1, -1}},
+      {{22, 11 * 2 * 4, 2 * 3}, {22, 11, 1, 1, -1}});
+  assert_matches(
+      {{1, 1111 * 3}, {1, 1, 1, 1111, 1, -1}},
+      {{1, 1111 * 3 * 7}, {1, 1, 1, 1111, 1, -1}});
+  assert_matches(
+      {{1, 303 * 11 * 2, 1}, {1, 303, -1, 1}},
+      {{1, 303 * 11 * 3, 1}, {1, 303, -1, 1}});
+  assert_matches(
+      {{8, 1, 1, 2 * 4, 1, 8}, {8, 2, -1, 1, 8}},
+      {{8, 1, 1, 2 * 4 * 6, 1, 8}, {8, 2, -1, 1, 8}});
+
+  // LHS split not matching
+  assert_does_not_match(
+      {{3, 17, 2 * 4 * 10, 1}, {3 * 17, 1, 2, -1, 10}},
+      {{3, 17, 2 * 4 * 3 * 10, 1}, {3 * 17, 1, 2, -1, 10}});
+  assert_does_not_match(
+      {{1, 303 * 11, 1}, {1, -1, 11, 1}},
+      {{1, 303 * 11 * 2, 1}, {1, -1, 11, 1}});
+  assert_does_not_match(
+      {{2, 3, 2 * 2, 5}, {1, 2 * 3, 1, -1, 2, 5, 1}},
+      {{2, 3, 3 * 2, 5}, {1, 2 * 3, 1, -1, 2, 5, 1}});
+  assert_does_not_match(
+      {{22, (11 + 1) * 2, 2}, {22, -1, 1, 1, 2 * 2}},
+      {{22, 11 * 2, 2}, {22, -1, 1, 1, 2 * 2}});
+  assert_does_not_match(
+      {{1, 1111 * 3}, {1, 1, 1, -1, 1, 3}},
+      {{1, 1111 * 2 * 3}, {1, 1, 1, -1, 1, 3}});
+  assert_does_not_match(
+      {{1, 303 * 11, 1}, {1, -1, 11, 1}},
+      {{1, (303 + 1) * 11, 1}, {1, -1, 11, 1}});
+  assert_does_not_match(
+      {{8, 1, 1, 2 * 4, 1, 8}, {8, -1, 4, 1, 8}},
+      {{8, 1, 1, 3 * 4, 1, 8}, {8, -1, 4, 1, 8}});
+
+  // Merge matching
+  assert_matches(
+      {{3, 17, 2 * 4 * 10, 1, 9}, {-1, 1, 2, 4, 10, 9}},
+      {{4, 18, 2 * 4 * 10, 1, 9}, {-1, 1, 2, 4, 10, 9}});
+  assert_matches({{22, 1, 23, 1}, {-1, 1}}, {{23, 1, 22, 1}, {-1, 1}});
+
+  // Merge not matching
+  assert_does_not_match({{2, 3, 4}, {-1, 4}}, {{2, 3, 4}, {2, -1}});
+  assert_does_not_match(
+      {{22, 1, 23, 1, 24}, {-1, 24}}, {{22, 1, 23, 1, 24}, {22, -1}});
+
+  // Split->Merge matching
+  assert_matches(
+      {{22, 11 * 2, 3}, {22, 11, 1, 1, -1}},
+      {{22, 11 * 3, 2}, {22, 11, 1, 1, -1}});
+  assert_matches(
+      {{1, 3922 * 3 * 7, 1, 2 * 2}, {1, 3922 * 2, 1, -1}},
+      {{1, 3922 * 7, 1, 2}, {1, 3922 * 2, 1, -1}});
+
+  // Split->Merge not matching
+  assert_does_not_match(
+      {{22, 11 * 2, 2}, {22, -1, 1, 1, 4}},
+      {{22, 11 * 2 * 3, 2}, {22, -1, 1, 1, 4}});
+  assert_does_not_match(
+      {{1, 3922 * 7, 1, 2}, {1, -1, 1, 7}},
+      {{1, 3922 * 7 * 2, 1, 2}, {1, -1, 1, 7}});
+
+  // Merge->Split matching
+  assert_matches(
+      {{1, 3922 * 2, 1, 7}, {1, 3922 * 7, -1}},
+      {{1, 3922 * 2 * 3, 1, 7}, {1, 3922 * 7, -1}});
+  assert_matches(
+      {{19, 3 * 4, 7, 99}, {19, 3, -1}}, {{19, 3 * 3, 8, 10}, {19, 3, -1}});
+
+  // Merge->Split not matching
+  assert_does_not_match(
+      {{1, 3922 * 2, 1, 7}, {1, -1, 2}}, {{1, 3922, 1, 7}, {1, -1, 2}});
+  assert_does_not_match(
+      {{19, 3 * 4, 7, 99}, {19, -1, 3}}, {{19, 3 * 5, 7, 99}, {19, -1, 3}});
+}
+
+TEST_F(NVFuserTest, FusionViewIdGraph_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int w = 2, x = 3, y = 4, z = 5;
+
+  auto tv0 = makeConcreteTensor({w, x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {w, x, y, z}, {w, y, x * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({w, x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {w, x, y, z}, {w, y, x * z});
+  fusion.addOutput(tv4);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv5 = add(tv0, tv3);
+  fusion.addOutput(tv5);
+
+  auto tv6 = makeConcreteTensor({w, x, x, y, z});
+
+  auto tv7 = sum(tv6, {2});
+  auto tv8 = broadcast(tv7, {false, true, false, true, false, false});
+
+  auto tv9 = makeConcreteTensor({w, 6, x, 7, y, z});
+  fusion.addInput(tv9);
+  auto tv10 = add(tv8, tv9);
+  fusion.addOutput(tv10);
+
+  auto tv12 = view(tv8, {w, 1, x, 1, y, z}, {w, y, x * z});
+  fusion.addOutput(tv12);
+
+  // Link the views after the views happen
+  auto t13 = add(tv12, tv4);
+  fusion.addOutput(t13);
+
+  // Grab the trivial reduced tensor from t12's view.
+  auto tv11 = ir_utils::producerTvsOf(tv12)[0];
+
+  // Start from the exact iter domain graph of the fusion
+  IterDomainGraph id_graph(&fusion);
+  auto disjoint_view_ids = id_graph.exactNodes();
+
+  TORCH_CHECK(
+      id_graph.exactNodes().strictAreMapped(tv2->axis(1), tv4->axis(1)));
+  TORCH_CHECK(
+      id_graph.exactNodes().strictAreMapped(tv2->axis(2), tv4->axis(2)));
+
+  TORCH_CHECK(id_graph.exactNodes().strictAreMapped(
+      tv2->getRootDomain()[1], tv12->getRootDomain()[1]));
+  TORCH_CHECK(id_graph.exactNodes().strictAreMapped(
+      tv2->getRootDomain()[2], tv12->getRootDomain()[2]));
+  TORCH_CHECK(id_graph.exactNodes().strictAreMapped(
+      tv2->getRootDomain()[3], tv12->getRootDomain()[3]));
+}
+
+TEST_F(NVFuserTest, FusionViewVectorize_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  auto tv0 = makeContigTensor(3);
+  fusion.addInput(tv0);
+  auto tv1 = flatten(tv0, 1, 2);
+  auto tv2 = flatten(tv0, 1, 2);
+  auto tv3 = sin(tv1);
+  auto tv4 = sin(tv2);
+  fusion.addOutput(tv3);
+  fusion.addOutput(tv4);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1024}, options);
+
+  auto lparams = schedulePointwise(&fusion, {input});
+
+  auto hasVectorization = [](TensorView* tv) -> bool {
+    for (auto i : tv->domain()->domain()) {
+      if (i->getParallelType() == ParallelType::Vectorize) {
+        return true;
+      }
+    }
+    return false;
+  };
+
+  for (auto o : fusion.outputs()) {
+    TORCH_CHECK(hasVectorization(o->as<TensorView>()));
+  }
+  for (auto i : fusion.inputs()) {
+    for (auto c : ir_utils::consumerTvsOf(i->as<TensorView>())) {
+      TORCH_CHECK(hasVectorization(c));
+    }
+  }
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {input}, lparams);
+  auto outputs = fe.runFusion({input}, lparams);
+
+  auto tv_ref = input.flatten(1, 2).sin();
+
+  testValidate(&fusion, outputs, {input}, {tv_ref, tv_ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionExpandFlatten_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({-1, -1, 1});
+  fusion->addInput(tv0);
+  auto tv1 = expand(
+      tv0,
+      {tv0->axis(0)->extent(),
+       tv0->axis(1)->extent(),
+       IrBuilder::create<Int>(8)});
+  auto tv2 = flatten(tv1, 1, 2);
+  auto tv3 = sum(tv2, {1});
+  fusion->addOutput(tv3);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::Tensor input = at::randn({256, 1024, 1}, options);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({input});
+
+  auto aten_out = input.expand({256, 1024, 8}).flatten(1, 2).sum(1);
+
+  testValidate(
+      executor_cache.fusion(),
+      cg_outputs,
+      {input},
+      {aten_out},
+      __LINE__,
+      __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionIllegalReductionFlatten_CUDA) {
+  EXPECT_THAT(
+      []() {
+        auto fusion = std::make_unique<Fusion>();
+        FusionGuard fg(fusion.get());
+
+        auto tv0 = makeConcreteTensor({2, 3});
+        fusion->addInput(tv0);
+
+        auto tv1 = sum(tv0, {1});
+        auto tv2 = flatten(tv1, 0, 1);
+        fusion->addOutput(tv2);
+      },
+      testing::ThrowsMessage<c10::Error>(
+          testing::HasSubstr("Invalid end_dim")));
+}
+
+TEST_F(NVFuserTest, FusionReductionFlatten1_CUDA) {
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  auto tv0 = makeConcreteTensor({2, 3, 5});
+  fusion->addInput(tv0);
+
+  auto tv1 = sum(tv0, {1});
+  auto tv2 = flatten(tv1, 0, 1);
+  fusion->addOutput(tv2);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+  at::manual_seed(0);
+  auto t0 = at::randn({2, 3, 5}, options);
+  auto ref = t0.sum({1}).flatten(0, 1);
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  testValidate(
+      executor_cache.fusion(), cg_outputs, {t0}, {ref}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionPwiseViewSchedule_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int x = 31, y = 65, z = 103;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv4);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv5 = add(tv0, tv3);
+  fusion.addOutput(tv5);
+
+  TORCH_INTERNAL_ASSERT(scheduler_utils::allMatchingViews(&fusion));
+  {
+    TransformPropagator propagator(tv4);
+    MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+  }
+
+  for (auto i : c10::irange(tv5->nDims() - 1)) {
+    tv5->merge(0);
+  }
+  tv5->split(0, 32);
+  tv5->split(0, 4);
+  tv5->axis(0)->parallelize(ParallelType::BIDx);
+  tv5->axis(1)->parallelize(ParallelType::Unroll);
+  tv5->axis(2)->parallelize(ParallelType::TIDx);
+
+  {
+    TransformPropagator propagator(tv5);
+    MaxRootDomainInfoSpanningTree spanning_tree(tv5);
+    spanning_tree.traverse(&propagator);
+    scheduler_utils::parallelizeAllLike(tv5);
+
+    // Inline the schedule
+    inlineMost();
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t4 = at::native::view(t3, {x, y * z});
+  auto t5 = t0 + t3;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t3});
+  auto cg_outputs = fe.runFusion({t0, t3});
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t2, t4, t5}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionSumViewSchedule_CUDA) {
+  Fusion fusion;
+  FusionGuard fg(&fusion);
+
+  int x = 31, y = 65, z = 103;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {x, y, z}, {x, y * z});
+  auto tv5 = sum(tv4, {1});
+  fusion.addOutput(tv5);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv6 = add(tv0, tv3);
+  fusion.addOutput(tv6);
+
+  TORCH_INTERNAL_ASSERT(scheduler_utils::allMatchingViews(&fusion));
+  {
+    TransformPropagator propagator(tv4);
+    MaxRootDomainInfoSpanningTree(tv4).traverse(&propagator);
+  }
+
+  tv5->split(1, 128);
+  tv5->split(1, 4);
+
+  auto tv5_rf = tv5->rFactor({1, 2});
+  tv5_rf->axis(0)->parallelize(ParallelType::BIDx);
+  tv5_rf->axis(2)->parallelize(ParallelType::Unroll);
+  tv5_rf->axis(3)->parallelize(ParallelType::TIDx);
+
+  {
+    TransformPropagator propagator(tv5_rf);
+    MaxRootDomainInfoSpanningTree spanning_tree(tv5_rf);
+    spanning_tree.traverse(&propagator);
+    scheduler_utils::parallelizeAllLike(tv5_rf);
+
+    // Inline the schedule
+    inlineMost();
+  }
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t4 = at::native::view(t3, {x, y * z});
+  auto t5 = t4.sum({1});
+  auto t6 = t0 + t3;
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t3});
+  auto cg_outputs = fe.runFusion({t0, t3});
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t2, t5, t6}, __LINE__, __FILE__);
+}
+
+// Make sure matching views are segmented into the same kernel
+TEST_F(NVFuserTest, FusionViewMagicSchedule1_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int x = 31, y = 65, z = 103;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv4);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv5 = add(tv0, tv3);
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t4 = at::native::view(t3, {x, y * z});
+  auto t5 = t0 + t3;
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3});
+  TORCH_CHECK(!executor_cache.getMostRecentKernelRuntime()->isSegmented());
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t2, t4, t5}, __LINE__, __FILE__);
+}
+
+// Make sure views of views are correct
+TEST_F(NVFuserTest, FusionViewMagicSchedule2_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int x = 31, y = 65, z = 103;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  auto tv3 = view(tv2, {x, y * z}, {x * y, z});
+  auto tv4 = view(tv3, {x * y, z}, {y, x * z});
+  auto tv5 = view(tv4, {y, x * z}, {x, y, z});
+  fusion.addOutput(tv5);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  auto aten_out = sin(t0);
+
+  // For now pointwise scheduler only accepts a single view at a time, so this
+  // will be broken up into multiple kernels. This is due to the reference check
+  // looking for all mappings to all input IDs.
+  // TODO: Fix the reference check for this case
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0});
+
+  testValidate(&fusion, cg_outputs, {t0}, {aten_out}, __LINE__, __FILE__);
+}
+
+// Make sure broadcasts not on the view path that don't interfere with view are
+// segmented in one kernel and correctly trigger 2D pointwise scheduling
+TEST_F(NVFuserTest, FusionViewMagicSchedule3_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = view(tv3, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv4);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv5 = add(tv0, tv3);
+  fusion.addOutput(tv5);
+
+  // Broadcast on another branch to drive the pointwise reference to not be on
+  // the view paths.
+
+  auto tv6 = makeConcreteTensor({w, x, y, z});
+  fusion.addInput(tv6);
+  auto tv7 = broadcast(tv0, {true, false, false, false});
+  auto tv8 = add(tv6, tv7);
+  // tv8 should be the reference for the pointwise fusion. This broadcast
+  // pattern doesn't interfere with the views, so this should also be scheduled
+  // as 2D.
+  fusion.addOutput(tv8);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t4 = at::native::view(t3, {x, y * z});
+  auto t5 = t0 + t3;
+  at::Tensor t6 = at::randn({w, x, y, z}, options);
+  auto t8 = t6.add(t0);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  // Collect the heuristic params
+  executor_cache.profile(true);
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3, t6});
+
+  TORCH_CHECK(!executor_cache.getMostRecentKernelRuntime()->isSegmented());
+  TORCH_CHECK(executor_cache.getMostRecentExecutorInfo()
+                  .params->isA<PointwiseParams>());
+  auto pparams =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
+  TORCH_CHECK(pparams->break_point == 1);
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t3, t6}, {t2, t4, t5, t8}, __LINE__, __FILE__);
+}
+
+// Make sure broadcasts through views when not conflicting with view are
+// segmented into one kernel and trigger 2D pointwise scheduler.
+TEST_F(NVFuserTest, FusionViewMagicSchedule4_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv0);
+
+  auto tv1 = sin(tv0);
+
+  auto tv2 = view(tv1, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv2);
+
+  auto tv3 = makeConcreteTensor({x, y, z});
+  fusion.addInput(tv3);
+
+  auto tv4 = makeConcreteTensor({x, 1, 1});
+  fusion.addInput(tv4);
+
+  auto tv5 = add(tv4, tv3);
+
+  auto tv6 = view(tv5, {x, y, z}, {x, y * z});
+  fusion.addOutput(tv6);
+
+  // Link 0 and 3 together for view analysis done based on before the views
+  // actually happened.
+  auto tv7 = add(tv0, tv3);
+  fusion.addOutput(tv7);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({x, y, z}, options);
+  at::Tensor t3 = at::randn({x, y, z}, options);
+  at::Tensor t4 = at::randn({x, 1, 1}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {x, y * z});
+  auto t5 = t4 + t3;
+  auto t6 = at::native::view(t5, {x, y * z});
+  auto t7 = t0 + t3;
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  // Collect the heuristic params
+  executor_cache.profile(true);
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3, t4});
+
+  TORCH_CHECK(!executor_cache.getMostRecentKernelRuntime()->isSegmented());
+  TORCH_CHECK(executor_cache.getMostRecentExecutorInfo()
+                  .params->isA<PointwiseParams>());
+  auto pparams =
+      executor_cache.getMostRecentExecutorInfo().params->as<PointwiseParams>();
+  TORCH_CHECK(pparams->break_point == 1);
+
+  testValidate(
+      &fusion, cg_outputs, {t0, t3, t4}, {t2, t6, t7}, __LINE__, __FILE__);
+}
+
+// Make sure different views that are consumed by the reference are segmented
+// into a single kernel.
+TEST_F(NVFuserTest, FusionViewMagicSchedule5_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeConcreteTensor({w, x, y * z});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = view(tv1, {w, x, y * z}, {z, y, x, w});
+
+  auto tv3 = makeConcreteTensor({w, x * y, z});
+  fusion.addInput(tv3);
+  auto tv4 = cos(tv3);
+  auto tv5 = view(tv4, {w, x * y, z}, {z, y, x, w});
+
+  auto tv6 = add(tv2, tv5);
+  fusion.addOutput(tv6);
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({w, x, y * z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {z, y, x, w});
+  at::Tensor t3 = at::randn({w, x * y, z}, options);
+  auto t4 = cos(t3);
+  auto t5 = at::native::view(t4, {z, y, x, w});
+  auto t6 = add(t2, t5);
+
+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  // Collect the heuristic params
+  executor_cache.profile(true);
+  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t3});
+
+  TORCH_CHECK(!executor_cache.getMostRecentKernelRuntime()->isSegmented());
+  TORCH_CHECK(executor_cache.getMostRecentExecutorInfo()
+                  .params->isA<PointwiseParams>());
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t6}, __LINE__, __FILE__);
+}
+
+// Make sure different views that are consumed by the reference are segmented
+// into a single kernel.
+TEST_F(NVFuserTest, FusionViewMapping_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeConcreteTensor({w, x, y * z});
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = view(tv1, {w, x, y * z}, {z, y, x, w});
+
+  auto tv3 = makeConcreteTensor({w, x * y, z});
+  fusion.addInput(tv3);
+  auto tv4 = cos(tv3);
+  auto tv5 = view(tv4, {w, x * y, z}, {z, y, x, w});
+
+  auto tv6 = add(tv2, tv5);
+  fusion.addOutput(tv6);
+
+  tv6->merge(0);
+  tv6->merge(0);
+  tv6->merge(0);
+  tv6->split(0, 128);
+  tv6->split(0, 4);
+  tv6->axis(0)->parallelize(ParallelType::BIDx);
+  tv6->axis(1)->parallelize(ParallelType::Unroll);
+  tv6->axis(2)->parallelize(ParallelType::TIDx);
+
+  TransformPropagator propagator(tv6);
+  MaxRootDomainInfoSpanningTree spanning_tree(tv6);
+  spanning_tree.traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(tv6);
+
+  // Inline the schedule
+  inlineMost();
+
+  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
+
+  at::Tensor t0 = at::randn({w, x, y * z}, options);
+  auto t1 = sin(t0);
+  auto t2 = at::native::view(t1, {z, y, x, w});
+  at::Tensor t3 = at::randn({w, x * y, z}, options);
+  auto t4 = cos(t3);
+  auto t5 = at::native::view(t4, {z, y, x, w});
+  auto t6 = add(t2, t5);
+
+  FusionExecutor fe;
+  fe.compileFusion(&fusion, {t0, t3});
+  auto cg_outputs = fe.runFusion({t0, t3});
+
+  testValidate(&fusion, cg_outputs, {t0, t3}, {t6}, __LINE__, __FILE__);
+}
+
+TEST_F(NVFuserTest, FusionLowerDivisibleSplits_CUDA) {
+  auto fusion_ptr = std::make_unique<Fusion>();
+  Fusion& fusion = *fusion_ptr.get();
+  FusionGuard fg(&fusion);
+
+  int w = 15, x = 31, y = 49, z = 65;
+
+  auto tv0 = makeContigTensor(4);
+  fusion.addInput(tv0);
+  auto tv1 = sin(tv0);
+  auto tv2 = view(tv1, {w, x, y, z}, {z, y, x, w});
+
+  fusion.addOutput(tv2);
+
+  tv2->merge(0)->merge(0)->merge(0)->split(0, 4)->split(0, 8, false);
+
+  TransformPropagator propagator(tv2);
+  MaxRootDomainInfoSpanningTree spanning_tree(tv2);
+  spanning_tree.traverse(&propagator);
+  scheduler_utils::parallelizeAllLike(tv2);
+
+  // Inline the schedule
+  inlineMost();
+
+  auto divisible_splits = getAllDivisibleSplits(&fusion);
+
+  // Operations on all tensors are basically:
+  // [10] merge(0)          [9]->outer->definition
+  // [9] merge(0)           [8]->outer->definition
+  // [8] merge(0)           [7]->in->definition
+  // [7] split(0, z, false) [6]->in->definition
+  // [6] split(1, y, false) [5]->in->definition
+  // [5] split(2, x, false) [3]->inner->definition
+  // RFactor of tv2
+  // [4] merge(0)           [3]->outer->definition
+  // [3] merge(0)           [2]->outer->definition
+  // [2] merge(0)           [1]->in->definition
+  // [1] split(0, 4)        [0]->in->definition
+  // [0] split(0, 8, false) tv->axis(0)->definition
+
+  for (auto tv : std::vector<TensorView*>({tv2, tv1, tv0})) {
+    auto transform_0 = tv->axis(0)->definition()->as<Split>();
+    auto transform_1 = transform_0->in()->definition()->as<Split>();
+    auto transform_2 = transform_1->in()->definition()->as<Merge>();
+    auto transform_3 = transform_2->outer()->definition()->as<Merge>();
+
+    auto transform_5 = transform_3->inner()->definition()->as<Split>();
+    auto transform_6 = transform_5->in()->definition()->as<Split>();
+    auto transform_7 = transform_6->in()->definition()->as<Split>();
+
+    TORCH_CHECK(
+        divisible_splits.find(transform_5) != divisible_splits.end(),
+        "Expecting: ",
+        transform_5->toString(),
+        "\nFrom TV: ",
+        tv,
+        "\nTo be a divisible split.");
+    TORCH_CHECK(
+        divisible_splits.find(transform_6) != divisible_splits.end(),
+        "Expecting: ",
+        transform_6->toString(),
+        "\nFrom TV: ",
+        tv,
+        "\nTo be a divisible split.");
+    TORCH_CHECK(
+        divisible_splits.find(transform_7) != divisible_splits.end(),
+        "Expecting: ",
+        transform_7->toString(),
+        "\nFrom TV: ",
+        tv,
+        "\nTo be a divisible split.");
+  }
+}
+
 } // namespace jit
 } // namespace torch
 #endif // #if defined(USE_CUDA)
diff --git a/torch/csrc/jit/codegen/cuda/test/test_utils.h b/torch/csrc/jit/codegen/cuda/test/test_utils.h
index 4886785e355c..8b199b930f24 100644
--- a/torch/csrc/jit/codegen/cuda/test/test_utils.h
+++ b/torch/csrc/jit/codegen/cuda/test/test_utils.h
@@ -1,8 +1,21 @@
 #pragma once
 
-#include <cstddef>
-
+#include <torch/csrc/jit/codegen/cuda/executor.h>
+#include <torch/csrc/jit/codegen/cuda/expr_evaluator.h>
 #include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
+#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
+#include <torch/csrc/jit/codegen/cuda/lower2device.h>
+#include <torch/csrc/jit/codegen/cuda/lower_magic_zero.h>
+#include <torch/csrc/jit/codegen/cuda/transform_replay.h>
+
+#include <ATen/Context.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDACachingAllocator.h>
+#include <torch/torch.h>
+
+#include <gtest/gtest.h>
+
+#include <cstddef>
 
 // Tests go in torch::jit
 namespace torch {
@@ -11,7 +24,7 @@ namespace jit {
 using namespace torch::jit::fuser::cuda;
 
 namespace {
-
+bool var;
 // Make a tensor that is known to be fully contiguous of dimensionality=ndims,
 // but unknown sizes
 TensorView* makeContigTensor(size_t ndims, DataType dtype = DataType::Float) {
@@ -64,6 +77,297 @@ void checkIntValue(
   TORCH_CHECK(actual_value.value() == expected_value);
 }
 
+// prime numbers
+int64_t prime_numbers[] = {
+    2,    3,    5,    7,    11,   13,   17,   19,   23,   29,   31,   37,
+    41,   43,   47,   53,   59,   61,   67,   71,   73,   79,   83,   89,
+    97,   101,  103,  107,  109,  113,  127,  131,  137,  139,  149,  151,
+    157,  163,  167,  173,  179,  181,  191,  193,  197,  199,  211,  223,
+    227,  229,  233,  239,  241,  251,  257,  263,  269,  271,  277,  281,
+    283,  293,  307,  311,  313,  317,  331,  337,  347,  349,  353,  359,
+    367,  373,  379,  383,  389,  397,  401,  409,  419,  421,  431,  433,
+    439,  443,  449,  457,  461,  463,  467,  479,  487,  491,  499,  503,
+    509,  521,  523,  541,  547,  557,  563,  569,  571,  577,  587,  593,
+    599,  601,  607,  613,  617,  619,  631,  641,  643,  647,  653,  659,
+    661,  673,  677,  683,  691,  701,  709,  719,  727,  733,  739,  743,
+    751,  757,  761,  769,  773,  787,  797,  809,  811,  821,  823,  827,
+    829,  839,  853,  857,  859,  863,  877,  881,  883,  887,  907,  911,
+    919,  929,  937,  941,  947,  953,  967,  971,  977,  983,  991,  997,
+    1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049, 1051, 1061, 1063, 1069,
+    1087, 1091, 1093, 1097, 1103, 1109, 1117, 1123, 1129, 1151, 1153, 1163,
+    1171, 1181, 1187, 1193, 1201, 1213, 1217, 1223};
+
+bool deviceMajorMinorCheck(int major, int minor = 0) {
+  auto dev_prop = at::cuda::getCurrentDeviceProperties();
+  if (dev_prop->major < major ||
+      (dev_prop->major == major && dev_prop->minor < minor)) {
+    return false;
+  }
+  return true;
+}
+
+int deviceSMCount() {
+  int sm_count = at::cuda::getCurrentDeviceProperties()->multiProcessorCount;
+  return sm_count;
+}
+
+void clearL2Cache() {
+  torch::NoGradGuard no_grad;
+  auto l2_cache_size = at::cuda::getCurrentDeviceProperties()->l2CacheSize;
+  auto options =
+      torch::TensorOptions().dtype(torch::kFloat32).device(at::kCUDA, 0);
+
+  auto l2_elems = l2_cache_size / 4;
+  torch::Tensor t0 = torch::empty(l2_elems, options);
+  torch::Tensor t1 = torch::clone(t0);
+};
+
+TensorView* loweredTv(TensorView* tv, GpuLower& gpulw) {
+  auto used_tvs = ir_utils::allTvs(gpulw.kernel()->as<Fusion>());
+  TensorView* matching_tv = nullptr;
+  for (auto lowered_tv : used_tvs) {
+    if (lowered_tv->name() == tv->name()) {
+      matching_tv = lowered_tv;
+    }
+  }
+  TORCH_INTERNAL_ASSERT(matching_tv != nullptr);
+  return matching_tv;
+}
+
+class PredicatedChecker : public kir::IrVisitor {
+ public:
+  // Checks if the provided tv is written to within a non-trivial conditional
+  static bool isPredicated(TensorView* tv, GpuLower& gpulw) {
+    PredicatedChecker checker(
+        loweredTv(tv, gpulw), gpulw.kernel()->topLevelExprs());
+    return checker.is_predicated_;
+  }
+
+ private:
+  PredicatedChecker() = delete;
+
+  PredicatedChecker(TensorView* tv, std::vector<Expr*> exprs) : tv_(tv) {
+    kir::IrVisitor::handle(exprs);
+  }
+
+  using kir::IrVisitor::handle;
+  bool is_predicated_ = false;
+  bool predicated_ite_ = false;
+  TensorView* tv_ = nullptr;
+
+  void handle(kir::IfThenElse* ite) final {
+    auto prev_ite = predicated_ite_;
+    predicated_ite_ = !ite->predicate()->value()->isConstScalar();
+    kir::IrVisitor::handle(ite);
+    predicated_ite_ = prev_ite;
+  }
+
+  void handle(Expr* expr) final {
+    if (expr->outputs().size() && expr->outputs()[0]->isA<kir::TensorIndex>()) {
+      auto ti = expr->outputs()[0]->as<kir::TensorIndex>();
+      if (ti->view() == tv_) {
+        is_predicated_ = is_predicated_ | predicated_ite_;
+        if (expr->predicate() != nullptr &&
+            !expr->predicate()->value()->isConst()) {
+          is_predicated_ = true;
+        }
+      }
+    }
+    kir::IrVisitor::handle(expr);
+  }
+};
+
+class UnswitchInElseChecker : public kir::IrVisitor {
+ public:
+  // Checks if there are any unswitched for loops within an else clause
+  static bool check(GpuLower& gpulw) {
+    UnswitchInElseChecker checker(gpulw.kernel()->topLevelExprs());
+    return checker.found_in_else_;
+  }
+
+ private:
+  UnswitchInElseChecker() = delete;
+  UnswitchInElseChecker(std::vector<Expr*> exprs) {
+    kir::IrVisitor::handle(exprs);
+  }
+
+  using kir::IrVisitor::handle;
+  bool within_else_ = false;
+  bool found_in_else_ = false;
+
+  void handle(kir::IfThenElse* ite) final {
+    auto prev_within_else = within_else_;
+    within_else_ = true;
+    kir::IrVisitor::handle(ite->elseBody().exprs());
+    within_else_ = prev_within_else;
+  }
+
+  void handle(kir::ForLoop* for_loop) final {
+    if (for_loop->iter_domain()->getParallelType() == ParallelType::Unswitch) {
+      found_in_else_ = found_in_else_ || within_else_;
+    }
+    kir::IrVisitor::handle(for_loop);
+  }
+};
+
+class PredicateMagicZeroChecker : public kir::IrVisitor {
+ public:
+  // Checks if all predicated domains of the provided tv are protected with
+  // magic zero
+  static bool isProtected(TensorView* tv, GpuLower& gpulw) {
+    PredicateMagicZeroChecker checker(
+        loweredTv(tv, gpulw), gpulw.kernel()->topLevelExprs());
+    return checker.is_protected_;
+  }
+
+ private:
+  using kir::IrVisitor::handle;
+
+  PredicateMagicZeroChecker(TensorView* tv, std::vector<Expr*> exprs)
+      : tv_(tv) {
+    handle(exprs);
+  }
+
+  void handle(kir::IfThenElse* ite) final {
+    auto prev_predicate = predicate_;
+    predicate_ = ite->predicate()->value();
+    kir::IrVisitor::handle(ite);
+    predicate_ = prev_predicate;
+  }
+
+  void handle(Expr* expr) final {
+    if (expr->outputs().size() && expr->outputs()[0]->isA<kir::TensorIndex>()) {
+      auto ti = expr->outputs()[0]->as<kir::TensorIndex>();
+      if (ti->view() == tv_) {
+        is_protected_ = checkPredicateOfTensor(predicate_);
+        return;
+      }
+    }
+
+    if (expr->isA<kir::ForLoop>()) {
+      handle(expr->as<kir::ForLoop>());
+    } else if (expr->isA<kir::IfThenElse>()) {
+      handle(expr->as<kir::IfThenElse>());
+    } else {
+      for (auto input : expr->inputs()) {
+        handle(input);
+      }
+    }
+  }
+
+  // Return true If all predicated domains are protected
+  bool checkPredicateOfTensor(Val* predicate) {
+    auto id_predicates = decomposeCompoundPredicate(predicate);
+    for (auto id_predicate : id_predicates) {
+      // Just check if nvfuser_zero is used. Not perfect but probably
+      // good enough.
+      is_magic_zero_found_ = false;
+      handle(id_predicate);
+      if (!is_magic_zero_found_) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  // Decompose "X && Y" to a vector of {X, Y}.
+  std::vector<Val*> decomposeCompoundPredicate(Val* predicate) {
+    if (auto binary_op = dynamic_cast<BinaryOp*>(predicate->definition())) {
+      if (binary_op->getBinaryOpType() == BinaryOpType::And) {
+        auto pred = decomposeCompoundPredicate(binary_op->lhs());
+        auto rhs_pred = decomposeCompoundPredicate(binary_op->rhs());
+        pred.insert(pred.end(), rhs_pred.begin(), rhs_pred.end());
+        return pred;
+      }
+    }
+
+    return {predicate};
+  }
+
+  void handle(Val* val) final {
+    if (isMagicZero(val)) {
+      is_magic_zero_found_ = true;
+      return;
+    }
+
+    auto def = val->definition();
+    if (def != nullptr) {
+      handle(def);
+    }
+  }
+
+ private:
+  bool is_protected_ = false;
+  Val* predicate_ = nullptr;
+  TensorView* tv_ = nullptr;
+  bool is_magic_zero_found_ = false;
+};
+
+// Basically just TransformPropagator, except that it checks the consistency
+// replayPasC with getMatchedLeafPosWithoutReplayPasC, replayCasP with
+// getMatchedLeafPosWithoutReplayCasP, and fullSelfReplay with fullSelfMatching:
+// - After replayPasC, getMatchedLeafPosWithoutReplayPasC should return the same
+//   replayed position
+// - After replayCasP, getMatchedLeafPosWithoutReplayCasP should return the same
+//   replayed position
+// - After fullSelfReplay, fullSelfMatching should return true
+struct TransformPropagatorWithCheck : public TransformPropagator {
+ public:
+  virtual void propagateC2P(TensorView* from, TensorView* to) override {
+    TransformPropagator::propagateC2P(from, to);
+    auto from_pos = replayed_pos_.at(from);
+    auto to_pos = replayed_pos_.at(to);
+    TORCH_CHECK(
+        TransformReplay::getMatchedLeafPosWithoutReplayPasC(
+            to, from, from_pos) == to_pos);
+  }
+  virtual void propagateP2C(TensorView* from, TensorView* to) override {
+    TransformPropagator::propagateP2C(from, to);
+    auto from_pos = replayed_pos_.at(from);
+    auto to_pos = replayed_pos_.at(to);
+    TORCH_CHECK(
+        TransformReplay::getMatchedLeafPosWithoutReplayCasP(
+            to, from, from_pos) == to_pos);
+  }
+  virtual void propagateSibling(TensorView* from, TensorView* to) override {
+    TransformPropagator::propagateSibling(from, to);
+    auto from_pos = replayed_pos_.at(from);
+    auto to_pos = replayed_pos_.at(to);
+    TORCH_CHECK(from_pos == to_pos);
+    TORCH_CHECK(TransformReplay::fullSelfMatching(from, to));
+  }
+  using TransformPropagator::TransformPropagator;
+};
+
 } // namespace
+
+class ContextCudnnTF32Disabled {
+ public:
+  ContextCudnnTF32Disabled() {
+    flag_ = at::globalContext().allowTF32CuDNN();
+    at::globalContext().setAllowTF32CuDNN(false);
+  }
+
+  ~ContextCudnnTF32Disabled() {
+    at::globalContext().setAllowTF32CuDNN(flag_);
+  }
+
+ private:
+  bool flag_;
+};
+
+// Fixture class must be uniquely identified, i.e., can't be in an
+// anonymous namespace
+class NVFuserTest : public ::testing::Test {
+ protected:
+  void SetUp() override {
+    // requires PASCAL or newer
+    if (!deviceMajorMinorCheck(6)) {
+      GTEST_SKIP() << "skipping tests on pre-PASCAL GPUs";
+    }
+    setFillAllocationWithNan(true);
+  }
+};
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/codegen/cuda/tools/stringify_file.py b/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
index 4c3f173660b0..a7ba997d91bb 100644
--- a/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
+++ b/torch/csrc/jit/codegen/cuda/tools/stringify_file.py
@@ -19,6 +19,8 @@
 # msvc string literal maximum length 16380
 # https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2026?view=msvc-170
 MAX_STRING_LITERAL = 16000
+# https://docs.microsoft.com/en-us/cpp/c-language/maximum-string-length?view=msvc-170
+MAX_STRING_CONCATENATED = 65535
 
 with open(args.input, 'r') as fin:
     with open(args.output, 'w') as fout:
@@ -28,14 +30,18 @@
         fout.write('namespace nvfuser_resources {\n\n')
         fout.write(f'constexpr const char* {literal_name} = R"(\n')
         accumulated_chars = 0
+        accumulated_chars_per_literal = 0
         for line in fin:
             accumulated_chars = accumulated_chars + len(line) + 1
-            if accumulated_chars >= MAX_STRING_LITERAL:
+            accumulated_chars_per_literal = accumulated_chars_per_literal + len(line) + 1
+            if accumulated_chars_per_literal >= MAX_STRING_LITERAL:
                 fout.write(')"\n')
                 fout.write('R"(\n')
                 fout.write(line)
-                accumulated_chars = len(line) + 1
+                accumulated_chars_per_literal = len(line) + 1
             else:
                 fout.write(line)
         fout.write(')";\n')
         fout.write('\n} // namespace nvfuser_resources\n')
+        if accumulated_chars >= MAX_STRING_CONCATENATED:
+            raise Exception("runtime header file exceeds size limit of 65535 for MSVC")
diff --git a/torch/csrc/jit/codegen/cuda/transform_iter.cpp b/torch/csrc/jit/codegen/cuda/transform_iter.cpp
index 3e0473665b47..ab683e79ce9a 100644
--- a/torch/csrc/jit/codegen/cuda/transform_iter.cpp
+++ b/torch/csrc/jit/codegen/cuda/transform_iter.cpp
@@ -137,7 +137,7 @@ void ReplayTransformations::handle(Swizzle2D* swizzle_2d) {
   auto id_in_y = swizzle_2d->inY();
 
   // Make sure we have a corresponding entry in our map pointing to the ID we're
-  // going to replay the split on
+  // going to replay the swizzle on
   auto it_x = id_map_.find(id_in_x);
   auto it_y = id_map_.find(id_in_y);
 
@@ -162,7 +162,7 @@ void ReplayTransformations::handle(Swizzle2D* swizzle_2d) {
   auto outs = std::make_pair(mapped_x, mapped_y);
 
   if (replay_swizzle_) {
-    // Replay the split onto mapped
+    // Replay the swizzle onto mapped
     outs = IterDomain::swizzle(swizzle_2d->swizzleType(), mapped_x, mapped_y);
 
     // Remove mapped from the leaf IDs
@@ -224,7 +224,7 @@ void ReplayTransformations::runReplay() {
   // Switch outDomain to a vector to start the traversal
   std::vector<Val*> traversal_vals(
       target_domain_.begin(), target_domain_.end());
-  traverseFrom(traversal_vals[0]->fusion(), traversal_vals);
+  traverseTo(traversal_vals[0]->fusion(), traversal_vals);
 
   if (error_on_failure_)
     TORCH_INTERNAL_ASSERT(
@@ -762,14 +762,6 @@ struct ProducerForwardingInfo {
           (outer->isTrivialReduction() && !inner->isReduction())) {
         auto compliment_id = inner->isTrivialReduction() ? inner : outer;
         auto forwarded_id = inner->isTrivialReduction() ? outer : inner;
-        // Only allow forwarding when the trivial reduction domain is
-        // an root domain
-        if (std::find(
-                producer->getMaybeRFactorDomain().begin(),
-                producer->getMaybeRFactorDomain().end(),
-                compliment_id) == producer->getMaybeRFactorDomain().end()) {
-          continue;
-        }
         forwarding_map.emplace(std::make_pair(forwarded_id, merge->out()));
         compliment_map.emplace(std::make_pair(
             forwarded_id, std::vector<IterDomain*>{compliment_id}));
diff --git a/torch/csrc/jit/codegen/cuda/transform_rfactor.cpp b/torch/csrc/jit/codegen/cuda/transform_rfactor.cpp
index dc5973c0ecd6..8d5151074563 100644
--- a/torch/csrc/jit/codegen/cuda/transform_rfactor.cpp
+++ b/torch/csrc/jit/codegen/cuda/transform_rfactor.cpp
@@ -262,7 +262,7 @@ std::pair<TensorDomain*, TensorDomain*> TransformRFactor::runReplay(
   std::transform(axes.begin(), axes.end(), axes.begin(), [ndims](int i) {
     TORCH_CHECK(
         i >= -ndims && i < ndims,
-        "Rfactor replay recieved an axis outside the number of dims in the tensor, acceptable inclusive range is ",
+        "Rfactor replay received an axis outside the number of dims in the tensor, acceptable inclusive range is ",
         -ndims,
         " to ",
         ndims - 1);
diff --git a/torch/csrc/jit/codegen/cuda/transform_view.cpp b/torch/csrc/jit/codegen/cuda/transform_view.cpp
index 03f7b1bb4c9a..a543c6d0f79c 100644
--- a/torch/csrc/jit/codegen/cuda/transform_view.cpp
+++ b/torch/csrc/jit/codegen/cuda/transform_view.cpp
@@ -14,181 +14,242 @@ namespace jit {
 namespace fuser {
 namespace cuda {
 
-struct ViewIndexState {
-  // The index into the new view
-  size_t new_view_index = 0;
-
-  // The index into the original view
-  size_t original_view_index = 0;
-
-  // The index into the transform view
-  size_t transform_view_index = 0;
-
-  // The number of broadcast axes before this transformation
-  size_t broadcast_offset = 0;
-
-  // The number of trivial reduction axes before this transformation
-  size_t trivial_reduction_offset = 0;
-
-  // The number of split transformations
-  size_t split_offset = 0;
-
-  // The number of merge transformations
-  size_t merge_offset = 0;
-};
-
-//! Base class for all tranformations
+//! There's three domains associated with performing a view operation:
+//! 1) Original Domain:
+//!   This view is the original input to the view operation. It has no
+//!   transforms on it, it is however passed in without its reduction domains
+//!   (as is expected since we're trying to generate the output of the
+//!   operations).
+//!
+//! Trivially reduced domain:
+//!   Predicting which operations are trivial reduced are not trivial. If a
+//!   broadcast is between two iter domains in the original domain that must be
+//!   merged for the view transform:
+//!     - If the broadcast domain lines up with a broadcast domain in the final
+//!       tensor domain keep it.
+//!     - If the domain is size-1 but not marked as a broadcast domain (runtime
+//!       size==1)
+//!       Note: This isn't something we generally support consistently
+//!     - If the broadcast domain is marked as a compile time broadcast domain,
+//!       and doesn't line up with a broadcast domain in the final result.
+//!       Trivially reduce it.
+//!   The index for these transformations is marked as the index of the original
+//!   domain, as that's the input for the trivial reduction. This produces the
+//!   trivially reduced domain.
+//!
+//! Post-view Domain:
+//!   This domain is the original domain after the trivial reductions and all
+//!   transformations. This domain holds the rfactor domains determined by
+//!   merge/split operations of the find transformations pass. It is the final
+//!   domain without all the broadcast operations (can have some that were
+//!   preserved through the transformations).
+//!       For example: {1, 2, 1, 4} -> {1, 2, 1, 2, 2} doesn't have any
+//!         conflicts of the view transformation and the broadcast dimensions,
+//!         so they won't be trivial reduced, they will simply be propagated
+//!         through the view.
+//!         {1, 2, 1, 4} -> {1, 8, 1} does have the second 1 dimension in
+//!         between the 2 and 8 that have to be merged. The first broadcast axis
+//!         will be propagated through the domains unafected, yet the second
+//!         braodcast axis will be trivially reduced, then rebroadcasted.
+//!  The transformation index marked for the splits/merges to produce this
+//!  domain are done based on an "in progress" tensor view (called transform
+//!  view index in the find transformation pass). This allows us to simply apply
+//!  these transformations serially to produce this domain.
+//!
+//! Post-broadcast Domain:
+//!    This domain finally matches the output of the view operation fully and
+//!    can be used in further computations.
+//!
+//! View process at compute time:
+//!   1) View takes in the input TensorView x, original runtime
+//!      std::vector<int64_t>, and viewed runtime std::vector<int64_t>.
+//!   2) AnalyzeView is called Which will figure out what series of
+//!      transformations is required from the input tensor to the output tensor.
+//!      These transformations are recorded.
+//!   3) Sum operation is called on the trivial reduction axes from the
+//!      analysis.
+//!   4) applyViewTransforms will generate the output domain of the view
+//!      operation.
+//!        Calls TensorDomain::view(view_analysis) which returns the rfactored
+//!        domain.
+//!        Gets forwarded to transformView(TensorDomain, view_analysis)
+//!        Gets forwarded to createViewDomain(TensorDomain, view_analysis)
+//!        createViewDomain creates the new root domain, and calls
+//!        createRfactorDomain on view_analysis.transforms().
+//!   5) brooadcast will be called with view_analysis.broadcast_axes
+//!
+//! TODO: Caching assumes that all size-1 inputs are correctly marked as a
+//! broadcast dimension. We should probably remove the runtime size-1 merge
+//! support in find transformation.
+//!
+//! Simple abstract class to record transformation and the indices required to
+//! apply it.
 class Transform : public PolymorphicBase {
  public:
-  virtual void toString(std::ostream& output) const = 0;
+  virtual std::string toString() const = 0;
 
-  size_t index() const {
+  int64_t index() const {
     return index_;
   }
 
-  size_t originalIndex() const {
-    return original_index_;
-  }
-
-  size_t newIndex() const {
-    return new_index_;
-  }
-
  protected:
-  Transform(const ViewIndexState& state, size_t index)
-      : index_(index),
-        original_index_(state.original_view_index),
-        new_index_(Transform::computeNewIndex(state)) {}
-
-  const size_t index_ = 0;
-  const size_t original_index_ = 0;
-  const size_t new_index_ = 0;
-
-  static size_t computeNewIndex(const ViewIndexState& state) {
-    return state.original_view_index - state.trivial_reduction_offset +
-        state.split_offset - state.merge_offset + state.broadcast_offset;
-  }
+  // Relevant location information for the transformation. Stored information is
+  // related to when we have to apply that transformation (see long comment at
+  // top of this file).
+  Transform(int64_t index) : index_(index) {}
+
+  const int64_t index_ = 0;
 };
 
-//! Base class for all view tranformations - Merge, Split, Keep
-//! These transforms require updating the rfactor domain of the view TensorView
-//! and are applied after removing any unnecessary trivial broadcasts.
 class ViewTransform : public Transform {
  public:
+  // Function to apply the transformation. Transformation is applied on
+  // current_transformed_domain. root_domain is required here to replace
+  // IterDomains so we can flip the rfactor flag on the root domain if it's
+  // involved in merge/split trasnforms to produce the rfactor domain.
   virtual void createRfactorDomain(
-      const std::vector<IterDomain*>& new_root_domain,
-      std::vector<IterDomain*>& rfactor_domain) = 0;
+      std::vector<IterDomain*>& root_domain,
+      std::vector<IterDomain*>& current_transformed_domain) = 0;
 
-  virtual bool isOriginalAxisDynamic() const = 0;
+  // Convenience function to replace id in root_domain with an id that has
+  // expand expanded, and rfactor flag turned on.
+  static IterDomain* replaceRootIdWithRFactor(
+      std::vector<IterDomain*>& root_domain,
+      IterDomain* id) {
+    auto root_domain_it = std::find(root_domain.begin(), root_domain.end(), id);
 
- protected:
-  ViewTransform(const ViewIndexState& state)
-      : Transform(state, ViewTransform::computeIndex(state)) {}
+    TORCH_INTERNAL_ASSERT(
+        root_domain_it != root_domain.end(),
+        "Wanted to replace ",
+        id->toString(),
+        " in root with an rfactor dimension, but IterDomain was not found in root.");
+
+    auto root_domain_pos = std::distance(root_domain.begin(), root_domain_it);
+
+    bool is_expanded_dim = id->hasExpandedExtent();
+
+    auto extent = is_expanded_dim ? id->expandedExtent() : id->extent();
+
+    auto cloned_id =
+        IterDomainBuilder(id)
+            .iter_type(
+                is_expanded_dim ? IterType::Iteration : id->getIterType())
+            .extent(extent)
+            .expanded_extent(nullptr)
+            .is_rfactor_domain(true)
+            .build();
 
-  static size_t computeIndex(const ViewIndexState& state) {
-    return state.original_view_index - state.trivial_reduction_offset;
+    root_domain.erase(root_domain.begin() + root_domain_pos);
+    root_domain.insert(root_domain.begin() + root_domain_pos, cloned_id);
+    return cloned_id;
   }
+
+  // Debugging utility to convert the transformation into a string.
+  virtual std::string toString() const = 0;
+
+ protected:
+  ViewTransform(const int64_t& index) : Transform(index) {}
 };
 
 namespace {
-typedef std::vector<size_t> Sizes;
-const size_t kEmptyAxis = 0;
-const size_t kSingletonAxis = 1;
-
 //! The merge tranformation either combines two root iterDomains together OR
-//! the last rfactor iterDomain with a root iterDomain.
+//! the last rfactor iterDomain with a root iterDomain. Unlike the general
+//! TensorView merge there's no merging across axes not placed in consecutive
+//! positions for View.
 class MergeTransform final : public ViewTransform {
  public:
-  MergeTransform(const ViewIndexState& state, bool is_last_axis_rfactor)
-      : ViewTransform(state), is_last_axis_rfactor_(is_last_axis_rfactor) {}
-
-  void toString(std::ostream& output) const override {
-    output << "Merge Index: " << index_ << " RF: " << is_last_axis_rfactor_
-           << std::endl;
-  }
+  MergeTransform(int64_t index) : ViewTransform(index) {}
 
-  bool isOriginalAxisDynamic() const override {
-    return false;
+  virtual std::string toString() const override {
+    std::stringstream ss;
+    ss << "Merge at index: " << index_;
+    return ss.str();
   }
 
   void createRfactorDomain(
-      const std::vector<IterDomain*>& new_root_domain,
-      std::vector<IterDomain*>& rfactor_domain) override {
+      std::vector<IterDomain*>& root_domain,
+      std::vector<IterDomain*>& current_transformed_domain) override {
     TORCH_INTERNAL_ASSERT(
-        (index_ + 1) < new_root_domain.size(),
-        "Index: \t",
-        index_,
-        "\t Domain Size:\t",
-        new_root_domain.size());
+        (index_ + 1) < current_transformed_domain.size(),
+        "Tried to apply: ",
+        toString(),
+        "\t To domain: \t",
+        current_transformed_domain);
+
+    // Assumed to never merge over non-contiguous dimensions.
+    IterDomain* outer_id = current_transformed_domain[index_];
+    if (!outer_id->isRFactorProduct()) {
+      outer_id = replaceRootIdWithRFactor(root_domain, outer_id);
+    }
 
-    IterDomain* merged_id = nullptr;
-    if (is_last_axis_rfactor_) {
-      TORCH_INTERNAL_ASSERT(!rfactor_domain.empty());
-      merged_id = rfactor_domain.back();
-      rfactor_domain.pop_back();
-    } else {
-      merged_id = new_root_domain[index_];
+    IterDomain* inner_id = current_transformed_domain[index_ + 1];
+    if (!inner_id->isRFactorProduct()) {
+      inner_id = replaceRootIdWithRFactor(root_domain, inner_id);
     }
 
-    auto merged_extent =
-        mul(merged_id->extent(), new_root_domain[index_ + 1]->extent());
+    TORCH_INTERNAL_ASSERT(
+        outer_id->start()->isZeroInt() && inner_id->start()->isZeroInt(),
+        "Didn't expect to apply view transformations on an iter domain",
+        " starting at a non-zero position.");
+
+    auto merged_extent = mul(outer_id->extent(), inner_id->extent());
 
     auto new_merged_id =
         IterDomainBuilder(FusionGuard::getCurFusion()->zeroVal(), merged_extent)
             .is_rfactor_domain(true)
             .build();
 
-    IrBuilder::create<Merge>(
-        new_merged_id, merged_id, new_root_domain[index_ + 1]);
+    IrBuilder::create<Merge>(new_merged_id, outer_id, inner_id);
 
-    rfactor_domain.push_back(new_merged_id);
+    current_transformed_domain.erase(
+        current_transformed_domain.begin() + index_);
+    current_transformed_domain.erase(
+        current_transformed_domain.begin() + index_);
+    current_transformed_domain.insert(
+        current_transformed_domain.begin() + index_, new_merged_id);
   }
-
- private:
-  const bool is_last_axis_rfactor_ = false;
 };
 
 //! The split tranformation creates two new iterDomains via an outer split.
 class SplitTransform final : public ViewTransform {
  public:
-  SplitTransform(
-      const ViewIndexState& state,
-      bool is_last_axis_rfactor,
-      size_t split_factor)
-      : ViewTransform(state),
-        is_last_axis_rfactor_(is_last_axis_rfactor),
-        split_factor_(split_factor) {}
-
-  void toString(std::ostream& output) const override {
-    output << "Split Index: " << index_ << " RF: " << is_last_axis_rfactor_
-           << " ARG: " << split_factor_ << std::endl;
+  SplitTransform(const int64_t index, int64_t split_factor)
+      : ViewTransform(index), split_factor_(split_factor) {
+    TORCH_INTERNAL_ASSERT(
+        split_factor > 0,
+        "Split factors must be greater than 0, but found ",
+        split_factor,
+        " during view transformation.");
   }
 
-  bool isOriginalAxisDynamic() const override {
-    return false;
+  virtual std::string toString() const override {
+    std::stringstream ss;
+    ss << "Split Index at: " << index_ << " by: " << split_factor_ << std::endl;
+    return ss.str();
   }
 
   void createRfactorDomain(
-      const std::vector<IterDomain*>& new_root_domain,
-      std::vector<IterDomain*>& rfactor_domain) override {
+      std::vector<IterDomain*>& root_domain,
+      std::vector<IterDomain*>& current_transformed_domain) override {
     TORCH_INTERNAL_ASSERT(
-        index_ < new_root_domain.size(),
+        index_ < current_transformed_domain.size(),
         "Index: \t",
         index_,
         "\t Domain Size:\t",
-        new_root_domain.size());
+        current_transformed_domain.size());
 
     auto factor = IrBuilder::create<Int>(split_factor_);
 
-    IterDomain* id = nullptr;
-    if (is_last_axis_rfactor_) {
-      TORCH_INTERNAL_ASSERT(!rfactor_domain.empty());
-      id = rfactor_domain.back();
-      rfactor_domain.pop_back();
-    } else {
-      id = new_root_domain[index_];
+    IterDomain* id = current_transformed_domain[index_];
+    if (!id->isRFactorProduct()) {
+      id = replaceRootIdWithRFactor(root_domain, id);
     }
 
+    TORCH_INTERNAL_ASSERT(
+        id->start()->isZeroInt(),
+        "Didn't expect to apply view transformations on an iter domain",
+        " starting at a non-zero position.");
+
     Val* remainder = ceilDiv(id->extent(), factor);
 
     // outer loop IterDomain
@@ -208,39 +269,20 @@ class SplitTransform final : public ViewTransform {
 
     IrBuilder::create<Split>(factor_id, remainder_id, id, factor, false);
 
-    rfactor_domain.push_back(factor_id);
-    rfactor_domain.push_back(remainder_id);
-  }
-
- private:
-  const bool is_last_axis_rfactor_ = false;
-  const size_t split_factor_ = 0;
-};
-
-//! The Keep transform moves the root iterDomain to the rfactor domain.
-class KeepTransform final : public ViewTransform {
- public:
-  KeepTransform(const ViewIndexState& state) : ViewTransform(state) {}
-
-  void toString(std::ostream& output) const override {
-    output << "Keep Index: " << index_ << std::endl;
+    current_transformed_domain.erase(
+        current_transformed_domain.begin() + index_);
+    current_transformed_domain.insert(
+        current_transformed_domain.begin() + index_, remainder_id);
+    current_transformed_domain.insert(
+        current_transformed_domain.begin() + index_, factor_id);
   }
 
-  bool isOriginalAxisDynamic() const override {
-    return true;
+  int64_t split_factor() const {
+    return split_factor_;
   }
 
-  void createRfactorDomain(
-      const std::vector<IterDomain*>& new_root_domain,
-      std::vector<IterDomain*>& rfactor_domain) override {
-    TORCH_INTERNAL_ASSERT(
-        index_ < new_root_domain.size(),
-        "Index: \t",
-        index_,
-        "\t Domain Size:\t",
-        new_root_domain.size());
-    rfactor_domain.push_back(new_root_domain[index_]);
-  }
+ private:
+  const int64_t split_factor_ = 0;
 };
 
 //! For any singleton dimensions in the new view, we create an implicit
@@ -248,11 +290,12 @@ class KeepTransform final : public ViewTransform {
 //! and view transformation steps.
 class BroadcastTransform final : public Transform {
  public:
-  BroadcastTransform(const ViewIndexState& state)
-      : Transform(state, Transform::computeNewIndex(state)) {}
+  BroadcastTransform(int64_t index) : Transform(index) {}
 
-  void toString(std::ostream& output) const override {
-    output << "Bcast Index: " << index_ << std::endl;
+  virtual std::string toString() const override {
+    std::stringstream ss;
+    ss << "Broadcast at: " << index_ << std::endl;
+    return ss.str();
   }
 };
 
@@ -260,16 +303,12 @@ class BroadcastTransform final : public Transform {
 //! them using a trivial reduction.
 class TrivialReductionTransform final : public Transform {
  public:
-  TrivialReductionTransform(const ViewIndexState& state)
-      : Transform(state, TrivialReductionTransform::computeIndex(state)) {}
+  TrivialReductionTransform(int64_t index) : Transform(index) {}
 
-  void toString(std::ostream& output) const override {
-    output << "1-Red Index: " << index_ << std::endl;
-  }
-
- private:
-  static size_t computeIndex(const ViewIndexState& state) {
-    return state.original_view_index;
+  virtual std::string toString() const override {
+    std::stringstream ss;
+    ss << "Trivial reduction at: " << index_ << std::endl;
+    return ss.str();
   }
 };
 
@@ -278,67 +317,93 @@ class TrivialReductionTransform final : public Transform {
 class AnalyzeViewTransformation {
  public:
   AnalyzeViewTransformation(
-      const Sizes& original_view,
-      const Sizes& new_view,
+      const std::vector<int64_t>& original_view,
+      const std::vector<int64_t>& new_view,
       std::vector<IterDomain*> root_domain = {})
-      : default_implicit_broadcast_(root_domain.empty()),
+      : root_domain_not_provided_(root_domain.empty()),
         root_domain_(root_domain),
+        root_is_transformed_(original_view.size(), false),
         original_view_(original_view),
-        new_view_(new_view),
-        transform_view_(original_view) {
-    // Check that the product of original and new view sizes are equal.
-    const size_t kOriginalNumElements = std::accumulate(
+        new_view_(new_view) {
+    TORCH_INTERNAL_ASSERT(
+        root_domain.empty() || original_view.size() == root_domain.size(),
+        "Incoming domain must match the original view sizes for view.");
+    // Check that the product of original and new view std::vector<int64_t> are
+    // equal.
+    const int64_t kOriginalNumElements = std::accumulate(
         original_view_.begin(), original_view_.end(), 1, std::multiplies<>());
-    const size_t kNewNumElements = std::accumulate(
+    const int64_t kNewNumElements = std::accumulate(
         new_view_.begin(), new_view.end(), 1, std::multiplies<>());
-    TORCH_INTERNAL_ASSERT(kOriginalNumElements == kNewNumElements);
+    TORCH_INTERNAL_ASSERT(
+        kOriginalNumElements == kNewNumElements,
+        "Total element counts across view operation must match.");
   }
 
   AnalyzeViewConstraint constraint() {
     findTransformation();
-    TORCH_INTERNAL_ASSERT(
-        validate(),
-        "Analyze View Transformation failed to find valid transformation.\n",
-        toString());
-    std::vector<int64_t> original_constraint(
-        original_view_.begin(), original_view_.end());
-    std::vector<int64_t> new_constraint(new_view_.begin(), new_view_.end());
-    for (auto& vt : view_transforms_) {
-      if (vt->isOriginalAxisDynamic()) {
-        original_constraint[vt->originalIndex()] = -1;
-        new_constraint[vt->newIndex()] = -1;
+
+    AnalyzeViewConstraint constraint;
+    constraint.original_constraint =
+        std::vector<int64_t>(original_view_.begin(), original_view_.end());
+    for (auto i : c10::irange(constraint.original_constraint.size())) {
+      if (constraint.original_constraint[i] != 1) {
+        constraint.original_constraint[i] = 0;
+      }
+    }
+
+    constraint.new_constraint =
+        std::vector<int64_t>(new_view_.begin(), new_view_.end());
+    for (auto i : c10::irange(constraint.new_constraint.size())) {
+      if (constraint.new_constraint[i] != 1) {
+        constraint.new_constraint[i] = 0;
       }
     }
-    return {original_constraint, new_constraint};
+
+    for (auto trivial_reduce : trivial_reduction_transforms_) {
+      constraint.trivial_reduction_string.push_back(trivial_reduce->index());
+    }
+
+    for (auto broadcast : broadcast_transforms_) {
+      constraint.broadcast_string.push_back(broadcast->index());
+    }
+
+    // Dilimeter for split/merge transforms is -2
+    for (auto split_merge : view_transforms_) {
+      if (split_merge->isA<SplitTransform>()) {
+        constraint.split_merge_string.push_back(split_merge->index());
+        constraint.split_merge_string.push_back(
+            split_merge->as<SplitTransform>()->split_factor());
+        constraint.split_merge_string.push_back(-2);
+      } else {
+        TORCH_INTERNAL_ASSERT(
+            split_merge->isA<MergeTransform>(),
+            "Unrecognized transformation found.");
+        constraint.split_merge_string.push_back(split_merge->index());
+        constraint.split_merge_string.push_back(-2);
+      }
+    }
+
+    return constraint;
   }
 
+  // Fill out all the information needed in AnalyzeViewResult, this should
+  // contain all the information of what's required to perform the view
+  // operation.
   AnalyzeViewResult run() {
+    // Find all the transformations to go from the original tensor domain to the
+    // final output of the view operations.
     findTransformation();
 
-    TORCH_INTERNAL_ASSERT(
-        validate(),
-        "Analyze View Transformation failed to find valid transformation.\n",
-        toString());
-
-    // Skip view operations if all iterDomains are kept as-is
-    bool all_keep_transforms = std::all_of(
-        view_transforms_.begin(),
-        view_transforms_.end(),
-        [](std::shared_ptr<ViewTransform> vt) {
-          return vt->isA<KeepTransform>();
-        });
-    if (all_keep_transforms) {
-      view_transforms_.clear();
-    }
+    auto trivial_reduction_axes = generateTrivialReductionAxes();
+    auto broadcast_axes = generateBroadcastAxes();
 
-    return {
-        !broadcast_transforms_.empty(),
-        generateBroadcastAxes(),
-        generateTrivialReductionAxes(),
-        view_transforms_};
+    // Move data to AnalyzeViewResult and return it.
+    return {broadcast_axes, trivial_reduction_axes, view_transforms_};
   }
 
  private:
+  // Returns the bool flags that should be used to broadcast the output view
+  // tensor
   std::vector<bool> generateBroadcastAxes() {
     std::vector<bool> broadcast_axes(new_view_.size(), false);
     for (auto& bcast : broadcast_transforms_) {
@@ -347,6 +412,8 @@ class AnalyzeViewTransformation {
     return broadcast_axes;
   }
 
+  // Returns the positions for the trivial reductions to be performed before the
+  // view operation
   std::vector<int> generateTrivialReductionAxes() {
     std::vector<int> reduction_axes;
     for (auto& tred : trivial_reduction_transforms_) {
@@ -372,365 +439,254 @@ class AnalyzeViewTransformation {
     output << std::endl;
 
     output << "===============================" << std::endl;
-    for (auto& move : trivial_reduction_transforms_) {
-      move->toString(output);
+    for (auto& trivial_reduction : trivial_reduction_transforms_) {
+      output << trivial_reduction->toString() << "\n";
     }
-    for (auto& move : view_transforms_) {
-      move->toString(output);
+    for (auto& split_or_merge : view_transforms_) {
+      output << split_or_merge->toString() << "\n";
     }
-    for (auto& move : broadcast_transforms_) {
-      move->toString(output);
+    for (auto& broadcast : broadcast_transforms_) {
+      output << broadcast->toString() << "\n";
     }
     output << "===============================" << std::endl;
     return output.str();
   }
 
-  //! is_index_merge_rhs - Does the original_view_index point to the rhs of the
-  //! Merge transform
-  //! is_last_axis_rfactor - Is the last iterDomain already in the rfactor
-  //! domain?
-  void addMergeTransform(bool is_index_merge_rhs, bool is_last_axis_rfactor) {
-    // The invariant for merge transform is transform index = rhs_position-1
-    if (is_index_merge_rhs) {
-      // The original_view_index points to the rhs of the Merge transform.
-      ViewIndexState clone_state(state_);
-      --clone_state.original_view_index;
-      view_transforms_.push_back(
-          std::make_shared<MergeTransform>(clone_state, is_last_axis_rfactor));
-    } else {
-      // The original_view_index points to the rhs-1 invariant position.
-      view_transforms_.push_back(
-          std::make_shared<MergeTransform>(state_, is_last_axis_rfactor));
-    }
-    ++state_.merge_offset;
-  }
-
-  void addSplitTransform(size_t split_factor, bool is_last_axis_rfactor) {
-    view_transforms_.push_back(std::make_shared<SplitTransform>(
-        state_, is_last_axis_rfactor, split_factor));
-    ++state_.split_offset;
-    ++state_.new_view_index;
-  }
-
-  void addKeepTransform(bool is_last_axis_rfactor) {
-    if (!is_last_axis_rfactor) {
-      view_transforms_.push_back(std::make_shared<KeepTransform>(state_));
-    }
-    ++state_.new_view_index;
-    ++state_.original_view_index;
-    ++state_.transform_view_index;
-  }
-
-  void addBroadcastTransform() {
-    broadcast_transforms_.push_back(
-        std::make_shared<BroadcastTransform>(state_));
-    ++state_.broadcast_offset;
-    ++state_.new_view_index;
-  }
-
-  void addTrivialReductionTransform() {
-    trivial_reduction_transforms_.push_back(
-        std::make_shared<TrivialReductionTransform>(state_));
-    ++state_.trivial_reduction_offset;
-  }
+  // Validation check after transformations are all found
 
-  bool validate() const {
-    if (state_.new_view_index != new_view_.size() ||
-        state_.original_view_index != original_view_.size() ||
-        state_.transform_view_index != transform_view_.size()) {
-      return false;
-    }
-    return true;
-  }
-
-  bool isImplicitBroadcast(size_t original_view_index) const {
-    if (default_implicit_broadcast_) {
+  bool isImplicitBroadcast(int64_t original_view_index) const {
+    if (root_domain_not_provided_) {
       return original_view_[original_view_index] == 1;
     } else {
-      TORCH_INTERNAL_ASSERT(!root_domain_.empty());
-      return root_domain_[original_view_index]->isImplicitBroadcast();
+      TORCH_INTERNAL_ASSERT(original_view_index < root_domain_.size());
+      return root_domain_[original_view_index]->isImplicitBroadcast() &&
+          !root_domain_[original_view_index]->hasExpandedExtent();
     }
   }
 
-  //! This utility class merges a fixed set of axes together
-  //! according to some invariant. Implicit broadcast axes cannot be
-  //! merged with standard iterDomains, so they are handled separately
-  //! with the Trivial Reduction transform.
-  //!
-  //! 1) MergeThenSplitAxes class merges axes until it is evenly divisible
-  //!    by the split factor.
-  //! 2) MergeAdjacentSingletonAxes class merges or reduces any
-  //!    adjacent singleton dimensions.
-  class MergeAxesInterface {
-   public:
-    virtual ~MergeAxesInterface() = default;
-
-   protected:
-    // See addMergeTransform for "is_index_merge_rhs" and
-    // "is_last_axis_rfactor" descriptions
-    void handle(bool is_index_merge_rhs, bool is_last_axis_rfactor) {
-      findNumberOfMergeAxes();
-
-      bool any_merge = false;
-      for (size_t idx = 0; idx < num_merge_axes_; ++idx) {
-        if (avt_->isImplicitBroadcast(state_.original_view_index)) {
-          avt_->addTrivialReductionTransform();
-        } else {
-          avt_->addMergeTransform(
-              is_index_merge_rhs, is_last_axis_rfactor || any_merge);
-          any_merge = true;
-        }
-        updateViewIndexState();
-      }
-
-      epilogue(is_last_axis_rfactor || any_merge);
-    }
-
-    MergeAxesInterface(
-        AnalyzeViewTransformation* avt,
-        ViewIndexState& state,
-        size_t initial_size = 1)
-        : avt_(avt), state_(state), merged_axis_size_(initial_size) {}
+  //! Find the broadcast, merge and split operations necessary
+  //! to transform the original view into the new view
+  void findTransformation() {
+    // There are three particularly important state indices we're working with.
+    // There is:
+    //   1) original_view_index which is indexing into the original tensor
+    //      domain after all reductions are removed. This lines up with the last
+    //      domain in original view that we added to current_size.
+    //   2) transform_view_index which is the index of the transformations as
+    //      we're virtually "developing" the output tensor domain (split/merge
+    //      transformations post trivial reductions).
+    //   3) The new_view_index which is directly associated with the new_view
+    //      and the dimension in new_view we're currently trying to create.
+
+    int64_t original_view_index = 0;
+    int64_t transform_view_index = 0;
+    int64_t new_view_index = 0;
+    int64_t current_size = original_view_[0];
+
+    // Safety counters to make sure we don't end up in an infinite loop.
+    int64_t prev_original_view_index = std::numeric_limits<int64_t>::max();
+    int64_t prev_new_view_index = std::numeric_limits<int64_t>::max();
 
-    // Get the current position in the original view shape
-    virtual size_t getCurrentAxisPosition() const = 0;
-    virtual bool isMergeInvariantValid() const = 0;
-    virtual void updateViewIndexState() = 0;
+    TORCH_INTERNAL_ASSERT(
+        view_transforms_.empty(),
+        "Already ran find transformation pass for View op, cannot run a second time.");
 
-    // Optional function run after merging all axes together
-    virtual void epilogue(bool is_last_axis_rfactor) = 0;
+    // Iterate until original view is completely consumed and new view is
+    // completely generated.
+    while (original_view_index < original_view_.size() ||
+           new_view_index < new_view_.size()) {
+      TORCH_INTERNAL_ASSERT(
+          !(prev_new_view_index == new_view_index &&
+            prev_original_view_index == original_view_index),
+          "Infinite loop detected in AnalyzeViewTransformation::findTransformation(). Bailing.");
 
-   private:
-    bool isStateWithinBounds() const {
-      return getCurrentAxisPosition() < avt_->original_view_.size();
-    }
+      prev_new_view_index = new_view_index;
+      prev_original_view_index = original_view_index;
 
-    // Get the number of adjacent dimensions for Merge Transform
-    void findNumberOfMergeAxes() {
-      num_merge_axes_ = 0;
-      while (isStateWithinBounds() && isMergeInvariantValid()) {
-        merged_axis_size_ *= avt_->original_view_[getCurrentAxisPosition()];
-        ++num_merge_axes_;
+      if (new_view_index >= new_view_.size()) {
+        TORCH_INTERNAL_ASSERT(
+            current_size == 1,
+            "View is complete, but there's still some elements to distribute.");
       }
-    }
-
-   protected:
-    AnalyzeViewTransformation* avt_;
-    ViewIndexState& state_;
-
-    // The number of adjacent axes for merge transform
-    size_t num_merge_axes_ = 0;
-
-    // The cumulative product of adjacent axes
-    size_t merged_axis_size_ = 0;
-  };
-
-  //! We merge axes until the sum of the original sizes is evenly divisible by
-  //! the new size. A Split transform is only valid if the axis is divisible
-  //! without remainder.
-  //!
-  //! 1) If the merged axis is larger than new size, then add a Split transform.
-  //!
-  //! 2) If the merged axis is equal to the new size but neither Split nor Merge
-  //! transforms were required, then keep the first non-singleton axis.
-  //!
-  //! 3) If the merged axis is equal to new size, then apply only the Merge
-  //! transforms.
-  //!
-  class MergeThenSplitAxes : MergeAxesInterface {
-   public:
-    static void process(
-        AnalyzeViewTransformation* avt,
-        ViewIndexState& state,
-        size_t initial_size,
-        size_t split_factor,
-        bool is_last_axis_rfactor) {
-      MergeThenSplitAxes mtsa(
-          avt, state, is_last_axis_rfactor, initial_size, split_factor);
-      mtsa.handle(false /* is_index_merge_rhs */, is_last_axis_rfactor);
-    }
-
-    virtual ~MergeThenSplitAxes() = default;
 
-   private:
-    MergeThenSplitAxes(
-        AnalyzeViewTransformation* avt,
-        ViewIndexState& state,
-        bool is_last_axis_rfactor,
-        size_t initial_size,
-        size_t split_factor)
-        : MergeAxesInterface(avt, state, initial_size),
-          split_factor_(split_factor) {}
-
-    size_t getCurrentAxisPosition() const override {
-      return state_.original_view_index + 1 + num_merge_axes_;
-    }
-
-    bool isMergeInvariantValid() const override {
-      return merged_axis_size_ % split_factor_ != 0;
-    }
-
-    void updateViewIndexState() override {
-      avt_->transform_view_[state_.transform_view_index] *=
-          avt_->original_view_[state_.original_view_index + 1];
-      avt_->transform_view_[state_.original_view_index + 1] = kEmptyAxis;
-      ++state_.original_view_index;
-    }
-
-    void epilogue(bool is_last_axis_rfactor) override {
-      if (merged_axis_size_ > split_factor_) {
-        avt_->transform_view_[state_.transform_view_index] /= split_factor_;
-        avt_->addSplitTransform(split_factor_, is_last_axis_rfactor);
-      } else {
-        avt_->addKeepTransform(is_last_axis_rfactor);
+      if ((new_view_index == new_view_.size() ||
+           (new_view_[new_view_index + 1] != 1)) &&
+          original_view_index + 1 < original_view_.size() &&
+          original_view_[original_view_index + 1] == 1 &&
+          !isImplicitBroadcast(original_view_index + 1)) {
+        // Next index in original_view is runtime size 1 and next new view is
+        // not, merge the size 1 into the current view before moving on. Even if
+        // the current size and new view size match we could have a trailing
+        // size 1 dimension on the input that needs to be merged in.
+        view_transforms_.push_back(
+            std::make_shared<MergeTransform>(transform_view_index));
+        ++original_view_index;
+        continue;
       }
-    }
 
-   private:
-    const size_t split_factor_ = 0;
-  };
-
-  //! A utility class to merge any adjacent size-1 dimensions
-  class MergeAdjacentSingletonAxes : MergeAxesInterface {
-   public:
-    static void process(AnalyzeViewTransformation* avt, ViewIndexState& state) {
-      MergeAdjacentSingletonAxes masa(avt, state);
-      masa.handle(
-          true /* is_index_merge_rhs */, true /* is_last_axis_rfactor */);
-    }
+      if (new_view_index < new_view_.size() &&
+          // Still new dimensions to resolve and current size does resolve it.
+          current_size == new_view_[new_view_index]) {
+        // Keep this dimension, it's good to go, we hit a boundary where there's
+        // a multiple of original dims, that matches a multiple of view dims.
+        // Increment state and keep going.
 
-    virtual ~MergeAdjacentSingletonAxes() = default;
+        ++transform_view_index;
+        ++new_view_index;
+        ++original_view_index;
 
-   private:
-    MergeAdjacentSingletonAxes(
-        AnalyzeViewTransformation* avt,
-        ViewIndexState& state)
-        : MergeAxesInterface(avt, state) {}
+        // Update current_size with the next size in original view
+        if (original_view_index < original_view_.size()) {
+          current_size = original_view_[original_view_index];
+        } else {
+          current_size = 0;
+        }
+        continue;
+      }
 
-    size_t getCurrentAxisPosition() const override {
-      return state_.original_view_index + num_merge_axes_;
-    }
+      // Compile time broadcast in new view, but not a matching one in original
+      // view. Insert broadcast and increment new_view. Size 1 dimensions in
+      // new_view that don't match up with runtime size 1's in original view are
+      // assumed to be broadcast (not a split from a runtime domain).
+      if (new_view_index < new_view_.size() && new_view_[new_view_index] == 1) {
+        broadcast_transforms_.push_back(
+            std::make_shared<BroadcastTransform>(new_view_index));
+        ++new_view_index;
+        continue;
+      }
 
-    bool isMergeInvariantValid() const override {
-      return avt_->original_view_[getCurrentAxisPosition()] == kSingletonAxis;
-    }
+      // If we run out of original_view dimensions we could still have broadcast
+      // dimensions for new_view, but that should be hit before this point.
+      TORCH_INTERNAL_ASSERT(
+          current_size != 0,
+          "View analysis failed, should never process an empty size unless we ",
+          "simply need to add broadcasts to the post-view domain.");
+
+      if (current_size == 1 && isImplicitBroadcast(original_view_index)) {
+        // Original view has a compile time size 1 dimension, and it's not found
+        // in the new_view_ (otherwise would have been caught in a branch
+        // above). Do a trivial reduction.
+        trivial_reduction_transforms_.push_back(
+            std::make_shared<TrivialReductionTransform>(original_view_index));
+        ++original_view_index;
+
+        // Update original position and current size.
+        if (original_view_index < original_view_.size()) {
+          current_size = original_view_[original_view_index];
+        } else {
+          current_size = 0;
+        }
 
-    void updateViewIndexState() override {
-      ++state_.original_view_index;
-      ++state_.transform_view_index;
-    }
+        continue;
+      }
 
-    void epilogue(bool is_last_axis_rfactor) override {}
-  };
+      if (original_view_index + 1 < original_view_.size() &&
+          isImplicitBroadcast(original_view_index + 1)) {
+        // Original view has a compile time size 1 dimension, and it's
+        // interfering with necessary transformations. Do a trivial reduction.
+        ++original_view_index;
+        trivial_reduction_transforms_.push_back(
+            std::make_shared<TrivialReductionTransform>(original_view_index));
 
-  //! Find the broadcast, merge and split operations necessary
-  //! to transform the original view into the new view
-  void findTransformation() {
-    // The original and new view are processed from left to right.
-    // old_view_index and new_view_index track the current position in each
-    // view respectively.
-    // kRfactor - Is the last iterDomain already in the rfactor domain?
-    while (state_.new_view_index < new_view_.size() &&
-           state_.original_view_index < original_view_.size()) {
-      const auto kCurrentSize = transform_view_[state_.transform_view_index];
-      auto is_last_axis_rfactor = transform_view_[state_.original_view_index] !=
-          original_view_[state_.original_view_index];
-
-      if (kCurrentSize == kEmptyAxis) {
-        // If current size in transform view is 0, then it was already handled
-        // and should be skipped.
-        ++state_.transform_view_index;
-      } else if (kCurrentSize == new_view_[state_.new_view_index]) {
-        addKeepTransform(is_last_axis_rfactor);
-      } else if (new_view_[state_.new_view_index] == kSingletonAxis) {
-        addBroadcastTransform();
-      } else {
-        MergeThenSplitAxes::process(
-            this,
-            state_,
-            kCurrentSize,
-            new_view_[state_.new_view_index],
-            is_last_axis_rfactor);
+        continue;
       }
-    }
 
-    MergeAdjacentSingletonAxes::process(this, state_);
+      // We're only left with performing transformations to match a new_view
+      // dimension, there must be an activew new_view.
+      TORCH_INTERNAL_ASSERT(
+          new_view_index < new_view_.size(),
+          "Expecting to still have new dimensions to work on in view, but none left.");
+
+      if (new_view_index < new_view_.size() &&
+          current_size % new_view_[new_view_index] == 0) {
+        // Insert split to generate the next new_view domain.
+        view_transforms_.push_back(std::make_shared<SplitTransform>(
+            transform_view_index, new_view_[new_view_index]));
+        current_size /= new_view_[new_view_index];
+        TORCH_INTERNAL_ASSERT(current_size > 1, "This should be unreachable.");
+        // Update transform and new since a split doesn't increment from the
+        // original domain we're working on.
+        ++transform_view_index;
+        ++new_view_index;
+        continue;
+      }
 
-    // Skip any root domains that were merged for any splits with remainder
-    // OR any singleton axes
-    while (state_.transform_view_index < transform_view_.size() &&
-           transform_view_[state_.transform_view_index] <= kSingletonAxis) {
-      ++state_.transform_view_index;
-    }
+      // Need more of the original_view dimension to resolve the new_view
+      // dimension, merge the next dimension in.
+      TORCH_INTERNAL_ASSERT(
+          original_view_index + 1 < original_view_.size(),
+          "Expecting to still have original dimensions to work on in view, but none left.");
 
-    // Add broadcast axes for any remaining size 1 dimensions
-    while (state_.original_view_index == original_view_.size() &&
-           state_.new_view_index < new_view_.size() &&
-           new_view_[state_.new_view_index] == kSingletonAxis) {
-      addBroadcastTransform();
+      view_transforms_.push_back(
+          std::make_shared<MergeTransform>(transform_view_index));
+      current_size *= original_view_[++original_view_index];
     }
   }
 
  private:
-  ViewIndexState state_;
-
   std::vector<std::shared_ptr<ViewTransform>> view_transforms_;
   std::vector<std::shared_ptr<BroadcastTransform>> broadcast_transforms_;
   std::vector<std::shared_ptr<TrivialReductionTransform>>
       trivial_reduction_transforms_;
 
-  bool default_implicit_broadcast_ = true;
+  // If root domain isn't provided always assume size-1 dimensions are
+  // compile-time dimensions. TODO: Remove runtime size-1 dimension support.
+  // This should be cached higher in the stack.
+  const bool root_domain_not_provided_ = true;
+
   const std::vector<IterDomain*> root_domain_;
-  const Sizes& original_view_;
-  const Sizes& new_view_;
-
-  // transform_view is a mutable view and is initialized with the original_view.
-  // It is used to track the current state of the original tensor domain.
-  //
-  // When we merge dimensions in the original_view, we multiply the sizes of
-  // the adjacent, merged dimensions. The product size is placed in the current
-  // position of the transform_view, while the other dimensions are set to 0.
-  //
-  // When we add a Split transform, the current size is divided by the outer
-  // split factor.
-  //
-  // Size-0 dimensions are automatically skipped.
-  //
-  // If transform size != original size for an axis, then the transformation
-  // uses the last rfactor domain. Otherwise, it is a root domain
-  // transformation.
-  Sizes transform_view_;
+  // Track if the root ID was transformed or kept ()
+  std::vector<bool> root_is_transformed_;
+  const std::vector<int64_t>& original_view_;
+  const std::vector<int64_t>& new_view_;
 };
 
-//! Create new TensorDomain with a modified rfactor domain using the specified
-//! view transformations
+//! Create new TensorDomain with a new root domain and modified rfactor domains
+//! using the specified view transformations. Original domain should already be
+//! without reduction axes.
 TensorDomain* createViewDomain(
     TensorDomain* original_domain,
-    const std::vector<std::shared_ptr<ViewTransform>>& view_transforms) {
+    const AnalyzeViewResult& view_analysis) {
   FUSER_PERF_SCOPE("createViewDomain");
-
-  TORCH_INTERNAL_ASSERT(!view_transforms.empty());
+  TORCH_INTERNAL_ASSERT(!view_analysis.transforms.empty());
 
   std::vector<IterDomain*> new_root_domain;
-  for (auto id :
-       TensorDomain::noReductions(original_domain->getMaybeRFactorDomain())) {
+  auto orig_root_domain = original_domain->getMaybeRFactorDomain();
+
+  // Apply trivial reductions.
+  for (auto id_i : c10::irange(orig_root_domain.size())) {
+    auto id = orig_root_domain[id_i];
+    if (id->isReduction()) {
+      continue;
+    }
+    if (std::find(
+            view_analysis.trivial_reduction_axes.begin(),
+            view_analysis.trivial_reduction_axes.end(),
+            (int)id_i) != view_analysis.trivial_reduction_axes.end()) {
+      continue;
+    }
+
     new_root_domain.push_back(id->cloneWithoutRFactor());
   }
 
-  std::vector<IterDomain*> rfactor_domain;
-  for (auto& t : view_transforms) {
-    t->createRfactorDomain(new_root_domain, rfactor_domain);
+  std::vector<IterDomain*> new_rfactor_domain(
+      new_root_domain.begin(), new_root_domain.end());
+
+  // Apply rfactor transformations.
+  for (auto& t : view_analysis.transforms) {
+    t->createRfactorDomain(new_root_domain, new_rfactor_domain);
   }
 
   return IrBuilder::create<TensorDomain>(
       new_root_domain,
-      rfactor_domain,
-      rfactor_domain,
-      std::vector<bool>(rfactor_domain.size(), true));
+      new_rfactor_domain,
+      new_rfactor_domain,
+      std::vector<bool>(new_rfactor_domain.size(), true));
 }
 
-//! Infer -1 value in new view sizes from original view sizes
-std::pair<Sizes, Sizes> inferNewViewShape(
+} // namespace
+
+std::pair<std::vector<int64_t>, std::vector<int64_t>> inferViewShapes(
     const std::vector<int64_t>& original_sizes,
     const std::vector<int64_t>& new_sizes) {
   bool valid_original_sizes = std::all_of(
@@ -739,13 +695,14 @@ std::pair<Sizes, Sizes> inferNewViewShape(
       });
   TORCH_INTERNAL_ASSERT(valid_original_sizes);
 
-  Sizes original_view(original_sizes.begin(), original_sizes.end());
-  Sizes new_view(new_sizes.size());
+  std::vector<int64_t> original_view(
+      original_sizes.begin(), original_sizes.end());
+  std::vector<int64_t> new_view(new_sizes.size());
 
   // TODO: refactor
   int64_t dynamic_index = -1;
-  size_t new_size_num_elements = 1;
-  for (size_t idx = 0; idx < new_sizes.size(); ++idx) {
+  int64_t new_size_num_elements = 1;
+  for (int64_t idx = 0; idx < new_sizes.size(); ++idx) {
     if (new_sizes[idx] == -1) {
       TORCH_INTERNAL_ASSERT(
           dynamic_index == -1, "Only one dimension can by inferred.")
@@ -757,7 +714,7 @@ std::pair<Sizes, Sizes> inferNewViewShape(
     }
   }
 
-  const size_t kNumElements = std::accumulate(
+  const int64_t kNumElements = std::accumulate(
       original_view.begin(), original_view.end(), 1, std::multiplies<>());
   if (dynamic_index != -1) {
     new_view[dynamic_index] = kNumElements / new_size_num_elements;
@@ -766,23 +723,31 @@ std::pair<Sizes, Sizes> inferNewViewShape(
   return {original_view, new_view};
 }
 
-} // namespace
-
 //! Generates the transformations necessary to convert
 //! from the original view into the new view.
 AnalyzeViewResult analyzeView(
-    const TensorView* tv,
+    const TensorView* original_view_tv,
     const std::vector<int64_t>& original_sizes,
     const std::vector<int64_t>& new_sizes) {
   FUSER_PERF_SCOPE("analyzeView");
   TORCH_INTERNAL_ASSERT(
-      TensorDomain::noReductions(tv->getMaybeRFactorDomain()).size() ==
-      original_sizes.size());
-  auto sizes = inferNewViewShape(original_sizes, new_sizes);
+      original_sizes.size() > 0,
+      "Empty original size not supported for view operation.");
+
+  TORCH_INTERNAL_ASSERT(
+      TensorDomain::noReductions(original_view_tv->getMaybeRFactorDomain())
+          .size() == original_sizes.size());
+
+  // Fill -1 dimension in new_std::vector<int64_t> with size infered from all
+  // other values
+  auto sizes = inferViewShapes(original_sizes, new_sizes);
+
+  // Analysize the transformations required to go from original_sizes to
+  // new_sizes
   AnalyzeViewTransformation analyzer(
       sizes.first /* original_view */,
       sizes.second /* new_view */,
-      TensorDomain::noReductions(tv->getMaybeRFactorDomain()));
+      TensorDomain::noReductions(original_view_tv->getMaybeRFactorDomain()));
   return analyzer.run();
 }
 
@@ -790,7 +755,7 @@ AnalyzeViewConstraint analyzeViewConstraint(
     const std::vector<int64_t>& original_sizes,
     const std::vector<int64_t>& new_sizes) {
   FUSER_PERF_SCOPE("analyzeViewConstraint");
-  auto sizes = inferNewViewShape(original_sizes, new_sizes);
+  auto sizes = inferViewShapes(original_sizes, new_sizes);
   AnalyzeViewTransformation analyzer(
       sizes.first /* original_view */, sizes.second /* new_view */);
   return analyzer.constraint();
@@ -800,9 +765,9 @@ AnalyzeViewConstraint analyzeViewConstraint(
 //! view transformations
 TensorDomain* transformView(
     TensorDomain* original_domain,
-    const std::vector<std::shared_ptr<ViewTransform>>& view_transforms) {
+    const AnalyzeViewResult& view_analysis) {
   FUSER_PERF_SCOPE("transformView");
-  return createViewDomain(original_domain, view_transforms);
+  return createViewDomain(original_domain, view_analysis);
 }
 
 } // namespace cuda
diff --git a/torch/csrc/jit/codegen/cuda/transform_view.h b/torch/csrc/jit/codegen/cuda/transform_view.h
index f8a986048bea..c3eb0ac34bea 100644
--- a/torch/csrc/jit/codegen/cuda/transform_view.h
+++ b/torch/csrc/jit/codegen/cuda/transform_view.h
@@ -34,17 +34,71 @@ class ViewTransform;
 //!
 
 struct AnalyzeViewResult {
-  bool has_broadcast = false;
   std::vector<bool> broadcast_axes;
   std::vector<int> trivial_reduction_axes;
   std::vector<std::shared_ptr<ViewTransform>> transforms;
 };
 
-struct AnalyzeViewConstraint {
+struct TORCH_API AnalyzeViewConstraint {
+  // 1 if size 1 dimension, otherwise 0;
   std::vector<int64_t> original_constraint;
   std::vector<int64_t> new_constraint;
+  // Just the positions of true in AnalyzeViewResult::trivial_reduction_axes
+  std::vector<int64_t> trivial_reduction_string;
+  // Just the positions of true in AnalyzeViewResult:broadcast_axes
+  std::vector<int64_t> broadcast_string;
+  // A stringified version of the transformations:
+  std::vector<int64_t> split_merge_string;
+
+  std::vector<int64_t> conglomerateString() const {
+    // Don't think this is necessary but just being safe. Using
+    // -3 as a dilimeter between value groups.
+    std::vector<int64_t> conglomerate = {
+        (int64_t)original_constraint.size(),
+        (int64_t)new_constraint.size(),
+        -3};
+    auto add_vec = [&conglomerate](const std::vector<int64_t>& vec) {
+      for (auto element : vec) {
+        conglomerate.push_back(element);
+      }
+      // TODO: Why doesn't this work?
+      // conglomerate.insert(conglomerate.back(), vec.begin(), vec.end());
+      conglomerate.push_back(-3);
+    };
+    add_vec(original_constraint);
+    add_vec(new_constraint);
+    add_vec(trivial_reduction_string);
+    add_vec(broadcast_string);
+    add_vec(split_merge_string);
+    return conglomerate;
+  }
+
+  bool operator==(const AnalyzeViewConstraint& other) const {
+    return other.conglomerateString() == this->conglomerateString();
+  }
+
+  // Naive hashing function, likely has a lot of collisions, but may not matter
+  // too much if we don't expact many types of views.
+  size_t hash() {
+    size_t hash_value = 0;
+    for (auto val : conglomerateString()) {
+      if (val == std::numeric_limits<int64_t>::max()) {
+        continue;
+      }
+      hash_value += val;
+    }
+    return hash_value;
+  }
 };
 
+//! Infer -1 value in new view std::vector<int64_t> based on original view
+//! std::vector<int64_t>. This shouldn't generally be used directly but is
+//! useful for testing.
+TORCH_CUDA_CU_API std::pair<std::vector<int64_t>, std::vector<int64_t>>
+inferViewShapes(
+    const std::vector<int64_t>& original_sizes,
+    const std::vector<int64_t>& new_sizes);
+
 // Find the transformations necessary to convert TensorView
 // from original size to new size.
 AnalyzeViewResult analyzeView(
@@ -53,7 +107,7 @@ AnalyzeViewResult analyzeView(
     const std::vector<int64_t>& new_sizes);
 
 // Find the constraints derived from the view transformations
-AnalyzeViewConstraint analyzeViewConstraint(
+TORCH_CUDA_CU_API AnalyzeViewConstraint analyzeViewConstraint(
     const std::vector<int64_t>& original_sizes,
     const std::vector<int64_t>& new_sizes);
 
@@ -62,7 +116,7 @@ AnalyzeViewConstraint analyzeViewConstraint(
 // but a new rfactor domain is created from the view transformations.
 TensorDomain* transformView(
     TensorDomain* original_domain,
-    const std::vector<std::shared_ptr<ViewTransform>>& view_transforms);
+    const AnalyzeViewResult& view_analysis);
 
 } // namespace cuda
 } // namespace fuser
diff --git a/torch/csrc/jit/codegen/cuda/type.cpp b/torch/csrc/jit/codegen/cuda/type.cpp
index 35f6310f2820..3b8f380683ed 100644
--- a/torch/csrc/jit/codegen/cuda/type.cpp
+++ b/torch/csrc/jit/codegen/cuda/type.cpp
@@ -1,5 +1,7 @@
 #include <torch/csrc/jit/codegen/cuda/type.h>
 
+#include <ATen/cuda/CUDAContext.h>
+
 #include <stdexcept>
 #include <unordered_map>
 
@@ -160,6 +162,17 @@ DataType getTypeFromComplexType(DataType dtype) {
   }
 }
 
+bool isSupportedTypeByDevice(DataType dtype) {
+  auto prop = at::cuda::getCurrentDeviceProperties();
+  auto major_ver = prop->major;
+  switch (dtype) {
+    case DataType::BFloat16:
+      return major_ver >= 8;
+    default:
+      return true;
+  }
+}
+
 bool isIntegerOp(const BinaryOpType bopt) {
   return bopt >= BinaryOpType::Mod && bopt <= BinaryOpType::Rshift;
 }
@@ -290,12 +303,20 @@ static const char* predicate_type2string(PredicateType t) {
 
 static const char* expr_type2string(ExprType t) {
   switch (t) {
+    case ExprType::FullOp:
+      return "FullOp";
+    case ExprType::ARangeOp:
+      return "ARangeOp";
+    case ExprType::EyeOp:
+      return "EyeOp";
     case ExprType::UnaryOp:
       return "UnaryOp";
     case ExprType::BinaryOp:
       return "BinaryOp";
     case ExprType::TernaryOp:
       return "TernaryOp";
+    case ExprType::RNGOp:
+      return "RNGOp";
     case ExprType::ReductionOp:
       return "ReductionOp";
     case ExprType::GroupedReductionOp:
@@ -304,6 +325,8 @@ static const char* expr_type2string(ExprType t) {
       return "BroadcastOp";
     case ExprType::WelfordOp:
       return "WelfordOp";
+    case ExprType::GroupedWelfordOp:
+      return "GroupedWelfordOp";
     case ExprType::LoadStoreOp:
       return "LoadStoreOp";
     case ExprType::MmaOp:
@@ -350,6 +373,8 @@ static const char* expr_type2string(ExprType t) {
       return "GridBroadcast";
     case ExprType::GridWelford:
       return "GridWelford";
+    case ExprType::GroupedGridWelford:
+      return "GroupedGridWelford";
     case ExprType::Swizzle2D:
       return "Swizzle2D";
     case ExprType::Swizzle2DInt:
@@ -387,6 +412,10 @@ bool needFloatSuffix(UnaryOpType t) {
   }
 }
 
+bool needFloatSuffix(RNGOpType t) {
+  return true;
+}
+
 static const char* unary_op_type2string(UnaryOpType t) {
   switch (t) {
     case UnaryOpType::Abs:
@@ -439,8 +468,6 @@ static const char* unary_op_type2string(UnaryOpType t) {
       return "not";
     case UnaryOpType::Print:
       return "print";
-    case UnaryOpType::RandLike:
-      return "randLike";
     case UnaryOpType::Reciprocal:
       return "reciprocal";
     case UnaryOpType::Relu:
@@ -642,6 +669,18 @@ static const char* binary_op_type_inline_op2string(BinaryOpType t) {
   return nullptr;
 }
 
+static const char* rng_op_type_inline_op2string(RNGOpType t) {
+  switch (t) {
+    case RNGOpType::Uniform:
+      return "rng_uniform";
+    case RNGOpType::UniformRange:
+      return "rng_uniform_range";
+    default:
+      break;
+  }
+  return nullptr;
+}
+
 std::string stringifyBooleanOp(const BinaryOpType bopt) {
   switch (bopt) {
     case BinaryOpType::And:
@@ -670,6 +709,17 @@ static const char* ternary_op_type2string(TernaryOpType t) {
   }
 }
 
+static const char* rng_op_type2string(RNGOpType t) {
+  switch (t) {
+    case RNGOpType::Uniform:
+      return "rng_uniform";
+    case RNGOpType::UniformRange:
+      return "rng_uniform_range";
+    default:
+      TORCH_INTERNAL_ASSERT(false, "Unexpected RNGOpType");
+  }
+}
+
 static const char* parallel_type2string(ParallelType t) {
   switch (t) {
     case ParallelType::BIDz:
@@ -982,6 +1032,10 @@ std::ostream& operator<<(std::ostream& out, const TernaryOpType totype) {
   return out << ternary_op_type2string(totype);
 }
 
+std::ostream& operator<<(std::ostream& out, const RNGOpType rngtype) {
+  return out << rng_op_type2string(rngtype);
+}
+
 std::ostream& operator<<(std::ostream& out, const ParallelType ptype) {
   return out << stringifyThread(ptype);
 }
@@ -1065,6 +1119,12 @@ c10::optional<std::string> inline_op_str(const BinaryOpType botype) {
                         : c10::nullopt;
 }
 
+c10::optional<std::string> inline_op_str(const RNGOpType rngtype) {
+  const char* str = rng_op_type_inline_op2string(rngtype);
+  return str != nullptr ? c10::optional<std::string>(std::string(str))
+                        : c10::nullopt;
+}
+
 c10::optional<std::string> integer_op_str(const BinaryOpType botype) {
   const char* str = binary_op_integer_op2string(botype);
   return str != nullptr ? c10::optional<std::string>(std::string(str))
diff --git a/torch/csrc/jit/codegen/cuda/type.h b/torch/csrc/jit/codegen/cuda/type.h
index fce051d432ff..4aa894113e99 100644
--- a/torch/csrc/jit/codegen/cuda/type.h
+++ b/torch/csrc/jit/codegen/cuda/type.h
@@ -101,16 +101,23 @@ int getVectorSizeFromType(DataType dtype);
 DataType getTypeFromVectorType(DataType dtype);
 // Return the corresponding scalar of a complex type
 DataType getTypeFromComplexType(DataType dtype);
+// Return if the datatype is supported on the current device
+TORCH_CUDA_CU_API bool isSupportedTypeByDevice(DataType dtype);
 
 enum class ExprType {
   Invalid,
+  FullOp,
+  ARangeOp,
+  EyeOp,
   UnaryOp,
   BinaryOp,
   TernaryOp,
+  RNGOp,
   ReductionOp,
   GroupedReductionOp,
   BroadcastOp,
   WelfordOp,
+  GroupedWelfordOp,
   MmaOp,
   TransposeOp,
   ExpandOp,
@@ -137,6 +144,7 @@ enum class ExprType {
   GroupedGridReduction,
   GridBroadcast,
   GridWelford,
+  GroupedGridWelford,
   AllocateFusedReduction
 };
 
@@ -167,7 +175,6 @@ enum class UnaryOpType {
   Log2,
   BitCast,
   Neg,
-  RandLike,
   Real,
   Reciprocal,
   Relu,
@@ -240,6 +247,11 @@ enum class BinaryOpType {
   Xor
 };
 
+enum class RNGOpType {
+  Uniform, // Uniform in [0, 1)
+  UniformRange, // Uniform in [low, high]
+};
+
 // Return if output of operator should be a boolean
 bool isIntegerOp(const BinaryOpType bopt);
 
@@ -341,6 +353,7 @@ enum class SwizzleMode { NoSwizzle = 0, Data, Loop };
 // float value i.e. sin->sinf
 bool needFloatSuffix(UnaryOpType t);
 bool needFloatSuffix(BinaryOpType t);
+bool needFloatSuffix(RNGOpType t);
 
 ValType promote_type(const ValType& t1, const ValType& t2);
 DataType promote_type(const DataType& t1, const DataType& t2);
@@ -357,6 +370,7 @@ TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const ExprType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const UnaryOpType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const BinaryOpType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const TernaryOpType);
+TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const RNGOpType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const ParallelType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const MemoryType);
 TORCH_CUDA_CU_API std::ostream& operator<<(std::ostream&, const IterType);
@@ -389,6 +403,7 @@ TORCH_CUDA_CU_API bool isParallelTypeVectorize(ParallelType);
 
 TORCH_CUDA_CU_API c10::optional<std::string> inline_op_str(const UnaryOpType);
 TORCH_CUDA_CU_API c10::optional<std::string> inline_op_str(const BinaryOpType);
+TORCH_CUDA_CU_API c10::optional<std::string> inline_op_str(const RNGOpType);
 TORCH_CUDA_CU_API c10::optional<std::string> integer_op_str(const BinaryOpType);
 TORCH_CUDA_CU_API c10::optional<std::string> bool_op_str(const BinaryOpType);
 
diff --git a/torch/csrc/jit/codegen/cuda/type_inference.cpp b/torch/csrc/jit/codegen/cuda/type_inference.cpp
index 534fa91488ce..7422cf20d7c2 100644
--- a/torch/csrc/jit/codegen/cuda/type_inference.cpp
+++ b/torch/csrc/jit/codegen/cuda/type_inference.cpp
@@ -445,13 +445,16 @@ class NaiveTypePropagator {
         copyScalarTypeAndDeviceToOutput(out_type->withDim(c10::nullopt), node);
         break;
       }
-      case prim::unsqueeze_copy:
       case prim::expand_copy:
       case prim::expand_as_copy:
-      case prim::squeeze_copy:
+      case prim::flatten_copy:
+      case prim::permute_copy:
       case prim::reshape_copy:
-      case prim::view_copy:
-      case prim::flatten_copy: {
+      case prim::squeeze_copy:
+      case prim::t_copy:
+      case prim::transpose_copy:
+      case prim::unsqueeze_copy:
+      case prim::view_copy: {
         auto out_type = node->input(0)->type()->cast<TensorType>();
         copyScalarTypeAndDeviceToOutput(out_type, node);
         break;
diff --git a/torch/csrc/jit/codegen/cuda/utils.cpp b/torch/csrc/jit/codegen/cuda/utils.cpp
index 5838970a1a03..33395692fb39 100644
--- a/torch/csrc/jit/codegen/cuda/utils.cpp
+++ b/torch/csrc/jit/codegen/cuda/utils.cpp
@@ -18,6 +18,7 @@ auto parseDebugDumpOptions() {
   std::unordered_map<DebugDumpOption, bool> options_map = {
       {DebugDumpOption::FusionIr, false},
       {DebugDumpOption::FusionIrMath, false},
+      {DebugDumpOption::FusionIrPresched, false},
       {DebugDumpOption::KernelIr, false},
       {DebugDumpOption::ComputeAtMap, false},
       {DebugDumpOption::CudaKernel, false},
@@ -37,8 +38,13 @@ auto parseDebugDumpOptions() {
       {DebugDumpOption::ParallelDimensions, false},
       {DebugDumpOption::Halo, false},
       {DebugDumpOption::PerfDebugVerbose, false},
+      {DebugDumpOption::PythonDefinition, false},
+      {DebugDumpOption::PythonFrontendDebug, false},
       {DebugDumpOption::TransformPropagator, false},
-      {DebugDumpOption::InlinePropagator, false}};
+      {DebugDumpOption::Cubin, false},
+      {DebugDumpOption::Ptx, false},
+      {DebugDumpOption::BankConflictInfo, false},
+      {DebugDumpOption::SyncMap, false}};
 
   if (const char* dump_options = std::getenv("PYTORCH_NVFUSER_DUMP")) {
     c10::string_view options_view(dump_options);
@@ -49,6 +55,8 @@ auto parseDebugDumpOptions() {
         options_map[DebugDumpOption::FusionIr] = true;
       } else if (token == "fusion_ir_math") {
         options_map[DebugDumpOption::FusionIrMath] = true;
+      } else if (token == "fusion_ir_presched") {
+        options_map[DebugDumpOption::FusionIrPresched] = true;
       } else if (token == "kernel_ir") {
         options_map[DebugDumpOption::KernelIr] = true;
       } else if (token == "ca_map") {
@@ -87,22 +95,33 @@ auto parseDebugDumpOptions() {
         options_map[DebugDumpOption::Halo] = true;
       } else if (token == "perf_debug_verbose") {
         options_map[DebugDumpOption::PerfDebugVerbose] = true;
+      } else if (token == "python_definition") {
+        options_map[DebugDumpOption::PythonDefinition] = true;
+      } else if (token == "python_frontend_debug") {
+        options_map[DebugDumpOption::PythonFrontendDebug] = true;
       } else if (token == "transform_propagator") {
         options_map[DebugDumpOption::TransformPropagator] = true;
-      } else if (token == "inline_propagator") {
-        options_map[DebugDumpOption::InlinePropagator] = true;
+      } else if (token == "cubin") {
+        options_map[DebugDumpOption::Cubin] = true;
+      } else if (token == "ptx") {
+        options_map[DebugDumpOption::Ptx] = true;
+      } else if (token == "bank_conflict") {
+        options_map[DebugDumpOption::BankConflictInfo] = true;
+      } else if (token == "sync_map") {
+        options_map[DebugDumpOption::SyncMap] = true;
       } else {
         TORCH_CHECK(
             false,
             "Invalid debug dump option: '",
             token,
             "'\nAvailable options:\n",
-            "\tfusion_ir, fusion_ir_math, kernel_ir, ca_map, cuda_kernel, cuda_full,\n",
-            "\tcuda_to_file, debug_info, launch_param, segmented_fusion, fusion_args,\n",
-            "\tkernel_args, dump_eff_bandwidth, draw_segmented_fusion,\n",
-            "\tscheduler_params, parallel_dimensions, buffer_reuse_verbose,\n",
-            "\tptxas_verbose, halo, segmenter_logging, perf_debug_verbose\n",
-            "\ttransform_propagator, inline_propagator\n");
+            "\tfusion_ir, fusion_ir_math, fusion_ir_presched, kernel_ir, ca_map,\n",
+            "\tcuda_kernel, cuda_full, cuda_to_file, debug_info, launch_param,\n",
+            "\tsegmented_fusion, fusion_args, kernel_args, dump_eff_bandwidth,\n",
+            "\tdraw_segmented_fusion, scheduler_params, parallel_dimensions,\n",
+            "\tbuffer_reuse_verbose, ptxas_verbose, halo, segmenter_logging,\n",
+            "\tperf_debug_verbose, python_definition, python_frontend_debug,\n",
+            "\ttransform_propagator, cubin, ptx, bank_conflict, sync_map\n");
       }
       options_view = (end_pos != c10::string_view::npos)
           ? options_view.substr(end_pos + 1)
@@ -116,12 +135,12 @@ auto parseDebugDumpOptions() {
 auto parseDisableOptions() {
   std::unordered_map<DisableOption, bool> options_map = {
       {DisableOption::ArchCheck, false},
+      {DisableOption::CompileToSass, false},
       {DisableOption::Fallback, false},
       {DisableOption::Fma, false},
       {DisableOption::IndexHoist, false},
       {DisableOption::Nvtx, false},
-      {DisableOption::PredicateElimination, false},
-      {DisableOption::UnrollWithRng, false}};
+      {DisableOption::PredicateElimination, false}};
 
   if (const char* dump_options = std::getenv("PYTORCH_NVFUSER_DISABLE")) {
     c10::string_view options_view(dump_options);
@@ -130,6 +149,8 @@ auto parseDisableOptions() {
       const auto token = options_view.substr(0, end_pos);
       if (token == "arch_check") {
         options_map[DisableOption::ArchCheck] = true;
+      } else if (token == "compile_to_sass") {
+        options_map[DisableOption::CompileToSass] = true;
       } else if (token == "fallback") {
         options_map[DisableOption::Fallback] = true;
       } else if (token == "fma") {
@@ -142,16 +163,13 @@ auto parseDisableOptions() {
         options_map[DisableOption::Nvtx] = true;
       } else if (token == "predicate_elimination") {
         options_map[DisableOption::PredicateElimination] = true;
-      } else if (token == "unroll_with_rng") {
-        options_map[DisableOption::UnrollWithRng] = true;
       } else {
         TORCH_CHECK(
             false,
             "Invalid disable option: '",
             token,
             "'\nAvailable options:\n",
-            "\tarch_check, fallback, fma, index_hoist, nvtx, predicate_elimination\n",
-            "unroll_with_rng");
+            "\tarch_check, fallback, fma, index_hoist, nvtx, predicate_elimination\n");
       }
       options_view = (end_pos != c10::string_view::npos)
           ? options_view.substr(end_pos + 1)
@@ -185,10 +203,11 @@ auto parseEnableOptions() {
       } else {
         TORCH_CHECK(
             false,
-            "Invalid disable option: '",
+            "Invalid enable option: '",
             token,
             "'\nAvailable options:\n",
-            "\tcomplex, kernel_profile");
+            "\tcomplex, kernel_profile, linear_decomposition,",
+            "conv_decomposition");
       }
       options_view = (end_pos != c10::string_view::npos)
           ? options_view.substr(end_pos + 1)
@@ -277,22 +296,89 @@ bool is_cpu_scalar(const c10::TensorType& tensor_type) {
   auto opt_device = tensor_type.device();
   auto opt_dim = tensor_type.dim();
   auto opt_numel = tensor_type.numel();
-  return opt_device.has_value() && opt_device.value().is_cpu() &&
+  return opt_device.has_value() && opt_device->is_cpu() &&
       opt_dim.has_value() && opt_numel.has_value() && opt_dim.value() == 0 &&
       opt_numel.value() == 1;
 }
 
+// Check device of TensorType in all inputs ensure all tensors are on cuda
+// devices.
+// return common device index (or -1 if device differs).
+int getCommonDeviceCUDA(const at::ArrayRef<IValue>& inputs) {
+  int index = -1;
+  for (const auto& input : inputs) {
+    if (!input.isTensor()) {
+      continue;
+    }
+    const auto& device = input.toTensor().device();
+    // skip cpu scalar tensor as they'll be promoted to scalar later
+    if (device.is_cpu() && is_cpu_scalar(input.toTensor())) {
+      continue;
+    }
+    TORCH_CHECK(device.is_cuda(), "nvfuser only supports cuda device");
+    auto cur_index = device.index();
+    if (index != -1 && index != cur_index) {
+      return -1;
+    }
+    index = (int)cur_index; // NOLINT
+  }
+  return index;
+}
+
+KernelIndexMode collectIndexMode(const at::ArrayRef<at::IValue>& inputs) {
+  // Save 1 more bit besides the sign bit to be conservative
+  constexpr int64_t most_positive_int32_index =
+      std::numeric_limits<int>::max() / 2;
+  constexpr int64_t most_negative_int32_index =
+      std::numeric_limits<int>::min() / 2;
+
+  // Check all runtime inputs, and if any one of
+  //  the input's index exceeds max_int32 will
+  //  fall back to int64 indexing
+  for (auto ivalue_input : inputs) {
+    if (ivalue_input.isTensor()) {
+      auto tensor_input = ivalue_input.toTensor();
+      int64_t tensor_most_positive_index = 0;
+      int64_t tensor_most_negative_index = 0;
+      for (auto dim_i = 0; dim_i < tensor_input.ndimension(); dim_i++) {
+        // Ignore broadcast dimensions
+        if (tensor_input.size(dim_i) > 1) {
+          // accumulate based on the sign of stride
+          if (tensor_input.stride(dim_i) > 0) {
+            // Acuumulate positive stride
+            tensor_most_positive_index +=
+                (tensor_input.size(dim_i) - 1) * tensor_input.stride(dim_i);
+          } else {
+            // Acuumulate negative stride
+            tensor_most_negative_index +=
+                (tensor_input.size(dim_i) - 1) * tensor_input.stride(dim_i);
+          }
+        }
+      }
+
+      // Fall back to int64 if it can be either too positive
+      //  or too negative.
+      if (tensor_most_positive_index > most_positive_int32_index ||
+          tensor_most_negative_index < most_negative_int32_index) {
+        return KernelIndexMode::INT64;
+      }
+    }
+  }
+  // return index mode as int32
+  return KernelIndexMode::INT32;
+}
+
 bool isDebugDumpEnabled(DebugDumpOption option) {
   const static auto dump_options = parseDebugDumpOptions();
   return dump_options.at(option);
 }
 
-bool isDisabled(DisableOption option) {
+bool isOptionDisabled(DisableOption option) {
   const static auto options = parseDisableOptions();
   return options.at(option);
 }
 
-bool isEnabled(EnableOption option) {
+bool isOptionEnabled(EnableOption option) {
   const static auto options = parseEnableOptions();
   return options.at(option);
 }
@@ -301,7 +387,8 @@ bool useFallback() {
   // Keep this env var for compatibility
   const char* disable_fb_env = getenv("PYTORCH_NVFUSER_DISABLE_FALLBACK");
   bool fallback_disabled = disable_fb_env ? atoi(disable_fb_env) : false;
-  fallback_disabled = fallback_disabled || isDisabled(DisableOption::Fallback);
+  fallback_disabled =
+      fallback_disabled || isOptionDisabled(DisableOption::Fallback);
 
   return !fallback_disabled;
 }
diff --git a/torch/csrc/jit/codegen/cuda/utils.h b/torch/csrc/jit/codegen/cuda/utils.h
index 13c4d65c3c59..61f7fee7cd4c 100644
--- a/torch/csrc/jit/codegen/cuda/utils.h
+++ b/torch/csrc/jit/codegen/cuda/utils.h
@@ -2,6 +2,7 @@
 
 #include <ATen/ATen.h>
 #include <c10/util/Exception.h>
+#include <torch/csrc/jit/codegen/cuda/type.h>
 #include <torch/csrc/jit/ir/ir.h>
 
 namespace torch {
@@ -17,6 +18,11 @@ bool is_zero_sized_tensor(const std::shared_ptr<c10::TensorType>& tensor_type);
 bool is_cpu_scalar(const at::Tensor& tensor);
 bool is_cpu_scalar(const c10::TensorType& tensor_type);
 
+// TODO: merge these two
+// check if input is compatible with 32b index mode
+int getCommonDeviceCUDA(const at::ArrayRef<IValue>& inputs);
+KernelIndexMode collectIndexMode(const at::ArrayRef<at::IValue>& inputs);
+
 //! Types of debug print-outs
 //!
 //! These can be set through the `PYTORCH_NVFUSER_DUMP` environment variable
@@ -24,6 +30,7 @@ bool is_cpu_scalar(const c10::TensorType& tensor_type);
 enum class DebugDumpOption {
   FusionIr, //!< Dump the Fusion IR before lowering
   FusionIrMath, //!< Dump just the compute (math) part of the Fusion IR
+  FusionIrPresched, //!< Dump the Fusion IR before it is scheduled.
   KernelIr, //!< Dump the compiler Kernel IR
   ComputeAtMap, //!< Dump the computeAt map
   CudaKernel, //!< Dump the generated CUDA C++ kernel code
@@ -46,10 +53,14 @@ enum class DebugDumpOption {
   Halo, //! Halo information of tensors
   PerfDebugVerbose, //! When running kernels, print verbose information
                     //! associated with what's running
+  PythonDefinition, //! Python Frontend Fusion Definition.
+  PythonFrontendDebug, //! Python Frontend debug information.
   TransformPropagator, //! When running TransformPropagator, print propagation
                        //! path and replay result
-  InlinePropagator, //! When running InlinePropagator, print propagation
-                    //! path and inlining result
+  Cubin, //! Dump compiled CUBIN
+  Ptx, //! Dump compiled PTX
+  BankConflictInfo, //! Dump bank confliction info
+  SyncMap //! RAW dependency info
 };
 
 TORCH_CUDA_CU_API bool isDebugDumpEnabled(DebugDumpOption option);
@@ -60,15 +71,16 @@ TORCH_CUDA_CU_API bool isDebugDumpEnabled(DebugDumpOption option);
 //!
 enum class DisableOption {
   ArchCheck, //! Disable hardware-specific checks to enable cross arch debug
+  CompileToSass, //! Disable direct compilation to sass so the ptx can be
+                 //! examined
   Fallback, //! Disable fallback
   Fma, //! Disable FMA instructions
   IndexHoist, //! Disable index hoisting
   Nvtx, //! Disable NVTX instrumentation
-  PredicateElimination, //! Disable predicate elimination
-  UnrollWithRng //! Disable unrolling for kernels with RNG in them
+  PredicateElimination //! Disable predicate elimination
 };
 
-TORCH_CUDA_CU_API bool isDisabled(DisableOption option);
+TORCH_CUDA_CU_API bool isOptionDisabled(DisableOption option);
 
 //! Types of features to enable
 //!
@@ -78,10 +90,10 @@ enum class EnableOption {
   Complex, //! Enable complex support on python
   KernelProfile, //! Enable intra-kernel performance profiling
   LinearDecomposition, //! Enable linear-bias decomposition
-  ConvDecomposition //! Enable conv-bias decomposition
+  ConvDecomposition, //! Enable conv-bias decomposition
 };
 
-TORCH_CUDA_CU_API bool isEnabled(EnableOption option);
+TORCH_CUDA_CU_API bool isOptionEnabled(EnableOption option);
 
 // Check if fallback path should be used which will dispatch to eagermode if any
 // errors are encountered. Helpful for debugging.
diff --git a/torch/csrc/jit/codegen/fuser/codegen.cpp b/torch/csrc/jit/codegen/fuser/codegen.cpp
index 0665d21a7a4f..c28ad2ba1ae0 100644
--- a/torch/csrc/jit/codegen/fuser/codegen.cpp
+++ b/torch/csrc/jit/codegen/fuser/codegen.cpp
@@ -490,7 +490,7 @@ std::string generateKernel(
         env.s("access", format("t${formal}.data[t${formal}_offset]", env));
         env.s("access_vec4", format("t${formal}_buf[i]", env));
       }
-      env.s("lhs_type", calcScalarTypeName(input.second.value().scalar_type));
+      env.s("lhs_type", calcScalarTypeName(input.second->scalar_type));
 
       // load input in vectorized code path
       auto ele_size = at::elementSize((*input.second).scalar_type);
diff --git a/torch/csrc/jit/codegen/fuser/cpu/fused_kernel.h b/torch/csrc/jit/codegen/fuser/cpu/fused_kernel.h
index ce5d6ee2c554..2e6d59596323 100644
--- a/torch/csrc/jit/codegen/fuser/cpu/fused_kernel.h
+++ b/torch/csrc/jit/codegen/fuser/cpu/fused_kernel.h
@@ -3,7 +3,6 @@
 #include <ATen/ATen.h>
 #include <torch/csrc/Export.h>
 #include <torch/csrc/jit/codegen/fuser/fused_kernel.h>
-#include <torch/csrc/utils/disallow_copy.h>
 
 #include <cstdint>
 #include <memory>
diff --git a/torch/csrc/jit/codegen/fuser/cpu/temp_file.h b/torch/csrc/jit/codegen/fuser/cpu/temp_file.h
index 080d76bde222..9fb53bc962c5 100644
--- a/torch/csrc/jit/codegen/fuser/cpu/temp_file.h
+++ b/torch/csrc/jit/codegen/fuser/cpu/temp_file.h
@@ -1,9 +1,9 @@
 #pragma once
 
 #include <ATen/ATen.h>
+#include <ATen/Utils.h>
 #include <c10/util/Exception.h>
 #include <torch/csrc/Export.h>
-#include <torch/csrc/utils/disallow_copy.h>
 
 #ifdef _WIN32
 #include <WinError.h>
@@ -61,7 +61,7 @@ int wmkstemps(wchar_t* tmpl, int suffix_len) {
 #endif
 
 struct TempFile {
-  TH_DISALLOW_COPY_AND_ASSIGN(TempFile);
+  AT_DISALLOW_COPY_AND_ASSIGN(TempFile);
 
   TempFile(const std::string& t, int suffix) {
 #ifdef _MSC_VER
diff --git a/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp b/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp
index 15c980064fd5..85de541f4ba7 100644
--- a/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp
+++ b/torch/csrc/jit/codegen/fuser/cuda/fused_kernel.cpp
@@ -35,6 +35,10 @@ void codegenOutputQuery(
     int& major,
     int& minor,
     bool& compile_to_sass) {
+#ifdef USE_ROCM
+  AT_CUDA_NVRTC_CHECK(nvrtc().nvrtcVersion(&major, &minor));
+  compile_to_sass = false;
+#else
   using CudaVersion = std::pair<int, int>;
   CudaVersion nvrtc_version;
   AT_CUDA_NVRTC_CHECK(
@@ -60,6 +64,8 @@ void codegenOutputQuery(
     max_dev_version = CudaVersion(7, 5);
   } else if (nvrtc_version == CudaVersion(11, 0)) { // 11.0 supports 3-8.0
     max_dev_version = CudaVersion(8, 0);
+  } else if (nvrtc_version.first == 11 && nvrtc_version.second < 8) {
+    max_dev_version = CudaVersion(8, 6);
   } else {
     // If the driver version is unknown (i.e. newer than this code)
     // assume the driver supports this device
@@ -75,6 +81,7 @@ void codegenOutputQuery(
     minor = dev_version.second;
     compile_to_sass = true;
   }
+#endif
 }
 
 // Compiles the specified kernel and stores the metadata required to run it
diff --git a/torch/csrc/jit/codegen/fuser/fused_kernel.h b/torch/csrc/jit/codegen/fuser/fused_kernel.h
index 3d34082ff771..29ab3e7ed51c 100644
--- a/torch/csrc/jit/codegen/fuser/fused_kernel.h
+++ b/torch/csrc/jit/codegen/fuser/fused_kernel.h
@@ -1,9 +1,9 @@
 #pragma once
 
 #include <ATen/ATen.h>
+#include <ATen/Utils.h>
 #include <torch/csrc/jit/codegen/fuser/partition_desc.h>
 #include <torch/csrc/jit/codegen/fuser/tensor_desc.h>
-#include <torch/csrc/utils/disallow_copy.h>
 
 #include <cstdint>
 #include <string>
@@ -14,7 +14,7 @@ namespace jit {
 namespace fuser {
 
 struct FusedKernel {
-  TH_DISALLOW_COPY_AND_ASSIGN(FusedKernel);
+  AT_DISALLOW_COPY_AND_ASSIGN(FusedKernel);
 
   // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
   FusedKernel(
diff --git a/torch/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp b/torch/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp
index 4942f3249356..f5d0709248ea 100644
--- a/torch/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp
+++ b/torch/csrc/jit/codegen/onednn/LlgaTensorImpl.cpp
@@ -16,7 +16,7 @@ dnnl::graph::engine& Engine::getEngine() {
 }
 
 dnnl::graph::stream& Stream::getStream() {
-  static dnnl::graph::stream cpu_stream{Engine::getEngine(), nullptr};
+  static dnnl::graph::stream cpu_stream{Engine::getEngine()};
   return cpu_stream;
 }
 
@@ -76,7 +76,7 @@ dnnl::graph::tensor llga_from_aten_tensor(const at::Tensor& tensor) {
 
 using data_type = dnnl::graph::logical_tensor::data_type;
 
-data_type getLlgaDataType(at::ScalarType dt) {
+data_type LlgaTensorDesc::getLlgaDataType(at::ScalarType dt) const {
   switch (dt) {
     case at::ScalarType::Float:
       return data_type::f32;
@@ -89,7 +89,12 @@ data_type getLlgaDataType(at::ScalarType dt) {
     case at::ScalarType::QUInt8:
       return data_type::u8;
     default:
-      TORCH_CHECK(false, "Not support data type ", dt);
+      // If a dtype is unsupported, oneDNN Graph will make that op a wildcard in
+      // the graph construction stage. Then when we would execute oneDNN Graph
+      // kernels pertaining to oneDNN Graph partitions, such an op would not be
+      // inside a oneDNN Graph partition, so we would not encounter inputs with
+      // unsupported dtypes at the time of executing compiled partitions.
+      return data_type::undef;
   }
 }
 
diff --git a/torch/csrc/jit/codegen/onednn/LlgaTensorImpl.h b/torch/csrc/jit/codegen/onednn/LlgaTensorImpl.h
index eb4aa592d700..14a42bb0d0f7 100644
--- a/torch/csrc/jit/codegen/onednn/LlgaTensorImpl.h
+++ b/torch/csrc/jit/codegen/onednn/LlgaTensorImpl.h
@@ -68,9 +68,6 @@ struct LlgaTensorDesc {
     }
   }
 
-  // TODO: llga need set input/output type constraints while it seems that we
-  // cannot get the dtype during compile time, hard-coded to fp32 for now to be
-  // able to add_op
   LlgaTensorDesc(const torch::jit::Value* v)
       : LlgaTensorDesc(
             v->unique(),
@@ -81,6 +78,10 @@ struct LlgaTensorDesc {
     if (v->type()->isSubtypeOf(TensorType::get())) {
       auto tt = v->type()->cast<TensorType>();
 
+      if (tt->scalarType()) {
+        dtype_ = getLlgaDataType(tt->scalarType().value());
+      }
+
       auto sizes = tt->sizes();
       if (sizes.sizes()) {
         for (auto d : *sizes.sizes()) {
@@ -99,6 +100,8 @@ struct LlgaTensorDesc {
 
   LlgaTensorDesc supplementTensorInfo(const at::Tensor& t) const;
 
+  desc::data_type getLlgaDataType(at::ScalarType dt) const;
+
   at::ScalarType aten_scalar_type() const;
 
   const std::vector<int64_t>& sizes() const {
diff --git a/torch/csrc/jit/codegen/onednn/README.md b/torch/csrc/jit/codegen/onednn/README.md
index 3da25117a070..e3f3ec66734b 100644
--- a/torch/csrc/jit/codegen/onednn/README.md
+++ b/torch/csrc/jit/codegen/onednn/README.md
@@ -1,6 +1,7 @@
 # Pytorch - oneDNN Graph API Bridge
-This integration will add the infrastructure of a new PyTorch JIT graph fuser based on [oneDNN Graph API](https://spec.oneapi.io/onednn-graph/latest/programming_model.html), which provides a flexible API for aggressive fusion. The current preview4 version supports fusion for FP32 inference. Currently, the speedup is achieved for static shapes,
-although we'd soon add dynamic-shape support. When oneDNN Graph is enabled, weights are cached, as they're constant during inference.
+This is a PyTorch JIT graph fuser based on [oneDNN Graph API](https://spec.oneapi.io/onednn-graph/latest/programming_model.html), which provides a flexible API for aggressive fusion. Float & BFloat16 inference is supported. However, BFloat16 only performs well on Intel Xeon Cooper Lake platform & beyond, as they have native BFloat16 support. Also, currently, PyTorch has divergent AMP support in JIT & eager modes, so one should disable JIT AMP support & leverage eager mode AMP support to use BFloat16. Please refer to the BFloat16 example below.
+
+Currently, speedup is achieved only for static shapes, although we'd soon add dynamic-shape support. When oneDNN Graph is enabled, weights are cached, as they're constant during inference.
 
 ## Graph Optimization
 We have registered optimization passes in the custom pre-passes set of PyTorch:
@@ -84,7 +85,7 @@ To map another op to oneDNN Graph, you should add an entry for it in in createOp
 If it has an inplace variant, you should add it in the lambda being passed to RemoveTensorMutation in
 torch/csrc/jit/codegen/onednn/interface.cpp. You might also want to add it to canFuseNode in torch/csrc/jit/codegen/onednn/register_interface.cpp.
 
-## How to use
+## Example with Float
 
 
 ```python
@@ -103,6 +104,28 @@ with torch.no_grad():
 
 # run the model
 with torch.no_grad():
-    # oneDNN graph fusion will be trigerred during runtime
+    # oneDNN graph fusion will be triggered during runtime
     output = model(images)
 ```
+
+## Example with BFloat16
+
+```python
+# Assuming we have a model of the name 'model'
+
+example_input = torch.rand(1, 3, 224, 224)
+
+# enable oneDNN Graph
+torch.jit.enable_onednn_fusion(True)
+# Disable AMP for JIT
+torch._C._jit_set_autocast_mode(False)
+with torch.no_grad(), torch.cpu.amp.autocast():
+    model = torch.jit.trace(model, (example_input))
+    model = torch.jit.freeze(model)
+     # 2 warm-ups (2 for tracing/scripting with an example, 3 without an example)
+    model(example_input)
+    model(example_input)
+
+    # speedup would be observed in subsequent runs.
+    model(example_input)
+```
diff --git a/torch/csrc/jit/codegen/onednn/decompose_silu.cpp b/torch/csrc/jit/codegen/onednn/decompose_silu.cpp
new file mode 100644
index 000000000000..3255eda9b6cb
--- /dev/null
+++ b/torch/csrc/jit/codegen/onednn/decompose_silu.cpp
@@ -0,0 +1,65 @@
+#include <torch/csrc/jit/codegen/onednn/decompose_silu.h>
+#include <torch/csrc/jit/codegen/onednn/operator.h>
+
+#include <ATen/code_template.h>
+#include <torch/csrc/jit/passes/dead_code_elimination.h>
+#include <torch/csrc/jit/passes/subgraph_rewrite.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace onednn {
+
+bool shouldDecomposeSilu(Node* node) {
+  if (node->kind() != aten::silu) {
+    return false;
+  }
+  auto inputToSilu = node->input(0)->node();
+  if (inputToSilu->kind() == aten::_convolution) {
+    // TODO: remove transpose check once the bridge supported ConvTranspose
+    bool transposed = Operator::Bool(inputToSilu, 6);
+    return !transposed;
+  }
+  if (inputToSilu->kind() == aten::linear) {
+    return true;
+  }
+  return false;
+}
+
+void DecomposeSilu(Node* node) {
+  if (shouldDecomposeSilu(node)) {
+    auto dtype = node->input(0)->type()->expect<TensorType>();
+
+    WithInsertPoint guard(node);
+    auto g = node->owningGraph();
+    auto sigmoid = g->insert(aten::sigmoid, {node->input(0)});
+    sigmoid->setType(dtype);
+
+    auto mul = g->insert(aten::mul, {sigmoid, node->input(0)});
+    mul->setType(dtype);
+
+    node->output()->replaceAllUsesWith(mul);
+  }
+}
+
+static void DecomposeSilu(Block* block) {
+  for (auto node : block->nodes()) {
+    for (auto sub : node->blocks()) {
+      DecomposeSilu(sub);
+    }
+
+    if (node->kind() == aten::silu) {
+      DecomposeSilu(node);
+    }
+  }
+}
+
+void DecomposeSiluForLLGA(std::shared_ptr<Graph>& graph) {
+  DecomposeSilu(graph->block());
+  EliminateDeadCode(graph);
+}
+
+} // namespace onednn
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/onednn/decompose_silu.h b/torch/csrc/jit/codegen/onednn/decompose_silu.h
new file mode 100644
index 000000000000..9d9a51502c83
--- /dev/null
+++ b/torch/csrc/jit/codegen/onednn/decompose_silu.h
@@ -0,0 +1,15 @@
+#pragma once
+
+#include <torch/csrc/jit/ir/ir.h>
+
+namespace torch {
+namespace jit {
+namespace fuser {
+namespace onednn {
+
+void DecomposeSiluForLLGA(std::shared_ptr<Graph>& graph);
+
+} // namespace onednn
+} // namespace fuser
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/codegen/onednn/graph_helper.cpp b/torch/csrc/jit/codegen/onednn/graph_helper.cpp
index 76f4475cdb53..a14dce108dd1 100644
--- a/torch/csrc/jit/codegen/onednn/graph_helper.cpp
+++ b/torch/csrc/jit/codegen/onednn/graph_helper.cpp
@@ -40,7 +40,7 @@ Operator makeWildcardOp(Node* node) {
   auto o = Operator(node, opkind::Wildcard);
   // wildcard op contains only topology info
   for (size_t i = 0; i < node->inputs().size(); i++) {
-    o.setInput(i);
+    o.setInput(static_cast<size_t>(NULL), i);
   }
   for (size_t i = 0; i < node->outputs().size(); i++) {
     o.setOutput(i);
@@ -56,223 +56,257 @@ Operator makeWildcardOp(Node* node) {
     return makeWildcardOp(node);                      \
   }
 
-Operator makeEltwiseOp(Node* node, opkind kind) {
-  return Operator(node, kind).setInput(0).setOutput(0);
+Operator LlgaGraphHelper::makeEltwiseOp(Node* node, opkind kind) {
+  return Operator(node, kind).setInput(0).setOutput(dnnl_graph_, 0);
 }
 
-Operator makeBinaryOp(Node* node, opkind kind) {
+Operator LlgaGraphHelper::makeBinaryOp(Node* node, opkind kind) {
   REQUIRE(
       node->input(0)->type()->isSubtypeOf(TensorType::get()) &&
       node->input(1)->type()->isSubtypeOf(TensorType::get()))
-  return Operator(node, kind).setInput(0, 1).setOutput(0);
+  return Operator(node, kind).setInput(0, 1).setOutput(dnnl_graph_, 0);
 }
 
 // Map a PyTorch op to its corresponding oneDNN Graph op.
 // If mapping isn't possible, then create a wildcard op instead.
 // The mapping is done as per oneDNN Graph op schema defined in
 // third_party/ideep/mkl-dnn/src/interface/op_def.hpp.
-Operator createOperator(Node* node) {
-  switch (node->kind()) {
-    case aten::conv2d: {
-      fixConvOptionalBias(node);
-      return Operator(node, opkind::Convolution)
-          .setInput(0, 1, 2)
-          .setOutput(0)
-          .setAttr("strides", Operator::Ints, 3)
-          .setAttr("pads_begin", Operator::Ints, 4)
-          .setAttr("pads_end", Operator::Ints, 4)
-          .setAttr("dilations", Operator::Ints, 5)
-          .setAttr("groups", Operator::Int, 6)
-          .setAttr("filter_format", std::string("OIX"))
-          .setAttr("data_format", std::string("NCX"));
-    }
-
-    case aten::_convolution: {
-      bool transposed = toIValue(node->namedInput("transposed"))->toBool();
-      REQUIRE(!transposed);
-
-      return Operator(node, opkind::Convolution)
-          .setInput(0, 1, 2)
-          .setOutput(0)
-          .setAttr("strides", Operator::Ints, 3)
-          .setAttr("pads_begin", Operator::Ints, 4)
-          .setAttr("pads_end", Operator::Ints, 4)
-          .setAttr("dilations", Operator::Ints, 5)
-          .setAttr("groups", Operator::Int, 8)
-          .setAttr("filter_format", std::string("OIX"))
-          .setAttr("data_format", std::string("NCX"));
-    }
-
-    case aten::batch_norm: {
-      auto training = toIValue(node->namedInput("training"));
-      REQUIRE(
-          training.has_value()); // cannot get training status in script mode
-      REQUIRE(!training->toBool()); // TODO: support bn training
+Operator LlgaGraphHelper::createOperator(Node* node) {
+  auto nodeKind = node->kind();
+  // we're using an if-else clause instead of a switch staement
+  // because we would soon be adding custom ops with function schemas.
+  // We would have to use Symbol::fromQualString at that time anyway,
+  // but we are okay with this choice, since this code is not in the hot-path.
+  if (nodeKind == Symbol::fromQualString("aten::conv2d")) {
+    fixConvOptionalBias(node);
+    return Operator(node, opkind::Convolution)
+        .setInput(0, 1, 2)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("strides", Operator::Ints, 3)
+        .setAttr("pads_begin", Operator::Ints, 4)
+        .setAttr("pads_end", Operator::Ints, 4)
+        .setAttr("dilations", Operator::Ints, 5)
+        .setAttr("groups", Operator::Int, 6)
+        .setAttr("filter_format", std::string("OIX"))
+        .setAttr("data_format", std::string("NCX"));
+  } else if (
+      (nodeKind == Symbol::fromQualString("aten::_convolution")) ||
+      (nodeKind == Symbol::fromQualString("aten::convolution"))) {
+    bool transposed = toIValue(node->namedInput("transposed"))->toBool();
+    REQUIRE(!transposed);
+    return Operator(node, opkind::Convolution)
+        .setInput(0, 1, 2)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("strides", Operator::Ints, 3)
+        .setAttr("pads_begin", Operator::Ints, 4)
+        .setAttr("pads_end", Operator::Ints, 4)
+        .setAttr("dilations", Operator::Ints, 5)
+        .setAttr("groups", Operator::Int, 8)
+        .setAttr("filter_format", std::string("OIX"))
+        .setAttr("data_format", std::string("NCX"));
+  } else if (nodeKind == Symbol::fromQualString("aten::batch_norm")) {
+    auto training = toIValue(node->namedInput("training"));
+    REQUIRE(training.has_value()); // cannot get training status in script mode
+    if (!training->toBool()) {
       return Operator(node, opkind::BatchNormInference)
           .setInput(0, 1, 2, 3, 4)
-          .setOutput(0)
+          .setOutput(dnnl_graph_, 0)
           .setAttr("epsilon", Operator::Float, 7)
           .setAttr("data_format", std::string("NCX"));
     }
-
-    case aten::layer_norm: {
-      auto normalized_shape = toIValue(node->namedInput("normalized_shape"));
-      REQUIRE(normalized_shape->toIntList().size() == 1);
-      return Operator(node, opkind::LayerNorm)
-          .setInput(0, 2, 3)
-          .setOutput(0)
-          .setAttr("epsilon", Operator::Float, 4)
-          .setAttr("keep_stats", false);
-    }
-
-    case aten::addmm: {
-      auto alpha = toIValue(node->namedInput("alpha"));
-      auto beta = toIValue(node->namedInput("beta"));
-      REQUIRE(
-          alpha.has_value() && beta.has_value() && (alpha->toDouble() == 1.0) &&
-          (beta->toDouble() == 1.0));
-      return Operator(node, opkind::MatMul).setInput(1, 2, 0).setOutput(0);
-    }
-
-    case aten::add:
-      return makeBinaryOp(node, opkind::Add);
-
-    case aten::mul:
-      return makeBinaryOp(node, opkind::Multiply);
-
-    case aten::tanh:
-      return makeEltwiseOp(node, opkind::Tanh);
-
-    case aten::relu:
-      return makeEltwiseOp(node, opkind::ReLU);
-
-    case aten::elu:
-      return makeEltwiseOp(node, opkind::Elu)
-          .setAttr("alpha", Operator::Float, 1);
-
-    case aten::sigmoid:
-      return makeEltwiseOp(node, opkind::Sigmoid);
-    case aten::gelu:
-      return makeEltwiseOp(node, opkind::GELU);
-
-    case aten::sqrt:
-      return makeEltwiseOp(node, opkind::Sqrt);
-
-    case aten::abs:
-      return makeEltwiseOp(node, opkind::Abs);
-
-    case aten::square:
-      return makeEltwiseOp(node, opkind::Square);
-
-    case aten::hardtanh:
-      return makeEltwiseOp(node, opkind::HardTanh)
-          .setAttr("min", Operator::Float, 1)
-          .setAttr("max", Operator::Float, 2);
-
-    case aten::relu6:
-      return makeEltwiseOp(node, opkind::HardTanh)
-          .setAttr("min", 0.f)
-          .setAttr("max", 6.f);
-
-    case aten::softmax: {
-      auto axis = toIValue(node->namedInput("dim"))->toInt();
-      return Operator(node, opkind::SoftMax)
-          .setInput(0)
-          .setOutput(0)
-          .setAttr("axis", axis);
-    }
-
-    case aten::cat: {
-      auto o = Operator(node, opkind::Concat);
-      REQUIRE(
-          node->namedInput("tensors")->node()->kind() == prim::ListConstruct);
-      REQUIRE(node->namedInput("tensors")->uses().size() == 1);
-      REQUIRE(node->namedInput("dim")->node()->kind() == prim::Constant);
-      // aten::cat needs a special handling since it takes a Tensor[] as input.
-      // We set the inputs of ListConstruct as the inputs of cat.
-      //
-      // Pytorch IR:                              LLGA sees:
-      //     %a    %b     %c          %dim              %a    %b    %c
-      //      \     |     /             |                \     |    /
-      //   prim::ListConstruct   prim::Constant     llga::Concat[axis=%dim]
-      //                    \      /
-      //                    aten::cat
-      auto listConstruct = node->input(0)->node();
-      for (auto input : listConstruct->inputs())
-        o.setInputValue(input);
-      return o.setOutput(0).setAttr("axis", Operator::Int, 1);
-    }
-
-    case aten::max_pool2d: {
-      REQUIRE(
-          node->namedInput("kernel_size")->node()->kind() == prim::Constant);
-
-      auto rounding_type =
-          toIValue(node->namedInput("ceil_mode"))->toBool() ? "ceil" : "floor";
-      return Operator(node, opkind::MaxPool)
-          .setInput(0)
-          .setOutput(0)
-          .setAttr("kernel", Operator::Ints, 1)
-          .setAttr("strides", Operator::Ints, 2)
-          .setAttr("pads_begin", Operator::Ints, 3)
-          .setAttr("pads_end", Operator::Ints, 3)
-          .setAttr("dilations", Operator::Ints, 4)
-          .setAttr("rounding_type", std::string(rounding_type))
-          .setAttr("data_format", std::string("NCX"));
-    }
-
-    case aten::avg_pool2d: {
-      // TODO: do we need add checks for all Constants?
-      REQUIRE(
-          node->namedInput("kernel_size")->node()->kind() == prim::Constant);
-      auto rounding_type =
-          toIValue(node->namedInput("ceil_mode"))->toBool() ? "ceil" : "floor";
-      auto divisor_override = toIValue(node->namedInput("divisor_override"));
-      REQUIRE(divisor_override->isNone());
-      return Operator(node, opkind::AvgPool)
-          .setInput(0)
-          .setOutput(0)
-          .setAttr("kernel", Operator::Ints, 1)
-          .setAttr("strides", Operator::Ints, 2)
-          .setAttr("pads_begin", Operator::Ints, 3)
-          .setAttr("pads_end", Operator::Ints, 3)
-          .setAttr("exclude_pad", !Operator::Bool(node, 5))
-          .setAttr("rounding_type", std::string(rounding_type))
-          .setAttr("data_format", std::string("NCX"));
-    }
-
-    case aten::matmul: {
-      auto dim0 = getDimensions(node->namedInput("self")).value_or(-1);
-      auto dim1 = getDimensions(node->namedInput("other")).value_or(-1);
-      // TODO: support all shape combinations
-      REQUIRE(
-          (dim0 == 2 && dim1 == 2) || (dim0 == 4 && dim1 == 4) ||
-          (dim0 == 3 && dim1 == 2));
-    } // fall through
-    case aten::mm: {
-      return Operator(node, opkind::MatMul).setInput(0, 1).setOutput(0);
-    }
-
-    case aten::linear: {
-      return Operator(node, opkind::MatMul)
-          .setInput(0, 1, 2)
-          .setOutput(0)
-          .setAttr("transpose_b", true);
+  } else if (nodeKind == Symbol::fromQualString("aten::layer_norm")) {
+    auto normalized_shape = toIValue(node->namedInput("normalized_shape"));
+    REQUIRE(normalized_shape->toIntList().size() == 1);
+    return Operator(node, opkind::LayerNorm)
+        .setInput(0, 2, 3)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("epsilon", Operator::Float, 4)
+        .setAttr("keep_stats", false);
+  } else if (nodeKind == Symbol::fromQualString("aten::addmm")) {
+    auto alpha = toIValue(node->namedInput("alpha"));
+    auto beta = toIValue(node->namedInput("beta"));
+    if (alpha.has_value() && beta.has_value()) {
+      if ((alpha->toDouble() == 1.0) && (beta->toDouble() == 1.0)) {
+        return Operator(node, opkind::MatMul)
+            .setInput(1, 2, 0)
+            .setOutput(dnnl_graph_, 0);
+      } else if ((alpha->toDouble() == 1.0) && (beta->toDouble() == 0.0)) {
+        return Operator(node, opkind::MatMul)
+            .setInput(1, 2)
+            .setOutput(dnnl_graph_, 0);
+      }
     }
-
-    default:
-      return makeWildcardOp(node);
+  } else if (nodeKind == Symbol::fromQualString("aten::add"))
+    return makeBinaryOp(node, opkind::Add);
+  else if (nodeKind == Symbol::fromQualString("aten::mul"))
+    return makeBinaryOp(node, opkind::Multiply);
+  else if (nodeKind == Symbol::fromQualString("aten::div"))
+    return makeBinaryOp(node, opkind::Divide);
+  else if (nodeKind == Symbol::fromQualString("aten::tanh"))
+    return makeEltwiseOp(node, opkind::Tanh);
+  else if (nodeKind == Symbol::fromQualString("aten::relu"))
+    return makeEltwiseOp(node, opkind::ReLU);
+  else if (nodeKind == Symbol::fromQualString("aten::elu"))
+    return makeEltwiseOp(node, opkind::Elu)
+        .setAttr("alpha", Operator::Float, 1);
+  else if (nodeKind == Symbol::fromQualString("aten::sigmoid"))
+    return makeEltwiseOp(node, opkind::Sigmoid);
+  else if (nodeKind == Symbol::fromQualString("aten::gelu"))
+    return makeEltwiseOp(node, opkind::GELU);
+  else if (nodeKind == Symbol::fromQualString("aten::round"))
+    return makeEltwiseOp(node, opkind::Round);
+  else if (nodeKind == Symbol::fromQualString("aten::exp"))
+    return makeEltwiseOp(node, opkind::Exp);
+  else if (nodeKind == Symbol::fromQualString("aten::sqrt"))
+    return makeEltwiseOp(node, opkind::Sqrt);
+  else if (nodeKind == Symbol::fromQualString("aten::abs"))
+    return makeEltwiseOp(node, opkind::Abs);
+  else if (nodeKind == Symbol::fromQualString("aten::square"))
+    return makeEltwiseOp(node, opkind::Square);
+  else if (nodeKind == Symbol::fromQualString("aten::clamp")) {
+    // PyTorch API already checks that both min & max are not None.
+    // But we can check it nevertheless.
+    auto clamp_min = toIValue(node->input(1));
+    auto clamp_max = toIValue(node->input(2));
+    REQUIRE(!(clamp_max->isNone() && clamp_min->isNone()));
+    auto clamp_min_value = (clamp_min->isNone())
+        ? -std::numeric_limits<float>::infinity()
+        : Operator::ScalarToFloat(node, 1);
+    auto clamp_max_value = (clamp_max->isNone())
+        ? std::numeric_limits<float>::infinity()
+        : Operator::ScalarToFloat(node, 2);
+    return makeEltwiseOp(node, opkind::Clamp)
+        .setAttr("min", clamp_min_value)
+        .setAttr("max", clamp_max_value);
+  } else if (nodeKind == Symbol::fromQualString("aten::hardtanh")) {
+    return makeEltwiseOp(node, opkind::Clamp)
+        .setAttr("min", Operator::ScalarToFloat, 1)
+        .setAttr("max", Operator::ScalarToFloat, 2);
+  } else if (nodeKind == Symbol::fromQualString("aten::hardswish"))
+    return makeEltwiseOp(node, opkind::HardSwish);
+  else if (nodeKind == Symbol::fromQualString("aten::log"))
+    return makeEltwiseOp(node, opkind::Log);
+  else if (nodeKind == Symbol::fromQualString("aten::leaky_relu")) {
+    return makeEltwiseOp(node, opkind::LeakyReLU)
+        .setAttr("alpha", Operator::Float, 1);
+  } else if (nodeKind == Symbol::fromQualString("aten::relu6")) {
+    return makeEltwiseOp(node, opkind::Clamp)
+        .setAttr("min", 0.f)
+        .setAttr("max", 6.f);
+  } else if (
+      (nodeKind == Symbol::fromQualString("aten::softmax")) ||
+      (nodeKind == Symbol::fromQualString("aten::_softmax"))) {
+    auto axis = toIValue(node->namedInput("dim"))->toInt();
+    return Operator(node, opkind::SoftMax)
+        .setInput(0)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("axis", axis);
+  } else if (nodeKind == Symbol::fromQualString("aten::_log_softmax")) {
+    auto axis = toIValue(node->namedInput("dim"))->toInt();
+    return Operator(node, opkind::LogSoftmax)
+        .setInput(0)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("axis", axis);
+  } else if (nodeKind == Symbol::fromQualString("aten::cat")) {
+    auto o = Operator(node, opkind::Concat);
+    REQUIRE(node->namedInput("tensors")->node()->kind() == prim::ListConstruct);
+    REQUIRE(node->namedInput("tensors")->uses().size() == 1);
+    REQUIRE(node->namedInput("dim")->node()->kind() == prim::Constant);
+    // aten::cat needs a special handling since it takes a Tensor[] as input.
+    // We set the inputs of ListConstruct as the inputs of cat.
+    //
+    // Pytorch IR:                              LLGA sees:
+    //     %a    %b     %c          %dim              %a    %b    %c
+    //      \     |     /             |                \     |    /
+    //   prim::ListConstruct   prim::Constant     llga::Concat[axis=%dim]
+    //                    \      /
+    //                    aten::cat
+    auto listConstruct = node->input(0)->node();
+    for (auto input : listConstruct->inputs())
+      o.setInputValue(input);
+    return o.setOutput(dnnl_graph_, 0).setAttr("axis", Operator::Int, 1);
+  } else if (
+      (nodeKind == Symbol::fromQualString("aten::max_pool2d")) ||
+      (nodeKind == Symbol::fromQualString("aten::max_pool2d_with_indices"))) {
+    // Currently, LLGA lacks support to create indices mask.
+    // Once it's supported, max_pool2d_with_indices should be mapped differently
+    REQUIRE(node->namedInput("kernel_size")->node()->kind() == prim::Constant);
+    auto rounding_type =
+        toIValue(node->namedInput("ceil_mode"))->toBool() ? "ceil" : "floor";
+    return Operator(node, opkind::MaxPool)
+        .setInput(0)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("kernel", Operator::Ints, 1)
+        .setAttr("strides", Operator::Ints, 2)
+        .setAttr("pads_begin", Operator::Ints, 3)
+        .setAttr("pads_end", Operator::Ints, 3)
+        .setAttr("dilations", Operator::Ints, 4)
+        .setAttr("rounding_type", std::string(rounding_type))
+        .setAttr("data_format", std::string("NCX"));
+  } else if (nodeKind == Symbol::fromQualString("aten::avg_pool2d")) {
+    // TODO: do we need add checks for all Constants?
+    REQUIRE(node->namedInput("kernel_size")->node()->kind() == prim::Constant);
+    auto rounding_type =
+        toIValue(node->namedInput("ceil_mode"))->toBool() ? "ceil" : "floor";
+    auto divisor_override = toIValue(node->namedInput("divisor_override"));
+    REQUIRE(divisor_override->isNone());
+    return Operator(node, opkind::AvgPool)
+        .setInput(0)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("kernel", Operator::Ints, 1)
+        .setAttr("strides", Operator::Ints, 2)
+        .setAttr("pads_begin", Operator::Ints, 3)
+        .setAttr("pads_end", Operator::Ints, 3)
+        .setAttr("exclude_pad", !Operator::Bool(node, 5))
+        .setAttr("rounding_type", std::string(rounding_type))
+        .setAttr("data_format", std::string("NCX"));
+  } else if (nodeKind == Symbol::fromQualString("aten::matmul")) {
+    auto dim0 = getDimensions(node->namedInput("self")).value_or(-1);
+    auto dim1 = getDimensions(node->namedInput("other")).value_or(-1);
+    // TODO: support all shape combinations
+    REQUIRE(
+        (dim0 == 2 && dim1 == 2) || (dim0 == 4 && dim1 == 4) ||
+        (dim0 == 3 && dim1 == 2));
+    return Operator(node, opkind::MatMul)
+        .setInput(0, 1)
+        .setOutput(dnnl_graph_, 0);
+  } // fall through
+  else if (nodeKind == Symbol::fromQualString("aten::mm")) {
+    return Operator(node, opkind::MatMul)
+        .setInput(0, 1)
+        .setOutput(dnnl_graph_, 0);
+  } else if (nodeKind == Symbol::fromQualString("aten::bmm")) {
+    return Operator(node, opkind::MatMul)
+        .setInput(0, 1)
+        .setOutput(dnnl_graph_, 0);
+  } else if (nodeKind == Symbol::fromQualString("aten::linear")) {
+    return Operator(node, opkind::MatMul)
+        .setInput(0, 1, 2)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("transpose_b", true);
+  } else if (nodeKind == Symbol::fromQualString("aten::permute")) {
+    REQUIRE(aliasDb_->hasInputWriters(node) == false);
+    return Operator(node, opkind::StaticTranspose)
+        .setInput(0)
+        .setOutput(dnnl_graph_, 0)
+        .setAttr("order", toIValue(node->namedInput("dims"))->toIntVector());
+  } else if (nodeKind == Symbol::fromQualString("aten::contiguous")) {
+    // Contiguous should only be mapped to oneDNN Graph if the destination
+    // memory-layout is different than the source memory-format
+    // Strides would be different, but shape would be same
+    auto typeOfInput = node->input(0)->type()->expect<TensorType>();
+    auto typeOfOutput = node->output(0)->type()->expect<TensorType>();
+    auto inputStrides = typeOfInput->strides().concrete_sizes();
+    auto outputStrides = typeOfOutput->strides().concrete_sizes();
+    REQUIRE(inputStrides != outputStrides);
+    return Operator(node, opkind::Reorder)
+        .setInput(0)
+        .setOutput(dnnl_graph_, 0);
   }
+  GRAPH_DEBUG("Making ", nodeKind.toQualString(), " a wildcard");
+  return makeWildcardOp(node);
 }
 
-dnnl::graph::op createLlgaOp(Node* node) {
-  return createOperator(node).llgaOp();
-}
-
-bool isSupported(Node* node) {
-  return createOperator(node).kind() != opkind::Wildcard;
-};
-
 DeviceType inferDeviceFromValue(Value* v) {
   auto tt = v->type()->cast<TensorType>();
   if (!tt) {
@@ -336,13 +370,26 @@ bool checkInputCompatibility(Node* node) {
         return false;
       }
       auto dtype = tensor.scalar_type();
-      if ((dtype != at::ScalarType::Float) && (dtype != at::ScalarType::Long)) {
+      if ((dtype != at::ScalarType::BFloat16) &&
+          (dtype != at::ScalarType::Float) && (dtype != at::ScalarType::Long)) {
+        // We've allowed Long dtype here although oneDNN Graph does not support
+        // Long dtype because oneDNN Graph will end up not handling the op that
+        // has an input with Long dtype, so it'd be handled by PyTorch.
         return false;
       }
     } else if (inputIValue.isScalar()) {
       if (inputIValue.isComplexDouble()) {
         return false;
       }
+    } else if (input->type()->isSubtypeOf(TensorType::get())) {
+      auto input_typeptr = input->type()->cast<TensorType>();
+      if (input_typeptr->scalarType().has_value()) {
+        at::ScalarType dtype = input_typeptr->scalarType().value();
+        if ((dtype != at::ScalarType::Float) &&
+            (dtype != at::ScalarType::BFloat16)) {
+          return false;
+        }
+      }
     }
   }
   return true;
@@ -353,19 +400,21 @@ LlgaGraphHelper::LlgaGraphHelper(
     dnnl::graph::partition::policy policy) {
   auto deviceType = inferDevice(graph);
   auto engineKind = getLlgaEngineKind(deviceType);
-  dnnl::graph::graph g{engineKind};
-
+  dnnl_graph_ =
+      std::unique_ptr<dnnl::graph::graph>(new dnnl::graph::graph(engineKind));
+  aliasDb_ = std::make_unique<torch::jit::AliasDb>(graph);
   GRAPH_DEBUG("Constructing LLGA graph");
   // TODO: select nodes in top-level block for now
   for (auto* node : graph->block()->nodes()) {
-    auto op = createLlgaOp(node);
     auto kindOfNode = node->kind();
+    GRAPH_DEBUG("Trying to add ", kindOfNode.toQualString());
     if (checkInputCompatibility(node)) {
-      g.add_op(op);
+      auto op = createOperator(node);
+      dnnl_graph_->add_op(op.llgaOp());
       GRAPH_DEBUG("  Added node ", kindOfNode.toQualString());
     } else {
-      GRAPH_DEBUG("The backend failed to add node ", kindOfNode.toQualString());
-      g.add_op(makeWildcardOp(node).llgaOp());
+      GRAPH_DEBUG("Incompatible inputs for ", kindOfNode.toQualString());
+      dnnl_graph_->add_op(makeWildcardOp(node).llgaOp());
     }
 
     for (Value* input : node->inputs()) {
@@ -374,7 +423,8 @@ LlgaGraphHelper::LlgaGraphHelper(
   }
 
   GRAPH_DEBUG("Get Partitions");
-  std::vector<dnnl::graph::partition> partitions = g.get_partitions(policy);
+  std::vector<dnnl::graph::partition> partitions =
+      dnnl_graph_->get_partitions(policy);
   // excluded unsupported Wildcard partitions
   for (auto& partition : partitions) {
     if (partition.is_supported()) {
@@ -455,7 +505,6 @@ Node* LlgaGraphHelper::createSingletonSubgraph(Node* n, AliasDb& aliasDb) {
   auto group = SubgraphUtils::createSingletonSubgraphAndUpdateAliasing(
       n, prim::oneDNNFusionGroup, aliasDb);
   opToOwningPartition_.add(group, partitionId);
-  LlgaNodeWrapper(group).initOutputLayouts();
   return group;
 }
 
@@ -535,25 +584,29 @@ LlgaNodeWrapper::LlgaNodeWrapper(const Node* node)
 }
 
 void LlgaNodeWrapper::setOpaqueLayout(size_t offset) {
-  TORCH_CHECK(offset < n->outputs().size(), "Invalid output offset ", offset);
+  const auto num_output = n->is(attr::output_layouts).size();
+  TORCH_CHECK(
+      offset < num_output,
+      "Out of range. (Invalid index ",
+      offset,
+      " for attr::output_layouts with size ",
+      num_output,
+      ")");
   auto& layouts =
       const_cast<std::vector<int64_t>&>(n->is(attr::output_layouts)); // NOLINT
-  layouts.at(offset) = 1;
+  layouts.at(offset) = OPAQUE_LAYOUT;
 }
 
 bool LlgaNodeWrapper::useOpaqueLayout(size_t offset) const {
-  TORCH_CHECK(offset < n->outputs().size(), "Invalid output offset ", offset);
-  return n->is(attr::output_layouts)[offset] == 1;
-}
-
-void LlgaNodeWrapper::initOutputLayouts() {
-  if (n->hasAttribute(attr::output_layouts)) {
-    return;
-  }
-
-  // Init all output layouts as undef
-  std::vector<int64_t> layouts(n->outputs().size(), 0);
-  n->is_(attr::output_layouts, layouts);
+  const auto num_output = n->is(attr::output_layouts).size();
+  TORCH_CHECK(
+      offset < num_output,
+      "Out of range. (Invalid index ",
+      offset,
+      " for attr::output_layouts with size ",
+      num_output,
+      ")");
+  return n->is(attr::output_layouts)[offset] == OPAQUE_LAYOUT;
 }
 
 } // namespace onednn
diff --git a/torch/csrc/jit/codegen/onednn/graph_helper.h b/torch/csrc/jit/codegen/onednn/graph_helper.h
index 969f3cdc0eff..fbb5eaa84aec 100644
--- a/torch/csrc/jit/codegen/onednn/graph_helper.h
+++ b/torch/csrc/jit/codegen/onednn/graph_helper.h
@@ -2,6 +2,7 @@
 
 #include <oneapi/dnnl/dnnl_graph.hpp>
 #include <torch/csrc/jit/codegen/onednn/operator.h>
+#include <torch/csrc/jit/ir/alias_analysis.h>
 #include <torch/csrc/jit/ir/ir.h>
 
 namespace torch {
@@ -9,6 +10,9 @@ namespace jit {
 namespace fuser {
 namespace onednn {
 
+#define STRIDED_LAYOUT 0
+#define OPAQUE_LAYOUT 1
+
 struct OpPartitionMap {
   void add(uint64_t opId, uint64_t partitionId) {
     opmap_[opId] = partitionId;
@@ -60,13 +64,20 @@ class LlgaGraphHelper {
 
   static bool isLlgaSubgraph(const Node* node);
 
+  Operator makeEltwiseOp(Node* node, dnnl::graph::op::kind kind);
+
+  Operator makeBinaryOp(Node* node, dnnl::graph::op::kind kind);
+
   std::vector<dnnl::graph::partition> getPartitions() const;
 
   std::map<size_t, Value*> getTensorIdToValue() const;
 
+  Operator createOperator(Node* node);
+
  private:
   size_t countSupportedOps(const std::shared_ptr<Graph>& graph) const;
-
+  std::unique_ptr<dnnl::graph::graph> dnnl_graph_ = nullptr;
+  std::unique_ptr<torch::jit::AliasDb> aliasDb_ = nullptr;
   OpPartitionMap opToOwningPartition_;
   std::vector<dnnl::graph::partition> partitions_;
   std::map<size_t, Value*>
@@ -84,8 +95,6 @@ class LlgaNodeWrapper {
   friend class LlgaGraphHelper;
 
  private:
-  void initOutputLayouts();
-
   Node* n;
 };
 
diff --git a/torch/csrc/jit/codegen/onednn/interface.cpp b/torch/csrc/jit/codegen/onednn/interface.cpp
index ef525f99e2c7..e65b4048383e 100644
--- a/torch/csrc/jit/codegen/onednn/interface.cpp
+++ b/torch/csrc/jit/codegen/onednn/interface.cpp
@@ -1,4 +1,5 @@
 #include <oneapi/dnnl/dnnl_graph.hpp>
+#include <torch/csrc/jit/codegen/onednn/decompose_silu.h>
 #include <torch/csrc/jit/codegen/onednn/defer_size_check.h>
 #include <torch/csrc/jit/codegen/onednn/graph_fuser.h>
 #include <torch/csrc/jit/codegen/onednn/guard_shape.h>
@@ -56,11 +57,19 @@ void fuseGraph(std::shared_ptr<Graph>& g) {
           aten::hardtanh_,
           aten::abs_,
           aten::square_,
-      };
+          aten::pow_,
+          aten::leaky_relu_,
+          aten::round_,
+          aten::exp_,
+          aten::abs_,
+          aten::hardswish_,
+          aten::silu_};
       return supportedOps.count(nodeToFunctionalize->kind()) != 0;
     });
     RemoveListMutation(g);
-    GRAPH_DUMP("After mutation removal. Before PrepareBinaryForLLGA", g);
+    GRAPH_DUMP("After mutation removal. Before DecomposeSiluForLlga", g);
+    DecomposeSiluForLLGA(g);
+    GRAPH_DUMP("After DecomposeSiluForLlga. Before PrepareBinaryForLLGA", g);
     PrepareBinaryForLLGA(g);
     GRAPH_DUMP("After PrepareBinaryForLLGA. Before DeferSizeCheck", g);
     DeferSizeCheck(g);
diff --git a/torch/csrc/jit/codegen/onednn/kernel.cpp b/torch/csrc/jit/codegen/onednn/kernel.cpp
index 5001daf8ab67..ceb873d1aa17 100644
--- a/torch/csrc/jit/codegen/onednn/kernel.cpp
+++ b/torch/csrc/jit/codegen/onednn/kernel.cpp
@@ -149,35 +149,42 @@ std::tuple<RunArgs, RunArgs> LlgaKernel::prepareRunArgs(
 
     if (spec.reuses_input_tensor()) {
 #ifdef GRAPH_DEBUG_ENABLED
-      GRAPH_DEBUG("oneDNN Graph would perform inplace computation");
+      GRAPH_DEBUG("inplace computation - input tensor would be reused");
 #endif
       auto inputTensor = inputs[spec.get_input_tensor_index()];
-      auto dataType = spec.dtype();
-      if (C10_UNLIKELY(!useOpaqueLayout(i) && inputTensor.is_mkldnn())) {
-        // If the input tensor was between two partitions, it would've been
-        // wrapped with LlgaTensorImpl. But if it's being reused as the output
-        // tensor which is not between two partitions, then we'd have to re-wrap
-        // it with TensorImpl, as it'd be fed into a PyTorch op.
+      if (inputTensor.is_mkldnn()) {
+        auto dataType = spec.dtype();
+        if (C10_UNLIKELY(!useOpaqueLayout(i))) {
+          // If the input tensor was between two partitions, it would've been
+          // wrapped with LlgaTensorImpl. But if it's being reused as the output
+          // tensor, which is not between two partitions, then we'd have to
+          // re-wrap it with a sub-class of TensorImpl, as it'd be fed into a
+          // PyTorch op.
 #ifdef GRAPH_DEBUG_ENABLED
-        GRAPH_DEBUG("Rewrap tensor");
+          GRAPH_DEBUG("rewrap tensors");
 #endif
-        auto llgaImpl =
-            static_cast<LlgaTensorImpl*>(inputTensor.unsafeGetTensorImpl());
-        switch (dataType) {
-          case data_type::f32:
-          case data_type::bf16:
-            inputTensor = LlgaTensorImpl::llga_to_aten_tensor(llgaImpl);
-            break;
-          case data_type::s32:
-          default:
-            TORCH_CHECK(
-                false, "Invalid data type ", static_cast<size_t>(dataType));
+          auto llgaImpl =
+              static_cast<LlgaTensorImpl*>(inputTensor.unsafeGetTensorImpl());
+          switch (dataType) {
+            case data_type::f32:
+            case data_type::bf16:
+              inputTensor = LlgaTensorImpl::llga_to_aten_tensor(llgaImpl);
+              break;
+            case data_type::s32:
+            default:
+              TORCH_CHECK(
+                  false, "Invalid data type ", static_cast<size_t>(dataType));
+          }
         }
+        outputs.push_back(inputTensor);
+        runOutputs.push_back(
+            {spec.logical_tensor(),
+             Engine::getEngine(),
+             inputTensor.data_ptr()});
+        return std::make_tuple(runInputs, runOutputs);
       }
-      outputs.push_back(inputTensor);
-      runOutputs.push_back(
-          {spec.logical_tensor(), Engine::getEngine(), inputTensor.data_ptr()});
-    } else if (useOpaqueLayout(i)) {
+    }
+    if (useOpaqueLayout(i)) {
       // Wrap tensors between partitions with LlgaTensorImpl wrapper, so that we
       // can bypass guard-check, as strides would be different than those
       // expected.
diff --git a/torch/csrc/jit/codegen/onednn/layout_propagation.cpp b/torch/csrc/jit/codegen/onednn/layout_propagation.cpp
index 448e1cf85884..4201282fb083 100644
--- a/torch/csrc/jit/codegen/onednn/layout_propagation.cpp
+++ b/torch/csrc/jit/codegen/onednn/layout_propagation.cpp
@@ -1,5 +1,6 @@
 #include <torch/csrc/jit/codegen/onednn/graph_helper.h>
 #include <torch/csrc/jit/codegen/onednn/layout_propagation.h>
+#include <torch/csrc/jit/jit_log.h>
 
 namespace torch {
 namespace jit {
@@ -10,6 +11,14 @@ void LayoutPropagation(Node* n) {
   if (!LlgaGraphHelper::isLlgaSubgraph(n))
     return;
 
+  // initial attr::output_layouts if undefined
+  if (!n->hasAttribute(attr::output_layouts)) {
+    const auto num_output = n->outputs().size();
+    GRAPH_DEBUG("Initial output_layouts of size ", num_output);
+    std::vector<int64_t> layouts(num_output, STRIDED_LAYOUT);
+    n->is_(attr::output_layouts, layouts);
+  }
+
   for (auto input : n->inputs()) {
     auto prev = input->node();
     auto offset = input->offset();
diff --git a/torch/csrc/jit/codegen/onednn/operator.h b/torch/csrc/jit/codegen/onednn/operator.h
index b4d93e94ed25..9521d316167a 100644
--- a/torch/csrc/jit/codegen/onednn/operator.h
+++ b/torch/csrc/jit/codegen/onednn/operator.h
@@ -14,9 +14,25 @@ class Operator {
   Operator(const Node* node, dnnl::graph::op::kind kind)
       : n(node), o(getId(node), kind, node->kind().toQualString()), k(kind) {}
 
+  // Returns output index if the Value is a graph output.
+  // Otherwise returns -1
+  int32_t graphOutputIdx(Value* v) {
+    int32_t i = 0;
+    for (const Value* output : v->owningGraph()->outputs()) {
+      if (v == output) {
+        return i;
+      }
+      i++;
+    }
+    return -1;
+  }
+
   Operator& setInputValue(Value* v) {
-    if (v->mustNotBeNone())
-      o.add_input(createLogicalTensor(v));
+    if (v->mustNotBeNone()) {
+      if (v->type()->kind() == c10::TensorType::Kind) {
+        o.add_input(createLogicalTensor(v));
+      }
+    }
     return *this;
   }
 
@@ -31,19 +47,50 @@ class Operator {
   }
 
   Operator& setOutputValue(Value* v) {
-    if (v->mustNotBeNone())
+    if (v->mustNotBeNone()) {
       o.add_output(createLogicalTensor(v));
+    }
+    return *this;
+  }
+
+  // setOutputValue & setOutput require a pointer to the LLGA graph, as output
+  // logical tensors that are graph outputs should be connected to an End LLGA
+  // op. A value of NULL can be provided for the graph pointer in order to
+  // maintain the legacy functionality of this function.
+  Operator& setOutputValue(Value* v, std::unique_ptr<dnnl::graph::graph>& g) {
+    if (v->mustNotBeNone()) {
+      auto output_tensor = createLogicalTensor(v);
+      o.add_output(output_tensor);
+      if (g) {
+        int32_t outputIndex = graphOutputIdx(v);
+        if (outputIndex != -1) {
+          dnnl::graph::op newEndNode(
+              LONG_MAX - outputIndex,
+              dnnl::graph::op::kind::End,
+              "EndNodeForGraphOutput");
+          newEndNode.add_input(output_tensor);
+          g->add_op(newEndNode);
+        }
+      }
+    }
     return *this;
   }
 
+  Operator& setOutput(std::unique_ptr<dnnl::graph::graph>& g, size_t offset) {
+    return setOutputValue(n->output(offset), g);
+  }
+
   Operator& setOutput(size_t offset) {
     return setOutputValue(n->output(offset));
   }
 
   template <typename... Ts>
-  Operator& setOutput(size_t offset, Ts... other) {
-    setOutput(offset);
-    return setOutput(other...);
+  Operator& setOutput(
+      std::unique_ptr<dnnl::graph::graph>& g,
+      size_t offset,
+      Ts... other) {
+    setOutput(g, offset);
+    return setOutput(g, other...);
   }
 
   template <typename Attr>
@@ -57,6 +104,10 @@ class Operator {
     return setAttr(name, fn(n, offset));
   }
 
+  static float ScalarToFloat(const Node* node, size_t offset) {
+    return toIValue(node->input(offset))->toScalar().to<float>();
+  }
+
   static std::vector<int64_t> Ints(const Node* node, size_t offset) {
     return toIValue(node->input(offset))->toIntVector();
   }
diff --git a/torch/csrc/jit/codegen/onednn/prepare_binary.cpp b/torch/csrc/jit/codegen/onednn/prepare_binary.cpp
index 5704f598eb43..350a53a7c5d3 100644
--- a/torch/csrc/jit/codegen/onednn/prepare_binary.cpp
+++ b/torch/csrc/jit/codegen/onednn/prepare_binary.cpp
@@ -1,3 +1,4 @@
+#include <aten/src/ATen/core/jit_type.h>
 #include <torch/csrc/jit/codegen/onednn/prepare_binary.h>
 #include <torch/csrc/jit/passes/dead_code_elimination.h>
 #include <torch/csrc/jit/passes/shape_analysis.h>
@@ -14,29 +15,98 @@ bool compareConstValue(Value* v, double d) {
        (ival->isDouble() && ival->toDouble() == d));
 }
 
-void mayConvertScalarInputToTensor(Node* node) {
+void handleBinaryOpInputs(Node* node) {
   // We do not handle binary ops with two scalar inputs,
   // and we assume scalar is always at the second place.
-  if (node->input(0)->type()->isSubtypeOf(TensorType::get()) &&
-      (node->input(1)->type()->isSubtypeOf(FloatType::get()) ||
-       node->input(1)->type()->isSubtypeOf(IntType::get()))) {
-    auto scalar = node->input(1);
-    WithInsertPoint guard(node);
-    auto g = node->owningGraph();
-    // 42 : Scalar  -->  tensor(42.0) : Float([])
-    auto t = g->insert(
-        aten::as_tensor, {scalar}, {{"dtype", at::ScalarType::Float}});
-    // add dim & stride info to IR
-    c10::optional<size_t> t_dim = 1;
-    auto target_type = TensorTypePtr(
-        TensorType::create(at::ScalarType::Float, at::kCPU, t_dim, false));
-    target_type = target_type->withSizes({1});
-    t->setType(target_type);
+  if (node->input(0)->type()->isSubtypeOf(TensorType::get())) {
+    auto dtypeOfFirstInput =
+        node->input(0)->type()->cast<TensorType>()->scalarType().value();
+    if (node->input(1)->type()->isSubtypeOf(FloatType::get()) ||
+        node->input(1)->type()->isSubtypeOf(IntType::get())) {
+      // If a scalar is added to be a tensor, we would assume that the
+      // scalar is of the same dtype as the tensor, as oneDNN graph
+      // currently requires inputs of binary ops to have the same dtype.
+      // We create a 1D tensor from the scalar input & "promote" its
+      // dtype to that of the first input. Doing so helps us satisfy PyTorch's
+      // type promotion rules.
+      // Although we convert the scalar to a tensor, we still need to promote
+      // types, as if the second input were still a scalar.
+      // The following sample code-snippet illustrates that converting a scalar
+      // input to a 1-D tensor may result in a different output dtype than would
+      // otherwise have been the case.
+      // clang-format off
+      //   >>> (1. + torch.rand([2]).half()).dtype
+      //       torch.float16
+      //   >>> (torch.tensor(1.).unsqueeze(0) + (torch.rand([2]).half())).dtype
+      //       torch.float32
+      // clang-format on
+      auto promotedDtype = dtypeOfFirstInput;
+      auto scalar = node->input(1);
+      WithInsertPoint guard(node);
+      auto g = node->owningGraph();
+      // 42 : Scalar  -->  tensor(42.0) : Float([])
+      auto t = g->insert(aten::as_tensor, {scalar}, {{"dtype", promotedDtype}});
+      // add dim & stride info to IR
+      c10::optional<size_t> t_dim = 1;
+      auto target_type = TensorTypePtr(
+          TensorType::create(promotedDtype, at::kCPU, t_dim, false));
+      target_type = target_type->withSizes({1});
+      t->setType(target_type);
 
-    // tensor(42.0) : Float([])  -->  tensor([42.0]) : Float([1])
-    auto unsqueezed = g->insert(aten::unsqueeze, {t, 0});
-    unsqueezed->setType(target_type);
-    node->replaceInput(1, unsqueezed);
+      // tensor(42.0) : Float([])  -->  tensor([42.0]) : Float([1])
+      auto unsqueezed = g->insert(aten::unsqueeze, {t, 0});
+      unsqueezed->setType(target_type);
+      node->replaceInput(1, unsqueezed);
+
+      // dtype might have changed, so needs to be updated in IR as well
+      node->output()->setType(
+          node->output()->type()->expect<TensorType>()->withScalarType(
+              promotedDtype));
+    } else if (node->input(1)->type()->isSubtypeOf(TensorType::get())) {
+      // Here, both inputs are tensors, and we just wanna make sure that they
+      // are the same dtype, as oneDNN Graph requires both inputs to have the
+      // same dtype. We'll follow PyTorch's type-promotion rules here.
+      auto second_input_typeptr = node->input(1)->type()->expect<TensorType>();
+      c10::optional<at::ScalarType> second_input_type =
+          second_input_typeptr->scalarType();
+      if (second_input_type != c10::nullopt) {
+        // dtype of the second tensor might not be available in the IR
+        auto dtypeOfSecondInput = second_input_type.value();
+        if (dtypeOfFirstInput != dtypeOfSecondInput) {
+          // Type promotion is required
+          auto promotedDtype =
+              c10::promoteTypes(dtypeOfFirstInput, dtypeOfSecondInput);
+          WithInsertPoint guard(node);
+          auto g = node->owningGraph();
+          if (promotedDtype == dtypeOfFirstInput) {
+            auto to_node_output = g->insert(
+                aten::to, {node->input(1)}, {{"dtype", promotedDtype}});
+            to_node_output->setType(
+                node->input(1)->type()->expect<TensorType>()->withScalarType(
+                    promotedDtype));
+            node->replaceInput(1, to_node_output);
+          } else {
+            auto to_node_output = g->insert(
+                aten::to, {node->input(0)}, {{"dtype", promotedDtype}});
+            to_node_output->setType(
+                node->input(0)->type()->expect<TensorType>()->withScalarType(
+                    promotedDtype));
+            node->replaceInput(0, to_node_output);
+          }
+          // dtype might have changed, so needs to be updated in IR as well
+          node->output()->setType(
+              node->output()->type()->expect<TensorType>()->withScalarType(
+                  promotedDtype));
+        } else {
+          // both dtypes are same
+          // IR info of dtypes is missing sometimes in JIT IR,
+          // and we shouldn't treat those tensors as FP32 tensors by default.
+          node->output()->setType(
+              node->output()->type()->expect<TensorType>()->withScalarType(
+                  dtypeOfFirstInput));
+        }
+      } // end inner if block
+    } // end outer if block
   }
 }
 
@@ -46,13 +116,18 @@ static void ConvertScalarToTensor(Block* block) {
       ConvertScalarToTensor(sub);
     }
 
-    if (node->kind() == aten::add || node->kind() == aten::mul) {
-      mayConvertScalarInputToTensor(node);
+    if (node->kind() == aten::add || node->kind() == aten::mul ||
+        node->kind() == aten::div) {
+      handleBinaryOpInputs(node);
     }
   }
 }
 
 void mayDecomposeAdd(Node* node) {
+  if (node->inputs().size() < 3) {
+    return; // corner-case in BERT-mrpc that's not in line with
+            // native_functions.yaml
+  }
   if (toIValue(node->namedInput("alpha")).has_value()) {
     auto alphaEqualsOne = compareConstValue(node->namedInput("alpha"), 1.0);
     if (!alphaEqualsOne) {
@@ -60,6 +135,10 @@ void mayDecomposeAdd(Node* node) {
       auto g = node->owningGraph();
       auto mul = g->insert(
           aten::mul, {node->namedInput("other"), node->namedInput("alpha")});
+      if (node->namedInput("other")->type()->isSubtypeOf(TensorType::get())) {
+        auto mulTensorTypePtr = node->namedInput("other")->type();
+        mul->setType(mulTensorTypePtr);
+      }
       node->replaceInput(1, mul);
       auto one = g->insertConstant(1.0);
       node->replaceInput(2, one);
diff --git a/torch/csrc/jit/docs/serialization.md b/torch/csrc/jit/docs/serialization.md
index 8c3461a9abe8..a374f5bed40b 100644
--- a/torch/csrc/jit/docs/serialization.md
+++ b/torch/csrc/jit/docs/serialization.md
@@ -127,7 +127,7 @@ its methods or attributes.
 
 **Uses of tensor constants**. Most constants are inlined as literals, like
 strings or ints. But since tensors are potentially very large, when
-`PythonPrint` encouters a constant tensor it will emit a reference to a
+`PythonPrint` encounters a constant tensor it will emit a reference to a
 global `CONSTANTS` table (like `foo = CONSTANTS.c0`).
 
 When importing, the importer will know how to resolve this reference into an
diff --git a/torch/csrc/jit/frontend/function_schema_parser.cpp b/torch/csrc/jit/frontend/function_schema_parser.cpp
index 7e7d5ed9815b..9a4c9b93a88d 100644
--- a/torch/csrc/jit/frontend/function_schema_parser.cpp
+++ b/torch/csrc/jit/frontend/function_schema_parser.cpp
@@ -165,7 +165,11 @@ struct SchemaParser {
       N = c10::stoll(L.expect(TK_NUMBER).text());
       L.expect(']');
       auto container = type_parser.parseAliasAnnotation();
-      if (container && alias_info) {
+      if (alias_info) {
+        if (!container) {
+          container = c10::optional<at::AliasInfo>(at::AliasInfo());
+          container->setIsWrite(alias_info->isWrite());
+        }
         container->addContainedType(std::move(*alias_info));
       }
       alias_info = std::move(container);
diff --git a/torch/csrc/jit/frontend/ir_emitter.cpp b/torch/csrc/jit/frontend/ir_emitter.cpp
index d60dd77bc8da..7c53dbd0b339 100644
--- a/torch/csrc/jit/frontend/ir_emitter.cpp
+++ b/torch/csrc/jit/frontend/ir_emitter.cpp
@@ -5640,7 +5640,7 @@ void CompilationUnit::define_interface(
   for (const Stmt& stmt : classDef.body()) {
     if (stmt.kind() != TK_DEF) {
       throw ErrorReport(stmt)
-          << "interface declartions can only contain method definitions";
+          << "interface declarations can only contain method definitions";
     }
     auto method_def = Def(stmt);
     if (!method_def.decl().return_type().present()) {
diff --git a/torch/csrc/jit/frontend/schema_matching.cpp b/torch/csrc/jit/frontend/schema_matching.cpp
index b5e4c395672f..0315d489fab5 100644
--- a/torch/csrc/jit/frontend/schema_matching.cpp
+++ b/torch/csrc/jit/frontend/schema_matching.cpp
@@ -324,16 +324,23 @@ static bool varargsCanBeUsedAsList(
       !typevar_list;
 }
 
-// Note (@zasdfgbnm):
-// This is a workaround for https://github.com/pytorch/pytorch/issues/47964
-// Currently JIT does not distinguish ScalarType vs int, so there is really
-// no way to distinguish x.view(1) vs x.view(torch.int8). So we have to hardcode
-// the aten::view.dtype here to block this overload. This blocklist should be
-// removed when JIT fully suports ScalarType as its own type.
 bool isBlockListedSchema(const FunctionSchema& schema) {
+  // Note (@zasdfgbnm):
+  // This is a workaround for https://github.com/pytorch/pytorch/issues/47964
+  // Currently JIT does not distinguish ScalarType vs int, so there is really
+  // no way to distinguish x.view(1) vs x.view(torch.int8). So we have to
+  // hardcode the aten::view.dtype here to block this overload. This blocklist
+  // should be removed when JIT fully suports ScalarType as its own type.
   if (schema.name() == "aten::view" && schema.overload_name() == "dtype") {
     return true;
   }
+  // Note (@tugsbayasgalan)
+  // TorchScript doesn't suport kwargs so this op collides with aten.max.others
+  // since both of them have 2 Tensor inputs. Since we don't expect users to
+  // use this op in TS, we just skip it
+  if (schema.name() == "aten::max" && schema.overload_name() == "unary_out") {
+    return true;
+  }
   return false;
 }
 
diff --git a/torch/csrc/jit/frontend/schema_matching.h b/torch/csrc/jit/frontend/schema_matching.h
index d0828ad8692c..754ede24597e 100644
--- a/torch/csrc/jit/frontend/schema_matching.h
+++ b/torch/csrc/jit/frontend/schema_matching.h
@@ -20,6 +20,8 @@ struct MatchedSchema {
   std::string schema_name;
 };
 
+TORCH_API bool isBlockListedSchema(const FunctionSchema& schema);
+
 TORCH_API MatchedSchema matchSchema(
     const ::c10::FunctionSchema& schema,
     const SourceRange& loc,
diff --git a/torch/csrc/jit/frontend/schema_type_parser.cpp b/torch/csrc/jit/frontend/schema_type_parser.cpp
index bef657cc8f79..a5e23b9de37d 100644
--- a/torch/csrc/jit/frontend/schema_type_parser.cpp
+++ b/torch/csrc/jit/frontend/schema_type_parser.cpp
@@ -411,7 +411,8 @@ SchemaTypeParser::parseFakeAndRealType() {
     real_value = parseBaseType();
     if (real_value->kind() == ScalarTypeType::Kind ||
         real_value->kind() == MemoryFormatType::Kind ||
-        real_value->kind() == LayoutType::Kind) {
+        real_value->kind() == LayoutType::Kind ||
+        real_value->kind() == SymIntType::Kind) {
       fake_value = c10::TypeFactory::get<IntType>();
     } else {
       fake_value = real_value;
@@ -425,7 +426,11 @@ SchemaTypeParser::parseFakeAndRealType() {
       fake_value = c10::TypeFactory::create<ListType>(fake_value);
       real_value = c10::TypeFactory::create<ListType>(real_value);
       auto container = parseAliasAnnotation();
-      if (container && alias_info) {
+      if (alias_info) {
+        if (!container) {
+          container = c10::optional<AliasInfo>(AliasInfo());
+          container->setIsWrite(alias_info->isWrite());
+        }
         container->addContainedType(std::move(*alias_info));
       }
       alias_info = std::move(container);
diff --git a/torch/csrc/jit/frontend/script_type_parser.cpp b/torch/csrc/jit/frontend/script_type_parser.cpp
index f5d6f640d413..d05ec95fb9fa 100644
--- a/torch/csrc/jit/frontend/script_type_parser.cpp
+++ b/torch/csrc/jit/frontend/script_type_parser.cpp
@@ -316,7 +316,7 @@ std::vector<IValue> ScriptTypeParser::evaluateDefaults(
   // We then run constant prop on this graph and check the results are
   // constant. This approach avoids having to have separate handling of
   // default arguments from standard expressions by piecing together existing
-  // machinery for graph generation, constant propgation, and constant
+  // machinery for graph generation, constant propagation, and constant
   // extraction.
   auto tuple_type = Subscript::create(
       r,
diff --git a/torch/csrc/jit/frontend/tracer.cpp b/torch/csrc/jit/frontend/tracer.cpp
index 35728265322b..890180ac2a2a 100644
--- a/torch/csrc/jit/frontend/tracer.cpp
+++ b/torch/csrc/jit/frontend/tracer.cpp
@@ -724,11 +724,24 @@ void addInputs(
     const c10::optional<at::ScalarType>& value) {
   detail::genericAddOptionalInput(n, name, value);
 }
-
 void addInputs(
     Node* n,
     const char* name,
-    at::TensorList value,
+    at::ArrayRef<at::Tensor> value,
+    bool allow_undefined) {
+  addInputs(n, name, at::ITensorListRef(value), allow_undefined);
+}
+void addInputs(
+    Node* n,
+    const char* name,
+    std::vector<at::Tensor> value,
+    bool allow_undefined) {
+  addInputs(n, name, at::ITensorListRef(value), allow_undefined);
+}
+void addInputs(
+    Node* n,
+    const char* name,
+    at::ITensorListRef value,
     bool allow_undefined) {
   Graph* g = n->owningGraph();
   Node* list_node = nullptr;
@@ -752,7 +765,6 @@ TORCH_API void addInputs(
       OptionalType::ofTensor(), fmap(value, getOptTensorValueTrace)));
   n->addInput(list_node->output());
 }
-
 void addInputs(
     Node* n,
     const char* name,
@@ -805,6 +817,14 @@ void addInputs(Node* n, const char* name, c10::SymIntArrayRef value) {
   addInputs(n, name, asIntArrayRefSlow(value));
 }
 
+void addInputs(Node* n, const char* name, c10::optional<c10::SymInt> value) {
+  addInputs(
+      n,
+      name,
+      value.has_value() ? c10::make_optional(value->expect_int())
+                        : c10::nullopt);
+}
+
 void addInputs(
     Node* n,
     const char* name,
@@ -825,6 +845,19 @@ void addInputs(
   }
 }
 
+void addInputs(
+    Node* n,
+    const char* name,
+    const at::OptionalSymIntArrayRef& opt_value) {
+  if (opt_value.has_value()) {
+    jit::tracer::addInputs(n, name, *opt_value);
+  } else {
+    Graph* g = n->owningGraph();
+    Value* none = g->insertNode(g->createNone())->output();
+    n->addInput(none);
+  }
+}
+
 void addInputs(Node* n, const char* name, ArrayRef<double> value) {
   std::vector<Value*> info;
   auto& g = getTracingState()->graph;
@@ -982,7 +1015,8 @@ void ArgumentStash::stashIntArrayRefElem(
   // TODO: check type?
   if (!isTracing())
     return;
-  auto& list_trace = stash.intlists.emplace(arg_name, size).first->second;
+  IntArrayRefTrace& list_trace =
+      stash.intlists.emplace(arg_name, size).first->second;
   AT_ASSERT(size == list_trace.size());
   AT_ASSERT(idx < list_trace.size());
   AT_ASSERT(list_trace[idx] == nullptr);
@@ -1060,7 +1094,8 @@ const char* WARN_CONSTRUCTOR =
 const char* WARN_RESIZE =
     " can't be represented in the JIT at the moment, so we won't connect any uses of "
     "this value with its current trace. If you happen to use it again, it will show "
-    "up as a constant in the graph.";
+    "up as a constant in the graph. Consider using `view` or `reshape` to make "
+    "it traceable.";
 const char* STRICT_TRACER_MSG =
     " might cause the trace to be incorrect, this is only valid if the container "
     "structure does not change based on the module's inputs. Consider using a constant "
diff --git a/torch/csrc/jit/frontend/tracer.h b/torch/csrc/jit/frontend/tracer.h
index 4f1e8b0c7d34..a5572b094985 100644
--- a/torch/csrc/jit/frontend/tracer.h
+++ b/torch/csrc/jit/frontend/tracer.h
@@ -264,6 +264,10 @@ TORCH_API void addInputs(
     const c10::optional<at::Tensor>& value);
 TORCH_API void addInputs(Node* n, const char* name, ArrayRef<int64_t> value);
 TORCH_API void addInputs(Node* n, const char* name, c10::SymIntArrayRef value);
+TORCH_API void addInputs(
+    Node* n,
+    const char* name,
+    c10::optional<c10::SymInt> value);
 TORCH_API void addInputs(
     Node* n,
     const char* name,
@@ -272,11 +276,25 @@ TORCH_API void addInputs(
     Node* n,
     const char* name,
     const at::OptionalIntArrayRef& opt_value);
+TORCH_API void addInputs(
+    Node* n,
+    const char* name,
+    const at::OptionalSymIntArrayRef& opt_value);
 TORCH_API void addInputs(
     Node* n,
     const char* name,
     ArrayRef<at::Tensor> value,
     bool allow_undefined = false);
+TORCH_API void addInputs(
+    Node* n,
+    const char* name,
+    std::vector<at::Tensor> value,
+    bool allow_undefined = false);
+TORCH_API void addInputs(
+    Node* n,
+    const char* name,
+    at::ITensorListRef value,
+    bool allow_undefined = false);
 TORCH_API void addInputs(
     Node* n,
     const char* name,
diff --git a/torch/csrc/jit/ir/alias_analysis.cpp b/torch/csrc/jit/ir/alias_analysis.cpp
index d39c489be213..d523730f5642 100644
--- a/torch/csrc/jit/ir/alias_analysis.cpp
+++ b/torch/csrc/jit/ir/alias_analysis.cpp
@@ -828,11 +828,17 @@ void AliasDb::analyzeImpl(Node* node) {
 
     // Do sanity checks on the alias annotation
     TORCH_INTERNAL_ASSERT(
-        formal->containedTypes().size() == 0,
+        formal->containedTypes().size() <= 1,
         "Composite types for alias analysis not yet supported");
     TORCH_INTERNAL_ASSERT(
         !formal->isWildcardBefore(),
         "Doesn't make sense for a input value to begin as a wildcard");
+    // This is a special case where we have alias info before [] but not after,
+    // such as `Tensor(a!)[]`
+    if (formal->containedTypes().size() == 1 && formal->beforeSets().empty()) {
+      // Use the first containedType in alias info.
+      formal = &(formal->containedTypes()[0]);
+    }
 
     const auto& formalAlias = formal->beforeSet();
 
@@ -879,10 +885,13 @@ void AliasDb::analyzeImpl(Node* node) {
     }
 
     TORCH_INTERNAL_ASSERT(
-        formal->containedTypes().size() == 0,
+        formal->containedTypes().size() <= 1,
         "Composite types for alias analysis not yet supported");
     TORCH_INTERNAL_ASSERT(formal->beforeSets() == formal->afterSets());
-
+    if (formal->containedTypes().size() == 1 && formal->beforeSets().empty()) {
+      // Use the first containedType in alias info.
+      formal = &(formal->containedTypes()[0]);
+    }
     if (formal->isWildcardBefore()) {
       TORCH_INTERNAL_ASSERT(
           formal->beforeSets().size() == 1,
diff --git a/torch/csrc/jit/ir/constants.cpp b/torch/csrc/jit/ir/constants.cpp
index bdc2020d17be..3bce37853690 100644
--- a/torch/csrc/jit/ir/constants.cpp
+++ b/torch/csrc/jit/ir/constants.cpp
@@ -102,7 +102,7 @@ c10::optional<Value*> tryInsertConstant(
       return c10::nullopt;
     }
   } else if (val.isString()) {
-    n->s_(attr::value, val.toString()->string());
+    n->s_(attr::value, val.toStringRef());
     n->output()->setType(StringType::get());
   } else if (val.isDevice()) {
     std::stringstream ss;
@@ -126,7 +126,9 @@ c10::optional<Value*> tryInsertConstant(
   } else if (val.isObject()) {
     const auto& ref = val.toObjectRef();
     // see: [Constant Object Weak CompilationUnit Reference]
-    if (!ref.type()->is_module() && ref.is_weak_compilation_ref()) {
+    if (!ref.type()->is_module() &&
+        (ref.is_weak_compilation_ref() ||
+         ref.is_empty_strong_compilation_ref())) {
       n->ival_(attr::value, val);
       n->output()->setType(val.type());
     } else {
diff --git a/torch/csrc/jit/ir/ir.cpp b/torch/csrc/jit/ir/ir.cpp
index 2b3d397f0378..7d6cca7e3f8f 100644
--- a/torch/csrc/jit/ir/ir.cpp
+++ b/torch/csrc/jit/ir/ir.cpp
@@ -1009,6 +1009,9 @@ Value* Node::namedInput(Symbol name) const {
 }
 
 bool Node::matches(const FunctionSchema& schema) const {
+  if (isBlockListedSchema(schema)) {
+    return false;
+  }
   // wrong name
   if (kind().toQualString() != schema.name()) {
     return false;
diff --git a/torch/csrc/jit/ir/ir.h b/torch/csrc/jit/ir/ir.h
index 87eed8259468..402dd58b0084 100644
--- a/torch/csrc/jit/ir/ir.h
+++ b/torch/csrc/jit/ir/ir.h
@@ -7,10 +7,10 @@
 #include <torch/csrc/jit/runtime/operator.h>
 
 #include <torch/csrc/Export.h>
-#include <torch/csrc/utils/disallow_copy.h>
 #include <torch/csrc/utils/python_stub.h>
 #include <torch/csrc/utils/schema_info.h>
 
+#include <ATen/Utils.h>
 #include <ATen/core/Tensor.h>
 #include <ATen/core/dynamic_type.h>
 #include <ATen/core/enum_type.h>
@@ -177,7 +177,7 @@ struct Wrap {
 };
 
 struct Value {
-  TH_DISALLOW_COPY_AND_ASSIGN(Value);
+  AT_DISALLOW_COPY_AND_ASSIGN(Value);
   Value(Node* node_, size_t offset_);
 
  private:
@@ -239,6 +239,11 @@ struct Value {
   const Node* node() const {
     return node_;
   }
+
+  /**
+   * @warning NEVER pass raw pointer of smart pointer managed Graph to Python.
+   * Check #87343 for details.
+   */
   Graph* owningGraph();
   const Graph* owningGraph() const;
   // TODO: make this more const correct
@@ -310,7 +315,7 @@ struct Value {
 };
 
 struct TORCH_API Node {
-  TH_DISALLOW_COPY_AND_ASSIGN(Node);
+  AT_DISALLOW_COPY_AND_ASSIGN(Node);
   friend struct Graph;
   friend struct Block;
   friend struct Value;
@@ -398,6 +403,10 @@ struct TORCH_API Node {
   }
   SourceRange sourceRange() const;
 
+  /**
+   * @warning NEVER pass raw pointer of smart pointer managed Graph to Python.
+   * Check #87343 for details.
+   */
   Graph* owningGraph() {
     return graph_;
   }
@@ -423,6 +432,7 @@ struct TORCH_API Node {
     return scope_->namesFromRoot();
   }
 
+  // Copies the source range, scope and callstack from another node.
   Node* copyMetadata(Node* from) {
     this->setSourceRange(from->sourceRange());
     this->setScope(from->scope());
@@ -1014,7 +1024,7 @@ struct Block {
   friend struct Node;
   friend struct Graph;
 
-  TH_DISALLOW_COPY_AND_ASSIGN(Block);
+  AT_DISALLOW_COPY_AND_ASSIGN(Block);
   TORCH_API Block(Graph* graph_, Node* node_);
 
   at::ArrayRef<Value*> inputs() {
@@ -1048,6 +1058,10 @@ struct Block {
   const Node* param_node() const {
     return input_;
   }
+  /**
+   * @warning NEVER pass raw pointer of smart pointer managed Graph to Python.
+   * Check #87343 for details.
+   */
   Graph* owningGraph() {
     return graph_;
   }
@@ -1162,8 +1176,8 @@ struct Block {
   std::shared_ptr<Wrap<Block>> wrap_;
 };
 
-struct Graph {
-  TH_DISALLOW_COPY_AND_ASSIGN(Graph);
+struct Graph : std::enable_shared_from_this<Graph> {
+  AT_DISALLOW_COPY_AND_ASSIGN(Graph);
   friend struct Node;
   friend struct Value;
   friend struct Block;
diff --git a/torch/csrc/jit/ir/irparser.cpp b/torch/csrc/jit/ir/irparser.cpp
index 1f790de92cb1..0673645731da 100644
--- a/torch/csrc/jit/ir/irparser.cpp
+++ b/torch/csrc/jit/ir/irparser.cpp
@@ -237,7 +237,7 @@ ParsedLiteral IRParser::parseScalarLiteral(Node* n) {
       auto text = L.expect(TK_NUMBER);
       if (!parse_tensor_constants_) {
         throw ErrorReport(token.range)
-            << "Single-element tensor constant encoutered but "
+            << "Single-element tensor constant encountered but "
             << "`parse_tensor_constants` is set to false " << token.text();
       }
       L.expect('}');
diff --git a/torch/csrc/jit/mobile/compatibility/backport_manager.cpp b/torch/csrc/jit/mobile/compatibility/backport_manager.cpp
index 2bad08c0765a..489084912445 100644
--- a/torch/csrc/jit/mobile/compatibility/backport_manager.cpp
+++ b/torch/csrc/jit/mobile/compatibility/backport_manager.cpp
@@ -7,6 +7,7 @@
 #include <torch/csrc/jit/mobile/import.h>
 #include <torch/csrc/jit/mobile/module.h>
 #include <torch/csrc/jit/serialization/export.h>
+#include <torch/csrc/jit/serialization/flatbuffer_serializer_jit.h>
 #include <torch/csrc/jit/serialization/import.h>
 #include <torch/csrc/jit/serialization/pickler.h>
 #include <cstddef>
@@ -503,6 +504,7 @@ std::stringstream backport_v7_to_v6(std::stringstream& input_model_stream) {
 
 std::stringstream backport_v9_to_v8(std::stringstream& input_model_stream) {
   ExtraFilesMap extra_files;
+  register_flatbuffer_all();
   Module torch_script =
       torch::jit::load(input_model_stream, c10::nullopt, extra_files);
   std::stringstream intermediate_model_stream;
diff --git a/torch/csrc/jit/mobile/compatibility/model_compatibility.cpp b/torch/csrc/jit/mobile/compatibility/model_compatibility.cpp
index 45a7ab2f67cb..089c116179ef 100644
--- a/torch/csrc/jit/mobile/compatibility/model_compatibility.cpp
+++ b/torch/csrc/jit/mobile/compatibility/model_compatibility.cpp
@@ -309,7 +309,7 @@ std::unordered_set<std::string> _get_mobile_model_contained_types(
     std::vector<std::string> type_name_list;
     for (const auto& type_definition : type_table) {
       std::unordered_set<std::string> type_tokens;
-      std::string type_name = type_definition.toString()->string();
+      std::string type_name = type_definition.toStringRef();
       type_name_list.emplace_back(type_name);
     }
     at::TypeParser parser(type_name_list);
diff --git a/torch/csrc/jit/mobile/flatbuffer_loader.cpp b/torch/csrc/jit/mobile/flatbuffer_loader.cpp
index fb23e7ee9775..29c29925ef09 100644
--- a/torch/csrc/jit/mobile/flatbuffer_loader.cpp
+++ b/torch/csrc/jit/mobile/flatbuffer_loader.cpp
@@ -36,7 +36,6 @@
 #include <torch/csrc/jit/serialization/export_bytecode.h>
 #include <torch/csrc/jit/serialization/import_export_constants.h>
 #include <torch/csrc/jit/serialization/import_read.h>
-#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h>
 #include <torch/custom_class.h>
 
 #ifndef DISABLE_UPGRADER
@@ -50,9 +49,12 @@
 #include <cstdlib>
 #endif
 
-#if defined(FBCODE_CAFFE2) or defined(FB_XPLAT_BUILD)
+#if defined(FB_XPLAT_BUILD) || defined(FBCODE_CAFFE2)
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated_fbsource.h> // NOLINT
 namespace flatbuffers = flatbuffers_fbsource;
 #define FLATBUFFERS_MAX_ALIGNMENT FLATBUFFERS_FBSOURCE_MAX_ALIGNMENT
+#else
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h> // NOLINT
 #endif
 
 namespace torch {
@@ -716,7 +718,7 @@ void FlatbufferLoader::extractJitSourceAndConstants(
     std::vector<IValue>* constants) {
   AT_ASSERT(
       module_parsed_,
-      "Need to first parse a flatbuffer file before extracing jit_sources");
+      "Need to first parse a flatbuffer file before extracting jit_sources");
 
   const auto* ivalues = module_->ivalues();
   for (uint32_t i = mobile_ivalue_size_; i < ivalues->size(); i++) {
diff --git a/torch/csrc/jit/mobile/import.cpp b/torch/csrc/jit/mobile/import.cpp
index 9a3ce0d8f839..2270418dbbcf 100644
--- a/torch/csrc/jit/mobile/import.cpp
+++ b/torch/csrc/jit/mobile/import.cpp
@@ -586,11 +586,19 @@ mobile::Module _load_for_mobile(
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files,
     uint64_t module_load_options) {
-  std::shared_ptr<char> data;
-  size_t size = 0;
-  std::tie(data, size) = get_file_content(filename.c_str());
-  return _load_mobile_from_bytes(
-      data, size, device, extra_files, module_load_options);
+  auto format = getFileFormat(filename);
+
+  if (format == FileFormat::FlatbufferFileFormat) {
+    std::shared_ptr<char> data;
+    size_t size = 0;
+    std::tie(data, size) = get_file_content(filename.c_str());
+    return _load_mobile_from_bytes(
+        data, size, device, extra_files, module_load_options);
+  }
+
+  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
+  return _load_for_mobile_impl(
+      std::move(rai), device, extra_files, module_load_options);
 }
 
 TORCH_API mobile::Module _load_for_mobile(
@@ -598,6 +606,7 @@ TORCH_API mobile::Module _load_for_mobile(
     c10::optional<c10::Device> device,
     ExtraFilesMap& extra_files,
     uint64_t module_load_options) {
+  // TODO optimize file read for non-flatbuffer models
   std::shared_ptr<char> data;
   size_t size = 0;
   std::tie(data, size) = get_rai_content(rai.get());
@@ -696,16 +705,43 @@ void _load_extra_only_for_mobile(
     const std::string& filename,
     c10::optional<at::Device> device,
     ExtraFilesMap& extra_files) {
-  std::unique_ptr<FileAdapter> rai = std::make_unique<FileAdapter>(filename);
   auto observer = torch::observerConfig().getModuleObserver();
   // NOLINTNEXTLINE(clang-analyzer-security.insecureAPI.rand)
   auto instance_key = std::rand();
   if (observer) {
     observer->onEnterLoadModel(instance_key);
   }
-  auto reader = torch::make_unique<PyTorchStreamReader>(std::move(rai));
-  BytecodeDeserializer deserializer(std::move(reader));
-  deserializer.deserialize_only_extra(device, extra_files);
+
+  auto format = getFileFormat(filename);
+  switch (format) {
+    case FileFormat::ZipFileFormat: {
+      std::unique_ptr<FileAdapter> rai =
+          std::make_unique<FileAdapter>(filename);
+      auto reader = torch::make_unique<PyTorchStreamReader>(std::move(rai));
+      BytecodeDeserializer deserializer(std::move(reader));
+      deserializer.deserialize_only_extra(device, extra_files);
+      break;
+    }
+    case FileFormat::FlatbufferFileFormat: {
+      // TODO: the current flatbuffers implementation will always load the
+      // whole module including the extra files. Ideally it should be
+      // possible to just get the extra files given data
+      std::shared_ptr<char> data;
+      size_t size = 0;
+      std::tie(data, size) = get_file_content(filename.c_str());
+      if (load_flatbuffer_bytes != nullptr) {
+        load_flatbuffer_bytes(data, size, device, &extra_files);
+      } else {
+        TORCH_CHECK(
+            false,
+            "Flatbuffer input file but the build hasn't enabled flatbuffer");
+      }
+      break;
+    }
+    default: {
+      TORCH_CHECK(false, "Format error");
+    }
+  }
 }
 
 namespace mobile {
diff --git a/torch/csrc/jit/mobile/import_data.cpp b/torch/csrc/jit/mobile/import_data.cpp
index 209805633e5f..01c6ea7ac579 100644
--- a/torch/csrc/jit/mobile/import_data.cpp
+++ b/torch/csrc/jit/mobile/import_data.cpp
@@ -174,7 +174,7 @@ std::map<std::string, at::Tensor> load_parameters_from_zip(
   auto result = unpickler.deserialize(device).toGenericDict();
   std::map<std::string, at::Tensor> map;
   for (const auto& e : result) {
-    auto key = e.key().toString()->string();
+    auto key = e.key().toStringRef();
     auto value = e.value().toTensor().tensor_data();
     map[key] = value;
   }
diff --git a/torch/csrc/jit/mobile/interpreter.cpp b/torch/csrc/jit/mobile/interpreter.cpp
index 1e5275403247..f989f4945389 100644
--- a/torch/csrc/jit/mobile/interpreter.cpp
+++ b/torch/csrc/jit/mobile/interpreter.cpp
@@ -110,6 +110,10 @@ bool InterpreterState::run(Stack& stack) {
       // Check with iliacher if has been done.
       // Plus this is not safe as if you throw exception record function will be
       // left enabled. That is a TODO
+      // NOTE: this recordFunction logic takes up ~2-3% of cpu cycles in some
+      // workflows. do we need it and/or can we opt-out of
+      // isRecordFunctionEnabled with a macro? if we delete it, things appear to
+      // work just fine.
       bool prev_value = isRecordFunctionEnabled();
       if (!prev_value) {
         // enable only for the RecordFunction
diff --git a/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp b/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
index 32852d691b56..20424df8e0a3 100644
--- a/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
+++ b/torch/csrc/jit/mobile/model_tracer/TracerRunner.cpp
@@ -267,6 +267,10 @@ void run_model(
 }
 
 TracerResult trace_run(const std::string& input_module_path) {
+  return trace_run(std::vector<std::string>(1, input_module_path));
+}
+
+TracerResult trace_run(const std::vector<std::string>& input_module_paths) {
   at::globalContext().setQEngine(at::QEngine::QNNPACK);
   c10::ObservedOperators::getUnobservedOperatorList().clear();
 
@@ -283,10 +287,25 @@ TracerResult trace_run(const std::string& input_module_path) {
 
   using torch::jit::MobileModuleLoadOptions;
 
-  // run with QNNPACK
-  run_model(input_module_path, root_ops, enabled_backends, called_kernel_tags);
-  at::globalContext().setQEngine(at::QEngine::FBGEMM);
-  run_model(input_module_path, root_ops, enabled_backends, called_kernel_tags);
+  for (auto& input_module_path : input_module_paths) {
+    // run with QNNPACK
+    at::globalContext().setQEngine(at::QEngine::QNNPACK);
+
+    run_model(
+        input_module_path, root_ops, enabled_backends, called_kernel_tags);
+    // Not every model can be successfully run with fbgemm,
+    // but for those that can this can help broaden the tracers scope around
+    // hyper optimized QNNPack paths
+    try {
+      at::globalContext().setQEngine(at::QEngine::FBGEMM);
+      run_model(
+          input_module_path, root_ops, enabled_backends, called_kernel_tags);
+    } catch (std::exception& ex) {
+      std::cerr
+          << "ModelTracer encountered an error while attempting to run the model in FBGEMM mode"
+          << ex.what() << "\n Skipping FBGEMM execution" << std::endl;
+    }
+  }
 
   op_tracer.getCalledOperators().withLock(
       [&](std::set<std::string>& called_operators) {
diff --git a/torch/csrc/jit/mobile/model_tracer/TracerRunner.h b/torch/csrc/jit/mobile/model_tracer/TracerRunner.h
index 1c407e7fa04d..9449071413f0 100644
--- a/torch/csrc/jit/mobile/model_tracer/TracerRunner.h
+++ b/torch/csrc/jit/mobile/model_tracer/TracerRunner.h
@@ -1,5 +1,9 @@
 #pragma once
 
+#include <set>
+#include <string>
+#include <vector>
+
 #include <ATen/core/ivalue.h>
 #include <torch/csrc/jit/mobile/model_tracer/BuildFeatureTracer.h>
 #include <torch/csrc/jit/mobile/model_tracer/CustomClassTracer.h>
@@ -24,7 +28,16 @@ struct TracerResult {
   std::set<std::string> enabled_backends;
 };
 
+/**
+ * Trace a single model and return the TracerResult.
+ */
 TracerResult trace_run(const std::string& input_module_path);
+
+/**
+ * Trace multiple models and return the TracerResult.
+ */
+TracerResult trace_run(const std::vector<std::string>& input_module_paths);
+
 } // namespace mobile
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/mobile/model_tracer/tracer.cpp b/torch/csrc/jit/mobile/model_tracer/tracer.cpp
index e20df510ab02..5c777fcbecff 100644
--- a/torch/csrc/jit/mobile/model_tracer/tracer.cpp
+++ b/torch/csrc/jit/mobile/model_tracer/tracer.cpp
@@ -1,11 +1,12 @@
 #include <iostream>
+#include <sstream>
 #include <string>
 
 /**
- * The tracer.cpp generates a binary that accepts a TorchScript model or a
- * Torch Mobile Model (with bytecode.pkl) which has at least 1 bundled
- * input. This binary then feeds the bundled input(s) into the model
- * and executes using the lite interpreter.
+ * The tracer.cpp generates a binary that accepts multiple Torch Mobile Model(s)
+ * (with bytecode.pkl), each of which has at least 1 bundled
+ * input. This binary then feeds the bundled input(s) into each corresponding
+ * model and executes it using the lite interpreter.
  *
  * Both root operators as well as called operators are recorded and saved
  * into a YAML file (whose path is provided on the command line).
@@ -33,7 +34,7 @@ typedef std::map<std::string, std::set<std::string>> kt_type;
 C10_DEFINE_string(
     model_input_path,
     "",
-    "The path of the input model file (.ptl).");
+    "A comma separated list of path(s) to the input model file(s) (.ptl).");
 
 C10_DEFINE_string(
     build_yaml_path,
@@ -82,10 +83,40 @@ void printOpsYAML(
   }
 }
 
+void printDTypeYAML(
+    std::ostream& out,
+    int indent,
+    const std::string& kernel_tag_name,
+    const std::set<std::string> dtypes) {
+  std::string indent_str = std::string(indent, ' ');
+  out << indent_str << kernel_tag_name << ":" << std::endl;
+  for (auto& dtype : dtypes) {
+    out << indent_str << "- " << dtype << std::endl;
+  }
+}
+
+void printDTypesYAML(
+    std::ostream& out,
+    const torch::jit::mobile::KernelDTypeTracer::kernel_tags_type&
+        kernel_tags) {
+  for (auto& it : kernel_tags) {
+    printDTypeYAML(out, 2, it.first, it.second);
+  }
+}
+
+void printCustomClassesYAML(
+    std::ostream& out,
+    const torch::jit::mobile::CustomClassTracer::custom_classes_type&
+        loaded_classes) {
+  for (auto& class_name : loaded_classes) {
+    out << "- " << class_name << std::endl;
+  }
+}
+
 /**
- * Converts a pytorch model (full/lite) to lite interpreter model for
- * mobile, and additionally writes out a list of root and called
- * operators.
+ * Runs multiple PyTorch lite interpreter models, and additionally writes
+ * out a list of root and called operators, kernel dtypes, and loaded/used
+ * TorchBind custom classes.
  */
 int main(int argc, char* argv[]) {
   if (!c10::ParseCommandLineFlags(&argc, &argv)) {
@@ -96,21 +127,27 @@ int main(int argc, char* argv[]) {
   REQUIRE_STRING_ARG(model_input_path);
   REQUIRE_STRING_ARG(build_yaml_path);
 
-  const std::string input_module_path = FLAGS_model_input_path;
-
+  std::istringstream sin(FLAGS_model_input_path);
   std::ofstream yaml_out(FLAGS_build_yaml_path);
 
-  std::cout << "Processing: " << input_module_path << std::endl;
   std::cout << "Output: " << FLAGS_build_yaml_path << std::endl;
   torch::jit::mobile::TracerResult tracer_result;
+  std::vector<std::string> model_input_paths;
+
+  for (std::string model_input_path;
+       std::getline(sin, model_input_path, ',');) {
+    std::cout << "Processing: " << model_input_path << std::endl;
+    model_input_paths.push_back(model_input_path);
+  }
+
   try {
-    tracer_result = torch::jit::mobile::trace_run(FLAGS_model_input_path);
+    tracer_result = torch::jit::mobile::trace_run(model_input_paths);
   } catch (std::exception& ex) {
     std::cerr
         << "ModelTracer has not been able to load the module for the following reasons:\n"
         << ex.what()
-        << "\nPlease consider posting to the PyTorch with the error message."
-        << std::endl;
+        << "\nPlease consider opening an issue at https://github.com/pytorch/pytorch/issues "
+        << "with the detailed error message." << std::endl;
 
     throw ex;
   }
@@ -135,7 +172,8 @@ int main(int argc, char* argv[]) {
     }
   }
 
-  yaml_out << "include_all_non_op_selectives: true" << std::endl;
+  yaml_out << "include_all_non_op_selectives: false" << std::endl;
+  yaml_out << "build_features: []" << std::endl;
   yaml_out << "operators:" << std::endl;
   printOpsYAML(
       yaml_out,
@@ -149,5 +187,20 @@ int main(int argc, char* argv[]) {
       false /* is_used_for_training */,
       false /* is_root_operator */,
       false /* include_all_overloads */);
+
+  yaml_out << "kernel_metadata:";
+  if (tracer_result.called_kernel_tags.empty()) {
+    yaml_out << " []";
+  }
+  yaml_out << std::endl;
+  printDTypesYAML(yaml_out, tracer_result.called_kernel_tags);
+
+  yaml_out << "custom_classes:";
+  if (tracer_result.loaded_classes.empty()) {
+    yaml_out << " []";
+  }
+  yaml_out << std::endl;
+  printCustomClassesYAML(yaml_out, tracer_result.loaded_classes);
+
   return 0;
 }
diff --git a/torch/csrc/jit/mobile/module.cpp b/torch/csrc/jit/mobile/module.cpp
index 2ef7c34c28b4..5da8cb4a55da 100644
--- a/torch/csrc/jit/mobile/module.cpp
+++ b/torch/csrc/jit/mobile/module.cpp
@@ -43,6 +43,50 @@ Method Module::get_method(const std::string& name) const {
   AT_ERROR("Method '", name, "' is not defined.");
 }
 
+bool Module::compareMethodSchemas(
+    const std::string& name_1,
+    const std::string& name_2) {
+  c10::optional<c10::FunctionSchema> schema_1, schema_2;
+  for (const auto& fn : cu_->methods()) {
+    if (fn->name() == name_1) {
+      schema_1 = fn->getSchema();
+    }
+    if (fn->name() == name_2) {
+      schema_2 = fn->getSchema();
+    }
+  }
+  if (schema_1.has_value() && schema_2.has_value()) {
+    return (schema_1 == schema_2);
+  }
+  return false;
+}
+
+void Module::unsafeRemoveMethod(const std::string& basename) {
+  int64_t i = 0;
+  for (; i < cu_->methods().size(); ++i) {
+    if ((cu_->methods()[i])->name() == basename) {
+      break;
+    }
+  }
+  object_->type()->unsafeRemoveMethod(basename);
+  cu_->unsafeRemoveFunction(i);
+}
+
+void Module::unsafeCopyMethod(
+    const std::string& new_method_name,
+    const Function& to_be_copied) {
+  TORCH_CHECK(
+      !find_method(new_method_name).has_value(),
+      "Trying to replace existing method.");
+  const c10::QualifiedName& tobe_copied_name = to_be_copied.qualname();
+  c10::QualifiedName qualified_method_name(
+      tobe_copied_name.prefix(), new_method_name);
+  std::unique_ptr<Function> new_fn = std::make_unique<Function>(
+      qualified_method_name, to_be_copied.get_code(), to_be_copied.getSchema());
+  object_->type()->addMethod(new_fn.get());
+  cu_->register_function(std::move(new_fn));
+}
+
 c10::optional<Method> Module::find_method(const std::string& basename) const {
   for (const auto& fn : cu_->methods()) {
     if (fn->name() == basename) {
diff --git a/torch/csrc/jit/mobile/module.h b/torch/csrc/jit/mobile/module.h
index 01c76e146581..2b07831e673f 100644
--- a/torch/csrc/jit/mobile/module.h
+++ b/torch/csrc/jit/mobile/module.h
@@ -3,6 +3,7 @@
 #include <torch/csrc/jit/mobile/debug_info.h>
 #include <torch/csrc/jit/mobile/function.h>
 #include <torch/csrc/jit/mobile/method.h>
+#include <torch/csrc/jit/mobile/quantization.h>
 
 namespace torch {
 namespace jit {
@@ -42,6 +43,10 @@ class CompilationUnit {
   Function* find_function(const c10::QualifiedName& qn);
   const Function* find_function(const c10::QualifiedName& qn) const;
 
+  void unsafeRemoveFunction(const int64_t index) {
+    methods_.erase(methods_.begin() + index);
+  }
+
  private:
   std::vector<std::unique_ptr<Function>> methods_;
 };
@@ -71,6 +76,7 @@ class TORCH_API Module {
     return get_method("forward")(std::move(inputs));
   }
   c10::optional<Method> find_method(const std::string& basename) const;
+
   const std::string name() const {
     return object_->name();
   }
@@ -152,6 +158,18 @@ class TORCH_API Module {
   }
 
  private:
+  friend class quantization::PTQQuanizationHelper;
+
+  bool compareMethodSchemas(
+      const std::string& name_1,
+      const std::string& name_2);
+
+  void unsafeRemoveMethod(const std::string& basename);
+
+  void unsafeCopyMethod(
+      const std::string& new_method_name,
+      const Function& to_be_copied);
+
   c10::intrusive_ptr<c10::ivalue::Object> object_;
   std::unordered_map<std::string, std::string> metadata_;
   std::shared_ptr<CompilationUnit> cu_;
diff --git a/torch/csrc/jit/mobile/parse_bytecode.cpp b/torch/csrc/jit/mobile/parse_bytecode.cpp
index c099bd10954b..4c6cab4d3574 100644
--- a/torch/csrc/jit/mobile/parse_bytecode.cpp
+++ b/torch/csrc/jit/mobile/parse_bytecode.cpp
@@ -181,7 +181,7 @@ void parseTypes(
   std::vector<std::string> types_string_list;
   types_string_list.resize(types_list.size());
   for (size_t i = 0; i < types_list.size(); i++) {
-    types_string_list[i] = types_list[i].toString()->string();
+    types_string_list[i] = types_list[i].toStringRef();
   }
 
   std::vector<c10::TypePtr> types_ptr_list = c10::parseType(types_string_list);
diff --git a/torch/csrc/jit/mobile/parse_operators.cpp b/torch/csrc/jit/mobile/parse_operators.cpp
index 47acf09f106f..03415657c780 100644
--- a/torch/csrc/jit/mobile/parse_operators.cpp
+++ b/torch/csrc/jit/mobile/parse_operators.cpp
@@ -21,9 +21,7 @@ void parseOperators(
       num_args = op_item[2].toInt();
     }
     function->append_operator(
-        op_item[0].toString()->string(),
-        op_item[1].toString()->string(),
-        num_args);
+        op_item[0].toStringRef(), op_item[1].toStringRef(), num_args);
   }
   function->initialize_operators(
       (module_load_options & MobileModuleLoadOptions::OPERATOR_CHECK));
diff --git a/torch/csrc/jit/mobile/profiler_edge.cpp b/torch/csrc/jit/mobile/profiler_edge.cpp
index d3dc596ca3dc..8fdd1654082a 100644
--- a/torch/csrc/jit/mobile/profiler_edge.cpp
+++ b/torch/csrc/jit/mobile/profiler_edge.cpp
@@ -18,15 +18,23 @@ KinetoEdgeCPUProfiler::KinetoEdgeCPUProfiler(
     const bool profile_memory,
     const bool with_stack,
     const bool with_flops,
-    const bool with_modules)
+    const bool with_modules,
+    std::vector<std::string> events)
     : m_(m), trace_file_name_(fname) {
+  torch::profiler::impl::ExperimentalConfig experimental_config;
+  // Enable hardware counters
+  if (events.size()) {
+    experimental_config.performance_events = std::move(events);
+  }
+
   torch::profiler::impl::ProfilerConfig config(
       torch::profiler::impl::ProfilerState::KINETO,
       report_input_shapes,
       profile_memory,
       with_stack,
       with_flops,
-      with_modules);
+      with_modules,
+      experimental_config);
   torch::autograd::profiler::prepareProfiler(
       config, {torch::autograd::profiler::ActivityType::CPU});
   if (with_modules || with_stack) {
diff --git a/torch/csrc/jit/mobile/profiler_edge.h b/torch/csrc/jit/mobile/profiler_edge.h
index 52dc26d1221a..2a89819e700c 100644
--- a/torch/csrc/jit/mobile/profiler_edge.h
+++ b/torch/csrc/jit/mobile/profiler_edge.h
@@ -55,7 +55,8 @@ class TORCH_API KinetoEdgeCPUProfiler {
       const bool profile_memory = false,
       const bool with_stack = false,
       const bool with_flops = false,
-      const bool with_modules = false);
+      const bool with_modules = false,
+      std::vector<std::string> events = {});
 
   const std::unique_ptr<torch::autograd::profiler::ProfilerResult>&
   disableProfiler();
diff --git a/torch/csrc/jit/mobile/promoted_prim_ops.cpp b/torch/csrc/jit/mobile/promoted_prim_ops.cpp
index 3819858c8d6a..c5a73f34f5c7 100644
--- a/torch/csrc/jit/mobile/promoted_prim_ops.cpp
+++ b/torch/csrc/jit/mobile/promoted_prim_ops.cpp
@@ -62,12 +62,32 @@ void sym_size(Stack& stack) {
   auto t = std::move(pop(stack)).toTensor();
   pack(stack, t.sym_sizes().vec());
 }
+void sym_size_int(Stack& stack) {
+  auto dim = pop(stack).toInt();
+  auto t = pop(stack).toTensor();
+  push(stack, t.sym_sizes()[dim]);
+}
+void sym_stride_int(Stack& stack) {
+  auto dim = pop(stack).toInt();
+  auto t = pop(stack).toTensor();
+  push(stack, t.sym_strides()[dim]);
+}
 
 void sym_numel(Stack& stack) {
   auto t = std::move(pop(stack)).toTensor();
   push(stack, t.sym_numel());
 }
 
+void sym_storage_offset(Stack& stack) {
+  auto t = std::move(pop(stack)).toTensor();
+  push(stack, t.sym_storage_offset());
+}
+
+void sym_stride(Stack& stack) {
+  auto t = std::move(pop(stack)).toTensor();
+  pack(stack, t.sym_strides().vec());
+}
+
 void device(Stack& stack) {
   push(stack, pop(stack).toTensor().device());
 }
diff --git a/torch/csrc/jit/mobile/promoted_prim_ops.h b/torch/csrc/jit/mobile/promoted_prim_ops.h
index f2012ab09bbe..352868359687 100644
--- a/torch/csrc/jit/mobile/promoted_prim_ops.h
+++ b/torch/csrc/jit/mobile/promoted_prim_ops.h
@@ -21,8 +21,16 @@ void size(Stack& stack);
 
 void sym_size(Stack& stack);
 
+void sym_size_int(Stack& stack);
+
+void sym_stride_int(Stack& stack);
+
 void sym_numel(Stack& stack);
 
+void sym_storage_offset(Stack& stack);
+
+void sym_stride(Stack& stack);
+
 void device(Stack& stack);
 
 void dtype(Stack& stack);
diff --git a/torch/csrc/jit/mobile/quantization.cpp b/torch/csrc/jit/mobile/quantization.cpp
new file mode 100644
index 000000000000..b391cf5ac0ee
--- /dev/null
+++ b/torch/csrc/jit/mobile/quantization.cpp
@@ -0,0 +1,66 @@
+#include <ATen/Context.h>
+#include <torch/csrc/jit/mobile/module.h>
+#include <torch/csrc/jit/mobile/quantization.h>
+
+namespace torch {
+namespace jit {
+namespace mobile {
+namespace quantization {
+
+void PTQQuanizationHelper::quantize_dynamic(
+    torch::jit::mobile::Module& m,
+    const std::string& method_name) {
+  at::globalContext().setReleaseWeightsWhenPrepacking(false);
+  std::string reset_observers_method_name = "reset_observers_" + method_name;
+  std::string observe_method_name = "observe_" + method_name;
+  std::string quantize_method_name = "quantize_" + method_name;
+  std::string quantized_method_name = "quantized_" + method_name;
+
+  TORCH_CHECK(
+      m.find_method(reset_observers_method_name).has_value(),
+      "PTQ ready module must have",
+      reset_observers_method_name,
+      " method.");
+  TORCH_CHECK(
+      m.find_method(observe_method_name),
+      "PTQ ready module must have",
+      reset_observers_method_name,
+      " method.");
+  TORCH_CHECK(
+      m.find_method(quantize_method_name),
+      "PTQ ready module must have",
+      quantize_method_name,
+      " method.");
+  TORCH_CHECK(
+      m.find_method(quantized_method_name),
+      "PTQ ready module must have",
+      quantized_method_name,
+      " method.");
+  TORCH_CHECK(
+      m.find_method("get_all_bundled_inputs"),
+      "PTQ ready module must have get_all_bundled_inputs method.");
+
+  auto inputs = m.run_method("get_all_bundled_inputs")
+                    .toList()
+                    .get(0)
+                    .toTupleRef()
+                    .elements()
+                    .vec();
+  m.get_method(reset_observers_method_name)({});
+  m.get_method(observe_method_name)(inputs);
+  m.get_method(quantize_method_name)(inputs);
+
+  m.compareMethodSchemas(method_name, quantized_method_name);
+  m.unsafeRemoveMethod(method_name);
+  const Function& to_be_copied =
+      m.find_method(quantized_method_name).value().function();
+  m.unsafeCopyMethod(method_name, to_be_copied);
+  m.unsafeRemoveMethod(quantized_method_name);
+  m.unsafeRemoveMethod(quantize_method_name);
+  m.unsafeRemoveMethod(observe_method_name);
+  m.unsafeRemoveMethod(reset_observers_method_name);
+}
+} // namespace quantization
+} // namespace mobile
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/mobile/quantization.h b/torch/csrc/jit/mobile/quantization.h
new file mode 100644
index 000000000000..aa47dcb64b62
--- /dev/null
+++ b/torch/csrc/jit/mobile/quantization.h
@@ -0,0 +1,38 @@
+#pragma once
+
+#include <c10/macros/Export.h>
+#include <string>
+
+namespace torch {
+namespace jit {
+namespace mobile {
+class Module;
+namespace quantization {
+/*
+ * Device side PTQ API.
+ * Once the model has been prepared for quantization on server side, such model
+ * is sent to device. On device side the model is further trained. At the end of
+ * the training, before the model is readied for inference, we need to quantize
+ * the model.
+ * Usage of this API is as follows.
+ * PTQQuanizationHelper ptq_helper;
+ * ptq_helper.quantize_dynamic(m, "forward");
+ * Args:
+ * m: Captured by reference, an instance of mobile::Module. This module will be
+ * mutated in place to replace its <method_name> method with quantized
+ * equivalent. method:name: Name of the method to be quantized. AOT preparation
+ * for quantization must also have been done for this method. Returns: In place
+ * mutated `m` whose size should be smaller due to weight quantization and whose
+ * <method_name> method should use quantized ops
+ */
+class TORCH_API PTQQuanizationHelper {
+ public:
+  PTQQuanizationHelper() = default;
+  void quantize_dynamic(
+      torch::jit::mobile::Module& m,
+      const std::string& method_name);
+};
+} // namespace quantization
+} // namespace mobile
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/operator_upgraders/README.md b/torch/csrc/jit/operator_upgraders/README.md
index 084e6688f148..75639006e503 100644
--- a/torch/csrc/jit/operator_upgraders/README.md
+++ b/torch/csrc/jit/operator_upgraders/README.md
@@ -1,6 +1,6 @@
 # Guidance for Operator Developer
 
-PyTorch’s operators sometimes require changes for different reasons (e.g. from improving their usability to fixing bugs). These changes can be backward compatibility (BC) breaking, where older programs will no longer run as expected (or at all) on the latest version of PyTorch (an old program / new runtime problem), or forward compatibility (FC) breaking, where new programs will not run on older versions of PyTorch (a new program / old runtime problem). This guidance focuses on the requirements for maintaining backwards comatibility when making changes to an operator.
+PyTorch’s operators sometimes require changes for different reasons (e.g. from improving their usability to fixing bugs). These changes can be backward compatibility (BC) breaking, where older programs will no longer run as expected (or at all) on the latest version of PyTorch (an old program / new runtime problem), or forward compatibility (FC) breaking, where new programs will not run on older versions of PyTorch (a new program / old runtime problem). This guidance focuses on the requirements for maintaining backwards compatibility when making changes to an operator.
 In order to do this we introduce the concept of the *upgrader*: a method to adapt the new operator to mimic the old operator behavior.
 When a new runtime reads an old program containing the old operator definition, the upgrader will adapt the old operator definition to comply with the new operator implementation. As you would expect, an upgrader is only applied when an old operation definition is encountered (i.e. if there are no "old" operators in the program, no upgrader would be used).
 For more details on the reasoning behind this new requirement please refer to the [PyTorch Operator Versioning RFC](https://github.com/pytorch/rfcs/blob/master/RFC-0017-PyTorch-Operator-Versioning.md).
@@ -177,7 +177,7 @@ When making changes to the operators, the first thing to identify is if it's BC/
             except Exception as e:
                 self.skipTest("Failed to load fixture!")
 
-            # Step4. Load the new model and it won't apply the ugprader
+            # Step4. Load the new model and it won't apply the upgrader
             current_mobile_module_float = self._save_load_mobile_module(MyModuleFloat)
             current_server_module_float = self._save_load_module(MyModuleFloat)
 
@@ -226,7 +226,7 @@ def foo(x, y, z=100):
     return x, y, z
 ```
 
-2. To help understanding the BC/FC breakage changes, here are some FC breaking changes examples. The solution to resolve it is not there yet. If it's desired, please report it in either [PyTorch Forum](https://discuss.pytorch.org/) or [PyTorch Github](https://github.com/pytorch/pytorch). We will prioritize it accordingly.
+2. To help understanding the BC/FC breakage changes, here are some FC breaking changes examples. The solution to resolve it is not there yet. If it's desired, please report it in either [PyTorch Forum](https://discuss.pytorch.org/) or [PyTorch GitHub](https://github.com/pytorch/pytorch). We will prioritize it accordingly.
 
     - Adding new default argument:
     - Adding a new default argument not RIGHT BEFORE the out arguments which can be 0 or more.
diff --git a/torch/csrc/jit/passes/freeze_module.cpp b/torch/csrc/jit/passes/freeze_module.cpp
index a847af666012..3f838e089ee9 100644
--- a/torch/csrc/jit/passes/freeze_module.cpp
+++ b/torch/csrc/jit/passes/freeze_module.cpp
@@ -30,6 +30,19 @@ std::vector<std::string> splitName(const std::string& name) {
   return result;
 }
 
+template <typename Iter>
+std::string concatName(const Iter& begin, const Iter& end) {
+  std::string combined_name = "";
+  for (Iter it = begin; it != end; ++it) {
+    const std::string& sub_name = *it;
+    if (!combined_name.empty()) {
+      combined_name += ".";
+    }
+    combined_name += sub_name;
+  }
+  return combined_name;
+}
+
 class AttributePropagator {
  public:
   AttributePropagator(
@@ -113,13 +126,23 @@ class AttributePropagator {
       LowerSimpleTuples(subgraph);
     };
 
+    std::unordered_map<std::string, std::unordered_set<std::string>>
+        interfacesToReassignType;
+
     for (auto function : preservedMethods_) {
       GRAPH_DEBUG("Analyzing function: " + function->name());
       auto graph = toGraphFunction(*function).graph();
       optimizeSubGraphs(graph, applyInline);
       if (freezeInterfaces_) {
-        inlineInterfaceCalls(graph);
+        inlineInterfaceCalls(graph, interfacesToReassignType);
       }
+    }
+
+    reassignInterfaceTypes(interfacesToReassignType);
+
+    for (auto function : preservedMethods_) {
+      GRAPH_DEBUG("Recording mutable attrs for function: " + function->name());
+      auto graph = toGraphFunction(*function).graph();
       // Record Attributes that are explicitly set in the module.
       // They cannot be folded.
       recordMutableAttrs(graph);
@@ -187,6 +210,46 @@ class AttributePropagator {
     return c10::nullopt;
   }
 
+  bool _loadModulePath(Value* input, std::shared_ptr<Graph>& graph) {
+    Node* node = input->node();
+    names_.clear();
+    while (!(node->outputs()[0]->type() == graph->inputs()[0]->type())) {
+      if (node->kind() == prim::GetAttr) {
+        names_.push_front(node->s(attr::name));
+        node = node->inputs()[0]->node();
+      } else {
+        return false;
+      }
+    }
+
+    return true;
+  }
+
+  c10::optional<std::deque<std::string>> getModulePath(
+      Value* input,
+      std::shared_ptr<Graph>& graph) {
+    bool success = _loadModulePath(input, graph);
+    if (!success) {
+      return c10::nullopt;
+    }
+    return names_;
+  }
+
+  template <typename Iter>
+  bool getModuleFromPath(
+      Module& attrModule,
+      const Iter& begin,
+      const Iter& end) {
+    for (Iter it = begin; it != end; ++it) {
+      const std::string& moduleName = *it;
+      if (preservedAttrs_.count(attrModule.attr(moduleName))) {
+        return false;
+      }
+      attrModule = attrModule.attr(moduleName).toModule();
+    }
+    return true;
+  }
+
   // findConstantAttr function locates the sub Module where attributes are
   // defined. The algorithm chases getAttr chains to locate the submodules.
   // For example:
@@ -237,22 +300,14 @@ class AttributePropagator {
       return false;
     }
 
-    Node* node = input->node();
-    names_.clear();
-    while (!(node->outputs()[0]->type() == graph->inputs()[0]->type())) {
-      if (node->kind() == prim::GetAttr) {
-        names_.push_front(node->s(attr::name));
-        node = node->inputs()[0]->node();
-      } else {
-        return false;
-      }
+    // loads the path into this->names_
+    if (!_loadModulePath(input, graph)) {
+      return false;
     }
 
-    for (auto& moduleName : names_) {
-      if (preservedAttrs_.count(attrModule.attr(moduleName))) {
-        return false;
-      }
-      attrModule = attrModule.attr(moduleName).toModule();
+    // reassigns attrModule to the module in names_
+    if (!getModuleFromPath(attrModule, names_.begin(), names_.end())) {
+      return false;
     }
 
     auto attr = attrModule.attr(name);
@@ -435,7 +490,47 @@ class AttributePropagator {
     return inlined;
   }
 
-  void inlineInterfaceCalls(std::shared_ptr<Graph>& graph) {
+  //   [Note: Inlining interfaces strategy]
+  // There's two structures that are relevant to freezing:
+  // - the graph describing the computation in a method
+  // - the module describing the data structure of the module instance.
+  //
+  // First, in inlineInterfaceCalls, we inline interfaces. This is done in a
+  // separate step from normal inlining because CallMethod on an interface type
+  // requires extra steps compared to inlining a normal CallMethod.
+  //
+  // Next we need to simplify the structure of the module data structure, which
+  // is done for the most part by the usual steps in cleanupFrozenModule.
+  //
+  // However, there's a complication that comes from the fact that within a
+  // method, you can change the value of an interface to another module that
+  // implements that interface.
+  //
+  // For example:
+  //
+  // impl: MyInterface
+  // ...
+  // def forward(self, x):
+  //     if x > 0:
+  //         self.impl = my_interface_impl
+  //
+  // This is disallowed in freezing, because in this case we can't flatten out
+  // the module structure, since the type of self.impl will change.
+  //
+  // To handle this, we do the following:
+  //   1. inlineInterfaceCalls:
+  //     a. inline the graph, and in the process record all interfaces
+  //     b. simultaneously, check (throw error) for disallowed SetAttr calls.
+  //   2. call reassignInterfaceTypes, which reassigns interface types to their
+  //      concrete types. This is done in a separate step to avoid interfering
+  //      with inlineInterfaceCalls (note: this may not need to be done as a
+  //      separate step)
+  //   3. eventually cleanupFrozenModule will reorder the module data structure
+  //      and it will expect that all interface types have been removed.
+  void inlineInterfaceCalls(
+      std::shared_ptr<Graph>& graph,
+      std::unordered_map<std::string, std::unordered_set<std::string>>&
+          interfacesToRetype) {
     auto block = graph->block();
     std::stack<Block*> blocks({block});
 
@@ -461,6 +556,44 @@ class AttributePropagator {
           inlineInterfaceCall(n, attr);
           // Reset the GetAttr to concrete module type.
           n->output()->setType(attr.type());
+
+          // Record this so that we can reassign the type later
+          // in reassignInterfaceTypes()
+          // See [Note: Inlining interfaces strategy]
+          auto path = getModulePath(input, graph);
+          TORCH_INTERNAL_ASSERT(path.has_value());
+          auto path_str = concatName(path->begin(), path->end());
+          interfacesToRetype[path_str].insert(name);
+        } else if (n->kind() == prim::SetAttr) {
+          // Check to make sure we're not assigning the value of any parameters
+          // that are interface types.
+          // See [Note: Inlining interfaces strategy]
+          auto name = n->s(attr::name);
+          auto attrModule = module_;
+          auto input = n->inputs()[0];
+
+          if (!input->type()->cast<InterfaceType>() &&
+              !input->type()->expectRef<ClassType>().is_module()) {
+            // we only care if we're setattr["thing"](%mod) if %mod
+            continue;
+          }
+
+          // note: this will modify attrModule until it is the parent of the
+          // "name" attr. In other words, attrModule is now the module that
+          // matches "input".
+          // We can't use findConstantAttr in case the base item is an object,
+          // instead of a module/interface.
+          auto path = getModulePath(input, graph);
+          TORCH_INTERNAL_ASSERT(path.has_value());
+          getModuleFromPath(attrModule, path->begin(), path->end());
+
+          const auto& attrType = attrModule.type()->getAttribute(name);
+          TORCH_INTERNAL_ASSERT(
+              !attrType->cast<InterfaceType>(),
+              "Freezing does not support SetAttr on an interface type. ",
+              "SetAttr is attempted on '",
+              name,
+              "'");
         } else if (n->kind() == prim::fork) {
           applyToForkSubgraph(
               n,
@@ -469,12 +602,33 @@ class AttributePropagator {
               std::bind(
                   &AttributePropagator::inlineInterfaceCalls,
                   *this,
-                  std::placeholders::_1));
+                  std::placeholders::_1,
+                  interfacesToRetype));
         }
       }
     }
   }
 
+  // See [Note: Inlining interfaces strategy]
+  // This modifies the internal structure of module types to reassign the
+  // type from an interface type to its concrete type.
+  void reassignInterfaceTypes(
+      const std::unordered_map<std::string, std::unordered_set<std::string>>&
+          interfacesToRetype) {
+    for (const auto& it : interfacesToRetype) {
+      const std::string& modulePath = it.first;
+      const std::vector<std::string>& splitPath = splitName(modulePath);
+      Module attrModule = module_;
+      getModuleFromPath(attrModule, splitPath.begin(), splitPath.end());
+
+      for (const std::string& name : it.second) {
+        auto subvalue = attrModule.attr(name);
+        auto subvalueType = subvalue.type();
+        attrModule.type()->unsafeChangeAttributeType(name, subvalueType);
+      }
+    }
+  }
+
   void propagateAttributes(std::shared_ptr<Graph>& graph) {
     std::unordered_map<ModulePtr, std::unordered_map<std::string, Value*>>
         attrValues;
@@ -751,17 +905,10 @@ class AttributePropagator {
       if (isMutable) {
         attrsToKeep_[type].insert(i);
         if (attr.isModule()) {
-          // FIXME: This error is conservative. Detected an interface module
-          // that cannot be fully inlined away because of side effects.
-          // TODO: We could allow freezing in this case but we would need to
-          // 1) Change the module type to use the concrete type (attrTy).
-          // Probably first unsafe remove attribute and add it using concrete
-          // type.
-          // 2) Fail if there is any setattr to an interface attribute bc
-          // everything is inlined based on old value of this attribute.
+          // See [Note: Inlining interfaces strategy]
           TORCH_CHECK(
               !type->getAttribute(i)->cast<InterfaceType>(),
-              "failed to freeze interface attribute '" + name + "'");
+              "Unexpected interface attribute '" + name + "' during freezing");
 
           auto attrModule = attr.toModule();
           handleSharedClassType(attrModule, graph);
diff --git a/torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp b/torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp
index 6e2fb72ab5eb..89d942a3d65c 100644
--- a/torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp
+++ b/torch/csrc/jit/passes/frozen_conv_add_relu_fusion_cuda.cpp
@@ -16,7 +16,7 @@ namespace jit {
 
 namespace {
 void fuseFrozenConvAddReluImpl(std::shared_ptr<Graph>& graph) {
-#if AT_CUDNN_ENABLED()
+#if AT_CUDNN_ENABLED() || AT_ROCM_ENABLED()
   GRAPH_DEBUG("Before fuseFrozenConvAddReluImpl: ", *graph);
   SubgraphRewriter rewriter;
 
@@ -31,10 +31,17 @@ void fuseFrozenConvAddReluImpl(std::shared_ptr<Graph>& graph) {
       %res = aten::${relu}(%x)
       return (%res))");
 
+#ifdef USE_ROCM
+  std::string conv_relu_fused = R"(
+    graph(%input, %weight, %bias, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
+        %res = aten::miopen_convolution_relu(%input, %weight, %bias, %stride, %padding, %dilation, %groups)
+        return (%res))";
+#else
   std::string conv_relu_fused = R"(
     graph(%input, %weight, %bias, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
         %res = aten::cudnn_convolution_relu(%input, %weight, %bias, %stride, %padding, %dilation, %groups)
         return (%res))";
+#endif
 
   auto conv_add_relu_rstring = at::jit::CodeTemplate(R"(
     graph(%input, %weight, %bias, %z, %alpha, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
@@ -43,10 +50,17 @@ void fuseFrozenConvAddReluImpl(std::shared_ptr<Graph>& graph) {
       %res = aten::${relu}(%y)
       return (%res))");
 
+#ifdef USE_ROCM
+  std::string conv_add_relu_fused = R"(
+    graph(%input, %weight, %bias, %z, %alpha, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
+        %res = aten::miopen_convolution_add_relu(%input, %weight, %z, %alpha, %bias, %stride, %padding, %dilation, %groups)
+        return (%res))";
+#else
   std::string conv_add_relu_fused = R"(
     graph(%input, %weight, %bias, %z, %alpha, %stride:int[], %padding:int[], %dilation:int[], %groups:int):
         %res = aten::cudnn_convolution_add_relu(%input, %weight, %z, %alpha, %bias, %stride, %padding, %dilation, %groups)
         return (%res))";
+#endif
 
   for (const auto& conv : conv_operators) {
     for (const auto& relu : relu_operators) {
diff --git a/torch/csrc/jit/passes/frozen_ops_to_mkldnn.cpp b/torch/csrc/jit/passes/frozen_ops_to_mkldnn.cpp
index fadc0dbcd732..2f30c845ed78 100644
--- a/torch/csrc/jit/passes/frozen_ops_to_mkldnn.cpp
+++ b/torch/csrc/jit/passes/frozen_ops_to_mkldnn.cpp
@@ -1,3 +1,4 @@
+#include <ATen/ATen.h>
 #include <ATen/Config.h>
 #include <ATen/Utils.h>
 #include <ATen/core/symbol.h>
@@ -243,9 +244,15 @@ Operation createUnaryOp(
     auto a_it = at::native::itensor_from_mkldnn(a);
     auto mkldnn_raw_data = a_it.get_data_handle();
     auto a_options_with_strided = a.options().layout(c10::kStrided);
+
+    // tensor's physical size could be bigger than a logical one
+    // `a_it.get_desc().get_size()` returns the real physical size (in bytes)
+    // we use it to compute `nelem` for `aten` ops
+    auto nelem = static_cast<int64_t>(
+        a_it.get_desc().get_size() / elementSize(a.scalar_type()));
     // we also wrap `a` storage into an aten tensor
     auto in_aten =
-        at::from_blob(mkldnn_raw_data, {a.numel()}, a_options_with_strided);
+        at::from_blob(mkldnn_raw_data, {nelem}, a_options_with_strided);
 
     auto out_raw_data = mkldnn_raw_data;
     auto out = a;
@@ -262,12 +269,9 @@ Operation createUnaryOp(
       out_raw_data = at::native::itensor_from_mkldnn(out).get_data_handle();
     }
 
-    // tensor's physical size could be bigger than a logical one
-    // `a_it.get_desc().get_size()` returns the real physical size (in bytes)
-    // we use it to compute `nelem` for `aten` ops
     TORCH_INTERNAL_ASSERT(
         a_it.get_desc().get_size() % elementSize(a.scalar_type()) == 0);
-    auto nelem = a_it.get_desc().get_size() / elementSize(a.scalar_type());
+
     auto out_aten = at::from_blob(
         out_raw_data, {static_cast<int64_t>(nelem)}, a_options_with_strided);
     aten_op(out_aten, in_aten);
diff --git a/torch/csrc/jit/passes/hoist_conv_packed_params.cpp b/torch/csrc/jit/passes/hoist_conv_packed_params.cpp
index 6cee6f3673e2..b26c964f47d0 100644
--- a/torch/csrc/jit/passes/hoist_conv_packed_params.cpp
+++ b/torch/csrc/jit/passes/hoist_conv_packed_params.cpp
@@ -103,17 +103,24 @@ void HoistConvPackedParams(script::Module& m) {
         c10::optional<std::string> moduleName = getModuleName(n->inputs()[0]);
         bool moduleNameIsQuantizedConv = moduleName.has_value() &&
             (moduleName.value() ==
-                 "__torch__.torch.nn.quantized.modules.conv.Conv1d" ||
+                 "__torch__.torch.ao.nn.quantized.modules.conv.Conv1d" ||
              moduleName.value() ==
-                 "__torch__.torch.nn.quantized.modules.conv.Conv2d" ||
+                 "__torch__.torch.ao.nn.quantized.modules.conv.Conv2d" ||
              moduleName.value() ==
-                 "__torch__.torch.nn.quantized.modules.conv.Conv3d" ||
+                 "__torch__.torch.ao.nn.quantized.modules.conv.Conv3d" ||
              moduleName.value() ==
                  "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU1d" ||
              moduleName.value() ==
                  "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU2d" ||
              moduleName.value() ==
-                 "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU3d");
+                 "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU3d" ||
+             // BC Stuff
+             moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv1d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv2d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv3d");
 
         if (moduleNameIsQuantizedConv) {
           GRAPH_UPDATE("Hoisting ", *n, " to root module.");
diff --git a/torch/csrc/jit/passes/mkldnn_rewrite.cpp b/torch/csrc/jit/passes/mkldnn_rewrite.cpp
index 59da4cda60d2..b35e9a24c874 100644
--- a/torch/csrc/jit/passes/mkldnn_rewrite.cpp
+++ b/torch/csrc/jit/passes/mkldnn_rewrite.cpp
@@ -105,9 +105,6 @@ void insertPrePackedConvOp(Block* b) {
 }
 
 void insertMkldnnPrePackedConv2dOp(std::shared_ptr<Graph>& graph) {
-  // Replace _convolution with conv2d
-  graph_rewrite_helper::replaceConvolutionWithAtenConv(graph);
-
   insertPrePackedConvOp(graph->block());
 }
 
diff --git a/torch/csrc/jit/passes/mobile_optimizer_type.h b/torch/csrc/jit/passes/mobile_optimizer_type.h
new file mode 100644
index 000000000000..d11f288dca34
--- /dev/null
+++ b/torch/csrc/jit/passes/mobile_optimizer_type.h
@@ -0,0 +1,13 @@
+#pragma once
+
+#include <cstdint>
+
+enum class MobileOptimizerType : int8_t {
+  CONV_BN_FUSION,
+  INSERT_FOLD_PREPACK_OPS,
+  REMOVE_DROPOUT,
+  FUSE_ADD_RELU,
+  HOIST_CONV_PACKED_PARAMS,
+  CONV_1D_TO_2D,
+  VULKAN_AUTOMATIC_GPU_TRANSFER,
+};
diff --git a/torch/csrc/jit/passes/normalize_ops.cpp b/torch/csrc/jit/passes/normalize_ops.cpp
index 3b1f10aaa2e3..497f41bcbd51 100644
--- a/torch/csrc/jit/passes/normalize_ops.cpp
+++ b/torch/csrc/jit/passes/normalize_ops.cpp
@@ -125,9 +125,11 @@ const std::unordered_map<Symbol, Symbol>& getOperatorAliasMap() {
       {aten::multiply, aten::mul},
       {aten::multiply_, aten::mul_},
       {aten::linalg_matmul, aten::matmul},
+      {aten::inverse, aten::linalg_inv},
       {aten::true_divide, aten::div},
       {aten::true_divide_, aten::div_},
       {aten::concat, aten::cat},
+      {aten::concatenate, aten::cat},
       {aten::row_stack, aten::vstack},
       {aten::swapdims, aten::transpose},
       {aten::swapdims_, aten::transpose_},
diff --git a/torch/csrc/jit/passes/onnx.cpp b/torch/csrc/jit/passes/onnx.cpp
index 19c0d0b36a1a..d4e7aa6c7f98 100644
--- a/torch/csrc/jit/passes/onnx.cpp
+++ b/torch/csrc/jit/passes/onnx.cpp
@@ -8,13 +8,14 @@
 #include <torch/csrc/jit/ir/constants.h>
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/dead_code_elimination.h>
+#include <torch/csrc/jit/passes/onnx/constant_map.h>
+#include <torch/csrc/jit/passes/onnx/helper.h>
 #include <torch/csrc/jit/passes/onnx/onnx_log.h>
 #include <torch/csrc/jit/passes/onnx/shape_type_inference.h>
 #include <torch/csrc/jit/python/python_ir.h>
 #include <torch/csrc/utils/pybind.h>
 #include <sstream>
 #include <unordered_map>
-
 namespace torch {
 namespace jit {
 
@@ -59,13 +60,13 @@ void checkONNXCompatibility(const c10::FunctionSchema& schema) {
     if (type->kind() == TypeKind::OptionalType) {
       type = reinterpret_cast<OptionalType*>(type.get())->getElementType();
       // recursive optional type is not supported
-      AT_ASSERT(type->kind() != TypeKind::OptionalType);
+      TORCH_INTERNAL_ASSERT(type->kind() != TypeKind::OptionalType);
     }
     if (type->kind() == TypeKind::ListType) {
       const auto& elem_type =
           reinterpret_cast<ListType*>(type.get())->getElementType();
       if (elem_type->isSubtypeOf(*TensorType::get())) {
-        AT_ASSERTM(
+        TORCH_INTERNAL_ASSERT(
             !has_tensor_list,
             "ONNX export supports at most one TensorList as input.");
         has_tensor_list = true;
@@ -92,7 +93,7 @@ void preprocessCaffe2Ops(Block* block) {
       size_t origin_inputs_index = 0;
       for (const auto& arg : args) {
         auto type = arg.type();
-        AT_ASSERT(origin_inputs_index < origin_inputs.size());
+        TORCH_INTERNAL_ASSERT(origin_inputs_index < origin_inputs.size());
         const auto& origin_input = origin_inputs[origin_inputs_index++];
         if (type->kind() == TypeKind::OptionalType &&
             origin_input->mustBeNone()) {
@@ -104,24 +105,24 @@ void preprocessCaffe2Ops(Block* block) {
             type->kind() == TypeKind::BoolType ||
             type->kind() == TypeKind::IntType) {
           const auto* constant_node = origin_input->node();
-          AT_ASSERT(constant_node->kind() == prim::Constant);
+          TORCH_INTERNAL_ASSERT(constant_node->kind() == prim::Constant);
           it->i_(Symbol::attr(arg.name()), constant_node->i(attr::value));
         } else if (type->kind() == TypeKind::FloatType) {
           const auto* constant_node = origin_input->node();
-          AT_ASSERT(constant_node->kind() == prim::Constant);
+          TORCH_INTERNAL_ASSERT(constant_node->kind() == prim::Constant);
           it->f_(Symbol::attr(arg.name()), constant_node->f(attr::value));
         } else if (type->kind() == TypeKind::StringType) {
           const auto* constant_node = origin_input->node();
-          AT_ASSERT(constant_node->kind() == prim::Constant);
+          TORCH_INTERNAL_ASSERT(constant_node->kind() == prim::Constant);
           it->s_(Symbol::attr(arg.name()), constant_node->s(attr::value));
         } else if (type->kind() == TypeKind::ListType) {
           const auto& list_node = origin_input->node();
           const auto& elem_type = type->castRaw<ListType>()->getElementType();
-          AT_ASSERT(
+          TORCH_INTERNAL_ASSERT(
               list_node->kind() == prim::ListConstruct ||
               list_node->kind() == prim::Constant);
           if (elem_type->isSubtypeOf(*TensorType::get())) {
-            AT_ASSERT(list_node->kind(), prim::ListConstruct);
+            TORCH_INTERNAL_ASSERT(list_node->kind(), prim::ListConstruct);
             const auto& tensor_list = origin_input->node()->inputs();
             for (const auto& t : tensor_list) {
               it->addInput(t);
@@ -131,7 +132,7 @@ void preprocessCaffe2Ops(Block* block) {
             if (list_node->kind() == prim::ListConstruct) {
               for (const auto* elem_input : list_node->inputs()) {
                 const auto* constant_node = elem_input->node();
-                AT_ASSERT(constant_node->kind() == prim::Constant);
+                TORCH_INTERNAL_ASSERT(constant_node->kind() == prim::Constant);
                 values.push_back(constant_node->f(attr::value));
               }
             } else { // is a constant list
@@ -243,7 +244,8 @@ void NodeToONNX(
     std::unordered_map<Value*, Value*>& env) {
   py::object onnx = py::module::import("torch.onnx");
   py::object onnx_globals = py::module::import("torch.onnx._globals");
-  py::object onnx_registry = py::module::import("torch.onnx.symbolic_registry");
+  py::object onnx_registration =
+      py::module::import("torch.onnx._internal.registration");
 
   // Setup all the lambda helper functions.
 
@@ -325,10 +327,20 @@ void NodeToONNX(
           ONNXShapeTypeInference(const_node, empty_params_dict, opset_version);
           env[old] = const_node->output();
         } else {
-          // ConstantValueMap has been set in shape inference,
-          // set_constant_value_map = false here to avoid redundancy.
+          // An update in ConstantValueMap is also needed here, since
+          // the user setType can be only accessed in this step, and it
+          // should be reliable.
           MergeInferredTypeAndSetMap(
-              outputs[i], old->type(), outputs[i]->type(), false);
+              outputs[i], old->type(), outputs[i]->type());
+          // non ONNX node with no type given will throw out the warnings here.
+          UpdateReliable(
+              outputs[i],
+              AreInputsReliableOrStatic(outputs[i]->node()),
+              /*no_type_warning=*/true);
+          // For the node type that does not have ComputeConstant logic, it may
+          // have reliable shape but its shape is not in ConstantValueMap. So we
+          // need to update ConstantValueMap.
+          UpdateShapeConstantIfReliable(outputs[i]);
 
           // Copy over source location and scope information to all nodes
           // created by the symbolic
@@ -425,8 +437,16 @@ void NodeToONNX(
 
     WithInsertPoint insert_point_guard(new_block);
     WithCurrentScope scope_guard(*g, n->scope());
+
+    // IMPORTANT: NEVER pass raw pointer of smart pointer managed objects to
+    // Python. Check #87343 for details.
     py::object raw_output = onnx.attr("_run_symbolic_function")(
-        g, new_block, n, py_inputs, env, operator_export_type);
+        g->shared_from_this(),
+        new_block,
+        n,
+        py_inputs,
+        env,
+        operator_export_type);
 
     // Find new nodes that have been created by _run_symbolic_function and
     // propagate metadata
@@ -452,16 +472,21 @@ void NodeToONNX(
 
     py::object opset_version =
         onnx_globals.attr("GLOBALS").attr("export_onnx_opset_version");
-    py::object is_registered_op = onnx_registry.attr("is_registered_op")(
-        "PythonOp", "prim", opset_version);
-    if (!py::hasattr(pyobj, "symbolic") &&
-        (!PyObject_IsTrue(is_registered_op.ptr()))) {
+    // NOTE(justinchuby): Call the internal registry to register the symbolic
+    // method defined in the module.
+    bool is_registered_op =
+        onnx_registration.attr("registry")
+            .attr("is_registered_op")("prim::PythonOp", opset_version)
+            .cast<bool>();
+    if (!py::hasattr(pyobj, "symbolic") && !is_registered_op) {
       // Inline the subgraph within the prim::PythonOp unless
       // either of these conditions are satisfied
       // 1. The torch.autograd.Function class of this node object has `symbolic`
       // method defined.
       // 2. Custom export symbolic is registered for prim::PythonOp.
-      if (operator_export_type == ::torch::onnx::OperatorExportTypes::ONNX) {
+      if (operator_export_type == ::torch::onnx::OperatorExportTypes::ONNX ||
+          operator_export_type ==
+              ::torch::onnx::OperatorExportTypes::ONNX_ATEN_FALLBACK) {
         try {
           inlineAutograd(op);
         } catch (const std::exception& ex) {
@@ -514,22 +539,35 @@ void NodeToONNX(
       // Call the symbolic function
       // Use a little trampoline function so we can give good error messages
       // upon argument mismatch
-      onnx_registry.attr("register_op")(
-          op->name(), pyobj.attr("symbolic"), "", opset_version);
+      // Register as a custom operator
+      // TODO: Find a more elegant way to do this without having to touch
+      // internal Python modules.
+      // TODO(justinchuby): Define a namespace for these Python Ops.
+      onnx_registration.attr("registry")
+          .attr("register")(
+              "::" + op->name(),
+              opset_version,
+              pyobj.attr("symbolic"),
+              /* custom */ true);
+
+      // IMPORTANT: NEVER pass raw pointer of smart pointer managed objects to
+      // Python. Check #87343 for details.
       py::object raw_output = onnx.attr("_run_symbolic_method")(
-          new_block->owningGraph(),
+          new_block->owningGraph()->shared_from_this(),
           op->name(),
           pyobj.attr("symbolic"),
           py_symbolic_args);
 
       processSymbolicOutput(op->name(), op, raw_output);
     } else {
-      TORCH_INTERNAL_ASSERT(PyObject_IsTrue(is_registered_op.ptr()));
+      TORCH_INTERNAL_ASSERT(is_registered_op);
       Node* n = static_cast<Node*>(op);
       n->s_(attr::name, op->name());
       // Call symbolic function
+      // IMPORTANT: NEVER pass raw pointer of smart pointer managed objects to
+      // Python. Check #87343 for details.
       py::object raw_output = onnx.attr("_run_symbolic_function")(
-          new_block->owningGraph(),
+          new_block->owningGraph()->shared_from_this(),
           new_block,
           n,
           py_symbolic_args,
diff --git a/torch/csrc/jit/passes/onnx.h b/torch/csrc/jit/passes/onnx.h
index e3c6cd23ecc3..11bee6791640 100644
--- a/torch/csrc/jit/passes/onnx.h
+++ b/torch/csrc/jit/passes/onnx.h
@@ -1,8 +1,6 @@
 #pragma once
 
 #include <torch/csrc/jit/ir/ir.h>
-#include <torch/csrc/jit/passes/onnx/constant_map.h>
-#include <torch/csrc/jit/passes/onnx/helper.h>
 #include <torch/csrc/onnx/onnx.h>
 #include <unordered_map>
 
diff --git a/torch/csrc/jit/passes/onnx/constant_fold.cpp b/torch/csrc/jit/passes/onnx/constant_fold.cpp
index 49694d57ef10..d515afa5d8e1 100644
--- a/torch/csrc/jit/passes/onnx/constant_fold.cpp
+++ b/torch/csrc/jit/passes/onnx/constant_fold.cpp
@@ -202,7 +202,7 @@ c10::optional<at::Tensor> runTorchSlice_opset10(
 at::Tensor runTorchArange_opset11(
     const Node* node,
     const std::vector<at::Tensor>& inputTensorValues) {
-  AT_ASSERT(inputTensorValues.size() == 3);
+  TORCH_INTERNAL_ASSERT(inputTensorValues.size() == 3);
   auto dtype = inputTensorValues[0].scalar_type();
   at::Tensor updated_val;
   switch (dtype) {
@@ -575,7 +575,7 @@ std::vector<at::Tensor> getValues(
           "getValues: Unsupported kind of constant node found.");
     }
   }
-  AT_ASSERT(inputTensorValues.size() == numInputs);
+  TORCH_INTERNAL_ASSERT(inputTensorValues.size() == numInputs);
   return inputTensorValues;
 }
 
@@ -618,7 +618,7 @@ void ConstantFoldONNX(Block* b, ParamMap& paramsDict, int opset_version) {
         "Constant folding not applied.");
     return;
   }
-  AT_ASSERT(b->param_node());
+  TORCH_INTERNAL_ASSERT(b->param_node());
   auto valsToParamsMap = buildValueToParamsMap(b, paramsDict);
   // Only the root block is constant-folded. Folding nested blocks is
   // not supported for now.
diff --git a/torch/csrc/jit/passes/onnx/fixup_onnx_controlflow.cpp b/torch/csrc/jit/passes/onnx/fixup_onnx_controlflow.cpp
index 82ee29cb3e05..f25160260ea7 100644
--- a/torch/csrc/jit/passes/onnx/fixup_onnx_controlflow.cpp
+++ b/torch/csrc/jit/passes/onnx/fixup_onnx_controlflow.cpp
@@ -195,10 +195,12 @@ Node* ONNXOptionalNode(OptionalTypePtr opt_type, Graph* g) {
 }
 
 // Replaces block output i with an onnx::Optional
-// with `type` taken from opt_type.
-// Needed when control flow has multiple branches, one of which
+// with `type` taken from opt_type. If and Loop Ops shares this function.
+// 1. If Op: Needed when control flow has multiple branches, one of which
 // is defined by `block` and returns a None and another branch
 // returns not-None. The passed-in opt_type should be from the other branch.
+// 2. Loop Op: insert Optional node before output, if input is Optional type
+// or output type is None.
 void ReplaceBlockOutputWithOptional(
     OptionalTypePtr opt_type,
     Block* block,
@@ -206,7 +208,9 @@ void ReplaceBlockOutputWithOptional(
   Node* opt_node = ONNXOptionalNode(opt_type, block->owningGraph());
   opt_node->insertBefore(block->return_node());
   Value* block_output = block->outputs().at(i);
-  block_output->replaceAllUsesWith(opt_node->output());
+  // replace only the last value as Optional type only affects
+  // the value right before output
+  block_output->replaceAllUsesAfterNodeWith(opt_node, opt_node->output());
   if (!block_output->type()->cast<NoneType>()) {
     opt_node->addInput(block_output);
     opt_node->copyMetadata(block_output->node());
@@ -265,7 +269,12 @@ void FixupONNXLoopBlockOutputs(Node* n) {
   for (Block* block : n->blocks()) {
     // output 0 is continue_condition, never None.
     for (const auto i : c10::irange(1, block->outputs().size())) {
-      if (block->outputs().at(i)->type()->cast<NoneType>()) {
+      // Two conditions that we need to replace block output with optional
+      // 1. output is NoneType
+      // 2. input is optional but output type is not
+      if ((block->outputs().at(i)->type()->cast<NoneType>()) ||
+          (block->inputs().at(i + 1)->type()->cast<OptionalType>() &&
+           !block->outputs().at(i)->type()->cast<OptionalType>())) {
         ReplaceBlockOutputWithOptional(
             // Output 0 is continue_condition.
             // Inputs (0, 1) are (loop_counter, cond). So input i + 1
diff --git a/torch/csrc/jit/passes/onnx/function_extraction.cpp b/torch/csrc/jit/passes/onnx/function_extraction.cpp
index cf21264e4c22..b012825c371a 100644
--- a/torch/csrc/jit/passes/onnx/function_extraction.cpp
+++ b/torch/csrc/jit/passes/onnx/function_extraction.cpp
@@ -1,5 +1,6 @@
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/onnx/function_extraction.h>
+#include <torch/csrc/jit/passes/onnx/naming.h>
 
 namespace torch {
 namespace jit {
@@ -308,6 +309,7 @@ c10::optional<ScopePtr> FunctionExtractor::FindCommonAncestor(
   if (diff != 0) {
     auto deeper_scope = diff > 0 ? a : b;
     auto other_scope = diff > 0 ? b : a;
+    diff = std::abs(diff);
     while (diff > 0) {
       deeper_scope = deeper_scope->parent();
       diff--;
@@ -653,7 +655,7 @@ void FunctionExtractor::ConvertScopeToFunction(
   auto& func_ctx = *func_ctxs_[scope_key];
 
   const std::string module_class_name(
-      func_ctx.scope_key_->name().toUnqualString());
+      ONNXScopeName::className(func_ctx.scope_key_));
   auto pos = module_class_name.rfind('.');
   TORCH_INTERNAL_ASSERT(pos != std::string::npos);
 
@@ -748,7 +750,8 @@ bool FunctionExtractor::ScopeContext::IsIdenticalFuncion(
   if (&other_ctx == this) {
     return true;
   }
-  if (this->scope_->name() != other_ctx.scope_->name()) {
+  if (ONNXScopeName::className(this->scope_) !=
+      ONNXScopeName::className(other_ctx.scope_)) {
     return false;
   }
   if (this->inputs_.size() != other_ctx.inputs_.size() ||
@@ -1046,7 +1049,7 @@ NodeAttrNameMap FunctionExtractor::run() {
   // Deepest scope comes first, guaranteeing no other scope can be its child.
   auto sorted_scope_keys = SortScopesByMaxDepth(identical_scope_map);
   for (const auto& scope_key : sorted_scope_keys) {
-    if (module_names_.find(scope_key->name().toUnqualString()) !=
+    if (module_names_.find(ONNXScopeName::className(scope_key)) !=
         module_names_.end()) {
       ConvertScopeToFunction(
           scope_key, identical_scope_map[scope_key], scope_ctxs, graph_);
@@ -1167,7 +1170,7 @@ void ONNXTrackScopeAttributes(
     } else if (v.isBool()) {
       attr_node->i_(k, v.toBool());
     } else if (v.isString()) {
-      attr_node->s_(k, v.toString()->string());
+      attr_node->s_(k, v.toStringRef());
     } else if (v.isIntList()) {
       attr_node->is_(k, v.toIntList().vec());
     } else if (v.isBoolList()) {
diff --git a/torch/csrc/jit/passes/onnx/function_substitution.cpp b/torch/csrc/jit/passes/onnx/function_substitution.cpp
index 9a756b018294..a5dd1d879370 100644
--- a/torch/csrc/jit/passes/onnx/function_substitution.cpp
+++ b/torch/csrc/jit/passes/onnx/function_substitution.cpp
@@ -2,17 +2,90 @@
 
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/onnx/helper.h>
+#include <torch/csrc/jit/passes/onnx/naming.h>
 
 namespace torch {
 namespace jit {
 
+namespace {
+
+const std::string kTopModuleVariableName = "";
+
+std::string TidyClassNameFromTorchScript(
+    const c10::optional<c10::QualifiedName>& class_name) {
+  if (!class_name) {
+    return "UNKNOWN_CLASS";
+  }
+  std::string out = "";
+  for (const auto& atom : class_name->atoms()) {
+    bool is_internal_torch_atom = (atom == "__torch__");
+    bool is_mangle_atom = (atom.find("__torch_mangle") != std::string::npos);
+    if (!is_internal_torch_atom && !is_mangle_atom) {
+      if (!out.empty()) {
+        out += ".";
+      }
+      out += atom;
+    }
+  }
+  return out;
+}
+
+std::string GetCallNodeVariableName(const Node* call_node) {
+  TORCH_INTERNAL_ASSERT(
+      call_node->kind() == prim::CallFunction ||
+      call_node->kind() == prim::CallMethod);
+  auto module_node = call_node->input(0)->node();
+
+  if (!module_node->hasAttribute(attr::name)) {
+    return "";
+  }
+  std::string module_name = module_node->s(attr::name);
+  if (module_node->inputs().size() == 0) {
+    return module_name;
+  }
+  // If module is from container, attr::name in module node only carries
+  // index info. Need to check parent node (container) for variable name.
+  auto parent_module_value = module_node->input(0);
+  while (parent_module_value) {
+    auto parent_module_type = parent_module_value->type()->cast<ClassType>();
+    if (parent_module_type &&
+        parent_module_type->name() ==
+            "__torch__.torch.nn.modules.container.ModuleList") {
+      auto parent_module_node = parent_module_value->node();
+      module_name = parent_module_node->s(attr::name) + "." + module_name;
+      parent_module_value = parent_module_node->inputs().size() > 0
+          ? parent_module_node->input(0)
+          : nullptr;
+    } else {
+      break;
+    }
+  }
+
+  return module_name;
+}
+
+ScopePtr ForwardCallScope(Graph& graph, Node* call_node) {
+  TORCH_INTERNAL_ASSERT(call_node->kind() == prim::CallMethod);
+  const std::string& method_name = call_node->s(attr::name);
+  if (method_name == "forward") {
+    const auto type = call_node->input(0)->type()->expect<c10::NamedType>();
+    const std::string class_name = TidyClassNameFromTorchScript(type->name());
+    const std::string variable_name = GetCallNodeVariableName(call_node);
+    const std::string scope_name =
+        onnx::ONNXScopeName::createFullScopeName(class_name, variable_name);
+    return graph.current_scope()->push(Symbol::scope(scope_name));
+  }
+  return graph.current_scope();
+}
+
 void functionCallSubstitution(Block* block) {
+  auto graph = block->owningGraph();
   for (auto it = block->nodes().begin(), end = block->nodes().end();
        it != end;) {
     Node* cur = *it++;
     switch (cur->kind()) {
       case prim::CallFunction: {
-        AT_ASSERT(cur->input(0)->node()->kind() == prim::Constant);
+        TORCH_INTERNAL_ASSERT(cur->input(0)->node()->kind() == prim::Constant);
         auto function_constant = cur->input(0)->node();
         auto fun_type =
             function_constant->output()->type()->expect<FunctionType>();
@@ -60,21 +133,54 @@ void functionCallSubstitution(Block* block) {
         const std::string& name = cur->s(attr::name);
         if (auto class_type = cur->input(0)->type()->cast<ClassType>()) {
           Function& function = class_type->getMethod(name);
+          ScopePtr call_scope = ForwardCallScope(*graph, cur);
+          WithCurrentScope scope_guard(*graph, call_scope);
+          GRAPH_DEBUG(
+              "Setting scope guard for forward call: ",
+              graph->current_scope()->namesFromRoot());
           if (auto graphFunction = tryToGraphFunction(function)) {
+            GRAPH_DEBUG(
+                "Inner graph for method call ",
+                name,
+                ": ",
+                *graphFunction->graph());
+            WithCurrentScope inner_graph_scope_guard(
+                *graphFunction->graph(), call_scope);
             functionCallSubstitution(graphFunction->graph()->block());
             inlineCallTo(cur, graphFunction, false);
           }
         }
       } break;
       default: {
+        if (!graph->current_scope()->isBlank()) {
+          cur->setScope(graph->current_scope());
+        }
         for (auto b : cur->blocks()) {
           functionCallSubstitution(b);
         }
       } break;
     }
+    GRAPH_DEBUG(
+        "Graph current scope after node process: ",
+        graph->current_scope()->namesFromRoot());
+  }
+}
+
+ScopePtr ONNXGraphTopLevelScope(Graph& graph) {
+  if (graph.inputs().size() == 0) {
+    return graph.current_scope();
+  }
+  if (auto top_module_type = graph.inputs().at(0)->type()->cast<ClassType>()) {
+    auto scope_name = ::torch::jit::onnx::ONNXScopeName::createFullScopeName(
+        TidyClassNameFromTorchScript(top_module_type->name()),
+        kTopModuleVariableName);
+    return graph.current_scope()->push(Symbol::scope(scope_name));
   }
+  return graph.current_scope();
 }
 
+} // namespace
+
 // This pass is to be used for ONNX conversion only. The ONNX converter depends
 // on a number of deprecated aten operators. These operators are removed from IR
 // and replaced by the compiled python function code. However, in-order to
@@ -82,6 +188,7 @@ void functionCallSubstitution(Block* block) {
 // with the aten symbolic which can still be used by the ONNX converter.
 void ONNXFunctionCallSubstitution(Graph& graph) {
   GRAPH_DUMP("Before function call substitution calls: ", &graph);
+  WithCurrentScope top_level_scope_guard(graph, ONNXGraphTopLevelScope(graph));
   functionCallSubstitution(graph.block());
   GRAPH_DUMP("After function call substitution calls: ", &graph);
 }
diff --git a/torch/csrc/jit/passes/onnx/helper.cpp b/torch/csrc/jit/passes/onnx/helper.cpp
index 77b58f037d59..a1ea88ae6572 100644
--- a/torch/csrc/jit/passes/onnx/helper.cpp
+++ b/torch/csrc/jit/passes/onnx/helper.cpp
@@ -89,7 +89,11 @@ c10::optional<at::ScalarType> ONNXTypeToATenType(int32_t onnx_type) {
     case ::ONNX_NAMESPACE::TensorProto_DataType_BFLOAT16:
       return at::kBFloat16;
     default:
-      TORCH_CHECK(false, "unexpected tensor scalar type");
+      TORCH_CHECK(
+          false,
+          "ONNX type ",
+          onnx_type,
+          " is an unexpected tensor scalar type");
   }
   return c10::optional<at::ScalarType>{};
 }
@@ -135,7 +139,11 @@ ::ONNX_NAMESPACE::TensorProto_DataType ATenTypeToOnnxType_aux(
     case at::kQInt32:
       return ::ONNX_NAMESPACE::TensorProto_DataType_INT32;
     default:
-      AT_ERROR("unexpected tensor scalar type");
+      TORCH_CHECK(
+          false,
+          "ScalarType ",
+          toString(at_type),
+          " is an unexpected tensor scalar type");
   }
 }
 } // namespace
diff --git a/torch/csrc/jit/passes/onnx/naming.cpp b/torch/csrc/jit/passes/onnx/naming.cpp
new file mode 100644
index 000000000000..f6e398d48536
--- /dev/null
+++ b/torch/csrc/jit/passes/onnx/naming.cpp
@@ -0,0 +1,205 @@
+#include <torch/csrc/jit/passes/onnx/naming.h>
+#include <torch/csrc/onnx/onnx.h>
+
+namespace torch {
+namespace jit {
+namespace onnx {
+
+namespace ONNXScopeName {
+
+using NameFunc = std::string (*)(torch::jit::ScopePtr scope);
+
+const std::string name_separator = "::";
+
+namespace {
+
+std::string nameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator,
+    NameFunc name_func) {
+  std::string out = (*name_func)(scope);
+  if (scope->isRoot()) {
+    return out;
+  }
+  auto parent = scope->parent();
+  while (!parent->isRoot()) {
+    out = std::string((*name_func)(parent)).append(layer_separator).append(out);
+    parent = parent->parent();
+  }
+  return out;
+}
+
+std::pair<std::string, std::string> parseNameFromScope(
+    torch::jit::ScopePtr scope) {
+  std::string full_name = scope->name().toUnqualString();
+  auto pos = full_name.find(name_separator);
+  TORCH_CHECK(
+      pos != std::string::npos,
+      "Scope name (" + full_name + ") does not contain '" + name_separator +
+          "'");
+  return std::make_pair(full_name.substr(0, pos), full_name.substr(pos + 2));
+}
+
+} // namespace
+
+std::string createFullScopeName(
+    const std::string& class_name,
+    const std::string& variable_name) {
+  return std::string(class_name).append(name_separator).append(variable_name);
+}
+
+std::string variableName(torch::jit::ScopePtr scope) {
+  return parseNameFromScope(scope).second;
+}
+
+std::string variableNameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator) {
+  return nameFromRoot(scope, layer_separator, &variableName);
+}
+
+std::string className(torch::jit::ScopePtr scope) {
+  return parseNameFromScope(scope).first;
+}
+
+std::string classNameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator) {
+  return nameFromRoot(scope, layer_separator, &className);
+}
+
+bool isCompatibleScope(torch::jit::ScopePtr scope) {
+  return !scope->isRoot() && !scope->isBlank() &&
+      (std::string(scope->name().toUnqualString()).find(name_separator) !=
+       std::string::npos);
+}
+} // namespace ONNXScopeName
+
+namespace {
+
+class NodeNameGenerator {
+ public:
+  NodeNameGenerator(std::shared_ptr<Graph> g) : graph_(g){};
+  virtual ~NodeNameGenerator() = 0;
+  void PopulateNodeNames();
+
+ protected:
+  virtual void CreateNodeName(Node* n) = 0;
+  void PopulateNodeNames(Block*);
+  void UpdateOutputsNames(Node* n);
+  bool IsGraphOutput(const Value* v, const std::shared_ptr<Graph> graph) const;
+
+ protected:
+  std::string CreateUniqueName(
+      std::unordered_map<std::string, size_t>& base_name_count,
+      std::string base_name);
+
+  std::unordered_map<const Node*, std::string> node_names_;
+  std::unordered_map<std::string, size_t> base_node_name_counts_;
+  std::shared_ptr<Graph> graph_;
+  const std::string layer_separator_ = "/";
+};
+NodeNameGenerator::~NodeNameGenerator(){};
+
+class ScopedNodeNameGenerator : public NodeNameGenerator {
+ public:
+  ScopedNodeNameGenerator(std::shared_ptr<Graph> g) : NodeNameGenerator(g){};
+
+ protected:
+  void CreateNodeName(Node* n) override;
+
+ private:
+  std::string GetFullScopeName(ScopePtr scope);
+  std::unordered_map<ScopePtr, std::string> full_scope_names_;
+  std::unordered_map<std::string, size_t> base_scope_name_counts_;
+};
+
+std::string NodeNameGenerator::CreateUniqueName(
+    std::unordered_map<std::string, size_t>& base_name_count,
+    std::string base_name) {
+  if (base_name_count.find(base_name) == base_name_count.end()) {
+    base_name_count[base_name] = 0;
+  } else {
+    auto count = ++base_name_count[base_name];
+    base_name += "_";
+    base_name += std::to_string(count);
+  }
+  return base_name;
+}
+
+bool NodeNameGenerator::IsGraphOutput(
+    const Value* v,
+    const std::shared_ptr<Graph> graph) const {
+  for (const auto* graph_output : graph->outputs()) {
+    if (v == graph_output) {
+      return true;
+    }
+  }
+  return false;
+}
+
+void NodeNameGenerator::UpdateOutputsNames(Node* n) {
+  if (node_names_.find(n) != node_names_.end()) {
+    auto node_name = node_names_[n];
+    for (auto i : c10::irange(n->outputs().size())) {
+      auto output = n->output(i);
+      if (!IsGraphOutput(output, graph_)) {
+        auto output_name = node_name;
+        output_name.append("_output_").append(std::to_string(i));
+        output->setDebugName(output_name);
+      }
+    }
+  }
+}
+
+void NodeNameGenerator::PopulateNodeNames() {
+  PopulateNodeNames(graph_->block());
+}
+
+void NodeNameGenerator::PopulateNodeNames(Block* b) {
+  for (auto* n : b->nodes()) {
+    for (auto* sub_block : n->blocks()) {
+      PopulateNodeNames(sub_block);
+    }
+    CreateNodeName(n);
+    UpdateOutputsNames(n);
+  }
+}
+
+void ScopedNodeNameGenerator::CreateNodeName(Node* n) {
+  if (node_names_.find(n) == node_names_.end()) {
+    if (!ONNXScopeName::isCompatibleScope(n->scope())) {
+      return;
+    }
+    if (n->mustBeNone()) {
+      // JIT IR does not allow attribute for None node.
+      return;
+    }
+    auto name = GetFullScopeName(n->scope());
+    name += layer_separator_;
+    name += n->kind().toUnqualString();
+    node_names_[n] = CreateUniqueName(base_node_name_counts_, name);
+  }
+  n->s_(Symbol::attr(::torch::onnx::kOnnxNodeNameAttribute), node_names_[n]);
+}
+
+std::string ScopedNodeNameGenerator::GetFullScopeName(ScopePtr scope) {
+  if (full_scope_names_.find(scope) == full_scope_names_.end()) {
+    auto full_scope_name =
+        ONNXScopeName::variableNameFromRoot(scope, layer_separator_);
+    full_scope_names_[scope] =
+        CreateUniqueName(base_scope_name_counts_, full_scope_name);
+  }
+  return full_scope_names_[scope];
+}
+
+} // namespace
+
+void AssignScopedNamesForNodeAndValue(std::shared_ptr<Graph>& graph) {
+  auto node_name_generator = std::make_unique<ScopedNodeNameGenerator>(graph);
+  node_name_generator->PopulateNodeNames();
+}
+
+} // namespace onnx
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/naming.h b/torch/csrc/jit/passes/onnx/naming.h
new file mode 100644
index 000000000000..0472d041a923
--- /dev/null
+++ b/torch/csrc/jit/passes/onnx/naming.h
@@ -0,0 +1,30 @@
+#pragma once
+
+#include <torch/csrc/jit/ir/ir.h>
+
+namespace torch {
+namespace jit {
+namespace onnx {
+
+namespace ONNXScopeName {
+
+std::string createFullScopeName(
+    const std::string& class_name,
+    const std::string& variable_name);
+std::string variableName(torch::jit::ScopePtr scope);
+std::string variableNameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator);
+std::string className(torch::jit::ScopePtr scope);
+std::string classNameFromRoot(
+    torch::jit::ScopePtr scope,
+    const std::string& layer_separator);
+bool isCompatibleScope(torch::jit::ScopePtr scope);
+
+} // namespace ONNXScopeName
+
+TORCH_API void AssignScopedNamesForNodeAndValue(std::shared_ptr<Graph>& graph);
+
+} // namespace onnx
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.cpp b/torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.cpp
index 9098146db3c8..d93e34f87c6e 100644
--- a/torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.cpp
+++ b/torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.cpp
@@ -130,7 +130,7 @@ std::unordered_map<int64_t, ConvertedIndex> MergeSliceAndSelectToIndices(
       cur_dim++;
     }
 
-    AT_ASSERT(cur_dim == dim);
+    TORCH_INTERNAL_ASSERT(cur_dim == dim);
     if (node->kind() == aten::slice) {
       auto size = CreateSizeOfDim(orig_data, dim, index_put_node);
       auto index_tensor = ConvertSliceToIndex(node, size, index_put_node);
@@ -146,8 +146,8 @@ std::unordered_map<int64_t, ConvertedIndex> MergeSliceAndSelectToIndices(
           std::forward_as_tuple(index_tensor, aten::select));
       dim_offset++;
     } else {
-      AT_ERROR(
-          "Unexpected node kind ",
+      TORCH_CHECK(
+          false,
           node->kind().toDisplayString(),
           " Expected aten::slice or aten::select.");
     }
@@ -165,7 +165,7 @@ std::unordered_map<int64_t, ConvertedIndex> MergeSliceAndSelectToIndices(
   }
 
   // Each dimension should have its associated index tensor.
-  AT_ASSERT((int64_t)dim_index_map.size() == cur_dim);
+  TORCH_INTERNAL_ASSERT((int64_t)dim_index_map.size() == cur_dim);
   return dim_index_map;
 }
 
@@ -190,7 +190,7 @@ std::vector<Value*> ReshapeToAdvancedIndexingFormat(
   size_t tensor_ind_count = 0;
   for (const auto i : c10::irange(dim_index_map.size())) {
     auto index_i = dim_index_map.find(i);
-    AT_ASSERT(index_i != dim_index_map.end());
+    TORCH_INTERNAL_ASSERT(index_i != dim_index_map.end());
     if (index_i->second.orig_node_kind == aten::index) {
       if (i < min_index_dim)
         min_index_dim = i;
@@ -202,7 +202,8 @@ std::vector<Value*> ReshapeToAdvancedIndexingFormat(
 
   if (((max_index_dim - min_index_dim + 1) != tensor_ind_count) &&
       tensor_ind_count != 0) {
-    AT_ERROR(
+    TORCH_CHECK(
+        false,
         "Only consecutive 1-d tensor indices are supported in exporting aten::index_put to ONNX.",
         "Check https://pytorch.org/docs/stable/onnx.html#indexing for details");
   }
@@ -212,7 +213,7 @@ std::vector<Value*> ReshapeToAdvancedIndexingFormat(
   for (const auto i : c10::irange(dim_index_map.size())) {
     size_t ind_size = 0;
     auto index_i = dim_index_map.find(i);
-    AT_ASSERT(index_i != dim_index_map.end());
+    TORCH_INTERNAL_ASSERT(index_i != dim_index_map.end());
     Value* index = index_i->second.index;
     switch (index_i->second.orig_node_kind) {
       case aten::select:
@@ -230,7 +231,8 @@ std::vector<Value*> ReshapeToAdvancedIndexingFormat(
         break;
       }
       default:
-        AT_ERROR("Unexpected node kind ", index_i->second.orig_node_kind);
+        TORCH_CHECK(
+            false, "Unexpected node kind ", index_i->second.orig_node_kind);
     }
 
     if (ind_size != 1) {
diff --git a/torch/csrc/jit/passes/onnx/peephole.cpp b/torch/csrc/jit/passes/onnx/peephole.cpp
index e5efc704a410..8fa08c110b6c 100644
--- a/torch/csrc/jit/passes/onnx/peephole.cpp
+++ b/torch/csrc/jit/passes/onnx/peephole.cpp
@@ -56,11 +56,11 @@ bool isNopTranspose(const std::vector<int64_t>& perm) {
 std::vector<int64_t> composeTransposes(
     const std::vector<int64_t>& t1,
     const std::vector<int64_t>& t2) {
-  AT_ASSERT(t1.size() == t2.size());
+  TORCH_INTERNAL_ASSERT(t1.size() == t2.size());
   std::vector<int64_t> ret;
   ret.reserve(t1.size());
   for (const auto& i : t2) {
-    AT_ASSERT(i < int64_t(t1.size()));
+    TORCH_INTERNAL_ASSERT(i < int64_t(t1.size()));
     ret.push_back(t1[i]);
   }
   return ret;
@@ -131,7 +131,7 @@ void fuseBroadcast(Block* b) {
 
     auto broadcast_positions = getBroadcastPositions(n);
     if (!broadcast_positions.empty()) {
-      AT_ASSERT(!n->hasAttribute(attr::axis));
+      TORCH_INTERNAL_ASSERT(!n->hasAttribute(attr::axis));
     }
 
     for (size_t position : broadcast_positions) {
@@ -627,7 +627,7 @@ static void speculateOps(Block* block) {
 static void replaceInputWithList(Node* node, size_t i, ArrayRef<Value*> to) {
   node->removeInput(i);
   for (auto* to_val : to) {
-    AT_ASSERT(to_val->owningGraph() == node->owningGraph());
+    TORCH_INTERNAL_ASSERT(to_val->owningGraph() == node->owningGraph());
     node->insertInput(i++, to_val);
   }
 }
diff --git a/torch/csrc/jit/passes/onnx/remove_inplace_ops_for_onnx.cpp b/torch/csrc/jit/passes/onnx/remove_inplace_ops_for_onnx.cpp
index db74dca360e3..efb7686fae3f 100644
--- a/torch/csrc/jit/passes/onnx/remove_inplace_ops_for_onnx.cpp
+++ b/torch/csrc/jit/passes/onnx/remove_inplace_ops_for_onnx.cpp
@@ -136,6 +136,21 @@ Node* addDummyClone(
       orig_data->type()->kind() == TypeKind::BoolType) {
     auto* noneNode = graph->create(prim::Constant);
     noneNode->output()->setType(NoneType::get());
+    // For scripting mode, aten::clone requires input to be a TensorType
+    // Hence if we encounter an IntType, FloatType, or BoolType,
+    // we set the input to the appropriate TensorType
+    if (orig_data->type()->kind() == TypeKind::IntType &&
+        insertBefore == false) {
+      orig_data->setType(TensorType::fromNumberType(*IntType::get()));
+    } else if (
+        orig_data->type()->kind() == TypeKind::FloatType &&
+        insertBefore == false) {
+      orig_data->setType(TensorType::fromNumberType(*FloatType::get()));
+    } else if (
+        orig_data->type()->kind() == TypeKind::BoolType &&
+        insertBefore == false) {
+      orig_data->setType(TensorType::fromBoolType());
+    }
     newNode = graph->create(aten::clone, /*num_outputs =*/1);
     newNode->addInput(orig_data);
     newNode->addInput(noneNode->output());
diff --git a/torch/csrc/jit/passes/onnx/scalar_type_analysis.cpp b/torch/csrc/jit/passes/onnx/scalar_type_analysis.cpp
index 3571199442d4..3af0360b7e01 100644
--- a/torch/csrc/jit/passes/onnx/scalar_type_analysis.cpp
+++ b/torch/csrc/jit/passes/onnx/scalar_type_analysis.cpp
@@ -48,6 +48,7 @@ static const std::unordered_set<NodeKind> standardOps = {
     onnx::Div,
     onnx::Gemm,
     onnx::Min,
+    onnx::Max,
     onnx::Mod,
     onnx::Mul,
     onnx::Pow,
@@ -75,7 +76,7 @@ static bool IsComparisonOp(const NodeKind& nkind) {
 static TensorTypePtr CreateProfiledTensorTypeWithScalarType(
     const TensorTypePtr& typePtr,
     const c10::ScalarType& scalar_type) {
-  AT_ASSERT(typePtr != nullptr);
+  TORCH_INTERNAL_ASSERT(typePtr != nullptr);
   return typePtr->withScalarType({scalar_type});
 }
 
@@ -288,16 +289,14 @@ static void UpdateScalarTypeForInputs(
         at::Tensor val = input->node()->t(attr::value);
         at::Tensor new_val = val.to(scalar_type);
         Node* const_node = n->owningGraph()->create(onnx::Constant);
-        const_node->copyMetadata(n);
         const_node->t_(attr::value, new_val);
         const_node->insertBefore(n);
         const_node->output()->setType(TensorType::create(new_val));
-        const_node->copyMetadata(input->node());
+        const_node->copyMetadata(n);
         n->replaceInputWith(input, const_node->output());
       } else {
         Node* cast_node = n->owningGraph()->create(onnx::Cast);
         cast_node->addInput(input);
-        cast_node->copyMetadata(n);
         cast_node->i_(attr::to, onnx_type);
         cast_node->insertBefore(n);
         cast_node->output()->setType(CreateProfiledTensorTypeWithScalarType(
diff --git a/torch/csrc/jit/passes/onnx/shape_type_inference.cpp b/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
index 04427743bb6d..a9087508e6ad 100644
--- a/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
+++ b/torch/csrc/jit/passes/onnx/shape_type_inference.cpp
@@ -12,6 +12,8 @@
 #include <torch/csrc/jit/serialization/onnx.h>
 #include <torch/csrc/utils/python_strings.h>
 
+#include <torch/csrc/onnx/diagnostics/diagnostics.h>
+
 #include <onnx/shape_inference/implementation.h>
 #include <algorithm>
 #include <cmath>
@@ -74,21 +76,19 @@ std::pair<TypePtr, bool> MergeInferredType(
 void MergeInferredTypeAndSetMap(
     Value* dest_v,
     TypePtr existing_type,
-    TypePtr inferred_type,
-    bool set_constant_value_map) {
+    TypePtr inferred_type) {
   TypePtr mergedType;
   bool inferred;
   std::tie(mergedType, inferred) =
       MergeInferredType(existing_type, inferred_type);
   dest_v->setType(mergedType);
-  if (set_constant_value_map) {
-    ConstantValueMap::SetUseInferredType(dest_v->debugName(), inferred);
-  }
+  ConstantValueMap::SetUseInferredType(dest_v->debugName(), inferred);
 }
 
 namespace {
 namespace onnx_torch = ::torch::onnx;
 namespace onnx = ::ONNX_NAMESPACE;
+namespace diagnostics = ::torch::onnx::diagnostics;
 
 c10::ShapeSymbol ONNXDimToShapeSymbol(
     const onnx::TensorShapeProto_Dimension& dim,
@@ -229,6 +229,28 @@ bool IsValidONNXNode(const Node* n) {
   return true;
 }
 
+bool CustomSettype(Node* node) {
+  // This is a helper function to decide if the non-ONNX node actually has
+  // custom setType from user
+  // Go through every symbolic_sizes and if any one of them is static, we say
+  // this is set by user. On the other hand, if all of them are * (dynamic), we
+  // take this node does not have given type, since unreliable nodes have *
+  // shape anyway.
+  auto all_output_has_type = [](Value* output) {
+    if (auto output_type = output->type()->cast<TensorType>()) {
+      if (auto sizes = output_type->symbolic_sizes().sizes()) {
+        return std::any_of(std::begin(*sizes), std::end(*sizes), [](auto size) {
+          return size.is_static();
+        });
+      }
+    }
+    return false;
+  };
+
+  return std::all_of(
+      node->outputs().begin(), node->outputs().end(), all_output_has_type);
+}
+
 Value* CloneValueFromListConstruct(
     Value* v,
     std::shared_ptr<Graph> n_graph,
@@ -697,18 +719,25 @@ std::vector<::c10::ShapeSymbol> Broadcast(
     const c10::ShapeSymbol& ss_shape_1 = input_shape_value_1[rank_1 - 1 - idx];
     bool is_static_0 = ss_shape_0.is_static();
     bool is_static_1 = ss_shape_1.is_static();
+    size_t shape_idx = rank_max - 1 - idx;
     if (is_static_0 && is_static_1) {
       int64_t static_0_sz = ss_shape_0.static_size();
       int64_t static_1_sz = ss_shape_1.static_size();
-      final_shape[rank_max - 1 - idx] = ::c10::ShapeSymbol::fromStaticSize(
-          std::max(static_0_sz, static_1_sz));
+      // condition for corner case of 0d tensor
+      // 0d tensor with 1d tensor would give us 0d tensor
+      if (std::min(static_0_sz, static_1_sz) == 0) {
+        final_shape[shape_idx] = ::c10::ShapeSymbol::fromStaticSize(
+            std::min(static_0_sz, static_1_sz));
+      } else {
+        final_shape[shape_idx] = ::c10::ShapeSymbol::fromStaticSize(
+            std::max(static_0_sz, static_1_sz));
+      }
     } else if (!is_static_0 && !is_static_1) {
       if (ss_shape_0.value() == ss_shape_1.value()) {
-        final_shape[rank_max - 1 - idx] = ss_shape_0;
+        final_shape[shape_idx] = ss_shape_0;
       }
     }
   }
-
   if (rank_0 < rank_1) {
     for (size_t idx = rank_min; idx < rank_max; idx++) {
       size_t shape_idx = rank_max - 1 - idx;
@@ -882,17 +911,18 @@ void ProcessReduceNode(Node* n) {
     auto input_shape_value_0 = input_shape_0.value().sizes();
     size_t rank_0 = input_shape_value_0.value().size();
     std::vector<::c10::ShapeSymbol> final_shape;
+    std::vector<int64_t> axes_vector(rank_0);
     if (!n->hasAttributeS("axes")) {
-      UpdateShape(n->output(0), c10::SymbolicShape(final_shape));
-      return;
+      std::iota(axes_vector.begin(), axes_vector.end(), 0);
+    } else {
+      axes_vector = n->is(attr::axes);
     }
-    final_shape.reserve(rank_0);
-    std::vector<int64_t> axes_vector = n->is(attr::axes);
     for (auto idx : c10::irange(axes_vector.size())) {
       if (axes_vector[idx] < 0) {
         axes_vector[idx] += rank_0;
       }
     }
+    final_shape.reserve(rank_0);
     // ONNX keepdims defaults to 1 when not set.
     int64_t keepdims = 1;
     if (n->hasAttributeS("keepdims")) {
@@ -1868,7 +1898,8 @@ static std::unordered_set<std::string> nodeTypeReliableForTracer = {
 
 void UpdateReliable(
     torch::jit::Value* output,
-    const std::pair<bool, bool>& inferred_type_reliable) {
+    const std::pair<bool, bool>& inferred_type_reliable,
+    bool no_type_warning) {
   auto inferred =
       ConstantValueMap::GetUseInferredType(output->debugName()).value_or(false);
   auto isTypeReliableForTracer =
@@ -1876,12 +1907,19 @@ void UpdateReliable(
           output->node()->kind().toDisplayString()) !=
       nodeTypeReliableForTracer.end();
   if (!inferred && !isTypeReliableForTracer &&
-      !output->node()->kind().is_onnx()) {
+      !output->node()->kind().is_onnx() && no_type_warning) {
+    // TODO(84661): This warning comes before setType in symbolic_fn.
+    // tracked in #84661
     TORCH_WARN(
         "The shape inference of ",
         output->node()->kind().toDisplayString(),
         " type is missing, so it may result in wrong shape inference for the exported graph. ",
         "Please consider adding it in symbolic function.");
+    // Experimental, nothing sent to stdout nor stderr.
+    diagnostics::Diagnose(
+        diagnostics::Rule::kNodeMissingOnnxShapeInference,
+        diagnostics::Level::kWarning,
+        {{"op_name", output->node()->kind().toDisplayString()}});
   }
   auto reliable = false;
   if (inferred) {
@@ -1933,6 +1971,7 @@ void ONNXShapeTypeInference(
   SetGraphInputTypeReliable(n->owningGraph());
   GRAPH_UPDATE(
       "Running ONNX shape inference for node: ", n->kind().toDisplayString());
+
   if (IsValidONNXNode(n)) {
     // Create a Graph containing only the single node n.
     // This graph is later converted to ONNX to run shape inference.
@@ -2025,6 +2064,15 @@ void ONNXShapeTypeInference(
       GRAPH_DEBUG(
           "ONNX graph after shape inference: ", prettyPrint(*model_proto));
     }
+  } else if (CustomSettype(n)) {
+    // If the node is not ONNX standard, go through every output to check if
+    // they all have shape. If they all do, this should be reliable even if the
+    // Op is not from ONNX.
+    for (auto node_output : n->outputs()) {
+      // Custom setType output should get in here if it's set correctly. They
+      // will be updated to inferred for later updatereliable function.
+      ConstantValueMap::SetUseInferredType(node_output->debugName(), true);
+    }
   }
 
   SpecialPostProcess(n);
@@ -2066,20 +2114,7 @@ void ONNXShapeTypeInference(
   // reliable shape but its shape is not in ConstantValueMap. So we need this
   // logic to update ConstantValueMap.
   for (auto node_output : n->outputs()) {
-    if (ConstantValueMap::HasTypeReliable(node_output->debugName())) {
-      auto reliable =
-          ConstantValueMap::GetTypeReliable(node_output->debugName())
-              .value_or(false);
-      if (reliable && !ConstantValueMap::HasShape(node_output->debugName())) {
-        // TODO: ListType case
-        if (auto output_tensor_type = node_output->type()->cast<TensorType>()) {
-          if (output_tensor_type->dim()) {
-            auto symbolic_sizes = output_tensor_type->symbolic_sizes();
-            UpdateShapeConstantValueMap(node_output, symbolic_sizes);
-          }
-        }
-      }
-    }
+    UpdateShapeConstantIfReliable(node_output);
   }
 
   GRAPH_DEBUG(
@@ -2212,7 +2247,7 @@ size_t ONNXAssignOutputShape(
           auto& new_var = THPVariable_Unpack(list_elem);
           TORCH_CHECK(
               var.scalar_type() == new_var.scalar_type(),
-              "Unsupported sequence with mixed elment types in model outputs. "
+              "Unsupported sequence with mixed element types in model outputs. "
               "ONNX supports only sequences of elements of the same data type.");
         }
         auto elem_type = graph->outputs()
@@ -2261,20 +2296,28 @@ size_t ONNXAssignOutputShape(
   } else if (THPUtils_checkString(output_obj)) {
     // Ignore string, since they are not supported as output in ONNX.
   } else if (PyNone_Check(output_obj)) {
-    // TODO: Currently there's no one thing to do here that works for
-    // both tracing and scripting.
-    // If we don't increment outputs_index here, then scripting fails
-    // for
-    // `python test/onnx/test_pytorch_onnx_no_runtime.py`.
-    // If we do increment it, then tracing fails for
-    // `python test/onnx/test_pytorch_onnx_onnxruntime.py
-    //  TestONNXRuntime.test_tuple_with_none_outputs`.
+    // Tracing:
+    //    Ignore None, since it is not captured in IR graph as output.
+    // Scripting:
+    //    Ignore None, if observing a fixed `None` node in IR graph. Because
+    //    it is meaningless to include it as graph output as it carries no
+    //    data/information. Plus that static `None` is not supported in ONNX
+    //    IR. Otherwise, the output should have type `Optional`, and should be
+    //    converted to ONNX `Optional`.
+
+    // More context:
     // Cause: in tracing we flatten the outputs in ONNXTracedModule.forward
-    // in torch/jit/_trace.py while tracing. This means the output has None
-    // objects omitted. But then the outputs passed in here are un-flattened,
-    // which means they contain None objects.
-    // Ideally we'd remove this difference.
-    outputs_index += static_cast<size_t>(is_script);
+    // in torch/jit/_trace.py while tracing. This means the traced IR graph
+    // has None outputs omitted.
+    // But then the outputs passed in here are un-flattened, which means they
+    // contain None objects. Ideally we'd remove this difference.
+    if (is_script && outputs_index < graph->outputs().size()) {
+      if (graph->outputs().at(outputs_index)->node()->mustBeNone()) {
+        graph->eraseOutput(outputs_index);
+      } else {
+        outputs_index++;
+      }
+    }
   } else {
     std::string msg =
         ("Model output has unsupported type. See "
@@ -2319,5 +2362,21 @@ void ONNXShapeTypeInference(
   ConstantValueMap::ClearMaps();
 }
 
+void UpdateShapeConstantIfReliable(torch::jit::Value* node_output) {
+  if (ConstantValueMap::HasTypeReliable(node_output->debugName())) {
+    auto reliable = ConstantValueMap::GetTypeReliable(node_output->debugName())
+                        .value_or(false);
+    if (reliable && !ConstantValueMap::HasShape(node_output->debugName())) {
+      // TODO: ListType case
+      if (auto output_tensor_type = node_output->type()->cast<TensorType>()) {
+        if (output_tensor_type->dim()) {
+          auto symbolic_sizes = output_tensor_type->symbolic_sizes();
+          UpdateShapeConstantValueMap(node_output, symbolic_sizes);
+        }
+      }
+    }
+  }
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/shape_type_inference.h b/torch/csrc/jit/passes/onnx/shape_type_inference.h
index afda5b176537..39350ed273d4 100644
--- a/torch/csrc/jit/passes/onnx/shape_type_inference.h
+++ b/torch/csrc/jit/passes/onnx/shape_type_inference.h
@@ -34,8 +34,7 @@ std::pair<TypePtr, bool> MergeInferredType(
 void MergeInferredTypeAndSetMap(
     Value* dest_v,
     TypePtr existing_type,
-    TypePtr inferred_type,
-    bool set_constant_value_map = true);
+    TypePtr inferred_type);
 
 // Update graph input types with dynamic axes info.
 // Axes that are marked as dynamic will be assigned as dynamic ShapeSymbol.
@@ -80,9 +79,11 @@ TORCH_API void ONNXShapeTypeInference(
 std::pair<bool, bool> AreInputsReliableOrStatic(Node* n);
 void UpdateReliable(
     torch::jit::Value* output,
-    const std::pair<bool, bool>& input_reliable);
+    const std::pair<bool, bool>& input_reliable,
+    bool no_type_warning = false);
 
 void UpdateReliable(torch::jit::Node* n);
+void UpdateShapeConstantIfReliable(torch::jit::Value* output);
 
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp b/torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp
index f5a50e76fcae..300e3452a8d1 100644
--- a/torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp
+++ b/torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp
@@ -299,7 +299,10 @@ void ConvertQuantizedWeight(
   }
 }
 
-enum class QuantizedParamsType { CONV, LINEAR };
+// CONV1D needs a different unpacking from CONV, since it's
+// packed as CONV2D intentionally at the first place.
+// See: https://github.com/pytorch/pytorch/pull/38248
+enum class QuantizedParamsType { CONV1D, CONV, LINEAR };
 
 // This is called before the onnx pass. Using pattern matching we
 // find the relevant nodes and extract the packed_params. The packed_params are
@@ -413,7 +416,8 @@ void unpackQuantizedWeightsHelper(
         groups = groups_int;
         transpose = transpose_int;
       } else if (
-          params_type == QuantizedParamsType::CONV &&
+          (params_type == QuantizedParamsType::CONV ||
+           params_type == QuantizedParamsType::CONV1D) &&
           ser_tup->elements()[0].isString()) {
         const auto& elements = ser_tup->elements();
         auto version = elements[0].toStringRef();
@@ -426,25 +430,32 @@ void unpackQuantizedWeightsHelper(
         const int64_t kSpatialDim = conv_params_packed[0].item<int64_t>();
         // skip kSpatialDim
         int64_t idx = 1;
+        // kSpatialDim = 2 even it's for Conv1D from torch.op to adopt Conv2D,
+        // so we need a special unpack for Conv1D which has Conv2D dim.
+        // See: https://github.com/pytorch/pytorch/pull/38248
         for (const auto i : c10::irange(kSpatialDim)) {
-          (void)i; // Suppress unused variable warning
-          stride_int.emplace_back(conv_params_packed[idx].item<int64_t>());
+          if (params_type != QuantizedParamsType::CONV1D || i != 0) {
+            stride_int.emplace_back(conv_params_packed[idx].item<int64_t>());
+          }
           idx++;
         }
         for (const auto i : c10::irange(kSpatialDim)) {
-          (void)i; // Suppress unused variable warning
-          padding_int.emplace_back(conv_params_packed[idx].item<int64_t>());
+          if (params_type != QuantizedParamsType::CONV1D || i != 0) {
+            padding_int.emplace_back(conv_params_packed[idx].item<int64_t>());
+          }
           idx++;
         }
         for (const auto i : c10::irange(kSpatialDim)) {
-          (void)i; // Suppress unused variable warning
-          dilation_int.emplace_back(conv_params_packed[idx].item<int64_t>());
+          if (params_type != QuantizedParamsType::CONV1D || i != 0) {
+            dilation_int.emplace_back(conv_params_packed[idx].item<int64_t>());
+          }
           idx++;
         }
         for (const auto i : c10::irange(kSpatialDim)) {
-          (void)i; // Suppress unused variable warning
-          output_padding_int.emplace_back(
-              conv_params_packed[idx].item<int64_t>());
+          if (params_type != QuantizedParamsType::CONV1D || i != 0) {
+            output_padding_int.emplace_back(
+                conv_params_packed[idx].item<int64_t>());
+          }
           idx++;
         }
         groups_int = conv_params_packed[idx].item<int64_t>();
@@ -461,6 +472,9 @@ void unpackQuantizedWeightsHelper(
         torch::List<c10::IValue> optional = elements[2].toList();
         bias = optional.get(0).toOptional<at::Tensor>();
 
+        if (params_type == QuantizedParamsType::CONV1D) {
+          unpacked_weight = unpacked_weight.squeeze_(2);
+        }
         stride = stride_int;
         padding = padding_int;
         dilation = dilation_int;
@@ -638,6 +652,10 @@ void UnpackQuantizedWeights(
   graph(%input, %packed_weight, %w_scale, %w_zero_point):
         %r = quantized::linear(%input, %packed_weight, %w_scale, %w_zero_point)
         return (%r) )";
+  std::string qconv1d_relu = R"(
+  graph(%input, %packed_params, %scale, %zero_point):
+        %r = quantized::conv1d_relu(%input, %packed_params, %scale, %zero_point)
+        return (%r) )";
   std::string qconv2d = R"(
   graph(%input, %packed_params, %scale, %zero_point):
         %r = quantized::conv2d(%input, %packed_params, %scale, %zero_point)
@@ -668,6 +686,13 @@ void UnpackQuantizedWeights(
       "quantized::conv2d_unpack",
       QuantizedParamsType::CONV,
       caffe2);
+  unpackQuantizedWeightsHelper(
+      graph,
+      paramsDict,
+      qconv1d_relu,
+      "quantized::conv1d_unpack",
+      QuantizedParamsType::CONV1D,
+      caffe2);
   unpackQuantizedWeightsHelper(
       graph,
       paramsDict,
diff --git a/torch/csrc/jit/passes/peephole_non_tensor.cpp b/torch/csrc/jit/passes/peephole_non_tensor.cpp
index c114ea759e52..10ff3db0586a 100644
--- a/torch/csrc/jit/passes/peephole_non_tensor.cpp
+++ b/torch/csrc/jit/passes/peephole_non_tensor.cpp
@@ -15,7 +15,7 @@ namespace {
  * constant int value if there exists one.
  *
  * @pre node is integer arithmetic.
- * @post if there's one constant in two oprands, then the second operand is
+ * @post if there's one constant in two operands, then the second operand is
  *       constant.
  */
 c10::optional<int64_t> checkArithNode(Node& node) {
diff --git a/torch/csrc/jit/passes/quantization/finalize.cpp b/torch/csrc/jit/passes/quantization/finalize.cpp
index 4793ba3b24ed..c1e68ecd80f5 100644
--- a/torch/csrc/jit/passes/quantization/finalize.cpp
+++ b/torch/csrc/jit/passes/quantization/finalize.cpp
@@ -2,9 +2,17 @@
 
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/clear_profiling.h>
+#include <torch/csrc/jit/passes/common_subexpression_elimination.h>
+#include <torch/csrc/jit/passes/constant_pooling.h>
+#include <torch/csrc/jit/passes/constant_propagation.h>
+#include <torch/csrc/jit/passes/dead_code_elimination.h>
 #include <torch/csrc/jit/passes/freeze_module.h>
+#include <torch/csrc/jit/passes/loop_unrolling.h>
+#include <torch/csrc/jit/passes/peephole.h>
 #include <torch/csrc/jit/passes/prepack_folding.h>
 #include <torch/csrc/jit/passes/quantization/quantization_patterns.h>
+#include <torch/csrc/jit/passes/quantization/register_packed_params.h>
+#include <torch/csrc/jit/runtime/graph_iterator.h>
 
 namespace torch {
 namespace jit {
@@ -33,12 +41,91 @@ void insertPrepackUnpackForConv(std::shared_ptr<Graph>& graph) {
   }
 }
 
+void removePackedParamInsertionAndFPWeightsSetAttr(
+    std::shared_ptr<Graph>& g,
+    const std::unordered_set<std::string>& packed_param_attr_names) {
+  DepthFirstGraphNodeIterator it(g);
+  Node* n = nullptr;
+  std::vector<Node*> nodes_to_delete;
+  while ((n = it.next()) != nullptr) {
+    if (n->kind() == prim::SetAttr) {
+      const std::string& attr_name = n->s(attr::name);
+      if (packed_param_attr_names.count(attr_name)) {
+        nodes_to_delete.push_back(n);
+      } else {
+        Value* v = n->input(0);
+        Value* self = g->inputs()[0];
+        std::vector<std::string> paths = getModuleAccessPath(v, self);
+        std::string path = joinPaths(paths);
+        if (packed_param_attr_names.count(path)) {
+          nodes_to_delete.push_back(n);
+        }
+      }
+    }
+  }
+  for (auto node : nodes_to_delete) {
+    node->removeAllInputs();
+  }
+  for (auto node : nodes_to_delete) {
+    node->destroy();
+  }
+  ConstantPooling(g);
+  EliminateDeadCode(g);
+}
+
+void removeObserverCallMethods(std::shared_ptr<Graph>& g) {
+  DepthFirstGraphNodeIterator it(g);
+  Node* n = nullptr;
+  std::vector<Node*> nodes_to_delete;
+  while ((n = it.next()) != nullptr) {
+    if (n->kind() == prim::CallMethod) {
+      const std::string& attr_name = n->s(attr::name);
+      if (attr_name == "calculate_qparams") {
+        auto observer_node = n->input(0)->node();
+        if (observer_node->kind() == prim::GetAttr &&
+            observer_node->s(attr::name).find("_observer_") !=
+                std::string::npos) {
+          nodes_to_delete.push_back(n);
+        }
+      }
+    }
+  }
+  for (auto node : nodes_to_delete) {
+    node->removeAllInputs();
+  }
+  for (auto node : nodes_to_delete) {
+    node->destroy();
+  }
+  EliminateDeadCode(g);
+}
+
+void keepOnlyPackedParamsGeneration(Module& m, const std::string& method_name) {
+  auto g = m.get_method(method_name).graph();
+  Function& function = m.get_method(method_name).function();
+  const auto& schema = function.getSchema();
+  auto new_schema = schema.cloneWithReturns({Argument("", NoneType::get())});
+  for (size_t i = 0, output_size = g->outputs().size(); i < output_size; i++) {
+    g->eraseOutput(i);
+  }
+  Node* none_node = g->createNone();
+  g->registerOutput(none_node->output());
+  none_node->insertBefore(g->return_node());
+  function.setSchema(new_schema);
+  EliminateDeadCode(g);
+}
+
 } // namespace
 
 void QuantFusion(std::shared_ptr<Graph>& graph, QuantType quant_type) {
   std::vector<QuantFusionInfo> patterns;
   if (quant_type == QuantType::DYNAMIC) {
     patterns = dynamic_quant_fusion_pattern_and_replacements();
+    std::vector<QuantFusionInfo> patterns_wo_dynamic_activation_quant =
+        dynamic_quantized_linear_pattern_and_replacements();
+    patterns.insert(
+        patterns.end(),
+        patterns_wo_dynamic_activation_quant.begin(),
+        patterns_wo_dynamic_activation_quant.end());
   } else {
     patterns = quant_fusion_pattern_and_replacements();
   }
@@ -79,6 +166,23 @@ void FoldQuantizedPrepackingOps(Module& module) {
   PrePackingOpsFolder(module, filter_fn, "quantized");
 }
 
+std::unordered_set<std::string> RegisterPrePackingParams(
+    Module& module,
+    const std::string& method_name) {
+  auto filter_fn = [](const Node* n) -> bool {
+    return (
+        n->kind() == Symbol::fromQualString("quantized::linear_prepack") ||
+        n->kind() == Symbol::fromQualString("quantized::conv1d_prepack") ||
+        n->kind() == Symbol::fromQualString("quantized::conv2d_prepack") ||
+        n->kind() == Symbol::fromQualString("quantized::conv3d_prepack") ||
+        n->kind() ==
+            Symbol::fromQualString("quantized::conv_transpose1d_prepack") ||
+        n->kind() ==
+            Symbol::fromQualString("quantized::conv_transpose2d_prepack"));
+  };
+  return RegisterPrePackParams(module, method_name, filter_fn, "");
+}
+
 Module Finalize(
     Module& module,
     QuantType quant_type,
@@ -101,5 +205,73 @@ Module Finalize(
   return frozen;
 }
 
+Module FinalizeOnDevicePTQ(
+    Module& module,
+    QuantType quant_type,
+    const std::string& method_name) {
+  // Tracing annotates the resulting graph with shape information. In many case,
+  // user applies different input shapes to traced graph. It is on the user to
+  // know it is correct to do so. The quantized module needs to be clean up and
+  // To prevent the JIT optimizations from leveraging the annotated shape info,
+  // clear shape information in the graph.
+  for (auto func : module.type()->methods()) {
+    ClearProfilingInformation(toGraphFunction(*func).graph());
+  }
+
+  const std::string kQuantizeString = "quantize_";
+  const auto matched_pos = method_name.find(kQuantizeString);
+  const auto end_pos = matched_pos + kQuantizeString.length();
+  const std::string orig_method_name = method_name.substr(end_pos);
+  TORCH_CHECK(
+      matched_pos == 0,
+      "Quantized ops can only be added to quantize_",
+      orig_method_name,
+      ". Please make sure to run quant/dequant nodes insertion step for on-device PTQ.");
+
+  const std::string quantized_method_name = "quantized_" + orig_method_name;
+  auto graph = module.get_method(method_name).graph();
+  // Doing some AOT optimizations here
+  // Of all CSE seeems to be required otherwise in some experiments
+  // serialized model is incorrect. As in it cannot be deserialized
+  // Rest are included as canonical optimizations that are not for inference
+  EliminateCommonSubexpression(graph);
+  EliminateDeadCode(graph);
+  PeepholeOptimize(graph);
+  ConstantPropagation(graph);
+  UnrollConstantLoops(graph);
+  ConstantPooling(graph);
+
+  InsertPrepackUnpack(graph);
+  GRAPH_DUMP("Before QuantFusion:", graph);
+  QuantFusion(graph, quant_type);
+  auto packed_param_attr_names = RegisterPrePackingParams(module, method_name);
+  GRAPH_DUMP("After QuantFusion + packed param registration:", graph);
+
+  // Now we have:
+  // 1. Inserted quantized weights packed params
+  // 2. Inserted packed params to module
+  // 3. Inserted quantized op
+  // The next thing we need is:
+  // 1. Replicate this method in quantize_forward
+  // 2. Remove SetAttr for fp weights that are reset by quantize_forward
+  // 3. Remove SetAttr node which will subsequently optimize away the nodes
+  //    producin packed_params
+  // 4. Modify quantized_forward to remove all the nodes except for SetAttrs
+  cloneMethod(module, method_name, quantized_method_name);
+  // removeWeightSetAttrs(module, quantized_method_name);
+  auto quantized_graph = module.get_method(quantized_method_name).graph();
+  removePackedParamInsertionAndFPWeightsSetAttr(
+      quantized_graph, packed_param_attr_names);
+  // Removing packed params is not sufficient since that does not do DCE
+  // for observer node's getatts and callmthods because callmethods have side
+  // effects
+  removeObserverCallMethods(quantized_graph);
+  // This step removed the return output from the graph and subsequent
+  // DCE removes all the ops. After that only remaining things should be
+  // packed_params
+  keepOnlyPackedParamsGeneration(module, method_name);
+  return module;
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/quantization/finalize.h b/torch/csrc/jit/passes/quantization/finalize.h
index 062d1e24251e..d73addbc387f 100644
--- a/torch/csrc/jit/passes/quantization/finalize.h
+++ b/torch/csrc/jit/passes/quantization/finalize.h
@@ -55,5 +55,9 @@ TORCH_API script::Module Finalize(
 
 TORCH_API void FoldQuantizedPrepackingOps(Module& module);
 
+TORCH_API Module FinalizeOnDevicePTQ(
+    Module& module,
+    QuantType quant_type,
+    const std::string& method_name);
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/quantization/helper.cpp b/torch/csrc/jit/passes/quantization/helper.cpp
index a2f7f12d9830..408345288f47 100644
--- a/torch/csrc/jit/passes/quantization/helper.cpp
+++ b/torch/csrc/jit/passes/quantization/helper.cpp
@@ -326,6 +326,21 @@ c10::optional<Use> getClampScalarInputUse(Value* v) {
   return c10::nullopt;
 }
 
+void cloneMethod(
+    Module& module,
+    const std::string& orig_method_name,
+    const std::string& new_method_name) {
+  const Function& method = module.get_method(orig_method_name).function();
+  auto graph = toGraphFunction(method).graph()->copy();
+  const auto& schema = method.getSchema();
+  const auto this_method_name =
+      c10::QualifiedName(*module.type()->name(), new_method_name);
+  auto copied = module._ivalue()->compilation_unit()->create_function(
+      this_method_name, graph);
+  module.type()->addMethod(copied);
+  copied->setSchema(schema);
+}
+
 std::vector<Value*> getPassThroughInputs(Value* v) {
   Node* n = v->node();
   if (isSingleInputGeneralCallFunction(n)) {
diff --git a/torch/csrc/jit/passes/quantization/helper.h b/torch/csrc/jit/passes/quantization/helper.h
index b8c1efa9dcaa..3c7129e71d59 100644
--- a/torch/csrc/jit/passes/quantization/helper.h
+++ b/torch/csrc/jit/passes/quantization/helper.h
@@ -40,6 +40,12 @@ c10::optional<Use> getClampScalarInputUse(Value* v);
 // the quantization parameters for `v` given the list of values
 TORCH_API std::vector<Value*> getPassThroughInputs(Value* v);
 
+// Clones the method by the name of orig_method_name into new_method_name method
+TORCH_API void cloneMethod(
+    Module& module,
+    const std::string& orig_method_name,
+    const std::string& new_method_name);
+
 // Check if a value in the graph is a Scalar value
 TORCH_API bool isScalar(Value* v);
 
@@ -47,9 +53,9 @@ TORCH_API bool isScalar(Value* v);
 TORCH_API bool hitGraphInput(Value* value);
 
 // Converts a mangled name, such as
-//   __torch__.torch.nn.quantized.modules.conv.___torch_mangle_7.Conv2d
+//   __torch__.torch.ao.nn.quantized.modules.conv.___torch_mangle_7.Conv2d
 // into an unmangled name, such as
-//   __torch__.torch.nn.quantized.modules.conv.Conv2d
+//   __torch__.torch.ao.nn.quantized.modules.conv.Conv2d
 TORCH_API std::string removeTorchMangle(const std::string& orig_name);
 
 // Return the module name that corresponds to the value.
diff --git a/torch/csrc/jit/passes/quantization/insert_observers.cpp b/torch/csrc/jit/passes/quantization/insert_observers.cpp
index 7602da14f4c1..3598a77b3175 100644
--- a/torch/csrc/jit/passes/quantization/insert_observers.cpp
+++ b/torch/csrc/jit/passes/quantization/insert_observers.cpp
@@ -12,8 +12,10 @@
 #include <torch/csrc/jit/passes/quantization/helper.h>
 #include <torch/csrc/jit/passes/remove_mutation.h>
 
+#include <memory>
 #include <regex>
 #include <stack>
+#include <string>
 
 namespace torch {
 namespace jit {
@@ -302,6 +304,8 @@ class InsertObserversHelper {
   // be used in insert observers
   void analyze(Module& module, const std::string& method_name);
 
+  void removeActivationObservers();
+
   /**
    * Recursively insert observers for the method, also we'll process
    * the nodes in the graph in the order of execution of these nodes
@@ -335,6 +339,13 @@ class InsertObserversHelper {
       std::unordered_set<Value*> graph_observed_values =
           std::unordered_set<Value*>());
 
+  void setInsertResetObserverMethod(
+      bool insert_reset_observer_method,
+      const std::string& method_name) {
+    insert_reset_observer_method_ = insert_reset_observer_method;
+    reset_observer_method_name_ = "reset_observers_" + method_name;
+  }
+
  private:
   std::tuple<OptionalModuleVector, OptionalModuleVector, std::vector<size_t>>
   insertObserversFor(
@@ -382,6 +393,10 @@ class InsertObserversHelper {
       const Module& observer_module,
       NameModuleVector& observer_name_and_modules);
 
+  void insertObserverResetMinMax(
+      Module& module,
+      const NameModuleVector& observer_name_and_modules);
+
   // Uses the state created by fillBoundaryValueMap and fillValueObserverMap
   // to return an observer configured for a value, if it is needed.
   c10::optional<Module> getObserverFor(Value* v);
@@ -434,6 +449,10 @@ class InsertObserversHelper {
   // property to it
   void fillPassThroughValueMap(const std::shared_ptr<Graph>& graph);
 
+  bool insertResetObserverMethod() {
+    return insert_reset_observer_method_;
+  }
+
   const ModuleQConfigMap& module_qconfig_map_;
 
   // Values we want to delay observation, used to delay the observation for
@@ -885,6 +904,9 @@ graph(%self, %a, %b):
           mul_aten_relu,         mul_aten_relu_,
           inplace_mul_aten_relu, inplace_mul_aten_relu_,
   };
+
+  bool insert_reset_observer_method_{false};
+  std::string reset_observer_method_name_;
 };
 
 ModuleMethodVector InsertObserversHelper::getInvokedMethods(
@@ -965,6 +987,52 @@ void InsertObserversHelper::insertObserverFor(
   }
 }
 
+void InsertObserversHelper::insertObserverResetMinMax(
+    Module& module,
+    const NameModuleVector& observer_name_and_modules) {
+  if (observer_name_and_modules.empty()) {
+    return;
+  }
+  auto reset_min_max_opt = module.find_method(reset_observer_method_name_);
+  if (!reset_min_max_opt.has_value()) {
+    std::shared_ptr<Graph> reset_observer_graph = std::make_shared<Graph>();
+    Value* module_value = reset_observer_graph->addInput("self");
+    Node* output_node = reset_observer_graph->createNone();
+    reset_observer_graph->insertNode(output_node);
+    reset_observer_graph->registerOutput(output_node->output());
+    module_value->setType(module._ivalue()->type());
+    const auto method_name = c10::QualifiedName(
+        *(module.type()->name()), reset_observer_method_name_);
+    auto reset_observer_fn =
+        module._ivalue()->compilation_unit()->create_function(
+            method_name, reset_observer_graph);
+    auto self_arg = c10::Argument("self", module.type());
+    auto output_arg = c10::Argument("none", output_node->output()->type());
+    auto schema = c10::FunctionSchema(
+        reset_observer_method_name_, "", {self_arg}, {output_arg});
+    reset_observer_fn->setSchema(std::move(schema));
+    module.type()->addMethod(reset_observer_fn);
+  }
+  auto reset_min_max_graph =
+      module.get_method(reset_observer_method_name_).graph();
+  Value* self = reset_min_max_graph->inputs()[0];
+
+  for (const auto& pair : observer_name_and_modules) {
+    const auto& observer_name = pair.first;
+    const auto& observer = pair.second;
+    Value* observer_value =
+        reset_min_max_graph->insertGetAttr(self, observer_name);
+    MatchedSchema reset_minmax_schema = matchSchema(
+        observer.get_method("reset_min_max_vals").function().getSchema(),
+        observer_value->node()->sourceRange(),
+        *reset_min_max_graph,
+        {observer_value},
+        {});
+    reset_min_max_graph->insertMethodCall(
+        "reset_min_max_vals", reset_minmax_schema);
+  }
+}
+
 void InsertObserversHelper::delayObservingValuesInPattern(
     Graph& graph,
     const PatternInfo& pattern) {
@@ -1180,6 +1248,20 @@ bool InsertObserversHelper::valueNeedsToBeQuantized(
   return false;
 }
 
+void InsertObserversHelper::removeActivationObservers() {
+  std::vector<std::unordered_map<Value*, Module>::iterator>
+      values_to_be_removed;
+  for (auto it = observer_for_value_.begin(); it != observer_for_value_.end();
+       it++) {
+    if (!isWeight(it->first)) {
+      values_to_be_removed.push_back(it);
+    }
+  }
+  for (auto it : values_to_be_removed) {
+    observer_for_value_.erase(it);
+  }
+}
+
 void InsertObserversHelper::fillValueObserverMap(
     Module& module,
     const std::string& method_name) {
@@ -1530,6 +1612,9 @@ InsertObserversHelper::insertObserversFor(
           "supported right now");
       insertObserverFor(v, module, observer, observer_name_and_modules);
     }
+    if (insertResetObserverMethod()) {
+      insertObserverResetMinMax(module, observer_name_and_modules);
+    }
     block_observer_map_[block] = observer_name_and_modules;
   }
   return std::make_tuple(
@@ -1584,5 +1669,53 @@ Module InsertObservers(
   helper.insertObservers(module, method_name, /* is_entry_point */ true);
   return module;
 }
+
+Module InsertObserversForOnDevicePTQ(
+    Module& input_module,
+    const std::string& method_name,
+    const QConfigDict& qconfig_dict,
+    bool inplace,
+    QuantType quant_type) {
+  ModuleQConfigMap map_before_clone;
+  fillQConfigMap(input_module, qconfig_dict, map_before_clone);
+  ModuleCloneHelper mh;
+  Module cloned_module = mh.clone(input_module, map_before_clone, inplace);
+  std::shared_ptr<Graph> g = cloned_module.get_method(method_name).graph();
+  SwapFunctionalLinear(g);
+  std::string observer_method_name = "observe_" + method_name;
+  cloneMethod(cloned_module, method_name, observer_method_name);
+  ModuleQConfigMap module_qconfig_map;
+  // Since the types are changed after clone, we need to fill
+  // the qconfig map again
+  fillQConfigMap(cloned_module, qconfig_dict, module_qconfig_map);
+  GRAPH_DEBUG("Quant type:", quant_type);
+  InsertObserversHelper helper(module_qconfig_map, quant_type);
+  // Removes list mutation part is not clear. Is it needed
+  helper.preprocess(cloned_module, observer_method_name);
+  // Since we expect the graph to be inlined this should not have any use
+  // However, this function does handle if blocks
+  // Although as far as I understood If blocks are not really handled
+  // in JIT quantization. Should we just protect against this. That is if we
+  // find observable value inside If block? Also side effect of inlining is that
+  // you will have multiple getattrs for the same attribute and thus potentially
+  // multiple observers observing the same value. This will also lead to
+  // increased size of the packed param struct. I dont expect this to be a
+  // commong pattern but something to be aware fo Note that current quant
+  // workflow does not prevent this anyway since during inset quant dequant
+  // things are inlined anyway
+  helper.fillBoundaryValueMap(cloned_module, observer_method_name);
+  // analyze needs to run after fillBoundaryValueMap
+  // since we need to know the boundary value mapping to trace
+  // through the calls
+  helper.analyze(cloned_module, observer_method_name);
+  // Remove activation observer if quant_type is dynamic
+  if (quant_type == QuantType::DYNAMIC) {
+    helper.removeActivationObservers();
+  }
+  helper.setInsertResetObserverMethod(true, method_name);
+  helper.insertObservers(
+      cloned_module, observer_method_name, /* is_entry_point */ true);
+  return cloned_module;
+}
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/quantization/insert_observers.h b/torch/csrc/jit/passes/quantization/insert_observers.h
index aa2f333c5b1e..6fa7fe044911 100644
--- a/torch/csrc/jit/passes/quantization/insert_observers.h
+++ b/torch/csrc/jit/passes/quantization/insert_observers.h
@@ -42,5 +42,27 @@ TORCH_API Module InsertObservers(
     bool inplace,
     QuantType quant_type = QuantType::STATIC);
 
+/** \brief Insert observer module and observer method for
+ *  the Tensors that needs to be observed.
+ *
+ * For each Tensor that needs to be observed in the method, insert observer
+ * module to the input module and observe_<method-name> methods to the module.
+ * This method is clone of mehtod_name with forward calls of observer added.
+ *
+ * \param module the input module
+ * \param method_name the method we want to insert observers for
+ * \param qconfig_dict the qconfig dictionary that specifies how
+ * each module is going to be quantized
+ * \param inplace whether we want to do inplace modification to the input module
+ * or clone the module
+ * \param is_dynamic whether the dynamic quantization script is being used.
+ */
+TORCH_API Module InsertObserversForOnDevicePTQ(
+    Module& module,
+    const std::string& method_name,
+    const QConfigDict& qconfig_dict,
+    bool inplace,
+    QuantType quant_type = QuantType::STATIC);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/quantization/insert_quant_dequant.cpp b/torch/csrc/jit/passes/quantization/insert_quant_dequant.cpp
index e1d91b2dafbc..c852696c62d7 100644
--- a/torch/csrc/jit/passes/quantization/insert_quant_dequant.cpp
+++ b/torch/csrc/jit/passes/quantization/insert_quant_dequant.cpp
@@ -2,6 +2,7 @@
 
 #include <c10/core/QScheme.h>
 #include <c10/util/irange.h>
+#include <torch/csrc/jit/frontend/schema_matching.h>
 #include <torch/csrc/jit/ir/subgraph_matcher.h>
 #include <torch/csrc/jit/jit_log.h>
 #include <torch/csrc/jit/passes/constant_propagation.h>
@@ -22,6 +23,18 @@ using graph_rewrite_helper::PatternInfo;
 // dynamic quantization ops for activation: choose_qparams, quant, dequant
 using DynamicQuantOps = std::tuple<Node*, Node*, Node*>;
 
+std::string kScalarType = "_scalar_type";
+
+struct QuantOpParams {
+  c10::QScheme qscheme{c10::kPerTensorAffine};
+  std::vector<Value*> qparams;
+  // This is only so that insertQuantizationOps can be templatized
+  // and subsequntly significant portion of that code can be reused.
+  std::string back() const {
+    return "AttributeDoesNotExist";
+  }
+};
+
 c10::QScheme toAffine(c10::QScheme qscheme) {
   switch (qscheme) {
     case c10::kPerTensorAffine:
@@ -244,19 +257,6 @@ at::ScalarType getObserverDtype(Module& module, Value* v) {
   return at::ScalarType::Undefined;
 }
 
-at::ScalarType getObserverComputeDtype(Module& module, Value* v) {
-  auto observer_name = findObserverName(v);
-  if (observer_name.has_value()) {
-    auto observer_module = module.attr(observer_name.value()).toModule();
-    if (observer_module.hasattr("compute_dtype")) {
-      at::ScalarType scalar_type =
-          observer_module.attr("compute_dtype").toScalarType();
-      return scalar_type;
-    }
-  }
-  return at::ScalarType::Undefined;
-}
-
 c10::optional<std::string> getEmbeddingBagObsName(
     script::Module& module,
     Node* n) {
@@ -277,12 +277,20 @@ bool isEmbeddingBagOp(
       embedding_bag_name.value().find("embedding_bag_") != std::string::npos;
 }
 
+template <typename T>
+Node* insertQuantDequantNodes(
+    Value* self,
+    Node* observer,
+    T& qparams,
+    const std::string& quantize_func);
+
 // Insert quant and dequant nodes into the graph for both static and dynamic
 // quant.
-Node* insertQuantDequantNodes(
+template <>
+Node* insertQuantDequantNodes<std::vector<std::string>>(
     Value* self,
     Node* observer,
-    const std::vector<std::string>& qparam_names,
+    std::vector<std::string>& qparam_names,
     const std::string& quantize_func) {
   Graph* g = observer->owningGraph();
   Value* observer_out = observer->output();
@@ -414,12 +422,13 @@ Node* insertEmbeddingBagOps(Node* observer, const std::string& op_name) {
   return qembedding_bag;
 }
 
+template <typename T>
 void insertQuantizationOps(
     Module& module,
     Value* self,
     Node* observer,
     bool is_per_channel,
-    const std::vector<std::string>& qparam_names,
+    T& qparams,
     QuantType quant_type = QuantType::STATIC) {
   Graph* g = observer->owningGraph();
   // Observer output
@@ -458,14 +467,10 @@ void insertQuantizationOps(
       dequant = insertFP16CastOps(g, observer_out);
     } else if (!isWeight(module, observer_out)) {
       auto observer_dtype = getObserverDtype(module, observer_out);
-      auto observer_compute_dtype =
-          getObserverComputeDtype(module, observer_out);
       if (observer_dtype == at::ScalarType::QUInt8 ||
-          observer_dtype == at::ScalarType::QInt8 ||
-          observer_compute_dtype == at::ScalarType::QUInt8 ||
-          observer_compute_dtype == at::ScalarType::QInt8) {
+          observer_dtype == at::ScalarType::QInt8) {
         // For activation tensors we insert choose_qparams, quant, dequant ops.
-        Value* dtype = g->insertGetAttr(self, qparam_names.back());
+        Value* dtype = g->insertGetAttr(self, qparams.back());
         std::tie(choose_qparams, quant, dequant) =
             insertChooseQParamQuantDequant(
                 g, observer_out, dtype, at::Symbol::aten(quantize_func));
@@ -477,12 +482,10 @@ void insertQuantizationOps(
       }
     } else {
       // For weight tensors we insert quant-dequant ops.
-      dequant =
-          insertQuantDequantNodes(self, observer, qparam_names, quantize_func);
+      dequant = insertQuantDequantNodes(self, observer, qparams, quantize_func);
     }
   } else { // Static quant
-    dequant =
-        insertQuantDequantNodes(self, observer, qparam_names, quantize_func);
+    dequant = insertQuantDequantNodes(self, observer, qparams, quantize_func);
   }
   observer_out->replaceAllUsesWith(original_val);
 
@@ -698,10 +701,16 @@ class InsertQuantDeQuantHelper {
 
   void run(Module& module, const std::string& method_name);
 
+  void runForOnDevicePTQ(Module& module, const std::string& method_name);
+
   // Cleanup observer nodes from graph and observer modules
   // from module object and ClassType
   void cleanup(Module& module);
 
+  // Cleanup observer nodes only but not modules
+  // This is for ondevice PTQ
+  void removeObserverNodes(Module& m);
+
   // In order to propagate quantization ops through the ops that doesn't
   // require observation, we'll first inline the graph, and call the
   // PropgateQuantizationOps pass
@@ -723,6 +732,11 @@ class InsertQuantDeQuantHelper {
   std::tuple<c10::QScheme, QParamVector> getQSchemeAndQParamVector(
       script::Module& module,
       Node* n);
+  QuantOpParams insertCalculateQParams(
+      script::Module& module,
+      Graph* g,
+      Node* n);
+
   void checkQScheme(Graph* g, c10::QScheme qscheme) {
     if (qscheme_for_graph_.count(g)) {
       // FIXME[T110786721]: This check was broken before nevery failing.
@@ -745,7 +759,12 @@ class InsertQuantDeQuantHelper {
 
   void collectObserverNodesAndValueToQuantize(Module& module, Value*);
   void cleanup(Module& module, Graph* g);
+  void removeObserverNodes(Graph* g);
   void quantizeTensors(Module& module, Graph* g, Value* self);
+  void insertCalculateQParamsAndQuantizationOps(
+      Module& module,
+      Graph* g,
+      Value* self);
 
   // Function that extracts and runs the weight observer in a separate
   // subgraph.
@@ -835,17 +854,16 @@ void InsertQuantDeQuantHelper::collectObserverNodesAndValueToQuantize(
   observer_nodes_for_graph_[g].push_back(observer);
 }
 
-void InsertQuantDeQuantHelper::cleanup(Module& module) {
+void InsertQuantDeQuantHelper::removeObserverNodes(Module& module) {
   for (auto& method : module.get_methods()) {
-    cleanup(module, method.graph().get());
+    removeObserverNodes(method.graph().get());
   }
   for (Module m : module.children()) {
-    cleanup(m);
+    removeObserverNodes(m);
   }
 }
 
-void InsertQuantDeQuantHelper::cleanup(Module& module, Graph* g) {
-  GRAPH_DUMP("Before Remove Observers:", g);
+void InsertQuantDeQuantHelper::removeObserverNodes(Graph* g) {
   if (nodes_to_destroy_.count(g)) {
     for (auto& n : nodes_to_destroy_.at(g)) {
       n->removeAllInputs();
@@ -855,6 +873,20 @@ void InsertQuantDeQuantHelper::cleanup(Module& module, Graph* g) {
     }
     nodes_to_destroy_.at(g).clear();
   }
+}
+
+void InsertQuantDeQuantHelper::cleanup(Module& module) {
+  for (auto& method : module.get_methods()) {
+    cleanup(module, method.graph().get());
+  }
+  for (Module m : module.children()) {
+    cleanup(m);
+  }
+}
+
+void InsertQuantDeQuantHelper::cleanup(Module& module, Graph* g) {
+  GRAPH_DUMP("Before Remove Observers:", g);
+  removeObserverNodes(g);
 
   // 1. If we have seen this graph before, this means the observer
   // attributes has been removed from the type(see step 2) but the slot
@@ -1043,9 +1075,10 @@ std::tuple<c10::QScheme, QParamVector> InsertQuantDeQuantHelper::
   auto scalar_type = observer_module.attr("dtype");
   if (isPlaceholderObserver(n->input(0))) {
     // get compute_dtype for dynamic quantization
-    if (observer_module.hasattr("compute_dtype")) {
-      qparams.push_back(std::make_pair(
-          "_scalar_type", observer_module.attr("compute_dtype")));
+    if (observer_module.hasattr("is_dynamic") &&
+        observer_module.attr("is_dynamic").toBool()) {
+      qparams.push_back(
+          std::make_pair(kScalarType, observer_module.attr("dtype")));
     }
     return std::make_tuple(qscheme, qparams);
   } else if (scalar_type == at::ScalarType::Half) {
@@ -1074,7 +1107,7 @@ std::tuple<c10::QScheme, QParamVector> InsertQuantDeQuantHelper::
     qparams.push_back(
         std::make_pair("_zero_point", zero_point.item<int64_t>()));
   }
-  qparams.push_back(std::make_pair("_scalar_type", scalar_type));
+  qparams.push_back(std::make_pair(kScalarType, scalar_type));
   return std::make_tuple(qscheme, qparams);
 }
 
@@ -1469,6 +1502,200 @@ void InsertQuantDeQuantHelper::propagateQuantizationOps(Module& module) {
   RemoveRedundantDequantize(graph);
 }
 
+// Insert quant and dequant nodes into the graph for both static and dynamic
+// quant.
+template <>
+Node* insertQuantDequantNodes<QuantOpParams>(
+    Value* self,
+    Node* observer,
+    QuantOpParams& qparams,
+    const std::string& quantize_func) {
+  (void)self;
+  Graph* g = observer->owningGraph();
+  Value* observer_out = observer->output();
+  Value* original_val = observer->input(1);
+  std::vector<Value*> inputs;
+  // + 1 for tensor to be quantized
+  inputs.reserve(qparams.qparams.size() + 1);
+  inputs.push_back({observer_out});
+  for (const auto& qparam_values : qparams.qparams) {
+    inputs.push_back(qparam_values);
+  }
+  Node* quant = insertQuant(
+      g,
+      inputs,
+      at::Symbol::aten(quantize_func),
+      original_val->debugName() + ".quant");
+  // Have to make sure that quant node appears after the values it depends on.
+  for (Value* v : inputs) {
+    quant->moveAfter(v->node());
+  }
+  Node* dequant = insertDeQuant(g, quant->output(), original_val);
+  dequant->moveAfter(quant);
+  return dequant;
+}
+
+void checkCalculateQParamsResultTypes(const Node* out) {
+  TORCH_CHECK(
+      out->outputs().size() == 2,
+      "calculate_qparams should produce output of size 2 (scale, zero_point).");
+  Value* scale = out->output(0);
+  Value* zp = out->output(1);
+  TORCH_CHECK(
+      scale->type()->expect<TensorType>(),
+      "Scale value should be of Tensor type.");
+  TORCH_CHECK(
+      zp->type()->expect<TensorType>(), "Scale value should be of float type.");
+}
+
+QuantOpParams InsertQuantDeQuantHelper::insertCalculateQParams(
+    script::Module& module,
+    Graph* g,
+    Node* n) {
+  // TODO: refactor findObserverName to take Node* as input
+  Value* self = g->inputs()[0];
+  Value* v = n->output();
+  TORCH_INTERNAL_ASSERT(
+      v->type()->isSubtypeOf(*TensorType::get()),
+      "Expected output of observer node to be Tensor");
+  auto observer_name = findObserverName(v);
+  TORCH_INTERNAL_ASSERT(
+      observer_name,
+      "getQSchemeAndParamMap expects the corresponding observer for ",
+      v->debugName(),
+      " exists.");
+  std::vector<Value*> qparams_graph_values;
+  QuantOpParams quant_op_params;
+
+  TORCH_CHECK(
+      !isPlaceholderObserver(n->input(0)),
+      "Placeholder observers are not supported in ondevice PTQ.");
+  auto observer_module = module.attr(observer_name.value()).toModule();
+  Value* observer_module_value = g->insertGetAttr(self, observer_name.value());
+  auto scalar_type = observer_module.attr("dtype");
+  TORCH_CHECK(
+      scalar_type.toScalarType() != at::ScalarType::Undefined,
+      "dtype of observer can't be undefined");
+  // Not sure if we need to support this for on device PTQ.
+  if (scalar_type == at::ScalarType::Half) {
+    return quant_op_params;
+  }
+  auto calculate_qparams = observer_module.get_method("calculate_qparams");
+  auto calculate_qparams_schema = calculate_qparams.function().getSchema();
+  MatchedSchema matched_schema = matchSchema(
+      calculate_qparams_schema,
+      v->node()->sourceRange(),
+      *g,
+      {observer_module_value},
+      {});
+  Node* call = g->insertMethodCall("calculate_qparams", matched_schema)->node();
+  Node* scale_zp_node = g->insertNode(g->createTupleUnpack(call->output(0)));
+  checkCalculateQParamsResultTypes(scale_zp_node);
+  auto qscheme = observer_module.attr("qscheme").toQScheme();
+  quant_op_params.qscheme = qscheme;
+  quant_op_params.qparams.push_back(scale_zp_node->output(0)); // scale Value*
+  quant_op_params.qparams.push_back(
+      scale_zp_node->output(1)); // zero_point Value*
+  if (isPerChannel(qscheme)) {
+    Value* ch_axis_value = g->insertGetAttr(observer_module_value, "ch_axis");
+    quant_op_params.qparams.push_back(ch_axis_value);
+  }
+  Value* scalar_type_value = g->insertGetAttr(observer_module_value, "dtype");
+  quant_op_params.qparams.push_back(scalar_type_value);
+  return quant_op_params;
+}
+
+void InsertQuantDeQuantHelper::insertCalculateQParamsAndQuantizationOps(
+    Module& module,
+    Graph* graph,
+    Value* self) {
+  if (!observer_nodes_for_graph_.count(graph)) {
+    return;
+  }
+  for (auto* n : observer_nodes_for_graph_.at(graph)) {
+    Graph* g = n->owningGraph();
+    // Observer output
+    Value* observer_out = n->output();
+    // Inserting before insert point
+    WithInsertPoint insert_qparams_calc(observer_out->node()->next());
+    auto quant_op_params = insertCalculateQParams(module, g, n);
+    insertQuantizationOps(
+        module,
+        self,
+        n,
+        isPerChannel(quant_op_params.qscheme),
+        quant_op_params,
+        quant_type_);
+  }
+}
+
+void InsertQuantDeQuantHelper::runForOnDevicePTQ(
+    Module& module,
+    const std::string& method_name) {
+  // In all likelihood this really wont do anything because we expect that
+  // the input method for quantization's prepare step will be inlined. Thus
+  // only call methods we will see will belong to observer's forward calls.
+  for (auto& invoked_methods : getInvokedMethods(module, method_name)) {
+    auto& invoked_module = std::get<0>(invoked_methods);
+    const auto& invoked_method_name = std::get<1>(invoked_methods);
+    runForOnDevicePTQ(invoked_module, invoked_method_name);
+  }
+
+  Method method = module.get_method(method_name);
+  auto graph = method.graph();
+  // Unliked the run method we dont need to extract new qparam values for the
+  // the same graph used in different call site.
+  // Reason is that for on device PTQ we dont:
+  // 1. Run calculate_qparams
+  // 2. Get the scale and zero point
+  // 3. get axis and dtype
+  // 4. register values from 2 and 3 as attributes on the parent module.
+  // Instead we insert call to calculate_qparams (1) via insertCalculateQParams
+  // in the graph itself. Then instead of 2 and 3, we get the output Value*
+  // and for 3, we insert GetAttr for axis and dtype and use those Value*
+  // with insterQuantizationOps
+
+  // prim::Param nodes do not belong to the graph. Hence the Insert
+  // point is the beginning of graph node. This also safe guards against
+  // observing a potentially mutated value due to some in-place operation
+  std::vector<Value*> input_values;
+  for (const auto idx : c10::irange(1, method.num_inputs())) {
+    auto& v = graph->inputs()[idx];
+    if (v->type()->isSubtypeOf(*TensorType::get())) {
+      input_values.push_back(v);
+    }
+  }
+
+  std::stack<Block*> blocks_to_visit;
+  blocks_to_visit.push(graph->block());
+  while (!blocks_to_visit.empty()) {
+    Block* b = blocks_to_visit.top();
+    blocks_to_visit.pop();
+    for (auto it = b->nodes().begin(), end = b->nodes().end(); it != end;) {
+      Node* n = *it++;
+      for (Value* v : n->outputs()) {
+        if (!v->type()->isSubtypeOf(*TensorType::get())) {
+          continue;
+        }
+        collectObserverNodesAndValueToQuantize(module, v);
+      }
+
+      for (Block* subblock : n->blocks()) {
+        blocks_to_visit.push(subblock);
+      }
+    }
+  }
+
+  for (Value* v : input_values) {
+    collectObserverNodesAndValueToQuantize(module, v);
+  }
+
+  GRAPH_DUMP("Before insertCalculateQparamsAndQuantizationOps:", graph);
+  Value* self = graph->inputs()[0];
+  insertCalculateQParamsAndQuantizationOps(module, graph.get(), self);
+  GRAPH_DUMP("After insertCalculateQparamsAndQuantizationOps:", graph);
+}
+
 } // namespace
 
 void ReplicateQuant(std::shared_ptr<Graph>& graph) {
@@ -1571,5 +1798,57 @@ Module InsertQuantDeQuant(
   return module;
 }
 
+/*
+ *
+ * Assumption: method_name method has observer placed
+ * Objective: modify that method to insert calls to:
+ * 1. calculate_qparams
+ * 2. GetAttr for axis and dtype values
+ * 3. Use Values from above two to insert calls to quant + dequant
+ * Thus after this step you have a graph of, e.g., observe_forward,
+ * that has observer nodes, calculate_qparams run on those observer nodes,
+ * output of which is used by quant-dequant nodes. output of dequant is used
+ * by the actual op.
+ * Later on we will replace dequant + op (e.g. linear) with
+ * 1. prepacked_op context
+ * 2. unpack
+ * 3. dequantize
+ * 4. linear
+ *
+ * Of the above pattern 2, 3, and 4 can be replaced by linear_run op
+ */
+// Module InsertQuantDeQuantForOnDevicePTQ(
+Module InsertQuantDeQuantOnDevicePTQ(
+    Module& input_module,
+    const std::string& method_name,
+    bool inplace,
+    bool debug,
+    QuantType quant_type) {
+  Module module = input_module.clone(inplace);
+  const std::string kObserveString = "observe_";
+  const auto matched_pos = method_name.find(kObserveString);
+  const auto end_pos = matched_pos + kObserveString.length();
+  const std::string orig_method_name = method_name.substr(end_pos);
+  TORCH_CHECK(
+      matched_pos == 0,
+      "Quant dequant nodes can only be added to observe_",
+      orig_method_name,
+      ". Please make sure to run prepare step for on-device PTQ.");
+
+  std::string quantize_method_name = "quantize_" + orig_method_name;
+  cloneMethod(module, method_name, quantize_method_name);
+  InsertQuantDeQuantHelper h(quant_type, debug);
+  h.runForOnDevicePTQ(module, quantize_method_name);
+  h.removeObserverNodes(module);
+  // Dont need:
+  // ReplicateChooseQParamsQuantDequant: This is propagating dynamic quant's
+  // quant dequant RemoveRedundantQuantizationOps: THis is removing activation
+  // observers for dynamic quant when the op related to it is not dynamically
+  // quantizable. Doesnt really make sense. In our case we wont have those
+  // anyway since for dynamic quant activations wont be observed We can still
+  // use this function because the above two methods should really be a noop
+  h.propagateQuantizationOps(module);
+  return module;
+}
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/quantization/insert_quant_dequant.h b/torch/csrc/jit/passes/quantization/insert_quant_dequant.h
index 2b3f5d711cbe..de2b31fdba7c 100644
--- a/torch/csrc/jit/passes/quantization/insert_quant_dequant.h
+++ b/torch/csrc/jit/passes/quantization/insert_quant_dequant.h
@@ -35,5 +35,12 @@ TORCH_API Module InsertQuantDeQuant(
     bool debug,
     QuantType quant_type = QuantType::STATIC);
 
+TORCH_API Module InsertQuantDeQuantOnDevicePTQ(
+    Module& module,
+    const std::string& method_name,
+    bool inplace,
+    bool debug,
+    QuantType quant_type = QuantType::STATIC);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/quantization/quantization_patterns.h b/torch/csrc/jit/passes/quantization/quantization_patterns.h
index 17123a84149d..1d5b738e2c13 100644
--- a/torch/csrc/jit/passes/quantization/quantization_patterns.h
+++ b/torch/csrc/jit/passes/quantization/quantization_patterns.h
@@ -1061,6 +1061,29 @@ graph(%a_quant, %alpha, %scale, %input_scale, %r_scale, %r_zero_point, %r_dtype)
   };
 }
 
+inline std::vector<QuantFusionInfo>
+dynamic_quantized_linear_pattern_and_replacements() {
+  std::string linear_dynamic = R"(
+graph(%packed_params, %a):
+        %w_quant : Tensor, %b : Tensor? = quantized::linear_unpack(%packed_params)
+        %w_dequant = aten::dequantize(%w_quant)
+        %r = aten::linear(%a, %w_dequant, %b)
+        return (%r) )";
+
+  // This pattern ignores reduce range
+  // Set the reduce range to default to true, since qnnpack backend ignores this
+  // argument.
+  std::string quantized_linear_dynamic = R"(
+graph(%packed_params, %a):
+        %reduce_range : bool = prim::Constant[value=1]()
+        %r = quantized::linear_dynamic(%a, %packed_params, %reduce_range)
+        return (%r) )";
+
+  return {
+      {"quantized::linear_dynamic", linear_dynamic, quantized_linear_dynamic},
+  };
+}
+
 std::vector<QuantFusionInfo> dynamic_quant_fusion_pattern_and_replacements() {
   std::string linear_dynamic = R"(
 graph(%packed_params, %a, %reduce_range, %a_dtype):
diff --git a/torch/csrc/jit/passes/quantization/register_packed_params.cpp b/torch/csrc/jit/passes/quantization/register_packed_params.cpp
new file mode 100644
index 000000000000..bd93c6535e61
--- /dev/null
+++ b/torch/csrc/jit/passes/quantization/register_packed_params.cpp
@@ -0,0 +1,149 @@
+#include <stack>
+
+#include <ATen/ATen.h>
+#include <torch/csrc/jit/api/module.h>
+#include <torch/csrc/jit/passes/constant_pooling.h>
+#include <torch/csrc/jit/passes/constant_propagation.h>
+#include <torch/csrc/jit/passes/quantization/helper.h>
+#include <torch/csrc/jit/passes/quantization/register_packed_params.h>
+
+namespace torch {
+namespace jit {
+
+namespace {
+bool isPrepackNode(Node* n) {
+  return (
+      n->kind() == Symbol::fromQualString("quantized::linear_prepack") ||
+      n->kind() == Symbol::fromQualString("quantized::conv1d_prepack") ||
+      n->kind() == Symbol::fromQualString("quantized::conv2d_prepack") ||
+      n->kind() == Symbol::fromQualString("quantized::conv3d_prepack") ||
+      n->kind() ==
+          Symbol::fromQualString("quantized::conv_transpose1d_prepack") ||
+      n->kind() ==
+          Symbol::fromQualString("quantized::conv_transpose2d_prepack"));
+}
+
+std::pair<Value*, std::string> findFPWeight(Node* prepack_node) {
+  TORCH_CHECK(isPrepackNode(prepack_node));
+  Node* n = nullptr;
+  n = prepack_node->input(0)->node();
+  bool is_quantize_node =
+      (n->kind() == Symbol::fromQualString("aten::quantize_per_tensor") ||
+       n->kind() == Symbol::fromQualString("aten::quantize_per_channel"));
+  TORCH_CHECK(
+      is_quantize_node,
+      "Input to prepack node must be output of weight quantization.");
+  // First input of quantize node is FP32 weight
+  n = n->input(0)->node();
+  bool is_getattr_node = (n->kind() == prim::GetAttr);
+  if (is_getattr_node) {
+    return {n->input(0), n->s(attr::name)};
+  }
+  return {nullptr, "AttributeDoesNotExist"};
+}
+} // namespace
+
+std::string joinPaths(const std::vector<std::string>& paths) {
+  std::string path;
+  for (const auto& p : paths) {
+    path.append(p).append(".");
+  }
+  return path;
+}
+// Must run this pass after constant folding.
+std::unordered_set<std::string> RegisterPrePackParams(
+    Module& m,
+    const std::string& method_name,
+    const PrePackParamFilterFn& is_packed_param,
+    const std::string& attr_prefix) {
+  int64_t uid = 0; // int + method name gives unique identifier
+  auto graph = m.get_method(method_name).graph();
+  std::stack<Block*> blocks_to_visit;
+  std::unordered_set<Node*> nodes_to_delete;
+  blocks_to_visit.push(graph->block());
+  std::string attr_name_base =
+      attr_prefix + "_" + method_name + "_ondevice_ptq_packed_weight_";
+  std::unordered_set<std::string> packed_param_names;
+
+  while (!blocks_to_visit.empty()) {
+    Block* b = blocks_to_visit.top();
+    blocks_to_visit.pop();
+    for (Node* n : b->nodes()) {
+      if (is_packed_param(n)) {
+        WithInsertPoint ins(n->next());
+        Value* packed_param_value = n->output(0);
+        TORCH_CHECK(n->outputs().size() == 1, "Prepack ops have single output");
+        auto attr_name = attr_name_base + c10::to_string(uid++);
+        TORCH_CHECK(
+            packed_param_value->uses().size() == 1,
+            "Packed param must be used by exactly one op.");
+        auto use = packed_param_value->uses()[0];
+        while (m.hasattr(attr_name)) {
+          attr_name = attr_name_base + "_" + c10::to_string(uid++);
+        }
+        // Now register attribute for this packed param but dont set it to any
+        // value. No value because we dont know what the value is at this point.
+        // Only when we run on-device ptq workflow, e.g. run quantize_forward
+        // method, is when the linear_prepack op will be executed and at that
+        // point we will have the actual value for this attribute.
+        m.register_attribute(attr_name, n->output(0)->type(), IValue());
+        // In order to add the output of linear_prepack, we now have to do
+        // setAttr Thus when quantize_forward is actually called the attribute
+        // is appropriately set.
+        Node* set_attr = graph->createSetAttr(
+            graph->inputs()[0], attr_name, packed_param_value);
+        set_attr->insertAfter(n);
+        // Now let's add GetAttr for the same attribute.
+        // Why?
+        // Because eventually the method being modified will be cloned into
+        // quantize_forward and quantized_forward.
+        // quantize_forward will only have, for example, linear_prepack and
+        // SetAttr Thus when quantize_forward is run attributes on the module
+        // are set. Then in quantized_forward we will actually get
+        // packed_params, via GetAttr and supply it to, for example,
+        // dynamic_linear At the end quantize_forward will not have any ops like
+        // dynamic_linear and quantized_forward will not have any linear_prepack
+        // or SetAttr
+        Value* packed_param_attr =
+            graph->insertGetAttr(graph->inputs()[0], attr_name)
+                ->setType(n->output(0)->type());
+        // We must replace this specific usage and we cannot doe
+        // replaceAllUsesWith This is because we first had to insert SetAttr
+        // node. This also takes as input packed_param_value, similar to the
+        // actual op. But only the use of the actual op must be replaced by
+        // output of GetAttr. Input of SetAttr still must use the
+        // packed_param_value
+        use.user->replaceInput(use.offset, packed_param_attr);
+        // Record the name of the attribute so that we can delete the SetAttr
+        // for it
+        packed_param_names.insert(std::move(attr_name));
+
+        // Now make sure that original weight is reset such that the module
+        // does not have weight attribute set anymore
+        auto value_weight_names_pair = findFPWeight(n);
+        Value* v = value_weight_names_pair.first;
+        std::string weight_name = std::move(value_weight_names_pair.second);
+        auto empty_tensor =
+            at::empty({0}, at::TensorOptions().requires_grad(false));
+        Node* none_node = graph->create(prim::Constant);
+        none_node->t_(attr::value, empty_tensor);
+        // none_node->output()->setType(TensorType::create(at::kFloat,
+        // c10::kCPU, 1, false));
+        Node* set_attr_orig_weight =
+            graph->createSetAttr(v, weight_name, none_node->output());
+        set_attr_orig_weight->insertAfter(packed_param_attr->node());
+        none_node->insertBefore(set_attr_orig_weight);
+        auto* self = v->owningGraph()->inputs()[0];
+        std::vector<std::string> path = getModuleAccessPath(v, self);
+        packed_param_names.emplace(joinPaths(path));
+      }
+      for (Block* subblock : n->blocks()) {
+        blocks_to_visit.push(subblock);
+      }
+    }
+  }
+  return packed_param_names;
+}
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/quantization/register_packed_params.h b/torch/csrc/jit/passes/quantization/register_packed_params.h
new file mode 100644
index 000000000000..c1cbf1b27bb3
--- /dev/null
+++ b/torch/csrc/jit/passes/quantization/register_packed_params.h
@@ -0,0 +1,20 @@
+#pragma once
+
+#include <torch/csrc/jit/api/module.h>
+#include <torch/csrc/jit/ir/ir.h>
+#include <memory>
+
+namespace torch {
+namespace jit {
+
+using PrePackParamFilterFn = std::function<bool(Node*)>;
+
+TORCH_API std::unordered_set<std::string> RegisterPrePackParams(
+    Module& m,
+    const std::string& method_name,
+    const PrePackParamFilterFn& is_packed_param,
+    const std::string& attr_prefix);
+
+TORCH_API std::string joinPaths(const std::vector<std::string>& paths);
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/symbolic_shape_analysis.cpp b/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
index 7b3a1e0b0148..5c13c2abe2ce 100644
--- a/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
+++ b/torch/csrc/jit/passes/symbolic_shape_analysis.cpp
@@ -37,8 +37,6 @@ but not limited to:
 - Add decent coverage of common ops
 - Add shape analysis pass on Graph that handles Loops
 - Allow concurrent reads to the operator map
-- Successive applications of same inputs to same shape function (e.g. series of
-pointwise ops)
 - Supporting returning partially evaluated shape compute graph
 */
 
@@ -193,6 +191,18 @@ bool isListOfInts(const TypePtr& type) {
       type->cast<ListType>()->getElementType()->cast<IntType>();
 }
 
+bool isListOfListOfInts(const TypePtr& type) {
+  // Allows List[Optional[List[Int]]]
+  if (!type->cast<ListType>()) {
+    return false;
+  }
+  TypePtr element_type = type->cast<ListType>()->getElementType();
+  if (element_type->cast<OptionalType>()) {
+    element_type = element_type->cast<OptionalType>()->getElementType();
+  }
+  return isListOfInts(element_type);
+}
+
 bool isListOfTensors(const TypePtr& type) {
   return type->cast<ListType>() &&
       type->cast<ListType>()->getElementType()->cast<TensorType>();
@@ -306,7 +316,20 @@ struct SymbolicShapeOpAnalyzer {
 
   // We handle non-constant values in the shape propagation step
   void substituteConstantInputs() {
-    if (schema_->name() == "aten::cat") {
+    if (shape_compute_graph_->inputs().size() == 0) {
+      return;
+    }
+
+    bool seen_tensor_list = false;
+
+    size_t op_in_index = 0;
+    while (op_in_index < shape_compute_graph_->inputs().size()) {
+      Value* graph_in_var = shape_compute_graph_->inputs().at(op_in_index);
+      if (!isListOfListOfInts(graph_in_var->type())) {
+        op_in_index++;
+        continue;
+      }
+
       // Modifying the graph where _node is part of to not use the tensor
       // construct
 
@@ -322,25 +345,32 @@ struct SymbolicShapeOpAnalyzer {
       // def cat(x, y, dim: int)
       //     tensors = [x, y]
       //     body
+      TORCH_INTERNAL_ASSERT(
+          !seen_tensor_list,
+          "SSA doesn't handle case with multiple tensor lists")
+      seen_tensor_list = true;
+
       uint64_t li_length = inputs_.size() - (schema_->arguments().size() - 1);
       std::vector<Value*> li_inputs;
-      Value* graph_input = shape_compute_graph_->inputs().at(0);
-      for (size_t j = 0; j < li_length; ++j) {
-        auto new_inp = shape_compute_graph_->insertInput(j);
-        new_inp->setType(ListType::ofInts());
+
+      TypePtr element_type =
+          graph_in_var->type()->cast<ListType>()->getElementType();
+      for (size_t j = op_in_index; j < op_in_index + li_length; ++j) {
+        auto new_inp = shape_compute_graph_->insertInput(op_in_index + j);
+        new_inp->setType(element_type);
         li_inputs.push_back(new_inp);
       }
       WithInsertPoint guard(*shape_compute_graph_->block()->nodes().begin());
       auto new_li = shape_compute_graph_->insertNode(
-          shape_compute_graph_->createList(ListType::ofInts(), li_inputs));
-      graph_input->replaceAllUsesWith(new_li->output());
-
-      shape_compute_graph_->eraseInput(li_length);
+          shape_compute_graph_->createList(element_type, li_inputs));
+      graph_in_var->replaceAllUsesWith(new_li->output());
+      shape_compute_graph_->eraseInput(op_in_index + li_length);
     }
 
     TORCH_INTERNAL_ASSERT(
         shape_compute_graph_->inputs().size() <= inputs_.size(),
         "Shape Compute Graph expected to have less inputs than actual inputs"); //?
+
     for (size_t op_in_index = 0;
          op_in_index < shape_compute_graph_->inputs().size();
          op_in_index++) {
@@ -632,7 +662,6 @@ std::vector<SSArgument> getNodeInputShapes(Node* n, const AliasDb& db) {
     }
     if (isListOfTensors(type)) {
       // waiting for more use cases to decide on best generalization
-      TORCH_INTERNAL_ASSERT(n->kind() == aten::cat, "TODO: generalize logic");
       if (n->input(node_index)->node()->kind() == prim::Constant) {
         auto ival = toIValue(n->input(node_index));
         for (const auto& ten : ival->toTensorVector()) {
diff --git a/torch/csrc/jit/passes/tensorexpr_fuser.cpp b/torch/csrc/jit/passes/tensorexpr_fuser.cpp
index 9736ce1b11f8..00335c92fe80 100644
--- a/torch/csrc/jit/passes/tensorexpr_fuser.cpp
+++ b/torch/csrc/jit/passes/tensorexpr_fuser.cpp
@@ -80,6 +80,7 @@ static const OperatorSet& supported_non_eltwise_set() {
   static const OperatorSet supported_non_eltwise_set{
       "aten::batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps, bool cudnn_enabled) -> Tensor",
       "aten::conv2d(Tensor input, Tensor weight, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] dilation=1, int groups=1) -> Tensor",
+      "aten::_convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> Tensor",
       "aten::matmul(Tensor self, Tensor other) -> Tensor",
   };
   // clang-format on
@@ -897,6 +898,7 @@ class TensorExprFuser {
     };
     static const OperatorSet cpu_compute_heavy_set{
       "aten::conv2d(Tensor input, Tensor weight, Tensor? bias=None, int[2] stride=1, int[2] padding=0, int[2] dilation=1, int groups=1) -> Tensor",
+      "aten::_convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool deterministic, bool cudnn_enabled, bool allow_tf32) -> Tensor",
       "aten::matmul(Tensor self, Tensor other) -> Tensor",
     };
     static const OperatorSet gpu_only_operator_set{
@@ -934,12 +936,16 @@ class TensorExprFuser {
         // but on top of that Float16 has a few kinks on LLVM.  Thus, on CPU we
         // additionally disable it until we either move to a more stable version
         // or find workarounds.
-        if ((*st == c10::ScalarType::Half ||
-             *st == c10::ScalarType::BFloat16) &&
-            *device == c10::kCPU) {
+        if (*st == c10::ScalarType::Half && *device == c10::kCPU) {
           return false;
         }
 
+        if (*st == c10::ScalarType::BFloat16 && *device == c10::kCPU) {
+#ifndef TORCH_ENABLE_LLVM
+          return false;
+#endif
+        }
+
         // These operators only support floats, because integer divisors need to
         // raise ZeroDivisionError.
         if (node->isMemberOf(float_only_operator_set) && !isFloatingType(*st)) {
@@ -1033,6 +1039,40 @@ class TensorExprFuser {
         }
       }
 
+      bool is_reduced_precision =
+          node->kind() == aten::_autocast_to_reduced_precision;
+      bool is_full_precision =
+          node->kind() == aten::_autocast_to_full_precision;
+      auto self_tensor = node->inputs()[0]; // input tensor
+
+      if (auto const& tt = self_tensor->type()->cast<TensorType>()) {
+        auto st = tt->scalarType();
+        if (!st.has_value()) {
+          return false;
+        }
+
+        auto device = tt->device();
+        if (!device.has_value()) {
+          return false;
+        }
+
+        bool is_cpu = device->is_cpu();
+
+        if (*st != at::kFloat && is_reduced_precision && is_cpu) {
+          // Regarding CPU, aten would do nothing if the data type is
+          // float. Then the aten performance is better than NNC. So NNC
+          // does not pull it into its fusion group.
+          return false;
+        }
+
+        if (*st != at::kBFloat16 && is_full_precision && is_cpu) {
+          // Regarding CPU, aten would do nothing if the data type is
+          // BFloat16. Then the aten performance is better than NNC. So NNC
+          // does not pull it into its fusion group.
+          return false;
+        }
+      }
+
       if (has_unsupported_pin_memory(node)) {
         return false;
       }
@@ -1045,7 +1085,11 @@ class TensorExprFuser {
       }
     }
 
-    if (node->kind() == aten::conv2d) {
+    if (node->kind() == aten::_convolution && !tensorexpr::isConv2d(node)) {
+      GRAPH_DEBUG("This aten::_convolution node is not a 2D conv");
+      return false;
+    }
+    if (node->kind() == aten::_convolution || node->kind() == aten::conv2d) {
       if (!tensorexpr::conv2dIsSupportedJit(node) &&
           !tensorexpr::mkldnnPrepackedConvIsSupportedJit(node)) {
         GRAPH_DEBUG("Params of conv2d are not supported");
diff --git a/torch/csrc/jit/passes/utils/memory_dag.cpp b/torch/csrc/jit/passes/utils/memory_dag.cpp
index 2a51402de188..d8eef5af852c 100644
--- a/torch/csrc/jit/passes/utils/memory_dag.cpp
+++ b/torch/csrc/jit/passes/utils/memory_dag.cpp
@@ -186,7 +186,7 @@ void MemoryDAG::setWildcards(
     auto wildcardElement = getWildcardElement(v);
     TORCH_INTERNAL_ASSERT(wildcardElement);
 
-    const MemoryLocations pointeeSet = getMemoryLocations(elementMap.at(v));
+    const MemoryLocations& pointeeSet = getMemoryLocations(elementMap.at(v));
     for (const auto& pointee : pointeeSet) {
       auto from = this->fromIndex(pointee);
       // avoid cycles where the wildcard points to itself
diff --git a/torch/csrc/jit/passes/vulkan_rewrite.cpp b/torch/csrc/jit/passes/vulkan_rewrite.cpp
index ec058d638723..0c37d5b50347 100644
--- a/torch/csrc/jit/passes/vulkan_rewrite.cpp
+++ b/torch/csrc/jit/passes/vulkan_rewrite.cpp
@@ -2,6 +2,7 @@
 #include <torch/csrc/jit/ir/ir.h>
 #include <torch/csrc/jit/ir/subgraph_matcher.h>
 #include <torch/csrc/jit/passes/constant_pooling.h>
+#include <torch/csrc/jit/passes/dead_code_elimination.h>
 #include <torch/csrc/jit/passes/fold_conv_bn.h>
 #include <torch/csrc/jit/passes/freeze_module.h>
 #include <torch/csrc/jit/passes/fuse_linear.h>
@@ -18,6 +19,24 @@ namespace jit {
 
 namespace {
 
+void insertPrePackedBatchNormOp(std::shared_ptr<Graph>& graph) {
+  std::string batchnorm_pattern = R"(
+    graph(%input, %weight, %bias, %mean, %var, %training, %momentum, %eps, %cudnn_enable):
+        %r = aten::batch_norm(%input, %weight, %bias, %mean, %var, %training, %momentum, %eps, %cudnn_enable)
+        return (%r))";
+  std::string prepacked_ops_pattern = R"(
+    graph(%input, %weight, %bias, %mean, %var, %training, %momentum, %eps, %cudnn_enable):
+        %op_context : __torch__.torch.classes.vulkan.BatchNormPackedContext = vulkan_prepack::create_batchnorm_context(
+            %weight, %bias, %mean, %var, %training, %momentum, %eps, %cudnn_enable)
+        %res = vulkan_prepack::run_batchnorm_context(%input, %op_context)
+        return (%res))";
+
+  SubgraphRewriter batchnorm_rewriter;
+  batchnorm_rewriter.RegisterRewritePattern(
+      batchnorm_pattern, prepacked_ops_pattern);
+  batchnorm_rewriter.runOnGraph(graph);
+}
+
 void insertPrePackedLinearOp(std::shared_ptr<Graph>& graph) {
   // fuse decomposed linear into aten::linear
   FuseLinear(graph);
@@ -82,6 +101,51 @@ void insertPrePackedConv2dOp(std::shared_ptr<Graph>& graph) {
   transpose_rewriter.runOnGraph(graph);
 }
 
+void transferInputOutputBackends(std::shared_ptr<Graph>& graph) {
+  // Move inputs to Vulkan backend
+  for (Value* input : graph->inputs()) {
+    NamedValue named_input = NamedValue("", input);
+    if (named_input.type()->kind() == TypeKind::TensorType) {
+      // find the insertion point
+      WithInsertPoint ip(input->uses()[0].user->prev());
+      Value* replaced_input = graph->insert(
+          Symbol::fromQualString("aten::to"), {named_input, "vulkan"});
+      // replace the input
+      input->replaceAllUsesAfterNodeWith(
+          replaced_input->node(), replaced_input);
+    }
+  }
+
+  // Move outputs to CPU backend
+  at::ArrayRef<Value*>&& outputs = graph->outputs();
+  for (size_t i = 0; i < outputs.size(); i++) {
+    Value* output = outputs[i];
+    NamedValue named_output = NamedValue("", output);
+    if (named_output.type()->kind() == TypeKind::TensorType) {
+      // find the insertion point
+      WithInsertPoint ip(output->node()->next());
+      Value* replaced_output = graph->insert(
+          Symbol::fromQualString("aten::to"), {named_output, "cpu"});
+      // replace the output
+      graph->block()->replaceOutput(i, replaced_output);
+    }
+  }
+
+  SubgraphRewriter rewriter;
+  rewriter.runOnGraph(graph);
+}
+
+void transferInputOutputBackends(script::Module& module) {
+  std::shared_ptr<Graph> graph = module.get_methods()[0].graph();
+  transferInputOutputBackends(graph);
+}
+
+void eliminateDeadCode(script::Module& module) {
+  for (auto& method : module.get_methods()) {
+    EliminateDeadCode(method.graph());
+  }
+}
+
 void insertPrePackedGruOp(std::shared_ptr<Graph>& graph) {
   std::string gru_pattern = R"(
       graph(%input.1, %hx.1, %params_cpu:Tensor[], %has_biases:bool, %num_layers:int, %dropout:float, %train:bool, %bidirectional:bool, %batch_first:bool):
@@ -94,9 +158,15 @@ void insertPrePackedGruOp(std::shared_ptr<Graph>& graph) {
         %y.1 : Tensor, %hn.1 : Tensor = vulkan_prepack::run_gru_context(%input.1, %hx.1, %packed_weights_biases)
         return (%y.1, %hn.1) )";
 
+  auto filter = [&](const Match& match,
+                    const std::unordered_map<std::string, Value*>& vmap) {
+    auto node = match.values_map.at(vmap.at("params_cpu"))->node();
+    return node->output()->type()->str() == "Tensor[]";
+  };
+
   SubgraphRewriter gru_rewriter;
   gru_rewriter.RegisterRewritePattern(gru_pattern, prepacked_ops_pattern);
-  gru_rewriter.runOnGraph(graph);
+  gru_rewriter.runOnGraph(graph, filter);
 }
 
 void insertPrePackedLstmOp(std::shared_ptr<Graph>& graph) {
@@ -112,9 +182,15 @@ void insertPrePackedLstmOp(std::shared_ptr<Graph>& graph) {
         %y.1 : Tensor, %hn.1 : Tensor, %cn.1 : Tensor = vulkan_prepack::run_lstm_context(%input.1, %hx.1, %cx.1, %packed_weights_biases)
         return (%y.1, %hn.1, %cn.1) )";
 
+  auto filter = [&](const Match& match,
+                    const std::unordered_map<std::string, Value*>& vmap) {
+    auto node = match.values_map.at(vmap.at("hx"))->node();
+    return node->output()->type()->str() == "Tensor[]";
+  };
+
   SubgraphRewriter lstm_rewriter;
   lstm_rewriter.RegisterRewritePattern(lstm_pattern, prepacked_ops_pattern);
-  lstm_rewriter.runOnGraph(graph);
+  lstm_rewriter.runOnGraph(graph, filter);
 }
 
 void fuseHardtanhWithPackedOps(std::shared_ptr<Graph>& graph) {
@@ -207,6 +283,7 @@ void vulkanInsertPrePackedOps(std::shared_ptr<Graph>& graph) {
   insertPrePackedConv2dOp(graph);
   insertPrePackedGruOp(graph);
   insertPrePackedLstmOp(graph);
+  insertPrePackedBatchNormOp(graph);
 }
 
 void vulkanInsertPrePackedOps(script::Module& module) {
@@ -230,10 +307,16 @@ void vulkanFoldPrePackingOps(script::Module& m) {
     return (
         (n->kind() ==
          Symbol::fromQualString("vulkan_prepack::create_conv2d_context")) ||
+        (n->kind() ==
+         Symbol::fromQualString("vulkan_prepack::create_tconv2d_context")) ||
         (n->kind() ==
          Symbol::fromQualString("vulkan_prepack::create_linear_context")) ||
         (n->kind() ==
-         Symbol::fromQualString("vulkan_prepack::create_tconv2d_context")));
+         Symbol::fromQualString("vulkan_prepack::create_gru_context")) ||
+        (n->kind() ==
+         Symbol::fromQualString("vulkan_prepack::create_lstm_context")) ||
+        (n->kind() ==
+         Symbol::fromQualString("vulkan_prepack::create_batchnorm_context")));
   };
   PrePackingOpsFolder(m, filter_fn, "prepack_folding");
 }
@@ -253,18 +336,28 @@ void vulkanRunCanonicalOptimizations(script::Module& module) {
 
 script::Module vulkanOptimizeForMobile(
     const script::Module& m,
+    const std::set<MobileOptimizerType>& optimization_blocklist,
     const std::vector<std::string>& preserved_methods) {
   auto cloned_module = m.clone();
   cloned_module.eval();
   cloned_module = FoldConvBatchNorm(cloned_module);
-  vulkanInsertPrePackedOps(cloned_module);
   cloned_module = freeze_module(cloned_module, preserved_methods);
+  vulkanInsertPrePackedOps(cloned_module);
   vulkanFusePrePackedConvWithClamp(cloned_module);
   vulkanFoldPrePackingOps(cloned_module);
   removeDropout(cloned_module);
   vulkanRemoveMutation(cloned_module);
+
+  if (!optimization_blocklist.count(
+          MobileOptimizerType::VULKAN_AUTOMATIC_GPU_TRANSFER)) {
+    transferInputOutputBackends(cloned_module);
+    cloned_module.register_attribute(
+        "requires_backend_transfers", BoolType::get(), false);
+  }
+
   // remove duplicated constants
   vulkanRunCanonicalOptimizations(cloned_module);
+  eliminateDeadCode(cloned_module);
 
   cloned_module.register_attribute(
       "optimized_for_vulkan", BoolType::get(), true);
diff --git a/torch/csrc/jit/passes/vulkan_rewrite.h b/torch/csrc/jit/passes/vulkan_rewrite.h
index 8e67dce70f54..395d885e8e2c 100644
--- a/torch/csrc/jit/passes/vulkan_rewrite.h
+++ b/torch/csrc/jit/passes/vulkan_rewrite.h
@@ -2,6 +2,7 @@
 
 #include <torch/csrc/jit/api/module.h>
 #include <torch/csrc/jit/ir/ir.h>
+#include <torch/csrc/jit/passes/mobile_optimizer_type.h>
 
 namespace torch {
 namespace jit {
@@ -11,6 +12,7 @@ TORCH_API void vulkanFusePrePackedConvWithClamp(script::Module& module);
 TORCH_API void vulkanFoldPrePackingOps(script::Module& module);
 TORCH_API script::Module vulkanOptimizeForMobile(
     const script::Module& module,
+    const std::set<MobileOptimizerType>& optimization_blocklist,
     const std::vector<std::string>& preserved_methods);
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/passes/xnnpack_rewrite.cpp b/torch/csrc/jit/passes/xnnpack_rewrite.cpp
index 2476d1be4df6..0e2163f7a19f 100644
--- a/torch/csrc/jit/passes/xnnpack_rewrite.cpp
+++ b/torch/csrc/jit/passes/xnnpack_rewrite.cpp
@@ -12,6 +12,7 @@
 #include <torch/csrc/jit/passes/graph_rewrite_helper.h>
 #include <torch/csrc/jit/passes/hoist_conv_packed_params.h>
 #include <torch/csrc/jit/passes/inliner.h>
+#include <torch/csrc/jit/passes/mobile_optimizer_type.h>
 #include <torch/csrc/jit/passes/prepack_folding.h>
 #include <torch/csrc/jit/passes/remove_dropout.h>
 #include <torch/csrc/jit/passes/subgraph_rewrite.h>
diff --git a/torch/csrc/jit/passes/xnnpack_rewrite.h b/torch/csrc/jit/passes/xnnpack_rewrite.h
index 498dcd006fe3..d1a64c52c923 100644
--- a/torch/csrc/jit/passes/xnnpack_rewrite.h
+++ b/torch/csrc/jit/passes/xnnpack_rewrite.h
@@ -2,19 +2,11 @@
 
 #include <torch/csrc/jit/api/module.h>
 #include <torch/csrc/jit/ir/ir.h>
+#include <torch/csrc/jit/passes/mobile_optimizer_type.h>
 
 namespace torch {
 namespace jit {
 
-enum class MobileOptimizerType : int8_t {
-  CONV_BN_FUSION,
-  INSERT_FOLD_PREPACK_OPS,
-  REMOVE_DROPOUT,
-  FUSE_ADD_RELU,
-  HOIST_CONV_PACKED_PARAMS,
-  CONV_1D_TO_2D,
-};
-
 TORCH_API void transformConv1dToConv2d(std::shared_ptr<Graph>& graph);
 TORCH_API void transformConv1dToConv2d(script::Module& module);
 TORCH_API void insertPrePackedOps(std::shared_ptr<Graph>& graph);
diff --git a/torch/csrc/jit/python/init.cpp b/torch/csrc/jit/python/init.cpp
index 9920b9a536f9..7ee48635cdff 100644
--- a/torch/csrc/jit/python/init.cpp
+++ b/torch/csrc/jit/python/init.cpp
@@ -13,7 +13,7 @@
 #if (!defined(FBCODE_CAFFE2) && defined(BUILD_ONEDNN_GRAPH))
 #include <torch/csrc/jit/codegen/onednn/interface.h>
 #endif
-#include <c10/core/SymIntNodeImpl.h>
+#include <c10/core/SymNodeImpl.h>
 #include <torch/csrc/jit/frontend/ir_emitter.h>
 #include <torch/csrc/jit/frontend/tracer.h>
 #include <torch/csrc/jit/ir/irparser.h>
@@ -52,6 +52,7 @@
 #include <torch/csrc/jit/passes/lower_graph.h>
 #include <torch/csrc/jit/passes/lower_tuples.h>
 #include <torch/csrc/jit/passes/metal_rewrite.h>
+#include <torch/csrc/jit/passes/mobile_optimizer_type.h>
 #include <torch/csrc/jit/passes/normalize_ops.h>
 #include <torch/csrc/jit/passes/peephole.h>
 #include <torch/csrc/jit/passes/peephole_list_idioms.h>
@@ -125,104 +126,11 @@ using c10::Argument;
 using c10::FunctionSchema;
 using c10::SchemaArgType;
 using c10::SchemaArgument;
-using c10::SymIntNode;
+using c10::SymNode;
 using caffe2::serialize::PyTorchStreamReader;
 using caffe2::serialize::PyTorchStreamWriter;
 using torch::utils::SchemaInfo;
 
-static c10::SymIntNode toSymIntNode(c10::SymIntNode a, py::object b) {
-  return torch::is_symint_node(b) ? b.cast<c10::SymIntNode>()
-                                  : a->wrap(b.cast<int64_t>());
-}
-
-class PythonSymIntNodeImpl : public c10::SymIntNodeImpl {
- public:
-  PythonSymIntNodeImpl(py::object pyobj) : c10::SymIntNodeImpl() {
-    pyobj_ = std::make_shared<c10::SafePyObject>(
-        pyobj.release().ptr(), getPyInterpreter());
-  };
-
-  virtual SymIntNode wrap(int64_t num) override {
-    py::gil_scoped_acquire acquire;
-    auto r = getPyObj().attr("wrap")(num);
-    return c10::make_intrusive<PythonSymIntNodeImpl>(r);
-  }
-
-  virtual bool bool_() override {
-    py::gil_scoped_acquire acquire;
-    return getPyObj().attr("__bool__")().is(py::handle(Py_True));
-  }
-
-  virtual int64_t int_() override {
-    py::gil_scoped_acquire acquire;
-    return getPyObj().attr("__int__")().cast<int64_t>();
-  }
-
-  virtual std::string str() override {
-    py::gil_scoped_acquire acquire;
-    return getPyObj().attr("__str__")().cast<std::string>();
-  }
-
-  virtual SymIntNode dispatch_common_(
-      const char* fname,
-      const SymIntNode& other) {
-    auto pother = dynamic_cast<PythonSymIntNodeImpl*>(other.get());
-    TORCH_CHECK(pother);
-    py::gil_scoped_acquire acquire;
-    auto r = getPyObj().attr(fname)(pother->getPyObj());
-    return c10::make_intrusive<PythonSymIntNodeImpl>(r);
-  }
-
-  virtual SymIntNode add(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode sub(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode mul(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode truediv(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode floordiv(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode mod(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode eq(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode gt(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode lt(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode le(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  virtual SymIntNode ge(const SymIntNode& other) override {
-    return dispatch_common_(__FUNCTION__, other);
-  }
-
-  py::handle getPyObj() {
-    return py::handle(pyobj_.get()->ptr(getPyInterpreter()));
-  }
-  std::shared_ptr<c10::SafePyObject> pyobj_ = nullptr;
-};
-
 namespace {
 
 using autograd::variable_list;
@@ -408,6 +316,25 @@ void initJITBindings(PyObject* module) {
           py::arg("qconfig_dict"),
           py::arg("inplace"),
           py::arg("quant_type_int") = 1)
+      .def(
+          "_jit_pass_insert_observer_method_for_ondevice_ptq",
+          [](Module& module,
+             const std::string& method_name,
+             const py::dict& qconfig_dict,
+             bool inplace,
+             int quant_type_int) {
+            auto dict = py::cast<std::unordered_map<
+                std::string,
+                c10::optional<std::tuple<Module, Module>>>>(qconfig_dict);
+            auto quant_type = static_cast<QuantType>(quant_type_int);
+            return InsertObserversForOnDevicePTQ(
+                module, method_name, dict, inplace, quant_type);
+          },
+          py::arg("module"),
+          py::arg("method_name"),
+          py::arg("qconfig_dict"),
+          py::arg("inplace"),
+          py::arg("quant_type_int") = 1)
       .def(
           "_jit_pass_insert_quant_dequant",
           [](Module& module,
@@ -424,6 +351,22 @@ void initJITBindings(PyObject* module) {
           py::arg("inplace"),
           py::arg("debug"),
           py::arg("quant_type_int") = 1)
+      .def(
+          "_jit_pass_insert_quant_dequant_for_ondevice_ptq",
+          [](Module& module,
+             const std::string& method_name,
+             bool inplace,
+             bool debug,
+             int quant_type_int) {
+            auto quant_type = static_cast<QuantType>(quant_type_int);
+            return InsertQuantDeQuantOnDevicePTQ(
+                module, method_name, inplace, debug, quant_type);
+          },
+          py::arg("module"),
+          py::arg("method_name"),
+          py::arg("inplace"),
+          py::arg("debug"),
+          py::arg("quant_type_int") = 1)
       .def(
           "_jit_pass_insert_prepack_unpack",
           [](std::shared_ptr<Graph>& g) { return InsertPrepackUnpack(g); })
@@ -490,6 +433,17 @@ void initJITBindings(PyObject* module) {
           py::arg("module"),
           py::arg("quant_type_int") = 1,
           py::arg("preserved_attrs") = std::vector<std::string>())
+      .def(
+          "_jit_pass_quant_finalize_for_ondevice_ptq",
+          [](Module& module,
+             int quant_type_int,
+             const std::string& method_name) {
+            auto quant_type = static_cast<QuantType>(quant_type_int);
+            return FinalizeOnDevicePTQ(module, quant_type, method_name);
+          },
+          py::arg("module"),
+          py::arg("quant_type_int") = 1,
+          py::arg("preserved_attrs") = std::vector<std::string>())
       .def(
           "_jit_pass_pattern_based_rewrite",
           [](const Module& m) { return PatternBasedRewrite(m); })
@@ -703,7 +657,7 @@ void initJITBindings(PyObject* module) {
       .def(
           "_jit_pass_create_autodiff_subgraphs",
           [](const std::shared_ptr<Graph>& graph, py::object threshold) {
-            if (threshold.is(py::none())) {
+            if (threshold.is_none()) {
               CreateAutodiffSubgraphs(graph);
             } else {
               CreateAutodiffSubgraphs(graph, py::cast<int>(threshold));
@@ -771,6 +725,9 @@ void initJITBindings(PyObject* module) {
 #if (!defined(FBCODE_CAFFE2) && defined(BUILD_ONEDNN_GRAPH))
       .def("_jit_set_llga_enabled", &RegisterLlgaFuseGraph::setEnabled)
       .def("_jit_llga_enabled", &RegisterLlgaFuseGraph::isEnabled)
+#else
+      .def("_jit_set_llga_enabled", [](bool flag) { return false; })
+      .def("_jit_llga_enabled", []() { return false; })
 #endif
       .def(
           "_jit_set_tracer_state_warn",
@@ -1125,8 +1082,10 @@ void initJITBindings(PyObject* module) {
       .def(
           "_jit_pass_vulkan_optimize_for_mobile",
           [](script::Module& module,
+             std::set<MobileOptimizerType>& optimization_blocklist,
              std::vector<std::string>& preserved_methods) {
-            return vulkanOptimizeForMobile(module, preserved_methods);
+            return vulkanOptimizeForMobile(
+                module, optimization_blocklist, preserved_methods);
           })
       .def(
           "_jit_pass_metal_insert_prepacked_ops",
@@ -1186,113 +1145,68 @@ void initJITBindings(PyObject* module) {
         }
       });
 
-  py::class_<c10::SymIntNodeImpl, c10::SymIntNode>(m, "SymIntNode")
-      .def_static(
-          "new_symint",
-          [](py::object obj) -> c10::SymIntNode {
-            return c10::make_intrusive<PythonSymIntNodeImpl>(obj);
-          })
-      .def(
-          "get_pyobj",
-          [](c10::SymIntNode a) -> py::object {
-            if (auto* psn = dynamic_cast<PythonSymIntNodeImpl*>(a.get())) {
-              return py::reinterpret_borrow<py::object>(psn->getPyObj());
-            }
-            return py::none();
-          })
-      .def(
-          "__add__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->add(snb);
-          })
-      .def(
-          "__radd__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->add(snb);
-          })
-      .def(
-          "__sub__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->sub(snb);
+  // NB: This isn't actually used for regular PyTorch symbolic tracing;
+  // XLA is what needs this
+#define SYMNODE_UNARY(n) .def(#n, [](c10::SymNode a) { return a->n(); })
+#define SYMNODE_BINARY(n) \
+  .def(#n, [](c10::SymNode a, c10::SymNode b) { return a->n(b); })
+  auto symnode_class =
+      py::class_<c10::SymNodeImpl, c10::SymNode>(m, "_SymNode")
+      // clang-format off
+      // These DO NOT install magic methods; the SymInt/SymFloat wrapper in
+      // Python is responsible for this
+      SYMNODE_UNARY(clone)
+      SYMNODE_UNARY(is_int)
+      SYMNODE_UNARY(is_float)
+      SYMNODE_UNARY(bool_)
+      SYMNODE_UNARY(int_)
+      SYMNODE_UNARY(sym_float)
+      SYMNODE_BINARY(add)
+      SYMNODE_BINARY(sub)
+      SYMNODE_BINARY(mul)
+      SYMNODE_BINARY(truediv)
+      SYMNODE_BINARY(pow)
+      SYMNODE_BINARY(floordiv)
+      SYMNODE_BINARY(mod)
+      SYMNODE_BINARY(eq)
+      SYMNODE_BINARY(gt)
+      SYMNODE_BINARY(lt)
+      SYMNODE_BINARY(le)
+      SYMNODE_BINARY(ge)
+      SYMNODE_BINARY(min)
+      SYMNODE_BINARY(max)
+      SYMNODE_UNARY(ceil)
+      SYMNODE_UNARY(floor)
+      SYMNODE_UNARY(neg)
+      // Intentionally don't set file line, as the
+      // Python backtrace matters more here
+      .def(
+          "guard_int",
+          [](c10::SymNode a) {
+            return a->guard_int(nullptr, 0);
+          })
+      .def(
+          "guard_float",
+          [](c10::SymNode a) {
+            return a->guard_float(nullptr, 0);
+          })
+      .def(
+          "wrap_int",
+          [](c10::SymNode a, int64_t b) {
+            return a->wrap_int(b);
+          })
+      .def(
+          "wrap_float",
+          [](c10::SymNode a, double b) {
+            return a->wrap_float(b);
           })
       .def(
-          "__mul__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->mul(snb);
-          })
-      .def(
-          "__rmul__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->mul(snb);
-          })
-      .def(
-          "__truediv__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->truediv(snb);
-          })
-      .def(
-          "__rtruediv__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return snb->truediv(a);
-          })
-      .def(
-          "__floordiv__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->floordiv(snb);
-          })
-      .def(
-          "__rfloordiv__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return snb->floordiv(a);
-          })
-      .def(
-          "__mod__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->mod(snb);
-          })
-      .def(
-          "__eq__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->eq(snb);
-          })
-      .def(
-          "__gt__",
-          [](c10::SymIntNode a, py::object b) {
-            auto snb = toSymIntNode(a, b);
-            return a->gt(snb);
-          })
-      .def(
-          "__lt__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->lt(snb);
-          })
-      .def(
-          "__le__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->le(snb);
-          })
-      .def(
-          "__ge__",
-          [](c10::SymIntNode a, py::object b) -> c10::SymIntNode {
-            auto snb = toSymIntNode(a, b);
-            return a->ge(snb);
-          })
-      .def("__bool__", [](c10::SymIntNode a) { return a->bool_(); })
-      .def("__int__", [](c10::SymIntNode a) { return a->int_(); })
-      .def("__str__", [](c10::SymIntNode a) { return a->str(); });
+          "__str__",
+          [](c10::SymNode a) { return a->str(); })
+      .def("__repr__", [](c10::SymNode a) {
+        return a->str();
+      });
+  // clang-format on
 
   // NOLINTNEXTLINE(bugprone-unused-raii)
   py::class_<CompleteArgumentSpec>(m, "CompleteArgumentSpec")
@@ -1366,7 +1280,7 @@ void initJITBindings(PyObject* module) {
       .def(py::init<std::string>())
       .def(py::init([](const py::object& buffer) {
         auto writer_func = [=](const void* data, size_t size) {
-          // Writting an empty file is a noop
+          // Writing an empty file is a noop
           if (size == 0) {
             return size;
           }
@@ -1410,6 +1324,9 @@ void initJITBindings(PyObject* module) {
       .value(
           "HOIST_CONV_PACKED_PARAMS",
           MobileOptimizerType::HOIST_CONV_PACKED_PARAMS)
+      .value(
+          "VULKAN_AUTOMATIC_GPU_TRANSFER",
+          MobileOptimizerType::VULKAN_AUTOMATIC_GPU_TRANSFER)
       .export_values();
 
   // This allows PyTorchStreamReader to read from a Python buffer. It requires
@@ -1593,13 +1510,11 @@ void initJITBindings(PyObject* module) {
                     return _get_operation_for_overload_or_packet(
                         {op}, symbol, args, kwargs, /*is_overload*/ true);
                   });
-              auto func_dk =
-                  py::cpp_function([op, symbol, allow_numbers_as_tensors](
-                                       const std::string& str_dk,
-                                       py::args args,
-                                       py::kwargs kwargs) {
+              auto func_dk = py::cpp_function(
+                  [op, symbol, allow_numbers_as_tensors](
+                      c10::DispatchKey dk_, py::args args, py::kwargs kwargs) {
                     c10::optional<c10::DispatchKey> dk =
-                        c10::make_optional(c10::parseDispatchKey(str_dk));
+                        c10::make_optional(dk_);
                     ToIValueAllowNumbersAsTensors g(allow_numbers_as_tensors);
                     return _get_operation_for_overload_or_packet(
                         {op}, symbol, args, kwargs, /*is_overload*/ true, dk);
@@ -1694,6 +1609,11 @@ void initJITBindings(PyObject* module) {
           [](SchemaInfo& self, const SchemaArgument& argument) {
             return self.is_mutable(argument);
           })
+      .def(
+          "has_argument",
+          [](SchemaInfo& self, const std::string& name) {
+            return self.has_argument(name);
+          })
       .def(
           "is_mutable",
           [](SchemaInfo& self, const std::string& name) {
@@ -1994,6 +1914,7 @@ void initJITBindings(PyObject* module) {
   });
 
   m.def("wait", [](const std::shared_ptr<PythonFutureWrapper>& fut) {
+    TORCH_CHECK(fut, "Future can't be None");
     return fut->wait();
   });
 
@@ -2001,12 +1922,14 @@ void initJITBindings(PyObject* module) {
       "_collect_all",
       [](const std::vector<std::shared_ptr<jit::PythonFutureWrapper>>& futures)
           -> std::shared_ptr<jit::PythonFutureWrapper> {
-        auto typePtr =
-            futures.empty() ? AnyType::get() : futures[0]->fut->elementType();
+        auto typePtr = futures.empty() || futures[0] == nullptr
+            ? AnyType::get()
+            : futures[0]->fut->elementType();
         c10::List<c10::intrusive_ptr<c10::ivalue::Future>> asList(
             c10::FutureType::create(typePtr));
         asList.reserve(futures.size());
         for (const auto& f : futures) {
+          TORCH_CHECK(f, "Future can't be None");
           asList.push_back(f->fut);
         }
         return std::make_shared<jit::PythonFutureWrapper>(
diff --git a/torch/csrc/jit/python/module_python.h b/torch/csrc/jit/python/module_python.h
index 35e2fc54b576..ab8bf1b8404f 100644
--- a/torch/csrc/jit/python/module_python.h
+++ b/torch/csrc/jit/python/module_python.h
@@ -9,17 +9,25 @@ namespace py = pybind11;
 namespace torch {
 namespace jit {
 
-inline c10::optional<Module> as_module(const py::object& obj) {
-  if (py::isinstance(
-          obj, py::module::import("torch.jit").attr("ScriptModule"))) {
+inline c10::optional<Module> as_module(py::handle obj) {
+  static py::handle ScriptModule =
+      py::module::import("torch.jit").attr("ScriptModule");
+  if (py::isinstance(obj, ScriptModule)) {
     return py::cast<Module>(obj.attr("_c"));
   }
   return c10::nullopt;
 }
 
-inline c10::optional<Object> as_object(const py::object& obj) {
-  if (py::isinstance(
-          obj, py::module::import("torch.jit").attr("RecursiveScriptClass"))) {
+inline c10::optional<Object> as_object(py::handle obj) {
+  static py::handle ScriptObject =
+      py::module::import("torch").attr("ScriptObject");
+  if (py::isinstance(obj, ScriptObject)) {
+    return py::cast<Object>(obj);
+  }
+
+  static py::handle RecursiveScriptClass =
+      py::module::import("torch.jit").attr("RecursiveScriptClass");
+  if (py::isinstance(obj, RecursiveScriptClass)) {
     return py::cast<Object>(obj.attr("_c"));
   }
   return c10::nullopt;
diff --git a/torch/csrc/jit/python/pybind_utils.cpp b/torch/csrc/jit/python/pybind_utils.cpp
index c1502dabdff5..47089fcc8969 100644
--- a/torch/csrc/jit/python/pybind_utils.cpp
+++ b/torch/csrc/jit/python/pybind_utils.cpp
@@ -10,6 +10,8 @@
 #include <c10/util/irange.h>
 #include <torch/csrc/utils/python_arg_parser.h>
 
+#include <limits>
+
 namespace torch {
 namespace jit {
 
@@ -38,6 +40,20 @@ void clear_registered_instances(void* ptr) {
   registered_instances.erase(ptr);
 }
 
+// WARNING: Precondition for this function is that, e.g., you have tested if a
+// SymIntList is in fact only ints, and if so, you called this with T=int64_t.
+// This precondition is NOT checked at runtime.
+template <typename T>
+IValue listToIValue(py::handle obj) {
+  c10::List<T> rs;
+  for (auto it = obj.begin(); it != obj.end(); it++) {
+    auto elm = *it;
+    rs.push_back(py::cast<T>(elm));
+  }
+  // Promises that we have decayed the list appropriately
+  return c10::impl::toList<T>(rs);
+}
+
 IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
   switch (type->kind()) {
     case TypeKind::TensorType: {
@@ -54,6 +70,7 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
           throw py::cast_error(
               c10::str("Unable to cast ", py::str(obj), " to Tensor"));
         }
+        bool save_symint = false;
         at::Scalar scalar;
         if (PyBool_Check(obj.ptr())) {
           scalar = at::Scalar(THPUtils_unpackBool(obj.ptr()));
@@ -63,12 +80,27 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
           scalar = at::Scalar(THPUtils_unpackComplexDouble(obj.ptr()));
         } else if (THPUtils_checkDouble(obj.ptr())) {
           scalar = at::Scalar(THPUtils_unpackDouble(obj.ptr()));
+        } else if (torch::is_symint(py::handle(obj))) {
+          save_symint = true;
+          scalar = at::Scalar(7777777);
+        } else if (torch::is_symfloat(py::handle(obj))) {
+          save_symint = true;
+          scalar = at::Scalar(std::numeric_limits<double>::quiet_NaN());
         } else {
           throw py::cast_error(
               c10::str("Unable to cast ", py::str(obj), " to Tensor"));
         }
         at::Tensor tensor = at::scalar_to_tensor(scalar);
         tensor.unsafeGetTensorImpl()->set_wrapped_number(true);
+
+        if (save_symint) {
+          auto py_tensor = py::cast(tensor);
+          if (PyObject_SetAttrString(
+                  py_tensor.ptr(), "_wrapped_number", obj.ptr()) < 0) {
+            throw python_error();
+          }
+        }
+
         return tensor;
       }
     }
@@ -80,16 +112,13 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
       auto c_obj = py::cast<std::complex<double>>(obj.ptr());
       return static_cast<c10::complex<double>>(c_obj);
     }
-    case TypeKind::SymIntType:
-      return py::cast<c10::SymInt>(obj);
     case TypeKind::IntType:
-    // NB: Typically, these switches are completely dead, because
-    // Argument::type() will always report IntType for these types.
-    // So this is a bit overly permissive: we'll accept a dtype
-    // passed to an int argument, for example.
-    case TypeKind::LayoutType:
-    case TypeKind::ScalarTypeType:
-    case TypeKind::MemoryFormatType:
+      // TODO: Properly fake this type
+      if (THPQScheme_Check(obj.ptr())) {
+        auto qscheme = reinterpret_cast<THPQScheme*>(obj.ptr());
+        return static_cast<uint8_t>(qscheme->qscheme);
+      }
+      // For backwards compatibility
       if (THPDtype_Check(obj.ptr())) {
         auto dtype = reinterpret_cast<THPDtype*>(obj.ptr());
         return static_cast<int64_t>(dtype->scalar_type);
@@ -107,6 +136,40 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
         return static_cast<int8_t>(memory_format->memory_format);
       }
       return py::cast<int64_t>(obj);
+    case TypeKind::LayoutType: {
+      if (THPLayout_Check(obj.ptr())) {
+        auto layout = reinterpret_cast<THPLayout*>(obj.ptr());
+        return static_cast<int8_t>(layout->layout);
+      }
+      // For backwards compatibility
+      return py::cast<int64_t>(obj);
+    }
+    case TypeKind::ScalarTypeType: {
+      if (THPDtype_Check(obj.ptr())) {
+        auto dtype = reinterpret_cast<THPDtype*>(obj.ptr());
+        return static_cast<int64_t>(dtype->scalar_type);
+      }
+      // For backwards compatibility
+      return py::cast<int64_t>(obj);
+    }
+    case TypeKind::MemoryFormatType: {
+      if (THPMemoryFormat_Check(obj.ptr())) {
+        auto memory_format = reinterpret_cast<THPMemoryFormat*>(obj.ptr());
+        return static_cast<int8_t>(memory_format->memory_format);
+      }
+      // For backwards compatibility
+      return py::cast<int64_t>(obj);
+    }
+    case TypeKind::SymIntType:
+      if (torch::is_symint(obj.ptr())) {
+        return py::cast<c10::SymInt>(obj);
+      }
+      return py::cast<int64_t>(obj);
+    case TypeKind::SymFloatType:
+      if (torch::is_symfloat(obj.ptr())) {
+        return py::cast<c10::SymFloat>(obj);
+      }
+      return py::cast<double>(obj);
     case TypeKind::NoneType:
       if (!obj.is_none()) {
         throw py::cast_error(
@@ -165,10 +228,8 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
     case TypeKind::ListType: {
       // If the object is a ScriptList, retrieve the c10::List
       // instance inside it.
-      try {
-        auto script_list = py::cast<ScriptList>(obj);
-        return script_list.list_;
-      } catch (...) {
+      if (py::isinstance<ScriptList>(obj)) {
+        return py::cast<ScriptList>(obj).list_;
       }
 
       // If not (i.e. it is a regular Python list), make a new
@@ -189,13 +250,35 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
             return repeated;
           }
         case TypeKind::SymIntType: {
-          c10::List<c10::SymInt> symints;
+          bool is_symbolic = false;
           for (auto it = obj.begin(); it != obj.end(); it++) {
             auto elm = *it;
-            auto si = py::cast<c10::SymInt>(elm);
-            symints.push_back(si);
+            if (torch::is_symint(elm)) {
+              is_symbolic = true;
+              break;
+            }
+          }
+          if (is_symbolic) {
+            return listToIValue<c10::SymInt>(obj);
+          } else {
+            return listToIValue<int64_t>(obj);
+          }
+        }
+        case TypeKind::SymFloatType: {
+          bool is_symbolic = false;
+          for (auto it = obj.begin(); it != obj.end(); it++) {
+            auto elm = *it;
+            // TODO: what about SymInt conversion to SymFloat?
+            if (torch::is_symfloat(elm)) {
+              is_symbolic = true;
+              break;
+            }
+          }
+          if (is_symbolic) {
+            return listToIValue<c10::SymFloat>(obj);
+          } else {
+            return listToIValue<double>(obj);
           }
-          return symints;
         }
         case TypeKind::FloatType:
           if (!N || !py::isinstance<py::float_>(obj)) {
@@ -242,7 +325,7 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
         // return an IValue() to denote a NoneType
         return {};
       }
-      return toIValue(obj, type->expectRef<OptionalType>().getElementType());
+      return toIValue(obj, type->expectRef<OptionalType>().getElementType(), N);
     }
     case TypeKind::ClassType: {
       auto classType = type->expect<ClassType>();
@@ -350,13 +433,19 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
         auto layout = reinterpret_cast<THPLayout*>(obj.ptr());
         return static_cast<int8_t>(layout->layout);
       }
-      if (py::isinstance<py::int_>(obj)) {
+      if (py::isinstance<py::bool_>(obj)) {
+        return py::cast<bool>(obj);
+      } else if (py::isinstance<py::int_>(obj)) {
         return py::cast<int64_t>(obj);
       } else if (py::isinstance<py::float_>(obj)) {
         return py::cast<double>(obj);
       } else if (PyComplex_CheckExact(obj.ptr())) {
         auto c_obj = py::cast<std::complex<double>>(obj.ptr());
         return static_cast<c10::complex<double>>(c_obj);
+      } else if (torch::is_symint(obj)) {
+        return py::cast<c10::SymInt>(obj);
+      } else if (torch::is_symfloat(obj)) {
+        return py::cast<c10::SymFloat>(obj);
       } else {
         throw py::cast_error(
             c10::str("Cannot cast ", py::str(obj), " to ", type->repr_str()));
@@ -411,5 +500,290 @@ IValue toIValue(py::handle obj, const TypePtr& type, c10::optional<int32_t> N) {
       "toIValue() cannot handle converting to type: ", type->repr_str()));
 }
 
+py::object toPyObject(IValue ivalue) {
+  if (ivalue.isNone()) {
+    return py::none();
+  } else if (ivalue.isTensor()) {
+    auto tensor = std::move(ivalue).toTensor();
+    if (tensor.unsafeGetTensorImpl()->is_wrapped_number()) {
+      TORCH_INTERNAL_ASSERT(tensor.device().is_cpu());
+      auto py_tensor = py::cast(tensor);
+      if (PyObject_HasAttrString(py_tensor.ptr(), "_wrapped_number")) {
+        return py_tensor.attr("_wrapped_number");
+      }
+      auto scalar_type = tensor.scalar_type();
+      switch (scalar_type) {
+        case at::ScalarType::Bool:
+          return py::cast(*tensor.data_ptr<bool>());
+        case at::ScalarType::Long:
+          return py::cast(*tensor.data_ptr<int64_t>());
+        case at::ScalarType::Double:
+          return py::cast(*tensor.data_ptr<double>());
+        case at::ScalarType::ComplexDouble:
+          // TODO: https://github.com/pytorch/pytorch/issues/77134
+          return py::cast(static_cast<std::complex<double>>(
+              *tensor.data_ptr<c10::complex<double>>()));
+        default:
+          TORCH_CHECK(
+              false,
+              "Missing cases in 'toPyObject' wrapped number handling! Can't convert ",
+              scalar_type,
+              " to a Python object");
+      }
+    } else {
+      guardAgainstNamedTensor<at::Tensor>(tensor);
+      return py::cast(autograd::Variable(std::move(tensor)));
+    }
+  } else if (ivalue.isStorage()) {
+    return py::cast(ivalue.toStorage());
+  } else if (ivalue.isGenerator()) {
+    return py::cast(ivalue.toGenerator());
+  } else if (ivalue.isDouble()) {
+    return py::cast(std::move(ivalue).toDouble());
+  } else if (ivalue.isComplexDouble()) {
+    return py::cast(
+        static_cast<std::complex<double>>(std::move(ivalue).toComplexDouble()));
+  } else if (ivalue.isInt()) {
+    return py::cast(std::move(ivalue).toInt());
+  } else if (ivalue.isBool()) {
+    return py::cast(std::move(ivalue).toBool());
+  } else if (ivalue.isString()) {
+    return py::cast(std::move(ivalue).toStringRef());
+  } else if (ivalue.isList()) {
+    auto list = std::move(ivalue).toList();
+    py::list t{list.size()};
+    for (const auto i : c10::irange(list.size())) {
+      t[i] = toPyObject(IValue{list.get(i)});
+    }
+    return std::move(t);
+  } else if (ivalue.isTuple()) {
+    auto tuple = std::move(ivalue).toTuple();
+    const auto& elements = tuple->elements();
+
+    py::tuple t{elements.size()};
+    for (const auto i : c10::irange(elements.size())) {
+      t[i] = toPyObject(IValue{elements.at(i)});
+    }
+
+    // If we have a NamedTuple
+    if (tuple->type() && tuple->type()->schema() &&
+        tuple->type()->schema()->name() != "") {
+      auto unqualName = tuple->type()->name()->name();
+
+      const std::vector<Argument>& tuple_args =
+          tuple->type()->schema()->arguments();
+
+      std::vector<pybind11::object> defaults;
+      auto it = std::find_if(
+          tuple_args.begin(), tuple_args.end(), [](const Argument& arg) {
+            return arg.default_value().has_value();
+          });
+      std::transform(
+          it,
+          tuple_args.end(),
+          std::back_inserter(defaults),
+          [](const Argument& arg) { return toPyObject(*arg.default_value()); });
+
+      std::vector<std::string> fieldNames =
+          fmap(tuple_args, [](const Argument& arg) { return arg.name(); });
+
+      return py::module::import("torch._jit_internal")
+          .attr("_create_named_tuple")(
+              t, unqualName, fieldNames, py::make_tuple(defaults));
+    } else {
+      return std::move(t);
+    }
+  } else if (ivalue.isDevice()) {
+    return py::cast<py::object>(THPDevice_New(std::move(ivalue).toDevice()));
+  } else if (ivalue.isGenericDict()) {
+    auto dict = std::move(ivalue).toGenericDict();
+    py::dict py_dict;
+    for (auto& pair : dict) {
+      py_dict[toPyObject(IValue{pair.key()})] =
+          toPyObject(IValue{pair.value()});
+    }
+    return std::move(py_dict);
+  } else if (ivalue.isRRef()) {
+#ifdef USE_RPC
+    auto RRefPtr =
+        c10::dynamic_intrusive_pointer_cast<torch::distributed::rpc::RRef>(
+            std::move(ivalue).toRRef());
+    return py::cast(torch::distributed::rpc::PyRRef(RRefPtr));
+#else
+    AT_ERROR("RRef is only supported with the distributed package");
+#endif
+  } else if (ivalue.isObject()) {
+    const auto obj = std::move(ivalue).toObject();
+    if (obj->type()->is_module()) {
+      return py::cast(Module(obj));
+    }
+
+    auto pyCu = get_python_cu();
+    if (obj->name().find("__torch__.torch.classes") == 0) {
+      return py::cast(Object(obj));
+    }
+    const auto classType = pyCu->get_class(c10::QualifiedName(obj->name()));
+    AT_ASSERT(classType);
+    auto pyClass = getScriptedClassOrError(obj->type());
+    auto pyObj = pyClass.attr("__new__")(pyClass);
+
+    const auto numAttrs = classType->numAttributes();
+
+    for (const auto slot : c10::irange(numAttrs)) {
+      const auto& attrName = classType->getAttributeName(slot);
+      IValue v = obj->getSlot(slot);
+      py::setattr(pyObj, attrName.c_str(), toPyObject(std::move(v)));
+    }
+    return pyObj;
+  } else if (ivalue.isPyObject()) {
+    // return borrowed reference to ensure it correctly incref the underlying
+    // PyObject
+    return py::reinterpret_borrow<py::object>(ivalue.toPyObject());
+  } else if (ivalue.isCapsule()) {
+    return py::cast(c10::Capsule(ivalue.toCapsule()));
+  } else if (ivalue.isFuture()) {
+    return py::cast(std::make_shared<PythonFutureWrapper>(ivalue.toFuture()));
+  } else if (ivalue.isEnum()) {
+    auto enum_holder = ivalue.toEnumHolder();
+    auto py_class = getScriptedClassOrError(enum_holder->type());
+    return py_class.attr(enum_holder->name().c_str());
+  } else if (ivalue.isRRef()) {
+#ifdef USE_RPC
+    return py::cast(torch::distributed::rpc::PyRRef(
+        c10::static_intrusive_pointer_cast<distributed::rpc::RRef>(
+            ivalue.toRRef())));
+#else
+    TORCH_CHECK(false, "RRef is only supported with the distributed package");
+#endif
+  } else if (ivalue.isSymInt()) {
+    return py::cast(ivalue.toSymInt());
+  } else if (ivalue.isSymFloat()) {
+    return py::cast(ivalue.toSymFloat());
+  } else {
+    AT_ERROR(
+        "Missing cases in 'toPyObject'! Can't convert ",
+        ivalue.tagKind(),
+        " to a Python object");
+  }
+}
+
+std::pair<std::shared_ptr<Operator>, Stack> getOpWithStack(
+    const std::vector<std::shared_ptr<Operator>>& operations,
+    py::args args,
+    const py::kwargs& kwargs) {
+  Stack stack;
+  if (operations.size() == 1) {
+    std::shared_ptr<Operator> op = operations.at(0);
+    // Create a stack full of the arguments and keyword arguments.
+    stack = createStackForSchema(
+        op->schema(), std::move(args), kwargs, c10::nullopt);
+
+    return std::make_pair(op, stack);
+  } else {
+    std::vector<schema_match_error> errors;
+    std::shared_ptr<Operator> found_op = nullptr;
+    for (const auto& op : operations) {
+      try {
+        stack = createStackForSchema(op->schema(), args, kwargs, c10::nullopt);
+        found_op = op;
+        break;
+      } catch (schema_match_error& error) {
+        errors.push_back(std::move(error));
+      }
+    }
+    if (!found_op) {
+      std::stringstream ss;
+      ss << "Overloaded torch operator invoked from Python failed to many any schema:\n";
+      for (const auto& err : errors) {
+        ss << err.what() << "\n\n";
+      }
+      throw std::runtime_error(ss.str());
+    }
+
+    return std::make_pair(found_op, stack);
+  }
+}
+
+py::object invokeOperatorFromPython(
+    const std::vector<std::shared_ptr<Operator>>& operations,
+    py::args args,
+    const py::kwargs& kwargs,
+    c10::optional<c10::DispatchKey> dk) {
+  auto opWithStack = getOpWithStack(operations, args, kwargs);
+  std::shared_ptr<Operator> found_op = std::get<0>(opWithStack);
+  Stack stack = std::get<1>(opWithStack);
+  {
+    pybind11::gil_scoped_release no_gil_guard;
+    if (dk) {
+      found_op->getOperationForDispatchKey (*dk)(stack);
+    } else {
+      found_op->getOperation()(stack);
+    }
+  }
+
+  return createPyObjectForStack(std::move(stack));
+}
+
+py::object _get_operation_for_overload_or_packet(
+    const std::vector<std::shared_ptr<Operator>>& operations,
+    Symbol symbol,
+    py::args args,
+    const py::kwargs& kwargs,
+    bool is_overload,
+    c10::optional<c10::DispatchKey> dk) {
+  std::vector<py::handle> overloaded_args;
+  size_t total_arg_num = args.size() + kwargs.size();
+  for (const auto i : c10::irange(args.size())) {
+    is_tensor_and_append_overloaded(args[i].ptr(), &overloaded_args);
+    is_tensor_list_and_append_overloaded(
+        args[i].ptr(),
+        &overloaded_args,
+        static_cast<int>(total_arg_num),
+        false /* throw_error */);
+  }
+  // NB: for kwargs, we cannot guarantee the order of appending
+  // is the same as the argument order in operator's schema.
+  // This is suboptimal, but should be fine. Later when we have
+  // better schema matching and argument parsing, we could
+  // match the operator in `operations` first, then the order will
+  // be guaranteed.
+  for (auto item : kwargs) {
+    is_tensor_and_append_overloaded(item.second.ptr(), &overloaded_args);
+    is_tensor_list_and_append_overloaded(
+        item.second.ptr(),
+        &overloaded_args,
+        total_arg_num,
+        false /* throw_error */);
+  }
+  if (overloaded_args.size() > 0 || at::impl::torch_function_mode_enabled()) {
+    py::object ret;
+    std::string ns = symbol.ns().toUnqualString();
+    std::string method_name = symbol.toUnqualString();
+    auto self_func = py::module::import("torch")
+                         .attr("ops")
+                         .attr(ns.c_str())
+                         .attr(method_name.c_str());
+    if (is_overload) {
+      auto overload_name = operations[0]->schema().overload_name();
+      if (overload_name == "") {
+        self_func = self_func.attr("default");
+      } else {
+        self_func = self_func.attr(overload_name.c_str());
+      }
+    }
+    std::string module_name("torch.ops");
+    module_name.append(ns);
+    return pybind11::reinterpret_steal<py::object>(
+        handle_torch_function_no_python_arg_parser(
+            overloaded_args,
+            args.ptr(),
+            kwargs.ptr(),
+            method_name.c_str(),
+            self_func.ptr(),
+            module_name.c_str()));
+  }
+  return invokeOperatorFromPython(operations, args, kwargs, dk);
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/python/pybind_utils.h b/torch/csrc/jit/python/pybind_utils.h
index cb339b9bb308..5dfe28e92fd7 100644
--- a/torch/csrc/jit/python/pybind_utils.h
+++ b/torch/csrc/jit/python/pybind_utils.h
@@ -64,7 +64,7 @@ TORCH_API IValue toIValue(
     const TypePtr& type,
     c10::optional<int32_t> N = c10::nullopt);
 
-py::object toPyObject(IValue ivalue);
+TORCH_API py::object toPyObject(IValue ivalue);
 
 // Hack to overload the behavior of toIValue to accept Python
 // numbers in places where a Tensor is expected
@@ -299,7 +299,7 @@ inline InferredType tryToInferType(py::handle input) {
     return InferredType(TensorType::get());
   }
 
-  if (input.is(py::none())) {
+  if (input.is_none()) {
     return InferredType(NoneType::get());
   }
 
@@ -565,6 +565,17 @@ inline Stack toTraceableStack(const py::tuple& inputs) {
   return info.toTupleRef().elements().vec();
 }
 
+// Serialize the python dictionary into a traceable stack.
+inline Stack toTraceableStack(const py::dict& inputs) {
+  Stack res;
+  for (auto it = inputs.begin(); it != inputs.end(); it++) {
+    if (THPVariable_Check(it->second.ptr())) {
+      res.push_back(toIValue(it->second, tryToInferType(it->second).type()));
+    }
+  }
+  return res;
+}
+
 inline IValue createGenericList(py::handle obj, const TypePtr& elem_type) {
   auto elems = c10::impl::GenericList(elem_type);
   for (auto elem : obj) {
@@ -646,7 +657,7 @@ inline IValue argumentToIValue(
     py::handle object) {
   const auto& argument = schema.arguments().at(argumentPosition);
   try {
-    return toIValue(object, argument.type(), argument.N());
+    return toIValue(object, argument.real_type(), argument.N());
   } catch (const py::cast_error& error) {
     throw schema_match_error(c10::str(
         schema.formatTypeMismatchMsg(
@@ -699,168 +710,6 @@ inline py::object getScriptedClassOrError(const c10::NamedTypePtr& classType) {
   return py_class;
 }
 
-inline py::object toPyObject(IValue ivalue) {
-  if (ivalue.isNone()) {
-    return py::none();
-  } else if (ivalue.isTensor()) {
-    auto tensor = std::move(ivalue).toTensor();
-    if (tensor.unsafeGetTensorImpl()->is_wrapped_number()) {
-      TORCH_INTERNAL_ASSERT(tensor.device().is_cpu());
-      auto scalar_type = tensor.scalar_type();
-      switch (scalar_type) {
-        case at::ScalarType::Bool:
-          return py::cast(*tensor.data_ptr<bool>());
-        case at::ScalarType::Long:
-          return py::cast(*tensor.data_ptr<int64_t>());
-        case at::ScalarType::Double:
-          return py::cast(*tensor.data_ptr<double>());
-        case at::ScalarType::ComplexDouble:
-          // TODO: https://github.com/pytorch/pytorch/issues/77134
-          return py::cast(static_cast<std::complex<double>>(
-              *tensor.data_ptr<c10::complex<double>>()));
-        default:
-          TORCH_CHECK(
-              false,
-              "Missing cases in 'toPyObject' wrapped number handling! Can't convert ",
-              scalar_type,
-              " to a Python object");
-      }
-    } else {
-      guardAgainstNamedTensor<at::Tensor>(tensor);
-      return py::cast(autograd::Variable(std::move(tensor)));
-    }
-  } else if (ivalue.isStorage()) {
-    return py::cast(ivalue.toStorage());
-  } else if (ivalue.isGenerator()) {
-    return py::cast(ivalue.toGenerator());
-  } else if (ivalue.isDouble()) {
-    return py::cast(std::move(ivalue).toDouble());
-  } else if (ivalue.isComplexDouble()) {
-    return py::cast(
-        static_cast<std::complex<double>>(std::move(ivalue).toComplexDouble()));
-  } else if (ivalue.isInt()) {
-    return py::cast(std::move(ivalue).toInt());
-  } else if (ivalue.isBool()) {
-    return py::cast(std::move(ivalue).toBool());
-  } else if (ivalue.isString()) {
-    return py::cast(std::move(ivalue).toStringRef());
-  } else if (ivalue.isList()) {
-    auto list = std::move(ivalue).toList();
-    py::list t{list.size()};
-    for (const auto i : c10::irange(list.size())) {
-      t[i] = toPyObject(IValue{list.get(i)});
-    }
-    return std::move(t);
-  } else if (ivalue.isTuple()) {
-    auto tuple = std::move(ivalue).toTuple();
-    const auto& elements = tuple->elements();
-
-    py::tuple t{elements.size()};
-    for (const auto i : c10::irange(elements.size())) {
-      t[i] = toPyObject(IValue{elements.at(i)});
-    }
-
-    // If we have a NamedTuple
-    if (tuple->type() && tuple->type()->schema() &&
-        tuple->type()->schema()->name() != "") {
-      auto unqualName = tuple->type()->name()->name();
-
-      const std::vector<Argument>& tuple_args =
-          tuple->type()->schema()->arguments();
-
-      std::vector<pybind11::object> defaults;
-      auto it = std::find_if(
-          tuple_args.begin(), tuple_args.end(), [](const Argument& arg) {
-            return arg.default_value().has_value();
-          });
-      std::transform(
-          it,
-          tuple_args.end(),
-          std::back_inserter(defaults),
-          [](const Argument& arg) { return toPyObject(*arg.default_value()); });
-
-      std::vector<std::string> fieldNames =
-          fmap(tuple_args, [](const Argument& arg) { return arg.name(); });
-
-      return py::module::import("torch._jit_internal")
-          .attr("_create_named_tuple")(
-              t, unqualName, fieldNames, py::make_tuple(defaults));
-    } else {
-      return std::move(t);
-    }
-  } else if (ivalue.isDevice()) {
-    return py::cast<py::object>(THPDevice_New(std::move(ivalue).toDevice()));
-  } else if (ivalue.isGenericDict()) {
-    auto dict = std::move(ivalue).toGenericDict();
-    py::dict py_dict;
-    for (auto& pair : dict) {
-      py_dict[toPyObject(IValue{pair.key()})] =
-          toPyObject(IValue{pair.value()});
-    }
-    return std::move(py_dict);
-  } else if (ivalue.isRRef()) {
-#ifdef USE_RPC
-    auto RRefPtr =
-        c10::dynamic_intrusive_pointer_cast<torch::distributed::rpc::RRef>(
-            std::move(ivalue).toRRef());
-    return py::cast(torch::distributed::rpc::PyRRef(RRefPtr));
-#else
-    AT_ERROR("RRef is only supported with the distributed package");
-#endif
-  } else if (ivalue.isObject()) {
-    const auto obj = std::move(ivalue).toObject();
-    if (obj->type()->is_module()) {
-      return py::cast(Module(obj));
-    }
-
-    auto pyCu = get_python_cu();
-    if (obj->name().find("__torch__.torch.classes") == 0) {
-      return py::cast(Object(obj));
-    }
-    const auto classType = pyCu->get_class(c10::QualifiedName(obj->name()));
-    AT_ASSERT(classType);
-    auto pyClass = getScriptedClassOrError(obj->type());
-    auto pyObj = pyClass.attr("__new__")(pyClass);
-
-    const auto numAttrs = classType->numAttributes();
-
-    for (const auto slot : c10::irange(numAttrs)) {
-      const auto& attrName = classType->getAttributeName(slot);
-      IValue v = obj->getSlot(slot);
-      py::setattr(pyObj, attrName.c_str(), toPyObject(std::move(v)));
-    }
-    return pyObj;
-  } else if (ivalue.isPyObject()) {
-    // return borrowed reference to ensure it correctly incref the underlying
-    // PyObject
-    return py::reinterpret_borrow<py::object>(ivalue.toPyObject());
-  } else if (ivalue.isCapsule()) {
-    return py::cast(c10::Capsule(ivalue.toCapsule()));
-  } else if (ivalue.isFuture()) {
-    return py::cast(std::make_shared<PythonFutureWrapper>(ivalue.toFuture()));
-  } else if (ivalue.isEnum()) {
-    auto enum_holder = ivalue.toEnumHolder();
-    auto py_class = getScriptedClassOrError(enum_holder->type());
-    return py_class.attr(enum_holder->name().c_str());
-  } else if (ivalue.isRRef()) {
-#ifdef USE_RPC
-    return py::cast(torch::distributed::rpc::PyRRef(
-        c10::static_intrusive_pointer_cast<distributed::rpc::RRef>(
-            ivalue.toRRef())));
-#else
-    TORCH_CHECK(false, "RRef is only supported with the distributed package");
-#endif
-  } else if (ivalue.isSymInt()) {
-    auto si = ivalue.toSymInt();
-    return py::cast(si);
-  } else {
-    AT_ERROR(
-        "Missing cases in 'toPyObject'! Can't convert ",
-        ivalue.tagKind(),
-        " to a Python object");
-  }
-}
-
 struct VISIBILITY_HIDDEN tuple_slice {
   /*implicit*/ tuple_slice(py::tuple tup_)
       : tup(std::move(tup_)), b(0), e(tup.size()) {}
@@ -1150,123 +999,24 @@ inline py::object invokeScriptMethodFromPython(
       });
 }
 
-inline std::pair<std::shared_ptr<Operator>, Stack> getOpWithStack(
+TORCH_API std::pair<std::shared_ptr<Operator>, Stack> getOpWithStack(
     const std::vector<std::shared_ptr<Operator>>& operations,
     py::args args,
-    const py::kwargs& kwargs) {
-  Stack stack;
-  if (operations.size() == 1) {
-    std::shared_ptr<Operator> op = operations.at(0);
-    // Create a stack full of the arguments and keyword arguments.
-    stack = createStackForSchema(
-        op->schema(), std::move(args), kwargs, c10::nullopt);
+    const py::kwargs& kwargs);
 
-    return std::make_pair(op, stack);
-  } else {
-    std::vector<schema_match_error> errors;
-    std::shared_ptr<Operator> found_op = nullptr;
-    for (const auto& op : operations) {
-      try {
-        stack = createStackForSchema(op->schema(), args, kwargs, c10::nullopt);
-        found_op = op;
-        break;
-      } catch (schema_match_error& error) {
-        errors.push_back(std::move(error));
-      }
-    }
-    if (!found_op) {
-      std::stringstream ss;
-      ss << "Overloaded torch operator invoked from Python failed to many any schema:\n";
-      for (const auto& err : errors) {
-        ss << err.what() << "\n\n";
-      }
-      throw std::runtime_error(ss.str());
-    }
-
-    return std::make_pair(found_op, stack);
-  }
-}
-inline py::object invokeOperatorFromPython(
+TORCH_API py::object invokeOperatorFromPython(
     const std::vector<std::shared_ptr<Operator>>& operations,
     py::args args,
     const py::kwargs& kwargs,
-    c10::optional<c10::DispatchKey> dk = c10::nullopt) {
-  auto opWithStack = getOpWithStack(operations, args, kwargs);
-  std::shared_ptr<Operator> found_op = std::get<0>(opWithStack);
-  Stack stack = std::get<1>(opWithStack);
-  {
-    pybind11::gil_scoped_release no_gil_guard;
-    if (dk) {
-      found_op->getOperationForDispatchKey (*dk)(stack);
-    } else {
-      found_op->getOperation()(stack);
-    }
-  }
+    c10::optional<c10::DispatchKey> dk = c10::nullopt);
 
-  return createPyObjectForStack(std::move(stack));
-}
-
-inline py::object _get_operation_for_overload_or_packet(
+TORCH_API py::object _get_operation_for_overload_or_packet(
     const std::vector<std::shared_ptr<Operator>>& operations,
     Symbol symbol,
     py::args args,
     const py::kwargs& kwargs,
     bool is_overload,
-    c10::optional<c10::DispatchKey> dk = c10::nullopt) {
-  std::vector<py::handle> overloaded_args;
-  size_t total_arg_num = args.size() + kwargs.size();
-  for (const auto i : c10::irange(args.size())) {
-    is_tensor_and_append_overloaded(args[i].ptr(), &overloaded_args);
-    is_tensor_list_and_append_overloaded(
-        args[i].ptr(),
-        &overloaded_args,
-        static_cast<int>(total_arg_num),
-        false /* throw_error */);
-  }
-  // NB: for kwargs, we cannot guarantee the order of appending
-  // is the same as the argument order in operator's schema.
-  // This is suboptimal, but should be fine. Later when we have
-  // better schema matching and argument parsing, we could
-  // match the operator in `operations` first, then the order will
-  // be guaranteed.
-  for (auto item : kwargs) {
-    is_tensor_and_append_overloaded(item.second.ptr(), &overloaded_args);
-    is_tensor_list_and_append_overloaded(
-        item.second.ptr(),
-        &overloaded_args,
-        total_arg_num,
-        false /* throw_error */);
-  }
-  if (overloaded_args.size() > 0 ||
-      at::impl::PythonTorchFunctionTLS::get_mode()) {
-    py::object ret;
-    std::string ns = symbol.ns().toUnqualString();
-    std::string method_name = symbol.toUnqualString();
-    auto self_func = py::module::import("torch")
-                         .attr("ops")
-                         .attr(ns.c_str())
-                         .attr(method_name.c_str());
-    if (is_overload) {
-      auto overload_name = operations[0]->schema().overload_name();
-      if (overload_name == "") {
-        self_func = self_func.attr("default");
-      } else {
-        self_func = self_func.attr(overload_name.c_str());
-      }
-    }
-    std::string module_name("torch.ops");
-    module_name.append(ns);
-    return pybind11::reinterpret_steal<py::object>(
-        handle_torch_function_no_python_arg_parser(
-            overloaded_args,
-            args.ptr(),
-            kwargs.ptr(),
-            method_name.c_str(),
-            self_func.ptr(),
-            module_name.c_str()));
-  }
-  return invokeOperatorFromPython(operations, args, kwargs, dk);
-}
+    c10::optional<c10::DispatchKey> dk = c10::nullopt);
 
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/python/python_ir.cpp b/torch/csrc/jit/python/python_ir.cpp
index 68690a206dc5..c1cae6eb300c 100644
--- a/torch/csrc/jit/python/python_ir.cpp
+++ b/torch/csrc/jit/python/python_ir.cpp
@@ -382,7 +382,11 @@ void initPythonIRBindings(PyObject* module_) {
           "Find all nodes",
           py::arg("kind"),
           py::arg("recurse") = true)
-      .def("addInput", [](Graph& g) { return g.addInput(); })
+      .def(
+          "addInput",
+          [](Graph& g, const std::string& name) { return g.addInput(name); },
+          "Add input to graph with optional name seed",
+          py::arg("name") = "")
       .def("copy", [](Graph& g) { return g.copy(); })
       .GS(eraseInput)
       .GS(eraseOutput)
@@ -767,11 +771,17 @@ void initPythonIRBindings(PyObject* module_) {
           [](Node& n, const char* name, const TypePtr& type) {
             return n.ty_(Symbol::attr(name), type);
           })
+      .def(
+          "ty",
+          [](Node& n, const char* name) { return n.ty(Symbol::attr(name)); })
       .def(
           "tys_",
           [](Node& n, const char* name, const std::vector<TypePtr>& types) {
             return n.tys_(Symbol::attr(name), types);
           })
+      .def(
+          "tys",
+          [](Node& n, const char* name) { return n.tys(Symbol::attr(name)); })
       .def(
           "zs_",
           [](Node& n, const char* name, TensorsAttr::ValueType v) {
diff --git a/torch/csrc/jit/python/python_sugared_value.cpp b/torch/csrc/jit/python/python_sugared_value.cpp
index 705731778dc3..12d565427ae4 100644
--- a/torch/csrc/jit/python/python_sugared_value.cpp
+++ b/torch/csrc/jit/python/python_sugared_value.cpp
@@ -1149,7 +1149,7 @@ std::shared_ptr<SugaredValue> toSugaredValue(
           g.insertConstant(static_cast<c10::complex<double>>(c_obj), loc));
     } else if (py::isinstance<py::str>(obj)) {
       return toSimple(g.insertConstant(py::cast<std::string>(obj), loc));
-    } else if (obj.is(py::none())) {
+    } else if (obj.is_none()) {
       return toSimple(g.insertConstant(IValue(), loc));
     } else if (THPDevice_Check(obj.ptr())) {
       auto device = reinterpret_cast<THPDevice*>(obj.ptr());
diff --git a/torch/csrc/jit/python/python_tracer.cpp b/torch/csrc/jit/python/python_tracer.cpp
index 494265e16184..c89d54872a07 100644
--- a/torch/csrc/jit/python/python_tracer.cpp
+++ b/torch/csrc/jit/python/python_tracer.cpp
@@ -27,7 +27,7 @@ namespace tracer {
 std::vector<StackEntry> _pythonCallstack() {
   pybind11::gil_scoped_acquire gil;
   PyFrameObject* frame = PyEval_GetFrame();
-  Py_INCREF(frame);
+  Py_XINCREF(frame);
   std::vector<StackEntry> entries;
 
   while (nullptr != frame) {
@@ -73,6 +73,69 @@ SourceRange getPythonInterpreterSourceRange() {
   return SourceRange(source, 0, stack_trace_text.size());
 }
 
+std::pair<std::shared_ptr<Graph>, Stack> createGraphByTracingWithDict(
+    const py::function& func,
+    const py::dict& inputs_dict,
+    Stack trace_inputs,
+    const py::function& var_name_lookup_fn,
+    bool strict,
+    bool force_outplace,
+    Module* self,
+    const std::vector<std::string>& argument_names) {
+  C10_LOG_API_USAGE_ONCE("torch.tracer");
+
+  auto lookup_fn_adapter =
+      [var_name_lookup_fn](const Variable& var) -> std::string {
+    pybind11::gil_scoped_acquire ag;
+    return py::cast<std::string>(var_name_lookup_fn(var));
+  };
+
+  // The argument_names parameter is parsed in python and its order
+  // is the same as the arguments' decalaration order in forward() method.
+  // These name shall be added to the graph as debug name and the order
+  // should align with the traceable stack we generated by the python dict.
+  std::vector<std::string> compact_argument_names;
+  Stack compact_trace_inputs;
+  for (std::vector<std::string>::size_type i = 0; i < argument_names.size();
+       i++) {
+    if (inputs_dict.contains(argument_names[i])) {
+      compact_argument_names.push_back(argument_names[i]);
+    }
+  }
+  for (std::vector<std::string>::size_type i = 0;
+       i < compact_argument_names.size();
+       i++) {
+    for (auto it = inputs_dict.begin(); it != inputs_dict.end(); it++) {
+      if (py::cast<std::string>(it->first) == compact_argument_names[i]) {
+        if (THPVariable_Check(it->second.ptr())) {
+          compact_trace_inputs.push_back(
+              toIValue(it->second, tryToInferType(it->second).type()));
+        }
+      }
+    }
+  }
+
+  auto outs = tracer::trace(
+      std::move(compact_trace_inputs),
+      [&](Stack inputs) -> Stack {
+        // We just leave the inputs_dict as it was and pass it to forward
+        // method.
+        auto out = func(**inputs_dict);
+        if (out.ptr() == Py_None) {
+          AT_ERROR(
+              "The traced function didn't return any values! Side-effects are not "
+              "captured in traces, so it would be a no-op.");
+        }
+        return {toTypeInferredIValue(out)};
+      },
+      lookup_fn_adapter,
+      strict,
+      force_outplace,
+      self,
+      compact_argument_names);
+  return std::make_pair(std::get<0>(outs)->graph, std::get<1>(outs));
+}
+
 std::pair<std::shared_ptr<Graph>, Stack> createGraphByTracing(
     const py::function& func,
     Stack trace_inputs,
diff --git a/torch/csrc/jit/python/python_tracer.h b/torch/csrc/jit/python/python_tracer.h
index 3f1fca20bfe0..6ec9dc388c31 100644
--- a/torch/csrc/jit/python/python_tracer.h
+++ b/torch/csrc/jit/python/python_tracer.h
@@ -24,6 +24,16 @@ Node* preRecordPythonTrace(
     at::ArrayRef<autograd::Variable> inputs,
     std::vector<THPObjectPtr> scalar_args);
 
+std::pair<std::shared_ptr<Graph>, Stack> createGraphByTracingWithDict(
+    const py::function& func,
+    const py::dict& inputs_dict,
+    Stack inputs,
+    const py::function& var_name_lookup_fn,
+    bool strict,
+    bool force_outplace,
+    Module* self = nullptr,
+    const std::vector<std::string>& argument_names = {});
+
 std::pair<std::shared_ptr<Graph>, Stack> createGraphByTracing(
     const py::function& func,
     Stack inputs,
diff --git a/torch/csrc/jit/python/script_init.cpp b/torch/csrc/jit/python/script_init.cpp
index 3f825d9bd52b..2c6f8b1daca8 100644
--- a/torch/csrc/jit/python/script_init.cpp
+++ b/torch/csrc/jit/python/script_init.cpp
@@ -16,6 +16,7 @@
 #include <torch/csrc/jit/mobile/file_format.h>
 #include <torch/csrc/jit/mobile/import.h>
 #include <torch/csrc/jit/mobile/module.h>
+#include <torch/csrc/jit/mobile/quantization.h>
 #include <torch/csrc/jit/operator_upgraders/upgraders.h>
 #include <torch/csrc/jit/operator_upgraders/upgraders_entry.h>
 #include <torch/csrc/jit/operator_upgraders/utils.h>
@@ -111,7 +112,7 @@ struct PythonResolver : public Resolver {
       const SourceRange& loc) override {
     pybind11::gil_scoped_acquire ag;
     py::object obj = rcb_(name);
-    if (obj.is(py::none())) {
+    if (obj.is_none()) {
       return nullptr;
     }
     return toSugaredValue(obj, m, loc);
@@ -152,7 +153,7 @@ struct PythonResolver : public Resolver {
     }
     pybind11::gil_scoped_acquire ag;
     py::object obj = rcb_(name);
-    if (obj.is(py::none())) {
+    if (obj.is_none()) {
       return nullptr;
     }
 
@@ -365,7 +366,7 @@ static StrongFunctionPtr script_compile_overloaded_function(
     const ResolutionCallback& rcb,
     const FunctionDefaults& implementation_defaults,
     const py::object& signature) {
-  if (signature.is(py::none())) {
+  if (signature.is_none()) {
     throw ErrorReport(overload_decl.range())
         << "Must explicitly add type annotations to overloaded functions";
   }
@@ -1217,6 +1218,43 @@ void initJitScriptBindings(PyObject* module) {
           py::arg("strict"),
           py::arg("force_outplace"),
           py::arg("argument_names") = std::vector<std::string>())
+      .def(
+          "_create_method_from_trace_with_dict",
+          [](Module& self,
+             const std::string& name,
+             const py::function& func,
+             const py::dict& input_dict,
+             const py::function& var_name_lookup_fn,
+             bool strict,
+             bool force_outplace,
+             const std::vector<std::string>& argument_names) {
+            // prereq: Module's buffers and parameters are unique
+            // this was ensured in python before calling this function
+            auto typed_inputs = toTraceableStack(input_dict);
+
+            std::shared_ptr<Graph> graph =
+                std::get<0>(tracer::createGraphByTracingWithDict(
+                    func,
+                    input_dict,
+                    typed_inputs,
+                    var_name_lookup_fn,
+                    strict,
+                    force_outplace,
+                    &self,
+                    argument_names));
+            const auto method_name = QualifiedName(*self.type()->name(), name);
+            auto fn = self._ivalue()->compilation_unit()->create_function(
+                method_name, graph);
+            self.type()->addMethod(fn);
+            didFinishEmitModule(self);
+          },
+          py::arg("name"),
+          py::arg("func"),
+          py::arg("input_dict"),
+          py::arg("var_name_lookup_fn"),
+          py::arg("strict"),
+          py::arg("force_outplace"),
+          py::arg("argument_names") = std::vector<std::string>())
       .def(
           "_get_forward_hooks",
           [](const Module& m) {
@@ -1667,6 +1705,43 @@ void initJitScriptBindings(PyObject* module) {
       py::arg("force_outplace"),
       py::arg("argument_names") = std::vector<std::string>());
 
+  m.def(
+      "_create_function_from_trace_with_dict",
+      [](const std::string& qualname,
+         const py::function& func,
+         const py::dict& input_dict,
+         const py::function& var_name_lookup_fn,
+         bool strict,
+         bool force_outplace,
+         const std::vector<std::string>& argument_names) {
+        auto typed_inputs = toTraceableStack(input_dict);
+        std::shared_ptr<Graph> graph =
+            std::get<0>(tracer::createGraphByTracingWithDict(
+                func,
+                input_dict,
+                typed_inputs,
+                var_name_lookup_fn,
+                strict,
+                force_outplace,
+                /*self=*/nullptr,
+                argument_names));
+
+        auto cu = get_python_cu();
+        auto name = c10::QualifiedName(qualname);
+        auto result = cu->create_function(
+            std::move(name), std::move(graph), /*shouldMangle=*/true);
+        StrongFunctionPtr ret(std::move(cu), result);
+        didFinishEmitFunction(ret);
+        return ret;
+      },
+      py::arg("name"),
+      py::arg("func"),
+      py::arg("input_dict"),
+      py::arg("var_name_lookup_fn"),
+      py::arg("strict"),
+      py::arg("force_outplace"),
+      py::arg("argument_names") = std::vector<std::string>());
+
   m.def(
       "_jit_script_class_compile",
       [](const std::string& qualifiedName,
@@ -1699,7 +1774,7 @@ void initJitScriptBindings(PyObject* module) {
           if (def.kind() != TK_DEF) {
             throw ErrorReport(def.range())
                 << "Currently class bodies can only contain method "
-                   "definitions. File an issue on Github if you want "
+                   "definitions. File an issue on GitHub if you want "
                    "something else!";
           }
           methodDefs.emplace_back(Def(def));
@@ -1794,7 +1869,7 @@ void initJitScriptBindings(PyObject* module) {
          py::object map_location,
          const py::dict& extra_files) {
         c10::optional<at::Device> optional_device;
-        if (!map_location.is(py::none())) {
+        if (!map_location.is_none()) {
           AT_ASSERT(THPDevice_Check(map_location.ptr()));
           optional_device =
               reinterpret_cast<THPDevice*>(map_location.ptr())->device;
@@ -1814,7 +1889,7 @@ void initJitScriptBindings(PyObject* module) {
          py::object map_location,
          std::string ts_id) {
         c10::optional<at::Device> optional_device;
-        if (!map_location.is(py::none())) {
+        if (!map_location.is_none()) {
           AT_ASSERT(THPDevice_Check(map_location.ptr()));
           optional_device =
               reinterpret_cast<THPDevice*>(map_location.ptr())->device;
@@ -1834,7 +1909,7 @@ void initJitScriptBindings(PyObject* module) {
          const py::dict& extra_files) {
         std::istringstream in(buffer);
         c10::optional<at::Device> optional_device;
-        if (!map_location.is(py::none())) {
+        if (!map_location.is_none()) {
           AT_ASSERT(THPDevice_Check(map_location.ptr()));
           optional_device =
               reinterpret_cast<THPDevice*>(map_location.ptr())->device;
@@ -1849,7 +1924,7 @@ void initJitScriptBindings(PyObject* module) {
       "_load_for_lite_interpreter",
       [](const std::string& filename, py::object map_location) {
         c10::optional<at::Device> optional_device;
-        if (!map_location.is(py::none())) {
+        if (!map_location.is_none()) {
           AT_ASSERT(THPDevice_Check(map_location.ptr()));
           optional_device =
               reinterpret_cast<THPDevice*>(map_location.ptr())->device;
@@ -1861,7 +1936,7 @@ void initJitScriptBindings(PyObject* module) {
       [](const std::string& buffer, py::object map_location) {
         std::istringstream in(buffer);
         c10::optional<at::Device> optional_device;
-        if (!map_location.is(py::none())) {
+        if (!map_location.is_none()) {
           AT_ASSERT(THPDevice_Check(map_location.ptr()));
           optional_device =
               reinterpret_cast<THPDevice*>(map_location.ptr())->device;
@@ -1953,6 +2028,12 @@ void initJitScriptBindings(PyObject* module) {
   m.def("_export_operator_list", [](torch::jit::mobile::Module& sm) {
     return debugMakeSet(torch::jit::mobile::_export_operator_list(sm));
   });
+  m.def(
+      "_quantize_ondevice_ptq_dynamic",
+      [](mobile::Module& m, const std::string& method_name) {
+        mobile::quantization::PTQQuanizationHelper ptq_helper;
+        ptq_helper.quantize_dynamic(m, method_name);
+      });
 
   m.def("_jit_set_emit_hooks", setEmitHooks);
   m.def("_jit_get_emit_hooks", getEmitHooks);
diff --git a/torch/csrc/jit/runtime/decomposition_registry.cpp b/torch/csrc/jit/runtime/decomposition_registry.cpp
index d55ac7eac9be..05e5c9b6b196 100644
--- a/torch/csrc/jit/runtime/decomposition_registry.cpp
+++ b/torch/csrc/jit/runtime/decomposition_registry.cpp
@@ -8,6 +8,7 @@
 #include <torch/csrc/jit/serialization/import_source.h>
 
 #include <c10/util/Exception.h>
+#include <torch/csrc/autograd/jit_decomp_interface.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <torch/csrc/jit/passes/constant_propagation.h>
 #include <torch/csrc/jit/passes/inliner.h>
@@ -160,6 +161,47 @@ void RegisterDecomposition(
   schema_to_decomposition[&schema] = g;
 }
 
+// see NOTE: [Jit Decomposition Interface]
+struct JitDecomp final : torch::autograd::impl::JitDecompInterface {
+  bool has_jit_decomposition(const c10::FunctionSchema& schema) const override;
+  void run_jit_decomposition(
+      const c10::OperatorHandle& op,
+      torch::jit::Stack* stack) const override;
+};
+
+JitDecomp jitDecomp;
+torch::autograd::impl::JitDecompRegisterer registerJitDecomp(&jitDecomp);
+
+void JitDecomp::run_jit_decomposition(
+    const c10::OperatorHandle& op,
+    torch::jit::Stack* stack) const {
+  ::torch::jit::run_jit_decomposition(op, stack);
+}
+
+bool JitDecomp::has_jit_decomposition(const FunctionSchema& schema) const {
+  return ::torch::jit::has_jit_decomposition(schema);
+}
+
+void run_jit_decomposition(
+    const c10::OperatorHandle& op,
+    torch::jit::Stack* stack) {
+  const auto& schema = op.schema();
+  // TODO: templatize based on op and keep static trace_exec
+  auto* trace_exec = torch::jit::GetDecompositionExecutor(schema);
+  trace_exec->run((*stack));
+  if (stack->back().isTuple()) {
+    at::IValue tup = stack->back();
+    stack->pop_back();
+    for (const auto& elem : tup.toTuple()->elements()) {
+      stack->push_back(elem);
+    }
+  }
+}
+
+bool has_jit_decomposition(const FunctionSchema& schema) {
+  return GetDecompositionFunction(schema).has_value();
+}
+
 Function* GetDecompositionExecutor(const FunctionSchema& schema) {
   auto maybe_func = GetDecompositionFunction(schema);
   TORCH_INTERNAL_ASSERT(maybe_func);
diff --git a/torch/csrc/jit/runtime/decomposition_registry.h b/torch/csrc/jit/runtime/decomposition_registry.h
index 4c6ef3029a0b..225204cf60de 100644
--- a/torch/csrc/jit/runtime/decomposition_registry.h
+++ b/torch/csrc/jit/runtime/decomposition_registry.h
@@ -25,5 +25,11 @@ TORCH_API Function* GetDecompositionExecutor(const char* schema_literal);
 
 TORCH_API Function* GetDecompositionExecutor(const FunctionSchema& schema);
 
+TORCH_API void run_jit_decomposition(
+    const c10::OperatorHandle& op,
+    torch::jit::Stack* stack);
+
+TORCH_API bool has_jit_decomposition(const FunctionSchema& schema);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/graph_executor.cpp b/torch/csrc/jit/runtime/graph_executor.cpp
index 0e121da32fcb..88a092c39fe0 100644
--- a/torch/csrc/jit/runtime/graph_executor.cpp
+++ b/torch/csrc/jit/runtime/graph_executor.cpp
@@ -133,7 +133,7 @@ struct CaptureList {
       auto tensors = val.toTensorList();
       sizes_.push_back(tensors.size());
 
-      for (const at::Tensor tensor : tensors) {
+      for (const auto& tensor : tensors) {
         captureTensor(tensor, is_output);
       }
     } else {
@@ -326,7 +326,7 @@ struct DifferentiableGraphBackward : public autograd::Node {
   void addOutputForIValue(const IValue& value) {
     if (value.isTensorList()) {
       input_tensor_lists_.insert({index_, value.toTensorList().size()});
-      for (const at::Tensor tensor : value.toTensorList()) {
+      for (const at::Tensor& tensor : value.toTensorList()) {
         addOutputForTensor(tensor);
         index_++;
       }
@@ -357,7 +357,7 @@ struct DifferentiableGraphBackward : public autograd::Node {
     if (v.isTensorList()) {
       auto tensors = v.toTensorList();
       input_instructions_.pushTensorList(tensors.size());
-      for (const at::Tensor tensor : tensors) {
+      for (const at::Tensor& tensor : tensors) {
         addInputVariable(tensor);
       }
     } else if (v.isTensor()) {
@@ -922,13 +922,12 @@ void runNondiffOptimization(
     std::shared_ptr<Graph>& graph,
     bool strict_fuser_check) {
   GRAPH_DEBUG(
-      "Before customPrePassses (beginning of runNondiffOptimization)\n",
-      *graph);
+      "Before customPrePasses (beginning of runNondiffOptimization)\n", *graph);
   // Run custom passes that different backends can register.
   for (const auto& passPair : getCustomPrePasses()) {
     passPair.first(graph);
   }
-  GRAPH_DEBUG("After customPrePassses\n", *graph);
+  GRAPH_DEBUG("After customPrePasses\n", *graph);
 
   // decomposition pass, decompose certain ops that will be used in the
   // following passes (like batchmm and jit fusion)
@@ -960,7 +959,7 @@ void runNondiffOptimization(
     passPair.first(graph);
   }
   GRAPH_DEBUG(
-      "After customPostPassses (end of runNondiffOptimization)\n", *graph);
+      "After customPostPasses (end of runNondiffOptimization)\n", *graph);
 }
 
 void runOptimization(
diff --git a/torch/csrc/jit/runtime/interpreter.cpp b/torch/csrc/jit/runtime/interpreter.cpp
index d7a6465c8d7b..b7ca8ff8f167 100644
--- a/torch/csrc/jit/runtime/interpreter.cpp
+++ b/torch/csrc/jit/runtime/interpreter.cpp
@@ -751,7 +751,8 @@ struct InterpreterStateImpl : c10::intrusive_ptr_target {
                 // Sends the warning to the warning handler with the
                 // "verbatim" flag. This flag ensures the warning handler
                 // will print the exception as configured.
-                c10::Warning::warn(location, msg, /*verbatim=*/true);
+                c10::warn(c10::Warning(
+                    c10::UserWarning(), location, msg, /*verbatim=*/true));
               }
               stack.pop_back();
             } else {
diff --git a/torch/csrc/jit/runtime/register_prim_ops.cpp b/torch/csrc/jit/runtime/register_prim_ops.cpp
index d360a78295d5..bd70a263bfd8 100644
--- a/torch/csrc/jit/runtime/register_prim_ops.cpp
+++ b/torch/csrc/jit/runtime/register_prim_ops.cpp
@@ -415,6 +415,16 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs{
         TORCH_SELECTIVE_SCHEMA("aten::sym_size(Tensor self) -> SymInt[]"),
         sym_size,
         aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA(
+            "aten::sym_size.int(Tensor self, int dim) -> SymInt"),
+        sym_size_int,
+        aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA(
+            "aten::sym_stride.int(Tensor self, int dim) -> SymInt"),
+        sym_stride_int,
+        aliasAnalysisFromSchema()),
     OperatorGeneratorArgs(
         TORCH_SELECTIVE_SCHEMA("aten::stride(Tensor self) -> int[]"),
         [](Stack& stack) {
@@ -426,6 +436,15 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs{
         TORCH_SELECTIVE_SCHEMA("aten::sym_numel(Tensor self) -> SymInt"),
         sym_numel,
         aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA(
+            "aten::sym_storage_offset(Tensor self) -> SymInt"),
+        sym_storage_offset,
+        aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA("aten::sym_stride(Tensor self) -> SymInt[]"),
+        sym_stride,
+        aliasAnalysisFromSchema()),
     OperatorGeneratorArgs(
         TORCH_SELECTIVE_SCHEMA("prim::EnumName(AnyEnumType enum) -> str"),
         [](Stack& stack) {
@@ -542,6 +561,34 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs{
           pack(stack, result);
         },
         aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA(
+            "aten::is_contiguous.memory_format(Tensor self, MemoryFormat memory_format) -> bool"),
+        [](Stack& stack) {
+          auto memory_format = pop(stack).toMemoryFormat();
+          auto t = pop(stack).toTensor();
+          push(stack, t.is_contiguous(memory_format));
+        },
+        aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        // NB: intentionally suffixed with extra _format to prevent tests for
+        // "_like" suffix from triggering on this
+        TORCH_SELECTIVE_SCHEMA(
+            "aten::is_strides_like_format(Tensor self, MemoryFormat memory_format) -> bool"),
+        [](Stack& stack) {
+          auto memory_format = pop(stack).toMemoryFormat();
+          auto t = pop(stack).toTensor();
+          push(stack, t.unsafeGetTensorImpl()->is_strides_like(memory_format));
+        },
+        aliasAnalysisFromSchema()),
+    OperatorGeneratorArgs(
+        TORCH_SELECTIVE_SCHEMA(
+            "aten::is_non_overlapping_and_dense(Tensor self) -> bool"),
+        [](Stack& stack) {
+          auto t = pop(stack).toTensor();
+          push(stack, t.unsafeGetTensorImpl()->is_non_overlapping_and_dense());
+        },
+        aliasAnalysisFromSchema()),
     // these ops are generic over the list element type.
     // CREATING GENERIC_LIST_OPS
     OperatorGeneratorArgs(
@@ -1081,7 +1128,7 @@ static const std::vector<OperatorGeneratorArgs> opGenArgs{
           }
           auto self = pop(stack).toTensor();
           auto result =
-              at::index_put_(self, opt_list_indices, values, accumulate);
+              at::index_put(self, opt_list_indices, values, accumulate);
           push(stack, std::move(result));
         },
         aliasAnalysisFromSchema()),
diff --git a/torch/csrc/jit/runtime/serialized_shape_function_registry.cpp b/torch/csrc/jit/runtime/serialized_shape_function_registry.cpp
index 1edfe5a43611..fdeba0a3add3 100644
--- a/torch/csrc/jit/runtime/serialized_shape_function_registry.cpp
+++ b/torch/csrc/jit/runtime/serialized_shape_function_registry.cpp
@@ -1740,8 +1740,12 @@ def conv_forwards(input: List[int],
   has_dilation = torch.gt(torch.len(dilation), 0)
   dim = torch.len(input)
   output_size = annotate(List[int], [])
+  if transposed:
+    weight_output_channels_dim = 1
+  else:
+    weight_output_channels_dim = 0
   _0 = torch.append(output_size, input[0])
-  _1 = torch.append(output_size, weight[0])
+  _1 = torch.append(output_size, weight[weight_output_channels_dim])
   for _2 in range(torch.__range_length(2, dim, 1)):
     d = torch.__derive_index(_2, 2, 1)
     if has_dilation:
@@ -1789,7 +1793,7 @@ def conv_forwards(input: List[int],
   dim = torch.len(input)
   output_size = annotate(List[int], [])
   _0 = torch.append(output_size, input[0])
-  _1 = torch.append(output_size, weight[0])
+  _1 = torch.append(output_size, weight[1])
   for _2 in range(torch.__range_length(2, dim, 1)):
     d = torch.__derive_index(_2, 2, 1)
     if has_dilation:
@@ -2196,25 +2200,35 @@ def conv_forwards(input: List[int],
     dt: Any) -> List[int]:
   out = annotate(List[int], [])
   if torch.__is__(opt_dims, None):
-    dims = annotate(List[int], [])
+    _0, opt_dims0 = True, opt_dims
+  else:
+    opt_dims1 = unchecked_cast(List[int], opt_dims)
+    _0, opt_dims0 = torch.eq(torch.len(opt_dims1), 0), opt_dims1
+  if _0:
+    _1 = torch.len(self)
+    dims0 = annotate(List[int], [])
+    for _2 in range(_1):
+      _3 = torch.append(dims0, _2)
+    dims = dims0
   else:
-    dims = unchecked_cast(List[int], opt_dims)
+    opt_dims2 = unchecked_cast(List[int], opt_dims0)
+    dims = opt_dims2
   for idx in range(torch.len(self)):
     is_mean_dim = False
-    for _0 in range(torch.len(dims)):
-      reduce_dim = dims[_0]
-      _1 = torch.len(self)
-      if torch.le(_1, 0):
+    for _4 in range(torch.len(dims)):
+      reduce_dim = dims[_4]
+      _5 = torch.len(self)
+      if torch.le(_5, 0):
         dim_post_expr = 1
       else:
-        dim_post_expr = _1
+        dim_post_expr = _5
       min = torch.neg(dim_post_expr)
       max = torch.sub(dim_post_expr, 1)
       if torch.lt(reduce_dim, min):
-        _2 = True
+        _6 = True
       else:
-        _2 = torch.gt(reduce_dim, max)
-      if torch.__not__(_2):
+        _6 = torch.gt(reduce_dim, max)
+      if torch.__not__(_6):
         pass
       else:
         ops.prim.RaiseException("AssertionError: ")
@@ -2230,11 +2244,11 @@ def conv_forwards(input: List[int],
       is_mean_dim = is_mean_dim0
     if is_mean_dim:
       if keep_dim:
-        _3 = torch.append(out, 1)
+        _7 = torch.append(out, 1)
       else:
         pass
     else:
-      _4 = torch.append(out, self[idx])
+      _8 = torch.append(out, self[idx])
   return out
 
 )=====")
@@ -2245,7 +2259,7 @@ def conv_forwards(input: List[int],
   out = annotate(List[int], [])
   for idx in range(torch.len(self)):
     is_mean_dim = False
-    for _1 in range(1):
+    for _1 in range(torch.len(_0)):
       reduce_dim = _0[_1]
       _2 = torch.len(self)
       if torch.le(_2, 0):
diff --git a/torch/csrc/jit/runtime/static/README.md b/torch/csrc/jit/runtime/static/README.md
index c97b64dd0771..9b72db912684 100644
--- a/torch/csrc/jit/runtime/static/README.md
+++ b/torch/csrc/jit/runtime/static/README.md
@@ -73,11 +73,7 @@ Static runtime's memory planner does two things:
 1) Coalesces internal allocations for tensor storage
 2) Does static analysis to figure out how to efficiently re-use memory.
 
-For (2), there are two algorithms used. Specify which algorithm with
-the `memory_planner_algorithm` field in `StaticModuleOptions`. The
-algorithms are briefly described below:
-
-### Standard Resizing (default)
+### Standard Resizing
 Static runtime will record the space required for each intermediate managed tensor it sees
 on the first inference iteration. An intermediate tensor is *managed* if two conditions
 are satisfied:
@@ -103,16 +99,6 @@ will occur. This is why dynamic shapes will degrade performance. With the standa
 strategy, static runtime will record the new largest tensor size in each storage group at the
 end of the iteration and allocate a buffer that is possibly bigger on the next iteration.
 
-### Precomputed Offsets Memory Planner (experimental)
-This algorithm is based on [arXiv:2001.03288](https://arxiv.org/pdf/2001.03288.pdf), section 5.2 "Greedy by Size for Offset Calculation".
-
-The paper describes the algorithm in detail, but the key considerations are:
-
-1) This algorithm will tend to be more efficient with respect to maximum memory usage
-2) This algorithm will *not* resize the tensor buffer since recomputing offsets is a quadratic operation. Therefore,
-to avoid performance degradation, the model should be warmed up with the largest possible inputs.
-
-
 ### Managed Output Tensors
 
 `StaticRuntime` can optionally manage output tensors via the `manage_output_tensors` option in `StaticModuleOptions`.
@@ -121,17 +107,7 @@ output tensors is separated from the one containing intermediate tensors. The fo
 of the inference run, but the latter needs deallocated at the end of the run.
 
 Under the hood, we store a refcounted pointer to the output arena in each returned `Tensor`. The arena is destroyed
-only when all output tensors are destroyed.
-
-```
-auto output = runtime(args);
-auto& elems = output.toTupleRef().elements();
-auto tensor_1 = elems[0].toTensor();
-auto tensor_2 = elems[1].toTensor();
-
-tensor_1 = at::empty({0}); // Output buffer not deallocated yet!
-tensor_2 = at::empty({0}); // This call deallocates the output buffer.
-```
+explicitly.
 
 ## Registering Ops
 Static runtime has three op execution modes:
@@ -165,10 +141,10 @@ is selected instead.
 
 When loading a model, ops are selected for each `torch::jit::Node` in the graph as follows:
 
-1) If an out variant is registered, pass the node to the function that prodcues the `SROperator`. If
-the result is not `nulltpr`, use that op.
-2) If a native function is registered, pass the node to the function that prodcues the `SROperator`. If
-the result is not `nulltpr`, use that op.
+1) If an out variant is registered, pass the node to the function that produces the `SROperator`. If
+the result is not `nullptr`, use that op.
+2) If a native function is registered, pass the node to the function that produces the `SROperator`. If
+the result is not `nullptr`, use that op.
 3) Use the JIT implementation. Static runtime will throw an exception if it does not exist.
 
 ## Implementation Details
@@ -255,68 +231,8 @@ upon `StaticModule` construction according to the out variant/native/JIT fallbac
 * `prim::Loop` operations have a `BlockRunner` for the execution of the looping sub-block.
 * `prim::fork` operations have `torch::jit::TaskLauncher` (`std::function<void(std::function<void()>)>`) responsible for forked graph execution.
 
-### `Asynchronous Execution`
-
-`StaticRuntime::runAsync()` API allows execution of asynchronous operations on `TaskLauncher` passed as arguments.
-`StaticRuntime::runAsync()` performs inline execution of parent graph on caller thread and asynchronous operations like `prim::fork` are executed
-on the launcher passed in. In the case that no launcher is provided, the execution happens on `at::launch` inter-op thread pool.
-
-### `prim::fork and aten::wait`
+### Asynchronous Execution
 
-`prim::fork` takes the callable function/method/Module (say `fn`) and arguments to that callable `args` and `kwargs`. Since the execution of forked function `fn` happens asynchronously and fork returns immediately after creating the async task, the `fn` may not have been executed by the time the line of code after the `fork` call is reached. Thus, `aten::wait` is used to wait for the async `fn` task to be completed. `prim::fork` nodes contain the sub-graph for the forked parts of the network. Each parent graph creates a separate instance of `StaticModule` for the
-forked sub-graph and `StaticRuntime` instances are created on the fly during runtime as the fork nodes are executed. The forked subgraph execution
-happens asynchronously on the launcher provided during `StaticRuntime::runAsync()` or by `at::launch` executor by default. `aten::wait` operator
-waits on the future returned by the corresponding `prim::fork` operation
-
-#### Inter-op parallelism via fork/wait ops
-
-Sample Model with independent operations can be parallelized by inserting fork/wait nodes in the graph.
-
-```python
-def CNNBlock(x):
-    out_1 = conv1(x)
-    out_1 = conv2(out_1)
-    out_1 = max_pool1(out_1)
-
-    out_2 = conv3(x)
-    out_2 = max_pool2(out_2)
-
-    out_merged = conv4(out_1 + out_2)
-    return out_merged
-```
-The two branches of (conv,conv,pool) operations can be parallelized by inserting fork nodes such that the execution of both the branches can
-happen in parallel:
-
-```python
-def branch1(x):
-    out = conv1(x)
-    out = conv2(x)
-    return max_pool1(out)
-
-def branch2(x):
-    out = conv3(x)
-    return max_pool2(out)
-
-def CNNBlock(x):
-    fut_1 = torch.jit.fork(branch1, x)
-    fut_2 = torch.jit.fork(branch2, x)
-
-    out_merged = conv4(torch.jit.wait(fut_1) + torch.jit.wait(fut_2))
-    return out_merged
- ```
-**Execution without fork/wait operations:**
-```
-<CALLER THREAD>: conv1 ─> conv2 ─> max_pool1 ─> conv3 ─> max_pool2 ─> conv4
-```
-
-**Execution with fork/wait operations:**
-```
-<CALLER THREAD>  :             fork1 ──> fork2 ──────────> wait(fut_1) ─> wait(fut_2) ─> conv4
-                                  |        |
-                                  |        |
-<INTER-OP THREAD>:                |       conv3 ──────────────────> max_pool2 -> fut_2
-                                  |
-<INTER-OP THREAD>:             conv1 ─> conv2 ─> max_pool1 ──>fut_1
-```
-More examples for fork/wait operations and inter-op parallelism in PyTorch can be found at
-[Dynamic Parallelism in TorchScript](https://pytorch.org/tutorials/advanced/torch-script-parallelism.html)
+The `StaticRuntime::runAsync()` API allows the execution of asynchronous operations on the `TaskLauncher` passed as arguments.
+`StaticRuntime::runAsync()` performs inline execution of the parent graph on the caller thread. Asynchronous operations like `prim::fork` are executed
+on the launcher passed in. In the case that no launcher is provided, the execution happens via `at::launch`, i.e. on the inter-op thread pool.
diff --git a/torch/csrc/jit/runtime/static/generated_ops.cpp b/torch/csrc/jit/runtime/static/generated_ops.cpp
index cbd9c037cac9..2ad1741ef56d 100644
--- a/torch/csrc/jit/runtime/static/generated_ops.cpp
+++ b/torch/csrc/jit/runtime/static/generated_ops.cpp
@@ -1,4 +1,5 @@
 // @lint-ignore-every CLANGTIDY HOWTOEVEN
+// AUTO-GENERATED FROM: torchgen/static_runtime/gen_static_runtime_ops.py
 #include <torch/csrc/jit/runtime/static/ops.h>
 
 #include <ATen/CPUFunctions.h>
@@ -12,6 +13,7 @@
 #include <ATen/native/EmbeddingBag.h>
 #include <ATen/native/Fill.h>
 #include <ATen/native/IndexingUtils.h>
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/SharedReduceOps.h>
 #include <ATen/native/TensorAdvancedIndexing.h>
@@ -555,6 +557,21 @@ REGISTER_OPERATOR_FUNCTOR(
           at::cpu::clamp_max_out(out, self, max);
         };
       }
+
+      if (n->matches(torch::schema(
+              "aten::clamp_max.Tensor(Tensor self, Tensor max) -> Tensor"))) {
+        return [](ProcessedNode* p_node) {
+          const auto& self = p_node->Input(0).toTensor();
+          const auto& max = p_node->Input(1).toTensor();
+          if (p_node->Output(0).isNone()) {
+            p_node->Output(0) = at::cpu::clamp_max(self, max);
+            return;
+          }
+          auto& out = p_node->Output(0).toTensor();
+          fastResizeToZero(out);
+          at::cpu::clamp_max_out(out, self, max);
+        };
+      }
       LogAndDumpSchema(n);
       return nullptr;
     });
@@ -953,26 +970,6 @@ REGISTER_OPERATOR_FUNCTOR(aten::index_copy, aten_index_copy, [](Node* n) -> SROp
   return nullptr;
 });
 
-REGISTER_OPERATOR_FUNCTOR(
-    aten::inverse,
-    aten_inverse,
-    [](Node* n) -> SROperator {
-      if (n->matches(torch::schema("aten::inverse(Tensor self) -> Tensor"))) {
-        return [](ProcessedNode* p_node) {
-          const auto& self = p_node->Input(0).toTensor();
-          if (p_node->Output(0).isNone()) {
-            p_node->Output(0) = at::native::inverse(self);
-            return;
-          }
-          auto& out = p_node->Output(0).toTensor();
-          fastResizeToZero(out);
-          at::native::inverse_out(self, out);
-        };
-      }
-      LogAndDumpSchema(n);
-      return nullptr;
-    });
-
 REGISTER_OPERATOR_FUNCTOR(aten::isin, aten_isin, [](Node* n) -> SROperator {
   if (n->matches(torch::schema(
           "aten::isin.Tensor_Tensor(Tensor elements, Tensor test_elements, *, bool assume_unique=False, bool invert=False) -> Tensor"))) {
@@ -1805,6 +1802,21 @@ REGISTER_OPERATOR_FUNCTOR(aten::square, aten_square, [](Node* n) -> SROperator {
 });
 
 REGISTER_OPERATOR_FUNCTOR(aten::prod, aten_prod, [](Node* n) -> SROperator {
+  if (n->matches(torch::schema(
+          "aten::prod(Tensor self, *, ScalarType? dtype=None) -> Tensor"))) {
+    return [](ProcessedNode* p_node) {
+      const auto& self = p_node->Input(0).toTensor();
+      const auto dtype = p_node->Input(1).toOptional<at::ScalarType>();
+      if (p_node->Output(0).isNone()) {
+        p_node->Output(0) = at::native::prod(self, dtype);
+        return;
+      }
+      auto& out = p_node->Output(0).toTensor();
+      fastResizeToZero(out);
+      at::native::prod_out(self, dtype, out);
+    };
+  }
+
   if (n->matches(torch::schema(
           "aten::prod.dim_int(Tensor self, int dim, bool keepdim=False, *, ScalarType? dtype=None) -> Tensor"))) {
     return [](ProcessedNode* p_node) {
@@ -2419,25 +2431,6 @@ REGISTER_OPERATOR_FUNCTOR(aten::addbmm, aten_addbmm, [](Node* n) -> SROperator {
   return nullptr;
 });
 
-REGISTER_OPERATOR_FUNCTOR(aten::diag, aten_diag, [](Node* n) -> SROperator {
-  if (n->matches(
-          torch::schema("aten::diag(Tensor self, int diagonal=0) -> Tensor"))) {
-    return [](ProcessedNode* p_node) {
-      const auto& self = p_node->Input(0).toTensor();
-      const auto diagonal = p_node->Input(1).toInt();
-      if (p_node->Output(0).isNone()) {
-        p_node->Output(0) = at::native::diag(self, diagonal);
-        return;
-      }
-      auto& out = p_node->Output(0).toTensor();
-      fastResizeToZero(out);
-      at::native::diag_cpu_out(self, diagonal, out);
-    };
-  }
-  LogAndDumpSchema(n);
-  return nullptr;
-});
-
 REGISTER_OPERATOR_FUNCTOR(aten::cross, aten_cross, [](Node* n) -> SROperator {
   if (n->matches(torch::schema(
           "aten::cross(Tensor self, Tensor other, int? dim=None) -> Tensor"))) {
@@ -3415,96 +3408,6 @@ REGISTER_OPERATOR_FUNCTOR(
       return nullptr;
     });
 
-REGISTER_OPERATOR_FUNCTOR(aten::nll_loss, aten_nll_loss, [](Node* n) -> SROperator {
-  if (n->matches(torch::schema(
-          "aten::nll_loss(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, int ignore_index=-100) -> Tensor"))) {
-    return [](ProcessedNode* p_node) {
-      const auto& self = p_node->Input(0).toTensor();
-      const auto& target = p_node->Input(1).toTensor();
-      const auto weight = p_node->Input(2).toOptional<at::Tensor>();
-      const auto reduction = p_node->Input(3).toInt();
-      const auto ignore_index = p_node->Input(4).toInt();
-      if (p_node->Output(0).isNone()) {
-        p_node->Output(0) =
-            at::native::nll_loss(self, target, weight, reduction, ignore_index);
-        return;
-      }
-      auto& out = p_node->Output(0).toTensor();
-      fastResizeToZero(out);
-      at::native::nll_loss_out(
-          self, target, weight, reduction, ignore_index, out);
-    };
-  }
-  LogAndDumpSchema(n);
-  return nullptr;
-});
-
-REGISTER_OPERATOR_FUNCTOR(
-    aten::nll_loss_backward,
-    aten_nll_loss_backward,
-    [](Node* n) -> SROperator {
-      if (n->matches(torch::schema(
-              "aten::nll_loss_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight) -> Tensor"))) {
-        return [](ProcessedNode* p_node) {
-          const auto& grad_output = p_node->Input(0).toTensor();
-          const auto& self = p_node->Input(1).toTensor();
-          const auto& target = p_node->Input(2).toTensor();
-          const auto weight = p_node->Input(3).toOptional<at::Tensor>();
-          const auto reduction = p_node->Input(4).toInt();
-          const auto ignore_index = p_node->Input(5).toInt();
-          const auto& total_weight = p_node->Input(6).toTensor();
-          if (p_node->Output(0).isNone()) {
-            p_node->Output(0) = at::cpu::nll_loss_backward(
-                grad_output,
-                self,
-                target,
-                weight,
-                reduction,
-                ignore_index,
-                total_weight);
-            return;
-          }
-          auto& grad_input = p_node->Output(0).toTensor();
-          fastResizeToZero(grad_input);
-          at::cpu::nll_loss_backward_out(
-              grad_input,
-              grad_output,
-              self,
-              target,
-              weight,
-              reduction,
-              ignore_index,
-              total_weight);
-        };
-      }
-      LogAndDumpSchema(n);
-      return nullptr;
-    });
-
-REGISTER_OPERATOR_FUNCTOR(aten::nll_loss2d, aten_nll_loss2d, [](Node* n) -> SROperator {
-  if (n->matches(torch::schema(
-          "aten::nll_loss2d(Tensor self, Tensor target, Tensor? weight=None, int reduction=Mean, int ignore_index=-100) -> Tensor"))) {
-    return [](ProcessedNode* p_node) {
-      const auto& self = p_node->Input(0).toTensor();
-      const auto& target = p_node->Input(1).toTensor();
-      const auto weight = p_node->Input(2).toOptional<at::Tensor>();
-      const auto reduction = p_node->Input(3).toInt();
-      const auto ignore_index = p_node->Input(4).toInt();
-      if (p_node->Output(0).isNone()) {
-        p_node->Output(0) = at::native::nll_loss2d(
-            self, target, weight, reduction, ignore_index);
-        return;
-      }
-      auto& out = p_node->Output(0).toTensor();
-      fastResizeToZero(out);
-      at::native::nll_loss2d_out(
-          self, target, weight, reduction, ignore_index, out);
-    };
-  }
-  LogAndDumpSchema(n);
-  return nullptr;
-});
-
 REGISTER_OPERATOR_FUNCTOR(
     aten::soft_margin_loss,
     aten_soft_margin_loss,
@@ -4715,17 +4618,16 @@ REGISTER_OPERATOR_FUNCTOR(
     aten::linalg_det,
     aten_linalg_det,
     [](Node* n) -> SROperator {
-      if (n->matches(
-              torch::schema("aten::linalg_det(Tensor self) -> Tensor"))) {
+      if (n->matches(torch::schema("aten::linalg_det(Tensor A) -> Tensor"))) {
         return [](ProcessedNode* p_node) {
-          const auto& self = p_node->Input(0).toTensor();
+          const auto& A = p_node->Input(0).toTensor();
           if (p_node->Output(0).isNone()) {
-            p_node->Output(0) = at::native::linalg_det(self);
+            p_node->Output(0) = at::native::linalg_det(A);
             return;
           }
           auto& out = p_node->Output(0).toTensor();
           fastResizeToZero(out);
-          at::native::linalg_det_out(self, out);
+          at::native::linalg_det_out(A, out);
         };
       }
       LogAndDumpSchema(n);
@@ -4779,17 +4681,36 @@ REGISTER_OPERATOR_FUNCTOR(
     aten::linalg_inv,
     aten_linalg_inv,
     [](Node* n) -> SROperator {
-      if (n->matches(
-              torch::schema("aten::linalg_inv(Tensor self) -> Tensor"))) {
+      if (n->matches(torch::schema("aten::linalg_inv(Tensor A) -> Tensor"))) {
+        return [](ProcessedNode* p_node) {
+          const auto& A = p_node->Input(0).toTensor();
+          if (p_node->Output(0).isNone()) {
+            p_node->Output(0) = at::native::linalg_inv(A);
+            return;
+          }
+          auto& out = p_node->Output(0).toTensor();
+          fastResizeToZero(out);
+          at::native::linalg_inv_out(A, out);
+        };
+      }
+      LogAndDumpSchema(n);
+      return nullptr;
+    });
+
+REGISTER_OPERATOR_FUNCTOR(
+    aten::inverse,
+    aten_inverse,
+    [](Node* n) -> SROperator {
+      if (n->matches(torch::schema("aten::inverse(Tensor self) -> Tensor"))) {
         return [](ProcessedNode* p_node) {
           const auto& self = p_node->Input(0).toTensor();
           if (p_node->Output(0).isNone()) {
-            p_node->Output(0) = at::native::linalg_inv(self);
+            p_node->Output(0) = at::native::inverse(self);
             return;
           }
           auto& out = p_node->Output(0).toTensor();
           fastResizeToZero(out);
-          at::native::linalg_inv_out(self, out);
+          at::native::inverse_out(self, out);
         };
       }
       LogAndDumpSchema(n);
@@ -4834,27 +4755,6 @@ REGISTER_OPERATOR_FUNCTOR(aten::outer, aten_outer, [](Node* n) -> SROperator {
   return nullptr;
 });
 
-REGISTER_OPERATOR_FUNCTOR(
-    aten::linalg_svdvals,
-    aten_linalg_svdvals,
-    [](Node* n) -> SROperator {
-      if (n->matches(torch::schema(
-              "aten::linalg_svdvals(Tensor A, *, str? driver=None) -> Tensor"))) {
-        return [](ProcessedNode* p_node) {
-          const auto& A = p_node->Input(0).toTensor();
-          if (p_node->Output(0).isNone()) {
-            p_node->Output(0) = at::native::linalg_svdvals(A);
-            return;
-          }
-          auto& out = p_node->Output(0).toTensor();
-          fastResizeToZero(out);
-          at::native::linalg_svdvals_out(A, c10::nullopt, out);
-        };
-      }
-      LogAndDumpSchema(n);
-      return nullptr;
-    });
-
 REGISTER_OPERATOR_FUNCTOR(
     aten::linalg_cond,
     aten_linalg_cond,
@@ -4882,18 +4782,18 @@ REGISTER_OPERATOR_FUNCTOR(
     aten_linalg_solve,
     [](Node* n) -> SROperator {
       if (n->matches(torch::schema(
-              "aten::linalg_solve(Tensor input, Tensor other, bool left) -> Tensor"))) {
+              "aten::linalg_solve(Tensor A, Tensor B, *, bool left=True) -> Tensor"))) {
         return [](ProcessedNode* p_node) {
-          const auto& input = p_node->Input(0).toTensor();
-          const auto& other = p_node->Input(1).toTensor();
-          auto left = p_node->Input(2).toBool();
+          const auto& A = p_node->Input(0).toTensor();
+          const auto& B = p_node->Input(1).toTensor();
+          const auto left = p_node->Input(2).toBool();
           if (p_node->Output(0).isNone()) {
-            p_node->Output(0) = at::native::linalg_solve(input, other, left);
+            p_node->Output(0) = at::native::linalg_solve(A, B, left);
             return;
           }
           auto& out = p_node->Output(0).toTensor();
           fastResizeToZero(out);
-          at::native::linalg_solve_out(input, other, left, out);
+          at::native::linalg_solve_out(A, B, left, out);
         };
       }
       LogAndDumpSchema(n);
diff --git a/torch/csrc/jit/runtime/static/impl.cpp b/torch/csrc/jit/runtime/static/impl.cpp
index b8fd7800e533..3f87df14f555 100644
--- a/torch/csrc/jit/runtime/static/impl.cpp
+++ b/torch/csrc/jit/runtime/static/impl.cpp
@@ -56,9 +56,9 @@ namespace jit {
 
 namespace {
 
-bool allArgsAreTensors(Node* node) {
+bool allArgsAreTensors(const Node* node) {
   const auto& inputs = node->inputs();
-  return std::all_of(inputs.begin(), inputs.end(), [](Value* value) {
+  return std::all_of(inputs.begin(), inputs.end(), [](const Value* value) {
     return value->type()->kind() == TypeKind::TensorType;
   });
 }
@@ -69,7 +69,7 @@ bool allArgsAreTensors(Node* node) {
 // These are rarely-used ops. Disallowing them typically eliminates
 // corner cases in graph optimizations, allowing for more aggressive
 // optimizations and better performance.
-bool isUnsupportedOp(Node* node) {
+bool isUnsupportedOp(const Node* node) {
   auto kind = node->kind();
   if (kind != aten::__is__ && kind != aten::__isnot__) {
     return false;
@@ -87,12 +87,21 @@ bool isUnsupportedOp(Node* node) {
   return allArgsAreTensors(node);
 }
 
-// graph must be frozen or canEnableStaticRuntime would return false
-// if there's any prim::CallMethod op left in the graph
-bool canEnableStaticRuntime(const std::shared_ptr<torch::jit::Graph>& graph) {
-  // check for sub-blocks
+namespace {
+
+bool canEnableStaticRuntimeImpl(const Block* block) {
+  if (block == nullptr) {
+    return false;
+  }
+
   bool can_support = true;
-  for (auto* node : graph->block()->nodes()) {
+  for (auto* node : block->nodes()) {
+    for (auto* subblock : node->blocks()) {
+      // The ordering prevents && from short circuiting, which we want -
+      // it's useful to see *all* the unsupported ops.
+      can_support = canEnableStaticRuntimeImpl(subblock) && can_support;
+    }
+
     const auto kind = node->kind();
     if (kind == prim::Constant) {
       continue;
@@ -107,14 +116,16 @@ bool canEnableStaticRuntime(const std::shared_ptr<torch::jit::Graph>& graph) {
   return can_support;
 }
 
+} // namespace
+
+// Graph must be frozen. canEnableStaticRuntime will return false
+// if there's any prim::CallMethod ops left in the graph.
+bool canEnableStaticRuntime(const std::shared_ptr<torch::jit::Graph>& graph) {
+  return canEnableStaticRuntimeImpl(graph->block());
+}
+
 namespace {
 
-// CustomClass extending torch::CustomClassHolder can be typecasted
-// to IValue StaticRuntimeMetadata is created so that we can attach
-// SR metadata to IR's prim::fork nodes. These CustomClass needs to be
-// registered first in order to be used as IValue.below is an
-// UNUSED VARIABLE but NEEDED to invoke the class_ constructor necessary
-// for class registration.
 auto sr_metadata_registerer = torch::class_<StaticRuntimeMetadata>(
     "StaticRuntime",
     "StaticRuntimeMetadata");
@@ -187,6 +198,7 @@ void OptimizeGraph(
     }
     FuseListUnpack(graph);
     RemoveUnnecessaryOutputs(graph);
+    PrepackWeights(graph);
 #endif
   }
 
@@ -2050,7 +2062,7 @@ bool ProcessedNode::verify_inputs_dont_overlap_outputs(bool force_check) const {
   bool skip_check = !schema ||
       ((schema->is_mutable() || !fn_->checkMemoryOverlap()) &&
        num_outputs() == 1);
-  if (!force_check && skip_check) {
+  if (!schema || (!force_check && skip_check)) {
     if (!schema) {
       VLOG(2) << "Detected that op schema is null";
       return true;
diff --git a/torch/csrc/jit/runtime/static/native_ops.cpp b/torch/csrc/jit/runtime/static/native_ops.cpp
index 16e357d8f459..1c8fb0791389 100644
--- a/torch/csrc/jit/runtime/static/native_ops.cpp
+++ b/torch/csrc/jit/runtime/static/native_ops.cpp
@@ -7,6 +7,7 @@
 #include <ATen/ScalarOps.h>
 #include <ATen/TensorUtils.h>
 #include <ATen/native/IndexingUtils.h>
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/TensorAdvancedIndexing.h>
 #include <c10/util/intrusive_ptr.h>
@@ -54,6 +55,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::TupleConstruct,
     prim_TupleConstruct,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::TupleConstruct)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         // prepare inputs
         auto stack = boxInputs(*p_node);
@@ -74,6 +78,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::TupleUnpack,
     prim_TupleUnpack,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::TupleUnpack)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         const auto& elems = p_node->Input(0).toTupleRef().elements();
         const size_t num_outputs = p_node->outputs().size();
@@ -90,6 +97,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::DictConstruct,
     prim_DictConstruct,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::DictConstruct)) {
+        return nullptr;
+      }
       auto dict_type = n->output()->type()->expect<DictType>();
       const auto num_inputs = n->inputs().size();
       TORCH_DCHECK_EQ(num_inputs % 2, 0);
@@ -112,7 +122,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     static_runtime::dict_unpack,
     static_runtime_dict_unpack,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "static_runtime::dict_unpack(...) -> ...")) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         DCHECK(p_node->num_inputs() - 1 == p_node->outputs().size());
         auto dict = p_node->Input(0).toGenericDict();
@@ -126,38 +139,51 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
       };
     });
 
-REGISTER_NATIVE_OPERATOR_FUNCTOR(
-    aten::__getitem__,
-    aten_getitem,
-    [](Node* n) -> SROperator {
-      if (n->inputs().size() != 2) {
-        return nullptr;
-      }
+REGISTER_NATIVE_OPERATOR_FUNCTOR(aten::__getitem__, aten_getitem, [](Node* n) -> SROperator {
+  if (!sr_schema_check(
+          n,
+          // TODO: "aten::__getitem__.str(str s, int index) -> str",
+          "aten::__getitem__.t(t[](a) list, int idx) -> t(*)",
+          "aten::__getitem__.Dict_str(Dict(str, t) self, str key) -> t(*)",
+          "aten::__getitem__.Dict_int(Dict(int, t) self, int key) -> t(*)",
+          "aten::__getitem__.Dict_bool(Dict(bool, t) self, bool key) -> t(*)",
+          "aten::__getitem__.Dict_float(Dict(float, t) self, float key) -> t(*)",
+          "aten::__getitem__.Dict_complex(Dict(complex, t) self, complex key) -> t(*)",
+          "aten::__getitem__.Dict_Tensor(Dict(Tensor, t) self, Tensor key) -> t(*)")) {
+    return nullptr;
+  }
 
-      if (n->input(0)->type()->castRaw<DictType>()) {
-        return [](ProcessedNode* p_node) {
-          auto dict = p_node->Input(0).toGenericDict();
-          const auto& key = p_node->Input(1);
-          auto value = dict.find(key);
-          TORCH_CHECK(value != dict.end(), "Key not in dict: ", key);
-          p_node->Output(0) = value->value();
-        };
-      } else if (n->input(0)->type()->castRaw<ListType>()) {
-        return [](ProcessedNode* p_node) {
-          const auto& list = p_node->Input(0).toList();
-          auto idx = p_node->Input(1).toInt();
-          p_node->Output(0) = getItem(list, idx);
-        };
-      }
+  if (n->inputs().size() != 2) {
+    return nullptr;
+  }
 
-      // TODO(T98581096): make __getitem__ work for other container types
-      return nullptr;
-    });
+  if (n->input(0)->type()->castRaw<DictType>()) {
+    return [](ProcessedNode* p_node) {
+      auto dict = p_node->Input(0).toGenericDict();
+      const auto& key = p_node->Input(1);
+      auto value = dict.find(key);
+      TORCH_CHECK(value != dict.end(), "Key not in dict: ", key);
+      p_node->Output(0) = value->value();
+    };
+  } else if (n->input(0)->type()->castRaw<ListType>()) {
+    return [](ProcessedNode* p_node) {
+      const auto& list = p_node->Input(0).toList();
+      auto idx = p_node->Input(1).toInt();
+      p_node->Output(0) = getItem(list, idx);
+    };
+  }
+
+  // TODO(T98581096): make __getitem__ work for other container types
+  return nullptr;
+});
 
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::ListConstruct,
     prim_ListConstruct,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::ListConstruct)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         // prepare inputs
         auto stack = boxInputs(*p_node);
@@ -175,6 +201,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::ListUnpack,
     prim_ListUnpack,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::ListUnpack)) {
+        return nullptr;
+      }
       const auto num_outputs = n->outputs().size();
       return [num_outputs](ProcessedNode* p_node) {
         const auto list = p_node->Input(0).toListRef();
@@ -194,6 +223,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::append,
     aten_append,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(
+              n, "aten::append.t(t[](a!) self, t(c -> *) el) -> t[](a!)")) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         auto list = p_node->Input(0).toList();
         list.push_back(p_node->Input(1));
@@ -204,21 +237,36 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::list,
     aten_list,
     [](Node* n) -> SROperator {
-      return [](ProcessedNode* p_node) {
-        const auto str = p_node->Input(0).toStringRef();
-        c10::List<std::string> chars;
-        chars.reserve(str.size());
-        for (auto c : str) {
-          chars.emplace_back(1, c);
-        }
-        p_node->Output(0) = std::move(chars);
-      };
+      if (n->matches(torch::schema("aten::list(str t) -> str[]"))) {
+        return [](ProcessedNode* p_node) {
+          const auto str = p_node->Input(0).toStringRef();
+          c10::List<std::string> chars;
+          chars.reserve(str.size());
+          for (auto c : str) {
+            chars.emplace_back(1, c);
+          }
+          p_node->Output(0) = std::move(chars);
+        };
+      }
+
+      if (n->matches(torch::schema("aten::list.t(t[] l) -> t[]"))) {
+        return [](ProcessedNode* p_node) {
+          const auto input = p_node->Input(0).toList();
+          p_node->Output(0) = input.copy();
+        };
+      }
+
+      LogAndDumpSchema(n);
+      return nullptr;
     });
 
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::numel,
     aten_numel,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "aten::numel(Tensor self) -> int")) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         const auto& arg = p_node->Input(0).toTensor();
         p_node->Output(0) = arg.numel();
@@ -229,6 +277,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::cpu,
     aten_cpu,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "aten::cpu(Tensor self) -> Tensor")) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         const auto& arg = p_node->Input(0).toTensor();
         p_node->Output(0) = arg.cpu();
@@ -239,6 +290,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::__range_length,
     aten_range_length,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(
+              n, "aten::__range_length(int lo, int hi, int step) -> int")) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         auto lo = p_node->Input(0).toInt();
         auto hi = p_node->Input(1).toInt();
@@ -257,24 +312,33 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
       };
     });
 
-REGISTER_NATIVE_OPERATOR_FUNCTOR(
-    aten::index_put,
-    aten_index_put,
-    [](Node* n) -> SROperator {
-      return [](ProcessedNode* p_node) {
-        const auto& self = p_node->Input(0).toTensor();
-        const auto& indices = p_node->Input(1).toOptionalTensorList();
-        const auto& values = p_node->Input(2).toTensor();
-        const auto accumulate = p_node->Input(3).toBool();
-        p_node->Output(0) =
-            at::native::index_put(self, indices, values, accumulate);
-      };
-    });
+REGISTER_NATIVE_OPERATOR_FUNCTOR(aten::index_put, aten_index_put, [](Node* n) -> SROperator {
+  if (n->matches(torch::schema(
+          "aten::index_put(Tensor self, Tensor[] indices, Tensor values, bool accumulate=False) -> Tensor")) ||
+      n->matches(torch::schema(
+          "aten::index_put(Tensor self, Tensor?[] indices, Tensor values, bool accumulate=False) -> Tensor"))) {
+    return [](ProcessedNode* p_node) {
+      const auto& self = p_node->Input(0).toTensor();
+      const auto& indices =
+          at::native::toListOfOptionalTensors(p_node->Input(1).toListRef());
+      const auto& values = p_node->Input(2).toTensor();
+      const auto accumulate = p_node->Input(3).toBool();
+      p_node->Output(0) =
+          at::native::index_put(self, indices, values, accumulate);
+    };
+  }
+
+  LogAndDumpSchema(n);
+  return nullptr;
+});
 
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::item,
     aten_item,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "aten::item(Tensor self) -> Scalar")) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         const auto& self = p_node->Input(0).toTensor();
         p_node->Output(0) = at::native::item(self);
@@ -285,6 +349,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::GetAttr,
     prim_GetAttr,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::GetAttr)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         auto& module = p_node->Input(0).toObjectRef();
         Node* node = p_node->node();
@@ -299,6 +366,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::SetAttr,
     prim_SetAttr,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::SetAttr)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         auto& module = p_node->Input(0).toObjectRef();
         Node* node = p_node->node();
@@ -534,6 +604,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::TypeCheck,
     prim_TypeCheck,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::TypeCheck)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         auto* node = p_node->node();
         const size_t num_inputs = node->inputs().size();
@@ -564,7 +637,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     static_runtime::VarTupleUnpack,
     static_runtime_VarTupleUnpack,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "static_runtime::VarTupleUnpack(...) -> ...")) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         size_t output_idx = 0;
         for (const auto idx : c10::irange(pnode->num_inputs())) {
@@ -687,7 +763,11 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     static_runtime::select_tensor,
     aten_select_tensor,
     [](Node* n) -> SROperator {
-      TORCH_CHECK(n->inputs().size() == 3);
+      if (!sr_schema_check(
+              n,
+              "static_runtime::select_tensor(Tensor(a) a, Tensor(b) b, bool use_b) -> Tensor(a|b)")) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         const auto did_copy = p_node->Input(2).toBool();
         DCHECK(p_node->Input(0).isTensor());
@@ -777,7 +857,7 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 
 REGISTER_NATIVE_OPERATOR_FUNCTOR(aten::tensor_split, aten_tensor_split, [](Node* n) -> SROperator {
   if (n->matches(torch::schema(
-          "tensor_split.indices(Tensor(a -> *) self, int[] indices, int dim=0) -> Tensor(a)[]"))) {
+          "aten::tensor_split.indices(Tensor(a -> *) self, int[] indices, int dim=0) -> Tensor(a)[]"))) {
     return [](ProcessedNode* pnode) {
       const auto& a = pnode->Input(0).toTensor();
       const auto& b = pnode->Input(1).toIntVector();
@@ -787,17 +867,17 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(aten::tensor_split, aten_tensor_split, [](Node*
   }
 
   if (n->matches(torch::schema(
-          "tensor_split.sections(Tensor(a -> *) self, int sections, int dim=0) -> Tensor(a)[]"))) {
+          "aten::tensor_split.sections(Tensor(a -> *) self, int sections, int dim=0) -> Tensor(a)[]"))) {
     return [](ProcessedNode* pnode) {
       const auto& a = pnode->Input(0).toTensor();
-      const auto b = pnode->Input(1).toInt();
+      const auto b = pnode->Input(1).toSymInt();
       const auto c = pnode->Input(2).toInt();
-      pnode->Output(0) = at::native::tensor_split(a, b, c);
+      pnode->Output(0) = at::native::tensor_split_sections_symint(a, b, c);
     };
   }
 
   if (n->matches(torch::schema(
-          "tensor_split.tensor_indices_or_sections(Tensor(a -> *) self, Tensor tensor_indices_or_sections, int dim=0) -> Tensor(a)[]"))) {
+          "aten::tensor_split.tensor_indices_or_sections(Tensor(a -> *) self, Tensor tensor_indices_or_sections, int dim=0) -> Tensor(a)[]"))) {
     return [](ProcessedNode* pnode) {
       const auto& a = pnode->Input(0).toTensor();
       const auto& b = pnode->Input(1).toTensor();
@@ -827,7 +907,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     static_runtime::create_owned_ref,
     static_runtime_create_owned_ref,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "static_runtime::create_owned_ref(...) -> ...")) {
+        return nullptr;
+      }
       return
           [](ProcessedNode* p_node) { p_node->Output(0) = p_node->Input(0); };
     });
@@ -853,6 +936,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::If,
     prim_If,
     [](Node* node) -> SROperator {
+      if (!sr_schema_check_kind(node, prim::If)) {
+        return nullptr;
+      }
       TORCH_DCHECK_EQ(node->blocks().size(), 2);
       const Block* true_block = node->blocks().at(0);
       const Block* false_block = node->blocks().at(1);
@@ -890,7 +976,11 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
             auto& runner = block_runners[!condition];
 
             auto output = runner({});
-            if (!output.isTuple()) {
+            // If we are returning a tuple, we are either returning
+            // multiple unpacked values or all of the values wrapped
+            // in a single tuple. The second condition handles the
+            // the latter case.
+            if (!output.isTuple() || p_node->num_outputs() == 1) {
               p_node->Output(0) = std::move(output);
               return;
             }
@@ -1025,6 +1115,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::fork,
     prim_Fork,
     [](Node* node) -> SROperator {
+      if (!sr_schema_check_kind(node, prim::fork)) {
+        return nullptr;
+      }
       auto forkedGraph = node->g(attr::Subgraph);
       Inline(*forkedGraph);
       auto sr_metadata = node->ival(getStaticRuntimeMetadataSymbol())
@@ -1061,7 +1154,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::wait,
     aten_Wait,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "aten::wait(Future(t) self) -> t")) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         TORCH_INTERNAL_ASSERT(p_node->Input(0).isFuture());
         auto future = p_node->Input(0).toFuture();
@@ -1088,7 +1184,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::Loop,
     prim_Loop,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::Loop)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         const auto max_trip_count = p_node->Input(0).toInt();
         auto condition = p_node->Input(1).toBool();
@@ -1130,6 +1229,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::CreateObject,
     prim_CreateObject,
     [](Node* node) -> SROperator {
+      if (!sr_schema_check_kind(node, prim::CreateObject)) {
+        return nullptr;
+      }
       auto class_type = node->output()->type()->expect<ClassType>();
       return [class_type = std::move(class_type)](ProcessedNode* pnode) {
         pnode->Output(0) = c10::ivalue::Object::create(
@@ -1141,7 +1243,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::TupleIndex,
     prim_TupleIndex,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::TupleIndex)) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         const auto& elems = pnode->Input(0).toTupleRef().elements();
         const auto num_elems = elems.size();
@@ -1159,7 +1264,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::RaiseException,
     prim_RaiseException,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::RaiseException)) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         const auto& message = pnode->Input(0).toStringRef();
         throw std::runtime_error(message);
@@ -1169,7 +1277,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::Uninitialized,
     prim_Uninitialized,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::Uninitialized)) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         pnode->Output(0) = IValue::uninitialized();
       };
@@ -1179,6 +1290,9 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::format,
     aten_format,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "aten::format(str self, ...) -> str")) {
+        return nullptr;
+      }
       TORCH_CHECK(n->inputs().size() > 0);
       return [](ProcessedNode* pnode) {
         const auto num_inputs = pnode->num_inputs();
@@ -1192,7 +1306,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::device,
     prim_device,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "prim::device(Tensor a) -> Device")) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         const auto& input = pnode->Input(0).toTensor();
         pnode->Output(0) = input.device();
@@ -1202,24 +1319,36 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::dtype,
     prim_dtype,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::dtype)) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         const auto& input = pnode->Input(0).toTensor();
         pnode->Output(0) = static_cast<int64_t>(input.scalar_type());
       };
     });
 
-REGISTER_NATIVE_OPERATOR_FUNCTOR(aten::dim, aten_dim, [](Node*) -> SROperator {
-  return [](ProcessedNode* pnode) {
-    const auto& input = pnode->Input(0).toTensor();
-    pnode->Output(0) = input.dim();
-  };
-});
+REGISTER_NATIVE_OPERATOR_FUNCTOR(
+    aten::dim,
+    aten_dim,
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "aten::dim(Tensor self) -> int")) {
+        return nullptr;
+      }
+      return [](ProcessedNode* pnode) {
+        const auto& input = pnode->Input(0).toTensor();
+        pnode->Output(0) = input.dim();
+      };
+    });
 
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     aten::__not__,
     aten_not,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "aten::__not__(bool self) -> bool")) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         auto input = pnode->Input(0).toBool();
         pnode->Output(0) = !input;
@@ -1255,7 +1384,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::is_cuda,
     prim_is_cuda,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(n, "prim::is_cuda(Tensor a) -> bool")) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         const auto& input = pnode->Input(0).toTensor();
         pnode->Output(0) = input.is_cuda();
@@ -1265,7 +1397,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::tolist,
     prim_tolist,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::tolist)) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         const auto& input = pnode->Input(0).toTensor();
         const auto dim = pnode->Input(1).toInt();
@@ -1281,7 +1416,10 @@ REGISTER_NATIVE_OPERATOR_FUNCTOR(
 REGISTER_NATIVE_OPERATOR_FUNCTOR(
     prim::IfThenElse,
     prim_IfThenElse,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::IfThenElse)) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         const auto condition = pnode->Input(0).toBool();
         pnode->Output(0) = condition ? createBorrowedIValue(pnode->Input(1))
diff --git a/torch/csrc/jit/runtime/static/ops.cpp b/torch/csrc/jit/runtime/static/ops.cpp
index 93a9043ed276..e2a154ad069e 100644
--- a/torch/csrc/jit/runtime/static/ops.cpp
+++ b/torch/csrc/jit/runtime/static/ops.cpp
@@ -10,6 +10,7 @@
 #include <ATen/cpu/vec/vec.h>
 #include <ATen/native/Fill.h>
 #include <ATen/native/IndexingUtils.h>
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/SharedReduceOps.h>
 #include <ATen/native/TensorAdvancedIndexing.h>
@@ -39,10 +40,12 @@
 #include <mutex>
 #include <unordered_map>
 
+#include <ATen/CompositeExplicitAutogradFunctions.h>
+
 C10_DEFINE_bool(
     static_runtime_enable_fast_math,
     true,
-    "If on, static runtime may use use optimizations that cause accurary loss "
+    "If on, static runtime may use use optimizations that cause accuracy loss "
     "vs the jit interpreter");
 
 namespace at {
@@ -75,7 +78,7 @@ void repeat_out(at::Tensor& result, const Tensor& self, IntArrayRef repeats) {
     return;
   }
 
-  Tensor xtensor = at::native::expand(self, padded_size);
+  Tensor xtensor = at::compositeexplicitautograd::expand(self, padded_size);
   Tensor urtensor = at::native::alias(result);
   for (const auto i : c10::irange(xtensor.dim())) {
     // can't unfold with step 0, so make sure step is at least 1
@@ -469,10 +472,21 @@ static inline void listConstructSlowPath(
   p_node->Output(0) = vals;
 }
 
+bool sr_schema_check_kind(torch::jit::Node* node, c10::Symbol node_kind) {
+  auto is_match = node->kind() == node_kind;
+  if (!is_match) {
+    torch::jit::LogAndDumpSchema(node);
+  }
+  return is_match;
+}
+
 REGISTER_OPERATOR_FUNCTOR(
     prim::ListConstruct,
     prim_ListConstruct,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::ListConstruct)) {
+        return nullptr;
+      }
       const bool can_optimize =
           isOptimizableContainerType(n, FastMap<Node*, bool>());
       const auto& type = n->output()->type()->expectRef<ListType>();
@@ -525,6 +539,9 @@ REGISTER_OPERATOR_FUNCTOR(
     prim::TupleConstruct,
     prim_TupleConstruct,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::TupleConstruct)) {
+        return nullptr;
+      }
       const bool can_optimize =
           isOptimizableContainerType(n, FastMap<Node*, bool>());
       const size_t size = n->inputs().size();
@@ -610,6 +627,11 @@ REGISTER_OPERATOR_FUNCTOR(
     static_runtime::clamp_nan_to_num,
     static_runtime_clamp_nan_to_num,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(
+              n,
+              "static_runtime::clamp_nan_to_num(Tensor input, Scalar? min, Scalar? max, float? nan, float? posinf, float? posinf) -> Tensor")) {
+        return nullptr;
+      }
       auto clamp_min_ival_opt = toIValue(n->input(1));
       auto clamp_max_ival_opt = toIValue(n->input(2));
       TORCH_CHECK(
@@ -881,6 +903,9 @@ REGISTER_OPERATOR_FUNCTOR(
     prim::VarStack,
     prim_VarStack,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::VarStack)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         const size_t num_inputs = p_node->num_inputs();
         const auto dim = p_node->Input(num_inputs - 1).toInt();
@@ -960,6 +985,9 @@ REGISTER_OPERATOR_FUNCTOR(
     prim::TensorExprDynamicGroup,
     prim_TensorExprDynamicGroup,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::TensorExprDynamicGroup)) {
+        return nullptr;
+      }
       auto graph = n->g(attr::Subgraph);
       Code code(graph, "");
       return [code](ProcessedNode* p_node) {
@@ -1561,6 +1589,13 @@ REGISTER_OPERATOR_FUNCTOR(
     [](Node* n) -> SROperator {
       // support 4- or 5-arg for adindexer/adfinder models
       // Keep TORCH_CHECK here because there is no alternative for fallback
+      if (!sr_schema_check(
+              n,
+              "static_runtime::to_maybe_copy_out.prim_dtype(Tensor self, int? dtype=None, bool non_blocking=False, bool copy=False) -> (Tensor, bool)",
+              "static_runtime::to_maybe_copy_out.dtype(Tensor self, ScalarType dtype, bool non_blocking=False, bool copy=False, MemoryFormat? memory_format=None) -> (Tensor, bool)",
+              "static_runtime::to_maybe_copy_out.other(Tensor self, Tensor other, bool non_blocking=False, bool copy=False, MemoryFormat? memory_format=None) -> (Tensor, bool)")) {
+        return nullptr;
+      }
       TORCH_CHECK(n->inputs().size() == 4 || n->inputs().size() == 5);
       const bool has_constant_non_tensor_dtype_and_flags =
           node_has_constant_non_tensor_dtype_and_flags(n);
@@ -1600,6 +1635,13 @@ REGISTER_OPERATOR_FUNCTOR(
     [](Node* n) -> SROperator {
       // support 4- or 5-arg for adindexer/adfinder models
       // Keep TORCH_CHECK here because there is no alternative for fallback
+      if (!sr_schema_check(
+              n,
+              "static_runtime::to_copy.prim_dtype(Tensor self, int? dtype=None, bool non_blocking=False, bool copy=False) -> Tensor",
+              "static_runtime::to_copy.dtype(Tensor self, ScalarType dtype, bool non_blocking=False, bool copy=False, MemoryFormat? memory_format=None) -> Tensor",
+              "static_runtime::to_copy.other(Tensor self, Tensor other, bool non_blocking=False, bool copy=False, MemoryFormat? memory_format=None) -> Tensor")) {
+        return nullptr;
+      }
       TORCH_CHECK(n->inputs().size() == 4 || n->inputs().size() == 5);
       const bool has_constant_non_tensor_dtype_and_flags =
           node_has_constant_non_tensor_dtype_and_flags(n);
@@ -1639,6 +1681,11 @@ REGISTER_OPERATOR_FUNCTOR(
     static_runtime::reshape_copy,
     aten_reshape,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(
+              n,
+              "static_runtime::reshape_copy(Tensor self, int[] shape) -> Tensor")) {
+        return nullptr;
+      }
       TORCH_CHECK(n->inputs().size() == 2);
       return [](ProcessedNode* p_node) {
         const auto& self = p_node->Input(0).toTensor(); // self
@@ -1656,6 +1703,11 @@ REGISTER_OPERATOR_FUNCTOR(
     static_runtime::flatten_copy,
     aten_flatten,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check(
+              n,
+              "static_runtime::flatten_copy.using_ints(Tensor self, int start_dim=0, int end_dim=-1) -> Tensor")) {
+        return nullptr;
+      }
       TORCH_CHECK(n->inputs().size() == 3);
       return [](ProcessedNode* p_node) {
         const auto& self = p_node->Input(0).toTensor();
@@ -2030,47 +2082,49 @@ c10::MaybeOwned<at::Tensor> borrow_from_optional_tensor_ivalue(
 
 } // namespace
 
-REGISTER_OPERATOR_FUNCTOR(
-    aten::layer_norm,
-    aten_layer_norm,
-    [](Node*) -> SROperator {
-      return [](ProcessedNode* p_node) {
-        // ignore Input(5): `bool cudnn_enable=True`
-        const auto& input = p_node->Input(0).toTensor();
-        const auto normalized_shape = p_node->Input(1).toDimVector();
-        float eps = p_node->Input(4).toDouble();
-
-        c10::MaybeOwned<at::Tensor> weight_maybe_owned =
-            borrow_from_optional_tensor_ivalue(p_node->Input(2));
-        const at::Tensor& weight = *weight_maybe_owned;
-        c10::MaybeOwned<at::Tensor> bias_maybe_owned =
-            borrow_from_optional_tensor_ivalue(p_node->Input(3));
-        const at::Tensor& bias = *bias_maybe_owned;
-
-        auto M_N = at::native::_check_layer_norm_inputs(
-            input, normalized_shape, weight, bias);
-        auto M = M_N.first;
-        auto N = M_N.second;
-        auto X = input.expect_contiguous();
-        auto gamma = weight.expect_contiguous();
-        auto beta = bias.expect_contiguous();
+REGISTER_OPERATOR_FUNCTOR(aten::layer_norm, aten_layer_norm, [](Node* n) -> SROperator {
+  if (!sr_schema_check(
+          n,
+          "aten::layer_norm(Tensor input, int[] normalized_shape, Tensor? weight=None, Tensor? bias=None, float eps=1e-05, bool cudnn_enable=True) -> Tensor")) {
+    return nullptr;
+  }
+  return [](ProcessedNode* p_node) {
+    // ignore Input(5): `bool cudnn_enable=True`
+    const auto& input = p_node->Input(0).toTensor();
+    const auto normalized_shape = p_node->Input(1).toDimVector();
+    float eps = p_node->Input(4).toDouble();
+
+    c10::MaybeOwned<at::Tensor> weight_maybe_owned =
+        borrow_from_optional_tensor_ivalue(p_node->Input(2));
+    const at::Tensor& weight = *weight_maybe_owned;
+    c10::MaybeOwned<at::Tensor> bias_maybe_owned =
+        borrow_from_optional_tensor_ivalue(p_node->Input(3));
+    const at::Tensor& bias = *bias_maybe_owned;
+
+    auto M_N = at::native::_check_layer_norm_inputs(
+        input, normalized_shape, weight, bias);
+    auto M = M_N.first;
+    auto N = M_N.second;
+    auto X = input.expect_contiguous();
+    auto gamma = weight.expect_contiguous();
+    auto beta = bias.expect_contiguous();
 
-        if (p_node->Output(0).isNone()) {
-          p_node->Output(0) = at::native::empty_like(
-              *X,
-              c10::nullopt /* dtype */,
-              c10::nullopt /* layout */,
-              c10::nullopt /* device */,
-              c10::nullopt /* pin_memory */,
-              at::MemoryFormat::Contiguous);
-        } else {
-          at::native::resize_(
-              p_node->Output(0).toTensor(), X->sizes(), c10::nullopt);
-        }
-        at::Tensor& output = p_node->Output(0).toTensor();
-        at::native::layer_norm_cpu_out(output, input, *gamma, *beta, eps, M, N);
-      };
-    });
+    if (p_node->Output(0).isNone()) {
+      p_node->Output(0) = at::native::empty_like(
+          *X,
+          c10::nullopt /* dtype */,
+          c10::nullopt /* layout */,
+          c10::nullopt /* device */,
+          c10::nullopt /* pin_memory */,
+          at::MemoryFormat::Contiguous);
+    } else {
+      at::native::resize_(
+          p_node->Output(0).toTensor(), X->sizes(), c10::nullopt);
+    }
+    at::Tensor& output = p_node->Output(0).toTensor();
+    at::native::layer_norm_cpu_out(output, input, *gamma, *beta, eps, M, N);
+  };
+});
 
 REGISTER_OPERATOR_FUNCTOR(aten::norm, aten_norm, [](Node* n) -> SROperator {
   if (n->matches(torch::schema(
@@ -2473,12 +2527,13 @@ REGISTER_OPERATOR_FUNCTOR(aten::zeros, aten_zeros, [](Node* n) -> SROperator {
     const auto dtype = p_node->Input(1).toOptional<c10::ScalarType>();
     const auto layout = p_node->Input(2).toOptional<c10::Layout>();
     if (!hasTensorWithOptions(p_node->Output(0), dtype, layout)) {
-      p_node->Output(0) = at::native::zeros(size, dtype, layout);
+      p_node->Output(0) = at::compositeexplicitautograd::zeros(
+          size, dtype, layout, c10::nullopt, c10::nullopt);
       return;
     }
     auto& out_t = p_node->Output(0).toTensor();
     fastResizeToZero(out_t);
-    at::native::zeros_out(size, out_t);
+    at::compositeexplicitautograd::zeros_out(out_t, size);
   };
 });
 
@@ -2619,6 +2674,9 @@ REGISTER_OPERATOR_FUNCTOR(
     prim::VarConcat,
     prim_VarConcat,
     [](Node* n) -> SROperator {
+      if (!sr_schema_check_kind(n, prim::VarConcat)) {
+        return nullptr;
+      }
       return [](ProcessedNode* p_node) {
         const size_t num_inputs = p_node->num_inputs();
         std::vector<at::Tensor> inputs(num_inputs - 1);
@@ -2782,7 +2840,12 @@ REGISTER_OPERATOR_FUNCTOR(
 REGISTER_OPERATOR_FUNCTOR(
     quantized::embedding_bag_byte_unpack,
     quantized_embedding_bag_byte_unpack,
-    [](Node*) -> SROperator {
+    [](Node* n) -> SROperator {
+      if (!sr_schema_check(
+              n,
+              "quantized::embedding_bag_byte_unpack(Tensor weight) -> Tensor")) {
+        return nullptr;
+      }
       return [](ProcessedNode* pnode) {
         auto& weight = pnode->Input(0).toTensor();
         if (pnode->Output(0).isNone()) {
diff --git a/torch/csrc/jit/runtime/static/ops.h b/torch/csrc/jit/runtime/static/ops.h
index a71c3473b341..9de4e45ddef3 100644
--- a/torch/csrc/jit/runtime/static/ops.h
+++ b/torch/csrc/jit/runtime/static/ops.h
@@ -165,9 +165,26 @@ inline std::string PrintNode(const Node* node) {
 }
 
 inline void LogAndDumpSchema(const Node* node) {
-  VLOG(1) << "Found schema mismatch";
-  node->schema().dump();
+  VLOG(1) << "Found schema mismatch for: " << node->schema();
 }
 
+inline bool sr_schema_check(torch::jit::Node*) {
+  return true;
+}
+
+template <typename Schema, typename... Schemas>
+bool sr_schema_check(
+    torch::jit::Node* node,
+    Schema&& first,
+    Schemas&&... rest) {
+  auto is_match = node->matches(first) || sr_schema_check(node, rest...);
+  if (!is_match) {
+    torch::jit::LogAndDumpSchema(node);
+  }
+  return is_match;
+}
+
+bool sr_schema_check_kind(torch::jit::Node* node, c10::Symbol node_kind);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/static/passes.cpp b/torch/csrc/jit/runtime/static/passes.cpp
index 7125bb696878..9c44266f6ed1 100644
--- a/torch/csrc/jit/runtime/static/passes.cpp
+++ b/torch/csrc/jit/runtime/static/passes.cpp
@@ -1043,6 +1043,10 @@ void CreateOwnedRefsForSpecialValuesHelper(Graph& graph, Block* block) {
   }
 
   auto outputs = block->outputs();
+  // Create owned refs for inputs. Otherwise, the input cleanup process
+  // will destroy our outputs before we return.
+  FastSet<Value*> inputs = {block->inputs().begin(), block->inputs().end()};
+
   for (const auto i : c10::irange(outputs.size())) {
     auto* output = outputs[i];
 
@@ -1052,7 +1056,7 @@ void CreateOwnedRefsForSpecialValuesHelper(Graph& graph, Block* block) {
       continue;
     }
 
-    if (toIValue(output).has_value() ||
+    if ((inputs.find(output) != inputs.end()) || toIValue(output).has_value() ||
         // If the output's owning block is not this one, it's from an outer
         // scope
         output->node()->owningBlock() != block) {
@@ -1427,5 +1431,27 @@ void FuseClampNaNToNum(std::shared_ptr<Graph>& graph) {
 #endif
 }
 
+void PrepackWeights(std::shared_ptr<Graph>& graph) {
+  const auto pattern = R"IR(
+    graph(%input: Tensor, %weight: Tensor, %bias: Tensor?, %scale: Tensor, %zero_point: Tensor):
+        %result: Tensor = fb::quantized_linear_unpacked_weight_v2(%input, %weight, %bias, %scale, %zero_point)
+        return (%result)
+  )IR";
+
+  const auto split_pattern = R"IR(
+    graph(%input: Tensor, %weight: Tensor, %bias: Tensor?, %scale: Tensor, %zero_point: Tensor):
+        %packed_params = quantized::linear_prepack(%weight, %bias)
+        %scale_float: float = aten::item(%scale)
+        %zero_point_int: int = aten::item(%zero_point)
+        %result: Tensor = quantized::linear(%input, %packed_params, %scale_float, %zero_point_int)
+        return (%result)
+  )IR";
+
+  SubgraphRewriter fuse;
+  fuse.RegisterRewritePattern(pattern, split_pattern);
+  fuse.runOnGraph(graph);
+  // Constant propagation should be called after this pass + others.
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/static/passes.h b/torch/csrc/jit/runtime/static/passes.h
index 60b87550729a..d61d7baa4947 100644
--- a/torch/csrc/jit/runtime/static/passes.h
+++ b/torch/csrc/jit/runtime/static/passes.h
@@ -21,7 +21,7 @@ TORCH_API void ReplacePermuteWithCopy(
     std::shared_ptr<torch::jit::Graph>& graph,
     bool outputs_are_immutable = true);
 
-void ReplaceWithMaybeCopy(
+TORCH_API void ReplaceWithMaybeCopy(
     std::shared_ptr<torch::jit::Graph>& graph,
     bool outputs_are_immutable = true);
 
@@ -85,5 +85,7 @@ TORCH_API void FuseClampNaNToNum(std::shared_ptr<Graph>& graph);
 TORCH_API void UseInPlaceGetRealInputsFromOptionalInputsV2(
     std::shared_ptr<Graph>& graph);
 
+TORCH_API void PrepackWeights(std::shared_ptr<Graph>& graph);
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/runtime/symbolic_shape_registry.cpp b/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
index 88c1f536b6a2..52dcb2ff391a 100644
--- a/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
+++ b/torch/csrc/jit/runtime/symbolic_shape_registry.cpp
@@ -249,7 +249,13 @@ std::shared_ptr<Graph> genShapeComputeFn(
     std::unordered_map<std::string, std::shared_ptr<Graph>>& reused_functions,
     const CompilationUnit& module) {
   std::shared_ptr<Graph> graph;
+  GRAPH_DEBUG(
+      "Registering schema: ",
+      *schema_string,
+      " with shape compute func: ",
+      shape_compute_function_name);
   if (reused_functions.count(shape_compute_function_name)) {
+    GRAPH_DEBUG("Registering reused schema");
     graph = reused_functions[shape_compute_function_name];
   } else {
     Function& shape_compute_function =
diff --git a/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp b/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
index 77824baf02ec..e008080f1d8f 100644
--- a/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
+++ b/torch/csrc/jit/runtime/symbolic_shape_registry_util.cpp
@@ -75,6 +75,8 @@ const OperatorMap<std::string>& get_tensorexpr_elementwise_set() {
       {"aten::relu(Tensor self) -> Tensor", "unary"},
       {"aten::leaky_relu(Tensor self, Scalar negative_slope=0.01) -> Tensor", "unary"},
       {"aten::softplus(Tensor self, Scalar beta=1, Scalar threshold=20) -> Tensor", "unary"},
+      {"aten::mish(Tensor self) -> Tensor", "unary"},
+      {"aten::elu(Tensor self, Scalar alpha=1, Scalar scale=1, Scalar input_scale=1) -> Tensor", "unary"},
       {"aten::relu6(Tensor self) -> Tensor", "unary"},
       {"aten::gelu(Tensor self, *, str approximate='none') -> Tensor", "unary"},
       {"aten::silu(Tensor self) -> Tensor", "unary"},
diff --git a/torch/csrc/jit/serialization/callstack_debug_info_serialization.cpp b/torch/csrc/jit/serialization/callstack_debug_info_serialization.cpp
index 46384f49f673..08e97360bcd2 100644
--- a/torch/csrc/jit/serialization/callstack_debug_info_serialization.cpp
+++ b/torch/csrc/jit/serialization/callstack_debug_info_serialization.cpp
@@ -184,8 +184,8 @@ c10::optional<ModuleInstanceInfo> InlinedCallStackDeserializer::
   }
   const auto& tup_elems = iv.toTupleRef().elements();
   TORCH_CHECK(tup_elems.size() == 2);
-  std::string type_name = tup_elems[0].toString()->string();
-  std::string instance_name = tup_elems[1].toString()->string();
+  std::string type_name = tup_elems[0].toStringRef();
+  std::string instance_name = tup_elems[1].toStringRef();
   // type_name might be empty string ""
   // In that case type_ptr should be just nullptr
   auto type_ptr = cu->get_class(type_name);
diff --git a/torch/csrc/jit/serialization/export.cpp b/torch/csrc/jit/serialization/export.cpp
index 471bcc23595c..f5f5ab7c9908 100644
--- a/torch/csrc/jit/serialization/export.cpp
+++ b/torch/csrc/jit/serialization/export.cpp
@@ -57,26 +57,30 @@ namespace onnx_torch = ::torch::onnx;
 namespace onnx = ::ONNX_NAMESPACE;
 
 const static int kInvalidOpsetVersion = -1;
+const static int kMainOpsetVersion = 17;
 // Based on OP_SET_ID_VERSION_MAP in
 // https://github.com/onnx/onnx/blob/master/onnx/helper.py.
-constexpr static std::array<int64_t, 17> kOpsetVersionToIRVersion = {
-    kInvalidOpsetVersion,
-    3,
-    kInvalidOpsetVersion,
-    kInvalidOpsetVersion,
-    kInvalidOpsetVersion,
-    3,
-    3,
-    3,
-    3,
-    4,
-    5,
-    6,
-    7,
-    7,
-    7,
-    8,
-    8};
+constexpr static std::array<int64_t, kMainOpsetVersion + 1>
+    kOpsetVersionToIRVersion = {
+        kInvalidOpsetVersion,
+        3, // opset 1
+        kInvalidOpsetVersion,
+        kInvalidOpsetVersion,
+        kInvalidOpsetVersion,
+        3, // opset 5
+        3, // opset 6
+        3, // opset 7
+        3, // opset 8
+        4, // opset 9
+        5, // opset 10
+        6, // opset 11
+        7, // opset 12
+        7, // opset 13
+        7, // opset 14
+        8, // opset 15
+        8, // opset 16
+        8, // opset 17
+};
 
 std::string getNodeStackTraceString(const Node* n) {
   return n->sourceRange().str();
@@ -422,7 +426,11 @@ onnx::TensorProto_DataType ATenTypeToOnnxType(at::ScalarType at_type) {
     case at::kBFloat16:
       return onnx::TensorProto_DataType_BFLOAT16;
     default:
-      AT_ERROR("unexpected tensor scalar type");
+      TORCH_CHECK(
+          false,
+          "ScalarType ",
+          toString(at_type),
+          " is an unexpected tensor scalar type");
   }
 }
 
@@ -721,7 +729,7 @@ void GraphEncoder::EncodeBlock(
     bool add_node_names,
     bool use_external_data_format,
     const std::string& onnx_file_path) {
-  AT_ASSERT(graph_proto != nullptr);
+  TORCH_INTERNAL_ASSERT(graph_proto != nullptr);
   std::string block_name = "torch_jit";
   if (num_blocks_) {
     block_name += std::to_string(num_blocks_);
@@ -798,7 +806,7 @@ void GraphEncoder::AddInitializersIntoGraphProto(
     const std::map<std::string, at::Tensor>& initializers,
     bool use_external_data_format,
     const std::string& onnx_file_path) {
-  AT_ASSERT(block->inputs().size() >= initializers.size());
+  TORCH_INTERNAL_ASSERT(block->inputs().size() >= initializers.size());
   for (auto input : block->inputs()) {
     auto name_tensor_pair = initializers.find(input->debugName());
     if (name_tensor_pair == initializers.end()) {
@@ -880,20 +888,29 @@ void GraphEncoder::EncodeNode(
     node_proto->set_domain(domain);
   }
   if (operator_export_type_ == onnx_torch::OperatorExportTypes::ONNX) {
-    AT_ASSERT(
+    TORCH_INTERNAL_ASSERT(
         !node->kind().is_aten() && !node->kind().is_prim() &&
         !node->kind().is_attr());
   }
   node_proto->set_op_type(node->kind().toUnqualString());
+  const auto node_name_attribute_symbol =
+      Symbol::attr(::torch::onnx::kOnnxNodeNameAttribute);
   if (add_node_names) {
-    auto node_name =
+    std::string node_name =
         node_proto->op_type() + "_" + std::to_string(num_op_nodes_);
+    if (node->hasAttribute(node_name_attribute_symbol)) {
+      node_name = node->s(node_name_attribute_symbol);
+    }
     node_proto->set_name(node_name);
     onnx_node_name_map_[node] = node_name;
     num_op_nodes_++;
   }
   auto attrs_it = node_attr_to_name_.find(node);
   for (auto attr_name : node->attributeNames()) {
+    if (attr_name == node_name_attribute_symbol) {
+      // Skip the node name attribute.
+      continue;
+    }
     if (attrs_it != node_attr_to_name_.end()) {
       auto attr_it = attrs_it->second.find(attr_name.toUnqualString());
       if (attr_it != attrs_it->second.end()) {
@@ -906,7 +923,7 @@ void GraphEncoder::EncodeNode(
         node_proto, node, attr_name, use_external_data_format, onnx_file_path);
   }
   if (node->kind() == ::c10::onnx::Loop) {
-    AT_ASSERT(node->blocks().size() == 1);
+    TORCH_INTERNAL_ASSERT(node->blocks().size() == 1);
 
     auto body = node_proto->add_attribute();
     body->set_name("body");
@@ -923,7 +940,7 @@ void GraphEncoder::EncodeNode(
         onnx_file_path);
   }
   if (node->kind() == ::c10::onnx::If) {
-    AT_ASSERT(node->blocks().size() == 2);
+    TORCH_INTERNAL_ASSERT(node->blocks().size() == 2);
 
     auto then_branch = node_proto->add_attribute();
     then_branch->set_name("then_branch");
@@ -961,7 +978,7 @@ void GraphEncoder::AddAttribute(
     const std::string& ref_attr_name,
     const AttributeKind attr_kind) {
   auto attr = node_proto->add_attribute();
-  AT_ASSERT(name.is_attr());
+  TORCH_INTERNAL_ASSERT(name.is_attr());
   attr->set_name(name.toUnqualString());
   attr->set_ref_attr_name(ref_attr_name);
   attr->set_type(ATenAttributeKindToOnnxAttributeType(attr_kind, name));
@@ -992,7 +1009,7 @@ void GraphEncoder::AddAttribute(
   };
 
   auto attr = node_proto->add_attribute();
-  AT_ASSERT(name.is_attr());
+  TORCH_INTERNAL_ASSERT(name.is_attr());
   attr->set_name(name.toUnqualString());
   attr->set_type(
       ATenAttributeKindToOnnxAttributeType(node->kindOf(name), name));
@@ -1219,7 +1236,7 @@ void GraphEncoder::EncodeTensor(
   // or use_external_data_format should be true, not both at the same time. They
   // can both be false at the same time (for ONNX export for regular model
   // size).
-  AT_ASSERT(
+  TORCH_INTERNAL_ASSERT(
       !((defer_weight_export_ && external_ref) && use_external_data_format));
   // Add a buffer to the raw_data_export_map for the caller to dump into an
   // external data store. If external_ref is not specified, we instead dump
@@ -1227,18 +1244,19 @@ void GraphEncoder::EncodeTensor(
   if (defer_weight_export_ && external_ref) {
     // For now, we use the name of the tensor as the external lookup name to
     // avoid ONNX protobuf changes.
-    AT_ASSERT(external_ref.value() == tensor_proto->name());
-    AT_ASSERT(raw_data_export_map_.count(external_ref.value()) == 0);
+    TORCH_INTERNAL_ASSERT(external_ref.value() == tensor_proto->name());
+    TORCH_INTERNAL_ASSERT(
+        raw_data_export_map_.count(external_ref.value()) == 0);
     raw_data_export_map_[external_ref.value()] = t;
     tensor_proto->set_raw_data("__EXTERNAL");
   } else {
-    AT_ASSERT(t.is_contiguous());
+    TORCH_INTERNAL_ASSERT(t.is_contiguous());
     size_t tensorSize = static_cast<size_t>(c10::multiply_integers(
         std::begin(tensor.sizes()), std::end(tensor.sizes())));
     if (use_external_data_format &&
         tensorSize > ParamSizeThresholdForExternalStorage) {
-      AT_ASSERT(!onnx_file_path.empty());
-      AT_ASSERT(tensor_proto->has_name());
+      TORCH_INTERNAL_ASSERT(!onnx_file_path.empty());
+      TORCH_INTERNAL_ASSERT(tensor_proto->has_name());
       auto tensorName = GetExternalFileName(tensor_proto->name());
       CreateExternalFile(t, tensorName, onnx_file_path);
       onnx::StringStringEntryProto* location =
diff --git a/torch/csrc/jit/serialization/export_bytecode.cpp b/torch/csrc/jit/serialization/export_bytecode.cpp
index b56c4980211a..6f30f82899ed 100644
--- a/torch/csrc/jit/serialization/export_bytecode.cpp
+++ b/torch/csrc/jit/serialization/export_bytecode.cpp
@@ -212,7 +212,7 @@ mobile::Code compileGraphToMobileCode(
           for (const TypePtr& element_type : input_type->containedTypes()) {
             TORCH_CHECK(
                 element_type->kind() != TypeKind::ClassType,
-                "Returining a list or dictionary with pytorch class type ",
+                "Returning a list or dictionary with pytorch class type ",
                 "is not supported in mobile module "
                 "(List[Foo] or Dict[int, Foo] for class Foo(torch.nn.Module)). "
                 "Workaround: instead of using pytorch class as their element type, ",
diff --git a/torch/csrc/jit/serialization/export_module.cpp b/torch/csrc/jit/serialization/export_module.cpp
index c20d60e4c5c9..90f9f9411b38 100644
--- a/torch/csrc/jit/serialization/export_module.cpp
+++ b/torch/csrc/jit/serialization/export_module.cpp
@@ -95,7 +95,7 @@ ExportModuleExtraFilesHook& GetExtraFilesHook() {
  *         ]
  *     ]"
  *
- * @param compilation_unit Jit compilcation unit to look up function schema.
+ * @param compilation_unit Jit compilation unit to look up function schema.
  * @param type_ptr A type pointer and it can be possibly any type.
  * @param default_type_str The default string representation. The string can
  * either from type_ptr->str(), type_ptr->annotation_str(), or
@@ -424,8 +424,11 @@ SourceRangeRecords getBackendSourceRanges(const Module& m) {
   return sr_records;
 }
 
+// TODO: remove mobileInterfaceCallExport as it is no longer needed.
+// This function was introduced to guard the usage of `InterfaceCall` and
+// now the support for `InterfaceCall` should be mature enough.
 auto& mobileInterfaceCallExport() {
-  static std::atomic<bool> flag{false};
+  static std::atomic<bool> flag{true};
   return flag;
 }
 
diff --git a/torch/csrc/jit/serialization/flatbuffer_serializer.cpp b/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
index c3d683e49690..54ec7c7b6ed3 100644
--- a/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
+++ b/torch/csrc/jit/serialization/flatbuffer_serializer.cpp
@@ -20,12 +20,13 @@
 #include <torch/csrc/jit/mobile/train/export_data.h>
 #include <torch/csrc/jit/passes/inliner.h>
 #include <torch/csrc/jit/runtime/instruction.h>
-#include <torch/csrc/jit/serialization/export.h>
-#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h> // NOLINT
 
-#if defined(FBCODE_CAFFE2) or defined(FB_XPLAT_BUILD)
+#if defined(FB_XPLAT_BUILD) || defined(FBCODE_CAFFE2)
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated_fbsource.h> // NOLINT
 namespace flatbuffers = flatbuffers_fbsource;
 #define FLATBUFFERS_MAX_ALIGNMENT FLATBUFFERS_FBSOURCE_MAX_ALIGNMENT
+#else
+#include <torch/csrc/jit/serialization/mobile_bytecode_generated.h> // NOLINT
 #endif
 
 namespace torch {
@@ -712,7 +713,7 @@ flatbuffers::Offset<mobile::serialization::IValue> FlatbufferSerializer::
   } else if (ivalue.isString()) {
     ivalue_type = IValueUnion::String;
     offset = mobile::serialization::CreateString(
-                 fbb, fbb.CreateSharedString(ivalue.toString()->string()))
+                 fbb, fbb.CreateSharedString(ivalue.toStringRef()))
                  .Union();
   } else if (ivalue.isGenericDict()) {
     ivalue_type = IValueUnion::Dict;
diff --git a/torch/csrc/jit/serialization/import.cpp b/torch/csrc/jit/serialization/import.cpp
index a72abeaede8e..b79d29726bef 100644
--- a/torch/csrc/jit/serialization/import.cpp
+++ b/torch/csrc/jit/serialization/import.cpp
@@ -444,7 +444,7 @@ Module _load_jit_module_from_bytes(
     std::shared_ptr<CompilationUnit> cu,
     c10::optional<c10::Device> device,
     ExtraFilesMap& extra_files) {
-  TORCH_CHECK(size >= kFileFormatHeaderSize, "Unrecorgnized data format");
+  TORCH_CHECK(size >= kFileFormatHeaderSize, "Unrecognized data format");
   auto format = getFileFormat(data.get());
   switch (format) {
     case FileFormat::FlatbufferFileFormat: {
diff --git a/torch/csrc/jit/serialization/import_source.cpp b/torch/csrc/jit/serialization/import_source.cpp
index 2a1f7fb04674..c515b6598904 100644
--- a/torch/csrc/jit/serialization/import_source.cpp
+++ b/torch/csrc/jit/serialization/import_source.cpp
@@ -316,19 +316,19 @@ c10::optional<Assign> SourceImporterImpl::
 
   // module demangled qualname -> ReplacementDescr
   static std::unordered_map<std::string, AttrTypeReplacementDescr> replacements{
-      {"__torch__.torch.nn.quantized.modules.linear.LinearPackedParams",
+      {"__torch__.torch.ao.nn.quantized.modules.linear.LinearPackedParams",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
-      {"__torch__.torch.nn.quantized.modules.linear.Linear",
+      {"__torch__.torch.ao.nn.quantized.modules.linear.Linear",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
-      {"__torch__.torch.nn.quantized.dynamic.modules.linear.Linear",
+      {"__torch__.torch.ao.nn.quantized.dynamic.modules.linear.Linear",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
-      {"__torch__.torch.nn.quantized.modules.conv.Conv2d",
+      {"__torch__.torch.ao.nn.quantized.modules.conv.Conv2d",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.Conv2dPackedParamsBase"}},
@@ -336,14 +336,35 @@ c10::optional<Assign> SourceImporterImpl::
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.Conv2dPackedParamsBase"}},
-      {"__torch__.torch.nn.quantized.modules.conv.Conv3d",
+      {"__torch__.torch.ao.nn.quantized.modules.conv.Conv3d",
        {"_packed_params",
         "Tensor",
         "__torch__.torch.classes.quantized.Conv3dPackedParamsBase"}},
       {"__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU3d",
        {"_packed_params",
         "Tensor",
-        "__torch__.torch.classes.quantized.Conv3dPackedParamsBase"}}};
+        "__torch__.torch.classes.quantized.Conv3dPackedParamsBase"}},
+      // BC Stuff
+      {"__torch__.torch.nn.quantized.modules.linear.LinearPackedParams",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
+      {"__torch__.torch.nn.quantized.modules.linear.Linear",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.LinearPackedParamsBase"}},
+      {"__torch__.torch.nn.quantized.modules.conv.Conv2d",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.Conv2dPackedParamsBase"}},
+      {"__torch__.torch.nn.quantized.modules.conv.Conv3d",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.Conv3dPackedParamsBase"}},
+      {"__torch__.torch.nn.quantized.dynamic.modules.linear.Linear",
+       {"_packed_params",
+        "Tensor",
+        "__torch__.torch.classes.quantized.LinearPackedParamsBase"}}};
   // @lint-ignore-every CLANGTIDY facebook-hte-StdRegexIsAwful
   static std::regex mangle_re("\\.___torch_mangle_\\d+");
   auto demangled_classname =
diff --git a/torch/csrc/jit/serialization/pickler.cpp b/torch/csrc/jit/serialization/pickler.cpp
index d1aca68773a1..364d603b4c43 100644
--- a/torch/csrc/jit/serialization/pickler.cpp
+++ b/torch/csrc/jit/serialization/pickler.cpp
@@ -480,6 +480,20 @@ void Pickler::pushLiteralTensor(const IValue& ivalue) {
   // Construct the collections.OrderedDict for the backward_hooks
   push<PickleOpCode>(PickleOpCode::REDUCE);
 
+  if (!quantized) {
+    // Only push it for regular tensor if the dictionary is not empty.
+    auto metadata = torch::jit::getTensorMetadata(tensor);
+    if (!metadata.empty()) {
+      // IValues based on std::unordered_map<K, V> are slow and deprecated.
+      // Thus, pass a c10::Dict to pushDict.
+      c10::Dict<std::string, bool> math_bits_;
+      for (const auto& pair : metadata) {
+        math_bits_.insert(pair.first, pair.second);
+      }
+      pushDict(math_bits_);
+    }
+  }
+
   push<PickleOpCode>(PickleOpCode::TUPLE);
 
   // Call torch._utils._rebuild_tensor_v2
@@ -567,7 +581,9 @@ void Pickler::pushTensorReference(const IValue& ivalue) {
 // ivalue to the stack as a string so we can preserve type tags across
 // serialization
 void Pickler::startTypeTag() {
-  pushGlobal("torch.jit._pickle", "restore_type_tag");
+  if (tag_aggregates_) {
+    pushGlobal("torch.jit._pickle", "restore_type_tag");
+  }
 }
 namespace {
 c10::optional<std::string> type_printer(const c10::Type& type) {
@@ -580,6 +596,9 @@ c10::optional<std::string> type_printer(const c10::Type& type) {
 
 // See startTypeTag
 void Pickler::endTypeTag(const IValue& ivalue) {
+  if (!tag_aggregates_) {
+    return;
+  }
   TORCH_INTERNAL_ASSERT(ivalue.isGenericDict() || ivalue.isList());
 
   // Push the dict type
diff --git a/torch/csrc/jit/serialization/pickler.h b/torch/csrc/jit/serialization/pickler.h
index 86117c130aa5..26f9fcf42396 100644
--- a/torch/csrc/jit/serialization/pickler.h
+++ b/torch/csrc/jit/serialization/pickler.h
@@ -5,11 +5,11 @@
 #include <utility>
 #include <vector>
 
+#include <ATen/Utils.h>
 #include <ATen/core/ivalue.h>
 #include <ATen/core/jit_type.h>
 #include <c10/util/ArrayRef.h>
 #include <torch/csrc/Export.h>
-#include <torch/csrc/utils/disallow_copy.h>
 
 namespace torch {
 namespace jit {
@@ -60,7 +60,7 @@ enum class PickleOpCode : char {
   BINFLOAT = 'G',
 
   // Protocol 2
-  PROTO = '\x80',
+  PROTO = char('\x80'),
   NEWOBJ = '\x81',
   EXT1 = '\x82',
   EXT2 = '\x83',
@@ -78,7 +78,7 @@ enum class PickleOpCode : char {
   SHORT_BINBYTES = 'C',
 
   // Protocol 4
-  SHORT_BINUNICODE = '\x8c',
+  SHORT_BINUNICODE = char('\x8c'),
   BINUNICODE8 = '\x8d',
   BINBYTES8 = '\x8e',
   EMPTY_SET = '\x8f',
@@ -118,7 +118,7 @@ void setTypeTags(bool state);
 bool getTypeTags();
 
 class TORCH_API Pickler {
-  TH_DISALLOW_COPY_AND_ASSIGN(Pickler);
+  AT_DISALLOW_COPY_AND_ASSIGN(Pickler);
 
  public:
   Pickler(std::function<void(const char*, size_t)> writer)
@@ -130,12 +130,14 @@ class TORCH_API Pickler {
       std::vector<at::Tensor>* tensor_table,
       std::function<c10::QualifiedName(const c10::ClassTypePtr&)> type_renamer,
       std::vector<c10::ClassTypePtr>* memoized_class_types,
-      std::function<std::string(const at::Tensor&)> get_tensor_id = nullptr)
+      std::function<std::string(const at::Tensor&)> get_tensor_id = nullptr,
+      bool tag_aggregates = true)
       : writer_(std::move(writer)),
         tensor_table_(tensor_table),
         type_renamer_(std::move(type_renamer)),
         memoized_class_types_(memoized_class_types),
-        get_tensor_id_(std::move(get_tensor_id)) {}
+        get_tensor_id_(std::move(get_tensor_id)),
+        tag_aggregates_(tag_aggregates) {}
   // NOLINTNEXTLINE(bugprone-exception-escape)
   ~Pickler();
 
@@ -274,6 +276,11 @@ class TORCH_API Pickler {
   std::unordered_map<std::string, uint32_t> memoized_globals_map_;
   std::unordered_map<std::string, uint32_t> memoized_strings_map_;
   std::unordered_map<std::string, uint32_t> memoized_devices_map_;
+  // when true, List and Dict objects will be wrapped in a
+  // torch.jit._pickle.restore_type_tag call to correctly set the dynamic
+  // TorchScript type for the object. When true the thing unpickling must have
+  // torch installed.
+  bool tag_aggregates_;
 };
 
 // returns a (tensor, record_size) for a tensor, converting it to a CPU tensor
@@ -289,5 +296,60 @@ uint64_t getStorageKey(const at::Tensor& tensor);
 // otherwise return false
 bool checkHasValidSetGetState(const std::shared_ptr<c10::ClassType>& cls);
 
+// Return a map of Tensor Metadata for serialization.
+// For now, it only takes care of `conj` and `neg` bit.
+inline std::unordered_map<std::string, bool> getTensorMetadata(
+    const at::Tensor& t) {
+  // We don't support serializing `ZeroTensor` as it is not public
+  // facing yet.
+  TORCH_CHECK(
+      !t._is_zerotensor(),
+      "ZeroTensor is not serializable,",
+      " please file an issue if required.");
+  std::unordered_map<std::string, bool> metadata{};
+
+  // Only add meta-data if the value is not default.
+  if (t.is_conj()) {
+    metadata["conj"] = true;
+  }
+  if (t.is_neg()) {
+    metadata["neg"] = true;
+  }
+  return metadata;
+}
+
+// set Tensor Metadata based on the map.
+// Refer: getTensorMathdata
+inline void setTensorMetadata(
+    const at::Tensor& t,
+    std::unordered_map<std::string, bool> metadata) {
+  for (auto& key_value_pair : metadata) {
+    if (key_value_pair.first == "conj") {
+      t._set_conj(true);
+    } else if (key_value_pair.first == "neg") {
+      t._set_neg(true);
+    } else {
+      TORCH_CHECK(
+          false,
+          "Unexpected key `",
+          key_value_pair.first,
+          "` passed to setTensorMetadata.");
+    }
+  }
+}
+
+// set Tensor metadata based on the map.
+// NOTE: This overload is required by unpickler.cpp
+inline void setTensorMetadata(
+    const at::Tensor& t,
+    c10::Dict<c10::IValue, c10::IValue> metadata_idict) {
+  std::unordered_map<std::string, bool> metadata;
+  for (auto& pair : metadata_idict) {
+    auto key = *pair.key().toString();
+    metadata[key] = pair.value().toBool();
+  }
+  setTensorMetadata(t, metadata);
+}
+
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/serialization/source_range_serialization.cpp b/torch/csrc/jit/serialization/source_range_serialization.cpp
index 170714c6c17a..ae23af3fa38b 100644
--- a/torch/csrc/jit/serialization/source_range_serialization.cpp
+++ b/torch/csrc/jit/serialization/source_range_serialization.cpp
@@ -85,7 +85,7 @@ std::shared_ptr<Source> SourceRangeDeserializer::deserialize_source(
 
     source = std::make_shared<Source>(str_cord, filename, starting_line_no_);
   } else {
-    std::string text_ = tup_elems[0].toString()->string();
+    std::string text_ = tup_elems[0].toStringRef();
     c10::optional<std::string> filename_ =
         tup_elems[1].toOptional<std::string>();
     int64_t starting_line_no_ = tup_elems[2].toInt();
diff --git a/torch/csrc/jit/serialization/unpickler.cpp b/torch/csrc/jit/serialization/unpickler.cpp
index 83a4fa827b36..4bbf7a783a23 100644
--- a/torch/csrc/jit/serialization/unpickler.cpp
+++ b/torch/csrc/jit/serialization/unpickler.cpp
@@ -91,6 +91,10 @@ void restoreAccurateTypeTags(const IValue& root, const TypePtr& type_tag) {
         TORCH_CHECK(!w.value.toSymInt().is_symbolic());
         // no op, there is nothing to tag
         break;
+      case c10::SymFloatType::Kind:
+        TORCH_CHECK(!w.value.toSymFloat().is_symbolic());
+        // no op, there is nothing to tag
+        break;
       case DynamicType::Kind:
       case UnionType::Kind:
       case EnumType::Kind:
@@ -125,7 +129,7 @@ void restoreAccurateTypeTags(const IValue& root, const TypePtr& type_tag) {
         auto elem_type = w.type->containedType(0);
         auto lst = w.value.toList();
         lst.unsafeSetElementType(elem_type);
-        for (const IValue item : lst) {
+        for (const IValue& item : lst) {
           Work elem = {elem_type, item};
           to_process.emplace_back(std::move(elem));
         }
@@ -528,6 +532,21 @@ PickleOpCode Unpickler::readInstruction() {
       }
       stack_.emplace_back(std::move(tensor));
     } break;
+    case PickleOpCode::SETITEM: {
+      // At this OpCode, stack looks like
+      // | Stack Bottom |
+      // | ......       |
+      // | Dict         | -> (stack_size - 3)
+      // | Key          | -> (stack_size - 2)
+      // | Value        | -> (stack_size - 1)
+      auto stack_size = stack_.size();
+      auto dict_pos = stack_size - 3;
+      auto key_pos = stack_size - 2;
+      auto val_pos = stack_size - 1;
+      auto dict = stack_.at(dict_pos).toGenericDict();
+      dict.insert_or_assign(stack_.at(key_pos), stack_.at(val_pos));
+      stack_.erase(stack_.begin() + (key_pos), stack_.end());
+    } break;
     default: {
       AT_ERROR(
           "Unknown opcode for unpickling at ",
@@ -542,6 +561,23 @@ PickleOpCode Unpickler::readInstruction() {
 void Unpickler::readGlobal(
     const std::string& module_name,
     const std::string& class_name) {
+  if (this->skip_next_read_global) {
+    // See [NOTE] skip_next_read_global
+    this->skip_next_read_global--;
+    if (this->skip_next_read_global == 1) {
+      // Pass through to the correct handler
+    } else if (this->skip_next_read_global == 0) {
+      // Corresponds to the type of `Tensor` being unpickled
+      if (module_name != "torch" || class_name != "Tensor") {
+        TORCH_WARN(
+            "Trying to load a Subclassed Tensor, it will be converted to at::Tensor in C++");
+      }
+      stack_.emplace_back(int64_t(globals_.size() - 1));
+      return;
+    } else {
+      TORCH_CHECK(false, "INVALID VALUES")
+    }
+  }
   // TODO [unpickler refactor] __main__ isn't used by the pickler anymore, this
   // is only here for bc-compatibility reasons
   if (module_name == "__main__") {
@@ -627,6 +663,12 @@ void Unpickler::readGlobal(
     // Unpickle a tensor
     bool quantized = class_name == "_rebuild_qtensor";
     rebuildTensor(quantized);
+  } else if (
+      module_name == "torch._tensor" &&
+      (class_name == "_rebuild_from_type_v2")) {
+    // Unpickle a Tensor with Python attributes or
+    // a Subclassed Tensor.
+    rebuildTensorFromTypeV2();
   } else if (
       module_name == "torch._utils" && class_name == "_rebuild_sparse_tensor") {
     rebuildSparseTensor();
@@ -819,17 +861,65 @@ void Unpickler::rebuildTensor(bool quantized) {
     } else {
       result = at::empty({0}, storage_tensor.options());
     }
-    bool requires_grad = elements.at(idx).toBool();
-    // elements[idx++] is empty backwards hooks
+    bool requires_grad = elements.at(idx++).toBool();
+    idx++; // backwards hooks is empty
     at::TensorImpl* impl = result.unsafeGetTensorImpl();
     impl->set_storage_keep_dtype(storage_tensor.storage());
     impl->set_storage_offset(storage_offset);
     impl->set_sizes_and_strides(size, stride);
     result = autograd::make_variable(result, requires_grad);
+
+    // Handle if math_bits were pickled.
+    // See `args` of _reduce_ex_internal
+    // for a regular tensor (final else case).
+    // Tensors pickled before this patch didn't
+    // have this argument for storing MathBits,
+    // in that case, we do nothing.
+    // NOTE: `math_bits` is the 7th arg.
+    // NOTE: This is only meant for regular tensor and not quantized
+    //       which also has 7 args serialized.
+    if (!quantized && elements.size() == 7) {
+      auto math_bits = elements.at(idx++).toGenericDict();
+      torch::jit::setTensorMetadata(result, math_bits);
+    }
+
     stack_.emplace_back(std::move(result));
   });
 }
 
+void Unpickler::rebuildTensorFromTypeV2() {
+  // [NOTE] skip_next_read_global
+  // When rebuilding Tensor with Python Attr or Subclassed Tensor,
+  // we receive `(func, type(self), args, state)` on stack for
+  // `rebuildTensorFromTypeV2`.
+  // Thus next call to readGlobal corresponds to `func` which is
+  // the function to rebuild the base tensor.
+  // The call after `func` to readGlobal corresponds to `type` of the
+  // Tensor where we raise warning if the type is not `torch.Tensor`.
+  this->skip_next_read_global = 2;
+  auto curr_globals_idx = globals_.size();
+  globals_.emplace_back([this, curr_globals_idx] {
+    // args is a tuple with following data
+    //  (function to rebuild base tensor, type of tensor,
+    //   arguments to construct base tensor, Python State (as dict))
+    auto args = pop(stack_).toTuple();
+    size_t tup_idx = 0;
+    const auto args_elems = args->elements();
+    auto base_tensor_args = args_elems.at(tup_idx + 2).toTuple();
+    auto py_state = args_elems.at(tup_idx + 3).toGenericDict();
+    if (py_state.size() > 0) {
+      TORCH_WARN(
+          "Loading Tensor with Python attributes will return at::Tensor with Python attributes being discarded");
+    }
+    // This calls the function to rebuild the
+    // base tensor.
+    // Eg. `rebuildTensor`, `rebuildSpareTensor`.
+    stack_.emplace_back(base_tensor_args);
+    globals_[curr_globals_idx + 1]();
+    stack_.emplace_back(pop(stack_));
+  });
+}
+
 #ifdef USE_RPC
 void Unpickler::rebuildRRef() {
   globals_.emplace_back([this] {
diff --git a/torch/csrc/jit/serialization/unpickler.h b/torch/csrc/jit/serialization/unpickler.h
index c57aa2556d73..de00e7eacff2 100644
--- a/torch/csrc/jit/serialization/unpickler.h
+++ b/torch/csrc/jit/serialization/unpickler.h
@@ -23,7 +23,7 @@ class DeserializationStorageContext;
 // deleted at some point, the Pickler doesn't produce it and it's only around to
 // support models saved before 1.1
 class TORCH_API Unpickler {
-  TH_DISALLOW_COPY_AND_ASSIGN(Unpickler);
+  AT_DISALLOW_COPY_AND_ASSIGN(Unpickler);
 
   using TypeParserT = c10::TypePtr (*)(const std::string&);
 
@@ -120,6 +120,7 @@ class TORCH_API Unpickler {
       const std::string& module_name,
       const std::string& class_name);
   void rebuildTensor(bool quantized);
+  void rebuildTensorFromTypeV2();
   void rebuildSparseTensor();
 #ifdef USE_DISTRIBUTED
   void rebuildRRef();
@@ -176,6 +177,9 @@ class TORCH_API Unpickler {
 
   // See [type tag serialization]
   uint64_t version_;
+
+  // See [NOTE] skip_next_read_global
+  uint8_t skip_next_read_global = 0;
 };
 
 void restoreAccurateTypeTags(const IValue& root, const c10::TypePtr& type_tag);
diff --git a/torch/csrc/jit/tensorexpr/half_support.h b/torch/csrc/jit/tensorexpr/half_support.h
index 76b44f6bfb13..f095c79fbb5a 100644
--- a/torch/csrc/jit/tensorexpr/half_support.h
+++ b/torch/csrc/jit/tensorexpr/half_support.h
@@ -77,12 +77,20 @@ class HalfRewriter : public IRMutator {
     // get the dtype of the `value()` before that is mutated.
     auto newType = v->value()->dtype();
     ExprPtr new_val = v->value()->accept_mutator(this);
+    auto bufType = v->buf()->dtype();
 
     if (isHalf(newType.scalar_type())) {
       new_val = alloc<Cast>(newType, new_val);
       inserted_half_casts_.insert(new_val);
     }
 
+    // The scalar_type of value is not Half while the buf is Half
+    if (!isHalf(newType.scalar_type()) && isHalf(bufType.scalar_type())) {
+      new_val = alloc<Cast>(
+          newType.cloneWithScalarType(bufType.scalar_type()), new_val);
+      inserted_half_casts_.insert(new_val);
+    }
+
     v->set_value(new_val);
     return v;
   }
@@ -101,15 +109,22 @@ class HalfRewriter : public IRMutator {
     // just don't allow half casts we didn't insert.
     if (isHalf(v)) {
       if (inserted_half_casts_.count(v) < 1) {
-        return child;
+        v->set_src_value(child);
+        v->set_dtype(v->dtype().cloneWithScalarType(c10::kFloat));
+        return v;
       }
     }
 
     // Remove Half(Float()) and friends.
     CastPtr cast_child = to<Cast>(child);
     if (cast_child) {
+      auto cast_to_double = v->dtype().scalar_type() == ScalarType::Double;
+      auto from_half = isHalf(cast_child->src_value());
+      // Cannot simplify the double(float(half)) to double(half) as NNC does
+      // not support cast BF16 to double directly.
+      auto not_cast_half_to_doulbe = !(cast_to_double && from_half);
       if (v->dtype().is_floating_point() &&
-          cast_child->dtype().is_floating_point()) {
+          cast_child->dtype().is_floating_point() && not_cast_half_to_doulbe) {
         return alloc<Cast>(v->dtype(), cast_child->src_value());
       }
     }
diff --git a/torch/csrc/jit/tensorexpr/kernel.cpp b/torch/csrc/jit/tensorexpr/kernel.cpp
index 5c7c783e1b78..eb108abfb029 100644
--- a/torch/csrc/jit/tensorexpr/kernel.cpp
+++ b/torch/csrc/jit/tensorexpr/kernel.cpp
@@ -8,6 +8,7 @@
 #include <c10/util/irange.h>
 #include <c10/util/string_utils.h>
 #include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/passes/graph_rewrite_helper.h>
 #include <torch/csrc/jit/passes/mkldnn_rewrite.h>
 #include <torch/csrc/jit/passes/symbolic_shape_runtime_fusion.h>
 #include <torch/csrc/jit/tensorexpr/analysis.h>
@@ -28,7 +29,7 @@ namespace tensorexpr {
 
 std::string buildErrorMessage(const std::string& s) {
   static const std::string generic_error_message =
-      "This error occured in the fuser. You can turn off the fuser with "
+      "This error occurred in the fuser. You can turn off the fuser with "
       "torch.jit.enable_fusion(False).";
   if (s.empty()) {
     return generic_error_message;
@@ -233,6 +234,20 @@ bool isContiguous(const torch::jit::Value* v, at::MemoryFormat memory_format) {
   return *strides == TensorType::contiguousStridesOf(*sizes, memory_format);
 }
 
+size_t get_conv_groups_index(const torch::jit::Node* node) {
+  switch (node->kind()) {
+    case aten::conv2d:
+      return 6;
+    case aten::_convolution:
+      return 8;
+    default:
+      TORCH_CHECK(
+          false,
+          "mkldnnPrepackedConvIsSupportedJit expects node kind to be conv2d or _convolution but got ",
+          node->kind());
+  }
+}
+
 // The fuser only supports conv2d with very specific properties:
 // - Static shapes: 4-d input and filter, 1-d bias.
 // - Constant strides/padding/dilation/groups
@@ -246,7 +261,8 @@ bool conv2dIsSupportedJit(const torch::jit::Node* node) {
   auto const& stride = toIValue(node->input(3));
   auto const& pad = toIValue(node->input(4));
   auto const& dilation = toIValue(node->input(5));
-  auto const& groups = toIValue(node->input(6));
+  size_t groups_index = get_conv_groups_index(node);
+  auto const& groups = toIValue(node->input(groups_index));
 
   // Everything should be statically known.
   if (!input || !weight || !bias || !stride || !pad || !dilation || !groups) {
@@ -278,7 +294,8 @@ bool mkldnnPrepackedConvIsSupportedJit(const torch::jit::Node* node) {
   auto const& stride = toIValue(node->input(3));
   auto const& pad = toIValue(node->input(4));
   auto const& dilation = toIValue(node->input(5));
-  auto const& groups = toIValue(node->input(6));
+  size_t groups_index = get_conv_groups_index(node);
+  auto const& groups = toIValue(node->input(groups_index));
 
   // Everything should be statically known (bias could be NoneType =
   // prim::Constant()).
@@ -314,6 +331,37 @@ bool mkldnnPrepackedConvIsSupportedJit(const torch::jit::Node* node) {
   return false;
 }
 
+bool isConv2d(const Node* node) {
+  if (node->kind() != aten::_convolution) {
+    return false;
+  }
+
+  auto const& stride = toIValue(node->input(3));
+  auto const& pad = toIValue(node->input(4));
+  auto const& dilation = toIValue(node->input(5));
+  auto const& transposed = toIValue(node->input(6));
+  auto const& output_padding = toIValue(node->input(7));
+
+  if (!stride || !pad || !dilation || !transposed || !output_padding) {
+    GRAPH_DEBUG("some params aren't static");
+    return false;
+  }
+
+  if (stride.value().toIntList().size() != 2 ||
+      pad.value().toIntList().size() != 2 ||
+      dilation.value().toIntList().size() != 2 ||
+      output_padding.value().toIntList().size() != 2) {
+    GRAPH_DEBUG("Conv not 2d");
+    return false;
+  }
+
+  if (transposed.value().toBool()) {
+    GRAPH_DEBUG("transposed Conv");
+    return false;
+  }
+  return true;
+}
+
 // The fuser currently only supports matmul of 2D x 2D matrices
 bool matmulIsSupported(const torch::jit::Node* node) {
   auto const& input0 = getTensorInfoJit(node->input(0));
@@ -622,29 +670,40 @@ bool loopBoundsAllEqual(const std::vector<ForPtr>& loops) {
 // indices where none would be needed, which would significantly complicate
 // vectorization.
 void fuseAllLoops(StmtPtr st) {
-  if (auto block = to<tensorexpr::Block>(st)) {
-    std::vector<ForPtr> loopsToFuse;
-    for (auto stmt : *block) {
-      auto loop = to<For>(stmt);
-      if (!loop) {
-        // Block contains something that's not a loop.  Quit.
-        return;
-      }
-      loopsToFuse.push_back(loop);
+  auto block = to<tensorexpr::Block>(st);
+  if (block == nullptr) {
+    return;
+  }
+
+  std::vector<std::vector<ForPtr>> all_outer_loops;
+  std::vector<ForPtr> outer_loops;
+  for (const auto& stmt : *block) {
+    auto loop = to<For>(stmt);
+    auto hasReduction = NodeFinder<ReduceOp>::find(stmt).size() != 0;
+    if (!loop || hasReduction) {
+      all_outer_loops.push_back(outer_loops);
+      outer_loops.clear();
+    } else {
+      outer_loops.push_back(loop);
     }
-    if (loopsToFuse.empty()) {
-      return;
+  }
+  all_outer_loops.push_back(outer_loops);
+
+  for (const auto& outer_loops : all_outer_loops) {
+    if (outer_loops.empty()) {
+      continue;
     }
-    // TODO: Support fusing some of the loops in a block.
-    // Currently, we only fuse all the loops in a block, which is restrictive.
-    if (!loopBoundsAllEqual(loopsToFuse)) {
-      return;
+
+    if (!loopBoundsAllEqual(outer_loops)) {
+      continue;
     }
+
     // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
     ForPtr fusedLoop;
-    if (!LoopNest::fuseLoops(loopsToFuse, &fusedLoop)) {
-      return;
+    if (!LoopNest::fuseLoops(outer_loops, &fusedLoop)) {
+      continue;
     }
+
     fuseAllLoops(fusedLoop->body());
   }
 }
@@ -781,7 +840,7 @@ StmtPtr TensorExprKernel::transformLoops(BackendType backendType, StmtPtr st) {
   if (backendType == kLLVMCodeGen) {
     fuseAllLoops(l.root_stmt());
     GRAPH_DEBUG("after fuse", *l.root_stmt());
-    parallelizeOuterLoops(l, bufOutputs_);
+    parallelizeOuterLoops(l, bufsToBeParallelized_);
     GRAPH_DEBUG("after parallelize", *l.root_stmt());
   }
 
@@ -1606,6 +1665,7 @@ void TensorExprKernel::optimizeOwningGraph() {
   deduceMemoryLayoutPolicy();
 
   // Fuse Conv with Eltwise Op
+  graph_rewrite_helper::replaceConvolutionWithAtenConv(graph_);
   FuseConvWithEltwise(graph_);
 
   // Optimize the concatenation
@@ -1649,6 +1709,20 @@ void TensorExprKernel::compile() {
         if (output->hasUses()) {
           Tensor t = computeValue(output);
 
+          // If there are for-loops before ExternalCall as follows,
+          //   stmt1: for:
+          //   stmt2    for:
+          //   stmt3: ExternalCall
+          // the for-loops would not be parallelized. So we mark the
+          // buf args of ExternalCall as to be parallelized to make sure
+          // its previous loop still could be parallelized.
+          if (to<ExternalCall>(t.stmt())) {
+            auto _external_call = to<ExternalCall>(t.stmt());
+            for (const auto& _buf : _external_call->buf_args()) {
+              bufsToBeParallelized_.insert(_buf);
+            }
+          }
+
           if (output->type()->cast<TensorType>()) {
             // Value is tensor
             if (t.buf()) {
@@ -1702,6 +1776,7 @@ void TensorExprKernel::compile() {
     if (!output->type()->cast<TensorType>()) {
       // Scalar outputs are represented as 0-dim buffers.
       bufOutputs_.insert(bufs_.at(output));
+      bufsToBeParallelized_.insert(bufs_.at(output));
       bufferArgs_.emplace_back(BufHandle(bufs_.at(output)));
       tensorOutputTensorOptions_.emplace_back(
           c10::TensorOptions(tensorType(bufs_.at(output))).device(device_));
@@ -1752,6 +1827,7 @@ void TensorExprKernel::compile() {
     }
 
     bufOutputs_.insert(bufs_.at(output));
+    bufsToBeParallelized_.insert(bufs_.at(output));
     bufferArgs_.emplace_back(BufHandle(bufs_.at(output)));
     tensorOutputTensorOptions_.emplace_back(
         c10::TensorOptions(tensorType(bufs_.at(output))).device(device_));
diff --git a/torch/csrc/jit/tensorexpr/kernel.h b/torch/csrc/jit/tensorexpr/kernel.h
index dc803ced2e29..e1c545627dd9 100644
--- a/torch/csrc/jit/tensorexpr/kernel.h
+++ b/torch/csrc/jit/tensorexpr/kernel.h
@@ -26,6 +26,8 @@ struct SmallSizeTPairHash {
 bool conv2dIsSupportedJit(const Node* node);
 // Returns true if the TE fuser supports this conv2d with mkldnn prepacked conv.
 bool mkldnnPrepackedConvIsSupportedJit(const Node* node);
+// Returns true if the the _convolution node is Conv2d.
+bool isConv2d(const Node* node);
 // Returns true if the TE fuser supports this matmul.
 bool matmulIsSupported(const Node* node);
 template <typename T>
@@ -311,6 +313,7 @@ class TORCH_API TensorExprKernel {
   std::vector<bool> isOutputScalar_;
   std::vector<UnpackedTensorOptions> tensorOutputTensorOptions_;
   std::unordered_set<BufPtr> bufOutputs_;
+  std::unordered_set<BufPtr> bufsToBeParallelized_;
   std::unordered_map<const torch::jit::Value*, BufPtr> bufs_;
   std::unordered_map<const torch::jit::Value*, VarHandle> scalars_;
   std::unordered_map<const torch::jit::Value*, std::string> input_name_map_;
diff --git a/torch/csrc/jit/tensorexpr/llvm_codegen.cpp b/torch/csrc/jit/tensorexpr/llvm_codegen.cpp
index 9c84137d330d..f6801973dd6b 100644
--- a/torch/csrc/jit/tensorexpr/llvm_codegen.cpp
+++ b/torch/csrc/jit/tensorexpr/llvm_codegen.cpp
@@ -446,6 +446,9 @@ LLVMCodeGenImpl::LLVMCodeGenImpl(
       irb_(getContext()),
       kernel_func_name_(std::move(kernel_func_name)),
       bufsExtAlloc_(ExternalAllocBufFinder::find(stmt)) {
+#if LLVM_VERSION_MAJOR >= 15
+  context_->setOpaquePointers(false);
+#endif
   if (!triple) {
     triple = LLVMTargetTriple();
   }
@@ -482,8 +485,11 @@ LLVMCodeGenImpl::LLVMCodeGenImpl(
 
   // We support float16 ops by casting expr inputs to float32
   // and then casting the result back to float16
+
+  GRAPH_DEBUG("Before HalfRewriter ", *stmt);
   HalfRewriter hsFix;
   stmt = stmt->accept_mutator(&hsFix);
+  GRAPH_DEBUG("After HalfRewriter ", *stmt);
 
   // Emit prototype and bind argument Vars to parameter indices.
   llvm::Type* retTy = dtypeToLLVM(dtype);
@@ -538,6 +544,10 @@ llvm::Type* LLVMCodeGenImpl::dtypeToLLVM(Dtype dtype) {
       return ByteTy_;
       break;
 
+    case ScalarType::BFloat16:
+      return ShortTy_;
+      break;
+
     default:
       throw unsupported_dtype();
   }
@@ -684,7 +694,7 @@ void LLVMCodeGenImpl::visit(AddPtr v) {
   } else if (!lfp && !rfp) {
     value_ = irb_.CreateAdd(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in Add", v);
+    throw malformed_input("llvm_codegen: bad type in Add", v);
   }
 }
 
@@ -702,7 +712,7 @@ void LLVMCodeGenImpl::visit(SubPtr v) {
   } else if (!lfp && !rfp) {
     value_ = irb_.CreateSub(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in Sub", v);
+    throw malformed_input("llvm_codegen: bad type in Sub", v);
   }
 }
 
@@ -720,7 +730,7 @@ void LLVMCodeGenImpl::visit(MulPtr v) {
   } else if (!lfp && !rfp) {
     value_ = irb_.CreateMul(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in Mul", v);
+    throw malformed_input("llvm_codegen: bad type in Mul", v);
   }
 }
 
@@ -738,7 +748,7 @@ void LLVMCodeGenImpl::visit(DivPtr v) {
   } else if (!lfp && !rfp) {
     value_ = irb_.CreateSDiv(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in Div", v);
+    throw malformed_input("llvm_codegen: bad type in Div", v);
   }
 }
 
@@ -753,7 +763,7 @@ void LLVMCodeGenImpl::visit(AndPtr v) {
   if (!lfp && !rfp) {
     value_ = irb_.CreateAnd(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in And", v);
+    throw malformed_input("llvm_codegen: bad type in And", v);
   }
 }
 
@@ -768,7 +778,7 @@ void LLVMCodeGenImpl::visit(OrPtr v) {
   if (!lfp && !rfp) {
     value_ = irb_.CreateOr(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in Or", v);
+    throw malformed_input("llvm_codegen: bad type in Or", v);
   }
 }
 
@@ -783,7 +793,7 @@ void LLVMCodeGenImpl::visit(XorPtr v) {
   if (!lfp && !rfp) {
     value_ = irb_.CreateXor(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in Xor", v);
+    throw malformed_input("llvm_codegen: bad type in Xor", v);
   }
 }
 
@@ -798,7 +808,7 @@ void LLVMCodeGenImpl::visit(LshiftPtr v) {
   if (!lfp && !rfp) {
     value_ = irb_.CreateShl(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in Lshift", v);
+    throw malformed_input("llvm_codegen: bad type in Lshift", v);
   }
 }
 
@@ -817,7 +827,7 @@ void LLVMCodeGenImpl::visit(RshiftPtr v) {
       value_ = irb_.CreateLShr(lhs, rhs);
     }
   } else {
-    throw malformed_input("llvm_codgen: bad type in Rshift", v);
+    throw malformed_input("llvm_codegen: bad type in Rshift", v);
   }
 }
 
@@ -832,7 +842,7 @@ void LLVMCodeGenImpl::visit(ModPtr v) {
   if (!lfp && !rfp) {
     value_ = irb_.CreateSRem(lhs, rhs);
   } else {
-    throw malformed_input("llvm_codgen: bad type in Mod", v);
+    throw malformed_input("llvm_codegen: bad type in Mod", v);
   }
 }
 
@@ -995,9 +1005,7 @@ void LLVMCodeGenImpl::visit(HalfImmPtr v) {
 }
 
 void LLVMCodeGenImpl::visit(BFloat16ImmPtr v) {
-  TORCH_INTERNAL_ASSERT(
-      false,
-      buildErrorMessage("Fuser's LLVM codegen does not support bfloat16"));
+  value_ = llvm::ConstantInt::get(ShortTy_, v->value().x);
 }
 
 void LLVMCodeGenImpl::visit(BoolImmPtr v) {
@@ -1015,6 +1023,25 @@ llvm::Type* llvmTypeToVec(llvm::Type* type, int lanes) {
 void LLVMCodeGenImpl::visit(CastPtr v) {
   v->src_value()->accept(this);
 
+  auto dst_type = v->dtype().scalar_type();
+  auto src_type = v->src_value()->dtype().scalar_type();
+  bool is_to_bf16 = (dst_type == c10::kBFloat16);
+  bool is_to_float = (dst_type == c10::kFloat);
+  bool is_from_bf16 = (src_type == c10::kBFloat16);
+  bool is_from_float = (src_type == c10::kFloat);
+
+  bool cast_from_bf16_to_fp32 = is_from_bf16 && is_to_float;
+  bool cast_from_fp32_to_bf16 = is_from_float && is_to_bf16;
+  bool non_bf16_cast = (!is_to_bf16) && (!is_from_bf16);
+  bool valid_bf16_cast = cast_from_bf16_to_fp32 || cast_from_fp32_to_bf16;
+  TORCH_CHECK(
+      valid_bf16_cast || non_bf16_cast,
+      "Cast is not implemented for the conversion between ",
+      src_type,
+      " and ",
+      dst_type,
+      ".");
+
   llvm::Type* dstType =
       llvmTypeToVec(dtypeToLLVM(v->dtype()), v->dtype().lanes());
   llvm::Type* srcType = dtypeToLLVM(v->src_value()->dtype());
@@ -1033,6 +1060,56 @@ void LLVMCodeGenImpl::visit(CastPtr v) {
       v->src_value()->dtype().scalar_type() == ScalarType::Bool;
 
   // Scalar casts
+  if (is_from_bf16) {
+    // Shift the BF16 value left by 16bits and then bit cast the shifted value
+    // to FP32.
+    //   FP32_VAL = BF16_VAL << 16
+    auto lans = v->dtype().lanes();
+    value_ = irb_.CreateZExt(value_, llvmTypeToVec(IntTy_, lans));
+    auto vec_shl_val = toVec(llvm::ConstantInt::get(IntTy_, 16), lans);
+    value_ = irb_.CreateShl(value_, vec_shl_val);
+    value_ = irb_.CreateBitOrPointerCast(value_, llvmTypeToVec(FloatTy_, lans));
+    return;
+  }
+
+  if (is_to_bf16) {
+    // Convert the FP32 value by RNE(Rounding to Nearest Even). Algorithm is as
+    // follows:
+    //   STEP1: U32_VAL = BITCAST(F32_VAL)
+    //   STEP2: U32_VAL_TMP = U32_VAL >> 16
+    //   STEP3: U32_VAL_TMP = U32_VAL_TMP & 1
+    //   STEP4: ROUNDING_BIAS = U32_VAL_TMP + UINT32(0x7FFF)
+    //   STEP5: U32_VAL_TMP = U32_VAL + ROUNDING_BIAS
+    //   STEP6: BF16_VAL = static_cast<UINT16>(U32_VAL_TMP >> 16)
+    auto lans = v->src_value()->dtype().lanes();
+    auto shift_len = llvm::ConstantInt::get(IntTy_, 16);
+    auto one = llvm::ConstantInt::get(ShortTy_, 1);
+    auto rounding_bias = llvm::ConstantInt::get(ShortTy_, 0x7FFF);
+    auto bf16_nan = llvm::ConstantInt::get(ShortTy_, 0xFFFF);
+
+    auto mask = irb_.CreateFCmpOEQ(value_, value_);
+    // STEP1: U32_VAL = BITCAST(F32_VAL)
+    auto fp32_i32_value =
+        irb_.CreateBitOrPointerCast(value_, llvmTypeToVec(IntTy_, lans));
+    // STEP2: U32_VAL_TMP = (U32_VAL >> 16)
+    value_ = irb_.CreateLShr(fp32_i32_value, toVec(shift_len, lans));
+    value_ = irb_.CreateTrunc(value_, llvmTypeToVec(ShortTy_, lans));
+    // STEP3: U32_VAL_TMP = U32_VAL_TMP & 1
+    value_ = irb_.CreateAnd(value_, toVec(one, lans));
+    // STEP4: ROUNDING_BIAS = U32_VAL_TMP + UINT32(0x7FFF)
+    value_ = irb_.CreateAdd(value_, toVec(rounding_bias, lans));
+    value_ = irb_.CreateZExt(value_, llvmTypeToVec(IntTy_, lans));
+    // STEP5: U32_VAL_TMP = U32_VAL + ROUNDING_BIAS
+    value_ = irb_.CreateAdd(value_, fp32_i32_value);
+    // STEP6: BF16_VAL = static_cast<UINT16>(U32_VAL_TMP >> 16)
+    value_ = irb_.CreateLShr(value_, toVec(shift_len, lans));
+    value_ = irb_.CreateTrunc(value_, llvmTypeToVec(ShortTy_, lans));
+    value_ = irb_.CreateBitOrPointerCast(value_, llvmTypeToVec(ShortTy_, lans));
+    // If the the value is NaN, return BF16 NaN.
+    value_ = irb_.CreateSelect(mask, value_, toVec(bf16_nan, lans));
+    return;
+  }
+
   if (srcType->isFPOrFPVectorTy()) {
     if (dstType->isFPOrFPVectorTy()) {
       // as with eager, convert from Double -> Half by Converting to Float then
@@ -1066,6 +1143,7 @@ void LLVMCodeGenImpl::visit(CastPtr v) {
     }
     return;
   }
+
   if (!srcType->isIntOrIntVectorTy()) {
     throw unimplemented_lowering(v);
   }
@@ -1176,6 +1254,9 @@ void LLVMCodeGenImpl::visit(RampPtr v) {
     case ScalarType::QUInt8:
       vecType = llvm::VectorType::get(ByteTy_, element_count);
       break;
+    case ScalarType::BFloat16:
+      vecType = llvm::VectorType::get(ShortTy_, element_count);
+      break;
     default:
       throw std::runtime_error("invalid dtype in Ramp");
   }
@@ -1251,6 +1332,9 @@ void LLVMCodeGenImpl::visit(LoadPtr v) {
     case ScalarType::QUInt8:
       loadType = llvm::VectorType::get(ByteTy_, element_count);
       break;
+    case ScalarType::BFloat16:
+      loadType = llvm::VectorType::get(ShortTy_, element_count);
+      break;
     default:
       throw std::runtime_error("invalid dtype in Load");
   }
diff --git a/torch/csrc/jit/tensorexpr/loopnest.cpp b/torch/csrc/jit/tensorexpr/loopnest.cpp
index 6b66d48fe505..a9cab316aa3e 100644
--- a/torch/csrc/jit/tensorexpr/loopnest.cpp
+++ b/torch/csrc/jit/tensorexpr/loopnest.cpp
@@ -1506,12 +1506,12 @@ void LoopNest::sliceHead(ForPtr f, int factor, ForPtr* head, ForPtr* tail) {
   }
 
   if (!f) {
-    throw malformed_input("sliceHead attempted on null loop", f);
+    throw malformed_input("sliceHead attempted on null loop");
   }
 
   BlockPtr p = to<Block>(f->get_parent());
   if (!p) {
-    throw malformed_input("sliceHead attempted on loop with no parent", p);
+    throw malformed_input("sliceHead attempted on loop with no parent");
   }
 
   ExprPtr head_end = alloc<Min>(
@@ -1546,12 +1546,12 @@ void LoopNest::sliceTail(ForPtr f, int factor, ForPtr* head, ForPtr* tail) {
   }
 
   if (!f) {
-    throw malformed_input("sliceTail attempted on null loop", f);
+    throw malformed_input("sliceTail attempted on null loop");
   }
 
   BlockPtr p = to<Block>(f->get_parent());
   if (!p) {
-    throw malformed_input("sliceTail attempted on loop with no parent", p);
+    throw malformed_input("sliceTail attempted on loop with no parent");
   }
 
   ExprPtr tail_start = alloc<Max>(
@@ -1585,12 +1585,12 @@ void LoopNest::splitWithTail(
     ForPtr* inner,
     ForPtr* tail) {
   if (!f) {
-    throw malformed_input("splitWithTail attempted on null loop", f);
+    throw malformed_input("splitWithTail attempted on null loop");
   }
 
   BlockPtr p = to<Block>(f->get_parent());
   if (!p) {
-    throw malformed_input("splitWithTail attempted on loop with no parent", p);
+    throw malformed_input("splitWithTail attempted on loop with no parent");
   }
 
   // Normalize the loop to simplify start and stop bound computation
diff --git a/torch/csrc/jit/tensorexpr/lowerings.cpp b/torch/csrc/jit/tensorexpr/lowerings.cpp
index e704cc689d27..de2d6e7cbdd8 100644
--- a/torch/csrc/jit/tensorexpr/lowerings.cpp
+++ b/torch/csrc/jit/tensorexpr/lowerings.cpp
@@ -1311,6 +1311,58 @@ int nnc_lowerings_lazy_registration() {
             });
       });
 
+  RegisterNNCLoweringsFunction aten_mish(
+      {"aten::mish(Tensor self) -> (Tensor)"},
+      [](const std::vector<ArgValue>& inputs,
+         const std::vector<ExprHandle>& outputShape,
+         const std::vector<ExprHandle>& outputStrides,
+         const c10::optional<ScalarType>& outputType,
+         at::Device device) {
+        return computeOneOperand(
+            "aten_mish",
+            inputs,
+            outputShape,
+            outputStrides,
+            outputType,
+            [](const ExprHandle& a) {
+              auto default_type_a = promoteIntegerToDefaultType(a);
+              return default_type_a * tanh(log1p(exp(default_type_a)));
+            });
+      });
+
+  RegisterNNCLoweringsFunction aten_elu(
+      {"aten::elu(Tensor self, Scalar alpha=1, Scalar scale=1, Scalar input_scale=1) -> (Tensor)"},
+      [](const std::vector<ArgValue>& inputs,
+         const std::vector<ExprHandle>& outputShape,
+         const std::vector<ExprHandle>& outputStrides,
+         const c10::optional<ScalarType>& outputType,
+         at::Device device) {
+        return computeFourOperand(
+            "aten_elu",
+            inputs,
+            outputShape,
+            outputStrides,
+            outputType,
+            [](const ExprHandle& a,
+               const ExprHandle& alpha,
+               const ExprHandle& scale,
+               const ExprHandle& input_scale) {
+              auto zero = Cast::make(a.dtype(), 0);
+              auto one = Cast::make(a.dtype(), 1);
+
+              auto poscoef = Cast::make(a.dtype(), scale);
+              auto negiptcoef = Cast::make(a.dtype(), input_scale);
+              auto negcoef = Cast::make(a.dtype(), alpha) * poscoef;
+
+              return CompareSelect::make(
+                  a,
+                  zero,
+                  a * poscoef,
+                  (exp(a * negiptcoef) - one) * negcoef,
+                  kGT);
+            });
+      });
+
   RegisterNNCLoweringsFunction aten_hardsigmoid(
       {"aten::hardsigmoid(Tensor self) -> (Tensor)"},
       [](const std::vector<ArgValue>& inputs,
diff --git a/torch/csrc/jit/tensorexpr/operators/misc.cpp b/torch/csrc/jit/tensorexpr/operators/misc.cpp
index 5e03fc3560ed..c935727efafb 100644
--- a/torch/csrc/jit/tensorexpr/operators/misc.cpp
+++ b/torch/csrc/jit/tensorexpr/operators/misc.cpp
@@ -510,7 +510,8 @@ static std::pair<ScalarType, std::vector<BufHandle>> processCatList(
 
 Tensor computeCatWoConditionals(
     const std::vector<ArgValue>& inputs,
-    const std::vector<ExprHandle>& outputShape) {
+    const std::vector<ExprHandle>& outputShape,
+    const std::vector<ExprHandle>& outputStrides) {
   // NOLINTNEXTLINE(performance-unnecessary-copy-initialization)
   auto input_list = c10::get<BufList>(inputs[0]);
   auto arg_dim = inputs[1];
@@ -534,8 +535,13 @@ Tensor computeCatWoConditionals(
   //       output[i,j+l2,k] = inp3[i,j,k]
 
   auto output_sizes_expr = ExprHandleVectorToExprVector(outputShape);
-  auto output_buf =
-      alloc<Buf>("aten_cat", output_sizes_expr, ToDtype(high_type));
+  auto output_strides_expr = ExprHandleVectorToExprVector(outputStrides);
+  auto output_buf = alloc<Buf>(
+      "aten_cat",
+      output_sizes_expr,
+      ToDtype(high_type),
+      nullptr,
+      output_strides_expr);
   if (non_empty_inputs.size() == 0) {
     return Tensor(
         output_buf, alloc<tensorexpr::Block>(std::vector<StmtPtr>({})));
@@ -544,6 +550,23 @@ Tensor computeCatWoConditionals(
   int64_t concat_dim = c10::get<int64_t>(arg_dim);
   auto norm_concat_dim = normalizeAndCheckIndex(concat_dim, outputShape.size());
 
+  auto loop_order_fn = [&](const BufPtr& buf_) {
+    std::vector<int32_t> loop_order;
+    if (buf_->is_contiguous()) {
+      for (int32_t i = buf_->ndim() - 1; i >= 0; i--) {
+        loop_order.push_back(i);
+      }
+    } else if (buf_->is_contiguous(c10::MemoryFormat::ChannelsLast)) {
+      loop_order = {1, 3, 2, 0};
+    } else if (buf_->is_contiguous(c10::MemoryFormat::ChannelsLast3d)) {
+      loop_order = {1, 4, 3, 2, 0};
+    } else {
+      loop_order = {1, 2, 0};
+    }
+
+    return loop_order;
+  };
+
   auto gen_code_for_input = [&](const BufHandle& inp,
                                 size_t inp_pos,
                                 ExprPtr concat_dim_size,
@@ -566,10 +589,16 @@ Tensor computeCatWoConditionals(
     auto load_expr = alloc<Load>(inp_buf, load_indices);
     auto load_promoted = promoteToDtype(ExprHandle(load_expr), high_type);
     StmtPtr st = alloc<Store>(output_buf, store_indices, load_promoted.node());
-    for (size_t i = dims.size(); i > 0; --i) {
+
+    auto loop_order = loop_order_fn(inp.node());
+    for (auto dim_index : loop_order) {
       st = alloc<For>(
-          for_vars[i - 1], immLike(dims[i - 1], 0), dims[i - 1].node(), st);
+          for_vars[dim_index],
+          immLike(dims[dim_index], 0),
+          dims[dim_index].node(),
+          st);
     }
+
     return st;
   };
 
@@ -596,7 +625,7 @@ Tensor computeCat(
     const c10::optional<ScalarType>& outputType,
     at::Device device) {
   if (device == at::kCPU && getCatWoConditionals()) {
-    return computeCatWoConditionals(inputs, outputShape);
+    return computeCatWoConditionals(inputs, outputShape, outputStrides);
   }
   // NOLINTNEXTLINE(performance-unnecessary-copy-initialization)
   auto inputList = c10::get<BufList>(inputs[0]);
diff --git a/torch/csrc/jit/tensorexpr/reduction.cpp b/torch/csrc/jit/tensorexpr/reduction.cpp
index 0cf8cb27dc02..7c047276b401 100644
--- a/torch/csrc/jit/tensorexpr/reduction.cpp
+++ b/torch/csrc/jit/tensorexpr/reduction.cpp
@@ -26,6 +26,21 @@ ReduceOpPtr Reducer::operator()(
       *this);
 }
 
+ExprHandle Reducer::operator()(
+    BufHandle result_buf,
+    BufHandle acc_buf,
+    ExprHandle body,
+    const std::vector<ExprHandle>& output,
+    const std::vector<VarHandle>& inner) const {
+  return ReduceOp::make(
+      complete(result_buf, interaction_, body, output, inner),
+      inner,
+      result_buf,
+      acc_buf,
+      body,
+      *this);
+}
+
 ExprHandle ReduceOp::make(
     ExprHandle body,
     std::vector<VarHandle> reduce_args,
@@ -34,6 +49,22 @@ ExprHandle ReduceOp::make(
       body.node(), VarHandleVectorToVarVector(reduce_args), reducer));
 }
 
+ExprHandle ReduceOp::make(
+    ExprHandle body,
+    std::vector<VarHandle> reduce_args,
+    BufHandle result_buf,
+    BufHandle acc_buf,
+    ExprHandle ri_operand,
+    const Reducer& reducer) {
+  return ExprHandle(alloc<ReduceOp>(
+      body.node(),
+      VarHandleVectorToVarVector(reduce_args),
+      result_buf.node(),
+      acc_buf.node(),
+      ri_operand.node(),
+      reducer));
+}
+
 } // namespace tensorexpr
 } // namespace jit
 } // namespace torch
diff --git a/torch/csrc/jit/tensorexpr/reduction.h b/torch/csrc/jit/tensorexpr/reduction.h
index 787fe6d2068f..e725052d5114 100644
--- a/torch/csrc/jit/tensorexpr/reduction.h
+++ b/torch/csrc/jit/tensorexpr/reduction.h
@@ -47,6 +47,13 @@ class TORCH_API Reducer {
       const std::vector<ExprPtr>& output,
       const std::vector<VarPtr>& inner) const;
 
+  ExprHandle operator()(
+      BufHandle result_buf,
+      BufHandle acc_buf,
+      ExprHandle body,
+      const std::vector<ExprHandle>& output,
+      const std::vector<VarHandle>& inner) const;
+
   // Polymorphic handling of Body functions with a variety of parameters.
   static ExprHandle getReduceBody(
       const std::function<ExprHandle(ParameterList&)>& func,
@@ -141,12 +148,40 @@ class TORCH_API ReduceOp : public ExprNode<ReduceOp> {
       : ExprNodeBase(body->dtype()),
         body_(body),
         reduce_args_(std::move(reduce_args)),
+        reducer_(reducer) {
+    result_buf_ = nullptr;
+    acc_buf_ = nullptr;
+    ri_operand_ = nullptr;
+  }
+
+  ReduceOp(
+      ExprPtr body,
+      std::vector<VarPtr> reduce_args,
+      BufPtr result_buf,
+      BufPtr acc_buf,
+      ExprPtr ri_operand,
+      const Reducer& reducer)
+      : ExprNodeBase(body->dtype()),
+        body_(body),
+        reduce_args_(std::move(reduce_args)),
+        result_buf_(result_buf),
+        acc_buf_(acc_buf),
+        ri_operand_(ri_operand),
         reducer_(reducer) {}
+
   static ExprHandle make(
       ExprHandle body,
       std::vector<VarHandle> reduce_args,
       const Reducer& reducer);
 
+  static ExprHandle make(
+      ExprHandle body,
+      std::vector<VarHandle> reduce_args,
+      BufHandle result_buf,
+      BufHandle acc_buf,
+      ExprHandle ri_operand,
+      const Reducer& reducer);
+
   // return the body expression which obtains the value to be reduced.
   ExprPtr body() const {
     return body_;
@@ -162,9 +197,36 @@ class TORCH_API ReduceOp : public ExprNode<ReduceOp> {
     return reduce_args_;
   }
 
+  void setAccBuf(BufHandle acc_buf) {
+    acc_buf_ = acc_buf.node();
+  }
+  BufPtr getAccBuf() {
+    return acc_buf_;
+  }
+
+  void setResultBuf(BufHandle buf) {
+    result_buf_ = buf.node();
+  }
+  BufPtr getResultBuf() {
+    return result_buf_;
+  }
+
+  void setRiOperand(ExprHandle ri_operand) {
+    ri_operand_ = ri_operand.node();
+  }
+  ExprPtr getRiOperand() {
+    return ri_operand_;
+  }
+
  private:
+  // body_ = reducer_->interaction_(result_buf_, ri_operand_)
   ExprPtr body_;
   std::vector<VarPtr> reduce_args_;
+
+  BufPtr result_buf_;
+  BufPtr acc_buf_;
+  ExprPtr ri_operand_;
+
   const Reducer reducer_;
 };
 
diff --git a/torch/csrc/jit/tensorexpr/stmt.h b/torch/csrc/jit/tensorexpr/stmt.h
index d2894ea157e6..37a204bfa329 100644
--- a/torch/csrc/jit/tensorexpr/stmt.h
+++ b/torch/csrc/jit/tensorexpr/stmt.h
@@ -754,13 +754,13 @@ class TORCH_API For : public StmtNode<For> {
         stop_(stop),
         loop_options_(std::move(loop_options)) {
     if (!var) {
-      throw malformed_input("invalid Var in For loop", var);
+      throw malformed_input("invalid Var in For loop");
     } else if (!start) {
-      throw malformed_input("invalid Start in For loop", start);
+      throw malformed_input("invalid Start in For loop");
     } else if (!stop) {
-      throw malformed_input("invalid Stop in For loop", stop);
+      throw malformed_input("invalid Stop in For loop");
     } else if (!body || body->get_parent()) {
-      throw malformed_input("invalid Body in For loop", body);
+      throw malformed_input("invalid Body in For loop");
     }
 
     BlockPtr b = to<Block>(body);
diff --git a/torch/csrc/jit/tensorexpr/tensor.cpp b/torch/csrc/jit/tensorexpr/tensor.cpp
index 840cfc5eaa65..e6a388e08d74 100644
--- a/torch/csrc/jit/tensorexpr/tensor.cpp
+++ b/torch/csrc/jit/tensorexpr/tensor.cpp
@@ -15,27 +15,55 @@ StmtPtr Tensor::constructStmt(
     const std::vector<VarPtr>& reduce_args) const {
   std::vector<ExprPtr> indices(args.begin(), args.end());
 
-  StmtPtr s = alloc<Store>(buf_, indices, body);
-
   size_t ndim = buf()->ndim();
   size_t reduce_ndim = reduce_dims.size();
+  auto reduce_op = to<ReduceOp>(body);
+  auto acc_buf = reduce_ndim > 0 ? reduce_op->getAccBuf() : nullptr;
+
+  StmtPtr s = alloc<Store>(buf_, indices, body);
+  if (reduce_ndim > 0) {
+    TORCH_INTERNAL_ASSERT(reduce_op != nullptr);
+    if (acc_buf != nullptr) {
+      auto reducer = reduce_op->reducer();
+      std::vector<ExprPtr> output_args(args.begin(), args.end());
+      ExprPtr new_reduce_op = reducer(
+          to<Buf>(acc_buf),
+          alloc<Cast>(acc_buf->dtype(), reduce_op->getRiOperand()),
+          output_args,
+          reduce_args);
+      new_reduce_op->set_dtype(acc_buf->dtype());
+      s = alloc<Store>(to<Buf>(acc_buf), indices, new_reduce_op);
+    }
+  }
 
   if (ndim == 0 && reduce_ndim == 0) {
     return s;
   }
 
-  ExprPtr init_expr = buf()->initializer();
-
   if (reduce_ndim > 0) {
+    TORCH_INTERNAL_ASSERT(reduce_op != nullptr);
+
     for (const auto i : c10::irange(reduce_ndim)) {
       // Going in reverse order: from innermost loop to the outermost
       size_t dim_index = reduce_ndim - i - 1;
       auto const& dim = reduce_dims[dim_index];
       s = alloc<For>(reduce_args[dim_index], immLike(dim, 0), dim, s);
     }
+    s = alloc<Block>(std::vector<StmtPtr>({s}));
+
+    BufPtr init_buf = acc_buf ? to<Buf>(acc_buf) : buf();
+    ExprPtr init_expr =
+        acc_buf ? to<Buf>(acc_buf)->initializer() : buf()->initializer();
     if (init_expr) {
-      StorePtr init_stmt = alloc<Store>(buf(), indices, init_expr);
-      s = alloc<Block>(std::vector<StmtPtr>({init_stmt, s}));
+      StorePtr init_stmt = alloc<Store>(init_buf, indices, init_expr);
+      to<Block>(s)->prepend_stmt(init_stmt);
+    }
+
+    if (acc_buf != nullptr) {
+      LoadPtr load_acc = alloc<Load>(acc_buf, indices);
+      auto cast = alloc<Cast>(buf()->dtype(), load_acc);
+      StorePtr post_stmt = alloc<Store>(buf(), indices, cast);
+      to<Block>(s)->append_stmt(post_stmt);
     }
   }
 
diff --git a/torch/csrc/jit/tensorexpr/tensor.h b/torch/csrc/jit/tensorexpr/tensor.h
index e1e9dc03581b..10a641092df5 100644
--- a/torch/csrc/jit/tensorexpr/tensor.h
+++ b/torch/csrc/jit/tensorexpr/tensor.h
@@ -171,7 +171,16 @@ Tensor Reduce(
   std::vector<ExprHandle> output_args(vars.begin(), vars.end());
   ExprHandle init_expr = Cast::make(body.dtype(), init_func(vars));
   BufHandle func_result = Buf::make(func_name, dims, body.dtype(), init_expr);
+
   ExprHandle reduce_op = reducer(func_result, body, output_args, reduce_vars);
+  if (body.dtype() == kBFloat16) {
+    ExprHandle init_expr_acc = Cast::make(kFloat, init_func(vars));
+    BufHandle func_result_acc =
+        Buf::make(func_name + "_acc", dims, kFloat, init_expr_acc);
+    reduce_op =
+        reducer(func_result, func_result_acc, body, output_args, reduce_vars);
+  }
+
   Tensor t = Tensor(func_result, vars, reduce_dims, reduce_vars, reduce_op);
   return t;
 }
diff --git a/torch/csrc/jit/tensorexpr/types.cpp b/torch/csrc/jit/tensorexpr/types.cpp
index d56ff125a845..95ff9df15202 100644
--- a/torch/csrc/jit/tensorexpr/types.cpp
+++ b/torch/csrc/jit/tensorexpr/types.cpp
@@ -84,7 +84,7 @@ std::string Dtype::ToCppString() const {
     case ScalarType::Half:
       return "half";
     case ScalarType::BFloat16:
-      return "__nv_bfloat16";
+      return "bfloat16";
     case ScalarType::QInt8:
       return "qint8";
     case ScalarType::QUInt8:
diff --git a/torch/csrc/lazy/backend/backend_device.cpp b/torch/csrc/lazy/backend/backend_device.cpp
index ca19d1c42d7e..e178aab755d8 100644
--- a/torch/csrc/lazy/backend/backend_device.cpp
+++ b/torch/csrc/lazy/backend/backend_device.cpp
@@ -54,7 +54,7 @@ c10::Device backendDeviceToAtenDevice(const BackendDevice& device) {
   return c10::Device(at::kLazy, device.ordinal());
 }
 
-c10::optional<BackendDevice> GetBackendDevice(const at::TensorList tensors) {
+c10::optional<BackendDevice> GetBackendDevice(at::ITensorListRef tensors) {
   for (auto& tensor : tensors) {
     if (auto lt = TryGetLtcTensor(tensor)) {
       return lt->GetDevice();
@@ -63,6 +63,10 @@ c10::optional<BackendDevice> GetBackendDevice(const at::TensorList tensors) {
   return c10::nullopt;
 }
 
+c10::optional<BackendDevice> GetBackendDevice(at::TensorList tensors) {
+  return GetBackendDevice(at::ITensorListRef(tensors));
+}
+
 c10::optional<BackendDevice> GetBackendDevice(const at::Tensor& tensor) {
   if (auto lt = TryGetLtcTensor(tensor)) {
     return lt->GetDevice();
diff --git a/torch/csrc/lazy/backend/backend_device.h b/torch/csrc/lazy/backend/backend_device.h
index 55d7ecdb5d3a..920314993513 100644
--- a/torch/csrc/lazy/backend/backend_device.h
+++ b/torch/csrc/lazy/backend/backend_device.h
@@ -73,6 +73,8 @@ TORCH_API c10::Device backendDeviceToAtenDevice(const BackendDevice& device);
 
 // Tries to extract the backend device out of the lazy tensor. Returns nullopt
 // if the input is not a lazy tensor.
+TORCH_API c10::optional<BackendDevice> GetBackendDevice(
+    const at::ITensorListRef tensors);
 TORCH_API c10::optional<BackendDevice> GetBackendDevice(
     const at::TensorList tensors);
 TORCH_API c10::optional<BackendDevice> GetBackendDevice(
diff --git a/torch/csrc/lazy/backend/backend_interface.cpp b/torch/csrc/lazy/backend/backend_interface.cpp
index 48e375051d04..0fb3257c90a9 100644
--- a/torch/csrc/lazy/backend/backend_interface.cpp
+++ b/torch/csrc/lazy/backend/backend_interface.cpp
@@ -1,4 +1,5 @@
 #include <torch/csrc/lazy/backend/backend_interface.h>
+#include <torch/csrc/lazy/core/internal_ops/ltc_ops.h>
 
 namespace torch {
 namespace lazy {
@@ -37,7 +38,7 @@ at::Tensor MakeTensorFromComputationData(
 std::unique_ptr<LoweringContext> LoweringContext::Create(
     const std::string& name,
     BackendDevice device,
-    c10::ArrayRef<Node*> post_order,
+    c10::ArrayRef<const Node*> post_order,
     Util::EmissionMap emit_status) {
   return getBackend()->CreateLoweringContext(
       name, device, post_order, emit_status);
diff --git a/torch/csrc/lazy/backend/backend_interface.h b/torch/csrc/lazy/backend/backend_interface.h
index cf029be2bf58..f94d3b602e52 100644
--- a/torch/csrc/lazy/backend/backend_interface.h
+++ b/torch/csrc/lazy/backend/backend_interface.h
@@ -4,7 +4,9 @@
 #include <torch/csrc/lazy/backend/backend_data.h>
 #include <torch/csrc/lazy/backend/backend_device.h>
 #include <torch/csrc/lazy/backend/lowering_context.h>
+#include <torch/csrc/lazy/core/lazy_graph_executor.h>
 #include <torch/csrc/lazy/core/shape.h>
+#include <torch/csrc/lazy/core/tensor.h>
 #include <atomic>
 
 namespace torch {
@@ -57,7 +59,7 @@ class TORCH_API BackendImplInterface {
 
   // Gets backend data if the node is a device data node. Otherwise returns
   // nullptr
-  virtual BackendDataPtr GetComputationDataFromNode(Node*) const = 0;
+  virtual BackendDataPtr GetComputationDataFromNode(const Node*) const = 0;
 
   virtual at::Tensor MakeTensorFromComputationData(
       const BackendDataPtr data,
@@ -70,7 +72,7 @@ class TORCH_API BackendImplInterface {
   virtual std::unique_ptr<LoweringContext> CreateLoweringContext(
       const std::string& name,
       BackendDevice device,
-      c10::ArrayRef<torch::lazy::Node*> post_order,
+      c10::ArrayRef<const torch::lazy::Node*> post_order,
       Util::EmissionMap emit_status) const = 0;
 
   virtual std::unique_ptr<LoweringContext> CreateLoweringContext(
diff --git a/torch/csrc/lazy/backend/lowering_context.cpp b/torch/csrc/lazy/backend/lowering_context.cpp
index 64922a1b3e13..635ee4891cc7 100644
--- a/torch/csrc/lazy/backend/lowering_context.cpp
+++ b/torch/csrc/lazy/backend/lowering_context.cpp
@@ -9,7 +9,7 @@ LoweringContext::LoweringContext(const std::string& name, BackendDevice device)
 LoweringContext::LoweringContext(
     const std::string& name,
     BackendDevice device,
-    c10::ArrayRef<torch::lazy::Node*> post_order,
+    c10::ArrayRef<const torch::lazy::Node*> post_order,
     Util::EmissionMap emit_status)
     : device_(std::move(device)), emit_status_(std::move(emit_status)) {}
 
diff --git a/torch/csrc/lazy/backend/lowering_context.h b/torch/csrc/lazy/backend/lowering_context.h
index 6f487aef7f74..49e7b8be58cb 100644
--- a/torch/csrc/lazy/backend/lowering_context.h
+++ b/torch/csrc/lazy/backend/lowering_context.h
@@ -42,7 +42,7 @@ class TORCH_API LoweringContext {
   LoweringContext(
       const std::string& name,
       BackendDevice device,
-      c10::ArrayRef<torch::lazy::Node*> post_order,
+      c10::ArrayRef<const torch::lazy::Node*> post_order,
       Util::EmissionMap emit_status);
 
   virtual ~LoweringContext() = default;
@@ -50,7 +50,7 @@ class TORCH_API LoweringContext {
   static std::unique_ptr<LoweringContext> Create(
       const std::string& name,
       BackendDevice device,
-      c10::ArrayRef<torch::lazy::Node*> post_order,
+      c10::ArrayRef<const torch::lazy::Node*> post_order,
       Util::EmissionMap emit_status);
 
   static std::unique_ptr<LoweringContext> Create(
diff --git a/torch/csrc/lazy/core/config.cpp b/torch/csrc/lazy/core/config.cpp
index b3488c4df79a..c39fd8fef75a 100644
--- a/torch/csrc/lazy/core/config.cpp
+++ b/torch/csrc/lazy/core/config.cpp
@@ -10,7 +10,12 @@ C10_DEFINE_bool(
 C10_DEFINE_bool(
     torch_lazy_handle_special_scalars,
     false,
-    "Handle special scalars 0 and 1 diffrently");
+    "Handle special scalars 0 and 1 differently");
+
+C10_DEFINE_bool(
+    torch_lazy_all_numbers_special_scalars,
+    false,
+    "Handle all numbers as special scalars");
 
 C10_DEFINE_bool(
     torch_lazy_reuse_ir,
diff --git a/torch/csrc/lazy/core/config.h b/torch/csrc/lazy/core/config.h
index fa481421c2aa..a221fb6f62ef 100644
--- a/torch/csrc/lazy/core/config.h
+++ b/torch/csrc/lazy/core/config.h
@@ -4,6 +4,7 @@
 
 C10_DECLARE_bool(torch_lazy_ir_debug);
 C10_DECLARE_bool(torch_lazy_handle_special_scalars);
+C10_DECLARE_bool(torch_lazy_all_numbers_special_scalars);
 C10_DECLARE_bool(torch_lazy_param_aliasing);
 C10_DECLARE_bool(torch_lazy_reuse_ir);
 C10_DECLARE_bool(torch_lazy_use_thread_pool);
diff --git a/torch/csrc/lazy/core/debug_util.cpp b/torch/csrc/lazy/core/debug_util.cpp
index 50f42b718128..50077d498a75 100644
--- a/torch/csrc/lazy/core/debug_util.cpp
+++ b/torch/csrc/lazy/core/debug_util.cpp
@@ -88,7 +88,7 @@ std::string DebugUtil::GetTensorsGraphInfo(
     c10::ArrayRef<torch::lazy::LazyTensorPtr> tensors,
     const std::vector<size_t>* indices,
     GraphFormat format) {
-  std::vector<torch::lazy::Node*> root_nodes;
+  std::vector<const torch::lazy::Node*> root_nodes;
   std::vector<torch::lazy::Value> root_values;
   std::vector<torch::lazy::hash_t> root_hashes;
   torch::lazy::Unique<torch::lazy::BackendDevice> unique_device;
diff --git a/torch/csrc/lazy/core/dynamic_ir.h b/torch/csrc/lazy/core/dynamic_ir.h
index 15d5aa4a8662..8af7f4fae44e 100644
--- a/torch/csrc/lazy/core/dynamic_ir.h
+++ b/torch/csrc/lazy/core/dynamic_ir.h
@@ -43,9 +43,12 @@ namespace lazy {
 
 class TORCH_API DimensionNode {
  public:
-  virtual bool isDynamic() const {
+  virtual bool isSymbolic() const {
     return false;
   };
+  virtual int64_t getDynamicValue() const {
+    TORCH_CHECK(false, "NYI");
+  };
   virtual int64_t getStaticValue() const {
     TORCH_CHECK(false, "NYI");
   };
diff --git a/torch/csrc/lazy/core/internal_ops/ltc_ops.h b/torch/csrc/lazy/core/internal_ops/ltc_ops.h
index 3f195d8b445c..ce62f2e51f53 100644
--- a/torch/csrc/lazy/core/internal_ops/ltc_ops.h
+++ b/torch/csrc/lazy/core/internal_ops/ltc_ops.h
@@ -48,13 +48,5 @@ const OpKindWrapper ltc_replication_pad_backward(
     "lazy_tensors::replication_pad_backward");
 const OpKindWrapper ltc_tensor_data("lazy_tensors::tensor_data");
 
-// For view ops
-const OpKindWrapper ltc_as_strided_view_update(
-    "lazy_tensors::as_strided_view_update");
-const OpKindWrapper ltc_diagonal_view_update(
-    "lazy_tensors::diagonal_view_update");
-const OpKindWrapper ltc_narrow_view_update("lazy_tensors::narrow_view_update");
-const OpKindWrapper ltc_select_view_update("lazy_tensors::select_view_update");
-
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/core/ir_builder.h b/torch/csrc/lazy/core/ir_builder.h
index 0af188f0131a..9cc974236cd8 100644
--- a/torch/csrc/lazy/core/ir_builder.h
+++ b/torch/csrc/lazy/core/ir_builder.h
@@ -5,6 +5,7 @@
 #include <torch/csrc/lazy/backend/backend_interface.h>
 #include <torch/csrc/lazy/core/config.h>
 #include <torch/csrc/lazy/core/ir.h>
+#include <torch/csrc/lazy/core/tensor.h>
 #include <torch/csrc/lazy/core/trie.h>
 #include <vector>
 
@@ -57,9 +58,6 @@ struct IrBuilder {
       const Value& input0,
       const std::vector<int64_t>& size,
       const bool& is_scalar_expand) const = 0;
-  virtual NodePtr MakeView(
-      const Value& input0,
-      const std::vector<int64_t>& output_size) const = 0;
   virtual NodePtr MakeCast(
       const Value& input0,
       const at::ScalarType& dtype,
@@ -72,59 +70,6 @@ struct IrBuilder {
       const size_t& num_outputs = 1,
       const hash_t& hash_seed = static_cast<uint32_t>(0x5a2d296e9)) const = 0;
 
-  // View op nodes
-  virtual NodePtr MakeAsStridedViewUpdate(
-      const Value& input0,
-      const Value& input1,
-      const std::vector<int64_t>& size,
-      const std::vector<int64_t>& stride,
-      const int64_t& storage_offset) const = 0;
-  virtual NodePtr MakeAsStrided(
-      const Value& input0,
-      const std::vector<int64_t>& size,
-      const std::vector<int64_t>& stride,
-      const int64_t& storage_offset) const = 0;
-  virtual NodePtr MakeDiagonalViewUpdate(
-      const Value& input0,
-      const Value& input1,
-      const int64_t& offset,
-      const int64_t& dim1,
-      const int64_t& dim2) const = 0;
-  virtual NodePtr MakeDiagonal(
-      const Value& input0,
-      const int64_t& offset,
-      const int64_t& dim1,
-      const int64_t& dim2) const = 0;
-  virtual NodePtr MakeNarrowViewUpdate(
-      const Value& input0,
-      const Value& input1,
-      const std::vector<int64_t>& base_indices) const = 0;
-  virtual NodePtr MakeNarrow(
-      const Value& input0,
-      const std::vector<int64_t>& base_indices,
-      const std::vector<int64_t>& sizes) const = 0;
-  virtual NodePtr MakePermute(
-      const Value& input0,
-      const std::vector<int64_t>& dims) const = 0;
-  virtual NodePtr MakeResize(
-      const Value& input0,
-      const std::vector<int64_t>& size) const = 0;
-  virtual NodePtr MakeSelectViewUpdate(
-      const Value& input0,
-      const Value& input1,
-      const int64_t& dim,
-      const int64_t& start,
-      const int64_t& end,
-      const int64_t& stride) const = 0;
-  virtual NodePtr MakeSelect(
-      const Value& input0,
-      const int64_t& dim,
-      const int64_t& start,
-      const int64_t& end,
-      const int64_t& stride) const = 0;
-  virtual NodePtr MakeSqueeze(const Value& input0, const int& dim) const = 0;
-  virtual NodePtr MakeUnsqueeze(const Value& input0, const int& dim) const = 0;
-
   // dynamic ir nodes
   virtual NodePtr MakeSizeNode(const Value& input, size_t dim) const = 0;
   virtual NodePtr MakeSizeAdd(const Value& a, const Value& b) const = 0;
@@ -148,11 +93,6 @@ static inline NodePtr MakeExpand(
     const bool& is_scalar_expand) {
   return getIrBuilder()->MakeExpand(input0, size, is_scalar_expand);
 }
-static inline NodePtr MakeView(
-    const Value& input0,
-    const std::vector<int64_t>& output_size) {
-  return getIrBuilder()->MakeView(input0, output_size);
-}
 static inline NodePtr MakeCast(
     const Value& input0,
     const at::ScalarType& dtype,
@@ -172,86 +112,6 @@ static inline NodePtr MakeGeneric(
       op, operands, shape, num_outputs, hash_seed);
 }
 
-// View op nodes
-static inline NodePtr MakeAsStridedViewUpdate(
-    const Value& input0,
-    const Value& input1,
-    const std::vector<int64_t>& size,
-    const std::vector<int64_t>& stride,
-    const int64_t& storage_offset) {
-  return getIrBuilder()->MakeAsStridedViewUpdate(
-      input0, input1, size, stride, storage_offset);
-}
-static inline NodePtr MakeAsStrided(
-    const Value& input0,
-    const std::vector<int64_t>& size,
-    const std::vector<int64_t>& stride,
-    const int64_t& storage_offset) {
-  return getIrBuilder()->MakeAsStrided(input0, size, stride, storage_offset);
-}
-static inline NodePtr MakeDiagonalViewUpdate(
-    const Value& input0,
-    const Value& input1,
-    const int64_t& offset,
-    const int64_t& dim1,
-    const int64_t& dim2) {
-  return getIrBuilder()->MakeDiagonalViewUpdate(
-      input0, input1, offset, dim1, dim2);
-}
-static inline NodePtr MakeDiagonal(
-    const Value& input0,
-    const int64_t& offset,
-    const int64_t& dim1,
-    const int64_t& dim2) {
-  return getIrBuilder()->MakeDiagonal(input0, offset, dim1, dim2);
-}
-static inline NodePtr MakeNarrowViewUpdate(
-    const Value& input0,
-    const Value& input1,
-    const std::vector<int64_t>& base_indices) {
-  return getIrBuilder()->MakeNarrowViewUpdate(input0, input1, base_indices);
-}
-static inline NodePtr MakeNarrow(
-    const Value& input0,
-    const std::vector<int64_t>& base_indices,
-    const std::vector<int64_t>& sizes) {
-  return getIrBuilder()->MakeNarrow(input0, base_indices, sizes);
-}
-static inline NodePtr MakePermute(
-    const Value& input0,
-    const std::vector<int64_t>& dims) {
-  return getIrBuilder()->MakePermute(input0, dims);
-}
-static inline NodePtr MakeResize(
-    const Value& input0,
-    const std::vector<int64_t>& size) {
-  return getIrBuilder()->MakeResize(input0, size);
-}
-static inline NodePtr MakeSelectViewUpdate(
-    const Value& input0,
-    const Value& input1,
-    const int64_t& dim,
-    const int64_t& start,
-    const int64_t& end,
-    const int64_t& stride) {
-  return getIrBuilder()->MakeSelectViewUpdate(
-      input0, input1, dim, start, end, stride);
-}
-static inline NodePtr MakeSelect(
-    const Value& input0,
-    const int64_t& dim,
-    const int64_t& start,
-    const int64_t& end,
-    const int64_t& stride) {
-  return getIrBuilder()->MakeSelect(input0, dim, start, end, stride);
-}
-static inline NodePtr MakeSqueeze(const Value& input0, const int& dim) {
-  return getIrBuilder()->MakeSqueeze(input0, dim);
-}
-static inline NodePtr MakeUnsqueeze(const Value& input0, const int& dim) {
-  return getIrBuilder()->MakeUnsqueeze(input0, dim);
-}
-
 // dynamic ir nodes
 static inline NodePtr MakeSizeNode(const Value& input, size_t dim) {
   return getIrBuilder()->MakeSizeNode(input, dim);
@@ -266,5 +126,23 @@ static inline NodePtr MakeSizeDiv(const Value& a, const Value& b) {
   return getIrBuilder()->MakeSizeDiv(a, b);
 }
 
+inline Value GetSymIntValue(c10::SymInt a) {
+  return Value(
+      a.is_symbolic()
+          ? dynamic_cast<torch::lazy::SymNodeImpl*>(a.toSymNodeImpl().get())
+                ->node_
+          : MakeScalar(a.as_int_unchecked(), at::kLong),
+      0);
+}
+
+// TODO: this should return Value
+inline std::vector<int64_t> GetSymIntArrayRefValue(c10::SymIntArrayRef arr) {
+  std::vector<int64_t> r;
+  for (const auto& a : arr) {
+    r.emplace_back(a.expect_int());
+  }
+  return r;
+}
+
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/core/ir_dump_util.cpp b/torch/csrc/lazy/core/ir_dump_util.cpp
index eff2873d668d..19cb2ae7b162 100644
--- a/torch/csrc/lazy/core/ir_dump_util.cpp
+++ b/torch/csrc/lazy/core/ir_dump_util.cpp
@@ -80,7 +80,7 @@ c10::optional<AttrTag> ParseAttrTag(
   return tag;
 }
 
-NodeIdMap GenerateIdMap(c10::ArrayRef<Node*> post_order) {
+NodeIdMap GenerateIdMap(c10::ArrayRef<const Node*> post_order) {
   NodeIdMap id_map;
   for (auto node : post_order) {
     TORCH_CHECK(id_map.emplace(node, id_map.size()).second, node->ToString());
@@ -89,7 +89,7 @@ NodeIdMap GenerateIdMap(c10::ArrayRef<Node*> post_order) {
 }
 
 std::unordered_map<const Node*, size_t> GetRootsIds(
-    c10::ArrayRef<Node*> roots) {
+    c10::ArrayRef<const Node*> roots) {
   std::unordered_map<const Node*, size_t> roots_ids;
   for (const auto i : c10::irange(roots.size())) {
     roots_ids[roots[i]] = i;
@@ -178,14 +178,14 @@ std::string GenerateTextNodeSpec(const Node* node, const NodeIdMap& id_map) {
 
 } // namespace
 
-std::string DumpUtil::ToDot(c10::ArrayRef<Node*> nodes) {
+std::string DumpUtil::ToDot(c10::ArrayRef<const Node*> nodes) {
   auto post_order = Util::ComputePostOrder(nodes);
   return PostOrderToDot(post_order, nodes);
 }
 
 std::string DumpUtil::PostOrderToDot(
-    c10::ArrayRef<Node*> post_order,
-    c10::ArrayRef<Node*> roots) {
+    c10::ArrayRef<const Node*> post_order,
+    c10::ArrayRef<const Node*> roots) {
   std::unordered_map<const Node*, size_t> roots_ids = GetRootsIds(roots);
   NodeIdMap id_map = GenerateIdMap(post_order);
   std::stringstream ss;
@@ -218,14 +218,14 @@ std::string DumpUtil::PostOrderToDot(
   return ss.str();
 }
 
-std::string DumpUtil::ToText(c10::ArrayRef<Node*> nodes) {
+std::string DumpUtil::ToText(c10::ArrayRef<const Node*> nodes) {
   auto post_order = Util::ComputePostOrder(nodes);
   return PostOrderToText(post_order, nodes);
 }
 
 std::string DumpUtil::PostOrderToText(
-    c10::ArrayRef<Node*> post_order,
-    c10::ArrayRef<Node*> roots) {
+    c10::ArrayRef<const Node*> post_order,
+    c10::ArrayRef<const Node*> roots) {
   std::unordered_map<const Node*, size_t> roots_ids = GetRootsIds(roots);
   NodeIdMap id_map = GenerateIdMap(post_order);
   std::stringstream ss;
diff --git a/torch/csrc/lazy/core/ir_dump_util.h b/torch/csrc/lazy/core/ir_dump_util.h
index 22cf139bfbd6..4b4e1e0749b2 100644
--- a/torch/csrc/lazy/core/ir_dump_util.h
+++ b/torch/csrc/lazy/core/ir_dump_util.h
@@ -11,17 +11,17 @@ class BackendDevice;
 
 class TORCH_API DumpUtil {
  public:
-  static std::string ToDot(c10::ArrayRef<Node*> nodes);
+  static std::string ToDot(c10::ArrayRef<const Node*> nodes);
 
   static std::string PostOrderToDot(
-      c10::ArrayRef<Node*> post_order,
-      c10::ArrayRef<Node*> roots);
+      c10::ArrayRef<const Node*> post_order,
+      c10::ArrayRef<const Node*> roots);
 
-  static std::string ToText(c10::ArrayRef<Node*> nodes);
+  static std::string ToText(c10::ArrayRef<const Node*> nodes);
 
   static std::string PostOrderToText(
-      c10::ArrayRef<Node*> post_order,
-      c10::ArrayRef<Node*> roots);
+      c10::ArrayRef<const Node*> post_order,
+      c10::ArrayRef<const Node*> roots);
 
   static std::string ToBackend(
       c10::ArrayRef<Value> values,
diff --git a/torch/csrc/lazy/core/ir_metadata.cpp b/torch/csrc/lazy/core/ir_metadata.cpp
index 75bc0c305165..49201db0c4da 100644
--- a/torch/csrc/lazy/core/ir_metadata.cpp
+++ b/torch/csrc/lazy/core/ir_metadata.cpp
@@ -47,6 +47,18 @@ struct ScopeContext {
 
 thread_local ScopeContext g_scope_context;
 
+std::string GetCurrentScope() {
+  std::string scope;
+  for (auto& scope_entry : g_scope_context.scopes) {
+    if (scope.empty()) {
+      scope = scope_entry.name;
+    } else {
+      scope += "/" + scope_entry.name;
+    }
+  }
+  return scope;
+}
+
 void PushScope(const std::string& name) {
   size_t id = g_scope_context.next_id;
   g_scope_context.scopes.push_back(
@@ -61,7 +73,10 @@ void PopScope() {
 }
 
 void ResetScopeContext() {
-  TORCH_CHECK(g_scope_context.scopes.size() == 0);
+  if (g_scope_context.scopes.size() != 0) {
+    TORCH_CHECK(
+        false, "Expecting scope to be empty but it is " + GetCurrentScope());
+  }
   g_scope_context.next_id = 1;
 }
 } // namespace
@@ -78,18 +93,6 @@ void ScopePusher::ResetScopes() {
   ResetScopeContext();
 }
 
-std::string GetCurrentScope() {
-  std::string scope;
-  for (auto& scope_entry : g_scope_context.scopes) {
-    if (scope.empty()) {
-      scope = scope_entry.name;
-    } else {
-      scope += "/" + scope_entry.name;
-    }
-  }
-  return scope;
-}
-
 MetaData GetMetaDataIfDebugging() {
   if (!FLAGS_torch_lazy_ir_debug) {
     return MetaData();
diff --git a/torch/csrc/lazy/core/ir_util.cpp b/torch/csrc/lazy/core/ir_util.cpp
index 2d463bb99d5f..b2a2a8ecfa20 100644
--- a/torch/csrc/lazy/core/ir_util.cpp
+++ b/torch/csrc/lazy/core/ir_util.cpp
@@ -5,13 +5,12 @@
 namespace torch {
 namespace lazy {
 
-std::vector<Node*> Util::ComputePostOrder(const Node* node, EmissionMap* emap) {
-  std::vector<Node*> post_order;
-  std::vector<Node*> queue;
-  // std::vector<const T> to c10::ArrayRef<T> conversion is not supported,
-  // so we need to drop const in the return vector and use const_cast here.
-  // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-  queue.push_back(const_cast<Node*>(node));
+std::vector<const Node*> Util::ComputePostOrder(
+    const Node* node,
+    EmissionMap* emap) {
+  std::vector<const Node*> post_order;
+  std::vector<const Node*> queue;
+  queue.push_back(node);
   while (!queue.empty()) {
     node = queue.back();
     auto it = emap->find(node);
@@ -20,8 +19,7 @@ std::vector<Node*> Util::ComputePostOrder(const Node* node, EmissionMap* emap) {
       for (auto& output : node->operands()) {
         auto oit = emap->find(output.node);
         if (oit == emap->end()) {
-          // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-          queue.push_back(const_cast<Node*>(output.node));
+          queue.push_back(output.node);
         } else {
           TORCH_CHECK(
               oit->second != kEmitting,
@@ -38,8 +36,7 @@ std::vector<Node*> Util::ComputePostOrder(const Node* node, EmissionMap* emap) {
             output.node->ToString());
       }
       (*emap)[node] = kEmitted;
-      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
-      post_order.push_back(const_cast<Node*>(node));
+      post_order.push_back(node);
       queue.pop_back();
     } else {
       TORCH_CHECK(it->second == kEmitted);
@@ -49,10 +46,10 @@ std::vector<Node*> Util::ComputePostOrder(const Node* node, EmissionMap* emap) {
   return post_order;
 }
 
-std::vector<Node*> Util::ComputePostOrder(
-    c10::ArrayRef<Node*> nodes,
+std::vector<const Node*> Util::ComputePostOrder(
+    c10::ArrayRef<const Node*> nodes,
     EmissionMap* emap) {
-  std::vector<Node*> post_order;
+  std::vector<const Node*> post_order;
   for (auto node : nodes) {
     auto node_post_order = ComputePostOrder(node, emap);
     post_order.insert(
@@ -61,12 +58,13 @@ std::vector<Node*> Util::ComputePostOrder(
   return post_order;
 }
 
-std::vector<Node*> Util::ComputePostOrder(c10::ArrayRef<Node*> nodes) {
+std::vector<const Node*> Util::ComputePostOrder(
+    c10::ArrayRef<const Node*> nodes) {
   EmissionMap emap;
   return ComputePostOrder(nodes, &emap);
 }
 
-size_t Util::GetGraphSize(c10::ArrayRef<Node*> nodes) {
+size_t Util::GetGraphSize(c10::ArrayRef<const Node*> nodes) {
   return ComputePostOrder(nodes).size();
 }
 
diff --git a/torch/csrc/lazy/core/ir_util.h b/torch/csrc/lazy/core/ir_util.h
index a95b1a523bfa..df3d0fd7ac40 100644
--- a/torch/csrc/lazy/core/ir_util.h
+++ b/torch/csrc/lazy/core/ir_util.h
@@ -25,21 +25,22 @@ class TORCH_API Util {
   // this API. The returned post-order can be empty if the node has already been
   // emitted inside the emission map. An error is generated if a loop is
   // detected.
-  static std::vector<Node*> ComputePostOrder(
+  static std::vector<const Node*> ComputePostOrder(
       const Node* node,
       EmissionMap* emap);
 
-  static std::vector<Node*> ComputePostOrder(
-      c10::ArrayRef<Node*> nodes,
+  static std::vector<const Node*> ComputePostOrder(
+      c10::ArrayRef<const Node*> nodes,
       EmissionMap* emap);
 
   // Same as above, but computes the post order on the set of nodes specified as
   // argument.
-  static std::vector<Node*> ComputePostOrder(c10::ArrayRef<Node*> nodes);
+  static std::vector<const Node*> ComputePostOrder(
+      c10::ArrayRef<const Node*> nodes);
 
   // Retrieves the number of nodes within the graph whose sink are passed in the
   // nodes argument.
-  static size_t GetGraphSize(c10::ArrayRef<Node*> nodes);
+  static size_t GetGraphSize(c10::ArrayRef<const Node*> nodes);
 };
 
 } // namespace lazy
diff --git a/torch/csrc/lazy/core/lazy_graph_executor.cpp b/torch/csrc/lazy/core/lazy_graph_executor.cpp
index eff49c7d4fe6..1201971f3bc2 100644
--- a/torch/csrc/lazy/core/lazy_graph_executor.cpp
+++ b/torch/csrc/lazy/core/lazy_graph_executor.cpp
@@ -379,10 +379,6 @@ class DeviceContextArena {
   std::map<BackendDevice, DeviceContext*> device_contexts_;
 };
 
-bool ShouldSyncIrValue(const Value& ir_value) {
-  return ir_value->op() != ltc_not_supported;
-}
-
 // Return true if no tensor in the list has an underlying IR (leaf or
 // operation).
 bool TensorsHaveIR(const std::vector<LazyTensorPtr>& tensors) {
@@ -394,10 +390,15 @@ bool TensorsHaveIR(const std::vector<LazyTensorPtr>& tensors) {
   return false;
 }
 
+std::atomic<LazyGraphExecutor*> lazy_graph_executor_registry;
 } // namespace
 
+void LazyGraphExecutor::Register(LazyGraphExecutor* executor) {
+  lazy_graph_executor_registry.store(executor);
+}
 LazyGraphExecutor* LazyGraphExecutor::Get() {
-  static LazyGraphExecutor* executor = new LazyGraphExecutor();
+  auto* executor = lazy_graph_executor_registry.load();
+  TORCH_CHECK(executor, "Lazy graph executor not registered.");
   return executor;
 }
 
@@ -608,6 +609,10 @@ void LazyGraphExecutor::Async::Wait() {
   }
 }
 
+bool LazyGraphExecutor::ShouldSyncTensor(const LazyTensorPtr tensor) const {
+  return tensor->GetIrValue()->op() != ltc_not_supported;
+}
+
 LazyGraphExecutor::SyncTensorCollection LazyGraphExecutor::CollectSyncTensors(
     const std::vector<LazyTensorPtr>& tensors,
     const SyncTensorsConfig& config) {
@@ -639,7 +644,7 @@ LazyGraphExecutor::SyncTensorCollection LazyGraphExecutor::CollectSyncTensors(
         tensors[i]->CurrentDataHandle() == nullptr) {
       Value ir_value = tensors[i]->CurrentIrValue();
       if (ir_value) {
-        if (ShouldSyncIrValue(ir_value)) {
+        if (ShouldSyncTensor(tensors[i])) {
           // Add only tensors which need to be synced.
           coll.hash = HashCombine(coll.hash, ir_value.hash());
           coll.indices.push_back(i);
@@ -705,6 +710,7 @@ std::vector<BackendDataPtr> LazyGraphExecutor::FetchTensorData(
       const BackendDevice& tensor_device = tensor->GetDevice();
       handle = getBackend()->CreateDataPlaceholder(
           tensor_device, std::move(tensor->shape()));
+
       tensor->SetDataHandle(handle, config.sync_ltc_data);
     }
     tensors_data.emplace_back(std::move(handle));
@@ -715,7 +721,7 @@ std::vector<BackendDataPtr> LazyGraphExecutor::FetchTensorData(
 LazyGraphExecutor::PostOrderData LazyGraphExecutor::RunPostOrder(
     const std::vector<LazyTensorPtr>& tensors,
     SyncTensorCollection* coll) {
-  std::vector<Node*> roots;
+  std::vector<const Node*> roots;
   roots.reserve(coll->indices.size());
   for (auto index : coll->indices) {
     Value ir_value = tensors.at(index)->CurrentIrValue();
@@ -783,33 +789,6 @@ LazyGraphExecutor::CompilationResult LazyGraphExecutor::Compile(
     Value ir_value = tensors[index]->CurrentIrValue();
     lowering_ctx->AddResult(ir_value);
   }
-  if (FLAGS_torch_lazy_param_aliasing && coll.config.sync_ltc_data) {
-    // We can only alias at the step barrier, when force_ltc_data is true.
-    // Consider the case:
-    //   1. Tensor A(DEVICE_DATA)
-    //   2. Tensor B = A + 0.9
-    //   3. A += 0.4
-    // If we activate aliasing for A's graph, and we do:
-    //   print(A)
-    //   print(A)
-    // The first print will update DEVICE_DATA' with DEVICE_DATA+0.4, and the
-    // second print will again update DEVICE_DATA" with DEVICE_DATA'+0.4, which
-    // will lead to incorrect results.
-    // We cannot normally turn A's state into DEVICE_DATA, as if any of the
-    // sources is a view, this will not lead to correct results (as A's value
-    // taken at different times need to reflect view source changes):
-    //   1. Tensor A = some_graph_with_view_source(V)
-    //   2. print(A)
-    //   3. V += 1
-    //   4. print(A)
-    // The second print should reflect the new value due to V's changes.
-    // Also in the first example, unless we are doing a step barrier and hence
-    // include all live tensors, if the B value is not part of the graph, it
-    // will later fetch the new value of A, which is incorrect.
-    // But, when we issue a step barrier (force_ltc_data == true) we have to
-    // turn everything into DEVICE_DATA, so we can activate aliasing.
-    BuildInputOutputAliases(tensors, coll.indices, lowering_ctx.get());
-  }
 
   ComputationPtr computation = lowering_ctx->Build();
   // If force_ltc_data is true it means that we did a proper sync and are
@@ -860,40 +839,6 @@ LazyGraphExecutor::ComputationCache::TypePtr LazyGraphExecutor::
 typedef SSIZE_T ssize_t;
 #endif
 
-void LazyGraphExecutor::BuildInputOutputAliases(
-    const std::vector<LazyTensorPtr>& tensors,
-    c10::ArrayRef<size_t> indices,
-    LoweringContext* lowering_ctx) {
-  std::unordered_map<int64_t, size_t> output_tensor_id_map;
-  for (const auto i : c10::irange(indices.size())) {
-    size_t tensor_index = indices[i];
-    int64_t tensor_id = tensors[tensor_index]->GetUniqueId();
-    output_tensor_id_map[tensor_id] = i;
-  }
-  const std::vector<BackendDataPtr>& parameters_data =
-      lowering_ctx->GetParametersData();
-  std::vector<ssize_t> alias_map(indices.size(), -1);
-  for (const auto i : c10::irange(parameters_data.size())) {
-    DeviceDataInfo* data_info =
-        dynamic_cast<DeviceDataInfo*>(parameters_data[i]->info());
-    if (data_info != nullptr && !data_info->read_only) {
-      auto it = output_tensor_id_map.find(data_info->tensor_id);
-      if (it != output_tensor_id_map.end()) {
-        size_t output_index = it->second;
-        if (lowering_ctx->CheckResultShape(parameters_data[i], output_index) &&
-            alias_map[output_index] < 0) {
-          lowering_ctx->SetUpAlias({static_cast<int64_t>(output_index)}, i, {});
-          alias_map[output_index] = i;
-
-          VLOG(6) << "Aliased parameter " << i << " with output "
-                  << output_index << ": " << Shape(parameters_data[i]->shape());
-        }
-      }
-    }
-  }
-  TORCH_LAZY_VALUE_METRIC("InputOutputAliasCount", alias_map.size());
-}
-
 std::shared_ptr<LazyGraphExecutor::Async> LazyGraphExecutor::
     SyncTensorsGraphInternal(
         std::vector<LazyTensorPtr>* tensors,
diff --git a/torch/csrc/lazy/core/lazy_graph_executor.h b/torch/csrc/lazy/core/lazy_graph_executor.h
index 8116ad23ff06..9894295f3b32 100644
--- a/torch/csrc/lazy/core/lazy_graph_executor.h
+++ b/torch/csrc/lazy/core/lazy_graph_executor.h
@@ -21,10 +21,18 @@ class TORCH_API LazyGraphExecutor {
     bool read_only = false;
   };
 
+  // Register a lazy graph executor instance that can be retrieved using Get()
+  static void Register(LazyGraphExecutor*);
   static LazyGraphExecutor* Get();
 
-  void RegisterTensor(std::shared_ptr<LazyTensor::Data> data);
-  void UnregisterTensor(LazyTensor::Data* data);
+  virtual ~LazyGraphExecutor() = default;
+
+  // Override these methods to perform custom tensor registration and
+  // unregistration Note: It is vital that the parent implementations are also
+  // called
+  //       in order for the tensors to show up in the live tensor list
+  virtual void RegisterTensor(std::shared_ptr<LazyTensor::Data> data);
+  virtual void UnregisterTensor(LazyTensor::Data* data);
 
   // Seed for random generator
   Value GetRngSeed(const BackendDevice& device);
@@ -150,7 +158,7 @@ class TORCH_API LazyGraphExecutor {
   };
 
   struct PostOrderData {
-    std::vector<Node*> post_order;
+    std::vector<const Node*> post_order;
     Util::EmissionMap emission_map;
     std::vector<BackendDataPtr> parameters_data;
     std::vector<size_t> parameter_sequence;
@@ -181,6 +189,8 @@ class TORCH_API LazyGraphExecutor {
     std::vector<BackendDataPtr> tensors_data;
   };
 
+  virtual bool ShouldSyncTensor(const LazyTensorPtr tensor) const;
+
   SyncTensorCollection CollectSyncTensors(
       const std::vector<LazyTensorPtr>& tensors,
       const SyncTensorsConfig& config);
@@ -213,11 +223,6 @@ class TORCH_API LazyGraphExecutor {
 
   ComputationCache::TypePtr LookupCachedCompile(const hash_t& hash);
 
-  void BuildInputOutputAliases(
-      const std::vector<LazyTensorPtr>& tensors,
-      c10::ArrayRef<size_t> indices,
-      LoweringContext* lowering_ctx);
-
   std::shared_ptr<Async> SyncTensorsGraphInternal(
       std::vector<LazyTensorPtr>* tensors,
       c10::ArrayRef<std::string> devices,
diff --git a/torch/csrc/lazy/core/lazy_view.cpp b/torch/csrc/lazy/core/lazy_view.cpp
deleted file mode 100644
index d52c0f62fb77..000000000000
--- a/torch/csrc/lazy/core/lazy_view.cpp
+++ /dev/null
@@ -1,262 +0,0 @@
-#include <torch/csrc/lazy/core/lazy_view.h>
-
-#include <torch/csrc/lazy/core/helpers.h>
-#include <torch/csrc/lazy/core/ir_builder.h>
-#include <torch/csrc/lazy/core/ops/utils.h>
-#include <torch/csrc/lazy/core/permutation_util.h>
-
-#include <c10/util/Exception.h>
-#include <algorithm>
-#include <functional>
-#include <numeric>
-
-namespace torch {
-namespace lazy {
-namespace {
-
-Value ApplyViewInfo(Value ir_value, const ViewInfo& view_info) {
-  switch (view_info.view_type) {
-    case ViewInfo::Type::kSelect:
-      return MakeSelect(
-          ir_value,
-          view_info.select->dim,
-          view_info.select->start,
-          view_info.select->end,
-          view_info.select->stride);
-    case ViewInfo::Type::kNarrow:
-      return MakeNarrow(
-          ir_value, view_info.indices, view_info.shape.sizes().vec());
-    case ViewInfo::Type::kNoOp:
-      return ir_value;
-    case ViewInfo::Type::kPermute:
-      return MakePermute(ir_value, view_info.permutation);
-    case ViewInfo::Type::kReshape:
-      return MakeView(ir_value, view_info.shape.sizes().vec());
-    case ViewInfo::Type::kResize:
-      return MakeResize(ir_value, view_info.shape.sizes().vec());
-    case ViewInfo::Type::kSqueeze:
-      return MakeSqueeze(ir_value, view_info.squeeze_index);
-    case ViewInfo::Type::kUnsqueeze:
-      return MakeUnsqueeze(ir_value, view_info.squeeze_index);
-    case ViewInfo::Type::kAsStrided:
-      return MakeAsStrided(
-          ir_value,
-          view_info.shape.sizes().vec(),
-          view_info.as_strided->stride,
-          view_info.as_strided->offset);
-    case ViewInfo::Type::kDiagonal:
-      return MakeDiagonal(
-          ir_value,
-          view_info.diagonal->offset,
-          view_info.diagonal->dim1,
-          view_info.diagonal->dim2);
-    default:
-      TORCH_INTERNAL_ASSERT(
-          false, "Invalid view type: ", GetEnumValue(view_info.view_type));
-  }
-}
-
-// Here we are trying to populate inplace updated values from the latest view
-// all the way back to the original tensor.
-// For example:
-//     a = torch.diagonal(b)
-//     b.add_(1) # a should be updated as well.
-//
-// Ideally we should all have a *ViewUpdate IR which updates the original
-// tensor/view withe current value. See DiagonalViewUpdate and corresponding
-// LowerDiagonalViewUpdate in ts_node_lowering.cpp. There are some "edge cases"
-// here simply because they can smartly reuse some other ops to undo themselves.
-Value ApplyUpdate(Value ir_value, const Alias::UpdateData& update_data) {
-  // We first bring the source IR value forward, by reshaping and slicing.
-  std::vector<Value> tmp_values({ir_value});
-  for (const ViewInfo& view_info : update_data.view_infos) {
-    tmp_values.push_back(ApplyViewInfo(tmp_values.back(), view_info));
-  }
-  // We then move backward given the source update value, by reshaping and
-  // slice-updating.
-  Value result = update_data.ir_value;
-  for (size_t i = update_data.view_infos.size(); i > 0; --i) {
-    const ViewInfo& view_info = update_data.view_infos[i - 1];
-    switch (view_info.view_type) {
-      case ViewInfo::Type::kSelect:
-        result = MakeSelectViewUpdate(
-            tmp_values[i - 1],
-            result,
-            view_info.select->dim,
-            view_info.select->start,
-            view_info.select->end,
-            view_info.select->stride);
-        break;
-      case ViewInfo::Type::kNarrow:
-        result =
-            MakeNarrowViewUpdate(tmp_values[i - 1], result, view_info.indices);
-        break;
-      case ViewInfo::Type::kNoOp:
-        break;
-      case ViewInfo::Type::kPermute:
-        result = MakePermute(result, InversePermutation(view_info.permutation));
-        break;
-      case ViewInfo::Type::kReshape:
-        result = MakeView(result, view_info.source_shape.sizes().vec());
-        break;
-      case ViewInfo::Type::kResize:
-        result = MakeResize(result, view_info.source_shape.sizes().vec());
-        break;
-      case ViewInfo::Type::kSqueeze:
-        result = MakeUnsqueeze(ir_value, view_info.squeeze_index);
-        break;
-      case ViewInfo::Type::kUnsqueeze:
-        result = MakeSqueeze(ir_value, view_info.squeeze_index);
-        break;
-      case ViewInfo::Type::kAsStrided:
-        result = MakeAsStridedViewUpdate(
-            tmp_values[i - 1],
-            result,
-            view_info.source_shape.sizes().vec(),
-            view_info.as_strided->stride,
-            view_info.as_strided->offset);
-        break;
-      case ViewInfo::Type::kDiagonal:
-        result = MakeDiagonalViewUpdate(
-            tmp_values[i - 1],
-            result,
-            view_info.diagonal->offset,
-            view_info.diagonal->dim1,
-            view_info.diagonal->dim2);
-        break;
-      default:
-        TORCH_INTERNAL_ASSERT(
-            false, "Invalid view type: ", GetEnumValue(view_info.view_type));
-    }
-  }
-  return result;
-}
-
-} // namespace
-
-ViewInfo::ViewInfo(Type view_type, Shape shape, Shape source_shape)
-    : view_type(view_type),
-      shape(std::move(shape)),
-      indices(source_shape.dim(), 0),
-      source_shape(std::move(source_shape)) {}
-
-ViewInfo::ViewInfo(Type view_type, Shape shape, Shape source_shape, int64_t sqi)
-    : view_type(view_type),
-      shape(std::move(shape)),
-      source_shape(std::move(source_shape)),
-      squeeze_index(sqi) {
-  TORCH_CHECK(view_type == Type::kSqueeze);
-}
-
-ViewInfo::ViewInfo(
-    Type view_type,
-    Shape source_shape,
-    std::vector<int64_t> permutation)
-    : view_type(view_type),
-      shape(MakePermuteShape(source_shape, permutation)),
-      source_shape(std::move(source_shape)),
-      permutation(std::move(permutation)) {
-  TORCH_CHECK(view_type == Type::kPermute);
-}
-
-ViewInfo::ViewInfo(Type view_type, const Shape& source_shape, SelectInfo select)
-    : view_type(view_type),
-      shape(MakeSelectShape(
-          source_shape,
-          select.dim,
-          select.start,
-          select.end,
-          select.stride)),
-      source_shape(source_shape),
-      select(select) {
-  TORCH_CHECK(view_type == Type::kSelect);
-}
-
-ViewInfo::ViewInfo(
-    Type view_type,
-    Shape shape,
-    Shape source_shape,
-    AsStridedInfo as_strided)
-    : view_type(view_type),
-      shape(std::move(shape)),
-      source_shape(std::move(source_shape)),
-      as_strided(std::move(as_strided)) {
-  TORCH_CHECK(view_type == Type::kAsStrided);
-}
-
-ViewInfo::ViewInfo(
-    Type view_type,
-    const Shape& source_shape,
-    DiagonalInfo diagonal)
-    : view_type(view_type),
-      shape(MakeDiagonalShape(
-          source_shape,
-          diagonal.offset,
-          diagonal.dim1,
-          diagonal.dim2)),
-      source_shape(source_shape),
-      diagonal(diagonal) {
-  TORCH_CHECK(view_type == Type::kDiagonal);
-}
-
-void Alias::Update(Value ir_value, std::vector<ViewInfo> view_infos) {
-  if (!updates_.empty() && updates_.back().view_infos == view_infos) {
-    updates_.back().ir_value = std::move(ir_value);
-  } else {
-    updates_.push_back({std::move(ir_value), std::move(view_infos)});
-  }
-  ++generation_;
-}
-
-Value Alias::SyncUpdateOperations() {
-  for (auto& update_data : updates_) {
-    root_ir_value_ = ApplyUpdate(root_ir_value_, update_data);
-  }
-  updates_.clear();
-  return root_ir_value_;
-}
-
-LazyView::LazyView(
-    Shape shape,
-    std::shared_ptr<Alias> alias,
-    ViewInfo view_info)
-    : shape_(std::move(shape)), alias_(std::move(alias)) {
-  view_infos_.push_back(std::move(view_info));
-}
-
-LazyView::LazyView(
-    Shape shape,
-    std::shared_ptr<Alias> alias,
-    std::vector<ViewInfo> view_infos)
-    : view_infos_(std::move(view_infos)),
-      shape_(std::move(shape)),
-      alias_(std::move(alias)) {}
-
-void LazyView::Update(Value ir_value) {
-  alias_->Update(std::move(ir_value), view_infos_);
-}
-
-std::shared_ptr<LazyView> LazyView::CreateSubView(
-    Shape shape,
-    ViewInfo view_info) {
-  std::vector<ViewInfo> view_infos(view_infos_);
-  view_infos.push_back(std::move(view_info));
-  return std::make_shared<LazyView>(
-      std::move(shape), alias_, std::move(view_infos));
-}
-
-std::tuple<Value, bool> LazyView::GetViewIrNode() {
-  if (IsUpToDate()) {
-    return std::make_tuple(ir_value_, false);
-  }
-  Value update = alias_->SyncUpdateOperations();
-  for (auto& view_info : view_infos_) {
-    update = ApplyViewInfo(update, view_info);
-  }
-  ir_value_ = update;
-  generation_ = alias_->generation();
-  return std::make_tuple(ir_value_, true);
-}
-
-} // namespace lazy
-} // namespace torch
diff --git a/torch/csrc/lazy/core/lazy_view.h b/torch/csrc/lazy/core/lazy_view.h
deleted file mode 100644
index 5e1a106494cf..000000000000
--- a/torch/csrc/lazy/core/lazy_view.h
+++ /dev/null
@@ -1,173 +0,0 @@
-#pragma once
-
-#include <c10/util/Optional.h>
-#include <torch/csrc/lazy/core/ir.h>
-#include <torch/csrc/lazy/core/shape.h>
-
-#include <memory>
-#include <vector>
-
-namespace torch {
-namespace lazy {
-
-struct TORCH_API SelectInfo {
-  bool operator==(const SelectInfo& ref) const {
-    return dim == ref.dim && start == ref.start && end == ref.end &&
-        stride == ref.stride;
-  }
-
-  int64_t dim = 0;
-  int64_t start = 0;
-  int64_t end = 0;
-  int64_t stride = 0;
-};
-
-struct TORCH_API AsStridedInfo {
-  bool operator==(const AsStridedInfo& ref) const {
-    return offset == ref.offset && stride == ref.stride;
-  }
-
-  std::vector<int64_t> stride;
-  int64_t offset = 0;
-};
-
-struct TORCH_API DiagonalInfo {
-  bool operator==(const DiagonalInfo& ref) const {
-    return offset == ref.offset && dim1 == ref.dim1 && dim2 == ref.dim2;
-  }
-
-  int64_t offset = 0;
-  int64_t dim1 = 0;
-  int64_t dim2 = 1;
-};
-
-struct TORCH_API ViewInfo {
-  enum class Type {
-    kInvalid,
-    kNarrow,
-    kNoOp,
-    kPermute,
-    kReshape,
-    kResize,
-    kSelect,
-    kAsStrided,
-    kDiagonal,
-    kSqueeze,
-    kUnsqueeze,
-  };
-
-  ViewInfo() = default;
-  ViewInfo(Type view_type, Shape shape, Shape source_shape);
-  ViewInfo(Type view_type, Shape shape, Shape source_shape, int64_t sqi);
-  ViewInfo(
-      Type view_type,
-      Shape source_shape,
-      std::vector<int64_t> permutation);
-  ViewInfo(Type view_type, const Shape& source_shape, SelectInfo select);
-  ViewInfo(
-      Type view_type,
-      Shape shape,
-      Shape source_shape,
-      AsStridedInfo as_strided);
-  ViewInfo(Type view_type, const Shape& source_shape, DiagonalInfo diagonal);
-
-  bool operator==(const ViewInfo& ref) const {
-    return view_type == ref.view_type && shape == ref.shape &&
-        indices == ref.indices && source_shape == ref.source_shape &&
-        permutation == ref.permutation && select == ref.select &&
-        as_strided == ref.as_strided && diagonal == ref.diagonal;
-  }
-
-  Type view_type = Type::kInvalid;
-  // The shape of the result of a view. In case of narrowing, this represents
-  // the size of the narrow slice.
-  Shape shape;
-  // In case of narrowing, the starting indices from where the narrow slice is
-  // cut.
-  std::vector<int64_t> indices;
-  // The shape of the source of this view.
-  Shape source_shape;
-  // The permutation to be used. If empty, this is not a permute operation.
-  std::vector<int64_t> permutation;
-  // Information used for sliced views.
-  c10::optional<SelectInfo> select;
-  // Information used for as_strided views.
-  c10::optional<AsStridedInfo> as_strided;
-  // Information used for diagonal views.
-  c10::optional<DiagonalInfo> diagonal;
-  // Squeeze/Unsqueeze Index
-  int64_t squeeze_index;
-};
-
-// When a "view" (capture by reference) is taken on a node, an Alias object is
-// created on the captured node itself, with its current IR Node value.
-class TORCH_API Alias {
- public:
-  struct UpdateData {
-    Value ir_value;
-    std::vector<ViewInfo> view_infos;
-  };
-
-  explicit Alias(Value ir_value) : root_ir_value_(std::move(ir_value)) {}
-
-  size_t generation() const {
-    return generation_;
-  }
-
-  // Appends an update to the IR value stored within the alias. The ir_value is
-  // the value to be written, and view_infos represents the forward path from
-  // the alias's ir_value to the update ir_value.
-  void Update(Value ir_value, std::vector<ViewInfo> view_infos);
-
-  Value SyncUpdateOperations();
-
- private:
-  // The IR value which is the root at which the view was created.
-  Value root_ir_value_;
-  // The stacked updates on the view. Orders matter, as most recent updates
-  // might overwrite older ones.
-  std::vector<UpdateData> updates_;
-  // Incremented every time an update happens. Used by view to track alias
-  // changes and regenerate the most current value.
-  size_t generation_ = 0;
-};
-
-class TORCH_API LazyView {
- public:
-  LazyView(Shape shape, std::shared_ptr<Alias> alias, ViewInfo view_info);
-  LazyView(
-      Shape shape,
-      std::shared_ptr<Alias> alias,
-      std::vector<ViewInfo> view_infos);
-
-  void Update(Value ir_value);
-
-  const Shape& shape() const {
-    return shape_;
-  }
-
-  const std::shared_ptr<Alias>& alias() const {
-    return alias_;
-  }
-
-  std::shared_ptr<LazyView> CreateSubView(Shape shape, ViewInfo view_info);
-
-  // Extracts the current IrNode out of a view, into a IrNode structure
-  // where the updated fields tells whether a new IR value has been created, or
-  // the cached one returned.
-  std::tuple<Value, bool> GetViewIrNode();
-
-  bool IsUpToDate() const {
-    return ir_value_ && generation_ == alias_->generation();
-  }
-
- private:
-  std::vector<ViewInfo> view_infos_;
-  Shape shape_;
-  std::shared_ptr<Alias> alias_;
-  Value ir_value_;
-  size_t generation_ = 0;
-};
-
-} // namespace lazy
-} // namespace torch
diff --git a/torch/csrc/lazy/core/metrics.cpp b/torch/csrc/lazy/core/metrics.cpp
index cb8120c1d45c..78aa7f15a526 100644
--- a/torch/csrc/lazy/core/metrics.cpp
+++ b/torch/csrc/lazy/core/metrics.cpp
@@ -149,6 +149,8 @@ void MetricsArena::ForEachCounter(
     const std::function<void(const std::string&, CounterData*)>& counter_func) {
   std::lock_guard<std::mutex> lock(lock_);
   for (auto& name_data : counters_) {
+    if (!name_data.second->IsValid())
+      continue;
     counter_func(name_data.first, name_data.second.get());
   }
 }
@@ -170,17 +172,19 @@ MetricData* MetricsArena::GetMetric(const std::string& name) {
 
 std::vector<std::string> MetricsArena::GetCounterNames() {
   std::vector<std::string> names;
-  std::lock_guard<std::mutex> lock(lock_);
-  for (auto& name_data : counters_) {
-    names.push_back(name_data.first);
-  }
+  ForEachCounter([&names](const std::string& name, CounterData* data) {
+    names.push_back(name);
+  });
   return names;
 }
 
 CounterData* MetricsArena::GetCounter(const std::string& name) {
   std::lock_guard<std::mutex> lock(lock_);
   auto it = counters_.find(name);
-  return it != counters_.end() ? it->second.get() : nullptr;
+  if (it == counters_.end()) {
+    return nullptr;
+  }
+  return it->second->IsValid() ? it->second.get() : nullptr;
 }
 
 MetricData::MetricData(MetricReprFn repr_fn, size_t max_samples)
@@ -353,6 +357,37 @@ std::string CreateMetricReport() {
   return ss.str();
 }
 
+std::string CreateMetricReport(
+    const std::vector<std::string>& counter_names,
+    const std::vector<std::string>& metric_names) {
+  MetricsArena* arena = MetricsArena::Get();
+  std::stringstream ss;
+  for (const std::string& metric_name : metric_names) {
+    MetricData* data = arena->GetMetric(metric_name);
+    if (data && data->TotalSamples() > 0) {
+      EmitMetricInfo(metric_name, data, &ss);
+    }
+  }
+  std::set<std::string> counter_name_set(
+      counter_names.begin(), counter_names.end());
+  arena->ForEachCounter(
+      [&ss, &counter_name_set](const std::string& name, CounterData* data) {
+        if (counter_name_set.find(name) != counter_name_set.end()) {
+          EmitCounterInfo(name, data, &ss);
+        }
+      });
+
+  static std::string fall_back_counter_prefix = "aten::";
+  arena->ForEachCounter([&ss](const std::string& name, CounterData* data) {
+    if (name.rfind(fall_back_counter_prefix, 0) == 0) {
+      // it might emit duplicated counter if user also specified exact aten
+      // counter in the `counter_names` but it should be very rare.
+      EmitCounterInfo(name, data, &ss);
+    }
+  });
+  return ss.str();
+}
+
 std::vector<std::string> GetMetricNames() {
   return MetricsArena::Get()->GetMetricNames();
 }
diff --git a/torch/csrc/lazy/core/metrics.h b/torch/csrc/lazy/core/metrics.h
index 43fb617c1ba1..2e263cc7f00b 100644
--- a/torch/csrc/lazy/core/metrics.h
+++ b/torch/csrc/lazy/core/metrics.h
@@ -81,6 +81,10 @@ class TORCH_API CounterData {
     value_ = 0;
   }
 
+  bool IsValid() const {
+    return value_ > 0;
+  }
+
  private:
   std::atomic<int64_t> value_;
 };
@@ -216,6 +220,11 @@ class TORCH_API Counter {
 // Creates a report with the current metrics statistics.
 TORCH_API std::string CreateMetricReport();
 
+// Creates a report with the selected metrics statistics.
+TORCH_API std::string CreateMetricReport(
+    const std::vector<std::string>& counter_names,
+    const std::vector<std::string>& metric_names);
+
 // Returns the currently registered metric names. Note that the list can grow
 // since metrics are usually function intialized (they are static function
 // variables).
diff --git a/torch/csrc/lazy/core/shape_inference.cpp b/torch/csrc/lazy/core/shape_inference.cpp
index 7e98a6d86095..df82fd45fe29 100644
--- a/torch/csrc/lazy/core/shape_inference.cpp
+++ b/torch/csrc/lazy/core/shape_inference.cpp
@@ -451,11 +451,11 @@ std::vector<Shape> compute_shape_expand(
   std::vector<int64_t> target_size(_sizes.size());
   for (const auto idx : c10::irange(_sizes.size())) {
     if (_sizes[idx].is_symbolic()) {
-      c10::SymIntNode symbolicIntNode = _sizes[idx].toSymIntNodeImpl();
-      auto* lazySymIntNode =
-          dynamic_cast<torch::lazy::SymIntNodeImpl*>(symbolicIntNode.get());
-      TORCH_INTERNAL_ASSERT(lazySymIntNode);
-      auto size_node = lazySymIntNode->node_;
+      c10::SymNode symbolicIntNode = _sizes[idx].toSymNodeImpl();
+      auto* lazySymNode =
+          dynamic_cast<torch::lazy::SymNodeImpl*>(symbolicIntNode.get());
+      TORCH_INTERNAL_ASSERT(lazySymNode);
+      auto size_node = lazySymNode->node_;
       auto static_value =
           std::dynamic_pointer_cast<torch::lazy::DimensionNode>(size_node)
               ->getStaticValue();
@@ -520,6 +520,12 @@ std::vector<Shape> compute_shape_cat(at::TensorList tensors, int64_t dim) {
   return {Shape(tensors[0].scalar_type(), out_shape)};
 }
 
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_cholesky(
+    const at::Tensor& self,
+    bool upper) {
+  return {Shape(self.scalar_type(), self.sizes().vec())};
+}
+
 std::vector<torch::lazy::Shape> compute_shape_native_batch_norm(
     const at::Tensor& input,
     const c10::optional<at::Tensor>& weight,
@@ -732,6 +738,12 @@ std::vector<Shape> compute_shape_zero(const at::Tensor& self) {
   return {Shape(self.scalar_type(), self.sizes().vec())};
 }
 
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_take(
+    const at::Tensor& self,
+    const at::Tensor& index) {
+  return {Shape(self.scalar_type(), index.sizes().vec())};
+}
+
 std::vector<Shape> compute_shape_trace(const at::Tensor& self) {
   return {Shape(self.scalar_type(), {})};
 }
@@ -1278,16 +1290,16 @@ std::vector<Shape> compute_shape_select_scatter(
     const at::Tensor& src,
     int64_t dim,
     int64_t index) {
-  auto self_meta = at::native::empty_strided_meta(
-      self.sizes(),
-      self.strides(),
+  auto self_meta = at::native::empty_strided_meta_symint(
+      self.sym_sizes(),
+      self.sym_strides(),
       /*dtype=*/c10::make_optional(self.scalar_type()),
       /*layout=*/c10::make_optional(self.layout()),
       /*device=*/c10::make_optional(c10::Device(c10::kMeta)),
       /*pin_memory=*/c10::nullopt);
-  auto src_meta = at::native::empty_strided_meta(
-      src.sizes(),
-      src.strides(),
+  auto src_meta = at::native::empty_strided_meta_symint(
+      src.sym_sizes(),
+      src.sym_strides(),
       /*dtype=*/c10::make_optional(src.scalar_type()),
       /*layout=*/c10::make_optional(src.layout()),
       /*device=*/c10::make_optional(c10::Device(c10::kMeta)),
@@ -1303,16 +1315,16 @@ std::vector<Shape> compute_shape_diagonal_scatter(
     int64_t offset,
     int64_t dim1,
     int64_t dim2) {
-  auto self_meta = at::native::empty_strided_meta(
-      self.sizes(),
-      self.strides(),
+  auto self_meta = at::native::empty_strided_meta_symint(
+      self.sym_sizes(),
+      self.sym_strides(),
       /*dtype=*/c10::make_optional(self.scalar_type()),
       /*layout=*/c10::make_optional(self.layout()),
       /*device=*/c10::make_optional(c10::Device(c10::kMeta)),
       /*pin_memory=*/c10::nullopt);
-  auto src_meta = at::native::empty_strided_meta(
-      src.sizes(),
-      src.strides(),
+  auto src_meta = at::native::empty_strided_meta_symint(
+      src.sym_sizes(),
+      src.sym_strides(),
       /*dtype=*/c10::make_optional(src.scalar_type()),
       /*layout=*/c10::make_optional(src.layout()),
       /*device=*/c10::make_optional(c10::Device(c10::kMeta)),
@@ -1322,53 +1334,53 @@ std::vector<Shape> compute_shape_diagonal_scatter(
   return {Shape(out_meta.scalar_type(), out_meta.sizes().vec())};
 }
 
-std::vector<Shape> compute_shape_slice_scatter(
+std::vector<Shape> compute_shape_slice_scatter_symint(
     const at::Tensor& self,
     const at::Tensor& src,
     int64_t dim,
-    c10::optional<int64_t> start,
-    c10::optional<int64_t> end,
-    int64_t step) {
-  auto self_meta = at::native::empty_strided_meta(
-      self.sizes(),
-      self.strides(),
+    c10::optional<c10::SymInt> start,
+    c10::optional<c10::SymInt> end,
+    c10::SymInt step) {
+  auto self_meta = at::native::empty_strided_meta_symint(
+      self.sym_sizes(),
+      self.sym_strides(),
       /*dtype=*/c10::make_optional(self.scalar_type()),
       /*layout=*/c10::make_optional(self.layout()),
       /*device=*/c10::make_optional(c10::Device(c10::kMeta)),
       /*pin_memory=*/c10::nullopt);
-  auto src_meta = at::native::empty_strided_meta(
-      src.sizes(),
-      src.strides(),
+  auto src_meta = at::native::empty_strided_meta_symint(
+      src.sym_sizes(),
+      src.sym_strides(),
       /*dtype=*/c10::make_optional(src.scalar_type()),
       /*layout=*/c10::make_optional(src.layout()),
       /*device=*/c10::make_optional(c10::Device(c10::kMeta)),
       /*pin_memory=*/c10::nullopt);
-  auto out_meta = at::compositeexplicitautograd::slice_scatter(
+  auto out_meta = at::compositeexplicitautograd::slice_scatter_symint(
       self_meta, src_meta, dim, start, end, step);
   return {Shape(out_meta.scalar_type(), out_meta.sizes().vec())};
 }
 
-std::vector<Shape> compute_shape_as_strided_scatter(
+std::vector<Shape> compute_shape_as_strided_scatter_symint(
     const at::Tensor& self,
     const at::Tensor& src,
-    at::IntArrayRef size,
-    at::IntArrayRef stride,
-    c10::optional<int64_t> storage_offset) {
-  auto self_meta = at::native::empty_strided_meta(
-      self.sizes(),
-      self.strides(),
+    at::SymIntArrayRef size,
+    at::SymIntArrayRef stride,
+    c10::optional<c10::SymInt> storage_offset) {
+  auto self_meta = at::native::empty_strided_meta_symint(
+      self.sym_sizes(),
+      self.sym_strides(),
       /*dtype=*/c10::make_optional(self.scalar_type()),
       /*layout=*/c10::make_optional(self.layout()),
       /*device=*/c10::make_optional(c10::Device(c10::kMeta)),
       /*pin_memory=*/c10::nullopt);
-  auto src_meta = at::native::empty_strided_meta(
-      src.sizes(),
-      src.strides(),
+  auto src_meta = at::native::empty_strided_meta_symint(
+      src.sym_sizes(),
+      src.sym_strides(),
       /*dtype=*/c10::make_optional(src.scalar_type()),
       /*layout=*/c10::make_optional(src.layout()),
       /*device=*/c10::make_optional(c10::Device(c10::kMeta)),
       /*pin_memory=*/c10::nullopt);
-  auto out_meta = at::compositeexplicitautograd::as_strided_scatter(
+  auto out_meta = at::compositeexplicitautograd::as_strided_scatter_symint(
       self_meta, src_meta, size, stride, storage_offset);
   return {Shape(out_meta.scalar_type(), out_meta.sizes().vec())};
 }
diff --git a/torch/csrc/lazy/core/shape_inference.h b/torch/csrc/lazy/core/shape_inference.h
index c6e503fe4b98..9ceb45d6b23d 100644
--- a/torch/csrc/lazy/core/shape_inference.h
+++ b/torch/csrc/lazy/core/shape_inference.h
@@ -4,7 +4,7 @@
 #include <c10/core/ScalarType.h>
 #include <c10/core/SymInt.h>
 #include <c10/core/SymIntArrayRef.h>
-#include <c10/core/SymIntNodeImpl.h>
+#include <c10/core/SymNodeImpl.h>
 #include <c10/macros/Export.h>
 #include <c10/util/Optional.h>
 #include <torch/csrc/lazy/backend/backend_data.h>
@@ -29,6 +29,7 @@ TORCH_API std::vector<torch::lazy::Shape> compute_shape_bernoulli(const at::Tens
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_binary_cross_entropy(const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_binary_cross_entropy_backward(const at::Tensor & grad_output, const at::Tensor & self, const at::Tensor & target, const c10::optional<at::Tensor> & weight, int64_t reduction);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_cat(at::TensorList tensors, int64_t dim);
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_cholesky(const at::Tensor & self, bool upper);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_clamp_min(const at::Tensor & self, const at::Scalar & min);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_clone(const at::Tensor & self, c10::optional<at::MemoryFormat> memory_format);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_constant_pad_nd(const at::Tensor & self, at::IntArrayRef pad, const at::Scalar & value);
@@ -83,6 +84,7 @@ TORCH_API std::vector<torch::lazy::Shape> compute_shape_std(const at::Tensor & s
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_std(const at::Tensor & self, at::OptionalIntArrayRef dim, c10::optional<int64_t> correction, bool keepdim);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_sum(const at::Tensor & self, c10::optional<at::ScalarType> dtype);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape__to_copy(const at::Tensor & self, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory, bool non_blocking, c10::optional<at::MemoryFormat> memory_format);
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_take(const at::Tensor & self, const at::Tensor & index);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_trace(const at::Tensor & self);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_zero(const at::Tensor & self);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_narrow_copy_symint(const at::Tensor & self, int64_t dim, int64_t start, c10::SymInt length);
@@ -113,9 +115,8 @@ TORCH_API std::vector<Shape> compute_shape_unsqueeze(const Output& input, const
 
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_select_scatter(const at::Tensor & self, const at::Tensor & src, int64_t dim, int64_t index);
 TORCH_API std::vector<torch::lazy::Shape> compute_shape_diagonal_scatter(const at::Tensor & self, const at::Tensor & src, int64_t offset, int64_t dim1, int64_t dim2);
-TORCH_API std::vector<torch::lazy::Shape> compute_shape_slice_scatter(const at::Tensor & self, const at::Tensor & src, int64_t dim, c10::optional<int64_t> start, c10::optional<int64_t> end, int64_t step);
-TORCH_API std::vector<torch::lazy::Shape> compute_shape_slice_scatter(const at::Tensor & self, const at::Tensor & src, int64_t dim, c10::optional<int64_t> start, c10::optional<int64_t> end, int64_t step);
-TORCH_API std::vector<torch::lazy::Shape> compute_shape_as_strided_scatter(const at::Tensor & self, const at::Tensor & src, at::IntArrayRef size, at::IntArrayRef stride, c10::optional<int64_t> storage_offset);
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_slice_scatter_symint(const at::Tensor & self, const at::Tensor & src, int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step);
+TORCH_API std::vector<torch::lazy::Shape> compute_shape_as_strided_scatter_symint(const at::Tensor & self, const at::Tensor & src, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset);
 // clang-format on
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/core/tensor.cpp b/torch/csrc/lazy/core/tensor.cpp
index 86971dc49bcb..734dc5fdbd9a 100644
--- a/torch/csrc/lazy/core/tensor.cpp
+++ b/torch/csrc/lazy/core/tensor.cpp
@@ -47,15 +47,6 @@ LazyTensorPtr LazyTensor::Create(Value ir_value, const BackendDevice& device) {
   return lazy_tensor;
 }
 
-LazyTensorPtr LazyTensor::Create(
-    std::shared_ptr<LazyView> view,
-    const BackendDevice& device) {
-  LazyTensorPtr lazy_tensor =
-      c10::make_intrusive<LazyTensor>(LazyTensor(std::move(view), device));
-  LazyGraphExecutor::Get()->RegisterTensor(lazy_tensor->data_ptr());
-  return lazy_tensor;
-}
-
 LazyTensorPtr LazyTensor::Create(BackendDataPtr handle) {
   LazyTensorPtr lazy_tensor =
       c10::make_intrusive<LazyTensor>(LazyTensor(std::move(handle)));
@@ -78,17 +69,7 @@ LazyTensor::LazyTensor(Value ir_value, const BackendDevice& device)
   TryLimitGraphSize();
 }
 
-LazyTensor::LazyTensor(
-    std::shared_ptr<LazyView> view,
-    const BackendDevice& device)
-    : LazyTensor(std::make_shared<Data>(std::move(view), device)) {}
-
-LazyTensor::LazyTensor(std::shared_ptr<Data> data)
-    : data_(std::move(data)),
-      storage_(c10::Storage(
-          {},
-          0,
-          c10::DataPtr(nullptr, backendDeviceToAtenDevice(data_->device)))) {}
+LazyTensor::LazyTensor(std::shared_ptr<Data> data) : data_(std::move(data)) {}
 
 LazyTensor::Data* LazyTensor::data() const {
   TORCH_CHECK(data_ != nullptr, "Trying to access a null cursor");
@@ -107,9 +88,6 @@ at::ScalarType LazyTensor::dtype() const {
 }
 
 MaybeRef<Shape> LazyTensor::shape() const {
-  if (data()->view != nullptr) {
-    return data()->view->shape();
-  }
   if (data()->handle != nullptr) {
     return Shape(data()->handle->shape());
   }
@@ -131,45 +109,23 @@ int64_t LazyTensor::GetUniqueId() const {
   return data()->unique_id;
 }
 
-std::ptrdiff_t LazyTensor::GetViewAliasId() const {
-  return data()->view != nullptr
-      ? reinterpret_cast<std::ptrdiff_t>(data()->view->alias().get())
-      : 0;
-}
-
 BackendDataPtr LazyTensor::GetDataHandle() {
-  // Data can coexist with a view, but we need to check that the view did
-  // not receive any updates before calling the current IR valid.
-  bool up_to_date = true;
-  Value ir_value;
-  if (data()->view != nullptr) {
-    bool updated = false;
-    std::tie(ir_value, updated) = GetViewUpdate(data()->view);
-    up_to_date = !updated;
-  }
-  if (up_to_date) {
-    BackendDataPtr handle = CurrentDataHandle();
-    if (handle != nullptr) {
-      TORCH_CHECK(
-          handle->HasValue(),
-          "Trying to access data while an async operation is in flight: ",
-          handle->shape().to_string());
-      return handle;
-    }
-  }
-  if (ir_value) {
-    // The view gave us an updated IR value. We usually do not have a valid IR
-    // value field together with a view, but to allow code reuse in
-    // ApplyPendingGraph() we temporarily set it here. The following call to
-    // ApplyPendingGraph() will clear it.
-    AssignIrValue(std::move(ir_value));
+  BackendDataPtr handle = CurrentDataHandle();
+  if (handle != nullptr) {
+    TORCH_CHECK(
+        handle->HasValue(),
+        "Trying to access data while an async operation is in flight: ",
+        handle->shape().to_string());
+    return handle;
   }
+
   if (data()->ir_value) {
     ApplyPendingGraph();
   } else {
     TORCH_CHECK(data()->tensor_data);
     data()->handle = TensorToDataHandle(*data()->tensor_data, GetDevice());
   }
+
   return data()->handle;
 }
 
@@ -184,10 +140,9 @@ void LazyTensor::SetDataHandle(BackendDataPtr handle) {
 void LazyTensor::SetDataHandle(BackendDataPtr handle, bool sync) {
   data()->handle = std::move(handle);
   // Assigning a device data should always clear the IR node, to allow graph
-  // trimming. A view cannot be reset though, unless we are at a step-end sync.
+  // trimming.
   AssignIrValue(Value());
   if (sync) {
-    data()->view = nullptr;
     data()->tensor_data = c10::nullopt;
   }
 }
@@ -195,16 +150,8 @@ void LazyTensor::SetDataHandle(BackendDataPtr handle, bool sync) {
 void LazyTensor::SetIrValue(Value ir_value) {
   data()->handle = nullptr;
   data()->tensor_data = c10::nullopt;
-  if (data()->view != nullptr) {
-    // If we have an active view, and a SetIrValue() happens, it means we are
-    // within an in-place execution context, and we need to update the view's
-    // alias as well.
-    data()->view = UpdateView(data()->view, std::move(ir_value));
-    data()->generation += 1;
-  } else {
-    AssignIrValue(std::move(ir_value));
-    TryLimitGraphSize();
-  }
+  AssignIrValue(std::move(ir_value));
+  TryLimitGraphSize();
 }
 
 void LazyTensor::SetInPlaceIrValue(Value ir_value) {
@@ -257,9 +204,6 @@ Value LazyTensor::GetIrValue() const {
 }
 
 Value LazyTensor::CurrentIrValue() const {
-  if (data()->view != nullptr) {
-    return std::get<0>(GetViewUpdate(data()->view));
-  }
   return data()->ir_value;
 }
 
@@ -268,9 +212,6 @@ void LazyTensor::SetTensorData(at::Tensor tensor_data) {
 }
 
 c10::optional<at::Tensor> LazyTensor::CurrentTensorData() const {
-  if (data()->view != nullptr && !data()->view->IsUpToDate()) {
-    return c10::nullopt;
-  }
   return data()->tensor_data;
 }
 
@@ -293,71 +234,6 @@ Value LazyTensor::GetIrValueForTensor(
   return CreateTensorNode(std::move(data), read_only);
 }
 
-std::tuple<Value, bool> LazyTensor::GetViewUpdate(
-    const std::shared_ptr<LazyView>& view) const {
-  auto value_with_update = view->GetViewIrNode();
-  if (std::get<1>(value_with_update)) {
-    data()->handle = nullptr;
-    data()->tensor_data = c10::nullopt;
-  }
-  return value_with_update;
-}
-
-std::shared_ptr<LazyView> LazyTensor::UpdateView(
-    std::shared_ptr<LazyView> view,
-    Value ir_value) const {
-  if (ir_value.shape().sizes() != view->shape().sizes()) {
-    TORCH_CHECK(ir_value.shape().numel() == view->shape().numel());
-
-    ViewInfo view_info(
-        ViewInfo::Type::kReshape, ir_value.shape(), view->shape());
-    view = view->CreateSubView(view_info.shape, view_info);
-  }
-  view->Update(std::move(ir_value));
-  return view;
-}
-
-void LazyTensor::SetSubView(ViewInfo view_info) const {
-  data()->view = data()->view->CreateSubView(view_info.shape, view_info);
-  data()->generation += 1;
-}
-
-void LazyTensor::ModifyCurrentView(ViewInfo view_info) const {
-  if (data()->view != nullptr) {
-    SetSubView(view_info);
-    return;
-  }
-  // This node is not a view. Since this function is meant to modify a view
-  // in place, we need to turn this existing tensor into a view.
-  Value ir_value = GetIrValue();
-  std::shared_ptr<Alias> alias = std::make_shared<Alias>(ir_value);
-  data()->view = std::make_shared<LazyView>(view_info.shape, alias, view_info);
-  AssignIrValue(Value());
-}
-
-std::shared_ptr<LazyView> LazyTensor::CreateView(ViewInfo view_info) const {
-  if (data()->view != nullptr) {
-    return data()->view->CreateSubView(view_info.shape, view_info);
-  }
-  // This node is not a view, and creating a view forks the current node into
-  // becoming one itself. This means creating an alias with the current IR
-  // Node, and using the same alias for the created IR Node.
-  Value ir_value = GetIrValue();
-  std::shared_ptr<Alias> alias = std::make_shared<Alias>(ir_value);
-  ViewInfo this_view_info(
-      ViewInfo::Type::kNoOp, ir_value.shape(), ir_value.shape());
-  data()->view = std::make_shared<LazyView>(
-      ir_value.shape(), alias, std::move(this_view_info));
-  AssignIrValue(Value());
-  return std::make_shared<LazyView>(view_info.shape, alias, view_info);
-}
-
-LazyTensorPtr LazyTensor::CreateViewTensor(ViewInfo view_info) const {
-  auto new_tensor = Create(CreateView(std::move(view_info)), GetDevice());
-  new_tensor->storage_ = Storage();
-  return new_tensor;
-}
-
 at::Tensor LazyTensor::ToTensor(bool detached) {
   at::Tensor tensor;
   c10::optional<at::Tensor> tensor_data = CurrentTensorData();
@@ -374,8 +250,7 @@ at::Tensor LazyTensor::ToTensor(bool detached) {
   } else {
     tensor = *tensor_data;
     if (detached) {
-      if (data()->ir_value || data()->handle != nullptr ||
-          data()->view != nullptr) {
+      if (data()->ir_value || data()->handle != nullptr) {
         // If we have other authoritive sources, just drop our reference and
         // transfer it to the caller.
         data()->tensor_data = c10::nullopt;
@@ -395,7 +270,6 @@ void LazyTensor::ShallowCopyTo(LazyTensorPtr dest) const {
 
 void LazyTensor::SetTensor(at::Tensor tensor) {
   SetTensorData(tensor);
-  data()->view = nullptr;
   data()->handle = nullptr;
   AssignIrValue(Value());
 }
@@ -408,25 +282,14 @@ void LazyTensor::UpdateFromTensor(at::Tensor tensor, bool sync) {
     SetTensorData(tensor);
     data()->handle = nullptr;
     AssignIrValue(Value());
-    if (data()->view != nullptr) {
-      Value ir_value = GetIrValueForTensor(tensor, GetDevice());
-      data()->view = UpdateView(data()->view, std::move(ir_value));
-    }
   }
 }
 
 void LazyTensor::UpdateFromTensorOut(at::Tensor tensor) {
-  if (data()->view != nullptr && shape().Get().numel() != tensor.numel()) {
-    data()->view = nullptr;
-  }
   UpdateFromTensor(std::move(tensor), /*sync=*/false);
 }
 
 void LazyTensor::UpdateFromTensorOut(const LazyTensorPtr& tensor) {
-  if (data()->view != nullptr &&
-      shape().Get().numel() != tensor->shape().Get().numel()) {
-    data()->view = nullptr;
-  }
   SetIrValue(tensor->GetIrValue());
 }
 
@@ -470,7 +333,7 @@ int64_t LazyTensor::GetNextTensorId() {
   return id_generator->fetch_add(1);
 }
 
-torch::lazy::Value GetTensorList(c10::ArrayRef<at::Tensor> tensors) {
+torch::lazy::Value GetTensorList(at::ITensorListRef tensors) {
   std::vector<Value> values;
   for (const auto& t : tensors) {
     auto* impl = dynamic_cast<LTCTensorImpl*>(t.unsafeGetTensorImpl());
diff --git a/torch/csrc/lazy/core/tensor.h b/torch/csrc/lazy/core/tensor.h
index e178ac119c6b..8dfa5a077c97 100644
--- a/torch/csrc/lazy/core/tensor.h
+++ b/torch/csrc/lazy/core/tensor.h
@@ -1,22 +1,18 @@
 #pragma once
 
-#include <c10/core/SymIntNodeImpl.h>
+#include <c10/core/SymNodeImpl.h>
 #include <c10/util/intrusive_ptr.h>
+#include <torch/csrc/lazy/backend/backend_data.h>
 #include <torch/csrc/lazy/backend/backend_device.h>
-#include <torch/csrc/lazy/backend/backend_interface.h>
 #include <torch/csrc/lazy/core/ir.h>
-#include <torch/csrc/lazy/core/lazy_view.h>
 #include <torch/csrc/lazy/core/util.h>
 
 namespace torch {
 namespace lazy {
 
-class TORCH_API SymIntNodeImpl : public c10::SymIntNodeImpl {
+class TORCH_API SymNodeImpl : public c10::SymNodeImpl {
  public:
-  SymIntNodeImpl(NodePtr ptr) : node_(std::move(ptr)){};
-  c10::SymIntNode add(const c10::SymIntNode& other) override {
-    TORCH_CHECK(false, "NYI");
-  }
+  SymNodeImpl(NodePtr ptr) : node_(std::move(ptr)){};
   NodePtr node_;
 };
 
@@ -37,10 +33,6 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
         : ir_value(std::move(ir_value)),
           device(std::move(device)),
           unique_id(GetNextTensorId()) {}
-    Data(std::shared_ptr<LazyView> view, BackendDevice device)
-        : view(std::move(view)),
-          device(std::move(device)),
-          unique_id(GetNextTensorId()) {}
     Data(at::Tensor tensor_data, BackendDevice device)
         : tensor_data(std::move(tensor_data)),
           device(std::move(device)),
@@ -50,7 +42,6 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
 
     BackendDataPtr handle;
     Value ir_value;
-    std::shared_ptr<LazyView> view;
     c10::optional<at::Tensor> tensor_data;
     const BackendDevice device;
     const int64_t unique_id = 0;
@@ -76,10 +67,6 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
     return data()->generation;
   }
 
-  LazyTensorPtr alias() const {
-    return c10::make_intrusive<LazyTensor>(LazyTensor(data_ptr()));
-  }
-
   int64_t size(int64_t dim) const;
 
   at::Tensor ToTensor(bool detached);
@@ -102,10 +89,6 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
   const BackendDevice& GetDevice() const;
   int64_t GetUniqueId() const;
 
-  // Retrieves an opaque ID of the alias object upon which the tensor's view is
-  // rooted, or 0 if this tensor is not a view.
-  std::ptrdiff_t GetViewAliasId() const;
-
   // Fetches the data behind the tensor. If the tensor has a graph defining
   // its current value, executes the graph and fetches the data result.
   BackendDataPtr GetDataHandle();
@@ -129,40 +112,21 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
   void SetIrValue(Value ir_value);
   void SetInPlaceIrValue(Value ir_value);
 
-  void SetSubView(ViewInfo view_info) const;
-
   c10::optional<at::Tensor> CurrentTensorData() const;
 
   std::vector<LazyTensorPtr> MakeOutputTensors(NodePtr node) const;
 
-  LazyTensorPtr CreateViewTensor(ViewInfo view_info) const;
   LazyTensorPtr CopyTensorToDevice(const BackendDevice& device);
 
-  void ModifyCurrentView(ViewInfo view_info) const;
-
   // Applies the queue of operations in preparation for using the data.
   void ApplyPendingGraph();
 
-  const c10::Storage& Storage() const {
-    return storage_;
-  }
-  // This is currently only used by outlier view ops such as expand that
-  // don't go through CreateViewTensor to support Tensor.is_alias_of.
-  void SetStorage(const c10::Storage& storage) {
-    storage_ = storage;
-  }
-
  private:
   LazyTensor(const at::Tensor& tensor, const BackendDevice& device);
   LazyTensor(Value ir_value, const BackendDevice& device);
-  LazyTensor(std::shared_ptr<LazyView> view, const BackendDevice& device);
   explicit LazyTensor(BackendDataPtr handle);
   explicit LazyTensor(std::shared_ptr<Data> data);
 
-  static LazyTensorPtr Create(
-      std::shared_ptr<LazyView> view,
-      const BackendDevice& device);
-
   std::shared_ptr<Data> data_ptr() const {
     return data_;
   }
@@ -173,15 +137,6 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
 
   Value CreateTensorNode(BackendDataPtr data, bool read_only) const;
 
-  std::tuple<Value, bool> GetViewUpdate(
-      const std::shared_ptr<LazyView>& view) const;
-
-  std::shared_ptr<LazyView> UpdateView(
-      std::shared_ptr<LazyView> view,
-      Value ir_value) const;
-
-  std::shared_ptr<LazyView> CreateView(ViewInfo view_info) const;
-
   // We build a graph accumulating operations, but at a given point we
   // need to force a rendering, otherwise the graph can grow without control.
   // Think:
@@ -196,12 +151,6 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
   static int64_t GetNextTensorId();
 
   std::shared_ptr<Data> data_;
-  // Temporarily used to suport Tensor.is_alias_of().
-  // This is a fake storage that doesn't store anything.
-  // Instead it serves as a marker to mark LazyTensors that
-  // points to the same storage, and thus alias of each other.
-  // FIXME(alanwaketan): Remove this once we have functionalization (bdhirsh).
-  c10::Storage storage_;
 };
 
 // Utils to convert at::Tensor to LazyTensor, and vice versa.
@@ -211,7 +160,7 @@ class TORCH_API LazyTensor : public c10::intrusive_ptr_target {
 // skips
 //       the LazyTensor wrappers, assuming that the list of underlying IR nodes
 //       is actually more useful for downstream computations.  TBD.
-TORCH_API torch::lazy::Value GetTensorList(c10::ArrayRef<at::Tensor> tensors);
+TORCH_API torch::lazy::Value GetTensorList(at::ITensorListRef tensors);
 
 // Section 1: at::Tensor => LazyTensor.
 // Extracts the LazyTensor out of an at::Tensor. Returns a null LazyTensor
diff --git a/torch/csrc/lazy/core/tensor_impl.cpp b/torch/csrc/lazy/core/tensor_impl.cpp
index 67cd53c442a6..e811b7560c3f 100644
--- a/torch/csrc/lazy/core/tensor_impl.cpp
+++ b/torch/csrc/lazy/core/tensor_impl.cpp
@@ -85,19 +85,7 @@ LTCTensorImpl::LTCTensorImpl(LazyTensor&& tensor)
           c10::scalarTypeToTypeMeta(tensor.dtype()),
           backendDeviceToAtenDevice(tensor.GetDevice())),
       tensor_(c10::make_intrusive<LazyTensor>(std::move(tensor))) {
-  // This is a temporary fix for a PyTorch core issue,
-  // according to https://github.com/pytorch/xla/pull/2682.
-  is_non_overlapping_and_dense_ = false;
-  set_sizes_strides_policy(SizesStridesPolicy::CustomSizes);
-
-  auto rank = tensor_->shape().Get().sizes().size();
-  sym_sizes_.reserve(rank);
-  for (auto i : c10::irange(rank)) {
-    auto dim_node = getBackend()->GetIrBuilder()->MakeSizeNode(
-        this->tensor_->GetIrValue(), i);
-    auto sn = c10::make_intrusive<torch::lazy::SymIntNodeImpl>(dim_node);
-    sym_sizes_.push_back(sn->toSymInt());
-  }
+  set_custom_sizes_strides(SizesStridesPolicy::CustomSizes);
 }
 
 void LTCTensorImpl::set_tensor(const LazyTensorPtr& lazy_tensor) {
@@ -142,18 +130,12 @@ void LTCTensorImpl::shallow_copy_from(
   generation_ = 0;
 }
 
-c10::SymIntArrayRef LTCTensorImpl::sym_sizes_custom() const {
-  if (FLAGS_ltc_enable_symbolic_shapes) {
-    return c10::SymIntArrayRef(sym_sizes_.data(), sym_sizes_.size());
-  }
-
-  // return upper bound
-  const_cast<LTCTensorImpl*>(this)->setup_size_properties();
-  return TensorImpl::sym_sizes_default();
+c10::SymIntArrayRef LTCTensorImpl::sym_strides_custom() const {
+  return c10::fromIntArrayRefKnownNonNegative(strides_custom());
 }
 
-c10::SymIntArrayRef LTCTensorImpl::sym_sizes() const {
-  return sym_sizes_custom();
+c10::SymIntArrayRef LTCTensorImpl::sym_sizes_custom() const {
+  return c10::fromIntArrayRefKnownNonNegative(sizes_custom());
 }
 
 void LTCTensorImpl::setup_size_properties() {
@@ -200,12 +182,33 @@ int64_t LTCTensorImpl::numel_custom() const {
   return numel_default();
 }
 
+int64_t LTCTensorImpl::storage_offset_custom() const {
+  return 0;
+}
+
+bool LTCTensorImpl::is_strides_like_custom(
+    c10::MemoryFormat memory_format) const {
+  TORCH_INTERNAL_ASSERT(memory_format != at::MemoryFormat::Contiguous);
+  return false;
+}
+
+bool LTCTensorImpl::is_non_overlapping_and_dense_custom() const {
+  // This should be true, but false as a temporary fix for a PyTorch core issue,
+  // according to https://github.com/pytorch/xla/pull/2682.
+  return false;
+}
+
 bool LTCTensorImpl::is_contiguous_custom(c10::MemoryFormat _unused) const {
+  // TODO(ezyang): I don't think this branch is actually necessary
+  // TODO(ezyang): I don't think this logic is right, shouldn't we pass on
+  // the memory format?
   if (tensor_->CurrentTensorData()) {
     return tensor_->CurrentTensorData()->is_contiguous();
   }
   // Only check that the storage is already contiguous.
   CHECK(is_contiguous_) << "Non-contiguous storage for lazy tensor";
+  // TODO: I don't think logic is right, we should check the requested memory
+  // format before returning true
   return true;
 }
 
diff --git a/torch/csrc/lazy/core/tensor_impl.h b/torch/csrc/lazy/core/tensor_impl.h
index 36e3a09b59aa..710230605cc1 100644
--- a/torch/csrc/lazy/core/tensor_impl.h
+++ b/torch/csrc/lazy/core/tensor_impl.h
@@ -39,27 +39,21 @@ class TORCH_API LTCTensorImpl final : public c10::TensorImpl {
 
   at::IntArrayRef sizes_custom() const override;
   at::IntArrayRef strides_custom() const override;
-  int64_t dim_custom() const override;
   int64_t numel_custom() const override;
+  int64_t storage_offset_custom() const override;
+  int64_t dim_custom() const override;
   bool is_contiguous_custom(at::MemoryFormat memory_format) const override;
+  bool is_strides_like_custom(at::MemoryFormat memory_format) const override;
+  bool is_non_overlapping_and_dense_custom() const override;
 
-  virtual c10::SymIntArrayRef sym_sizes_custom() const override;
-  virtual c10::SymIntArrayRef sym_sizes() const override;
-
-#ifndef C10_DISABLE_TENSORIMPL_EXTENSIBILITY
-  const at::Storage& storage() const override {
-    return tensor_->Storage();
-  }
-  bool has_storage() const override {
-    return tensor_->Storage();
-  }
-#endif // C10_DISABLE_TENSORIMPL_EXTENSIBILITY
+  c10::SymIntArrayRef sym_sizes_custom() const override;
+  c10::SymIntArrayRef sym_strides_custom() const override;
 
  private:
   void setup_size_properties();
 
   LazyTensorPtr tensor_;
-  std::vector<c10::SymInt> sym_sizes_;
+  mutable c10::optional<std::vector<c10::SymInt>> sym_sizes_;
   size_t generation_{0};
 };
 
diff --git a/torch/csrc/lazy/core/tensor_util.cpp b/torch/csrc/lazy/core/tensor_util.cpp
index 8eb7632c2d8a..d631a0dfb79a 100644
--- a/torch/csrc/lazy/core/tensor_util.cpp
+++ b/torch/csrc/lazy/core/tensor_util.cpp
@@ -60,6 +60,9 @@ std::vector<BackendDataPtr> CreateTensorsData(
 bool IsSpecialScalar(const at::Scalar& value) {
   if (FLAGS_torch_lazy_handle_special_scalars &&
       (value.isIntegral(false) || value.isFloatingPoint())) {
+    if (FLAGS_torch_lazy_all_numbers_special_scalars) {
+      return true;
+    }
     double scalar_value = value.toDouble();
     return scalar_value == 0.0 || std::fabs(scalar_value) == 1.0;
   }
diff --git a/torch/csrc/lazy/python/init.cpp b/torch/csrc/lazy/python/init.cpp
index 6ba8e5247d9c..0b773788eff9 100644
--- a/torch/csrc/lazy/python/init.cpp
+++ b/torch/csrc/lazy/python/init.cpp
@@ -42,9 +42,9 @@ std::ptrdiff_t GetTensorId(const at::Tensor& tensor) {
 
 std::string GetTensorsDump(
     const std::vector<at::Tensor>& tensors,
-    const std::function<std::string(c10::ArrayRef<torch::lazy::Node*>)>&
+    const std::function<std::string(c10::ArrayRef<const torch::lazy::Node*>)>&
         coverter) {
-  std::vector<torch::lazy::Node*> nodes;
+  std::vector<const torch::lazy::Node*> nodes;
   std::vector<torch::lazy::Value> values;
   for (auto& tensor : tensors) {
     auto inner = at::functionalization::impl::from_functional_tensor(tensor);
@@ -142,7 +142,7 @@ void initLazyBindings(PyObject* module) {
   lazy.def(
       "_get_tensors_text",
       [](const std::vector<at::Tensor>& tensors) -> std::string {
-        auto coverter = [](c10::ArrayRef<torch::lazy::Node*> nodes) {
+        auto coverter = [](c10::ArrayRef<const torch::lazy::Node*> nodes) {
           return torch::lazy::DumpUtil::ToText(nodes);
         };
         return GetTensorsDump(tensors, coverter);
@@ -150,7 +150,7 @@ void initLazyBindings(PyObject* module) {
   lazy.def(
       "_get_tensors_dot",
       [](const std::vector<at::Tensor>& tensors) -> std::string {
-        auto coverter = [](c10::ArrayRef<torch::lazy::Node*> nodes) {
+        auto coverter = [](c10::ArrayRef<const torch::lazy::Node*> nodes) {
           return torch::lazy::DumpUtil::ToDot(nodes);
         };
         return GetTensorsDump(tensors, coverter);
@@ -201,6 +201,9 @@ void initLazyBindings(PyObject* module) {
   lazy.def("_get_symbolic_shape_mode", []() {
     return FLAGS_ltc_enable_symbolic_shapes;
   });
+  lazy.def("_get_default_device_type", []() {
+    return getBackend()->GetDefaultDeviceType()->toString();
+  });
 
   lazy_ts_backend.def("_init", []() {
 #if !(defined(FBCODE_CAFFE2) || defined(OVRSOURCE))
@@ -219,7 +222,7 @@ void initLazyBindings(PyObject* module) {
       [](const std::vector<at::Tensor>& tensors)
           -> std::pair<std::vector<int64_t>, std::vector<at::IValue>> {
 #if !(defined(FBCODE_CAFFE2) || defined(OVRSOURCE))
-        std::vector<Node*> roots;
+        std::vector<const Node*> roots;
         for (auto& tensor : tensors) {
           auto xtensor = TryGetLtcTensor(tensor);
           roots.push_back(xtensor->GetIrValue().node.get());
@@ -303,14 +306,9 @@ void initLazyBindings(PyObject* module) {
         return result;
       });
 
-#ifndef USE_DEPLOY
   // When libtorch_python is loaded, we register the python frame getter
   // otherwise, debug util simply omits python frames
-  // TODO(whc) can we make this work inside torch deploy interpreter?
-  // it doesn't work as-is, possibly becuase GetPythonFrames resolves to
-  // external cpython rather than embedded cpython
   GetPythonFramesFunction() = GetPythonFrames;
-#endif
 }
 
 } // namespace lazy
diff --git a/torch/csrc/lazy/python/python_util.cpp b/torch/csrc/lazy/python/python_util.cpp
index 00f93e4115ed..703d43ca6505 100644
--- a/torch/csrc/lazy/python/python_util.cpp
+++ b/torch/csrc/lazy/python/python_util.cpp
@@ -33,7 +33,9 @@ std::vector<SourceLocation> GetPythonFrames() {
   if (Py_IsInitialized()) {
     pybind11::gil_scoped_acquire gil;
     PyFrameObject* frame = PyEval_GetFrame();
-    Py_INCREF(frame);
+    if (frame != nullptr) {
+      Py_INCREF(frame);
+    }
     while (frame != nullptr) {
       SourceLocation loc;
       auto code = THPCodeObjectPtr(PyFrame_GetCode(frame));
diff --git a/torch/csrc/lazy/ts_backend/dynamic_ir.cpp b/torch/csrc/lazy/ts_backend/dynamic_ir.cpp
index 1180885ab57f..2bb67af47fc7 100644
--- a/torch/csrc/lazy/ts_backend/dynamic_ir.cpp
+++ b/torch/csrc/lazy/ts_backend/dynamic_ir.cpp
@@ -34,7 +34,7 @@ SizeNode::SizeNode(Value input, size_t dim)
 int64_t SizeNode::getStaticValue() const {
   return dynamic_cast<const TsNode*>(operand(0).node)->shape(0).size(dim_);
 }
-bool SizeNode::isDynamic() const {
+bool SizeNode::isSymbolic() const {
   auto symbolic_vec =
       dynamic_cast<const TsNode*>(operand(0).node)->shape(0).is_symbolic();
   if (!symbolic_vec.has_value()) {
@@ -59,8 +59,8 @@ int64_t SizeAdd::getStaticValue() const {
       DimCast(operand(1))->getStaticValue();
 }
 
-bool SizeAdd::isDynamic() const {
-  return DimCast(operand(0))->isDynamic() || DimCast(operand(1))->isDynamic();
+bool SizeAdd::isSymbolic() const {
+  return DimCast(operand(0))->isSymbolic() || DimCast(operand(1))->isSymbolic();
 }
 
 std::string SizeAdd::ToString() const {
@@ -79,8 +79,8 @@ int64_t SizeMul::getStaticValue() const {
       DimCast(operand(1))->getStaticValue();
 }
 
-bool SizeMul::isDynamic() const {
-  return DimCast(operand(0))->isDynamic() || DimCast(operand(1))->isDynamic();
+bool SizeMul::isSymbolic() const {
+  return DimCast(operand(0))->isSymbolic() || DimCast(operand(1))->isSymbolic();
 }
 
 std::string SizeMul::ToString() const {
@@ -102,8 +102,8 @@ int64_t SizeDiv::getStaticValue() const {
       DimCast(operand(1))->getStaticValue();
 }
 
-bool SizeDiv::isDynamic() const {
-  return DimCast(operand(0))->isDynamic() || DimCast(operand(1))->isDynamic();
+bool SizeDiv::isSymbolic() const {
+  return DimCast(operand(0))->isSymbolic() || DimCast(operand(1))->isSymbolic();
 }
 
 std::string SizeDiv::ToString() const {
diff --git a/torch/csrc/lazy/ts_backend/dynamic_ir.h b/torch/csrc/lazy/ts_backend/dynamic_ir.h
index 042e48f13678..40132aa57404 100644
--- a/torch/csrc/lazy/ts_backend/dynamic_ir.h
+++ b/torch/csrc/lazy/ts_backend/dynamic_ir.h
@@ -49,7 +49,7 @@ class TORCH_API SizeNode : public TsNode, public DimensionNode {
  public:
   SizeNode(Value input, size_t dim);
   int64_t getStaticValue() const override;
-  bool isDynamic() const override;
+  bool isSymbolic() const override;
   std::string ToString() const override;
   size_t dim_ = 0;
   virtual torch::lazy::TSOpVector Lower(
@@ -61,7 +61,7 @@ class TORCH_API SizeAdd : public TsNode, public DimensionNode {
  public:
   SizeAdd(Value a, Value b);
   int64_t getStaticValue() const override;
-  bool isDynamic() const override;
+  bool isSymbolic() const override;
   std::string ToString() const override;
 };
 
@@ -69,7 +69,7 @@ class TORCH_API SizeMul : public TsNode, public DimensionNode {
  public:
   SizeMul(Value a, Value b);
   int64_t getStaticValue() const override;
-  bool isDynamic() const override;
+  bool isSymbolic() const override;
   std::string ToString() const override;
 };
 
@@ -77,7 +77,7 @@ class TORCH_API SizeDiv : public TsNode, public DimensionNode {
  public:
   SizeDiv(Value a, Value b);
   int64_t getStaticValue() const override;
-  bool isDynamic() const override;
+  bool isSymbolic() const override;
   std::string ToString() const override;
 };
 
diff --git a/torch/csrc/lazy/ts_backend/ir_builder.h b/torch/csrc/lazy/ts_backend/ir_builder.h
index 600243b67f62..1f32a3521ba8 100644
--- a/torch/csrc/lazy/ts_backend/ir_builder.h
+++ b/torch/csrc/lazy/ts_backend/ir_builder.h
@@ -30,10 +30,6 @@ struct TorchScriptIrBuilder : IrBuilder {
       const bool& is_scalar_expand) const override {
     return ReuseOrMakeNode<Expand>(input0, size, is_scalar_expand);
   }
-  NodePtr MakeView(const Value& input0, const std::vector<int64_t>& output_size)
-      const override {
-    return ReuseOrMakeNode<View>(input0, output_size);
-  }
   NodePtr MakeCast(
       const Value& input0,
       const at::ScalarType& dtype,
@@ -55,84 +51,6 @@ struct TorchScriptIrBuilder : IrBuilder {
     return MakeNode<Generic>(op, operands, shape, num_outputs, hash_seed);
   }
 
-  // View op nodes
-  NodePtr MakeAsStridedViewUpdate(
-      const Value& input0,
-      const Value& input1,
-      const std::vector<int64_t>& size,
-      const std::vector<int64_t>& stride,
-      const int64_t& storage_offset) const override {
-    return ReuseOrMakeNode<AsStridedViewUpdate>(
-        input0, input1, size, stride, storage_offset);
-  }
-  NodePtr MakeAsStrided(
-      const Value& input0,
-      const std::vector<int64_t>& size,
-      const std::vector<int64_t>& stride,
-      const int64_t& storage_offset) const override {
-    return ReuseOrMakeNode<AsStrided>(input0, size, stride, storage_offset);
-  }
-  NodePtr MakeDiagonalViewUpdate(
-      const Value& input0,
-      const Value& input1,
-      const int64_t& offset,
-      const int64_t& dim1,
-      const int64_t& dim2) const override {
-    return ReuseOrMakeNode<DiagonalViewUpdate>(
-        input0, input1, offset, dim1, dim2);
-  }
-  NodePtr MakeDiagonal(
-      const Value& input0,
-      const int64_t& offset,
-      const int64_t& dim1,
-      const int64_t& dim2) const override {
-    return ReuseOrMakeNode<Diagonal>(input0, offset, dim1, dim2);
-  }
-  NodePtr MakeNarrowViewUpdate(
-      const Value& input0,
-      const Value& input1,
-      const std::vector<int64_t>& base_indices) const override {
-    return ReuseOrMakeNode<NarrowViewUpdate>(input0, input1, base_indices);
-  }
-  NodePtr MakeNarrow(
-      const Value& input0,
-      const std::vector<int64_t>& base_indices,
-      const std::vector<int64_t>& sizes) const override {
-    return ReuseOrMakeNode<Narrow>(input0, base_indices, sizes);
-  }
-  NodePtr MakePermute(const Value& input0, const std::vector<int64_t>& dims)
-      const override {
-    return ReuseOrMakeNode<Permute>(input0, dims);
-  }
-  NodePtr MakeResize(const Value& input0, const std::vector<int64_t>& size)
-      const override {
-    return ReuseOrMakeNode<Resize>(input0, size);
-  }
-  NodePtr MakeSelectViewUpdate(
-      const Value& input0,
-      const Value& input1,
-      const int64_t& dim,
-      const int64_t& start,
-      const int64_t& end,
-      const int64_t& stride) const override {
-    return ReuseOrMakeNode<SelectViewUpdate>(
-        input0, input1, dim, start, end, stride);
-  }
-  NodePtr MakeSelect(
-      const Value& input0,
-      const int64_t& dim,
-      const int64_t& start,
-      const int64_t& end,
-      const int64_t& stride) const override {
-    return ReuseOrMakeNode<Select>(input0, dim, start, end, stride);
-  }
-  NodePtr MakeSqueeze(const Value& input0, const int& dim) const override {
-    return ReuseOrMakeNode<Squeeze>(input0, dim);
-  }
-  NodePtr MakeUnsqueeze(const Value& input0, const int& dim) const override {
-    return ReuseOrMakeNode<Unsqueeze>(input0, dim);
-  }
-
   // dynamic ir nodes
   // TODO: verify if IR node reusing works for Dynamic shape ops
   NodePtr MakeSizeNode(const Value& input, size_t dim) const override {
diff --git a/torch/csrc/lazy/ts_backend/tensor_aten_ops.cpp b/torch/csrc/lazy/ts_backend/tensor_aten_ops.cpp
index 534a9bca130d..8970e5354a7f 100644
--- a/torch/csrc/lazy/ts_backend/tensor_aten_ops.cpp
+++ b/torch/csrc/lazy/ts_backend/tensor_aten_ops.cpp
@@ -36,84 +36,11 @@ torch::lazy::Value MaybeExpand(
       /*is_scalar_expand=*/false);
 }
 
-std::vector<int64_t> GetExpandDimensions(
-    const torch::lazy::Shape& shape,
-    std::vector<int64_t> dimensions) {
-  TORCH_CHECK_GE(dimensions.size(), shape.dim()) << shape;
-  int64_t base = dimensions.size() - shape.dim();
-  for (size_t i = 0; i < shape.dim(); ++i) {
-    if (dimensions[base + i] == -1) {
-      dimensions[base + i] = shape.size(i);
-    }
-  }
-  return dimensions;
-}
-
-torch::lazy::ViewInfo CreateAsStridedViewInfo(
-    const torch::lazy::Shape& input_shape,
-    std::vector<int64_t> size,
-    std::vector<int64_t> stride,
-    c10::optional<int64_t> storage_offset) {
-  torch::lazy::Shape result_shape =
-      torch::lazy::Shape(input_shape.scalar_type(), size);
-  torch::lazy::AsStridedInfo as_strided_info;
-  as_strided_info.stride = std::move(stride);
-  if (storage_offset) {
-    as_strided_info.offset = *storage_offset;
-  }
-  return torch::lazy::ViewInfo(
-      torch::lazy::ViewInfo::Type::kAsStrided,
-      std::move(result_shape),
-      input_shape,
-      std::move(as_strided_info));
-}
-
 } // namespace
 
 //////////////////////////////////////////////////////////////////////////////
 // ATEN operators follows here, listed in alphabetical order.
 //////////////////////////////////////////////////////////////////////////////
-torch::lazy::LazyTensorPtr as_strided(
-    const torch::lazy::LazyTensorPtr& input,
-    std::vector<int64_t> size,
-    std::vector<int64_t> stride,
-    c10::optional<int64_t> storage_offset) {
-  auto input_shape = input->shape();
-  return input->CreateViewTensor(CreateAsStridedViewInfo(
-      input_shape, std::move(size), std::move(stride), storage_offset));
-}
-
-void as_strided_(
-    torch::lazy::LazyTensorPtr& input,
-    std::vector<int64_t> size,
-    std::vector<int64_t> stride,
-    c10::optional<int64_t> storage_offset) {
-  if (input->data()->view == nullptr) {
-    input->SetIrValue(torch::lazy::MakeAsStrided(
-        input->GetIrValue(),
-        std::move(size),
-        std::move(stride),
-        storage_offset.value_or(0)));
-  } else {
-    auto input_shape = input->shape();
-    input->SetSubView(CreateAsStridedViewInfo(
-        input_shape, std::move(size), std::move(stride), storage_offset));
-  }
-}
-
-torch::lazy::LazyTensorPtr expand(
-    const torch::lazy::LazyTensorPtr& input,
-    std::vector<int64_t> size) {
-  auto input_shape = input->shape();
-  auto output = torch::lazy::LazyTensor::Create(
-      torch::lazy::MakeExpand(
-          input->GetIrValue(),
-          GetExpandDimensions(input_shape.Get(), std::move(size)),
-          /*is_scalar_expand=*/false),
-      input->GetDevice());
-  output->SetStorage(input->Storage());
-  return output;
-}
 
 void fill_(torch::lazy::LazyTensorPtr& input, const at::Scalar& value) {
   torch::lazy::Value constant =
@@ -122,38 +49,6 @@ void fill_(torch::lazy::LazyTensorPtr& input, const at::Scalar& value) {
   input->SetInPlaceIrValue(std::move(constant));
 }
 
-torch::lazy::LazyTensorPtr narrow(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim,
-    int64_t start,
-    int64_t length) {
-  auto input_shape = input->shape();
-  dim = torch::lazy::GetCanonicalDimensionIndex(dim, input_shape.Get().dim());
-  torch::lazy::Shape narrow_shape = input_shape;
-  narrow_shape.set_size(dim, length);
-
-  torch::lazy::ViewInfo::Type view_type =
-      (input_shape.Get().numel() == narrow_shape.numel())
-      ? torch::lazy::ViewInfo::Type::kReshape
-      : torch::lazy::ViewInfo::Type::kNarrow;
-  torch::lazy::ViewInfo view_info(
-      view_type, std::move(narrow_shape), input_shape);
-  view_info.indices[dim] =
-      torch::lazy::GetCanonicalPosition(input_shape.Get().sizes(), dim, start);
-  return input->CreateViewTensor(std::move(view_info));
-}
-
-torch::lazy::LazyTensorPtr permute(
-    const torch::lazy::LazyTensorPtr& input,
-    c10::ArrayRef<int64_t> dims) {
-  auto input_shape = input->shape();
-  torch::lazy::ViewInfo view_info(
-      torch::lazy::ViewInfo::Type::kPermute,
-      input_shape,
-      torch::lazy::GetCanonicalDimensionIndices(dims, input_shape.Get().dim()));
-  return input->CreateViewTensor(std::move(view_info));
-}
-
 void copy_(torch::lazy::LazyTensorPtr& input, torch::lazy::LazyTensorPtr& src) {
   if (input->GetDevice() == src->GetDevice()) {
     torch::lazy::Value copy_value;
@@ -174,119 +69,5 @@ void copy_(torch::lazy::LazyTensorPtr& input, torch::lazy::LazyTensorPtr& src) {
   }
 }
 
-torch::lazy::LazyTensorPtr select(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim,
-    int64_t index) {
-  auto shape = input->shape();
-  dim = torch::lazy::GetCanonicalDimensionIndex(dim, shape.Get().dim());
-  torch::lazy::LazyTensorPtr result = narrow(input, dim, index, 1);
-  auto new_dims = torch::lazy::DropDimensions(shape.Get().sizes(), {dim});
-  return view(result, new_dims);
-}
-
-torch::lazy::LazyTensorPtr slice(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim,
-    int64_t start,
-    int64_t end,
-    int64_t step) {
-  auto input_shape = input->shape();
-  dim = torch::lazy::GetCanonicalDimensionIndex(dim, input_shape.Get().dim());
-  start =
-      torch::lazy::GetCanonicalPosition(input_shape.Get().sizes(), dim, start);
-  end = torch::lazy::GetCanonicalPosition(input_shape.Get().sizes(), dim, end);
-  // PyTorch allows tensor[-1:0] to return a 0-dim tensor.
-  if (start > end) {
-    end = start;
-  }
-  step = std::min(step, end - start);
-
-  torch::lazy::SelectInfo select = {dim, start, end, step};
-  torch::lazy::ViewInfo view_info(
-      torch::lazy::ViewInfo::Type::kSelect, input_shape, select);
-  return input->CreateViewTensor(std::move(view_info));
-}
-
-torch::lazy::LazyTensorPtr squeeze(const torch::lazy::LazyTensorPtr& input) {
-  auto input_shape = input->shape();
-  auto output_dimensions =
-      BuildSqueezedDimensions(input_shape.Get().sizes(), /*squeeze_dim=*/-1);
-  return view(input, output_dimensions);
-}
-
-torch::lazy::LazyTensorPtr squeeze(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim) {
-  auto input_shape = input->shape();
-  int64_t squeeze_dim =
-      torch::lazy::GetCanonicalDimensionIndex(dim, input->shape().Get().dim());
-  auto output_dimensions =
-      BuildSqueezedDimensions(input_shape.Get().sizes(), squeeze_dim);
-  return view(input, output_dimensions);
-}
-
-void squeeze_(torch::lazy::LazyTensorPtr& input) {
-  input->SetIrValue(torch::lazy::MakeSqueeze(input->GetIrValue(), -1));
-}
-
-void squeeze_(torch::lazy::LazyTensorPtr& input, int64_t dim) {
-  input->SetIrValue(torch::lazy::MakeSqueeze(
-      input->GetIrValue(),
-      torch::lazy::GetCanonicalDimensionIndex(
-          dim, input->shape().Get().dim())));
-}
-
-torch::lazy::LazyTensorPtr transpose(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim0,
-    int64_t dim1) {
-  auto input_shape = input->shape();
-  auto permute_dims = torch::lazy::MakeTransposePermutation(
-      /*dim0=*/dim0, /*dim1=*/dim1, /*rank=*/input_shape.Get().dim());
-  torch::lazy::ViewInfo view_info(
-      torch::lazy::ViewInfo::Type::kPermute, input_shape, permute_dims);
-  return input->CreateViewTensor(std::move(view_info));
-}
-
-void transpose_(torch::lazy::LazyTensorPtr& input, int64_t dim0, int64_t dim1) {
-  auto input_shape = input->shape();
-  auto permute_dims = torch::lazy::MakeTransposePermutation(
-      /*dim0=*/dim0, /*dim1=*/dim1, /*rank=*/input_shape.Get().dim());
-  torch::lazy::ViewInfo view_info(
-      torch::lazy::ViewInfo::Type::kPermute, input_shape, permute_dims);
-  return input->ModifyCurrentView(std::move(view_info));
-}
-
-torch::lazy::LazyTensorPtr unsqueeze(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim) {
-  auto input_shape = input->shape();
-  int64_t squeeze_dim =
-      torch::lazy::GetCanonicalDimensionIndex(dim, input_shape.Get().dim() + 1);
-  auto dimensions =
-      BuildUnsqueezedDimensions(input_shape.Get().sizes(), squeeze_dim);
-  return view(input, dimensions);
-}
-
-void unsqueeze_(torch::lazy::LazyTensorPtr& input, int64_t dim) {
-  int squeeze_dim = torch::lazy::GetCanonicalDimensionIndex(
-      dim, input->shape().Get().dim() + 1);
-  input->SetIrValue(
-      torch::lazy::MakeUnsqueeze(input->GetIrValue(), squeeze_dim));
-}
-
-torch::lazy::LazyTensorPtr view(
-    const torch::lazy::LazyTensorPtr& input,
-    c10::ArrayRef<int64_t> output_size) {
-  auto input_shape = input->shape().Get();
-  torch::lazy::Shape shape = torch::lazy::Shape(
-      input_shape.scalar_type(),
-      at::infer_size(output_size, input_shape.numel()));
-  torch::lazy::ViewInfo view_info(
-      torch::lazy::ViewInfo::Type::kReshape, std::move(shape), input_shape);
-  return input->CreateViewTensor(std::move(view_info));
-}
-
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/tensor_aten_ops.h b/torch/csrc/lazy/ts_backend/tensor_aten_ops.h
index 0d5a49bdfbd6..bf663f4ca6b1 100644
--- a/torch/csrc/lazy/ts_backend/tensor_aten_ops.h
+++ b/torch/csrc/lazy/ts_backend/tensor_aten_ops.h
@@ -8,100 +8,10 @@ namespace lazy {
 //////////////////////////////////////////////////////////////////////////////
 // ATEN operators follows here, listed in alphabetical order.
 //////////////////////////////////////////////////////////////////////////////
-// Takes a slice from the input as R1 at the specified offset and reshapes it
-// into the provided size.
-torch::lazy::LazyTensorPtr as_strided(
-    const torch::lazy::LazyTensorPtr& input,
-    std::vector<int64_t> size,
-    std::vector<int64_t> stride,
-    c10::optional<int64_t> storage_offset);
-
-// In-place version of the method above.
-void as_strided_(
-    torch::lazy::LazyTensorPtr& input,
-    std::vector<int64_t> size,
-    std::vector<int64_t> stride,
-    c10::optional<int64_t> storage_offset);
-
-torch::lazy::LazyTensorPtr expand(
-    const torch::lazy::LazyTensorPtr& input,
-    std::vector<int64_t> size);
 
+void copy_(torch::lazy::LazyTensorPtr& input, torch::lazy::LazyTensorPtr& src);
 // Fills the input with the given value.
 void fill_(torch::lazy::LazyTensorPtr& input, const at::Scalar& value);
 
-// Returns a new tensor that is a narrowed view of the input in the given
-// dimension.
-torch::lazy::LazyTensorPtr narrow(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim,
-    int64_t start,
-    int64_t length);
-
-// Permute the dimensions of this tensor according to the given permutation.
-torch::lazy::LazyTensorPtr permute(
-    const torch::lazy::LazyTensorPtr& input,
-    c10::ArrayRef<int64_t> dims);
-
-// Repeats the input tensor along each dimension by the given number of
-// repeats.
-torch::lazy::LazyTensorPtr repeat(
-    const torch::lazy::LazyTensorPtr& input,
-    std::vector<int64_t> repeats);
-
-void copy_(torch::lazy::LazyTensorPtr& input, torch::lazy::LazyTensorPtr& src);
-
-torch::lazy::LazyTensorPtr select(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim,
-    int64_t index);
-
-torch::lazy::LazyTensorPtr slice(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim,
-    int64_t start,
-    int64_t end,
-    int64_t step);
-
-// Squeeze out all trivial (size 1) dimensions.
-torch::lazy::LazyTensorPtr squeeze(const torch::lazy::LazyTensorPtr& input);
-
-// Squeeze out the specified dimension index, if trivial (size 1). Returns
-// unchanged input otherwise.
-torch::lazy::LazyTensorPtr squeeze(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim);
-
-// In-place versions of the methods above.
-void squeeze_(torch::lazy::LazyTensorPtr& input);
-void squeeze_(torch::lazy::LazyTensorPtr& input, int64_t dim);
-
-std::tuple<
-    torch::lazy::LazyTensorPtr,
-    torch::lazy::LazyTensorPtr,
-    torch::lazy::LazyTensorPtr>
-svd(const torch::lazy::LazyTensorPtr& input, bool some, bool compute_uv);
-
-// Swap given dimensions of the input.
-torch::lazy::LazyTensorPtr transpose(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim0,
-    int64_t dim1);
-
-// In-place version of the method above.
-void transpose_(torch::lazy::LazyTensorPtr& input, int64_t dim0, int64_t dim1);
-
-// Insert a dimension of size one at the specified position.
-torch::lazy::LazyTensorPtr unsqueeze(
-    const torch::lazy::LazyTensorPtr& input,
-    int64_t dim);
-
-// In-place version of the method above.
-void unsqueeze_(torch::lazy::LazyTensorPtr& input, int64_t dim);
-
-// Like reshape, but it returns a view into the original tensor.
-torch::lazy::LazyTensorPtr view(
-    const torch::lazy::LazyTensorPtr& input,
-    c10::ArrayRef<int64_t> output_size);
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/ts_backend/ts_backend_impl.cpp b/torch/csrc/lazy/ts_backend/ts_backend_impl.cpp
index a390ac76c126..488dd9f24d9d 100644
--- a/torch/csrc/lazy/ts_backend/ts_backend_impl.cpp
+++ b/torch/csrc/lazy/ts_backend/ts_backend_impl.cpp
@@ -2,6 +2,7 @@
 
 #include <ATen/Functions.h>
 #include <torch/csrc/lazy/backend/backend_device.h>
+#include <torch/csrc/lazy/core/lazy_graph_executor.h>
 #include <torch/csrc/lazy/generated/LazyNativeFunctions.h>
 #include <torch/csrc/lazy/ts_backend/config.h>
 #include <torch/csrc/lazy/ts_backend/ir_builder.h>
@@ -60,7 +61,7 @@ class TSBackendImpl : public torch::lazy::BackendImplInterface {
   std::unique_ptr<torch::lazy::LoweringContext> CreateLoweringContext(
       const std::string& name,
       torch::lazy::BackendDevice device,
-      c10::ArrayRef<torch::lazy::Node*> post_order,
+      c10::ArrayRef<const torch::lazy::Node*> post_order,
       torch::lazy::Util::EmissionMap emit_status) const override {
     return std::make_unique<torch::lazy::TSLoweringContext>(
         name, device, post_order, emit_status);
@@ -112,8 +113,9 @@ class TSBackendImpl : public torch::lazy::BackendImplInterface {
     return std::make_shared<TSData>(scalar, device);
   }
 
-  torch::lazy::BackendDataPtr GetComputationDataFromNode(Node* node) const {
-    auto* device_data_node = dynamic_cast<DeviceData*>(node);
+  torch::lazy::BackendDataPtr GetComputationDataFromNode(
+      const Node* node) const {
+    auto* device_data_node = DeviceData::Cast(node);
     if (!device_data_node) {
       return nullptr;
     }
@@ -273,6 +275,9 @@ void InitTorchScriptBackend() {
   register_ts_ltc_eager_fallback();
   static std::unique_ptr<BackendRegistrar> s_registrar;
   s_registrar = std::make_unique<BackendRegistrar>(GetTSBackendImpl());
+
+  static LazyGraphExecutor* executor = new LazyGraphExecutor();
+  LazyGraphExecutor::Register(executor);
 }
 
 } // namespace lazy
diff --git a/torch/csrc/lazy/ts_backend/ts_lowering_context.cpp b/torch/csrc/lazy/ts_backend/ts_lowering_context.cpp
index ff3d1aa07b78..ad1cac4870f5 100644
--- a/torch/csrc/lazy/ts_backend/ts_lowering_context.cpp
+++ b/torch/csrc/lazy/ts_backend/ts_lowering_context.cpp
@@ -17,7 +17,7 @@ TSLoweringContext::TSLoweringContext(
 TSLoweringContext::TSLoweringContext(
     const std::string& name,
     BackendDevice device,
-    c10::ArrayRef<Node*> post_order,
+    c10::ArrayRef<const Node*> post_order,
     Util::EmissionMap emit_status)
     : torch::lazy::LoweringContext(name, device, post_order, emit_status),
       graph_(std::make_shared<torch::jit::Graph>()),
diff --git a/torch/csrc/lazy/ts_backend/ts_lowering_context.h b/torch/csrc/lazy/ts_backend/ts_lowering_context.h
index 700f27d505fd..0ad2b669c0e6 100644
--- a/torch/csrc/lazy/ts_backend/ts_lowering_context.h
+++ b/torch/csrc/lazy/ts_backend/ts_lowering_context.h
@@ -71,7 +71,7 @@ class TORCH_API TSLoweringContext : public LoweringContext {
   TSLoweringContext(
       const std::string& name,
       BackendDevice device,
-      c10::ArrayRef<Node*> post_order,
+      c10::ArrayRef<const Node*> post_order,
       Util::EmissionMap emit_status);
 
   size_t AddResult(const Output& output) override {
diff --git a/torch/csrc/lazy/ts_backend/ts_native_functions.cpp b/torch/csrc/lazy/ts_backend/ts_native_functions.cpp
index 9e837249f52d..95eb0c9be455 100644
--- a/torch/csrc/lazy/ts_backend/ts_native_functions.cpp
+++ b/torch/csrc/lazy/ts_backend/ts_native_functions.cpp
@@ -270,29 +270,14 @@ at::Tensor LazyNativeFunctions::_to_copy(
 };
 
 at::Tensor LazyNativeFunctions::empty_symint(
-    c10::SymIntArrayRef size,
-    c10::optional<at::ScalarType> dtype,
-    c10::optional<at::Layout> layout,
-    c10::optional<at::Device> device,
-    c10::optional<bool> pin_memory,
-    c10::optional<at::MemoryFormat> memory_format) {
-  // TODO: support SymIntNodes as well
-  return empty(
-      c10::asIntArrayRefSlow(size),
-      dtype,
-      layout,
-      device,
-      pin_memory,
-      memory_format);
-}
-
-at::Tensor LazyNativeFunctions::empty(
-    at::IntArrayRef size,
+    at::SymIntArrayRef sym_size,
     c10::optional<at::ScalarType> dtype,
     c10::optional<at::Layout> layout,
     c10::optional<at::Device> device,
     c10::optional<bool> pin_memory,
     c10::optional<at::MemoryFormat> memory_format) {
+  // TODO: support this directly
+  auto size = c10::asIntArrayRefSlow(sym_size);
   const auto device_type = torch::lazy::getBackend()->EagerFallbackDeviceType();
   at::TensorOptions options = at::TensorOptions()
                                   .device(c10::Device(device_type))
@@ -314,15 +299,18 @@ at::Tensor LazyNativeFunctions::empty(
   }
 }
 
-at::Tensor LazyNativeFunctions::empty_strided(
-    at::IntArrayRef size,
-    at::IntArrayRef stride,
+at::Tensor LazyNativeFunctions::empty_strided_symint(
+    at::SymIntArrayRef sym_size,
+    at::SymIntArrayRef sym_stride,
     c10::optional<at::ScalarType> dtype,
     c10::optional<at::Layout> layout,
     c10::optional<at::Device> device,
     c10::optional<bool> pin_memory) {
   TORCH_LAZY_FN_COUNTER("lazy::");
-  at::Tensor t = empty(size, dtype, layout, device, pin_memory, c10::nullopt);
+  at::Tensor t =
+      empty_symint(sym_size, dtype, layout, device, pin_memory, c10::nullopt);
+  auto size = c10::asIntArrayRefSlow(sym_size);
+  auto stride = c10::asIntArrayRefSlow(sym_stride);
   return t.as_strided(size, stride, /*storage_offset=*/0);
 }
 
@@ -418,7 +406,8 @@ at::Tensor LazyNativeFunctions::_unsafe_view(
     const at::Tensor& self,
     at::IntArrayRef size) {
   TORCH_LAZY_FN_COUNTER("lazy::");
-  return LazyNativeFunctions::view_copy(self, size);
+  return LazyNativeFunctions::view_copy_symint(
+      self, c10::fromIntArrayRefSlow(size));
 }
 
 // This is needed by the torch.tensor constructor.
@@ -444,25 +433,25 @@ at::Tensor LazyNativeFunctions::block_diag(at::TensorList tensors) {
   return at::functionalization::functionalize_aten_op<ATEN_OP(
       block_diag)>::call(tensors);
 }
-at::Tensor LazyNativeFunctions::new_empty_strided(
+at::Tensor LazyNativeFunctions::new_empty_strided_symint(
     const at::Tensor& self,
-    at::IntArrayRef size,
-    at::IntArrayRef stride,
+    c10::SymIntArrayRef size,
+    c10::SymIntArrayRef stride,
     c10::optional<at::ScalarType> dtype,
     c10::optional<at::Layout> layout,
     c10::optional<at::Device> device,
     c10::optional<bool> pin_memory) {
   return at::functionalization::
-      functionalize_aten_op<ATEN_OP(new_empty_strided)>::call(
+      functionalize_aten_op_symint<ATEN_OP(new_empty_strided)>::call(
           self, size, stride, dtype, layout, device, pin_memory);
 }
 
-at::Tensor LazyNativeFunctions::narrow_copy(
+at::Tensor LazyNativeFunctions::narrow_copy_symint(
     const at::Tensor& self,
     int64_t dim,
-    int64_t start,
-    int64_t length) {
-  return at::functionalization::functionalize_aten_op<ATEN_OP(
+    c10::SymInt start,
+    c10::SymInt length) {
+  return at::functionalization::functionalize_aten_op_symint<ATEN_OP(
       narrow_copy)>::call(self, dim, start, length);
 }
 at::Tensor LazyNativeFunctions::pixel_shuffle(
@@ -477,12 +466,12 @@ at::Tensor LazyNativeFunctions::pixel_unshuffle(
   return at::functionalization::functionalize_aten_op<ATEN_OP(
       pixel_unshuffle)>::call(self, downscale_factor);
 }
-at::Tensor LazyNativeFunctions::select_backward(
+at::Tensor LazyNativeFunctions::select_backward_symint(
     const at::Tensor& grad_output,
-    at::IntArrayRef input_sizes,
+    c10::SymIntArrayRef input_sizes,
     int64_t dim,
-    int64_t index) {
-  return at::functionalization::functionalize_aten_op<ATEN_OP(
+    c10::SymInt index) {
+  return at::functionalization::functionalize_aten_op_symint<ATEN_OP(
       select_backward)>::call(grad_output, input_sizes, dim, index);
 }
 at::Tensor LazyNativeFunctions::_trilinear(
@@ -497,12 +486,6 @@ at::Tensor LazyNativeFunctions::_trilinear(
   return at::functionalization::functionalize_aten_op<ATEN_OP(_trilinear)>::
       call(i1, i2, i3, expand1, expand2, expand3, sumdim, unroll_dim);
 }
-::std::tuple<at::Tensor, at::Tensor> LazyNativeFunctions::linalg_inv_ex(
-    const at::Tensor& self,
-    bool check_errors) {
-  return at::functionalization::functionalize_aten_op<ATEN_OP(
-      linalg_inv_ex)>::call(self, check_errors);
-}
 at::Tensor LazyNativeFunctions::linalg_pinv(
     const at::Tensor& self,
     const c10::optional<at::Tensor>& atol,
@@ -539,24 +522,33 @@ at::Tensor& LazyNativeFunctions::logsumexp_out(
   return out;
 }
 
-at::Tensor LazyNativeFunctions::diagonal_backward(
-    const at::Tensor& grad_output,
-    at::IntArrayRef input_sizes,
+at::Tensor LazyNativeFunctions::diag_embed(
+    const at::Tensor& self,
     int64_t offset,
     int64_t dim1,
     int64_t dim2) {
   return at::functionalization::functionalize_aten_op<ATEN_OP(
+      diag_embed)>::call(self, offset, dim1, dim2);
+}
+
+at::Tensor LazyNativeFunctions::diagonal_backward_symint(
+    const at::Tensor& grad_output,
+    at::SymIntArrayRef input_sizes,
+    int64_t offset,
+    int64_t dim1,
+    int64_t dim2) {
+  return at::functionalization::functionalize_aten_op_symint<ATEN_OP(
       diagonal_backward)>::call(grad_output, input_sizes, offset, dim1, dim2);
 }
 
-at::Tensor LazyNativeFunctions::slice_backward(
+at::Tensor LazyNativeFunctions::slice_backward_symint(
     const at::Tensor& grad_output,
-    at::IntArrayRef input_sizes,
+    at::SymIntArrayRef input_sizes,
     int64_t dim,
-    int64_t start,
-    int64_t end,
-    int64_t step) {
-  return at::functionalization::functionalize_aten_op<ATEN_OP(
+    c10::SymInt start,
+    c10::SymInt end,
+    c10::SymInt step) {
+  return at::functionalization::functionalize_aten_op_symint<ATEN_OP(
       slice_backward)>::call(grad_output, input_sizes, dim, start, end, step);
 }
 
diff --git a/torch/csrc/lazy/ts_backend/ts_node_lowering.cpp b/torch/csrc/lazy/ts_backend/ts_node_lowering.cpp
index 2c0598ecfe1b..12341b69e654 100644
--- a/torch/csrc/lazy/ts_backend/ts_node_lowering.cpp
+++ b/torch/csrc/lazy/ts_backend/ts_node_lowering.cpp
@@ -157,201 +157,5 @@ torch::lazy::TSOpVector Scalar::Lower(
   return {loctx->graph()->insertConstant(at::scalar_tensor(value, options))};
 }
 
-// View Ops
-
-torch::lazy::TSOpVector AsStrided::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  std::vector<torch::jit::NamedValue> arguments;
-  arguments.emplace_back(loctx->GetOutputOp(operand(0)));
-  arguments.emplace_back(size);
-  arguments.emplace_back(stride);
-  arguments.emplace_back(storage_offset);
-  TSOpVector as_strided_out = LowerBuiltin(this, function, arguments);
-  TORCH_CHECK_EQ(as_strided_out.size(), 1);
-  return {GenerateClone(as_strided_out.front(), function)};
-}
-
-torch::lazy::TSOpVector AsStridedViewUpdate::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  torch::jit::Value* destination =
-      GenerateClone(loctx->GetOutputOp(operand(0)), function);
-  const torch::lazy::Output& input_op = operand(1);
-  const torch::lazy::Shape& input_shape = input_op.shape();
-  const auto input_dimensions = input_shape.sizes();
-  std::vector<torch::jit::NamedValue> dest_arguments;
-  dest_arguments.emplace_back(destination);
-  dest_arguments.emplace_back(
-      std::vector<int64_t>(input_dimensions.begin(), input_dimensions.end()));
-  dest_arguments.emplace_back(stride);
-  dest_arguments.emplace_back(storage_offset);
-  TSOpVector as_strided_out =
-      LowerBuiltin(at::aten::as_strided, function, dest_arguments);
-  TORCH_CHECK_EQ(as_strided_out.size(), 1);
-  torch::jit::Value* as_strided = as_strided_out.front();
-  GenerateCopy(as_strided, loctx->GetOutputOp(input_op), function);
-  return {destination};
-}
-
-torch::lazy::TSOpVector Diagonal::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  std::vector<torch::jit::NamedValue> arguments;
-  arguments.emplace_back(loctx->GetOutputOp(operand(0)));
-  arguments.emplace_back(offset);
-  arguments.emplace_back(dim1);
-  arguments.emplace_back(dim2);
-  return LowerBuiltin(this, function, arguments);
-}
-
-torch::lazy::TSOpVector DiagonalViewUpdate::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  // Since we promise the backends that we never generate any aliased
-  // inplace update IR, therefore we clone the target first and then
-  // update the clone inplace instead. Since the clone is transient,
-  // it will never be aliased, and therefore it's safe.
-  torch::jit::Value* destination =
-      GenerateClone(loctx->GetOutputOp(operand(0)), function);
-
-  // Replay the diagonal.
-  std::vector<torch::jit::NamedValue> arguments;
-  arguments.emplace_back(destination);
-  arguments.emplace_back(offset);
-  arguments.emplace_back(dim1);
-  arguments.emplace_back(dim2);
-  auto diag = LowerBuiltin(at::aten::diagonal, function, arguments);
-
-  // Update the replayed diagonal view with the input.
-  GenerateCopy(diag.front(), loctx->GetOutputOp(operand(1)), function);
-
-  // Destination's diag view should be updated.
-  return {destination};
-}
-
-torch::lazy::TSOpVector Narrow::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  const torch::lazy::Output& input = operand(0);
-  torch::jit::Value* base = loctx->GetOutputOp(input);
-  const torch::lazy::Shape& input_shape = input.shape();
-  TORCH_CHECK_EQ(sizes.size(), base_indices.size());
-  TORCH_CHECK_EQ(input_shape.dim(), base_indices.size());
-  for (size_t dim = 0; dim < base_indices.size(); ++dim) {
-    int64_t start = base_indices[dim];
-    base = GenerateSlice(
-        /*base=*/base,
-        /*dim=*/dim,
-        /*start=*/start,
-        /*end=*/start + sizes[dim],
-        /*step=*/1,
-        /*function=*/function);
-  }
-  return {base};
-}
-
-torch::lazy::TSOpVector NarrowViewUpdate::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  torch::jit::Value* dest =
-      GenerateClone(loctx->GetOutputOp(operand(0)), function);
-  const torch::lazy::Output& source_argument = operand(1);
-  const torch::lazy::Shape& source_shape = source_argument.shape();
-  TORCH_CHECK_EQ(source_shape.dim(), base_indices.size());
-  torch::jit::Value* base = dest;
-  for (size_t dim = 0; dim < base_indices.size(); ++dim) {
-    int64_t start = base_indices[dim];
-    base = GenerateSlice(
-        /*base=*/base,
-        /*dim=*/dim,
-        /*start=*/start,
-        /*end=*/start + source_shape.size(dim),
-        /*step=*/1,
-        /*function=*/function);
-  }
-  GenerateCopy(base, loctx->GetOutputOp(source_argument), function);
-  return {dest};
-}
-
-torch::lazy::TSOpVector Permute::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  std::vector<torch::jit::NamedValue> arguments;
-  arguments.emplace_back(loctx->GetOutputOp(operand(0)));
-  arguments.emplace_back(dims);
-  return LowerBuiltin(this, function, arguments);
-}
-
-torch::lazy::TSOpVector Resize::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  std::vector<torch::jit::NamedValue> arguments;
-  for (const torch::lazy::Output& output : operands()) {
-    arguments.emplace_back(loctx->GetOutputOp(output));
-  }
-  return LowerBuiltin(this, function, arguments);
-}
-
-torch::lazy::TSOpVector Select::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  int64_t step = torch::lazy::GetStride(start, end, stride);
-  torch::jit::Value* base = loctx->GetOutputOp(operand(0));
-  return {GenerateSlice(
-      /*base=*/base,
-      /*dim=*/dim,
-      /*start=*/start,
-      /*end=*/end,
-      /*step=*/step,
-      /*function=*/function)};
-}
-
-torch::lazy::TSOpVector SelectViewUpdate::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  torch::jit::Value* dest =
-      GenerateClone(loctx->GetOutputOp(operand(0)), function);
-  int64_t step = torch::lazy::GetStride(start, end, stride);
-  torch::jit::Value* selected = GenerateSlice(
-      /*base=*/dest,
-      /*dim=*/dim,
-      /*start=*/start,
-      /*end=*/end,
-      /*step=*/step,
-      /*function=*/function);
-  GenerateCopy(selected, loctx->GetOutputOp(operand(1)), function);
-  return {dest};
-}
-
-torch::lazy::TSOpVector Squeeze::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  std::vector<torch::jit::NamedValue> arguments;
-  arguments.emplace_back(loctx->GetOutputOp(operand(0)));
-  if (dim != -1) {
-    arguments.emplace_back(dim);
-  }
-  return LowerBuiltin(this, function, arguments);
-}
-
-torch::lazy::TSOpVector Unsqueeze::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  std::vector<torch::jit::NamedValue> arguments;
-  arguments.emplace_back(loctx->GetOutputOp(operand(0)));
-  arguments.emplace_back(dim);
-  return LowerBuiltin(this, function, arguments);
-}
-
-torch::lazy::TSOpVector View::Lower(
-    std::shared_ptr<torch::jit::GraphFunction> function,
-    torch::lazy::TSLoweringContext* loctx) const {
-  std::vector<torch::jit::NamedValue> arguments;
-  arguments.emplace_back(loctx->GetOutputOp(operand(0)));
-  arguments.emplace_back(output_size);
-  return LowerBuiltin(at::aten::reshape, function, arguments);
-}
-
 } // namespace lazy
 } // namespace torch
diff --git a/torch/csrc/lazy/tutorial.md b/torch/csrc/lazy/tutorial.md
index 6d4e75affc38..e26c55d2c520 100644
--- a/torch/csrc/lazy/tutorial.md
+++ b/torch/csrc/lazy/tutorial.md
@@ -283,4 +283,4 @@ This concludes our brief introduction to LT. Hopefully, you'll remember the main
 * It's really tricky to produce such graphs without overburdening a user too much. Think, torch.jit.script, torch.jit.trace! Also, think ifs, fors, "Lions, and Tigers, and Bears, Oh My" We digressed.
 
 
-Please give LT a try and tell us what you think on Github! We are **eager, not lazy** (haha!) to hear from you!
+Please give LT a try and tell us what you think on GitHub! We are **eager, not lazy** (haha!) to hear from you!
diff --git a/torch/csrc/onnx/diagnostics/diagnostics.h b/torch/csrc/onnx/diagnostics/diagnostics.h
new file mode 100644
index 000000000000..65ca626b843b
--- /dev/null
+++ b/torch/csrc/onnx/diagnostics/diagnostics.h
@@ -0,0 +1,63 @@
+#pragma once
+#include <torch/csrc/onnx/diagnostics/generated/rules.h>
+#include <torch/csrc/utils/pybind.h>
+#include <string>
+
+namespace torch {
+namespace onnx {
+namespace diagnostics {
+
+/**
+ * @brief Level of a diagnostic.
+ * @details The levels are defined by the SARIF specification, and are not
+ * modifiable. For alternative categories, please use Tag instead.
+ * @todo Introduce Tag to C++ api.
+ */
+enum class Level : uint8_t {
+  kNone,
+  kNote,
+  kWarning,
+  kError,
+};
+
+static constexpr const char* const kPyLevelNames[] = {
+    "NONE",
+    "NOTE",
+    "WARNING",
+    "ERROR",
+};
+
+// Wrappers around Python diagnostics.
+// TODO: Move to .cpp file in following PR.
+
+inline py::object _PyDiagnostics() {
+  return py::module::import("torch.onnx._internal.diagnostics");
+}
+
+inline py::object _PyRule(Rule rule) {
+  return _PyDiagnostics().attr("rules").attr(
+      kPyRuleNames[static_cast<uint32_t>(rule)]);
+}
+
+inline py::object _PyLevel(Level level) {
+  return _PyDiagnostics().attr("levels").attr(
+      kPyLevelNames[static_cast<uint32_t>(level)]);
+}
+
+inline void Diagnose(
+    Rule rule,
+    Level level,
+    std::unordered_map<std::string, std::string> messageArgs = {}) {
+  py::object py_rule = _PyRule(rule);
+  py::object py_level = _PyLevel(level);
+
+  // TODO: statically check that size of messageArgs matches with rule.
+  py::object py_message =
+      py_rule.attr("format_message")(**py::cast(messageArgs));
+
+  _PyDiagnostics().attr("diagnose")(py_rule, py_level, py_message);
+}
+
+} // namespace diagnostics
+} // namespace onnx
+} // namespace torch
diff --git a/torch/csrc/onnx/diagnostics/generated/rules.h b/torch/csrc/onnx/diagnostics/generated/rules.h
new file mode 100644
index 000000000000..405456336422
--- /dev/null
+++ b/torch/csrc/onnx/diagnostics/generated/rules.h
@@ -0,0 +1,48 @@
+#pragma once
+
+/**
+ * GENERATED CODE - DO NOT EDIT DIRECTLY
+ * This file is generated by gen_diagnostics.py.
+ * See tools/onnx/gen_diagnostics.py for more information.
+ *
+ * Diagnostic rules for PyTorch ONNX export.
+ */
+
+namespace torch {
+namespace onnx {
+namespace diagnostics {
+
+enum class Rule : uint32_t {
+  /**
+   * @brief Node is missing ONNX shape inference.
+   */
+  kNodeMissingOnnxShapeInference,
+
+  /**
+   * @brief Missing symbolic function for custom PyTorch operator, cannot
+   * translate node to ONNX.
+   */
+  kMissingCustomSymbolicFunction,
+
+  /**
+   * @brief Missing symbolic function for standard PyTorch operator, cannot
+   * translate node to ONNX.
+   */
+  kMissingStandardSymbolicFunction,
+
+  /**
+   * @brief Operator is supported in newer opset version.
+   */
+  kOperatorSupportedInNewerOpsetVersion,
+};
+
+static constexpr const char* const kPyRuleNames[] = {
+    "node_missing_onnx_shape_inference",
+    "missing_custom_symbolic_function",
+    "missing_standard_symbolic_function",
+    "operator_supported_in_newer_opset_version",
+};
+
+} // namespace diagnostics
+} // namespace onnx
+} // namespace torch
diff --git a/torch/csrc/onnx/init.cpp b/torch/csrc/onnx/init.cpp
index 8b93098978e8..9222273d45e2 100644
--- a/torch/csrc/onnx/init.cpp
+++ b/torch/csrc/onnx/init.cpp
@@ -14,6 +14,7 @@
 #include <torch/csrc/jit/passes/onnx/function_extraction.h>
 #include <torch/csrc/jit/passes/onnx/function_substitution.h>
 #include <torch/csrc/jit/passes/onnx/list_model_parameters.h>
+#include <torch/csrc/jit/passes/onnx/naming.h>
 #include <torch/csrc/jit/passes/onnx/onnx_log.h>
 #include <torch/csrc/jit/passes/onnx/pattern_conversion/autograd_function_process.h>
 #include <torch/csrc/jit/passes/onnx/pattern_conversion/pattern_conversion.h>
@@ -225,7 +226,17 @@ void initONNXBindings(PyObject* module) {
               out << std::endl;
             }
           },
-          "Write `args` to the previously specified ONNX log stream.");
+          "Write `args` to the previously specified ONNX log stream.")
+      .def(
+          "_jit_pass_onnx_assign_scoped_names_for_node_and_value",
+          ::torch::wrap_pybind_function(
+              ::torch::jit::onnx::AssignScopedNamesForNodeAndValue),
+          "Assign informative scoped names for nodes and values.")
+      .def(
+          "_jit_onnx_create_full_scope_name",
+          ::torch::wrap_pybind_function(
+              ::torch::jit::onnx::ONNXScopeName::createFullScopeName),
+          "Create a full scope name from class name and variable name.");
 
   m.def(
       "_check_onnx_proto",
diff --git a/torch/csrc/onnx/onnx.h b/torch/csrc/onnx/onnx.h
index 163bcdaf7007..3a314692f69b 100644
--- a/torch/csrc/onnx/onnx.h
+++ b/torch/csrc/onnx/onnx.h
@@ -16,5 +16,7 @@ enum class TrainingMode {
   TRAINING, // Training mode
 };
 
+constexpr char kOnnxNodeNameAttribute[] = "onnx_name";
+
 } // namespace onnx
 } // namespace torch
diff --git a/torch/csrc/profiler/api.cpp b/torch/csrc/profiler/api.cpp
deleted file mode 100644
index e1e45bbfd6d4..000000000000
--- a/torch/csrc/profiler/api.cpp
+++ /dev/null
@@ -1,184 +0,0 @@
-#include <torch/csrc/profiler/api.h>
-
-namespace torch {
-namespace profiler {
-namespace impl {
-
-namespace {
-enum ProfilerIValueIdx {
-  STATE = 0,
-  REPORT_INPUT_SHAPES,
-  PROFILE_MEMORY,
-  NUM_PROFILER_CFG_IVALUE_IDX // must be last in list
-};
-} // namespace
-
-at::IValue ProfilerConfig::toIValue() const {
-  c10::impl::GenericList eventIValueList(at::AnyType::get());
-  eventIValueList.reserve(NUM_PROFILER_CFG_IVALUE_IDX);
-  eventIValueList.emplace_back(static_cast<int64_t>(state));
-  eventIValueList.emplace_back(report_input_shapes);
-  eventIValueList.emplace_back(profile_memory);
-  return eventIValueList;
-}
-
-ProfilerConfig ProfilerConfig::fromIValue(
-    const at::IValue& profilerConfigIValue) {
-  TORCH_INTERNAL_ASSERT(
-      profilerConfigIValue.isList(),
-      "Expected IValue to contain type c10::impl::GenericList");
-  auto ivalues = profilerConfigIValue.toList();
-  TORCH_INTERNAL_ASSERT(
-      ivalues.size() == NUM_PROFILER_CFG_IVALUE_IDX,
-      c10::str(
-          "Expected exactly ",
-          NUM_PROFILER_CFG_IVALUE_IDX,
-          " ivalues to resconstruct ProfilerConfig."));
-  return ProfilerConfig(
-      static_cast<ProfilerState>(ivalues.get(ProfilerIValueIdx::STATE).toInt()),
-      ivalues.get(ProfilerIValueIdx::REPORT_INPUT_SHAPES).toBool(),
-      ivalues.get(ProfilerIValueIdx::PROFILE_MEMORY).toBool());
-}
-
-bool profilerEnabled() {
-  auto state_ptr = ProfilerThreadLocalStateBase::getTLS();
-  return state_ptr &&
-      state_ptr->config().state !=
-      torch::profiler::impl::ProfilerState::Disabled;
-}
-
-TORCH_API ActiveProfilerType profilerType() {
-  auto state_ptr = ProfilerThreadLocalStateBase::getTLS();
-  return state_ptr == nullptr ? ActiveProfilerType::NONE
-                              : state_ptr->profilerType();
-}
-
-torch::profiler::impl::ProfilerConfig getProfilerConfig() {
-  auto state_ptr = ProfilerThreadLocalStateBase::getTLS();
-  TORCH_CHECK(
-      state_ptr,
-      "Tried to access profiler config, but profiler is not enabled!");
-  return state_ptr->config();
-}
-
-ProfilerStubs::~ProfilerStubs() = default;
-
-namespace {
-struct DefaultCUDAStubs : public ProfilerStubs {
-  void record(
-      int* /*device*/,
-      ProfilerEventStub* /*event*/,
-      int64_t* /*cpu_ns*/) const override {
-    fail();
-  }
-  float elapsed(
-      const ProfilerEventStub* /*event*/,
-      const ProfilerEventStub* /*event2*/) const override {
-    fail();
-    return 0.f;
-  }
-  void mark(const char* /*name*/) const override {
-    fail();
-  }
-  void rangePush(const char* /*name*/) const override {
-    fail();
-  }
-  void rangePop() const override {
-    fail();
-  }
-  bool enabled() const override {
-    return false;
-  }
-  void onEachDevice(std::function<void(int)> /*op*/) const override {
-    fail();
-  }
-  void synchronize() const override {
-    fail();
-  }
-  ~DefaultCUDAStubs() override = default;
-
- private:
-  void fail() const {
-    AT_ERROR("CUDA used in profiler but not enabled.");
-  }
-};
-
-const DefaultCUDAStubs default_cuda_stubs;
-constexpr const DefaultCUDAStubs* default_cuda_stubs_addr = &default_cuda_stubs;
-// Constant initialization, so it is guaranteed to be initialized before
-// static initialization calls which may invoke registerCUDAMethods
-inline const ProfilerStubs*& cuda_stubs() {
-  static const ProfilerStubs* stubs_ =
-      static_cast<const ProfilerStubs*>(default_cuda_stubs_addr);
-  return stubs_;
-}
-
-struct DefaultITTStubs : public ProfilerStubs {
-  void record(
-      int* /*device*/,
-      ProfilerEventStub* /*event*/,
-      int64_t* /*cpu_ns*/) const override {
-    fail();
-  }
-  float elapsed(
-      const ProfilerEventStub* /*event*/,
-      const ProfilerEventStub* /*event2*/) const override {
-    fail();
-    return 0.f;
-  }
-  void mark(const char* /*name*/) const override {
-    fail();
-  }
-  void rangePush(const char* /*name*/) const override {
-    fail();
-  }
-  void rangePop() const override {
-    fail();
-  }
-  bool enabled() const override {
-    return false;
-  }
-  void onEachDevice(std::function<void(int)> /*op*/) const override {
-    fail();
-  }
-  void synchronize() const override {
-    fail();
-  }
-  ~DefaultITTStubs() override = default;
-
- private:
-  void fail() const {
-    AT_ERROR("ITT used in profiler but not enabled.");
-  }
-};
-
-const DefaultITTStubs default_itt_stubs;
-constexpr const DefaultITTStubs* default_itt_stubs_addr = &default_itt_stubs;
-// Constant initialization, so it is guaranteed to be initialized before
-// static initialization calls which may invoke registerITTMethods
-inline const ProfilerStubs*& itt_stubs() {
-  static const ProfilerStubs* stubs_ =
-      static_cast<const ProfilerStubs*>(default_itt_stubs_addr);
-  return stubs_;
-}
-} // namespace
-
-const ProfilerStubs* cudaStubs() {
-  return cuda_stubs();
-}
-
-void registerCUDAMethods(ProfilerStubs* stubs) {
-  cuda_stubs() = stubs;
-}
-
-const ProfilerStubs* ittStubs() {
-  return itt_stubs();
-}
-
-void registerITTMethods(ProfilerStubs* stubs) {
-  itt_stubs() = stubs;
-}
-
-} // namespace impl
-} // namespace profiler
-} // namespace torch
diff --git a/torch/csrc/profiler/api.h b/torch/csrc/profiler/api.h
index b75875d2f4ea..8349803af269 100644
--- a/torch/csrc/profiler/api.h
+++ b/torch/csrc/profiler/api.h
@@ -1,171 +1,6 @@
 #pragma once
 
-#include <ATen/record_function.h>
-#include <torch/csrc/Export.h>
-
-struct CUevent_st;
-
-namespace torch {
-namespace profiler {
-namespace impl {
-
-// ----------------------------------------------------------------------------
-// -- Profiler Config ---------------------------------------------------------
-// ----------------------------------------------------------------------------
-enum class C10_API_ENUM ActivityType {
-  CPU = 0,
-  CUDA, // CUDA kernels, runtime
-  NUM_KINETO_ACTIVITIES, // must be the last one
-};
-
-enum class C10_API_ENUM ProfilerState {
-  Disabled = 0,
-  CPU, // CPU-only profiling
-  CUDA, // CPU + CUDA events
-  NVTX, // only emit NVTX markers
-  ITT, // only emit ITT markers
-  KINETO, // use libkineto
-  KINETO_GPU_FALLBACK, // use CUDA events when CUPTI is not available
-  KINETO_ONDEMAND, // run the profiler in on-demand mode
-  NUM_PROFILER_STATES, // must be the last one
-};
-
-enum class C10_API_ENUM ActiveProfilerType {
-  NONE = 0,
-  LEGACY,
-  KINETO,
-  NVTX,
-  ITT
-};
-
-struct TORCH_API ExperimentalConfig {
-  explicit ExperimentalConfig(
-      std::vector<std::string> profiler_metrics = {},
-      bool profiler_measure_per_kernel = false)
-      : profiler_metrics(std::move(profiler_metrics)),
-        profiler_measure_per_kernel(profiler_measure_per_kernel) {}
-  ~ExperimentalConfig() = default;
-  std::vector<std::string> profiler_metrics;
-  bool profiler_measure_per_kernel = false;
-
-  bool hasOptions() const {
-    return profiler_metrics.size() > 0;
-  }
-};
-
-struct TORCH_API ProfilerConfig {
-  explicit ProfilerConfig(
-      ProfilerState state,
-      bool report_input_shapes = false,
-      bool profile_memory = false,
-      bool with_stack = false,
-      bool with_flops = false,
-      bool with_modules = false,
-      ExperimentalConfig experimental_config = ExperimentalConfig())
-      : state(state),
-        experimental_config(experimental_config),
-        report_input_shapes(report_input_shapes),
-        profile_memory(profile_memory),
-        with_stack(with_stack),
-        with_flops(with_flops),
-        with_modules(with_modules) {}
-  ~ProfilerConfig() = default;
-  ProfilerState state;
-  ExperimentalConfig experimental_config;
-  bool report_input_shapes;
-  bool profile_memory;
-  bool with_stack;
-  bool with_flops;
-  bool with_modules;
-
-  // Returns IValues corresponding to ProfilerConfig struct, to be used for
-  // serialization.
-  at::IValue toIValue() const;
-
-  // Reconstructs a ProfilerConfig from IValues given by toIValue.
-  static ProfilerConfig fromIValue(const at::IValue& profilerConfigIValue);
-};
-
-struct TORCH_API ProfilerThreadLocalStateBase
-    : public c10::MemoryReportingInfoBase {
-  explicit ProfilerThreadLocalStateBase(const ProfilerConfig& config)
-      : c10::MemoryReportingInfoBase(), config_(config) {}
-  ~ProfilerThreadLocalStateBase() override = default;
-
-  static ProfilerThreadLocalStateBase* getTLS() {
-    return static_cast<ProfilerThreadLocalStateBase*>(
-        c10::ThreadLocalDebugInfo::get(c10::DebugInfoKind::PROFILER_STATE));
-  }
-
-  const ProfilerConfig& config() const {
-    return config_;
-  }
-
-  void setCallbackHandle(at::CallbackHandle handle) {
-    handle_ = handle;
-  }
-
-  at::CallbackHandle callbackHandle() const {
-    return handle_;
-  }
-
-  bool hasCallbackHandle() {
-    return handle_ > 0;
-  }
-
-  bool memoryProfilingEnabled() const override {
-    return config_.profile_memory;
-  }
-
-  virtual ActiveProfilerType profilerType() = 0;
-
- protected:
-  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  std::mutex state_mutex_;
-  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  ProfilerConfig config_ = ProfilerConfig(ProfilerState::Disabled);
-  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  at::CallbackHandle handle_ = 0;
-};
-
-// Returns if the profiler is currently enabled in the current thread.
-TORCH_API bool profilerEnabled();
-
-TORCH_API ActiveProfilerType profilerType();
-
-// Retrieve the thread_local ProfilerConfig.
-TORCH_API ProfilerConfig getProfilerConfig();
-
-// ----------------------------------------------------------------------------
-// -- Annotation --------------------------------------------------------------
-// ----------------------------------------------------------------------------
-using ProfilerEventStub = std::shared_ptr<CUevent_st>;
-
-struct TORCH_API ProfilerStubs {
-  virtual void record(int* device, ProfilerEventStub* event, int64_t* cpu_ns)
-      const = 0;
-  virtual float elapsed(
-      const ProfilerEventStub* event,
-      const ProfilerEventStub* event2) const = 0;
-  virtual void mark(const char* name) const = 0;
-  virtual void rangePush(const char* name) const = 0;
-  virtual void rangePop() const = 0;
-  virtual bool enabled() const {
-    return false;
-  }
-  virtual void onEachDevice(std::function<void(int)> op) const = 0;
-  virtual void synchronize() const = 0;
-  virtual ~ProfilerStubs();
-};
-
-TORCH_API void registerCUDAMethods(ProfilerStubs* stubs);
-TORCH_API const ProfilerStubs* cudaStubs();
-TORCH_API void registerITTMethods(ProfilerStubs* stubs);
-TORCH_API const ProfilerStubs* ittStubs();
-
-} // namespace impl
-} // namespace profiler
-} // namespace torch
+#include <torch/csrc/profiler/orchestration/observer.h>
 
 // There are some components which use these symbols. Until we migrate them
 // we have to mirror them in the old autograd namespace.
diff --git a/torch/csrc/profiler/collection.cpp b/torch/csrc/profiler/collection.cpp
index eafa75d26776..932f7ad81f6c 100644
--- a/torch/csrc/profiler/collection.cpp
+++ b/torch/csrc/profiler/collection.cpp
@@ -21,14 +21,52 @@
 #include <c10/util/hash.h>
 #include <c10/util/overloaded.h>
 #include <torch/csrc/jit/runtime/interpreter.h>
+#include <torch/csrc/profiler/data_flow.h>
 #include <torch/csrc/profiler/kineto_shim.h>
 
 namespace torch {
 namespace profiler {
 namespace impl {
+using result_ptr_t = std::shared_ptr<Result>;
 using trace_ptr_t =
     std::unique_ptr<torch::profiler::impl::kineto::ActivityTraceWrapper>;
 
+RawTensorMetadataBase::RawTensorMetadataBase(const at::Tensor& t)
+    : data_{t.has_storage() ? t.storage().data() : nullptr},
+      dtype_{t.scalar_type()},
+      layout_{t.layout()},
+      dim_{static_cast<uint32_t>(t.sizes().size())} {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
+      t.sizes().size() <= std::numeric_limits<uint32_t>::max(),
+      "Cannot profile Tensors of size > uint32 max. Got dim: ",
+      t.sizes().size());
+}
+
+RawTensorMetadata::RawTensorMetadata(const at::Tensor& t)
+    : RawTensorMetadataBase(t),
+      weak_self_{WeakTensor(t)},
+      device_type_{t.device().type()},
+      device_index_{t.device().index()} {}
+
+TensorMetadata::TensorMetadata(
+    const RawTensorMetadata& r,
+    const std::vector<int64_t>& sizes,
+    const std::vector<int64_t>& strides)
+    : RawTensorMetadataBase(r),
+      weak_self_{r.weak_self_.value_or(WeakTensor(at::Tensor()))},
+      device_{r.device_type_, r.device_index_},
+      sizes_{sizes},
+      strides_{strides} {
+  SOFT_ASSERT(r.weak_self_.has_value());
+}
+
+// ============================================================================
+// == PyTorch Ops =============================================================
+// ============================================================================
+
+// ----------------------------
+// |  Input / Output encoder  |
+// ----------------------------
 void InputOutputEncoder::push(c10::ArrayRef<const c10::IValue> values) {
   for (const auto& value : values) {
     if (value.isTensor()) {
@@ -42,7 +80,9 @@ void InputOutputEncoder::push(c10::ArrayRef<const c10::IValue> values) {
       ivalues_.emplace_back(value);
     } else if (value.isTensorList()) {
       tags_.emplace_back(Tag::TensorListBegin);
-      // TODO: Skip TensorList for now.
+      for (const auto& t : value.toTensorList()) {
+        push(t);
+      }
       tags_.emplace_back(Tag::TERMINATOR);
     } else {
       tags_.emplace_back(Tag::Other);
@@ -54,23 +94,11 @@ void InputOutputEncoder::push(c10::ArrayRef<const c10::IValue> values) {
 void InputOutputEncoder::push(const at::Tensor& t) {
   if (t.defined() && !t.is_nested()) { // TODO fix nested sizes
     tags_.emplace_back(Tag::Tensor);
-    const auto& sizes = t.sizes();
-    const auto dim = sizes.size();
-    TORCH_CHECK(
-        dim <= std::numeric_limits<uint32_t>::max(),
-        "Cannot profile Tensors of size > uint32 max. Got dim: ",
-        dim);
-
-    tensor_metadata_.emplace_back(
-        /*ptr_=*/(void*)t.unsafeGetTensorImpl(),
-        /*device_type_*/ t.device().type(),
-        /*device_index_*/ t.device().index(),
-        /*dtype=*/t.scalar_type(),
-        /*dim_=*/(uint32_t)dim,
-        /*layout_=*/t.layout());
-
-    for (const auto i : sizes) {
-      tensor_sizes_.emplace_back(i);
+    tensor_metadata_.emplace_back(t);
+    tensor_sizes_strides_.copy(t.sizes());
+    if (t.layout() == at::kStrided) {
+      // Only Strided layout tensors have strides
+      tensor_sizes_strides_.copy(t.strides());
     }
   } else {
     tags_.emplace_back(Tag::UndefinedTensor);
@@ -82,49 +110,51 @@ auto InputOutputEncoder::getNextShapesAndDtypes() {
   return [this,
           tag_it = tags_.begin(),
           tensor_metadata_it = tensor_metadata_.begin(),
-          tensor_size_it = tensor_sizes_.begin(),
+          tensor_size_strides_it = tensor_sizes_strides_.begin(),
           ivals_it = ivalues_.begin()]() mutable {
-    struct Inputs out;
+    auto decode_tensor = [&]() -> TensorMetadata {
+      const auto& raw_metadata = *tensor_metadata_it++;
+      std::vector<int64_t> sizes;
+      std::vector<int64_t> strides;
+      for (C10_UNUSED const auto _ : c10::irange(raw_metadata.dim_)) {
+        sizes.push_back(*tensor_size_strides_it++);
+      }
+      if (raw_metadata.layout_ == at::kStrided) {
+        for (C10_UNUSED const auto _ : c10::irange(raw_metadata.dim_)) {
+          strides.push_back(*tensor_size_strides_it++);
+        }
+      }
+      return {raw_metadata, sizes, strides};
+    };
+
+    std::vector<op_input_t> out;
     bool terminate = false;
     while (!terminate && tag_it != tags_.end()) {
-      out.shapes_.emplace_back();
       switch (*tag_it) {
-        case Tag::Tensor: {
-          const auto& md = *tensor_metadata_it++;
-          for (const auto _ : c10::irange(md.dim_)) {
-            (void)_; // Suppress unused variable warning
-            out.shapes_.back().push_back(*tensor_size_it++);
-          }
-          out.tensor_metadata_.emplace_back(md);
-          out.ivalues_.emplace_back();
-          out.dtypes_.emplace_back(scalarTypeToTypeMeta(md.dtype_).name());
-        } break;
+        case Tag::Tensor:
+          out.emplace_back(decode_tensor());
+          break;
 
-        case Tag::TensorListBegin:
+        case Tag::TensorListBegin: {
+          std::vector<TensorMetadata> arg;
           while (*(++tag_it) != Tag::TERMINATOR) {
-            // TODO: Skip TensorLists for now.
+            TORCH_INTERNAL_ASSERT(*tag_it == Tag::Tensor, (int)(*tag_it));
+            arg.emplace_back(decode_tensor());
           }
-          out.dtypes_.emplace_back("TensorList");
-          out.ivalues_.emplace_back();
-          out.tensor_metadata_.emplace_back();
-          break;
+          out.emplace_back(std::move(arg));
+        } break;
 
         case Tag::Scalar:
-          out.dtypes_.emplace_back("Scalar");
-          out.ivalues_.emplace_back(*ivals_it++);
-          out.tensor_metadata_.emplace_back();
+          out.emplace_back(*ivals_it++);
           break;
 
         case Tag::UndefinedTensor:
         case Tag::Other:
-          out.dtypes_.emplace_back();
-          out.ivalues_.emplace_back();
-          out.tensor_metadata_.emplace_back();
+          out.emplace_back(c10::nullopt);
           break;
 
         case Tag::TERMINATOR:
           // This marks the end of this op.
-          out.shapes_.pop_back();
           terminate = true;
           break;
 
@@ -140,10 +170,192 @@ auto InputOutputEncoder::getNextShapesAndDtypes() {
 void InputOutputEncoder::clear() {
   tags_.clear();
   tensor_metadata_.clear();
-  tensor_sizes_.clear();
+  tensor_sizes_strides_.clear();
   ivalues_.clear();
 }
 
+// ---------------------------------------------------
+// |  Correlation ID tracking (OpList & EventBlock)  |
+// ---------------------------------------------------
+template <typename T, size_t ChunkSize>
+ThreadLocalSubqueue::TorchOpStorage::EventBlock<T, ChunkSize>::EventBlock() {
+  static std::atomic<uint64_t> counter_{0};
+  id_start_ = 1 + ChunkSize * counter_++;
+}
+
+template <class... Args>
+std::pair<KinetoObserverContext::Event*, uint64_t> ThreadLocalSubqueue::
+    TorchOpStorage::OpList::emplace_back(Args&&... args) {
+  maybe_grow();
+  *next_ = {std::forward<Args>(args)...};
+  auto corr_id = buffer_last_->correlation_id(next_);
+  return {next_++, corr_id};
+}
+
+uint64_t ThreadLocalSubqueue::TorchOpStorage::OpList::correlationID(
+    const OpList::Iterator& e) {
+  return e.address().first->correlation_id(&*e);
+}
+
+template <typename T, size_t ChunkSize>
+uint64_t ThreadLocalSubqueue::TorchOpStorage::EventBlock<T, ChunkSize>::
+    correlation_id(const T* ptr) const {
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
+      ptr >= this->data() && ptr < this->data() + ChunkSize);
+  return id_start_ + (ptr - this->data());
+}
+
+// ---------------------------------
+// |  Collection (Observer logic)  |
+// ---------------------------------
+std::unique_ptr<KinetoObserverContext> ThreadLocalSubqueue::begin_op(
+    const at::RecordFunction& fn) {
+  KinetoObserverContext::Event* event;
+  uint64_t corr_id;
+  std::tie(event, corr_id) = torch_ops_.op_events_.emplace_back(
+      fn.seqNr(),
+      fn.forwardThreadId(),
+      fn.scope(),
+      fn.isAsync(),
+      fn.debugHandle(),
+      fn.name());
+  if (config_.report_input_shapes) {
+    torch_ops_.inputs_outputs_.push(fn.inputs());
+  }
+  if (fn.scope() == at::RecordScope::USER_SCOPE) {
+    torch::profiler::impl::kineto::pushUserCorrelationId(corr_id);
+  } else {
+    torch::profiler::impl::kineto::pushCorrelationId(corr_id);
+  }
+
+#if !defined BUILD_LITE_INTERPRETER && !defined C10_MOBILE
+  // backward nodes source range corresponds to the forward node
+  // TODO: consider using C++ stack trace
+  if (config_.with_stack && fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
+    auto cs = torch::profiler::impl::prepareCallstack(jit::currentCallstack());
+    torch_ops_.jit_stack_.emplace_back(callstackStr(cs));
+  }
+  if (config_.with_modules &&
+      fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
+    torch_ops_.jit_modules_.emplace_back(jit::currentModuleHierarchy());
+  }
+#endif
+  if (config_.with_flops) {
+    torch_ops_.extra_args_.emplace_back(
+        torch::profiler::impl::saveExtraArgs(fn));
+  }
+
+  auto out = std::make_unique<KinetoObserverContext>(event);
+
+  if (config_.state == ProfilerState::KINETO_GPU_FALLBACK) {
+    try {
+      out->fallback_ = torch_ops_.gpu_fallback_.emplace_back();
+      torch::profiler::impl::cudaStubs()->record(
+          nullptr, &out->fallback_->cuda_event_start_, nullptr);
+    } catch (const std::exception& e) {
+      LOG(WARNING) << "Failed to record CUDA event. " << e.what();
+    }
+  }
+
+  event->start_time_ = torch::profiler::impl::getApproximateTime();
+  event->allow_tf32_cublas_ = at::globalContext().allowTF32CuBLAS();
+  if (!config_.experimental_config.performance_events.empty()) {
+    const size_t n = config_.experimental_config.performance_events.size();
+    event->counters_ = std::make_unique<perf_counters_t>(n, 0);
+    perf_profiler_->Enable();
+  }
+  return out;
+}
+
+// ---------------
+// |  Collation  |
+// ---------------
+namespace {
+template <typename T>
+struct StealOrDefault {
+  StealOrDefault(T& container)
+      : container_{container}, it_{container.begin()} {}
+
+  ~StealOrDefault() {
+    container_.get().clear();
+  }
+
+  typename T::Iterator::value_type operator()() {
+    if (it_.exhausted()) {
+      return typename T::Iterator::value_type();
+    } else {
+      auto result = std::move(*it_);
+      ++it_;
+      return result;
+    }
+  }
+
+  std::reference_wrapper<T> container_;
+  typename T::Iterator it_;
+};
+} // namespace
+
+void ThreadLocalSubqueue::TorchOpStorage::materialize(
+    std::vector<std::shared_ptr<Result>>& out,
+    const std::function<time_t(approx_time_t)> time_converter,
+    const uint64_t tid,
+    const kineto::DeviceAndResource& kineto_info) {
+  // Plumb Autograd info to the top level annotation.
+  auto it = op_events_.begin();
+  for (C10_UNUSED const auto _ :
+       c10::irange(static_cast<int64_t>(op_events_.size()) - 1)) {
+    auto& first = it->basic_fields_;
+    auto& second = (++it)->basic_fields_;
+    if (first.scope_ == at::RecordScope::FUNCTION &&
+        second.scope_ == at::RecordScope::BACKWARD_FUNCTION &&
+        first.name_.rfind("autograd::engine::evaluate_function: ", 0) == 0) {
+      first.sequence_number_ = second.sequence_number_;
+      first.forward_tid_ = second.forward_tid_;
+    }
+  }
+
+  // `AccumulateGrad` is an important marker for profile analysis; however the
+  // annotation relies on `c10::demangle` which is platform dependent. In
+  // particular, Windows will add a "struct " prefix.
+  const std::string accumulate_grad = "torch::autograd::AccumulateGrad";
+  const std::string windows_pattern = std::string("struct ") + accumulate_grad;
+  for (auto& event : op_events_) {
+    auto& name = event.basic_fields_.name_;
+    auto position = name.find(windows_pattern);
+    if (position != std::string::npos) {
+      name.replace(position, windows_pattern.size(), accumulate_grad);
+    }
+  }
+
+  auto input_getter = inputs_outputs_.getNextShapesAndDtypes();
+
+  // TODO: CTAD will take care of template args when we move to C++17
+  auto jit_stack = StealOrDefault<decltype(jit_stack_)>(jit_stack_);
+  auto jit_module = StealOrDefault<decltype(jit_modules_)>(jit_modules_);
+  auto extra_args = StealOrDefault<decltype(extra_args_)>(extra_args_);
+  auto gpu_fallback = StealOrDefault<decltype(gpu_fallback_)>(gpu_fallback_);
+
+  for (auto event = op_events_.begin(); event != op_events_.end(); ++event) {
+    ExtraFields<EventType::TorchOp> e{
+        std::move(event->basic_fields_),
+        ThreadLocalSubqueue::TorchOpStorage::OpList::correlationID(event),
+        time_converter(event->end_time_),
+        input_getter(),
+        jit_stack(),
+        jit_module(),
+        extra_args(),
+        gpu_fallback(),
+        event->allow_tf32_cublas_,
+        std::move(event->counters_)};
+
+    out.emplace_back(Result::create(
+        time_converter(event->start_time_), tid, kineto_info, std::move(e)));
+  }
+
+  op_events_.clear();
+  inputs_outputs_.clear();
+}
+
 namespace {
 // See `RecordQueue::getSubqueue()` for an overview of this cache.
 struct SubQueueThreadCache {
@@ -160,42 +372,7 @@ struct SubQueueThreadCache {
 // `sub_queue_cache_` and fall back to a different mechanism.
 std::atomic<uint32_t> queue_id_{0};
 thread_local SubQueueThreadCache sub_queue_cache_{0, nullptr};
-} // namespace
 
-namespace python_tracer {
-namespace {
-GetFn get_fn;
-
-struct NoOpPythonTracer : public PythonTracerBase {
-  static NoOpPythonTracer& singleton() {
-    static NoOpPythonTracer singleton_;
-    return singleton_;
-  }
-  void start(RecordQueue*) override {}
-  void stop() override {}
-  void clear() override {}
-  std::vector<std::shared_ptr<Result>> getEvents(
-      std::function<time_t(approx_time_t)>,
-      std::vector<CompressedEvent>&) override {
-    return {};
-  }
-  ~NoOpPythonTracer() = default;
-};
-} // namespace
-
-void registerTracer(GetFn get_tracer) {
-  get_fn = get_tracer;
-}
-
-PythonTracerBase& PythonTracerBase::get() {
-  if (get_fn == nullptr) {
-    return NoOpPythonTracer::singleton();
-  }
-  return get_fn();
-}
-} // namespace python_tracer
-
-namespace {
 std::string toString(const ExtraFields<EventType::PyCall>& e) {
   if (e.module_.has_value()) {
     return fmt::format(
@@ -272,13 +449,21 @@ uint64_t Result::correlationID() const {
 }
 
 int64_t Result::endTimeNS() const {
-  return visit(c10::overloaded(
+  auto end_time_ns = visit(c10::overloaded(
       ATTRIBUTE(TorchOp, torchOpEndNS(e, finished_, parent_)),
       ATTRIBUTE(Backend, e.end_time_us_ * 1000),
       ATTRIBUTE(Allocation, start_time_ns_),
       ATTRIBUTE(OutOfMemory, start_time_ns_),
       ATTRIBUTE(Kineto, start_time_ns_ + e.duration_us_ * 1000),
       [&](const auto& e) -> int64_t { return e.end_time_ns_; }));
+
+  // In rare cases we're willing to tolerate ops which are missing an end time
+  // so long as they can borrow their parent's end time. A consequence of this,
+  // however, is that `endTimeNS` may not make sense until tree construction is
+  // complete.
+  auto end_time_is_valid =
+      !finished_ || SOFT_ASSERT(end_time_ns >= start_time_ns_, name());
+  return end_time_is_valid ? end_time_ns : start_time_ns_;
 }
 
 uint64_t Result::endTID() const {
@@ -297,81 +482,16 @@ c10::DeviceType Result::deviceType() const {
 }
 #undef ATTRIBUTE
 
-template <typename T, size_t ChunkSize>
-ThreadLocalSubqueue::EventBlock<T, ChunkSize>::EventBlock() {
-  static std::atomic<uint64_t> counter_{0};
-  id_start_ = 1 + ChunkSize * counter_++;
-}
-template <class... Args>
-std::pair<KinetoObserverContext::Event*, uint64_t> ThreadLocalSubqueue::OpList::
-    emplace_back(Args&&... args) {
-  maybe_grow();
-  *next_ = {std::forward<Args>(args)...};
-  auto corr_id = buffer_last_->correlation_id(next_);
-  return {next_++, corr_id};
-}
-uint64_t ThreadLocalSubqueue::OpList::correlationID(const OpList::Iterator& e) {
-  return e.address().first->correlation_id(&*e);
-}
-
 ThreadLocalSubqueue::ThreadLocalSubqueue(
     const uint64_t tid,
     const ProfilerConfig& config)
     : tid_{tid}, config_{config}, kineto_info_{kineto::kineto_ids()} {
   torch::profiler::impl::kineto::recordThreadInfo();
-}
-
-std::unique_ptr<KinetoObserverContext> ThreadLocalSubqueue::begin_op(
-    const at::RecordFunction& fn) {
-  KinetoObserverContext::Event* event;
-  uint64_t corr_id;
-  std::tie(event, corr_id) = op_events_.emplace_back(
-      fn.seqNr(),
-      fn.forwardThreadId(),
-      fn.scope(),
-      fn.isAsync(),
-      fn.debugHandle(),
-      fn.name());
-  if (config_.report_input_shapes) {
-    inputs_outputs_.push(fn.inputs());
-  }
-  if (fn.scope() == at::RecordScope::USER_SCOPE) {
-    torch::profiler::impl::kineto::pushUserCorrelationId(corr_id);
-  } else {
-    torch::profiler::impl::kineto::pushCorrelationId(corr_id);
-  }
-
-#if !defined BUILD_LITE_INTERPRETER && !defined C10_MOBILE
-  // backward nodes source range corresponds to the forward node
-  // TODO: consider using C++ stack trace
-  if (config_.with_stack && fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
-    auto cs = torch::profiler::impl::prepareCallstack(jit::currentCallstack());
-    jit_stack_.emplace_back(callstackStr(cs));
-  }
-  if (config_.with_modules &&
-      fn.scope() != at::RecordScope::BACKWARD_FUNCTION) {
-    jit_modules_.emplace_back(jit::currentModuleHierarchy());
-  }
-#endif
-  if (config_.with_flops) {
-    extra_args_.emplace_back(torch::profiler::impl::saveExtraArgs(fn));
-  }
-
-  auto out = std::make_unique<KinetoObserverContext>(event);
-
-  if (config_.state == ProfilerState::KINETO_GPU_FALLBACK) {
-    try {
-      out->fallback_ = gpu_fallback_.emplace_back();
-      torch::profiler::impl::cudaStubs()->record(
-          nullptr, &out->fallback_->cuda_event_start_, nullptr);
-    } catch (const std::exception& e) {
-      LOG(WARNING) << "Failed to record CUDA event. " << e.what();
-    }
+  if (config_.experimental_config.performance_events.size()) {
+    perf_profiler_ =
+        std::make_unique<torch::profiler::impl::linux_perf::PerfProfiler>();
+    perf_profiler_->Configure(config_.experimental_config.performance_events);
   }
-
-  event->start_time_ = torch::profiler::impl::getApproximateTime();
-  event->allow_tf32_cublas_ = at::globalContext().allowTF32CuBLAS();
-  return out;
 }
 
 RecordQueue::RecordQueue(
@@ -379,7 +499,7 @@ RecordQueue::RecordQueue(
     std::set<ActivityType> activities)
     : id_(++queue_id_), config_{config}, activities_{activities} {
   if (tracePython()) {
-    python_tracer::PythonTracerBase::get().start(this);
+    python_tracer_ = python_tracer::PythonTracerBase::make(this);
   }
 }
 
@@ -414,30 +534,19 @@ ThreadLocalSubqueue* RecordQueue::getSubqueue() {
 }
 
 void RecordQueue::stop() {
-  if (tracePython()) {
-    python_tracer::PythonTracerBase::get().stop();
+  if (python_tracer_) {
+    python_tracer_->stop();
   }
 }
 
 namespace {
-template <typename T>
-auto steal_or_default(T& it) {
-  if (it.exhausted()) {
-    return typename T::value_type();
-  } else {
-    auto result = std::move(*it);
-    ++it;
-    return result;
-  }
-}
-
 void mark_finished(std::shared_ptr<Result>& r) {
   TORCH_INTERNAL_ASSERT(!r->finished_, r->name());
   r->finished_ = true;
   TORCH_INTERNAL_ASSERT(r->endTimeNS() >= r->start_time_ns_, r->name());
 }
 
-static constexpr const char* indexKey = "Profiler Event Index";
+static constexpr const char* indexKey = "Ev Idx";
 
 void passEventsToKineto(
     const std::vector<std::shared_ptr<Result>>& results,
@@ -727,7 +836,7 @@ trace_ptr_t addKinetoEvents(
   passEventsToKineto(results, start_time_us, end_time_us);
 
   // In on demand mode kineto is directly controlled by other machinery.
-  if (config.state == ProfilerState::KINETO_ONDEMAND) {
+  if (config.global()) {
     return nullptr;
   }
 
@@ -737,48 +846,13 @@ trace_ptr_t addKinetoEvents(
   return trace;
 }
 
-struct EvaluateFunctionVisitor {
-  void operator()(
-      ExtraFields<EventType::TorchOp>& first,
-      ExtraFields<EventType::TorchOp>& second) {
-    if (first.scope_ == at::RecordScope::FUNCTION &&
-        second.scope_ == at::RecordScope::BACKWARD_FUNCTION &&
-        first.name_.rfind("autograd::engine::evaluate_function: ", 0) == 0) {
-      first.sequence_number_ = second.sequence_number_;
-      first.forward_tid_ = second.forward_tid_;
-    }
-  }
-
-  template <typename T0, typename T1>
-  void operator()(T0&, T1&) {}
-};
-
-void set_autograd_evaluate(std::vector<std::shared_ptr<Result>>& results) {
-  auto end = results.size() > 2 ? results.end() - 1 : results.begin();
-  for (auto it = results.begin(); it < end; ++it) {
-    if ((*it)->start_tid_ == (*(it + 1))->start_tid_) {
-      c10::visit(
-          EvaluateFunctionVisitor(),
-          (*it)->extra_fields_,
-          (*(it + 1))->extra_fields_);
-    }
-  }
-}
-
-using result_ptr_t = std::shared_ptr<Result>;
 struct ResultGreater {
   bool operator()(const result_ptr_t& a, const result_ptr_t& b) const {
     return a->endTimeNS() > b->endTimeNS();
   }
 };
 
-void build_tree(std::vector<std::shared_ptr<Result>>& events) {
-  set_autograd_evaluate(events);
-  std::stable_sort(
-      events.begin(), events.end(), [](const auto& a, const auto& b) {
-        return a->start_time_ns_ < b->start_time_ns_;
-      });
-
+void build_tree(std::vector<std::shared_ptr<Result>>& sorted_events) {
   using op_fields = ExtraFields<EventType::TorchOp>;
   ska::flat_hash_map<uint64_t, std::shared_ptr<Result>> stacks;
   std::priority_queue<result_ptr_t, std::vector<result_ptr_t>, ResultGreater>
@@ -853,7 +927,7 @@ void build_tree(std::vector<std::shared_ptr<Result>>& events) {
   };
 
   // Stack replay loop.
-  for (auto& event : events) {
+  for (auto& event : sorted_events) {
     while (!end_events_.empty() &&
            end_events_.top()->endTimeNS() < event->start_time_ns_) {
       pop_event(end_events_.top());
@@ -886,66 +960,32 @@ RecordQueue::getRecords(
   std::vector<python_tracer::CompressedEvent> python_enters;
   for (auto& subqueue_it : sub_queues_) {
     auto& queue = *subqueue_it.second;
-    for (auto& i : queue.backend_events_) {
-      auto start_time = i.start_time_us_;
-      out.emplace_back(Result::create(
-          /*start_time_ns_=*/start_time * 1000,
-          /*start_tid_=*/queue.tid(),
-          /*kineto_info_=*/queue.kineto_info(),
-          /*extra_fields_=*/std::move(i)));
-    }
-    queue.backend_events_.clear();
-
-    auto input_getter = queue.inputs_outputs_.getNextShapesAndDtypes();
-    auto jit_stack_it = queue.jit_stack_.begin();
-    auto jit_module_it = queue.jit_modules_.begin();
-    auto extra_args_it = queue.extra_args_.begin();
-    auto gpu_fallback_it = queue.gpu_fallback_.begin();
-    for (auto event = queue.op_events_.begin(); event != queue.op_events_.end();
-         ++event) {
-      auto& i = *event;
-      auto start_time = converter(i.start_time_);
-      out.emplace_back(Result::create(
-          start_time,
-          /*start_tid_=*/queue.tid(),
-          /*kineto_info_=*/queue.kineto_info(),
-          /*extra_fields_=*/
-          ExtraFields<EventType::TorchOp>(
-              std::move(i.basic_fields_),
-              ThreadLocalSubqueue::OpList::correlationID(event),
-              converter(i.end_time_),
-              input_getter(),
-              steal_or_default(jit_stack_it),
-              steal_or_default(jit_module_it),
-              steal_or_default(extra_args_it),
-              steal_or_default(gpu_fallback_it),
-              i.allow_tf32_cublas_)));
-    }
-    queue.op_events_.clear();
-    queue.inputs_outputs_.clear();
-    queue.jit_stack_.clear();
-    queue.jit_modules_.clear();
-    queue.extra_args_.clear();
-    queue.gpu_fallback_.clear();
+    auto materialize = [&](auto& events) {
+      for (auto& i : events) {
+        out.emplace_back(Result::create(
+            /*start_time_ns_=*/c10::guts::if_constexpr<std::is_same<
+                typename std::remove_reference<decltype(i)>::type,
+                ExtraFields<EventType::Backend>>::value>(
+                [&](auto _) { return _(i).start_time_us_ * 1000; },
+                [&](auto _) { return converter(_(i).start_time_); }),
+            /*start_tid_=*/queue.tid(),
+            /*kineto_info_=*/queue.kineto_info(),
+            /*extra_fields_=*/std::move(i)));
+      }
+      events.clear();
+    };
 
+    queue.torch_ops_.materialize(
+        out, converter, queue.tid(), queue.kineto_info());
+    materialize(queue.backend_events_);
     for (auto& i : queue.allocations_) {
-      auto start_time = converter(i.start_time_);
       out.emplace_back(Result::create(
-          start_time,
+          /*start_time_ns_=*/converter(i.start_time_),
           /*start_tid_=*/queue.tid(),
           /*kineto_info_=*/queue.kineto_info(),
-          /*extra_fields_=*/std::move(i)));
+          /*extra_fields_=*/ExtraFields<EventType::Allocation>(i)));
     }
-    queue.allocations_.clear();
-    for (auto& i : queue.ooms_) {
-      auto start_time = converter(i.start_time_);
-      out.emplace_back(Result::create(
-          start_time,
-          /*start_tid_=*/queue.tid(),
-          /*kineto_info_=*/queue.kineto_info(),
-          /*extra_fields_=*/std::move(i)));
-    }
-    queue.ooms_.clear();
+    materialize(queue.ooms_);
 
     for (auto& i : queue.py_calls_) {
       python_enters.push_back(
@@ -953,15 +993,24 @@ RecordQueue::getRecords(
     }
   }
 
-  if (tracePython()) {
-    auto& tracer = python_tracer::PythonTracerBase::get();
-    for (auto i : tracer.getEvents(converter, python_enters)) {
+  if (python_tracer_) {
+    for (auto i : python_tracer_->getEvents(
+             converter, python_enters, end_time_us * 1000)) {
       out.push_back(i);
     }
-    tracer.clear();
+    python_tracer_.reset();
   }
 
   auto trace = addKinetoEvents(out, start_time_us, end_time_us, config_);
+
+  std::stable_sort(out.begin(), out.end(), [](const auto& a, const auto& b) {
+    return a->start_time_ns_ < b->start_time_ns_;
+  });
+
+  if (config_.report_input_shapes && config_.profile_memory) {
+    calculateUniqueTensorIDs(out);
+  }
+
   build_tree(out);
   return {out, std::move(trace)};
 }
diff --git a/torch/csrc/profiler/collection.h b/torch/csrc/profiler/collection.h
index 270adcd01022..cef614bd9861 100644
--- a/torch/csrc/profiler/collection.h
+++ b/torch/csrc/profiler/collection.h
@@ -7,13 +7,19 @@
 #include <utility>
 
 #include <ATen/Context.h>
-#include <c10/core/DeviceType.h>
+#include <c10/core/Device.h>
+#include <c10/core/TensorImpl.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/flat_hash_map.h>
 #include <c10/util/strong_type.h>
 #include <c10/util/variant.h>
 #include <torch/csrc/profiler/containers.h>
+#include <torch/csrc/profiler/data_flow.h>
+#include <torch/csrc/profiler/events.h>
 #include <torch/csrc/profiler/kineto_shim.h>
+#include <torch/csrc/profiler/orchestration/python_tracer.h>
+#include <torch/csrc/profiler/perf.h>
+#include <torch/csrc/profiler/stubs/base.h>
 #include <torch/csrc/profiler/util.h>
 #include <torch/csrc/utils/python_stub.h>
 
@@ -31,6 +37,62 @@ enum class EventType : uint8_t {
   Kineto
 };
 
+// ============================================================================
+// == Value (Tensor, Scalar) summary ==========================================
+// ============================================================================
+struct TORCH_API RawTensorMetadataBase {
+  RawTensorMetadataBase() = default;
+  explicit RawTensorMetadataBase(const at::Tensor& t);
+
+  StorageImplData data_;
+  c10::ScalarType dtype_;
+  c10::Layout layout_;
+  uint32_t dim_;
+};
+
+// Collected during profiling.
+struct TORCH_API RawTensorMetadata : RawTensorMetadataBase {
+  RawTensorMetadata() = default;
+  RawTensorMetadata(const RawTensorMetadata&) = default;
+  explicit RawTensorMetadata(const at::Tensor& t);
+
+  // Wrap `weak_self_` in `c10::optional` and split device into components to
+  // keep struct default constructable. (which the std::array initializer needs)
+  c10::optional<WeakTensor> weak_self_;
+  c10::DeviceType device_type_;
+  c10::DeviceIndex device_index_;
+};
+
+// Used during post processing.
+struct TORCH_API TensorMetadata : public RawTensorMetadataBase {
+  TensorMetadata(
+      const RawTensorMetadata& r,
+      const std::vector<int64_t>& sizes,
+      const std::vector<int64_t>& strides);
+
+  TensorImplAddress impl() const {
+    return weak_self_.get();
+  }
+
+  WeakTensor weak_self_;
+  c10::Device device_;
+  std::vector<int64_t> sizes_;
+  std::vector<int64_t> strides_;
+
+  // Set during `calculateUniqueTensorIDs`.
+  c10::optional<TensorID> id_;
+  c10::optional<AllocationID> allocation_id_;
+};
+
+using op_input_t = c10::variant<
+    TensorMetadata,
+    std::vector<TensorMetadata>,
+    c10::IValue,
+    c10::nullopt_t>;
+
+// ============================================================================
+// == ExtraFields =============================================================
+// ============================================================================
 template <EventType>
 struct ExtraFields;
 
@@ -48,24 +110,6 @@ struct TorchOpBasicFields {
   uint64_t end_tid_{0};
 };
 
-struct TensorMetadata {
-  void* ptr_;
-  // Device is separated into DeviceType and DeviceIndex as Device
-  // doesn't have a default initializer (which the std::array initializer needs)
-  c10::DeviceType device_type_;
-  c10::DeviceIndex device_index_;
-  c10::ScalarType dtype_;
-  uint32_t dim_;
-  c10::Layout layout_;
-};
-
-struct Inputs {
-  std::vector<std::vector<int64_t>> shapes_;
-  std::vector<std::string> dtypes_;
-  std::vector<c10::optional<TensorMetadata>> tensor_metadata_;
-  std::vector<c10::IValue> ivalues_;
-};
-
 using jit_stack_t = std::vector<std::string>;
 using jit_modules_t = std::vector<std::string>;
 using extra_args_t = std::unordered_map<std::string, c10::IValue>;
@@ -81,12 +125,13 @@ struct ExtraFields<EventType::TorchOp> : TorchOpBasicFields {
       TorchOpBasicFields&& f,
       uint64_t correlation_id,
       time_t end_time_ns,
-      Inputs&& inputs,
+      std::vector<op_input_t>&& inputs,
       jit_stack_t&& jit_stack,
       jit_modules_t&& jit_modules,
       extra_args_t&& extra_args,
       FallbackPair&& gpu_fallback,
-      bool allow_tf32_cublas)
+      bool allow_tf32_cublas,
+      std::unique_ptr<perf_counters_t>&& perf_event_counters)
       : TorchOpBasicFields(std::move(f)),
         correlation_id_{correlation_id},
         end_time_ns_{end_time_ns},
@@ -95,15 +140,17 @@ struct ExtraFields<EventType::TorchOp> : TorchOpBasicFields {
         jit_modules_{std::move(jit_modules)},
         extra_args_{std::move(extra_args)},
         gpu_fallback_{std::move(gpu_fallback)},
-        allow_tf32_cublas_{allow_tf32_cublas} {}
+        allow_tf32_cublas_{allow_tf32_cublas},
+        perf_event_counters_{std::move(perf_event_counters)} {}
   uint64_t correlation_id_;
   time_t end_time_ns_;
-  Inputs inputs_;
+  std::vector<op_input_t> inputs_;
   jit_stack_t jit_stack_;
   jit_modules_t jit_modules_;
   extra_args_t extra_args_;
   FallbackPair gpu_fallback_;
   bool allow_tf32_cublas_;
+  std::unique_ptr<perf_counters_t> perf_event_counters_;
 };
 
 template <>
@@ -118,8 +165,7 @@ struct ExtraFields<EventType::Backend> {
   jit_modules_t jit_modules_;
 };
 
-template <>
-struct ExtraFields<EventType::Allocation> {
+struct RawAllocation {
   torch::profiler::impl::approx_time_t start_time_;
   void* ptr_;
   int64_t alloc_size_;
@@ -131,8 +177,20 @@ struct ExtraFields<EventType::Allocation> {
 
 // For performance.
 static_assert(
-    std::is_pod<ExtraFields<EventType::Allocation>>::value,
-    "Non-POD member of ExtraFields<EventType::Allocation>.");
+    std::is_pod<RawAllocation>::value,
+    "Non-POD member of RawAllocation.");
+
+template <>
+struct ExtraFields<EventType::Allocation> : RawAllocation {
+  ExtraFields(const RawAllocation& allocation) : RawAllocation(allocation) {}
+
+  c10::Device device() const {
+    return {device_type_, device_index_};
+  }
+
+  c10::optional<TensorID> id_;
+  c10::optional<AllocationID> allocation_id_;
+};
 
 template <>
 struct ExtraFields<EventType::OutOfMemory> {
@@ -162,16 +220,39 @@ using strong_t = strong::
 using PyModuleSelf = strong_t<PyObject*, struct PyModuleSelf_>;
 using PyModuleCls = strong_t<PyObject*, struct PyModuleCls_>;
 using PyMethod = strong_t</*PyMethodDef*/ void*, struct PyMethod_>;
+using PyOptimizerSelf = strong_t<PyObject*, struct PyOptSelf_>;
+using PyOptimizerCls = strong_t<PyObject*, struct PyOptimizer_>;
 
 struct NNModuleInfo {
+  struct ParameterInfo {
+    std::string name_;
+    TensorMetadata metadata_;
+    c10::optional<TensorMetadata> grad_metadata_;
+  };
+
   PyModuleSelf self_;
   PyModuleCls cls_;
   at::StringView cls_name_;
 
+  std::vector<ParameterInfo> parameters_;
   // Indicates that `self_` is the kth instance of `cls_` observed.
   size_t id_{std::numeric_limits<size_t>::max()};
 };
 
+struct OptimizerInfo {
+  struct ParameterInfo {
+    TensorMetadata metadata_;
+    c10::optional<TensorMetadata> grad_metadata_;
+    std::vector<std::pair<std::string, TensorMetadata>> state_;
+  };
+
+  PyOptimizerSelf self_;
+  PyOptimizerCls cls_;
+  at::StringView cls_name_;
+
+  std::vector<ParameterInfo> parameters_;
+};
+
 struct PyExtraFieldsBase {
   PyExtraFieldsBase(time_t end_time_ns, size_t python_tid, PyFrameState caller)
       : end_time_ns_{end_time_ns}, python_tid_{python_tid}, caller_{caller} {}
@@ -186,7 +267,11 @@ struct PyExtraFieldsBase {
 
 template <>
 struct ExtraFields<EventType::PyCall> : public PyExtraFieldsBase {
-  using args_t = std::pair<PyFrameState, c10::optional<NNModuleInfo>>;
+  struct args_t {
+    PyFrameState frame_state_;
+    c10::optional<NNModuleInfo> module_info_;
+    c10::optional<OptimizerInfo> optimizer_info_;
+  };
 
   ExtraFields(
       time_t end_time_ns,
@@ -194,11 +279,13 @@ struct ExtraFields<EventType::PyCall> : public PyExtraFieldsBase {
       PyFrameState caller,
       args_t args)
       : PyExtraFieldsBase(end_time_ns, python_tid, caller),
-        callsite_{args.first},
-        module_{args.second} {}
+        callsite_{args.frame_state_},
+        module_{args.module_info_},
+        optimizer_{args.optimizer_info_} {}
 
   PyFrameState callsite_;
   c10::optional<NNModuleInfo> module_;
+  c10::optional<OptimizerInfo> optimizer_;
 };
 
 template <>
@@ -252,6 +339,17 @@ struct TORCH_API Result : public std::enable_shared_from_this<Result> {
     return c10::visit(std::forward<T>(visitor), extra_fields_);
   }
 
+  template <typename T, typename Fn>
+  void visit_if_base(Fn&& fn) const {
+    visit([&](const auto& extra_fields) {
+      using extra_fields_t = typename std::remove_cv<
+          typename std::remove_reference<decltype(extra_fields)>::type>::type;
+
+      c10::guts::if_constexpr<std::is_base_of<T, extra_fields_t>::value>(
+          [&](auto _) { fn(_(extra_fields)); });
+    });
+  }
+
   EventType tag() const {
     return visit([](const auto& i) { return deduceTag(i); });
   }
@@ -309,6 +407,7 @@ struct KinetoObserverContext : public at::ObserverContext {
     approx_time_t end_time_{std::numeric_limits<approx_time_t>::min()};
 
     bool allow_tf32_cublas_;
+    std::unique_ptr<perf_counters_t> counters_;
   };
 
   explicit KinetoObserverContext(Event* event) : event_{event} {}
@@ -345,55 +444,13 @@ class InputOutputEncoder final {
   void push(const at::Tensor& t);
 
   AppendOnlyList<Tag, IO_ENCODER_DEFAULT_BLOCK_SIZE> tags_;
-  AppendOnlyList<TensorMetadata, IO_ENCODER_DEFAULT_BLOCK_SIZE>
+  AppendOnlyList<RawTensorMetadata, IO_ENCODER_DEFAULT_BLOCK_SIZE>
       tensor_metadata_;
-  AppendOnlyList<int64_t, IO_ENCODER_DEFAULT_BLOCK_SIZE> tensor_sizes_;
+  AppendOnlyList<int64_t, IO_ENCODER_DEFAULT_BLOCK_SIZE> tensor_sizes_strides_;
   AppendOnlyList<c10::IValue, IO_ENCODER_DEFAULT_BLOCK_SIZE> ivalues_;
 };
 
-class RecordQueue;
-namespace python_tracer {
-/*
-Libtorch does not depend on Python (e.g. cannot #include <Python.h>); however
-when we call the profiler from libtorch_python we need the profiler to be able
-to ingest the data that we collect from the Python tracer. (`PyEval_SetProfile`)
-
-In order to solve this dependency issue we define a virtual base and a function
-to register a getter. The python tracer then implements these functions and
-exposes itself by calling `registerTracer` from `torch/csrc/autograd/init.cpp`.
-This pattern of registration for faux python dependencies in libtorch is common
-in the PyTorch codebase.
-*/
-
-using TraceKey = strong::type<
-    uint64_t,
-    struct TraceKey_,
-    strong::regular,
-    strong::hashable,
-    strong::ostreamable>;
-
-struct CompressedEvent {
-  TraceKey key_;
-  uint64_t system_tid_;
-  kineto::DeviceAndResource kineto_info_;
-  time_t enter_t_;
-};
-
-struct TORCH_API PythonTracerBase {
-  static PythonTracerBase& get();
-  virtual ~PythonTracerBase() = default;
-
-  virtual void start(RecordQueue* queue) = 0;
-  virtual void stop() = 0;
-  virtual std::vector<std::shared_ptr<Result>> getEvents(
-      std::function<time_t(approx_time_t)> time_converter,
-      std::vector<CompressedEvent>& enters) = 0;
-  virtual void clear() = 0;
-};
-
-using GetFn = PythonTracerBase& (*)();
-TORCH_API void registerTracer(GetFn get_tracer);
-} // namespace python_tracer
+using perf_profiler_t = torch::profiler::impl::linux_perf::PerfProfiler;
 
 class TORCH_API ThreadLocalSubqueue {
  public:
@@ -429,67 +486,74 @@ class TORCH_API ThreadLocalSubqueue {
     return kineto_info_;
   }
 
+  inline void disable_perf_profiler(perf_counters_t& counters) const {
+    perf_profiler_->Disable(counters);
+  }
+
  private:
   uint64_t tid_;
   ProfilerConfig config_;
   kineto::DeviceAndResource kineto_info_;
+  std::unique_ptr<perf_profiler_t> perf_profiler_;
 
   friend class RecordQueue;
   // See `containers.h` for block size benchmarks.
   static constexpr size_t BlockSize = 512;
 
-  template <typename T, size_t ChunkSize>
-  class EventBlock : public std::array<T, ChunkSize> {
-   public:
-    EventBlock();
-    uint64_t correlation_id(const T* ptr) const {
-      TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
-          ptr >= this->data() && ptr < this->data() + ChunkSize);
-      return id_start_ + (ptr - this->data());
-    }
-
-   private:
-    uint64_t id_start_;
-  };
+  struct TorchOpStorage {
+    // NB: This is a destructive operation.
+    void materialize(
+        std::vector<std::shared_ptr<Result>>& out,
+        const std::function<time_t(approx_time_t)> time_converter,
+        const uint64_t tid,
+        const kineto::DeviceAndResource& kineto_info);
 
-  class OpList : public AppendOnlyList<
-                     KinetoObserverContext::Event,
-                     BlockSize,
-                     EventBlock> {
-   public:
-    template <class... Args>
-    std::pair<KinetoObserverContext::Event*, uint64_t> emplace_back(
-        Args&&... args);
-    static uint64_t correlationID(const OpList::Iterator& e);
-  };
+    template <typename T, size_t ChunkSize>
+    class EventBlock : public std::array<T, ChunkSize> {
+     public:
+      EventBlock();
+      uint64_t correlation_id(const T* ptr) const;
 
-  OpList op_events_;
+     private:
+      uint64_t id_start_;
+    };
 
-  // report_input_shapes
-  InputOutputEncoder inputs_outputs_;
+    using event_t = KinetoObserverContext::Event;
+    class OpList : public AppendOnlyList<event_t, BlockSize, EventBlock> {
+     public:
+      template <class... Args>
+      std::pair<event_t*, uint64_t> emplace_back(Args&&... args);
+      static uint64_t correlationID(const OpList::Iterator& e);
+    } op_events_;
 
-  // with_stack
-  AppendOnlyList<jit_stack_t, BlockSize> jit_stack_;
-  AppendOnlyList<std::pair<python_tracer::TraceKey, approx_time_t>, BlockSize>
-      py_calls_;
+    // report_input_shapes
+    InputOutputEncoder inputs_outputs_;
 
-  // with_modules
-  AppendOnlyList<jit_modules_t, BlockSize> jit_modules_;
+    // with_stack (JIT)
+    AppendOnlyList<jit_stack_t, BlockSize> jit_stack_;
 
-  // with_flops
-  AppendOnlyList<extra_args_t, BlockSize> extra_args_;
+    // with_modules
+    AppendOnlyList<jit_modules_t, BlockSize> jit_modules_;
 
-  // ProfilerState::KINETO_GPU_FALLBACK
-  AppendOnlyList<FallbackPair, BlockSize> gpu_fallback_;
+    // with_flops
+    AppendOnlyList<extra_args_t, BlockSize> extra_args_;
+
+    // ProfilerState::KINETO_GPU_FALLBACK
+    AppendOnlyList<FallbackPair, BlockSize> gpu_fallback_;
+  } torch_ops_;
 
   // reportBackendEventToActiveKinetoProfiler
   AppendOnlyList<ExtraFields<EventType::Backend>, BlockSize> backend_events_;
 
   // reportMemoryUsage
-  AppendOnlyList<ExtraFields<EventType::Allocation>, BlockSize> allocations_;
+  AppendOnlyList<RawAllocation, BlockSize> allocations_;
 
   // reportOOMs
   AppendOnlyList<ExtraFields<EventType::OutOfMemory>, BlockSize> ooms_;
+
+  // with_stack (Python)
+  AppendOnlyList<std::pair<python_tracer::TraceKey, approx_time_t>, BlockSize>
+      py_calls_;
 };
 
 class TORCH_API RecordQueue {
@@ -516,6 +580,7 @@ class TORCH_API RecordQueue {
   ska::flat_hash_map<uint64_t, std::unique_ptr<ThreadLocalSubqueue>>
       sub_queues_;
   std::mutex sub_queue_mutex_;
+  std::unique_ptr<python_tracer::PythonTracerBase> python_tracer_;
 };
 
 } // namespace impl
diff --git a/torch/csrc/profiler/containers.h b/torch/csrc/profiler/containers.h
index 78fab227f6be..445ff107a457 100644
--- a/torch/csrc/profiler/containers.h
+++ b/torch/csrc/profiler/containers.h
@@ -5,10 +5,12 @@
 #include <cstddef>
 #include <cstdint>
 #include <forward_list>
+#include <new>
 #include <utility>
 #include <vector>
 
 #include <c10/macros/Macros.h>
+#include <c10/util/ArrayRef.h>
 #include <c10/util/Exception.h>
 
 namespace torch {
@@ -62,10 +64,28 @@ class AppendOnlyList {
   template <class... Args>
   T* emplace_back(Args&&... args) {
     maybe_grow();
-    *next_ = {std::forward<Args>(args)...};
+    ::new ((void*)next_) T{std::forward<Args>(args)...};
     return next_++;
   }
 
+  template <typename T0>
+  typename std::enable_if<
+      std::is_same<T0, T>::value && std::is_trivially_copyable<T>::value>::type
+  copy(c10::ArrayRef<T0> src) {
+    size_t n = src.size();
+    if (C10_LIKELY(next_ && (next_ + n <= end_))) {
+      std::memcpy((void*)next_, (void*)src.begin(), n * sizeof(T0));
+      next_ += n;
+    } else {
+      // We could chunk this into several `memcpy`s, but because we expect this
+      // fallback to be infrequent (n << ChunkSize) the performance impact is
+      // negligible.
+      for (auto i : src) {
+        emplace_back(i);
+      }
+    }
+  }
+
   void clear() {
     buffer_.clear();
     buffer_last_ = buffer_.before_begin();
diff --git a/torch/csrc/profiler/data_flow.cpp b/torch/csrc/profiler/data_flow.cpp
new file mode 100644
index 000000000000..dcb3eaffd439
--- /dev/null
+++ b/torch/csrc/profiler/data_flow.cpp
@@ -0,0 +1,197 @@
+#include <torch/csrc/profiler/data_flow.h>
+
+#include <c10/util/overloaded.h>
+#include <c10/util/variant.h>
+#include <torch/csrc/profiler/collection.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+namespace {
+static constexpr TensorImplAddress NoTensorImpl{nullptr};
+
+struct RawTensorInfo {
+  TensorImplAddress impl_;
+  StorageImplData storage_;
+  c10::Device device_;
+  bool is_free_;
+
+  // Used to assign back to the original structs.
+  std::reference_wrapper<c10::optional<AllocationID>> allocation_id_ref_;
+  std::reference_wrapper<c10::optional<TensorID>> id_ref_;
+};
+
+struct RawTensors {
+  std::vector<RawTensorInfo>& get() {
+    return tensors_;
+  }
+
+  void operator()(TensorMetadata& t) {
+    tensors_.emplace_back(RawTensorInfo{
+        t.impl(), t.data_, t.device_, false, t.allocation_id_, t.id_});
+  }
+
+  void operator()(c10::optional<TensorMetadata>& t) {
+    if (t.has_value()) {
+      (*this)(*t);
+    }
+  }
+
+  void operator()(ExtraFields<EventType::Allocation>& a) {
+    const StorageImplData ptr{a.ptr_};
+    const auto is_free = a.alloc_size_ < 0;
+    tensors_.emplace_back(RawTensorInfo{
+        NoTensorImpl, ptr, a.device(), is_free, a.allocation_id_, a.id_});
+  }
+
+  void operator()(std::vector<TensorMetadata>& t) {
+    for (auto& ti : t) {
+      (*this)(ti);
+    }
+  }
+
+  template <typename T>
+  void operator()(T&) {}
+
+  std::vector<RawTensorInfo> tensors_;
+};
+} // namespace
+
+void calculateUniqueTensorIDs(
+    std::vector<std::shared_ptr<Result>>& sorted_results) {
+  // This task is equivilent to https://leetcode.com/problems/number-of-islands/
+  // We first cluster events with a greedy index assignment, and then merge
+  // groups that overlap.
+  std::vector<RawTensorInfo> tensors;
+
+  // Flatten results to a uniform representation.
+  // --------------------------------------------------------------------------
+  {
+    RawTensors raw_tensors;
+
+    // The python tracer caches values, so it's only safe to use the first case.
+    ska::flat_hash_set<PyModuleSelf> seen_modules;
+    ska::flat_hash_set<PyOptimizerSelf> seen_optimizers;
+    for (auto& result : sorted_results) {
+      result->visit(c10::overloaded(
+          [&](ExtraFields<EventType::TorchOp>& torch_op) {
+            for (auto& i : torch_op.inputs_) {
+              c10::visit(raw_tensors, i);
+            }
+          },
+          [&](ExtraFields<EventType::PyCall>& py_call) {
+            // torch.nn.Module
+            if (py_call.module_.has_value() &&
+                seen_modules.insert(py_call.module_->self_).second) {
+              for (auto& p : py_call.module_->parameters_) {
+                raw_tensors(p.metadata_);
+                raw_tensors(p.grad_metadata_);
+              }
+            }
+
+            // torch.optim.Optimizer
+            if (py_call.optimizer_.has_value() &&
+                seen_optimizers.insert(py_call.optimizer_->self_).second) {
+              for (auto& p : py_call.optimizer_->parameters_) {
+                raw_tensors(p.metadata_);
+                raw_tensors(p.grad_metadata_);
+                for (auto& state_i : p.state_) {
+                  raw_tensors(state_i.second);
+                }
+              }
+            }
+          },
+          [&](auto& i) { raw_tensors(i); }));
+    }
+    tensors = std::move(raw_tensors.tensors_);
+  }
+
+  // Assign IDs to solve ABA for Storage.
+  // --------------------------------------------------------------------------
+  {
+    size_t counter{1};
+    using key_t = std::pair<StorageImplData, c10::Device>;
+    ska::flat_hash_map<key_t, size_t, HashCombine> versions;
+    for (auto& t : tensors) {
+      auto inserted = versions.insert({{t.storage_, t.device_}, counter});
+      counter += inserted.second;
+      t.allocation_id_ref_.get().emplace(AllocationID(inserted.first->second));
+      if (t.is_free_) {
+        versions.erase(inserted.first);
+      }
+    }
+  }
+
+  // Handle any allocation events which we cannot prove are for Tensor storage.
+  // --------------------------------------------------------------------------
+  {
+    ska::flat_hash_set<AllocationID> tensor_set;
+    for (const auto& t : tensors) {
+      if (t.impl_ != NoTensorImpl) {
+        tensor_set.insert(*t.allocation_id_ref_.get());
+      }
+    }
+    tensors.erase(
+        std::remove_if(
+            tensors.begin(),
+            tensors.end(),
+            [&tensor_set](const auto& i) {
+              auto it = tensor_set.find(*i.allocation_id_ref_.get());
+              return it == tensor_set.end();
+            }),
+        tensors.end());
+  }
+
+  // Handle the case that the storage of a TensorImpl changed.
+  // --------------------------------------------------------------------------
+  using storage_id_pair_t = std::pair<AllocationID, AllocationID>;
+  ska::flat_hash_set<storage_id_pair_t, HashCombine> same_group_set;
+  {
+    ska::flat_hash_map<TensorImplAddress, AllocationID> impl_map;
+    for (const auto& t : tensors) {
+      // Storage allocations / frees don't have an associated TensorImpl, so
+      // we don't want all storages to merge through nullptr.
+      if (!t.impl_) {
+        continue;
+      }
+
+      const auto allocation_id = *t.allocation_id_ref_.get();
+      const auto it = impl_map.insert({t.impl_, allocation_id}).first;
+
+      // The pair needs to be sorted for the coalesce step to work properly.
+      it->second < allocation_id
+          ? same_group_set.insert({it->second, allocation_id})
+          : same_group_set.insert({allocation_id, it->second});
+    }
+  }
+
+  // Coalesce groups and assign final IDs.
+  // --------------------------------------------------------------------------
+  ska::flat_hash_map<AllocationID, size_t> id_map;
+  {
+    std::vector<storage_id_pair_t> unique_pairs;
+    for (const auto& i : same_group_set) {
+      unique_pairs.push_back(i);
+    }
+    std::sort(unique_pairs.begin(), unique_pairs.end());
+
+    size_t current_id{0};
+    for (const auto& i : unique_pairs) {
+      auto inserted = id_map.insert({i.first, current_id});
+      current_id += inserted.second;
+      id_map.insert({i.second, inserted.first->second});
+    }
+  }
+
+  // Write back to Tensor IDs.
+  // --------------------------------------------------------------------------
+  for (const auto& t : tensors) {
+    const auto id = id_map.at(*t.allocation_id_ref_.get());
+    t.id_ref_.get().emplace(TensorID(id));
+  }
+}
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/data_flow.h b/torch/csrc/profiler/data_flow.h
new file mode 100644
index 000000000000..cb72756a1a5b
--- /dev/null
+++ b/torch/csrc/profiler/data_flow.h
@@ -0,0 +1,95 @@
+#pragma once
+
+#include <memory>
+
+#include <ATen/core/TensorBody.h>
+#include <c10/core/TensorImpl.h>
+#include <c10/macros/Macros.h>
+#include <c10/util/strong_type.h>
+#include <c10/util/variant.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+// Identity is a complex concept in PyTorch. A Tensor might not have a
+// an associated storage, multiple Tensors might share the same underlying
+// storage, the storage of a Tensor might change over time, etc.
+//
+// For the purpose of profiling we're mostly interested in data flow
+// analysis. As a result, we can take an expansive view of identity:
+// Tensors share an ID if they share a TensorImpl or storage data.
+//
+// This identity equality is transitive; If Tensors T0 and T1 share a storage
+// S0 and T1 later points to a different storage S1 then all Tensors which
+// point to either S0 or S1 are considered to have the same identity. (Since
+// profiler cannot reason beyond that.)
+//
+// The profiler will handle lifetime analysis to ensure that identities do
+// not run afoul of the ABA problem. This does, however, mean that identities
+// can only be assigned when memory profiling is enabled.
+using TensorID = strong::type<size_t, struct TensorID_, strong::regular>;
+
+// Uniquely identifies an allocation. (Generally a StorageImpl's data ptr.)
+using AllocationID = strong::type<
+    size_t,
+    struct StorageID_,
+    strong::ordered,
+    strong::regular,
+    strong::hashable>;
+
+// We use a Tensor's TensorImpl adress and StorageImpl data start to build the
+// data flow graph. We do not hold an owning reference so we wrap them in strong
+// types to prevent direct access.
+using TensorImplAddress = strong::type<
+    const c10::TensorImpl*,
+    struct TensorImplAddress_,
+    strong::regular,
+    strong::hashable,
+    strong::boolean>;
+
+using StorageImplData = strong::type<
+    void*,
+    struct StorageImplData_,
+    strong::regular,
+    strong::hashable,
+    strong::boolean>;
+
+// ============================================================================
+// == weak_intrusive_ptr and the ABA problem for TensorImpl* ==================
+// ============================================================================
+// Tracking `TensorImpl`s is an important part of identity tracking, because
+// a Tensor might change storage; however when it does we want to retain the
+// fact that the old and new storage belong to the same logical Tensor. We
+// cannot take an owning reference to the Tensor because that would change
+// program semantics by extending the lifetime of the Tensor. However if we
+// store a raw TensorImpl* pointer the TensorImpl might be deleted and a new
+// TensorImpl might be created that reuses the address. (ABA problem)
+//
+// Fortunately, there is a feature of `c10::intrusive_ptr` that we can use to
+// prevent address reuse for the duration of profiling: the weak intrusive ptr.
+// When a Tensor's refcount reaches zero but there are outstanding weak
+// references (`weakcount_ > 0`) it will free the underlying managed resources
+// by calling `target_->release_resources()`, but it will not call `delete`.
+// (Instead, `delete` is called when the last weak reference is destroyed.)
+// This means that we can safely use address identity to track `TensorImpls`.
+class WeakTensor {
+ public:
+  explicit WeakTensor(const at::Tensor& t) : weak_self_(t.getIntrusivePtr()) {}
+
+  auto get() const {
+    return TensorImplAddress{weak_self_._unsafe_get_target()};
+  }
+
+ private:
+  c10::weak_intrusive_ptr<c10::TensorImpl> weak_self_;
+};
+
+struct Result;
+
+void calculateUniqueTensorIDs(
+    std::vector<std::shared_ptr<Result>>& sorted_results);
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/events.h b/torch/csrc/profiler/events.h
new file mode 100644
index 000000000000..a1a956f13279
--- /dev/null
+++ b/torch/csrc/profiler/events.h
@@ -0,0 +1,30 @@
+#pragma once
+
+#include <array>
+#include <cstring>
+#include <vector>
+
+namespace torch {
+namespace profiler {
+
+/* A vector type to hold a list of performance counters */
+using perf_counters_t = std::vector<uint64_t>;
+
+/* Standard list of performance events independent of hardware or backend */
+constexpr std::array<const char*, 2> ProfilerPerfEvents = {
+    /*
+     * Number of Processing Elelement (PE) cycles between two points of interest
+     * in time. This should correlate positively with wall-time. Measured in
+     * uint64_t. PE can be non cpu. TBD reporting behavior for multiple PEs
+     * participating (i.e. threadpool).
+     */
+    "cycles",
+
+    /* Number of PE instructions between two points of interest in time. This
+     * should correlate positively with wall time and the amount of computation
+     * (i.e. work). Across repeat executions, the number of instructions should
+     * be more or less invariant. Measured in uint64_t. PE can be non cpu.
+     */
+    "instructions"};
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/kineto_client_interface.cpp b/torch/csrc/profiler/kineto_client_interface.cpp
index e6842718e047..c9e07ca367c5 100644
--- a/torch/csrc/profiler/kineto_client_interface.cpp
+++ b/torch/csrc/profiler/kineto_client_interface.cpp
@@ -1,6 +1,14 @@
 #ifdef USE_KINETO
 #include <libkineto.h>
 #include <torch/csrc/autograd/profiler_kineto.h>
+#include <cstdlib>
+
+// Ondemand tracing is not supported on Apple or edge platform
+#if defined(__APPLE__) || defined(EDGE_PROFILER_USE_KINETO)
+#define ENABLE_GLOBAL_OBSERVER (0)
+#else
+#define ENABLE_GLOBAL_OBSERVER (1)
+#endif
 
 namespace torch {
 namespace profiler {
@@ -21,13 +29,16 @@ class LibKinetoClient : public libkineto::ClientInterface {
   void start() override {
     ProfilerConfig cfg{
         ProfilerState::KINETO_ONDEMAND,
-        reportInputShapes_,
-        false,
-        false,
-        false,
-        false};
+        /*report_input_shapes=*/reportInputShapes_,
+        /*profile_memory=*/false,
+        /*with_stack=*/withStack_,
+        /*with_flops=*/false,
+        /*with_modules=*/false};
     std::set<ActivityType> activities{ActivityType::CPU};
-    auto scopes = {at::RecordScope::FUNCTION, at::RecordScope::USER_SCOPE};
+    std::unordered_set<at::RecordScope> scopes;
+    scopes.insert(at::RecordScope::FUNCTION);
+    scopes.insert(at::RecordScope::USER_SCOPE);
+    scopes.insert(at::RecordScope::BACKWARD_FUNCTION);
     enableProfiler(cfg, activities, scopes);
   }
 
@@ -35,8 +46,14 @@ class LibKinetoClient : public libkineto::ClientInterface {
     (void)disableProfiler();
   }
 
+  // @lint-ignore CLANGTIDY cppcoreguidelines-explicit-virtual-functions
+  void set_withstack(bool withStack) {
+    withStack_ = withStack;
+  }
+
  private:
   bool reportInputShapes_{true};
+  bool withStack_{false};
 };
 
 } // namespace
@@ -44,14 +61,24 @@ class LibKinetoClient : public libkineto::ClientInterface {
 } // namespace impl
 } // namespace profiler
 
-#ifdef ENABLE_LIBKINETO_CLIENT
+#if ENABLE_GLOBAL_OBSERVER
+namespace {
+
 struct RegisterLibKinetoClient {
   RegisterLibKinetoClient() {
     static profiler::impl::LibKinetoClient client;
+
+    if (std::getenv("KINETO_USE_DAEMON") != nullptr) {
+      libkineto_init(/*cpuOnly=*/false, /*logOnError=*/true);
+      libkineto::api().suppressLogMessages();
+    }
+
     libkineto::api().registerClient(&client);
   }
 } register_libkineto_client;
-#endif // ENABLE_LIBKINETO_CLIENT
+
+} // namespace
+#endif
 
 } // namespace torch
 #endif // USE_KINETO
diff --git a/torch/csrc/profiler/kineto_shim.cpp b/torch/csrc/profiler/kineto_shim.cpp
index 5fd02d2e9374..bf84d327924b 100644
--- a/torch/csrc/profiler/kineto_shim.cpp
+++ b/torch/csrc/profiler/kineto_shim.cpp
@@ -116,7 +116,7 @@ TraceWrapper::operator bool() const {
 
 ActivityTraceWrapper::ActivityTraceWrapper(
     std::unique_ptr<interface_trace_t>&& trace)
-    : trace_(std::move(trace)), saved_{false} {}
+    : trace_(std::move(trace)) {}
 
 ActivityTraceWrapper::operator bool() const {
 #ifdef USE_KINETO
@@ -221,7 +221,7 @@ void prepareTrace(
   ExperimentalConfigWrapper configWrap(config);
 
   // Experimental Configuration options are present
-  if (config.hasOptions() && configWrap.assertValid(activities)) {
+  if (config && configWrap.assertValid(activities)) {
     configWrap.prepareTraceWithExperimentalOptions();
     return;
   }
diff --git a/torch/csrc/profiler/kineto_shim.h b/torch/csrc/profiler/kineto_shim.h
index 631569b02f6b..fa02e979275b 100644
--- a/torch/csrc/profiler/kineto_shim.h
+++ b/torch/csrc/profiler/kineto_shim.h
@@ -108,7 +108,9 @@ struct ActivityTraceWrapper {
 
  private:
   std::unique_ptr<interface_trace_t> trace_;
+#ifdef USE_KINETO
   bool saved_ = false; // Kineto's save is destructive
+#endif
 };
 
 using ActivitySet = std::set<torch::autograd::profiler::ActivityType>;
diff --git a/torch/csrc/profiler/orchestration/observer.cpp b/torch/csrc/profiler/orchestration/observer.cpp
new file mode 100644
index 000000000000..d4920c170920
--- /dev/null
+++ b/torch/csrc/profiler/orchestration/observer.cpp
@@ -0,0 +1,181 @@
+#include <torch/csrc/profiler/orchestration/observer.h>
+
+#include <torch/csrc/profiler/util.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+using GlobalManager = GlobalStateManager<ProfilerStateBase>;
+
+// ----------------------------------------------------------------------------
+// -- Profiler Config ---------------------------------------------------------
+// ----------------------------------------------------------------------------
+ExperimentalConfig::ExperimentalConfig(
+    std::vector<std::string> profiler_metrics,
+    bool profiler_measure_per_kernel,
+    bool verbose,
+    std::vector<std::string> performance_events)
+    : profiler_metrics{profiler_metrics},
+      profiler_measure_per_kernel{profiler_measure_per_kernel},
+      verbose{verbose},
+      performance_events(std::move(performance_events)) {}
+
+/*explicit*/ ExperimentalConfig::operator bool() const {
+  return !profiler_metrics.empty();
+}
+
+ProfilerConfig::ProfilerConfig(
+    ProfilerState state,
+    bool report_input_shapes,
+    bool profile_memory,
+    bool with_stack,
+    bool with_flops,
+    bool with_modules,
+    ExperimentalConfig experimental_config)
+    : state{state},
+      experimental_config{experimental_config},
+      report_input_shapes{report_input_shapes},
+      profile_memory{profile_memory},
+      with_stack{with_stack},
+      with_flops{with_flops},
+      with_modules{with_modules} {}
+
+bool ProfilerConfig::disabled() const {
+  return state == torch::profiler::impl::ProfilerState::Disabled;
+}
+
+bool ProfilerConfig::global() const {
+  return state == torch::profiler::impl::ProfilerState::KINETO_ONDEMAND;
+}
+
+namespace {
+enum ProfilerIValueIdx {
+  STATE = 0,
+  REPORT_INPUT_SHAPES,
+  PROFILE_MEMORY,
+  NUM_PROFILER_CFG_IVALUE_IDX // must be last in list
+};
+} // namespace
+
+at::IValue ProfilerConfig::toIValue() const {
+  c10::impl::GenericList eventIValueList(at::AnyType::get());
+  eventIValueList.reserve(NUM_PROFILER_CFG_IVALUE_IDX);
+  eventIValueList.emplace_back(static_cast<int64_t>(state));
+  eventIValueList.emplace_back(report_input_shapes);
+  eventIValueList.emplace_back(profile_memory);
+  return eventIValueList;
+}
+
+ProfilerConfig ProfilerConfig::fromIValue(
+    const at::IValue& profilerConfigIValue) {
+  TORCH_INTERNAL_ASSERT(
+      profilerConfigIValue.isList(),
+      "Expected IValue to contain type c10::impl::GenericList");
+  auto ivalues = profilerConfigIValue.toList();
+  TORCH_INTERNAL_ASSERT(
+      ivalues.size() == NUM_PROFILER_CFG_IVALUE_IDX,
+      c10::str(
+          "Expected exactly ",
+          NUM_PROFILER_CFG_IVALUE_IDX,
+          " ivalues to resconstruct ProfilerConfig."));
+  return ProfilerConfig(
+      static_cast<ProfilerState>(ivalues.get(ProfilerIValueIdx::STATE).toInt()),
+      ivalues.get(ProfilerIValueIdx::REPORT_INPUT_SHAPES).toBool(),
+      ivalues.get(ProfilerIValueIdx::PROFILE_MEMORY).toBool());
+}
+
+// ----------------------------------------------------------------------------
+// -- Profiler base class -----------------------------------------------------
+// ----------------------------------------------------------------------------
+/*explicit*/ ProfilerStateBase::ProfilerStateBase(const ProfilerConfig& config)
+    : c10::MemoryReportingInfoBase(), config_(config) {}
+
+ProfilerStateBase::~ProfilerStateBase() {
+  if (handle_) {
+    auto handle = handle_;
+    removeCallback();
+    SOFT_ASSERT(false, "Leaked callback handle: ", handle);
+  }
+}
+
+/*static*/ ProfilerStateBase* ProfilerStateBase::get(bool global) {
+  auto* out = global
+      ? GlobalManager::get()
+      : static_cast<ProfilerStateBase*>(
+            c10::ThreadLocalDebugInfo::get(c10::DebugInfoKind::PROFILER_STATE));
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!out || out->config().global() == global);
+  return out;
+}
+
+/*static*/ void ProfilerStateBase::push(
+    std::shared_ptr<ProfilerStateBase>&& state) {
+  TORCH_INTERNAL_ASSERT(state != nullptr);
+  if (state->config().global()) {
+    GlobalManager::push(std::move(state));
+  } else {
+    c10::ThreadLocalDebugInfo::_push(c10::DebugInfoKind::PROFILER_STATE, state);
+  }
+}
+
+namespace {
+std::shared_ptr<ProfilerStateBase> popTLS() {
+  // If there is no active thread local profiler then we simply return null.
+  // However if there is an active profiler but it is not the top
+  // `DebugInfoBase`then `c10::ThreadLocalDebugInfo::_pop` will throw.
+  // TODO(robieta): make `noexcept` version.
+  return c10::ThreadLocalDebugInfo::get(c10::DebugInfoKind::PROFILER_STATE)
+      ? std::static_pointer_cast<ProfilerStateBase>(
+            c10::ThreadLocalDebugInfo::_pop(c10::DebugInfoKind::PROFILER_STATE))
+      : nullptr;
+}
+} // namespace
+
+/*static*/ std::shared_ptr<ProfilerStateBase> ProfilerStateBase::pop(
+    bool global) {
+  auto out = global ? GlobalManager::pop() : popTLS();
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!out || out->config().global() == global);
+  return out;
+}
+
+void ProfilerStateBase::setCallbackHandle(at::CallbackHandle handle) {
+  if (handle_) {
+    at::removeCallback(handle_);
+    SOFT_ASSERT(
+        false,
+        "ProfilerStateBase already has a registered callback. "
+        "Removing to avoid leaked callback.");
+  }
+
+  handle_ = handle;
+}
+
+void ProfilerStateBase::removeCallback() {
+  if (handle_) {
+    at::removeCallback(handle_);
+    handle_ = 0;
+  }
+}
+
+bool profilerEnabled() {
+  auto* state_ptr = ProfilerStateBase::get(/*global=*/false);
+  return state_ptr && !state_ptr->config().disabled();
+}
+
+TORCH_API ActiveProfilerType profilerType() {
+  auto* state_ptr = ProfilerStateBase::get(/*global=*/false);
+  return state_ptr == nullptr ? ActiveProfilerType::NONE
+                              : state_ptr->profilerType();
+}
+
+torch::profiler::impl::ProfilerConfig getProfilerConfig() {
+  auto* state_ptr = ProfilerStateBase::get(/*global=*/false);
+  TORCH_CHECK(
+      state_ptr,
+      "Tried to access profiler config, but profiler is not enabled!");
+  return state_ptr->config();
+}
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/orchestration/observer.h b/torch/csrc/profiler/orchestration/observer.h
new file mode 100644
index 000000000000..d9d89aa3a41c
--- /dev/null
+++ b/torch/csrc/profiler/orchestration/observer.h
@@ -0,0 +1,135 @@
+#pragma once
+
+#include <ATen/record_function.h>
+#include <torch/csrc/Export.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+// ----------------------------------------------------------------------------
+// -- Profiler Config ---------------------------------------------------------
+// ----------------------------------------------------------------------------
+enum class C10_API_ENUM ActivityType {
+  CPU = 0,
+  CUDA, // CUDA kernels, runtime
+  NUM_KINETO_ACTIVITIES, // must be the last one
+};
+
+enum class C10_API_ENUM ProfilerState {
+  Disabled = 0,
+  CPU, // CPU-only profiling
+  CUDA, // CPU + CUDA events
+  NVTX, // only emit NVTX markers
+  ITT, // only emit ITT markers
+  KINETO, // use libkineto
+  KINETO_GPU_FALLBACK, // use CUDA events when CUPTI is not available
+  KINETO_ONDEMAND, // run the profiler in on-demand mode
+  NUM_PROFILER_STATES, // must be the last one
+};
+
+enum class C10_API_ENUM ActiveProfilerType {
+  NONE = 0,
+  LEGACY,
+  KINETO,
+  NVTX,
+  ITT
+};
+
+struct TORCH_API ExperimentalConfig {
+  ExperimentalConfig(
+      std::vector<std::string> profiler_metrics = {},
+      bool profiler_measure_per_kernel = false,
+      bool verbose = false,
+      std::vector<std::string> performance_events = {});
+  ~ExperimentalConfig() = default;
+  explicit operator bool() const;
+
+  std::vector<std::string> profiler_metrics;
+  bool profiler_measure_per_kernel;
+  bool verbose;
+  /*
+   * List of performance events to be profiled.
+   * An empty list will disable performance event based profiling altogether.
+   */
+  std::vector<std::string> performance_events;
+};
+
+struct TORCH_API ProfilerConfig {
+  ProfilerConfig(
+      ProfilerState state,
+      bool report_input_shapes = false,
+      bool profile_memory = false,
+      bool with_stack = false,
+      bool with_flops = false,
+      bool with_modules = false,
+      ExperimentalConfig experimental_config = ExperimentalConfig());
+  ~ProfilerConfig() = default;
+
+  bool disabled() const;
+  bool global() const;
+
+  ProfilerState state;
+  ExperimentalConfig experimental_config;
+  bool report_input_shapes;
+  bool profile_memory;
+  bool with_stack;
+  bool with_flops;
+  bool with_modules;
+
+  // For serialization
+  at::IValue toIValue() const;
+  static ProfilerConfig fromIValue(const at::IValue& profilerConfigIValue);
+};
+
+// ----------------------------------------------------------------------------
+// -- Profiler base class -----------------------------------------------------
+// ----------------------------------------------------------------------------
+struct TORCH_API ProfilerStateBase : public c10::MemoryReportingInfoBase {
+  explicit ProfilerStateBase(const ProfilerConfig& config);
+  ~ProfilerStateBase() override;
+
+  static ProfilerStateBase* get(bool global);
+  static ProfilerStateBase* get() {
+    auto* out = get(/*global=*/true);
+    return out ? out : get(/*global=*/false);
+  }
+
+  static void push(std::shared_ptr<ProfilerStateBase>&& state);
+
+  static std::shared_ptr<ProfilerStateBase> pop(bool global);
+  static std::shared_ptr<ProfilerStateBase> pop() {
+    auto out = pop(/*global=*/true);
+    return out ? out : pop(/*global=*/false);
+  }
+
+  const ProfilerConfig& config() const {
+    return config_;
+  }
+
+  void setCallbackHandle(at::CallbackHandle handle);
+  void removeCallback();
+
+  bool memoryProfilingEnabled() const override {
+    return config_.profile_memory;
+  }
+
+  virtual ActiveProfilerType profilerType() = 0;
+
+ protected:
+  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
+  std::mutex state_mutex_;
+  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
+  ProfilerConfig config_ = ProfilerConfig(ProfilerState::Disabled);
+  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
+  at::CallbackHandle handle_ = 0;
+};
+
+// Note: The following are only for the active *thread local* profiler.
+TORCH_API bool profilerEnabled();
+TORCH_API ActiveProfilerType profilerType();
+TORCH_API ProfilerConfig getProfilerConfig();
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/orchestration/python_tracer.cpp b/torch/csrc/profiler/orchestration/python_tracer.cpp
new file mode 100644
index 000000000000..8f63163089b3
--- /dev/null
+++ b/torch/csrc/profiler/orchestration/python_tracer.cpp
@@ -0,0 +1,37 @@
+#include <torch/csrc/profiler/orchestration/python_tracer.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+namespace python_tracer {
+namespace {
+MakeFn make_fn;
+
+struct NoOpPythonTracer : public PythonTracerBase {
+  NoOpPythonTracer() = default;
+  ~NoOpPythonTracer() = default;
+
+  void stop() override {}
+  std::vector<std::shared_ptr<Result>> getEvents(
+      std::function<time_t(approx_time_t)>,
+      std::vector<CompressedEvent>&,
+      time_t) override {
+    return {};
+  }
+};
+} // namespace
+
+void registerTracer(MakeFn make_tracer) {
+  make_fn = make_tracer;
+}
+
+std::unique_ptr<PythonTracerBase> PythonTracerBase::make(RecordQueue* queue) {
+  if (make_fn == nullptr) {
+    return std::make_unique<NoOpPythonTracer>();
+  }
+  return make_fn(queue);
+}
+} // namespace python_tracer
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/orchestration/python_tracer.h b/torch/csrc/profiler/orchestration/python_tracer.h
new file mode 100644
index 000000000000..93becfee3cf3
--- /dev/null
+++ b/torch/csrc/profiler/orchestration/python_tracer.h
@@ -0,0 +1,62 @@
+#pragma once
+
+#include <cstdint>
+#include <memory>
+#include <utility>
+#include <vector>
+
+#include <c10/util/strong_type.h>
+
+#include <torch/csrc/profiler/kineto_shim.h>
+#include <torch/csrc/profiler/util.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+class RecordQueue;
+struct Result;
+namespace python_tracer {
+
+using TraceKey = strong::type<
+    uint64_t,
+    struct TraceKey_,
+    strong::regular,
+    strong::hashable,
+    strong::ostreamable>;
+
+struct CompressedEvent {
+  TraceKey key_;
+  uint64_t system_tid_;
+  kineto::DeviceAndResource kineto_info_;
+  time_t enter_t_;
+};
+
+/*
+Libtorch does not depend on Python (e.g. cannot #include <Python.h>); however
+when we call the profiler from libtorch_python we need the profiler to be able
+to ingest the data that we collect from the Python tracer. (`PyEval_SetProfile`)
+
+In order to solve this dependency issue we define a virtual base and a function
+to register a getter. The python tracer then implements these functions and
+exposes itself by calling `registerTracer` from `torch/csrc/autograd/init.cpp`.
+This pattern of registration for faux python dependencies in libtorch is common
+in the PyTorch codebase.
+*/
+struct TORCH_API PythonTracerBase {
+  static std::unique_ptr<PythonTracerBase> make(RecordQueue* queue);
+  virtual ~PythonTracerBase() = default;
+
+  virtual void stop() = 0;
+  virtual std::vector<std::shared_ptr<Result>> getEvents(
+      std::function<time_t(approx_time_t)> time_converter,
+      std::vector<CompressedEvent>& enters,
+      time_t end_time_ns) = 0;
+};
+
+using MakeFn = std::unique_ptr<PythonTracerBase> (*)(RecordQueue*);
+TORCH_API void registerTracer(MakeFn make_tracer);
+} // namespace python_tracer
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/perf-inl.h b/torch/csrc/profiler/perf-inl.h
new file mode 100644
index 000000000000..0dfa45ac6f2b
--- /dev/null
+++ b/torch/csrc/profiler/perf-inl.h
@@ -0,0 +1,72 @@
+#pragma once
+
+#if defined(__ANDROID__) || defined(__linux__)
+
+#include <unistd.h>
+
+#include <sys/ioctl.h>
+#include <sys/syscall.h>
+
+#include <linux/perf_event.h>
+
+#endif /* __ANDROID__ || __linux__ */
+
+#include <torch/csrc/profiler/perf.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+namespace linux_perf {
+
+/*
+ * PerfEvent
+ * ---------
+ */
+
+inline void PerfEvent::Disable() const {
+#if defined(__ANDROID__) || defined(__linux__)
+  ioctl(fd_, PERF_EVENT_IOC_DISABLE, 0);
+#endif /* __ANDROID__ || __linux__ */
+}
+
+inline void PerfEvent::Enable() const {
+#if defined(__ANDROID__) || defined(__linux__)
+  ioctl(fd_, PERF_EVENT_IOC_ENABLE, 0);
+#endif /* __ANDROID__ || __linux__ */
+}
+
+inline void PerfEvent::Reset() const {
+#if defined(__ANDROID__) || defined(__linux__)
+  ioctl(fd_, PERF_EVENT_IOC_RESET, 0);
+#endif /* __ANDROID__ || __linux__ */
+}
+
+/*
+ * PerfProfiler
+ * ------------
+ */
+
+inline uint64_t PerfProfiler::CalcDelta(uint64_t start, uint64_t end) const {
+  if (end < start) { // overflow
+    return end + (std::numeric_limits<uint64_t>::max() - start);
+  }
+  // not possible to wrap around start for a 64b cycle counter
+  return end - start;
+}
+
+inline void PerfProfiler::StartCounting() const {
+  for (auto& e : events_) {
+    e.Enable();
+  }
+}
+
+inline void PerfProfiler::StopCounting() const {
+  for (auto& e : events_) {
+    e.Disable();
+  }
+}
+
+} // namespace linux_perf
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/perf.cpp b/torch/csrc/profiler/perf.cpp
new file mode 100644
index 000000000000..c5b2125fe4c9
--- /dev/null
+++ b/torch/csrc/profiler/perf.cpp
@@ -0,0 +1,199 @@
+#include <unordered_set>
+
+#include <torch/csrc/profiler/perf-inl.h>
+#include <torch/csrc/profiler/perf.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+namespace linux_perf {
+
+#if defined(__ANDROID__) || defined(__linux__)
+
+/*
+ * PerfEvent
+ * ---------
+ */
+
+/*
+ * Syscall wrapper for perf_event_open(2)
+ */
+inline long perf_event_open(
+    struct perf_event_attr* hw_event,
+    pid_t pid,
+    int cpu,
+    int group_fd,
+    unsigned long flags) {
+  return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
+}
+
+// TODO sync with Kineto level abstract events in profiler/events.h
+static const std::unordered_map<
+    std::string,
+    std::pair<perf_type_id, /* perf event type */ uint32_t>>
+    EventTable{
+        {"cycles",
+         std::make_pair(PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES)},
+        {"instructions",
+         std::make_pair(PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS)},
+
+        // Non Standard events for testing
+        {"pagefaults",
+         std::make_pair(PERF_TYPE_SOFTWARE, PERF_COUNT_SW_PAGE_FAULTS)},
+        {"backend-stall-cycles",
+         std::make_pair(
+             PERF_TYPE_HARDWARE,
+             PERF_COUNT_HW_STALLED_CYCLES_BACKEND)},
+        {"frontend-stall-cycles",
+         std::make_pair(
+             PERF_TYPE_HARDWARE,
+             PERF_COUNT_HW_STALLED_CYCLES_FRONTEND)}};
+
+PerfEvent::~PerfEvent() {
+  if (fd_ > -1) {
+    close(fd_);
+  }
+  fd_ = -1; // poison
+}
+
+void PerfEvent::Init() {
+  TORCH_CHECK(!name_.empty(), "Invalid profiler event name");
+
+  auto const it = EventTable.find(name_);
+  if (it == EventTable.end()) {
+    TORCH_CHECK(false, "Unsupported profiler event name: ", name_);
+  }
+
+  struct perf_event_attr attr {};
+  memset(&attr, 0, sizeof(attr));
+
+  attr.size = sizeof(perf_event_attr);
+  attr.type = it->second.first;
+  attr.config = it->second.second;
+  attr.disabled = 1;
+  attr.inherit = 1;
+  attr.exclude_kernel = 1; // TBD
+  attr.exclude_hv = 1;
+  /*
+   * These can be used to calculate estimated totals if the PMU is overcommitted
+   * and multiplexing is happening
+   */
+  attr.read_format =
+      PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING;
+
+  pid_t pid = getpid(); // this pid
+  int cpu = -1; // all cpus
+  int group_fd = -1;
+  unsigned long flags = 0;
+
+  fd_ = static_cast<int>(perf_event_open(&attr, pid, cpu, group_fd, flags));
+  if (fd_ == -1) {
+    TORCH_CHECK(
+        false, "perf_event_open() failed, error: ", std::strerror(errno));
+  }
+  Reset();
+}
+
+uint64_t PerfEvent::ReadCounter() const {
+  PerfCounter counter{};
+  long n = read(fd_, &counter, sizeof(PerfCounter));
+  TORCH_CHECK(
+      n == sizeof(counter),
+      "Read failed for Perf event fd, event : ",
+      name_,
+      ", error: ",
+      std::strerror(errno));
+  TORCH_CHECK(
+      counter.time_enabled == counter.time_running,
+      "Hardware performance counter time multiplexing is not handled yet",
+      ", name: ",
+      name_,
+      ", enabled: ",
+      counter.time_enabled,
+      ", running: ",
+      counter.time_running);
+  return counter.value;
+}
+
+#else /* __ANDROID__ || __linux__ */
+/*
+ * Shim class for unsupported platforms - this will always return 0 counter
+ * value
+ */
+
+PerfEvent::~PerfEvent(){};
+
+void PerfEvent::Init(){};
+
+uint64_t PerfEvent::ReadCounter() const {
+  return 0;
+};
+
+#endif /* __ANDROID__ || __linux__ */
+
+/*
+ * PerfProfiler
+ * ------------
+ */
+
+void PerfProfiler::Configure(std::vector<std::string>& event_names) {
+  TORCH_CHECK(
+      event_names.size() <= MAX_EVENTS,
+      "Too many events to configure, configured: ",
+      event_names.size(),
+      ", max allowed:",
+      MAX_EVENTS);
+  std::unordered_set<std::string> s(event_names.begin(), event_names.end());
+  TORCH_CHECK(
+      s.size() == event_names.size(), "Duplicate event names are not allowed!")
+  for (auto name : event_names) {
+    events_.emplace_back(name);
+    events_.back().Init();
+  }
+
+  // TODO
+  // Reset pthreadpool here to make sure we can attach to new children
+  // threads
+}
+
+void PerfProfiler::Enable() {
+  if (start_values_.size()) {
+    StopCounting();
+  }
+
+  start_values_.emplace(events_.size(), 0);
+
+  auto& sv = start_values_.top();
+  for (int i = 0; i < events_.size(); ++i) {
+    sv[i] = events_[i].ReadCounter();
+  }
+  StartCounting();
+}
+
+void PerfProfiler::Disable(perf_counters_t& vals) {
+  StopCounting();
+  TORCH_CHECK(
+      vals.size() == events_.size(),
+      "Can not fit all perf counters in the supplied container");
+  TORCH_CHECK(
+      start_values_.size() > 0,
+      "PerfProfiler must be enabled before disabling");
+
+  /* Always connecting this disable event to the last enable event i.e. using
+   * whatever is on the top of the start counter value stack. */
+  perf_counters_t& sv = start_values_.top();
+  for (int i = 0; i < events_.size(); ++i) {
+    vals[i] = CalcDelta(sv[i], events_[i].ReadCounter());
+  }
+  start_values_.pop();
+
+  // Restore it for a parent
+  if (start_values_.size()) {
+    StartCounting();
+  }
+}
+} // namespace linux_perf
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/perf.h b/torch/csrc/profiler/perf.h
new file mode 100644
index 000000000000..88432a946f77
--- /dev/null
+++ b/torch/csrc/profiler/perf.h
@@ -0,0 +1,105 @@
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <memory>
+#include <stack>
+#include <string>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+
+#include <torch/csrc/profiler/events.h>
+
+#include <c10/util/Exception.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+namespace linux_perf {
+
+/*
+ * Maximum number of events supported
+ * This stems from the hardware limitation on CPU performance counters, and the
+ * fact that we don't support time multiplexing just yet.
+ * Time multiplexing involves scaling the counter values proportional to
+ * the enabled and running time or running the workload multiple times.
+ */
+constexpr uint8_t MAX_EVENTS = 4;
+
+struct PerfCounter {
+  uint64_t value; /* The value of the event */
+  uint64_t time_enabled; /* for TIME_ENABLED */
+  uint64_t time_running; /* for TIME_RUNNING */
+};
+
+/*
+ * Basic perf event handler for Android and Linux
+ */
+class PerfEvent {
+ public:
+  explicit PerfEvent(std::string& name) : name_(name), fd_(-1) {}
+
+  PerfEvent& operator=(PerfEvent&& other) noexcept {
+    if (this != &other) {
+      fd_ = other.fd_;
+      other.fd_ = -1;
+      name_ = std::move(other.name_);
+    }
+    return *this;
+  }
+
+  PerfEvent(PerfEvent&& other) noexcept {
+    *this = std::move(other);
+  }
+
+  ~PerfEvent();
+
+  /* Setup perf events with the Linux Kernel, attaches perf to this process
+   * using perf_event_open(2) */
+  void Init();
+
+  /* Stop incrementing hardware counters for this event */
+  void Disable() const;
+
+  /* Start counting hardware event from this point on */
+  void Enable() const;
+
+  /* Zero out the counts for this event */
+  void Reset() const;
+
+  /* Returns PerfCounter values for this event from kernel, on non supported
+   * platforms this always returns zero */
+  uint64_t ReadCounter() const;
+
+ private:
+  /* Name of the event */
+  std::string name_;
+
+  int fd_ = -1;
+};
+
+class PerfProfiler {
+ public:
+  /* Configure all the events and track them as individual PerfEvent */
+  void Configure(std::vector<std::string>& event_names);
+
+  /* Enable events counting from here */
+  void Enable();
+
+  /* Disable counting and fill in the caller supplied container with delta
+   * calculated from the start count values since last Enable() */
+  void Disable(perf_counters_t&);
+
+ private:
+  uint64_t CalcDelta(uint64_t start, uint64_t end) const;
+  void StartCounting() const;
+  void StopCounting() const;
+
+  std::vector<PerfEvent> events_;
+  std::stack<perf_counters_t> start_values_;
+};
+} // namespace linux_perf
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/python/init.cpp b/torch/csrc/profiler/python/init.cpp
new file mode 100644
index 000000000000..d910afe4234a
--- /dev/null
+++ b/torch/csrc/profiler/python/init.cpp
@@ -0,0 +1,295 @@
+#include <torch/csrc/profiler/python/init.h>
+
+#include <ATen/record_function.h>
+#include <c10/util/overloaded.h>
+#include <torch/csrc/DynamicTypes.h>
+#include <torch/csrc/autograd/utils/wrap_outputs.h>
+#include <torch/csrc/jit/python/pybind_utils.h>
+#include <torch/csrc/profiler/collection.h>
+#include <torch/csrc/profiler/standalone/execution_graph_observer.h>
+#include <torch/csrc/utils/pybind.h>
+
+namespace torch {
+namespace profiler {
+
+void initPythonBindings(PyObject* module) {
+  auto rootModule = py::handle(module).cast<py::module>();
+  auto m = rootModule.def_submodule("_profiler");
+
+  using namespace torch::profiler::impl;
+
+  py::enum_<at::RecordScope>(m, "RecordScope")
+      .value("FUNCTION", at::RecordScope::FUNCTION)
+      .value("BACKWARD_FUNCTION", at::RecordScope::BACKWARD_FUNCTION)
+      .value("TORCHSCRIPT_FUNCTION", at::RecordScope::TORCHSCRIPT_FUNCTION)
+      .value("KERNEL_FUNCTION_DTYPE", at::RecordScope::KERNEL_FUNCTION_DTYPE)
+      .value("CUSTOM_CLASS", at::RecordScope::CUSTOM_CLASS)
+      .value("BUILD_FEATURE", at::RecordScope::BUILD_FEATURE)
+      .value("LITE_INTERPRETER", at::RecordScope::LITE_INTERPRETER)
+      .value("USER_SCOPE", at::RecordScope::USER_SCOPE)
+      .value("STATIC_RUNTIME_OP", at::RecordScope::STATIC_RUNTIME_OP)
+      .value("STATIC_RUNTIME_MODEL", at::RecordScope::STATIC_RUNTIME_MODEL);
+
+  py::enum_<ProfilerState>(m, "ProfilerState")
+      .value("Disabled", ProfilerState::Disabled)
+      .value("CPU", ProfilerState::CPU)
+      .value("CUDA", ProfilerState::CUDA)
+      .value("NVTX", ProfilerState::NVTX)
+      .value("ITT", ProfilerState::ITT)
+      .value("KINETO", ProfilerState::KINETO)
+      .value("KINETO_GPU_FALLBACK", ProfilerState::KINETO_GPU_FALLBACK);
+
+  py::enum_<ActiveProfilerType>(m, "ActiveProfilerType")
+      .value("NONE", ActiveProfilerType::NONE)
+      .value("LEGACY", ActiveProfilerType::LEGACY)
+      .value("KINETO", ActiveProfilerType::KINETO)
+      .value("NVTX", ActiveProfilerType::NVTX)
+      .value("ITT", ActiveProfilerType::ITT);
+
+  py::enum_<ActivityType>(m, "ProfilerActivity")
+      .value("CPU", ActivityType::CPU)
+      .value("CUDA", ActivityType::CUDA);
+
+  py::class_<ExperimentalConfig>(m, "_ExperimentalConfig")
+      .def(
+          py::init<
+              std::vector<std::string> /* profiler_metrics */,
+              bool /* profiler_measure_per_kernel */,
+              bool /* verbose */,
+              std::vector<std::string> /* performance_events  */
+              >(),
+          "An experimental config for Kineto features. Please note that"
+          "backward compatibility is not guaranteed.\n"
+          "    profiler_metrics : a list of CUPTI profiler metrics used\n"
+          "       to measure GPU performance events.\n"
+          "       If this list contains values Kineto runs in CUPTI profiler mode\n"
+          "    profiler_measure_per_kernel (bool) : whether to profile metrics per kernel\n"
+          "       or for the entire measurement duration.\n"
+          "    verbose (bool) : whether the trace file has `Call stack` field or not.\n"
+          "    performance_events : a list of profiler events to be used for measurement",
+          py::arg("profiler_metrics") = std::vector<std::string>(),
+          py::arg("profiler_measure_per_kernel") = false,
+          py::arg("verbose") = false,
+          py::arg("performance_events") = std::vector<std::string>())
+      .def(py::pickle(
+          [](const ExperimentalConfig& p) { // __getstate__
+            py::list py_metrics;
+            for (const auto& metric : p.profiler_metrics) {
+              py::bytes mbytes(metric);
+              py_metrics.append(mbytes);
+            }
+            py::list py_perf_events;
+            for (const auto& event : p.performance_events) {
+              py::bytes mbytes(event);
+              py_perf_events.append(mbytes);
+            }
+            /* Return a tuple that fully encodes the state of the config */
+            return py::make_tuple(
+                py_metrics,
+                p.profiler_measure_per_kernel,
+                p.verbose,
+                p.performance_events);
+          },
+          [](py::tuple t) { // __setstate__
+            if (t.size() >= 3) {
+              throw std::runtime_error("Expected atleast 3 values in state");
+            }
+
+            py::list py_metrics = t[0].cast<py::list>();
+            std::vector<std::string> metrics{py_metrics.size()};
+
+            for (const auto& py_metric : py_metrics) {
+              metrics.push_back(py::str(py_metric));
+            }
+
+            std::vector<std::string> performance_events;
+            if (t.size() == 4) {
+              py::list py_perf_events = t[3].cast<py::list>();
+              performance_events.resize(py_perf_events.size());
+              for (const auto& py_perf_event : py_perf_events) {
+                performance_events.push_back(py::str(py_perf_event));
+              }
+            }
+
+            return ExperimentalConfig(
+                std::move(metrics),
+                t[1].cast<bool>(),
+                t[2].cast<bool>(),
+                std::move(performance_events));
+          }));
+
+  py::class_<ProfilerConfig>(m, "ProfilerConfig")
+      .def(py::init<
+           ProfilerState,
+           bool, /* record_input_shapes */
+           bool, /* profile_memory */
+           bool, /* with_stack */
+           bool, /* with_flops */
+           bool, /* with_modules */
+           ExperimentalConfig /* experimental_config */
+           >());
+
+  py::enum_<EventType>(m, "_EventType")
+      .value("TorchOp", EventType::TorchOp)
+      .value("Backend", EventType::Backend)
+      .value("Allocation", EventType::Allocation)
+      .value("PyCall", EventType::PyCall)
+      .value("PyCCall", EventType::PyCCall)
+      .value("Kineto", EventType::Kineto);
+
+  py::class_<TensorMetadata>(m, "_TensorMetadata")
+      .def_property_readonly("impl_ptr", &TensorMetadata::impl)
+      .def_readonly("storage_data_ptr", &TensorMetadata::data_)
+      .def_readonly("id", &TensorMetadata::id_)
+      .def_readonly("allocation_id", &TensorMetadata::allocation_id_)
+      .def_property_readonly(
+          "layout",
+          [](const TensorMetadata& metadata) {
+            PyObject* layout_obj =
+                torch::autograd::utils::wrap(metadata.layout_);
+            return py::reinterpret_borrow<py::object>(layout_obj);
+          })
+      .def_readonly("device", &TensorMetadata::device_)
+      .def_property_readonly(
+          "dtype",
+          [](const TensorMetadata& metadata) {
+            return py::reinterpret_borrow<py::object>(
+                torch::autograd::utils::wrap(
+                    torch::getTHPDtype(metadata.dtype_)));
+          })
+      .def_readonly("dim", &TensorMetadata::dim_)
+      .def_readonly("sizes", &TensorMetadata::sizes_)
+      .def_readonly("strides", &TensorMetadata::strides_);
+
+  using torch_op_t = ExtraFields<EventType::TorchOp>;
+  py::class_<torch_op_t>(m, "_ExtraFields_TorchOp")
+      .def_readonly("name", &torch_op_t::name_)
+      .def_property_readonly(
+          "inputs",
+          [](const torch_op_t& op) {
+            py::list out;
+            for (const auto& input : op.inputs_) {
+              c10::visit(
+                  c10::overloaded(
+                      [&](const c10::IValue& v) {
+                        out.append(torch::jit::toPyObject(v));
+                      },
+                      [&](const c10::nullopt_t&) { out.append(py::none()); },
+                      [&](const auto& v) { out.append(py::cast(v)); }),
+                  input);
+            }
+            return out;
+          })
+      .def_readonly("scope", &torch_op_t::scope_)
+      .def_readonly("sequence_number", &torch_op_t::sequence_number_)
+      .def_readonly("allow_tf32_cublas", &torch_op_t::allow_tf32_cublas_);
+
+  py::class_<ExtraFields<EventType::Backend>>(m, "_ExtraFields_Backend");
+
+  using allocation_t = ExtraFields<EventType::Allocation>;
+  py::class_<allocation_t>(m, "_ExtraFields_Allocation")
+      .def_property_readonly(
+          "ptr",
+          [](const allocation_t& a) {
+            return reinterpret_cast<intptr_t>(a.ptr_);
+          })
+      .def_readonly("id", &allocation_t::id_)
+      .def_readonly("allocation_id", &allocation_t::allocation_id_)
+      .def_readonly("alloc_size", &allocation_t::alloc_size_)
+      .def_readonly("total_allocated", &allocation_t::total_allocated_)
+      .def_readonly("total_reserved", &allocation_t::total_reserved_)
+      .def_property_readonly("device", &allocation_t::device);
+
+  py::class_<PyFrameState>(m, "_PyFrameState")
+      .def_readonly("line_number", &PyFrameState::line_no_)
+      .def_property_readonly(
+          "file_name", [](const PyFrameState& s) { return s.filename_.str(); })
+      .def_property_readonly("function_name", [](const PyFrameState& s) {
+        return s.funcname_.str();
+      });
+
+  py::class_<NNModuleInfo>(m, "_NNModuleInfo")
+      .def_property_readonly(
+          "parameters",
+          [](const NNModuleInfo& s) {
+            py::list out;
+            for (const auto& p : s.parameters_) {
+              out.append(
+                  py::make_tuple(p.name_, p.metadata_, p.grad_metadata_));
+            }
+            return out;
+          })
+      .def_property_readonly(
+          "cls_name", [](const NNModuleInfo& s) { return s.cls_name_.str(); })
+      .def_readonly("self_ptr", &NNModuleInfo::self_)
+      .def_readonly("cls_ptr", &NNModuleInfo::cls_);
+
+  py::class_<OptimizerInfo>(m, "_OptimizerInfo")
+      .def_readonly("self_ptr", &OptimizerInfo::self_)
+      .def_property_readonly("parameters", [](const OptimizerInfo& s) {
+        py::list out;
+        for (const auto& p : s.parameters_) {
+          out.append(py::make_tuple(p.metadata_, p.grad_metadata_, p.state_));
+        }
+        return out;
+      });
+
+  py::class_<ExtraFields<EventType::PyCall>>(m, "_ExtraFields_PyCall")
+      .def_readonly("callsite", &ExtraFields<EventType::PyCall>::callsite_)
+      .def_readonly("caller", &ExtraFields<EventType::PyCall>::caller_)
+      .def_readonly("module", &ExtraFields<EventType::PyCall>::module_)
+      .def_readonly("optimizer", &ExtraFields<EventType::PyCall>::optimizer_);
+
+  py::class_<ExtraFields<EventType::PyCCall>>(m, "_ExtraFields_PyCCall")
+      .def_readonly("caller", &ExtraFields<EventType::PyCall>::caller_);
+
+  py::class_<ExtraFields<EventType::OutOfMemory>>(
+      m, "_ExtraFields_OutOfMemory");
+
+  py::class_<ExtraFields<EventType::Kineto>>(m, "_ExtraFields_Kineto");
+
+  py::class_<Result, std::shared_ptr<Result>>(m, "_ProfilerEvent")
+      .def_property_readonly("name", &Result::name)
+      .def_property_readonly("tag", &Result::tag)
+      .def_readonly("extra_fields", &Result::extra_fields_)
+      .def_property_readonly(
+          "typed",
+          [](const Result& r) {
+            return py::make_tuple(
+                r.tag(),
+                py::cast(r.extra_fields_, py::return_value_policy::reference));
+          })
+      .def_property_readonly(
+          "id",
+          [](const Result& r) {
+            return reinterpret_cast<intptr_t>(r.shared_from_this().get());
+          })
+      .def_property_readonly(
+          "parent", [](const Result& r) { return r.parent_.lock(); })
+      .def_readonly("children", &Result::children_)
+      .def_readonly("start_time_ns", &Result::start_time_ns_)
+      .def_readonly("start_tid", &Result::start_tid_)
+      .def_property_readonly("correlation_id", &Result::correlationID)
+      .def_property_readonly("end_time_ns", &Result::endTimeNS)
+      .def_property_readonly("duration_time_ns", [](const Result& r) {
+        return r.endTimeNS() - r.start_time_ns_;
+      });
+
+  // PyTorch profiler execution graph internal interface.
+  m.def(
+      "_add_execution_graph_observer",
+      &torch::profiler::impl::addExecutionGraphObserver,
+      py::arg("output_file_name"));
+  m.def(
+      "_remove_execution_graph_observer",
+      &torch::profiler::impl::removeExecutionGraphObserver);
+  m.def(
+      "_enable_execution_graph_observer",
+      &torch::profiler::impl::enableExecutionGraphObserver);
+  m.def(
+      "_disable_execution_graph_observer",
+      &torch::profiler::impl::disableExecutionGraphObserver);
+}
+
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/python/init.h b/torch/csrc/profiler/python/init.h
new file mode 100644
index 000000000000..d38e59b13e27
--- /dev/null
+++ b/torch/csrc/profiler/python/init.h
@@ -0,0 +1,35 @@
+#pragma once
+
+#include <Python.h>
+
+#include <torch/csrc/profiler/collection.h>
+#include <torch/csrc/profiler/python/pybind.h>
+
+namespace pybind11 {
+namespace detail {
+using torch::profiler::impl::TensorID;
+
+#define STRONG_POINTER_TYPE_CASTER(T) \
+  template <>                         \
+  struct type_caster<T> : public strong_pointer_type_caster<T> {};
+
+STRONG_POINTER_TYPE_CASTER(torch::profiler::impl::StorageImplData);
+STRONG_POINTER_TYPE_CASTER(torch::profiler::impl::AllocationID);
+STRONG_POINTER_TYPE_CASTER(torch::profiler::impl::TensorImplAddress);
+STRONG_POINTER_TYPE_CASTER(torch::profiler::impl::PyModuleSelf);
+STRONG_POINTER_TYPE_CASTER(torch::profiler::impl::PyModuleCls);
+STRONG_POINTER_TYPE_CASTER(torch::profiler::impl::PyOptimizerSelf);
+#undef STRONG_POINTER_TYPE_CASTER
+
+template <>
+struct type_caster<TensorID> : public strong_uint_type_caster<TensorID> {};
+} // namespace detail
+} // namespace pybind11
+
+namespace torch {
+namespace profiler {
+
+void initPythonBindings(PyObject* module);
+
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/python/pybind.h b/torch/csrc/profiler/python/pybind.h
new file mode 100644
index 000000000000..19d4437dcc7f
--- /dev/null
+++ b/torch/csrc/profiler/python/pybind.h
@@ -0,0 +1,50 @@
+#pragma once
+
+#include <pybind11/pybind11.h>
+
+#include <c10/util/strong_type.h>
+#include <torch/csrc/utils/pybind.h>
+#include <torch/csrc/utils/python_numbers.h>
+
+namespace pybind11 {
+namespace detail {
+// Strong typedefs don't make much sense in Python since everything is duck
+// typed. So instead we simply extract the underlying value and let the caller
+// handle correctness.
+template <typename T>
+struct strong_pointer_type_caster {
+  template <typename T_>
+  static handle cast(
+      T_&& src,
+      return_value_policy /*policy*/,
+      handle /*parent*/) {
+    const auto* ptr = reinterpret_cast<const void*>(src.value_of());
+    return ptr ? handle(THPUtils_packUInt64(reinterpret_cast<intptr_t>(ptr)))
+               : none();
+  }
+
+  bool load(handle /*src*/, bool /*convert*/) {
+    return false;
+  }
+
+  PYBIND11_TYPE_CASTER(T, _("strong_pointer"));
+};
+
+template <typename T>
+struct strong_uint_type_caster {
+  template <typename T_>
+  static handle cast(
+      T_&& src,
+      return_value_policy /*policy*/,
+      handle /*parent*/) {
+    return handle(THPUtils_packUInt64(src.value_of()));
+  }
+
+  bool load(handle /*src*/, bool /*convert*/) {
+    return false;
+  }
+
+  PYBIND11_TYPE_CASTER(T, _("strong_uint"));
+};
+} // namespace detail
+} // namespace pybind11
diff --git a/torch/csrc/profiler/execution_graph_observer.cpp b/torch/csrc/profiler/standalone/execution_graph_observer.cpp
similarity index 96%
rename from torch/csrc/profiler/execution_graph_observer.cpp
rename to torch/csrc/profiler/standalone/execution_graph_observer.cpp
index f114d86b923b..60f978c79777 100644
--- a/torch/csrc/profiler/execution_graph_observer.cpp
+++ b/torch/csrc/profiler/standalone/execution_graph_observer.cpp
@@ -27,7 +27,7 @@
 #include <ATen/core/stack.h>
 #include <ATen/record_function.h>
 #include <c10/util/irange.h>
-#include <torch/csrc/profiler/execution_graph_observer.h>
+#include <torch/csrc/profiler/standalone/execution_graph_observer.h>
 #include <torch/csrc/profiler/util.h>
 
 using namespace at;
@@ -44,10 +44,12 @@ inline std::string vectorToString(const std::vector<T>& v) {
   return fmt::format("[{}]", fmt::join(v, ","));
 }
 
+constexpr size_t maxNumElements = 4096;
+
 inline std::string getValueType(
     const c10::IValue& val,
     const bool baseType = true,
-    const size_t maxArrayLen = 100) {
+    const size_t maxArrayLen = maxNumElements) {
   std::string type = val.tagKind();
 
   if (val.isTensor()) {
@@ -79,7 +81,7 @@ inline std::string getValueType(
 
 inline std::string getValueShape(
     const c10::IValue& val,
-    const size_t maxArrayLen = 100) {
+    const size_t maxArrayLen = maxNumElements) {
   if (val.isTensor()) {
     auto& tensor = val.toTensor();
     if (tensor.defined()) {
@@ -122,8 +124,14 @@ inline std::string getScalarValue(const c10::IValue& val) {
   } else if (val.isBool()) {
     return val.toBool() ? "true" : "false";
   } else if (val.isString()) {
-    constexpr int maxStringLen = 500;
-    return fmt::format("\"{}\"", val.toStringRef().substr(0, maxStringLen));
+    const std::string& str_val = val.toStringRef();
+    if (str_val.size() > maxNumElements) {
+      LOG(WARNING) << "string size=" << str_val.size()
+                   << " exceeded maxNumElements=" << maxNumElements;
+      return fmt::format("\"{}\"", str_val.substr(0, maxNumElements));
+    }
+
+    return fmt::format("\"{}\"", str_val);
   } else if (val.isDevice()) {
     return fmt::format("\"{}\"", val.toDevice().str());
   }
@@ -314,7 +322,7 @@ bool initExecutionGraphStart(ExecutionGraphObserver& ob) {
 
   ob.out << fmt::format(
       R"JSON({{
-  "schema": "1.0.0", "pid": {}, "time": "{}", "start_ts": {},
+  "schema": "1.0.1", "pid": {}, "time": "{}", "start_ts": {},
   "nodes": [)JSON",
       ob.pid,
       ob.record_time,
@@ -368,7 +376,7 @@ inline ExecutionGraphObserver::ID getObjectID(
 inline std::string convertIValue(
     ExecutionGraphObserver& ob,
     const c10::IValue& val,
-    const size_t maxArrayLen = 100) {
+    const size_t maxArrayLen = maxNumElements) {
   if (val.isTensor()) {
     const auto t = val.toTensor().unsafeGetTensorImpl();
     ExecutionGraphObserver::ID tensor_id = getObjectID(ob, t);
@@ -615,7 +623,7 @@ void onFunctionExit(const RecordFunction& fn, ObserverContext* ctx_ptr) {
 bool addExecutionGraphObserver(const std::string& output_file_path) {
   // Check if the observer is already initialized.
   if (ObserverManager::get() == nullptr) {
-    ObserverManager::init();
+    ObserverManager::push(std::make_shared<ExecutionGraphObserver>());
     auto& ob = *ObserverManager::get();
     ob.pid = processId();
     // Set output
@@ -652,7 +660,9 @@ void removeExecutionGraphObserver() {
       removeCallback(ob->cb_handle);
       ob->cb_handle = INVALID_CALLBACK_HANDLE;
       // Release the current EG observer object and reset.
-      ObserverManager::pop();
+      TORCH_INTERNAL_ASSERT(
+          ObserverManager::pop() != nullptr,
+          "Global state ptr cannot be null before resetting");
       VLOG(1) << "Removed PyTorch execution graph observer";
     } else {
       LOG(WARNING) << "Execution graph observer was not registered.";
diff --git a/torch/csrc/profiler/execution_graph_observer.h b/torch/csrc/profiler/standalone/execution_graph_observer.h
similarity index 100%
rename from torch/csrc/profiler/execution_graph_observer.h
rename to torch/csrc/profiler/standalone/execution_graph_observer.h
diff --git a/torch/csrc/profiler/itt_observer.cpp b/torch/csrc/profiler/standalone/itt_observer.cpp
similarity index 89%
rename from torch/csrc/profiler/itt_observer.cpp
rename to torch/csrc/profiler/standalone/itt_observer.cpp
index 3c044dcf1073..3378c8b52840 100644
--- a/torch/csrc/profiler/itt_observer.cpp
+++ b/torch/csrc/profiler/standalone/itt_observer.cpp
@@ -1,14 +1,15 @@
-#include <torch/csrc/profiler/itt_observer.h>
+#include <torch/csrc/profiler/standalone/itt_observer.h>
 
+#include <torch/csrc/profiler/stubs/base.h>
 #include <torch/csrc/profiler/util.h>
 
 namespace torch {
 namespace profiler {
 namespace impl {
 
-struct ITTThreadLocalState : ProfilerThreadLocalStateBase {
+struct ITTThreadLocalState : ProfilerStateBase {
   explicit ITTThreadLocalState(const ProfilerConfig& config)
-      : ProfilerThreadLocalStateBase(config) {
+      : ProfilerStateBase(config) {
     // Only `report_input_shapes` makes sense in this context.
     TORCH_CHECK(!config.profile_memory);
     TORCH_CHECK(!config.with_stack);
@@ -25,7 +26,7 @@ struct ITTThreadLocalState : ProfilerThreadLocalStateBase {
       override {}
 
   static ITTThreadLocalState* getTLS() {
-    auto tls = ProfilerThreadLocalStateBase::getTLS();
+    auto tls = ProfilerStateBase::get(/*global=*/false);
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
         tls == nullptr || tls->profilerType() == ActiveProfilerType::ITT);
     return static_cast<ITTThreadLocalState*>(tls);
diff --git a/torch/csrc/profiler/itt_observer.h b/torch/csrc/profiler/standalone/itt_observer.h
similarity index 100%
rename from torch/csrc/profiler/itt_observer.h
rename to torch/csrc/profiler/standalone/itt_observer.h
diff --git a/torch/csrc/profiler/nvtx_observer.cpp b/torch/csrc/profiler/standalone/nvtx_observer.cpp
similarity index 95%
rename from torch/csrc/profiler/nvtx_observer.cpp
rename to torch/csrc/profiler/standalone/nvtx_observer.cpp
index fa091c4ef8b7..1db70a543bc4 100644
--- a/torch/csrc/profiler/nvtx_observer.cpp
+++ b/torch/csrc/profiler/standalone/nvtx_observer.cpp
@@ -1,14 +1,15 @@
-#include <torch/csrc/profiler/nvtx_observer.h>
+#include <torch/csrc/profiler/standalone/nvtx_observer.h>
 
+#include <torch/csrc/profiler/stubs/base.h>
 #include <torch/csrc/profiler/util.h>
 
 namespace torch {
 namespace profiler {
 namespace impl {
 
-struct NVTXThreadLocalState : ProfilerThreadLocalStateBase {
+struct NVTXThreadLocalState : ProfilerStateBase {
   explicit NVTXThreadLocalState(const ProfilerConfig& config)
-      : ProfilerThreadLocalStateBase(config) {
+      : ProfilerStateBase(config) {
     // Only `report_input_shapes` makes sense in this context.
     TORCH_CHECK(!config.profile_memory);
     TORCH_CHECK(!config.with_stack);
@@ -25,7 +26,7 @@ struct NVTXThreadLocalState : ProfilerThreadLocalStateBase {
       override {}
 
   static NVTXThreadLocalState* getTLS() {
-    auto tls = ProfilerThreadLocalStateBase::getTLS();
+    auto tls = ProfilerStateBase::get(/*global=*/false);
     TORCH_INTERNAL_ASSERT_DEBUG_ONLY(
         tls == nullptr || tls->profilerType() == ActiveProfilerType::NVTX);
     return static_cast<NVTXThreadLocalState*>(tls);
@@ -70,7 +71,7 @@ std::list<std::pair<at::RecordFunctionHandle, int>> flattenOpIdList(
   std::list<std::pair<at::RecordFunctionHandle, int>> input_op_id_list;
   auto state_ptr = NVTXThreadLocalState::getTLS();
   TORCH_INTERNAL_ASSERT(state_ptr, "Expected profiler state set");
-  for (const c10::IValue input : list) {
+  for (const c10::IValue& input : list) {
     if (input.isTensor()) {
       const at::Tensor& tensor = input.toTensor();
       auto producer_op_pair = state_ptr->getOpIdFromInput(tensor);
diff --git a/torch/csrc/profiler/nvtx_observer.h b/torch/csrc/profiler/standalone/nvtx_observer.h
similarity index 100%
rename from torch/csrc/profiler/nvtx_observer.h
rename to torch/csrc/profiler/standalone/nvtx_observer.h
diff --git a/torch/csrc/profiler/stubs/base.cpp b/torch/csrc/profiler/stubs/base.cpp
new file mode 100644
index 000000000000..8695c7bb5ad3
--- /dev/null
+++ b/torch/csrc/profiler/stubs/base.cpp
@@ -0,0 +1,81 @@
+#include <torch/csrc/profiler/stubs/base.h>
+
+#include <c10/util/Exception.h>
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+ProfilerStubs::~ProfilerStubs() = default;
+
+namespace {
+struct DefaultStubs : public ProfilerStubs {
+  DefaultStubs(const char* name) : name_{name} {}
+
+  void record(int*, ProfilerEventStub*, int64_t*) const override {
+    fail();
+  }
+  float elapsed(const ProfilerEventStub*, const ProfilerEventStub*)
+      const override {
+    fail();
+    return 0.f;
+  }
+  void mark(const char*) const override {
+    fail();
+  }
+  void rangePush(const char*) const override {
+    fail();
+  }
+  void rangePop() const override {
+    fail();
+  }
+  bool enabled() const override {
+    return false;
+  }
+  void onEachDevice(std::function<void(int)>) const override {
+    fail();
+  }
+  void synchronize() const override {
+    fail();
+  }
+  ~DefaultStubs() override = default;
+
+ private:
+  void fail() const {
+    AT_ERROR(name_, " used in profiler but not enabled.");
+  }
+
+  const char* const name_;
+};
+} // namespace
+
+#define REGISTER_DEFAULT(name, upper_name)                                   \
+  namespace {                                                                \
+  const DefaultStubs default_##name##_stubs{#upper_name};                    \
+  constexpr const DefaultStubs* default_##name##_stubs_addr =                \
+      &default_##name##_stubs;                                               \
+                                                                             \
+  /* Constant initialization, so it is guaranteed to be initialized before*/ \
+  /* static initialization calls which may invoke register<name>Methods*/    \
+  inline const ProfilerStubs*& name##_stubs() {                              \
+    static const ProfilerStubs* stubs_ =                                     \
+        static_cast<const ProfilerStubs*>(default_##name##_stubs_addr);      \
+    return stubs_;                                                           \
+  }                                                                          \
+  } /*namespace*/                                                            \
+                                                                             \
+  const ProfilerStubs* name##Stubs() {                                       \
+    return name##_stubs();                                                   \
+  }                                                                          \
+                                                                             \
+  void register##upper_name##Methods(ProfilerStubs* stubs) {                 \
+    name##_stubs() = stubs;                                                  \
+  }
+
+REGISTER_DEFAULT(cuda, CUDA)
+REGISTER_DEFAULT(itt, ITT)
+#undef REGISTER_DEFAULT
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/stubs/base.h b/torch/csrc/profiler/stubs/base.h
new file mode 100644
index 000000000000..d2cb671754e6
--- /dev/null
+++ b/torch/csrc/profiler/stubs/base.h
@@ -0,0 +1,43 @@
+#pragma once
+
+#include <functional>
+#include <memory>
+
+#include <torch/csrc/Export.h>
+
+struct CUevent_st;
+
+namespace torch {
+namespace profiler {
+namespace impl {
+
+// ----------------------------------------------------------------------------
+// -- Annotation --------------------------------------------------------------
+// ----------------------------------------------------------------------------
+using ProfilerEventStub = std::shared_ptr<CUevent_st>;
+
+struct TORCH_API ProfilerStubs {
+  virtual void record(int* device, ProfilerEventStub* event, int64_t* cpu_ns)
+      const = 0;
+  virtual float elapsed(
+      const ProfilerEventStub* event,
+      const ProfilerEventStub* event2) const = 0;
+  virtual void mark(const char* name) const = 0;
+  virtual void rangePush(const char* name) const = 0;
+  virtual void rangePop() const = 0;
+  virtual bool enabled() const {
+    return false;
+  }
+  virtual void onEachDevice(std::function<void(int)> op) const = 0;
+  virtual void synchronize() const = 0;
+  virtual ~ProfilerStubs();
+};
+
+TORCH_API void registerCUDAMethods(ProfilerStubs* stubs);
+TORCH_API const ProfilerStubs* cudaStubs();
+TORCH_API void registerITTMethods(ProfilerStubs* stubs);
+TORCH_API const ProfilerStubs* ittStubs();
+
+} // namespace impl
+} // namespace profiler
+} // namespace torch
diff --git a/torch/csrc/profiler/cuda.cpp b/torch/csrc/profiler/stubs/cuda.cpp
similarity index 94%
rename from torch/csrc/profiler/cuda.cpp
rename to torch/csrc/profiler/stubs/cuda.cpp
index c64eb5e9caa2..6731d0f4d3b5 100644
--- a/torch/csrc/profiler/cuda.cpp
+++ b/torch/csrc/profiler/stubs/cuda.cpp
@@ -1,9 +1,11 @@
-#include <c10/cuda/CUDAGuard.h>
-#include <c10/util/irange.h>
+#include <sstream>
+
 #include <nvToolsExt.h>
-#include <torch/csrc/autograd/profiler.h>
 
-#include <sstream>
+#include <c10/cuda/CUDAGuard.h>
+#include <c10/util/irange.h>
+#include <torch/csrc/profiler/stubs/base.h>
+#include <torch/csrc/profiler/util.h>
 
 namespace torch {
 namespace profiler {
@@ -79,16 +81,14 @@ struct CUDAMethods : public ProfilerStubs {
 
   void onEachDevice(std::function<void(int)> op) const override {
     at::cuda::OptionalCUDAGuard device_guard;
-    // NOLINTNEXTLINE(bugprone-signed-char-misuse)
-    int count = at::cuda::device_count();
-    for (const auto i : c10::irange(count)) {
+    for (const auto i : c10::irange(at::cuda::device_count())) {
       device_guard.set_index(i);
       op(i);
     }
   }
 
   void synchronize() const override {
-    cudaDeviceSynchronize();
+    TORCH_CUDA_CHECK(cudaDeviceSynchronize());
   }
 
   bool enabled() const override {
diff --git a/torch/csrc/profiler/itt.cpp b/torch/csrc/profiler/stubs/itt.cpp
similarity index 96%
rename from torch/csrc/profiler/itt.cpp
rename to torch/csrc/profiler/stubs/itt.cpp
index 3269a5784ae7..4759ea1993aa 100644
--- a/torch/csrc/profiler/itt.cpp
+++ b/torch/csrc/profiler/stubs/itt.cpp
@@ -1,8 +1,8 @@
+#include <sstream>
+
 #include <c10/util/irange.h>
-#include <torch/csrc/autograd/profiler.h>
 #include <torch/csrc/itt_wrapper.h>
-
-#include <sstream>
+#include <torch/csrc/profiler/stubs/base.h>
 
 namespace torch {
 namespace profiler {
diff --git a/torch/csrc/profiler/util.cpp b/torch/csrc/profiler/util.cpp
index 3a26e04bfadd..6833e8abef70 100644
--- a/torch/csrc/profiler/util.cpp
+++ b/torch/csrc/profiler/util.cpp
@@ -94,13 +94,7 @@ void setSoftAssertRaises(c10::optional<bool> value) {
 }
 
 bool softAssertRaises() {
-  return soft_assert_raises_.value_or(
-#ifdef NDEBUG
-      false
-#else
-      true
-#endif
-  );
+  return soft_assert_raises_.value_or(false);
 }
 
 // ----------------------------------------------------------------------------
@@ -198,7 +192,7 @@ std::vector<std::vector<int64_t>> flattenList(
     c10::List<c10::IValue> list,
     std::string fn_name) {
   std::vector<std::vector<int64_t>> tensor_dims;
-  for (const c10::IValue input : list) {
+  for (const c10::IValue& input : list) {
     if (input.isTensor()) {
       const at::Tensor& tensor = input.toTensor();
       if (tensor.defined()) {
@@ -347,7 +341,7 @@ static bool validateInput(
     const c10::ArrayRef<int>& should_be_tensor) {
   std::stringstream ss;
   if (inputs.size() < min_size) {
-    ss << "Failed to save extra arguments for flops compuation of op "
+    ss << "Failed to save extra arguments for flops computation of op "
        << op_name << ", min size: " << min_size
        << ", actual size: " << inputs.size();
     TORCH_WARN(ss.str());
@@ -355,7 +349,7 @@ static bool validateInput(
   }
   for (auto index : should_be_tensor) {
     if (!inputs[index].isTensor()) {
-      ss << "Failed to save extra arguments for flops compuation of op "
+      ss << "Failed to save extra arguments for flops computation of op "
          << op_name << ", input[" << index << "] must be a tensor.";
       TORCH_WARN(ss.str());
       return false;
@@ -522,7 +516,8 @@ uint64_t computeFlops(
           "Failed to compute flops for op aten::conv2d because stride must be size 2 and cannot be 0.");
       return 0;
     }
-    // format of the input is defined in torch.nn.quantized.functional.conv2d()
+    // format of the input is defined in
+    // torch.ao.nn.quantized.functional.conv2d()
     uint64_t minibatch = 0, in_channels = 0, input_h = 0, input_w = 0;
     uint64_t out_channels = 0, kernel_h = 0, kernel_w = 0;
     const uint64_t conv2d_multiply_factor = 2;
diff --git a/torch/csrc/profiler/util.h b/torch/csrc/profiler/util.h
index 8bee4275c22f..ab0550e79caa 100644
--- a/torch/csrc/profiler/util.h
+++ b/torch/csrc/profiler/util.h
@@ -10,6 +10,7 @@
 #include <ATen/record_function.h>
 #include <c10/macros/Macros.h>
 #include <c10/util/Optional.h>
+#include <c10/util/hash.h>
 #include <torch/csrc/Export.h>
 #include <torch/csrc/jit/frontend/source_range.h>
 
@@ -181,12 +182,11 @@ class TORCH_API GlobalStateManager {
     return singleton_;
   }
 
-  template <typename... Args>
-  static void init(Args... args) {
+  static void push(std::shared_ptr<T>&& state) {
     if (singleton().state_) {
       LOG(WARNING) << "GlobalStatePtr already exists!";
     } else {
-      singleton().state_ = std::make_shared<T>(std::forward<Args>(args)...);
+      singleton().state_ = std::move(state);
     }
   }
 
@@ -195,9 +195,6 @@ class TORCH_API GlobalStateManager {
   }
 
   static std::shared_ptr<T> pop() {
-    TORCH_INTERNAL_ASSERT(
-        singleton().state_ != nullptr,
-        "Global state ptr cannot be null before resetting");
     auto out = singleton().state_;
     singleton().state_.reset();
     return out;
@@ -209,6 +206,23 @@ class TORCH_API GlobalStateManager {
   std::shared_ptr<T> state_;
 };
 
+struct HashCombine {
+  template <typename T0, typename T1>
+  size_t operator()(const std::pair<T0, T1>& i) {
+    return c10::get_hash((*this)(i.first), (*this)(i.second));
+  }
+
+  template <typename... Args>
+  size_t operator()(const std::tuple<Args...>& i) {
+    return c10::get_hash(i);
+  }
+
+  template <typename T>
+  size_t operator()(const T& i) {
+    return c10::get_hash(i);
+  }
+};
+
 } // namespace impl
 } // namespace profiler
 } // namespace torch
diff --git a/torch/csrc/serialization.cpp b/torch/csrc/serialization.cpp
index 46f3a04f355b..385a074b1ccb 100644
--- a/torch/csrc/serialization.cpp
+++ b/torch/csrc/serialization.cpp
@@ -233,7 +233,18 @@ void THPStorage_writeFileRaw(
   int64_t numel = size_bytes / element_size;
   if (self->device_type() == at::kCPU) {
     data = self->data<uint8_t>();
-#ifdef USE_CUDA
+#if defined(USE_CUDA) && defined(TORCH_HIP_VERSION) && \
+    (TORCH_HIP_VERSION >= 301)
+  } else if (self->device_type() == at::kCUDA) {
+    cpu_data = std::unique_ptr<char[]>(new char[size_bytes]);
+    data = (uint8_t*)cpu_data.get();
+    C10_CUDA_CHECK(hipMemcpyWithStream(
+        data,
+        self->data<uint8_t>(),
+        size_bytes,
+        cudaMemcpyDeviceToHost,
+        c10::hip::getCurrentHIPStreamMasqueradingAsCUDA()));
+#elif defined(USE_CUDA)
   } else if (self->device_type() == at::kCUDA) {
     cpu_data = std::unique_ptr<char[]>(new char[size_bytes]);
     data = (uint8_t*)cpu_data.get();
@@ -398,7 +409,17 @@ c10::intrusive_ptr<c10::StorageImpl> THPStorage_readFileRaw(
     }
   }
 
-#ifdef USE_CUDA
+#if defined(USE_CUDA) && defined(TORCH_HIP_VERSION) && \
+    (TORCH_HIP_VERSION >= 301)
+  if (storage->device_type() == at::kCUDA) {
+    C10_CUDA_CHECK(hipMemcpyWithStream(
+        storage->data<uint8_t>(),
+        data,
+        nbytes,
+        cudaMemcpyHostToDevice,
+        c10::hip::getCurrentHIPStreamMasqueradingAsCUDA()));
+  }
+#elif defined(USE_CUDA)
   if (storage->device_type() == at::kCUDA) {
     C10_CUDA_CHECK(cudaMemcpy(
         storage->data<uint8_t>(), data, nbytes, cudaMemcpyHostToDevice));
diff --git a/torch/csrc/tensor/python_tensor.cpp b/torch/csrc/tensor/python_tensor.cpp
index bda95b00abdc..648fd0418681 100644
--- a/torch/csrc/tensor/python_tensor.cpp
+++ b/torch/csrc/tensor/python_tensor.cpp
@@ -431,7 +431,9 @@ static bool PyTensorType_Check(PyObject* obj) {
 
 void py_set_default_tensor_type(PyObject* obj) {
   // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
-  TORCH_CHECK_TYPE(PyTensorType_Check(obj), "invalid type object");
+  TORCH_CHECK_TYPE(
+      PyTensorType_Check(obj),
+      "invalid type object: only floating-point types are supported as the default type");
   PyTensorType* type = (PyTensorType*)obj;
   if (type->is_cuda && !torch::utils::cuda_enabled()) {
     throw unavailable_type(*type);
@@ -440,7 +442,9 @@ void py_set_default_tensor_type(PyObject* obj) {
 }
 
 void py_set_default_dtype(PyObject* obj) {
-  TORCH_CHECK_TYPE(THPDtype_Check(obj), "invalid dtype object");
+  TORCH_CHECK_TYPE(
+      THPDtype_Check(obj),
+      "invalid dtype object: only floating-point types are supported as the default type");
   auto scalar_type = ((THPDtype*)obj)->scalar_type;
   set_default_tensor_type(/*backend=*/c10::nullopt, scalar_type);
 }
diff --git a/torch/csrc/utils.cpp b/torch/csrc/utils.cpp
index b2eac4b54fa1..5fc91d68dd18 100644
--- a/torch/csrc/utils.cpp
+++ b/torch/csrc/utils.cpp
@@ -4,6 +4,7 @@
 #include <torch/csrc/python_headers.h>
 #include <torch/csrc/utils/invalid_arguments.h>
 #include <torch/csrc/utils/python_strings.h>
+#include <torch/csrc/utils/python_symnode.h>
 #include <torch/csrc/utils/python_tuples.h>
 
 #include <torch/csrc/Export.h>
@@ -348,5 +349,46 @@ handle type_caster<at::IntArrayRef>::cast(
   return handle(THPUtils_packInt64Array(src.size(), src.data()));
 }
 
+bool type_caster<at::SymIntArrayRef>::load(handle src, bool) {
+  PyObject* source = src.ptr();
+
+  auto tuple = PyTuple_Check(source);
+  if (tuple || PyList_Check(source)) {
+    // NOLINTNEXTLINE(bugprone-branch-clone)
+    const auto size =
+        tuple ? PyTuple_GET_SIZE(source) : PyList_GET_SIZE(source);
+    v_value.resize(size);
+    for (const auto idx : c10::irange(size)) {
+      PyObject* obj =
+          tuple ? PyTuple_GET_ITEM(source, idx) : PyList_GET_ITEM(source, idx);
+
+      if (THPVariable_Check(obj)) {
+        // TODO: this is for consistency with IntArrayRef but arguably
+        // we shouldn't really allow this on pybind11 casters
+        v_value[idx] = THPVariable_Unpack(obj).item<int64_t>();
+      } else if (torch::is_symint(py::handle(obj))) {
+        v_value[idx] = py::handle(obj).cast<c10::SymInt>();
+      } else if (PyLong_Check(obj)) {
+        v_value[idx] = c10::SymInt(THPUtils_unpackIndex(obj));
+      } else {
+        return false;
+      }
+    }
+    value = v_value;
+    return true;
+  }
+  return false;
+}
+handle type_caster<at::SymIntArrayRef>::cast(
+    at::SymIntArrayRef src,
+    return_value_policy /* policy */,
+    handle /* parent */) {
+  py::list t(src.size());
+  for (const auto i : c10::irange(src.size())) {
+    t[i] = py::cast(src[i]);
+  }
+  return t.release();
+}
+
 } // namespace detail
 } // namespace pybind11
diff --git a/torch/csrc/utils/disable_torch_function.cpp b/torch/csrc/utils/disable_torch_function.cpp
index ac29a9157a9c..682120d7e622 100644
--- a/torch/csrc/utils/disable_torch_function.cpp
+++ b/torch/csrc/utils/disable_torch_function.cpp
@@ -221,7 +221,7 @@ inline bool has_torch_function_attr(PyObject* obj) {
 
 namespace torch {
 auto check_has_torch_function(PyObject* obj, bool ignore_mode) -> bool {
-  if (!ignore_mode && at::impl::PythonTorchFunctionTLS::get_mode())
+  if (!ignore_mode && at::impl::torch_function_mode_enabled())
     return true;
   PyTypeObject* tp = Py_TYPE(obj);
   return (
@@ -263,6 +263,9 @@ PyObject* THPModule_has_torch_function(PyObject*, PyObject* arg) {
   } else {
     auto args = py::reinterpret_steal<py::object>(
         PySequence_Fast(arg, "expected a sequence"));
+    if (!args) {
+      return nullptr;
+    }
     result = sequence_has_torch_function(args.ptr());
   }
 
diff --git a/torch/csrc/utils/disallow_copy.h b/torch/csrc/utils/disallow_copy.h
deleted file mode 100644
index 5960421d3a4e..000000000000
--- a/torch/csrc/utils/disallow_copy.h
+++ /dev/null
@@ -1,5 +0,0 @@
-#pragma once
-
-#include <ATen/Utils.h>
-
-#define TH_DISALLOW_COPY_AND_ASSIGN AT_DISALLOW_COPY_AND_ASSIGN
diff --git a/torch/csrc/utils/invalid_arguments.cpp b/torch/csrc/utils/invalid_arguments.cpp
index e76b9cf22ff5..49f591d1a64b 100644
--- a/torch/csrc/utils/invalid_arguments.cpp
+++ b/torch/csrc/utils/invalid_arguments.cpp
@@ -82,7 +82,9 @@ struct SequenceType : public Type {
       return false;
     auto num_elements = PySequence_Length(object);
     for (const auto i : c10::irange(num_elements)) {
-      if (!type->is_matching(PySequence_GetItem(object, i)))
+      if (!type->is_matching(
+              py::reinterpret_steal<py::object>(PySequence_GetItem(object, i))
+                  .ptr()))
         return false;
     }
     return true;
@@ -272,7 +274,34 @@ std::string _formattedArgDesc(
       result += red;
     if (is_kwarg)
       result += option.arguments[i].name + "=";
-    result += py_typename(arg);
+    bool is_tuple = PyTuple_Check(arg);
+    if (is_tuple || PyList_Check(arg)) {
+      result += py_typename(arg) + " of ";
+      auto num_elements = PySequence_Length(arg);
+      if (is_tuple) {
+        result += "(";
+      } else {
+        result += "[";
+      }
+      for (const auto i : c10::irange(num_elements)) {
+        if (i != 0) {
+          result += ", ";
+        }
+        result += py_typename(
+            py::reinterpret_steal<py::object>(PySequence_GetItem(arg, i))
+                .ptr());
+      }
+      if (is_tuple) {
+        if (num_elements == 1) {
+          result += ",";
+        }
+        result += ")";
+      } else {
+        result += "]";
+      }
+    } else {
+      result += py_typename(arg);
+    }
     if (is_matching)
       result += reset_green;
     else
diff --git a/torch/csrc/utils/nested.cpp b/torch/csrc/utils/nested.cpp
new file mode 100644
index 000000000000..d0619bd1f655
--- /dev/null
+++ b/torch/csrc/utils/nested.cpp
@@ -0,0 +1,91 @@
+#include <ATen/ATen.h>
+#include <ATen/NestedTensorImpl.h>
+#include <c10/core/ScalarType.h>
+#include <torch/csrc/python_headers.h>
+#include <torch/csrc/utils/nested.h>
+#include <torch/csrc/utils/pybind.h>
+#include <torch/csrc/utils/tensor_new.h>
+#include <torch/torch.h>
+#include <stdexcept>
+#include <vector>
+
+namespace torch {
+namespace utils {
+
+// NB: device_idx here is NOT a DeviceIndex, but index into PythonArgs
+c10::TensorOptions typeIdWithDefault(
+    PythonArgs& r,
+    int device_idx,
+    c10::DispatchKey dispatch_key) {
+  auto options = dispatchKeyToTensorOptions(dispatch_key);
+  if (!r.isNone(device_idx)) {
+    options = options.device(r.device(device_idx));
+  }
+  return options;
+}
+
+at::Tensor nested_tensor_ctor(
+    c10::DispatchKey dispatch_key,
+    at::ScalarType scalar_type,
+    torch::PythonArgs& r) {
+  TORCH_CHECK(r.idx == 0, "nested_tensor(): invalid arguments");
+
+  PyObject* data = r.pyobject(0);
+  // Check if data is a list: Only List[Tensor] and List[List...[Scalar]] are
+  // accepted for now
+  TORCH_CHECK_TYPE(
+      PyList_Check(data),
+      "Only lists (List[Tensor] and List[List...[Scalar]]) are accepted in nested_tensor");
+
+  auto dtype_val = r.scalartypeWithDefault(1, scalar_type);
+  auto tensor_options = typeIdWithDefault(r, 2, dispatch_key);
+  bool pin_memory = r.toBool(3);
+  bool args_requires_grad = r.toBool(4);
+
+  TORCH_CHECK(
+      PyList_Size(data) >= 0,
+      "Something went really wrong and your list has negative size");
+
+  // Check whether we are dealing with lists of tensors or not
+  std::vector<at::Tensor> new_list(PyList_Size(data));
+  for (const auto i : c10::irange(PyList_Size(data))) {
+    PyObject* elem = PyList_GetItem(data, i);
+    if (THPVariable_Check(elem)) {
+      new_list[i] = THPVariable_Unpack(PyList_GetItem(data, i)).detach();
+      TORCH_CHECK(
+          !new_list[i].is_nested(),
+          "We do not accept nested tensors as input to nested tensors");
+      TORCH_CHECK(
+          new_list[i].layout() == kStrided,
+          "We do not accept non-strided layouts as input to nested tensors");
+    } else {
+      PythonArgs elem_r(r);
+      std::array<PyObject*, 6> elem_args = {
+          elem, // data
+          r.args[1], // dtpye
+          nullptr, // device (cpu)
+          nullptr, // no pinned memory
+          r.args[4], // requires grad
+          nullptr // names
+      };
+      elem_r.args = elem_args.data();
+      new_list[i] = tensor_ctor(dispatch_key, scalar_type, elem_r);
+    }
+  }
+
+  at::ScalarType final_dtype = dtype_val;
+  if (r.isNone(1) && new_list.size() > 0) {
+    final_dtype = c10::typeMetaToScalarType(new_list[0].dtype());
+  }
+  at::Device final_device = tensor_options.device();
+  if (r.isNone(2) && new_list.size() > 0) {
+    final_device = new_list[0].device();
+  }
+  auto out = at::_nested_tensor_from_tensor_list(
+      new_list, final_dtype, c10::nullopt, final_device, pin_memory);
+  out.requires_grad_(args_requires_grad);
+  return out;
+}
+
+} // namespace utils
+} // namespace torch
diff --git a/torch/csrc/utils/nested.h b/torch/csrc/utils/nested.h
new file mode 100644
index 000000000000..f3a1061e4712
--- /dev/null
+++ b/torch/csrc/utils/nested.h
@@ -0,0 +1,17 @@
+#pragma once
+
+#include <torch/csrc/python_headers.h>
+#include <torch/csrc/utils/python_arg_parser.h>
+
+#include <ATen/core/Tensor.h>
+
+namespace torch {
+namespace utils {
+
+at::Tensor nested_tensor_ctor(
+    c10::DispatchKey dispatch_key,
+    at::ScalarType scalar_type,
+    PythonArgs& r);
+
+} // namespace utils
+} // namespace torch
diff --git a/torch/csrc/utils/pybind.cpp b/torch/csrc/utils/pybind.cpp
new file mode 100644
index 000000000000..1b9d1e3a2f73
--- /dev/null
+++ b/torch/csrc/utils/pybind.cpp
@@ -0,0 +1,83 @@
+#include <torch/csrc/utils/pybind.h>
+#include <torch/csrc/utils/python_arg_parser.h>
+#include <torch/csrc/utils/python_symnode.h>
+
+namespace pybind11 {
+namespace detail {
+
+bool type_caster<c10::SymInt>::load(py::handle src, bool) {
+  if (torch::is_symint(src)) {
+    auto node = src.attr("node");
+    if (py::isinstance<c10::SymNodeImpl>(node)) {
+      value = c10::SymInt(py::cast<c10::SymNode>(node));
+      return true;
+    }
+
+    value = c10::SymInt(static_cast<c10::SymNode>(
+        c10::make_intrusive<torch::impl::PythonSymNodeImpl>(node)));
+    return true;
+  }
+
+  auto raw_obj = src.ptr();
+  if (THPUtils_checkIndex(raw_obj)) {
+    value = c10::SymInt{THPUtils_unpackIndex(raw_obj)};
+    return true;
+  }
+  return false;
+}
+
+py::handle type_caster<c10::SymInt>::cast(
+    c10::SymInt si,
+    return_value_policy /* policy */,
+    handle /* parent */) {
+  if (si.is_symbolic()) {
+    auto* py_node =
+        dynamic_cast<torch::impl::PythonSymNodeImpl*>(si.toSymNodeImpl().get());
+    if (py_node) {
+      // Return the Python directly (unwrap)
+      return torch::get_symint_class()(py_node->getPyObj()).release();
+    } else {
+      // Wrap the C++ into Python
+      auto inner = py::cast(si.toSymNodeImpl());
+      if (!inner) {
+        throw python_error();
+      }
+      return torch::get_symint_class()(inner).release();
+    }
+  } else {
+    return py::cast(si.as_int_unchecked()).release();
+  }
+}
+
+bool type_caster<c10::SymFloat>::load(py::handle src, bool) {
+  if (torch::is_symfloat(src)) {
+    value = c10::SymFloat(static_cast<c10::SymNode>(
+        c10::make_intrusive<torch::impl::PythonSymNodeImpl>(src.attr("node"))));
+    return true;
+  }
+
+  auto raw_obj = src.ptr();
+  if (THPUtils_checkDouble(raw_obj)) {
+    value = c10::SymFloat{THPUtils_unpackDouble(raw_obj)};
+    return true;
+  }
+  return false;
+}
+
+py::handle type_caster<c10::SymFloat>::cast(
+    c10::SymFloat si,
+    return_value_policy /* policy */,
+    handle /* parent */) {
+  if (si.is_symbolic()) {
+    // TODO: generalize this to work with C++ backed class
+    auto* py_node =
+        dynamic_cast<torch::impl::PythonSymNodeImpl*>(si.toSymNodeImpl().get());
+    TORCH_INTERNAL_ASSERT(py_node);
+    return torch::get_symfloat_class()(py_node->getPyObj()).release();
+  } else {
+    return py::cast(si.as_float_unchecked()).release();
+  }
+}
+
+} // namespace detail
+} // namespace pybind11
diff --git a/torch/csrc/utils/pybind.h b/torch/csrc/utils/pybind.h
index f45caf6fac9c..bf1553814cef 100644
--- a/torch/csrc/utils/pybind.h
+++ b/torch/csrc/utils/pybind.h
@@ -12,6 +12,8 @@
 #include <torch/csrc/Device.h>
 #include <torch/csrc/DynamicTypes.h>
 #include <torch/csrc/Generator.h>
+#include <torch/csrc/MemoryFormat.h>
+#include <torch/csrc/utils/tensor_memoryformats.h>
 
 #include <stdexcept>
 #include <utility>
@@ -34,7 +36,7 @@ template <>
 struct TORCH_PYTHON_API type_caster<at::Tensor> {
  public:
   // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  PYBIND11_TYPE_CASTER(at::Tensor, _("at::Tensor"));
+  PYBIND11_TYPE_CASTER(at::Tensor, _("torch.Tensor"));
 
   bool load(handle src, bool);
 
@@ -49,7 +51,7 @@ template <>
 struct type_caster<at::Storage> {
  public:
   // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  PYBIND11_TYPE_CASTER(at::Storage, _("at::Storage"));
+  PYBIND11_TYPE_CASTER(at::Storage, _("torch.storage._StorageBase"));
 
   bool load(handle src, bool) {
     PyObject* obj = src.ptr();
@@ -72,7 +74,7 @@ template <>
 struct type_caster<at::Generator> {
  public:
   // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  PYBIND11_TYPE_CASTER(at::Generator, _("at::Generator"));
+  PYBIND11_TYPE_CASTER(at::Generator, _("torch.Generator"));
 
   bool load(handle src, bool) {
     PyObject* obj = src.ptr();
@@ -95,7 +97,7 @@ template <>
 struct TORCH_PYTHON_API type_caster<at::IntArrayRef> {
  public:
   // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  PYBIND11_TYPE_CASTER(at::IntArrayRef, _("at::IntArrayRef"));
+  PYBIND11_TYPE_CASTER(at::IntArrayRef, _("typing.Tuple[int, ...]"));
 
   bool load(handle src, bool);
   static handle cast(
@@ -107,11 +109,49 @@ struct TORCH_PYTHON_API type_caster<at::IntArrayRef> {
   std::vector<int64_t> v_value;
 };
 
+template <>
+struct TORCH_PYTHON_API type_caster<at::SymIntArrayRef> {
+ public:
+  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
+  PYBIND11_TYPE_CASTER(at::SymIntArrayRef, _("at::SymIntArrayRef"));
+
+  bool load(handle src, bool);
+  static handle cast(
+      at::SymIntArrayRef src,
+      return_value_policy /* policy */,
+      handle /* parent */);
+
+ private:
+  std::vector<c10::SymInt> v_value;
+};
+
+template <>
+struct TORCH_PYTHON_API type_caster<at::MemoryFormat> {
+ public:
+  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
+  PYBIND11_TYPE_CASTER(at::MemoryFormat, _("torch.memory_format"));
+
+  bool load(handle src, bool) {
+    PyObject* obj = src.ptr();
+    if (THPMemoryFormat_Check(obj)) {
+      value = reinterpret_cast<THPMemoryFormat*>(obj)->memory_format;
+      return true;
+    }
+    return false;
+  }
+  static handle cast(
+      at::MemoryFormat src,
+      return_value_policy /* policy */,
+      handle /* parent */) {
+    return handle(torch::utils::getTHPMemoryFormat(src));
+  }
+};
+
 template <>
 struct type_caster<at::Device> {
  public:
   // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
-  PYBIND11_TYPE_CASTER(at::Device, _("at::Device"));
+  PYBIND11_TYPE_CASTER(at::Device, _("torch.device"));
 
   // PYBIND11_TYPE_CASTER defines a member field called value. Since at::Device
   // cannot be default-initialized, we provide this constructor to explicitly
@@ -136,6 +176,86 @@ struct type_caster<at::Device> {
   }
 };
 
+template <>
+struct type_caster<c10::DispatchKey>
+    : public type_caster_base<c10::DispatchKey> {
+  using base = type_caster_base<c10::DispatchKey>;
+  c10::DispatchKey tmp;
+
+ public:
+  bool load(handle src, bool convert) {
+    if (base::load(src, convert)) {
+      return true;
+    } else if (py::isinstance(
+                   src, py::module_::import("builtins").attr("str"))) {
+      tmp = c10::parseDispatchKey(py::cast<std::string>(src));
+      value = &tmp;
+      return true;
+    }
+    return false;
+  }
+
+  static handle cast(
+      c10::DispatchKey src,
+      return_value_policy policy,
+      handle parent) {
+    return base::cast(src, policy, parent);
+  }
+};
+
+template <>
+struct type_caster<c10::SymInt> {
+ public:
+  PYBIND11_TYPE_CASTER(c10::SymInt, _("torch._prims_common.IntLike"));
+  bool load(py::handle src, bool);
+
+  static py::handle cast(
+      c10::SymInt si,
+      return_value_policy /* policy */,
+      handle /* parent */);
+};
+
+template <>
+struct type_caster<c10::SymFloat> {
+ public:
+  PYBIND11_TYPE_CASTER(c10::SymFloat, _("torch._prims_common.FloatLike"));
+  bool load(py::handle src, bool);
+
+  static py::handle cast(
+      c10::SymFloat si,
+      return_value_policy /* policy */,
+      handle /* parent */);
+};
+
+template <typename T>
+struct type_caster<c10::complex<T>> {
+ public:
+  // NOLINTNEXTLINE(cppcoreguidelines-non-private-member-variables-in-classes)
+  PYBIND11_TYPE_CASTER(c10::complex<T>, _("torch._complex.complex"));
+
+  bool load(handle src, bool) {
+    PyObject* obj = src.ptr();
+
+    // Refered from `THPUtils_unpackComplexDouble`
+    Py_complex py_complex = PyComplex_AsCComplex(obj);
+    if (py_complex.real == -1.0 && PyErr_Occurred()) {
+      return false;
+    }
+
+    // Python's Complex is always double precision.
+    value = c10::complex<double>(py_complex.real, py_complex.imag);
+    return true;
+  }
+
+  static handle cast(
+      const c10::complex<T>& complex,
+      return_value_policy /* policy */,
+      handle /* parent */) {
+    // Python only knows double precision complex.
+    return handle(PyComplex_FromDoubles(complex.real(), complex.imag()));
+  }
+};
+
 // Pybind11 bindings for our optional and variant types.
 // http://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#c-17-library-containers
 template <typename T>
diff --git a/torch/csrc/utils/python_arg_parser.cpp b/torch/csrc/utils/python_arg_parser.cpp
index 9095be4e61d9..e46e09c08829 100644
--- a/torch/csrc/utils/python_arg_parser.cpp
+++ b/torch/csrc/utils/python_arg_parser.cpp
@@ -26,11 +26,13 @@ static std::unordered_map<std::string, ParameterType> type_map = {
     {"Tensor", ParameterType::TENSOR},
     {"Scalar", ParameterType::SCALAR},
     {"int64_t", ParameterType::INT64},
+    {"SymInt", ParameterType::SYM_INT},
     {"double", ParameterType::DOUBLE},
     {"complex", ParameterType::COMPLEX},
     {"TensorList", ParameterType::TENSOR_LIST},
     {"c10::List<c10::optional<Tensor>>", ParameterType::TENSOR_LIST},
     {"IntArrayRef", ParameterType::INT_LIST},
+    {"SymIntArrayRef", ParameterType::SYM_INT_LIST},
     {"ArrayRef<double>", ParameterType::FLOAT_LIST},
     {"Generator", ParameterType::GENERATOR},
     {"bool", ParameterType::BOOL},
@@ -44,9 +46,7 @@ static std::unordered_map<std::string, ParameterType> type_map = {
     {"Stream", ParameterType::STREAM},
     {"std::string", ParameterType::STRING},
     {"c10::string_view", ParameterType::STRING},
-    {"SymInt", ParameterType::SYM_INT},
     {"Dimname", ParameterType::DIMNAME},
-    {"SymIntArrayRef", ParameterType::SYM_INT_LIST},
     {"DimnameList", ParameterType::DIMNAME_LIST},
     {"ScalarList", ParameterType::SCALAR_LIST},
 };
@@ -92,6 +92,7 @@ bool should_allow_numbers_as_tensors(const std::string& name) {
       "sub",          "sub_",          "sub_out",
       "subtract",     "subtract_",     "subtract_out", // alias of sub
       "true_divide",  "true_divide_",  "true_divide_out",
+      "to",           "_to_copy",      "copy_",
       "floor_divide", "floor_divide_", "floor_divide_out"};
   return allowed.find(name) != allowed.end();
 }
@@ -288,27 +289,41 @@ auto handle_torch_function_no_python_arg_parser(
   py::tuple py_types = py::cast(overloaded_types);
   py::object ret;
   PyObject* mode_obj = nullptr;
+
   const bool is_torch_function =
       torch_function_name == TorchFunctionName::TorchFunction;
-  auto get_mode = [&]() {
-    return is_torch_function ? at::impl::PythonTorchFunctionTLS::get_mode()
-                             : at::impl::TorchDispatchModeTLS::get_state();
+  auto get_stack_len = [&]() {
+    return is_torch_function ? at::impl::PythonTorchFunctionTLS::stack_len()
+                             : c10::impl::TorchDispatchModeTLS::stack_len();
   };
 
-  const auto& maybe_mode = get_mode();
-  if (maybe_mode) {
-    mode_obj = maybe_mode->ptr(getPyInterpreter());
-    TORCH_INTERNAL_ASSERT(py_types.ptr() != nullptr);
-    TORCH_INTERNAL_ASSERT(args != nullptr);
+  if (get_stack_len() > 0) {
     // Disable mode on the inside; this makes for a more user-friendly
     // experience if you try to, e.g., print your tensors.
     at::optional<torch::overrides::StashTorchFunctionModeGuard> tf_g;
     at::optional<torch_dispatch_mode::StashTorchDispatchModeGuard> td_g;
     if (is_torch_function) {
       tf_g.emplace();
+      mode_obj = tf_g->get_cur_mode()->ptr(getPyInterpreter());
     } else {
       td_g.emplace();
+      mode_obj = td_g->get_cur_mode()->ptr(getPyInterpreter());
     }
+    py::object torch_function =
+        PyObject_FastGetAttrString(mode_obj, torch_function_name_str);
+    if (!torch_function) {
+      TORCH_INTERNAL_ASSERT(0);
+    }
+    TORCH_INTERNAL_ASSERT(py_types.ptr() != nullptr);
+    TORCH_INTERNAL_ASSERT(args != nullptr);
+
+    TORCH_CHECK(
+        PyObject_FastGetAttrString(torch_function.ptr(), "__self__")
+            .is(py::reinterpret_borrow<py::object>(mode_obj)),
+        "Defining your mode's `",
+        torch_function_name_str,
+        "` as a classmethod is not supported, please make it a plain method");
+
     // Blegh.  This accidentally works in PyObject_CallFunctionObjArgs below
     // because the nullptr terminates the argument list ick ick ick.
     if (kwargs == nullptr) {
@@ -376,8 +391,15 @@ auto handle_torch_function_no_python_arg_parser(
     // all __torch_function__ implementations in overloaded_args
     // returned NotImplemented, so we raise a TypeError.
     std::stringstream ss;
-    ss << "no implementation found for '" << module_name << "." << func_name
-       << "' on types that implement " << torch_function_name_str << ": [";
+    ss << "no implementation found for '";
+    if (module_name && func_name) {
+      ss << module_name << "." << func_name;
+    } else {
+      py::handle fn = torch_api_function;
+      ss << py::str(fn.attr("__module__")) << "."
+         << py::str(fn.attr("__name__"));
+    }
+    ss << "' on types that implement " << torch_function_name_str << ": [";
     for (auto& arg : overloaded_args) {
       ss << py::repr(get_type_of_overloaded_arg(arg.ptr()));
       if (!arg.is(overloaded_args.back())) {
@@ -385,18 +407,6 @@ auto handle_torch_function_no_python_arg_parser(
       }
     }
     ss << "]";
-    if (mode_obj) {
-      // Note [Paranoid check mode is same]
-      // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-      // If a user forcibly changes the mode in a non-lexical way
-      // in the inner context, the mode could be invalid here.  So just be
-      // a bit safe, it doesn't cost us anything since this is error reporting
-      const auto& maybe_mode = get_mode();
-      TORCH_INTERNAL_ASSERT(
-          maybe_mode && mode_obj == maybe_mode->ptr(getPyInterpreter()));
-      ss << " nor was it found on the currently active mode "
-         << py::repr(mode_obj);
-    }
     const std::string& tmp = ss.str();
     PyErr_SetString(PyExc_TypeError, tmp.c_str());
     throw python_error();
@@ -654,7 +664,10 @@ bool is_float_or_complex_list(PyObject* obj) {
   return true;
 }
 
-static bool is_int_list(PyObject* obj, int broadcast_size) {
+static bool is_int_list(
+    PyObject* obj,
+    int broadcast_size,
+    int64_t* failed_idx = nullptr) {
   if (PyTuple_Check(obj) || PyList_Check(obj)) {
     auto len = PySequence_Size(obj);
     if (len == 0) {
@@ -672,8 +685,11 @@ static bool is_int_list(PyObject* obj, int broadcast_size) {
     // NB: do NOT check that the later arguments are ints, as this is
     // BC-breaking for FX
     for (int i = 1; i < len; i++) {
-      if (torch::is_symint_node(
+      if (torch::is_symint(
               py::reinterpret_steal<py::object>(PySequence_GetItem(obj, i)))) {
+        if (failed_idx != nullptr) {
+          *failed_idx = i;
+        }
         return false;
       }
     }
@@ -684,9 +700,13 @@ static bool is_int_list(PyObject* obj, int broadcast_size) {
 
     // NOTE: JIT tracer allows arbitrary scalar tensors to act as ints
     // in an intlist argument. Even float or complex scalar tensors.
-    return (
-        jit::tracer::isTracing() && THPVariable_Check(item.ptr()) &&
-        THPVariable_Unpack(item.ptr()).sizes() == c10::IntArrayRef{});
+    bool r =
+        (jit::tracer::isTracing() && THPVariable_Check(item.ptr()) &&
+         THPVariable_Unpack(item.ptr()).sizes() == c10::IntArrayRef{});
+    if (!r && failed_idx != nullptr) {
+      *failed_idx = 0;
+    }
+    return r;
   }
   // if a size is specified (e.g. IntArrayRef[2]) we also allow passing a single
   // int
@@ -696,12 +716,15 @@ static bool is_int_list(PyObject* obj, int broadcast_size) {
 static bool is_int_or_symint(PyObject* obj) {
   // THPUtils_checkIndex may call __index__ or __int__
   // which may have side effects if obj is a symint node
-  // so we do `is_symint_node` check first
+  // so we do `is_symint` check first
   // TODO: maybe we should be using checkLong here?
-  return torch::is_symint_node(py::handle(obj)) || THPUtils_checkIndex(obj);
+  return torch::is_symint(py::handle(obj)) || THPUtils_checkIndex(obj);
 }
 
-static bool is_int_or_symint_list(PyObject* obj, int broadcast_size) {
+static bool is_int_or_symint_list(
+    PyObject* obj,
+    int broadcast_size,
+    int64_t* failed_idx = nullptr) {
   if (PyTuple_Check(obj) || PyList_Check(obj)) {
     if (PySequence_Size(obj) == 0) {
       return true;
@@ -713,9 +736,13 @@ static bool is_int_or_symint_list(PyObject* obj, int broadcast_size) {
     }
     // NOTE: JIT tracer allows arbitrary scalar tensors to act as ints
     // in an intlist argument. Even float or complex scalar tensors.
-    return (
-        jit::tracer::isTracing() && THPVariable_Check(item.ptr()) &&
-        THPVariable_Unpack(item.ptr()).sizes() == c10::IntArrayRef{});
+    bool r =
+        (jit::tracer::isTracing() && THPVariable_Check(item.ptr()) &&
+         THPVariable_Unpack(item.ptr()).sizes() == c10::IntArrayRef{});
+    if (!r && failed_idx != nullptr) {
+      *failed_idx = 0;
+    }
+    return r;
   }
   // if a size is specified (e.g. IntArrayRef[2]) we also allow passing a single
   // int
@@ -726,15 +753,23 @@ static bool is_int_or_symint_list(PyObject* obj, int broadcast_size) {
 auto FunctionParameter::check(
     PyObject* obj,
     std::vector<py::handle>& overloaded_args,
-    int argnum) -> bool {
+    int argnum,
+    int64_t* failed_idx) -> bool {
   switch (type_) {
     case ParameterType::TENSOR: {
       if (is_tensor_and_append_overloaded(obj, &overloaded_args)) {
         return true;
       }
-      return allow_numbers_as_tensors && THPUtils_checkScalar(obj);
+      if (allow_numbers_as_tensors) {
+        return THPUtils_checkScalar(obj);
+      }
+      return false;
     }
     case ParameterType::SCALAR:
+      if (THPUtils_checkScalar(obj)) {
+        return true;
+      }
+      // fallthrough
     case ParameterType::COMPLEX:
       if (PyComplex_Check(obj)) {
         return true;
@@ -776,7 +811,7 @@ auto FunctionParameter::check(
           obj, &overloaded_args, argnum, true /* throw_error */);
     }
     case ParameterType::INT_LIST:
-      return is_int_list(obj, size);
+      return is_int_list(obj, size, failed_idx);
     case ParameterType::FLOAT_LIST:
       return is_float_or_complex_list(obj);
     case ParameterType::GENERATOR:
@@ -802,20 +837,18 @@ auto FunctionParameter::check(
       return THPStream_Check(obj);
     case ParameterType::STRING:
       return THPUtils_checkString(obj);
-    default:
-      throw std::runtime_error("unknown parameter type");
-    case ParameterType::SCALAR_LIST: {
+    case ParameterType::SCALAR_LIST:
       return is_scalar_list(obj);
-    }
-    case ParameterType::SYM_INT: {
+    case ParameterType::SYM_INT:
       return is_int_or_symint(obj);
-    }
-    case ParameterType::SYM_INT_LIST: {
-      return is_int_or_symint_list(obj, size);
-    }
+    case ParameterType::SYM_INT_LIST:
+      return is_int_or_symint_list(obj, size, failed_idx);
+    default:
+      throw std::runtime_error("unknown parameter type");
   }
 }
 
+// WARNING: these strings are parsed invalid_arguments.cpp
 std::string FunctionParameter::type_name() const {
   switch (type_) {
     case ParameterType::TENSOR:
@@ -823,9 +856,10 @@ std::string FunctionParameter::type_name() const {
     case ParameterType::SCALAR:
       return "Number";
     case ParameterType::INT64:
-      return "int";
+    // NB: SymInt is intentionally not mentioned here, as conventional user
+    // use will only know about ints
     case ParameterType::SYM_INT:
-      return "SymInt";
+      return "int";
     case ParameterType::DOUBLE:
       return "float";
     case ParameterType::COMPLEX:
@@ -863,7 +897,7 @@ std::string FunctionParameter::type_name() const {
     case ParameterType::SCALAR_LIST:
       return "tuple of Scalars";
     case ParameterType::SYM_INT_LIST:
-      return "tuple of SymInts";
+      return "tuple of ints";
     default:
       throw std::runtime_error("unknown parameter type");
   }
@@ -897,6 +931,7 @@ static inline std::vector<int64_t> parse_intlist_args(
 
   // case 1. s is an int (e.g., s=2)
   if (s[0] != '{') {
+    TORCH_CHECK(size > 0, "Incorrect size of IntArrayRef: ", size);
     return std::vector<int64_t>(size, std::stol(s));
   }
 
@@ -990,7 +1025,7 @@ void FunctionParameter::set_default_str(const std::string& str) {
       throw std::runtime_error(
           "default value for Tensor must be none, got: " + str);
     }
-  } else if (type_ == ParameterType::INT64) {
+  } else if (type_ == ParameterType::INT64 || type_ == ParameterType::SYM_INT) {
     default_int = atol(str.c_str());
   } else if (type_ == ParameterType::BOOL) {
     default_bool = (str == "True" || str == "true");
@@ -1006,7 +1041,9 @@ void FunctionParameter::set_default_str(const std::string& str) {
       default_scalar = as_integer.has_value() ? at::Scalar(as_integer.value())
                                               : at::Scalar(atof(str.c_str()));
     }
-  } else if (type_ == ParameterType::INT_LIST) {
+  } else if (
+      type_ == ParameterType::INT_LIST ||
+      type_ == ParameterType::SYM_INT_LIST) {
     if (str != "None") {
       default_intlist = parse_intlist_args(str, size);
     }
@@ -1045,6 +1082,31 @@ void FunctionParameter::set_default_str(const std::string& str) {
       default_string = parse_string_literal(str);
     }
   }
+  // These types weren't handled here before. Adding a default error
+  // led to a lot of test failures so adding this skip for now.
+  // We should correctly handle these though because it might be causing
+  // silent failures.
+  else if (type_ == ParameterType::TENSOR_LIST) { // NOLINT
+    // throw std::runtime_error("Invalid Tensor List");
+  } else if (type_ == ParameterType::GENERATOR) { // NOLINT
+    // throw std::runtime_error("ParameterType::GENERATOR");
+  } else if (type_ == ParameterType::PYOBJECT) { // NOLINT
+    // throw std::runtime_error("ParameterType::PYOBJECT");
+  } else if (type_ == ParameterType::MEMORY_FORMAT) { // NOLINT
+    // throw std::runtime_error("ParameterType::MEMORY_FORMAT");
+  } else if (type_ == ParameterType::DIMNAME) { // NOLINT
+    // throw std::runtime_error("ParameterType::DIMNAME");
+  } else if (type_ == ParameterType::DIMNAME_LIST) { // NOLINT
+    // throw std::runtime_error("ParameterType::DIMNAME_LIST");
+  } else if (type_ == ParameterType::SCALAR_LIST) { // NOLINT
+    // throw std::runtime_error("ParameterType::SCALAR_LIST");
+  } else if (type_ == ParameterType::STORAGE) { // NOLINT
+    // throw std::runtime_error("ParameterType::STORAGE");
+  } else if (type_ == ParameterType::QSCHEME) { // NOLINT
+    // throw std::runtime_error("ParameterType::QSCHEME");
+  } else {
+    throw std::runtime_error("unknown parameter type");
+  }
 }
 
 // NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
@@ -1300,6 +1362,8 @@ bool FunctionSignature::parse(
       is_kwd = true;
     }
 
+    int64_t failed_idx = -1;
+    bool varargs_eligible = allow_varargs_intlist && arg_pos == 0 && !is_kwd;
     if ((!obj && param.optional) || (obj == Py_None && param.allow_none)) {
       dst[i++] = nullptr;
     } else if (!obj) {
@@ -1308,15 +1372,16 @@ bool FunctionSignature::parse(
         missing_args(*this, i);
       }
       return false;
-    } else if (param.check(obj, this->overloaded_args, i)) {
+    } else if (param.check(obj, this->overloaded_args, i, &failed_idx)) {
       dst[i++] = obj;
       // XXX: the Variable check is necessary because sizes become tensors when
       // tracer is enabled. This behavior easily leads to ambiguities, and we
       // should avoid having complex signatures that make use of it...
     } else if (
-        allow_varargs_intlist && arg_pos == 0 && !is_kwd &&
-        ((int_list_overload ? is_int_list(args, param.size)
-                            : is_int_or_symint_list(args, param.size)))) {
+        varargs_eligible &&
+        ((int_list_overload
+              ? is_int_list(args, param.size, &failed_idx)
+              : is_int_or_symint_list(args, param.size, &failed_idx)))) {
       // take all positional arguments as this parameter
       // e.g. permute(1, 2, 3) -> permute((1, 2, 3))
       dst[i++] = args;
@@ -1333,6 +1398,24 @@ bool FunctionSignature::parse(
             Py_TYPE(obj)->tp_name);
       } else {
         // foo(): argument 'other' (position 2) must be str, not int
+        if (failed_idx != -1) {
+          if (!(PyTuple_Check(obj) || PyList_Check(obj))) {
+            TORCH_INTERNAL_ASSERT(varargs_eligible);
+            obj = args;
+          }
+          TORCH_INTERNAL_ASSERT(failed_idx < PySequence_Size(obj));
+          throw TypeError(
+              "%s(): argument '%s' (position %ld) must be %s, but found element of type %s at pos %ld",
+              name.c_str(),
+              param.name.c_str(),
+              static_cast<long>(arg_pos + 1),
+              param.type_name().c_str(),
+              Py_TYPE(py::reinterpret_steal<py::object>(
+                          PySequence_GetItem(obj, failed_idx))
+                          .ptr())
+                  ->tp_name,
+              static_cast<long>(failed_idx));
+        }
         throw TypeError(
             "%s(): argument '%s' (position %ld) must be %s, not %s",
             name.c_str(),
@@ -1475,6 +1558,7 @@ at::Tensor PythonArgs::tensor_slow(int i) {
     return THPVariable_Unpack(obj);
   }
 
+  bool save_symint = false;
   at::Scalar scalar;
   if (PyBool_Check(obj)) {
     scalar = at::Scalar(THPUtils_unpackBool(obj));
@@ -1484,6 +1568,18 @@ at::Tensor PythonArgs::tensor_slow(int i) {
     scalar = at::Scalar(THPUtils_unpackComplexDouble(obj));
   } else if (THPUtils_checkDouble(obj)) {
     scalar = at::Scalar(THPUtils_unpackDouble(obj));
+    // NB: we DO NOT put symbolic ints/floats into the Scalar itself,
+    // because although Scalar supports SymInt/SymFloat, the subsequent
+    // conversion to Tensor does not.  Instead, do it out of band.
+  } else if (torch::is_symint(py::handle(obj))) {
+    save_symint = true;
+    // This scalar value doesn't matter, it shouldn't ever actually
+    // get read out.  Make it a big and weird looking number to help
+    // people figure out if there's aproblem.
+    scalar = at::Scalar(7777777);
+  } else if (torch::is_symfloat(py::handle(obj))) {
+    save_symint = true;
+    scalar = at::Scalar(std::numeric_limits<double>::quiet_NaN());
   } else {
     // NB: Are you here because you passed None to a Variable method,
     // and you expected an undefined tensor to be returned?   Don't add
@@ -1498,6 +1594,14 @@ at::Tensor PythonArgs::tensor_slow(int i) {
 
   at::Tensor tensor = scalar_to_tensor(scalar);
   tensor.unsafeGetTensorImpl()->set_wrapped_number(true);
+
+  if (save_symint) {
+    auto py_tensor = py::cast(tensor);
+    if (PyObject_SetAttrString(py_tensor.ptr(), "_wrapped_number", obj) < 0) {
+      throw python_error();
+    }
+  }
+
   return tensor;
 }
 
@@ -1529,6 +1633,15 @@ at::Scalar PythonArgs::scalar_slow(PyObject* arg) {
   if (PyComplex_Check(arg)) {
     return at::Scalar(THPUtils_unpackComplexDouble(arg));
   }
+
+  if (torch::is_symint(arg)) {
+    return at::Scalar(py::cast<c10::SymInt>(arg));
+  }
+
+  if (torch::is_symfloat(arg)) {
+    return at::Scalar(py::cast<c10::SymFloat>(arg));
+  }
+
   return at::Scalar(THPUtils_unpackDouble(arg));
 }
 
diff --git a/torch/csrc/utils/python_arg_parser.h b/torch/csrc/utils/python_arg_parser.h
index 29b157f0345d..c766d93a70eb 100644
--- a/torch/csrc/utils/python_arg_parser.h
+++ b/torch/csrc/utils/python_arg_parser.h
@@ -61,6 +61,7 @@
 #include <torch/csrc/utils/pybind.h>
 #include <torch/csrc/utils/python_numbers.h>
 #include <torch/csrc/utils/python_strings.h>
+#include <torch/csrc/utils/python_symnode.h>
 #include <torch/csrc/utils/six.h>
 
 #include <ATen/PythonTorchFunctionTLS.h>
@@ -68,7 +69,9 @@
 #include <c10/util/Exception.h>
 #include <c10/util/irange.h>
 
-#include <c10/core/SymIntNodeImpl.h>
+#include <c10/core/SymFloat.h>
+#include <c10/core/SymNodeImpl.h>
+
 #include <array>
 #include <cstddef>
 #include <memory>
@@ -76,49 +79,15 @@
 #include <string>
 #include <vector>
 
-namespace torch {
-inline bool is_symint_node(py::handle obj) {
-  auto static tp_symn = py::type::of<c10::SymIntNodeImpl>();
-  // TODO: switch this to `isinstance`
-  if (obj.get_type().equal(tp_symn)) {
-    TORCH_CHECK(
-        !jit::tracer::isTracing(), "JIT tracing of SymInts isn't supported!");
+inline bool THPUtils_checkScalar(PyObject* obj) {
+#ifdef USE_NUMPY
+  if (torch::utils::is_numpy_scalar(obj)) {
     return true;
   }
-  return false;
+#endif
+  return PyFloat_Check(obj) || PyLong_Check(obj) || PyComplex_Check(obj) ||
+      torch::is_symint(py::handle(obj)) || torch::is_symfloat(py::handle(obj));
 }
-} // namespace torch
-
-namespace pybind11 {
-namespace detail {
-template <>
-struct type_caster<c10::SymInt> {
- public:
-  PYBIND11_TYPE_CASTER(c10::SymInt, _("SymInt"));
-  bool load(py::handle src, bool) {
-    if (torch::is_symint_node(src)) {
-      value = src.cast<c10::SymIntNodeImpl*>()->toSymInt();
-      return true;
-    }
-
-    auto raw_obj = src.ptr();
-    if (THPUtils_checkIndex(raw_obj)) {
-      value = c10::SymInt{THPUtils_unpackIndex(raw_obj)};
-      return true;
-    }
-    return false;
-  }
-
-  static py::handle cast(
-      c10::SymInt si,
-      return_value_policy /* policy */,
-      handle /* parent */) {
-    return si.is_symbolic() ? py::cast(si.toSymIntNodeImpl()).release()
-                            : py::cast(si.expect_int()).release();
-  }
-};
-} // namespace detail
-} // namespace pybind11
 
 namespace torch {
 
@@ -261,6 +230,7 @@ struct PythonArgs {
   inline std::vector<int64_t> intlist(int i);
   inline std::vector<c10::SymInt> symintlist(int i);
   inline c10::OptionalArray<int64_t> intlistOptional(int i);
+  inline c10::OptionalArray<c10::SymInt> symintlistOptional(int i);
   inline std::vector<int64_t> intlistWithDefault(
       int i,
       std::vector<int64_t> default_intlist);
@@ -278,6 +248,7 @@ struct PythonArgs {
   inline c10::optional<at::ScalarType> scalartypeOptional(int i);
   inline c10::optional<at::Scalar> scalarOptional(int i);
   inline c10::optional<int64_t> toInt64Optional(int i);
+  inline c10::optional<c10::SymInt> toSymIntOptional(int i);
   inline c10::optional<bool> toBoolOptional(int i);
   inline c10::optional<double> toDoubleOptional(int i);
   inline c10::OptionalArray<double> doublelistOptional(int i);
@@ -329,7 +300,8 @@ struct FunctionParameter {
   bool check(
       PyObject* obj,
       std::vector<py::handle>& overloaded_args,
-      int argnum);
+      int argnum,
+      int64_t* failed_idx = nullptr);
 
   void set_default_str(const std::string& str);
   std::string type_name() const;
@@ -390,7 +362,7 @@ inline PythonArgs PythonArgParser::parse(PyObject* self, ParsedArgs<0>& dst) {
 
 inline bool PythonArgs::has_torch_function() {
   return !this->signature.overloaded_args.empty() ||
-      at::impl::PythonTorchFunctionTLS::get_mode();
+      at::impl::torch_function_mode_enabled();
 }
 
 inline std::string PythonArgs::get_func_name() {
@@ -520,7 +492,7 @@ inline std::vector<int64_t> PythonArgs::intlist(int i) {
 
 inline PyObject* toPyObject(c10::SymInt symint) {
   if (symint.is_symbolic()) {
-    auto r = py::cast(symint.toSymIntNodeImpl()).release().ptr();
+    auto r = py::cast(symint).release().ptr();
     TORCH_INTERNAL_ASSERT(r);
     return r;
   } else {
@@ -528,6 +500,20 @@ inline PyObject* toPyObject(c10::SymInt symint) {
   }
 }
 
+inline void throw_intlist_exception(
+    const torch::PythonArgs* args,
+    size_t i,
+    PyObject* obj,
+    size_t idx) {
+  throw TypeError(
+      "%s(): argument '%s' must be %s, but found element of type %s at pos %ld",
+      args->signature.name.c_str(),
+      args->signature.params[i].name.c_str(),
+      args->signature.params[i].type_name().c_str(),
+      Py_TYPE(obj)->tp_name,
+      idx + 1);
+}
+
 inline std::vector<c10::SymInt> PythonArgs::symintlist(int i) {
   if (!args[i]) {
     return c10::fmap(signature.params[i].default_intlist, [](int64_t di) {
@@ -541,8 +527,8 @@ inline std::vector<c10::SymInt> PythonArgs::symintlist(int i) {
         size1, c10::SymInt(THPUtils_unpackIndex(args[i])));
   }
 
-  if (size1 > 0 && torch::is_symint_node(py::handle(args[i]))) {
-    auto si = py::handle(args[i]).cast<c10::SymIntNodeImpl*>()->toSymInt();
+  if (size1 > 0 && torch::is_symint(py::handle(args[i]))) {
+    auto si = py::handle(args[i]).cast<c10::SymInt>();
     return std::vector<c10::SymInt>(size1, si);
   }
 
@@ -555,29 +541,44 @@ inline std::vector<c10::SymInt> PythonArgs::symintlist(int i) {
   for (const auto idx : c10::irange(size2)) {
     PyObject* obj =
         tuple ? PyTuple_GET_ITEM(arg, idx) : PyList_GET_ITEM(arg, idx);
-    try {
-      if (is_symint_node(py::handle(obj))) {
-        res.push_back(py::handle(obj).cast<c10::SymIntNodeImpl*>()->toSymInt());
+
+    // Elements of torch.Size are tensors during tracing, and we need to
+    // record extra information before they are turned into an IntArrayRef
+    if (traceable && jit::tracer::isTracing() && THPVariable_Check(obj)) {
+      auto& var = THPVariable_Unpack(obj);
+      jit::tracer::ArgumentStash::stashIntArrayRefElem(
+          signature.params[i].name, size2, idx, var);
+      try {
+        res.push_back(var.item<int64_t>());
+        continue;
+      } catch (std::exception& e) {
+        throw_intlist_exception(this, i, obj, idx);
+      }
+      continue;
+    } else {
+      // convert tensor to scalar outside of try / catch,
+      // so that Tensor subclass exceptions will not be caught.
+      if (THPVariable_Check(obj)) {
+        auto& var = THPVariable_Unpack(obj);
+        if (var.numel() != 1 ||
+            !at::isIntegralType(
+                var.dtype().toScalarType(), /*include_bool*/ true)) {
+          throw_intlist_exception(this, i, obj, idx);
+        }
+        // TODO: ideally, if this was a fake tensor this would
+        // result in a SymInt, but we don't have the API to do this
+        res.push_back(var.item<int64_t>());
       } else {
-        // Elements of torch.Size are tensors during tracing, and we need to
-        // record extra information before they are turned into an IntArrayRef
-        if (traceable && jit::tracer::isTracing() && THPVariable_Check(obj)) {
-          auto& var = THPVariable_Unpack(obj);
-          jit::tracer::ArgumentStash::stashIntArrayRefElem(
-              signature.params[i].name, size2, idx, var);
-          res.push_back(var.item<int64_t>());
-          continue;
+        try {
+          if (is_symint(py::handle(obj))) {
+            res.push_back(py::handle(obj).cast<c10::SymInt>());
+          } else {
+            res.push_back(c10::SymInt(THPUtils_unpackIndex(obj)));
+          }
+        } catch (std::exception& e) {
+          throw_intlist_exception(this, i, obj, idx);
         }
-        res.push_back(c10::SymInt(THPUtils_unpackIndex(obj)));
       }
-    } catch (const std::exception& e) {
-      auto te = TypeError(
-          "%s(): argument '%s' must be %s, but found element of type %s at pos %ld",
-          signature.name.c_str(),
-          signature.params[i].name.c_str(),
-          signature.params[i].type_name().c_str(),
-          Py_TYPE(obj)->tp_name,
-          idx + 1);
     }
   }
 
@@ -602,26 +603,36 @@ inline std::vector<int64_t> PythonArgs::intlistWithDefault(
   for (const auto idx : c10::irange(size2)) {
     PyObject* obj =
         tuple ? PyTuple_GET_ITEM(arg, idx) : PyList_GET_ITEM(arg, idx);
-    try {
-      // Elements of torch.Size are tensors during tracing, and we need to
-      // record extra information before they are turned into an IntArrayRef
-      if (traceable && jit::tracer::isTracing() && THPVariable_Check(obj)) {
-        auto& var = THPVariable_Unpack(obj);
-        jit::tracer::ArgumentStash::stashIntArrayRefElem(
-            signature.params[i].name, size2, idx, var);
+    // Elements of torch.Size are tensors during tracing, and we need to
+    // record extra information before they are turned into an IntArrayRef
+    if (traceable && jit::tracer::isTracing() && THPVariable_Check(obj)) {
+      auto& var = THPVariable_Unpack(obj);
+      jit::tracer::ArgumentStash::stashIntArrayRefElem(
+          signature.params[i].name, size2, idx, var);
+      try {
         res[idx] = var.item<int64_t>();
         continue;
+      } catch (std::exception& e) {
+        throw_intlist_exception(this, i, obj, idx);
+      }
+    } else {
+      // convert tensor to scalar outside of try / catch,
+      // so that Tensor subclass exceptions will not be caught.
+      if (THPVariable_Check(obj)) {
+        auto& var = THPVariable_Unpack(obj);
+        if (var.numel() != 1 ||
+            !at::isIntegralType(
+                var.dtype().toScalarType(), /*include_bool*/ true)) {
+          throw_intlist_exception(this, i, obj, idx);
+        }
+        res[idx] = var.item<int64_t>();
       } else {
-        res[idx] = THPUtils_unpackIndex(obj);
+        try {
+          res[idx] = THPUtils_unpackIndex(obj);
+        } catch (std::exception& e) {
+          throw_intlist_exception(this, i, obj, idx);
+        }
       }
-    } catch (const std::exception& e) {
-      throw TypeError(
-          "%s(): argument '%s' must be %s, but found element of type %s at pos %ld",
-          signature.name.c_str(),
-          signature.params[i].name.c_str(),
-          signature.params[i].type_name().c_str(),
-          Py_TYPE(obj)->tp_name,
-          idx + 1);
     }
   }
   return res;
@@ -634,6 +645,13 @@ inline c10::OptionalArray<int64_t> PythonArgs::intlistOptional(int i) {
   return intlist(i);
 }
 
+inline c10::OptionalArray<c10::SymInt> PythonArgs::symintlistOptional(int i) {
+  if (!args[i]) {
+    return {};
+  }
+  return symintlist(i);
+}
+
 inline std::vector<double> PythonArgs::getDoublelist(int i) {
   PyObject* arg = args[i];
   auto tuple = PyTuple_Check(arg);
@@ -903,6 +921,12 @@ inline c10::optional<int64_t> PythonArgs::toInt64Optional(int i) {
   return toInt64(i);
 }
 
+inline c10::optional<c10::SymInt> PythonArgs::toSymIntOptional(int i) {
+  if (!args[i])
+    return c10::nullopt;
+  return toSymInt(i);
+}
+
 inline c10::optional<bool> PythonArgs::toBoolOptional(int i) {
   if (!args[i]) {
     return c10::nullopt;
diff --git a/torch/csrc/utils/python_compat.h b/torch/csrc/utils/python_compat.h
index 6fe4cccfd011..863f71a48748 100644
--- a/torch/csrc/utils/python_compat.h
+++ b/torch/csrc/utils/python_compat.h
@@ -82,6 +82,11 @@ static inline PyFrameObject* PyFrame_GetBack(PyFrameObject* frame) {
   Py_XINCREF(frame->f_back);
   return frame->f_back;
 }
+
+static inline PyFrameObject* PyThreadState_GetFrame(PyThreadState* state) {
+  Py_XINCREF(state->frame);
+  return state->frame;
+}
 #endif
 
 #if PY_VERSION_HEX < 0x030900A4 && !defined(Py_SET_TYPE)
diff --git a/torch/csrc/utils/python_dispatch.cpp b/torch/csrc/utils/python_dispatch.cpp
index 25c1a0837fc6..a44cb3277352 100644
--- a/torch/csrc/utils/python_dispatch.cpp
+++ b/torch/csrc/utils/python_dispatch.cpp
@@ -2,7 +2,10 @@
 #include <torch/csrc/utils/python_dispatch.h>
 
 #include <ATen/ATen.h>
+#include <ATen/FuncTorchTLS.h>
+#include <ATen/FunctionalTensorWrapper.h>
 #include <ATen/TensorSubclassLikeUtils.h>
+#include <ATen/core/PythonOpRegistrationTrampoline.h>
 #include <ATen/core/dispatch/Dispatcher.h>
 #include <torch/library.h>
 
@@ -10,6 +13,7 @@
 #include <torch/csrc/autograd/python_variable.h>
 #include <torch/csrc/jit/python/pybind_utils.h>
 
+#include <c10/util/flat_hash_map.h>
 #include <pybind11/operators.h>
 #include <pybind11/stl.h>
 #include <torch/csrc/utils/pybind.h>
@@ -22,6 +26,14 @@ namespace torch {
 namespace impl {
 namespace dispatch {
 
+// NB: I'd like to index this on OperatorHandle, but I can't, as I can't
+// guarantee that the main interpreter has finish doing all registrations before
+// the other interpreters start banging on it
+static ska::flat_hash_map<
+    c10::OperatorName,
+    ska::flat_hash_map<c10::DispatchKey, std::shared_ptr<c10::SafePyObject>>>
+    python_registrations_;
+
 torch::Library::Kind parseKind(const std::string& k) {
   static std::unordered_map<std::string, torch::Library::Kind> kind_map = {
       {"DEF", torch::Library::DEF},
@@ -57,19 +69,101 @@ inline torch::CppFunction dispatch_str(const char* key, Func&& raw_f) {
   }
 }
 
+struct EnableHermeticPyObject {
+  EnableHermeticPyObject()
+      : old_(c10::impl::HermeticPyObjectTLS::get_state()),
+        old_excluded_python_(
+            c10::impl::tls_is_dispatch_key_excluded(at::DispatchKey::Python)),
+        old_python_(
+            c10::impl::tls_is_dispatch_key_included(at::DispatchKey::Python)),
+        old_python_snapshot_(c10::impl::tls_is_dispatch_key_included(
+            at::DispatchKey::PythonTLSSnapshot)) {
+    c10::impl::HermeticPyObjectTLS::set_state(true);
+    c10::impl::tls_set_dispatch_key_excluded(at::DispatchKey::Python, true);
+    c10::impl::tls_set_dispatch_key_included(at::DispatchKey::Python, false);
+    c10::impl::tls_set_dispatch_key_included(
+        at::DispatchKey::PythonTLSSnapshot, false);
+  }
+  ~EnableHermeticPyObject() {
+    c10::impl::HermeticPyObjectTLS::set_state(old_);
+    c10::impl::tls_set_dispatch_key_excluded(
+        at::DispatchKey::Python, old_excluded_python_);
+    c10::impl::tls_set_dispatch_key_included(
+        at::DispatchKey::Python, old_python_);
+    c10::impl::tls_set_dispatch_key_included(
+        at::DispatchKey::PythonTLSSnapshot, old_python_snapshot_);
+  }
+  bool old_;
+  bool old_excluded_python_;
+  bool old_python_;
+  bool old_python_snapshot_;
+};
+
 class PythonKernelHolder : public c10::OperatorKernel {
   c10::SafePyObject func_;
+  c10::DispatchKey dispatch_key_;
 
  public:
-  PythonKernelHolder(py::object func)
-      : func_(func.release().ptr(), getPyInterpreter()) {}
+  PythonKernelHolder(py::object func, c10::DispatchKey dispatch_key)
+      : func_(func.release().ptr(), getPyInterpreter()),
+        dispatch_key_(dispatch_key) {}
 
   void operator()(
       const c10::OperatorHandle& op,
       c10::DispatchKeySet keyset,
       torch::jit::Stack* stack) {
+    // Figure out if we can handle it hermetically, or if we have
+    // to double dispatch
+
+    // If Torch Dispatch Mode is active, use its PyInterpreter for dispatch
+    const auto mode_stack_len = c10::impl::TorchDispatchModeTLS::stack_len();
+    if (mode_stack_len > 0) {
+      const auto& cur_torch_dispatch_mode_state =
+          c10::impl::TorchDispatchModeTLS::get_stack_at(mode_stack_len - 1);
+      cur_torch_dispatch_mode_state->pyinterpreter()
+          ->python_op_registration_trampoline(op, dispatch_key_, stack);
+      return;
+    }
+
+    const auto& schema = op.schema();
+    const auto num_arguments = schema.arguments().size();
+
+    // Otherwise, find a PyInterpreter on a Tensor IF if has Python key (which
+    // means it's a nontrivial tensor subclass)
+    for (const auto& ivalue : torch::jit::last(*stack, num_arguments)) {
+      if (ivalue.isTensor()) {
+        auto* interpreter = ivalue.unsafeToTensorImpl()->pyobj_interpreter();
+        if (interpreter &&
+            ivalue.unsafeToTensorImpl()->key_set().has(
+                at::DispatchKey::Python)) {
+          (*interpreter)
+              ->python_op_registration_trampoline(op, dispatch_key_, stack);
+          return;
+        }
+      } else if (ivalue.isTensorList() || ivalue.isOptionalTensorList()) {
+        // NB: use toListRef as it doesn't induce refcount bumps
+        // (toTensorListRef is not a thing)
+        for (const auto& nv : ivalue.toListRef()) {
+          if (nv.isNone()) {
+            continue;
+          }
+          auto* interpreter = nv.unsafeToTensorImpl()->pyobj_interpreter();
+          if (interpreter &&
+              nv.unsafeToTensorImpl()->key_set().has(at::DispatchKey::Python)) {
+            (*interpreter)
+                ->python_op_registration_trampoline(op, dispatch_key_, stack);
+            return;
+          }
+        }
+      }
+    }
+
+    // Nothing requires the operator to be homed to a specific interpreter, so
+    // run it on the current interpreter
+
     auto arguments = torch::jit::pop(*stack, op.schema().arguments().size());
     py::gil_scoped_acquire g;
+    EnableHermeticPyObject g2;
     auto args_kwargs = parseIValuesToPyArgsKwargs(op, arguments);
     auto obj = py::reinterpret_steal<py::object>(PyObject_Call(
         func_.ptr(getPyInterpreter()),
@@ -82,6 +176,14 @@ class PythonKernelHolder : public c10::OperatorKernel {
   }
 };
 
+torch::_RegisterOrVerify register_or_verify() {
+  if (isMainPyInterpreter()) {
+    return torch::_RegisterOrVerify::REGISTER;
+  } else {
+    return torch::_RegisterOrVerify::VERIFY;
+  }
+}
+
 void initDispatchBindings(PyObject* module) {
   auto m = py::handle(module).cast<py::module>();
 
@@ -90,9 +192,12 @@ void initDispatchBindings(PyObject* module) {
 
   // TODO: figure out how to do chaining
   py::class_<torch::Library>(m, "_DispatchModule")
+      // Some of these APIs are only for testing and do not work in multipy
+      // environment
       .def(
           "def_",
           [](py::object self, const char* schema, const char* alias) {
+            TORCH_INTERNAL_ASSERT(isMainPyInterpreter());
             self.cast<torch::Library&>().def(
                 torch::schema(schema, parseAliasAnalysisKind(alias)));
             return self;
@@ -106,6 +211,7 @@ void initDispatchBindings(PyObject* module) {
       .def(
           "def_legacy",
           [](py::object self, const char* schema) {
+            TORCH_INTERNAL_ASSERT(isMainPyInterpreter());
             self.cast<torch::Library&>().def(torch::jit::parseSchema(schema));
             return self;
           },
@@ -125,6 +231,7 @@ void initDispatchBindings(PyObject* module) {
              const char* name,
              const char* dispatch,
              const char* debug) {
+            TORCH_INTERNAL_ASSERT(isMainPyInterpreter());
             self.cast<torch::Library&>().def(
                 name, dispatch_str(dispatch, [](const at::Tensor& a) {
                         return a;
@@ -142,6 +249,7 @@ void initDispatchBindings(PyObject* module) {
              const char* dispatch,
              const char* alias,
              const char* debug) {
+            TORCH_INTERNAL_ASSERT(isMainPyInterpreter());
             self.cast<torch::Library&>().def(
                 torch::schema(schema, parseAliasAnalysisKind(alias)),
                 dispatch_str(dispatch, [](const at::Tensor& a) {
@@ -162,6 +270,7 @@ void initDispatchBindings(PyObject* module) {
              const char* name,
              const char* dispatch,
              const char* debug) {
+            TORCH_INTERNAL_ASSERT(isMainPyInterpreter());
             self.cast<torch::Library&>().impl(
                 name, dispatch_str(dispatch, [](const at::Tensor& a) {
                         return a;
@@ -172,38 +281,26 @@ void initDispatchBindings(PyObject* module) {
           py::arg("name"),
           py::arg("dispatch") = "",
           py::arg("debug") = "impl_t_t")
-      .def(
-          "impl_tt_t",
-          [](py::object self,
-             const char* name,
-             const char* dispatch,
-             const char* debug) {
-            self.cast<torch::Library&>().impl(
-                name,
-                dispatch_str(
-                    dispatch,
-                    [](const at::Tensor& a, const at::Tensor& b) { return a; })
-                    .debug(debug));
-            return self;
-          },
-          "",
-          py::arg("name"),
-          py::arg("dispatch") = "",
-          py::arg("debug") = "")
       .def(
           "impl",
           [](py::object self,
              const char* name,
-             const char* dispatch,
+             // TODO: empty string no longer works
+             c10::DispatchKey dispatch,
              py::object func) {
             HANDLE_TH_ERRORS
-            self.cast<torch::Library&>().impl(
+            auto& lib = self.cast<torch::Library&>();
+            lib.impl(
                 name,
-                dispatch_str(
+                torch::dispatch(
                     dispatch,
                     CppFunction::makeFromBoxedFunctor(
-                        std::make_unique<PythonKernelHolder>(
-                            std::move(func)))));
+                        std::make_unique<PythonKernelHolder>(func, dispatch))),
+                register_or_verify());
+            python_registrations_[lib._resolve(name)].insert_or_assign(
+                dispatch,
+                std::make_shared<c10::SafePyObject>(
+                    func.release().ptr(), getPyInterpreter()));
             END_HANDLE_TH_ERRORS_PYBIND
           },
           "",
@@ -213,8 +310,11 @@ void initDispatchBindings(PyObject* module) {
       .def(
           "define",
           [](py::object self, const char* schema, const char* alias_analysis) {
+            auto parsed_schema =
+                torch::schema(schema, parseAliasAnalysisKind(alias_analysis));
             self.cast<torch::Library&>().def(
-                torch::schema(schema, parseAliasAnalysisKind(alias_analysis)));
+                std::move(parsed_schema), {}, register_or_verify());
+            // TODO: this is dumb, had to make a second copy
             return torch::schema(schema, parseAliasAnalysisKind(alias_analysis))
                 .name();
           },
@@ -224,6 +324,7 @@ void initDispatchBindings(PyObject* module) {
       .def(
           "fallback_fallthrough",
           [](py::object self, const char* dispatch) {
+            TORCH_INTERNAL_ASSERT(isMainPyInterpreter());
             self.cast<torch::Library&>().fallback(
                 dispatch_str(dispatch, CppFunction::makeFallthrough()));
             return self;
@@ -295,11 +396,20 @@ void initDispatchBindings(PyObject* module) {
       // Returns whether or not a direct kernel registration exists
       // for this <op_name, dispatch_key> pair.
       "_dispatch_has_kernel_for_dispatch_key",
-      [](const char* name, const char* dispatch) -> bool {
+      [](const char* name, c10::DispatchKey dispatch) -> bool {
+        auto op =
+            c10::Dispatcher::singleton().findOp(torch::jit::parseName(name));
+        TORCH_CHECK(op, "operator ", name, " does not exist");
+        return op->hasKernelForDispatchKey(dispatch);
+      });
+
+  m.def(
+      "_dispatch_has_kernel_for_any_dispatch_key",
+      [](const char* name, c10::DispatchKeySet ks) -> bool {
         auto op =
             c10::Dispatcher::singleton().findOp(torch::jit::parseName(name));
         TORCH_CHECK(op, "operator ", name, " does not exist");
-        return op->hasKernelForDispatchKey(c10::parseDispatchKey(dispatch));
+        return op->hasKernelForAnyDispatchKey(ks);
       });
 
   m.def(
@@ -330,21 +440,139 @@ void initDispatchBindings(PyObject* module) {
     return states;
   });
 
+  m.def("_dispatch_get_all_op_names", []() -> std::vector<std::string> {
+    auto op_names = c10::Dispatcher::singleton().getAllOpNames();
+
+    std::vector<std::string> names;
+    names.reserve(op_names.size());
+    for (auto& op : op_names) {
+      std::stringstream ss;
+      ss << op.name;
+      if (!op.overload_name.empty()) {
+        ss << "." << op.overload_name;
+      }
+      names.push_back(ss.str());
+    }
+
+    return names;
+  });
+
   m.def(
       "_dispatch_tls_set_dispatch_key_excluded",
-      [](const char* dispatch_key, bool desired_state) {
-        c10::impl::tls_set_dispatch_key_excluded(
-            c10::parseDispatchKey(dispatch_key), desired_state);
+      [](c10::DispatchKey dispatch_key, bool desired_state) {
+        c10::impl::tls_set_dispatch_key_excluded(dispatch_key, desired_state);
+      });
+  m.def(
+      "_dispatch_tls_is_dispatch_key_excluded",
+      [](c10::DispatchKey dispatch_key) {
+        return c10::impl::tls_is_dispatch_key_excluded(dispatch_key);
+      });
+  m.def(
+      "_dispatch_tls_set_dispatch_key_included",
+      [](c10::DispatchKey dispatch_key, bool desired_state) {
+        c10::impl::tls_set_dispatch_key_included(dispatch_key, desired_state);
+      });
+  m.def(
+      "_dispatch_tls_is_dispatch_key_included",
+      [](c10::DispatchKey dispatch_key) {
+        return c10::impl::tls_is_dispatch_key_included(dispatch_key);
       });
-  m.def("_dispatch_tls_is_dispatch_key_excluded", [](const char* dispatch_key) {
-    return c10::impl::tls_is_dispatch_key_excluded(
-        c10::parseDispatchKey(dispatch_key));
-  });
 
   m.def("_dispatch_isTensorSubclassLike", [](const at::Tensor& tensor) {
     return at::isTensorSubclassLike(tensor);
   });
 
+  m.def("_dispatch_key_name", [](c10::DispatchKey k) {
+    return c10::toString(k);
+  });
+  m.def("_dispatch_key_parse", [](c10::DispatchKey k) { return k; });
+  m.def("_dispatch_num_backends", []() { return c10::num_backends; });
+
+#define DEF_ONE(n) .value(#n, c10::DispatchKey::n)
+
+  py::enum_<c10::DispatchKey>(m, "DispatchKey")
+      // clang-format off
+      DEF_ONE(Undefined)
+      DEF_ONE(CompositeExplicitAutogradNonFunctional)
+      DEF_ONE(CompositeExplicitAutograd)
+      DEF_ONE(CompositeImplicitAutogradNestedTensor)
+      DEF_ONE(CompositeImplicitAutograd)
+      DEF_ONE(AutogradOther)
+      DEF_ONE(Autograd)
+      DEF_ONE(BackendSelect)
+      DEF_ONE(ADInplaceOrView)
+      DEF_ONE(PythonTLSSnapshot)
+      DEF_ONE(Python)
+      DEF_ONE(FuncTorchDynamicLayerFrontMode)
+      DEF_ONE(FuncTorchDynamicLayerBackMode)
+      DEF_ONE(PythonDispatcher)
+      DEF_ONE(Functionalize)
+  // clang-format on
+
+#define DEF_SINGLE(n, prefix) .value(#prefix #n, c10::DispatchKey::prefix##n)
+#define DEF_MULTIPLE(fullname, prefix)              \
+  DEF_SINGLE(, fullname)                            \
+  DEF_SINGLE(, StartOf##fullname##Backends)         \
+  C10_FORALL_BACKEND_COMPONENTS(DEF_SINGLE, prefix) \
+  DEF_SINGLE(, EndOf##fullname##Backends)
+
+      // clang-format off
+  C10_FORALL_FUNCTIONALITY_KEYS(DEF_MULTIPLE)
+  // clang-format on
+
+#undef DEF_MULTIPLE
+#undef DEF_SINGLE
+          ;
+
+  py::class_<c10::DispatchKeySet>(m, "DispatchKeySet")
+      .def(py::init<c10::DispatchKey>())
+      .def("__or__", &c10::DispatchKeySet::operator|)
+      .def("__sub__", &c10::DispatchKeySet::operator-)
+      .def("__and__", &c10::DispatchKeySet::operator&)
+      .def("highestPriorityTypeId", &c10::DispatchKeySet::highestPriorityTypeId)
+      .def("has", &c10::DispatchKeySet::has)
+      .def("__repr__", [](c10::DispatchKeySet d) { return c10::toString(d); });
+
+  m.attr("_dispatch_autogradother_backends") =
+      py::cast(c10::autogradother_backends);
+
+  m.def("_dispatch_has_backend_fallback", [](c10::DispatchKey t) {
+    return c10::Dispatcher::singleton().hasBackendFallbackForDispatchKey(t);
+  });
+
+  m.def("_dispatch_keyset_full_after", [](c10::DispatchKey t) {
+    return c10::DispatchKeySet(c10::DispatchKeySet::FULL_AFTER, t);
+  });
+
+  m.def("_dispatch_keyset_to_string", [](c10::DispatchKeySet keyset) {
+    return c10::toString(keyset);
+  });
+
+  m.def("_dispatch_get_backend_keyset_from_autograd", [](c10::DispatchKey k) {
+    return c10::getBackendKeySetFromAutograd(k);
+  });
+
+  m.def("_dispatch_keys", [](const at::Tensor& tensor) {
+    auto* impl = tensor.unsafeGetTensorImpl();
+    return impl->key_set();
+  });
+  m.def("_dispatch_tls_local_include_set", []() {
+    return c10::impl::tls_local_dispatch_key_set().included_;
+  });
+  m.def("_dispatch_tls_local_exclude_set", []() {
+    return c10::impl::tls_local_dispatch_key_set().excluded_;
+  });
+  m.def("_functionalization_reapply_views_tls", []() {
+    return at::functionalization::impl::getFunctionalizationReapplyViewsTLS();
+  });
+  m.def(
+      "_dispatch_is_included_in_alias",
+      [](c10::DispatchKey a, c10::DispatchKey b) {
+        return c10::isIncludedInAlias(a, b);
+      });
+  py::class_<c10::impl::ExcludeDispatchKeyGuard>(m, "ExcludeDispatchKeyGuard")
+      .def(py::init<c10::DispatchKeySet>());
+
   py::class_<at::AutoDispatchBelowAutograd>(m, "_AutoDispatchBelowAutograd")
       .def(py::init<>());
 
@@ -384,6 +612,36 @@ void initDispatchBindings(PyObject* module) {
         return names;
       },
       py::arg("dispatch_key") = static_cast<const char*>(""));
+
+  m.def(
+      "_dispatch_is_main_interpreter", []() { return isMainPyInterpreter(); });
+
+  m.def("_are_functorch_transforms_active", []() {
+    auto include_set = c10::impl::tls_local_dispatch_key_set().included_;
+    return (
+        include_set.has(c10::DispatchKey::FuncTorchDynamicLayerFrontMode) ||
+        include_set.has(c10::DispatchKey::FuncTorchDynamicLayerBackMode));
+  });
+}
+
+// TODO: dedupe with the kernel
+void python_op_registration_trampoline_impl(
+    const c10::OperatorHandle& op,
+    c10::DispatchKey key,
+    torch::jit::Stack* stack) {
+  auto arguments = torch::jit::pop(*stack, op.schema().arguments().size());
+  py::gil_scoped_acquire g;
+  auto args_kwargs = parseIValuesToPyArgsKwargs(op, arguments);
+  const auto& func = python_registrations_[op.operator_name()][key];
+  TORCH_INTERNAL_ASSERT(func != nullptr);
+  auto* pyobj = func->ptr(getPyInterpreter());
+  TORCH_INTERNAL_ASSERT(pyobj != nullptr);
+  auto obj = py::reinterpret_steal<py::object>(
+      PyObject_Call(pyobj, args_kwargs.first.ptr(), args_kwargs.second.ptr()));
+  if (!obj) {
+    throw python_error();
+  }
+  pushPyOutToStack(op, stack, obj, "PythonKernelHolder");
 }
 
 } // namespace dispatch
diff --git a/torch/csrc/utils/python_dispatch.h b/torch/csrc/utils/python_dispatch.h
index f05c36ac268d..d719de730551 100644
--- a/torch/csrc/utils/python_dispatch.h
+++ b/torch/csrc/utils/python_dispatch.h
@@ -7,6 +7,11 @@ namespace dispatch {
 
 void initDispatchBindings(PyObject* module);
 
-}
+void python_op_registration_trampoline_impl(
+    const c10::OperatorHandle& op,
+    c10::DispatchKey key,
+    torch::jit::Stack* stack);
+
+} // namespace dispatch
 } // namespace impl
 } // namespace torch
diff --git a/torch/csrc/utils/python_numbers.h b/torch/csrc/utils/python_numbers.h
index f312073bec24..a81e72f764aa 100644
--- a/torch/csrc/utils/python_numbers.h
+++ b/torch/csrc/utils/python_numbers.h
@@ -139,15 +139,6 @@ inline bool THPUtils_checkDouble(PyObject* obj) {
   return PyFloat_Check(obj) || PyLong_Check(obj);
 }
 
-inline bool THPUtils_checkScalar(PyObject* obj) {
-#ifdef USE_NUMPY
-  if (torch::utils::is_numpy_scalar(obj)) {
-    return true;
-  }
-#endif
-  return PyFloat_Check(obj) || PyLong_Check(obj) || PyComplex_Check(obj);
-}
-
 inline double THPUtils_unpackDouble(PyObject* obj) {
   if (PyFloat_Check(obj)) {
     return PyFloat_AS_DOUBLE(obj);
diff --git a/torch/csrc/utils/python_symnode.cpp b/torch/csrc/utils/python_symnode.cpp
new file mode 100644
index 000000000000..318bb2266aa4
--- /dev/null
+++ b/torch/csrc/utils/python_symnode.cpp
@@ -0,0 +1,19 @@
+#include <torch/csrc/utils/python_symnode.h>
+
+namespace torch {
+
+py::handle get_symint_class() {
+  // NB: leak
+  static py::handle symint_class =
+      py::object(py::module::import("torch").attr("SymInt")).release();
+  return symint_class;
+}
+
+py::handle get_symfloat_class() {
+  // NB: leak
+  static py::handle symfloat_class =
+      py::object(py::module::import("torch").attr("SymFloat")).release();
+  return symfloat_class;
+}
+
+} // namespace torch
diff --git a/torch/csrc/utils/python_symnode.h b/torch/csrc/utils/python_symnode.h
new file mode 100644
index 000000000000..00bddfb9e4dc
--- /dev/null
+++ b/torch/csrc/utils/python_symnode.h
@@ -0,0 +1,178 @@
+#pragma once
+
+#include <c10/core/SafePyObject.h>
+#include <c10/core/SymNodeImpl.h>
+
+#include <torch/csrc/autograd/python_variable.h>
+#include <torch/csrc/utils/pybind.h>
+
+namespace torch {
+
+TORCH_PYTHON_API py::handle get_symint_class();
+TORCH_PYTHON_API py::handle get_symfloat_class();
+
+// NB: These functions must not be called too early, otherwise torch not setup.
+// Alternate design is to have torch "register" the object to us
+inline bool is_symint(py::handle obj) {
+  return py::isinstance(obj, get_symint_class());
+}
+inline bool is_symfloat(py::handle obj) {
+  return py::isinstance(obj, get_symfloat_class());
+}
+
+namespace impl {
+
+// This c10::SymNodeImpl simply backends to a Python object that
+// implements the API.   The Python object is the source of truth,
+// this is just an adapter so C++ calls can get to the object.
+class PythonSymNodeImpl : public c10::SymNodeImpl {
+ public:
+  PythonSymNodeImpl(py::object pyobj) : c10::SymNodeImpl() {
+    pyobj_ = std::make_shared<c10::SafePyObject>(
+        pyobj.release().ptr(), getPyInterpreter());
+  };
+
+  c10::SymNode wrap_int(int64_t num) override {
+    py::gil_scoped_acquire acquire;
+    auto r = getPyObj().attr("wrap_int")(num);
+    return c10::make_intrusive<PythonSymNodeImpl>(r);
+  }
+
+  c10::SymNode wrap_float(double num) override {
+    py::gil_scoped_acquire acquire;
+    auto r = getPyObj().attr("wrap_float")(num);
+    return c10::make_intrusive<PythonSymNodeImpl>(r);
+  }
+
+  bool bool_() override {
+    py::gil_scoped_acquire acquire;
+    return getPyObj().attr("bool_")().is(py::handle(Py_True));
+  }
+
+  bool is_int() override {
+    py::gil_scoped_acquire acquire;
+    return getPyObj().attr("is_int")().is(py::handle(Py_True));
+  }
+
+  bool is_float() override {
+    py::gil_scoped_acquire acquire;
+    return getPyObj().attr("is_float")().is(py::handle(Py_True));
+  }
+
+  int64_t guard_int(const char* file, int64_t line) override {
+    py::gil_scoped_acquire acquire;
+    return getPyObj().attr("guard_int")(file, line).cast<int64_t>();
+  }
+
+  double guard_float(const char* file, int64_t line) override {
+    py::gil_scoped_acquire acquire;
+    return getPyObj().attr("guard_float")(file, line).cast<double>();
+  }
+
+  int64_t int_() override {
+    py::gil_scoped_acquire acquire;
+    return getPyObj().attr("int_")().cast<int64_t>();
+  }
+
+  std::string str() override {
+    py::gil_scoped_acquire acquire;
+    return getPyObj().attr("str")().cast<std::string>();
+  }
+
+  c10::SymNode dispatch_common_(const char* fname, const c10::SymNode& other) {
+    auto pother = dynamic_cast<PythonSymNodeImpl*>(other.get());
+    TORCH_CHECK(pother);
+    py::gil_scoped_acquire acquire;
+    auto r = getPyObj().attr(fname)(pother->getPyObj());
+    return c10::make_intrusive<PythonSymNodeImpl>(r);
+  }
+
+  c10::SymNode dispatch_common_(const char* fname) {
+    py::gil_scoped_acquire acquire;
+    auto r = getPyObj().attr(fname)();
+    return c10::make_intrusive<PythonSymNodeImpl>(r);
+  }
+
+  c10::SymNode add(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode sub(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode mul(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode truediv(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode pow(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode floordiv(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode mod(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode eq(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode gt(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode lt(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode le(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode ge(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode min(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+  c10::SymNode max(const c10::SymNode& other) override {
+    return dispatch_common_(__func__, other);
+  }
+
+  c10::SymNode ceil() override {
+    return dispatch_common_(__func__);
+  }
+
+  c10::SymNode floor() override {
+    return dispatch_common_(__func__);
+  }
+
+  c10::SymNode neg() override {
+    return dispatch_common_(__func__);
+  }
+
+  c10::SymNode clone() override {
+    return dispatch_common_(__func__);
+  }
+
+  c10::SymNode sym_float() override {
+    return dispatch_common_(__func__);
+  }
+
+  py::handle getPyObj() {
+    return py::handle(pyobj_.get()->ptr(getPyInterpreter()));
+  }
+  std::shared_ptr<c10::SafePyObject> pyobj_ = nullptr;
+};
+
+} // namespace impl
+} // namespace torch
diff --git a/torch/csrc/utils/python_torch_function_mode.h b/torch/csrc/utils/python_torch_function_mode.h
index 1ab703dc5032..f6652dfd9308 100644
--- a/torch/csrc/utils/python_torch_function_mode.h
+++ b/torch/csrc/utils/python_torch_function_mode.h
@@ -5,21 +5,20 @@
 namespace torch {
 namespace overrides {
 
-// Corresponds to torch.overrides._no_torch_function_mode.  We discourage use
-// of this in userland because it's non-compositional; there might be another
-// mode waiting to go after you, and you shouldn't just blindly disable it.
-// From C++ side, there is no such thing as compositional modes, there is one
-// mode and of course you should be able to clear it.
 struct StashTorchFunctionModeGuard {
   StashTorchFunctionModeGuard() {
-    at::impl::PythonTorchFunctionTLS::swap_mode(old_mode_);
+    cur_mode_ = at::impl::PythonTorchFunctionTLS::pop_stack();
   }
   ~StashTorchFunctionModeGuard() {
-    at::impl::PythonTorchFunctionTLS::set_mode(std::move(old_mode_));
+    at::impl::PythonTorchFunctionTLS::push_onto_stack(cur_mode_);
+  }
+
+  const std::shared_ptr<c10::SafePyObject>& get_cur_mode() {
+    return cur_mode_;
   }
 
  private:
-  std::shared_ptr<c10::SafePyObject> old_mode_ = nullptr;
+  std::shared_ptr<c10::SafePyObject> cur_mode_;
 };
 
 } // namespace overrides
diff --git a/torch/csrc/utils/schema_info.cpp b/torch/csrc/utils/schema_info.cpp
index f3b3d1b4017d..fafd1c121180 100644
--- a/torch/csrc/utils/schema_info.cpp
+++ b/torch/csrc/utils/schema_info.cpp
@@ -100,6 +100,10 @@ bool SchemaInfo::is_mutable(const c10::SchemaArgument& argument) {
       });
 }
 
+bool SchemaInfo::has_argument(c10::string_view name) {
+  return schema_.argumentIndexWithName(name) != c10::nullopt;
+}
+
 bool SchemaInfo::is_mutable(c10::string_view name) {
   c10::optional<int> index = schema_.argumentIndexWithName(name);
   TORCH_INTERNAL_ASSERT(
diff --git a/torch/csrc/utils/schema_info.h b/torch/csrc/utils/schema_info.h
index a469483512fd..ae1a6f766ede 100644
--- a/torch/csrc/utils/schema_info.h
+++ b/torch/csrc/utils/schema_info.h
@@ -32,6 +32,8 @@ struct TORCH_API SchemaInfo {
 
   bool is_mutable(c10::string_view name);
 
+  bool has_argument(c10::string_view name);
+
   bool is_nondeterministic() const;
 
   // Returns whether lhs and rhs may alias directly.
diff --git a/torch/csrc/utils/tensor_memoryformats.cpp b/torch/csrc/utils/tensor_memoryformats.cpp
index 1faecd5e6234..aabe2ad407ed 100644
--- a/torch/csrc/utils/tensor_memoryformats.cpp
+++ b/torch/csrc/utils/tensor_memoryformats.cpp
@@ -17,9 +17,11 @@ std::array<PyObject*, static_cast<int>(at::MemoryFormat::NumOptions)>
     memory_format_registry = {};
 } // anonymous namespace
 
-py::object getTHPMemoryFormat(at::MemoryFormat memory_format) {
+PyObject* getTHPMemoryFormat(at::MemoryFormat memory_format) {
   return py::reinterpret_borrow<py::object>(
-      memory_format_registry[static_cast<size_t>(memory_format)]);
+             memory_format_registry[static_cast<size_t>(memory_format)])
+      .release()
+      .ptr();
 }
 
 void initializeMemoryFormats() {
diff --git a/torch/csrc/utils/tensor_memoryformats.h b/torch/csrc/utils/tensor_memoryformats.h
index e9aa1af1f647..19b1c8deaf5a 100644
--- a/torch/csrc/utils/tensor_memoryformats.h
+++ b/torch/csrc/utils/tensor_memoryformats.h
@@ -1,13 +1,13 @@
 #pragma once
 
 #include <c10/core/MemoryFormat.h>
-#include <torch/csrc/utils/pybind.h>
+#include <torch/csrc/utils/python_stub.h>
 
 namespace torch {
 namespace utils {
 
 void initializeMemoryFormats();
-py::object getTHPMemoryFormat(c10::MemoryFormat);
+PyObject* getTHPMemoryFormat(c10::MemoryFormat);
 
 } // namespace utils
 } // namespace torch
diff --git a/torch/csrc/utils/tensor_new.cpp b/torch/csrc/utils/tensor_new.cpp
index 707ebeb19e84..83506346505e 100644
--- a/torch/csrc/utils/tensor_new.cpp
+++ b/torch/csrc/utils/tensor_new.cpp
@@ -79,10 +79,10 @@ Tensor new_with_sizes(
     c10::TensorOptions options,
     at::ScalarType scalar_type,
     const optional<Device>& device,
-    IntArrayRef sizes) {
+    c10::SymIntArrayRef sizes) {
   maybe_initialize_cuda(options.device());
   pybind11::gil_scoped_release no_gil;
-  return torch::empty(sizes, build_options(options, scalar_type, device));
+  return at::empty_symint(sizes, build_options(options, scalar_type, device));
 }
 
 Tensor new_with_storage(
@@ -124,6 +124,12 @@ std::vector<int64_t> compute_sizes(PyObject* seq, ScalarType scalar_type) {
 }
 
 ScalarType infer_scalar_type(PyObject* obj) {
+  if (torch::is_symint(obj)) {
+    return ScalarType::Long;
+  }
+  if (torch::is_symfloat(obj)) {
+    return ScalarType::Double;
+  }
 #ifdef USE_NUMPY
   if (is_numpy_available()) {
     if (PyArray_Check(obj)) {
@@ -204,7 +210,21 @@ void recursive_store(
   TORCH_INTERNAL_ASSERT_DEBUG_ONLY(data != nullptr);
 
   int64_t ndim = sizes.size();
+  bool is_symfloat = torch::is_symfloat(obj);
+  bool is_symint = torch::is_symint(obj);
   if (dim == ndim) {
+    if (is_symfloat) {
+      auto new_obj = py::reinterpret_borrow<py::object>(obj);
+      auto val = new_obj.cast<c10::SymFloat>();
+      *(double*)data = val.guard_float(__FILE__, __LINE__);
+      return;
+    }
+    if (is_symint) {
+      auto new_obj = py::reinterpret_borrow<py::object>(obj);
+      auto val = new_obj.cast<c10::SymInt>();
+      *(int64_t*)data = val.guard_int(__FILE__, __LINE__);
+      return;
+    }
     torch::utils::store_scalar(data, scalarType, obj);
     return;
   }
@@ -531,7 +551,7 @@ Tensor legacy_sparse_tensor_generic_ctor_new(
       "new(*, int64_t cdata)|hidden",
       "new(Tensor indices, Tensor values, *, Device? device=None)",
       "new(Tensor indices, Tensor values, IntArrayRef size, *, Device? device=None)",
-      "new(IntArrayRef size, *, Device? device=None)",
+      "new(SymIntArrayRef size, *, Device? device=None)",
   });
   if (ctor_or_new == CtorOrNew::NEW)
     check_base_legacy_new(dispatch_key, c10::kSparse);
@@ -577,7 +597,7 @@ Tensor legacy_sparse_tensor_generic_ctor_new(
       }
     }
     return new_with_sizes(
-        options, scalar_type, r.deviceOptional(1), r.intlist(0));
+        options, scalar_type, r.deviceOptional(1), r.symintlist(0));
   }
   throw std::runtime_error("new(): invalid arguments");
 }
@@ -615,7 +635,7 @@ Tensor legacy_tensor_generic_ctor_new(
                                                           // matching with
                                                           // IntArrayRef,
                                                           // PyObject*
-      "new(IntArrayRef size, *, Device? device=None)",
+      "new(SymIntArrayRef size, *, Device? device=None)",
       "new(PyObject* data, *, Device? device=None)",
   });
 
@@ -690,7 +710,7 @@ Tensor legacy_tensor_generic_ctor_new(
           options, scalar_type, deviceOptional, r.pyobject(0));
     }
     return new_with_sizes(
-        options, scalar_type, r.deviceOptional(1), r.intlist(0));
+        options, scalar_type, r.deviceOptional(1), r.symintlist(0));
   } else if (r.idx == 6) {
     auto deviceOptional = r.deviceOptional(1);
     check_legacy_ctor_device(dispatch_key, deviceOptional);
diff --git a/torch/csrc/utils/tensor_types.cpp b/torch/csrc/utils/tensor_types.cpp
index 25a8ac28a8ec..484f497fd39d 100644
--- a/torch/csrc/utils/tensor_types.cpp
+++ b/torch/csrc/utils/tensor_types.cpp
@@ -43,6 +43,10 @@ static const char* backend_to_string(const at::Backend& backend) {
       return "torch.mps";
     case at::Backend::PrivateUse1:
       return "torch.privateuseone";
+    case at::Backend::Lazy:
+      return "torch.lazy";
+    case at::Backend::XLA:
+      return "torch.xla";
     default:
       AT_ERROR("Unimplemented backend ", backend);
   }
diff --git a/torch/csrc/utils/torch_dispatch_mode.h b/torch/csrc/utils/torch_dispatch_mode.h
index 3dd73b7422d7..2c97a7d96c32 100644
--- a/torch/csrc/utils/torch_dispatch_mode.h
+++ b/torch/csrc/utils/torch_dispatch_mode.h
@@ -1,6 +1,6 @@
 #pragma once
 
-#include <ATen/core/TorchDispatchModeTLS.h>
+#include <c10/core/impl/TorchDispatchModeTLS.h>
 
 namespace torch {
 namespace torch_dispatch_mode {
@@ -8,16 +8,35 @@ namespace torch_dispatch_mode {
 struct StashTorchDispatchModeGuard {
  public:
   StashTorchDispatchModeGuard() {
-    saved_ = at::impl::TorchDispatchModeTLS::get_state();
-    at::impl::TorchDispatchModeTLS::set_state(nullptr);
+    saved_mode_ = c10::impl::TorchDispatchModeTLS::pop_stack();
   }
 
   ~StashTorchDispatchModeGuard() {
-    at::impl::TorchDispatchModeTLS::set_state(saved_);
+    c10::impl::TorchDispatchModeTLS::push_onto_stack(std::move(saved_mode_));
+  }
+
+  const std::shared_ptr<c10::SafePyObject>& get_cur_mode() {
+    return saved_mode_;
+  }
+
+ private:
+  std::shared_ptr<at::SafePyObject> saved_mode_;
+};
+
+struct StashTorchDispatchStackGuard {
+ public:
+  StashTorchDispatchStackGuard() {
+    const auto old = c10::impl::TorchDispatchModeTLS::get_state();
+    c10::impl::TorchDispatchModeTLS::set_state(saved_state_);
+    saved_state_ = std::move(old);
+  }
+
+  ~StashTorchDispatchStackGuard() {
+    c10::impl::TorchDispatchModeTLS::set_state(std::move(saved_state_));
   }
 
  private:
-  std::shared_ptr<at::SafePyObject> saved_;
+  c10::impl::TorchDispatchModeTLS saved_state_;
 };
 
 } // namespace torch_dispatch_mode
diff --git a/torch/cuda/__init__.py b/torch/cuda/__init__.py
index c0299141e8d9..1eb6c70ab7b8 100644
--- a/torch/cuda/__init__.py
+++ b/torch/cuda/__init__.py
@@ -15,7 +15,8 @@
 import traceback
 import warnings
 import threading
-from typing import List, Optional, Tuple, Union, Any
+from functools import lru_cache
+from typing import Any, List, Optional, Set, Tuple, Union
 from ._utils import _get_device_index, _dummy_type
 from .._utils import classproperty
 from .graphs import CUDAGraph, graph_pool_handle, graph, \
@@ -74,13 +75,28 @@ def get_calls(self) -> List:
 has_half: bool = False
 default_generators: Tuple[torch._C.Generator] = ()  # type: ignore[assignment]
 
+def _is_compiled() -> bool:
+    r"""Returns true if compile with CUDA support."""
+    return hasattr(torch._C, '_cuda_getDeviceCount')
+
+def _nvml_based_avail() -> bool:
+    return os.getenv('PYTORCH_NVML_BASED_CUDA_CHECK') == '1'
+
 def is_available() -> bool:
     r"""Returns a bool indicating if CUDA is currently available."""
-    if not hasattr(torch._C, '_cuda_getDeviceCount'):
+    if not _is_compiled():
         return False
-    # This function never throws and returns 0 if driver is missing or can't
-    # be initialized
-    return torch._C._cuda_getDeviceCount() > 0
+    if _nvml_based_avail():
+        # The user has set an env variable to request this availability check that attempts to avoid fork poisoning by
+        # using NVML at the cost of a weaker CUDA availability assessment. Note that if NVML discovery/initialization
+        # fails, this assessment falls back to the default CUDA Runtime API assessment (`cudaGetDeviceCount`)
+        return device_count() > 0
+    else:
+        # The default availability inspection never throws and returns 0 if the driver is missing or can't
+        # be initialized. This uses the CUDA Runtime API `cudaGetDeviceCount` which in turn initializes the CUDA Driver
+        # API via `cuInit`
+        return torch._C._cuda_getDeviceCount() > 0
+
 
 def is_bf16_supported():
     r"""Returns a bool indicating if the current CUDA/ROCm device supports dtype bfloat16"""
@@ -219,6 +235,8 @@ def _lazy_init():
                 "libcudart functions unavailable. It looks like you have a broken build?")
         # This function throws if there's a driver initialization error, no GPUs
         # are found or any other error occurs
+        if 'CUDA_MODULE_LOADING' not in os.environ:
+            os.environ['CUDA_MODULE_LOADING'] = 'LAZY'
         torch._C._cuda_init()
         # Some of the queued calls may reentrantly call _lazy_init();
         # we need to just return without initializing in that case.
@@ -456,12 +474,70 @@ def set_stream(stream: Stream):
         return
     torch._C._cuda_setStream(stream._cdata)
 
+def _parse_visible_devices() -> Set[int]:
+    """Parse CUDA_VISIBLE_DEVICES environment variable."""
+    var = os.getenv("CUDA_VISIBLE_DEVICES")
+    if var is None:
+        return set(x for x in range(64))
+
+    def _strtoul(s: str) -> int:
+        """Return -1 or positive integer sequence string starts with,"""
+        if not s:
+            return -1
+        for idx, c in enumerate(s):
+            if not c.isdigit():
+                break
+            if idx + 1 == len(s):
+                idx += 1
+        return int(s[:idx]) if idx > 0 else -1
+
+    # CUDA_VISIBLE_DEVICES uses something like strtoul
+    # which makes `1gpu2,2ampere` is equivalent to `1,2`
+    rc: Set[int] = set()
+    for elem in var.split(","):
+        rc.add(_strtoul(elem.strip()))
+    return rc
+
+def _raw_device_count_nvml() -> int:
+    """Return number of devices as reported by NVML
+    or negative value if NVML discovery/initialization failed."""
+    from ctypes import CDLL, c_int
+    nvml_h = CDLL("libnvidia-ml.so.1")
+    rc = nvml_h.nvmlInit()
+    if rc != 0:
+        warnings.warn("Can't initialize NVML")
+        return -1
+    dev_arr = (c_int * 1)(-1)
+    rc = nvml_h.nvmlDeviceGetCount_v2(dev_arr)
+    if rc != 0:
+        warnings.warn("Can't get nvml device count")
+        return -1
+    del nvml_h
+    return dev_arr[0]
+
+def _device_count_nvml() -> int:
+    """Return number of devices as reported by NVML taking CUDA_VISIBLE_DEVICES into account.
+    Negative value is returned if NVML discovery or initialization has failed."""
+    visible_devices = _parse_visible_devices()
+    if not visible_devices:
+        return 0
+    try:
+        raw_cnt = _raw_device_count_nvml()
+    except OSError:
+        return -1
+    except AttributeError:
+        return -1
+    if raw_cnt <= 0:
+        return raw_cnt
+    return len(set(range(raw_cnt)).intersection(visible_devices))
+
+@lru_cache(maxsize=1)
 def device_count() -> int:
     r"""Returns the number of GPUs available."""
-    if is_available():
-        return torch._C._cuda_getDeviceCount()
-    else:
+    if not _is_compiled():
         return 0
+    nvml_count = _device_count_nvml()
+    return torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count
 
 def get_arch_list() -> List[str]:
     r"""Returns list CUDA architectures this library was compiled for."""
@@ -661,11 +737,12 @@ def type(self, *args, **kwargs):
 
     __new__ = _lazy_new
 
-from torch.storage import _LegacyStorage
+from torch.storage import _LegacyStorage, _warn_typed_storage_removal
 
 class _CudaLegacyStorage(_LegacyStorage):
     @classmethod
     def from_buffer(cls, *args, **kwargs):
+        _warn_typed_storage_removal()
         raise RuntimeError('from_buffer: Not available for CUDA storage')
 
     @classmethod
@@ -679,61 +756,121 @@ def _new_shared_filename(cls, manager, obj, size, *, device=None, dtype=None):
 class ByteStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.uint8
 
 class DoubleStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.double
 
 class FloatStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.float
 
 class HalfStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.half
 
 class LongStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.long
 
 class IntStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.int
 
 class ShortStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.short
 
 class CharStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.int8
 
 class BoolStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.bool
 
 class BFloat16Storage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.bfloat16
 
 class ComplexDoubleStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.cdouble
 
 class ComplexFloatStorage(_CudaLegacyStorage):
     @classproperty
     def dtype(self):
+        _warn_typed_storage_removal()
+        return self._dtype
+
+    @classproperty
+    def _dtype(self):
         return torch.cfloat
 
 del _LegacyStorage
@@ -757,3 +894,31 @@ def dtype(self):
 from . import nvtx
 from . import amp
 from . import jiterator
+
+__all__ = [
+    # Typed storage and tensors
+    'BFloat16Storage', 'BFloat16Tensor',
+    'BoolStorage', 'BoolTensor',
+    'ByteStorage', 'ByteTensor',
+    'CharStorage', 'CharTensor',
+    'ComplexDoubleStorage', 'ComplexFloatStorage',
+    'DoubleStorage', 'DoubleTensor',
+    'FloatStorage', 'FloatTensor',
+    'HalfStorage', 'HalfTensor',
+    'IntStorage', 'IntTensor',
+    'LongStorage', 'LongTensor',
+    'ShortStorage', 'ShortTensor',
+    'CUDAGraph', 'CudaError', 'DeferredCudaCallError', 'Event', 'ExternalStream', 'OutOfMemoryError',
+    'Stream', 'StreamContext', 'amp', 'caching_allocator_alloc', 'caching_allocator_delete', 'can_device_access_peer',
+    'check_error', 'cudaStatus', 'cudart', 'current_blas_handle', 'current_device', 'current_stream', 'default_generators',
+    'default_stream', 'device', 'device_count', 'device_of', 'empty_cache', 'get_allocator_backend', 'CUDAPluggableAllocator',
+    'change_current_allocator', 'get_arch_list', 'get_device_capability', 'get_device_name', 'get_device_properties',
+    'get_gencode_flags', 'get_rng_state', 'get_rng_state_all', 'get_sync_debug_mode', 'graph', 'graph_pool_handle', 'graphs',
+    'has_half', 'has_magma', 'init', 'initial_seed', 'ipc_collect', 'is_available', 'is_bf16_supported',
+    'is_current_stream_capturing', 'is_initialized', 'jiterator', 'list_gpu_processes', 'make_graphed_callables',
+    'manual_seed', 'manual_seed_all', 'max_memory_allocated', 'max_memory_cached', 'max_memory_reserved',
+    'mem_get_info', 'memory', 'memory_allocated', 'memory_cached', 'memory_reserved', 'memory_snapshot',
+    'memory_stats', 'memory_stats_as_nested_dict', 'memory_summary', 'memory_usage', 'nccl', 'nvtx', 'profiler',
+    'random', 'reset_accumulated_memory_stats', 'reset_max_memory_allocated', 'reset_max_memory_cached',
+    'reset_peak_memory_stats', 'seed', 'seed_all', 'set_device', 'set_per_process_memory_fraction', 'set_rng_state',
+    'set_rng_state_all', 'set_stream', 'set_sync_debug_mode', 'sparse', 'stream', 'streams', 'synchronize', 'utilization']
diff --git a/torch/cuda/_dynamo_graphs.py b/torch/cuda/_dynamo_graphs.py
index 1d8ae673edb3..1b2211ed32b2 100644
--- a/torch/cuda/_dynamo_graphs.py
+++ b/torch/cuda/_dynamo_graphs.py
@@ -4,14 +4,14 @@
 from torch.fx.passes.backends.cudagraphs import partition_cudagraphs
 from torch.multiprocessing.reductions import StorageWeakRef
 from torch.utils._pytree import tree_map
-import torchdynamo  # type: ignore[import]
-from torchdynamo.optimizations.training import AOTAutogradStrategy  # type: ignore[import]
+import torch._dynamo  # type: ignore[import]
+from torch._dynamo.optimizations.training import AotAutogradStrategy  # type: ignore[import]
 
 import operator
 from collections import defaultdict
-from typing import Set
+from typing import Set, Dict, Any
 
-# TODO: maybe this should live in torchdynamo instead
+# TODO: maybe this should live in torch._dynamo instead
 
 __all__ = ['aot_autograd_cudagraphs']
 
@@ -89,7 +89,7 @@ def find_input_mutations(g):
     mutated_inputs = set()
     for n in g.nodes:
         if n.op == 'placeholder':
-            inputs[StorageWeakRef(n.meta[FK].storage())].add(input_idx)
+            inputs[StorageWeakRef(n.meta[FK]._typed_storage())].add(input_idx)
             input_idx += 1
         elif n.op == 'call_function':
             if n.target is operator.getitem:
@@ -109,7 +109,7 @@ def find_input_mutations(g):
                 if mut_arg:
                     # TODO: not correct for args that contain tensors in a struct
                     # like list
-                    mutated_inputs |= inputs[StorageWeakRef(argument.meta[FK].storage())]
+                    mutated_inputs |= inputs[StorageWeakRef(argument.meta[FK]._typed_storage())]
         # TODO: error on unrecognized nodes
     return mutated_inputs
 
@@ -133,16 +133,15 @@ def cudagraphs(model, inputs):
 
 
 def raw_aot_autograd_cudagraphs(model, inputs):
-    kwargs = {
+    kwargs: Dict[str, Any] = {
         # these are taken from memory_efficient_fusion()
         "fw_compiler": cudagraphs,
         "bw_compiler": cudagraphs,
-        "hasher_type": "StaticShapeHasher",
     }
 
     def _wrapped_bw_compiler(*args, **kwargs):
-        # stop TorchDynamo from trying to compile our generated backwards pass
-        return torchdynamo.disable(bw_compiler(*args, **kwargs))  # type: ignore[operator]
+        # stop dynamo from trying to compile our generated backwards pass
+        return torch._dynamo.disable(bw_compiler(*args, **kwargs))  # type: ignore[operator]
 
     bw_compiler = kwargs.get("bw_compiler") or kwargs["fw_compiler"]
     kwargs["bw_compiler"] = _wrapped_bw_compiler
@@ -152,7 +151,7 @@ def _wrapped_bw_compiler(*args, **kwargs):
     return aot_module_simplified(model, **kwargs)
 
 
-class AOTAutogradCudaGraphs(AOTAutogradStrategy):
+class AOTAutogradCudaGraphs(AotAutogradStrategy):
     def candidate(self):
         return raw_aot_autograd_cudagraphs(self.gm, self.example_inputs)
 
diff --git a/torch/cuda/_memory_viz.py b/torch/cuda/_memory_viz.py
index 829497e8ff92..ec76fe32db0d 100644
--- a/torch/cuda/_memory_viz.py
+++ b/torch/cuda/_memory_viz.py
@@ -3,8 +3,8 @@
 import os
 import io
 import subprocess
-from typing import Dict, Any
-__all__ = ["format_flamegraph", "segments", "memory", "compare", "stats", "Bytes"]
+
+__all__ = ["format_flamegraph", "segments", "memory", "compare"]
 
 def _frame_fmt(f):
     i = f['line']
@@ -42,26 +42,29 @@ def _write_blocks(f, prefix, blocks):
         for h in b['history']:
             sz = h['real_size']
             accounted_for_size += sz
-            frames = h['frames']
-            if frames:
-                frame_s = ';'.join([_frame_fmt(f) for f in reversed(frames)])
+            if 'frames' in h:
+                frames = h['frames']
+                if frames:
+                    frame_s = ';'.join([_frame_fmt(f) for f in reversed(frames)])
+                else:
+                    frame_s = "<non-python>"
+                f.write(f'{prefix};{b["state"]};{frame_s} {sz}\n')
             else:
-                frame_s = "<non-python>"
-            f.write(f'{prefix};{b["state"]};{frame_s} {sz}\n')
+                f.write(f'{prefix};{b["state"]};<no-context> {sz}\n')
         gaps = b['size'] - accounted_for_size
         if gaps:
             f.write(f'{prefix};{b["state"]};<gaps> {gaps}\n')
 
 def segments(snapshot, format_flamegraph=format_flamegraph):
     f = io.StringIO()
-    for seg in snapshot:
+    for seg in snapshot['segments']:
         prefix = f'stream_{seg["stream"]};seg_{seg["address"]}'
         _write_blocks(f, prefix, seg['blocks'])
     return format_flamegraph(f.getvalue())
 
 def memory(snapshot, format_flamegraph=format_flamegraph):
     f = io.StringIO()
-    for seg in snapshot:
+    for seg in snapshot['segments']:
         prefix = f'stream_{seg["stream"]}'
         _write_blocks(f, prefix, seg['blocks'])
     return format_flamegraph(f.getvalue())
@@ -107,21 +110,196 @@ def __repr__(self):
             num /= 1024.0
         return f"{num:.1f}YiB"
 
-def stats(snapshot):
-    result : Dict[str, Any] = {}
-    result['segments'] = len(snapshot)
-    result['total_size'] = Bytes(0)
-    for seg in snapshot:
-        total_size = 0
+def calc_active(seg):
+    return sum(b['size'] for b in seg['blocks'] if b['state'] == 'active_allocated')
+
+def _report_free(free_external, free_internal):
+    total = free_external + free_internal
+    pct = (free_internal / total) * 100
+    suffix = f' ({pct:.1f}% internal)'
+    return f'{Bytes(total)}{suffix}'
+
+PAGE_SIZE = 1024 * 1024 * 20
+legend = f"""\
+
+Legend:
+    [a     ] - a segment in the allocator
+     ^-- a page {Bytes(PAGE_SIZE)} of memory in the segment
+    a-z: pages filled with a single block's content
+    ' ': page is completely free
+    *: page if completely full with multiple blocks
+    0-9: page is partially full with tensors of multiple blocks (9 == 90% full)
+    (X% internal) - of the free memory, X% is free because we rounded the size of the allocation.
+"""
+
+def segsum(data):
+    """" Visually reports how the allocator has filled its segments. This printout can help debug fragmentation issues
+    since free fragments will appear as gaps in this printout.  The amount of free space is reported for each segment.
+    We distinguish between internal free memory which occurs because the allocator rounds the allocation size, and
+    external free memory, which are the gaps between allocations in a segment.
+    Args:
+        data: snapshot dictionary created from _snapshot()
+
+    """
+    segments = []
+    out = io.StringIO()
+    out.write(f"Summary of segments >= {Bytes(PAGE_SIZE)} in size\n")
+    total_reserved = 0
+    total_allocated = 0
+    free_external = 0
+    free_internal = 0
+    for seg in sorted(data['segments'], key=lambda x: (x['total_size'], calc_active(x))):
+        total_reserved += seg['total_size']
+
+        seg_free_external = 0
+        seg_free_internal = 0
+        seg_allocated = 0
+        all_ranges = []
+        boffset = 0
         for b in seg['blocks']:
-            if b['state'] not in result:
-                result[b['state']] = Bytes(0)
-            total_size += b['size']
-            result[b['state']] += b['size']
-        assert seg['total_size'] == total_size
-        result['total_size'] += total_size
-    return result
+            active = b['state'] == 'active_allocated'
+            if 'history' in b:
+                # use the more accureate real_size to account for internal fragmenetation if we have it
+                for h in b['history']:
+                    if active:
+                        all_ranges.append((h['addr'] - seg['address'], h['real_size'], active))
+                        seg_allocated += h['real_size']
+                        assert len(b['history']) == 1
+                        seg_free_internal += b['size'] - h['real_size']
+            else:
+                if active:
+                    all_ranges.append((boffset, b['size'], True))
+                    seg_allocated += b['size']
+            if not active:
+                seg_free_external += b['size']
+
+            boffset += b['size']
+
+        total_allocated += seg_allocated
+        free_external += seg_free_external
+        free_internal += seg_free_internal
 
+        nseg = (seg['total_size'] - 1) // PAGE_SIZE + 1
+        occupied = [' ' for _ in range(nseg)]
+        frac = [0.0 for _ in range(nseg)]
+        active_size = 0
+        for i, (start_, size, active) in enumerate(all_ranges):
+            active_size += size
+            finish_ = (start_ + size)
+            start = start_ // PAGE_SIZE
+            finish = (finish_ - 1) // PAGE_SIZE + 1
+            m = chr((ord('a' if active else 'A') + (i % 26)))
+            for j in range(start, finish):
+                s = max(start_, j * PAGE_SIZE)
+                e = min(finish_, (j + 1) * PAGE_SIZE)
+                frac[j] += (e - s) / PAGE_SIZE
+                if occupied[j] != ' ':
+                    occupied[j] = '0123456789*'[int(frac[j] * 10)]
+                else:
+                    occupied[j] = m
+        stream = '' if seg['stream'] == 0 else f', stream_{seg["stream"]}'
+        body = ''.join(occupied)
+        assert seg_free_external + seg_free_internal + seg_allocated == seg['total_size']
+        stream = f' stream_{seg["stream"]}' if seg['stream'] != 0 else ''
+        if seg['total_size'] >= PAGE_SIZE:
+            out.write(f'[{body}] {Bytes(seg["total_size"])} allocated, '
+                      f'{_report_free(seg_free_external, seg_free_internal)} free{stream}\n')
+    out.write(f'segments: {len(data["segments"])}\n')
+    out.write(f'total_reserved: {Bytes(total_reserved)}\n')
+    out.write(f'total_allocated: {Bytes(total_allocated)}\n')
+    internal_external = f' ({Bytes(free_internal)} internal + {Bytes(free_external)} external)' if free_internal else ''
+    out.write(f'total_free: {_report_free(free_external, free_internal)}\n')
+    out.write(legend)
+    assert free_internal + free_external + total_allocated == total_reserved
+    return out.getvalue()
+
+def trace(data):
+    out = io.StringIO()
+
+    def format(entries):
+        segment_intervals : list = []
+        segment_addr_to_name = {}
+        allocation_addr_to_name = {}
+
+        free_names : list = []
+        next_name = 0
+
+        def _name():
+            nonlocal next_name
+            if free_names:
+                return free_names.pop()
+            r, m = next_name // 26, next_name % 26
+            next_name += 1
+            return f'{chr(ord("a") + m)}{"" if r == 0 else r}'
+
+        def find_segment(addr):
+            for name, saddr, size in segment_intervals:
+                if addr >= saddr and addr < saddr + size:
+                    return name, saddr
+            for i, seg in enumerate(data['segments']):
+                saddr = seg['address']
+                size = seg['allocated_size']
+                if addr >= saddr and addr < saddr + size:
+                    return f'seg_{i}', saddr
+            return None, None
+        count = 0
+        out.write(f'{len(entries)} entries\n')
+
+
+        total_reserved = 0
+        for seg in data['segments']:
+            total_reserved += seg['total_size']
+
+        for count, e in enumerate(entries):
+            if e['action'] == 'alloc':
+                addr, size = e['addr'], e['size']
+                n = _name()
+                seg_name, seg_addr = find_segment(addr)
+                if seg_name is None:
+                    seg_name = "MEM"
+                    offset = addr
+                else:
+                    offset = addr - seg_addr
+                out.write(f'{n} = {seg_name}[{offset}:{Bytes(size)}]\n')
+                allocation_addr_to_name[addr] = (n, size, count)
+                count += size
+            elif e['action'] == 'free_requested':
+                addr, size = e['addr'], e['size']
+                name, _, _ = allocation_addr_to_name.get(addr, (addr, None, None))
+                out.write(f'del {name} # {Bytes(size)}\n')
+            elif e['action'] == 'free_completed':
+                addr, size = e['addr'], e['size']
+                count -= size
+                name, _, _ = allocation_addr_to_name.get(addr, (addr, None, None))
+                out.write(f'# free completed for {name} {Bytes(size)}\n')
+                if name in allocation_addr_to_name:
+                    free_names.append(name)
+                    del allocation_addr_to_name[name]
+            elif e['action'] == 'segment_alloc':
+                addr, size = e['addr'], e['size']
+                name = _name()
+                out.write(f'{name} = cudaMalloc({addr}, {Bytes(size)})\n')
+                segment_intervals.append((name, addr, size))
+                segment_addr_to_name[addr] = name
+            elif e['action'] == 'segment_free':
+                addr, size = e['addr'], e['size']
+                name = segment_addr_to_name.get(addr, addr)
+                out.write(f'cudaFree({name}) # {Bytes(size)}\n')
+                if name in segment_addr_to_name:
+                    free_names.append(name)
+                    del segment_addr_to_name[name]
+            elif e['action'] == 'oom':
+                size = e['size']
+                free = e['device_free']
+                out.write(f'raise OutOfMemoryError() # {Bytes(size)} requested, {Bytes(free)} free in CUDA\n')
+            else:
+                out.write(f'{e}\n')
+        out.write(f"TOTAL MEM: {Bytes(count)}")
+    for i, d in enumerate(data['device_traces']):
+        if d:
+            out.write(f'Device {i} ----------------\n')
+            format(d)
+    return out.getvalue()
 
 if __name__ == "__main__":
     import os.path
@@ -129,20 +307,25 @@ def stats(snapshot):
     if thedir in sys.path:
         # otherwise we find cuda/random.py as random...
         sys.path.remove(thedir)
-    from pprint import pprint
     import argparse
 
-    fn_name = 'torch.cuda.memory_dbg.snapshot()'
+    fn_name = 'torch.cuda.memory._snapshot()'
     pickled = f'pickled memory statistics from {fn_name}'
     parser = argparse.ArgumentParser(description=f'Visualize memory dumps produced by {fn_name}')
+
     subparsers = parser.add_subparsers(dest='action')
 
     def _output(p):
         p.add_argument('-o', '--output', default='output.svg', help='flamegraph svg (default: output.svg)')
 
-    stats_a = subparsers.add_parser('stats', description='Prints overall allocation statistics')
+    description = 'Prints overall allocation statistics and a visualization of how the allocators segments are currently filled.'
+    stats_a = subparsers.add_parser('stats', description=description)
     stats_a.add_argument('input', help=pickled)
 
+    description = 'Prints buffer of the most recent allocation events embedded in the snapshot in a Pythonic style.'
+    trace_a = subparsers.add_parser('trace', description=description)
+    trace_a.add_argument('input', help=pickled)
+
     description = 'Generate a flamegraph that visualizes what memory is stored in each allocator segment (aka block)'
     segments_a = subparsers.add_parser('segments', description=description)
     segments_a.add_argument('input', help=pickled)
@@ -165,24 +348,31 @@ def _output(p):
 
     def _read(name):
         if name == '-':
-            return sys.stdin.buffer
+            f = sys.stdin.buffer
         else:
-            return open(name, 'rb')
+            f = open(name, 'rb')
+        data = pickle.load(f)
+        if isinstance(data, list):  # segments only...
+            data = {'segments': data, 'traces': []}
+        return data
 
     def _write(name, data):
         with open(name, 'w') as f:
             f.write(data)
 
     if args.action == 'segments':
-        data = pickle.load(_read(args.input))
+        data = _read(args.input)
         _write(args.output, segments(data))
     elif args.action == 'memory':
-        data = pickle.load(_read(args.input))
+        data = _read(args.input)
         _write(args.output, memory(data))
     elif args.action == 'stats':
-        data = pickle.load(_read(args.input))
-        pprint(stats(data))
+        data = _read(args.input)
+        print(segsum(data))
+    elif args.action == 'trace':
+        data = _read(args.input)
+        print(trace(data))
     elif args.action == 'compare':
-        before = pickle.load(_read(args.before))
-        after = pickle.load(_read(args.after))
+        before = _read(args.before)
+        after = _read(args.after)
         _write(args.output, compare(before, after))
diff --git a/torch/cuda/_sanitizer.py b/torch/cuda/_sanitizer.py
new file mode 100644
index 000000000000..bec670914e06
--- /dev/null
+++ b/torch/cuda/_sanitizer.py
@@ -0,0 +1,641 @@
+r"""
+This module introduces CUDA Sanitizer, a tool for detecting synchronization errors
+between kernels ran on different streams. It stores information on accesses to tensors
+to determine if they are synchronized or not. When enabled in a python program and a
+possible data race is detected, a detailed warning will be printed and the program
+will exit.
+
+It can be enabled either by importing this module and calling
+:func:`enable_cuda_sanitizer()` or by exporting the ``TORCH_CUDA_SANITIZER``
+environment variable.
+"""
+
+import enum
+import functools
+import io
+import logging
+import sys
+import textwrap
+import traceback
+from dataclasses import dataclass, field
+from typing import Any, Dict, Iterator, List, Optional, Set, Tuple, TypeVar
+
+import torch
+import torch.utils._cuda_trace as cuda_trace
+from torch.utils._python_dispatch import TorchDispatchMode
+from torch.utils._pytree import tree_map
+
+
+DEFAULT_STREAM_ID = 0
+
+TK = TypeVar("TK")
+TVa = TypeVar("TVa")
+TVb = TypeVar("TVb")
+
+DataPtr = int
+StreamId = int
+EventId = int
+SeqNum = int
+
+logger = logging.getLogger(__name__)
+
+
+class AccessType(enum.Enum):
+    READ = enum.auto()
+    WRITE = enum.auto()
+
+    def __str__(self):
+        return "reading from" if self is AccessType.READ else "writing to"
+
+
+@dataclass
+class Access:
+    r"""Stores information about a single access to a tensor by a kernel.
+
+    Args:
+        type: either AccessType.READ or AccessType.Write.
+        seq_num: the sequential number of the kernel performing the access.
+        stream: the stream id of the stream executing the kernel.
+        operator: the schema of the launched kernel, which lists the
+            arguments and return type.
+        aliases: the arguments in the schema this access corresponds to.
+        is_output: Whether the tensor was an output of the kernel.
+        stack_trace: the stack summary object captured during access.
+    """
+    type: AccessType
+    seq_num: SeqNum
+    stream: StreamId
+    operator: str
+    aliases: List[str]
+    is_output: bool
+    stack_trace: traceback.StackSummary
+
+
+class SynchronizationError(Exception):
+    """Base class for errors detected by CUDA Sanitizer."""
+
+    pass
+
+
+class UnsynchronizedAccessError(SynchronizationError):
+    """Stores information about two unsynchronized accesses to one data pointer."""
+
+    def __init__(
+        self,
+        data_ptr: DataPtr,
+        allocation_stack_trace: Optional[traceback.StackSummary],
+        current_access: Access,
+        previous_access: Access,
+    ):
+        self.data_ptr = data_ptr
+        self.allocation_stack_trace = allocation_stack_trace
+        self.current_access = current_access
+        self.previous_access = previous_access
+
+    def __str__(self):
+        def format_access(access: Access):
+            message.write(f"{access.operator}\n{access.type}")
+            if access.aliases:
+                message.write(" argument(s) " + ", ".join(access.aliases))
+                if access.is_output:
+                    message.write(", and to")
+            if access.is_output:
+                message.write(" the output")
+            message.write(
+                f"\nWith stack trace:\n{''.join(access.stack_trace.format())}\n"
+            )
+
+        with io.StringIO() as message:
+            message.write(
+                textwrap.dedent(
+                    f"""\
+                    ============================
+                    CSAN detected a possible data race on tensor with data pointer {self.data_ptr}
+                    Access by stream {self.current_access.stream} during kernel:
+                    """
+                )
+            )
+            format_access(self.current_access)
+
+            message.write(
+                f"Previous access by stream {self.previous_access.stream} during kernel:\n"
+            )
+            format_access(self.previous_access)
+
+            if self.allocation_stack_trace:
+                message.write(
+                    "Tensor was allocated with stack trace:\n"
+                    f"{''.join(self.allocation_stack_trace.format())}"
+                )
+            else:
+                message.write("Trace for tensor allocation not found.")
+            return message.getvalue()
+
+
+class CUDASanitizerErrors(Exception):
+    """Wrapper class for errors reported by CUDA Sanitizer."""
+
+    def __init__(self, errors: List[SynchronizationError]):
+        self.errors = errors
+
+    def __str__(self):
+        return f"detected {len(self.errors)} errors"
+
+
+def format_log_message(message: str) -> str:
+    return " ".join(line.strip() for line in message.strip().splitlines())
+
+
+@dataclass
+class TensorInfo:
+    r"""Stores information about a single tensor and recent accesses to it.
+
+    Args:
+        allocation_stack_trace: the stack summary object captured during tensor
+            allocation. Can be ``None`` if the allocation wasn't caught by CSAN.
+        reads: list of read accesses to the tensor that were performed since
+            the last write.
+        write: the last write access to the tensor.
+    """
+    allocation_stack_trace: Optional[traceback.StackSummary]
+    reads: List[Access] = field(default_factory=list)
+    write: Optional[Access] = None
+
+
+class _TensorsAccessed:
+    def __init__(self):
+        self.accesses: Dict[DataPtr, TensorInfo] = {}
+
+    def ensure_tensor_exists(self, data_ptr: DataPtr) -> None:
+        if data_ptr not in self.accesses:
+            logger.info(
+                format_log_message(
+                    f"""
+                    Found tensor with pointer: {data_ptr}, but no matching tensor
+                    allocation in the trace. Backfilling the trace now.
+                    Perhaps the sanitizer was enabled after some torch operations?
+                    """
+                )
+            )
+            self.create_tensor(data_ptr, None)
+
+    def ensure_tensor_does_not_exist(self, data_ptr: DataPtr) -> None:
+        if data_ptr in self.accesses:
+            logger.info(
+                format_log_message(
+                    f"""
+                    Found duplicate tensor allocation in the trace for tensor with
+                    pointer: {data_ptr}. Assuming the trace for tensor deallocation
+                    wasn't caught and backfilling it now.
+                    Perhaps the sanitizer was enabled after some torch operations?
+                    """
+                )
+            )
+            self.delete_tensor(data_ptr)
+
+    def create_tensor(
+        self, data_ptr: DataPtr, stack_trace: Optional[traceback.StackSummary]
+    ) -> None:
+        self.accesses[data_ptr] = TensorInfo(stack_trace)
+
+    def delete_tensor(self, data_ptr: DataPtr) -> None:
+        del self.accesses[data_ptr]
+
+    def were_there_reads_since_last_write(self, data_ptr: DataPtr) -> bool:
+        return True if self.accesses[data_ptr].reads else False
+
+    def get_allocation_stack_trace(
+        self, data_ptr: DataPtr
+    ) -> Optional[traceback.StackSummary]:
+        return self.accesses[data_ptr].allocation_stack_trace
+
+    def get_write(self, data_ptr: DataPtr) -> Optional[Access]:
+        return self.accesses[data_ptr].write
+
+    def get_reads(self, data_ptr: DataPtr) -> List[Access]:
+        return self.accesses[data_ptr].reads
+
+    def add_read(self, data_ptr: DataPtr, access: Access) -> None:
+        self.accesses[data_ptr].reads.append(access)
+
+    def set_write(self, data_ptr: DataPtr, access: Access) -> None:
+        self.accesses[data_ptr].write = access
+        self.accesses[data_ptr].reads = []
+
+
+class StreamSynchronizations:
+    def __init__(self):
+        self.current_sync_states: Dict[StreamId, Dict[StreamId, SeqNum]] = {}
+        self.recorded_sync_states: Dict[EventId, Dict[StreamId, SeqNum]] = {}
+        self.host_sync_state: Dict[StreamId, SeqNum] = {}
+        self.create_stream(DEFAULT_STREAM_ID)
+
+    def _ensure_stream_exists(self, stream: StreamId) -> None:
+        if stream not in self.current_sync_states:
+            logger.info(
+                format_log_message(
+                    f"""
+                    Found Stream with id: {stream}, but no matching stream
+                    creation in the trace. Backfilling the trace now.
+                    Perhaps the sanitizer was enabled after some torch operations?
+                    """
+                )
+            )
+            self.create_stream(stream)
+
+    def _ensure_event_exists(self, event: EventId) -> None:
+        if event not in self.recorded_sync_states:
+            logger.info(
+                format_log_message(
+                    f"""
+                    Found Event with id: {event}, but no matching event
+                    creation in the trace. Backfilling the trace now.
+                    Perhaps the sanitizer was enabled after some torch operations?
+                    """
+                )
+            )
+            self.create_event(event)
+
+    def _ensure_event_does_not_exist(self, event: EventId) -> None:
+        if event in self.recorded_sync_states:
+            logger.info(
+                format_log_message(
+                    f"""
+                    Found duplicate event creation in the trace for event with
+                    id: {event}. Assuming the trace for event deletion wasn't caught
+                    and backfilling it now.
+                    Perhaps the sanitizer was enabled after some torch operations?
+                    """
+                )
+            )
+            self.delete_event(event)
+
+    def create_stream(self, stream: StreamId) -> None:
+        if stream in self.current_sync_states:
+            logger.info(
+                format_log_message(
+                    f"""
+                    Found duplicate Stream creation in the trace for Stream with
+                    id: {stream}. PyTorch Streams are only created once, so this
+                    trace entry is ignored.
+                    """
+                )
+            )
+        else:
+            self.host_sync_state[stream] = 0
+            self.current_sync_states[stream] = self.host_sync_state.copy()
+
+    def create_event(self, event: EventId) -> None:
+        self._ensure_event_does_not_exist(event)
+        self.recorded_sync_states[event] = {}
+
+    def delete_event(self, event: EventId) -> None:
+        self._ensure_event_exists(event)
+        del self.recorded_sync_states[event]
+
+    def update_seq_num(self, stream: StreamId, seq_num: SeqNum) -> None:
+        self._ensure_stream_exists(stream)
+        self.current_sync_states[stream][stream] = seq_num
+
+    def record_state(self, event: EventId, stream: StreamId) -> None:
+        self._ensure_event_exists(event)
+        self._ensure_stream_exists(stream)
+        self.recorded_sync_states[event] = self.current_sync_states[stream].copy()
+
+    def _state_wait_for_other(
+        self, state: Dict[StreamId, SeqNum], other: Dict[StreamId, SeqNum]
+    ) -> None:
+        for stream, seq_num in other.items():
+            state[stream] = max(state.get(stream, -1), seq_num)
+
+    def stream_wait_for_event(self, stream: StreamId, event: EventId) -> None:
+        self._ensure_stream_exists(stream)
+        self._ensure_event_exists(event)
+        self._state_wait_for_other(
+            self.current_sync_states[stream], self.recorded_sync_states[event]
+        )
+
+    def all_streams_wait_for_event(self, event: EventId) -> None:
+        self._ensure_event_exists(event)
+        for stream in self.current_sync_states.keys():
+            self.stream_wait_for_event(stream, event)
+
+        self._state_wait_for_other(
+            self.host_sync_state, self.recorded_sync_states[event]
+        )
+
+    def all_streams_wait_for_stream(self, stream: StreamId) -> None:
+        self._ensure_stream_exists(stream)
+        for state in self.current_sync_states.values():
+            self._state_wait_for_other(state, self.current_sync_states[stream])
+
+        self._state_wait_for_other(
+            self.host_sync_state, self.current_sync_states[stream]
+        )
+
+    def sync_all_streams(self) -> None:
+        for stream, state in self.current_sync_states.items():
+            self.host_sync_state[stream] = state[stream]
+
+        for state in self.current_sync_states.values():
+            self._state_wait_for_other(state, self.host_sync_state)
+
+    def is_ordered_after(
+        self, current_stream: StreamId, seq_num: SeqNum, other_stream: StreamId
+    ) -> bool:
+        self._ensure_stream_exists(current_stream)
+        self._ensure_stream_exists(other_stream)
+        return seq_num <= self.current_sync_states[current_stream].get(other_stream, -1)
+
+
+class EventHandler:
+    """Analyzes CSAN trace for synchronization errors.
+
+    Stores information on each stream's synchronizations with other streams as well
+    as tensor accesses to determine whether a given kernel launch might cause a
+    data race.
+    """
+
+    def __init__(self):
+        self.tensors_accessed = _TensorsAccessed()
+        self.syncs = StreamSynchronizations()
+        self.seq_num: SeqNum = 0
+
+    def _handle_kernel_launch(
+        self,
+        stream: StreamId,
+        read_only: Set[DataPtr],
+        read_write: Set[DataPtr],
+        outputs: Set[DataPtr],
+        operator: str,
+        tensor_aliases: Dict[int, List[str]],
+    ) -> List[SynchronizationError]:
+        def check_conflict(
+            data_ptr: DataPtr, current_access: Access, previous_access: Optional[Access]
+        ) -> None:
+            if previous_access is None:
+                return
+            if not self.syncs.is_ordered_after(
+                current_access.stream, previous_access.seq_num, previous_access.stream
+            ):
+                error_list.append(
+                    UnsynchronizedAccessError(
+                        data_ptr,
+                        self.tensors_accessed.get_allocation_stack_trace(data_ptr),
+                        current_access,
+                        previous_access,
+                    )
+                )
+
+        error_list: List[SynchronizationError] = []
+        self.seq_num += 1
+        self.syncs.update_seq_num(stream, self.seq_num)
+        stack_trace = traceback.StackSummary.extract(
+            traceback.walk_stack(None), lookup_lines=False
+        )
+        # The stack trace generated in this way is in the inverse order, so it must be
+        # reversed.
+        stack_trace.reverse()
+
+        for data_ptr in read_only:
+            self.tensors_accessed.ensure_tensor_exists(data_ptr)
+            current_access = Access(
+                AccessType.READ,
+                self.seq_num,
+                stream,
+                operator,
+                tensor_aliases[data_ptr],
+                data_ptr in outputs,
+                stack_trace,
+            )
+            check_conflict(
+                data_ptr, current_access, self.tensors_accessed.get_write(data_ptr)
+            )
+            self.tensors_accessed.add_read(data_ptr, current_access)
+
+        for data_ptr in read_write:
+            self.tensors_accessed.ensure_tensor_exists(data_ptr)
+            current_access = Access(
+                AccessType.WRITE,
+                self.seq_num,
+                stream,
+                operator,
+                tensor_aliases[data_ptr],
+                data_ptr in outputs,
+                stack_trace,
+            )
+            if self.tensors_accessed.were_there_reads_since_last_write(data_ptr):
+                for previous_access in self.tensors_accessed.get_reads(data_ptr):
+                    check_conflict(data_ptr, current_access, previous_access)
+            else:
+                check_conflict(
+                    data_ptr, current_access, self.tensors_accessed.get_write(data_ptr)
+                )
+            self.tensors_accessed.set_write(data_ptr, current_access)
+
+        return error_list
+
+    def _handle_event_creation(self, event: EventId) -> None:
+        self.syncs.create_event(event)
+
+    def _handle_event_deletion(self, event: EventId) -> None:
+        self.syncs.delete_event(event)
+
+    def _handle_event_record(self, event: EventId, stream: StreamId) -> None:
+        self.syncs.record_state(event, stream)
+
+    def _handle_event_wait(self, event: EventId, stream: StreamId) -> None:
+        self.syncs.stream_wait_for_event(stream, event)
+
+    def _handle_memory_allocation(self, data_ptr: DataPtr) -> None:
+        self.tensors_accessed.ensure_tensor_does_not_exist(data_ptr)
+        stack_trace = traceback.StackSummary.extract(
+            traceback.walk_stack(None), lookup_lines=False
+        )
+        # The stack trace generated in this way is in the inverse order, so it must be
+        # reversed.
+        stack_trace.reverse()
+        self.tensors_accessed.create_tensor(
+            data_ptr,
+            stack_trace,
+        )
+
+    def _handle_memory_deallocation(self, data_ptr: DataPtr) -> None:
+        self.tensors_accessed.ensure_tensor_exists(data_ptr)
+        self.tensors_accessed.delete_tensor(data_ptr)
+
+    def _handle_stream_creation(self, stream: StreamId) -> None:
+        self.syncs.create_stream(stream)
+
+    def _handle_device_synchronization(self) -> None:
+        self.syncs.sync_all_streams()
+
+    def _handle_stream_synchronization(self, stream: StreamId) -> None:
+        self.syncs.all_streams_wait_for_stream(stream)
+
+    def _handle_event_synchronization(self, event: EventId) -> None:
+        self.syncs.all_streams_wait_for_event(event)
+
+
+def zip_by_key(a: Dict[TK, TVa], b: Dict[TK, TVb]) -> Iterator[Tuple[TK, TVa, TVb]]:
+    for arg, value in a.items():
+        if arg in b:
+            yield arg, value, b[arg]
+
+
+def zip_arguments(
+    schema: torch.FunctionSchema, args: Tuple[Any, ...], kwargs: Dict[str, Any]
+) -> Iterator[Tuple[torch.Argument, Any]]:
+    schema_args = schema.arguments[: len(args)]
+    schema_kwargs = {arg.name: arg for arg in schema.arguments[len(args) :]}
+
+    yield from zip(schema_args, args)
+
+    for _, argument, value in zip_by_key(schema_kwargs, kwargs):
+        yield (argument, value)
+
+
+class ArgumentHandler:
+    def __init__(self):
+        self.dataptrs_read: Set[DataPtr] = set()
+        self.dataptrs_written: Set[DataPtr] = set()
+        self.tensor_aliases: Dict[DataPtr, List[str]] = dict()
+        self.outputs: Set[DataPtr] = set()
+
+    def _handle_argument(
+        self,
+        value: Any,
+        is_write: bool,
+        name: Optional[str] = None,
+        is_output: bool = False,
+    ) -> None:
+        if isinstance(value, torch.Tensor) and value.is_cuda:
+            data_ptr = value.data_ptr()
+            if is_write:
+                self.dataptrs_written.add(data_ptr)
+            else:
+                self.dataptrs_read.add(data_ptr)
+
+            self.tensor_aliases.setdefault(data_ptr, [])
+            if name is not None:
+                self.tensor_aliases[data_ptr].append(name)
+            if is_output:
+                self.outputs.add(data_ptr)
+
+    def parse_inputs(
+        self,
+        schema: torch.FunctionSchema,
+        args: Tuple[Any, ...],
+        kwargs: Dict[str, Any],
+    ) -> None:
+        for argument, value in zip_arguments(schema, args, kwargs):
+            is_write = argument.alias_info is not None and argument.alias_info.is_write
+            tree_map(
+                functools.partial(
+                    self._handle_argument, is_write=is_write, name=argument.name
+                ),
+                value,
+            )
+
+    def parse_outputs(self, outputs: Any) -> None:
+        tree_map(
+            functools.partial(self._handle_argument, is_write=True, is_output=True),
+            outputs,
+        )
+
+
+class CUDASanitizerDispatchMode(TorchDispatchMode):
+    def __init__(self):
+        self.event_handler = EventHandler()
+        torch._C._activate_cuda_trace()
+        cuda_trace.register_callback_for_cuda_event_creation(
+            self.event_handler._handle_event_creation
+        )
+        cuda_trace.register_callback_for_cuda_event_deletion(
+            self.event_handler._handle_event_deletion
+        )
+        cuda_trace.register_callback_for_cuda_event_record(
+            self.event_handler._handle_event_record
+        )
+        cuda_trace.register_callback_for_cuda_event_wait(
+            self.event_handler._handle_event_wait
+        )
+        cuda_trace.register_callback_for_cuda_memory_allocation(
+            self.event_handler._handle_memory_allocation
+        )
+        cuda_trace.register_callback_for_cuda_memory_deallocation(
+            self.event_handler._handle_memory_deallocation
+        )
+        cuda_trace.register_callback_for_cuda_stream_creation(
+            self.event_handler._handle_stream_creation
+        )
+        cuda_trace.register_callback_for_cuda_device_synchronization(
+            self.event_handler._handle_device_synchronization
+        )
+        cuda_trace.register_callback_for_cuda_stream_synchronization(
+            self.event_handler._handle_stream_synchronization
+        )
+        cuda_trace.register_callback_for_cuda_event_synchronization(
+            self.event_handler._handle_event_synchronization
+        )
+
+    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+        if kwargs is None:
+            kwargs = {}
+
+        argument_handler = ArgumentHandler()
+        argument_handler.parse_inputs(func._schema, args, kwargs)
+
+        outputs = func(*args, **kwargs)
+
+        argument_handler.parse_outputs(outputs)
+        errors = self.event_handler._handle_kernel_launch(
+            torch.cuda.current_stream().cuda_stream,
+            argument_handler.dataptrs_read - argument_handler.dataptrs_written,
+            argument_handler.dataptrs_written,
+            argument_handler.outputs,
+            func._schema,
+            argument_handler.tensor_aliases,
+        )
+        if errors:
+            for error in errors:
+                print(error, file=sys.stderr)
+            raise CUDASanitizerErrors(errors)
+
+        return outputs
+
+
+class CUDASanitizer:
+    """Manages the lifetime of a CUDASanitizer dispatch mode object.
+
+    The CUDASanitizer class wraps the entering/exiting functions of the dispatch mode
+    context manager in the enable function/destructor, respectively. This is to
+    explicitly set the lifetime of the dispatch mode object to that of the application.
+    This approach was deemed more elegant than using the atexit module.
+    """
+
+    def __init__(self):
+        self.dispatch = CUDASanitizerDispatchMode()
+        self.enabled = False
+
+    def enable(self):
+        self.dispatch.__enter__()
+        self.enabled = True
+
+    def __del__(self):
+        if self.enabled:
+            self.dispatch.__exit__(None, None, None)
+
+
+def enable_cuda_sanitizer():
+    """Enables CUDA Sanitizer.
+
+    The sanitizer will begin to analyze low-level CUDA calls invoked by torch functions
+    for synchronization errors. All data races found will be printed to the standard
+    error output along with stack traces of suspected causes. For best results, the
+    sanitizer should be enabled at the very beginning of the program.
+    """
+    cuda_sanitizer.enable()
+
+
+cuda_sanitizer = CUDASanitizer()
diff --git a/torch/cuda/amp/autocast_mode.py b/torch/cuda/amp/autocast_mode.py
index 43fc8d167d81..83bc6beb5e79 100644
--- a/torch/cuda/amp/autocast_mode.py
+++ b/torch/cuda/amp/autocast_mode.py
@@ -9,6 +9,7 @@
 from torch._six import string_classes
 from typing import Any
 
+__all__ = ["autocast", "custom_fwd", "custom_bwd"]
 
 class autocast(torch.amp.autocast_mode.autocast):
     r"""
@@ -92,6 +93,7 @@ def custom_fwd(fwd=None, *, cast_inputs=None):
 
     @functools.wraps(fwd)
     def decorate_fwd(*args, **kwargs):
+        args[0]._dtype = torch.get_autocast_gpu_dtype()
         if cast_inputs is None:
             args[0]._fwd_used_autocast = torch.is_autocast_enabled()
             return fwd(*args, **kwargs)
@@ -118,6 +120,6 @@ def custom_bwd(bwd):
     """
     @functools.wraps(bwd)
     def decorate_bwd(*args, **kwargs):
-        with autocast(args[0]._fwd_used_autocast):
+        with autocast(enabled=args[0]._fwd_used_autocast, dtype=args[0]._dtype):
             return bwd(*args, **kwargs)
     return decorate_bwd
diff --git a/torch/cuda/amp/common.py b/torch/cuda/amp/common.py
index f2b2a8469f8f..d0c1e3c04d1c 100644
--- a/torch/cuda/amp/common.py
+++ b/torch/cuda/amp/common.py
@@ -1,6 +1,7 @@
 import torch
 from importlib.util import find_spec
 
+__all__ = ["amp_definitely_not_available"]
 
 def amp_definitely_not_available():
     return not (torch.cuda.is_available() or find_spec('torch_xla'))
diff --git a/torch/cuda/amp/grad_scaler.py b/torch/cuda/amp/grad_scaler.py
index b39ae4106fb6..b26438327ca4 100644
--- a/torch/cuda/amp/grad_scaler.py
+++ b/torch/cuda/amp/grad_scaler.py
@@ -6,6 +6,8 @@
 from .common import amp_definitely_not_available
 
 
+__all__ = ["OptState", "GradScaler"]
+
 class _MultiDeviceReplicator(object):
     """
     Lazily serves copies of a tensor to requested devices.  Copies are cached per-device.
diff --git a/torch/cuda/graphs.py b/torch/cuda/graphs.py
index c69c5fdccdd5..ede08ae362c5 100644
--- a/torch/cuda/graphs.py
+++ b/torch/cuda/graphs.py
@@ -223,9 +223,16 @@ def make_graphed_callables(callables, sample_args, num_warmup_iters=3):
         When running a graphed callable, you must pass its arguments in the same order and format
         they appeared in that callable's ``sample_args``.
 
+    .. warning::
+        The automatic mixed precision is supported in :func:`~torch.cuda.make_graphed_callables` only with disabled
+        caching. The context manager `torch.cuda.amp.autocast()` must have `cache_enabled=False`.
+
     .. warning::
         All Tensor outputs of graphed callables must require grad.
     """
+    if torch.is_autocast_enabled() and torch.is_autocast_cache_enabled():
+        raise RuntimeError("make_graphed_callables does not support the autocast caching. Please set `cache_enabled=False`.")
+
     just_one_callable = False
 
     if not isinstance(callables, tuple):
@@ -281,9 +288,6 @@ def make_graphed_callables(callables, sample_args, num_warmup_iters=3):
     # the safest approach is to capture all passes in the same order they'll run:
     # fwd 1, fwd 2, ... fwd N, then bwd N, bwd N-1, ... bwd 1.
 
-    # Clear AMP autocast cache before capturing the graphs
-    torch.clear_autocast_cache()
-
     # Capture forward graphs
     per_callable_static_outputs = []
     per_callable_output_was_tensor = []
@@ -343,9 +347,6 @@ def make_graphed_callables(callables, sample_args, num_warmup_iters=3):
     per_callable_static_grad_inputs = list(reversed(per_callable_static_grad_inputs))
     # Now for every per_callable list, per_callable_*[i] holds the stuff for the ith callable.
 
-    # Clear AMP autocast cache after both forward and backward graphs are captured
-    torch.clear_autocast_cache()
-
     def make_graphed_autograd_function(fwd_graph,
                                        bwd_graph,
                                        module_params,
diff --git a/torch/cuda/jiterator.py b/torch/cuda/jiterator.py
index 207da5685acd..562a66d47db2 100644
--- a/torch/cuda/jiterator.py
+++ b/torch/cuda/jiterator.py
@@ -54,7 +54,7 @@ def __init__(self, code_string: str, return_by_ref: bool, num_outputs: int, **kw
     def __call__(self, *tensors: Tensor, **kwargs):
         # Jiterator follow torch.cuda's lazy initialization behavior
         # Defer checking cuda's availability at the function invocation time
-        assert self.is_cuda_available, "Jiterator is only supported on CUDA GPUs, no CUDA GPUs are available."
+        assert self.is_cuda_available, "Jiterator is only supported on CUDA and ROCm GPUs, none are available."
 
         assert len(tensors) <= 8, "jiterator only supports up to 8 tensor inputs."
 
@@ -98,7 +98,7 @@ def _create_jit_fn(code_string: str, **kwargs) -> Callable:
         # invoke jitted function like a regular python function
         result = jitted_fn(a, b, alpha=3.14)
 
-    code_string also allows mulitple function definitions, and the last function will be treated as the entry function.
+    code_string also allows multiple function definitions, and the last function will be treated as the entry function.
 
     Example::
 
diff --git a/torch/cuda/memory.py b/torch/cuda/memory.py
index 30f4e0f2ff2f..c40d9de58040 100644
--- a/torch/cuda/memory.py
+++ b/torch/cuda/memory.py
@@ -1,10 +1,12 @@
 import collections
 import contextlib
+import ctypes
 import warnings
 from typing import Any, Dict, Union, Tuple
 
 import torch
 from . import is_initialized, _get_device_index, _lazy_init
+from ._utils import _dummy_type
 
 from ._memory_viz import segments as _segments, memory as _memory
 
@@ -16,7 +18,13 @@
            "reset_peak_memory_stats", "reset_max_memory_allocated", "reset_max_memory_cached",
            "memory_allocated", "max_memory_allocated", "memory_reserved", "max_memory_reserved",
            "memory_cached", "max_memory_cached", "memory_snapshot", "memory_summary", "list_gpu_processes",
-           "mem_get_info"]
+           "mem_get_info", "get_allocator_backend", "CUDAPluggableAllocator", "change_current_allocator"]
+
+
+if not hasattr(torch._C, '_cuda_CUDAAllocator'):
+    # Define dummy base classes
+    torch._C.__dict__['_cuda_CUDAAllocator'] = _dummy_type('_cuda_CUDAAllocator')
+
 
 def _host_allocator():
     _lazy_init()
@@ -61,7 +69,7 @@ def caching_allocator_alloc(size, device: Union[Device, int] = None, stream=None
     if not isinstance(stream, int):
         raise TypeError('Invalid type for stream argument, must be '
                         '`torch.cuda.Stream` or `int` representing a pointer '
-                        'to a exisiting stream')
+                        'to a existing stream')
     with torch.cuda.device(device):
         return torch._C._cuda_cudaCachingAllocator_raw_alloc(size, stream)
 
@@ -177,7 +185,7 @@ def memory_stats(device: Union[Device, int] = None) -> Dict[str, Any]:
 
     The caching allocator can be configured via ENV to not split blocks larger than a
     defined size (see Memory Management section of the Cuda Semantics documentation).
-    This helps avoid memory framentation but may have a performance
+    This helps avoid memory fragmentation but may have a performance
     penalty. Additional outputs to assist with tuning and evaluating impact:
 
     - ``"max_split_size"``: blocks above this size will not be split.
@@ -194,6 +202,10 @@ def memory_stats(device: Union[Device, int] = None) -> Dict[str, Any]:
     .. note::
         See :ref:`cuda-memory-management` for more details about GPU memory
         management.
+
+    .. note::
+        With :ref:`backend:cudaMallocAsync<cuda-memory-envvars>`, some stats are not
+        meaningful, and are always reported as zero.
     """
     result = []
 
@@ -416,7 +428,7 @@ def memory_snapshot():
         See :ref:`cuda-memory-management` for more details about GPU memory
         management.
     """
-    return torch._C._cuda_memorySnapshot()
+    return torch._C._cuda_memorySnapshot()['segments']
 
 
 def memory_summary(device: Union[Device, int] = None, abbreviated: bool = False) -> str:
@@ -595,9 +607,28 @@ def mem_get_info(device: Union[Device, int] = None) -> Tuple[int, int]:
     device = _get_device_index(device)
     return torch.cuda.cudart().cudaMemGetInfo(device)
 
-def _record_memory_history(enabled: bool, device: Union[Device, int] = None):
+def _record_memory_history(enabled: bool, record_context=True,
+                           trace_alloc_max_entries=1,
+                           trace_alloc_record_context=False, device: Union[Device, int] = None,
+                           _enable_expensive_cpp=False):
+    """Enables recording of Python stack traces to be associated with memory
+    allocations, so you can tell what allocated any piece of memory in
+    :func:`torch.memory_snapshot`.
+
+    The Python trace collection is fast (2us per trace), so you may consider
+    enabling this on production jobs if you anticipate ever having to debug
+    memory issues.
+
+    .. warning:
+        The :attr:`_enable_expensive_cpp` arguments lets you enable also
+        collecting C++ stack traces.  This collection is VERY SLOW and should
+        only be used if you are debugging framework problems on a minified
+        example.  In principle, it should be possible to implement fast C++
+        stack trace collection; file an issue with us if you need it.
+    """
     with torch.cuda.device(device):
-        _C._cuda_recordMemoryHistory(enabled)
+        _C._cuda_recordMemoryHistory(enabled, record_context, _enable_expensive_cpp,
+                                     trace_alloc_max_entries, trace_alloc_record_context)
 
 def _snapshot(device: Union[Device, int] = None):
     with torch.cuda.device(device):
@@ -605,12 +636,89 @@ def _snapshot(device: Union[Device, int] = None):
 
 def _save_segment_usage(filename='output.svg', snapshot=None):
     if snapshot is None:
-        snapshot = memory_snapshot()
+        snapshot = _snapshot()
     with open(filename, 'w') as f:
         f.write(_segments(snapshot))
 
 def _save_memory_usage(filename='output.svg', snapshot=None):
     if snapshot is None:
-        snapshot = memory_snapshot()
+        snapshot = _snapshot()
     with open(filename, 'w') as f:
         f.write(_memory(snapshot))
+
+def _set_allocator_settings(env: str):
+    return torch._C._cuda_cudaCachingAllocator_set_allocator_settings(env)
+
+def get_allocator_backend() -> str:
+    r"""Returns a string describing the active allocator backend as set by
+    ``PYTORCH_CUDA_ALLOC_CONF``. Currently available backends are
+    ``native`` (PyTorch's native caching allocator) and `cudaMallocAsync``
+    (CUDA's built-in asynchronous allocator).
+
+    .. note::
+        See :ref:`cuda-memory-management` for details on choosing the allocator backend.
+    """
+    return torch._C._cuda_getAllocatorBackend()
+
+class _CUDAAllocator:
+    r"""Wrapper over internal CUDA memory allocators.
+    """
+    def __init__(self, allocator: torch._C._cuda_CUDAAllocator):
+        self._allocator = allocator
+
+    def allocator(self):
+        return self._allocator
+
+
+class CUDAPluggableAllocator(_CUDAAllocator):
+    r"""CUDA memory allocator loaded from a so file.
+
+    Memory allocators are compiled in .so files and loaded dynamically using ctypes.
+    To change the active allocator use the :func:`torch.memory.cuda.change_current_allocator`
+    function.
+
+    Args:
+        path_to_so_file(str): Path in the filesystem to the `.so` file containing
+            the allocator functions
+        alloc_fn_name(str): Name of the function to perform the memory allocation
+            in the so file. The signature must be:
+            void* alloc_fn_name(ssize_t size, int device, cudaStream_t stream);
+        free_fn_name(str): Name of the function to perform the memory release
+            in the so file. The signature must be:
+            void free_fn_name(void* ptr, size_t size, cudaStream_t stream);
+
+    .. warning::
+        This is currently supported only in unix OSs
+
+    .. note::
+        See :ref:`cuda-memory-management` for details on creating and using a custom allocator
+    """
+    def __init__(self, path_to_so_file: str, alloc_fn_name: str, free_fn_name: str):
+        allocator = ctypes.CDLL(path_to_so_file)
+        alloc_fn = ctypes.cast(getattr(allocator, alloc_fn_name), ctypes.c_void_p).value
+        free_fn = ctypes.cast(getattr(allocator, free_fn_name), ctypes.c_void_p).value
+        assert alloc_fn is not None
+        assert free_fn is not None
+        self._allocator = torch._C._cuda_customAllocator(alloc_fn, free_fn)
+
+
+def change_current_allocator(allocator: _CUDAAllocator) -> None:
+    r"""Changes the currently used memory allocator to be the one provided.
+    If the current allocator has already been used/initialized, this function will error.
+
+
+    Args:
+        allocator (torch.cuda.memory._CUDAAllocator): allocator to be set as the active one.
+    .. note::
+        See :ref:`cuda-memory-management` for details on creating and using a custom allocator
+    """
+    torch._C._cuda_changeCurrentAllocator(allocator.allocator())
+
+
+def _get_current_allocator() -> _CUDAAllocator:
+    r"""Returns the allocator being currently used.
+
+    .. note::
+        See :ref:`cuda-memory-management` for details on creating and using a custom allocator
+    """
+    return _CUDAAllocator(torch._C._cuda_getAllocator())
diff --git a/torch/cuda/profiler.py b/torch/cuda/profiler.py
index 95ae19d733b1..eb7c813b122a 100644
--- a/torch/cuda/profiler.py
+++ b/torch/cuda/profiler.py
@@ -2,6 +2,7 @@
 import contextlib
 from . import cudart, check_error
 
+__all__ = ["init", "start", "stop", "profile"]
 
 DEFAULT_FLAGS = [
     "gpustarttimestamp",
diff --git a/torch/deploy.h b/torch/deploy.h
deleted file mode 100644
index 87338adaba1d..000000000000
--- a/torch/deploy.h
+++ /dev/null
@@ -1,3 +0,0 @@
-#pragma once
-
-#include <torch/csrc/deploy/deploy.h>
diff --git a/torch/distributed/__init__.py b/torch/distributed/__init__.py
index 3a5345c9ff4c..26d58968a381 100644
--- a/torch/distributed/__init__.py
+++ b/torch/distributed/__init__.py
@@ -4,7 +4,6 @@
 
 import torch
 
-
 def is_available() -> bool:
     """
     Returns ``True`` if the distributed package is available. Otherwise,
@@ -20,6 +19,8 @@ def is_available() -> bool:
 if is_available() and not torch._C._c10d_init():
     raise RuntimeError("Failed to initialize torch.distributed")
 
+# Custom Runtime Errors thrown from the distributed package
+DistBackendError = torch._C._DistBackendError
 
 if is_available():
     from torch._C._distributed_c10d import (
@@ -44,6 +45,7 @@ def is_available() -> bool:
         get_debug_level,
         set_debug_level,
         set_debug_level_from_env,
+        _make_nccl_premul_sum,
     )
 
     if sys.platform != "win32":
@@ -64,12 +66,26 @@ def is_available() -> bool:
         _reduce_scatter_base,
         _create_process_group_wrapper,
         _rank_not_in_group,
+        _c10d_error_logger,
     )
 
     from .rendezvous import (
+        rendezvous,
         _create_store_from_options,
+        register_rendezvous_handler,
     )
 
     from .remote_device import _remote_device
 
     set_debug_level_from_env()
+
+else:
+    # This stub is sufficient to get
+    #   python test/test_public_bindings.py -k test_correct_module_names
+    # working even when USE_DISTRIBUTED=0.  Feel free to add more
+    # stubs as necessary.
+    # We cannot define stubs directly because they confuse pyre
+
+    class _ProcessGroupStub:
+        pass
+    sys.modules["torch.distributed"].ProcessGroup = _ProcessGroupStub  # type: ignore[attr-defined]
diff --git a/torch/distributed/_composable/__init__.py b/torch/distributed/_composable/__init__.py
new file mode 100644
index 000000000000..2952426a09fd
--- /dev/null
+++ b/torch/distributed/_composable/__init__.py
@@ -0,0 +1,4 @@
+from .checkpoint_activation import checkpoint
+from .contract import contract
+from .fully_shard import fully_shard
+from .replicate import replicate
diff --git a/torch/distributed/_composable/_ddp.py b/torch/distributed/_composable/_ddp.py
new file mode 100644
index 000000000000..9e94ec3d53cd
--- /dev/null
+++ b/torch/distributed/_composable/_ddp.py
@@ -0,0 +1,1877 @@
+import sys
+import copy
+from dataclasses import dataclass
+from typing import Any, Callable, Optional, Type
+from enum import Enum, auto
+import inspect
+import itertools
+import logging
+import os
+import warnings
+from contextlib import contextmanager
+
+import torch
+import torch.distributed as dist
+from torch.autograd import Function, Variable
+from torch.distributed.algorithms.join import (
+    Join,
+    Joinable,
+    JoinHook,
+)
+
+from torch.utils._pytree import tree_flatten, tree_unflatten
+
+RPC_AVAILABLE = False
+if dist.is_available():
+    from torch.distributed.utils import (
+        _verify_param_shape_across_processes,
+        _sync_module_states,
+        _to_kwargs,
+    )
+    from torch.distributed.distributed_c10d import ReduceOp, _get_default_group
+if torch.distributed.rpc.is_available():
+    RPC_AVAILABLE = True
+    from torch.distributed.rpc import RRef
+
+from torch._utils import _get_device_index
+
+from torch.nn.modules import Module
+from torch.nn.parallel._replicated_tensor_ddp_utils import (
+    _ddp_with_replicated_tensor_enabled,
+)
+from torch.nn.parallel.scatter_gather import (
+    gather,
+    scatter_kwargs,
+)  # noqa: F401
+
+__all__ = ["DistributedDataParallel"]
+
+logger = logging.getLogger(__name__)
+
+
+def _tree_flatten_with_rref(output):
+    output_is_rref = RPC_AVAILABLE and isinstance(output, RRef)
+    if output_is_rref:
+        output_tensor_list, treespec = tree_flatten(output.local_value())
+    else:
+        output_tensor_list, treespec = tree_flatten(output)
+    # Need to return flattened tensors, spec to re-pack them, as well
+    # as if the return type was actually an RRef to reconstruct.
+    return output_tensor_list, treespec, output_is_rref
+
+
+def _tree_unflatten_with_rref(output, treespec, output_is_rref):
+    output = tree_unflatten(output, treespec)
+    if output_is_rref:
+        output = RRef(output)
+    return output
+
+
+def _find_tensors(obj):
+    r"""
+    Recursively find all tensors contained in the specified object.
+    """
+    if RPC_AVAILABLE and isinstance(obj, RRef):
+        # If the current node is the owner of the RRef, unwrap it and try to
+        # find Tensors.
+        # TODO: Expand to remote RRefs.
+        if obj.is_owner():
+            return _find_tensors(obj.local_value())
+    if isinstance(obj, torch.Tensor):
+        return [obj]
+    if isinstance(obj, (list, tuple)):
+        return itertools.chain(*map(_find_tensors, obj))
+    if isinstance(obj, dict):
+        return itertools.chain(*map(_find_tensors, obj.values()))
+    return []
+
+
+def _dump_DDP_relevant_env_vars():
+    relevant_env_vars = [
+        "RANK",
+        "LOCAL_RANK",
+        "WORLD_SIZE",
+        "MASTER_PORT",
+        "MASTER_ADDR",
+        "CUDA_VISIBLE_DEVICES",
+        "GLOO_SOCKET_IFNAME",
+        "GLOO_DEVICE_TRANSPORT",
+        "NCCL_SOCKET_IFNAME",
+        "NCCL_BLOCKING_WAIT",
+        "NCCL_DEBUG",
+        "NCCL_DEBUG_SUBSYS",
+        "NCCL_IB_DISABLE",
+        # More NCCL env vars:
+        "NCCL_P2P_DISABLE",
+        "NCCL_P2P_LEVEL",
+        "NCCL_SHM_DISABLE",
+        "NCCL_SOCKET_NTHREADS",
+        "NCCL_NSOCKS_PERTHREAD",
+        "NCCL_BUFFSIZE",
+        "NCCL_NTHREADS",
+        "NCCL_RINGS",
+        "NCCL_MAX_NCHANNELS",
+        "NCCL_MIN_NCHANNELS",
+        "NCCL_CHECKS_DISABLE",
+        "NCCL_CHECK_POINTERS",
+        "NCCL_LAUNCH_MODE",
+        "NCCL_IB_HCA",
+        "NCCL_IB_TIMEOUT",
+        "NCCL_IB_RETRY_CNT",
+        "NCCL_IB_GID_INDEX",
+        "NCCL_IB_SL",
+        "NCCL_IB_TC",
+        "NCCL_IB_AR_THRESHOLD",
+        "NCCL_IB_CUDA_SUPPORT",
+        "NCCL_NET_GDR_LEVEL",
+        "NCCL_NET_GDR_READ",
+        "NCCL_SINGLE_RING_THRESHOLD",
+        "NCCL_LL_THRESHOLD",
+        "NCCL_TREE_THRESHOLD",
+        "NCCL_ALGO",
+        "NCCL_PROTO",
+        "NCCL_IGNORE_CPU_AFFINITY",
+        "NCCL_DEBUG_FILE",
+        "NCCL_COLLNET_ENABLE",
+        "NCCL_TOPO_FILE",
+        "NCCL_TOPO_DUMP_FILE",
+        "NCCL_ASYNC_ERROR_HANDLING",
+    ]
+    formatted_output = ""
+    for var in relevant_env_vars:
+        value = os.environ[var] if var in os.environ else "N/A"
+        formatted_output += "env:%s=%s\n" % (var, value)
+    print(formatted_output)
+
+
+class _BufferCommHookLocation(Enum):
+    PRE_FORWARD = auto()
+    POST_FORWARD = auto()
+
+
+@dataclass
+class _BufferCommHook:
+    buffer_comm_hook: Callable
+    buffer_comm_hook_state: Any
+    buffer_comm_hook_location: _BufferCommHookLocation
+
+
+# Add a DDPSink to run various functions when backwards starts, such as
+# queueing call back of out-most backward/graph task,
+# this helps call back is fired after all gradients' calculation
+# is completed.
+class _DDPSink(Function):
+    @staticmethod
+    def forward(ctx, reducer, state_dict, *inputs):
+        # set_materialize_grads(False) will ensure that None gradients stay as
+        # None and are not filled with zeros.
+        ctx.set_materialize_grads(False)
+        ctx.reducer = reducer
+        ctx.state_dict = state_dict
+        ret = tuple(
+            inp.clone() if isinstance(inp, torch.Tensor) else inp
+            for inp in inputs
+        )
+        return ret
+
+    @staticmethod
+    def backward(ctx, *grad_outputs):
+        state_dict = ctx.state_dict
+        # Enqueue delay allreduce for static graph training on the first
+        # iteration.
+        if (
+            ctx.state_dict["static_graph"]
+            and ctx.state_dict["num_iterations"] == 1
+        ):
+            Variable._execution_engine.queue_callback(
+                ctx.reducer._delay_all_reduce
+            )
+
+        return (None, None, *grad_outputs)
+
+
+class _DDPJoinHook(JoinHook):
+    def __init__(self, ddp, divide_by_initial_world_size):
+        """
+        Sets config variables for internal usage.
+        """
+        assert isinstance(ddp, DistributedDataParallel), (
+            "DDP join hook requires passing in a DistributedDataParallel "
+            "instance as the state"
+        )
+        assert ddp.logger is not None
+        ddp.logger._set_uneven_input_join()
+        self.ddp = ddp
+        self.ddp._divide_by_initial_world_size = divide_by_initial_world_size
+        super().__init__()
+
+    def main_hook(self):
+        """
+        Shadows the DDP collective communication operations in the forward and
+        backward passes.
+        """
+        ddp = self.ddp
+        # Buckets are rebuilt only once during a training period
+        ddp.reducer._rebuild_buckets()
+
+        # Schedule a broadcast if we are syncing module buffers in the
+        # forward pass
+        # TODO: make DDP uneven inputs context manager support buffer
+        # comm hook (https://github.com/pytorch/pytorch/issues/65436)
+        ddp._check_and_sync_module_buffers()
+
+        # Check if need to sync in the backward pass
+        work = ddp._check_global_requires_backward_grad_sync(
+            is_joined_rank=True
+        )
+        work.wait()
+        should_sync_backwards = work.result()[0].item() != 0
+        # Forward parameter sync is disabled in the next iteration if we
+        # are skipping gradient sync this iteration, so set
+        # `require_forward_param_sync` accordingly
+        ddp.require_forward_param_sync = should_sync_backwards
+        if not should_sync_backwards:
+            return
+
+        # Schedule one allreduce per gradient bucket to match the backward
+        # pass allreduce
+        ddp._match_all_reduce_for_bwd_pass()
+
+        # Check if we need to allreduce locally unused parameters
+        if ddp.find_unused_parameters:
+            ddp._match_unused_params_allreduce()
+
+        # Rebuilt parameters are pushed only once during a training period
+        ddp.reducer._push_all_rebuilt_params()
+
+    def post_hook(self, is_last_joiner: bool):
+        """
+        Syncs the final model to ensure that the model is the same across all
+        processes.
+        """
+        self.ddp._sync_final_model(is_last_joiner)
+
+
+class DistributedDataParallel(Module, Joinable):
+    r"""Implements distributed data parallelism that is based on
+    ``torch.distributed`` package at the module level.
+
+    This container provides data parallelism by synchronizing gradients
+    across each model replica. The devices to synchronize across are
+    specified by the input ``process_group``, which is the entire world
+    by default. Note that ``DistributedDataParallel`` does not chunk or
+    otherwise shard the input across participating GPUs; the user is
+    responsible for defining how to do so, for example through the use
+    of a :class:`DistributedSampler`.
+
+    See also: :ref:`distributed-basics` and :ref:`cuda-nn-ddp-instead`.
+    The same constraints on input as in :class:`torch.nn.DataParallel` apply.
+
+    Creation of this class requires that ``torch.distributed`` to be already
+    initialized, by calling :func:`torch.distributed.init_process_group`.
+
+    ``DistributedDataParallel`` is proven to be significantly faster than
+    :class:`torch.nn.DataParallel` for single-node multi-GPU data
+    parallel training.
+
+    To use ``DistributedDataParallel`` on a host with N GPUs, you should spawn
+    up ``N`` processes, ensuring that each process exclusively works on a single
+    GPU from 0 to N-1. This can be done by either setting
+    ``CUDA_VISIBLE_DEVICES`` for every process or by calling:
+
+        >>> # xdoctest: +SKIP("undefined variables")
+        >>> torch.cuda.set_device(i)
+
+    where i is from 0 to N-1. In each process, you should refer the following
+    to construct this module:
+
+        >>> # xdoctest: +SKIP("undefined variables")
+        >>> torch.distributed.init_process_group(
+        >>>     backend='nccl', world_size=N, init_method='...'
+        >>> )
+        >>> model = DistributedDataParallel(model, device_ids=[i], output_device=i)
+
+    In order to spawn up multiple processes per node, you can use either
+    ``torch.distributed.launch`` or ``torch.multiprocessing.spawn``.
+
+    .. note::
+        Please refer to `PyTorch Distributed Overview <https://pytorch.org/tutorials/beginner/dist_overview.html>`__
+        for a brief introduction to all features related to distributed training.
+
+    .. note::
+        ``DistributedDataParallel`` can be used in conjunction with
+        :class:`torch.distributed.optim.ZeroRedundancyOptimizer` to reduce
+        per-rank optimizer states memory footprint. Please refer to
+        `ZeroRedundancyOptimizer recipe <https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html>`__
+        for more details.
+
+    .. note:: ``nccl`` backend is currently the fastest and highly recommended
+        backend when using GPUs. This applies to both single-node and
+        multi-node distributed training.
+
+    .. note:: This module also supports mixed-precision distributed training.
+        This means that your model can have different types of parameters such
+        as mixed types of ``fp16`` and ``fp32``, the gradient reduction on these
+        mixed types of parameters will just work fine.
+
+    .. note:: If you use ``torch.save`` on one process to checkpoint the module,
+        and ``torch.load`` on some other processes to recover it, make sure that
+        ``map_location`` is configured properly for every process. Without
+        ``map_location``, ``torch.load`` would recover the module to devices
+        where the module was saved from.
+
+    .. note:: When a model is trained on ``M`` nodes with ``batch=N``, the
+        gradient will be ``M`` times smaller when compared to the same model
+        trained on a single node with ``batch=M*N`` if the loss is summed (NOT
+        averaged as usual) across instances in a batch (because the gradients
+        between different nodes are averaged). You should take this into
+        consideration when you want to obtain a mathematically equivalent
+        training process compared to the local training counterpart. But in most
+        cases, you can just treat a DistributedDataParallel wrapped model, a
+        DataParallel wrapped model and an ordinary model on a single GPU as the
+        same (E.g. using the same learning rate for equivalent batch size).
+
+    .. note::
+        Parameters are never broadcast between processes. The module performs
+        an all-reduce step on gradients and assumes that they will be modified
+        by the optimizer in all processes in the same way. Buffers
+        (e.g. BatchNorm stats) are broadcast from the module in process of rank
+        0, to all other replicas in the system in every iteration.
+
+    .. note::
+        If you are using DistributedDataParallel in conjunction with the
+        :ref:`distributed-rpc-framework`, you should always use
+        :meth:`torch.distributed.autograd.backward` to compute gradients and
+        :class:`torch.distributed.optim.DistributedOptimizer` for optimizing
+        parameters.
+
+        Example::
+
+            >>> # xdoctest: +SKIP("undefined variables")
+            >>> import torch.distributed.autograd as dist_autograd
+            >>> from torch.nn.parallel import DistributedDataParallel as DDP
+            >>> import torch
+            >>> from torch import optim
+            >>> from torch.distributed.optim import DistributedOptimizer
+            >>> import torch.distributed.rpc as rpc
+            >>> from torch.distributed.rpc import RRef
+            >>>
+            >>> t1 = torch.rand((3, 3), requires_grad=True)
+            >>> t2 = torch.rand((3, 3), requires_grad=True)
+            >>> rref = rpc.remote("worker1", torch.add, args=(t1, t2))
+            >>> ddp_model = DDP(my_model)
+            >>>
+            >>> # Setup optimizer
+            >>> optimizer_params = [rref]
+            >>> for param in ddp_model.parameters():
+            >>>     optimizer_params.append(RRef(param))
+            >>>
+            >>> dist_optim = DistributedOptimizer(
+            >>>     optim.SGD,
+            >>>     optimizer_params,
+            >>>     lr=0.05,
+            >>> )
+            >>>
+            >>> with dist_autograd.context() as context_id:
+            >>>     pred = ddp_model(rref.to_here())
+            >>>     loss = loss_func(pred, target)
+            >>>     dist_autograd.backward(context_id, [loss])
+            >>>     dist_optim.step(context_id)
+
+    .. note::
+        DistributedDataParallel currently offers limited support for gradient
+        checkpointing with :meth:`torch.utils.checkpoint`. DDP will work as
+        expected when there are no unused parameters in the model and each layer
+        is checkpointed at most once (make sure you are not passing
+        `find_unused_parameters=True` to DDP). We currently do not support the
+        case where a layer is checkpointed multiple times, or when there unused
+        parameters in the checkpointed model.
+
+    .. note::
+        To let a non-DDP model load a state dict from a DDP model,
+        :meth:`~torch.nn.modules.utils.consume_prefix_in_state_dict_if_present`
+        needs to be applied to strip the prefix "module." in the DDP state dict before loading.
+
+    .. warning::
+        Constructor, forward method, and differentiation of the output (or a
+        function of the output of this module) are distributed synchronization
+        points. Take that into account in case different processes might be
+        executing different code.
+
+    .. warning::
+        This module assumes all parameters are registered in the model by the
+        time it is created. No parameters should be added nor removed later.
+        Same applies to buffers.
+
+    .. warning::
+        This module assumes all parameters are registered in the model of each
+        distributed processes are in the same order. The module itself will
+        conduct gradient ``allreduce`` following the reverse order of the
+        registered parameters of the model. In other words, it is users'
+        responsibility to ensure that each distributed process has the exact
+        same model and thus the exact same parameter registration order.
+
+    .. warning::
+        This module allows parameters with non-rowmajor-contiguous strides.
+        For example, your model may contain some parameters whose
+        :class:`torch.memory_format` is ``torch.contiguous_format``
+        and others whose format is ``torch.channels_last``.  However,
+        corresponding parameters in different processes must have the
+        same strides.
+
+    .. warning::
+        This module doesn't work with :func:`torch.autograd.grad` (i.e. it will
+        only work if gradients are to be accumulated in ``.grad`` attributes of
+        parameters).
+
+    .. warning::
+        If you plan on using this module with a ``nccl`` backend or a ``gloo``
+        backend (that uses Infiniband), together with a DataLoader that uses
+        multiple workers, please change the multiprocessing start method to
+        ``forkserver`` (Python 3 only) or ``spawn``. Unfortunately
+        Gloo (that uses Infiniband) and NCCL2 are not fork safe, and you will
+        likely experience deadlocks if you don't change this setting.
+
+    .. warning::
+        You should never try to change your model's parameters after wrapping
+        up your model with ``DistributedDataParallel``. Because, when
+        wrapping up your model with ``DistributedDataParallel``, the constructor
+        of ``DistributedDataParallel`` will register the additional gradient
+        reduction functions on all the parameters of the model itself at the
+        time of construction. If you change the model's parameters afterwards,
+        gradient reduction functions no longer match the correct set of
+        parameters.
+
+    .. warning::
+        Using ``DistributedDataParallel`` in conjunction with the
+        :ref:`distributed-rpc-framework` is experimental and subject to change.
+
+    Args:
+        module (Module): module to be parallelized
+        device_ids (list of int or torch.device): CUDA devices.
+                   1) For single-device modules, ``device_ids`` can
+                   contain exactly one device id, which represents the only
+                   CUDA device where the input module corresponding to this process resides.
+                   Alternatively, ``device_ids`` can also be ``None``.
+                   2) For multi-device modules and CPU modules,
+                   ``device_ids`` must be ``None``.
+
+                   When ``device_ids`` is ``None`` for both cases,
+                   both the input data for the forward pass and the actual module
+                   must be placed on the correct device.
+                   (default: ``None``)
+        output_device (int or torch.device): Device location of output for
+                      single-device CUDA modules. For multi-device modules and
+                      CPU modules, it must be ``None``, and the module itself
+                      dictates the output location. (default: ``device_ids[0]``
+                      for single-device modules)
+        broadcast_buffers (bool): Flag that enables syncing (broadcasting)
+                          buffers of the module at beginning of the ``forward``
+                          function. (default: ``True``)
+        process_group: The process group to be used for distributed data
+                       all-reduction. If ``None``, the default process group, which
+                       is created by :func:`torch.distributed.init_process_group`,
+                       will be used. (default: ``None``)
+        bucket_cap_mb: ``DistributedDataParallel`` will bucket parameters into
+                       multiple buckets so that gradient reduction of each
+                       bucket can potentially overlap with backward computation.
+                       :attr:`bucket_cap_mb` controls the bucket size in
+                       MegaBytes (MB). (default: 25)
+        find_unused_parameters (bool): Traverse the autograd graph from all
+                               tensors contained in the return value of the
+                               wrapped module's ``forward`` function. Parameters
+                               that don't receive gradients as part of this
+                               graph are preemptively marked as being ready to
+                               be reduced. In addition, parameters that may have
+                               been used in the wrapped module's ``forward``
+                               function but were not part of loss computation and
+                               thus would also not receive gradients are
+                               preemptively marked as ready to be reduced.
+                               (default: ``False``)
+        check_reduction: This argument is deprecated.
+        gradient_as_bucket_view (bool): When set to ``True``, gradients will be views
+                      pointing to different offsets of ``allreduce`` communication
+                      buckets. This can reduce peak memory usage, where the
+                      saved memory size will be equal to the total gradients
+                      size. Moreover, it avoids the overhead of copying between
+                      gradients and ``allreduce`` communication buckets. When
+                      gradients are views, ``detach_()`` cannot be called on the
+                      gradients. If hitting such errors, please fix it by
+                      referring to the :meth:`~torch.optim.Optimizer.zero_grad`
+                      function in ``torch/optim/optimizer.py`` as a solution.
+                      Note that gradients will be views after first iteration, so
+                      the peak memory saving should be checked after first iteration.
+        static_graph (bool): When set to ``True``, DDP knows the trained graph is
+                     static. Static graph means 1) The set of used and unused
+                     parameters will not change during the whole training loop; in
+                     this case, it does not matter whether users set
+                     ``find_unused_parameters = True`` or not. 2) How the graph is trained
+                     will not change during the whole training loop (meaning there is
+                     no control flow depending on iterations).
+                     When static_graph is set to be ``True``, DDP will support cases that
+                     can not be supported in the past:
+                     1) Reentrant backwards.
+                     2) Activation checkpointing multiple times.
+                     3) Activation checkpointing when model has unused parameters.
+                     4) There are model parameters that are outside of forward function.
+                     5) Potentially improve performance when there are unused parameters,
+                     as DDP will not search graph in each iteration to detect unused
+                     parameters when static_graph is set to be ``True``.
+                     To check whether you can set static_graph to be ``True``, one way is to
+                     check ddp logging data at the end of your previous model training,
+                     if ``ddp_logging_data.get("can_set_static_graph") == True``, mostly you
+                     can set ``static_graph = True`` as well.
+
+                     Example::
+                         >>> # xdoctest: +SKIP("undefined variables")
+                         >>> model_DDP = torch.nn.parallel.DistributedDataParallel(model)
+                         >>> # Training loop
+                         >>> ...
+                         >>> ddp_logging_data = model_DDP._get_ddp_logging_data()
+                         >>> static_graph = ddp_logging_data.get("can_set_static_graph")
+
+
+    Attributes:
+        module (Module): the module to be parallelized.
+
+    Example::
+
+        >>> # xdoctest: +SKIP("undefined variables")
+        >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
+        >>> net = torch.nn.parallel.DistributedDataParallel(model)
+    """
+
+    # used to track whether the given thread is inside ddp forward for torchdynamo purposes
+    _active_ddp_module = None
+
+    def __init__(
+        self,
+        module,
+        device_ids=None,
+        output_device=None,
+        dim=0,
+        broadcast_buffers=True,
+        process_group=None,
+        bucket_cap_mb=25,
+        find_unused_parameters=False,
+        check_reduction=False,
+        gradient_as_bucket_view=False,
+        static_graph=False,
+    ):
+
+        super(DistributedDataParallel, self).__init__()
+        Joinable.__init__(self)
+        self.logger: Optional[dist.Logger] = None
+        if not any((p.requires_grad for p in module.parameters())):
+            self._log_and_throw(
+                RuntimeError,
+                "DistributedDataParallel is not needed when a module "
+                "doesn't have any parameter that requires a gradient.",
+            )
+
+        if device_ids is not None and len(device_ids) > 1:
+            self._log_and_throw(
+                ValueError,
+                "device_ids can only be None or contain a single element.",
+            )
+
+        self.is_multi_device_module = (
+            len({p.device for p in module.parameters()}) > 1
+        )
+        distinct_device_types = {p.device.type for p in module.parameters()}
+        if len(distinct_device_types) != 1:
+            self._log_and_throw(
+                ValueError,
+                "DistributedDataParallel's input module must be on "
+                "the same type of devices, but input module parameters locate in {}.".format(
+                    distinct_device_types
+                ),
+            )
+
+        self.device_type = list(distinct_device_types)[0]
+
+        if (
+            device_ids is None
+            or len(device_ids) == 0  # For backward compatibility.
+            or self.device_type == "cpu"
+            or self.is_multi_device_module
+        ):
+            if device_ids or output_device:
+                self._log_and_throw(
+                    ValueError,
+                    "DistributedDataParallel device_ids and output_device arguments "
+                    "only work with single-device/multiple-device GPU modules or CPU modules, "
+                    "but got device_ids {}, output_device {}, and module parameters {}.".format(
+                        device_ids,
+                        output_device,
+                        {p.device for p in module.parameters()},
+                    ),
+                )
+
+            self.device_ids = None
+            self.output_device = None
+        else:
+            self.device_ids = [_get_device_index(x, True) for x in device_ids]
+
+            if output_device is None:
+                output_device = device_ids[0]
+
+            self.output_device = _get_device_index(output_device, True)
+
+        if process_group is None:
+            self.process_group = _get_default_group()
+        else:
+            self.process_group = process_group
+
+        self.static_graph = False
+        self.dim = dim
+        self.module = module
+        self.device = list(self.module.parameters())[0].device
+        self.broadcast_buffers = broadcast_buffers
+        self.find_unused_parameters = find_unused_parameters
+        self.require_backward_grad_sync = True
+        self.require_forward_param_sync = True
+        self.gradient_as_bucket_view = gradient_as_bucket_view
+        if hasattr(module, "_ddp_params_and_buffers_to_ignore"):
+            self.parameters_to_ignore = module._ddp_params_and_buffers_to_ignore
+        else:
+            self.parameters_to_ignore = []
+
+        self._use_replicated_tensor_module = (
+            _ddp_with_replicated_tensor_enabled()
+        )
+        self._build_replicated_tensor_module()
+
+        if check_reduction:
+            # This argument is no longer used since the reducer
+            # will ensure reduction completes even if some parameters
+            # do not receive gradients.
+            warnings.warn(
+                "The `check_reduction` argument in `DistributedDataParallel` "
+                "module is deprecated. Please avoid using it."
+            )
+
+        # Check that a module does not have Uninitialized parameters
+        for param in module.parameters():
+            if isinstance(param, torch.nn.parameter.UninitializedParameter):
+                self._log_and_throw(
+                    RuntimeError,
+                    "Modules with uninitialized parameters can't be used with `DistributedDataParallel`. "
+                    "Run a dummy forward pass to correctly initialize the modules",
+                )
+        # used for intra-node param sync and inter-node sync as well
+        self.broadcast_bucket_size = int(250 * 1024 * 1024)
+
+        # reduction bucket size
+        self.bucket_bytes_cap = int(bucket_cap_mb * 1024 * 1024)
+        # Whether to perform input tensor CPU to GPU copies on a side-stream
+        self.use_side_stream_for_tensor_copies = (
+            os.environ.get("PYTORCH_DDP_USE_SIDE_STREAM", "1") == "1"
+        )
+
+        # Build parameters for reducer.
+        parameters, expect_sparse_gradient = self._build_params_for_reducer()
+        # Verify model equivalence.
+        _verify_param_shape_across_processes(self.process_group, parameters)
+        # Sync params and buffers. Ensures all DDP models start off at the same value.
+        _sync_module_states(
+            module=self.module,
+            process_group=self.process_group,
+            broadcast_bucket_size=self.broadcast_bucket_size,
+            src=0,
+            params_and_buffers_to_ignore=self.parameters_to_ignore,
+        )
+        # In debug mode, build a mapping of parameter index -> parameter.
+        param_to_name_mapping = self._build_debug_param_to_name_mapping(
+            parameters
+        )
+        # Builds reducer.
+        self._ddp_init_helper(
+            parameters,
+            expect_sparse_gradient,
+            param_to_name_mapping,
+            static_graph,
+        )
+        self._has_rebuilt_buckets = False
+
+        if static_graph:
+            self._set_static_graph()
+
+    def _build_replicated_tensor_module(self):
+        if self._use_replicated_tensor_module:
+            # Create a module with ReplicatedTensor without copying tensors. Avoid
+            # registering '_replicated_tensor_module' as a submodule by directly
+            # adding to self.__dict__.
+            from torch.nn.parallel._replicated_tensor_ddp_interop import (
+                _replicate_module,
+            )
+
+            self.__dict__["_replicated_tensor_module"] = _replicate_module(
+                self.module, self.process_group
+            )
+
+    def _log_and_throw(self, err_type, err_msg):
+        if self.logger is not None:
+            self.logger.set_error_and_log(f"{str(err_type)}: {err_msg}")
+        raise err_type(err_msg)
+
+    def _ddp_init_helper(
+        self,
+        parameters,
+        expect_sparse_gradient,
+        param_to_name_mapping,
+        static_graph,
+    ):
+        """
+        Initialization helper function that does the following:
+        (1) bucketing the parameters for reductions
+        (2) resetting the bucketing states
+        (3) registering the grad hooks
+        (4) Logging construction-time DDP logging data
+        (5) passing a handle of DDP to SyncBatchNorm Layer
+        """
+        self.num_iterations = 0
+        # Notice, the parameters order is not in the order in which they are used,
+        # especially in models with control flow.
+        #
+        # Alongside parameters are not presented in the real execution order,
+        # if a certain model happens to also
+        #   1) have other collectives comm ops in its backward graph.
+        #   2) have unused parameter in subset ranks of the whole world.
+        # bucketing could insert ALL-REDUCE comm op too early on the rank with unused parameter,
+        # matching up with other collectives comm ops on other ranks unexpectedly.
+        #
+        # In order to handle this corner case, when the parameters are not in the real execution order,
+        # we don't do bucketing, thus only one ALL-REDUCE is inserted after all the gradients
+        # of the whole graph are computed.
+        #
+        # Notice, here we only disable bucketing for the first iteration.
+        # After the first iteration, it's OK to rebuild buckets,
+        # because "bucket rebuild" bucketizes parameters based on its real execution order in backward graph.
+
+        # Can remove this branching once #73732 is landed.
+        if static_graph is True or self.find_unused_parameters is False:
+            bucket_size_limits = [sys.maxsize]
+        else:
+            bucket_size_limits = [
+                dist._DEFAULT_FIRST_BUCKET_BYTES,
+                self.bucket_bytes_cap,
+            ]
+        (
+            bucket_indices,
+            per_bucket_size_limits,
+        ) = dist._compute_bucket_assignment_by_size(
+            parameters,
+            bucket_size_limits,
+            expect_sparse_gradient,
+        )
+
+        # Note: reverse list of buckets because we want to approximate the
+        # order in which their gradients are produced, and assume they
+        # are used in the forward pass in the order they are defined.
+        self.reducer = dist.Reducer(
+            parameters,
+            list(reversed(bucket_indices)),
+            list(reversed(per_bucket_size_limits)),
+            self.process_group,
+            expect_sparse_gradient,
+            # The bucket size limit is specified in the constructor.
+            # Additionally, we allow for a single small bucket for parameters
+            # that are defined first, such that their gradients don't spill into
+            # a much larger bucket, adding unnecessary latency after gradient
+            # computation finishes. Experiments showed 1MB is a reasonable value.
+            self.bucket_bytes_cap,
+            self.find_unused_parameters,
+            self.gradient_as_bucket_view,
+            param_to_name_mapping,
+            # User can set dist._DEFAULT_FIRST_BUCKET_BYTES to tune DDP first
+            # bucket.
+            dist._DEFAULT_FIRST_BUCKET_BYTES,
+        )
+
+        self.logger = dist.Logger(self.reducer)
+        # Set as a weak reference to avoid reference cycle between
+        # logger and reducer.
+        self.reducer.set_logger(self.logger)
+
+        has_sync_bn = False
+        for submodule in self.module.modules():
+            if isinstance(submodule, torch.nn.SyncBatchNorm):
+                has_sync_bn = True
+                break
+
+        # Set logging data that can be got during construction time.
+        self.logger.set_construction_data_and_log(
+            self.module.__class__.__name__,
+            [] if self.device_ids is None else self.device_ids,
+            -1 if self.output_device is None else self.output_device,
+            self.broadcast_buffers,
+            has_sync_bn,
+            static_graph,
+        )
+
+        # passing a handle to torch.nn.SyncBatchNorm layer
+        self._passing_sync_batchnorm_handle(self.module)
+
+    def __getstate__(self):
+        self._check_default_group()
+        attrs = copy.copy(self.__dict__)
+        del attrs["process_group"]
+        del attrs["reducer"]
+        del attrs["logger"]
+        if self._use_replicated_tensor_module:
+            del attrs["_replicated_tensor_module"]
+        return attrs
+
+    def __setstate__(self, state):
+        # If serializable, then the process group should be the default one
+        self.process_group = _get_default_group()
+        super(DistributedDataParallel, self).__setstate__(state)
+        self._build_replicated_tensor_module()
+        self.__dict__.setdefault("require_forward_param_sync", True)
+        self.__dict__.setdefault("require_backward_grad_sync", True)
+        parameters, expect_sparse_gradient = self._build_params_for_reducer()
+        # In debug mode, build a mapping of parameter index -> parameter.
+        param_to_name_mapping = self._build_debug_param_to_name_mapping(
+            parameters
+        )
+        # Builds reducer.
+        self._ddp_init_helper(
+            parameters,
+            expect_sparse_gradient,
+            param_to_name_mapping,
+            self.static_graph,
+        )
+        if self.static_graph:
+            self.reducer._set_static_graph()
+            assert self.logger is not None
+            self.logger._set_static_graph()
+
+    def _build_params_for_reducer(self):
+        # Build tuple of (module, parameter) for all parameters that require grads.
+        modules_and_parameters = [
+            (module, parameter)
+            for module_name, module in self.module.named_modules()
+            for parameter in [
+                param
+                # Note that we access module.named_parameters instead of
+                # parameters(module). parameters(module) is only needed in the
+                # single-process multi device case, where it accesses replicated
+                # parameters through _former_parameters.
+                for param_name, param in module.named_parameters(recurse=False)
+                if param.requires_grad
+                and f"{module_name}.{param_name}"
+                not in self.parameters_to_ignore
+            ]
+        ]
+
+        # Deduplicate any parameters that might be shared across child modules.
+        memo = set()
+        modules_and_parameters = [
+            # "p not in memo" is the deduplication check.
+            # "not memo.add(p)" is always True, and it's only there to cause "add(p)" if needed.
+            (m, p)
+            for m, p in modules_and_parameters
+            if p not in memo and not memo.add(p)  # type: ignore[func-returns-value]
+        ]
+
+        # Build list of parameters.
+        parameters = list(parameter for _, parameter in modules_and_parameters)
+
+        # Checks if a module will produce a sparse gradient.
+        def produces_sparse_gradient(module):
+            if isinstance(module, torch.nn.Embedding) or isinstance(
+                module, torch.nn.EmbeddingBag
+            ):
+                return module.sparse
+            return False
+
+        # Build list of booleans indicating whether or not to expect sparse
+        # gradients for the corresponding parameters.
+        expect_sparse_gradient = list(
+            produces_sparse_gradient(module)
+            for module, _ in modules_and_parameters
+        )
+
+        self._assign_modules_buffers()
+
+        return parameters, expect_sparse_gradient
+
+    def _assign_modules_buffers(self):
+        """
+        Assigns module buffers to self.modules_buffers which are then used to
+        broadcast across ranks when broadcast_buffers=True. Note that this
+        must be called every time buffers need to be synced because buffers can
+        be reassigned by user module,
+        see https://github.com/pytorch/pytorch/issues/63916.
+        """
+        # Collect buffers for modules, filtering out buffers that should be ignored.
+        named_module_buffers = [
+            (buffer, buffer_name)
+            for buffer_name, buffer in self.module.named_buffers()
+            if buffer_name not in self.parameters_to_ignore
+        ]
+        self.modules_buffers = [
+            buffer for (buffer, buffer_name) in named_module_buffers
+        ]
+        # Dict[str, tensor] representing module buffers not ignored by DDP.
+        self.named_module_buffers = {
+            buffer_name: buffer
+            for (buffer, buffer_name) in named_module_buffers
+        }
+
+    def _build_debug_param_to_name_mapping(self, parameters):
+        if dist.get_debug_level() == dist.DebugLevel.OFF:
+            return {}
+
+        param_to_param_index = {
+            parameters[i]: i for i in range(len(parameters))
+        }
+        param_set = set(parameters)
+        param_index_to_param_fqn = {}
+        for module_name, module in self.module.named_modules():
+            for param_name, param in module.named_parameters(recurse=False):
+                fqn = f"{module_name}.{param_name}"
+                # Bypass ignored parameters since those are not reduced by DDP
+                # to begin with.
+                if fqn not in self.parameters_to_ignore and param.requires_grad:
+                    if param not in param_set:
+                        self._log_and_throw(
+                            ValueError,
+                            f"Param with name {fqn} found in module parameters, but not DDP parameters."
+                            " This indicates a bug in DDP, please report an issue to PyTorch.",
+                        )
+                    param_index = param_to_param_index[param]
+                    param_index_to_param_fqn[param_index] = fqn
+
+        # Ensure we covered all parameters
+        if len(param_set) != len(param_index_to_param_fqn):
+            self._log_and_throw(
+                ValueError,
+                (
+                    "Expected param to name mapping to cover all parameters, but"
+                    f" got conflicting lengths: {len(param_set)} vs "
+                    f"{len(param_index_to_param_fqn)}. This indicates a bug in DDP"
+                    ", please report an issue to PyTorch."
+                ),
+            )
+
+        return param_index_to_param_fqn
+
+    def _get_parameters(self, m, recurse=True):
+        """
+        Returns a generator of module parameters
+        """
+
+        def model_parameters(m):
+            ps = (
+                m._former_parameters.values()
+                if hasattr(m, "_former_parameters")
+                else m.parameters(recurse=False)
+            )
+            for p in ps:
+                yield p
+
+        for m in m.modules() if recurse else [m]:
+            for p in model_parameters(m):
+                yield p
+
+    def _check_default_group(self):
+        pickle_not_supported = False
+        try:
+            if self.process_group != _get_default_group():
+                pickle_not_supported = True
+        except RuntimeError:
+            pickle_not_supported = True
+
+        if pickle_not_supported:
+            self._log_and_throw(
+                RuntimeError,
+                "DDP Pickling/Unpickling are only supported "
+                "when using DDP with the default process "
+                "group. That is, when you have called "
+                "init_process_group and have not passed "
+                "process_group argument to DDP constructor",
+            )
+
+    @contextmanager
+    def no_sync(self):
+        r"""
+        A context manager to disable gradient synchronizations across DDP
+        processes. Within this context, gradients will be accumulated on module
+        variables, which will later be synchronized in the first
+        forward-backward pass exiting the context.
+
+        Example::
+
+            >>> # xdoctest: +SKIP("undefined variables")
+            >>> ddp = torch.nn.parallel.DistributedDataParallel(model, pg)
+            >>> with ddp.no_sync():
+            >>>   for input in inputs:
+            >>>     ddp(input).backward()  # no synchronization, accumulate grads
+            >>> ddp(another_input).backward()  # synchronize grads
+        """
+        old_require_backward_grad_sync = self.require_backward_grad_sync
+        self.require_backward_grad_sync = False
+        try:
+            yield
+        finally:
+            self.require_backward_grad_sync = old_require_backward_grad_sync
+
+    @classmethod
+    def _get_active_ddp_module(cls):
+        """
+        TorchDynamo needs to know whether DDP is currently active, and access the DDP module in order to cooperatively optimize it.
+        """
+        return cls._active_ddp_module
+
+    # note, this ctxmgr function is marked 'skip' in torchdynamo, so dynamo only kicks in
+    # for the 'module_to_run' underneath
+    # see torchdynamo/eval_frame.py TorchPatcher.patch for more details
+    @contextmanager
+    def _inside_ddp_forward(self):
+        DistributedDataParallel._active_ddp_module = self
+        try:
+            yield
+        except Exception:
+            raise
+        finally:
+            DistributedDataParallel._active_ddp_module = None
+
+    def _run_ddp_forward(self, *inputs, **kwargs):
+        module_to_run = (
+            self._replicated_tensor_module
+            if self._use_replicated_tensor_module
+            else self.module
+        )
+
+        if self.device_ids:
+            inputs, kwargs = _to_kwargs(
+                inputs,
+                kwargs,
+                self.device_ids[0],
+                self.use_side_stream_for_tensor_copies,
+            )
+            with self._inside_ddp_forward():
+                return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
+        else:
+            with self._inside_ddp_forward():
+                return module_to_run(*inputs, **kwargs)
+
+    def pre_forward(self):
+        with torch.autograd.profiler.record_function(
+            "DistributedDataParallel.pre_forward"
+        ):
+            if torch.is_grad_enabled() and self.require_backward_grad_sync:
+                assert self.logger is not None
+                self.logger.set_runtime_stats_and_log()
+                self.num_iterations += 1
+                self.reducer.prepare_for_forward()
+
+            # Notify the join context that this process has not joined, if
+            # needed
+            work = Join.notify_join_context(self)
+            if work:
+                self.reducer._set_forward_pass_work_handle(
+                    work, self._divide_by_initial_world_size  # type: ignore[arg-type]
+                )
+
+            # Calling _rebuild_buckets before forward compuation,
+            # It may allocate new buckets before deallocating old buckets
+            # inside _rebuild_buckets. To save peak memory usage,
+            # call _rebuild_buckets before the peak memory usage increases
+            # during forward computation.
+            # This should be called only once during whole training period.
+            if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
+                logger.info(
+                    "Reducer buckets have been rebuilt in this iteration."
+                )
+                self._has_rebuilt_buckets = True
+
+            # sync params according to location (before/after forward) user
+            # specified as part of hook, if hook was specified.
+            if self._check_sync_bufs_pre_fwd():
+                self._sync_buffers()
+
+            if self._join_config.enable:
+                # Notify joined ranks whether they should sync in backwards pass or not.
+                self._check_global_requires_backward_grad_sync(
+                    is_joined_rank=False
+                )
+
+    def post_forward(self, output):
+        with torch.autograd.profiler.record_function(
+            "DistributedDataParallel.post_forward"
+        ):
+            # sync params according to location (before/after forward) user
+            # specified as part of hook, if hook was specified.
+            if self._check_sync_bufs_post_fwd():
+                self._sync_buffers()
+
+            if torch.is_grad_enabled() and self.require_backward_grad_sync:
+                self.require_forward_param_sync = True
+                # We'll return the output object verbatim since it is a freeform
+                # object. We need to find any tensors in this object, though,
+                # because we need to figure out which parameters were used during
+                # this forward pass, to ensure we short circuit reduction for any
+                # unused parameters. Only if `find_unused_parameters` is set.
+                if self.find_unused_parameters and not self.static_graph:
+                    # Do not need to populate this for static graph.
+                    self.reducer.prepare_for_backward(
+                        list(_find_tensors(output))
+                    )
+                else:
+                    self.reducer.prepare_for_backward([])
+            else:
+                self.require_forward_param_sync = False
+
+        # TODO: DDPSink is currently enabled for unused parameter detection and
+        # static graph training for first iteration.
+        if (self.find_unused_parameters and not self.static_graph) or (
+            self.static_graph and self.num_iterations == 1
+        ):
+            state_dict = {
+                "static_graph": self.static_graph,
+                "num_iterations": self.num_iterations,
+            }
+
+            (
+                output_tensor_list,
+                treespec,
+                output_is_rref,
+            ) = _tree_flatten_with_rref(output)
+            output_placeholders = [None for _ in range(len(output_tensor_list))]
+            # Do not touch tensors that have no grad_fn, which can cause issues
+            # such as https://github.com/pytorch/pytorch/issues/60733
+            for i, output in enumerate(output_tensor_list):
+                if torch.is_tensor(output) and output.grad_fn is None:
+                    output_placeholders[i] = output
+
+            # When find_unused_parameters=True, makes tensors which require grad
+            # run through the DDPSink backward pass. When not all outputs are
+            # used in loss, this makes those corresponding tensors receive
+            # undefined gradient which the reducer then handles to ensure
+            # param.grad field is not touched and we don't error out.
+            passthrough_tensor_list = _DDPSink.apply(
+                self.reducer,
+                state_dict,
+                *output_tensor_list,
+            )
+            for i in range(len(output_placeholders)):
+                if output_placeholders[i] is None:
+                    output_placeholders[i] = passthrough_tensor_list[i]
+
+            # Reconstruct output data structure.
+            output = _tree_unflatten_with_rref(
+                output_placeholders, treespec, output_is_rref
+            )
+        return output
+
+    def forward(self, *inputs, **kwargs):
+        self.pre_forward(*inputs, **kwargs)
+        with torch.autograd.profiler.record_function(
+            "DistributedDataParallel.forward"
+        ):
+            output = self._run_ddp_forward(*inputs, **kwargs)
+        output = self.post_forward(output)
+        return output
+
+    def scatter(self, inputs, kwargs, device_ids):
+        return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
+
+    def to_kwargs(self, inputs, kwargs, device_id):
+        # Kept for BC
+        return _to_kwargs(
+            inputs, kwargs, device_id, self.use_side_stream_for_tensor_copies
+        )
+
+    def gather(self, outputs, output_device):
+        return gather(outputs, output_device, dim=self.dim)
+
+    def train(self, mode=True):
+        super(DistributedDataParallel, self).train(mode)
+        if self._use_replicated_tensor_module:
+            self._replicated_tensor_module.train(mode)  # type: ignore[union-attr]
+        return self
+
+    # When running in join mode, schedules an allreduce to notify joined ranks
+    # of whether backwards pass synchronization will run this iteraton or not.
+    def _check_global_requires_backward_grad_sync(self, is_joined_rank):
+        if not is_joined_rank and self.require_backward_grad_sync:
+            requires_sync_tensor = torch.ones(1, device=self.device)
+        else:
+            requires_sync_tensor = torch.zeros(1, device=self.device)
+
+        work = dist.all_reduce(
+            requires_sync_tensor, group=self.process_group, async_op=True
+        )
+        return work
+
+    # When running in join mode, checks and performs sync of module buffers if
+    # the models have buffers that should be synchronized in the forward pass.
+    def _check_and_sync_module_buffers(self):
+        if self._check_sync_bufs_pre_fwd():
+            authoritative_rank = self._find_common_rank(
+                self._distributed_rank, False
+            )
+            self._sync_module_buffers(authoritative_rank)
+
+    # When running in join model, agrees upon a common rank and broadcast model
+    # parameters to all other ranks.
+    def _sync_final_model(self, is_last_joiner):
+        # Agree upon the process that will be the authoritative model copy.
+        # The current rank is a candidate for being the authoritative copy if
+        # is_last_joiner=True. We break ties via picking the larger rank.
+        self._authoritative_rank = self._find_common_rank(
+            self._distributed_rank, is_last_joiner
+        )
+        _sync_module_states(
+            module=self.module,
+            process_group=self.process_group,
+            broadcast_bucket_size=self.broadcast_bucket_size,
+            src=self._authoritative_rank,
+            params_and_buffers_to_ignore=self.parameters_to_ignore,
+        )
+
+    # Schedule comm ops to match those scheduled in the reducer's backward
+    # pass.
+    def _match_all_reduce_for_bwd_pass(self):
+        comm_work = []
+        # Schedule comm in the same order as Reducer schedules them, i.e.
+        # the order of the buckets. Retrieving the bucket order from the reducer
+        # ensures that we keep the same order in join mode, such as when bucket
+        # order is rebuilt dynamically.
+
+        # Returns grad_buckets in order, but real tensors are substituted with
+        # zero tensors of the same shape.
+        grad_buckets = self.reducer._get_zeros_like_grad_buckets()
+        for grad_bucket in grad_buckets:
+            # Joined processes contribute zero gradient. In the case that
+            # divide_by_initial_world_size=True, we divide grads by the static
+            # world size, if not, the dividing factor is reduced by the number
+            # of joined processes.
+            work = self.reducer._run_comm_hook(grad_bucket)
+            comm_work.append(work)
+        for work in comm_work:
+            work.wait()
+
+    # Allreduces the used parameter mapping across ranks.
+    def _match_unused_params_allreduce(self):
+        locally_used_param_map = self.reducer._get_local_used_map()
+        self.process_group.allreduce(locally_used_param_map)
+
+    def join(
+        self,
+        divide_by_initial_world_size: bool = True,
+        enable: bool = True,
+        throw_on_early_termination: bool = False,
+    ):
+        r"""
+        A context manager to be used in conjunction with an instance of
+        :class:`torch.nn.parallel.DistributedDataParallel` to be
+        able to train with uneven inputs across participating processes.
+
+        This context manager will keep track of already-joined DDP processes,
+        and "shadow" the forward and backward passes by inserting collective
+        communication operations to match with the ones created by non-joined
+        DDP processes. This will ensure each collective call has a corresponding
+        call by already-joined DDP processes, preventing hangs or errors that
+        would otherwise happen when training with uneven inputs across
+        processes. Alternatively, if the flag ``throw_on_early_termination`` is
+        specified to be ``True``, all trainers will throw an error once one rank
+        runs out of inputs, allowing these errors to be caught and handled
+        according to application logic.
+
+        Once all DDP processes have joined, the context manager will broadcast
+        the model corresponding to the last joined process to all processes to
+        ensure the model is the same across all processes
+        (which is guaranteed by DDP).
+
+        To use this to enable training with uneven inputs across processes,
+        simply wrap this context manager around your training loop. No further
+        modifications to the model or data loading is required.
+
+        .. warning::
+            If the model or training loop this context manager is wrapped around
+            has additional distributed collective operations, such as
+            ``SyncBatchNorm`` in the model's forward pass, then the flag
+            ``throw_on_early_termination`` must be enabled. This is because this
+            context manager is not aware of non-DDP collective communication.
+            This flag will cause all ranks to throw when any one rank
+            exhausts inputs, allowing these errors to be caught and recovered
+            from across all ranks.
+
+        Args:
+            divide_by_initial_world_size (bool): If ``True``, will divide
+                gradients by the initial ``world_size`` DDP training was launched
+                with. If ``False``, will compute the effective world size
+                (number of ranks that have not depleted their inputs yet) and
+                divide gradients by that during allreduce. Set
+                ``divide_by_initial_world_size=True`` to ensure every input
+                sample including the uneven inputs have equal weight in terms of
+                how much they contribute to the global gradient. This is
+                achieved by always dividing the gradient by the initial
+                ``world_size`` even when we encounter uneven inputs. If you set
+                this to ``False``, we divide the gradient by the remaining
+                number of nodes. This ensures parity with training on a smaller
+                ``world_size`` although it also means the uneven inputs would
+                contribute more towards the global gradient. Typically, you
+                would want to set this to ``True`` for cases where the last few
+                inputs of your training job are uneven. In extreme cases, where
+                there is a large discrepancy in the number of inputs, setting
+                this to ``False`` might provide better results.
+            enable (bool): Whether to enable uneven input detection or not. Pass
+                in ``enable=False`` to disable in cases where you know that
+                inputs are even across participating processes. Default is
+                ``True``.
+            throw_on_early_termination (bool): Whether to throw an error
+                or continue training when at least one rank has exhausted
+                inputs. If ``True``, will throw upon the first rank reaching end
+                of data. If ``False``, will continue training with a smaller
+                effective world size until all ranks are joined. Note that if
+                this flag is specified, then the flag
+                ``divide_by_initial_world_size`` would be ignored. Default
+                is ``False``.
+
+
+        Example::
+
+            >>> import torch
+            >>> import torch.distributed as dist
+            >>> import os
+            >>> import torch.multiprocessing as mp
+            >>> import torch.nn as nn
+            >>> # On each spawned worker
+            >>> def worker(rank):
+            >>>     dist.init_process_group("nccl", rank=rank, world_size=2)
+            >>>     torch.cuda.set_device(rank)
+            >>>     model = nn.Linear(1, 1, bias=False).to(rank)
+            >>>     model = torch.nn.parallel.DistributedDataParallel(
+            >>>         model, device_ids=[rank], output_device=rank
+            >>>     )
+            >>>     # Rank 1 gets one more input than rank 0.
+            >>>     inputs = [torch.tensor([1]).float() for _ in range(10 + rank)]
+            >>>     with model.join():
+            >>>         for _ in range(5):
+            >>>             for inp in inputs:
+            >>>                 loss = model(inp).sum()
+            >>>                 loss.backward()
+            >>>     # Without the join() API, the below synchronization will hang
+            >>>     # blocking for rank 1's allreduce to complete.
+            >>>     torch.cuda.synchronize(device=rank)
+        """
+        return Join(
+            [self],
+            enable,
+            throw_on_early_termination,
+            divide_by_initial_world_size=divide_by_initial_world_size,
+        )
+
+    def join_hook(
+        self,
+        **kwargs,
+    ):
+        r"""
+        Returns the DDP join hook, which enables training on uneven inputs by
+        shadowing the collective communications in the forward and backward
+        passes.
+
+        Arguments:
+            kwargs (dict): a :class:`dict` containing any keyword arguments
+                to modify the behavior of the join hook at run time; all
+                :class:`Joinable` instances sharing the same join context
+                manager are forwarded the same value for ``kwargs``.
+
+        The hook supports the following keyword arguments:
+            divide_by_initial_world_size (bool, optional):
+                If ``True``, then gradients are divided by the initial world
+                size that DDP was launched with.
+                If ``False``, then gradients are divided by the effective world
+                size (i.e. the number of non-joined processes), meaning that
+                the uneven inputs contribute more toward the global gradient.
+                Typically, this should be set to ``True`` if the degree of
+                unevenness is small but can be set to ``False`` in extreme
+                cases for possibly better results.
+                Default is ``True``.
+        """
+        divide_by_initial_world_size = kwargs.get(
+            "divide_by_initial_world_size", True
+        )
+        return _DDPJoinHook(
+            self, divide_by_initial_world_size=divide_by_initial_world_size
+        )
+
+    @property
+    def join_device(self):
+        return self.device
+
+    @property
+    def join_process_group(self):
+        return self.process_group
+
+    def _register_buffer_comm_hook(
+        self,
+        state,
+        hook: Callable,
+        comm_hook_location=_BufferCommHookLocation.POST_FORWARD,
+    ):
+        r"""
+        Allows custom registration of hooks that define how buffer are
+        synchronized across ranks. The hook takes in an optional state
+        and is passed in a Dict[str, Tensor] corresponding to buffer names
+        and the buffers, and can run arbitrary reductions on buffers as
+        opposed to DDP's default broadcast from rank 0. This is useful for
+        example if a counter needs to be summed or averaged across ranks
+        every iteration.
+
+        Args:
+            state (Any): Optional state that is passed to the hook.
+            hook (Callable): Callable with the following signature:
+                            ``hook(state: object, buffers: Dict[str, torch.Tensor])
+                            -> Optional[List[torch.futures.Future[torch.Tensor]]]``
+            comm_hook_location (_BufferCommHookLocation): Enum value indicating
+                            where to run the hook.
+                            _BufferCommHookLocation.PRE_FORWARD means that the
+                            hook will run _before_ the forward pass, and
+                            _BufferCommHookLocation.POST_FORWARD means that the
+                            hook will run _after_ the forward pass.
+
+            hook (Callable): Callable with the following signature:
+                         ``hook(state: object, bucket: dist.GradBucket) -> torch.futures.Future[torch.Tensor]``:
+
+            NOTE: To maximize performance, users can return a
+                List[torch.futures.Future] from their hook, and DDP will
+                install and await these hooks appropriately at the end of
+                the backward pass. This will ensure all buffers are
+                synchronized by the end of the backward pass. If this
+                setting is used, it is recommended to pass
+                comm_hook_location=_BufferCommHookLocation.POST_FORWARD,
+                which will trigger the hook after the forward pass.
+                If _BufferCommHookLocation.PRE_FORWARD is used, users must
+                ensure appropriate synchronization when manipulating GPU
+                buffers in the forward pass.
+        """
+        assert callable(hook)
+        self.buffer_hook = _BufferCommHook(
+            buffer_comm_hook=hook,
+            buffer_comm_hook_state=state,
+            buffer_comm_hook_location=comm_hook_location,
+        )
+
+    def register_comm_hook(self, state: object, hook: Callable):
+        r"""
+        Registers a communication hook which is an enhancement that provides a
+        flexible hook to users where they can specify how DDP aggregates gradients
+        across multiple workers.
+
+        This hook would be very useful for researchers to try out new ideas. For
+        example, this hook can be used to implement several algorithms like GossipGrad
+        and gradient compression which involve different communication strategies for
+        parameter syncs while running Distributed DataParallel training.
+
+        Args:
+            state (object): Passed to the hook to maintain any state information during the training process.
+                            Examples include error feedback in gradient compression,
+                            peers to communicate with next in GossipGrad, etc.
+
+                            It is locally stored by each worker
+                            and shared by all the gradient tensors on the worker.
+            hook (Callable): Callable with the following signature:
+                             ``hook(state: object, bucket: dist.GradBucket) -> torch.futures.Future[torch.Tensor]``:
+
+                             This function is called once the bucket is ready. The
+                             hook can perform whatever processing is needed and return
+                             a Future indicating completion of any async work (ex: allreduce).
+                             If the hook doesn't perform any communication, it still
+                             must return a completed Future. The Future should hold the
+                             new value of grad bucket's tensors. Once a bucket is ready,
+                             c10d reducer would call this hook and use the tensors returned
+                             by the Future and copy grads to individual parameters.
+                             Note that the future's return type must be a single tensor.
+
+                             We also provide an API called ``get_future`` to retrieve a
+                             Future associated with the completion of ``c10d.ProcessGroup.Work``.
+                             ``get_future`` is currently supported for NCCL and also supported for most
+                             operations on GLOO and MPI, except for peer to peer operations (send/recv).
+
+        .. warning ::
+            Grad bucket's tensors will not be predivided by world_size. User is responsible
+            to divide by the world_size in case of operations like allreduce.
+
+        .. warning ::
+            DDP communication hook can only be registered once and should be registered
+            before calling backward.
+
+        .. warning ::
+            The Future object that hook returns should contain a single tensor
+            that has the same shape with the tensors inside grad bucket.
+
+        .. warning ::
+            ``get_future`` API supports NCCL, and partially GLOO and MPI backends (no support
+            for peer-to-peer operations like send/recv) and will return a ``torch.futures.Future``.
+
+        Example::
+            Below is an example of a noop hook that returns the same tensor.
+
+            >>> def noop(state: object, bucket: dist.GradBucket) -> torch.futures.Future[torch.Tensor]:
+            >>>     fut = torch.futures.Future()
+            >>>     fut.set_result(bucket.buffer())
+            >>>     return fut
+
+            >>> # xdoctest: +SKIP('undefined name')
+            >>> ddp.register_comm_hook(state=None, hook=noop)
+
+        Example::
+            Below is an example of a Parallel SGD algorithm where gradients are encoded before
+            allreduce, and then decoded after allreduce.
+
+            >>> def encode_and_decode(state: object, bucket: dist.GradBucket) -> torch.futures.Future[torch.Tensor]:
+            >>>     encoded_tensor = encode(bucket.buffer()) # encode gradients
+            >>>     fut = torch.distributed.all_reduce(encoded_tensor).get_future()
+            >>>     # Define the then callback to decode.
+            >>>     def decode(fut):
+            >>>         decoded_tensor = decode(fut.value()[0]) # decode gradients
+            >>>         return decoded_tensor
+            >>>     return fut.then(decode)
+
+            >>> # xdoctest: +SKIP('undefined name')
+            >>> ddp.register_comm_hook(state=None, hook=encode_and_decode)
+        """
+        self._check_comm_hook(hook)
+        assert self.logger is not None
+        self.logger._set_comm_hook_name(hook.__qualname__)
+        dist._register_comm_hook(self.reducer, state, hook)
+
+    def _register_builtin_comm_hook(self, comm_hook_type):
+        r"""
+        Registers a built-in communication hook that specifies how DDP
+        aggregates gradients across multiple workers.
+        The built-in hooks aim to provide efficient C++ implementations for certain hooks,
+        which might not be as efficient if implemented in Python using a Python communication hook.
+
+        Args:
+            comm_hook_type (dist.BuiltinCommHookType): type of communication hook, such as ALLREDUCE, FP16_COMPRESS, etc.
+
+        .. warning ::
+            DDP communication hook can only be registered once and should be registered
+            before calling backward.
+
+        Example::
+            Below is an example of a FP16 compression where gradients are
+            compressed into 16-bit floating-point numbers before allreduce, and
+            then decompressed after allreduce.
+
+            >>> # xdoctest: +SKIP('undefined name')
+            >>> ddp._register_builtin_comm_hook(dist.BuiltinCommHookType.FP16_COMPRESS)
+
+        """
+        assert self.logger is not None
+        self.logger._set_comm_hook_name(str(comm_hook_type))
+        dist._register_builtin_comm_hook(self.reducer, comm_hook_type)
+
+    def _register_fused_optim(
+        self, optim: Type, *args, optim_params=None, **kwargs
+    ):
+        r"""
+            Registers an optimizer with DDP such that the optimization for a
+            parameter will run immediately when that parameter's gradient is
+            finished with reduction, instead of waiting for all parameters'
+            gradients to finish reduction. This can result in a training speedup
+            depending on your workload since the optimizer can run while gradient
+            reduction for other parameters are still ongoing. In addition, this has
+            the potential to reduce peak memory consumption during training, as it
+            only needs to load the per-parameter optimizer states of a single
+            parameter at a time, instead of loading all per-parameter optimizer
+            states at once.
+
+            Args:
+                optim_cls (Type): a ``torch.optim.Optimizer`` class to be registered
+                as a fused optimizer.
+                *args (Sequence[Any]): Arguments to forward to `optim_cls`.
+                optim_params (Optional[Iterable[torch.Tensor]]): Set of parameters
+                to optimize, similar to `params` argument of traditional `torch.optim`
+                Optimizers. If this is omitted, all DDP model parameters will be
+                optimized.
+                **kwargs: (Dict[str, Any]): Keyword arguments to forward to `optim_cls`.
+
+        .. warning ::
+            _register_fused_optim should only be called once on a DDP instance,
+            and registering multiple fused optimizers for the same DDP model
+            is not currently supported. Please ping
+            https://github.com/pytorch/pytorch/issues/71595 if this is necessary
+            for your use case.
+
+        .. warning ::
+            _register_fused_optim and register_comm_hook currently do not
+            compose together, meaning that custom DDP communication hooks are
+            not supported with overlapped optimizers. Please ping
+            https://github.com/pytorch/pytorch/issues/71595 if this is necessary
+            for your use case.
+
+        .. warning ::
+            Gradient accumulation and DDP `no_sync` are currently not supported
+            with overlapped optimizer. Please ping
+            https://github.com/pytorch/pytorch/issues/71595 if this is necessary
+            for your use case.
+
+        Example::
+
+            >>> # xdoctest: +SKIP("No rendezvous handler")
+            >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
+            >>> net = torch.nn.parallel.DistributedDataParallel(model, pg)
+            >>> lr = 1e-2
+            >>> betas = (0.9, 0.99)
+            >>> eps = 1e-6
+            >>> net._register_fused_optim(torch.optim.Adam, lr, betas=betas, eps=eps)
+            >>> # Example with subset of parameters
+            >>> params_to_opt = [list(net.parameters())[0]]
+            >>> net._register_fused_optim(
+            ...   torch.optim.Adam, lr, optim_params=params_to_opt,  betas=betas, eps=eps
+            ... )
+        """
+        # Note: importing in function, otherwise this will cause a circular
+        # import as optimizer_overlap module needs to import DistributedDataParallel.
+        from torch.distributed.algorithms._optimizer_overlap import (
+            _as_overlapped_optim,
+        )
+
+        overlapped_optim = _as_overlapped_optim(
+            optim, optim_params, *args, **kwargs
+        )
+        try:
+            overlapped_optim.register_ddp(self)
+        except NotImplementedError:
+            raise RuntimeError(
+                f"{optim} does not support overlapped DDP. Please file an issue to PyTorch or the respective owner of {optim}."
+            )
+
+    def _distributed_broadcast_coalesced(
+        self, tensors, buffer_size, authoritative_rank=0
+    ):
+        dist._broadcast_coalesced(
+            self.process_group, tensors, buffer_size, authoritative_rank
+        )
+
+    def _check_sync_bufs_post_fwd(self):
+        return (
+            self.will_sync_module_buffers()
+            and hasattr(self, "buffer_hook")
+            and self.buffer_hook.buffer_comm_hook_location
+            == _BufferCommHookLocation.POST_FORWARD
+        )
+
+    def _check_sync_bufs_pre_fwd(self):
+        return self.will_sync_module_buffers() and (
+            not hasattr(self, "buffer_hook")
+            or self.buffer_hook.buffer_comm_hook_location
+            == _BufferCommHookLocation.PRE_FORWARD
+        )
+
+    def will_sync_module_buffers(self):
+        return (
+            self.require_forward_param_sync
+            and self.broadcast_buffers
+            and len(self.modules_buffers) > 0
+        )
+
+    def _find_common_rank(self, input_rank, rank_cond):
+        # -1 indicates that this rank is not under consideration to be the
+        # common_rank
+        rank_to_use = torch.tensor(
+            [input_rank if rank_cond else -1],
+            device=self.device,
+        )
+        dist.all_reduce(rank_to_use, op=ReduceOp.MAX, group=self.process_group)
+        if rank_to_use.item() == -1:
+            self._log_and_throw(
+                ValueError,
+                "BUG! Expected rank_cond to be true for at least one process."
+                " This indicates a bug in PyTorch, please report an issue.",
+            )
+        return rank_to_use.item()
+
+    def _sync_buffers(self):
+        with torch.no_grad():
+            # module buffer sync
+            # Synchronize buffers across processes.
+            # If we are running DDP with the join manager, we have to agree
+            # upon a rank to sync module buffers from, since rank 0 may
+            # already have been joined and have stale module buffers.
+            if self._join_config.enable:
+                authoritative_rank = self._find_common_rank(
+                    self._distributed_rank, True
+                )
+            else:
+                # The process with rank 0 is considered the authoritative copy.
+                authoritative_rank = 0
+            # Update self.modules_buffers incase any buffers were
+            # reassigned.
+            self._assign_modules_buffers()
+            self._sync_module_buffers(authoritative_rank)
+
+    def _sync_module_buffers(self, authoritative_rank):
+        if not hasattr(self, "buffer_hook"):
+            self._default_broadcast_coalesced(
+                authoritative_rank=authoritative_rank
+            )
+        else:
+            hook = self.buffer_hook.buffer_comm_hook
+            state = self.buffer_hook.buffer_comm_hook_state
+            futs = hook(state, self.named_module_buffers)
+            if futs is not None:
+                self.reducer._install_post_backward_futures(futs)
+
+    def _default_broadcast_coalesced(
+        self, bufs=None, bucket_size=None, authoritative_rank=0
+    ):
+        """
+        Broadcasts buffers from rank 0 to rest of workers. If bufs, bucket_size
+        are None, default values self.modules_buffers and
+        self.broadcast_bucket_size are used instead.
+        """
+        if bufs is None:
+            bufs = self.modules_buffers
+        if bucket_size is None:
+            bucket_size = self.broadcast_bucket_size
+
+        self._distributed_broadcast_coalesced(
+            bufs, bucket_size, authoritative_rank
+        )
+
+    def _passing_sync_batchnorm_handle(self, module):
+        for layer in module.modules():
+            if isinstance(layer, torch.nn.modules.SyncBatchNorm):
+                if self.device_type == "cpu":
+                    self._log_and_throw(
+                        ValueError,
+                        "SyncBatchNorm layers only work with GPU modules",
+                    )
+
+    def _check_comm_hook(self, hook):
+        if not callable(hook):
+            self._log_and_throw(
+                TypeError, "Communication hook must be callable."
+            )
+
+        sig = inspect.signature(hook)
+        if (
+            sig.parameters["bucket"].annotation != inspect._empty
+            and sig.parameters["bucket"].annotation != dist.GradBucket
+        ):
+            self._log_and_throw(
+                ValueError,
+                "Communication hook: bucket annotation should be dist.GradBucket.",
+            )
+
+        if (
+            sig.return_annotation != inspect._empty
+            and sig.return_annotation != torch.futures.Future[torch.Tensor]
+        ):
+            self._log_and_throw(
+                ValueError,
+                "Communication hook: return annotation should be torch.futures.Future[torch.Tensor].",
+            )
+
+        if hook.__name__ in [
+            "bf16_compress_hook",
+            "bf16_compress_wrapper_hook",
+        ] and (
+            (torch.version.cuda is None and torch.version.hip is None)
+            or (
+                torch.version.cuda is not None
+                and int(torch.version.cuda.split(".")[0]) < 11
+            )
+            or not dist.is_available()
+            or not dist.is_nccl_available()
+            or torch.cuda.nccl.version() < (2, 10)
+        ):
+            self._log_and_throw(
+                TypeError,
+                "BF16 all reduce communication hook required CUDA 11+ and NCCL 2.10+.",
+            )
+
+    @property
+    def _distributed_rank(self):
+        return dist.get_rank(self.process_group)
+
+    @staticmethod
+    def _set_params_and_buffers_to_ignore_for_model(
+        module, params_and_buffers_to_ignore
+    ):
+        """
+        Sets parameters and buffers to be ignored by DDP. Expected format for
+        parameters is the fully qualified name: {module_name}.{param_name}, and
+        similarly, {module_name}.{buffer_name} for buffers. For example:
+        params_to_ignore = []
+        # NB: model here is vanilla PyTorch module, not yet wrapped with DDP.
+        for module_name, module in model.named_modules():
+            for param_name, param in module.named_parameters(recurse=False):
+                if should_ignore(param):
+                    # Create expected format
+                    fqn = f"{module_name}.{param_name}"
+                    params_to_ignore.append(fqn)
+        torch.nn.parallel.DistributedDataParallel._set_params_and_buffers_to_ignore_for_model(
+            model,
+            params_to_ignore
+        )
+        """
+        # This is a workaround to set parameters and buffers DDP should ignore
+        # during synchronization. It will be removed when the API is finalized
+        # as part of addressing https://github.com/pytorch/pytorch/issues/43690.
+        module._ddp_params_and_buffers_to_ignore = params_and_buffers_to_ignore
+
+    def _get_ddp_logging_data(self):
+        r"""
+        This interface can be called after DistributedDataParallel() is
+        constructed. It returns a dictionary of logging data. It could help
+        for debugging and analysis. The loggind data includes DistributedDataParallel
+        constructor input parameters, some internal states of DistributedDataParallel
+        and performance metrics. Simply print the dictorinary and see what
+        these metrics are.
+        This is a prototype interface and subject to change in the future.
+        """
+        assert self.logger is not None
+        ddp_logging_data = self.logger._get_ddp_logging_data()
+        return {**ddp_logging_data.strs_map, **ddp_logging_data.ints_map}
+
+    def _set_ddp_runtime_logging_sample_rate(self, sample_rate):
+        r"""
+        This interface allows users to set sample_rate of collecting
+        runtime stats. The runtime stats will be recorded for the
+        first 10 iterations, after 10 iteratons runtime stats will be
+        recorded once every "sample_rate" training iterations. In
+        default, runtime stats are recorded for the first 10 iterations,
+        after 10 iterations runtime stats are recorded once every
+        "kDDPRuntimeLoggingSampleRate=100" training iterations.
+        This is a prototype interface and subject to change in the future.
+        """
+        if sample_rate < 1:
+            self._log_and_throw(
+                ValueError,
+                "DDP runtime logging sample rate should be equal or greater than 1",
+            )
+        self.reducer._set_ddp_runtime_logging_sample_rate(sample_rate)
+
+    def _set_static_graph(self):
+        """
+        It is recommended to set static graph in the DDP constructor, which will
+        call this private API internally.
+        """
+        # If self.static_graph has been set, no need to set it again
+        if self.static_graph:
+            warnings.warn(
+                "You've set static_graph to be True, no need to set it again."
+            )
+            return
+        self.static_graph = True
+        self.reducer._set_static_graph()
+        assert self.logger is not None
+        self.logger._set_static_graph()
+        if self.find_unused_parameters:
+            warnings.warn(
+                "You passed find_unused_parameters=true to DistributedDataParallel, "
+                "`_set_static_graph` will detect unused parameters automatically, so "
+                "you do not need to set find_unused_parameters=true, just be sure these "
+                "unused parameters will not change during training loop while calling "
+                "`_set_static_graph`."
+            )
diff --git a/torch/distributed/_composable/checkpoint_activation.py b/torch/distributed/_composable/checkpoint_activation.py
new file mode 100644
index 000000000000..4d9a2ea7fddb
--- /dev/null
+++ b/torch/distributed/_composable/checkpoint_activation.py
@@ -0,0 +1,157 @@
+import torch
+import torch.nn as nn
+from torch.utils.checkpoint import detach_variable
+
+from contextlib import contextmanager
+from typing import Any, List, Optional, Tuple
+
+from .contract import contract
+
+
+@contextmanager
+def _no_hook(module: nn.Module):
+    r"""
+    Disable hooks installed by checkpoint to avoid unintentional recursion
+    during backward recomputation.
+    """
+    orig_enable_hook = checkpoint.state(module).enable_hook
+    checkpoint.state(module).enable_hook = False
+    try:
+        yield
+    except Exception:
+        raise
+    finally:
+        checkpoint.state(module).enable_hook = orig_enable_hook
+
+
+class _ModuleHookCheckpointFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, module: nn.Module, output: Any, *inputs: Any) -> Any:  # type: ignore[override]
+        ctx.module = module
+
+        # Save non-tensor inputs in ctx, keep a placeholder None for tensors
+        # to be filled out during the backward.
+        ctx.inputs = []
+        ctx.tensor_indices = []
+        tensor_inputs = []
+        for i, inp in enumerate(inputs):
+            if torch.is_tensor(inp):
+                tensor_inputs.append(inp)
+                ctx.tensor_indices.append(i)
+                ctx.inputs.append(None)
+            else:
+                ctx.inputs.append(inp)
+
+        ctx.save_for_backward(*tensor_inputs)
+
+        return output
+
+    @staticmethod
+    def backward(ctx, output_grads: Tuple[Optional[torch.Tensor]]) -> Any:  # type: ignore[override]
+        if not torch.autograd._is_checkpoint_valid():
+            raise RuntimeError(
+                "Checkpointing is not compatible with .grad() or when an "
+                "`inputs` parameter is passed to .backward(). Please use "
+                ".backward() and do not pass its `inputs` argument."
+            )
+
+        # Copy the list to avoid modifying original list.
+        inputs = list(ctx.inputs)
+        tensor_indices = ctx.tensor_indices
+        tensors = ctx.saved_tensors
+
+        # Fill in inputs with appropriate saved tensors.
+        for i, idx in enumerate(tensor_indices):
+            inputs[idx] = tensors[i]
+
+        detached_inputs = detach_variable(tuple(inputs))
+        with torch.enable_grad(), _no_hook(ctx.module):
+            outputs = ctx.module(*detached_inputs)
+
+        if isinstance(outputs, torch.Tensor):
+            outputs = (outputs,)
+
+        if isinstance(output_grads, torch.Tensor):
+            output_grads = (output_grads,)
+
+        # run backward() with only tensor that requires grad
+        outputs_requires_grad: List[torch.Tensor] = []
+        output_grad_tensors: List[torch.Tensor] = []
+        for i in range(len(outputs)):
+            if torch.is_tensor(outputs[i]) and outputs[i].requires_grad:
+                outputs_requires_grad.append(outputs[i])
+                assert (
+                    output_grads[i] is not None
+                ), f"expecting grad for output at index {i}, but got None."
+                output_grad_tensors.append(output_grads[i])  # type: ignore[arg-type]
+        if len(outputs_requires_grad) == 0:
+            raise RuntimeError(
+                "none of output has requires_grad=True,"
+                " this checkpoint() is not necessary"
+            )
+
+        torch.autograd.backward(outputs_requires_grad, output_grad_tensors)
+        grads = tuple(
+            inp.grad if isinstance(inp, torch.Tensor) else None
+            for inp in detached_inputs
+        )
+
+        # The two None is for forward argument module and output respectively.
+        return (None, None) + grads
+
+
+@contract
+def checkpoint(module: nn.Module) -> nn.Module:
+    r"""
+    This is a composable activation checkpointing API. Unlike functional
+    activation checkpointing APIs, this one does not require changing model
+    source code. Unlike ``nn.Module`` wrapper activation checkpointing APIs,
+    this one does not modify model structure or fully-qualified names either.
+    Under the hood, it registers activation checkpointing logic as pre- and
+    post-forward hooks. Hence, this API can be easily applied to any model or
+    sub-modules in the model.
+
+    Args:
+        module (nn.Module): the target model or sub-module to apply activation
+            checkpointing.
+
+    Example::
+        >>> import torch.nn as nn
+        >>>
+        >>> class MyModel(nn.Module):
+        >>>     def __init__(self):
+        >>>         super().__init__()
+        >>>         self.l1 = nn.Linear(10, 10)
+        >>>         self.l2 = nn.Linear(10, 10)
+        >>>
+        >>>     def forward(self, x):
+        >>>         return self.l2(self.l1(x))
+        >>>
+        >>> model = MyModel()
+        >>> checkpoint(model.l1)  # apply activation checkpointing only to l1
+        >>> model(torch.zeros(2, 10)).sum().backward()
+
+    """
+
+    def forward_pre_hook(module: nn.Module, inputs: Tuple[Any]) -> None:
+        if checkpoint.state(module).enable_hook:
+            checkpoint.state(module).orig_grad_enabled = torch.is_grad_enabled()
+            torch.set_grad_enabled(False)
+
+    def forward_hook(module: nn.Module, inputs: Tuple[Any], output: Any) -> Any:
+        if checkpoint.state(module).enable_hook:
+            torch.set_grad_enabled(checkpoint.state(module).orig_grad_enabled)
+            return _ModuleHookCheckpointFunction.apply(module, output, *inputs)
+        else:
+            return output
+
+    # This hook does the following things:
+    # 1. detach outputs from the autograd graph to discard activations
+    # 2. insert an autograd.Function after the forward pass to recompute
+    #    activations during the backward pass.
+    checkpoint.state(module).enable_hook = True
+    module.register_forward_pre_hook(forward_pre_hook)
+    # Use prepend to make sure we restore the original grad enabled state right
+    # after the module forward invocation.
+    module.register_forward_hook(forward_hook, prepend=True)
+    return module
diff --git a/torch/distributed/_composable/contract.py b/torch/distributed/_composable/contract.py
new file mode 100644
index 000000000000..fca817bcfc1d
--- /dev/null
+++ b/torch/distributed/_composable/contract.py
@@ -0,0 +1,152 @@
+import torch.nn as nn
+
+from collections import OrderedDict
+from typing import Any, Callable, Dict, List, Optional
+
+
+# use state_slot as key for module.__dict__ to avoid coliding with other
+# properties.
+# TODO: since all composable distributed features can share the same slot.
+class _StateKey:
+
+    # implement operator < to avoid breaking dir()
+    def __lt__(self, other: Any) -> bool:
+        return True if isinstance(other, str) else id(self) < id(other)
+
+
+class _State:
+    pass
+
+
+STATE_KEY = _StateKey()
+
+
+def contract(func):
+    r"""
+    Decorate a function as a composable distributed API, where the first
+    argument of the function must be an :class:`nn.Module` instance. The
+    decorator verifies that the wrapped function does not modify parameter,
+    buffer or sub-module fully-qualified names (FQN).
+
+    When a function ``func`` is decorated by ``@contract``, a
+    ``.state(module: nn.Module)`` method will be installed to the decorated
+    function. Then you can retrieve and modify the state on a module by calling
+    ``func.state(module)``.
+
+    Example::
+        >>> import torch.nn as nn
+        >>>
+        >>> class MyModel(nn.Module):
+        >>>     def __init__(self):
+        >>>         super().__init__()
+        >>>         self.l1 = nn.Linear(10, 10)
+        >>>         self.l2 = nn.Linear(10, 10)
+        >>>
+        >>>     def forward(self, x):
+        >>>         return self.l2(self.l1(x))
+        >>>
+        >>> @contract
+        >>> def my_feature(module: nn.Module) -> nn.Module:
+        >>>     my_feature.state(module).some_state = "any value"
+        >>>     return module
+        >>>
+        >>> model = MyModel()
+        >>> my_feature(model.l1)
+        >>> assert my_feature.state(model.l1).some_state == "any value"
+        >>> my_feature(model.l2)
+        >>> model(torch.randn(2, 10)).sum().backward()
+    """
+
+    def wrapper(module: nn.Module, *args, **kwargs) -> Optional[nn.Module]:
+        # install states specific to the wrapped ``func``
+        default_all_state: Dict[Callable, _State] = {}
+        all_state: Dict[Callable, _State] = module.__dict__.setdefault(  # type: ignore[call-overload]
+            STATE_KEY, default_all_state
+        )
+        assert isinstance(
+            all_state, dict
+        ), "Distributed composable API states corrupted"
+        assert func not in all_state, (
+            "Each distinct composable distributed API can only be applied to a "
+            f"module once. {func} has already been applied to the following "
+            f"module.\n{module}"
+        )
+        all_state.setdefault(func, _State())
+
+        orig_named_params = OrderedDict(module.named_parameters())
+        orig_named_buffers = OrderedDict(
+            module.named_buffers(remove_duplicate=False)
+        )
+        orig_named_modules = OrderedDict(
+            module.named_modules(remove_duplicate=False)
+        )
+
+        updated = func(module, *args, **kwargs)
+
+        if updated is None:
+            updated = module
+
+        new_named_params = OrderedDict(updated.named_parameters())
+        new_named_buffers = OrderedDict(
+            updated.named_buffers(remove_duplicate=False)
+        )
+        new_named_modules = OrderedDict(
+            updated.named_modules(remove_duplicate=False)
+        )
+
+        assert isinstance(updated, nn.Module), (
+            "Output of composable distributed APIs must be either None or "
+            f"nn.Module, but got {type(updated)}"
+        )
+
+        def check_fqn(orig_fqns: List[str], new_fqns: List[str]):
+            if orig_fqns == new_fqns:
+                return
+
+            orig_fqn_set, new_fqn_set = set(orig_fqns), set(new_fqns)
+            orig_only = orig_fqn_set - new_fqn_set
+            new_only = new_fqn_set - orig_fqn_set
+            if len(orig_only) or len(new_only):
+                raise RuntimeError(
+                    "Composable distributed API implementations cannot modify "
+                    "FQNs.\n"
+                    f"Only in original FQNs: {orig_only},\n"
+                    f"Only in new FQNs: {new_only}"
+                )
+            else:
+                raise RuntimeError(
+                    "Composable distributed API implementations cannot modify "
+                    "the order of FQNs.\n"
+                    f"Original FQNs: {orig_only}\n"
+                    f"New FQNs: {new_only}"
+                )
+
+        check_fqn(
+            list(orig_named_params.keys()), list(new_named_params.keys())
+        )
+        check_fqn(
+            list(orig_named_buffers.keys()), list(new_named_buffers.keys())
+        )
+        check_fqn(
+            list(orig_named_modules.keys()), list(new_named_modules.keys())
+        )
+
+        # TODO: a stricter verification should also reject changing module
+        # types and monkey-patching forward() method implementations.
+
+        # TODO: verify that installed distributed paradigms are compatible with
+        # each other.
+
+        return updated
+
+    def get_state(module: nn.Module) -> Optional[_State]:
+        return module.__dict__.setdefault(  # type: ignore[call-overload]
+            STATE_KEY,
+            {},  # TODO(@yhcharles): this is a temporary fix, need a better way
+        ).get(
+            func
+        )  # type: ignore[call-overload]
+
+    wrapper.state = get_state  # type: ignore[attr-defined]
+
+    return wrapper
diff --git a/torch/distributed/_composable/fully_shard.py b/torch/distributed/_composable/fully_shard.py
new file mode 100644
index 000000000000..174b2ca89a78
--- /dev/null
+++ b/torch/distributed/_composable/fully_shard.py
@@ -0,0 +1,80 @@
+from typing import Callable, Iterable, Optional, Union
+
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+from torch.distributed._composable.contract import contract
+from torch.distributed.fsdp._init_utils import (
+    _init_buffer_state,
+    _init_core_state,
+    _init_ignored_module_states,
+    _init_param_handles_from_module,
+    _init_prefetching_state,
+    _init_process_group_state,
+    _init_runtime_state,
+    _init_state_dict_state,
+)
+from torch.distributed.fsdp._runtime_utils import (
+    _register_post_forward_hooks,
+    _register_pre_forward_hooks,
+)
+from torch.distributed.fsdp.api import (
+    BackwardPrefetch,
+    CPUOffload,
+    MixedPrecision,
+    ShardingStrategy,
+)
+from torch.distributed.fsdp.wrap import _FSDPPolicy
+
+
+@contract
+def fully_shard(
+    module: nn.Module,
+    process_group: Optional[dist.ProcessGroup] = None,
+    mixed_precision: Optional[MixedPrecision] = None,
+    cpu_offload: Optional[CPUOffload] = None,
+    policy: Optional[_FSDPPolicy] = None,
+    ignored_modules: Optional[Iterable[torch.nn.Module]] = None,
+    device_id: Optional[Union[int, torch.device]] = None,
+    param_init_fn: Optional[Callable[[nn.Module], None]] = None,
+    sync_module_states: bool = False,
+) -> nn.Module:
+    """
+    Applies ``FullyShardedDataParallel` (FSDP) semantics to ``module``.
+    """
+    # Enforce the new auto wrap policy
+    if policy is not None and not isinstance(policy, _FSDPPolicy):
+        raise ValueError(f"Expects an `_FSDPPolicy` but got {policy}")
+    state = fully_shard.state(module)
+    state = _init_ignored_module_states(state, module, ignored_modules)
+    state = _init_process_group_state(state, process_group)
+    limit_all_gathers = True
+    use_orig_params = True
+    backward_prefetch_limit = 1
+    forward_prefetch_limit = 1
+    state = _init_core_state(
+        state,
+        ShardingStrategy.FULL_SHARD,
+        mixed_precision,
+        cpu_offload,
+        limit_all_gathers,
+        use_orig_params,
+        backward_prefetch_limit,
+        forward_prefetch_limit,
+    )
+    state = _init_runtime_state(state)
+    state = _init_prefetching_state(state, BackwardPrefetch.BACKWARD_PRE, False)
+    state = _init_buffer_state(state, module)
+    state = _init_param_handles_from_module(
+        state,
+        module,
+        policy,
+        device_id,
+        param_init_fn,
+        sync_module_states,
+    )
+    state = _init_state_dict_state(state)
+    modules = list(module.modules())
+    _register_pre_forward_hooks(state, modules)
+    _register_post_forward_hooks(state, modules)
+    return module
diff --git a/torch/distributed/_composable/replicate.py b/torch/distributed/_composable/replicate.py
new file mode 100644
index 000000000000..0e94427afee8
--- /dev/null
+++ b/torch/distributed/_composable/replicate.py
@@ -0,0 +1,107 @@
+from typing import List, Tuple
+
+import torch
+import torch.nn as nn
+
+from . import _ddp
+from .contract import contract
+
+
+class DistributedState:
+    ...
+
+
+class ReplicateState(DistributedState):
+    def __init__(self) -> None:
+        self.modules: List[nn.Module] = []
+        self.has_initialized: bool = False
+        self._param_list: nn.ParameterList = nn.ParameterList()
+
+    def mark_modules(self, *modules: nn.Module) -> None:
+        for module in modules:
+            self.modules.append(module)
+            replicate.state(module)._distributed_state = self
+            replicate.state(module)._params_collected = False
+
+    def _recursive_collect_params(self, module: nn.Module) -> None:
+        # TODO: skip if managed by other APIs
+
+        if hasattr(replicate.state(module), "_params_collected"):
+            if replicate.state(module)._params_collected:
+                return
+            replicate.state(module)._params_collected = True
+
+        self._param_list.extend(
+            param
+            for param in module.parameters(recurse=False)
+            # for param in module.parameters()
+            if param.requires_grad
+        )
+        for child in module.children():
+            self._recursive_collect_params(child)
+
+    def init_helper(self):
+        if self.has_initialized:
+            return
+
+        self.has_initialized = True
+        for module in self.modules:
+            self._recursive_collect_params(module)
+
+        self._ddp = _ddp.DistributedDataParallel(self._param_list)
+
+    def root_module_forward_pre_hook(
+        self, module: nn.Module, input: Tuple[torch.Tensor]
+    ) -> None:
+        self.init_helper()
+        self._ddp.pre_forward()
+
+    def root_module_forward_post_hook(
+        self,
+        module: nn.Module,
+        input: Tuple[torch.Tensor],
+        output: torch.Tensor,
+    ) -> torch.Tensor:
+        return self._ddp.post_forward(output)
+
+
+# TODO(@yhcharles): use a per-model instance instead of a global one
+_default_state = ReplicateState()
+
+
+@contract
+def replicate(
+    module: nn.Module,  # NOTE: contract now supports single module only
+    dist_state: ReplicateState = _default_state,
+) -> nn.Module:
+    r"""Replicates module(s)
+
+    Args:
+        modules (torch.nn.Module): modules to replicate
+
+    Example::
+        >>> module = nn.Linear(3, 3)
+        >>> replicate(module)
+    """
+    dist_state.mark_modules(module)
+    return module
+
+
+def mark_root_module(
+    module: nn.Module, dist_state: ReplicateState = _default_state
+) -> nn.Module:
+    r"""Mark the root module. Its sub-modules can be replicated.
+
+    Args:
+        modules (torch.nn.Module): root module
+
+    Example::
+        >>> module = nn.Linear(3, 3)
+        >>> replicate(module)
+    """
+    module.register_forward_pre_hook(dist_state.root_module_forward_pre_hook)
+    # TODO(@yhcharles): fix type error
+    module.register_forward_hook(
+        dist_state.root_module_forward_post_hook  # type: ignore[arg-type]
+    )
+    return module
diff --git a/torch/distributed/_shard/checkpoint/__init__.py b/torch/distributed/_shard/checkpoint/__init__.py
index 8230407dc578..166c6f9254cf 100644
--- a/torch/distributed/_shard/checkpoint/__init__.py
+++ b/torch/distributed/_shard/checkpoint/__init__.py
@@ -1,15 +1,12 @@
-from .metadata import (
-    BytesReadRequest,
-    BytesWriteRequest,
-    TensorStorageMetadata,
-    BytesStorageMetadata,
-    ChunkStorageMetadata,
-    Metadata,
-    TensorReadRequest,
-    TensorWriteRequest,
+# Keep old package for BC purposes, this file should be removed once
+# everything moves to the `torch.distributed.checkpoint` package.
+import sys
+import torch
+import warnings
+
+from torch.distributed.checkpoint import *  # noqa: F403
+warnings.warn(
+    "torch.distributed._shard.checkpoint will be deprecated, use torch.distributed.checkpoint instead",
+    DeprecationWarning
 )
-from .state_dict_loader import load_state_dict
-from .state_dict_saver import save_state_dict
-from .storage import StorageReader, StorageWriter
-from .filesystem import FileSystemReader, FileSystemWriter
-from .api import CheckpointException
+sys.modules['torch.distributed._shard.checkpoint'] = torch.distributed.checkpoint
diff --git a/torch/distributed/_shard/checkpoint/filesystem.py b/torch/distributed/_shard/checkpoint/filesystem.py
deleted file mode 100644
index 6498f797efe7..000000000000
--- a/torch/distributed/_shard/checkpoint/filesystem.py
+++ /dev/null
@@ -1,145 +0,0 @@
-import os
-import operator
-import pickle
-from typing import List, Optional, Union, cast
-
-import torch
-from torch import Tensor
-from torch.futures import Future
-from pathlib import Path
-
-from .metadata import (
-    BytesReadRequest,
-    BytesWriteRequest,
-    Metadata,
-    TensorReadRequest,
-    TensorWriteRequest,
-)
-from .storage import StorageReader, StorageWriter
-from torch.distributed._shard._utils import narrow_tensor_by_index
-
-
-class FileSystemWriter(StorageWriter):
-    """
-    Basic implementation of StorageWriter using file IO.
-
-    This implementation makes the following assumptions and simplifications:
-
-    * The checkpoint path is an empty or non-existing directory.
-    * File creation is atomic
-
-    The checkpoint consist of one file per write request plus
-    a `.metadata` file with the serialized metadata.
-
-    """
-    def __init__(self, path: Union[str, os.PathLike]) -> None:
-        """
-        Initialize the writer pointing to `path`
-
-        Args:
-            path: diretory where the checkpoint will be writen to.
-        """
-        super().__init__()
-        self.path = Path(path)
-
-    def write_bytes(self, requests: List[BytesWriteRequest]) -> Future[None]:
-        for req in requests:
-            with (self.path / req.storage_key).open("wb") as w:
-                w.write(req.bytes.getbuffer())
-                os.fsync(w.fileno())
-
-        fut: Future[None] = Future()
-        fut.set_result(None)
-        return fut
-
-    def write_tensors(self, requests: List[TensorWriteRequest]) -> Future[None]:
-        for req in requests:
-            # The following couple lines are simple implementation to get
-            # things going.
-            #
-            # At load time, to enable resharding, we use (sub)view of the tensor.
-            # Since the storage of the tensor might not be contiguous. we need to
-            # preserve the original view, to calculate the correct sub view at load.
-            #
-            # `torch.save` saves both the view and storage, it is a good option
-            # for unblocking. There are two drawbacks:
-            # 1. `torch.save` is pickle based, and pickle is not known for its
-            #   compatibility, we should consider replacing it with a more
-            #   stable option.
-            # 2. pickle is not streamable.
-            with (self.path / req.storage_key).open("wb") as w:
-                torch.save(req.tensor, w)
-                os.fsync(w.fileno())
-
-        fut: Future[None] = Future()
-        fut.set_result(None)
-        return fut
-
-    def prepare(self) -> None:
-        self.path.mkdir(parents=True, exist_ok=True)
-
-    def finish(self, metadata: Metadata) -> None:
-        with (self.path / ".metadata.tmp").open("wb") as metadata_file:
-            pickle.dump(metadata, metadata_file)
-            os.fsync(metadata_file.fileno())
-
-        (self.path / ".metadata.tmp").rename(self.path / ".metadata")
-
-class FileSystemReader(StorageReader):
-    def __init__(self, path: Union[str, os.PathLike]) -> None:
-        super().__init__()
-        self.path = Path(path)
-
-    def read_tensors(self, requests: List[TensorReadRequest]) -> Future[None]:
-        """
-        Very basic implementation that read from file system.
-        """
-        # Sort the the requests by storage key and try to reuse the loaded tensors
-        requests.sort(key=operator.attrgetter("storage_key"))
-
-        cached_storage_key = None
-        view_cached: Optional[Tensor] = None
-
-        for req in requests:
-            if cached_storage_key != req.storage_key or \
-                    (view_cached is not None and view_cached.device != req.tensor.device):
-
-                with (self.path / req.storage_key).open("rb") as storage:
-                    view_cached = cast(Tensor, torch.load(storage, map_location=req.tensor.device))
-                    cached_storage_key = req.storage_key
-
-            view_to_copy: Tensor = cast(Tensor, view_cached)
-            # FileSystemWrite writes the tensor as is during save.
-            # During load time, we will load the Tensor (with it orignal view)
-            # narrow it along all dimemsions, and copy_ it to the
-            # target tensor, which will be the same size.
-            view_to_copy = narrow_tensor_by_index(view_to_copy, req.offsets, req.lengths)
-
-            assert (
-                view_to_copy.size() == req.tensor.size()
-            ), f"The {req.storage_key} src/dst size does not match."
-
-
-            assert (
-                view_to_copy.device == req.tensor.device
-            ), f"cannot load across devices {view_to_copy.device} vs {req.tensor.device}"
-
-            req.tensor.copy_(view_to_copy)
-
-        fut: Future = Future()
-        fut.set_result(None)
-        return fut
-
-    def read_bytes(self, requests: List[BytesReadRequest]) -> Future[None]:
-        for req in requests:
-            with (self.path / req.storage_key).open("rb") as storage:
-                req.bytes.write(storage.read())
-
-        fut: Future = Future()
-        fut.set_result(None)
-        return fut
-
-    # Implementating the abstract function in StorageReader
-    def read_metadata(self) -> Metadata:
-        with (self.path / ".metadata").open("rb") as metadata_file:
-            return pickle.load(metadata_file)
diff --git a/torch/distributed/_shard/checkpoint/resharding.py b/torch/distributed/_shard/checkpoint/resharding.py
deleted file mode 100644
index 822813e4a965..000000000000
--- a/torch/distributed/_shard/checkpoint/resharding.py
+++ /dev/null
@@ -1,306 +0,0 @@
-import hashlib
-import io
-from typing import List, Tuple, Dict
-
-import torch
-from torch import Tensor
-
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor,
-)
-from torch.distributed._shard.sharding_spec import (
-    ShardMetadata,
-)
-from torch.distributed._shard.sharding_spec._internals import (
-    _check_shard_metadata_pair_overlap,
-)
-from torch.distributed._shard.sharded_tensor.shard import Shard
-from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
-
-
-from .metadata import (
-    BytesWriteRequest,
-    TensorReadRequest,
-    TensorWriteRequest,
-    ChunkStorageMetadata,
-    TensorStorageMetadata,
-    BytesStorageMetadata,
-    MetadataIndex,
-)
-
-def _trim(tensor: torch.Tensor) -> torch.Tensor:
-    tensor = tensor.detach()
-    if tensor.storage().size() != tensor.numel():
-        return tensor.clone()
-    return tensor
-
-def _create_storage_key(
-    storage_key_to_fqn: Dict[str, str],
-    fqn: str
-) -> str:
-    """
-    Compute the storage key from the Fully Qualified Name
-    Storage keys must respect the following properties:
-    1) Globally unique name across all objects and ranks.
-    2) Suitable for usage with common storage systems (IE, alphanumeric only)
-    """
-
-    storage_key = hashlib.sha256(bytes(fqn, "utf-8")).hexdigest()
-    counter = 0
-    while storage_key in storage_key_to_fqn:
-        storage_key = hashlib.sha256(bytes(f"{fqn}{counter}", "utf-8")).hexdigest()
-        counter += 1
-
-    storage_key_to_fqn[storage_key] = fqn
-    return storage_key
-
-# This constant is used as the separator character between tensor name and shard name
-STORAGE_KEY_SEPARATOR = "$"
-
-def _shards_get_overlap_region_wrt_saved_tensor(
-    saved_shard: ShardMetadata, current_shard: ShardMetadata
-) -> List[Tuple[int, int, int, int]]:
-    """
-    Return the overlapping region between saved_shard and current_shard.
-    There returned list has the same number of elements as the tensor's dimension.
-    For each element, we produce a tuple with the following contents:
-        (dimension, `saved_shard` offset, `current_shard` offset, length)
-
-    Offsets are relative to each shard.
-    """
-    narrows = []
-    for dim, (
-        saved_shard_offset,
-        current_shard_offset,
-        saved_shard_size,
-        current_shard_size,
-    ) in enumerate(
-        zip(
-            saved_shard.shard_offsets,
-            current_shard.shard_offsets,
-            saved_shard.shard_sizes,
-            current_shard.shard_sizes,
-        )
-    ):
-        min_range_end = min(
-            saved_shard_offset + saved_shard_size,
-            current_shard_offset + current_shard_size,
-        )
-
-        length = min_range_end - max(current_shard_offset, saved_shard_offset)
-
-        if saved_shard_offset > current_shard_offset:
-            offset_for_saved_tensor = 0
-            offset_for_current_tensor = saved_shard_offset - current_shard_offset
-        else:
-            offset_for_saved_tensor = current_shard_offset - saved_shard_offset
-            offset_for_current_tensor = 0
-
-        narrows.append(
-            (dim, offset_for_saved_tensor, offset_for_current_tensor, length)
-        )
-
-    return narrows
-
-def _chunk_to_shard_md(chunk_md: ChunkStorageMetadata) -> ShardMetadata:
-    return ShardMetadata(
-        shard_offsets=list(chunk_md.offsets),
-        shard_sizes=list(chunk_md.sizes)
-    )
-
-def _shard_md_to_chunk(chunk_md: ShardMetadata) -> ChunkStorageMetadata:
-    return ChunkStorageMetadata(
-        offsets=torch.Size(chunk_md.shard_offsets),
-        sizes=torch.Size(chunk_md.shard_sizes),
-    )
-
-def _compute_sharded_tensor_md(
-    tensor: ShardedTensor,
-) -> TensorStorageMetadata:
-    smd = [_shard_md_to_chunk(sm) for sm in tensor.metadata().shards_metadata]
-
-    return TensorStorageMetadata(
-        properties=tensor.metadata().tensor_properties,
-        size=torch.Size(tensor.metadata().size),
-        chunks=smd,
-    )
-
-
-def _get_shard_key(shard: ShardMetadata) -> str:
-    """
-    Compute an unique key for a shard.
-
-    This key is unique vis-a-vis other shard of the owning ShardedTensor
-    """
-    return "_".join(str(i) for i in shard.shard_offsets)
-
-def _get_shard_storage_key(
-    tensor_storage_key: str,
-    shard: ShardMetadata,
-    storage_key_to_fqn: Dict[str, str]
-) -> str:
-    shard_key = f"{tensor_storage_key}{STORAGE_KEY_SEPARATOR}{_get_shard_key(shard)}"
-
-    return _create_storage_key(storage_key_to_fqn, shard_key)
-
-
-def _prepare_sharded_tensor_write(
-    fqn: str,
-    sharded_tensor: ShardedTensor,
-    storage_key: str,
-    storage_key_to_fqn: Dict[str, str]
-) -> Tuple[List[TensorWriteRequest], TensorStorageMetadata, Dict[MetadataIndex, str]]:
-    """
-    Prepare sharded tensor write.
-
-    Args:
-        fqn: The FQN of ``sharded_tensor`` in the state_dict.
-        sharded_tensor: The sharded tensor to persist.
-        storage_key: The identifier for `sharded_tensor`.
-        storage_key_to_fqn: dict used to produce storage keys
-    Returns:
-        A 3-element tuple with the following values:
-            List of ``TensorWriteRequest`` for the tensor
-            Metadada describing the tensor.
-            Dictionary describing storage information for this tensor
-
-    NB `storage_key` is used to compose the key names of the local shards.
-    """
-    write_requests = []
-    shard_to_storage_key: Dict[str, str] = dict()
-    storage_md = {}
-
-    for shard_md in sharded_tensor.metadata().shards_metadata:
-        shard_storage_key = _get_shard_storage_key(storage_key, shard_md, storage_key_to_fqn)
-        shard_to_storage_key[_get_shard_key(shard_md)] = shard_storage_key
-        storage_md[MetadataIndex(fqn, shard_md.shard_offsets)] = shard_storage_key
-
-    for shard in sharded_tensor.local_shards():
-        tensor = shard.tensor.detach()
-        shard_storage_key = shard_to_storage_key[_get_shard_key(shard.metadata)]
-
-        wr = TensorWriteRequest(
-            tensor=_trim(tensor),
-            storage_key=shard_storage_key,
-        )
-        write_requests.append(wr)
-    return write_requests, _compute_sharded_tensor_md(sharded_tensor), storage_md
-
-
-def _prepare_sharded_tensor_read(
-    fqn: str,
-    storage_metadata: Dict[MetadataIndex, str],
-    metadata: TensorStorageMetadata,
-    sharded_tensor_out: ShardedTensor
-) -> List[TensorReadRequest]:
-    """
-    Prepare sharded tensor read.
-
-    Args:
-        fqn: The FQN of ``sharded_tensor`` in the state_dict.
-        storage_metadata: Dictionary describing checkpoint storage.
-        metadata: Metadata describing the persisted sharded tensor. Normally,
-                  this is generated by func::`_prepare_sharded_tensor_write`.
-        sharded_tensor_out: The ShardedTensor being read.
-
-    Returns:
-        A list of class::`TensorReadRequest`. When fullfilled,
-        `sharded_tensor_out`'s local shards load from the persisted sharded
-        tensor.
-    """
-    return _prepare_generic_tensor_read(
-        fqn,
-        metadata.chunks,
-        sharded_tensor_out.local_shards(),
-        storage_metadata)
-
-def _prepare_generic_tensor_read(
-    fqn: str,
-    checkpoint_shards: List[ChunkStorageMetadata],
-    local_shards: List[Shard],
-    storage_metadata: Dict[MetadataIndex, str]
-) -> List[TensorReadRequest]:
-    read_reqs = []
-    # this is a naive quadratic algo that can be optimized later
-    for shard in local_shards:
-        # scan all mds looking for chunks
-        for storage_md in checkpoint_shards:
-            shard_md_from_storage = _chunk_to_shard_md(storage_md)
-
-            # do they overlap?
-            if not _check_shard_metadata_pair_overlap(
-                shard.metadata, shard_md_from_storage
-            ):
-                continue
-
-            storage_key = storage_metadata[MetadataIndex(fqn, storage_md.offsets)]
-            target_tensor = shard.tensor.detach()
-            offsets = []
-            lengths = []
-            for (
-                dim,
-                offset_for_saved_tensor,
-                offset_for_current_tensor,
-                length,
-            ) in _shards_get_overlap_region_wrt_saved_tensor(
-                saved_shard=shard_md_from_storage, current_shard=shard.metadata
-            ):
-                # Note that we do NOT want to make any tensor copy.
-                # all operation must be view only
-                target_tensor = torch.narrow(
-                    target_tensor, dim, offset_for_current_tensor, length
-                )
-                offsets.append(offset_for_saved_tensor)
-                lengths.append(length)
-
-            read_reqs.append(
-                TensorReadRequest(
-                    tensor=target_tensor,
-                    storage_key=storage_key,
-                    offsets=tuple(offsets),
-                    lengths=tuple(lengths),
-                )
-            )
-    return read_reqs
-
-def _compute_tensor_md(tensor: Tensor) -> TensorStorageMetadata:
-    return TensorStorageMetadata(
-        properties=TensorProperties.create_from_tensor(tensor),
-        size=tensor.size(),
-        chunks=[ChunkStorageMetadata(
-            offsets=torch.Size([0] * len(tensor.shape)),
-            sizes=tensor.size(),
-        )]
-    )
-
-def _prepare_tensor_write(
-    tensor: Tensor, fqn: str, storage_key_to_fqn: Dict[str, str]
-) -> Tuple[List[TensorWriteRequest], TensorStorageMetadata, Dict[MetadataIndex, str]]:
-    storage_key = _create_storage_key(storage_key_to_fqn, fqn)
-    storage_md = {MetadataIndex(fqn, [0] * len(tensor.shape)): storage_key}
-    write_reqs = [
-        TensorWriteRequest(
-            tensor=_trim(tensor),
-            storage_key=storage_key,
-        )
-    ]
-    return (write_reqs, _compute_tensor_md(tensor), storage_md)
-
-
-def _compute_bytes_md(bytes: io.BytesIO) -> BytesStorageMetadata:
-    return BytesStorageMetadata(
-    )
-
-def _prepare_bytes_write(
-    bytes: io.BytesIO, fqn: str, storage_key_to_fqn: Dict[str, str]
-) -> Tuple[List[BytesWriteRequest], BytesStorageMetadata, Dict[MetadataIndex, str]]:
-    storage_key = _create_storage_key(storage_key_to_fqn, fqn)
-    storage_md = {MetadataIndex(fqn): storage_key}
-
-    write_reqs = [
-        BytesWriteRequest(
-            bytes=bytes,
-            storage_key=storage_key,
-        )
-    ]
-    return (write_reqs, _compute_bytes_md(bytes), storage_md)
diff --git a/torch/distributed/_shard/checkpoint/state_dict_loader.py b/torch/distributed/_shard/checkpoint/state_dict_loader.py
deleted file mode 100644
index f5f7422904da..000000000000
--- a/torch/distributed/_shard/checkpoint/state_dict_loader.py
+++ /dev/null
@@ -1,174 +0,0 @@
-import io
-from typing import Any, Dict, List, Tuple, Optional, cast
-from torch.distributed._shard.metadata import ShardMetadata
-from torch.distributed._shard.sharded_tensor.shard import Shard
-
-import torch
-import torch.distributed as dist
-from torch import Tensor
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor,
-)
-
-from .metadata import (
-    BytesReadRequest,
-    BytesStorageMetadata,
-    TensorReadRequest,
-    TensorStorageMetadata,
-    Metadata,
-    MetadataIndex,
-)
-from .resharding import (
-    _prepare_generic_tensor_read,
-)
-from .storage import (
-    StorageReader,
-)
-
-from .utils import _DistWrapper
-
-def _create_shard_metadata(size: torch.Size) -> ShardMetadata:
-    return ShardMetadata(
-        shard_offsets=[0] * len(size),
-        shard_sizes=list(size),
-    )
-
-def _create_shard_for(tensor: Tensor) -> Shard:
-    return Shard(
-        tensor=tensor,
-        metadata=_create_shard_metadata(tensor.size()),
-    )
-
-def _reshard_and_prepare_read_request(
-    state_dict: Dict[str, Any], metadata_from_storage: Metadata
-) -> Tuple[List[BytesReadRequest], List[TensorReadRequest]]:
-    """
-    Use the loaded metadata and the current state dict to map the saved tensors to current tensor
-    """
-    tensor_read_requests = []
-    bytes_read_requests = []
-    storage_md = cast(Dict[MetadataIndex, str], metadata_from_storage.storage_data)
-    for fqn, obj in state_dict.items():
-        md = metadata_from_storage.state_dict_metadata[fqn]
-        if isinstance(obj, ShardedTensor):
-            local_shards = obj.local_shards()
-        elif isinstance(obj, torch.Tensor):
-            local_shards = [_create_shard_for(obj)]
-        else:
-            if isinstance(md, BytesStorageMetadata):
-                bytes_io = io.BytesIO()
-                brr = BytesReadRequest(
-                    bytes=bytes_io,
-                    storage_key=storage_md[MetadataIndex(fqn)],
-                    fqn=fqn
-                )
-                bytes_read_requests.append(brr)
-            else:
-                raise ValueError(
-                    f"Invalid checkpoint metadata for {fqn}, " +
-                    f"expected BytesStorageMetadata but found {type(md)}"
-                )
-            continue
-
-        if isinstance(md, TensorStorageMetadata):
-            checkpoint_shards = md.chunks
-        else:
-            raise ValueError(
-                f"Invalid checkpoint metadata for {fqn}, " +
-                f"expected TensorStorageMetadata but found {type(md)}"
-            )
-
-        tensor_read_requests += _prepare_generic_tensor_read(fqn, checkpoint_shards, local_shards, storage_md)
-
-
-
-    return (bytes_read_requests, tensor_read_requests)
-
-
-def load_state_dict(
-    state_dict: Dict[str, Any],
-    storage_reader: StorageReader,
-    process_group: Optional[dist.ProcessGroup] = None,
-    coordinator_rank: int = 0,
-    no_dist: bool = False
-) -> None:
-    """
-    Load a distributed state_dict in SPMD style.
-
-    Each rank will try to read the least amount of data necessary
-    to fullfill the requested `state_dict`.
-
-    When loading ShardedTensor instances, each rank only
-    reads data for their local shards.
-
-    All tensors in ``state_dict`` must be allocated on their
-    destination device prior to calling this function.
-
-    All non-tensor data is loaded using `torch.load()` and modified in place
-    on state_dict.
-
-    Users must call `load_state_dict` on the root module to ensure load
-    pos-processing and non-tensor data properly propagates.
-
-    This function can be used for local inference and load a checkpoint
-    produced by ``save_state_dict`` without having a process group initialized
-    by passing ``no_dist=True`` and by using Tensors instead of ShardedTensors.
-
-    Args:
-        state_dict (Dict[str, Any]) : The state_dict to load. Note that this
-            state dict will updated in places.
-        storage_reader (StorageReader): StorageReader used to load data from.
-        process_group (ProcessGroup): ProcessGroup to be used for cross-rank synchronization
-        coordinator_rank (int): Rank to use to coordinate the checkpoint, rank0 is used by default
-        no_dist (bool): Don't attempt to load in SPMD style. Default to False
-
-    Returns:
-        None.
-
-    Examples
-        >>> # xdoctest: +SKIP
-        >>> my_model = MyModule()
-        >>> optimizer = Adagrad(my_model.parameters())
-        >>> model_state_dict = my_model.state_dict()
-        >>> fs_storage_loader = torch.distributed._shard.checkpoint.FileSystemLoader("/checkpoint/1")
-
-        >>> torch.distributed._shard.checkpoint.load_state_dict(
-        >>>     state_dict=model_state_dict,
-        >>>     storage_reader=fs_storage_loader,
-        >>> )
-
-        >>> # module.load_state_dict() function might have customized steps
-        >>> # to flush the state_dict, must call it to
-        >>> # ensure correct behavior.
-        >>> my_model.load_state_dict(model_state_dict)
-
-    .. note:: load_state_dict uses collectives to coordinate reads across ranks.
-        For NCCL-based process groups, internal tensor representations of objects
-        must be moved to the GPU device before communication takes place. In this
-        case, the device used is given by ``torch.cuda.current_device()`` and it
-        is the user's responsibility to ensure that this is set so that each rank
-        has an individual GPU, via ``torch.cuda.set_device()``
-    """
-    distW = _DistWrapper(process_group, not no_dist, coordinator_rank)
-
-    def load_model():
-        metadata = storage_reader.read_metadata()
-        bytes_read_requests, tensor_read_requests = _reshard_and_prepare_read_request(
-            state_dict=state_dict, metadata_from_storage=metadata
-        )
-        bytes_futures = storage_reader.read_bytes(bytes_read_requests)
-        tensor_futures = storage_reader.read_tensors(tensor_read_requests)
-
-        bytes_futures.wait()
-
-        # Addtional steps are required to convert the bytes to its original type
-        # Note that this is NOT inplace,
-        # it creating a new object and replace what's in the state dict
-        for req in bytes_read_requests:
-            # Ensure the BytesIO is rewound
-            req.bytes.seek(0)
-            state_dict[req.fqn] = torch.load(req.bytes)
-
-        tensor_futures.wait()
-
-    distW.all_gather("checkpoint read", load_model)
diff --git a/torch/distributed/_shard/checkpoint/state_dict_saver.py b/torch/distributed/_shard/checkpoint/state_dict_saver.py
deleted file mode 100644
index c1353fd258c0..000000000000
--- a/torch/distributed/_shard/checkpoint/state_dict_saver.py
+++ /dev/null
@@ -1,177 +0,0 @@
-import io
-from typing import Any, Dict, List, Tuple, Optional, Union
-
-
-import torch
-import torch.distributed as dist
-
-from torch import Tensor
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor,
-)
-
-from .metadata import (
-    Metadata,
-    BytesWriteRequest,
-    TensorWriteRequest,
-)
-from .resharding import (
-    _prepare_sharded_tensor_write,
-    _prepare_tensor_write,
-    _prepare_bytes_write
-)
-
-from .storage import (
-    StorageWriter,
-)
-
-from .utils import _DistWrapper
-
-
-# -------------- private functions --------------
-
-def _prepare(
-    state_dict: Dict[str, Any],
-    write_replicated_data: bool,
-) -> Tuple[Metadata, List[BytesWriteRequest], List[TensorWriteRequest]]:
-    """
-    Build the serialization plan for a given state_dict
-
-    Args:
-        state_dict: The instance to plan for.
-
-    Returns:
-        A tuple with the following values:
-
-        metadata: Metadata
-        The storage metadata describing Tensor and ShardedTensors
-        instances found in `state_dict`. See `Metadata` for the schema.
-
-        size_for_storage_keys: Dict[str, int]
-            Key is the storage key name, value is the associated size
-            It can used to pre allocate the storage for parallel and non sequential writes.
-
-        bytes_write_requests: List[BytesWriteRequest]
-            List of ByteIO write requests that should be performed by the writer.
-
-        tensor_write_requests: List[TensorWriteRequest]
-            List of Tensor write requests that should be performed by the writer.
-
-    """
-    metadata = Metadata(state_dict_metadata={})
-    tensor_write_requests: List[TensorWriteRequest] = []
-    bytes_write_requests: List[BytesWriteRequest] = []
-    storage_key_to_fqn: Dict[str, str] = dict()
-
-    storage_md = {}
-
-    for fqn, obj in state_dict.items():
-        if isinstance(obj, ShardedTensor):
-            st_write_reqs, st_md, storage_data = _prepare_sharded_tensor_write(fqn, obj, fqn, storage_key_to_fqn)
-            tensor_write_requests += st_write_reqs
-            metadata.state_dict_metadata[fqn] = st_md
-            storage_md.update(storage_data)
-        elif isinstance(obj, Tensor):
-            write_reqs, tensor_md, storage_data = _prepare_tensor_write(obj, fqn, storage_key_to_fqn)
-            if write_replicated_data:
-                tensor_write_requests += write_reqs
-            metadata.state_dict_metadata[fqn] = tensor_md
-            storage_md.update(storage_data)
-        else:
-            bytes_io = io.BytesIO()
-            # This produces incomplete MD for rank > 0 since we won't populate bytes_io.
-            # This is ok since only rank == 0 uses this data
-            if write_replicated_data:
-                torch.save(obj, bytes_io)
-            byte_write_reqs, bytes_md, storage_data = _prepare_bytes_write(bytes_io, fqn, storage_key_to_fqn)
-            if write_replicated_data:
-                bytes_write_requests += byte_write_reqs
-            metadata.state_dict_metadata[fqn] = bytes_md
-            storage_md.update(storage_data)
-
-    metadata.storage_data = storage_md
-    return (metadata, bytes_write_requests, tensor_write_requests)
-
-def save_state_dict(
-    state_dict: Dict[str, Any],
-    storage_writer: StorageWriter,
-    process_group: Optional[dist.ProcessGroup] = None,
-    coordinator_rank: int = 0,
-    no_dist: bool = False
-) -> None:
-    """
-    Save a distributed model in SPMD style.
-
-    This function is different from ``torch.save()`` as it handles
-    ``ShardedTensor`` by having each rank only save their local shards.
-
-    To produce a state_dict with ShardedTensor instances you must call
-    ``_register_state_dict_hook`` on the top module with value
-    `torch.distributed._shard.sharded_tensor.state_dict_hook` prior to
-    calling `state_dict()` on the top module.
-
-    There is no guarantees of Backwards Compatibility across PyTorch versions
-    for saved state_dicts.
-
-    If using the `process_group` argument, make sure that only its ranks
-    call `save_state_dict` and that all data in state_dict belong to it.
-
-    This function can be used to save a state_dict with an intialized process
-    group by passing ``no_dist=True``. This can be used to produce a checkpoint
-    that can consumed by load_state_dict is a SPMD fashion.
-
-    Args:
-        state_dict (Dict[str, Any]) : A state_dict
-        storage_writer (StorageWriter): Instance of StorageWrite use to perform writes.
-        process_group (ProcessGroup): ProcessGroup to be used for cross-rank synchronization
-        coordinator_rank (int): Rank to use to coordinate the checkpoint, rank0 is used by default
-        no_dist (bool): Don't attempt to save in SPMD style. Default to False
-
-    Example:
-        >>> # xdoctest: +SKIP
-        >>> my_model = MyModule()
-        >>> # We must call this function prior to state_dict()
-        >>> my_model._register_state_dict_hook(state_dict_hook)
-
-        >>> model_state_dict = my_model.state_dict()
-
-        >>> fs_storage_writer = torch.distributed._shard.checkpoint.FileSystemWriter("/checkpoint/1")
-        >>> torch.distributed._shard.checkpoint.save_state_dict(
-        >>>     state_dict=model_state_dict,
-        >>>     storage_writer=fs_stroage_writer,
-        >>> )
-
-    .. note:: save_state_dict uses collectives to coordinate writes across ranks.
-        For NCCL-based process groups, internal tensor representations of objects
-        must be moved to the GPU device before communication takes place. In this
-        case, the device used is given by ``torch.cuda.current_device()`` and it
-        is the user's responsibility to ensure that this is set so that each rank
-        has an individual GPU, via ``torch.cuda.set_device()``
-    """
-    distW = _DistWrapper(process_group, not no_dist, coordinator_rank)
-
-    distW.broadcast("prepare", storage_writer.prepare)
-    metadata = None
-
-    def write_step():
-        nonlocal metadata
-        (
-            metadata,
-            bytes_write_requests,
-            tensor_write_requests,
-        ) = _prepare(state_dict, distW.is_coordinator)
-
-        combined_writes: List[Union[TensorWriteRequest, BytesWriteRequest]] = []
-        combined_writes.extend(tensor_write_requests)
-        combined_writes.extend(bytes_write_requests)
-
-        storage_writer.prepare_storage(combined_writes)
-        bytes_futures = storage_writer.write_bytes(bytes_write_requests)
-        tensor_futures = storage_writer.write_tensors(tensor_write_requests)
-        torch.futures.wait_all([bytes_futures, tensor_futures])
-
-    def finish_checkpoint(_):
-        assert metadata is not None
-        storage_writer.finish(metadata=metadata)
-
-    distW.all_reduce("checkpoitn write", write_step, finish_checkpoint)
diff --git a/torch/distributed/_shard/checkpoint/storage.py b/torch/distributed/_shard/checkpoint/storage.py
deleted file mode 100644
index 1325f0687fc3..000000000000
--- a/torch/distributed/_shard/checkpoint/storage.py
+++ /dev/null
@@ -1,188 +0,0 @@
-import abc
-from typing import List, Union
-
-from torch.futures import Future
-
-from .metadata import (
-    BytesReadRequest,
-    BytesWriteRequest,
-    Metadata,
-    TensorReadRequest,
-    TensorWriteRequest,
-)
-
-class StorageWriter(abc.ABC):
-    """
-    Interface used by ``save_state_dict`` to write to storage.
-
-    A subclass should expect the following sequence of calls by ``save_state_dict``
-
-    1) (called once globally) prepare()
-    2) prepare_storage() with the writes that will be used with (3) and (4).
-    3) write_bytes
-    4) write_tensors.
-    5) Wait for (2) and (3) futures. If either fail, abort checkpoint.
-    6) (called once globally) finish().
-
-    There's a single process that executes methods that are called once globally.
-    The writes from (3) and (4) are initiated before any waiting is done.
-    The last call to finish() has the semantics of commiting the checkpoint.
-
-
-    """
-    @abc.abstractmethod
-    def prepare(self) -> None:
-        """
-        Initialize storage to receive the checkpoint.
-
-        This method is called once globally per checkpoint before any other method.
-        This is in contrast to ``prepare_storage`` which is called on each process
-        in parallel.
-
-        Returns:
-            Future to signal intialization is complete.
-        """
-        pass
-
-    @abc.abstractmethod
-    def write_bytes(self, requests: List[BytesWriteRequest]) -> Future[None]:
-        """
-        Initiate writes for all requests in `requests`.
-
-        Writing can happen asynchronously and/or concurrently. A blocking
-        implementation is valid.
-
-        Args:
-            requests (List[BytesWriteRequest]): A list of requests to write
-        Returns:
-            A future that completes once all writes have finished.
-        """
-        pass
-
-    @abc.abstractmethod
-    def write_tensors(self, requests: List[TensorWriteRequest]) -> Future[None]:
-        """
-        Initiate writes for all requests in `requests`.
-
-        Writing can happen asynchronously and/or concurrently. A blocking
-        implementation is valid.
-
-        Implementors are responsible for any device to host transfers required
-        to copy.
-
-        Args:
-            requests (List[TensorWriteRequest]): A list of requests to write
-
-        Returns:
-            A future that completes once all writes have finished.
-        """
-        pass
-
-    @abc.abstractmethod
-    def finish(self, metadata: Metadata) -> None:
-        """
-        Writes the metadata and marks the current checkpoint as sucessfull.
-
-        This method is called once globally after all data was writen
-        and is used to write its metadata and commit the checkpoint.
-
-        The `metadata` object includes a global view of the checkpoint
-        and, while writing it is optional, it must be recoverable by the
-        StorageReader implementation.
-
-        The actual format/schema used for serializing `metadata` is
-        considered and implementation detail.
-
-        Args:
-            metadata (Metadata): metadata for the new checkpoint
-
-        Returns:
-            None
-        """
-        pass
-
-    def prepare_storage(self, storage_writes: List[Union[TensorWriteRequest, BytesWriteRequest]]) -> None:
-        """
-        Prepare the underlying storage for upcoming writes.
-
-        This is an optional override intended for advanced scenarios where
-        a storage layer needs wants to do some work ahead of the writing itself.
-
-        This method is called on each process in parallel before any writes are performed.
-
-        The default implementation does nothing.
-
-        Args:
-            storage_writes (List[Union[TensorWriteRequest, BytesWriteRequest]]): A list of
-            all writes that will be submited.
-
-        Returns:
-            None
-        """
-        pass
-
-
-class StorageReader(abc.ABC):
-    """
-    Interface used by ``load_state_dict`` to read from storage.
-
-    A subclass should expected the following sequence of calls by ``load_state_dict``:
-
-    1) read_metadata() - on all ranks
-    2) read_bytes
-    3) read_tensors
-
-    The reads from (2) and (3) are initiated before any waiting is done.
-
-    Implementors must ensure host/device synchronization as part of
-    completion of both read requests.
-    """
-
-    @abc.abstractmethod
-    def read_bytes(self, requests: List[BytesReadRequest]) -> Future[None]:
-        """
-        Initiate read for all requests in `requests`.
-
-        Reading happen asynchronously and/or concurrently. A blocking
-        implementation is valid.
-
-        Args:
-            requests (List[BytesReadRequest]): A list of requests to read.
-
-        Return:
-            A future that completes once all read have finished.
-        """
-        pass
-
-    @abc.abstractmethod
-    def read_tensors(self, requests: List[TensorReadRequest]) -> Future[None]:
-        """
-        Initiate read for all requests in `requests`.
-
-        Reading happen asynchronously and/or concurrently. A blocking
-        implementation is valid.
-
-        Implementors must not assume that the original device
-        at write time will be the same at read time.
-
-        If an implementation uses asynchronous copies to device, it must
-        ensure proper synchronization W.R.T. the returned future.
-
-        Args:
-            requests (List[BytesReadRequest]): A list of requests to read.
-
-        Returns:
-            A future that completes once all read have finished.
-        """
-        pass
-
-    @abc.abstractmethod
-    def read_metadata(self) -> Metadata:
-        """
-        Reads the checkpoint metadata.
-
-        Returnss:
-            The metatada object associated with the checkpoint being loaded.
-
-        """
-        pass
diff --git a/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py b/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py
index 4d0600e92f85..fbdeb553cc28 100644
--- a/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py
+++ b/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py
@@ -20,6 +20,7 @@
 _register_default_op(torch.Tensor.ndim.__get__, _sharded_op_impl)  # type: ignore[attr-defined]
 _register_default_op(torch.Tensor.is_contiguous, _sharded_op_impl)
 _register_default_op(torch.Tensor.contiguous, _sharded_op_impl)
+_register_default_op(torch.Tensor.is_floating_point, _sharded_op_impl)
 
 # __reduce_ex__ to dispatch to get_state/set_state
 _register_default_op(torch.Tensor.__reduce_ex__, _sharded_op_impl)
@@ -41,8 +42,18 @@ def tensor_device(types, args=(), kwargs=None, pg=None):
     # Validate types
     if not isinstance(self_st, ShardedTensor):
         raise TypeError("input needs to be a ShardedTensor")
-
-    return self_st.local_shards()[0].tensor.device
+    dev: torch.device
+    if self_st._local_shards:
+        dev = self_st._local_shards[0].tensor.device
+    elif pg and pg._get_backend_name() == "gloo":
+        dev = torch.device("cpu")
+    else:
+        dev = torch.device(torch.cuda.current_device())
+    return dev
+
+@_sharded_op_impl(torch.Tensor.is_meta.__get__)  # type: ignore[attr-defined]
+def st_is_meta(types, args=(), kwargs=None, pg=None):
+    return args[0].local_tensor().is_meta
 
 
 def sharded_type_as_check(*args, **kwargs):
diff --git a/torch/distributed/_shard/sharded_tensor/api.py b/torch/distributed/_shard/sharded_tensor/api.py
index a7b9088e1115..36ab5d6969a3 100644
--- a/torch/distributed/_shard/sharded_tensor/api.py
+++ b/torch/distributed/_shard/sharded_tensor/api.py
@@ -455,7 +455,7 @@ def shard_size(shard_md):
         world_size = dist.get_world_size(self._process_group)
         rank_sizes = [0 for _ in range(world_size)]
         max_rank_size = 0
-        shard_placement: Dict[ShardMetadata, Tuple[int, int]] = dict()
+        shard_placement: Dict[ShardMetadata, Tuple[int, int]] = {}
         # collect sizes
         for shard_md in self.metadata().shards_metadata:
             shard_rank = cast(_remote_device, shard_md.placement).rank()
@@ -630,7 +630,13 @@ def cuda(
         return st_cuda
 
     def to(self, *args, **kwargs) -> ShardedTensor:
-        current_device = self._local_shards[0].tensor.device
+        current_device: torch.device
+        if self._local_shards:
+            current_device = self._local_shards[0].tensor.device
+        elif self._process_group._get_backend_name() == "gloo":
+            current_device = torch.device("cpu")
+        else:
+            current_device = torch.device(torch.cuda.current_device())
         current_dtype = self.dtype
         device_to = current_device
         dtype_to = current_dtype
diff --git a/torch/distributed/_shard/sharding_spec/_internals.py b/torch/distributed/_shard/sharding_spec/_internals.py
index 99eb9b9bb090..d9de0b985132 100644
--- a/torch/distributed/_shard/sharding_spec/_internals.py
+++ b/torch/distributed/_shard/sharding_spec/_internals.py
@@ -1,7 +1,8 @@
-from typing import List
+from typing import List, Optional, Tuple
 
 from torch.distributed._shard.metadata import ShardMetadata
 
+
 def _check_shard_metadata_pair_overlap(shard1: ShardMetadata, shard2: ShardMetadata):
     """
     Checks if two shards overlap.
@@ -20,6 +21,53 @@ def _check_shard_metadata_pair_overlap(shard1: ShardMetadata, shard2: ShardMetad
 
     return True
 
+
+def _find_nd_overlapping_shards(
+    shards: List[ShardMetadata], sharded_dims: List[int]
+) -> Optional[Tuple[int, int]]:
+    # Each rank has len(sharded_dims) tuples. Each tuple represent the
+    # [begin, end] (inclusive) pair of that dimension.
+    shard_intervals = [
+        [
+            (s.shard_offsets[dim], s.shard_offsets[dim] + s.shard_sizes[dim] - 1)
+            for dim in sharded_dims
+        ]
+        for s in shards
+    ]
+
+    for i in range(len(shards)):
+        shard_i = shard_intervals[i]
+        for j in range(i + 1, len(shards)):
+            shard_j = shard_intervals[j]
+            # For each dim of each shard, check if one shard resides on the other
+            # end of second shard with respect to that dim. As an example for a 2D
+            # shard, we would check if one shard is above or on the left of the
+            # other shard.
+            overlap = True
+            for interval_i, interval_j in zip(shard_i, shard_j):
+                if interval_i[0] > interval_j[1] or interval_j[0] > interval_i[1]:
+                    overlap = False
+                    break
+            if overlap:
+                return (i, j)
+    return None
+
+
+def _find_1d_overlapping_shards(
+    shards: List[ShardMetadata], dim: int
+) -> Optional[Tuple[int, int]]:
+    # (begin, end, index_in_shards). Begin and end are inclusive.
+    intervals = [
+        (s.shard_offsets[dim], s.shard_offsets[dim] + s.shard_sizes[dim] - 1, i)
+        for i, s in enumerate(shards)
+    ]
+    intervals.sort()
+    for i in range(len(shards) - 1):
+        if intervals[i][1] >= intervals[i + 1][0]:
+            return (intervals[i][2], intervals[i + 1][2])
+    return None
+
+
 def validate_non_overlapping_shards_metadata(shards: List[ShardMetadata]):
     """
     Ensures none of the shards overlap with each other.
@@ -30,11 +78,36 @@ def validate_non_overlapping_shards_metadata(shards: List[ShardMetadata]):
     Raises:
         ``ValueError`` if there's overlap in any two shards.
     """
-    # TODO: evaluate optimizing this if needed.
-    for i in range(len(shards)):
-        for j in range(i + 1, len(shards)):
-            if _check_shard_metadata_pair_overlap(shards[i], shards[j]):
-                raise ValueError(f'Shards {shards[i]} and {shards[j]} overlap')
+    if not shards or len(shards) == 1:
+        return
+
+    sharded_dims: List[int] = []
+    for dim in range(len(shards[0].shard_offsets)):
+        for i in range(1, len(shards)):
+            if (
+                shards[i].shard_offsets[dim] != shards[0].shard_offsets[dim] or
+                shards[i].shard_sizes[dim] != shards[0].shard_sizes[dim]
+            ):
+                sharded_dims.append(dim)
+                break
+
+    pair: Optional[Tuple[int, int]] = None
+    if len(sharded_dims) == 0:
+        # All shards are the same, all dims are not partitioned. Choose any 2.
+        pair = (0, 1)
+    elif len(sharded_dims) == 1:
+        # Shards are partitioned over only one dimension. Overlap can be found
+        # using a O(nlogn) overlapping interval algorithm.
+        pair = _find_1d_overlapping_shards(shards, sharded_dims[0])
+    else:
+        # Shards are partitioned over more than one dimension. Fall back to
+        # pair-wise check. Even though O(nlogn) algorithms (line sweep) exist
+        # for 2D overlap, the implementation is not trivial and may not justify
+        # the time saving in most cases.
+        pair = _find_nd_overlapping_shards(shards, sharded_dims)
+
+    if pair:
+        raise ValueError(f'Shards {shards[pair[0]]} and {shards[pair[1]]} overlap')
 
 
 def check_tensor(shards_metadata, tensor_dims) -> None:
diff --git a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py
index 6461416bc601..7b9a11025552 100644
--- a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py
+++ b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py
@@ -114,6 +114,12 @@ def build_metadata(self,
 
 
     def shard(self, tensor: torch.Tensor, src_rank: int = 0, process_group=None) -> "ShardedTensor":
+        """
+        Args:
+            src_rank: group rank relative to ``process_group``
+
+            N.B. If ``process_group`` is None, ``src_rank`` is a global rank.
+        """
         # relative imports to avoid circular dependency
         from torch.distributed._shard.sharded_tensor import (
             ShardedTensor
@@ -170,7 +176,7 @@ def shard(self, tensor: torch.Tensor, src_rank: int = 0, process_group=None) ->
         # scatter takes the global rank as ``src``
         src_for_scatter = src_rank
         if process_group is not None and process_group is not distributed_c10d._get_default_group():
-            src_for_scatter = distributed_c10d._get_global_rank(process_group, src_for_scatter)
+            src_for_scatter = distributed_c10d.get_global_rank(process_group, src_for_scatter)
 
         dist.scatter(
             local_tensor,
diff --git a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py
index 2377289c591d..3b05ce999247 100644
--- a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py
+++ b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/_common.py
@@ -1,23 +1,21 @@
 # coding=utf-8
 
-from typing import List
-
 import torch
 import torch.distributed as dist
-from torch.distributed._shard.sharding_spec import ChunkShardingSpec
+from torch.distributed._shard.sharded_tensor import ShardedTensor
 from torch.distributed._shard.sharded_tensor._ops._common import _sharded_op_common
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor,
-)
+from torch.distributed._shard.sharding_spec import ChunkShardingSpec
 from torch.distributed._shard.sharding_spec._internals import (
-    get_split_size,
+    get_chunk_sharding_params,
     get_chunked_dim_size,
+    get_split_size,
 )
+from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
 from torch.distributed.nn.functional import (
-    all_gather,
+    _all_gather_base,
+    all_reduce,
     all_to_all_single,
 )
-from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
 
 
 def _chunk_sharding_spec_check(spec, op):
@@ -29,6 +27,7 @@ def _chunk_sharding_spec_check(spec, op):
             f"Only ChunkShardingSpec supported for '{op.__name__}'."
         )
 
+
 def _register_sharded_op_on_local_tensor(
     op, early_stop_func=None, extra_check=None, customized_func=None
 ):
@@ -55,6 +54,7 @@ def _register_sharded_op_on_local_tensor(
         func (Callable): registered implementation for sharded op for
         ``__torch_function__`` dispatch.
     """
+
     @custom_sharding_spec_op(ChunkShardingSpec, op)
     @_sharded_op_common(op, early_stop_func, extra_check)
     def sharded_tensor_op_on_local_tensor(types, args=(), kwargs=None, pg=None):
@@ -88,7 +88,7 @@ def _handle_col_wise_sharding_base(
     weight,
     local_shard,
     pg,
-    gathered_inputs=None,
+    gathered_inputs,
     mode=None,
     gathered_per_sample_weights=None,
     gathered_offsets=None,
@@ -124,10 +124,6 @@ def _handle_col_wise_sharding_base(
 
     Return: final result of input being applied with the op.
     """
-    if gathered_inputs is None:
-        # allgather the inputs first.
-        gathered_inputs = all_gather(input, group=pg)
-
     # run the operator's function for all the inputs.
     results = []
     for i, inp in enumerate(gathered_inputs):
@@ -161,9 +157,7 @@ def _handle_col_wise_sharding_base(
     return torch.transpose(output, 0, col_dim)
 
 
-def _result_distribute_with_col_rearrange(
-    results, input, world_size, weight, pg
-):
+def _result_distribute_with_col_rearrange(results, input, world_size, weight, pg):
     """
     For col-wise sharding of weight, we need to distribute
     results to each rank. We do them in this function.
@@ -187,7 +181,9 @@ def _result_distribute_with_col_rearrange(
     dims = list(results[0].size())
     dims[0] = sharding_dim_size
     combined_results = torch.cat(results)
-    output = torch.empty(*dims, device=combined_results.device, dtype=combined_results.dtype)
+    output = torch.empty(
+        *dims, device=combined_results.device, dtype=combined_results.dtype
+    )
 
     # Compute output splits
     split_size = get_split_size(sharding_dim_size, world_size)
@@ -226,170 +222,13 @@ def _result_distribute_with_col_rearrange(
     return output.index_select(0, torch.tensor(indices, device=output.device))
 
 
-def _handle_row_wise_lookup_distribute(
-    input_sorted, input, world_size, weight, rank, padding_idx
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to distribute
-    the sorted lookup IDs of embedding/embeddingBag to each rank.
-    If the index in the placement is not equal to the rank number, we need to
-    do the rearrangement based on the order given by the Sharding Spec (placement).
-
-    In addition, we do two things for padding_idx. The first thing is to only
-    set it if it's within the range of the current rank and the other thing
-    is to do the modularization of it by sharded_dim_size_max.
-
-    Args:
-        input_sorted: sorted lookup IDs of embedding/embeddingBag.
-        input: tensor to be applied op on.
-        world_size: number of ranks.
-        weight: shareded weight tensor.
-        rank: # of cuda process.
-        padding_idx: If specified, the entries at padding_idx do
-            not contribute to the gradient and reduction.
-
-    Return:
-        input_sorted: sorted lookup IDs of embedding/embeddingBag
-            Rearrangement performed if it is needed.
-        input_split_sizes: size of IDs to be assigned to each rank.
-        sharded_dim_size_max: the max size of the row each rank gets.
-        input_split_rearrange_indices: indices of row rearrangement.
-        rearrange_indices_1d_second_order: reverse indices of row
-            rearrangement, which will be used to restore the original
-            order.
-        padding_idx: Same as input if padding_idx is within the range
-            of the given rank; otherwise, None is returned. It is
-            also modularized by sharded_dim_size_max.
-    """
-    # Decide which rank the input goes to by check the sharding range.
-    split_size = get_split_size(weight.size(0), world_size)
-    rearrange_rows = False
-    indices_flatten = None
-    input_split_sizes: List[int] = [0] * world_size
-    input_split_start_indices: List[int] = [0] * world_size
-    start_row_idx_rank = None
-    end_row_idx_rank = None
-    # When we do the chunk split, we always ensure the first N - 1 chunks get max out
-    # and then the Nth chunk gets the rest. So input_split_sizes like [3, 3, 3, 4]
-    # are not possible. The expected split size will be [4, 4, 4, 1].
-    sharded_dim_size_max = get_chunked_dim_size(weight.size(0), split_size, 0)
-    for idx, placement in enumerate(weight._sharding_spec.placements):
-        sharded_dim_size = get_chunked_dim_size(weight.size(0), split_size, idx)
-        start_row_idx = idx * sharded_dim_size_max
-        end_row_idx = start_row_idx + sharded_dim_size
-        start_idx = torch.searchsorted(input_sorted, start_row_idx).item()
-        end_idx = torch.searchsorted(input_sorted, end_row_idx).item()
-        input_split_sizes[placement.rank()] = int(end_idx - start_idx)
-        input_split_start_indices[placement.rank()] = int(start_idx)
-        if placement.rank() != idx:
-            rearrange_rows = True
-        # Store the range of the current rank.
-        if placement.rank() == rank:
-            start_row_idx_rank = start_row_idx
-            end_row_idx_rank = end_row_idx
-
-    # Perform the modular if padding_idx is within the range.
-    if padding_idx is not None:
-        if padding_idx < start_row_idx_rank or padding_idx >= end_row_idx_rank:
-            padding_idx = None
-        else:
-            padding_idx = padding_idx % sharded_dim_size_max
-
-    rearrange_indices_1d_second_order = None
-    if rearrange_rows:
-        # Need to re-arrange the 1D tensor to be sent via all2all.
-        indices: List[List[int]] = [[0]] * world_size
-        for placement in weight._sharding_spec.placements:
-            split_length = input_split_sizes[placement.rank()]
-            offset_idx = input_split_start_indices[placement.rank()]
-            indices[placement.rank()] = list(
-                range(offset_idx, offset_idx + split_length)
-            )
-        indices_flatten = list(idx for indice in indices for idx in indice)
-
-        input_sorted = input_sorted.index_select(
-            0, torch.tensor(indices_flatten, device=input.device)
-        )
-        rearrange_indices_1d_second_order = torch.argsort(torch.Tensor(indices_flatten))
-
-    return (
-        input_sorted,
-        input_split_sizes,
-        sharded_dim_size_max,
-        torch.tensor(indices_flatten, device=input.device) if rearrange_rows else None,
-        rearrange_indices_1d_second_order,
-        padding_idx,
-    )
-
-
-def _communicate_size_to_each_rank(
-    input_size_list, output_size, input, pg, tensor_type=torch.int
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to first
-    communicate the input length to each rank because each rank gets a
-    different one.
-
-    Args:
-        input_size_list: list of sizes to be sent to each rank.
-        output_size: length of the output tensor.
-        input: tensor to be applied op on.
-        pg: process group.
-        tensor_type: dtype of tensor.
-
-    Return: A list of communication results (int).
-    """
-    input_size_list_tensor = torch.tensor(
-        input_size_list, dtype=tensor_type, device=input.device
-    )
-    output_size_list_tensor = torch.empty(
-        output_size, dtype=tensor_type, device=input.device
-    )
-    dist.all_to_all_single(
-        output_size_list_tensor,
-        input_size_list_tensor,
-        group=pg,
-    )
-    return output_size_list_tensor.tolist()
-
-
-def _communicate_list_to_each_rank(
-    input_tensor_list, output_lists, input, pg, tensor_type=torch.int64
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to
-    communicate a list of input tensors to each rank. Because the
-    input could be a list of list, we need to first convert the list
-    to a tensor.
-
-    Args:
-        input_tensor_list: list of tensors to be sent to each rank.
-        output_lists: list of sizes to be obtained from each rank.
-        input: tensor to be applied op on.
-        pg: process group.
-        tensor_type: dtype of tensor.
-
-    Return: A list of communication results (tensors).
-    """
-    output_tensor_list = []
-    for output_list in output_lists:
-        output_tensor_list.append(
-            torch.empty(output_list, dtype=tensor_type, device=input.device)
-        )
-    dist.all_to_all(
-        output_tensor_list,
-        input_tensor_list,
-        group=pg,
-    )
-    return output_tensor_list
-
-
 def _handle_max_norm_col_wise(
     max_norm,
     norm_type,
     local_shard,
     input,
     world_size,
+    gathered_inputs,
     pg,
 ):
     """
@@ -406,24 +245,22 @@ def _handle_max_norm_col_wise(
         local_shard: col-wise shared local weight used for lookup.
         input: tensor to be applied op to.
         world_size: number of ranks.
+        gathered_inputs: list of inputs from all ranks.
         pg: process group.
 
     Return:
         local_shard_norm_renormed: local_shard re-normed to max_norm if the norm is larger
             than it.
-        gathered_inputs: list of inputs from all ranks.
+
     """
     norm_type = norm_type if norm_type is not None else 2.0
-    # allgather the inputs first.
-    gathered_inputs = [torch.zeros_like(input) for _ in range(world_size)]
-    dist.all_gather(gathered_inputs, input, group=pg)
     unique_inp = torch.unique(torch.cat(gathered_inputs))
     local_shard_sum = torch.sum(
         torch.pow(torch.abs(local_shard), norm_type), dim=1, dtype=local_shard.dtype
     )
     # For col-wise sharding, we need to first aggregate the powered sum
     # from each rank first and then calculate the norm.
-    dist.all_reduce(local_shard_sum, group=pg)
+    local_shard_sum = all_reduce(local_shard_sum, group=pg)
     local_shard_norm = torch.pow(local_shard_sum, 1.0 / norm_type)
     max_norm_tensor = torch.full(
         (local_shard.size(0),),
@@ -443,4 +280,73 @@ def _handle_max_norm_col_wise(
         .t()
         .contiguous()
     )
-    return local_shard_norm_renormed, gathered_inputs
+    return local_shard_norm_renormed
+
+
+def _all_gather_base_input(input, pg):
+    """
+    Use _all_gather_base to get a concatenated input from each rank.
+
+    Args:
+        input: tensor to be applied op on.
+        pg: process group.
+
+    Returns:
+        gathered_inputs: input gathered from each rank and concat by dim 0.
+    """
+    # allgather the inputs first.
+    gather_inp_size = list(input.size())
+    gather_inp_size[0] = input.size(0) * dist.get_world_size(pg)
+    gather_inp = torch.empty(gather_inp_size, device=input.device, dtype=input.dtype)
+    return _all_gather_base(gather_inp, input, group=pg)
+
+
+def _handle_row_wise_mask(gather_inp, padding_idx, weight, world_size, rank):
+    """
+    Mask the input for embedding look-up for IDs which are not stored
+    on the current rank. This function also adjust the ``padding_idx``
+    so that it is only used on the rank where the corresponding row is
+    stored.
+
+    Note that, with ``max_norm`` flag on, only weights of rows being
+    looked up will be re-normed. So we need an extra row for masked ID
+    so that it does not affect the final result and ``max_norm``.
+
+    Args:
+        gather_inp: tensor to be applied op on gathered from all ranks.
+        padding_idx: If specified, the entries at padding_idx do
+            not contribute to the gradient; therefore, the embedding
+            vector at padding_idx is not updated during training,
+            i.e. it remains as a fixed “pad”.
+            Note that the embedding vector at padding_idx is
+            excluded from the reduction.
+        weight: weight tensor of Embedding look-up table.
+        world_size: number of ranks.
+        rank: # of cuda process.
+
+    Returns:
+        lookup_input: Tensor of masked input.
+        padding_idx: adjusted padding_idx.
+        padding_row: The extra row we used during lookup so that
+            looking up does not affect ``max_norm``.
+    """
+    (start_pos, chunk_size) = get_chunk_sharding_params(
+        weight.size(0), world_size, weight._sharding_spec, rank
+    )
+    mask = (gather_inp < start_pos) | (gather_inp >= start_pos + chunk_size)
+    lookup_input = gather_inp.clone() - start_pos
+    lookup_input[mask] = chunk_size
+    if (
+        padding_idx is not None
+        and padding_idx >= start_pos
+        and padding_idx < (start_pos + chunk_size)
+    ):
+        padding_idx = padding_idx - start_pos
+    else:
+        padding_idx = None
+
+    # When max_norm is set, it will only re-norm the row being looked up.
+    padding_row = torch.zeros(
+        1, weight.size(1), device=gather_inp.device, dtype=weight.dtype
+    )
+    return lookup_input, padding_idx, padding_row
diff --git a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py
index 10433129173d..77bfa08da025 100644
--- a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py
+++ b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding.py
@@ -2,17 +2,19 @@
 
 import torch
 import torch.distributed as dist
+from torch.distributed._shard.replicated_tensor import ReplicatedTensor
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed._shard.sharding_spec import ChunkShardingSpec
+from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
+from torch.distributed.nn.functional import all_gather, all_reduce, reduce_scatter
+
 from ._common import (
-    _communicate_size_to_each_rank,
+    _all_gather_base_input,
     _handle_col_wise_sharding_base,
-    _handle_row_wise_lookup_distribute,
     _handle_max_norm_col_wise,
+    _handle_row_wise_mask,
 )
-from torch.distributed._shard.sharding_spec import ChunkShardingSpec
-from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor
-)
+
 
 @custom_sharding_spec_op(ChunkShardingSpec, torch.nn.functional.embedding)
 def sharded_embedding(types, args, kwargs, pg):
@@ -37,34 +39,30 @@ def sharded_embedding(types, args, kwargs, pg):
     4 GPUs creating 3 shard of (3 x 17) and 1 shard of (1 x 17).
     The algorithm is as follows:
 
-    1. First the input is flattened to 1D and gets sorted so that we can distribute
-       them to the corresponding rank. For example if the given input is
+    1. First the input is all gathered to all ranks, since this is SPMD and
+       input is actually sharded across all ranks. The inputs then become a
+       4 (4 x 6) tensor on each rank. For example if the given input is
        tensor([[6, 5, 2, 9, 6, 3],
                [3, 1, 2, 4, 7, 6],
                [4, 0, 4, 9, 8, 9],
                [8, 6, 6, 4, 6, 1]])
-       Then we have the 1D array like:
-       tensor([6, 5, 2, 9, 6, 3, 3, 1, 2, 4, 7, 6, 4, 0, 4, 9, 8, 9, 8, 6, 6, 4, 6, 1])
-       And sort it:
-       tensor([0, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 6, 6, 7, 8, 8, 9, 9, 9])
-       We also record the indices so that we can recover back.
-    2. Next we perform the split by search the index of the chunking
-       boundary. So the above array will be split into 4 parts:
-       tensor([[0, 1, 1, 2, 2], [3, 3, 4, 4, 4, 4, 5],
-              [6, 6, 6, 6, 6, 6, 7, 8, 8], [9, 9, 9])
-       Rearrangement may be needed if the rank order is different from
-       its index in the placement.
-    3. Next, we communicate the length of each part to each rank via all2all
-       so that each rank now knows what input it will get from all other ranks.
-    4. Before we send out the array to other ranks, we need to do the modular operation
-       so that each rank do use that for embedding lookup.
-       The above tensor will look like the below after performing the moduler of 3:
-       tensor([[0, 1, 1, 2, 2], [0, 0, 1, 1, 1, 1, 2],
-              [0, 0, 0, 0, 0, 0, 1, 2, 2], [0, 0, 0])
-    5. Now, each rank receives a matrix (size may vary) and do the lookup. We then use
-       all2all to send the result back to each rank.
-    6. We use the recorded indices to recover the sorted positions and reshape the
-       matrix to (4 x 6 x 17), which is what we need.
+       on rank 0.
+       Then on every rank, we will have this tensor.
+       If input itself is already replicated, no all-gather will be done.
+    2. Next, we mask the ID which are not stored on that rank.
+       For example on rank 0, we store ID [0, 1, 2]. We only keep the ID
+       inside the set of numbers. The rest of them will be masked to an extra row.
+       The masked matrix will be used for embedding look up and is like:
+       tensor([[4, 4, 2, 4, 4, 4],
+               [4, 1, 2, 4, 4, 4],
+               [4, 0, 4, 4, 4, 4],
+               [4, 4, 4, 4, 4, 1]])
+       The reason of having an extra row (aka, number 4 in the example) is
+       because when max_norm is specified only weight which has looked will
+       be re-normed so mask IDs whose embeddings are not stored in current
+       rank will to an extra row will ensure max_norm still works as expected.
+    3. If max_norm is specified, the extra row gurantee that the mask ID will
+       not affect the behavior of weigh re-norm.
 
     COLWISE SHARDING
     ================
@@ -146,10 +144,8 @@ def _validate_embedding_param(args, kwargs):
     input = args[0]
     weight = args[1]
     max_norm = kwargs.get("max_norm")
-    norm_type = kwargs.get("norm_type")
     scale_grad_by_freq = kwargs.get("scale_grad_by_freq")
     sparse = kwargs.get("sparse")
-    padding_idx = kwargs.get("padding_idx")
 
     # Validate types
     if not isinstance(input, torch.Tensor):
@@ -213,11 +209,16 @@ def _handle_col_wise_sharding(
 
     Returns: final result of lookup.
     """
-    gathered_inputs = None
+    if not isinstance(input, ReplicatedTensor):
+        # allgather the inputs first for non Replicated Tensor.
+        gathered_inputs = all_gather(input, group=pg)
+    else:
+        gathered_inputs = input
+
     if max_norm is not None:
         # max_norm changes the weight in-place
-        local_shard, gathered_inputs = _handle_max_norm_col_wise(
-            max_norm, norm_type, local_shard, input, world_size, pg
+        local_shard = _handle_max_norm_col_wise(
+            max_norm, norm_type, local_shard, input, world_size, gathered_inputs, pg
         )
 
     output = _handle_col_wise_sharding_base(
@@ -228,8 +229,8 @@ def _handle_col_wise_sharding(
         weight,
         local_shard,
         pg,
+        gathered_inputs,
         padding_idx=padding_idx,
-        gathered_inputs=gathered_inputs,
     )
     return (output, local_shard)
 
@@ -260,77 +261,44 @@ def _handle_row_wise_sharding(
 
     Returns: final result of lookup.
     """
-    # flatten the ids across all input and sort
-    input_size = input.size()
-    input_1d = torch.reshape(input, (-1,)).contiguous()
-    input_sorted, indices_1d = torch.sort(input_1d)
-    rearrange_indices_1d = torch.argsort(indices_1d)
-    input_sorted.contiguous()
-
-    (
-        input_sorted,
-        input_split_sizes,
-        sharded_dim_size_max,
-        _,
-        rearrange_indices_1d_second_order,
-        padding_idx,
-    ) = _handle_row_wise_lookup_distribute(
-        input_sorted, input, world_size, weight, rank, padding_idx
-    )
-
-    # Get the input split size to be sent from each rank to the current rank.
-    # We can then infer the output split size.
-    output_split_sizes = _communicate_size_to_each_rank(
-        input_split_sizes, world_size, input, pg
-    )
-
-    # Input sent from each rank to the current rank may have different sizes.
-    gathered_input = torch.empty(
-        sum(output_split_sizes), dtype=torch.int64, device=input.device
-    )
-
-    # Perform the modular operation of the 1D tensor to be sent to each rank.
-    input_sorted = torch.remainder(input_sorted, sharded_dim_size_max)
+    if not isinstance(input, ReplicatedTensor):
+        # allgather the inputs first for non Replicated Tensor.
+        gather_inp = _all_gather_base_input(input, pg)
+    else:
+        gather_inp = input
 
-    # Perform alltoall
-    dist.all_to_all_single(
-        gathered_input,
-        input_sorted,
-        input_split_sizes=input_split_sizes,
-        output_split_sizes=output_split_sizes,
-        group=pg,
+    # Mask the input according to sharding spec.
+    lookup_input, padding_idx, padding_row = _handle_row_wise_mask(
+        gather_inp, padding_idx, weight, world_size, rank
     )
 
-    # If input is None, passing in max_norm causes
-    # errors in CUDA.
-    if max_norm is not None and gathered_input.size(0) == 0:
+    # When input is a large tensor, the value of weight is changed.
+    # This is a walk-around for now. GH issue: #81717
+    if max_norm is not None:
+        torch.nn.functional.embedding(
+            torch.unique(lookup_input)[:-1],
+            local_shard,
+            padding_idx=padding_idx,
+            max_norm=max_norm,
+            norm_type=norm_type,
+        )
         max_norm = None
 
-    # Perform local embedding look up.
-    gathered_input_embeddings = torch.nn.functional.embedding(
-        gathered_input,
-        local_shard,
+    local_input_embeddings = torch.nn.functional.embedding(
+        lookup_input,
+        torch.cat([local_shard, padding_row]),
         padding_idx=padding_idx,
         max_norm=max_norm,
         norm_type=norm_type,
     )
 
-    # Gather all lookup result appropriately by performing alltoall again
-    gathered_output = torch.empty(
-        input_sorted.size(0), weight.size(1), device=input.device
-    )
-    dist.all_to_all_single(
-        gathered_output,
-        gathered_input_embeddings,
-        input_split_sizes=output_split_sizes,
-        output_split_sizes=input_split_sizes,
-        group=pg,
-    )
-
-    # Rearrange the results to its original shape.
-    if rearrange_indices_1d_second_order is not None:
-        gathered_output = gathered_output[rearrange_indices_1d_second_order]
-    gathered_output = gathered_output[rearrange_indices_1d]
-
-    # Return the appropriate local result.
-    return torch.reshape(gathered_output, (*input_size, weight.size(1)))
+    # TODO: Make the result a PartialTensor.
+    if isinstance(input, ReplicatedTensor):
+        return all_reduce(local_input_embeddings, group=pg)
+    else:
+        local_shards = local_input_embeddings.chunk(pg.size())
+        return reduce_scatter(
+            torch.empty_like(local_shards[0]),
+            list(local_shards),
+            group=pg,
+        )
diff --git a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py
index 1703f5d15dc4..4d75e26ba1f9 100644
--- a/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py
+++ b/torch/distributed/_shard/sharding_spec/chunk_sharding_spec_ops/embedding_bag.py
@@ -1,23 +1,21 @@
 # coding=utf-8
 
-from typing import List, cast
+from typing import cast, List
 
 import torch
 import torch.distributed as dist
-from torch._C._distributed_c10d import (
-    ReduceOp,
-)
+from torch._C._distributed_c10d import ReduceOp
+from torch.distributed._shard.replicated_tensor import ReplicatedTensor
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed._shard.sharding_spec import ChunkShardingSpec
+from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
+from torch.distributed.nn.functional import all_gather, all_reduce, reduce_scatter
+
 from ._common import (
-    _communicate_list_to_each_rank,
-    _communicate_size_to_each_rank,
+    _all_gather_base_input,
     _handle_col_wise_sharding_base,
-    _handle_row_wise_lookup_distribute,
     _handle_max_norm_col_wise,
-)
-from torch.distributed._shard.sharding_spec import ChunkShardingSpec
-from torch.distributed._shard.sharding_spec.api import custom_sharding_spec_op
-from torch.distributed._shard.sharded_tensor import (
-    ShardedTensor
+    _handle_row_wise_mask,
 )
 
 
@@ -44,45 +42,31 @@ def sharded_embedding_bag(types, args, kwargs, pg):
     4 GPUs creating 4 shard of (4 x 17).
     The algorithm is as follows:
 
-    1. First if the input is a 2D tensor, we sort by row. (If it's a 1D tensor, we sort
-       the tensor per interval defined by offset.
-       For example if the given input is generated within [1, 9] like
-       tensor([[ 3,  7,  7,  9,  2,  1],
-               [ 0,  0, 14,  5,  3, 12],
-               [ 4,  5,  5,  9,  5, 13],
-               [10,  3,  0,  7, 13,  9]])
-       Then we have the sorted 2D tensor like:
-       tensor([[ 1,  2,  3,  7,  7,  9],
-               [ 0,  0,  3,  5, 12, 14],
-               [ 4,  5,  5,  5,  9, 13],
-               [ 0,  3,  7,  9, 10, 13]])
-       Note if placement not equal to rank we will rearrange accordingly.
-    2. Based on sorted result, we now have the offset like the following:
-       [tensor([0, 3, 5, 6]), tensor([0, 3, 4, 4]),
-        tensor([0, 0, 4, 5]), tensor([0, 2, 3, 5])]
-       Note that embedding bag does allow the offset idx equal to length of
-       input or repetitive. For these cases, it return a zero tensor.
-    3. Next, we rearrange the sorted tensor into different ranks by first
-       flattening it and grouping by ranks. Finally, we get a list of 1D tensors.
-       So the given tensor now becomes:
-       [tensor([1, 2, 3, 0, 0, 3, 0, 3]), tensor([7, 7, 5, 4, 5, 5, 5, 7]),
-        tensor([9, 9, 9, 10]), tensor([12, 14, 13, 13])]
-       We sync offsets with IDs. Offset now becomes:
-       [tensor([0, 3, 6, 6]), tensor([0, 2, 3, 7]),
-        tensor([0, 1, 1, 2]), tensor([0, 0, 2, 3])]
-    5. Before we send out the array to other ranks, we need to do the modular operation
-       so that each rank do use that for embedding look-up.
-       The above ID tensor list will look like the below after performing the moduler of 4:
-       [tensor([1, 2, 3, 0, 0, 3, 0, 3]), tensor([3, 3, 1, 0, 1, 1, 1, 3]),
-        tensor([1, 1, 1, 2]), tensor([0, 2, 1, 1])]
-    4. The example above only happens in one rank and each rank does a very similar thing
-       with different rearranged IDs and offsets list. We then send IDs and offsets to the
-       corresponding rank. Each rank do the look-up and aggregation on its local shard.
-       We then use reduce_scatter to send the result back to each rank and perform the
-       aggregation simultaneously.
-    5. For "Mean" mode we need to divide by either column size (2D) or the interval length
-       defined by the offset. We also need to mask the unexisting row to neg Inf so that
-       negative value does not gets wiped out in the "Max" mode.
+    1. First the input is all gathered to all ranks, since this is SPMD and
+       input is actually sharded across all ranks. The inputs then become a
+       4 (4 x 6) tensor on each rank. For example if the given input is
+       tensor([[6, 5, 2, 9, 6, 3],
+               [3, 1, 2, 4, 7, 6],
+               [4, 0, 4, 9, 8, 9],
+               [8, 6, 6, 4, 6, 1]])
+       on rank 0.
+       Then on every rank, we will have this tensor.
+       If input itself is already replicated, no all-gather will be done.
+    2. Next, we mask the ID which are not stored on that rank.
+       For example on rank 0, we store ID [0, 1, 2]. We only keep the ID
+       inside the set of numbers. The rest of them will be masked to an extra row.
+       The masked matrix will be used for embedding look up and is like:
+       tensor([[4, 4, 2, 4, 4, 4],
+               [4, 1, 2, 4, 4, 4],
+               [4, 0, 4, 4, 4, 4],
+               [4, 4, 4, 4, 4, 1]])
+    3. If ``max_norm`` is specified, the extra row gurantee that the mask ID will
+       not affect the behavior of weigh re-norm.
+    4. The example above only happens in one rank and each rank does a very similar thing.
+       For "Mean" mode we need to divide by either column size (2D) or the interval length
+       defined by the offset (excluding the row specified in ``padding_idx``).
+       We also need to mask the unexisting row to neg Inf so that negative value does not
+       gets wiped out in the "Max" mode.
 
     COLWISE SHARDING
     ================
@@ -185,11 +169,9 @@ def _validate_embedding_bag_param(args, kwargs):
     per_sample_weights = kwargs.get("per_sample_weights")
     mode = kwargs.get("mode")
     max_norm = kwargs.get("max_norm")
-    norm_type = kwargs.get("norm_type")
     scale_grad_by_freq = kwargs.get("scale_grad_by_freq")
     sparse = kwargs.get("sparse")
     include_last_offset = kwargs.get("include_last_offset")
-    padding_idx = kwargs.get("padding_idx")
 
     # Validate types
     if not isinstance(input, torch.Tensor):
@@ -299,22 +281,16 @@ def _handle_col_wise_sharding(
             If max_norm, this will be the renormed weight.
     """
     # allgather the special input of embedding bag first.
-    gathered_per_sample_weights = None
-    if per_sample_weights is not None:
-        gathered_per_sample_weights = [
-            torch.zeros_like(per_sample_weights) for _ in range(world_size)
-        ]
-        dist.all_gather(gathered_per_sample_weights, per_sample_weights, group=pg)
-    gathered_offsets = None
-    if offsets is not None:
-        gathered_offsets = [torch.zeros_like(offsets) for _ in range(world_size)]
-        dist.all_gather(gathered_offsets, offsets, group=pg)
+    (
+        gathered_inputs,
+        gathered_per_sample_weights,
+        gathered_offsets,
+    ) = _all_gather_embedding_bag_input(input, per_sample_weights, offsets, pg)
 
-    gathered_inputs = None
     if max_norm is not None:
         # max_norm changes the weight in-place
-        local_shard, gathered_inputs = _handle_max_norm_col_wise(
-            max_norm, norm_type, local_shard, input, world_size, pg
+        local_shard = _handle_max_norm_col_wise(
+            max_norm, norm_type, local_shard, input, world_size, gathered_inputs, pg
         )
 
     output = _handle_col_wise_sharding_base(
@@ -325,11 +301,11 @@ def _handle_col_wise_sharding(
         weight,
         local_shard,
         pg,
+        gathered_inputs,
         mode=mode,
         gathered_per_sample_weights=gathered_per_sample_weights,
         gathered_offsets=gathered_offsets,
         padding_idx=padding_idx,
-        gathered_inputs=gathered_inputs,
     )
     return (output, local_shard)
 
@@ -377,443 +353,132 @@ def _handle_row_wise_sharding(
     Returns:
         gathered_output: final result of lookup and aggregation.
     """
-    # We sort each interval defined by offset. If 2D, each interval is a row.
-    input_size = input.size()
-    (
-        input_split_sorted_list,
-        input_split_sorted_indices,
-        split_sizes_1d,
-        split_sizes_1d_with_padding,
-    ) = _input_split_sort(input, offsets, padding_idx)
-
-    # Within each interval of the sorted list, we first need to distribute
-    # each ID to different bucket(rank) and also ensure the rearrangement
-    # has been done in case the placement idx not equal to rank.
-    # We then perform some simple stats on each interval for the next step
-    # If user specifies per_sample_weights we need to rearrange them
-    # to be sync with IDs and then distribute them to each rank
-    (
-        input_combined,
-        input_combined_split_sizes,
-        offsets_rearrange_list,
-        offsets_rearrange_sizes,
-        per_sample_weights,
-        sharded_dim_size_max,
-        padding_idx,
-    ) = _sorted_input_distribute_prepare(
-        input_split_sorted_list,
-        input_split_sorted_indices,
-        world_size,
-        input,
-        weight,
-        per_sample_weights,
-        rank,
-        padding_idx,
-    )
+    if not isinstance(input, ReplicatedTensor):
+        if input.dim() > 1 and per_sample_weights is None:
+            # allgather the inputs first for non Replicated Tensor.
+            gather_inp = _all_gather_base_input(input, pg)
+        else:
+            (
+                gathered_inputs,
+                gathered_per_sample_weights,
+                gathered_offsets,
+            ) = _all_gather_embedding_bag_input(input, per_sample_weights, offsets, pg)
+            cat_dim = 0 if input.dim() != 1 else -1
+            gather_inp = torch.cat(gathered_inputs, dim=cat_dim)
+            if per_sample_weights is not None:
+                per_sample_weights = torch.cat(gathered_per_sample_weights, dim=cat_dim)
+            offset_add = 0 if input.dim() > 1 else input.size(0)
+            if offsets is not None:
+                offsets_list = torch.cat(
+                    [gathered_offsets[i] + (offset_add * i) for i in range(pg.size())],
+                    dim=cat_dim,
+                )
+    else:
+        gather_inp = input
 
-    # Send ID/offsets/per_sample_weights to different bucket(rank).
-    (
-        gathered_input,
-        output_offsets_tensor_list,
-        output_split_sizes,
-        gathered_per_sample_weights,
-    ) = _distribute_input(
-        input_combined,
-        input_combined_split_sizes,
-        offsets_rearrange_list,
-        offsets_rearrange_sizes,
-        sharded_dim_size_max,
-        world_size,
-        input,
-        per_sample_weights,
-        pg,
+    # Mask the input according to sharding spec.
+    lookup_input, padding_local, padding_row = _handle_row_wise_mask(
+        gather_inp, padding_idx, weight, world_size, rank
     )
+    if mode == "max":
+        padding_row[:] = -float("Inf")
 
-    # Perform the embedding bag look-up and aggregation
-    results = []
-    for i, inp in enumerate(gathered_input):
-        per_sample_weights = (
-            gathered_per_sample_weights[i]
-            if gathered_per_sample_weights is not None
-            else None
-        )
-        # If input is None, passing in max_norm causes
-        # errors in CUDA.
-        if max_norm is not None and inp.size(0) == 0:
-            max_norm = None
-
-        # Perform local embedding look up and aggregation.
-        result = torch.nn.functional.embedding_bag(
-            inp,
+    # When input is a large tensor, the value of weight is changed.
+    # This is a walk-around for now. GH issue: #81717.
+    if max_norm is not None:
+        torch.nn.functional.embedding_bag(
+            torch.unique(lookup_input)[:-1],
             local_shard,
-            offsets=output_offsets_tensor_list[i],
-            mode=mode if mode != "mean" else "sum",
-            per_sample_weights=per_sample_weights,
+            offsets=torch.tensor([0], device=local_shard.device, dtype=torch.long),
+            mode=mode,
+            per_sample_weights=None,
             max_norm=max_norm,
             norm_type=norm_type,
-            padding_idx=padding_idx,
+            padding_idx=padding_local,
         )
-        if mode != "max":
-            results.append(result)
-        # For max case, it there is no look-up from some ranks
-        # it will return all zero for that. For that case, we need
-        # to set the row to neg inf; otherwise, in the final
-        # aggregation negative values will be rounded up to zero.
-        elif inp.size(0) == 0:
-            result[:] = -float("Inf")
-            results.append(result)
-        else:
-            for idx, current_offset in enumerate(output_offsets_tensor_list[i]):
-                next_offset = current_offset
-                if idx == len(output_offsets_tensor_list[i]) - 1:
-                    next_offset = output_split_sizes[i]
-                else:
-                    next_offset = output_offsets_tensor_list[i][idx + 1]
-                # When there is no interval in the current rank or all IDs
-                # are equal to padding_idx, we then need to ensure they
-                # don't contribute to the final result.
-                if (current_offset == next_offset) or (
-                    padding_idx is not None
-                    and not torch.any(
-                        torch.ne(inp[current_offset:next_offset], padding_idx)
-                    )
-                ):
-                    result[idx] = -float("Inf")
-            results.append(result)
-
-    # Gather all the aggregated results appropriately by using reduce_scatter.
-    row_size = input.size(0) if len(input_size) > 1 else len(split_sizes_1d)
-    gathered_output = torch.empty(row_size, weight.size(1), device=input.device)
+        max_norm = None
+    result = torch.nn.functional.embedding_bag(
+        lookup_input,
+        torch.cat([local_shard, padding_row]),
+        offsets=offsets_list if offsets is not None else offsets,
+        mode=mode if mode != "mean" else "sum",
+        per_sample_weights=per_sample_weights,
+        max_norm=max_norm,
+        norm_type=norm_type,
+        padding_idx=padding_local,
+    )
+
     op = ReduceOp.SUM if mode != "max" else ReduceOp.MAX
-    dist.reduce_scatter(gathered_output, results, op=op, group=pg)
+    # TODO: Make the result a PartialTensor and move the the logic below there.
+    if isinstance(input, ReplicatedTensor):
+        result = all_reduce(result, op=op, group=pg)
+    else:
+        local_shards = result.chunk(pg.size())
+        result = reduce_scatter(
+            torch.empty_like(local_shards[0]),
+            list(local_shards),
+            op=op,
+            group=pg,
+        )
 
     # For Mean, we cannot do the division until very end because the sum of means
     # not equal to the mean of sum. (Divisor is different)
     if mode == "mean":
-        split_sizes_1d_tensor = torch.tensor(
-            split_sizes_1d_with_padding, dtype=torch.float, device=input.device
-        )
-        # Make sure divisor is not zero.
-        split_sizes_1d_tensor[split_sizes_1d_tensor == 0.0] = 1.0
-        return (
-            torch.div(gathered_output.t().contiguous(), split_sizes_1d_tensor)
-            .t()
-            .contiguous()
-        )
-
-    # Return the appropriate local result.
-    return gathered_output
-
-
-def _input_split_sort(input, offsets, padding_idx):
-    """
-    In the circumstance of row-wise sharding of weight, we need to distribute
-    the sorted lookup IDs of embeddingBag to each rank by range. The constraint
-    here is that we can not directly sort the whole input because we have to
-    differentiate between each interval because the result is aggregated.
-
-    If the index in the placement is not equal to the rank number, we need to
-    do the rearrangement based on the order given by the Sharding Spec (placement).
-
-    We also calculate the split_size with padding_idx excluded per interval
-    so that we can use it as the divisor to calculate the mean correctly.
-
-    Args:
-        input: tensor to be applied op on.
-        offsets: start index of each interval in the 1D case.
-        padding_idx: the embedding vector at padding_idx is
-            excluded from the reduction.
-
-    Return:
-        input_split_sorted_list: list of ID positions sorted per interval.
-        input_split_sorted_indices: sorted indices for per_sample_weights
-            rearrangments.
-        split_sizes_1d: size of each split for 1D input because it can be
-            different in such scenario.
-        split_sizes_1d_with_padding: size of each split for 1D input with
-            padding_idx excluded. This is for the divisor of `mean` mode.
-    """
-    input_size = input.size()
-    input_split_sorted_list = []
-    split_sizes_1d = []
-    split_sizes_1d_with_padding = []
-    padding_idx = padding_idx if padding_idx is not None else -1
-
-    # For 2D tensor, we just first sort and then append row by row into a list.
-    if len(input_size) > 1:
-        indice_offset = 0
-        sorted_input, input_split_sorted_indices = torch.sort(input)
-        for i in range(0, sorted_input.size(0)):
-            input_split_sorted_list.append(sorted_input[i])
-            input_split_sorted_indices[i] += indice_offset
-            indice_offset += input.size(1)
-            split_sizes_1d_with_padding.append(
-                torch.sum(torch.ne(sorted_input[i], padding_idx)).item()
+        if input.dim() > 1:
+            padding_idx = padding_idx if padding_idx is not None else -1
+            split_sizes = torch.sum(
+                torch.ne(input, padding_idx), dim=-1, dtype=local_shard.dtype
             )
-        input_split_sorted_indices = torch.reshape(input_split_sorted_indices, (-1,))
-    # Split 1D input tensor based on the given offsets.
-    else:
-        input_split_sorted_indices_list = []
-        offset_len = len(offsets)
-        split_size = offsets[1:offset_len] - offsets[0:-1]
-        split_sizes_1d = split_size.tolist()
-        if torch.sum(split_size) < input.size(0):
-            split_sizes_1d.append(input.size(0) - offsets[-1].item())
-        indice_offset = 0
-        for idx, split_result in enumerate(torch.split(input, split_sizes_1d)):
-            split_result_sorted, indices = torch.sort(split_result)
-            input_split_sorted_list.append(split_result_sorted)
-            split_sizes_1d_with_padding.append(
-                torch.sum(torch.ne(split_result_sorted, padding_idx)).item()
+        else:
+            split_sizes = torch.cat(
+                (
+                    offsets[1 : offsets.size(0)] - offsets[0:-1],
+                    (input.size(0) - offsets[-1]).unsqueeze(0),
+                ),
+                dim=-1,
             )
-            input_split_sorted_indices_list.append(indices + indice_offset)
-            indice_offset += split_sizes_1d[idx]
-        input_split_sorted_indices = torch.cat(input_split_sorted_indices_list)
-
-    return (
-        input_split_sorted_list,
-        input_split_sorted_indices,
-        split_sizes_1d,
-        split_sizes_1d_with_padding,
-    )
-
+        return torch.div(result, split_sizes.unsqueeze(1))
 
-def _sorted_input_distribute_prepare(
-    input_split_sorted_list,
-    input_split_sorted_indices,
-    world_size,
-    input,
-    weight,
-    per_sample_weights,
-    rank,
-    padding_idx,
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to distribute
-    the sorted lookup IDs of embeddingBag to each rank by range. After sorting
-    per interval, we need to distribute each position to the corresponding
-    rank and we need to sync this change to offsets and per_sample_weights.
-    Also, we perform rearrangements, if the order in Sharding Spec is not
-    same as the rank sequence.
-
-    In addition, in the row-wise sharding, we need to do two things for
-    padding_idx. The first thing is only to set it if it's within the range
-    of the current rank and the other thing is to do the modularization of
-    it by sharded_dim_size_max.
+    # Return the appropriate local result.
+    return result
 
-    Args:
-        input_split_sorted_list: list of ID positions sorted per interval.
-        input_split_sorted_indices: sorted indices for per_sample_weights
-            rearrangments.
-        input: tensor to be applied op on.
-        world_size: number of ranks.
-        weight: shareded weight tensor.
-        per_sample_weights: weights for weighted sum mode.
-        rank: # of cuda process.
-        padding_idx: If specified, the entries at padding_idx do
-            not contribute to the gradient and reduction.
 
-    Returns:
-        input_combined: list of ID to be sent to each rank.
-        input_combined_split_sizes: # of bags sent to each rank.
-        offsets_rearrange_list: list of starting position of each bag.
-        offsets_rearrange_sizes: # of bag offsets sent to each rank.
-        per_sample_weights: weights for weighted sum mode.
-        sharded_dim_size_max: the max size of the row each rank gets.
-        padding_idx: Modularized padding_idx if it is within the range,
-            otherwise, None is returned.
+def _all_gather_embedding_bag_input(input, per_sample_weights, offsets, pg):
     """
-    input_sorted_list = []
-    input_split_sizes_list = []
-    input_split_sizes_rolling_sum = []
-    rearrange_indices_list = []
-    input_split_rearrange_indices_combined = None
-    split_sizes_rolling_sum = 0
-    for idx, split_result_sorted in enumerate(input_split_sorted_list):
-        split_result_sorted.contiguous()
-        (
-            input_sorted,
-            input_split_sizes,
-            sharded_dim_size_max,
-            input_split_rearrange_indices,
-            _,
-            padding_idx_modular,
-        ) = _handle_row_wise_lookup_distribute(
-            split_result_sorted, input, world_size, weight, rank, padding_idx
-        )
-        rearrange_indices_list.append(
-            input_split_rearrange_indices + split_sizes_rolling_sum
-            if input_split_rearrange_indices is not None
-            else None
-        )
-        input_sorted_list.append(input_sorted)
-        input_split_sizes_list.append(input_split_sizes)
-        input_split_sizes_rolling_sum.append(split_sizes_rolling_sum)
-        split_sizes_rolling_sum += sum(input_split_sizes)
-
-    # padding_idx cannot be directly overridden in the for loop because the
-    # later iteration will wipe out the modularized padding_idx.
-    padding_idx = padding_idx_modular
-    if not (any(x is None for x in rearrange_indices_list)):
-        input_split_rearrange_indices_combined = torch.cat(rearrange_indices_list)
-
-    # Flatten each interval into a big 1D tensor.
-    input_combined = torch.cat(input_sorted_list)
-
-    # Rearrange the 1D tensor to move the IDs of look-up within each
-    # interval to the corresponding sharding rank. We also rearrange
-    # the offsets to be in sync with IDs.
-    input_combined_rearrange_indices = []
-    offsets_rearrange_list = []
-    offsets_rearrange_sizes = []
-    input_combined_split_sizes = []
-    # Calculate the indices for rearrangements
-    for rank in range(0, world_size):
-        offsets_rearrange = []
-        offset = 0
-        for idx, input_split_sizes in enumerate(input_split_sizes_list):
-            offsets_rearrange.append(offset)
-            split_length = input_split_sizes[rank]
-            offset_idx = input_split_sizes_rolling_sum[idx] + sum(
-                [
-                    split_size if i < rank else 0
-                    for i, split_size in enumerate(input_split_sizes)
-                ]
-            )
-            input_combined_rearrange_indices += list(
-                range(offset_idx, offset_idx + split_length)
-            )
-            offset += split_length
-        offsets_rearrange_list.append(offsets_rearrange)
-        offsets_rearrange_sizes.append(len(offsets_rearrange))
-        input_combined_split_sizes.append(offset)
-
-    # Perform the actual rearrangements of IDs
-    input_combined = input_combined.index_select(
-        0, torch.tensor(input_combined_rearrange_indices, device=input.device)
-    )
+    In case we need to gather input and all other parameters of embeddingBag
+    ops, we need to stack all input together to perform ``all_gather``
+    collective communication just once.
 
-    # If per_sample_weights exists, we need to sync the shift which
-    # we applied to the position IDs for look-up.
-    if per_sample_weights is not None:
-        # Rearrange per interval.
-        per_sample_weights = torch.reshape(per_sample_weights, (-1,))
-        per_sample_weights = per_sample_weights[input_split_sorted_indices]
-        if input_split_rearrange_indices_combined is not None:
-            per_sample_weights = per_sample_weights[
-                input_split_rearrange_indices_combined
-            ]
-        # Rearrange across different ranks.
-        per_sample_weights = per_sample_weights.index_select(
-            0,
-            torch.tensor(input_combined_rearrange_indices, device=input.device),
-        )
-
-    return (
-        input_combined,
-        input_combined_split_sizes,
-        offsets_rearrange_list,
-        offsets_rearrange_sizes,
-        per_sample_weights,
-        sharded_dim_size_max,
-        padding_idx,
-    )
-
-
-def _distribute_input(
-    input_combined,
-    input_combined_split_sizes,
-    offsets_rearrange_list,
-    offsets_rearrange_sizes,
-    sharded_dim_size_max,
-    world_size,
-    input,
-    per_sample_weights,
-    pg,
-):
-    """
-    In the circumstance of row-wise sharding of weight, we need to distribute
-    the sorted lookup IDs of embeddingBag, offsets and per_sample_weights to
-    each rank by range. To save the # of communication, we consolidate the
-    communication of tensors which shares the same dtype.
+    Note that since offsets does not share the same size as input and
+    is always smaller than input, we resize it during the communication.
 
     Args:
-        input_combined: list of ID to be sent to each rank.
-        input_combined_split_sizes: # of bags sent to each rank.
-        offsets_rearrange_list: list of starting position of each bag.
-        offsets_rearrange_sizes: # of bag offsets sent to each rank.
-        sharded_dim_size_max: the max size of the row each rank gets.
-        world_size: number of ranks.
         input: tensor to be applied op on.
-        per_sample_weights: weights for weighted sum mode.
+        per_sampe_weights: weights for weighted sum mode.
+        offsets: when input is 1D. offsets determines the starting
+            index position of each bag (sequence) in input.
         pg: process group.
 
     Returns:
-        gathered_input: list of tensors of IDs for lookup and aggregation.
-        output_offsets_tensor_list: list of tensors of offsets which specifies the
-            boundary of each bag.
-        output_split_sizes: list of size of IDs sent from each rank.
-        gathered_per_sample_weights: per_sample_weights from each rank.
+        gathered_inputs: list of input tensor gathered from each rank.
+        gathered_per_sample_weights: list of per_sample_weights from each rank.
+        gathered_offsets: list of offsets from each rank.
     """
-    # Communicate the length of offset and ID split size to each rank
-    # To save the # of communications, we interleave the sizes into one list.
-    input_size_list = offsets_rearrange_sizes + input_combined_split_sizes
-    input_size_list[::2] = offsets_rearrange_sizes
-    input_size_list[1::2] = input_combined_split_sizes
-    output_size_list = _communicate_size_to_each_rank(
-        input_size_list, world_size * 2, input, pg
-    )
-
-    # Perform the modular operation of the 1D tensor to be sent to each rank.
-    input_combined = torch.remainder(input_combined, sharded_dim_size_max)
-    input_combined_list = list(torch.split(input_combined, input_combined_split_sizes))
-
-    # Covert each offset list to a tensor and combine with the input
-    # so we only perform one communication to each rank.
-    input_tensor_list = []
-    output_tensor_size_list = []
-    for idx, input_list in enumerate(offsets_rearrange_list):
-        input_tensor_list.append(
-            torch.cat(
-                (
-                    torch.tensor(input_list, dtype=torch.int64, device=input.device),
-                    input_combined_list[idx],
-                )
-            )
-        )
-        output_tensor_size_list.append(
-            output_size_list[2 * idx] + output_size_list[2 * idx + 1]
-        )
-
-    output_tensor_list = _communicate_list_to_each_rank(
-        input_tensor_list, output_tensor_size_list, input, pg
-    )
-    output_tensor_list = list(
-        torch.split(torch.cat(output_tensor_list), output_size_list)
-    )
-    output_offsets_tensor_list = output_tensor_list[::2]
-    gathered_input = output_tensor_list[1::2]
-    output_split_sizes = output_size_list[1::2]
+    input_to_gather = [input]
+    if per_sample_weights is not None:
+        input_to_gather.append(per_sample_weights)
+    if offsets is not None:
+        input_to_gather.append(offsets.clone().resize_(input.size()))
+    gathered_inputs = all_gather(torch.stack(input_to_gather), group=pg)
 
-    # If user specifies per_sample_weights we need to communicate
-    # them to the corresponding rank.
     gathered_per_sample_weights = None
     if per_sample_weights is not None:
-        # Split the 1D tensor per_sample_weights to be sent to each rank.
-        per_sample_weights_list = list(
-            torch.split(per_sample_weights, input_combined_split_sizes)
-        )
-        gathered_per_sample_weights = _communicate_list_to_each_rank(
-            per_sample_weights_list,
-            output_split_sizes,
-            input,
-            pg,
-            tensor_type=per_sample_weights.dtype,
-        )
-
-    return (
-        gathered_input,
-        output_offsets_tensor_list,
-        output_split_sizes,
-        gathered_per_sample_weights,
-    )
+        gathered_per_sample_weights = [t[1] for t in gathered_inputs]
+    gathered_offsets = None
+    if offsets is not None:
+        idx = 2 if per_sample_weights is not None else 1
+        gathered_offsets = [
+            t[idx].resize_(offsets.size()).to(offsets.dtype) for t in gathered_inputs
+        ]
+    gathered_inputs = [t[0].to(input.dtype) for t in gathered_inputs]
+    return gathered_inputs, gathered_per_sample_weights, gathered_offsets
diff --git a/torch/distributed/_sharding_spec/__init__.py b/torch/distributed/_sharding_spec/__init__.py
index 11e9e9a3deed..f3060005dbdd 100644
--- a/torch/distributed/_sharding_spec/__init__.py
+++ b/torch/distributed/_sharding_spec/__init__.py
@@ -9,4 +9,6 @@
     "torch.distributed._sharding_spec will be deprecated, use torch.distributed._shard.sharding_spec instead",
     DeprecationWarning
 )
-sys.modules['torch.distributed._sharding_spec'] = torch.distributed._shard.sharding_spec
+
+import torch.distributed._shard.sharding_spec as _sharding_spec
+sys.modules['torch.distributed._sharding_spec'] = _sharding_spec
diff --git a/torch/distributed/_spmd/__init__.py b/torch/distributed/_spmd/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/distributed/_spmd/comm_tensor.py b/torch/distributed/_spmd/comm_tensor.py
new file mode 100644
index 000000000000..8e9b41cc29af
--- /dev/null
+++ b/torch/distributed/_spmd/comm_tensor.py
@@ -0,0 +1,241 @@
+from dataclasses import dataclass
+from functools import partial
+from typing import Any, List, Optional, Tuple
+
+
+import torch
+from torch._C import _disabled_torch_function_impl
+from torch.fx.experimental.proxy_tensor import (
+    _ProxyTensor,
+    fetch_tensor_proxy,
+    get_proxy,
+    get_proxy_slots,
+    set_proxy_slot,
+    track_tensor_tree,
+)
+from torch.utils._mode_utils import no_dispatch
+from torch.utils._pytree import (
+    tree_flatten,
+    tree_map,
+    tree_map_only,
+)
+
+
+@dataclass
+class _CommResult:
+    # a custom type wrapping both inplace output tensor and work handle
+    _tensor: torch.Tensor
+    _work: torch.distributed._Work
+
+
+def _wait_comm(comm_result: _CommResult):
+    # This function is only used by tracing mode as a call_function node right
+    # before consuming a collective result tensor.
+    comm_result._work.wait()
+    return comm_result._tensor
+
+
+def _wrap_comm_result(result: Tuple[Any, Any]) -> Tuple[Any, Any]:
+    def wrap(work, e):
+        assert isinstance(e, torch.Tensor), (
+            "Excepting collection of tensors as the first element in the "
+            "return value of communication operations."
+        )
+
+        return _CommResult(e, work)
+
+    # E.g.,
+    # allreduce_ returns ([tensor], work)
+    # allgather_ returns ([[tensor1, tensor2]], work)
+    work = result[1]
+    return (tree_map(partial(wrap, work), result[0]), work)
+
+
+def _get_tracer(obj: Any) -> Optional[torch.fx.Tracer]:
+    slots = get_proxy_slots(obj)
+    if slots is None:
+        return None
+    keys = tuple(slots.keys())
+    assert len(keys) == 1
+    return keys[0]
+
+
+class CommTensor(torch.Tensor):
+    r"""
+    A Tensor subclass to wrap input tensors for collective communications. This
+    Tensor subclass works for both eager and tracing mode.
+
+    In eager mode, it will record whether the inplace collective communication
+    has been launched using this Tensor and remember the corresponding work
+    handle. If yes, it will expliclty call wait() in the ``__torch_dispatch__``
+    function before subsequent operations consuming the value of the Tensor.
+
+    In tracing mode, ``CommTensor`` inserts two node into the graph using the
+    ``__torch_dispatch__`` function.
+    1. The first node is inserted right after the
+    communication, wrapping both the inplace output tensor and the returned
+    work handle into a custom ``_CommResult`` type. We have to do this because
+    ``ProxyTorchDispatchMode`` only handles ``torch.Tensor``, ``_ProxyTensor``,
+    and ``torch.nn.Parameter`` objects and will treat the work handle
+    as a constant and embed that into the graph. As a result, during execution,
+    it will use the work handle created during tracing and will lead to wrong
+    result. The solution in this test is to manually create a proxy on the
+    return value of ``allreduce_`` which is ``([tensor], work)``, and wrap that
+    to ``[(_CommResult(tensor, work)), work]``. In this way, subsequent nodes can
+    directly consume ``_CommResult``.
+    2. The second node is inserted right before any subsequent node reads from
+    ``_CommResult``. It will call ``wait()`` on the stashed work handle to ensure
+    that computation waits for communication.
+    """
+
+    _supported_comms: List[str] = [
+        "allreduce_",
+        "allgather_",
+        "broadcast_",
+        "reduce_scatter_",
+        "scatter_",
+    ]
+
+    _tensor: torch.Tensor
+    _work: Optional[torch.distributed._Work]
+
+    @staticmethod
+    def __new__(cls, tensor: torch.Tensor):
+        t = tensor._tensor if isinstance(tensor, CommTensor) else tensor
+        if _get_tracer(t) is None:
+            # noop for eager mode
+            return tensor
+
+        # Use non-CommTensor to avoid nested CommTensor Wrapping
+        r = torch.Tensor._make_subclass(cls, t, require_grad=t.requires_grad)
+        # The tensor object wrapped by this CommTensor
+        r._tensor = tensor  # type: ignore[attr-defined]
+        # Record the LAST `work` object returned by collective communication
+        # operations. If this is None, it means no collectives have called
+        # since last time a tensor is wrapped by CommTensor
+        r._work = None  # type: ignore[attr-defined]
+        return r
+
+    def __repr__(self):
+        return f"CommTensor({self._tensor}, work={self._work})"
+
+    # disable __torch_function__ so that CommTensor can recursively dispatch
+    # with ProxyTorchDispatchMode in make_fx
+    __torch_function__ = _disabled_torch_function_impl
+
+    @classmethod
+    def _is_supported(cls, op_name):
+        return any([comm in op_name for comm in cls._supported_comms])
+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+        # shared states when unwrapping args
+        tracer: Optional[torch.fx.Tracer] = None
+        work: Optional[torch.distributed._Work] = None
+
+        # wrapped ._tensor if this is a CommTensor, and insert/call wait()
+        # if communication has been launched on this tensor.
+        def unwrap(e: Any):
+            if isinstance(e, CommTensor):
+                nonlocal tracer, work
+
+                work = e._work
+                tracer = _get_tracer(e._tensor)
+
+                if work is not None:
+                    if tracer is not None:
+                        # insert a node to the traced graph.
+                        proxy_res = tracer.create_proxy(  # type: ignore[union-attr]
+                            'call_function',
+                            _wait_comm,
+                            (get_proxy(e._tensor).proxy,),
+                            {},
+                            name="wait_comm"
+                        )
+                        # HACK: update the proxy for the inplace output
+                        set_proxy_slot(e._tensor, tracer, proxy_res)
+                    # For eager mode, simply wait.
+                    # During tracing, still need to wait here, to make sure the
+                    # execution during tracing is correct.
+                    work.wait()
+
+                # communication has been waited, stop propagating CommTensor
+                return e._tensor
+            else:
+                return e
+
+        def wrap(e: Any):
+            return CommTensor(e) if isinstance(e, torch.Tensor) else e
+
+        def set_work(work: torch.distributed._Work, e: Any):
+            if isinstance(e, CommTensor):
+                e._work = work  # type: ignore[attr-defined]
+            elif isinstance(e, torch.Tensor):
+                raise RuntimeError(
+                    "Type of output tensors from collective communication during "
+                    "tracing should always be CommTensor instead of torch.Tensor"
+                )
+            return e
+
+        unwrapped_args = tree_map(unwrap, args)
+        unwrapped_kwargs = tree_map(unwrap, kwargs)
+
+        if cls._is_supported(func.__name__):
+            if tracer is not None:
+                # in tracing mode, get proxies for args
+                proxy_args, proxy_kwargs = tree_map_only(
+                    _ProxyTensor,
+                    lambda e: e.proxy,
+                    tree_map_only(
+                        torch.Tensor,
+                        fetch_tensor_proxy(tracer),
+                        (unwrapped_args, unwrapped_kwargs)
+                    ),
+                )
+
+                # get proxy for output tuple
+                proxy_res = func(*proxy_args, **proxy_kwargs)
+                # insert a node that wraps the output tuple into
+                # _CommResult(tensor, work)
+                comm_result_proxy = tracer.create_proxy(  # type: ignore[union-attr]
+                    'call_function',
+                    _wrap_comm_result,
+                    (proxy_res, ),
+                    {},
+                    name="comm_result"
+                )
+
+                with no_dispatch():
+                    # disable dispatch to avoid trigger ProxyTorchDispatchMode logic
+                    out = func(*unwrapped_args, **unwrapped_kwargs)
+
+                # wrap output with the proxy of _CommResult, so that subsequent
+                # ops and link to it.
+                track_tensor_tree(out, comm_result_proxy, constant=None, tracer=tracer)
+
+                # N.B.: we still need to remember the work handle here, and wait
+                # for it later to make sure the execution during tracing is
+                # correct. Also, remember comm is already launched
+                # args[0] is always the collection of output tensors
+                tree_map(partial(set_work, out[1]), args[0])
+
+                # HACK: update the proxy on the input argument as this is an
+                # inplace collective communication.
+                flat_args, args_spec = tree_flatten(unwrapped_args[0])
+                flat_out, out_spec = tree_flatten(out[0])
+                for a, o in zip(flat_args, flat_out):
+                    set_proxy_slot(a, tracer, get_proxy(o))
+
+                return out
+            else:
+                # in eager mode, simply remember work handle as an attribute
+                out = func(*unwrapped_args, **unwrapped_kwargs)
+                tree_map(partial(set_work, out[1]), args[0])
+                return out
+        else:
+            if work is not None:
+                return func(*unwrapped_args, **unwrapped_kwargs)
+            else:
+                # we need to propagate CommTensor wrapping until the first
+                # subsequent operation has waited for it.
+                return tree_map(wrap, func(*unwrapped_args, **unwrapped_kwargs))
diff --git a/torch/distributed/_tensor/README.md b/torch/distributed/_tensor/README.md
new file mode 100644
index 000000000000..9bbd71b764e5
--- /dev/null
+++ b/torch/distributed/_tensor/README.md
@@ -0,0 +1,3 @@
+# Distributed Tensor
+
+This is a prototype distributed tensor implementation that implements most of the basic parts in the RFC https://docs.google.com/document/d/15R3fmoPbzedlKSjtpQ97HFPidp9QTXLEap6gyIvRrMY/edit#
diff --git a/torch/distributed/_tensor/__init__.py b/torch/distributed/_tensor/__init__.py
new file mode 100644
index 000000000000..32a57146bc93
--- /dev/null
+++ b/torch/distributed/_tensor/__init__.py
@@ -0,0 +1,189 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from typing import Optional, Sequence, Callable, cast
+
+import torch
+import torch.nn as nn
+from torch.distributed._tensor.api import DTensor
+from torch.distributed._tensor.device_mesh import DeviceMesh, get_global_device_mesh
+from torch.distributed._tensor.placement_types import Placement, Shard, Replicate
+
+
+# Import all builtin dist tensor ops
+import torch.distributed._tensor.ops
+
+
+def distribute_tensor(
+    tensor: torch.Tensor,
+    device_mesh: Optional[DeviceMesh] = None,
+    placements: Optional[Sequence[Placement]] = None,
+) -> DTensor:
+    """
+    Distribute a torch.Tensor to the `device_mesh` according to the `placements`
+    specified. The rank of `device_mesh` and `placements` must be the same.
+
+    Args:
+        tensor (torch.Tensor): torch.Tensor to be distributed. Note that if you
+            want to shard a tensor on a dimension that is not evenly divisible by
+            the number of devices in that mesh dimension, we use `torch.tensor_split`
+            semantic to shard the tensor and scatter the shards.
+        device_mesh (:class:`DeviceMesh`, optional): DeviceMesh to distribute the
+            tensor, if not specified, must be called under a DeviceMesh context
+            manager, default: None
+        placements (List[:class:`Placement`], optional): the placements that
+            describes how to place the tensor on DeviceMesh, must have the same
+            number of elements as `device_mesh.ndim`. If not specified, we will
+            by default replicate the tensor across the `device_mesh` from the
+            first rank of each dimension of the `device_mesh`.
+
+    Returns:
+        A :class:`DTensor` object
+    """
+    # get default device mesh if there's nothing specified
+    device_mesh = (
+        get_global_device_mesh() if device_mesh is None else device_mesh
+    )
+    # convert tensor to the correponding device type if it's not in that device type
+    tensor = tensor.to(device_mesh.device_type)
+    # set default placements to replicated if not specified
+    if placements is None:
+        placements = [Replicate() for _ in range(device_mesh.ndim)]
+
+    if len(placements) != device_mesh.ndim:
+        raise ValueError(
+            f"`placements` must have the same length as `device_mesh.ndim`! "
+            f"Found placements length: {len(placements)}, and device_mesh.ndim: {device_mesh.ndim}."
+        )
+
+    if isinstance(tensor, DTensor):
+        # if the tensor is already a DTensor, we just need to check if the
+        # device mesh and placements are the same
+        if tensor.device_mesh != device_mesh:
+            raise ValueError(
+                f"Cannot distribute a DTensor with device mesh {tensor.device_mesh} "
+                f"to a different device mesh {device_mesh}."
+            )
+        if tensor.placements != placements:
+            raise ValueError(
+                f"Cannot distribute a DTensor with placements {tensor.placements} "
+                f"to a different placements {placements}. do you want to call "
+                f"`redistribute` instead?"
+            )
+        return tensor
+
+    local_tensor = tensor
+
+    # distribute the tensor according to the placements.
+    for idx, placement in enumerate(placements):
+        if placement.is_shard():
+            placement = cast(Shard, placement)
+            output = placement._shard_tensor(local_tensor, device_mesh, idx)
+            # scatter call could not return a tensor with correct requires_grad
+            # field, as ProcessGroupNCCL refuse to take a tensor with requires_grad
+            # to do inplace update! So we manually set it here
+            output.requires_grad_(tensor.requires_grad)
+            local_tensor = output
+        elif placement.is_replicate():
+            local_tensor = local_tensor.contiguous()
+            device_mesh.broadcast(local_tensor, mesh_dim=idx)
+        else:
+            raise RuntimeError(
+                f"Trying to distribute tensor with unsupported placements {placement} on device mesh dimension {idx}!"
+            )
+
+    assert local_tensor is not None, "distributing a tensor should not be None"
+    return DTensor(
+        local_tensor,
+        device_mesh,
+        placements,
+        size=tensor.size(),
+        requires_grad=tensor.requires_grad,
+    )
+
+
+def distribute_module(
+    module: nn.Module,
+    device_mesh: Optional[DeviceMesh] = None,
+    partition_fn: Optional[Callable[[str, nn.Module, DeviceMesh], None]] = None,
+    input_fn: Optional[Callable[..., None]] = None,
+    output_fn: Optional[Callable[..., None]] = None,
+) -> nn.Module:
+    """
+    This function converts all module parameters to :class:`DTensor` parameters
+    according to the `partition_fn` specified. It could also control the input or
+    output of the module by specifying the `input_fn` and `output_fn`. (i.e. convert
+    the input to :class:`DTensor`, convert the output back to torch.Tensor)
+    Args:
+        module (:class:`nn.Module`): user module to be partitioned.
+        device_mesh (:class:`DeviceMesh`): the device mesh to place the module.
+        partition_fn (Callable): the function to partition parameters (i.e. shard certain
+            parameters across the `device_mesh`). If `partition_fn` is not specified,
+            by default we replicate all module parameters of `module` across the mesh.
+        input_fn (Callable): specify the input distribution, i.e. could control how the
+            input of the module is sharded. `input_fn` will be installed as a module
+            `forward_pre_hook` (pre forward hook).
+        output_fn (Callable): specify the output distribution, i.e. could control how the
+            output is sharded, or convert it back to torch.Tensor. output_fn will be
+            installed as a module `forward_hook` (post forward hook).
+
+    Returns:
+        A module that contains parameters/buffers that are all `DTensor`s.
+    """
+
+    if device_mesh is None:
+        device_mesh = get_global_device_mesh()
+
+    def replicate_module_params_buffers(m: nn.Module, mesh: DeviceMesh) -> None:
+        # This function loop over the immediate module parameters and
+        # buffers, replicate all non DTensor params/buffers to DTensor
+        # parameters/buffers, if they have not been partitioned in the
+        # partition_fn, we can't easily use `module._apply` here
+        # because we don't know what happened inside partition_fn as
+        # user could do anything, i.e. install hooks, and we want to
+        # preserve those.
+        full_replicate = [Replicate()] * mesh.ndim
+        for key, param in m._parameters.items():
+            if param is not None and not isinstance(param, DTensor):
+                m.register_parameter(
+                    key,
+                    nn.Parameter(
+                        distribute_tensor(param.data, mesh, full_replicate)
+                    ),
+                )
+        for key, buffer in m._buffers.items():
+            if buffer is not None and not isinstance(buffer, DTensor):
+                m._buffers[key] = distribute_tensor(
+                    buffer, mesh, full_replicate
+                )
+
+    if partition_fn is None:
+        # if partition_fn not specified, we by default replicate
+        # all module params/buffers
+        for name, submod in module.named_modules():
+            replicate_module_params_buffers(submod, device_mesh)
+    else:
+        # apply partition_fun to submodules
+        for name, submod in module.named_modules():
+            partition_fn(name, submod, device_mesh)
+            replicate_module_params_buffers(submod, device_mesh)
+
+    # register input_fn as module forward pre hook
+    if input_fn is not None:
+        module.register_forward_pre_hook(lambda _, inputs: input_fn(inputs, device_mesh))  # type: ignore[misc]
+    # register input_fn as module forward hook
+    if output_fn is not None:
+        module.register_forward_hook(
+            lambda mod, inputs, outputs: output_fn(outputs, device_mesh)  # type: ignore[misc]
+        )
+
+    return module
+
+
+# All public APIs from dtensor package
+__all__ = [
+    "DTensor",
+    "DeviceMesh",
+    "distribute_tensor",
+    "distribute_module",
+    "Shard",
+    "Replicate",
+]
diff --git a/torch/distributed/_tensor/api.py b/torch/distributed/_tensor/api.py
new file mode 100644
index 000000000000..bf5514cc7d4e
--- /dev/null
+++ b/torch/distributed/_tensor/api.py
@@ -0,0 +1,393 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+import copy
+import warnings
+import torch
+from torch.utils._pytree import tree_flatten
+from typing import Dict, Callable, Optional, Sequence, cast
+from torch.distributed._tensor.device_mesh import get_global_device_mesh, DeviceMesh
+from torch.distributed._tensor.placement_types import (
+    Placement,
+    Shard,
+    Replicate,
+    _Partial,
+    DTensorSpec,
+)
+from torch.distributed._tensor.redistribute import Redistribute
+
+from torch.distributed._tensor.dispatch import operator_dispatch, OpSchema, OutputSharding
+
+# NOTE [Autograd interaction between torch.Tensor]
+#
+# The autograd functions defined below are being used by the public
+# facing APIs (i.e. from_local, to_local) to ensure our DTensor
+# works together with torch.Tensor within autograd engine. This
+# allows DistributedTensor to exist on part of the module hierarchy
+# and still able to calculate gradients across the torch.Tensor and
+# DistributedTensor boundary.
+# As an example, we have the a module that consists of submodules
+# A, B, and C, the execution flow would be like:
+#  input(torch.Tensor) -> Module A -> Module B -> Module C -> output (torch.Tensor)
+#
+# Suppose I only want to make Module B be a sharded module with
+# DistributedTensor params, we would need to make the folloing
+# flow to work:
+#
+#  input(torch.Tensor) -> Module A
+#       -> DTensor input -> Sharded Module B -> DTensor output
+#           -> output (torch.Tensor) -> Module C -> output (torch.Tensor)
+#
+# We need the conversion from Module A to DTensor input, which is
+# `from_local`, and conversion from DTensor output to output, which
+# is `to_local`, thus these two functions must be Autograd functions.
+#
+class ToTorchTensor(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input: "DTensor"):  # type: ignore[override]
+        ctx.dtensor_device_mesh = input.device_mesh
+        ctx.dtensor_placements = input.placements
+        ctx.dtensor_shape = input.shape
+        ctx.dtensor_requires_grad = input.requires_grad
+        return input._local_tensor.detach()
+
+    @staticmethod
+    def backward(ctx, grad_output: torch.Tensor):  # type: ignore[override]
+        device_mesh = ctx.dtensor_device_mesh
+        placements = ctx.dtensor_placements
+        return DTensor(
+            grad_output,
+            device_mesh,
+            placements,
+            size=ctx.dtensor_shape,
+            requires_grad=grad_output.requires_grad,
+        )
+
+
+class FromTorchTensor(torch.autograd.Function):
+    @staticmethod
+    def forward(  # type: ignore[override]
+        ctx,  # pyre-ignore[2]: Parameter must be annotated.
+        input: torch.Tensor,
+        device_mesh: DeviceMesh,
+        placements: Sequence[Placement],
+        run_check: bool,
+    ) -> "DTensor":
+        ctx.previous_placement = placements
+        ctx.previous_device_mesh = device_mesh
+
+        if run_check:
+            # TODO: by default check tensor metas across rank
+            # TODO: See if we need to make this run_check logic
+            # have a corresponding backward.
+            for idx, placement in enumerate(placements):
+                if placement.is_replicate():
+                    # broadcast rank 0 tensor to all ranks
+                    # only broadcast if run_check is True
+                    input = input.contiguous()
+                    device_mesh.broadcast(input, mesh_dim=idx)
+
+        # if it's not by default run_check, we assume user is certain that each
+        # rank has the same tensor shape, and we just use that to calculate the
+        # global shape
+        tensor_shape = list(input.size())
+        for idx, placement in enumerate(placements):
+            if placement.is_shard():
+                shard_dim = cast(Shard, placement).dim
+                local_dim_size = tensor_shape[shard_dim]
+                tensor_shape[shard_dim] = local_dim_size * device_mesh.size(idx)
+
+        dist_tensor = DTensor(
+            input,
+            device_mesh,
+            placements,
+            size=torch.Size(tensor_shape),
+            # requires_grad of the dist tensor depends on if input
+            # requires_grad or not
+            requires_grad=input.requires_grad,
+        )
+        return dist_tensor
+
+    @staticmethod
+    def backward(ctx, grad_output: "DTensor"):  # type: ignore[override]
+        previous_placement = ctx.previous_placement
+        previous_device_mesh = ctx.previous_device_mesh
+
+        # reshard to the placement when creating DistributedTensor
+        # so that the gradient layout matches, and we could return
+        # local gradients directly
+        if grad_output.placements != previous_placement:
+            # pyre-fixme[16]: `Redistribute` has no attribute `apply`.
+            grad_output = Redistribute.apply(
+                grad_output, previous_device_mesh, previous_placement
+            )
+
+        # TODO: backward is also differentiable now, add a test
+        # to test higher level gradients.
+        return grad_output.to_local(), None, None, None
+
+
+class DTensor(torch.Tensor):  # pyre-ignore[13]: pyre is bad at __new__
+    _local_tensor: torch.Tensor
+    _spec: DTensorSpec
+    __slots__ = ["_local_tensor", "_spec"]
+
+    # class attribute that handles operator placements propagation
+    # rules, keyed by aten op name, value is propagation func
+    _op_to_rules: Dict[str, Callable[[OpSchema], OutputSharding]] = {}
+
+    # class attribute that handles custom registered ops, all handled
+    # custom ops should appear in this table, and overriding the default
+    # operators that's been covered by _op_to_rules or fallbacks.
+    # (custom operator is the highest priority when dispatching).
+    # pyre-fixme[24]: Generic type `Callable` expects 2 type parameters.
+    _custom_dispatch_ops: Dict[str, Callable] = {}
+
+    @staticmethod
+    def __new__(
+        cls,
+        local_tensor: torch.Tensor,
+        device_mesh: DeviceMesh,
+        placements: Sequence[Placement],
+        *,
+        size: torch.Size,
+        requires_grad: bool = False,
+    ) -> "DTensor":
+        """
+        Construct a DTensor from a local tensor, device mesh, and placement and
+        other tensor properties (i.e. shape, requires_grad, strides, etc).
+        Note: This is not a public API and it's only supposed to be used by the
+            operator implementations and internals. If you want to construct a
+            DTensor from a local tensor, consider using `DTensor.from_local`, if
+            you want to construct a DTensor from a "global" tensor (where you
+            already have tensor initialized and want to shard this tensor),
+            consider using `distribute_tensor`.
+        """
+        # recover tensor strides from local tensor strides and global size info
+        # in the case of sharding
+        # TODO: we should try to use meta tensor for shape and stride calculation
+        tensor_stride = list(local_tensor.stride())
+        local_size = list(local_tensor.size())
+        for placement in placements:
+            if isinstance(placement, Shard):
+                shard_dim = placement.dim
+                # recover tensor stride by modifying the stride that larger than
+                # the current stride on the shard_dim
+                for i in range(len(tensor_stride)):
+                    if (
+                        i != shard_dim
+                        and tensor_stride[i] >= tensor_stride[shard_dim]
+                    ):
+                        # rescale the stride by the shard size
+                        tensor_stride[i] = (
+                            tensor_stride[i] // local_size[shard_dim]
+                        ) * size[shard_dim]
+            elif not isinstance(placement, (Replicate, _Partial)):
+                raise RuntimeError(
+                    f"placement type {type(placement)} not supported!"
+                )
+
+        if requires_grad != local_tensor.requires_grad:
+            warnings.warn(
+                "To construct DTensor from torch.Tensor, it's recommended to "
+                "use local_tensor.detach() and make requires_grad consistent."
+            )
+
+        # new method instruct wrapper tensor from local_tensor and add
+        # placement spec, it does not do actual distribution
+        r = torch.Tensor._make_wrapper_subclass(  # type: ignore[attr-defined]
+            cls,
+            size,
+            strides=tensor_stride,
+            dtype=local_tensor.dtype,
+            device=local_tensor.device,
+            layout=local_tensor.layout,
+            requires_grad=requires_grad,
+        )
+        # deepcopy and set spec
+        r._spec = DTensorSpec(
+            device_mesh, copy.deepcopy(placements), shape=r.size()
+        )
+        # detach local tensor from autograd graph as we initialize the
+        # distributed tensor and autograd will be working on top of
+        # the wrapper tensor directly instead of local torch.Tensor
+        r._local_tensor = local_tensor.detach()
+        return r
+
+    # pyre-fixme[14]: `__repr__` overrides method defined in `DTensor` inconsistently.
+    # pyre-fixme[3]: Return type must be annotated.
+    def __repr__(self):
+        # TODO: consider all_gather the local tensors for better debugging
+        return f"DTensor(local_tensor={self._local_tensor}, device_mesh={self._spec.mesh}, placements={self._spec.placements})"
+
+    @classmethod
+    # pyre-fixme[3]: Return type must be annotated.
+    # pyre-fixme[2]: Parameter must be annotated.
+    def __torch_function__(cls, func, types, args=(), kwargs=None):
+        if kwargs is None:
+            kwargs = {}
+        # if we find nn.functional name in dispatch op, dispatch to it instead,
+        # this allow us to override some python level behaviors that wouldn't be
+        # possible in __torch_dispatch__ level.
+        if func.__name__ in DTensor._custom_dispatch_ops:
+            # dispatch to the same table as the name should be different between
+            # torch_function and torch_dispatch
+            return DTensor._custom_dispatch_ops[func.__name__](*args, **kwargs)
+        else:
+            # if not, just do nothing here
+            return super().__torch_function__(func, types, args, kwargs)
+
+    @classmethod
+    # pyre-fixme[3]: Return type must be annotated.
+    # pyre-fixme[2]: Parameter must be annotated.
+    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+        # check that we are not getting mixed vanilla and Distributed tensors
+        arg_list, _ = tree_flatten(args)
+        for arg in arg_list:
+            if isinstance(arg, torch.Tensor) and not isinstance(arg, DTensor):
+                raise RuntimeError(
+                    f"{func}: got mixed distributed and non-distributed tensors."
+                )
+
+        if kwargs is None:
+            kwargs = {}
+
+        return operator_dispatch(
+            func,
+            args,
+            kwargs,
+            DTensor._op_to_rules,
+            DTensor._custom_dispatch_ops,
+        )
+
+    @classmethod
+    def from_local(
+        cls,
+        local_tensor: torch.Tensor,
+        device_mesh: Optional[DeviceMesh] = None,
+        placements: Optional[Sequence[Placement]] = None,
+        run_check: bool = True,
+    ) -> "DTensor":
+        """
+        Create a :class:`DTensor` from a local torch.Tensor on each rank
+        according to the `device_mesh` and `placements` specified.
+
+        Args:
+            local_tensor (torch.Tensor): local torch.Tensor on each rank.
+            device_mesh (:class:`DeviceMesh`, optional): DeviceMesh to place the
+                tensor, if not specified, must be called under a DeviceMesh
+                context manager, default: None
+            placements (List[:class:`Placement`], optional): the placements that
+                describes how to place the local torch.Tensor on DeviceMesh, must
+                have the same number of elements as `device_mesh.ndim`. If not
+                specified, we will by default replicate the tensor across the
+                `device_mesh` from the first rank of each dimension of the `device_mesh`.
+            run_check (bool, optional): indicate whether to run check across ranks
+                to check meta information and data. if have :class:`Replicate` in
+                `placements`, the data on first rank of the device mesh dimension
+                will be broadcasted to other ranks.
+
+        Returns:
+            A :class:`DTensor` object
+
+        .. note:: `from_local` is differentiable, the `requires_grad` of the created
+            `DTensor` object will depend on if `local_tensor` requires_grad or not.
+        """
+        # if same shape/dtype, no need to run_check, if not, must allgather
+        # the metadatas to check the size/dtype across ranks
+        # There should be no data communication unless there's replication
+        # strategy, where we broadcast the replication from the first rank
+        # in the mesh dimension
+        device_mesh = (
+            get_global_device_mesh() if device_mesh is None else device_mesh
+        )
+        # convert the local tensor to desired device base on device mesh's device_type
+        local_tensor = local_tensor.to(device_mesh.device_type)
+
+        # set default placements to replicated if not specified
+        if placements is None:
+            placements = [Replicate() for _ in range(device_mesh.ndim)]
+
+        # `from_local` is differentiable, and the gradient of the dist tensor this function
+        # created should flow back the gradients to the local_tensor, so we call an autograd
+        # function to construct the dist tensor instead.
+        return FromTorchTensor.apply(  # pyre-ignore[16]: autograd func
+            local_tensor, device_mesh, placements, run_check
+        )
+
+    def to_local(self) -> torch.Tensor:
+        """
+        Get the local tensor of this DTensor on its current rank. For sharding it returns
+        a local shard of the logical tensor view, for replication it returns the replica on
+        its current rank.
+
+        Returns:
+            A :class:`torch.Tensor` object that represents the local tensor of its current rank.
+
+        .. note:: `to_local` is differentiable, the `requires_grad` of the local tensor returned
+            will depend on if the `DTensor` requires_grad or not.
+        """
+        return ToTorchTensor.apply(self)  # pyre-ignore[16]: autograd func
+
+    def redistribute(
+        self,
+        device_mesh: Optional[DeviceMesh] = None,
+        placements: Optional[Sequence[Placement]] = None,
+    ) -> "DTensor":
+        """
+        `redistribute` performs necessary collective operations that redistribute the current
+        DTensor from its current placements to a new placements, or from is current DeviceMesh
+        to a new DeviceMesh. i.e. we can turn a Sharded DTensor to a Replicated DTensor by
+        specifying a Replicate placement for each dimension of the DeviceMesh.
+
+        Args:
+            device_mesh (:class:`DeviceMesh`, optional): DeviceMesh to place the
+                DTensor, if not specified, must be called under a DeviceMesh
+                context manager, default: None
+            placements (List[:class:`Placement`], optional): the new placements that
+                describes how to place the DTensor into the DeviceMesh, must
+                have the same number of elements as `device_mesh.ndim`.
+
+        Returns:
+            A :class:`DTensor` object
+
+        .. note:: `redistribute` is differentiable.
+        """
+        # This API perform necessary transformations and get
+        # a new DTensor with the new spec. i.e. for
+        # sharding it's a reshard behavior.
+        # Note that redistribute currently only supports out
+        # of place redistribution, i.e. it always create a new
+        # DTensor object and leave the original one unchanged.
+        device_mesh = (
+            get_global_device_mesh() if device_mesh is None else device_mesh
+        )
+        # raise error if new placements not specified
+        if placements is None:
+            raise RuntimeError("placements is needed for redistribute!")
+
+        for placement in placements:
+            if placement.is_partial():
+                raise RuntimeError(
+                    "Can not redistribute to _Partial, _Partial is for internal use only!"
+                )
+
+        # pyre-fixme[16]: `Redistribute` has no attribute `apply`.
+        return Redistribute.apply(self, device_mesh, placements)
+
+    @property
+    def device_mesh(self) -> DeviceMesh:
+        """
+        The :class:`DeviceMesh` attribute that associates with this DTensor object.
+
+        .. note:: device_mesh is a read-only property, it can not be set.
+        """
+        return self._spec.mesh
+
+    @property
+    def placements(self) -> Sequence[Placement]:
+        """
+        The placements attribute of this DTensor that describes the layout of this
+        DTensor on the its DeviceMesh.
+
+        .. note:: placements is a read-only property, it can not be set.
+        """
+        return self._spec.placements
diff --git a/torch/distributed/_tensor/device_mesh.py b/torch/distributed/_tensor/device_mesh.py
new file mode 100644
index 000000000000..5ca3f8c6159b
--- /dev/null
+++ b/torch/distributed/_tensor/device_mesh.py
@@ -0,0 +1,506 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+import warnings
+from typing import List, Optional, Sequence, TypeVar, Union
+import torch
+from torch.distributed.distributed_c10d import (
+    all_gather,
+    all_reduce,
+    broadcast,
+    get_rank,
+    get_world_size,
+    get_global_rank,
+    ReduceOp,
+    GroupMember,
+    scatter,
+    _get_default_group,
+    reduce_scatter,
+    new_group,
+    ProcessGroup,
+    all_to_all,
+    Work,
+)
+
+_global_device_mesh: Optional["DeviceMesh"] = None
+
+
+def get_global_device_mesh() -> "DeviceMesh":
+    global _global_device_mesh
+    assert (
+        _global_device_mesh is not None
+    ), "Could not get a default device mesh!"
+    return _global_device_mesh
+
+
+def set_global_device_mesh(mesh: Optional["DeviceMesh"]) -> None:
+    global _global_device_mesh
+    _global_device_mesh = mesh
+
+
+# We want a type for "can be passed to torch.as_tensor()";
+# this is a recursive sequence type, which isn't fully supported
+# yet in python. This construct simulates that up to depth 7.
+T = TypeVar("T")
+_L = Union[T, Sequence[T]]
+NDIntList = _L[_L[_L[_L[_L[_L[_L[int]]]]]]]
+
+MeshExprT = Union[
+    torch.Tensor,
+    NDIntList,
+]
+
+
+class DeviceMesh(object):
+    """
+    DeviceMesh represents a mesh of devices, where layout of devices could be
+    represented as a n-d dimension array, and each value of the n-d dimensional
+    array is the global id of the default process group ranks.
+
+    DeviceMesh could be used to describe the layout of devices across the cluster,
+    and serves as a proxy for communication among the device lists within the cluster.
+
+    We use the default ProcessGroup in this DeviceMesh class to implement proper
+    communications. Note that we also add collective wrappers in this class. This is
+    used to decouple detailed communication backend with the underlying
+    DTensor implementation.
+
+    DeviceMesh can be used as a context manager.
+    Args:
+        device_type (str): device type of the mesh. Currently supports: cpu, cuda.
+        mesh (ndarray): could be a multi-dimension array or an integer tensor that
+            describes the layout of devices, the ids are global ids of the
+            default process group.
+        dim_groups (List[ProcessGroup], optional): The ProcessGroup used per mesh
+            dimension.
+
+    Returns:
+        A :class:`DeviceMesh` object
+
+    Example (2 host with 4 GPUs each):
+        ```
+        # The following program runs on each process/rank in SPMD manner.
+        # initialized default world
+        torch.distributed.init_process_group(backend="nccl", world_size=8)
+        # initialize device mesh as (2, 4) to represent the topology
+        # of cross-host(dim 0), and within-host (dim 1)
+        mesh = DeviceMesh(device_type="cuda",
+                          mesh=[
+                            [0, 1, 2, 3],
+                            [4, 5, 6, 7]
+                          ])
+        ```
+        A reduction over the first dimension of mesh will reduce across
+        columns (0, 4), .. and (3, 7), a reduction over the second dimension
+        of mesh reduces across rows (0, 1, 2, 3) and (4, 5, 6, 7)
+
+    """
+
+    device_type: str
+    mesh: torch.Tensor
+    _backend: str
+
+    def __init__(
+        self,
+        device_type: str,
+        mesh: MeshExprT,
+        dim_groups: Optional[List[ProcessGroup]] = None,
+    ) -> None:
+        self.device_type = device_type
+        self.mesh = (
+            mesh.detach()
+            if isinstance(mesh, torch.Tensor)
+            else torch.tensor(mesh, dtype=torch.int)
+        )
+        default_pg = _get_default_group()
+        self._backend = default_pg._get_backend_name()
+        # TODO: if user want to pass pg_options, offer a way to do it
+        # check default pg backend, should support device_type
+        if device_type == "cpu":
+            assert (
+                self._backend == "gloo"
+            ), f"ProcessGroup backend: {self._backend} not supporting CPU!"
+        elif device_type == "cuda":
+            if self._backend == "gloo":
+                warnings.warn(
+                    "We recommend using nccl backend for cuda device type, gloo backend might only have partial support!"
+                )
+            assert self._backend == "gloo" or self._backend == "nccl"
+        else:
+            raise RuntimeError(
+                f"DeviceMesh only support cpu or cuda device type, but got {device_type}"
+            )
+
+        world_size = get_world_size()
+        if self.mesh.numel() > world_size:
+            raise RuntimeError(
+                f"Mesh should not be bigger than default world size, but found {self.mesh.numel()} ranks!"
+            )
+
+        unique_mesh_values = self.mesh.unique(sorted=True)
+        if unique_mesh_values.numel() != self.mesh.numel():
+            raise RuntimeError(
+                f"DeviceMesh cannot have duplicate values, but found {self.mesh.tolist()}"
+            )
+
+        # coordinates of this rank on the mesh
+        rank_coords = (self.mesh == get_rank()).nonzero()
+        assert rank_coords.size(0) in (0, 1)
+        self._coordinate_on_dim: Optional[List[int]] = (
+            rank_coords[0].tolist() if rank_coords.size(0) > 0 else None
+        )
+
+        # groups created by dimension, each dimension should have exact
+        # one valid process group per rank
+        self._dim_groups: List[ProcessGroup] = []
+        if dim_groups is not None:
+            # if user hand creating dimension based groups
+            # we just take it and use it for communication
+            if not isinstance(dim_groups, list):
+                raise RuntimeError(
+                    "dim_groups expected to be Optional[List[ProcessGroup]]"
+                )
+
+            for group in dim_groups:
+                if not isinstance(group, ProcessGroup):
+                    raise RuntimeError(
+                        f"found object in dim_groups that is not a ProcessGroup: {group}"
+                    )
+
+            if self.get_rank() in self.mesh:
+                if len(dim_groups) != self.mesh.ndim:
+                    raise RuntimeError(
+                        f"length of dim_groups ({len(dim_groups)}) expected to be equal to mesh.ndim ({self.mesh.ndim})"
+                    )
+            else:
+                if len(dim_groups) != 0:
+                    raise RuntimeError(
+                        f"length of dim_groups ({len(dim_groups)}) expected to be equal to 0 on rank {self.get_rank()} "
+                        f"for mesh {self.mesh}"
+                    )
+
+            self._dim_groups = dim_groups
+            return
+
+        if self.mesh.ndim == 1 and unique_mesh_values[-1] == world_size - 1:
+            # if the mesh is the same as world_pg, we just append the default
+            # pg to the first dim goups, as new_group cannot have the exact
+            # same ranks as world
+            self._dim_groups.append(default_pg)
+        else:
+            # create sub pgs base on the mesh argument specified
+            # handle multi-dim mesh, create subgroups by
+            # looping over the pg_ranks_by_dim for each dim
+            for dim in range(self.mesh.ndim):
+                # swap the current dim to the last dim
+                # then reshape to flatten out other dims
+                pg_ranks_by_dim = self.mesh.swapdims(-1, dim).reshape(
+                    -1, self.mesh.size(dim)
+                )
+
+                # multi-dim mesh, create subgroups by
+                # looping over the pg_ranks for each dim
+                # and append the groups
+                for dim_mesh in pg_ranks_by_dim:
+                    subgroup_ranks = dim_mesh.tolist()
+                    # call new_group regardless of the current rank in the
+                    # pg or not, it's required that all ranks participate
+                    # in subgroup construction
+                    new_subgroup = new_group(
+                        ranks=subgroup_ranks, backend=self._backend
+                    )
+                    # only add to dim_groups if the current rank in the subgroup
+                    if self.get_rank() in subgroup_ranks:
+                        if len(self._dim_groups) > dim:
+                            raise RuntimeError(
+                                f"Each device mesh dimension should get only one process group, but got {self.get_rank} "
+                                f"in {subgroup_ranks}!"
+                            )
+                        self._dim_groups.append(new_subgroup)
+
+    def __enter__(self) -> "DeviceMesh":
+        # set global device_mesh to this instance
+        set_global_device_mesh(self)
+        return self
+
+    # pyre-fixme[2]: Parameter must be annotated.
+    def __exit__(self, exc_type, exc_value, exc_traceback) -> None:
+        # unset global device mesh
+        set_global_device_mesh(None)
+
+    def __repr__(self) -> str:
+        return f"DeviceMesh:({self.mesh.tolist()})"
+
+    def __eq__(self, other: object) -> bool:
+        if not isinstance(other, DeviceMesh):
+            return False
+        if id(self) == id(other):
+            return True
+        return self.mesh.equal(other.mesh)
+
+    def get_dim_groups(self) -> List[ProcessGroup]:
+        return self._dim_groups
+
+    # pyre-fixme[3]: Return type must be annotated.
+    def size(self, dim: int = 0):
+        return self.mesh.size(dim)
+
+    @property
+    def ndim(self) -> int:
+        return self.mesh.ndim
+
+    def backend(self) -> str:
+        return self._backend
+
+    def get_rank(self) -> int:
+        return get_rank()
+
+    def get_coordinate_on_dim(self, dim: int) -> Optional[int]:
+        """
+        Return the relative index of this rank relative to a given
+        dimension of the mesh. If this rank is not part of the mesh, return None.
+        """
+        return self._coordinate_on_dim[dim] if self._coordinate_on_dim else None
+
+    def scatter(
+        self,
+        output: torch.Tensor,
+        scatter_list: List[torch.Tensor],
+        mesh_dim: int = 0,
+        async_op: bool = False,
+    ) -> Optional[Work]:
+        """
+        scatter a list of tensors to a device mesh dimension. We by default
+        use the first rank of the mesh dimension as the source of truth, i.e
+        for a 2d mesh [[0, 1], [2, 3]], if we scatter on mesh_dim = 1, we will
+        scatter the tensor list on rank 0 to rank 0/1, and tensor lista on rank
+        2 to rank 2/3.
+
+        Args:
+            tensor (torch.Tensor): the tensor to receive the scattered list.
+            scatter_list (List[torch.Tensor]): the tensor list to be scattered.
+            mesh_dim (int, optional): indicate which mesh dimension we want
+                to scatter on, we by default choose the first rank on the
+                mesh dimension as source of truth.
+
+        Returns:
+            A :class:`Work` object
+        """
+        dim_group = self._dim_groups[mesh_dim]
+        # src need to be global rank
+        src_for_dim = 0
+        if dim_group is not GroupMember.WORLD:
+            src_for_dim = get_global_rank(dim_group, 0)
+
+        if src_for_dim == get_rank():
+            fut = scatter(
+                output,
+                scatter_list=scatter_list,
+                src=src_for_dim,
+                group=dim_group,
+                async_op=async_op,
+            )
+        else:
+            fut = scatter(
+                output,
+                scatter_list=None,
+                src=src_for_dim,
+                group=dim_group,
+                async_op=async_op,
+            )
+
+        return fut
+
+    def broadcast(
+        self,
+        tensor: torch.Tensor,
+        mesh_dim: int = 0,
+        async_op: bool = False,
+    ) -> Optional[Work]:
+        """
+        broadcast the tensor to a device mesh dimension. We by default
+        use the first rank of the mesh dimension as the source of truth, i.e
+        for a 2d mesh [[0, 1], [2, 3]], if we broadcast on mesh_dim = 1, we will
+        broadcast the tensor on rank 0 to rank 0/1, and tensor on rank 2
+        to rank 2/3.
+
+        Args:
+            tensor (torch.Tensor): tensor to broadcast.
+            mesh_dim (int, optional): indicate which mesh dimension we want
+                to scatter on, we by default choose the first rank on the
+                mesh dimension as source of truth.
+
+        Returns:
+            A :class:`Work` object
+        """
+        dim_group = self._dim_groups[mesh_dim]
+        # src need to be global rank
+        src_for_dim = 0
+        if dim_group is not GroupMember.WORLD:
+            src_for_dim = get_global_rank(dim_group, 0)
+
+        return broadcast(
+            tensor, src=src_for_dim, group=dim_group, async_op=async_op
+        )
+
+    def all_gather(
+        self,
+        tensor_list: List[torch.Tensor],
+        tensor: torch.Tensor,
+        mesh_dim: int = 0,
+        async_op: bool = False,
+    ) -> Optional[Work]:
+        """
+        all_gather the tensor on each rank to the tensor_list on a
+        device mesh dimension.
+
+        Args:
+            tensor_list (List[torch.Tensor]): The gathered tensor list.
+            tensor (torch.Tensor): tensor to be gathered on each rank.
+            mesh_dim (int, optional): indicate which mesh dimension we want
+                to scatter on, we by default choose the first rank on the
+                mesh dimension as source of truth.
+
+        Returns:
+            A :class:`Work` object
+        """
+        dim_group = self._dim_groups[mesh_dim]
+        return all_gather(
+            tensor_list, tensor, group=dim_group, async_op=async_op
+        )
+
+    def all_reduce(
+        self,
+        tensor: torch.Tensor,
+        op: ReduceOp = ReduceOp.SUM,  # type: ignore[assignment]
+        mesh_dim: int = 0,
+        async_op: bool = False,
+    ) -> Optional[Work]:
+        """
+        all_reduce the tensor on each rank on a device mesh dimension, and
+        return an output tensor on each rank after all_reduce.
+
+        Args:
+            tensor (torch.Tensor): tensor to be all_reduced on each rank.
+            op (:class:`torch.distributed.distributed_c10d.ReduceOp, optional):
+                the reduction op of all_reduce (i.e. ReduceOp.SUM)
+            mesh_dim (int, optional): indicate which mesh dimension we want
+                to reduce on.
+
+        Returns:
+            A :class:`Work` object
+        """
+        dim_group = self._dim_groups[mesh_dim]
+        return all_reduce(tensor, op=op, group=dim_group, async_op=async_op)
+
+    def reduce_scatter(
+        self,
+        output: torch.Tensor,
+        input_list: List[torch.Tensor],
+        op: ReduceOp = ReduceOp.SUM,  # type: ignore[assignment]
+        mesh_dim: int = 0,
+        async_op: bool = False,
+    ) -> Optional[Work]:
+        """
+        reduce the input_list on each rank on a device mesh dimension, and scatter
+        the results to the output tensor on each rank.
+
+        Args:
+            output (torch.Tensor): tensor to receive the scattered result.
+            input_list (List[torch.Tensor]): tensor list to be reduced and scattered
+                and scattered on each rank.
+            op (:class:`torch.distributed.distributed_c10d.ReduceOp, optional):
+                the reduction op of reduce_scatter (i.e. ReduceOp.SUM)
+            mesh_dim (int, optional): indicate which mesh dimension we want
+                to scatter on.
+
+        Returns:
+            A :class:`Work` object
+        """
+        if self._backend == "nccl":
+            dim_group = self._dim_groups[mesh_dim]
+            fut = reduce_scatter(
+                output, input_list, op=op, group=dim_group, async_op=async_op
+            )
+
+        elif self._backend == "gloo":
+            # it's gloo, which does not have reduce_scatter
+            # we have to do all_reduce + scatter
+            warnings.warn(
+                "ProcessGroupGloo does not support reduce_scatter, falling back with all reduce!"
+            )
+            my_coordinate = self.get_coordinate_on_dim(mesh_dim)
+            # TODO: what should happen if rank is not in the mesh?
+            # see issue https://github.com/pytorch/tau/pull/492
+            assert (
+                my_coordinate is not None
+            ), "Rank if not part of mesh"  # TODO: figure out behavior here
+            fut = None
+            flattened_list = []
+            offset_list = []
+
+            offset = 0
+            for input in input_list:
+                offset_list.append(offset)
+                offset += input.numel()
+                flattened_list.append(input.flatten())
+
+            # all reduce since gloo does not support reduce_scatter
+            flat_tensor = torch.cat(flattened_list).clone(
+                memory_format=torch.contiguous_format
+            )
+            fut = self.all_reduce(
+                flat_tensor, op=op, mesh_dim=mesh_dim, async_op=async_op
+            )
+            # scatter the tensor
+            output_offset = offset_list[my_coordinate]
+            output.copy_(
+                flat_tensor[
+                    output_offset : output_offset + output.numel()
+                ].view(output.shape)
+            )
+        else:
+            raise RuntimeError(
+                f"backend {self._backend} does not support reduce_scatter!"
+            )
+        return fut
+
+    # TODO: test uneven split on GLOO and NCCL
+    def all_to_all(
+        self,
+        output_tensor_list: List[torch.Tensor],
+        input_tensor_list: List[torch.Tensor],
+        mesh_dim: int = 0,
+        async_op: bool = False,
+    ) -> Optional[Work]:
+        dim_group = self._dim_groups[mesh_dim]
+
+        work = None
+        # no direct dist.all_to_all support on 'gloo' so we manually do scatters
+        if self.backend() == "gloo":
+            # TODO: pull the handle of uneven case in #492
+            dim_group_size = get_world_size(dim_group)
+            for i in range(dim_group_size):
+                # src need to be global rank
+                src_for_dim = i
+                if dim_group is not GroupMember.WORLD:
+                    src_for_dim = get_global_rank(dim_group, i)
+
+                work = scatter(
+                    output_tensor_list[i],
+                    input_tensor_list if self.get_rank() == src_for_dim else [],
+                    group=dim_group,
+                    src=src_for_dim,
+                    async_op=async_op,
+                )
+
+        elif self.backend() == "nccl":
+            work = all_to_all(
+                output_tensor_list,
+                input_tensor_list,
+                dim_group,
+                async_op=async_op,
+            )
+        else:
+            raise RuntimeError(
+                f"DeviceMesh does not support all-to-all collective operations on {self.backend()} backend."
+            )
+        return work
diff --git a/torch/distributed/_tensor/dispatch.py b/torch/distributed/_tensor/dispatch.py
new file mode 100644
index 000000000000..8c9e5a22efb8
--- /dev/null
+++ b/torch/distributed/_tensor/dispatch.py
@@ -0,0 +1,301 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from dataclasses import dataclass
+from typing import List, Callable, Dict, Tuple, Optional, cast
+
+import torch
+from torch.utils._pytree import tree_map, tree_flatten, tree_unflatten
+from torchgen.model import FunctionSchema, SchemaKind
+
+import torch.distributed._tensor.api as dtensor
+from torch.distributed._tensor.placement_types import DTensorSpec
+from torch.distributed._tensor.redistribute import redistribute_dtensor
+from torch.distributed._tensor.utils import (
+    ArgKwargsType,
+    OutputSpecType,
+    unwrap_local_tensor,
+    unwrap_schema,
+    wrap,
+)
+
+
+"""
+If _ENABLE_FALLBACK set to False, dispatch will fail when an op doesn't
+have a sharding rule registered.
+"""
+_ENABLE_FALLBACK = False
+
+
+"""
+Print information on ops input shape and sharding for debugging purposes.
+"""
+_DEBUG_VERBOSE = False
+
+
+@dataclass
+class OpSchema(object):
+    """
+    OpSchema is a data class that describes an operator input schemas, it
+    includes DTensor DTensorSpecs and non-tensor args/kwargs (positional order
+    preserved). It is mainly used by the dispatching logic below to run things like
+    sharding propagation.
+
+    Sharding propagation rules registered could utilize this data class and
+    do inplace update some fields (when necessary, i.e shape related ops) to make
+    sure the args/kwargs are legit before passing to the local tensor operator.
+    This is the main reason that we don't freeze this dataclass.
+
+    NOTE: greater access to the operator inputs comes with greater responsibility.
+    Here are some basic rules about what can be used and what can be changed.
+
+    Args:
+        func_schema: the function schema of the operator
+        args_schema: contains args except that the DTensor args have been replaced
+            with its DTensorSpec
+        kwargs_schema: contains kwargs except that the DTensor kwargs have been replaced
+            with its DTensorSpec
+
+    What can be used:
+        - every attribute within this class could be read to conduct
+          sharding propagation.
+    What can be changed:
+        - only the args_schema and kwargs_schema could be changed.
+        - every non-tensor args could be changed to accomodate for local tensor
+          operations (i.e. for ops like view/reshape/...)
+        - every "DTensorSpec" attribute inside `args_schema`, `kwargs_schema` and
+          `args_spec` SHOULD NOT be updated! DTensorSpec are read only and sharding
+          propagation shouldn't inplace update them, otherwise the input DTensor
+          placements will get implicitly changed and it's error-prone.
+    """
+
+    func_schema: FunctionSchema
+    args_schema: Tuple[object, ...]
+    kwargs_schema: Dict[str, object]
+    is_inplace: bool = False
+    is_out_variant: bool = False
+
+    def __post_init__(self) -> None:
+        schema_kind = self.func_schema.kind()
+        self.is_inplace = (
+            schema_kind
+            == SchemaKind.inplace  # pyre-ignore [16] pyre bad at enum
+        )
+        self.is_out_variant = (
+            schema_kind == SchemaKind.out  # pyre-ignore [16] pyre bad at enum
+        )
+
+    @property
+    def args_spec(self) -> Tuple[DTensorSpec, ...]:
+        """
+        args_spec: Tuple[DTensorSpec, ...]: contains a clean list of args spec list
+            with NO non-DTensor positional arguments (i.e. int/float/tuple, etc)
+            mainly used by sharding propagation to propagate the output spec
+        """
+        # filter out non-relavant values from args schema to get a clean spec list
+        # this would mainly be used by sharding propagation rules
+        return tuple(
+            item for item in self.args_schema if isinstance(item, DTensorSpec)
+        )
+
+    def __repr__(self) -> str:
+        return (
+            f"OpSchema(func_schema={self.func_schema},"
+            f" args_schema={self.args_schema},"
+            f" kwargs_schema={self.kwargs_schema})"
+        )
+
+
+@dataclass
+class OutputSharding:
+    """
+    OutputSharding is a data class that is used by the sharding propagation
+    rules, it could set the output_spec upon successful propagation, and if
+    it failed, output_spec would become None and sharding propagation rules
+    could give a list of suggestions for inputs to reshard.
+
+    NOTE: the schema_suggestion generated by sharding propagation should be
+    exactly the same as the operator OpSchema, except the DTensor DTensorSpecs
+    """
+
+    output_spec: OutputSpecType
+    schema_suggestions: Optional[List[OpSchema]] = None
+    failed_reason: Optional[str] = None
+
+
+def pack_args_kwargs_with_local_tensor(
+    args: ArgKwargsType,
+    args_schema: ArgKwargsType,
+    redistribute_with_schema: bool = False,
+) -> ArgKwargsType:
+    flatten_args, args_tree_spec = tree_flatten(args)
+    flatten_args_schema, _ = tree_flatten(args_schema)
+
+    for i, arg in enumerate(flatten_args):
+        if isinstance(arg, dtensor.DTensor):
+            if redistribute_with_schema:
+                target_spec = flatten_args_schema[i]
+                arg = redistribute_dtensor(
+                    arg, target_spec.mesh, target_spec.placements
+                )
+
+            # reuse the schema list and update it with local tensor
+            flatten_args_schema[i] = arg._local_tensor
+
+    return tree_unflatten(flatten_args_schema, args_tree_spec)
+
+
+def _reshape_alias(
+    x: torch.Tensor, shape: Tuple[int, ...], strides: Tuple[int, ...]
+) -> torch.Tensor:
+    return torch.ops.aten.view(x, shape)
+
+
+_CURRENT_DECOMPOSITION_TABLE: Dict[
+    Callable[..., object], Callable[..., object]
+] = {torch.ops.aten._reshape_alias.default: _reshape_alias}
+
+
+def propagate_input_sharding(
+    op_call: torch._ops.OpOverload,
+    args: Tuple[object, ...],
+    kwargs: Dict[str, object],
+    op_to_rules: Dict[str, Callable[[OpSchema], OutputSharding]],
+) -> Tuple[OpSchema, bool, Optional[OutputSharding]]:
+    # parse the operator schema
+    func_schema = FunctionSchema.parse(str(op_call._schema))
+    # unwrap the args/kwargs schema
+    args_schema = tree_map(unwrap_schema, args)
+    kwargs_schema = tree_map(unwrap_schema, kwargs)
+
+    op_schema = OpSchema(func_schema, args_schema, kwargs_schema)
+
+    if _DEBUG_VERBOSE and torch.distributed.get_rank() == 0:
+        print(f"{op_call}({op_schema})")
+        local_shapes = tree_map(
+            lambda t: t.to_local().shape
+            if isinstance(t, dtensor.DTensor)
+            else None,
+            args,
+        )
+        print(f"    local shapes: {local_shapes}")
+
+    op_key = str(op_call)
+    sharding_prop_func = op_to_rules.get(op_key, None)
+
+    if sharding_prop_func is None:
+        # step 1. If there's not even one sharding rule
+        # implemented for the operator, we fall back to
+        # local tensor compute, this is wront currently
+        # we will change the behavior to reshard to full
+        # replicate and do the computatation
+        if not _ENABLE_FALLBACK:
+            raise NotImplementedError(
+                f"Operator {op_key} does not have a DistributedTensor rule registered."
+            )
+        else:
+            return op_schema, False, None
+
+    # step 2. there's sharding propagation rule, run
+    # sharding propagation to get output sharding
+    try:
+        output_sharding = sharding_prop_func(op_schema)
+    except Exception as e:
+        raise RuntimeError(
+            f"Sharding propagation failed on op {op_key}.\n"
+            f"Input schema: {op_schema}.\n"
+            f"Error: {e}"
+        ) from e
+
+    # step 3. if can't get output_spec from sharding
+    # propagation (i.e. no rules apply for input
+    # placements), we do auto redistribute on inputs
+    # to get an eligble input, which we will pick a
+    # target schema base on the redistribute cost
+    # TODO: implement full auto distribute with a
+    # simple cost estimation model
+    if output_sharding.output_spec is None:
+        # do auto distributed/boxing here
+        if output_sharding.schema_suggestions is not None:
+            # pick the first suggestion for now,
+            target_schema = output_sharding.schema_suggestions[0]
+            # run sharding propagation again with target schema
+            output_sharding = sharding_prop_func(target_schema)
+
+            return target_schema, True, output_sharding
+
+        else:
+            raise RuntimeError(
+                f"Sharding propagation failed on op {op_key}!"
+                f"Input schema: {op_schema}."
+                f"Failed reason: {output_sharding.failed_reason}"
+            )
+    else:
+        return op_schema, False, output_sharding
+
+
+def operator_dispatch(
+    op_call: torch._ops.OpOverload,
+    args: Tuple[object, ...],
+    kwargs: Dict[str, object],
+    op_to_rules: Dict[str, Callable[[OpSchema], OutputSharding]],
+    custom_dispatch_ops: Dict[str, Callable[..., object]],
+) -> object:
+    # first we need to lift some private aten aliases to public calls
+    if op_call in _CURRENT_DECOMPOSITION_TABLE:
+        return _CURRENT_DECOMPOSITION_TABLE[op_call](*args, **kwargs)
+
+    # STEP 0. See if threre're user defined custom aten operator
+    # implementations. Custom operators take the highest priority
+    if str(op_call) in custom_dispatch_ops:
+        # dispatch to user defined custom distributed tensor ops
+        return custom_dispatch_ops[str(op_call)](*args, **kwargs)
+
+    target_schema, redistribute, output_sharding = propagate_input_sharding(
+        op_call, args, kwargs, op_to_rules
+    )
+
+    if output_sharding is None:
+        # default to local tensor ops, this is wrong
+        # but we use it now to enable more tensor point-wise ops
+        # TODO: delete this and use replicate (all_gather) as
+        # the default fallback.
+        tensor_args = tree_map(unwrap_local_tensor, args)
+        tensor_kwargs = tree_map(unwrap_local_tensor, kwargs)
+        local_results = op_call(*tensor_args, **tensor_kwargs)
+        return wrap(local_results, target_schema.args_spec[0])
+
+    local_tensor_args = pack_args_kwargs_with_local_tensor(
+        args,
+        target_schema.args_schema,
+        redistribute_with_schema=redistribute,
+    )
+    local_tensor_kwargs = pack_args_kwargs_with_local_tensor(
+        kwargs,
+        target_schema.kwargs_schema,
+        redistribute_with_schema=redistribute,
+    )
+
+    # run local op computation with potentially modified args/kwargs
+    local_tensor_args = cast(Tuple[object, ...], local_tensor_args)
+    local_tensor_kwargs = cast(Dict[str, object], local_tensor_kwargs)
+    local_results = op_call(*local_tensor_args, **local_tensor_kwargs)
+
+    if target_schema.is_inplace:
+        # inplace op should return self instead of re-wrapping
+        self = cast(dtensor.DTensor, args[0])
+        self._spec = cast(DTensorSpec, output_sharding.output_spec)
+        return self
+    elif target_schema.is_out_variant:
+        # out variant could possibly have multiple out args (i.e. lu_unpack.out)
+        output_specs = (
+            (output_sharding.output_spec,)
+            if not isinstance(output_sharding.output_spec, tuple)
+            else output_sharding.output_spec
+        )
+        out_dts = []
+        for i, out in enumerate(target_schema.func_schema.arguments.out):
+            out_dt = cast(dtensor.DTensor, kwargs[out.name])
+            out_dt._spec = cast(DTensorSpec, output_specs[i])
+            out_dts.append(out_dt)
+        return tuple(out_dts) if len(out_dts) > 1 else out_dts[0]
+    else:
+        return wrap(local_results, output_sharding.output_spec)
diff --git a/torch/distributed/_tensor/ops/__init__.py b/torch/distributed/_tensor/ops/__init__.py
new file mode 100644
index 000000000000..5550b2ffae08
--- /dev/null
+++ b/torch/distributed/_tensor/ops/__init__.py
@@ -0,0 +1,7 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from .matrix_ops import *  # noqa: F403
+from .math_ops import *  # noqa: F403
+from .tensor_ops import *  # noqa: F403
+from .tp_sharding_ops import *  # noqa: F403
+from .pointwise_ops import *  # noqa: F403
+from .view_ops import *  # noqa: F403
diff --git a/torch/distributed/_tensor/ops/common_rules.py b/torch/distributed/_tensor/ops/common_rules.py
new file mode 100644
index 000000000000..29925c8a52c7
--- /dev/null
+++ b/torch/distributed/_tensor/ops/common_rules.py
@@ -0,0 +1,376 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+import torch
+from typing import List, Sequence, Dict, Tuple, Optional, cast
+from torch.distributed._tensor.dispatch import OpSchema, OutputSharding
+from torch.distributed._tensor.placement_types import DTensorSpec
+from torch.distributed._tensor.ops.utils import prod
+
+
+def _replace_char_in_str(string: str, new_char: str, idx: int) -> str:
+    return string[:idx] + new_char + string[idx + 1 :]
+
+
+def _inplace_rewrap_schema_suggestion(
+    suggestion: OpSchema, input_schema: OpSchema
+) -> None:
+    suggestion_args_spec = suggestion.args_spec
+    new_arg_schema: List[object] = []
+    idx_of_args_spec = 0
+    for arg in input_schema.args_schema:
+        if isinstance(arg, DTensorSpec):
+            new_arg_schema.append(suggestion_args_spec[idx_of_args_spec])
+            idx_of_args_spec += 1
+        else:
+            new_arg_schema.append(arg)
+    suggestion.args_schema = tuple(new_arg_schema)
+    suggestion.kwargs_schema = input_schema.kwargs_schema
+
+
+def _gen_reshard_suggestions(
+    op_schema: OpSchema,
+    input_dims: List[str],
+    input_specs: Tuple[DTensorSpec, ...],
+    dim_to_sharding: Dict[str, int],
+    pending_sum: List[int],
+) -> OutputSharding:
+    suggested_arg_specs: List[DTensorSpec] = []
+    for input_dim, input_spec in zip(input_dims, input_specs):
+        dim_map = [dim_to_sharding[dim] for dim in input_dim]
+        suggested_arg_specs.append(
+            DTensorSpec.from_dim_map(
+                mesh=input_spec.mesh,
+                dim_map=dim_map,
+                sums=pending_sum,
+                shape=input_spec.shape,
+            )
+        )
+    suggested_schema = OpSchema(
+        op_schema.func_schema, tuple(suggested_arg_specs), {}
+    )
+    _inplace_rewrap_schema_suggestion(suggested_schema, op_schema)
+    return OutputSharding(
+        None,
+        schema_suggestions=[suggested_schema],
+        failed_reason="Input placements op sharding propagation failed, need to reshard!",
+    )
+
+
+def einop_rule(
+    equation: str,
+    op_schema: OpSchema,
+    *,
+    linearity: bool = False,
+    enforce_sharding: Optional[Dict[str, int]] = None,
+) -> OutputSharding:
+    """
+    Propagate the sharding of inputs to output for ops whose data
+    moves according to einsum notation. This is mostly borrowed
+    from @zdevito's sharding simulator. Examples:
+        mk,kn->mn - einsum
+        ij,ij->ij - addition
+        ij,j->ij - broadcasted addition
+        ij->i - reduction
+    Other ops could use this propagation algorithm when applied, note
+    that einsum propagation only deal with list of specs (DTensor specs)
+    as it only works on list of tensors!
+
+    linearity in einop_rule means that the calling op `f` follows this rule:
+        f(a + b) = f(a) + f(b)
+
+    In this case we can propagate the partial sum, note that linearity in einop
+    only applies to partial sum, not other operations like min/max (which are
+    associative but not linear).
+    """
+    # parse einop equation and extract arg specs
+    inputs, outputs = equation.split("->")
+    input_dims, output_dims = inputs.split(","), outputs.split(",")
+    input_specs = op_schema.args_spec
+    # NOTE: only support single output unless needed in future
+    output_dim = output_dims[0]
+
+    dim_to_sharding: Dict[str, int] = {}
+    dim_to_size: Dict[str, int] = {}
+    # record pending sum, key is mesh dimension, value is pending sum
+    # counter across input specs
+    pending_sums_counter: Dict[int, int] = {}
+    seen_shardings: Dict[int, str] = {}
+    needs_reshard = False
+
+    def merge_sharding(dim: str, a: int, b: int) -> int:
+        # merge the sharding of inputs if it's able to merge, i.e. we can merge
+        # replicate and shard to shard, but this will trigger an reshard operation
+        if a != b:
+            if a == -1 or b == -1:
+                # reshard the replicate to match the sharded one
+                nonlocal needs_reshard
+                needs_reshard = True
+                return a if a != -1 else b
+            else:
+                # TODO: further merge the sharding properly (i.e. reshard one input to replicate)
+                raise RuntimeError(
+                    f"{equation}: dim {dim} sharded two different ways: {a} and {b}"
+                )
+        else:
+            return a
+
+    for input_dim, input_spec in zip(input_dims, input_specs):
+        # deal with partial sums
+        input_sums = input_spec.sums
+        for sum_dim in input_sums:
+            if sum_dim not in pending_sums_counter:
+                seen_shardings[sum_dim] = "+"
+            # update pending sum counter for pending sum mesh
+            # dimension with the occurance from each input
+            pending_sums_counter[sum_dim] = (
+                pending_sums_counter.get(sum_dim, 0) + 1
+            )
+
+        for idx, (dim, mesh_dim) in enumerate(
+            zip(input_dim, input_spec.dim_map)
+        ):
+            if enforce_sharding and dim in enforce_sharding:
+                if enforce_sharding[dim] != mesh_dim:
+                    needs_reshard = True
+                dim_to_sharding[dim] = enforce_sharding[dim]
+                dim_to_size[dim] = input_spec.shape[idx]
+            elif dim not in dim_to_sharding:
+                dim_to_sharding[dim] = mesh_dim
+                dim_to_size[dim] = input_spec.shape[idx]
+            else:
+                dim_to_sharding[dim] = merge_sharding(
+                    dim, dim_to_sharding[dim], mesh_dim
+                )
+                assert dim_to_size[dim] == input_spec.shape[idx]
+
+            # after merging sharding, we check if there're multiple
+            # sharding on the same mesh dim.
+            merged_sharding_for_dim = dim_to_sharding[dim]
+            if merged_sharding_for_dim != -1:
+                if (
+                    merged_sharding_for_dim in seen_shardings
+                    and dim != seen_shardings[merged_sharding_for_dim]
+                ):
+                    needs_reshard = True
+                    seen_shardings[merged_sharding_for_dim] += dim
+                else:
+                    seen_shardings[merged_sharding_for_dim] = dim
+
+    if pending_sums_counter and not linearity:
+        # return reshard suggestion with no pending sum, because we already properly
+        # merge the sharding, this reshard suggestion is legit to use
+        return _gen_reshard_suggestions(
+            op_schema, input_dims, input_specs, dim_to_sharding, []
+        )
+    else:
+        # It's a op that support linearity, but not all input arguments are partial
+        # we fail the sharding propagation with suggestion to make all inputs be
+        # partial on the corresponding mesh dim (all inputs should be partial for
+        # the mesh dims in order to execute locally and delay the sum reduction)
+        for value in pending_sums_counter.values():
+            if value != len(input_specs):
+                needs_reshard = True
+
+    for mesh_dim, dims in seen_shardings.items():
+        if len(dims) > 1:
+            # we found different input dims are being sharded on the same mesh dim
+            # in order to perform local op computation, we need to reshard inputs
+            # base on some simple heuristics, now we simply pick the one with least comm
+            # volume. (i.e. the input with least size)
+            # TODO: consider a more advanced heuristic to pick the best sharding
+            costs = []
+            for d in dims:
+                cost = 0
+                for input_dim, input_spec in zip(input_dims, input_specs):
+                    if (
+                        d in input_dim
+                        and input_spec.dim_map[input_dim.index(d)] == mesh_dim
+                    ):
+                        cost += prod(
+                            input_spec.local_shape
+                        ) * input_spec.mesh.size(mesh_dim)
+                costs.append(cost)
+            d_to_keep_sharding = dims[costs.index(max(costs))]
+            for d in dims:
+                # update dim_to_sharding to keep the sharding of the dim with
+                # highest comm and make the rest of the dims to replicate
+                if d != d_to_keep_sharding:
+                    dim_to_sharding[d] = -1
+
+    pending_sums = list(pending_sums_counter.keys())
+    if needs_reshard:
+        return _gen_reshard_suggestions(
+            op_schema, input_dims, input_specs, dim_to_sharding, pending_sums
+        )
+
+    # generate output pending sum if a dim is sharded, and it appears in input
+    # but not output
+    for dim, shard_on_mesh in dim_to_sharding.items():
+        if dim not in output_dims[0] and shard_on_mesh != -1:
+            pending_sums.append(shard_on_mesh)
+
+    # if no need to reshard, we directly generate the output sharding
+    output_dim_map = []
+    output_shape = []
+    for dim in output_dim:
+        if dim == "1":
+            # find output dim that is a singleton dimension, mark sharding and shape
+            output_dim_map.append(-1)
+            output_shape.append(1)
+        else:
+            output_dim_map.append(dim_to_sharding[dim])
+            output_shape.append(dim_to_size[dim])
+
+    return OutputSharding(
+        DTensorSpec.from_dim_map(
+            input_specs[0].mesh,
+            output_dim_map,
+            pending_sums,
+            shape=torch.Size(output_shape),
+        )
+    )
+
+
+def pointwise_rule(
+    op_schema: OpSchema, linearity: bool = False
+) -> OutputSharding:
+    """
+    Propagate the sharding for pointwise operations. Examples:
+        ij,ij->ij - addition/mul
+        ij,j->ij - broadcasted addition
+    """
+    alphabet = "abcdefghijklmnopqrstuvwxyz"
+    # find the max_dim first in case we need to broadcasting
+    input_specs = op_schema.args_spec
+    max_dim = max(input.ndim for input in input_specs)
+    dimchars = []
+    singleton_counter: List[int] = [0] * max_dim
+    for input in input_specs:
+        start_dim = max_dim - input.ndim
+        p = alphabet[start_dim:max_dim]
+        # handle the "broadcasting to a common shape case"
+        # see https://pytorch.org/docs/stable/notes/broadcasting.html
+        # If any of the dimensions is singleton dimension (i.e. 1).
+        # we mark the dim char as a special "1" to distinguish with
+        # the non-singleton dimension, so that sharding propagation
+        # should just ignore the singleton dimension.
+        if len(input_specs) > 1:
+            for i in range(max_dim):
+                if i < start_dim:
+                    # treat the leading miss dim chars as singleton
+                    singleton_counter[i] += 1
+                elif input.shape[i - start_dim] == 1:
+                    # mark singleton dim char as a special "1" in einop rule
+                    singleton_counter[i] += 1
+                    p = _replace_char_in_str(p, "1", (i - start_dim))
+
+        dimchars.append(p)
+    out_dimchars = alphabet[:max_dim]
+    # check if we replace the all inputs dim char with singleton dimension,
+    # if we replace all inputs, we also need to replace the output dimension.
+    for output_dim_idx in range(len(out_dimchars)):
+        out_dimchar = out_dimchars[output_dim_idx]
+        if singleton_counter[output_dim_idx] == len(input_specs):
+            out_dimchars = _replace_char_in_str(
+                out_dimchars, "1", output_dim_idx
+            )
+
+    fmt = f"{','.join(p for p in dimchars)}->{out_dimchars}"
+
+    enforce_sharding: Dict[str, int] = {}
+    if op_schema.is_inplace:
+        # inplace op should keep the input sharding it writes to
+        for out_dimchar, mesh_dim in zip(out_dimchars, input_specs[0].dim_map):
+            enforce_sharding[out_dimchar] = mesh_dim
+    elif op_schema.is_out_variant:
+        out_spec = cast(DTensorSpec, op_schema.kwargs_schema["out"])
+        for out_dimchar, mesh_dim in zip(out_dimchars, out_spec.dim_map):
+            enforce_sharding[out_dimchar] = mesh_dim
+
+    return einop_rule(
+        fmt,
+        op_schema,
+        linearity=linearity,
+        enforce_sharding=enforce_sharding,
+    )
+
+
+def linear_pointwise_rule(op_schema: OpSchema) -> OutputSharding:
+    """
+    Linear pointwise operators can propagate pending reductions.
+    For example, c = add(a, b); if a is pending sum, then c will be
+    pending sum as well without any communication overhead.
+    """
+    return pointwise_rule(op_schema, linearity=True)
+
+
+def reduction_rule(
+    op_schema: OpSchema,
+    *,
+    dims: Optional[Sequence[int]] = None,
+    keep_dim: bool = False,
+    reduction_linear: bool = False,
+) -> OutputSharding:
+    """
+    Propagate the sharding for reduction operations. Examples:
+        ij->i - sum on dim
+
+    reduction_linear means that the reduction `f` follows this rule:
+        f([f(a), f(b)]) = f([a, b])
+
+    reduction linear should be super set of linearity.
+    """
+    alphabet = "abcdefghijklmnopqrstuvwxyz"
+    # reduction op usually begin with a single tensor
+    input_spec = cast(DTensorSpec, op_schema.args_schema[0])
+    reduce_dims = range(input_spec.ndim) if dims is None else dims
+
+    if not reduction_linear:
+        # if the reduction is not linear, we need to clear the pending sum
+        # on the input spec, also replicate the reducing dimension if the
+        # reducing dimension is sharded, then suggest a resharding
+        reshard_dim_map = input_spec.dim_map
+        needs_reshard = False
+        for dim in reduce_dims:
+            if input_spec.dim_map[dim] != -1:
+                needs_reshard = True
+                reshard_dim_map[dim] = -1
+        needs_reshard = needs_reshard or len(input_spec.sums) > 0
+
+        if needs_reshard:
+            no_partial_spec = DTensorSpec.from_dim_map(
+                input_spec.mesh, reshard_dim_map, [], input_spec.shape
+            )
+            schema_suggestion = OpSchema(
+                op_schema.func_schema, (no_partial_spec,), {}
+            )
+            _inplace_rewrap_schema_suggestion(schema_suggestion, op_schema)
+            return OutputSharding(
+                output_spec=None, schema_suggestions=[schema_suggestion]
+            )
+
+    input_chars = alphabet[: input_spec.ndim]
+
+    if dims is None and not keep_dim:
+        # reducing to a single scalar tensor, we just mark output as empty
+        out_dimchars = ""
+    else:
+        # if keep the reduction dim, we need to keep the dim char by marking
+        # it as a singleton "1" in the out_dimchars
+        reduce_dim_char = ord("1") if keep_dim else None
+        out_dimchars = input_chars.translate(
+            {ord(alphabet[dim]): reduce_dim_char for dim in reduce_dims}
+        )
+    fmt = f"{input_chars}->{out_dimchars}"
+
+    enforce_sharding: Dict[str, int] = {}
+    if op_schema.is_out_variant:
+        out_spec = cast(DTensorSpec, op_schema.kwargs_schema["out"])
+        for out_dimchar, mesh_dim in zip(out_dimchars, out_spec.dim_map):
+            enforce_sharding[out_dimchar] = mesh_dim
+
+    return einop_rule(
+        fmt,
+        op_schema,
+        linearity=reduction_linear,
+        enforce_sharding=enforce_sharding,
+    )
diff --git a/torch/distributed/_tensor/ops/math_ops.py b/torch/distributed/_tensor/ops/math_ops.py
new file mode 100644
index 000000000000..eb4cd86ed5c6
--- /dev/null
+++ b/torch/distributed/_tensor/ops/math_ops.py
@@ -0,0 +1,141 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from typing import cast, Optional, Sequence
+
+from torch.distributed._tensor.api import DTensor
+from torch.distributed._tensor.placement_types import DTensorSpec
+from torch.distributed._tensor.dispatch import OpSchema, OutputSharding
+from torch.distributed._tensor.ops.common_rules import reduction_rule, pointwise_rule
+from torch.distributed._tensor.ops.utils import register_prop_rule, as_list, normalize_dims
+
+
+def _infer_reduction_dims(
+    dims_arg: object, ndim: int
+) -> Optional[Sequence[int]]:
+    if dims_arg is None:
+        return None
+    dims = cast(Sequence[int], as_list(dims_arg))
+    dims = normalize_dims(dims, ndim)
+    empty_dims = [[0], [-1], []]
+    if ndim == 0 and dims_arg in empty_dims:
+        return None
+    return dims
+
+
+@register_prop_rule("aten.all.default")
+def default_reduction_rule(op_schema: OpSchema) -> OutputSharding:
+    return reduction_rule(op_schema, reduction_linear=True)
+
+
+def sum_rule(op_schema: OpSchema) -> OutputSharding:
+    args_schema = op_schema.args_schema
+    input_spec = cast(DTensorSpec, args_schema[0])
+    dims = None
+    if len(args_schema) > 1:
+        dims = _infer_reduction_dims(args_schema[1], input_spec.ndim)
+
+    keep_dim = len(args_schema) > 2 and bool(args_schema[2])
+    return reduction_rule(
+        op_schema, dims=dims, keep_dim=keep_dim, reduction_linear=True
+    )
+
+
+sum_ops = [
+    "aten.sum.default",
+    "aten.sum.dim_IntList",
+]
+for sum_op in sum_ops:
+    DTensor._op_to_rules[sum_op] = sum_rule
+
+
+@register_prop_rule("aten._softmax.default")
+def softmax_rule(op_schema: OpSchema) -> OutputSharding:
+    input_spec, softmax_dim, _ = op_schema.args_schema
+    input_spec = cast(DTensorSpec, input_spec)
+    softmax_dim = cast(int, softmax_dim)
+    dim_map = input_spec.dim_map
+    if softmax_dim < len(dim_map) and dim_map[softmax_dim] >= 0:
+        raise RuntimeError("Cannot run softmax on sharding dimension!")
+    return OutputSharding(input_spec)
+
+
+@register_prop_rule("aten._softmax_backward_data.default")
+def softmax_bwd_rule(op_schema: OpSchema) -> OutputSharding:
+    grad_out_spec, out_spec, softmax_dim, _ = op_schema.args_schema
+    grad_out_spec = cast(DTensorSpec, grad_out_spec)
+    out_spec = cast(DTensorSpec, out_spec)
+    softmax_dim = cast(int, softmax_dim)
+    grad_out_dim_map = grad_out_spec.dim_map
+    out_dim_map = out_spec.dim_map
+    if softmax_dim < len(grad_out_dim_map) and (
+        grad_out_dim_map[softmax_dim] >= 0 or out_dim_map[softmax_dim] >= 0
+    ):
+        raise RuntimeError(
+            "Cannot run _softmax_backward_data on sharding dimension!"
+        )
+    return pointwise_rule(op_schema)
+
+
+def mean_rule(op_schema: OpSchema) -> OutputSharding:
+    args_schema = op_schema.args_schema
+    input_spec = cast(DTensorSpec, args_schema[0])
+    dims = None
+    # if length of args > 1, we check args to find dims
+    if len(args_schema) > 1:
+        dims = _infer_reduction_dims(args_schema[1], input_spec.ndim)
+
+    keep_dim = len(args_schema) > 2 and bool(args_schema[2])
+    return reduction_rule(
+        op_schema, dims=dims, keep_dim=keep_dim, reduction_linear=False
+    )
+
+
+mean_ops = [
+    "aten.mean.default",
+    "aten.mean.dim",
+    "aten.mean.out",
+]
+
+for mean_op in mean_ops:
+    DTensor._op_to_rules[mean_op] = mean_rule
+
+
+def var_rule(op_schema: OpSchema) -> OutputSharding:
+    args_schema = op_schema.args_schema
+    input_spec = cast(DTensorSpec, args_schema[0])
+    dims = None
+    # if length of args > 1, we check args to find dims, note that
+    # var.default have unbias arg as the first argument, so we want
+    # to check if it's not bool
+    if len(args_schema) > 1 and not isinstance(args_schema[1], bool):
+        dims = _infer_reduction_dims(args_schema[1], input_spec.ndim)
+
+    keep_dim = len(args_schema) > 3 and bool(args_schema[3])
+    return reduction_rule(
+        op_schema, dims=dims, keep_dim=keep_dim, reduction_linear=False
+    )
+
+
+var_ops = [
+    "aten.var.default",
+    "aten.var.dim",
+    "aten.var.out",
+]
+
+for var_op in var_ops:
+    DTensor._op_to_rules[var_op] = var_rule
+
+
+@register_prop_rule("aten.var.correction")
+@register_prop_rule("aten.var.correction_out")
+def var_correction_rule(op_schema: OpSchema) -> OutputSharding:
+    args_schema = op_schema.args_schema
+    input_spec = cast(DTensorSpec, args_schema[0])
+    dims = None
+    if len(args_schema) > 1:
+        dims = _infer_reduction_dims(args_schema[1], input_spec.ndim)
+
+    # keep_dim is a kwarg instead of arg for var.correction
+    keep_dim = cast(bool, op_schema.kwargs_schema.get("keepdim", False))
+    return reduction_rule(
+        op_schema, dims=dims, keep_dim=keep_dim, reduction_linear=False
+    )
diff --git a/torch/distributed/_tensor/ops/matrix_ops.py b/torch/distributed/_tensor/ops/matrix_ops.py
new file mode 100644
index 000000000000..47988799282e
--- /dev/null
+++ b/torch/distributed/_tensor/ops/matrix_ops.py
@@ -0,0 +1,129 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# implement matrix related ops for distributed tensor
+from torch.distributed._tensor.dispatch import OpSchema, OutputSharding
+from torch.distributed._tensor.ops.common_rules import einop_rule, pointwise_rule
+from torch.distributed._tensor.ops.utils import register_prop_rule
+
+
+def _update_schema_suggestion_for_addmm(
+    output_sharding: OutputSharding,
+    op_schema: OpSchema,
+    pointwise_add_update: bool = True,
+) -> OutputSharding:
+    # schema suggestion coming from output sharding could be:
+    # 1. pointwise add sharding input suggestion
+    # 2. mm sharding input suggestion
+    # inplace update schema suggestion to return addmm suggestion
+    assert output_sharding.schema_suggestions is not None
+    suggestion = output_sharding.schema_suggestions[0]
+    if pointwise_add_update:
+        # update with pointwise suggestion
+        args_schema = (
+            suggestion.args_schema[0],
+            op_schema.args_schema[1],
+            op_schema.args_schema[2],
+        )
+    else:
+        # update with mm suggestion
+        args_schema = (
+            op_schema.args_schema[0],
+            suggestion.args_schema[0],
+            suggestion.args_schema[1],
+        )
+
+    output_sharding.schema_suggestions = [
+        OpSchema(
+            func_schema=op_schema.func_schema,
+            args_schema=args_schema,
+            kwargs_schema=op_schema.kwargs_schema,
+        )
+    ]
+    return output_sharding
+
+
+@register_prop_rule("aten.mm.default")
+def mm_rules(op_schema: OpSchema) -> OutputSharding:
+    return einop_rule("mk,kn->mn", op_schema, linearity=False)
+
+
+@register_prop_rule("aten.addmm.default")
+def addmm_rules(op_schema: OpSchema) -> OutputSharding:
+    input_spec, mat1_spec, mat2_spec = op_schema.args_spec
+    mm_out_sharding = mm_rules(
+        OpSchema(op_schema.func_schema, (mat1_spec, mat2_spec), {})
+    )
+    if mm_out_sharding.output_spec is None:
+        # non-eligible input, suggest addmm input specs
+        if mm_out_sharding.schema_suggestions is not None:
+            # TODO: add more suggestions for resharding
+            return _update_schema_suggestion_for_addmm(
+                mm_out_sharding,
+                op_schema,
+                pointwise_add_update=False,
+            )
+        else:
+            return OutputSharding(None)
+
+    # run point wise rule on input + (mm_out) with linearity
+    output_sharding = pointwise_rule(
+        OpSchema(
+            op_schema.func_schema, (input_spec, mm_out_sharding.output_spec), {}
+        ),
+        linearity=True,
+    )
+    # if propagation failed, edit the schema suggestion from pointwise rules
+    # to return addmm suggestion instead as it's a chained suggestion.
+    if (
+        output_sharding.output_spec is None
+        and output_sharding.schema_suggestions is not None
+    ):
+        return _update_schema_suggestion_for_addmm(output_sharding, op_schema)
+
+    return output_sharding
+
+
+@register_prop_rule("aten.t.default")
+def transpose_rule(op_schema: OpSchema) -> OutputSharding:
+    return einop_rule("ij->ji", op_schema, linearity=True)
+
+
+@register_prop_rule("aten.bmm.default")
+def bmm_rules(op_schema: OpSchema) -> OutputSharding:
+    return einop_rule("bmk,bkn->bmn", op_schema, linearity=False)
+
+
+@register_prop_rule("aten.baddbmm.default")
+def baddbmm_rules(op_schema: OpSchema) -> OutputSharding:
+    input_spec, mat1_spec, mat2_spec = op_schema.args_spec
+    bmm_output_sharding = bmm_rules(
+        OpSchema(op_schema.func_schema, (mat1_spec, mat2_spec), {})
+    )
+    if bmm_output_sharding.output_spec is None:
+        # TODO: add more suggestions
+        if bmm_output_sharding.schema_suggestions is not None:
+            return _update_schema_suggestion_for_addmm(
+                bmm_output_sharding,
+                op_schema,
+                pointwise_add_update=False,
+            )
+        else:
+            return OutputSharding(None)
+
+    # run point wise rule on input + (bmm_out) with linearity
+    output_sharding = pointwise_rule(
+        OpSchema(
+            op_schema.func_schema,
+            (input_spec, bmm_output_sharding.output_spec),
+            {},
+        ),
+        linearity=True,
+    )
+    # if propagation failed, edit the schema suggestion from pointwise rules
+    # to return baddbmm suggestion instead as it's a chained suggestion.
+    if (
+        output_sharding.output_spec is None
+        and output_sharding.schema_suggestions is not None
+    ):
+        return _update_schema_suggestion_for_addmm(output_sharding, op_schema)
+
+    return output_sharding
diff --git a/torch/distributed/_tensor/ops/pointwise_ops.py b/torch/distributed/_tensor/ops/pointwise_ops.py
new file mode 100644
index 000000000000..6c92eacd1b8b
--- /dev/null
+++ b/torch/distributed/_tensor/ops/pointwise_ops.py
@@ -0,0 +1,396 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from typing import cast
+
+from torch.distributed._tensor.api import DTensor
+from torch.distributed._tensor.ops.common_rules import linear_pointwise_rule, pointwise_rule
+from torch.distributed._tensor.placement_types import DTensorSpec, Replicate, _Partial
+from torch.distributed._tensor.dispatch import OpSchema, OutputSharding
+from torch.distributed._tensor.ops.utils import register_prop_rule
+
+# leave the remaining pointwise_ops list here for convenience,
+# Below ops are some pointwise ops that are yet to be supported,
+# they might not be a complete list.
+# pointwise_ops = [
+#     "fake_quantize_per_channel_affine",
+#     "fake_quantize_per_tensor_affine",
+#     "floor_divide",  # floor_divide is deprecated
+#     "frexp",  # multiple output pointwise op, need to add support
+#     "gradient",  #  need investigation on this op
+#     "imag",  # complex data type only
+#     "quantized_batch_norm",
+#     "quantized_max_pool1d",
+#     "quantized_max_pool2d",
+#     "real",  # complex data type only
+# ]
+
+
+linear_pointwise_ops = [
+    "aten.div.Scalar",  # this op is linear on the first argument, and the second argument is scalar, so it fits as a linear op.
+    "aten.to.dtype",
+]
+
+
+pointwise_ops = [
+    # please keep the entries below alphabetically sorted
+    "aten.abs.default",
+    "aten.acos.default",
+    "aten.acos.out",
+    "aten.acos_.default",
+    "aten.acosh.default",
+    "aten.acosh.out",
+    "aten.acosh_.default",
+    "aten.add.Scalar",
+    "aten.add.Tensor",
+    "aten.add.out",
+    "aten.add_.Scalar",
+    "aten.add_.Tensor",
+    "aten.addcdiv.default",
+    "aten.addcdiv.out",
+    "aten.addcdiv_.default",
+    "aten.addcmul.default",
+    "aten.addcmul.out",
+    "aten.addcmul_.default",
+    "aten.angle.default",
+    "aten.angle.out",
+    "aten.asin.default",
+    "aten.asin.out",
+    "aten.asin_.default",
+    "aten.asinh.default",
+    "aten.asinh.out",
+    "aten.asinh_.default",
+    "aten.atan.default",
+    "aten.atan.out",
+    "aten.atan2.default",
+    "aten.atan2.out",
+    "aten.atan2_.default",
+    "aten.atan_.default",
+    "aten.atanh.default",
+    "aten.atanh.out",
+    "aten.atanh_.default",
+    "aten.bitwise_and.Scalar",
+    "aten.bitwise_and.Scalar_Tensor",
+    "aten.bitwise_and.Scalar_out",
+    "aten.bitwise_and.Tensor",
+    "aten.bitwise_and.Tensor_out",
+    "aten.bitwise_and_.Scalar",
+    "aten.bitwise_and_.Tensor",
+    "aten.bitwise_left_shift.Scalar_Tensor",
+    "aten.bitwise_left_shift.Tensor",
+    "aten.bitwise_left_shift.Tensor_Scalar",
+    "aten.bitwise_left_shift.Tensor_Scalar_out",
+    "aten.bitwise_left_shift.Tensor_out",
+    "aten.bitwise_left_shift_.Tensor",
+    "aten.bitwise_left_shift_.Tensor_Scalar",
+    "aten.bitwise_not.default",
+    "aten.bitwise_not.out",
+    "aten.bitwise_not_.default",
+    "aten.bitwise_or.Scalar",
+    "aten.bitwise_or.Scalar_Tensor",
+    "aten.bitwise_or.Scalar_out",
+    "aten.bitwise_or.Tensor",
+    "aten.bitwise_or.Tensor_out",
+    "aten.bitwise_or_.Scalar",
+    "aten.bitwise_or_.Tensor",
+    "aten.bitwise_right_shift.Scalar_Tensor",
+    "aten.bitwise_right_shift.Tensor",
+    "aten.bitwise_right_shift.Tensor_Scalar",
+    "aten.bitwise_right_shift.Tensor_Scalar_out",
+    "aten.bitwise_right_shift.Tensor_out",
+    "aten.bitwise_right_shift_.Tensor",
+    "aten.bitwise_right_shift_.Tensor_Scalar",
+    "aten.bitwise_xor.Scalar",
+    "aten.bitwise_xor.Scalar_Tensor",
+    "aten.bitwise_xor.Scalar_out",
+    "aten.bitwise_xor.Tensor",
+    "aten.bitwise_xor.Tensor_out",
+    "aten.bitwise_xor_.Scalar",
+    "aten.bitwise_xor_.Tensor",
+    "aten.ceil.default",
+    "aten.ceil.out",
+    "aten.ceil_.default",
+    "aten.clamp.default",
+    "aten.clamp.out",
+    "aten.clamp_.default",
+    "aten.clip.default",
+    "aten.clip.out",
+    "aten.clip_.default",
+    "aten.conj_physical.default",
+    "aten.conj_physical.out",
+    "aten.conj_physical_.default",
+    "aten.copy_sign.Scalar",
+    "aten.copy_sign.Scalar_out",
+    "aten.copy_sign.Tensor",
+    "aten.copy_sign.out",
+    "aten.copy_sign_.Scalar",
+    "aten.copy_sign_.Tensor",
+    "aten.cos.default",
+    "aten.cos.out",
+    "aten.cos_.default",
+    "aten.cosh.default",
+    "aten.cosh.out",
+    "aten.cosh_.default",
+    "aten.deg2rad.default",
+    "aten.deg2rad.out",
+    "aten.deg2rad_.default",
+    "aten.digamma.default",
+    "aten.digamma.out",
+    "aten.digamma_.default",
+    "aten.div.Tensor",
+    "aten.div.Tensor_mode",
+    "aten.div.out",
+    "aten.div.out_mode",
+    "aten.div_.Tensor",
+    "aten.div_.Tensor_mode",
+    "aten.eq.Tensor",
+    "aten.eq.Tensor_out",
+    "aten.eq.Scalar",
+    "aten.eq.Scalar_out",
+    "aten.equal.default",
+    "aten.erf.default",
+    "aten.erf.out",
+    "aten.erf_.default",
+    "aten.erfc.default",
+    "aten.erfc.out",
+    "aten.erfc_.default",
+    "aten.erfinv.default",
+    "aten.erfinv.out",
+    "aten.erfinv_.default",
+    "aten.exp.default",
+    "aten.exp.out",
+    "aten.exp2.default",
+    "aten.exp2.out",
+    "aten.exp2_.default",
+    "aten.exp_.default",
+    "aten.expm1.default",
+    "aten.expm1.out",
+    "aten.expm1_.default",
+    "aten.float_power.Scalar",
+    "aten.float_power.Scalar_out",
+    "aten.float_power.Tensor_Scalar",
+    "aten.float_power.Tensor_Scalar_out",
+    "aten.float_power.Tensor_Tensor",
+    "aten.float_power.Tensor_Tensor_out",
+    "aten.float_power_.Scalar",
+    "aten.float_power_.Tensor",
+    "aten.floor.default",
+    "aten.floor.out",
+    "aten.floor_.default",
+    "aten.fmod.Scalar",
+    "aten.fmod.Scalar_out",
+    "aten.fmod.Tensor",
+    "aten.fmod.Tensor_out",
+    "aten.fmod_.Scalar",
+    "aten.fmod_.Tensor",
+    "aten.frac.default",
+    "aten.frac.out",
+    "aten.frac_.default",
+    "aten.ge.Scalar",
+    "aten.ge.Tensor",
+    "aten.gelu.default",
+    "aten.gt.Scalar",
+    "aten.gt.Tensor",
+    "aten.hypot.default",
+    "aten.hypot.out",
+    "aten.hypot_.default",
+    "aten.i0.default",
+    "aten.i0.out",
+    "aten.i0_.default",
+    "aten.igamma.default",
+    "aten.igamma.out",
+    "aten.igamma_.default",
+    "aten.igammac.default",
+    "aten.igammac.out",
+    "aten.igammac_.default",
+    "aten.isnan.default",
+    "aten.ldexp.default",
+    "aten.ldexp.out",
+    "aten.ldexp_.default",
+    "aten.le.Scalar",
+    "aten.le.Tensor",
+    "aten.lerp.Scalar",
+    "aten.lerp.Scalar_out",
+    "aten.lerp.Tensor",
+    "aten.lerp.Tensor_out",
+    "aten.lerp_.Scalar",
+    "aten.lerp_.Tensor",
+    "aten.lgamma.default",
+    "aten.lgamma.out",
+    "aten.lgamma_.default",
+    "aten.log.default",
+    "aten.log.out",
+    "aten.log10.default",
+    "aten.log10.out",
+    "aten.log10_.default",
+    "aten.log1p.default",
+    "aten.log1p.out",
+    "aten.log1p_.default",
+    "aten.log2.default",
+    "aten.log2.out",
+    "aten.log2_.default",
+    "aten.log_.default",
+    "aten.logaddexp.default",
+    "aten.logaddexp.out",
+    "aten.logaddexp2.default",
+    "aten.logaddexp2.out",
+    "aten.logical_and.default",
+    "aten.logical_and.out",
+    "aten.logical_and_.default",
+    "aten.logical_not.default",
+    "aten.logical_not.out",
+    "aten.logical_not_.default",
+    "aten.logical_or.default",
+    "aten.logical_or.out",
+    "aten.logical_or_.default",
+    "aten.logical_xor.default",
+    "aten.logical_xor.out",
+    "aten.logical_xor_.default",
+    "aten.logit.default",
+    "aten.logit.out",
+    "aten.logit_.default",
+    "aten.masked_fill.Scalar",
+    "aten.mul.Scalar",
+    "aten.mul.Tensor",
+    "aten.mul.out",
+    "aten.mul_.Scalar",
+    "aten.mul_.Tensor",
+    "aten.mvlgamma.default",
+    "aten.mvlgamma.out",
+    "aten.mvlgamma_.default",
+    "aten.native_dropout_backward.default",
+    "aten.native_dropout_backward.out",
+    "aten.nan_to_num.default",
+    "aten.nan_to_num.out",
+    "aten.nan_to_num_.default",
+    "aten.ne.Scalar",
+    "aten.neg.default",
+    "aten.neg.out",
+    "aten.neg_.default",
+    "aten.nextafter.default",
+    "aten.nextafter.out",
+    "aten.nextafter_.default",
+    "aten.polygamma.default",
+    "aten.polygamma.out",
+    "aten.polygamma_.default",
+    "aten.positive.default",
+    "aten.pow.Scalar",
+    "aten.pow.Scalar_out",
+    "aten.pow.Tensor_Scalar",
+    "aten.pow.Tensor_Scalar_out",
+    "aten.pow.Tensor_Tensor",
+    "aten.pow.Tensor_Tensor_out",
+    "aten.pow_.Scalar",
+    "aten.pow_.Tensor",
+    "aten.reciprocal.default",
+    "aten.reciprocal.out",
+    "aten.reciprocal_.default",
+    "aten.red2deg.default",
+    "aten.red2deg.out",
+    "aten.red2deg_.default",
+    "aten.relu.default",
+    "aten.relu_.default",
+    "aten.remainder.Scalar",
+    "aten.remainder.Scalar_Tensor",
+    "aten.remainder.Scalar_out",
+    "aten.remainder.Tensor",
+    "aten.remainder.Tensor_out",
+    "aten.remainder_.Scalar",
+    "aten.remainder_.Tensor",
+    "aten.round.decimals",
+    "aten.round.decimals_out",
+    "aten.round.default",
+    "aten.round.out",
+    "aten.round_.decimals",
+    "aten.round_.default",
+    "aten.rsqrt.default",
+    "aten.rsqrt.out",
+    "aten.rsqrt_.default",
+    "aten.rsub.Scalar",
+    "aten.sgn.default",
+    "aten.sgn.out",
+    "aten.sgn_.default",
+    "aten.sigmoid.default",
+    "aten.sigmoid.out",
+    "aten.sigmoid_.default",
+    "aten.sign.default",
+    "aten.sign.out",
+    "aten.sign_.default",
+    "aten.signbit.default",
+    "aten.signbit.out",
+    "aten.sin.default",
+    "aten.sin.out",
+    "aten.sin_.default",
+    "aten.sinc.default",
+    "aten.sinc.out",
+    "aten.sinc_.default",
+    "aten.sinh.default",
+    "aten.sinh.out",
+    "aten.sinh_.default",
+    "aten.sqrt.default",
+    "aten.sqrt.out",
+    "aten.sqrt_.default",
+    "aten.square.default",
+    "aten.square.out",
+    "aten.square_.default",
+    "aten.sub.Scalar",
+    "aten.sub.Tensor",
+    "aten.sub.out",
+    "aten.sub_.Scalar",
+    "aten.sub_.Tensor",
+    "aten.tan.default",
+    "aten.tan.out",
+    "aten.tan_.default",
+    "aten.tanh.default",
+    "aten.tanh.out",
+    "aten.tanh_.default",
+    "aten.true_divide.Tensor",
+    "aten.trunc.default",
+    "aten.trunc.out",
+    "aten.trunc_.default",
+    "aten.where.self",
+    "aten.xlogy.OutScalar_Self",
+    "aten.xlogy.OutTensor",
+    "aten.xlogy.Scalar_other",
+    "aten.xlogy.Scalar_self",
+    "aten.xlogy.Tensor",
+    "aten.xlogy_.OutScalar_Other",
+    "aten.xlogy_.Scalar_other",
+    "aten.xlogy_.Tensor",
+    "prims.convert_element_type.default",
+    # backward point-wise ops
+    # please keep the entries below alphabetically sorted
+    "aten.gelu_backward.default",
+    "aten.sigmoid_backward.default",
+    "aten.tanh_backward.default",
+    "aten.threshold_backward.default",
+]
+
+
+for op in linear_pointwise_ops:
+    DTensor._op_to_rules[op] = linear_pointwise_rule
+
+
+for op in pointwise_ops:
+    DTensor._op_to_rules[op] = pointwise_rule
+
+
+@register_prop_rule("aten.native_dropout.default")
+def dropout_rule(op_schema: OpSchema) -> OutputSharding:
+    self_spec = cast(DTensorSpec, op_schema.args_schema[0])
+
+    # TODO: We are specializing dropout_rule now because it's
+    # a non-deterministic algorithm, and replication does not
+    # not support non-deterministic op yet. We should remove
+    # this rule and make dropout to use pointwise rule instead
+    # once we support non-deterministic op.
+    replicate_or_partial = False
+    for placement in self_spec.placements:
+        if isinstance(placement, (Replicate, _Partial)):
+            replicate_or_partial = True
+            break
+
+    if replicate_or_partial:
+        return OutputSharding(
+            None, failed_reason="Dropout with replication is not supported yet!"
+        )
+    else:
+        return OutputSharding(self_spec)
diff --git a/torch/distributed/_tensor/ops/tensor_ops.py b/torch/distributed/_tensor/ops/tensor_ops.py
new file mode 100644
index 000000000000..f386e1fdb9fd
--- /dev/null
+++ b/torch/distributed/_tensor/ops/tensor_ops.py
@@ -0,0 +1,481 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+import torch
+from torch.distributed._tensor.api import (
+    DTensor,
+    DTensorSpec,
+    Placement,
+    Replicate,
+    Shard,
+    _Partial,
+)
+from torch.distributed._tensor.dispatch import OpSchema, OutputSharding
+from torch.distributed._tensor.ops.common_rules import pointwise_rule
+from torch.distributed._tensor.ops.utils import register_prop_rule
+from typing import List, Optional, Sequence, Tuple, cast
+
+
+# NOTE: the default propagation rule should apply for
+# any operator that does not return a DTensor, i.e.
+# for operators that only returns int/float/bool, we by
+# default still propagate the spec, this is to ensure
+# that we only return None for the case where the sharding
+# propagation failed, and we should do auto-redistribute
+def default_prop_rule(op_schema: OpSchema) -> OutputSharding:
+    # by default prop the first arg spec
+    return OutputSharding(op_schema.args_spec[0])
+
+
+def prop_create_like(op_schema: OpSchema) -> OutputSharding:
+    # For operators that create tensors with same shape as input but
+    # with specific content that does not depend on the input, we
+    # can propagate Sharding, but we have to make sure we move from
+    # partial to replicated.
+    input_spec = op_schema.args_spec[0]
+    output_spec = DTensorSpec(
+        mesh=input_spec.mesh,
+        placements=tuple(
+            Replicate() if isinstance(p, _Partial) else p
+            for p in input_spec.placements
+        ),
+        ndim=input_spec.ndim,
+        shape=input_spec.shape,
+    )
+    return OutputSharding(output_spec=output_spec)
+
+
+# some tensor ops should not support shard, i.e. local_scalar_dense
+# shouldn't work for shard as it requires numel == 1
+def no_shard_prop_rule(op_schema: OpSchema) -> OutputSharding:
+    # by default prop the first arg spec
+    tensor_spec = op_schema.args_spec[0]
+    for placement in tensor_spec.placements:
+        if placement.is_shard():
+            return OutputSharding(
+                None,
+                failed_reason=f"Op does not support input placements "
+                f"with `Shard`, but found placements: "
+                f"{tensor_spec.placements}",
+            )
+    # otherwise default prop the first arg spec
+    return OutputSharding(tensor_spec)
+
+
+def new_factory_rule(op_schema: OpSchema) -> OutputSharding:
+    # this op would benefit from backward sharding propagation!
+    # Since we cannot do that yet, just return replicated
+    input = op_schema.args_schema[0]
+    size = torch.Size(cast(Sequence[int], op_schema.args_schema[1]))
+    assert isinstance(input, DTensorSpec)
+
+    return OutputSharding(
+        output_spec=DTensorSpec(
+            mesh=input.mesh,
+            placements=[Replicate()] * input.mesh.ndim,
+            shape=size,
+            ndim=len(size),
+        )
+    )
+
+
+default_prop_ops = [
+    "aten._to_copy.default",
+    "aten.clone.default",
+    "aten.contiguous.default",
+    "aten.copy_.default",
+    "aten.detach.default",
+    "aten.is_same_size.default",
+    "aten.new_empty_strided.default",
+]
+
+create_like_ops = [
+    "aten.empty_like.default",
+    "aten.fill_.Scalar",
+    "aten.full_like.default",
+    "aten.ones_like.default",
+    "aten.zero_.default",
+    "aten.zeros_like.default",
+]
+
+new_factory_ops = [
+    "aten.new_full.default",
+    "aten.new_ones.default",
+    "aten.new_zeros.default",
+]
+
+no_shard_prop_ops = ["aten._local_scalar_dense.default"]
+
+for op in default_prop_ops:
+    DTensor._op_to_rules[op] = default_prop_rule
+
+for op in create_like_ops:
+    DTensor._op_to_rules[op] = prop_create_like
+
+for op in no_shard_prop_ops:
+    DTensor._op_to_rules[op] = no_shard_prop_rule
+
+for op in new_factory_ops:
+    DTensor._op_to_rules[op] = new_factory_rule
+
+
+@register_prop_rule("aten.bucketize.Tensor")
+def prop_bucketize(op_schema: OpSchema) -> OutputSharding:
+    """
+    Point-wise on the first input (just propagate input sharding).
+    Expect replicated for second input.
+    """
+    input_schema, boundaries = op_schema.args_schema
+    assert isinstance(input_schema, DTensorSpec)
+    assert isinstance(boundaries, DTensorSpec)
+
+    if all(isinstance(p, Replicate) for p in boundaries.placements):
+        return OutputSharding(output_spec=input_schema)
+    else:
+        return OutputSharding(
+            output_spec=None,
+            schema_suggestions=[
+                OpSchema(
+                    func_schema=op_schema.func_schema,
+                    args_schema=(
+                        input_schema,
+                        DTensorSpec(
+                            mesh=boundaries.mesh,
+                            placements=[Replicate()]
+                            * len(boundaries.placements),
+                            ndim=boundaries.ndim,
+                            shape=boundaries.shape,
+                        ),
+                    ),
+                    kwargs_schema=op_schema.kwargs_schema,
+                )
+            ],
+        )
+
+
+def unshard_tensor_dim(
+    placements: Sequence[Placement], dim: int
+) -> Sequence[Placement]:
+    """Disallow the given tensor dimension to be sharded"""
+    return tuple(
+        p if (not isinstance(p, Shard) or p.dim != dim) else Replicate()
+        for p in placements
+    )
+
+
+def _prop_all_but_dim(
+    op_schema: OpSchema, dim: int, out_shape: torch.Size
+) -> OutputSharding:
+    """
+    Considering an op that takes its input as first argument, forwards all shardings
+    except for the given dimension.
+    """
+    input_spec = op_schema.args_schema[0]
+    assert isinstance(input_spec, DTensorSpec)
+
+    output_placements = unshard_tensor_dim(input_spec.placements, dim=dim)
+    output_spec = DTensorSpec(
+        mesh=input_spec.mesh,
+        placements=output_placements,
+        shape=out_shape,
+        ndim=input_spec.ndim,
+    )
+
+    if input_spec.placements == output_placements:
+        out = OutputSharding(output_spec=output_spec)
+    else:
+        suggested_input_spec = DTensorSpec(
+            mesh=input_spec.mesh,
+            placements=output_placements,
+            ndim=input_spec.ndim,
+            shape=input_spec.shape,
+        )
+        out = OutputSharding(
+            output_spec=None,
+            schema_suggestions=[
+                OpSchema(
+                    func_schema=op_schema.func_schema,
+                    args_schema=(suggested_input_spec,)
+                    + op_schema.args_schema[1:],
+                    kwargs_schema=op_schema.kwargs_schema,
+                ),
+            ],
+        )
+    return out
+
+
+@register_prop_rule("aten.slice.Tensor")
+def prop_slice(op_schema: OpSchema) -> OutputSharding:
+    """NOTE: can be further optimized (right now it replicates before slicing on a sharded dimension)"""
+    defaults = (None, 0, None, None, 1)
+    input_spec, dim, start, end, step = (
+        op_schema.args_schema + defaults[len(op_schema.args_schema) :]
+    )
+    assert isinstance(input_spec, DTensorSpec)
+    assert isinstance(dim, int)
+    assert start is None or isinstance(start, int)
+    assert end is None or isinstance(end, int)
+    assert isinstance(step, int)
+
+    # normalize arguments
+    if dim < 0:
+        dim += input_spec.ndim
+    if start is None:
+        start = 0
+    if step is None:
+        step = 1
+    if end is None or end > input_spec.shape[dim]:
+        end = input_spec.shape[dim]
+    if start < 0:
+        start += input_spec.shape[dim]
+    if end < 0:
+        end += input_spec.shape[dim]
+
+    if start == 0 and end == input_spec.shape[dim] and step == 1:
+        return OutputSharding(output_spec=input_spec)
+
+    # shape propagation
+    slice_len = (end - start + step - 1) // step
+    out_shape = torch.Size(
+        tuple(input_spec.shape[0:dim])
+        + (slice_len,)
+        + tuple(input_spec.shape[dim + 1 :])
+    )
+
+    return _prop_all_but_dim(op_schema, dim=dim, out_shape=out_shape)
+
+
+@register_prop_rule("aten.slice_scatter.default")
+def prop_slice_scatter(op_schema: OpSchema) -> OutputSharding:
+    # 1. number of dimensions in input and src need to match.
+    # 2. number of elements on all non-dim need to match between input and src.
+    # 3. numer of elements in src in dim need to match the slice size.
+    # Given the above:
+    # - We suggest for src to follow the sharding of input, except on the scatter dimension,
+    #   where our best bet for now is to make them replicated as a fall-back.
+    #   TODO: Ideally we'd like to make sure the output is re-sharded afterwards to keep input sharding.
+
+    defaults = (None, None, 0, None, None, 1)
+    input, src, dim, start, end, step = (
+        op_schema.args_schema + defaults[len(op_schema.args_schema) :]
+    )
+    assert isinstance(input, DTensorSpec)
+    assert isinstance(src, DTensorSpec)
+    assert isinstance(dim, int)
+
+    if dim < 0:
+        dim += input.ndim
+
+    # first, we keep the input sharding, except for the input dimension
+    # also, we cannot allow partial sum anymore.
+    input_suggestion = tuple(
+        Replicate()
+        if isinstance(p, _Partial) or (isinstance(p, Shard) and p.dim == dim)
+        else p
+        for p in input.placements
+    )
+
+    if input_suggestion == tuple(input.placements) and src.placements == tuple(
+        input.placements
+    ):
+        # if our sharding is correct, the output sharding will be the same as the input.
+        return OutputSharding(
+            output_spec=DTensorSpec(
+                mesh=input.mesh,
+                placements=input.placements,
+                shape=input.shape,
+                ndim=input.ndim,
+            )
+        )
+    else:
+        # otherwise, return the suggestion.
+        return OutputSharding(
+            output_spec=None,
+            schema_suggestions=[
+                OpSchema(
+                    func_schema=op_schema.func_schema,
+                    args_schema=(
+                        DTensorSpec(
+                            mesh=input.mesh,
+                            placements=input_suggestion,
+                            shape=input.shape,
+                            ndim=input.ndim,
+                        ),
+                        DTensorSpec(
+                            mesh=src.mesh,
+                            placements=input_suggestion,
+                            shape=src.shape,
+                            ndim=src.ndim,
+                        ),
+                    )
+                    + op_schema.args_schema[2:],
+                    kwargs_schema=op_schema.kwargs_schema,
+                )
+            ],
+        )
+
+
+@register_prop_rule("aten.index_select.default")
+def prop_index_select(op_schema: OpSchema) -> OutputSharding:
+    values_spec, dim, indices_spec = op_schema.args_schema
+
+    assert isinstance(values_spec, DTensorSpec)
+    assert isinstance(dim, int)
+    assert isinstance(indices_spec, DTensorSpec)
+
+    all_indices_spec: List[Optional[DTensorSpec]] = [
+        indices_spec if dim == i else None for i in range(values_spec.ndim)
+    ]
+
+    result = prop_index(
+        OpSchema(
+            func_schema=op_schema.func_schema,
+            args_schema=(values_spec, all_indices_spec),
+            kwargs_schema=op_schema.kwargs_schema,
+        )
+    )
+    if result.schema_suggestions:
+        result.schema_suggestions = [
+            OpSchema(
+                func_schema=op_schema.func_schema,
+                args_schema=(s.args_schema[0], dim, s.args_schema[1][dim]),
+                kwargs_schema=op_schema.kwargs_schema,
+            )
+            for s in result.schema_suggestions
+        ]
+    return result
+
+
+@register_prop_rule("aten.index.Tensor")
+def prop_index(op_schema: OpSchema) -> OutputSharding:
+    """
+    Expect replicated on the first input; _mostly_ pointwise on the second input.
+    TODO: exception: when the dtype of second input is "bool", then a torch.nonzero needs to be triggered first.
+    """
+    # Current sharding constraints:
+    # For values:
+    #   1. We currently require that the dimension of values_spec be replicated or partial
+    #      if they are being indexed on.
+    #   2. Other dimensions of values_spec can remain sharded if they are so.
+    # For indices:
+    #   Indices can be either sharded or replicated. All index tensors need to be sharded
+    #   in a compatible way, following the pointwise rule (including resolving _Partial
+    #   into either sharded or replicated)
+
+    values_spec, multi_indices_spec = op_schema.args_schema
+    assert isinstance(values_spec, DTensorSpec)
+    assert isinstance(multi_indices_spec, list)
+    multi_indices_spec = cast(List[Optional[DTensorSpec]], multi_indices_spec)
+    valid_indices_spec: List[Tuple[int, DTensorSpec]] = [
+        (i, a) for i, a in enumerate(multi_indices_spec) if a is not None
+    ]
+
+    # 1. All indices have to be sharded equally. Moreover, indices can be broadcast.
+    #    Here, we piggyback on the pointwise sharding rule for indices.
+    indices_out = pointwise_rule(
+        OpSchema(
+            func_schema=op_schema.func_schema,
+            args_schema=tuple(v[1] for v in valid_indices_spec),
+            kwargs_schema={},
+        )
+    )
+    need_reshard_on_indices = indices_out.output_spec is None
+
+    if not need_reshard_on_indices:
+        # this means that our inputs are already sharded properly and we will use that as our indices_spec
+        assert isinstance(indices_out.output_spec, DTensorSpec)
+        indices_spec: DTensorSpec = indices_out.output_spec
+    else:
+        assert indices_out.schema_suggestions is not None
+        valid_indices_suggestion = indices_out.schema_suggestions[0]
+        for i, v in enumerate(valid_indices_suggestion.args_spec):
+            multi_indices_spec[valid_indices_spec[i][0]] = v
+        # we'll need to call pointwise_rule again to see what's our ideal indices_spec and then
+        # use that to compute our ideal values_spec
+        indices_output_spec = pointwise_rule(
+            valid_indices_suggestion
+        ).output_spec
+        assert isinstance(indices_output_spec, DTensorSpec)
+        indices_spec = indices_output_spec
+
+    lookup_dims = set(v[0] for v in valid_indices_spec)
+
+    need_reshard_on_values = tuple(
+        (
+            isinstance(vp, Shard)
+            and (vp.dim in lookup_dims or isinstance(ip, Shard))
+        )
+        for vp, ip in zip(values_spec.placements, indices_spec.placements)
+    )
+
+    if not need_reshard_on_indices and not any(need_reshard_on_values):
+
+        value_placements = values_spec.placements
+        value_shape = values_spec.shape
+
+        all_dims_consecutive = all(
+            b[0] - a[0] == 1
+            for b, a in zip(valid_indices_spec[1:], valid_indices_spec[:-1])
+        )
+        if all_dims_consecutive:
+            # if all index vectors are consecutives, insert at the dimension of the first index
+            insert_dim: int = valid_indices_spec[0][0]
+        else:
+            # else, insert on the first dimension
+            insert_dim = 0
+
+        def place(vp: Placement, ip: Placement) -> Placement:
+            if isinstance(vp, Shard):
+                return Shard(
+                    vp.dim
+                    if vp.dim < insert_dim
+                    # accounts for the offset in output dimensions
+                    else vp.dim
+                    + indices_spec.ndim
+                    - sum(1 if vp.dim > v[0] else 0 for v in valid_indices_spec)
+                )
+            if isinstance(ip, Shard):
+                return Shard(ip.dim + insert_dim)
+            # _Partial or Replicated
+            return vp
+
+        value_placements = tuple(
+            place(vp, ip)
+            for vp, ip in zip(values_spec.placements, indices_spec.placements)
+        )
+        value_shape = torch.Size(
+            tuple(value_shape[:insert_dim])
+            + tuple(indices_spec.shape)
+            + tuple(value_shape[insert_dim + len(valid_indices_spec) :])
+        )
+
+        result = OutputSharding(
+            output_spec=DTensorSpec(
+                mesh=values_spec.mesh,
+                placements=value_placements,
+                shape=value_shape,
+                ndim=len(value_shape),
+            )
+        )
+        return result
+    else:
+        result = OutputSharding(
+            output_spec=None,
+            schema_suggestions=[
+                OpSchema(
+                    func_schema=op_schema.func_schema,
+                    args_schema=(
+                        DTensorSpec(
+                            mesh=values_spec.mesh,
+                            placements=[
+                                Replicate() if need_reshard_on_values[i] else v
+                                for i, v in enumerate(values_spec.placements)
+                            ],
+                            ndim=values_spec.ndim,
+                            shape=values_spec.shape,
+                        ),
+                        multi_indices_spec,
+                    ),
+                    kwargs_schema=op_schema.kwargs_schema,
+                )
+            ],
+        )
+        return result
diff --git a/torch/distributed/_tensor/ops/tp_sharding_ops.py b/torch/distributed/_tensor/ops/tp_sharding_ops.py
new file mode 100644
index 000000000000..01db8920e674
--- /dev/null
+++ b/torch/distributed/_tensor/ops/tp_sharding_ops.py
@@ -0,0 +1,55 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# implement matrix related ops for distributed tensor
+import torch
+import torch.utils._pytree as pytree
+from typing import List
+from torch.distributed._tensor.api import DTensor
+from torch.distributed._tensor.utils import unwrap_local_tensor
+from torch.distributed._tensor.ops.utils import unwrap_single_placement, register_impl
+
+"""
+The ops below were quickly hacked and needed to be polished down the road.
+Although they come with unit tests already, the logic is directly borrowed
+from ShardedTensor. We need to also make it work for all placement types
+of DTensor and all corner cases for sharded distributed tensor.
+"""
+
+
+@register_impl("aten.cat.default")
+def dist_cat(tensor_list: List[DTensor], dim: int = 0) -> DTensor:
+    local_inputs = pytree.tree_map(unwrap_local_tensor, tensor_list)
+    local_tensor = torch.ops.aten.concat(local_inputs, dim=dim)
+    return DTensor.from_local(
+        local_tensor,
+        tensor_list[0].device_mesh,
+        tensor_list[0].placements,
+        run_check=False,
+    )
+
+
+@register_impl("aten.split.Tensor")
+# pyre-fixme[2]: Parameter must be annotated.
+def dist_split(self: DTensor, split_size_or_sections, dim=0) -> List[DTensor]:
+    local_mat = pytree.tree_map(unwrap_local_tensor, self)
+    mat_placement = pytree.tree_map(unwrap_single_placement, self)
+    sharding_dim = mat_placement.dim
+    world_size = self.device_mesh.size(dim=0)
+    if dim < 0:
+        dim = self.dim() + dim
+    if sharding_dim < 0:
+        sharding_dim = self.dim() + sharding_dim
+    if dim == sharding_dim:
+        if type(split_size_or_sections) is list:
+            split_size_or_sections[sharding_dim] //= world_size
+        else:
+            split_size_or_sections //= world_size
+    tensor_list = local_mat.split(split_size_or_sections, dim=dim)
+    return [
+        DTensor.from_local(
+            tensor,
+            self.device_mesh,
+            [mat_placement],
+            run_check=False,
+        )
+        for tensor in tensor_list
+    ]
diff --git a/torch/distributed/_tensor/ops/utils.py b/torch/distributed/_tensor/ops/utils.py
new file mode 100644
index 000000000000..42db7142638a
--- /dev/null
+++ b/torch/distributed/_tensor/ops/utils.py
@@ -0,0 +1,81 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+import functools
+import operator
+
+import torch
+from typing import List, Union, Sequence, Iterable
+from torch.distributed._tensor.api import DTensor
+
+
+# pyre-fixme[3]: Return type must be annotated.
+# pyre-fixme[2]: Parameter must be annotated.
+def unwrap_single_placement(e):
+    if not isinstance(e, DTensor):
+        return None
+    assert len(e.placements) == 1, "more than one placement!"
+    return e.placements[0]
+
+
+# convenient wrapper to register custom operator impls
+# pyre-fixme[3]: Return type must be annotated.
+# pyre-fixme[2]: Parameter must be annotated.
+def register_impl(func):
+    # pyre-fixme[53]: Captured variable `func` is not annotated.
+    # pyre-fixme[3]: Return type must be annotated.
+    # pyre-fixme[2]: Parameter must be annotated.
+    def wrapper(impl):
+        DTensor._custom_dispatch_ops[func] = impl
+        return impl
+
+    return wrapper
+
+
+# convenient wrapper to register sharding propagation rules
+# pyre-fixme[3]: Return type must be annotated.
+# pyre-fixme[2]: Parameter must be annotated.
+def register_prop_rule(func):
+    # pyre-fixme[53]: Captured variable `func` is not annotated.
+    # pyre-fixme[3]: Return type must be annotated.
+    # pyre-fixme[2]: Parameter must be annotated.
+    def wrapper(impl):
+        DTensor._op_to_rules[func] = impl
+        return impl
+
+    return wrapper
+
+
+def as_list(
+    x: Union[List[object], object]
+    # pyre-fixme[11]: Annotation `immutable_list` is not defined as a type.
+) -> Union[List[object], torch.fx.immutable_collections.immutable_list]:
+    # During tracing, `aten.sum.dim_IntList` uses `immutable_list` for its args,
+    # which is an object but treated as a list by the tracer. Therefore, keep
+    # `immutable_list` intact here as well.
+    if type(x) is list or isinstance(
+        x, torch.fx.immutable_collections.immutable_list
+    ):
+        return x
+    else:
+        return [x]
+
+
+def normalize_dim(dim: int, ndim: int) -> int:
+    return dim if dim >= 0 else dim + ndim
+
+
+def normalize_dims(dims: Union[int, Sequence[int]], ndim: int) -> Sequence[int]:
+    """
+    normalize a dim or a sequence of dims, so that they
+    are all positive.
+    """
+    if isinstance(dims, int):
+        dims = (normalize_dim(dims, ndim),)
+    elif isinstance(dims, list):
+        dims = [normalize_dim(dim, ndim) for dim in dims]
+    elif isinstance(dims, tuple):
+        dims = tuple([normalize_dim(dim, ndim) for dim in dims])
+    return dims
+
+
+def prod(xs: Iterable[int]) -> int:
+    return functools.reduce(operator.mul, xs, 1)
diff --git a/torch/distributed/_tensor/ops/view_ops.py b/torch/distributed/_tensor/ops/view_ops.py
new file mode 100644
index 000000000000..a8849b2ed14b
--- /dev/null
+++ b/torch/distributed/_tensor/ops/view_ops.py
@@ -0,0 +1,707 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from dataclasses import dataclass
+from typing import (
+    Callable,
+    Dict,
+    Iterable,
+    Optional,
+    Tuple,
+    Set,
+    Union,
+    Sequence,
+    cast,
+)
+
+import torch
+from torch import Tensor
+
+from torch.distributed._tensor.placement_types import DTensorSpec, Placement, Replicate
+from torch.distributed._tensor.api import Shard
+from torch.distributed._tensor.dispatch import OpSchema, OutputSharding
+from torch.distributed._tensor.ops.utils import (
+    normalize_dim,
+    normalize_dims,
+    prod,
+    register_prop_rule,
+)
+
+
+Shape = Tuple[int, ...]
+
+
+@dataclass
+class DimSpec:
+    """Specifies how an output dimension maps to an input dimension."""
+
+    def inputs(self) -> Iterable["DimSpec"]:
+        return ()
+
+
+# Rules that map each dimension of the output to dimensions of the input tensor
+DimMap = Tuple[DimSpec, ...]
+
+
+@dataclass
+class Singleton(DimSpec):
+    """Output dimension is a singleton"""
+
+    pass
+
+
+@dataclass
+class InputDim(DimSpec):
+    """Output dimension maps directly to an input dimension."""
+
+    input_dim: int
+
+
+@dataclass
+class Broadcast(DimSpec):
+    """Output is the broadcast of a singleton input dimension."""
+
+    dim: DimSpec
+    dim_size: int
+
+    @classmethod
+    def new(cls, dim: DimSpec, dim_size: int) -> DimSpec:
+        return Broadcast(dim, dim_size)
+
+    def inputs(self) -> Iterable[DimSpec]:
+        return (self.dim,)
+
+
+@dataclass
+class NewDim(DimSpec):
+    """This is a new dimension created by the op."""
+
+    size: int
+
+    @classmethod
+    def new(cls, size: int) -> DimSpec:
+        return Singleton() if size == 1 else NewDim(size)
+
+
+@dataclass
+class Repeat(DimSpec):
+    """Output dimension is the input dimension repeated n-times."""
+
+    input_dim: DimSpec
+    times: int
+
+    @classmethod
+    def new(cls, dim: DimSpec, times: int) -> DimSpec:
+        if times == 1:
+            return dim
+        elif isinstance(dim, Singleton):
+            # repeating a singleton is the same as broadcasting it
+            return Broadcast(dim, times)
+        else:
+            return Repeat(dim, times)
+
+    def inputs(self) -> Iterable[DimSpec]:
+        return (self.input_dim,)
+
+
+@dataclass
+class Flatten(DimSpec):
+    """
+    Output dimension is a set of input dimensions flattened, keeping
+    right-most adjacent elements adjacent in the output.
+    """
+
+    input_dims: Sequence[DimSpec]
+
+    @classmethod
+    def new(cls, dims: Sequence[DimSpec]) -> DimSpec:
+        if len(dims) == 0:
+            # flattening a scalar leads to a singleton
+            return Singleton()
+        elif len(dims) == 1:
+            # flattening a single dimension is no-op
+            return dims[0]
+        else:
+            return Flatten(dims)
+
+    def inputs(self) -> Iterable[DimSpec]:
+        return self.input_dims
+
+
+@dataclass
+class Split(DimSpec):
+    """
+    This dimension is a member of a decomposition of the input dim.
+    Note that input_dim itself could be a Flattened set of input dims.
+    """
+
+    input_dim: DimSpec
+    group_shape: Shape
+    split_id: int
+
+    @classmethod
+    def new(
+        cls, dim: DimSpec, group_shape: Tuple[int, ...], idx: int
+    ) -> DimSpec:
+        assert len(group_shape) > 0
+        if len(group_shape) == 1:
+            # not really a group, just return the input dim back
+            assert idx == 0
+            return dim
+        elif group_shape[idx] == 1:
+            return Singleton()
+        else:
+            # remove singletons from group
+            # group_mapping = [(new_index, (shape, old_index)) ...]
+            group_mapping = list(
+                enumerate((s, i) for i, s in enumerate(group_shape) if s != 1)
+            )
+            new_group_shape = tuple(m[1][0] for m in group_mapping)
+            new_idx = next(filter(lambda x: x[1][1] == idx, group_mapping))[0]
+            return Split(dim, new_group_shape, new_idx)
+
+    def inputs(self) -> Iterable[DimSpec]:
+        return (self.input_dim,)
+
+
+def dim_pad_left(ndim: int, min_dims: int) -> DimMap:
+    return (Singleton(),) * max(0, min_dims - ndim) + tuple(
+        InputDim(i) for i in range(ndim)
+    )
+
+
+def dim_atleast_3d(ndim: int) -> DimMap:
+    if ndim == 0:
+        return (Singleton(), Singleton(), Singleton())
+    elif ndim == 1:
+        return (Singleton(), InputDim(0), Singleton())
+    elif ndim == 2:
+        return (InputDim(0), InputDim(1), Singleton())
+    else:
+        return tuple(InputDim(i) for i in range(ndim))
+
+
+def expand(input_shape: Shape, shape: Shape) -> DimMap:
+    """Implements broadcast on multiple dimensions"""
+    assert len(shape) >= len(input_shape)
+
+    # 1. create padded input dimensions
+    padded_input = dim_pad_left(len(input_shape), len(shape))
+    # 2. check that input shapes are compatible
+    mapping = []
+    for p, desired_s in zip(padded_input, shape):
+        if isinstance(p, Singleton):
+            actual_s = 1
+            assert desired_s >= 0
+        else:
+            assert isinstance(
+                p, InputDim
+            ), f"DimSpec not supported in expand: {p}"
+            actual_s = input_shape[p.input_dim]
+            assert actual_s == 1 or desired_s == -1 or desired_s == actual_s
+        mapping.append(
+            p
+            if desired_s in (1, -1) or desired_s == actual_s
+            else Broadcast.new(p, desired_s)
+        )
+    return tuple(mapping)
+
+
+def normalize_sizes(sizes: Union[Shape, Tuple[Shape]]) -> Shape:
+    if isinstance(sizes[0], int):
+        return cast(Shape, sizes)
+    elif len(sizes) == 1:
+        return cast(Shape, sizes[0])  # type: ignore[redundant-cast]
+    else:
+        raise RuntimeError("Size must be int... or tuple")
+
+
+def dim_flatten(ndim: int) -> DimMap:
+    if ndim == 0:
+        return (Singleton(),)
+    elif ndim == 1:
+        return (InputDim(0),)
+    else:
+        return (Flatten.new(tuple(InputDim(i) for i in range(ndim))),)
+
+
+def dim_movedim(
+    ndim: int,
+    input: Union[int, Sequence[int]],
+    destination: Union[int, Sequence[int]],
+) -> DimMap:
+    input = normalize_dims(input, ndim)
+    destination = normalize_dims(destination, ndim)
+
+    assert len(input) == len(destination)
+    input_set = set(input)
+    assert len(input_set) == len(input), "Found repeated input dims"
+    assert len(set(destination)) == len(
+        destination
+    ), "Found repeated output dims"
+    assert max(input) < ndim
+    assert max(destination) < ndim
+
+    dest = [-1] * ndim
+    for i, d in zip(input, destination):
+        dest[d] = i
+
+    unused_inputs_iter = iter(i for i in range(ndim) if i not in input_set)
+    for i in range(ndim):
+        if dest[i] == -1:
+            dest[i] = next(unused_inputs_iter)
+
+    return tuple(InputDim(i) for i in dest)
+
+
+def dim_repeat(ndim: int, sizes: Shape) -> DimMap:
+    sizes = normalize_sizes(sizes)
+    assert (
+        len(sizes) >= ndim
+    ), f"Number of dimensions of repeat dims {sizes} can not be smaller than number of dimensions of tensor {ndim}."
+    pad = len(sizes) - ndim
+    return tuple(Repeat.new(Singleton(), s) for s in sizes[:pad]) + tuple(
+        Repeat.new(InputDim(i), s) for i, s in enumerate(sizes[pad:])
+    )
+
+
+def infer_size(total_size: int, sizes: Shape) -> Shape:
+    """
+    One dimension input to view may be "-1".
+    Infer the size of this dimension given the total_size.
+    """
+    infers = [i for i, s in enumerate(sizes) if s == -1]
+    size = prod(sizes)
+    assert len(infers) <= 1, "can only infer one size"
+    if infers:
+        size = -size
+        missing_size = total_size // size
+        assert (
+            total_size % size == 0
+        ), f"size inferred for -1 is not integral {sizes} should have {total_size} elements."
+        return tuple(s if s != -1 else missing_size for s in sizes)
+    assert size == total_size, f"sizes do not match {total_size} vs {size}"
+    return sizes
+
+
+def view_groups(from_size: Shape, to_size: Shape) -> DimMap:
+    """
+    A view or reshape operation can be decomposed into a set of 3 types of smaller operations:
+    1) Forward a dimension from input to output
+    2) Flatten a set of dimensions into a single dimension
+    3) Split one dimension into multiple dimensions
+
+    view_groups identifies these operations and returns, for each output dimension, what
+    is operation was performed in the input dimension. For example:
+
+        view_groups([2, 3, 4], [2, 12]) -> (
+            InputDim(0),
+            Flatten((InputDim(1), InputDim(2)))
+        )
+
+    - ouptut dimension 0 maps to input dimension 0
+    - output dimension 1 maps to a flattened input dimensions 1 and 2
+
+
+        view_groups([2, 3], [3, 2]) -> (
+            Split(Flatten((InputDim(0), InputDim(1))), (3, 2), 0),
+            Split(Flatten((InputDim(0), InputDim(1))), (3, 2), 1),
+        )
+
+    - in the above, input is flattened into a single dimension and then split
+      into two separate dimensions with different sizes from the input.
+    """
+    from_nelem = prod(from_size)
+    to_size = infer_size(from_nelem, normalize_sizes(to_size))
+
+    assert from_nelem == prod(to_size), "Total view shape does not add up"
+
+    from_idx = 0
+    to_idx = 0
+    from_len = len(from_size)
+    to_len = len(to_size)
+
+    result_pp = []
+
+    while from_idx < from_len or to_idx < to_len:
+        from_group_dim, to_group_shape = [], []
+
+        if from_idx >= from_len:
+            f = 1
+        else:
+            f = from_size[from_idx]
+            from_group_dim.append(from_idx)
+            from_idx += 1
+
+        if to_idx >= to_len:
+            t = 1
+        else:
+            t = to_size[to_idx]
+            to_group_shape.append(t)
+            to_idx += 1
+
+        # if any of the groups is singleton, great, we need to backtrack though
+        if f == 1 and t != 1:
+            # produces ([1], [])
+            to_idx -= 1
+            to_group_shape = []
+        elif f != 1 and t == 1:
+            # produces ([], [1])
+            from_idx -= 1
+            from_group_dim = []
+        else:
+            # produces ([1], [1]),  ([2], [2]), ([2,3], [6])
+            while f != t:
+                if f < t:
+                    nf = from_size[from_idx]
+                    from_group_dim.append(from_idx)
+                    from_idx += 1
+                    f *= nf
+                else:
+                    nt = to_size[to_idx]
+                    to_group_shape.append(nt)
+                    to_idx += 1
+                    t *= nt
+
+        if len(to_group_shape) > 0:
+            flattened = Flatten.new(
+                tuple(
+                    InputDim(fi) for fi in from_group_dim if from_size[fi] > 1
+                )
+            )
+            result_pp += [
+                Split.new(flattened, tuple(to_group_shape), i)
+                for i in range(len(to_group_shape))
+            ]
+
+    return tuple(result_pp)
+
+
+def dim_tile(ndim: int, dims: Tuple[int, ...]) -> DimMap:
+    if len(dims) < ndim:
+        dims = (1,) * (ndim - len(dims)) + dims
+    return dim_repeat(ndim, dims)
+
+
+def dim_transpose(ndim: int, dim1: int, dim2: int) -> DimMap:
+    dim1 = normalize_dim(dim1, ndim)
+    dim2 = normalize_dim(dim2, ndim)
+    assert dim1 < ndim
+    assert dim2 < ndim
+    dimmap = list(InputDim(i) for i in range(ndim))
+    swapdim = dimmap[dim1]
+    dimmap[dim1] = dimmap[dim2]
+    dimmap[dim2] = swapdim
+    return tuple(dimmap)
+
+
+def dim_squeeze(shape: Shape, dim: Optional[int] = None) -> DimMap:
+    # FIXME: this is wrong when dim=None and one of the dimensions
+    # equals size of the mesh. For example squeeze(DTensor(tensor(4), Shard[0])) could
+    # end up as squeeze(tensor(1)) if we have 4 devices; this would lead to
+    # removal of a dimension that is not acutally a singleton.
+    return tuple(
+        InputDim(i)
+        for i, s in enumerate(shape)
+        if s > 1 or (dim is not None and i != normalize_dim(dim, len(shape)))
+    )
+
+
+def dim_unsqueeze(ndim: int, dim: int) -> DimMap:
+    dims = tuple(InputDim(i) for i in range(ndim))
+    if dim < 0:
+        dim += ndim + 1
+    return dims[:dim] + (Singleton(),) + dims[dim:]
+
+
+def dim_reduction(
+    ndim: int, dim_or_dims: Optional[Union[int, Sequence[int]]], keepdim: bool
+) -> DimMap:
+    """
+    General fallback for reduction ops where _Partial() does not apply.
+    This will cause incoming tensor to be replicated on the reducing dimensions.
+    """
+    if dim_or_dims is None:
+        dim_or_dims = tuple(range(ndim))
+    if isinstance(dim_or_dims, int):
+        dim_or_dims = (dim_or_dims,)
+    dim_or_dims = tuple(d if d >= 0 else d + ndim for d in dim_or_dims)
+    return tuple(
+        InputDim(i) if i not in dim_or_dims else Singleton()
+        for i in range(ndim)
+        if i not in dim_or_dims or keepdim
+    )
+
+
+@dataclass
+class Op:
+    dim_map: Callable[..., DimMap]
+    shape_argnum: Optional[int] = None
+
+
+ops: Dict[Callable[..., torch.Tensor], Op] = {
+    torch.atleast_1d: Op(dim_map=lambda x: dim_pad_left(x.ndim, 1)),
+    torch.atleast_2d: Op(dim_map=lambda x: dim_pad_left(x.ndim, 2)),
+    torch.atleast_3d: Op(dim_map=lambda x: dim_atleast_3d(x.ndim)),
+    torch.broadcast_to: Op(
+        dim_map=lambda input, shape: expand(input.shape, shape), shape_argnum=1
+    ),
+    Tensor.expand: Op(
+        dim_map=lambda self, *sizes: expand(self.shape, normalize_sizes(sizes)),
+        shape_argnum=1,
+    ),
+    torch.flatten: Op(dim_map=lambda tensor: dim_flatten(tensor.ndim)),
+    torch.movedim: Op(
+        dim_map=lambda input, source, destination: dim_movedim(
+            input.ndim, source, destination
+        )
+    ),
+    torch.permute: Op(
+        dim_map=lambda input, dims: tuple(
+            InputDim(i) for i in normalize_dims(dims, input.ndim)
+        )
+    ),
+    torch.ravel: Op(dim_map=lambda tensor: dim_flatten(tensor.ndim)),
+    Tensor.repeat: Op(
+        dim_map=lambda self, *sizes: dim_repeat(self.ndim, sizes)
+    ),
+    torch.reshape: Op(
+        dim_map=lambda input, shape: view_groups(input.shape, shape),
+        shape_argnum=1,
+    ),
+    torch.squeeze: Op(
+        dim_map=lambda input, dim=None: dim_squeeze(input.shape, dim)
+    ),
+    torch.tile: Op(dim_map=lambda input, dims: dim_tile(input.ndim, dims)),
+    torch.transpose: Op(
+        dim_map=lambda input, dim0, dim1: dim_transpose(input.ndim, dim0, dim1)
+    ),
+    torch.unsqueeze: Op(
+        dim_map=lambda input, dim: dim_unsqueeze(input.ndim, dim)
+    ),
+    Tensor.view: Op(
+        dim_map=lambda input, *shape: view_groups(input.shape, shape),
+        shape_argnum=1,
+    ),
+}
+
+
+def propagate_shape_and_sharding(
+    in_shard: Sequence[Placement],
+    local_in_shape: Shape,
+    rule: DimMap,
+    mesh_sizes: Shape,
+) -> Tuple[Shape, Optional[Sequence[Placement]], torch.Tensor]:
+    """
+    Takes as input the global shape of the tensor, and the input sharding,
+    and produce corresponding output sharding and shape of the output tensor.
+
+    Sharding propagation follows mapped dimensions:
+    - An output dimension that maps directly to an input dimension is sharded equally
+    - An output dimension that is a flattened set of input dimensions can only be
+      sharded if only the leftmost flattened dimension is sharded.
+    - An output dimension that is a split of the input dimension can only be sharded
+      if the leftmost split size is divisible by the mesh dimension
+    """
+    assert len(in_shard) == len(mesh_sizes)
+    sharded_in_dims: Set[int] = set(
+        s.dim for s in in_shard if isinstance(s, Shard)
+    )
+    # for each input dim, for each mesh dim, provides a list of possible shardable dimensions
+    shardable_dims: torch.Tensor = torch.ones(
+        (len(local_in_shape), len(mesh_sizes)), dtype=torch.bool
+    )
+
+    # in case an input dimension disappears (e.g. collapsing, reduction)
+    # we cannot shard in that dimension (we need a replication fall-back rule)
+
+    seen_input_dims: Set[int] = set()
+
+    def collect_used_inputs(cmd: DimSpec) -> None:
+        if isinstance(cmd, InputDim):
+            seen_input_dims.add(cmd.input_dim)
+        for inp in cmd.inputs():
+            collect_used_inputs(inp)
+
+    for cmd in rule:
+        collect_used_inputs(cmd)
+    for dim in range(len(local_in_shape)):
+        shardable_dims[dim, :] = dim in seen_input_dims
+
+    def get_dim_size(cmd: DimSpec) -> Tuple[int, Optional[InputDim]]:
+        if isinstance(cmd, InputDim):
+            seen_input_dims.add(cmd.input_dim)
+            return (
+                local_in_shape[cmd.input_dim],
+                cmd if cmd.input_dim in sharded_in_dims else None,
+            )
+        elif isinstance(cmd, Flatten):
+            for dim in cmd.input_dims[1:]:
+                if isinstance(dim, InputDim):
+                    shardable_dims[dim.input_dim, :] = False
+            dim0 = cmd.input_dims[0]
+            return (
+                prod(get_dim_size(a)[0] for a in cmd.input_dims),
+                dim0
+                if isinstance(dim0, InputDim)
+                and dim0.input_dim in sharded_in_dims
+                else None,
+            )
+        elif isinstance(cmd, Split):
+            _, in_dim = get_dim_size(cmd.input_dim)
+            out_size = cmd.group_shape[cmd.split_id]
+            if cmd.split_id == 0 and in_dim is not None:
+                # we need to check that the input dimension is divisble
+                # by the size of the submesh we're sharding it on
+                # NOTE: it would be possible to shard the same input dimension
+                # on more than one mesh dimension. In that case, the dimension
+                # needs to be divisible by the product of mesh sizes.
+                # In order to keep the problem more tractable, we will not consider
+                # double resharding as a suggestion (e.g. [Shard(0), Shard(0) ])
+                # but we will allow it if that's the input and it's compatible
+
+                # 1. is this dimension shardable on each individual mesh dim?
+                for mesh_dim, mesh_dim_size in enumerate(mesh_sizes):
+                    shardable_dims[in_dim.input_dim, mesh_dim] = (
+                        out_size % mesh_dim_size == 0
+                    )
+
+                # 2. here we special case things like [Shard(0), Shard(0)]
+                submesh_size = 1
+                for size, shard in zip(mesh_sizes, in_shard):
+                    if isinstance(shard, Shard) and shard.dim == in_dim:
+                        submesh_size *= size
+                assert (
+                    out_size % submesh_size == 0
+                ), f"Resulting dimension size {out_size} is not divisible by its mesh dimension {submesh_size}."
+
+            # we will only shard our first component of the split
+            return out_size, in_dim if cmd.split_id == 0 else None
+        elif isinstance(cmd, Singleton):
+            return 1, None
+        elif isinstance(cmd, Broadcast):
+            return cmd.dim_size, None
+        elif isinstance(cmd, NewDim):
+            return cmd.size, None
+        elif isinstance(cmd, Repeat):
+            size, in_dim = get_dim_size(cmd.input_dim)
+            if in_dim is not None:
+                shardable_dims[in_dim.input_dim, :] = False
+            return size * cmd.times, None
+        else:
+            raise RuntimeError(f"cmd not found: {cmd}, in rule: {rule}")
+
+    dim_map = {}
+    out_shape = []
+    for dim, cmd in enumerate(rule):
+        out_size, in_dim = get_dim_size(cmd)
+        out_shape.append(out_size)
+        if in_dim is not None:
+            dim_map[in_dim.input_dim] = dim
+
+    needs_reshard = any(
+        isinstance(placement, Shard)
+        and not shardable_dims[placement.dim][mesh_dim]
+        for mesh_dim, placement in enumerate(in_shard)
+    )
+
+    output_placements = (
+        None
+        if needs_reshard
+        else [
+            Shard(dim_map[s.dim]) if isinstance(s, Shard) else s
+            for s in in_shard
+        ]
+    )
+
+    return (tuple(out_shape), output_placements, shardable_dims)
+
+
+def register_prop_rule_map(
+    aten_op_name: str, local_op_name: Callable[..., torch.Tensor]
+) -> None:
+    spec: Op = ops[local_op_name]
+
+    @register_prop_rule(aten_op_name)
+    def reshape_prop(op_schema: OpSchema) -> OutputSharding:
+        rules = spec.dim_map(*op_schema.args_schema, **op_schema.kwargs_schema)
+        input_dtensor_spec = op_schema.args_schema[0]
+
+        assert isinstance(
+            input_dtensor_spec, DTensorSpec
+        ), "Expected first input to be a DTensorSpec"
+        global_in_shape = input_dtensor_spec.shape
+        assert global_in_shape is not None, "Shape required."
+
+        (
+            global_out_shape,
+            shard_out,
+            shardable_dims,
+        ) = propagate_shape_and_sharding(
+            input_dtensor_spec.placements,
+            tuple(global_in_shape),
+            rules,
+            tuple(input_dtensor_spec.mesh.mesh.shape),
+        )
+
+        if shard_out is not None:
+            # no reshard needed
+            output_dtensor_spec = DTensorSpec(
+                mesh=input_dtensor_spec.mesh,
+                placements=shard_out,
+                shape=torch.Size(global_out_shape),
+                ndim=len(global_out_shape),
+            )
+            local_out_shape = output_dtensor_spec.local_shape
+
+            # We only need the local shape to lower he call into the local op
+            args = op_schema.args_schema
+            shape_argnum = spec.shape_argnum
+            if shape_argnum is not None:
+                op_schema.args_schema = (
+                    args[:shape_argnum]
+                    + (tuple(local_out_shape),)
+                    + args[shape_argnum + 1 :]
+                )
+
+            return OutputSharding(output_spec=output_dtensor_spec)
+
+        else:
+            # TODO: optimize this. we shouldn't simply blindly replicate
+            #       unshardable dims ...
+            # FIXME: this can be wrong for situations where we have
+            #        [Shard(0), Shard(0)]
+            suggested_placements = [
+                p
+                if not isinstance(p, Shard) or shardable_dims[p.dim][mesh_dim]
+                else Replicate()
+                for mesh_dim, p in enumerate(input_dtensor_spec.placements)
+            ]
+            return OutputSharding(
+                output_spec=None,
+                schema_suggestions=[
+                    OpSchema(
+                        func_schema=op_schema.func_schema,
+                        args_schema=(
+                            DTensorSpec(
+                                placements=suggested_placements,
+                                mesh=input_dtensor_spec.mesh,
+                                ndim=input_dtensor_spec.ndim,
+                                shape=input_dtensor_spec.shape,
+                            ),
+                        )
+                        + op_schema.args_schema[1:],
+                        kwargs_schema=op_schema.kwargs_schema,
+                    )
+                ],
+            )
+
+
+register_prop_rule_map("aten.squeeze.default", torch.squeeze)
+register_prop_rule_map("aten.squeeze.dim", torch.squeeze)
+register_prop_rule_map("aten.view.default", Tensor.view)
+register_prop_rule_map("aten.view.SymInt", Tensor.view)
+register_prop_rule_map("aten._unsafe_view.default", Tensor.view)
+register_prop_rule_map("aten.unsqueeze.default", torch.unsqueeze)
+register_prop_rule_map("aten.expand.default", Tensor.expand)
+register_prop_rule_map("aten.permute.default", torch.permute)
+register_prop_rule_map("aten.repeat.default", Tensor.repeat)
+register_prop_rule_map("aten.transpose.int", torch.transpose)
diff --git a/torch/distributed/_tensor/parallel/__init__.py b/torch/distributed/_tensor/parallel/__init__.py
new file mode 100644
index 000000000000..bf050d57d169
--- /dev/null
+++ b/torch/distributed/_tensor/parallel/__init__.py
@@ -0,0 +1,36 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from torch.distributed._tensor.parallel.multihead_attention_tp import (
+    TensorParallelMultiheadAttention,
+)
+
+from torch.distributed._tensor.parallel.style import (
+    ColwiseParallel,
+    PairwiseParallel,
+    ParallelStyle,
+    RowwiseParallel,
+    make_input_replicate_1d,
+    make_input_shard_1d,
+    make_input_shard_1d_dim_last,
+    make_output_replicate_1d,
+    make_output_shard_1d,
+    make_output_tensor,
+)
+
+from torch.distributed._tensor.parallel.api import (
+    parallelize_module,
+)
+
+__all__ = [
+    "ColwiseParallel",
+    "PairwiseParallel",
+    "ParallelStyle",
+    "RowwiseParallel",
+    "TensorParallelMultiheadAttention",
+    "make_input_replicate_1d",
+    "make_input_shard_1d",
+    "make_input_shard_1d_dim_last",
+    "make_output_replicate_1d",
+    "make_output_tensor",
+    "make_output_shard_1d",
+    "parallelize_module",
+]
diff --git a/torch/distributed/_tensor/parallel/_view_with_dim_change.py b/torch/distributed/_tensor/parallel/_view_with_dim_change.py
new file mode 100644
index 000000000000..7988129318b7
--- /dev/null
+++ b/torch/distributed/_tensor/parallel/_view_with_dim_change.py
@@ -0,0 +1,108 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from typing import Tuple, Union
+
+import torch
+from torch.distributed._tensor import DTensor as DT
+from torch.distributed._tensor.placement_types import Shard
+from torch.distributed._tensor.ops.utils import prod
+
+
+def _view_with_sharding_dim_change(
+    tensor: Union[torch.Tensor, DT], sharding_dim: int, shape: Tuple[int, ...]
+) -> Union[torch.Tensor, DT]:
+    """
+    We change the implicit sharding dim for a distributed tensor without comms.
+    Because if we don't change sharding dim, we will ended up having more comms that are not necessary.
+    Note that this op will produce invalid DTensor, you will need to call this op in pair to recover
+    it back to a valid DTensor.
+
+    This should only be used when implicitly changing sharding dim doesn't have semantic issue.
+    """
+    if isinstance(tensor, DT):
+        # pyre-fixme[16]: Undefined attribute.
+        return _ViewAndRedistribute.apply(tensor, sharding_dim, shape)
+    else:
+        return tensor.view(shape)
+
+
+class _ViewAndRedistribute(torch.autograd.Function):
+    @staticmethod
+    # pyre-fixme[14]: Inconsistent override.
+    def forward(  # type: ignore[override]
+        ctx,  # pyre-ignore[2]: Parameter must be annotated.
+        self: DT,
+        sharding_dim: int,
+        shape: Tuple[int, ...],
+    ) -> DT:
+        ctx.previous_placement = self.placements
+        ctx.previous_device_mesh = self.device_mesh
+        ctx.previous_local_shape = self.to_local().size()
+        ctx.previous_global_shape = self.size()
+        assert (
+            self.device_mesh.ndim == 1
+        ), "Only support 1D Device Mesh for _ViewAndRedistribute."
+        if (
+            self.placements[0].is_shard(dim=sharding_dim)
+            or self.placements[0].is_replicate()
+            or self.placements[0].is_partial()
+        ):
+            # pyre-fixme[7]: Incompatible return type.
+            return self.view(shape)  # type: ignore[return-value]
+        else:
+            if sharding_dim < 0:
+                sharding_dim += self.dim()
+
+            device_mesh = self.device_mesh
+            world_size = device_mesh.size(dim=0)
+            new_sharding_placement = [Shard(sharding_dim)]
+
+            # Fix shape
+            try:
+                infer_idx = shape.index(-1)
+            except ValueError:
+                infer_idx = None  # type: ignore[assignment]
+
+            # Infer the dim which is specified with -1.
+            if infer_idx is not None:
+                st_size = prod(self.size())  # type: ignore[attr-defined]
+                shape_size = -1 * prod(shape)  # type: ignore[attr-defined]
+                # pyre-fixme[60]: Concatenation not yet support for multiple variadic
+                shape = (
+                    *shape[:infer_idx],
+                    st_size // shape_size,
+                    *shape[infer_idx + 1 :],
+                )
+
+            # pyre-fixme[60]: Concatenation not yet support for multiple variadic
+            new_local_tensor_size = (
+                *shape[:sharding_dim],
+                shape[sharding_dim] // world_size,
+                *shape[sharding_dim + 1 :],
+            )
+            new_local_tensor = self.to_local().view(*new_local_tensor_size)
+
+            return DT(
+                new_local_tensor,
+                device_mesh,
+                new_sharding_placement,
+                size=torch.Size(shape),
+                requires_grad=new_local_tensor.requires_grad,
+            )
+
+    @staticmethod
+    def backward(ctx, grad_output: DT) -> Tuple[DT, None, None]:  # type: ignore[override]
+        previous_placement = ctx.previous_placement
+        previous_device_mesh = ctx.previous_device_mesh
+        previous_local_tensor_size = ctx.previous_local_shape
+        previous_global_shape = ctx.previous_global_shape
+        return (
+            DT(
+                grad_output.to_local().view(*previous_local_tensor_size),
+                previous_device_mesh,
+                previous_placement,
+                size=previous_global_shape,
+                requires_grad=grad_output.requires_grad,
+            ),
+            None,
+            None,
+        )
diff --git a/torch/distributed/_tensor/parallel/api.py b/torch/distributed/_tensor/parallel/api.py
new file mode 100644
index 000000000000..a1c513078b95
--- /dev/null
+++ b/torch/distributed/_tensor/parallel/api.py
@@ -0,0 +1,415 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+import torch
+import torch.nn as nn
+from typing import Union, Dict
+from torch.distributed._tensor import (
+    distribute_module,
+    distribute_tensor,
+    Shard,
+    Replicate,
+    DeviceMesh,
+)
+from torch.distributed._tensor.parallel import TensorParallelMultiheadAttention
+from torch.distributed._tensor.parallel.style import (
+    ColwiseParallel,
+    PairwiseParallel,
+    ParallelStyle,
+    RowwiseParallel,
+)
+from torch.distributed._tensor.parallel.utils import _create_1d_device_mesh
+
+
+__all__ = [
+    "parallelize_module",
+]
+
+
+def parallelize_module(  # type: ignore[return]
+    module: nn.Module,
+    device_mesh: DeviceMesh,
+    parallelize_plan: Union[ParallelStyle, Dict[str, ParallelStyle]],
+    tp_mesh_dim: int = 0,
+) -> nn.Module:
+    """
+    The API to apply Tensor Parallelism (TP) in PyTorch. We parallelize module
+    or sub_modules based on a parallelize_plan which contains the parallel_style
+    which indicates how user want the module or sub_module to be parallelized.
+    User can also specify different parallel_style per module fully qualifed name (FQN).
+    The API supports 2D parallelism natively by accepting an n-dimension device_mesh
+    and users just need to specify the dimension where we perform tensor parallelism on.
+
+    Args:
+        module (:class:`nn.Module`):
+            Module to be parallelized.
+        device_mesh (:class:`DeviceMesh`):
+            Object which describes the mesh topology
+            of devices for the DTensor.
+        parallelize_plan (Union[:class:`ParallelStyle`, Dict[str, :class:`ParallelStyle`]]):
+            The plan used to parallelize the module. It can be either a
+            :class:`ParallelStyle` object which contains how
+            we prepare input/output for Tensor Parallelism or it can be a
+            dict of module FQN and its corresponding :class:`ParallelStyle` object.
+        tp_mesh_dim (int):
+            The dimension of ``device_mesh`` where we perform
+            Tensor Parallelism on.
+
+    Return:
+        A :class:`nn.Module` object parallelized.
+
+    Example::
+        >>> # xdoctest: +SKIP("distributed")
+        >>> from from torch.distributed._tensor.parallel import parallelize_module, PairwiseParallel
+        >>>
+        >>> # Define the module.
+        >>> m = Model(...)
+        >>> m = parallelize_module(m, PairwiseParallel())
+        >>>
+
+    .. warning::
+        ``PairwiseParallel`` comes with constraints for now. If you need finer
+        granularity, you need to pass in a dict of module FQN and parallel style instead.
+    """
+
+    if device_mesh.ndim > 1:
+        device_mesh = _create_1d_device_mesh(device_mesh, tp_mesh_dim)
+
+    if isinstance(parallelize_plan, ParallelStyle):
+        # RowwiseParallel or ColwiseParallel
+        if isinstance(parallelize_plan, ColwiseParallel) or isinstance(
+            parallelize_plan, RowwiseParallel
+        ):
+            return _parallelize_linear(module, device_mesh, parallelize_plan)
+        # PairwiseParallel
+        if _is_mha_for_pairwise_parallel(module):
+            return _parallelize_multihead_attn(module, device_mesh)
+        elif _is_mlp_for_pairwise_parallel(module):
+            return _parallelize_mlp(module, device_mesh)
+        else:
+            for n, m in module.named_children():
+                module.register_module(
+                    n, parallelize_module(m, device_mesh, parallelize_plan)
+                )
+            return module
+    elif isinstance(parallelize_plan, dict):
+        for module_path, parallelize_style in parallelize_plan.items():
+            sub_module = module.get_submodule(module_path)
+            module.register_module(  # type: ignore[call-arg] # pyre-ignore[20]
+                parallelize_module(  # type: ignore[arg-type]
+                    module_path, sub_module, device_mesh, parallelize_style  # type: ignore[arg-type] # pyre-ignore[6]
+                )
+            )
+            return module
+    else:
+        raise RuntimeError(  # pyre-ignore[7]
+            "Expect Union[ParallelStyle, Dict[str, ParallelStyle]] for"
+            f" parallelize_plan, {type(parallelize_plan)} found!"
+        )
+
+
+def _is_mha_for_pairwise_parallel(module: nn.Module) -> bool:
+    """
+    Check whether the mha module is the one can be handled for Pairwise parallel.
+
+    Args:
+        module (:class:`nn.Module`):
+            Module to be checked.
+
+    Return:
+        A boolean object which specifies whether the module is MHA supported by Pairwise parallel or not.
+    """
+    return isinstance(module, TensorParallelMultiheadAttention) or isinstance(
+        module, nn.MultiheadAttention
+    )
+
+
+def _is_mlp_for_pairwise_parallel(module: nn.Module) -> bool:
+    """
+    Traverse through all the immediate children of the given module and count the
+    number of Linear module. If the number is more than one, we return True.
+
+    Args:
+        module (:class:`nn.Module`):
+            Module to be traversed and counted.
+
+    Return:
+        A bool which specifies whether the module is MLP supported or not.
+
+    .. warning::
+        The traversal is not recursive for now.
+    """
+    linear_submodules = list(
+        filter(lambda x: isinstance(x, nn.Linear), module.children())
+    )
+    return len(linear_submodules) > 1
+
+
+def _rowwise_parallelize_linear_fn(
+    name: str,
+    module: nn.Module,
+    device_mesh: DeviceMesh,
+) -> None:
+    """
+    This function parallelizes the input :class:`nn.Linear` module in
+    :class:`RowwiseParallel` style.
+
+    Args:
+        name (str):
+            Name of the input module.
+        module (:class:`nn.Module`):
+            The :class:`nn.Linear` module to be parallelized.
+        device_mesh (:class:`DeviceMesh`):
+            Object which describes the mesh topology of devices.
+
+    Returns:
+        None
+    """
+
+    for name, param in module.named_parameters():
+        dist_spec = (
+            [Shard(1)] if name == "weight" else [Replicate()]  # type: ignore[list-item]
+        )
+        dist_param = torch.nn.Parameter(
+            distribute_tensor(param, device_mesh, dist_spec)
+        )
+        module.register_parameter(name, dist_param)
+
+
+def _colwise_parallelize_linear_fn(
+    name: str,
+    module: nn.Module,
+    device_mesh: DeviceMesh,
+) -> None:
+    """
+    This function parallelizes the input :class:`nn.Linear` module in
+    :class:`ColwiseParallel` style.
+
+    Args:
+        name (str):
+            Name of the input module.
+        module (:class:`nn.Module`):
+            The :class:`nn.Linear` module to be parallelized.
+        device_mesh (:class:`DeviceMesh`):
+            Object which describes the mesh topology of devices.
+
+    Returns:
+        None
+    """
+
+    for name, param in module.named_parameters():
+        dist_param = torch.nn.Parameter(
+            distribute_tensor(param, device_mesh, [Shard(0)])
+        )
+        module.register_parameter(name, dist_param)
+
+
+def _parallelize_linear(
+    module: nn.Module,
+    device_mesh: DeviceMesh,
+    parallel_style: ParallelStyle = ColwiseParallel(),
+    tp_mesh_dim: int = 0,
+) -> nn.Module:
+    """
+    This function requires that the input module be an object
+    of :class:`nn.Linear`.
+    The module will be parallelized over a 1-d :class:`DeviceMesh`
+    based on the :class:`ParallelStyle`.
+
+    Args:
+        module (:class:`nn.Module`):
+            The module to be parallelized.
+        device_mesh (:class:`DeviceMesh`):
+            Object which describes the mesh topology of devices for the :class:`DTensor`.
+            If the mesh is more than 1-dimensional, we will use the mesh dim of
+            `device_mesh` specified by `tp_mesh_dim`.
+        parallel_style (:class:`ParallelStyle`, optional):
+            The object which describes how the :class:`nn.Linear` module
+            should be distributed over :class:`DeviceMesh` and how the input
+            and output should be prepared for Tensor Parallelism.
+            :class:`RowwiseStyle`: weight is sharded on dim 1 and bias is
+            replicate.
+            :class:`ColwiseStyle`: weight and bias are both sharded on dim 0.
+            Default: :class:`ColwiseParallel`
+        tp_mesh_dim (int):
+            The dimension of :class:`DeviceMesh` on which we
+            perform Tensor Parallelism.
+            Default: 0
+
+    Return:
+        A :class:`nn.Module` object parallelized.
+    """
+
+    if not isinstance(module, nn.Linear):
+        raise RuntimeError(
+            f"Expect a torch.nn.Linear module but received {type(module)}!"
+        )
+
+    if not isinstance(parallel_style, ParallelStyle):
+        raise RuntimeError(
+            "Expect a ParallelStyle object but received"
+            f" {type(parallel_style)}!"
+        )
+
+    if device_mesh.ndim > 1:
+        device_mesh = _create_1d_device_mesh(device_mesh, tp_mesh_dim)
+
+    if isinstance(parallel_style, RowwiseParallel):
+        distribute_module(
+            module,
+            device_mesh,
+            _rowwise_parallelize_linear_fn,
+            input_fn=parallel_style._prepare_input,  # type: ignore[arg-type, misc] # pyre-ignore[6]
+            output_fn=parallel_style._prepare_output,  # type: ignore[arg-type, misc] # pyre-ignore[6]
+        )
+    elif isinstance(parallel_style, ColwiseParallel):
+        distribute_module(
+            module,
+            device_mesh,
+            _colwise_parallelize_linear_fn,
+            input_fn=parallel_style._prepare_input,  # type: ignore[arg-type, misc] # pyre-ignore[6]
+            output_fn=parallel_style._prepare_output,  # type: ignore[arg-type, misc] # pyre-ignore[6]
+        )
+    else:
+        raise RuntimeError(f"{type(parallel_style)} is not supported!")
+    return module
+
+
+def _parallelize_multihead_attn(
+    module: nn.Module,
+    device_mesh: DeviceMesh,
+    parallel_style: ParallelStyle = PairwiseParallel(),
+    tp_mesh_dim: int = 0,
+) -> nn.Module:
+    """
+    This function assumes the input module is a sequence of nn.Linear
+    and we parallelize the module based on the given parallel style.
+    We don't change the FQN of each sub-module and replace each parameter
+    in place.
+
+    Args:
+        module (:class:`nn.Module`):
+            Module to be parallelized.
+        device_mesh (:class:`DeviceMesh`):
+            Object which describes the mesh topology of devices.
+        parallel_style (:class:`ParallelStyle`):
+            Object which contains how we prepare input/output
+            for Tensor Parallelism.
+        tp_mesh_dim (int):
+            The dimension of `device_mesh` where we perform
+            Tensor Parallelism on.
+
+    Return:
+        A :class:`nn.Module` object parallelized.
+
+    .. warning::
+        We only support ``PairwiseParallel`` right now.
+    """
+
+    if not isinstance(parallel_style, PairwiseParallel):
+        raise NotImplementedError(
+            "Only support PairwiseParallel for Multihead Attention"
+            " parallelization."
+        )
+
+    if device_mesh.ndim > 1:
+        device_mesh = _create_1d_device_mesh(device_mesh, tp_mesh_dim)
+
+    if isinstance(module, nn.MultiheadAttention):
+        tp_multi_head_attention = TensorParallelMultiheadAttention(
+            module.embed_dim,
+            module.num_heads,
+            device=torch.device(device_mesh.device_type),
+            tp_size=device_mesh.size(tp_mesh_dim),
+            add_bias_kv=module.bias_k is not None,
+        )
+        tp_multi_head_attention.copy(module)
+        module = tp_multi_head_attention
+
+    if isinstance(module, TensorParallelMultiheadAttention):  # shard TPMA
+        for n, m in module.named_children():
+            if n == "qkv":
+                # Col-wise Parallelize the qkv layer.
+                distribute_module(
+                    m,
+                    device_mesh,
+                    _colwise_parallelize_linear_fn,
+                    input_fn=parallel_style._prepare_input,  # type: ignore[arg-type, misc] # pyre-ignore[6]
+                )
+            elif n == "proj":
+                # Row-wise Parallelize the proj layer
+                distribute_module(
+                    m,
+                    device_mesh,
+                    _rowwise_parallelize_linear_fn,
+                    output_fn=parallel_style._prepare_output,  # type: ignore[arg-type, misc] # pyre-ignore[6]
+                )
+    return module
+
+
+def _parallelize_mlp(
+    module: nn.Module,
+    device_mesh: DeviceMesh,
+    parallel_style: ParallelStyle = PairwiseParallel(),
+    tp_mesh_dim: int = 0,
+) -> nn.Module:
+    """
+    This function assumes the input module is a sequence of nn.Linear
+    and we parallelize the module based on the given parallel style.
+    We don't change the FQN of each sub-module and replace each parameter
+    in place.
+
+    Args:
+        module (:class:`nn.Module`):
+            Module to be parallelized.
+        device_mesh (:class:`DeviceMesh`):
+            Object which describes the mesh topology of devices.
+        parallel_style (:class:`ParallelStyle`):
+            Object which contains how we prepare input/output
+            for Tensor Parallelism.
+        tp_mesh_dim (int):
+            The dimension of `device_mesh` where we perform
+            Tensor Parallelism on.
+
+    Return:
+        A :class:`nn.Module` object parallelized.
+
+    .. warning::
+        We only support ``PairwiseParallel`` right now.
+    """
+    if not isinstance(parallel_style, PairwiseParallel):
+        raise NotImplementedError(
+            "Only support PairwiseParallel for MLP parallelization."
+        )
+
+    if not _is_mlp_for_pairwise_parallel(module):
+        raise RuntimeError("More than one nn.Linear needed for a MLP.")
+
+    if device_mesh.ndim > 1:
+        device_mesh = _create_1d_device_mesh(device_mesh, tp_mesh_dim)
+
+    linear_submodules = list(
+        filter(lambda x: isinstance(x, nn.Linear), module.children())
+    )
+    mlp_last_even_layer = (len(linear_submodules) // 2) * 2
+    for i in range(mlp_last_even_layer):
+        m = linear_submodules[i]
+        if i % 2 == 0:
+            # Col-wise Parallelize the linear layer
+            distribute_module(
+                m,
+                device_mesh,
+                _colwise_parallelize_linear_fn,
+                input_fn=parallel_style._prepare_input  # type: ignore[arg-type, misc] # pyre-ignore[6]
+                if i == 0
+                else None,
+            )
+        else:
+            # Row-wise Parallelize the linear layer
+            distribute_module(
+                m,
+                device_mesh,
+                _rowwise_parallelize_linear_fn,
+                output_fn=parallel_style._prepare_output  # type: ignore[arg-type, misc] # pyre-ignore[6]
+                if i == (mlp_last_even_layer - 1)
+                else None,
+            )
+    return module
diff --git a/torch/distributed/_tensor/parallel/fsdp.py b/torch/distributed/_tensor/parallel/fsdp.py
new file mode 100644
index 000000000000..55b34f10d743
--- /dev/null
+++ b/torch/distributed/_tensor/parallel/fsdp.py
@@ -0,0 +1,359 @@
+import warnings
+import copy
+from typing import List, NamedTuple, Optional, Tuple, cast
+
+import torch
+import torch.distributed as dist
+import torch.distributed.distributed_c10d as c10d
+
+from torch.distributed.fsdp._shard_utils import _create_chunk_sharded_tensor
+
+import torch.distributed._shard.sharding_spec as shard_spec
+from torch.distributed._shard.sharding_spec.chunk_sharding_spec import (
+    ChunkShardingSpec,
+)
+
+from torch.distributed._shard.sharded_tensor import (
+    Shard,
+    ShardedTensor,
+    ShardedTensorMetadata,
+    TensorProperties,
+)
+
+from torch.distributed._shard.sharding_spec import (
+    ShardMetadata,
+)
+
+from torch.distributed.remote_device import _remote_device
+
+from torch.distributed._tensor import (
+    DTensor as DistributedTensor,
+    DeviceMesh,
+    Shard as DShard,
+)
+from torch.distributed._tensor.placement_types import Placement
+
+__all__ = ["is_available"]
+
+
+class _STShardingInfo(NamedTuple):
+    """:class:`ShardedTensor` sharding information."""
+
+    sharding_spec: Optional[shard_spec.ShardingSpec]
+    global_size: Optional[torch.Size]
+    process_group: Optional[c10d.ProcessGroup]
+    device_mesh: Optional[DeviceMesh]
+    placements: Optional[List[Placement]]
+
+
+def _get_box(tensor: DistributedTensor) -> Tuple[torch.Size, torch.Size]:
+    device_mesh = tensor.device_mesh
+    assert device_mesh.ndim == 1, "Only 1D DeviceMeshes currently handled"
+
+    placement = tensor.placements[0]
+    offsets = [0] * len(tensor.size())
+    num_chunks = device_mesh.size(dim=0)
+
+    if tensor.placements[0].is_shard():
+        shard_dim = cast(DShard, placement).dim
+        chunk_size = tensor.size(shard_dim) // num_chunks
+        offsets[shard_dim] = chunk_size
+
+    return (torch.Size(offsets), tensor._local_tensor.size())
+
+
+def _get_box_for(
+    tensor: DistributedTensor, idx: int
+) -> Tuple[torch.Size, torch.Size]:
+    offsets, size = _get_box(tensor)
+    return (torch.Size([val * idx for val in offsets]), size)
+
+
+def _get_local_box(tensor: DistributedTensor) -> Tuple[torch.Size, torch.Size]:
+    device_mesh = tensor.device_mesh
+    dim_0_coord = device_mesh.get_coordinate_on_dim(0)
+    assert dim_0_coord is not None
+    return _get_box_for(tensor, dim_0_coord)
+
+
+def _create_shard_md_from_dt(
+    dt: DistributedTensor, current_rank: int
+) -> ShardMetadata:
+    mesh = dt.device_mesh
+    assert mesh.ndim == 1, "Only 1D DeviceMeshes currently handled"
+
+    offsets, sizes = _get_local_box(dt)
+    return ShardMetadata(
+        shard_offsets=list(offsets),
+        shard_sizes=list(sizes),
+        placement=f"rank:{current_rank}/{dt._local_tensor.device}",
+    )
+
+
+def _create_sharded_tensor_md_from_dt(
+    dt: DistributedTensor, dt_pg: c10d.ProcessGroup
+) -> ShardedTensorMetadata:
+    # This is where it gets tricky, we have to produce a ShardedTensor that has full coverage
+    # and yet has only one valid shard for the current rank.
+
+    shards_md = []
+    my_rank = dist.get_rank(dt_pg)
+    scapegoat_rank = 0 if my_rank > 0 else 1
+
+    if dt.placements[0].is_shard():
+        shard_count = dt_pg.size()
+    else:
+        shard_count = 1
+
+    for i in range(shard_count):
+        offsets, sizes = _get_box_for(dt, i)
+        shards_md.append(
+            ShardMetadata(
+                shard_offsets=list(offsets),
+                shard_sizes=list(sizes),
+                placement=(
+                    f"rank:{scapegoat_rank if i > 0 else my_rank}/{dt._local_tensor.device}"
+                ),
+            )
+        )
+
+    return ShardedTensorMetadata(
+        shards_metadata=shards_md,
+        size=dt.size(),
+        tensor_properties=TensorProperties(
+            dtype=dt.dtype,
+            layout=dt.layout,
+            requires_grad=dt.requires_grad,
+            # ignore memory_format and pin_memory as those are not supported by DT
+        ),
+    )
+
+
+def _get_dt_pg(dt: DistributedTensor) -> c10d.ProcessGroup:
+    mesh = dt.device_mesh
+    assert mesh.ndim == 1, "Only 1D DeviceMeshes currently handled"
+    return mesh.get_dim_groups()[0]
+
+
+def _rewrite_spec_if_needed(
+    spec: shard_spec.ShardingSpec, tensor: torch.Tensor, rank: int
+) -> shard_spec.ShardingSpec:
+    """
+    Rewrite ``spec`` to match the device of ``tensor``.
+
+    FSDP.sharded_optim_state_dict sneakly ships optimizer state to CPU so if the original ShardingSpec
+    produces CUDA metadata, ST construction bombs.
+    """
+    if not isinstance(spec, ChunkShardingSpec):
+        return spec
+
+    # let's see if we need
+    rewrite = False
+    for p in spec.placements:
+        p = cast(_remote_device, p)
+        if p.rank() == rank and p.device() != tensor.device:
+            rewrite = True
+            break
+    if rewrite:
+        spec = copy.deepcopy(spec)
+        for i, placement in enumerate(spec.placements):
+            placement = cast(_remote_device, placement)
+            if placement.rank() == rank and placement.device() != tensor.device:
+                spec.placements[i] = _remote_device(
+                    f"rank:{rank}/{tensor.device}"
+                )
+
+    return spec
+
+
+def _flatten_tensor(
+    tensor: torch.Tensor,
+) -> Tuple[torch.Tensor, Optional[_STShardingInfo]]:
+    if type(tensor) is ShardedTensor:
+        return tensor.local_tensor(), _STShardingInfo(
+            tensor.sharding_spec(),
+            tensor.size(),
+            tensor._process_group,
+            None,
+            None,
+        )
+    elif type(tensor) is DistributedTensor:
+        tensor._local_tensor.requires_grad_()
+        return tensor._local_tensor, _STShardingInfo(
+            None,
+            None,
+            None,
+            tensor.device_mesh,
+            list(tensor.placements),
+        )
+    return tensor, None
+
+
+def _unflatten_tensor(
+    tensor: torch.Tensor, sharding_info: _STShardingInfo
+) -> torch.Tensor:
+    result: torch.Tensor
+
+    if sharding_info.sharding_spec is not None:
+        assert sharding_info.global_size is not None
+        result = ShardedTensor._init_from_local_tensor(
+            tensor,
+            _rewrite_spec_if_needed(
+                sharding_info.sharding_spec,
+                tensor,
+                dist.get_rank(sharding_info.process_group),
+            ),
+            sharding_info.global_size,
+            process_group=cast(dist.ProcessGroup, sharding_info.process_group),
+        )
+    else:
+        result = DistributedTensor.from_local(
+            tensor,
+            device_mesh=sharding_info.device_mesh,
+            placements=sharding_info.placements,
+            run_check=False,
+        )
+
+    _set_fsdp_flattened(result)
+    return result
+
+
+def _chunk_tensor(
+    tensor: torch.Tensor,
+    rank: int,
+    world_size: int,
+    num_devices_per_node: int,
+    pg: dist.ProcessGroup,
+) -> torch.Tensor:
+    if type(tensor) is ShardedTensor:
+        assert len(tensor.local_shards()) == 1
+
+        inner_param = tensor.local_tensor()
+        inner_st = _create_chunk_sharded_tensor(
+            inner_param,
+            rank,
+            world_size,
+            num_devices_per_node,
+            pg,
+        )
+
+        outer_local_shard = tensor.local_shards()[0]
+        shards: List[Shard] = [
+            Shard(inner_st, copy.deepcopy(outer_local_shard.metadata))
+        ]
+        st_meta = copy.deepcopy(tensor.metadata())
+        st_meta.tensor_properties.requires_grad = False
+
+        st_outer = ShardedTensor._init_from_local_shards_and_global_metadata(
+            shards,
+            sharded_tensor_metadata=st_meta,
+            process_group=tensor._process_group,
+            init_rrefs=False,
+        )
+        return st_outer
+    elif type(tensor) is DistributedTensor:
+        device_mesh = tensor.device_mesh
+        assert device_mesh.ndim == 1, "Only 1D DeviceMeshes currently handled"
+
+        inner_param = tensor._local_tensor
+
+        inner_st = _create_chunk_sharded_tensor(
+            inner_param,
+            rank,
+            world_size,
+            torch.cuda.device_count(),
+            pg,
+        )
+
+        dt_pg = _get_dt_pg(tensor)
+        # We do this differently here, we create a ST with no local shards then patch it
+        shards = [
+            Shard(
+                inner_st, _create_shard_md_from_dt(tensor, dist.get_rank(dt_pg))
+            )
+        ]
+
+        st_meta = _create_sharded_tensor_md_from_dt(tensor, dt_pg)
+        st_meta.tensor_properties.requires_grad = False
+
+        st_outer = ShardedTensor._init_from_local_shards_and_global_metadata(
+            shards,
+            sharded_tensor_metadata=st_meta,
+            process_group=dt_pg,
+            init_rrefs=False,
+        )
+
+        return st_outer
+    else:
+        return _create_chunk_sharded_tensor(
+            tensor,
+            rank,
+            world_size,
+            num_devices_per_node,
+            pg,
+        )
+
+
+def _pre_load_state_dict(
+    tensor: torch.Tensor,
+) -> Tuple[torch.Tensor, List[Shard]]:
+    shards = cast(ShardedTensor, tensor).local_shards()
+    if len(shards) == 1 and type(shards[0].tensor) is ShardedTensor:
+        inner_tensor = shards[0].tensor
+        shards = inner_tensor.local_shards()  # pyre-ignore[16]
+        tensor = inner_tensor
+
+    return (tensor, shards if len(shards) > 0 else [])
+
+
+try:
+    from torch.distributed.fsdp._fsdp_extensions import (
+        _set_fsdp_extensions,
+        FSDPExtensions,
+    )
+    from torch.distributed.fsdp._common_utils import _set_fsdp_flattened
+
+    class DTensorExtensions(FSDPExtensions):
+        def pre_flatten_transform(
+            self,
+            tensor: torch.Tensor,
+        ) -> Tuple[torch.Tensor, Optional[_STShardingInfo]]:
+            return _flatten_tensor(tensor)
+
+        def post_unflatten_transform(
+            self, tensor: torch.Tensor, param_extension: _STShardingInfo
+        ) -> torch.Tensor:
+            return _unflatten_tensor(tensor, param_extension)
+
+        def chunk_tensor(
+            self,
+            tensor: torch.Tensor,
+            rank: int,
+            world_size: int,
+            num_devices_per_node: int,
+            pg: dist.ProcessGroup,
+        ) -> torch.Tensor:
+            return _chunk_tensor(
+                tensor, rank, world_size, num_devices_per_node, pg
+            )
+
+        def pre_load_state_dict_transform(
+            self,
+            tensor: torch.Tensor,
+        ) -> Tuple[torch.Tensor, List[Shard]]:
+            return _pre_load_state_dict(tensor)
+
+    _set_fsdp_extensions(DTensorExtensions())
+
+    def is_available() -> bool:
+        return True
+
+except BaseException as e:
+    warnings.warn(
+        "PyTorch doesn't have TensorFlattener extension point available"
+        "2D parallelism won't work with FSDP"
+        f"exception: {e}"
+    )
+
+    def is_available() -> bool:
+        return False
diff --git a/torch/distributed/_tensor/parallel/multihead_attention_tp.py b/torch/distributed/_tensor/parallel/multihead_attention_tp.py
new file mode 100644
index 000000000000..3071f42632fd
--- /dev/null
+++ b/torch/distributed/_tensor/parallel/multihead_attention_tp.py
@@ -0,0 +1,273 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# pyre-ignore-all-errors[6]
+
+import math
+
+import torch
+from torch.distributed._tensor import DTensor as DT
+from torch.distributed._tensor.placement_types import Shard
+from torch.distributed._tensor.parallel._view_with_dim_change import (
+    _view_with_sharding_dim_change,
+)
+
+from typing import Optional, Union
+
+
+# TODO: Add a test to test equivalence between our Multihead Attention
+# with other mainstream ones (Megatron-LM or PyTorch).
+def _stride_same_as_shard(
+    tensor: torch.Tensor, tp_size: int, chunk_dim: int, cat_dim: int
+) -> torch.Tensor:
+    """
+    Adjust local tensor's stride same as the sharded situation.
+    So that view result will keeps the same.
+    """
+    if isinstance(tensor, DT):
+        return tensor
+    view_size = list(tensor.size())
+    view_size[chunk_dim] //= tp_size
+    return torch.cat(
+        [t.view(*view_size) for t in tensor.chunk(tp_size, dim=chunk_dim)],
+        dim=cat_dim,
+    ).contiguous()
+
+
+class TensorParallelMultiheadAttention(torch.nn.Module):
+    """
+    Multi-head Attention block from Transformer models.
+    Since we need some customizations for the attention layer,
+    we are writing a customized but mathematically equivalent
+    attention module as defined in torch.nn.
+
+    Note that:
+    We now only support the case when it's self attention with
+    limited input args and we also assume that the input tensor
+    has a dimension of three. Although we do implement the logic
+    for multihead attention, it was not fully tested.
+    """
+
+    def __init__(
+        self,
+        embed_dim: int,
+        num_heads: int,
+        dropout: float = 0.0,
+        bias: bool = True,
+        add_bias_kv: bool = False,
+        add_zero_attn: bool = False,
+        kdim: Optional[int] = None,
+        vdim: Optional[int] = None,
+        batch_first: bool = False,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+        tp_size: int = 1,
+        self_attention: bool = True,
+    ) -> None:
+        super(TensorParallelMultiheadAttention, self).__init__()
+        self.device: torch.device = (
+            torch.device("cuda" if torch.cuda.is_available() else "cpu")
+            if device is None
+            else device
+        )
+        self.num_heads = num_heads
+        self.hidden_size = embed_dim
+        self.hidden_size_per_attention_head: int = self.hidden_size // num_heads
+        self.scale: float = self.hidden_size_per_attention_head**-0.5
+        if self_attention:
+            self.qkv: torch.nn.Module = torch.nn.Linear(
+                embed_dim, embed_dim * 3, bias=add_bias_kv, device=self.device
+            )
+            torch.nn.init.xavier_uniform_(self.qkv.weight)
+            if add_bias_kv:
+                torch.nn.init.zeros_(self.qkv.bias)
+        else:
+            self.query: torch.nn.Module = torch.nn.Linear(
+                embed_dim, embed_dim, bias=add_bias_kv, device=self.device
+            )
+            self.key: torch.nn.Module = torch.nn.Linear(
+                embed_dim, embed_dim, bias=add_bias_kv, device=self.device
+            )
+            self.value: torch.nn.Module = torch.nn.Linear(
+                embed_dim, embed_dim, bias=add_bias_kv, device=self.device
+            )
+            torch.nn.init.xavier_uniform_(self.query.weight)
+            torch.nn.init.xavier_uniform_(self.key.weight)
+            torch.nn.init.xavier_uniform_(self.value.weight)
+            if add_bias_kv:
+                torch.nn.init.zeros_(self.query.bias)
+                torch.nn.init.zeros_(self.key.bias)
+                torch.nn.init.zeros_(self.value.bias)
+        self.proj: torch.nn.Module = torch.nn.Linear(
+            embed_dim, embed_dim, bias=bias, device=self.device
+        )
+        torch.nn.init.kaiming_uniform_(self.proj.weight, a=math.sqrt(5))
+        if bias:
+            torch.nn.init.zeros_(self.proj.bias)
+        self.tp_size = tp_size
+        self.hidden_size = embed_dim
+        self.norm_factor: float = math.sqrt(self.hidden_size_per_attention_head)
+        self.self_attention = self_attention
+
+    def forward(
+        self,
+        query: Union[torch.Tensor, DT],
+        key: Union[torch.Tensor, DT],
+        value: Union[torch.Tensor, DT],
+        key_padding_mask: Optional[torch.Tensor] = None,
+        need_weights: bool = True,
+        attn_mask: Optional[torch.Tensor] = None,
+        average_attn_weights: bool = True,
+    ) -> Union[torch.Tensor, DT]:
+        b, sq, h = query.shape
+        sk = key.size(1)
+        nh = self.num_heads
+        hn = self.hidden_size_per_attention_head
+
+        # x: [b, sq/sk/sv, h]
+        # ===================
+        # Permute. [sq/sk/sv, b, h]
+        # ===================
+        if not self.self_attention:
+            # =====================
+            # Query, Key, and Value
+            # =====================
+            query = query.permute(1, 0, 2).contiguous()
+            key = key.permute(1, 0, 2).contiguous()
+            value = value.permute(1, 0, 2).contiguous()
+
+            # Attention heads [sq/sk/sv, b, h] --> [sq/sk/sv * b, (nh * hn)]
+            query = query.view(-1, h)
+            key = key.view(-1, h)
+            value = value.view(-1, h)
+
+            query_layer = _view_with_sharding_dim_change(
+                self.query(query), 1, (sq, b * nh, hn)
+            )
+            key_layer = _view_with_sharding_dim_change(
+                self.key(key), 1, (sk, b * nh, hn)
+            )
+            value_layer = _view_with_sharding_dim_change(
+                self.value(value), 1, (sk, b * nh, hn)
+            )
+        else:
+            assert torch.equal(query, key) and torch.equal(
+                query, value
+            ), "inputs are different for self-attention."
+            # =====================
+            # Query
+            # =====================
+            query = query.permute(1, 0, 2).contiguous()
+
+            # Attention heads [sq, b, h] --> [sq * b, (nh * 3 * hn)]
+            query = query.view(-1, h)
+            mixed_x_layer = self.qkv(query)
+
+            # [sq * b, 3 * h] --> [sq, b, nh, 3 * hn]
+            mixed_x_layer = _view_with_sharding_dim_change(
+                mixed_x_layer, 2, (sq, b, nh, 3 * hn)
+            )
+
+            # [sq, b, nh, 3 * hn] --> 3 [sq, b, nh, hn]
+            last_dim = mixed_x_layer.dim() - 1
+            last_dim_size = mixed_x_layer.size(last_dim) // 3
+            (query_layer, key_layer, value_layer) = mixed_x_layer.split(
+                last_dim_size, dim=last_dim
+            )
+
+            query_layer = _stride_same_as_shard(query_layer, self.tp_size, 2, 1)
+            key_layer = _stride_same_as_shard(key_layer, self.tp_size, 2, 1)
+            value_layer = _stride_same_as_shard(value_layer, self.tp_size, 2, 1)
+            # [sq, b, nh, hn] -> [sq, b * nh, hn]
+            query_layer = _view_with_sharding_dim_change(
+                query_layer, 1, (sq, b * nh, -1)
+            )
+            key_layer = _view_with_sharding_dim_change(
+                key_layer, 1, (sq, b * nh, -1)
+            )
+            value_layer = _view_with_sharding_dim_change(
+                value_layer, 1, (sq, b * nh, -1)
+            )
+
+        # ===================================
+        # Raw attention scores. [b, nh, s, s]
+        # ===================================
+
+        factor = self.tp_size if isinstance(query_layer, DT) else 1
+        # preallocting result tensor: [b * nh, sq, sk]
+        matmul_result = torch.empty(
+            b * nh // factor,
+            sq,
+            sk,
+            dtype=query_layer.dtype,
+            device=self.device,
+        )
+        if isinstance(query_layer, DT):
+            matmul_result = DT.from_local(
+                matmul_result,
+                query_layer.device_mesh,
+                [Shard(0)],
+                run_check=False,
+            )
+
+        # Raw attention scores. [b * nh, sq, sk]
+        attn = torch.baddbmm(
+            matmul_result,
+            query_layer.transpose(0, 1),  # [b * nh, sq, hn]
+            key_layer.transpose(0, 1).transpose(1, 2),  # [b * nh, hn, sk]
+            beta=0.0,
+            alpha=(1.0 / self.norm_factor),
+        )
+
+        # ===============
+        # Attention probs
+        # ===============
+        attn = attn.softmax(dim=-1)
+
+        # =========================
+        # Context layer. [sq * b, hidden]
+        # =========================
+
+        # bmm: [b * nh, sq, hn]
+        context_layer = torch.bmm(attn, value_layer.transpose(0, 1))
+
+        # change view [nh, b, sq, hn]
+        context_layer = context_layer.view(nh, b, sq, hn)
+
+        # [nh, b, sq, hn] --> [sq, b, nh, hn]
+        context_layer = context_layer.permute(2, 1, 0, 3).contiguous()
+
+        # [sq, b, nh, hn] --> [sq * b, hidden]
+        context_layer = _view_with_sharding_dim_change(
+            context_layer.contiguous(), 1, (-1, self.hidden_size)
+        )
+
+        # =================
+        # Projection. [sq, b, h]
+        # =================
+        output = self.proj(context_layer).view(sq, b, h)
+
+        # ===================
+        # Permute. [b, sq, h]
+        # ===================
+        output = output.permute(1, 0, 2)
+
+        return output
+
+    def copy(self, that: torch.nn.MultiheadAttention) -> None:
+        # TODO: current implementation assume `self` is a self attention module
+        assert (
+            self.hidden_size == that.embed_dim
+        ), "embed_dim must be equal in TensorParallelMultiheadAttention.copy()!"
+
+        if that.in_proj_weight is not None:
+            self.qkv.register_parameter("weight", that.in_proj_weight)
+        if that.in_proj_bias is not None:
+            self.qkv.register_parameter("bias", that.in_proj_bias)
+        if that.out_proj.weight is not None:
+            # TODO: The use of Parameter is to avoid `mypy` issue caused
+            # by the `tensor` type annotation on Linear.weight to which
+            # a Parameter object is actually assigned
+            self.proj.register_parameter(
+                "weight", torch.nn.Parameter(that.out_proj.weight)
+            )
+        if that.out_proj.bias is not None:
+            self.proj.register_parameter("bias", that.out_proj.bias)
diff --git a/torch/distributed/_tensor/parallel/style.py b/torch/distributed/_tensor/parallel/style.py
new file mode 100644
index 000000000000..e414cb0dc09d
--- /dev/null
+++ b/torch/distributed/_tensor/parallel/style.py
@@ -0,0 +1,233 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from abc import abstractmethod
+import torch
+from abc import ABC
+from typing import Union, Optional
+from torch.distributed._tensor import DTensor, Shard, Replicate, DeviceMesh
+from torch.distributed._tensor.parallel.utils import (
+    _PrepareInputType,
+    _PrepareOutputType,
+    _prepare_input_validate,
+    _prepare_output_validate,
+)
+
+
+class ParallelStyle(ABC):
+    """
+    The parallel style user wants the module or submodule to be parallelized.
+    Users can extend this class to build their own parallel style with customized input/output preparations.
+    """
+
+    _prepare_input: _PrepareInputType
+    _prepare_output: _PrepareOutputType
+
+    @abstractmethod
+    def __init__(self, _prepare_input, _prepare_output) -> None:
+        self._prepare_input = _prepare_input  # type: ignore[assignment, misc]
+        self._prepare_output = _prepare_output  # type: ignore[assignment, misc]
+
+
+class PairwiseParallel(ParallelStyle):
+    """
+    PairwiseParallel concatenate colwise and rowwise styles as a fixed
+    pair like what Megatron-LM(https://arxiv.org/abs/1909.08053) is doing.
+    We assume both input and output needs to a replicate DTensor.
+
+    .. warning::
+        PairwiseParallel only supports ``nn.Multihead Attention``,
+        ``nn.Transformer`` or even-number-layer MLP for now.
+    """
+
+    def __init__(self) -> None:
+        super().__init__(make_input_replicate_1d, make_output_tensor)
+
+
+class RowwiseParallel(ParallelStyle):
+    """
+    Partitioning the row of a module.
+    We assume the input to be a sharded :class:`DTensor` and output to be a replicated :class:`DTensor`.
+    """
+
+    def __init__(self) -> None:
+        super().__init__(make_input_shard_1d_dim_last, make_output_replicate_1d)
+
+
+class ColwiseParallel(ParallelStyle):
+    """
+    Partitioning the column of a tensor or module.
+    We assume the input to be a replicated :class:`DTensor` and output to be a sharded :class:`DTensor`.
+    """
+
+    def __init__(self) -> None:
+        super().__init__(make_input_replicate_1d, make_output_replicate_1d)
+
+
+@_prepare_input_validate  # type: ignore[arg-type] # pyre-ignore[56]
+def make_input_shard_1d(
+    input: Union[torch.Tensor, DTensor],
+    device_mesh: Optional[DeviceMesh] = None,
+    dim: int = 0,
+) -> DTensor:
+    """
+    Shard input tensor on ``dim`` over an 1-D device mesh. This function will be used in ParallelStyle.
+
+    Args:
+        input (Union[:class:`torch.Tensor`, :class:`DTensor`]):
+            Single tensor will be sharded on dimension ``dim``
+            over the 1-D :class:`DeviceMesh`.
+        device_mesh (:class:`DeviceMesh`, optional):
+            The 1-D device mesh where ``input`` will be sharded.
+            If no :class:`DeviceMesh` is passed and ``input`` is a :class:`DTensor`,
+            `input.device_mesh` will be used.
+            If :class:`DeviceMesh` is not 1-D, an exception will be thrown.
+            Default: ``None``
+        dim (int, optional): The sharding dimension of ``input`` tensor.
+            Default: 0
+
+    Returns:
+        A :class:`DTensor` sharded on dimension ``dim`` over ``device_mesh``.
+    """
+    shard_spec = [Shard(dim)]
+    if isinstance(input, DTensor):
+        return input.redistribute(device_mesh, shard_spec)
+    elif isinstance(input, torch.Tensor):
+        return DTensor.from_local(
+            input, device_mesh, shard_spec, run_check=False
+        )
+    else:
+        raise RuntimeError(
+            "Tensor parallel module expects torch.Tensor or DTensor input but"
+            f" received {type(input)}!"
+        )
+
+
+def make_input_shard_1d_dim_last(
+    input: Union[torch.Tensor, DTensor],
+    device_mesh: Optional[DeviceMesh] = None,
+) -> DTensor:
+    """
+    Wrapper func of ``make_input_shard_1d`` with ``dim`` = -1.
+
+    Args:
+        input (Union[:class:`torch.Tensor`, :class:`DTensor`]):
+            This single tensor will be sharded on dimension ``dim``
+            over the 1-D :class:`DeviceMesh`.
+        device_mesh (:class:`DeviceMesh`, optional):
+            The 1-D device mesh where ``input`` will be sharded.
+            If no :class:`DeviceMesh` is passed and ``input`` is a :class:`DTensor`,
+            `input.device_mesh` will be used.
+            If :class:`DeviceMesh` is not 1-D, an exception will be thrown.
+            Default: ``None``
+
+    Returns:
+        A :class:`DTensor` sharded on dimension ``dim`` over ``device_mesh``.
+    """
+    return make_input_shard_1d(input, device_mesh, dim=-1)  # type: ignore[call-arg]
+
+
+@_prepare_input_validate  # type: ignore[arg-type] # pyre-ignore[56]
+def make_input_replicate_1d(
+    input: Union[torch.Tensor, DTensor],
+    device_mesh: Optional[DeviceMesh] = None,
+) -> DTensor:
+    """
+    Replicate input tensor over an 1-D device mesh. This function will be used in ParallelStyle.
+
+    Args:
+        input (Union[:class:`torch.Tensor`, :class:`DTensor`]):
+            This input tensor will be replicated over the 1-D :class:`DeviceMesh`.
+        device_mesh (:class:`DeviceMesh`, optional):
+            The 1-D device mesh where ``input`` will be replicated.
+            If no :class:`DeviceMesh` is passed and ``input`` is a :class:`DTensor`,
+            ``input.device_mesh`` will be used.
+            If :class:`DeviceMesh` is not 1-D, an exception will be thrown.
+            Default: ``None``
+
+    Returns:
+        A :class:`DTensor` replicated over ``device_mesh``.
+    """
+    replicate = [Replicate()]
+    if isinstance(input, DTensor):
+        return input.redistribute(device_mesh, replicate)
+    elif isinstance(input, torch.Tensor):
+        return DTensor.from_local(
+            input, device_mesh, replicate, run_check=False
+        )
+    else:
+        raise RuntimeError(
+            "Tensor parallel module expects torch.Tensor or DTensor input but"
+            f" received {type(input)}!"
+        )
+
+
+@_prepare_output_validate  # type: ignore[arg-type] # pyre-ignore[56]
+def make_output_shard_1d(
+    output: DTensor, device_mesh: Optional[DeviceMesh] = None, dim: int = 0
+) -> DTensor:
+    """
+    Convert Output DTensor to a sharded DTensor. This will be used in ParallelStyle.
+
+    Args:
+        output (:class:`DTensor`):
+            Output of module to be converted.
+        device_mesh (:class:`DeviceMesh`, optional):
+            Object needed to shard the output and it needs to be a 1D ``device_mesh``
+            and we will throw exceptions if a non-1D ``device_mesh`` is passed in.
+            If no ``device_mesh`` is passed in, we will reuse the one from output.
+            Default: ``None``
+        dim (int): Sharding dim for output. Default: 0
+
+    Return:
+        A :class:`DTensor` object sharded on the given dim.
+    """
+
+    return output.redistribute(device_mesh, [Shard(dim)])
+
+
+@_prepare_output_validate  # type: ignore[arg-type] # pyre-ignore[56]
+def make_output_replicate_1d(
+    output: DTensor, device_mesh: Optional[DeviceMesh] = None
+) -> DTensor:
+    """
+    Convert Output DTensor to a replicated DTensor. This will be used in ParallelStyle.
+
+    Args:
+        output (:class:`DTensor`):
+            Output of module to be converted.
+        device_mesh (:class:`DeviceMesh`, optional):
+            Object needed to replicate the output and it needs to be a 1D ``device_mesh``
+            and we will throw exceptions if a non-1D ``device_mesh`` is passed in.
+            If no ``device_mesh`` is passed in, we will reuse the one from output.
+            Default: ``None``
+
+    Return:
+        A :class:`DTensor` object made replicate.
+    """
+
+    return output.redistribute(device_mesh, [Replicate()])
+
+
+@_prepare_output_validate  # type: ignore[arg-type] # pyre-ignore[56]
+def make_output_tensor(
+    output: DTensor, device_mesh: Optional[DeviceMesh] = None
+) -> torch.Tensor:
+    """
+    Convert Output DTensor to a replicated DTensor first and then convert it to Tensor.
+
+    Args:
+        output (:class:`DTensor`):
+            Output of module to be converted.
+        device_mesh (:class:`DeviceMesh`, optional):
+            Object which is needed to replicate the output and it needs to be
+            a 1D ``device_mesh`` and we will throw exceptions if a non-1D
+            ``device_mesh`` is passed in. If no ``device_mesh`` is passed in,
+            we will reuse the one from output.
+            Default: ``None``
+
+    Return:
+        A :class:`torch.Tensor` object converted from output DTensor.
+    """
+
+    return make_output_replicate_1d(  # type: ignore[attr-defined]
+        output, device_mesh
+    ).to_local()  # type: ignore[call-arg]
diff --git a/torch/distributed/_tensor/parallel/utils.py b/torch/distributed/_tensor/parallel/utils.py
new file mode 100644
index 000000000000..c63fe638b351
--- /dev/null
+++ b/torch/distributed/_tensor/parallel/utils.py
@@ -0,0 +1,152 @@
+import functools
+
+import torch
+from torch.distributed._tensor import DeviceMesh, DTensor
+from typing import Callable, Optional, Union
+
+_PrepareInputType = Callable[
+    [Union[torch.Tensor, DTensor], Optional[DeviceMesh], Optional[int]], DTensor
+]
+
+_PrepareOutputType = Callable[
+    [DTensor, Optional[DeviceMesh], Optional[int]], Union[torch.Tensor, DTensor]
+]
+
+
+def _prepare_input_validate(
+    _prepare_input_func: _PrepareInputType,
+) -> _PrepareInputType:
+    """
+    Inject common validation logics for `_prepare_input` funcs via this
+    decorator, including verifying that input needs to be either
+    a :class:`Tensor` or :class:`DTensor` and only 1D :class:`DeviceMesh`
+    is passed in.
+
+    Args:
+        _prepare_input_func (Callable): The func we want to inject the
+            validation into.
+
+    Returns:
+        func (Callable): Same input function with validation logic added.
+
+    Example::
+        >>> @_prepare_input_validate
+        >>> def make_input_shard_1d(args, kwargs):
+        >>>   ...
+        >>>
+        >>> input = torch.rand(...)
+        >>> dtensor = make_input_shard_1d(input, device_mesh, 1)
+        >>> # This will call '_prepare_input_validate' first
+    """
+
+    @functools.wraps(_prepare_input_func)
+    def wrapper(*args, **kwargs):  # pyre-ignore[2, 3]
+        assert len(args) >= 1, "_prepare_input needs at least one arg."
+        input = args[0]
+        if isinstance(input, list) or isinstance(input, tuple):
+            input = input[0]
+            args = (input, *args[1:])
+        device_mesh = None if len(args) < 2 else args[1]
+
+        if device_mesh is None:
+            if isinstance(input, DTensor):
+                device_mesh = input.device_mesh
+                args = (*args[:1], device_mesh, *args[2:])  # pyre-ignore[60]
+            else:
+                raise RuntimeError(
+                    "device_mesh is not passed nor can be inferred"
+                )
+        if device_mesh.ndim != 1:
+            raise RuntimeError(
+                f"device_mesh has dims {device_mesh.ndim} but expcted to be 1"
+                " for input."
+            )
+        return _prepare_input_func(*args, **kwargs)
+
+    return wrapper
+
+
+def _prepare_output_validate(
+    _prepare_output_func: _PrepareOutputType,
+) -> _PrepareOutputType:
+    """
+    Inject common validation logics for _prepare_output funcs via this
+    decorator, including verifying that output needs to be a DTensor
+    and only 1D Device Mesh is passed in.
+    Example::
+        >>> @_prepare_output_validate
+        >>> def make_output_shard_1d(args, kwargs):
+        >>>   ...
+        >>>
+        >>> dt = distribute(tensor, device_mesh, [Shard(0)])
+        >>> make_output_shard_1d(dt, device_mesh, 1)
+        >>> # This will call '_prepare_output_validate' first
+    Args:
+        _prepare_output_func (Callable): The func we want to inject the
+            validation into.
+    Return:
+        func (Callable): Same input func with validation logic added.
+    """
+
+    @functools.wraps(_prepare_output_func)
+    def wrapper(*args, **kwargs):  # pyre-ignore[2, 3]
+        assert len(args) >= 1, "_prepare_output needs at least one arg."
+        output = args[0]
+        assert isinstance(output, DTensor), (
+            "Expect output of Tensor Parallel to be a DTensor, but found"
+            f" {type(output)}."
+        )
+        if len(args) < 2 or args[1] is None:
+            device_mesh = output.device_mesh
+            args = (*args[:1], device_mesh, *args[2:])  # pyre-ignore[60]
+        else:
+            device_mesh = args[1]
+
+        assert device_mesh.ndim == 1, (
+            f"device_mesh has dims {device_mesh.ndim} but expcted to be 1 for"
+            " output."
+        )
+        return _prepare_output_func(*args, **kwargs)
+
+    return wrapper
+
+
+def _create_1d_device_mesh(
+    device_mesh: DeviceMesh, tp_mesh_dim: int = 0
+) -> DeviceMesh:
+    """
+    This function converts a N-D ``device_mesh`` into a 1D ``device_mesh``
+    for 1D Tensor Parallelism.
+
+    Args:
+        device_mesh (DeviceMesh):
+            :class:``DeviceMesh`` object which describes the mesh topology
+            of devices for the DTensor.
+        tp_mesh_dim (int):
+            the dimension of ``device_mesh`` where we perform
+            Tensor Parallelism on.
+
+    Return:
+        device_mesh (DeviceMesh): 1-D :class:``DeviceMesh`` object that
+            Tensor Parallelism operates on.
+    """
+    assert (
+        tp_mesh_dim < device_mesh.ndim and tp_mesh_dim >= -device_mesh.ndim
+    ), (
+        f"Expect tp_mesh_dim within range [{-device_mesh.ndim},"
+        f" {device_mesh.ndim}), but found {tp_mesh_dim}."
+    )
+
+    if device_mesh.ndim == 1:
+        return device_mesh
+
+    # swap the current dim to the last dim then reshape to flatten out other
+    # dims, so we can just extract the list of ranks which contains cur_rank.
+    cur_rank = device_mesh.get_rank()
+    pg_ranks_by_dim = device_mesh.mesh.swapdims(-1, tp_mesh_dim).reshape(
+        -1, device_mesh.mesh.size(tp_mesh_dim)
+    )
+    dim_mesh_1d = pg_ranks_by_dim[torch.any(pg_ranks_by_dim == cur_rank, 1), :]
+
+    sub_pg = device_mesh.get_dim_groups()[tp_mesh_dim]
+    return DeviceMesh(device_mesh.device_type, dim_mesh_1d.squeeze(), [sub_pg])
diff --git a/torch/distributed/_tensor/placement_types.py b/torch/distributed/_tensor/placement_types.py
new file mode 100644
index 000000000000..f2df183b046d
--- /dev/null
+++ b/torch/distributed/_tensor/placement_types.py
@@ -0,0 +1,432 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+from dataclasses import dataclass
+from typing import Optional, List, Sequence, Tuple, cast
+
+import torch
+import torch.distributed.distributed_c10d as c10d
+from torch.distributed._spmd.comm_tensor import CommTensor
+
+from torch.distributed._tensor.device_mesh import DeviceMesh
+
+
+class Placement(object):
+    # base class Placement type
+
+    # convenient utils to check for placement types
+    def is_shard(self, dim: Optional[int] = None) -> bool:
+        if dim is not None and isinstance(self, Shard):
+            return self.dim == dim
+        else:
+            return isinstance(self, Shard)
+
+    def is_replicate(self) -> bool:
+        return isinstance(self, Replicate)
+
+    def is_partial(self) -> bool:
+        return isinstance(self, _Partial)
+
+
+@dataclass
+class Shard(Placement):
+    # shard placement, shard on a dim
+    dim: int
+
+    def _split_tensor(
+        self,
+        tensor: torch.Tensor,
+        num_chunks: int,
+        *,
+        with_padding: bool = True,
+        contiguous: bool = True,
+    ) -> Tuple[List[torch.Tensor], int]:
+        # NOTE: For with_padding option, we pad the tensor on each rank before calling
+        # the collectives (i.e. scatter/all_gather, etc.). This is because for gloo
+        # backend, it does not support uneven collectives, nccl supports some, but
+        # it might be slow compared to even size collective, we need to pad tensor
+        # before really calling the collective, and unpad/narrow it afterwards
+        # TODO: consider if we should remove this logic once ProcessGroupGloo
+        # support uneven list, and collective perfomance on par
+        assert (
+            self.dim <= tensor.ndim
+        ), f"Sharding dim {self.dim} greater than tensor ndim {tensor.ndim}"
+        assert (
+            tensor.size(self.dim) >= num_chunks
+        ), f"Tensors to be sharded on dim {self.dim} must be at least as large as "
+        f"the number of devices in that dimension {num_chunks}"
+        # split tensor over dimension `dim` into n slices with padding if necessary
+        tensor_list = list(tensor.tensor_split(num_chunks, self.dim))
+        idx_start_to_pad = tensor.size(self.dim) % num_chunks
+        if with_padding or contiguous:
+            shard_list = []
+            for i, shard in enumerate(tensor_list):
+                if (
+                    with_padding
+                    and idx_start_to_pad != 0
+                    and i >= idx_start_to_pad
+                ):
+                    shard = self._pad_tensor(shard)
+                # input tensors are expected to be congtiguous by the collective backend
+                shard = shard.contiguous() if contiguous else shard
+                shard_list.append(shard)
+            return shard_list, idx_start_to_pad
+        else:
+            return tensor_list, idx_start_to_pad
+
+    def _pad_tensor(self, tensor: torch.Tensor) -> torch.Tensor:
+        # pad tensor by 1 on the shard dim
+        pad = [0, 0] * (tensor.ndim - self.dim)
+        pad[-1] = 1
+        return torch.nn.functional.pad(tensor, pad)
+
+    def _unpad_tensor(self, tensor: torch.Tensor) -> torch.Tensor:
+        # unpad tensor by 1 on the shard dim
+        return tensor.narrow(
+            self.dim, start=0, length=tensor.size(self.dim) - 1
+        )
+
+    def _local_shard_size_on_dim(
+        self,
+        size_on_dim: int,
+        num_chunks: int,
+        rank: int,
+        return_offset: bool = False,
+    ) -> Tuple[int, int]:
+        """
+        returns the local shard size and offset on a given tensor dim
+        """
+        assert (
+            size_on_dim >= num_chunks
+        ), f"Size to be sharded on dim {self.dim} must be at least as large as the number of devices in that dimension {num_chunks}"
+        split_size, pad_idx = divmod(size_on_dim, num_chunks)
+        local_shard_size = (
+            split_size + 1 if pad_idx != 0 and rank < pad_idx else split_size
+        )
+        local_offset_on_dim = -1
+        if return_offset:
+            local_offset_on_dim = (
+                rank * split_size + pad_idx if rank >= pad_idx else rank
+            )
+        return (local_shard_size, local_offset_on_dim)
+
+    def _shard_tensor(
+        self, tensor: torch.Tensor, mesh: DeviceMesh, mesh_dim: int
+    ) -> torch.Tensor:
+        """
+        shard and scatter a tensor on a mesh dimension (use coordinate
+        0 on the mesh dimension as source of truth)
+        """
+        my_coordinate = mesh.get_coordinate_on_dim(mesh_dim)
+        num_chunks = mesh.size(dim=mesh_dim)
+        # TODO: what should happen if rank is not in the mesh?
+        # see issue https://github.com/pytorch/tau/pull/492
+        assert (
+            my_coordinate is not None
+        ), "Rank if not part of mesh"  # TODO: figure out behavior here
+        scatter_list, pad_idx = self._split_tensor(
+            tensor, num_chunks, with_padding=True, contiguous=True
+        )
+        output = torch.empty_like(scatter_list[my_coordinate])
+        mesh.scatter(output, scatter_list, mesh_dim=mesh_dim)
+
+        if pad_idx != 0 and my_coordinate >= pad_idx:
+            output = self._unpad_tensor(output)
+        return output
+
+    def _reduce_shard_tensor(
+        self,
+        tensor: torch.Tensor,
+        mesh: DeviceMesh,
+        reduce_op: c10d.ReduceOp,
+        mesh_dim: int,
+    ) -> torch.Tensor:
+        """
+        reduce and scatter a tensor on a mesh dimension
+        """
+        my_coordinate = mesh.get_coordinate_on_dim(mesh_dim)
+        num_chunks = mesh.size(dim=mesh_dim)
+        # TODO: what should happen if rank is not in the mesh?
+        # see issue https://github.com/pytorch/tau/pull/492
+        assert (
+            my_coordinate is not None
+        ), "Rank if not part of mesh"  # TODO: figure out behavior here
+        scattered_list, pad_idx = self._split_tensor(
+            tensor, num_chunks, with_padding=True, contiguous=True
+        )
+        # wrap with comm tensor
+        scattered_list = [CommTensor(t) for t in scattered_list]
+        output = torch.empty_like(scattered_list[my_coordinate])
+        mesh.reduce_scatter(
+            CommTensor(output),
+            scattered_list,  # pyre-ignore[6]
+            op=reduce_op,
+            mesh_dim=mesh_dim,
+        )
+        if pad_idx != 0 and my_coordinate >= pad_idx:
+            output = self._unpad_tensor(output)
+        return output
+
+    def _to_replicate_tensor(
+        self,
+        local_tensor: torch.Tensor,
+        size: torch.Size,
+        mesh: DeviceMesh,
+        mesh_dim: int,
+    ) -> torch.Tensor:
+        """
+        This function all_gather all shards and return a tensor that
+        is replicated on the previously sharded mesh dimension
+        """
+        my_coordinate = mesh.get_coordinate_on_dim(mesh_dim)
+        num_chunks = mesh.size(dim=mesh_dim)
+        # TODO: what should happen if rank is not in the mesh?
+        # see issue https://github.com/pytorch/tau/pull/492
+        assert (
+            my_coordinate is not None
+        ), "Rank if not part of mesh"  # TODO: figure out behavior here
+        # check if it needs to pad input tensor before all_gather
+        pad_idx = size[self.dim] % num_chunks
+        if pad_idx != 0 and my_coordinate >= pad_idx:
+            local_tensor = self._pad_tensor(local_tensor).contiguous()
+
+        gathered_list = []
+        # N.B. CommTensor does not change eager mode behavior. During tracing, it
+        # makes sure communication result is properly waited before subsequent
+        # read operations.
+        for _ in range(num_chunks):
+            gathered_list.append(
+                CommTensor(
+                    torch.empty_like(
+                        local_tensor,
+                        memory_format=torch.contiguous_format,
+                    )
+                )
+            )
+
+        mesh.all_gather(gathered_list, CommTensor(local_tensor.contiguous()), mesh_dim=mesh_dim)  # type: ignore[arg-type]
+        # unpad the tensor if the input tensor was padded
+        if pad_idx != 0:
+            gathered_list = [
+                self._unpad_tensor(gathered_tensor)  # type: ignore[misc]
+                if i >= pad_idx
+                else gathered_tensor
+                for i, gathered_tensor in enumerate(gathered_list)
+            ]
+        return torch.cat(gathered_list, dim=self.dim)  # type: ignore[arg-type]
+
+
+@dataclass
+class Replicate(Placement):
+    # replicate placement
+    pass
+
+
+@dataclass
+class _Partial(Placement):
+    # This is a default partial placement with element-wise reduce op
+    # when doing reduction it follows the contract of `_to_replicate`
+    # and `_to_shard` to do the reduction and convert the local tensor
+    # to the corresponding state (replicate or shard)
+    #
+    # We can implement custom reductions as needed by subclassing this
+    # class and override those contracts.
+    reduce_op: c10d.ReduceOp.RedOpType = c10d.ReduceOp.RedOpType.SUM  # type: ignore[attr-defined]
+
+    def _to_replicate(
+        self, tensor: torch.Tensor, mesh: DeviceMesh, mesh_dim: int
+    ) -> torch.Tensor:
+        # out-of-place all_reduce to replicate, since the current partial DTensor
+        # might get used by other ops as well, so we can't inplace modify it
+        cloned_local = CommTensor(
+            tensor.clone(memory_format=torch.contiguous_format)
+        )
+        mesh.all_reduce(
+            cloned_local, c10d.ReduceOp(self.reduce_op), mesh_dim=mesh_dim  # type: ignore[call-arg]
+        )
+        return cloned_local
+
+    def _to_shard(
+        self,
+        tensor: torch.Tensor,
+        mesh: DeviceMesh,
+        mesh_dim: int,
+        shard_spec: Placement,
+    ) -> torch.Tensor:
+        # by default call reduce_shard_tensor of the shard_spec.
+        shard_spec = cast(Shard, shard_spec)
+        return shard_spec._reduce_shard_tensor(
+            tensor, mesh, c10d.ReduceOp(self.reduce_op), mesh_dim  # type: ignore[call-arg]
+        )
+
+
+# used internally to propagate the placements
+@dataclass
+class DTensorSpec(object):
+    mesh: DeviceMesh
+    placements: Sequence[Placement]
+    # shape of the current dist tensor, this will be set upon
+    # construction of the DTensor, prop rule could read it, and
+    # would need to set in output spec when calculate the output
+    # sharding
+    shape: torch.Size
+    # ndim of the current dist tensor, if passed in, this would be
+    # validated with shape, if not passed in, will be generated from
+    # the shape
+    ndim: int = -1
+
+    def __post_init__(self) -> None:
+        if self.ndim == -1:
+            self.ndim = len(self.shape)
+
+    @property
+    def dim_map(self) -> List[int]:
+        """
+        dim_map is a property we derive from `placements` of
+        the distributed tensor. It simply return a list of ints
+        where dim_map[i] denotes the sharding mapping to the mesh
+        dimension, and len(dim_map) == dist_tensor.ndim
+        dim_map[i] = -1: means tensor dim i replicate on mesh
+        dim_map[i] = j: means tensor dim i shard on mesh dim j
+
+        For example, we have a dist tensor that have the shape of
+        [18, 20, 30], and device_mesh([0, 1, 2, 3]), placements:
+        [Shard(1)], the dim_map of this placement would be:
+        [-1, 1, -1]. This representation is pretty helpful during
+        sharding propagation where we could know exactly each
+        tensor dimension is sharded or not.
+
+        Note that if placements contains `_Partial`, we have to
+        explicitly deal with it, so that when we create a DTensorSpec
+        with dim_map, we could properly record the pending sums.
+        """
+        # dims mapping of dist tensor sharding
+        # return size of tensor ndim, -1 represent replicate
+        # and int >=0 represent shard on that device mesh dim
+        r = [-1] * self.ndim
+        for i, placement in enumerate(self.placements):
+            if placement.is_shard():
+                shard_dim = cast(Shard, placement).dim
+                if r[shard_dim] > -1:
+                    raise ValueError(
+                        f"Tensor dim {shard_dim} is already sharded on mesh dim {r[shard_dim]},"
+                        " DTensor operator implementation does not support things like hybrid"
+                        " sharding strategies yet (i.e. [Shard(0), Shard(0)])"
+                    )
+                r[shard_dim] = i
+        return r
+
+    @property
+    def sums(self) -> List[int]:
+        """
+        sums is a property we derive from `placements` of the
+        distributed tensor. It simply return a list of ints where
+        sums[i] denotes the pending sum (partial) on mesh dim i
+        """
+        return [
+            idx
+            for idx, placement in enumerate(self.placements)
+            if placement.is_partial()
+        ]
+
+    @property
+    def local_shape(self) -> Tuple[int, ...]:
+        """
+        Compute the shape of a local shard of the given DTensor on its current
+        coordinate of the mesh.
+        """
+        assert (
+            self.shape is not None
+        ), "DTensorSpec does not contain global shape."
+        local_shape = list(self.shape)  # start with global shape
+        for idx, placement in enumerate(self.placements):
+            mesh_dim_size = self.mesh.size(idx)
+            my_coordinate = self.mesh.get_coordinate_on_dim(idx)
+            assert my_coordinate is not None, "Rank not part of mesh!"
+            if isinstance(placement, Shard):
+                shard_dim = placement.dim
+                assert (
+                    shard_dim < self.ndim
+                ), f"Sharding dim {shard_dim} greater than tensor ndim {self.ndim}"
+                local_shard_size, _ = placement._local_shard_size_on_dim(
+                    local_shape[shard_dim], mesh_dim_size, my_coordinate
+                )
+                assert isinstance(local_shard_size, int)
+                local_shape[shard_dim] = local_shard_size
+        return tuple(local_shape)
+
+    @property
+    def local_offsets(self) -> Tuple[int, ...]:
+        """
+        Compute the offsets of a local shard of the given DTensor on its current
+        global rank. This is mostly used by distributed checkpointing to know the
+        exact offsets of the local shard.
+        """
+        assert (
+            self.shape is not None
+        ), "DTensorSpec does not contain global shape."
+        local_offsets = [0] * self.ndim
+        local_shape = list(self.shape)
+
+        for idx, placement in enumerate(self.placements):
+            mesh_dim_size = self.mesh.size(idx)
+            my_coordinate = self.mesh.get_coordinate_on_dim(idx)
+            assert my_coordinate is not None, "Rank not part of mesh!"
+            if isinstance(placement, Shard):
+                shard_dim = placement.dim
+                assert (
+                    shard_dim < self.ndim
+                ), f"Sharding dim {shard_dim} greater than tensor ndim {self.ndim}"
+                shard_size, shard_offset = placement._local_shard_size_on_dim(
+                    local_shape[shard_dim],
+                    mesh_dim_size,
+                    my_coordinate,
+                    return_offset=True,
+                )
+                local_shape[shard_dim] = shard_size
+                local_offsets[shard_dim] = shard_offset
+        return tuple(local_offsets)
+
+    @classmethod
+    def from_dim_map(
+        cls,
+        mesh: DeviceMesh,
+        dim_map: List[int],
+        sums: List[int],
+        shape: torch.Size,
+    ) -> "DTensorSpec":
+        """
+        Construct a DTensorSpec from dim_map list and pending sum.
+
+        Args:
+            mesh (class:`DeviceMesh`): device mesh to be used in the DTensorSpec
+            dim_map (List[int]): a list of integer that represents sharding on each
+                tensor dimension, see `dim_map` property doc for details
+            sums (List[int]): a list of integer that represents the dist tensor have
+                pending sum on which device mesh dimension.
+            shape (torch.Size): shape of the DTensor associated with this spec.
+
+        Return:
+            a class:`DTensorSpec` object
+        """
+        # by default replicate on device mesh dims
+        placements: List[Placement] = [Replicate() for _ in range(mesh.ndim)]
+
+        # find all mesh dims that need pending reductions
+        for s in sums:
+            placements[s] = _Partial()
+
+        for i, m in enumerate(dim_map):
+            if m >= 0:
+                placement = placements[m]
+                if placement.is_shard():
+                    placement = cast(Shard, placement)
+                    raise RuntimeError(
+                        f"DeviceMesh dimension cann't be mapped to two dimension of the same tensor: {i} and {placement.dim}"
+                    )
+                elif placement.is_partial():
+                    raise RuntimeError(
+                        f"DeviceMesh dimension {m} cannot be both shard and partial!"
+                    )
+                placements[m] = Shard(i)
+
+        return cls(mesh, placements, shape=shape, ndim=len(dim_map))
diff --git a/torch/distributed/_tensor/redistribute.py b/torch/distributed/_tensor/redistribute.py
new file mode 100644
index 000000000000..ab36cd408903
--- /dev/null
+++ b/torch/distributed/_tensor/redistribute.py
@@ -0,0 +1,236 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+from typing import Dict, List, Sequence, Tuple, cast
+
+import torch
+import torch.distributed._tensor.api as dtensor
+from torch.distributed._tensor.placement_types import Placement, _Partial, Shard, Replicate
+from torch.distributed._tensor.device_mesh import DeviceMesh
+
+
+_PlacementItem = Tuple[int, Tuple[Placement, Placement]]
+
+
+def _replicate_then_shard(val: _PlacementItem) -> int:
+    """
+    Replicate from inner to outer dimension.
+    Shard from outer to inner dimension.
+    """
+    i, (current, target) = val
+    if (target.is_replicate() or target.is_partial()) and current.is_shard():
+        return -i
+    elif (current.is_replicate() or current.is_partial()) and target.is_shard():
+        return i
+    else:
+        return 0
+
+
+def _decompose_reshard(val: List[_PlacementItem]) -> List[_PlacementItem]:
+    """
+    Decompose Si -> Sj into Si -> R -> Sj
+    There's 2 ways a shardings can differ within a mesh dimension:
+      1) sharding on different tensor dimensions, e.g. Shard(0) -> Shard(1)
+      2) different sub-shards of a repeated shard ("mis-aligned sharding")
+          (Shard(0), Shard(0)) -> (Replicate(), Shard(0))
+          Here the Shard(0) -> Shard(0) for mesh dimension 2 is actually
+          a reshard, because in the first case it's a sub-sharding of an already tensor dimension 0,
+          and in the second case, it's the first sharding on tensor dimesnion 0.
+    """
+    # detect mis-aligned repeated shardings
+    from collections import defaultdict
+
+    repeat_dim_current: Dict[int, int] = defaultdict(int)
+    repeat_dim_target: Dict[int, int] = defaultdict(int)
+
+    output: List[_PlacementItem] = []
+
+    for i, (current, target) in val:
+        # detect mis-aligned sharding
+        if current.is_shard():
+            repeat_dim_current[cast(Shard, current).dim] += 1
+        if target.is_shard():
+            repeat_dim_target[cast(Shard, target).dim] += 1
+        if (
+            isinstance(current, Shard)
+            and isinstance(target, Shard)
+            and (
+                current.dim != target.dim
+                or repeat_dim_current[current.dim]
+                != repeat_dim_target[target.dim]
+            )
+        ):
+            # decompose Shard(i) -> Shard(j) into Shard(i) -> Replicate() -> Shard(j)
+            output.append((i, (current, Replicate())))
+            output.append((i, (Replicate(), target)))
+        else:
+            output.append((i, (current, target)))
+
+    return output
+
+
+# Intentionally expose this API to trace ops on local tensors
+def _redistribute_with_local_tensor(
+    local_tensor: torch.Tensor,
+    size: torch.Size,
+    device_mesh: DeviceMesh,
+    current_placements: Sequence[Placement],
+    target_placements: Sequence[Placement],
+) -> torch.Tensor:
+    new_local_tensor = None
+
+    sorted_placements = list(
+        enumerate(zip(current_placements, target_placements))
+    )
+    sorted_placements = _decompose_reshard(sorted_placements)
+    sorted_placements.sort(key=_replicate_then_shard)
+
+    for i, (current, target) in sorted_placements:
+        my_coordinate = device_mesh.get_coordinate_on_dim(i)
+        num_chunks = device_mesh.size(dim=i)
+        # TODO: what should happen if rank is not in the mesh?
+        # see issue https://github.com/pytorch/tau/pull/492
+        assert (
+            my_coordinate is not None
+        ), "Rank if not part of mesh"  # TODO: figure out behavior here
+
+        if current == target:
+            # short cut, just use the original local tensor
+            new_local_tensor = local_tensor
+            continue
+
+        if target.is_replicate():
+            # Case 1: target is Replicate
+            if current.is_partial():
+                partial_spec = cast(_Partial, current)
+                new_local_tensor = partial_spec._to_replicate(
+                    local_tensor, device_mesh, i
+                )
+            elif current.is_shard():
+                current_placement = cast(Shard, current)
+                new_local_tensor = current_placement._to_replicate_tensor(
+                    local_tensor, size, device_mesh, i
+                )
+            else:
+                raise RuntimeError(
+                    f"redistribute from {current_placements} to {target_placements} not supported yet"
+                )
+        elif target.is_shard():
+            # Case 2: target is Shard
+            target_placement = cast(Shard, target)
+            if current.is_partial():
+                partial_spec = cast(_Partial, current)
+                new_local_tensor = partial_spec._to_shard(
+                    local_tensor, device_mesh, i, target_placement
+                )
+            elif current.is_replicate():
+                # split the tensor and return the corresponding cloned local shard
+                shards, _ = target_placement._split_tensor(
+                    local_tensor,
+                    num_chunks,
+                    with_padding=False,
+                    contiguous=False,
+                )
+                new_local_tensor = shards[my_coordinate].clone()
+            else:
+                # NOTE: this case shouldn't hit _decompose_sharding, decompose sharding should
+                # decompose Shard(0) -> Shard(1) into Shard(0) -> Replicate -> Shard(1)
+                assert (
+                    current.is_shard()
+                ), f"Current placement should be shard but found {current}"
+                shard_spec = cast(Shard, current)
+                if shard_spec.dim != target_placement.dim:
+                    # TODO: enable this with all_to_all
+                    raise NotImplementedError(
+                        "Changing sharding dim is not supported yet!"
+                    )
+
+        elif target.is_partial():
+            if current.is_replicate():
+                # For replicate -> partial, we zero out all other ranks of the current mesh dim
+                # and leave only 1 rank have the data, to perform a "zero cost" reshard.
+                if my_coordinate is not None and my_coordinate != 0:
+                    new_local_tensor = local_tensor.zero_()
+                else:
+                    new_local_tensor = local_tensor
+            else:
+                raise RuntimeError(
+                    f"redistribute from {current_placements} to {target_placements} not supported yet"
+                )
+
+        assert new_local_tensor is not None
+        local_tensor = new_local_tensor
+
+    assert new_local_tensor is not None, "redistribute failed!"
+
+    return new_local_tensor
+
+
+def redistribute_dtensor(
+    input: "dtensor.DTensor",
+    device_mesh: DeviceMesh,
+    placements: Sequence[Placement],
+) -> "dtensor.DTensor":
+    if input.device_mesh != device_mesh:
+        # TODO: alltoall reshuffling to change device_mesh if they are not the same
+        raise NotImplementedError("Cross device mesh comm not supported yet!")
+
+    local_tensor = input._local_tensor
+    new_local_tensor = _redistribute_with_local_tensor(
+        local_tensor,
+        input.size(),
+        device_mesh,
+        input.placements,
+        placements,
+    )
+
+    return dtensor.DTensor(
+        new_local_tensor,
+        device_mesh,
+        placements,
+        size=input.size(),
+        requires_grad=local_tensor.requires_grad,
+    )
+
+
+class Redistribute(torch.autograd.Function):
+    @staticmethod
+    def forward(  # type: ignore[override]
+        # pyre-fixme[2]: Parameter must be annotated.
+        ctx,
+        input: "dtensor.DTensor",
+        device_mesh: DeviceMesh,
+        placements: List[Placement],
+    ):
+        ctx.previous_placement = input.placements
+        ctx.previous_device_mesh = input.device_mesh
+        return redistribute_dtensor(input, device_mesh, placements)
+
+    @staticmethod
+    def backward(ctx, grad_output: "dtensor.DTensor"):  # type: ignore[override]
+        previous_placement = ctx.previous_placement
+        previous_device_mesh = ctx.previous_device_mesh
+        # When we run backward pass of redistribute (i.e. manual redistribute from
+        # user code instead of torch_dispatch), we scan first and see if we need
+        # to change the target placement for one special case:
+        #   replicate -> partial.
+        # In this case we keep the grad as replicate, this is because we don't
+        # want to convert the replicated gradients back to partial, although
+        # that's logically conform with the same layout, converting the gradients
+        # back to partial is acutally useless as you would have to do reduce later
+        # which would be more expensive than keeping it replicate! For this reason,
+        # we keep the replicate grad here.
+        # TODO: see if this make sense for all cases.
+        target_placements: List[Placement] = []
+        for current, target in zip(grad_output.placements, previous_placement):
+            if current.is_replicate() and target.is_partial():
+                # keep target placement to replicate instead of partial in this case
+                target_placements.append(current)
+            else:
+                target_placements.append(target)
+
+        return (
+            redistribute_dtensor(
+                grad_output, previous_device_mesh, target_placements
+            ),
+            None,
+            None,
+        )
diff --git a/torch/distributed/_tensor/utils.py b/torch/distributed/_tensor/utils.py
new file mode 100644
index 000000000000..bb56f488d81f
--- /dev/null
+++ b/torch/distributed/_tensor/utils.py
@@ -0,0 +1,53 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import torch
+from typing import Union, Dict, Tuple, Optional, Sequence
+
+import torch.distributed._tensor.api as dtensor
+from torch.distributed._tensor.placement_types import DTensorSpec
+
+ArgKwargsType = Union[Tuple[object, ...], Dict[str, object]]
+# ATen op schemas could have Tensor, Tuple[Tensor] and List[Tensor], so output type sould
+# be the same set of possiblities.
+OutputSpecType = Optional[Union[DTensorSpec, Sequence[DTensorSpec]]]
+
+
+def unwrap_local_tensor(e: "dtensor.DTensor") -> torch.Tensor:
+    return e._local_tensor if isinstance(e, dtensor.DTensor) else e
+
+
+def unwrap_schema(e: object) -> object:
+    return e._spec if isinstance(e, dtensor.DTensor) else e
+
+
+def wrap(res: object, spec: OutputSpecType) -> object:
+    if isinstance(res, torch.Tensor):
+        assert spec is not None and isinstance(
+            spec, DTensorSpec
+        ), f"output spec does not match with output! Expected DTensorSpec, got {spec}."
+        return dtensor.DTensor(
+            res,
+            spec.mesh,
+            spec.placements,
+            size=spec.shape,
+            requires_grad=res.requires_grad,
+        )
+    elif isinstance(res, list):
+        assert spec is not None and isinstance(
+            spec, list
+        ), f"output spec does not match with output! Expected list, got {spec}."
+        return list(
+            dtensor.DTensor(e, s.mesh, s.placements, size=s.shape)
+            for e, s in zip(res, spec)
+        )
+    elif isinstance(res, tuple):
+        assert spec is not None and isinstance(
+            spec, tuple
+        ), f"output spec does not match with output! Expected tuple, got {spec}"
+        return tuple(
+            dtensor.DTensor(e, s.mesh, s.placements, size=s.shape)
+            for e, s in zip(res, spec)
+        )
+    else:
+        # if the res contains only non tensor values, we simply return it without rewrapping
+        return res
diff --git a/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py b/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py
index 53364c74f224..8aa4a1875ab8 100644
--- a/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py
+++ b/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py
@@ -1,52 +1,29 @@
-from enum import Enum, auto
-from contextlib import suppress
+from enum import auto, Enum
+from functools import partial
+from typing import Any, Dict, Iterator, Tuple
 
 import torch
-from torch.autograd.graph import save_on_cpu
-from torch.utils.checkpoint import checkpoint
-from torch.distributed.utils import _replace_by_prefix
 import torch.nn as nn
-from typing import Any, Dict, Iterator, Tuple
-from functools import partial
+from torch.autograd.graph import save_on_cpu
+from torch.distributed.utils import _pack_kwargs, _replace_by_prefix, _unpack_kwargs
+from torch.utils.checkpoint import checkpoint as torch_utils_checkpoint
 
-_CHECKPOINT_PREFIX = "_checkpoint_wrapped_module"
+_CHECKPOINT_WRAPPED_MODULE = "_checkpoint_wrapped_module"
+_CHECKPOINT_PREFIX = _CHECKPOINT_WRAPPED_MODULE + "."
 
 class CheckpointImpl(Enum):
     REENTRANT = auto()
     NO_REENTRANT = auto()
 
 
-class CheckpointWrapper(torch.nn.Module):
+class ActivationWrapper(torch.nn.Module):
     """
-    An nn.Module that wraps another nn.Module with checkpointing.
+    Base class for Activation Checkpoint and Activation Offload.
+    Not meant to be instantiated directly.
     """
-    def __init__(
-        self,
-        mod: torch.nn.Module,
-        checkpoint_impl: CheckpointImpl = CheckpointImpl.REENTRANT,
-        offload_to_cpu: bool = False,
-        checkpoint_fn=None,
-        *checkpoint_fn_args,
-        **checkpoint_fn_kwargs,
-    ):
+    def __init__(self, mod):
         super().__init__()
         self._checkpoint_wrapped_module = mod
-        self.checkpoint_impl = checkpoint_impl
-        self.offload_to_cpu = offload_to_cpu
-        if checkpoint_fn is None:
-            # use torch.utils.checkpoint
-            self.checkpoint_fn = partial(
-                checkpoint,
-                use_reentrant=(
-                    self.checkpoint_impl == CheckpointImpl.REENTRANT
-                ),
-            )
-        else:
-            self.checkpoint_fn = partial(
-                checkpoint_fn,
-                *checkpoint_fn_args,
-                **checkpoint_fn_kwargs,
-            )
         # state_dict post hook to remove prefix to allow loading into a
         # non-checkpoint wrapped module.
         self._register_state_dict_hook(self._post_state_dict_hook)
@@ -56,6 +33,9 @@ def __init__(
             self._pre_load_state_dict_hook, with_module=True
         )
 
+    def forward(self, *args, **kwargs):
+        raise ValueError("Subclasses should implement forward().")
+
     def __getattr__(self, name: str) -> Any:
         """Forward missing attributes to wrapped module."""
         try:
@@ -67,15 +47,6 @@ def __getitem__(self, key: int) -> Any:
         """Forward indexing calls in case the module is a nn.Sequential."""
         return self._checkpoint_wrapped_module.__getitem__(key)  # type: ignore[operator]
 
-    def forward(self, *args, **kwargs):
-        offload_mgr = save_on_cpu(pin_memory=True) if self.offload_to_cpu else suppress()
-        with offload_mgr:  # type: ignore[attr-defined]
-            return self.checkpoint_fn(
-                self._checkpoint_wrapped_module,
-                *args,
-                **kwargs
-            )
-
     def named_parameters(
         self,
         *args,
@@ -83,10 +54,10 @@ def named_parameters(
     ) -> Iterator[Tuple[str, torch.nn.Parameter]]:
         """
         Overrides :meth:`named_parameters()` to intercept parameter names and
-        remove all occurrences of _CHECKPOINT_PREFIX.
+        remove all occurrences of ``_CHECKPOINT_PREFIX``.
         """
         for param_name, param in super().named_parameters(*args, **kwargs):
-            yield param_name.replace(f"{_CHECKPOINT_PREFIX}.", ""), param
+            yield param_name.replace(_CHECKPOINT_PREFIX, ""), param
 
     @staticmethod
     def _post_state_dict_hook(
@@ -103,7 +74,7 @@ def _post_state_dict_hook(
         checkpoint-wrapped modules as this class adds the prefix back before
         loading the state_dict.
         """
-        _replace_by_prefix(state_dict, f"{prefix}{_CHECKPOINT_PREFIX}.", prefix)
+        _replace_by_prefix(state_dict, f"{prefix}{_CHECKPOINT_PREFIX}", prefix)
         return state_dict
 
     @staticmethod
@@ -119,13 +90,110 @@ def _pre_load_state_dict_hook(
         prefix so that non-checkpointed modules can be loaded into
         checkpoint_wrapper modules properly.
         """
-        _replace_by_prefix(state_dict, prefix, prefix + f"{_CHECKPOINT_PREFIX}.")
+        _replace_by_prefix(state_dict, prefix, prefix + f"{_CHECKPOINT_PREFIX}")
+
+
+class OffloadWrapper(ActivationWrapper):
+    def __init__(self, mod):
+        super().__init__(mod)
+
+    def forward(self, *args, **kwargs):
+        with save_on_cpu(pin_memory=True):
+            return self._checkpoint_wrapped_module(*args, **kwargs)
+
+
+class CheckpointWrapper(ActivationWrapper):
+    """
+    An ``nn.Module`` that wraps another ``nn.Module`` with checkpointing. Note that this
+    module is not meant to be used directly, but instead it is to be used
+    through the ``checkpoint_wrapper`` function.
+    """
+    def __init__(
+        self,
+        mod: torch.nn.Module,
+        checkpoint_impl: CheckpointImpl = CheckpointImpl.REENTRANT,
+        checkpoint_fn=None,
+        *checkpoint_fn_args,
+        **checkpoint_fn_kwargs,
+    ):
+        super().__init__(mod)
+        self.checkpoint_impl = checkpoint_impl
+        if checkpoint_fn is None:
+            # use torch.utils.checkpoint
+            self.checkpoint_fn = partial(
+                torch_utils_checkpoint,
+                use_reentrant=(
+                    self.checkpoint_impl == CheckpointImpl.REENTRANT
+                ),
+            )
+        else:
+            # Construct user-specified checkpoint function.
+            self.checkpoint_fn = partial(
+                checkpoint_fn,
+                *checkpoint_fn_args,
+                **checkpoint_fn_kwargs,
+            )
+
+    def forward(self, *args, **kwargs):
+        # Support keyword arguments for reentrant checkpoint. Note that this
+        # only works if user has specified self.checkpoint_impl and is not
+        # using their own custom checkpoint_fn.
+        if self.checkpoint_impl == CheckpointImpl.REENTRANT and kwargs != {}:
+            # Pack the args and kwargs
+            flat_args, kwarg_keys = _pack_kwargs(*args, **kwargs)
+
+            # Function that only takes (packed) args, but can unpack them
+            # into the original args and kwargs for the checkpointed
+            # function, and runs that function.
+            def my_function(*inputs):
+                # unpack back into args and kwargs
+                unpacked_args, unpacked_kwargs = _unpack_kwargs(
+                    inputs, kwarg_keys
+                )
+                # run original module
+                return self._checkpoint_wrapped_module(
+                    *unpacked_args, **unpacked_kwargs
+                )
+
+            # Pass the function that only takes packed args into reentrant
+            # checkpoint API.
+            return self.checkpoint_fn(  # type: ignore[misc]
+                my_function,
+                *flat_args,
+            )
+        else:
+            return self.checkpoint_fn(  # type: ignore[misc]
+                self._checkpoint_wrapped_module,
+                *args,
+                **kwargs
+            )
+
+def offload_wrapper(
+    module: torch.nn.Module
+) -> torch.nn.Module:
+    """
+    A convenience wrapper for activation offloading to CPU. If the module is wrapped
+    with this function, all subsequent calls to the module will automatically
+    offload intermediate activations to the CPU. Wrappers with activation
+    offload can be composed with ones that do recomputation-based
+    checkpoint to trade off increased compute versus increased CPU
+    memory usage and additional H2D transfers.
+    Usage::
+        offloaded_module = offload_wrapper(module)
+        outputs = checkpointed_module(inputs)
+    Args:
+        module (nn.Module):
+            The module to be wrapped
+    Returns:
+        (nn.Module):
+            Wrapped module
+    """
+    return OffloadWrapper(module)
 
 
 def checkpoint_wrapper(
     module: torch.nn.Module,
     checkpoint_impl: CheckpointImpl = CheckpointImpl.REENTRANT,
-    offload_to_cpu: bool = False,
     checkpoint_fn=None,
     *checkpoint_fn_args,
     **checkpoint_fn_kwargs,
@@ -142,14 +210,12 @@ def checkpoint_wrapper(
         module (nn.Module):
             The module to be wrapped
         checkpoint_impl (Optional[CheckpointImpl]):
-            The checkpointing implementation to use. Currently only
-            CheckpointImpl.REENTRANT is supported. Note that this will only
+            The checkpointing implementation to use. Note that this will only
             be passed into the ``torch.utils.checkpoint.checkpoint``
             implementation, and is ignored if a custom ``checkpoint_fn`` is
-            specified.
-        offload_to_cpu (Optional[bool]):
-            Whether to offload outer activations to CPU. Note that this
-            currently only works with CheckpointImpl.REENTRANT.
+            specified. Note that for implementations using reentrant checkpoint
+            from ``torch.utils.checkpoint``, keyword arguments will only be
+            supported if ``checkpoint_impl`` is passed as ``CheckpointImpl.REENTRANT`.
         checkpoint_fn (Optional[Callable]):
             Functional checkpoint implementation to use. If this is specified,
             it will be used over the default ``torch.utils.checkpoint.checkpoint``
@@ -163,11 +229,11 @@ def checkpoint_wrapper(
     """
 
     return CheckpointWrapper(
-        module, checkpoint_impl, offload_to_cpu, checkpoint_fn, checkpoint_fn_args, checkpoint_fn_kwargs
+        module, checkpoint_impl, checkpoint_fn, checkpoint_fn_args, checkpoint_fn_kwargs
     )
 
 
-def apply_activation_checkpointing_wrapper(
+def apply_activation_checkpointing(
     model, checkpoint_wrapper_fn=checkpoint_wrapper, check_fn=lambda _: True
 ):
     """
@@ -180,27 +246,30 @@ def apply_activation_checkpointing_wrapper(
         their checkpoint-wrapped modules.
     Note::
         This function will not wrap the overall root module. If this is needed, please directly use
-        :class:`CheckpointWrapper`.
+        :func:`checkpoint_wrapper` or :func:`offload_wrapper`.
     Usage::
         model = nn.Sequential(
             nn.Linear(10, 10), nn.Linear(10, 10), nn.Linear(10, 10)
         )
         check_fn = lambda l: isinstance(l, nn.Linear)
+        # checkpoint activations
         apply_activation_checkpointing(model, checkpoint_wrapper_fn=checkpoint_wrapper, check_fn=check_fn)
+        # Or offload activations to CPU
+        apply_activation_checkpointing(model, checkpoint_wrapper_fn=offload_wrapper, check_fn=check_fn)
     Args:
-        module (nn.Module):
-            The model who's submodules (or self) should be wrapped with activation checkpointing.
+        model (nn.Module):
+            The model whose submodules should be wrapped with activation checkpointing.
         checkpoint_wrapper_fn (Optional[Callable[nn.Module]])
-            A `Callable` which will wrap modules
+            A ``Callable`` which will wrap modules
         check_fn (Optional[Callable[nn.Module, nn.Module]])
-            A lambda function which will be passed current layer and returns
-            ``True`` or ``False`` depending on whether input layer should be wrapped.
+            A lambda function which will be passed each child submoule of ``model`` and returns
+            ``True`` or ``False`` depending on whether the submodule should be wrapped.
     Returns: None (`model` is modified inplace)
     """
     # TODO: Importing inside function to avoid circular import issue between FSDP and
     # checkpoint_wrapper. This can be resolved once wrap() APIs are decoupled from FSDP code.
     from torch.distributed.fsdp.wrap import _recursive_wrap, lambda_auto_wrap_policy
-    return _recursive_wrap(
+    _recursive_wrap(
         module=model,
         auto_wrap_policy=partial(lambda_auto_wrap_policy, lambda_fn=check_fn),
         wrapper_cls=checkpoint_wrapper_fn,
diff --git a/torch/distributed/algorithms/_comm_hooks/default_hooks.py b/torch/distributed/algorithms/_comm_hooks/default_hooks.py
index b9ea63392eb3..7d2c845f4e63 100644
--- a/torch/distributed/algorithms/_comm_hooks/default_hooks.py
+++ b/torch/distributed/algorithms/_comm_hooks/default_hooks.py
@@ -1,16 +1,15 @@
 import functools
 import torch
 import torch.distributed as dist
-from torch.distributed import distributed_c10d
 
 
 class DefaultState(object):
     r"""
-    Stores state needed to perform the default ``all_reduce`` algorithm
+    Stores state needed to perform the default communication algorithm
     within a communication hook.
 
     Args:
-        process_group (ProcessGroup): The process group to be used for all-reduce.
+        process_group (ProcessGroup): The process group to be used.
     """
 
     __slots__ = [
@@ -22,17 +21,19 @@ class DefaultState(object):
 
     def __init__(
         self,
-        process_group
+        process_group: dist.ProcessGroup
     ):
-        self.process_group = process_group if process_group is not None else distributed_c10d._get_default_group()
+        if process_group is None:
+            raise ValueError(f"Expected to pass in an explicit ProcessGroup to {self}.")
+        self.process_group = process_group
         self.world_size = dist.get_world_size(process_group)
+        # Setting two factors `self.gradient_predivide_factor`
+        # and `self.gradient_postdivide_factor` to avoid underflow and overflow
         self.gradient_predivide_factor = self._get_gradient_predivide_factor(
             self.world_size
         )
         self.gradient_postdivide_factor = self.world_size / self.gradient_predivide_factor
 
-    # setting two factors `self.gradient_predivide_factor`
-    # and `self.gradient_postdivide_factor` to avoid underflow and overflow
     def _get_gradient_predivide_factor(self, world_size: int) -> float:
         factor: int = 1
         while world_size % factor == 0 and world_size / factor > factor:
@@ -67,7 +68,7 @@ def __init__(
 def _decompress(state: LowPrecisionState, grad: torch.Tensor):
     """
     Casts gradients back to full parameter precision so that
-    further computation happens in full precision
+    further computation happens in full precision.
     """
     orig_grad_data = grad.data
     grad.data = grad.data.to(state.parameter_type)
@@ -80,46 +81,80 @@ def allreduce_hook(state: DefaultState, grad: torch.Tensor):
     and a necessary pre- and post-division of gradients.
 
     Args:
-        state (DefaultState): State information, configures pre- and post-division factors
+        state (DefaultState): State information, configures pre- and post-division factors.
         grad (torch.Tensor): A gradient for the local batch that needs to be communicated across ranks.
     """
+    # Average grad by pre-division factor. Together pre- and post-division factors
+    # lead to an overall averaging by world_size, required for consistency with PyTorch DDP.
+    # This is a two-step process to avoid potential underflow and overflow.
     if state.gradient_predivide_factor > 1:
         grad.div_(state.gradient_predivide_factor)
     dist.all_reduce(grad, group=state.process_group)
+    # Average grad by post-division factor.
     if state.gradient_postdivide_factor > 1:
         grad.div_(state.gradient_postdivide_factor)
 
-def lower_precision_hook(prec: torch.dtype, state: LowPrecisionState, grad: torch.Tensor):
-    grad.data = grad.data.to(prec)
-    allreduce_hook(state, grad)
-    _decompress(state, grad)
+def reduce_scatter_hook(state: DefaultState, grad: torch.Tensor, output: torch.Tensor):
+    r"""
+    This FSDP communication hook implements ``reduce_scatter`` algorithm for
+    sharded FSDP strategies and a necessary pre- and post-division of gradients.
 
-def fp16_compress_hook(state: LowPrecisionState, grad: torch.Tensor):
+    Args:
+        state (DefaultState): State information, configures pre- and post-division factors.
+        grad (torch.Tensor): An unsharded gradient for the local batch that needs to be
+        communicated across ranks.
+        output (torch.Tensor): Stores a single shard of the gradient after ``reduce_scatter``.
+    """
+    # Average grad by pre-division factor.
+    if state.gradient_predivide_factor > 1:
+        grad.div_(state.gradient_predivide_factor)
+    dist.reduce_scatter_tensor(
+        output, grad, group=state.process_group
+    )
+    # Average grad's shard by post-division factor.
+    if state.gradient_postdivide_factor > 1:
+        output.div_(state.gradient_postdivide_factor)
+
+def _low_precision_hook(prec: torch.dtype, state: LowPrecisionState, grad: torch.Tensor, output: torch.Tensor):
+    grad.data = grad.data.to(prec)
+    if output is not None:
+        output.data = output.data.to(prec)
+        reduce_scatter_hook(state, grad, output)
+        _decompress(state, output)
+    else:
+        allreduce_hook(state, grad)
+        _decompress(state, grad)
+
+def fp16_compress_hook(state: LowPrecisionState, grad: torch.Tensor, output: torch.Tensor = None):
     r"""
     This FSDP communication hook implements a simple gradient compression
     approach that casts ``grad`` to half-precision floating-point format (``torch.float16``).
     It also averages gradients by ``world_size`` in two steps: first it pre-divides gradients by a
-    ``state.predivide_factor``, and after an allreduce step gradients are averaged by a ``state.postdivide_factor``.
-    Onse post-division is done, compressed gradients are casted back to parameters' precision.
+    ``state.gradient_predivide_factor``, and after a communication step (``all_reduce`` or ``reduce_scatter``)
+    gradients are averaged by a ``state.gradient_postdivide_factor``.
+    Once post-division is done, compressed gradients are casted back to parameters' precision.
 
     Args:
-        state (DefaultState): State information, configures pre- and post-division factors
+        state (LowPrecisionState): State information, configures pre- and post-division factors, parameters' precision.
         grad (torch.Tensor): A gradient for the local batch that needs to be communicated across ranks in a lower precision.
+        output (torch.Tensor): Stores a single shard of the gradient after ``reduce_scatter``.
     """
-    fp16_hook = functools.partial(lower_precision_hook, torch.float16)
-    return fp16_hook(state, grad)
+    fp16_hook = functools.partial(_low_precision_hook, torch.float16)
+    return fp16_hook(state, grad, output)
 
-def bf16_compress_hook(state: LowPrecisionState, grad: torch.Tensor):
+def bf16_compress_hook(state: LowPrecisionState, grad: torch.Tensor, output: torch.Tensor = None):
     r"""
     This FSDP communication hook implements a simple gradient compression
     approach that casts ``grad`` to half-precision floating-point format (``torch.float16``).
     It also averages gradients by ``world_size`` in two steps: first it pre-divides gradients by a
-    ``state.predivide_factor``, and after an allreduce step gradients are averaged by a ``state.postdivide_factor``.
-    Onse post-division is done, compressed gradients are casted back to parameters' precision.
+    ``state.gradient_predivide_factor``, and after a communication step (``all_reduce`` or ``reduce_scatter``)
+    gradients are averaged by a ``state.gradient_postdivide_factor``.
+    Once post-division is done, compressed gradients are casted back to parameters' precision.
 
     Args:
-        state (DefaultState): State information, configures pre- and post-division factors
+        state (LowPrecisionState): State information, configures pre- and post-division factors, parameters' precision.
         grad (torch.Tensor): A gradient for the local batch that needs to be communicated across ranks in a lower precision.
+        output (torch.Tensor): Stores a single shard of the gradient after ``reduce_scatter``.
     """
-    bf16_hook = functools.partial(lower_precision_hook, torch.bfloat16)
-    return bf16_hook(state, grad)
+    bf16_hook = functools.partial(_low_precision_hook, torch.bfloat16)
+    return bf16_hook(state, grad, output)
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py b/torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py
index 53bf9efe1b3c..b3b45025a84a 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py
@@ -5,11 +5,12 @@
 import torch.distributed as dist
 from torch.distributed.optim import ZeroRedundancyOptimizer
 from torch.distributed.optim.zero_redundancy_optimizer import (
-    _get_global_rank,
     _OverlapStatus,
 )
 from torch.nn.parallel.distributed import DistributedDataParallel
 
+__all__ = ["hook_with_zero_step", "hook_with_zero_step_interleaved"]
+
 # Functional optimizers require passing a list of gradients to their `step()`
 # method, and ZeRO requires a functional optimizer to overlap with DDP
 # Passing a `None` instead of an actual gradient indicates to the optimizer
@@ -85,7 +86,7 @@ def _broadcast_bucket(
             overlap_info.broadcast_handles.append(
                 dist.broadcast(
                     bucket_assignments[bucket_index].tensor,
-                    src=_get_global_rank(zero.process_group, assigned_rank),
+                    src=dist.get_global_rank(zero.process_group, assigned_rank),
                     group=zero.process_group,
                     async_op=True,
                 )
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py b/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py
index 3d1bb120f122..9cad6f1b0573 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py
@@ -3,6 +3,7 @@
 import torch
 from torch.distributed import GradBucket
 
+__all__ = ["noop_hook"]
 
 
 def noop_hook(_: Any, bucket: GradBucket) -> torch.futures.Future[torch.Tensor]:
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py b/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py
index 48114b9716b8..a9a6aa3ddb0e 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py
@@ -3,6 +3,7 @@
 import torch
 import torch.distributed as dist
 
+__all__ = ["allreduce_hook", "fp16_compress_hook", "bf16_compress_hook", "fp16_compress_wrapper", "bf16_compress_wrapper"]
 
 def _allreduce_fut(
     process_group: dist.ProcessGroup, tensor: torch.Tensor
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/optimizer_overlap_hooks.py b/torch/distributed/algorithms/ddp_comm_hooks/optimizer_overlap_hooks.py
index 405612e555ca..9d09ab04fd1b 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/optimizer_overlap_hooks.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/optimizer_overlap_hooks.py
@@ -1,8 +1,10 @@
-from typing import Any, Callable
+from typing import Any, Callable, List
 
 import torch
 import torch.distributed as dist
 
+__all__: List[str] = []
+
 _FUNCTIONAL_OPTIM_STEP_METHOD_NAME = "step_param"
 
 class _OptimizerHookState(object):
diff --git a/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py b/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
index fb3662db23a2..2053879626cd 100644
--- a/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
@@ -123,9 +123,9 @@ class PowerSGDState(object):
 
         1.1. If ``matrix_approximation_rank`` is too low, the full model quality will need more training steps to reach or will never reach and yield loss in accuracy.
 
-        1.2. The increase of ``matrix_approximation_rank`` can substantially increase the computation costs of the compression, and the accuracy may not be futher improved beyond a certain ``matrix_approximation_rank`` threshold.
+        1.2. The increase of ``matrix_approximation_rank`` can substantially increase the computation costs of the compression, and the accuracy may not be further improved beyond a certain ``matrix_approximation_rank`` threshold.
 
-    To tune ``matrix_approximation_rank``, we suggest to start from 1 and increase by factors of 2 (like an expoential grid search, 1, 2, 4, ...), until a satisfactory accuracy is reached. Typically only a small value 1-4 is used. For some NLP tasks (as shown in Appendix D of the original paper), this value has been increased to 32.
+    To tune ``matrix_approximation_rank``, we suggest to start from 1 and increase by factors of 2 (like an exponential grid search, 1, 2, 4, ...), until a satisfactory accuracy is reached. Typically only a small value 1-4 is used. For some NLP tasks (as shown in Appendix D of the original paper), this value has been increased to 32.
 
     2. ``start_powerSGD_iter`` defers PowerSGD compression until step ``start_powerSGD_iter``, and vanilla allreduce runs prior to step ``start_powerSGD_iter``. This hybrid scheme of **vanilla allreduce + PowerSGD** can effectively improve the accuracy, even a relatively small ``matrix_approximation_rank`` is used. This is because that, the beginning of training phase is usually very sensitive to inaccurate gradients, and compressing gradients too early may make the training quickly take a suboptimal trajectory, which can result in an irrecoverable impact on the accuracy.
 
diff --git a/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py b/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
index 9c187d5d77e3..c33a305142bb 100644
--- a/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
+++ b/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
@@ -67,7 +67,7 @@ class HierarchicalModelAverager(averagers.ModelAverager):
         >>> # Register a post-localSGD communication hook.
         >>> # Assume that each machine has 4 GPUs, then each intra-machine subgroup has a size of 4.
         >>> subgroup, _ = dist.new_subgroups()
-        >>> state = PostLocalSGDState(subgroup=subgroup, start_localSGD_iter=100)
+        >>> state = PostLocalSGDState(process_group=None, subgroup=subgroup, start_localSGD_iter=100)
         >>> model.register_comm_hook(state, post_localSGD_hook)
         >>>
         >>> # Average parameters among each group of 8 processes every 4 iterations, and among all
diff --git a/torch/distributed/algorithms/model_averaging/utils.py b/torch/distributed/algorithms/model_averaging/utils.py
index cdc3adf6146a..d207f627be24 100644
--- a/torch/distributed/algorithms/model_averaging/utils.py
+++ b/torch/distributed/algorithms/model_averaging/utils.py
@@ -9,6 +9,8 @@
 # if we're trying to use them.
 from torch.distributed import ProcessGroup, group
 
+__all__ = ["average_parameters", "get_params_to_average", "average_parameters_or_parameter_groups"]
+
 def average_parameters(
     params: Iterator[torch.nn.Parameter], process_group: ProcessGroup
 ):
diff --git a/torch/distributed/benchmarks/README.md b/torch/distributed/benchmarks/README.md
index 082ab87af623..f5b1ec6bff2d 100644
--- a/torch/distributed/benchmarks/README.md
+++ b/torch/distributed/benchmarks/README.md
@@ -11,7 +11,7 @@ There are different training paradigms where combining these two techniques migh
 2) Enable hybrid parallelism as described in the [PipeDream](https://arxiv.org/abs/1806.03377) paper. We can use the [Distributed RPC framework](https://pytorch.org/docs/master/rpc.html) to pipeline stages of the model across multiple workers and replicate each stage (if needed) using [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
 
 ## Training Process
-This benchmark focuses on the first paradime above. The training process is executed as follows:
+This benchmark focuses on the first paradigm above. The training process is executed as follows:
 
 1) The master creates embedding tables on each of the 8 Parameter Servers and holds an [RRef](https://pytorch.org/docs/master/rpc.html#rref) to it.
 2) The master, then kicks off the training loop on the 8 trainers and passes the embedding table RRef to the trainers.
diff --git a/torch/distributed/benchmarks/benchmark_ddp_rpc.py b/torch/distributed/benchmarks/benchmark_ddp_rpc.py
index e12556f396fb..6614d3969bfc 100644
--- a/torch/distributed/benchmarks/benchmark_ddp_rpc.py
+++ b/torch/distributed/benchmarks/benchmark_ddp_rpc.py
@@ -335,7 +335,7 @@ def run_worker(rank, world_size):
         "--embedding-dim",
         type=int,
         default=EMBEDDING_DIM,
-        help="Number of embedding dimentions.",
+        help="Number of embedding dimensions.",
     )
     parser.add_argument(
         "--warmup-cycles",
diff --git a/torch/distributed/c10d_error_logger.py b/torch/distributed/c10d_error_logger.py
new file mode 100644
index 000000000000..83efad98b2b2
--- /dev/null
+++ b/torch/distributed/c10d_error_logger.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import logging
+from typing import List, Tuple
+
+from torch.distributed.logging_handlers import _log_handlers
+
+__all__: List[str] = []
+
+
+def _get_or_create_logger() -> logging.Logger:
+    logging_handler, log_handler_name = _get_logging_handler()
+    logger = logging.getLogger(f"c10d-collectives-{log_handler_name}")
+    logger.setLevel(logging.DEBUG)
+    formatter = logging.Formatter(
+        "%(asctime)s %(filename)s:%(lineno)s %(levelname)s p:%(processName)s t:%(threadName)s: %(message)s"
+    )
+    logging_handler.setFormatter(formatter)
+    logger.propagate = False
+    logger.addHandler(logging_handler)
+    return logger
+
+
+def _get_logging_handler(destination: str = "default") -> Tuple[logging.Handler, str]:
+    log_handler = _log_handlers[destination]
+    log_handler_name = type(log_handler).__name__
+    return (log_handler, log_handler_name)
diff --git a/torch/distributed/checkpoint/__init__.py b/torch/distributed/checkpoint/__init__.py
new file mode 100644
index 000000000000..febc953f9b60
--- /dev/null
+++ b/torch/distributed/checkpoint/__init__.py
@@ -0,0 +1,21 @@
+from .metadata import (
+    TensorStorageMetadata,
+    BytesStorageMetadata,
+    ChunkStorageMetadata,
+    Metadata,
+)
+from .state_dict_loader import load_state_dict
+from .state_dict_saver import save_state_dict
+from .storage import StorageReader, StorageWriter
+from .filesystem import FileSystemReader, FileSystemWriter
+from .api import CheckpointException
+
+
+from .planner import (
+    SavePlanner,
+    LoadPlanner,
+    SavePlan,
+    LoadPlan,
+    ReadItem,
+    WriteItem,
+)
diff --git a/torch/distributed/_shard/checkpoint/api.py b/torch/distributed/checkpoint/api.py
similarity index 90%
rename from torch/distributed/_shard/checkpoint/api.py
rename to torch/distributed/checkpoint/api.py
index e74b34d9f233..d7bfa18ecd79 100644
--- a/torch/distributed/_shard/checkpoint/api.py
+++ b/torch/distributed/checkpoint/api.py
@@ -3,20 +3,28 @@
 
 WRAPPED_EXCEPTION = Tuple[BaseException, tb.StackSummary]
 
+__all__ = ["CheckpointException"]
+
+
 def _wrap_exception(exc: BaseException) -> WRAPPED_EXCEPTION:
     return (exc, tb.extract_tb(exc.__traceback__))
 
+
 def _is_wrapped_exception(obj: Any) -> bool:
     if not isinstance(obj, tuple):
         return False
     if len(obj) != 2:
         return False
-    return isinstance(obj[0], BaseException) and isinstance(obj[1], tb.StackSummary)
+    return isinstance(obj[0], BaseException) and isinstance(
+        obj[1], tb.StackSummary
+    )
+
 
 class CheckpointException(BaseException):
     """
     Exception raised if failure was detected as part of a checkpoint load or save.
     """
+
     def __init__(self, msg: str, failures: Dict[int, WRAPPED_EXCEPTION]):
         super().__init__(msg, failures)
         self._failures = failures
diff --git a/torch/distributed/checkpoint/dedup_tensors.py b/torch/distributed/checkpoint/dedup_tensors.py
new file mode 100644
index 000000000000..ea425a6ad9a4
--- /dev/null
+++ b/torch/distributed/checkpoint/dedup_tensors.py
@@ -0,0 +1,58 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+import dataclasses
+import logging
+from typing import Dict, List
+
+from torch.distributed.checkpoint.metadata import MetadataIndex
+from torch.distributed.checkpoint.planner import SavePlan
+
+__all__ = ["dedup_tensors"]
+
+
+def init_logger() -> logging.Logger:
+    logger = logging.getLogger(__name__)
+    level = logging.INFO
+    logger.setLevel(level)
+    console = logging.StreamHandler()
+    formatter = logging.Formatter(
+        "%(asctime)s %(filename)s:%(lineno)s %(levelname)s p:%(processName)s t:%(threadName)s: %(message)s"
+    )
+    console.setFormatter(formatter)
+    console.setLevel(level)
+    logger.addHandler(console)
+    logger.propagate = False
+    return logger
+
+logger = init_logger()
+
+# TODO add docstring for dedup_tensors
+def dedup_tensors(all_plans: List[SavePlan]) -> List[SavePlan]:
+    all_plans = list(all_plans)
+    key_to_plan: Dict[MetadataIndex, List[int]] = {}
+    for plan_idx, plan in enumerate(all_plans):
+        for write_item in plan.items:
+            key_to_plan.setdefault(write_item.index, []).append(plan_idx)
+
+    replicated_items = {k: v for k, v in key_to_plan.items() if len(v) > 1}
+
+    # Remove duplicates by always keeping the first entry.
+    # Compute the per-rank remove set.
+    plan_to_keys: Dict[int, List[MetadataIndex]] = {}
+    for key, plans in replicated_items.items():
+        for plan_idx in plans[1:]:
+            plan_to_keys.setdefault(plan_idx, []).append(key)
+    logger.info(f"Duplicate keys to remove: {plan_to_keys}")
+
+    for plan_idx, keys in plan_to_keys.items():
+        key_set = set(keys)
+        # rewrite items and remove elements
+        new_items = [
+            write_item
+            for write_item in all_plans[plan_idx].items
+            if write_item.index not in key_set
+        ]
+        all_plans[plan_idx] = dataclasses.replace(
+            all_plans[plan_idx], items=new_items
+        )
+
+    return all_plans
diff --git a/torch/distributed/checkpoint/default_planner.py b/torch/distributed/checkpoint/default_planner.py
new file mode 100644
index 000000000000..aa531a62d235
--- /dev/null
+++ b/torch/distributed/checkpoint/default_planner.py
@@ -0,0 +1,244 @@
+import dataclasses
+import io
+from typing import List, Tuple, Dict, Any, Union, cast
+
+import torch
+
+from torch.distributed._shard._utils import narrow_tensor_by_index
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+
+
+from .planner import (
+    SavePlanner,
+    LoadPlanner,
+    SavePlan,
+    LoadPlan,
+    ReadItem,
+    WriteItem,
+    WriteItemType,
+)
+
+from .metadata import (
+    BytesStorageMetadata,
+    TensorStorageMetadata,
+    MetadataIndex,
+    Metadata,
+    STATE_DICT_TYPE,
+    STORAGE_TYPES,
+)
+
+from .planner_helpers import (
+    _create_read_items,
+    _create_write_items,
+    _create_default_metadata_only_plan,
+)
+
+from .utils import find_state_dict_object
+
+__all__ = [
+    "DefaultSavePlanner",
+    "DefaultLoadPlanner",
+    "create_default_local_load_plan",
+    "create_default_global_load_plan",
+    "create_default_local_save_plan",
+    "create_default_global_save_plan",
+]
+
+
+class DefaultSavePlanner(SavePlanner):
+    def init(self, state_dict: Dict[str, Any], is_coordinator: bool) -> None:
+        self.state_dict = state_dict
+        self.is_coordinator = is_coordinator
+
+    def create_local_plan(self) -> SavePlan:
+        self.plan = create_default_local_save_plan(
+            self.state_dict, self.is_coordinator
+        )
+        return self.plan
+
+    def create_global_plan(
+        self, all_plans: List[SavePlan]
+    ) -> Tuple[List[SavePlan], Metadata]:
+        self.global_plan, self.metadata = create_default_global_save_plan(
+            all_plans
+        )
+        return self.global_plan, self.metadata
+
+    def finish_plan(self, new_plan: SavePlan) -> SavePlan:
+        self.plan = new_plan
+        return new_plan
+
+    def resolve_data(
+        self, write_item: WriteItem
+    ) -> Union[torch.Tensor, io.BytesIO]:
+        object = self.lookup_object(write_item.index)
+        return self.transform_object(write_item, object)
+
+    def lookup_object(self, index: MetadataIndex) -> Any:
+        """
+        This is an extension from the planner interface to make it easy to extend the default planner
+        """
+        return find_state_dict_object(self.state_dict, index)
+
+    def transform_object(self, write_item: WriteItem, object: Any):
+        """
+        This is an extension from the planner interface to make it easy to extend the default planner
+        """
+        if write_item.type == WriteItemType.BYTE_IO:
+            bytes = io.BytesIO()
+            torch.save(object, bytes)
+            object = bytes
+        return object
+
+
+class DefaultLoadPlanner(LoadPlanner):
+    def init(
+        self,
+        state_dict: STATE_DICT_TYPE,
+        metadata: Metadata,
+        is_coordinator: bool,
+    ) -> None:
+        self.state_dict = state_dict
+        self.metadata = metadata
+        self.is_coordinator = is_coordinator
+
+    def create_local_plan(self) -> LoadPlan:
+        return create_default_local_load_plan(self.state_dict, self.metadata)
+
+    def create_global_plan(self, global_plan: List[LoadPlan]) -> List[LoadPlan]:
+        return create_default_global_load_plan(global_plan)
+
+    def finish_plan(self, new_plan: LoadPlan) -> LoadPlan:
+        return new_plan
+
+    def load_bytes(self, read_item: ReadItem, value: io.BytesIO) -> None:
+        self.state_dict[read_item.dest_index.fqn] = torch.load(value)
+
+    def resolve_tensor(self, read_item: ReadItem):
+        tensor = self.lookup_tensor(read_item.dest_index)
+        return self.transform_tensor(read_item, tensor)
+
+    def commit_tensor(self, read_item: ReadItem, tensor: torch.Tensor) -> None:
+        pass
+
+    def lookup_tensor(self, index: MetadataIndex) -> torch.Tensor:
+        """
+        This is an extension from the planner interface to make it easy to extend the default planner
+        """
+        return find_state_dict_object(self.state_dict, index)
+
+    def transform_tensor(self, read_item: ReadItem, tensor: torch.Tensor):
+        """
+        This is an extension from the planner interface to make it easy to extend the default planner
+        """
+        return narrow_tensor_by_index(
+            tensor, read_item.dest_offsets, read_item.lengths
+        )
+
+
+def create_default_local_load_plan(
+    state_dict: Dict[str, Any],
+    metadata: Metadata,
+) -> LoadPlan:
+    requests = []
+    """
+    Create the ``LoadPlan`` used by DefaultLoadPlanner.
+
+    It produces one read item per value in ``state_dict`` using the metadata in ``metadata``.
+
+    The default behavior is to match key exactly between state_dict and metadata.
+    It handles resharding by issuing multiple read requests against storage in order to match
+    load requirements.
+    """
+    for fqn, obj in state_dict.items():
+        md = metadata.state_dict_metadata[fqn]
+        requests += _create_read_items(fqn, md, obj)
+
+    return LoadPlan(requests)
+
+
+def create_default_global_load_plan(
+    all_plans: List[LoadPlan],
+) -> List[LoadPlan]:
+    """
+    Create global load plan used by DefaultLoadPlanner.
+
+    The default load behavior involved no global coordination and this function
+    currently doesn't change the local plans.
+    """
+    return all_plans
+
+
+def create_default_local_save_plan(
+    state_dict: Dict[str, Any], is_coordinator: bool
+) -> SavePlan:
+    """
+    Create the ``SavePlan`` used by DefaultSavePlanner.
+
+    On non-coordinator ranks, this function ignores tensors and non-tensor objects,
+    only producing writes for ShardedTensor objects.
+
+    On the coordinator rank, produce writes for all values.
+    """
+    requests = []
+    for fqn, obj in state_dict.items():
+        if isinstance(obj, ShardedTensor) or is_coordinator:
+            requests += _create_write_items(fqn, obj)
+    return SavePlan(requests)
+
+
+def create_default_global_save_plan(
+    all_plans: List[SavePlan],
+) -> Tuple[List[SavePlan], Metadata]:
+    """
+    Create the global plan and metadata used by DefaultSavePlanner.
+
+    Metadata is produced by concatenating the metadata of all ``WriteItem`` from the supplied plans.
+
+    The only global planning change is to update index hints in all ``MetadataIndex`` objects.
+    """
+    md: Dict[str, STORAGE_TYPES] = {}
+    new_plans = []
+    for plan in all_plans:
+        new_items = []
+        for item in plan.items:
+            if not item.type == WriteItemType.SHARD:
+                assert item.index.fqn not in md
+
+            if item.type == WriteItemType.BYTE_IO:
+                md[item.index.fqn] = BytesStorageMetadata()
+                new_items.append(item)
+            else:
+                assert item.tensor_data is not None
+                tensor_md = cast(
+                    TensorStorageMetadata,
+                    md.setdefault(
+                        item.index.fqn,
+                        TensorStorageMetadata(
+                            properties=item.tensor_data.properties,
+                            size=item.tensor_data.size,
+                            chunks=[],
+                        ),
+                    ),
+                )
+                new_index = dataclasses.replace(
+                    item.index, index=len(tensor_md.chunks)
+                )
+                new_item = dataclasses.replace(item, index=new_index)
+                new_items.append(new_item)
+
+                assert (
+                    item.tensor_data.chunk is not None
+                ), f"Cannot create MD for tensor without bounds. FQN: {item.index.fqn}"
+                tensor_md.chunks.append(item.tensor_data.chunk)
+        new_plans.append(dataclasses.replace(plan, items=new_items))
+    return (new_plans, Metadata(md))
+
+
+def _create_default_local_metadata(state_dict: STATE_DICT_TYPE) -> Metadata:
+    """
+    Return the ``Metadata`` if DefaultSavePlanner was used to checkpoint ``state_dict``.
+    """
+    plan = _create_default_metadata_only_plan(state_dict)
+    _, md = create_default_global_save_plan([plan])
+    return md
diff --git a/torch/distributed/checkpoint/filesystem.py b/torch/distributed/checkpoint/filesystem.py
new file mode 100644
index 000000000000..0e679c303921
--- /dev/null
+++ b/torch/distributed/checkpoint/filesystem.py
@@ -0,0 +1,313 @@
+from dataclasses import dataclass
+import os
+import dataclasses
+import io
+import pickle
+from typing import List, Union, Dict, cast
+
+import torch
+from torch import Tensor
+from torch.futures import Future
+from pathlib import Path
+
+from .metadata import (
+    Metadata,
+    MetadataIndex,
+)
+from .storage import (
+    StorageReader,
+    StorageWriter,
+    WriteResult,
+)
+
+from .planner import (
+    LoadItemType,
+    LoadPlanner,
+    LoadPlan,
+    SavePlan,
+    SavePlanner,
+    ReadItem,
+    WriteItem,
+    WriteItemType,
+)
+
+from torch.distributed._shard._utils import narrow_tensor_by_index
+
+
+__all__ = [
+    "FileSystemWriter",
+    "SlicedBufferedReader",
+    "FileSystemReader",
+]
+
+
+@dataclass
+class _StorageInfo:
+    """
+    This is the per entry storage info
+    """
+
+    relative_path: str
+    offset: int
+    length: int
+
+
+@dataclass
+class _StoragePrefix:
+    prefix: str
+
+
+DEFAULT_SUFIX = ".distcp"
+
+
+def _trim(tensor: torch.Tensor) -> torch.Tensor:
+    tensor = tensor.detach().cpu()
+    if tensor._typed_storage()._size() != tensor.numel():
+        tensor = tensor.clone()
+    return tensor
+
+
+def _result_from_write_item(
+    item: WriteItem, size_in_bytes, storage_data
+) -> WriteResult:
+    return WriteResult(
+        index=item.index, size_in_bytes=size_in_bytes, storage_data=storage_data
+    )
+
+
+def _write_item(stream, data, write_item, storage_key):
+    offset = stream.tell()
+
+    if write_item.type == WriteItemType.BYTE_IO:
+        assert isinstance(data, io.BytesIO)
+        stream.write(data.getbuffer())
+    else:
+        assert isinstance(data, torch.Tensor)
+        assert data.device == torch.device("cpu")
+        torch.save(data, stream)
+    length = stream.tell() - offset
+
+    return _result_from_write_item(
+        write_item, length, _StorageInfo(storage_key, offset, length)
+    )
+
+
+def _write_files_from_queue(
+    file_queue: List,
+    planner: SavePlanner,
+    use_fsync: bool,
+):
+    write_results = []
+
+    for file_path, file_name, write_items in file_queue:
+        tensor_w = [
+            wi for wi in write_items if wi.type != WriteItemType.BYTE_IO
+        ]
+        bytes_w = [wi for wi in write_items if wi.type == WriteItemType.BYTE_IO]
+
+        with open(file_path, "wb") as stream:
+            for write_item in bytes_w:
+                data = planner.resolve_data(write_item)
+                write_results.append(
+                    _write_item(stream, data, write_item, file_name)
+                )
+
+            for write_item in tensor_w:
+                tensor = _trim(
+                    cast(torch.Tensor, planner.resolve_data(write_item))
+                )
+                assert not tensor.is_cuda
+                write_results.append(
+                    _write_item(stream, tensor, write_item, file_name)
+                )
+
+            if use_fsync:
+                os.fsync(stream.fileno())
+
+    return write_results
+
+
+class FileSystemWriter(StorageWriter):
+    """
+    Basic implementation of StorageWriter using file IO.
+
+    This implementation makes the following assumptions and simplifications:
+
+    * The checkpoint path is an empty or non-existing directory.
+    * File creation is atomic
+
+    The checkpoint consist of one file per write request plus
+    a `.metadata` file with the serialized metadata.
+
+    """
+
+    def __init__(
+        self,
+        path: Union[str, os.PathLike],
+        single_file_per_rank: bool = False,
+        sync_files: bool = True,
+    ) -> None:
+        """
+        Initialize the writer pointing to `path`
+
+        Args:
+            path: diretory where the checkpoint will be writen to.
+            single_file_per_rank: Produce one file per rank instead of one file per tensor/blob. Default to True.
+            sync_files: force files to be synced to permanent storage. Default to True.
+
+        N. B. If sync_files is disabled, there's no guarantee that the checkpoint will be consistent in the case of a failure.
+        """
+        super().__init__()
+        self.path = Path(path)
+        self.single_file_per_rank = single_file_per_rank
+        self.sync_files = sync_files
+
+    def init(self, is_coordinator: bool) -> None:
+        pass
+
+    def prepare_local_plan(self, plan: SavePlan) -> SavePlan:
+        # There's no storage input in the local plan
+        return plan
+
+    def prepare_global_plan(
+        self, global_plan: List[SavePlan]
+    ) -> List[SavePlan]:
+        self.path.mkdir(parents=True, exist_ok=True)
+
+        new_plans = [
+            dataclasses.replace(plan, storage_data=_StoragePrefix(f"__{i}_"))
+            for i, plan in enumerate(global_plan)
+        ]
+        return new_plans
+
+    def write_data(
+        self,
+        plan: SavePlan,
+        planner: SavePlanner,
+    ) -> Future[List[WriteResult]]:
+        storage_plan: _StoragePrefix = plan.storage_data
+        file_count = 0
+
+        def gen_file():
+            nonlocal file_count
+            file_name = f"{storage_plan.prefix}{file_count}{DEFAULT_SUFIX}"
+            file_count += 1
+            return file_name
+
+        file_queue = []
+        if self.single_file_per_rank:
+            file_name = gen_file()
+            file_queue.append((self.path / file_name, file_name, plan.items))
+        else:
+            for item in plan.items:
+                file_name = gen_file()
+                file_queue.append((self.path / file_name, file_name, [item]))
+
+        results = _write_files_from_queue(
+            file_queue=file_queue,
+            planner=planner,
+            use_fsync=self.sync_files,
+        )
+
+        fut: Future[List[WriteResult]] = Future()
+        fut.set_result(results)
+        return fut
+
+    def finish(
+        self, metadata: Metadata, results: List[List[WriteResult]]
+    ) -> None:
+        storage_md = dict()
+        for wr_list in results:
+            storage_md.update({wr.index: wr.storage_data for wr in wr_list})
+        metadata.storage_data = storage_md
+        with (self.path / ".metadata.tmp").open("wb") as metadata_file:
+            pickle.dump(metadata, metadata_file)
+            os.fsync(metadata_file.fileno())
+
+        (self.path / ".metadata.tmp").rename(self.path / ".metadata")
+
+
+class SlicedBufferedReader(io.BufferedReader):
+    # TODO override read to handle (-1) correctly
+    def __init__(self, base_stream: io.RawIOBase, offset: int, len: int):
+        super().__init__(base_stream)
+        self.offset = offset
+        self.len = len
+        self.seek(0)
+
+    def seek(self, __offset: int, __whence: int = os.SEEK_SET) -> int:
+        if __whence == os.SEEK_SET:
+            __offset = self.offset + __offset
+        elif __whence == os.SEEK_END:
+            __whence = os.SEEK_SET
+            __offset = (self.offset + self.len) - __offset
+        return super().seek(__offset, __whence)
+
+    def tell(self) -> int:
+        return super().tell() - self.offset
+
+
+class FileSystemReader(StorageReader):
+    def __init__(self, path: Union[str, os.PathLike]) -> None:
+        super().__init__()
+        self.path = Path(path)
+        self.storage_data: Dict[MetadataIndex, _StorageInfo] = dict()
+
+    def _slice_file(self, file, sinfo: _StorageInfo):
+        return SlicedBufferedReader(
+            io.FileIO(file.fileno(), closefd=False), sinfo.offset, sinfo.length
+        )
+
+    def read_data(self, plan: LoadPlan, planner: LoadPlanner) -> Future[None]:
+        # group requests by file
+        per_file: Dict[str, List[ReadItem]] = dict()
+        for read_item in plan.items:
+            item_md = self.storage_data[read_item.storage_index]
+            path = item_md.relative_path
+            per_file.setdefault(path, []).append(read_item)
+
+        for relative_path, reqs in per_file.items():
+            with (self.path / relative_path).open("rb") as file:
+                # TODO sort by offset and cache the reading
+                for req in reqs:
+                    item_md = self.storage_data[req.storage_index]
+                    file_slice = self._slice_file(file, item_md)
+                    if req.type == LoadItemType.BYTE_IO:
+                        bytes = io.BytesIO(file_slice.read(item_md.length))
+                        bytes.seek(0)
+                        planner.load_bytes(req, bytes)
+                    else:
+                        tensor = cast(
+                            Tensor, torch.load(file_slice, map_location="cpu")
+                        )
+                        tensor = narrow_tensor_by_index(
+                            tensor, req.storage_offsets, req.lengths
+                        )
+                        target_tensor = planner.resolve_tensor(req).detach()
+
+                        assert (
+                            target_tensor.size() == tensor.size()
+                        ), f"req {req.storage_index} mismatch sizes {target_tensor.size()} vs {tensor.size()}"
+                        target_tensor.copy_(tensor)
+                        planner.commit_tensor(req, target_tensor)
+
+        fut: Future = Future()
+        fut.set_result(None)
+        return fut
+
+    # Implementating the abstract function in StorageReader
+    def read_metadata(self) -> Metadata:
+        with (self.path / ".metadata").open("rb") as metadata_file:
+            return pickle.load(metadata_file)
+
+    def init(self, metadata: Metadata, is_coordinator: bool) -> None:
+        self.storage_data = metadata.storage_data
+        assert self.storage_data is not None
+
+    def prepare_local_plan(self, plan: LoadPlan) -> LoadPlan:
+        return plan
+
+    def prepare_global_plan(
+        self, global_plan: List[LoadPlan]
+    ) -> List[LoadPlan]:
+        return global_plan
diff --git a/torch/distributed/_shard/checkpoint/metadata.py b/torch/distributed/checkpoint/metadata.py
similarity index 75%
rename from torch/distributed/_shard/checkpoint/metadata.py
rename to torch/distributed/checkpoint/metadata.py
index 6073b6dcf640..1a03f16ff473 100644
--- a/torch/distributed/_shard/checkpoint/metadata.py
+++ b/torch/distributed/checkpoint/metadata.py
@@ -1,74 +1,62 @@
-import io
 from dataclasses import dataclass, field
-from typing import Dict, List, Tuple, Union, Optional, Sequence, Any
+from typing import Dict, List, Union, Optional, Sequence, Any
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
 
 import torch
 from torch.distributed._shard.sharded_tensor import (
     ShardedTensor,
 )
-from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
+
+__all__ = [
+    "ChunkStorageMetadata",
+    "TensorStorageMetadata",
+    "BytesStorageMetadata",
+    "Metadata",
+    "MetadataIndex",
+]
+
 
 @dataclass
 class ChunkStorageMetadata:
     """
     Each chunk is expected to have the same properties of the TensorStorageMetadata that includes it.
     """
+
     offsets: torch.Size
     sizes: torch.Size
 
+
 @dataclass
 class TensorStorageMetadata:
     properties: TensorProperties
     size: torch.Size
     chunks: List[ChunkStorageMetadata]
 
+
 @dataclass
 class BytesStorageMetadata:
     pass
 
+
 TENSOR_TYPE = Union[torch.Tensor, ShardedTensor]
 STORAGE_TYPES = Union[TensorStorageMetadata, BytesStorageMetadata]
 STATE_DICT_TYPE = Dict[str, Any]
 
+
 @dataclass
 class Metadata:
     # Keys are the same from the `state_dict` used.
     state_dict_metadata: Dict[str, STORAGE_TYPES]
+    planner_data: Any = None
     storage_data: Any = None
 
-@dataclass
-class BytesWriteRequest:
-    bytes: io.BytesIO
-    storage_key: str
-
-
-@dataclass
-class BytesReadRequest:
-    bytes: io.BytesIO
-    storage_key: str
-    fqn: str
-
-
-@dataclass
-class TensorWriteRequest:
-    tensor: torch.Tensor
-    storage_key: str
-
-
-@dataclass
-class TensorReadRequest:
-    tensor: torch.Tensor
-    storage_key: str
-    # offset and length w.r.t. to the storage identified by ``storage_key``
-    offsets: Tuple[int, ...]
-    lengths: Tuple[int, ...]
-
 
 @dataclass(frozen=True)
 class MetadataIndex:
     """
     This class represents a lookup key for items in a state dict or Metadata.
     """
+
     fqn: str
     """Fully Qualified Name of the object"""
 
@@ -87,7 +75,12 @@ class MetadataIndex:
     the linear search and thus making it significantly faster.
     """
 
-    def __init__(self, fqn: str, offset: Optional[Sequence[int]] = None, index: Optional[int] = None):
+    def __init__(
+        self,
+        fqn: str,
+        offset: Optional[Sequence[int]] = None,
+        index: Optional[int] = None,
+    ):
         # We must use object.__setattr__ due to frozen=True
         object.__setattr__(self, "fqn", fqn)
         object.__setattr__(self, "index", index)
diff --git a/torch/distributed/checkpoint/planner.py b/torch/distributed/checkpoint/planner.py
new file mode 100644
index 000000000000..cb94a40df732
--- /dev/null
+++ b/torch/distributed/checkpoint/planner.py
@@ -0,0 +1,377 @@
+import abc
+from dataclasses import dataclass
+import io
+from typing import List, Tuple, Any, Union, Optional
+
+from enum import Enum, auto
+import torch
+
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
+
+from .metadata import (
+    ChunkStorageMetadata,
+    MetadataIndex,
+    Metadata,
+    STATE_DICT_TYPE,
+)
+
+
+__all__ = [
+    "WriteItemType",
+    "LoadItemType",
+    "TensorWriteData",
+    "WriteItem",
+    "ReadItem",
+    "SavePlan",
+    "LoadPlan",
+    "SavePlanner",
+    "LoadPlanner",
+]
+
+
+class WriteItemType(Enum):
+    TENSOR = auto()
+    SHARD = auto()
+    BYTE_IO = auto()
+
+
+class LoadItemType(Enum):
+    TENSOR = auto()
+    BYTE_IO = auto()
+
+
+@dataclass(frozen=True)
+class TensorWriteData:
+    chunk: ChunkStorageMetadata
+    properties: TensorProperties
+    size: torch.Size
+
+
+@dataclass(frozen=True)
+class WriteItem:
+    index: MetadataIndex
+    type: WriteItemType
+
+    # Value present if it's a tensor write
+    tensor_data: Optional[TensorWriteData] = None
+
+
+@dataclass(frozen=True)
+class ReadItem:
+    # Read Item
+    type: LoadItemType
+
+    # Index into the state_dict
+    dest_index: MetadataIndex
+    # Offsets into destination tensor
+    dest_offsets: torch.Size
+
+    # Index into the checkpoint
+    storage_index: MetadataIndex
+    # Offset into the checkpoint data
+    storage_offsets: torch.Size
+
+    # Size of the hypercube to copy
+    lengths: torch.Size
+
+
+@dataclass(frozen=True)
+class SavePlan:
+    items: List[WriteItem]
+    storage_data: Any = None
+    planner_data: Any = None
+
+
+@dataclass
+class LoadPlan:
+    items: List[ReadItem]
+    storage_data: Any = None
+    planner_data: Any = None
+
+
+class SavePlanner(abc.ABC):
+    """
+    Abstract class defining the protocol used by save_state_dict to plan the save process.
+
+    SavePlanners are stateful objects that can be used to customize the whole save process.
+
+    SavePlanner acts as an access proxy to the state_dict, so any transfomation done to it
+    will be visible to the whole process.
+
+    A planner subclass can expect the following sequence of calls during save_state_dict:
+
+    1) init - called on all ranks.
+        Signals the start of a checkpoint save.
+
+    2) create_local_plan - called on all ranks.
+        Process the state_dict and produces a `SavePlan` that will be sent for global planning.
+
+    3) create_global_plan - called on the coordinator rank only.
+        Takes the SavePlan from all ranks and make any global decision.
+
+    4) finish_plan - called on all ranks.
+        This gives each rank a chance to adjust to global planning decisions.
+
+    5) resolve_data - called multiple times on each rank
+        Lookups a value on the `state_dict` for the storage layer to write.
+
+    Users are recomended to extend DefaultSavePlanner instead of this interface directly as
+    most changes can be expressed by changes in a single method.
+
+    There are 3 usual patterns of extension:
+
+    Rewriting state_dict. This is the simplest way to extend the save process as it
+    doesn't requite understanding the intrincacies of how SavePlan works:
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class RenamePlanner(DefaultSavePlanner):
+    >>>     def init(self, state_dict, is_coordinator):
+    >>>         # prefix all keys with `foo_``
+    >>>         super().init(self, {"foo_" + k: v for k, v in state_dict.items()}, is_coordinator)
+
+    Modifying local plan and lookup in tandem. This is useful when fine control of how data is persisted
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class FP16Planner(DefaultSavePlanner):
+    >>>     def create_local_plan(self):
+    >>>         plan = super().create_local_plan()
+    >>>         for p in plan:
+    >>>             if p.tensor_data is not None:
+    >>>                 p.tensor_data.properties.dtype = torch.float16
+    >>>
+    >>>     def resolve_data(self, write_item):
+    >>>         item = super().resolve_data(write_item)
+    >>>         return item if write_item.type == WriteItemType.BYTE_IO else item.to(torch.float16)
+
+    Using the global planning step to make central decisions that can't be made individually by each rank
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> from itertools import islice
+    >>> from dataclasses import replace
+    >>> class DDPLoadBalancingPlanner(DefaultSavePlanner):
+    >>>     # This uses the default local plan behavior of having all non-sharded writes in rank 0
+    >>>     # This sample doesn't handle ShardedTensors
+    >>>     def create_global_plan(self, all_plans):
+    >>>         def chunk(it, size):
+    >>>             it = iter(it)
+    >>>         return list(iter(lambda: tuple(islice(it, size)), ()))
+    >>>         all_plans = [
+    >>>             replace(plan, items=items) for plan, items in
+    >>>                 zip(all_plans, chunk(all_plans[0].items, len(all_plans)))
+    >>>         ]
+    >>>         return super().create_global_plan(all_plans)
+
+    Finally, some planners need to save additional metadata in the checkpoint, this is
+    accomplished by having each rank contribute their data items in the local plan and
+    the global planner aggregate them:
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class SaveExtraDataPlanner(DefaultSavePlanner):
+    >>>     def create_local_plan(self) -> SavePlan:
+    >>>         plan = super().create_local_plan()
+    >>>         return replace(plan, planner_data="per-rank-data")
+    >>>
+    >>>     def create_global_plan(self, all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]:
+    >>>         global_plan, metadata = super().create_global_plan(all_plans)
+    >>>         merged_data = [p.planner_data for p in global_plan]
+    >>>         metadata = replace(metadata, planner_data=merged_data)
+    >>>         return global_plan, metadata
+    """
+
+    @abc.abstractmethod
+    def init(self, state_dict: STATE_DICT_TYPE, is_coordinator: bool) -> None:
+        """
+        Intialize this planner to save ``state_dict``.
+
+        Implementations should save those values as they won't be provided lated in the save process.
+
+        This is called on all ranks.
+        """
+        pass
+
+    @abc.abstractmethod
+    def create_local_plan(self) -> SavePlan:
+        """
+        Compute the save plan for the current rank.
+        This will be aggregated and passed to create_global_plan.
+        Planner specific data can be passed through SavePlan::planner_data.
+
+        This is called on all ranks.
+        """
+        pass
+
+    @abc.abstractmethod
+    def create_global_plan(
+        self, all_plans: List[SavePlan]
+    ) -> Tuple[List[SavePlan], Metadata]:
+        """
+        Compute the global checkpoint plan and return the local plan of each rank.
+
+        This is called on the coordinator rank only.
+        """
+        pass
+
+    @abc.abstractmethod
+    def finish_plan(self, new_plan: SavePlan) -> SavePlan:
+        """
+        Merge the plan created by `create_local_plan` and the result of `create_global_plan`.
+
+        This is called on all ranks.
+        """
+        pass
+
+    @abc.abstractmethod
+    def resolve_data(
+        self, write_item: WriteItem
+    ) -> Union[torch.Tensor, io.BytesIO]:
+        """
+        Lookup the object associated with ``write_item``in `state_dict` and apply any
+        transformation (such as serialization) prior to the storage layer consuming it.
+
+        Called on each rank multiple times, at least once per WriteItem in the final SavePlan.
+
+        This method should be idepotent and thread-save. StorageWriter implementations
+        are free to call it as frequently as they need.
+
+        Any transformation that allocates memory should be lazily done when his method
+        is called in order to reduce peak memory required by checkpointing.
+
+        When returning tensors, they can be on any device or format, they can be views too.
+        It's the storage layer responsiblity to figure out how to save them.
+        """
+        pass
+
+
+class LoadPlanner:
+    """
+    Abstract class defining the protocol used by load_state_dict to plan the load process.
+
+    LoadPlanner are stateful objects that can be used to customize the whole load process.
+
+    LoadPlanner acts as an access proxy to the state_dict, so any transfomation done to it
+    will be visible to the whole process.
+
+    A planner subclass can expect the following sequence of calls during load_state_dict:
+
+    1) init - called on all ranks.
+        Signals the start of loading a checkpoint.
+
+    2) create_local_plan - called on all ranks.
+        Process the state_dict and produces a `LoadPlan` that will be sent for global planning.
+
+    3) create_global_plan - called on the coordinator rank only.
+        Takes the LoadPlan from all ranks and make any global decision.
+
+    4) load_bytes - called multiple times on each rank
+        This is called once per non-tensor value in state_dict.
+
+    5) resolve_tensor and commit_tensor - called multiple times on each rank
+        They are called in pair for each Tensor value in state_dict.
+
+    Users are recomended to extend DefaultLoadPlanner instead of this interface directly as
+    most changes can be expressed by changes in a single method.
+
+    There are two usual patterns of extension:
+
+    Rewriting state_dict. This is the simplest way to extend the load process as it
+    doesn't requite understanding the intrincacies of how LoadPlan works. We need
+    to keep a reference to the original state_dict as load happens in place so
+    we need to be able to perform it in place
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class RenamePlanner(DefaultLoadPlanner):
+    >>>     def init(self, state_dict, metadata, is_coordinator):
+    >>>         self.original_state_dict = state_dict
+    >>>         super().init(self, {"foo_" + k: v for k, v in state_dict.items()}, is_coordinator)
+    >>>
+    >>>     def load_bytes(self, read_item, value):
+    >>>         # Remove the "foo_" prefix
+    >>>         self.original_state_dict[read_item.dest_index.fqn[4:]] = torch.load(value)
+
+
+    Modifying resolve_tensor and commit_tensor to handle load time transformation.
+
+    >>> # xdoctest: +SKIP("undefined vars")
+    >>> class MetaModelMaterialize(DefaultSavePlanner):
+    >>>     def resolve_tensor(self, read_item):
+    >>>         tensor = super().resolve_tensor(read_item)
+    >>>         return torch.empty_like(tensor, device="cpu")
+    >>>
+    >>>     def commit_tensor(self, read_item, tensor):
+    >>>         self.state_dict[read_item.dest_index.fqn] = tensor
+    """
+
+    @abc.abstractmethod
+    def init(
+        self,
+        state_dict: STATE_DICT_TYPE,
+        metadata: Metadata,
+        is_coordinator: bool,
+    ) -> None:
+        """
+        Initialize this instance to load data into ``state_dict``
+
+        . N.B. This is called on every rank.
+        """
+        pass
+
+    @abc.abstractmethod
+    def create_local_plan(self) -> LoadPlan:
+        """
+        Create a LoadPlan based on state_dict and metadata provided by init.
+
+        . N.B. This is called on every rank.
+        """
+        pass
+
+    @abc.abstractmethod
+    def create_global_plan(self, global_plan: List[LoadPlan]) -> List[LoadPlan]:
+        """
+        Compute the global load plan and return plans for each rank.
+
+        . N.B. This is called on the coordinator rank only
+        """
+        pass
+
+    @abc.abstractmethod
+    def finish_plan(self, central_plan: LoadPlan) -> LoadPlan:
+        """
+        Accept the plan from coordinator and return final LoadPlan.
+        """
+        pass
+
+    @abc.abstractmethod
+    def load_bytes(self, read_item: ReadItem, value: io.BytesIO) -> None:
+        """
+        Load the item described by ``read_item``and ``value``.
+
+        This method is expected to modify in-place the underlying state_dict.
+
+        The contents of ``value`` are defined by the SavePlanner used to produce
+        the checkpoint being loaded.
+        """
+        pass
+
+    @abc.abstractmethod
+    def resolve_tensor(self, read_item: ReadItem) -> torch.Tensor:
+        """
+        Return the tensor described by ``read_item`` to be used by the StorageReader to load `read_item`.
+
+        The tensor should alias with one on the underlying state_dict as StorageReader will replace its contents.
+        If, for any reason, that's not possible, the planner can use the ``commit_tensor`` method to copy the data
+        back to the one in state_dict.
+        """
+        pass
+
+    @abc.abstractmethod
+    def commit_tensor(self, read_item: ReadItem, tensor: torch.Tensor) -> None:
+        """
+        This method is called once the StorageReader finished loading data into ``tensor``.
+
+        The provided tensor is the same one returned by the call to ``resolve_tensor``.
+        This method is only needed if this LoadPlanner needs to post process ``tensor`` prior to
+        copying it back to the one in the state_dict.
+
+        The contents of tensor will follow its device synchronization model.
+        """
+        pass
diff --git a/torch/distributed/checkpoint/planner_helpers.py b/torch/distributed/checkpoint/planner_helpers.py
new file mode 100644
index 000000000000..23fbcd0d7e78
--- /dev/null
+++ b/torch/distributed/checkpoint/planner_helpers.py
@@ -0,0 +1,221 @@
+from typing import List, Any
+
+import torch
+
+from torch.distributed._shard.metadata import ShardMetadata
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed._shard.sharded_tensor.metadata import TensorProperties
+from torch.distributed._shard.sharded_tensor.shard import Shard
+
+from torch.distributed._shard.sharding_spec._internals import (
+    _check_shard_metadata_pair_overlap,
+)
+
+from .planner import (
+    LoadItemType,
+    SavePlan,
+    ReadItem,
+    WriteItem,
+    WriteItemType,
+    TensorWriteData,
+)
+
+from .metadata import (
+    BytesStorageMetadata,
+    ChunkStorageMetadata,
+    TensorStorageMetadata,
+    MetadataIndex,
+    STATE_DICT_TYPE,
+    STORAGE_TYPES,
+)
+
+from .resharding import _shards_get_overlap_region_wrt_saved_tensor
+
+__all__: List[str] = []
+
+
+def _create_shard_metadata(size: torch.Size) -> ShardMetadata:
+    return ShardMetadata(
+        shard_offsets=[0] * len(size),
+        shard_sizes=list(size),
+    )
+
+
+def _create_shard_from_tensor(tensor: torch.Tensor) -> Shard:
+    return Shard(tensor=tensor, metadata=_create_shard_metadata(tensor.size()))
+
+
+def _chunk_for_shard(shard_md: ShardMetadata) -> ChunkStorageMetadata:
+    return ChunkStorageMetadata(
+        offsets=torch.Size(shard_md.shard_offsets),
+        sizes=torch.Size(shard_md.shard_sizes),
+    )
+
+
+def _sharded_tensor_metadata(
+    sharded_tensor: ShardedTensor, shard_md: ShardMetadata
+) -> TensorWriteData:
+    return TensorWriteData(
+        chunk=_chunk_for_shard(shard_md),
+        properties=sharded_tensor.metadata().tensor_properties,
+        size=sharded_tensor.metadata().size,
+    )
+
+
+def _create_write_item_for_shard(
+    fqn: str, sharded_tensor: ShardedTensor, shard_md: ShardMetadata
+) -> WriteItem:
+    offsets = torch.Size(shard_md.shard_offsets)
+    return WriteItem(
+        index=MetadataIndex(fqn, offsets),
+        type=WriteItemType.SHARD,
+        tensor_data=_sharded_tensor_metadata(sharded_tensor, shard_md),
+    )
+
+
+def _create_write_item_for_tensor(fqn: str, tensor: torch.Tensor) -> WriteItem:
+    offsets = torch.Size([0] * len(tensor.size()))
+    return WriteItem(
+        index=MetadataIndex(fqn, offsets),
+        type=WriteItemType.TENSOR,
+        tensor_data=TensorWriteData(
+            chunk=ChunkStorageMetadata(offsets=offsets, sizes=tensor.size()),
+            properties=TensorProperties.create_from_tensor(tensor),
+            size=tensor.size(),
+        ),
+    )
+
+
+def _create_write_item_for_bytesio(fqn: str, bytes: Any):
+    return WriteItem(
+        index=MetadataIndex(fqn),
+        type=WriteItemType.BYTE_IO,
+    )
+
+
+def _create_read_item_for_byteio(
+    dest_index, dest_offset, storage_index, storage_offset, length
+):
+    return ReadItem(
+        type=LoadItemType.BYTE_IO,
+        dest_index=dest_index,
+        dest_offsets=torch.Size((dest_offset,)),
+        storage_index=storage_index,
+        storage_offsets=torch.Size((storage_offset,)),
+        lengths=torch.Size((length,)),
+    )
+
+
+def _create_read_item_for_tensor(
+    dest_index, dest_offsets, storage_index, storage_offsets, lengths
+):
+    return ReadItem(
+        type=LoadItemType.TENSOR,
+        dest_index=dest_index,
+        dest_offsets=torch.Size(dest_offsets),
+        storage_index=storage_index,
+        storage_offsets=torch.Size(storage_offsets),
+        lengths=torch.Size(lengths),
+    )
+
+
+def _create_sharded_read_items(
+    fqn: str,
+    checkpoint_md: TensorStorageMetadata,
+    local_shards: List[Shard],
+) -> List[ReadItem]:
+
+    read_items = []
+    # this is a naive quadratic algo that can be optimized later
+    for idx, shard in enumerate(local_shards):
+        for storage_idx, storage_md in enumerate(checkpoint_md.chunks):
+            shard_md_from_storage = ShardMetadata(
+                shard_sizes=list(storage_md.sizes),
+                shard_offsets=list(storage_md.offsets),
+            )
+
+            if not _check_shard_metadata_pair_overlap(
+                shard.metadata, shard_md_from_storage
+            ):
+                continue
+
+            storage_offsets = []
+            dest_offsets = []
+            lengths = []
+            for (
+                dim,
+                offset_for_saved_tensor,
+                offset_for_current_tensor,
+                length,
+            ) in _shards_get_overlap_region_wrt_saved_tensor(
+                saved_shard=shard_md_from_storage, current_shard=shard.metadata
+            ):
+                storage_offsets.append(offset_for_saved_tensor)
+                dest_offsets.append(offset_for_current_tensor)
+                lengths.append(length)
+
+            read_items.append(
+                _create_read_item_for_tensor(
+                    dest_index=MetadataIndex(
+                        fqn, shard.metadata.shard_offsets, idx
+                    ),
+                    dest_offsets=dest_offsets,
+                    storage_index=MetadataIndex(
+                        fqn, storage_md.offsets, storage_idx
+                    ),
+                    storage_offsets=storage_offsets,
+                    lengths=lengths,
+                )
+            )
+    return read_items
+
+
+def _create_default_metadata_only_plan(state_dict: STATE_DICT_TYPE) -> SavePlan:
+    requests = []
+    for fqn, obj in state_dict.items():
+        if isinstance(obj, ShardedTensor):
+            for shard_md in obj.metadata().shards_metadata:
+                requests.append(
+                    _create_write_item_for_shard(fqn, obj, shard_md)
+                )
+        elif isinstance(obj, torch.Tensor):
+            requests.append(_create_write_item_for_tensor(fqn, obj))
+        else:
+            requests.append(_create_write_item_for_bytesio(fqn, obj))
+    return SavePlan(requests)
+
+
+def _create_write_items(fqn: str, object: Any) -> List[WriteItem]:
+    if isinstance(object, ShardedTensor):
+        return [
+            _create_write_item_for_shard(fqn, object, shard.metadata)
+            for shard in object.local_shards()
+        ]
+    elif isinstance(object, torch.Tensor):
+        return [_create_write_item_for_tensor(fqn, object)]
+    else:
+        return [_create_write_item_for_bytesio(fqn, object)]
+
+
+def _create_read_items(fqn: str, md: STORAGE_TYPES, obj: Any) -> List[ReadItem]:
+    if isinstance(md, BytesStorageMetadata):
+        return [
+            _create_read_item_for_byteio(
+                dest_index=MetadataIndex(fqn),
+                dest_offset=0,
+                storage_index=MetadataIndex(fqn),
+                storage_offset=0,
+                length=0,
+            )
+        ]
+    elif isinstance(obj, ShardedTensor):
+        local_shards = obj.local_shards()
+    elif isinstance(obj, torch.Tensor):
+        local_shards = [_create_shard_from_tensor(obj)]
+    else:
+        raise ValueError(
+            f"Invalid checkpoint metadata for {fqn}, "
+            + f"expected BytesStorageMetadata but found {type(md)}"
+        )
+
+    return _create_sharded_read_items(fqn, md, local_shards)
diff --git a/torch/distributed/checkpoint/resharding.py b/torch/distributed/checkpoint/resharding.py
new file mode 100644
index 000000000000..c00def73b14d
--- /dev/null
+++ b/torch/distributed/checkpoint/resharding.py
@@ -0,0 +1,55 @@
+from typing import List, Tuple
+
+from torch.distributed._shard.sharding_spec import (
+    ShardMetadata,
+)
+
+__all__: List[str] = []
+
+
+def _shards_get_overlap_region_wrt_saved_tensor(
+    saved_shard: ShardMetadata, current_shard: ShardMetadata
+) -> List[Tuple[int, int, int, int]]:
+    """
+    Return the overlapping region between saved_shard and current_shard.
+    There returned list has the same number of elements as the tensor's dimension.
+    For each element, we produce a tuple with the following contents:
+        (dimension, `saved_shard` offset, `current_shard` offset, length)
+
+    Offsets are relative to each shard.
+    """
+    narrows = []
+    for dim, (
+        saved_shard_offset,
+        current_shard_offset,
+        saved_shard_size,
+        current_shard_size,
+    ) in enumerate(
+        zip(
+            saved_shard.shard_offsets,
+            current_shard.shard_offsets,
+            saved_shard.shard_sizes,
+            current_shard.shard_sizes,
+        )
+    ):
+        min_range_end = min(
+            saved_shard_offset + saved_shard_size,
+            current_shard_offset + current_shard_size,
+        )
+
+        length = min_range_end - max(current_shard_offset, saved_shard_offset)
+
+        if saved_shard_offset > current_shard_offset:
+            offset_for_saved_tensor = 0
+            offset_for_current_tensor = (
+                saved_shard_offset - current_shard_offset
+            )
+        else:
+            offset_for_saved_tensor = current_shard_offset - saved_shard_offset
+            offset_for_current_tensor = 0
+
+        narrows.append(
+            (dim, offset_for_saved_tensor, offset_for_current_tensor, length)
+        )
+
+    return narrows
diff --git a/torch/distributed/checkpoint/state_dict_loader.py b/torch/distributed/checkpoint/state_dict_loader.py
new file mode 100644
index 000000000000..1d085f4d339e
--- /dev/null
+++ b/torch/distributed/checkpoint/state_dict_loader.py
@@ -0,0 +1,111 @@
+from typing import Any, Dict, Optional
+
+import torch.distributed as dist
+
+from .storage import (
+    StorageReader,
+)
+from .planner import LoadPlanner
+from .default_planner import DefaultLoadPlanner
+
+from .utils import _DistWrapper
+
+__all__ = ["load_state_dict"]
+
+
+def load_state_dict(
+    state_dict: Dict[str, Any],
+    storage_reader: StorageReader,
+    process_group: Optional[dist.ProcessGroup] = None,
+    coordinator_rank: int = 0,
+    no_dist: bool = False,
+    planner: LoadPlanner = None,
+) -> None:
+    """
+    Load a distributed state_dict in SPMD style.
+
+    Each rank will try to read the least amount of data necessary
+    to fullfill the requested `state_dict`.
+
+    When loading ShardedTensor instances, each rank only
+    reads data for their local shards.
+
+    All tensors in ``state_dict`` must be allocated on their
+    destination device prior to calling this function.
+
+    All non-tensor data is loaded using `torch.load()` and modified in place
+    on state_dict.
+
+    Users must call `load_state_dict` on the root module to ensure load
+    pos-processing and non-tensor data properly propagates.
+
+    This function can be used for local inference and load a checkpoint
+    produced by ``save_state_dict`` without having a process group initialized
+    by passing ``no_dist=True`` and by using Tensors instead of ShardedTensors.
+
+    Args:
+        state_dict (Dict[str, Any]) : The state_dict to load. Note that this
+            state dict will updated in places.
+        storage_reader (StorageReader): StorageReader used to load data from.
+        process_group (ProcessGroup): ProcessGroup to be used for cross-rank synchronization
+        coordinator_rank (int): Rank to use to coordinate the checkpoint, rank0 is used by default
+        no_dist (bool): Don't attempt to load in SPMD style. Default to False
+
+    Returns:
+        None.
+
+    Examples
+        >>> # xdoctest: +SKIP
+        >>> my_model = MyModule()
+        >>> optimizer = Adagrad(my_model.parameters())
+        >>> model_state_dict = my_model.state_dict()
+        >>> fs_storage_loader = torch.distributed.checkpoint.FileSystemLoader("/checkpoint/1")
+
+        >>> torch.distributed.checkpoint.load_state_dict(
+        >>>     state_dict=model_state_dict,
+        >>>     storage_reader=fs_storage_loader,
+        >>> )
+
+        >>> # module.load_state_dict() function might have customized steps
+        >>> # to flush the state_dict, must call it to
+        >>> # ensure correct behavior.
+        >>> my_model.load_state_dict(model_state_dict)
+
+    .. note:: load_state_dict uses collectives to coordinate reads across ranks.
+        For NCCL-based process groups, internal tensor representations of objects
+        must be moved to the GPU device before communication takes place. In this
+        case, the device used is given by ``torch.cuda.current_device()`` and it
+        is the user's responsibility to ensure that this is set so that each rank
+        has an individual GPU, via ``torch.cuda.set_device()``
+    """
+    distW = _DistWrapper(process_group, not no_dist, coordinator_rank)
+    if planner is None:
+        planner = DefaultLoadPlanner()
+
+    def local_step():
+        assert planner is not None
+        metadata = storage_reader.read_metadata()
+        planner.init(state_dict, metadata, distW.is_coordinator)
+        storage_reader.init(metadata, distW.is_coordinator)
+
+        local_plan = planner.create_local_plan()
+        local_plan = storage_reader.prepare_local_plan(local_plan)
+        return local_plan
+
+    def global_step(all_local_plans):
+        assert planner is not None
+        all_local_plans = planner.create_global_plan(all_local_plans)
+        all_local_plans = storage_reader.prepare_global_plan(all_local_plans)
+        return all_local_plans
+
+    central_plan = distW.reduce_scatter("plan", local_step, global_step)
+
+    def read_data():
+        assert planner is not None
+        final_local_plan = planner.finish_plan(central_plan)
+        all_reads = storage_reader.read_data(final_local_plan, planner)
+
+        all_reads.wait()
+        return None
+
+    _ = distW.all_gather("read", read_data)
diff --git a/torch/distributed/checkpoint/state_dict_saver.py b/torch/distributed/checkpoint/state_dict_saver.py
new file mode 100644
index 000000000000..5e7fde10324c
--- /dev/null
+++ b/torch/distributed/checkpoint/state_dict_saver.py
@@ -0,0 +1,115 @@
+from typing import Optional
+import torch.distributed as dist
+
+from .planner import SavePlanner
+from .default_planner import DefaultSavePlanner
+
+
+from .storage import (
+    StorageWriter,
+)
+
+from .metadata import Metadata, STATE_DICT_TYPE
+from .utils import _DistWrapper
+
+__all__ = ["save_state_dict"]
+
+
+def save_state_dict(
+    state_dict: STATE_DICT_TYPE,
+    storage_writer: StorageWriter,
+    process_group: Optional[dist.ProcessGroup] = None,
+    coordinator_rank: int = 0,
+    no_dist: bool = False,
+    planner: SavePlanner = None,
+) -> Metadata:
+    """
+    Save a distributed model in SPMD style.
+
+    This function is different from ``torch.save()`` as it handles
+    ``ShardedTensor`` by having each rank only save their local shards.
+
+    To produce a state_dict with ShardedTensor instances you must call
+    ``_register_state_dict_hook`` on the top module with value
+    `torch.distributed._shard.sharded_tensor.state_dict_hook` prior to
+    calling `state_dict()` on the top module.
+
+    There is no guarantees of Backwards Compatibility across PyTorch versions
+    for saved state_dicts.
+
+    If using the `process_group` argument, make sure that only its ranks
+    call `save_state_dict` and that all data in state_dict belong to it.
+
+    This function can be used to save a state_dict with an intialized process
+    group by passing ``no_dist=True``. This can be used to produce a checkpoint
+    that can consumed by load_state_dict is a SPMD fashion.
+
+    Args:
+        state_dict (Dict[str, Any]) : A state_dict
+        storage_writer (StorageWriter): Instance of StorageWrite use to perform writes.
+        process_group (ProcessGroup): ProcessGroup to be used for cross-rank synchronization
+        coordinator_rank (int): Rank to use to coordinate the checkpoint, rank0 is used by default
+        no_dist (bool): Don't attempt to save in SPMD style. Default to False
+
+    Example:
+        >>> # xdoctest: +SKIP
+        >>> my_model = MyModule()
+        >>> # We must call this function prior to state_dict()
+        >>> my_model._register_state_dict_hook(state_dict_hook)
+
+        >>> model_state_dict = my_model.state_dict()
+
+        >>> fs_storage_writer = torch.distributed.checkpoint.FileSystemWriter("/checkpoint/1")
+        >>> torch.distributed.checkpoint.save_state_dict(
+        >>>     state_dict=model_state_dict,
+        >>>     storage_writer=fs_stroage_writer,
+        >>> )
+
+    .. note:: save_state_dict uses collectives to coordinate writes across ranks.
+        For NCCL-based process groups, internal tensor representations of objects
+        must be moved to the GPU device before communication takes place. In this
+        case, the device used is given by ``torch.cuda.current_device()`` and it
+        is the user's responsibility to ensure that this is set so that each rank
+        has an individual GPU, via ``torch.cuda.set_device()``
+    """
+    distW = _DistWrapper(process_group, not no_dist, coordinator_rank)
+    if planner is None:
+        planner = DefaultSavePlanner()
+    assert planner is not None
+
+    global_metatadata = None
+
+    def local_step():
+        assert planner is not None
+        planner.init(state_dict, distW.is_coordinator)
+        storage_writer.init(distW.is_coordinator)
+        local_plan = planner.create_local_plan()
+        local_plan = storage_writer.prepare_local_plan(local_plan)
+        return local_plan
+
+    def global_step(all_local_plans):
+        nonlocal global_metatadata
+
+        assert planner is not None
+        all_local_plans, global_metatadata = planner.create_global_plan(
+            all_local_plans
+        )
+        all_local_plans = storage_writer.prepare_global_plan(all_local_plans)
+        return all_local_plans
+
+    central_plan = distW.reduce_scatter("plan", local_step, global_step)
+
+    def write_data():
+        assert planner is not None
+        final_local_plan = planner.finish_plan(central_plan)
+        all_writes = storage_writer.write_data(final_local_plan, planner)
+
+        all_writes.wait()
+        return all_writes.value()
+
+    def finish_checkpoint(all_results):
+        assert global_metatadata is not None
+        storage_writer.finish(metadata=global_metatadata, results=all_results)
+        return global_metatadata
+
+    return distW.all_reduce("write", write_data, finish_checkpoint)
diff --git a/torch/distributed/checkpoint/storage.py b/torch/distributed/checkpoint/storage.py
new file mode 100644
index 000000000000..dbc8fda59eac
--- /dev/null
+++ b/torch/distributed/checkpoint/storage.py
@@ -0,0 +1,233 @@
+import abc
+from dataclasses import dataclass
+from typing import List, Any
+
+from torch.futures import Future
+
+from .metadata import (
+    Metadata,
+    MetadataIndex,
+)
+
+from .planner import (
+    LoadPlan,
+    SavePlan,
+    SavePlanner,
+    LoadPlanner,
+)
+
+__all__ = ["WriteResult", "StorageWriter", "StorageReader"]
+
+
+@dataclass(frozen=True)
+class WriteResult:
+    index: MetadataIndex
+
+    size_in_bytes: int
+    storage_data: Any
+
+
+class StorageWriter(abc.ABC):
+    """
+    Interface used by ``save_state_dict`` to write to storage.
+
+    One StorageWriter instance acts as both the coordinator and the follower
+    in a distributed checkpoint. As part of initialization, each instance
+    is told its role.
+
+    A subclass should expect the following sequence of calls.
+
+    1) (all ranks) init()
+    2) (all ranks) prepare_local_plan()
+    3) (coordinator) prepare_global_plan()
+    4) (all ranks) write_data()
+    5) (coordinator) finish()
+    """
+
+    @abc.abstractmethod
+    def init(self, is_coordinator: bool) -> None:
+        """
+        Initialize this instance.
+
+        Args:
+            is_coordinator (bool): Whether this instance is reponsible for coordinating
+              the checkpoint.
+        """
+        pass
+
+    @abc.abstractmethod
+    def prepare_local_plan(self, plan: SavePlan) -> SavePlan:
+        """
+        Perform storage-specific local planning.
+
+        While this method can produce a completely different plan, the recomended
+        way is to store storage specific data in SavePlan::storage_data.
+
+        Args:
+            plan (SavePlan): The local plan from the ``SavePlanner`` in use.
+
+        Returns:
+            A transformed ``SavePlan`` after storage local planning
+        """
+        pass
+
+    @abc.abstractmethod
+    def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]:
+        """
+        Perform centralized planning of storage.
+
+        This method is only called on the coordinator instance.
+
+        While this method can produce a completely different plan, the prefered
+        way is to store storage specific data in SavePlan::storage_data.
+
+        Args:
+            plans: A list of ``SavePlan`` instances, one for each rank.
+
+        Returns:
+            A list of transformed ``SavePlan`` after storage global planning
+        """
+        pass
+
+    @abc.abstractmethod
+    def write_data(
+        self, plan: SavePlan, planner: SavePlanner
+    ) -> Future[List[WriteResult]]:
+        """
+        Write all items from ``plan`` using ``planner`` to resolve the data.
+
+        A subclass should call ``SavePlanner::resolve_data`` on each item
+        from the plan to get access to the underlying object to write.
+
+        Subclasses should lazily call `resolve_data` as it can allocate memory.
+        In case of tensors, make following assuptions:
+
+        - They might be on any device, including not matching the one on ``WriteItem::tensor_data``
+        - They might be views or not contiguous. Only the projection needs to be saved.
+
+        Args:
+            plan (SavePlan): The save plan to execute.
+            planner (SavePlanner): Planner object to be used to resolve items to data.
+
+        Returns:
+            A future that completes to a list of WriteResult
+        """
+        pass
+
+    @abc.abstractmethod
+    def finish(
+        self, metadata: Metadata, results: List[List[WriteResult]]
+    ) -> None:
+        """
+        Writes the metadata and marks the current checkpoint as sucessful.
+
+        The actual format/schema used for serializing `metadata` is an
+        implemetation detail. The only requirement is that it's recoverable
+        in to the same object graph.
+
+        Args:
+            metadata (Metadata): metadata for the new checkpoint
+            results: A list of WriteResults from all ranks.
+
+        Returns:
+            None
+        """
+        pass
+
+
+class StorageReader(abc.ABC):
+    """
+    Interface used by ``load_state_dict`` to read from storage.
+
+    One StorageReader instance acts as both the coordinator and the follower
+    in a distributed checkpoint. As part of initialization, each instance
+    is told its role.
+
+    A subclass should expected the following sequence of calls by ``load_state_dict``:
+
+    1) (all ranks) read_metadata()
+    2) (all ranks) init
+    3) (all ranks) prepare_local_plan
+    4) (coordinator) prepare_global_plan
+    5) (all ranks) read_data
+    """
+
+    @abc.abstractmethod
+    def read_metadata(self) -> Metadata:
+        """
+        Reads the checkpoint metadata.
+
+        Returns:
+            The metatada object associated with the checkpoint being loaded.
+
+        """
+        pass
+
+    @abc.abstractmethod
+    def init(self, metadata: Metadata, is_coordinator: bool) -> None:
+        """
+        Initialize this instance.
+
+        Args:
+            metadata (Metadata): The metadata schema to use.
+            is_coordinator (bool): Whether this instance is reponsible for coordinating
+              the checkpoint.
+        """
+        pass
+
+    @abc.abstractmethod
+    def prepare_local_plan(self, plan: LoadPlan) -> LoadPlan:
+        """
+        Perform storage-specific local planning.
+
+        While this method can produce a completely different plan, the recomended
+        way is to store storage specific data in LoadPlan::storage_data.
+
+        Args:
+            plan (LoadPlan): The local plan from the ``LoadPlan`` in use.
+
+        Returns:
+            A transformed ``LoadPlan`` after storage local planning
+        """
+        pass
+
+    @abc.abstractmethod
+    def prepare_global_plan(self, plans: List[LoadPlan]) -> List[LoadPlan]:
+        """
+        Perform centralized planning of storage loading.
+
+        This method is only called on the coordinator instance.
+
+        While this method can produce a completely different plan, the prefered
+        way is to store storage specific data in LoadPlan::storage_data.
+
+        Args:
+            plans: A list of ``LoadPlan`` instances, one for each rank.
+
+        Returns:
+            A list of transformed ``LoadPlan`` after storage global planning
+        """
+        pass
+
+    @abc.abstractmethod
+    def read_data(self, plan: LoadPlan, planner: LoadPlanner) -> Future[None]:
+        """
+        Reads all items from ``plan`` using ``planner`` to resolve the data.
+
+        A subclass should call ``LoadPlanner::load_bytes`` to deserialize a BytesIO
+        object into the right place.
+
+        A subclass should call ``LoadPlanner::resolve_tensor`` to get access to the
+        tensors that in should load data into.
+
+        It's the StorageLayer responsibility to properly schedule any cross device copies
+        required.
+
+        Args:
+            plan (LoadPlan): The local plan to execute on
+            planner (LoadPlanner): The planner object to use to resolve items.
+
+        Returns:
+            A future that completes once all reads are finished.
+        """
+        pass
diff --git a/torch/distributed/checkpoint/traverse.py b/torch/distributed/checkpoint/traverse.py
new file mode 100644
index 000000000000..75dc42453348
--- /dev/null
+++ b/torch/distributed/checkpoint/traverse.py
@@ -0,0 +1,170 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+import torch
+
+from typing import (
+    Callable,
+    Collection,
+    List,
+    Mapping,
+    MutableMapping,
+    Optional,
+    Tuple,
+    TypeVar,
+    Union,
+    cast,
+)
+from torch.distributed.checkpoint.metadata import (
+    STATE_DICT_TYPE,
+)
+from torch.distributed._shard.sharded_tensor.api import ShardedTensor
+from torch.distributed._tensor import DTensor
+
+PATH_ITEM = Union[str, int]
+OBJ_PATH = Tuple[PATH_ITEM, ...]
+T = TypeVar("T")
+
+STATE_DICT_ITEM = object
+CONTAINER_TYPE = MutableMapping[PATH_ITEM, STATE_DICT_ITEM]
+
+__all__ = ["traverse_state_dict", "set_element", "get_element", "print_tensor"]
+
+
+def _keep_visiting_tensors(value: STATE_DICT_ITEM) -> bool:
+    return isinstance(value, torch.Tensor)
+
+
+# TODO: update docstring for traverse.py
+def traverse_state_dict(
+    state_dict: STATE_DICT_TYPE,
+    visitor: Callable[[OBJ_PATH, STATE_DICT_ITEM], None],
+    keep_traversing: Callable[[STATE_DICT_ITEM], bool] = _keep_visiting_tensors,
+) -> None:
+    """
+    Invoke ``visitor`` for each value recursively in ``state_dict``.
+    Traversal is short-circuited when if finds a collection for which ``keep_visiting_tensors`` evaluates
+    to false for all elements.
+    By default, all collections with at least one ``torch.Tensor`` element are traversed.
+    Visitor takes a path argument that is a tuple of the keys used to reach it.
+    """
+    # a value is terminal if it has no other containers values inside it
+    def _is_terminal(value: STATE_DICT_ITEM) -> bool:
+        values: Collection[STATE_DICT_ITEM]
+        if isinstance(value, Mapping):
+            values = value.values()
+        elif isinstance(value, list):
+            values = value
+        else:
+            return True
+
+        for entry in values:
+            if isinstance(entry, (Mapping, list)) and not _is_terminal(entry):
+                return False
+            if keep_traversing is not None and keep_traversing(entry):
+                return False
+        return True
+
+    def _traverse_obj(path: OBJ_PATH, value: STATE_DICT_ITEM) -> None:
+        if _is_terminal(value):
+            visitor(path, value)
+        elif isinstance(value, Mapping):
+            for k, v in value.items():
+                _traverse_obj(path + (str(k),), v)
+        elif isinstance(value, list):
+            for i, v in enumerate(value):
+                _traverse_obj(path + (i,), v)
+
+    for key, value in state_dict.items():
+        _traverse_obj((str(key),), value)
+
+
+def set_element(
+    root_dict: STATE_DICT_TYPE, path: OBJ_PATH, value: STATE_DICT_ITEM
+) -> None:
+    """
+    Set ``value`` in ``root_dict`` along the ``path`` object path.
+    """
+    cur_container = cast(CONTAINER_TYPE, root_dict)
+
+    def extend_list(lst: List[STATE_DICT_ITEM], idx: int) -> None:
+        while len(lst) <= idx:
+            lst.append(None)
+
+    for i in range(1, len(path)):
+        prev_key = path[i - 1]
+        key = path[i]
+        def_val = cast(STATE_DICT_ITEM, {} if type(key) == str else [])
+
+        if isinstance(cur_container, Mapping):
+            cur_container = cast(
+                CONTAINER_TYPE, cur_container.setdefault(prev_key, def_val)
+            )
+        else:
+            extend_list(cur_container, prev_key)
+            if cur_container[prev_key] is None:
+                cur_container[prev_key] = def_val
+            cur_container = cur_container[prev_key]
+
+    key = path[-1]
+    if type(key) == int:
+        extend_list(cast(List[STATE_DICT_ITEM], cur_container), key)
+
+    cur_container[key] = value
+
+
+def get_element(
+    root_dict: STATE_DICT_TYPE,
+    path: OBJ_PATH,
+    default_value: Optional[T] = None,
+) -> Optional[T]:
+    """
+    Retrieve the value at ``path``from ``root_dict``, returning ``default_value`` if not found.
+    """
+    cur_value = cast(CONTAINER_TYPE, root_dict)
+    for part in path:
+        if type(part) is int:
+            if not isinstance(cur_value, list) or len(cur_value) < part:
+                return default_value
+        elif not isinstance(cur_value, Mapping) or part not in cur_value:
+            return default_value
+
+        cur_value = cast(CONTAINER_TYPE, cur_value[part])
+    return cast(Optional[T], cur_value)
+
+
+def _print_nested(
+    value: STATE_DICT_ITEM,
+    prefix: str = "",
+    print_fun: Callable[[str], None] = print,
+) -> None:
+    if type(value) is ShardedTensor:
+        print_fun(f"{prefix} ShardedTensor size: {value.size()}")
+        for shard in value.local_shards():
+            _print_nested(
+                shard.tensor,
+                f"{shard.metadata.shard_offsets} ",
+                print_fun=print_fun,
+            )
+    elif type(value) is (DTensor):
+        print_fun(f"{prefix} DistributedTensor size: {value.size()}")
+        # TODO: add local offset for _local_tensor in print_nested.
+        _print_nested(
+            value._local_tensor,
+            print_fun=print_fun,
+        )
+    elif isinstance(value, torch.Tensor):
+        print_fun(f"{prefix} Tensor size: {value.size()}")
+    else:
+        print_fun(f"{prefix} Type: {type(value)}")
+
+
+def print_tensor(
+    path: OBJ_PATH,
+    value: STATE_DICT_ITEM,
+    print_fun: Callable[[str], None] = print,
+) -> None:
+    """
+    Callback that can be used with travese_state_dict to print its content.
+    By default the content is printed using the builtin ``print`` but this can
+    be change by passing a different ``print_fun` callable.
+    """
+    _print_nested(value, prefix=str(path), print_fun=print_fun)
diff --git a/torch/distributed/_shard/checkpoint/utils.py b/torch/distributed/checkpoint/utils.py
similarity index 73%
rename from torch/distributed/_shard/checkpoint/utils.py
rename to torch/distributed/checkpoint/utils.py
index 1ff9aa126b3e..a8d2a42d0fca 100644
--- a/torch/distributed/_shard/checkpoint/utils.py
+++ b/torch/distributed/checkpoint/utils.py
@@ -4,7 +4,7 @@
     CheckpointException,
     _wrap_exception,
     _is_wrapped_exception,
-    WRAPPED_EXCEPTION
+    WRAPPED_EXCEPTION,
 )
 
 import torch
@@ -20,12 +20,20 @@
     MetadataIndex,
 )
 
+__all__ = ["find_tensor_shard", "find_state_dict_object"]
 
-T = TypeVar('T')
-R = TypeVar('R')
+T = TypeVar("T")
+R = TypeVar("R")
+
+
+def _get_failure_dict(
+    results: List[Union[T, WRAPPED_EXCEPTION]]
+) -> Dict[int, WRAPPED_EXCEPTION]:
+    return cast(
+        Dict[int, WRAPPED_EXCEPTION],
+        {i: err for i, err in enumerate(results) if _is_wrapped_exception(err)},
+    )
 
-def _get_failure_dict(results: List[Union[T, WRAPPED_EXCEPTION]]) -> Dict[int, WRAPPED_EXCEPTION]:
-    return cast(Dict[int, WRAPPED_EXCEPTION], {i: err for i, err in enumerate(results) if _is_wrapped_exception(err)})
 
 class _DistWrapper:
     """
@@ -36,7 +44,13 @@ class _DistWrapper:
     All variants that take functions are exception robust, meaning that if one or more
     ranks raise errors, all ranks will observe those.
     """
-    def __init__(self, group: Optional[dist.ProcessGroup], use_dist: bool, coordinator_rank: int):
+
+    def __init__(
+        self,
+        group: Optional[dist.ProcessGroup],
+        use_dist: bool,
+        coordinator_rank: int,
+    ):
         self.group = group
         self.use_dist = use_dist
         self.coordinator_rank = coordinator_rank
@@ -64,7 +78,8 @@ def broadcast_object(self, object: Optional[T]) -> T:
             dist.broadcast_object_list(
                 object_list=object_list,
                 group=self.group,
-                src=self.coordinator_rank)
+                src=self.coordinator_rank,
+            )
         return cast(T, object_list[0])
 
     def gather_object(self, object: T) -> Optional[List[T]]:
@@ -72,13 +87,17 @@ def gather_object(self, object: T) -> Optional[List[T]]:
         Same as c10d::gather_object but works without distributed enabled.
         """
         if self.use_dist:
-            gather_objs = cast(List[T], [None] * dist.get_world_size(self.group)) if self.is_coordinator else None
+            gather_objs = (
+                cast(List[T], [None] * dist.get_world_size(self.group))
+                if self.is_coordinator
+                else None
+            )
 
             dist.gather_object(
                 obj=object,
                 object_gather_list=gather_objs if self.is_coordinator else None,
                 dst=self.coordinator_rank,
-                group=self.group
+                group=self.group,
             )
             result = gather_objs
         else:
@@ -90,12 +109,12 @@ def all_gather_object(self, object: T) -> List[T]:
         Same as c10d::all_gather_object but works without distributed enabled.
         """
         if self.use_dist:
-            gather_objs = cast(List[T], [None] * dist.get_world_size(self.group))
+            gather_objs = cast(
+                List[T], [None] * dist.get_world_size(self.group)
+            )
 
             dist.all_gather_object(
-                object_list=gather_objs,
-                obj=object,
-                group=self.group
+                object_list=gather_objs, obj=object, group=self.group
             )
         else:
             gather_objs = [object]
@@ -109,9 +128,11 @@ def scatter_object(self, object_list: Optional[List[T]]) -> T:
             gather_result = cast(List[T], [None])
             dist.scatter_object_list(
                 scatter_object_output_list=gather_result,
-                scatter_object_input_list=object_list if self.is_coordinator else None,
+                scatter_object_input_list=object_list
+                if self.is_coordinator
+                else None,
                 src=self.coordinator_rank,
-                group=self.group
+                group=self.group,
             )
 
             local_reply = gather_result[0]
@@ -124,7 +145,7 @@ def reduce_scatter(
         self,
         step: str,
         map_fun: Callable[[], T],
-        reduce_fun: Callable[[List[T]], List[R]]
+        reduce_fun: Callable[[List[T]], List[R]],
     ) -> R:
         """
         Compute a value on each rank, then do centralized reduce on a single rank, followed by a scatter.
@@ -150,12 +171,17 @@ def reduce_scatter(
             if len(node_failures) == 0:
                 try:
                     # N.B. why can't mypy cast List[R] to List[Union[R, WRAPPED_EXCEPTION]]?
-                    all_results = cast(List[Union[R, CheckpointException]], reduce_fun(cast(List[T], all_data)))
+                    all_results = cast(
+                        List[Union[R, CheckpointException]],
+                        reduce_fun(cast(List[T], all_data)),
+                    )
                 except BaseException as e:
                     node_failures[self.rank] = _wrap_exception(e)
 
             if len(node_failures) > 0:
-                all_results = [CheckpointException(step, node_failures)] * self.get_world_size()
+                all_results = [
+                    CheckpointException(step, node_failures)
+                ] * self.get_world_size()
 
         result = self.scatter_object(all_results)
         if isinstance(result, CheckpointException):
@@ -166,7 +192,7 @@ def all_reduce(
         self,
         step: str,
         map_fun: Callable[[], T],
-        reduce_fun: Callable[[List[T]], R]
+        reduce_fun: Callable[[List[T]], R],
     ) -> R:
         """
         Compute a value on each rank, then do centralized reduce on a single rank, followed by a broadcast.
@@ -244,33 +270,64 @@ def broadcast(
             try:
                 result = map_fun()
             except BaseException as e:
-                result = CheckpointException(step, {self.rank: _wrap_exception(e)})
+                result = CheckpointException(
+                    step, {self.rank: _wrap_exception(e)}
+                )
         final_result = self.broadcast_object(result)
         if isinstance(final_result, CheckpointException):
             raise final_result
         return cast(T, final_result)
 
+
 def _find_shard(tensor: ShardedTensor, index: MetadataIndex) -> Shard:
     if index.offset is None:
-        raise ValueError(f"Cannot lookup {index.fqn} since its a ShardedTensor and no offset was provided")
+        raise ValueError(
+            f"Cannot lookup {index.fqn} since its a ShardedTensor and no offset was provided"
+        )
 
     shards = tensor.local_shards()
     # index fast path
     if index.index is not None:
-        if len(shards) > index.index and torch.Size(shards[index.index].metadata.shard_offsets) == index.offset:
+        if (
+            len(shards) > index.index
+            and torch.Size(shards[index.index].metadata.shard_offsets)
+            == index.offset
+        ):
             return shards[index.index]
 
     for shard in shards:
         if torch.Size(shard.metadata.shard_offsets) == index.offset:
             return shard
-    raise ValueError(f"Could not find shard at '{index.offset}' for FQN: '{index.fqn}'")
+    raise ValueError(
+        f"Could not find shard at '{index.offset}' for FQN: '{index.fqn}'"
+    )
 
-def find_state_dict_object(state_dict: STATE_DICT_TYPE, index: MetadataIndex) -> Any:
+
+def find_tensor_shard(
+    tensor: torch.Tensor, index: MetadataIndex
+) -> torch.Tensor:
+    if isinstance(tensor, ShardedTensor):
+        return _find_shard(tensor, index).tensor
+    if index.offset is not None:
+        # special case looking up a tensor by origin
+        if index.offset == torch.Size([0] * len(tensor.size())):
+            return tensor
+        raise ValueError(
+            f"FQN: '{index.fqn}' is not a ShardedTensor, can't find by offset: '{index.offset}'"
+        )
+    return tensor
+
+
+def find_state_dict_object(
+    state_dict: STATE_DICT_TYPE, index: MetadataIndex
+) -> Any:
     if index.fqn not in state_dict:
         raise ValueError(f"Could not find FQN: '{index.fqn}'")
     obj = state_dict[index.fqn]
-    if isinstance(obj, ShardedTensor):
-        return _find_shard(obj, index).tensor
-    if index.offset is not None:
-        raise ValueError(f"FQN: '{index.fqn}' is not a ShardedTensor, can't find by offset")
+    if isinstance(obj, torch.Tensor):
+        return find_tensor_shard(obj, index)
+    elif index.offset is not None:
+        raise ValueError(
+            f"FQN: '{index.fqn}' is not a ShardedTensor, can't find by offset: '{index.offset}'"
+        )
     return obj
diff --git a/torch/distributed/distributed_c10d.py b/torch/distributed/distributed_c10d.py
index 932de843c25f..4d343bcffec3 100644
--- a/torch/distributed/distributed_c10d.py
+++ b/torch/distributed/distributed_c10d.py
@@ -1,18 +1,23 @@
+import itertools
+import collections.abc
 import contextlib
+import functools
 import io
 import logging
 import os
 import pickle
 import time
 import warnings
+from collections import namedtuple
 from datetime import timedelta
-from typing import Callable, Dict, Optional, Tuple, Union
+from typing import Any, Dict, Optional, Tuple, Union
 
 import torch
 from torch._C._distributed_c10d import (
     AllreduceCoalescedOptions,
     AllreduceOptions,
     AllToAllOptions,
+    _DistributedBackendOptions,
     BarrierOptions,
     BroadcastOptions,
     GatherOptions,
@@ -25,16 +30,35 @@
     Store,
     DebugLevel,
     get_debug_level,
+    Work
 )
 from torch._six import string_classes
-
+from torch.autograd.profiler import record_function
 from .constants import default_pg_timeout
+from .c10d_error_logger import _get_or_create_logger
 from .rendezvous import register_rendezvous_handler, rendezvous  # noqa: F401
 
-
-# This module is wildcard imported from torch.distributed.
-# TODO: specify __all__
-
+__all__ = [
+    'Backend', 'GroupMember', 'P2POp', 'all_gather', 'all_gather_coalesced',
+    'all_gather_multigpu', 'all_gather_object', 'all_reduce',
+    'all_reduce_coalesced', 'all_reduce_multigpu', 'all_to_all',
+    'all_to_all_single', 'barrier', 'batch_isend_irecv', 'broadcast',
+    'broadcast_multigpu', 'broadcast_object_list', 'destroy_process_group',
+    'dist_backend', 'gather', 'gather_object', 'get_backend', 'get_rank',
+    'get_world_size', 'group', 'init_process_group', 'irecv',
+    'is_gloo_available', 'is_initialized', 'is_mpi_available',
+    'is_nccl_available', 'is_torchelastic_launched', 'is_ucc_available',
+    'isend', 'monitored_barrier', 'new_group', 'new_subgroups',
+    'new_subgroups_by_enumeration', 'recv', 'reduce', 'reduce_multigpu',
+    'reduce_scatter', 'reduce_scatter_multigpu', 'scatter',
+    'scatter_object_list', 'send', 'supports_complex',
+    'AllreduceCoalescedOptions', 'AllreduceOptions', 'AllToAllOptions',
+    'BarrierOptions', 'BroadcastOptions', 'GatherOptions', 'PrefixStore',
+    'ProcessGroup', 'ReduceOp', 'ReduceOptions', 'ReduceScatterOptions',
+    'ScatterOptions', 'Store', 'DebugLevel', 'get_debug_level', 'Work',
+    'default_pg_timeout', 'get_group_rank', 'get_global_rank', 'get_process_group_ranks',
+    'reduce_op', 'all_gather_into_tensor', 'reduce_scatter_tensor', 'exception_handler'
+]
 
 _MPI_AVAILABLE = True
 _NCCL_AVAILABLE = True
@@ -44,29 +68,62 @@
 _pickler = pickle.Pickler
 _unpickler = pickle.Unpickler
 
+# Change __module__ of all imported types from torch._C._distributed_c10d that are public
+def _export_c_types():
+    _public_types_to_change_module = [
+        AllreduceCoalescedOptions,
+        AllreduceOptions,
+        AllToAllOptions,
+        BarrierOptions,
+        BroadcastOptions,
+        GatherOptions,
+        PrefixStore,
+        ProcessGroup,
+        ReduceOp,
+        ReduceOptions,
+        ReduceScatterOptions,
+        ScatterOptions,
+        Store,
+        DebugLevel,
+        get_debug_level,
+        Work
+    ]
+    for type in _public_types_to_change_module:
+        type.__module__ = "torch.distributed.distributed_c10d"
+_export_c_types()
+
 try:
     from torch._C._distributed_c10d import ProcessGroupMPI
+    ProcessGroupMPI.__module__ = "torch.distributed.distributed_c10d"
+    __all__ += ["ProcessGroupMPI"]
 except ImportError:
     _MPI_AVAILABLE = False
 
 try:
     from torch._C._distributed_c10d import ProcessGroupNCCL
+    ProcessGroupNCCL.__module__ = "torch.distributed.distributed_c10d"
+    __all__ += ["ProcessGroupNCCL"]
 except ImportError:
     _NCCL_AVAILABLE = False
 
 try:
     from torch._C._distributed_c10d import ProcessGroupGloo
     from torch._C._distributed_c10d import _ProcessGroupWrapper
+    ProcessGroupGloo.__module__ = "torch.distributed.distributed_c10d"
+    __all__ += ["ProcessGroupGloo"]
 except ImportError:
     _GLOO_AVAILABLE = False
 
 try:
     from torch._C._distributed_c10d import ProcessGroupUCC
     ProcessGroupUCC.__module__ = "torch.distributed.distributed_c10d"
+    __all__ += ["ProcessGroupUCC"]
 except ImportError:
     _UCC_AVAILABLE = False
 
 logger = logging.getLogger(__name__)
+global _c10d_error_logger
+_c10d_error_logger = _get_or_create_logger()
 
 PG_WRAPPER_STORE_PREFIX = "pg_wrapper"
 
@@ -114,7 +171,10 @@ class Backend(object):
     UCC = "ucc"
     MPI = "mpi"
     TCP = "tcp"
-    _plugins: Dict[str, Callable] = {}
+
+    _BackendPlugin = namedtuple("_BackendPlugin", ["creator_fn", "extended_api"])
+
+    _plugins: Dict[str, _BackendPlugin] = {}
 
     def __new__(cls, name: str):
         if not isinstance(name, string_classes):
@@ -134,7 +194,7 @@ def __new__(cls, name: str):
         return value
 
     @classmethod
-    def register_backend(cls, name, func):
+    def register_backend(cls, name, func, extended_api=False):
         """
         Registers a new backend with the given name and instantiating function.
 
@@ -148,6 +208,10 @@ def register_backend(cls, name, func):
                              The function should be implemented in the backend
                              extension and takes four arguments, including
                              ``store``, ``rank``, ``world_size``, and ``timeout``.
+            extended_api (bool, optional): Whether the backend supports extended argument structure.
+                                           Default: ``False``. If set to ``True``, the backend
+                                           will get an instance of ``c10d::DistributedBackendOptions``, and
+                                           a process group options object as defined by the backend implementation.
 
         .. note:: This support of 3rd party backend is experimental and subject to change.
 
@@ -163,7 +227,7 @@ def register_backend(cls, name, func):
         )
 
         setattr(Backend, name.upper(), name.upper())
-        Backend._plugins[name.upper()] = func
+        Backend._plugins[name.upper()] = Backend._BackendPlugin(func, extended_api)
 
 
 # `_backend`, `dist_backend`, and `reduce_op` are here to maintain backward
@@ -183,9 +247,9 @@ class _reduce_op(object):
 
     def __init__(self):
         # __members__ is a dict storing key-value pairs for enum classes
-        for k, v in ReduceOp.__members__.items():
+        for k, v in ReduceOp.RedOpType.__members__.items():
             setattr(self, k, v)
-        self.__members__ = ReduceOp.__members__
+        self.__members__ = ReduceOp.RedOpType.__members__
 
     def __getattribute__(self, key):
         warnings.warn(
@@ -198,32 +262,113 @@ def __getattribute__(self, key):
 reduce_op = _reduce_op()
 
 
-class group(object):
+# DO NOT USE THESE FIELDS DIRECTLY.
+# Use them through the _world object to make sure the _world override mechanism
+_pg_map: Dict[ProcessGroup, Tuple[str, Optional[Store]]] = {}
+_pg_names: Dict[ProcessGroup, str] = {}
+_pg_group_ranks: Dict[ProcessGroup, Dict[int, int]] = {}
+_group_count = 0
+
+class _World:
+    """
+    Container class for c10d process group state.
+    This is used during registration and lookup of PG state.
+
+    .. warning:: This is an experimental API inteded to expose the inner workings
+       of c10d and is subject to change..
+    """
+    def __init__(self):
+        self._default_pg = None
+
+    @property
+    def default_pg(self):
+        """
+        The default ProcessGroup includes all ranks of the cluster.
+        This is used by c10d APIs when a ProcessGroup is needed but None is provided.
+        """
+        return self._default_pg
+
+    @default_pg.setter
+    def default_pg(self, value):
+        self._default_pg = value
+
+    @property
+    def pg_map(self) -> Dict[ProcessGroup, Tuple[str, Optional[Store]]]:
+        """
+        Cached process groups
+        For NCCL and GLOO pg, it is a map from ProcessGroup to (Backend, Store)
+        For MPI pg, it is a map from ProcessGroup to (Backend, None)
+
+        TODO don't expose the map, expose fine grained ops
+        """
+        global _pg_map
+        return _pg_map
+
+    @property
+    def pg_names(self) -> Dict[ProcessGroup, str]:
+        """
+        Process group's names, map from ProcessGroup to str.
+
+        TODO don't expose the map, expose fine grained ops
+        """
+        global _pg_names
+        return _pg_names
+
+    @property
+    def pg_group_ranks(self) -> Dict[ProcessGroup, Dict[int, int]]:
+        """
+        Process group's global rank to local rank mapping
+        TODO don't expose the map, expose fine grained ops
+        """
+        global _pg_group_ranks
+        return _pg_group_ranks
+
+    @property
+    def group_count(self) -> int:
+        """
+        Process group count for default naming.
+
+        TODO don't expose group_count, use something else instead
+        """
+        global _group_count
+        return _group_count
+
+    @group_count.setter
+    def group_count(self, value):
+        """
+        Count is used when computing the name of ProcessGroups when using global synchronization.
+        """
+        global _group_count
+        _group_count = value
+
+
+_world = _World()
+"""Holds the singleton instance of ``_World`` used by c10. Experimental extension point to override it"""
+
+class _WorldMeta(type):
+    """
+    Meta class of ``group`` and ``GroupMember`` so they
+    can have the class property ``WORLD``.
+    """
     # Points to the default PG once initialized.
-    WORLD: Optional[ProcessGroup] = None
+    @property
+    def WORLD(cls) -> Optional[ProcessGroup]:
+        return _world.default_pg
+
+    @WORLD.setter
+    def WORLD(cls, pg: Optional[ProcessGroup]):
+        _world.default_pg = pg
 
+class group(object, metaclass=_WorldMeta):
+    pass
 
-class GroupMember(object):
-    # Alias to group.WORLD for backward compatibility
-    WORLD = group.WORLD
+class GroupMember(object, metaclass=_WorldMeta):
     NON_GROUP_MEMBER = object()
 
 
-# Cached process groups
-# For NCCL and GLOO pg, it is a map from ProcessGroup to (Backend, Store)
-# For MPI pg, it is a map from ProcessGroup to (Backend, None)
-_pg_map: Dict[ProcessGroup, Tuple[str, Optional[Store]]] = {}
-# Process group's names, map from ProcessGroup to str
-_pg_names: Dict[ProcessGroup, str] = {}
-# Process group's global rank to local rank mapping
-_pg_group_ranks: Dict[ProcessGroup, Dict[int, int]] = {}
-
 # Default process group state
 _default_pg_init_method = None
 
-# Process group count for default naming
-_group_count = 0
-
 STORE_BASED_BARRIER_PREFIX = "store_based_barrier_key"
 
 
@@ -243,7 +388,7 @@ def _store_based_barrier(rank, store, timeout):
     ``init_process_group`` or ``new_group``. Intended to be used only with
     those two methods and is not a generic alternative to ``barrier()``.
     """
-    store_key = "{}:{}".format(STORE_BASED_BARRIER_PREFIX, _group_count)
+    store_key = "{}:{}".format(STORE_BASED_BARRIER_PREFIX, _world.group_count)
     store.add(store_key, 1)
     logger.info("Added key: {} to store for rank: {}".format(store_key, rank))
 
@@ -301,41 +446,78 @@ def _warn_not_in_group(op_name):
     )
 
 
-def _get_group_rank(group: ProcessGroup, rank):
+def get_group_rank(group: ProcessGroup, global_rank: int) -> int:
     """
-    Helper that gets a given group's local rank in the group from a given global
-    rank.
+    Translate a global rank into a group rank.
+
+    ``global_rank`` must be part of ``group`` otherwise this raises RuntimeError.
+
+    Args:
+        group (ProcessGroup): ProcessGroup to find the relative rank.
+        global_rank (int): Global rank to query.
+
+    Returns:
+        Group rank of ``global_rank`` relative to ``group``
+
+    N.B. calling this function on the default process group returns identity
     """
     if group is GroupMember.WORLD:
-        raise RuntimeError(
-            "group.WORLD does not have local rank to global " "rank mapping"
-        )
-    if group not in _pg_group_ranks:
-        raise RuntimeError("The given group does not exist")
-    try:
-        group_rank = _pg_group_ranks[group][rank]
-    except KeyError:
-        raise RuntimeError(
-            f"The global rank {rank} is not part of the group {group}"
-        ) from None
-    return group_rank
+        return global_rank
+    if group not in _world.pg_group_ranks:
+        raise RuntimeError(f"Group {group} is not registered, please create group with torch.distributed.new_group API")
+    group_ranks = _world.pg_group_ranks[group]
+    if global_rank not in group_ranks:
+        raise RuntimeError(f"Global rank {global_rank} is not part of group {group}")
 
+    return group_ranks[global_rank]
 
-def _get_global_rank(group, group_rank):
+def get_global_rank(group: ProcessGroup, group_rank: int) -> int:
     """
-    Helper that gets a given group's global rank from a given local rank in the
-    group.
+    Translate a group rank into a global rank.
+
+    ``group_rank`` must be part of `group` otherwise this raises RuntimeError.
+
+    Args:
+        group (ProcessGroup): ProcessGroup to find the global rank from.
+        group_rank (int): Group rank to query.
+
+    Returns:
+        Global rank of ``group_rank`` relative to ``group``
+
+    N.B. calling this function on the default process group returns identity
     """
     if group is GroupMember.WORLD:
-        raise RuntimeError(
-            "group.WORLD does not have local rank to global " "rank mapping"
-        )
-    group_rank_map = _pg_group_ranks[group]
-    for rank, grp_rank in group_rank_map.items():
+        return group_rank
+    if group not in _world.pg_group_ranks:
+        raise RuntimeError(f"Group {group} is not registered, please create group with torch.distributed.new_group API")
+    for rank, grp_rank in _world.pg_group_ranks[group].items():
         if grp_rank == group_rank:
             return rank
-    raise RuntimeError("The group rank is not part of the group")
+    raise RuntimeError(f"Group rank {group_rank} is not part of group {group}")
+
+# TODO: remove this once the ecosystem moves away from it.
+def _get_global_rank(group, rank):
+    """
+    This method is deprecated, please use get_global_rank.
+    """
+    warnings.warn(
+        "torch.distributed.distributed_c10d._get_global_rank is deprecated "
+        "please use torch.distributed.distributed_c10d.get_global_rank instead"
+    )
+    return get_global_rank(group, rank)
+
+
+def get_process_group_ranks(group: ProcessGroup):
+    """
+    Get all ranks associated with ``group``.
+
+    Args:
+        group (ProcessGroup): ProcessGroup to get all ranks from.
 
+    Returns:
+        List of global ranks ordered by group rank.
+    """
+    return list(_world.pg_group_ranks[group].keys())
 
 def _get_group_size(group):
     """
@@ -370,6 +552,26 @@ def _check_tensor_list(param, param_name):
             "to be of type List[torch.Tensor].".format(param_name)
         )
 
+def _as_iterable(obj) -> collections.abc.Iterable:
+    return obj if isinstance(obj, list) else (obj,)
+
+def _ensure_all_tensors_same_dtype(*tensors) -> None:
+    last_dtype = None
+    for tensor in itertools.chain(*map(_as_iterable, tensors)):
+        tensor_dtype = tensor.dtype
+        # Mixing complex and its element type is allowed
+        if tensor_dtype.is_complex:
+            tensor_dtype = torch.float32 if tensor_dtype == torch.complex64 else torch.complex128
+
+        if last_dtype is None:
+            last_dtype = tensor_dtype
+        else:
+            if last_dtype != tensor_dtype:
+                raise RuntimeError(
+                    "Invalid usage of tensors with different dtypes"
+                    f"Found {last_dtype} and  {tensor.dtype}"
+                )
+
 
 def _check_op(op):
     """
@@ -401,42 +603,42 @@ def _check_p2p_op_list(p2p_op_list):
         raise RuntimeError("All ops need to use the same group.")
 
 
-def is_mpi_available():
+def is_mpi_available() -> bool:
     """
     Checks if the MPI backend is available.
     """
     return _MPI_AVAILABLE
 
 
-def is_nccl_available():
+def is_nccl_available() -> bool:
     """
     Checks if the NCCL backend is available.
     """
     return _NCCL_AVAILABLE
 
 
-def is_gloo_available():
+def is_gloo_available() -> bool:
     """
     Checks if the Gloo backend is available.
     """
     return _GLOO_AVAILABLE
 
 
-def is_ucc_available():
+def is_ucc_available() -> bool:
     """
     Checks if the UCC backend is available.
     """
     return _UCC_AVAILABLE
 
 
-def is_initialized():
+def is_initialized() -> bool:
     """
     Checking if the default process group has been initialized
     """
     return GroupMember.WORLD is not None
 
 
-def is_torchelastic_launched():
+def is_torchelastic_launched() -> bool:
     """
     Checks whether this process was launched with ``torch.distributed.elastic``
     (aka torchelastic). The existence of ``TORCHELASTIC_RUN_ID`` environment
@@ -470,15 +672,14 @@ def _get_default_store():
             "please make sure to call init_process_group."
         )
     default_pg = _get_default_group()
-    _, default_store = _pg_map[default_pg]
+    _, default_store = _world.pg_map[default_pg]
     return default_store
 
 
 def _update_default_pg(pg):
-    GroupMember.WORLD = group.WORLD = pg
+    _world.default_pg = pg
 
-
-def get_backend(group=None):
+def get_backend(group: Optional[ProcessGroup] = None) -> str:
     """
     Returns the backend of the given process group.
 
@@ -497,20 +698,20 @@ def get_backend(group=None):
         pg = group
     if _rank_not_in_group(pg):
         raise RuntimeError("Invalid process group specified")
-    pg_store = _pg_map.get(pg, None)
+    pg_store = _world.pg_map.get(pg, None)
     assert pg_store is not None
     return pg_store[0]
 
 
 def init_process_group(
-    backend,
-    init_method=None,
-    timeout=default_pg_timeout,
-    world_size=-1,
-    rank=-1,
-    store=None,
-    group_name="",
-    pg_options=None,
+    backend: Union[str, Backend],
+    init_method: Optional[str] = None,
+    timeout: timedelta = default_pg_timeout,
+    world_size: int = -1,
+    rank: int = -1,
+    store: Optional[Store] = None,
+    group_name: str = "",
+    pg_options: Optional[Any] = None,
 ):
     """
     Initializes the default distributed process group, and this will also
@@ -581,7 +782,8 @@ def init_process_group(
         on a system that supports MPI.
 
     """
-    global _pg_group_ranks
+    global _world
+
     global _backend
     global _default_pg_init_method
 
@@ -642,8 +844,8 @@ def init_process_group(
         )
         _update_default_pg(default_pg)
 
-    _pg_group_ranks[GroupMember.WORLD] = {i: i for i in range(GroupMember.WORLD.size())}  # type: ignore[attr-defined, index]
-    _backend = _pg_map[GroupMember.WORLD][0]  # type: ignore[index]
+    _world.pg_group_ranks[GroupMember.WORLD] = {i: i for i in range(GroupMember.WORLD.size())}  # type: ignore[attr-defined, index]
+    _backend = _world.pg_map[GroupMember.WORLD][0]  # type: ignore[index]
     _default_pg_init_method = init_method
 
     # barrier at the end to ensure that once we return from this method, all
@@ -662,9 +864,9 @@ def init_process_group(
 
 
 def _new_process_group_helper(
-    world_size,
-    rank,
-    group_ranks,
+    group_size,
+    group_rank,
+    global_ranks_in_group,
     backend,
     store,
     pg_options=None,
@@ -680,15 +882,13 @@ def _new_process_group_helper(
 
     This function is called with ``group_ranks == []`` for the default group.
     """
-    global _pg_map
-    global _group_count
-    global _pg_names
+    global _world
 
     if not group_name:
-        group_name = str(_group_count)
-        _group_count += 1
+        group_name = str(_world.group_count)
+        _world.group_count = _world.group_count + 1
 
-    if group_name in _pg_names.values():
+    if group_name in _world.pg_names.values():
         raise RuntimeError(
             "The specified group name has already been "
             "created, please use a different group name"
@@ -700,7 +900,7 @@ def _new_process_group_helper(
         )
 
     # The list of group ranks is empty if we're creating the default group.
-    is_default_group = len(group_ranks) == 0
+    is_default_group = len(global_ranks_in_group) == 0
 
     backend = Backend(backend)
     pg: Union[ProcessGroupGloo, ProcessGroupMPI, ProcessGroupNCCL, ProcessGroupUCC]
@@ -711,17 +911,15 @@ def _new_process_group_helper(
                 " MPI is only included if you build PyTorch from"
                 " source on a host that has MPI installed."
             )
-        pg = ProcessGroupMPI.create(group_ranks)
+        pg = ProcessGroupMPI.create(global_ranks_in_group)
         if not pg:
             return GroupMember.NON_GROUP_MEMBER
-        _pg_map[pg] = (Backend.MPI, None)
-        _pg_names[pg] = group_name
     else:
         # If this is a subgroup (which means group_ranks is specified),
         # we check if the current process is a member of the new group.
         if not is_default_group:
             global_rank = _get_default_group().rank()
-            if global_rank not in group_ranks:
+            if global_rank not in global_ranks_in_group:
                 return GroupMember.NON_GROUP_MEMBER
 
         # Use the group name as prefix in the default store, such that
@@ -731,28 +929,7 @@ def _new_process_group_helper(
         if backend == Backend.GLOO:
             if pg_options is not None:
                 raise RuntimeError("GLOO options not supported")
-            pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
-            # In debug mode and if GLOO is available, wrap in a wrapper PG that
-            # enables enhanced collective checking for debugability.
-            if get_debug_level() == DebugLevel.DETAIL:
-                if not _GLOO_AVAILABLE:
-                    logger.info(
-                        """TORCH_DISTRIBUTED_DEBUG was set to DETAIL, but
-                                GLOO is not available. Build with Gloo to
-                                create a wrapper process group in debug mode
-                                to aid collective desynchronization debugging."""
-                    )
-                else:
-                    pg = _create_process_group_wrapper(
-                        wrapped_pg=pg,
-                        store_prefix=group_name,
-                        store=store,
-                        rank=rank,
-                        world_size=world_size,
-                        timeout=timeout,
-                    )
-            _pg_map[pg] = (Backend.GLOO, store)
-            _pg_names[pg] = group_name
+            pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
         elif backend == Backend.NCCL:
             if not is_nccl_available():
                 raise RuntimeError("Distributed package doesn't have NCCL " "built in")
@@ -766,69 +943,64 @@ def _new_process_group_helper(
                 pg_options.is_high_priority_stream = False
                 pg_options._timeout = timeout
 
-            pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
-            # In debug mode and if GLOO is available, wrap in a wrapper PG that
-            # enables enhanced collective checking for debugability.
-            if get_debug_level() == DebugLevel.DETAIL:
-                if not _GLOO_AVAILABLE:
-                    logger.info(
-                        """TORCH_DISTRIBUTED_DEBUG was set to DETAIL, but
-                                GLOO is not available. Build with Gloo to
-                                create a wrapper process group in debug mode
-                                to aid collective desynchronization debugging."""
-                    )
-                else:
-                    pg = _create_process_group_wrapper(
-                        wrapped_pg=pg,
-                        store_prefix=group_name,
-                        store=store,
-                        rank=rank,
-                        world_size=world_size,
-                        timeout=timeout,
-                    )
-            _pg_map[pg] = (Backend.NCCL, store)
-            _pg_names[pg] = group_name
+            pg = ProcessGroupNCCL(prefix_store, group_rank, group_size, pg_options)
         elif backend == Backend.UCC and is_ucc_available():
             # TODO: once UCC plugin is fully deprecated, remove
             # is_ucc_available() from above elif-condition and raise
             # RuntimeError if is_ucc_available() returns false.
 
-            pg = ProcessGroupUCC(prefix_store, rank, world_size, timeout=timeout)
-            # In debug mode and if GLOO is available, wrap in a wrapper PG that
-            # enables enhanced collective checking for debugability.
-            if get_debug_level() == DebugLevel.DETAIL:
-                if not _GLOO_AVAILABLE:
-                    logger.info(
-                        """TORCH_DISTRIBUTED_DEBUG was set to DETAIL, but
-                                GLOO is not available. Build with Gloo to
-                                create a wrapper process group in debug mode
-                                to aid collective desynchronization debugging."""
-                    )
-                else:
-                    pg = _create_process_group_wrapper(
-                        wrapped_pg=pg,
-                        store_prefix=group_name,
-                        store=store,
-                        rank=rank,
-                        world_size=world_size,
-                        timeout=timeout,
-                    )
-            _pg_map[pg] = (Backend.UCC, store)
-            _pg_names[pg] = group_name
+            pg = ProcessGroupUCC(prefix_store, group_rank, group_size, timeout=timeout)
         else:
             assert backend.upper() in Backend._plugins, (
-                f"unknown c10d backend type {backend.upper()}"
+                f"Unknown c10d backend type {backend.upper()}"
             )
-            pg = Backend._plugins[backend.upper()](
-                prefix_store, rank, world_size, timeout
-            )
-            _pg_map[pg] = (backend, store)
-            _pg_names[pg] = group_name
 
+            backend_plugin = Backend._plugins[backend.upper()]
+            creator_fn = backend_plugin.creator_fn
+            extended_api = backend_plugin.extended_api
+
+            if not extended_api:
+                pg = creator_fn(prefix_store, group_rank, group_size, timeout)
+            else:
+                dist_backend_opts = _DistributedBackendOptions()
+                dist_backend_opts.store = prefix_store
+                dist_backend_opts.group_rank = group_rank
+                dist_backend_opts.group_size = group_size
+                dist_backend_opts.timeout = timeout
+                dist_backend_opts.group_id = group_name
+                dist_backend_opts.global_ranks_in_group = global_ranks_in_group
+
+                pg = creator_fn(dist_backend_opts, pg_options)
+
+    # Process group wrapper initialization for supported PGs when TORCH_DISTRIBUTED_DEBUG is set
+    if backend in [Backend.GLOO, Backend.NCCL, Backend.UCC]:
+        # In debug mode and if GLOO is available, wrap in a wrapper PG that
+        # enables enhanced collective checking for debuggability.
+        if get_debug_level() == DebugLevel.DETAIL:
+            if not _GLOO_AVAILABLE:
+                logger.info(
+                    """TORCH_DISTRIBUTED_DEBUG was set to DETAIL, but
+                            GLOO is not available. Build with Gloo to
+                            create a wrapper process group in debug mode
+                            to aid collective desynchronization debugging."""
+                )
+            else:
+                pg = _create_process_group_wrapper(
+                    wrapped_pg=pg,
+                    store_prefix=group_name,
+                    store=store,
+                    rank=group_rank,
+                    world_size=group_size,
+                    timeout=timeout,
+                )
+
+    # update global state
+    _world.pg_map[pg] = (backend, store)
+    _world.pg_names[pg] = group_name
     return pg
 
 
-def destroy_process_group(group=None):
+def destroy_process_group(group: Optional[ProcessGroup] = None):
     """
     Destroy a given process group, and deinitialize the distributed package
 
@@ -838,11 +1010,7 @@ def destroy_process_group(group=None):
                                         groups including the default one will
                                         be destroyed.
     """
-    global _pg_map
-    global _pg_names
-    global _pg_group_ranks
-    global _default_pg_init_method
-    global _group_count
+    global _world
 
     if group == GroupMember.NON_GROUP_MEMBER:
         return
@@ -853,32 +1021,31 @@ def destroy_process_group(group=None):
         pg = group
 
     assert pg is not None
-    if _pg_map.get(pg, None) is None:
+    if _world.pg_map.get(pg, None) is None:
         raise RuntimeError("Invalid process group specified")
 
     if group is None or group == GroupMember.WORLD:
         _update_default_pg(None)
-        _default_pg_init_method = None
-        _pg_map.clear()
-        _pg_names.clear()
-        _pg_group_ranks.clear()
+        _world.pg_map.clear()
+        _world.pg_names.clear()
+        _world.pg_group_ranks.clear()
 
         # when process group doesn't have an explicit name (only WORLD (default)
-        # process group can have an explicit name), we use global _group_counter
+        # process group can have an explicit name), we use global _world.group_count
         # to generate the name. We need to reset the counter on destruction to
         # allow consistent value to be generated when we re-create process
         # groups after some trainers recover from failure
         #
         # We only reset this when WORLD is being destroyed because if this
         # process group is in good state, we aren't dealing with failures.
-        _group_count = 0
+        _world.group_count = 0
     else:
-        del _pg_map[pg]
-        del _pg_names[pg]
-        del _pg_group_ranks[pg]
+        del _world.pg_map[pg]
+        del _world.pg_names[pg]
+        del _world.pg_group_ranks[pg]
 
 
-def get_rank(group=None):
+def get_rank(group: Optional[ProcessGroup] = None) -> int:
     """
     Returns the rank of the current process in the provided ``group`` or the
     default group if none was provided.
@@ -903,10 +1070,10 @@ def get_rank(group=None):
     if group is None or group is GroupMember.WORLD:
         return default_pg.rank()
 
-    return _get_group_rank(group, default_pg.rank())
+    return get_group_rank(group, default_pg.rank())
 
 
-def get_world_size(group=None):
+def get_world_size(group: Optional[ProcessGroup] = None) -> int:
     """
     Returns the number of processes in the current process group
 
@@ -925,7 +1092,7 @@ def get_world_size(group=None):
     return _get_group_size(group)
 
 
-def isend(tensor, dst, group=None, tag=0):
+def isend(tensor: torch.Tensor, dst: int, group: Optional[ProcessGroup] = None, tag: int = 0) -> Work:
     """
     Sends a tensor asynchronously.
 
@@ -954,11 +1121,11 @@ def isend(tensor, dst, group=None, tag=0):
         default_pg = _get_default_group()
         return default_pg.send([tensor], dst, tag)
     else:
-        group_dst_rank = _get_group_rank(group, dst)
+        group_dst_rank = get_group_rank(group, dst)
         return group.send([tensor], group_dst_rank, tag)
 
 
-def irecv(tensor, src=None, group=None, tag=0):
+def irecv(tensor: torch.Tensor, src: Optional[int] = None, group: Optional[ProcessGroup] = None, tag: int = 0) -> Work:
     """
     Receives a tensor asynchronously.
 
@@ -991,22 +1158,29 @@ def irecv(tensor, src=None, group=None, tag=0):
         if pg is GroupMember.WORLD:
             return pg.recv([tensor], src, tag)
         else:
-            group_src_rank = _get_group_rank(pg, src)
+            group_src_rank = get_group_rank(pg, src)
             return pg.recv([tensor], group_src_rank, tag)
 
 
-def send(tensor, dst, group=None, tag=0):
+def send(tensor: torch.Tensor, dst: int, group: Optional[ProcessGroup] = None, tag: int = 0) -> Work:
     """
     Sends a tensor synchronously.
 
     Args:
         tensor (Tensor): Tensor to send.
-        dst (int): Destination rank.
+        dst (int): Destination rank. Destination rank should not be the same
+        as the rank of the current process.
         group (ProcessGroup, optional): The process group to work on. If None,
             the default process group will be used.
         tag (int, optional): Tag to match send with remote recv
 
     """
+    if get_rank() == dst:
+        raise ValueError(
+            "Invalid destination rank: destination rank should not be the same as "
+            "the rank of the current process."
+        )
+
     _check_single_tensor(tensor, "tensor")
     if _rank_not_in_group(group):
         _warn_not_in_group("send")
@@ -1016,11 +1190,11 @@ def send(tensor, dst, group=None, tag=0):
         default_pg = _get_default_group()
         default_pg.send([tensor], dst, tag).wait()
     else:
-        group_dst_rank = _get_group_rank(group, dst)
+        group_dst_rank = get_group_rank(group, dst)
         group.send([tensor], group_dst_rank, tag).wait()
 
 
-def recv(tensor, src=None, group=None, tag=0):
+def recv(tensor: torch.Tensor, src: Optional[int] = None, group: Optional[ProcessGroup] = None, tag: int = 0) -> Work:
     """
     Receives a tensor synchronously.
 
@@ -1054,12 +1228,12 @@ def recv(tensor, src=None, group=None, tag=0):
         if group is None or group is GroupMember.WORLD:
             return src_rank
         else:
-            return _get_global_rank(pg, src_rank)
+            return get_global_rank(pg, src_rank)
     else:
         if group is None or group is GroupMember.WORLD:
             pg.recv([tensor], src, tag).wait()
         else:
-            group_src_rank = _get_group_rank(pg, src)
+            group_src_rank = get_group_rank(pg, src)
             pg.recv([tensor], group_src_rank, tag).wait()
         return src
 
@@ -1165,6 +1339,34 @@ def batch_isend_irecv(p2p_op_list):
     return reqs
 
 
+def exception_handler(func):
+    @functools.wraps(func)
+    def wrapper(*args, **kwargs):
+        try:
+            return func(*args, **kwargs)
+        except Exception as error:
+            if is_initialized():
+                error_msg_dict = {
+                    "func_name": f"{func.__name__}",
+                    "args": f"{args}, {kwargs}",
+                    "backend": f"{get_backend()}",
+                    "world_size": f"{get_world_size()}",
+                    "global_rank": f"{get_rank()}",
+                    "local_rank": f"{get_rank(kwargs.get('group'))}",
+                    "error": f"{error}",
+                }
+            else:
+                error_msg_dict = {
+                    "func_name": f"{func.__name__}",
+                    "args": f"{args}, {kwargs}",
+                    "error": f"{error}",
+                }
+            _c10d_error_logger.debug(error_msg_dict)
+            raise
+    return wrapper
+
+
+@exception_handler
 def broadcast_multigpu(tensor_list, src, group=None, async_op=False, src_tensor=0):
     """
     Broadcasts the tensor to the whole group with multiple GPU tensors
@@ -1197,6 +1399,12 @@ def broadcast_multigpu(tensor_list, src, group=None, async_op=False, src_tensor=
         None, if not async_op or if not part of the group
 
     """
+    warnings.warn(
+        "torch.distributed.broadcast_multigpu will be deprecated. If you must "
+        "use it, please revisit our documentation later at "
+        "https://pytorch.org/docs/master/distributed.html#multi-gpu-collective-functions"
+    )
+
     if _rank_not_in_group(group):
         _warn_not_in_group("broadcast_multigpu")
         return
@@ -1209,7 +1417,7 @@ def broadcast_multigpu(tensor_list, src, group=None, async_op=False, src_tensor=
         default_pg = _get_default_group()
         work = default_pg.broadcast(tensor_list, opts)
     else:
-        group_src_rank = _get_group_rank(group, src)
+        group_src_rank = get_group_rank(group, src)
         opts.rootRank = group_src_rank
         work = group.broadcast(tensor_list, opts)
     if async_op:
@@ -1218,6 +1426,7 @@ def broadcast_multigpu(tensor_list, src, group=None, async_op=False, src_tensor=
         work.wait()
 
 
+@exception_handler
 def broadcast(tensor, src, group=None, async_op=False):
     """
     Broadcasts the tensor to the whole group.
@@ -1251,7 +1460,7 @@ def broadcast(tensor, src, group=None, async_op=False):
         default_pg = _get_default_group()
         work = default_pg.broadcast([tensor], opts)
     else:
-        group_src_rank = _get_group_rank(group, src)
+        group_src_rank = get_group_rank(group, src)
         opts.rootRank = group_src_rank
         work = group.broadcast([tensor], opts)
     if async_op:
@@ -1259,7 +1468,7 @@ def broadcast(tensor, src, group=None, async_op=False):
     else:
         work.wait()
 
-
+@exception_handler
 def all_reduce_multigpu(tensor_list, op=ReduceOp.SUM, group=None, async_op=False):
     r"""
     Reduces the tensor data across all machines in such a way that all get
@@ -1294,6 +1503,12 @@ def all_reduce_multigpu(tensor_list, op=ReduceOp.SUM, group=None, async_op=False
         None, if not async_op or if not part of the group
 
     """
+    warnings.warn(
+        "torch.distributed.all_reduce_multigpu will be deprecated. If you must "
+        "use it, please revisit our documentation later at "
+        "https://pytorch.org/docs/master/distributed.html#multi-gpu-collective-functions"
+    )
+
     if _rank_not_in_group(group):
         return
 
@@ -1314,7 +1529,7 @@ def all_reduce_multigpu(tensor_list, op=ReduceOp.SUM, group=None, async_op=False
     else:
         work.wait()
 
-
+@exception_handler
 def all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False):
     """
     Reduces the tensor data across all machines in such a way that all get
@@ -1386,7 +1601,7 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False):
     else:
         work.wait()
 
-
+@exception_handler
 def all_reduce_coalesced(tensors, op=ReduceOp.SUM, group=None, async_op=False):
     """
     WARNING: at this time individual shape checking is not implemented across nodes.
@@ -1420,7 +1635,13 @@ def all_reduce_coalesced(tensors, op=ReduceOp.SUM, group=None, async_op=False):
         None, if not async_op or if not part of the group.
 
     """
+    warnings.warn(
+        "torch.distributed.all_reduce_coalesced will be deprecated. If you must "
+        "use it, please revisit our documentation later at "
+        "https://pytorch.org/docs/master/distributed.html#collective-functions"
+    )
     _check_tensor_list(tensors, "tensor")
+    _ensure_all_tensors_same_dtype(tensors)
     if _rank_not_in_group(group):
         _warn_not_in_group("all_reduce_coalesced")
         return
@@ -1443,7 +1664,7 @@ def all_reduce_coalesced(tensors, op=ReduceOp.SUM, group=None, async_op=False):
     else:
         work.wait()
 
-
+@exception_handler
 def reduce_multigpu(
     tensor_list, dst, op=ReduceOp.SUM, group=None, async_op=False, dst_tensor=0
 ):
@@ -1477,6 +1698,12 @@ def reduce_multigpu(
         None, otherwise
 
     """
+    warnings.warn(
+        "torch.distributed.reduce_multigpu will be deprecated. If you must "
+        "use it, please revisit our documentation later at "
+        "https://pytorch.org/docs/master/distributed.html#multi-gpu-collective-functions"
+    )
+
     if _rank_not_in_group(group):
         _warn_not_in_group("reduce_multigpu")
         return
@@ -1490,7 +1717,7 @@ def reduce_multigpu(
         default_pg = _get_default_group()
         work = default_pg.reduce(tensor_list, opts)
     else:
-        group_dst_rank = _get_group_rank(group, dst)
+        group_dst_rank = get_group_rank(group, dst)
         opts.rootRank = group_dst_rank
         work = group.reduce(tensor_list, opts)
 
@@ -1499,7 +1726,7 @@ def reduce_multigpu(
     else:
         work.wait()
 
-
+@exception_handler
 def reduce(tensor, dst, op=ReduceOp.SUM, group=None, async_op=False):
     """
     Reduces the tensor data across all machines.
@@ -1535,7 +1762,7 @@ def reduce(tensor, dst, op=ReduceOp.SUM, group=None, async_op=False):
         default_pg = _get_default_group()
         work = default_pg.reduce([tensor], opts)
     else:
-        group_dst_rank = _get_group_rank(group, dst)
+        group_dst_rank = get_group_rank(group, dst)
         opts.rootRank = group_dst_rank
         work = group.reduce([tensor], opts)
 
@@ -1544,7 +1771,7 @@ def reduce(tensor, dst, op=ReduceOp.SUM, group=None, async_op=False):
     else:
         work.wait()
 
-
+@exception_handler
 def all_gather_multigpu(
     output_tensor_lists, input_tensor_list, group=None, async_op=False
 ):
@@ -1590,6 +1817,12 @@ def all_gather_multigpu(
         None, if not async_op or if not part of the group
 
     """
+    warnings.warn(
+        "torch.distributed.all_gather_multigpu will be deprecated. If you must "
+        "use it, please revisit our documentation later at "
+        "https://pytorch.org/docs/master/distributed.html#multi-gpu-collective-functions"
+    )
+
     if _rank_not_in_group(group):
         _warn_not_in_group("all_gather_multigpu")
         return
@@ -1617,7 +1850,7 @@ def all_gather_multigpu(
 def _object_to_tensor(obj, device):
     f = io.BytesIO()
     _pickler(f).dump(obj)
-    byte_storage = torch.ByteStorage.from_buffer(f.getvalue())  # type: ignore[attr-defined]
+    byte_storage = torch.ByteStorage._from_buffer(f.getvalue())  # type: ignore[attr-defined]
     # Do not replace `torch.ByteTensor` or `torch.LongTensor` with torch.tensor and specifying dtype.
     # Otherwise, it will casue 100X slowdown.
     # See: https://github.com/pytorch/pytorch/issues/65696
@@ -1645,6 +1878,7 @@ def _check_for_nccl_backend(group):
         isinstance(pg, ProcessGroupNCCL)
     )
 
+@exception_handler
 def all_gather_object(object_list, obj, group=None):
     """
     Gathers picklable objects from the whole group into a list. Similar to
@@ -1731,6 +1965,7 @@ def all_gather_object(object_list, obj, group=None):
         object_list[i] = _tensor_to_object(tensor, tensor_size)
 
 
+@exception_handler
 def gather_object(obj, object_gather_list=None, dst=0, group=None):
     """
     Gathers picklable objects from the whole group in a single process.
@@ -1835,6 +2070,7 @@ def gather_object(obj, object_gather_list=None, dst=0, group=None):
         object_gather_list[i] = _tensor_to_object(tensor, tensor_size)
 
 
+@exception_handler
 def broadcast_object_list(object_list, src=0, group=None, device=None):
     """
     Broadcasts picklable objects in ``object_list`` to the whole group. Similar
@@ -1934,6 +2170,7 @@ def broadcast_object_list(object_list, src=0, group=None, device=None):
             object_list[i] = _tensor_to_object(obj_view, obj_size)
 
 
+@exception_handler
 def scatter_object_list(
     scatter_object_output_list, scatter_object_input_list, src=0, group=None
 ):
@@ -1963,9 +2200,6 @@ def scatter_object_list(
         since it does not provide an ``async_op`` handle and thus will be a
         blocking call.
 
-    .. note:: Note that this API does not support the NCCL backend, as the
-        tensor-based scatter collective is not supported by ProcessGroupNCCL.
-
     .. warning::
         :func:`scatter_object_list` uses ``pickle`` module implicitly, which
         is known to be insecure. It is possible to construct malicious pickle
@@ -2040,6 +2274,7 @@ def scatter_object_list(
     scatter_object_output_list[0] = _tensor_to_object(output_tensor, obj_tensor_size)
 
 
+@exception_handler
 def all_gather(tensor_list, tensor, group=None, async_op=False):
     """
     Gathers tensors from the whole group in a list.
@@ -2091,6 +2326,7 @@ def all_gather(tensor_list, tensor, group=None, async_op=False):
     """
     _check_tensor_list(tensor_list, "tensor_list")
     _check_single_tensor(tensor, "tensor")
+    _ensure_all_tensors_same_dtype(tensor_list, tensor)
     if _rank_not_in_group(group):
         _warn_not_in_group("all_gather")
         return
@@ -2112,14 +2348,23 @@ def all_gather(tensor_list, tensor, group=None, async_op=False):
         work.wait()
 
 
-def _all_gather_base(output_tensor, input_tensor, group=None, async_op=False):
+@exception_handler
+def all_gather_into_tensor(output_tensor, input_tensor, group=None, async_op=False):
     """
-    Single tensor all gather. Gathers a single tensor from all ranks, and puts them in a single output tensor.
+    Gather tensors from all ranks and put them in a single output tensor.
 
     Args:
-        output_tensor (Tensor): Output tensor. It should contain
-            correctly-sized tensors to be used for output of the collective.
-        input_tensor (Tensor): Tensor to be broadcast from current process.
+        output_tensor (Tensor): Output tensor to accommodate tensor elements
+            from all ranks. It must be correctly sized to have one of the
+            following forms:
+            (i) a concatenation of all the input tensors along the primary
+            dimension; for definition of "concatenation", see ``torch.cat()``;
+            (ii) a stack of all the input tensors along the primary dimension;
+            for definition of "stack", see ``torch.stack()``.
+            Examples below may better explain the supported output forms.
+        input_tensor (Tensor): Tensor to be gathered from current rank.
+            Different from the ``all_gather`` API, the input tensors in this
+            API must have the same size across all ranks.
         group (ProcessGroup, optional): The process group to work on. If None,
             the default process group will be used.
         async_op (bool, optional): Whether this op should be an async op
@@ -2130,30 +2375,36 @@ def _all_gather_base(output_tensor, input_tensor, group=None, async_op=False):
 
     Examples:
         >>> # xdoctest: +SKIP("need process group init")
-        >>> # All tensors below are of torch.int64 dtype.
-        >>> # We have 2 process groups, 2 ranks.
-        >>> output_tensor = torch.zeros(2, dtype=torch.int64)
-        >>> output_tensor
-        [tensor([0, 0])] # Rank 0 and 1
-        >>> tensor = torch.arange(1, dtype=torch.int64) + 1 + rank
-        >>> tensor
-        tensor([1]) # Rank 0
-        tensor([2]) # Rank 1
-        >>> dist.all_gather_base(output_tensor, tensor)
-        >>> output_tensor
-        tensor([1,2]) # Rank 0
-        tensor([1,2]) # Rank 1
+        >>> # All tensors below are of torch.int64 dtype and on CUDA devices.
+        >>> # We have two ranks.
+        >>> device = torch.device(f'cuda:{rank}')
+        >>> tensor_in = torch.arange(2, dtype=torch.int64, device=device) + 1 + 2 * rank
+        >>> tensor_in
+        tensor([1, 2], device='cuda:0') # Rank 0
+        tensor([3, 4], device='cuda:1') # Rank 1
+        >>> # Output in concatenation form
+        >>> tensor_out = torch.zeros(world_size * 2, dtype=torch.int64, device=device)
+        >>> dist.all_gather_into_tensor(tensor_out, tensor_in)
+        >>> tensor_out
+        tensor([1, 2, 3, 4], device='cuda:0') # Rank 0
+        tensor([1, 2, 3, 4], device='cuda:1') # Rank 1
+        >>> # Output in stack form
+        >>> tensor_out2 = torch.zeros(world_size, 2, dtype=torch.int64, device=device)
+        >>> dist.all_gather_into_tensor(tensor_out2, tensor_in)
+        >>> tensor_out2
+        tensor([[1, 2],
+                [3, 4]], device='cuda:0') # Rank 0
+        tensor([[1, 2],
+                [3, 4]], device='cuda:1') # Rank 1
 
     .. warning::
-        `_all_gather_base` is experimental and subject to change.
-        It is the caller's responsibility to ensure the output_tensor
-        is correctly sized.
+        The Gloo backend does not support this API.
 
     """
     _check_single_tensor(input_tensor, "input_tensor")
     _check_single_tensor(output_tensor, "output_tensor")
     if _rank_not_in_group(group):
-        _warn_not_in_group("_all_gather_base")
+        _warn_not_in_group("all_gather_into_tensor")
         return
 
     output_tensor = (
@@ -2179,6 +2430,37 @@ def _all_gather_base(output_tensor, input_tensor, group=None, async_op=False):
         work.wait()
 
 
+@exception_handler
+def _all_gather_base(output_tensor, input_tensor, group=None, async_op=False):
+    """
+    Single tensor all gather. Gathers a single tensor from all ranks, and puts them in a single output tensor.
+
+    Args:
+        output_tensor (Tensor): Output tensor. It should contain
+            correctly-sized tensors to be used for output of the collective.
+        input_tensor (Tensor): Tensor to be broadcast from current process.
+        group (ProcessGroup, optional): The process group to work on. If None,
+            the default process group will be used.
+        async_op (bool, optional): Whether this op should be an async op
+
+    Returns:
+        Async work handle, if async_op is set to True.
+        None, if not async_op or if not part of the group
+
+    .. warning::
+        `_all_gather_base` is a private function. Users should use
+        `all_gather_into_tensor` instead.
+
+    """
+    warnings.warn(
+        "torch.distributed._all_gather_base is a private function and will be "
+        "deprecated. Please use torch.distributed.all_gather_into_tensor "
+        "instead."
+    )
+    return all_gather_into_tensor(output_tensor, input_tensor, group, async_op)
+
+
+@exception_handler
 def all_gather_coalesced(
     output_tensor_lists, input_tensor_list, group=None, async_op=False
 ):
@@ -2225,18 +2507,25 @@ def all_gather_coalesced(
     performance improvements but users of this function should take extra care
     to ensure that each node passes in tensors whose shapes match across nodes.
     """
+    warnings.warn(
+        "torch.distributed.all_gather_coalesced will be deprecated. If you must "
+        "use it, please revisit our documentation later at "
+        "https://pytorch.org/docs/master/distributed.html#collective-functions"
+    )
     # We only check basic compatibility with C++ params here, C++ code will
     # do shape and type checking.
     if _rank_not_in_group(group):
         _warn_not_in_group("all_gather_coalesced")
         return
-    _check_tensor_list(input_tensor_list, "tensor_list")
+    _check_tensor_list(input_tensor_list, "input_tensor_list")
+    _ensure_all_tensors_same_dtype(input_tensor_list)
     if not isinstance(output_tensor_lists, list):
         raise RuntimeError(
             "Invalid function argument: " "output_tensor_lists should be a list"
         )
     for output_tensor_list in output_tensor_lists:
         _check_tensor_list(output_tensor_list, "output_tensor_lists")
+        _ensure_all_tensors_same_dtype(output_tensor_list)
 
     output_tensor_lists = [
         [t if not t.is_complex() else torch.view_as_real(t) for t in l]
@@ -2271,6 +2560,7 @@ def _validate_output_list_for_rank(my_rank, dst, gather_list):
         )
 
 
+@exception_handler
 def gather(tensor, gather_list=None, dst=0, group=None, async_op=False):
     """
     Gathers a list of tensors in a single process.
@@ -2297,6 +2587,7 @@ def gather(tensor, gather_list=None, dst=0, group=None, async_op=False):
         _check_tensor_list(gather_list, "gather_list")
     else:
         gather_list = []
+    _ensure_all_tensors_same_dtype(tensor, gather_list)
 
     if _rank_not_in_group(group):
         _warn_not_in_group("gather")
@@ -2314,7 +2605,7 @@ def gather(tensor, gather_list=None, dst=0, group=None, async_op=False):
         default_pg = _get_default_group()
         work = default_pg.gather(output_tensors, input_tensors, opts)
     else:
-        group_dst_rank = _get_group_rank(group, dst)
+        group_dst_rank = get_group_rank(group, dst)
         opts.rootRank = group_dst_rank
         work = group.gather(output_tensors, input_tensors, opts)
 
@@ -2324,6 +2615,7 @@ def gather(tensor, gather_list=None, dst=0, group=None, async_op=False):
         work.wait()
 
 
+@exception_handler
 def scatter(tensor, scatter_list=None, src=0, group=None, async_op=False):
     """
     Scatters a list of tensors to all processes in a group.
@@ -2346,6 +2638,27 @@ def scatter(tensor, scatter_list=None, src=0, group=None, async_op=False):
         Async work handle, if async_op is set to True.
         None, if not async_op or if not part of the group
 
+    .. note:: Note that all Tensors in scatter_list must have the same size.
+
+    Example::
+        >>> # xdoctest: +SKIP("need process group init")
+        >>> # Note: Process group initialization omitted on each rank.
+        >>> import torch.distributed as dist
+        >>> tensor_size = 2
+        >>> t_ones = torch.ones(tensor_size)
+        >>> t_fives = torch.ones(tensor_size) * 5
+        >>> output_tensor = torch.zeros(tensor_size)
+        >>> if dist.get_rank() == 0:
+        >>>     # Assumes world_size of 2.
+        >>>     # Only tensors, all of which must be the same size.
+        >>>     scatter_list = [t_ones, t_fives]
+        >>> else:
+        >>>     scatter_list = None
+        >>> dist.scatter(output_tensor, scatter_list, src=0)
+        >>> # Rank i gets scatter_list[i]. For example, on rank 1:
+        >>> output_tensor
+        tensor([5., 5.])
+
     """
     _check_single_tensor(tensor, "tensor")
 
@@ -2354,6 +2667,7 @@ def scatter(tensor, scatter_list=None, src=0, group=None, async_op=False):
         _check_tensor_list(scatter_list, "scatter_list")
     else:
         scatter_list = []
+    _ensure_all_tensors_same_dtype(tensor, scatter_list)
 
     if _rank_not_in_group(group):
         _warn_not_in_group("scatter")
@@ -2387,7 +2701,7 @@ def scatter(tensor, scatter_list=None, src=0, group=None, async_op=False):
         default_pg = _get_default_group()
         work = default_pg.scatter(output_tensors, input_tensors, opts)
     else:
-        group_src_rank = _get_group_rank(group, src)
+        group_src_rank = get_group_rank(group, src)
         opts.rootRank = group_src_rank
         work = group.scatter(output_tensors, input_tensors, opts)
 
@@ -2397,6 +2711,7 @@ def scatter(tensor, scatter_list=None, src=0, group=None, async_op=False):
         work.wait()
 
 
+@exception_handler
 def reduce_scatter_multigpu(
     output_tensor_list, input_tensor_lists, op=ReduceOp.SUM, group=None, async_op=False
 ):
@@ -2441,6 +2756,12 @@ def reduce_scatter_multigpu(
         None, if not async_op or if not part of the group.
 
     """
+    warnings.warn(
+        "torch.distributed.reduce_scatter_multigpu will be deprecated. If you must "
+        "use it, please revisit our documentation later at "
+        "https://pytorch.org/docs/master/distributed.html#multi-gpu-collective-functions"
+    )
+
     if _rank_not_in_group(group):
         _warn_not_in_group("reduce_scatter_multigpu")
         return
@@ -2460,6 +2781,7 @@ def reduce_scatter_multigpu(
         work.wait()
 
 
+@exception_handler
 def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
     """
     Reduces, then scatters a list of tensors to all processes in a group.
@@ -2467,6 +2789,9 @@ def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=Fal
     Args:
         output (Tensor): Output tensor.
         input_list (list[Tensor]): List of tensors to reduce and scatter.
+        op (optional): One of the values from
+            ``torch.distributed.ReduceOp``
+            enum.  Specifies an operation used for element-wise reductions.
         group (ProcessGroup, optional): The process group to work on. If None,
             the default process group will be used.
         async_op (bool, optional): Whether this op should be an async op.
@@ -2478,6 +2803,7 @@ def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=Fal
     """
     _check_single_tensor(output, "output")
     _check_tensor_list(input_list, "input_list")
+    _ensure_all_tensors_same_dtype(output, input_list)
     if _rank_not_in_group(group):
         _warn_not_in_group("reduce_scatter")
         return
@@ -2497,13 +2823,22 @@ def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=Fal
         work.wait()
 
 
-def _reduce_scatter_base(output, input, op=ReduceOp.SUM, group=None, async_op=False):
+@exception_handler
+def reduce_scatter_tensor(output, input, op=ReduceOp.SUM, group=None, async_op=False):
     """
-    Reduces, then scatters a flattened tensor to all processes in a group.
+    Reduces, then scatters a tensor to all ranks in a group.
 
     Args:
-        output (Tensor): Output tensor.
-        input (Tensor): Input tensor that is of size output tensor size times world size
+        output (Tensor): Output tensor. It should have the same size across all
+            ranks.
+        input (Tensor): Input tensor to be reduced and scattered. Its size
+            should be output tensor size times the world size. The input tensor
+            can have one of the following shapes:
+            (i) a concatenation of the output tensors along the primary
+            dimension, or
+            (ii) a stack of the output tensors along the primary dimension.
+            For definition of "concatenation", see ``torch.cat()``.
+            For definition of "stack", see ``torch.stack()``.
         group (ProcessGroup, optional): The process group to work on. If None,
             the default process group will be used.
         async_op (bool, optional): Whether this op should be an async op.
@@ -2512,12 +2847,42 @@ def _reduce_scatter_base(output, input, op=ReduceOp.SUM, group=None, async_op=Fa
         Async work handle, if async_op is set to True.
         None, if not async_op or if not part of the group.
 
+    Examples:
+        >>> # xdoctest: +SKIP("need process group init")
+        >>> # All tensors below are of torch.int64 dtype and on CUDA devices.
+        >>> # We have two ranks.
+        >>> device = torch.device(f'cuda:{rank}')
+        >>> tensor_out = torch.zeros(2, dtype=torch.int64, device=device)
+        >>> # Input in concatenation form
+        >>> tensor_in = torch.arange(world_size * 2, dtype=torch.int64, device=device)
+        >>> tensor_in
+        tensor([0, 1, 2, 3], device='cuda:0') # Rank 0
+        tensor([0, 1, 2, 3], device='cuda:1') # Rank 1
+        >>> dist.reduce_scatter_tensor(tensor_out, tensor_in)
+        >>> tensor_out
+        tensor([0, 2], device='cuda:0') # Rank 0
+        tensor([4, 6], device='cuda:1') # Rank 1
+        >>> # Input in stack form
+        >>> tensor_in = torch.reshape(tensor_in, (world_size, 2))
+        >>> tensor_in
+        tensor([[0, 1],
+                [2, 3]], device='cuda:0') # Rank 0
+        tensor([[0, 1],
+                [2, 3]], device='cuda:1') # Rank 1
+        >>> dist.reduce_scatter_tensor(tensor_out, tensor_in)
+        >>> tensor_out
+        tensor([0, 2], device='cuda:0') # Rank 0
+        tensor([4, 6], device='cuda:1') # Rank 1
+
+    .. warning::
+        The Gloo backend does not support this API.
+
     """
     _check_single_tensor(output, "output")
     _check_single_tensor(input, "input")
 
     if _rank_not_in_group(group):
-        _warn_not_in_group("_reduce_scatter_base")
+        _warn_not_in_group("reduce_scatter_tensor")
         return
 
     opts = ReduceScatterOptions()
@@ -2535,6 +2900,35 @@ def _reduce_scatter_base(output, input, op=ReduceOp.SUM, group=None, async_op=Fa
         work.wait()
 
 
+def _reduce_scatter_base(output, input, op=ReduceOp.SUM, group=None, async_op=False):
+    """
+    Reduces, then scatters a flattened tensor to all processes in a group.
+
+    Args:
+        output (Tensor): Output tensor.
+        input (Tensor): Input tensor that is of size output tensor size times world size
+        group (ProcessGroup, optional): The process group to work on. If None,
+            the default process group will be used.
+        async_op (bool, optional): Whether this op should be an async op.
+
+    Returns:
+        Async work handle, if async_op is set to True.
+        None, if not async_op or if not part of the group.
+
+    .. warning::
+        `_reduce_scatter_base` is a private function. Users should use
+        `reduce_scatter_tensor` instead.
+
+    """
+    warnings.warn(
+        "torch.distributed._reduce_scatter_base is a private function and will "
+        "be deprecated. Please use torch.distributed.reduce_scatter_tensor "
+        "instead."
+    )
+    return reduce_scatter_tensor(output, input, op, group, async_op)
+
+
+@exception_handler
 def all_to_all_single(
     output,
     input,
@@ -2639,6 +3033,7 @@ def all_to_all_single(
     opts = AllToAllOptions()
     _check_single_tensor(output, "output")
     _check_single_tensor(input, "input")
+    _ensure_all_tensors_same_dtype(output, input)
 
     if input.is_complex():
         input = torch.view_as_real(input)
@@ -2664,6 +3059,7 @@ def all_to_all_single(
         work.wait()
 
 
+@exception_handler
 def all_to_all(output_tensor_list, input_tensor_list, group=None, async_op=False):
     """
     Each process scatters list of input tensors to all processes in a group and
@@ -2762,6 +3158,7 @@ def all_to_all(output_tensor_list, input_tensor_list, group=None, async_op=False
     opts = AllToAllOptions()
     _check_tensor_list(output_tensor_list, "output_tensor_list")
     _check_tensor_list(input_tensor_list, "input_tensor_list")
+    _ensure_all_tensors_same_dtype(output_tensor_list, input_tensor_list)
 
     input_tensor_list = [
         t if not t.is_complex() else torch.view_as_real(t) for t in input_tensor_list
@@ -2846,7 +3243,7 @@ def monitored_barrier(group=GroupMember.WORLD, timeout=None, wait_all_ranks=Fals
     whole group exits the function successfully, making it useful for debugging
     and synchronizing. However, it can have a performance impact and should only
     be used for debugging or scenarios that require full synchronization points
-    on the host-side. For debugging purposees, this barrier can be inserted
+    on the host-side. For debugging purposes, this barrier can be inserted
     before the application's collective calls to check if any ranks are
     desynchronized.
 
@@ -2913,7 +3310,6 @@ def _create_process_group_wrapper(
     wrapped_pg = _ProcessGroupWrapper(wrapped_pg, helper_pg)
     return wrapped_pg
 
-
 def new_group(ranks=None, timeout=default_pg_timeout, backend=None, pg_options=None):
     """
     Creates a new distributed group.
@@ -2973,10 +3369,10 @@ def new_group(ranks=None, timeout=default_pg_timeout, backend=None, pg_options=N
         A handle of distributed group that can be given to collective calls.
     """
 
-    global _pg_group_ranks
+    global _world
 
     default_pg = _get_default_group()
-    default_backend, default_store = _pg_map[default_pg]
+    default_backend, default_store = _world.pg_map[default_pg]
     global_rank = default_pg.rank()
     global_world_size = default_pg.size()
 
@@ -3012,18 +3408,20 @@ def new_group(ranks=None, timeout=default_pg_timeout, backend=None, pg_options=N
         group_rank = global_rank
 
     backend = Backend(backend)
-    pg = _new_process_group_helper(
-        group_world_size,
-        group_rank,
-        ranks,
-        backend,
-        default_store,
-        pg_options=pg_options,
-        timeout=timeout,
-    )
+
+    with record_function(f"## process_group:init with ranks: {ranks}"):
+        pg = _new_process_group_helper(
+            group_world_size,
+            group_rank,
+            ranks,
+            backend,
+            default_store,
+            pg_options=pg_options,
+            timeout=timeout,
+        )
 
     # Create the global rank to group rank mapping
-    _pg_group_ranks[pg] = {
+    _world.pg_group_ranks[pg] = {
         global_rank: group_rank for group_rank, global_rank in enumerate(ranks)
     }
 
diff --git a/torch/distributed/elastic/agent/server/__init__.py b/torch/distributed/elastic/agent/server/__init__.py
index 37b0ee150d69..07790492d584 100644
--- a/torch/distributed/elastic/agent/server/__init__.py
+++ b/torch/distributed/elastic/agent/server/__init__.py
@@ -30,10 +30,11 @@
 
 from .api import (  # noqa: F401
     ElasticAgent,
+    RunResult,
     SimpleElasticAgent,
     Worker,
     WorkerGroup,
-    RunResult,
     WorkerSpec,
     WorkerState,
 )
+from .local_elastic_agent import TORCHELASTIC_ENABLE_FILE_TIMER, TORCHELASTIC_TIMER_FILE
diff --git a/torch/distributed/elastic/agent/server/api.py b/torch/distributed/elastic/agent/server/api.py
index 259632869d43..2a0536166d4b 100644
--- a/torch/distributed/elastic/agent/server/api.py
+++ b/torch/distributed/elastic/agent/server/api.py
@@ -192,7 +192,7 @@ class WorkerState(str, Enum):
       INIT - worker group object created not yet started
       HEALTHY - workers running and healthy
       UNHEALTHY - workers running and unhealthy
-      STOPPED - workers stopped (interruped) by the agent
+      STOPPED - workers stopped (interrupted) by the agent
       SUCCEEDED - workers finished running (exit 0)
       FAILED - workers failed to successfully finish (exit !0)
 
diff --git a/torch/distributed/elastic/agent/server/local_elastic_agent.py b/torch/distributed/elastic/agent/server/local_elastic_agent.py
index ef9248d14afc..ec1269d34eee 100644
--- a/torch/distributed/elastic/agent/server/local_elastic_agent.py
+++ b/torch/distributed/elastic/agent/server/local_elastic_agent.py
@@ -7,12 +7,18 @@
 # LICENSE file in the root directory of this source tree.
 
 
+import json
 import os
 import shutil
 import signal
+import socket
 import tempfile
+import uuid
 from typing import Any, Dict, Optional, Tuple
 
+import torch.distributed.elastic.timer as timer
+from torch.distributed.elastic import events
+
 from torch.distributed.elastic.agent.server.api import (
     RunResult,
     SimpleElasticAgent,
@@ -20,15 +26,22 @@
     WorkerSpec,
     WorkerState,
 )
+from torch.distributed.elastic.events.api import EventMetadataValue
 from torch.distributed.elastic.metrics.api import prof
 from torch.distributed.elastic.multiprocessing import PContext, start_processes
 from torch.distributed.elastic.utils import macros
 from torch.distributed.elastic.utils.logging import get_logger
 
-
 log = get_logger()
 
-__all__ = ['LocalElasticAgent']
+__all__ = [
+    "LocalElasticAgent",
+    "TORCHELASTIC_ENABLE_FILE_TIMER",
+    "TORCHELASTIC_TIMER_FILE",
+]
+
+TORCHELASTIC_ENABLE_FILE_TIMER = "TORCHELASTIC_ENABLE_FILE_TIMER"
+TORCHELASTIC_TIMER_FILE = "TORCHELASTIC_TIMER_FILE"
 
 class LocalElasticAgent(SimpleElasticAgent):
     """
@@ -55,6 +68,17 @@ class LocalElasticAgent(SimpleElasticAgent):
     user code deal with ensuring that workers are terminated in a synchronous
     manner rather than relying on the exit_barrier_timeout.
 
+    A named pipe based watchdog can be enabled in ```LocalElasticAgent``` if an
+    environment variable ``TORCHELASTIC_ENABLE_FILE_TIMER`` with value 1 has
+    been defined in the ```LocalElasticAgent``` process.
+    Optionally, another environment variable ```TORCHELASTIC_TIMER_FILE```
+    can be set with a unique file name for the named pipe. If the environment
+    variable ```TORCHELASTIC_TIMER_FILE``` is not set, ```LocalElasticAgent```
+    will internally create a unique file name and set it to the environment
+    variable ```TORCHELASTIC_TIMER_FILE```, and this environment variable will
+    be propagated to the worker processes to allow them to connect to the same
+    named pipe that ```LocalElasticAgent``` uses.
+
     Example launching function
 
     ::
@@ -111,6 +135,7 @@ def __init__(
         self._pcontext: Optional[PContext] = None
         rdzv_run_id = spec.rdzv_handler.get_run_id()
         self._log_dir = self._make_log_dir(log_dir, rdzv_run_id)
+        self._worker_watchdog: Optional[timer.FileTimerServer] = None
 
     def _make_log_dir(self, log_dir: Optional[str], rdzv_run_id: str):
         base_log_dir = log_dir or tempfile.mkdtemp(prefix="torchelastic_")
@@ -119,6 +144,71 @@ def _make_log_dir(self, log_dir: Optional[str], rdzv_run_id: str):
         log.info(f"log directory set to: {dir}")
         return dir
 
+    def _setup_local_watchdog(self, envs: Dict[int, Dict[str, str]]) -> None:
+        enable_watchdog_env_name = TORCHELASTIC_ENABLE_FILE_TIMER
+        watchdog_enabled = os.getenv(enable_watchdog_env_name)
+        watchdog_file_env_name = TORCHELASTIC_TIMER_FILE
+        watchdog_file_path = os.getenv(watchdog_file_env_name)
+        if watchdog_enabled is not None and str(watchdog_enabled) == "1":
+            if watchdog_file_path is None:
+                watchdog_file_path = "/tmp/watchdog_timer_" + str(uuid.uuid4())
+            log.info(f"Starting a FileTimerServer with {watchdog_file_path} ...")
+            self._worker_watchdog = timer.FileTimerServer(
+                file_path=watchdog_file_path,
+                max_interval=0.1,
+                daemon=True,
+                log_event=self._log_watchdog_event)
+            self._worker_watchdog.start()
+            log.info("FileTimerServer started")
+        else:
+            log.info(f"Environment variable '{enable_watchdog_env_name}' not found. Do not start FileTimerServer.")
+        # Propagate the watchdog file env to worker processes
+        if watchdog_file_path is not None:
+            for _, worker_env in envs.items():
+                worker_env[watchdog_file_env_name] = watchdog_file_path
+
+
+    def _get_fq_hostname(self) -> str:
+        return socket.getfqdn(socket.gethostname())
+
+    def _log_watchdog_event(
+        self,
+        name: str,
+        request: Optional[timer.FileTimerRequest],
+    ) -> None:
+        wg = self._worker_group
+        spec = wg.spec
+        md = {
+            "watchdog_event": name
+        }
+        if request is not None:
+            md["worker_pid"] = str(request.worker_pid)
+            md["scope_id"] = request.scope_id
+            md["expiration_time"] = str(request.expiration_time)
+            md["signal"] = str(request.signal)
+        md_str = json.dumps(md)
+        state = "RUNNING"
+        metadata: Dict[str, EventMetadataValue] = {
+            "run_id": spec.rdzv_handler.get_run_id(),
+            "global_rank": None,
+            "group_rank": wg.group_rank,
+            "worker_id": None,
+            "role": spec.role,
+            "hostname": self._get_fq_hostname(),
+            "state": state,
+            "total_run_time": self._total_execution_time,
+            "rdzv_backend": spec.rdzv_handler.get_backend(),
+            "raw_error": None,
+            "metadata": md_str,
+            "agent_restarts": spec.max_restarts - self._remaining_restarts,
+        }
+        # Note: The 'metadata' field of the Event is converted to a TorchelasticStatusLogEntry later.
+        #       The 'name' field of the Event is NOT used in the TorchelasticStatusLogEntry.
+        event = events.Event(
+            name=name, source=events.EventSource.AGENT, metadata=metadata
+        )
+        events.record(event)
+
     # pyre-fixme[56]: Pyre was not able to infer the type of the decorator
     #  `torch.distributed.elastic.metrics.prof`.
     @prof
@@ -175,6 +265,8 @@ def _start_workers(self, worker_group: WorkerGroup) -> Dict[int, Any]:
         shutil.rmtree(attempt_log_dir, ignore_errors=True)
         os.makedirs(attempt_log_dir)
 
+        self._setup_local_watchdog(envs=envs)
+
         assert spec.entrypoint is not None
         self._pcontext = start_processes(
             name=spec.role,
@@ -190,6 +282,9 @@ def _start_workers(self, worker_group: WorkerGroup) -> Dict[int, Any]:
         return self._pcontext.pids()
 
     def _shutdown(self, death_sig: signal.Signals = signal.SIGTERM) -> None:
+        if self._worker_watchdog is not None:
+            self._worker_watchdog.stop()
+            self._worker_watchdog = None
         if self._pcontext:
             self._pcontext.close(death_sig)
 
diff --git a/torch/distributed/elastic/multiprocessing/api.py b/torch/distributed/elastic/multiprocessing/api.py
index 3f109ef10071..727566fc6039 100644
--- a/torch/distributed/elastic/multiprocessing/api.py
+++ b/torch/distributed/elastic/multiprocessing/api.py
@@ -35,6 +35,8 @@
 
 log = logging.getLogger(__name__)
 
+__all__ = ["SignalException", "Std", "to_map", "RunProcsResult", "PContext", "get_std_cm", "MultiprocessContext",
+           "SubprocessHandler", "SubprocessContext"]
 
 class SignalException(Exception):
     """
@@ -535,7 +537,7 @@ def _close(self, death_sig: signal.Signals, timeout: int = 30) -> None:
         for proc in self._pc.processes:
             if proc.is_alive():
                 log.warning(
-                    f"Unable to shutdown process {proc.pid} via {death_sig}, forcefully exitting via {_get_kill_signal()}"
+                    f"Unable to shutdown process {proc.pid} via {death_sig}, forcefully exiting via {_get_kill_signal()}"
                 )
                 try:
                     os.kill(proc.pid, _get_kill_signal())
@@ -712,7 +714,7 @@ def _close(self, death_sig: signal.Signals, timeout: int = 30) -> None:
         for handler in self.subprocess_handlers.values():
             if handler.proc.poll() is None:
                 log.warning(
-                    f"Unable to shutdown process {handler.proc.pid} via {death_sig}, forcefully exitting via {_get_kill_signal()}"
+                    f"Unable to shutdown process {handler.proc.pid} via {death_sig}, forcefully exiting via {_get_kill_signal()}"
                 )
                 handler.close(death_sig=_get_kill_signal())
                 handler.proc.wait()
diff --git a/torch/distributed/elastic/multiprocessing/errors/__init__.py b/torch/distributed/elastic/multiprocessing/errors/__init__.py
index be955d28d470..46c18d32b058 100644
--- a/torch/distributed/elastic/multiprocessing/errors/__init__.py
+++ b/torch/distributed/elastic/multiprocessing/errors/__init__.py
@@ -65,6 +65,7 @@
 from .error_handler import ErrorHandler  # noqa: F401
 from .handlers import get_error_handler  # noqa: F401
 
+__all__ = ["ProcessFailure", "ChildFailedError", "record", "ErrorHandler", "get_error_handler"]
 
 log = get_logger()
 
diff --git a/torch/distributed/elastic/multiprocessing/tail_log.py b/torch/distributed/elastic/multiprocessing/tail_log.py
index 2c70f4ceda93..a7253c53b69d 100644
--- a/torch/distributed/elastic/multiprocessing/tail_log.py
+++ b/torch/distributed/elastic/multiprocessing/tail_log.py
@@ -14,6 +14,7 @@
 from threading import Event
 from typing import Dict, List, TextIO
 
+__all__ = ["tail_logfile", "TailLog"]
 
 log = logging.getLogger(__name__)
 
diff --git a/torch/distributed/elastic/rendezvous/etcd_rendezvous.py b/torch/distributed/elastic/rendezvous/etcd_rendezvous.py
index 5e11ad1e6d33..a7b682ccc89f 100644
--- a/torch/distributed/elastic/rendezvous/etcd_rendezvous.py
+++ b/torch/distributed/elastic/rendezvous/etcd_rendezvous.py
@@ -293,7 +293,7 @@ def rendezvous_barrier(self):
                 time.sleep(1)
 
             except RendezvousTimeoutError:
-                log.info("Rendezvous timeout occured in EtcdRendezvousHandler")
+                log.info("Rendezvous timeout occurred in EtcdRendezvousHandler")
                 raise
 
             except RendezvousClosedError:
diff --git a/torch/distributed/elastic/timer/__init__.py b/torch/distributed/elastic/timer/__init__.py
index 5d1efe0ead2b..ea4b2a46c423 100644
--- a/torch/distributed/elastic/timer/__init__.py
+++ b/torch/distributed/elastic/timer/__init__.py
@@ -41,3 +41,4 @@ def trainer_func(message_queue):
 
 from .api import TimerClient, TimerRequest, TimerServer, configure, expires  # noqa: F401
 from .local_timer import LocalTimerClient, LocalTimerServer  # noqa: F401
+from .file_based_local_timer import FileTimerClient, FileTimerServer, FileTimerRequest  # noqa: F401
diff --git a/torch/distributed/elastic/timer/file_based_local_timer.py b/torch/distributed/elastic/timer/file_based_local_timer.py
new file mode 100644
index 000000000000..88fefe1dab81
--- /dev/null
+++ b/torch/distributed/elastic/timer/file_based_local_timer.py
@@ -0,0 +1,330 @@
+# Copyright (c) Meta Platforms, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import io
+import json
+import logging
+import os
+import select
+import signal
+import sys
+import threading
+import time
+from typing import Callable, Dict, List, Optional, Set, Tuple
+
+from torch.distributed.elastic.timer.api import TimerClient, TimerRequest
+
+__all__ = ["FileTimerClient", "FileTimerRequest", "FileTimerServer"]
+
+class FileTimerRequest(TimerRequest):
+    """
+    Data object representing a countdown timer acquisition and release
+    that is used between the ``FileTimerClient`` and ``FileTimerServer``.
+    A negative ``expiration_time`` should be interpreted as a "release"
+    request.
+    ``signal`` is the signal to reap the worker process from the server
+    process.
+    """
+
+    __slots__ = ["version", "worker_pid", "scope_id", "expiration_time", "signal"]
+
+    def __init__(self, worker_pid: int, scope_id: str, expiration_time: float, signal: int = 0) -> None:
+        self.version = 1
+        self.worker_pid = worker_pid
+        self.scope_id = scope_id
+        self.expiration_time = expiration_time
+        self.signal = signal
+
+    def __eq__(self, other) -> bool:
+        if isinstance(other, FileTimerRequest):
+            return (
+                self.version == other.version
+                and self.worker_pid == other.worker_pid
+                and self.scope_id == other.scope_id
+                and self.expiration_time == other.expiration_time
+                and self.signal == other.signal
+            )
+        return False
+
+    def to_json(self) -> str:
+        return json.dumps(
+            {
+                "version": self.version,
+                "pid": self.worker_pid,
+                "scope_id": self.scope_id,
+                "expiration_time": self.expiration_time,
+                "signal": self.signal
+            },
+        )
+
+
+class FileTimerClient(TimerClient):
+    """
+    Client side of ``FileTimerServer``. This client is meant to be used
+    on the same host that the ``FileTimerServer`` is running on and uses
+    pid to uniquely identify a worker.
+    This client uses a named_pipe to send timer requests to the
+    ``FileTimerServer``. This client is a producer while the
+    ``FileTimerServer`` is a consumer. Multiple clients can work with
+    the same ``FileTimerServer``.
+
+    Args:
+
+        file_path: str, the path of a FIFO special file. ``FileTimerServer``
+                        must have created it by calling os.mkfifo().
+
+        signal: signal, the signal to use to kill the process. Using a
+                        negative or zero signal will not kill the process.
+    """
+    def __init__(self, file_path: str, signal=(signal.SIGKILL if sys.platform != "win32" else
+                                               signal.CTRL_C_EVENT)) -> None:  # type: ignore[attr-defined]
+        super().__init__()
+        self._file_path = file_path
+        self.signal = signal
+
+    def _open_non_blocking(self) -> Optional[io.TextIOWrapper]:
+        try:
+            fd = os.open(self._file_path, os.O_WRONLY | os.O_NONBLOCK)
+            return os.fdopen(fd, "wt")
+        except Exception:
+            return None
+
+    def _send_request(self, request: FileTimerRequest) -> None:
+        # The server may have crashed or may haven't started yet.
+        # In such case, calling open() in blocking model blocks the client.
+        # To avoid such issue, open it in non-blocking mode, and an OSError will
+        # be raised if the server is not there.
+        file = self._open_non_blocking()
+        if file is None:
+            raise BrokenPipeError("Could not send the FileTimerRequest because FileTimerServer is not available.")
+        with file:
+            json_request = request.to_json()
+            # Write request with no greater than select.PIPE_BUF is guarantee to be atomic.
+            if len(json_request) > select.PIPE_BUF:
+                raise RuntimeError(
+                    f"FileTimerRequest larger than {select.PIPE_BUF} bytes "
+                    f"is not supported: {json_request}"
+                )
+            file.write(json_request + "\n")
+
+    def acquire(self, scope_id: str, expiration_time: float) -> None:
+        self._send_request(
+            request=FileTimerRequest(
+                worker_pid=os.getpid(),
+                scope_id=scope_id,
+                expiration_time=expiration_time,
+                signal=self.signal
+            ),
+        )
+
+    def release(self, scope_id: str) -> None:
+        self._send_request(
+            request=FileTimerRequest(
+                worker_pid=os.getpid(),
+                scope_id=scope_id,
+                expiration_time=-1,
+                signal=0
+            ),
+        )
+
+
+class FileTimerServer:
+    """
+    Server that works with ``FileTimerClient``. Clients are expected to be
+    running on the same host as the process that is running this server.
+    Each host in the job is expected to start its own timer server locally
+    and each server instance manages timers for local workers (running on
+    processes on the same host).
+
+    Args:
+
+        file_path: str, the path of a FIFO special file to be created.
+
+        max_interval: float, max interval in seconds for each watchdog loop.
+
+        daemon: bool, running the watchdog thread in daemon mode or not.
+                      A daemon thread will not block a process to stop.
+        log_event: Callable[[Dict[str, str]], None], an optional callback for
+                logging the events in JSON format.
+    """
+
+    def __init__(
+        self,
+        file_path: str,
+        max_interval: float = 10,
+        daemon: bool = True,
+        log_event: Callable[[str, Optional[FileTimerRequest]], None] = None
+    ) -> None:
+        self._file_path = file_path
+        self._max_interval = max_interval
+        self._daemon = daemon
+        self._timers: Dict[Tuple[int, str], FileTimerRequest] = {}
+        self._stop_signaled = False
+        self._watchdog_thread: Optional[threading.Thread] = None
+        if os.path.exists(self._file_path):
+            os.remove(self._file_path)
+        os.mkfifo(self._file_path)
+        # For test only. Count the number of requests received.
+        self._request_count = 0
+        # For test only. Process all requests and stop the server.
+        self._run_once = False
+        self._log_event = log_event if log_event is not None else lambda name, request: None
+
+
+    def start(self) -> None:
+        logging.info(
+            f"Starting {type(self).__name__}..."
+            f" max_interval={self._max_interval},"
+            f" daemon={self._daemon}"
+        )
+        self._watchdog_thread = threading.Thread(target=self._watchdog_loop, daemon=self._daemon)
+        logging.info("Starting watchdog thread...")
+        self._watchdog_thread.start()
+        self._log_event("watchdog started", None)
+
+    def stop(self) -> None:
+        logging.info(f"Stopping {type(self).__name__}")
+        self._stop_signaled = True
+        if self._watchdog_thread:
+            logging.info("Stopping watchdog thread...")
+            self._watchdog_thread.join(self._max_interval)
+            self._watchdog_thread = None
+        else:
+            logging.info("No watchdog thread running, doing nothing")
+        if os.path.exists(self._file_path):
+            os.remove(self._file_path)
+        self._log_event("watchdog stopped", None)
+
+    def run_once(self) -> None:
+        self._run_once = True
+        if self._watchdog_thread:
+            logging.info("Stopping watchdog thread...")
+            self._watchdog_thread.join()
+            self._watchdog_thread = None
+        else:
+            logging.info("No watchdog thread running, doing nothing")
+        if os.path.exists(self._file_path):
+            os.remove(self._file_path)
+
+    def _watchdog_loop(self) -> None:
+        # Open the pipe in blocking mode blocks the server thread.
+        # This is fine for the following reasons:
+        #  1. No client case usually does not happen.
+        #  2. We are running the watchdog loop in a separate daemon
+        #     thread, which will not block the process to stop.
+        with open(self._file_path, "rt") as fd:
+            while not self._stop_signaled:
+                try:
+                    run_once = self._run_once
+                    self._run_watchdog(fd)
+                    if run_once:
+                        break
+                except Exception as e:
+                    logging.error("Error running watchdog", exc_info=e)
+
+    def _run_watchdog(self, fd: io.TextIOWrapper) -> None:
+        timer_requests = self._get_requests(fd, self._max_interval)
+        self.register_timers(timer_requests)
+        now = time.time()
+        reaped_worker_pids = set()
+        for worker_pid, expired_timers in self.get_expired_timers(now).items():
+            logging.info(f"Reaping worker_pid=[{worker_pid}]." f" Expired timers: {self._get_scopes(expired_timers)}")
+            reaped_worker_pids.add(worker_pid)
+            # In case we have multiple expired timers, we find the first timer
+            # with a valid signal (>0) in the expiration time order.
+            expired_timers.sort(key=lambda timer: timer.expiration_time)
+            signal = 0
+            expired_timer = None
+            for timer in expired_timers:
+                self._log_event("timer expired", timer)
+                if timer.signal > 0:
+                    signal = timer.signal
+                    expired_timer = timer
+                    break
+            if signal <= 0:
+                logging.info(f"No signal specified with worker=[{worker_pid}]. Do not reap it.")
+                continue
+            if self._reap_worker(worker_pid, signal):
+                logging.info(f"Successfully reaped worker=[{worker_pid}] with signal={signal}")
+                self._log_event("kill worker process", expired_timer)
+            else:
+                logging.error(f"Error reaping worker=[{worker_pid}]. Will retry on next watchdog.")
+        self.clear_timers(reaped_worker_pids)
+
+    def _get_scopes(self, timer_requests: List[FileTimerRequest]) -> List[str]:
+        return [r.scope_id for r in timer_requests]
+
+    def _get_requests(self, fd: io.TextIOWrapper, max_interval: float) -> List[FileTimerRequest]:
+        start = time.time()
+        requests = []
+        while not self._stop_signaled or self._run_once:
+            # For named pipe, readline() is blocking when at least one writer opens.
+            # It returns only when flush() is called at the writer side.
+            # Note that flush() is automatically called inside close().
+            # After the last writer closes, readline() is not blocking.
+            # It will return an empty string when it's at end-of-file.
+            # Since the client side always opens the pipe, writes a message and closes
+            # the pipe immediately, the readline() call below is not blocking for long.
+            json_request = fd.readline()
+            if len(json_request) == 0:
+                if self._run_once:
+                    break
+                time.sleep(min(max_interval, 1))
+            else:
+                request = json.loads(json_request)
+                pid = request["pid"]
+                scope_id = request["scope_id"]
+                expiration_time = request["expiration_time"]
+                signal = request["signal"]
+                requests.append(
+                    FileTimerRequest(
+                        worker_pid=pid, scope_id=scope_id, expiration_time=expiration_time, signal=signal
+                    )
+                )
+            now = time.time()
+            if now - start > max_interval:
+                break
+        return requests
+
+    def register_timers(self, timer_requests: List[FileTimerRequest]) -> None:
+        for request in timer_requests:
+            pid = request.worker_pid
+            scope_id = request.scope_id
+            expiration_time = request.expiration_time
+            self._request_count += 1
+
+            key = (pid, scope_id)
+            # negative expiration is a proxy for a release call
+            if expiration_time < 0:
+                if key in self._timers:
+                    del self._timers[key]
+            else:
+                self._timers[key] = request
+
+    def clear_timers(self, worker_pids: Set[int]) -> None:
+        for (pid, scope_id) in list(self._timers.keys()):
+            if pid in worker_pids:
+                del self._timers[(pid, scope_id)]
+
+    def get_expired_timers(self, deadline: float) -> Dict[int, List[FileTimerRequest]]:
+        # pid -> [timer_requests...]
+        expired_timers: Dict[int, List[FileTimerRequest]] = {}
+        for request in self._timers.values():
+            if request.expiration_time <= deadline:
+                expired_scopes = expired_timers.setdefault(request.worker_pid, [])
+                expired_scopes.append(request)
+        return expired_timers
+
+    def _reap_worker(self, worker_pid: int, signal: int) -> bool:
+        try:
+            os.kill(worker_pid, signal)
+            return True
+        except ProcessLookupError:
+            logging.info(f"Process with pid={worker_pid} does not exist. Skipping")
+            return True
+        except Exception as e:
+            logging.error(f"Error terminating pid={worker_pid}", exc_info=e)
+        return False
diff --git a/torch/distributed/fsdp/__init__.py b/torch/distributed/fsdp/__init__.py
index e2b39c90fe58..b1bffdb25a0e 100644
--- a/torch/distributed/fsdp/__init__.py
+++ b/torch/distributed/fsdp/__init__.py
@@ -7,7 +7,7 @@
     LocalStateDictConfig,
     MixedPrecision,
     OptimStateKeyType,
+    ShardedStateDictConfig,
     ShardingStrategy,
     StateDictType,
 )
-from .wrap import ParamExecOrderWrapPolicy
diff --git a/torch/distributed/fsdp/_common_utils.py b/torch/distributed/fsdp/_common_utils.py
new file mode 100644
index 000000000000..f6ccc3e9243f
--- /dev/null
+++ b/torch/distributed/fsdp/_common_utils.py
@@ -0,0 +1,202 @@
+"""
+This file includes private common utilities for FSDP.
+"""
+
+import traceback
+from enum import auto, Enum
+from typing import Any, Callable, Dict, List, no_type_check, Union
+
+import torch
+import torch.distributed.fsdp.flat_param as flat_param_file
+import torch.nn as nn
+from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
+    _CHECKPOINT_PREFIX,
+)
+
+FSDP_WRAPPED_MODULE = "_fsdp_wrapped_module"
+FSDP_PREFIX = FSDP_WRAPPED_MODULE + "."
+FSDP_FLATTENED = "_fsdp_flattened"
+
+
+# We leverage Python's dynamic attribute definition to unify the state
+# management for the wrapper and non-wrapper approaches. The `Any` represents
+# the `_State` object in _composable/contract.py, but we do not import it to
+# avoid circular imports.
+_FSDPState = Union[nn.Module, Any]
+
+
+class TrainingState(Enum):
+    """
+    An enum that indicates the state of a ``FullyShardedDataParallel` instance.
+    """
+
+    IDLE = auto()
+    FORWARD_BACKWARD = auto()
+    SUMMON_FULL_PARAMS = auto()
+
+
+class HandleTrainingState(Enum):
+    """
+    An enum that indicates the state of a ``FlatParamHandle`.
+    """
+
+    IDLE = auto()
+    FORWARD = auto()
+    BACKWARD_PRE = auto()
+    BACKWARD_POST = auto()
+    SUMMON_FULL_PARAMS = auto()
+
+
+def _is_composable(state: _FSDPState):
+    # TODO: This is a temporary hack for differentiate between code paths.
+    return not isinstance(state, nn.Module)
+
+
+@no_type_check
+def _all_handles(state: _FSDPState) -> List:
+    return (
+        state._handles
+        if _is_composable(state)
+        else state._fsdp_handles(state)  # `FullyShardedDataParallel`
+    )
+
+
+@no_type_check
+def _module_handles(state: _FSDPState, module: nn.Module) -> List:
+    """
+    Given a module and returns the flat handles that map to this module. If the
+    module is FullyShardedDataParallel, the module._handles will be returned.
+    """
+    if _is_composable(state):
+        return state._module_to_handles[module][:]
+    else:
+        return module._handles[:]
+
+
+@no_type_check
+def _has_fsdp_params(state: _FSDPState, module: nn.Module) -> bool:
+    """Given a module and returns if this module has parameters sharded by FSDP."""
+    return len(_module_handles(state, module)) > 0
+
+
+def clean_tensor_name(tensor_name: str) -> str:
+    """
+    Cleans the parameter or buffer name by removing any module wrapper
+    prefixes.
+    """
+    tensor_name = tensor_name.replace(FSDP_PREFIX, "")
+    # TODO: Explicitly replacing the checkpoint wrapper prefix is not ideal as
+    # it couples `CheckpointWrapper` and FSDP and also does not scale for more
+    # module wrappers.
+    tensor_name = tensor_name.replace(_CHECKPOINT_PREFIX, "")
+    return tensor_name
+
+
+def _set_fsdp_flattened(tensor: torch.Tensor) -> None:
+    """
+    Sets an attribute on ``tensor`` to mark it as flattened by FSDP. This is to
+    avoid re-flattening it during nested construction.
+    """
+    setattr(tensor, FSDP_FLATTENED, True)
+
+
+def _is_fsdp_flattened(tensor: torch.Tensor) -> bool:
+    """Returns if ``tensor`` has been marked as flattened by FSDP."""
+    return getattr(tensor, FSDP_FLATTENED, False)
+
+
+def _get_param_to_fqns(
+    model: torch.nn.Module,
+    dedup_shared_params: bool = True,
+) -> Dict[nn.Parameter, List[str]]:
+    """
+    Constructs a mapping from parameter to a list of its FQNs. Each normal
+    parameter maps to a singleton list containing its FQN, while each
+    ``FlatParameter`` maps to a list of its original parameter FQNs, which may
+    have length greater than one. All FQNs are prefixed starting from
+    ``model``.
+
+    Args:
+        model (torch.nn.Module): Root module (which may or may not be a
+            :class:`FullyShardedDataParallel` instance).
+        dedup_shared_params (bool): For shared parameters, if ``True``, only
+            includes the FQNs corresponding to the first encounter of the
+            shared parameter in the module traversal; if ``False``, then
+            includes the FQNs across all encounters. (Default: ``True``)
+    """
+
+    def module_fn(module, prefix, param_to_fqns):
+        for param_name, param in module.named_parameters(recurse=False):
+            local_fqns = (
+                param._fqns
+                if type(param) is flat_param_file.FlatParameter
+                else [param_name]
+            )  # prefixed from `module`
+            global_fqns = [
+                clean_tensor_name(prefix + name) for name in local_fqns
+            ]  # prefixed from the top level `model` (i.e. including `prefix`)
+            is_shared_param = param in param_to_fqns
+            if not is_shared_param:
+                param_to_fqns[param] = global_fqns
+            elif not dedup_shared_params:
+                param_to_fqns[param].extend(global_fqns)
+
+    def return_fn(param_to_fqns):
+        return param_to_fqns
+
+    param_to_unflat_param_names: Dict[torch.nn.Parameter, List[str]] = {}
+    return _apply_to_modules(
+        model,
+        module_fn,
+        return_fn,
+        param_to_unflat_param_names,
+    )
+
+
+def _apply_to_modules(
+    root_module: torch.nn.Module,
+    module_fn: Callable,
+    return_fn: Callable,
+    *args,
+    **kwargs,
+):
+    """
+    Performs a pre-order traversal of the modules in the hierarchy rooted at
+    ``root_module``, applying ``module_fn`` at each module and finally
+    returning a value using ``return_fn``. The traversal constructs the full
+    module prefix name (e.g. "module.submodule." just like in model state dict)
+    and makes that available to ``module_fn``.
+    """
+
+    def f(module: torch.nn.Module, prefix: str, *args, **kwargs):
+        # Call the module function before recursing over children (pre-order)
+        module_fn(module, prefix, *args, **kwargs)
+        for submodule_name, submodule in module.named_children():
+            if submodule is not None:
+                new_prefix = prefix + submodule_name + "."
+                f(submodule, new_prefix, *args, **kwargs)
+
+    f(root_module, "", *args, **kwargs)
+    return return_fn(*args, **kwargs)
+
+
+@no_type_check
+def _assert_in_training_states(
+    state: _FSDPState,
+    training_states: List[TrainingState],
+) -> None:
+    """Asserts that FSDP is in the states ``_training_states``."""
+    # Raise a `ValueError` instead of using `assert` to ensure that these
+    # logical assertions run even if `assert`s are disabled
+    if state.training_state not in training_states:
+        msg = (
+            f"expected to be in states {training_states} but current state is "
+            f"{state.training_state}"
+        )
+        # Print the error on rank 0 in case this is called in the backward pass
+        if state.rank == 0:
+            if isinstance(state, nn.Module):
+                print(f"Asserting FSDP instance is: {state}")
+            print(f"ERROR: {msg}")
+            traceback.print_stack()
+        raise ValueError(msg)
diff --git a/torch/distributed/fsdp/_exec_order_utils.py b/torch/distributed/fsdp/_exec_order_utils.py
new file mode 100644
index 000000000000..a7082113a385
--- /dev/null
+++ b/torch/distributed/fsdp/_exec_order_utils.py
@@ -0,0 +1,384 @@
+import itertools
+import warnings
+from enum import auto, Enum
+from typing import cast, Dict, List, Optional, Tuple, Union
+
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+from torch.distributed.fsdp._common_utils import (
+    _all_handles,
+    _FSDPState,
+    _get_param_to_fqns,
+)
+from torch.distributed.fsdp.flat_param import FlatParameter, FlatParamHandle
+
+_HandlesKey = Tuple[FlatParamHandle, ...]
+
+
+class _ExecOrderWarnStatus(Enum):
+    """Used internally for execution order validation."""
+
+    NONE = auto()  # no deviation yet
+    WARNING = auto()  # deviated this iteration; currently issuing warnings
+    WARNED = auto()  # deviated in a previous iteration
+
+
+class _ExecOrderData:
+    """
+    This contains the data structures to track the execution order. We track
+    the pre-forward order on the *first* iteration for forward prefetching
+    (which thus assumes static graph) and the post-forward order on *every*
+    iteration for backward prefetching (which thus does not assume static
+    graph but may be provide an incorrect order).
+    """
+
+    def __init__(
+        self,
+        debug_level: dist.DebugLevel,
+        backward_prefetch_limit: int,
+        forward_prefetch_limit: int,
+    ) -> None:
+        # Tracks the (static) pre-forward order for execution order validation
+        # and forward prefetching
+        self.handles_pre_forward_order: List[_HandlesKey] = []
+        # Maps each handles key to its index in `handles_pre_forward_order`
+        self.handles_to_pre_forward_order_index: Dict[_HandlesKey, int] = {}
+        # Tracks the post-forward order for pre-backward prefetching
+        self.handles_post_forward_order: List[_HandlesKey] = []
+        # Maps each handles key to its index in `handles_post_forward_order`
+        self.handles_to_post_forward_order_index: Dict[_HandlesKey, int] = {}
+        self.is_first_iter = True
+
+        # Gives the max number of backward/forward prefetched all-gathers by a
+        # single module
+        self._backward_prefetch_limit = backward_prefetch_limit
+        self._forward_prefetch_limit = forward_prefetch_limit
+
+        # Data structures for execution order validation
+        self._checking_order: bool = debug_level in [
+            dist.DebugLevel.INFO,
+            dist.DebugLevel.DETAIL,
+        ]
+        self.process_group: Optional[dist.ProcessGroup] = None
+        self.world_size: Optional[int] = None
+        self.all_handles: List[FlatParamHandle] = []
+        # Maps each handle to its index in `all_handles`, which must be the
+        # same across ranks for the execution order validation to work
+        self.handle_to_handle_index: Dict[FlatParamHandle, int] = {}
+        # Names are prefixed from the root module
+        self.flat_param_to_prefixed_param_names: Dict[FlatParameter, List[str]] = {}
+        # Current index in the pre-forward execution order
+        self.current_order_index = 0
+        self.warn_status = _ExecOrderWarnStatus.NONE
+
+    def init(
+        self,
+        state: _FSDPState,
+        root_module: nn.Module,
+        process_group: dist.ProcessGroup,
+    ) -> None:
+        """
+        Initializes the data structures needed for checking the forward order.
+        This should be called after a root FSDP instance has been set during
+        lazy initialization.
+        """
+        self.process_group = process_group
+        self.rank = process_group.rank()
+        self.world_size = process_group.size()
+        # Fix an order over the handles, which should be the same across ranks
+        for handle in _all_handles(state):
+            index = len(self.all_handles)
+            self.all_handles.append(handle)
+            self.handle_to_handle_index[handle] = index
+        self.flat_param_to_prefixed_param_names = cast(
+            Dict[FlatParameter, List[str]],
+            _get_param_to_fqns(root_module),
+        )
+        # TODO (awgu): We can broadcast the metadata of rank 0's `all_handles`
+        # to check that all ranks have the same handles in the same order.
+        # https://github.com/pytorch/pytorch/issues/79620
+
+    def get_handles_to_backward_prefetch(
+        self,
+        current_handles_key: _HandlesKey,
+    ) -> Optional[List[_HandlesKey]]:
+        """
+        Returns a :class:`list` of the handles keys of the handles to backward
+        prefetch given the current handles key. If there are no valid handles
+        keys to prefetch, then this returns an empty :class:`list`.
+        """
+        current_index = self.handles_to_post_forward_order_index.get(
+            current_handles_key, None
+        )
+        if current_index is None:
+            return None
+        target_index = current_index - 1
+        target_handles_keys: List[_HandlesKey] = []
+        for _ in range(self._backward_prefetch_limit):
+            if target_index < 0:
+                break
+            target_handles_keys.append(self.handles_post_forward_order[target_index])
+            target_index -= 1
+        return target_handles_keys
+
+    def get_handles_to_forward_prefetch(
+        self,
+        current_handles_key: _HandlesKey,
+    ) -> Optional[List[_HandlesKey]]:
+        """
+        Returns a :class:`list` of the handles keys of the handles to forward
+        prefetch given the current handles key. If there are no valid handles
+        keys to prefetch, then this returns an empty :class:`list`.
+        """
+        current_index = self.handles_to_pre_forward_order_index.get(
+            current_handles_key, None
+        )
+        if current_index is None:
+            return None
+        target_index = current_index + 1
+        target_handles_keys: List[_HandlesKey] = []
+        for _ in range(self._forward_prefetch_limit):
+            if target_index >= len(self.handles_pre_forward_order):
+                break
+            target_handles_keys.append(self.handles_pre_forward_order[target_index])
+            target_index += 1
+        return target_handles_keys
+
+    def record_post_forward(self, handles: List[FlatParamHandle]) -> None:
+        """
+        Records ``handles`` in the post-forward order, where ``handles`` should
+        be a group of handles used in the same module's forward. If ``handles``
+        is empty, then it is omitted.
+
+        Unlike :meth:`record_pre_forward`, this records the order *every*
+        iteration with the expectation that the recorded order is reset in
+        :meth:`next_iter`.
+        """
+        if not handles:
+            return
+        handles_key = tuple(handles)
+        # Only record the first usage of a handles key
+        if handles_key in self.handles_to_post_forward_order_index:
+            return
+        index = len(self.handles_post_forward_order)
+        self.handles_to_post_forward_order_index[handles_key] = index
+        self.handles_post_forward_order.append(handles_key)
+
+    def record_pre_forward(
+        self, handles: List[FlatParamHandle], is_training: bool
+    ) -> None:
+        """
+        Records ``handles`` in the pre-forward order, where ``handles`` should
+        be a group of handles used in the same module's forward. If ``handles``
+        is empty, then it is omitted.
+
+        On the first iteration, this checks the execution order across ranks.
+        See :meth:`_check_order` for details.
+        """
+        if not handles:
+            return
+        handles_key = tuple(handles)
+        self._check_order(handles_key, is_training)
+        # Fix the order after the first iteration and only record the first
+        # usage of a handles key
+        if (
+            not self.is_first_iter
+            or handles_key in self.handles_to_pre_forward_order_index
+        ):
+            return
+        index = len(self.handles_pre_forward_order)
+        self.handles_to_pre_forward_order_index[handles_key] = index
+        self.handles_pre_forward_order.append(handles_key)
+
+    def _check_order(self, handles_key: _HandlesKey, is_training: bool) -> None:
+        """
+        Checks the forward execution order as long as ``is_training`` is
+        ``True`` since checking in eval mode is not supported.
+
+        - On the first iteration, this uses all-gathers to check that all ranks
+        are all-gathering the same handles and hence ``FlatParameter`` s,
+        raising an error if not.
+        - On subsequent iterations, if the distributed debug level is at least
+        INFO, then this checks that each rank is locally consistent with its
+        own forward order from the first iteration, issuing a warning if not.
+        This issues a warning on the first deviating iteration and stops
+        warning thereafter.
+        """
+        # Do not check order in eval mode since the post-backward callback does
+        # not run so it cannot be used to mark the end of an iteration
+        if not is_training:
+            return
+        if self.is_first_iter:
+            msg_prefix = "Forward order differs across ranks:"
+            optional_local_indices: Tuple[
+                Optional[int], ...
+            ] = self._get_handle_indices(handles_key)
+            device = handles_key[0].device  # guaranteed to be non-CPU
+            num_valid_indices = sum(
+                (index is not None) for index in optional_local_indices
+            )
+            tensor_kwargs: Dict[str, Union[torch.dtype, torch.device]] = {
+                "dtype": torch.int32,
+                "device": device,
+            }
+            world_num_valid_indices = torch.zeros(self.world_size, **tensor_kwargs)  # type: ignore[arg-type, call-overload]
+            local_num_valid_indices = torch.tensor([num_valid_indices], **tensor_kwargs)  # type: ignore[arg-type, call-overload]
+            dist.all_gather_into_tensor(
+                world_num_valid_indices,
+                local_num_valid_indices,
+                group=self.process_group,
+            )
+            # Check that all ranks plan to all-gather the same number of
+            # parameters
+            # TODO (awgu): Since every module has at most one handle in the
+            # current implementation, this should never raise the error.
+            assert self.world_size is not None  # mypy
+            for (r1, n1), (r2, n2) in itertools.combinations(
+                (
+                    (rank, world_num_valid_indices[rank])
+                    for rank in range(self.world_size)
+                ),
+                2,
+            ):
+                if n1 != n2:
+                    raise RuntimeError(
+                        f"{msg_prefix} rank {r1} is all-gathering {n1} parameters "
+                        f"while rank {r2} is all-gathering {n2} parameters"
+                    )
+            world_indices = torch.zeros(  # type: ignore[call-overload]
+                self.world_size * num_valid_indices, **tensor_kwargs
+            )
+            local_indices = torch.tensor(optional_local_indices, **tensor_kwargs)  # type: ignore[arg-type]
+            dist.all_gather_into_tensor(
+                world_indices, local_indices, group=self.process_group
+            )
+            # Check that all ranks plan to all-gather the same index parameters
+            for (r1, i1), (r2, i2) in itertools.combinations(
+                (
+                    (
+                        rank,
+                        world_indices[
+                            rank * num_valid_indices : (rank + 1) * num_valid_indices
+                        ],
+                    )
+                    for rank in range(self.world_size)
+                ),
+                2,
+            ):
+                if i1 != i2:
+                    r1_param_names = self._get_names_from_handle_indices(i1)
+                    r2_param_names = self._get_names_from_handle_indices(i2)
+                    raise RuntimeError(
+                        f"{msg_prefix} rank {r1} is all-gathering parameters "
+                        f"for {r1_param_names} while rank {r2} is all-gathering "
+                        f"parameters for {r2_param_names}"
+                    )
+        elif self._checking_order:
+            # Only issue warnings on the first deviating iteration and stop
+            # checking thereafter to avoid flooding the console
+            if self.warn_status == _ExecOrderWarnStatus.WARNED:
+                return
+            msg_prefix = None  # non-`None` means we should warn
+            if self.current_order_index >= len(self.handles_pre_forward_order):
+                # This iteration sees extra all-gather(s) compared to the first
+                msg_prefix = (
+                    "Expected to not all-gather any more parameters in the "
+                    "forward but trying to all-gather parameters for "
+                )
+            else:
+                expected_handles_key = self.handles_pre_forward_order[
+                    self.current_order_index
+                ]
+                if expected_handles_key != handles_key:
+                    expected_param_names = self._get_names_from_handles(
+                        expected_handles_key
+                    )
+                    msg_prefix = (
+                        f"Expected to all-gather for {expected_param_names} "
+                        "but trying to all-gather parameters for "
+                    )
+            if msg_prefix is not None:
+                param_names = self._get_names_from_handles(handles_key)
+                msg_suffix = (
+                    f"{param_names}"
+                    if param_names
+                    else "a newly-added parameter since construction time"
+                )
+                warnings.warn(
+                    "Forward order differs from that of the first iteration "
+                    f"on rank {self.rank}. Collectives are unchecked and may "
+                    f"give incorrect results or hang.\n{msg_prefix}{msg_suffix}"
+                )
+                self.warn_status = _ExecOrderWarnStatus.WARNING
+            self.current_order_index += 1
+
+    def _get_handle_indices(
+        self,
+        handles_key: _HandlesKey,
+    ) -> Tuple[Optional[int], ...]:
+        """
+        Returns the handle indices (i.e. indices into ``self.all_handles``)
+        corresponding to the handles in ``handles_key``. An entry in the
+        returned tuple is ``None`` if the handle is invalid.
+        """
+        indices: List[Optional[int]] = []
+        for handle in handles_key:
+            if handle not in self.handle_to_handle_index:
+                indices.append(None)
+            else:
+                indices.append(self.handle_to_handle_index[handle])
+        return tuple(indices)
+
+    def _get_names_from_handle_indices(
+        self,
+        handle_indices: Tuple[int, ...],
+    ) -> List[List[str]]:
+        """
+        Returns a list of prefixed parameter names for each handle in
+        ``handle_indices``. If a handle index is invalid, then its prefixed
+        parameter names are omitted from the returned list.
+        """
+        prefixed_param_names: List[List[str]] = []
+        for index in handle_indices:
+            if index is None or index < 0 or index >= len(self.all_handles):
+                continue
+            handle = self.all_handles[index]
+            flat_param = handle.flat_param
+            prefixed_param_names.append(
+                self.flat_param_to_prefixed_param_names[flat_param]
+            )
+        return prefixed_param_names
+
+    def _get_names_from_handles(
+        self,
+        handles_key: _HandlesKey,
+    ) -> List[List[str]]:
+        """
+        Returns a list of prefixed parameter names for each handle in
+        ``handles_key``. If a handle is invalid, then its prefixed parameter
+        names are omitted from the returned list.
+        """
+        prefixed_param_names: List[List[str]] = []
+        for handle in handles_key:
+            flat_param = handle.flat_param
+            if flat_param not in self.flat_param_to_prefixed_param_names:
+                continue
+            prefixed_param_names.append(
+                self.flat_param_to_prefixed_param_names[flat_param]
+            )
+        return prefixed_param_names
+
+    def next_iter(self):
+        """
+        Advances the internal data structures per iteration. This should be
+        called in the post-backward callback since that marks the true end of
+        an iteration.
+        """
+        self.is_first_iter = False
+        self.handles_to_post_forward_order_index.clear()
+        self.handles_post_forward_order.clear()
+        if self._checking_order:
+            self.current_order_index = 0
+            if self.warn_status == _ExecOrderWarnStatus.WARNING:
+                self.warn_status = _ExecOrderWarnStatus.WARNED
diff --git a/torch/distributed/fsdp/_fsdp_extensions.py b/torch/distributed/fsdp/_fsdp_extensions.py
new file mode 100644
index 000000000000..1f087f44b573
--- /dev/null
+++ b/torch/distributed/fsdp/_fsdp_extensions.py
@@ -0,0 +1,115 @@
+from abc import ABC, abstractmethod
+from typing import Any, List, Optional, Tuple
+
+import torch
+import torch.distributed as dist
+from torch.distributed._shard.sharded_tensor.api import ShardedTensor
+from torch.distributed._shard.sharded_tensor.shard import Shard
+from torch.distributed.fsdp._shard_utils import _create_chunk_sharded_tensor
+
+
+class FSDPExtensions(ABC):
+    """
+    This enables some customizable hooks to enable composability with tensor
+    parallelism. To activate these hooks, use :func:`_set_fsdp_extensions` to
+    set a custom :class:`FSDPExtensions` that implements the hooks.
+    """
+
+    @abstractmethod
+    def pre_flatten_transform(
+        self,
+        tensor: torch.Tensor,
+    ) -> Tuple[torch.Tensor, Optional[Any]]:
+        """E.g. converting ``DistributedTensor`` to local tensor."""
+        ...
+
+    @abstractmethod
+    def post_unflatten_transform(
+        self,
+        tensor: torch.Tensor,
+        param_extension: Any,
+    ) -> torch.Tensor:
+        """E.g. converting local tensor to ``DistributedTensor``."""
+        ...
+
+    @abstractmethod
+    def chunk_tensor(
+        self,
+        tensor: torch.Tensor,
+        rank: int,
+        world_size: int,
+        num_devices_per_node: int,
+        pg: dist.ProcessGroup,
+    ) -> torch.Tensor:
+        """Shards a tensor to chunks and returns the local chunk."""
+        ...
+
+    @abstractmethod
+    def pre_load_state_dict_transform(
+        self,
+        tensor: torch.Tensor,
+    ) -> Tuple[torch.Tensor, List[Shard]]:
+        """
+        This is to be called before loading a *sharded* model state dict and
+        should return the tensor and list of shards from which to load data.
+        """
+        ...
+
+
+_extensions: Optional[FSDPExtensions] = None
+
+
+def _set_fsdp_extensions(flattener: FSDPExtensions) -> None:
+    global _extensions
+    _extensions = flattener
+
+
+def _ext_pre_flatten_transform(
+    tensor: torch.Tensor,
+) -> Tuple[torch.Tensor, Optional[Any]]:
+    if _extensions is not None:
+        new_tensor, extension = _extensions.pre_flatten_transform(tensor)
+        if extension is not None:
+            return new_tensor, extension
+    return tensor, None
+
+
+def _ext_post_unflatten_transform(
+    tensor: torch.Tensor,
+    param_extension: Any,
+) -> torch.Tensor:
+    if _extensions is not None and param_extension is not None:
+        return _extensions.post_unflatten_transform(tensor, param_extension)
+    return tensor
+
+
+def _ext_chunk_tensor(
+    tensor: torch.Tensor,
+    rank: int,
+    world_size: int,
+    num_devices_per_node: int,
+    pg: dist.ProcessGroup,
+) -> torch.Tensor:
+    chunk_tensor_fn = (
+        _extensions.chunk_tensor
+        if _extensions is not None
+        else _create_chunk_sharded_tensor
+    )
+    return chunk_tensor_fn(
+        tensor,
+        rank,
+        world_size,
+        num_devices_per_node,
+        pg,
+    )
+
+
+def _ext_pre_load_state_dict_transform(
+    tensor: torch.Tensor,
+) -> Tuple[torch.Tensor, List[Shard]]:
+    if _extensions is not None:
+        return _extensions.pre_load_state_dict_transform(tensor)
+
+    assert type(tensor) is ShardedTensor
+    shards = tensor.local_shards()
+    return (tensor, shards)
diff --git a/torch/distributed/fsdp/_init_utils.py b/torch/distributed/fsdp/_init_utils.py
new file mode 100644
index 000000000000..7e128251fcc4
--- /dev/null
+++ b/torch/distributed/fsdp/_init_utils.py
@@ -0,0 +1,763 @@
+import collections
+import warnings
+from typing import (
+    Callable,
+    Dict,
+    Generator,
+    Iterable,
+    Iterator,
+    List,
+    no_type_check,
+    Optional,
+    Set,
+    Tuple,
+    Type,
+    Union,
+)
+
+import torch
+import torch.distributed as dist
+import torch.distributed.fsdp.fully_sharded_data_parallel as fsdp_file
+import torch.nn as nn
+from torch.distributed.algorithms._comm_hooks import default_hooks
+from torch.distributed.distributed_c10d import _get_default_group
+from torch.distributed.fsdp._common_utils import (
+    _FSDPState,
+    _get_param_to_fqns,
+    _is_fsdp_flattened,
+    clean_tensor_name,
+    TrainingState,
+)
+from torch.distributed.fsdp._exec_order_utils import _ExecOrderData
+from torch.distributed.fsdp._limiter_utils import _FreeEventQueue
+from torch.distributed.fsdp._wrap_utils import _get_submodule_to_states
+from torch.distributed.fsdp.api import (
+    BackwardPrefetch,
+    CPUOffload,
+    FullStateDictConfig,
+    MixedPrecision,
+    ShardingStrategy,
+    StateDictConfig,
+    StateDictType,
+)
+from torch.distributed.fsdp.flat_param import (
+    _HandlesKey,
+    FlatParameter,
+    FlatParamHandle,
+    HandleConfig,
+    HandleShardingStrategy,
+)
+from torch.distributed.fsdp.wrap import _FSDPPolicy
+from torch.distributed.utils import _sync_params_and_buffers
+from torch.utils.hooks import RemovableHandle
+
+_TORCHDISTX_AVAIL = True
+try:
+    from torchdistx import deferred_init, fake  # type: ignore[import]
+except ImportError:
+    _TORCHDISTX_AVAIL = False
+
+PARAM_BROADCAST_BUCKET_SIZE = int(250 * 1024 * 1024)
+FSDP_SYNCED = "_fsdp_synced"
+
+# TODO (awgu): Refactor this later
+SHARDING_STRATEGY_MAP = {
+    ShardingStrategy.NO_SHARD: HandleShardingStrategy.NO_SHARD,
+    ShardingStrategy.FULL_SHARD: HandleShardingStrategy.FULL_SHARD,
+    ShardingStrategy.SHARD_GRAD_OP: HandleShardingStrategy.SHARD_GRAD_OP,
+}
+
+
+# NOTE: Since non-self attributes cannot be type annotated, several attributes
+# on `state` are defined first as local variables before being assigned.
+
+
+@no_type_check
+def _init_process_group_state(
+    state: _FSDPState,
+    process_group: Optional[dist.ProcessGroup],
+) -> _FSDPState:
+    state.process_group = process_group or _get_default_group()
+    state.rank = state.process_group.rank()
+    state.world_size = state.process_group.size()
+    return state
+
+
+@no_type_check
+def _init_ignored_module_states(
+    state: _FSDPState,
+    module: nn.Module,
+    ignored_modules: Optional[Iterable[torch.nn.Module]],
+) -> _FSDPState:
+    state._ignored_modules = _get_ignored_modules(module, ignored_modules)
+    state._ignored_params, state._ignored_param_names = _get_ignored_params(
+        module,
+        state._ignored_modules,
+    )
+    # TODO: FSDP's contract for buffers is not well-defined. They are
+    # implicitly ignored for most functionality since they are not sharded;
+    # however, FSDP still imposes some semantics on buffers (e.g. buffer mixed
+    # precision). We should formalize this contract and decide if we need to
+    # compute and store `_ignored_buffers`.
+    return state
+
+
+@no_type_check
+def _init_buffer_state(
+    state: _FSDPState,
+    module: nn.Module,
+) -> _FSDPState:
+    state._buffer_names = _get_buffer_names(module)
+    # Save a mapping from clean fully-qualified buffer name (starting from
+    # `module`) to its original dtype for restoring that dtype during model
+    # checkpointing when buffer mixed precision is enabled. The names should
+    # be clean since the casting happens in a `summon_full_params()` context.
+    _buffer_name_to_orig_dtype: Dict[str, torch.dtype] = {}
+    for buffer_name, buffer in module.named_buffers():
+        buffer_name = clean_tensor_name(buffer_name)
+        _buffer_name_to_orig_dtype[buffer_name] = buffer.dtype
+    state._buffer_name_to_orig_dtype = _buffer_name_to_orig_dtype
+    return state
+
+
+@no_type_check
+def _init_core_state(
+    state: _FSDPState,
+    sharding_strategy: Optional[ShardingStrategy],
+    mixed_precision: Optional[MixedPrecision],
+    cpu_offload: Optional[CPUOffload],
+    limit_all_gathers: bool,
+    use_orig_params: bool,
+    backward_prefetch_limit: int,
+    forward_prefetch_limit: int,
+) -> _FSDPState:
+    # We clamp the strategy to `NO_SHARD` for world size of 1 since they are
+    # currently functionally equivalent. This may change if/when we integrate
+    # FSDP with MoE.
+    if state.world_size == 1:
+        sharding_strategy = ShardingStrategy.NO_SHARD
+    state.sharding_strategy = sharding_strategy or ShardingStrategy.FULL_SHARD
+    state.mixed_precision = mixed_precision or MixedPrecision()
+    state.cpu_offload = cpu_offload or CPUOffload()
+    state.limit_all_gathers = limit_all_gathers
+    state._use_orig_params = use_orig_params
+    state.training_state = TrainingState.IDLE
+    state._is_root = None
+    _streams: Dict[str, torch.cuda.Stream] = {}
+    state._streams = _streams
+    state._free_event_queue = _FreeEventQueue()
+    state._debug_level = dist.get_debug_level()
+    state._exec_order_data = _ExecOrderData(
+        state._debug_level,
+        backward_prefetch_limit,
+        forward_prefetch_limit,
+    )
+    _module_to_handles: Dict[
+        nn.Module, List[FlatParamHandle]
+    ] = collections.defaultdict(list)
+    state._module_to_handles = _module_to_handles
+    # Invariant: `state.params` contains exactly the `FlatParameter`s of the
+    # handles in `state._handles`
+    _handles: List[FlatParamHandle] = []
+    state._handles = _handles
+    params: List[FlatParameter] = []
+    state.params = params
+    return state
+
+
+@no_type_check
+def _init_runtime_state(
+    state: _FSDPState,
+) -> _FSDPState:
+    _root_pre_forward_handles: List[RemovableHandle] = []
+    state._root_pre_forward_handles = _root_pre_forward_handles
+    _pre_forward_handles: List[RemovableHandle] = []
+    state._pre_forward_handles = _pre_forward_handles
+    _post_forward_handles: List[RemovableHandle] = []
+    state._post_forward_handles = _post_forward_handles
+    _module_to_handles: Dict[
+        nn.Module, List[FlatParamHandle]
+    ] = collections.defaultdict(list)
+    state._module_to_handles = _module_to_handles
+    state._sync_gradients = True
+    state._communication_hook = _get_default_comm_hook(state.sharding_strategy)
+    state._communication_hook_state = _get_default_comm_hook_state(state.process_group)
+    state._hook_registered = False
+    # Used to prevent running the pre-backward hook multiple times
+    _ran_pre_backward_hook: Dict[_HandlesKey, bool] = {}
+    state._ran_pre_backward_hook = _ran_pre_backward_hook
+    return state
+
+
+@no_type_check
+def _init_prefetching_state(
+    state: _FSDPState,
+    backward_prefetch: BackwardPrefetch,
+    forward_prefetch: bool,
+) -> _FSDPState:
+    state.backward_prefetch = backward_prefetch
+    state.forward_prefetch = forward_prefetch
+    _handles_prefetched: Dict[_HandlesKey, bool] = {}
+    state._handles_prefetched = _handles_prefetched
+    # Used for guarding against mistargeted backward prefetches
+    _needs_pre_backward_unshard: Dict[_HandlesKey, bool] = {}
+    state._needs_pre_backward_unshard = _needs_pre_backward_unshard
+    # Used for guarding against mistargeted forward prefetches
+    _needs_pre_forward_unshard: Dict[_HandlesKey, bool] = {}
+    state._needs_pre_forward_unshard = _needs_pre_forward_unshard
+    # The data structures use tuples of handles to generalize over the case
+    # where a module's forward involves multiple handles.
+    return state
+
+
+def _init_state_dict_state(state: _FSDPState) -> _FSDPState:
+    state._state_dict_type = StateDictType.FULL_STATE_DICT
+    state_dict_config: StateDictConfig = FullStateDictConfig()
+    state._state_dict_config = state_dict_config
+    unshard_params_ctx: Dict[nn.Module, Generator] = {}
+    state._unshard_params_ctx = unshard_params_ctx
+    return state
+
+
+@no_type_check
+def _init_param_handle_from_module(
+    state: _FSDPState,
+    root_module: nn.Module,
+    device_id: Optional[Union[int, torch.device]],
+    param_init_fn: Optional[Callable[[nn.Module], None]],
+    sync_module_states: bool,
+    module_wrapper_cls: Type,
+) -> _FSDPState:
+    """
+    Initializes a ``FlatParamHandle`` from a module ``root_module``. This is
+    the module wrapper code path.
+    """
+    _check_single_device_module(root_module, state._ignored_params)
+    device_from_device_id = _get_device_from_device_id(device_id, state.rank)
+    _materialize_module(
+        root_module,
+        param_init_fn,
+        state._ignored_params,
+        device_from_device_id,
+        lambda k: not isinstance(k, module_wrapper_cls),
+    )
+    # TODO: Investigate refactoring `_move_module_to_device()` to
+    # `_move_states_to_device()` to avoid the `device_id` + CPU offload hack
+    _move_module_to_device(root_module, state._ignored_params, device_from_device_id)
+    state.compute_device = _get_compute_device(
+        root_module,
+        state._ignored_params,
+        device_from_device_id,
+        state.rank,
+    )
+    managed_params = list(_get_orig_params(root_module, state._ignored_params))
+    if sync_module_states:
+        _sync_module_params_and_buffers(
+            root_module, managed_params, state.process_group
+        )
+    _init_param_handle_from_params(state, managed_params, root_module)
+    return state
+
+
+@no_type_check
+def _init_param_handles_from_module(
+    state: _FSDPState,
+    root_module: nn.Module,
+    policy: _FSDPPolicy,
+    device_id: Optional[Union[int, torch.device]],
+    param_init_fn: Optional[Callable[[nn.Module], None]],
+    sync_module_states: bool,
+) -> _FSDPState:
+    """
+    Initializes all ``FlatParamHandle`` s from a module ``root_module``. This
+    is the non-module-wrapper code path.
+    """
+    submodule_to_states = _get_submodule_to_states(
+        root_module,
+        policy,
+        state._ignored_modules,
+        state._ignored_params,
+    )
+    _check_single_device_module(root_module, state._ignored_params)
+    device_from_device_id = _get_device_from_device_id(device_id, state.rank)
+    # Initialize and shard `FlatParamHandle`s one by one following bottom-up
+    # order (hence the `reversed`) to avoid increasing peak GPU memory usage
+    materialized_module = False
+    for submodule, (params, buffers, param_names, buffer_names) in reversed(
+        submodule_to_states.items()
+    ):
+        materialized_module |= _materialize_module(
+            submodule,
+            param_init_fn,
+            state._ignored_params,
+            device_from_device_id,
+            lambda _: True,
+        )
+        if materialized_module:
+            # Materializing from meta device can change the parameter/buffer
+            # variables, so reacquire references
+            params = [submodule.get_parameter(param_name) for param_name in param_names]
+            buffers = [
+                submodule.get_buffer(buffer_name) for buffer_name in buffer_names
+            ]
+        _move_states_to_device(params, buffers, device_from_device_id)
+        if not hasattr(state, "compute_device"):  # only need to set once
+            state.compute_device = _get_compute_device(
+                submodule,
+                state._ignored_params,
+                device_from_device_id,
+                state.rank,
+            )
+        if sync_module_states:
+            _sync_module_states(params, buffers, state.process_group)
+        # Pass `root_module` to have internal FQN metadata prefix starting from
+        # it instead of `submodule`
+        _init_param_handle_from_params(state, params, root_module)
+    # Reverse to preserve top-down order like `_fsdp_handles()`
+    state._handles.reverse()
+    return state
+
+
+@no_type_check
+def _init_param_handle_from_params(
+    state: _FSDPState,
+    params: List[nn.Parameter],
+    root_module: nn.Module,
+):
+    if len(params) == 0:
+        return
+    handle_config = HandleConfig(
+        SHARDING_STRATEGY_MAP[state.sharding_strategy],
+        state.cpu_offload.offload_params,
+        state.mixed_precision.param_dtype,
+        state.mixed_precision.reduce_dtype,
+        state.mixed_precision.keep_low_precision_grads,
+    )
+    handle = FlatParamHandle(
+        params,
+        root_module,
+        state.compute_device,
+        handle_config,
+        state.process_group,
+        state._use_orig_params,
+    )
+    # TODO: Can simplify call `shard()` in the `FlatParamHandle` ctor
+    handle.shard()
+    assert handle not in state._handles
+    state.params.append(handle.flat_param)
+    state._handles.append(handle)
+    for module in handle.flat_param._modules:
+        state._module_to_handles[module].append(handle)
+    cpu_device = torch.device("cpu")
+    if state.cpu_offload.offload_params and handle.flat_param.device != cpu_device:
+        handle.flat_param_to(cpu_device)
+
+
+def _get_ignored_modules(
+    root_module: nn.Module,
+    _ignored_modules: Optional[Iterable[torch.nn.Module]],
+) -> Set[nn.Module]:
+    """
+    Checks that ``_ignored_modules`` is an iterable of ``nn.Module`` s without
+    any FSDP instances, and returns the modules contained in their module
+    subtrees as a :class:`set`. Nested FSDP instances are excluded, but their
+    already-computed ignored modules are included.
+    """
+    if _ignored_modules is None:
+        return set()
+    msg_prefix = "`ignored_modules` should be an iterable of `torch.nn.Module`s "
+    try:
+        ignored_root_modules = set(_ignored_modules)
+    except TypeError:
+        raise TypeError(msg_prefix + f"but got {type(_ignored_modules)}")
+    for module in ignored_root_modules:
+        if not isinstance(module, torch.nn.Module):
+            raise TypeError(msg_prefix + f"but got an iterable with {type(module)}")
+        if isinstance(module, fsdp_file.FullyShardedDataParallel):
+            raise ValueError("`ignored_modules` should not include FSDP modules")
+    # Include child modules and exclude nested FSDP modules themselves
+    ignored_modules = set(
+        child
+        for module in ignored_root_modules
+        for child in module.modules()
+        if not isinstance(child, fsdp_file.FullyShardedDataParallel)
+    )
+    if root_module in ignored_modules:
+        warnings.warn(
+            "Trying to ignore the top-level module passed into the FSDP "
+            "constructor itself will result in all parameters being "
+            f"ignored and is not well-supported: {module}"
+        )
+    # Include nested FSDP modules' ignored modules
+    for submodule in root_module.modules():
+        if isinstance(submodule, fsdp_file.FullyShardedDataParallel):
+            assert hasattr(submodule, "_ignored_modules")
+            ignored_modules.update(submodule._ignored_modules)
+    return ignored_modules
+
+
+def _get_ignored_params(
+    root_module: torch.nn.Module,
+    ignored_modules: Set[torch.nn.Module],
+) -> Tuple[Set[torch.nn.Parameter], Set[str]]:
+    """
+    Returns the parameters of the modules in ``ignored_modules``,
+    excluding any :class:`FlatParameter` s, and their fully prefixed names,
+    both as :class:`set` s.
+    """
+    ignored_params = set(
+        p for m in ignored_modules for p in m.parameters() if not _is_fsdp_flattened(p)
+    )
+    # Conservatively include all shared parameters' names
+    param_to_unflat_param_names = _get_param_to_fqns(
+        root_module,
+        dedup_shared_params=False,
+    )
+    ignored_param_names = set()
+    for param in ignored_params:
+        unflat_param_names = param_to_unflat_param_names[param]
+        clean_names = []
+        for k in unflat_param_names:
+            # Clean any module wrapper prefixes in case of nested wrapping
+            clean_names.append(clean_tensor_name(k))
+        ignored_param_names.update(clean_names)
+    return ignored_params, ignored_param_names
+
+
+def _get_buffer_names(root_module: nn.Module) -> Set[str]:
+    """
+    Returns the fully prefixed names of all buffers in the module hierarchy
+    rooted at ``root_module`` as a class:`set`.
+    """
+    return set(
+        clean_tensor_name(buffer_name) for buffer_name, _ in root_module.named_buffers()
+    )
+
+
+def _check_single_device_module(
+    module: nn.Module,
+    ignored_params: Set[nn.Parameter],
+) -> None:
+    """
+    Raises an error if ``module`` has original parameters on multiple devices,
+    ignoring the parameters in ``ignored_params``. Thus, after this method, the
+    module must be either fully on the CPU or fully on a non-CPU device.
+    """
+    devices = set(param.device for param in _get_orig_params(module, ignored_params))
+    if len(devices) > 1:
+        raise RuntimeError(
+            f"FSDP only supports single device modules but got params on {devices}"
+        )
+
+
+def _get_device_from_device_id(
+    device_id: Optional[Union[int, torch.device]],
+    rank: int,
+) -> Optional[torch.device]:
+    """
+    Processes ``device_id`` and returns either the corresponding device or
+    ``None`` if ``device_id`` is ``None``.
+    """
+    if device_id is None:
+        return None
+    device = (
+        device_id if isinstance(device_id, torch.device) else torch.device(device_id)
+    )
+    if device == torch.device("cuda"):
+        warnings.warn(
+            f"FSDP got the argument `device_id` {device_id} on rank "
+            f"{rank}, which does not have an explicit index. "
+            f"FSDP will use the current device {torch.cuda.current_device()}. "
+            "If this is incorrect, please explicitly call `torch.cuda.set_device()` "
+            "before FSDP initialization or pass in the explicit device "
+            "index as the `device_id` argument."
+        )
+        device = torch.device("cuda", torch.cuda.current_device())
+    return device
+
+
+def _materialize_module(
+    module: nn.Module,
+    param_init_fn: Optional[Callable[[nn.Module], None]],
+    ignored_params: Set[nn.Parameter],
+    device_from_device_id: Optional[torch.device],
+    deferred_init_check_fn: Callable,
+) -> bool:
+    """
+    Materializes the wrapped module ``module`` in place if needed: either
+    if the module has parameters that use meta device or are torchdistX
+    fake tensors.
+
+    This method uses ``param_init_fn`` to materialize the module if the
+    function is not ``None`` and falls back to default behavior otherwise.
+    For meta device, this moves the module to ``device_from_device_id`` if
+    it is not ``None`` or the current device otherwise and calls
+    ``reset_parameters()``, and for torchdistX fake tensors, this calls
+    ``deferred_init.materialize_module()``.
+
+    Returns:
+        bool: ``True`` if ``module`` was materialized and ``False`` if this was
+        a no-op.
+    """
+    managed_params = _get_orig_params(module, ignored_params)
+    is_meta_module = any(param.is_meta for param in managed_params)
+    is_torchdistX_deferred_init = (
+        not is_meta_module
+        and _TORCHDISTX_AVAIL
+        and any(fake.is_fake(param) for param in managed_params)
+    )
+    if (is_meta_module or is_torchdistX_deferred_init) and param_init_fn is not None:
+        if not callable(param_init_fn):
+            raise ValueError(
+                f"Expected {param_init_fn} to be callable but got {type(param_init_fn)}"
+            )
+        param_init_fn(module)
+        return True
+    elif is_meta_module:
+        # Run default meta device initialization
+        materialization_device = device_from_device_id or torch.device(
+            torch.cuda.current_device()
+        )
+        module.to_empty(device=materialization_device)
+        try:
+            with torch.no_grad():
+                module.reset_parameters()  # type: ignore[operator]
+        except BaseException as e:
+            warnings.warn(
+                "Unable to call `reset_parameters()` for module on meta "
+                f"device with error {str(e)}. Please ensure your "
+                "module implements a `reset_parameters()` method."
+            )
+            raise e
+        return True
+    elif is_torchdistX_deferred_init:
+        # Run default torchdistX initialization
+        deferred_init.materialize_module(module, check_fn=deferred_init_check_fn)
+        return True
+    return False
+
+
+def _move_module_to_device(
+    module: nn.Module,
+    ignored_params: Set[nn.Parameter],
+    device_from_device_id: Optional[torch.device],
+) -> None:
+    """
+    Moves ``module`` depending on ``device_from_device_id`` and its current
+    device. This includes moving ignored modules' parameters.
+
+    - If ``device_from_device_id`` is not ``None``, then this moves
+    ``module`` to the device.
+    - If ``device_from_device_id`` is ``None``, then this does not move
+    ``module`` but warns the user if it is on CPU.
+
+    Precondition: ``_check_single_device_module()``.
+    """
+    param = next(_get_orig_params(module, ignored_params), None)
+    if param is None:
+        return  # no original parameters to manage
+    cpu_device = torch.device("cpu")
+    if device_from_device_id is not None:
+        if param.device == cpu_device:
+            # NOTE: This includes moving ignored modules' parameters.
+            module = module.to(device_from_device_id)
+            # TODO: This is a temporary fix to move already-constructed
+            # `FlatParameter`s back to CPU if needed. This is needed to
+            # make CPU offload work with `device_id`.
+            for submodule in module.modules():
+                if (
+                    isinstance(submodule, fsdp_file.FullyShardedDataParallel)
+                    and submodule.cpu_offload.offload_params
+                ):
+                    for handle in submodule._handles:
+                        handle.flat_param_to(torch.device("cpu"))
+    elif param.device == cpu_device:
+        _warn_cpu_init()
+
+
+def _move_states_to_device(
+    params: List[nn.Parameter],
+    buffers: List[torch.Tensor],
+    device_from_device_id: Optional[torch.device],
+) -> None:
+    """
+    Precondition: ``_check_single_device_module()``.
+    """
+    if len(params) == 0 and len(buffers) == 0:
+        return
+    if len(params) > 0:
+        current_device = params[0].device
+    elif len(buffers) > 0:
+        current_device = buffers[0].device
+    cpu_device = torch.device("cpu")
+    if device_from_device_id is not None:
+        # Move the parameters and buffers like the `.data` code path in
+        # `nn.Module._apply()`, which underlies `nn.Module.to()`
+        for param in params:
+            with torch.no_grad():
+                param.data = param.to(device_from_device_id)
+                if param.grad is not None:
+                    param.grad.data = param.grad.to(device_from_device_id)
+        for buffer in buffers:
+            buffer.data = buffer.to(device_from_device_id)
+    elif current_device == cpu_device:
+        _warn_cpu_init()
+
+
+def _warn_cpu_init():
+    warnings.warn(
+        "The passed-in `module` is on CPU and will thus have FSDP's sharding "
+        "initialization run on CPU, which may be slower than on GPU. We "
+        "recommend passing in the `device_id` argument for FSDP to move "
+        "`module` to GPU for the sharding initialization. `module` must also "
+        "be on GPU device to work with the `sync_module_states=True` flag "
+        "since that requires GPU communication."
+    )
+
+
+def _get_compute_device(
+    module: nn.Module,
+    ignored_params: Set[nn.Parameter],
+    device_from_device_id: Optional[torch.device],
+    rank: int,
+) -> torch.device:
+    """
+    Determines and returns this FSDP instance's compute device. If the module
+    is already on a non-CPU device, then the compute device is that non-CPU
+    device. If the module is on CPU, then the compute device is the current
+    device.
+
+    Since this method should be called after materializing the module, any
+    non-CPU device should not be meta device. For now, the compute device is
+    always a CUDA GPU device with its explicit index.
+
+    Precondition: ``_check_single_device_module()`` and
+    ``_move_module_to_device()``.
+    """
+    # If the module is on GPU already, then that GPU device has priority
+    # over the current device
+    param = next(_get_orig_params(module, ignored_params), None)
+    if param is not None and param.device.type == "cuda":
+        compute_device = param.device
+    else:
+        compute_device = torch.device("cuda", torch.cuda.current_device())
+    if device_from_device_id is not None and compute_device != device_from_device_id:
+        raise ValueError(
+            f"Inconsistent compute device and `device_id` on rank {rank}: "
+            f"{compute_device} vs {device_from_device_id}"
+        )
+    return compute_device
+
+
+# TODO: See how to deprecate!
+def _sync_module_params_and_buffers(
+    module: nn.Module,
+    params: List[nn.Parameter],
+    process_group: dist.ProcessGroup,
+) -> None:
+    """
+    Synchronizes module states (i.e. parameters ``params`` and all
+    not-yet-synced buffers) by broadcasting from rank 0 to all ranks.
+
+    Precondition: ``sync_module_states == True`` and ``self.process_group`` has
+    been set.
+    """
+    _check_params_for_sync_module_states(params)
+    module_states: List[torch.Tensor] = []
+    # TODO (awgu): When exposing the original parameters, we need to also
+    # use this attribute to prevent re-synchronizing parameters.
+    for buffer in module.buffers():
+        # Avoid re-synchronizing buffers in case of nested wrapping
+        if not getattr(buffer, FSDP_SYNCED, False):
+            setattr(buffer, FSDP_SYNCED, True)
+            module_states.append(buffer.detach())
+    module_states.extend(param.detach() for param in params)
+    _sync_params_and_buffers(
+        process_group,
+        module_states,
+        PARAM_BROADCAST_BUCKET_SIZE,
+        src=0,
+    )
+
+
+def _sync_module_states(
+    params: List[nn.Parameter],
+    buffers: List[torch.Tensor],
+    process_group: dist.ProcessGroup,
+) -> None:
+    _check_params_for_sync_module_states(params)
+    # Assumes that each call to this method passes in disjoint `params` and
+    # and `buffers` across calls, so there is no chance of re-synchronizing
+    params_and_buffers = [param.detach() for param in params] + [
+        buffer.detach() for buffer in buffers
+    ]
+    _sync_params_and_buffers(
+        process_group,
+        params_and_buffers,
+        PARAM_BROADCAST_BUCKET_SIZE,
+        src=0,
+    )
+
+
+def _check_params_for_sync_module_states(
+    params: List[nn.Parameter],
+) -> None:
+    if params and any(param.device == torch.device("cpu") for param in params):
+        raise ValueError(
+            "The module has CPU parameters when `sync_module_states=True`, "
+            "which only works when all parameters are on GPU. Please specify "
+            "the `device_id` argument or move the module to GPU before passing "
+            "into FSDP."
+        )
+
+
+def _get_orig_params(
+    module: nn.Module,
+    ignored_params: Set[nn.Parameter],
+) -> Iterator[nn.Parameter]:
+    """
+    Returns an iterator over the original parameters in ``module``, ignoring
+    the parameters in ``ignored_params``, any ``FlatParameter`` s (which may be
+    present due to nested FSDP wrapping), and any original parameters already
+    flattened (only relevant when ``use_orig_params=True``).
+    """
+    param_gen = module.parameters()
+    try:
+        while True:
+            param = next(param_gen)
+            if param not in ignored_params and not _is_fsdp_flattened(param):
+                yield param
+    except StopIteration:
+        pass
+
+
+def _check_orig_params_flattened(
+    fsdp_module,
+    ignored_params: Set[nn.Parameter],
+) -> None:
+    """
+    Checks that all original parameters have been flattened and hence made
+    invisible to ``named_parameters()`` for the module hierarchy rooted at
+    ``fsdp_module``. This should be called as a sanity check after flattening
+    the wrapped module's parameters.
+    """
+    for param_name, param in fsdp_module.named_parameters():
+        if param not in ignored_params and not _is_fsdp_flattened(param):
+            raise RuntimeError(
+                f"Found an unflattened parameter: {param_name}; "
+                f"{param.size()} {param.__class__}"
+            )
+
+
+def _get_default_comm_hook(sharding_strategy: ShardingStrategy):
+    return (
+        default_hooks.allreduce_hook
+        if sharding_strategy == ShardingStrategy.NO_SHARD
+        else default_hooks.reduce_scatter_hook
+    )
+
+
+def _get_default_comm_hook_state(
+    process_group: dist.ProcessGroup,
+) -> default_hooks.DefaultState:
+    return default_hooks.DefaultState(process_group=process_group)
diff --git a/torch/distributed/fsdp/_limiter_utils.py b/torch/distributed/fsdp/_limiter_utils.py
new file mode 100644
index 000000000000..efb5b3ba5ae1
--- /dev/null
+++ b/torch/distributed/fsdp/_limiter_utils.py
@@ -0,0 +1,33 @@
+import collections
+from typing import Deque, Optional
+
+import torch
+
+
+class _FreeEventQueue:
+    """
+    This tracks all pending frees corresponding to inflight all-gathers. The
+    queueing pattern is iterative enqueues with a single dequeue per iteration
+    once the limit ``_max_num_inflight_all_gathers`` is reached.
+    """
+
+    def __init__(self) -> None:
+        self._queue: Deque[torch.cuda.Event] = collections.deque()
+        self._max_num_inflight_all_gathers = 2  # empirically chosen
+
+    def enqueue(self, free_event: torch.cuda.Event) -> None:
+        """Enqueues a free event."""
+        self._queue.append(free_event)
+
+    def dequeue_if_needed(self) -> Optional[torch.cuda.Event]:
+        """Dequeues a single event if the limit is reached."""
+        if len(self._queue) >= self._max_num_inflight_all_gathers:
+            return self._dequeue()
+        return None
+
+    def _dequeue(self) -> Optional[torch.cuda.Event]:
+        """Dequeues a free event if possible."""
+        if self._queue:
+            event = self._queue.popleft()
+            return event
+        return None
diff --git a/torch/distributed/fsdp/_optim_utils.py b/torch/distributed/fsdp/_optim_utils.py
index 907c0f4b0686..70fb4156d537 100644
--- a/torch/distributed/fsdp/_optim_utils.py
+++ b/torch/distributed/fsdp/_optim_utils.py
@@ -1,7 +1,9 @@
+import collections
 import copy
 import functools
 from typing import (
     Any,
+    cast,
     Dict,
     Iterable,
     Iterator,
@@ -15,12 +17,23 @@
 
 import torch
 import torch.distributed as dist
+
 # Import the entire FSDP file to avoid circular imports
-import torch.distributed.fsdp.fully_sharded_data_parallel as FSDP
-from torch.distributed.fsdp.flat_param import (
-    FlatParameter,
-    FlatParamHandle,
-)
+import torch.distributed.fsdp.fully_sharded_data_parallel as fsdp_file
+import torch.nn as nn
+from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed.fsdp._common_utils import _get_param_to_fqns
+from torch.distributed.fsdp._fsdp_extensions import _ext_chunk_tensor
+from torch.distributed.fsdp._runtime_utils import _clear_grads_if_needed, _lazy_init
+from torch.distributed.fsdp._shard_utils import _gather_state_dict
+from torch.distributed.fsdp.api import ShardingStrategy
+from torch.distributed.fsdp.flat_param import FlatParameter, FlatParamHandle
+
+
+def sorted_items(dictionary: Dict[str, Any]) -> Iterator[Tuple[str, Any]]:
+    keys = sorted(dictionary.keys())
+    for k in keys:
+        yield k, dictionary[k]
 
 
 class _ConsolidatedOptimState:
@@ -42,6 +55,7 @@ class _ConsolidatedOptimState:
         non_tensor_state (Dict[str, Any]): Mapping from non-tensor state
             name to its value.
     """
+
     tensor_state: Dict[str, torch.Tensor] = {}
     zero_dim_tensor_state: Dict[str, torch.Tensor] = {}
     non_tensor_state: Dict[str, Any] = {}
@@ -58,6 +72,7 @@ class _PosDimTensorInfo(NamedTuple):
             non-FSDP parameter and is hence not sharded).
         dtype (torch.dtype): Data type of the tensor.
     """
+
     shape: torch.Size
     dtype: torch.dtype
 
@@ -68,6 +83,7 @@ class _OptimStateKey(NamedTuple):
     ranks. It is based on the unflattened parameter names rather than parameter
     IDs to make it indepenendent of each rank's own optimizer construction.
     """
+
     unflat_param_names: Tuple[str, ...]
     is_flat_param: bool
 
@@ -77,6 +93,7 @@ def _unflatten_optim_state(
     flat_param_state: Dict[str, Any],
     fsdp_module,
     to_save: bool,
+    shard_state: bool,
 ) -> List[Dict[str, Any]]:
     """
     Unflattens the optimizer state, consisting of the "state" part and the
@@ -101,13 +118,29 @@ def _unflatten_optim_state(
         otherwise. The final optimizer state dict will need to map these
         entries using the proper unflattened parameter IDs.
     """
+    _clear_grads_if_needed(fsdp_module._fsdp_handles(fsdp_module))
     consolidated_state = _communicate_optim_state(
-        flat_param, flat_param_state, fsdp_module, to_save,
-    )
-    unflat_param_state = _unflatten_communicated_optim_state(
         flat_param,
-        consolidated_state,
-    ) if to_save else []
+        flat_param_state,
+        fsdp_module,
+        to_save,
+    )
+    unflat_param_state = (
+        _unflatten_communicated_optim_state(
+            fsdp_module,
+            flat_param,
+            consolidated_state,
+            shard_state,
+        )
+        if to_save or shard_state
+        else []
+    )
+    if to_save:
+        for optim_state in unflat_param_state:
+            for key in list(optim_state.keys()):
+                state = optim_state[key]
+                if isinstance(state, torch.Tensor):
+                    optim_state[key] = state.cpu()
     return unflat_param_state
 
 
@@ -139,43 +172,51 @@ def _communicate_optim_state(
         ``flat_param``; the state is not populated for non-target ranks.
     """
     state = _ConsolidatedOptimState()
-    tensor_state, zero_dim_tensor_state, non_tensor_state = \
-        state.tensor_state, state.zero_dim_tensor_state, state.non_tensor_state
+    tensor_state, zero_dim_tensor_state, non_tensor_state = (
+        state.tensor_state,
+        state.zero_dim_tensor_state,
+        state.non_tensor_state,
+    )
     group = fsdp_module.process_group
 
-    tensor_buffer = None  # initialize lazily in case it is not needed
-    for state_name, value in flat_param_state.items():
+    for state_name, value in sorted_items(flat_param_state):
         # Positive-dimension tensor state: communicate across ranks
         if torch.is_tensor(value) and value.dim() > 0:
-            # If the parameter is not sharded (e.g. world size of 1), then
-            # neither is the positive-dimension tensor state, so no need to
-            # communicate it -- we take the target rank's value
-            if not flat_param._is_sharded:  # type: ignore[attr-defined]
-                tensor_state[state_name] = value.cpu()
+            # If the parameter is not sharded, then neither is the
+            # positive-dimension tensor state, so no need to communicate it --
+            # we take the target rank's value
+            if (
+                fsdp_module.world_size == 1
+                or fsdp_module.sharding_strategy == ShardingStrategy.NO_SHARD
+            ):
+                tensor_state[state_name] = value
                 continue
-            if tensor_buffer is None:
-                # Assume that positive-dimension tensor optimizer state
-                # has the same shape as the sharded flattened parameter
-                buffer_size = flat_param._full_param_padded.size()  # type: ignore[attr-defined]
-                tensor_buffer = value.new_zeros(*buffer_size)
-            dist._all_gather_base(tensor_buffer, value, group=group)
+            if not value.is_cuda:
+                value = value.to(fsdp_module.compute_device)
+            # Assume that positive-dimension tensor optimizer state
+            # has the same shape as the sharded flattened parameter
+            buffer_size = flat_param._full_param_padded.size()  # type: ignore[attr-defined]
+            tensor_buffer = value.new_zeros(*buffer_size)
+            dist.all_gather_into_tensor(tensor_buffer, value, group=group)
             torch.cuda.synchronize()
             if to_save:
-                unpadded_numel = flat_param._unsharded_size.numel()  # type: ignore[attr-defined]
-                tensor_state[state_name] = tensor_buffer[:unpadded_numel].cpu()
+                unpadded_numel = flat_param._unpadded_unsharded_size.numel()  # type: ignore[attr-defined]
+                tensor_state[state_name] = tensor_buffer[:unpadded_numel]
         # Zero-dimension tensor state and non-tensor state: take this rank's
         # value directly
         elif to_save:
             if _is_zero_dim_tensor(value):
-                zero_dim_tensor_state[state_name] = value.cpu()
+                zero_dim_tensor_state[state_name] = value
             else:
                 non_tensor_state[state_name] = value
     return state
 
 
 def _unflatten_communicated_optim_state(
+    fsdp_module,
     flat_param: FlatParameter,
     state: _ConsolidatedOptimState,
+    shard_state: bool,
 ) -> List[Dict[str, Any]]:
     """
     Unflattens the communicated optimizer state (given by ``tensor_state``,
@@ -196,32 +237,45 @@ def _unflatten_communicated_optim_state(
     unflat_param_state: List[Dict[str, Any]] = []
     flat_param_views: Dict[str, Iterator] = {}
     num_unflat_params = flat_param._num_params
-    tensor_state, zero_dim_tensor_state, non_tensor_state = \
-        state.tensor_state, state.zero_dim_tensor_state, state.non_tensor_state
+    tensor_state, zero_dim_tensor_state, non_tensor_state = (
+        state.tensor_state,
+        state.zero_dim_tensor_state,
+        state.non_tensor_state,
+    )
 
     for _ in range(num_unflat_params):
         unflat_state_param = {}
         # Add positive-dimension tensor state: unflatten with views
-        for state_name, flat_tensor in tensor_state.items():
+        for state_name, flat_tensor in sorted_items(tensor_state):
             views_generated = state_name in flat_param_views
             if not views_generated:
                 views = FlatParamHandle._get_unflat_views(flat_param, flat_tensor)
                 flat_param_views[state_name] = views
             else:
                 views = flat_param_views[state_name]
-            unflat_state_param[state_name] = next(views)
+            optim_state: Union[torch.Tensor, ShardedTensor] = next(views)
+            if shard_state:
+                optim_state = _ext_chunk_tensor(
+                    optim_state,
+                    fsdp_module.rank,
+                    fsdp_module.world_size,
+                    torch.cuda.device_count(),
+                    fsdp_module.process_group,
+                )
+            unflat_state_param[state_name] = optim_state
+
         # Add zero-dimension tensor state: take the target rank's value
-        for state_name, zero_dim_tensor in zero_dim_tensor_state.items():
+        for state_name, zero_dim_tensor in sorted_items(zero_dim_tensor_state):
             unflat_state_param[state_name] = zero_dim_tensor
         # Add non-tensor state: take the target rank's value
-        for state_name, non_tensor in non_tensor_state.items():
+        for state_name, non_tensor in sorted_items(non_tensor_state):
             unflat_state_param[state_name] = non_tensor
         unflat_param_state.append(unflat_state_param)
     return unflat_param_state
 
 
-def _flatten_full_optim_state_dict(
-    full_optim_state_dict: Dict[str, Any],
+def _flatten_optim_state_dict(
+    optim_state_dict: Dict[str, Any],
     model: torch.nn.Module,
     shard_state: bool,
 ) -> Dict[str, Any]:
@@ -234,44 +288,50 @@ def _flatten_full_optim_state_dict(
     Returns:
         Dict[str, Any]: The flattened optimizer state dict.
     """
-    full_osd = full_optim_state_dict
-    if "state" not in full_osd or "param_groups" not in full_osd:
+    unflat_osd = optim_state_dict
+    if "state" not in unflat_osd or "param_groups" not in unflat_osd:
         raise ValueError(
-            "`full_optim_state_dict` must have the keys \"state\" and "
-            "\"param_groups\" to be a valid optimizer state dict"
+            '`optim_state_dict` must have the keys "state" and '
+            '"param_groups" to be a valid optimizer state dict'
         )
     flat_param_to_fsdp_module = _get_flat_param_to_fsdp_module(model)
-    param_to_unflat_param_names = FSDP._get_param_to_unflat_param_names(model)
+    param_to_fqns = _get_param_to_fqns(model)
 
     # Construct the "state" part
     flat_osd_state: Dict[_OptimStateKey, Any] = {}
-    full_osd_state = full_osd["state"]
-    for param, unflat_param_names in param_to_unflat_param_names.items():
+    unflat_osd_state = unflat_osd["state"]
+    for param, unflat_param_names in param_to_fqns.items():
         if isinstance(param, FlatParameter):  # flatten FSDP parameters' states
-            assert param in flat_param_to_fsdp_module, \
-                "Check the `flat_param_to_fsdp_module` construction\n" \
-                f"param: {param}"
+            assert (
+                param in flat_param_to_fsdp_module
+            ), f"Check the `flat_param_to_fsdp_module` construction\nparam: {param}"
             fsdp_module = flat_param_to_fsdp_module[param]
             flat_state = _flatten_optim_state(
-                full_osd_state, unflat_param_names, fsdp_module, param,
+                unflat_osd_state,
+                unflat_param_names,
+                fsdp_module,
+                param,
                 shard_state,
             )
             key = _OptimStateKey(tuple(unflat_param_names), True)
-            flat_osd_state[key] = flat_state
+            if flat_state:
+                # Only include non-empty states since as expected by
+                # `torch.optim.Optimizer` s
+                flat_osd_state[key] = flat_state
         else:  # do not flatten non-FSDP parameters' states
             assert len(unflat_param_names) == 1
             unflat_param_name = unflat_param_names[0]
-            if unflat_param_name not in full_osd_state:
+            if unflat_param_name not in unflat_osd_state:
                 # The state dict may not have an entry for a parameter if it
                 # was not passed into the optimizer (e.g. if it is not an
                 # FSDP-managed parameter)
                 continue
             key = _OptimStateKey(tuple(unflat_param_names), False)
-            flat_osd_state[key] = copy.copy(full_osd_state[unflat_param_name])
+            flat_osd_state[key] = copy.copy(unflat_osd_state[unflat_param_name])
 
     # Construct the "param_groups" part -- copy as is since it will be
-    # rekeyed later according to the target rank's `optim_input`
-    flat_osd_param_groups = copy.deepcopy(full_osd["param_groups"])
+    # rekeyed later according to the target rank's optimizer
+    flat_osd_param_groups = copy.deepcopy(unflat_osd["param_groups"])
     return {"state": flat_osd_state, "param_groups": flat_osd_param_groups}
 
 
@@ -306,13 +366,15 @@ def _flatten_optim_state(
         "state" part will map a key to this returned value.
     """
     num_unflat_params = len(unflat_param_names)
-    assert num_unflat_params > 0, \
-        "Expects at least one unflattened parameter corresponding to the " \
+    assert num_unflat_params > 0, (
+        "Expects at least one unflattened parameter corresponding to the "
         "flattened parameter"
+    )
     unflat_param_shapes = flat_param._shapes
     num_unflat_param_shapes = len(unflat_param_shapes)
-    assert num_unflat_params == num_unflat_param_shapes, \
-        f"Expects {num_unflat_params} shapes but got {num_unflat_param_shapes}"
+    assert (
+        num_unflat_params == num_unflat_param_shapes
+    ), f"Expects {num_unflat_params} shapes but got {num_unflat_param_shapes}"
 
     # Check if these unflattened parameters have any optimizer state
     has_state = [
@@ -326,8 +388,11 @@ def _flatten_optim_state(
     # There may still be some unflattened parameters with state and some
     # without
     unflat_param_states = [
-        unflat_osd_state[unflat_param_name]
-        if unflat_param_name in unflat_osd_state else None
+        _gather_state_dict(
+            unflat_osd_state[unflat_param_name], pg=fsdp_module.process_group
+        )
+        if unflat_param_name in unflat_osd_state
+        else None
         for unflat_param_name in unflat_param_names
     ]
     # Check that the unflattened parameters have the same state names
@@ -349,8 +414,7 @@ def _flatten_optim_state(
     flat_state: Dict[str, Any] = {}
     for state_name in state_names:
         state_values = [
-            unflat_param_state[state_name]
-            if unflat_param_state is not None else None
+            unflat_param_state[state_name] if unflat_param_state is not None else None
             for unflat_param_state in unflat_param_states
         ]
         non_none_state_values = [v for v in state_values if v is not None]
@@ -370,26 +434,35 @@ def _flatten_optim_state(
             )
         if are_pos_dim_tensors:
             flat_tensor = _flatten_tensor_optim_state(
-                state_name, state_values, unflat_param_names,
-                unflat_param_shapes, flat_param,
+                state_name,
+                state_values,
+                unflat_param_names,
+                unflat_param_shapes,
+                flat_param,
             )
             if shard_state:
                 # Shard the flattened tensor immediately to minimize max memory
                 # usage
                 sharded_flat_tensor, _ = FlatParamHandle._get_shard(
-                    flat_tensor, fsdp_module.rank, fsdp_module.world_size,
+                    flat_tensor,
+                    fsdp_module.rank,
+                    fsdp_module.world_size,
                 )
                 flat_state[state_name] = sharded_flat_tensor
             else:
                 flat_state[state_name] = flat_tensor
         elif are_zero_dim_tensors:
             flat_state[state_name] = _flatten_zero_dim_tensor_optim_state(
-                state_name, state_values, unflat_param_names,
+                state_name,
+                state_values,
+                unflat_param_names,
             )
         else:
             assert are_non_tensors
             flat_state[state_name] = _flatten_non_tensor_optim_state(
-                state_name, state_values, unflat_param_names,
+                state_name,
+                state_values,
+                unflat_param_names,
             )
 
     return flat_state
@@ -447,9 +520,7 @@ def _flatten_tensor_optim_state(
     # Check that each tensor state matches its parameter's shape
     for tensor, shape in zip(pos_dim_tensors, unflat_param_shapes):
         if tensor is None and len(shape) == 0:
-            raise ValueError(
-                "Flattening a zero-dimension parameter is not supported"
-            )
+            raise ValueError("Flattening a zero-dimension parameter is not supported")
         elif tensor is not None and tensor.shape != shape:
             raise ValueError(
                 "Tensor optimizer state does not have same shape as its "
@@ -460,18 +531,23 @@ def _flatten_tensor_optim_state(
     # the shard as needed (just like for the flattened parameter)
     cpu_device = torch.device("cpu")
     tensors = [
-        torch.flatten(state_value.to(cpu_device)) if state_value is not None
-        else torch.flatten(torch.zeros(
-            size=shape, dtype=dtype, device=cpu_device,
-        ))
-        for state_value, shape
-        in zip(pos_dim_tensors, unflat_param_shapes)
+        torch.flatten(state_value.to(cpu_device))
+        if state_value is not None
+        else torch.flatten(
+            torch.zeros(
+                size=shape,
+                dtype=dtype,
+                device=cpu_device,
+            )
+        )
+        for state_value, shape in zip(pos_dim_tensors, unflat_param_shapes)
     ]
     flat_tensor = torch.cat(tensors)
-    flat_param_shape = flat_param._unsharded_size  # type: ignore[attr-defined]
-    assert flat_tensor.shape == flat_param_shape, \
-        f"tensor optim state: {flat_tensor.shape} " \
+    flat_param_shape = flat_param._unpadded_unsharded_size  # type: ignore[attr-defined]
+    assert flat_tensor.shape == flat_param_shape, (
+        f"tensor optim state: {flat_tensor.shape} "
         f"flattened parameter: {flat_param_shape}"
+    )
     return flat_tensor
 
 
@@ -512,8 +588,11 @@ def _flatten_zero_dim_tensor_optim_state(
     # Enforce that all have the same value and dtype
     values_set = set(t.item() if t is not None else None for t in zero_dim_tensors)
     dtypes = set(t.dtype if t is not None else None for t in zero_dim_tensors)
-    if len(non_none_tensors) != len(zero_dim_tensors) or \
-            len(values_set) != 1 or len(dtypes) != 1:
+    if (
+        len(non_none_tensors) != len(zero_dim_tensors)
+        or len(values_set) != 1
+        or len(dtypes) != 1
+    ):
         raise ValueError(
             "All unflattened parameters comprising a single flattened "
             "parameter must have scalar state with the same value and dtype "
@@ -554,8 +633,7 @@ def _flatten_non_tensor_optim_state(
     non_none_non_tensors = [nt for nt in non_tensors if nt is not None]
     # Enforce that all have the same value (same type already checked)
     non_tensor_set = set(non_tensors)
-    if len(non_none_non_tensors) != len(non_tensors) or \
-            len(non_tensor_set) != 1:
+    if len(non_none_non_tensors) != len(non_tensors) or len(non_tensor_set) != 1:
         raise ValueError(
             "All unflattened parameters comprising a single flattened "
             "parameter must have scalar state with the same value and dtype "
@@ -588,13 +666,15 @@ def _process_pos_dim_tensor_state(
     no_tensor_osd: Dict[str, Any] = {"state": {}}
     for key, param_state in flat_osd["state"].items():
         no_tensor_osd["state"][key] = {}
-        for state_name, value in param_state.items():
+        for state_name, value in sorted_items(param_state):
             is_pos_dim_tensor_state = torch.is_tensor(value) and value.dim() > 0
             if not is_pos_dim_tensor_state:
                 no_tensor_osd["state"][key][state_name] = value
                 continue
             if key.is_flat_param:  # FSDP parameter
-                sharded_size = FlatParamHandle._get_sharded_size(value, rank=0, world_size=world_size)
+                sharded_size = FlatParamHandle._get_sharded_size(
+                    value, rank=0, world_size=world_size
+                )
                 assert len(sharded_size) == 1, f"{sharded_size}"
                 info = _PosDimTensorInfo(sharded_size, value.dtype)
             else:  # non-FSDP parameter
@@ -621,8 +701,7 @@ def _broadcast_processed_optim_state_dict(
         Dict[str, Any]: The processed optimizer state dict.
     """
     # Broadcast the two data structures rank 0 to all ranks
-    obj_list = [processed_optim_state_dict] if rank == 0 \
-        else [None]
+    obj_list = [processed_optim_state_dict] if rank == 0 else [None]
     dist.broadcast_object_list(obj_list, src=0, group=group)
     processed_optim_state_dict = obj_list[0]  # type: ignore[assignment]
     assert processed_optim_state_dict is not None
@@ -659,12 +738,13 @@ def _broadcast_pos_dim_tensor_states(
         Dict[str, Any]: The optimizer state dict with the positive-dimension
         tensor state correctly populated via ``broadcast()`` s from rank 0.
     """
-    assert rank != 0 or flat_optim_state_dict is not None, \
-        "Expects rank 0 to pass in the flattened optimizer state dict"
+    assert (
+        rank != 0 or flat_optim_state_dict is not None
+    ), "Expects rank 0 to pass in the flattened optimizer state dict"
     no_tensor_osd = processed_optim_state_dict  # alias
     flat_osd = flat_optim_state_dict  # alias
     for key, param_state in no_tensor_osd["state"].items():
-        for state_name, value in param_state.items():
+        for state_name, value in sorted_items(param_state):
             is_pos_dim_tensor_state = isinstance(value, _PosDimTensorInfo)
             if not is_pos_dim_tensor_state:
                 continue
@@ -676,13 +756,26 @@ def _broadcast_pos_dim_tensor_states(
             shape, dtype = value.shape, value.dtype
             if key.is_flat_param:  # FSDP parameter
                 _broadcast_sharded_pos_dim_tensor_state(
-                    unsharded_tensor, param_state, state_name, shape, dtype,
-                    broadcast_device, rank, world_size, group,
+                    unsharded_tensor,
+                    param_state,
+                    state_name,
+                    shape,
+                    dtype,
+                    broadcast_device,
+                    rank,
+                    world_size,
+                    group,
                 )  # modify `param_state` destructively
             else:  # non-FSDP parameter
                 _broadcast_unsharded_pos_dim_tensor_state(
-                    unsharded_tensor, param_state, state_name, shape, dtype,
-                    broadcast_device, rank, group,
+                    unsharded_tensor,
+                    param_state,
+                    state_name,
+                    shape,
+                    dtype,
+                    broadcast_device,
+                    rank,
+                    group,
                 )  # modify `param_state` destructively
     return no_tensor_osd
 
@@ -710,8 +803,9 @@ def _broadcast_sharded_pos_dim_tensor_state(
     """
     get_shard: Optional[functools.partial[Tuple[torch.Tensor, int]]] = None
     if rank == 0:
-        assert unsharded_tensor is not None, \
-            "Expects rank 0 to pass in the unsharded tensor"
+        assert (
+            unsharded_tensor is not None
+        ), "Expects rank 0 to pass in the unsharded tensor"
         get_shard = functools.partial(
             FlatParamHandle._get_shard,
             unsharded_tensor,
@@ -722,7 +816,9 @@ def _broadcast_sharded_pos_dim_tensor_state(
             sharded_tensor = get_shard(target_rank, world_size)[0].to(broadcast_device)
         else:
             sharded_tensor = torch.zeros(
-                shape, requires_grad=False, dtype=dtype,
+                shape,
+                requires_grad=False,
+                dtype=dtype,
                 device=broadcast_device,
             )
         dist.broadcast(sharded_tensor, src=0, group=group)
@@ -758,16 +854,22 @@ def _broadcast_unsharded_pos_dim_tensor_state(
             broadcast if on rank 0; ignored otherwise.
     """
     if rank == 0:
-        assert unsharded_tensor is not None, \
-            "Expects rank 0 to pass in the unsharded tensor"
-        assert shape == unsharded_tensor.shape, \
-            f"Shape mismatch: {shape} {unsharded_tensor.shape}"
-        assert dtype == unsharded_tensor.dtype, \
-            f"dtype mismatch: {dtype} {unsharded_tensor.dtype}"
+        assert (
+            unsharded_tensor is not None
+        ), "Expects rank 0 to pass in the unsharded tensor"
+        assert (
+            shape == unsharded_tensor.shape
+        ), f"Shape mismatch: {shape} {unsharded_tensor.shape}"
+        assert (
+            dtype == unsharded_tensor.dtype
+        ), f"dtype mismatch: {dtype} {unsharded_tensor.dtype}"
         unsharded_tensor = unsharded_tensor.to(broadcast_device)
     else:
         unsharded_tensor = torch.zeros(
-            shape, requires_grad=False, dtype=dtype, device=broadcast_device,
+            shape,
+            requires_grad=False,
+            dtype=dtype,
+            device=broadcast_device,
         )
     dist.broadcast(unsharded_tensor, src=0, group=group)
     # Keep the tensor on the broadcast device, which is typically GPU
@@ -777,28 +879,37 @@ def _broadcast_unsharded_pos_dim_tensor_state(
 def _rekey_sharded_optim_state_dict(
     sharded_osd: Dict[str, Any],
     model: torch.nn.Module,
-    optim_input: Optional[Union[
-        List[Dict[str, Any]], Iterable[torch.nn.Parameter],
-    ]] = None,
+    optim: torch.optim.Optimizer,
+    optim_input: Optional[
+        Union[
+            List[Dict[str, Any]],
+            Iterable[torch.nn.Parameter],
+        ]
+    ],
+    using_optim_input: bool,
 ) -> Dict[str, Any]:
     """
     Rekeys the optimizer state dict from unflattened parameter names to
-    flattened parameter IDs according to the calling rank's ``optim_input``,
-    which may be different across ranks. In particular, the unflattened
-    parameter names are represented as :class:`_OptimStateKey` s.
+    flattened parameter IDs according to the calling rank's ``optim``, which
+    may be different across ranks. In particular, the unflattened parameter
+    names are represented as :class:`_OptimStateKey` s.
     """
-    param_to_flat_param_id = _get_param_to_param_id(model, optim_input)
-    param_to_unflat_param_names = FSDP._get_param_to_unflat_param_names(model)
+    param_to_flat_param_id = (
+        _get_param_to_param_id_from_optim_input(model, optim_input)
+        if using_optim_input
+        else _get_param_to_param_id(optim)
+    )
+    param_to_fqns = _get_param_to_fqns(model)
     # All parameter keys in `param_to_flat_param_id` should be in
-    # `param_to_unflat_param_names` -- strict inequality follows when not all
-    # parameters are passed to the optimizer via `optim_input`
-    assert len(param_to_flat_param_id) <= len(param_to_unflat_param_names)
+    # `param_to_fqns` -- strict inequality follows when not all parameters are
+    # passed to the optimizer
+    assert len(param_to_flat_param_id) <= len(param_to_fqns)
 
     unflat_param_names_to_flat_param_id: Dict[Tuple[str, ...], int] = {}  # for "state"
     unflat_param_name_to_flat_param_id: Dict[str, int] = {}  # for "param_groups"
-    for param, unflat_param_names in param_to_unflat_param_names.items():
+    for param, unflat_param_names in param_to_fqns.items():
         if param not in param_to_flat_param_id:
-            # This parameter was not passed to the optimizer via `optim_input`
+            # This parameter was not passed to the optimizer
             continue
         flat_param_id = param_to_flat_param_id[param]
         unflat_param_names_to_flat_param_id[tuple(unflat_param_names)] = flat_param_id
@@ -814,10 +925,12 @@ def _rekey_sharded_optim_state_dict(
     rekeyed_osd_param_groups: List[Dict[str, Any]] = []
     for unflat_param_group in sharded_osd["param_groups"]:
         flat_param_group = copy.deepcopy(unflat_param_group)
-        flat_param_ids = sorted(set(
-            unflat_param_name_to_flat_param_id[unflat_param_name]
-            for unflat_param_name in unflat_param_group["params"]
-        ))
+        flat_param_ids = sorted(
+            set(
+                unflat_param_name_to_flat_param_id[unflat_param_name]
+                for unflat_param_name in unflat_param_group["params"]
+            )
+        )
         flat_param_group["params"] = flat_param_ids
         rekeyed_osd_param_groups.append(flat_param_group)
 
@@ -839,23 +952,44 @@ def _get_flat_param_to_fsdp_module(model: torch.nn.Module):
     """
     flat_param_to_fsdp_module = {}
     for module in model.modules():
-        if isinstance(module, FSDP.FullyShardedDataParallel):
-            module._lazy_init()
+        if isinstance(module, fsdp_file.FullyShardedDataParallel):
+            _lazy_init(module, module)
             for param in module.params:  # may have none
                 flat_param_to_fsdp_module[param] = module
     return flat_param_to_fsdp_module
 
 
 def _get_param_id_to_param(
+    optim: torch.optim.Optimizer,
+):
+    """
+    Constructs a mapping from parameter IDs to parameters. This may be used
+    both for models with ``FlatParameter`` s and without.
+    """
+    param_id_to_param: List[nn.Parameter] = []
+    for param_group in optim.param_groups:
+        for param in param_group["params"]:
+            param_id_to_param.append(param)
+    return param_id_to_param
+
+
+def _get_param_id_to_param_from_optim_input(
     model: torch.nn.Module,
-    optim_input: Optional[Union[
-        List[Dict[str, Any]], Iterable[torch.nn.Parameter],
-    ]] = None,
+    optim_input: Optional[
+        Union[
+            List[Dict[str, Any]],
+            Iterable[torch.nn.Parameter],
+        ]
+    ] = None,
 ) -> List[torch.nn.Parameter]:
     """
     Constructs a mapping from parameter IDs to parameters. This may be used
     both for models with ``FlatParameter`` s and without.
 
+    NOTE: This method is only preserved for backward compatibility. The method
+    :meth:`_get_param_id_to_param` is the preferred code path that does not
+    rely on ``optim_input``.
+
     NOTE: We critically assume that, whether the optimizer input is a list of
     parameters or a list of parameter groups, :class:`torch.optim.Optimizer`
     enumerates the parameter IDs in order. In other words, for a parameter list
@@ -897,18 +1031,17 @@ def _get_param_id_to_param(
         all_tensors &= isinstance(param, torch.Tensor)
         all_dicts &= isinstance(param, dict)
     if not all_tensors and not all_dicts:
-        raise TypeError(
-            "Optimizer input should be an iterable of Tensors or dicts"
-        )
+        raise TypeError("Optimizer input should be an iterable of Tensors or dicts")
     if all_tensors:
         return params  # type: ignore[return-value]
     assert all_dicts
     param_id_to_param = []
     for param_group in params:
         has_params_key = "params" in param_group  # type: ignore[operator]
-        assert has_params_key, \
-            "A parameter group should map \"params\" to a list of the " \
+        assert has_params_key, (
+            'A parameter group should map "params" to a list of the '
             "parameters in the group"
+        )
         for param in param_group["params"]:  # type: ignore[index]
             # Implicitly map `flat_param_id` (current length of the list) to
             # `param`
@@ -917,16 +1050,25 @@ def _get_param_id_to_param(
 
 
 def _get_param_to_param_id(
+    optim: torch.optim.Optimizer,
+) -> Dict[torch.nn.Parameter, int]:
+    """Constructs the inverse mapping of :func:`_get_param_id_to_param`."""
+    param_id_to_param = _get_param_id_to_param(optim)
+    return {param: param_id for param_id, param in enumerate(param_id_to_param)}
+
+
+def _get_param_to_param_id_from_optim_input(
     model: torch.nn.Module,
-    optim_input: Optional[Union[
-        List[Dict[str, Any]], Iterable[torch.nn.Parameter],
-    ]] = None,
+    optim_input: Optional[
+        Union[
+            List[Dict[str, Any]],
+            Iterable[torch.nn.Parameter],
+        ]
+    ] = None,
 ) -> Dict[torch.nn.Parameter, int]:
     """Constructs the inverse mapping of :func:`_get_param_id_to_param`."""
-    param_id_to_param = _get_param_id_to_param(model, optim_input)
-    return {
-        param: param_id for param_id, param in enumerate(param_id_to_param)
-    }
+    param_id_to_param = _get_param_id_to_param_from_optim_input(model, optim_input)
+    return {param: param_id for param_id, param in enumerate(param_id_to_param)}
 
 
 def _get_unflat_to_flat_param_ids(
@@ -952,17 +1094,20 @@ def _get_unflat_to_flat_param_ids(
     unflat_to_flat_param_ids = {}
     for flat_param_id, unflat_param_ids in flat_to_unflat_param_ids.items():
         for unflat_param_id in unflat_param_ids:
-            assert unflat_param_id not in unflat_to_flat_param_ids, \
-                "`flat_to_unflat_param_ids` has the unflattened parameter " \
-                f"ID {unflat_param_id} mapped to multiple flattened " \
+            assert unflat_param_id not in unflat_to_flat_param_ids, (
+                "`flat_to_unflat_param_ids` has the unflattened parameter "
+                f"ID {unflat_param_id} mapped to multiple flattened "
                 "parameter IDs"
+            )
             unflat_to_flat_param_ids[unflat_param_id] = flat_param_id
     num_unflat_param_ids = len(unflat_to_flat_param_ids)
     unflat_param_ids_set = set(unflat_to_flat_param_ids.keys())
-    assert unflat_param_ids_set == set(range(num_unflat_param_ids)), \
-        "The set of unflattened parameter IDs should be {0, ..., " + \
-        str(num_unflat_param_ids - 1) + "} but got " + \
-        f"{unflat_param_ids_set}"
+    assert unflat_param_ids_set == set(range(num_unflat_param_ids)), (
+        "The set of unflattened parameter IDs should be {0, ..., "
+        + str(num_unflat_param_ids - 1)
+        + "} but got "
+        + f"{unflat_param_ids_set}"
+    )
     return [
         unflat_to_flat_param_ids[unflat_param_id]
         for unflat_param_id in range(num_unflat_param_ids)
@@ -971,3 +1116,167 @@ def _get_unflat_to_flat_param_ids(
 
 def _is_zero_dim_tensor(x: Any) -> bool:
     return torch.is_tensor(x) and x.dim() == 0
+
+
+def _optim_state_dict(
+    model: torch.nn.Module,
+    optim: torch.optim.Optimizer,
+    optim_input: Optional[
+        Union[
+            List[Dict[str, Any]],
+            Iterable[torch.nn.Parameter],
+        ]
+    ],
+    rank0_only: bool,
+    shard_state: bool,
+    group: Optional[dist.ProcessGroup],
+    using_optim_input: bool,
+) -> Dict[str, Any]:
+    """
+    Consolidates the optimizer state and returns it as a :class:`dict`
+    following the convention of :meth:`torch.optim.Optimizer.state_dict`,
+    i.e. with keys ``"state"`` and ``"param_groups"``.
+    The flattened parameters in ``FSDP`` modules contained in ``model``
+    are mapped back to their unflattened parameters.
+
+    Args:
+        model (torch.nn.Module): Root module (which may or may not be a
+            :class:`FullyShardedDataParallel` instance) whose parameters
+            were passed into the optimizer ``optim``.
+        optim (torch.optim.Optimizer): Optimizer for ``model`` 's
+            parameters.
+        rank0_only (bool): If ``True``, saves the populated :class:`dict`
+            only on rank 0; if ``False``, saves it on all ranks. (Default:
+            ``True``)
+        shard_state (bool): If ``True``, shard and distribute all
+            non-zero-dimension states.
+
+    Returns:
+        Dict[str, Any]: A :class:`dict` containing the optimizer state for
+        ``model`` 's original unflattened parameters and including keys
+        "state" and "param_groups" following the convention of
+        :meth:`torch.optim.Optimizer.state_dict`. If ``rank0_only=False``,
+        then nonzero ranks return an empty :class:`dict`.
+    """
+    osd = optim.state_dict()
+    osd_state, osd_param_groups = osd["state"], osd["param_groups"]
+    rank = dist.get_rank(group)
+    to_save = not rank0_only or (rank == 0 or shard_state)
+    fsdp_osd: Dict = {"state": {}, "param_groups": []} if to_save else {}
+    fsdp_osd_state = fsdp_osd["state"] if to_save else None
+
+    # Construct the local mapping between unflattened parameter names
+    # (`_OptimStateKey`s) and parameter IDs and broadcast rank 0's mapping
+    param_to_fqns: Dict[torch.nn.Parameter, List[str]] = _get_param_to_fqns(model)
+    flat_param_id_to_param: List[torch.nn.Parameter] = (
+        _get_param_id_to_param_from_optim_input(model, optim_input)
+        if using_optim_input
+        else _get_param_id_to_param(optim)
+    )
+    optim_state_key_to_flat_param_id: Dict[_OptimStateKey, int] = {}  # local
+    r0_flat_param_id_to_optim_state_key: Dict[
+        int, _OptimStateKey
+    ] = collections.OrderedDict()  # rank 0
+    for flat_param_id, param in enumerate(flat_param_id_to_param):
+        # Do not include parameters without state to avoid empty mappings
+        # just like in normal `torch.optim.Optimizer.state_dict()`
+        if flat_param_id not in osd_state:
+            continue
+        optim_state_key = _OptimStateKey(
+            unflat_param_names=tuple(param_to_fqns[param]),
+            is_flat_param=isinstance(param, FlatParameter),
+        )
+        if rank == 0:
+            r0_flat_param_id_to_optim_state_key[flat_param_id] = optim_state_key
+        optim_state_key_to_flat_param_id[optim_state_key] = flat_param_id
+    key_obj_list: List[Optional[Dict[int, _OptimStateKey]]] = (
+        [r0_flat_param_id_to_optim_state_key] if rank == 0 else [None]
+    )
+    dist.broadcast_object_list(key_obj_list, src=0, group=group)
+    assert key_obj_list[0] is not None
+    r0_flat_param_id_to_optim_state_key = key_obj_list[0]
+
+    # Ensure that all ranks have at least the optimizer states needed by
+    # rank 0's optimizer
+    missing_keys: List[_OptimStateKey] = []
+    for r0_optim_state_key in r0_flat_param_id_to_optim_state_key.values():
+        if r0_optim_state_key not in optim_state_key_to_flat_param_id:
+            # A parameter from rank 0's optimizer does not exist for this
+            # rank's optimizer
+            missing_keys.append(r0_optim_state_key)
+            continue
+        flat_param_id = optim_state_key_to_flat_param_id[r0_optim_state_key]
+        assert flat_param_id >= 0 and flat_param_id < len(
+            flat_param_id_to_param
+        ), "Check the `flat_param_id_to_param` construction"
+    device = torch.device("cuda", torch.cuda.current_device())
+    num_missing = torch.tensor([len(missing_keys)], dtype=torch.int32, device=device)
+    dist.all_reduce(num_missing, group=group)
+    if num_missing.item() > 0:
+        obj_list = [None for _ in range(dist.get_world_size(group))]
+        dist.all_gather_object(obj_list, missing_keys, group=group)
+        error_msg = (
+            "FSDP currently requires each rank to have at least the "
+            "optimizer states needed by rank 0's optimizer but some ranks "
+            "are missing some of those states"
+        )
+        for rank, keys in enumerate(obj_list):
+            keys = cast(List[_OptimStateKey], keys)
+            if len(keys) > 0:
+                error_msg += (
+                    f"\nRank {rank} is missing states for the parameters: "
+                    f"{[key.unflat_param_names for key in keys]}"
+                )
+        raise RuntimeError(error_msg)
+
+    # Iterate in rank 0's flattened parameter ID order to ensure aligned
+    # all-gathers across ranks
+    flat_param_to_fsdp_module = _get_flat_param_to_fsdp_module(model)
+    for r0_optim_state_key in r0_flat_param_id_to_optim_state_key.values():
+        flat_param_id = optim_state_key_to_flat_param_id[r0_optim_state_key]
+        param = flat_param_id_to_param[flat_param_id]
+        if r0_optim_state_key.is_flat_param:
+            fsdp_module = flat_param_to_fsdp_module[param]
+            unflat_state = _unflatten_optim_state(
+                cast(FlatParameter, param),
+                osd_state[flat_param_id],
+                fsdp_module,
+                to_save,
+                shard_state,
+            )
+            if to_save:
+                assert len(unflat_state) == len(r0_optim_state_key.unflat_param_names)
+                for unflat_param_name, unflat_param_state in zip(
+                    r0_optim_state_key.unflat_param_names,
+                    unflat_state,
+                ):
+                    fsdp_osd_state[unflat_param_name] = unflat_param_state
+        elif to_save:
+            assert len(r0_optim_state_key.unflat_param_names) == 1
+            unflat_param_name = r0_optim_state_key.unflat_param_names[0]
+            fsdp_osd_state[unflat_param_name] = copy.copy(osd_state[flat_param_id])
+            for state_name, value in sorted_items(fsdp_osd_state[unflat_param_name]):
+                if torch.is_tensor(value):
+                    fsdp_osd_state[unflat_param_name][state_name] = value.cpu()
+
+    if not to_save:
+        return {}
+
+    # Handle the "param_groups" part of the optimizer state dict
+    fsdp_osd_param_groups = fsdp_osd["param_groups"]  # alias
+    for flat_param_group in osd_param_groups:
+        unflat_param_group = copy.deepcopy(flat_param_group)
+        param_group_params = [
+            flat_param_id_to_param[flat_param_id]
+            for flat_param_id in flat_param_group["params"]
+        ]
+        nested_unflat_param_names = [
+            param_to_fqns[param] for param in param_group_params
+        ]
+        unflat_param_group["params"] = [
+            unflat_param_name
+            for unflat_param_names in nested_unflat_param_names
+            for unflat_param_name in unflat_param_names
+        ]  # flatten the list of lists
+        fsdp_osd_param_groups.append(unflat_param_group)
+    return fsdp_osd
diff --git a/torch/distributed/fsdp/_runtime_utils.py b/torch/distributed/fsdp/_runtime_utils.py
new file mode 100644
index 000000000000..e0986d300a65
--- /dev/null
+++ b/torch/distributed/fsdp/_runtime_utils.py
@@ -0,0 +1,1155 @@
+import functools
+import warnings
+from typing import Any, Callable, Iterable, List, no_type_check, Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+from torch.distributed.algorithms._comm_hooks import LOW_PRECISION_HOOKS
+from torch.distributed.fsdp._common_utils import (
+    _all_handles,
+    _assert_in_training_states,
+    _FSDPState,
+    _is_composable,
+    TrainingState,
+)
+from torch.distributed.fsdp._utils import (
+    _apply_to_tensors,
+    _no_dispatch_record_stream,
+    p_assert,
+)
+from torch.distributed.fsdp.api import BackwardPrefetch
+from torch.distributed.fsdp.flat_param import (
+    _HandlesKey,
+    FlatParameter,
+    FlatParamHandle,
+    HandleShardingStrategy,
+    HandleTrainingState,
+)
+from torch.distributed.utils import _to_kwargs
+
+
+@no_type_check
+def _lazy_init(
+    state: _FSDPState,
+    root_module: nn.Module,
+) -> _FSDPState:
+    """
+    Performs initialization lazily, typically right before the first forward
+    pass. The laziness is needed to ensure that the parameter device/dtype and
+    the FSDP hierarchy have finalized. This method's actual logic only runs on
+    the root FSDP instance, which performs initialization for all non-root FSDP
+    instances to avoid partial initialization.
+
+    For the non-composable code path, ``state`` and ``root_module`` should be
+    the same, namely the FSDP instance itself.
+    """
+    if state._is_root is not None:
+        return  # no-op: already lazily initialized
+    if not torch.cuda.is_available():
+        # Allow the FSDP constructor to run even without CUDA but check this
+        # once we start real execution
+        raise RuntimeError("FSDP does not support CPU only execution")
+    # The following logic is only run on the root FSDP instance since it will
+    # set `_is_root=False` for the non-root instances
+    state._is_root = True
+    _assert_in_training_states(state, [TrainingState.IDLE])
+    _init_streams(state)
+    buffers, buffer_dtypes = _get_buffers_and_dtypes_for_computation(state, root_module)
+    _cast_buffers_to_dtype_and_device(buffers, buffer_dtypes, state.compute_device)
+    for handle in state._handles:
+        handle.init_flat_param_attributes()
+    state._exec_order_data.init(state, root_module, state.process_group)
+    if _is_composable(state):
+        # Return early since there is no need to share data structures
+        return state
+    # Initialize non-root FSDP instances and share attributes from the root to
+    # non-root instances
+    assert state is root_module
+    inconsistent_limit_all_gathers = False
+    for fsdp_module in state.fsdp_modules(root_module):
+        if fsdp_module is root_module:
+            continue
+        # Relax the assert for non-root FSDP instances in case the nested
+        # initialized module is wrapped again in FSDP later (e.g. after
+        # training to run inference)
+        p_assert(
+            fsdp_module._is_root is None or not fsdp_module._is_root,
+            "Non-root FSDP instance's `_is_root` should not have been "
+            "set yet or should have been set to `False`",
+        )
+        fsdp_module._is_root = False
+        fsdp_module._streams = state._streams
+        fsdp_module._exec_order_data = state._exec_order_data
+        if fsdp_module.limit_all_gathers != state.limit_all_gathers:
+            # Prefer the root's value
+            inconsistent_limit_all_gathers = True
+            fsdp_module.limit_all_gathers = state.limit_all_gathers
+        fsdp_module._free_event_queue = state._free_event_queue
+        fsdp_module._handles_prefetched = state._handles_prefetched
+        fsdp_module._needs_pre_backward_unshard = state._needs_pre_backward_unshard
+        for handle in fsdp_module._handles:
+            handle.init_flat_param_attributes()
+    if inconsistent_limit_all_gathers:
+        warnings.warn(
+            "Found inconsistent `limit_all_gathers` values across FSDP "
+            f"instances on rank {state.rank}. Using the root FSDP's value of "
+            f"{state.limit_all_gathers} for all instances."
+        )
+    return state
+
+
+@no_type_check
+def _init_streams(
+    state: _FSDPState,
+) -> _FSDPState:
+    """
+    Initializes CUDA streams for overlapping communication, computation, and
+    data transfers. The streams should be shared across FSDP instances.
+    """
+    assert state._is_root
+    assert torch.cuda.is_available()
+    # Stream for unshard logic, including allocating the all-gather destination
+    # tensors and the all-gathers themselves.
+    state._streams["unshard"] = torch.cuda.Stream()
+    # Stream for overlapping gradient reduction with the backward pass gradient
+    # computation.
+    state._streams["post_backward"] = torch.cuda.Stream()
+    # Stream for pre-unshard logic, namely allocations and writes for CPU
+    # offloading (H2D copy) and mixed precision (low precision cast).
+    state._streams["pre_unshard"] = torch.cuda.Stream()
+
+
+@no_type_check
+def _unshard(
+    state: _FSDPState,
+    handles: List[FlatParamHandle],
+    unshard_stream: torch.cuda.Stream,
+    pre_unshard_stream: torch.cuda.Stream,
+) -> None:
+    """
+    Unshards the handles in ``handles``. If the handles are in
+    :meth:`summon_full_params` and are using mixed precision, then they are
+    forced to full precision.
+
+    Postcondition: Each handle's ``FlatParameter`` 's data is the padded
+    unsharded flattened parameter on the compute device.
+    """
+    if not handles:
+        return
+    if state.limit_all_gathers:
+        event = state._free_event_queue.dequeue_if_needed()
+        if event:
+            event.synchronize()
+    any_ran_pre_unshard = False
+    with torch.cuda.stream(pre_unshard_stream):
+        for handle in handles:
+            ran_pre_unshard = handle.pre_unshard()
+            any_ran_pre_unshard = any_ran_pre_unshard or ran_pre_unshard
+    if any_ran_pre_unshard:
+        unshard_stream.wait_stream(pre_unshard_stream)
+    with torch.cuda.stream(unshard_stream):
+        for handle in handles:
+            handle.unshard()
+            handle.post_unshard()
+
+
+@no_type_check
+def _reshard(
+    state: _FSDPState,
+    handles: List[FlatParamHandle],
+    free_unsharded_flat_params: List[bool],
+):
+    """
+    Reshards the handles in ``handles``. ``free_unsharded_flat_params`` should
+    have the same length as ``handles``, and each element should give whether
+    the corresponding handle should free its padded unsharded flattened
+    parameter.
+    """
+    if not handles:
+        return
+    p_assert(
+        len(handles) == len(free_unsharded_flat_params),
+        "Expects both lists to have equal length but got "
+        f"{len(handles)} and {len(free_unsharded_flat_params)}",
+    )
+    for handle, free_unsharded_flat_param in zip(
+        handles,
+        free_unsharded_flat_params,
+    ):
+        handle.reshard(free_unsharded_flat_param)
+        if state.limit_all_gathers and free_unsharded_flat_param:
+            free_event = torch.cuda.Event()
+            free_event.record()
+            state._free_event_queue.enqueue(free_event)
+        handle.post_reshard()
+    # Since we prefetch entire handles keys at a time, conservatively mark
+    # the entire key as no longer prefetched once we free at least one
+    handles_key = tuple(handles)
+    if any(free_unsharded_flat_params):
+        state._handles_prefetched.pop(handles_key, None)
+
+
+def _unshard_grads(
+    handles: List[FlatParamHandle],
+) -> None:
+    for handle in handles:
+        handle.unshard_grad()
+
+
+def _reshard_grads(
+    handles: List[FlatParamHandle],
+) -> None:
+    for handle in handles:
+        handle.reshard_grad()
+
+
+@no_type_check
+def _pre_forward(
+    state: _FSDPState,
+    handles: List[FlatParamHandle],
+    unshard_fn: Callable,
+    module: nn.Module,
+    input: Any,
+):
+    """
+    Runs the pre-forward logic. This includes an opportunity to unshard
+    currently sharded parameters such as those for the current forward and
+    registering post-backward hooks for these current parameters.
+
+    Args:
+        handles (List[FlatParamHandle]): Handles giving the parameters used in
+            the current forward.
+        unshard_fn (Optional[Callable]): A callable to unshard any currently
+            sharded parameters or ``None`` to not do any unsharding.
+        module (nn.Module): Module whose forward this method runs right before;
+            expected by the hook signature.
+        input (Any): Unused; expected by the hook signature.
+    """
+    state.training_state = TrainingState.FORWARD_BACKWARD
+    state._exec_order_data.record_pre_forward(handles, module.training)
+    for handle in handles:
+        handle._training_state = HandleTrainingState.FORWARD
+    if unshard_fn is not None:
+        unshard_fn()
+    # Register post-backward hooks to reshard the parameters and reduce-scatter
+    # their gradients. They must be re-registered every forward pass in case
+    # the `grad_fn` is mutated.
+    _register_post_backward_hooks(state, handles)
+
+
+@no_type_check
+def _pre_forward_unshard(
+    state: _FSDPState,
+    handles: List[FlatParamHandle],
+) -> None:
+    """Unshards parameters in the pre-forward."""
+    if not handles:
+        return
+    _unshard(state, handles, state._streams["unshard"], state._streams["pre_unshard"])
+    handles_key = tuple(handles)
+    state._needs_pre_forward_unshard[handles_key] = False
+    torch.cuda.current_stream().wait_stream(state._streams["unshard"])
+    _prefetch_handles(state, handles_key)
+
+
+@no_type_check
+def _post_forward(
+    state: _FSDPState,
+    handles: List[FlatParamHandle],
+    reshard_fn: Callable,
+    module: nn.Module,
+    input: Any,
+    output: Any,
+) -> Any:
+    """
+    Runs the post-forward logic. This includes an opportunity to reshard
+    currently unsharded parameters such as those used in the current forward
+    and registering pre-backward hooks on the forward outputs.
+
+    Args:
+        handles (List[FlatParamHandle]): Handles giving the parameters used in
+            the current forward.
+        reshard_fn (Optional[Callable]): A callable to reshard any currently
+            unsharded parameters (e.g. from the current forward) or ``None`` to
+            not do any resharding.
+        module (nn.Module): Unused; expected by the hook signature.
+        input (Any): Unused; exepcted by the hook signature.
+        output (Any): Forward pass output; pre-backward hooks are registered on
+            the tensors that require gradients in this output.
+
+    Postcondition: Each ``FlatParameter`` 's data points to the sharded
+    flattened parameter.
+    """
+    state._exec_order_data.record_post_forward(handles)
+    if reshard_fn is not None:
+        reshard_fn()
+    # Register pre-backward hooks to unshard the flattened parameters
+    # for the gradient computation (if needed)
+    output = _register_pre_backward_hooks(state, output, handles)
+    state.training_state = TrainingState.IDLE
+    for handle in handles:
+        handle._training_state = HandleTrainingState.IDLE
+    return output
+
+
+@no_type_check
+def _post_forward_reshard(
+    state: _FSDPState,
+    handles: List[FlatParamHandle],
+) -> None:
+    """Reshards parameters in the post-forward."""
+    if not handles:
+        return
+    # Do not free the root's parameters in the post-forward for `FULL_SHARD`
+    # with the intention that they are immediately used for backward
+    # computation (though this may not be true)
+    free_unsharded_flat_params = [
+        not state._is_root
+        and handle._config.sharding_strategy == HandleShardingStrategy.FULL_SHARD
+        for handle in handles
+    ]
+    _reshard(state, handles, free_unsharded_flat_params)
+
+
+@no_type_check
+def _root_pre_forward(
+    state: _FSDPState,
+    module: nn.Module,
+    *args,
+    **kwargs,
+):
+    """
+    Runs pre-forward logic specific to the root FSDP instance, which should run
+    before any individual module's pre-forward. This starts with an attempt at
+    lazy initialization (which only runs non-vacuously once). Otherwise, if
+    this is called on a non-root FSDP instance, then the forward inputs are
+    returned directly.
+
+    Args:
+        module (nn.Module): Module for which this logic tries to run. It may or
+            may not be the root. If not, then this method does not do anything.
+    """
+    _lazy_init(state, module)
+    p_assert(state._is_root is not None, "Expects a root FSDP to have been set")
+    if not state._is_root:
+        return args, kwargs
+    if state.forward_prefetch:
+        handles_keys = []
+        if _is_composable(state):
+            # TODO: This assumes singleton handles keys.
+            handles_keys = [tuple(handle) for handle in state._handles]
+        else:
+            for fsdp_module in state.fsdp_modules(state):
+                handles_key = tuple(fsdp_module._handles)
+                handles_keys.append(handles_key)
+        for handles_key in handles_keys:
+            state._needs_pre_forward_unshard[handles_key] = True
+    _wait_for_computation_stream(
+        torch.cuda.current_stream(),
+        state._streams["unshard"],
+        state._streams["pre_unshard"],
+    )
+    _clear_grads_if_needed(_all_handles(state))
+    input_dtype: Optional[torch.dtype] = state.mixed_precision.param_dtype
+    args, kwargs = _prepare_forward_inputs(
+        state.compute_device, input_dtype, *args, **kwargs
+    )
+    return args, kwargs
+
+
+def _prepare_forward_inputs(
+    device: torch.device,
+    input_dtype: Optional[torch.dtype],
+    *args: Any,
+    **kwargs: Any,
+) -> Tuple[Any, Any]:
+    """
+    Prepares the forward inputs by moving them to ``device`` and casting them
+    to ``input_dtype`` if it is not ``None``.
+    """
+    # TODO: Do not use the side stream for tensor copies for now; investigate
+    # the perf with/without it.
+    # TODO: For mixed precision, move the inputs to the compute device and cast
+    # to reduced-precision in a single `to()` call.
+    args_tuple, kwargs_tuple = _to_kwargs(args, kwargs, device.index, False)
+    args = args_tuple[0]
+    kwargs = kwargs_tuple[0]
+    if input_dtype is not None:
+        args, kwargs = _cast_fp_inputs_to_dtype(input_dtype, *args, **kwargs)
+    return args, kwargs
+
+
+def _cast_fp_inputs_to_dtype(
+    dtype: torch.dtype,
+    *args: Any,
+    **kwargs: Any,
+) -> Tuple[Any, Any]:
+    """
+    Casts floating point tensors in ``args`` and ``kwargs`` to ``input_dtype``.
+    This respects the existing ``requires_grad`` on the tensors.
+    """
+
+    def cast_fn(x: torch.Tensor) -> torch.Tensor:
+        if not torch.is_floating_point(x):
+            return x
+        y = x.to(dtype)
+        # Explicitly copy over `requires_grad` since this runs inside
+        # `torch.no_grad()`
+        if x.is_leaf:
+            y.requires_grad = x.requires_grad
+        return y
+
+    with torch.no_grad():
+        return (_apply_to_tensors(cast_fn, args), _apply_to_tensors(cast_fn, kwargs))
+
+
+@no_type_check
+def _pre_backward_hook(
+    state: _FSDPState,
+    _handles: List[FlatParamHandle],
+    *unused: Any,
+) -> Any:
+    """Prepares ``_handles`` 's ``FlatParameter`` s for gradient computation."""
+    _handles_key = tuple(_handles)  # avoid shadowing `handles_key`
+    # Only run the pre-backward hook once per group of handles involved in the
+    # same module forward computation
+    if _handles_key and state._ran_pre_backward_hook.get(_handles_key, False):
+        return
+
+    with torch.autograd.profiler.record_function(
+        "FullyShardedDataParallel._pre_backward_hook"
+    ):
+        # Queue the post-backward callback once for the root FSDP instance to
+        # attach it to the outermost backward graph task so that it is called
+        # after all backward calls complete
+        if state._is_root and not state._post_backward_callback_queued:
+            _register_post_backward_final_callback(state)
+            _clear_grads_if_needed(_all_handles(state))
+        elif _handles_key:
+            allowed_states = [TrainingState.IDLE]
+            if _is_composable(state):
+                allowed_states.append(TrainingState.FORWARD_BACKWARD)
+            _assert_in_training_states(state, allowed_states)
+        state.training_state = TrainingState.FORWARD_BACKWARD
+        # Queueing the post-backward callback is the only logic that is not
+        # per-handle in the pre-backward hook, so we can return early here if
+        # there are no handles.
+        if not _handles_key:
+            return
+        for handle in _handles:
+            handle._training_state = HandleTrainingState.BACKWARD_PRE
+
+        # If the handles have been prefetched, this `_unshard()` simply
+        # switches to using the unsharded parameter
+        _unshard(
+            state, _handles, state._streams["unshard"], state._streams["pre_unshard"]
+        )
+        torch.cuda.current_stream().wait_stream(state._streams["unshard"])
+
+        # Set this to `False` to ensure that a mistargeted prefetch does not
+        # actually unshard these handles
+        state._needs_pre_backward_unshard[_handles_key] = False
+        _prefetch_handles(state, _handles_key)
+        for handle in _handles:
+            handle.prepare_gradient_for_backward()
+        state._ran_pre_backward_hook[_handles_key] = True
+
+
+@no_type_check
+@torch.no_grad()
+def _post_backward_hook(
+    state: _FSDPState,
+    handle: FlatParamHandle,
+    *unused: Any,
+):
+    """
+    Reduce-scatters the gradient of ``handle`` 's ``FlatParameter``.
+
+    Precondition: The ``FlatParameter`` 's ``.grad`` attribute contains the
+    unsharded gradient for the local batch.
+
+    Postcondition:
+    - If using ``NO_SHARD``, then the ``.grad`` attribute is the reduced
+    unsharded gradient.
+    - Otherwise, the ``_saved_grad_shard`` attribute is the reduced sharded
+    gradient (accumulating with any existing gradient).
+    """
+    param = handle.flat_param
+    param._post_backward_called = True
+    with torch.autograd.profiler.record_function(
+        "FullyShardedDataParallel._post_backward_hook"
+    ):
+        _assert_in_training_states(state, [TrainingState.FORWARD_BACKWARD])
+        p_assert(
+            handle._training_state == HandleTrainingState.BACKWARD_PRE,
+            f"Expects `BACKWARD_PRE` state but got {handle._training_state}",
+        )
+        handle._training_state = HandleTrainingState.BACKWARD_POST
+
+        if param.grad is None:
+            return
+        if param.grad.requires_grad:
+            raise RuntimeError("FSDP does not support gradients of gradients")
+
+        free_unsharded_flat_param = _should_free_in_backward(state, handle)
+        _reshard(state, [handle], [free_unsharded_flat_param])
+
+        # TODO: Post-backward prefetching does not support the multiple handles
+        # per module case since the post-backward hook runs per handle, not per
+        # group of handles.
+        handles_key = (handle,)
+        _prefetch_handles(state, handles_key)
+
+        if not state._sync_gradients:
+            return
+
+        # Wait for all ops in the current stream (e.g. gradient
+        # computation) to finish before reduce-scattering the gradient
+        state._streams["post_backward"].wait_stream(torch.cuda.current_stream())
+
+        with torch.cuda.stream(state._streams["post_backward"]):
+            unsharded_grad_data = param.grad.data
+            if state._exec_order_data.is_first_iter:  # only check once
+                _check_comm_hook(
+                    state._communication_hook, state._communication_hook_state
+                )
+            if handle._uses_reduce_mixed_precision and not _low_precision_hook_enabled(
+                state
+            ):
+                # TODO: Use the low precision communication hook directly
+                param.grad.data = param.grad.to(state.mixed_precision.reduce_dtype)
+
+            if handle.uses_sharded_strategy:
+                # We clear `.grad` to permit multiple backwards. This avoids a
+                # race where the second backward pass computation precedes
+                # ahead of the first backward pass reduction, which is possible
+                # since the reduction is issued in a separate stream and is
+                # async and would result in reducing the wrong gradient.
+                unsharded_grad = param.grad.data
+                param.grad = None
+                p_assert(
+                    len(unsharded_grad.size()) == 1,
+                    f"Expects gradient to be flattened but got {unsharded_grad.size()}",
+                )
+                chunks = list(unsharded_grad.chunk(state.world_size))
+                numel_to_pad = (
+                    state.world_size * chunks[0].numel() - unsharded_grad.numel()
+                )
+                padded_unsharded_grad = (
+                    F.pad(unsharded_grad, [0, numel_to_pad])
+                    if numel_to_pad > 0
+                    else unsharded_grad
+                )
+                new_sharded_grad = torch.empty_like(chunks[0])  # padded
+                state._communication_hook(
+                    state._communication_hook_state,
+                    padded_unsharded_grad,
+                    new_sharded_grad,
+                )
+                _cast_grad_to_param_dtype(state, handle, new_sharded_grad, param)
+
+                # Save the sharded gradient in `_saved_grad_shard` to support
+                # gradient accumulation -- for multiple backwards, the gradient
+                # reductions may happen in arbitrary order
+                accumulate_grad = hasattr(param, "_saved_grad_shard")
+                if accumulate_grad:
+                    _check_grad_to_accumulate(new_sharded_grad, param._saved_grad_shard)
+                    param._saved_grad_shard += new_sharded_grad
+                else:
+                    param._saved_grad_shard = new_sharded_grad
+                sharded_grad = param._saved_grad_shard
+            else:
+                state._communication_hook(state._communication_hook_state, param.grad)
+                # For `NO_SHARD`, we can keep the low precision gradients by
+                # simply omitting the cast altogether
+                if not handle._keep_low_precision_grads:
+                    _cast_grad_to_param_dtype(state, handle, param.grad, param)
+                sharded_grad = param.grad.data
+
+            if handle._config.offload_params:
+                # Offload the gradient to CPU to ensure parameters and
+                # gradients are on the same device as required by the optimizer
+                # TODO: Investigate why `NO_SHARD` breaks correctness when
+                # using `non_blocking=True` here.
+                non_blocking = handle.uses_sharded_strategy
+                param._cpu_grad.copy_(  # type: ignore[attr-defined]
+                    sharded_grad.detach(), non_blocking=non_blocking
+                )  # synchronized in the post-backward callback
+                # Since the sharded gradient is produced in the post-backward
+                # stream and consumed later in the computation stream, inform
+                # the caching allocator
+                _no_dispatch_record_stream(
+                    sharded_grad.data, torch.cuda.current_stream()
+                )
+
+            # Since the unsharded gradient is produced in the computation
+            # stream and consumed in the post-backward stream, inform the
+            # caching allocator (before it goes out of scope)
+            _no_dispatch_record_stream(
+                unsharded_grad_data, state._streams["post_backward"]
+            )
+
+            if handle._use_orig_params:
+                # Since the handle's `FlatParameter` completed its gradient
+                # computation, we should reset the gradient noneness mask
+                handle._reset_is_grad_none()
+                # Delay using sharded gradient views until after the
+                # reduce-scatter instead of immediately after resharding
+                handle._use_sharded_grad_views()
+
+
+@no_type_check
+def _should_free_in_backward(
+    state: _FSDPState,
+    handle: FlatParamHandle,
+) -> bool:
+    """
+    Returns whether FSDP should free the unsharded flattened parameter in the
+    post-backward or not.
+    """
+    return (
+        state._sync_gradients and handle.uses_sharded_strategy
+    ) or handle._config.sharding_strategy == HandleShardingStrategy.FULL_SHARD
+
+
+@no_type_check
+def _cast_grad_to_param_dtype(
+    state: _FSDPState,
+    handle: FlatParamHandle,
+    sharded_grad: torch.Tensor,
+    param: FlatParameter,
+):
+    """
+    Casts ``sharded_grad`` back to the full parameter dtype so that the
+    optimizer step runs with that dtype. This performs an actual cast if
+    1. parameters were in reduced precision during the forward since then
+    gradients would be in that reduced precision, or
+    2. parameters were not in reduced precision but gradients were in
+    reduced precision for communication.
+    However, if a low precision communication hook is registered, then this
+    dtype cast happens in the hook instead.
+    """
+    _assert_in_training_states(state, [TrainingState.FORWARD_BACKWARD])
+    if not _low_precision_hook_enabled(state) and (
+        handle._uses_param_mixed_precision or handle._uses_reduce_mixed_precision
+    ):
+        low_prec_grad_data = sharded_grad.data
+        sharded_grad.data = sharded_grad.data.to(dtype=param.dtype)
+        # Since for `NO_SHARD`, the gradient is produced in the computation
+        # stream and consumed here in the post-backward stream, inform the
+        # caching allocator; for the sharded strategies, the gradient is
+        # produced in the post-backward stream, so this `record_stream()`
+        # should be a no-op
+        _no_dispatch_record_stream(low_prec_grad_data, torch.cuda.current_stream())
+
+
+def _check_comm_hook(
+    comm_hook: Any,
+    comm_hook_state: Any,
+) -> None:
+    p_assert(comm_hook is not None, "Communication hook should not be `None`")
+    p_assert(
+        comm_hook_state is not None, "Communication hook state should not be `None`"
+    )
+
+
+def _check_grad_to_accumulate(
+    new_sharded_grad: torch.Tensor,
+    accumulated_grad: torch.Tensor,
+) -> None:
+    p_assert(
+        accumulated_grad.shape == new_sharded_grad.shape,
+        "Shape mismatch when accumulating gradients: "
+        f"existing gradient shape={accumulated_grad.shape} "
+        f"new gradient shape={new_sharded_grad.shape}",
+    )
+    p_assert(
+        accumulated_grad.device == new_sharded_grad.device,
+        "Device mismatch when accumulating gradients: "
+        f"existing gradient device={accumulated_grad.device} "
+        f"new gradient device={new_sharded_grad.device}",
+    )
+
+
+@no_type_check
+def _low_precision_hook_enabled(state: _FSDPState) -> bool:
+    return state._communication_hook in LOW_PRECISION_HOOKS
+
+
+@no_type_check
+@torch.no_grad()
+def _post_backward_final_callback(
+    state: _FSDPState,
+):
+    """
+    This waits for the post-backward to finish and performs some final cleanup.
+    This runs at the end of the entire backward pass and should only be called
+    on the root FSDP instance.
+    """
+    p_assert(
+        state._is_root,
+        "The post-backward callback should only be called on the root FSDP instance",
+    )
+
+    if state._sync_gradients:
+        torch.cuda.current_stream().wait_stream(state._streams["post_backward"])
+        if state.cpu_offload.offload_params:
+            # Wait for non-blocking GPU -> CPU sharded gradient copies from the
+            # post-backward hooks to finish explicitly since CPU gradients do
+            # not automatically synchronize with the GPU
+            torch.cuda.current_stream().synchronize()
+    state._exec_order_data.next_iter()
+
+    states = [state] if _is_composable(state) else state.fsdp_modules(state)
+    for state in states:
+        _catch_all_reshard(state)
+        _finalize_params(state)
+        state._ran_pre_backward_hook.clear()
+        state.training_state = TrainingState.IDLE
+        for handle in state._handles:
+            handle._training_state = HandleTrainingState.IDLE
+        state._handles_prefetched.clear()
+    # Reset for cases like one forward and multiple backwards
+    state._post_backward_callback_queued = False
+
+
+@no_type_check
+def _catch_all_reshard(
+    state: _FSDPState,
+) -> None:
+    """
+    Reshards the parameters that may not have been resharded in the
+    post-backward hook. This can happen when a module's output is used in the
+    forward pass, meaning that its pre-backward hook runs (unsharding the
+    parameter), but the post-backward hook does not run because the output was
+    not jused in the loss computation corresponding to this backward pass.
+    """
+    # Wrap with a try-except to provide a more informative traceback if an
+    # error is raised
+    try:
+        free_unsharded_flat_params: List[bool] = []
+        handles_to_reshard: List[FlatParamHandle] = []
+        for handle in state._handles:
+            # TODO: This already-resharded check is brittle:
+            # https://github.com/pytorch/pytorch/issues/83956
+            already_resharded = (
+                handle.flat_param.data_ptr()
+                == handle.flat_param._local_shard.data_ptr()
+            )
+            if already_resharded:
+                continue
+            free_unsharded_flat_params.append(_should_free_in_backward(state, handle))
+            handles_to_reshard.append(handle)
+        _reshard(state, handles_to_reshard, free_unsharded_flat_params)
+    except Exception as e:
+        p_assert(
+            False,
+            f"Got exception in the catch-all reshard for {state}: {str(e)}",
+            raise_assertion_error=False,
+        )
+        raise e
+
+
+@no_type_check
+def _finalize_params(
+    state: _FSDPState,
+) -> None:
+    """Finalizes the parameters before the next iteration."""
+    for handle in state._handles:
+        flat_param = handle.flat_param
+        if flat_param.requires_grad:
+            if hasattr(flat_param, "_post_backward_hook_state"):
+                p_assert(
+                    len(flat_param._post_backward_hook_state) == 2,
+                    f"Invalid: ``_post_backward_hook_state``: {flat_param._post_backward_hook_state}",
+                )
+                flat_param._post_backward_hook_state[1].remove()
+                delattr(flat_param, "_post_backward_hook_state")
+            if not state._sync_gradients:
+                # Preserve the gradient accumulation state if not synchronizing
+                # gradients: `.grad` remains the unsharded gradient  from prior
+                # `no_sync()` iterations, and `_saved_grad_shard` remains the
+                # sharded gradient from the last synchronized iteration
+                continue
+            handle.prepare_gradient_for_optim()
+            p_assert(
+                hasattr(flat_param, "_post_backward_called"),
+                "Expects `_post_backward_called` to be set on the `FlatParameter`",
+            )
+            flat_param._post_backward_called = False
+
+
+@no_type_check
+def _prefetch_handles(
+    state: _FSDPState,
+    current_handles_key: _HandlesKey,
+) -> None:
+    """
+    Prefetches the next handles if needed (without synchronization). An empty
+    handles key cannot prefetch.
+    """
+    if not current_handles_key:
+        return
+    handles_to_prefetch = _get_handles_to_prefetch(state, current_handles_key)
+    for handles_key in handles_to_prefetch:
+        # Prefetch the next set of handles without synchronizing to allow
+        # the sync to happen as late as possible to maximize overlap
+        _unshard(
+            state, handles_key, state._streams["unshard"], state._streams["pre_unshard"]
+        )
+        state._handles_prefetched[handles_key] = True
+
+
+@no_type_check
+def _get_handles_to_prefetch(
+    state: _FSDPState,
+    current_handles_key: _HandlesKey,
+) -> List[_HandlesKey]:
+    """
+    Returns a :class:`list` of the handles keys to prefetch for the next
+    module(s), where ``current_handles_key`` represents the current module.
+
+    "Prefetching" refers to running the unshard logic early (without
+    synchronization), and the "next" modules depend on the recorded execution
+    order and the current training state.
+    """
+    training_state = _get_training_state(current_handles_key)
+    valid_training_states = (
+        HandleTrainingState.BACKWARD_PRE,
+        HandleTrainingState.BACKWARD_POST,
+        HandleTrainingState.FORWARD,
+    )
+    p_assert(
+        training_state in valid_training_states,
+        f"Prefetching is only supported in {valid_training_states} but "
+        f"currently in {training_state}",
+    )
+    eod = state._exec_order_data
+    target_handles_keys: List[_HandlesKey] = []
+    if (
+        training_state == HandleTrainingState.BACKWARD_PRE
+        and state.backward_prefetch == BackwardPrefetch.BACKWARD_PRE
+    ) or (
+        training_state == HandleTrainingState.BACKWARD_POST
+        and state.backward_prefetch == BackwardPrefetch.BACKWARD_POST
+    ):
+        target_handles_keys = [
+            target_handles_key
+            for target_handles_key in eod.get_handles_to_backward_prefetch(
+                current_handles_key
+            )
+            if state._needs_pre_backward_unshard.get(target_handles_key, False)
+            and not state._handles_prefetched.get(target_handles_key, False)
+        ]
+    elif training_state == HandleTrainingState.FORWARD and state.forward_prefetch:
+        target_handles_keys = [
+            target_handles_key
+            for target_handles_key in eod.get_handles_to_forward_prefetch(
+                current_handles_key
+            )
+            if state._needs_pre_forward_unshard.get(target_handles_key, False)
+            and not state._handles_prefetched.get(target_handles_key, False)
+        ]
+    return target_handles_keys
+
+
+def _get_training_state(
+    handles_key: _HandlesKey,
+) -> HandleTrainingState:
+    """Returns the training state of the handles in ``handles_key``."""
+    p_assert(len(handles_key) > 0, "Expects a non-empty handles key")
+    training_states = set(handle._training_state for handle in handles_key)
+    p_assert(
+        len(training_states) == 1,
+        f"Expects uniform training state but got {training_states}",
+    )
+    return next(iter(training_states))
+
+
+@no_type_check
+def _register_pre_forward_hooks(
+    state: _FSDPState,
+    modules: Iterable[nn.Module],
+) -> None:
+    """
+    Registers pre-forward hooks on all modules in ``modules``. The pre-forward
+    hooks are partially applied based on the current ``FlatParamHandle``
+    construction, meaning that they must be re-registered if the construction
+    changes.
+    """
+    for forward_handle in state._pre_forward_handles:
+        forward_handle.remove()
+    state._pre_forward_handles.clear()
+    for module in modules:
+        module_param_handles = state._module_to_handles[module]
+        if module_param_handles:
+            unshard_fn = functools.partial(
+                _pre_forward_unshard,
+                state,
+                module_param_handles,
+            )
+            hook = functools.partial(
+                _pre_forward, state, module_param_handles, unshard_fn
+            )
+            state._pre_forward_handles.append(module.register_forward_pre_hook(hook))
+
+
+@no_type_check
+def _register_post_forward_hooks(
+    state: _FSDPState,
+    modules: Iterable[nn.Module],
+) -> None:
+    """
+    Registers post-forward hooks on all modules in ``modules``. The
+    post-forward hooks are partially applied based on the current
+    ``FlatParamHandle`` construction, meaning that they must be re-registered
+    if the construction changes.
+    """
+    for forward_handle in state._post_forward_handles:
+        forward_handle.remove()
+    state._post_forward_handles.clear()
+    for module in modules:
+        module_param_handles = state._module_to_handles[module]
+        if module_param_handles:
+            reshard_fn = functools.partial(
+                _post_forward_reshard,
+                state,
+                module_param_handles,
+            )
+            hook = functools.partial(
+                _post_forward,
+                state,
+                module_param_handles,
+                reshard_fn,
+            )
+            state._post_forward_handles.append(module.register_forward_hook(hook))
+
+
+@no_type_check
+def _register_root_pre_forward_hooks(
+    state: _FSDPState,
+    modules: Iterable[nn.Module],
+):
+    """
+    # TODO (awgu): This requires kwarg support for hooks registered by
+    ``register_forward_pre_hook()``. ``_root_pre_forward()`` does not have the
+    supported hook signature right now.
+    """
+    for forward_handle in state._root_pre_forward_handles:
+        forward_handle.remove()
+    state._root_pre_forward_handles.clear()
+    for module in modules:
+        module_param_handles = state._module_to_handles[module]
+        if module_param_handles:
+            hook = functools.partial(_root_pre_forward, state, module)
+            state._root_pre_forward_handles.append(
+                module.register_forward_pre_hook(hook)
+            )
+
+
+@no_type_check
+def _register_pre_backward_hooks(
+    state: _FSDPState,
+    outputs: Any,
+    handles: List[FlatParamHandle],
+) -> None:
+    """
+    Registers pre-backward hooks on the tensors that require gradients in the
+    forward pass outputs ``outputs``, which were computed using the
+    ``FlatParameter`` s of ``handles``.
+
+    Returns:
+        Forward pass outputs with pre-backward hooks registered to tensors that
+        require gradients.
+    """
+    # If there is no gradient computation, then there is no need for
+    # pre-backward logic
+    if not torch.is_grad_enabled():
+        return outputs
+    if state._is_root:
+        state._post_backward_callback_queued = False  # only defined on the root
+
+    handles_key = tuple(handles)
+    if handles_key:
+        # Since these handles' `FlatParameter`s participated in a forward, we
+        # conservatively assume that they will be used in the backward
+        state._needs_pre_backward_unshard[handles_key] = False
+        state._ran_pre_backward_hook[handles_key] = False
+
+    def _register_hook(t: torch.Tensor) -> torch.Tensor:
+        if t.requires_grad:
+            t.register_hook(functools.partial(_pre_backward_hook, state, handles))
+            state._needs_pre_backward_unshard[handles_key] = True
+        return t
+
+    return _apply_to_tensors(_register_hook, outputs)
+
+
+def _register_post_backward_hooks(
+    state: _FSDPState,
+    handles: List[FlatParamHandle],
+) -> None:
+    """
+    Registers post-backward hooks on the ``FlatParameter`` s'
+    ``AccumulateGrad`` objects to reshard and to reduce-scatter gradients.
+
+    The ``AccumulateGrad`` object represents the last function that finalizes
+    the ``FlatParameter`` 's gradient, so it only runs after its entire
+    gradient computation has finished.
+
+    We register the post-backward hook only once in the *first* forward that a
+    ``FlatParameter`` participates in. This relies on the ``AccumulateGrad``
+    object being preserved through multiple forwards.
+    """
+    # If there is no gradient computation, then there is no need for
+    # post-backward logic
+    if not torch.is_grad_enabled():
+        return
+    for handle in handles:
+        flat_param = handle.flat_param
+        already_registered = hasattr(flat_param, "_post_backward_hook_state")
+        if already_registered or not flat_param.requires_grad:
+            continue
+        # Get the `AccumulateGrad` object
+        temp_flat_param = flat_param.expand_as(flat_param)
+        p_assert(
+            temp_flat_param.grad_fn is not None,
+            "The `grad_fn` is needed to access the `AccumulateGrad` and "
+            "register the post-backward hook",
+        )
+        acc_grad = temp_flat_param.grad_fn.next_functions[0][0]
+        hook_handle = acc_grad.register_hook(
+            functools.partial(_post_backward_hook, state, handle)
+        )
+        flat_param._post_backward_hook_state = (acc_grad, hook_handle)  # type: ignore[attr-defined]
+
+
+@no_type_check
+def _register_post_backward_final_callback(state: _FSDPState) -> None:
+    """
+    Registers the post-backward final callback that runs at the end of the
+    backward pass. This should be called from the root FSDP instance at the
+    beginning of the pre-backward.
+    """
+    p_assert(
+        state._is_root,
+        "Only the root FSDP instance should register the post-backward callback",
+    )
+    if state._post_backward_callback_queued:
+        return
+    _assert_in_training_states(state, [TrainingState.IDLE])
+    state._post_backward_callback_queued = True
+    Variable._execution_engine.queue_callback(
+        functools.partial(_post_backward_final_callback, state)
+    )
+
+
+def _wait_for_computation_stream(
+    computation_stream: torch.cuda.Stream,
+    unshard_stream: torch.cuda.Stream,
+    pre_unshard_stream: torch.cuda.Stream,
+):
+    """
+    Has the unshard and pre-unshard streams wait for the computation stream.
+    For example, this should be called in the FSDP root's pre-forward to
+    respect optimizer step computation.
+    """
+    unshard_stream.wait_stream(computation_stream)
+    # Having the pre-all-gather stream wait for the current stream even if we
+    # do not leverage the pre-all-gather stream is tolerable since this only
+    # runs once per iteration
+    pre_unshard_stream.wait_stream(computation_stream)
+
+
+def _clear_grads_if_needed(
+    handles: List[FlatParamHandle],
+):
+    """
+    Clears the original parameters' gradients if needed. This method's CPU
+    overhead is minimal, so we may call it throughout FSDP methods, which serve
+    as callsites to free the gradient memory earlier.
+    """
+    for handle in handles:
+        if handle._use_orig_params:
+            handle._clear_grads_if_needed()
+
+
+@no_type_check
+def _get_buffers_and_dtypes_for_computation(
+    state: _FSDPState,
+    root_module: nn.Module,
+) -> Tuple[List[torch.Tensor], List[Optional[torch.dtype]]]:
+    """
+    Returns all buffers in the module tree rooted at ``root_module`` and a
+    corresponding list of the buffer dtypes for computation. Each buffer dtype
+    is either ``None`` if buffer mixed precision is not enabled or the buffer
+    low precision dtype otherwise.
+    """
+    p_assert(state._is_root, "Expects the root to cast buffers")
+    buffers: List[torch.Tensor] = []
+    buffer_dtypes: List[Optional[torch.dtype]] = []
+    if _is_composable(state):
+        buffers = [
+            buffer for module in root_module.modules() for buffer in module.buffers()
+        ]
+        buffer_dtypes = [
+            state.mixed_precision.buffer_dtype for _ in range(len(buffers))
+        ]
+    else:
+        visited_buffers = set()
+        # Traverse the FSDP instances bottom-up so that we prefer the owning
+        # FSDP instance's mixed precision setting for each buffer
+        for fsdp_module in reversed(state.fsdp_modules(root_module)):
+            for buffer in fsdp_module.buffers():
+                if buffer in visited_buffers:
+                    continue
+                visited_buffers.add(buffer)
+                buffers.append(buffer)
+                buffer_dtypes.append(fsdp_module.mixed_precision.buffer_dtype)
+    assert len(buffers) == len(buffer_dtypes), f"{len(buffers)} {len(buffer_dtypes)}"
+    return buffers, buffer_dtypes
+
+
+@no_type_check
+def _get_buffer_dtypes(
+    state: _FSDPState,
+    buffer_names: List[str],
+) -> List[torch.dtype]:
+    """
+    Returns the original buffer types of the given buffer names.
+    """
+    buffer_dtypes: List[torch.dtype] = []
+    for buffer_name in buffer_names:
+        p_assert(
+            buffer_name in state._buffer_name_to_orig_dtype,
+            f"{buffer_name} is missing from pre-computed dict on rank "
+            f"{state.rank}, which only has keys "
+            f"{state._buffer_name_to_orig_dtype.keys()}",
+        )
+        buffer_dtypes.append(state._buffer_name_to_orig_dtype[buffer_name])
+    return buffer_dtypes
+
+
+def _cast_buffers_to_dtype_and_device(
+    buffers: List[torch.Tensor],
+    buffer_dtypes: List[Optional[torch.dtype]],
+    device: torch.device,
+) -> None:
+    """
+    Casts ``buffers`` to the dtypes given by ``buffer_dtypes`` and moves them
+    to ``device``. If an element in ``buffer_dtypes`` is ``None``, then the
+    corresponding buffer is only moved to ``device``.
+    """
+    p_assert(
+        buffer_dtypes is None or len(buffers) == len(buffer_dtypes),
+        f"Expects `buffers` and `buffer_dtypes` to have the same length if "
+        f"`buffer_dtypes` is specified but got {len(buffers)} and "
+        f"{len(buffer_dtypes)}",
+    )
+    for buffer, buffer_dtype in zip(buffers, buffer_dtypes):
+        if not torch.is_floating_point(buffer) or buffer_dtype is None:
+            buffer.data = buffer.to(device=device)
+        else:
+            buffer.data = buffer.to(device=device, dtype=buffer_dtype)
diff --git a/torch/distributed/fsdp/shard_utils.py b/torch/distributed/fsdp/_shard_utils.py
similarity index 64%
rename from torch/distributed/fsdp/shard_utils.py
rename to torch/distributed/fsdp/_shard_utils.py
index 966427e27526..0cc9dd656f16 100644
--- a/torch/distributed/fsdp/shard_utils.py
+++ b/torch/distributed/fsdp/_shard_utils.py
@@ -1,17 +1,23 @@
 import bisect
 import itertools
 import math
-from typing import Any, Dict, List, Tuple, Optional
+from typing import Any, Dict, List, Optional, Tuple
 
 import torch
 import torch.distributed as dist
 import torch.nn.functional as F
 from torch.distributed import distributed_c10d
-from torch.distributed._shard.sharded_tensor import ShardedTensor
+from torch.distributed._shard.sharded_tensor import (
+    Shard,
+    ShardedTensor,
+    ShardedTensorMetadata,
+    TensorProperties,
+)
 from torch.distributed._shard.sharding_spec import (
     ChunkShardingSpec,
     EnumerableShardingSpec,
     ShardingSpec,
+    ShardMetadata,
 )
 
 
@@ -150,16 +156,35 @@ def _all_gather_sharded_tensor(
         pg = distributed_c10d._get_default_group()
     world_size = dist.get_world_size(pg)
     shards = sharded_tensor.local_shards()
-    local_tensor = shards[0].tensor.flatten()
     dim_0_size = sharded_tensor.size()[0]  # type: ignore[index]
     tensor_numel = sharded_tensor.size().numel()  # type: ignore[union-attr]
     chunk_size = math.ceil(dim_0_size / world_size) * tensor_numel // dim_0_size
-    num_padding = chunk_size - local_tensor.numel()
-    if num_padding > 0:
-        local_tensor = F.pad(local_tensor, [0, num_padding])
-    tensor = torch.empty(chunk_size * world_size, dtype=local_tensor.dtype).cuda()
+    cuda_device = torch.device("cuda", torch.cuda.current_device())
+    if shards:
+        local_tensor = shards[0].tensor.flatten()
+        if not local_tensor.is_cuda:
+            move_to_cpu = torch.ones(1, device=cuda_device)
+            local_tensor = local_tensor.cuda()
+        else:
+            move_to_cpu = torch.zeros(1, device=cuda_device)
+        num_padding = chunk_size - local_tensor.numel()
+        if num_padding > 0:
+            local_tensor = F.pad(local_tensor, [0, num_padding])
+    else:
+        local_tensor = torch.zeros(
+            chunk_size, dtype=sharded_tensor.dtype, device=cuda_device
+        )
+        move_to_cpu = torch.zeros(1, device=cuda_device)
+
+    tensor = torch.empty(
+        chunk_size * world_size,
+        dtype=local_tensor.dtype,
+        device=cuda_device,
+    )
     dist._all_gather_base(tensor, local_tensor, group=pg)
-    return tensor.narrow(0, 0, tensor_numel).reshape(sharded_tensor.size())
+
+    tensor = tensor.narrow(0, 0, tensor_numel).reshape(sharded_tensor.size())
+    return tensor
 
 
 def _gather_state_dict(
@@ -167,24 +192,66 @@ def _gather_state_dict(
     pg: Optional[dist.ProcessGroup] = None,
 ) -> Dict[str, Any]:
     """
-    Given a state_dict, this API gathers all the ShardedTensor in the state_dict
-    to the output_rank, and creates a new state_dict which the values are either
-    the gathered tensors (rank == output_rank) or None (rank != output_rank).
+    Given a state_dict, this API gathers all the ShardedTensors in the state_dict.
     """
     new_state_dict = {}
     for key, tensor in state_dict.items():
         if isinstance(tensor, ShardedTensor):
-            """
-            # TODO: It is unclear why the following implementation cause a
-            # timeout in some unittests on AWS servers but not other environment.
-            output_tensor = (
-                torch.empty(tensor.shape, dtype=tensor.dtype).cuda()
-                if curr_rank == output_rank
-                else None
-            )
-            tensor.gather(output_rank, output_tensor)
-            """
             output_tensor = _all_gather_sharded_tensor(tensor, pg)
-            tensor = output_tensor
+            if tensor.local_shards() and tensor.local_shards()[0].tensor.is_cuda:
+                tensor = output_tensor
+            else:
+                tensor = output_tensor.cpu()
         new_state_dict[key] = tensor
     return new_state_dict
+
+
+def _create_chunk_sharded_tensor(
+    tensor: torch.Tensor,
+    rank: int,
+    world_size: int,
+    num_devices_per_node: int,
+    pg: dist.ProcessGroup,
+) -> ShardedTensor:
+    """
+    Shard a tensor to chunks along the first dimension. The local rank will gets its
+    corresponding chunk as the local shard to create a ShardedTensor.
+    """
+    chunks = tensor.chunk(world_size, dim=0)
+    if len(chunks) > rank:
+        local_shard = chunks[rank].clone()
+        offsets = [0 for _ in tensor.size()]
+        offsets[0] = math.ceil(tensor.size()[0] / world_size) * rank
+        local_shards = [Shard.from_tensor_and_offsets(local_shard, offsets, rank)]
+    else:
+        local_shards = []
+
+    # Create a ShardedTensor without invoking communication.
+    chunk_sizes = [list(chunk.size()) for chunk in chunks]
+    dim0_offsets = [0] + list(
+        itertools.accumulate([chunk_size[0] for chunk_size in chunk_sizes])
+    )[:-1]
+    offsets = [0] * (len(chunk_sizes[0]) - 1)
+    chunk_offsets = [[d0] + offsets for d0 in dim0_offsets]
+    placements = [
+        f"rank:{r}/cuda:{r % num_devices_per_node}" for r in range(len(chunk_sizes))
+    ]
+    assert len(chunk_sizes) == len(chunk_offsets) == len(placements)
+    shard_metadata = [
+        ShardMetadata(offset, size, placement)
+        for offset, size, placement in zip(chunk_offsets, chunk_sizes, placements)
+    ]
+    sharded_tensor_metadata = ShardedTensorMetadata(
+        shards_metadata=shard_metadata,
+        size=tensor.size(),
+        tensor_properties=TensorProperties(
+            dtype=tensor.dtype,
+            layout=tensor.layout,
+            requires_grad=False,
+            memory_format=torch.contiguous_format,
+            pin_memory=tensor.is_pinned(),
+        ),
+    )
+    return ShardedTensor._init_from_local_shards_and_global_metadata(
+        local_shards, sharded_tensor_metadata=sharded_tensor_metadata, process_group=pg
+    )
diff --git a/torch/distributed/fsdp/_state_dict_utils.py b/torch/distributed/fsdp/_state_dict_utils.py
new file mode 100644
index 000000000000..54191cb55ece
--- /dev/null
+++ b/torch/distributed/fsdp/_state_dict_utils.py
@@ -0,0 +1,694 @@
+import functools
+import math
+import warnings
+from typing import Any, Callable, cast, Dict, Iterator, no_type_check, Tuple
+
+import torch
+import torch.distributed as dist
+import torch.distributed.algorithms._checkpoint.checkpoint_wrapper as checkpoint_wrapper
+
+# Import the entire FSDP file to avoid circular imports
+import torch.nn as nn
+import torch.nn.functional as F
+
+from torch.distributed._shard.sharded_tensor import (
+    init_from_local_shards,
+    Shard,
+    ShardedTensor,
+)
+from torch.distributed.fsdp._common_utils import (
+    _FSDPState,
+    _has_fsdp_params,
+    _module_handles,
+    clean_tensor_name,
+    FSDP_PREFIX,
+    FSDP_WRAPPED_MODULE,
+)
+from torch.distributed.fsdp._runtime_utils import (
+    _cast_buffers_to_dtype_and_device,
+    _clear_grads_if_needed,
+    _get_buffer_dtypes,
+    _lazy_init,
+)
+from torch.distributed.fsdp.api import FullStateDictConfig, StateDictType
+from torch.distributed.utils import _replace_by_prefix
+
+from ._fsdp_extensions import (
+    _ext_chunk_tensor,
+    _ext_pre_load_state_dict_transform,
+    _extensions as _user_extensions,
+)
+from ._unshard_param_utils import (
+    _deregister_orig_params,
+    _register_orig_params,
+    _unshard_params,
+    FLAT_PARAM,
+)
+from .flat_param import FlatParamHandle
+
+
+def _convert_to_wrapped_module_name(module_name: str) -> str:
+    module_name = module_name.replace(f"{FSDP_PREFIX}", "")
+    module_name = module_name.replace(f"{FSDP_WRAPPED_MODULE}", "")
+    if module_name:
+        module_name = f"{module_name}."
+    # Activation checkpoint adds a prefix that has to be
+    # removed as well.
+    module_name = module_name.replace(checkpoint_wrapper._CHECKPOINT_PREFIX, "")
+    return module_name
+
+
+def _param_fqns(
+    module: nn.Module, fsdp_state: _FSDPState
+) -> Iterator[Tuple[str, str, str]]:
+    if not _has_fsdp_params(fsdp_state, module):
+        return
+    for param_name, module_name in _module_handles(fsdp_state, module)[
+        0
+    ].parameter_module_names():
+        module_name = _convert_to_wrapped_module_name(module_name)
+        fqn = f"{module_name}{param_name}"
+        yield fqn, param_name, module_name
+
+
+def _shared_param_fqns(module: nn.Module, fsdp_state) -> Iterator[Tuple[str, str, str]]:
+    for param_name, module_name in _module_handles(fsdp_state, module)[
+        0
+    ].shared_parameter_module_names():
+        module_name = _convert_to_wrapped_module_name(module_name)
+        fqn = f"{module_name}{param_name}"
+        yield fqn, param_name, module_name
+
+
+@no_type_check
+def _enter_unshard_params_ctx(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    recurse: bool = False,
+    writeback: bool = False,
+    rank0_only: bool = False,
+    offload_to_cpu: bool = False,
+    with_grads: bool = False,
+) -> None:
+    """
+    state_dict hooks cannot use the pure context call as the checkpoint flow
+    requires to enter the context in the pre-hook but leave the context in the
+    post-hook. This API enters the context of ``_unshard_params``.
+    """
+    assert module not in fsdp_state._unshard_params_ctx, (
+        "Entering the ``_unshard_params`` context but _unshard_params_ctx[module] "
+        "is not None."
+    )
+    fsdp_state._unshard_params_ctx[module] = _unshard_params(
+        module,
+        fsdp_state,
+        writeback=writeback,
+        rank0_only=rank0_only,
+        offload_to_cpu=offload_to_cpu,
+        with_grads=with_grads,
+    )
+    fsdp_state._unshard_params_ctx[module].__enter__()
+
+
+@no_type_check
+def _exit_unshard_params_ctx(module: nn.Module, fsdp_state: _FSDPState) -> None:
+    """A helper function to exit ``_unshard_params`` context."""
+    fsdp_state._unshard_params_ctx[module].__exit__(None, None, None)
+    fsdp_state._unshard_params_ctx.pop(module)
+
+
+def _common_pre_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> None:
+    """Performs the pre-state_dict tasks shared by all state_dict types."""
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
+    # TODO: need to check if this is always correct for composable FSDP.
+    _lazy_init(fsdp_state, module)
+    # TODO: change to this call after pre_state_dict_hook is in `nn.Module`.
+    # if fsdp_state.is_root:
+    #    _clear_grads_if_needed(_all_handles(fsdp_state))
+    if _has_fsdp_params(fsdp_state, module):
+        _clear_grads_if_needed([_module_handles(fsdp_state, module)[0]])
+
+
+def _common_unshard_pre_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    offload_to_cpu: bool,
+    rank0_only: bool,
+) -> None:
+    """
+    Performs the pre-state_dict tasks shared by all state_dict types that require
+    ``_unshard_params()``. FULL_STATE_DICT and SHARDED_STATE_DICT use this hook.
+    """
+    _enter_unshard_params_ctx(
+        module,
+        fsdp_state,
+        recurse=False,
+        writeback=False,
+        offload_to_cpu=offload_to_cpu,
+        rank0_only=rank0_only,
+    )
+
+
+# TODO: change to the decorator style. See ``_full_pre_state_dict_hook``.
+@no_type_check
+def _common_unshard_post_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+    param_hook: Callable,
+) -> Dict[str, Any]:
+    """
+    The post-state_dict flow that shared by all state_dict types that require
+    ``_unshard_params()``. FULL_STATE_DICT and SHARDED_STATE_DICT use this
+    hook.
+    """
+    _replace_by_prefix(state_dict, prefix + f"{FSDP_PREFIX}", prefix)
+    # Return early for trivial cases
+    if not state_dict or not _has_fsdp_params(fsdp_state, module):
+        _exit_unshard_params_ctx(module, fsdp_state)
+        return state_dict
+
+    # TODO: Once pre_state_dict hook is supported, this pop should be removed.
+    # For `use_orig_params=True`, the `FlatParameter` is not registered, so
+    # there is no entry in the state dict for it to pop.
+    if not fsdp_state._use_orig_params:
+        state_dict.pop(f"{prefix}{FLAT_PARAM}")
+
+    # If a rank does not have unsharded parameters(when `rank0_only=True`
+    # and `rank != 0`), then the rank only needed to participate in the
+    # all-gather and does not need to save the # state dict. We simply check
+    # rank0_only to ensure this issue.
+    rank0_only = (
+        fsdp_state._state_dict_type == StateDictType.FULL_STATE_DICT
+        and cast(FullStateDictConfig, fsdp_state._state_dict_config).rank0_only
+    )
+    # no_fsdp_return means the state_dict returned by this rank should contain
+    # only non-FSDP controlled parameters and buffers.
+    no_fsdp_return = rank0_only and fsdp_state.rank != 0
+    if no_fsdp_return and not fsdp_state._use_orig_params:
+        for clean_key in fsdp_state._buffer_names:
+            # This is a hack to support activation checkpoint.
+            clean_key = clean_key.replace(
+                f"{checkpoint_wrapper._CHECKPOINT_PREFIX}.", ""
+            )
+            state_dict.pop(f"{prefix}{clean_key}", None)
+        _exit_unshard_params_ctx(module, fsdp_state)
+        return state_dict
+
+    # Loop only the parameters saved in this instance's wrapped module to
+    # avoid processing buffers.
+    for fqn, param_name, module_name in _param_fqns(module, fsdp_state):
+        # TODO: remove the parameter retrieval. See ``_full_pre_state_dict_hook``.
+        param = functools.reduce(getattr, fqn.split("."), module.module)
+        fqn = f"{prefix}{fqn}"
+        if no_fsdp_return:
+            state_dict.pop(fqn)
+            continue
+        state_dict[fqn] = param
+        assert fqn in state_dict, (
+            f"FSDP assumes {fqn} is in the state_dict but the state_dict only "
+            f"has {state_dict.keys()}. "
+            f"prefix={prefix}, module_name={module_name}, "
+            f"param_name={param_name} rank={fsdp_state.rank}."
+        )
+
+        param_hook(state_dict, prefix, fqn)
+    _exit_unshard_params_ctx(module, fsdp_state)
+
+    cpu_device = torch.device("cpu")
+    buffer_clean_fqns = []
+    buffers = []
+    for clean_key in fsdp_state._buffer_names:
+        # This is a hack to support activation checkpoint.
+        clean_key = clean_tensor_name(clean_key)
+        fqn = f"{prefix}{clean_key}"
+        if fqn not in state_dict:
+            # A buffer can be registered as non-persistent.
+            continue
+        if no_fsdp_return:
+            state_dict.pop(fqn)
+        else:
+            buffer = state_dict[fqn]
+            if (
+                fsdp_state._state_dict_config.offload_to_cpu
+                and buffer.device != cpu_device
+            ):
+                state_dict[fqn] = buffer.to(cpu_device)
+            # TODO: for composable FSDP, this should be clean_tensor_name(clean_key),
+            buffer_clean_fqns.append(clean_key)
+            buffers.append(state_dict[fqn])
+    if buffers and fsdp_state._mixed_precision_enabled_for_buffers():
+        buffer_dtypes = _get_buffer_dtypes(fsdp_state, buffer_clean_fqns)
+        _cast_buffers_to_dtype_and_device(
+            buffers, buffer_dtypes, fsdp_state.compute_device
+        )
+        for buffers, clean_fqn in zip(buffers, buffer_clean_fqns):
+            fqn = f"{prefix}{clean_fqn}"
+            state_dict[fqn] = buffer.clone()
+    return state_dict
+
+
+@no_type_check
+def _full_pre_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> None:
+    """
+    Hook that runs before model.state_dict() is called. pre-state_dict hook is
+    not actually supported by ``nn.Module``. As a result, this API is called
+    from ``_full_post_state_dict_hook()`` to simulate the case. Once pre-state_dict
+    is supported in ``nn.Module``, this hook will be registered as a hook in
+    ``nn.Module``.
+
+    TODO: clean the callsites and hacks after ``pre_state_dict_hook` ` is supported
+    in ``nn.Module``.
+    """
+    _common_pre_state_dict_hook(module, fsdp_state, state_dict, prefix)
+    _common_unshard_pre_state_dict_hook(
+        module,
+        fsdp_state,
+        offload_to_cpu=fsdp_state._state_dict_config.offload_to_cpu,
+        rank0_only=cast(FullStateDictConfig, fsdp_state._state_dict_config).rank0_only,
+    )
+
+
+@no_type_check
+def _full_post_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> Dict[str, Any]:
+    """
+    Hook that runs after model.state_dict() is called before returning result to
+    user. For FSDP, we may have to clone the tensors in state_dict as params go
+    back to sharded version after _unshard_params ends, and also remove
+    the ``FSDP_WRAPPED_MODULE`` prefix.
+    """
+    # TODO: remove the hack. See ``_full_pre_state_dict_hook``.
+    _full_pre_state_dict_hook(module, fsdp_state, state_dict, prefix)
+
+    def param_hook(
+        state_dict: Dict[str, Any],
+        prefix: str,
+        fqn: str,
+    ) -> None:
+        clean_key = fqn
+        clean_prefix = clean_tensor_name(prefix)
+        # Strip prefix out of key if needed as buffer names and param names
+        # do not have prefix considered as they are not computed in `state_dict`
+        # call.
+        if clean_key.startswith(clean_prefix):
+            clean_key = clean_key[len(clean_prefix) :]
+
+        # Clone non-ignored parameters before exiting the `_unshard_params()` context.
+        if clean_key not in fsdp_state._ignored_param_names and not getattr(
+            state_dict[fqn], "_has_been_cloned", False
+        ):
+            try:
+                state_dict[fqn] = state_dict[fqn].clone().detach()
+                state_dict[fqn]._has_been_cloned = True  # type: ignore[attr-defined]
+            except BaseException as e:
+                warnings.warn(
+                    f"Failed to clone() tensor with name {fqn} on rank {fsdp_state.rank}. "
+                    "This may mean that this state_dict entry could point to invalid "
+                    "memory regions after returning from state_dict() call if this "
+                    "parameter is managed by FSDP. Please check clone "
+                    f"implementation of {fqn}. Error: {str(e)}"
+                )
+
+    return _common_unshard_post_state_dict_hook(
+        module, fsdp_state, state_dict, prefix, param_hook
+    )
+
+
+def _full_pre_load_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> None:
+    _lazy_init(fsdp_state, module)
+    _enter_unshard_params_ctx(module, fsdp_state, recurse=False, writeback=True)
+    _replace_by_prefix(state_dict, prefix, prefix + f"{FSDP_PREFIX}")
+
+
+def _full_post_load_state_dict_hook(
+    module: nn.Module, fsdp_state: _FSDPState, *args, **kwargs
+) -> None:
+    _exit_unshard_params_ctx(module, fsdp_state)
+
+
+def _local_pre_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> None:
+    """
+    Hook that runs before model.state_dict() is called. Right now, pre-state_dict
+    hook is not supported by the PyTorch core. So this API is called from
+    `_local_post_state_dict_hook()` to simulate the case.
+    """
+    if (
+        _has_fsdp_params(fsdp_state, module)
+        and not _module_handles(fsdp_state, module)[0].uses_sharded_strategy
+    ):
+        raise RuntimeError(
+            "``local_state_dict`` can only be used when parameters are flatten "
+            "and sharded."
+        )
+    _common_pre_state_dict_hook(module, fsdp_state, state_dict, prefix)
+
+
+@no_type_check
+def _local_post_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> Dict[str, Any]:
+    """
+    This hook create a ShardedTensor from the local flat_param and replace
+    the state_dict[f"{prefix}{FLAT_PARAM}] with the ShardedTensor. No copy
+    will happen. The underlying storage is the same.
+    """
+    # TODO: remove the hack. See ``_full_pre_state_dict_hook``.
+    _local_pre_state_dict_hook(module, fsdp_state, state_dict, prefix)
+
+    _replace_by_prefix(state_dict, f"{prefix}{FSDP_PREFIX}", prefix)
+    if not _has_fsdp_params(fsdp_state, module):
+        return state_dict
+
+    # state_dict[f"{prefix}{FLAT_PARAM}"] exists and has the same tensor
+    # value as the flat_param but it is a pure Tensor because
+    # nn.Module.state_dict() will detach the parameter. Therefore, we need
+    # to get flat_param to get the metadata.
+    assert _module_handles(fsdp_state, module), "Should have returned early"
+    flat_param = _module_handles(fsdp_state, module)[0].flat_param
+    # Construct a ShardedTensor from the flat_param.
+    full_numel = flat_param._unpadded_unsharded_size.numel()  # type: ignore[attr-defined]
+    shard_offset = flat_param.numel() * fsdp_state.rank
+    valid_data_size = flat_param.numel() - flat_param._shard_numel_padded
+    if valid_data_size > 0 and flat_param._shard_numel_padded > 0:
+        flat_param = flat_param.narrow(0, 0, valid_data_size)
+    local_shards = [
+        Shard.from_tensor_and_offsets(flat_param, [shard_offset], fsdp_state.rank)
+    ]
+    sharded_tensor = init_from_local_shards(
+        local_shards, full_numel, process_group=fsdp_state.process_group
+    )  # type: ignore[assignment]
+    if fsdp_state._state_dict_config.offload_to_cpu:
+        sharded_tensor = sharded_tensor.cpu()
+    state_dict[f"{prefix}{FLAT_PARAM}"] = sharded_tensor
+    return state_dict
+
+
+def _local_post_load_state_dict_hook(
+    module: nn.Module, fsdp_state: _FSDPState, *args, **kwargs
+) -> None:
+    pass
+
+
+def _local_pre_load_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> None:
+    """
+    This hook finds the local flat_param for this FSDP module from the
+    state_dict. The flat_param should be a ShardedTensor. This hook converts
+    the ShardedTensor to a tensor. No copy happen unless padding is required.
+    """
+    _lazy_init(fsdp_state, module)
+    _replace_by_prefix(state_dict, prefix, f"{prefix}{FSDP_PREFIX}")
+    fqn = f"{prefix}{FSDP_PREFIX}{FLAT_PARAM}"
+    if fqn not in state_dict:
+        assert not _has_fsdp_params(fsdp_state, module), (
+            "No `FlatParameter` in `state_dict` for this FSDP instance "
+            "but it has parameters"
+        )
+        return
+    load_tensor = state_dict[fqn]
+    assert isinstance(
+        load_tensor, ShardedTensor
+    ), "Tensors in local_state_dict should be ShardedTensor."
+
+    # Convert the ShardedTensor to a Tensor.
+    shards = load_tensor.local_shards()
+    assert len(shards), "load_local_state_dict assume one shard per ShardedTensor."
+    load_tensor = shards[0].tensor
+
+    # Get the metadata of the flat_param to decide whether to pad the loaded
+    # tensor.
+    flat_param = _module_handles(fsdp_state, module)[0].flat_param
+    assert flat_param is not None
+    if flat_param._shard_numel_padded not in (0, flat_param.numel()):
+        assert load_tensor.numel() < flat_param.numel(), (
+            f"Local shard size = {flat_param.numel()} and the tensor in "
+            f"the state_dict is {load_tensor.numel()}."
+        )
+        load_tensor = F.pad(load_tensor, [0, flat_param._shard_numel_padded])
+    state_dict[fqn] = load_tensor
+
+
+def _sharded_pre_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> None:
+    """
+    Hook that runs before model.state_dict() is called. Check
+    ``_full_pre_load_state_dict_hook`` for the detail.
+    """
+    if (
+        _has_fsdp_params(fsdp_state, module)
+        and not _module_handles(fsdp_state, module)[0].uses_sharded_strategy
+    ):
+        raise RuntimeError(
+            "``sharded_state_dict`` can only be used when parameters are flatten "
+            "and sharded."
+        )
+    _common_pre_state_dict_hook(module, fsdp_state, state_dict, prefix)
+    # Setting offload_to_cpu here does not work even if offload_to_cpu is True.
+    # We have to create ShardedTensor first then move it to CPU.
+    _common_unshard_pre_state_dict_hook(
+        module,
+        fsdp_state,
+        offload_to_cpu=False,
+        rank0_only=False,
+    )
+
+
+@no_type_check
+def _sharded_post_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> Dict[str, Any]:
+    """
+    The hook replaces the unflattened, unsharded parameter in the state_dict
+    with a unflattened, sharded parameter (a ShardedTensor).
+    """
+
+    # TODO: remove the hack. See ``_full_pre_state_dict_hook``.
+    _sharded_pre_state_dict_hook(module, fsdp_state, state_dict, prefix)
+
+    def param_hook(state_dict: Dict[str, Any], prefix: str, fqn: str):
+        param = state_dict[fqn]
+        sharded_tensor = _ext_chunk_tensor(
+            tensor=param,
+            rank=fsdp_state.rank,
+            world_size=fsdp_state.world_size,
+            num_devices_per_node=torch.cuda.device_count(),
+            pg=fsdp_state.process_group,
+        )
+        if fsdp_state._state_dict_config.offload_to_cpu:
+            sharded_tensor = sharded_tensor.cpu()
+        state_dict[fqn] = sharded_tensor
+
+    return _common_unshard_post_state_dict_hook(
+        module, fsdp_state, state_dict, prefix, param_hook
+    )
+
+
+@no_type_check
+def _sharded_post_load_state_dict_hook(
+    module: nn.Module, fsdp_state: _FSDPState, *args, **kwargs
+) -> None:
+    if fsdp_state._use_orig_params:
+        _register_orig_params(module, fsdp_state)
+
+
+@no_type_check
+def _sharded_pre_load_state_dict_hook(
+    module: nn.Module,
+    fsdp_state: _FSDPState,
+    state_dict: Dict[str, Any],
+    prefix: str,
+) -> None:
+    """
+    The hook combines the unflattened, sharded parameters (ShardedTensor) to
+    a new FlatParameter and shards the new FlatParameter to the local chunk.
+    """
+    _lazy_init(fsdp_state, module)
+    _replace_by_prefix(state_dict, prefix, prefix + f"{FSDP_PREFIX}")
+    if not _has_fsdp_params(fsdp_state, module):
+        return
+
+    if not _module_handles(fsdp_state, module)[0].uses_sharded_strategy:
+        raise RuntimeError(
+            "load_sharded_state_dict can only be called when parameters "
+            "are flatten and sharded."
+        )
+
+    nonsharded_tensors = []
+    shared_fqns = [fqn for fqn, _, _ in _shared_param_fqns(module, fsdp_state)]
+    loaded_shapes = []
+    for fqn, _, _ in _param_fqns(module, fsdp_state):
+        full_fqn = f"{prefix}{FSDP_PREFIX}{fqn}"
+        param = state_dict.pop(full_fqn)
+        if fqn in shared_fqns:
+            continue
+        # All-gather the param (ShardedTensor)
+        param, shards = _ext_pre_load_state_dict_transform(param)
+        loaded_shapes.append(param.size())
+        assert len(shards) < 2, (
+            "Expects 0 or 1 shard per rank "
+            f"but got {len(shards)} shards on rank {fsdp_state.rank}."
+        )
+        param_numel = param.size().numel()
+        dim_0_size = param.size()[0]
+        chunk_size = (
+            math.ceil(dim_0_size / fsdp_state.world_size) * param_numel // dim_0_size
+        )
+        if len(shards) == 1:
+            local_tensor = shards[0].tensor.flatten()
+            if not local_tensor.is_cuda:
+                local_tensor = local_tensor.cuda()
+            num_padding = chunk_size - local_tensor.numel()
+            if num_padding > 0:
+                local_tensor = F.pad(local_tensor, [0, num_padding])
+        else:
+            local_tensor = torch.zeros(chunk_size, dtype=param.dtype).cuda()
+        tensor = torch.empty(
+            chunk_size * fsdp_state.world_size, dtype=local_tensor.dtype
+        ).cuda()
+        dist.all_gather_into_tensor(
+            tensor, local_tensor, group=fsdp_state.process_group
+        )
+        tensor = tensor.narrow(0, 0, param_numel).reshape(param.size())
+        nonsharded_tensors.append(tensor)
+
+    # Create a new flat_param from the loaded, non-sharded tensors.
+    flat_param = _module_handles(fsdp_state, module)[0].flat_param
+    loaded_flat_param = FlatParamHandle.flatten_params(
+        nonsharded_tensors, requires_grad=False
+    )
+
+    # Get the chunk from the loaded flat_param for the local rank.
+    loaded_flat_tensor, num_to_pad = FlatParamHandle._get_shard(
+        loaded_flat_param,
+        fsdp_state.rank,
+        fsdp_state.world_size,
+    )
+    loaded_flat_tensor.to(flat_param.device)
+    assert all(s1 == s2 for s1, s2 in zip(loaded_shapes, flat_param._shapes)), (
+        f"The original shapes in FSDP are {flat_param._shapes}. "
+        f"The loaded shapes are {loaded_shapes}. "
+        f"FSDP extension is {'NOT' if _user_extensions is not None else ''} None."
+    )
+    assert flat_param.numel() == loaded_flat_tensor.numel(), (
+        f"The loaded local chunk has different numel({loaded_flat_tensor.numel()}) "
+        f"from the local chunk {flat_param.numel()}."
+    )
+    assert flat_param._shard_numel_padded == num_to_pad, (
+        f"The loaded local chunk has different padding({num_to_pad}) "
+        f"from the local chunk {flat_param._shard_numel_padded}."
+    )
+    state_dict[f"{prefix}{FSDP_PREFIX}{FLAT_PARAM}"] = loaded_flat_tensor
+    if fsdp_state._use_orig_params:
+        _deregister_orig_params(module, fsdp_state)
+
+
+@no_type_check
+@torch.no_grad()
+def _post_state_dict_hook(
+    module: nn.Module,
+    state_dict: Dict[str, Any],
+    prefix: str,
+    *args: Any,
+) -> Dict[str, Any]:
+    """
+    _post_state_dict_hook() is called after the state_dict() of this
+    FSDP module is executed. ``fsdp_state._state_dict_type`` is used to decide
+    what postprocessing will be done.
+    """
+    # TODO: get the composable state from module
+    fsdp_state: _FSDPState = module
+    _post_state_dict_hook_fn = {
+        StateDictType.FULL_STATE_DICT: _full_post_state_dict_hook,
+        StateDictType.LOCAL_STATE_DICT: _local_post_state_dict_hook,
+        StateDictType.SHARDED_STATE_DICT: _sharded_post_state_dict_hook,
+    }
+    processed_state_dict = _post_state_dict_hook_fn[fsdp_state._state_dict_type](
+        module, fsdp_state, state_dict, prefix
+    )
+    return processed_state_dict
+
+
+@no_type_check
+@torch.no_grad()
+def _pre_load_state_dict_hook(
+    module: nn.Module,
+    state_dict: Dict[str, Any],
+    prefix: str,
+    *args: Any,
+) -> None:
+    """
+    ``_pre_state_dict_hook` is called before ``module._load_from_state_dict()``
+    is called. ``fsdp_state._state_dict_type`` is used to decide what preprocessing
+    will be done.
+    """
+    # TODO: get the composable state from module
+    fsdp_state: _FSDPState = module
+    _pre_load_state_dict_hook_fn = {
+        StateDictType.FULL_STATE_DICT: _full_pre_load_state_dict_hook,
+        StateDictType.LOCAL_STATE_DICT: _local_pre_load_state_dict_hook,
+        StateDictType.SHARDED_STATE_DICT: _sharded_pre_load_state_dict_hook,
+    }
+    # Code that is common for all state_dict impls
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
+    # Dispatch into state_dict specific implementation of pre-hook.
+    _pre_load_state_dict_hook_fn[fsdp_state._state_dict_type](
+        module, fsdp_state, state_dict, prefix
+    )
+
+
+@no_type_check
+@torch.no_grad()
+def _post_load_state_dict_hook(module: nn.Module, *args: Any) -> None:
+    # TODO: get the composable state from module
+    fsdp_state: _FSDPState = module
+    _post_load_state_dict_hook_fn = {
+        StateDictType.FULL_STATE_DICT: _full_post_load_state_dict_hook,
+        StateDictType.LOCAL_STATE_DICT: _local_post_load_state_dict_hook,
+        StateDictType.SHARDED_STATE_DICT: _sharded_post_load_state_dict_hook,
+    }
+    # Code that is common for all state_dict impls
+    # Dispatch into state_dict type specific implementation of post-hook for
+    # loading state_dict.
+    _post_load_state_dict_hook_fn[fsdp_state._state_dict_type](module, fsdp_state)
diff --git a/torch/distributed/fsdp/_symbolic_trace.py b/torch/distributed/fsdp/_symbolic_trace.py
index 026595fd7def..f6fe5e432252 100644
--- a/torch/distributed/fsdp/_symbolic_trace.py
+++ b/torch/distributed/fsdp/_symbolic_trace.py
@@ -5,7 +5,6 @@
 
 import torch
 
-
 __all__ = ["TracingConfig"]
 
 
@@ -140,13 +139,18 @@ def _patched_create_proxy(
         if args is not None:
             named_params: List[Tuple[str, torch.nn.Parameter]] = []
             for arg in args:
-                if isinstance(arg, torch.fx.Proxy) and arg.node.target in prefixed_param_name_to_param:
+                if (
+                    isinstance(arg, torch.fx.Proxy)
+                    and arg.node.target in prefixed_param_name_to_param
+                ):
                     param = prefixed_param_name_to_param[arg.node.target]
                     named_params.append((arg.node.target, param))
                     if param not in set(execution_info.param_exec_order):
                         execution_info.param_exec_order.append(param)
             if named_params:
-                execution_info.module_to_execution_infos[module].append((module, named_params))
+                execution_info.module_to_execution_infos[module].append(
+                    (module, named_params)
+                )
     elif kind == "call_module":
         named_params = list(module.named_parameters())
         if named_params:
@@ -234,7 +238,10 @@ def _patch_tracer(
     )
     prefixed_param_name_to_param = dict(root_module.named_parameters())
     tracer.create_proxy = functools.partial(
-        _patched_create_proxy, original_create_proxy, execution_info, prefixed_param_name_to_param
+        _patched_create_proxy,
+        original_create_proxy,
+        execution_info,
+        prefixed_param_name_to_param,
     )
     try:
         yield
diff --git a/torch/distributed/fsdp/_unshard_param_utils.py b/torch/distributed/fsdp/_unshard_param_utils.py
new file mode 100644
index 000000000000..950841850b62
--- /dev/null
+++ b/torch/distributed/fsdp/_unshard_param_utils.py
@@ -0,0 +1,254 @@
+import contextlib
+import warnings
+from typing import cast, Generator, List
+
+import torch
+import torch.nn as nn
+from torch.distributed.fsdp._common_utils import (
+    _FSDPState,
+    _has_fsdp_params,
+    _module_handles,
+    HandleTrainingState,
+)
+from torch.distributed.fsdp._runtime_utils import (
+    _clear_grads_if_needed,
+    _reshard,
+    _reshard_grads,
+    _unshard,
+    _unshard_grads,
+)
+from ._utils import p_assert
+from .flat_param import FlatParamHandle
+
+FLAT_PARAM = "_flat_param"
+
+
+@torch.no_grad()
+def _writeback_to_local_shard(
+    handles: List[FlatParamHandle],
+    writeback_grad: bool,
+):
+    """
+    For each handle, writes back the this rank's shard of the unsharded
+    flattened parameter to the sharded flattened parameter. If
+    ``writeback_grad=True``, then writes back to the sharded gradient as
+    well.
+
+    Precondition: Each handle's ``FlatParameter`` 's data points to the
+    padded unsharded flattened parameter.
+    """
+    for handle in handles:
+        # For `NO_SHARD`, `_local_shard` is the unsharded flattened
+        # parameter and `grad` is the unsharded gradient, so there is no
+        # need to writeback for either
+        if not handle.uses_sharded_strategy:
+            continue
+        assert (
+            handle.flat_param.ndim == 1
+        ), f"Expects `flat_param` to be flattened but got {handle.flat_param.shape}"
+
+        # Get the unpadded shard instead of the padded shard to persist
+        # user changes to the padding (though FSDP does not explicitly
+        # support this)
+        param_shard, _ = FlatParamHandle._get_unpadded_shard(
+            handle.flat_param,
+            handle.rank,
+            handle.world_size,
+        )
+        handle.flat_param._local_shard[: param_shard.numel()].copy_(param_shard)  # type: ignore[attr-defined]
+        if writeback_grad:
+            existing_grad = handle.sharded_grad
+            if existing_grad is not None:
+                assert handle.flat_param.grad is not None
+                grad_shard, _ = FlatParamHandle._get_unpadded_shard(
+                    handle.flat_param.grad,
+                    handle.rank,
+                    handle.world_size,
+                )
+                existing_grad[: grad_shard.numel()].copy_(grad_shard)
+
+
+def _deregister_flat_param(state: _FSDPState, module: nn.Module) -> None:
+    """
+    De-registers the flattened parameter from the wrapped module, hiding it
+    from ``nn.Module`` methods.
+
+    We do not use ``del`` because we want ``FLAT_PARAM`` to always be an
+    attribute but dynamically change whether it is visible to ``nn.Module``
+    methods.
+    """
+    if _has_fsdp_params(state, module):
+        # TODO: figure out the case for the composable APIs.
+        cast(nn.Module, module.module)._parameters.pop(FLAT_PARAM, None)
+
+
+def _register_flat_param(state: _FSDPState, module: nn.Module) -> None:
+    """
+    Registers the flattened parameter to the wrapped module, making it
+    visible to ``nn.Module`` methods.
+
+    We do not use :meth:`nn.Module.register_parameter` because we want
+    ``FLAT_PARAM`` to always be an attribute but dynamically change whether
+    it is visible to ``nn.Module`` methods.
+    """
+    handles = _module_handles(state, module)
+    if _has_fsdp_params(state, module):
+        # TODO: figure out the case for the composable APIs.
+        cast(nn.Module, module.module)._parameters[FLAT_PARAM] = handles[0].flat_param
+
+
+@contextlib.contextmanager
+def _unflatten_as_params(state: _FSDPState, module: nn.Module) -> Generator:
+    """
+    Assumes that the flattened parameter is unsharded. When in the context,
+    de-registers the flattened parameter and unflattens the original
+    parameters as ``nn.Parameter`` views into the flattened parameter.
+    After the context, re-registers the flattened parameter and restores
+    the original parameters as ``Tensor`` views into the flattened
+    parameter.
+    """
+    handles = _module_handles(state, module)
+    if not handles:
+        yield
+    else:
+        _deregister_flat_param(state, module)
+        try:
+            with handles[0].unflatten_as_params():
+                yield
+        finally:
+            if not handles[0]._use_orig_params:
+                _register_flat_param(state, module)
+
+
+@contextlib.contextmanager
+def _unshard_params(
+    module: nn.Module,
+    state: _FSDPState,
+    writeback: bool = True,
+    rank0_only: bool = False,
+    offload_to_cpu: bool = False,
+    with_grads: bool = False,
+):
+    if with_grads and (offload_to_cpu or not state._use_orig_params):
+        raise NotImplementedError(
+            f"with_grads={with_grads} "
+            f"use_orig_params={state._use_orig_params} "
+            f"offload_to_cpu={offload_to_cpu} "
+            f"is not supported yet"
+        )
+    if writeback and rank0_only:
+        raise ValueError(
+            "writeback=True and rank0_only=True is not supported, as model "
+            "parameter shapes will be different across ranks, and writing "
+            "to them can lead to inconsistencies across ranks when the "
+            "context is exited."
+        )
+    if offload_to_cpu and not rank0_only:
+        warnings.warn(
+            "offload_to_cpu and rank0_only=False will result in "
+            "full parameters being redundantly copied to CPU memory for "
+            "GPUs that reside on the same machine, which may incur the risk of "
+            "CPU OOM. It is recommended to use ``offload_to_cpu`` with "
+            "rank0_only=True."
+        )
+
+    torch.cuda.synchronize()
+    # If handles are shared by other module(s), the handle may be already unsharded.
+    handles = [
+        handle
+        for handle in _module_handles(state, module)
+        if handle._training_state != HandleTrainingState.SUMMON_FULL_PARAMS
+    ]
+    if not handles:
+        yield
+        return
+
+    for handle in handles:
+        if handle._training_state != HandleTrainingState.IDLE:
+            raise ValueError(f"Current handle state is {handle._training_state}")
+
+    for handle in handles:
+        handle._training_state = HandleTrainingState.SUMMON_FULL_PARAMS
+
+    _clear_grads_if_needed(handles)
+    free_unsharded_flat_params = [handle.needs_unshard() for handle in handles]
+    # No need to call `wait_stream()` since we unshard in the computation
+    # stream directly
+    computation_stream = torch.cuda.current_stream()
+    _unshard(state, handles, computation_stream, computation_stream)
+    if with_grads:
+        _unshard_grads(handles)
+
+    if rank0_only and state.rank != 0:
+        # Free the unsharded flattened parameter early
+        _reshard(state, handles, free_unsharded_flat_params)
+        if with_grads:
+            _reshard_grads(handles)
+        try:
+            yield
+        finally:
+            for handle in handles:
+                handle._training_state = HandleTrainingState.IDLE
+    else:
+        # Unflatten the unsharded flattened parameters
+        with contextlib.ExitStack() as stack:
+            # Invariant: rank == 0 or !rank0_only
+            for handle in handles:
+                if offload_to_cpu and handle.uses_sharded_strategy:
+                    stack.enter_context(handle.to_cpu())
+                    # TODO (awgu): Since PyTorch enforces that a parameter
+                    # and its gradients need to match metadata (e.g.
+                    # device), we must move gradients to CPU *after* we
+                    # move parameters.
+            # TODO (awgu): This FPW call assumes 1 `FlatParameter`
+            if not state._use_orig_params:
+                stack.enter_context(_unflatten_as_params(state, module))
+            try:
+                yield
+            finally:
+                stack.close()
+                if writeback:
+                    _writeback_to_local_shard(handles, with_grads)
+                _reshard(state, handles, free_unsharded_flat_params)
+                if with_grads:
+                    _reshard_grads(handles)
+                for handle in handles:
+                    handle._training_state = HandleTrainingState.IDLE
+
+
+def _deregister_orig_params(state: _FSDPState, module: nn.Module) -> None:
+    """
+    Deregisters the original parameters; registers the ``FlatParameter``.
+    """
+    handles = _module_handles(state, module)
+    p_assert(
+        len(handles) <= 1,
+        "Expects <=1 handle per FSDP instance; needs to be refactored "
+        "for >1 handle (e.g. non-recursive wrapping)",
+    )
+    if not handles:
+        return
+    handle = handles[0]
+    p_assert(
+        handle._use_orig_params,
+        f"Inconsistent `_use_orig_params` -- FSDP: {state._use_orig_params} "
+        f"handle: {handle._use_orig_params}",
+    )
+    handle._deregister_orig_params()
+    _register_flat_param(state, module)
+
+
+def _register_orig_params(state: _FSDPState, module: nn.Module) -> None:
+    """
+    Deregisters the ``FlatParameter``; registers the original parameters.
+    """
+    handles = _module_handles(state, module)
+    if not handles:
+        return
+    handle = handles[0]
+    _deregister_flat_param(state, module)
+    if handle.is_sharded(handle.flat_param):
+        handle._use_sharded_views()
+        handle._use_sharded_grad_views()
+    else:
+        handle._use_unsharded_views(as_params=True)
diff --git a/torch/distributed/fsdp/_utils.py b/torch/distributed/fsdp/_utils.py
index 69cba499bc1b..5efb376e6645 100644
--- a/torch/distributed/fsdp/_utils.py
+++ b/torch/distributed/fsdp/_utils.py
@@ -1,31 +1,36 @@
-from collections import OrderedDict
 import dataclasses
-from typing import Any, Callable, Dict, List, Set, Tuple, Union
+import traceback
+from collections import OrderedDict
+from typing import Any, Callable, cast, Dict, List, Set, Tuple, Union
 
 import torch
 from torch.nn.modules.batchnorm import _BatchNorm
-from torch.nn.parallel.scatter_gather import _is_namedtuple  # type: ignore[attr-defined]
-
+from torch.nn.parallel.scatter_gather import (  # type: ignore[attr-defined]
+    _is_namedtuple,
+)
 from torch.nn.utils.rnn import PackedSequence
+from torch.utils._mode_utils import no_dispatch
 
-"""Useful functions to deal with tensor types with other python container types."""
 
 def _contains_batchnorm(module):
-    return any(
-        isinstance(mod, _BatchNorm) for mod in module.modules()
-    )
+    return any(isinstance(mod, _BatchNorm) for mod in module.modules())
+
 
 def _override_batchnorm_mixed_precision(module):
     for mod in module.modules():
         if isinstance(mod, _BatchNorm):
             mod._wrap_overrides = {"mixed_precision": None}  # type: ignore[assignment]
 
+
 def _apply_to_tensors(
-    fn: Callable, container: Union[torch.Tensor, Dict, List, Tuple, Set, OrderedDict, PackedSequence]
+    fn: Callable,
+    container: Union[torch.Tensor, Dict, List, Tuple, Set, OrderedDict, PackedSequence],
 ) -> Any:
     """Recursively apply to all tensor in different kinds of container types."""
 
-    def apply(x: Union[torch.Tensor, Dict, List, Tuple, Set, OrderedDict, PackedSequence]) -> Any:
+    def apply(
+        x: Union[torch.Tensor, Dict, List, Tuple, Set, OrderedDict, PackedSequence]
+    ) -> Any:
         if torch.is_tensor(x):
             return fn(x)
         elif hasattr(x, "__dataclass_fields__"):
@@ -54,27 +59,65 @@ def apply(x: Union[torch.Tensor, Dict, List, Tuple, Set, OrderedDict, PackedSequ
 
     return apply(container)
 
-def _apply_to_modules(
-    root_module: torch.nn.Module,
-    module_fn: Callable,
-    return_fn: Callable,
-    *args,
-    **kwargs,
-):
+
+@torch.no_grad()
+def _alloc_storage(tensor: torch.Tensor, size: torch.Size) -> bool:
+    """
+    Allocate storage for ``tensor`` with the given size.
+
+    Returns:
+        bool: ``True`` if this method allocated storage and ``False`` if the
+        storage was already allocated.
+    """
+    already_allocated = tensor._typed_storage()._size() == size.numel()
+    if not already_allocated:
+        tensor_storage_size = tensor._typed_storage()._size()
+        p_assert(
+            tensor_storage_size == 0,
+            f"Tensor storage should have been resized to be 0 but got {tensor_storage_size}",
+        )
+        tensor._typed_storage()._resize_(size.numel())
+    return not already_allocated
+
+
+@torch.no_grad()
+def _free_storage(tensor: torch.Tensor) -> bool:
     """
-    Performs a pre-order traversal of the modules in the hierarchy rooted at
-    ``root_module``, applying ``module_fn`` at each module and finally
-    returning a value using ``return_fn``. The traversal constructs the full
-    module prefix name (e.g. "module.submodule." just like in model state dict)
-    and makes that available to ``module_fn``.
+    Frees the underlying storage of ``tensor``.
+
+    Returns:
+        bool: ``True`` if the method freed the storage and ``False`` if the
+        storage was already freed.
     """
-    def f(module: torch.nn.Module, prefix: str, *args, **kwargs):
-        # Call the module function before recursing over children (pre-order)
-        module_fn(module, prefix, *args, **kwargs)
-        for submodule_name, submodule in module.named_children():
-            if submodule is not None:
-                new_prefix = prefix + submodule_name + "."
-                f(submodule, new_prefix, *args, **kwargs)
-
-    f(root_module, "", *args, **kwargs)
-    return return_fn(*args, **kwargs)
+    already_freed = tensor._typed_storage()._size() == 0
+    if not already_freed:
+        p_assert(
+            tensor.storage_offset() == 0,
+            "Freeing a tensor's storage is unsafe when it is not the sole occupant\n"
+            f"storage offset: {tensor.storage_offset()}\n"
+            f"storage size: {tensor._typed_storage()._size()}\n"
+            f"tensor shape: {tensor.shape}",
+        )
+        tensor._typed_storage()._resize_(0)
+    return not already_freed
+
+
+def _same_storage(x: torch.Tensor, y: torch.Tensor) -> bool:
+    """Returns if ``x`` and ``y`` share the same storage."""
+    # NOTE: CPU and GPU tensors are ensured to have different data pointers.
+    return x._typed_storage()._data_ptr() == y._typed_storage()._data_ptr()
+
+
+def p_assert(cond: Any, s: str, raise_assertion_error: bool = True) -> None:
+    """This is used as an alternate to ``assert`` when in the backward context
+    to print the error message ``s`` since otherwise, it is swallowed."""
+    if not cond:
+        print(s)
+        traceback.print_stack()
+        if raise_assertion_error:
+            raise AssertionError(s)
+
+
+def _no_dispatch_record_stream(tensor: torch.Tensor, stream: torch.cuda.Stream) -> None:
+    with no_dispatch():
+        tensor.record_stream(cast(torch._C.Stream, stream))
diff --git a/torch/distributed/fsdp/_wrap_utils.py b/torch/distributed/fsdp/_wrap_utils.py
new file mode 100644
index 000000000000..cdda065df199
--- /dev/null
+++ b/torch/distributed/fsdp/_wrap_utils.py
@@ -0,0 +1,170 @@
+import collections
+import functools
+import warnings
+from typing import Any, Deque, Dict, List, NamedTuple, Set, Tuple
+
+import torch
+import torch.nn as nn
+from torch.distributed.fsdp._utils import (
+    _contains_batchnorm,
+    _override_batchnorm_mixed_precision,
+)
+from torch.distributed.fsdp.wrap import (
+    _FSDPPolicy,
+    _or_policy,
+    _recursive_wrap,
+    _wrap_batchnorm_individually,
+)
+
+
+class SubmoduleState(NamedTuple):
+    """
+    Submodule state for ``_get_submodule_to_states()``, representing a logical
+    grouping (e.g. parameters to be flattened together).
+    """
+
+    params: List[nn.Parameter]
+    buffers: List[torch.Tensor]
+    # Parameter and buffer names are prefixed starting from the submodule,
+    # which is not necessarily the root module
+    param_names: List[str]
+    buffer_names: List[str]
+
+
+def _auto_wrap(
+    auto_wrap_kwargs: Dict[str, Any],
+    fsdp_kwargs: Dict[str, Any],
+    module_wrapper_cls: Any,  # e.g. `FullyShardedDataParallel`
+) -> None:
+    """
+    Recursively auto wraps the root module given by the key "module" in
+    ``auto_wrap_kwargs`` with the arguments in ``auto_wrap_kwargs`` and
+    ``fsdp_kwargs``.
+
+    Precondition: ``auto_wrap_policy`` contains the arguments expected by
+    ``_recursive_wrap()``, where ``auto_wrap_policy`` is not ``None``.
+    ``fsdp_kwargs`` contains all FSDP arguments except ``module``.
+    """
+    auto_wrap_policy = auto_wrap_kwargs["auto_wrap_policy"]
+    # Support new way to pass an auto wrap policy
+    if isinstance(auto_wrap_policy, _FSDPPolicy):
+        auto_wrap_policy = auto_wrap_policy.policy
+    root_module = auto_wrap_kwargs["module"]
+    assert auto_wrap_policy is not None
+    # For auto wrapping, submodules should not already be wrapped with FSDP
+    # since double wrapping is not supported
+    for module_name, module in root_module.named_modules():
+        if isinstance(module, module_wrapper_cls):
+            raise ValueError(
+                f"Expected {module_name} to NOT be FullyShardedDataParallel "
+                "if using an `auto_wrap_policy`"
+            )
+    mixed_precision = fsdp_kwargs["mixed_precision"]
+    if mixed_precision is not None and _contains_batchnorm(root_module):
+        _override_batchnorm_mixed_precision(root_module)
+        auto_wrap_policy = functools.partial(
+            _or_policy, policies=[_wrap_batchnorm_individually, auto_wrap_policy]
+        )
+        warnings.warn(
+            "Both mixed precision and an `auto_wrap_policy` were specified "
+            "for FSDP, where the wrapped module has batch norm submodules. "
+            "The batch norm submodules will be wrapped as separate FSDP "
+            "instances with mixed precision disabled since some batch norm "
+            "kernels do not support low precision."
+        )
+    auto_wrap_kwargs["auto_wrap_policy"] = auto_wrap_policy
+    _recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
+
+
+def _get_submodule_to_states(
+    root_module: nn.Module,
+    auto_wrap_policy: _FSDPPolicy,
+    ignored_modules: Set[nn.Module],
+    ignored_params: Set[nn.Parameter],
+) -> Dict[nn.Module, SubmoduleState]:
+    """
+    Returns a mapping from submodule to its parameters, buffers, parameter
+    names, and buffer names, where each entry logically represents a grouping
+    according to the given auto wrap policy and ignored modules/parameters.
+    However, this method does not actually perform any module wrapping.
+
+    The mapped-to values are the states from the subtree rooted at the
+    corresponding submodule key, excluding child submodules in the mapping and
+    ignored state. Sibling submodules cannot be grouped together. The parameter
+    and buffer names are prefixed starting from the submodule.
+
+    Each non-ignored parameter and buffer appears exactly once in the returned
+    ``dict``, and the ``dict`` is ordered by increasing tree depth. A mapped-to
+    parameter list may be empty if the submodule has no parameters or if its
+    parameters were assigned to a parent submodule instead.
+    """
+    # Record the modules to wrap without actually wrapping
+    wrapped_modules: List[nn.Module] = []  # these are only logically wrapped
+    wrapper_cls = functools.partial(_record_module_wrapper_cls, wrapped_modules)
+    _recursive_wrap(
+        root_module,
+        auto_wrap_policy=auto_wrap_policy.policy,
+        wrapper_cls=wrapper_cls,
+        ignored_modules=ignored_modules,
+        ignored_params=ignored_params,
+        only_wrap_children=False,
+    )
+    # Always include the root module even if not wrapped by the given policy
+    if root_module not in wrapped_modules:
+        wrapped_modules.append(root_module)
+
+    submodule_to_states = collections.OrderedDict()
+    visited_params = set()
+    for ignored_param in ignored_params:
+        visited_params.add(ignored_param)
+    visited_buffers = set()
+    # Constructing `wrapped_modules` with `_recursive_wrap()` follows a
+    # post-order traversal. We record state in `submodule_to_states` using a
+    # reverse post-ordering since that is a topological sort. This assigns
+    # parent-child shared parameters to the parent submodule.
+    # TODO: To handle sibling shared parameters, we need to pre-compute the
+    # shared parameters and assign them to the LCA submodule manually.
+    wrapped_modules.reverse()
+    wrapped_modules_set = set(wrapped_modules)
+    for submodule in wrapped_modules:
+        # Perform a BFS from `submodule` and record all unvisited state that is
+        # not already associated with another module in `wrapped_modules`.
+        queue: Deque[Tuple[nn.Module, str]] = collections.deque()
+        queue.append((submodule, ""))
+        params: List[nn.Parameter] = []
+        param_names: List[str] = []
+        buffers: List[torch.Tensor] = []
+        buffer_names: List[str] = []
+        while len(queue) > 0:
+            module, prefix = queue.popleft()
+            for param_name, param in module.named_parameters(recurse=False):
+                if param not in visited_params:
+                    params.append(param)
+                    visited_params.add(param)
+                    param_names.append(prefix + param_name)
+            for buffer_name, buffer in module.named_buffers(recurse=False):
+                if buffer not in visited_buffers:
+                    buffers.append(buffer)
+                    visited_buffers.add(buffer)
+                    buffer_names.append(prefix + buffer_name)
+            for child_module_name, child_module in module.named_children():
+                if child_module not in wrapped_modules_set:
+                    queue.append((child_module, prefix + child_module_name + "."))
+        submodule_to_states[submodule] = SubmoduleState(
+            params, buffers, param_names, buffer_names
+        )
+    return submodule_to_states
+
+
+def _record_module_wrapper_cls(
+    wrapped_modules: List[nn.Module],
+    module: nn.Module,
+    **kwargs,
+) -> nn.Module:
+    """
+    This defines a pseudo-wrapper class to be passed to ``_recursive_wrap()``
+    that records the wrapped module to the input ``wrapped_modules`` without
+    actually wrapping with a class.
+    """
+    wrapped_modules.append(module)
+    return module
diff --git a/torch/distributed/fsdp/api.py b/torch/distributed/fsdp/api.py
new file mode 100644
index 000000000000..18f3cd3069dd
--- /dev/null
+++ b/torch/distributed/fsdp/api.py
@@ -0,0 +1,245 @@
+"""
+This file includes public APIs for FSDP such as the classes used for the
+constructor arguments.
+"""
+
+from dataclasses import dataclass
+from enum import auto, Enum
+
+from typing import Optional
+
+import torch
+
+__all__ = [
+    "ShardingStrategy",
+    "BackwardPrefetch",
+    "MixedPrecision",
+    "CPUOffload",
+    "StateDictType",
+    "StateDictConfig",
+    "FullStateDictConfig",
+    "LocalStateDictConfig",
+    "ShardedStateDictConfig",
+]
+
+
+class ShardingStrategy(Enum):
+    """
+    This specifies the sharding strategy to be used for distributed training by
+    :class:`FullyShardedDataParallel`.
+
+    - ``FULL_SHARD``: Parameters, gradients, and optimizer states are sharded.
+      For the parameters, this strategy unshards (via all-gather) before the
+      forward, reshards after the forward, unshards before the backward
+      computation, and reshards after the backward computation. For gradients,
+      it synchronizes and shards them (via reduce-scatter) after the backward
+      computation. The sharded optimizer states are updated locally per rank.
+    - ``SHARD_GRAD_OP``: Gradients and optimizer states are sharded during
+      computation, and additionally, parameters are sharded outside
+      computation. For the parameters, this strategy unshards before the
+      forward, does not reshard them after the forward, and only reshards them
+      after the backward computation. The sharded optimizer states are updated
+      locally per rank. Inside ``no_sync()``, the parameters are not resharded
+      after the backward computation.
+    - ``NO_SHARD``: Parameters, gradients, and optimizer states are not sharded
+      but instead replicated across ranks similar to PyTorch's
+      :class:`DistributedDataParallel` API. For gradients, this strategy
+      synchronizes them (via all-reduce) after the backward computation. The
+      unsharded optimizer states are updated locally per rank.
+    """
+
+    FULL_SHARD = auto()
+    SHARD_GRAD_OP = auto()
+    NO_SHARD = auto()
+    # HYBRID_SHARD = auto()
+
+
+class BackwardPrefetch(Enum):
+    """
+    This configures explicit backward prefetching, which can improve throughput
+    but may slightly increase peak memory usage.
+
+    For NCCL backend, any collectives, even if issued in different streams,
+    contend for the same per-device NCCL stream, which is why the relative
+    order in which the collectives are issued matters for overlapping. The
+    different backward prefetching settings correspond to different orderings.
+
+    - ``BACKWARD_PRE``: This prefetches the next set of parameters before the
+      current set of parameter's gradient computation. This improves backward
+      pass throughput by overlapping communication (next all-gather) and
+      computation (current gradient computation).
+    - ``BACKWARD_POST``: This prefetches the next set of parameters after the
+      current set of parameter's gradient computation. This may improve
+      backward pass throughput by overlapping communication (current
+      reduce-scatter) and computation (next gradient computation).
+      Specifically, the next all-gather is reordered to be before the current
+      reduce-scatter.
+
+    .. note:: If the increase in peak memory usage from prefetching is an
+        issue, you may consider passing ``limit_all_gathers=True`` to the FSDP
+        constructor, which may help reduce peak memory usage in some cases.
+    """
+
+    # NOTE: For both modes, the ordering that defines "current" and "next" is
+    # not always correct in the current implementation, so this may cause some
+    # performance regression for some models.
+    BACKWARD_PRE = auto()
+    BACKWARD_POST = auto()
+
+
+@dataclass
+class MixedPrecision:
+    """
+    This configures FSDP-native mixed precision training.
+
+    Attributes:
+        param_dtype (torch.dtype): This specifies the dtype for model
+            parameters, inputs, and therefore the dtype for computation.
+            However, outside the forward and backward passes, parameters are in
+            full precision. Model checkpointing always happens in full
+            precision.
+        reduce_dtype (torch.dtype): This specifies the dtype for gradient
+            reduction, which is permitted to differ from ``param_dtype``.
+        buffer_dtype (torch.dtype): This specifies the dtype for buffers. FSDP
+            does not shard buffers, casts them to ``buffer_dtype`` in the first
+            forward pass, and keeps them in that dtype thereafter. Model
+            checkpointing always happens in full precision.
+        keep_low_precision_grads (bool): This specifies whether to upcast
+            gradients back to the full parameter precision after the backward
+            pass. This may be set to ``False`` to save memory if using custom
+            optimizers that can perform the optimizer step in ``reduce_dtype``.
+
+    .. note:: This API is experimental and subject to change.
+
+    .. note:: Only floating point tensors are cast to their specified dtypes.
+
+    .. note:: In ``summon_full_params``, parameters are forced to full
+        precision, but buffers are not.
+
+    .. note:: ``state_dict`` checkpoints parameters and buffers in full
+        precision. For buffers, this is only supported for
+        ``StateDictType.FULL_STATE_DICT``.
+
+    .. note:: Each low precision dtype must be specified explicitly. For
+        example, ``MixedPrecision(reduce_dtype=torch.float16)`` only specifies
+        the reduction dtype to be low precision, and FSDP will not cast
+        parameters or buffers.
+
+    .. note:: If a ``reduce_dtype`` is not specified, then gradient reduction
+        happens in ``param_dtype`` if specified or the original parameter dtype
+        otherwise.
+
+    .. note:: If the user passes a model with ``BatchNorm`` modules and an
+        ``auto_wrap_policy`` to the FSDP constructor, then FSDP will disable
+        mixed precision for ``BatchNorm`` modules by wrapping them separately
+        in their own FSDP instance with mixed precision disabled. This is due
+        to some missing low precision ``BatchNorm`` kernels. If the user does
+        not use an ``auto_wrap_policy``, then the user must take care to not
+        use mixed precision for FSDP instances containing ``BatchNorm``
+        modules.
+    """
+
+    param_dtype: Optional[torch.dtype] = None
+    reduce_dtype: Optional[torch.dtype] = None
+    buffer_dtype: Optional[torch.dtype] = None
+    keep_low_precision_grads: bool = False
+
+
+@dataclass
+class CPUOffload:
+    """
+    This configures CPU offloading.
+
+    Attributes:
+        offload_params (bool): This specifies whether to offload parameters to
+            CPU when not involved in computation. If enabled, this implicitly
+            offloads gradients to CPU as well. This is to support the optimizer
+            step, which requires parameters and gradients to be on the same
+            device.
+    """
+
+    offload_params: bool = False
+
+
+class StateDictType(Enum):
+    """
+    This enum indicates that which type of ``state_dict`` the FSDP module is
+    currently processing (returning or loading).
+    The default value is FULL_STATE_DICT to comply the PyTorch convention.
+    ..note::
+        FSDP currently supports three types of ``state_dict``:
+            1. ``state_dict/load_state_dict`: this pair of APIs return and load
+               the non-sharded, unflattened parameters. The semantics is the
+               same as using DDP.
+            2. ``_local_state_dict/_load_local_state_dict``: this pair of APIs return
+               and load local sharded, flattened parameters. The values returned
+               by ``_local_state_dict`` can be directly used by FSDP and is only
+               meaningful to FSDP (because parameters are flattened). Note that
+               these APIs are meant for use via the :func:`state_dict_type`
+               context manager as follows:
+                   >>> # xdoctest: +SKIP("undefined variables")
+                   >>> with fsdp.state_dict_type(StateDictType.LOCAL_STATE_DICT):
+                   ...     state = fsdp.state_dict()  # loads local state dict
+            3. ``_sharded_state_dict/_load_sharded_state_dict``: this pair of APIs
+               return and load sharded, unflattened parameters. The ``state_dict``
+               return by ``sharded_state_dict`` can be used by all other parallel
+               schemes (resharding may be required).
+    """
+
+    FULL_STATE_DICT = auto()
+    LOCAL_STATE_DICT = auto()
+    SHARDED_STATE_DICT = auto()
+
+
+@dataclass
+class StateDictConfig:
+    """
+    ``StateDictConfig`` is the base class for all state_dict configuration classes.
+    Users should instantiate a child version (i.e. ``FullStateDictConfig``) in
+    order to configure settings for the particular type of ``state_dict``
+    implementation FSDP will use.
+    """
+
+    offload_to_cpu: bool = False
+
+
+@dataclass
+class FullStateDictConfig(StateDictConfig):
+    """
+    ``FullStateDictConfig`` is a config class meant to be used with
+    ``StateDictType.FULL_STATE_DICT``. Currently, it accepts two parameters,
+    ``offload_to_cpu`` and ``rank0_only`` which can be configured to offload
+    the full ``state_dict`` to CPU and to materialize the ``state_dict`` on
+    rank 0 only. When used, it is recommended to enable both of these flags
+    together to optimize memory savings when taking checkpoints. Note that
+    this config class is meant for user via the :func:`state_dict_type`
+    context manager as follows:
+        >>> # xdoctest: +SKIP("undefined variables")
+        >>> fsdp = FSDP(model, auto_wrap_policy=...)
+        >>> cfg = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
+        >>> with FullyShardedDataParallel.state_dict_type(fsdp, StateDictType.FULL_STATE_DICT, cfg):
+        >>>     state = fsdp.state_dict()
+        >>>     # state will be empty on non rank 0 and contain CPU tensors on rank 0.
+        >>> # To reload checkpoint for inference, finetuning, transfer learning, etc:
+        >>> model = model_fn() # Initialize model on CPU in preparation for wrapping with FSDP
+        >>> if dist.get_rank() == 0:
+        >>>     # Load checkpoint only on rank 0 to avoid memory redundancy
+        >>>     state_dict = torch.load("my_checkpoint.pt")
+        >>>     model.load_state_dict(state_dict)
+        >>> # All ranks initialize FSDP module as usual. ``sync_module_states`` argument
+        >>> # communicates loaded checkpoint states from rank 0 to rest of the world.
+        >>> fsdp = FSDP(model, device_id=torch.cuda.current_device(), auto_wrap_policy=..., sync_module_states=True)
+        >>> # After this point, all ranks have FSDP model with loaded checkpoint.
+    """
+
+    rank0_only: bool = False
+
+
+@dataclass
+class LocalStateDictConfig(StateDictConfig):
+    pass
+
+
+@dataclass
+class ShardedStateDictConfig(StateDictConfig):
+    pass
diff --git a/torch/distributed/fsdp/flat_param.py b/torch/distributed/fsdp/flat_param.py
index 8d553391a77e..b5892bca683a 100644
--- a/torch/distributed/fsdp/flat_param.py
+++ b/torch/distributed/fsdp/flat_param.py
@@ -1,11 +1,16 @@
 import contextlib
+import warnings
+from dataclasses import dataclass
+from enum import auto, Enum
 from itertools import accumulate, chain
 from typing import (
+    Any,
     Dict,
     Generator,
     Iterator,
     List,
     NamedTuple,
+    no_type_check,
     Optional,
     Sequence,
     Set,
@@ -14,18 +19,38 @@
 )
 
 import torch
+import torch.distributed as dist
 import torch.nn as nn
 import torch.nn.functional as F
 from torch import Tensor
+from torch.distributed.fsdp._common_utils import (
+    _set_fsdp_flattened,
+    HandleTrainingState,
+)
+
+from ._fsdp_extensions import _ext_post_unflatten_transform, _ext_pre_flatten_transform
+from ._utils import (
+    _alloc_storage,
+    _free_storage,
+    _no_dispatch_record_stream,
+    _same_storage,
+    p_assert,
+)
 
 __all__ = [
-    "FlatParameter", "FlatParamHandle", "FlatParamShardMetadata",
-    "ParamInfo", "SharedParamInfo",
+    "FlatParameter",
+    "FlatParamHandle",
+    "FlatParamShardMetadata",
+    "ParamInfo",
+    "SharedParamInfo",
+    "HandleConfig",
+    "HandleShardingStrategy",
 ]
 
 
 class ParamInfo(NamedTuple):
     """Information for an original module parameter."""
+
     param_name: str  # unprefixed
     module: nn.Module
     module_name: str
@@ -40,6 +65,7 @@ class SharedParamInfo(NamedTuple):
     in the parameter walk. These are prefixed with "prim". The primary module
     and parameter do not have their own :class:`SharedParamInfo` instance.
     """
+
     param_name: str  # unprefixed
     module: nn.Module
     module_name: str
@@ -64,12 +90,31 @@ class FlatParamShardMetadata(NamedTuple):
             units of numels) giving this rank's part of each flattened
             original module parameter.
     """
+
     param_names: Tuple[str, ...]
     param_shapes: Tuple[torch.Size, ...]
     param_numels: Tuple[int, ...]
     param_offsets: Tuple[Tuple[int, int], ...]
 
 
+# TODO (awgu): Prefix these with "Handle" for now to avoid circular imports and
+# inadvertent misuses; coalesce with those in fully_sharded_data_parallel.py
+# later
+class HandleShardingStrategy(Enum):
+    FULL_SHARD = auto()
+    SHARD_GRAD_OP = auto()
+    NO_SHARD = auto()
+
+
+@dataclass
+class HandleConfig:
+    sharding_strategy: HandleShardingStrategy
+    offload_params: bool
+    low_prec_param_dtype: Optional[torch.dtype]
+    low_prec_reduce_dtype: Optional[torch.dtype]
+    keep_low_precision_grads: bool = False
+
+
 class FlatParameter(nn.Parameter):
     """
     This is the flattened parameter used by :class:`FullyShardedDataParallel`.
@@ -85,30 +130,42 @@ class FlatParameter(nn.Parameter):
         parameter data is saved in ``self._local_shard``, and a new ``Tensor``
         ``self._full_param_padded`` is created, which is the all-gather
         destination and owns the unsharded parameter storage thereafter. (See
-        :meth:`FullyShardedDataParallel._init_param_attributes`.)
+        :meth:`FlatParamHandle.init_flat_param_attributes`.)
         - Throughout runtime, the parameter data changes storages as needed,
         e.g. to the sharded flattened parameter, reduced-precision sharded
         flattened parameter, or the unsharded flattened parameter.
 
     Attributes:
-        _is_sharded (bool): Whether the flattened parameter is *ever* sharded
-            across ranks (not whether it is *currently* sharded).
-        _unsharded_size (torch.Size): Unsharded flattened parameter's size.
+        _unpadded_unsharded_size (torch.Size): Unsharded flattened parameter's
+            size without padding.
+        _padded_unsharded_size (torch.Size): Unsharded flattened parameter's
+            size with padding. This is only set for sharded strategies since
+            they require padding for the all-gather.
+        _sharded_size (torch.Size): Sharded flattened parameter's size with
+            padding. This is also set for ``NO_SHARD``, in which case it is the
+            same as the unsharded sizes. (We omit "padded" because there is no
+            analogous unpadded one.)
 
         _param_infos (Tuple[ParamInfo, ...]): Each parameter's parameter info
             entry; see :class:`ParamInfo`.
         _numels (Tuple[int, ...]): Each parameter's numel.
         _shapes (Tuple[torch.Size, ...]): Each parameter's shape.
-        _prefixed_param_names (Tuple[str, ...]): Each parameter's name prefixed
-            with the parent module names starting from the module passed to
+        _fqns (Tuple[str, ...]): Each original parameter's name prefixed with
+            the parent module names starting from the module passed to
             construct this flattened parameter via :class:`FlatParamHandle`;
             the prefixed names are guaranteed to be unique within the subtree
-            rooted in that module.
+            rooted in that module. We refer to these names as FQNs.
         _num_params (int): Number of original parameters flattened into this
             flattened parameter; this is the length of ``_param_infos``,
-            ``_numels``, ``_shapes``, and ``_prefixed_param_names``.
+            ``_numels``, ``_shapes``, and ``_fqns``.
         _shared_param_infos (Tuple[SharedParamInfo, ...]): Shared parameter
             info entries; see :class:`SharedParamInfo`.
+        _param_extensions (Tuple[Optional[Any], ...]): Parameter extensions
+            (i.e. some per-parameter state) used to customize pre-flatten and
+            post-unflatten behavior. This is experimental, and users should not
+            depend on its existence in the future.
+        _modules (Set[nn.Module]): Modules that contain some original parameter
+            that is flattened into the ``FlatParameter``.
 
         _shard_param_offsets (List[Tuple[int, int])): [start, end] offsets (in
             units of numel) giving this rank's part of each flattened original
@@ -123,25 +180,61 @@ class FlatParameter(nn.Parameter):
         _shard_numel_padded (int): Numel padded for this rank's sharded
             flattened parameter.
 
-        _local_shard (Tensor): Sharded flattened parameter with padding.
+        _local_shard (Tensor): Sharded flattened parameter with padding if
+            using a sharded strategy. If using ``NO_SHARD``, then this is the
+            unpadded unsharded flattened parameter, and there is no notion of a
+            sharded flattened parameter or padded unsharded flattened
+            parameter.
         _full_param_padded (Tensor): Unsharded flattened parameter with
-            padding.
-        _shard_bwd_hook (Tuple[AccumulateGrad, RemovableHandle]): Flattened
-            parameter's :class:`AccumulateGrad` object and post-backward hook
-            handle.
-        _mp_shard (Tensor): Reduced-precision flattened parameter with padding.
+            padding. This is not defined for ``NO_SHARD``. When using mixed
+            precision for parameters, this has the low precision.
+        _full_prec_full_param_padded (Tensor): Full precision unsharded
+            flattened parameter with padding. This is used for unsharding
+            outside of computation when using mixed precision for parameters.
+            This is never defined for ``NO_SHARD``.
+        _post_backward_hook_state (Tuple[AccumulateGrad, RemovableHandle]):
+            Flattened parameter's :class:`AccumulateGrad` object and
+            post-backward hook handle.
+        _mp_shard (Tensor): Low precision sharded flattened parameter with
+            padding. This is only defined when parameter mixed precision is
+            enabled. For ``NO_SHARD``, this is used for computation.
         _cpu_grad (Tensor): Sharded gradient with padding stored on CPU.
+            This is only defined when offloading parameters is enabled.
         _saved_grad_shard (Tensor): Sharded gradient with padding from previous
             iterations for gradient accumulation without :meth:`no_sync`.
+
+        _params (Optional[List[nn.Parameter]]): The original parameter
+            variables if ``use_orig_params=True`` and ``None`` otherwise.
+        _shared_params (Optional[List[nn.Parameter]]): The original shared
+            parameter variables if ``use_orig_params=True`` and ``None``
+            otherwise.
+        _tensors (Optional[List[Optional[Tensor]]]): This saves the ``Tensor``
+            views created in the forward and tracked by autograd when
+            ``use_orig_params=True`` and is ``None`` otherwise. This is to
+            preserve those ``Tensor`` variables for the backward to ensure that
+            the ``FlatParameter`` 's ``AccumulateGrad`` object does not change
+            in which case the post-backward hook does not run. This is relevant
+            for cases like reentrant activation checkpointing.
+        _is_grad_none (Optional[List[bool]]): A mask over the original
+            parameters' gradients indicating if it is logically ``None`` or not
+            if ``use_orig_params=True`` and ``None`` otherwise. This is needed
+            because only some of the parameters may have ``None`` gradient, in
+            which case the ``FlatParameter`` gradient must be non-``None`` and
+            must use zeros to approximate those original ``None`` gradients.
+            This mask informs FSDP to set the original parameter gradients to
+            ``None`` (instead of zeros) as needed.
     """
 
-    def init_metadata(
+    def _init_metadata(
         self,
         param_infos: List[ParamInfo],
         numels: List[int],
         shapes: List[torch.Size],
         prefixed_param_names: List[str],
         shared_param_infos: List[SharedParamInfo],
+        param_extensions: List[Any],
+        params: Optional[List[nn.Parameter]],
+        shared_params: Optional[List[nn.Parameter]],
     ) -> None:
         """
         Initializes attributes holding metadata about the original parameters
@@ -161,19 +254,50 @@ def init_metadata(
         assert len(param_infos) == len(numels)
         assert len(param_infos) == len(shapes)
         assert len(param_infos) == len(prefixed_param_names)
+        assert len(param_infos) == len(param_extensions)
         self._num_params = len(param_infos)
         self._param_infos = tuple(param_infos)
         self._numels = tuple(numels)
         self._shapes = tuple(shapes)
-        self._prefixed_param_names = tuple(prefixed_param_names)
+        self._fqns = tuple(prefixed_param_names)
         self._shared_param_infos = tuple(shared_param_infos)
-        self._is_sharded = False
-        self._unsharded_size = self.size()
+        self._param_extensions = tuple(param_extensions)
+        self._modules = set(pi.module for pi in self._param_infos).union(
+            set(spi.module for spi in self._shared_param_infos)
+        )
+        assert (params is None) == (shared_params is None)
+        if params is not None:
+            assert shared_params is not None and len(shared_params) == len(
+                shared_param_infos
+            )
+            self._params: Optional[List[nn.Parameter]] = params
+            self._shared_params: Optional[List[nn.Parameter]] = shared_params
+            # Mark the original parameters to avoid flattening them into
+            # another `FlatParameter` during recursive construction
+            for param in chain(self._params, self._shared_params):
+                _set_fsdp_flattened(param)
+            self._is_grad_none: Optional[List[bool]] = [
+                False for _ in range(len(params))
+            ]
+            self._tensors: Optional[List[Optional[Tensor]]] = [
+                None for _ in range(len(self._params))
+            ]
+        else:
+            self._params = None
+            self._shared_params = None
+            self._is_grad_none = None
+            self._tensors = None
+        self._unpadded_unsharded_size = self.size()
+        _set_fsdp_flattened(self)
+        # Tracks whether the `FlatParameter`'s post-backward hook has been
+        # called to modify the behavior of the post-backward callback
+        self._post_backward_called = False
 
 
 class FlatParamHandle:
     """
-    This handle manages a flattened parameter (:class:`FlatParameter`).
+    This handle manages a flattened parameter (:class:`FlatParameter`). This
+    includes sharding and view management.
 
     Args:
         params (Sequence[nn.Parameter]): The parameters to use for the
@@ -182,20 +306,47 @@ class FlatParamHandle:
             all parameters in ``params``; for non-recursive wrapping, this must
             be the top-level module, while for recursive wrapping, this may not
             necessarily be the top-level module.
+        device (torch.device): The compute and communication device, which
+            should be a non-CPU device. We refer to it as the compute device.
+        config (HandleConfig): A config customizing the handle based on FSDP's
+            available features.
+        use_orig_params (bool): If ``True``, then FSDP preserves the original
+            parameter variables and returns them from ``named_parameters()``
+            (e.g. to support different optimizer hyperparameters within one
+            :class:`FlatParameter`). If ``False``, then FSDP reconstructs the
+            parameter every iteration and returns the :class:`FlatParameter` s
+            from ``named_parameters()``.
     """
+
+    ##################
+    # INITIALIZATION #
+    ##################
     def __init__(
         self,
         params: Sequence[nn.Parameter],
         module: nn.Module,
-    ) -> None:
+        device: torch.device,
+        config: HandleConfig,
+        process_group: dist.ProcessGroup,
+        use_orig_params: bool,
+    ):
         super().__init__()
-        self._init_flat_param(module, params)
-        self._unflatten(as_params=False)
+        self.device = device
+        self._config = config
+        self.process_group = process_group
+        self.rank = process_group.rank()
+        self.world_size = process_group.size()
+        self._use_orig_params = use_orig_params
+        self._training_state = HandleTrainingState.IDLE
+        self._debug_level = dist.get_debug_level()
+        self._init_flat_param(params, module, use_orig_params)
+        self._use_unsharded_views(as_params=False)
 
     def _init_flat_param(
         self,
-        module: nn.Module,
         params: Sequence[Optional[nn.Parameter]],
+        module: nn.Module,
+        use_orig_params: bool,
     ) -> None:
         """
         Initializes the flattened parameter ``self.flat_param`` by flattening
@@ -212,8 +363,10 @@ def _init_flat_param(
         """
         params_set = set(params)
         params_set.discard(None)
-        assert len(params_set) > 0, \
-            "Cannot initialize a `FlatParameter` from an empty parameter list"
+        if len(params_set) == 0:
+            raise ValueError(
+                "Cannot initialize a `FlatParameter` from an empty parameter list"
+            )
         param_infos: List[ParamInfo] = []
         numels: List[int] = []
         shapes: List[torch.Size] = []
@@ -221,28 +374,48 @@ def _init_flat_param(
         shared_param_infos: List[SharedParamInfo] = []
         shared_param_memo: Dict[nn.Parameter, Tuple[nn.Module, str, str]] = {}
         params_to_flatten: List[nn.Parameter] = []
+        shared_params: List[nn.Parameter] = []
+        param_extensions: List[Any] = []
         dtype: Optional[torch.dtype] = None
         requires_grad: Optional[bool] = None
         for submodule_name, submodule in module.named_modules():
             for param_name, param in submodule.named_parameters(recurse=False):
                 if param not in params_set:
                     continue
-                if param in shared_param_memo:
-                    prim_module, prim_module_name, prim_param_name = shared_param_memo[param]
-                    shared_param_infos.append(SharedParamInfo(
-                        param_name, submodule, submodule_name, prim_param_name,
-                        prim_module, prim_module_name,
-                    ))
+                if param in shared_param_memo:  # shared reference
+                    prim_module, prim_module_name, prim_param_name = shared_param_memo[
+                        param
+                    ]
+                    shared_params.append(param)
+                    shared_param_infos.append(
+                        SharedParamInfo(
+                            param_name,
+                            submodule,
+                            submodule_name,
+                            prim_param_name,
+                            prim_module,
+                            prim_module_name,
+                        )
+                    )
                 else:
-                    if isinstance(param, FlatParameter):
+                    if type(param) is FlatParameter:
                         raise ValueError("`FlatParameter` does not support nesting")
                     if dtype is not None and param.dtype != dtype:
                         raise ValueError(
                             "`FlatParameter` requires uniform dtype but got "
                             f"{dtype} and {param.dtype}"
                         )
-                    if requires_grad is not None and param.requires_grad != requires_grad:
-                        raise ValueError("`FlatParameter` requires uniform `requires_grad`")
+                    if dtype is None and not param.is_floating_point():
+                        raise ValueError("Integer parameters are unsupported")
+                    if (
+                        requires_grad is not None
+                        and param.requires_grad != requires_grad
+                    ):
+                        raise ValueError(
+                            "`FlatParameter` requires uniform `requires_grad`"
+                        )
+                    param, extension = _ext_pre_flatten_transform(param)
+                    param_extensions.append(extension)
                     dtype = param.dtype
                     requires_grad = param.requires_grad
                     shared_param_memo[param] = (submodule, submodule_name, param_name)
@@ -250,13 +423,28 @@ def _init_flat_param(
                     param_infos.append(ParamInfo(param_name, submodule, submodule_name))
                     numels.append(param.numel())
                     shapes.append(param.shape)
-                    prefixed_param_name = submodule_name + "." + param_name \
-                        if submodule_name else param_name
+                    prefixed_param_name = (
+                        submodule_name + "." + param_name
+                        if submodule_name
+                        else param_name
+                    )
                     prefixed_param_names.append(prefixed_param_name)
-        assert requires_grad is not None
-        self.flat_param = FlatParamHandle.flatten_params(params_to_flatten, requires_grad)
-        self.flat_param.init_metadata(
-            param_infos, numels, shapes, prefixed_param_names, shared_param_infos,
+        assert requires_grad is not None, (
+            "Passed-in `params` were not found in the module tree\n"
+            f"params: {params}\nmodule: {module}"
+        )
+        self.flat_param = FlatParamHandle.flatten_params(
+            params_to_flatten, requires_grad
+        )
+        self.flat_param._init_metadata(
+            param_infos,
+            numels,
+            shapes,
+            prefixed_param_names,
+            shared_param_infos,
+            param_extensions,
+            params_to_flatten if use_orig_params else None,
+            shared_params if use_orig_params else None,
         )
 
     @staticmethod
@@ -271,116 +459,88 @@ def flatten_params(
 
         We expose this factory method for checkpointing (e.g. sharded state
         dict). The flattened parameter's metadata should only be initialized
-        once (see :meth:`init_metadata`), but its tensor data may be reloaded.
+        once (see :meth:`_init_metadata`), but its tensor data may be reloaded.
         """
         with torch.no_grad():
             flat_params = [
-                p.detach().reshape(-1) if isinstance(p, nn.Parameter)
-                else p.reshape(-1) for p in params
+                p.detach().reshape(-1) if isinstance(p, nn.Parameter) else p.reshape(-1)
+                for p in params
             ]
             flat_param_data = torch.cat(flat_params, dim=0)
         flat_param = FlatParameter(flat_param_data, requires_grad=requires_grad)
         return flat_param
 
-    @staticmethod
-    def _get_unflat_views(
-        flat_param: FlatParameter,
-        tensor: Optional[torch.Tensor] = None,
-    ) -> Iterator[Tensor]:
-        """
-        Returns unflattened ``Tensor`` views into ``tensor`` if it is not
-        ``None`` or ``flat_param`` otherwise, where the unflattening is based
-        on ``flat_param`` 's metadata.
-
-        In other words, to get views into the unsharded flattened parameter,
-        pass ``tensor`` as ``None``, but to get views into tensor optimizer
-        state, pass ``tensor`` as the optimizer state tensor.
-        """
-        if tensor is None:
-            tensor = flat_param
-        assert tensor.numel() == flat_param._unsharded_size.numel(), \
-            f"Expects {flat_param._unsharded_size.numel()} numel but got " \
-            f"{tensor.numel()} numel"
-        views = (
-            subtensor.view(shape) for (subtensor, shape) in
-            zip(torch.split(tensor, flat_param._numels, dim=0), flat_param._shapes)  # type: ignore[arg-type]
-        )
-        return views
-
-    def _unflatten(self, as_params: bool) -> None:
+    ###################################
+    # SHARD INITIALIZATION & METADATA #
+    ###################################
+    @torch.no_grad()
+    def shard(self):
         """
-        Unflattens the unsharded flattened parameter by setting the original
-        module parameter variables to be views into it.
+        Shards the handle's ``FlatParameter``. In terms of memory, this
+        allocates new memory for the sharded flattened parameter and frees the
+        unsharded flattened parameter's storage.
 
-        Args:
-            as_params (bool): If ``True``, then registers the original
-                parameters as ``nn.Parameter`` s; if ``False``, then registers
-                the original parameters only as ``Tensor`` s. ``False`` should
-                be used during forward/backward computation and when hiding the
-                original parameters from :meth:`nn.Module.named_parameters`.
+        Postcondition: ``self.flat_param`` is the sharded flattened parameter.
+        Shard metadata attributes are set for all sharding strategies.
+        ``process_group``, ``rank``, and ``world_size`` attributes are set if
+        using a sharded strategy.
         """
-        views = self._get_unflat_views(self.flat_param)
-        for view, (param_name, module, _) in zip(views, self.flat_param._param_infos):
-            if hasattr(module, param_name):
-                delattr(module, param_name)
-            if as_params:
-                module.register_parameter(param_name, nn.Parameter(view))
-            else:
-                setattr(module, param_name, view)
-        for (param_name, module, _, prim_param_name, prim_module, _) in self.flat_param._shared_param_infos:
-            if hasattr(module, param_name):
-                delattr(module, param_name)
-            assert hasattr(prim_module, prim_param_name)
-            param: Union[Tensor, nn.Parameter] = getattr(prim_module, prim_param_name)
-            if as_params:
-                assert isinstance(param, nn.Parameter)
-                module.register_parameter(param_name, param)
-            else:
-                setattr(module, param_name, param)
-
-    @contextlib.contextmanager
-    def unflatten_as_params(self) -> Generator:
-        """
-        Assumes the flattened parameter is unsharded. When in the context,
-        unflattens the original parameters as ``nn.Parameter`` views into the
-        flattened parameter, and after the context, restores the original
-        parameters as ``Tensor`` views into the flattened parameter.
-        """
-        self._unflatten(as_params=True)
-        try:
-            yield
-        finally:
-            self._unflatten(as_params=False)
+        flat_param = self.flat_param
+        if not self.uses_sharded_strategy:
+            self._init_shard_metadata(0, 0, flat_param.numel() - 1)
+        else:
+            p_assert(
+                flat_param.storage_offset() == 0,
+                "The `FlatParameter` is not the sole occupant of its storage",
+            )
+            orig_storage = flat_param._typed_storage()
+            sharded_flat_param, numel_padded = FlatParamHandle._get_shard(
+                flat_param, self.rank, self.world_size
+            )
+            flat_param.set_(sharded_flat_param)  # type: ignore[call-overload]
+            start = sharded_flat_param.numel() * self.rank
+            end = sharded_flat_param.numel() * (self.rank + 1) - 1  # inclusive
+            self._init_shard_metadata(numel_padded, start, end)
+            if orig_storage._size() > 0:
+                orig_storage._resize_(0)
+        if self._use_orig_params:
+            self._use_sharded_views()
 
-    def init_shard_metadata(
+    def _init_shard_metadata(
         self,
-        sharded_flat_param_numel: int,
         numel_padded: int,
-        rank: int,
+        start: int,
+        end: int,
     ) -> None:
         """
         Initializes shard-related metadata for this rank's shard of the
-        flattened parameter: ``_shard_param_offsets``, ``_shard_indices``, and
-        ``_shard_numel_padded``.
+        flattened parameter: ``_sharded_size``, ``_shard_param_offsets``,
+        ``_shard_indices``, and ``_shard_numel_padded``.
 
         Args:
-            sharded_flat_param_numel (int): Numel of each rank's sharded
-                flattened parameter with padding (i.e. including
-                ``numel_padded``).
             numel_padded (int): Numel padded for this rank's sharded flattened
                 parameter.
-            rank (int): Caller's rank.
+            start (int): Start index in the sharded flattened parameter
+                assigned to this rank.
+            end (int): End index (inclusive) in the sharded flattened parameter
+                assigned to this rank. If this exceeds the sharded flattened
+                parameter's numel, then it is truncated.
+
+        Precondition: ``self.flat_param`` 's data is the sharded flattened
+        parameter.
         """
-        if numel_padded > sharded_flat_param_numel:
-            raise ValueError(
-                f"Sharded flattened parameter with {sharded_flat_param_numel} "
-                f"numel cannot have {numel_padded} numel padded"
-            )
-        start = sharded_flat_param_numel * rank
-        end = sharded_flat_param_numel * (rank + 1) - 1  # inclusive
-        self.flat_param._shard_param_offsets, self.flat_param._shard_indices = (  # type: ignore[attr-defined]
-            self._get_shard_metadata(start, end)
+        self.flat_param._sharded_size = self.flat_param.size()  # type: ignore[attr-defined]
+        sharded_flat_param_numel = self.flat_param.numel()  # includes `numel_padded`
+        p_assert(start >= 0 and start <= end, f"start: {start} end: {end}")
+        p_assert(
+            numel_padded <= sharded_flat_param_numel,
+            f"numel_padded: {numel_padded} "
+            f"sharded_flat_param_numel: {sharded_flat_param_numel}",
         )
+        (
+            self.flat_param._shard_param_offsets,  # type: ignore[attr-defined]
+            self.flat_param._shard_indices,  # type: ignore[attr-defined]
+        ) = self._get_shard_metadata(start, end)
         self.flat_param._shard_numel_padded = numel_padded  # type: ignore[attr-defined]
 
     def _get_shard_metadata(
@@ -421,16 +581,21 @@ def _get_shard_metadata(
                 intra_param_start = start - param_start
             intra_param_end = min(param_end, end) - param_start
             shard_param_indices_range.append(i)
-            shard_param_offsets.append((intra_param_start, intra_param_end))  # both inclusive
+            shard_param_offsets.append(
+                (intra_param_start, intra_param_end)
+            )  # both inclusive
         if len(shard_param_indices_range) == 0:
             shard_param_indices = (0, 0)
             assert len(shard_param_offsets) == 0
         else:
             shard_param_indices = (
-                shard_param_indices_range[0], shard_param_indices_range[-1],
+                shard_param_indices_range[0],
+                shard_param_indices_range[-1],
+            )
+            assert (
+                len(shard_param_offsets)
+                == shard_param_indices[-1] - shard_param_indices[0] + 1
             )
-            assert len(shard_param_offsets) == \
-                shard_param_indices[-1] - shard_param_indices[0] + 1
         return tuple(shard_param_offsets), shard_param_indices
 
     @staticmethod
@@ -455,7 +620,9 @@ def _get_unpadded_shard(
         else:
             chunk = chunks[rank]
         numel_to_pad = chunks[0].numel() - chunk.numel()
-        assert numel_to_pad >= 0, "Chunk's size should be at most the first chunk's size"
+        assert (
+            numel_to_pad >= 0
+        ), "Chunk's size should be at most the first chunk's size"
         return chunk, numel_to_pad
 
     @staticmethod
@@ -471,7 +638,9 @@ def _get_shard(
         This method allocates new memory (via :meth:`clone`) since the
         unsharded ``tensor`` may be deallocated after this method returns.
         """
-        chunk, numel_to_pad = FlatParamHandle._get_unpadded_shard(tensor, rank, world_size)
+        chunk, numel_to_pad = FlatParamHandle._get_unpadded_shard(
+            tensor, rank, world_size
+        )
         shard = chunk.clone()
         if numel_to_pad > 0:
             shard = F.pad(shard, [0, numel_to_pad])
@@ -485,8 +654,8 @@ def _get_sharded_size(tensor: Tensor, rank: int, world_size: int) -> torch.Size:
         shape is 1D.
         """
         assert len(tensor.shape) == 1, f"{tensor.shape}"
-        unpadded_sharded_tensor, numel_to_pad = (
-            FlatParamHandle._get_unpadded_shard(tensor, rank, world_size)
+        unpadded_sharded_tensor, numel_to_pad = FlatParamHandle._get_unpadded_shard(
+            tensor, rank, world_size
         )
         unpadded_sharded_size = unpadded_sharded_tensor.size()
         assert len(unpadded_sharded_size) == 1, f"{unpadded_sharded_size}"
@@ -506,20 +675,1031 @@ def shard_metadata(
     ) -> FlatParamShardMetadata:
         """Returns shard-related metadata specific to this rank's shard of the
         flattened parameter."""
-        assert hasattr(self.flat_param, "_shard_indices") and \
-            hasattr(self.flat_param, "_shard_param_offsets"), \
-            "Shard metadata has not been initialized"
+        assert hasattr(self.flat_param, "_shard_indices") and hasattr(
+            self.flat_param, "_shard_param_offsets"
+        ), "Shard metadata has not been initialized"
         shard_param_start_index = self.flat_param._shard_indices[0]  # type: ignore[attr-defined]
         shard_param_end_index = self.flat_param._shard_indices[1]  # type: ignore[attr-defined]
-        sl = slice(shard_param_start_index, shard_param_end_index + 1) \
-            if shard_param_start_index <= shard_param_end_index else slice(0, 0)
+        sl = (
+            slice(shard_param_start_index, shard_param_end_index + 1)
+            if shard_param_start_index <= shard_param_end_index
+            else slice(0, 0)
+        )
         return FlatParamShardMetadata(
-            self.flat_param._prefixed_param_names[sl],
+            self.flat_param._fqns[sl],
             self.flat_param._shapes[sl],
             self.flat_param._numels[sl],
             self.flat_param._shard_param_offsets[:],  # type: ignore[attr-defined]
         )
 
+    @no_type_check
+    @torch.no_grad()
+    def init_flat_param_attributes(self) -> None:
+        """
+        This initializes some attributes on the handle's ``FlatParameter``.
+        This should be called during lazy initialization since it requires the
+        parameter to be on the compute device if not offloading to CPU and we
+        want to give users the chance to move the parameter appropriately after
+        the FSDP constructor.
+
+        For each tensor attribute on the ``FlatParameter``, see the unshard and
+        reshard methods in this class for the allocation and free pattern.
+        """
+        flat_param = self.flat_param
+        cpu_device = torch.device("cpu")
+        if self._config.offload_params:
+            p_assert(
+                flat_param.device == cpu_device,
+                "Expects the `FlatParameter` to be offloaded to CPU since CPU "
+                "offloading is enabled. You may be accidentally moving the "
+                f"model to {flat_param.device} after the FSDP constructor.",
+            )
+        flat_param._local_shard = flat_param.data
+        if self._config.offload_params:
+            # Pin the memory for faster H2D transfer
+            flat_param._local_shard = flat_param._local_shard.pin_memory()
+            # Pre-allocate the sharded gradient on CPU to enable non-blocking
+            # D2H transfer during the backward pass
+            flat_param._cpu_grad = torch.zeros_like(
+                flat_param._local_shard, device=cpu_device
+            ).pin_memory()
+        if self._config.low_prec_param_dtype is not None:
+            # For parameter mixed precision, we maintain a low precision
+            # sharded tensor on the compute device to be all-gathered (for
+            # sharded strategies) or directly used (for `NO_SHARD`) for
+            # computation.
+            flat_param._mp_shard = torch.zeros_like(
+                flat_param._local_shard,
+                device=self.device,
+                dtype=self._config.low_prec_param_dtype,
+            )
+            _free_storage(flat_param._mp_shard)
+        if self.uses_sharded_strategy:
+            # We maintain a padded unsharded tensor that serves as the
+            # all-gather destination and owns the original parameter storages.
+            unsharded_param_dtype = (
+                self._config.low_prec_param_dtype or flat_param.dtype
+            )  # use low precision if parameter mixed precision is enabled
+            padded_unsharded_numel = flat_param.numel() * self.world_size
+            flat_param._full_param_padded = torch.zeros(
+                padded_unsharded_numel,
+                device=self.device,
+                dtype=unsharded_param_dtype,
+            )
+            flat_param._padded_unsharded_size = flat_param._full_param_padded.size()
+            _free_storage(flat_param._full_param_padded)
+
+            if self._config.low_prec_param_dtype is not None:
+                # For parameter mixed precision, we maintain a full precision
+                # padded unsharded tensor for when we force full precision.
+                flat_param._full_prec_full_param_padded = torch.zeros(
+                    padded_unsharded_numel,
+                    device=self.device,
+                    dtype=flat_param.dtype,  # full precision
+                )
+                _free_storage(flat_param._full_prec_full_param_padded)
+
+    ###################
+    # UNSHARD/RESHARD #
+    ###################
+    def pre_unshard(self) -> bool:
+        """
+        Returns: ``False`` if this is a no-op and ``True`` otherwise.
+
+        Postcondition: ``self.flat_param`` 's data is on the device for
+        communication and is what should be all-gathered. This means that it
+        matches the dtype of the expected unsharded parameter.
+        """
+        ret = False
+        if self._use_orig_params:
+            ret = self._writeback_orig_params()
+        if (
+            self.uses_sharded_strategy
+            and not self._config.offload_params
+            and not self.needs_unshard()
+        ):
+            pass  # no-op
+        elif self._uses_param_mixed_precision and not self._force_full_precision:
+            self._use_low_precision_shard()
+            ret = True
+        elif self._config.offload_params and self.flat_param.device != self.device:
+            # NOTE: This creates a new tensor distinct from any attributes.
+            self.flat_param_to(self.device, non_blocking=True)
+            ret = True
+        self._check_on_compute_device(self.flat_param)
+        return ret
+
+    def _use_low_precision_shard(self):
+        """
+        Allocates the low precision shard directly on the compute device and
+        switches to using the low precision sharded flattened parameter.
+        """
+        self._check_low_precision_shard()
+        flat_param = self.flat_param
+        _alloc_storage(
+            flat_param._mp_shard, flat_param._local_shard.size()  # type: ignore[attr-defined]
+        )
+        # `copy_()` implicitly casts to the low precision
+        flat_param._mp_shard.copy_(  # type: ignore[attr-defined]
+            flat_param._local_shard.to(  # type: ignore[attr-defined]
+                self.device, non_blocking=True
+            )
+        )
+        # Invariant: `_mp_shard` is always on the compute device.
+        flat_param.data = flat_param._mp_shard  # type: ignore[attr-defined]
+
+    def unshard(self):
+        """
+        Runs the unshard logic. This includes all-gathering the flattened
+        parameter and switching to using the unsharded flattened parameter. If
+        the handle does not need unsharding, then this only switches to using
+        the unsharded flattened parameter. For ``NO_SHARD``, this is a no-op.
+
+        If FSDP is in :meth:`summon_full_params` and the handle uses parameter
+        mixed precision, then the parameter is forced to full precision.
+        """
+        if not self.needs_unshard():
+            # Even when not needing an unshard, we should switch to using
+            # the unsharded flattened parameter
+            unsharded_flat_param = (
+                self._get_padded_unsharded_flat_param()
+                if self.uses_sharded_strategy
+                else self.flat_param
+            )
+            self._use_unsharded_flat_param(unsharded_flat_param)
+            return
+        unsharded_flat_param = self._alloc_padded_unsharded_flat_param()
+        padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
+        self._use_unsharded_flat_param(padded_unsharded_flat_param)
+
+    def needs_unshard(self) -> bool:
+        """Returns if the handle's flattened parameter needs to be unsharded."""
+        if not self.uses_sharded_strategy:
+            return False
+        unsharded_flat_param = self._get_padded_unsharded_flat_param()
+        already_unsharded = (
+            unsharded_flat_param._typed_storage()._size()
+            == unsharded_flat_param.numel()
+        )
+        return not already_unsharded
+
+    def _alloc_padded_unsharded_flat_param(self):
+        """
+        Allocates the *padded* unsharded flattened parameter. The unpadded
+        unsharded flattened parameter is always a view into the padded one.
+        This padded parameter is saved to a different attribute on the
+        ``FlatParameter`` depending on if we force full precision.
+        """
+        self._check_sharded_strategy()
+        flat_param = self.flat_param
+        unsharded_flat_param = self._get_padded_unsharded_flat_param()
+        self._check_storage_freed(unsharded_flat_param)
+        _alloc_storage(unsharded_flat_param, flat_param._padded_unsharded_size)  # type: ignore[attr-defined]
+        return unsharded_flat_param
+
+    def _get_padded_unsharded_flat_param(self) -> torch.Tensor:
+        """
+        Returns a reference to the padded unsharded flattened parameter
+        depending on the calling context. This should only be called if using a
+        sharded strategy.
+        """
+        self._check_sharded_strategy()
+        flat_param = self.flat_param
+        if self._force_full_precision:
+            # When parameter mixed precision is enabled, we use a different
+            # tensor as the all-gather destination to preserve the invariant
+            # that  `_full_param_padded` is in the low precision
+            unsharded_flat_param = flat_param._full_prec_full_param_padded  # type: ignore[attr-defined]
+            p_assert(
+                unsharded_flat_param.dtype != self._config.low_prec_param_dtype,
+                f"Expects full precision but got {self._config.low_prec_param_dtype}",
+            )
+        else:
+            unsharded_flat_param = flat_param._full_param_padded  # type: ignore[attr-defined]
+        return unsharded_flat_param
+
+    def _all_gather_flat_param(
+        self,
+        padded_unsharded_flat_param: Tensor,
+    ) -> Tensor:
+        """
+        All-gathers the handle's flattened parameter to the destination
+        ``padded_unsharded_flat_param``, and switches to using the all-gathered
+        tensor.
+        """
+        p_assert(
+            hasattr(self, "process_group") and hasattr(self, "world_size"),
+            "Expects a process group and world size to have been set via `shard()`",
+        )
+        sharded_flat_param = self.flat_param.data
+        expected_numel = sharded_flat_param.numel() * self.world_size
+        p_assert(
+            padded_unsharded_flat_param.numel() == expected_numel,
+            f"Expects {expected_numel} numel but got {padded_unsharded_flat_param.numel()}",
+        )
+        dist.all_gather_into_tensor(
+            padded_unsharded_flat_param,
+            sharded_flat_param,
+            self.process_group,
+        )
+        return padded_unsharded_flat_param
+
+    def _use_unsharded_flat_param(
+        self,
+        padded_unsharded_flat_param: torch.Tensor,
+    ) -> None:
+        """
+        Switches to using the *unpadded* unsharded flattened parameter, which
+        is a view into the *padded* unsharded flattened parameter.
+        """
+        unsharded_size = self.flat_param._unpadded_unsharded_size
+        self.flat_param.data = padded_unsharded_flat_param[
+            : unsharded_size.numel()
+        ].view(
+            unsharded_size
+        )  # this `.view()` is not autograd visible
+        in_forward = self._training_state == HandleTrainingState.FORWARD
+        in_pre_backward = self._training_state == HandleTrainingState.BACKWARD_PRE
+        if self._use_orig_params:
+            # We use `Tensor` views in the forward so that they are tracked by
+            # autograd. We use them in the pre-backward as well to support
+            # reentrant activation checkpointing, which needs the views to be
+            # tracked by autograd in the backward pass's recomputed forward.
+            self._use_unsharded_views(
+                as_params=(not in_forward and not in_pre_backward)
+            )
+        elif in_forward:
+            self._use_unsharded_views(as_params=False)
+
+    def post_unshard(self):
+        """
+        Runs the post-unshard logic. This includes freeing the low precision
+        shard if needed.
+        """
+        if self._uses_param_mixed_precision and self.uses_sharded_strategy:
+            self._free_low_precision_sharded_param()
+        self._check_on_compute_device(self.flat_param)
+
+    def _free_low_precision_sharded_param(self):
+        """Frees the low precision sharded flattened parameter."""
+        self._check_low_precision_shard()
+        _free_storage(self.flat_param._mp_shard)  # type: ignore[attr-defined]
+
+    @torch.no_grad()
+    def unshard_grad(self):
+        """
+        Unshards the handle's ``FlatParameter`` 's gradient. If all ranks have
+        ``None`` gradient, then all original parameters will as well. This
+        method performs an all-reduce and an all-gather. The additional
+        all-reduce is tolerable since this method is not meant to be used on
+        the computation critical path.
+
+        Postcondition: ``_saved_grad_shard`` is defined and contains the value
+        to set ``flat_param.grad`` after gradients are resharded.
+        """
+        if not self.uses_sharded_strategy:
+            self._use_unsharded_grad_views()
+            return
+        flat_param = self.flat_param
+        self._check_unsharded(flat_param)
+
+        # Check if all ranks have a `None` gradient
+        num_grad_none = torch.zeros(1, dtype=torch.int32, device=self.device)
+        num_grad_none[0] = flat_param.grad is None
+        dist.all_reduce(num_grad_none, group=self.process_group)
+        if num_grad_none[0] == self.world_size:
+            flat_param._saved_grad_shard = None  # type: ignore[attr-defined]
+            self._use_unsharded_grad_views()
+            return
+
+        padded_unsharded_grad = torch.empty(
+            flat_param._padded_unsharded_size,  # type: ignore[attr-defined]
+            device=self.device,
+        )
+        if flat_param.grad is None:
+            # In the case that only some ranks have `None` gradient, we use
+            # zeros to approximate as a best effort attempt
+            if self._debug_level == dist.DebugLevel.DETAIL:
+                warnings.warn(
+                    f"[Rank {self.rank}] Only some but not all ranks have a "
+                    "`None` `FlatParameter` gradient, so FSDP is using zeros to "
+                    "approximate those ranks' sharded gradients being `None`"
+                )
+            flat_param._saved_grad_shard = None  # type: ignore[attr-defined]
+            sharded_grad = torch.zeros(flat_param._sharded_size, device=self.device)  # type: ignore[attr-defined]
+        else:
+            self._check_sharded(flat_param.grad)
+            flat_param._saved_grad_shard = flat_param.grad  # type: ignore[attr-defined]
+            sharded_grad = flat_param._saved_grad_shard  # type: ignore[attr-defined]
+        dist.all_gather_into_tensor(
+            padded_unsharded_grad, sharded_grad, self.process_group
+        )
+        unsharded_size = self.flat_param._unpadded_unsharded_size
+        flat_param.grad = padded_unsharded_grad[: unsharded_size.numel()].view(
+            unsharded_size
+        )
+        self._use_unsharded_grad_views()
+
+    def reshard_grad(self):
+        if self._use_orig_params:
+            self._use_sharded_grad_views()
+        if not self.uses_sharded_strategy:
+            return
+        self.flat_param.grad = self.flat_param._saved_grad_shard  # type: ignore[attr-defined]
+        delattr(self.flat_param, "_saved_grad_shard")
+
+    def prepare_gradient_for_backward(self):
+        """
+        Prepares the gradient for the backward computation by saving and
+        clearing any existing sharded gradient in ``.grad`` to enable computing
+        a new unsharded gradient.
+        """
+        p_assert(
+            self._training_state
+            in (HandleTrainingState.BACKWARD_PRE, HandleTrainingState.IDLE),
+            "Expects to be in `BACKWARD_PRE` or `IDLE` (if prefetching)",
+        )
+        flat_param = self.flat_param
+        if flat_param.grad is not None and (
+            flat_param.grad.size() != flat_param._unpadded_unsharded_size
+            or flat_param.grad.device != flat_param.device  # grad on CPU
+        ):
+            self._check_on_compute_device(self.flat_param)
+            grad_offloaded = flat_param.grad.device != self.device
+            p_assert(
+                not grad_offloaded or self._config.offload_params,
+                f"Expects the sharded gradient to be on {self.device} "
+                f"but got {flat_param.grad.device}",
+            )
+            prev_iter_synced_gradients = (
+                flat_param.grad.size()
+                == flat_param._local_shard.size()  # type: ignore[attr-defined]
+            )
+            if prev_iter_synced_gradients:
+                # TODO (awgu): Gradient accumulation outside `no_sync()`
+                # does not work with CPU offloading. The issue should be
+                # that, in the post-backward hook, we cannot do an addition
+                # between a CPU tensor (the existing sharded gradient) and
+                # a GPU tensor (the new sharded gradient).
+                if not grad_offloaded:
+                    flat_param._saved_grad_shard = flat_param.grad.data  # type: ignore[attr-defined]
+                    sharded_grad = flat_param._saved_grad_shard  # type: ignore[attr-defined]
+                else:
+                    p_assert(
+                        hasattr(flat_param, "_cpu_grad"),
+                        "`_cpu_grad` should be defined if the gradient is on CPU",
+                    )
+                    sharded_grad = flat_param._cpu_grad  # type: ignore[attr-defined]
+                # If user specified to keep the gradient in low precision, then
+                # the gradient may still be of the low precision dtype if the
+                # user did not set the gradient to `None` after the previous
+                # backward, in which case FSDP should cast back to the full
+                # precision dtype so that FSDP can accumulate in that dtype in
+                # the post-backward hook and assign to `.grad` in that dtype in
+                # the post-backward callback.
+                local_shard_dtype = flat_param._local_shard.dtype  # type: ignore[attr-defined]
+                if (
+                    self._config.keep_low_precision_grads
+                    and sharded_grad.dtype != local_shard_dtype
+                ):
+                    sharded_grad.data = sharded_grad.to(local_shard_dtype)
+            else:
+                padded_unsharded_size = flat_param._padded_unsharded_size  # type: ignore[attr-defined]
+                p_assert(
+                    flat_param.grad.size() == padded_unsharded_size,
+                    "Expects `.grad` to be the unsharded gradient in "
+                    f"`no_sync()` with size {padded_unsharded_size} "
+                    f"but got size {flat_param.grad.size()}",
+                )
+            flat_param.grad = None
+
+    def prepare_gradient_for_optim(self):
+        """
+        Prepares the gradient for optimizer computation by moving the sharded
+        gradient to the ``.grad`` attribute.
+        """
+
+        def cast_grad_to_param_dtype_if_needed(flat_param):
+            if self._config.keep_low_precision_grads:
+                assert flat_param.grad is not None  # mypy
+                # This cast is meaningful when `param_dtype` is a low precision
+                # dtype.
+                flat_param.grad.data = flat_param.grad.to(
+                    self._config.low_prec_param_dtype
+                )
+
+        flat_param = self.flat_param
+        # TODO (awgu): We should replace these conditional checks to encode
+        # the logical intention more directly.
+        if hasattr(flat_param, "_cpu_grad"):
+            # NOTE: This branch includes `NO_SHARD`.
+            self._check_sharded(flat_param)
+            self._check_on_cpu(flat_param)
+            flat_param.grad = flat_param._cpu_grad  # type: ignore[attr-defined]
+            cast_grad_to_param_dtype_if_needed(flat_param)
+        elif hasattr(flat_param, "_saved_grad_shard"):
+            self._check_sharded(flat_param)
+            self._check_on_compute_device(flat_param)
+            self._check_on_compute_device(flat_param._saved_grad_shard)  # type: ignore[attr-defined]
+            # If no sharded gradient was computed this iteration, then there is
+            # no need to forward `_saved_grad_shard` to `grad`
+            if flat_param._post_backward_called:  # type: ignore[attr-defined]
+                flat_param.grad = flat_param._saved_grad_shard  # type: ignore[attr-defined]
+                cast_grad_to_param_dtype_if_needed(flat_param)
+        else:
+            p_assert(
+                not self.uses_sharded_strategy
+                or not flat_param._post_backward_called,  # type: ignore[attr-defined]
+                "All sharded parameters that received a gradient in the "
+                "post-backward should use `_saved_grad_shard`",
+            )
+        # Delete `_saved_grad_shard` since its existence indicates a previous
+        # gradient to accumulate with in the post-backward hook
+        if hasattr(flat_param, "_saved_grad_shard"):
+            delattr(flat_param, "_saved_grad_shard")
+
+    @contextlib.contextmanager
+    def to_cpu(self):
+        """
+        Moves the unpadded unsharded flattened parameter to CPU while in the
+        context and moves it back to the previous device upon exit. For now,
+        this assumes the ``FlatParameter`` is the unpadded unsharded flattened
+        parameter since (1) there is no reason to include the padding in the
+        copy and (2) there is no use case for the sharded flattened parameter.
+
+        Precondition: ``self.flat_param`` 's data is the unpadded unsharded
+        flattened parameter on the compute device, and the handle uses a
+        sharded strategy.
+        Postcondition: Same as the precondition.
+        """
+        self._check_sharded_strategy()
+        p_assert(
+            self.flat_param.size() == self.flat_param._unpadded_unsharded_size,
+            f"Expects size {self.flat_param._unpadded_unsharded_size} but got {self.flat_param.size()}",
+        )
+        self._check_on_compute_device(self.flat_param)
+        # Check that the unpadded unsharded flattened parameter is a view into
+        # the padded unsharded flattened parameter as expected
+        # NOTE: This check is not strictly needed for correctness but is a
+        # useful sanity check since the tensor should only be used internally.
+        unpadded_storage_ptr = self.flat_param._typed_storage()._data_ptr()
+        padded_storage_ptr = (
+            self._get_padded_unsharded_flat_param()._typed_storage()._data_ptr()
+        )
+        p_assert(
+            unpadded_storage_ptr == padded_storage_ptr,
+            "Expects the unpadded parameter to be a view into the padded parameter",
+        )
+        self.flat_param_to(torch.device("cpu"))
+        self._free_unsharded_flat_param()
+        try:
+            yield
+        finally:
+            p_assert(
+                self.flat_param.size() == self.flat_param._unpadded_unsharded_size,
+                f"Expects size {self.flat_param._unpadded_unsharded_size} but got {self.flat_param.size()}",
+            )
+            padded_unsharded_flat_param = self._alloc_padded_unsharded_flat_param()
+            # Copy from CPU to the compute device
+            padded_unsharded_flat_param[: self.flat_param.numel()].copy_(
+                self.flat_param
+            )
+            self._use_unsharded_flat_param(padded_unsharded_flat_param)
+
+    def reshard(self, free_unsharded_flat_param: bool):
+        """
+        Runs the reshard logic. This includes freeing the unsharded flattened
+        parameter if ``free_unsharded_flat_param`` and switching to using the
+        sharded flattened parameter.
+        """
+        if free_unsharded_flat_param:
+            self._free_unsharded_flat_param()
+        self._use_sharded_flat_param()
+
+    def post_reshard(self):
+        """
+        Runs the post-reshard logic. This includes freeing any memory that
+        can now be freed given that the ``FlatParameter`` points to the full
+        precision sharded flattened parameter.
+
+        Precondition: ``self.flat_param`` 's data points to the full precision
+        sharded flattened parameter.
+        """
+        # For `NO_SHARD`, `_mp_shard` is not freed in the post-unshard since
+        # it is also the low precision *unsharded* flattened parameter. Hence,
+        # we delay the free until the reshard.
+        if (
+            self._uses_param_mixed_precision
+            and not self.uses_sharded_strategy
+            and not self._force_full_precision  # did not use the low precision shard
+        ):
+            self._free_low_precision_sharded_param()
+
+    def _free_unsharded_flat_param(self):
+        """
+        Frees the padded unsharded flattened parameter. The tensor to free
+        depends on the calling context since the unshard may have forced full
+        precision, in which case a different tensor is used.
+        """
+        self._check_sharded_strategy()
+        unsharded_flat_param = self._get_padded_unsharded_flat_param()
+        self._check_storage_allocated(unsharded_flat_param)
+        self._check_on_compute_device(unsharded_flat_param)
+        # Do not free the memory until all ops in the current stream finish
+        _no_dispatch_record_stream(unsharded_flat_param, torch.cuda.current_stream())
+        _free_storage(unsharded_flat_param)
+
+    def _use_sharded_flat_param(self) -> None:
+        """Switches to using the sharded flattened parameter."""
+        flat_param = self.flat_param
+        if self._config.offload_params:
+            device = flat_param._local_shard.device  # type: ignore[attr-defined]
+            p_assert(
+                device == torch.device("cpu"),
+                f"Expects the local shard to be on CPU but got {device}",
+            )
+        flat_param.data = flat_param._local_shard  # type: ignore[attr-defined]
+        if self._use_orig_params:
+            self._use_sharded_views()
+            # For the post-forward reshard, we may try to use sharded gradient
+            # views, but for the post-backward reshard, we delay the call to
+            # after the reduce-scatter
+            if self._training_state == HandleTrainingState.FORWARD:
+                self._use_sharded_grad_views()
+
+    #########
+    # VIEWS #
+    #########
+    @staticmethod
+    def _get_unflat_views(
+        flat_param: FlatParameter,
+        tensor: Optional[torch.Tensor] = None,
+    ) -> Iterator[Tensor]:
+        """
+        Returns unflattened ``Tensor`` views into ``tensor`` if it is not
+        ``None`` or ``flat_param`` otherwise, where the unflattening is based
+        on ``flat_param`` 's metadata.
+
+        In other words, to get views into the unsharded flattened parameter,
+        pass ``tensor`` as ``None``, but to get views into tensor optimizer
+        state, pass ``tensor`` as the optimizer state tensor.
+        """
+        if tensor is None:
+            tensor = flat_param
+        p_assert(
+            tensor.numel() == flat_param._unpadded_unsharded_size.numel(),
+            f"Expects {flat_param._unpadded_unsharded_size.numel()} numel but got "
+            f"{tensor.numel()} numel",
+        )
+        views = (
+            _ext_post_unflatten_transform(subtensor.view(shape), param_extension)
+            for (subtensor, shape, param_extension) in zip(
+                torch.split(tensor, flat_param._numels, dim=0),  # type: ignore[arg-type]
+                flat_param._shapes,
+                flat_param._param_extensions,
+            )
+        )
+        return views
+
+    def _use_unsharded_views(self, as_params: bool) -> None:
+        """
+        Unflattens the unsharded flattened parameter by setting the original
+        module parameter variables to be views into it.
+
+        Args:
+            as_params (bool): If ``True``, then registers the original
+                parameters as ``nn.Parameter`` s; if ``False``, then registers
+                the original parameters only as ``Tensor`` s. ``False`` should
+                be used during forward/backward computation and when hiding the
+                original parameters from :meth:`nn.Module.named_parameters`.
+        """
+        self._check_unsharded(self.flat_param)
+        views = self._get_unflat_views(self.flat_param)
+        for i, (view, (param_name, module, _)) in enumerate(
+            zip(views, self.flat_param._param_infos)
+        ):
+            if hasattr(module, param_name):
+                delattr(module, param_name)
+            if self._use_orig_params and as_params:
+                param = self.flat_param._params[i]  # type: ignore[index]
+                setattr(module, param_name, param)
+                param.data = view
+            elif as_params:
+                module.register_parameter(param_name, nn.Parameter(view))
+            else:  # `as_params=False`
+                param_var: Tensor = view
+                if self._use_orig_params:
+                    if self._training_state == HandleTrainingState.FORWARD:
+                        assert self.flat_param._tensors is not None
+                        # Save the `Tensor` for the pre-backward
+                        self.flat_param._tensors[i] = view  # save for pre-backward
+                    elif self._training_state == HandleTrainingState.BACKWARD_PRE:
+                        # Use the saved `Tensor` variable from the forward to
+                        # preserve the autograd graph so that the post-backward
+                        # hook fires (e.g. for reentrant AC)
+                        assert self.flat_param._tensors is not None  # mypy
+                        tensor = self.flat_param._tensors[i]
+                        p_assert(
+                            tensor is not None,
+                            "Expects `Tensor` to have been saved in forward",
+                        )
+                        tensor.data = view  # type: ignore[union-attr]
+                        assert tensor is not None  # mypy
+                        param_var = tensor
+                setattr(module, param_name, param_var)
+                if self._use_orig_params and self._training_state == HandleTrainingState.FORWARD:
+                    module._parameters[param_name] = param_var  # type: ignore[assignment]
+        for i, (
+            param_name,
+            module,
+            _,
+            prim_param_name,
+            prim_module,
+            prim_module_name,
+        ) in enumerate(self.flat_param._shared_param_infos):
+            if hasattr(module, param_name):
+                delattr(module, param_name)
+            p_assert(
+                hasattr(prim_module, prim_param_name),
+                f"Module {prim_module_name} is missing parameter {prim_param_name}",
+            )
+            prim_param: Union[Tensor, nn.Parameter] = getattr(
+                prim_module, prim_param_name
+            )
+            p_assert(
+                not as_params or isinstance(prim_param, nn.Parameter),
+                f"as_params={as_params} type(prim_param)={type(prim_param)}",
+            )
+            if self._use_orig_params and as_params:
+                shared_param = self.flat_param._shared_params[i]  # type: ignore[index]
+                setattr(module, param_name, shared_param)
+                shared_param.data = prim_param
+            elif as_params:
+                assert isinstance(prim_param, nn.Parameter)
+                module.register_parameter(param_name, prim_param)
+            else:
+                setattr(module, param_name, prim_param)
+                if self._use_orig_params and self._training_state == HandleTrainingState.FORWARD:
+                    module._parameters[param_name] = prim_param  # type: ignore[assignment]
+
+    def _use_unsharded_grad_views(self) -> None:
+        """
+        Unflattens the unsharded flattened parameter's gradient by setting the
+        original module parameter variables' gradients to be views into it.
+        """
+        # Expects the gradient to be in `flat_param.grad`
+        if self.flat_param.grad is None:
+            assert self.flat_param._params is not None  # mypy
+            assert self.flat_param._shared_params is not None  # mypy
+            for param in chain(
+                self.flat_param._params,  # type: ignore[attr-defined]
+                self.flat_param._shared_params,  # type: ignore[attr-defined]
+            ):
+                param.grad = None
+            return
+        self._check_unsharded(self.flat_param.grad)
+        views = self._get_unflat_views(self.flat_param, self.flat_param.grad)
+        for i, (view, (param_name, module, _)) in enumerate(
+            zip(views, self.flat_param._param_infos)
+        ):
+            p_assert(
+                hasattr(module, param_name),
+                f"{self.flat_param._fqns[i]} is missing",
+            )
+            param = getattr(module, param_name)
+            param.grad = view
+        for i, (
+            param_name,
+            module,
+            module_name,
+            prim_param_name,
+            prim_module,
+            _,
+        ) in enumerate(self.flat_param._shared_param_infos):
+            p_assert(
+                hasattr(module, param_name),
+                f"{module_name + '.' + param_name if module_name else param_name} is missing",
+            )  # did not save prefixed name
+            param = getattr(module, param_name)
+            prim_param = getattr(prim_module, prim_param_name)
+            param.grad = prim_param.grad
+
+    @contextlib.contextmanager
+    def unflatten_as_params(self) -> Generator:
+        """
+        Assumes the flattened parameter is unsharded. When in the context,
+        unflattens the original parameters as ``nn.Parameter`` views into the
+        flattened parameter, and after the context, restores the original
+        parameters as ``Tensor`` views into the flattened parameter.
+        """
+        self._use_unsharded_views(as_params=True)
+        try:
+            yield
+        finally:
+            self._use_unsharded_views(as_params=False)
+
+    @torch.no_grad()
+    def _use_sharded_views(self) -> None:
+        """
+        Sets the original module parameter variables' data to be flattened
+        views into the sharded flattened parameter.
+
+        The views are kept as flattened to simplify the case where a parameter
+        is sharded across ranks. Parameters whose data is not present in the
+        sharded flattened parameter have their data set to a size-0 empty
+        tensor. We do not delete them to ensure to preserve expected behaviors
+        like model printability. Parameters whose data is present must preserve
+        their variables to be passable to an optimizer.
+        """
+        if not self.uses_sharded_strategy:
+            # For `NO_SHARD`, use the *unflattened* unsharded views since we
+            # have the unsharded parameter
+            self._use_unsharded_views(as_params=True)
+            return
+        self._check_sharded(self.flat_param)
+        start, end = self.flat_param._shard_indices  # type: ignore[attr-defined]
+        offset = 0
+        assert self.flat_param._params is not None
+        for i, (param, (param_name, module, _)) in enumerate(
+            zip(self.flat_param._params, self.flat_param._param_infos)
+        ):
+            setattr(module, param_name, param)
+            in_sharded_flat_param = (
+                i >= start
+                and i <= end
+                and self.flat_param._shard_param_offsets  # type: ignore[attr-defined]
+            )
+            if in_sharded_flat_param:
+                param_start, param_end = self.flat_param._shard_param_offsets[i - start]  # type: ignore[attr-defined]
+                numel_in_shard = param_end - param_start + 1
+                param.data = self.flat_param[offset : offset + numel_in_shard]
+                offset += numel_in_shard
+            else:
+                # Allow the original data to be freed via garbage collection
+                param.data = torch.empty(
+                    0,
+                    dtype=param.dtype,
+                    device=self.flat_param.device,
+                    requires_grad=False,
+                )
+        assert self.flat_param._shared_params is not None
+        for i, (
+            param,
+            (param_name, module, _, prim_param_name, prim_module, _),
+        ) in enumerate(
+            zip(self.flat_param._shared_params, self.flat_param._shared_param_infos)
+        ):
+            setattr(module, param_name, param)
+            prim_param = getattr(prim_module, prim_param_name)
+            param.data = prim_param  # could be both empty and non-empty
+        if self._training_state == HandleTrainingState.BACKWARD_POST:
+            assert self.flat_param._tensors is not None  # mypy
+            # Clear the saved `Tensor`s since they are unneeded now
+            for i in range(len(self.flat_param._tensors)):
+                self.flat_param._tensors[i] = None  # type: ignore[index]
+
+    @torch.no_grad()
+    def _use_sharded_grad_views(self) -> None:
+        """
+        Sets the original module parameter variables' gradients to be flattened
+        views into the sharded flattened parameter's gradient. This is a no-op
+        if there is no gradient.
+
+        Parameters whose data is not present in the sharded flattened parameter
+        and parameters with ``requires_grad=False`` have their gradients set to
+        ``None``. Since the gradient variables do not need to be preserved,
+        this method does not manipulate existing ``Tensor`` data directly and
+        creates new ``Tensor`` variables instead.
+        """
+        flat_param = self.flat_param
+        self._check_sharded(flat_param)
+        grad = self.sharded_grad
+        if grad is None:
+            assert flat_param._params is not None  # mypy
+            assert flat_param._shared_params is not None  # mypy
+            for param in chain(flat_param._params, flat_param._shared_params):  # type: ignore[attr-defined]
+                param.grad = None
+            return
+        self._check_sharded(grad)
+        start, end = flat_param._shard_indices  # type: ignore[attr-defined]
+        offset = 0
+        assert flat_param._params is not None
+        for i, param in enumerate(flat_param._params):
+            in_sharded_flat_param = (
+                i >= start
+                and i <= end
+                and flat_param._shard_param_offsets  # type: ignore[attr-defined]
+            )
+            if in_sharded_flat_param:
+                param_start, param_end = flat_param._shard_param_offsets[i - start]  # type: ignore[attr-defined]
+                numel_in_shard = param_end - param_start + 1
+                assert flat_param._is_grad_none is not None  # mypy
+                if param.requires_grad and not flat_param._is_grad_none[i]:
+                    param.grad = grad[offset : offset + numel_in_shard].reshape(
+                        param.shape
+                    )
+                else:
+                    param.grad = None
+                offset += numel_in_shard
+            else:
+                param.grad = None
+        assert flat_param._shared_params is not None
+        for i, (param, (_, _, _, prim_param_name, prim_module, _)) in enumerate(
+            zip(flat_param._shared_params, flat_param._shared_param_infos)
+        ):
+            in_sharded_flat_param = hasattr(prim_module, prim_param_name)
+            if in_sharded_flat_param and param.requires_grad:
+                prim_param = getattr(prim_module, prim_param_name)
+                param.grad = prim_param.grad  # share the same reference
+            else:
+                param.grad = None
+
+    @torch.no_grad()
+    def _writeback_orig_params(self) -> bool:
+        """
+        Iterates over the original parameters and writes back any parameters
+        that changed storages (due to a non-inplace operator) to the handle's
+        ``FlatParameter``. This method preserves the ``FlatParameter` 's
+        device even if an original parameter's device changes.
+
+        Raises:
+            RuntimeError: If an original parameter or gradient changes storages
+            but no longer has the expected flattened shape.
+        Returns: ``True`` if some writeback happened, and ``False`` otherwise.
+        """
+        if self.uses_sharded_strategy and not self.is_sharded(self.flat_param):
+            # For `NO_SHARD`, we may still need to writeback
+            return False
+        flat_param = self.flat_param
+        start, end = flat_param._shard_indices  # type: ignore[attr-defined]
+        offset = 0
+        assert flat_param._params is not None
+        wroteback = False
+        for i, (param, (param_name, module, _)) in enumerate(
+            zip(flat_param._params, flat_param._param_infos)
+        ):
+            if not hasattr(module, param_name):
+                # Do not writeback if original parameters are deregistered
+                # (e.g. during model checkpointing)
+                continue
+            in_sharded_flat_param = (
+                i >= start
+                and i <= end
+                and self.flat_param._shard_param_offsets  # type: ignore[attr-defined]
+            )
+            if not in_sharded_flat_param:
+                continue
+            param_start, param_end = flat_param._shard_param_offsets[i - start]  # type: ignore[attr-defined]
+            numel_in_shard = param_end - param_start + 1
+            # Check for parameter writeback
+            param_changed = getattr(module, param_name) is not param
+            needs_param_writeback = (
+                param_changed  # changed parameter variable itself
+                or not _same_storage(param, flat_param)  # changed `.data`
+            )
+            if param_changed:
+                # NOTE: The gradient is not preserved after a parameter change.
+                param = getattr(module, param_name)
+                flat_param._params[i] = param
+            if needs_param_writeback:
+                expected_shape = torch.Size([numel_in_shard])
+                self._writeback_tensor(
+                    param, flat_param, i, expected_shape, offset, True
+                )
+                wroteback = True
+            # Check for gradient writeback
+            # NOTE: Since this method is called in the pre-unshard, which is
+            # only called during computation in the pre-forward or
+            # pre-backward, the sharded gradient should be guaranteed to be in
+            # `.grad`, not in `._saved_grad_shard`.
+            if param.grad is None and flat_param.grad is not None:
+                expected_shape = torch.Size([numel_in_shard])
+                self._writeback_tensor(
+                    None, flat_param.grad, i, expected_shape, offset, False
+                )
+            elif param.grad is not None:
+                # For `NO_SHARD` + CPU offloading, `_cpu_grad` is always in
+                # memory and owns the gradient storage, so it will never
+                # require gradient writeback.
+                flat_param_grad = (
+                    flat_param.grad
+                    if self.uses_sharded_strategy or not self._config.offload_params
+                    else flat_param._cpu_grad  # type: ignore[attr-defined]
+                )
+                needs_grad_writeback = flat_param_grad is None or not _same_storage(
+                    param.grad, flat_param_grad
+                )
+                if needs_grad_writeback:
+                    if flat_param_grad is None:
+                        flat_param_grad = torch.zeros_like(flat_param)
+                    expected_shape = torch.Size([numel_in_shard])
+                    self._writeback_tensor(
+                        param.grad, flat_param_grad, i, expected_shape, offset, False
+                    )
+                    flat_param.grad = flat_param_grad
+            offset += numel_in_shard
+        # TODO (awgu): Handle shared parameters. We need to re-generate the
+        # shared parameter data structures in case sharedness changed.
+        for i, (
+            param_name,
+            module,
+            _,
+            prim_param_name,
+            prim_module,
+            _,
+        ) in enumerate(flat_param._shared_param_infos):
+            if getattr(module, param_name) is not getattr(prim_module, prim_param_name):
+                raise NotImplementedError(
+                    "Changing shared parameters is not supported yet"
+                )
+        return wroteback
+
+    def _writeback_tensor(
+        self,
+        src_tensor: Optional[Tensor],
+        dst_tensor: Tensor,
+        tensor_index: int,
+        expected_shape: torch.Size,
+        offset: int,
+        is_param: bool,  # else gradient
+    ) -> None:
+        """
+        Writes back ``src_tensor`` to ``dst_tensor`` at offset ``offset``,
+        where ``src_tensor`` should have shape ``expected_shape``. ``is_param``
+        indicates if the tensor is the parameter (if ``True``) or gradient (if
+        ``False``). If ``src_tensor`` is ``None``, then the effect is zeroing
+        instead of copying. ``tensor_index`` gives the index of ``src_tensor``
+        in the metadata structures.
+
+        Raises:
+            RuntimeError: If the ``src_tensor`` does not have the expected
+            shape.
+        """
+        p_assert(
+            len(expected_shape) == 1,
+            f"Expects a 1D expected shape but got {expected_shape}",
+        )
+        if self._debug_level == dist.DebugLevel.DETAIL:
+            rank = self.rank if hasattr(self, "rank") else dist.get_rank()
+            src_shape = src_tensor.shape if src_tensor is not None else None
+            src_device = src_tensor.device if src_tensor is not None else None
+            warnings.warn(
+                f"[Rank {rank}] {'Parameter' if is_param else 'Gradient'} needs "
+                f"writeback in {self._training_state}\n"
+                f"expected shape={expected_shape} shape={src_shape} "
+                f"expected device={dst_tensor.device} device={src_device}"
+            )
+        if src_tensor is not None and src_tensor.shape != expected_shape:
+            # NOTE: Gradient shape mismatch is not possible in practice since
+            # the gradient shape is enforced to match that of the parameter and
+            # we already check for parameter shape mismatch.
+            raise RuntimeError(
+                f"Cannot writeback when the {'parameter' if is_param else 'gradient'} "
+                f"shape changes\nExpects {expected_shape} but got {src_tensor.shape}"
+            )
+        if src_tensor is not None:
+            dst_tensor[offset : offset + expected_shape.numel()].copy_(src_tensor)
+        else:
+            dst_tensor[offset : offset + expected_shape.numel()].zero_()
+            assert self.flat_param._is_grad_none is not None
+            self.flat_param._is_grad_none[tensor_index] = True
+
+    def _clear_grads_if_needed(self):
+        """
+        When ``use_orig_params=True``, sets the underlying ``flat_param.grad``
+        to ``None`` if *all* of the original parameters' ``.grad`` are
+        ``None``. This is targeting ``optim.zero_grad(set_to_none=True)``, in
+        which case we want to free the gradients as soon after the
+        ``zero_grad()`` call as possible.
+        """
+        if not self._use_orig_params:
+            return
+        flat_param = self.flat_param
+        assert flat_param._params is not None
+        if all(param.grad is None for param in flat_param._params):
+            flat_param.grad = None
+
+    def _deregister_orig_params(self):
+        for (param_name, module, _) in self.flat_param._param_infos:
+            if hasattr(module, param_name):
+                delattr(module, param_name)
+        for (param_name, module, _, _, _, _) in self.flat_param._shared_param_infos:
+            if hasattr(module, param_name):
+                delattr(module, param_name)
+
+    ###########
+    # HELPERS #
+    ###########
+    def flat_param_to(self, *args, **kwargs):
+        """Wraps an in-place call to ``.to()`` for ``self.flat_param``."""
+        self.flat_param.data = self.flat_param.to(*args, **kwargs)
+        if self._use_orig_params:
+            # Refresh the views because their storage may have changed
+            if self.is_sharded(self.flat_param):
+                self._use_sharded_views()
+            else:
+                self._use_unsharded_views(as_params=True)
+
     def _get_modules(self) -> Set[nn.Module]:
         """Returns a :class:`set` of the modules whose parameters are included
         in this handle's flattened parameter."""
@@ -527,13 +1707,200 @@ def _get_modules(self) -> Set[nn.Module]:
             set(spi.module for spi in self.flat_param._shared_param_infos)
         )
 
+    def is_sharded(self, tensor: Tensor) -> bool:
+        """
+        Returns if ``tensor`` is *currently* sharded. For ``NO_SHARD``, we
+        choose to have this always return ``False`` for clarity.
+        """
+        if (
+            not hasattr(self.flat_param, "_sharded_size")
+            or not self.uses_sharded_strategy
+        ):
+            # `_sharded_size` is defined iff `handle.shard()` has been called
+            return False
+        sharded_size = self.flat_param._sharded_size  # type: ignore[attr-defined]
+        return tensor.size() == sharded_size
+
     def parameter_module_names(self) -> Iterator[Tuple[str, str]]:
         shared_param_infos = [
             ParamInfo(param_name, module, module_name)
-            for (param_name, module, module_name, _, _, _)
-            in self.flat_param._shared_param_infos
+            for (
+                param_name,
+                module,
+                module_name,
+                _,
+                _,
+                _,
+            ) in self.flat_param._shared_param_infos
         ]
         for param_name, _, module_name in chain(
             self.flat_param._param_infos, shared_param_infos
         ):
             yield (param_name, module_name)
+
+    def shared_parameter_module_names(self) -> Iterator[Tuple[str, str]]:
+        for param_name, _, module_name in [
+            ParamInfo(param_name, module, module_name)
+            for (
+                param_name,
+                module,
+                module_name,
+                _,
+                _,
+                _,
+            ) in self.flat_param._shared_param_infos
+        ]:
+            yield (param_name, module_name)
+
+    @property
+    def _fqns_in_shard(self) -> List[str]:
+        """Returns the FQNs of the parameters present in this rank's shard."""
+        fqns_in_shard: List[str] = []
+        start, end = self.flat_param._shard_indices  # type: ignore[attr-defined]
+        for i in range(len(self.flat_param._fqns)):
+            if i >= start and i <= end and self.flat_param._shard_param_offsets:  # type: ignore[attr-defined]
+                fqns_in_shard.append(self.flat_param._fqns[i])
+        return fqns_in_shard
+
+    @property
+    def sharded_grad(self) -> Optional[Tensor]:
+        """Returns the handle's sharded gradient."""
+        flat_param = self.flat_param
+        # Priority for non-`None`: `_cpu_grad` > `_saved_grad_shard` > `grad`
+        # - CPU offloading: `_cpu_grad`
+        # - No CPU offloading + sharded strategies: `_saved_grad_shard`
+        # - No CPU offloading + `NO_SHARD`: `grad`
+        if hasattr(flat_param, "_cpu_grad"):
+            grad = flat_param._cpu_grad  # type: ignore[attr-defined]
+        elif hasattr(flat_param, "_saved_grad_shard"):
+            grad = flat_param._saved_grad_shard  # type: ignore[attr-defined]
+        else:
+            # If in the forward, then there may be an accumulated gradient,
+            # which will be in `.grad`
+            p_assert(
+                flat_param.grad is None
+                or not self.uses_sharded_strategy
+                or self._training_state == HandleTrainingState.FORWARD,
+                "Sharded strategies should use `_cpu_grad` or `_saved_grad_shard` "
+                "unless in FORWARD (for the post-forward reshard)",
+            )
+            grad = flat_param.grad
+        return grad
+
+    def _reset_is_grad_none(self) -> None:
+        """
+        Resets the ``_is_grad_none`` mask as needed. This method should only be
+        called in the post-backward after gradient computation, in which case
+        if a parameter requires gradient, then it will surely receive a
+        gradient and we may reset its mask entry to ``False``.
+        """
+        if not self._use_orig_params:
+            return
+        p_assert(
+            self._training_state == HandleTrainingState.BACKWARD_POST,
+            "Expects to only be called in the post-backward after gradient computation",
+        )
+        flat_param = self.flat_param
+        assert flat_param._params is not None  # mypy
+        for i, param in enumerate(flat_param._params):
+            # As long as the parameter requires gradient, it should receive a
+            # meaningful gradient (even if the gradient happens to be zeros)
+            if param.requires_grad:
+                assert flat_param._is_grad_none is not None  # mypy
+                flat_param._is_grad_none[i] = False
+
+    #######################
+    # CHECKS & INVARIANTS #
+    #######################
+    def _check_sharded_strategy(self):
+        p_assert(self.uses_sharded_strategy, "Expects sharded strategy")
+
+    def _check_on_compute_device(self, tensor: Tensor):
+        p_assert(
+            tensor.device == self.device,
+            f"Expects tensor to be on the compute device {self.device}",
+        )
+
+    def _check_on_cpu(self, tensor: Tensor):
+        p_assert(
+            tensor.device == torch.device("cpu"),
+            f"Expects tensor to be on CPU but got {tensor.device}",
+        )
+
+    @staticmethod
+    def _check_storage_freed(tensor: Tensor):
+        storage_size: int = tensor._typed_storage()._size()
+        p_assert(
+            storage_size == 0,
+            f"Expects storage to be freed but got storage with size {storage_size}",
+        )
+
+    @staticmethod
+    def _check_storage_allocated(tensor: Tensor):
+        storage_size: int = tensor._typed_storage()._size()
+        p_assert(storage_size > 0, "Expects storage to be allocated")
+
+    def _check_low_precision_shard(self):
+        p_assert(
+            self._uses_param_mixed_precision,
+            "Not using low precision for parameters",
+        )
+        p_assert(
+            getattr(self.flat_param, "_mp_shard", None) is not None,
+            "Expects `_mp_shard` to exist",
+        )
+        device = self.flat_param._mp_shard.device  # type: ignore[attr-defined]
+        p_assert(
+            device == self.device,
+            f"Expects the low precision shard to be on {self.device} but got {device}",
+        )
+
+    def _check_unsharded(self, tensor: Tensor):
+        msg_prefix = "Expects tensor to be unsharded "
+        p_assert(tensor is not None, msg_prefix + "but got `None`")
+        unsharded_size = self.flat_param._unpadded_unsharded_size
+        p_assert(
+            tensor.size() == unsharded_size,
+            msg_prefix + f"with size {unsharded_size} but got {tensor.size()}",
+        )
+
+    def _check_sharded(self, tensor: Tensor):
+        msg_prefix = "Expects tensor to be sharded "
+        p_assert(tensor is not None, msg_prefix + "but got `None`")
+        sharded_size = self.flat_param._sharded_size  # type: ignore[attr-defined]
+        p_assert(
+            tensor.size() == sharded_size,
+            msg_prefix + f"with size {sharded_size} but got {tensor.size()}",
+        )
+
+    ##############
+    # PROPERTIES #
+    ##############
+    @property
+    def uses_sharded_strategy(self) -> bool:
+        return self._config.sharding_strategy != HandleShardingStrategy.NO_SHARD
+
+    @property
+    def _uses_param_mixed_precision(self) -> bool:
+        return self._config.low_prec_param_dtype is not None
+
+    @property
+    def _uses_reduce_mixed_precision(self) -> bool:
+        return self._config.low_prec_reduce_dtype is not None
+
+    @property
+    def _keep_low_precision_grads(self) -> bool:
+        return self._config.keep_low_precision_grads
+
+    @property
+    def _force_full_precision(self) -> bool:
+        return (
+            self._training_state == HandleTrainingState.SUMMON_FULL_PARAMS
+            and self._uses_param_mixed_precision
+        )
+
+
+# A handles key represents the group of `FlatParamHandle`s involved in a given
+# module's forward. These will be all-gathered together in the pre-forward and
+# pre-backward.
+_HandlesKey = Tuple[FlatParamHandle, ...]
diff --git a/torch/distributed/fsdp/flatten_params_wrapper.py b/torch/distributed/fsdp/flatten_params_wrapper.py
deleted file mode 100644
index a4eed12d40bf..000000000000
--- a/torch/distributed/fsdp/flatten_params_wrapper.py
+++ /dev/null
@@ -1,156 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the BSD license found in the
-# LICENSE file in the root directory of this source tree.
-
-# Copyright (c) Tongzhou Wang
-# Licensed under the MIT License.
-
-import contextlib
-from typing import Any, Dict, Generator, List
-
-import torch.nn as nn
-from torch.distributed.utils import _replace_by_prefix
-
-from .flat_param import FlatParamHandle
-
-FLAT_PARAM = "flat_param"
-FPW_MODULE = "_fpw_module"
-
-__all__ = ["FlattenParamsWrapper"]
-
-
-def _post_state_dict_hook(
-    module: nn.Module, state_dict: Dict[str, Any], prefix: str, *args: Any
-) -> Dict[str, Any]:
-    """
-    _post_state_dict_hook() is called after the state_dict() is executed
-    and before returning the state_dict to the users.
-    This API post-processes the keys of the state_dict to remove the
-    FlattenParamsWrapper internal prefix.
-    """
-    # Move everything from FPW_MODULE up one level.
-    _replace_by_prefix(state_dict, prefix + f"{FPW_MODULE}.", prefix)
-    return state_dict
-
-
-def _pre_load_state_dict_hook(
-    state_dict: Dict[str, Any],
-    prefix: str,
-    *args: Any,
-) -> None:
-    """
-    _pre_load_state_dict_hook() is called before the _load_from_state_dict() is
-    executed. This API pre-processes the keys of the state_dict to add the
-    FlattenParamsWrapper internal prefix.
-    """
-    # Push everything down to FPW_MODULE level.
-    _replace_by_prefix(state_dict, prefix, prefix + f"{FPW_MODULE}.")
-    # The flat_param_* keys actually needs to move one level up.
-    flat_param_key = prefix + f"{FPW_MODULE}.{FLAT_PARAM}"
-    for k in list(state_dict.keys()):
-        if k.startswith(flat_param_key):
-            last_part = k.split(".")[-1]
-            assert last_part.startswith(
-                FLAT_PARAM
-            ), f"Expected key to contain flat_param, but key name is {k}"
-            _replace_by_prefix(state_dict, k, prefix + last_part)
-
-
-class FlattenParamsWrapper(nn.Module):
-    """
-    This is a wrapper for flattening parameters in a ``nn.Module`` 's subtree
-    into a single flattened parameter and is based on [1]. This is used for
-    :class:`FullyShardedDataParallel` 's recursive wrapping.
-    [1] https://github.com/SsnL/PyTorch-Reparam-Module
-
-    Args:
-        module (nn.Module): Module to wrap.
-        params (List[nn.Parameter]): Parameters in ``module`` 's subtree to
-            flatten into a single flattened parameter.
-
-    Attributes:
-        flat_param (Optional[FlatParameter]): The flattened parameter.
-            ``flat_param`` is ``None`` either when (1) this wrapper manages no
-            parameters or (2) the wrapped module's parameters are unflattened.
-        _fpw_module (nn.Module): The wrapped module.
-        _flat_param_handle (FlatParamHandle): A handle for the flattened
-            parameter; only present if this wrapper manages parameters.
-    """
-    def __init__(
-        self,
-        module: nn.Module,
-        params: List[nn.Parameter],
-    ) -> None:
-        super().__init__()
-        self._fpw_module = module
-        self.flat_param = None
-        # Register hooks to clean parameter names for state dict (even if this
-        # wrapper itself manages no parameters since it must clean names from
-        # submodules)
-        self._register_state_dict_hook(_post_state_dict_hook)
-        self._register_load_state_dict_pre_hook(_pre_load_state_dict_hook)
-        if len(params) == 0:
-            return
-        self._flat_param_handle = FlatParamHandle(params, module)
-        # Defining `self.flat_param` registers the `FlatParameter` and makes it
-        # visible to `named_parameters()`
-        self.flat_param = self._flat_param_handle.flat_param
-        assert getattr(self, FPW_MODULE) is self._fpw_module
-        assert getattr(self, FLAT_PARAM) is self.flat_param
-
-    @property
-    def has_params(self) -> bool:
-        """Returns whether this wrapper manages any parameters."""
-        return hasattr(self, "_flat_param_handle")
-
-    @property
-    def handle(self) -> FlatParamHandle:
-        assert hasattr(self, "_flat_param_handle"), \
-            "Accessing the handle of a `FlattenParamsWrapper` that does not " \
-            "manage any parameters"
-        return self._flat_param_handle
-
-    @property
-    def module(self) -> Any:
-        """Returns the wrapped module (like DDP)."""
-        return self._fpw_module
-
-    @contextlib.contextmanager
-    def unflatten_as_params(self) -> Generator:
-        """
-        Assumes that the flattened parameter is unsharded. When in the context,
-        unflattens the original parameters as ``nn.Parameter`` views into the
-        flattened parameter and de-registers the flattened parameter. After the
-        context, restores the original parameters as ``Tensor`` views into the
-        flattened parameter and re-registers the flattened parameter.
-        """
-        if getattr(self, "flat_param", None) is None:
-            yield
-        else:
-            # De-register the `FlatParameter` from this wrapper to hide it from
-            # `named_parameters()` (though it still exists in memory)
-            del self.flat_param
-            try:
-                with self._flat_param_handle.unflatten_as_params():
-                    yield
-            finally:
-                # Re-register the `FlatParameter`
-                self.flat_param = self._flat_param_handle.flat_param
-
-    def __getattr__(self, name: str) -> Any:
-        """Forward missing attributes of this wrapper to the wrapped module."""
-        try:
-            return super().__getattr__(name)  # defer to `nn.Module`'s logic
-        except AttributeError:
-            return getattr(self.module, name)  # fall back to the wrapped module
-
-    def __getitem__(self, key: int) -> Any:
-        """Forward indexing calls to the wrapped module in case the wrapped
-        module is an ``nn.Sequential``."""
-        return self.module.__getitem__(key)
-
-    def forward(self, *inputs: Any, **kwinputs: Any) -> Any:
-        if self.flat_param is not None:
-            self._flat_param_handle._unflatten(as_params=False)
-        return self.module(*inputs, **kwinputs)
diff --git a/torch/distributed/fsdp/fully_sharded_data_parallel.py b/torch/distributed/fsdp/fully_sharded_data_parallel.py
index fbd1bda189e9..d2d4fbf229b6 100644
--- a/torch/distributed/fsdp/fully_sharded_data_parallel.py
+++ b/torch/distributed/fsdp/fully_sharded_data_parallel.py
@@ -1,14 +1,11 @@
-import collections
 import contextlib
 import copy
 import functools
-import itertools
 import math
 import traceback
 import warnings
 from contextlib import contextmanager
-from dataclasses import dataclass
-from enum import Enum, auto
+from enum import auto, Enum
 from typing import (
     Any,
     Callable,
@@ -17,450 +14,104 @@
     Iterable,
     Iterator,
     List,
-    Mapping,
-    NamedTuple,
     Optional,
-    Set,
     Tuple,
     Union,
-    cast,
 )
 
 import torch
 import torch.distributed as dist
-import torch.distributed.algorithms._checkpoint.checkpoint_wrapper as checkpoint_wrapper
 import torch.nn as nn
-import torch.nn.functional as F
-from torch.autograd import Variable
 from torch.distributed import ProcessGroup
-from torch.distributed._shard.sharded_tensor import (
-    Shard,
-    ShardedTensor,
-    init_from_local_shards,
-)
 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
-    _CHECKPOINT_PREFIX,
+    _CHECKPOINT_WRAPPED_MODULE,
+    ActivationWrapper,
+)
+from torch.distributed.algorithms._comm_hooks import LOW_PRECISION_HOOKS
+from torch.distributed.fsdp._common_utils import (
+    _get_param_to_fqns,
+    FSDP_PREFIX,
+    FSDP_WRAPPED_MODULE,
+    HandleTrainingState,
+    TrainingState,
 )
-from torch.distributed.algorithms._comm_hooks import (
-    LOW_PRECISION_HOOKS,
-    default_hooks,
+from torch.distributed.fsdp._init_utils import (
+    _check_orig_params_flattened,
+    _get_default_comm_hook,
+    _init_buffer_state,
+    _init_core_state,
+    _init_ignored_module_states,
+    _init_param_handle_from_module,
+    _init_prefetching_state,
+    _init_process_group_state,
+    _init_runtime_state,
+    _init_state_dict_state,
 )
-from torch.distributed.distributed_c10d import _get_default_group
-from torch.distributed.utils import (
-    _replace_by_prefix,
-    _sync_params_and_buffers,
-    _to_kwargs,
+from torch.distributed.fsdp._runtime_utils import (
+    _lazy_init,
+    _post_forward,
+    _post_forward_reshard,
+    _pre_forward,
+    _pre_forward_unshard,
+    _reshard,
+    _root_pre_forward,
+    _should_free_in_backward,
+    _wait_for_computation_stream,
+)
+from torch.distributed.fsdp._wrap_utils import _auto_wrap
+from torch.distributed.fsdp.api import (
+    BackwardPrefetch,
+    CPUOffload,
+    FullStateDictConfig,
+    LocalStateDictConfig,
+    MixedPrecision,
+    ShardedStateDictConfig,
+    ShardingStrategy,
+    StateDictConfig,
+    StateDictType,
 )
-from torch.nn.parameter import Parameter
 
 from ._optim_utils import (
     _broadcast_pos_dim_tensor_states,
     _broadcast_processed_optim_state_dict,
-    _flatten_full_optim_state_dict,
-    _get_flat_param_to_fsdp_module,
+    _flatten_optim_state_dict,
     _get_param_id_to_param,
+    _get_param_id_to_param_from_optim_input,
     _get_param_to_param_id,
-    _OptimStateKey,
+    _get_param_to_param_id_from_optim_input,
+    _optim_state_dict,
     _process_pos_dim_tensor_state,
     _rekey_sharded_optim_state_dict,
-    _unflatten_optim_state,
 )
-from ._utils import (
-    _apply_to_modules,
-    _apply_to_tensors,
-    _contains_batchnorm,
-    _override_batchnorm_mixed_precision,
+from ._state_dict_utils import (
+    _post_load_state_dict_hook,
+    _post_state_dict_hook,
+    _pre_load_state_dict_hook,
 )
-from .flat_param import FlatParameter, FlatParamHandle
-from .flatten_params_wrapper import (
-    FLAT_PARAM,
-    FPW_MODULE,
-    FlattenParamsWrapper,
+from ._unshard_param_utils import (
+    _deregister_orig_params,
+    _register_flat_param,
+    _register_orig_params,
+    _unshard_params,
 )
-from .wrap import (
-    ParamExecOrderWrapPolicy,
-    _or_policy,
-    _recursive_wrap,
-    _wrap_batchnorm_individually,
-)
-
-_TORCHDISTX_AVAIL = True
-try:
-    from torchdistx import deferred_init, fake
-except ImportError:
-    _TORCHDISTX_AVAIL = False
-
-_TORCH_FX_AVAIL = True
-if not hasattr(torch, "fx"):
-    _TORCH_FX_AVAIL = False
-if _TORCH_FX_AVAIL:
-    from ._symbolic_trace import (
-        TracingConfig,
-        _init_execution_info,
-        _patch_tracer,
-    )
+from ._utils import p_assert
+from .flat_param import FlatParameter, FlatParamHandle
 
 
 __all__ = [
-    "FullyShardedDataParallel", "ShardingStrategy", "MixedPrecision",
-    "CPUOffload", "BackwardPrefetch", "StateDictType", "StateDictConfig",
-    "FullStateDictConfig", "LocalStateDictConfig", "ShardedStateDictConfig",
-    "OptimStateKeyType", "TrainingState_", "p_assert", "clean_tensor_name",
+    "FullyShardedDataParallel",
+    "OptimStateKeyType",
 ]
 
 
-FSDP_WRAPPED_MODULE = "_fsdp_wrapped_module"
-FSDP_PREFIX = FSDP_WRAPPED_MODULE + "." + FPW_MODULE + "."
-
-_PARAM_BROADCAST_BUCKET_SIZE = int(250 * 1024 * 1024)
-
-def _default_meta_device_init_fn(module):
-    """
-    Default initializer for modules initialized on the meta device.
-    """
-    # TODO: move module to device_id here once device_id is available.
-    module.to_empty(device=torch.cuda.current_device())
-    try:
-        with torch.no_grad():
-            module.reset_parameters()
-    except BaseException as e:
-        warnings.warn(
-            f"Unable to call reset_parameters() for module on meta device with error {str(e)}. "
-            "Please ensure your module implements a ``reset_parameters`` function."
-        )
-        raise e
-
-
-class ShardingStrategy(Enum):
-    """
-    This specifies the sharding strategy to be used for distributed training by
-    :class:`FullyShardedDataParallel`.
-    FULL_SHARD: Parameters, gradients, and optimizer states are sharded. For
-                the parameters, this algorithm all-gathers before the forward,
-                reshards after the forward, all-gathers before the backward
-                computation, and reshards after the backward computation. The
-                gradients are synchronized and sharded via reduce-scatter after
-                the backward computation. The sharded optimizer states are
-                updated locally.
-    SHARD_GRAD_OP: Gradients and optimizer states are sharded during
-                   computation, and additionally parameters are sharded outside
-                   computation. For the parameters, this algorithm all-gathers
-                   before the forward, does not reshard after the forward, and
-                   only reshards after the backward computation. The gradients
-                   are synchronized and sharded via reduce-scatter after the
-                   backward computation. The sharded optimizer states are
-                   updated locally. Inside ``no_sync()``, the parameters are
-                   not resharded after the backward computation.
-    NO_SHARD: Parameters, gradients, and optimizer states are not sharded but
-              instead replicated across ranks, similar to PyTorch's
-              ``DistributedDataParallel`` API. The gradients are synchronized
-              via all-reduce after the backward computation. The unsharded
-              optimizer states are updated locally.
-    HYBRID_SHARD(future support): Apply ``FULL_SHARD`` intra-node and
-                                  ``NO_SHARD`` inter-node.
-
-    """
-    FULL_SHARD = auto()
-    SHARD_GRAD_OP = auto()
-    NO_SHARD = auto()
-    # TODO
-    # HYBRID_SHARD = auto()
-
-
-@dataclass
-class MixedPrecision:
-    """
-    A config to enable mixed precision training with FullyShardedDataParallel.
-    This class can be constructed with three flags:
-        ``param_dtype`` controls the precision of model parameters, inputs, and
-        therefore the precision under which computation happens. After forward
-        and backward passes, FSDP parameters point to full precision shards
-        that are kept in memory. Full precision parameters are always
-        checkpointed.
-        ``reduce_dtype`` controls the precision under which gradient reduction
-        would occur, which can potentially be different than ``param_dtype``
-        for use cases such as communication efficiency.
-        ``buffer_dtype`` controls the precision that buffers are cast to. Note
-        that buffers are unsharded and are cast in the first forward pass, and
-        remain in their reduced precision state even after forward/backward
-        passes. However, when taking checkpoints with ``state_dict``, buffers
-        are checkpointed in their full precision (and then restored back to
-        to their reduced precision) as expected. Note that this checkpoint
-        support is currently limited to ``StateDictType.FULL_STATE_DICT``.
-
-    .. note:: In ``summon_full_params``, parameters are summoned in full
-        precision but buffers are not.
-
-    .. note:: Parameters and buffers are checkpointed in full precision. For
-        buffers, this is only guaranteed to work for ``StateDictType.FULL_STATE_DICT``.
-
-    .. note:: This API is experimental and subject to change.
-
-    .. note:: Specification of reduced precision types must be explicit, in that
-        if, for example, ``param_dtype`` is not specified, it will not be cast by
-        FSDP. Thus, a config such as ``MixedPrecision(reduce_dtype=torch.float16)``
-        will not cast buffers or parameters. Note that if a ``MixedPrecision``
-        config is specified without a ``reduce_dtype``, gradient communication
-        would occur in the `param_dtype` precision, if given, otherwise, in the
-        original parameter precision.
-    """
-    # maintain a tensor of this dtype that the fp32 param shard will be cast to.
-    # Will control the precision of model params, inputs, and thus compute as
-    # well.
-    param_dtype: Optional[torch.dtype] = None
-    # Gradient communication precision.
-    reduce_dtype: Optional[torch.dtype] = None
-    # Buffer precision.
-    # TODO: buffer + param are usually of the same type, if user specifies
-    # param but not buffer, should we automatically make buffer be the same?
-    buffer_dtype: Optional[torch.dtype] = None
-
-
-@dataclass
-class CPUOffload:
-    """
-    CPU offloading config. Currently, only parameter and gradient CPU
-    offload are supported.
-    offload_params: Offloading parameters to CPUs when these parameters are
-                    not used for computation on GPUs. This implicitly enables
-                    gradient offloading to CPUs in order for parameters and
-                    gradients to be on the same device to work with optimizer.
-    """
-
-    offload_params: bool = False
+FLAT_PARAM = "_flat_param"
 
 
-class BackwardPrefetch(Enum):
-    """
-    Specify where to prefetch next layer's full parameters
-    during backward pass.
-    BACKWARD_PRE: prefetch right before current layer's backward computation
-                  starts, this approach will increase backward communication
-                  and computation overalpping and potentialy improve training
-                  performance, but it may increase the peak memory usage as
-                  the prefetched full parameters will be kept in the GPU memory
-                  until next layer's backward computation is done.
-    BACKWARD_POST: prefetch right after current layer's backward computation finishes,
-                   this approach will not increase peak memory as prefetching happens
-                   after current layer's full parameters are freed.
-                   It could potentially improve backward communication and computation
-                   overlapping as it avoids all_gather and reduce_scatter are blocked
-                   each other in the single NCCL stream. However, based on our experiments,
-                   for some models, the backward post backward hook fire order is not always
-                   the reversed forward computation order, so this
-                   approach may prefetch full parameters for layers ahead of next layer,
-                   this 'ahead' all_gather could delay next layer's all_gather in the
-                   single NCCL stream and cause the next layer's computation delay. So it may
-                   cause some performance regession for some models.
-    """
-
-    BACKWARD_PRE = auto()
-    BACKWARD_POST = auto()
-    # TODO, BACKWARD_PRE_CPU, prefetch full parameters and keep them in the CPU memory
-
-
-class TrainingState_(Enum):
-    """
-    Simple enum to indicate what state FSDP is in. Used for asserting
-    to make sure APIs are called in the correct state.
-    ..note::
-        ``BACKWARD_PRE`` and ``BACKWARD_POST`` states are used to ensure we
-        receives backward hooks in the correct order. It is used to catch
-        unexpected order of hooks being called (likely due to our
-        hook registration logic or autograd engine logic changes).
-    """
-
-    IDLE = auto()
-    FORWARD = auto()
-    BACKWARD_PRE = auto()
-    BACKWARD_POST = auto()
-    SUMMON_FULL_PARAMS = auto()
-
-
-class StateDictType(Enum):
-    """
-    This enum indicates that which type of ``state_dict`` the FSDP module is
-    currently processing (returning or loading).
-    The default value is FULL_STATE_DICT to comply the PyTorch convention.
-    ..note::
-        FSDP currently supports two types of ``state_dict``:
-            1. ``state_dict/load_state_dict`: this pair of APIs return and load
-               the non-sharded, unflattened parameters. The semantics is the
-               same as using DDP.
-            2. ``_local_state_dict/_load_local_state_dict``: this pair of APIs return
-               and load local sharded, flattened parameters. The values returned
-               by ``_local_state_dict`` can be directly used by FSDP and is only
-               meaningful to FSDP (because parameters are flattened). Note that
-               these APIs are meant for use via the :func:`state_dict_type`
-               context manager as follows:
-                   >>> # xdoctest: +SKIP("undefined variables")
-                   >>> with fsdp.state_dict_type(StateDictType.LOCAL_STATE_DICT):
-                   ...     state = fsdp.state_dict()  # loads local state dict
-            3. ``_sharded_state_dict/_load_sharded_state_dict``: this pair of APIs
-               return and load sharded, unflattened parameters. The ``state_dict``
-               return by ``sharded_state_dict`` can be used by all other parallel
-               schemes (resharding may be required).
-    """
-
-    FULL_STATE_DICT = auto()
-    LOCAL_STATE_DICT = auto()
-    SHARDED_STATE_DICT = auto()
-
-@dataclass
-class StateDictConfig:
-    """
-    ``StateDictConfig`` is the base class for all state_dict configuration classes.
-    Users should instantiate a child version (i.e. ``FullStateDictConfig``) in
-    order to configure settings for the particular type of ``state_dict``
-    implementation FSDP will use.
-    """
-    pass
-
-@dataclass
-class FullStateDictConfig(StateDictConfig):
-    """
-    ``FullStateDictConfig`` is a config class meant to be used with
-    ``StateDictType.FULL_STATE_DICT``. Currently, it accepts two parameters,
-    ``offload_to_cpu`` and ``rank0_only`` which can be configured to offload
-    the full ``state_dict`` to CPU and to materialize the ``state_dict`` on
-    rank 0 only. When used, it is recommended to enable both of these flags
-    together to optimize memory savings when taking checkpoints. Note that
-    this config class is meant for user via the :func:`state_dict_type`
-    context manager as follows:
-        >>> # xdoctest: +SKIP("undefined variables")
-        >>> fsdp = FSDP(model, auto_wrap_policy=...)
-        >>> cfg = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
-        >>> with FullyShardedDataParallel.state_dict_type(fsdp, StateDictType.FULL_STATE_DICT, cfg):
-        >>>     state = fsdp.state_dict()
-        >>>     # state will be empty on non rank 0 and contain CPU tensors on rank 0.
-        >>> # To reload checkpoint for inference, finetuning, transfer learning, etc:
-        >>> model = model_fn() # Initialize model on CPU in preparation for wrapping with FSDP
-        >>> if dist.get_rank() == 0:
-        >>>     # Load checkpoint only on rank 0 to avoid memory redundancy
-        >>>     state_dict = torch.load("my_checkpoint.pt")
-        >>>     model.load_state_dict(state_dict)
-        >>> # All ranks initialize FSDP module as usual. ``sync_module_states`` argument
-        >>> # communicates loaded checkpoint states from rank 0 to rest of the world.
-        >>> fsdp = FSDP(model, device_id=torch.cuda.current_device(), auto_wrap_policy=..., sync_module_states=True)
-        >>> # After this point, all ranks have FSDP model with loaded checkpoint.
-    """
-    offload_to_cpu: bool = False
-    rank0_only: bool = False
-
-@dataclass
-class LocalStateDictConfig(StateDictConfig):
-    pass
-
-@dataclass
-class ShardedStateDictConfig(StateDictConfig):
-    pass
-
-_state_dict_type_to_config = {
-    StateDictType.FULL_STATE_DICT: FullStateDictConfig,
-    StateDictType.LOCAL_STATE_DICT: LocalStateDictConfig,
-    StateDictType.SHARDED_STATE_DICT: ShardedStateDictConfig,
-}
-
 class OptimStateKeyType(Enum):
     PARAM_NAME = auto()
     PARAM_ID = auto()
 
 
-class _ExecOrderWarnStatus(Enum):
-    """Used internally for execution order validation."""
-    NONE = auto()     # no deviation yet
-    WARNING = auto()  # deviated this iteration; currently issuing warnings
-    WARNED = auto()   # deviated in a previous iteration
-
-
-class _ExecOrderData():
-    """
-    This contains the data used for validating execution order across ranks.
-
-    Attributes:
-        _all_flat_params (List[FlatParameter]): A :class:`list` of all
-            flattened parameters contained in the FSDP module hierarchy with
-            the list index implicitly giving a unique parameter index.
-        _param_to_unflat_param_names (Dict[FlatParameter, List[str]]): A
-            mapping from flattened parameter to the comprising unflattened
-            parameters' names.
-        is_first_iter (bool): Whether executing in the first iteration or not.
-        param_order (List[int]): Order that parameters participate in the
-            forward pass; constructed on the first iteration and validated
-            against in subsequent iterations.
-        index (int): Index tracking the position in ``param_order``
-            when validating the forward pass execution order in subsequent
-            iterations.
-        warn_status (_ExecOrderWarnStatus): To avoid flooding the console, we
-            only issue warnings throughout the first deviating iteration and no
-            longer check thereafter; this tracks the warning status.
-    """
-    def __init__(self) -> None:
-        self._all_flat_params: List[FlatParameter] = []
-        self._param_to_unflat_param_names: Dict[FlatParameter, List[str]] = []
-        # Modified in the first iteration:
-        self.is_first_iter: bool = True
-        self.param_order: List[int] = []
-        # Modified in the subsequent iterations:
-        self.index: int = 0
-        self.warn_status: _ExecOrderWarnStatus = _ExecOrderWarnStatus.NONE
-
-    def init(self, root_module: "FullyShardedDataParallel"):
-        assert root_module._is_root, "This data structure should only be " \
-            "initialized on an FSDP root module"
-        # Save all `FlatParameter`s in `root_module`'s hierarchy to
-        # `_all_flat_params` instead of re-materializing each time to avoid the
-        # result depending on the calling context (e.g. when some parameters
-        # have been rebuilt)
-        self._all_flat_params = [
-            param for param in root_module.parameters()
-            if isinstance(param, FlatParameter)
-        ]
-        self._param_to_unflat_param_names = cast(
-            Dict[FlatParameter, List[str]],
-            _get_param_to_unflat_param_names(root_module)
-        )
-
-    def get_param_index(self, param: FlatParameter) -> int:
-        """Returns a unique non-negative parameter index for ``param`` if it is
-        valid or -1 otherwise. Critically, this index assignment must be the
-        same across ranks."""
-        assert isinstance(param, FlatParameter), \
-            f"Expects `param` is a `FlatParameter` but got {type(param)}"
-        for i, p in enumerate(self._all_flat_params):
-            if p is param:
-                return i
-        return -1
-
-    def get_param(self, param_index: int) -> Optional[FlatParameter]:
-        """Returns the parameter corresponding to ``param_index`` or ``None``
-        if the index is invalid."""
-        for i, p in enumerate(self._all_flat_params):
-            if i == param_index:
-                return p
-        return None
-
-    def get_unflat_param_names(self, param_index: int) -> List[str]:
-        """Returns a :class:`list` of unflattened parameter names comprising
-        the flattened parameter with index ``param_index`` or an empty
-        :class:`list` if ``param_index`` is invalid."""
-        param = self.get_param(param_index)
-        if param is None:
-            return []
-        assert param in self._param_to_unflat_param_names, \
-            "Internal data structures out of sync; check `init()`"
-        return self._param_to_unflat_param_names[param]
-
-    def reset(self):
-        """Called in :meth:`_wait_for_post_backward` to reset data for the next
-        iteration."""
-        self.is_first_iter = False
-        self.index = 0
-        # `reset()` marks the end of an iteration, so transition if needed
-        if self.warn_status == _ExecOrderWarnStatus.WARNING:
-            self.warn_status = _ExecOrderWarnStatus.WARNED
-
-
 class FullyShardedDataParallel(nn.Module):
     """
     A wrapper for sharding Module parameters across data parallel workers. This
@@ -521,6 +172,15 @@ class FullyShardedDataParallel(nn.Module):
         same FSDP unit. If enhanced shared parameter support is needed for your
         use case, please ping https://github.com/pytorch/pytorch/issues/77724
 
+    .. note:
+        Attempting to run the forward pass of a submodule that is contained in an
+        FSDP instance is not supported and will result in errors. This is because the
+        submodule's parameters will be sharded, but it itself is not an FSDP instance,
+        so its forward pass will not all-gather the full parameters appropriately.
+        This could potentially happen when attempting to run only the encoder of a
+        encoder-decoder model, and the encoder is not wrapped in its own FSDP instance. To
+        resolve this, please wrap the submodule in its own FSDP unit.
+
     .. note::
         Inputs into FSDP ``forward`` function will be moved to compute device
         (same device FSDP module is on) before running ``forward``, so user does
@@ -528,73 +188,56 @@ class FullyShardedDataParallel(nn.Module):
 
     Args:
         module (nn.Module):
-            module to be wrapped with FSDP.
+            This is the module to be wrapped with FSDP.
         process_group (Optional[ProcessGroup]):
-            process group for sharding
+            This is the process group used for collective communications.
         sharding_strategy (Optional[ShardingStrategy]):
-            Config sharding algorithm, different sharding algorithm has trade
-            off between memory saving and communication overhead. ``FULL_SHARD``
-            will be chosen if sharding_strategy is not specified.
+            This configures the sharding strategy used by FSDP, which may trade
+            off memory saving and communication overhead. See
+            :class:`ShardingStrategy` for details. (Default: ``FULL_SHARD``)
         cpu_offload (Optional[CPUOffload]):
-            CPU offloading config. Currently, only parameter and gradient CPU
-            offload is supported. It can be enabled via passing in
-            ``cpu_offload=CPUOffload(offload_params=True)``. Note that this
-            currently implicitly enables gradient offloading to CPU in order for
-            params and grads to be on same device to work with optimizer. This
-            API is subject to change. Default is ``None`` in which case there
-            will be no offloading.
-        auto_wrap_policy (Optional[Callable[[nn.Module, bool, int], bool]]):
-            A callable specifying a policy to recursively wrap layers with FSDP.
-            Note that this policy currently will only apply to child modules of
-            the passed in module. The remainder modules are always wrapped in
-            the returned FSDP root instance.
-            ``size_based_auto_wrap_policy`` written in ``torch.distributed.fsdp.wrap`` is
-            an example of ``auto_wrap_policy`` callable, this policy wraps layers
-            with the number of parameters larger than 100M. ``transformer_auto_wrap_policy``
-            written in ``torch.distributed.fsdp.wrap`` is an example of ``auto_wrap_policy``
-            callable for transformer-like model architectures. Users can supply the customized
-            ``auto_wrap_policy`` callable that should accept following arguments:
-            ``module: nn.Module``, ``recurse: bool``, ``unwrapped_params: int``, and return
-            a ``bool`` specifying whether the passed in ``module``` should be wrapped
-            (if ``recurse=False``) or whether we should recurse down the subgraph of ``module``
-            children (if ``recurse=True``). Extra customized arguments could be added to
-            the customized ``auto_wrap_policy`` callable as well. It is a good practice to
-            print out the sharded model and check whether the sharded model is what
-            the application wants and then adjust accordingly.
+            This configures CPU offloading. If this is set to ``None``, then
+            no CPU offloading happens. See :class:`CPUOffload` for details.
+            (Default: ``None``)
+        auto_wrap_policy (Optional[Union[Callable[[nn.Module, bool, int], bool], _FSDPPolicy]]):
+            This is either ``None``, an ``_FSDPPolicy``, or a callable of
+            a fixed signature. If it is ``None``, then ``module`` is wrapped
+            with only a top-level FSDP instance without any nested wrapping. If
+            it is an ``_FSDPPolicy``, then the wrapping follows the given
+            policy. ``ModuleWrapPolicy`` in ``torch.distributed.fsdp.wrap.py``
+            is an example. If it is a callable, then it should take in three
+            arguments ``module: nn.Module``, ``recurse: bool``, and
+            ``nonwrapped_numel: int`` and should return a ``bool`` specifying
+            whether the passed-in ``module`` should be wrapped if
+            ``recurse=False`` or if the traversal should continue down the
+            subtree if ``recurse=True``. Additional custom arguments may be
+            added to the callable. The ``size_based_auto_wrap_policy`` in
+            ``torch.distributed.fsdp.wrap.py`` gives an example callable that
+            wraps a module if the parameters in its subtree exceed 100M numel.
+            A good practice is to print the model after wrapping and adjust as
+            needed.
 
             Example::
 
                 >>> def custom_auto_wrap_policy(
                 >>>     module: nn.Module,
                 >>>     recurse: bool,
-                >>>     unwrapped_params: int,
-                >>>     # These are customizable for this policy function.
+                >>>     nonwrapped_numel: int,
+                >>>     # Additional custom arguments
                 >>>     min_num_params: int = int(1e8),
                 >>> ) -> bool:
-                >>>     return unwrapped_params >= min_num_params
-                >>> # Configure a custom min_num_params
-                >>> my_auto_wrap_policy = functools.partial(custom_auto_wrap_policy, min_num_params=1e5)
+                >>>     return nonwrapped_numel >= min_num_params
+                >>> # Configure a custom `min_num_params`
+                >>> my_auto_wrap_policy = functools.partial(custom_auto_wrap_policy, min_num_params=int(1e5))
 
         backward_prefetch (Optional[BackwardPrefetch]):
-            This is an experimental feature that is subject to change in the
-            the near future. It allows users to enable two different backward_prefetch
-            algorithms to help backward communication and computation overlapping.
-            Pros and cons of each algorithm is explained in the class ``BackwardPrefetch``.
-        mixed_precision (Optional[MixedPrecision]): A ``MixedPrecision`` instance
-            describing the mixed precision training config to be used. ``MixedPrecision``
-            supports configuring parameter, buffer, and gradient communication dtype. Note
-            that only floating point data is cast to the reduced precision. This allows
-            users potential memory saving and training speedup while trading off
-            accuracy during model training. If ``None``, no mixed precision is applied.
-            Note that if ``mixed_precision`` is enabled for FSDP model that
-            contains ``BatchNorm`` with ``auto_wrap_policy``, FSDP will take
-            care to disable mixed precision for ``BatchNorm`` units by wrapping
-            them separately in their own FSDP unit with ``mixed_precision=None``.
-            This is done because several ``BatchNorm`` kernels do not implement
-            reduced type support at the moment. If individually wrapping the model,
-            users must take care to set ``mixed_precision=None`` for
-            ``BatchNorm`` units.
-            (Default: ``None``)
+            This configures explicit backward prefetching of all-gathers. See
+            :class:`BackwardPrefetch` for details. (Default: ``BACKWARD_PRE``)
+        mixed_precision (Optional[MixedPrecision]):
+            This configures native mixed precision for FSDP. If this is set to
+            ``None``, then no mixed precision is used. Otherwise, parameter,
+            buffer, and gradient reduction dtypes can be set. See
+            :class:`MixedPrecision` for details. (Default: ``None``)
         ignored_modules (Optional[Iterable[torch.nn.Module]]): Modules whose
             own parameters and child modules' parameters and buffers are
             ignored by this instance. None of the modules directly in
@@ -614,7 +257,7 @@ class FullyShardedDataParallel(nn.Module):
             is not specified, otherwise we run ``param_init_fn`` to initialize the passed
             in ``nn.Module``. In particular, this means that if ``is_meta=True`` for any
             module parameters for modules that will be wrapped with FSDP and ``param_init_fn``
-            is not specified, we assume your module properly implements a ``reset_paramters()``
+            is not specified, we assume your module properly implements a ``reset_parameters()``
             and will throw errors if not. Note that additionally, we offer support for modules
             initialized with torchdistX's (https://github.com/pytorch/torchdistX)
             ``deferred_init`` API. In this case, deferred modules would be initialized
@@ -641,12 +284,11 @@ class FullyShardedDataParallel(nn.Module):
         device_id (Optional[Union[int, torch.device]]): An ``int`` or ``torch.device``
             describing the CUDA device the FSDP module should be moved to determining where
             initialization such as sharding takes place. If this argument is not specified
-            and ``module`` is on CPU, we will move ``module`` to current CUDA device for faster
-            initialization and move ``module`` back to CPU before returning.
-            If specified, resulting FSDP instances will reside on this device.
-            Note that if ``device_id`` is specified but ``module`` is already
-            on a different CUDA device, an error will be thrown. (Default: ``None``)
-
+            and ``module`` is on CPU, we issue a warning mentioning that this argument can
+            be specified for faster initialization. If specified, resulting FSDP instances
+            will reside on this device, including moving ignored modules' parameters if
+            needed. Note that if ``device_id`` is specified but ``module`` is already on a
+            different CUDA device, an error will be thrown. (Default: ``None``)
         sync_module_states (bool): If ``True``, each individually wrapped FSDP unit will broadcast
             module parameters from rank 0 to ensure they are the same across all ranks after
             initialization. This helps ensure model parameters are the same across ranks
@@ -655,8 +297,20 @@ class FullyShardedDataParallel(nn.Module):
             This can also help load checkpoints taken by ``state_dict`` and to be loaded by
             ``load_state_dict`` in a memory efficient way. See documentation for
             :class:`FullStateDictConfig` for an example of this. (Default: ``False``)
-
+        forward_prefetch (bool): If ``True``, then FSDP *explicitly* prefetches
+            the next upcoming all-gather while executing in the forward pass.
+            This may improve communication and computation overlap for CPU
+            bound workloads. This should only be used for static graph models
+            since the forward order is fixed based on the first iteration's
+            execution. (Default: ``False``)
+        limit_all_gathers (bool): If ``False``, then FSDP allows the CPU
+            thread to schedule all-gathers without any extra synchronization.
+            If ``True``, then FSDP explicitly synchronizes the CPU thread to
+            prevent too many in-flight all-gathers. This ``bool`` only affects
+            the sharded strategies that schedule all-gathers. Enabling this can
+            help lower the number of CUDA malloc retries.
     """
+
     def __init__(
         self,
         module: nn.Module,
@@ -664,600 +318,123 @@ def __init__(
         sharding_strategy: Optional[ShardingStrategy] = None,
         cpu_offload: Optional[CPUOffload] = None,
         auto_wrap_policy: Optional[Callable] = None,
-        backward_prefetch: Optional[BackwardPrefetch] = None,
+        backward_prefetch: Optional[BackwardPrefetch] = BackwardPrefetch.BACKWARD_PRE,
         mixed_precision: Optional[MixedPrecision] = None,
         ignored_modules: Optional[Iterable[torch.nn.Module]] = None,
         param_init_fn: Optional[Callable[[nn.Module], None]] = None,
         device_id: Optional[Union[int, torch.device]] = None,
         sync_module_states: bool = False,
         forward_prefetch: bool = False,
+        limit_all_gathers: bool = False,
+        use_orig_params: bool = False,
     ):
-        if isinstance(auto_wrap_policy, ParamExecOrderWrapPolicy):
-            self._init_param_exec_order_wrap_policy(
-                module=module,
-                process_group=process_group,
-                sharding_strategy=sharding_strategy,
-                cpu_offload=cpu_offload,
-                auto_wrap_policy=auto_wrap_policy,
-                backward_prefetch=backward_prefetch,
-                mixed_precision=mixed_precision,
-                ignored_modules=ignored_modules,
-                param_init_fn=param_init_fn,
-                device_id=device_id,
-                sync_module_states=sync_module_states,
-            )
-            return
-
         torch._C._log_api_usage_once("torch.distributed.fsdp")
         super().__init__()
-        self._handles: List[FlatParamHandle] = []
-        # Validate the ignored modules and derive the ignored parameters/buffers
-        ignored_modules = self._get_ignored_modules(module, ignored_modules)
-        self._ignored_modules = ignored_modules
-        ignored_params, ignored_param_names = \
-            self._get_ignored_params(module, ignored_modules)
-        buffer_names = self._get_buffer_names(module)
-        # Compute the names to ignore for full state dict cloning (i.e. those
-        # of the ignored modules' parameters and of all modules' buffers)
-        self._ignored_param_names = ignored_param_names
-        self._buffer_names = buffer_names
-        # NOTE: Since the names are computed at construction time, if the user
-        # changes them later, then FSDP will not properly ignore them. However,
-        # the `FlatParameter` implementation already relies on this assumption.
-        # We do this at construction time since we want the fully prefixed
-        # parameter names matching the keys in the model state dict (namely,
-        # including the wrapped module's name in the prefix), which may be done
-        # most non-intrusively here before flattening.
-
-        # if auto_wrap_policy is specified, submodules should not be
-        # already wrapped, otherwise we'd attempt to double wrap them resulting
-        # in errors.
-        if auto_wrap_policy is not None:
-            self._check_wrapped(
-                module,
-                check_fn=lambda mod: not isinstance(mod, FullyShardedDataParallel),
-                err_fn=lambda mod: f"Expected {mod} to NOT be FullyShardedDataParallel if auto_wrap is enabled.",
-            )
-            if mixed_precision is not None and _contains_batchnorm(module):
-                _override_batchnorm_mixed_precision(module)
-                policy_to_use = functools.partial(
-                    _or_policy,
-                    policies=[_wrap_batchnorm_individually, auto_wrap_policy]
-                )
-                warnings.warn(
-                    "Mixed precision was specified for FSDP module with"
-                    " batchnorm submodules wrapped via ``auto_wrap_policy``."
-                    " BatchNorm units will be wrapped as a separate FSDP unit,"
-                    " with mixed_precision disabled (i.e. set to ``None``)"
-                    " as several BatchNorm kernels would raise errors when"
-                    " operating on reduced precision inputs."
-                )
-            else:
-                policy_to_use = auto_wrap_policy
-            _recursive_wrap(
-                module,
-                auto_wrap_policy=policy_to_use,
-                wrapper_cls=FullyShardedDataParallel,
-                ignored_modules=ignored_modules,
-                ignored_params=ignored_params,
-                # Note that we have the recursive_wrap skip wrapping for
-                # the outermost (this) module otherwise it will result in a
-                # double-wrap causing issues.
-                only_wrap_children=True,
-                # FSDP arguments follow.
-                process_group=process_group,
-                sharding_strategy=sharding_strategy,
-                cpu_offload=cpu_offload,
-                backward_prefetch=backward_prefetch,
-                forward_prefetch=forward_prefetch,
-                mixed_precision=mixed_precision,
-                param_init_fn=param_init_fn,
-                device_id=device_id,
-                sync_module_states=sync_module_states,
-            )
-
-        self.process_group = process_group or _get_default_group()
-        self.rank = self.process_group.rank()
-        self.world_size = self.process_group.size()
-        if device_id is not None:
-            self.device_id = (
-                device_id if isinstance(device_id, torch.device)
-                else torch.device(device_id)
-            )
-            # If user passed in something like torch.device("cuda"),
-            # device index of current device is unclear, make it explicit.
-            if self.device_id == torch.device("cuda"):
-                warnings.warn(
-                    f"Passed in {self.device_id} does not have explicit index, "
-                    f"setting it to current index: {torch.cuda.current_device()}. "
-                    "If this is not correct, please explicitly call torch.cuda.set_device()"
-                    "before FSDP initialization or pass in explicit device index as device_id argument."
-                )
-                self.device_id = torch.device("cuda", torch.cuda.current_device())
-        else:
-            self.device_id = None
-
-
-        is_meta_module = any(p.is_meta for p in module.parameters())
-        is_torchdistX_deferred_init = (
-            not is_meta_module and _TORCHDISTX_AVAIL
-            and any(fake.is_fake(p) for p in module.parameters())
-        )
 
-        def _run_param_init_fn():
-            # Call user-specified initialization function.
-            if not callable(param_init_fn):
-                raise ValueError(
-                    f"Expected {param_init_fn} to be callable, but got {type(param_init_fn)}"
-                )
-            param_init_fn(module)
-
-        if is_meta_module:
-            if param_init_fn is not None:
-                _run_param_init_fn()
-            else:
-                # Call default initialization function that is dependent on
-                # reset_parameters.
-                _default_meta_device_init_fn(module)
-        elif is_torchdistX_deferred_init:
-            assert _TORCHDISTX_AVAIL, "Got torchdistX initialized module but torchdistX lib is not available."
-            if param_init_fn is not None:
-                _run_param_init_fn()
-            else:
-                # Call default torchdistX initialization function. Omit re-initialization of FSDP submodules
-                # which is unnecessary.
-                check_fn = lambda k: not isinstance(k, FullyShardedDataParallel)  # noqa: E731
-                deferred_init.materialize_module(module, check_fn=check_fn)
-
-        # Check that module was placed onto a single device.
-        module_devices = set(
-            p.device for p in module.parameters() if p not in ignored_params and not isinstance(p, FlatParameter)
-        )
-
-        if len(module_devices) > 1:
-            raise RuntimeError(
-                f"FSDP only supports single device modules, but got params on {module_devices}"
-            )
-
-        # Move module appropriately depending on device_id and whether module is on CPU.
-        self._move_module_if_needed(module)
-
-        # device for computation, if module is on GPU, use module.device;
-        # if module is on CPU, use current device;
-        self.compute_device = _get_default_cuda_device(module)
-
-        # if device_id is specified, ensure it is the same
-        assert (
-            self.device_id is None or self.compute_device == self.device_id
-        ), f"Inconsistent compute_device and device_id: {self.compute_device} vs {self.device_id}"
-
-        # Enum to indicate if we're in the forward/backward pass, idle, etc.
-        self.training_state = TrainingState_.IDLE
-
-        # setting two factors to avoid underflow and overflow
-        self.gradient_predivide_factor: float = self._get_gradient_predivide_factor(
-            self.world_size
+        _init_ignored_module_states(self, module, ignored_modules)
+        if auto_wrap_policy is not None:
+            auto_wrap_kwargs = {
+                "module": module,
+                "auto_wrap_policy": auto_wrap_policy,
+                "wrapper_cls": FullyShardedDataParallel,
+                "ignored_modules": self._ignored_modules,
+                "ignored_params": self._ignored_params,
+                "only_wrap_children": True,  # avoid double wrapping the root
+            }
+            fsdp_kwargs = {
+                "process_group": process_group,
+                "sharding_strategy": sharding_strategy,
+                "cpu_offload": cpu_offload,
+                "backward_prefetch": backward_prefetch,
+                "mixed_precision": mixed_precision,
+                "param_init_fn": param_init_fn,
+                "device_id": device_id,
+                "sync_module_states": sync_module_states,
+                "forward_prefetch": forward_prefetch,
+                "limit_all_gathers": limit_all_gathers,
+                "use_orig_params": use_orig_params,
+            }
+            _auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel)
+
+        _init_process_group_state(self, process_group)
+        backward_prefetch_limit = 1
+        forward_prefetch_limit = 1
+        _init_core_state(
+            self,
+            sharding_strategy,
+            mixed_precision,
+            cpu_offload,
+            limit_all_gathers,
+            use_orig_params,
+            backward_prefetch_limit,
+            forward_prefetch_limit,
         )
-        self.gradient_postdivide_factor: float = (
-            self.world_size / self.gradient_predivide_factor
+        _init_runtime_state(self)
+        _init_prefetching_state(self, backward_prefetch, forward_prefetch)
+        _init_buffer_state(self, module)
+        _init_param_handle_from_module(
+            self,
+            module,
+            device_id,
+            param_init_fn,
+            sync_module_states,
+            FullyShardedDataParallel,
         )
-
-        self.cpu_offload = cpu_offload or CPUOffload()
-        self.backward_prefetch = backward_prefetch
-        self.forward_prefetch = forward_prefetch
-        self.sharding_strategy = sharding_strategy or ShardingStrategy.FULL_SHARD
-        self.mixed_precision = mixed_precision
-        # Original buffer type (mapping since all buffers may not be of same type). In
-        # the case of mixed precision training, this is used to restore buffers
-        # to their original type (which may not be the same as that of the
-        # parameters in the model) when checkpointing.
-        self._orig_buffer_dtypes: Dict[str, torch.dtype] = {}
-
-        # Only handle params which are not already sharded. This enables
-        # sharding individual layers of a Module, with an outer wrapper to
-        # shard any leftover parameters.
-        params = [
-            p for p in module.parameters()
-            if p not in ignored_params and not isinstance(p, FlatParameter)
-        ]
-
-        if sync_module_states:
-            if params != [] and params[0].device == torch.device("cpu"):
-                raise ValueError(
-                    "Module has CPU parameters, but sync_module_states=True is specified."
-                    "This only works for GPU module, please specify `device_id` argument or move"
-                    " module to GPU before init."
-                )
-            # Collect buffers we have to synchronize, avoiding buffers that have already
-            # been synchronized to avoid redundant synchronization.
-            bufs_to_sync = []
-            for buf in module.buffers():
-                if not getattr(buf, '_fsdp_has_been_sync', False):
-                    buf._fsdp_has_been_sync = True
-                    bufs_to_sync.append(buf.detach())
-
-            states_to_sync = [param.detach() for param in params]
-            states_to_sync.extend(bufs_to_sync)
-            _sync_params_and_buffers(
-                process_group=self.process_group,
-                module_states=states_to_sync,
-                # Same bucket size as DDP
-                broadcast_bucket_size=_PARAM_BROADCAST_BUCKET_SIZE,
-                src=0,
-            )
-
-        self._fsdp_wrapped_module = FlattenParamsWrapper(module, params)
-        assert getattr(self, FSDP_WRAPPED_MODULE) is self._fsdp_wrapped_module
-        self.params = []
-        if self._fsdp_wrapped_module.has_params:
-            self.params.append(self._fsdp_wrapped_module.flat_param)
-            self._register_param_handle(self._fsdp_wrapped_module.handle)
-
-        # Shard module parameters in place
-        self._shard_parameters()
-
-        # Check that the sharding logic was applied to all parameters by
-        # checking that the original module parameters have been replaced by
-        # `Tensor` views and are no longer `nn.Parameter`s
-        for n, p in self.named_parameters():
-            if p not in ignored_params and not isinstance(p, FlatParameter):
-                raise RuntimeError(
-                    f"found unflattened parameter: {n} ; {p.size()} {p.__class__}"
-                )
-        self._reset_lazy_init()
-
-        # Flag indicating if we require gradient reduction in the backward
-        # pass (set to `False` in the `no_sync()` context manager)
-        self._require_backward_grad_sync: bool = True
-
-        self._state_dict_type = StateDictType.FULL_STATE_DICT
-        self._state_dict_config = FullStateDictConfig()
-
-        # FSDP currently provides three different state_dicts. The actual
-        # state_dict that will be saved/loaded is decided by
-        # self._state_dict_type. And the main logic of each state_dict is
-        # implemented in the hook. Therefore, for each hook (post-save and
-        # pre-load), there is a dispatcher dictionary to dispatch the execution
-        # flow to the correct implementation.
-        self._register_state_dict_hook(self._post_state_dict_hook)
-        self._post_state_dict_hook_fn = {
-            StateDictType.FULL_STATE_DICT: self._full_post_state_dict_hook,
-            StateDictType.LOCAL_STATE_DICT: self._local_post_state_dict_hook,
-            StateDictType.SHARDED_STATE_DICT: self._sharded_post_state_dict_hook,
-        }
+        self._fsdp_wrapped_module = module
+        if not use_orig_params:
+            _check_orig_params_flattened(self, self._ignored_params)
+            _register_flat_param(self, self)
+
+        # Delete to avoid keeping references after the constructor
+        delattr(self, "_ignored_params")
+
+        # `_state_dict_type` controls the `state_dict()` behavior, which is
+        # implemented using post-save and pre-load hooks
+        _init_state_dict_state(self)
+        self._register_state_dict_hook(_post_state_dict_hook)
         self._register_load_state_dict_pre_hook(
-            self._pre_load_state_dict_hook, with_module=True
-        )
-        self._pre_load_state_dict_hook_fn = {
-            StateDictType.FULL_STATE_DICT: self._full_pre_load_state_dict_hook,
-            StateDictType.LOCAL_STATE_DICT: self._local_pre_load_state_dict_hook,
-            StateDictType.SHARDED_STATE_DICT: self._sharded_pre_load_state_dict_hook,
-        }
-        self.register_load_state_dict_post_hook(
-            self._post_load_state_dict_hook
-        )
-        self._post_load_state_dict_hook_fn = {
-            StateDictType.FULL_STATE_DICT: self._full_post_load_state_dict_hook,
-            StateDictType.LOCAL_STATE_DICT: self._local_post_load_state_dict_hook,
-            StateDictType.SHARDED_STATE_DICT: self._sharded_post_load_state_dict_hook,
-        }
-
-        # Flag to guard against preparing gradients multiple times per backward pass.
-        self._pre_backward_hook_has_run = False
-        # Used for prefetching all gather full params in post backward hook
-        self._need_rebuild_full_params = False
-
-        # If specified, offload parameter shard to CPU.
-        if self.cpu_offload.offload_params:
-            for p in self.params:
-                self._offload_to_cpu(p)
-
-        # For validating execution order across ranks
-        self._exec_order_data = _ExecOrderData()
-
-        # setting communication hook to a default
-        self.communication_hook = self._get_default_comm_hook()
-        self.communication_hook_state = self._get_default_comm_hook_state()
-        self._hook_registered = False
-
-    def _init_param_exec_order_wrap_policy(self, *args, **kwargs) -> None:
-        auto_wrap_policy = kwargs["auto_wrap_policy"]
-        module = kwargs["module"]
-        assert hasattr(auto_wrap_policy, "tracing_config")
-        if not _TORCH_FX_AVAIL:
-            assert (
-                auto_wrap_policy.tracing_config is None
-            ), "tracing_config should be None when torch.fx is not enabled"
-        elif isinstance(
-            auto_wrap_policy.tracing_config,
-            TracingConfig
-        ):
-            tracer = auto_wrap_policy.tracing_config.tracer
-            execution_info = _init_execution_info(module)
-
-            for m in module.modules():
-                assert not isinstance(
-                    m, FullyShardedDataParallel
-                ), "The input module of _patch_tracer should not contain FSDP modules"
-
-            with _patch_tracer(
-                tracer=tracer,
-                root_module=module,
-                execution_info=execution_info,
-            ):
-                try:
-                    tracer.trace(module, auto_wrap_policy.tracing_config.concrete_args)
-                except BaseException as e:
-                    raise RuntimeError(
-                        "tracer.trace failed inside _init_param_exec_order_wrap_policy"
-                        f" with the error: {e}."
-                    )
-        else:
-            assert (
-                auto_wrap_policy.tracing_config is None
-            ), "tracing_config should either be an instance of TracingConfig or be None"
-        # The initial FSDP wrapping is done with auto_wrap_policy.init_policy
-        kwargs["auto_wrap_policy"] = auto_wrap_policy.init_policy
-        self.__init__(*args, **kwargs)
-        self._param_exec_order_policy: bool = True
-        # self._param_exec_order_prep_stage is set to True before we get the execution order
-        self._param_exec_order_prep_stage: bool = True
-        # A list that stores the flatten parameters and its name based on the parameter execution order
-        self._fsdp_params_exec_order: List[FlatParameter] = []
-        if _TORCH_FX_AVAIL and isinstance(
-            auto_wrap_policy.tracing_config,
-            TracingConfig
-        ):
-            # Initialize a dict that maps each module to its parent FSDP wrap
-            module_to_fsdp: Dict[nn.Module, FullyShardedDataParallel] = dict()
-            for wrap in self.fsdp_modules(self):
-                module_to_fsdp[wrap.module] = wrap
-            # Set self._fsdp_params_exec_order based on execution_info.module_forward_order.
-            # TODO (linjianma): self._fsdp_params_exec_order will be set based on
-            # the parameter execution order rather than module_forward_order,
-            # once the non-recursive wrapping policy is fully implemented.
-            for m in execution_info.module_forward_order:
-                if m in module_to_fsdp:
-                    for flat_param in module_to_fsdp[m].params:
-                        self._fsdp_params_exec_order.append(flat_param)
-            self._param_exec_order_prep_stage = False
-
-        for m in self.modules():
-            if m is not self and isinstance(m, FullyShardedDataParallel):
-                # Assignment by reference, so each children FSDP wrap has access to
-                # the _fsdp_params_exec_order of the root module
-                m._fsdp_params_exec_order = self._fsdp_params_exec_order
-                m._param_exec_order_policy = self._param_exec_order_policy
-                m._param_exec_order_prep_stage = self._param_exec_order_prep_stage
-
-    def _move_module_if_needed(self, module) -> None:
-        """
-        Moves module if module is on CPU and device_id is specified.
-        If device_id is not specified and module is on CPU, we log a
-        warning to user mentioning to use ``device_id`` argument to speed
-        up initialization performance.
-        """
-        # Move module to device specified. Note that this is done prior to
-        # setting compute_device to ensure that they align.
-        if self.device_id is not None:
-            param = None
-            try:
-                # Get the next unflat param
-                param_gen = module.parameters()
-                while True:
-                    param = next(param_gen)
-                    if not isinstance(param, FlatParameter):
-                        break
-
-                if param.device == torch.device("cpu"):
-                    module = module.to(self.device_id)
-            except StopIteration:
-                # this FSDP instance manages no parameters.
-                pass
-
-            # For GPU modules, module device should match device_id.
-            if (
-                param is not None
-                and not isinstance(param, FlatParameter)
-                and param.device != self.device_id
-            ):
-                raise RuntimeError(
-                    f"Module on rank {self.rank} is given device_id argument "
-                    f"{self.device_id}, but is on {param.device}. "
-                    " Either move module before FSDP init or omit device_id argument."
-                )
-        else:
-            # device_id argument is not specified
-            # If module is on CPU, log a warning asking user to use `device_id` for faster
-            # GPU init.
-            try:
-                # Get the next unflat param
-                param_gen = module.parameters()
-                while True:
-                    param = next(param_gen)
-                    if not isinstance(param, FlatParameter):
-                        break
-
-                if param.device == torch.device("cpu"):
-                    warnings.warn(
-                        "Module is put on CPU and will thus have flattening and sharding"
-                        " run on CPU, which is less efficient than on GPU. We recommend passing in "
-                        "`device_id` argument which will enable FSDP to put module on GPU device,"
-                        " module must also be on GPU device to work with `sync_module_states=True` flag"
-                        " which requires GPU communication."
-                    )
-            except StopIteration:
-                # this FSDP instance manages no parameters
-                pass
-
-    def _init_reshard_after_forward(self):
-        if self.sharding_strategy == ShardingStrategy.FULL_SHARD:
-            # Free full params and keep shard only after forward
-            self.reshard_after_forward = True
-        elif self.sharding_strategy == ShardingStrategy.SHARD_GRAD_OP:
-            # Keep full params in the GPU memory until backward
-            # computation is done
-            self.reshard_after_forward = False
-        elif self.sharding_strategy == ShardingStrategy.NO_SHARD:
-            # self.reshard_after_forward is not used when NO_SHARD
-            # is set, just setting it as False here
-            self.reshard_after_forward = False
-        else:
-            raise RuntimeError(
-                "sharding_strategy only supports FULL_SHARD, SHARD_GRAD_OP and NO_SHARD right now."
-            )
-
-    def _get_ignored_modules(
-        self,
-        root_module: torch.nn.Module,
-        _ignored_modules: Any,
-    ) -> Set[torch.nn.Module]:
-        """
-        Checks that ``_ignored_modules`` (1) is an iterable of
-        ``torch.nn.Module`` s without any :class:`FullyShardedDataParallel`
-        instances and does not contain the top-level ``module`` itself, and
-        then returns them and their children as a :class:`set`, excluding
-        nested :class:`FullyShardedDataParallel` instances.
-
-        We include the child modules of modules in ``_ignored_modules`` to be
-        more intuitive since ignoring a module should ignore its child modules
-        as well, and we exclude :class:`FullyShardedDataParallel` instances
-        since ``self`` may be the intended root instance that manages them.
-        """
-        if _ignored_modules is None:
-            return set()
-        msg_prefix = "`ignored_modules` should be an iterable of " \
-            "`torch.nn.Module`s "
-        try:
-            ignored_root_modules = set(_ignored_modules)
-        except TypeError:
-            raise TypeError(msg_prefix + f"but got {type(_ignored_modules)}")
-        for module in ignored_root_modules:
-            if not isinstance(module, torch.nn.Module):
-                raise TypeError(
-                    msg_prefix + f"but got an iterable with {type(module)}"
-                )
-            if isinstance(module, FullyShardedDataParallel):
-                raise ValueError(
-                    "`ignored_modules` should not include FSDP modules"
-                )
-        # Include child modules and exclude nested FSDP modules
-        ignored_modules = set(
-            child for module in ignored_root_modules
-            for child in module.modules()
-            if not isinstance(child, FullyShardedDataParallel) and
-            not isinstance(child, FlattenParamsWrapper)
+            _pre_load_state_dict_hook, with_module=True
         )
-        if root_module in ignored_modules:
-            warnings.warn(
-                "Trying to ignore the top-level module passed into the FSDP "
-                "constructor itself will result in all parameters being "
-                f"ignored and is not supported: {module}"
-            )
-        for submodule in root_module.modules():
-            if isinstance(submodule, FullyShardedDataParallel):
-                assert hasattr(submodule, "_ignored_modules")
-                ignored_modules.update(submodule._ignored_modules)
-        return ignored_modules
-
-    def _get_ignored_params(
-        self,
-        root_module: torch.nn.Module,
-        ignored_modules: Set[torch.nn.Module],
-    ) -> Tuple[Set[torch.nn.Parameter], Set[str]]:
-        """
-        Returns the parameters of the modules in ``ignored_modules``,
-        excluding any :class:`FlatParameter` s and their fully prefixed names,
-        both as :class:`set` s.
+        self.register_load_state_dict_post_hook(_post_load_state_dict_hook)
 
-        Args:
-            root_module (torch.nn.Module): Top-level module passed into the
-                FSDP constructor from which to derive the fully prefixed names.
-            ignored_modules (Set[torch.nn.Module]): Modules to ignore.
+    @property
+    def module(self) -> nn.Module:
         """
-        ignored_params = set(
-            p for m in ignored_modules for p in m.parameters()
-            if not isinstance(p, FlatParameter)
-        )
-        param_to_unflat_param_names = _get_param_to_unflat_param_names(
-            root_module, dedup_shared_params=False,
-        )
-        ignored_param_names = set()
-        for param in ignored_params:
-            unflat_param_names = param_to_unflat_param_names[param]
-            clean_names = []
-            for k in unflat_param_names:
-                clean_names.append(clean_tensor_name(k))
-            ignored_param_names.update(clean_names)
-        return ignored_params, ignored_param_names
-
-    def _get_buffer_names(self, root_module: torch.nn.Module) -> Set[str]:
+        Returns the wrapped module (like :class:`DistributedDataParallel`).
         """
-        Returns the fully prefixed names of all buffers in the module hierarchy
-        rooted at ``root_module`` as a class:`set`.
+        # FSDP's `.module` must refer to the innermost wrapped module when
+        # composing with other module wrappers in order for state dict to work
+        if isinstance(self._fsdp_wrapped_module, ActivationWrapper):
+            return getattr(self._fsdp_wrapped_module, _CHECKPOINT_WRAPPED_MODULE)
+        return self._fsdp_wrapped_module
 
-        Args:
-            root_module (torch.nn.Module): Top-level module passed into the
-                FSDP constructor from which to derive the fully prefixed names.
-        """
-        def module_fn(module, prefix, buffer_names):
-            # For FSDP modules, only add the entry when considering the
-            # contained `FlattenParamsWrapper` to avoid duplication
-            if not isinstance(module, FullyShardedDataParallel):
-                for buffer_name, _ in module.named_buffers(recurse=False):
-                    prefixed_buffer_name = clean_tensor_name(prefix + buffer_name)
-                    buffer_names.add(prefixed_buffer_name)
-
-        def return_fn(buffer_names, *args):
-            return buffer_names
-
-        buffer_names: Set[str] = set()
-        return _apply_to_modules(
-            root_module, module_fn, return_fn, buffer_names,
-        )
+    @property
+    def _has_params(self) -> bool:
+        """Returns whether this FSDP instance manages any parameters."""
+        return hasattr(self, "_handles") and len(self._handles) > 0
 
-    @classmethod
-    def _check_wrapped(cls, begin_module, check_fn, err_fn):
-        for _, mod in begin_module.named_modules():
-            if not check_fn(mod):
-                raise ValueError(err_fn(mod))
+    @property
+    def _flat_param(self) -> Optional[FlatParameter]:
+        return self._handles[0].flat_param if self._handles else None
 
-    def _register_param_handle(self, handle: FlatParamHandle) -> None:
-        """Registers the parameter handle to this FSDP instance."""
-        if handle not in self._handles:
-            self._handles.append(handle)
+    def __getattr__(self, name: str) -> Any:
+        """Forward missing attributes to the wrapped module."""
+        try:
+            return super().__getattr__(name)  # defer to nn.Module's logic
+        except AttributeError:
+            return getattr(self._fsdp_wrapped_module, name)
 
-    @property
-    def module(self) -> nn.Module:
-        """Make model.module accessible, just like DDP. Return the
-        underlying module without the flatten_params_wrapper
-        """
-        assert isinstance(self._fsdp_wrapped_module, FlattenParamsWrapper)
-        return self._fsdp_wrapped_module.module
+    def __getitem__(self, key: int) -> Any:
+        """Forward indexing calls in case the module is an ``nn.Sequential``."""
+        if hasattr(self, FSDP_WRAPPED_MODULE):
+            return self._fsdp_wrapped_module.__getitem__(key)  # type: ignore[operator]
+        return super().__getitem__(key)
 
     def check_is_root(self) -> bool:
-        self._lazy_init()
+        _lazy_init(self, self)
         assert self._is_root is not None
         return self._is_root
 
-    def _use_param_exec_order_policy(self) -> bool:
-        return (
-            hasattr(self, "_param_exec_order_policy")
-            and self._param_exec_order_policy
-        )
-
-    def _is_param_exec_order_prep_stage(self) -> bool:
-        is_prep_stage = (
-            hasattr(self, "_param_exec_order_prep_stage")
-            and self._param_exec_order_prep_stage
-        )
-        if not is_prep_stage:
-            for p in self.parameters():
-                assert (
-                    not hasattr(p, "_params_exec_order_hook_handle")
-                ), "When not in execution order prep stage, all _params_exec_order_hook_handle should be removed."
-        return is_prep_stage
-
     @staticmethod
     def fsdp_modules(
         module: nn.Module,
@@ -1278,9 +455,22 @@ def fsdp_modules(
             the input ``module``.
         """
         return [
-            submodule for submodule in module.modules()
-            if isinstance(submodule, FullyShardedDataParallel) and
-            (not root_only or submodule.check_is_root())
+            submodule
+            for submodule in module.modules()
+            if isinstance(submodule, FullyShardedDataParallel)
+            and (not root_only or submodule.check_is_root())
+        ]
+
+    @staticmethod
+    def _fsdp_handles(module: nn.Module) -> List[FlatParamHandle]:
+        """
+        Returns all nested FSDP instances' handles in the module hierarchy
+        rooted at ``module``.
+        """
+        return [
+            handle
+            for fsdp_module in FullyShardedDataParallel.fsdp_modules(module)
+            for handle in fsdp_module._handles
         ]
 
     def apply(self, fn: Callable[[nn.Module], None]) -> "FullyShardedDataParallel":
@@ -1299,7 +489,7 @@ def apply(self, fn: Callable[[nn.Module], None]) -> "FullyShardedDataParallel":
             Module: self
         """
         uninitialized = self._is_root is None
-        self._assert_state(TrainingState_.IDLE)
+        self._assert_state(TrainingState.IDLE)
         with self._summon_full_params(recurse=False, writeback=True):
             ret = super().apply(fn)
 
@@ -1311,53 +501,31 @@ def apply(self, fn: Callable[[nn.Module], None]) -> "FullyShardedDataParallel":
 
         return ret
 
-    # setting two factors 'self.gradient_predivide_factor'
-    # and 'self.gradient_postdivide_factor' to avoid underflow and overflow
-    def _get_gradient_predivide_factor(self, world_size: int) -> float:
-        factor: int = 1
-        while world_size % factor == 0 and world_size / factor > factor:
-            factor *= 2
-        return float(factor)
-
-    def _offload_to_cpu(self, p):
-        """
-        Offloads parameter to CPU from self.compute_device. If the parameter is
-        already on CPU then this is a noop.
-        """
-        cpu_device = torch.device("cpu")
-        if p.device == cpu_device:
-            return
-        with torch.no_grad():
-            p.data = p.to(cpu_device)
-
     def _mixed_precision_enabled_for_params(self) -> bool:
         """
         Whether user explicitly enabled mixed precision for
         parameters or not.
         """
-        return (
-            self.mixed_precision is not None
-            and self.mixed_precision.param_dtype is not None
-        )
+        return self.mixed_precision.param_dtype is not None
 
     def _mixed_precision_enabled_for_buffers(self) -> bool:
         """
         Whether user explicitly enabled mixed precision for
         buffers or not.
         """
-        return (
-            self.mixed_precision is not None
-            and self.mixed_precision.buffer_dtype is not None
-        )
+        return self.mixed_precision.buffer_dtype is not None
 
     def _mixed_precision_enabled_for_reduce(self) -> bool:
         """
         Whether user explicitly enabled mixed precision for
         gradient reduction or not.
         """
+        return self.mixed_precision.reduce_dtype is not None
+
+    def _mixed_precision_keep_low_precision_grads(self) -> bool:
         return (
             self.mixed_precision is not None
-            and self.mixed_precision.reduce_dtype is not None
+            and self.mixed_precision.keep_low_precision_grads
         )
 
     def _low_precision_hook_enabled(self) -> bool:
@@ -1365,439 +533,27 @@ def _low_precision_hook_enabled(self) -> bool:
         Wether a low precision hook is registered or not.
         """
         return (
-            self.communication_hook is not None
-            and self.communication_hook in LOW_PRECISION_HOOKS
+            self._communication_hook is not None
+            and self._communication_hook in LOW_PRECISION_HOOKS
         )
 
-    def _cast_fp_inputs_to_precision(
-        self, dtype: torch.dtype, *args: Any, **kwargs: Any
-    ) -> Tuple[Any, Any]:
-        """
-        Casts floating point tensors in args and kwargs to precision given by dtype.
-        requires_grad field is respected.
-        """
-        def cast_fn(x: torch.Tensor) -> torch.Tensor:
-            if not torch.is_floating_point(x):
-                return x
-            y = x.to(dtype)
-            # Explicitly copy over requires_grad context since this is happening
-            # within torch.no_grad.
-            if x.is_leaf:
-                y.requires_grad = x.requires_grad
-            return y
-
-        with torch.no_grad():
-            return (
-                _apply_to_tensors(cast_fn, args),
-                _apply_to_tensors(cast_fn, kwargs)
-            )
-
-    @torch.no_grad()
-    def _cast_param_shards_to_dtype(self):
-        """
-        Allocates a mixed precision paramter shard and casts parameter shards to
-        reduced precision by copying into this mixed precision shard. Note that
-        if we are CPU offloading, this also implicitly loads the parameter shard
-        back to GPU.
-        """
-        assert (
-            self._mixed_precision_enabled_for_params()
-        ), "Expected to only be called when mixed precision for parameters is enabled."
-        with torch.cuda.stream(self._streams["mixed_precision_params"]):
-            for p in self.params:
-                assert p._mp_shard is not None
-                _alloc_storage(data=p._mp_shard, size=p._local_shard.size())
-                # Cast is done by copy
-                p._mp_shard.copy_(
-                    # no-op if not CPU offloading, otherwise nonblocking because
-                    # p._local_shard is pinned in _init_param_attributes.
-                    p._local_shard.to(p._mp_shard.device, non_blocking=True)
-                )
-                # Point p to the mp shard
-                p.data = p._mp_shard
-        # Block current stream on this copy work.
-        torch.cuda.current_stream().wait_stream(self._streams["mixed_precision_params"])
-
-    @torch.no_grad()
-    def _free_mp_shard(self, params: List[FlatParameter]):
-        """
-        Deallocate storage for parameter's mixed precision shard.
-        """
-        assert (
-            self._mixed_precision_enabled_for_params()
-        ), "Expected to only be called when mixed precision for parameters is enabled."
-        current_stream = torch.cuda.current_stream()
-        for p in params:
-            # mp_shard should always be allocated.
-            assert p._mp_shard is not None
-            # Shard is allocated in "mixed_precision_stream" and then we block
-            # current stream on this stream, so don't free it until work in the
-            # current stream is completed.
-            p._mp_shard.record_stream(current_stream)
-            _free_storage(p._mp_shard)
-
-    def _cast_buffers(
-        self,
-        device: Optional[torch.device] = None,
-        dtype: Optional[Dict[str, torch.dtype]] = None,
-        memo: Optional[Set] = None,
-        recurse: bool = True,
-    ) -> None:
-        """Move all buffers to the given *device* and *dtype*.
-        If *device* is not given, then it will default to
-        ``self.compute_device``, otherwise buffer will be moved to ``device``.
-        In the case of nested FSDP instances, we will respect the child instance's
-        ``compute_device`` configuration.
-        If *dtype* is given, it must be a mapping of buffer name to buffer dtype,
-            and this argument is currently only given to restore back to original
-            buffer types during checkpoint. If *dtype* is not given, and we are
-            in mixed precision training, the buffer will be cast to buffer_dtype,
-            otherwise the buffer will not be cast.
-        Args:
-            device (torch.device, Optional):
-                device to cast buffers to (defaults to compute_device)
-            dtype: (Dict[str, torch.dtype], Optional):
-                Mapping of buffer name to their dtype to cast to.
-            memo (Set, Optional):
-                set of modules that have already been processed
-            recurse (bool, Optional):
-                Whether to call _cast_buffers recursively on nested FSDP
-                instances (default is True).
-        """
-        if memo is None:
-            memo = set()
-        for module in self.modules():
-            if module is not self and isinstance(module, FullyShardedDataParallel) and recurse:
-                # Allow any child FSDP instances to handle their own buffers.
-                module._cast_buffers(device=device, dtype=dtype, memo=memo, recurse=recurse)
-            elif module not in memo:
-                memo.add(module)
-                for name, buf in module.named_buffers(recurse=False):
-                    if buf is None:
-                        continue
-                    buf = buf.to(device=device or self.compute_device)
-                    if name not in self._orig_buffer_dtypes:
-                        self._orig_buffer_dtypes[name] = buf.dtype
-                    # If given, cast buffer to the given dtype. This is used to
-                    # suppport mixed precision for buffers
-                    # (given by self.mixed_precision.buffer_dtype) and also used
-                    # to restore the buffer dtype to the original precision for
-                    # state_dict() calls.
-                    # Note that non-floating point buffers are not casted.
-                    if torch.is_floating_point(buf):
-                        # We are restoring the original buffer type in
-                        # preparation for checkpoint.
-                        if dtype:
-                            buf = buf.to(dtype=dtype[name])
-                        # Note that we don't pass in self.mixed_precision.buffer_dtype
-                        # recursively into _cast_buffers, as we want to respect
-                        # mp config for child FSDP instances.
-                        elif self._mixed_precision_enabled_for_buffers():
-                            buf = buf.to(self.mixed_precision.buffer_dtype)
-
-                    setattr(module, name, buf)
-
-    @torch.no_grad()
-    def _shard_parameters(self) -> None:
-        """
-        At initialization we wrap a module with full parameters and shard the
-        parameters in-place. Sharding is implemented by viewing each parameter
-        as a 1D Tensor and retaining only a single slice, where the slice size
-        is determined by the number of data parallel workers.
-        After this initial sharding is complete, the user can initialize a
-        ``torch.optim.Optimizer`` in the usual way, i.e.::
-        .. code-block:: python
-            optim = torch.optim.Adam(sharded_module.parameters(), lr=0.0001)
-        The optimizer will see only a single slice of parameters and will thus
-        allocate less memory for optimizer state, avoiding redundancy across
-        data parallel workers.
-        """
-        for handle in self._handles:
-            p = handle.flat_param
-            assert not p._is_sharded, "Param should have not been sharded yet."
-            assert (
-                p.is_floating_point()
-            ), "Autograd does not support operations for integer type."
-
-            # Sharding is done only when world_size is larger than 1 and
-            # sharding_strategy!=NO_SHARD.
-            p._is_sharded = (  # type: ignore[attr-defined]
-                self.world_size > 1
-                and self.sharding_strategy != ShardingStrategy.NO_SHARD
-            )
-
-            if not p._is_sharded:  # type: ignore[attr-defined]
-                continue
-
-            # Save the original storage and free it later on.
-            # Since we're modifying the tensor's storage directly,
-            # make sure the tensor is the sole occupant of the storage.
-            assert (
-                p.storage_offset() == 0
-            ), "The tensor is not the sole occupant of the storage."
-            orig_storage = p.storage()
-
-            # Replace p with the relevant shard.
-            local_shard, numel_padded = FlatParamHandle._get_shard(p, self.rank, self.world_size)
-            p.set_(local_shard)  # type: ignore[call-overload]
-            handle.init_shard_metadata(local_shard.numel(), numel_padded, self.rank)
-
-            # Free storage that contains the original full data.
-            if orig_storage.size() > 0:
-                orig_storage.resize_(0)  # type: ignore[attr-defined]
-
-    def __getattr__(self, name: str) -> Any:
-        """Forward missing attributes to wrapped module."""
-        try:
-            return super().__getattr__(name)  # defer to nn.Module's logic
-        except AttributeError:
-            return getattr(self._fsdp_wrapped_module, name)
-
-    def __getitem__(self, key: int) -> Any:
-        """Forward indexing calls in case the module is a nn.Sequential."""
-        return self._fsdp_wrapped_module.__getitem__(key)  # type: ignore[operator]
-
     def _reset_lazy_init(self) -> None:
         """
         Reset instance so :func:`_lazy_init` will run on the next forward.
         """
         self._is_root: Optional[bool] = None
-        self._streams: Dict[str, torch.cuda.Stream] = {}
-        self._fsdp_graph_order: List[nn.Module] = []
-        self._my_fsdp_idx_in_graph: Optional[int] = None
-        self._pre_backward_hook_full_params_prefetched: bool = False
-        self._forward_full_params_prefetched: bool = False
-
-        for p in self.params:
-            if hasattr(p, "_local_shard"):
-                # reset attributes that are added in _init_param_attributes, as
-                # part of _lazy_init
-                del p._local_shard  # type: ignore[attr-defined]
-        # set 'self.reshard_after_forward' flag based on self.sharding_strategy
-        self._init_reshard_after_forward()
-
-    def _lazy_init(self) -> None:
-        """
-        Performs initialization lazily, typically right before the first
-        forward pass. The laziness is needed to ensure that the parameter
-        device/dtype and the FSDP hierarchy have finalized.
-
-        This method's actual logic only runs on the root FSDP instance, which
-        performs initialization for all non-root FSDP instances to avoid
-        partial initialization.
-        """
-        if self._is_root is not None:
-            return  # no-op: already initialized
-        # The following logic is only run on the root FSDP instance
-        self._is_root = True
-        self._assert_state(TrainingState_.IDLE)
-        self._init_streams()
-        self._cast_buffers(recurse=True)
-        for param in self.params:
-            self._init_param_attributes(param)
-        # Do not reshard the root's parameters at the end of the forward pass
-        # with the intention that they are immediately used in the backward
-        # pass gradient computation (though this may not be true)
-        self.reshard_after_forward = False
-        self._exec_order_data.init(self)
-        # Initialize non-root FSDP instances and share attributes from the root
-        # to non-root instances (e.g. streams for overlapping)
-        for fsdp_module in self.fsdp_modules(self):
-            if fsdp_module is not self:
-                # Relax the assert for non-root FSDP instances in case the
-                # nested initialized module is wrapped again in FSDP later (e.g.
-                # after training to run inference)
-                assert fsdp_module._is_root is None or not fsdp_module._is_root, (
-                    "Non-root FSDP instance's `_is_root` should not have been "
-                    "set yet or should have been set to `False`"
-                )
-                fsdp_module._is_root = False
-                fsdp_module._streams = self._streams
-                fsdp_module._fsdp_graph_order = self._fsdp_graph_order
-                fsdp_module._exec_order_data = self._exec_order_data
-                for param in fsdp_module.params:
-                    fsdp_module._init_param_attributes(param)
-
-    @torch.no_grad()
-    def _init_param_attributes(self, p: FlatParameter) -> None:
-        """
-        We manage several attributes on each Parameter instance. The first is
-        set by :func:`_shard_parameters`:
-            ``_is_sharded``: ``True`` if the Parameter is sharded or ``False``
-                if the Parameter is intentionally not sharded (in which case we
-                will all-reduce grads for this param). Currently the way
-                `_is_sharded = False` is if world_size = 1 or sharding strategy
-                is NO_SHARD.
-        A few attributes are set here:
-            ``_local_shard``: a single shard of the parameter. This is needed to
-                recover the shard after rebuilding full parameter in forward
-                and backward.
-            ``_full_param_padded``: the full weight (padded to be evenly
-                divisible by ``world_size``), used for computation in the
-                forward and backward pass. It is initialized with the
-                appropriate size and then has its storage freed. This will be
-                resized in place and only materialized (via all-gather) as needed.
-        Another attribute is set by :func:`_register_post_backward_hooks`:
-            ``_shard_bwd_hook``: it holds the parameter's AccumulateGrad object
-                and the registered post hook handle.
-        """
-        assert hasattr(p, "_is_sharded"), "Parameters should have been sharded during construction."
-        # If _local_shard has been set in the first lazy init and
-        # current parameter is pointed to _local_shard, no need to
-        # set the _local_shard again.
-        if hasattr(p, "_local_shard"):
-            # If CPU offloading, p._local_shard should have been placed on CPU
-            # during its first lazy construction.
-            if self.cpu_offload.offload_params:
-                assert p._local_shard.device == torch.device(  # type: ignore[attr-defined]
-                    "cpu"
-                ), (
-                    "Expected p._local_shard to be on CPU, "  # type: ignore[attr-defined]
-                    f"but it's on {p._local_shard.device}"  # type: ignore[attr-defined]
-                )
-            return
-
-        # A single shard of the parameters. Also makes p._local_shard to be on
-        # CPU if we are CPU offloading, since p.data would be on CPU during
-        # init.
-        if self.cpu_offload.offload_params:
-            assert p.device == torch.device("cpu"), (
-                "Expected param to be on CPU when cpu_offloading is enabled. "
-                "If CPU offloading is enabled correctly, you may be "
-                "accidentally moving the model to CUDA after FSDP initialization."
-            )
-        p._local_shard = p.data  # type: ignore[attr-defined]
-        # If CPU offloading, pin the memory to enable faster CPU -> GPU device
-        # transfer.
-        if self.cpu_offload.offload_params:
-            assert p._local_shard.device == torch.device("cpu")  # type: ignore[attr-defined]
-            p._local_shard.pin_memory()  # type: ignore[attr-defined]
-            # When offloading parameters, also move the grad shard to CPU during
-            # backward pass. In this case, it's important to pre-allocate the
-            # CPU grad shard in pinned memory so that we can do a non-blocking
-            # transfer.
-            p._cpu_grad = torch.zeros_like(  # type: ignore[attr-defined]
-                p, device=torch.device("cpu")
-            ).pin_memory()
-
-        # If mixed_precision, maintain reduced precision param shard on
-        # compute_device for computation in fwd/bwd. We resize storage to 0 here
-        # and rematerialize before building the full param when needed. After
-        # fwd/bwd, it is freed and we only hold on to the full precision shard.
-        # As a result, this reduced precision shard is not allocated if we are
-        # not in the forward/backward pass.
-        if (
-            self._mixed_precision_enabled_for_params()
-        ):
-            p._mp_shard = torch.zeros_like(
-                p._local_shard,
-                device=self.compute_device,
-                dtype=self.mixed_precision.param_dtype
-            )
-            _free_storage(p._mp_shard)
-
-        # We also maintain a full-sized parameter of type self.compute_dtype.
-        # We resize the storage to size 0 at init (here) and only materialize
-        # as needed. The storage may contain padding elements so that it is
-        # evenly divisible by world_size, although these padding elements will
-        # be removed before the relevant computation.
-        if p._is_sharded:  # type: ignore[attr-defined]
-            # We set p._full_param_padded's dtype to the desired parameter dtype
-            # in the case of mixed precision. This is so that when we all_gather
-            # into full_param_padded it can occur without issues and result in
-            # full_param_padded having the expected param_dtype.
-            full_param_dtype = (
-                p.dtype if not self._mixed_precision_enabled_for_params()
-                else self.mixed_precision.param_dtype
-            )
-            p._full_param_padded = torch.zeros(  # type: ignore[attr-defined]
-                p.numel() * self.world_size,
-                device=self.compute_device,
-                dtype=full_param_dtype,
-            )
-            _free_storage(p._full_param_padded)  # type: ignore[attr-defined]
-
-        # Track whether the `FlatParameter`'s post-backward hook has been
-        # called for validation in `_wait_for_post_backward()`
-        p._post_backward_called = False
-
-    def _init_streams(self) -> None:
-        """Initializes CUDA streams for overlapping data transfer and
-        computation. This should only be called on the root FSDP instance."""
-        assert self._is_root
-        if torch.cuda.is_available():
-            # Stream for all-gathering parameters.
-            self._streams["all_gather"] = torch.cuda.Stream()
-            # Stream for overlapping grad reduction with the backward pass.
-            self._streams["post_backward"] = torch.cuda.Stream()
-            # Stream to move main params to self.mixed_precision.param_dtype
-            # for forward pass.
-            if self._mixed_precision_enabled_for_params():
-                self._streams["mixed_precision_params"] = torch.cuda.Stream()
-
-    def _wait_for_previous_optim_step(self) -> None:
-        """
-        The root :class:`FullyShardedDataParallel` instance needs to
-        synchronize with the default stream to ensure that the previous
-        optimizer step is done.
-        """
-        if not torch.cuda.is_available() or not self._is_root:
-            return
-        if self._mixed_precision_enabled_for_params():
-            self._streams["mixed_precision_params"].wait_stream(
-                torch.cuda.current_stream()
-            )
-        self._streams["all_gather"].wait_stream(torch.cuda.current_stream())
-
-    def _need_prefetch_full_params(self, state: TrainingState_) -> bool:
-        allowed_states = (
-            TrainingState_.FORWARD, TrainingState_.BACKWARD_PRE, TrainingState_.BACKWARD_POST
-        )
-        assert state in allowed_states, f"state needs to be in the set of {allowed_states}"
-        valid_fsdp_graph_and_index = (
-            self._fsdp_graph_order is not None
-            and self._my_fsdp_idx_in_graph is not None
-        )
-        if state == TrainingState_.FORWARD:
-            return (
-                self.forward_prefetch
-                and valid_fsdp_graph_and_index
-                and self._my_fsdp_idx_in_graph < len(self._fsdp_graph_order) - 1
-                and self._fsdp_graph_order[self._my_fsdp_idx_in_graph + 1].training_state
-                != TrainingState_.FORWARD
-            )
-        elif state == TrainingState_.BACKWARD_PRE:
-            return (
-                self.backward_prefetch == BackwardPrefetch.BACKWARD_PRE
-                and valid_fsdp_graph_and_index
-                and self._my_fsdp_idx_in_graph > 0
-                and self._fsdp_graph_order[self._my_fsdp_idx_in_graph - 1].training_state
-                != TrainingState_.BACKWARD_POST
-            )
-        else:
-            return (
-                self.backward_prefetch == BackwardPrefetch.BACKWARD_POST
-                and valid_fsdp_graph_and_index
-                and self._my_fsdp_idx_in_graph > 0
-                and self._fsdp_graph_order[self._my_fsdp_idx_in_graph - 1].training_state
-                != TrainingState_.BACKWARD_POST
-                and self._fsdp_graph_order[
-                    self._my_fsdp_idx_in_graph - 1
-                ]._need_rebuild_full_params
-            )
 
     @staticmethod
-    @contextlib.contextmanager
-    def state_dict_type(
+    def set_state_dict_type(
         module: nn.Module,
         state_dict_type: StateDictType,
         state_dict_config: Optional[StateDictConfig] = None,
-    ) -> Generator:
+    ) -> Tuple[StateDictType, StateDictConfig]:
         """
-        A context manager to set the ``state_dict_type`` of all the descendant
-        FSDP modules of the target module. The target module does not have to
-        be a FSDP module. If the target module is a FSDP module, its
-        ``state_dict_type`` will also be changed.
+        Set the ``state_dict_type`` and the corresponding (optional)
+        configurations of all the descendant FSDP modules of the target module.
+        The target module does not have to be a FSDP module. If the target
+        module is a FSDP module, its ``state_dict_type`` will also be changed.
 
         .. note:: This API should be called for only the top-level (root)
             module.
@@ -1805,24 +561,36 @@ def state_dict_type(
         .. note:: This API enables users to transparently use the conventional
             ``state_dict`` API to take model checkpoints in cases where the
             root FSDP module is wrapped by another ``nn.Module``. For example,
-            the following will ensure ``state_dict``  is called on all non-FSDP
-            instances, while dispatching into `local_state_dict` implementation
+            the following will ensure ``state_dict`` is called on all non-FSDP
+            instances, while dispatching into `sharded_state_dict` implementation
             for FSDP:
 
         Example::
 
             >>> # xdoctest: +SKIP("undefined variables")
             >>> model = DDP(FSDP(...))
-            >>> with FSDP.state_dict_type(model, StateDictType.LOCAL_STATE_DICT):
-            >>>     checkpoint = model.state_dict()
+            >>> FSDP.set_state_dict_type(
+            >>>     model,
+            >>>     StateDictType.SHARDED_STATE_DICT,
+            >>>     ShardedStateDictConfig(offload_to_cpu=True),
+            >>> )
+            >>> checkpoint = model.state_dict()
 
         Args:
             module (torch.nn.Module): Root module.
             state_dict_type (StateDictType): the desired ``state_dict_type`` to set.
+            state_dict_config (Optional[StateDictConfig]): the configuration for the
+                target ``state_dict_type``.
         """
+        _state_dict_type_to_config = {
+            StateDictType.FULL_STATE_DICT: FullStateDictConfig,
+            StateDictType.LOCAL_STATE_DICT: LocalStateDictConfig,
+            StateDictType.SHARDED_STATE_DICT: ShardedStateDictConfig,
+        }
+
         prev_state_dict_type = None
         prev_state_dict_config = None
-        # Use default config a state_dict config is not set.
+        # Use the default config if a state_dict config is not set.
         if state_dict_config is None:
             state_dict_config = _state_dict_type_to_config[state_dict_type]()
         for submodule in FullyShardedDataParallel.fsdp_modules(module):
@@ -1832,783 +600,105 @@ def state_dict_type(
                 prev_state_dict_config = submodule._state_dict_config
             if prev_state_dict_type != submodule._state_dict_type:
                 raise RuntimeError("All FSDP module should the same state_dict_type.")
-            if type(prev_state_dict_config) != type(submodule._state_dict_config):
+            if not isinstance(
+                submodule._state_dict_config, type(prev_state_dict_config)
+            ):
                 raise RuntimeError(
                     "All FSDP modules should have the same type of state_dict_config."
                 )
 
-            expected_state_dict_config_type = _state_dict_type_to_config[state_dict_type]
+            expected_state_dict_config_type = _state_dict_type_to_config[
+                state_dict_type
+            ]
             if expected_state_dict_config_type != type(state_dict_config):
                 raise RuntimeError(
-                    f"Expected state_dict_config of type {expected_state_dict_config_type} but got {type(state_dict_config)}"
+                    f"Expected state_dict_config of type {expected_state_dict_config_type} "
+                    f"but got {type(state_dict_config)}"
                 )
             submodule._state_dict_type = state_dict_type
             submodule._state_dict_config = state_dict_config
-        try:
-            yield
-        finally:
-            assert prev_state_dict_type is not None  # Avoid mypy warning
-            assert prev_state_dict_config is not None  # Avoid mypy warning
-            for submodule in FullyShardedDataParallel.fsdp_modules(module):
-                submodule._state_dict_type = prev_state_dict_type
-                submodule._state_dict_config = prev_state_dict_config
-
-    @property
-    def _param_fqns(self) -> Iterator[Tuple[str, str, str]]:
-        for param_name, module_name in (
-            self._fsdp_wrapped_module.handle.parameter_module_names()
-        ):
-            module_name = module_name.replace(f"{FPW_MODULE}.", "")
-            module_name = module_name.replace(f"{FPW_MODULE}", "")
-            if module_name:
-                module_name = f"{module_name}."
-            # Activation checkpoint adds a prefix that has to be
-            # removed as well.
-            module_name = module_name.replace(
-                f"{checkpoint_wrapper._CHECKPOINT_PREFIX}.", ""
-            )
-            fqn = f"{module_name}{param_name}"
-            yield fqn, param_name, module_name
-
-    def _full_post_state_dict_hook(
-        self,
-        state_dict: Dict[str, Any],
-        prefix: str,
-    ) -> Dict[str, Any]:
-        """
-        Hook that runs after model.state_dict() is called before returning result to
-        user. For FSDP, we may have to clone the tensors in state_dict as params go
-        back to sharded version after _summon_full_params ends, and also remove
-        "_fsdp_wrapped_module" prefix.
-        """
-        _replace_by_prefix(state_dict, prefix + f"{FSDP_WRAPPED_MODULE}.", prefix)
-        self._assert_state([TrainingState_.SUMMON_FULL_PARAMS])
-        # Return early for trivial cases
-        if not state_dict or not self._fsdp_wrapped_module.has_params:
-            return state_dict
-
-        # If the `FlatParameter` is registered, then this rank only needed to
-        # participate in the all-gather but does not actually save the state
-        # dict (e.g. when `rank0_only=True` and `self.rank != 0`)
-        if hasattr(self._fsdp_wrapped_module, "flat_param"):
-            return state_dict
-
-        offload_to_cpu = self._state_dict_config.offload_to_cpu
-        cpu_device = torch.device("cpu")
-
-        # Loop only the parameters saved in self._fsdp_wrapped_module to avoid
-        # processing buffers.
-        for fqn, param_name, module_name in self._param_fqns:
-            fqn = f"{prefix}{fqn}"
-            clean_key = fqn
-            clean_prefix = clean_tensor_name(prefix)
-            # Strip prefix out of key if needed as buffer names and param names
-            # do not have prefix considered as they are not computed in `state_dict`
-            # call.
-            if clean_key.startswith(clean_prefix):
-                clean_key = clean_key[len(clean_prefix):]
-
-            # Clone non-ignored parameters before exiting the
-            # `_summon_full_params()` context
-            assert fqn in state_dict, (
-                f"FSDP assumes {fqn} is in the state_dict but the state_dict "
-                f"only has {state_dict.keys()}. prefix={prefix}, "
-                f"module_name={module_name} param_name={param_name} rank={self.rank}."
-            )
-            if clean_key not in self._ignored_param_names and \
-                    not getattr(state_dict[fqn], "_has_been_cloned", False):
-                try:
-                    state_dict[fqn] = state_dict[fqn].clone().detach()
-                    state_dict[fqn]._has_been_cloned = True  # type: ignore[attr-defined]
-                except BaseException as e:
-                    warnings.warn(
-                        f"Failed to clone() tensor with name {fqn}. This may mean "
-                        "that this state_dict entry could point to invalid memory "
-                        "regions after returning from state_dict() call if this "
-                        "parameter is managed by FSDP. Please check clone "
-                        f"implementation of {fqn}. Error: {str(e)}"
-                    )
-
-        # Offload the buffer to CPU if needed -- we do not do this in
-        # `_summon_full_params()` since without care, that would free
-        # the original buffer's GPU memory and require reallocating
-        # that memory later; this only affects the state dict's buffer
-        # variable and leaves the original buffer's GPU memory intact
-        if offload_to_cpu:
-            for clean_key in self._buffer_names:
-                # This is a hack to support activation checkpoint.
-                clean_key = clean_key.replace(
-                    f"{checkpoint_wrapper._CHECKPOINT_PREFIX}.", ""
-                )
-                fqn = f"{prefix}{clean_key}"
-                if state_dict[fqn].device != cpu_device:
-                    state_dict[fqn] = state_dict[fqn].to(cpu_device)
-        return state_dict
-
-    def _local_post_state_dict_hook(
-        self,
-        state_dict: Dict[str, Any],
-        prefix: str,
-    ) -> Dict[str, Any]:
-        """
-        This hook create a ShardedTensor from the local flat_param and replace
-        the state_dict[f"{prefix}{FLAT_PARAM}] with the ShardedTensor. No copy
-        will happen. The underlying storage is the same.
-        """
-        _replace_by_prefix(state_dict, f"{prefix}{FSDP_WRAPPED_MODULE}.", prefix)
-        if not self._fsdp_wrapped_module.has_params:
-            return state_dict
-
-        # state_dict[f"{prefix}{FLAT_PARAM}"] exists and has the same tensor
-        # value as the flat_param but it is a pure Tensor because
-        # nn.Module.state_dict() will detach the parameter. Therefore, we need
-        # to get flat_param from the FlattenParamsWrapper to get the metadata.
-        flat_param = getattr(self._fsdp_wrapped_module, FLAT_PARAM, None)
-        # Construct a ShardedTensor from the flat_param.
-        full_numel = flat_param._unsharded_size.numel()
-        shard_offset = flat_param.numel() * self.rank
-        valid_data_size = flat_param.numel() - flat_param._shard_numel_padded
-        if valid_data_size > 0 and flat_param._shard_numel_padded > 0:
-            flat_param = flat_param.narrow(0, 0, valid_data_size)
-        local_shards = [
-            Shard.from_tensor_and_offsets(flat_param, [shard_offset], self.rank)
-        ]
-        state_dict[f"{prefix}{FLAT_PARAM}"] = init_from_local_shards(
-            local_shards, full_numel, process_group=self.process_group
-        )  # type: ignore[assignment]
 
-        return state_dict
-
-    @torch.no_grad()
-    def _sharded_post_state_dict_hook(
-        self,
-        state_dict: Dict[str, Any],
-        prefix: str,
-    ) -> Dict[str, Any]:
-        """
-        The hook replaces the unflattened, unsharded parameter in the state_dict
-        with a unflattened, sharded parameter (a ShardedTensor).
-        """
-        _replace_by_prefix(state_dict, f"{prefix}{FSDP_WRAPPED_MODULE}.", prefix)
-        if not self._fsdp_wrapped_module.has_params:
-            return state_dict
-
-        assert self.training_state != TrainingState_.SUMMON_FULL_PARAMS, (
-            "Inside _sharded_post_load_state_dict_hook, the training_state must "
-            "not be SUMMON_FULL_PARAMS."
-        )
-        with self._summon_full_params(recurse=False, writeback=False):
-            for fqn, _, _ in self._param_fqns:
-                # Create a ShardedTensor for the unflattened, non-sharded parameter.
-                param = functools.reduce(getattr, fqn.split("."), self.module)
-                local_shard = param.chunk(self.world_size)[self.rank].clone()
-                offsets = [0 for _ in param.size()]
-                offsets[0] = math.ceil(param.size()[0] / self.world_size) * self.rank
-                local_shards = [
-                    Shard.from_tensor_and_offsets(local_shard, offsets, self.rank)
-                ]
-                fqn = f"{prefix}{fqn}"
-                state_dict[fqn] = init_from_local_shards(
-                    local_shards, param.size(), process_group=self.process_group
-                )  # type: ignore[assignment]
-        state_dict.pop(f"{prefix}{FLAT_PARAM}")
-        return state_dict
+        return prev_state_dict_type, prev_state_dict_config
 
     @staticmethod
-    def _post_state_dict_hook(
+    @contextlib.contextmanager
+    def state_dict_type(
         module: nn.Module,
-        state_dict: Dict[str, Any],
-        prefix: str,
-        *args: Any,
-    ) -> Dict[str, Any]:
-        """
-        _post_state_dict_hook() is called after the state_dict() of this
-        FSDP module is executed. ``self._state_dict_type`` is used to decide
-        what postprocessing will be done.
-        """
-        self = cast(FullyShardedDataParallel, module)
-        processed_state_dict = self._post_state_dict_hook_fn[self._state_dict_type](state_dict, prefix)
-        # Restore buffers, which currently are in their full precision type,
-        # back to their mixed precision type. This is because buffers are cast
-        # during lazy_init() and stay at their mixed precision type before/after
-        # forward/backward. As a result state_dict() should maintain this.
-        if (
-            self._is_root
-            and self._mixed_precision_enabled_for_buffers()
-        ):
-            self._cast_buffers(recurse=True)
-        return processed_state_dict
-
-    def state_dict(self, *args, **kwargs):
+        state_dict_type: StateDictType,
+        state_dict_config: Optional[StateDictConfig] = None,
+    ) -> Generator:
         """
-        This is the entry point of all three FSDP ``state_dict`` APIs: full,
-        local, and sharded. For the full state dict
-        (``StateDictType.FULL_STATE_DICT``), FSDP attempts to unshard the model
-        on all ranks, which may result in an OOM error if the full model cannot
-        fit on a single GPU. In that case, users may pass in a
-        :class:`FullStateDictConfig` to only save the checkpoint on rank 0 and/
-        or to offload it to CPU memory layer by layer, enabling much larger
-        checkpoints. If the full model cannot fit in CPU memory, then users may
-        instead take a local state dict (``StateDictType.LOCAL_STATE_DICT``)
-        that only saves the local shard of the model. The sharded state dict
-        (``StateDictType.SHARDED_STATE_DICT``) saves the model parameters as
-        ``ShardedTensor`` s. The ``state_dict`` type can be configured using
-        the :meth:`state_dict_type` context manager.
+        A context manager to set the ``state_dict_type`` of all the descendant
+        FSDP modules of the target module. This context manager has the same
+        functions as :meth:`set_state_dict_type`. Read the document of
+        :meth:`set_state_dict_type` for the detail.
 
         Example::
 
             >>> # xdoctest: +SKIP("undefined variables")
-            >>> import torch
-            >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-            >>> from torch.distributed.fsdp import StateDictType
-            >>> torch.cuda.set_device(device_id)
-            >>> my_module = nn.Linear(...)
-            >>> sharded_module = FSDP(my_module)
-            >>> full_state_dict_config = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
-            >>> with FSDP.state_dict_type(sharded_module, StateDictType.FULL_STATE_DICT, full_state_dict_config):
-            >>>     full_dict = sharded_module.state_dict()
-            >>> full_dict.keys()
-            >>> odict_keys(['weight', 'bias'])
-            >>> # using local state dict
-            >>> with FSDP.state_dict_type(sharded_module, StateDictType.LOCAL_STATE_DICT):
-            >>>     local_dict = sharded_module.state_dict()
-            >>> local_dict.keys()
-            >>> odict_keys(['flat_param', 'inner.flat_param'])
-
-        .. warning:: This needs to be called on all ranks, since synchronization
-            primitives may be used.
-        """
-        # TODO (rohan-varma): separate these out once a state_dict pre-hook
-        # is available.
-        if torch.cuda.is_available():
-            torch.cuda.synchronize()
-        self._lazy_init()
-        if self._state_dict_type == StateDictType.FULL_STATE_DICT:
-            # Get config args
-            full_state_dict_config = (
-                self._state_dict_config if self._state_dict_config is not None
-                else FullStateDictConfig()
-            )
-            rank0_only = full_state_dict_config.rank0_only
-            offload_to_cpu = full_state_dict_config.offload_to_cpu
-            summon_ctx = (
-                self._summon_full_params(
-                    recurse=False, writeback=False, offload_to_cpu=offload_to_cpu, rank0_only=rank0_only
-                )
-                if self.training_state != TrainingState_.SUMMON_FULL_PARAMS else
-                contextlib.suppress()
-            )
-            with summon_ctx:
-                # Since buffers are not sharded and stay casted, restore them to their
-                # original user module specified types for checkpoint. We take care to
-                # recast in post_state_dict_hook for consistency with the fact that
-                # buffers stay casted after forward/backward. We must have the
-                # call here instead of above because _summon_full_params itself
-                # calls _lazy_init() which would cast the buffers.
-                if (
-                    self._is_root
-                    and self._mixed_precision_enabled_for_buffers()
-                ):
-                    self._cast_buffers(
-                        dtype=self._orig_buffer_dtypes, recurse=False
-                    )
-                state_dict = super().state_dict(*args, **kwargs)
-
-            # TODO: support offload to CPU in post state dict hook.
-            if not rank0_only or self.rank == 0:
-                return state_dict
-            else:
-                return {}
-
-        elif (
-            self._state_dict_type == StateDictType.LOCAL_STATE_DICT or
-            self._state_dict_type == StateDictType.SHARDED_STATE_DICT
-        ):
-            if (
-                self._fsdp_wrapped_module.flat_param is not None and
-                not self._fsdp_wrapped_module.flat_param._is_sharded
-            ):
-                raise RuntimeError(
-                    "sharded_state_dict/local_state_dict can only be called "
-                    "when parameters are flatten and sharded."
-                )
-            return super().state_dict(*args, **kwargs)
-        else:
-            raise ValueError(f"Unknown StateDictType {self._state_dict_type}.")
-
-    def _local_state_dict(self, *args: Any, **kwargs: Any) -> Any:
-        """
-        Returns the local state of the module. Parameters are flattened and
-        sharded, so the resulting state_dict can only be loaded after the module
-        has been wrapped with FSDP.
-        """
-        with self.state_dict_type(self, StateDictType.LOCAL_STATE_DICT):
-            return self.state_dict(*args, **kwargs)
-
-    def _full_post_load_state_dict_hook(self, *args, **kwargs) -> None:
-        # We should exit summon_full_params context.
-        self._assert_state([TrainingState_.SUMMON_FULL_PARAMS])
-        assert getattr(self, '_full_param_ctx', None) is not None
-        self._full_param_ctx.__exit__(None, None, None)
-        self._full_param_ctx = None
-
-    def _sharded_state_dict(self, *args: Any, **kwargs: Any) -> Any:
-        """
-        Returns the sharded states of the module. Parameters are unflattened and
-        sharded, so the resulting state_dict can be used with any parallelism
-        (e.g., DPP, model parallelism, and single trainer) after a valid
-        resharding.
-        """
-        with self.set_state_dict_type(StateDictType.SHARDED_STATE_DICT):
-            return self.state_dict(self, *args, **kwargs)
-
-    def _full_pre_load_state_dict_hook(
-        self,
-        state_dict: Dict[str, Any],
-        prefix: str,
-    ) -> None:
-        # We do not expect to be calling pre-hooks twice without post-hook
-        # call in between.
-        assert getattr(self, '_full_param_ctx', None) is None
-        # Note that it needs writeback=True to persist.
-        self._full_param_ctx = self._summon_full_params(
-            recurse=False, writeback=True
-        )
-        self._full_param_ctx.__enter__()
-        _replace_by_prefix(state_dict, prefix, prefix + f"{FSDP_WRAPPED_MODULE}.")
-
-    def _local_post_load_state_dict_hook(self, *args, **kwargs) -> None:
-        pass
-
-    def _local_pre_load_state_dict_hook(
-        self,
-        state_dict: Dict[str, Any],
-        prefix: str,
-    ) -> None:
-        """
-        This hook finds the local flat_param for this FSDP module from the
-        state_dict. The flat_param should be a ShardedTensor. This hook converts
-        the ShardedTensor to a tensor. No copy happen unless padding is required.
-        """
-        _replace_by_prefix(state_dict, prefix, f"{prefix}{FSDP_WRAPPED_MODULE}.")
-        fqn = f"{prefix}{FSDP_WRAPPED_MODULE}.{FLAT_PARAM}"
-        if fqn not in state_dict:
-            assert getattr(self._fsdp_wrapped_module, FLAT_PARAM, None) is None, (
-                "No flat parameter in state_dict but self._fsdp_wrapped_module.flat_param is not None"
-            )
-            return
-        load_tensor = state_dict[fqn]
-        assert isinstance(
-            load_tensor, ShardedTensor
-        ), "Tensors in local_state_dict should be ShardedTensor."
-
-        # Convert the ShardedTensor to a Tensor.
-        shards = load_tensor.local_shards()
-        assert len(shards), "load_local_state_dict assume one shard per ShardedTensor."
-        load_tensor = cast(torch.Tensor, shards[0].tensor)
-
-        # Get the metada of the flat_param to decide whether to pad the loaded
-        # tensor.
-        flat_param = self._fsdp_wrapped_module.flat_param
-        assert flat_param is not None
-        if flat_param._shard_numel_padded not in (0, flat_param.numel()):
-            assert load_tensor.numel() < flat_param.numel(), (
-                f"Local shard size = {flat_param.numel()} and the tensor in "
-                f"the state_dict is {load_tensor.numel()}."
-            )
-            load_tensor = F.pad(load_tensor, [0, flat_param._shard_numel_padded])
-        state_dict[fqn] = load_tensor
-
-    def _sharded_post_load_state_dict_hook(self, *args, **kwargs) -> None:
-        pass
+            >>> model = DDP(FSDP(...))
+            >>> with FSDP.state_dict_type(
+            >>>     model,
+            >>>     StateDictType.SHARDED_STATE_DICT,
+            >>> ):
+            >>>     checkpoint = model.state_dict()
 
-    def _sharded_pre_load_state_dict_hook(
-        self,
-        state_dict: Dict[str, Any],
-        prefix: str,
-    ) -> None:
-        """
-        The hook combines the unflattened, sharded parameters (ShardedTensor) to
-        a new FlatParameter and shards the new FlatParameter to the local chunk.
+        Args:
+            module (torch.nn.Module): Root module.
+            state_dict_type (StateDictType): the desired ``state_dict_type`` to set.
+            state_dict_config (Optional[StateDictConfig]): the configuration for the
+                target ``state_dict_type``.
         """
-        _replace_by_prefix(state_dict, prefix, prefix + f"{FSDP_WRAPPED_MODULE}.")
-        if not self._fsdp_wrapped_module.has_params:
-            return
-
-        if not self._fsdp_wrapped_module.flat_param._is_sharded:
-            raise RuntimeError(
-                "load_sharded_state_dict can only be called when parameters "
-                "are flatten and sharded."
-            )
-
-        nonsharded_tensors = []
-        # TODO: Reduce the communication by using only one _all_gather_base to
-        # gather all the parameters in this layer. This can be achieved by
-        # concatenated all the local shards and then append the padding.
-        # https://github.com/pytorch/pytorch/issues/77461
-        for (param_name, _, module_name) in self._fsdp_wrapped_module.handle.flat_param._param_infos:
-            module_name = module_name.replace(f"{FPW_MODULE}.", "")
-            module_name = module_name.replace(f"{FPW_MODULE}", "")
-            if module_name:
-                module_name = f"{module_name}."
-            fqn = f"{prefix}{FSDP_WRAPPED_MODULE}.{module_name}{param_name}"
-            param = state_dict.pop(fqn)
-
-            # All-gather the param (ShardedTensor)
-            shards = param.local_shards()
-            local_tensor = cast(torch.Tensor, shards[0].tensor).flatten()
-            dim_0_size = param.size()[0]
-            param_numel = param.size().numel()
-            chunk_size = (
-                math.ceil(dim_0_size / self.world_size) * param_numel // dim_0_size
+        prev_state_dict_type = None
+        prev_state_dict_config = None
+        try:
+            (
+                prev_state_dict_type,
+                prev_state_dict_config,
+            ) = FullyShardedDataParallel.set_state_dict_type(
+                module, state_dict_type, state_dict_config
             )
-            num_padding = chunk_size - local_tensor.numel()
-            if num_padding > 0:
-                local_tensor = F.pad(local_tensor, [0, num_padding])
-            tensor = torch.empty(
-                chunk_size * self.world_size, dtype=local_tensor.dtype
-            ).cuda()
-            dist._all_gather_base(tensor, local_tensor, group=self.process_group)
-            tensor = tensor.narrow(0, 0, param_numel).reshape(param.size())
-            nonsharded_tensors.append(tensor)
-
-        # Create a new flat_param from the loaded, non-sharded tensors.
-        flat_param = self._fsdp_wrapped_module.flat_param
-        loaded_flat_param = FlatParamHandle.flatten_params(nonsharded_tensors, requires_grad=False)
-
-        # Get the chunk from the loaded flat_param for the local rank.
-        loaded_flat_param, num_to_pad = FlatParamHandle._get_shard(
-            loaded_flat_param, self.rank, self.world_size,
-        )
-        assert flat_param.numel() == loaded_flat_param.numel(), (
-            f"The loaded local chunk has different numel({flat_param.numel()}) "
-            f"from the local chunk {flat_param.numel()}."
-        )
-        assert flat_param._shard_numel_padded == num_to_pad, (
-            f"The loaded local chunk has different padding({num_to_pad}) "
-            f"from the local chunk {flat_param._shard_numel_padded}."
-        )
-        state_dict[f"{prefix}_fsdp_wrapped_module.flat_param"] = loaded_flat_param
-
-    @staticmethod
-    def _pre_load_state_dict_hook(
-        module: nn.Module,
-        state_dict: Dict[str, Any],
-        prefix: str,
-        *args: Any,
-    ) -> None:
-        """
-        ``_pre_state_dict_hook` is called before ``self._load_from_state_dict()``
-        is called. ``self._state_dict_type`` is used to decide what preprocessing
-        will be done.
-        """
-        # Code that is common for all state_dict impls
-        self = cast(FullyShardedDataParallel, module)
-        if torch.cuda.is_available():
-            torch.cuda.synchronize()
-        # Dispatch into state_dict specific implementation of pre-hook.
-        self._pre_load_state_dict_hook_fn[self._state_dict_type](state_dict, prefix)
-
-    @staticmethod
-    def _post_load_state_dict_hook(module: nn.Module, *args: Any) -> None:
-        # Code that is common for all state_dict impls
-        self = cast(FullyShardedDataParallel, module)
-        # Dispatch into state_dict type specific implementation of post-hook for
-        # loading state_dict.
-        self._post_load_state_dict_hook_fn[self._state_dict_type]()
-
-    def load_state_dict(
-        self,
-        state_dict: Mapping[str, Any],
-        *args,
-    ) -> NamedTuple:
-        """
-        The entry point of all three FSDP ``load_state_dict`` APIs. By default,
-        calling ``load_state_dict`` on an FSDP module will result in FSDP
-        attempting to load a "full" state_dict, i.e. a state_dict consisting of
-        full, unsharded, unflattened original module parameters. This requires
-        FSDP to load the full parameter context on each rank which could result
-        in GPU OOM. As a result, :func:`state_dict_type` API is available to
-        configure between ``load_state_dict`` implementations. User can thus use
-        ``with self.state_dict_type(self, StateDictType.LOCAL_STATE_DICT)`` context
-        manager to load a local state dict checkpoint that will restore only
-        local shards of the module. Currently, the only supported
-        implementations are ``StateDictType.LOCAL_STATE_DICT`` and
-        ``StateDictType.FULL_STATE_DICT`` (default). Please see :func:`state_dict`
-        for documentation around creating an FSDP checkpoint.
-
-        Example::
-
-            >>> # xdoctest: +SKIP("undefined variables")
-            >>> import torch
-            >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-            >>> from torch.distributed.fsdp import StateDictType
-            >>> torch.cuda.set_device(device_id)
-            >>> my_module = nn.Linear(...)
-            >>> sharded_module = FSDP(my_module)
-            >>> checkpoint = torch.load(PATH)
-            >>> full_state_dict = checkpoint['full_state_dict']
-            >>> with FSDP.state_dict_type(sharded_module, StateDictType.FULL_STATE_DICT):
-            >>>     sharded_module.load_state_dict(full_state_dict)
-            >>> full_dict.keys()
-            >>> odict_keys(['weight', 'bias'])
-            >>> # using local state dict
-            >>> local_state_dict = checkpoint['local_state_dict']
-            >>> with FSDP.state_dict_type(sharded_module, StateDictType.LOCAL_STATE_DICT):
-            >>>     sharded_module.load_state_dict(local_state_dict)
-            >>> local_dict.keys()
-            >>> odict_keys(['flat_param', 'inner.flat_param'])
-
-        .. warning:: This needs to be called on all ranks, since synchronization
-            primitives may be used.
-        """
-        return super().load_state_dict(state_dict, *args)
-
-    def _load_local_state_dict(
-        self,
-        state_dict: Mapping[str, Any],
-        *args,
-    ) -> NamedTuple:
-        """
-        Load states from a flattened, sharded state dictionary.
-        """
-        with self.state_dict_type(self, StateDictType.LOCAL_STATE_DICT):
-            return self.load_state_dict(state_dict, *args)
-
-    def _load_sharded_state_dict(
-        self,
-        state_dict: Union[Dict[str, torch.Tensor], "OrderedDict[str, torch.Tensor]"],
-        strict: bool = True,
-    ) -> NamedTuple:
-        """
-        Load states from a unflattened, sharded state dictionary.
-        """
-        with self.set_state_dict_type(StateDictType.SHARDED_STATE_DICT):
-            return self.load_state_dict(state_dict, strict)
-
-    def forward(self, *args: Any, **kwargs: Any) -> Any:
-        with torch.autograd.profiler.record_function("FullyShardedDataParallel.forward"):
-            self._lazy_init()
-            self._wait_for_previous_optim_step()
-
-            # Start of a forward pass.
-            self.training_state = TrainingState_.FORWARD
-            if self._is_root:
-                # TODO: disabling side stream for tensor copies for now, investigate
-                # perf with it on / off.
-                # Place inputs on compute_device. This is a noop if inputs are already
-                # on compute_device. Note that when device_id is specified,
-                # device_id == self.compute_device is guaranteed.
-                # TODO: for mixed precision, move inputs to right device + cast might
-                # be done in one go for performance.
-                args, kwargs = _to_kwargs(args, kwargs, self.compute_device.index, False)
-                args = args[0]
-                kwargs = kwargs[0]
-
-            # Cast inputs to their mixed precision type.
-            if (
-                self._is_root
-                and self._mixed_precision_enabled_for_params()
-            ):
-                input_dtype = self.mixed_precision.param_dtype
-                args, kwargs = self._cast_fp_inputs_to_precision(
-                    input_dtype, *args, **kwargs
+            yield
+        except Exception as e:
+            raise e
+        else:
+            assert prev_state_dict_type is not None
+            assert prev_state_dict_config is not None
+        finally:
+            if prev_state_dict_type is not None and prev_state_dict_config is not None:
+                FullyShardedDataParallel.set_state_dict_type(
+                    module, prev_state_dict_type, prev_state_dict_config
                 )
 
-            # Only rebuilding full params when the params are not prefetched in previous layers
-            if not self._forward_full_params_prefetched:
-                self._rebuild_full_params()
-            self._forward_full_params_prefetched = False
-            # Wait for all_gather full parameters to finish before computation
-            torch.cuda.current_stream().wait_stream(self._streams["all_gather"])
-
-            # Prefetch next layer's full params in forward pass
-            if self._need_prefetch_full_params(self.training_state):
-                # This guarantees that pre-fetching is initialized only after all
-                # previous computations are finished. Therefore, all gather next layer's
-                # parameters will only overlap with this layer's computation. This
-                # prevents over-prefetching, where multiple layer's parameters are prefetched
-                # before the computation.
-                self._streams["all_gather"].wait_stream(torch.cuda.current_stream())
-                self._fsdp_graph_order[self._my_fsdp_idx_in_graph + 1]._rebuild_full_params()
-                self._fsdp_graph_order[self._my_fsdp_idx_in_graph + 1]._forward_full_params_prefetched = True
-
-            # Register backward hooks to reshard params and reduce-scatter grads.
-            # These need to be re-registered every forward pass in some cases where grad_fn
-            # is mutated.
-            self._register_post_backward_hooks()
-            outputs = self._fsdp_wrapped_module(*args, **kwargs)
-
-            if self not in self._fsdp_graph_order:
-                self._my_fsdp_idx_in_graph = len(self._fsdp_graph_order)
-                self._fsdp_graph_order.append(self)
-
-            if self.reshard_after_forward:
-                self._free_full_params()
-                if (
-                    self._mixed_precision_enabled_for_params()
-                ):
-                    self._free_mp_shard(self.params)
-            # Switch to original local shards of params. We maintain this invariant throughout
-            # the code, i.e., ``p.data == p._local_shard`` after each function. This
-            # also ensures that after the first forward, the optimizer state will be
-            # initialized with the correct dtype and (sharded) size, since optimizer
-            # state is typically initialized lazily in ``optim.step()``. Note that
-            # when CPU offload is enabled, _use_param_local_shard implicitly
-            # offloads the local shard to CPU by making p.data point to
-            # p._local_shard, which would reside on CPU.
-            self._use_param_local_shard()
-
-            # Register pre-backward hooks to all-gather the params for the backward
-            # pass (if output's grad was needed). This won't register anything if
-            # we are in eval mode.
-            outputs = self._register_pre_backward_hooks(outputs)
-
-            # Done with a forward pass.
-            self.training_state = TrainingState_.IDLE
-
-        return outputs
-
-    @torch.no_grad()
-    def _write_back_current_shard(self, full_params):
-        """
-        Writes back full_params into self.params.
-        """
-        for p, (full_param, _) in zip(self.params, full_params):
-            if not p._is_sharded:  # type: ignore[attr-defined]
-                continue  # Already copied because no sharding.
-
-            # TODO: Might be able to refactor to use _get_shard.
-            chunks = full_param.chunk(self.world_size)  # type: ignore[attr-defined]
-            assert len(chunks) > self.rank
-            chunk = chunks[self.rank]
-            p._local_shard.copy_(chunk)  # type: ignore[attr-defined]
-
-    @contextlib.contextmanager
-    def _summon_full_params(
-        self,
-        recurse: bool = True,
-        writeback: bool = True,
-        rank0_only: bool = False,
-        offload_to_cpu: bool = False,
-    ):
-        if writeback and rank0_only:
-            raise ValueError(
-                "writeback=True and rank0_only=True is not supported, as model "
-                "parameter shapes will be different across ranks, and writing "
-                "to them can lead to inconsistencies across ranks when the "
-                "context is exited."
-            )
-
-        if offload_to_cpu and not rank0_only:
-            warnings.warn(
-                "offload_to_cpu and rank0_only=False will result in "
-                "full parameters being redundantly copied to CPU memory for "
-                "GPUs that reside on the same machine, which may incur the risk of "
-                "CPU OOM. It is recommended to use ``offload_to_cpu`` with "
-                "rank0_only=True."
-            )
-
-        def _free_full_params_and_use_local_shard(params_to_free):
-            # We may not always be able to free the full param, for example in
-            # the case where world_size == 1 and the shard actually points to
-            # the full parameter.
-            for (param, can_free) in params_to_free:
-                if can_free:
-                    current_stream = torch.cuda.current_stream()
-                    # Don't let PyTorch reuse this memory until all work in the
-                    # current stream is complete
-                    param.record_stream(current_stream)
-                    _free_storage(param)
-
-            # when CPU offload is enabled, _use_param_local_shard implicitly
-            # offloads the local shard to CPU by making p.data point to
-            # p._local_shard, which would reside on CPU.
-            self._use_param_local_shard()
+    def state_dict(self, *args, **kwargs):
+        _lazy_init(self, self)
+        return super().state_dict(*args, **kwargs)
 
-        if recurse:
-            with contextlib.ExitStack() as stack:
-                # Summon all params for any nested FSDP instances.
-                for module in self.fsdp_modules(self):
-                    stack.enter_context(
-                        module._summon_full_params(
-                            recurse=False,
-                            writeback=writeback,
-                            rank0_only=rank0_only,
-                            offload_to_cpu=offload_to_cpu,
-                        )
-                    )
-                # Yield to the caller, with full params in all nested instances.
-                yield
-            # Exiting from the ExitStack will re-shard params.
-            return
-        else:
-            torch.cuda.synchronize()
-            self._lazy_init()
-            self._assert_state([TrainingState_.IDLE])
-            # Set the state so that we assert when trying to go into
-            # forward/backward.
-            self.training_state = TrainingState_.SUMMON_FULL_PARAMS
-
-            # Even if rank0_only = True, we need to materialize all params here
-            # and free them right after as full param materialization requires
-            # collective comm.
-            currently_local_params = self._rebuild_full_params()
-            # Wait for all_gather to finish before computation
-            torch.cuda.current_stream().wait_stream(self._streams["all_gather"])
-            my_rank = dist.get_rank(self.process_group)
-            if offload_to_cpu and (not rank0_only or my_rank == 0):
-                for p in self.params:
-                    if p._is_sharded:
-                        with torch.no_grad():
-                            # Note that we avoid using p._full_param_padded
-                            # directly here as we may not be using that param
-                            # as the full_param from _rebuild_full_params (i.e.)
-                            # in mixed precision.
-                            for p, (full_param, _) in zip(
-                                self.params, currently_local_params
-                            ):
-                                full_param = full_param.to(torch.device("cpu"))
-                                self._update_p_data(p, output_tensor=full_param)
-
-            if rank0_only and my_rank != 0:
-                _free_full_params_and_use_local_shard(currently_local_params)
-                try:
-                    yield
-                finally:
-                    self.training_state = TrainingState_.IDLE
-            else:
-                # FSDP now has the full flattened parameter. Unflatten it to get the
-                # full parameters.
-                with contextlib.ExitStack() as stack:
-                    # Invariant: rank == 0 or !rank0_only
-                    stack.enter_context(self._fsdp_wrapped_module.unflatten_as_params())
-                    try:
-                        yield
-                    finally:
-                        if offload_to_cpu and (not rank0_only or my_rank == 0):
-                            for p in self.params:
-                                if p._is_sharded:
-                                    with torch.no_grad():
-                                        # Note that we avoid using
-                                        # p._full_param_padded directly here as
-                                        # we may not be using that param
-                                        # as the full_param from
-                                        # _rebuild_full_params (i.e. in mixed
-                                        # precision.
-                                        for p, (full_param, _) in zip(
-                                            self.params, currently_local_params
-                                        ):
-                                            full_param = full_param.to(self.compute_device)
-                                            self._update_p_data(
-                                                p, output_tensor=full_param,
-                                            )
-
-                        if writeback:
-                            self._write_back_current_shard(currently_local_params)
-                        stack.close()
-                        _free_full_params_and_use_local_shard(currently_local_params)
-                        self.training_state = TrainingState_.IDLE
+    def forward(self, *args: Any, **kwargs: Any) -> Any:
+        """
+        Runs the forward pass for the wrapped module, inserting FSDP-specific
+        pre- and post-forward sharding logic.
+        """
+        with torch.autograd.profiler.record_function(
+            "FullyShardedDataParallel.forward"
+        ):
+            args, kwargs = _root_pre_forward(self, self, *args, **kwargs)
+            unused = None
+            unshard_fn = functools.partial(_pre_forward_unshard, self, self._handles)
+            reshard_fn = functools.partial(_post_forward_reshard, self, self._handles)
+            _pre_forward(
+                self, self._handles, unshard_fn, self._fsdp_wrapped_module, unused
+            )
+            for handle in self._handles:
+                p_assert(
+                    handle.flat_param.device == self.compute_device,
+                    "Expected `FlatParameter` to be on the compute device "
+                    f"{self.compute_device} but got {handle.flat_param.device}",
+                )
+            output = self._fsdp_wrapped_module(*args, **kwargs)
+            return _post_forward(
+                self, self._handles, reshard_fn, unused, unused, output
+            )
 
     @staticmethod
     @contextlib.contextmanager
@@ -2618,8 +708,9 @@ def summon_full_params(
         writeback: bool = True,
         rank0_only: bool = False,
         offload_to_cpu: bool = False,
+        with_grads: bool = False,
     ) -> Generator:
-        r""" A context manager to expose full params for FSDP instances.
+        r"""A context manager to expose full params for FSDP instances.
         Can be useful *after* forward/backward for a model to get
         the params for additional processing or checking. It can take a non-FSDP
         module and will summon full params for all contained FSDP modules as
@@ -2656,7 +747,7 @@ def summon_full_params(
             recurse (bool, Optional): recursively summon all params for nested
                 FSDP instances (default: True).
             writeback (bool, Optional): if ``False``, modifications to params are
-                discarded after the context manager exists;
+                discarded after the context manager exits;
                 disabling this can be slightly more efficient (default: True)
             rank0_only (bool, Optional): if ``True``, full parameters are
                 materialized on only global rank 0. This means that within the
@@ -2672,21 +763,27 @@ def summon_full_params(
                 for world_size = 1 or ``NO_SHARD`` config). It is recommended
                 to use ``offload_to_cpu`` with ``rank0_only=True`` to avoid
                 redundant copies of model parameters being offloaded to the same CPU memory.
+            with_grads (bool, Optional): If ``True``, gradients are also
+                unsharded with the parameters. Currently, this is only
+                supported when passing ``use_orig_params=True`` to the FSDP
+                constructor and ``offload_to_cpu=False`` to this method.
+                (Default: ``False``)
         """
         # Note that we specify root_only as FSDP roots will handle summoning
         # child FSDP instances based on recurse argument.
-        fsdp_modules = FullyShardedDataParallel.fsdp_modules(
+        root_fsdp_modules = FullyShardedDataParallel.fsdp_modules(
             module, root_only=True
         )
         # Summon all params for all FSDP instances
         with contextlib.ExitStack() as stack:
-            for module in fsdp_modules:
+            for module in root_fsdp_modules:
                 stack.enter_context(
                     module._summon_full_params(
                         recurse=recurse,
                         writeback=writeback,
                         rank0_only=rank0_only,
                         offload_to_cpu=offload_to_cpu,
+                        with_grads=with_grads,
                     )
                 )
             # Yield to the caller, with full params in all FSDP instances.
@@ -2694,6 +791,108 @@ def summon_full_params(
         # Exiting from the ExitStack will reshard all params.
         return
 
+    @contextlib.contextmanager
+    def _summon_full_params(
+        self,
+        recurse: bool = True,
+        writeback: bool = True,
+        rank0_only: bool = False,
+        offload_to_cpu: bool = False,
+        with_grads: bool = False,
+    ):
+        if with_grads and (offload_to_cpu or not self._use_orig_params):
+            raise NotImplementedError(
+                f"with_grads={with_grads} "
+                f"use_orig_params={self._use_orig_params} "
+                f"offload_to_cpu={offload_to_cpu} "
+                f"is not supported yet"
+            )
+        if writeback and rank0_only:
+            raise ValueError(
+                "writeback=True and rank0_only=True is not supported, as model "
+                "parameter shapes will be different across ranks, and writing "
+                "to them can lead to inconsistencies across ranks when the "
+                "context is exited."
+            )
+        if offload_to_cpu and not rank0_only:
+            warnings.warn(
+                "offload_to_cpu and rank0_only=False will result in "
+                "full parameters being redundantly copied to CPU memory for "
+                "GPUs that reside on the same machine, which may incur the risk of "
+                "CPU OOM. It is recommended to use ``offload_to_cpu`` with "
+                "rank0_only=True."
+            )
+
+        if recurse:
+            with contextlib.ExitStack() as stack:
+                for module in self.fsdp_modules(self):
+                    stack.enter_context(
+                        module._summon_full_params(
+                            recurse=False,
+                            writeback=writeback,
+                            rank0_only=rank0_only,
+                            offload_to_cpu=offload_to_cpu,
+                            with_grads=with_grads,
+                        )
+                    )
+                yield
+            return
+
+        _lazy_init(self, self)
+        with _unshard_params(
+            module=self,
+            state=self,
+            writeback=writeback,
+            rank0_only=rank0_only,
+            offload_to_cpu=offload_to_cpu,
+            with_grads=with_grads,
+        ):
+            try:
+                self.training_state = TrainingState.SUMMON_FULL_PARAMS
+                yield
+            finally:
+                self.training_state = TrainingState.IDLE
+
+    @contextlib.contextmanager
+    def _deregister_orig_params_ctx(self):
+        """
+        This deregisters the original parameters and exposes the
+        :class:`FlatParameter` s. If a :class:`FlatParameter` is sharded, then
+        this refreshes the sharded views before exiting. This method shouuld
+        only be called when using the original parameters.
+        """
+        p_assert(
+            self._use_orig_params,
+            "`_deregister_orig_params_ctx()` should only be called when "
+            "`_use_orig_params=True`",
+        )
+        for fsdp_module in self.fsdp_modules(self):
+            _deregister_orig_params(fsdp_module, fsdp_module)
+        try:
+            yield
+        finally:
+            for fsdp_module in self.fsdp_modules(self):
+                _register_orig_params(fsdp_module, fsdp_module)
+
+    def _apply(self, *args, **kwargs):
+        """
+        When using the original parameters, this deregisters the original
+        parameters and exposes the :class:`FlatParameter` s before calling
+        ``_apply()``.
+        """
+        # When using the original parameters: Since (1) the `FlatParameter`s
+        # own the storage and (2) `_apply()` is the subroutine underlying the
+        # most common storage-changing ops like `to()` and `cuda()`, we
+        # override `_apply()` to have the storage change directly performed on
+        # the `FlatParameter`s instead of applying to the original parameters
+        # and then writing back to the `FlatParameter`s.
+        with (
+            self._deregister_orig_params_ctx()
+            if self._use_orig_params
+            else contextlib.suppress()
+        ):
+            return super()._apply(*args, **kwargs)
+
     def named_buffers(
         self,
         *args,
@@ -2704,9 +903,12 @@ def named_buffers(
         remove all occurrences of the FSDP-specific flattened buffer prefix
         when inside the :meth:`summon_full_params` context manager.
         """
-        in_summon_full_params = self.training_state == TrainingState_.SUMMON_FULL_PARAMS
+        should_clean_name = (
+            self.training_state == TrainingState.SUMMON_FULL_PARAMS
+            or self._use_orig_params
+        )
         for buffer_name, buffer in super().named_buffers(*args, **kwargs):
-            if in_summon_full_params:
+            if should_clean_name:
                 # Remove any instances of the FSDP-specific prefix; there can
                 # be multiple in the case of nested FSDP modules
                 buffer_name = buffer_name.replace(FSDP_PREFIX, "")
@@ -2722,378 +924,27 @@ def named_parameters(
         remove all occurrences of the FSDP-specific flattened parameter prefix
         when inside the :meth:`summon_full_params` context manager.
         """
-        # Determine which logic to use based on the context at call time
-        in_summon_full_params = self.training_state == TrainingState_.SUMMON_FULL_PARAMS
+        should_clean_name = (
+            self.training_state == TrainingState.SUMMON_FULL_PARAMS
+            or self._use_orig_params
+        )
         for param_name, param in super().named_parameters(*args, **kwargs):
-            if in_summon_full_params:
+            if should_clean_name:
                 # Remove any instances of the FSDP-specific prefix; there can
                 # be multiple in the case of nested FSDP modules
                 param_name = param_name.replace(FSDP_PREFIX, "")
             yield (param_name, param)
 
-    def _register_pre_backward_hooks(self, outputs: Any) -> Any:
-        """Register pre-backward hook to run before the wrapped module's
-        backward. Hooks should be attached to all outputs from the forward.
-        Returns:
-            outputs: new outputs with hooks registered if they requires gradient.
-        """
-        # Reset before each backward pass
-        self._need_rebuild_full_params = False
-
-        if not torch.is_grad_enabled():
-            return outputs  # don't register hooks if grad isn't enabled
-
-        if self._is_root:
-            # This actually means that only root instance has
-            # _post_backward_callback_queued defined. Accidentally accessing this field
-            # will assert on all other instances, giving us a nice bug checker.
-            self._post_backward_callback_queued = False
-
-        # Reset before each backward pass
-        self._pre_backward_hook_has_run = False
-
-        def _pre_backward_hook(*unused: Any) -> None:
-            # Run ``_pre_backward_hook`` only once per backward pass
-            if self._pre_backward_hook_has_run:
-                return
-
-            with torch.autograd.profiler.record_function("FullyShardedDataParallel._pre_backward_hook"):
-                # try to queue final backward callback only once for root, so
-                # that final backward callback is attached to the outer most
-                # backward graph task and called after all the backward
-                # calls are completed.
-                if self._is_root:
-                    self._queue_wait_for_post_backward()
-
-                if self._pre_backward_hook_full_params_prefetched:
-                    # Always wait for all_gather before rebuilding full params, just
-                    # in case full params have already been prefetched in previous layer's
-                    # pre-backward hook.
-                    torch.cuda.current_stream().wait_stream(self._streams["all_gather"])
-
-                # Start of a backward pass for the first time in an backward pass.
-                self._assert_state([TrainingState_.IDLE])
-                self.training_state = TrainingState_.BACKWARD_PRE
-
-                # All-gather full parameters, moving them to compute device if
-                # necessary.
-                self._rebuild_full_params()
-                self._pre_backward_hook_full_params_prefetched = False
-                # Wait for all_gather to finish before computation
-                torch.cuda.current_stream().wait_stream(self._streams["all_gather"])
-
-                # Prefetch next layer's full params in backward pass,
-                # since it is prefetching, no need to wait for all_gather stream.
-                if self._need_prefetch_full_params(self.training_state):
-                    self._fsdp_graph_order[self._my_fsdp_idx_in_graph - 1]._rebuild_full_params()  # type: ignore[operator]
-                    self._fsdp_graph_order[self._my_fsdp_idx_in_graph - 1]._pre_backward_hook_full_params_prefetched = True
-
-                self._pre_backward_hook_has_run = True
-                # Prepare p.grad so that it is in the right shape, device, accumulated values, etc.
-                self._prep_grads_for_backward()
-
-        def _register_hook(t: torch.Tensor) -> torch.Tensor:
-            if t.requires_grad:
-                t.register_hook(_pre_backward_hook)
-                self._need_rebuild_full_params = True
-            return t
-
-        # Attach hooks to Tensor outputs.
-        outputs = _apply_to_tensors(_register_hook, outputs)
-
-        return outputs
-
-    def _register_post_backward_hooks(self) -> None:
-        """
-        Register backward hooks to reshard params and reduce-scatter grads.
-        This is called during forward pass. The goal is to attach a hook
-        on each of the parameter's gradient generating function (``grad_acc``
-        below) so that the hook is called *after* all gradients for that
-        param are computed.
-        Goals:
-        1. We want the hook to fire once and only once *after* all gradients
-        are accumulated for a param.
-        2. If it fires more than once, we end up incorrectly shard the grad
-        multiple times. (could lead to dimension too small)
-        3. If it fires once but too early or doesn't fire, we leave gradients
-        unsharded. (could lead to dimension too large)
-        Due to multiple-pass forward, this function can be called on
-        the same parameter multiple times in a single forward pass. If we register
-        the hook multiple time, we end up getting called multiple times. We
-        could try to get a new hook every time and delete the previous one
-        registered. However, due to *unknown reason* (I have debugged it for
-        a long time!), in mixed precision mode, we get two different ``grad_acc``
-        objects below during different calls of this function (in the same
-        forward pass). If we keep the last one, the hook end up firing too
-        early. In full precision mode, we luckily get the *same* ``grad_acc``
-        object, so deleting and re-registering still ensured the hook fire
-        once after all gradients are generated.
-        Empirically, keep the first hook register per forward pass seems to
-        work the best. We do need to remove the hook at the end of the
-        backward pass. Otherwise, the next forward pass will not register
-        a new hook, which is needed for a new forward pass.
-        """
-        if not torch.is_grad_enabled():
-            return  # don't register grad hooks if grad isn't enabled
-        for p in self.params:
-            if p.requires_grad:
-                if hasattr(p, "_shard_bwd_hook"):
-                    continue
-                # Register a hook on the first call, empirically, autograd
-                # fires it at the end for this param, which makes sense.
-                p_tmp = p.expand_as(p)  # Get a grad_fn on p_tmp.
-                assert (
-                    p_tmp.grad_fn is not None
-                ), "p_tmp grad_fn should not be None, it is used to access \
-                    p's AccumulateGrad object and register post hook on it."
-                grad_acc = p_tmp.grad_fn.next_functions[0][
-                    0
-                ]  # Gets its AccumulateGrad object.
-                handle = grad_acc.register_hook(
-                    functools.partial(self._post_backward_hook, p)
-                )
-                p._shard_bwd_hook = (grad_acc, handle)  # type: ignore[attr-defined]
-
-    @torch.no_grad()
-    def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
-        """
-        At the start of :func:`_post_backward_hook`, ``param.grad`` contains the
-        full gradient for the local batch. The reduce-scatter op will replace
-        ``param.grad`` with a single shard of the summed gradient across all
-        GPUs. This shard will align with the current GPU rank. For example::
-            before reduce_scatter:
-                param.grad (GPU #0): [1, 2, 3, 4]
-                param.grad (GPU #1): [5, 6, 7, 8]
-            after reduce_scatter:
-                param.grad (GPU #0): [6, 8]    # 1+5, 2+6
-                param.grad (GPU #1): [10, 12]  # 3+7, 4+8
-        The local GPU's ``optim.step`` is responsible for updating a single
-        shard of params, also corresponding to the current GPU's rank. This
-        alignment is created by :func:`_shard_parameters`, which ensures that
-        the local optimizer only sees the relevant parameter shard.
-        """
-        p_assert(
-            hasattr(param, '_post_backward_called'),
-            "Expected flag _post_backward_called to exist on param."
-        )
-        param._post_backward_called = True
-        with torch.autograd.profiler.record_function("FullyShardedDataParallel._post_backward_hook"):
-            # First hook callback will see PRE state. If we have multiple params,
-            # then subsequent hook callbacks will see POST state.
-            self._assert_state([TrainingState_.BACKWARD_PRE, TrainingState_.BACKWARD_POST])
-            self.training_state = TrainingState_.BACKWARD_POST
-
-            if self._use_param_exec_order_policy() and self._param_exec_order_prep_stage:
-                # In self._fsdp_params_exec_order, the parameters are ordered based on
-                # the execution order in the backward pass in the first iteration.
-                self._fsdp_params_exec_order.append(param)
-
-            if param.grad is None:
-                return
-
-            if param.grad.requires_grad:
-                raise RuntimeError(
-                    "FSDP only works with gradients that don't require gradients"
-                )
-
-            if (
-                self._require_backward_grad_sync
-                or self.sharding_strategy == ShardingStrategy.FULL_SHARD
-            ):
-                self._free_full_params(cast(List[FlatParameter], [param]))
-
-            if self._mixed_precision_enabled_for_params():
-                # Noop if reshard_after_forward=True because we'd free the param
-                # shard when rebuilding the full params in the pre_beckward_hook.
-                self._free_mp_shard(cast(List[FlatParameter], [param]))
-
-            # Switch to local shard after backward. Note that
-            # when CPU offload is enabled, _use_param_local_shard implicitly
-            # offloads the local shard to CPU by making p.data point to
-            # p._local_shard, which would reside on CPU.
-            self._use_param_local_shard(cast(List[FlatParameter], [param]))
-
-            # Prefetch previous layer's full params in backward pass post backward hook,
-            # If next layer's backward computation is done and full params are freed,
-            # no need to prefetch the full params again.
-            # Only prefetch full params if any of the next layer's outputs requires grad
-            if self._need_prefetch_full_params(self.training_state):
-                self._fsdp_graph_order[self._my_fsdp_idx_in_graph - 1]._rebuild_full_params()  # type: ignore[operator]
-                # Next layer's computation will start right after this all_gather,
-                # Wait for all_gather to finish before computation.
-                torch.cuda.current_stream().wait_stream(self._streams["all_gather"])
-
-            if not self._require_backward_grad_sync:
-                return
-
-            # Wait for all work in the current stream to finish, then start the
-            # reductions in post_backward stream.
-            self._streams["post_backward"].wait_stream(torch.cuda.current_stream())
-
-            with torch.cuda.stream(self._streams["post_backward"]):
-                orig_grad_data = param.grad.data
-                if (
-                    self._mixed_precision_enabled_for_reduce() and not self._low_precision_hook_enabled()
-                ):
-                    # Cast gradient to precision in which it should be communicated.
-                    # If a low precision hook is registered and reduce_dtype is specified
-                    # in `MixedPrecision`, communication hook will take care of
-                    # casting to lower precision and back.
-                    # TODO: Make this a communication hook when communication hooks
-                    # are implemented for FSDP. Note that this is a noop if the
-                    # reduce_dtype matches the param dtype.
-                    param.grad.data = param.grad.data.to(self.mixed_precision.reduce_dtype)
-
-                if self.gradient_predivide_factor > 1 and self.communication_hook is None:
-                    # Average grad by pre-division factor. Together pre- and post-division factors
-                    # lead to an overall averaging by world_size, required for consistency with PyTorch DDP.
-                    # This is a two-step process to avoid potential underflow and overflow.
-                    param.grad.div_(self.gradient_predivide_factor)
-
-                grad = param.grad.data
-                if param._is_sharded:  # type: ignore[attr-defined]
-                    # We clear `param.grad` to permit repeated gradient
-                    # computations when this FSDP module is called multiple times.
-                    # This is to avoid a race among multiple re-entrant backward
-                    # passes. For example, the second backward pass computation
-                    # precedes ahead of the first backward pass reduction, which is
-                    # possible since the reduction is in a different stream and is
-                    # async. Then, the first backward pass may be incorrectly
-                    # reducing the second backward pass's `param.grad`.
-                    # The reduced gradients are accumulated in
-                    # `param._saved_grad_shard`, and the gradient reductions can
-                    # happen in arbitrary order, though we tolerate this due to the
-                    # (approximate) commutativity of floating-point addition.
-                    param.grad = None
-                    grad_flatten = torch.flatten(grad)
-                    chunks = list(grad_flatten.chunk(self.world_size))
-                    num_pad = self.world_size * chunks[0].numel() - grad.numel()
-                    input_flattened = F.pad(grad_flatten, [0, num_pad])
-                    output = torch.zeros_like(chunks[0])
-                    dist._reduce_scatter_base(
-                        output, input_flattened, group=self.process_group
-                    )
-                    if self.gradient_postdivide_factor > 1:
-                        # Average grad by pre-division factor. Together pre- and post-division factors
-                        # lead to an overall averaging by world_size, required for consistency with PyTorch DDP.
-                        # This is a two-step process to avoid potential underflow and overflow.
-                        output.div_(self.gradient_postdivide_factor)
-
-                    self._cast_grad_to_param_dtype(output, param)
-
-                    # To support gradient accumulation outside `no_sync()`, we save
-                    # the gradient data to `param._saved_grad_shard` before the
-                    # backward pass, accumulate gradients into it here, and set
-                    # `param.grad` with the accumulated value at the end of the
-                    # backward pass in preparation for the optimizer step.
-                    accumulate_grad = hasattr(param, "_saved_grad_shard")
-                    if accumulate_grad:
-                        p_assert(
-                            param._saved_grad_shard.shape == output.shape,  # type: ignore[attr-defined]
-                            "Shape mismatch when accumulating gradients: "  # type: ignore[attr-defined]
-                            f"existing grad shape={param._saved_grad_shard.shape} "
-                            f"new grad shape={output.shape}"  # type: ignore[attr-defined]
-                        )
-                        p_assert(
-                            param._saved_grad_shard.device == output.device,  # type: ignore[attr-defined]
-                            "Device mismatch when accumulating gradients: "  # type: ignore[attr-defined]
-                            f"existing grad device={param._saved_grad_shard.device} "
-                            f"new grad device={output.device}"  # type: ignore[attr-defined]
-                        )
-                        param._saved_grad_shard += output  # type: ignore[attr-defined]
-                    else:
-                        param._saved_grad_shard = output  # type: ignore[attr-defined]
-                    grad = param._saved_grad_shard  # type: ignore[attr-defined]
-                else:
-                    # Currently the way for _is_sharded to be False is if
-                    # world_size == 1 or sharding_strategy is NO_SHARD.
-                    assert (
-                        self.world_size == 1 or self.sharding_strategy == ShardingStrategy.NO_SHARD
-                    ), "Currently the way for _is_sharded to be False is \
-                        world_size == 1 or sharding_stratagy is set to be NO_SHARD"
-                    if self.sharding_strategy == ShardingStrategy.NO_SHARD:
-                        # if a communication hook was not registered,
-                        # then a default hook (`all_reduce`) will be used
-                        self.communication_hook(self.communication_hook_state, param.grad)
-
-                    self._cast_grad_to_param_dtype(param.grad, param)
-
-                # Regardless of sharding or not, offload the grad to CPU if we are
-                # offloading params. This is so param and grad reside on same device
-                # which is needed for the optimizer step.
-                if self.cpu_offload.offload_params:
-                    # We specify non_blocking=True
-                    # and ensure the appropriate synchronization is done by waiting
-                    # streams in _wait_for_post_backward.
-                    param._cpu_grad.copy_(  # type: ignore[attr-defined]
-                        grad.detach(), non_blocking=True
-                    )
-                    # Don't let this memory get reused until after the transfer.
-                    grad.data.record_stream(torch.cuda.current_stream())
-
-                # After _post_backward_hook returns, orig_grad_data will eventually
-                # go out of scope, at which point it could otherwise be freed for
-                # further reuse by the main stream while the div/reduce_scatter/copy
-                # are underway in the post_backward stream. See:
-                # github.com/NVIDIA/apex/blob/master/apex/parallel/distributed.py
-                orig_grad_data.record_stream(self._streams["post_backward"])
-
-    def _cast_grad_to_param_dtype(
-        self,
-        grad: torch.Tensor,
-        param: FlatParameter,
-    ):
-        """
-        Casts gradient ``grad`` back to the full parameter dtype so that the
-        optimizer step runs with that dtype. This performs an actual cast if
-        1. parameters were in reduced precision during the forward since then
-        gradients would be in that reduced precision, or
-        2. parameters were not in reduced precision but gradients were in
-        reduced precision for communication.
-        However, if a low precision communication hook is registered, then this
-        dtype cast happens in the hook instead.
-        """
-        self._assert_state(TrainingState_.BACKWARD_POST)
-        if (
-            not self._low_precision_hook_enabled()
-            and (
-                self._mixed_precision_enabled_for_params()
-                or self._mixed_precision_enabled_for_reduce()
-            )
-        ):
-            low_prec_grad_data = grad.data
-            grad.data = grad.data.to(dtype=param.dtype)
-            # Do not let the low precision gradient memory get reused until
-            # the cast to full parameter precision completes
-            low_prec_grad_data.record_stream(torch.cuda.current_stream())
-
-    def _queue_wait_for_post_backward(self) -> None:
-        """Try to queue a `wait_for_post_backward` callback.
-        Only called on root and only queue one callback at the beginning of
-        outer most backward.
-        """
-        assert (
-            self._is_root
-        ), "_queue_wait_for_post_backward can only be called on root."
-        if not self._post_backward_callback_queued:
-            self._assert_state([TrainingState_.IDLE])
-            self._post_backward_callback_queued = True
-            Variable._execution_engine.queue_callback(self._wait_for_post_backward)
-
     @torch.no_grad()
     def _wait_for_post_backward(self) -> None:
         """Wait for post-backward to finish. Only called on root instance."""
         assert self._is_root, "_wait_for_post_backward can only be called on root."
-        # Check if the root module has params and if any of them has
-        # the `requires_grad` field set. If `requires_grad=False` for
-        # all the params, the post_backward hook will not fire and the
-        # state will remain in `TrainingState_.BACKWARD_PRE`.
-        if any([p.requires_grad for p in self.params]):
-            self._assert_state(TrainingState_.BACKWARD_POST)
-        else:
-            self._assert_state(TrainingState_.BACKWARD_PRE)
+        # Root's training state might be backward_pre or backward_post depending on
+        # if root parameter's post backward hook was called. The post-backward hook
+        # may not have been called if gradient was not computed for this param/FSDP
+        # module.
 
-        if self._require_backward_grad_sync:
+        if self._sync_gradients:
             torch.cuda.current_stream().wait_stream(self._streams["post_backward"])
             if self.cpu_offload.offload_params:
                 # We need to wait for the non-blocking GPU ->
@@ -3102,103 +953,85 @@ def _wait_for_post_backward(self) -> None:
                 # stream to finish GPU -> CPU copies unless we explicitly block the
                 # host-side with synchronize().
                 torch.cuda.current_stream().synchronize()
+        self._exec_order_data.next_iter()
 
         # A backward pass is done, clean up below.
-        self._exec_order_data.reset()
+        def _catch_all_reshard(fsdp_module: FullyShardedDataParallel) -> None:
+            """
+            Reshards full parameters that may have not been resharded in
+            post_backward_hook. This can happen when an FSDP module's output
+            is used in forward so its pre-backward fires unsharding the param,
+            but post-backward does not fire since the output was not ultimately
+            used in loss computation so FSDP parameter did not get a gradient.
+            """
+            # Note that we wrap resharding logic in a try-catch as a defensive
+            # approach, as if an error is thrown, we are in the backwards pass,
+            # and autograd would not print out much useful info about the actual
+            # error hit.
+            try:
+                free_unsharded_flat_params: List[bool] = []
+                handles_to_reshard: List[FlatParamHandle] = []
+                for handle in fsdp_module._handles:
+                    # TODO: This already-resharded check is brittle:
+                    # https://github.com/pytorch/pytorch/issues/83956
+                    already_resharded = (
+                        handle.flat_param.data_ptr()
+                        == handle.flat_param._local_shard.data_ptr()
+                    )
+                    if already_resharded:
+                        continue
+                    free_unsharded_flat_params.append(
+                        _should_free_in_backward(fsdp_module, handle)
+                    )
+                    handles_to_reshard.append(handle)
+                _reshard(self, handles_to_reshard, free_unsharded_flat_params)
+            except Exception as e:
+                p_assert(
+                    False,
+                    f"Got exception while resharding module {fsdp_module}: {str(e)}",
+                    raise_assertion_error=False,
+                )
+                raise e
 
         def _finalize_params(fsdp_module: FullyShardedDataParallel) -> None:
             """Helper used below on all fsdp modules."""
-            for p in fsdp_module.params:
+            for handle in fsdp_module._handles:
+                p = handle.flat_param
                 if p.requires_grad:
-                    if hasattr(p, "_shard_bwd_hook"):
-                        assert len(p._shard_bwd_hook) == 2 and len(  # type: ignore[attr-defined]
-                            p._shard_bwd_hook  # type: ignore[attr-defined]
-                        ), (  # type: ignore[attr-defined]
-                            "p._shard_bwd_hook fields are not valid."
-                        )
-                        p._shard_bwd_hook[1].remove()  # type: ignore[attr-defined]
-                        delattr(p, "_shard_bwd_hook")
-                    # Preserve the gradient accumulation state if not
-                    # synchronizing: `p.grad` remains the unsharded gradient
-                    # accumulated from prior `no_sync()` iterations, and
-                    # `p._saved_grad_shard` remains the sharded gradient from
-                    # the last synchronized iteration
-                    if not self._require_backward_grad_sync:
-                        continue
-                    # Set `p.grad` as needed to ensure optimizer correctness
-                    # since optimizers operate on the `grad` attribute
-                    if hasattr(p, "_cpu_grad"):
+                    if hasattr(p, "_post_backward_hook_state"):
                         p_assert(
-                            p.device == torch.device("cpu"),
-                            f"Device mismatch: p={p.device} "  # type: ignore[attr-defined]
-                            f"p._cpu_grad={p._cpu_grad}"
+                            len(p._post_backward_hook_state) == 2,  # type: ignore[attr-defined]
+                            "p._post_backward_hook_state fields are not valid.",
                         )
-                        p.grad = p._cpu_grad  # type: ignore[attr-defined]
-                    elif hasattr(p, "_saved_grad_shard"):
-                        p_assert(
-                            p.device == p._saved_grad_shard.device,  # type: ignore[attr-defined]
-                            f"Device mismatch: p={p.device} "  # type: ignore[attr-defined]
-                            f"p._saved_grad_shard={p._saved_grad_shard.device}"
-                        )
-                        p.grad = p._saved_grad_shard  # type: ignore[attr-defined]
-                    else:
-                        p_assert(
-                            not p._is_sharded or not p._post_backward_called,
-                            "All sharded parameters that received gradient should "
-                            "use `_saved_grad_shard`"
-                        )
-                    if hasattr(p, "_saved_grad_shard"):
-                        delattr(p, "_saved_grad_shard")
-
+                        p._post_backward_hook_state[1].remove()  # type: ignore[attr-defined]
+                        delattr(p, "_post_backward_hook_state")
+                    if not self._sync_gradients:
+                        # Preserve the gradient accumulation state if not
+                        # synchronizing gradients: `p.grad` remains the
+                        # unsharded gradient accumulated from prior `no_sync()`
+                        # iterations, and `p._saved_grad_shard` remains the
+                        # sharded gradient from the last synchronized iteration
+                        continue
+                    handle.prepare_gradient_for_optim()
                     p_assert(
-                        hasattr(p, '_post_backward_called'),
-                        "Expected flag _post_backward_called to be set on param."
+                        hasattr(p, "_post_backward_called"),
+                        "Expected flag _post_backward_called to be set on param.",
                     )
                     # Reset _post_backward_called in preparation for the next iteration.
                     p._post_backward_called = False
 
         # Update root and nested FSDP's hooks and flags.
-        for m in self.modules():  # includes self
-            if isinstance(m, FullyShardedDataParallel):
-                if any(p.requires_grad for p in m.parameters()):
-                    # Check if the module has params and if any of them has
-                    # the `requires_grad` field set. If `requires_grad=False` for
-                    # all the params, the post_backward hook will not fire and the
-                    # state will remain in `TrainingState_.BACKWARD_PRE`.
-                    managed_param_requires_grad = any(p.requires_grad for p in m.params)
-                    if managed_param_requires_grad:
-                        p_assert(
-                            all(hasattr(p, '_post_backward_called') for p in m.params),
-                            "Expected all params to have flag _post_backward_called set!"
-                        )
-                        post_backward_hook_called = any(p._post_backward_called for p in m.params)
-                        if post_backward_hook_called:
-                            m._assert_state(TrainingState_.BACKWARD_POST)
-                        else:
-                            # post backward hook was not called, meaning param
-                            # did not have a gradient computed. It was either unused
-                            # in forward, or unused in loss computation so it did
-                            # not get gradient
-                            m._assert_state([TrainingState_.BACKWARD_PRE, TrainingState_.IDLE])
-                    else:
-                        m._assert_state(TrainingState_.BACKWARD_PRE)
-                else:
-                    # When `m` and its children have no non-ignored params or
-                    # have non-ignored params but none with `requires_grad==True`,
-                    # there are two cases:
-                    # 1. output tensors are `requires_grad==True`. In this case,
-                    # pre-backward hook is still registered, so it is in BACKWARD_PRE state.
-                    # 2. output tensors are `requires_grad==False`. In this case,
-                    # pre-backward hook is not registered, so it is in IDLE state.
-                    m._assert_state([TrainingState_.BACKWARD_PRE, TrainingState_.IDLE])
-
-                _finalize_params(m)
-                m._pre_backward_hook_has_run = False
-                m.training_state = TrainingState_.IDLE
-
-                if m._is_root:
-                    # reset this flag for cases like "one forward pass + multiple backward passes"
-                    self._post_backward_callback_queued = False
+        for m in self.fsdp_modules(self):  # includes self
+            _catch_all_reshard(m)
+            _finalize_params(m)
+            m._ran_pre_backward_hook.clear()
+            m.training_state = TrainingState.IDLE
+            for handle in m._handles:
+                handle._training_state = HandleTrainingState.IDLE
+            m._handles_prefetched.clear()
+            if m._is_root:
+                # reset this flag for cases like "one forward pass + multiple backward passes"
+                self._post_backward_callback_queued = False
 
         if self._use_param_exec_order_policy() and self._param_exec_order_prep_stage:
             self._param_exec_order_policy_second_iter_init()
@@ -3231,327 +1064,12 @@ def _param_exec_order_policy_second_iter_init(self) -> None:
         # TODO (linjianma): Patch the forward of each model in the keys
         # of fsdp_wrap_map based on the information above.
 
-    def _update_p_data(self, p, output_tensor: torch.Tensor) -> None:
-        """
-        Helper function to update p.data pointer.
-        Args:
-            output_tensor (torch.Tensor): this tensor contains the data we just gathered.
-        """
-        p.data = output_tensor
-        # Trim any padding and reshape to match original size.
-        p.data = p.data[:p._unsharded_size.numel()].view(p._unsharded_size)  # type: ignore[attr-defined]
-
-    @torch.no_grad()
-    def _rebuild_full_params(self) -> List[Tuple[torch.Tensor, bool]]:
-        """
-        Gather all shards of params.
-        """
-        # _summon_full_params must do a full precision rebuild even under mixed
-        # precision, because it is used for e.g. checkpoint where we'd like to
-        # checkpoint in full precision.
-        force_full_precision = (self.training_state == TrainingState_.SUMMON_FULL_PARAMS)
-        # full param output tensors and a flag indicating whether
-        # _summon_full_params can free them or not. It is possible that we can't
-        # free the full param, which currently occurs when the returned
-        # parameter points to the unsharded param when world_size == 1, or when
-        # we're returning the full parameter and reshard_after_forward=False
-        # (because we need to ensure p._full_param_padded stays intact)
-        output_tensors: List[Tuple[torch.Tensor, bool]] = []
-        with torch.cuda.stream(self._streams["all_gather"]):
-            for p in self.params:
-                mixed_precision_cast_ran = (
-                    self._mixed_precision_enabled_for_params()
-                    and not force_full_precision
-                )
-                if mixed_precision_cast_ran:
-                    self._cast_param_shards_to_dtype()
-                    # TODO: remove below
-                    for p in self.params:
-                        assert p.dtype == self.mixed_precision.param_dtype
-                # We can skip moving params to GPU if mixed precision, as p.data
-                # would then be pointing to p._mp_shard which is already on
-                # self.compute_device.
-                if self.cpu_offload.offload_params and not mixed_precision_cast_ran:
-                    # Move params to GPU if needed. Note that we don't use
-                    # self._full_param_padded.device here because the attr is
-                    # not set always, i.e. when world_size=1 and
-                    # p._is_sharded = False. However when it is set, the
-                    # device is always self.compute_device.
-                    p.data = p.data.to(self.compute_device, non_blocking=True)
-                # Check the validity of this `_rebuild_full_params()` call in
-                # terms of execution order (regardless of if FSDP actually
-                # needs to all-gather or not)
-                self._check_rebuild_full_params(p)
-                # e.g., when world_size == 1
-                if not p._is_sharded:  # type: ignore[attr-defined]
-                    if mixed_precision_cast_ran:
-                        # p.data should be the same type as p._mp_shard, and it
-                        # is safe to free.
-                        assert p.data.dtype == p._mp_shard.dtype
-                        # Safe to free because p.data points to the mp shard.
-                        output_tensors.append((p.data, True))
-                    else:
-                        # p.data points to the unsharded parameter, so not safe to
-                        # free.
-                        output_tensors.append((p.data, False))
-                    continue
-                # If full param has been rebuilt or has not been freed, no need to call all gather
-                elif (
-                    p._full_param_padded.storage().size()  # type: ignore[attr-defined]
-                    == p._full_param_padded.size().numel()  # type: ignore[attr-defined]
-                ):
-                    # Check that the full param is in the expected precision, if
-                    # training with mixed precision
-                    if mixed_precision_cast_ran:
-                        if p._full_param_padded.dtype != self.mixed_precision.param_dtype:
-                            raise ValueError(
-                                "_rebuild_full_params: Expected full param to be "
-                                f"of type {self.mixed_precision.param_dtype}, "
-                                f"but got {p._full_param_padded.dtype}!"
-                            )
-                    # output is full_param_padded which can be freed depending
-                    # on reshard_after_forward (this path is exercised by tests
-                    # in test_fsdp_summon_full_params).
-                    output_tensors.append((p._full_param_padded, self.reshard_after_forward))
-
-                    self._update_p_data(p, output_tensor=p._full_param_padded)  # type: ignore[attr-defined]
-                    continue
-                else:
-                    # If full param has not been rebuilt or has been freed, call all gather
-                    p_data = p.data  # type: ignore[attr-defined]
-                    p_full_size = p._full_param_padded.size()  # type: ignore[attr-defined]
-                    assert (
-                        p_full_size.numel() == p_data.numel() * self.world_size
-                    ), "Param full size should be equal to its shard size multiply world_size."
-                    assert (
-                        p._full_param_padded.storage().size() == 0  # type: ignore[attr-defined]
-                    ), "Full param's storage should have been freed before if all gather is needed."  # type: ignore[attr-defined]
-                    if (
-                        self._mixed_precision_enabled_for_params()
-                        and force_full_precision
-                    ):
-                        # p._full_param_padded has the reduced precision type,
-                        # but we need full precision rebuild as we're in
-                        # _summon_full_params. Note that this is why
-                        # _summon_full_params collects locally used params from
-                        # _rebuild_full_params instead of relying on
-                        # p._full_param_padded, as it may not always be
-                        # allocated such as during mixed precision.
-                        output_tensor = p_data.new_zeros(p_full_size)
-                    else:
-                        # Allocate based on full size from all shards.
-                        _alloc_storage(p._full_param_padded, size=p_full_size)  # type: ignore[attr-defined]
-                        output_tensor = p._full_param_padded  # type: ignore[attr-defined]
-                    # Fill output_tensor with (p.data for each shard in self.world_size)
-                    dist._all_gather_base(
-                        output_tensor, p_data, group=self.process_group
-                    )
-
-                    # The full parameter, which can be freed. Note that we
-                    # append here before update_p_data so as to not saved the
-                    # tensor with padding trimmed, which causes issues with
-                    # writeback in _summon_full_params.
-                    output_tensors.append((output_tensor, True))
-                    # Set p.data = output_tensor (with padding trimmed)
-                    self._update_p_data(p, output_tensor=output_tensor)
-                    # We can free the reduced precision shard as we have the
-                    # full precision parameter.
-                    if (
-                        self._mixed_precision_enabled_for_params()
-                    ):
-                        self._free_mp_shard(cast(List[FlatParameter], [p]))
-        return output_tensors
-
-    def _check_rebuild_full_params(self, param: FlatParameter):
-        """
-        Checks the validity of a call to :meth:`_rebuild_full_params` in terms
-        of the execution order. If on the first iteration, this uses an
-        all-gather to check that all ranks are running ``forward()`` with the
-        same parameter, erroring if not, and on subsequent iterations, if the
-        forward order differs from that of the first iteration (meaning that we
-        can no longer guarantee correct execution since all-gathers may be
-        mismatched), then we issue a warning to the user. This only issues
-        warnings on the first deviating iteration and stops checking
-        thereafter.
-
-        Only the :meth:`_rebuild_full_params` calls in the forward pass are
-        checked since a correct forward order should imply a correct
-        pre-backward order for typical cases.
-
-        Executing in ``no_sync()`` does not affect this check for
-        ``FULL_SHARD`` and ``SHARD_GRAD_OP``: (1) Being in ``no_sync()`` in the
-        first iteration does not yield a different forward
-        :meth:`_rebuild_full_params()` sequence, and (2) being in ``no_sync()``
-        in a later iteration does not give false positive warnings since the
-        forward :meth:`_rebuild_full_params()` sequence still matches the first
-        iteration sequence (for ``FULL_SHARD``) or the first iteration
-        sequence's prefix (for ``SHARD_GRAD_OP``).
-        """
-        # Only check when rebuilding the full parameters in the forward pass,
-        # and skip the check (1) when in eval mode since then there is not a
-        # safe point at which to reset the execution order data and (2) if
-        # world size is 1 since then there is no chance of desynchronization
-        if self.training_state != TrainingState_.FORWARD or \
-                not self.training or self.world_size == 1:
-            return
-        eod = self._exec_order_data
-        param_index = eod.get_param_index(param)
-        if not eod.is_first_iter:
-            # Only issue warnings on the first deviating iteration and stop
-            # checking thereafter to avoid flooding the console
-            if eod.warn_status == _ExecOrderWarnStatus.WARNED:
-                return
-            # However, we may issue multiple warnings on the first deviating
-            # iteration to help debugging, where either:
-            # 1. This iteration sees an extra `_rebuild_full_params()` in
-            # `forward()` compared to the first iteration
-            msg_prefix = curr_param_order = None  # non-`None` means we warn
-            if eod.index >= len(eod.param_order):
-                msg_prefix = "Expected to not rebuild any more parameters " \
-                    "in `forward()` for this module but trying to rebuild " \
-                    "parameters for "
-                curr_param_order = eod.param_order + [param_index]
-            else:
-                expected_param_index = eod.param_order[eod.index]
-                # 2. This iteration sees the same number of
-                # `_rebuild_full_params()` (so far) but the current parameter
-                # differs
-                if param_index != expected_param_index:
-                    expected_param_names = eod.get_unflat_param_names(expected_param_index)
-                    assert len(expected_param_names) > 0, \
-                        "Expected parameter should always be valid"
-                    msg_prefix = "Expected to rebuild parameters in " \
-                        f"`forward()` for {expected_param_names} but " \
-                        "instead trying to rebuild parameters for "
-                    curr_param_order = eod.param_order[:eod.index - 1] + [param_index]
-            to_issue_warning = msg_prefix is not None
-            if to_issue_warning:
-                assert curr_param_order is not None
-                param_names = eod.get_unflat_param_names(param_index)
-                is_added_param = len(param_names) == 0
-                if is_added_param:
-                    msg_suffix = "a newly-added parameter since construction time"
-                else:
-                    msg_suffix = f"{param_names}"
-                sub_msg = msg_prefix + msg_suffix
-                first_iter_param_names = [
-                    eod.get_unflat_param_names(index) for index in eod.param_order
-                ]
-                curr_iter_param_names = [
-                    eod.get_unflat_param_names(index) for index in curr_param_order
-                ]
-                warnings.warn(
-                    "Forward order differs from that of the first iteration "
-                    f"on rank {self.rank} -- collectives are unchecked and may "
-                    "give incorrect results or hang\n" + sub_msg + "\n" +
-                    f"First iteration's forward order: {first_iter_param_names}"
-                    "\nThis iteration's forward order (so far): "
-                    f"{curr_iter_param_names}"
-                )
-                eod.warn_status = _ExecOrderWarnStatus.WARNING
-            eod.index += 1
-        else:
-            # Use `compute_device` instead of the parameter's device in case it
-            # is offloaded on CPU and we are using NCCL backend, which requires
-            # communicated tensors be on GPU
-            device = self.compute_device
-            indices = torch.zeros(self.world_size, dtype=torch.int32, device=device)
-            index = torch.tensor([param_index], dtype=torch.int32, device=device)
-            dist._all_gather_base(indices, index, group=self.process_group)
-            # Check that all ranks plan to all-gather the same parameter index
-            for (r1, i1), (r2, i2) in itertools.combinations(
-                ((rank, indices[rank]) for rank in range(self.world_size)), 2,
-            ):
-                if not torch.equal(i1, i2):
-                    r1_param_names = eod.get_unflat_param_names(i1)
-                    r2_param_names = eod.get_unflat_param_names(i2)
-                    raise RuntimeError(
-                        f"Forward order differs across ranks: rank {r1} is "
-                        "rebuilding full parameters in `forward()` for "
-                        f"{r1_param_names} while rank {r2} is rebuilding full "
-                        f"parameters in `forward()` for {r2_param_names}"
-                    )
-            eod.param_order.append(param_index)
-
-    @torch.no_grad()
-    def _prep_grads_for_backward(self) -> None:
-        """Make sure p.grad has the correct size/device, otherwise set it to None."""
-        for p in self.params:
-            if p.grad is not None and (
-                p.grad.size() != p._unsharded_size  # type: ignore[attr-defined]
-                or p.grad.device != p.device
-            ):
-                offloaded: bool = p.grad.device != p.device
-                if offloaded:
-                    assert self.cpu_offload.offload_params, \
-                        "`p.grad.device` and `p.device` should be the same " \
-                        "if not offloading parameters to CPU"
-                prev_iter_outside_no_sync: bool = \
-                    p.grad.size() == p._local_shard.shape  # type: ignore[attr-defined]
-                # As long as the previous iteration was outside `no_sync()`,
-                # then we must save the gradient in `_saved_grad_shard`, even
-                # if the current iteration is inside `no_sync()`. This is to
-                # prepare for the next iteration outside `no_sync()`, which may
-                # try to accumulate gradients. FSDP accumulates gradients in
-                # the separate variable `p._saved_grad_shard` to leave `p.grad`
-                # for the per-iteration gradient.
-                if prev_iter_outside_no_sync:
-                    # FSDP currently does not support gradient accumulation
-                    # outside `no_sync()` when using CPU offloading (see the
-                    # warning in the class's docstring).
-                    if not offloaded:
-                        p._saved_grad_shard = p.grad.data  # type: ignore[attr-defined]
-                p.grad = None
-
-    @torch.no_grad()
-    def _free_full_params(self, params: Optional[List[FlatParameter]] = None) -> None:
-        """
-        Free up storage for full parameters.
-        """
-        if params is None:
-            params = self.params
-        current_stream = torch.cuda.current_stream()
-        for p in params:
-            # e.g., world_size == 1 or self.sharding_strategy = NO_SHARD
-            if not p._is_sharded:  # type: ignore[attr-defined]
-                if (
-                    self._mixed_precision_enabled_for_params()
-                ):
-                    self._free_mp_shard(cast(List[FlatParameter], [p]))
-                continue
-            # Don't let PyTorch reuse this memory until all work in the current
-            # stream is complete.
-            p._full_param_padded.record_stream(current_stream)  # type: ignore[attr-defined]
-            # There may be external references to the Tensor Storage that we
-            # can't modify, such as references that are created by
-            # ctx.save_for_backward in the forward pass. Thus when we
-            # unshard parameters, we should reuse the original Tensor
-            # Storage object and unshard it in-place. For now, just resize
-            # the Storage to 0 to save memory.
-            _free_storage(p._full_param_padded)  # type: ignore[attr-defined]
-
-    @torch.no_grad()
-    def _use_param_local_shard(
-        self, params: Optional[List[FlatParameter]] = None
-    ) -> None:
-        """Use local shard for a list of params. Also implicitly offloads
-        parameters back to CPU if we are CPU offloading."""
-        if params is None:
-            params = self.params
-        for p in params:
-            if self.cpu_offload.offload_params:
-                # Ensure local_shard resides in CPU if we are offloading params.
-                assert p._local_shard.device == torch.device(  # type: ignore[attr-defined]
-                    "cpu"
-                ), "Expected p._local_shard to be on CPU"
-            p.data = p._local_shard  # type: ignore[attr-defined]
-
-    def _assert_state(self, state: Union[TrainingState_, List[TrainingState_]]) -> None:
+    def _assert_state(self, state: Union[TrainingState, List[TrainingState]]) -> None:
         """Assert we are in the given state."""
         # Since assert can be turned off and this error checking
         # is really important, we use explicit error checking
         # and raise a ValueError if needed.
-        if isinstance(state, TrainingState_):
+        if isinstance(state, TrainingState):
             state = [state]
         if self.training_state not in state:
             msg = (
@@ -3584,39 +1102,35 @@ def no_sync(self) -> Generator:
             offloaded to CPU when inside the context manager. Instead, they
             will only be offloaded right after the eventual sync.
         """
-        self._lazy_init()
-        assert self._is_root, "`no_sync()` on inner FSDP instances is not supported"
-        self._assert_state(TrainingState_.IDLE)
+        _lazy_init(self, self)
+        if not self._is_root:
+            raise RuntimeError(
+                "`no_sync()` on inner FSDP instances is not supported. Please call `no_sync()` on root FSDP module."
+            )
+        self._assert_state(TrainingState.IDLE)
         old_flags = []
         for m in self.modules():
             if isinstance(m, FullyShardedDataParallel):
-                old_flags.append((m, m._require_backward_grad_sync))
-                m._require_backward_grad_sync = False
+                old_flags.append((m, m._sync_gradients))
+                m._sync_gradients = False
         try:
             yield
         finally:
             for m, old_flag in old_flags:
-                assert not m._require_backward_grad_sync, (
-                    "`_require_backward_grad_sync` was incorrectly set to "
+                assert not m._sync_gradients, (
+                    "`_sync_gradients` was incorrectly set to "
                     "`True` while in the `no_sync()` context manager"
                 )
-                m._require_backward_grad_sync = old_flag
-
-    @property
-    def params_with_grad(self) -> List[Parameter]:
-        """
-        Recursively returns a list of all module parameters that have a gradient.
-        """
-        return [p for p in self.parameters() if p.grad is not None]
+                m._sync_gradients = old_flag
 
     @torch.no_grad()
     def clip_grad_norm_(
         self, max_norm: Union[float, int], norm_type: Union[float, int] = 2.0
-    ) -> None:
+    ) -> torch.Tensor:
         """
-        Clip all gradients at this point in time. The norm is computed over all
-        gradients together, as if they were concatenated into a single vector.
-        Gradients are modified in-place.
+        Clips the gradient norm of all parameters. The norm is computed over
+        all parameters' gradients as viewed as a single vector, and the
+        gradients are modified in-place.
 
         Args:
             max_norm (float or int): max norm of the gradients
@@ -3633,45 +1147,126 @@ def clip_grad_norm_(
             calling it for FSDP models would lead to different scaling being
             applied per subset of model parameters.
 
-        .. warning:: This needs to be called on all ranks, since synchronization
-            primitives will be used.
+        .. warning:: This needs to be called on all ranks since it uses
+            collective communications.
         """
-        self._lazy_init()
-        self._wait_for_previous_optim_step()
-        assert self._is_root, "clip_grad_norm should only be called on the root (parent) instance"
-        self._assert_state(TrainingState_.IDLE)
-
+        _lazy_init(self, self)
+        if not self._is_root:
+            raise RuntimeError(
+                "`clip_grad_norm_()` should only be called on the root FSDP instance"
+            )
+        self._assert_state(TrainingState.IDLE)
+        _wait_for_computation_stream(
+            torch.cuda.current_stream(),
+            self._streams["unshard"],
+            self._streams["pre_unshard"],
+        )
+        # If every FSDP instance uses `NO_SHARD`, then we can directly use
+        # the normal `nn.utils` one targeting local gradients
+        all_no_shard = all(
+            not handle.uses_sharded_strategy
+            for handle in FullyShardedDataParallel._fsdp_handles(self)
+        )
+        if all_no_shard:
+            return torch.nn.utils.clip_grad_norm_(
+                self.parameters(), max_norm, norm_type
+            )
+        # Otherwise, there exists some FSDP instance using a sharded strategy,
+        # where sharded and non-sharded parameters must be handled separately
         max_norm = float(max_norm)
         norm_type = float(norm_type)
-        # Computes the max norm for this shard's gradients and sync's across workers
-        local_norm = _calc_grad_norm(self.params_with_grad, norm_type).cuda()  # type: ignore[arg-type]
+        sharded_params = set()
+        nonsharded_params = set()  # `NO_SHARD` or not FSDP-managed
+        for handle in FullyShardedDataParallel._fsdp_handles(self):
+            target_set = (
+                sharded_params if handle.uses_sharded_strategy else nonsharded_params
+            )
+            if handle._use_orig_params:
+                for param in handle.flat_param._params:
+                    target_set.add(param)
+            else:
+                target_set.add(handle.flat_param)
+        for param in self.parameters():
+            not_fsdp_managed = (
+                param not in sharded_params and param not in nonsharded_params
+            )
+            if not_fsdp_managed:
+                nonsharded_params.add(param)
+        local_sharded_norm = _get_grad_norm(sharded_params, norm_type).to(
+            self.compute_device
+        )
+        local_nonsharded_norm = _get_grad_norm(nonsharded_params, norm_type).to(
+            self.compute_device
+        )
+        # Reconstruct the total gradient norm depending on the norm type
         if norm_type == math.inf:
-            total_norm = local_norm
-            dist.all_reduce(total_norm, op=torch.distributed.ReduceOp.MAX, group=self.process_group)
+            total_norm = torch.maximum(local_sharded_norm, local_nonsharded_norm)
+            dist.all_reduce(
+                total_norm, op=torch.distributed.ReduceOp.MAX, group=self.process_group
+            )
         else:
-            total_norm = local_norm ** norm_type
+            total_norm = local_sharded_norm**norm_type
             dist.all_reduce(total_norm, group=self.process_group)
+            # All-reducing the local non-sharded norm would count it an extra
+            # world-size-many times
+            total_norm += local_nonsharded_norm**norm_type
             total_norm = total_norm ** (1.0 / norm_type)
-
-        if self.cpu_offload:
+        if self.cpu_offload.offload_params:
             total_norm = total_norm.cpu()
 
-        clip_coef = torch.tensor(max_norm, dtype=total_norm.dtype, device=total_norm.device) / (total_norm + 1e-6)
-        if clip_coef < 1:
-            # multiply by clip_coef, aka, (max_norm/total_norm).
-            for p in self.params_with_grad:
-                assert p.grad is not None
-                p.grad.detach().mul_(clip_coef.to(p.grad.device))
+        clip_coef = torch.tensor(
+            max_norm, dtype=total_norm.dtype, device=total_norm.device
+        ) / (total_norm + 1e-6)
+        # Multiplying by the clamped coefficient is meaningless when it is
+        # equal to 1, but it avoids the host-device sync that would result from
+        # `if clip_coef < 1`
+        clip_coef_clamped = torch.clamp(clip_coef, max=1.0)
+        grads = [param.grad for param in self.parameters() if param.grad is not None]
+        for grad in grads:
+            grad.detach().mul_(clip_coef_clamped.to(grad.device))
+        return total_norm
+
+    @staticmethod
+    def _warn_optim_input(optim_input):
+        if optim_input is not None:
+            warnings.warn(
+                "The `optim_input` argument is deprecated and will be removed after PyTorch 1.13. You may remove it "
+                "from your code without changing its functionality."
+            )
+
+    @staticmethod
+    def _is_using_optim_input(optim_input, optim) -> bool:
+        if optim_input is None and optim is None:
+            # Use the default behavior of `optim_input``
+            return True
+        if optim_input is not None:
+            # Use the `optim_input` code path
+            return True
+        # Use the `optim` code path
+        return False
+
+    @staticmethod
+    def _raise_on_use_orig_params_optim_checkpoint(model: nn.Module):
+        if any(
+            fsdp_module._use_orig_params
+            for fsdp_module in FullyShardedDataParallel.fsdp_modules(model)
+        ):
+            raise NotImplementedError(
+                "Optimizer state checkpointing is not supported yet for `use_orig_params=True`"
+            )
 
     @staticmethod
     def full_optim_state_dict(
         model: torch.nn.Module,
         optim: torch.optim.Optimizer,
-        optim_input: Optional[Union[
-            List[Dict[str, Any]], Iterable[torch.nn.Parameter],
-        ]] = None,
+        optim_input: Optional[
+            Union[
+                List[Dict[str, Any]],
+                Iterable[torch.nn.Parameter],
+            ]
+        ] = None,
         rank0_only: bool = True,
-        group=None,
+        group: Optional[dist.ProcessGroup] = None,
     ) -> Dict[str, Any]:
         """
         Consolidates the full optimizer state on rank 0 and returns it
@@ -3680,18 +1275,14 @@ def full_optim_state_dict(
         and ``"param_groups"``. The flattened parameters in ``FSDP`` modules
         contained in ``model`` are mapped back to their unflattened parameters.
 
-        .. warning:: This needs to be called on all ranks since synchronization
-            primitives are used. However, if ``rank0_only=True``, then the
-            state dict is only populated on rank 0, and all other ranks return
-            an empty :class:`dict`.
+        .. warning:: This needs to be called on all ranks since it uses
+            collective communications. However, if ``rank0_only=True``, then
+            the state dict is only populated on rank 0, and all other ranks
+            return an empty :class:`dict`.
 
         .. warning:: Unlike ``torch.optim.Optimizer.state_dict()``, this method
             uses full parameter names as keys instead of parameter IDs.
 
-        .. warning:: If you do not pass ``model.parameters()`` as the first
-            argument to the optimizer, then you should pass that same value to
-            this method as ``optim_input``.
-
         .. note:: Like in :meth:`torch.optim.Optimizer.state_dict`, the tensors
             contained in the optimizer state dict are not cloned, so there may
             be aliasing surprises. For best practices, consider saving the
@@ -3708,7 +1299,8 @@ def full_optim_state_dict(
                 Input passed into the optimizer ``optim`` representing either a
                 :class:`list` of parameter groups or an iterable of parameters;
                 if ``None``, then this method assumes the input was
-                ``model.parameters()``. (Default: ``None``)
+                ``model.parameters()``. This argument is deprecated, and there
+                is no need to pass it in anymore. (Default: ``None``)
             rank0_only (bool): If ``True``, saves the populated :class:`dict`
                 only on rank 0; if ``False``, saves it on all ranks. (Default:
                 ``True``)
@@ -3722,123 +1314,80 @@ def full_optim_state_dict(
             :meth:`torch.optim.Optimizer.state_dict`. If ``rank0_only=True``,
             then nonzero ranks return an empty :class:`dict`.
         """
-        osd = optim.state_dict()
-        osd_state, osd_param_groups = osd["state"], osd["param_groups"]
-        rank = dist.get_rank(group)
-        to_save = not rank0_only or rank == 0
-        full_osd: Dict = {"state": {}, "param_groups": []} if to_save else {}
-        full_osd_state = full_osd["state"] if to_save else None
-
-        # Construct the local mapping between unflattened parameter names
-        # (`_OptimStateKey`s) and parameter IDs and broadcast rank 0's mapping
-        param_to_unflat_param_names: Dict[torch.nn.Parameter, List[str]] = \
-            _get_param_to_unflat_param_names(model)
-        flat_param_id_to_param: List[torch.nn.Parameter] = \
-            _get_param_id_to_param(model, optim_input)
-        optim_state_key_to_flat_param_id: Dict[_OptimStateKey, int] = {}  # local
-        r0_flat_param_id_to_optim_state_key: Dict[int, _OptimStateKey] = collections.OrderedDict()  # rank 0
-        for flat_param_id, param in enumerate(flat_param_id_to_param):
-            # Do not include parameters without state to avoid empty mappings
-            # just like in normal `torch.optim.Optimizer.state_dict()`
-            if flat_param_id not in osd_state:
-                continue
-            optim_state_key = _OptimStateKey(
-                unflat_param_names=tuple(param_to_unflat_param_names[param]),
-                is_flat_param=isinstance(param, FlatParameter),
-            )
-            if rank == 0:
-                r0_flat_param_id_to_optim_state_key[flat_param_id] = optim_state_key
-            optim_state_key_to_flat_param_id[optim_state_key] = flat_param_id
-        obj_list = [r0_flat_param_id_to_optim_state_key] if rank == 0 else [None]
-        dist.broadcast_object_list(obj_list, src=0, group=group)
-        r0_flat_param_id_to_optim_state_key = obj_list[0]
-
-        # Ensure that all ranks have at least the optimizer states needed by
-        # rank 0's optimizer
-        missing_keys: List[_OptimStateKey] = []
-        for r0_optim_state_key in r0_flat_param_id_to_optim_state_key.values():
-            if r0_optim_state_key not in optim_state_key_to_flat_param_id:
-                # A parameter from rank 0's optimizer does not exist for this
-                # rank's optimizer
-                missing_keys.append(r0_optim_state_key)
-                continue
-            flat_param_id = optim_state_key_to_flat_param_id[r0_optim_state_key]
-            assert flat_param_id >= 0 and flat_param_id < len(flat_param_id_to_param), \
-                "Check the `flat_param_id_to_param` construction"
-        device = torch.device("cuda", torch.cuda.current_device())
-        num_missing = torch.tensor([len(missing_keys)], dtype=torch.int32, device=device)
-        dist.all_reduce(num_missing, group=group)
-        if num_missing.item() > 0:
-            obj_list = [None for _ in range(dist.get_world_size(group))]
-            dist.all_gather_object(obj_list, missing_keys, group=group)
-            error_msg = (
-                "FSDP currently requires each rank to have at least the "
-                "optimizer states needed by rank 0's optimizer but some ranks "
-                "are missing some of those states"
-            )
-            for rank, keys in enumerate(obj_list):
-                if len(keys) > 0:
-                    error_msg += (
-                        f"\nRank {rank} is missing states for the parameters: "
-                        f"{[key.unflat_param_names for key in keys]}"
-                    )
-            raise RuntimeError(error_msg)
-
-        # Iterate in rank 0's flattened parameter ID order to ensure aligned
-        # all-gathers across ranks
-        flat_param_to_fsdp_module = _get_flat_param_to_fsdp_module(model)
-        for r0_optim_state_key in r0_flat_param_id_to_optim_state_key.values():
-            flat_param_id = optim_state_key_to_flat_param_id[r0_optim_state_key]
-            param = flat_param_id_to_param[flat_param_id]
-            if r0_optim_state_key.is_flat_param:
-                fsdp_module = flat_param_to_fsdp_module[param]
-                unflat_state = _unflatten_optim_state(
-                    param, osd_state[flat_param_id], fsdp_module, to_save,
-                )
-                if to_save:
-                    assert len(unflat_state) == len(r0_optim_state_key.unflat_param_names)
-                    for unflat_param_name, unflat_param_state in zip(
-                        r0_optim_state_key.unflat_param_names, unflat_state,
-                    ):
-                        full_osd_state[unflat_param_name] = unflat_param_state
-            elif to_save:
-                assert len(r0_optim_state_key.unflat_param_names) == 1
-                unflat_param_name = r0_optim_state_key.unflat_param_names[0]
-                full_osd_state[unflat_param_name] = copy.copy(osd_state[flat_param_id])
-                for state_name, value in full_osd_state[unflat_param_name].items():
-                    if torch.is_tensor(value):
-                        full_osd_state[unflat_param_name][state_name] = value.cpu()
-
-        if not to_save:
-            return {}
-
-        # Handle the "param_groups" part of the optimizer state dict
-        full_osd_param_groups = full_osd["param_groups"]  # alias
-        for flat_param_group in osd_param_groups:
-            unflat_param_group = copy.deepcopy(flat_param_group)
-            param_group_params = [
-                flat_param_id_to_param[flat_param_id]
-                for flat_param_id in flat_param_group["params"]
-            ]
-            nested_unflat_param_names = [
-                param_to_unflat_param_names[param]
-                for param in param_group_params
+        FullyShardedDataParallel._raise_on_use_orig_params_optim_checkpoint(model)
+        FullyShardedDataParallel._warn_optim_input(optim_input)
+        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
+            optim_input,
+            optim,
+        )
+        return _optim_state_dict(
+            model=model,
+            optim=optim,
+            optim_input=optim_input,
+            rank0_only=rank0_only,
+            shard_state=False,
+            group=group,
+            using_optim_input=using_optim_input,
+        )
+
+    @staticmethod
+    def sharded_optim_state_dict(
+        model: torch.nn.Module,
+        optim: torch.optim.Optimizer,
+        optim_input: Optional[
+            Union[
+                List[Dict[str, Any]],
+                Iterable[torch.nn.Parameter],
             ]
-            unflat_param_group["params"] = [
-                unflat_param_name
-                for unflat_param_names in nested_unflat_param_names
-                for unflat_param_name in unflat_param_names
-            ]  # flatten the list of lists
-            full_osd_param_groups.append(unflat_param_group)
-        return full_osd
+        ] = None,
+        group: Optional[dist.ProcessGroup] = None,
+    ) -> Dict[str, Any]:
+        """
+        The API is similar to :meth:`full_optim_state_dict` but this API chunks
+        all non-zero-dimension states to :class:`ShardedTensor` to save memory.
+        This API should only be used when the model ``state_dict`` is derived
+        with the context manager ``with state_dict_type(SHARDED_STATE_DICT):``.
+
+        For the detailed usage, refer to :meth:`full_optim_state_dict`.
+
+        .. warning:: The returned state dict contains ``ShardedTensor`` and
+            cannot be directly used by the regular ``optim.load_state_dict``.
+        """
+        FullyShardedDataParallel._raise_on_use_orig_params_optim_checkpoint(model)
+        FullyShardedDataParallel._warn_optim_input(optim_input)
+        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
+            optim_input,
+            optim,
+        )
+        # TODO: The ultimate goal of the optimizer state APIs should be the same
+        # as state_dict/load_state_dict -- using one API to get optimizer states
+        # and one API to load optimizer states. ``state_dict_type`` will be used
+        # to decide which optimizer states should be returned.
+        # There are currently two APIs to load a full optimizer state. So the
+        # first step of the unification is to merge the two full optimizer state
+        # loading APIs.
+        # Task: https://github.com/pytorch/pytorch/issues/82232
+        return _optim_state_dict(
+            model=model,
+            optim=optim,
+            optim_input=optim_input,
+            rank0_only=False,
+            shard_state=True,
+            group=group,
+            using_optim_input=using_optim_input,
+        )
 
     @staticmethod
     def shard_full_optim_state_dict(
         full_optim_state_dict: Dict[str, Any],
         model: torch.nn.Module,
-        optim_input: Optional[Union[
-            List[Dict[str, Any]], Iterable[torch.nn.Parameter],
-        ]] = None,
+        optim_input: Optional[
+            Union[
+                List[Dict[str, Any]],
+                Iterable[torch.nn.Parameter],
+            ]
+        ] = None,
+        optim: Optional[torch.optim.Optimizer] = None,
     ) -> Dict[str, Any]:
         """
         Shards the full optimizer state dict ``full_optim_state_dict`` by
@@ -3860,10 +1409,6 @@ def shard_full_optim_state_dict(
             >>> sharded_osd = FSDP.shard_full_optim_state_dict(full_osd, new_model)
             >>> new_optim.load_state_dict(sharded_osd)
 
-        .. warning:: If you do not pass ``model.parameters()`` as the first
-            argument to the optimizer, then you should pass that same value to
-            this method as ``optim_input``.
-
         .. note:: Both :meth:`shard_full_optim_state_dict` and
             :meth:`scatter_full_optim_state_dict` may be used to get the
             sharded optimizer state dict to load. Assuming that the full
@@ -3887,25 +1432,97 @@ def shard_full_optim_state_dict(
                 Input passed into the optimizer representing either a
                 :class:`list` of parameter groups or an iterable of parameters;
                 if ``None``, then this method assumes the input was
-                ``model.parameters()``. (Default: ``None``)
+                ``model.parameters()``. This argument is deprecated, and there
+                is no need to pass it in anymore. (Default: ``None``)
+            optim (Optional[torch.optim.Optimizer]): Optimizer that will load
+                the state dict returned by this method. This is the preferred
+                argument to use over ``optim_input``. (Default: ``None``)
 
         Returns:
             Dict[str, Any]: The full optimizer state dict now remapped to
             flattened parameters instead of unflattened parameters and
             restricted to only include this rank's part of the optimizer state.
         """
-        sharded_osd = _flatten_full_optim_state_dict(
-            full_optim_state_dict, model, True,
+        FullyShardedDataParallel._raise_on_use_orig_params_optim_checkpoint(model)
+        FullyShardedDataParallel._warn_optim_input(optim_input)
+        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
+            optim_input,
+            optim,
+        )
+        sharded_osd = _flatten_optim_state_dict(
+            full_optim_state_dict,
+            model,
+            True,
+        )
+        return _rekey_sharded_optim_state_dict(
+            sharded_osd,
+            model,
+            optim,
+            optim_input,
+            using_optim_input,
+        )
+
+    @staticmethod
+    def flatten_sharded_optim_state_dict(
+        sharded_optim_state_dict: Dict[str, Any],
+        model: torch.nn.Module,
+        optim_input: Optional[
+            Union[
+                List[Dict[str, Any]],
+                Iterable[torch.nn.Parameter],
+            ]
+        ] = None,
+        optim: Optional[torch.optim.Optimizer] = None,
+    ) -> Dict[str, Any]:
+        """
+        The API is similar to :meth:`shard_full_optim_state_dict`. The only
+        difference is that the input ``sharded_optim_state_dict`` should be
+        returned from :meth:`sharded_optim_state_dict`. Therefore, there will
+        be all-gather calls on each rank to gather ``ShardedTensor`` s.
+
+        Args:
+            sharded_optim_state_dict (Dict[str, Any]): Optimizer state dict
+                corresponding to the unflattened parameters and holding the
+                sharded optimizer state.
+            model (torch.nn.Module):
+                Refer to :meth:``shard_full_optim_state_dict``.
+
+        Returns:
+            Refer to :meth:`shard_full_optim_state_dict`.
+        """
+        FullyShardedDataParallel._raise_on_use_orig_params_optim_checkpoint(model)
+        FullyShardedDataParallel._warn_optim_input(optim_input)
+        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
+            optim_input,
+            optim,
+        )
+        # TODO: The implementation is the same as ``shard_full_optim_state_dict``.
+        # See the TODO in ``shard_full_optim_state_dict`` for the future
+        # unification plan.
+        flattened_osd = _flatten_optim_state_dict(
+            sharded_optim_state_dict,
+            model=model,
+            shard_state=True,
+        )
+        return _rekey_sharded_optim_state_dict(
+            flattened_osd,
+            model,
+            optim,
+            optim_input,
+            using_optim_input,
         )
-        return _rekey_sharded_optim_state_dict(sharded_osd, model, optim_input)
 
     @staticmethod
     def scatter_full_optim_state_dict(
         full_optim_state_dict: Optional[Dict[str, Any]],
         model: torch.nn.Module,
-        optim_input: Optional[Union[
-            List[Dict[str, Any]], Iterable[torch.nn.Parameter],
-        ]] = None,
+        optim_input: Optional[
+            Union[
+                List[Dict[str, Any]],
+                Iterable[torch.nn.Parameter],
+            ]
+        ] = None,
+        optim: Optional[torch.optim.Optimizer] = None,
         group: Optional[Any] = None,
     ) -> Dict[str, Any]:
         """
@@ -3950,7 +1567,11 @@ def scatter_full_optim_state_dict(
                 Input passed into the optimizer representing either a
                 :class:`list` of parameter groups or an iterable of parameters;
                 if ``None``, then this method assumes the input was
-                ``model.parameters()``. (Default: ``None``)
+                ``model.parameters()``. This argument is deprecated, and there
+                is no need to pass it in anymore. (Default: ``None``)
+            optim (Optional[torch.optim.Optimizer]): Optimizer that will load
+                the state dict returned by this method. This is the preferred
+                argument to use over ``optim_input``. (Default: ``None``)
             group (dist.ProcessGroup): Model's process group or ``None`` if
                 using the default process group. (Default: ``None``)
 
@@ -3959,6 +1580,12 @@ def scatter_full_optim_state_dict(
             flattened parameters instead of unflattened parameters and
             restricted to only include this rank's part of the optimizer state.
         """
+        FullyShardedDataParallel._raise_on_use_orig_params_optim_checkpoint(model)
+        FullyShardedDataParallel._warn_optim_input(optim_input)
+        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
+            optim_input,
+            optim,
+        )
         # Try to use the passed-in process group, the model's process group,
         # or the default process group (i.e. `None`) in that priority order
         if group is None and hasattr(model, "process_group"):
@@ -3967,8 +1594,9 @@ def scatter_full_optim_state_dict(
         world_size = dist.get_world_size(group)
         # Check for a valid broadcast device, preferring GPU when available
         using_nccl = dist.distributed_c10d._check_for_nccl_backend(group)
-        broadcast_device = torch.device("cuda") if torch.cuda.is_available() \
-            else torch.device("cpu")
+        broadcast_device = (
+            torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+        )
         if using_nccl and not torch.cuda.is_available():
             raise RuntimeError("NCCL requires a GPU for collectives")
         # Flatten the optimizer state dict and construct a copy with the
@@ -3977,22 +1605,38 @@ def scatter_full_optim_state_dict(
         if rank == 0:
             if full_optim_state_dict is None:
                 raise ValueError("Rank 0 must pass in the full optimizer state dict")
-            flat_osd = _flatten_full_optim_state_dict(full_optim_state_dict, model, False)
+            flat_osd = _flatten_optim_state_dict(
+                full_optim_state_dict,
+                model=model,
+                shard_state=False,
+            )
             processed_osd = _process_pos_dim_tensor_state(flat_osd, world_size)
         # Broadcast the optim state dict without positive-dimension tensor
         # state and the FSDP parameter IDs from rank 0 to all ranks
         processed_osd = _broadcast_processed_optim_state_dict(
-            processed_osd if rank == 0 else None, rank, group,
+            processed_osd if rank == 0 else None,
+            rank,
+            group,
         )
         # Broadcast positive-dimension tensor state (both sharded tensors for
         # FSDP parameters and unsharded tensors for non-FSDP parameters)
         sharded_osd = _broadcast_pos_dim_tensor_states(
-            processed_osd, flat_osd if rank == 0 else None, rank, world_size,
-            group, broadcast_device,
+            processed_osd,
+            flat_osd if rank == 0 else None,
+            rank,
+            world_size,
+            group,
+            broadcast_device,
         )
         # Rekey the optimizer state dict to use parameter IDs according to this
-        # rank's `optim_input`
-        sharded_osd = _rekey_sharded_optim_state_dict(sharded_osd, model, optim_input)
+        # rank's `optim`
+        sharded_osd = _rekey_sharded_optim_state_dict(
+            sharded_osd,
+            model,
+            optim,
+            optim_input,
+            using_optim_input,
+        )
         return sharded_osd
 
     @staticmethod
@@ -4000,9 +1644,13 @@ def rekey_optim_state_dict(
         optim_state_dict: Dict[str, Any],
         optim_state_key_type: OptimStateKeyType,
         model: torch.nn.Module,
-        optim_input: Optional[Union[
-            List[Dict[str, Any]], Iterable[torch.nn.Parameter],
-        ]] = None,
+        optim_input: Optional[
+            Union[
+                List[Dict[str, Any]],
+                Iterable[torch.nn.Parameter],
+            ]
+        ] = None,
+        optim: Optional[torch.optim.Optimizer] = None,
     ) -> Dict[str, Any]:
         """
         Re-keys the optimizer state dict ``optim_state_dict`` to use the key
@@ -4036,31 +1684,42 @@ def rekey_optim_state_dict(
             Dict[str, Any]: The optimizer state dict re-keyed using the
             parameter keys specified by ``optim_state_key_type``.
         """
-        assert optim_state_key_type in \
-            (OptimStateKeyType.PARAM_NAME, OptimStateKeyType.PARAM_ID)
+        FullyShardedDataParallel._warn_optim_input(optim_input)
+        using_optim_input = FullyShardedDataParallel._is_using_optim_input(
+            optim_input,
+            optim,
+        )
+        assert optim_state_key_type in (
+            OptimStateKeyType.PARAM_NAME,
+            OptimStateKeyType.PARAM_ID,
+        )
         osd = optim_state_dict  # alias
         # Validate that the existing parameter keys are uniformly typed
-        uses_param_name_mask = [
-            type(param_key) is str for param_key in osd["state"]
-        ]
-        uses_param_id_mask = [
-            type(param_key) is int for param_key in osd["state"]
-        ]
-        if (any(uses_param_name_mask) and not all(uses_param_name_mask)) or \
-                (any(uses_param_id_mask) and not all(uses_param_id_mask)):
+        uses_param_name_mask = [type(param_key) is str for param_key in osd["state"]]
+        uses_param_id_mask = [type(param_key) is int for param_key in osd["state"]]
+        if (any(uses_param_name_mask) and not all(uses_param_name_mask)) or (
+            any(uses_param_id_mask) and not all(uses_param_id_mask)
+        ):
             error_msg = f"Invalid parameter keys: {osd['state'].keys()}"
             raise ValueError(error_msg)
         # Return directly if the existing key type matches the target key type
-        if (optim_state_key_type == OptimStateKeyType.PARAM_NAME and
-            all(uses_param_name_mask)) or \
-            (optim_state_key_type == OptimStateKeyType.PARAM_ID and
-                all(uses_param_id_mask)):
+        if (
+            optim_state_key_type == OptimStateKeyType.PARAM_NAME
+            and all(uses_param_name_mask)
+        ) or (
+            optim_state_key_type == OptimStateKeyType.PARAM_ID
+            and all(uses_param_id_mask)
+        ):
             return osd
         # Otherwise, actually perform the re-keying
         new_osd = {}
         if optim_state_key_type == OptimStateKeyType.PARAM_NAME:  # ID -> name
-            param_id_to_param = _get_param_id_to_param(model, optim_input)
-            param_to_param_name = _get_param_to_param_name(model)
+            param_id_to_param = (
+                _get_param_id_to_param_from_optim_input(model, optim_input)
+                if using_optim_input
+                else _get_param_id_to_param(optim)
+            )
+            param_to_param_name = _get_param_to_fqn(model)
             param_id_to_param_name: List[str] = [
                 param_to_param_name[param] for param in param_id_to_param
             ]
@@ -4070,14 +1729,20 @@ def rekey_optim_state_dict(
             }
             new_osd["param_groups"] = copy.deepcopy(osd["param_groups"])
             for param_group in new_osd["param_groups"]:
-                param_group["params"] = sorted([
-                    param_id_to_param_name[param_id]
-                    for param_id in param_group["params"]
-                ])
+                param_group["params"] = sorted(
+                    [
+                        param_id_to_param_name[param_id]
+                        for param_id in param_group["params"]
+                    ]
+                )
             return new_osd
         elif optim_state_key_type == OptimStateKeyType.PARAM_ID:  # name -> ID
-            param_name_to_param = _get_param_name_to_param(model)
-            param_to_param_id = _get_param_to_param_id(model, optim_input)
+            param_name_to_param = _get_fqn_to_param(model)
+            param_to_param_id = (
+                _get_param_to_param_id_from_optim_input(model, optim_input)
+                if using_optim_input
+                else _get_param_to_param_id(optim)
+            )
             # Because not all model parameters may be passed as the optimizer
             # input, we may need to drop some parameters from this mapping
             param_name_to_param_id = {
@@ -4091,31 +1756,15 @@ def rekey_optim_state_dict(
             }
             new_osd["param_groups"] = copy.deepcopy(osd["param_groups"])
             for param_group in new_osd["param_groups"]:
-                param_group["params"] = sorted([
-                    param_name_to_param_id[param_name]
-                    for param_name in param_group["params"]
-                ])
+                param_group["params"] = sorted(
+                    [
+                        param_name_to_param_id[param_name]
+                        for param_name in param_group["params"]
+                    ]
+                )
             return new_osd
         return new_osd  # should never reach here
 
-    def _get_default_comm_hook(self) -> Any:
-        r"""
-        Returns a default communication hook based on a sharding strategy.
-        """
-        if self.sharding_strategy != ShardingStrategy.NO_SHARD:
-            return None
-        else:
-            return default_hooks.allreduce_hook
-
-    def _get_default_comm_hook_state(self) -> Any:
-        r"""
-        Returns a default communication hook state based on a sharding strategy.
-        """
-        if self.sharding_strategy != ShardingStrategy.NO_SHARD:
-            return None
-        else:
-            return default_hooks.DefaultState(process_group=self.process_group)
-
     def register_comm_hook(self, state: object, hook: callable):
         """
         Registers a communication hook which is an enhancement that provides a
@@ -4126,10 +1775,6 @@ def register_comm_hook(self, state: object, hook: callable):
         which involve different communication strategies for
         parameter syncs while training with :class:`FullyShardedDataParallel`.
 
-        .. warning::
-            FSDP only support communication hooks for a ``NO_SHARD`` strategy at this time.
-            If other strategies are used, an error will be raised.
-
         .. warning ::
             FSDP communication hook should be registered before running an initial forward pass
             and only once.
@@ -4140,159 +1785,82 @@ def register_comm_hook(self, state: object, hook: callable):
                             peers to communicate with next in `GossipGrad <https://arxiv.org/abs/1803.05880>`_, etc.
                             It is locally stored by each worker
                             and shared by all the gradient tensors on the worker.
-            hook (Callable): Callable with the following signature:
-                            ``hook: Callable[torch.Tensor] -> None``:
+            hook (Callable): Callable, which has one of the following signatures:
+                            1) ``hook: Callable[torch.Tensor] -> None``:
                             This function takes in a Python tensor, which represents
                             the full, flattened, unsharded gradient with respect to all variables
                             corresponding to the model this FSDP unit is wrapping
                             (that are not wrapped by other FSDP sub-units).
-                            It then performs all necessary processing and returns ``None``.
+                            It then performs all necessary processing and returns ``None``;
+                            2) ``hook: Callable[torch.Tensor, torch.Tensor] -> None``:
+                            This function takes in two Python tensors, the first one represents
+                            the full, flattened, unsharded gradient with respect to all variables
+                            corresponding to the model this FSDP unit is wrapping
+                            (that are not wrapped by other FSDP sub-units). The latter
+                            represents a pre-sized tensor to store a chunk of a sharded gradient after
+                            reduction.
+                            In both cases, callable performs all necessary processing and returns ``None``.
+                            Callables with signature 1 are expected to handle gradient communication for a `NO_SHARD` case.
+                            Callables with signature 2 are expected to handle gradient communication for sharded cases.
 
         """
         if not self.check_is_root():
-            raise AssertionError("register_comm_hook can only be called on a root instance.")
-        if self.sharding_strategy != ShardingStrategy.NO_SHARD:
-            raise NotImplementedError(
-                "Communication hooks are currently only available for a NO_SHARD strategy."
+            raise AssertionError(
+                "register_comm_hook can only be called on a root instance."
             )
-        else:
-            # register same hook for root and all submodules
-            for submodule in self.fsdp_modules(self):
-                assert not submodule._hook_registered, "communication hook can be only registered once"
-                submodule._hook_registered = True
-                assert submodule.communication_hook == self._get_default_comm_hook(),\
-                    f"communication hook should be default, but it is {submodule.communication_hook.__name__} instead"
-                submodule.communication_hook_state = state
-                submodule.communication_hook = hook
-
-def _get_default_cuda_device(module: nn.Module) -> torch.device:
-    """Try to infer CUDA device from module parameters."""
-    try:
-        compute_device = next(module.parameters()).device
-        if compute_device.type == "cuda":
-            return compute_device
-    # e.g., if module does not have parameters, it will throw StopIteration,
-    # in this case, instead of raising exception, return cuda device.
-    except StopIteration:
-        pass
-    # Fall back to current CUDA device
-    return torch.device("cuda", torch.cuda.current_device())
-
-
-def _free_storage(data: torch.Tensor) -> None:
-    """Free underlying storage of a Tensor."""
-    if data.storage().size() > 0:
-        # Since we're modifying the Tensor's Storage directly, make sure the Tensor
-        # is the sole occupant of the Storage.
-        assert (
-            data.storage_offset() == 0
-        ), "The tensor is not the sole occupant of the storage."
-        data.storage().resize_(0)  # type: ignore[attr-defined]
-
-
-@torch.no_grad()
-def _alloc_storage(data: torch.Tensor, size: torch.Size) -> None:
-    """Allocate storage for a tensor."""
-    if data.storage().size() == size.numel():  # no need to reallocate
-        return
-    assert (
-        data.storage().size() == 0
-    ), "Then tensor storage should have been resized to be 0."
-    data.storage().resize_(size.numel())  # type: ignore[attr-defined]
-
-def p_assert(cond: Any, s: Any) -> None:
-    """This is used as an alternate to ``assert`` when in the backward context
-    to print the error message ``s`` since otherwise, it is swallowed."""
-    if not cond:
-        print(s)
-        traceback.print_stack()
-        raise AssertionError
-
-def _calc_grad_norm(parameters: List[torch.nn.Parameter], p: float) -> torch.Tensor:
-    r"""Calculate gradient norm of an iterable of parameters.
-    Returns:
-        Total norm of the parameters (viewed as a single vector).
+        for submodule in self.fsdp_modules(self):
+            assert (
+                not submodule._hook_registered
+            ), "communication hook can be only registered once"
+            submodule._hook_registered = True
+            assert submodule._communication_hook == _get_default_comm_hook(
+                self.sharding_strategy
+            ), f"communication hook should be default, but it is {submodule._communication_hook.__name__} instead"
+            submodule._communication_hook_state = state
+            submodule._communication_hook = hook
+
+
+def _get_grad_norm(
+    params: Iterable[nn.Parameter],
+    norm_type: float,
+) -> torch.Tensor:
     """
-    parameters = [p for p in parameters if p.grad is not None]
-
-    if len(parameters) == 0:
+    Returns the gradient norm of parameters ``param`` s, where the gradients
+    are viewed as a single vector.
+    """
+    params_with_grad = [param for param in params if param.grad is not None]
+    if len(params_with_grad) == 0:
         return torch.tensor(0.0)
-    if p == math.inf:
-        local_norm = torch.tensor(max(par.grad.detach().abs().max() for par in parameters))
-    else:
-        # Compute the norm in full precision no matter what
-        local_norm = torch.linalg.vector_norm(
-            torch.stack(
-                [
-                    torch.linalg.vector_norm(par.grad.detach(), p, dtype=torch.float32)
-                    for par in parameters
-                ]
-            ),
-            p,
+    grads = [param.grad for param in params_with_grad]
+    grad_dtypes = set(grad.dtype for grad in grads)
+    if len(grad_dtypes) != 1:
+        raise ValueError(
+            f"Requires uniform dtype across all gradients but got {grad_dtypes}"
         )
-    local_norm.to(dtype=parameters[0].dtype)
-    return local_norm
-
-
-def _get_param_to_unflat_param_names(
-    model: torch.nn.Module,
-    dedup_shared_params: bool = True,
-) -> Dict[torch.nn.Parameter, List[str]]:
-    """
-    Constructs a mapping from flattened parameter (including non-FSDP-module
-    parameters) to its unflattened parameter names. For non-FSDP-module
-    parameters, these mapped-to lists always contain a single element. The
-    unflattened parameter names should match the keys of the model state dict.
-
-    For shared parameters, only the first parameter name is included (following
-    the ``torch.nn.Module.parameters()`` order).
-
-    Args:
-        model (torch.nn.Module): Root module (which may or may not be a
-            :class:`FullyShardedDataParallel` instance).
-        dedup_shared_params (bool): If ``True``, only includes the first
-            list of unflattened parameter names corresponding to a parameter
-            in the module walk order; if ``False``, then includes all of the
-            unflattened parameter names.
-    """
-    def module_fn(module, prefix, param_to_unflat_param_names):
-        # For FSDP modules, only add the entry when considering the contained
-        # `FlattenParamsWrapper` to avoid duplication
-        if not isinstance(module, FullyShardedDataParallel):
-            for param_name, param in module.named_parameters(recurse=False):
-                module_prefixed_param_names = (
-                    param._prefixed_param_names if isinstance(param, FlatParameter)
-                    else [param_name]
-                )  # prefixed from `module`
-                fully_prefixed_param_names = [
-                    clean_tensor_name(prefix + name)
-                    for name in module_prefixed_param_names
-                ]  # fully prefixed from the top level including `prefix`
-                # If this parameter has already been visited, then it is a
-                # shared parameter; then, only take the first parameter name
-                is_shared_param = param in param_to_unflat_param_names
-                if not is_shared_param:
-                    param_to_unflat_param_names[param] = fully_prefixed_param_names
-                elif not dedup_shared_params:
-                    param_to_unflat_param_names[param].extend(fully_prefixed_param_names)
-
-    def return_fn(param_to_unflat_param_names):
-        return param_to_unflat_param_names
-
-    param_to_unflat_param_names: Dict[torch.nn.Parameter, List[str]] = {}
-    return _apply_to_modules(
-        model, module_fn, return_fn, param_to_unflat_param_names,
+    # Compute the gradient norm in FP32, where we treat the gradients as a
+    # single vector
+    grad_norm = torch.linalg.vector_norm(
+        torch.stack(
+            [
+                torch.linalg.vector_norm(grad.detach(), norm_type, dtype=torch.float32)
+                for grad in grads
+            ],
+        ),
+        norm_type,
+        dtype=torch.float32,
     )
+    grad_norm = grad_norm.to(grads[0].dtype)
+    return grad_norm
 
 
-def _get_param_to_param_name(
+def _get_param_to_fqn(
     model: torch.nn.Module,
 ) -> Dict[torch.nn.Parameter, str]:
     """
     Constructs a mapping from parameters to their parameter names. ``model``
     should not contain any :class:`FullyShardedDataParallel` instances, which
     means that none of the parameters should be ``FlatParameter`` s. As a
-    result, compared to :meth:`_get_param_to_unflat_param_names`, the mapped
+    result, compared to :meth:`_get_param_to_fqns`, the mapped
     values may be flattened from singleton :class:`list` s to the contained
     names themselves.
 
@@ -4300,40 +1868,25 @@ def _get_param_to_param_name(
         model (torch.nn.Module): Root module, which should not contain any
             :class:`FullyShardedDataParallel` instances.
     """
-    param_to_param_names = _get_param_to_unflat_param_names(model)
+    param_to_param_names = _get_param_to_fqns(model)
     for param_names in param_to_param_names.values():
-        assert len(param_names) > 0, "`_get_param_to_unflat_param_names()` " \
-            "should not construct empty lists"
+        assert len(param_names) > 0, (
+            "`_get_param_to_fqns()` " "should not construct empty lists"
+        )
         if len(param_names) > 1:
             raise RuntimeError(
                 "Each parameter should only map to one parameter name but got "
                 f"{len(param_names)}: {param_names}"
             )
     param_to_param_name = {
-        param: param_names[0]
-        for param, param_names in param_to_param_names.items()
+        param: param_names[0] for param, param_names in param_to_param_names.items()
     }
     return param_to_param_name
 
 
-def _get_param_name_to_param(
+def _get_fqn_to_param(
     model: torch.nn.Module,
 ) -> Dict[str, torch.nn.Parameter]:
-    """Constructs the inverse mapping of :meth:`_get_param_to_param_name`."""
-    param_to_param_name = _get_param_to_param_name(model)
+    """Constructs the inverse mapping of :meth:`_get_param_to_fqn`."""
+    param_to_param_name = _get_param_to_fqn(model)
     return dict(zip(param_to_param_name.values(), param_to_param_name.keys()))
-
-
-def clean_tensor_name(tensor_name: str) -> str:
-    """Cleans the parameter or buffer name by removing any FSDP-related
-    prefixes."""
-    # FSDP full tensor names may not have both (i.e. `FSDP_PREFIX`), so we
-    # call `replace()` twice separately
-    tensor_name = tensor_name.replace(FSDP_WRAPPED_MODULE + ".", "")
-    tensor_name = tensor_name.replace(FPW_MODULE + ".", "")
-    # TODO: Explicitly replacing checkpoint_wrapper prefix is not ideal,
-    # as it increases coupling between CheckpointWrapper and FSDP. This is also not
-    # scalable for additional wrapped modules, we should come up with a general solution
-    # for this issue.
-    tensor_name = tensor_name.replace(_CHECKPOINT_PREFIX + ".", "")
-    return tensor_name
diff --git a/torch/distributed/fsdp/sharded_grad_scaler.py b/torch/distributed/fsdp/sharded_grad_scaler.py
index 27ba44e6c151..86dbfd7edc16 100644
--- a/torch/distributed/fsdp/sharded_grad_scaler.py
+++ b/torch/distributed/fsdp/sharded_grad_scaler.py
@@ -1,12 +1,12 @@
-from collections import abc, defaultdict
 import logging
+from collections import abc, defaultdict
 from typing import Dict, List, Optional, Union
 
 import torch
+import torch.distributed as dist
 from torch.cuda import FloatTensor  # type: ignore[attr-defined]
-from torch.cuda.amp.grad_scaler import GradScaler, OptState, _MultiDeviceReplicator
+from torch.cuda.amp.grad_scaler import _MultiDeviceReplicator, GradScaler, OptState
 from torch.distributed.distributed_c10d import ProcessGroup
-import torch.distributed as dist
 from torch.optim.sgd import SGD
 
 
@@ -23,6 +23,7 @@ class _GeneralMultiDeviceReplicator(_MultiDeviceReplicator):
     Lazily serves tensor to request device. This class extends
     _MultiDeviceReplicator to allow support for "cpu" as a device.
     """
+
     def __init__(self, master_tensor: torch.Tensor) -> None:
         assert _is_supported_device(master_tensor)
         self.master = master_tensor
@@ -77,9 +78,10 @@ class ShardedGradScaler(GradScaler):
         process_group (ProcessGroup, optional, default=torch.distributed.group.WORLD):
             process group for sharding
     """
+
     def __init__(
         self,
-        init_scale: float = 2.0 ** 16,
+        init_scale: float = 2.0**16,
         backoff_factor: float = 0.5,
         growth_factor: float = 2.0,
         growth_interval: int = 2000,
@@ -97,7 +99,9 @@ def __init__(
             self.process_group = process_group
             self._per_optimizer_states = defaultdict(_refresh_per_optimizer_state)
 
-    def scale(self, outputs: Union[torch.Tensor, List[torch.Tensor]]) -> Union[torch.Tensor, List[torch.Tensor]]:
+    def scale(
+        self, outputs: Union[torch.Tensor, List[torch.Tensor]]
+    ) -> Union[torch.Tensor, List[torch.Tensor]]:
         if not self._enabled:
             return outputs
 
@@ -106,7 +110,9 @@ def scale(self, outputs: Union[torch.Tensor, List[torch.Tensor]]) -> Union[torch
             if self._scale is None:
                 self._lazy_init_scale_growth_tracker(outputs.device)
             assert self._scale is not None
-            scaled_output = outputs * self._scale.to(device=outputs.device, non_blocking=True)
+            scaled_output = outputs * self._scale.to(
+                device=outputs.device, non_blocking=True
+            )
             # Here we ensure the return dtype is the same as the outputs dtype.
             # For the FSDP + Mixed Precision use case, the loss output is in the Mixed Precision
             # format (fp16, bf16) and so the scaled loss should be of the same dtype.
@@ -114,7 +120,9 @@ def scale(self, outputs: Union[torch.Tensor, List[torch.Tensor]]) -> Union[torch
 
         stash: List[_GeneralMultiDeviceReplicator] = []
 
-        def apply_scale(val: Union[torch.Tensor, abc.Iterable]) -> Union[torch.Tensor, abc.Iterable]:
+        def apply_scale(
+            val: Union[torch.Tensor, abc.Iterable]
+        ) -> Union[torch.Tensor, abc.Iterable]:
             if isinstance(val, torch.Tensor):
                 assert _is_supported_device(val)
                 if len(stash) == 0:
@@ -150,20 +158,30 @@ def _foreach_non_finite_check_and_unscale_cpu_(
         for grad in grads:
             for tensor in grad:
                 if tensor.device != expected_device:
-                    logging.error("tensor device is %s and expected device is %s" % (tensor.device, expected_device))
+                    logging.error(
+                        "tensor device is %s and expected device is %s"
+                        % (tensor.device, expected_device)
+                    )
                     raise ValueError("Gradients must be on the same device.")
 
                 # check for non_overlapping_and_dense doesn't exist in the python world
                 # as remarked here https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/AmpKernels.cu#L108
                 # we assume tensor is not MTA(multi tensor apply) safe. iterate through each item regardless of dtype
-                if torch.isinf(tensor).any().item() is True or torch.isnan(tensor).any().item() is True:
+                if (
+                    torch.isinf(tensor).any().item() is True
+                    or torch.isnan(tensor).any().item() is True
+                ):
                     found_inf.data = torch.tensor([1.0])
                     break
                 else:
                     tensor.data *= inv_scale.item()
 
     def _unscale_grads_(
-        self, optimizer: SGD, inv_scale: torch.Tensor, found_inf: torch.Tensor, allow_fp16: bool = True
+        self,
+        optimizer: SGD,
+        inv_scale: torch.Tensor,
+        found_inf: torch.Tensor,
+        allow_fp16: bool = True,
     ) -> Dict[torch.device, torch.Tensor]:
         per_device_inv_scale = _GeneralMultiDeviceReplicator(inv_scale)
         per_device_found_inf = _GeneralMultiDeviceReplicator(found_inf)
@@ -195,7 +213,9 @@ def _unscale_grads_(
                     else:
                         to_unscale = param.grad
 
-                    per_device_and_dtype_grads[to_unscale.device][to_unscale.dtype].append(to_unscale)
+                    per_device_and_dtype_grads[to_unscale.device][
+                        to_unscale.dtype
+                    ].append(to_unscale)
 
             for device, per_dtype_grads in per_device_and_dtype_grads.items():
                 for grads in per_dtype_grads.values():
@@ -222,16 +242,22 @@ def unscale_(self, optimizer: SGD) -> None:
         optimizer_state = self._per_optimizer_states[id(optimizer)]
 
         if optimizer_state["stage"] is OptState.UNSCALED:
-            raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")
+            raise RuntimeError(
+                "unscale_() has already been called on this optimizer since the last update()."
+            )
         elif optimizer_state["stage"] is OptState.STEPPED:
             raise RuntimeError("unscale_() is being called after step().")
 
         # FP32 division can be imprecise for certain compile options, so we carry out the reciprocal in FP64.
         assert self._scale is not None
         inv_scale = self._scale.double().reciprocal().float()
-        found_inf = torch.full((1,), 0.0, dtype=torch.float32, device=self._scale.device)
+        found_inf = torch.full(
+            (1,), 0.0, dtype=torch.float32, device=self._scale.device
+        )
 
-        optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, True)
+        optimizer_state["found_inf_per_device"] = self._unscale_grads_(
+            optimizer, inv_scale, found_inf, True
+        )
         optimizer_state["stage"] = OptState.UNSCALED
 
         # Synchronize the detected inf across the ranks
@@ -241,10 +267,18 @@ def unscale_(self, optimizer: SGD) -> None:
         for v in optimizer_state["found_inf_per_device"].values():
             if v.device.type == "cpu":
                 v_on_cuda = v.cuda()
-                future_handles.append(dist.all_reduce(v_on_cuda, async_op=True, group=self.process_group).get_future())
+                future_handles.append(
+                    dist.all_reduce(
+                        v_on_cuda, async_op=True, group=self.process_group
+                    ).get_future()
+                )
                 v.copy_(v_on_cuda.cpu())
             else:
-                future_handles.append(dist.all_reduce(v, async_op=True, group=self.process_group).get_future())
+                future_handles.append(
+                    dist.all_reduce(
+                        v, async_op=True, group=self.process_group
+                    ).get_future()
+                )
 
         # Make sure that the calls are done before moving out.
         if future_handles:
diff --git a/torch/distributed/fsdp/wrap.py b/torch/distributed/fsdp/wrap.py
index 8013da8e37ea..e20c07f18d13 100644
--- a/torch/distributed/fsdp/wrap.py
+++ b/torch/distributed/fsdp/wrap.py
@@ -4,23 +4,13 @@
 # LICENSE file in the root directory of this source tree.
 
 import contextlib
-from dataclasses import dataclass
-from typing import (
-    Any,
-    Callable,
-    Dict,
-    Generator,
-    Optional,
-    Set,
-    Tuple,
-    Type,
-    cast,
-)
+import functools
+from abc import ABC, abstractmethod
+from typing import Any, Callable, cast, Dict, Generator, Optional, Set, Tuple, Type
 
 import torch.nn as nn
 from torch.nn.modules.batchnorm import _BatchNorm
 
-
 __all__ = [
     "always_wrap_policy",
     "lambda_auto_wrap_policy",
@@ -28,24 +18,84 @@
     "size_based_auto_wrap_policy",
     "enable_wrap",
     "wrap",
-    "ParamExecOrderWrapPolicy",
+    "ModuleWrapPolicy",
 ]
 
 
 def always_wrap_policy(*args, **kwargs) -> bool:
     """
-    A simple wrapper policy that always returns ``True``,
-    i.e. when passed as the `auto_wrap_policy` into FSDP,
-    this will result in all submodules being wrapped as
-    distinct FSDP instances.
+    A simple recursive wrap policy that always returns ``True``. This means
+    that every submodule is wrapped by the wrapper class in
+    :func:`_recursive_wrap`.
     """
     return True
 
-def lambda_auto_wrap_policy(
+
+class _FSDPPolicy(ABC):
+    """
+    This defines an abstract base class that represents an FSDP policy for
+    constructing ``FlatParameter`` s.
+    """
+
+    # The motivation for this abstract base class is to hide the interface
+    # expected by `_recursive_wrap()` from users (i.e. the `recurse` argument).
+    def __init__(self):
+        ...
+
+    @property
+    @abstractmethod
+    def policy(self) -> Callable:
+        ...
+
+
+def _module_wrap_policy(
     module: nn.Module,
     recurse: bool,
-    unwrapped_params: int,
-    lambda_fn: Callable
+    nonwrapped_numel: int,
+    module_classes: Set[Type[nn.Module]],
+) -> bool:
+    """
+    This auto wrap policy wraps every module that is an instance of any type in
+    ``module_classes`` as its own FSDP instance. The root module given by
+    ``module`` is always wrapped as an FSDP instance regardless. Since the
+    wrapping proceeds bottom up, each FSDP instance manages the parameters in
+    its subtree excluding any already managed by a child FSDP instance.
+
+    Args:
+        module (nn.Module): Current module being considered.
+        recurse (bool): If ``False``, then this function must decide whether
+            ``module`` should be wrapped as an FSDP instance or not. If
+            ``True``, then the function is still recursing down the module
+            tree as a part of the DFS.
+        nonwrapped_numel (int): Parameter numel not yet wrapped.
+        module_classes (Set[Type[nn.Module]]): Set of module classes that are
+            wrapped as FSDP instances.
+
+    Returns:
+        ``True`` if ``recurse=True``, and whether ``module`` should be wrapped
+        if ``recurse=False``.
+    """
+    if recurse:
+        return True  # always recurse
+    return isinstance(module, tuple(module_classes))
+
+
+class ModuleWrapPolicy(_FSDPPolicy):
+    """This is a wrapper around :func:`_module_wrap_policy`."""
+
+    def __init__(self, module_classes: Set[Type[nn.Module]]):
+        self._policy: Callable = functools.partial(
+            _module_wrap_policy,
+            module_classes=module_classes,
+        )
+
+    @property
+    def policy(self):
+        return self._policy
+
+
+def lambda_auto_wrap_policy(
+    module: nn.Module, recurse: bool, nonwrapped_numel: int, lambda_fn: Callable
 ) -> bool:
     """
     A convenient auto wrap policy to wrap submodules based on an arbitrary user
@@ -57,69 +107,35 @@ def lambda_auto_wrap_policy(
     The first three parameters are required by :func:`_recursive_wrap`.
 
     Args:
-       module (nn.Module):
-           The module to be considered in this decision.
-       recurse (bool):
-           Indicate if this is called to make a decision on whether we
-           should recurse down a subgraph of the module structure.
-           If False, it means this function is called to make a decision
-           on whether we should wrap the said module.
-       unwrapped_params (int):
-           The number of parameters yet to be wrapped in this module.
-
-       lambda_fn (Callable[nn.Module] -> bool):
-           If this returns ``True``, this module will be wrapped by
-           wrapper_cls individually.
+        module (nn.Module): Current module being considered.
+        recurse (bool): If ``False``, then this function must decide whether
+            ``module`` should be wrapped as an FSDP instance or not. If
+            ``True``, then the function is still recursing down the module
+            tree as a part of the DFS.
+        nonwrapped_numel (int): Parameter numel not yet wrapped.
+
+        lambda_fn (Callable[[nn.Module], bool]): If this returns ``True``, then
+            this module will be wrapped.
     """
     if recurse:
-        # always recurse
-        return True
-    else:
-        # if not recursing, decide whether we should wrap for the leaf node or reminder
-        return lambda_fn(module)
+        return True  # always recurse
+    return lambda_fn(module)
+
 
 def transformer_auto_wrap_policy(
     module: nn.Module,
     recurse: bool,
-    unwrapped_params: int,
+    nonwrapped_numel: int,
     transformer_layer_cls: Set[Type[nn.Module]],
 ) -> bool:
     """
-    A convenient auto wrap policy for transformer models. If the submodule
-    is an instance of transformer_layer_cls, the submodule will be wrapped
-    as a FSDP unit. Otherwise, all the other remainder submodules are wrapped
-    by the outermost FSDP unit. Right now, FSDP requires submodules that share
-    weights to be wrapped in the same FSDP unit, this auto wrap policy can
-    conviniently wrap the shared embeddings into the same FSDP unit for transformer
-    models. In the near future, FSDP will support submodules that share weights
-    to be wrapped in the separated FSDP units.
-
-    Return if a module should be wrapped during FSDP auto wrapping.
-
-    The first three parameters are required by :func:`_recursive_wrap`.
-
-
-    Args:
-       module (nn.Module):
-           The module to be considered in this decision.
-       recurse (bool):
-           Indicate if this is called to make a decision on whether we
-           should recurse down a subgraph of the module structure.
-           If False, it means this function is called to make a decision
-           on whether we should wrap the said module.
-       unwrapped_params (int):
-           The number of parameters yet to be wrapped in this module.
-
-       transformer_layer_cls (int):
-           Submodules with one of the `transformer_layer_cls` names
-           will be wrapped as separated FSDP units
+    See :func:`_module_wrap_policy`, where ``transformer_layer_cls`` is the
+    same as ``module_classes``. Note that shared parameters must be wrapped in
+    the same FSDP instance, so this auto wrap policy can help wrap shared
+    embeddings into the same FSDP instance for transformer models.
     """
-    if recurse:
-        # always recurse
-        return True
-    else:
-        # if not recursing, decide whether we should wrap for the leaf node or reminder
-        return isinstance(module, tuple(transformer_layer_cls))
+    return _module_wrap_policy(module, recurse, nonwrapped_numel, transformer_layer_cls)
+
 
 def _wrap_batchnorm_individually(
     module: nn.Module,
@@ -128,7 +144,7 @@ def _wrap_batchnorm_individually(
     **kwargs,
 ) -> bool:
     """
-    A policy that wraps ``BatchNorm`` instances in their own FSDP unit.
+    A policy that wraps ``BatchNorm`` instances in their own FSDP instance.
     """
     if recurse:
         # always recurse
@@ -138,57 +154,50 @@ def _wrap_batchnorm_individually(
         # BN layer or not.
         return isinstance(module, _BatchNorm)
 
+
 def _or_policy(
     module: nn.Module,
     recurse: bool,
-    unwrapped_params: int,
+    nonwrapped_numel: int,
     policies,
 ) -> bool:
     """
     A policy that wraps ``module`` if any policy in the passed in iterable of
     ``policies`` returns ``True``.
     """
-    return any(
-        policy(module, recurse, unwrapped_params) for policy in policies
-    )
+    return any(policy(module, recurse, nonwrapped_numel) for policy in policies)
 
 
 def size_based_auto_wrap_policy(
     module: nn.Module,
     recurse: bool,
-    unwrapped_params: int,
-    # These are customizable for this policy function.
+    nonwrapped_numel: int,
+    # Additional custom arguments
     min_num_params: int = int(1e8),
     force_leaf_modules: Optional[Set[Type[nn.Module]]] = None,
     exclude_wrap_modules: Optional[Set[Type[nn.Module]]] = None,
 ) -> bool:
-    """A size based auto_wrap_policy function for FSDP API.
-
-       Return if a module should be wrapped during FSDP auto wrapping.
-
-       The first three parameters are used by :func:`_recursive_wrap`. If
-       you write a custom version of this policy function, your version
-       needs to at least accept the first three parameters and free
-       to do whatever you want in the function.
+    """
+    A size-based auto wrap policy.
 
     Args:
-       module (nn.Module):
-           The module to be considered in this decision.
-       recurse (bool):
-           Indicate if this is called to make a decision on whether we
-           should recurse down a subgraph of the module structure.
-           If False, it means this function is called to make a decision
-           on whether we should wrap the said module.
-       unwrapped_params (int):
-           The number of parameters yet to be wrapped in this module.
-
-       min_num_params (int):
-           Customizable policy input. It controls the size threshold
-           on how big should a module be to be considered wrapped.
-       force_leaf_modules (Set[Type[nn.Module]]): set of module types to
-           keep as leaves, i.e., their children will never be wrapped.
-       exclude_wrap_modules (Set[Type[nn.Module]]):
-           Customizable set of module types to be excluded in wrapping.
+        module (nn.Module): Current module being considered.
+        recurse (bool): If ``False``, then this function must decide whether
+            ``module`` should be wrapped as an FSDP instance or not. If
+            ``True``, then the function is still recursing down the module
+            tree as a part of the DFS.
+        nonwrapped_numel (int): Parameter numel not yet wrapped.
+
+        min_num_params (int): Customizable policy input that controls the size
+            threshold over which a module is ready to be wrapped. This is in
+            units of numel.
+        force_leaf_modules (Set[Type[nn.Module]]): Set of module types to keep
+            as leaves, i.e. their children will never be wrapped.
+        exclude_wrap_modules (Set[Type[nn.Module]]): Set of module types to be
+            excluded in wrapping.
+
+    Returns:
+        Whether ``module`` should be wrapped.
     """
     force_leaf_modules = (
         size_based_auto_wrap_policy.FORCE_LEAF_MODULES  # type: ignore[attr-defined]
@@ -201,7 +210,10 @@ def size_based_auto_wrap_policy(
         else exclude_wrap_modules
     )
 
-    is_large = unwrapped_params >= min_num_params
+    # Keep the argument `min_num_params` for BC for now, but it represents the
+    # minimum non-wrapped *numel* before triggering a wrapping
+    min_nonwrapped_numel = min_num_params
+    is_large = nonwrapped_numel >= min_nonwrapped_numel
     if recurse:
         # We should recurse if the module is big enough but not in force_leaf_modules list.
         return is_large and not isinstance(module, tuple(force_leaf_modules))
@@ -288,58 +300,9 @@ def wrap(module: nn.Module, **wrap_overrides: Any) -> nn.Module:
     return module
 
 
-@dataclass
-class ParamExecOrderWrapPolicy:
-    """
-    This is the class used for the wrapping policy that wraps parameters and performs
-    the communication scheduling based on the parameter execution order in the forward pass
-    (also called non-recursive wrapping policy).
-
-    The policy contains multiple wraps. Each wrap contains original parameters that will be executed together,
-    and the wrap transfers these parameters into one ``FlattenParameter``. In both forward and the backward passes,
-    the sharded parameters in each wrap will be gathered just before these parameters are used in the passes.
-    These parameters will then be reshaded once they have been used.
-
-    TODO (linjianma): For now, the parameters contained in each wrap of ``ParamExecOrderWrapPolicy``
-    are the parameters in each wrap of the ``init_policy`` (a recursive wrapping policy).
-    Later we will wrap parameters based on bucket size.
-
-    Args:
-        init_policy (Callable):
-            The initial recursive wrapping policy used to guide the wrapping of
-            this policy. If tracing_config is none, in the first forward and
-            backward iteration, ``init_policy`` is used to record parameter
-            execution order. Otherwise, init_policy is only used in FSDP
-            constructor for module level wrapping.
-
-            The default ``always_wrap_policy`` might not be the best choice for every model. For example, for
-            transformer based models, setting ``transformer_auto_wrap_policy`` as the ``init_policy`` will guarantee
-            wrapping each transformer layer into one FSDP unit, and can be easily combined with checkpointing
-            within each transformer layer.
-
-        tracing_config (Optional[TracingConfig]):
-            The configuration used to perform symbolic tracing at FSDP
-            constructor to get the module and parameter execution order. The
-            type of ``tracing_config`` needs to be either ``None`` or
-            ``TracingConfig``. If set as ``None``, then symbolic tracing is not
-            enabled, and one forward as well as backward iteration are needed to
-            get the parameter execution order.
-
-    ..warning :: Note that not all modules can be successfully traced when
-    ``tracing_config`` is not None and symbolic tracing is enabled. The two
-    cases below may be unable to trace: 1. when there is a data-dependent
-    branch, 2. when the forward pass contains operators that don't support
-    ``torch.fx.Proxy`` as the input type (e.g. ``arange``, ``zeros``, ``ones``,
-    ``full``, ``full_like``, ``eye``, ``empty``, ``tensor``). For those cases,
-    users can set ``tracing_config = None`` to disable symbolic tracing.
-    """
-    init_policy: Callable = always_wrap_policy
-    tracing_config: Any = None
-
-
 def _wrap(module: nn.Module, wrapper_cls: Callable, **kwargs) -> nn.Module:
     assert wrapper_cls is not None
-    if hasattr(module, '_wrap_overrides'):
+    if hasattr(module, "_wrap_overrides"):
         # If module has a _wrap_overrides attribute, we force overriding the
         # FSDP config with these attributes for this module. Currently this
         # is only used to disable mixed precision for BatchNorm when
@@ -357,16 +320,16 @@ def _recursive_wrap(
     ignored_modules: Set[nn.Module],
     ignored_params: Set[nn.Parameter],
     only_wrap_children: bool = False,
-    **kwargs: Any
+    **kwargs: Any,
 ) -> Tuple[nn.Module, int]:
     """
-    Automatically wrap child modules of *module* that meet the given
-    criteria with :func:`auto_wrap`. Does not rely on _ConfigAutoWrap.
+    Wraps submodules of ``module`` for which ``auto_wrap_policy`` returns
+    ``True`` with ``wrapper_cls``.
+
     Args:
-        module (nn.Module):
-            module to recursively wrap
-        auto_wrap_policy (Callable):
-            A callable specifying a policy to recursively wrap layers with FSDP.
+        module (nn.Module): Module to recursively wrap.
+        auto_wrap_policy (Callable): A callable representing a policy that
+            determines which modules to recursively wrap with ``wrapper_cls``.
         ignored_modules (Set[torch.nn.Module]): Modules to ignore when
             wrapping.
         ignored_params (Set[torch.nn.Parameter]): Parameters to ignore when
@@ -374,7 +337,7 @@ def _recursive_wrap(
             in ``ignored_modules``.
     Returns:
         (nn.Module, int):
-            Wrapped module and the number parameters wrapped recursively.
+            ``module`` after wrapping and the numel recursively wrapped.
     """
     assert auto_wrap_policy is not None, "Must specify auto_wrap_policy."
     assert wrapper_cls is not None, "Must specify wrapper_cls"
@@ -389,13 +352,13 @@ def _recursive_wrap(
             pass
 
     # We count all params, assuming none of them are already wrapped.
-    num_params = sum(
+    nonwrapped_numel = sum(
         p.numel() for p in module.parameters() if p not in ignored_params
     )
 
     assert auto_wrap_policy is not None
-    if auto_wrap_policy(module=module, recurse=True, unwrapped_params=num_params):
-        total_wrapped_params = 0
+    if auto_wrap_policy(module=module, recurse=True, nonwrapped_numel=nonwrapped_numel):
+        total_wrapped_numel = 0
         # Iterate through the children, recursively wrap if necessary
         for name, child in module.named_children():
             if child in ignored_modules:
@@ -410,17 +373,17 @@ def _recursive_wrap(
             )
             setattr(module, name, wrapped_child)
             # Keep track of how many parameters have been wrapped
-            total_wrapped_params += num_wrapped_params
+            total_wrapped_numel += num_wrapped_params
         # decide if we need to wrap the current module,
         # since the left over parameters exceed the number of params to wrap
-        remainder = num_params - total_wrapped_params
+        remainder = nonwrapped_numel - total_wrapped_numel
         if not only_wrap_children and auto_wrap_policy(
-            module=module, recurse=False, unwrapped_params=remainder
+            module=module, recurse=False, nonwrapped_numel=remainder
         ):
             # Leaf node or final wrapping of the remainder both happen here.
-            return _wrap(module, wrapper_cls, **kwargs), num_params
+            return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
         else:
-            return module, total_wrapped_params
+            return module, total_wrapped_numel
     return module, 0
 
 
diff --git a/torch/distributed/logging_handlers.py b/torch/distributed/logging_handlers.py
new file mode 100644
index 000000000000..3c607fe45da7
--- /dev/null
+++ b/torch/distributed/logging_handlers.py
@@ -0,0 +1,16 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import logging
+from typing import Dict, List
+
+__all__: List[str] = []
+
+_log_handlers: Dict[str, logging.Handler] = {
+    "default": logging.NullHandler(),
+}
diff --git a/torch/distributed/nn/api/remote_module.py b/torch/distributed/nn/api/remote_module.py
index 9a8d01c40d70..e9f9d4d3d3bb 100644
--- a/torch/distributed/nn/api/remote_module.py
+++ b/torch/distributed/nn/api/remote_module.py
@@ -28,6 +28,7 @@
 from torch.nn.parameter import Parameter
 from torch.utils.hooks import RemovableHandle
 
+__all__ = ["RemoteModule"]
 
 _grad_t = Union[Tuple[Tensor, ...], Tensor]
 # See https://mypy.readthedocs.io/en/latest/generics.html#generic-methods-and-generic-self for the use
@@ -60,9 +61,12 @@
     "_buffers",
     "_non_persistent_buffers_set",
     "_backward_hooks",
+    "_backward_pre_hooks",
     "_is_full_backward_hook",
     "_forward_hooks",
+    "_forward_hooks_with_kwargs",
     "_forward_pre_hooks",
+    "_forward_pre_hooks_with_kwargs",
     "_state_dict_hooks",
     "_load_state_dict_pre_hooks",
     "_load_state_dict_post_hooks",
@@ -72,6 +76,7 @@
     "forward",
 )
 
+
 # RPC handler.
 def _instantiate_template(module_interface_cls, enable_moving_cpu_tensors_to_cuda):
     instantiator.instantiate_scriptable_remote_module_template(
@@ -130,7 +135,7 @@ def __init__(
         It creates a user-specified module on a specified remote node.
         It behaves like a regular ``nn.Module`` except that the ``forward`` method is
         executed on the remote node.
-        It takes care of autograd recording to ensure the backward pass propogates
+        It takes care of autograd recording to ensure the backward pass propagates
         gradients back to the corresponding remote module.
         It can be shared across processors using `RPC framework <https://pytorch.org/docs/stable/rpc.html>`__,
         without incurring any overheads of copying the actual module,
@@ -193,13 +198,13 @@ def __init__(
         Example::
             Run the following code in two different processes:
 
+            >>> # xdoctest: +SKIP("distributed")
             >>> # On worker 0:
             >>> import torch
             >>> import torch.distributed.rpc as rpc
             >>> from torch import nn, Tensor
             >>> from torch.distributed.nn.api.remote_module import RemoteModule
             >>>
-            >>> # xdoctest: +SKIP
             >>> rpc.init_rpc("worker0", rank=0, world_size=2)
             >>> remote_linear_module = RemoteModule(
             >>>     "worker1/cpu", nn.Linear, args=(20, 30),
@@ -354,14 +359,24 @@ def to(self, *args, **kwargs) -> T:  # type: ignore[return]
         _raise_not_supported(self.to.__name__)
 
     def register_backward_hook(  # type: ignore[return]
-        self, hook: Callable[[Module, _grad_t, _grad_t], Union[None, Tensor]]
+        self, hook: Callable[[Module, _grad_t, _grad_t], Union[None, _grad_t]]
     ) -> RemovableHandle:
         _raise_not_supported(self.register_backward_hook.__name__)
 
-    def register_forward_pre_hook(self, hook: Callable[..., None]) -> RemovableHandle:  # type: ignore[return]
+    def register_forward_pre_hook(  # type: ignore[return]
+        self,
+        hook: Callable[..., None],
+        prepend: bool = False,
+        with_kwargs: bool = False,
+    ) -> RemovableHandle:
         _raise_not_supported(self.register_forward_pre_hook.__name__)
 
-    def register_forward_hook(self, hook: Callable[..., None]) -> RemovableHandle:  # type: ignore[return]
+    def register_forward_hook(  # type: ignore[return]
+        self,
+        hook: Callable[..., None],
+        prepend: bool = False,
+        with_kwargs: bool = False,
+    ) -> RemovableHandle:
         _raise_not_supported(self.register_forward_hook.__name__)
 
     def state_dict(self, *args, **kwargs):
@@ -380,7 +395,10 @@ def parameters(self, recurse: bool = True) -> Iterator[Parameter]:
         )
 
     def named_parameters(  # type: ignore[return]
-        self, prefix: str = "", recurse: bool = True
+        self,
+        prefix: str = "",
+        recurse: bool = True,
+        remove_duplicate: bool = True
     ) -> Iterator[Tuple[str, Parameter]]:
         _raise_not_supported(self.named_parameters.__name__)
 
@@ -388,7 +406,10 @@ def buffers(self, recurse: bool = True) -> Iterator[Tensor]:  # type: ignore[ret
         _raise_not_supported(self.buffers.__name__)
 
     def named_buffers(  # type: ignore[return]
-        self, prefix: str = "", recurse: bool = True
+        self,
+        prefix: str = "",
+        recurse: bool = True,
+        remove_duplicate: bool = True
     ) -> Iterator[Tuple[str, Tensor]]:
         _raise_not_supported(self.named_buffers.__name__)
 
@@ -500,13 +521,13 @@ def init_from_module_rref(
         Example::
             Run the following code in two different processes:
 
+            >>> # xdoctest: +SKIP("distributed")
             >>> # On worker 0:
             >>> import torch
             >>> import torch.distributed.rpc as rpc
             >>> from torch import nn, Tensor
             >>> from torch.distributed.nn.api.remote_module import RemoteModule
             >>>
-            >>> # xdoctest: +SKIP
             >>> rpc.init_rpc("worker0", rank=0, world_size=2)
             >>> remote_module = RemoteModule(
             >>>     "worker1/cpu", nn.Linear, args=(20, 30),
@@ -581,7 +602,7 @@ class RemoteModule(_RemoteModule):
         It creates a user-specified module on a specified remote node.
         It behaves like a regular ``nn.Module`` except that the ``forward`` method is
         executed on the remote node.
-        It takes care of autograd recording to ensure the backward pass propogates
+        It takes care of autograd recording to ensure the backward pass propagates
         gradients back to the corresponding remote module.
 
         It generates two methods ``forward_async`` and ``forward`` based on the
@@ -622,13 +643,13 @@ class RemoteModule(_RemoteModule):
     Example::
         Run the following code in two different processes:
 
+        >>> # xdoctest: +SKIP("distributed")
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
         >>> from torch import nn, Tensor
         >>> from torch.distributed.nn.api.remote_module import RemoteModule
         >>>
-        >>> # xdoctest: +SKIP
         >>> rpc.init_rpc("worker0", rank=0, world_size=2)
         >>> remote_linear_module = RemoteModule(
         >>>     "worker1/cpu", nn.Linear, args=(20, 30),
diff --git a/torch/distributed/optim/__init__.py b/torch/distributed/optim/__init__.py
index 7f5086a7a83b..950222b8d5fa 100644
--- a/torch/distributed/optim/__init__.py
+++ b/torch/distributed/optim/__init__.py
@@ -17,6 +17,7 @@
 from .functional_rprop import _FunctionalRprop
 from .functional_adamax import _FunctionalAdamax
 from .utils import as_functional_optim
+from .apply_optimizer_in_backward import _apply_optimizer_in_backward
 
 
 # DistributedOptimizer imports torch.distributed.rpc names, so gate availability
diff --git a/torch/distributed/optim/apply_optimizer_in_backward.py b/torch/distributed/optim/apply_optimizer_in_backward.py
new file mode 100644
index 000000000000..ff72f28e6be1
--- /dev/null
+++ b/torch/distributed/optim/apply_optimizer_in_backward.py
@@ -0,0 +1,78 @@
+from typing import Any, Dict, Iterable, Type, List, no_type_check
+
+import torch
+
+__all__: List[str] = []
+
+@no_type_check
+def _apply_optimizer_in_backward(
+    optimizer_class: Type[torch.optim.Optimizer],
+    params: Iterable[torch.nn.Parameter],
+    optimizer_kwargs: Dict[str, Any],
+) -> None:
+    """
+    Upon ``backward()``, parameters will fire the corresponding optimizer.
+
+    Note - gradients for these parameters will be set to None after ``backward()``.
+    This means that any other (non applied) optimizer over this parameter will be
+    a no-op.
+
+    Args:
+        optimizer_class: (Type[torch.optim.Optimizer]): Optimizer to apply to parameter
+        params: (Iterator[nn.Parameter]): parameters to apply optimizer state to
+        optimizer_kwargs: (Dict[str, Any]): kwargs to pass to optimizer constructor
+
+    Example::
+        params_generator = model.parameters()
+        param_1 = next(params_generator)
+        remainder_params = list(params_generator)
+
+        apply_optimizer_in_backward(torch.optim.SGD, [param_1], {"lr": .02})
+        apply_optimizer_in_backward(torch.optim.Adam, remainder_params, {"lr": .04})
+
+        model(...).sum().backward() # after backward, parameters will already
+        # have their registered optimizer applied.
+
+    """
+
+    @no_type_check
+    def _apply_optimizer_in_backward_to_param(param: torch.nn.Parameter) -> None:
+        # view_as creates a node in autograd graph that allows us access to the
+        # parameter's AccumulateGrad autograd function object. We register a
+        # hook on this object to fire the optimizer when the gradient for
+        # this parameter is ready (has been accumulated into .grad field)
+
+        # Don't create a new acc_grad if we already have one
+        # i.e.f or shared parameters or attaching multiple optimizers to a param.
+        if not hasattr(param, 'acc_grad'):
+            acc_grad = param.view_as(param).grad_fn.next_functions[0][0]
+        else:
+            acc_grad = param._acc_grad
+
+        optimizer = optimizer_class([param], **optimizer_kwargs)
+
+        # Keep the grad accumulator around for the lifetime of the Tensor,
+        # store it on the param to avoid uncollectable ref-cycle
+        if not hasattr(param, 'acc_grad'):
+            param._acc_grad = acc_grad  # type: ignore[attr-defined]
+
+        if not hasattr(param, '_in_backward_optimizers'):
+            param._in_backward_optimizers = []  # type: ignore[attr-defined]
+            # TODO: investigate whether we really need these attributes.
+            param._optimizer_classes = []  # type: ignore[attr-defined]
+            param._optimizer_kwargs = []  # type: ignore[attr-defined]
+
+        param._in_backward_optimizers.append(optimizer)  # type: ignore[attr-defined]
+        param._optimizer_classes.append(optimizer_class)  # type: ignore[attr-defined]
+        param._optimizer_kwargs.append(optimizer_kwargs)  # type: ignore[attr-defined]
+
+        def optimizer_hook(*_unused) -> None:
+            for opt in param._in_backward_optimizers:  # type: ignore[attr-defined]
+                opt.step()
+
+            param.grad = None
+
+        param._acc_grad.register_hook(optimizer_hook)  # type: ignore[attr-defined]
+
+    for param in params:
+        _apply_optimizer_in_backward_to_param(param)
diff --git a/torch/distributed/optim/functional_adam.py b/torch/distributed/optim/functional_adam.py
index 963b56f6b072..92b749a54dbd 100644
--- a/torch/distributed/optim/functional_adam.py
+++ b/torch/distributed/optim/functional_adam.py
@@ -27,6 +27,7 @@ def __init__(
         amsgrad: bool = False,
         maximize: bool = False,
         foreach: bool = False,
+        fused: bool = False,
         _allow_empty_param_list: bool = False,
     ):
         if not 0.0 <= lr:
@@ -50,6 +51,7 @@ def __init__(
         self.amsgrad = amsgrad
         self.maximize = maximize
         self.foreach = foreach
+        self.fused = fused
         self.state = torch.jit.annotate(Dict[torch.Tensor, Dict[str, torch.Tensor]], {})
 
         if len(params) == 0 and not _allow_empty_param_list:
@@ -64,7 +66,6 @@ def step_param(self, param: Tensor, grad: Optional[Tensor]):
         Similar to step, but operates on a single parameter and optionally a
         gradient tensor.
         """
-        params = [param]
         params_with_grad = []
         grads = []
         exp_avgs = []
@@ -105,7 +106,10 @@ def step_param(self, param: Tensor, grad: Optional[Tensor]):
                    lr=self.defaults['lr'],
                    weight_decay=self.defaults['weight_decay'],
                    eps=self.defaults['eps'],
-                   foreach=self.foreach)
+                   foreach=self.foreach,
+                   fused=self.fused,
+                   grad_scale=None,
+                   found_inf=None)
 
     def step(self, gradients: List[Optional[Tensor]]):
         params = self.param_group['params']
@@ -164,4 +168,7 @@ def step(self, gradients: List[Optional[Tensor]]):
                    lr=self.defaults['lr'],
                    weight_decay=self.defaults['weight_decay'],
                    eps=self.defaults['eps'],
-                   foreach=self.foreach)
+                   foreach=self.foreach,
+                   fused=self.fused,
+                   grad_scale=None,
+                   found_inf=None)
diff --git a/torch/distributed/optim/functional_rprop.py b/torch/distributed/optim/functional_rprop.py
index 9a2f89799d6c..77d350b20e32 100644
--- a/torch/distributed/optim/functional_rprop.py
+++ b/torch/distributed/optim/functional_rprop.py
@@ -24,6 +24,7 @@ def __init__(
         etas: Tuple[float, float] = (0.5, 1.2),
         step_sizes: Tuple[float, float] = (1e-6, 50),
         foreach: bool = False,
+        maximize: bool = False,
         _allow_empty_param_list: bool = False,
     ):
         self.defaults = {
@@ -32,6 +33,7 @@ def __init__(
         self.etas = etas
         self.step_sizes = step_sizes
         self.foreach = foreach
+        self.maximize = maximize
 
         if len(params) == 0 and not _allow_empty_param_list:
             raise ValueError("optimizer got an empty parameter list")
@@ -86,4 +88,5 @@ def step(self, gradients: List[Optional[Tensor]]):
                     step_size_max=step_size_max,
                     etaminus=etaminus,
                     etaplus=etaplus,
-                    foreach=self.foreach)
+                    foreach=self.foreach,
+                    maximize=self.maximize)
diff --git a/torch/distributed/optim/optimizer.py b/torch/distributed/optim/optimizer.py
index caf5ab293bc4..535104beb9f4 100644
--- a/torch/distributed/optim/optimizer.py
+++ b/torch/distributed/optim/optimizer.py
@@ -18,6 +18,7 @@
 
 logger = logging.getLogger(__name__)
 
+
 # XXX: we define a _ScriptModuleOptimizer here to explicitly
 # compile the FunctionalOptimizer class into TorchScript
 # This is because ScriptClass instance still lives in
@@ -33,6 +34,7 @@ class _ScriptLocalOptimizerInterface(object):
     def step(self, autograd_ctx_id: int) -> None:
         pass
 
+
 class _ScriptLocalOptimizer(nn.Module):
     # TorchScript does not support multithread concurrent compiling.
     # request_callback might invoke concurrent compiling, so we
@@ -106,6 +108,7 @@ def _new_script_local_optimizer(optim_cls, local_params_rref, *args, **kwargs):
         return rpc.RRef(
             script_optim, _ScriptLocalOptimizerInterface)
 
+
 @jit.script
 def _script_local_optimizer_step(
     local_optim_rref: RRef[_ScriptLocalOptimizerInterface],
@@ -114,6 +117,7 @@ def _script_local_optimizer_step(
     local_optim = local_optim_rref.local_value()
     local_optim.step(autograd_ctx_id)
 
+
 def _wait_for_all(rpc_futs):
     # TODO: improve error propagation
     exception = None
@@ -163,12 +167,12 @@ class DistributedOptimizer:
         kwargs: arguments to pass to the optimizer constructor on each worker.
 
     Example::
+        >>> # xdoctest: +SKIP("distributed")
         >>> import torch.distributed.autograd as dist_autograd
         >>> import torch.distributed.rpc as rpc
         >>> from torch import optim
         >>> from torch.distributed.optim import DistributedOptimizer
         >>>
-        >>> # xdoctest: +SKIP
         >>> with dist_autograd.context() as context_id:
         >>>   # Forward pass.
         >>>   rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
diff --git a/torch/distributed/optim/zero_redundancy_optimizer.py b/torch/distributed/optim/zero_redundancy_optimizer.py
index 48eef147ebfc..199b8c794a9a 100644
--- a/torch/distributed/optim/zero_redundancy_optimizer.py
+++ b/torch/distributed/optim/zero_redundancy_optimizer.py
@@ -110,13 +110,6 @@ def _broadcast_object(
     return obj
 
 
-def _get_global_rank(group: Any, rank: int) -> int:
-    r"""
-    Returns the global rank for the given group and rank.
-    """
-    return (rank if group is dist.group.WORLD
-            else dist.distributed_c10d._get_global_rank(group, rank))
-
 
 class _ZeROJoinHook(JoinHook):
     def __init__(self, zero):
@@ -243,6 +236,7 @@ def __init__(self, world_size) -> None:
         self.params_per_rank: List[List[torch.Tensor]] = \
             [[] for _ in range(world_size)]
         self.offsets: Dict[int, int] = {}
+        # Group Ranks
         self.assigned_ranks_per_bucket: List[Set[int]] = []
         self.num_bucket_assignments: int = 0
         self.total_size: Optional[int] = None
@@ -404,7 +398,7 @@ def __init__(
         self.process_group = process_group if process_group is not None else dist.group.WORLD
         self.world_size: int = dist.get_world_size(self.process_group)
         self.rank: int = dist.get_rank(self.process_group)
-        self.global_rank: int = _get_global_rank(self.process_group, self.rank)
+        self.global_rank: int = dist.distributed_c10d.get_global_rank(self.process_group, self.rank)
 
         self._overlap_with_ddp: bool = overlap_with_ddp
         self._optim_defaults = defaults
@@ -523,7 +517,7 @@ def consolidate_state_dict(self, to: int = 0) -> None:
         # is to move all sharded state management to RPC RRef
         self._all_state_dicts = []
         for rank in range(self.world_size):
-            global_rank = _get_global_rank(self.process_group, rank)
+            global_rank = dist.distributed_c10d.get_global_rank(self.process_group, rank)
             if self.rank == to:
                 # Consolidate all local `state_dict`s on this rank, storing on
                 # CPU to save GPU memory
@@ -743,14 +737,14 @@ def _broadcast_params_from_rank(self, rank: int):
         if self.parameters_as_bucket_view:
             for dev_i_buckets in self._buckets:
                 bucket = dev_i_buckets[rank]
-                global_rank = _get_global_rank(self.process_group, rank)
+                global_rank = dist.distributed_c10d.get_global_rank(self.process_group, rank)
                 handles.append(
                     dist.broadcast(tensor=bucket, src=global_rank,
                                    group=self.process_group, async_op=True)
                 )
         else:
             param_groups = self._partition_parameters()[rank]
-            global_rank = _get_global_rank(self.process_group, rank)
+            global_rank = dist.distributed_c10d.get_global_rank(self.process_group, rank)
             for param_group in param_groups:
                 for param in param_group["params"]:
                     handles.append(
@@ -857,8 +851,8 @@ def _assign_bucket_subset_to_rank(
                 corresponding to the bucket to assign.
             bucket_offset (int): offset giving the index of the first element
                 in ``bucket_params`` in the bucket's full parameter list.
-            assigned_rank (int): rank to assign to.
-            assigned_ranks_per_bucket (List[Set[int]]): :class:`set` of ranks
+            assigned_rank (int): group rank to assign to.
+            assigned_ranks_per_bucket (List[Set[int]]): :class:`set` of group ranks
                 assigned to each bucket.
         """
         overlap_info = self._overlap_info
diff --git a/torch/distributed/optim/zero_redundancy_optimizer.pyi b/torch/distributed/optim/zero_redundancy_optimizer.pyi
index 04998104bff3..d0f38240ff7e 100644
--- a/torch/distributed/optim/zero_redundancy_optimizer.pyi
+++ b/torch/distributed/optim/zero_redundancy_optimizer.pyi
@@ -13,9 +13,6 @@ import torch
 from torch.distributed.algorithms.join import Joinable, JoinHook
 from torch.optim import Optimizer
 
-
-def _get_global_rank(group: Any, rank: int) -> int: ...
-
 class _ZeROJoinHook(JoinHook):
     zero: Any = ...
     def __init__(self, zero: Any) -> None: ...
diff --git a/torch/distributed/pipeline/sync/_balance/profile.py b/torch/distributed/pipeline/sync/_balance/profile.py
index 6b8a240a2cdf..fa1a0c06a8e3 100644
--- a/torch/distributed/pipeline/sync/_balance/profile.py
+++ b/torch/distributed/pipeline/sync/_balance/profile.py
@@ -99,13 +99,15 @@ def profile_sizes(
         detach(batch)
 
         # Detect memory usage at forward.
+        torch._C._cuda_clearCublasWorkspaces()
         memory_before = torch.cuda.memory_allocated(device)
         batch = batch.call(layer)
+        torch._C._cuda_clearCublasWorkspaces()
         memory_after = torch.cuda.memory_allocated(device)
         latent_size = memory_after - memory_before
 
         # Analyze size of parameters.
-        param_size = sum(p.storage().nbytes() for p in layer.parameters())
+        param_size = sum(p._typed_storage()._nbytes() for p in layer.parameters())
 
         # Combine size of parameters and activations with normalize scales.
         size = latent_size * latent_scale + param_size * param_scale
diff --git a/torch/distributed/pipeline/sync/checkpoint.py b/torch/distributed/pipeline/sync/checkpoint.py
index dffdc0294736..a944b7b6de19 100644
--- a/torch/distributed/pipeline/sync/checkpoint.py
+++ b/torch/distributed/pipeline/sync/checkpoint.py
@@ -47,7 +47,9 @@
 from .microbatch import Batch
 from .phony import get_phony
 
-__all__ = ["is_checkpointing", "is_recomputing"]
+__all__ = ["Function", "checkpoint", "Checkpointing", "ThreadLocal", "enable_checkpointing",
+           "enable_recomputing", "is_checkpointing", "is_recomputing", "Context", "save_rng_states",
+           "restore_rng_states", "Checkpoint", "Recompute"]
 
 
 Tensors = Sequence[Tensor]
diff --git a/torch/distributed/pipeline/sync/copy.py b/torch/distributed/pipeline/sync/copy.py
index 5174074d872e..0177ba6946d4 100644
--- a/torch/distributed/pipeline/sync/copy.py
+++ b/torch/distributed/pipeline/sync/copy.py
@@ -15,7 +15,7 @@
 
 from .stream import AbstractStream, current_stream, get_device, record_stream, use_stream, wait_stream
 
-__all__: List[str] = []
+__all__: List[str] = ["Context", "Copy", "Wait"]
 
 
 Tensors = Sequence[Tensor]
diff --git a/torch/distributed/pipeline/sync/dependency.py b/torch/distributed/pipeline/sync/dependency.py
index a5a7ba50293e..777574da0c8f 100644
--- a/torch/distributed/pipeline/sync/dependency.py
+++ b/torch/distributed/pipeline/sync/dependency.py
@@ -12,7 +12,7 @@
 
 from .phony import get_phony
 
-__all__: List[str] = []
+__all__: List[str] = ["fork", "Fork", "join", "Join"]
 
 
 def fork(input: Tensor) -> Tuple[Tensor, Tensor]:
diff --git a/torch/distributed/pipeline/sync/microbatch.py b/torch/distributed/pipeline/sync/microbatch.py
index 3612332c36cc..10dbbf38cfd2 100644
--- a/torch/distributed/pipeline/sync/microbatch.py
+++ b/torch/distributed/pipeline/sync/microbatch.py
@@ -12,7 +12,7 @@
 from torch import Tensor
 import torch.cuda.comm
 
-__all__: List[str] = []
+__all__: List[str] = ["NoChunk", "Batch", "check", "scatter", "gather"]
 
 
 Tensors = Sequence[Tensor]
diff --git a/torch/distributed/pipeline/sync/phony.py b/torch/distributed/pipeline/sync/phony.py
index 5e89ff0efd27..1b785aee7f57 100644
--- a/torch/distributed/pipeline/sync/phony.py
+++ b/torch/distributed/pipeline/sync/phony.py
@@ -12,7 +12,7 @@
 
 from .stream import default_stream, use_stream
 
-__all__: List[str] = []
+__all__: List[str] = ["get_phony"]
 
 
 _phonies: Dict[Tuple[torch.device, bool], Tensor] = {}
diff --git a/torch/distributed/pipeline/sync/pipe.py b/torch/distributed/pipeline/sync/pipe.py
index 81d1a7bc7793..ba4fda1fcf83 100644
--- a/torch/distributed/pipeline/sync/pipe.py
+++ b/torch/distributed/pipeline/sync/pipe.py
@@ -21,7 +21,7 @@
 from .skip.skippable import verify_skippables
 from .stream import AbstractStream, new_stream
 
-__all__ = ["Pipe"]
+__all__ = ["Pipe", "BalanceError", "PipeSequential", "WithDevice"]
 
 
 Device = Union[torch.device, int, str]
@@ -149,14 +149,16 @@ class WithDevice(nn.Module):
         device(:class:`torch.device`): The device to run the module on.
 
     Example::
+        >>> # xdoctest: +SKIP("distributed")
         >>> fc1 = nn.Linear(16, 8).cuda(0)
         >>> fc2 = nn.Linear(8, 4).cuda(1)
         >>> dropout = nn.Dropout()
         >>>
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
         >>> # Dropout does not have any parameters/buffers, but we want to
         >>> # run it on cuda:1 to avoid any GPU to CPU transfers.
         >>> model = nn.Sequential(fc1, fc2, WithDevice(dropout, 'cuda:1'))
-        >>> # xdoctest: +SKIP
+        >>> # xdoctest: +SKIP("Needs RPC framework init")
         >>> model = Pipe(model, chunks=8)
     """
     def __init__(self, module: nn.Module, device: torch.device):
@@ -185,6 +187,7 @@ def _assemble_partition(modules: List[nn.Module]):
             modules_list.append(module)
     return PipeSequential(*modules_list)
 
+
 def _split_module(modules: nn.Sequential) -> Tuple[List[nn.Sequential], List[torch.device]]:
     partitions = []
     devices = []
diff --git a/torch/distributed/pipeline/sync/pipeline.py b/torch/distributed/pipeline/sync/pipeline.py
index d2cfce219826..705fa16e45a3 100644
--- a/torch/distributed/pipeline/sync/pipeline.py
+++ b/torch/distributed/pipeline/sync/pipeline.py
@@ -23,7 +23,7 @@
 from .stream import AbstractStream, current_stream, use_device
 from .worker import Task, create_workers
 
-__all__: List[str] = []
+__all__: List[str] = ["Pipeline"]
 
 
 Tensors = Sequence[Tensor]
diff --git a/torch/distributed/pipeline/sync/stream.py b/torch/distributed/pipeline/sync/stream.py
index 41e1591793b6..59fedf865a42 100644
--- a/torch/distributed/pipeline/sync/stream.py
+++ b/torch/distributed/pipeline/sync/stream.py
@@ -12,7 +12,9 @@
 
 import torch
 
-__all__: List[str] = []
+__all__: List[str] = ["CPUStreamType", "new_stream", "current_stream", "default_stream",
+                      "use_device", "use_stream", "get_device", "wait_stream", "record_stream",
+                      "is_cuda", "as_cuda"]
 
 
 class CPUStreamType:
@@ -102,7 +104,7 @@ def record_stream(tensor: torch.Tensor, stream: AbstractStream) -> None:
         #
         # Issue: https://github.com/pytorch/pytorch/issues/27366
         #
-        tensor = tensor.new_empty([0]).set_(tensor.storage())
+        tensor = tensor.new_empty([0]).set_(tensor._typed_storage())
 
         # Typechecking: torch.cuda.Stream is incompatible with torch._C.Stream
         tensor.record_stream(as_cuda(stream))  # type: ignore[arg-type]
diff --git a/torch/distributed/pipeline/sync/utils.py b/torch/distributed/pipeline/sync/utils.py
index 2ca48390c406..37cac7e6e9c2 100644
--- a/torch/distributed/pipeline/sync/utils.py
+++ b/torch/distributed/pipeline/sync/utils.py
@@ -1,6 +1,8 @@
 from torch import nn
 from typing import List
 
+__all__ = ["partition_model"]
+
 def partition_model(
         module: nn.Sequential,
         balance: List[int],
diff --git a/torch/distributed/pipeline/sync/worker.py b/torch/distributed/pipeline/sync/worker.py
index 4da6a49e3fd3..9c8accae5175 100644
--- a/torch/distributed/pipeline/sync/worker.py
+++ b/torch/distributed/pipeline/sync/worker.py
@@ -17,7 +17,7 @@
 from .microbatch import Batch
 from .stream import AbstractStream, use_device, use_stream
 
-__all__: List[str] = []
+__all__: List[str] = ["Task", "worker", "create_workers", "spawn_workers"]
 
 
 ExcInfo = Tuple[Type[BaseException], BaseException, TracebackType]
diff --git a/torch/distributed/rpc/__init__.py b/torch/distributed/rpc/__init__.py
index 07afb05b38bc..121aa2727962 100644
--- a/torch/distributed/rpc/__init__.py
+++ b/torch/distributed/rpc/__init__.py
@@ -9,13 +9,13 @@
 import torch
 import torch.distributed as dist
 
-
 logger = logging.getLogger(__name__)
 
 
 _init_counter = 0
 _init_counter_lock = threading.Lock()
 
+__all__ = ["is_available"]
 
 def is_available():
     return hasattr(torch._C, "_rpc_init")
@@ -77,6 +77,9 @@ def is_available():
 
     rendezvous_iterator: Generator[Tuple[Store, int, int], None, None]
 
+    __all__ += ["init_rpc", "BackendType", "TensorPipeRpcBackendOptions"]
+    __all__ = __all__ + api.__all__ + backend_registry.__all__
+
     def init_rpc(
         name,
         backend=None,
diff --git a/torch/distributed/rpc/api.py b/torch/distributed/rpc/api.py
index 4cb09779bcce..f5e544806822 100644
--- a/torch/distributed/rpc/api.py
+++ b/torch/distributed/rpc/api.py
@@ -58,7 +58,6 @@
 _ignore_rref_leak = True
 _default_pickler = _internal_rpc_pickler
 
-
 @contextlib.contextmanager
 def _use_rpc_pickler(rpc_pickler):
     r"""
@@ -148,6 +147,7 @@ def _broadcast_to_followers(sequence_id, objects_map):
 
 _thread_local_var = threading.local()
 
+
 @contextlib.contextmanager
 def _wait_all():
     r"""
@@ -157,10 +157,10 @@ def _wait_all():
 
 
     Example::
+        >>> # xdoctest: +SKIP("distributed")
         >>> # On worker 0:
         >>> import torch
         >>> import torch.distributed.rpc as rpc
-        >>> # xdoctest: +SKIP
         >>> rpc.init_rpc("worker0", rank=0, world_size=2)
         >>> with rpc._wait_all():
         >>>    fut_1 = rpc.rpc_async(dst, torch.add, (torch.ones(2, 2), 1))
@@ -176,6 +176,7 @@ def _wait_all():
         finally:
             del _thread_local_var.future_list
 
+
 @_require_initialized
 def _all_gather(obj, worker_names=None, timeout=UNSET_RPC_TIMEOUT):
     r"""
@@ -190,7 +191,7 @@ def _all_gather(obj, worker_names=None, timeout=UNSET_RPC_TIMEOUT):
             _ALL_WORKER_NAMES is not None
         ), "`_ALL_WORKER_NAMES` is not initialized for `def _all_gather`."
         worker_names = _ALL_WORKER_NAMES
-    leader_name = sorted(worker_names)[0]
+    leader_name = min(worker_names)
 
     self_name = _get_current_rpc_agent().get_worker_info().name
 
@@ -237,7 +238,7 @@ def _all_gather(obj, worker_names=None, timeout=UNSET_RPC_TIMEOUT):
     # Leader's signal is the first to be unblocked, after receiving all
     # followers' data objects.
     if is_leader:
-        worker_name_to_response_future_dict = dict()
+        worker_name_to_response_future_dict = {}
         for follower_name in worker_names - {leader_name}:
             fut = rpc_async(
                 follower_name,
@@ -285,6 +286,7 @@ def _barrier(worker_names):
             f"Failed to complete barrier, got error {ex}"
         )
 
+
 @_require_initialized
 def _wait_all_workers(timeout=DEFAULT_SHUTDOWN_TIMEOUT):
     r"""
@@ -376,6 +378,7 @@ def shutdown(graceful=True, timeout=DEFAULT_SHUTDOWN_TIMEOUT):
     else:
         _finalize_shutdown()
 
+
 def _finalize_shutdown():
     try:
         # This raises a `TORCH_CHECK()` exception on RRef leak detected.
@@ -396,6 +399,7 @@ def _finalize_shutdown():
         _cleanup_python_rpc_handler()
         _reset_current_rpc_agent()
 
+
 @_require_initialized
 def get_worker_info(worker_name=None):
     r"""
@@ -453,7 +457,6 @@ def _rref_typeof_on_user(rref, timeout=UNSET_RPC_TIMEOUT, blocking=True):
         return fut
 
 
-
 T = TypeVar("T")
 GenericWithOneTypeVar = Generic[T]
 
@@ -669,6 +672,7 @@ def remote(to, func, args=None, kwargs=None, timeout=UNSET_RPC_TIMEOUT):
 
     return rref
 
+
 def _invoke_rpc(to, func, rpc_type, args=None, kwargs=None, rpc_timeout=UNSET_RPC_TIMEOUT):
     if not callable(func):
         raise TypeError("function should be callable.")
@@ -900,15 +904,17 @@ def rpc_async(to, func, args=None, kwargs=None, timeout=UNSET_RPC_TIMEOUT):
         _thread_local_var.future_list.append(fut)
     return fut
 
+
 def _get_should_profile():
     # Legacy profiler should be enabled. RPC profiling is not supported with
     # Kineto profiler.
-    ActiveProfilerType = torch._C._autograd.ActiveProfilerType
+    ActiveProfilerType = torch._C._profiler.ActiveProfilerType
     return (
         torch.autograd._profiler_enabled() and
         torch._C._autograd._profiler_type() == ActiveProfilerType.LEGACY  # type: ignore[attr-defined]
     )
 
+
 def _enable_rpc_profiler(should_profile, qualified_name, func, rpc_type, dst_worker_info):
     ctx_manager = contextlib.suppress()
 
diff --git a/torch/distributed/rpc/backend_registry.py b/torch/distributed/rpc/backend_registry.py
index da441e12a24e..dabf84efe598 100644
--- a/torch/distributed/rpc/backend_registry.py
+++ b/torch/distributed/rpc/backend_registry.py
@@ -11,6 +11,9 @@
 from . import api
 from . import constants as rpc_constants
 
+__all__ = ["backend_registered", "register_backend", "construct_rpc_backend_options", "init_backend",
+           "BackendValue", "BackendType"]
+
 BackendValue = collections.namedtuple(
     "BackendValue", ["construct_rpc_backend_options_handler", "init_backend_handler"]
 )
diff --git a/torch/distributed/rpc/constants.py b/torch/distributed/rpc/constants.py
index ea8356fa94aa..3bc525b70d9b 100644
--- a/torch/distributed/rpc/constants.py
+++ b/torch/distributed/rpc/constants.py
@@ -1,5 +1,5 @@
 from datetime import timedelta
-
+from typing import List
 from torch._C._distributed_rpc import (
     _DEFAULT_INIT_METHOD,
     _DEFAULT_NUM_WORKER_THREADS,
@@ -20,3 +20,5 @@
 DEFAULT_PROCESS_GROUP_TIMEOUT: timedelta = timedelta(milliseconds=2 ** 31 - 1)
 # Value indicating that timeout is not set for RPC call, and the default should be used.
 UNSET_RPC_TIMEOUT: float = _UNSET_RPC_TIMEOUT
+
+__all__: List[str] = []
diff --git a/torch/distributed/rpc/internal.py b/torch/distributed/rpc/internal.py
index 8c4dde955aea..435fd29557d0 100644
--- a/torch/distributed/rpc/internal.py
+++ b/torch/distributed/rpc/internal.py
@@ -11,6 +11,7 @@
 import torch.distributed as dist
 from torch._C._distributed_rpc import _get_current_rpc_agent
 
+__all__ = ["RPCExecMode", "serialize", "deserialize", "PythonUDF", "RemoteException"]
 
 # Thread local tensor tables to store tensors while pickling torch.Tensor
 # objects
@@ -217,7 +218,20 @@ def _run_function(python_udf):
 
 def _handle_exception(result):
     if isinstance(result, RemoteException):
-        raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
+        exception_msg = result.msg.encode("utf-8").decode("unicode_escape")
+        # We wrap exception re-creation here in case some exception classes
+        # cannot be constructed directly from a string.
+        exc = None
+        try:
+            exc = result.exception_type(exception_msg)
+        except BaseException as e:
+            raise RuntimeError(  # noqa: B904
+                f"Failed to create original exception type. Error msg was {str(e)}"
+                f" Original exception on remote side was {exception_msg}"
+            )
+
+        if exc is not None:
+            raise exc
 
 
 def _build_rpc_profiling_key(
diff --git a/torch/distributed/rpc/options.py b/torch/distributed/rpc/options.py
index bb67ac032e6d..da1d91ab3239 100644
--- a/torch/distributed/rpc/options.py
+++ b/torch/distributed/rpc/options.py
@@ -7,6 +7,7 @@
 
 DeviceType = Union[int, str, torch.device]
 
+__all__ = ["TensorPipeRpcBackendOptions"]
 
 def _to_device(device: DeviceType) -> torch.device:
     device = torch.device(device)
@@ -113,6 +114,7 @@ def set_device_map(self, to: str, device_map: Dict[DeviceType, DeviceType]):
                 invertible.
 
         Example::
+            >>> # xdoctest: +SKIP("distributed")
             >>> # both workers
             >>> def add(x, y):
             >>>     print(x)  # tensor([1., 1.], device='cuda:1')
@@ -127,7 +129,6 @@ def set_device_map(self, to: str, device_map: Dict[DeviceType, DeviceType]):
             >>> options.set_device_map("worker1", {1: 2})
             >>> # maps worker0's cuda:1 to worker1's cuda:2
             >>>
-            >>> # xdoctest: +SKIP
             >>> rpc.init_rpc(
             >>>     "worker0",
             >>>     rank=0,
diff --git a/torch/distributed/utils.py b/torch/distributed/utils.py
index 20a618d7cb30..bfb6b8c6243e 100644
--- a/torch/distributed/utils.py
+++ b/torch/distributed/utils.py
@@ -1,23 +1,61 @@
-import collections
+from typing import Any, Dict, List, Tuple
 
 import torch
 import torch.distributed as dist
 from torch.nn.parallel._functions import _get_stream
 from torch.nn.parallel.scatter_gather import (  # type: ignore[attr-defined]
-    is_namedtuple as _is_namedtuple
+    _is_namedtuple,
 )
-from typing import Dict, Any, List
+from torch.nn.utils.rnn import PackedSequence
 
 __all__ = []  # type: ignore[var-annotated]
 
+def _pack_kwargs(*args: Any, **kwargs: Any) -> Tuple[Tuple[Any, ...], Tuple[str, ...]]:
+    """
+    Turn argument list into separate key list and value list (unpack_kwargs does the opposite)
+    Inspiration: https://github.com/facebookresearch/fairscale/blob/eeb6684/fairscale/internal/containers.py#L70
+    Usage::
+
+        kwarg_keys, flat_args = pack_kwargs(1, 2, a=3, b=4)
+        assert kwarg_keys == ("a", "b")
+        assert flat_args == (1, 2, 3, 4)
+        args, kwargs = unpack_kwargs(kwarg_keys, flat_args)
+        assert args == (1, 2)
+        assert kwargs == {"a": 3, "b": 4}
+    Returns:
+        Tuple[Tuple[Any, ...], Tuple[str, ...]]: The first tuple element gives
+        gives both positional args and kwarg values, where the positional args
+        proceed kwarg values and kwarg values are ordered consistently with the
+        kwarg keys. The second tuple element gives the kwarg keys.
+        The second tuple element's length is at most the first tuple element's length.
+    """
+    kwarg_keys: List[str] = []
+    flat_args: List[Any] = list(args)
+    for k, v in kwargs.items():
+        kwarg_keys.append(k)
+        flat_args.append(v)
+
+    return tuple(flat_args), tuple(kwarg_keys)
+
+
+def _unpack_kwargs(flat_args: Tuple[Any, ...], kwarg_keys: Tuple[str, ...]) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
+    """See _pack_kwargs."""
+    assert len(kwarg_keys) <= len(flat_args), f"too many keys {len(kwarg_keys)} vs. {len(flat_args)}"
+    if len(kwarg_keys) == 0:
+        return flat_args, {}
+    args = flat_args[: -len(kwarg_keys)]
+    kwargs = {k: v for k, v in zip(kwarg_keys, flat_args[-len(kwarg_keys) :])}
+    return args, kwargs
+
 def _recursive_to(inputs, target_gpu, use_side_stream_for_tensor_copies):
     r"""
     Recursively moves input to the target_gpu.
     """
 
     def to_map(obj):
-        if isinstance(obj, torch.Tensor):
-            if obj.device == torch.device("cuda", target_gpu):
+        if isinstance(obj, torch.Tensor) or isinstance(obj, PackedSequence):
+            device = obj.data.device if isinstance(obj, PackedSequence) else obj.device
+            if device == torch.device("cuda", target_gpu):
                 return (obj,)
             if not use_side_stream_for_tensor_copies:
                 return (obj.to(target_gpu),)
@@ -34,28 +72,19 @@ def to_map(obj):
                     current_stream.wait_stream(stream)
                     # Ensure tensor memory is not reused until work on
                     # main stream is complete
-                    output.record_stream(current_stream)  # type: ignore[arg-type]
+                    if isinstance(obj, PackedSequence):
+                        output.data.record_stream(current_stream)  # type: ignore[arg-type]
+                    else:
+                        output.record_stream(current_stream)  # type: ignore[arg-type]
                 return (output,)
         if _is_namedtuple(obj):
             return [type(obj)(*args) for args in zip(*map(to_map, obj))]
         if isinstance(obj, tuple) and len(obj) > 0:
             return list(zip(*map(to_map, obj)))
-        if isinstance(obj, str):
-            # Needs to be checked, otherwise it's taken as a sequence infinitely.
-            # This is because the elements of a string are also strings, and so on.
-            return [obj]
-        if isinstance(obj, collections.abc.Sequence) and len(obj) > 0:
-            try:
-                return [type(obj)(i) for i in zip(*map(to_map, obj))]  # type: ignore[call-arg]
-            except TypeError:
-                # The sequence type may not support `__init__(iterable)` (e.g., `range`).
-                return [list(i) for i in zip(*map(to_map, obj))]
-        if isinstance(obj, collections.abc.Mapping) and len(obj) > 0:
-            try:
-                return [type(obj)(i) for i in zip(*map(to_map, obj.items()))]   # type: ignore[call-arg]
-            except TypeError:
-                # The mapping type may not support `__init__(iterable)`.
-                return [dict(i) for i in zip(*map(to_map, obj.items()))]
+        if isinstance(obj, list) and len(obj) > 0:
+            return [list(i) for i in zip(*map(to_map, obj))]
+        if isinstance(obj, dict) and len(obj) > 0:
+            return [type(obj)(i) for i in zip(*map(to_map, obj.items()))]
         return [obj]
 
     # Avoid reference cycle
diff --git a/torch/distributions/distribution.py b/torch/distributions/distribution.py
index 66bd158bd87b..4159f34d7748 100644
--- a/torch/distributions/distribution.py
+++ b/torch/distributions/distribution.py
@@ -2,7 +2,8 @@
 import warnings
 from torch.distributions import constraints
 from torch.distributions.utils import lazy_property
-from typing import Dict, Optional, Any
+from torch.types import _size
+from typing import Dict, Optional, Any, Tuple
 
 __all__ = ['Distribution']
 
@@ -16,7 +17,7 @@ class Distribution(object):
     _validate_args = __debug__
 
     @staticmethod
-    def set_default_validate_args(value):
+    def set_default_validate_args(value: bool) -> None:
         """
         Sets whether validation is enabled or disabled.
 
@@ -32,7 +33,12 @@ def set_default_validate_args(value):
             raise ValueError
         Distribution._validate_args = value
 
-    def __init__(self, batch_shape=torch.Size(), event_shape=torch.Size(), validate_args=None):
+    def __init__(
+        self,
+        batch_shape: torch.Size = torch.Size(),
+        event_shape: torch.Size = torch.Size(),
+        validate_args: Optional[bool] = None,
+    ):
         self._batch_shape = batch_shape
         self._event_shape = event_shape
         if validate_args is not None:
@@ -62,7 +68,7 @@ def __init__(self, batch_shape=torch.Size(), event_shape=torch.Size(), validate_
                     )
         super(Distribution, self).__init__()
 
-    def expand(self, batch_shape, _instance=None):
+    def expand(self, batch_shape: torch.Size, _instance=None):
         """
         Returns a new distribution instance (or populates an existing instance
         provided by a derived class) with batch dimensions expanded to
@@ -84,14 +90,14 @@ def expand(self, batch_shape, _instance=None):
         raise NotImplementedError
 
     @property
-    def batch_shape(self):
+    def batch_shape(self) -> torch.Size:
         """
         Returns the shape over which parameters are batched.
         """
         return self._batch_shape
 
     @property
-    def event_shape(self):
+    def event_shape(self) -> torch.Size:
         """
         Returns the shape of a single sample (without batching).
         """
@@ -116,34 +122,34 @@ def support(self) -> Optional[Any]:
         raise NotImplementedError
 
     @property
-    def mean(self):
+    def mean(self) -> torch.Tensor:
         """
         Returns the mean of the distribution.
         """
         raise NotImplementedError
 
     @property
-    def mode(self):
+    def mode(self) -> torch.Tensor:
         """
         Returns the mode of the distribution.
         """
         raise NotImplementedError(f"{self.__class__} does not implement mode")
 
     @property
-    def variance(self):
+    def variance(self) -> torch.Tensor:
         """
         Returns the variance of the distribution.
         """
         raise NotImplementedError
 
     @property
-    def stddev(self):
+    def stddev(self) -> torch.Tensor:
         """
         Returns the standard deviation of the distribution.
         """
         return self.variance.sqrt()
 
-    def sample(self, sample_shape=torch.Size()):
+    def sample(self, sample_shape: torch.Size = torch.Size()) -> torch.Tensor:
         """
         Generates a sample_shape shaped sample or sample_shape shaped batch of
         samples if the distribution parameters are batched.
@@ -151,7 +157,7 @@ def sample(self, sample_shape=torch.Size()):
         with torch.no_grad():
             return self.rsample(sample_shape)
 
-    def rsample(self, sample_shape=torch.Size()):
+    def rsample(self, sample_shape: torch.Size = torch.Size()) -> torch.Tensor:
         """
         Generates a sample_shape shaped reparameterized sample or sample_shape
         shaped batch of reparameterized samples if the distribution parameters
@@ -159,7 +165,7 @@ def rsample(self, sample_shape=torch.Size()):
         """
         raise NotImplementedError
 
-    def sample_n(self, n):
+    def sample_n(self, n: int) -> torch.Tensor:
         """
         Generates n samples or n batches of samples if the distribution
         parameters are batched.
@@ -167,7 +173,7 @@ def sample_n(self, n):
         warnings.warn('sample_n will be deprecated. Use .sample((n,)) instead', UserWarning)
         return self.sample(torch.Size((n,)))
 
-    def log_prob(self, value):
+    def log_prob(self, value: torch.Tensor) -> torch.Tensor:
         """
         Returns the log of the probability density/mass function evaluated at
         `value`.
@@ -177,7 +183,7 @@ def log_prob(self, value):
         """
         raise NotImplementedError
 
-    def cdf(self, value):
+    def cdf(self, value: torch.Tensor) -> torch.Tensor:
         """
         Returns the cumulative density/mass function evaluated at
         `value`.
@@ -187,7 +193,7 @@ def cdf(self, value):
         """
         raise NotImplementedError
 
-    def icdf(self, value):
+    def icdf(self, value: torch.Tensor) -> torch.Tensor:
         """
         Returns the inverse cumulative density/mass function evaluated at
         `value`.
@@ -197,7 +203,7 @@ def icdf(self, value):
         """
         raise NotImplementedError
 
-    def enumerate_support(self, expand=True):
+    def enumerate_support(self, expand: bool = True) -> torch.Tensor:
         """
         Returns tensor containing all values supported by a discrete
         distribution. The result will enumerate over dimension 0, so the shape
@@ -221,7 +227,7 @@ def enumerate_support(self, expand=True):
         """
         raise NotImplementedError
 
-    def entropy(self):
+    def entropy(self) -> torch.Tensor:
         """
         Returns entropy of distribution, batched over batch_shape.
 
@@ -230,7 +236,7 @@ def entropy(self):
         """
         raise NotImplementedError
 
-    def perplexity(self):
+    def perplexity(self) -> torch.Tensor:
         """
         Returns perplexity of distribution, batched over batch_shape.
 
@@ -239,7 +245,7 @@ def perplexity(self):
         """
         return torch.exp(self.entropy())
 
-    def _extended_shape(self, sample_shape=torch.Size()):
+    def _extended_shape(self, sample_shape: _size = torch.Size()) -> Tuple[int, ...]:
         """
         Returns the size of the sample returned by the distribution, given
         a `sample_shape`. Note, that the batch and event shapes of a distribution
@@ -253,7 +259,7 @@ def _extended_shape(self, sample_shape=torch.Size()):
             sample_shape = torch.Size(sample_shape)
         return sample_shape + self._batch_shape + self._event_shape
 
-    def _validate_sample(self, value):
+    def _validate_sample(self, value: torch.Tensor) -> None:
         """
         Argument validation for distribution methods such as `log_prob`,
         `cdf` and `icdf`. The rightmost dimensions of a value to be
@@ -306,7 +312,7 @@ def _get_checked_instance(self, cls, _instance=None):
                                       format(self.__class__.__name__, cls.__name__))
         return self.__new__(type(self)) if _instance is None else _instance
 
-    def __repr__(self):
+    def __repr__(self) -> str:
         param_names = [k for k, _ in self.arg_constraints.items() if k in self.__dict__]
         args_string = ', '.join(['{}: {}'.format(p, self.__dict__[p]
                                 if self.__dict__[p].numel() == 1
diff --git a/torch/distributions/half_cauchy.py b/torch/distributions/half_cauchy.py
index e751c5a0fbc3..e8f4bcae3811 100644
--- a/torch/distributions/half_cauchy.py
+++ b/torch/distributions/half_cauchy.py
@@ -61,7 +61,7 @@ def log_prob(self, value):
         value = torch.as_tensor(value, dtype=self.base_dist.scale.dtype,
                                 device=self.base_dist.scale.device)
         log_prob = self.base_dist.log_prob(value) + math.log(2)
-        log_prob[value.expand(log_prob.shape) < 0] = -inf
+        log_prob = torch.where(value >= 0, log_prob, -inf)
         return log_prob
 
     def cdf(self, value):
diff --git a/torch/distributions/half_normal.py b/torch/distributions/half_normal.py
index 1c3f9e8e66a2..d5b133707ad9 100644
--- a/torch/distributions/half_normal.py
+++ b/torch/distributions/half_normal.py
@@ -59,7 +59,7 @@ def log_prob(self, value):
         if self._validate_args:
             self._validate_sample(value)
         log_prob = self.base_dist.log_prob(value) + math.log(2)
-        log_prob[value.expand(log_prob.shape) < 0] = -inf
+        log_prob = torch.where(value >= 0, log_prob, -inf)
         return log_prob
 
     def cdf(self, value):
diff --git a/torch/distributions/kl.py b/torch/distributions/kl.py
index b439f2edd0f2..a4b30289ced3 100644
--- a/torch/distributions/kl.py
+++ b/torch/distributions/kl.py
@@ -36,6 +36,7 @@
 _KL_REGISTRY = {}  # Source of truth mapping a few general (type, type) pairs to functions.
 _KL_MEMOIZE: Dict[Tuple[Type, Type], Callable] = {}  # Memoized version mapping many specific (type, type) pairs to functions.
 
+__all__ = ["register_kl", "kl_divergence"]
 
 def register_kl(type_p, type_q):
     """
diff --git a/torch/distributions/lkj_cholesky.py b/torch/distributions/lkj_cholesky.py
index 1cc5bfa84081..930d03816119 100644
--- a/torch/distributions/lkj_cholesky.py
+++ b/torch/distributions/lkj_cholesky.py
@@ -21,7 +21,7 @@ class LKJCholesky(Distribution):
     LKJ distribution for lower Cholesky factor of correlation matrices.
     The distribution is controlled by ``concentration`` parameter :math:`\eta`
     to make the probability of the correlation matrix :math:`M` generated from
-    a Cholesky factor propotional to :math:`\det(M)^{\eta - 1}`. Because of that,
+    a Cholesky factor proportional to :math:`\det(M)^{\eta - 1}`. Because of that,
     when ``concentration == 1``, we have a uniform distribution over Cholesky
     factors of correlation matrices. Note that this distribution samples the
     Cholesky factor of correlation matrices and not the correlation matrices
diff --git a/torch/distributions/lowrank_multivariate_normal.py b/torch/distributions/lowrank_multivariate_normal.py
index 5c83dcc9e7de..921477ac99a4 100644
--- a/torch/distributions/lowrank_multivariate_normal.py
+++ b/torch/distributions/lowrank_multivariate_normal.py
@@ -8,6 +8,7 @@
 
 __all__ = ['LowRankMultivariateNormal']
 
+
 def _batch_capacitance_tril(W, D):
     r"""
     Computes Cholesky of :math:`I + W.T @ inv(D) @ W` for a batch of matrices :math:`W`
@@ -52,7 +53,8 @@ class LowRankMultivariateNormal(Distribution):
         covariance_matrix = cov_factor @ cov_factor.T + cov_diag
 
     Example:
-        >>> # xdoctest: +REQUIRES(--lapack)
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
+        >>> # xdoctest: +IGNORE_WANT("non-determenistic")
         >>> m = LowRankMultivariateNormal(torch.zeros(2), torch.tensor([[1.], [0.]]), torch.ones(2))
         >>> m.sample()  # normally distributed with mean=`[0,0]`, cov_factor=`[[1],[0]]`, cov_diag=`[1,1]`
         tensor([-0.2102, -0.5429])
diff --git a/torch/distributions/mixture_same_family.py b/torch/distributions/mixture_same_family.py
index 8e0fdce3ada2..dd0beace1917 100644
--- a/torch/distributions/mixture_same_family.py
+++ b/torch/distributions/mixture_same_family.py
@@ -60,7 +60,7 @@ def __init__(self,
 
         if not isinstance(self._mixture_distribution, Categorical):
             raise ValueError(" The Mixture distribution needs to be an "
-                             " instance of torch.distribtutions.Categorical")
+                             " instance of torch.distributions.Categorical")
 
         if not isinstance(self._component_distribution, Distribution):
             raise ValueError("The Component distribution need to be an "
diff --git a/torch/distributions/multivariate_normal.py b/torch/distributions/multivariate_normal.py
index 55a5dd3a228a..e8c15c32d985 100644
--- a/torch/distributions/multivariate_normal.py
+++ b/torch/distributions/multivariate_normal.py
@@ -7,6 +7,7 @@
 
 __all__ = ['MultivariateNormal']
 
+
 def _batch_mv(bmat, bvec):
     r"""
     Performs a batched matrix-vector product, with compatible but different batch shapes.
@@ -91,7 +92,8 @@ class MultivariateNormal(Distribution):
 
     Example:
 
-        >>> # xdoctest: +REQUIRES(--lapack)
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
+        >>> # xdoctest: +IGNORE_WANT("non-determenistic")
         >>> m = MultivariateNormal(torch.zeros(2), torch.eye(2))
         >>> m.sample()  # normally distributed with mean=`[0,0]` and covariance_matrix=`I`
         tensor([-0.2102, -0.5429])
diff --git a/torch/distributions/transformed_distribution.py b/torch/distributions/transformed_distribution.py
index 9d7bd6fbd690..a3bab3e836a3 100644
--- a/torch/distributions/transformed_distribution.py
+++ b/torch/distributions/transformed_distribution.py
@@ -57,26 +57,29 @@ def __init__(self, base_distribution, transforms, validate_args=None):
         base_shape = base_distribution.batch_shape + base_distribution.event_shape
         base_event_dim = len(base_distribution.event_shape)
         transform = ComposeTransform(self.transforms)
-        domain_event_dim = transform.domain.event_dim
-        if len(base_shape) < domain_event_dim:
+        if len(base_shape) < transform.domain.event_dim:
             raise ValueError("base_distribution needs to have shape with size at least {}, but got {}."
-                             .format(domain_event_dim, base_shape))
-        shape = transform.forward_shape(base_shape)
-        expanded_base_shape = transform.inverse_shape(shape)
+                             .format(transform.domain.event_dim, base_shape))
+        forward_shape = transform.forward_shape(base_shape)
+        expanded_base_shape = transform.inverse_shape(forward_shape)
         if base_shape != expanded_base_shape:
             base_batch_shape = expanded_base_shape[:len(expanded_base_shape) - base_event_dim]
             base_distribution = base_distribution.expand(base_batch_shape)
-        reinterpreted_batch_ndims = domain_event_dim - base_event_dim
+        reinterpreted_batch_ndims = transform.domain.event_dim - base_event_dim
         if reinterpreted_batch_ndims > 0:
             base_distribution = Independent(base_distribution, reinterpreted_batch_ndims)
         self.base_dist = base_distribution
 
         # Compute shapes.
-        event_dim = transform.codomain.event_dim + max(base_event_dim - domain_event_dim, 0)
-        assert len(shape) >= event_dim
-        cut = len(shape) - event_dim
-        batch_shape = shape[:cut]
-        event_shape = shape[cut:]
+        transform_change_in_event_dim = transform.codomain.event_dim - transform.domain.event_dim
+        event_dim = max(
+            transform.codomain.event_dim,  # the transform is coupled
+            base_event_dim + transform_change_in_event_dim  # the base dist is coupled
+        )
+        assert len(forward_shape) >= event_dim
+        cut = len(forward_shape) - event_dim
+        batch_shape = forward_shape[:cut]
+        event_shape = forward_shape[cut:]
         super(TransformedDistribution, self).__init__(batch_shape, event_shape, validate_args=validate_args)
 
     def expand(self, batch_shape, _instance=None):
diff --git a/torch/distributions/utils.py b/torch/distributions/utils.py
index a9fc23efb327..8462e9783e7a 100644
--- a/torch/distributions/utils.py
+++ b/torch/distributions/utils.py
@@ -7,6 +7,8 @@
 
 euler_constant = 0.57721566490153286060  # Euler Mascheroni Constant
 
+__all__ = ["broadcast_all", "logits_to_probs", "clamp_probs", "probs_to_logits", "lazy_property",
+           "tril_matrix_to_vec", "vec_to_tril_matrix"]
 
 def broadcast_all(*values):
     r"""
diff --git a/torch/distributions/wishart.py b/torch/distributions/wishart.py
index e23c61c836e1..6d31375afac4 100644
--- a/torch/distributions/wishart.py
+++ b/torch/distributions/wishart.py
@@ -54,8 +54,11 @@ class Wishart(ExponentialFamily):
 
     **References**
 
-    [1] `On equivalence of the LKJ distribution and the restricted Wishart distribution`,
-    Zhenxun Wang, Yunan Wu, Haitao Chu.
+    [1] Wang, Z., Wu, Y. and Chu, H., 2018. `On equivalence of the LKJ distribution and the restricted Wishart distribution`.
+    [2] Sawyer, S., 2007. `Wishart Distributions and Inverse-Wishart Sampling`.
+    [3] Anderson, T. W., 2003. `An Introduction to Multivariate Statistical Analysis (3rd ed.)`.
+    [4] Odell, P. L. & Feiveson, A. H., 1966. `A Numerical Procedure to Generate a SampleCovariance Matrix`. JASA, 61(313):199-203.
+    [5] Ku, Y.-C. & Bloomfield, P., 2010. `Generating Random Wishart Matrices with Fractional Degrees of Freedom in OX`.
     """
     arg_constraints = {
         'covariance_matrix': constraints.positive_definite,
@@ -216,9 +219,9 @@ def _bartlett_sampling(self, sample_shape=torch.Size()):
     def rsample(self, sample_shape=torch.Size(), max_try_correction=None):
         r"""
         .. warning::
-            In some cases, sampling algorithn based on Bartlett decomposition may return singular matrix samples.
+            In some cases, sampling algorithm based on Bartlett decomposition may return singular matrix samples.
             Several tries to correct singular samples are performed by default, but it may end up returning
-            singular matrix samples. Sigular samples may return `-inf` values in `.log_prob()`.
+            singular matrix samples. Singular samples may return `-inf` values in `.log_prob()`.
             In those cases, the user should validate the samples and either fix the value of `df`
             or adjust `max_try_correction` value for argument in `.rsample` accordingly.
         """
diff --git a/torch/fft/__init__.py b/torch/fft/__init__.py
index a9a6e3e84650..3bc5191c7b57 100644
--- a/torch/fft/__init__.py
+++ b/torch/fft/__init__.py
@@ -1004,7 +1004,7 @@
 hfftn = _add_docstr(_fft.fft_hfftn, r"""
 hfftn(input, s=None, dim=None, norm=None, *, out=None) -> Tensor
 
-Computes the n-dimensional discrete Fourier transform of a Herimitian symmetric
+Computes the n-dimensional discrete Fourier transform of a Hermitian symmetric
 :attr:`input` signal.
 
 :attr:`input` is interpreted as a one-sided Hermitian signal in the time
diff --git a/torch/functional.py b/torch/functional.py
index 5cd9f12e486b..ee04cb250c2c 100644
--- a/torch/functional.py
+++ b/torch/functional.py
@@ -4,6 +4,7 @@
 
 import torch
 from torch._C import _add_docstr
+import torch.backends.opt_einsum as opt_einsum
 import torch.nn.functional as F
 from ._lowrank import svd_lowrank, pca_lowrank
 from .overrides import (
@@ -136,7 +137,6 @@ def broadcast_shapes(*shapes):
             return tensors[0].shape
 
 
-
 def split(
     tensor: Tensor, split_size_or_sections: Union[int, List[int]], dim: int = 0
 ) -> List[Tensor]:
@@ -206,7 +206,7 @@ def einsum(*args: Any) -> Tensor:
     Equation:
 
         The :attr:`equation` string specifies the subscripts (letters in `[a-zA-Z]`) for each dimension of
-        the input :attr:`operands` in the same order as the dimensions, separating subcripts for each operand by a
+        the input :attr:`operands` in the same order as the dimensions, separating subscripts for each operand by a
         comma (','), e.g. `'ij,jk'` specify subscripts for two 2D operands. The dimensions labeled with the same subscript
         must be broadcastable, that is, their size must either match or be `1`. The exception is if a subscript is
         repeated for the same input operand, in which case the dimensions labeled with this subscript for this operand
@@ -239,9 +239,19 @@ def einsum(*args: Any) -> Tensor:
 
     .. note::
 
-        This function does not optimize the given expression, so a different formula for the same computation may
-        run faster or consume less memory. Projects like opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/)
-        can optimize the formula for you.
+        This function uses opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/) to speed up computation or to
+        consume less memory by optimizing contraction order. This optimization occurs when there are at least three
+        inputs, since the order does not matter otherwise. Note that finding _the_ optimal path is an NP-hard problem,
+        thus, opt_einsum relies on different heuristics to achieve near-optimal results. If opt_einsum is not available,
+        the default order is to contract from left to right.
+
+        To bypass this default behavior, add the following line to disable the usage of opt_einsum and skip path
+        calculation: `torch.backends.opt_einsum.enabled = False`
+
+        To specify which strategy you'd like for opt_einsum to compute the contraction path, add the following line:
+        `torch.backends.opt_einsum.strategy = 'auto'`. The default strategy is 'auto', and we also support 'greedy' and
+        'optimal'. Disclaimer that the runtime of 'optimal' is factorial in the number of inputs! See more details in
+        the opt_einsum documentation (https://optimized-einsum.readthedocs.io/en/stable/path_finding.html).
 
     .. note::
 
@@ -362,7 +372,18 @@ def parse_subscript(n: int) -> str:
         # in the original implementation this line is omitted
         return einsum(equation, *_operands)
 
-    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
+    if len(operands) <= 2 or not opt_einsum.enabled:
+        # the path for contracting 0 or 1 time(s) is already optimized
+        # or the user has disabled using opt_einsum
+        return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
+
+    path = None
+    if opt_einsum.is_available():
+        _opt_einsum = opt_einsum.get_opt_einsum()
+        tupled_path = _opt_einsum.contract_path(equation, *operands, optimize=opt_einsum.strategy)[0]
+        # flatten path for dispatching to C++
+        path = [item for pair in tupled_path for item in pair]
+    return _VF.einsum(equation, operands, path=path)  # type: ignore[attr-defined]
 
 
 # This wrapper exists to support variadic args.
@@ -451,6 +472,7 @@ def meshgrid(*tensors, indexing: Optional[str] = None) -> Tuple[Tensor, ...]:
 
             `torch.meshgrid` is commonly used to produce a grid for
             plotting.
+            >>> # xdoctest: +REQUIRES(module:matplotlib)
             >>> import matplotlib.pyplot as plt
             >>> xs = torch.linspace(-5, 5, steps=100)
             >>> ys = torch.linspace(-5, 5, steps=100)
@@ -458,8 +480,6 @@ def meshgrid(*tensors, indexing: Optional[str] = None) -> Tuple[Tensor, ...]:
             >>> z = torch.sin(torch.sqrt(x * x + y * y))
             >>> ax = plt.axes(projection='3d')
             >>> ax.plot_surface(x.numpy(), y.numpy(), z.numpy())
-            >>> # xdoctest: +SKIP
-            <mpl_toolkits.mplot3d.art3d.Poly3DCollection object at 0x7f8f30d40100>
             >>> plt.show()
 
         .. image:: ../_static/img/meshgrid.png
@@ -592,6 +612,15 @@ def stft(input: Tensor, n_fft: int, hop_length: Optional[int] = None,
             a real tensor with an extra last dimension for the real and
             imaginary components.
 
+            .. versionchanged:: 1.14.0
+               ``return_complex`` is now a required argument for real inputs,
+               as the default is being transitioned to ``True``.
+
+            .. deprecated:: 1.14.0
+               ``return_complex=False`` is deprecated, instead use ``return_complex=True``
+               Note that calling :func:`torch.view_as_real` on the output will
+               recover the deprecated output format.
+
     Returns:
         Tensor: A tensor containing the STFT result with shape described above
 
@@ -649,14 +678,13 @@ def stft(input: Tensor, n_fft: int, hop_length: Optional[int] = None,
 IEEE Trans. ASSP, vol.32, no.2, pp.236-243, Apr. 1984.
 
 Args:
-    input (Tensor): The input tensor. Expected to be output of :func:`~torch.stft`,
-        can either be complex (``channel``, ``fft_size``, ``n_frame``), or real
-        (``channel``, ``fft_size``, ``n_frame``, 2) where the ``channel``
-        dimension is optional.
-
-        .. deprecated:: 1.8.0
-            Real input is deprecated, use complex inputs as returned by
-            ``stft(..., return_complex=True)`` instead.
+    input (Tensor): The input tensor. Expected to be in the format of :func:`~torch.stft`,
+        output. That is a complex tensor of shape (``channel``, ``fft_size``, ``n_frame``),
+        where the ``channel`` dimension is optional.
+
+        .. versionchanged:: 1.14.0
+            Real datatype inputs are no longer supported. Input must now have a
+            complex datatype, as returned by ``stft(..., return_complex=True)``.
     n_fft (int): Size of Fourier transform
     hop_length (Optional[int]): The distance between neighboring sliding window frames.
         (Default: ``n_fft // 4``)
@@ -736,23 +764,22 @@ def _unique_impl(input: Tensor, sorted: bool = True,
 
         >>> output = torch.unique(torch.tensor([1, 3, 2, 3], dtype=torch.long))
         >>> output
-        >>> # xdoctest: +SKIP
-        tensor([ 2,  3,  1])
+        tensor([1, 2, 3])
 
         >>> output, inverse_indices = torch.unique(
         ...     torch.tensor([1, 3, 2, 3], dtype=torch.long), sorted=True, return_inverse=True)
         >>> output
-        tensor([ 1,  2,  3])
+        tensor([1, 2, 3])
         >>> inverse_indices
-        tensor([ 0,  2,  1,  2])
+        tensor([0, 2, 1, 2])
 
         >>> output, inverse_indices = torch.unique(
         ...     torch.tensor([[1, 3], [2, 3]], dtype=torch.long), sorted=True, return_inverse=True)
         >>> output
-        tensor([ 1,  2,  3])
+        tensor([1, 2, 3])
         >>> inverse_indices
-        tensor([[ 0,  2],
-                [ 1,  2]])
+        tensor([[0, 2],
+                [1, 2]])
 
     """
     if has_torch_function_unary(input):
@@ -983,6 +1010,7 @@ def tensordot(a, b, dims: List[List[int]], out: Optional[torch.Tensor] = None):
     def tensordot(a, b, dims: torch.Tensor, out: Optional[torch.Tensor] = None):  # noqa: F811
         pass
 
+
 def tensordot(a, b, dims=2, out: Optional[torch.Tensor] = None):  # noqa: F811
     r"""Returns a contraction of a and b over multiple dimensions.
 
@@ -1019,9 +1047,9 @@ def tensordot(a, b, dims=2, out: Optional[torch.Tensor] = None):  # noqa: F811
                 [4796., 5162.],
                 [4928., 5306.]])
 
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA)
         >>> a = torch.randn(3, 4, 5, device='cuda')
         >>> b = torch.randn(4, 5, 6, device='cuda')
-        >>> # xdoctest: +SKIP
         >>> c = torch.tensordot(a, b, dims=2).cpu()
         tensor([[ 8.3504, -2.5436,  6.2922,  2.7556, -1.0732,  3.2741],
                 [ 3.3161,  0.0704,  5.0187, -0.4079, -4.3126,  4.8744],
@@ -1073,7 +1101,8 @@ def tensordot(a, b, dims=2, out: Optional[torch.Tensor] = None):  # noqa: F811
     else:
         return _VF.tensordot(a, b, dims_a, dims_b, out=out)  # type: ignore[attr-defined]
 
-def cartesian_prod(*tensors):
+
+def cartesian_prod(*tensors: Tensor) -> Tensor:
     """Do cartesian product of the given sequence of tensors. The behavior is similar to
     python's `itertools.product`.
 
@@ -1087,9 +1116,9 @@ def cartesian_prod(*tensors):
 
     Example::
 
+        >>> import itertools
         >>> a = [1, 2, 3]
         >>> b = [4, 5]
-        >>> # xdoctest: +SKIP
         >>> list(itertools.product(a, b))
         [(1, 4), (1, 5), (2, 4), (2, 5), (3, 4), (3, 5)]
         >>> tensor_a = torch.tensor(a)
@@ -1107,6 +1136,7 @@ def cartesian_prod(*tensors):
         return handle_torch_function(cartesian_prod, tensors, *tensors)
     return _VF.cartesian_prod(tensors)  # type: ignore[attr-defined]
 
+
 def block_diag(*tensors):
     """Create a block diagonal matrix from provided tensors.
 
@@ -1197,6 +1227,7 @@ def cdist(x1, x2, p=2., compute_mode='use_mm_for_euclid_dist_if_necessary'):
     else:
         raise ValueError(f"{compute_mode} is not a valid value for compute_mode")
 
+
 def atleast_1d(*tensors):
     r"""
     Returns a 1-dimensional view of each input tensor with zero dimensions.
@@ -1210,12 +1241,11 @@ def atleast_1d(*tensors):
 
     Example::
 
-        >>> x = torch.randn(2)
+        >>> x = torch.arange(2)
         >>> x
-        >>> # xdoctest: +SKIP
-        tensor([1.4584, 0.7583])
+        tensor([0, 1])
         >>> torch.atleast_1d(x)
-        tensor([1.4584, 0.7583])
+        tensor([0, 1])
         >>> x = torch.tensor(1.)
         >>> x
         tensor(1.)
@@ -1233,6 +1263,7 @@ def atleast_1d(*tensors):
         tensors = tensors[0]
     return _VF.atleast_1d(tensors)  # type: ignore[attr-defined]
 
+
 def atleast_2d(*tensors):
     r"""
     Returns a 2-dimensional view of each input tensor with zero dimensions.
@@ -1251,14 +1282,13 @@ def atleast_2d(*tensors):
         tensor(1.)
         >>> torch.atleast_2d(x)
         tensor([[1.]])
-        >>> x = torch.randn(2,2)
+        >>> x = torch.arange(4).view(2,2)
         >>> x
-        >>> # xdoctest: +SKIP
-        tensor([[2.2086, 2.5165],
-                [0.1757, 0.5194]])
+        tensor([[0, 1],
+                [2, 3]])
         >>> torch.atleast_2d(x)
-        tensor([[2.2086, 2.5165],
-                [0.1757, 0.5194]])
+        tensor([[0, 1],
+                [2, 3]])
         >>> x = torch.tensor(0.5)
         >>> y = torch.tensor(1.)
         >>> torch.atleast_2d((x,y))
@@ -1271,6 +1301,7 @@ def atleast_2d(*tensors):
         tensors = tensors[0]
     return _VF.atleast_2d(tensors)  # type: ignore[attr-defined]
 
+
 def atleast_3d(*tensors):
     r"""
     Returns a 3-dimensional view of each input tensor with zero dimensions.
@@ -1289,22 +1320,21 @@ def atleast_3d(*tensors):
         tensor(0.5000)
         >>> torch.atleast_3d(x)
         tensor([[[0.5000]]])
-        >>> y = torch.randn(2,2)
+        >>> y = torch.arange(4).view(2,2)
         >>> y
-        >>> # xdoctest: +SKIP
-        tensor([[-0.8079,  0.7460],
-                [-1.1647,  1.4734]])
+        tensor([[0, 1],
+                [2, 3]])
         >>> torch.atleast_3d(y)
-        tensor([[[-0.8079],
-                [ 0.7460]],
+        tensor([[[0],
+                 [1]],
                 <BLANKLINE>
-                [[-1.1647],
-                [ 1.4734]]])
-        >>> x = torch.randn(1,1,1)
+                [[2],
+                 [3]]])
+        >>> x = torch.tensor(1).view(1, 1, 1)
         >>> x
-        tensor([[[-1.5689]]])
+        tensor([[[1]]])
         >>> torch.atleast_3d(x)
-        tensor([[[-1.5689]]])
+        tensor([[[1]]])
         >>> x = torch.tensor(0.5)
         >>> y = torch.tensor(1.)
         >>> torch.atleast_3d((x,y))
@@ -1363,10 +1393,11 @@ def norm(input, p="fro", dim=None, keepdim=False, out=None, dtype=None):  # noqa
         Its documentation and behavior may be incorrect, and it is no longer
         actively maintained.
 
-        Use :func:`torch.linalg.norm`, instead, or :func:`torch.linalg.vector_norm`
-        when computing vector norms and :func:`torch.linalg.matrix_norm` when
-        computing matrix norms. Note, however, the signature for these functions
-        is slightly different than the signature for torch.norm.
+        Use :func:`torch.linalg.vector_norm` when computing vector norms and
+        :func:`torch.linalg.matrix_norm` when computing matrix norms.
+        For a function with a similar behavior as this one see :func:`torch.linalg.norm`.
+        Note, however, the signature for these functions is slightly different than the
+        signature for ``torch.norm``.
 
     Args:
         input (Tensor): The input tensor. Its data type must be either a floating
@@ -1416,8 +1447,8 @@ def norm(input, p="fro", dim=None, keepdim=False, out=None, dtype=None):  # noqa
     .. note::
         Even though ``p='fro'`` supports any number of dimensions, the true
         mathematical definition of Frobenius norm only applies to tensors with
-        exactly two dimensions. :func:`torch.linalg.norm` with ``ord='fro'`` aligns
-        with the mathematical definition, since it can only be applied across
+        exactly two dimensions. :func:`torch.linalg.matrix_norm` with ``ord='fro'``
+        aligns with the mathematical definition, since it can only be applied across
         exactly two dimensions.
 
     Example::
@@ -1426,7 +1457,6 @@ def norm(input, p="fro", dim=None, keepdim=False, out=None, dtype=None):  # noqa
         >>> a = torch.arange(9, dtype= torch.float) - 4
         >>> b = a.reshape((3, 3))
         >>> torch.norm(a)
-        >>> # xdoctest: +SKIP
         tensor(7.7460)
         >>> torch.norm(b)
         tensor(7.7460)
@@ -1452,6 +1482,42 @@ def norm(input, p="fro", dim=None, keepdim=False, out=None, dtype=None):  # noqa
         return handle_torch_function(
             norm, (input,), input, p=p, dim=dim, keepdim=keepdim, out=out, dtype=dtype)
 
+    # NB. All the repeated code and weird python is to please TorchScript.
+    #     For a more compact implementation see the relevant function in `_refs/__init__.py`
+
+    # We don't do this for MPS or sparse tensors
+    if input.layout == torch.strided and input.device.type in ("cpu", "cuda", "meta"):
+        if dim is not None:
+            if isinstance(dim, int):
+                _dim = [dim]
+            else:
+                _dim = dim
+        else:
+            _dim = None  # type: ignore[assignment]
+
+        if isinstance(p, str):
+            if p == "fro" and (dim is None or isinstance(dim, int) or len(dim) <= 2):
+                if out is None:
+                    return torch.linalg.vector_norm(input, 2, _dim, keepdim, dtype=dtype)
+                else:
+                    return torch.linalg.vector_norm(input, 2, _dim, keepdim, dtype=dtype, out=out)
+
+            # Here we either call the nuclear norm, or we call matrix_norm with some arguments
+            # that will throw an error
+            if _dim is None:
+                _dim = list(range(input.ndim))
+            if out is None:
+                return torch.linalg.matrix_norm(input, p, _dim, keepdim, dtype=dtype)
+            else:
+                return torch.linalg.matrix_norm(input, p, _dim, keepdim, dtype=dtype, out=out)
+        else:
+            # NB. p should be Union[str, number], not Optional!
+            _p = 2.0 if p is None else p
+            if out is None:
+                return torch.linalg.vector_norm(input, _p, _dim, keepdim, dtype=dtype)
+            else:
+                return torch.linalg.vector_norm(input, _p, _dim, keepdim, dtype=dtype, out=out)
+
     ndim = input.dim()
 
     # catch default case
@@ -1514,6 +1580,7 @@ def norm(input, p="fro", dim=None, keepdim=False, out=None, dtype=None):  # noqa
             else:
                 return _VF.norm(input, p, _dim, keepdim=keepdim, dtype=dtype, out=out)  # type: ignore[attr-defined]
 
+
 def chain_matmul(*matrices, out=None):
     r"""Returns the matrix product of the :math:`N` 2-D tensors. This product is efficiently computed
     using the matrix chain order algorithm which selects the order in which incurs the lowest cost in terms
@@ -1537,12 +1604,13 @@ def chain_matmul(*matrices, out=None):
 
     Example::
 
+        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
         >>> a = torch.randn(3, 4)
         >>> b = torch.randn(4, 5)
         >>> c = torch.randn(5, 6)
         >>> d = torch.randn(6, 7)
+        >>> # will raise a deprecation warning
         >>> torch.chain_matmul(a, b, c, d)
-        >>> # xdoctest: +SKIP
         tensor([[ -2.3375,  -3.9790,  -4.1119,  -6.6577,   9.5609, -11.5095,  -3.2614],
                 [ 21.4038,   3.3378,  -8.4982,  -5.2457, -10.2561,  -2.4684,   2.7163],
                 [ -0.9647,  -5.8917,  -2.3213,  -5.2284,  12.8615, -12.2816,  -2.5095]])
@@ -1635,7 +1703,8 @@ def _lu_impl(A, pivot=True, get_infos=False, out=None):
 
     Example::
 
-        >>> # xdoctest: +REQUIRES(--lapack)
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
+        >>> # xdoctest: +IGNORE_WANT("non-determenistic")
         >>> A = torch.randn(2, 3, 3)
         >>> A_LU, pivots = torch.lu(A)
         >>> A_LU
@@ -1662,6 +1731,7 @@ def _lu_impl(A, pivot=True, get_infos=False, out=None):
 else:
     _ListOrSeq = List[Tensor]
 
+
 def _check_list_size(out_len: int, get_infos: bool, out: _ListOrSeq) -> None:
     get_infos_int = 1 if get_infos else 0
     if out_len - get_infos_int != 2:
@@ -1669,6 +1739,7 @@ def _check_list_size(out_len: int, get_infos: bool, out: _ListOrSeq) -> None:
     if not isinstance(out, (tuple, list)):
         raise TypeError(f"argument 'out' must be tuple of Tensors, not {type(out).__name__}")
 
+
 def _lu_with_infos(A, pivot=True, get_infos=False, out=None):
     # type: (Tensor, bool, bool, Optional[Tuple[Tensor, Tensor, Tensor]]) -> Tuple[Tensor, Tensor, Tensor]
     if has_torch_function_unary(A):
@@ -1683,6 +1754,7 @@ def _lu_with_infos(A, pivot=True, get_infos=False, out=None):
     else:
         return result  # A_LU, pivots, infos
 
+
 def _lu_no_infos(A, pivot=True, get_infos=False, out=None):
     # type: (Tensor, bool, bool, Optional[Tuple[Tensor, Tensor]]) -> Tuple[Tensor, Tensor]
     # need to check for torch_function here so that we exit if
@@ -1710,5 +1782,6 @@ def _lu_no_infos(A, pivot=True, get_infos=False, out=None):
     func_name='lu')
 lu.__doc__ = _lu_impl.__doc__
 
+
 def align_tensors(*tensors):
     raise RuntimeError('`align_tensors` not yet implemented.')
diff --git a/torch/futures/__init__.py b/torch/futures/__init__.py
index 1795983b3f30..f2ba35f1e80b 100644
--- a/torch/futures/__init__.py
+++ b/torch/futures/__init__.py
@@ -9,9 +9,11 @@
 T = TypeVar("T")
 S = TypeVar("S")
 
+
 class _PyFutureMeta(type(torch._C.Future), type(Generic)):  # type: ignore[misc, no-redef]
     pass
 
+
 class Future(torch._C.Future, Generic[T], metaclass=_PyFutureMeta):
     r"""
     Wrapper around a ``torch._C.Future`` which encapsulates an asynchronous
diff --git a/torch/fx/OVERVIEW.md b/torch/fx/OVERVIEW.md
index f2995eb7a77d..9c0707089d78 100644
--- a/torch/fx/OVERVIEW.md
+++ b/torch/fx/OVERVIEW.md
@@ -61,7 +61,7 @@ module = MyModule()
 symbolic_traced : torch.fx.GraphModule = symbolic_trace(module)
 
 input = torch.rand(3, 4)
-torch.testing.assert_allclose(symbolic_traced(input), module(input))
+torch.testing.assert_close(symbolic_traced(input), module(input))
 ```
 
 Here, we set up a simple Module that exercises different language features: fetching a parameter, applying an arithmetic operator, applying a submodule (linear), and applying a Tensor method. `symbolic_trace` returns an instance of GraphModule, which is in itself a subclass of `nn.Module`. We can see that the `symbolic_traced` instance runs and returns the same result as the original module instance module.
diff --git a/torch/fx/_symbolic_trace.py b/torch/fx/_symbolic_trace.py
index 4b21be69e0ef..ff9df1161a70 100644
--- a/torch/fx/_symbolic_trace.py
+++ b/torch/fx/_symbolic_trace.py
@@ -363,8 +363,9 @@ def is_leaf_module(self, m: torch.nn.Module, module_qualified_name: str) -> bool
                 submodule ``bar``, which contains submodule ``baz``, that module will
                 appear with the qualified name ``foo.bar.baz`` here.
         """
-        return m.__module__.startswith("torch.nn") and not isinstance(
-            m, torch.nn.Sequential
+        return (
+            (m.__module__.startswith("torch.nn") or m.__module__.startswith("torch.ao.nn"))
+            and not isinstance(m, torch.nn.Sequential)
         )
 
     @compatibility(is_backward_compatible=True)
@@ -433,6 +434,68 @@ def call_module(
             return forward(*args, **kwargs)
         return self.create_proxy("call_module", module_qualified_name, args, kwargs)
 
+    @compatibility(is_backward_compatible=False)
+    def getattr(self, attr: str, attr_val: Any, parameter_proxy_cache: Dict[str, Any]):
+        """
+        Method that specifies the behavior of this ``Tracer`` when we call getattr
+        on a call to an ``nn.Module`` instance.
+
+        By default, the behavior is to return a proxy value for the attribute. It
+        also stores the proxy value in the ``parameter_proxy_cache``, so that future
+        calls will reuse the proxy rather than creating a new one.
+
+        This method can be overridden to --for example-- not return proxies when
+        querying parameters.
+
+        Args:
+
+            attr (str): The name of the attribute being queried
+            attr_val (Any): The value of the attribute
+            parametr_proxy_cache (Dict[str, Any]): A cache of attr names to proxies
+
+        Return:
+
+            The return value from the getattr call.
+        """
+        def maybe_get_proxy_for_attr(
+            attr_val, collection_to_search, parameter_proxy_cache
+        ):
+            for n, p in collection_to_search:
+                if attr_val is p:
+                    if n not in parameter_proxy_cache:
+                        kwargs = {}
+                        if (
+                            "proxy_factory_fn"
+                            in inspect.signature(self.create_proxy).parameters
+                        ):
+                            kwargs["proxy_factory_fn"] = (
+                                None
+                                if not self.param_shapes_constant
+                                else lambda node: ParameterProxy(
+                                    self, node, n, attr_val
+                                )
+                            )
+                        val_proxy = self.create_proxy("get_attr", n, (), {}, **kwargs)  # type: ignore[arg-type]
+                        parameter_proxy_cache[n] = val_proxy
+                    return parameter_proxy_cache[n]
+            return None
+
+        if isinstance(attr_val, torch.nn.Parameter):
+            maybe_parameter_proxy = maybe_get_proxy_for_attr(
+                attr_val, self.root.named_parameters(), parameter_proxy_cache
+            )
+            if maybe_parameter_proxy is not None:
+                return maybe_parameter_proxy
+
+        if self.proxy_buffer_attributes and isinstance(attr_val, torch.Tensor):
+            maybe_buffer_proxy = maybe_get_proxy_for_attr(
+                attr_val, self.root.named_buffers(), parameter_proxy_cache
+            )
+            if maybe_buffer_proxy is not None:
+                return maybe_buffer_proxy
+
+        return attr_val
+
     # This method will be refactored
     @compatibility(is_backward_compatible=False)
     def create_args_for_root(self, root_fn, is_module, concrete_args=None):
@@ -559,46 +622,6 @@ def flatten_fn(*args):
             return flatten_fn, flat_args
         return root_fn, args
 
-    def _module_getattr(self, attr, attr_val, parameter_proxy_cache):
-        def maybe_get_proxy_for_attr(
-            attr_val, collection_to_search, parameter_proxy_cache
-        ):
-            for n, p in collection_to_search:
-                if attr_val is p:
-                    if n not in parameter_proxy_cache:
-                        kwargs = {}
-                        if (
-                            "proxy_factory_fn"
-                            in inspect.signature(self.create_proxy).parameters
-                        ):
-                            kwargs["proxy_factory_fn"] = (
-                                None
-                                if not self.param_shapes_constant
-                                else lambda node: ParameterProxy(
-                                    self, node, n, attr_val
-                                )
-                            )
-                        val_proxy = self.create_proxy("get_attr", n, (), {}, **kwargs)  # type: ignore[arg-type]
-                        parameter_proxy_cache[n] = val_proxy
-                    return parameter_proxy_cache[n]
-            return None
-
-        if isinstance(attr_val, torch.nn.Parameter):
-            maybe_parameter_proxy = maybe_get_proxy_for_attr(
-                attr_val, self.root.named_parameters(), parameter_proxy_cache
-            )
-            if maybe_parameter_proxy is not None:
-                return maybe_parameter_proxy
-
-        if self.proxy_buffer_attributes and isinstance(attr_val, torch.Tensor):
-            maybe_buffer_proxy = maybe_get_proxy_for_attr(
-                attr_val, self.root.named_buffers(), parameter_proxy_cache
-            )
-            if maybe_buffer_proxy is not None:
-                return maybe_buffer_proxy
-
-        return attr_val
-
     @compatibility(is_backward_compatible=True)
     def trace(
         self,
@@ -679,7 +702,7 @@ def collect_tensor_attrs(m: torch.nn.Module, prefix_atoms: List[str]):
             @functools.wraps(_orig_module_getattr)
             def module_getattr_wrapper(mod, attr):
                 attr_val = _orig_module_getattr(mod, attr)
-                return self._module_getattr(attr, attr_val, parameter_proxy_cache)
+                return self.getattr(attr, attr_val, parameter_proxy_cache)
 
             @functools.wraps(_orig_module_call)
             def module_call_wrapper(mod, *args, **kwargs):
@@ -969,9 +992,9 @@ def my_custom_function(x, y):
             "string name"
         )
 
-    if hasattr(fn_or_name, "__code__"):
+    if callable(fn_or_name):
         assert not isinstance(fn_or_name, str)  # to make mypy happy
-        fn_name = fn_or_name.__code__.co_name
+        fn_name = fn_or_name.__name__
     else:
         assert isinstance(
             fn_or_name, str
diff --git a/torch/fx/experimental/accelerator_partitioner.py b/torch/fx/experimental/accelerator_partitioner.py
index 2b17ef2f86c3..5a007314d628 100644
--- a/torch/fx/experimental/accelerator_partitioner.py
+++ b/torch/fx/experimental/accelerator_partitioner.py
@@ -696,7 +696,7 @@ def find_partition_to_combine_based_on_size(
             return find_combination, partitions
 
         def reset_partition_in_sparse_nn(partition, new_partition=True):
-            """If crossing the boudary between non-embedding nodes and
+            """If crossing the boundary between non-embedding nodes and
             embedding nodes, create a new partition
             """
             if in_embedding_region:
diff --git a/torch/fx/experimental/const_fold.py b/torch/fx/experimental/const_fold.py
index ae2ffc60b0dd..a96980302978 100644
--- a/torch/fx/experimental/const_fold.py
+++ b/torch/fx/experimental/const_fold.py
@@ -24,11 +24,6 @@ def __init__(
         fx_const_folded_attrs_name: str = None,
         device_for_folded_attrs: str = "cuda",
     ):
-        # In init, we set graph's owning module to root which will make graph's
-        # owning module be None because graph already have a owning module. We
-        # need owning module to run DCE. To work around we set the number of
-        # graph's owners to 0.
-        graph._owners = 0
         super().__init__(root, graph)
         self.const_subgraph_module = (
             None
@@ -180,6 +175,10 @@ def split_const_subgraphs(
         if skip_folding_node_fn and skip_folding_node_fn(node):
             continue
 
+        # Skip folding side-effectful functions
+        if node.is_impure():
+            continue
+
         # Must be a constant foldable node at this point.
         const_nodes.add(node)
         if node.op != "get_attr":
diff --git a/torch/fx/experimental/graph_gradual_typechecker.py b/torch/fx/experimental/graph_gradual_typechecker.py
index 6094952f1695..7ffabc9c6996 100644
--- a/torch/fx/experimental/graph_gradual_typechecker.py
+++ b/torch/fx/experimental/graph_gradual_typechecker.py
@@ -184,7 +184,7 @@ def get_attr_inference_rule(n: Node, traced):
     if attr_name == "shape":
         n.type = Dyn
     else:
-        raise TypeError("Not yet implelemted")
+        raise TypeError("Not yet implemented")
 
     # TODO. We leave it like this till we add a type to represent tensor sizes
     return n.type
@@ -507,7 +507,7 @@ def flatten_check(tensor_type, start_dim, end_dim):
         new_type_list = lhs + mid + rhs
         return TensorType(tuple(new_type_list))
     else:
-        raise TypeError(f'Incompatable dimentions {start_dim}, {end_dim - 1} in type {tensor_type}')
+        raise TypeError(f'Incompatable dimensions {start_dim}, {end_dim - 1} in type {tensor_type}')
 
 @register_inference_rule(torch.flatten)
 def flatten_inference_rule(n: Node):
diff --git a/torch/fx/experimental/meta_tracer.py b/torch/fx/experimental/meta_tracer.py
index 7ec5fb88d409..4718c504e006 100644
--- a/torch/fx/experimental/meta_tracer.py
+++ b/torch/fx/experimental/meta_tracer.py
@@ -199,11 +199,11 @@ def create_proxy(self, kind, target, args, kwargs, name=None, type_expr=None, pr
 
         return rv
 
-    def _module_getattr(self, attr, attr_val, parameter_proxy_cache):
+    def getattr(self, attr, attr_val, parameter_proxy_cache):
         if getattr(self, '_disable_module_getattr', False):
             return attr_val
         else:
-            return super()._module_getattr(attr, attr_val, parameter_proxy_cache)
+            return super().getattr(attr, attr_val, parameter_proxy_cache)
 
     def call_module(self, m, forward, args, kwargs):
         self.orig_forward = forward
diff --git a/torch/fx/experimental/migrate_gradual_types/constraint_generator.py b/torch/fx/experimental/migrate_gradual_types/constraint_generator.py
index 987998973c40..10004cab4515 100644
--- a/torch/fx/experimental/migrate_gradual_types/constraint_generator.py
+++ b/torch/fx/experimental/migrate_gradual_types/constraint_generator.py
@@ -1,5 +1,6 @@
 import torch
 import operator
+import warnings
 from typing import Callable, Dict, Iterable
 
 from torch.fx._symbolic_trace import _assert_is_none
@@ -167,6 +168,7 @@ def expand_inference_rule(n: Node, symbols, constraints, counter):
 
     return constraints, counter
 
+
 @register_inference_rule(torch.nn.functional.gelu)
 @register_inference_rule(torch.nn.functional.dropout)
 @register_inference_rule(torch.nn.functional.softmax)
@@ -176,20 +178,33 @@ def expand_inference_rule(n: Node, symbols, constraints, counter):
 @register_inference_rule("long")
 @register_inference_rule("contiguous")
 @register_inference_rule(torch.ones)
+@register_inference_rule(torch.zeros)
 def equality_inference_rule(n: Node, symbols, constraints, counter):
     """
     We generate the constraint: input = output
     """
-    assert isinstance(n.args[0], Node)
     output, counter = gen_tvar(counter)
     symbols[n] = output
-    input = symbols[n.args[0]]
-    assert isinstance(input, TVar)
-    return [BinConstraintT(input, output, op_eq)], counter
-
-
 
+    if isinstance(n.args[0], Node):
+        input = symbols[n.args[0]]
+        if isinstance(input, TVar):
+            return [BinConstraintT(input, output, op_eq)], counter
 
+        # then we have dimension variables
+        else:
+            for arg in n.args:
+                assert isinstance(symbols[arg], DVar)
+        my_size = [symbols[arg] for arg in n.args]
+        return [BinConstraintT(output, TensorType(my_size), op_eq)], counter
+
+    elif isinstance(n.args[0], tuple):
+        # then the tuple is the size
+        assert len(n.args[0]) <= 4
+        my_size = [symbols[arg] for arg in n.args[0]]
+        return [BinConstraintT(output, TensorType(my_size), op_eq)], counter
+    else:
+        raise NotImplementedError('Method not yet implemented')
 
 
 @register_inference_rule("transpose")
@@ -546,6 +561,22 @@ def gt_inference_rule(n: Node, symbols, constraints, counter):
             my_gt, counter = gen_bvar(counter)
             equality_constraint = BinConstraintD(my_gt, gt_constraint, op_eq)
             return [equality_constraint], counter
+
+        elif isinstance(e1, TVar) and isinstance(e2, int):
+            # then we made the wrong assumption about the argument being a tensor
+            # so we should fix the assumption
+            warnings.warn(f'Made the wrong assumption for node {n}. Correctness not guaranteed.')
+
+            new_e1, counter = gen_dvar(counter)
+            symbols[n.args[0]] = new_e1
+            symbols[n.args[0]]
+
+            gt_constraint = BinConstraintD(new_e1, e2, op_gt)
+
+            my_gt, counter = gen_bvar(counter)
+            equality_constraint = BinConstraintD(my_gt, gt_constraint, op_eq)
+            return [equality_constraint], counter
+
         else:
             raise NotImplementedError('Method not yet implemented')
 
@@ -740,7 +771,8 @@ def full_inference_rule(n: Node, symbols, constraints, counter):
 
     assert isinstance(n.args[0], Iterable)
     for arg in n.args[0]:
-        res.append(symbols[arg])
+        dim = arg if isinstance(arg, int) else symbols[arg]
+        res.append(dim)
     c = BinConstraintT(full, TensorType(list(res)), op_eq)  # type: ignore[arg-type]
     return [c], counter
 
@@ -978,8 +1010,8 @@ def torch_dim_inference_rule(n: Node, symbols, constraints, counter):
 def torch_linear_inference_rule(n: Node, symbols, constraints, counter):
     assert isinstance(n.args[0], Node)
     weight_dims, counter = gen_tensor_dims(2, counter)
-    equality_constraint = BinConstraintT(n.args[1], TensorType(weight_dims), op_eq)
-    constraints, counter = linear_constraints(n, weight_dims[0], weight_dims[1], symbols, counter)
+    equality_constraint = BinConstraintT(symbols[n.args[1]], TensorType(weight_dims), op_eq)
+    constraints, counter = linear_constraints(n, weight_dims[1], weight_dims[0], symbols, counter)
     return [equality_constraint] + constraints, counter
 
 
@@ -1168,11 +1200,6 @@ def generate_constraints(self, counter=0):
 
         all_constraints = []
 
-        # Annotate with Dyn if no type exists
-        for n in graph.nodes:
-            if n.type is None:
-                n.type = Dyn
-
         for n in graph.nodes:
             (constraints, counter) = self.generate_constraints_node(n, counter)
             all_constraints += constraints
@@ -1192,16 +1219,17 @@ def generate_constraints_node(self, n: Node, counter):
             x, counter = gen_tvar(counter)
             self.symbol_dict[n] = x
 
-            if n.type != Dyn and (not isinstance(n.type, TensorType)):
+            my_type = n.type
 
+            if n.type != Dyn and (not isinstance(n.type, TensorType)):
                 if n.type == torch.nn.parameter.Parameter:
                     # since we have a parameter, the shape must be static
                     assert 'example_value' in n.meta
-                    n.type = TensorType(n.meta['example_value'].size())
+                    my_type = TensorType(n.meta['example_value'].size())
                 else:
-                    n.type = Dyn
+                    my_type = Dyn
 
-            c1 = BinConstraintT(n.type, x, op_precision)
+            c1 = BinConstraintT(my_type, x, op_precision)
             c2 = BinConstraintT(x, MAX_TENSOR_RANK, op_leq)
             return [c1, c2], counter
 
diff --git a/torch/fx/experimental/normalize.py b/torch/fx/experimental/normalize.py
index d7bd2c114684..06bc2309975c 100644
--- a/torch/fx/experimental/normalize.py
+++ b/torch/fx/experimental/normalize.py
@@ -59,6 +59,7 @@ def get_type(arg):
         if n.op != "output":
             self.node_map[out] = n
             out.node.meta = n.meta
+            out.node.type = n.type
         return out
 
     def call_function(
diff --git a/torch/fx/experimental/proxy_tensor.py b/torch/fx/experimental/proxy_tensor.py
index d8afadf22a8a..012984ebe6f0 100644
--- a/torch/fx/experimental/proxy_tensor.py
+++ b/torch/fx/experimental/proxy_tensor.py
@@ -5,87 +5,43 @@
 # LICENSE file in the root directory of this source tree.
 import contextlib
 import functools
-from typing import Any, Dict, Optional, Tuple, Callable, Union
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
 import torch
-from torch._C import _disabled_torch_function_impl
 import torch.utils._pytree as pytree
 from torch.fx import Tracer, GraphModule
 from torch._subclasses.fake_tensor import FakeTensorMode
+from torch._dispatch.python import enable_python_dispatcher
 import torch.fx as fx
-from torch.utils._mode_utils import no_dispatch
 from torch.fx.passes.shape_prop import _extract_tensor_metadata
 from contextlib import contextmanager, nullcontext
 import inspect
+from dataclasses import dataclass
+import weakref
+import operator
 
-from torch.utils._python_dispatch import TorchDispatchMode, enable_torch_dispatch_mode
+from torch.utils._python_dispatch import TorchDispatchMode, _pop_mode_temporarily, _get_current_dispatch_mode
 from torch._subclasses import FakeTensor
-from .symbolic_shapes import ShapeEnv, magic_methods, reflectable_magic_methods
-import torch.fx.experimental.symbolic_shapes as symbolic_shapes
+from .symbolic_shapes import ShapeEnv, SymDispatchMode, SymNode
+from torch.fx import Proxy
+from torch import SymInt, SymFloat
 
-__all__ = ["ProxyTensor", "PythonKeyTracer", "dispatch_trace", "make_fx", "DecompositionInterpreter"]
+__all__ = ["PythonKeyTracer", "dispatch_trace", "make_fx", "DecompositionInterpreter", "get_proxy", "has_proxy", "py_sym_types"]
 aten = torch.ops.aten
+prim = torch.ops.prim
 
 CURRENT_DECOMPOSITION_TABLE: Dict[torch._ops.OpOverload, Callable] = {}
 
+CONSTANT_NUMEL_LIMIT = 1
+
+# We currently convert all SymInt to proxies before we use them.
+# This could plausibly be handled at the Dynamo level.
+pytree._register_pytree_node(torch.Size, lambda x: (list(x), None), lambda xs, _: tuple(xs))
 
 def fake_signature(fn, nargs):
     """FX gets confused by varargs, de-confuse it"""
     argnames = ",".join(f"arg{i}" for i in range(nargs))
     return eval(f"lambda {argnames}: fn({argnames})", {"fn": fn})
 
-
-class ProxySymInt(object):
-    def __init__(self, sym_int, proxy):
-        assert isinstance(sym_int, torch._C.SymIntNode) or isinstance(sym_int, int)
-        self.sym_int = sym_int
-        # Note, this doesn't have to be a proxy, it can also be an int
-        self.proxy = proxy
-
-    def wrap(self, num):
-        return ProxySymInt(num, num)
-
-    def __str__(self):
-        return f"ProxySymInt({self.sym_int})"
-
-    def __int__(self):
-        # Not sure how to make mypy support this lol
-        return int(self.sym_int)  # type: ignore[arg-type]
-
-    def __bool__(self):
-        return bool(self.sym_int)
-
-    def to_fx_node(self):
-        if isinstance(self.proxy, fx.Proxy):
-            return self.proxy.node
-        return self.proxy
-
-import operator
-
-def create_magic_impl(op):
-    def magic_impl(self, other):
-        def unwrap_proxy(x):
-            return x.proxy if isinstance(x, ProxySymInt) else x
-        out_proxy = op(unwrap_proxy(self), unwrap_proxy(other))
-
-        def unwrap_proxyint(x):
-            return x.sym_int if isinstance(x, ProxySymInt) else x
-        out_sym_int = op(unwrap_proxyint(self), unwrap_proxyint(other))
-        return ProxySymInt(out_sym_int, out_proxy)
-    return magic_impl
-
-for method in reflectable_magic_methods:
-    method_name = f'{method}'
-
-    op = getattr(operator, method_name)
-    setattr(ProxySymInt, f'r{method_name}', create_magic_impl(op))
-
-for method in magic_methods:
-    method_name = f'{method}'
-
-    op = getattr(operator, method_name)
-    setattr(ProxySymInt, method_name, create_magic_impl(op))
-
-
 @contextmanager
 def decompose(decomposition_table):
     global CURRENT_DECOMPOSITION_TABLE
@@ -96,13 +52,141 @@ def decompose(decomposition_table):
     finally:
         CURRENT_DECOMPOSITION_TABLE = old_decomposition_table
 
-def wrap_output(inner_res, proxy_res, *, constant, proxy_mode):
+# ensure we cannot collide with other properties
+proxy_slot = object()
+no_default = object()
+
+py_sym_types = (SymInt, SymFloat)
+def is_sym_node(node):
+    assert hasattr(node, 'meta'), "All nodes traced with proxy_tensor should have meta"
+    return "val" in node.meta and isinstance(node.meta['val'], py_sym_types)
+
+def set_proxy_slot(obj, tracer, proxy):
+    assert isinstance(obj, (torch.Tensor, SymNode)), type(obj)
+    d = obj.__dict__.setdefault(proxy_slot, weakref.WeakKeyDictionary())  # type: ignore[call-overload]
+    assert isinstance(d, weakref.WeakKeyDictionary)
+    # NB: Never clobber pre-existing proxy.  Although the proxies
+    # are in principle equivalent, when we do graph partitioning
+    # we need there not to be spurious dependencies on tangent inputs.
+    # This works because primals get their SymInts set first, and
+    # THEN later we allocate tangent inputs.  Make sure if a SymInt
+    # is derivable from a primal that we use that.
+    #
+    # However, we DO want to clobber proxies whenever we run an inplace operation
+    # on a tensor, and it affects the metadata on the proxy.
+    # This doesn't really apply to SymInts/SymFloats though, which are immutable.
+    if tracer not in d or isinstance(obj, torch.Tensor):
+        d[tracer] = proxy
+
+def has_proxy_slot(obj, tracer):
+    assert isinstance(obj, (torch.Tensor, SymNode)), type(obj)
+    return get_proxy_slot(obj, tracer, False, lambda _: True)
+
+# the default argument is what to return if the slot is not set.
+# the transform argument is handy if you need to extract a subfield from
+# the successfully looked up result (but NOT the default.)
+def get_proxy_slot(obj, tracer, default=no_default, transform=lambda x: x):
+    assert isinstance(obj, (torch.Tensor, SymNode)), type(obj)
+    d = obj.__dict__.get(proxy_slot)  # type: ignore[call-overload]
+    if not d:
+        if default is no_default:
+            raise KeyError(f"{obj} is not tracked with proxy for {tracer}")
+        return default
+    assert isinstance(d, weakref.WeakKeyDictionary)
+    if tracer not in d:
+        if default is no_default:
+            raise KeyError(f"{obj} is not tracked with proxy for {tracer}")
+        else:
+            return default
+    return transform(d[tracer])
+
+
+def get_proxy_slots(obj):
+    return obj.__dict__.get(proxy_slot)
+
+
+# Gets the proxy for a tensor, if it exists.
+def get_proxy(obj):
+    res = get_proxy_slots(obj)
+    if res is None:
+        return None
+    vals = tuple(res.values())
+    assert len(vals) == 1
+    return vals[0]
+
+def has_proxy(obj):
+    return get_proxy(obj) is not None
+
+def snapshot_fake(val):
+    return val.detach()
+
+# What invariants do we have for the 'val' set on the FX node?  It has accurate
+# metadata... but only for metadata that exists "below" all other subsystems
+# (most notably autograd, but also vmap, functorch transforms, etc).  This means
+# you can get the dtype, shape, stride, storage, but you CANNOT get requires_grad,
+# grad_fn, _base (_base actually may be set due to recursive call to
+# ADInplaceOrView, but you shouldn't rely on it.)
+def set_meta(proxy, val):
+    if isinstance(val, FakeTensor):
+        proxy.node.meta['val'] = snapshot_fake(val)
+        proxy.node.meta['tensor_meta'] = _extract_tensor_metadata(val)
+    elif isinstance(val, py_sym_types):
+        proxy.node.meta['val'] = val
+    elif isinstance(val, list) or isinstance(val, tuple):
+        if all(isinstance(x, FakeTensor) for x in val):
+            proxy.node.meta['val'] = [snapshot_fake(x) for x in val]
+    elif isinstance(val, torch.Tensor):
+        if not val.is_sparse:
+            proxy.node.meta['tensor_meta'] = _extract_tensor_metadata(val)
+            # NB: Kinda hacky, but we should try to get val as the metadata
+            # everywhere
+            # TODO: This doesn't properly track storages.  A more robust
+            # approach would be to maintain a per-trace FakeTensorMode and
+            # from_real_tensor to create fake values (don't forget to
+            # snapshot_fake)
+            fake_tensor_mode = FakeTensorMode(allow_fallback_kernels=True)
+            with fake_tensor_mode:
+                proxy.node.meta['val'] = torch.empty_strided(val.shape, val.stride(), device=val.device, dtype=val.dtype)
+    return proxy
+
+def thunkify(f, *args, **kwargs):
+    """
+    Delays computation of f until it's called again
+    Also caches the result
+    """
+    return functools.lru_cache(1)(functools.partial(f, *args, **kwargs))
+
+def track_tensor(tensor, proxy, *, constant, tracer):
+    def try_set_proxy_slot(outer_s, proxy_callable, *args):
+        assert callable(proxy_callable)
+        if isinstance(outer_s, SymInt):
+            inner_s = outer_s.node
+            set_proxy_slot(inner_s, tracer, thunkify(proxy_callable, outer_s, *args))
+
+    # The basic idea is that we need to associate each tensor/SymInt
+    # with a Proxy.  How do we setup this association?  We just store
+    # the proxy on the proxy slot of the object, keyed on the tracer
+    # (so that if we have multiple tracers at the same time, they
+    # don't clobber each other.)
+    for i, s in enumerate(tensor.shape):
+        try_set_proxy_slot(s, lambda x, i: set_meta(torch.ops.aten.sym_size(proxy, i), x), i)
+
+    for i, s in enumerate(tensor.stride()):
+        try_set_proxy_slot(s, lambda x, i: set_meta(torch.ops.aten.sym_stride(proxy, i), x), i)
+
+    try_set_proxy_slot(tensor.numel(), lambda x: set_meta(torch.ops.aten.sym_numel(proxy), x))
+    try_set_proxy_slot(tensor.storage_offset(), lambda x: set_meta(torch.ops.aten.sym_storage_offset(proxy), x))
+    set_proxy_slot(tensor, tracer, _ProxyTensor(proxy, constant))
+
+def track_tensor_tree(inner_res, proxy_res, *, constant, tracer):
     def wrap_with_proxy(e, proxy, constant):
         if isinstance(e, torch.Tensor):
-            with no_dispatch():
-                return ProxyTensor(e, proxy, constant=constant, proxy_mode=proxy_mode)
-        else:
-            return e
+            track_tensor(e, proxy, tracer=tracer, constant=constant)
+            set_meta(proxy, e)
+        elif isinstance(e, list):
+            # example use case: allreduce_ returns ([tensor], work)
+            for idx, ee in enumerate(e):
+                wrap_with_proxy(ee, proxy[idx], get_constant(idx))
 
     def get_constant(idx):
         if constant is None:
@@ -113,83 +197,152 @@ def get_constant(idx):
     # Unfortunately, tree_map cannot directly be used here. As the resulting
     # object may be a proxy that represents a tuple, we may need to
     # explicitly unwrap the proxy by simulating the flattening operations.
-    if isinstance(inner_res, tuple):
-        return tuple(wrap_with_proxy(e, proxy_res[idx], get_constant(idx)) for idx, e in enumerate(inner_res))
-    elif isinstance(inner_res, list):
-        return list([wrap_with_proxy(e, proxy_res[idx], get_constant(idx)) for idx, e in enumerate(inner_res)])
+    if isinstance(inner_res, tuple) or isinstance(inner_res, list):
+        if isinstance(proxy_res, fx.Proxy):
+            set_meta(proxy_res, inner_res)
+        for idx, e in enumerate(inner_res):
+            wrap_with_proxy(e, proxy_res[idx], get_constant(idx))
     elif isinstance(inner_res, torch.Tensor):
-        return wrap_with_proxy(inner_res, proxy_res, constant)
-    else:
-        return inner_res
+        wrap_with_proxy(inner_res, proxy_res, constant)
+
+    return inner_res
 
 
 def maybe_disable_fake_tensor_mode():
     # TODO: figure out if this API generally makes sense and bake it into the
     # library
-    mb_fake_mode = torch._C._get_torch_dispatch_mode()
+    mb_fake_mode = _get_current_dispatch_mode()
     if isinstance(mb_fake_mode, FakeTensorMode):
-        return enable_torch_dispatch_mode(mb_fake_mode.inner, replace=mb_fake_mode)
+        return _pop_mode_temporarily()
     else:
         return nullcontext()
 
 
-def unwrap_elem(e):
-    if isinstance(e, ProxyTensor):
-        return e.elem
-    if isinstance(e, torch._C.SymIntNode):
-        if isinstance(e.get_pyobj(), ProxySymInt):
-            return e.get_pyobj().sym_int
+@dataclass
+class _ProxyTensor:
+    proxy: Proxy
+    constant: Optional[torch.Tensor]
+
+
+def fetch_sym_proxy(tracer):
+    def inner(e):
+        n = e.node
+        if n.constant is not None:
+            return n.constant
         else:
-            raise RuntimeError(f"Something has gone wrong, we are trying to put SymInt {e.get_pyobj()} into the graph,"
-                               f"even though it's not a ProxySymInt. This is a bug.")
-    return e
+            # NB: we REQUIRE all symints to be tracked
+            return get_proxy_slot(n, tracer)()
+    return inner
+
 
+def fetch_tensor_proxy(tracer):
+    return lambda t: get_proxy_slot(t, tracer, t)
 
-def proxy_call(proxy_mode, func_overload, args, kwargs=None):
-    if kwargs is None:
-        kwargs = {}
+HANDLED_TYPES = (torch.Tensor, torch.nn.Parameter)
 
-    func = func_overload.overloadpacket
-    if func_overload in CURRENT_DECOMPOSITION_TABLE:
-        with proxy_mode.restore():
-            return CURRENT_DECOMPOSITION_TABLE[func_overload](*args, **kwargs)
-    with proxy_mode.restore():
-        r = func_overload.decompose(*args, **kwargs)
+def proxy_call(proxy_mode, func, args, kwargs):
+    def can_handle_tensor(x):
+        return type(x) in HANDLED_TYPES or has_proxy_slot(x, proxy_mode.tracer)
+
+    # If there are any tensor subclasses, we need to handle those tensor subclasses first
+    # TODO: we could use types to test this
+    if not pytree.tree_all_only(torch.Tensor, can_handle_tensor, (args, kwargs)):
+        return NotImplemented
+
+    if func in CURRENT_DECOMPOSITION_TABLE:
+        with proxy_mode:
+            r = CURRENT_DECOMPOSITION_TABLE[func](*args, **kwargs)
+            if r is not NotImplemented:
+                return r
+
+    with proxy_mode:
+        r = func.decompose(*args, **kwargs)
         if r is not NotImplemented:
             return r
 
-    all_constant = pytree.tree_all_only(ProxyTensor, lambda t: t.constant is not None, (args, kwargs))
+    tracer = proxy_mode.tracer
+    f_args, f_kwargs = pytree.tree_map_only(torch.Tensor, fetch_tensor_proxy(tracer), (args, kwargs))
 
-    if torch.Tag.data_dependent_output in func_overload.tags:  # type: ignore[attr-defined]
+    # If there are SymInts, we also should not consider this constant.
+    # However, fake tensor handling of SymInts is sufficiently broken that
+    # I couldn't write a test for this case
+    all_constant = (
+        pytree.tree_all_only(_ProxyTensor, lambda t: t.constant is not None, (f_args, f_kwargs))
+        # TODO: maybe constant SymInts should also be allowed?  Not sure if
+        # this can happen
+        and pytree.tree_all_only((SymInt, SymFloat), lambda _: False, (args, kwargs))
+    )
+
+    if torch.Tag.data_dependent_output in func.tags:  # type: ignore[attr-defined]
         # Check if all of the Tensor inputs are constants
         if all_constant:
             const_args, const_kwargs = pytree.tree_map_only(
-                ProxyTensor, lambda t: t.constant, (args, kwargs)
+                _ProxyTensor, lambda t: t.constant, (f_args, f_kwargs)
             )
             with maybe_disable_fake_tensor_mode():
-                return func_overload(*const_args, **const_kwargs)
+                return func(*const_args, **const_kwargs)
         raise RuntimeError(
-            "It appears that you're trying to get value out of a tracing tensor - erroring out! "
+            f"It appears that you're trying to get value out of a tracing tensor with {func} - erroring out! "
             "It's likely that this is caused by data-dependent control flow or similar."
         )
+    proxy_args, proxy_kwargs = pytree.tree_map_only(
+        (SymInt, SymFloat),
+        fetch_sym_proxy(proxy_mode.tracer),
+        pytree.tree_map_only(_ProxyTensor, lambda e: e.proxy, (f_args, f_kwargs))
+    )
+
+    # When we trace through a torch.tensor invocation, you never actually
+    # see a torch.ops.aten.tensor call. Instead, the way this function is
+    # implemented internally is that we allocate a plain tensor (this is
+    # *guaranteed* to be a plain tensor, we disable all modes when doing
+    # so), and then call at::lift_fresh on it (to give modes a chance to do
+    # their stuff).  Furthermore, the tensor argument to lift_fresh is guaranteed
+    # to be freshly allocated, so we want lift_fresh to be a no-op (directly
+    # returning the input argument).
+    #
+    # Here is the basic problem: when we trace this sequence of executions
+    # into an FX graph, what happens to this call sequence?  Traditionally,
+    # tensor constants get interned as buffers on the FX GraphModule.  But
+    # this is dangerous.  Consider:
+    #
+    #       x = torch.tensor(1)
+    #       x.add_(2)
+    #
+    # Naively, this traces into:
+    #
+    #       t = self._tensor_constant0  # initialized to torch.tensor(1)
+    #       x = torch.ops.aten.lift_fresh(t)
+    #       x.add_(2)
+    #
+    # If lift_fresh returns t directly, the subsequent add_ call will
+    # modify the tensor constant. Really, the problem is we've violated
+    # the invariant the the argument to lift is fresh.  So what we should
+    # preserve the invariant by replacing lift_fresh with lift_fresh_copy:
+    #
+    #       t = self._tensor_constant0  # initialized to torch.tensor(1)
+    #       x = torch.ops.aten.lift_fresh_copy(t)
+    #       x.add_(2)
+    #
+    # This is what the overload modification does.
+    if func is torch.ops.aten.lift_fresh.default:
+        func = torch.ops.aten.lift_fresh_copy.default
 
-    proxy_args, proxy_kwargs = pytree.tree_map_only(ProxyTensor, lambda e: e.proxy, (args, kwargs))
-    proxy_res = func_overload(*proxy_args, **proxy_kwargs)
+    proxy_out = proxy_mode.tracer.create_proxy('call_function', func, proxy_args, proxy_kwargs,
+                                               name=proxy_mode.tracer.graph._target_to_str(func.overloadpacket.__name__))
 
+    # This makes DCE marginally less likely to DCE inplace operations.
+    # It is not strictly necessary
     # Kind of a hacky way to test if an op is in-place or not
-    if func.__name__[-1] == "_" and func.__name__[0] != "_":
-        args[0].proxy = proxy_res
-        proxy_res.node.meta['tensor_meta'] = _extract_tensor_metadata(args[0])
-
-    elem_args, elem_kwargs = pytree.tree_map(unwrap_elem, (args, kwargs))
-    inner_res = func_overload(*elem_args, **elem_kwargs)
+    if func.overloadpacket.__name__[-1] == "_" and func.overloadpacket.__name__[0] != "_":
+        if isinstance(args[0], List):
+            # e.g., c10d::allreduce_ returns a list of tensors as the first element
+            # in the output.
+            for i, a in enumerate(args[0]):
+                a.proxy = proxy_out[0][i]
+        else:
+            args[0].proxy = proxy_out
 
-    # Needed to sync up metadata for in-place operators that modify metadata
-    # TODO: instead forward the metadata to the inner tensor so updating
-    # is not necessary
-    if torch.Tag.inplace_view in func_overload.tags:  # type: ignore[attr-defined]
-        with no_dispatch():
-            func_overload(*args, **kwargs)
+    out = func(*args, **kwargs)
 
     # In some circumstances, we will be tracing in a situation where a tensor
     # is *statically* known to be a constant (currently, this only happens if
@@ -212,90 +365,33 @@ def proxy_call(proxy_mode, func_overload, args, kwargs=None):
     # element constant computation by testing the numel of the result before
     # propagating const-ness.  Similarly, we don't require the constant to
     # live on CPU, but we could.
-    any_constant = pytree.tree_any_only(ProxyTensor, lambda t: t.constant is not None, (args, kwargs))
+    any_constant = pytree.tree_any_only(_ProxyTensor, lambda t: t.constant is not None, (f_args, f_kwargs))
 
     constant = None
-    # NB: do NOT include factories as constants
-    if all_constant and any_constant:
+
+    # If this is a lift, the input tensor is guaranteed to be a
+    # constant, so we keep a copy of the original argument along so
+    # we can query it if we're asked to item() it at some later point
+    if func is torch.ops.aten.lift_fresh_copy.default and out.numel() <= CONSTANT_NUMEL_LIMIT:
+        with maybe_disable_fake_tensor_mode():
+            constant = args[0].clone()
+    elif (
+        torch.Tag.nondeterministic_seeded not in func.tags  # type: ignore[attr-defined]
+        and all_constant
+        and any_constant
+        and pytree.tree_all_only(torch.Tensor, lambda t: t.numel() <= CONSTANT_NUMEL_LIMIT, out)
+    ):
+        # NB: do NOT include factories as constants
         with maybe_disable_fake_tensor_mode():
             const_args, const_kwargs = pytree.tree_map_only(
-                ProxyTensor, lambda t: t.constant, (args, kwargs)
+                _ProxyTensor, lambda t: t.constant, (f_args, f_kwargs)
             )
-            constant = func_overload(*const_args, **const_kwargs)
-
-    # TODO(chilli): Enable this after it's been refactored to work with wrapper tensor subclasses in general
-    # pytree.tree_map(lambda x: check_metadata_consistency(x, ProxyTensor), (inner_res, args, kwargs))
-    return wrap_output(inner_res, proxy_res, constant=constant, proxy_mode=proxy_mode)
-
-
-class ProxyTensor(torch.Tensor):
-    proxy: fx.Proxy
-    elem: torch.Tensor
-    has_sym_ints: bool
-    proxy_mode: "ProxyTorchDispatchMode"
-
-
-    @staticmethod
-    def __new__(cls, elem, proxy, *, requires_grad=None, constant=None, proxy_mode):
-        def create_proxy_symint(sym_int, new_proxy):
-            return torch._C.SymIntNode.new_symint(ProxySymInt(sym_int, new_proxy))
-
-        has_sym_ints = symbolic_shapes.has_symbolic_sizes_strides(elem)
-        if has_sym_ints:
-            new_shape = []
-            for idx, s in enumerate(elem.shape):
-                if isinstance(s, torch._C.SymIntNode):
-                    new_shape.append(create_proxy_symint(s, proxy.size(idx)))
-                else:
-                    assert isinstance(s, int)
-                    # If it's not an existing SymIntNodeImpl, just pass the proxy as the int
-                    # _make_wrapper_subclass requires all inputs to be SymIntNodeImpls
-                    new_shape.append(create_proxy_symint(s, s))
-            # TODO: hack, since we currently don't support symbolic strides
-            new_strides = symbolic_shapes.create_contiguous(new_shape)
-        else:
-            new_shape = elem.shape
-            new_strides = elem.stride()
-
-        r = torch.Tensor._make_wrapper_subclass(  # type: ignore[attr-defined]
-            cls,
-            new_shape, dtype=elem.dtype, layout=elem.layout, device=elem.device,
-            requires_grad=requires_grad if requires_grad is not None else False, strides=new_strides,
-            storage_offset=elem.storage_offset()
-        )
-        r.has_sym_ints = has_sym_ints
-        return r
-
-    def __init__(self, elem, proxy, *, requires_grad=None, constant=None, proxy_mode):
-        # TODO: hack since _extract_tensor_metadata currently tries to access stride
-        if elem.is_sparse or self.has_sym_ints:
-            proxy.node.meta['tensor_meta'] = {}
-        else:
-            proxy.node.meta['tensor_meta'] = _extract_tensor_metadata(self)
-        # This detects situations where you accidentally put a ProxyTensor
-        # inside a ProxyTensor for the same trace; this is a layering violation
-        assert not (isinstance(elem, ProxyTensor) and elem.proxy.tracer is proxy.tracer)
-        self.elem = elem
-        self.proxy = proxy
-        self.constant = constant
-        self.proxy_mode = proxy_mode
-
-
-    def __deepcopy__(self, memo):
-        return self.clone()
-
-    def __repr__(self):
-        with no_dispatch():
-            return f"ProxyTensor({self.elem}, proxy={self.proxy})"
-
-    __torch_function__ = _disabled_torch_function_impl
+            constant = func(*const_args, **const_kwargs)
+    else:
+        constant = None
 
-    @classmethod
-    def __torch_dispatch__(cls, func_overload, types, args=(), kwargs=None):
-        raise RuntimeError(
-            "Should not be needed as we always trace with modes. May have entered this due to redispatching from"
-            "__torch_dispatch__ into another op without restoring dispatch mode"
-        )
+    track_tensor_tree(out, proxy_out, constant=constant, tracer=tracer)
+    return out
 
 
 class PythonKeyTracer(Tracer):
@@ -310,6 +406,10 @@ def call_module(
     ) -> Any:
         return forward(*args, **kwargs)
 
+    # We don't want to turn getattr calls into proxies. So we just return the actual value.
+    def getattr(self, attr, attr_val, parameter_proxy_cache):
+        return attr_val
+
     def create_arg(self, a: Any):
         if isinstance(a, torch.nn.Parameter):
             for n, p in self.root.named_parameters():
@@ -327,10 +427,9 @@ def create_arg(self, a: Any):
                 setattr(self.root, qualname, a)
 
             return self.create_node('get_attr', qualname, (), {})
-        elif isinstance(a, torch._C.SymIntNode):
-            py_symint = a.get_pyobj()
-            assert isinstance(py_symint, ProxySymInt)
-            return py_symint.to_fx_node()
+        elif isinstance(a, (SymInt, SymFloat)):
+            assert a.node.constant is not None
+            return a.node.constant
         return super().create_arg(a)
 
 
@@ -344,32 +443,29 @@ def dispatch_trace(
     return GraphModule(tracer.root, graph, name)
 
 
-def wrap_key(f, inps, proxy_mode):
-    flat_inps, _ = pytree.tree_flatten(inps)
+def wrap_key(f, tensors, tracer):
+    flat_tensors, tensors_spec = pytree.tree_flatten(tensors)
 
     @functools.wraps(f)
-    def wrapped(*args):
-        flat_args, args_spec = pytree.tree_flatten(args)
-        assert (len(flat_args) == len(flat_inps))
-        for idx, arg in enumerate(flat_args):
-            if isinstance(flat_inps[idx], torch.Tensor):
-                with no_dispatch():
-                    flat_args[idx] = ProxyTensor(
-                        flat_inps[idx],
-                        arg,
-                        requires_grad=(flat_inps[idx].is_leaf and flat_inps[idx].requires_grad),
-                        proxy_mode=proxy_mode,
-                    )
-            else:
-                flat_args[idx] = flat_inps[idx]
-
-        tree_args = pytree.tree_unflatten(flat_args, args_spec)
-        out = f(*tree_args)
-        flat_outs, out_spec = pytree.tree_flatten(out)
-        for idx in range(len(flat_outs)):
-            if isinstance(flat_outs[idx], torch.Tensor) and isinstance(flat_outs[idx], ProxyTensor):
-                flat_outs[idx] = flat_outs[idx].proxy
-        return pytree.tree_unflatten(flat_outs, out_spec)
+    def wrapped(*proxies):
+        flat_proxies, proxies_spec = pytree.tree_flatten(proxies)
+        assert len(flat_proxies) == len(flat_tensors)
+        assert isinstance(_get_current_dispatch_mode(), ProxyTorchDispatchMode)
+        with _pop_mode_temporarily():
+            track_tensor_tree(flat_tensors, flat_proxies, constant=None, tracer=tracer)
+
+        out = f(*tensors)
+        out = pytree.tree_map_only(
+            torch.Tensor,
+            lambda t: get_proxy_slot(t, tracer, t, lambda x: x.proxy),
+            out
+        )
+        out = pytree.tree_map_only(
+            (SymInt, SymFloat),
+            lambda t: get_proxy_slot(t.node, tracer)(),
+            out
+        )
+        return out
 
     return wrapped
 
@@ -378,94 +474,99 @@ class ProxyTorchDispatchMode(TorchDispatchMode):
     def __init__(self, tracer):
         self.tracer = tracer
         self.enable_tracing = True
+        self.sym_mode = ProxySymDispatchMode(tracer)
+        self.trace_state = {}
+        self._managers = []
+
+    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+        with self.sym_mode.enable(False):
+            return self.inner_torch_dispatch(func, types, args, kwargs)
+
+    def __enter__(self):
+        # sym mode first, then us...
+        m = self.sym_mode.enable(True)
+        self._managers.append(m)
+        m.__enter__()
+        return super().__enter__()
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        m = self._managers.pop()
+        # ...exit us first, then sym mode
+        b = super().__exit__(exc_type, exc_value, traceback)
+        if not b:
+            return m.__exit__(exc_type, exc_value, traceback)
+        else:
+            return m.__exit__(None, None, None)
 
-    def __torch_dispatch__(self, func_overload, types, args=(), kwargs=None):
+    def inner_torch_dispatch(self, func, types, args=(), kwargs=None):
         if not self.enable_tracing:
-            return func_overload(*args, **kwargs)
-
-        if symbolic_shapes.is_symbolic_op(func_overload):
-            with self.restore():
-                return symbolic_shapes.handle_symbolic_op(func_overload, args, kwargs)
-
-        func = func_overload.overloadpacket
-        # We don't want to convert torch.tensor constants into tracing objects.
-        if func_overload == aten.lift.default:
-            return args[0]
-
-        if any(tuple(isinstance(arg, ProxyTensor) for arg in pytree.tree_flatten(args)[0])):
-            out = proxy_call(self, func_overload, args, kwargs)
-        # When we trace through a torch.tensor invocation, you never actually
-        # see a torch.ops.aten.tensor call. Instead, the way this function is
-        # implemented internally is that we allocate a plain tensor (this is
-        # *guaranteed* to be a plain tensor, we disable all modes when doing
-        # so), and then call at::lift_fresh on it (to give modes a chance to do
-        # their stuff).  Furthermore, the tensor argument to lift_fresh is guaranteed
-        # to be freshly allocated, so we want lift_fresh to be a no-op (directly
-        # returning the input argument).
-        #
-        # Here is the basic problem: when we trace this sequence of executions
-        # into an FX graph, what happens to this call sequence?  Traditionally,
-        # tensor constants get interned as buffers on the FX GraphModule.  But
-        # this is dangerous.  Consider:
-        #
-        #       x = torch.tensor(1)
-        #       x.add_(2)
-        #
-        # Naively, this traces into:
-        #
-        #       t = self._tensor_constant0  # initialized to torch.tensor(1)
-        #       x = torch.ops.aten.lift_fresh(t)
-        #       x.add_(2)
-        #
-        # If lift_fresh returns t directly, the subsequent add_ call will
-        # modify the tensor constant. Really, the problem is we've violated
-        # the invariant the the argument to lift is fresh.  So what we should
-        # preserve the invariant by replacing lift_fresh with lift_fresh_copy:
-        #
-        #       t = self._tensor_constant0  # initialized to torch.tensor(1)
-        #       x = torch.ops.aten.lift_fresh_copy(t)
-        #       x.add_(2)
-        #
-        # This is what the overload modification does.
-        else:
-            flat_args = pytree.tree_flatten((args, kwargs))[0]
-            handled_types = [torch.Tensor, ProxyTensor, torch.nn.Parameter]
-
-            # If there are any tensor subclasses, we need to handle those tensor subclasses first
-            if any([isinstance(arg, torch.Tensor) and type(arg) not in handled_types for arg in flat_args]):
-                return NotImplemented
-
-            if func_overload is torch.ops.aten.lift_fresh.default:
-                func_overload = torch.ops.aten.lift_fresh_copy.default
-
-            proxy_res = self.tracer.create_proxy('call_function', func_overload, args, kwargs,
-                                                 name=self.tracer.graph._target_to_str(func.__name__))
-
-            inner_res = func_overload(*args, **kwargs)
-
-            # If this is a lift, the input tensor is guaranteed to be a
-            # constant, so we keep a copy of the original argument along so
-            # we can query it if we're asked to item() it at some later point
-            is_lift = func_overload is torch.ops.aten.lift_fresh_copy.default
-            if is_lift:
-                with maybe_disable_fake_tensor_mode():
-                    constant = args[0].clone()
-            else:
-                constant = None
-            out = wrap_output(inner_res, proxy_res, constant=constant, proxy_mode=self)
-
-        def assert_proxy_tensor(e):
-            if isinstance(e, torch.Tensor):
-                assert isinstance(e, ProxyTensor), \
-                    f"Internal Error: ProxyTensor is incorrectly baking a tensor constant into the graph: {str(e)}"
-
-        # When we trace factory functions, we expect that tensor outputs are *always* ProxyTensors.
-        # (Except for torch.tensor() constants handled through lift(), which is handled
-        # specially further up).
-        pytree.tree_map(assert_proxy_tensor, out)
+            return func(*args, **kwargs)
+
+        if func in [prim.device.default]:
+            return func(*args, **kwargs)
+
+        out = proxy_call(self, func, args, kwargs)
         return out
 
 
+class ProxySymDispatchMode(SymDispatchMode):
+    def __init__(self, tracer):
+        super().__init__()
+        self.tracer = tracer
+        # When false, we don't trace operations.  If you do this, you MUST
+        # call track_tensor/track_tensor_tree on all results of the operation
+        # to ensure we can adeduately track the results
+        self.enable_tracing = True
+
+    @contextmanager
+    def enable(self, b):
+        old = self.enable_tracing
+        self.enable_tracing = b
+        try:
+            yield
+        finally:
+            self.enable_tracing = old
+
+    def _compute_proxy(self, func, args, out: Union[SymInt, SymFloat]):
+        n_args = tuple(
+            get_proxy_slot(a.node, self.tracer)().node if isinstance(a, py_sym_types) else a
+            for a in args
+        )
+
+        # func doesn't have a __torch_function__ that Proxy can interpose, so
+        # we gotta do it manually
+        n_out = self.tracer.create_node("call_function", func, n_args, {})
+        p_out = fx.Proxy(n_out, self.tracer)
+        set_meta(p_out, out)
+        return p_out
+
+    def __sym_dispatch__(self, func, types, args, kwargs):
+        if not self.enable_tracing:
+            return func(*args, **kwargs)
+
+        # Peephole optimize multiply by one
+        # NB: be careful not to trigger guards here!
+        if func == operator.mul:
+            if isinstance(args[1], int) and args[1] == 1:
+                return args[0]
+            elif isinstance(args[0], int) and args[0] == 1:
+                return args[1]
+
+        # For speed, we assume there are no nested data structures
+        # (otherwise we could use tree_map)
+        # We also assume there are no keyword arguments.
+        assert not kwargs
+        out = func(*args, **kwargs)
+        assert isinstance(out, py_sym_types), f"{func}(*{args}, **{kwargs}) = {out}"
+
+        # Delays tracing out the proxies on this op until we actually need it
+        p_out_thunk = thunkify(self._compute_proxy, func=func, args=args, out=out)
+        set_proxy_slot(out.node, self.tracer, p_out_thunk)
+        return out
+
+
+# TODO: I'm not sure what the point of this class is; you can just
+# make_fx through a regular Interpreter
 class DecompositionInterpreter(torch.fx.Interpreter):
     def __init__(self, module: torch.fx.GraphModule, new_graph: torch.fx.Graph, decomposition_table=None, **kwargs):
         super().__init__(module, **kwargs)
@@ -478,20 +579,24 @@ def __init__(self, module: torch.fx.GraphModule, new_graph: torch.fx.Graph, deco
 
     def placeholder(self, target, args, kwargs):
         out = super().placeholder(target, args, kwargs)
+        proxy = torch.fx.Proxy(self.new_graph.placeholder(target), self.tracer)
+        track_tensor_tree(out, proxy, constant=None, tracer=self.tracer)
         # TODO handle case where the first character of target is '*'
-        return ProxyTensor(out, torch.fx.Proxy(self.new_graph.placeholder(target), self.tracer), proxy_mode=self.mode)
+        return out
 
     def get_attr(self, target, args, kwargs):
         out = super().get_attr(target, args, kwargs)
-        return ProxyTensor(out, torch.fx.Proxy(self.new_graph.get_attr(target), self.tracer), proxy_mode=self.mode)
+        proxy = torch.fx.Proxy(self.new_graph.get_attr(target), self.tracer)
+        track_tensor_tree(out, proxy, constant=None, tracer=self.tracer)
+        return out
 
-    # call_function, call_method, call_module get traced automatically by the ProxyTensors.
+    # call_function, call_method, call_module get traced automatically by the outer mode.
 
     def output(self, target, args, kwargs):
         out = super().output(target, args, kwargs)
 
         def unwrap(e):
-            return e.proxy.node if isinstance(e, ProxyTensor) else e
+            return get_proxy_slot(e, self.tracer, e, lambda x: x.proxy.node)
         self.new_graph.output(pytree.tree_map(unwrap, out))
         return out
 
@@ -512,6 +617,15 @@ def wrapped(flat_args):
         return func(*fn_args, **fn_kwargs)
     return wrapped, flat_args
 
+@contextmanager
+def disable_autocast_cache():
+    old_value = torch.is_autocast_cache_enabled()
+    torch.set_autocast_cache_enabled(False)
+    try:
+        yield
+    finally:
+        torch.set_autocast_cache_enabled(old_value)
+
 
 def make_fx(f, decomposition_table=None, tracing_mode="real"):
     assert tracing_mode in ["real", "fake", "symbolic"]
@@ -529,66 +643,60 @@ def wrapped(*args):
         elif tracing_mode == "fake":
             fake_tensor_mode = FakeTensorMode(allow_fallback_kernels=True)
         elif tracing_mode == "symbolic":
-            fake_tensor_mode = FakeTensorMode(allow_fallback_kernels=False)
+            shape_env = ShapeEnv()
+            fake_tensor_mode = FakeTensorMode(allow_fallback_kernels=False, shape_env=shape_env)
         else:
             raise AssertionError(f"Unexpected tracing type: {tracing_mode}")
 
+        python_dispatcher_mode: Any = nullcontext()
+        if tracing_mode == "symbolic":
+            python_dispatcher_mode = enable_python_dispatcher()
+
         proxy_mode = ProxyTorchDispatchMode(fx_tracer)
 
-        def wrap_fake_concrete(x):
+        def wrap_fake(x):
             if isinstance(x, torch.Tensor):
                 return fake_tensor_mode.from_tensor(x)  # type: ignore[attr-defined]
 
             return x
 
-        shape_env = ShapeEnv()
-
-        # todo: Figure out a more informative name for symints
-        def wrap_fake_symbolic(x, sym_shape):
-            if isinstance(x, torch.Tensor):
-                val = FakeTensor(fake_tensor_mode, torch.empty(sym_shape, device="meta"), x.device)
-                return val
-            return x
+        sym_mode = proxy_mode.sym_mode
 
         wrap_fn_map = {
             "real": lambda x: x,
-            "fake": wrap_fake_concrete,
+            "fake": wrap_fake,
+            "symbolic": wrap_fake,
         }
-        if tracing_mode == "symbolic":
-            flat_shapes = shape_env.create_shapes_for_args(args)
-            flat_args, spec = pytree.tree_flatten(args)
-            args = pytree.tree_unflatten(list(map(lambda a: wrap_fake_symbolic(a[0], a[1]), zip(flat_args, flat_shapes))), spec)
-        else:
-            args = pytree.tree_map(wrap_fn_map[tracing_mode], args)
+        args = pytree.tree_map(wrap_fn_map[tracing_mode], args)
 
-        if not hasattr(f, '__code__') or inspect.unwrap(f).__code__.co_flags & inspect.CO_VARARGS:
+        if not hasattr(inspect.unwrap(f), '__code__') or inspect.unwrap(f).__code__.co_flags & inspect.CO_VARARGS:
             # FX doesn't support varargs, so we gotta fake up a wrapper
             # TODO: Would be nice to fix this at the source...
             func = fake_signature(f, len(phs))
         else:
             func = f
 
-        with decompose(decomposition_table), fake_tensor_mode, proxy_mode:  # type: ignore[attr-defined]
-            t = dispatch_trace(wrap_key(func, args, proxy_mode), tracer=fx_tracer, concrete_args=tuple(phs))
+        # We disable the autocast cache as the autocast cache causes type conversions on parameters to
+        # check a cache, which introduces untracked tensors into the graph
+        with decompose(decomposition_table), fake_tensor_mode, python_dispatcher_mode, \
+             sym_mode, proxy_mode, disable_autocast_cache():  # type: ignore[attr-defined]
+            t = dispatch_trace(wrap_key(func, args, fx_tracer), tracer=fx_tracer, concrete_args=tuple(phs))
 
         # TODO: kind of a bad way to do it, should maybe figure out a better way
-        t.shape_env = shape_env  # type: ignore[assignment]
+        if tracing_mode == "symbolic":
+            t.shape_env = shape_env  # type: ignore[assignment]
         return t
 
     return wrapped
 
 
 def get_torch_dispatch_modes():
-    modes = [torch._C._get_torch_dispatch_mode()]
-    if modes[-1] is None:
-        return list()
-    while modes[-1].inner is not None:
-        modes.append(modes[-1].inner)
-    return modes
+    return torch.utils._python_dispatch._get_current_dispatch_mode_stack()
 
 
 @contextlib.contextmanager
 def disable_proxy_modes_tracing():
+    # TODO: This probably doesn't correctly also disable ProxySymDispatchMode
     modes = get_torch_dispatch_modes()
     proxy_tensor_modes = [m for m in modes if isinstance(m, ProxyTorchDispatchMode)]
     olds = [m.enable_tracing for m in proxy_tensor_modes]
@@ -601,7 +709,7 @@ def disable_proxy_modes_tracing():
             proxy_mode.enable_tracing = old
 
 
-def get_isolated_graphmodule(func, args, kwargs):
+def get_isolated_graphmodule(func, args, kwargs, tracing_mode="real"):
     """A helper function used to get the GraphModule for the given func.
 
     It's expected to be used in the ProxyTensor tracing context.
@@ -610,19 +718,6 @@ def get_isolated_graphmodule(func, args, kwargs):
     """
     wrapped, all_args = wrapper_and_args_for_make_fx(func, args, kwargs)
 
-    unwrapped_all_args = [unwrap_elem(a) for a in all_args]
-
-    # Current implementation doesn't support the case when ProxyTensor is
-    # wrapped with another Tensor subclass
-    # See: https://github.com/pytorch/pytorch/pull/81764#issuecomment-1200472068
-    # TODO: Once https://github.com/pytorch/pytorch/pull/82549 is merged, we can
-    # remove this
-    assert all(
-        getattr(a, "elem", None) is None
-        for a in unwrapped_all_args
-        if isinstance(a, torch.Tensor)
-    ), "ProxyTensor is wrapped with another Tensor subclass"
-
     with disable_proxy_modes_tracing():
-        gm = make_fx(wrapped)(unwrapped_all_args)
+        gm = make_fx(wrapped, tracing_mode=tracing_mode)(all_args)
     return gm
diff --git a/torch/fx/experimental/symbolic_shapes.py b/torch/fx/experimental/symbolic_shapes.py
index 793b212d9162..41121808e24e 100644
--- a/torch/fx/experimental/symbolic_shapes.py
+++ b/torch/fx/experimental/symbolic_shapes.py
@@ -1,19 +1,79 @@
 import torch
 import torch.utils._pytree as pytree
-from typing import Dict, Any
+from typing import Set, Dict, List, Type, Optional, cast, Union
+import sys
+import operator
+import builtins
+import math
+import functools
+import threading
+from contextlib import contextmanager
+from functools import lru_cache, partial
+import traceback
+import collections
+import textwrap
+from torch._subclasses.meta_utils import MetaConverter
+from torch import SymInt, SymFloat
 
 try:
     import sympy  # type: ignore[import]
+    from sympy.printing.precedence import precedence  # type: ignore[import]
     HAS_SYMPY = True
 except ImportError:
     HAS_SYMPY = False
 
-aten = torch.ops.aten
+aten = torch.ops.aten  # type: ignore[has-type]
 
-__all__ = ["has_symbolic_sizes_strides", "create_contiguous", "is_symbolic_op", "handle_symbolic_op", "PySymInt", "ShapeEnv"]
+__all__ = [
+    "has_symbolic_sizes_strides", "create_contiguous", "ShapeEnv",
+    "SymDispatchMode", "sym_int", "sym_float", "FloorDiv", "guard_int", "wrap_node",
+    "sym_sqrt",
+]
+
+SYM_FUNCTION_MODE = None
+
+# We don't bother with the metaclass as all of the dispatching logic happens
+# entirely from Python
+#
+# Didn't bother with ancestors for now, unlikely to have multiple modes for
+# symints right now
+
+
+# SymDispatchMode gets invoked whenever an operation is processed on
+# a PySymInt.  When this occurs, you get called at __sym_dispatch__
+# with the operation in question.  This is symmetric to TorchDispatchMode
+# but with some caveats:
+#
+#   - In TorchDispatchMode, you get the same arguments as what a user
+#     invoked your API with; e.g., if you call torch.ops.aten.foo(a, b),
+#     you get (a, b) as args to your call.  In SymDispatchMode, if
+#     you call a + b (where a and b are SymInts), you will get
+#     (a.get_pyobj(), b.get_pyobj()) as your args (these are PySymInts)
+#
+#   - SymInt/PySymInt don't have FX proxy support (unlike, e.g., Tensor).
+#     So you have to manually call Tracer/create_node to write into
+#     the graph.  See ProxySymDispatchMode for an example
+#
+class SymDispatchMode:
+    def __sym_dispatch__(self, func, types, args, kwargs):
+        raise NotImplementedError()
+
+    def __enter__(self):
+        global SYM_FUNCTION_MODE
+        old = SYM_FUNCTION_MODE
+        if hasattr(self, "inner"):
+            raise RuntimeError(f"{self} has already been used as a mode. Please use a fresh version")
+        else:
+            self.inner = old
+        SYM_FUNCTION_MODE = self
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        global SYM_FUNCTION_MODE
+        SYM_FUNCTION_MODE = self.inner
 
 def has_symbolic_sizes_strides(elem):
-    return any([isinstance(i, torch._C.SymIntNode) for i in elem.shape])
+    return elem._has_symbolic_sizes_strides
 
 def create_contiguous(shape):
     strides = [1]
@@ -21,55 +81,176 @@ def create_contiguous(shape):
         strides.append(dim * strides[-1])
     return list(reversed(strides))
 
-def is_symbolic_op(func):
-    return func in [aten.sym_size.default, aten.dim.default,
-                    aten.is_contiguous.default, aten.stride.default, aten.sym_numel.default
-                    ]
+def _handle_sym_dispatch(func, args, kwargs):
+    global SYM_FUNCTION_MODE
+    mode = SYM_FUNCTION_MODE
+    assert mode
+    SYM_FUNCTION_MODE = mode.inner
+    try:
+        # TODO: properly compute types
+        types: List[Type] = []
+        return mode.__sym_dispatch__(func, types, args, kwargs)
+    finally:
+        SYM_FUNCTION_MODE = mode
 
-def handle_symbolic_op(func, args, kwargs):
-    assert is_symbolic_op(func)
-    if func == torch.ops.aten.sym_size.default:
-        return None
-    if func == torch.ops.aten.dim.default:
-        return len(args[0].shape)
-    if func == torch.ops.aten.sym_numel.default:
-        res = 1
-        for s in args[0].shape:
-            res = res * s
-        return res
-    # TODO: hack, need to make is_contiguous calls symbolic (probably through computing on symbolic strides)
-    if func == torch.ops.aten.is_contiguous.default:
-        return True
-    # TODO: hack, we don't currently support symbolic strides properly
-    if func == torch.ops.aten.stride.default:
-        return create_contiguous(args[0].shape)
+def guard_int(a):
+    if isinstance(a, SymInt):
+        return a.node.guard_int("", 0)  # NB: uses Python backtrace
+    assert isinstance(a, int)
+    return a
+
+def sym_float(a):
+    if isinstance(a, SymFloat):
+        return a
+    elif hasattr(a, '__sym_float__'):
+        return a.__sym_float__()
+    return float(a)
+
+# Drop in replacement for math.sqrt
+def sym_sqrt(a):
+    if hasattr(a, '__sym_sqrt__'):
+        return a.__sym_sqrt__()
+    return math.sqrt(a)
+
+# Drop in replacement for math.floor/ceil.  Actually, math.floor/ceil
+# directly usable, but this has a more relaxed type signature for mypy
+# (mypy requires SupportFloat which is too strict)
+def sym_floor(a):
+    return math.floor(a)  # type: ignore[type]
+
+def sym_ceil(a):
+    return math.ceil(a)  # type: ignore[type]
+
+def sym_int(a):
+    if isinstance(a, SymInt):
+        return a
+    elif isinstance(a, SymFloat):
+        return sym_floor(a) if a > 0 else sym_ceil(a)
+    return int(a)
+
+def to_node(self, num):
+    if isinstance(num, (SymInt, SymFloat)):
+        return num.node
+    elif isinstance(num, int):
+        return self.wrap_int(num)
+    elif isinstance(num, float):
+        return self.wrap_float(num)
+    else:
+        # NotImplemented is important so that Python tries the
+        # other magic method
+        return NotImplemented
 
 # TODO: An incomplete list
 # 1. Set variables to be equal when we do equality
 # 2. Specialize on 0/1 when we do subtraction
-class PySymInt(object):
+class SymNode:
     """
-    PySymInt objects are the primary "symbolic shape" objects that flow through
-    our program. They're what sit under FakeTensor, and contains our primary
-    implementation of symbolic shapes.
+    This is a type erased SymInt/SymFloat which we use to do actual operations.
+    End users don't touch this.  Magic methods are NOT defined on this object.
     """
-    def __init__(self, expr, shape_env):
-        self.expr = expr
+    def __init__(self, expr, shape_env, pytype, constant=None):
+        self._expr = expr
         self.shape_env = shape_env
+        self.pytype = pytype
+        self.constant = constant
+
+    @property
+    def expr(self):
+        self._update_expr()
+        return self._expr
+
+    def _update_expr(self):
+        self._expr = self.shape_env.replace(self._expr)
+
+    def is_int(self):
+        return self.pytype is int
+
+    def is_float(self):
+        return self.pytype is float
+
+    def wrap_int(self, num):
+        assert isinstance(num, int)
+        return SymNode(sympy.Integer(num), self.shape_env, int, constant=num)
+
+    def wrap_float(self, num):
+        assert isinstance(num, float)
+        return SymNode(sympy.Float(num), self.shape_env, float, constant=num)
 
-    def wrap(self, num):
-        return PySymInt(sympy.Integer(num), self.shape_env)
+    def clone(self):
+        return self
+
+    def str(self):
+        return f"{self.expr}"
 
     def __str__(self):
-        return f"PySymInt({self.expr})"
+        return self.str()
+
+    def __repr__(self):
+        return self.str()
+
+    # These methods are metaprogrammed in below
+    def sym_int(self) -> "SymNode":
+        ...
+
+    def sym_float(self) -> "SymNode":
+        ...
 
     # Today we error on calling int on a symbolic shape, as this is a very accessible footgun.
-    # In the future we'll probably need some explicit way of allowing this
-    def __int__(self):
+    def int_(self):
         raise RuntimeError("Trying to extract a concrete int out of a symbolic int")
 
-    def __bool__(self):
-        return bool(self.shape_env.evaluate_expr(self.expr))
+    # You can manually trigger a guard with this function
+    def guard_int(self, file, line):
+        # TODO: use the file/line for some useful diagnostic on why a
+        # guard occurred
+        return int(self.shape_env.evaluate_expr(self.expr))
+
+    def guard_float(self, file, line):
+        # TODO: use the file/line for some useful diagnostic on why a
+        # guard occurred
+        return float(self.shape_env.evaluate_expr(self.expr))
+
+    def bool_(self):
+        return bool(self.shape_env.evaluate_expr(self.shape_env.replace(self.expr)))
+
+
+if HAS_SYMPY:
+    class FloorDiv(sympy.Function):
+        """
+        We maintain this so that:
+        1. We can use divisibility guards to simplify FloorDiv(a, b) to a / b.
+        2. Printing out the expression is nicer (compared to say, representing a//b as (a - a % b) / b)
+        """
+        nargs = (2,)
+
+        def _sympystr(self, printer):
+            lhs = self.args[0]
+            rhs = self.args[1]
+            lhs_str = printer._print(lhs)
+            rhs_str = printer._print(rhs)
+            if precedence(lhs) < precedence(sympy.div):
+                lhs_str = f"({lhs_str})"
+            if precedence(rhs) < precedence(sympy.div):
+                rhs_str = f"({rhs_str})"
+
+            return f"{lhs_str}//{rhs_str}"
+
+        @classmethod
+        def eval(cls, base, divisor):
+            if base == 0:
+                return sympy.Integer(0)
+            if divisor == 1:
+                return base
+            if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
+                return base // divisor
+            if isinstance(base, FloorDiv):
+                return FloorDiv(base.args[0], base.args[1] * divisor)
+
+            gcd = sympy.gcd(base, divisor)
+            if gcd != 1:
+                return FloorDiv(
+                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
+                )
 
 # Methods that have a `__foo__` as well as `__rfoo__`
 reflectable_magic_methods = {
@@ -77,7 +258,9 @@ def __bool__(self):
     'sub': lambda a, b: a - b,
     'mul': lambda a, b: a * b,
     'mod': lambda a, b: a % b,
-    'floordiv': lambda a, b: sympy.floor(a / b),
+    'pow': lambda a, b: a ** b,
+    'truediv': lambda a, b: a / b,
+    'floordiv': lambda a, b: FloorDiv(a, b),
 }
 
 magic_methods = {
@@ -87,75 +270,429 @@ def __bool__(self):
     'lt': lambda a, b: sympy.Lt(a, b),
     'le': lambda a, b: sympy.Le(a, b),
     'ge': lambda a, b: sympy.Ge(a, b),
+    'floor': lambda a: sympy.floor(a),
+    'sym_float': lambda a: a,  # Cannot use sympy.Float(a) here, coz it expects python literals
+    'ceil': lambda a: sympy.ceiling(a),
+    'neg': lambda a: -a,
+    'min': lambda a, b: sympy.Min(a, b),
+    'max': lambda a, b: sympy.Max(a, b),
+    'sym_sqrt': lambda a: sympy.sqrt(a),
 }
 
-for method, _func in magic_methods.items():
-    method_name = f'{method}'
+unary_magic_methods = {
+    'sym_float',
+    'ceil',
+    'floor',
+    'neg',
+    'sym_sqrt',
+}
+
+magic_methods_on_builtins = {"min", "max"}
+magic_methods_on_math = {"ceil", "floor"}
+magic_methods_on_submodule = {"sym_float", "sym_sqrt"}
+
+always_float_magic_methods = {"truediv", "sym_float", "sym_sqrt"}
+always_int_magic_methods = {"ceil", "floor"}
+always_bool_magic_methods = {"eq", "gt", "lt", "le", "ge"}
+
+def wrap_node(x):
+    # TODO: let C++ also take advantage of this
+    if isinstance(x, SymNode) and x.constant is not None:
+        return x.constant
+    if x.is_int():
+        return SymInt(x)
+    elif x.is_float():
+        return SymFloat(x)
+    else:
+        raise AssertionError(f"unrecognized return type {x}")
+
+def _make_node_magic(method, func):
+    func = lru_cache(256)(func)
+
+    def binary_magic_impl(self, other):
+        if method in magic_methods_on_builtins:
+            op = getattr(builtins, method)
+        else:
+            op = getattr(operator, method)
+        if SYM_FUNCTION_MODE:
+            r = _handle_sym_dispatch(op, (wrap_node(self), wrap_node(other)), {})
+            assert isinstance(r, (SymInt, SymFloat)), type(r)
+            return r.node
+        assert isinstance(other, SymNode)
+        other_expr = other.expr
+        # TODO: consider constant prop here
+        expr = self.shape_env.replace(self.expr)
+        other_expr = self.shape_env.replace(other_expr)
+        out = func(expr, other_expr)
+        out = sympy.expand(out)
+        pytype: Type
+        if method in always_float_magic_methods:
+            pytype = float
+        else:
+            pytype = self.pytype
+
+        # TODO: relational operators actually technically return a
+        # PySymBool, this is a type error
+        return SymNode(out, self.shape_env, pytype)
+
+    def unary_magic_impl(self):
+        if SYM_FUNCTION_MODE:
+            if method in magic_methods_on_math:
+                op = getattr(math, method)
+            elif method in magic_methods_on_submodule:
+                op = getattr(sys.modules[__name__], method)
+            else:
+                op = getattr(operator, method)
+            r = _handle_sym_dispatch(op, (wrap_node(self),), {})
+            assert isinstance(r, (SymInt, SymFloat)), type(r)
+            return r.node
+        # TODO: consider constant prop here
+        expr = self.shape_env.replace(self.expr)
+        out = func(expr)
+        out = sympy.expand(out)
+        pytype: Type
+        if method in always_int_magic_methods:
+            pytype = int
+        elif method in always_float_magic_methods:
+            pytype = float
+        else:
+            pytype = self.pytype
+
+        return SymNode(out, self.shape_env, pytype)
+
+    if method in unary_magic_methods:
+        setattr(SymNode, method, unary_magic_impl)
+    else:
+        setattr(SymNode, method, binary_magic_impl)
+
+for method, func in magic_methods.items():
+    _make_node_magic(method, func)
+
+def _make_user_magic(method, user_type):
+    # User magic takes care of wrapping the other operand into a node,
+    # so that our internal logic can assume everything is nodes
+
+    def unary_magic_impl(self):
+        return wrap_node(getattr(self.node, method)())
+
+    def binary_magic_impl(self, other):
+        other_node = to_node(self.node, other)
+        if other_node is NotImplemented:
+            return NotImplemented
+        return wrap_node(getattr(self.node, method)(other_node))
+
+    def rbinary_magic_impl(self, other):
+        other_node = to_node(self.node, other)
+        if other_node is NotImplemented:
+            return NotImplemented
+        return wrap_node(getattr(other_node, method)(self.node))
+
+    if method in unary_magic_methods:
+        setattr(user_type, f"__{method}__", unary_magic_impl)
+    else:
+        setattr(user_type, f"__{method}__", binary_magic_impl)
+        if method in reflectable_magic_methods:
+            setattr(user_type, f"__r{method}__", rbinary_magic_impl)
+
+for method, func in magic_methods.items():
+    _make_user_magic(method, SymInt)
+    _make_user_magic(method, SymFloat)
+
+del method
+del func
+
+def _lru_cache(fn, maxsize=None):
+    """
+    Wrapper around lru_cache that clears when new info about shapes has been
+    updated.
+
+    Use lru_cache if the output is always the same, regardless of the
+    constraints we know now (i.e. evaluate_expr)
+
+    Use _lru_cache otherwise.
+    """
+    fn_cache = lru_cache(maxsize)(fn)
+    prior_key = None
+
+    @functools.wraps(fn)
+    def wrapper(self, *args, **kwargs):
+        nonlocal prior_key
+        if prior_key != self._get_key():
+            prior_key = self._get_key()
+            fn_cache.cache_clear()
+        return fn_cache(self, *args, **kwargs)
+
+    wrapper.cache_info = fn_cache.cache_info  # type: ignore[attr-defined]
+    return wrapper
 
-    def _create_magic_impl(func):
-        def magic_impl(self, other):
-            if isinstance(other, PySymInt):
-                other = other.expr
-            return PySymInt(func(self.expr, other), self.shape_env)
-        return magic_impl
 
-    # this should be wrapped transparently into torch._C.SymIntNode
-    setattr(PySymInt, method_name, _create_magic_impl(_func))
 
 class ShapeEnv(object):
     def __init__(self):
         self.guards = []
-        self.shape_env = {}
+        # Maps symbolic ints to their original concrete values
+        # Currently populated from tensors
+        self.var_to_val: Dict["sympy.Symbol", "sympy.Integer"] = {}
+        # Maps from sympy ints to expressions representing them
+        # Populated from equality guards (i.e. a.shape[0] == b.shape[0])
+        self.replacements: Dict["sympy.Symbol", "sympy.Expr"] = {}  #
+        # Set holds a % b expressions that evaluate to 0.
+        self.divisible: Set["sympy.Expr"] = set()
+        # Duck-shaping says that if two input tensors have the same size,
+        # they get assigned the same symbolic variable
+        self.val_to_var: Dict[int, "sympy.Expr"] = {0: sympy.Integer(0), 1: sympy.Integer(1)}
+        self.tls = threading.local()
+
+    def _suppress_guards_tls(self):
+        return getattr(self.tls, "suppress_guards", False)
+
+    @contextmanager
+    def suppress_guards(self):
+        self.tls.suppress_guards = True
+        try:
+            yield
+        finally:
+            self.tls.suppress_guards = False
 
-    def create_symint(self, name, val, shape_env=None):
+    def _get_key(self):
+        """
+        Defines the current "state" of the guards we've accumulated in this ShapeEnv.
+        Determines when we need to invalidate our cache
+        """
+        return (len(self.replacements), len(self.divisible))
+
+    def create_symbolic_sizes_strides(self, ex: torch.Tensor):
+        """
+        Returns a list of symbolic sizes and strides for the given tensor.
+        We try our best to express stride in terms of the sizes, so as to not
+        introduce new symbolic variables.
+        """
+        size = [self.create_symbol(i) for i in ex.size()]
+        stride: List[Optional[sympy.Expr]] = [None] * len(size)
+        for i, val in enumerate(ex.stride()):
+            if val in (0, 1):
+                stride[i] = sympy.Integer(val)
+        while any(x is None for x in stride):
+            candidates = {
+                ex.size(i) * ex.stride()[i]: size[i] * stride[i]
+                for i in range(len(size))
+                if stride[i] is not None and ex.stride()[i] >= 0
+            }
+            # iterate over unbound strides in sorted order
+            val_list = sorted(
+                [(ex.stride()[i], i) for i in range(len(stride)) if stride[i] is None]
+            )
+            for _, i in val_list:
+                if stride[i] is None and ex.stride()[i] in candidates:
+                    stride[i] = candidates[ex.stride()[i]]
+                    candidates[ex.size(i) * ex.stride()[i]] = size[i] * stride[i]
+            if any(x is None for x in stride):
+                # bind the smallest unbound stride to a new variable
+                val, i = min(
+                    [
+                        (ex.stride()[i], i)
+                        for i in range(len(stride))
+                        if stride[i] is None
+                    ]
+                )
+                stride[i] = self.create_symbol(val)
+        assert all(x is not None for x in stride)
+        return [self.create_symintnode(i) for i in size], [self.create_symintnode(i) for i in stride]  # type: ignore[arg-type]
+
+    def create_symintnode(self, expr: Union["sympy.Expr", int]):
+        return SymInt(SymNode(expr, self, int))
+
+    def create_symbol(self, val: int) -> "sympy.Expr":
         if not HAS_SYMPY:
             raise RuntimeError("Need sympy installed to create symbolic shapes")
-        if shape_env is None:
-            shape_env = self.shape_env
-        # Currently we don't put 0/1 specialization in guards but perhaps we should
-        if val == 0 or val == 1:
-            return val
-        sympy_expr = sympy.Symbol(name, positive=True)
-        py_sym_int = PySymInt(sympy_expr, self)
-        cpp_sym_int = torch._C.SymIntNode.new_symint(py_sym_int)  # type: ignore[attr-defined]
-        shape_env[sympy_expr] = val
-        return cpp_sym_int
-
-    def try_constantify(self, expr):
+        if val < 0:
+            # all sympy base variables must be positive and > 1
+            return -self.create_symbol(-val)
+        # This implements duck-shaping: input sizes that match are assigned
+        # the same symint
+        # TODO: Create a guard whenever this happens
+        # TODO: But how do I represent the guard in this case?
+        # Note: val_to_var is also initialized with 0/1 mapping to constants, so
+        # this also ensures that all symbols are > 1
+        if val in self.val_to_var:
+            return self.val_to_var[val]
+        sympy_expr = sympy.Symbol(f"s{len(self.var_to_val)}", positive=True, integer=True)
+        self.var_to_val[sympy_expr] = sympy.Integer(val)
+        self.val_to_var[val] = sympy_expr
+        return sympy_expr
+
+    def evaluate_guards_for_args(self, *args):
+        new_env = ShapeEnv()
+        # NB: This must be kept in sync with create_aot_dispatcher_function
+        # and wrap_fake_symbolic
+        meta_converter = MetaConverter()
+        pytree.tree_map_only(torch.Tensor, partial(meta_converter, shape_env=new_env), args)
+        return all(guard.xreplace(new_env.var_to_val) for guard, _ in self.guards)
+
+    def get_guard_expr(self):
+        """
+        Returns a sympy expression representing all of the shape env guards.
+
+        NOTE: Does not include implicit 0/1 or duck-shaping guards
+        """
+        return sympy.And(*[guard for guard, _ in self.guards])
+
+    def get_nontrivial_guards(self):
+        return [self.simplify(guard) for guard, _ in self.guards if self._maybe_evaluate_static(guard) is None]
+
+    def format_guards(self, verbose=False):
+        def format_tb(tb):
+            if not verbose:
+                return ""
+            return f"\n   Guarded at:\n{textwrap.indent(tb, '   ')}"
+
+        return '\n'.join(f" - {guard}{format_tb(tb)}" for guard, tb in self.guards)
+
+    def get_shape_groups(self):
+        shape_groups = collections.defaultdict(list)
+        for k, v in self.replacements.items():
+            shape_groups[v].append(k)
+        return shape_groups
+
+    @_lru_cache
+    def _maybe_evaluate_static(self, expr: "sympy.Expr") -> "Optional[sympy.Expr]":
+        """
+        Tries to evaluate expr without introducing guards
+        """
+        expr = self.simplify(expr)
         # Simplifies assuming that shape vars > 1 (since we cache on 0/1 shape values)
-        new_shape_env = {k: sympy.Symbol(f'shape_{idx}', positive=True) + 1 for idx, k in enumerate(self.shape_env.keys())}
-        new_expr = expr.subs(new_shape_env)
-        new_expr = new_expr.simplify()
+        symbols = list(expr.free_symbols)
+        new_shape_env = {
+            k: sympy.Symbol(f"shape_{idx}", positive=True, integer=True) + 1
+            for idx, k in enumerate(symbols)
+        }
+        new_expr = expr.xreplace(new_shape_env)
+        floor_div_replace = {}
+        for atom in new_expr.atoms(FloorDiv):
+            floor_div_replace[atom] = sympy.floor(atom.args[0] / atom.args[1])
+        new_expr = sympy.expand(new_expr.xreplace(floor_div_replace))
         if len(list(new_expr.free_symbols)) == 0:
             return new_expr
         return None
 
-    def create_shapes_for_args(self, args, shape_env=None):
-        # Takes pytrees and returns a flat list
-        arg_cnt = 0
+    @_lru_cache
+    def replace(self, expr: "sympy.Expr") -> "sympy.Expr":
+        replacements = {s: self._find(cast(sympy.Symbol, s)) for s in expr.free_symbols}
+        return sympy.expand(expr.xreplace(replacements))
 
-        def create_shape(x):
-            nonlocal arg_cnt
-            if not isinstance(x, torch.Tensor):
-                return x
+    @_lru_cache
+    def _update_divisible(self):
+        new_divisible = set()
+        for k in self.divisible:
+            res = self.replace(k)
+            if len(res.free_symbols) > 0:
+                new_divisible.add(k)
 
-            out_shape = [self.create_symint(f"s_{arg_cnt}[{idx}]", sz, shape_env) for idx, sz in enumerate(x.shape)]
-            arg_cnt += 1
-            return out_shape
-        return list(map(create_shape, pytree.tree_flatten(args)[0]))
+        self.divisible = new_divisible
 
-    def evaluate_guards_for_args(self, *args):
-        env: Dict[Any, Any] = {}
-        _ = self.create_shapes_for_args(args, shape_env=env)
-        return all(guard.subs(env) == value for guard, value in self.guards)
+    @_lru_cache
+    def simplify(self, expr: "sympy.Expr") -> "sympy.Expr":
+        expr = self.replace(expr)
+        if expr.has(FloorDiv):
+            self._update_divisible()
+            div_replacements = {}
+            for atom in expr.atoms(FloorDiv):
+                base, divisor = atom.args
+                if self.replace(base % divisor) in self.divisible:
+                    div_replacements[atom] = base / divisor
+            expr = expr.xreplace(div_replacements)
+            expr = sympy.expand(expr)
+        return expr
+
+    @lru_cache(256)
+    def size_hint(self, expr: "sympy.Expr"):
+        """
+        Gets a size hint for a given expression from the underlying shapes we had.
+        Does not introduce a guard, so only use this when you can guarantee that
+        your code is still valid for arbitrary shapes (such as optimization decisions)
+        """
+        result_expr = sympy.expand(expr).xreplace(self.var_to_val)
+        assert len(result_expr.free_symbols) == 0, "Size hint has variables we don't have underlying values for"
+        return result_expr
+
+    @_lru_cache
+    def _find(self, a: "sympy.Symbol") -> "sympy.Expr":
+        """
+        Implements a DSU-like algorithm to find the variable that represents a
+        Also handles transitive non-identity replacements.
+
+        a: b + c
+        c: d
+        """
+        if a not in self.replacements:
+            return a
+        res = self.replacements[a]
+        cur_replace = {s: self._find(s) for s in res.free_symbols}
+        self.replacements[a] = self.replacements[a].xreplace(cur_replace)
+        return self.replacements[a]
+
+    @lru_cache(256)
+    def _maybe_guard_eq(self, expr: "sympy.Eq") -> None:
+        """
+        Evaluates the result of an eq call. If true, uses information to
+        simplify shapes (i.e. a == b or a % 5 == 0)
+        """
+        concrete_bool = bool(self.size_hint(expr))
+        if not concrete_bool:
+            return
+        free = list(expr.free_symbols)
+
+        assert len(free) > 0, "The expression should not be static by this point"
+        # In case of really gnarly expression, we don't blow up
+        if len(free) > 5:
+            return
+        free = sorted(free, key=lambda x: (self.size_hint(x), x.name), reverse=True)  # type: ignore[attr-defined]
+        lhs = expr.lhs
+        rhs = expr.rhs
+        try:
+            solutions = sympy.solve(lhs - rhs, free[0], dict=True)
+            if len(solutions) != 1:
+                return
+            solution = solutions[0][free[0]]
+            if all(t.is_integer for t in sympy.preorder_traversal(solution)):
+                new_var = self._find(solution)
+                self.replacements[cast(sympy.Symbol, free[0])] = new_var
+        except NotImplementedError:
+            if expr.has(sympy.Mod):
+                mod_expr = tuple(expr.atoms(sympy.Mod))[0]
+                try:
+                    solutions = sympy.solve(lhs - rhs, mod_expr, dict=True)
+                    if len(solutions) == 1 and solutions[0][mod_expr] == 0:
+                        self.divisible.add(mod_expr)
+                except NotImplementedError:
+                    pass
+            return
 
+    @lru_cache(256)
+    def evaluate_expr(self, expr: "sympy.Expr"):
+        """
+        Given an expression, evaluates it, adding guards if necessary
+        """
+        if len(expr.free_symbols) == 0:
+            return expr
+        expr = self.simplify(expr)
+        static_expr = self._maybe_evaluate_static(expr)
+        if static_expr is not None:
+            return static_expr
 
-    def evaluate_expr(self, expr):
-        const_expr = self.try_constantify(expr)
-        if const_expr is not None:
-            return const_expr
+        if isinstance(expr, sympy.Eq):
+            self._maybe_guard_eq(expr)
+        concrete_val = self.size_hint(expr)
 
-        expr = expr.simplify()
-        concrete_val = expr.subs(self.shape_env)
-        self.guards.append((expr, concrete_val))
+        # TODO: optimize this; avoid formatting traces until we need them
+        # NB: drop two frames; evaluate_expr and the Sym* function that
+        # actually called us
+        if not self._suppress_guards_tls():
+            stack = ''.join(traceback.format_list(traceback.extract_stack()[:-2]))
+            if concrete_val is sympy.true:
+                self.guards.append((expr, stack))
+            elif concrete_val is sympy.false:
+                self.guards.append((sympy.Not(expr), stack))
+            else:
+                self.guards.append((sympy.Eq(expr, concrete_val), stack))
         return concrete_val
diff --git a/torch/fx/experimental/unification/core.py b/torch/fx/experimental/unification/core.py
index 39ed3d0d4d2f..32116f93c30f 100644
--- a/torch/fx/experimental/unification/core.py
+++ b/torch/fx/experimental/unification/core.py
@@ -6,6 +6,8 @@
 from .variable import isvar
 from .dispatch import dispatch
 
+__all__ = ["reify", "unify"]
+
 ################
 # Reificiation #
 ################
diff --git a/torch/fx/experimental/unification/dispatch.py b/torch/fx/experimental/unification/dispatch.py
index b69c64c8bfad..93039ce75070 100644
--- a/torch/fx/experimental/unification/dispatch.py
+++ b/torch/fx/experimental/unification/dispatch.py
@@ -1,6 +1,6 @@
 from functools import partial
 from .multipledispatch import dispatch  # type: ignore[import]
 
-namespace = dict()  # type: ignore[var-annotated]
+namespace = {}  # type: ignore[var-annotated]
 
 dispatch = partial(dispatch, namespace=namespace)
diff --git a/torch/fx/experimental/unification/match.py b/torch/fx/experimental/unification/match.py
index 7751128c8170..3a8ce9385e30 100644
--- a/torch/fx/experimental/unification/match.py
+++ b/torch/fx/experimental/unification/match.py
@@ -7,7 +7,7 @@
 class Dispatcher(object):
     def __init__(self, name):
         self.name = name
-        self.funcs = dict()
+        self.funcs = {}
         self.ordering = []
 
     def add(self, signature, func):
@@ -60,7 +60,7 @@ def __call__(self, *args, **kwargs):
 
 
 
-global_namespace = dict()  # type: ignore[var-annotated]
+global_namespace = {}  # type: ignore[var-annotated]
 
 
 def match(*signature, **kwargs):
diff --git a/torch/fx/experimental/unification/multipledispatch/conflict.py b/torch/fx/experimental/unification/multipledispatch/conflict.py
index 9aba9a0d600e..5aa0c0ed19ed 100644
--- a/torch/fx/experimental/unification/multipledispatch/conflict.py
+++ b/torch/fx/experimental/unification/multipledispatch/conflict.py
@@ -1,6 +1,8 @@
 from .utils import _toposort, groupby
 from .variadic import isvariadic
 
+__all__ = ["AmbiguityWarning", "supercedes", "consistent", "ambiguous", "ambiguities", "super_signature",
+           "edge", "ordering"]
 
 class AmbiguityWarning(Warning):
     pass
diff --git a/torch/fx/experimental/unification/multipledispatch/core.py b/torch/fx/experimental/unification/multipledispatch/core.py
index 2f1304d07431..ac8f14296ae1 100644
--- a/torch/fx/experimental/unification/multipledispatch/core.py
+++ b/torch/fx/experimental/unification/multipledispatch/core.py
@@ -3,8 +3,9 @@
 
 from .dispatcher import Dispatcher, MethodDispatcher
 
-global_namespace = dict()  # type: ignore[var-annotated]
+global_namespace = {}  # type: ignore[var-annotated]
 
+__all__ = ["dispatch", "ismethod"]
 
 def dispatch(*types, **kwargs):
     """ Dispatch function on the types of the inputs
@@ -25,7 +26,7 @@ def dispatch(*types, **kwargs):
     >>> f(3.0)
     2.0
     >>> # Specify an isolated namespace with the namespace keyword argument
-    >>> my_namespace = dict()
+    >>> my_namespace = {}
     >>> @dispatch(int, namespace=my_namespace)
     ... def foo(x):
     ...     return x + 1
diff --git a/torch/fx/experimental/unification/multipledispatch/dispatcher.py b/torch/fx/experimental/unification/multipledispatch/dispatcher.py
index 78806f8c106b..126f964a9147 100644
--- a/torch/fx/experimental/unification/multipledispatch/dispatcher.py
+++ b/torch/fx/experimental/unification/multipledispatch/dispatcher.py
@@ -5,6 +5,8 @@
 from .variadic import Variadic, isvariadic
 import itertools as itl
 
+__all__ = ["MDNotImplementedError", "ambiguity_warn", "halt_ordering", "restart_ordering", "variadic_signature_matches_iter",
+           "variadic_signature_matches", "Dispatcher", "source", "MethodDispatcher", "str_signature", "warning_text"]
 
 class MDNotImplementedError(NotImplementedError):
     """ A NotImplementedError for multiple dispatch """
@@ -333,7 +335,7 @@ def __setstate__(self, d):
         self.name = d['name']
         self.funcs = d['funcs']
         self._ordering = ordering(self.funcs)
-        self._cache = dict()
+        self._cache = {}
 
     @property
     def __doc__(self):
diff --git a/torch/fx/experimental/unification/multipledispatch/utils.py b/torch/fx/experimental/unification/multipledispatch/utils.py
index ae36406005d5..b7b9b7b91377 100644
--- a/torch/fx/experimental/unification/multipledispatch/utils.py
+++ b/torch/fx/experimental/unification/multipledispatch/utils.py
@@ -1,5 +1,6 @@
 from collections import OrderedDict
 
+__all__ = ["raises", "expand_tuples", "reverse_dict", "groupby", "typename"]
 
 def raises(err, lamda):
     try:
diff --git a/torch/fx/experimental/unification/multipledispatch/variadic.py b/torch/fx/experimental/unification/multipledispatch/variadic.py
index b0a59a699d63..e4ea4eb19932 100644
--- a/torch/fx/experimental/unification/multipledispatch/variadic.py
+++ b/torch/fx/experimental/unification/multipledispatch/variadic.py
@@ -2,6 +2,7 @@
 
 from .utils import typename
 
+__all__ = ["VariadicSignatureType", "isvariadic", "VariadicSignatureMeta", "Variadic"]
 
 class VariadicSignatureType(type):
     # checking if subclass is a subclass of self
diff --git a/torch/fx/experimental/unification/utils.py b/torch/fx/experimental/unification/utils.py
index a0200b0b037c..0632544ecec2 100644
--- a/torch/fx/experimental/unification/utils.py
+++ b/torch/fx/experimental/unification/utils.py
@@ -1,3 +1,4 @@
+__all__ = ["hashable", "transitive_get", "raises", "reverse_dict", "xfail", "freeze"]
 def hashable(x):
     try:
         hash(x)
diff --git a/torch/fx/graph.py b/torch/fx/graph.py
index 647b3520669c..5a25d5100f68 100644
--- a/torch/fx/graph.py
+++ b/torch/fx/graph.py
@@ -1,9 +1,10 @@
+from collections import defaultdict
 from .node import Node, Argument, Target, map_arg, _type_repr, _get_qualified_name
 import torch.utils._pytree as pytree
 from . import _pytree as fx_pytree
 from ._compatibility import compatibility
-import contextlib
 
+import contextlib
 from typing import TYPE_CHECKING, Callable, Any, List, Dict, NamedTuple, Optional, Tuple, Set, FrozenSet, Type
 from dataclasses import dataclass
 from contextlib import contextmanager
@@ -16,6 +17,7 @@
 import warnings
 import inspect
 
+__all__ = ["PythonCode", "CodeGen", "Graph"]
 
 if TYPE_CHECKING:
     from .graph_module import GraphModule  # noqa: F401
@@ -92,7 +94,11 @@ def _is_from_torch(obj: Any) -> bool:
     module_name = getattr(obj, '__module__', None)
     if module_name is not None:
         base_module = module_name.partition('.')[0]
-        return base_module == 'torch'
+        return (
+            base_module == 'torch' and
+            not module_name.startswith("torch._dynamo.") and
+            not module_name.startswith("torch._inductor.")
+        )
 
     name = getattr(obj, '__name__', None)
     # exclude torch because torch.torch.torch.torch works. idk mang
@@ -115,7 +121,8 @@ class _Namespace:
     def __init__(self):
         self._obj_to_name: Dict[Any, str] = {}
         self._unassociated_names = set()
-        self._used_names: Dict[str, int] = {}
+        self._used_names: Set[str] = set()
+        self._base_count: Dict[str, int] = defaultdict(int)
 
         self._illegal_char_regex = re.compile('[^0-9a-zA-Z_]+')
         self._name_suffix_regex = re.compile(r"(.*)_(\d+)$")
@@ -133,6 +140,9 @@ def create_name(self, candidate: str, obj: Optional[Any]) -> str:
         # delete all characters that are illegal in a Python identifier
         candidate = self._illegal_char_regex.sub('_', candidate)
 
+        if not candidate:
+            candidate = '_unnamed'
+
         if candidate[0].isdigit():
             candidate = f'_{candidate}'
 
@@ -145,13 +155,15 @@ def create_name(self, candidate: str, obj: Optional[Any]) -> str:
             num = int(num_str)
 
         candidate = base if num is None else f'{base}_{num}'
-        num = num if num else 0
+        if not num:
+            num = self._base_count[base]
 
         while candidate in self._used_names or self._is_illegal_name(candidate, obj):
             num += 1
             candidate = f'{base}_{num}'
 
-        self._used_names.setdefault(candidate, 0)
+        self._used_names.add(candidate)
+        self._base_count[base] = num
         if obj is None:
             self._unassociated_names.add(candidate)
         else:
@@ -183,6 +195,21 @@ def _is_illegal_name(self, name: str, obj: Any) -> bool:
 
         return False
 
+dtype_abbrs = {
+    torch.bfloat16: 'bf16',
+    torch.float64: 'f64',
+    torch.float32: 'f32',
+    torch.float16: 'f16',
+    torch.complex32: 'c32',
+    torch.complex64: 'c64',
+    torch.complex128: 'c128',
+    torch.int8: 'i8',
+    torch.int16: 'i16',
+    torch.int32: 'i32',
+    torch.int64: 'i64',
+    torch.bool: 'b8',
+    torch.uint8: 'u8',
+}
 
 @compatibility(is_backward_compatible=True)
 @dataclass
@@ -294,7 +321,7 @@ def additional_globals(self) -> List[Tuple[str, Any]]:
         """
         return []
 
-    def _gen_python_code(self, nodes, root_module: str, namespace: _Namespace) -> PythonCode:
+    def _gen_python_code(self, nodes, root_module: str, namespace: _Namespace, *, verbose: bool = False) -> PythonCode:
         free_vars: List[str] = []
         body: List[str] = []
         globals_: Dict[str, Any] = {}
@@ -408,9 +435,76 @@ def delete_unused_values(user : Node):
             else:
                 body.append('\n')
 
+        prev_stacktrace = None
+
+        def append_stacktrace_summary(node : Node):
+            """
+            Append a summary of the stacktrace to the generated code. This is
+            useful for debugging.
+            """
+            nonlocal prev_stacktrace
+            pattern = re.compile(r"^File \"(.+)\", line (\d+), in (.+)$")
+
+            if node.op not in {'placeholder', 'output'}:
+                if node.stack_trace:
+                    if node.stack_trace != prev_stacktrace:
+                        prev_stacktrace = node.stack_trace
+
+                        lines = node.stack_trace.strip().split('\n')
+                        idx = 0
+                        context_lines = []
+                        while idx < len(lines):
+                            line = lines[idx].strip()
+                            if line.startswith('File '):
+                                break
+
+                            # Skip printing module stack
+                            if not line.startswith("Module stack"):
+                                context_lines.append(line)
+                            idx += 1
+
+                        summary_lines = []
+                        if context_lines:
+                            summary_lines.append(', '.join(context_lines))
+
+                        if idx + 1 < len(lines):
+                            matches = pattern.match(lines[idx].strip())
+                            if matches:
+                                file = matches.group(1)
+                                lineno = matches.group(2)
+                                lineage = f'File: {file}:{lineno}'
+                                summary_lines.append(lineage)
+
+                            code = f"code: {lines[idx + 1].strip()}"
+                            summary_lines.append(code)
+
+                        summary_str = ', '.join(summary_lines)
+                        body.append(f'\n# {summary_str}\n')
+                elif prev_stacktrace != "":
+                    prev_stacktrace = ""
+                    body.append('\n# No stacktrace found for following nodes\n')
+
+        def stringify_shape(shape : torch.Size) -> str:
+            return f"[{', '.join(str(x) for x in shape)}]"
 
         def emit_node(node : Node):
             maybe_type_annotation = '' if node.type is None else f' : {type_repr(node.type)}'
+
+            if verbose:
+                # override annotation with more detailed information
+                from torch._subclasses.fake_tensor import FakeTensor
+                from torch.fx.experimental.proxy_tensor import py_sym_types
+                from torch.fx.passes.shape_prop import TensorMetadata
+
+                meta_val = node.meta.get('val', node.meta.get('tensor_meta', None))
+
+                if isinstance(meta_val, FakeTensor):
+                    maybe_type_annotation = f': {dtype_abbrs[meta_val.dtype]}{stringify_shape(meta_val.shape)}'
+                elif isinstance(meta_val, py_sym_types):
+                    maybe_type_annotation = f': Sym({meta_val})'
+                elif isinstance(meta_val, TensorMetadata):
+                    maybe_type_annotation = f': {dtype_abbrs[meta_val.dtype]}{stringify_shape(meta_val.shape)}'
+
             if node.op == 'placeholder':
                 assert isinstance(node.target, str)
                 maybe_default_arg = '' if not node.args else f' = {repr(node.args[0])}'
@@ -475,6 +569,8 @@ def emit_node(node : Node):
         for node in nodes:
             # NOTE: emit_node does not emit a string with newline. It depends
             # on delete_unused_values to append one
+            if verbose:
+                append_stacktrace_summary(node)
             emit_node(node)
             delete_unused_values(node)
 
@@ -500,7 +596,7 @@ def emit_node(node : Node):
 
         prologue = self.gen_fn_def(free_vars, maybe_return_annotation[0])
 
-        code = ''.join(body)
+        code = ''.join(body).lstrip('\n')
         code = '\n'.join('    ' + line for line in code.split('\n'))
         fn_code = f"""
 {wrap_stmts}
@@ -608,7 +704,6 @@ def __init__(self, owning_module: Optional["GraphModule"] = None, tracer_cls: Op
         self._insert = self._root.prepend
         self._len = 0
         self._graph_namespace = _Namespace()
-        self._owners = 0
         self._owning_module = owning_module
         self._tracer_cls = tracer_cls
         self._tracer_extras = tracer_extras
@@ -616,18 +711,11 @@ def __init__(self, owning_module: Optional["GraphModule"] = None, tracer_cls: Op
 
     @property
     def owning_module(self):
-        """
-        Return the module that owns this ``GraphModule``, if there is one,
-        ``None`` if there is no owning module or if there are multiple owning
-        modules.
-        """
         return self._owning_module
 
     @owning_module.setter
     def owning_module(self, mod: Optional["GraphModule"]):
-        if mod:
-            self._owning_module = mod if not self._owners else None
-            self._owners += 1
+        self._owning_module = mod
 
     @property
     def nodes(self) -> _node_list:
@@ -1089,7 +1177,7 @@ def _target_to_str(self, target : Target) -> str:
         return op
 
     @compatibility(is_backward_compatible=True)
-    def python_code(self, root_module: str) -> PythonCode:
+    def python_code(self, root_module: str, *, verbose: bool = False) -> PythonCode:
         """
         Turn this ``Graph`` into valid Python code.
 
@@ -1148,10 +1236,10 @@ def override_node_repr(graph: Graph):
                     node._repr_fn = orig_repr_fns[node]
 
         with override_node_repr(self):
-            return self._python_code(root_module, namespace)
+            return self._python_code(root_module, namespace, verbose=verbose)
 
-    def _python_code(self, root_module: str, namespace: _Namespace) -> PythonCode:
-        return self._codegen._gen_python_code(self.nodes, root_module, namespace)
+    def _python_code(self, root_module: str, namespace: _Namespace, *, verbose: bool = False) -> PythonCode:
+        return self._codegen._gen_python_code(self.nodes, root_module, namespace, verbose=verbose)
 
 
     def __str__(self) -> str:
diff --git a/torch/fx/graph_module.py b/torch/fx/graph_module.py
index 4f8bcc13d9e3..5f839983c5f3 100644
--- a/torch/fx/graph_module.py
+++ b/torch/fx/graph_module.py
@@ -16,6 +16,8 @@
 import os
 import warnings
 
+__all__ = ["reduce_graph_module", "reduce_package_graph_module", "reduce_deploy_graph_module", "GraphModule"]
+
 # Normal exec loses the source code, however we can work with
 # the linecache module to recover it.
 # Using _exec_with_source will add it to our local cache
@@ -116,7 +118,7 @@ def reduce_package_graph_module(
 def reduce_deploy_graph_module(
     importer: PackageImporter, body: Dict[Any, Any], import_block: str
 ) -> torch.nn.Module:
-    ns = dict()
+    ns = {}
     ns["__builtins__"] = importer.patched_builtins
     fn_src = body.get('_code')
     assert fn_src is not None
@@ -378,6 +380,9 @@ def __init__(self,
         if self.graph._tracer_extras:
             self._tracer_extras = self.graph._tracer_extras
 
+        # Dictionary to store metadata
+        self.meta : Dict[str, Any] = {}
+
     # TorchScript breaks trying to compile the graph setter because of the
     # continued string literal. Issue here: https://github.com/pytorch/pytorch/issues/44842
     #
@@ -707,11 +712,12 @@ def __copy__(self):
         return GraphModule(self, self.graph)
 
     @compatibility(is_backward_compatible=False)
-    def nested_str(self) -> str:
+    def print_readable(self, print_output=True):
         """
         Return the Python code generated for current GraphModule and its children GraphModules
         """
-        module_code = self.code
+        verbose_python_code = self._graph.python_code(root_module='self', verbose=True)
+        module_code = verbose_python_code.src
         module_code = module_code.lstrip('\n')
         module_code = f"class {self._get_name()}(torch.nn.Module):\n" + module_code
         module_code = _addindent(module_code, 4)
@@ -719,15 +725,19 @@ def nested_str(self) -> str:
         submodule_code_list = [""]
         for submodule in self.children():
             if isinstance(submodule, GraphModule):
-                submodule_code_list.append(submodule.__nested_code())
+                submodule_code_list.append(submodule.print_readable(print_output=False))
         submodule_code = "\n".join(submodule_code_list)
         submodule_code = _addindent(submodule_code, 4)
 
-        return module_code + submodule_code
+        output = module_code + submodule_code
+        if print_output:
+            print(module_code + submodule_code)
+        return output
 
     def __str__(self) -> str:
         orig_str = super().__str__()
-        return '\n'.join([orig_str, self._code])
+        print_readable_reminder = "# To see more debug info, please use `graph_module.print_readable()`"
+        return '\n'.join([orig_str, self._code, print_readable_reminder])
 
     def _replicate_for_data_parallel(self):
         new_gm = self.__copy__()
diff --git a/torch/fx/immutable_collections.py b/torch/fx/immutable_collections.py
index d400656e080c..063ce8bafef9 100644
--- a/torch/fx/immutable_collections.py
+++ b/torch/fx/immutable_collections.py
@@ -3,6 +3,8 @@
 from ._compatibility import compatibility
 from torch.utils._pytree import Context, _register_pytree_node
 
+__all__ = ["immutable_list", "immutable_dict"]
+
 _help_mutation = """\
 If you are attempting to modify the kwargs or args of a torch.fx.Node object,
 instead create a new copy of it and assign the copy to the node:
diff --git a/torch/fx/interpreter.py b/torch/fx/interpreter.py
index ca2a4f1a7a16..6428d4c5c3bb 100644
--- a/torch/fx/interpreter.py
+++ b/torch/fx/interpreter.py
@@ -57,7 +57,7 @@ def fn(x):
             gm = torch.fx.symbolic_trace(fn)
             input = torch.randn(3, 4)
             result = NegSigmSwapInterpreter(gm).run(input)
-            torch.testing.assert_allclose(result, torch.neg(input).sigmoid())
+            torch.testing.assert_close(result, torch.neg(input).sigmoid())
 
     Args:
         module (GraphModule): The module to be executed
@@ -126,7 +126,16 @@ def run(self, *args, initial_env : Optional[Dict[Node, Any]] = None, enable_io_p
                 # values for a subset of the program.
                 continue
 
-            self.env[node] = self.run_node(node)
+            try:
+                self.env[node] = self.run_node(node)
+            except Exception as e:
+                msg = f"While executing {node.format_node()}"
+                msg = '{}\n\n{}'.format(e.args[0], msg) if e.args else str(msg)
+                msg += f"\nOriginal traceback:\n{node.stack_trace}"
+                e.args = (msg,) + e.args[1:]
+                if isinstance(e, KeyError):
+                    raise RuntimeError(*e.args) from e
+                raise
 
             if self.garbage_collect_values:
                 for to_delete in self.user_to_last_uses.get(node, []):
@@ -303,7 +312,7 @@ def fetch_attr(self, target : str):
         Fetch an attribute from the ``Module`` hierarchy of ``self.module``.
 
         Args:
-            target (str): The fully-qualfiied name of the attribute to fetch
+            target (str): The fully-qualified name of the attribute to fetch
 
         Return:
             Any: The value of the attribute.
@@ -386,7 +395,7 @@ def fn(x):
 
             transformed : torch.nn.Module = NegSigmSwapXformer(gm).transform()
             input = torch.randn(3, 4)
-            torch.testing.assert_allclose(transformed(input), torch.neg(input).sigmoid())
+            torch.testing.assert_close(transformed(input), torch.neg(input).sigmoid())
 
     Args:
         module (GraphModule): The ``Module`` to be transformed.
diff --git a/torch/fx/node.py b/torch/fx/node.py
index 66a94154f946..2003feb6db33 100644
--- a/torch/fx/node.py
+++ b/torch/fx/node.py
@@ -24,6 +24,7 @@
     List[Any],  # actually Argument
     Dict[str, Any],  # actually Argument
     slice,  # Slice[Argument, Argument, Argument], but slice is not a templated type in typing
+    range,
     'Node',
     BaseArgumentTypes
 ]]
@@ -115,7 +116,7 @@ class Node:
       following the Python calling convention
     - ``call_module`` applies a module in the module hierarchy's ``forward()`` method to given arguments. ``name`` is
       as previous. ``target`` is the fully-qualified name of the module in the module hierarchy to call.
-      ``args`` and ``kwargs`` represent the arguments to invoke the module on, *including the self argument*.
+      ``args`` and ``kwargs`` represent the arguments to invoke the module on, *excluding the self argument*.
     - ``call_method`` calls a method on a value. ``name`` is as similar. ``target`` is the string name of the method
       to apply to the ``self`` argument. ``args`` and ``kwargs`` represent the arguments to invoke the module on,
       *including the self argument*
@@ -468,7 +469,9 @@ def format_node(self,
     @compatibility(is_backward_compatible=True)
     def replace_all_uses_with(self,
                               replace_with : 'Node',
-                              delete_user_cb: Callable[['Node'], bool] = lambda user: True
+                              delete_user_cb: Callable[['Node'], bool] = lambda user: True,
+                              *,
+                              propagate_meta=False
                               ) -> List['Node']:
         """
         Replace all uses of ``self`` in the Graph with the Node ``replace_with``.
@@ -478,11 +481,21 @@ def replace_all_uses_with(self,
             replace_with (Node): The node to replace all uses of ``self`` with.
             delete_user_cb (Callable): Callback that is called to determine
               whether a given user of the self node should be removed.
+            propagate_meta (bool): Whether or not to copy all properties
+              on the .meta field of the original node onto the replacement node.
+              For safety, this is only valid to do if the replacement node
+              doesn't already have an existing .meta field.
 
         Returns:
 
             The list of Nodes on which this change was made.
         """
+        if propagate_meta:
+            assert len(replace_with.meta) == 0, \
+                'Called node.replace_all_uses_with(replace_with, propagate_meta=True), ' \
+                'but replace_with already has .meta keys'
+            for k, v in self.meta.items():
+                replace_with.meta[k] = v
         to_process = list(self.users)
         skipped = []
         for use_node in to_process:
diff --git a/torch/fx/operator_schemas.py b/torch/fx/operator_schemas.py
index 7ec3bc9a6731..92f5246c313e 100644
--- a/torch/fx/operator_schemas.py
+++ b/torch/fx/operator_schemas.py
@@ -13,6 +13,9 @@
 if TYPE_CHECKING:
     from .node import Argument
 
+__all__ = ["ArgsKwargsPair", "check_for_mutable_operation", "get_signature_for_torch_op", "create_type_hint",
+           "type_matches", "normalize_function", "normalize_module"]
+
 @compatibility(is_backward_compatible=False)
 class ArgsKwargsPair(NamedTuple):
     """
diff --git a/torch/fx/passes/README.md b/torch/fx/passes/README.md
index a2996848713e..e972234f2082 100644
--- a/torch/fx/passes/README.md
+++ b/torch/fx/passes/README.md
@@ -1,5 +1,5 @@
 ## FX Pass Infrastructure
-This folder contains the pass infarstructure and passes for transforming fx.Graph.
+This folder contains the pass infrastructure and passes for transforming fx.Graph.
 
 
 ## Code Structure
diff --git a/torch/fx/passes/backends/cudagraphs.py b/torch/fx/passes/backends/cudagraphs.py
index 7aa4aed45ccf..2d4ccbcfb3dc 100644
--- a/torch/fx/passes/backends/cudagraphs.py
+++ b/torch/fx/passes/backends/cudagraphs.py
@@ -21,15 +21,18 @@ def is_node_supported(self, submodules, node: torch.fx.Node) -> bool:
 
         found_not_cuda = False
 
+        def meta_fk(meta):
+            return meta["val"] if "val" in meta else meta["fake_result"]
+
         def find_not_cuda(t):
             nonlocal found_not_cuda
             if isinstance(t, torch.Tensor) and t.device.type != 'cuda':
                 found_not_cuda = True
 
         for n in node.all_input_nodes:
-            tree_map(find_not_cuda, n.meta['fake_result'])
+            tree_map(find_not_cuda, meta_fk(n.meta))
 
-        tree_map(find_not_cuda, node.meta['fake_result'])
+        tree_map(find_not_cuda, meta_fk(node.meta))
 
         # NB: factory function is accounted for because the result would be
         # cpu or cuda
diff --git a/torch/fx/passes/backends/nvfuser.py b/torch/fx/passes/backends/nvfuser.py
deleted file mode 100644
index cbded1de0cde..000000000000
--- a/torch/fx/passes/backends/nvfuser.py
+++ /dev/null
@@ -1,286 +0,0 @@
-from typing import Dict
-
-import torch
-from torch.nn import Module
-from torch._ops import OpOverload
-
-from torch.fx import GraphModule
-from torch.fx.node import Node, _get_qualified_name
-from torch.fx.passes.operator_support import OperatorSupport
-from torch.fx.passes.tools_common import CALLABLE_NODE_OPS
-from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
-from torch._prims.executor import execute
-from torch.fx.experimental.proxy_tensor import DecompositionInterpreter
-from torch._decomp import decomposition_table
-
-import typing as t
-
-import logging
-
-logging.basicConfig(level=logging.WARNING)
-logger = logging.getLogger(__name__)
-
-def aten_to_dtype(self, dtype: torch.dtype, **kwargs):
-    if len(kwargs) > 0 or not dtype:
-        raise RuntimeError("No support for other to.dtype() formats other than to.dtype(self, dtype)")
-    return torch._prims.convert_element_type(self, dtype)
-
-# decomposition_table currently contains both aten2aten and aten2prim decomposition
-# this is a hack to separate them, as we only need aten2prim decomposition for nvfuser-supported aten graph lowering
-aten2aten_decomp = {}
-aten2prim_decomp = {}
-
-for op, decomp_fn in decomposition_table.items():
-    if "torch._refs" in decomp_fn.__module__:
-        aten2prim_decomp[op] = decomp_fn
-    else:
-        aten2aten_decomp[op] = decomp_fn
-
-aten2aten_decomp_skips = {
-    "aten.native_layer_norm_backward.default",
-    "aten.embedding_dense_backward.default",   # This is hurting nvfuser's perf
-    "aten.addmm.default"
-}
-
-for op, decomp_fn in decomposition_table.items():
-    if "torch._refs" in decomp_fn.__module__:
-        aten2prim_decomp[op] = decomp_fn
-    else:
-        if str(op) not in aten2aten_decomp_skips:
-            aten2aten_decomp[op] = decomp_fn
-
-
-aten2prim_decomp[torch.ops.aten.to.dtype] = aten_to_dtype
-
-
-class NvFuserOperatorSupport(OperatorSupport):
-    """
-    Operator support for nvFuser backend.
-
-    Currently, partitioning is based on FX ATen graph. The fused subgraph will latter be decomposed into prims.
-    To determine if an ATen ops is supported by nvFuser, we shall check the prim ops used in its ref decomposition.
-    Only if all the prim ops in the ref has a nvfuser_impl, we say this Aten op is suppported by nvFuser.
-
-    Note: When adding a rule, please add it to the corresponding section and follow the
-    alphabetical order.
-    """
-
-    def __init__(self):
-
-        # TODO: current list copied from torch/csrc/jit/codegen/cuda/parser.cpp is incorrect,
-        # as that file is solely for TorchScript and doesn't represent the actual status
-        # whether operation would be runnable by primTorch+nvFuser.
-        # We will iterate on this list to reflect the the reality.
-        support_dict = {
-            # ===============================================================
-            # call_function aten
-            # ===============================================================
-            # Following supported aten ops is copied from torch/csrc/jit/codegen/cuda/parser.cpp
-            # TODO: might need to update according to supported input types
-            "torch.ops.aten.add": None,
-            "torch.ops.aten.sub": None,
-            # "torch.ops.aten.rsub": None,    # rsub decomp is supported at aten2aten level
-            "torch.ops.aten.div": None,
-            "torch.ops.aten.atan2": None,
-            "torch.ops.aten.mul": None,
-            "torch.ops.aten.max": None,
-            "torch.ops.aten.min": None,
-            "torch.ops.aten.pow": None,
-            "torch.ops.aten.remainder": None,
-            "torch.ops.aten.fmod": None,
-            "torch.ops.aten.bitwise_and": None,
-            "torch.ops.aten.__and__": None,
-            "torch.ops.aten.bitwise_or": None,
-            "torch.ops.aten.__or__": None,
-            "torch.ops.aten.bitwise_xor": None,
-            "torch.ops.aten.__xor__": None,
-            "torch.ops.aten.bitwise_left_shift": None,
-            "torch.ops.aten.__lshift__": None,
-            "torch.ops.aten.bitwise_right_shift": None,
-            "torch.ops.aten.__rshift__": None,
-            "torch.ops.aten.eq": None,
-            "torch.ops.aten.ne": None,
-            "torch.ops.aten.ge": None,
-            "torch.ops.aten.gt": None,
-            "torch.ops.aten.le": None,
-            "torch.ops.aten.lt": None,
-            "torch.ops.aten.abs": None,
-            "torch.ops.aten.bitwise_not": None,
-            "torch.ops.aten.ceil": None,
-            "torch.ops.aten.floor": None,
-            "torch.ops.aten.frac": None,
-            "torch.ops.aten.neg": None,
-            "torch.ops.aten.relu": None,
-            "torch.ops.aten.round": None,
-            "torch.ops.aten.silu": None,
-            "torch.ops.aten.trunc": None,
-            "torch.ops.aten.log": None,
-            "torch.ops.aten.log10": None,
-            "torch.ops.aten.log1p": None,
-            "torch.ops.aten.log2": None,
-            "torch.ops.aten.lgamma": None,
-            "torch.ops.aten.exp": None,
-            "torch.ops.aten.expm1": None,
-            "torch.ops.aten.erf": None,
-            "torch.ops.aten.erfc": None,
-            "torch.ops.aten.cos": None,
-            "torch.ops.aten.acos": None,
-            "torch.ops.aten.cosh": None,
-            "torch.ops.aten.sin": None,
-            "torch.ops.aten.asin": None,
-            "torch.ops.aten.sinh": None,
-            "torch.ops.aten.tan": None,
-            "torch.ops.aten.atan": None,
-            "torch.ops.aten.tanh": None,
-            "torch.ops.aten.atanh": None,
-            "torch.ops.aten.sqrt": None,
-            "torch.ops.aten.rsqrt": None,
-            "torch.ops.aten.reciprocal": None,
-            "torch.ops.aten.sigmoid": None,
-            "torch.ops.aten.isfinite": None,
-            "torch.ops.aten.isinf": None,
-            "torch.ops.aten.isnan": None,
-            "torch.ops.aten.isneginf": None,
-            "torch.ops.aten.isposinf": None,
-            "torch.ops.aten.isreal": None,
-            # "torch.ops.aten.rand_like": None,  # causing Node empty_like_default does not support nvfuser
-            "torch.ops.aten.softplus": None,
-            "torch.ops.aten.threshold": None,
-            # relying on aten->aten->prim decomp, aten2aten is using unsupported aten.new_zero op
-            # "torch.ops.aten.threshold_backward": None,
-            "torch.ops.aten.clamp": None,
-            # "torch.ops.aten.clone": None,
-            # Failing with where(): incompatible function arguments: \
-            # [<torch._C._nvfuser.TensorView, tensor, <torch._C._nvfuser.TensorView]
-            # failing with BERT_pytorch_forward_0, which has aten.where.ScalarSelf in the decomps
-            # "torch.ops.aten.where": None,
-            # However, aten.where.self overload is fully supported
-            "torch.ops.aten.where.self": None,
-            "torch.ops.aten.lerp": None,
-            "torch.ops.aten.addcmul": None,
-            # "torch.ops.aten.native_dropout": None,    # missing refs for aten.rank_like
-            "torch.ops.aten.dropout": None,
-            # "torch.ops.aten.native_dropout_backward": None,   # missing refs for aten.type_as
-            "torch.ops.aten.instance_norm": None,
-            "torch.ops.aten._batch_norm_impl_index": None,
-            # "torch.ops.aten.native_batch_norm": None,     # missing refs for aten.var
-            "torch.ops.aten.batch_norm": None,
-            "torch.ops.aten.cudnn_batch_norm": None,
-            "torch.ops.aten._batch_norm_impl_index_backward": None,
-            # "torch.ops.aten.native_batch_norm_backward": None,    # should have been handled at aten2aten decomp
-            "torch.ops.aten.native_layer_norm": None,
-            "torch.ops.aten.layer_norm": None,
-            # relying on aten->aten->prim decomp, aten2aten is using unsupported aten.div
-            # "torch.ops.aten.native_layer_norm_backward": None,
-            "torch.ops.aten.softmax.int": None,
-            "torch.ops.aten.log_softmax.int": None,
-            # relying on aten->aten->prim decomp, aten2aten is using unsupported aten.amax
-            # "torch.ops.aten._softmax": None,
-            "torch.ops.aten._log_softmax_backward_data": None,
-            # "torch.ops.aten._softmax_backward_data": None,  # Node _softmax_backward_data_default does not support nvfuser
-            # "torch.ops.aten.var.dim": None,       # missing refs
-            "torch.ops.aten.std.dim": None,
-            "torch.ops.aten.sum": None,
-            # "torch.ops.aten.mean.dim": None,      # missing refs
-            "torch.ops.aten._grad_sum_to_size": None,
-            "torch.ops.aten.sum_to_size": None,
-            "torch.ops.aten._autocast_to_reduced_precision": None,
-            "torch.ops.aten._autocast_to_full_precision": None,
-            # "torch.ops.aten.to.dtype": None,      # causing segfault
-            # "torch.ops.aten.type_as": None,       # missing refs
-            "torch.ops.aten.linear": None,
-            "torch.ops.aten.gelu": None,
-            # "torch.ops.aten.gelu_backward": None,       # gelu_backward is handled at aten2aten decomp
-            # "torch.ops.aten.hardtanh": None,        # has functional ref, using unsupported aten.clamp
-            "torch.ops.aten.leaky_relu": None,
-            "torch.ops.aten.square": None,
-            # relying on aten->aten->prim decomp, aten2aten is using unsupported aten.conj_physical
-            "torch.ops.aten.tanh_backward": None,
-            # "torch.ops.aten.amax": None,      # missing prim decomp
-            # "torch.ops.aten.amin": None,      # missing prim decomp
-            # "torch.ops.aten.reshape": None,
-            # "torch.ops.aten.view": None,      # missing prim decomp
-            "torch.ops.aten.flatten.using_ints": None,
-
-            # ===============================================================
-            # call_function builtins and operator
-            # ===============================================================
-            "getattr": None,
-            "_operator.getitem": None,
-        }
-
-        super().__init__(support_dict)
-
-    def is_node_supported(
-        self, submodules: t.Mapping[str, Module], node: Node
-    ) -> bool:
-
-        # nvFuser FX subgraph should be purely functional
-        if node.op not in CALLABLE_NODE_OPS:
-            return False
-
-        # ops in supported_dict doesn't have overload name
-        # use overloadpacket's qualified_name for OpOverload
-        if isinstance(node.target, OpOverload):
-            target = _get_qualified_name(node.target.overloadpacket)
-            if target in self._support_dict:
-                return True
-
-        return super().is_node_supported(submodules, node)
-
-
-class NvFuserBackend:
-    def __init__(self):
-        self.supported_ops = NvFuserOperatorSupport()
-
-        # TODO: this is a naive implementation of cache without proper guard
-        self.partitioner_cache: Dict[GraphModule, GraphModule] = {}
-
-        # TODO: this is a naive implementation of cache without proper guard, this will only work for identical inputs
-        self.prim_decomp_cache: Dict[GraphModule, GraphModule] = {}
-
-    def lower_to_prims_and_execute(self, graph_module: GraphModule, *args, **kwargs):
-        # `graph_module` is an Aten-Fx graph
-        # "lowering to prims" and "trace execution" are grouped into this function, as they are both input dependent
-
-        if graph_module in self.prim_decomp_cache:
-            logging.debug("prim_decomp_cache hit!")
-            prim_module = self.prim_decomp_cache[graph_module]
-        else:
-            prim_graph = torch.fx.Graph()
-            DecompositionInterpreter(graph_module, prim_graph, decomposition_table=aten2prim_decomp).run(*args, **kwargs)
-            prim_module = torch.fx.GraphModule(graph_module, prim_graph)
-            self.prim_decomp_cache[graph_module] = prim_module
-
-            logging.debug("Lower to prims graph: ", prim_module.code)
-
-        # invokes trace executor for running the prim graph
-        return execute(prim_module, *args, executor="nvfuser")
-
-    def compile(self, graph_module: GraphModule) -> GraphModule:
-        # entry function for nvFuser backend
-        logging.debug("Compiling graph_module: ", graph_module.code)
-
-        # FX graph based partitioning based on nvfuser supported ops
-        if graph_module in self.partitioner_cache:
-            logging.debug("partitioner_cache hit!")
-            fused_graph_module = self.partitioner_cache[graph_module]
-        else:
-            partitioner = CapabilityBasedPartitioner(
-                graph_module, self.supported_ops, allows_single_node_partition=False)
-            fused_graph_module = partitioner.partition_and_fuse()
-
-            self.partitioner_cache[graph_module] = fused_graph_module
-
-        # Overriding fused_module's __call__() function with lower_to_prims_and_execute()
-        for node in fused_graph_module.graph.nodes:
-            # TODO: use a better way to identify fused submodule
-            if node.op == "call_module" and "fused_" in node.name:
-                fused_module = getattr(fused_graph_module, node.name)
-                fused_module._wrapped_call = self.lower_to_prims_and_execute
-
-        return fused_graph_module
-
-    def __call__(self, graph_module: GraphModule, _) -> GraphModule:
-        # wrap self.compile as __call__ function to fit the interface for AOTAutograd's fw_compiler
-        return self.compile(graph_module)
diff --git a/torch/fx/passes/fake_tensor_prop.py b/torch/fx/passes/fake_tensor_prop.py
index 427ff41761f4..403db5b9a009 100644
--- a/torch/fx/passes/fake_tensor_prop.py
+++ b/torch/fx/passes/fake_tensor_prop.py
@@ -1,3 +1,5 @@
+from typing import Optional
+
 import torch.fx
 from torch.fx import Node
 from torch.fx._compatibility import compatibility
@@ -17,14 +19,20 @@ class FakeTensorProp(torch.fx.Interpreter):
 
     Args:
          module (GraphModule): The module to be executed
+         mode (Optional[FakeTensorMode]): The dispatch mode used to execute computation indicated by each FX Node.
     """
+    def __init__(self, module: torch.fx.GraphModule, mode: Optional[FakeTensorMode] = None):
+        super().__init__(module)
+        if mode is None:
+            mode = FakeTensorMode()
+        self._mode = mode
 
     def run_node(self, n: Node):
         result = super().run_node(n)
-        n.meta['fake_result'] = result
+        n.meta['val'] = result
         return result
 
     def propagate(self, *args):
-        with FakeTensorMode.push() as mode:
-            fake_args = [mode.from_tensor(a) for a in args]
+        with self._mode:
+            fake_args = [self._mode.from_tensor(a) for a in args]
             return super().run(*fake_args)
diff --git a/torch/fx/passes/graph_drawer.py b/torch/fx/passes/graph_drawer.py
index 42416066904d..3754739c30a6 100644
--- a/torch/fx/passes/graph_drawer.py
+++ b/torch/fx/passes/graph_drawer.py
@@ -140,12 +140,16 @@ def _get_leaf_node(
 
         def _typename(self, target: Any) -> str:
             if isinstance(target, torch.nn.Module):
-                return torch.typename(target)
-
-            if isinstance(target, str):
-                return target
+                ret = torch.typename(target)
+            elif isinstance(target, str):
+                ret = target
+            else:
+                ret = _get_qualified_name(target)
 
-            return _get_qualified_name(target)
+            # Escape "{" and "}" to prevent dot files like:
+            # https://gist.github.com/SungMinCho/1a017aab662c75d805c5954d62c5aabc
+            # which triggers `Error: bad label format (...)` from dot
+            return ret.replace("{", r"\{").replace("}", r"\}")
 
         def _get_node_label(
             self,
diff --git a/torch/fx/passes/infra/partitioner.py b/torch/fx/passes/infra/partitioner.py
index 18a665b88ede..a19894ddc685 100644
--- a/torch/fx/passes/infra/partitioner.py
+++ b/torch/fx/passes/infra/partitioner.py
@@ -1,18 +1,17 @@
-from typing import Dict, List, Set, Iterable, Optional
+from typing import Dict, List, Set, Iterable, Sequence, Optional
 
 from torch.fx.passes.utils.fuser_utils import fuse_by_partitions
-from torch.fx.passes.tools_common import NodeList
 
 from torch.fx.graph_module import GraphModule
 from torch.fx.node import Node, _get_qualified_name
 from torch.fx.passes.operator_support import OperatorSupportBase
 
-from collections import defaultdict
 import logging
 import itertools
+from copy import copy
 
-logging.basicConfig(level=logging.WARNING)
 logger = logging.getLogger(__name__)
+logger.setLevel(logging.WARNING)
 
 class Partition:
     def __init__(self, id: int = None, nodes: Iterable[Node] = None):
@@ -36,143 +35,127 @@ class CapabilityBasedPartitioner:
     def __init__(self,
                  graph_module: GraphModule,
                  operator_support: OperatorSupportBase,
-                 allows_single_node_partition: bool = False
+                 allows_single_node_partition: bool = False,
+                 non_compute_ops: Optional[Sequence[str]] = None,
+                 allowed_single_node_partition_ops: Optional[Sequence[str]] = None,
                  ) -> None:
         self.graph_module = graph_module
         self.operator_support = operator_support
         self.allows_single_node_partition = allows_single_node_partition
-
-        # map of node to it's upstream dependency nodes
-        # if A is found in dependency_map[B], then B depends on A (or a is an upstream depedency of b)
-        self.dependency_map = self.__build_dependency_map()
-
-    def __build_dependency_map(self) -> Dict[Node, Set[Node]]:
-        dependency_map = defaultdict(set)
-
-        # assumptions: nodes in graph are sorted in topological order
-        for node in self.graph_module.graph.nodes:
-            for input_node in node.all_input_nodes:
-                # add input_node and input_node's upstream dependency
-                dependency_map[node].add(input_node)
-                dependency_map[node].update(dependency_map[input_node])
-
-        return dependency_map
-
-    def __node_depends_on(self, a: Node, b: Node) -> int:
-        # Returns
-        # 1 if b depends on a (,or equivalently a is an upstream depedency of b)
-        # -1 if a depends on b (,or equivalently b is an upstream depedency of a)
-        # 0 if a and b doesn't have dependency between each other
-
-        if a in self.dependency_map[b]:
-            return 1
-        elif b in self.dependency_map[a]:
-            return -1
-        else:
-            return 0
-
-    def __partition_depends_on(self, partition_a: Partition, partition_b: Partition) -> int:
-        # Returns
-        # 1 if b depends on a (,or equivalently a is an upstream depedency of b)
-        # -1 if a depends on b (,or equivalently b is an upstream depedency of a)
-        # 0 if a and b doesn't have dependency between each other
-
-        # TODO: build a cache here to speedup the query
-
-        for node_a in partition_a.nodes:
-            for node_b in partition_b.nodes:
-                dependency = self.__node_depends_on(node_a, node_b)
-                if dependency != 0:
-                    return dependency
-        return 0
-
-    def __get_supported_nodes(self) -> NodeList:
-        logging.debug("Collecting supported nodes...")
-        supported_nodes = []
-        for node in self.graph_module.graph.nodes:
-            if self.operator_support.is_node_supported(dict(self.graph_module.named_modules()), node):
-                supported_nodes.append(node)
-        return supported_nodes
+        self.non_compute_ops = non_compute_ops if non_compute_ops is not None else []
+        self.allowed_single_node_partition_ops = (
+            allowed_single_node_partition_ops
+            if allowed_single_node_partition_ops is not None
+            else []
+        )
+
+    def __is_node_supported(self, node: Node) -> bool:
+        return (
+            self.operator_support.is_node_supported(dict(self.graph_module.named_modules()), node)
+        )
 
     def propose_partitions(self) -> List[Partition]:
-        candidates: NodeList = self.__get_supported_nodes()
-
         # assumptions: nodes in candidate list is sorted in topological order
         assignment: Dict[Node, int] = {}   # maping from node to partition_id
         partitions_by_id: Dict[int, Partition] = {}  # mapping from partition_id to partition
         new_partition_id = itertools.count()
 
-        def assign(node: Node, id: Optional[int] = None):
-            # If id is None, remove the node from original assigment
-
-            # node has been assigned before, clean up and re-assign
+        # try to merge partition other_id into partition self_id
+        # merge only happens if the end graph doesn't contain cyclic dependency
+        # returns `True` when merge happens, `False` otherwise.
+        def maybe_merge_partition(self_id: int, other_id: int):
+            # merged_nodes is the union of nodes in two partition to-be-merged
+            merged_nodes = copy(partitions_by_id[self_id].nodes)
+            merged_nodes.update(partitions_by_id[other_id].nodes)
+
+            # Note it's ok to use `set` here, since we are only query if a node
+            # has been visited. We are NEVER going to iterate on nodes inside
+            # the set.
+            visited: Set[Node] = set()
+
+            def dfs_find_cycle(node):
+                if node in visited:
+                    return False
+                if node in merged_nodes:
+                    return True  # found cycle, return
+
+                visited.add(node)
+                # branching on hitting partition or not
+                if node in assignment:
+                    # Since partition is not merged in the graph yet, when we
+                    # hit a node in a partition through DFS, we need to
+                    # traverse all nodes in the partition to properly reflect
+                    # dependencies after the fusion
+                    for p_node in partitions_by_id[assignment[node]].nodes:
+                        for user_node in p_node.users:
+                            if user_node not in partitions_by_id[assignment[node]].nodes and dfs_find_cycle(user_node):
+                                return True
+                else:
+                    for user_node in node.users:
+                        if dfs_find_cycle(user_node):
+                            return True
+                return False
+
+            # check if merge would create cyclic dependency.
+            for node in merged_nodes:
+                for user_node in node.users:
+                    if user_node not in merged_nodes and dfs_find_cycle(user_node):
+                        # return false indicating cyclic dependency found and
+                        # merge is aborted
+                        return False
+
+            # no cyclic dependency found, move forward with the merge
+            # updating partition nodes
+            partitions_by_id[self_id].nodes = merged_nodes
+            # updating assignment map
+            for node in partitions_by_id[other_id].nodes:
+                assignment[node] = self_id
+            # delete other partition
+            del partitions_by_id[other_id]
+
+            return True
+
+        def merge_single_node(node: Node, id: Optional[int]):
             if node in assignment:
-                original_id = assignment[node]
-                del assignment[node]
-                partitions_by_id[original_id].remove_node(node)
-                if partitions_by_id[original_id].size() == 0:
-                    del partitions_by_id[original_id]
+                partitions_by_id[assignment[node]].remove_node(node)
 
-            if id is not None:
+            if id is None:
+                assignment.pop(node)
+            elif id not in partitions_by_id:
                 assignment[node] = id
-                if id not in partitions_by_id:
-                    partitions_by_id[id] = Partition(id=id, nodes=[node])
-                else:
-                    partitions_by_id[id].add_node(node)
+                partitions_by_id[id] = Partition(id=id, nodes=[node])
+            else:
+                assignment[node] = id
+                partitions_by_id[id].add_node(node)
 
-        logging.debug("Proposing partitions...")
+        logger.debug("Proposing partitions...")
 
-        # visit candidates in reversed topological order
-        for node in reversed(candidates):
+        for node in reversed(self.graph_module.graph.nodes):
             # use Dict as an ordered set to ensure deterministic partitioning result, don't care value
-            user_partitions: Dict[Partition, None] = {}
+            merge_candidates: Dict[int, None] = {}
+
+            # Note a limited horizontal fusion is enabled:
+            #   when `node` is not supported, the code below attempts to fuse consumer of `node`.
+            #
+            # I don't see a need to add a knob to disable horizontal fusion yet, we can short-cut
+            # the fusion by adding an `else` block here to skip horizontal fusion.
+            if self.__is_node_supported(node) and node not in assignment:
+                partition_id = next(new_partition_id)
+                merge_single_node(node, partition_id)
+                merge_candidates[partition_id] = None
+
             for user_node in node.users:
                 if user_node in assignment:
-                    id = assignment[user_node]
-                    user_partitions[partitions_by_id[id]] = None
-                else:
-                    user_partitions[Partition(nodes=[user_node])] = None
-
-            # Filter out all the partitions that has dependency on other users
-            # TODO: find a better way to do this, rather than pair-wise comparision
-            user_partitions_list = list(user_partitions.keys())
-            for i in range(len(user_partitions_list)):
-                for j in range(i + 1, len(user_partitions_list)):
-                    pi = user_partitions_list[i]
-                    pj = user_partitions_list[j]
-                    dependency = self.__partition_depends_on(pi, pj)
-                    if dependency == 1 and pj in user_partitions:
-                        del user_partitions[pj]
-                    elif dependency == -1 and pi in user_partitions:
-                        del user_partitions[pi]
-
-            # We use the following rules for partition assignment:
-            # 1. If none of the candidates has been assigned to a partition, create a new partition
-            # 2. If there is one partition candidate, assign to the partition
-            # 3. If there are more than one partition candidates, assign current node to the first partition and
-            #    merge the other partitions with first partition, since user_partitions doesn't have depedency between
-            #    each other.
-
-            assigned_candidate_partition_ids = [partition.id for partition in user_partitions if partition.id is not None]
-
-            if len(assigned_candidate_partition_ids) == 0:
-                # create a new partition
-                assign(node, next(new_partition_id))
-            elif len(assigned_candidate_partition_ids) == 1:
-                id = assigned_candidate_partition_ids[0]
-                assign(node, id)
-            else:
-                # users are assigned to more than one partition, since user_partitions doesn't have
-                # dependency on each other, they can be fused into a single partition
-                id = assigned_candidate_partition_ids[0]
-                assign(node, id)
-
-                reassignment: Dict[Node, int] = {}
-                for other_id in assigned_candidate_partition_ids[1:]:
-                    for other_node in partitions_by_id[other_id].nodes:
-                        reassignment[other_node] = id
-                for other_node in reassignment:
-                    assign(other_node, id)
+                    merge_candidates[assignment[user_node]] = None
+
+            merge_candidates_list = list(merge_candidates.keys())
+            if len(merge_candidates_list) > 1:
+                self_id = merge_candidates_list[0]
+                for other_id in merge_candidates_list[1:]:
+                    # note: merge partition `other_id` into partition `self_id` if
+                    # it doesn't create cyclic depenency in the graph, otherwise,
+                    # this is a no-op
+                    maybe_merge_partition(self_id, other_id)
 
         # post processing to re-assign "getitem" nodes into upstream partition
         logger.debug("Reassigning getitem nodes to its producer node's partition...")
@@ -190,14 +173,15 @@ def assign(node: Node, id: Optional[int] = None):
                 id = assignment.get(node, None)     # type: ignore[arg-type]
                 for user in node.users:
                     if assignment.get(user, None) != id:    # type: ignore[arg-type]
-                        nodes_reassignment[user] = id
+                        nodes_reassignment[user] = id  # type: ignore[assignment]
         for node, id in nodes_reassignment.items():
-            assign(node, id)
+            merge_single_node(node, id)
 
         # filter out single node partitions
         if not self.allows_single_node_partition:
             logger.debug("Filtering out single node partitions...")
-            non_compute_ops = {"torch.ops.aten.view", "_operator.getitem"}
+            default_non_compute_ops = {"torch.ops.aten.view", "_operator.getitem"}
+            non_compute_ops = default_non_compute_ops.union(set(self.non_compute_ops))
             partitions_to_remove: List[int] = []
             for id, partition in partitions_by_id.items():
                 compute_node_count = 0
@@ -205,22 +189,81 @@ def assign(node: Node, id: Optional[int] = None):
                     if node.op == "call_function" and \
                        _get_qualified_name(node.target) not in non_compute_ops:  # type: ignore[arg-type]
                         compute_node_count += 1
+                    if node.op == "call_function" and \
+                       _get_qualified_name(node.target) in self.allowed_single_node_partition_ops:
+                        compute_node_count += 1
                 if compute_node_count <= 1:
                     partitions_to_remove.append(id)
             for id in partitions_to_remove:
                 del partitions_by_id[id]
 
-        logging.debug("Partitions proposed:")
+        logger.debug("Partitions proposed:")
         for id, partition in partitions_by_id.items():
-            logging.debug(f"partition #{id}", [node.name for node in partition.nodes])
+            logger.debug(f"partition #{id}", [node.name for node in partition.nodes])
 
         return list(partitions_by_id.values())
 
     def fuse_partitions(self, partitions: List[Partition]) -> GraphModule:
-        logging.debug("Fusing partitions...")
+        logger.debug("Fusing partitions...")
         # fuse_by_partitions expects partitions in List[List[Node]]: [ [node0, node1], [node2, node3] ]
         return fuse_by_partitions(self.graph_module, [list(partition.nodes) for partition in partitions])
 
+    # remove non-compute-ops that sits at the boundary of a partition.
+    def remove_bookend_non_compute_ops(self, partitions: List[Partition]):
+        non_compute_ops = set(self.non_compute_ops)
+
+        def is_non_compute_node(node: Node):
+            return node.op == "call_function" and \
+                _get_qualified_name(node.target) in non_compute_ops  # type: ignore[arg-type]
+
+        # cache transparent nodes
+        transparent_input_nodes: Dict[Node, bool] = {}
+        transparent_output_nodes: Dict[Node, bool] = {}
+
+        def is_transparent_input_node(node: Node, partition: Set[Node], removed_nodes: Set[Node]):
+            if node.op == "placeholder" or (node not in partition) or (node in removed_nodes):
+                return True
+            if node in transparent_input_nodes:
+                return transparent_input_nodes[node]
+            if is_non_compute_node(node):
+                for input_n in node.all_input_nodes:
+                    if not is_transparent_input_node(input_n, partition, removed_nodes):
+                        transparent_input_nodes[node] = False
+                        return False
+                transparent_input_nodes[node] = True
+                return True
+            transparent_input_nodes[node] = False
+            return False
+
+        def is_transparent_output_node(node: Node, partition: Set[Node], removed_nodes: Set[Node]):
+            if node.op == "placeholder" or (node not in partition) or (node in removed_nodes):
+                return True
+            if node in transparent_output_nodes:
+                return transparent_output_nodes[node]
+            if is_non_compute_node(node):
+                for output_n in node.users:
+                    if not is_transparent_output_node(output_n, partition, removed_nodes):
+                        transparent_output_nodes[node] = False
+                        return False
+                transparent_output_nodes[node] = True
+                return True
+            transparent_output_nodes[node] = False
+            return False
+
+        for partition in partitions:
+            # Note it's ok to use `set` here, since we are only query if a node
+            # has been removed. We are NEVER going to iterate on nodes inside
+            # the set.
+            remove_node: Set[Node] = set()
+            for node in partition.nodes:
+                if is_non_compute_node(node) and \
+                    (is_transparent_input_node(node, partition.nodes, remove_node) or
+                     is_transparent_output_node(node, partition.nodes, remove_node)):
+                    remove_node.add(node)
+
+            if len(remove_node) != 0:
+                partition.nodes = partition.nodes - remove_node
+
     def partition_and_fuse(self) -> GraphModule:
         partitions = self.propose_partitions()
         fused_gm = self.fuse_partitions(partitions)
diff --git a/torch/fx/passes/infra/pass_manager.py b/torch/fx/passes/infra/pass_manager.py
index 2295c3001d06..e649acfb28f5 100644
--- a/torch/fx/passes/infra/pass_manager.py
+++ b/torch/fx/passes/infra/pass_manager.py
@@ -1,4 +1,5 @@
 import inspect
+import logging
 from queue import Queue
 from functools import wraps
 from typing import Callable, Dict, List
@@ -8,30 +9,10 @@
 from torch.fx._compatibility import compatibility
 from torch.fx.passes.infra.pass_base import PassResult
 
-__all__ = ['inplace_wrapper', 'pass_result_wrapper', 'this_before_that_pass_constraint', 'PassManager']
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.WARNING)
 
-@compatibility(is_backward_compatible=False)
-def inplace_wrapper(fn: Callable) -> Callable:
-    """
-    Convenience wrapper for passes which modify an object inplace. This
-    wrapper makes them return a PassResult containing the modified object and
-    True for the "modified" flag.
-
-    Args:
-        fn (Callable[Module, Any])
-
-    Returns:
-        wrapped_fn (Callable[Module, PassResult])
-    """
-    if fn is None:
-        return None
-
-    @wraps(fn)
-    def wrapped_fn(gm):
-        fn(gm)
-        return PassResult(gm, True)
-
-    return wrapped_fn
+__all__ = ['pass_result_wrapper', 'this_before_that_pass_constraint', 'PassManager']
 
 @compatibility(is_backward_compatible=False)
 def pass_result_wrapper(fn: Callable) -> Callable:
@@ -51,8 +32,16 @@ def pass_result_wrapper(fn: Callable) -> Callable:
 
     @wraps(fn)
     def wrapped_fn(gm):
-        gm = fn(gm)
-        return PassResult(gm, True)
+        res = fn(gm)
+        if res is None:
+            return PassResult(gm, True)
+        if isinstance(res, PassResult):
+            return res
+        elif isinstance(res, nn.Module):
+            return PassResult(res, True)
+
+    if not inspect.isfunction(fn):
+        wrapped_fn.__name__ = type(fn).__name__
 
     return wrapped_fn
 
@@ -176,8 +165,8 @@ class PassManager:
             checks
     """
 
-    passes: List[Callable[[nn.Module], PassResult]] = []
-    constraints: List[Callable[[Callable, Callable], bool]] = []
+    passes: List[Callable[[nn.Module], PassResult]]
+    constraints: List[Callable[[Callable, Callable], bool]]
     _validated: bool = False
     steps: int = 1
 
@@ -189,10 +178,8 @@ def __init__(
         run_checks_after_each_pass: bool = False,
         suppress_check_failures: bool = False,
     ):
-        if passes:
-            self.passes = passes
-        if constraints:
-            self.constraints = constraints
+        self.passes = passes or []
+        self.constraints = constraints or []
         if steps:
             self.steps = steps
 
@@ -275,18 +262,30 @@ def __call__(self, module: nn.Module) -> PassResult:
             modified = False
 
             # Run the set of passes on the graph module
-            for fn in self.passes:
-                res = fn(module)
-
-                module = res.graph_module
-                modified = modified or res.modified
-
-                if isinstance(module, GraphModule):
-                    module.recompile()
-
-                # Check graph invariants
-                if self.run_checks_after_each_pass:
-                    self.check(module)
+            for i, fn in enumerate(self.passes):
+                logger.debug(f"Running pass \'{fn.__name__}\'")
+
+                try:
+                    res = fn(module)
+
+                    if not isinstance(res, PassResult) and not hasattr(res, "graph_module"):
+                        raise TypeError(f"The result of the pass {fn.__name__} should be type PassResult. \
+                                          Please wrap it with pass_result_wrapper()")
+                    module = res.graph_module
+                    modified = modified or res.modified
+
+                    if isinstance(module, GraphModule):
+                        logger.debug(f"Graph after pass \'{fn.__name__}\':", module.graph)
+                        module.recompile()
+
+                    # Check graph invariants
+                    if self.run_checks_after_each_pass:
+                        self.check(module)
+
+                except Exception as e:
+                    prev_pass_names = [p.__name__ for p in self.passes[:i]]
+                    msg = f"An error occurred when running the \'{fn.__name__}\' pass after the following passes: {prev_pass_names}"
+                    raise Exception(msg) from e
 
             # If the graph no longer changes, then we can stop running these passes
             overall_modified = overall_modified or modified
diff --git a/torch/fx/passes/net_min_base.py b/torch/fx/passes/net_min_base.py
index 26df7714e066..d63abb0325f6 100644
--- a/torch/fx/passes/net_min_base.py
+++ b/torch/fx/passes/net_min_base.py
@@ -1,25 +1,29 @@
-from typing import Any, Callable, Tuple, Dict, Optional
 import logging
+from dataclasses import dataclass
+from typing import Any, Callable, Dict, List, Optional, Tuple
 
 import torch
 import torch.fx
-from torch.fx.node import map_arg
 from torch.fx._compatibility import compatibility
+from torch.fx.node import map_arg
 
 from .shape_prop import ShapeProp
 from .split_utils import split_by_tags
 from .tools_common import (
-    Tensors,
-    TensorOrTensors,
-    NodeList,
-    NodeSet,
     CALLABLE_NODE_OPS,
     FxNetAccFusionsFinder,
-    Names
+    Names,
+    NodeList,
+    NodeSet,
+    TensorOrTensors,
+    Tensors,
 )
-from dataclasses import dataclass
 
-__all__ = ['FxNetMinimizerBadModuleError', 'FxNetMinimizerRunFuncError', 'FxNetMinimizerResultMismatchError']
+__all__ = [
+    "FxNetMinimizerBadModuleError",
+    "FxNetMinimizerRunFuncError",
+    "FxNetMinimizerResultMismatchError",
+]
 
 _LOGGER = logging.getLogger(__name__)
 
@@ -32,6 +36,7 @@ class FxNetMinimizerBadModuleError(Exception):
 
     pass
 
+
 @compatibility(is_backward_compatible=False)
 class FxNetMinimizerRunFuncError(Exception):
     """
@@ -40,6 +45,7 @@ class FxNetMinimizerRunFuncError(Exception):
 
     pass
 
+
 @compatibility(is_backward_compatible=False)
 class FxNetMinimizerResultMismatchError(Exception):
     """
@@ -48,6 +54,7 @@ class FxNetMinimizerResultMismatchError(Exception):
 
     pass
 
+
 @dataclass
 class _MinimizerSettingBase:
     """
@@ -64,6 +71,7 @@ class _MinimizerSettingBase:
     `return_intermediate`: If true, when using `run_nodes()` function to run the
     model, intermediate results of all the ops will be returned as output.
     """
+
     accumulate_error: bool = False
     traverse_method: str = "sequential"
     find_all: bool = False
@@ -97,7 +105,9 @@ def __init__(
         self,
         module: torch.fx.GraphModule,
         sample_input: Tensors,
-        compare_fn: Callable[[TensorOrTensors, TensorOrTensors, Names], Tuple[float, bool]],
+        compare_fn: Callable[
+            [TensorOrTensors, TensorOrTensors, Names], Tuple[float, bool]
+        ],
         settings: _MinimizerSettingBase,
     ):
         assert isinstance(module, torch.fx.GraphModule)
@@ -116,6 +126,12 @@ def __init__(
         # Stores the results of compare_fn
         self.results: Dict[Any, Any] = {}
 
+        # Stores the report for the runs
+        self.reports: List[List[str]] = []
+
+        # Current iteration
+        self.iteration: int = 0
+
         callable_nodes = {
             node for node in self.module.graph.nodes if node.op in CALLABLE_NODE_OPS
         }
@@ -293,10 +309,7 @@ def _build_submodule(self, nodes: NodeSet) -> Tuple[torch.fx.GraphModule, str]:
         return split_module, submodule_name
 
     def _run_and_compare(
-        self,
-        split_module: torch.fx.GraphModule,
-        submod_name: str,
-        output_names: Names
+        self, split_module: torch.fx.GraphModule, submod_name: str, output_names: Names
     ):
         """
         Run the submodule in `split_module` that has name `submod_name`
@@ -311,6 +324,13 @@ def _run_and_compare(
         submodule = getattr(split_module, submod_name)
         a_input, b_input = self._get_submod_inputs(split_module, submod_name)
 
+        if len(self.reports) == 0:
+            self.reports.append([])
+            self.iteration = 1
+
+        report = self.reports[self.iteration - 1]
+        report.append("Run and compare ...")
+
         if output_names:
             output_nodes: NodeList = []
             for node in submodule.graph.nodes:
@@ -339,52 +359,83 @@ def _run_and_compare(
         names: Names = output_names
         if output_names is None:
             names = [str(v) for v in result_key]
+
         numeric_result, bool_result = self.compare_fn(a_result, b_result, names)
+
         self.results[result_key] = numeric_result
+        report.append(f"Numerical accuracy = {numeric_result}")
         if not bool_result:
+            report.append(f"Result mismatch for {result_key}")
             raise FxNetMinimizerResultMismatchError(f"Result mismatch for {result_key}")
 
-    def _binary_search_impl(self, nodes: NodeList) -> NodeSet:
+    def _binary_search_impl(
+        self, all_nodes: NodeList, start_idx: int, end_idx: int
+    ) -> NodeSet:
         """
         Recursive binary search implementation.
         """
+        nodes: NodeList = all_nodes[start_idx:end_idx]
+
+        report: List[str] = []
+        self.reports.append(report)
+        self.iteration += 1
+        report.append(f"Binary search iteration {self.iteration}.")
+        report.append(
+            f"From node index {start_idx} to {end_idx-1}. "
+            f"Size of the interested node list is {len(nodes)}"
+        )
+
         cur_nodes: NodeSet = set(nodes)
+
         for node in nodes:
             if node in self.fusions:
                 cur_nodes.update(self.fusions[node])
 
         try:
             split_module, submod_name = self._build_submodule(cur_nodes)
-            self._run_and_compare(
-                split_module,
-                submod_name,
-                []
-            )
+            self._run_and_compare(split_module, submod_name, [])
         except (FxNetMinimizerRunFuncError, FxNetMinimizerResultMismatchError):
+
             if len(nodes) == 1:
+                report.append(
+                    f"This is the last node in the sub-module. "
+                    f"Search in the current branch is successful with culprit = {cur_nodes}."
+                )
+                self.print_report(report)
                 return cur_nodes
 
+            report.append(
+                "Proceed to split and lower the halves of the current "
+                "sub-module individually."
+            )
+            self.print_report(report)
+
             mid = len(nodes) // 2
+            culprits = self._binary_search_impl(all_nodes, start_idx, start_idx + mid)
 
-            culprits = self._binary_search_impl(nodes[:mid])
-            if not self.settings.find_all:
+            if len(culprits) != 0 and not self.settings.find_all:
                 return culprits
 
-            culprits.update(self._binary_search_impl(nodes[mid:]))
+            culprits = self._binary_search_impl(all_nodes, start_idx + mid, end_idx)
+
             if len(culprits) == 0:
-                raise FxNetMinimizerBadModuleError(
-                    "Found an error in a group of nodes, but was not able to minimize",
-                    nodes,
+                report.append(
+                    f"Further split and lowering found no errors. "
+                    f"Unable to minimize the submodule with list of nodes: {nodes}"
                 )
+                self.print_report(report)
+
             return culprits
         else:
+            report.append("No discrepancy found.")
+            self.print_report(report)
             return set()
 
     def _binary_traverse(self, nodes: NodeList) -> NodeSet:
         """
         Binary search on `nodes` for culprit.
         """
-        return self._binary_search_impl(nodes)
+        return self._binary_search_impl(nodes, 0, len(nodes))
 
     def _sequential_traverse(self, nodes: NodeList) -> NodeSet:
         """
@@ -393,6 +444,12 @@ def _sequential_traverse(self, nodes: NodeList) -> NodeSet:
         culprits: NodeSet = set()
 
         for node in nodes:
+            report: List[str] = []
+            self.reports.append(report)
+            self.iteration += 1
+            report.append(f"Sequential traverse iteration {self.iteration}.")
+            report.append(f"Visit node: {node.name}")
+
             _LOGGER.info(f"Visit node: {node.name}")
             cur_nodes: NodeSet = {node}
 
@@ -401,15 +458,18 @@ def _sequential_traverse(self, nodes: NodeList) -> NodeSet:
 
             try:
                 split_module, submod_name = self._build_submodule(cur_nodes)
-                self._run_and_compare(
-                    split_module, submod_name, [node.name]
-                )
+                self._run_and_compare(split_module, submod_name, [node.name])
+                self.print_report(report)
             except (FxNetMinimizerResultMismatchError):
                 culprits.add(node)
+                report.append(f"Found culprit from numeric error: {node}")
+                self.print_report(report)
                 if not self.settings.find_all:
                     return culprits
             except (FxNetMinimizerRunFuncError):
                 culprits.update(cur_nodes)
+                report.append(f"Found culprit from run error: {node}")
+                self.print_report(report)
                 if not self.settings.find_all:
                     return culprits
 
@@ -426,19 +486,30 @@ def _accumulate_traverse(self, nodes: NodeList) -> NodeSet:
             return culprits
 
         for node in nodes:
+            report: List[str] = []
+            self.reports.append(report)
+            self.iteration += 1
+            report.append(f"Accumulate traverse iteration {self.iteration}.")
+
             nodes_to_run.add(node)
 
             node_name = node.name
             if node_name is not None and isinstance(node_name, tuple):
                 node_name = node_name[0]
-            assert node_name is not None and isinstance(node_name, str), f"minimize: node_name: {node_name}"
+            assert node_name is not None and isinstance(
+                node_name, str
+            ), f"minimize: node_name: {node_name}"
+
+            report.append(f"Add node: {node_name}")
 
             try:
                 split_module, submod_name = self._build_submodule(nodes_to_run)
                 self._run_and_compare(split_module, submod_name, [node_name])
-            except (FxNetMinimizerResultMismatchError,
-                    FxNetMinimizerRunFuncError):
+                self.print_report(report)
+            except (FxNetMinimizerResultMismatchError, FxNetMinimizerRunFuncError):
                 culprits.add(node)
+                report.append(f"Found culprit {node}")
+                self.print_report(report)
                 return culprits
 
         return culprits
@@ -500,7 +571,20 @@ def run_nodes(self, start: Optional[str] = None, end: Optional[str] = None):
         ) as e:
             print(e)
 
-    def minimize(self, start: Optional[str] = None, end: Optional[str] = None) -> NodeSet:
+    def print_report(self, report: List[str]):
+        for i in range(len(report)):
+            if i > 0:
+                print(" . " + report[i])
+            else:
+                print(report[i])
+
+    def print_reports(self):
+        for report in self.reports:
+            self.print_report(report)
+
+    def minimize(
+        self, start: Optional[str] = None, end: Optional[str] = None
+    ) -> NodeSet:
         """
         Minimizing the model from node with name `start` to node with name `end` base
         on self.settings. Find culprits that causes FxNetMinimizerRunFuncError or
@@ -519,6 +603,7 @@ def minimize(self, start: Optional[str] = None, end: Optional[str] = None) -> No
 
         print(self.settings)
         print(self.module.graph)
+
         nodes = self._collect_nodes(start, end)
 
         if self.settings.traverse_method == "sequential":
diff --git a/torch/fx/passes/pass_manager.py b/torch/fx/passes/pass_manager.py
index 096857efb7c2..cf002b3611bf 100644
--- a/torch/fx/passes/pass_manager.py
+++ b/torch/fx/passes/pass_manager.py
@@ -1,6 +1,9 @@
 from functools import wraps
 from inspect import unwrap
 from typing import Callable, List
+import logging
+
+logger = logging.getLogger(__name__)
 
 
 # for callables which modify object inplace and return something other than
@@ -19,11 +22,51 @@ def inplace_wrapper(fn: Callable) -> Callable:
 
     @wraps(fn)
     def wrapped_fn(gm):
-        fn(gm)
+        val = fn(gm)
         return gm
 
     return wrapped_fn
 
+def log_hook(fn: Callable, level=logging.INFO) -> Callable:
+    """
+    Logs callable output.
+
+    This is useful for logging output of passes. Note inplace_wrapper replaces
+    the pass output with the modified object. If we want to log the original
+    output, apply this wrapper before inplace_wrapper.
+
+
+    ```
+    def my_pass(d: Dict) -> bool:
+        changed = False
+        if 'foo' in d:
+            d['foo'] = 'bar'
+            changed = True
+        return changed
+
+    pm = PassManager(
+        passes=[
+            inplace_wrapper(log_hook(my_pass))
+        ]
+    )
+    ```
+
+    Args:
+        fn (Callable[Type1, Type2])
+        level: logging level (e.g. logging.INFO)
+
+    Returns:
+        wrapped_fn (Callable[Type1, Type2])
+    """
+    @wraps(fn)
+    def wrapped_fn(gm):
+        val = fn(gm)
+        logger.log(level, f"Ran pass {fn}\t Return value: {val}",)
+        return val
+
+    return wrapped_fn
+
+
 
 def loop_pass(base_pass: Callable, n_iter: int = None, predicate: Callable = None):
     """
@@ -141,8 +184,8 @@ class PassManager:
             `this_before_that_pass_constraint` for example.
     """
 
-    passes: List[Callable] = []
-    constraints: List[Callable] = []
+    passes: List[Callable]
+    constraints: List[Callable]
     _validated: bool = False
 
     def __init__(
@@ -150,10 +193,8 @@ def __init__(
         passes=None,
         constraints=None,
     ):
-        if passes:
-            self.passes = passes
-        if constraints:
-            self.constraints = constraints
+        self.passes = passes or []
+        self.constraints = constraints or []
 
     @classmethod
     def build_from_passlist(cls, passes):
@@ -169,6 +210,16 @@ def add_constraint(self, constraint):
         self.constraints.append(constraint)
         self._validated = False
 
+    def remove_pass(self, _passes: List[Callable]):
+        if _passes is None:
+            return
+        passes_left = []
+        for ps in self.passes:
+            if ps.__name__ not in _passes:
+                passes_left.append(ps)
+        self.passes = passes_left
+        self._validated = False
+
     def validate(self):
         """
         Validates that current pass schedule defined by `self.passes` is valid
diff --git a/torch/fx/passes/reinplace.py b/torch/fx/passes/reinplace.py
index 39f0ad1569b9..86986a85acc8 100644
--- a/torch/fx/passes/reinplace.py
+++ b/torch/fx/passes/reinplace.py
@@ -2,7 +2,7 @@
 from torch.fx import Node
 from torch.fx._compatibility import compatibility
 from torch._subclasses.fake_tensor import FakeTensorMode, FakeTensor
-from torch.utils._pytree import tree_map
+from torch.utils._pytree import tree_map, tree_flatten, tree_map_only
 from torch.multiprocessing.reductions import StorageWeakRef
 
 import _operator
@@ -100,8 +100,8 @@ def run_node(self, node: Node):
             # Assert here that this is actually the case, and their storages are the same.
             assert isinstance(node.meta['fake_result'], FakeTensor)
             assert isinstance(node.meta['view_of'].meta['fake_result'], FakeTensor)
-            view_storage = StorageWeakRef(node.meta['fake_result'].storage())
-            base_storage = StorageWeakRef(node.meta['view_of'].meta['fake_result'].storage())
+            view_storage = StorageWeakRef(node.meta['fake_result']._typed_storage())
+            base_storage = StorageWeakRef(node.meta['view_of'].meta['fake_result']._typed_storage())
             assert view_storage == base_storage
         return result
 
@@ -110,7 +110,8 @@ def run_node(self, node: Node):
     def propagate(self, *args):
         self.multi_output_view_nodes = {}
         self.node_counter = -1
-        with FakeTensorMode.push() as mode:
+
+        with FakeTensorMode(allow_meta=True) as mode:
             fake_args = [mode.from_tensor(a) for a in args]
             return super().run(*fake_args)
 
@@ -152,10 +153,12 @@ def _maybe_get_inplace_op(op):
         for f in inplace_overloads
         if _schemas_match(op._schema, f._schema)
     ]
-    # This is for sanity: if foo() and foo_() are both operators,
-    # we expect them to have compatible schemas.
-    # (This is asserted by codegen for ATen, but might not be true
-    # for other arbitrary operators).
+    # Just becuase foo() and foo_() are both existing operators,
+    # They aren't guaranteed to have compatible schemas.
+    # For example, pow.Scalar(Scalar self, Tensor exponent) has no valid inplace variant,
+    # Even though several overloads of pow_ exist.
+    if len(inplace_overloads_with_matching_schemas) == 0:
+        return None
     assert len(inplace_overloads_with_matching_schemas) == 1
     inplace_op = inplace_overloads_with_matching_schemas[0]
     return inplace_op
@@ -173,7 +176,7 @@ def _maybe_get_inplace_op(op):
 def _get_all_later_node_usages(tensor_aliases: Set[Node], op_index: int):
     def _add_if_tensor(x, set_):
         if isinstance(x, FakeTensor):
-            set_.add(StorageWeakRef(x.storage()))
+            set_.add(StorageWeakRef(x._typed_storage()))
 
     nodes_used_after = set()
     for t in tensor_aliases:
@@ -181,7 +184,7 @@ def _add_if_tensor(x, set_):
         usage_nodes = t.users
         for n in usage_nodes:
             # We only care about usages after the current node
-            if n.meta['node_idx'] <= op_index:
+            if 'node_idx' not in n.meta or n.meta['node_idx'] <= op_index:
                 continue
             # We also don't care about intermediate view ops.
             # They only matter if their output is then used elsewhere
@@ -260,33 +263,61 @@ def reinplace(gm, *sample_args):
     In general, we can't reinplace node `b = a.add(...)` if "a" aliases any of the
     inputs to the program.
 
-    Given a node "b = foo(a, ...)", the algorithm for re-inplacing is as follows:
+    Given a node "b = foo(a, args...) the algorithm for re-inplacing is as follows:
+
+    (1) Perform some initial checks on the metadata of "a" and "args..."
+        that can disqualify them from being reinplaced.
+
+      (1a) Check that the self argument we're attempting to reinplace
+           has acceptable dtype/size metadata to reinplace with.
+
+           For example, if we have:
+             a = torch.ones(1)
+             b = torch.ones(10)
+             out = torch.add(a, b)
+           We can't turn that into
+             a.add_(b)
+           Because that would require resizing "a".
+
+           Similarly, we can't convert torch.ge(a, b) into a.ge_(b),
+           beause that would require changing a's dtype (from e.g. float32 to bool).
+           Note that in this specific example, we could technically do better..
+
+           If we see the pattern:
+             a_1 = a.ge(b)
+             a_2 = aten._to_copy(a_1, a.dtype)
+           Then we this should be valid to completely re-inplace
+           (this is exactly what functionalization will emit when it sees a.ge_(b)).
+
+           This optimization is only really important for user programs
+           that directly use inplace comparison ops though.
+
+           We also cannot re-inplace on tensors that have overlapping memory,
+           e.g. torch.ones(1).expand(4, 4).add_(1)
 
-    (1) Check if foo has a mutating variant. If not, move to the next node.
+      (1b) Check if "a" is an alias of any of the program inputs.
 
-        Note that we ignore view ops (we don't bother to turn `as_strided()`
-        into `as_strided_()`), as it complicates the algorithm and doesn't
-        provide meaningful speedups.
+          If it is, skip and move to the next node.
+          Inplace'ing an op that would cause it to mutate a program is not sound,
+          because that would be a side effect visible to the user.
 
-        Currently, we also only check for an inplace op, `foo_`.
-        Later, we should beef this up to check for out= or mutable ops.
+          NOTE: there's a future optimization that we should make:
+          if "a" is a (alias of a)  program input, but later in the program
+          there is a node that looks like "a.copy_(...)",
+          Then re-inplacing is ok to do - we are temporarily re-using a's buffer,
+          which will later be overwritten by the copy_() call.
 
-    (2) Check if "a" is an alias of any of the program inputs.
+          This will be an important optimization to have for programs that mutate
+          their inputs. It currently isn't implemented though.
 
-        If it is, skip and move to the next node.
-        Inplace'ing an op that would cause it to mutate a program is not sound,
-        because that would be a side effect visible to the user.
+      (1c) Check if "a" and "args..." alias
 
-        NOTE: there's a future optimization that we should make:
-        if "a" is a (alias of a)  program input, but later in the program
-        there is a node that looks like "a.copy_(...)",
-        Then re-inplacing is ok to do - we are temporarily re-using a's buffer,
-        which will later be overwritten by the copy_() call.
+          For example, re-inplacing to create code like the below
+          isn't guaranteed to be sound:
 
-        This will be an important optimization to have for programs that mutate
-        their inputs. It currently isn't implemented though.
+            aten.mul_(a, a)
 
-    (3) Check that "a" and all of its outstanding aliases are not used anywhere
+    (2) Check that "a" and all of its outstanding aliases are not used anywhere
         later in the graph. If this is the case, then it's safe to re-inplace
         to "b = foo_(a)".
 
@@ -350,8 +381,53 @@ def f(x):
                       as_strided -> as_strided_scatter
                 (ii) "args..." are the same between the foo() and foo_scatter() calls.
 
-    (4) Finally, after converting "b = foo(a)" into "foo_(a)",
-        we need to find all later nodes that use "b" as an argument
+    (3) Perform the actual re-inplacing on foo!
+
+      (3b) is the common case, but special care is needed for {view}_scatter (3a)
+
+      (3a) {view}_scatter ops.
+
+        Consider this program:
+          a = torch.zeros(2, 2)
+          b = torch.ones(2)
+          a[0] = b
+
+        Post functionalization, that will look like:
+          a = torch.zeros(2)
+          b = torch.ones(1)
+          a_updated = torch.select_scatter(a, b, 0, 0)
+
+        In this case though, there is no "functional" op to re-inplace!
+        Instead, we'd like to directly remove toe select_scatter call.
+        We already know from (3) that this is valid,
+        because "a" has no later usages in the graph.
+
+        We perform the re-inplacing on the {view}_scatter op like so
+        Before:
+          a_updated = torch.select_scatter(a, b, args...)
+        After:
+          a_slice = a.select(a, args...)
+          a_slice.copy_(b)
+
+      (3b) Otherwise, replace the functional op with its inplace variant.
+        Before:
+          b = foo(a, args...)
+        After:
+          a.foo_(args...)
+
+    (4) Finally, after converting either:
+          Before:
+            b = foo(a)
+          After:
+            foo_(a)
+        or
+          Before:
+            b = {slice}_scatter(a, mutated_slice, args...)
+          After:
+            slice = {slice}(a, args...)
+            slice.copy_(mutated_slice)
+
+        We now need to find all later nodes that use "b" as an argument
         and update them to take in "a" instead.
 
         Note that for the majority of inplace ops, this isn't actually necessary
@@ -376,7 +452,7 @@ def f(x):
     # Useful debug printing
     # def _print(x):
     # if isinstance(x, FakeTensor):
-    # print(f'fake_result: {StorageWeakRef(x.storage()).cdata}')
+    # print(f'fake_result: {StorageWeakRef(x._typed_storage()).cdata}')
 
     # for n in gm.graph.nodes:
     # print(n.format_node())
@@ -392,7 +468,10 @@ def f(x):
     # so we know not to re-inplace them.
     # NOTE: later, we'll need to add an optimization for fully recovering performance
     # on programs that mutate inputs.
-    input_storages = set(StorageWeakRef(node.meta['fake_result'].storage()) for node in gm.graph.nodes if node.op == 'placeholder')
+    input_storages = set(
+        StorageWeakRef(
+            node.meta['fake_result']._typed_storage()
+        ) for node in gm.graph.nodes if node.op == 'placeholder')
 
 
     # We also need to know for a given node, what are all of its aliasing nodes.
@@ -402,38 +481,68 @@ def f(x):
             # Tree-mapping because some ops can return lists of tensors.
             def _add_to_map(x):
                 if isinstance(x, FakeTensor):
-                    storage_to_nodes[StorageWeakRef(x.storage())].add(n)
+                    storage_to_nodes[StorageWeakRef(x._typed_storage())].add(n)
             tree_map(_add_to_map, n.meta['fake_result'])
 
     # inplace-ify functional ops, subject to the constraints written below.
-    all_later_view_inverse_node_usages = set()
+    all_later_view_inverse_nodes_to_delete = set()
     for idx, node in enumerate(gm.graph.nodes):
         if node.op == 'call_function':
-            # Step 1: Check to see if this operator has an inplace variant.
-            maybe_inplace_op = _maybe_get_inplace_op(node.target)
-            if maybe_inplace_op is None:
+
+            # Today, the re-inplace pass on directly acts on:
+            # - functional ops with an inplace variant
+            # - {view}_scatter ops that can be potentially removed from the graph.
+            # Both of these ops take in tensor first args, so filtering on this condition
+            # makes the later code simpler.
+            # We should revisit this at some point though, particularly when we also want
+            # the reinplacer to be able to handle out= and mutable operators
+            # and tensorlist first args (like `_foreach_` ops).
+            if not isinstance(node.target, torch._ops.OpOverload):
+                continue
+            if len(node.target._schema.arguments) < 1:
+                continue
+            if type(node.target._schema.arguments[0].type) != torch.TensorType:
                 continue
-            # This is a proxy check for ensuring that the first argument is "tensor-like"
-            # (This should be the case for all ops with inplace variants in ATen,
-            # although we technically don't have guarantees for custom ops).
-            assert len(node.target._schema.arguments) > 0
-            assert 'Tensor' in str(node.target._schema.arguments[0].type)
 
-            # Step 2: ensure that the op we're trying to re-inplace isn't a program input.
+            # Step 1a: Check that the self argument we're attempting to reinplace
+            # has the same size/stride as the output.
+            # For example, we shouldn't try to reinplace torch.add(scalar_tensor, larger_tensor)
+            # As it would require resizing scalar_tensor.
+            # (We could potentially swizzle this into larger_tensor.add_(scalar_tensor),
+            # this is probably an optimization to revisit later).
             self_arg = node.args[0]
+            self_flattened, _ = tree_flatten(self_arg.meta['fake_result'])
+            node_flattened, _ = tree_flatten(node.meta['fake_result'])
+            self_has_wrong_metadata = False
+            if len(self_flattened) == len(node_flattened):
+                for self_meta, node_meta in zip(self_flattened, node_flattened):
+                    if self_meta.numel() != node_meta.numel():
+                        self_has_wrong_metadata = True
+                    if self_meta.dtype != node_meta.dtype:
+                        self_has_wrong_metadata = True
+                    # We also cannot re-inplace on tensors that have internal memory overlap.
+                    # e.g. torch.ones(1).expand(4, 4).add_(1)
+                    if torch._debug_has_internal_overlap(self_meta) == 1:
+                        self_has_wrong_metadata = True
+            # Here, we (optimistically) assume that a.resize(b) is valid to re-inplace,
+            # Since users should never really be calling the functional "torch.ops.aten.resize"
+            # op directly in their programs.
+            if self_has_wrong_metadata and node.target != torch.ops.aten.resize.default:
+                continue
+
+            # Step 1b: ensure that the op we're trying to re-inplace isn't a program input
             self_arg_name = self_arg.name
-            self_arg_storage = StorageWeakRef(self_arg.meta['fake_result'].storage())
+            self_arg_storage = StorageWeakRef(self_arg.meta['fake_result']._typed_storage())
             if self_arg_storage in input_storages:
                 # TODO: later, add the optimization for handling `copy_()` calls in the graph.
                 continue
             if len([x for x in node.args if x is self_arg]) > 1:
-                # Step (3b) in the original description.
+                # Step 1c:
                 # Calling stuff like aten.mul_(a, a) isn't guaranteed to be sound,
                 # so we prevent re-inplacing in this case.
                 continue
 
-            self_arg_storage = StorageWeakRef(self_arg.meta['fake_result'].storage())
-            curr_node_storage = StorageWeakRef(node.meta['fake_result'].storage())
+            self_arg_storage = StorageWeakRef(self_arg.meta['fake_result']._typed_storage())
             self_aliases = storage_to_nodes[self_arg_storage]
 
             # First, we find all later usages of any of the aliases of self_arg.
@@ -442,27 +551,60 @@ def _add_to_map(x):
             # that are safe to fully remove.
             later_view_inverse_node_usages = _get_view_inverse_node_usages(later_node_usages, self_aliases)
 
-            # Step 3: Check to see if the input to the op is re-used later in the graph.
+            # Step 2: Check to see if the input to the op is re-used later in the graph.
             # If not (same goes for its aliases), then this op is safe to re-in place.
             # This is a slightly roundabout way to check that there are no later usages of the current self argument.
             # (later_view_inverse_node_usages corresponds to "view_scatter" nodes that we are allowed to delete)
             can_reinplace = len(later_node_usages - later_view_inverse_node_usages) == 0
             if not can_reinplace:
                 continue
-            # Step 4: replace the current out-of-place op with its inplace variant.
-            node.target = maybe_inplace_op
+
+            # Step 3a: Special handling for when we see *_scatter operators.
+            # When we see an operator like `b = torch.slice_scatter(a, ...)`,
+            # instead of trying to "inplace" it into a.slice_scatter_(..._),
+            # we would prefer to remove it from the graph entirely,
+            # and instead copy_() the slice directly into the larger tensor.
+            # See the description of the algorithm for a full example.
+            if node.target in _VIEW_INVERSE_MAP and node not in all_later_view_inverse_nodes_to_delete:
+                view_op = _VIEW_INVERSE_MAP[node.target]
+                # Before:
+                #   base_updated = torch.ops.aten.slice_scatter.default(base, mutated_slice, args...)
+                # After:
+                #   slice = torch.ops.aten.slice.default(base, args...)
+                #   slice.copy_(mutated_slice)
+                with gm.graph.inserting_before(node):
+                    mutated_slice_node = node.args[1]
+                    remaining_slice_args = node.args[2:]
+                    slice_node = gm.graph.create_node(
+                        'call_function', view_op, (self_arg,) + tuple(remaining_slice_args), node.kwargs)
+                    copy_node = gm.graph.create_node(
+                        'call_function', torch.ops.aten.copy_.default, (slice_node, mutated_slice_node,), {})
+                # Add the slice_scatter node to our "nodes to delete" list.
+                all_later_view_inverse_nodes_to_delete.add(node)
+
+
+            else:
+                # Step 3b: Check to see if this operator has an inplace variant.
+                maybe_inplace_op = _maybe_get_inplace_op(node.target)
+                if maybe_inplace_op is None:
+                    continue
+                # And if so, replace it with its inplace variant.
+                node.target = maybe_inplace_op
+
             # At this point, 'storage_to_nodes' will be stale.
             # Now that we're inplacing `b = foo(a)`, we need to effectively
             # union together the dict values for b and a's storage.
             # Hmm... morally I think we also want to keep the `fake_result` metadata
             # up to date here, but I'm not sure how easy it is to do.
             # Maybe it's fine to wait until the end of the pass to update it.
+            curr_node_storage = StorageWeakRef(node.meta['fake_result']._typed_storage())
             storage_to_nodes[self_arg_storage].update(storage_to_nodes[curr_node_storage])
             storage_to_nodes[curr_node_storage].update(storage_to_nodes[self_arg_storage])
 
             # Need to remember the view_scatter view nodes we found so we can remove them alter.
-            all_later_view_inverse_node_usages.update(later_view_inverse_node_usages)
+            all_later_view_inverse_nodes_to_delete.update(later_view_inverse_node_usages)
 
+            # Step 4:
             # Now that we've replaced b = a.foo() with a.foo_(),
             # We need to replace any later usages of "b" with "a"
             for old in itertools.chain([node], later_view_inverse_node_usages):
@@ -470,30 +612,50 @@ def _add_to_map(x):
                 nodes_to_update = [n for n in old.users if n.meta['node_idx'] > node.meta['node_idx']]
                 for node_to_update in nodes_to_update:
                     new_args = []
-                    for arg_idx, a in enumerate(node_to_update.args):
+                    args = node_to_update.args
+
+                    def replace_arg(a):
                         if a == old:
-                            new_args.append(new)
-                        else:
-                            new_args.append(a)
-                    new_kwargs = {}
-                    for kwarg_idx, (k, v) in enumerate(node_to_update.kwargs.items()):
-                        if isinstance(v, Node) and v.name == old.name:
-                            new_kwargs[k] = new
-                        else:
-                            new_kwargs[k] = v
-                    node_to_update.args = tuple(new_args)
-                    node_to_update.kwargs = new_kwargs
-
-                    old_ref = StorageWeakRef(old.meta['fake_result'].storage())
-                    node_ref = StorageWeakRef(node_to_update.meta['fake_result'].storage())
-                    if old_ref == node_ref:
-                        # This will happen if we're updating a view op, e.g.
-                        # e.g. replacing
-                        #     x = view(old)
-                        #     x = view(new)
-                        # When that happens, we need to make sure to keep our
-                        # storage mapping up to date.
-                        new_ref = StorageWeakRef(new.meta['fake_result'].storage())
+                            return new
+                        return a
+
+                    # First, replace usages of "b" with "a"
+                    node_to_update.args = tree_map_only(Node, replace_arg, node_to_update.args)
+                    node_to_update.kwargs = tree_map_only(Node, replace_arg, node_to_update.kwargs)
+
+                    # Second, update our storage_to_nodes data structure.
+                    old_flattened_res, _ = tree_flatten(old.meta['fake_result'])
+                    node_flattened_res, _ = tree_flatten(node_to_update.meta['fake_result'])
+
+                    old_res_storage = set(
+                        StorageWeakRef(
+                            x._typed_storage()
+                        ) for x in old_flattened_res if isinstance(x, FakeTensor))
+                    node_res_storage = set(
+                        StorageWeakRef(
+                            x._typed_storage()
+                        ) for x in node_flattened_res if isinstance(x, FakeTensor))
+
+                    # This will happen if we're updating a view op, e.g.
+                    # e.g. replacing
+                    #     x = view(old)
+                    #     x = view(new)
+                    # When that happens, we need to make sure to keep our
+                    # storage mapping up to date.
+                    #
+                    # We're checking for len(...) == 1 here because all view ops are guaranteed to return either a single tensor,
+                    # or multiple tensors that all share the same storage.
+                    # We can't just check equality because we might encounter FX nodes that return zero tensor outputs.
+                    if len(old_res_storage) == 1 and len(node_res_storage) == 1 and old_res_storage == node_res_storage:
+                        new_flattened_res, _ = tree_flatten(new.meta['fake_result'])
+                        new_res_storage = set(
+                            StorageWeakRef(
+                                x._typed_storage()
+                            ) for x in new_flattened_res if isinstance(x, FakeTensor))
+                        assert len(new_res_storage) == 1
+                        (old_ref,) = old_res_storage
+                        (new_ref,) = new_res_storage
+                        (node_ref,) = node_res_storage
                         # Technically, "old_ref" and all its aliases will remain
                         # in our mapping.
                         # That should be fine though, since we deleted "old"
@@ -501,10 +663,10 @@ def _add_to_map(x):
                         storage_to_nodes[node_ref].update(storage_to_nodes[new_ref])
                         storage_to_nodes[new_ref].update(storage_to_nodes[node_ref])
 
-    # Step 5: delete any _scatter nodes that we de-functionalized
+    # Step 4: delete any _scatter nodes that we de-functionalized
     # Need to take care not to delete any of these nodes until after *all* modifications
     # to the graph are finished.
-    for to_delete in all_later_view_inverse_node_usages:
+    for to_delete in all_later_view_inverse_nodes_to_delete:
         gm.graph.erase_node(to_delete)
 
 
diff --git a/torch/fx/passes/shape_prop.py b/torch/fx/passes/shape_prop.py
index 9c3a036e90bf..2be996f714ce 100644
--- a/torch/fx/passes/shape_prop.py
+++ b/torch/fx/passes/shape_prop.py
@@ -17,7 +17,7 @@ class TensorMetadata(NamedTuple):
     shape : torch.Size
     dtype : torch.dtype
     requires_grad : bool
-    stride : Tuple[int]
+    stride : Tuple[int, ...]
     memory_format : Optional[torch.memory_format]
 
     # Quantization metadata
diff --git a/torch/fx/passes/split_module.py b/torch/fx/passes/split_module.py
index 1795de363ca6..0343bae94c31 100644
--- a/torch/fx/passes/split_module.py
+++ b/torch/fx/passes/split_module.py
@@ -1,32 +1,38 @@
+import inspect
+import operator
+from typing import Any, Callable, Dict, List, Optional
+
 import torch
-from torch.fx.graph_module import GraphModule
-from typing import Callable, List, Dict, Any, Optional
 from torch.fx._compatibility import compatibility
-import inspect
+from torch.fx.graph_module import GraphModule
+
+__all__ = ["Partition", "split_module"]
 
-__all__ = ['Partition', 'split_module']
 
 @compatibility(is_backward_compatible=True)
 class Partition:
     def __init__(self, name: str):
         self.name: str = name
-        self.submod_name = f'submod_{name}'
+        self.submod_name = f"submod_{name}"
         self.node_names: List[str] = []
         self.inputs: Dict[str, None] = {}
         self.outputs: Dict[str, None] = {}
         self.partitions_dependent_on: Dict[str, None] = {}
         self.partition_dependents: Dict[str, None] = {}
-        self.graph : torch.fx.graph.Graph = torch.fx.graph.Graph()
-        self.environment : Dict[torch.fx.node.Node, torch.fx.node.Node] = {}
-        self.targets : Dict[str, Any] = {}
+        self.graph: torch.fx.graph.Graph = torch.fx.graph.Graph()
+        self.environment: Dict[torch.fx.node.Node, torch.fx.node.Node] = {}
+        self.targets: Dict[str, Any] = {}
 
     def __repr__(self) -> str:
-        return f"name: {self.name},\n" \
-            f" nodes: {self.node_names},\n" \
-            f" inputs: {self.inputs},\n" \
-            f" outputs: {self.outputs},\n" \
-            f" partitions depenent on: {self.partitions_dependent_on},\n" \
-            f" parition dependents: {self.partition_dependents}"
+        return (
+            f"name: {self.name},\n"
+            f" nodes: {self.node_names},\n"
+            f" inputs: {self.inputs},\n"
+            f" outputs: {self.outputs},\n"
+            f" partitions dependent on: {self.partitions_dependent_on},\n"
+            f" partition dependents: {self.partition_dependents}"
+        )
+
 
 # Creates subgraphs out of main graph
 @compatibility(is_backward_compatible=True)
@@ -35,6 +41,7 @@ def split_module(
     root_m: torch.nn.Module,
     split_callback: Callable[[torch.fx.node.Node], int],
     qualname_map: Optional[Dict[str, str]] = None,
+    keep_original_order: Optional[bool] = False,
 ):
     """
     Creates subgraphs out of main graph
@@ -51,6 +58,9 @@ def split_module(
         qualname_map: Optional[Dict[str, str]]: optional output parameter that returns a
             mapping from new target names in the module after split to old target
             names in the original module.
+        keep_original_order: Optional[bool]: keep the original order of the GraphModule
+            or use the Topological order of the new constructed GraphModule
+
 
     Returns:
         GraphModule: the module after split.
@@ -130,9 +140,11 @@ def forward(self, x, y):
     partitions: Dict[str, Partition] = {}
     orig_nodes: Dict[str, torch.fx.node.Node] = {}
 
-    def record_cross_partition_use(def_node : torch.fx.node.Node, use_node : Optional[torch.fx.node.Node]):  # noqa: B950
-        def_partition_name = getattr(def_node, '_fx_partition', None)
-        use_partition_name = getattr(use_node, '_fx_partition', None)
+    def record_cross_partition_use(
+        def_node: torch.fx.node.Node, use_node: Optional[torch.fx.node.Node]
+    ):  # noqa: B950
+        def_partition_name = getattr(def_node, "_fx_partition", None)
+        use_partition_name = getattr(use_node, "_fx_partition", None)
         if def_partition_name != use_partition_name:
             if def_partition_name is not None:
                 def_partition = partitions[def_partition_name]
@@ -148,14 +160,35 @@ def record_cross_partition_use(def_node : torch.fx.node.Node, use_node : Optiona
 
     # split nodes into parititons
     for node in m.graph.nodes:
+        # Annotations on local names within function are lost during FX transforms.
+        # Adding back known type annotation for getitem nodes for jit scriptability.
+        if node.target == operator.getitem:
+            sequence_node, index_node = node.args
+            # only support type Tuple for now
+            if (
+                hasattr(sequence_node.type, "_name")
+                and sequence_node.type._name == "Tuple"
+            ):
+                parameterized_types = sequence_node.type.__args__
+                if len(parameterized_types) == 2 and isinstance(
+                    parameterized_types[1], type(...)
+                ):
+                    node.type = parameterized_types[0]
+                else:
+                    assert len(parameterized_types) > index_node
+                    node_type = parameterized_types[index_node]
+                    node.type = node_type
+
         orig_nodes[node.name] = node
 
         # TODO currently placeholders/parameters aren't put into random partitions,
         # rather they're added to the graphs where they are used down below
         if node.op in ["placeholder", "get_attr"]:
             continue
-        if node.op == 'output':
-            torch.fx.graph.map_arg(node.args[0], lambda n: record_cross_partition_use(n, None))
+        if node.op == "output":
+            torch.fx.graph.map_arg(
+                node.args[0], lambda n: record_cross_partition_use(n, None)
+            )
             continue
         partition_name = str(split_callback(node))
 
@@ -167,17 +200,22 @@ def record_cross_partition_use(def_node : torch.fx.node.Node, use_node : Optiona
         partition.node_names.append(node.name)
         node._fx_partition = partition_name
 
-        torch.fx.graph.map_arg(node.args, lambda def_node: record_cross_partition_use(def_node, node))
-        torch.fx.graph.map_arg(node.kwargs, lambda def_node: record_cross_partition_use(def_node, node))  # noqa: B950
+        torch.fx.graph.map_arg(
+            node.args, lambda def_node: record_cross_partition_use(def_node, node)
+        )
+        torch.fx.graph.map_arg(
+            node.kwargs, lambda def_node: record_cross_partition_use(def_node, node)
+        )  # noqa: B950
 
+    original_partition_order = list(partitions.keys())
     # find partitions with no dependencies
-    root_partitions : List[str] = []
+    root_partitions: List[str] = []
     for partition_name, partition in partitions.items():
         if not len(partition.partitions_dependent_on):
             root_partitions.append(partition_name)
 
     # check partitions for circular dependencies and create topological partition ordering
-    sorted_partitions : List[str] = []
+    sorted_partitions: List[str] = []
     while root_partitions:
         root_partition = root_partitions.pop()
         sorted_partitions.append(root_partition)
@@ -192,63 +230,76 @@ def record_cross_partition_use(def_node : torch.fx.node.Node, use_node : Optiona
     for partition_name in sorted_partitions:
         partition = partitions[partition_name]
         for input in partition.inputs:
-            placeholder = partition.graph.placeholder(input)
+            placeholder = partition.graph.placeholder(
+                input,
+                type_expr=orig_nodes[input].type,
+            )
             placeholder.meta = orig_nodes[input].meta.copy()
             partition.environment[orig_nodes[input]] = placeholder
 
     # Transform nodes and collect targets for partition's submodule
     for node in m.graph.nodes:
-        if hasattr(node, '_fx_partition'):
+        if hasattr(node, "_fx_partition"):
             partition = partitions[node._fx_partition]
 
             # swap out old graph nodes in kw/args with references to new nodes in this submodule
             environment = partition.environment
-            gathered_args = torch.fx.graph.map_arg(node.args, lambda n : environment[n])
-            gathered_kwargs = torch.fx.graph.map_arg(node.kwargs, lambda n : environment[n])
+            gathered_args = torch.fx.graph.map_arg(node.args, lambda n: environment[n])
+            gathered_kwargs = torch.fx.graph.map_arg(
+                node.kwargs, lambda n: environment[n]
+            )
 
-            if node.op not in ['call_module', 'get_attr']:
+            if node.op not in ["call_module", "get_attr"]:
                 target = node.target
             else:
-                target_atoms = node.target.split('.')
+                target_atoms = node.target.split(".")
                 target_attr = m
                 for atom in target_atoms:
                     if not hasattr(target_attr, atom):
-                        raise RuntimeError(f'Operator target {node.target} not found!')
+                        raise RuntimeError(f"Operator target {node.target} not found!")
                     target_attr = getattr(target_attr, atom)
                 # target = target_atoms[-1]
-                target = '_'.join(target_atoms)
+                target = "_".join(target_atoms)
                 partition.targets[target] = target_attr
                 # Fill in the passed-in mapping from new qualname to old qualname
                 if qualname_map is not None:
                     # When creating the split module later, the submodules will have
                     # path prefix matching the corresponding partition's submod_name
-                    qualname = f'{partition.submod_name}.{target}'
+                    qualname = f"{partition.submod_name}.{target}"
                     qualname_map[qualname] = node.target
 
             assert isinstance(gathered_args, tuple)
             assert isinstance(gathered_kwargs, dict)
-            new_node = partition.graph.create_node(op=node.op, target=target, args=gathered_args,
-                                                   kwargs=gathered_kwargs)
+            new_node = partition.graph.create_node(
+                op=node.op,
+                target=target,
+                args=gathered_args,
+                kwargs=gathered_kwargs,
+                type_expr=node.type,
+            )
             new_node.meta = node.meta.copy()
             partition.environment[node] = new_node
 
     # Set up values to construct base module
-    base_mod_env : Dict[str, torch.fx.node.Node] = {}
-    base_mod_graph : torch.fx.graph.Graph = torch.fx.graph.Graph()
-    base_mod_attrs : Dict[str, torch.fx.graph_module.GraphModule] = {}
+    base_mod_env: Dict[str, torch.fx.node.Node] = {}
+    base_mod_graph: torch.fx.graph.Graph = torch.fx.graph.Graph()
+    base_mod_attrs: Dict[str, torch.fx.graph_module.GraphModule] = {}
     for node in m.graph.nodes:
-        if node.op == 'placeholder':
-            default_value = node.args[0] if len(node.args) > 0 else inspect.Signature.empty
+        if node.op == "placeholder":
+            default_value = (
+                node.args[0] if len(node.args) > 0 else inspect.Signature.empty
+            )
             base_mod_env[node.name] = base_mod_graph.placeholder(
-                node.target, type_expr=node.type, default_value=default_value)
+                node.target, type_expr=node.type, default_value=default_value
+            )
             base_mod_env[node.name].meta = node.meta.copy()
-        elif node.op == 'get_attr':
+        elif node.op == "get_attr":
             base_mod_env[node.name] = base_mod_graph.get_attr(node.target)
             base_mod_env[node.name].meta = node.meta.copy()
             attr_val = m
-            for atom in node.target.split('.'):
+            for atom in node.target.split("."):
                 if not hasattr(attr_val, atom):
-                    raise RuntimeError(f'Node target {node.target} not found!')
+                    raise RuntimeError(f"Node target {node.target} not found!")
                 attr_val = getattr(attr_val, atom)
             base_mod_attrs[node.target] = attr_val
 
@@ -258,19 +309,30 @@ def record_cross_partition_use(def_node : torch.fx.node.Node, use_node : Optiona
     # 3) Construct the base graph by emitting calls to those submodules in
     #    topological order
 
-    for partition_name in sorted_partitions:
+    construct_order_partitions = (
+        sorted_partitions if not keep_original_order else original_partition_order
+    )
+
+    for partition_name in construct_order_partitions:
         partition = partitions[partition_name]
 
         # Set correct output values
-        output_vals = tuple(partition.environment[orig_nodes[name]] for name in partition.outputs)
+        output_vals = tuple(
+            partition.environment[orig_nodes[name]] for name in partition.outputs
+        )
         output_vals = output_vals[0] if len(output_vals) == 1 else output_vals  # type: ignore[assignment]
         partition.graph.output(output_vals)
 
         # Construct GraphModule for this partition
-        base_mod_attrs[partition.submod_name] = torch.fx.graph_module.GraphModule(partition.targets, partition.graph)  # noqa: B950
+        base_mod_attrs[partition.submod_name] = torch.fx.graph_module.GraphModule(
+            partition.targets, partition.graph
+        )  # noqa: B950
 
         # Emit call in base graph to this submodule
-        output_val = base_mod_graph.call_module(partition.submod_name, tuple(base_mod_env[name] for name in partition.inputs))
+        output_val = base_mod_graph.call_module(
+            partition.submod_name,
+            tuple(base_mod_env[name] for name in partition.inputs),
+        )
         if len(partition.outputs) > 1:
             # Unpack multiple return values from submodule
             output_val_proxy = torch.fx.proxy.Proxy(output_val)
@@ -280,7 +342,9 @@ def record_cross_partition_use(def_node : torch.fx.node.Node, use_node : Optiona
             base_mod_env[list(partition.outputs)[0]] = output_val
 
     for node in m.graph.nodes:
-        if node.op == 'output':
-            base_mod_graph.output(torch.fx.graph.map_arg(node.args[0], lambda n : base_mod_env[n.name]))  # noqa: B950
+        if node.op == "output":
+            base_mod_graph.output(
+                torch.fx.graph.map_arg(node.args[0], lambda n: base_mod_env[n.name])
+            )  # noqa: B950
 
     return torch.fx.graph_module.GraphModule(base_mod_attrs, base_mod_graph)
diff --git a/torch/fx/passes/split_utils.py b/torch/fx/passes/split_utils.py
index 0236d92d8b41..f3d8608c83c9 100644
--- a/torch/fx/passes/split_utils.py
+++ b/torch/fx/passes/split_utils.py
@@ -3,7 +3,7 @@
 
 import torch.fx
 from torch.fx.graph import map_arg
-from .tools_common import NodeList, NodeSet
+from .tools_common import NodeList
 from torch.fx._compatibility import compatibility
 from torch.fx.passes.utils import lift_subgraph_as_module, HolderModule
 
@@ -130,7 +130,7 @@ def flatten(x: torch.fx.node.Argument) -> NodeList:
     all_components: List[Component] = []
 
     # Stores nodes that will be used in main graph.
-    used_in_main: NodeSet = set()
+    used_in_main: Dict[torch.fx.Node, None] = {}
 
     # Main graph after split.
     main_g = torch.fx.Graph()
@@ -208,7 +208,7 @@ def remap_func(x):
                 comp.input_placeholders.append(
                     comp.graph.placeholder(x.name, type_expr=x.type)
                 )
-                used_in_main.add(x)
+                used_in_main[x] = None
 
             return comp.input_placeholders[
                 next(i for i, y in enumerate(comp.orig_inputs) if x is y)
@@ -231,7 +231,7 @@ def remap_func(x):
         else:
             # All component results consumed by the output node should be
             # marked as "used in main".
-            used_in_main.add(x)
+            used_in_main[x] = None
 
     # If a node is used in main graph then we mark it as an output in the component
     # it belongs to.
diff --git a/torch/fx/passes/splitter_base.py b/torch/fx/passes/splitter_base.py
index e74a5ff3b5d0..0f357c38dcb7 100644
--- a/torch/fx/passes/splitter_base.py
+++ b/torch/fx/passes/splitter_base.py
@@ -1,4 +1,5 @@
 import argparse
+import copy
 from collections import defaultdict
 from dataclasses import dataclass
 from typing import NamedTuple, Sequence, Iterable, Any, List, Dict, Optional, Tuple
@@ -24,18 +25,26 @@
     NodeSet,
     is_node_output_tensor,
 )
-import warnings
+
 
 __all__ = ['FxNetAccNodesFinder', 'FxNetSplitterInternalError', 'Subgraph', 'SplitResult', 'generate_inputs_for_submodules']
 _LOGGER = logging.getLogger(__name__)
 
+DEFAULT_MIN_ACC_MODULE_SIZE = 1
+DEFAULT_SKIP_FUSION = False
+DEFAULT_ALLOW_NON_TENSOR = False
 
 class _SplitterSettingBase:
-    def __init__(self):
+    def __init__(
+        self,
+        min_acc_module_size=DEFAULT_MIN_ACC_MODULE_SIZE,
+        skip_fusion=DEFAULT_SKIP_FUSION,
+        allow_non_tensor=DEFAULT_ALLOW_NON_TENSOR
+    ):
         parser = argparse.ArgumentParser()
         parser.add_argument(
             "--min_acc_module_size",
-            default=1,
+            required=False,
             type=int,
             help="Minimum size limit of an accelerator subgraph.",
         )
@@ -61,9 +70,9 @@ def __init__(self):
         )
         args, unknown = parser.parse_known_args()
 
-        self.min_acc_module_size: int = args.min_acc_module_size
-        self.skip_fusion: bool = args.skip_fusion
-        self.allow_non_tensor: bool = args.allow_non_tensor
+        self.min_acc_module_size: int = args.min_acc_module_size if args.min_acc_module_size else min_acc_module_size
+        self.skip_fusion: bool = args.skip_fusion if args.skip_fusion else skip_fusion
+        self.allow_non_tensor: bool = args.allow_non_tensor if args.allow_non_tensor else allow_non_tensor
 
 
 @compatibility(is_backward_compatible=False)
@@ -199,7 +208,8 @@ class SplitResult(NamedTuple):
 def generate_inputs_for_submodules(
     model: torch.nn.Module,
     inputs: Sequence[Any],
-    target_submodules: Iterable[str]
+    target_submodules: Iterable[str],
+    deepcopy: bool = False,
 ) -> Dict[str, Any]:
     """
     Generate inputs for targeting submdoules in the given model. Note that if two submodules refer to the same obj, this
@@ -219,17 +229,24 @@ def generate_inputs_for_submodules(
     submodule_to_names = dict((mod, name) for name, mod in model.named_modules())
 
     def pre_forward(module, module_inputs):
-        results[submodule_to_names[module]] = module_inputs
-    try:
-        for name, mod in model.named_modules():
-            if name in target_submodules:
-                handles.append(mod.register_forward_pre_hook(pre_forward))
-        model(*inputs)
-    except Exception as e:
-        warnings.warn(f"Failed to generate submodule inputs because of the following error:\n{e}")
-    finally:
+        results[submodule_to_names[module]] = copy.deepcopy(module_inputs) if deepcopy else module_inputs
+
+    for name, mod in model.named_modules():
+        if name in target_submodules:
+            handles.append(mod.register_forward_pre_hook(pre_forward))
+
+    def clean_up_handles():
         for h in handles:
             h.remove()
+
+    try:
+        with torch.no_grad():
+            model(*inputs)
+    except Exception as e:
+        clean_up_handles()
+        raise e
+
+    clean_up_handles()
     return results
 
 
@@ -315,11 +332,21 @@ def __init__(
         self.update_deps_for_fusions()
 
         self.non_acc_submodule_name = non_acc_submodule_name
+        self._node_submodule_map: Dict[str, str] = {}
 
     # ===============================================================
     # Helpers for ctor and initial state
     # ===============================================================
 
+    def get_node_submodule_map(self) -> Dict[str, str]:
+        """ Returns a map from node name to submodule name, e.g.
+            node: main_module_impl_impl_over_arch_unary_multiple_embedding
+              _pooling_embedding_pooling_sparse_entity_equivalence_key
+              _proxy_embedding_bag
+            maps to submodule name of: _run_on_acc_1
+        """
+        return self._node_submodule_map
+
     def find_deps(self) -> Dict[torch.fx.Node, NodeSet]:
         """
         Builds a graph of node dependencies. Leaf nodes don't have any
@@ -715,12 +742,9 @@ def put_nodes_into_subgraphs(self) -> List[Subgraph]:
         current_cpu_nodes, current_acc_nodes = self.starter_nodes()
         visited_nodes: NodeSet = set()
 
-        # Determine which subgraph to start from based on node dependency
-        acc_subgraph: bool = True
-        for n in current_cpu_nodes:
-            if self.deps[n] <= visited_nodes:
-                acc_subgraph = False
-                break
+        # Determine which subgraph to start from based on which subgraph has
+        # 0-dep node
+        acc_subgraph: bool = not any([len(self.deps[n]) == 0 for n in current_cpu_nodes])
 
         current_subgraph_nodes: NodeList = []
 
@@ -809,14 +833,14 @@ def remove_small_acc_subgraphs(self, subgraphs: List[Subgraph]) -> List[Subgraph
     def tag(self, subgraphs: List[Subgraph]):
         self.tags: List[str] = []
         for subgraph in subgraphs:
-            subgraph_name = self.non_acc_submodule_name
-
             tag = f"_run_on_acc_{len(self.tags)}" if subgraph.is_acc else f"{self.non_acc_submodule_name}{len(self.tags)}"
             self.tags.append(tag)
             for node in subgraph.nodes:
                 if hasattr(node, "tag"):
                     raise FxNetSplitterInternalError(f"Node {node} was already tagged")
+
                 node.tag = tag  # type: ignore[attr-defined]
+                self._node_submodule_map[node.name] = tag
 
     def split(self, remove_tag: bool = False) -> torch.fx.GraphModule:
         split_module = split_by_tags(self.module, self.tags)
diff --git a/torch/fx/passes/tests/test_pass_manager.py b/torch/fx/passes/tests/test_pass_manager.py
index 4ed0cfce89de..60ed6671179b 100644
--- a/torch/fx/passes/tests/test_pass_manager.py
+++ b/torch/fx/passes/tests/test_pass_manager.py
@@ -34,3 +34,25 @@ def test_these_before_those_pass_constraint(self) -> None:
         pm.add_constraint(constraint)
 
         self.assertRaises(RuntimeError, pm.validate)
+
+    def test_two_pass_managers(self) -> None:
+        """Make sure we can construct the PassManager twice and not share any
+        state between them"""
+
+        passes = [lambda x: 2 * x for _ in range(3)]
+        constraint = these_before_those_pass_constraint(passes[0], passes[1])
+        pm1 = PassManager()
+        for p in passes:
+            pm1.add_pass(p)
+        pm1.add_constraint(constraint)
+        output1 = pm1(1)
+        self.assertEqual(output1, 2 ** 3)
+
+        passes = [lambda x: 3 * x for _ in range(3)]
+        constraint = these_before_those_pass_constraint(passes[0], passes[1])
+        pm2 = PassManager()
+        for p in passes:
+            pm2.add_pass(p)
+        pm2.add_constraint(constraint)
+        output2 = pm2(1)
+        self.assertEqual(output2, 3 ** 3)
diff --git a/torch/fx/passes/utils/fuser_utils.py b/torch/fx/passes/utils/fuser_utils.py
index f3d5f0242169..9eddc2befd04 100644
--- a/torch/fx/passes/utils/fuser_utils.py
+++ b/torch/fx/passes/utils/fuser_utils.py
@@ -181,12 +181,12 @@ def insert_subgm(gm: GraphModule, sub_gm: GraphModule, orig_inputs: Tuple[Node,
 
     if len(orig_outputs) == 1:
         # main_remapping[comp.orig_outputs[0]] = module_node
-        orig_outputs[0].replace_all_uses_with(module_node)
+        orig_outputs[0].replace_all_uses_with(module_node, propagate_meta=True)
     else:
         for i, orig_output in enumerate(orig_outputs):
             # Use Proxy to record getitem access.
             proxy_out = torch.fx.Proxy(module_node)[i].node  # type: ignore[index]
-            orig_output.replace_all_uses_with(proxy_out)
+            orig_output.replace_all_uses_with(proxy_out, propagate_meta=True)
     return gm
 
 def erase_nodes(gm: GraphModule, nodes: NodeList):
diff --git a/torch/fx/passes/utils/matcher_utils.py b/torch/fx/passes/utils/matcher_utils.py
index 13d343398826..04938ac6e3e1 100644
--- a/torch/fx/passes/utils/matcher_utils.py
+++ b/torch/fx/passes/utils/matcher_utils.py
@@ -1,50 +1,31 @@
 from dataclasses import dataclass, field
 from collections import defaultdict
 import copy
-import torch.library
 from torch.fx.graph import Graph
 from torch.fx.node import Node
 from torch.fx._compatibility import compatibility
-from typing import Dict, List, Set
+from typing import Dict, List, Set, Any
+import logging
+import os
 
 __all__ = ['SubgraphMatcher', 'InternalMatch']
 
+# Set`PYTORCH_MATCHER_LOGLEVEL=INFO` to see debug logs
+def _init_logger():
+    logger = logging.getLogger(__name__)
 
-pseudo = torch.library.Library("pseudo", "DEF")
+    level = os.environ.get('PYTORCH_MATCHER_LOGLEVEL', 'WARNING').upper()
+    logger.setLevel(level)
+    console = logging.StreamHandler()
+    formatter = logging.Formatter("%(filename)s > %(message)s")
+    console.setFormatter(formatter)
+    console.setLevel(level)
+    # add the handlers to the logger
+    logger.addHandler(console)
+    logger.propagate = False
+    return logger
 
-pseudo.define("any() -> ()")
-"""
-pseudo.any is a wildcard node that can be matched with any fx node with arbitrary number of inputs and outputs.
-For example, to match relu followed by one fx node:
-    def pattern(a):
-        y = a.relu()
-        z = torch.ops.pseudo.any(y)
-        return z
-"""
-
-pseudo.define("oneof(*, str[] targets) -> ()")
-"""
-pseudo.oneof is a special node that can be matched with a fx node whose target is in the permissible list.
-`targets` must be be a list of qualified name for operators, e.g. ["operator.add", "torch.sigmoid",
-"torch.ops.aten.foo", "torch.ops.prims.bar"]
-
-For example, using following pattern with pseudo.oneof
-    def pattern(a):
-        y = a.relu()
-        z = torch.ops.pseudo.oneof(y, targets=["relu", "torch.sigmoid", "operator.add"])
-        return z
-
-It will have 3 matches in the following function
-    def forward(y):
-        z = y.relu()
-        x = z.relu()    # first match
-
-        x = x.relu()
-        x = torch.sigmoid(x)    # second match
-
-        x = x.relu()
-        return x + 1    # third match
-"""
+logger = _init_logger()
 
 @compatibility(is_backward_compatible=False)
 @dataclass
@@ -111,24 +92,10 @@ def __init__(self, pattern: Graph,
             self.pattern_anchors = [n for n in output_node.all_input_nodes if len(n.users) == 1]
 
     def _nodes_are_equal(self, pn: Node, gn: Node) -> bool:
-        # TODO: match args and kwargs
-
         # if exact match for placeholder is not required, then use placeholder as a wildcard
         if not self.match_placeholder and pn.op == "placeholder":
             return True
 
-        if pn.target == torch.ops.pseudo.any:
-            return True
-
-        if pn.target == torch.ops.pseudo.oneof:
-            permissible_targets: List[str] = pn.kwargs.get("targets", list())  # type: ignore[assignment]
-            assert isinstance(permissible_targets, list), \
-                "pseudo.oneof(permissible_targets=[\"foo\", \"bar\"]) only accept targets as a list"
-            assert len(permissible_targets) > 0, "please specific as least one target for pseudo.oneof"
-
-            if gn._pretty_print_target(gn.target) in permissible_targets:
-                return True
-
         if pn.op == gn.op:
             if pn.op == "placeholder" or pn.op == "output":
                 return True
@@ -138,12 +105,11 @@ def _nodes_are_equal(self, pn: Node, gn: Node) -> bool:
     def _is_contained(self, nodes_map: Dict[Node, Node]) -> bool:
         # `lookup` represents all the nodes in `original_graph`
         # that are part of `pattern`
-        lookup: Dict[Node, Node] = {gn : pn for pn, gn in nodes_map.items()}
-        for gn, pn in lookup.items():
-            # Placeholders can be used by other nodes in the graphs
-            if pn.op == "placeholder":
-                continue
 
+        # Placeholders can be used by other nodes in the graphs
+        lookup: Dict[Node, Node] = {gn : pn for pn, gn in nodes_map.items() if pn.op != "placeholder"}
+
+        for gn, pn in lookup.items():
             # nodes returned by output are allowed to be used in other areas of the graph
             if pn in self.pattern_returning_nodes:
                 continue
@@ -173,7 +139,29 @@ def _remove_overlapping_matches(self, matches: List[InternalMatch]) -> List[Inte
                         nodes_matched.add(gn)
         return non_overlapping_matches
 
+    def _match_args(self, pn: Any, gn: Any, match: InternalMatch) -> bool:
+        assert not (isinstance(pn, Node) and isinstance(gn, Node)), "pn and gn cannot both be Node"
+
+        if isinstance(pn, Node) and not isinstance(gn, Node):
+            if pn.op == "placeholder":
+                # Check if we've already matched these nodes in the current
+                # traversal
+                if pn in match.nodes_map:
+                    return match.nodes_map[pn] == gn
+
+                match.nodes_map[pn] = gn
+                return True
+            else:
+                return False
+        elif not isinstance(pn, Node) and isinstance(gn, Node):
+            return False
+        else:
+            return type(gn) == type(pn) and gn == pn
+
     def _match_nodes(self, pn: Node, gn: Node, match: InternalMatch) -> bool:
+        logger.info(f"  matching {pn} to {gn}")
+
+        assert isinstance(pn, Node) and isinstance(gn, Node), str(f"pn and gn must be Node, pn: {pn}, gn: {gn}")
 
         # Check if we've already matched these nodes in the current
         # traversal
@@ -187,7 +175,8 @@ def _match_nodes(self, pn: Node, gn: Node, match: InternalMatch) -> bool:
         if not self._nodes_are_equal(pn, gn):
             return False
 
-        # Optimistically mark `pn` as a match for `gn`
+        # Optimistically mark `pn` as a match for `gn`, and save a local copy of match
+        saved_match = copy.copy(match)
         match.nodes_map[pn] = gn
 
         if pn.op == "placeholder":
@@ -195,12 +184,46 @@ def _match_nodes(self, pn: Node, gn: Node, match: InternalMatch) -> bool:
 
         # Recursively traverse upwards to check if `pn` is a true
         # match for `gn`
-        match_found = (len(pn.all_input_nodes) == len(gn.all_input_nodes) and
-                       all(self._match_nodes(pn_, gn_, match) for pn_, gn_
-                           in zip(pn.all_input_nodes, gn.all_input_nodes)))
+        match_found = True
+
+        def flatten_args(args) -> List[Any]:
+            # Recursively flatten args
+            result : List[Any] = []
+            for arg in args:
+                # flatten the list, if only it's a list/tuple of nodes
+                if isinstance(arg, (list, tuple)) and len(arg) > 0 and isinstance(arg[0], Node):
+                    result.extend(flatten_args(arg))
+                else:
+                    result.append(arg)
+
+            return result
+
+        pn_flatten_args = flatten_args(pn.args)
+        gn_flatten_args = flatten_args(gn.args)
+
+        if pn.kwargs.keys() == gn.kwargs.keys():
+            for key in pn.kwargs.keys():
+                pn_flatten_args.append(pn.kwargs[key])
+                gn_flatten_args.append(gn.kwargs[key])
+        else:
+            match_found = False
+
+        if match_found and len(pn_flatten_args) == len(gn_flatten_args):
+            for pn_, gn_ in zip(pn_flatten_args, gn_flatten_args):
+                if isinstance(gn_, Node) and isinstance(pn_, Node):
+                    matched = self._match_nodes(pn_, gn_, match)
+                else:
+                    matched = self._match_args(pn_, gn_, match)
+
+                if not matched:
+                    match_found = False
+                    break
+        else:
+            match_found = False
 
         if not match_found:
-            match.nodes_map.pop(pn)
+            # revert to saved_match before matching with current node
+            match = copy.copy(saved_match)
             return False
 
         return True
@@ -241,6 +264,7 @@ def match(self, graph: Graph) -> List[InternalMatch]:
         in practice, it's unlikely to blow up.
 
         """
+        from torch.fx.passes.utils.fuser_utils import validate_partition
 
         # find candidate nodes to match with pattern anchors
         match_candidates: Dict[Node, List[Node]] = defaultdict(list)
@@ -249,6 +273,9 @@ def match(self, graph: Graph) -> List[InternalMatch]:
                 if self._nodes_are_equal(pattern_anchor, node):
                     match_candidates[pattern_anchor].append(node)
         match_candidates_list = list(match_candidates.items())
+
+        logger.info(f"Initial match_candidates_list: {match_candidates_list}\n")
+
         matches: List[InternalMatch] = []
 
         def backtracking(anchor_index, match):
@@ -256,27 +283,55 @@ def backtracking(anchor_index, match):
                 match.placeholder_nodes = [match.nodes_map[pn] for pn in self.pattern_placeholder_nodes]
                 match.returning_nodes = [match.nodes_map[pn] for pn in self.pattern_returning_nodes]
                 matches.append(match)
+
+                logger.info(f"Found a match: {match}\n")
                 return
 
             pattern_anchor, candidate_nodes = match_candidates_list[anchor_index]
             saved_match = copy.copy(match)
 
             for node in candidate_nodes:
+                logger.info(f"Trying to match anchor {pattern_anchor} to {node}")
+
                 match_found = self._match_nodes(pattern_anchor, node, match)
                 if match_found:
                     # match next anchor
                     backtracking(anchor_index + 1, match)
+                else:
+                    logger.info(f"Failed to match anchor {pattern_anchor} to {node}\n")
 
-                    # revert to saved_match before matching with current anchor
-                    match = copy.copy(saved_match)
+                # revert to saved_match before matching with current anchor
+                match = copy.copy(saved_match)
 
         match = InternalMatch(anchors=self.pattern_anchors)
-        backtracking(0, match)
+        if match_candidates_list:
+            backtracking(0, match)
 
         # filter out the matches where the subgraph is not fully_contained
+        before = len(matches)
         matches = [match for match in matches if self._is_contained(match.nodes_map)]
+        after = len(matches)
+        if before != after:
+            logger.info(f"Filtered out {before - after} matches because they are not fully contained")
+
+        # filter out the matches that that forms a cycle if the subgraph is fused
+        valid_matches = []
+        for match in matches:
+            matched_compute_nodes = \
+                [gn for pn, gn in match.nodes_map.items() if pn.op not in {"placeholder", "output"}]
+            if validate_partition(matched_compute_nodes):
+                valid_matches.append(match)
+        if len(valid_matches) != len(matches):
+            logger.info(f"Filtered out {len(matches) - len(valid_matches)} matches because \
+                          matched subgraph would form a cycle if fused")
 
         if self.remove_overlapping_matches:
-            matches = self._remove_overlapping_matches(matches)
+            before = len(valid_matches)
+            matches = self._remove_overlapping_matches(valid_matches)
+            after = len(matches)
+            if before != after:
+                logger.info(f"Filtered out {before - after} matches because matched subgraphs are overlapping")
+
+        logger.info(f"Matches returned: {matches}")
 
         return matches
diff --git a/torch/fx/proxy.py b/torch/fx/proxy.py
index 0b788824a9c5..6f9535b11737 100644
--- a/torch/fx/proxy.py
+++ b/torch/fx/proxy.py
@@ -95,14 +95,23 @@ def _find_user_frame(self):
         symbolic tracing.
         """
         # We have to do a little dance here. Basically, walk up the callstack and
-        # record the first frame not in the FX source. This is the frame executing
+        # record the first frame not in the pytorch source. This is the frame executing
         # the user code during tracing.
         frame = inspect.currentframe()
 
-        fx_files = ['torch/fx/proxy.py', 'torch/fx/_symbolic_trace.py']
+        pt_files = ['torch/fx/proxy.py',
+                    'torch/fx/_symbolic_trace.py',
+                    'torch/fx/experimental/proxy_tensor.py',
+                    'torch/_ops.py',
+                    'torch/_tensor.py',
+                    'torch/utils/_python_dispatch.py',
+                    'torch/_prims_common/wrappers.py',
+                    'torch/_refs/__init__.py',
+                    'torch/_refs/nn/functional/__init__.py'
+                    ]
         while frame:
             frame = frame.f_back
-            if frame and all(not frame.f_code.co_filename.endswith(file) for file in fx_files):
+            if frame and all(not frame.f_code.co_filename.endswith(file) for file in pt_files):
                 break
 
         if not frame:
@@ -140,7 +149,7 @@ def create_arg(self, a: Any) -> Argument:
                 def no_node(arg):
                     if isinstance(arg, Node):
                         raise RuntimeError("Keys for dictionaries used as an argument cannot contain a "
-                                           "Node. Got key: {k}")
+                                           f"Node. Got key: {k}")
                 map_aggregate(k, no_node)
 
                 r[k] = self.create_arg(v)
@@ -148,12 +157,14 @@ def no_node(arg):
         elif isinstance(a, slice):
             return slice(self.create_arg(a.start), self.create_arg(a.stop), self.create_arg(a.step))
 
+        elif isinstance(a, range):
+            return range(self.create_arg(a.start), self.create_arg(a.stop), self.create_arg(a.step))
+
         if isinstance(a, Proxy):
             # base case: we unwrap the Proxy object
             return a.node
         elif isinstance(a, base_types) or a is None or a is ...:
             return a
-
         raise NotImplementedError(f"argument of type: {type(a)}")
 
     @compatibility(is_backward_compatible=True)
diff --git a/torch/fx/subgraph_rewriter.py b/torch/fx/subgraph_rewriter.py
index c651324bb142..e7d239d4699c 100644
--- a/torch/fx/subgraph_rewriter.py
+++ b/torch/fx/subgraph_rewriter.py
@@ -5,10 +5,10 @@
 from ._compatibility import compatibility
 
 import copy
-from typing import Callable, Dict, List, NamedTuple, Optional, Set
+from typing import Callable, Dict, List, NamedTuple, Optional, Set, Union
 import torch
 
-__all__ = ['Match', 'replace_pattern']
+__all__ = ['Match', 'replace_pattern', 'replace_pattern_with_filters']
 
 @compatibility(is_backward_compatible=True)
 class Match(NamedTuple):
@@ -17,76 +17,6 @@ class Match(NamedTuple):
     # Maps nodes in the pattern subgraph to nodes in the larger graph
     nodes_map: Dict[Node, Node]
 
-class _SubgraphMatcher:
-    def __init__(self, pattern: Graph) -> None:
-        self.pattern = pattern
-        if len(pattern.nodes) == 0:
-            raise ValueError("_SubgraphMatcher cannot be initialized with an "
-                             "empty pattern")
-        # `self.pattern_anchor` is the output Node in `pattern`
-        self.pattern_anchor = next(iter(reversed(pattern.nodes)))
-        # Ensure that there is only a single output value in the pattern
-        # since we don't support multiple outputs
-        assert len(self.pattern_anchor.all_input_nodes) == 1, \
-            "Pattern matching on multiple outputs is not supported"
-        # Maps nodes in the pattern subgraph to nodes in the larger graph
-        self.nodes_map: Dict[Node, Node] = {}
-
-    def matches_subgraph_from_anchor(self, anchor: Node) -> bool:
-        """
-        Checks if the whole pattern can be matched starting from
-        ``anchor`` in the larger graph.
-
-        Pattern matching is done by recursively comparing the pattern
-        node's use-def relationships against the graph node's.
-        """
-        self.nodes_map = {}
-        return self._match_nodes(self.pattern_anchor, anchor)
-
-    # Compare the pattern node `pn` against the graph node `gn`
-    def _match_nodes(self, pn: Node, gn: Node) -> bool:
-
-        # Check if we've already matched these nodes in the current
-        # traversal
-        if pn in self.nodes_map:
-            return self.nodes_map[pn] == gn
-
-        def attributes_are_equal(pn: Node, gn: Node) -> bool:
-            # Use placeholder and output nodes as wildcards. The
-            # only exception is that an output node can't match
-            # a placeholder
-            if (pn.op == "placeholder"
-                    or (pn.op == "output" and gn.op != "placeholder")):
-                return True
-            return pn.op == gn.op and pn.target == gn.target
-
-        # Terminate early if the node attributes are not equal
-        if not attributes_are_equal(pn, gn):
-            return False
-
-        # Optimistically mark `pn` as a match for `gn`
-        self.nodes_map[pn] = gn
-
-        # Traverse the use-def relationships to ensure that `pn` is a true
-        # match for `gn`
-        if pn.op == "placeholder":
-            return True
-        if (pn.op != "output"
-                and len(pn.all_input_nodes) != len(gn.all_input_nodes)):
-            return False
-        if pn.op == "output":
-            match_found = any(self._match_nodes(pn.all_input_nodes[0], gn_)
-                              for gn_ in gn.all_input_nodes)
-        else:
-            match_found = (len(pn.all_input_nodes) == len(gn.all_input_nodes)
-                           and all(self._match_nodes(pn_, gn_) for pn_, gn_
-                                   in zip(pn.all_input_nodes, gn.all_input_nodes)))
-        if not match_found:
-            self.nodes_map.pop(pn)
-            return False
-
-        return True
-
 
 def _replace_submodules(gm: GraphModule, replacement: torch.nn.Module) -> None:
     gm.delete_all_unused_submodules()
@@ -133,8 +63,13 @@ def try_get_submodule(mod: torch.nn.Module, target: str) -> Optional[torch.nn.Mo
 
     gm.graph.lint()
 
+
 @compatibility(is_backward_compatible=True)
-def replace_pattern(gm: GraphModule, pattern: Callable, replacement: Callable) -> List[Match]:
+def replace_pattern(
+    gm: GraphModule,
+    pattern: Union[Callable, GraphModule],
+    replacement: Union[Callable, GraphModule]
+) -> List[Match]:
     """
     Matches all possible non-overlapping sets of operators and their
     data dependencies (``pattern``) in the Graph of a GraphModule
@@ -145,6 +80,7 @@ def replace_pattern(gm: GraphModule, pattern: Callable, replacement: Callable) -
         ``gm``: The GraphModule that wraps the Graph to operate on
         ``pattern``: The subgraph to match in ``gm`` for replacement
         ``replacement``: The subgraph to replace ``pattern`` with
+        ``match_filter``: A function that takes in (`InternalMatch`, original_graph, pattern_graph)
 
     Returns:
         List[Match]: A list of ``Match`` objects representing the places
@@ -248,175 +184,118 @@ def forward(self, x, w1, w2):
             add_2 = add_1 + max_2
             return add_2
     """
+    return _replace_pattern(gm, pattern, replacement)
+
+
+# Experimental API, not backward compatible
+@compatibility(is_backward_compatible=False)
+def replace_pattern_with_filters(
+    gm: GraphModule,
+    pattern: Union[Callable, GraphModule],
+    replacement: Union[Callable, GraphModule],
+    match_filters: List[Callable[["InternalMatch", Graph, Graph], bool]],  # type: ignore[name-defined]
+) -> List[Match]:
+    """
+    See replace_pattern for documentation. This function is an overload with an additional match_filter argument.
+
+    Args:
+        ``match_filter``: A function that takes in (match: InternalMatch, original_graph: Graph, pattern_graph: Graph)
+            and returns a boolean indicating whether the match satisfies the condition. See matcher_utils.py for
+            definition of InternalMatch.
+    """
+
+    return _replace_pattern(gm, pattern, replacement, match_filters)
+
+
+def _replace_pattern(
+    gm: GraphModule,
+    pattern: Union[Callable, GraphModule],
+    replacement: Union[Callable, GraphModule],
+    match_filters: List[Callable[["InternalMatch", Graph, Graph], bool]] = None  # type: ignore[name-defined]
+) -> List[Match]:
+
+    from torch.fx.passes.utils.matcher_utils import SubgraphMatcher, InternalMatch
+
+    if match_filters is None:
+        match_filters = []
+
     # Get the graphs for `gm`, `pattern`, `replacement`
-    original_graph = gm.graph
-    pattern_graph = symbolic_trace(pattern).graph
-    replacement_graph = symbolic_trace(replacement).graph
-
-    # Find all possible pattern matches in original_graph. Note that
-    # pattern matches may overlap with each other.
-    matcher = _SubgraphMatcher(pattern_graph)
-    matches: List[Match] = []
-
-    # Consider each node as an "anchor" (deepest matching graph node)
-    for anchor in original_graph.nodes:
-
-        if matcher.matches_subgraph_from_anchor(anchor):
-
-            def pattern_is_contained(nodes_map: Dict[Node, Node]) -> bool:
-                # `lookup` represents all the nodes in `original_graph`
-                # that are part of `pattern`
-                lookup: Dict[Node, Node] = {v: k for k, v in nodes_map.items()}
-                for n in lookup.keys():
-
-                    # Nodes that can "leak"...
-
-                    # Placeholders (by definition)
-                    if n.op == "placeholder":
-                        continue
-                    # Pattern output (acts as a container)
-                    if lookup[n].op == "output":
-                        continue
-                    # Result contained by pattern output (what we'll
-                    # hook in to the new Graph, thus what we'll
-                    # potentially use in other areas of the Graph as
-                    # an input Node)
-                    if (len(lookup[n].users) == 1
-                            and list(lookup[n].users.keys())[0].op == "output"):
-                        continue
-
-                    for user in n.users:
-                        # If this node has users that were not in
-                        # `lookup`, then it must leak out of the
-                        # pattern subgraph
-                        if user not in lookup:
-                            return False
-                return True
-
-            # It's not a match if the pattern leaks out into the rest
-            # of the graph
-            if pattern_is_contained(matcher.nodes_map):
-                # Shallow copy nodes_map
-                matches.append(Match(anchor=anchor,
-                                     nodes_map=copy.copy({
-                                         key: value
-                                         for key, value in matcher.nodes_map.items()
-                                     })))
-
-    # The set of all nodes in `original_graph` that we've seen thus far
-    # as part of a pattern match
-    replaced_nodes: Set[Node] = set()
-    # As we progressively replace nodes, we'll need to keep track of how the match results should change
-    match_changed_node: Dict[Node, Node] = dict()
+    original_graph: Graph = gm.graph
 
-    # Return True if one of the nodes in the current match has already
-    # been used as part of another match
-    def overlaps_with_prev_match(match: Match) -> bool:
-        for pn, gn in match.nodes_map.items():
-            if pn.op in ["placeholder", "output"]:
-                continue
-            if gn in replaced_nodes and gn.op != "placeholder":
-                return True
-        return False
+    if isinstance(pattern, GraphModule):
+        pattern_graph = pattern.graph
+    else:
+        pattern_graph = symbolic_trace(pattern).graph
 
-    for match in matches:
-        # Skip overlapping matches
-        if overlaps_with_prev_match(match):
-            continue
+    if isinstance(replacement, GraphModule):
+        replacement_graph = replacement.graph
+    else:
+        replacement_graph = symbolic_trace(replacement).graph
 
-        # Map replacement graph nodes to their copy in `original_graph`
-        val_map: Dict[Node, Node] = {}
+    matcher = SubgraphMatcher(pattern_graph, match_output=False, match_placeholder=False,
+                              remove_overlapping_matches=True)
+    _matches: List[InternalMatch] = matcher.match(original_graph)
 
-        pattern_placeholders = [n for n in pattern_graph.nodes
-                                if n.op == "placeholder"]
-        assert len(pattern_placeholders) > 0
-        replacement_placeholders = [n for n in replacement_graph.nodes
-                                    if n.op == "placeholder"]
-        assert len(pattern_placeholders) == len(replacement_placeholders)
-        placeholder_map = {r: p for r, p
-                           in zip(replacement_placeholders, pattern_placeholders)}
-
-        # node from `original_graph` that matched with the output node
-        # in `pattern`
-        subgraph_output: Node = match.anchor
-
-        def mark_node_as_replaced(n: Node) -> None:
-            if n not in match.nodes_map.values():
-                return
-            for n_ in n.all_input_nodes:
-                mark_node_as_replaced(n_)
-            replaced_nodes.add(n)
-
-        for input_node in subgraph_output.all_input_nodes:
-            mark_node_as_replaced(input_node)
+    # Filter out matches that don't match the filter
+    _matches = [
+        m for m in _matches
+        if all(match_filter(m, original_graph, pattern_graph)
+               for match_filter in match_filters)
+    ]
+
+    replacement_placeholders = [n for n in replacement_graph.nodes if n.op == "placeholder"]
+
+    # As we progressively replace nodes, we'll need to keep track of how the match results should change
+    match_changed_node: Dict[Node, Node] = {}
+
+    for match in _matches:
+
+        # Build connecting between replacement graph's input and original graph input producer node
 
         # Initialize `val_map` with mappings from placeholder nodes in
         # `replacement` to their corresponding node in `original_graph`
-        for replacement_node in replacement_placeholders:
-            # Get the `original_graph` placeholder node
-            # corresponding to the current `replacement_node`
-            pattern_node = placeholder_map[replacement_node]
-            original_graph_node = match_changed_node.get(match.nodes_map[pattern_node], match.nodes_map[pattern_node])
-
-            # Populate `val_map`
-            val_map[replacement_node] = original_graph_node
+        assert len(match.placeholder_nodes) == len(replacement_placeholders)
+        val_map: Dict[Node, Node] = {}
+        for rn, gn in zip(replacement_placeholders, match.placeholder_nodes):
+            if isinstance(gn, Node):
+                val_map[rn] = match_changed_node.get(gn, gn)
+            else:
+                val_map[rn] = gn
 
         # Copy the replacement graph over
-        with original_graph.inserting_before(subgraph_output):
-            copied_output = original_graph.graph_copy(replacement_graph,
-                                                      val_map)
+        user_nodes: Set[Node] = set()
+        for n in match.returning_nodes:
+            for user in n.users:
+                user_nodes.add(user)
+        assert user_nodes, "The returning_nodes should have at least one user node"
+
+        if len(user_nodes) == 1:
+            first_user_node = list(user_nodes)[0]
+        else:
+            # If there are multiple user nodes, we need to find the first user node
+            # in the current execution order of the `original_graph`
+            for n in original_graph.nodes:
+                if n in user_nodes:
+                    first_user_node = n
+                    break
 
-        # Hook the output Node of the replacement subgraph in to the
-        # original Graph at the correct location
+        with original_graph.inserting_before(first_user_node):
+            copied_returning_nodes = original_graph.graph_copy(replacement_graph, val_map)
 
-        # CASE 1: We need to hook the replacement subgraph in somewhere
-        # in the middle of the graph. We replace the Node in the
-        # original graph that corresponds to the end of the pattern
-        # subgraph
-        if subgraph_output.op != "output":
-            pattern_outputs = [n for n in pattern_graph.nodes
-                               if n.op == "output"]
-            assert len(pattern_outputs) > 0
-            replacement_outputs = [n for n in replacement_graph.nodes
-                                   if n.op == "output"]
-            assert len(replacement_outputs) == len(pattern_outputs)
-            outputs_map = {p: r for r, p
-                           in zip(replacement_outputs, pattern_outputs)}
-
-            for pn, gn in match.nodes_map.items():
-                if gn.op == "placeholder":
-                    continue
-
-                # Search for the node corresponding to the output of the pattern
-                if pn.op != "output":
-                    continue
-                assert subgraph_output == gn
-
-                # Update all anchor inputs to the new nodes
-                rn = outputs_map[pn]
-                for pn_input, rn_input in zip(pn.all_input_nodes, rn.all_input_nodes):
-                    gn_input = match.nodes_map[pn_input]
-                    rn_input_in_original_graph = val_map[rn_input]
-                    gn_input.replace_all_uses_with(rn_input_in_original_graph)
-                    # We store the updated node point in case other nodes want to use it
-                    match_changed_node[gn_input] = rn_input_in_original_graph
-
-            assert subgraph_output.op != "output"
-        # CASE 2: The pattern subgraph match extends to the end of the
-        # original graph, so we need to change the current graph's
-        # output Node to reflect the insertion of the replacement graph.
-        # We'll keep the current output Node, but update its args and
-        # `_input_nodes` as necessary
-        else:
-            subgraph_output.args = ((copied_output,))
-            if isinstance(copied_output, Node):
-                subgraph_output._input_nodes = {copied_output: None}
+        if isinstance(copied_returning_nodes, Node):
+            copied_returning_nodes = (copied_returning_nodes, )
 
-        assert isinstance(copied_output, Node)
-        # Erase the `pattern` nodes
-        for node in reversed(original_graph.nodes):
-            if len(node.users) == 0 and node.op != "output":
-                original_graph.erase_node(node)
+        # Hook the output Node of the replacement subgraph in to the
+        # original Graph at the correct location
+        assert len(match.returning_nodes) == len(copied_returning_nodes)
+        for gn, copied_node in zip(match.returning_nodes, copied_returning_nodes):
+            gn.replace_all_uses_with(copied_node)
+            match_changed_node[gn] = copied_node
+        # Remove the original nodes
+        for node in reversed(pattern_graph.nodes):
+            if node.op != "placeholder" and node.op != "output":
+                gn = match.nodes_map[node]
+                gm.graph.erase_node(gn)
 
     # Update the passed-in GraphModule to reflect the new state of
     # `original_graph`
@@ -427,4 +306,6 @@ def mark_node_as_replaced(n: Node) -> None:
     if isinstance(replacement, torch.nn.Module):
         _replace_submodules(gm, replacement)
 
+    # Convert _matches: InternalMatch to Match to comply with backward compatibility of this function
+    matches: List[Match] = [Match(anchor=match.anchors[0], nodes_map=match.nodes_map) for match in _matches]
     return matches
diff --git a/torch/fx/tensor_type.py b/torch/fx/tensor_type.py
index fd9f408c21c4..df8036f3b48a 100644
--- a/torch/fx/tensor_type.py
+++ b/torch/fx/tensor_type.py
@@ -28,7 +28,9 @@ def __eq__(self, other):
 
     @staticmethod
     def __class_getitem__(*args):
-        return TensorType(args[0])
+        if len(args) == 1 and isinstance(args[0], tuple):
+            args = args[0]
+        return TensorType(tuple(args))
 
 
 class _DynType:
diff --git a/torch/fx/traceback.py b/torch/fx/traceback.py
index aac94a85e17d..cee7626e5c83 100644
--- a/torch/fx/traceback.py
+++ b/torch/fx/traceback.py
@@ -3,7 +3,7 @@
 from typing import Optional, List
 from ._compatibility import compatibility
 
-__all__ = ['override_stack_trace', 'append_stack_trace', 'format_stack', 'is_stack_trace_overridden']
+__all__ = ['override_stack_trace', 'set_stack_trace', 'append_stack_trace', 'format_stack', 'is_stack_trace_overridden']
 
 
 current_stack: List[str] = []
@@ -23,6 +23,13 @@ def override_stack_trace():
         is_overridden = saved_is_overridden
 
 
+@compatibility(is_backward_compatible=False)
+def set_stack_trace(stack : List[str]):
+    global current_stack
+
+    if is_overridden and stack:
+        current_stack = stack
+
 @compatibility(is_backward_compatible=False)
 @contextmanager
 def append_stack_trace(stack : Optional[str]):
@@ -44,10 +51,10 @@ def append_stack_trace(stack : Optional[str]):
 @compatibility(is_backward_compatible=False)
 def format_stack() -> List[str]:
     if is_overridden:
-        return current_stack
+        return current_stack.copy()
     else:
         # fallback to traceback.format_stack()
-        return traceback.format_stack()
+        return traceback.format_list(traceback.extract_stack()[:-1])
 
 
 @compatibility(is_backward_compatible=False)
diff --git a/torch/hub.py b/torch/hub.py
index cc27b15930bb..cfe4181332ec 100644
--- a/torch/hub.py
+++ b/torch/hub.py
@@ -10,10 +10,11 @@
 import warnings
 import zipfile
 from pathlib import Path
-from typing import Callable, Dict, Optional, Union, Any
+from typing import Dict, Optional, Any
 from urllib.error import HTTPError
 from urllib.request import urlopen, Request
 from urllib.parse import urlparse  # noqa: F401
+from torch.serialization import MAP_LOCATION
 
 try:
     from tqdm.auto import tqdm  # automatically select proper tqdm submodule if available
@@ -170,7 +171,6 @@ def _validate_not_a_forked_repo(repo_owner, repo_name, ref):
                      'If it\'s a commit from a forked repo, please call hub.load() with forked repo directly.')
 
 
-
 def _get_cache_or_reload(github, force_reload, trust_repo, calling_fn, verbose=True, skip_validation=False):
     # Setup hub_dir to save downloaded files
     hub_dir = get_dir()
@@ -240,6 +240,7 @@ def _get_cache_or_reload(github, force_reload, trust_repo, calling_fn, verbose=T
 
     return repo_dir
 
+
 def _check_repo_is_trusted(repo_owner, repo_name, owner_name_branch, trust_repo, calling_fn="load"):
     hub_dir = get_dir()
     filepath = os.path.join(hub_dir, "trusted_list")
@@ -522,11 +523,11 @@ def load(repo_or_dir, model, *args, source='github', trust_repo=None, force_relo
     Example:
         >>> # from a github repo
         >>> repo = 'pytorch/vision'
-        >>> model = torch.hub.load(repo, 'resnet50', pretrained=True)
+        >>> model = torch.hub.load(repo, 'resnet50', weights='ResNet50_Weights.IMAGENET1K_V1')
         >>> # from a local directory
         >>> path = '/some/local/path/pytorch/vision'
         >>> # xdoctest: +SKIP
-        >>> model = torch.hub.load(path, 'resnet50', pretrained=True)
+        >>> model = torch.hub.load(path, 'resnet50', weights='ResNet50_Weights.DEFAULT')
     """
     source = source.lower()
 
@@ -558,9 +559,9 @@ def _load_local(hubconf_dir, model, *args, **kwargs):
         a single model with corresponding pretrained weights.
 
     Example:
+        >>> # xdoctest: +SKIP("stub local path")
         >>> path = '/some/local/path/pytorch/vision'
-        >>> # xdoctest: +SKIP
-        >>> model = _load_local(path, 'resnet50', pretrained=True)
+        >>> model = _load_local(path, 'resnet50', weights='ResNet50_Weights.IMAGENET1K_V1')
     """
     sys.path.insert(0, hubconf_dir)
 
@@ -587,6 +588,7 @@ def download_url_to_file(url, dst, hash_prefix=None, progress=True):
             Default: True
 
     Example:
+        >>> # xdoctest: +REQUIRES(POSIX)
         >>> torch.hub.download_url_to_file('https://s3.amazonaws.com/pytorch/models/resnet18-5c106cde.pth', '/tmp/temporary_file')
 
     """
@@ -665,7 +667,7 @@ def _legacy_zip_load(filename, model_dir, map_location):
 def load_state_dict_from_url(
     url: str,
     model_dir: Optional[str] = None,
-    map_location: Optional[Union[Callable[[str], str], Dict[str, str]]] = None,
+    map_location: MAP_LOCATION = None,
     progress: bool = True,
     check_hash: bool = False,
     file_name: Optional[str] = None
diff --git a/torch/jit/_builtins.py b/torch/jit/_builtins.py
index 2a3fd865adb1..509957371e7d 100644
--- a/torch/jit/_builtins.py
+++ b/torch/jit/_builtins.py
@@ -12,7 +12,7 @@
 
 _builtin_table: Optional[Dict[int, str]] = None
 
-_modules_containing_builtins = (torch, torch._C._nn, torch._C._fft, torch._C._linalg, torch._C._sparse, torch._C._special)  # type: ignore[attr-defined] # noqa: B950
+_modules_containing_builtins = (torch, torch._C._nn, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._sparse, torch._C._special)  # type: ignore[attr-defined] # noqa: B950
 
 _builtin_ops = [
     # Pairs of (function, op_name)
diff --git a/torch/jit/_freeze.py b/torch/jit/_freeze.py
index 15fab449eeb5..af0a132ee0e7 100644
--- a/torch/jit/_freeze.py
+++ b/torch/jit/_freeze.py
@@ -136,9 +136,9 @@ def run_frozen_optimizations(
         mod (:class:`ScriptModule`): a frozen module to be optimized
 
         optimize_numerics (bool): If ``True``, a set of optimization passes will be run that does not strictly
-        preserve numerics. These optimizations preserve default rtol and atol of `torch.testing.assert_allclose`
+        preserve numerics. These optimizations preserve default rtol and atol of `torch.testing.assert_close`
         when applied on a single transformation, however in a module where many transformations are applied
-        the rtol or atol may no longer fall within the default `assert_allclose` tolerance. Conv -> Batchnorm folding,
+        the rtol or atol may no longer fall within the default `assert_close` tolerance. Conv -> Batchnorm folding,
         Conv-Add/Sub, and Conv -> Mul/Div folding all may alter numerics.
 
     Returns:
@@ -162,7 +162,8 @@ def run_frozen_optimizations(
         assert "batch_norm" not in str(frozen_mod.graph)
 
     """
-    torch._C._jit_pass_optimize_frozen_graph(mod.graph, optimize_numerics)
+    if mod._c._has_method("forward"):
+        torch._C._jit_pass_optimize_frozen_graph(mod.graph, optimize_numerics)
 
     if preserved_methods is None:
         preserved_methods = []
diff --git a/torch/jit/_fuser.py b/torch/jit/_fuser.py
index dd26a4e24d38..ddab9d99cf2b 100644
--- a/torch/jit/_fuser.py
+++ b/torch/jit/_fuser.py
@@ -26,16 +26,19 @@ def fuser(name):
     * ``fuser0`` - enables only legacy fuser
     * ``fuser1`` - enables only NNC
     * ``fuser2`` - enables only nvFuser
+    * ``fuser3`` - enables oneDNN Graph
     """
     old_cpu_fuse = torch._C._jit_can_fuse_on_cpu()
     old_gpu_fuse = torch._C._jit_can_fuse_on_gpu()
     old_texpr_fuser_state = torch._C._jit_texpr_fuser_enabled()
     old_nvfuser_state = torch._C._jit_nvfuser_enabled()
+    old_llga_state = torch._C._jit_llga_enabled()
     if name == 'fuser0':  # legacy fuser
         torch._C._jit_override_can_fuse_on_cpu(True)
         torch._C._jit_override_can_fuse_on_gpu(True)
         torch._C._jit_set_texpr_fuser_enabled(False)
         torch._C._jit_set_nvfuser_enabled(False)
+        torch._C._jit_set_llga_enabled(False)
     elif name == 'fuser1':  # NNC
         old_profiling_executor = torch._C._jit_set_profiling_executor(True)
         old_profiling_mode = torch._C._get_graph_executor_optimize(True)
@@ -43,22 +46,33 @@ def fuser(name):
         torch._C._jit_override_can_fuse_on_gpu(True)
         torch._C._jit_set_texpr_fuser_enabled(True)
         torch._C._jit_set_nvfuser_enabled(False)
+        torch._C._jit_set_llga_enabled(False)
     elif name == 'fuser2':  # nvFuser
         torch._C._jit_override_can_fuse_on_cpu(False)
         torch._C._jit_override_can_fuse_on_gpu(False)
         torch._C._jit_set_texpr_fuser_enabled(False)
         torch._C._jit_set_nvfuser_enabled(True)
+        torch._C._jit_set_llga_enabled(False)
+    elif name == 'fuser3':  # oneDNN Graph
+        old_profiling_executor = torch._C._jit_set_profiling_executor(True)
+        old_profiling_mode = torch._C._get_graph_executor_optimize(True)
+        torch._C._jit_override_can_fuse_on_cpu(True)
+        torch._C._jit_override_can_fuse_on_gpu(False)
+        torch._C._jit_set_texpr_fuser_enabled(True)
+        torch._C._jit_set_nvfuser_enabled(False)
+        torch._C._jit_set_llga_enabled(True)
     elif name == 'none':  # Turn Pytorch fuser off
         torch._C._jit_override_can_fuse_on_cpu(False)
         torch._C._jit_override_can_fuse_on_gpu(False)
         torch._C._jit_set_texpr_fuser_enabled(False)
         torch._C._jit_set_nvfuser_enabled(False)
+        torch._C._jit_set_llga_enabled(False)
     else:
         raise Exception(f"unrecognized fuser option (name: {name})")
     try:
         yield
     finally:
-        if name == 'fuser1':  # NNC
+        if name in ['fuser1', 'fuser3']:  # NNC or oneDNN Graph
             torch._C._jit_set_profiling_executor(old_profiling_executor)
             torch._C._get_graph_executor_optimize(old_profiling_mode)
         # recover the previous values
@@ -66,6 +80,7 @@ def fuser(name):
         torch._C._jit_override_can_fuse_on_gpu(old_gpu_fuse)
         torch._C._jit_set_texpr_fuser_enabled(old_texpr_fuser_state)
         torch._C._jit_set_nvfuser_enabled(old_nvfuser_state)
+        torch._C._jit_set_llga_enabled(old_llga_state)
 
 
 last_executed_optimized_graph = torch._C._last_executed_optimized_graph
diff --git a/torch/jit/_recursive.py b/torch/jit/_recursive.py
index 5dd98c49c6af..e2717c78ab7e 100644
--- a/torch/jit/_recursive.py
+++ b/torch/jit/_recursive.py
@@ -28,8 +28,11 @@
     "_buffers",
     "_non_persistent_buffers_set",
     "_backward_hooks",
+    "_backward_pre_hooks",
     "_forward_hooks",
+    "_forward_hooks_with_kwargs",
     "_forward_pre_hooks",
+    "_forward_pre_hooks_with_kwargs",
     "_state_dict_hooks",
     "_load_state_dict_pre_hooks",
     "_load_state_dict_post_hooks",
diff --git a/torch/jit/_shape_functions.py b/torch/jit/_shape_functions.py
index 1c4ac59023d8..5eace9515b59 100644
--- a/torch/jit/_shape_functions.py
+++ b/torch/jit/_shape_functions.py
@@ -5,9 +5,17 @@
 
 ###
 # There are generated files that depend on this file
-# To re-generate, please run:
-# cd ~/pytorch && python
-# torchgen/shape_functions/gen_jit_shape_functions.py
+# To re-generate, please run from the root of the repo:
+# python torchgen/shape_functions/gen_jit_shape_functions.py
+
+# How to test:
+# After regenerating files, compile PyTorch.
+# Then run: ./build/bin/test_jit --gtest_filter=TestShapeGraphLinting.Basic
+# If you have enabled opinfo testing for the op, also run:
+# python test/test_ops_jit.py TestJitCPU::test_variant_consistency_jit_[FAILING_OP]_cpu_float32
+# to reproduce errors from opinfo tests.
+
+# Example PR: https://github.com/pytorch/pytorch/pull/80860/files
 ####
 
 import torch
@@ -155,10 +163,11 @@ def view_one_unused(self: List[int], sizes: List[int], *, implicit: bool = False
 
 def sum_mean_dim(self: List[int], opt_dims: Optional[List[int]], keep_dim: bool, dt: Any):
     out: List[int] = []
-    if opt_dims is None:
-        dims: List[int] = []
+    if opt_dims is None or len(opt_dims) == 0:
+        dims: List[int] = list(range(len(self)))
     else:
         dims = opt_dims
+
     for idx in range(len(self)):
         is_mean_dim: bool = False
         for reduce_dim in dims:
@@ -346,6 +355,10 @@ def upsample_nearest2d(
     out: List[int] = []
     out.append(input[0])
     out.append(input[1])
+
+    if (scale_factors is None and output_size is None):
+        assert 0, "Either output_size or scale_factors must be presented"
+
     if output_size is not None:
         assert (
             scale_factors is None
@@ -353,7 +366,6 @@ def upsample_nearest2d(
         assert len(output_size) == 2
         out.append(output_size[0])
         out.append(output_size[1])
-        return out
 
     if scale_factors is not None:
         assert (
@@ -362,8 +374,8 @@ def upsample_nearest2d(
         assert len(scale_factors) == 2
         out.append(int(input[2] * scale_factors[0]))
         out.append(int(input[3] * scale_factors[1]))
-        return out
-    assert 0, "Either output_size or scale_factors must be presented"
+
+    return out
 
 
 def mm(self: List[int], mat2: List[int]):
@@ -745,7 +757,7 @@ def conv_transpose2d_input(input: List[int], weight: List[int], bias: Optional[L
     dim = len(input)
     output_size: List[int] = []
     input_batch_size_dim = 0
-    weight_output_channels_dim = 0
+    weight_output_channels_dim = 1
     output_size.append(input[input_batch_size_dim])
     output_size.append(weight[weight_output_channels_dim])
 
@@ -760,7 +772,7 @@ def conv_forwards(input: List[int], weight: List[int], bias: Optional[List[int]]
     dim = len(input)
     output_size: List[int] = []
     input_batch_size_dim = 0
-    weight_output_channels_dim = 0
+    weight_output_channels_dim = 1 if transposed else 0
     output_size.append(input[input_batch_size_dim])
     output_size.append(weight[weight_output_channels_dim])
 
@@ -968,14 +980,24 @@ def native_batch_norm(input: List[int], weight: Optional[List[int]], bias: Optio
         _size = [0]
     return _copy(input), _size, _size
 
-# TODO: Add support for List[Optional[List[int]]] arguments (i.e. `Tensor?[]`).
-# def index_Tensor(self: List[int], indices: List[Optional[List[int]]]) -> List[int]:
-#     assert len(indices) <= len(self), "More indices than dimensions to index"
-#     broadcasted_shape: List[int] = []
-#     for index_tensor_shape in indices:
-#         if index_tensor_shape is not None:
-#             broadcasted_shape = broadcast(broadcasted_shape, index_tensor_shape)
-#     return broadcasted_shape
+"""
+Currently deferring the enabling of this, as part of the propoasal to suspend
+adding ops.
+There are currently cases in the test case where this is being called
+in the SSA opinfo tests with with unexpected values (eg list of two ints, see the first
+opinfo test). The behavoir of index is significantly dependent on the inputs.
+
+This could be an error with how we are matching up shape functions, or that this
+function needs to just implement everything.
+
+def index_Tensor(self: List[int], indices: List[Optional[List[int]]]) -> List[int]:
+    assert len(indices) <= len(self), "More indices than dimensions to index"
+    broadcasted_shape: List[int] = []
+    for index_tensor_shape in indices:
+        if index_tensor_shape is not None:
+            broadcasted_shape = broadcast(broadcasted_shape, index_tensor_shape)
+    return broadcasted_shape
+"""
 
 ScriptFn = torch._C.ScriptFunction
 shape_compute_graph_mapping : Dict[str, ScriptFn ] = {}
@@ -1069,8 +1091,9 @@ def add_bounded_compute_mapping(operator_schema: str, lower_bound_func: Callable
 add_shape_compute_mapping("aten::nll_loss_forward(Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index) -> (Tensor output, Tensor total_weight)", nll_loss_forward)
 add_shape_compute_mapping("aten::native_layer_norm(Tensor input, int[] normalized_shape, Tensor? weight, Tensor? bias, float eps) -> (Tensor, Tensor, Tensor)", native_layer_norm)
 add_shape_compute_mapping("aten::native_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor? running_mean, Tensor? running_var, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)", native_batch_norm)
-# TODO: Add support for List[Optional[List[int]]] arguments (i.e. `Tensor?[]`).
-#add_shape_compute_mapping("aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor", index_Tensor)
+add_shape_compute_mapping("aten::_native_batch_norm_legit(Tensor input, Tensor? weight, Tensor? bias, Tensor running_mean, Tensor running_var, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)", native_batch_norm)
+add_shape_compute_mapping("aten::_native_batch_norm_legit.no_stats(Tensor input, Tensor? weight, Tensor? bias, Tensor running_mean, Tensor running_var, bool training, float momentum, float eps) -> (Tensor, Tensor, Tensor)", native_batch_norm)
+# add_shape_compute_mapping("aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor", index_Tensor)
 
 # TODO: migrate over all of symbolic_shape_registry_util.cpp
 # These are duplicated here so that the functions will be serialiazed
diff --git a/torch/jit/_trace.py b/torch/jit/_trace.py
index 7a658f3285ad..5fa570893146 100644
--- a/torch/jit/_trace.py
+++ b/torch/jit/_trace.py
@@ -316,6 +316,7 @@ def _check_trace(
     force_outplace,
     is_trace_module,
     _module_class,
+    example_inputs_is_kwarg=False,
 ):
     # Note: tracing is independent of optimizations, which consume the trace
     for inputs in check_inputs:
@@ -335,20 +336,33 @@ def _check_trace(
                 _force_outplace=force_outplace,
                 _module_class=_module_class,
                 _compilation_unit=torch._C.CompilationUnit(),
+                example_inputs_is_kwarg=example_inputs_is_kwarg,
             )
             check_mod_func = check_mod._c._get_method(traced_func.name)
             inputs = inputs[traced_func.name]
-            if isinstance(inputs, (torch.Tensor, dict)):
+            if isinstance(inputs, (torch.Tensor)) or isinstance(inputs, dict) and not example_inputs_is_kwarg:
                 inputs = (inputs,)
         else:
-            check_mod = torch.jit.trace(
-                func,
-                _clone_inputs(inputs),
-                check_trace=False,
-                strict=strict,
-                _force_outplace=force_outplace,
-                _module_class=_module_class,
-            )
+            if example_inputs_is_kwarg:
+                check_mod = torch.jit.trace(
+                    func,
+                    check_trace=False,
+                    strict=strict,
+                    _force_outplace=force_outplace,
+                    _module_class=_module_class,
+                    example_kwarg_inputs=_clone_inputs(inputs),
+                )
+            else:
+                check_mod = torch.jit.trace(
+                    func,
+                    _clone_inputs(inputs),
+                    check_trace=False,
+                    strict=strict,
+                    _force_outplace=force_outplace,
+                    _module_class=_module_class,
+                )
+
+
             check_mod_func = check_mod
 
         def graph_diagnostic_info():
@@ -440,7 +454,10 @@ def wrap_retval(x):
 
         def run_mod_and_filter_tensor_outputs(mod, inputs, running_what):
             try:
-                outs = wrap_retval(mod(*_clone_inputs(inputs)))
+                if isinstance(inputs, dict) and example_inputs_is_kwarg:
+                    outs = wrap_retval(mod(**inputs))
+                else:
+                    outs = wrap_retval(mod(*_clone_inputs(inputs)))
                 outs = [out for out in outs if isinstance(out, torch.Tensor)]
                 return outs
             except Exception as e:
@@ -498,13 +515,22 @@ def compare_outputs(original, reference, match_what):
                             equal_nan=True,
                         )
                     else:
-                        torch.testing.assert_close(
-                            orig.double(),
-                            ref.double(),
-                            rtol=check_tolerance,
-                            atol=default_tolerances(orig, ref)[1],
-                            equal_nan=True,
-                        )
+                        if orig.is_mps or ref.is_mps:
+                            torch.testing.assert_close(
+                                orig.float(),
+                                ref.float(),
+                                rtol=check_tolerance,
+                                atol=default_tolerances(orig, ref)[1],
+                                equal_nan=True,
+                            )
+                        else:
+                            torch.testing.assert_close(
+                                orig.double(),
+                                ref.double(),
+                                rtol=check_tolerance,
+                                atol=default_tolerances(orig, ref)[1],
+                                equal_nan=True,
+                            )
                 except AssertionError as e:
                     maybe_warn_nondeterministic()
                     warnings.warn(
@@ -586,7 +612,7 @@ def wrap_check_inputs(check_inputs):
 
 def trace(
     func,
-    example_inputs,
+    example_inputs=None,
     optimize=None,
     check_trace=True,
     check_inputs=None,
@@ -595,6 +621,7 @@ def trace(
     _force_outplace=False,
     _module_class=None,
     _compilation_unit=_python_cu,
+    example_kwarg_inputs=None
 ):
     """
     Trace a function and return an executable  or :class:`ScriptFunction`
@@ -650,14 +677,17 @@ def trace(
             tensors. When a module is passed `torch.jit.trace`, only the
             ``forward`` method is run and traced (see :func:`torch.jit.trace
             <torch.jit.trace_module>` for details).
-        example_inputs (tuple or torch.Tensor):  A tuple of example inputs that
-            will be passed to the function while tracing. The resulting trace
-            can be run with inputs of different types and shapes assuming the
-            traced operations support those types and shapes. `example_inputs`
-            may also be a single Tensor in which case it is automatically
-            wrapped in a tuple.
 
     Keyword arguments:
+        example_inputs (tuple or torch.Tensor or None, optional): A tuple of example
+            inputs that will be passed to the function while tracing.
+            Default: ``None``. Either this argument or ``example_kwarg_inputs``
+            should be specified. The resulting trace can be run with inputs of
+            different types and shapes assuming the traced operations support those
+            types and shapes. `example_inputs` may also be a single Tensor in which
+            case it is automatically wrapped in a tuple. When the value is None,
+            ``example_kwarg_inputs`` should be specified.
+
         check_trace (``bool``, optional): Check if the same inputs run through
             traced code produce the same outputs. Default: ``True``. You might want
             to disable this if, for example, your network contains non-
@@ -681,6 +711,12 @@ def trace(
             and you are sure that the container you are using in your
             problem is a ``constant`` structure and does not get used as
             control flow (if, for) conditions.
+        example_kwarg_inputs (dict, optional): This parameter is a pack of keyword
+            arguments of example inputs that will be passed to the function while
+            tracing. Default: ``None``. Either this argument or ``example_inputs``
+            should be specified. The dict will be unpacking by the arguments name
+            of the traced function. If the keys of the dict don't not match with
+            the traced function's arguments name, a runtime exception will be raised.
 
     Returns:
         If `func` is `nn.Module` or ``forward`` of `nn.Module`, `trace` returns
@@ -746,7 +782,13 @@ def forward(self, x):
         )
         return func
 
+
     if isinstance(func, torch.nn.Module):
+        if example_inputs is None:
+            if isinstance(example_kwarg_inputs, dict):
+                example_inputs = example_kwarg_inputs
+            else:
+                raise RuntimeError("example_kwarg_inputs should be a dict")
         return trace_module(
             func,
             {"forward": example_inputs},
@@ -757,6 +799,7 @@ def forward(self, x):
             strict,
             _force_outplace,
             _module_class,
+            example_inputs_is_kwarg=isinstance(example_kwarg_inputs, dict),
         )
 
     if (
@@ -764,6 +807,11 @@ def forward(self, x):
         and isinstance(func.__self__, torch.nn.Module)
         and func.__name__ == "forward"
     ):
+        if example_inputs is None:
+            if isinstance(example_kwarg_inputs, dict):
+                example_inputs = example_kwarg_inputs
+            else:
+                raise RuntimeError("example_kwarg_inputs should be a dict")
         return trace_module(
             func.__self__,
             {"forward": example_inputs},
@@ -774,13 +822,14 @@ def forward(self, x):
             strict,
             _force_outplace,
             _module_class,
+            example_inputs_is_kwarg=isinstance(example_kwarg_inputs, dict),
         )
 
     # Special case for common case of passing a single Tensor
-    if isinstance(example_inputs, (torch.Tensor, dict)):
+    if isinstance(example_inputs, (torch.Tensor, dict)) and example_kwarg_inputs is None:
         example_inputs = (example_inputs,)
     # done primarily so that weird iterables fail here and not pybind11 code
-    elif not isinstance(example_inputs, tuple):
+    elif example_kwarg_inputs is None and not isinstance(example_inputs, tuple):
         example_inputs = tuple(example_inputs)
 
     var_lookup_fn = _create_interpreter_name_lookup_fn(0)
@@ -792,15 +841,27 @@ def forward(self, x):
         )
 
     name = _qualified_name(func)
-    traced = torch._C._create_function_from_trace(
-        name,
-        func,
-        example_inputs,
-        var_lookup_fn,
-        strict,
-        _force_outplace,
-        get_callable_argument_names(func)
-    )
+    if isinstance(example_kwarg_inputs, dict):
+        example_inputs = example_kwarg_inputs
+        traced = torch._C._create_function_from_trace_with_dict(
+            name,
+            func,
+            example_kwarg_inputs,
+            var_lookup_fn,
+            strict,
+            _force_outplace,
+            get_callable_argument_names(func)
+        )
+    else:
+        traced = torch._C._create_function_from_trace(
+            name,
+            func,
+            example_inputs,
+            var_lookup_fn,
+            strict,
+            _force_outplace,
+            get_callable_argument_names(func)
+        )
 
     # Check the trace against new traces created from user-specified inputs
     if check_trace:
@@ -814,6 +875,7 @@ def forward(self, x):
                 _force_outplace,
                 False,
                 _module_class,
+                example_inputs_is_kwarg=isinstance(example_kwarg_inputs, dict),
             )
         else:
             _check_trace(
@@ -825,6 +887,7 @@ def forward(self, x):
                 _force_outplace,
                 False,
                 _module_class,
+                example_inputs_is_kwarg=isinstance(example_kwarg_inputs, dict),
             )
 
     return traced
@@ -844,6 +907,7 @@ def trace_module(
     _force_outplace=False,
     _module_class=None,
     _compilation_unit=_python_cu,
+    example_inputs_is_kwarg=False,
 ):
     """
     Trace a module and return an executable :class:`ScriptModule` that will be optimized
@@ -878,6 +942,8 @@ def trace_module(
         check_tolerance (float, optional): Floating-point comparison tolerance to use in the checker procedure.
                                            This can be used to relax the checker strictness in the event that
                                            results diverge numerically for a known reason, such as operator fusion.
+        example_inputs_is_kwarg (``bool``, optional): This parameter indicate whether the example inputs is a pack
+                                           pack of keyword arguments. Default: ``False``.
 
     Returns:
         A :class:`ScriptModule` object with a single ``forward`` method containing the traced code.
@@ -962,17 +1028,34 @@ def register_submods(mod, prefix):
                 func = getattr(mod, method_name)
                 argument_names = get_callable_argument_names(func)
 
-            example_inputs = make_tuple(example_inputs)
+            if isinstance(example_inputs, dict) and example_inputs_is_kwarg:
+                # Raise exception when the user provided key names are not aligned with forward() method's arguments' name/
+                for key in example_inputs:
+                    if key not in argument_names:
+                        valid_arguments = "[" + ','.join(argument_names) + "]"
+                        raise NameError("""'{}' is not in forward() method's arguments,
+                         valid arguments name are {}""".format(key, valid_arguments))
+                module._c._create_method_from_trace_with_dict(
+                    method_name,
+                    func,
+                    example_inputs,
+                    var_lookup_fn,
+                    strict,
+                    _force_outplace,
+                    argument_names,
+                )
+            else:
+                example_inputs = make_tuple(example_inputs)
+                module._c._create_method_from_trace(
+                    method_name,
+                    func,
+                    example_inputs,
+                    var_lookup_fn,
+                    strict,
+                    _force_outplace,
+                    argument_names,
+                )
 
-            module._c._create_method_from_trace(
-                method_name,
-                func,
-                example_inputs,
-                var_lookup_fn,
-                strict,
-                _force_outplace,
-                argument_names,
-            )
             check_trace_method = module._c._get_method(method_name)
 
             # Check the trace against new traces created from user-specified inputs
@@ -987,6 +1070,7 @@ def register_submods(mod, prefix):
                         _force_outplace,
                         True,
                         _module_class,
+                        example_inputs_is_kwarg=example_inputs_is_kwarg,
                     )
                 else:
                     _check_trace(
@@ -998,6 +1082,7 @@ def register_submods(mod, prefix):
                         _force_outplace,
                         True,
                         _module_class,
+                        example_inputs_is_kwarg=example_inputs_is_kwarg,
                     )
     finally:
         torch.jit._trace._trace_module_map = old_module_map
diff --git a/torch/jit/annotations.py b/torch/jit/annotations.py
index a4a36ce36a5e..a6ff2d04d207 100644
--- a/torch/jit/annotations.py
+++ b/torch/jit/annotations.py
@@ -1,4 +1,5 @@
 import ast
+import dis
 import enum
 import inspect
 import re
@@ -144,6 +145,15 @@ def check_fn(fn, loc):
         raise torch.jit.frontend.FrontendError(loc, "Expected a single top-level function")
 
 
+def _eval_no_call(stmt, glob, loc):
+    """Evaluate statement as long as it does not contain any method/function calls"""
+    bytecode = compile(stmt, "", mode="eval")
+    for insn in dis.get_instructions(bytecode):
+        if "CALL" in insn.opname:
+            raise RuntimeError(f"Type annotation should not contain calls, but '{stmt}' does")
+    return eval(bytecode, glob, loc)  # type: ignore[arg-type] # noqa: P204
+
+
 def parse_type_line(type_line, rcb, loc):
     """Parses a type annotation specified as a comment.
 
@@ -154,7 +164,7 @@ def parse_type_line(type_line, rcb, loc):
     arg_ann_str, ret_ann_str = split_type_line(type_line)
 
     try:
-        arg_ann = eval(arg_ann_str, {}, EvalEnv(rcb))  # type: ignore[arg-type] # noqa: P204
+        arg_ann = _eval_no_call(arg_ann_str, {}, EvalEnv(rcb))
     except (NameError, SyntaxError) as e:
         raise RuntimeError("Failed to parse the argument list of a type annotation") from e
 
@@ -162,7 +172,7 @@ def parse_type_line(type_line, rcb, loc):
         arg_ann = (arg_ann,)
 
     try:
-        ret_ann = eval(ret_ann_str, {}, EvalEnv(rcb))  # type: ignore[arg-type] # noqa: P204
+        ret_ann = _eval_no_call(ret_ann_str, {}, EvalEnv(rcb))
     except (NameError, SyntaxError) as e:
         raise RuntimeError("Failed to parse the return type of a type annotation") from e
 
diff --git a/torch/jit/frontend.py b/torch/jit/frontend.py
index 62548ba7e2cd..44a8628f77d5 100644
--- a/torch/jit/frontend.py
+++ b/torch/jit/frontend.py
@@ -324,7 +324,7 @@ def build_class_def(ctx, py_def, methods, properties, self_name, assigns):
 
 def build_def(ctx, py_def, type_line, def_name, self_name=None, pdt_arg_types=None):
     body = py_def.body
-    r = ctx.make_range(py_def.lineno + len(py_def.decorator_list),
+    r = ctx.make_range(py_def.lineno,
                        py_def.col_offset,
                        py_def.col_offset + len("def"))
 
@@ -614,7 +614,7 @@ def build_AugAssign(ctx, stmt):
         else:
             raise NotSupportedError(
                 find_before(ctx, rhs.range().start, '=', offsets=(-1, 0)),
-                "unsupported kind of augumented assignment: " + op.__name__)
+                "unsupported kind of augmented assignment: " + op.__name__)
         return AugAssign(lhs, op_token, rhs)
 
     @staticmethod
diff --git a/torch/jit/quantized.py b/torch/jit/quantized.py
index 483296dce5d6..df0cfe1cc1f4 100644
--- a/torch/jit/quantized.py
+++ b/torch/jit/quantized.py
@@ -14,7 +14,7 @@ def __init__(self, other):
         super(QuantizedLinear, self).__init__()
         warnings.warn(
             "torch.jit.QuantizedLinear is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.Linear instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.Linear instead.")
 
         self.in_features = other.in_features
         self.out_features = other.out_features
@@ -59,7 +59,7 @@ def __init__(self, other):
         super(QuantizedLinearFP16, self).__init__()
         warnings.warn(
             "torch.jit.QuantizedLinearFP16 is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.Linear instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.Linear instead.")
         self.in_features = other.in_features
         self.out_features = other.out_features
         self.original_weight = other.weight
@@ -99,7 +99,7 @@ def __init__(self, other):
         super(QuantizedRNNCellBase, self).__init__()
         warnings.warn(
             "torch.jit.QuantizedRNNCellBase is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.RNNCell instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.RNNCell instead.")
 
         self.input_size = other.input_size
         self.hidden_size = other.hidden_size
@@ -177,7 +177,7 @@ def __init__(self, other):
         super(QuantizedRNNCell, self).__init__(other)
         warnings.warn(
             "torch.jit.QuantizedRNNCell is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.RNNCell instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.RNNCell instead.")
         self.nonlinearity = other.nonlinearity
 
     @torch.jit.script_method
@@ -212,7 +212,7 @@ def __init__(self, other):
         super(QuantizedLSTMCell, self).__init__(other)
         warnings.warn(
             "torch.jit.QuantizedLSTMCell is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.LSTMCell instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.LSTMCell instead.")
 
     @torch.jit.script_method
     def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
@@ -235,7 +235,7 @@ def __init__(self, other):
         super(QuantizedGRUCell, self).__init__(other)
         warnings.warn(
             "torch.jit.QuantizedGRUCell is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.GRUCell instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.GRUCell instead.")
 
     @torch.jit.script_method
     def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
@@ -263,7 +263,7 @@ def __init__(self, other, dtype=torch.int8):
         super(QuantizedRNNBase, self).__init__()
         warnings.warn(
             "torch.jit.QuantizedRNNBase is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic instead.")
         self.mode = other.mode
         self.input_size = other.input_size
         self.hidden_size = other.hidden_size
@@ -368,7 +368,7 @@ def __init__(self, other, dtype):
         super(QuantizedLSTM, self).__init__(other, dtype)
         warnings.warn(
             "torch.jit.QuantizedLSTM is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.LSTM instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.LSTM instead.")
 
     @torch.jit.script_method
     def forward_impl(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]], batch_sizes: Optional[Tensor],
@@ -448,7 +448,7 @@ def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
         warnings.warn(
             "torch.jit.QuantizedGRU is deprecated and will be removed in an upcoming "
-            "PyTorch release. Please use the torch.nn.quantized.dynamic.GRU instead.")
+            "PyTorch release. Please use the torch.ao.nn.quantized.dynamic.GRU instead.")
 
 
     @torch.jit.script_method
diff --git a/torch/lib/libshm/CMakeLists.txt b/torch/lib/libshm/CMakeLists.txt
index 6ee75c888884..07d716c0357b 100644
--- a/torch/lib/libshm/CMakeLists.txt
+++ b/torch/lib/libshm/CMakeLists.txt
@@ -1,6 +1,5 @@
 project(libshm C CXX)
-cmake_minimum_required(VERSION 2.8.12 FATAL_ERROR)
-cmake_policy(VERSION 2.8.12)
+cmake_minimum_required(VERSION 3.13 FATAL_ERROR)
 
 set(TORCH_ROOT ${CMAKE_CURRENT_LIST_DIR}/../../../)
 include(${TORCH_ROOT}/cmake/public/threads.cmake)
@@ -16,12 +15,6 @@ if(MSVC)
   add_definitions(-D_CRT_SECURE_NO_DEPRECATE=1)
 endif(MSVC)
 
-if(CMAKE_VERSION VERSION_LESS "3.1")
-  set(CMAKE_CXX_FLAGS "-std=c++14 ${CMAKE_CXX_FLAGS}")
-else()
-  set(CMAKE_CXX_STANDARD 14 CACHE STRING "The C++ standard whose features are requested to build this target.")
-endif()
-
 add_library(shm SHARED core.cpp)
 if(HAVE_SOVERSION)
   set_target_properties(shm PROPERTIES
@@ -29,26 +22,22 @@ if(HAVE_SOVERSION)
 endif()
 
 target_include_directories(shm PUBLIC
-  ${CMAKE_BINARY_DIR}/aten/src # provides "ATen/TypeExtendedInterface.h" to ATen.h
   ${TORCH_ROOT}/torch/lib # provides "libshm/libshm.h"
-  )
+)
 
-add_executable(torch_shm_manager manager.cpp)
-target_link_libraries(torch_shm_manager shm)
-set_target_properties(torch_shm_manager PROPERTIES INSTALL_RPATH "${_rpath_portable_origin}/../lib")
 ### Torch packages supposes libraries prefix is "lib"
 set_target_properties(shm PROPERTIES
   PREFIX "lib"
-  IMPORT_PREFIX "lib")
-target_link_libraries(shm torch c10)
+  IMPORT_PREFIX "lib"
+  CXX_STANDARD 14)
+target_link_libraries(shm PUBLIC torch)
 
 if(UNIX AND NOT APPLE)
   include(CheckLibraryExists)
   # https://github.com/libgit2/libgit2/issues/2128#issuecomment-35649830
   check_library_exists(rt clock_gettime "time.h" NEED_LIBRT)
   if(NEED_LIBRT)
-    target_link_libraries(shm rt)
-    target_link_libraries(torch_shm_manager rt)
+    target_link_libraries(shm PUBLIC rt)
   else()
     message(STATUS "Checking if rt requires pthread")
     # Sometimes, rt won't be available unless you also link against
@@ -72,12 +61,15 @@ if(UNIX AND NOT APPLE)
     unset(CMAKE_REQUIRED_LIBRARIES)
     if(NEED_RT_AND_PTHREAD)
       message(STATUS "Needs it, linking against pthread and rt")
-      target_link_libraries(shm rt caffe2::Threads)
-      target_link_libraries(torch_shm_manager rt caffe2::Threads)
+      target_link_libraries(shm PUBLIC rt caffe2::Threads)
     endif()
   endif()
 endif()
 
+add_executable(torch_shm_manager manager.cpp)
+target_link_libraries(torch_shm_manager PRIVATE shm)
+set_target_properties(torch_shm_manager PROPERTIES
+  INSTALL_RPATH "${_rpath_portable_origin}/../lib")
 
 install(TARGETS shm LIBRARY DESTINATION ${LIBSHM_INSTALL_LIB_SUBDIR})
 install(FILES libshm.h DESTINATION "include")
diff --git a/torch/library.h b/torch/library.h
index 69175d075662..71b190e8a3e9 100644
--- a/torch/library.h
+++ b/torch/library.h
@@ -86,6 +86,12 @@ namespace torch {
 struct NoInferSchemaTag {};
 #endif
 
+// For multipy/torchdeploy use case
+enum class _RegisterOrVerify {
+  REGISTER,
+  VERIFY
+};
+
 template <class CurClass>
 class class_;
 
@@ -591,9 +597,9 @@ class TORCH_API Library final {
   /// ```
 
   template <typename Schema>
-  Library& def(Schema&& raw_schema, const std::vector<at::Tag>& tags = {}) & {
+  Library& def(Schema&& raw_schema, const std::vector<at::Tag>& tags = {}, _RegisterOrVerify rv = _RegisterOrVerify::REGISTER) & {
     c10::FunctionSchema s = schema(std::forward<Schema>(raw_schema));
-    return _def(std::move(s), nullptr, tags);
+    return _def(std::move(s), nullptr, tags, rv);
   }
   /// Define an operator for a schema and then register an implementation for
   /// it.  This is typically what you would use if you aren't planning
@@ -644,7 +650,7 @@ class TORCH_API Library final {
   /// }
   /// ```
   template <typename Name, typename Func>
-  Library& impl(Name name, Func&& raw_f) & {
+  Library& impl(Name name, Func&& raw_f, _RegisterOrVerify rv = _RegisterOrVerify::REGISTER) & {
     // TODO: need to raise an error when you impl a function that has a
     // catch all def
 #if defined C10_MOBILE
@@ -652,7 +658,7 @@ class TORCH_API Library final {
 #else
     CppFunction f(std::forward<Func>(raw_f));
 #endif
-    return _impl(name, std::move(f));
+    return _impl(name, std::move(f), rv);
   }
 
 #if defined C10_MOBILE
@@ -673,6 +679,10 @@ class TORCH_API Library final {
   }
 #endif
 
+  // Helper for getting an OperatorName for a const char*.  You probably
+  // don't need this.
+  c10::OperatorName _resolve(const char* name) const;
+
   /// \private
   ///
   /// Convenience overload for directly specifying the dispatch key when
@@ -809,12 +819,17 @@ class TORCH_API Library final {
   Library& _def(
       c10::FunctionSchema&& schema,
       c10::OperatorName* out_name = nullptr,
-      const std::vector<at::Tag>& tags = {}) &;
+      const std::vector<at::Tag>& tags = {},
+      _RegisterOrVerify rv = _RegisterOrVerify::REGISTER
+      ) &;
   Library& _def(
       c10::either<c10::OperatorName, c10::FunctionSchema>&&,
       CppFunction&& f) &;
-  Library& _impl(const char* name, CppFunction&& f) &;
+  Library& _impl(const char* name, CppFunction&& f,
+    _RegisterOrVerify rv = _RegisterOrVerify::REGISTER) &;
   Library& _fallback(CppFunction&& f) &;
+
+  at::OperatorName _parseNameForLib(const char* name_str) const;
 };
 
 namespace detail {
diff --git a/torch/library.py b/torch/library.py
index eae9bac26453..a4b538ddd124 100644
--- a/torch/library.py
+++ b/torch/library.py
@@ -110,20 +110,20 @@ def impl(self, op_name, fn, dispatch_key=''):
             dispatcher_op_name = name
             if '::' not in dispatcher_op_name:
                 dispatcher_op_name = f'{self.ns}::{dispatcher_op_name}'
-            # get a string containing the names of every dispatch key that the operator has a registration for.
-            dispatch_key_registration = torch._C._dispatch_dump(dispatcher_op_name)
+
             # Internally, we shouldn't be registering meta kernels for any operators that
             # have CompositeImplicitAutograd kernels.
             # Instead, we should be letting those decompositions run, and writing meta kernels
             # only for the base operators.
-            if 'CompositeImplicitAutograd' in dispatch_key_registration:
+            if torch._C._dispatch_has_kernel_for_dispatch_key(dispatcher_op_name, "CompositeImplicitAutograd"):
                 raise RuntimeError(
                     f"We should not register a meta kernel directly to the operator '{name}',"
                     " because it has a CompositeImplicitAutograd kernel in core."
                     " Instead we should let the operator decompose, and ensure that we have meta kernels"
                     " for the base ops that it decomposes into.")
 
-        self.m.impl(name, dispatch_key, fn)
+        self.m.impl(name, dispatch_key if dispatch_key != "" else "CompositeImplicitAutograd", fn)
+
         _impls.add(key)
         self._op_impls.add(key)
 
diff --git a/torch/linalg/__init__.py b/torch/linalg/__init__.py
index bf82cc21a9f5..4b551b0d40e9 100644
--- a/torch/linalg/__init__.py
+++ b/torch/linalg/__init__.py
@@ -28,8 +28,7 @@
 
 Supports input of float, double, cfloat and cdouble dtypes. Also supports batches
 of vectors, for which it computes the product along the dimension :attr:`dim`.
-In this case, the output has the same batch dimensions as the inputs broadcast to
-a common shape.
+It broadcasts over the batch dimensions.
 
 Args:
     input (Tensor): the first input tensor.
@@ -39,9 +38,6 @@
 Keyword args:
     out (Tensor, optional): the output tensor. Ignored if `None`. Default: `None`.
 
-Raises:
-    RuntimeError: If after broadcasting :attr:`input`\ `.size(\ `:attr:`dim`\ `) != 3`
-                  or :attr:`other`\ `.size(\ `:attr:`dim`\ `) != 3`.
 Example:
     >>> a = torch.randn(4, 3)
     >>> a
@@ -236,7 +232,7 @@
 
         linalg.solve(A, B) == linalg.inv(A) @ B  # When B is a matrix
 
-    It is always prefered to use :func:`~solve` when possible, as it is faster and more
+    It is always preferred to use :func:`~solve` when possible, as it is faster and more
     numerically stable than computing the inverse explicitly.
 
 .. seealso::
@@ -586,7 +582,7 @@
     out (Tensor, optional): output tensor. Ignored if `None`. Default: `None`.
 
 Returns:
-    A complex-valued tensor cointaining the eigenvalues even when :attr:`A` is real.
+    A complex-valued tensor containing the eigenvalues even when :attr:`A` is real.
 
 Examples::
 
@@ -755,7 +751,7 @@
     out (Tensor, optional): output tensor. Ignored if `None`. Default: `None`.
 
 Returns:
-    A real-valued tensor cointaining the eigenvalues even when :attr:`A` is complex.
+    A real-valued tensor containing the eigenvalues even when :attr:`A` is complex.
     The eigenvalues are returned in ascending order.
 
 Examples::
@@ -784,7 +780,7 @@
 Let :math:`\mathbb{K}` be :math:`\mathbb{R}` or :math:`\mathbb{C}`, and
 let :math:`V \in \mathbb{K}^{m \times n}` be a matrix with columns :math:`v_i \in \mathbb{K}^m`
 for :math:`i=1,\ldots,m` with :math:`m \geq n`. Denote by :math:`w_i` the vector resulting from
-zeroing out the first :math:`i-1` compontents of :math:`v_i` and setting to `1` the :math:`i`-th.
+zeroing out the first :math:`i-1` components of :math:`v_i` and setting to `1` the :math:`i`-th.
 For a vector :math:`\tau \in \mathbb{K}^k` with :math:`k \leq n`, this function computes the
 first :math:`n` columns of the matrix
 
@@ -967,7 +963,7 @@
 :attr:`LD` and :attr:`pivots` are the compact representation of the LDL factorization and
 are expected to be computed by :func:`torch.linalg.ldl_factor_ex`.
 :attr:`hermitian` argument to this function should be the same
-as the corresponding argumens in :func:`torch.linalg.ldl_factor_ex`.
+as the corresponding arguments in :func:`torch.linalg.ldl_factor_ex`.
 
 Supports input of float, double, cfloat and cdouble dtypes.
 Also supports batches of matrices, and if :attr:`A` is a batch of matrices then
@@ -1019,9 +1015,8 @@
 Also supports batches of matrices, and if the inputs are batches of matrices then
 the output has the same batch dimensions.
 
-:attr:`driver` chooses the LAPACK/MAGMA function that will be used.
+:attr:`driver` chooses the backend function that will be used.
 For CPU inputs the valid values are `'gels'`, `'gelsy'`, `'gelsd`, `'gelss'`.
-For CUDA input, the only valid driver is `'gels'`, which assumes that :attr:`A` is full-rank.
 To choose the best driver on CPU consider:
 
 - If :attr:`A` is well-conditioned (its `condition number`_ is not too large), or you do not mind some precision loss.
@@ -1034,6 +1029,8 @@
   - `'gelsd'` (tridiagonal reduction and SVD)
   - But if you run into memory issues: `'gelss'` (full SVD).
 
+For CUDA input, the only valid driver is `'gels'`, which assumes that :attr:`A` is full-rank.
+
 See also the `full description of these drivers`_
 
 :attr:`rcond` is used to determine the effective rank of the matrices in :attr:`A`
@@ -1044,7 +1041,7 @@
 
 This function returns the solution to the problem and some extra information in a named tuple of
 four tensors `(solution, residuals, rank, singular_values)`. For inputs :attr:`A`, :attr:`B`
-of shape `(*, m, n)`, `(*, m, k)` respectively, it cointains
+of shape `(*, m, n)`, `(*, m, k)` respectively, it contains
 
 - `solution`: the least squares solution. It has shape `(*, n, k)`.
 - `residuals`: the squared residuals of the solutions, that is, :math:`\|AX - B\|_F^2`.
@@ -1088,16 +1085,26 @@
 
 Examples::
 
-    >>> A = torch.tensor([[[10, 2, 3], [3, 10, 5], [5, 6, 12]]], dtype=torch.float) # shape (1, 3, 3)
-    >>> B = torch.tensor([[[2, 5, 1], [3, 2, 1], [5, 1, 9]],
-                          [[4, 2, 9], [2, 0, 3], [2, 5, 3]]], dtype=torch.float) # shape (2, 3, 3)
+    >>> A = torch.randn(1,3,3)
+    >>> A
+    tensor([[[-1.0838,  0.0225,  0.2275],
+         [ 0.2438,  0.3844,  0.5499],
+         [ 0.1175, -0.9102,  2.0870]]])
+    >>> B = torch.randn(2,3,3)
+    >>> B
+    tensor([[[-0.6772,  0.7758,  0.5109],
+         [-1.4382,  1.3769,  1.1818],
+         [-0.3450,  0.0806,  0.3967]],
+        [[-1.3994, -0.1521, -0.1473],
+         [ 1.9194,  1.0458,  0.6705],
+         [-1.1802, -0.9796,  1.4086]]])
     >>> X = torch.linalg.lstsq(A, B).solution # A is broadcasted to shape (2, 3, 3)
     >>> torch.dist(X, torch.linalg.pinv(A) @ B)
-    tensor(2.0862e-07)
+    tensor(1.5152e-06)
 
     >>> S = torch.linalg.lstsq(A, B, driver='gelsd').singular_values
     >>> torch.dist(S, torch.linalg.svdvals(A))
-    tensor(5.7220e-06)
+    tensor(2.3842e-07)
 
     >>> A[:, 0].zero_()  # Decrease the rank of A
     >>> rank = torch.linalg.lstsq(A, B).rank
@@ -1129,7 +1136,7 @@
 
         matrix_power(torch.linalg.solve(A, B), n) == matrix_power(A, -n)  @ B
 
-    It is always prefered to use :func:`~solve` when possible, as it is faster and more
+    It is always preferred to use :func:`~solve` when possible, as it is faster and more
     numerically stable than computing :math:`A^{-n}` explicitly.
 
 .. seealso::
@@ -1395,7 +1402,7 @@
 
 Supports input of float, double, cfloat and cdouble dtypes.
 
-This function does not necessarily treat multidimensonal :attr:`x` as a batch of
+This function does not necessarily treat multidimensional :attr:`x` as a batch of
 vectors, instead:
 
 - If :attr:`dim`\ `= None`, :attr:`x` will be flattened before the norm is computed.
@@ -1968,7 +1975,7 @@
 
         torch.linalg.lstsq(A, B).solution == A.pinv() @ B
 
-    It is always prefered to use :func:`~lstsq` when possible, as it is faster and more
+    It is always preferred to use :func:`~lstsq` when possible, as it is faster and more
     numerically stable than computing the pseudoinverse explicitly.
 
 .. note::
@@ -2550,7 +2557,7 @@
 
         linalg.tensorsolve(A, B) == torch.tensordot(linalg.tensorinv(A), B)  # When B is a tensor with shape A.shape[:B.ndim]
 
-    It is always prefered to use :func:`~tensorsolve` when possible, as it is faster and more
+    It is always preferred to use :func:`~tensorsolve` when possible, as it is faster and more
     numerically stable than computing the pseudoinverse explicitly.
 
 .. seealso::
diff --git a/torch/masked/__init__.py b/torch/masked/__init__.py
new file mode 100644
index 000000000000..e0193416ed2f
--- /dev/null
+++ b/torch/masked/__init__.py
@@ -0,0 +1,37 @@
+from .maskedtensor.core import is_masked_tensor, MaskedTensor
+from .maskedtensor.creation import as_masked_tensor, masked_tensor
+from ._ops import (
+    _canonical_dim,
+    _generate_docstring,
+    _reduction_identity,
+    _where,
+    _input_mask,
+    _output_mask,
+    _combine_input_and_mask,
+    sum,
+    prod,
+    cumsum,
+    cumprod,
+    amax,
+    amin,
+    argmax,
+    argmin,
+    mean,
+    median,
+    logsumexp,
+    logaddexp,
+    norm,
+    var,
+    std,
+    softmax,
+    log_softmax,
+    softmin,
+    normalize,
+)
+
+__all__ = [
+    "as_masked_tensor",
+    "is_masked_tensor",
+    "masked_tensor",
+    "MaskedTensor",
+]
diff --git a/torch/_masked/_docs.py b/torch/masked/_docs.py
similarity index 97%
rename from torch/_masked/_docs.py
rename to torch/masked/_docs.py
index 36961ed62c08..cfd8f397bafa 100644
--- a/torch/_masked/_docs.py
+++ b/torch/masked/_docs.py
@@ -75,7 +75,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.amax(input, 1, mask=mask)
+    >>> torch.masked._ops.amax(input, 1, mask=mask)
     tensor([                  -1, -9223372036854775808])
 """
 
@@ -145,7 +145,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.amin(input, 1, mask=mask)
+    >>> torch.masked._ops.amin(input, 1, mask=mask)
     tensor([                 -3, 9223372036854775807])
 """
 
@@ -210,7 +210,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.argmax(input, 1, mask=mask)
+    >>> torch.masked._ops.argmax(input, 1, mask=mask)
     tensor([2, 0])
 """
 
@@ -275,7 +275,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.argmin(input, 1, mask=mask)
+    >>> torch.masked._ops.argmin(input, 1, mask=mask)
     tensor([0, 0])
 """
 
@@ -331,7 +331,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.cumprod(input, 1, mask=mask)
+    >>> torch.masked._ops.cumprod(input, 1, mask=mask)
     tensor([[-3., -3.,  3.],
             [ 1.,  1.,  1.]])
 """
@@ -388,7 +388,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.cumsum(input, 1, mask=mask)
+    >>> torch.masked._ops.cumsum(input, 1, mask=mask)
     tensor([[-3., -3., -4.],
             [ 0.,  0.,  0.]])
 """
@@ -445,7 +445,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.log_softmax(input, 1, mask=mask)
+    >>> torch.masked._ops.log_softmax(input, 1, mask=mask)
     tensor([[-2.1269,    -inf, -0.1269],
             [    nan,     nan,     nan]])
 """
@@ -514,7 +514,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.logsumexp(input, 1, mask=mask)
+    >>> torch.masked._ops.logsumexp(input, 1, mask=mask)
     tensor([                   0, -9223372036854775808])
 """
 
@@ -587,7 +587,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.mean(input, 1, mask=mask)
+    >>> torch.masked._ops.mean(input, 1, mask=mask)
     tensor([-2., nan])
 """
 
@@ -655,7 +655,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.median(input, 1, mask=mask)
+    >>> torch.masked._ops.median(input, 1, mask=mask)
     tensor([-3., nan])
 """
 
@@ -727,7 +727,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.norm(input, 2.0, 1, mask=mask)
+    >>> torch.masked._ops.norm(input, 2.0, 1, mask=mask)
     tensor([3.1623, 0.0000])
 """
 
@@ -786,7 +786,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.normalize(input, 2.0, 1, mask=mask)
+    >>> torch.masked._ops.normalize(input, 2.0, 1, mask=mask)
     tensor([[-0.9487,  0.0000, -0.3162],
             [ 0.0000,  0.0000,  0.0000]])
 """
@@ -855,7 +855,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.prod(input, 1, mask=mask)
+    >>> torch.masked._ops.prod(input, 1, mask=mask)
     tensor([3, 1])
 """
 
@@ -911,7 +911,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.softmax(input, 1, mask=mask)
+    >>> torch.masked._ops.softmax(input, 1, mask=mask)
     tensor([[0.1192, 0.0000, 0.8808],
             [   nan,    nan,    nan]])
 """
@@ -968,7 +968,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.softmin(input, 1, mask=mask)
+    >>> torch.masked._ops.softmin(input, 1, mask=mask)
     tensor([[0.8808, 0.0000, 0.1192],
             [   nan,    nan,    nan]])
 """
@@ -1037,7 +1037,7 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.std(input, 1, False, mask=mask)
+    >>> torch.masked._ops.std(input, 1, False, mask=mask)
     tensor([1., nan])
 """
 
@@ -1105,21 +1105,18 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.sum(input, 1, mask=mask)
+    >>> torch.masked._ops.sum(input, 1, mask=mask)
     tensor([-4,  0])
 """
 
 var_docstring = """var(input, dim, unbiased, *, keepdim=False, dtype=None, mask=None) -> Tensor
-
 Returns variance of all the elements in the :attr:`input`
 tensor along the given dimension(s) :attr:`dim` while the :attr:`input`
 elements are masked out according to the boolean tensor
 :attr:`mask`.
-
-The identity value of sample variance operation is undefined.  The
+The identity value of sample variance operation is undefined. The
 elements of output tensor with strided layout, that correspond to
 fully masked-out elements, have ``nan`` values.
-
 If :attr:`keepdim` is ``True``, the output tensor is of the same size
 as :attr:`input` except in the dimension(s) :attr:`dim` where it is of
 size 1. Otherwise, :attr:`dim` is squeezed (see
@@ -1166,7 +1163,6 @@
       containing the binary mask of validity of input tensor
       elements.
       Default: None that is equivalent to ``torch.ones(input.shape, dtype=torch.bool)``.
-
 Example::
 
     >>> input = tensor([[-3, -2, -1], [ 0, 1, 2]])
@@ -1177,6 +1173,6 @@
     >>> mask
     tensor([[ True, False,  True],
             [False, False, False]])
-    >>> torch._masked.var(input, 1, False, mask=mask)
+    >>> torch.masked._ops.var(input, 1, False, mask=mask)
     tensor([1., nan])
 """
diff --git a/torch/_masked/__init__.py b/torch/masked/_ops.py
similarity index 92%
rename from torch/_masked/__init__.py
rename to torch/masked/_ops.py
index ac49fc95baf7..a933f06024aa 100644
--- a/torch/_masked/__init__.py
+++ b/torch/masked/_ops.py
@@ -7,6 +7,7 @@
 
 import torch
 from torch import Tensor
+from torch.masked import as_masked_tensor, is_masked_tensor, MaskedTensor
 from . import _docs
 
 if TYPE_CHECKING:
@@ -49,8 +50,8 @@ def _apply_docstring_templates(func):
 
 
 def _generate_docstring(func):
-    """An utility function called from tools/update_masked_docs.py
-    script to update the module torch._masked._docs.py
+    """A utility function called from tools/update_masked_docs.py
+    script to update the module torch.masked._docs.py
     """
     docstring_templates = dict(
         reduction_signature="""\
@@ -278,7 +279,7 @@ def _generate_docstring(func):
         cumprod="cumulative_prod",
     )
 
-    operation_names = dict()
+    operation_names = {}
     operation_names.update(reduction_names)
     operation_names.update(normalization_names)
 
@@ -856,7 +857,7 @@ def _where(mask: Tensor, input: Tensor, fill_value: Tensor) -> Tensor:
         )
 
 
-def _input_mask(input: Tensor, *args, **kwargs) -> Tensor:
+def _input_mask(input: Union[Tensor, MaskedTensor], *args, **kwargs) -> Tensor:
     """Return canonical input mask.
 
     A canonical input mask is defined as a boolean mask tensor that
@@ -987,23 +988,51 @@ def _output_mask(op, input: Tensor, *args, **kwargs) -> Tensor:
         )
 
 
-def _combine_input_and_mask(op, input: Tensor, mask, *args) -> Tensor:
-    """Return input with masked-out elements eliminated for the given operations."""
-    if mask is None:
-        return input
-    canonical_mask = _input_mask(input, mask=mask)
-    if callable(op):
-        fill_value = _reduction_identity(op.__name__, input, *args)
-        return _where(canonical_mask, input, fill_value)
-    else:
-        raise ValueError(
-            f"_combine_input_and_mask expected masked operation (got {type(op).__name__} object)"
-        )
+def _combine_input_and_mask(
+    op, input: Union[MaskedTensor, Tensor], mask, *args
+) -> Tensor:
+    def helper(input, mask):
+        if mask is None:
+            return input
+        canonical_mask = _input_mask(input, mask=mask)
+        if callable(op):
+            fill_value = _reduction_identity(op.__name__, input, *args)
+            return _where(canonical_mask, input, fill_value)
+        else:
+            raise ValueError(
+                f"_combine_input_and_mask expected masked operation (got {type(op).__name__} object)"
+            )
+
+    class Combine(torch.autograd.Function):
+        @staticmethod
+        def forward(ctx, input, mask):
+            """Return input with masked-out elements eliminated for the given operations."""
+            ctx.save_for_backward(mask)
+
+            if mask is not None:
+                ctx.mark_non_differentiable(mask)
+
+            return helper(input, mask)
+
+        @staticmethod
+        def backward(ctx, grad_output):
+            (mask,) = ctx.saved_tensors
+            grad_data = (
+                grad_output.get_data() if is_masked_tensor(grad_output) else grad_output
+            )
+            result = as_masked_tensor(grad_data, mask)
+            return result, None
+
+    return (
+        Combine.apply(input.get_data(), input.get_mask())  # type: ignore[union-attr]
+        if is_masked_tensor(input)
+        else helper(input, mask)
+    )
 
 
 @_apply_docstring_templates
 def sum(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: DimOrDims = None,
     *,
     keepdim: Optional[bool] = False,
@@ -1038,25 +1067,25 @@ def sum(
                 dtype = torch.int64
     dim_ = _canonical_dim(dim, input.ndim)
     mask_input = _combine_input_and_mask(sum, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.sum(mask_input, dim_, bool(keepdim), dtype=dtype)
-    elif input.layout == torch.sparse_coo:
+    elif mask_input.layout == torch.sparse_coo:
         return _sparse_coo_scatter_reduction_helper(
             torch.sum, mask_input, dim_, bool(keepdim), dtype
         )
-    elif input.layout == torch.sparse_csr:
+    elif mask_input.layout == torch.sparse_csr:
         return torch._sparse_csr_sum(
             mask_input, dim=list(dim_), keepdim=bool(keepdim), dtype=dtype
         )
     else:
         raise ValueError(
-            f"masked sum expects strided, sparse_coo or sparse_csr tensor (got {input.layout} tensor)"
+            f"masked sum expects strided, sparse_coo or sparse_csr tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def prod(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: DimOrDims = None,
     *,
     keepdim: Optional[bool] = False,
@@ -1091,14 +1120,14 @@ def prod(
                 dtype = torch.int64
     dim_ = _canonical_dim(dim, input.ndim)
     mask_input = _combine_input_and_mask(prod, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         # Workaround https://github.com/pytorch/pytorch/issues/56586
         result = mask_input
         result = result.to(dtype=dtype)
         for d in reversed(dim_):
             result = result.prod(dim=d, keepdim=bool(keepdim))
         return result
-    elif input.layout == torch.sparse_coo:
+    elif mask_input.layout == torch.sparse_coo:
         if mask is None:
             # See comment in the sparse_csr branch, the same issue arises for sparse_coo tensors
             raise ValueError(
@@ -1107,7 +1136,7 @@ def prod(
         return _sparse_coo_scatter_reduction_helper(
             torch.prod, mask_input, dim_, bool(keepdim), dtype
         )
-    elif input.layout == torch.sparse_csr:
+    elif mask_input.layout == torch.sparse_csr:
         if mask is None:
             # mask is None corresponds to all-True mask. The
             # unspecified elements in the CSR tensor correspond to
@@ -1127,7 +1156,7 @@ def prod(
         )
     else:
         raise ValueError(
-            f"masked prod expects strided, sparse_coo or sparse_csr tensor (got {input.layout} tensor)"
+            f"masked prod expects strided, sparse_coo or sparse_csr tensor (got {mask_input.layout} tensor)"
         )
 
 
@@ -1143,11 +1172,11 @@ def cumsum(
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
     mask_input = _combine_input_and_mask(sum, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.cumsum(mask_input, dim_, dtype=dtype).to(dtype=dtype)
     else:
         raise ValueError(
-            f"masked cumsum expects strided tensor (got {input.layout} tensor)"
+            f"masked cumsum expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
@@ -1163,17 +1192,17 @@ def cumprod(
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
     mask_input = _combine_input_and_mask(prod, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.cumprod(mask_input, dim_, dtype=dtype).to(dtype=dtype)
     else:
         raise ValueError(
-            f"masked cumprod expects strided tensor (got {input.layout} tensor)"
+            f"masked cumprod expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def amax(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: DimOrDims = None,
     *,
     keepdim: Optional[bool] = False,
@@ -1195,9 +1224,9 @@ def amax(
 
     mask_input = _combine_input_and_mask(amax, input, mask)
     dim_ = _canonical_dim(dim, mask_input.ndim)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.amax(mask_input, dim_, bool(keepdim)).to(dtype=dtype)
-    elif input.layout == torch.sparse_coo:
+    elif mask_input.layout == torch.sparse_coo:
         if mask is None:
             # See comment in the sparse_csr branch of prod, a similar issue arises here
             # where unspecified elements along a dimension may need to be reduced with the result
@@ -1207,7 +1236,7 @@ def amax(
         return _sparse_coo_scatter_reduction_helper(
             torch.amax, mask_input, dim_, bool(keepdim), dtype
         )
-    elif input.layout == torch.sparse_csr:
+    elif mask_input.layout == torch.sparse_csr:
         if mask is None:
             raise ValueError(
                 "masked amax expects explicit mask for sparse_csr tensor input"
@@ -1217,13 +1246,13 @@ def amax(
         )
     else:
         raise ValueError(
-            f"masked amax expects strided, sparse_coo or sparse_csr tensor (got {input.layout} tensor)"
+            f"masked amax expects strided, sparse_coo or sparse_csr tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def amin(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: DimOrDims = None,
     *,
     keepdim: Optional[bool] = False,
@@ -1245,9 +1274,9 @@ def amin(
 
     mask_input = _combine_input_and_mask(amin, input, mask)
     dim_ = _canonical_dim(dim, mask_input.ndim)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.amin(mask_input, dim_, bool(keepdim)).to(dtype=dtype)
-    elif input.layout == torch.sparse_coo:
+    elif mask_input.layout == torch.sparse_coo:
         if mask is None:
             # See comment in the sparse_csr branch of prod, a similar issue arises here
             # where unspecified elements along a dimension may need to be reduced with the result
@@ -1257,7 +1286,7 @@ def amin(
         return _sparse_coo_scatter_reduction_helper(
             torch.amin, mask_input, dim_, bool(keepdim), dtype
         )
-    elif input.layout == torch.sparse_csr:
+    elif mask_input.layout == torch.sparse_csr:
         if mask is None:
             raise ValueError(
                 "masked amin expects explicit mask for sparse_csr tensor input"
@@ -1267,13 +1296,13 @@ def amin(
         )
     else:
         raise ValueError(
-            f"masked amin expects strided, sparse_coo or sparse_csr tensor (got {input.layout} tensor)"
+            f"masked amin expects strided, sparse_coo or sparse_csr tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def argmax(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: int = None,
     *,
     keepdim: Optional[bool] = False,
@@ -1289,17 +1318,17 @@ def argmax(
     if dtype is None:
         dtype = input.dtype
     mask_input = _combine_input_and_mask(argmax, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.argmax(mask_input, dim, bool(keepdim)).to(dtype=dtype)
     else:
         raise ValueError(
-            f"masked argmax expects strided tensor (got {input.layout} tensor)"
+            f"masked argmax expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def argmin(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: int = None,
     *,
     keepdim: Optional[bool] = False,
@@ -1315,17 +1344,17 @@ def argmin(
     if dtype is None:
         dtype = input.dtype
     mask_input = _combine_input_and_mask(argmin, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.argmin(mask_input, dim, bool(keepdim)).to(dtype=dtype)
     else:
         raise ValueError(
-            f"masked argmin expects strided tensor (got {input.layout} tensor)"
+            f"masked argmin expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def mean(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: DimOrDims = None,
     *,
     keepdim: Optional[bool] = False,
@@ -1386,7 +1415,7 @@ def mean(
 
 @_apply_docstring_templates
 def median(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: int = -1,
     *,
     keepdim: bool = False,
@@ -1412,7 +1441,7 @@ def median(
     if not is_float:
         input = input.to(dtype=torch.float)
     mask_input = _combine_input_and_mask(median, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         output = torch.nanmedian(mask_input, dim_, keepdim).values
         if is_float:
             return output
@@ -1424,7 +1453,7 @@ def median(
             )
     else:
         raise ValueError(
-            f"masked median expects strided tensor (got {input.layout} tensor)"
+            f"masked median expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
@@ -1441,19 +1470,19 @@ def logsumexp(
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)
     mask_input = _combine_input_and_mask(logsumexp, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.logsumexp(mask_input, dim_, keepdim=keepdim).to(dtype=dtype)
     else:
         raise ValueError(
-            f"masked logsumexp expects strided tensor (got {input.layout} tensor)"
+            f"masked logsumexp expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
 # TODO: Add docstring; currently they're only set up for reductions and normalizations
 # @_apply_docstring_templates
 def logaddexp(
-    input: Tensor,
-    other: Tensor,
+    input: Union[Tensor, MaskedTensor],
+    other: Union[Tensor, MaskedTensor],
     *,
     dtype: Optional[DType] = None,
     input_mask: Optional[Tensor] = None,
@@ -1473,7 +1502,7 @@ def logaddexp(
 
 @_apply_docstring_templates
 def norm(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     ord: Optional[float] = 2.0,
     dim: DimOrDims = None,
     *,
@@ -1496,19 +1525,19 @@ def norm(
     if dtype is None:
         dtype = input.dtype
     mask_input = _combine_input_and_mask(norm, input, mask, ord)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         dim_ = _canonical_dim(dim, input.ndim)
         return torch.linalg.vector_norm(
             mask_input, ord, dim_, bool(keepdim), dtype=dtype
         )
     else:
         raise ValueError(
-            f"masked norm expects strided tensor (got {input.layout} tensor)"
+            f"masked norm expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
-def std_var(
-    input: Tensor,
+def _std_var(
+    input: Union[Tensor, MaskedTensor],
     dim: DimOrDims = None,
     unbiased: Optional[bool] = False,
     *,
@@ -1570,7 +1599,7 @@ def std_var(
 
 @_apply_docstring_templates
 def var(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: DimOrDims = None,
     unbiased: Optional[bool] = False,
     *,
@@ -1586,7 +1615,7 @@ def var(
 fully masked-out elements, have ``nan`` values.
 {reduction_args}
 {reduction_example}"""
-    return std_var(
+    return _std_var(
         input=input,
         dim=dim,
         unbiased=unbiased,
@@ -1599,7 +1628,7 @@ def var(
 
 @_apply_docstring_templates
 def std(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: DimOrDims = None,
     unbiased: Optional[bool] = False,
     *,
@@ -1615,7 +1644,7 @@ def std(
 fully masked-out elements, have ``nan`` values.
 {reduction_args}
 {reduction_example}"""
-    return std_var(
+    return _std_var(
         input=input,
         dim=dim,
         unbiased=unbiased,
@@ -1628,7 +1657,7 @@ def std(
 
 @_apply_docstring_templates
 def softmax(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: int,
     *,
     dtype: Optional[DType] = None,
@@ -1638,17 +1667,17 @@ def softmax(
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
     mask_input = _combine_input_and_mask(amax, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.nn.functional.softmax(mask_input, dim_, dtype=dtype)
     else:
         raise ValueError(
-            f"masked softmax expects strided tensor (got {input.layout} tensor)"
+            f"masked softmax expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def log_softmax(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: int,
     *,
     dtype: Optional[DType] = None,
@@ -1658,17 +1687,17 @@ def log_softmax(
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
     mask_input = _combine_input_and_mask(amax, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.nn.functional.log_softmax(mask_input, dim_, dtype=dtype)
     else:
         raise ValueError(
-            f"masked log_softmax expects strided tensor (got {input.layout} tensor)"
+            f"masked log_softmax expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def softmin(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     dim: int,
     *,
     dtype: Optional[DType] = None,
@@ -1678,17 +1707,17 @@ def softmin(
         dtype = input.dtype
     dim_ = _canonical_dim(dim, input.ndim)[0]
     mask_input = _combine_input_and_mask(amin, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         return torch.nn.functional.softmin(mask_input, dim_, dtype=dtype)
     else:
         raise ValueError(
-            f"masked softmin expects strided tensor (got {input.layout} tensor)"
+            f"masked softmin expects strided tensor (got {mask_input.layout} tensor)"
         )
 
 
 @_apply_docstring_templates
 def normalize(
-    input: Tensor,
+    input: Union[Tensor, MaskedTensor],
     ord: float,
     dim: int,
     *,
@@ -1701,7 +1730,7 @@ def normalize(
     dim_ = _canonical_dim(dim, input.ndim)[0]
     # TODO: eliminate mask_input as unnecessary when using masked divide.
     mask_input = _combine_input_and_mask(sum, input, mask)
-    if input.layout == torch.strided:
+    if mask_input.layout == torch.strided:
         nrm_ = norm(input, ord, dim, keepdim=True, dtype=dtype, mask=mask)
         # TODO: replace torch.maximum with masked maximum when available.
         denom = torch.maximum(nrm_, nrm_.new_full([], eps))
@@ -1709,5 +1738,5 @@ def normalize(
         return torch.divide(mask_input, denom)
     else:
         raise ValueError(
-            f"masked normalize expects strided tensor (got {input.layout} tensor)"
+            f"masked normalize expects strided tensor (got {mask_input.layout} tensor)"
         )
diff --git a/torch/masked/maskedtensor/__init__.py b/torch/masked/maskedtensor/__init__.py
new file mode 100644
index 000000000000..e38e03c87086
--- /dev/null
+++ b/torch/masked/maskedtensor/__init__.py
@@ -0,0 +1,8 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# flake8: noqa
+
+from .binary import _apply_native_binary, _is_native_binary
+from .core import is_masked_tensor, MaskedTensor
+from .passthrough import _apply_pass_through_fn, _is_pass_through_fn
+from .reductions import _apply_reduction, _is_reduction
+from .unary import _apply_native_unary, _is_native_unary
diff --git a/torch/masked/maskedtensor/_ops_refs.py b/torch/masked/maskedtensor/_ops_refs.py
new file mode 100644
index 000000000000..9ad398b52d41
--- /dev/null
+++ b/torch/masked/maskedtensor/_ops_refs.py
@@ -0,0 +1,473 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+from functools import partial
+import torch
+
+from .binary import (
+    _apply_native_binary,
+    NATIVE_BINARY_FNS,
+    NATIVE_INPLACE_BINARY_FNS,
+)
+from .core import is_masked_tensor, MaskedTensor, _get_data, _masks_match, _maybe_get_mask
+from .passthrough import (
+    _apply_pass_through_fn,
+    PASSTHROUGH_FNS
+)
+from .reductions import (
+    _apply_reduction,
+    NATIVE_REDUCE_FNS,
+    TORCH_REDUCE_FNS,
+    TENSOR_REDUCE_FNS,
+)
+from .unary import (
+    _apply_native_unary,
+    NATIVE_UNARY_FNS,
+    NATIVE_INPLACE_UNARY_FNS,
+)
+
+
+__all__ = []  # type: ignore[var-annotated]
+
+
+def _check_args_kwargs_length(args, kwargs, error_prefix, len_args=None, len_kwargs=None):
+    if len_args is not None and len_args != len(args):
+        raise ValueError(f"{error_prefix}: len(args) must be {len_args} but got {len(args)}")
+    if len_kwargs is not None and len_kwargs != len(kwargs):
+        raise ValueError(f"{error_prefix}: len(kwargs) must be {len_kwargs} but got {len(kwargs)}")
+
+
+class _MaskedContiguous(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        if not is_masked_tensor(input):
+            raise ValueError("MaskedContiguous forward: input must be a MaskedTensor.")
+
+        if input.is_contiguous():
+            return input
+
+        data = input.get_data()
+        mask = input.get_mask()
+
+        return MaskedTensor(data.contiguous(), mask.contiguous())
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output
+
+
+class _MaskedToDense(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        if not is_masked_tensor(input):
+            raise ValueError("MaskedToDense forward: input must be a MaskedTensor.")
+
+        if input.layout == torch.strided:
+            return input
+
+        ctx.layout = input.layout
+        data = input.get_data()
+        mask = input.get_mask()
+
+        return MaskedTensor(data.to_dense(), mask.to_dense())
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        layout = ctx.layout
+
+        if layout == torch.sparse_coo:
+            return grad_output.to_sparse_coo()
+        elif layout == torch.sparse_csr:
+            return grad_output.to_sparse_csr()
+        elif layout == torch.strided:
+            return grad_output.to_dense()
+        raise ValueError("to_dense: Unsupported input layout: ", layout)
+
+
+class _MaskedToSparse(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        if not is_masked_tensor(input):
+            raise ValueError("MaskedToSparse forward: input must be a MaskedTensor.")
+
+        # Following the convention from sparse tensors that to_sparse always means that we convert to sparse_coo
+        if input.layout == torch.sparse_coo:
+            return input
+
+        data = input.get_data()
+        mask = input.get_mask()
+        sparse_mask = mask.to_sparse_coo().coalesce()
+        sparse_data = data.sparse_mask(sparse_mask)
+
+        return MaskedTensor(sparse_data, sparse_mask)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output.to_dense()
+
+
+class _MaskedToSparseCsr(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        if not is_masked_tensor(input):
+            raise ValueError("MaskedToSparseCsr forward: input must be a MaskedTensor.")
+
+        if input._masked_data.ndim != 2:
+            raise ValueError(f"Only 2D tensors can be converted to the SparseCsr layout but got shape: {input._masked_data.size()}")
+
+        if input.layout == torch.sparse_csr:
+            return input
+
+        data = input.get_data()
+        mask = input.get_mask()
+        sparse_mask = mask.to_sparse_csr()
+        sparse_data = data.sparse_mask(sparse_mask)
+
+        return MaskedTensor(sparse_data, sparse_mask)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output.to_dense()
+
+
+class _MaskedWhere(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, cond, self, other):
+        ctx.mark_non_differentiable(cond)
+        ctx.save_for_backward(cond)
+        return torch.ops.aten.where(cond, self, other)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        (cond,) = ctx.saved_tensors
+
+        def masked_out_like(mt):
+            return MaskedTensor(mt.get_data(), torch.zeros_like(mt.get_mask()).bool())
+
+        return (
+            None,
+            torch.ops.aten.where(cond, grad_output, masked_out_like(grad_output)),
+            torch.ops.aten.where(cond, masked_out_like(grad_output), grad_output),
+        )
+
+
+_MASKEDTENSOR_FUNCTION_TABLE = {}
+
+_function_fn_apply_map = {
+    (tuple(NATIVE_REDUCE_FNS), tuple(TORCH_REDUCE_FNS), tuple(TENSOR_REDUCE_FNS)): _apply_reduction,
+}
+
+for fn_map_list, apply_fn in _function_fn_apply_map.items():
+    for fn_map in fn_map_list:
+        for fn in fn_map:
+            _MASKEDTENSOR_FUNCTION_TABLE[fn] = partial(apply_fn, fn)
+
+
+def register_function_func(ops):
+    """
+    Used for registering a new __torch_function__ function to MaskedTensor
+    Called via _MASKEDTENSOR_FUNCTION_TABLE[func](*args, **kwargs)
+
+    The code to register a new function looks like:
+
+    @register_function_func(list_of_ops)
+    def foo(func, *args, **kwargs):
+        <implementation>
+    """
+    def wrapper(func):
+        for op in ops:
+            _MASKEDTENSOR_FUNCTION_TABLE[op] = partial(func, op)
+    return wrapper
+
+
+@register_function_func(NATIVE_REDUCE_FNS + TORCH_REDUCE_FNS + TENSOR_REDUCE_FNS)
+def _general_function_reductions(func, *args, **kwargs):
+    return _apply_reduction(func, *args, **kwargs)
+
+
+@register_function_func([torch.Tensor.where, torch.where])
+def _function_where(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, "__torch_function__, torch.where", len_args=3, len_kwargs=0)
+    return _MaskedWhere.apply(*args)
+
+
+@register_function_func([torch.Tensor.contiguous])
+def _function_contiguous(func, *args, **kwargs):
+    return _MaskedContiguous.apply(args[0])
+
+
+@register_function_func([torch.Tensor.to_dense])
+def _function_to_dense(func, *args, **kwargs):
+    return _MaskedToDense.apply(args[0])
+
+
+@register_function_func([torch.Tensor.to_sparse])
+def _function_to_sparse(func, *args, **kwargs):
+    return _MaskedToSparse.apply(args[0])
+
+
+@register_function_func([torch.Tensor.to_sparse_csr])
+def _function_to_sparse_csr(func, *args, **kwargs):
+    return _MaskedToSparseCsr.apply(args[0])
+
+
+_MASKEDTENSOR_DISPATCH_TABLE = {}
+
+def register_dispatch_func(aten_ops):
+    """
+    Used for registering a new __torch_dispatch__ function to MaskedTensor
+    Called via _MASKEDTENSOR_DISPATCH_TABLE[func](*args, **kwargs)
+
+    The code to register a new function looks like:
+
+    @register_dispatch_func(list_of_ops)
+    def foo(func, *args, **kwargs):
+        <implementation>
+    """
+    def wrapper(func):
+        for aten_op in aten_ops:
+            _MASKEDTENSOR_DISPATCH_TABLE[aten_op] = partial(func, aten_op)
+    return wrapper
+
+
+@register_dispatch_func(NATIVE_REDUCE_FNS + TORCH_REDUCE_FNS + TENSOR_REDUCE_FNS)
+def _general_reduction(func, *args, **kwargs):
+    return _apply_reduction(func, *args, **kwargs)
+
+
+@register_dispatch_func(PASSTHROUGH_FNS)
+def _general_passthrough(func, *args, **kwargs):
+    return _apply_pass_through_fn(func, *args, **kwargs)
+
+
+@register_dispatch_func(NATIVE_UNARY_FNS + NATIVE_INPLACE_UNARY_FNS)
+def _general_unary(func, *args, **kwargs):
+    return _apply_native_unary(func, *args, **kwargs)
+
+
+@register_dispatch_func(NATIVE_BINARY_FNS + NATIVE_INPLACE_BINARY_FNS)
+def _general_binary(func, *args, **kwargs):
+    return _apply_native_binary(func, *args, **kwargs)
+
+
+@register_dispatch_func([torch.ops.aten.stride])
+def stride(func, *args, **kwargs):
+    return None
+
+
+@register_dispatch_func([torch.ops.aten.sym_stride])
+def sym_stride(func, *args, **kwargs):
+    return None
+
+
+@register_dispatch_func([torch.ops.prim.layout])
+def layout(func, *args, **kwargs):
+    return _get_data(args[0]).layout
+
+
+@register_dispatch_func([torch.ops.aten.is_contiguous])
+def is_contiguous(func, *args, **kwargs):
+    data = _get_data(args[0])
+    if data.is_sparse:
+        raise ValueError(
+            "MaskedTensors with sparse data do not have is_contiguous"
+        )
+    return func(data, *args[1:], **kwargs)
+
+
+@register_dispatch_func([torch.ops.aten.is_strides_like_format])
+def is_strides_like_format(func, *args, **kwargs):
+    data = _get_data(args[0])
+    if data.is_sparse:
+        raise ValueError(
+            "MaskedTensors with sparse data do not have is_strides_like_format"
+        )
+    return func(data, *args[1:], **kwargs)
+
+
+@register_dispatch_func([torch.ops.aten.is_non_overlapping_and_dense])
+def is_non_overlapping_and_dense(func, *args, **kwargs):
+    data = _get_data(args[0])
+    if data.is_sparse:
+        raise ValueError(
+            "MaskedTensors with sparse data do not have is_non_overlapping_and_dense"
+        )
+    return func(data, *args[1:], **kwargs)
+
+
+@register_dispatch_func([torch.ops.aten.contiguous])
+def contiguous(func, *args, **kwargs):
+    if _get_data(args[0]).is_sparse:
+        raise ValueError(
+            "MaskedTensors with sparse data do not have contiguous"
+        )
+    return _MaskedContiguous.apply(args[0])
+
+
+@register_dispatch_func([torch.ops.aten.new_empty_strided])
+def new_empty_strided(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=3)
+    data = _get_data(args[0])
+    mask = _maybe_get_mask(args[0])
+    if tuple(args[1]) != tuple(data.size()):
+        raise ValueError(f"__torch_dispatch__, {func}: args[1] expected to be the same as data.size()")
+    if tuple(args[2]) != tuple(data.stride()):
+        raise ValueError(f"__torch_dispatch__, {func}: args[2] expected to be the same as data.stride()")
+    return MaskedTensor(func(data, args[1], args[2], **kwargs), mask)
+
+
+@register_dispatch_func([torch.ops.aten._local_scalar_dense])
+def _local_scalar_dense(func, *args, **kwargs):
+    if not _maybe_get_mask(args[0]):
+        raise ValueError(f"__torch_dispatch__, {func}: expected a mask tensor")
+    return torch.ops.aten._local_scalar_dense(_get_data(args[0]))
+
+
+@register_dispatch_func([torch.ops.aten.detach, torch.ops.aten.clone])
+def _apply_fn_on_data(func, *args, **kwargs):
+    return MaskedTensor(func(_get_data(args[0])), _maybe_get_mask(args[0]))
+
+
+@register_dispatch_func([torch.ops.aten._to_copy])
+def _to_copy(func, *args, **kwargs):
+    new_data = func(_get_data(args[0]), *args[1:], **kwargs)
+    return MaskedTensor(new_data, _maybe_get_mask(args[0]))
+
+
+@register_dispatch_func([torch.ops.aten._softmax])
+def _softmax(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=3, len_kwargs=0)
+    data = _get_data(args[0])
+    mask = _maybe_get_mask(args[0])
+    result_data = torch.ops.aten._masked_softmax(data, ~mask, args[1], 2)
+    return MaskedTensor(result_data, mask)
+
+
+@register_dispatch_func([torch.ops.aten.ones_like])
+def ones_like(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1)
+    result_data = func(_get_data(args[0]), **kwargs)
+    return MaskedTensor(result_data, _maybe_get_mask(args[0]))
+
+
+@register_dispatch_func([torch.ops.aten._softmax_backward_data])
+def _softmax_backward_data(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=4)
+    grad, output, dim, input_dtype = args
+    if is_masked_tensor(grad) and is_masked_tensor(output):
+        if not _masks_match(grad, output):
+            raise ValueError("__torch_dispatch__, {func}: expected the masks of grad and output to match")
+        grad_data = _get_data(grad)
+        new_grad_data = torch.ops.aten._masked_softmax_backward(
+            grad_data,
+            _get_data(output),
+            ~_maybe_get_mask(grad),
+            dim % grad_data.ndim,
+        )
+        res = MaskedTensor(new_grad_data, _maybe_get_mask(grad))
+        return res
+    else:
+        raise ValueError(f"__torch_dispatch__, {func}: grad and output must both be MaskedTensors")
+
+
+@register_dispatch_func([torch.ops.aten.copy_])
+def copy_(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=2)
+    if not _masks_match(_maybe_get_mask(args[0]), _maybe_get_mask(args[1])):
+        raise ValueError("args[0] mask and args[1] mask must match but do not")
+    func(_get_data(args[0]), _get_data(args[1]))
+    return args[0]
+
+
+@register_dispatch_func([torch.ops.aten.where])
+def where(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=3, len_kwargs=0)
+    if not torch.is_tensor(args[0]):
+        raise ValueError("__torch_dispatch__, {func}: expected args[0] to be a tensor")
+    mx = args[1]
+    my = args[2]
+    if not is_masked_tensor(mx):
+        mx = MaskedTensor(mx, torch.ones_like(mx, dtype=torch.bool))
+    if not is_masked_tensor(my):
+        my = MaskedTensor(my, torch.ones_like(my, dtype=torch.bool))
+    new_data = func(args[0], mx.get_data(), my.get_data())
+    new_mask = func(args[0], mx.get_mask(), my.get_mask())
+    return MaskedTensor(new_data, new_mask)
+
+
+@register_dispatch_func([torch.ops.aten.to_sparse])
+def to_sparse(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+    if not torch.is_tensor(args[0]):
+        raise TypeError("__torch_dispatch__, {func}: expected args[0] to be a tensor")
+    mt = args[0]
+    if not is_masked_tensor(mt):
+        mt = MaskedTensor(mt, torch.ones_like(mt, dtype=torch.bool))
+    if mt.is_sparse_coo():
+        return mt
+    new_mask = func(_maybe_get_mask(args[0])).coalesce()
+    new_data = _get_data(args[0]).sparse_mask(new_mask)
+    return MaskedTensor(new_data, new_mask)
+
+
+@register_dispatch_func([torch.ops.aten.to_sparse_csr])
+def to_sparse_csr(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+    if not torch.is_tensor(args[0]):
+        raise ValueError("__torch_dispatch__, {func}: expected args[0] to be a tensor")
+    mt = args[0]
+    if not is_masked_tensor(mt):
+        mt = MaskedTensor(mt, torch.ones_like(mt).bool())
+    if mt.is_sparse_csr():
+        return mt
+    new_mask = func(_maybe_get_mask(args[0]))
+    new_data = _get_data(args[0]).sparse_mask(new_mask)
+    return MaskedTensor(new_data, new_mask)
+
+
+@register_dispatch_func([torch.ops.aten._to_dense])
+def _to_dense(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+    if not torch.is_tensor(args[0]):
+        raise ValueError("__torch_dispatch__, {func}: expected args[0] to be a tensor")
+    mt = args[0]
+    if not is_masked_tensor(mt):
+        mt = MaskedTensor(mt, torch.ones_like(mt).bool())
+    new_data = func(_get_data(args[0]))
+    new_mask = func(_maybe_get_mask(args[0]))
+    return MaskedTensor(new_data, new_mask)
+
+
+@register_dispatch_func([torch.ops.aten._indices])
+def _indices(func, *args, **kwargs):
+    # Assumes data is sparse
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+    data = _get_data(args[0]).indices()
+    return MaskedTensor(data, torch.ones_like(data).bool())
+
+
+@register_dispatch_func([torch.ops.aten._values])
+def _values(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=1, len_kwargs=0)
+    data = _get_data(args[0]).values()
+    return MaskedTensor(data, torch.ones_like(data).bool())
+
+
+@register_dispatch_func([torch.ops.aten._sparse_coo_tensor_with_dims_and_tensors])
+def _sparse_coo_tensor_with_dims_and_tensors(func, *args, **kwargs):
+    new_args = list(args)
+    if is_masked_tensor(args[-1]):
+        new_args[-1] = args[-1].get_data()
+    if is_masked_tensor(args[-2]):
+        new_args[-2] = args[-2].get_data()
+
+    new_data = func(*new_args, **kwargs)
+    new_args[-1] = torch.ones_like(new_args[-1])
+    new_mask = func(*new_args, **kwargs).bool()
+
+    return MaskedTensor(new_data, new_mask)
+
+
+@register_dispatch_func([torch.ops.aten.is_same_size])
+def is_same_size(func, *args, **kwargs):
+    _check_args_kwargs_length(args, kwargs, f"__torch_dispatch__, {func}", len_args=2)
+    return _get_data(args[0]).is_same_size(_get_data(args[1]))
diff --git a/torch/masked/maskedtensor/binary.py b/torch/masked/maskedtensor/binary.py
new file mode 100644
index 000000000000..087ea95916e5
--- /dev/null
+++ b/torch/masked/maskedtensor/binary.py
@@ -0,0 +1,192 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import torch
+
+from .core import _map_mt_args_kwargs, _masks_match, _tensors_match, _wrap_result, is_masked_tensor
+
+__all__ = []  # type: ignore[var-annotated]
+
+BINARY_NAMES = [
+    "add",
+    "atan2",
+    "arctan2",
+    "bitwise_and",
+    "bitwise_or",
+    "bitwise_xor",
+    "bitwise_left_shift",
+    "bitwise_right_shift",
+    "div",
+    "divide",
+    "floor_divide",
+    "fmod",
+    "logaddexp",
+    "logaddexp2",
+    "mul",
+    "multiply",
+    "nextafter",
+    "remainder",
+    "sub",
+    "subtract",
+    "true_divide",
+    "eq",
+    "ne",
+    "le",
+    "ge",
+    "greater",
+    "greater_equal",
+    "gt",
+    "less_equal",
+    "lt",
+    "less",
+    "maximum",
+    "minimum",
+    "fmax",
+    "fmin",
+    "not_equal",
+]
+
+INPLACE_BINARY_NAMES = [
+    n + "_"
+    for n in (
+        list(
+            set(BINARY_NAMES)
+            - {
+                "logaddexp",
+                "logaddexp2",
+                "equal",
+                "fmin",
+                "minimum",
+                "maximum",
+                "fmax",
+            }
+        )
+    )
+]
+
+
+def _get_at_least_one_mask(a, b):
+    if not is_masked_tensor(a) and not is_masked_tensor(b):
+        raise TypeError("At least one of `a` and `b` must be a MaskedTensor")
+    if not _masks_match(a, b):
+        raise ValueError("a and b must have matching masks")
+    if is_masked_tensor(a):
+        return a.get_mask()
+    return b.get_mask()
+
+
+def _binary_helper(fn, args, kwargs, inplace):
+    if len(kwargs) != 0:
+        raise ValueError("len(kwargs) must equal 0")
+    for a in args[2:]:
+        if torch.is_tensor(a):
+            raise TypeError("MaskedTensor binary ops do not support Tensor arguments aside from the lhs and rhs")
+
+    if not _masks_match(*args[:2]):
+        raise ValueError(
+            "Input masks must match. If you need support for this, please open an issue on Github."
+        )
+
+    data_args, data_kwargs = _map_mt_args_kwargs(
+        args, kwargs, lambda x: x.get_data()
+    )
+    mask_args, mask_kwargs = _map_mt_args_kwargs(
+        args, kwargs, lambda x: x.get_mask()
+    )
+
+    args0_layout = data_args[0].layout
+    same_layout = (
+        (torch.is_tensor(data_args[1]) or is_masked_tensor(data_args[1])) and
+        (args0_layout == data_args[1].layout)
+    )
+
+    if args0_layout == torch.sparse_coo:
+        if same_layout:
+            if not _tensors_match(data_args[0].indices(), data_args[1].indices()):
+                raise ValueError(
+                    "sparse_coo indices must match. If you need support for this, please open an issue on Github."
+                )
+            if data_args[0].size() != data_args[1].size():
+                raise ValueError("input1 and input2 must have the same size for binary functions.")
+
+            data_args[1] = data_args[1].values()
+
+        i = data_args[0].indices()
+        size = data_args[0].size()
+        data_args[0] = data_args[0].values()
+        v = fn(*data_args)
+        result_data = torch.sparse_coo_tensor(i, v, size)
+
+    elif args0_layout == torch.sparse_csr:
+        if same_layout:
+            if not (
+                _tensors_match(data_args[0].crow_indices(), data_args[1].crow_indices())
+                and _tensors_match(
+                    data_args[0].col_indices(), data_args[1].col_indices()
+                )
+            ):
+                raise ValueError(
+                    "sparse_csr indices must match. If you need support for this, please open an issue on Github."
+                )
+
+            data_args[1] = data_args[1].values()
+
+        crow = data_args[0].crow_indices()
+        col = data_args[0].col_indices()
+        data_args[0] = data_args[0].values()
+        v = fn(*data_args)
+        result_data = torch.sparse_csr_tensor(crow, col, v)
+
+    else:
+        result_data = fn(*data_args)
+
+    if inplace:
+        args[0]._set_data_mask(result_data, mask_args[0])
+        return args[0]
+    else:
+        result_mask = _get_at_least_one_mask(*args[:2])
+        # sparse tensors don't have strides so we can only expand if the layout is strided
+        if args0_layout == torch.strided:
+            result_mask = result_mask.expand_as(result_data)
+        return _wrap_result(result_data, result_mask)
+
+
+def _torch_binary(fn_name):
+    fn = getattr(torch.ops.aten, fn_name)
+
+    def binary_fn(*args, **kwargs):
+        return _binary_helper(fn, args, kwargs, inplace=False)
+
+    return binary_fn
+
+
+def _torch_inplace_binary(fn_name):
+    fn = getattr(torch.ops.aten, fn_name)
+
+    def binary_fn(*args, **kwargs):
+        return _binary_helper(fn, args, kwargs, inplace=True)
+
+    return binary_fn
+
+
+NATIVE_BINARY_MAP = {
+    getattr(torch.ops.aten, name): _torch_binary(name) for name in BINARY_NAMES
+}
+NATIVE_INPLACE_BINARY_MAP = {
+    getattr(torch.ops.aten, name): _torch_inplace_binary(name)
+    for name in INPLACE_BINARY_NAMES
+}
+
+NATIVE_BINARY_FNS = list(NATIVE_BINARY_MAP.keys())
+NATIVE_INPLACE_BINARY_FNS = list(NATIVE_INPLACE_BINARY_MAP.keys())
+
+
+def _is_native_binary(fn):
+    return fn in NATIVE_BINARY_FNS or fn in NATIVE_INPLACE_BINARY_FNS
+
+
+def _apply_native_binary(fn, *args, **kwargs):
+    if fn in NATIVE_BINARY_FNS:
+        return NATIVE_BINARY_MAP[fn](*args, **kwargs)
+    if fn in NATIVE_INPLACE_BINARY_FNS:
+        return NATIVE_INPLACE_BINARY_MAP[fn](*args, **kwargs)
+    return NotImplemented
diff --git a/torch/masked/maskedtensor/core.py b/torch/masked/maskedtensor/core.py
new file mode 100644
index 000000000000..3274ef2ef956
--- /dev/null
+++ b/torch/masked/maskedtensor/core.py
@@ -0,0 +1,335 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import warnings
+
+import torch
+from torch.overrides import get_default_nowrap_functions
+
+
+__all__ = [
+    "MaskedTensor",
+    "is_masked_tensor",
+]
+
+
+def is_masked_tensor(a):
+    r""" Returns True if the input is a MaskedTensor, else False
+
+    Args:
+        a: any input
+
+    Examples:
+
+        >>> # xdoctest: +SKIP
+        >>> from torch.masked import MaskedTensor
+        >>> data = torch.arange(6).reshape(2,3)
+        >>> mask = torch.tensor([[True, False, False], [True, True, False]])
+        >>> mt = MaskedTensor(data, mask)
+        >>> is_masked_tensor(mt)
+        True
+    """
+    return isinstance(a, MaskedTensor)
+
+
+def _tensors_match(a, b, exact=True, rtol=1e-05, atol=1e-08):
+    if is_masked_tensor(a) or is_masked_tensor(b):
+        raise ValueError("Neither `a` nor `b` can be a MaskedTensor.")
+    if a.layout != b.layout:
+        raise ValueError(f"`a` and `b` must have the same layout. Got {a.layout} and {b.layout}")
+
+    if a.dtype != b.dtype:
+        b = b.type(a.dtype)
+    if a.layout == b.layout == torch.sparse_coo:
+        return _tensors_match(a.values(), b.values(), exact) and _tensors_match(
+            a.indices(), b.indices(), exact
+        )
+    elif a.layout == b.layout == torch.sparse_csr:
+        return (
+            _tensors_match(a.crow_indices(), b.crow_indices(), exact)
+            and _tensors_match(a.col_indices(), b.col_indices(), exact)
+            and _tensors_match(a.values(), b.values(), exact)
+        )
+    if exact:
+        return (a.dim() == b.dim()) and torch.eq(a, b).all().item()
+    return (a.dim() == b.dim()) and torch.allclose(a, b, rtol=rtol, atol=atol)
+
+
+def _masks_match(a, b):
+    if is_masked_tensor(a) and is_masked_tensor(b):
+        mask_a = a.get_mask()
+        mask_b = b.get_mask()
+        return _tensors_match(mask_a, mask_b, exact=True)
+    return True
+
+
+def _map_mt_args_kwargs(args, kwargs, map_fn):
+    def _helper(a, map_fn):
+        if is_masked_tensor(a):
+            return map_fn(a)
+        elif torch.is_tensor(a):
+            return a
+        elif isinstance(a, list):
+            a_impl, _ = _map_mt_args_kwargs(a, {}, map_fn)
+            return a_impl
+        elif isinstance(a, tuple):
+            a_impl, _ = _map_mt_args_kwargs(a, {}, map_fn)
+            return tuple(a_impl)
+        else:
+            return a
+
+    if kwargs is None:
+        kwargs = {}
+    impl_args = []
+    for a in args:
+        impl_args.append(_helper(a, map_fn))
+    impl_kwargs = {}
+    for k, v in kwargs.items():
+        impl_kwargs[k] = _helper(a, map_fn)
+    return impl_args, impl_kwargs
+
+
+def _wrap_result(result_data, result_mask):
+    if isinstance(result_data, list):
+        return list(_wrap_result(r, m) for (r, m) in zip(result_data, result_mask))
+    if isinstance(result_data, tuple):
+        return tuple(_wrap_result(r, m) for (r, m) in zip(result_data, result_mask))
+    if torch.is_tensor(result_data):
+        return MaskedTensor(result_data, result_mask)
+    # Expect result_data and result_mask to be Tensors only
+    return NotImplemented
+
+
+def _masked_tensor_str(data, mask, formatter):
+    if data.layout in {torch.sparse_coo, torch.sparse_csr}:
+        data = data.to_dense()
+        mask = mask.to_dense()
+    if data.dim() == 1:
+        formatted_elements = [
+            formatter.format(d.item()) if isinstance(d.item(), float) else str(d.item())
+            for d in data
+        ]
+        max_len = max(
+            map(lambda x: 8 if x[1] else len(x[0]), zip(formatted_elements, ~mask))
+        )
+        return (
+            "["
+            + ", ".join(
+                [
+                    "--".rjust(max_len) if m else e
+                    for (e, m) in zip(formatted_elements, ~mask)
+                ]
+            )
+            + "]"
+        )
+    sub_strings = [_masked_tensor_str(d, m, formatter) for (d, m) in zip(data, mask)]
+    sub_strings = ["\n".join(["  " + si for si in s.split("\n")]) for s in sub_strings]
+    return "[\n" + ",\n".join(sub_strings) + "\n]"
+
+
+def _get_data(a):
+    if is_masked_tensor(a):
+        return a._masked_data
+    return a
+
+
+def _maybe_get_mask(a):
+    if is_masked_tensor(a):
+        return a.get_mask()
+    return None
+
+
+class MaskedTensor(torch.Tensor):
+    @staticmethod
+    def __new__(cls, data, mask, requires_grad=False):
+        if is_masked_tensor(data) or not torch.is_tensor(data):
+            raise TypeError("data must be a Tensor")
+        if is_masked_tensor(mask) or not torch.is_tensor(mask):
+            raise TypeError("mask must be a Tensor")
+        # Use a Tensor that of the give size for the wrapper.
+        kwargs = {}
+        kwargs["device"] = data.device
+        kwargs["dtype"] = data.dtype
+        kwargs["layout"] = data.layout
+        kwargs["requires_grad"] = requires_grad
+        kwargs["dispatch_sizes_strides_policy"] = "strides"
+        kwargs["dispatch_layout"] = True
+        warnings.warn(("The PyTorch API of MaskedTensors is in prototype stage "
+                       "and will change in the near future. Please open a Github issue "
+                       "for features requests and see our documentation on the torch.masked "
+                       "module for further information about the project."), UserWarning)
+        if data.requires_grad:
+            warnings.warn("It is not recommended to create a MaskedTensor with a tensor that requires_grad. "
+                          "To avoid this, you can use data.clone().detach()", UserWarning)
+        return torch.Tensor._make_wrapper_subclass(cls, data.size(), **kwargs)  # type: ignore[attr-defined]
+
+    def _preprocess_data(self, data, mask):
+        from .._ops import _sparse_coo_where, _sparse_csr_where
+
+        if data.layout != mask.layout:
+            raise TypeError("data and mask must have the same layout.")
+        if data.layout == torch.sparse_coo:
+            data = data.coalesce()
+            mask = mask.coalesce()
+            if data._nnz() != mask._nnz():
+                data = _sparse_coo_where(mask, data, torch.tensor(0))
+        elif data.layout == torch.sparse_csr:
+            if data._nnz() != mask._nnz():
+                data = _sparse_csr_where(mask, data, torch.tensor(0))
+
+        # Have to pick awkward names to not conflict with existing fields such as data
+        self._masked_data = data.clone()
+        self._masked_mask = mask.clone()
+
+    def _validate_members(self):
+        data = self._masked_data
+        mask = self.get_mask()
+        if type(data) != type(mask):
+            raise TypeError(f"data and mask must have the same type. Got {type(data)} and {type(mask)}")
+        if data.layout not in {torch.strided, torch.sparse_coo, torch.sparse_csr}:
+            raise TypeError(f"data layout of {data.layout} is not supported.")
+        if data.layout == torch.sparse_coo:
+            if not _tensors_match(data.indices(), mask.indices(), exact=True):
+                raise ValueError("data and mask are both sparse COO tensors but do not have the same indices.")
+        elif data.layout == torch.sparse_csr:
+            if not _tensors_match(
+                data.crow_indices(), mask.crow_indices(), exact=True
+            ) or not _tensors_match(data.col_indices(), mask.col_indices(), exact=True):
+                raise ValueError("data and mask are both sparse CSR tensors but do not share either crow or col indices.")
+        if mask.dtype != torch.bool:
+            raise TypeError("mask must have dtype bool.")
+        if not (
+            data.dtype == torch.float16
+            or data.dtype == torch.float32
+            or data.dtype == torch.float64
+            or data.dtype == torch.bool
+            or data.dtype == torch.int8
+            or data.dtype == torch.int16
+            or data.dtype == torch.int32
+            or data.dtype == torch.int64
+        ):
+            raise TypeError(f"{data.dtype} is not supported in MaskedTensor.")
+        if data.dim() != mask.dim():
+            raise ValueError("data.dim() must equal mask.dim()")
+        if data.size() != mask.size():
+            raise ValueError("data.size() must equal mask.size()")
+
+    def __init__(self, data, mask, requires_grad=False):
+        self._preprocess_data(data, mask)
+        self._validate_members()
+
+    @staticmethod
+    def _from_values(data, mask):
+        """ Differentiable constructor for MaskedTensor """
+        class Constructor(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, data, mask):
+                return MaskedTensor(data, mask)
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                return grad_output, None
+
+        result = Constructor.apply(data, mask)
+        return result
+
+    def _set_data_mask(self, data, mask):
+        self._masked_data = data
+        self._masked_mask = mask
+        self._validate_members()
+
+    def __repr__(self):
+        formatter = "{0:8.4f}"
+        if self.dim() == 0:
+            scalar_data = self.get_data().item()
+            data_formatted = (
+                formatter.format(scalar_data)
+                if isinstance(scalar_data, float)
+                else str(scalar_data)
+            )
+            if not self.get_mask().item():
+                data_formatted = "--"
+            return (
+                "MaskedTensor("
+                + data_formatted
+                + ", "
+                + str(self.get_mask().item())
+                + ")"
+            )
+        s = _masked_tensor_str(self.get_data(), self.get_mask(), formatter)
+        s = "\n".join("  " + si for si in s.split("\n"))
+        return "MaskedTensor(\n" + s + "\n)"
+
+    # Seems like this needs to be defined before torch_dispatch to work
+    @classmethod
+    def __torch_function__(cls, func, types, args=(), kwargs=None):
+        kwargs = kwargs or {}
+
+        from ._ops_refs import _MASKEDTENSOR_FUNCTION_TABLE
+        if func in _MASKEDTENSOR_FUNCTION_TABLE:
+            return _MASKEDTENSOR_FUNCTION_TABLE[func](*args, **kwargs)
+
+        if not all(issubclass(cls, t) for t in types):
+            return NotImplemented
+        with torch._C.DisableTorchFunction():
+            ret = func(*args, **kwargs)
+            if func in get_default_nowrap_functions():
+                return ret
+            else:
+                return torch._tensor._convert(ret, cls)
+
+    @classmethod
+    def unary(cls, fn, data, mask):
+        return MaskedTensor(fn(data), mask)
+
+    @classmethod
+    def __torch_dispatch__(cls, func, types, args, kwargs):
+        func = func.overloadpacket
+
+        from ._ops_refs import _MASKEDTENSOR_DISPATCH_TABLE
+        if func in _MASKEDTENSOR_DISPATCH_TABLE:
+            return _MASKEDTENSOR_DISPATCH_TABLE[func](*args, **kwargs)
+
+        msg = (
+            f"{func.__name__} is not implemented in __torch_dispatch__ for MaskedTensor.\n"
+            "If you would like this operator to be supported, please file an issue for a feature request at "
+            "https://github.com/pytorch/maskedtensor/issues with a minimal reproducible code snippet.\n"
+            "In the case that the semantics for the operator are not trivial, it would be appreciated "
+            "to also include a proposal for the semantics."
+        )
+        warnings.warn(msg)
+        return NotImplemented
+
+    def __lt__(self, other):
+        if is_masked_tensor(other):
+            return MaskedTensor(self.get_data() < _get_data(other), self.get_mask())
+        return MaskedTensor(self.get_data() < other, self.get_mask())
+
+    def to_tensor(self, value):
+        return self.get_data().masked_fill(~self.get_mask(), value)
+
+    def get_data(self):
+        class GetData(torch.autograd.Function):
+            @staticmethod
+            def forward(ctx, self):
+                return self._masked_data
+
+            @staticmethod
+            def backward(ctx, grad_output):
+                if is_masked_tensor(grad_output):
+                    return grad_output
+                return MaskedTensor(grad_output, self.get_mask())
+
+        return GetData.apply(self)
+
+    def get_mask(self):
+        return self._masked_mask
+
+    def is_sparse_coo(self):
+        return self.layout == torch.sparse_coo
+
+    def is_sparse_csr(self):
+        return self.layout == torch.sparse_csr
+
+    # Update later to support more sparse layouts
+    def is_sparse(self):
+        return self.is_sparse_coo() or self.is_sparse_csr()
diff --git a/torch/masked/maskedtensor/creation.py b/torch/masked/maskedtensor/creation.py
new file mode 100644
index 000000000000..861984a21e1c
--- /dev/null
+++ b/torch/masked/maskedtensor/creation.py
@@ -0,0 +1,21 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+from .core import MaskedTensor
+
+__all__ = [
+    "as_masked_tensor",
+    "masked_tensor",
+]
+
+
+""""
+These two factory functions are intended to mirror
+    torch.tensor - guaranteed to be a leaf node
+    torch.as_tensor - differentiable constructor that preserves the autograd history
+"""
+
+def masked_tensor(data, mask, requires_grad=False):
+    return MaskedTensor(data, mask, requires_grad)
+
+def as_masked_tensor(data, mask):
+    return MaskedTensor._from_values(data, mask)
diff --git a/torch/masked/maskedtensor/passthrough.py b/torch/masked/maskedtensor/passthrough.py
new file mode 100644
index 000000000000..91c9e5f81830
--- /dev/null
+++ b/torch/masked/maskedtensor/passthrough.py
@@ -0,0 +1,43 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+"""
+These are functions that should simply be applied to both mask and data.
+Take select or stack as an example. This operation can be applied to
+both the mask and data of a MaskedTensor and the result wrapped into
+a new MaskedTensor as a result.
+"""
+
+import torch
+
+from .core import _map_mt_args_kwargs, _wrap_result
+
+__all__ = []  # type: ignore[var-annotated]
+
+
+PASSTHROUGH_FNS = [
+    torch.ops.aten.select,
+    torch.ops.aten.transpose,
+    torch.ops.aten.split,
+    torch.ops.aten.t,
+    torch.ops.aten.slice,
+    torch.ops.aten.slice_backward,
+    torch.ops.aten.select_backward,
+    torch.ops.aten.index,
+    torch.ops.aten.expand,
+    torch.ops.aten.view,
+    torch.ops.aten._unsafe_view,
+    torch.ops.aten._reshape_alias,
+    torch.ops.aten.cat,
+    torch.ops.aten.unsqueeze,
+]
+
+
+def _is_pass_through_fn(fn):
+    return fn in PASSTHROUGH_FNS
+
+
+def _apply_pass_through_fn(fn, *args, **kwargs):
+    data_args, data_kwargs = _map_mt_args_kwargs(args, kwargs, lambda x: x.get_data())
+    result_data = fn(*data_args, **data_kwargs)
+    mask_args, mask_kwargs = _map_mt_args_kwargs(args, kwargs, lambda x: x.get_mask())
+    result_mask = fn(*mask_args, **mask_kwargs)
+    return _wrap_result(result_data, result_mask)
diff --git a/torch/masked/maskedtensor/reductions.py b/torch/masked/maskedtensor/reductions.py
new file mode 100644
index 000000000000..210af5d6c09c
--- /dev/null
+++ b/torch/masked/maskedtensor/reductions.py
@@ -0,0 +1,173 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import warnings
+
+import torch
+
+from .core import is_masked_tensor
+from .creation import as_masked_tensor, masked_tensor
+
+__all__ = []  # type: ignore[var-annotated]
+
+
+def _masked_all_all(data, mask=None):
+    if mask is None:
+        return data.all()
+    return data.masked_fill(~mask, True).all()
+
+
+def _masked_all_dim(data, dim, keepdim=False, mask=None):
+    if mask is None:
+        return torch.all(data, dim=dim, keepdim=keepdim)
+    return torch.all(data.masked_fill(~mask, True), dim=dim, keepdim=keepdim)
+
+
+def _masked_all(*args, **kwargs):
+    if len(args) == 1 and len(kwargs) == 1:
+        return _masked_all_all(args[0], mask=kwargs["mask"])
+    return _masked_all_dim(*args, **kwargs)
+
+
+def _multidim_any(mask, dim, keepdim):
+    if isinstance(dim, int):
+        return _multidim_any(mask, [dim], keepdim)
+    for d in sorted(dim, reverse=True):
+        mask = torch.any(mask, dim=d, keepdim=keepdim)
+    return mask
+
+
+def _get_masked_fn(fn):
+    if fn == "all":
+        return _masked_all
+    return getattr(torch.masked, fn)
+
+
+def _torch_reduce_all(fn):
+    def reduce_all(self):
+        masked_fn = _get_masked_fn(fn)
+        data = self.get_data()
+        mask = self.get_mask().values() if self.is_sparse() else self.get_mask()
+        # When reduction is "all", then torch.argmin/torch.argmax needs to return the index of the
+        # element corresponding to the min/max, but this operation isn't supported correctly for sparse layouts.
+        # Therefore, this implementation calculates it using the strides.
+        if fn == "all":
+            result_data = masked_fn(data, mask=mask)
+
+        elif fn in {"argmin", "argmax"} and self.is_sparse_coo():
+            sparse_idx = masked_fn(data.values(), mask=mask).to(dtype=torch.int)
+            indices = (
+                data.to_sparse_coo().indices()
+                if not self.is_sparse_coo()
+                else data.indices()
+            )
+            idx = indices.unbind(1)[sparse_idx]
+            stride = data.size().numel() / torch.tensor(
+                data.size(), device=data.device
+            ).cumprod(0)
+            result_data = torch.sum(idx * stride)
+
+        # we simply pass in the values for sparse COO/CSR tensors
+        elif self.is_sparse():
+            result_data = masked_fn(masked_tensor(data.values(), mask))
+
+        else:
+            result_data = masked_fn(self, mask=mask)
+
+        return as_masked_tensor(result_data, torch.any(mask))
+
+    return reduce_all
+
+
+def _torch_reduce_dim(fn):
+    def reduce_dim(self, dim, keepdim=False, dtype=None):
+        if self.is_sparse():
+            msg = (
+                f"The sparse version of {fn} is not implemented in reductions.\n"
+                "If you would like this operator to be supported, please file an issue for a feature request at "
+                "https://github.com/pytorch/maskedtensor/issues with a minimal reproducible code snippet.\n"
+                "In the case that the semantics for the operator are not trivial, it would be appreciated "
+                "to also include a proposal for the semantics."
+            )
+            warnings.warn(msg)
+            return NotImplemented
+        if not is_masked_tensor(self):
+            raise TypeError("Input to reduce_dim must be a MaskedTensor")
+
+        masked_fn = _get_masked_fn(fn)
+        data = self.get_data()
+        mask = self.get_mask()
+        if fn == "all":
+            result_data = masked_fn(data, dim=dim, keepdim=keepdim, mask=mask)
+        else:
+            result_data = masked_fn(
+                self, dim=dim, keepdim=keepdim, dtype=dtype, mask=self.get_mask()
+            )
+        return as_masked_tensor(result_data, _multidim_any(mask, dim, keepdim))
+
+    return reduce_dim
+
+
+def _torch_reduce(fn):
+    def reduce_fn(*args, **kwargs):
+        if len(args) == 1 and len(kwargs) == 0:
+            return _torch_reduce_all(fn)(args[0])
+        return _torch_reduce_dim(fn)(*args, **kwargs)
+
+    return reduce_fn
+
+
+def _reduce_dim_args(input, dim, keepdim=False, dtype=None):
+    return input, dim, keepdim, dtype
+
+
+def _torch_grad_reduce(fn):
+    def grad_reduce(*args, **kwargs):
+        if len(args) == 1 and len(kwargs) == 0:
+            return _torch_reduce_all(fn)(args[0])
+        # TODO: autograd.Function doesn't support kwarg
+        input, dim, keepdim, dtype = _reduce_dim_args(*args, **kwargs)
+        return _torch_reduce_dim(fn)(input, dim, keepdim, dtype)
+
+    return grad_reduce
+
+
+REDUCE_NAMES = [
+    "sum",
+    "mean",
+    "amin",
+    "amax",
+    "argmin",
+    "argmax",
+    "prod",
+    "all",
+    "norm",
+    "var",
+    "std",
+]
+
+NATIVE_REDUCE_MAP = {
+    getattr(torch.ops.aten, name): _torch_reduce(name) for name in REDUCE_NAMES
+}
+TORCH_REDUCE_MAP = {
+    getattr(torch, name): _torch_grad_reduce(name) for name in REDUCE_NAMES
+}
+TENSOR_REDUCE_MAP = {
+    getattr(torch.Tensor, name): _torch_grad_reduce(name) for name in REDUCE_NAMES
+}
+
+NATIVE_REDUCE_FNS = list(NATIVE_REDUCE_MAP.keys())
+TORCH_REDUCE_FNS = list(TORCH_REDUCE_MAP.keys())
+TENSOR_REDUCE_FNS = list(TENSOR_REDUCE_MAP.keys())
+
+def _is_reduction(fn):
+    return fn in NATIVE_REDUCE_MAP or fn in TORCH_REDUCE_MAP or fn in TENSOR_REDUCE_MAP
+
+
+def _apply_reduction(fn, *args, **kwargs):
+    if fn in NATIVE_REDUCE_MAP:
+        return NATIVE_REDUCE_MAP[fn](*args, **kwargs)
+    if fn in TORCH_REDUCE_MAP:
+        return TORCH_REDUCE_MAP[fn](*args, **kwargs)
+    if fn in TENSOR_REDUCE_MAP:
+        return TENSOR_REDUCE_MAP[fn](*args, **kwargs)
+    return NotImplemented
diff --git a/torch/masked/maskedtensor/unary.py b/torch/masked/maskedtensor/unary.py
new file mode 100644
index 000000000000..b3d5c136bfd4
--- /dev/null
+++ b/torch/masked/maskedtensor/unary.py
@@ -0,0 +1,188 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+import torch
+
+from .core import _map_mt_args_kwargs, _wrap_result
+
+__all__ = []  # type: ignore[var-annotated]
+
+
+UNARY_NAMES = [
+    "abs",
+    "absolute",
+    "acos",
+    "arccos",
+    "acosh",
+    "arccosh",
+    "angle",
+    "asin",
+    "arcsin",
+    "asinh",
+    "arcsinh",
+    "atan",
+    "arctan",
+    "atanh",
+    "arctanh",
+    "bitwise_not",
+    "ceil",
+    "clamp",
+    "clip",
+    "conj_physical",
+    "cos",
+    "cosh",
+    "deg2rad",
+    "digamma",
+    "erf",
+    "erfc",
+    "erfinv",
+    "exp",
+    "exp2",
+    "expm1",
+    "fix",
+    "floor",
+    "frac",
+    "lgamma",
+    "log",
+    "log10",
+    "log1p",
+    "log2",
+    "logit",
+    "i0",
+    "isnan",
+    "nan_to_num",
+    "neg",
+    "negative",
+    "positive",
+    "pow",
+    "rad2deg",
+    "reciprocal",
+    "round",
+    "rsqrt",
+    "sigmoid",
+    "sign",
+    "sgn",
+    "signbit",
+    "sin",
+    "sinc",
+    "sinh",
+    "sqrt",
+    "square",
+    "tan",
+    "tanh",
+    "trunc",
+]
+
+INPLACE_UNARY_NAMES = [
+    n + "_"
+    for n in (list(set(UNARY_NAMES) - {"angle", "positive", "signbit", "isnan"}))
+]
+
+# Explicitly tracking functions we know are currently not supported
+# This might be due to missing code gen or because of complex semantics
+UNARY_NAMES_UNSUPPORTED = [
+    "atan2",
+    "arctan2",
+    "bitwise_left_shift",
+    "bitwise_right_shift",
+    "copysign",
+    "float_power",
+    "fmod",
+    "frexp",
+    "gradient",
+    "imag",
+    "ldexp",
+    "lerp",
+    "logical_not",
+    "hypot",
+    "igamma",
+    "igammac",
+    "mvlgamma",
+    "nextafter",
+    "polygamma",
+    "real",
+    "remainder",
+    "true_divide",
+    "xlogy",
+]
+
+
+def _unary_helper(fn, args, kwargs, inplace):
+    if len(kwargs) != 0:
+        raise ValueError("MaskedTensor unary ops require that len(kwargs) == 0. "
+                         "If you need support for this, please open an issue on Github.")
+    for a in args[1:]:
+        if torch.is_tensor(a):
+            raise TypeError("MaskedTensor unary ops do not support additional Tensor arguments")
+
+    mask_args, mask_kwargs = _map_mt_args_kwargs(
+        args, kwargs, lambda x: x._masked_mask
+    )
+    data_args, data_kwargs = _map_mt_args_kwargs(
+        args, kwargs, lambda x: x._masked_data
+    )
+
+    if args[0].layout == torch.sparse_coo:
+        data_args[0] = data_args[0].coalesce()
+        s = data_args[0].size()
+        i = data_args[0].indices()
+        data_args[0] = data_args[0].coalesce().values()
+        v = fn(*data_args)
+        result_data = torch.sparse_coo_tensor(i, v, size=s)
+
+    elif args[0].layout == torch.sparse_csr:
+        crow = data_args[0].crow_indices()
+        col = data_args[0].col_indices()
+        data_args[0] = data_args[0].values()
+        v = fn(*data_args)
+        result_data = torch.sparse_csr_tensor(crow, col, v)
+
+    else:
+        result_data = fn(*data_args)
+
+    if inplace:
+        args[0]._set_data_mask(result_data, mask_args[0])
+        return args[0]
+    else:
+        return _wrap_result(result_data, mask_args[0])
+
+
+def _torch_unary(fn_name):
+    fn = getattr(torch.ops.aten, fn_name)
+
+    def unary_fn(*args, **kwargs):
+        return _unary_helper(fn, args, kwargs, inplace=False)
+
+    return unary_fn
+
+
+def _torch_inplace_unary(fn_name):
+    fn = getattr(torch.ops.aten, fn_name)
+
+    def unary_fn(*args, **kwargs):
+        return _unary_helper(fn, args, kwargs, inplace=True)
+
+    return unary_fn
+
+
+NATIVE_UNARY_MAP = {
+    getattr(torch.ops.aten, name): _torch_unary(name) for name in UNARY_NAMES
+}
+NATIVE_INPLACE_UNARY_MAP = {
+    getattr(torch.ops.aten, name): _torch_inplace_unary(name)
+    for name in INPLACE_UNARY_NAMES
+}
+
+NATIVE_UNARY_FNS = list(NATIVE_UNARY_MAP.keys())
+NATIVE_INPLACE_UNARY_FNS = list(NATIVE_INPLACE_UNARY_MAP.keys())
+
+
+def _is_native_unary(fn):
+    return fn in NATIVE_UNARY_FNS or fn in NATIVE_INPLACE_UNARY_FNS
+
+
+def _apply_native_unary(fn, *args, **kwargs):
+    if fn in NATIVE_UNARY_FNS:
+        return NATIVE_UNARY_MAP[fn](*args, **kwargs)
+    if fn in NATIVE_INPLACE_UNARY_FNS:
+        return NATIVE_INPLACE_UNARY_MAP[fn](*args, **kwargs)
+    return NotImplemented
diff --git a/torch/monitor/__init__.py b/torch/monitor/__init__.py
index 723936c8382a..b8589bb00087 100644
--- a/torch/monitor/__init__.py
+++ b/torch/monitor/__init__.py
@@ -16,6 +16,7 @@ class TensorboardEventHandler:
     This currently only supports ``torch.monitor.Stat`` events which are logged
     as scalars.
 
+    >>> # xdoctest: +REQUIRES(module:tensorboard)
     >>> from torch.utils.tensorboard import SummaryWriter
     >>> from torch.monitor import TensorboardEventHandler, register_event_handler
     >>> writer = SummaryWriter("log_dir")
diff --git a/torch/multiprocessing/reductions.py b/torch/multiprocessing/reductions.py
index 403b28d6a63c..4fcccb47685c 100644
--- a/torch/multiprocessing/reductions.py
+++ b/torch/multiprocessing/reductions.py
@@ -113,7 +113,7 @@ def rebuild_cuda_tensor(tensor_cls, tensor_size, tensor_stride, tensor_offset,
                         requires_grad, ref_counter_handle, ref_counter_offset, event_handle, event_sync_required):
     # If storage_handle is None, storage points to nullptr.
     if storage_handle is None or storage_size_bytes == 0:
-        storage = storage_cls(0, dtype=dtype, device=storage_device)
+        storage = storage_cls(0, dtype=dtype, device=storage_device, _internal=True)
     else:
         storage = storage_from_cache(storage_cls, (storage_handle, storage_offset_bytes))
         if storage is None:
@@ -132,8 +132,10 @@ def rebuild_cuda_tensor(tensor_cls, tensor_size, tensor_stride, tensor_offset,
             # We already ref counting this Storage, but producer needs new ref-counters to be released.
             storage_cls._release_ipc_counter(ref_counter_handle, ref_counter_offset, device=storage_device)
 
+    _storage = storage if isinstance(storage, torch.UntypedStorage) else storage._untyped_storage
+
     t = torch._utils._rebuild_tensor(
-        torch.storage.TypedStorage(wrap_storage=storage.untyped(), dtype=dtype),
+        torch.storage.TypedStorage(wrap_storage=_storage, dtype=dtype, _internal=True),
         tensor_offset, tensor_size, tensor_stride)
 
     if tensor_cls == torch.nn.parameter.Parameter:
@@ -147,7 +149,7 @@ def rebuild_cuda_tensor(tensor_cls, tensor_size, tensor_stride, tensor_offset,
 
 
 def reduce_tensor(tensor):
-    storage = tensor.storage()
+    storage = tensor._typed_storage()
 
     if tensor.requires_grad and not tensor.is_leaf:
         raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
@@ -248,7 +250,7 @@ def reduce_tensor(tensor):
     # eliminated it so that we could just use tensor views to implement the same
     # thing.
     #
-    if storage.is_cuda:
+    if storage._untyped_storage.device.type == 'cuda':
         (device,
          handle,
          storage_size_bytes,
@@ -325,7 +327,8 @@ def rebuild_storage_filename(cls, manager, handle, size, dtype=None):
         untyped_storage: torch.UntypedStorage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, byte_size)
         storage = torch.TypedStorage(
             wrap_storage=untyped_storage,
-            dtype=dtype)
+            dtype=dtype,
+            _internal=True)
     shared_cache[handle] = StorageWeakRef(storage)
     return storage._shared_decref()
 
@@ -334,18 +337,18 @@ def rebuild_storage_empty(cls):
     return cls()
 
 def rebuild_typed_storage(storage, dtype):
-    return torch.storage.TypedStorage(wrap_storage=storage, dtype=dtype)
+    return torch.storage.TypedStorage(wrap_storage=storage, dtype=dtype, _internal=True)
 
 # Use for torch.storage.TypedStorage
 def reduce_typed_storage(storage):
-    return (rebuild_typed_storage, (storage._storage, storage.dtype))
+    return (rebuild_typed_storage, (storage._untyped_storage, storage.dtype))
 
 def rebuild_typed_storage_child(storage, storage_type):
-    return storage_type(wrap_storage=storage)
+    return storage_type(wrap_storage=storage, _internal=True)
 
 # Use for child classes of torch.storage.TypedStorage, like torch.FloatStorage
 def reduce_typed_storage_child(storage):
-    return (rebuild_typed_storage_child, (storage._storage, type(storage)))
+    return (rebuild_typed_storage_child, (storage._untyped_storage, type(storage)))
 
 def reduce_storage(storage):
     from . import get_sharing_strategy
diff --git a/torch/nested/__init__.py b/torch/nested/__init__.py
index e69de29bb2d1..151d44ab66e1 100644
--- a/torch/nested/__init__.py
+++ b/torch/nested/__init__.py
@@ -0,0 +1,149 @@
+from typing import List, Optional
+
+import torch
+from torch import Tensor
+from torch._C import _add_docstr, _nested  # type: ignore[attr-defined]
+
+from torch.types import _device as Device, _dtype as DType
+
+__all__ = [
+    "to_padded_tensor",
+    "as_nested_tensor",
+    "nested_tensor",
+]
+
+# Nested Tensor constructor functions
+
+
+def as_nested_tensor(
+    tensor_list: List[Tensor],
+    dtype: Optional[DType] = None,
+    device: Optional[Device] = None,
+) -> Tensor:
+    r"""
+    Constructs a nested tensor preserving autograd history from :attr:`tensor_list` a list of tensors.
+
+    .. note::
+        Tensors within the list are always copied by this function due to current nested tensor semantics.
+
+    Args:
+        tensor_list (List[Tensor]): a list of tensors with the same ndim
+
+    Keyword arguments:
+        dtype (:class:`torch.dtype`, optional): the desired type of returned nested tensor.
+            Default: if None, same :class:`torch.dtype` as leftmost tensor in the list.
+        device (:class:`torch.device`, optional): the desired device of returned nested tensor.
+            Default: if None, same :class:`torch.device` as leftmost tensor in the list
+
+    Example::
+
+        >>> a = torch.arange(3, dtype=torch.float, requires_grad=True)
+        >>> b = torch.arange(5, dtype=torch.float, requires_grad=True)
+        >>> nt = torch.nested.as_nested_tensor([a, b])
+        >>> nt.is_leaf
+        False
+        >>> fake_grad = torch.nested.nested_tensor([torch.ones_like(a), torch.zeros_like(b)])
+        >>> nt.backward(fake_grad)
+        >>> a.grad
+        tensor([1., 1., 1.])
+        >>> b.grad
+        tensor([0., 0., 0., 0., 0.])
+    """
+    if not isinstance(tensor_list, list) or any(
+        [not torch.is_tensor(t) for t in tensor_list]
+    ):
+        raise TypeError(
+            "nested_tensor(): Expected first argument to be a list of tensors "
+        )
+    return torch._nested_tensor_from_tensor_list(tensor_list, dtype, None, device, None)
+
+
+# Note: This not only adds doc strings for the nested ops, but
+# also connects the torch.nested Python namespace to the torch._C._nested builtins.
+
+to_padded_tensor = _add_docstr(
+    _nested.nested_to_padded_tensor,
+    r"""
+to_padded_tensor(input, padding, output_size=None, out=None) -> Tensor
+
+Returns a new (non-nested) Tensor by padding the :attr:`input` nested tensor.
+The leading entries will be filled with the nested data,
+while the trailing entries will be padded.
+
+.. warning::
+
+    :func:`to_padded_tensor` always copies the underlying data,
+    since the nested and the non-nested tensors differ in memory layout.
+
+Args:
+    padding (float): The padding value for the trailing entries.
+
+Keyword args:
+    output_size (Tuple[int]): The size of the output tensor.
+                              If given, it must be large enough to contain all nested data;
+                              else, will infer by taking the max size of each nested sub-tensor along each dimension.
+    out (Tensor, optional): the output tensor.
+
+Example::
+
+    >>> nt = torch.nested.nested_tensor([torch.randn((2, 5)), torch.randn((3, 4))])
+    nested_tensor([
+      tensor([[ 1.6862, -1.1282,  1.1031,  0.0464, -1.3276],
+              [-1.9967, -1.0054,  1.8972,  0.9174, -1.4995]]),
+      tensor([[-1.8546, -0.7194, -0.2918, -0.1846],
+              [ 0.2773,  0.8793, -0.5183, -0.6447],
+              [ 1.8009,  1.8468, -0.9832, -1.5272]])
+    ])
+    >>> pt_infer = torch.nested.to_padded_tensor(nt, 0.0)
+    tensor([[[ 1.6862, -1.1282,  1.1031,  0.0464, -1.3276],
+             [-1.9967, -1.0054,  1.8972,  0.9174, -1.4995],
+             [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000]],
+            [[-1.8546, -0.7194, -0.2918, -0.1846,  0.0000],
+             [ 0.2773,  0.8793, -0.5183, -0.6447,  0.0000],
+             [ 1.8009,  1.8468, -0.9832, -1.5272,  0.0000]]])
+    >>> pt_large = torch.nested.to_padded_tensor(nt, 1.0, (2, 4, 6))
+    tensor([[[ 1.6862, -1.1282,  1.1031,  0.0464, -1.3276,  1.0000],
+             [-1.9967, -1.0054,  1.8972,  0.9174, -1.4995,  1.0000],
+             [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000],
+             [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000]],
+            [[-1.8546, -0.7194, -0.2918, -0.1846,  1.0000,  1.0000],
+             [ 0.2773,  0.8793, -0.5183, -0.6447,  1.0000,  1.0000],
+             [ 1.8009,  1.8468, -0.9832, -1.5272,  1.0000,  1.0000],
+             [ 1.0000,  1.0000,  1.0000,  1.0000,  1.0000,  1.0000]]])
+    >>> pt_small = torch.nested.to_padded_tensor(nt, 2.0, (2, 2, 2))
+    RuntimeError: Value in output_size is less than NestedTensor padded size. Truncation is not supported.
+
+""",
+)
+
+nested_tensor = _add_docstr(
+    _nested.nested_tensor,
+    r"""
+nested_tensor(tensor_list, *, dtype=None, device=None, requires_grad=False, pin_memory=False) -> Tensor
+
+Constructs a nested tensor with no autograd history (also known as a “leaf tensor”, see
+:ref:`Autograd mechanics <autograd-mechanics>`) from :attr:`tensor_list` a list of tensors.
+
+Args:
+    tensor_list (List[array_like]): a list of tensors, or anything that can be passed to torch.tensor,
+    where each element of the list has the same dimensionality.
+
+Keyword arguments:
+    dtype (:class:`torch.dtype`, optional): the desired type of returned nested tensor.
+        Default: if None, same :class:`torch.dtype` as leftmost tensor in the list.
+    device (:class:`torch.device`, optional): the desired device of returned nested tensor.
+        Default: if None, same :class:`torch.device` as leftmost tensor in the list
+    requires_grad (bool, optional): If autograd should record operations on the
+        returned nested tensor. Default: ``False``.
+    pin_memory (bool, optional): If set, returned nested tensor would be allocated in
+        the pinned memory. Works only for CPU tensors. Default: ``False``.
+
+Example::
+
+    >>> a = torch.arange(3, dtype=torch.float, requires_grad=True)
+    >>> b = torch.arange(5, dtype=torch.float, requires_grad=True)
+    >>> nt = torch.nested.nested_tensor([a, b], requires_grad=True)
+    >>> nt.is_leaf
+    True
+    """,
+)
diff --git a/torch/nn/functional.py b/torch/nn/functional.py
index 8a3b78e73f50..a1a102d786f1 100644
--- a/torch/nn/functional.py
+++ b/torch/nn/functional.py
@@ -6,7 +6,7 @@
 import torch
 from torch import _VF
 from torch._C import _infer_size, _add_docstr
-from torch._torch_docs import reproducibility_notes, tf32_notes
+from torch._torch_docs import reproducibility_notes, tf32_notes, sparse_support_notes
 # A workaround to support both TorchScript and MyPy:
 from typing import TYPE_CHECKING
 if TYPE_CHECKING:
@@ -1713,8 +1713,10 @@ def rrelu(
 
 where :math:`\Phi(x)` is the Cumulative Distribution Function for Gaussian Distribution.
 
-When the approximate argument is 'tanh', Gelu is estimated with:
-    :math::  \text{GELU}(x) = 0.5 * x * (1 + \text{Tanh}(\sqrt(2 / \pi) * (x + 0.044715 * x^3)))
+When the approximate argument is 'tanh', Gelu is estimated with
+
+.. math::
+    \text{GELU}(x) = 0.5 * x * (1 + \text{Tanh}(\sqrt(2 / \pi) * (x + 0.044715 * x^3)))
 
 See `Gaussian Error Linear Units (GELUs) <https://arxiv.org/abs/1606.08415>`_.
 """)
@@ -1953,7 +1955,6 @@ def tanh(input):
 
     See :class:`~torch.nn.Tanh` for more details.
     """
-    warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
     return input.tanh()
 
 
@@ -1964,7 +1965,6 @@ def sigmoid(input):
 
     See :class:`~torch.nn.Sigmoid` for more details.
     """
-    warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
     return input.sigmoid()
 
 
@@ -1997,6 +1997,10 @@ def hardsigmoid(input: Tensor, inplace: bool = False) -> Tensor:
 
 Applies a linear transformation to the incoming data: :math:`y = xA^T + b`.
 
+This operation supports 2-D :attr:`weight` with :ref:`sparse layout<sparse-docs>`
+
+{sparse_beta_warning}
+
 This operator supports :ref:`TensorFloat32<tf32_on_ampere>`.
 
 Shape:
@@ -2006,7 +2010,7 @@ def hardsigmoid(input: Tensor, inplace: bool = False) -> Tensor:
     - Weight: :math:`(out\_features, in\_features)` or :math:`(in\_features)`
     - Bias: :math:`(out\_features)` or :math:`()`
     - Output: :math:`(*, out\_features)` or :math:`(*)`, based on the shape of the weight
-""")
+""".format(**sparse_support_notes))
 
 
 bilinear = _add_docstr(
@@ -2144,7 +2148,7 @@ def embedding(
     Examples::
 
         >>> # a batch of 2 samples of 4 indices each
-        >>> input = torch.tensor([[1,2,4,5],[4,3,2,9]])
+        >>> input = torch.tensor([[1, 2, 4, 5], [4, 3, 2, 9]])
         >>> # an embedding matrix containing 10 tensors of size 3
         >>> embedding_matrix = torch.rand(10, 3)
         >>> # xdoctest: +IGNORE_WANT("non-deterministic")
@@ -2520,6 +2524,8 @@ def group_norm(
     """
     if has_torch_function_variadic(input, weight, bias):
         return handle_torch_function(group_norm, (input, weight, bias,), input, num_groups, weight=weight, bias=bias, eps=eps)
+    if input.dim() < 2:
+        raise RuntimeError(f"Expected at least 2 dimensions for input tensor but received {input.dim()}")
     _verify_batch_size([input.size(0) * input.size(1) // num_groups, num_groups] + list(input.size()[2:]))
     return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
 
@@ -2677,7 +2683,7 @@ def nll_loss(
         >>> input = torch.randn(3, 5, requires_grad=True)
         >>> # each element in target has to have 0 <= value < C
         >>> target = torch.tensor([1, 0, 4])
-        >>> output = F.nll_loss(F.log_softmax(input), target)
+        >>> output = F.nll_loss(F.log_softmax(input, dim=1), target)
         >>> output.backward()
     """
     if has_torch_function_variadic(input, target, weight):
@@ -2756,7 +2762,7 @@ def poisson_nll_loss(
         reduction = _Reduction.legacy_get_string(size_average, reduce)
     if reduction != "none" and reduction != "mean" and reduction != "sum":
         ret = input
-        raise ValueError(reduction + " is not valid")
+        raise ValueError(reduction + " is not a valid value for reduction")
 
     ret = torch.poisson_nll_loss(input, target, log_input, full, eps, _Reduction.get_enum(reduction))
     return ret
@@ -3861,7 +3867,7 @@ def interpolate(input: Tensor, size: Optional[int] = None, scale_factor: Optiona
             if len(size) != dim:
                 raise ValueError(
                     "Input and output must have the same number of spatial dimensions, but got "
-                    f"input with with spatial dimensions of {list(input.shape[2:])} and output size of {size}. "
+                    f"input with spatial dimensions of {list(input.shape[2:])} and output size of {size}. "
                     "Please provide input tensor in (N, C, d1, d2, ...,dK) format and "
                     "output size in (o1, o2, ...,oK) format."
 
@@ -4583,24 +4589,43 @@ def triplet_margin_with_distance_loss(
             reduction=reduction,
         )
 
-    distance_function = distance_function if distance_function is not None else pairwise_distance
-
-    positive_dist = distance_function(anchor, positive)
-    negative_dist = distance_function(anchor, negative)
-
+    # Check validity of reduction mode
+    if reduction not in ("mean", "sum", "none"):
+        raise ValueError(f"{reduction} is not a valid value for reduction")
+
+    # Check dimensions
+    a_dim = anchor.ndim
+    p_dim = positive.ndim
+    n_dim = negative.ndim
+    if not (a_dim == p_dim and p_dim == n_dim):
+        raise RuntimeError(
+            (f"The anchor, positive, and negative tensors are expected to have "
+             f"the same number of dimensions, but got: anchor {a_dim}D, "
+             f"positive {p_dim}D, and negative {n_dim}D inputs"))
+
+    # Calculate loss
+    if distance_function is None:
+        distance_function = torch.pairwise_distance
+
+    dist_pos = distance_function(anchor, positive)
+    dist_neg = distance_function(anchor, negative)
+    # The distance swap is described in the paper "Learning shallow
+    # convolutional feature descriptors with triplet losses" by V. Balntas, E.
+    # Riba et al.  If True, and if the positive example is closer to the
+    # negative example than the anchor is, swaps the positive example and the
+    # anchor in the loss computation.
     if swap:
-        swap_dist = distance_function(positive, negative)
-        negative_dist = torch.min(negative_dist, swap_dist)
+        dist_swap = distance_function(positive, negative)
+        dist_neg = torch.minimum(dist_neg, dist_swap)
+    loss = torch.clamp_min(margin + dist_pos - dist_neg, 0)
 
-    output = torch.clamp(positive_dist - negative_dist + margin, min=0.0)
-
-    reduction_enum = _Reduction.get_enum(reduction)
-    if reduction_enum == 1:
-        return output.mean()
-    elif reduction_enum == 2:
-        return output.sum()
-    else:
-        return output
+    # Apply reduction
+    if reduction == "sum":
+        return torch.sum(loss)
+    elif reduction == "mean":
+        return torch.mean(loss)
+    else:  # reduction == "none"
+        return loss
 
 
 def normalize(input: Tensor, p: float = 2.0, dim: int = 1, eps: float = 1e-12, out: Optional[Tensor] = None) -> Tensor:
@@ -4662,16 +4687,7 @@ def unfold(
         return handle_torch_function(
             unfold, (input,), input, kernel_size, dilation=dilation, padding=padding, stride=stride
         )
-    if input.dim() == 4:
-        msg = "{} must be int or 2-tuple for 4D input"
-        assert_int_or_pair(kernel_size, "kernel_size", msg)
-        assert_int_or_pair(dilation, "dilation", msg)
-        assert_int_or_pair(padding, "padding", msg)
-        assert_int_or_pair(stride, "stride", msg)
-
-        return torch._C._nn.im2col(input, _pair(kernel_size), _pair(dilation), _pair(padding), _pair(stride))
-    else:
-        raise NotImplementedError("Input Error: Only 4D input Tensors are supported (got {}D)".format(input.dim()))
+    return torch._C._nn.im2col(input, _pair(kernel_size), _pair(dilation), _pair(padding), _pair(stride))
 
 
 def fold(
@@ -4693,20 +4709,9 @@ def fold(
         return handle_torch_function(
             fold, (input,), input, output_size, kernel_size, dilation=dilation, padding=padding, stride=stride
         )
-    if input.dim() == 3 or input.dim() == 2:
-        msg = "{} must be int or 2-tuple for 3D input"
-        assert_int_or_pair(output_size, "output_size", msg)
-        assert_int_or_pair(kernel_size, "kernel_size", msg)
-        assert_int_or_pair(dilation, "dilation", msg)
-        assert_int_or_pair(padding, "padding", msg)
-        assert_int_or_pair(stride, "stride", msg)
-
-        return torch._C._nn.col2im(
-            input, _pair(output_size), _pair(kernel_size), _pair(dilation), _pair(padding), _pair(stride)
-        )
-    else:
-        raise NotImplementedError("Input Error: Only unbatched (2D) or batched (3D) input Tensors"
-                                  f"are supported (got {input.dim()}D)")
+    return torch._C._nn.col2im(
+        input, _pair(output_size), _pair(kernel_size), _pair(dilation), _pair(padding), _pair(stride)
+    )
 
 #
 # multihead attention
@@ -4821,52 +4826,35 @@ def _in_projection(
     return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)
 
 
-def _scaled_dot_product_attention(
-    q: Tensor,
-    k: Tensor,
-    v: Tensor,
-    attn_mask: Optional[Tensor] = None,
-    dropout_p: float = 0.0,
-) -> Tuple[Tensor, Tensor]:
-    r"""
-    Computes scaled dot product attention on query, key and value tensors, using
-    an optional attention mask if passed, and applying dropout if a probability
-    greater than 0.0 is specified.
-    Returns a tensor pair containing attended values and attention weights.
-
-    Args:
-        q, k, v: query, key and value tensors. See Shape section for shape details.
-        attn_mask: optional tensor containing mask values to be added to calculated
-            attention. May be 2D or 3D; see Shape section for details.
-        dropout_p: dropout probability. If greater than 0.0, dropout is applied.
+_scaled_dot_product_attention = _add_docstr(
+    torch._C._nn._scaled_dot_product_attention, r"""
+Computes scaled dot product attention on query, key and value tensors, using
+an optional attention mask if passed, and applying dropout if a probability
+greater than 0.0 is specified.
 
-    Shape:
-        - q: :math:`(B, Nt, E)` where B is batch size, Nt is the target sequence length,
-            and E is embedding dimension.
-        - key: :math:`(B, Ns, E)` where B is batch size, Ns is the source sequence length,
-            and E is embedding dimension.
-        - value: :math:`(B, Ns, E)` where B is batch size, Ns is the source sequence length,
-            and E is embedding dimension.
-        - attn_mask: either a 3D tensor of shape :math:`(B, Nt, Ns)` or a 2D tensor of
-            shape :math:`(Nt, Ns)`.
-
-        - Output: attention values have shape :math:`(B, Nt, E)`; attention weights
-            have shape :math:`(B, Nt, Ns)`
-    """
-    B, Nt, E = q.shape
-    q = q / math.sqrt(E)
-    # (B, Nt, E) x (B, E, Ns) -> (B, Nt, Ns)
-    if attn_mask is not None:
-        attn = torch.baddbmm(attn_mask, q, k.transpose(-2, -1))
-    else:
-        attn = torch.bmm(q, k.transpose(-2, -1))
+Args:
+     query (Tensor): Query tensor; shape (N, ..., L, E)
+     key (Tensor): Key tensor; shape (N, ..., S, E)
+     value (Tensor): Value tensor; shape (N, ..., S, E)
+     attn_mask (optional Tensor): Attention mask; shape (N, ..., L, S) or (L, S). Currently, only a boolean mask
+         is supported, where a value of True indicates that the element *should* take part in attention.
+     dropout_p (float): Dropout probability; if greater than 0.0, dropout is applied
+     need_attn_weights (bool): If true, the second return value will contain the attention weights used;
+         otherwise, the second return value is unspecified
+     is_causal (bool): If true, assumes causal attention masking; for this case, attn_mask should not be set.
+
+
+Returns a tuple containing:
+    output (Tensor): Attention output; shape (N, ..., L, E)
+    attn_weights (Tensor): Attention weighting; shape (N, ..., L, S)
+
+Shape legend:
+    N: Batch size
+    ...: Any number of other batch dimensions (optional)
+    S: Source sequence length
+    L: Target sequence lengthE: Embedding dimension
 
-    attn = softmax(attn, dim=-1)
-    if dropout_p > 0.0:
-        attn = dropout(attn, p=dropout_p)
-    # (B, Nt, Ns) x (B, Ns, E) -> (B, Nt, E)
-    output = torch.bmm(attn, v)
-    return output, attn
+""")
 
 
 def _mha_shape_check(query: Tensor, key: Tensor, value: Tensor,
@@ -4980,8 +4968,8 @@ def multi_head_attention_forward(
         - value: :math:`(S, E)` or :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
           the embedding dimension.
         - key_padding_mask: :math:`(S)` or :math:`(N, S)` where N is the batch size, S is the source sequence length.
-          If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions
-          will be unchanged. If a BoolTensor is provided, the positions with the
+          If a FloatTensor is provided, it will be directly added to the value.
+          If a BoolTensor is provided, the positions with the
           value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
         - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
           3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,
@@ -5051,6 +5039,11 @@ def multi_head_attention_forward(
     # set up shape vars
     tgt_len, bsz, embed_dim = query.shape
     src_len, _, _ = key.shape
+    if key_padding_mask is not None:
+        _kpm_dtype = key_padding_mask.dtype
+        if _kpm_dtype != torch.bool and not torch.is_floating_point(key_padding_mask):
+            raise AssertionError(
+                "only bool and floating types of key_padding_mask are supported")
     assert embed_dim == embed_dim_to_check, \
         f"was expecting embedding dimension of {embed_dim_to_check}, but got {embed_dim}"
     if isinstance(embed_dim, torch.Tensor):
@@ -5103,11 +5096,6 @@ def multi_head_attention_forward(
         else:
             raise RuntimeError(f"attn_mask's dimension {attn_mask.dim()} is not supported")
 
-    # prep key padding mask
-    if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:
-        warnings.warn("Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
-        key_padding_mask = key_padding_mask.to(torch.bool)
-
     # add bias along batch dimension (currently second)
     if bias_k is not None and bias_v is not None:
         assert static_k is None, "bias cannot be added to static key."
@@ -5184,7 +5172,19 @@ def multi_head_attention_forward(
     #
     # (deep breath) calculate attention and out projection
     #
-    attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
+
+    B, Nt, E = q.shape
+    q_scaled = q / math.sqrt(E)
+    if attn_mask is not None:
+        attn_output_weights = torch.baddbmm(attn_mask, q_scaled, k.transpose(-2, -1))
+    else:
+        attn_output_weights = torch.bmm(q_scaled, k.transpose(-2, -1))
+    attn_output_weights = softmax(attn_output_weights, dim=-1)
+    if dropout_p > 0.0:
+        attn_output_weights = dropout(attn_output_weights, p=dropout_p)
+
+    attn_output = torch.bmm(attn_output_weights, v)
+
     attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len * bsz, embed_dim)
     attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
     attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1))
diff --git a/torch/nn/init.py b/torch/nn/init.py
index 6ea582d6189b..b70e7f5e390c 100644
--- a/torch/nn/init.py
+++ b/torch/nn/init.py
@@ -463,7 +463,7 @@ def orthogonal_(tensor, gain=1):
         gain: optional scaling factor
 
     Examples:
-        >>> # xdoctest: +REQUIRES(--lapack)
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
         >>> w = torch.empty(3, 5)
         >>> nn.init.orthogonal_(w)
     """
diff --git a/torch/nn/intrinsic/__init__.py b/torch/nn/intrinsic/__init__.py
index 3d79bdbfe832..e19378bc7aee 100644
--- a/torch/nn/intrinsic/__init__.py
+++ b/torch/nn/intrinsic/__init__.py
@@ -1 +1,35 @@
-from .modules import *  # noqa: F403
+from torch.ao.nn.intrinsic import ConvBn1d
+from torch.ao.nn.intrinsic import ConvBn2d
+from torch.ao.nn.intrinsic import ConvBn3d
+from torch.ao.nn.intrinsic import ConvBnReLU1d
+from torch.ao.nn.intrinsic import ConvBnReLU2d
+from torch.ao.nn.intrinsic import ConvBnReLU3d
+from torch.ao.nn.intrinsic import ConvReLU1d
+from torch.ao.nn.intrinsic import ConvReLU2d
+from torch.ao.nn.intrinsic import ConvReLU3d
+from torch.ao.nn.intrinsic import LinearReLU
+from torch.ao.nn.intrinsic import BNReLU2d
+from torch.ao.nn.intrinsic import BNReLU3d
+from torch.ao.nn.intrinsic import LinearBn1d
+from torch.ao.nn.intrinsic.modules.fused import _FusedModule  # noqa: F401
+
+# Include the subpackages in case user imports from it directly
+from . import modules  # noqa: F401
+from . import qat  # noqa: F401
+from . import quantized  # noqa: F401
+
+__all__ = [
+    'ConvBn1d',
+    'ConvBn2d',
+    'ConvBn3d',
+    'ConvBnReLU1d',
+    'ConvBnReLU2d',
+    'ConvBnReLU3d',
+    'ConvReLU1d',
+    'ConvReLU2d',
+    'ConvReLU3d',
+    'LinearReLU',
+    'BNReLU2d',
+    'BNReLU3d',
+    'LinearBn1d',
+]
diff --git a/torch/nn/intrinsic/modules/__init__.py b/torch/nn/intrinsic/modules/__init__.py
index c998621f668d..670a654efb95 100644
--- a/torch/nn/intrinsic/modules/__init__.py
+++ b/torch/nn/intrinsic/modules/__init__.py
@@ -1,4 +1,6 @@
-from .fused import _FusedModule
+from .fused import _FusedModule  # noqa: F401
+from .fused import BNReLU2d
+from .fused import BNReLU3d
 from .fused import ConvBn1d
 from .fused import ConvBn2d
 from .fused import ConvBn3d
@@ -8,14 +10,13 @@
 from .fused import ConvReLU1d
 from .fused import ConvReLU2d
 from .fused import ConvReLU3d
-from .fused import LinearReLU
-from .fused import BNReLU2d
-from .fused import BNReLU3d
 from .fused import LinearBn1d
+from .fused import LinearReLU
 
 
 __all__ = [
-    '_FusedModule',
+    'BNReLU2d',
+    'BNReLU3d',
     'ConvBn1d',
     'ConvBn2d',
     'ConvBn3d',
@@ -25,8 +26,6 @@
     'ConvReLU1d',
     'ConvReLU2d',
     'ConvReLU3d',
-    'LinearReLU',
-    'BNReLU2d',
-    'BNReLU3d',
     'LinearBn1d',
+    'LinearReLU',
 ]
diff --git a/torch/nn/intrinsic/modules/fused.py b/torch/nn/intrinsic/modules/fused.py
index 261142fa8fc6..dc962f956427 100644
--- a/torch/nn/intrinsic/modules/fused.py
+++ b/torch/nn/intrinsic/modules/fused.py
@@ -1,128 +1,30 @@
-import torch
-from torch.nn import Conv1d, Conv2d, Conv3d, ReLU, Linear, BatchNorm1d, BatchNorm2d, BatchNorm3d
-from torch.nn.utils.parametrize import type_before_parametrizations
-
-__all__ = ['ConvReLU1d', 'ConvReLU2d', 'ConvReLU3d', 'LinearReLU', 'ConvBn1d', 'ConvBn2d',
-           'ConvBnReLU1d', 'ConvBnReLU2d', 'ConvBn3d', 'ConvBnReLU3d', 'BNReLU2d', 'BNReLU3d',
-           'LinearBn1d']
-# Used for identifying intrinsic modules used in quantization
-class _FusedModule(torch.nn.Sequential):
-    pass
-
-class ConvReLU1d(_FusedModule):
-    r"""This is a sequential container which calls the Conv1d and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, relu):
-        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(relu) == ReLU, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(conv), type_before_parametrizations(relu))
-        super().__init__(conv, relu)
-
-class ConvReLU2d(_FusedModule):
-    r"""This is a sequential container which calls the Conv2d and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, relu):
-        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(relu) == ReLU, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(conv), type_before_parametrizations(relu))
-        super().__init__(conv, relu)
-
-class ConvReLU3d(_FusedModule):
-    r"""This is a sequential container which calls the Conv3d and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, relu):
-        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(relu) == ReLU, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(conv), type_before_parametrizations(relu))
-        super().__init__(conv, relu)
-
-class LinearReLU(_FusedModule):
-    r"""This is a sequential container which calls the Linear and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, linear, relu):
-        assert type_before_parametrizations(linear) == Linear and type_before_parametrizations(relu) == ReLU, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(linear), type_before_parametrizations(relu))
-        super().__init__(linear, relu)
-
-class ConvBn1d(_FusedModule):
-    r"""This is a sequential container which calls the Conv 1d and Batch Norm 1d modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, bn):
-        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(bn) == BatchNorm1d, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(conv), type_before_parametrizations(bn))
-        super().__init__(conv, bn)
-
-class ConvBn2d(_FusedModule):
-    r"""This is a sequential container which calls the Conv 2d and Batch Norm 2d modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, bn):
-        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(bn) == BatchNorm2d, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(conv), type_before_parametrizations(bn))
-        super(ConvBn2d, self).__init__(conv, bn)
-
-class ConvBnReLU1d(_FusedModule):
-    r"""This is a sequential container which calls the Conv 1d, Batch Norm 1d, and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, bn, relu):
-        assert type_before_parametrizations(conv) == Conv1d and type_before_parametrizations(bn) == BatchNorm1d and \
-            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
-            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
-        super().__init__(conv, bn, relu)
-
-class ConvBnReLU2d(_FusedModule):
-    r"""This is a sequential container which calls the Conv 2d, Batch Norm 2d, and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, bn, relu):
-        assert type_before_parametrizations(conv) == Conv2d and type_before_parametrizations(bn) == BatchNorm2d and \
-            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
-            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
-        super().__init__(conv, bn, relu)
-
-class ConvBn3d(_FusedModule):
-    r"""This is a sequential container which calls the Conv 3d and Batch Norm 3d modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, bn):
-        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(bn) == BatchNorm3d, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(conv), type_before_parametrizations(bn))
-        super().__init__(conv, bn)
-
-class ConvBnReLU3d(_FusedModule):
-    r"""This is a sequential container which calls the Conv 3d, Batch Norm 3d, and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, conv, bn, relu):
-        assert type_before_parametrizations(conv) == Conv3d and type_before_parametrizations(bn) == BatchNorm3d and \
-            type_before_parametrizations(relu) == ReLU, 'Incorrect types for input modules{}{}{}' \
-            .format(type_before_parametrizations(conv), type_before_parametrizations(bn), type_before_parametrizations(relu))
-        super().__init__(conv, bn, relu)
-
-
-class BNReLU2d(_FusedModule):
-    r"""This is a sequential container which calls the BatchNorm 2d and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, batch_norm, relu):
-        assert type_before_parametrizations(batch_norm) == BatchNorm2d and type_before_parametrizations(relu) == ReLU, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(batch_norm), type_before_parametrizations(relu))
-        super().__init__(batch_norm, relu)
-
-class BNReLU3d(_FusedModule):
-    r"""This is a sequential container which calls the BatchNorm 3d and ReLU modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, batch_norm, relu):
-        assert type_before_parametrizations(batch_norm) == BatchNorm3d and type_before_parametrizations(relu) == ReLU, \
-            'Incorrect types for input modules{}{}'.format(
-                type_before_parametrizations(batch_norm), type_before_parametrizations(relu))
-        super().__init__(batch_norm, relu)
-
-
-class LinearBn1d(_FusedModule):
-    r"""This is a sequential container which calls the Linear and BatchNorm1d modules.
-    During quantization this will be replaced with the corresponding fused module."""
-    def __init__(self, linear, bn):
-        assert type_before_parametrizations(linear) == Linear and type_before_parametrizations(bn) == BatchNorm1d, \
-            'Incorrect types for input modules{}{}'.format(type_before_parametrizations(linear), type_before_parametrizations(bn))
-        super().__init__(linear, bn)
+from torch.ao.nn.intrinsic import BNReLU2d
+from torch.ao.nn.intrinsic import BNReLU3d
+from torch.ao.nn.intrinsic import ConvBn1d
+from torch.ao.nn.intrinsic import ConvBn2d
+from torch.ao.nn.intrinsic import ConvBn3d
+from torch.ao.nn.intrinsic import ConvBnReLU1d
+from torch.ao.nn.intrinsic import ConvBnReLU2d
+from torch.ao.nn.intrinsic import ConvBnReLU3d
+from torch.ao.nn.intrinsic import ConvReLU1d
+from torch.ao.nn.intrinsic import ConvReLU2d
+from torch.ao.nn.intrinsic import ConvReLU3d
+from torch.ao.nn.intrinsic import LinearBn1d
+from torch.ao.nn.intrinsic import LinearReLU
+from torch.ao.nn.intrinsic.modules.fused import _FusedModule  # noqa: F401
+
+__all__ = [
+    'BNReLU2d',
+    'BNReLU3d',
+    'ConvBn1d',
+    'ConvBn2d',
+    'ConvBn3d',
+    'ConvBnReLU1d',
+    'ConvBnReLU2d',
+    'ConvBnReLU3d',
+    'ConvReLU1d',
+    'ConvReLU2d',
+    'ConvReLU3d',
+    'LinearBn1d',
+    'LinearReLU',
+]
diff --git a/torch/nn/intrinsic/qat/modules/conv_fused.py b/torch/nn/intrinsic/qat/modules/conv_fused.py
index dc8ef829c68b..ccd79bb638f8 100644
--- a/torch/nn/intrinsic/qat/modules/conv_fused.py
+++ b/torch/nn/intrinsic/qat/modules/conv_fused.py
@@ -1,727 +1,37 @@
-import math
-import torch
-import torch.nn as nn
-import torch.nn.intrinsic as nni
-import torch.nn.qat as nnqat
-import torch.nn.functional as F
-from torch.nn import init
-from torch.nn.utils import fuse_conv_bn_weights
-from torch.nn.modules.utils import _single, _pair, _triple
-from torch.nn.parameter import Parameter
-from typing import TypeVar
-
-__all__ = ['ConvBn1d', 'ConvBnReLU1d', 'ConvReLU1d', 'ConvBn2d', 'ConvBnReLU2d', 'ConvReLU2d', 'ConvBn3d',
-           'ConvBnReLU3d', 'ConvReLU3d', 'update_bn_stats', 'freeze_bn_stats']
-_BN_CLASS_MAP = {
-    1: nn.BatchNorm1d,
-    2: nn.BatchNorm2d,
-    3: nn.BatchNorm3d,
-}
-
-
-MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
-
-
-class _ConvBnNd(nn.modules.conv._ConvNd, nni._FusedModule):
-
-    _version = 2
-    _FLOAT_MODULE = MOD
-
-    def __init__(self,
-                 # ConvNd args
-                 in_channels, out_channels, kernel_size, stride,
-                 padding, dilation, transposed, output_padding,
-                 groups,
-                 bias,
-                 padding_mode,
-                 # BatchNormNd args
-                 # num_features: out_channels
-                 eps=1e-05, momentum=0.1,
-                 # affine: True
-                 # track_running_stats: True
-                 # Args for this module
-                 freeze_bn=False,
-                 qconfig=None,
-                 dim=2):
-        nn.modules.conv._ConvNd.__init__(self, in_channels, out_channels, kernel_size,
-                                         stride, padding, dilation, transposed,
-                                         output_padding, groups, False, padding_mode)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        self.qconfig = qconfig
-        self.freeze_bn = freeze_bn if self.training else True
-        self.bn = _BN_CLASS_MAP[dim](out_channels, eps, momentum, True, True)
-        self.weight_fake_quant = self.qconfig.weight()
-        if bias:
-            self.bias = Parameter(torch.empty(out_channels))
-        else:
-            self.register_parameter('bias', None)
-        self.reset_bn_parameters()
-
-        # this needs to be called after reset_bn_parameters,
-        # as they modify the same state
-        if self.training:
-            if freeze_bn:
-                self.freeze_bn_stats()
-            else:
-                self.update_bn_stats()
-        else:
-            self.freeze_bn_stats()
-
-    def reset_running_stats(self):
-        self.bn.reset_running_stats()
-
-    def reset_bn_parameters(self):
-        self.bn.reset_running_stats()
-        init.uniform_(self.bn.weight)
-        init.zeros_(self.bn.bias)
-        # note: below is actully for conv, not BN
-        if self.bias is not None:
-            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
-            bound = 1 / math.sqrt(fan_in)
-            init.uniform_(self.bias, -bound, bound)
-
-    def reset_parameters(self):
-        super(_ConvBnNd, self).reset_parameters()
-
-    def update_bn_stats(self):
-        self.freeze_bn = False
-        self.bn.training = True
-        return self
-
-    def freeze_bn_stats(self):
-        self.freeze_bn = True
-        self.bn.training = False
-        return self
-
-    def _forward(self, input):
-        assert self.bn.running_var is not None
-        running_std = torch.sqrt(self.bn.running_var + self.bn.eps)
-        scale_factor = self.bn.weight / running_std
-        weight_shape = [1] * len(self.weight.shape)
-        weight_shape[0] = -1
-        bias_shape = [1] * len(self.weight.shape)
-        bias_shape[1] = -1
-        scaled_weight = self.weight_fake_quant(self.weight * scale_factor.reshape(weight_shape))
-        # using zero bias here since the bias for original conv
-        # will be added later
-        if self.bias is not None:
-            zero_bias = torch.zeros_like(self.bias)
-        else:
-            zero_bias = torch.zeros(self.out_channels, device=scaled_weight.device)
-        conv = self._conv_forward(input, scaled_weight, zero_bias)
-        conv_orig = conv / scale_factor.reshape(bias_shape)
-        if self.bias is not None:
-            conv_orig = conv_orig + self.bias.reshape(bias_shape)
-        conv = self.bn(conv_orig)
-        return conv
-
-    def extra_repr(self):
-        # TODO(jerryzh): extend
-        return super(_ConvBnNd, self).extra_repr()
-
-    def forward(self, input):
-        return self._forward(input)
-
-    def train(self, mode=True):
-        """
-        Batchnorm's training behavior is using the self.training flag. Prevent
-        changing it if BN is frozen. This makes sure that calling `model.train()`
-        on a model with a frozen BN will behave properly.
-        """
-        self.training = mode
-        if not self.freeze_bn:
-            for module in self.children():
-                module.train(mode)
-        return self
-
-    # ===== Serialization version history =====
-    #
-    # Version 1/None
-    #   self
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #   |--- gamma : Tensor
-    #   |--- beta : Tensor
-    #   |--- running_mean : Tensor
-    #   |--- running_var : Tensor
-    #   |--- num_batches_tracked : Tensor
-    #
-    # Version 2
-    #   self
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #   |--- bn : Module
-    #        |--- weight : Tensor (moved from v1.self.gamma)
-    #        |--- bias : Tensor (moved from v1.self.beta)
-    #        |--- running_mean : Tensor (moved from v1.self.running_mean)
-    #        |--- running_var : Tensor (moved from v1.self.running_var)
-    #        |--- num_batches_tracked : Tensor (moved from v1.self.num_batches_tracked)
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
-        version = local_metadata.get('version', None)
-        if version is None or version == 1:
-            # BN related parameters and buffers were moved into the BN module for v2
-            v2_to_v1_names = {
-                'bn.weight': 'gamma',
-                'bn.bias': 'beta',
-                'bn.running_mean': 'running_mean',
-                'bn.running_var': 'running_var',
-                'bn.num_batches_tracked': 'num_batches_tracked',
-            }
-            for v2_name, v1_name in v2_to_v1_names.items():
-                if prefix + v1_name in state_dict:
-                    state_dict[prefix + v2_name] = state_dict[prefix + v1_name]
-                    state_dict.pop(prefix + v1_name)
-                elif prefix + v2_name in state_dict:
-                    # there was a brief period where forward compatibility
-                    # for this module was broken (between
-                    # https://github.com/pytorch/pytorch/pull/38478
-                    # and https://github.com/pytorch/pytorch/pull/38820)
-                    # and modules emitted the v2 state_dict format while
-                    # specifying that version == 1. This patches the forward
-                    # compatibility issue by allowing the v2 style entries to
-                    # be used.
-                    pass
-                elif strict:
-                    missing_keys.append(prefix + v2_name)
-
-        super(_ConvBnNd, self)._load_from_state_dict(
-            state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module or qparams_dict
-
-            Args: `mod` a float module, either produced by torch.ao.quantization utilities
-            or directly from user
-        """
-        # The ignore is because _FLOAT_MODULE is a TypeVar here where the bound
-        # has no __name__ (code is fine though)
-        assert type(mod) == cls._FLOAT_MODULE, 'qat.' + cls.__name__ + '.from_float only works for ' + \
-            cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid qconfig'
-        qconfig = mod.qconfig
-        conv, bn = mod[0], mod[1]
-        qat_convbn = cls(conv.in_channels, conv.out_channels, conv.kernel_size,
-                         conv.stride, conv.padding, conv.dilation,
-                         conv.groups, conv.bias is not None,
-                         conv.padding_mode,
-                         bn.eps, bn.momentum,
-                         False,
-                         qconfig)
-        qat_convbn.weight = conv.weight
-        qat_convbn.bias = conv.bias
-        qat_convbn.bn.weight = bn.weight
-        qat_convbn.bn.bias = bn.bias
-        qat_convbn.bn.running_mean = bn.running_mean
-        qat_convbn.bn.running_var = bn.running_var
-        # mypy error: Cannot determine type of 'num_batches_tracked'
-        qat_convbn.bn.num_batches_tracked = bn.num_batches_tracked  # type: ignore[has-type]
-        return qat_convbn
-
-    def to_float(self):
-        cls = type(self)
-        conv = cls._FLOAT_CONV_MODULE(  # type: ignore[attr-defined]
-            self.in_channels,
-            self.out_channels,
-            self.kernel_size,
-            self.stride,
-            self.padding,
-            self.dilation,
-            self.groups,
-            self.bias is not None,
-            self.padding_mode)
-        conv.weight = torch.nn.Parameter(self.weight.detach())
-        if self.bias is not None:
-            conv.bias = torch.nn.Parameter(self.bias.detach())
-
-        if cls._FLOAT_BN_MODULE:  # type: ignore[attr-defined]
-            # fuse bn into conv
-            conv.weight, conv.bias = fuse_conv_bn_weights(
-                conv.weight,
-                conv.bias,
-                self.bn.running_mean,
-                self.bn.running_var,
-                self.bn.eps,
-                self.bn.weight,
-                self.bn.bias
-            )
-
-        if cls._FLOAT_RELU_MODULE:  # type: ignore[attr-defined]
-            modules = []
-            modules.append(conv)
-            relu = cls._FLOAT_RELU_MODULE()  # type: ignore[attr-defined]
-            modules.append(relu)
-            conv_relu = cls._FUSED_FLOAT_MODULE(*modules)  # type: ignore[attr-defined]
-            conv_relu.train(self.training)
-            return conv_relu
-        else:
-            conv.train(self.training)
-            return conv
-
-class ConvBn1d(_ConvBnNd, nn.Conv1d):
-    r"""
-    A ConvBn1d module is a module fused from Conv1d and BatchNorm1d,
-    attached with FakeQuantize modules for weight,
-    used in quantization aware training.
-
-    We combined the interface of :class:`torch.nn.Conv1d` and
-    :class:`torch.nn.BatchNorm1d`.
-
-    Similar to :class:`torch.nn.Conv1d`, with FakeQuantize modules initialized
-    to default.
-
-    Attributes:
-        freeze_bn:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    _FLOAT_BN_MODULE = nn.BatchNorm1d
-    _FLOAT_RELU_MODULE = None
-    _FLOAT_MODULE = nni.ConvBn1d
-    _FLOAT_CONV_MODULE = nn.Conv1d
-
-    def __init__(self,
-                 # Conv1d args
-                 in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1,
-                 bias=None,
-                 padding_mode='zeros',
-                 # BatchNorm1d args
-                 # num_features: out_channels
-                 eps=1e-05, momentum=0.1,
-                 # affine: True
-                 # track_running_stats: True
-                 # Args for this module
-                 freeze_bn=False,
-                 qconfig=None):
-        kernel_size = _single(kernel_size)
-        stride = _single(stride)
-        padding = _single(padding)
-        dilation = _single(dilation)
-        _ConvBnNd.__init__(self, in_channels, out_channels, kernel_size, stride,
-                           padding, dilation, False, _single(0), groups, bias, padding_mode,
-                           eps, momentum, freeze_bn, qconfig, dim=1)
-
-class ConvBnReLU1d(ConvBn1d):
-    r"""
-    A ConvBnReLU1d module is a module fused from Conv1d, BatchNorm1d and ReLU,
-    attached with FakeQuantize modules for weight,
-    used in quantization aware training.
-
-    We combined the interface of :class:`torch.nn.Conv1d` and
-    :class:`torch.nn.BatchNorm1d` and :class:`torch.nn.ReLU`.
-
-    Similar to `torch.nn.Conv1d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    # base class defines _FLOAT_MODULE as "ConvBn1d"
-    _FLOAT_MODULE = nni.ConvBnReLU1d  # type: ignore[assignment]
-    _FLOAT_CONV_MODULE = nn.Conv1d
-    _FLOAT_BN_MODULE = nn.BatchNorm1d
-    _FLOAT_RELU_MODULE = nn.ReLU  # type: ignore[assignment]
-    # module class after fusing bn into conv
-    _FUSED_FLOAT_MODULE = nni.ConvReLU1d
-
-    def __init__(self,
-                 # Conv1d args
-                 in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1,
-                 bias=None,
-                 padding_mode='zeros',
-                 # BatchNorm1d args
-                 # num_features: out_channels
-                 eps=1e-05, momentum=0.1,
-                 # affine: True
-                 # track_running_stats: True
-                 # Args for this module
-                 freeze_bn=False,
-                 qconfig=None):
-        super().__init__(in_channels, out_channels, kernel_size, stride,
-                         padding, dilation, groups, bias,
-                         padding_mode, eps, momentum,
-                         freeze_bn,
-                         qconfig)
-
-    def forward(self, input):
-        return F.relu(ConvBn1d._forward(self, input))
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(ConvBnReLU1d, cls).from_float(mod)
-
-class ConvReLU1d(nnqat.Conv1d, nni._FusedModule):
-    r"""A ConvReLU1d module is a fused module of Conv1d and ReLU, attached with
-    FakeQuantize modules for weight for
-    quantization aware training.
-
-    We combined the interface of :class:`~torch.nn.Conv1d` and
-    :class:`~torch.nn.BatchNorm1d`.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    _FLOAT_MODULE = nni.ConvReLU1d
-    _FLOAT_CONV_MODULE = nn.Conv1d
-    _FLOAT_BN_MODULE = None
-    _FLOAT_RELU_MODULE = nn.ReLU
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1,
-                 bias=True, padding_mode='zeros',
-                 qconfig=None):
-        super(ConvReLU1d, self).__init__(in_channels, out_channels, kernel_size,
-                                         stride=stride, padding=padding, dilation=dilation,
-                                         groups=groups, bias=bias, padding_mode=padding_mode,
-                                         qconfig=qconfig)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        self.qconfig = qconfig
-        self.weight_fake_quant = self.qconfig.weight()
-
-    def forward(self, input):
-        return F.relu(
-            self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias))
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(ConvReLU1d, cls).from_float(mod)
-
-class ConvBn2d(_ConvBnNd, nn.Conv2d):
-    r"""
-    A ConvBn2d module is a module fused from Conv2d and BatchNorm2d,
-    attached with FakeQuantize modules for weight,
-    used in quantization aware training.
-
-    We combined the interface of :class:`torch.nn.Conv2d` and
-    :class:`torch.nn.BatchNorm2d`.
-
-    Similar to :class:`torch.nn.Conv2d`, with FakeQuantize modules initialized
-    to default.
-
-    Attributes:
-        freeze_bn:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    _FLOAT_MODULE = nni.ConvBn2d
-    _FLOAT_CONV_MODULE = nn.Conv2d
-    _FLOAT_BN_MODULE = nn.BatchNorm2d
-    _FLOAT_RELU_MODULE = None
-
-    def __init__(self,
-                 # ConvNd args
-                 in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1,
-                 bias=None,
-                 padding_mode='zeros',
-                 # BatchNorm2d args
-                 # num_features: out_channels
-                 eps=1e-05, momentum=0.1,
-                 # affine: True
-                 # track_running_stats: True
-                 # Args for this module
-                 freeze_bn=False,
-                 qconfig=None):
-        kernel_size = _pair(kernel_size)
-        stride = _pair(stride)
-        padding = _pair(padding)
-        dilation = _pair(dilation)
-        _ConvBnNd.__init__(self, in_channels, out_channels, kernel_size, stride,
-                           padding, dilation, False, _pair(0), groups, bias, padding_mode,
-                           eps, momentum, freeze_bn, qconfig, dim=2)
-
-class ConvBnReLU2d(ConvBn2d):
-    r"""
-    A ConvBnReLU2d module is a module fused from Conv2d, BatchNorm2d and ReLU,
-    attached with FakeQuantize modules for weight,
-    used in quantization aware training.
-
-    We combined the interface of :class:`torch.nn.Conv2d` and
-    :class:`torch.nn.BatchNorm2d` and :class:`torch.nn.ReLU`.
-
-    Similar to `torch.nn.Conv2d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    # base class defines _FLOAT_MODULE as "ConvBn2d"
-    _FLOAT_MODULE = nni.ConvBnReLU2d  # type: ignore[assignment]
-    _FLOAT_CONV_MODULE = nn.Conv2d
-    _FLOAT_BN_MODULE = nn.BatchNorm2d
-    _FLOAT_RELU_MODULE = nn.ReLU  # type: ignore[assignment]
-    # module class after fusing bn into conv
-    _FUSED_FLOAT_MODULE = nni.ConvReLU2d
-
-    def __init__(self,
-                 # Conv2d args
-                 in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1,
-                 bias=None,
-                 padding_mode='zeros',
-                 # BatchNorm2d args
-                 # num_features: out_channels
-                 eps=1e-05, momentum=0.1,
-                 # affine: True
-                 # track_running_stats: True
-                 # Args for this module
-                 freeze_bn=False,
-                 qconfig=None):
-        super(ConvBnReLU2d, self).__init__(in_channels, out_channels, kernel_size, stride,
-                                           padding, dilation, groups, bias,
-                                           padding_mode, eps, momentum,
-                                           freeze_bn,
-                                           qconfig)
-
-    def forward(self, input):
-        return F.relu(ConvBn2d._forward(self, input))
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(ConvBnReLU2d, cls).from_float(mod)
-
-class ConvReLU2d(nnqat.Conv2d, nni._FusedModule):
-    r"""A ConvReLU2d module is a fused module of Conv2d and ReLU, attached with
-    FakeQuantize modules for weight for
-    quantization aware training.
-
-    We combined the interface of :class:`~torch.nn.Conv2d` and
-    :class:`~torch.nn.BatchNorm2d`.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    _FLOAT_MODULE = nni.ConvReLU2d
-    _FLOAT_CONV_MODULE = nn.Conv2d
-    _FLOAT_BN_MODULE = None
-    _FLOAT_RELU_MODULE = nn.ReLU
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1,
-                 bias=True, padding_mode='zeros',
-                 qconfig=None):
-        super(ConvReLU2d, self).__init__(in_channels, out_channels, kernel_size,
-                                         stride=stride, padding=padding, dilation=dilation,
-                                         groups=groups, bias=bias, padding_mode=padding_mode,
-                                         qconfig=qconfig)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        self.qconfig = qconfig
-        self.weight_fake_quant = self.qconfig.weight()
-
-    def forward(self, input):
-        return F.relu(
-            self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias))
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(ConvReLU2d, cls).from_float(mod)
-
-class ConvBn3d(_ConvBnNd, nn.Conv3d):
-    r"""
-    A ConvBn3d module is a module fused from Conv3d and BatchNorm3d,
-    attached with FakeQuantize modules for weight,
-    used in quantization aware training.
-
-    We combined the interface of :class:`torch.nn.Conv3d` and
-    :class:`torch.nn.BatchNorm3d`.
-
-    Similar to :class:`torch.nn.Conv3d`, with FakeQuantize modules initialized
-    to default.
-
-    Attributes:
-        freeze_bn:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    _FLOAT_MODULE = nni.ConvBn3d
-    _FLOAT_CONV_MODULE = nn.Conv3d
-    _FLOAT_BN_MODULE = nn.BatchNorm3d
-    _FLOAT_RELU_MODULE = None
-
-    def __init__(
-        self,
-        # ConvNd args
-        in_channels,
-        out_channels,
-        kernel_size,
-        stride=1,
-        padding=0,
-        dilation=1,
-        groups=1,
-        bias=None,
-        padding_mode="zeros",
-        # BatchNorm3d args
-        # num_features: out_channels
-        eps=1e-05,
-        momentum=0.1,
-        # affine: True
-        # track_running_stats: True
-        # Args for this module
-        freeze_bn=False,
-        qconfig=None,
-    ):
-        kernel_size = _triple(kernel_size)
-        stride = _triple(stride)
-        padding = _triple(padding)
-        dilation = _triple(dilation)
-        _ConvBnNd.__init__(
-            self,
-            in_channels,
-            out_channels,
-            kernel_size,
-            stride,
-            padding,
-            dilation,
-            False,
-            _triple(0),
-            groups,
-            bias,
-            padding_mode,
-            eps,
-            momentum,
-            freeze_bn,
-            qconfig,
-            dim=3,
-        )
-
-class ConvBnReLU3d(ConvBn3d):
-    r"""
-    A ConvBnReLU3d module is a module fused from Conv3d, BatchNorm3d and ReLU,
-    attached with FakeQuantize modules for weight,
-    used in quantization aware training.
-
-    We combined the interface of :class:`torch.nn.Conv3d` and
-    :class:`torch.nn.BatchNorm3d` and :class:`torch.nn.ReLU`.
-
-    Similar to `torch.nn.Conv3d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    _FLOAT_MODULE = nni.ConvBnReLU3d  # type: ignore[assignment]
-    _FLOAT_CONV_MODULE = nn.Conv3d
-    _FLOAT_BN_MODULE = nn.BatchNorm3d
-    _FLOAT_RELU_MODULE = nn.ReLU  # type: ignore[assignment]
-    # module class after fusing bn into conv
-    _FUSED_FLOAT_MODULE = nni.ConvReLU3d
-
-    def __init__(
-        self,
-        # Conv3d args
-        in_channels,
-        out_channels,
-        kernel_size,
-        stride=1,
-        padding=0,
-        dilation=1,
-        groups=1,
-        bias=None,
-        padding_mode="zeros",
-        # BatchNorm3d args
-        # num_features: out_channels
-        eps=1e-05,
-        momentum=0.1,
-        # affine: True
-        # track_running_stats: True
-        # Args for this module
-        freeze_bn=False,
-        qconfig=None,
-    ):
-        super(ConvBnReLU3d, self).__init__(
-            in_channels,
-            out_channels,
-            kernel_size,
-            stride,
-            padding,
-            dilation,
-            groups,
-            bias,
-            padding_mode,
-            eps,
-            momentum,
-            freeze_bn,
-            qconfig,
-        )
-
-    def forward(self, input):
-        return F.relu(ConvBn3d._forward(self, input))
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(ConvBnReLU3d, cls).from_float(mod)
-
-class ConvReLU3d(nnqat.Conv3d, nni._FusedModule):
-    r"""A ConvReLU3d module is a fused module of Conv3d and ReLU, attached with
-    FakeQuantize modules for weight for
-    quantization aware training.
-
-    We combined the interface of :class:`~torch.nn.Conv3d` and
-    :class:`~torch.nn.BatchNorm3d`.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    _FLOAT_MODULE = nni.ConvReLU3d
-    _FLOAT_CONV_MODULE = nn.Conv3d
-    _FLOAT_BN_MODULE = None
-    _FLOAT_RELU_MODULE = nn.ReLU
-
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        kernel_size,
-        stride=1,
-        padding=0,
-        dilation=1,
-        groups=1,
-        bias=True,
-        padding_mode="zeros",
-        qconfig=None,
-    ):
-        super(ConvReLU3d, self).__init__(
-            in_channels,
-            out_channels,
-            kernel_size,
-            stride=stride,
-            padding=padding,
-            dilation=dilation,
-            groups=groups,
-            bias=bias,
-            padding_mode=padding_mode,
-            qconfig=qconfig,
-        )
-        assert qconfig, "qconfig must be provided for QAT module"
-        self.qconfig = qconfig
-        self.weight_fake_quant = self.qconfig.weight()
-
-    def forward(self, input):
-        return F.relu(
-            self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
-        )
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(ConvReLU3d, cls).from_float(mod)
-
-def update_bn_stats(mod):
-    if type(mod) in set(
-        [ConvBnReLU1d, ConvBnReLU2d, ConvBnReLU3d, ConvBn1d, ConvBn2d, ConvBn3d]
-    ):
-        mod.update_bn_stats()
-
-def freeze_bn_stats(mod):
-    if type(mod) in set(
-        [ConvBnReLU1d, ConvBnReLU2d, ConvBnReLU3d, ConvBn1d, ConvBn2d, ConvBn3d]
-    ):
-        mod.freeze_bn_stats()
+# flake8: noqa: F401
+r"""Intrinsic QAT Modules
+
+This file is in the process of migration to `torch/ao/nn/intrinsic/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/intrinsic/qat/modules`,
+while adding an import statement here.
+"""
+
+__all__ = [
+    # Modules
+    'ConvBn1d',
+    'ConvBnReLU1d',
+    'ConvReLU1d',
+    'ConvBn2d',
+    'ConvBnReLU2d',
+    'ConvReLU2d',
+    'ConvBn3d',
+    'ConvBnReLU3d',
+    'ConvReLU3d',
+    # Utilities
+    'freeze_bn_stats',
+    'update_bn_stats',
+]
+
+from torch.ao.nn.intrinsic.qat import ConvBn1d
+from torch.ao.nn.intrinsic.qat import ConvBnReLU1d
+from torch.ao.nn.intrinsic.qat import ConvReLU1d
+from torch.ao.nn.intrinsic.qat import ConvBn2d
+from torch.ao.nn.intrinsic.qat import ConvBnReLU2d
+from torch.ao.nn.intrinsic.qat import ConvReLU2d
+from torch.ao.nn.intrinsic.qat import ConvBn3d
+from torch.ao.nn.intrinsic.qat import ConvBnReLU3d
+from torch.ao.nn.intrinsic.qat import ConvReLU3d
+from torch.ao.nn.intrinsic.qat import freeze_bn_stats
+from torch.ao.nn.intrinsic.qat import update_bn_stats
diff --git a/torch/nn/intrinsic/qat/modules/linear_fused.py b/torch/nn/intrinsic/qat/modules/linear_fused.py
index e53303ff706f..57fbf83f4b06 100644
--- a/torch/nn/intrinsic/qat/modules/linear_fused.py
+++ b/torch/nn/intrinsic/qat/modules/linear_fused.py
@@ -1,167 +1,15 @@
-import torch
-import torch.nn as nn
-import torch.nn.intrinsic as nni
-import torch.nn.functional as F
-from torch.nn import init
-from torch.nn.parameter import Parameter
-from torch.nn.utils.fusion import fuse_linear_bn_weights
+# flake8: noqa: F401
+r"""Intrinsic QAT Modules
 
+This file is in the process of migration to `torch/ao/nn/intrinsic/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/intrinsic/qat/modules`,
+while adding an import statement here.
+"""
 
-class LinearBn1d(nn.modules.linear.Linear, nni._FusedModule):
-    r"""
-    A LinearBn1d module is a module fused from Linear and BatchNorm1d, attached
-    with FakeQuantize modules for weight, used in quantization aware training.
+__all__ = [
+    'LinearBn1d',
+]
 
-    We combined the interface of :class:`torch.nn.Linear` and
-    :class:torch.nn.BatchNorm1d`.
-
-    Similar to :class:`torch.nn.Linear`, with FakeQuantize modules initialized
-    to default.
-
-    Attributes:
-        freeze_bn:
-        weight_fake_quant: fake quant module for weight
-
-    """
-    def __init__(self,
-                 # Linear args
-                 in_features, out_features, bias=True,
-                 # BatchNorm1d args
-                 # num_features: out_features
-                 eps=1e-05, momentum=0.1,
-                 # affine: True
-                 # track_running_stats: True
-                 # Args for this module
-                 freeze_bn=False,
-                 qconfig=None):
-        nn.modules.linear.Linear.__init__(self, in_features, out_features, bias)
-        assert qconfig, 'qconfig must be provded for QAT module'
-        self.qconfig = qconfig
-        self.freeze_bn = freeze_bn if self.training else True
-        self.bn = nn.BatchNorm1d(out_features, eps, momentum, True, True)
-        self.weight_fake_quant = self.qconfig.weight()
-        if bias:
-            self.bias = Parameter(torch.empty(out_features))
-        else:
-            self.register_parameter('bias', None)
-        self.reset_bn_parameters()
-
-        # this needs to be called after reset_bn_parameters,
-        # as they modify the same state
-        if self.training:
-            if freeze_bn:
-                self.freeze_bn_stats()
-            else:
-                self.update_bn_stats()
-        else:
-            self.freeze_bn_stats()
-
-    def reset_running_stats(self):
-        self.bn.reset_running_stats()
-
-    def reset_bn_parameters(self):
-        self.bn.reset_running_stats()
-        init.uniform_(self.bn.weight)
-        init.zeros_(self.bn.bias)
-
-    def reset_parameters(self):
-        super(LinearBn1d, self).reset_parameters()
-
-    def update_bn_stats(self):
-        self.freeze_bn = False
-        self.bn.training = True
-        return self
-
-    def freeze_bn_stats(self):
-        self.freeze_bn = True
-        self.bn.training = False
-        return self
-
-    def forward(self, input):
-        assert self.bn.running_var is not None
-
-        # Scale the linear weights by BN's running statistics to reduce
-        # weight jitter, see https://arxiv.org/pdf/1806.08342.pdf, page 18
-        # for motivation.
-        #
-        # Instead of
-        #
-        #   x1 = F.linear(x0, fq(w), b)
-        #   x2 = self.bn(x1)
-        #
-        # We have
-        #
-        #   # scale the weight by previous batch's running statistics
-        #   scale_factor = bn.w / bn.running_std_from_prev_batch
-        #   # do the linear transformation without bias
-        #   x1_scaled = F.linear(x0, fq(w * scale_factor), 0)
-        #   # reverse the scaling and add original bias
-        #   x1_orig = x1_scaled / scale_factor + b
-        #   x2 = self.bn(x1_orig)
-
-        running_std = torch.sqrt(self.bn.running_var + self.bn.eps)
-        scale_factor = self.bn.weight / running_std
-        weight_shape = [1] * len(self.weight.shape)
-        weight_shape[0] = -1
-        bias_shape = [1] * len(self.weight.shape)
-        bias_shape[1] = -1
-        scaled_weight = self.weight_fake_quant(self.weight * scale_factor.reshape(weight_shape))
-        if self.bias is not None:
-            zero_bias = torch.zeros_like(self.bias)
-        else:
-            zero_bias = torch.zeros(self.out_features, device=scaled_weight.device)
-        linear_out = F.linear(input, scaled_weight, zero_bias)
-        linear_out_orig = linear_out / scale_factor.reshape(bias_shape)
-        if self.bias is not None:
-            linear_out_orig = linear_out_orig + self.bias.reshape(bias_shape)
-        bn_out = self.bn(linear_out_orig)
-        return bn_out
-
-    def train(self, mode=True):
-        """
-        Batchnorm's training behavior is using the self.training flag. Prevent
-        changing it if BN is frozen. This makes sure that calling `model.train()`
-        on a model with a frozen BN will behave properly.
-        """
-        self.training = mode
-        if not self.freeze_bn:
-            for module in self.children():
-                module.train(mode)
-        return self
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module or qparams_dict
-
-            Args: `mod' a float module, either produced by torch.ao.quantization
-            utilities or directly from user
-        """
-        assert type(mod) == nni.LinearBn1d, 'qat.' + cls.__name__ + \
-            '.from_float only works for ' + nni.LinearBn1d.__name__
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid config'
-        qconfig = mod.qconfig
-        linear, bn = mod[0], mod[1]
-        qat_linearbn = cls(linear.in_features, linear.out_features, linear.bias is not None,
-                           bn.eps, bn.momentum,
-                           False, qconfig)
-        qat_linearbn.weight = linear.weight
-        qat_linearbn.bias = linear.bias
-        qat_linearbn.bn.weight = bn.weight
-        qat_linearbn.bn.bias = bn.bias
-        qat_linearbn.bn.running_mean = bn.running_mean
-        qat_linearbn.bn.running_var = bn.running_var
-        qat_linearbn.bn.num_batches_tracked = bn.num_batches_tracked
-        return qat_linearbn
-
-    def to_float(self):
-        linear = torch.nn.Linear(self.in_features, self.out_features)
-        linear.weight, linear.bias = fuse_linear_bn_weights(
-            self.weight,
-            self.bias,
-            self.bn.running_mean,
-            self.bn.running_var,
-            self.bn.eps,
-            self.bn.weight,
-            self.bn.bias)
-        return linear
+from torch.ao.nn.intrinsic.qat import LinearBn1d
diff --git a/torch/nn/intrinsic/qat/modules/linear_relu.py b/torch/nn/intrinsic/qat/modules/linear_relu.py
index 052a3f2fc7d4..45afc3384c21 100644
--- a/torch/nn/intrinsic/qat/modules/linear_relu.py
+++ b/torch/nn/intrinsic/qat/modules/linear_relu.py
@@ -1,48 +1,15 @@
-import torch
-import torch.nn.qat as nnqat
-import torch.nn.intrinsic as nni
-import torch.nn.functional as F
+# flake8: noqa: F401
+r"""Intrinsic QAT Modules
 
-class LinearReLU(nnqat.Linear, nni._FusedModule):
-    r"""
-    A LinearReLU module fused from Linear and ReLU modules, attached with
-    FakeQuantize modules for weight, used in
-    quantization aware training.
+This file is in the process of migration to `torch/ao/nn/intrinsic/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/intrinsic/qat/modules`,
+while adding an import statement here.
+"""
 
-    We adopt the same interface as :class:`torch.nn.Linear`.
+__all__ = [
+    'LinearReLU',
+]
 
-    Similar to `torch.nn.intrinsic.LinearReLU`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight: fake quant module for weight
-
-    Examples::
-
-        >>> # xdoctest: +SKIP
-        >>> m = nn.qat.LinearReLU(20, 30)
-        >>> input = torch.randn(128, 20)
-        >>> output = m(input)
-        >>> print(output.size())
-        torch.Size([128, 30])
-    """
-    _FLOAT_MODULE = nni.LinearReLU
-
-    def __init__(self, in_features, out_features, bias=True,
-                 qconfig=None):
-        super(LinearReLU, self).__init__(in_features, out_features, bias, qconfig)
-
-    def forward(self, input):
-        return F.relu(F.linear(input, self.weight_fake_quant(self.weight), self.bias))
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(LinearReLU, cls).from_float(mod)
-
-    def to_float(self):
-        linear = torch.nn.Linear(self.in_features, self.out_features, self.bias is not None)
-        linear.weight = torch.nn.Parameter(self.weight.detach())
-        if self.bias is not None:
-            linear.bias = torch.nn.Parameter(self.bias.detach())
-        relu = torch.nn.ReLU()
-        return torch.nn.intrinsic.LinearReLU(linear, relu)
+from torch.ao.nn.intrinsic.qat import LinearReLU
diff --git a/torch/nn/intrinsic/quantized/__init__.py b/torch/nn/intrinsic/quantized/__init__.py
index 3d79bdbfe832..a3c5788d574d 100644
--- a/torch/nn/intrinsic/quantized/__init__.py
+++ b/torch/nn/intrinsic/quantized/__init__.py
@@ -1 +1,10 @@
 from .modules import *  # noqa: F403
+
+__all__ = [
+    'BNReLU2d',
+    'BNReLU3d',
+    'ConvReLU1d',
+    'ConvReLU2d',
+    'ConvReLU3d',
+    'LinearReLU',
+]
diff --git a/torch/nn/intrinsic/quantized/dynamic/modules/__init__.py b/torch/nn/intrinsic/quantized/dynamic/modules/__init__.py
index ce571862b427..ea1885a6aec4 100644
--- a/torch/nn/intrinsic/quantized/dynamic/modules/__init__.py
+++ b/torch/nn/intrinsic/quantized/dynamic/modules/__init__.py
@@ -1,4 +1,3 @@
-import torch
 from .linear_relu import LinearReLU
 
 __all__ = [
diff --git a/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py b/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
index 76852e2c06eb..63cc8609e2d8 100644
--- a/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
+++ b/torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py
@@ -1,51 +1,5 @@
-import torch
-import torch.nn.quantized.dynamic as nnqd
-import torch.nn.intrinsic as nni
+from torch.ao.nn.intrinsic.quantized.dynamic import LinearReLU
 
-class LinearReLU(nnqd.Linear):
-    r"""
-    A LinearReLU module fused from Linear and ReLU modules that can be used
-    for dynamic quantization.
-    Supports both, FP16 and INT8 quantization.
-
-    We adopt the same interface as :class:`torch.nn.quantized.dynamic.Linear`.
-
-    Attributes:
-        Same as torch.nn.quantized.dynamic.Linear
-
-    Examples::
-
-        >>> m = nn.intrinsic.quantized.dynamic.LinearReLU(20, 30)
-        >>> input = torch.randn(128, 20)
-        >>> # xdoctest: +SKIP
-        >>> output = m(input)
-        >>> print(output.size())
-        torch.Size([128, 30])
-    """
-    _FLOAT_MODULE = nni.LinearReLU  # type: ignore[assignment]
-
-    def __init__(self, in_features, out_features, bias=True, dtype=torch.qint8):
-        super().__init__(in_features, out_features, bias, dtype)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        if self._packed_params.dtype == torch.qint8:
-            # TODO check if we should set reduce_rage = True by default here
-            Y = torch.ops.quantized.linear_relu_dynamic(
-                x, self._packed_params._packed_params, reduce_range=True)
-        elif self._packed_params.dtype == torch.float16:
-            Y = torch.ops.quantized.linear_relu_dynamic_fp16(
-                x, self._packed_params._packed_params)
-        else:
-            raise RuntimeError('Unsupported dtype on dynamic quantized linear relu!')
-        return Y.to(x.dtype)
-
-    def _get_name(self):
-        return 'DynamicQuantizedLinearReLU'
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(LinearReLU, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, ref_qlinear_relu):
-        return super().from_reference(ref_qlinear_relu[0])
+__all__ = [
+    'LinearReLU',
+]
diff --git a/torch/nn/intrinsic/quantized/modules/bn_relu.py b/torch/nn/intrinsic/quantized/modules/bn_relu.py
index 0727e57553d0..7682b4f8ae42 100644
--- a/torch/nn/intrinsic/quantized/modules/bn_relu.py
+++ b/torch/nn/intrinsic/quantized/modules/bn_relu.py
@@ -1,78 +1,7 @@
+from torch.ao.nn.intrinsic.quantized import BNReLU2d
+from torch.ao.nn.intrinsic.quantized import BNReLU3d
 
-import torch
-import torch.nn.intrinsic
-import torch.nn.intrinsic.qat
-import torch.nn.quantized as nnq
-
-
-class BNReLU2d(nnq.BatchNorm2d):
-    r"""
-    A BNReLU2d module is a fused module of BatchNorm2d and ReLU
-
-    We adopt the same interface as :class:`torch.nn.quantized.BatchNorm2d`.
-
-    Attributes:
-        Same as torch.nn.quantized.BatchNorm2d
-
-    """
-    _FLOAT_MODULE = torch.nn.intrinsic.BNReLU2d
-
-    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
-        super(BNReLU2d, self).__init__(num_features, eps=eps, momentum=momentum, device=device, dtype=dtype)
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        return torch.ops.quantized.batch_norm2d_relu(
-            input, self.weight, self.bias, self.running_mean,
-            self.running_var, self.eps, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedBNReLU2d'
-
-    @classmethod
-    def from_float(cls, mod):
-        # TODO: Add qat support for BNReLU2d
-        return super(BNReLU2d, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, bn_relu, output_scale, output_zero_point):
-        return super().from_reference(bn_relu[0], output_scale, output_zero_point)
-
-class BNReLU3d(nnq.BatchNorm3d):
-    r"""
-    A BNReLU3d module is a fused module of BatchNorm3d and ReLU
-
-    We adopt the same interface as :class:`torch.nn.quantized.BatchNorm3d`.
-
-    Attributes:
-        Same as torch.nn.quantized.BatchNorm3d
-
-    """
-    _FLOAT_MODULE = torch.nn.intrinsic.BNReLU3d
-
-    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
-        super(BNReLU3d, self).__init__(num_features, eps=eps, momentum=momentum, device=device, dtype=dtype)
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
-        return torch.ops.quantized.batch_norm3d_relu(
-            input, self.weight, self.bias, self.running_mean,
-            self.running_var, self.eps, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedBNReLU3d'
-
-    @classmethod
-    def from_float(cls, mod):
-        # TODO: Add qat support for BNReLU3d
-        return super(BNReLU3d, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, bn_relu, output_scale, output_zero_point):
-        return super().from_reference(bn_relu[0], output_scale, output_zero_point)
+__all__ = [
+    'BNReLU2d',
+    'BNReLU3d',
+]
diff --git a/torch/nn/intrinsic/quantized/modules/conv_relu.py b/torch/nn/intrinsic/quantized/modules/conv_relu.py
index d4c24ee241f6..65c811cc5666 100644
--- a/torch/nn/intrinsic/quantized/modules/conv_relu.py
+++ b/torch/nn/intrinsic/quantized/modules/conv_relu.py
@@ -1,166 +1,9 @@
-
-import torch
-import torch.nn.intrinsic
-import torch.nn.intrinsic.qat
-import torch.nn.functional as F
-import torch.nn.quantized as nnq
-
-from torch.nn.utils import fuse_conv_bn_weights
-
-_reverse_repeat_padding = nnq.modules.conv._reverse_repeat_padding
-
-# TODO: factor out the common parts to ConvNd
-class ConvReLU1d(nnq.Conv1d):
-    r"""
-    A ConvReLU1d module is a fused module of Conv1d and ReLU
-
-    We adopt the same interface as :class:`torch.nn.quantized.Conv1d`.
-
-    Attributes:
-        Same as torch.nn.quantized.Conv1d
-
-    """
-    _FLOAT_MODULE = torch.nn.intrinsic.ConvReLU1d  # type: ignore[assignment]
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        super(ConvReLU1d, self).__init__(
-            in_channels, out_channels, kernel_size, stride=stride,
-            padding=padding, dilation=dilation, groups=groups, bias=bias,
-            padding_mode=padding_mode, device=device, dtype=dtype)
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        if self.padding_mode != 'zeros':
-            # Padding in Conv1d is stored as (p, p), need to get (p,)
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return torch.ops.quantized.conv1d_relu(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedConvReLU1d'
-
-    @classmethod
-    def from_float(cls, mod):
-        if type(mod) == torch.nn.intrinsic.qat.ConvBnReLU1d:
-            mod.weight, mod.bias = fuse_conv_bn_weights(
-                mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
-                mod.bn.eps, mod.bn.weight, mod.bn.bias)
-        return super(ConvReLU1d, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
-        assert type(ref_qconv) != torch.nn.intrinsic.ConvBnReLU1d, \
-            "BatchNorm1d should be fused into Conv1d before converting to reference module"
-        return super().from_reference(ref_qconv[0], output_scale, output_zero_point)
-
-class ConvReLU2d(nnq.Conv2d):
-    r"""
-    A ConvReLU2d module is a fused module of Conv2d and ReLU
-
-    We adopt the same interface as :class:`torch.nn.quantized.Conv2d`.
-
-    Attributes:
-        Same as torch.nn.quantized.Conv2d
-
-    """
-    _FLOAT_MODULE = torch.nn.intrinsic.ConvReLU2d  # type: ignore[assignment]
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        super(ConvReLU2d, self).__init__(
-            in_channels, out_channels, kernel_size, stride=stride,
-            padding=padding, dilation=dilation, groups=groups, bias=bias,
-            padding_mode=padding_mode, device=device, dtype=dtype)
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return torch.ops.quantized.conv2d_relu(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedConvReLU2d'
-
-    @classmethod
-    def from_float(cls, mod):
-        if type(mod) == torch.nn.intrinsic.qat.ConvBnReLU2d:
-            mod.weight, mod.bias = fuse_conv_bn_weights(
-                mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
-                mod.bn.eps, mod.bn.weight, mod.bn.bias)
-        return super(ConvReLU2d, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
-        assert type(ref_qconv) != torch.nn.intrinsic.ConvBnReLU2d, \
-            "BatchNorm2d should be fused into Conv2d before converting to reference module"
-        return super().from_reference(ref_qconv[0], output_scale, output_zero_point)
-
-
-class ConvReLU3d(nnq.Conv3d):
-    r"""
-    A ConvReLU3d module is a fused module of Conv3d and ReLU
-
-    We adopt the same interface as :class:`torch.nn.quantized.Conv3d`.
-
-    Attributes: Same as torch.nn.quantized.Conv3d
-
-    """
-    _FLOAT_MODULE = torch.nn.intrinsic.ConvReLU3d  # type: ignore[assignment]
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
-        super(ConvReLU3d, self).__init__(
-            in_channels, out_channels, kernel_size, stride=stride,
-            padding=padding, dilation=dilation, groups=groups, bias=bias,
-            padding_mode=padding_mode, device=device, dtype=dtype)
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return torch.ops.quantized.conv3d_relu(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedConvReLU3d'
-
-    @classmethod
-    def from_float(cls, mod):
-        if type(mod) == torch.nn.intrinsic.qat.ConvBnReLU3d:
-            mod.weight, mod.bias = fuse_conv_bn_weights(
-                mod.weight,
-                mod.bias,
-                mod.bn.running_mean,
-                mod.bn.running_var,
-                mod.bn.eps,
-                mod.bn.weight,
-                mod.bn.bias,
-            )
-        return super(ConvReLU3d, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
-        assert type(ref_qconv) != torch.nn.intrinsic.ConvBnReLU3d, \
-            "BatchNorm3d should be fused into Conv3d before converting to reference module"
-        return super().from_reference(ref_qconv[0], output_scale, output_zero_point)
+from torch.ao.nn.intrinsic.quantized import ConvReLU1d
+from torch.ao.nn.intrinsic.quantized import ConvReLU2d
+from torch.ao.nn.intrinsic.quantized import ConvReLU3d
+
+__all__ = [
+    'ConvReLU1d',
+    'ConvReLU2d',
+    'ConvReLU3d',
+]
diff --git a/torch/nn/intrinsic/quantized/modules/linear_relu.py b/torch/nn/intrinsic/quantized/modules/linear_relu.py
index 4721b7d111c7..3e89e9b5821f 100644
--- a/torch/nn/intrinsic/quantized/modules/linear_relu.py
+++ b/torch/nn/intrinsic/quantized/modules/linear_relu.py
@@ -1,41 +1,5 @@
-import torch
-import torch.nn.quantized as nnq
-import torch.nn.intrinsic as nni
+from torch.ao.nn.intrinsic.quantized import LinearReLU
 
-class LinearReLU(nnq.Linear):
-    r"""
-    A LinearReLU module fused from Linear and ReLU modules
-
-    We adopt the same interface as :class:`torch.nn.quantized.Linear`.
-
-    Attributes:
-        Same as torch.nn.quantized.Linear
-
-    Examples::
-
-        >>> # xdoctest: +SKIP
-        >>> m = nn.intrinsic.LinearReLU(20, 30)
-        >>> input = torch.randn(128, 20)
-        >>> output = m(input)
-        >>> print(output.size())
-        torch.Size([128, 30])
-    """
-    _FLOAT_MODULE = nni.LinearReLU
-
-    def __init__(self, in_features, out_features, bias=True, dtype=torch.qint8):
-        super().__init__(in_features, out_features, bias, dtype)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return torch.ops.quantized.linear_relu(
-            x, self._packed_params._packed_params, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedLinearReLU'
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(LinearReLU, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, ref_linear_relu, output_scale, output_zero_point):
-        return super().from_reference(ref_linear_relu[0], output_scale, output_zero_point)
+__all__ = [
+    'LinearReLU',
+]
diff --git a/torch/nn/modules/_functions.py b/torch/nn/modules/_functions.py
index 66200345cbc2..099c2b164daa 100644
--- a/torch/nn/modules/_functions.py
+++ b/torch/nn/modules/_functions.py
@@ -7,7 +7,10 @@ class SyncBatchNorm(Function):
 
     @staticmethod
     def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
-        if not input.is_contiguous(memory_format=torch.channels_last):
+        if not (
+            input.is_contiguous(memory_format=torch.channels_last) or
+            input.is_contiguous(memory_format=torch.channels_last_3d)
+        ):
             input = input.contiguous()
         if weight is not None:
             weight = weight.contiguous()
@@ -53,7 +56,7 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum,
                                         combined_size * world_size,
                                         dtype=combined.dtype,
                                         device=combined.device)
-            dist._all_gather_base(combined_flat, combined, process_group, async_op=False)
+            dist.all_gather_into_tensor(combined_flat, combined, process_group, async_op=False)
             combined = torch.reshape(combined_flat, (world_size, combined_size))
             # world_size * (2C + 1) -> world_size * C, world_size * C, world_size * 1
             mean_all, invstd_all, count_all = torch.split(combined, num_channels, dim=1)
@@ -104,7 +107,10 @@ def forward(self, input, weight, bias, running_mean, running_var, eps, momentum,
 
     @staticmethod
     def backward(self, grad_output):
-        if not grad_output.is_contiguous(memory_format=torch.channels_last):
+        if not (
+            grad_output.is_contiguous(memory_format=torch.channels_last) or
+            grad_output.is_contiguous(memory_format=torch.channels_last_3d)
+        ):
             grad_output = grad_output.contiguous()
         saved_input, weight, mean, invstd, count_tensor = self.saved_tensors
         grad_input = grad_weight = grad_bias = None
diff --git a/torch/nn/modules/activation.py b/torch/nn/modules/activation.py
index f82af34a53d2..b00da06126a7 100644
--- a/torch/nn/modules/activation.py
+++ b/torch/nn/modules/activation.py
@@ -654,7 +654,8 @@ class GELU(Module):
     where :math:`\Phi(x)` is the Cumulative Distribution Function for Gaussian Distribution.
 
     When the approximate argument is 'tanh', Gelu is estimated with:
-        :math:: \text{GELU}(x) = 0.5 * x * (1 + \text{Tanh}(\sqrt(2 / \pi) * (x + 0.044715 * x^3)))
+
+    .. math:: \text{GELU}(x) = 0.5 * x * (1 + \text{Tanh}(\sqrt(2 / \pi) * (x + 0.044715 * x^3)))
 
     Args:
         approximate (str, optional): the gelu approximation algorithm to use:
@@ -683,7 +684,7 @@ def forward(self, input: Tensor) -> Tensor:
         return F.gelu(input, approximate=self.approximate)
 
     def extra_repr(self) -> str:
-        return 'approximate={}'.format(self.approximate)
+        return 'approximate={}'.format(repr(self.approximate))
 
 
 class Hardshrink(Module):
@@ -900,16 +901,16 @@ class MultiheadAttention(Module):
 
     - self attention is being computed (i.e., ``query``, ``key``, and ``value`` are the same tensor. This
       restriction will be loosened in the future.)
+    - inputs are batched (3D) with ``batch_first==True``
     - Either autograd is disabled (using ``torch.inference_mode`` or ``torch.no_grad``) or no tensor argument ``requires_grad``
     - training is disabled (using ``.eval()``)
-    - dropout is 0
     - ``add_bias_kv`` is ``False``
     - ``add_zero_attn`` is ``False``
     - ``batch_first`` is ``True`` and the input is batched
     - ``kdim`` and ``vdim`` are equal to ``embed_dim``
-    - at most one of ``key_padding_mask`` or ``attn_mask`` is passed
     - if a `NestedTensor <https://pytorch.org/docs/stable/nested.html>`_ is passed, neither ``key_padding_mask``
       nor ``attn_mask`` is passed
+    - autocast is disabled
 
     If the optimized implementation is in use, a
     `NestedTensor <https://pytorch.org/docs/stable/nested.html>`_ can be passed for
@@ -1030,8 +1031,7 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
             to ignore for the purpose of attention (i.e. treat as "padding"). For unbatched `query`, shape should be :math:`(S)`.
             Binary and byte masks are supported.
             For a binary mask, a ``True`` value indicates that the corresponding ``key`` value will be ignored for
-            the purpose of attention. For a byte mask, a non-zero value indicates that the corresponding ``key``
-            value will be ignored.
+            the purpose of attention. For a float mask, it will be directly added to the corresponding ``key`` value.
         need_weights: If specified, returns ``attn_output_weights`` in addition to ``attn_outputs``.
             Default: ``True``.
         attn_mask: If specified, a 2D or 3D mask preventing attention to certain positions. Must be of shape
@@ -1061,6 +1061,11 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
             `batch_first` argument is ignored for unbatched inputs.
         """
         is_batched = query.dim() == 3
+        if key_padding_mask is not None:
+            _kpm_dtype = key_padding_mask.dtype
+            if _kpm_dtype != torch.bool and not torch.is_floating_point(key_padding_mask):
+                raise AssertionError(
+                    "only bool and floating types of key_padding_mask are supported")
         why_not_fast_path = ''
         if not is_batched:
             why_not_fast_path = f"input not batched; expected query.dim() of 3 but got {query.dim()}"
@@ -1082,16 +1087,15 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
             why_not_fast_path = "self.bias_k was not None"
         elif self.bias_v is not None:
             why_not_fast_path = "self.bias_v was not None"
-        elif self.dropout:
-            why_not_fast_path = f"dropout was {self.dropout}, required zero"
         elif self.add_zero_attn:
             why_not_fast_path = "add_zero_attn was enabled"
         elif not self._qkv_same_embed_dim:
             why_not_fast_path = "_qkv_same_embed_dim was not True"
-        elif attn_mask is not None:
-            why_not_fast_path = "attn_mask was not None"
-        elif query.is_nested and key_padding_mask is not None:
-            why_not_fast_path = "key_padding_mask is not supported with NestedTensor input"
+        elif query.is_nested and (key_padding_mask is not None or attn_mask is not None):
+            why_not_fast_path = "supplying both src_key_padding_mask and src_mask at the same time \
+                                 is not supported with NestedTensor input"
+        elif torch.is_autocast_enabled():
+            why_not_fast_path = "autocast is enabled"
 
         if not why_not_fast_path:
             tensor_args = (
@@ -1107,12 +1111,14 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
             # generator expressions.
             if torch.overrides.has_torch_function(tensor_args):
                 why_not_fast_path = "some Tensor argument has_torch_function"
-            elif not all([(x.is_cuda or 'cpu' in str(x.device)) for x in tensor_args]):
+            elif not all([(x is None or x.is_cuda or 'cpu' in str(x.device)) for x in tensor_args]):
                 why_not_fast_path = "some Tensor argument is neither CUDA nor CPU"
-            elif torch.is_grad_enabled() and any([x.requires_grad for x in tensor_args]):
+            elif torch.is_grad_enabled() and any([x is not None and x.requires_grad for x in tensor_args]):
                 why_not_fast_path = ("grad is enabled and at least one of query or the "
                                      "input/output projection weights or biases requires_grad")
             if not why_not_fast_path:
+                merged_mask, mask_type = self.merge_masks(attn_mask, key_padding_mask, query)
+
                 return torch._native_multi_head_attention(
                     query,
                     key,
@@ -1123,10 +1129,11 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
                     self.in_proj_bias,
                     self.out_proj.weight,
                     self.out_proj.bias,
-                    key_padding_mask if key_padding_mask is not None else attn_mask,
+                    merged_mask,
                     need_weights,
                     average_attn_weights,
-                    1 if key_padding_mask is not None else 0 if attn_mask is not None else None)
+                    mask_type)
+
         any_nested = query.is_nested or key.is_nested or value.is_nested
         assert not any_nested, ("MultiheadAttention does not support NestedTensor outside of its fast path. " +
                                 f"The fast path was not hit because {why_not_fast_path}")
@@ -1167,6 +1174,40 @@ def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: O
         else:
             return attn_output, attn_output_weights
 
+    def merge_masks(self, attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor],
+                    query: Tensor) -> Tuple[Optional[Tensor], Optional[int]]:
+        r"""
+        Determine mask type and combine masks if necessary. If only one mask is provided, that mask
+        and the corresponding mask type will be returned. If both masks are provided, they will be both
+        expanded to shape ``(batch_size, num_heads, seq_len, seq_len)``, combined with logical ``or``
+        and mask type 2 will be returned
+        Args:
+            attn_mask: attention mask of shape ``(seq_len, seq_len)``, mask type 0
+            key_padding_mask: padding mask of shape ``(batch_size, seq_len)``, mask type 1
+            query: query embeddings of shape ``(batch_size, seq_len, embed_dim)``
+        Returns:
+            merged_mask: merged mask
+            mask_type: merged mask type (0, 1, or 2)
+        """
+        mask_type: Optional[int] = None
+        merged_mask: Optional[Tensor] = None
+        if attn_mask is not None:
+            mask_type = 0
+            merged_mask = attn_mask
+        if key_padding_mask is not None:
+            mask_type = 1
+            merged_mask = key_padding_mask
+        if (attn_mask is not None) and (key_padding_mask is not None):
+            # In this branch query can't be a nested tensor, so it has a shape
+            batch_size, seq_len, _ = query.shape
+            mask_type = 2
+            key_padding_mask_expanded = key_padding_mask.view(batch_size, 1, 1, seq_len) \
+                                                        .expand(-1, self.num_heads, -1, -1)
+            attn_mask_expanded = attn_mask.view(1, 1, seq_len, seq_len).expand(batch_size, self.num_heads, -1, -1)
+            merged_mask = attn_mask_expanded.logical_or(key_padding_mask_expanded)
+        return merged_mask, mask_type
+
+
 class PReLU(Module):
     r"""Applies the element-wise function:
 
@@ -1304,7 +1345,7 @@ class Softmin(Module):
 
     Examples::
 
-        >>> m = nn.Softmin()
+        >>> m = nn.Softmin(dim=1)
         >>> input = torch.randn(2, 3)
         >>> output = m(input)
     """
@@ -1336,7 +1377,7 @@ class Softmax(Module):
     .. math::
         \text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
 
-    When the input Tensor is a sparse tensor then the unspecifed
+    When the input Tensor is a sparse tensor then the unspecified
     values are treated as ``-inf``.
 
     Shape:
@@ -1431,7 +1472,7 @@ class LogSoftmax(Module):
 
     Examples::
 
-        >>> m = nn.LogSoftmax()
+        >>> m = nn.LogSoftmax(dim=1)
         >>> input = torch.randn(2, 3)
         >>> output = m(input)
     """
diff --git a/torch/nn/modules/batchnorm.py b/torch/nn/modules/batchnorm.py
index 094e91b2e695..6f078a8f592c 100644
--- a/torch/nn/modules/batchnorm.py
+++ b/torch/nn/modules/batchnorm.py
@@ -13,6 +13,7 @@
 __all__ = ['BatchNorm1d', 'LazyBatchNorm1d', 'BatchNorm2d', 'LazyBatchNorm2d', 'BatchNorm3d',
            'LazyBatchNorm3d', 'SyncBatchNorm']
 
+
 class _NormBase(Module):
     """Common base of _InstanceNorm and _BatchNorm"""
 
@@ -680,10 +681,6 @@ def _check_non_zero_input_channels(self, input):
             )
 
     def forward(self, input: Tensor) -> Tensor:
-        # currently only GPU input is supported
-        if not input.is_cuda:
-            raise ValueError("SyncBatchNorm expected input tensor to be on GPU")
-
         self._check_input_dim(input)
         self._check_non_zero_input_channels(input)
 
@@ -726,8 +723,13 @@ def forward(self, input: Tensor) -> Tensor:
         )
 
         # Don't sync batchnorm stats in inference mode (model.eval()).
-        need_sync = (bn_training and self.training)
+        need_sync = (bn_training and self.training and
+                     torch.distributed.is_available() and torch.distributed.is_initialized())
         if need_sync:
+            # currently only GPU input is supported
+            if not input.is_cuda:
+                raise ValueError("SyncBatchNorm expected input tensor to be on GPU")
+
             process_group = torch.distributed.group.WORLD
             if self.process_group:
                 process_group = self.process_group
@@ -779,6 +781,7 @@ def convert_sync_batchnorm(cls, module, process_group=None):
         Example::
 
             >>> # Network with nn.BatchNorm layer
+            >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA)
             >>> module = torch.nn.Sequential(
             >>>            torch.nn.Linear(20, 100),
             >>>            torch.nn.BatchNorm1d(100),
@@ -790,7 +793,7 @@ def convert_sync_batchnorm(cls, module, process_group=None):
             >>> # Note: every rank calls into new_group for every
             >>> # process group created, even if that rank is not
             >>> # part of the group.
-            >>> # xdoctest: +SKIP
+            >>> # xdoctest: +SKIP("distributed")
             >>> process_groups = [torch.distributed.new_group(pids) for pids in [r1, r2]]
             >>> process_group = process_groups[0 if dist.get_rank() <= 3 else 1]
             >>> sync_bn_module = torch.nn.SyncBatchNorm.convert_sync_batchnorm(module, process_group)
diff --git a/torch/nn/modules/container.py b/torch/nn/modules/container.py
index 9ede31035728..2d4357c03a68 100644
--- a/torch/nn/modules/container.py
+++ b/torch/nn/modules/container.py
@@ -100,7 +100,7 @@ def _get_item_by_idx(self, iterator, idx) -> T:
         return next(islice(iterator, idx, None))
 
     @_copy_to_script_wrapper
-    def __getitem__(self, idx) -> Union['Sequential', T]:
+    def __getitem__(self, idx: Union[slice, int]) -> Union['Sequential', T]:
         if isinstance(idx, slice):
             return self.__class__(OrderedDict(list(self._modules.items())[idx]))
         else:
@@ -275,7 +275,7 @@ def _get_abs_string_index(self, idx):
         return str(idx)
 
     @_copy_to_script_wrapper
-    def __getitem__(self, idx: int) -> Union[Module, 'ModuleList']:
+    def __getitem__(self, idx: Union[int, slice]) -> Union[Module, 'ModuleList']:
         if isinstance(idx, slice):
             return self.__class__(list(self._modules.values())[idx])
         else:
diff --git a/torch/nn/modules/conv.py b/torch/nn/modules/conv.py
index d216a8c77849..5c081e64ecca 100644
--- a/torch/nn/modules/conv.py
+++ b/torch/nn/modules/conv.py
@@ -54,7 +54,7 @@ class _ConvNd(Module):
     def _conv_forward(self, input: Tensor, weight: Tensor, bias: Optional[Tensor]) -> Tensor:
         ...
 
-    _in_channels: int
+    in_channels: int
     _reversed_padding_repeated_twice: List[int]
     out_channels: int
     kernel_size: Tuple[int, ...]
@@ -340,7 +340,7 @@ class Conv2d(_ConvNd):
       number or a tuple.
 
     * :attr:`padding` controls the amount of padding applied to the input. It
-      can be either a string {{'valid', 'same'}} or a tuple of ints giving the
+      can be either a string {{'valid', 'same'}} or an int / a tuple of ints giving the
       amount of implicit padding applied on both sides.
 
     * :attr:`dilation` controls the spacing between the kernel points; also
diff --git a/torch/nn/modules/distance.py b/torch/nn/modules/distance.py
index 065ad6b24397..73ba31b8868a 100644
--- a/torch/nn/modules/distance.py
+++ b/torch/nn/modules/distance.py
@@ -7,13 +7,21 @@
 
 class PairwiseDistance(Module):
     r"""
-    Computes the pairwise distance between vectors :math:`v_1`, :math:`v_2` using the p-norm:
+    Computes the pairwise distance between input vectors, or between columns of input matrices.
+
+    Distances are computed using ``p``-norm, with constant ``eps`` added to avoid division by zero
+    if ``p`` is negative, i.e.:
+
+    .. math ::
+        \mathrm{dist}\left(x, y\right) = \left\Vert x-y + \epsilon e \right\Vert_p,
+
+    where :math:`e` is the vector of ones and the ``p``-norm is given by.
 
     .. math ::
         \Vert x \Vert _p = \left( \sum_{i=1}^n  \vert x_i \vert ^ p \right) ^ {1/p}.
 
     Args:
-        p (real): the norm degree. Default: 2
+        p (real, optional): the norm degree. Can be negative. Default: 2
         eps (float, optional): Small value to avoid division by zero.
             Default: 1e-6
         keepdim (bool, optional): Determines whether or not to keep the vector dimension.
@@ -23,6 +31,7 @@ class PairwiseDistance(Module):
         - Input2: :math:`(N, D)` or :math:`(D)`, same shape as the Input1
         - Output: :math:`(N)` or :math:`()` based on input dimension.
           If :attr:`keepdim` is ``True``, then :math:`(N, 1)` or :math:`(1)` based on input dimension.
+
     Examples::
         >>> pdist = nn.PairwiseDistance(p=2)
         >>> input1 = torch.randn(100, 128)
diff --git a/torch/nn/modules/fold.py b/torch/nn/modules/fold.py
index 5380cf155c90..a7b1f758dd5a 100644
--- a/torch/nn/modules/fold.py
+++ b/torch/nn/modules/fold.py
@@ -50,13 +50,13 @@ class Fold(Module):
         output_size (int or tuple): the shape of the spatial dimensions of the
                                     output (i.e., ``output.sizes()[2:]``)
         kernel_size (int or tuple): the size of the sliding blocks
-        stride (int or tuple): the stride of the sliding blocks in the input
-                               spatial dimensions. Default: 1
-        padding (int or tuple, optional): implicit zero padding to be added on
-                                          both sides of input. Default: 0
         dilation (int or tuple, optional): a parameter that controls the
                                            stride of elements within the
                                            neighborhood. Default: 1
+        padding (int or tuple, optional): implicit zero padding to be added on
+                                          both sides of input. Default: 0
+        stride (int or tuple): the stride of the sliding blocks in the input
+                               spatial dimensions. Default: 1
 
     * If :attr:`output_size`, :attr:`kernel_size`, :attr:`dilation`,
       :attr:`padding` or :attr:`stride` is an int or a tuple of length 1 then
@@ -192,13 +192,13 @@ class Unfold(Module):
 
     Args:
         kernel_size (int or tuple): the size of the sliding blocks
-        stride (int or tuple, optional): the stride of the sliding blocks in the input
-                                         spatial dimensions. Default: 1
-        padding (int or tuple, optional): implicit zero padding to be added on
-                                          both sides of input. Default: 0
         dilation (int or tuple, optional): a parameter that controls the
                                            stride of elements within the
                                            neighborhood. Default: 1
+        padding (int or tuple, optional): implicit zero padding to be added on
+                                          both sides of input. Default: 0
+        stride (int or tuple, optional): the stride of the sliding blocks in the input
+                                         spatial dimensions. Default: 1
 
     * If :attr:`kernel_size`, :attr:`dilation`, :attr:`padding` or
       :attr:`stride` is an int or a tuple of length 1, their values will be
diff --git a/torch/nn/modules/loss.py b/torch/nn/modules/loss.py
index 1d466fbf2c62..ed151e64f4f0 100644
--- a/torch/nn/modules/loss.py
+++ b/torch/nn/modules/loss.py
@@ -449,15 +449,16 @@ class KLDivLoss(_Loss):
 
     Examples::
 
+        >>> import torch.nn.functional as F
         >>> kl_loss = nn.KLDivLoss(reduction="batchmean")
         >>> # input should be a distribution in the log space
-        >>> input = F.log_softmax(torch.randn(3, 5, requires_grad=True))
+        >>> input = F.log_softmax(torch.randn(3, 5, requires_grad=True), dim=1)
         >>> # Sample a batch of distributions. Usually this would come from the dataset
-        >>> target = F.softmax(torch.rand(3, 5))
+        >>> target = F.softmax(torch.rand(3, 5), dim=1)
         >>> output = kl_loss(input, target)
 
         >>> kl_loss = nn.KLDivLoss(reduction="batchmean", log_target=True)
-        >>> log_target = F.log_softmax(torch.rand(3, 5))
+        >>> log_target = F.log_softmax(torch.rand(3, 5), dim=1)
         >>> output = kl_loss(input, log_target)
     """
     __constants__ = ['reduction']
@@ -1344,7 +1345,7 @@ class MultiMarginLoss(_WeightedLoss):
     .. math::
         \text{loss}(x, y) = \frac{\sum_i \max(0, \text{margin} - x[y] + x[i])^p}{\text{x.size}(0)}
 
-    where :math:`x \in \left\{0, \; \cdots , \; \text{x.size}(0) - 1\right\}`
+    where :math:`i \in \left\{0, \; \cdots , \; \text{x.size}(0) - 1\right\}`
     and :math:`i \neq y`.
 
     Optionally, you can give non-equal weighting on the classes by passing
diff --git a/torch/nn/modules/module.py b/torch/nn/modules/module.py
index f99cc2ae0faf..e57a7b26d1e5 100644
--- a/torch/nn/modules/module.py
+++ b/torch/nn/modules/module.py
@@ -12,8 +12,10 @@
 from typing import Union, Tuple, Any, Callable, Iterator, Set, Optional, overload, TypeVar, Mapping, Dict, List
 from ...utils.hooks import RemovableHandle
 
-__all__ = ['register_module_forward_pre_hook', 'register_module_forward_hook', 'register_module_backward_hook',
-           'register_module_full_backward_hook', 'Module']
+__all__ = ['register_module_forward_pre_hook', 'register_module_forward_hook',
+           'register_module_full_backward_pre_hook', 'register_module_backward_hook',
+           'register_module_full_backward_hook', 'register_module_buffer_registration_hook',
+           'register_module_module_registration_hook', 'register_module_parameter_registration_hook', 'Module']
 
 _grad_t = Union[Tuple[Tensor, ...], Tensor]
 # See https://mypy.readthedocs.io/en/latest/generics.html#generic-methods-and-generic-self for the use
@@ -21,6 +23,7 @@
 # the type of the subclass, not the looser type of `Module`.
 T = TypeVar('T', bound='Module')
 
+
 class _IncompatibleKeys(namedtuple('IncompatibleKeys', ['missing_keys', 'unexpected_keys'])):
     def __repr__(self):
         if not self.missing_keys and not self.unexpected_keys:
@@ -41,6 +44,12 @@ def _addindent(s_, numSpaces):
     s = first + '\n' + s
     return s
 
+r"""This tracks hooks common to all modules that are executed immediately before
+.registering the buffer/module/parameter"""
+_global_buffer_registration_hooks: Dict[int, Callable] = OrderedDict()
+_global_module_registration_hooks: Dict[int, Callable] = OrderedDict()
+_global_parameter_registration_hooks: Dict[int, Callable] = OrderedDict()
+
 class _WrappedHook:
     def __init__(self, hook: Callable, module: Optional["Module"] = None):
         self.hook: Callable = hook
@@ -80,6 +89,7 @@ def __setstate__(self, state: Dict):
 r"""This tracks hooks common to all modules that are executed before/after
 calling forward and backward. This is global state used for debugging/profiling
 purposes"""
+_global_backward_pre_hooks: Dict[int, Callable] = OrderedDict()
 _global_backward_hooks: Dict[int, Callable] = OrderedDict()
 _global_is_full_backward_hook: Optional[bool] = None
 _global_forward_pre_hooks: Dict[int, Callable] = OrderedDict()
@@ -88,6 +98,78 @@ def __setstate__(self, state: Dict):
 _EXTRA_STATE_KEY_SUFFIX = '_extra_state'
 
 
+def register_module_buffer_registration_hook(hook: Callable[..., None]) -> RemovableHandle:
+    r"""Registers a buffer registration hook common to all modules.
+
+    .. warning ::
+
+        This adds global state to the `nn.Module` module
+
+    The hook will be called every time :func:`register_buffer` is invoked.
+    It should have the following signature::
+
+        hook(module, name, buffer) -> None or new buffer
+
+    The hook can modify the input or return a single modified value in the hook.
+
+    Returns:
+        :class:`torch.utils.hooks.RemovableHandle`:
+            a handle that can be used to remove the added hook by calling
+            ``handle.remove()``
+    """
+    handle = hooks.RemovableHandle(_global_buffer_registration_hooks)
+    _global_buffer_registration_hooks[handle.id] = hook
+    return handle
+
+
+def register_module_module_registration_hook(hook: Callable[..., None]) -> RemovableHandle:
+    r"""Registers a module registration hook common to all modules.
+
+    .. warning ::
+
+        This adds global state to the `nn.Module` module
+
+    The hook will be called every time :func:`register_module` is invoked.
+    It should have the following signature::
+
+        hook(module, name, submodule) -> None or new submodule
+
+    The hook can modify the input or return a single modified value in the hook.
+
+    Returns:
+        :class:`torch.utils.hooks.RemovableHandle`:
+            a handle that can be used to remove the added hook by calling
+            ``handle.remove()``
+    """
+    handle = hooks.RemovableHandle(_global_module_registration_hooks)
+    _global_module_registration_hooks[handle.id] = hook
+    return handle
+
+
+def register_module_parameter_registration_hook(hook: Callable[..., None]) -> RemovableHandle:
+    r"""Registers a parameter registration hook common to all modules.
+
+    .. warning ::
+
+        This adds global state to the `nn.Module` module
+
+    The hook will be called every time :func:`register_parameter` is invoked.
+    It should have the following signature::
+
+        hook(module, name, param) -> None or new parameter
+
+    The hook can modify the input or return a single modified value in the hook.
+
+    Returns:
+        :class:`torch.utils.hooks.RemovableHandle`:
+            a handle that can be used to remove the added hook by calling
+            ``handle.remove()``
+    """
+    handle = hooks.RemovableHandle(_global_parameter_registration_hooks)
+    _global_parameter_registration_hooks[handle.id] = hook
+    return handle
+
+
 def register_module_forward_pre_hook(hook: Callable[..., None]) -> RemovableHandle:
     r"""Registers a forward pre-hook common to all modules.
 
@@ -151,8 +233,9 @@ def register_module_forward_hook(hook: Callable[..., None]) -> RemovableHandle:
     _global_forward_hooks[handle.id] = hook
     return handle
 
+
 def register_module_backward_hook(
-    hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]]
+    hook: Callable[['Module', _grad_t, _grad_t], Union[None, _grad_t]]
 ) -> RemovableHandle:
     r"""Registers a backward hook common to all the modules.
 
@@ -177,8 +260,46 @@ def register_module_backward_hook(
     _global_backward_hooks[handle.id] = hook
     return handle
 
+
+def register_module_full_backward_pre_hook(
+    hook: Callable[['Module', _grad_t], Union[None, _grad_t]]
+) -> RemovableHandle:
+    r"""Registers a backward pre-hook common to all the modules.
+
+    .. warning ::
+        This adds global state to the `nn.module` module
+        and it is only intended for debugging/profiling purposes.
+
+    The hook will be called every time the gradients for the module are computed.
+    The hook should have the following signature::
+
+        hook(module, grad_output) -> Tensor or None
+
+    The :attr:`grad_output` is a tuple. The hook should
+    not modify its arguments, but it can optionally return a new gradient with
+    respect to the output that will be used in place of :attr:`grad_output` in
+    subsequent computations. Entries in :attr:`grad_output` will be ``None`` for
+    all non-Tensor arguments.
+
+    For technical reasons, when this hook is applied to a Module, its forward function will
+    receive a view of each Tensor passed to the Module. Similarly the caller will receive a view
+    of each Tensor returned by the Module's forward function.
+
+    Global hooks are called before hooks registered with `register_backward_pre_hook`
+
+    Returns:
+        :class:`torch.utils.hooks.RemovableHandle`:
+            a handle that can be used to remove the added hook by calling
+            ``handle.remove()``
+
+    """
+    handle = hooks.RemovableHandle(_global_backward_pre_hooks)
+    _global_backward_pre_hooks[handle.id] = hook
+    return handle
+
+
 def register_module_full_backward_hook(
-    hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]]
+    hook: Callable[['Module', _grad_t, _grad_t], Union[None, _grad_t]]
 ) -> RemovableHandle:
     r"""Registers a backward hook common to all the modules.
 
@@ -186,8 +307,10 @@ def register_module_full_backward_hook(
         This adds global state to the `nn.module` module
         and it is only intended for debugging/profiling purposes.
 
-    The hook will be called every time the gradients with respect to module
-    inputs are computed. The hook should have the following signature::
+    The hook will be called every time the gradients with respect to a module
+    are computed, i.e. the hook will execute if and only if the gradients with
+    respect to module outputs are computed. The hook should have the following
+    signature::
 
         hook(module, grad_input, grad_output) -> Tensor or None
 
@@ -291,10 +414,19 @@ def forward(self, x):
     _parameters: Dict[str, Optional[Parameter]]
     _buffers: Dict[str, Optional[Tensor]]
     _non_persistent_buffers_set: Set[str]
+    _backward_pre_hooks: Dict[int, Callable]
     _backward_hooks: Dict[int, Callable]
     _is_full_backward_hook: Optional[bool]
     _forward_hooks: Dict[int, Callable]
+    # Marks whether the corresponding _forward_hooks accept kwargs or not.
+    # As JIT does not support Set[int], this dict is used as a set, where all
+    # hooks represented in this dict accept kwargs.
+    _forward_hooks_with_kwargs: Dict[int, bool]
     _forward_pre_hooks: Dict[int, Callable]
+    # Marks whether the corresponding _forward_hooks accept kwargs or not.
+    # As JIT does not support Set[int], this dict is used as a set, where all
+    # hooks represented in this dict accept kwargs.
+    _forward_pre_hooks_with_kwargs: Dict[int, bool]
     _state_dict_hooks: Dict[int, Callable]
     _load_state_dict_pre_hooks: Dict[int, Callable]
     _load_state_dict_post_hooks: Dict[int, Callable]
@@ -316,10 +448,13 @@ def __init__(self) -> None:
         super().__setattr__('_parameters', OrderedDict())
         super().__setattr__('_buffers', OrderedDict())
         super().__setattr__('_non_persistent_buffers_set', set())
+        super().__setattr__('_backward_pre_hooks', OrderedDict())
         super().__setattr__('_backward_hooks', OrderedDict())
         super().__setattr__('_is_full_backward_hook', None)
         super().__setattr__('_forward_hooks', OrderedDict())
+        super().__setattr__('_forward_hooks_with_kwargs', OrderedDict())
         super().__setattr__('_forward_pre_hooks', OrderedDict())
+        super().__setattr__('_forward_pre_hooks_with_kwargs', OrderedDict())
         super().__setattr__('_state_dict_hooks', OrderedDict())
         super().__setattr__('_load_state_dict_pre_hooks', OrderedDict())
         super().__setattr__('_load_state_dict_post_hooks', OrderedDict())
@@ -376,6 +511,10 @@ def register_buffer(self, name: str, tensor: Optional[Tensor], persistent: bool
                             "(torch Tensor or None required)"
                             .format(torch.typename(tensor), name))
         else:
+            for hook in _global_buffer_registration_hooks.values():
+                output = hook(self, name, tensor)
+                if output is not None:
+                    tensor = output
             self._buffers[name] = tensor
             if persistent:
                 self._non_persistent_buffers_set.discard(name)
@@ -422,6 +561,10 @@ def register_parameter(self, name: str, param: Optional[Parameter]) -> None:
                 "as a function of another Tensor, compute the value in "
                 "the forward() method.".format(name))
         else:
+            for hook in _global_parameter_registration_hooks.values():
+                output = hook(self, name, param)
+                if output is not None:
+                    param = output
             self._parameters[name] = param
 
     def add_module(self, name: str, module: Optional['Module']) -> None:
@@ -446,6 +589,10 @@ def add_module(self, name: str, module: Optional['Module']) -> None:
             raise KeyError("module name can't contain \".\", got: {}".format(name))
         elif name == '':
             raise KeyError("module name can't be empty string \"\"")
+        for hook in _global_module_registration_hooks.values():
+            output = hook(self, name, module)
+            if output is not None:
+                module = output
         self._modules[name] = module
 
     def register_module(self, name: str, module: Optional['Module']) -> None:
@@ -602,7 +749,7 @@ def get_extra_state(self) -> Any:
         if you need to store extra state. This function is called when building the
         module's `state_dict()`.
 
-        Note that extra state should be pickleable to ensure working serialization
+        Note that extra state should be picklable to ensure working serialization
         of the state_dict. We only provide provide backwards compatibility guarantees
         for serializing Tensors; other objects may break backwards compatibility if
         their serialized pickled form changes.
@@ -933,6 +1080,7 @@ def to(self, *args, **kwargs):
             Parameter containing:
             tensor([[ 0.1913, -0.3420],
                     [-0.5113, -0.2325]], dtype=torch.float64)
+            >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
             >>> gpu1 = torch.device("cuda:1")
             >>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
             Linear(in_features=2, out_features=2, bias=True)
@@ -981,8 +1129,57 @@ def convert(t):
 
         return self._apply(convert)
 
+    def register_full_backward_pre_hook(
+        self,
+        hook: Callable[["Module", _grad_t], Union[None, _grad_t]],
+        prepend: bool = False,
+    ) -> RemovableHandle:
+        r"""Registers a backward pre-hook on the module.
+
+        The hook will be called every time the gradients for the module are computed.
+        The hook should have the following signature::
+
+            hook(module, grad_output) -> Tensor or None
+
+        The :attr:`grad_output` is a tuple. The hook should
+        not modify its arguments, but it can optionally return a new gradient with
+        respect to the output that will be used in place of :attr:`grad_output` in
+        subsequent computations. Entries in :attr:`grad_output` will be ``None`` for
+        all non-Tensor arguments.
+
+        For technical reasons, when this hook is applied to a Module, its forward function will
+        receive a view of each Tensor passed to the Module. Similarly the caller will receive a view
+        of each Tensor returned by the Module's forward function.
+
+        .. warning ::
+            Modifying inputs inplace is not allowed when using backward hooks and
+            will raise an error.
+
+        Args:
+            hook (Callable): The user-defined hook to be registered.
+            prepend (bool): If true, the provided ``hook`` will be fired before
+                all existing ``backward_pre`` hooks on this
+                :class:`torch.nn.modules.Module`. Otherwise, the provided
+                ``hook`` will be fired after all existing ``backward_pre`` hooks
+                on this :class:`torch.nn.modules.Module`. Note that global
+                ``backward_pre`` hooks registered with
+                :func:`register_module_full_backward_pre_hook` will fire before
+                all hooks registered by this method.
+
+        Returns:
+            :class:`torch.utils.hooks.RemovableHandle`:
+                a handle that can be used to remove the added hook by calling
+                ``handle.remove()``
+
+        """
+        handle = hooks.RemovableHandle(self._backward_pre_hooks)
+        self._backward_pre_hooks[handle.id] = hook
+        if prepend:
+            self._backward_pre_hooks.move_to_end(handle.id, last=False)  # type: ignore[attr-defined]
+        return handle
+
     def register_backward_hook(
-        self, hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]]
+        self, hook: Callable[['Module', _grad_t, _grad_t], Union[None, _grad_t]]
     ) -> RemovableHandle:
         r"""Registers a backward hook on the module.
 
@@ -1006,12 +1203,16 @@ def register_backward_hook(
         return handle
 
     def register_full_backward_hook(
-        self, hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]]
+        self,
+        hook: Callable[["Module", _grad_t, _grad_t], Union[None, _grad_t]],
+        prepend: bool = False,
     ) -> RemovableHandle:
         r"""Registers a backward hook on the module.
 
-        The hook will be called every time the gradients with respect to module
-        inputs are computed. The hook should have the following signature::
+        The hook will be called every time the gradients with respect to a module
+        are computed, i.e. the hook will execute if and only if the gradients with
+        respect to module outputs are computed. The hook should have the following
+        signature::
 
             hook(module, grad_input, grad_output) -> tuple(Tensor) or None
 
@@ -1032,6 +1233,17 @@ def register_full_backward_hook(
             Modifying inputs or outputs inplace is not allowed when using backward hooks and
             will raise an error.
 
+        Args:
+            hook (Callable): The user-defined hook to be registered.
+            prepend (bool): If true, the provided ``hook`` will be fired before
+                all existing ``backward`` hooks on this
+                :class:`torch.nn.modules.Module`. Otherwise, the provided
+                ``hook`` will be fired after all existing ``backward`` hooks on
+                this :class:`torch.nn.modules.Module`. Note that global
+                ``backward`` hooks registered with
+                :func:`register_module_full_backward_hook` will fire before
+                all hooks registered by this method.
+
         Returns:
             :class:`torch.utils.hooks.RemovableHandle`:
                 a handle that can be used to remove the added hook by calling
@@ -1046,6 +1258,8 @@ def register_full_backward_hook(
 
         handle = hooks.RemovableHandle(self._backward_hooks)
         self._backward_hooks[handle.id] = hook
+        if prepend:
+            self._backward_hooks.move_to_end(handle.id, last=False)  # type: ignore[attr-defined]
         return handle
 
     def _get_backward_hooks(self):
@@ -1067,9 +1281,16 @@ def _get_backward_hooks(self):
 
         return full_backward_hooks, non_full_backward_hooks
 
+    def _get_backward_pre_hooks(self):
+        backward_pre_hooks: List[Callable] = []
+        backward_pre_hooks += _global_backward_pre_hooks.values()
+        backward_pre_hooks += self._backward_pre_hooks.values()
+
+        return backward_pre_hooks
+
     def _maybe_warn_non_full_backward_hook(self, inputs, result, grad_fn):
         if not isinstance(result, torch.Tensor):
-            if not (isinstance(result, tuple) and all([isinstance(r, torch.Tensor) for r in result])):
+            if not (isinstance(result, tuple) and all(isinstance(r, torch.Tensor) for r in result)):
                 warnings.warn("Using non-full backward hooks on a Module that does not return a "
                               "single Tensor or a tuple of Tensors is deprecated and will be removed "
                               "in future versions. This hook will be missing some of the grad_output. "
@@ -1079,7 +1300,7 @@ def _maybe_warn_non_full_backward_hook(self, inputs, result, grad_fn):
             result = (result,)
 
         if not isinstance(inputs, torch.Tensor):
-            if not (isinstance(inputs, tuple) and all([isinstance(i, torch.Tensor) for i in inputs])):
+            if not (isinstance(inputs, tuple) and all(isinstance(i, torch.Tensor) for i in inputs)):
                 warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
                               "single Tensor or a tuple of Tensors is deprecated and will be removed "
                               "in future versions. This hook will be missing some of the grad_input. "
@@ -1110,50 +1331,123 @@ def _maybe_warn_non_full_backward_hook(self, inputs, result, grad_fn):
                               "some grad_input. Please use register_full_backward_hook to get the documented "
                               "behavior.")
 
-    def register_forward_pre_hook(self, hook: Callable[..., None]) -> RemovableHandle:
+    def register_forward_pre_hook(
+        self,
+        hook: Callable[..., None],
+        *,
+        prepend: bool = False,
+        with_kwargs: bool = False,
+    ) -> RemovableHandle:
         r"""Registers a forward pre-hook on the module.
 
         The hook will be called every time before :func:`forward` is invoked.
-        It should have the following signature::
 
-            hook(module, input) -> None or modified input
 
-        The input contains only the positional arguments given to the module.
-        Keyword arguments won't be passed to the hooks and only to the ``forward``.
-        The hook can modify the input. User can either return a tuple or a
-        single modified value in the hook. We will wrap the value into a tuple
-        if a single value is returned(unless that value is already a tuple).
+        If ``with_kwargs`` is false or not specified, the input contains only
+        the positional arguments given to the module. Keyword arguments won't be
+        passed to the hooks and only to the ``forward``. The hook can modify the
+        input. User can either return a tuple or a single modified value in the
+        hook. We will wrap the value into a tuple if a single value is returned
+        (unless that value is already a tuple). The hook should have the
+        following signature::
+
+            hook(module, args) -> None or modified input
+
+        If ``with_kwargs`` is true, the forward pre-hook will be passed the
+        kwargs given to the forward function. And if the hook modifies the
+        input, both the args and kwargs should be returned. The hook should have
+        the following signature::
+
+            hook(module, args, kwargs) -> None or a tuple of modified input and kwargs
+
+        Args:
+            hook (Callable): The user defined hook to be registered.
+            prepend (bool): If true, the provided ``hook`` will be fired before
+                all existing ``forward_pre`` hooks on this
+                :class:`torch.nn.modules.Module`. Otherwise, the provided
+                ``hook`` will be fired after all existing ``forward_pre`` hooks
+                on this :class:`torch.nn.modules.Module`. Note that global
+                ``forward_pre`` hooks registered with
+                :func:`register_module_forward_pre_hook` will fire before all
+                hooks registered by this method.
+                Default: ``False``
+            with_kwargs (bool): If true, the ``hook`` will be passed the kwargs
+                given to the forward function.
+                Default: ``False``
 
         Returns:
             :class:`torch.utils.hooks.RemovableHandle`:
                 a handle that can be used to remove the added hook by calling
                 ``handle.remove()``
         """
-        handle = hooks.RemovableHandle(self._forward_pre_hooks)
+        handle = hooks.RemovableHandle(
+            self._forward_pre_hooks,
+            extra_dict=self._forward_pre_hooks_with_kwargs
+        )
         self._forward_pre_hooks[handle.id] = hook
+        if with_kwargs:
+            self._forward_pre_hooks_with_kwargs[handle.id] = True
+
+        if prepend:
+            self._forward_pre_hooks.move_to_end(handle.id, last=False)  # type: ignore[attr-defined]
         return handle
 
-    def register_forward_hook(self, hook: Callable[..., None]) -> RemovableHandle:
+    def register_forward_hook(
+        self,
+        hook: Callable[..., None],
+        *,
+        prepend: bool = False,
+        with_kwargs: bool = False,
+    ) -> RemovableHandle:
         r"""Registers a forward hook on the module.
 
         The hook will be called every time after :func:`forward` has computed an output.
-        It should have the following signature::
 
-            hook(module, input, output) -> None or modified output
+        If ``with_kwargs`` is ``False`` or not specified, the input contains only
+        the positional arguments given to the module. Keyword arguments won't be
+        passed to the hooks and only to the ``forward``. The hook can modify the
+        output. It can modify the input inplace but it will not have effect on
+        forward since this is called after :func:`forward` is called. The hook
+        should have the following signature::
+
+            hook(module, args, output) -> None or modified output
 
-        The input contains only the positional arguments given to the module.
-        Keyword arguments won't be passed to the hooks and only to the ``forward``.
-        The hook can modify the output. It can modify the input inplace but
-        it will not have effect on forward since this is called after
-        :func:`forward` is called.
+        If ``with_kwargs`` is ``True``, the forward hook will be passed the
+        ``kwargs`` given to the forward function and be expected to return the
+        output possibly modified. The hook should have the following signature::
+
+            hook(module, args, kwargs, output) -> None or modified output
+
+        Args:
+            hook (Callable): The user defined hook to be registered.
+            prepend (bool): If ``True``, the provided ``hook`` will be fired
+                before all existing ``forward`` hooks on this
+                :class:`torch.nn.modules.Module`. Otherwise, the provided
+                ``hook`` will be fired after all existing ``forward`` hooks on
+                this :class:`torch.nn.modules.Module`. Note that global
+                ``forward`` hooks registered with
+                :func:`register_module_forward_hook` will fire before all hooks
+                registered by this method.
+                Default: ``False``
+            with_kwargs (bool): If ``True``, the ``hook`` will be passed the
+                kwargs given to the forward function.
+                Default: ``False``
 
         Returns:
             :class:`torch.utils.hooks.RemovableHandle`:
                 a handle that can be used to remove the added hook by calling
                 ``handle.remove()``
         """
-        handle = hooks.RemovableHandle(self._forward_hooks)
+        handle = hooks.RemovableHandle(
+            self._forward_hooks,
+            extra_dict=self._forward_hooks_with_kwargs
+        )
         self._forward_hooks[handle.id] = hook
+        if with_kwargs:
+            self._forward_hooks_with_kwargs[handle.id] = True
+
+        if prepend:
+            self._forward_hooks.move_to_end(handle.id, last=False)  # type: ignore[attr-defined]
         return handle
 
     def _slow_forward(self, *input, **kwargs):
@@ -1176,34 +1470,61 @@ def _slow_forward(self, *input, **kwargs):
                 tracing_state.pop_scope()
         return result
 
-    def _call_impl(self, *input, **kwargs):
+    def _call_impl(self, *args, **kwargs):
         forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.forward)
         # If we don't have any hooks, we want to skip the rest of the logic in
         # this function, and just call forward.
-        if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
+        if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
+                or _global_backward_pre_hooks or _global_backward_hooks
                 or _global_forward_hooks or _global_forward_pre_hooks):
-            return forward_call(*input, **kwargs)
+            return forward_call(*args, **kwargs)
         # Do not call functions when jit is used
         full_backward_hooks, non_full_backward_hooks = [], []
+        backward_pre_hooks = []
+        if self._backward_pre_hooks or _global_backward_pre_hooks:
+            backward_pre_hooks = self._get_backward_pre_hooks()
+
         if self._backward_hooks or _global_backward_hooks:
             full_backward_hooks, non_full_backward_hooks = self._get_backward_hooks()
+
         if _global_forward_pre_hooks or self._forward_pre_hooks:
-            for hook in (*_global_forward_pre_hooks.values(), *self._forward_pre_hooks.values()):
-                result = hook(self, input)
-                if result is not None:
-                    if not isinstance(result, tuple):
-                        result = (result,)
-                    input = result
+            for hook_id, hook in (
+                *_global_forward_pre_hooks.items(),
+                *self._forward_pre_hooks.items(),
+            ):
+                if hook_id in self._forward_pre_hooks_with_kwargs:
+                    result = hook(self, args, kwargs)  # type: ignore[misc]
+                    if result is not None:
+                        if isinstance(result, tuple) and len(result) == 2:
+                            args, kwargs = result
+                        else:
+                            raise RuntimeError(
+                                "forward pre-hook must return None or a tuple "
+                                f"of (new_args, new_kwargs), but got {result}."
+                            )
+                else:
+                    result = hook(self, args)
+                    if result is not None:
+                        if not isinstance(result, tuple):
+                            result = (result,)
+                        args = result
 
         bw_hook = None
-        if full_backward_hooks:
-            bw_hook = hooks.BackwardHook(self, full_backward_hooks)
-            input = bw_hook.setup_input_hook(input)
+        if full_backward_hooks or backward_pre_hooks:
+            bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hooks)
+            args = bw_hook.setup_input_hook(args)
 
-        result = forward_call(*input, **kwargs)
+        result = forward_call(*args, **kwargs)
         if _global_forward_hooks or self._forward_hooks:
-            for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values()):
-                hook_result = hook(self, input, result)
+            for hook_id, hook in (
+                *_global_forward_hooks.items(),
+                *self._forward_hooks.items(),
+            ):
+                if hook_id in self._forward_hooks_with_kwargs:
+                    hook_result = hook(self, args, kwargs, result)
+                else:
+                    hook_result = hook(self, args, result)
+
                 if hook_result is not None:
                     result = hook_result
 
@@ -1222,7 +1543,7 @@ def _call_impl(self, *input, **kwargs):
             if grad_fn is not None:
                 for hook in non_full_backward_hooks:
                     grad_fn.register_hook(_WrappedHook(hook, self))
-                self._maybe_warn_non_full_backward_hook(input, result, grad_fn)
+                self._maybe_warn_non_full_backward_hook(args, result, grad_fn)
 
         return result
 
@@ -1233,6 +1554,10 @@ def __setstate__(self, state):
         # Support loading old checkpoints that don't have the following attrs:
         if '_forward_pre_hooks' not in self.__dict__:
             self._forward_pre_hooks = OrderedDict()
+        if '_forward_pre_hooks_with_kwargs' not in self.__dict__:
+            self._forward_pre_hooks_with_kwargs = OrderedDict()
+        if '_forward_hooks_with_kwargs' not in self.__dict__:
+            self._forward_hooks_with_kwargs = OrderedDict()
         if '_state_dict_hooks' not in self.__dict__:
             self._state_dict_hooks = OrderedDict()
         if '_load_state_dict_pre_hooks' not in self.__dict__:
@@ -1243,6 +1568,8 @@ def __setstate__(self, state):
             self._non_persistent_buffers_set = set()
         if '_is_full_backward_hook' not in self.__dict__:
             self._is_full_backward_hook = None
+        if '_backward_pre_hooks' not in self.__dict__:
+            self._backward_pre_hooks = OrderedDict()
 
     def __getattr__(self, name: str) -> Union[Tensor, 'Module']:
         if '_parameters' in self.__dict__:
@@ -1289,12 +1616,20 @@ def remove_from(*dicts_or_sets):
                     raise AttributeError(
                         "cannot assign module before Module.__init__() call")
                 remove_from(self.__dict__, self._parameters, self._buffers, self._non_persistent_buffers_set)
+                for hook in _global_module_registration_hooks.values():
+                    output = hook(self, name, value)
+                    if output is not None:
+                        value = output
                 modules[name] = value
             elif modules is not None and name in modules:
                 if value is not None:
                     raise TypeError("cannot assign '{}' as child module '{}' "
                                     "(torch.nn.Module or None expected)"
                                     .format(torch.typename(value), name))
+                for hook in _global_module_registration_hooks.values():
+                    output = hook(self, name, value)
+                    if output is not None:
+                        value = output
                 modules[name] = value
             else:
                 buffers = self.__dict__.get('_buffers')
@@ -1303,6 +1638,10 @@ def remove_from(*dicts_or_sets):
                         raise TypeError("cannot assign '{}' as buffer '{}' "
                                         "(torch.Tensor or None expected)"
                                         .format(torch.typename(value), name))
+                    for hook in _global_buffer_registration_hooks.values():
+                        output = hook(self, name, value)
+                        if output is not None:
+                            value = output
                     buffers[name] = value
                 else:
                     super().__setattr__(name, value)
@@ -1481,7 +1820,7 @@ def register_load_state_dict_post_hook(self, hook):
         ``strict=True`` are affected by modifications the hook makes to
         ``missing_keys`` or ``unexpected_keys``, as expected. Additions to either
         set of keys will result in an error being thrown when ``strict=True``, and
-        clearning out both missing and unexpected keys will avoid an error.
+        clearing out both missing and unexpected keys will avoid an error.
 
         Returns:
             :class:`torch.utils.hooks.RemovableHandle`:
@@ -1625,13 +1964,15 @@ def load_state_dict(self, state_dict: Mapping[str, Any],
             # mypy isn't aware that "_metadata" exists in state_dict
             state_dict._metadata = metadata  # type: ignore[attr-defined]
 
-        def load(module, prefix=''):
+        def load(module, local_state_dict, prefix=''):
             local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
             module._load_from_state_dict(
-                state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
+                local_state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
             for name, child in module._modules.items():
                 if child is not None:
-                    load(child, prefix + name + '.')
+                    child_prefix = prefix + name + '.'
+                    child_state_dict = {k: v for k, v in local_state_dict.items() if k.startswith(child_prefix)}
+                    load(child, child_state_dict, child_prefix)
 
             # Note that the hook can modify missing_keys and unexpected_keys.
             incompatible_keys = _IncompatibleKeys(missing_keys, unexpected_keys)
@@ -1643,7 +1984,7 @@ def load(module, prefix=''):
                     "it should be done inplace."
                 )
 
-        load(self)
+        load(self, state_dict)
         del load
 
         if strict:
@@ -1661,16 +2002,17 @@ def load(module, prefix=''):
                                self.__class__.__name__, "\n\t".join(error_msgs)))
         return _IncompatibleKeys(missing_keys, unexpected_keys)
 
-    def _named_members(self, get_members_fn, prefix='', recurse=True):
+    def _named_members(self, get_members_fn, prefix='', recurse=True, remove_duplicate: bool = True):
         r"""Helper method for yielding various names + members of modules."""
         memo = set()
-        modules = self.named_modules(prefix=prefix) if recurse else [(prefix, self)]
+        modules = self.named_modules(prefix=prefix, remove_duplicate=remove_duplicate) if recurse else [(prefix, self)]
         for module_prefix, module in modules:
             members = get_members_fn(module)
             for k, v in members:
                 if v is None or v in memo:
                     continue
-                memo.add(v)
+                if remove_duplicate:
+                    memo.add(v)
                 name = module_prefix + ('.' if module_prefix else '') + k
                 yield name, v
 
@@ -1699,7 +2041,12 @@ def parameters(self, recurse: bool = True) -> Iterator[Parameter]:
         for name, param in self.named_parameters(recurse=recurse):
             yield param
 
-    def named_parameters(self, prefix: str = '', recurse: bool = True) -> Iterator[Tuple[str, Parameter]]:
+    def named_parameters(
+            self,
+            prefix: str = '',
+            recurse: bool = True,
+            remove_duplicate: bool = True
+    ) -> Iterator[Tuple[str, Parameter]]:
         r"""Returns an iterator over module parameters, yielding both the
         name of the parameter as well as the parameter itself.
 
@@ -1708,6 +2055,8 @@ def named_parameters(self, prefix: str = '', recurse: bool = True) -> Iterator[T
             recurse (bool): if True, then yields parameters of this module
                 and all submodules. Otherwise, yields only parameters that
                 are direct members of this module.
+            remove_duplicate (bool, optional): whether to remove the duplicated
+                parameters in the result. Defaults to True.
 
         Yields:
             (str, Parameter): Tuple containing the name and parameter
@@ -1722,7 +2071,7 @@ def named_parameters(self, prefix: str = '', recurse: bool = True) -> Iterator[T
         """
         gen = self._named_members(
             lambda module: module._parameters.items(),
-            prefix=prefix, recurse=recurse)
+            prefix=prefix, recurse=recurse, remove_duplicate=remove_duplicate)
         for elem in gen:
             yield elem
 
@@ -1749,15 +2098,16 @@ def buffers(self, recurse: bool = True) -> Iterator[Tensor]:
         for _, buf in self.named_buffers(recurse=recurse):
             yield buf
 
-    def named_buffers(self, prefix: str = '', recurse: bool = True) -> Iterator[Tuple[str, Tensor]]:
+    def named_buffers(self, prefix: str = '', recurse: bool = True, remove_duplicate: bool = True) -> Iterator[Tuple[str, Tensor]]:
         r"""Returns an iterator over module buffers, yielding both the
         name of the buffer as well as the buffer itself.
 
         Args:
             prefix (str): prefix to prepend to all buffer names.
-            recurse (bool): if True, then yields buffers of this module
+            recurse (bool, optional): if True, then yields buffers of this module
                 and all submodules. Otherwise, yields only buffers that
-                are direct members of this module.
+                are direct members of this module. Defaults to True.
+            remove_duplicate (bool, optional): whether to remove the duplicated buffers in the result. Defaults to True.
 
         Yields:
             (str, torch.Tensor): Tuple containing the name and buffer
@@ -1772,7 +2122,7 @@ def named_buffers(self, prefix: str = '', recurse: bool = True) -> Iterator[Tupl
         """
         gen = self._named_members(
             lambda module: module._buffers.items(),
-            prefix=prefix, recurse=recurse)
+            prefix=prefix, recurse=recurse, remove_duplicate=remove_duplicate)
         for elem in gen:
             yield elem
 
diff --git a/torch/nn/modules/pooling.py b/torch/nn/modules/pooling.py
index eb45e20db56d..78a19abc0c22 100644
--- a/torch/nn/modules/pooling.py
+++ b/torch/nn/modules/pooling.py
@@ -126,7 +126,7 @@ class MaxPool2d(_MaxPoolNd):
     Args:
         kernel_size: the size of the window to take a max over
         stride: the stride of the window. Default value is :attr:`kernel_size`
-        padding: implicit zero padding to be added on both sides
+        padding: Implicit negative infinity padding to be added on both sides
         dilation: a parameter that controls the stride of elements in the window
         return_indices: if ``True``, will return the max indices along with the outputs.
                         Useful for :class:`torch.nn.MaxUnpool2d` later
@@ -200,7 +200,7 @@ class MaxPool3d(_MaxPoolNd):
     Args:
         kernel_size: the size of the window to take a max over
         stride: the stride of the window. Default value is :attr:`kernel_size`
-        padding: implicit zero padding to be added on all three sides
+        padding: Implicit negative infinity padding to be added on all three sides
         dilation: a parameter that controls the stride of elements in the window
         return_indices: if ``True``, will return the max indices along with the outputs.
                         Useful for :class:`torch.nn.MaxUnpool3d` later
@@ -525,7 +525,7 @@ class AvgPool1d(_AvgPoolNd):
         >>> # pool with window of size=3, stride=2
         >>> m = nn.AvgPool1d(3, stride=2)
         >>> m(torch.tensor([[[1.,2,3,4,5,6,7]]]))
-        tensor([[[ 2.,  4.,  6.]]])
+        tensor([[[2., 4., 6.]]])
     """
 
     kernel_size: _size_1_t
diff --git a/torch/nn/modules/rnn.py b/torch/nn/modules/rnn.py
index 4d6fd9c959eb..f94728653b0f 100644
--- a/torch/nn/modules/rnn.py
+++ b/torch/nn/modules/rnn.py
@@ -441,6 +441,7 @@ def forward(self, input, hx=None):  # noqa: F811
             max_batch_size = int(batch_sizes[0])
         else:
             batch_sizes = None
+            assert (input.dim() in (2, 3)), f"RNN: Expected input to be 2-D or 3-D but received {input.dim()}-D tensor"
             is_batched = input.dim() == 3
             batch_dim = 0 if self.batch_first else 1
             if not is_batched:
@@ -733,6 +734,7 @@ def forward(self, input, hx=None):  # noqa: F811
             max_batch_size = int(max_batch_size)
         else:
             batch_sizes = None
+            assert (input.dim() in (2, 3)), f"LSTM: Expected input to be 2-D or 3-D but received {input.dim()}-D tensor"
             is_batched = input.dim() == 3
             batch_dim = 0 if self.batch_first else 1
             if not is_batched:
@@ -923,6 +925,7 @@ def forward(self, input, hx=None):  # noqa: F811
             max_batch_size = int(max_batch_size)
         else:
             batch_sizes = None
+            assert (input.dim() in (2, 3)), f"GRU: Expected input to be 2-D or 3-D but received {input.dim()}-D tensor"
             is_batched = input.dim() == 3
             batch_dim = 0 if self.batch_first else 1
             if not is_batched:
diff --git a/torch/nn/modules/sparse.py b/torch/nn/modules/sparse.py
index a8d70616d852..020e5156911a 100644
--- a/torch/nn/modules/sparse.py
+++ b/torch/nn/modules/sparse.py
@@ -117,11 +117,12 @@ class Embedding(Module):
     norm_type: float
     scale_grad_by_freq: bool
     weight: Tensor
+    freeze: bool
     sparse: bool
 
     def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
                  max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 sparse: bool = False, _weight: Optional[Tensor] = None,
+                 sparse: bool = False, _weight: Optional[Tensor] = None, _freeze: bool = False,
                  device=None, dtype=None) -> None:
         factory_kwargs = {'device': device, 'dtype': dtype}
         super(Embedding, self).__init__()
@@ -138,12 +139,13 @@ def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optiona
         self.norm_type = norm_type
         self.scale_grad_by_freq = scale_grad_by_freq
         if _weight is None:
-            self.weight = Parameter(torch.empty((num_embeddings, embedding_dim), **factory_kwargs))
+            self.weight = Parameter(torch.empty((num_embeddings, embedding_dim), **factory_kwargs),
+                                    requires_grad=not _freeze)
             self.reset_parameters()
         else:
             assert list(_weight.shape) == [num_embeddings, embedding_dim], \
                 'Shape of weight does not match num_embeddings and embedding_dim'
-            self.weight = Parameter(_weight)
+            self.weight = Parameter(_weight, requires_grad=not _freeze)
 
         self.sparse = sparse
 
@@ -212,12 +214,12 @@ def from_pretrained(cls, embeddings, freeze=True, padding_idx=None,
             num_embeddings=rows,
             embedding_dim=cols,
             _weight=embeddings,
+            _freeze=freeze,
             padding_idx=padding_idx,
             max_norm=max_norm,
             norm_type=norm_type,
             scale_grad_by_freq=scale_grad_by_freq,
             sparse=sparse)
-        embedding.weight.requires_grad = not freeze
         return embedding
 
 
diff --git a/torch/nn/modules/transformer.py b/torch/nn/modules/transformer.py
index 95203bb26b7c..5f1bc7bb2785 100644
--- a/torch/nn/modules/transformer.py
+++ b/torch/nn/modules/transformer.py
@@ -56,6 +56,7 @@ def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int =
                  device=None, dtype=None) -> None:
         factory_kwargs = {'device': device, 'dtype': dtype}
         super(Transformer, self).__init__()
+        torch._C._log_api_usage_once(f"torch.nn.modules.{self.__class__.__name__}")
 
         if custom_encoder is not None:
             self.encoder = custom_encoder
@@ -150,11 +151,11 @@ def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, t
         return output
 
     @staticmethod
-    def generate_square_subsequent_mask(sz: int) -> Tensor:
+    def generate_square_subsequent_mask(sz: int, device='cpu') -> Tensor:
         r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
             Unmasked positions are filled with float(0.0).
         """
-        return torch.triu(torch.full((sz, sz), float('-inf')), diagonal=1)
+        return torch.triu(torch.full((sz, sz), float('-inf'), device=device), diagonal=1)
 
     def _reset_parameters(self):
         r"""Initiate parameters in the transformer model."""
@@ -186,6 +187,7 @@ class TransformerEncoder(Module):
 
     def __init__(self, encoder_layer, num_layers, norm=None, enable_nested_tensor=True, mask_check=True):
         super(TransformerEncoder, self).__init__()
+        torch._C._log_api_usage_once(f"torch.nn.modules.{self.__class__.__name__}")
         self.layers = _get_clones(encoder_layer, num_layers)
         self.num_layers = num_layers
         self.norm = norm
@@ -203,6 +205,11 @@ def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_ma
         Shape:
             see the docs in Transformer class.
         """
+        if src_key_padding_mask is not None:
+            _skpm_dtype = src_key_padding_mask.dtype
+            if _skpm_dtype != torch.bool and not torch.is_floating_point(src_key_padding_mask):
+                raise AssertionError(
+                    "only bool and floating types of key_padding_mask are supported")
         output = src
         convert_to_nested = False
         first_layer = self.layers[0]
@@ -236,6 +243,10 @@ def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_ma
             why_not_sparsity_fast_path = "NestedTensor input is not supported"
         elif mask is not None:
             why_not_sparsity_fast_path = "src_key_padding_mask and mask were both supplied"
+        elif first_layer.self_attn.num_heads % 2 == 1:
+            why_not_sparsity_fast_path = "num_head is odd"
+        elif torch.is_autocast_enabled():
+            why_not_sparsity_fast_path = "autocast is enabled"
 
         if not why_not_sparsity_fast_path:
             tensor_args = (
@@ -258,20 +269,13 @@ def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_ma
                 why_not_sparsity_fast_path = "some Tensor argument has_torch_function"
             elif not (src.is_cuda or 'cpu' in str(src.device)):
                 why_not_sparsity_fast_path = "src is neither CUDA nor CPU"
-            elif torch.is_grad_enabled() and any([x.requires_grad for x in tensor_args]):
+            elif torch.is_grad_enabled() and any(x.requires_grad for x in tensor_args):
                 why_not_sparsity_fast_path = ("grad is enabled and at least one of query or the "
                                               "input/output projection weights or biases requires_grad")
 
             if (not why_not_sparsity_fast_path) and (src_key_padding_mask is not None):
                 convert_to_nested = True
-                # simplify on or after on 8/16/2022 to unconditionally call with mask_check=False
-                # we have established that either (1) the mask is OK with the check above,
-                # or (2) that we don't need a mask check with mask_check=False in the init
-                if not torch.jit.is_scripting():
-                    output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
-                else:
-                    # When scripting, make a simpler call until the FC bar passes on 8/16/2022
-                    output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not())
+                output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
                 src_key_padding_mask_for_layers = None
 
         for mod in self.layers:
@@ -305,6 +309,7 @@ class TransformerDecoder(Module):
 
     def __init__(self, decoder_layer, num_layers, norm=None):
         super(TransformerDecoder, self).__init__()
+        torch._C._log_api_usage_once(f"torch.nn.modules.{self.__class__.__name__}")
         self.layers = _get_clones(decoder_layer, num_layers)
         self.num_layers = num_layers
         self.norm = norm
@@ -445,6 +450,11 @@ def forward(self, src: Tensor, src_mask: Optional[Tensor] = None,
             see the docs in Transformer class.
         """
 
+        if src_key_padding_mask is not None:
+            _skpm_dtype = src_key_padding_mask.dtype
+            if _skpm_dtype != torch.bool and not torch.is_floating_point(src_key_padding_mask):
+                raise AssertionError(
+                    "only bool and floating types of key_padding_mask are supported")
         # see Fig. 1 of https://arxiv.org/pdf/2002.04745v1.pdf
         why_not_sparsity_fast_path = ''
         if not src.dim() == 3:
@@ -459,11 +469,12 @@ def forward(self, src: Tensor, src_mask: Optional[Tensor] = None,
             why_not_sparsity_fast_path = "activation_relu_or_gelu was not True"
         elif not (self.norm1.eps == self.norm2.eps):
             why_not_sparsity_fast_path = "norm1.eps is not equal to norm2.eps"
-        elif src_mask is not None:
-            why_not_sparsity_fast_path = "src_mask is not supported for fastpath"
-        elif src.is_nested and src_key_padding_mask is not None:
-            why_not_sparsity_fast_path = "src_key_padding_mask is not supported with NestedTensor input for fastpath"
-
+        elif src.is_nested and (src_key_padding_mask is not None or src_mask is not None):
+            why_not_sparsity_fast_path = "neither src_key_padding_mask nor src_mask are not supported with NestedTensor input"
+        elif self.self_attn.num_heads % 2 == 1:
+            why_not_sparsity_fast_path = "num_head is odd"
+        elif torch.is_autocast_enabled():
+            why_not_sparsity_fast_path = "autocast is enabled"
         if not why_not_sparsity_fast_path:
             tensor_args = (
                 src,
@@ -485,13 +496,14 @@ def forward(self, src: Tensor, src_mask: Optional[Tensor] = None,
             # generator expressions.
             if torch.overrides.has_torch_function(tensor_args):
                 why_not_sparsity_fast_path = "some Tensor argument has_torch_function"
-            elif not all([(x.is_cuda or 'cpu' in str(x.device)) for x in tensor_args]):
+            elif not all((x.is_cuda or 'cpu' in str(x.device)) for x in tensor_args):
                 why_not_sparsity_fast_path = "some Tensor argument is neither CUDA nor CPU"
-            elif torch.is_grad_enabled() and any([x.requires_grad for x in tensor_args]):
+            elif torch.is_grad_enabled() and any(x.requires_grad for x in tensor_args):
                 why_not_sparsity_fast_path = ("grad is enabled and at least one of query or the "
                                               "input/output projection weights or biases requires_grad")
 
             if not why_not_sparsity_fast_path:
+                merged_mask, mask_type = self.self_attn.merge_masks(src_mask, src_key_padding_mask, src)
                 return torch._transformer_encoder_layer_fwd(
                     src,
                     self.self_attn.embed_dim,
@@ -511,13 +523,11 @@ def forward(self, src: Tensor, src_mask: Optional[Tensor] = None,
                     self.linear1.bias,
                     self.linear2.weight,
                     self.linear2.bias,
-                    # TODO: if src_mask and src_key_padding_mask merge to single 4-dim mask
-                    src_mask if src_mask is not None else src_key_padding_mask,
-                    1 if src_key_padding_mask is not None else
-                    0 if src_mask is not None else
-                    None,
+                    merged_mask,
+                    mask_type,
                 )
 
+
         x = src
         if self.norm_first:
             x = x + self._sa_block(self.norm1(x), src_mask, src_key_padding_mask)
diff --git a/torch/nn/modules/upsampling.py b/torch/nn/modules/upsampling.py
index 56711f295414..4f13c84c2e90 100644
--- a/torch/nn/modules/upsampling.py
+++ b/torch/nn/modules/upsampling.py
@@ -7,6 +7,7 @@
 
 __all__ = ['Upsample', 'UpsamplingNearest2d', 'UpsamplingBilinear2d']
 
+
 class Upsample(Module):
     r"""Upsamples a given multi-channel 1D (temporal), 2D (spatial) or 3D (volumetric) data.
 
@@ -73,62 +74,61 @@ class Upsample(Module):
 
         >>> input = torch.arange(1, 5, dtype=torch.float32).view(1, 1, 2, 2)
         >>> input
-        tensor([[[[ 1.,  2.],
-                  [ 3.,  4.]]]])
+        tensor([[[[1., 2.],
+                  [3., 4.]]]])
 
         >>> m = nn.Upsample(scale_factor=2, mode='nearest')
         >>> m(input)
-        tensor([[[[ 1.,  1.,  2.,  2.],
-                  [ 1.,  1.,  2.,  2.],
-                  [ 3.,  3.,  4.,  4.],
-                  [ 3.,  3.,  4.,  4.]]]])
+        tensor([[[[1., 1., 2., 2.],
+                  [1., 1., 2., 2.],
+                  [3., 3., 4., 4.],
+                  [3., 3., 4., 4.]]]])
 
         >>> # xdoctest: +IGNORE_WANT("other tests seem to modify printing styles")
         >>> m = nn.Upsample(scale_factor=2, mode='bilinear')  # align_corners=False
         >>> m(input)
-        tensor([[[[ 1.0000,  1.2500,  1.7500,  2.0000],
-                  [ 1.5000,  1.7500,  2.2500,  2.5000],
-                  [ 2.5000,  2.7500,  3.2500,  3.5000],
-                  [ 3.0000,  3.2500,  3.7500,  4.0000]]]])
+        tensor([[[[1.0000, 1.2500, 1.7500, 2.0000],
+                  [1.5000, 1.7500, 2.2500, 2.5000],
+                  [2.5000, 2.7500, 3.2500, 3.5000],
+                  [3.0000, 3.2500, 3.7500, 4.0000]]]])
 
         >>> m = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
         >>> m(input)
-        tensor([[[[ 1.0000,  1.3333,  1.6667,  2.0000],
-                  [ 1.6667,  2.0000,  2.3333,  2.6667],
-                  [ 2.3333,  2.6667,  3.0000,  3.3333],
-                  [ 3.0000,  3.3333,  3.6667,  4.0000]]]])
+        tensor([[[[1.0000, 1.3333, 1.6667, 2.0000],
+                  [1.6667, 2.0000, 2.3333, 2.6667],
+                  [2.3333, 2.6667, 3.0000, 3.3333],
+                  [3.0000, 3.3333, 3.6667, 4.0000]]]])
 
         >>> # Try scaling the same data in a larger tensor
-        >>>
         >>> input_3x3 = torch.zeros(3, 3).view(1, 1, 3, 3)
         >>> input_3x3[:, :, :2, :2].copy_(input)
-        tensor([[[[ 1.,  2.],
-                  [ 3.,  4.]]]])
+        tensor([[[[1., 2.],
+                  [3., 4.]]]])
         >>> input_3x3
-        tensor([[[[ 1.,  2.,  0.],
-                  [ 3.,  4.,  0.],
-                  [ 0.,  0.,  0.]]]])
+        tensor([[[[1., 2., 0.],
+                  [3., 4., 0.],
+                  [0., 0., 0.]]]])
 
         >>> # xdoctest: +IGNORE_WANT("seems to fail when other tests are run in the same session")
         >>> m = nn.Upsample(scale_factor=2, mode='bilinear')  # align_corners=False
         >>> # Notice that values in top left corner are the same with the small input (except at boundary)
         >>> m(input_3x3)
-        tensor([[[[ 1.0000,  1.2500,  1.7500,  1.5000,  0.5000,  0.0000],
-                  [ 1.5000,  1.7500,  2.2500,  1.8750,  0.6250,  0.0000],
-                  [ 2.5000,  2.7500,  3.2500,  2.6250,  0.8750,  0.0000],
-                  [ 2.2500,  2.4375,  2.8125,  2.2500,  0.7500,  0.0000],
-                  [ 0.7500,  0.8125,  0.9375,  0.7500,  0.2500,  0.0000],
-                  [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000]]]])
+        tensor([[[[1.0000, 1.2500, 1.7500, 1.5000, 0.5000, 0.0000],
+                  [1.5000, 1.7500, 2.2500, 1.8750, 0.6250, 0.0000],
+                  [2.5000, 2.7500, 3.2500, 2.6250, 0.8750, 0.0000],
+                  [2.2500, 2.4375, 2.8125, 2.2500, 0.7500, 0.0000],
+                  [0.7500, 0.8125, 0.9375, 0.7500, 0.2500, 0.0000],
+                  [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]]])
 
         >>> m = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
         >>> # Notice that values in top left corner are now changed
         >>> m(input_3x3)
-        tensor([[[[ 1.0000,  1.4000,  1.8000,  1.6000,  0.8000,  0.0000],
-                  [ 1.8000,  2.2000,  2.6000,  2.2400,  1.1200,  0.0000],
-                  [ 2.6000,  3.0000,  3.4000,  2.8800,  1.4400,  0.0000],
-                  [ 2.4000,  2.7200,  3.0400,  2.5600,  1.2800,  0.0000],
-                  [ 1.2000,  1.3600,  1.5200,  1.2800,  0.6400,  0.0000],
-                  [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000]]]])
+        tensor([[[[1.0000, 1.4000, 1.8000, 1.6000, 0.8000, 0.0000],
+                  [1.8000, 2.2000, 2.6000, 2.2400, 1.1200, 0.0000],
+                  [2.6000, 3.0000, 3.4000, 2.8800, 1.4400, 0.0000],
+                  [2.4000, 2.7200, 3.0400, 2.5600, 1.2800, 0.0000],
+                  [1.2000, 1.3600, 1.5200, 1.2800, 0.6400, 0.0000],
+                  [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]]])
     """
     __constants__ = ['size', 'scale_factor', 'mode', 'align_corners', 'name', 'recompute_scale_factor']
     name: str
@@ -196,15 +196,15 @@ class UpsamplingNearest2d(Upsample):
 
         >>> input = torch.arange(1, 5, dtype=torch.float32).view(1, 1, 2, 2)
         >>> input
-        tensor([[[[ 1.,  2.],
-                  [ 3.,  4.]]]])
+        tensor([[[[1., 2.],
+                  [3., 4.]]]])
 
         >>> m = nn.UpsamplingNearest2d(scale_factor=2)
         >>> m(input)
-        tensor([[[[ 1.,  1.,  2.,  2.],
-                  [ 1.,  1.,  2.,  2.],
-                  [ 3.,  3.,  4.,  4.],
-                  [ 3.,  3.,  4.,  4.]]]])
+        tensor([[[[1., 1., 2., 2.],
+                  [1., 1., 2., 2.],
+                  [3., 3., 4., 4.],
+                  [3., 3., 4., 4.]]]])
     """
     def __init__(self, size: Optional[_size_2_t] = None, scale_factor: Optional[_ratio_2_t] = None) -> None:
         super(UpsamplingNearest2d, self).__init__(size, scale_factor, mode='nearest')
@@ -242,16 +242,16 @@ class UpsamplingBilinear2d(Upsample):
 
         >>> input = torch.arange(1, 5, dtype=torch.float32).view(1, 1, 2, 2)
         >>> input
-        tensor([[[[ 1.,  2.],
-                  [ 3.,  4.]]]])
+        tensor([[[[1., 2.],
+                  [3., 4.]]]])
 
         >>> # xdoctest: +IGNORE_WANT("do other tests modify the global state?")
         >>> m = nn.UpsamplingBilinear2d(scale_factor=2)
         >>> m(input)
-        tensor([[[[ 1.0000,  1.3333,  1.6667,  2.0000],
-                  [ 1.6667,  2.0000,  2.3333,  2.6667],
-                  [ 2.3333,  2.6667,  3.0000,  3.3333],
-                  [ 3.0000,  3.3333,  3.6667,  4.0000]]]])
+        tensor([[[[1.0000, 1.3333, 1.6667, 2.0000],
+                  [1.6667, 2.0000, 2.3333, 2.6667],
+                  [2.3333, 2.6667, 3.0000, 3.3333],
+                  [3.0000, 3.3333, 3.6667, 4.0000]]]])
     """
     def __init__(self, size: Optional[_size_2_t] = None, scale_factor: Optional[_ratio_2_t] = None) -> None:
         super(UpsamplingBilinear2d, self).__init__(size, scale_factor, mode='bilinear', align_corners=True)
diff --git a/torch/nn/parallel/distributed.py b/torch/nn/parallel/distributed.py
index dcaccf38a672..b6673874eecc 100644
--- a/torch/nn/parallel/distributed.py
+++ b/torch/nn/parallel/distributed.py
@@ -37,12 +37,13 @@
 
 from ..modules import Module
 from ._replicated_tensor_ddp_utils import _ddp_with_replicated_tensor_enabled
-from .scatter_gather import gather, is_namedtuple, scatter_kwargs  # noqa: F401
+from .scatter_gather import gather, scatter_kwargs  # noqa: F401
 
-__all__ = ['DistributedDataParallel']
+__all__ = ["DistributedDataParallel"]
 
 logger = logging.getLogger(__name__)
 
+
 def _tree_flatten_with_rref(output):
     output_is_rref = RPC_AVAILABLE and isinstance(output, RRef)
     if output_is_rref:
@@ -142,12 +143,14 @@ class _BufferCommHookLocation(Enum):
     PRE_FORWARD = auto()
     POST_FORWARD = auto()
 
+
 @dataclass
 class _BufferCommHook:
     buffer_comm_hook: Callable
     buffer_comm_hook_state: Any
     buffer_comm_hook_location: _BufferCommHookLocation
 
+
 # Add a DDPSink to run various functions when backwards starts, such as
 # queueing call back of out-most backward/graph task,
 # this helps call back is fired after all gradients' calculation
@@ -161,9 +164,7 @@ def forward(ctx, reducer, state_dict, *inputs):
         ctx.reducer = reducer
         ctx.state_dict = state_dict
         ret = tuple(
-            inp.clone()
-            if isinstance(inp, torch.Tensor)
-            else inp
+            inp.clone() if isinstance(inp, torch.Tensor) else inp
             for inp in inputs
         )
         return ret
@@ -173,8 +174,13 @@ def backward(ctx, *grad_outputs):
         state_dict = ctx.state_dict
         # Enqueue delay allreduce for static graph training on the first
         # iteration.
-        if ctx.state_dict['static_graph'] and ctx.state_dict['num_iterations'] == 1:
-            Variable._execution_engine.queue_callback(ctx.reducer._delay_all_reduce)
+        if (
+            ctx.state_dict["static_graph"]
+            and ctx.state_dict["num_iterations"] == 1
+        ):
+            Variable._execution_engine.queue_callback(
+                ctx.reducer._delay_all_reduce
+            )
 
         return (None, None, *grad_outputs)
 
@@ -188,6 +194,7 @@ def __init__(self, ddp, divide_by_initial_world_size):
             "DDP join hook requires passing in a DistributedDataParallel "
             "instance as the state"
         )
+        assert ddp.logger is not None
         ddp.logger._set_uneven_input_join()
         self.ddp = ddp
         self.ddp._divide_by_initial_world_size = divide_by_initial_world_size
@@ -209,7 +216,9 @@ def main_hook(self):
         ddp._check_and_sync_module_buffers()
 
         # Check if need to sync in the backward pass
-        work = ddp._check_global_requires_backward_grad_sync(is_joined_rank=True)
+        work = ddp._check_global_requires_backward_grad_sync(
+            is_joined_rank=True
+        )
         work.wait()
         should_sync_backwards = work.result()[0].item() != 0
         # Forward parameter sync is disabled in the next iteration if we
@@ -237,17 +246,18 @@ def post_hook(self, is_last_joiner: bool):
         """
         self.ddp._sync_final_model(is_last_joiner)
 
+
 class DistributedDataParallel(Module, Joinable):
     r"""Implements distributed data parallelism that is based on
     ``torch.distributed`` package at the module level.
 
-    This container parallelizes the application of the given module by
-    splitting the input across the specified devices by chunking in the batch
-    dimension. The module is replicated on each machine and each device, and
-    each such replica handles a portion of the input. During the backwards
-    pass, gradients from each node are averaged.
-
-    The batch size should be larger than the number of GPUs used locally.
+    This container provides data parallelism by synchronizing gradients
+    across each model replica. The devices to synchronize across are
+    specified by the input ``process_group``, which is the entire world
+    by default. Note that ``DistributedDataParallel`` does not chunk or
+    otherwise shard the input across participating GPUs; the user is
+    responsible for defining how to do so, for example through the use
+    of a :class:`DistributedSampler`.
 
     See also: :ref:`distributed-basics` and :ref:`cuda-nn-ddp-instead`.
     The same constraints on input as in :class:`torch.nn.DataParallel` apply.
@@ -330,15 +340,6 @@ class DistributedDataParallel(Module, Joinable):
         :class:`torch.distributed.optim.DistributedOptimizer` for optimizing
         parameters.
 
-    .. note::
-        DistributedDataParallel currently offers limited support for gradient
-        checkpointing with :meth:`torch.utils.checkpoint`. DDP will work as
-        expected when there are no unused parameters in the model and each layer
-        is checkpointed at most once (make sure you are not passing
-        `find_unused_parameters=True` to DDP). We currently do not support the
-        case where a layer is checkpointed multiple times, or when there unused
-        parameters in the checkpointed model.
-
         Example::
 
             >>> # xdoctest: +SKIP("undefined variables")
@@ -372,6 +373,15 @@ class DistributedDataParallel(Module, Joinable):
             >>>     dist_autograd.backward(context_id, [loss])
             >>>     dist_optim.step(context_id)
 
+    .. note::
+        DistributedDataParallel currently offers limited support for gradient
+        checkpointing with :meth:`torch.utils.checkpoint`. DDP will work as
+        expected when there are no unused parameters in the model and each layer
+        is checkpointed at most once (make sure you are not passing
+        `find_unused_parameters=True` to DDP). We currently do not support the
+        case where a layer is checkpointed multiple times, or when there unused
+        parameters in the checkpointed model.
+
     .. note::
         To let a non-DDP model load a state dict from a DDP model,
         :meth:`~torch.nn.modules.utils.consume_prefix_in_state_dict_if_present`
@@ -424,7 +434,7 @@ class DistributedDataParallel(Module, Joinable):
         of ``DistributedDataParallel`` will register the additional gradient
         reduction functions on all the parameters of the model itself at the
         time of construction. If you change the model's parameters afterwards,
-        gradient redunction functions no longer match the correct set of
+        gradient reduction functions no longer match the correct set of
         parameters.
 
     .. warning::
@@ -500,7 +510,7 @@ class DistributedDataParallel(Module, Joinable):
                      3) Activation checkpointing when model has unused parameters.
                      4) There are model parameters that are outside of forward function.
                      5) Potentially improve performance when there are unused parameters,
-                     as DDP will not search graph in each iteraton to detect unused
+                     as DDP will not search graph in each iteration to detect unused
                      parameters when static_graph is set to be ``True``.
                      To check whether you can set static_graph to be ``True``, one way is to
                      check ddp logging data at the end of your previous model training,
@@ -526,6 +536,9 @@ class DistributedDataParallel(Module, Joinable):
         >>> net = torch.nn.parallel.DistributedDataParallel(model)
     """
 
+    # used to track whether the given thread is inside ddp forward for torchdynamo purposes
+    _active_ddp_module = None
+
     def __init__(
         self,
         module,
@@ -540,11 +553,15 @@ def __init__(
         gradient_as_bucket_view=False,
         static_graph=False,
     ):
-
         super(DistributedDataParallel, self).__init__()
         Joinable.__init__(self)
         self.logger = None
-        if not any((p.requires_grad for p in module.parameters())):
+        if hasattr(module, "_ddp_params_and_buffers_to_ignore"):
+            self.parameters_to_ignore = set(module._ddp_params_and_buffers_to_ignore)
+        else:
+            self.parameters_to_ignore = set()
+        self._module_parameters = [p for n, p in module.named_parameters() if n not in self.parameters_to_ignore]
+        if not any((p.requires_grad for p in self._module_parameters)):
             self._log_and_throw(
                 RuntimeError,
                 "DistributedDataParallel is not needed when a module "
@@ -553,11 +570,12 @@ def __init__(
 
         if device_ids is not None and len(device_ids) > 1:
             self._log_and_throw(
-                ValueError, "device_ids can only be None or contain a single element."
+                ValueError,
+                "device_ids can only be None or contain a single element.",
             )
 
-        self.is_multi_device_module = len({p.device for p in module.parameters()}) > 1
-        distinct_device_types = {p.device.type for p in module.parameters()}
+        self.is_multi_device_module = len({p.device for p in self._module_parameters}) > 1
+        distinct_device_types = {p.device.type for p in self._module_parameters if p.device is not None}
         if len(distinct_device_types) != 1:
             self._log_and_throw(
                 ValueError,
@@ -583,7 +601,7 @@ def __init__(
                     "but got device_ids {}, output_device {}, and module parameters {}.".format(
                         device_ids,
                         output_device,
-                        {p.device for p in module.parameters()},
+                        {p.device for p in self._module_parameters},
                     ),
                 )
 
@@ -605,18 +623,16 @@ def __init__(
         self.static_graph = False
         self.dim = dim
         self.module = module
-        self.device = list(self.module.parameters())[0].device
+        self.device = list(self._module_parameters)[0].device
         self.broadcast_buffers = broadcast_buffers
         self.find_unused_parameters = find_unused_parameters
         self.require_backward_grad_sync = True
         self.require_forward_param_sync = True
         self.gradient_as_bucket_view = gradient_as_bucket_view
-        if hasattr(module, "_ddp_params_and_buffers_to_ignore"):
-            self.parameters_to_ignore = module._ddp_params_and_buffers_to_ignore
-        else:
-            self.parameters_to_ignore = []
 
-        self._use_replicated_tensor_module = _ddp_with_replicated_tensor_enabled()
+        self._use_replicated_tensor_module = (
+            _ddp_with_replicated_tensor_enabled()
+        )
         self._build_replicated_tensor_module()
 
         if check_reduction:
@@ -629,7 +645,7 @@ def __init__(
             )
 
         # Check that a module does not have Uninitialized parameters
-        for param in module.parameters():
+        for param in self._module_parameters:
             if isinstance(param, torch.nn.parameter.UninitializedParameter):
                 self._log_and_throw(
                     RuntimeError,
@@ -659,10 +675,15 @@ def __init__(
             params_and_buffers_to_ignore=self.parameters_to_ignore,
         )
         # In debug mode, build a mapping of parameter index -> parameter.
-        param_to_name_mapping = self._build_debug_param_to_name_mapping(parameters)
+        param_to_name_mapping = self._build_debug_param_to_name_mapping(
+            parameters
+        )
         # Builds reducer.
         self._ddp_init_helper(
-            parameters, expect_sparse_gradient, param_to_name_mapping, static_graph
+            parameters,
+            expect_sparse_gradient,
+            param_to_name_mapping,
+            static_graph,
         )
         self._has_rebuilt_buckets = False
 
@@ -675,7 +696,10 @@ def _build_replicated_tensor_module(self):
             # registering '_replicated_tensor_module' as a submodule by directly
             # adding to self.__dict__.
             from ._replicated_tensor_ddp_interop import _replicate_module
-            self.__dict__['_replicated_tensor_module'] = _replicate_module(self.module, self.process_group)
+
+            self.__dict__["_replicated_tensor_module"] = _replicate_module(
+                self.module, self.process_group
+            )
 
     def _log_and_throw(self, err_type, err_msg):
         if self.logger is not None:
@@ -683,8 +707,11 @@ def _log_and_throw(self, err_type, err_msg):
         raise err_type(err_msg)
 
     def _ddp_init_helper(
-        self, parameters, expect_sparse_gradient, param_to_name_mapping,
-        static_graph
+        self,
+        parameters,
+        expect_sparse_gradient,
+        param_to_name_mapping,
+        static_graph,
     ):
         """
         Initialization helper function that does the following:
@@ -717,8 +744,14 @@ def _ddp_init_helper(
         if static_graph is True or self.find_unused_parameters is False:
             bucket_size_limits = [sys.maxsize]
         else:
-            bucket_size_limits = [dist._DEFAULT_FIRST_BUCKET_BYTES, self.bucket_bytes_cap]
-        bucket_indices, per_bucket_size_limits = dist._compute_bucket_assignment_by_size(
+            bucket_size_limits = [
+                dist._DEFAULT_FIRST_BUCKET_BYTES,
+                self.bucket_bytes_cap,
+            ]
+        (
+            bucket_indices,
+            per_bucket_size_limits,
+        ) = dist._compute_bucket_assignment_by_size(
             parameters,
             bucket_size_limits,
             expect_sparse_gradient,
@@ -744,7 +777,7 @@ def _ddp_init_helper(
             param_to_name_mapping,
             # User can set dist._DEFAULT_FIRST_BUCKET_BYTES to tune DDP first
             # bucket.
-            dist._DEFAULT_FIRST_BUCKET_BYTES
+            dist._DEFAULT_FIRST_BUCKET_BYTES,
         )
 
         self.logger = dist.Logger(self.reducer)
@@ -790,13 +823,19 @@ def __setstate__(self, state):
         self.__dict__.setdefault("require_backward_grad_sync", True)
         parameters, expect_sparse_gradient = self._build_params_for_reducer()
         # In debug mode, build a mapping of parameter index -> parameter.
-        param_to_name_mapping = self._build_debug_param_to_name_mapping(parameters)
+        param_to_name_mapping = self._build_debug_param_to_name_mapping(
+            parameters
+        )
         # Builds reducer.
         self._ddp_init_helper(
-            parameters, expect_sparse_gradient, param_to_name_mapping, self.static_graph
+            parameters,
+            expect_sparse_gradient,
+            param_to_name_mapping,
+            self.static_graph,
         )
         if self.static_graph:
             self.reducer._set_static_graph()
+            assert self.logger is not None
             self.logger._set_static_graph()
 
     def _build_params_for_reducer(self):
@@ -812,7 +851,8 @@ def _build_params_for_reducer(self):
                 # parameters through _former_parameters.
                 for param_name, param in module.named_parameters(recurse=False)
                 if param.requires_grad
-                and f"{module_name}.{param_name}" not in self.parameters_to_ignore
+                and f"{module_name}.{param_name}"
+                not in self.parameters_to_ignore
             ]
         ]
 
@@ -821,8 +861,9 @@ def _build_params_for_reducer(self):
         modules_and_parameters = [
             # "p not in memo" is the deduplication check.
             # "not memo.add(p)" is always True, and it's only there to cause "add(p)" if needed.
-            (m, p) for m, p in modules_and_parameters
-            if p not in memo and not memo.add(p)
+            (m, p)
+            for m, p in modules_and_parameters
+            if p not in memo and not memo.add(p)  # type: ignore[func-returns-value]
         ]
 
         # Build list of parameters.
@@ -838,7 +879,10 @@ def produces_sparse_gradient(module):
 
         # Build list of booleans indicating whether or not to expect sparse
         # gradients for the corresponding parameters.
-        expect_sparse_gradient = list(produces_sparse_gradient(module) for module, _ in modules_and_parameters)
+        expect_sparse_gradient = list(
+            produces_sparse_gradient(module)
+            for module, _ in modules_and_parameters
+        )
 
         self._assign_modules_buffers()
 
@@ -859,19 +903,21 @@ def _assign_modules_buffers(self):
             if buffer_name not in self.parameters_to_ignore
         ]
         self.modules_buffers = [
-            buffer
-            for (buffer, buffer_name) in named_module_buffers
+            buffer for (buffer, buffer_name) in named_module_buffers
         ]
         # Dict[str, tensor] representing module buffers not ignored by DDP.
         self.named_module_buffers = {
-            buffer_name: buffer for (buffer, buffer_name) in named_module_buffers
+            buffer_name: buffer
+            for (buffer, buffer_name) in named_module_buffers
         }
 
     def _build_debug_param_to_name_mapping(self, parameters):
         if dist.get_debug_level() == dist.DebugLevel.OFF:
             return {}
 
-        param_to_param_index = {parameters[i]: i for i in range(len(parameters))}
+        param_to_param_index = {
+            parameters[i]: i for i in range(len(parameters))
+        }
         param_set = set(parameters)
         param_index_to_param_fqn = {}
         for module_name, module in self.module.named_modules():
@@ -955,6 +1001,10 @@ def no_sync(self):
             >>>   for input in inputs:
             >>>     ddp(input).backward()  # no synchronization, accumulate grads
             >>> ddp(another_input).backward()  # synchronize grads
+
+        .. warning::
+            The forward pass should be included inside the context manager, or
+            else gradients will still be synchronized.
         """
         old_require_backward_grad_sync = self.require_backward_grad_sync
         self.require_backward_grad_sync = False
@@ -963,23 +1013,52 @@ def no_sync(self):
         finally:
             self.require_backward_grad_sync = old_require_backward_grad_sync
 
+    @classmethod
+    def _get_active_ddp_module(cls):
+        """
+        TorchDynamo needs to know whether DDP is currently active, and access the DDP module in order to cooperatively optimize it.
+        """
+        return cls._active_ddp_module
+
+    # note, this ctxmgr function is marked 'skip' in torchdynamo, so dynamo only kicks in
+    # for the 'module_to_run' underneath
+    # see torch._dynamo/eval_frame.py TorchPatcher.patch for more details
+    @contextmanager
+    def _inside_ddp_forward(self):
+        DistributedDataParallel._active_ddp_module = self
+        try:
+            yield
+        except Exception:
+            raise
+        finally:
+            DistributedDataParallel._active_ddp_module = None
+
     def _run_ddp_forward(self, *inputs, **kwargs):
-        module_to_run = self._replicated_tensor_module if self._use_replicated_tensor_module else self.module
+        module_to_run = (
+            self._replicated_tensor_module
+            if self._use_replicated_tensor_module
+            else self.module
+        )
 
         if self.device_ids:
             inputs, kwargs = _to_kwargs(
                 inputs,
                 kwargs,
                 self.device_ids[0],
-                self.use_side_stream_for_tensor_copies
+                self.use_side_stream_for_tensor_copies,
             )
-            return module_to_run(*inputs[0], **kwargs[0])
+            with self._inside_ddp_forward():
+                return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
         else:
-            return module_to_run(*inputs, **kwargs)
+            with self._inside_ddp_forward():
+                return module_to_run(*inputs, **kwargs)
 
     def forward(self, *inputs, **kwargs):
-        with torch.autograd.profiler.record_function("DistributedDataParallel.forward"):
+        with torch.autograd.profiler.record_function(
+            "DistributedDataParallel.forward"
+        ):
             if torch.is_grad_enabled() and self.require_backward_grad_sync:
+                assert self.logger is not None
                 self.logger.set_runtime_stats_and_log()
                 self.num_iterations += 1
                 self.reducer.prepare_for_forward()
@@ -989,7 +1068,7 @@ def forward(self, *inputs, **kwargs):
             work = Join.notify_join_context(self)
             if work:
                 self.reducer._set_forward_pass_work_handle(
-                    work, self._divide_by_initial_world_size
+                    work, self._divide_by_initial_world_size  # type: ignore[arg-type]
                 )
 
             # Calling _rebuild_buckets before forward compuation,
@@ -999,18 +1078,22 @@ def forward(self, *inputs, **kwargs):
             # during forward computation.
             # This should be called only once during whole training period.
             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
-                logger.info("Reducer buckets have been rebuilt in this iteration.")
+                logger.info(
+                    "Reducer buckets have been rebuilt in this iteration."
+                )
                 self._has_rebuilt_buckets = True
 
             # sync params according to location (before/after forward) user
             # specified as part of hook, if hook was specified.
-            buffer_hook_registered = hasattr(self, 'buffer_hook')
+            buffer_hook_registered = hasattr(self, "buffer_hook")
             if self._check_sync_bufs_pre_fwd():
                 self._sync_buffers()
 
             if self._join_config.enable:
                 # Notify joined ranks whether they should sync in backwards pass or not.
-                self._check_global_requires_backward_grad_sync(is_joined_rank=False)
+                self._check_global_requires_backward_grad_sync(
+                    is_joined_rank=False
+                )
 
             output = self._run_ddp_forward(*inputs, **kwargs)
 
@@ -1028,7 +1111,9 @@ def forward(self, *inputs, **kwargs):
                 # unused parameters. Only if `find_unused_parameters` is set.
                 if self.find_unused_parameters and not self.static_graph:
                     # Do not need to populate this for static graph.
-                    self.reducer.prepare_for_backward(list(_find_tensors(output)))
+                    self.reducer.prepare_for_backward(
+                        list(_find_tensors(output))
+                    )
                 else:
                     self.reducer.prepare_for_backward([])
             else:
@@ -1040,13 +1125,15 @@ def forward(self, *inputs, **kwargs):
             self.static_graph and self.num_iterations == 1
         ):
             state_dict = {
-                'static_graph': self.static_graph,
-                'num_iterations': self.num_iterations,
+                "static_graph": self.static_graph,
+                "num_iterations": self.num_iterations,
             }
 
-            output_tensor_list, treespec, output_is_rref = _tree_flatten_with_rref(
-                output
-            )
+            (
+                output_tensor_list,
+                treespec,
+                output_is_rref,
+            ) = _tree_flatten_with_rref(output)
             output_placeholders = [None for _ in range(len(output_tensor_list))]
             # Do not touch tensors that have no grad_fn, which can cause issues
             # such as https://github.com/pytorch/pytorch/issues/60733
@@ -1089,11 +1176,11 @@ def gather(self, outputs, output_device):
     def train(self, mode=True):
         super(DistributedDataParallel, self).train(mode)
         if self._use_replicated_tensor_module:
-            self._replicated_tensor_module.train(mode)
+            self._replicated_tensor_module.train(mode)  # type: ignore[union-attr]
         return self
 
     # When running in join mode, schedules an allreduce to notify joined ranks
-    # of whether backwards pass synchronization will run this iteraton or not.
+    # of whether backwards pass synchronization will run this iteration or not.
     def _check_global_requires_backward_grad_sync(self, is_joined_rank):
         if not is_joined_rank and self.require_backward_grad_sync:
             requires_sync_tensor = torch.ones(1, device=self.device)
@@ -1109,7 +1196,9 @@ def _check_global_requires_backward_grad_sync(self, is_joined_rank):
     # the models have buffers that should be synchronized in the forward pass.
     def _check_and_sync_module_buffers(self):
         if self._check_sync_bufs_pre_fwd():
-            authoritative_rank = self._find_common_rank(self._distributed_rank, False)
+            authoritative_rank = self._find_common_rank(
+                self._distributed_rank, False
+            )
             self._sync_module_buffers(authoritative_rank)
 
     # When running in join model, agrees upon a common rank and broadcast model
@@ -1126,7 +1215,7 @@ def _sync_final_model(self, is_last_joiner):
             process_group=self.process_group,
             broadcast_bucket_size=self.broadcast_bucket_size,
             src=self._authoritative_rank,
-            params_and_buffers_to_ignore=self.parameters_to_ignore
+            params_and_buffers_to_ignore=self.parameters_to_ignore,
         )
 
     # Schedule comm ops to match those scheduled in the reducer's backward
@@ -1290,7 +1379,9 @@ def join_hook(
                 cases for possibly better results.
                 Default is ``True``.
         """
-        divide_by_initial_world_size = kwargs.get("divide_by_initial_world_size", True)
+        divide_by_initial_world_size = kwargs.get(
+            "divide_by_initial_world_size", True
+        )
         return _DDPJoinHook(
             self, divide_by_initial_world_size=divide_by_initial_world_size
         )
@@ -1306,53 +1397,53 @@ def join_process_group(self):
     def _register_buffer_comm_hook(
         self,
         state,
-        hook: callable,
-        comm_hook_location=_BufferCommHookLocation.POST_FORWARD
+        hook: Callable,
+        comm_hook_location=_BufferCommHookLocation.POST_FORWARD,
     ):
         r"""
-            Allows custom registration of hooks that define how buffer are
-            synchronized across ranks. The hook takes in an optional state
-            and is passed in a Dict[str, Tensor] corresponding to buffer names
-            and the buffers, and can run arbitrary reductions on buffers as
-            opposed to DDP's default broadcast from rank 0. This is useful for
-            example if a counter needs to be summed or averaged across ranks
-            every iteration.
+        Allows custom registration of hooks that define how buffer are
+        synchronized across ranks. The hook takes in an optional state
+        and is passed in a Dict[str, Tensor] corresponding to buffer names
+        and the buffers, and can run arbitrary reductions on buffers as
+        opposed to DDP's default broadcast from rank 0. This is useful for
+        example if a counter needs to be summed or averaged across ranks
+        every iteration.
 
-            Args:
-                state (Any): Optional state that is passed to the hook.
-                hook (Callable): Callable with the following signature:
-                                ``hook(state: object, buffers: Dict[str, torch.Tensor])
-                                -> Optional[List[torch.futures.Future[torch.Tensor]]]``
-                comm_hook_location (_BufferCommHookLocation): Enum value indicating
-                                where to run the hook.
-                                _BufferCommHookLocation.PRE_FORWARD means that the
-                                hook will run _before_ the forward pass, and
-                                _BufferCommHookLocation.POST_FORWARD means that the
-                                hook will run _after_ the forward pass.
-
-                hook (Callable): Callable with the following signature:
-                             ``hook(state: object, bucket: dist.GradBucket) -> torch.futures.Future[torch.Tensor]``:
+        Args:
+            state (Any): Optional state that is passed to the hook.
+            hook (Callable): Callable with the following signature:
+                            ``hook(state: object, buffers: Dict[str, torch.Tensor])
+                            -> Optional[List[torch.futures.Future[torch.Tensor]]]``
+            comm_hook_location (_BufferCommHookLocation): Enum value indicating
+                            where to run the hook.
+                            _BufferCommHookLocation.PRE_FORWARD means that the
+                            hook will run _before_ the forward pass, and
+                            _BufferCommHookLocation.POST_FORWARD means that the
+                            hook will run _after_ the forward pass.
 
-                NOTE: To maximize performance, users can return a
-                    List[torch.futures.Future] from their hook, and DDP will
-                    install and await these hooks appropriately at the end of
-                    the backward pass. This will ensure all buffers are
-                    synchronized by the end of the backward pass. If this
-                    setting is used, it is recommended to pass
-                    comm_hook_location=_BufferCommHookLocation.POST_FORWARD,
-                    which will trigger the hook after the forward pass.
-                    If _BufferCommHookLocation.PRE_FORWARD is used, users must
-                    ensure appropriate synchronization when manipulating GPU
-                    buffers in the forward pass.
-            """
+            hook (Callable): Callable with the following signature:
+                         ``hook(state: object, bucket: dist.GradBucket) -> torch.futures.Future[torch.Tensor]``:
+
+            NOTE: To maximize performance, users can return a
+                List[torch.futures.Future] from their hook, and DDP will
+                install and await these hooks appropriately at the end of
+                the backward pass. This will ensure all buffers are
+                synchronized by the end of the backward pass. If this
+                setting is used, it is recommended to pass
+                comm_hook_location=_BufferCommHookLocation.POST_FORWARD,
+                which will trigger the hook after the forward pass.
+                If _BufferCommHookLocation.PRE_FORWARD is used, users must
+                ensure appropriate synchronization when manipulating GPU
+                buffers in the forward pass.
+        """
         assert callable(hook)
         self.buffer_hook = _BufferCommHook(
             buffer_comm_hook=hook,
             buffer_comm_hook_state=state,
-            buffer_comm_hook_location=comm_hook_location
+            buffer_comm_hook_location=comm_hook_location,
         )
 
-    def register_comm_hook(self, state: object, hook: callable):
+    def register_comm_hook(self, state: object, hook: Callable):
         r"""
         Registers a communication hook which is an enhancement that provides a
         flexible hook to users where they can specify how DDP aggregates gradients
@@ -1432,6 +1523,7 @@ def register_comm_hook(self, state: object, hook: callable):
             >>> ddp.register_comm_hook(state=None, hook=encode_and_decode)
         """
         self._check_comm_hook(hook)
+        assert self.logger is not None
         self.logger._set_comm_hook_name(hook.__qualname__)
         dist._register_comm_hook(self.reducer, state, hook)
 
@@ -1458,72 +1550,79 @@ def _register_builtin_comm_hook(self, comm_hook_type):
             >>> ddp._register_builtin_comm_hook(dist.BuiltinCommHookType.FP16_COMPRESS)
 
         """
+        assert self.logger is not None
         self.logger._set_comm_hook_name(str(comm_hook_type))
         dist._register_builtin_comm_hook(self.reducer, comm_hook_type)
 
-    def _register_fused_optim(self, optim: Type, *args, optim_params=None, **kwargs):
+    def _register_fused_optim(
+        self, optim: Type, *args, optim_params=None, **kwargs
+    ):
         r"""
-        Registers an optimizer with DDP such that the optimization for a
-        parameter will run immediately when that parameter's gradient is
-        finished with reduction, instead of waiting for all parameters'
-        gradients to finish reduction. This can result in a training speedup
-        depending on your workload since the optimizer can run while gradient
-        reduction for other parameters are still ongoing. In addition, this has
-        the potential to reduce peak memory consumption during training, as it
-        only needs to load the per-parameter optimizer states of a single
-        parameter at a time, instead of loading all per-parameter optimizer
-        states at once.
+            Registers an optimizer with DDP such that the optimization for a
+            parameter will run immediately when that parameter's gradient is
+            finished with reduction, instead of waiting for all parameters'
+            gradients to finish reduction. This can result in a training speedup
+            depending on your workload since the optimizer can run while gradient
+            reduction for other parameters are still ongoing. In addition, this has
+            the potential to reduce peak memory consumption during training, as it
+            only needs to load the per-parameter optimizer states of a single
+            parameter at a time, instead of loading all per-parameter optimizer
+            states at once.
 
-        Args:
-            optim_cls (Type): a ``torch.optim.Optimizer`` class to be registered
-            as a fused optimizer.
-            *args (Sequence[Any]): Arguments to forward to `optim_cls`.
-            optim_params (Optional[Iterable[torch.Tensor]]): Set of parameters
-            to optimize, similar to `params` argument of traditional `torch.optim`
-            Optimizers. If this is omitted, all DDP model parameters will be
-            optimized.
-            **kwargs: (Dict[str, Any]): Keyword arguments to forward to `optim_cls`.
-
-    .. warning ::
-        _register_fused_optim should only be called once on a DDP instance,
-        and registering multiple fused optimizers for the same DDP model
-        is not currently supported. Please ping
-        https://github.com/pytorch/pytorch/issues/71595 if this is necessary
-        for your use case.
-
-    .. warning ::
-        _register_fused_optim and register_comm_hook currently do not
-        compose together, meaning that custom DDP communication hooks are
-        not supported with overlapped optimizers. Please ping
-        https://github.com/pytorch/pytorch/issues/71595 if this is necessary
-        for your use case.
-
-    .. warning ::
-        Gradient accumulation and DDP `no_sync` are currently not supported
-        with overlapped optimizer. Please ping
-        https://github.com/pytorch/pytorch/issues/71595 if this is necessary
-        for your use case.
+            Args:
+                optim_cls (Type): a ``torch.optim.Optimizer`` class to be registered
+                as a fused optimizer.
+                *args (Sequence[Any]): Arguments to forward to `optim_cls`.
+                optim_params (Optional[Iterable[torch.Tensor]]): Set of parameters
+                to optimize, similar to `params` argument of traditional `torch.optim`
+                Optimizers. If this is omitted, all DDP model parameters will be
+                optimized.
+                **kwargs: (Dict[str, Any]): Keyword arguments to forward to `optim_cls`.
 
-    Example::
+        .. warning ::
+            _register_fused_optim should only be called once on a DDP instance,
+            and registering multiple fused optimizers for the same DDP model
+            is not currently supported. Please ping
+            https://github.com/pytorch/pytorch/issues/71595 if this is necessary
+            for your use case.
 
-        >>> # xdoctest: +SKIP("No rendezvous handler")
-        >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
-        >>> net = torch.nn.parallel.DistributedDataParallel(model, pg)
-        >>> lr = 1e-2
-        >>> betas = (0.9, 0.99)
-        >>> eps = 1e-6
-        >>> net._register_fused_optim(torch.optim.Adam, lr, betas=betas, eps=eps)
-        >>> # Example with subset of parameters
-        >>> params_to_opt = [list(net.parameters())[0]]
-        >>> net._register_fused_optim(
-        ...   torch.optim.Adam, lr, optim_params=params_to_opt,  betas=betas, eps=eps
-        ... )
+        .. warning ::
+            _register_fused_optim and register_comm_hook currently do not
+            compose together, meaning that custom DDP communication hooks are
+            not supported with overlapped optimizers. Please ping
+            https://github.com/pytorch/pytorch/issues/71595 if this is necessary
+            for your use case.
+
+        .. warning ::
+            Gradient accumulation and DDP `no_sync` are currently not supported
+            with overlapped optimizer. Please ping
+            https://github.com/pytorch/pytorch/issues/71595 if this is necessary
+            for your use case.
+
+        Example::
+
+            >>> # xdoctest: +SKIP("No rendezvous handler")
+            >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
+            >>> net = torch.nn.parallel.DistributedDataParallel(model, pg)
+            >>> lr = 1e-2
+            >>> betas = (0.9, 0.99)
+            >>> eps = 1e-6
+            >>> net._register_fused_optim(torch.optim.Adam, lr, betas=betas, eps=eps)
+            >>> # Example with subset of parameters
+            >>> params_to_opt = [list(net.parameters())[0]]
+            >>> net._register_fused_optim(
+            ...   torch.optim.Adam, lr, optim_params=params_to_opt,  betas=betas, eps=eps
+            ... )
         """
         # Note: importing in function, otherwise this will cause a circular
         # import as optimizer_overlap module needs to import DistributedDataParallel.
-        from torch.distributed.algorithms._optimizer_overlap import _as_overlapped_optim
+        from torch.distributed.algorithms._optimizer_overlap import (
+            _as_overlapped_optim,
+        )
 
-        overlapped_optim = _as_overlapped_optim(optim, optim_params, *args, **kwargs)
+        overlapped_optim = _as_overlapped_optim(
+            optim, optim_params, *args, **kwargs
+        )
         try:
             overlapped_optim.register_ddp(self)
         except NotImplementedError:
@@ -1540,16 +1639,16 @@ def _distributed_broadcast_coalesced(
 
     def _check_sync_bufs_post_fwd(self):
         return (
-            self.will_sync_module_buffers() and
-            hasattr(self, 'buffer_hook') and
-            self.buffer_hook.buffer_comm_hook_location ==
-            _BufferCommHookLocation.POST_FORWARD
+            self.will_sync_module_buffers()
+            and hasattr(self, "buffer_hook")
+            and self.buffer_hook.buffer_comm_hook_location
+            == _BufferCommHookLocation.POST_FORWARD
         )
 
     def _check_sync_bufs_pre_fwd(self):
         return self.will_sync_module_buffers() and (
-            not hasattr(self, 'buffer_hook') or
-            self.buffer_hook.buffer_comm_hook_location
+            not hasattr(self, "buffer_hook")
+            or self.buffer_hook.buffer_comm_hook_location
             == _BufferCommHookLocation.PRE_FORWARD
         )
 
@@ -1596,8 +1695,10 @@ def _sync_buffers(self):
             self._sync_module_buffers(authoritative_rank)
 
     def _sync_module_buffers(self, authoritative_rank):
-        if not hasattr(self, 'buffer_hook'):
-            self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
+        if not hasattr(self, "buffer_hook"):
+            self._default_broadcast_coalesced(
+                authoritative_rank=authoritative_rank
+            )
         else:
             hook = self.buffer_hook.buffer_comm_hook
             state = self.buffer_hook.buffer_comm_hook_state
@@ -1619,9 +1720,7 @@ def _default_broadcast_coalesced(
             bucket_size = self.broadcast_bucket_size
 
         self._distributed_broadcast_coalesced(
-            bufs,
-            bucket_size,
-            authoritative_rank
+            bufs, bucket_size, authoritative_rank
         )
 
     def _passing_sync_batchnorm_handle(self, module):
@@ -1629,12 +1728,15 @@ def _passing_sync_batchnorm_handle(self, module):
             if isinstance(layer, torch.nn.modules.SyncBatchNorm):
                 if self.device_type == "cpu":
                     self._log_and_throw(
-                        ValueError, "SyncBatchNorm layers only work with GPU modules"
+                        ValueError,
+                        "SyncBatchNorm layers only work with GPU modules",
                     )
 
     def _check_comm_hook(self, hook):
         if not callable(hook):
-            self._log_and_throw(TypeError, "Communication hook must be callable.")
+            self._log_and_throw(
+                TypeError, "Communication hook must be callable."
+            )
 
         sig = inspect.signature(hook)
         if (
@@ -1655,18 +1757,23 @@ def _check_comm_hook(self, hook):
                 "Communication hook: return annotation should be torch.futures.Future[torch.Tensor].",
             )
 
-        if (
-            hook.__name__ in ["bf16_compress_hook", "bf16_compress_wrapper_hook"]
-            and
-            (
-                (torch.version.cuda is None and torch.version.hip is None)
-                or (torch.version.cuda is not None and int(torch.version.cuda.split('.')[0]) < 11)
-                or not dist.is_available()
-                or not dist.is_nccl_available()
-                or torch.cuda.nccl.version() < (2, 10)
+        if hook.__name__ in [
+            "bf16_compress_hook",
+            "bf16_compress_wrapper_hook",
+        ] and (
+            (torch.version.cuda is None and torch.version.hip is None)
+            or (
+                torch.version.cuda is not None
+                and int(torch.version.cuda.split(".")[0]) < 11
             )
+            or not dist.is_available()
+            or not dist.is_nccl_available()
+            or torch.cuda.nccl.version() < (2, 10)
         ):
-            self._log_and_throw(TypeError, "BF16 all reduce communication hook required CUDA 11+ and NCCL 2.10+.")
+            self._log_and_throw(
+                TypeError,
+                "BF16 all reduce communication hook required CUDA 11+ and NCCL 2.10+.",
+            )
 
     @property
     def _distributed_rank(self):
@@ -1697,6 +1804,12 @@ def _set_params_and_buffers_to_ignore_for_model(
         # during synchronization. It will be removed when the API is finalized
         # as part of addressing https://github.com/pytorch/pytorch/issues/43690.
         module._ddp_params_and_buffers_to_ignore = params_and_buffers_to_ignore
+        for name, param in module.named_parameters():
+            if name in params_and_buffers_to_ignore:
+                param._ddp_ignored = True
+        for name, buffer in module.named_buffers():
+            if name in params_and_buffers_to_ignore:
+                buffer._ddp_ignored = True
 
     def _get_ddp_logging_data(self):
         r"""
@@ -1708,6 +1821,7 @@ def _get_ddp_logging_data(self):
         these metrics are.
         This is a prototype interface and subject to change in the future.
         """
+        assert self.logger is not None
         ddp_logging_data = self.logger._get_ddp_logging_data()
         return {**ddp_logging_data.strs_map, **ddp_logging_data.ints_map}
 
@@ -1715,7 +1829,7 @@ def _set_ddp_runtime_logging_sample_rate(self, sample_rate):
         r"""
         This interface allows users to set sample_rate of collecting
         runtime stats. The runtime stats will be recorded for the
-        first 10 iterations, after 10 iteratons runtime stats will be
+        first 10 iterations, after 10 iterations runtime stats will be
         recorded once every "sample_rate" training iterations. In
         default, runtime stats are recorded for the first 10 iterations,
         after 10 iterations runtime stats are recorded once every
@@ -1742,6 +1856,7 @@ def _set_static_graph(self):
             return
         self.static_graph = True
         self.reducer._set_static_graph()
+        assert self.logger is not None
         self.logger._set_static_graph()
         if self.find_unused_parameters:
             warnings.warn(
diff --git a/torch/nn/parallel/distributed.pyi b/torch/nn/parallel/distributed.pyi
deleted file mode 100644
index a75713afb828..000000000000
--- a/torch/nn/parallel/distributed.pyi
+++ /dev/null
@@ -1,21 +0,0 @@
-from ..modules import Module
-from typing import Any, Optional
-from .common_types import _devices_t, _device_t
-
-
-class DistributedDataParallel(Module):
-    process_group: Any = ...
-    dim: int = ...
-    module: Module = ...
-    device_ids: _devices_t = ...
-    output_device: _device_t = ...
-    broadcast_buffers: bool = ...
-    check_reduction: bool = ...
-    broadcast_bucket_size: float = ...
-    bucket_bytes_cap: float = ...
-
-    # TODO type process_group once `distributed` module is stubbed
-    def __init__(self, module: Module, device_ids: Optional[_devices_t] = ...,
-                 output_device: Optional[_device_t] = ..., dim: int = ...,
-                 broadcast_buffers: bool = ..., process_group: Optional[Any] = ..., bucket_cap_mb: float = ...,
-                 find_unused_parameters: bool = ..., check_reduction: bool = ...) -> None: ...
diff --git a/torch/nn/parameter.py b/torch/nn/parameter.py
index bb766ebf578c..68908001238e 100644
--- a/torch/nn/parameter.py
+++ b/torch/nn/parameter.py
@@ -60,6 +60,7 @@ def __repr__(self):
         return 'Parameter containing:\n' + super(Parameter, self).__repr__()
 
     def __reduce_ex__(self, proto):
+        # TODO(kshitij12345): Support saving Python Attribute
         # See Note [Don't serialize hooks]
         return (
             torch._utils._rebuild_parameter,
@@ -156,7 +157,7 @@ def is_lazy(param):
 class UninitializedParameter(UninitializedTensorMixin, Parameter):
     r"""A parameter that is not initialized.
 
-    Unitialized Parameters are a a special case of :class:`torch.nn.Parameter`
+    Uninitialized Parameters are a a special case of :class:`torch.nn.Parameter`
     where the shape of the data is still unknown.
 
     Unlike a :class:`torch.nn.Parameter`, uninitialized parameters
@@ -176,11 +177,18 @@ def __new__(cls, requires_grad=True, device=None, dtype=None) -> None:
         data = torch.empty(0, **factory_kwargs)
         return torch.Tensor._make_subclass(cls, data, requires_grad)
 
+    def __deepcopy__(self, memo):
+        if id(self) in memo:
+            return memo[id(self)]
+        else:
+            result = type(self)(self.requires_grad, self.data.device, self.data.dtype)
+            memo[id(self)] = result
+            return result
 
 class UninitializedBuffer(UninitializedTensorMixin, torch.Tensor):
     r"""A buffer that is not initialized.
 
-    Unitialized Buffer is a a special case of :class:`torch.Tensor`
+    Uninitialized Buffer is a a special case of :class:`torch.Tensor`
     where the shape of the data is still unknown.
 
     Unlike a :class:`torch.Tensor`, uninitialized parameters
diff --git a/torch/nn/qat/__init__.py b/torch/nn/qat/__init__.py
index 3d79bdbfe832..6b4c4a181ada 100644
--- a/torch/nn/qat/__init__.py
+++ b/torch/nn/qat/__init__.py
@@ -1 +1,18 @@
+# flake8: noqa: F401
+r"""QAT Dynamic Modules
+
+This package is in the process of being deprecated.
+Please, use `torch.ao.nn.qat.dynamic` instead.
+"""
+from . import dynamic  # noqa: F403
+from . import modules  # noqa: F403
 from .modules import *  # noqa: F403
+
+__all__ = [
+    "Linear",
+    "Conv1d",
+    "Conv2d",
+    "Conv3d",
+    "Embedding",
+    "EmbeddingBag",
+]
diff --git a/torch/nn/qat/dynamic/__init__.py b/torch/nn/qat/dynamic/__init__.py
index 3d79bdbfe832..b8fd28316a35 100644
--- a/torch/nn/qat/dynamic/__init__.py
+++ b/torch/nn/qat/dynamic/__init__.py
@@ -1 +1,7 @@
+# flake8: noqa: F401
+r"""QAT Dynamic Modules
+
+This package is in the process of being deprecated.
+Please, use `torch.ao.nn.qat.dynamic` instead.
+"""
 from .modules import *  # noqa: F403
diff --git a/torch/nn/qat/dynamic/modules/linear.py b/torch/nn/qat/dynamic/modules/linear.py
index 8f4bbe47a41e..47f61acaaa9e 100644
--- a/torch/nn/qat/dynamic/modules/linear.py
+++ b/torch/nn/qat/dynamic/modules/linear.py
@@ -1,25 +1,10 @@
-import torch
-from torch.ao.quantization import activation_is_memoryless
-
-
-class Linear(torch.nn.qat.Linear):
-    r"""
-    A linear module attached with FakeQuantize modules for weight,
-    used for dynamic quantization aware training.
-
-    We adopt the same interface as `torch.nn.Linear`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
-    for documentation.
-
-    Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
-    default.
-    """
-
-    def __init__(self, in_features, out_features, bias=True,
-                 qconfig=None, device=None, dtype=None) -> None:
-        super().__init__(in_features, out_features, bias, qconfig, device, dtype)
-        if not activation_is_memoryless(qconfig):
-            raise ValueError(
-                "Dynamic QAT requires a memoryless observer." +
-                "This means a MovingAverage observer with averaging constant equal to 1"
-            )
+# flake8: noqa: F401
+r"""QAT Modules
+
+This file is in the process of migration to `torch/ao/nn/qat/dynamic`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/qat/dynamic/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.qat.dynamic.modules.linear import Linear
diff --git a/torch/nn/qat/modules/__init__.py b/torch/nn/qat/modules/__init__.py
index 988a1dd5ed4b..8be7c4b532dd 100644
--- a/torch/nn/qat/modules/__init__.py
+++ b/torch/nn/qat/modules/__init__.py
@@ -1,8 +1,18 @@
-from .linear import Linear
-from .conv import Conv1d
-from .conv import Conv2d
-from .conv import Conv3d
-from .embedding_ops import EmbeddingBag, Embedding
+# flake8: noqa: F401
+r"""QAT Modules
+
+This package is in the process of being deprecated.
+Please, use `torch.ao.nn.qat.modules` instead.
+"""
+from torch.ao.nn.qat.modules.linear import Linear
+from torch.ao.nn.qat.modules.conv import Conv1d
+from torch.ao.nn.qat.modules.conv import Conv2d
+from torch.ao.nn.qat.modules.conv import Conv3d
+from torch.ao.nn.qat.modules.embedding_ops import EmbeddingBag, Embedding
+
+from . import conv
+from . import embedding_ops
+from . import linear
 
 __all__ = [
     "Linear",
diff --git a/torch/nn/qat/modules/conv.py b/torch/nn/qat/modules/conv.py
index ef6a7c909f49..a64b6ac6da97 100644
--- a/torch/nn/qat/modules/conv.py
+++ b/torch/nn/qat/modules/conv.py
@@ -1,264 +1,12 @@
-import torch
-import torch.nn as nn
-from torch.nn.modules.utils import _single, _pair, _triple
-from torch.nn.intrinsic import _FusedModule
-from typing import Tuple, TypeVar, Union
-from torch.nn.common_types import _size_1_t, _size_2_t, _size_3_t
-
-MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
-
-class _ConvNd(nn.modules.conv._ConvNd):
-
-    _FLOAT_MODULE = MOD
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: Tuple[int, ...],
-                 stride: Tuple[int, ...],
-                 padding: Tuple[int, ...],
-                 dilation: Tuple[int, ...],
-                 transposed: bool,
-                 output_padding: Tuple[int, ...],
-                 groups: int,
-                 bias: bool,
-                 padding_mode: str,
-                 qconfig=None,
-                 device=None,
-                 dtype=None) -> None:
-        factory_kwargs = {"device": device, "dtype": dtype}
-        nn.modules.conv._ConvNd.__init__(self, in_channels, out_channels, kernel_size,
-                                         stride, padding, dilation, transposed,
-                                         output_padding, groups, bias, padding_mode, **factory_kwargs)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
-
-    def forward(self, input):
-        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
-
-    @staticmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module
-
-            Args:
-               `mod`: a float module, either produced by torch.ao.quantization utilities
-               or directly from user
-        """
-        assert type(mod) == cls._FLOAT_MODULE, (
-            "qat."
-            + cls.__name__
-            + ".from_float only works for "
-            + cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
-        )
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid qconfig'
-        if issubclass(type(mod), _FusedModule):
-            mod = mod[0]  # type: ignore[index]
-        qconfig = mod.qconfig
-        qat_conv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
-                       stride=mod.stride, padding=mod.padding, dilation=mod.dilation,
-                       groups=mod.groups, bias=mod.bias is not None,
-                       padding_mode=mod.padding_mode, qconfig=qconfig)
-        qat_conv.weight = mod.weight
-        qat_conv.bias = mod.bias
-        return qat_conv
-
-    def to_float(self):
-        """ This works for both single qat conv, and the qat conv - relu modules
-        to convert the qat module to a floating point module
-        """
-        cls = type(self)
-        conv = cls._FLOAT_CONV_MODULE(  # type: ignore[attr-defined, operator]
-            self.in_channels,
-            self.out_channels,
-            self.kernel_size,  # type: ignore[arg-type]
-            self.stride,  # type: ignore[arg-type]
-            self.padding,  # type: ignore[arg-type]
-            self.dilation,  # type: ignore[arg-type]
-            self.groups,
-            self.bias is not None,
-            self.padding_mode)
-        conv.weight = torch.nn.Parameter(self.weight.detach())
-        if self.bias is not None:
-            conv.bias = torch.nn.Parameter(self.bias.detach())
-        # conv relu
-        if issubclass(cls, _FusedModule):
-            modules = [conv]
-            assert hasattr(cls, "_FLOAT_RELU_MODULE")
-            relu = cls._FLOAT_RELU_MODULE()  # type: ignore[attr-defined]
-            modules.append(relu)
-            fused = cls._FLOAT_MODULE(*modules)  # type: ignore[arg-type, attr-defined, operator]
-            fused.train(self.training)
-            return fused
-        else:
-            return conv
-
-class Conv1d(_ConvNd, nn.Conv1d):
-    r"""
-    A Conv1d module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as :class:`~torch.nn.Conv1d`
-
-    Similar to :class:`~torch.nn.Conv2d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Conv1d
-    _FLOAT_CONV_MODULE = nn.Conv1d
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: Union[str, _size_1_t] = 0,
-                 dilation: _size_1_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 qconfig=None,
-                 device=None,
-                 dtype=None) -> None:
-        kernel_size_ = _single(kernel_size)
-        stride_ = _single(stride)
-        padding_ = padding if isinstance(padding, str) else _single(padding)
-        dilation_ = _single(dilation)
-        super().__init__(
-            in_channels,
-            out_channels,
-            kernel_size_,
-            stride=stride_,
-            padding=padding_,
-            dilation=dilation_,
-            transposed=False,
-            output_padding=_single(0),
-            groups=groups,
-            bias=bias,
-            padding_mode=padding_mode,
-            qconfig=qconfig,
-            device=device,
-            dtype=dtype)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super().from_float(cls, mod)
-
-class Conv2d(_ConvNd, nn.Conv2d):
-    r"""
-    A Conv2d module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.Conv2d`, please see
-    https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d
-    for documentation.
-
-    Similar to `torch.nn.Conv2d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Conv2d
-    _FLOAT_CONV_MODULE = nn.Conv2d
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_2_t,
-                 stride: _size_2_t = 1,
-                 padding: Union[str, _size_2_t] = 0,
-                 dilation: _size_2_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 qconfig=None,
-                 device=None,
-                 dtype=None) -> None:
-        kernel_size_ = _pair(kernel_size)
-        stride_ = _pair(stride)
-        padding_ = padding if isinstance(padding, str) else _pair(padding)
-        dilation_ = _pair(dilation)
-        super().__init__(
-            in_channels,
-            out_channels,
-            kernel_size_,
-            stride=stride_,
-            padding=padding_,
-            dilation=dilation_,
-            transposed=False,
-            output_padding=_pair(0),
-            groups=groups,
-            bias=bias,
-            padding_mode=padding_mode,
-            qconfig=qconfig,
-            device=device,
-            dtype=dtype)
-
-    def forward(self, input):
-        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super().from_float(cls, mod)
-
-class Conv3d(_ConvNd, nn.Conv3d):
-    r"""
-    A Conv3d module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.Conv3d`, please see
-    https://pytorch.org/docs/stable/nn.html?highlight=conv3d#torch.nn.Conv3d
-    for documentation.
-
-    Similar to `torch.nn.Conv3d`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight_fake_quant: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Conv3d
-    _FLOAT_CONV_MODULE = nn.Conv3d
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_3_t,
-                 stride: _size_3_t = 1,
-                 padding: Union[str, _size_3_t] = 0,
-                 dilation: _size_3_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 qconfig=None,
-                 device=None,
-                 dtype=None) -> None:
-        kernel_size_ = _triple(kernel_size)
-        stride_ = _triple(stride)
-        padding_ = padding if isinstance(padding, str) else _triple(padding)
-        dilation_ = _triple(dilation)
-        super().__init__(
-            in_channels,
-            out_channels,
-            kernel_size_,
-            stride=stride_,
-            padding=padding_,
-            dilation=dilation_,
-            transposed=False,
-            output_padding=_triple(0),
-            groups=groups,
-            bias=bias,
-            padding_mode=padding_mode,
-            qconfig=qconfig,
-            device=device,
-            dtype=dtype)
-
-    def forward(self, input):
-        return self._conv_forward(input, self.weight_fake_quant(self.weight), self.bias)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super().from_float(cls, mod)
+# flake8: noqa: F401
+r"""QAT Modules
+
+This file is in the process of migration to `torch/ao/nn/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/qat/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.qat.modules.conv import Conv1d
+from torch.ao.nn.qat.modules.conv import Conv2d
+from torch.ao.nn.qat.modules.conv import Conv3d
diff --git a/torch/nn/qat/modules/embedding_ops.py b/torch/nn/qat/modules/embedding_ops.py
index da7f33363742..88c7f2dfd45c 100644
--- a/torch/nn/qat/modules/embedding_ops.py
+++ b/torch/nn/qat/modules/embedding_ops.py
@@ -1,143 +1,14 @@
-import torch
-from torch import Tensor
-import torch.nn as nn
-import torch.nn.functional as F
+# flake8: noqa: F401
+r"""QAT Modules
 
-__all__ = ['Embedding', 'EmbeddingBag']
-
-class Embedding(nn.Embedding):
-    r"""
-    An embedding bag module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.Embedding`, please see
-    https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding
-    for documentation.
-
-    Similar to `torch.nn.Embedding`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Embedding
-
-    def __init__(self, num_embeddings, embedding_dim, padding_idx=None,
-                 max_norm=None, norm_type=2.0, scale_grad_by_freq=False,
-                 sparse=False, _weight=None, device=None, dtype=None, qconfig=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_embeddings, embedding_dim, padding_idx, max_norm,
-                         norm_type, scale_grad_by_freq, sparse, _weight,
-                         **factory_kwargs)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        assert qconfig.weight().qscheme == torch.per_channel_affine_float_qparams, \
-            'Embedding weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
-            str(qconfig.weight().qscheme)
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
-
-    def forward(self, input) -> Tensor:
-        return F.embedding(input, self.weight_fake_quant(self.weight), self.padding_idx,
-                           self.max_norm, self.norm_type, self.scale_grad_by_freq,
-                           self.sparse)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module
-
-            Args: `mod` a float module, either produced by torch.ao.quantization utilities
-            or directly from user
-        """
-        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
-            cls._FLOAT_MODULE.__name__
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid qconfig'
-        weight_qscheme = mod.qconfig.weight().qscheme  # type: ignore[union-attr, operator]
-        assert weight_qscheme == torch.per_channel_affine_float_qparams, \
-            'Embedding weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
-            str(weight_qscheme)
-
-        qconfig = mod.qconfig
-        qat_embedding_bag = cls(mod.num_embeddings, mod.embedding_dim, mod.padding_idx,
-                                mod.max_norm, mod.norm_type, mod.scale_grad_by_freq,
-                                mod.sparse, mod.weight, qconfig=qconfig)
-
-        return qat_embedding_bag
+This file is in the process of migration to `torch/ao/nn/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/qat/modules`,
+while adding an import statement here.
+"""
 
-    def to_float(self):
-        embedding_bag = torch.nn.Embedding(self.num_embeddings, self.embedding_dim, self.padding_idx,
-                                           self.max_norm, self.norm_type, self.scale_grad_by_freq,
-                                           self.sparse, None)
-        embedding_bag.weight = torch.nn.Parameter(self.weight.detach())
-        embedding_bag.train(self.training)
-        return embedding_bag
-
-class EmbeddingBag(nn.EmbeddingBag):
-    r"""
-    An embedding bag module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.EmbeddingBag`, please see
-    https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag
-    for documentation.
-
-    Similar to `torch.nn.EmbeddingBag`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.EmbeddingBag
-
-    def __init__(self, num_embeddings, embedding_dim, max_norm=None,
-                 norm_type=2.0, scale_grad_by_freq=False, mode='mean',
-                 sparse=False, _weight=None, include_last_offset=False,
-                 padding_idx=None, qconfig=None, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_embeddings, embedding_dim, max_norm, norm_type,
-                         scale_grad_by_freq, mode, sparse, _weight,
-                         include_last_offset, padding_idx, **factory_kwargs)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        assert qconfig.weight().qscheme == torch.per_channel_affine_float_qparams, \
-            'Embedding Bag weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
-            str(qconfig.weight().qscheme)
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
-
-    def forward(self, input, offsets=None, per_sample_weights=None) -> Tensor:
-        return F.embedding_bag(input, self.weight_fake_quant(self.weight), offsets,
-                               self.max_norm, self.norm_type,
-                               self.scale_grad_by_freq, self.mode, self.sparse,
-                               per_sample_weights, self.include_last_offset,
-                               self.padding_idx)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module
-
-            Args: `mod` a float module, either produced by torch.ao.quantization utilities
-            or directly from user
-        """
-        assert type(mod) == cls._FLOAT_MODULE, ' qat.' + cls.__name__ + '.from_float only works for ' + \
-            cls._FLOAT_MODULE.__name__
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        assert mod.qconfig, 'Input float module must have a valid qconfig'
-        weight_qscheme = mod.qconfig.weight().qscheme  # type: ignore[union-attr, operator]
-        assert weight_qscheme == torch.per_channel_affine_float_qparams, \
-            'Embedding Bag weights requires a qscheme of torch.per_channel_affine_float_qparams Got ' + \
-            str(weight_qscheme)
-
-        qconfig = mod.qconfig
-        qat_embedding_bag = cls(mod.num_embeddings, mod.embedding_dim, mod.max_norm, mod.norm_type,
-                                mod.scale_grad_by_freq, mod.mode, mod.sparse, mod.weight,
-                                mod.include_last_offset, mod.padding_idx, qconfig=qconfig)
-
-        return qat_embedding_bag
+__all__ = ['Embedding', 'EmbeddingBag']
 
-    def to_float(self):
-        embedding_bag = torch.nn.EmbeddingBag(self.num_embeddings, self.embedding_dim, self.max_norm,
-                                              self.norm_type, self.scale_grad_by_freq, self.mode, self.sparse,
-                                              None, self.include_last_offset, self.padding_idx)
-        embedding_bag.weight = torch.nn.Parameter(self.weight.detach())
-        embedding_bag.train(self.training)
-        return embedding_bag
+from torch.ao.nn.qat.modules.embedding_ops import Embedding
+from torch.ao.nn.qat.modules.embedding_ops import EmbeddingBag
diff --git a/torch/nn/qat/modules/linear.py b/torch/nn/qat/modules/linear.py
index b03c255f97f8..a35f3f8d7e0e 100644
--- a/torch/nn/qat/modules/linear.py
+++ b/torch/nn/qat/modules/linear.py
@@ -1,77 +1,10 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch.nn.intrinsic import LinearReLU
-from torch.nn.utils.parametrize import (
-    is_parametrized,
-    type_before_parametrizations,
-    transfer_parametrizations_and_params,
-)
-
-class Linear(nn.Linear):
-    r"""
-    A linear module attached with FakeQuantize modules for weight,
-    used for quantization aware training.
-
-    We adopt the same interface as `torch.nn.Linear`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
-    for documentation.
-
-    Similar to `torch.nn.Linear`, with FakeQuantize modules initialized to
-    default.
-
-    Attributes:
-        weight: fake quant module for weight
-    """
-    _FLOAT_MODULE = nn.Linear
-
-    def __init__(self, in_features, out_features, bias=True,
-                 qconfig=None, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(in_features, out_features, bias, **factory_kwargs)
-        assert qconfig, 'qconfig must be provided for QAT module'
-        self.qconfig = qconfig
-        self.weight_fake_quant = qconfig.weight(factory_kwargs=factory_kwargs)
-
-    def forward(self, input):
-        return F.linear(input, self.weight_fake_quant(self.weight), self.bias)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a qat module from a float module or qparams_dict
-            Args: `mod` a float module, either produced by torch.ao.quantization utilities
-            or directly from user
-        """
-        assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
-            " qat."
-            + cls.__name__
-            + ".from_float only works for "
-            + cls._FLOAT_MODULE.__name__
-        )
-        assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
-        assert mod.qconfig, "Input float module must have a valid qconfig"
-        if type_before_parametrizations(mod) == LinearReLU:
-            mod = mod[0]
-
-        qconfig = mod.qconfig
-        qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)
-
-        if is_parametrized(mod, "weight"):
-            transfer_parametrizations_and_params(mod, qat_linear, "weight")
-        else:
-            qat_linear.weight = mod.weight
-
-        if is_parametrized(mod, "bias"):
-            transfer_parametrizations_and_params(mod, qat_linear, "bias")
-        else:
-            qat_linear.bias = mod.bias
-
-        return qat_linear
-
-    def to_float(self):
-        linear = torch.nn.Linear(self.in_features, self.out_features, self.bias is not None)
-        linear.weight = torch.nn.Parameter(self.weight.detach())
-        if self.bias is not None:
-            linear.bias = torch.nn.Parameter(self.bias.detach())
-        linear.train(self.training)
-        return linear
+# flake8: noqa: F401
+r"""QAT Modules
+
+This file is in the process of migration to `torch/ao/nn/qat`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/qat/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.qat.modules.linear import Linear
diff --git a/torch/nn/quantizable/modules/__init__.py b/torch/nn/quantizable/modules/__init__.py
index 7c9fb032a2bb..a1257b404b73 100644
--- a/torch/nn/quantizable/modules/__init__.py
+++ b/torch/nn/quantizable/modules/__init__.py
@@ -1,6 +1,6 @@
-from .activation import MultiheadAttention
-from .rnn import LSTM
-from .rnn import LSTMCell
+from torch.ao.nn.quantizable.modules.activation import MultiheadAttention
+from torch.ao.nn.quantizable.modules.rnn import LSTM
+from torch.ao.nn.quantizable.modules.rnn import LSTMCell
 
 __all__ = [
     'LSTM',
diff --git a/torch/nn/quantizable/modules/activation.py b/torch/nn/quantizable/modules/activation.py
index 65ea5c274135..e854414ec8ca 100644
--- a/torch/nn/quantizable/modules/activation.py
+++ b/torch/nn/quantizable/modules/activation.py
@@ -1,454 +1,10 @@
-import torch
-import torch.jit  # this is needed to avoid a circular import
-from torch import nn
-import torch.nn.functional as nnF
-
-from torch import Tensor
-from typing import Optional, Tuple
-
-import warnings
-
-class MultiheadAttention(nn.MultiheadAttention):
-    _FLOAT_MODULE = nn.MultiheadAttention
-
-    r"""Quantizable implementation of the MultiheadAttention.
-
-    Note::
-        Please, refer to :class:`~torch.nn.MultiheadAttention` for more
-        information
-
-    Allows the model to jointly attend to information from different
-    representation subspaces.
-    See reference: Attention Is All You Need
-
-    The original MHA module is not quantizable.
-    This reimplements it by explicitly instantiating the linear layers.
-
-    .. math::
-        \text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O
-        \text{where} head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
-
-    Args:
-        embed_dim: total dimension of the model.
-        num_heads: parallel attention heads.
-        dropout: a Dropout layer on attn_output_weights. Default: 0.0.
-        bias: add bias as module parameter. Default: True.
-        add_bias_kv: add bias to the key and value sequences at dim=0.
-        add_zero_attn: add a new batch of zeros to the key and
-                       value sequences at dim=1.
-        kdim: total number of features in key. Default: None.
-        vdim: total number of features in value. Default: None.
-        batch_first: If ``True``, then the input and output tensors are provided
-            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
-
-    Note that if :attr:`kdim` and :attr:`vdim` are None, they will be set
-    to :attr:`embed_dim` such that query, key, and value have the same
-    number of features.
-
-    Examples::
-
-        >>> import torch.nn.quantizable as nnqa
-        >>> multihead_attn = nnqa.MultiheadAttention(embed_dim, num_heads)
-        >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
-
-    Note::
-        Please, follow the quantization flow to convert the quantizable MHA.
-    """
-    __constants__ = ['batch_first']
-
-    def __init__(self, embed_dim: int, num_heads: int,
-                 dropout: float = 0., bias: bool = True,
-                 add_bias_kv: bool = False, add_zero_attn: bool = False,
-                 kdim: int = None, vdim: int = None, batch_first: bool = False,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(MultiheadAttention, self).__init__(embed_dim, num_heads, dropout,
-                                                 bias, add_bias_kv,
-                                                 add_zero_attn, kdim, vdim, batch_first,
-                                                 **factory_kwargs)
-        self.linear_Q = nn.Linear(self.embed_dim, self.embed_dim, bias=bias, **factory_kwargs)
-        self.linear_K = nn.Linear(self.kdim, self.embed_dim, bias=bias, **factory_kwargs)
-        self.linear_V = nn.Linear(self.vdim, self.embed_dim, bias=bias, **factory_kwargs)
-        # for the type: ignore, see https://github.com/pytorch/pytorch/issues/58969
-        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=bias, **factory_kwargs)  # type: ignore[assignment]
-
-        # Functionals
-        self.q_scaling_product = torch.nn.quantized.FloatFunctional()
-        # note: importing torch.nn.quantized at top creates a circular import
-
-        # Quant/Dequant
-        self.quant_attn_output = torch.ao.quantization.QuantStub()
-        self.quant_attn_output_weights = torch.ao.quantization.QuantStub()
-        self.dequant_q = torch.ao.quantization.DeQuantStub()
-        self.dequant_k = torch.ao.quantization.DeQuantStub()
-        self.dequant_v = torch.ao.quantization.DeQuantStub()
-
-    def _get_name(self):
-        return 'QuantizableMultiheadAttention'
-
-    @classmethod
-    def from_float(cls, other):
-        assert type(other) == cls._FLOAT_MODULE
-        assert hasattr(other, 'qconfig'), "The float module must have 'qconfig'"
-        # Setting the dropout to 0.0!
-        observed = cls(other.embed_dim, other.num_heads, other.dropout,
-                       (other.in_proj_bias is not None),
-                       (other.bias_k is not None),
-                       other.add_zero_attn, other.kdim, other.vdim)
-        observed.bias_k = other.bias_k
-        observed.bias_v = other.bias_v
-        observed.qconfig = other.qconfig
-
-        # Set the linear weights
-        # for the type: ignores, see https://github.com/pytorch/pytorch/issues/58969
-        observed.out_proj.weight = other.out_proj.weight  # type: ignore[has-type]
-        observed.out_proj.bias = other.out_proj.bias  # type: ignore[has-type]
-        if other._qkv_same_embed_dim:
-            # Use separate params
-            bias = other.in_proj_bias
-            _start = 0
-            _end = _start + other.embed_dim
-            weight = other.in_proj_weight[_start:_end, :]
-            if bias is not None:
-                bias = torch.nn.Parameter(bias[_start:_end], bias.requires_grad)
-            observed.linear_Q.weight = torch.nn.Parameter(weight,
-                                                          weight.requires_grad)
-            observed.linear_Q.bias = bias
-
-            bias = other.in_proj_bias
-            _start = _end
-            _end = _start + other.embed_dim
-            weight = other.in_proj_weight[_start:_end, :]
-            if bias is not None:
-                bias = torch.nn.Parameter(bias[_start:_end], bias.requires_grad)
-            observed.linear_K.weight = torch.nn.Parameter(weight,
-                                                          weight.requires_grad)
-            observed.linear_K.bias = bias
-
-            bias = other.in_proj_bias
-            _start = _end
-            weight = other.in_proj_weight[_start:, :]
-            if bias is not None:
-                bias = torch.nn.Parameter(bias[_start:], bias.requires_grad)
-            observed.linear_V.weight = torch.nn.Parameter(weight,
-                                                          weight.requires_grad)
-            observed.linear_V.bias = bias
-        else:
-            observed.linear_Q.weight = nn.Parameter(other.q_proj_weight)
-            observed.linear_K.weight = nn.Parameter(other.k_proj_weight)
-            observed.linear_V.weight = nn.Parameter(other.v_proj_weight)
-            if other.in_proj_bias is None:
-                observed.linear_Q.bias = None  # type: ignore[assignment]
-                observed.linear_K.bias = None  # type: ignore[assignment]
-                observed.linear_V.bias = None  # type: ignore[assignment]
-            else:
-                observed.linear_Q.bias = nn.Parameter(other.in_proj_bias[0:other.embed_dim])
-                observed.linear_K.bias = nn.Parameter(other.in_proj_bias[other.embed_dim:(other.embed_dim * 2)])
-                observed.linear_V.bias = nn.Parameter(other.in_proj_bias[(other.embed_dim * 2):])
-        observed.eval()
-        # Explicit prepare
-        observed = torch.ao.quantization.prepare(observed, inplace=True)
-        return observed
-
-    @torch.jit.unused
-    def dequantize(self):
-        r"""Utility to convert the quantized MHA back to float.
-
-        The motivation for this is that it is not trivial to conver the weights
-        from the format that is used in the quantized version back to the
-        float.
-        """
-        fp = self._FLOAT_MODULE(self.embed_dim, self.num_heads, self.dropout,
-                                (self.in_proj_bias is not None),
-                                (self.bias_k is not None),
-                                self.add_zero_attn, self.kdim, self.vdim, self.batch_first)
-        assert fp._qkv_same_embed_dim == self._qkv_same_embed_dim
-        if self.bias_k is not None:
-            fp.bias_k = nn.Parameter(self.bias_k.dequantize())
-        if self.bias_v is not None:
-            fp.bias_v = nn.Parameter(self.bias_v.dequantize())
-
-        # Set the linear weights
-        # Note: Because the linear layers are quantized, mypy does not nkow how
-        # to deal with them -- might need to ignore the typing checks.
-        # for the type: ignore[has-type], see https://github.com/pytorch/pytorch/issues/58969
-        w, b = self.out_proj._weight_bias()  # type: ignore[operator, has-type]
-        fp.out_proj.weight = nn.Parameter(w.dequantize())
-        if b is not None:
-            fp.out_proj.bias = nn.Parameter(b)
-
-        wQ, bQ = self.linear_Q._weight_bias()  # type: ignore[operator]
-        wQ = wQ.dequantize()
-        wK, bK = self.linear_K._weight_bias()  # type: ignore[operator]
-        wK = wK.dequantize()
-        wV, bV = self.linear_V._weight_bias()  # type: ignore[operator]
-        wV = wV.dequantize()
-        if fp._qkv_same_embed_dim:
-            # Use separate params
-            _start = 0
-            _end = _start + fp.embed_dim
-            fp.in_proj_weight[_start:_end, :] = wQ
-            if fp.in_proj_bias is not None:
-                assert all(bQ == 0)
-                fp.in_proj_bias[_start:_end] = bQ
-
-            _start = _end
-            _end = _start + fp.embed_dim
-            fp.in_proj_weight[_start:_end, :] = wK
-            if fp.in_proj_bias is not None:
-                assert all(bK == 0)
-                fp.in_proj_bias[_start:_end] = bK
-
-            _start = _end
-            fp.in_proj_weight[_start:, :] = wV
-            if fp.in_proj_bias is not None:
-                assert all(bV == 0)
-                fp.in_proj_bias[_start:] = bV
-        else:
-            fp.q_proj_weight = nn.Parameter(wQ)
-            fp.k_proj_weight = nn.Parameter(wK)
-            fp.v_proj_weight = nn.Parameter(wV)
-            if fp.in_proj_bias is None:
-                self.linear_Q.bias = None
-                self.linear_K.bias = None
-                self.linear_V.bias = None
-            else:
-                fp.in_proj_bias[0:fp.embed_dim] = bQ
-                fp.in_proj_bias[fp.embed_dim:(fp.embed_dim * 2)] = bK
-                fp.in_proj_bias[(fp.embed_dim * 2):] = bV
-
-        return fp
-
-
-    @classmethod
-    def from_observed(cls, other):
-        # The whole flow is float -> observed -> quantized
-        # This class does float -> observed only
-        # See nn.quantized.MultiheadAttention
-        raise NotImplementedError("It looks like you are trying to prepare an "
-                                  "MHA module. Please, see "
-                                  "the examples on quantizable MHAs.")
-
-    def forward(self,
-                query: Tensor,
-                key: Tensor,
-                value: Tensor,
-                key_padding_mask: Optional[Tensor] = None,
-                need_weights: bool = True,
-                attn_mask: Optional[Tensor] = None,
-                average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
-        r"""
-    Note::
-        Please, refer to :func:`~torch.nn.MultiheadAttention.forward` for more
-        information
-
-    Args:
-        query, key, value: map a query and a set of key-value pairs to an output.
-            See "Attention Is All You Need" for more details.
-        key_padding_mask: if provided, specified padding elements in the key will
-            be ignored by the attention. When given a binary mask and a value is True,
-            the corresponding value on the attention layer will be ignored. When given
-            a byte mask and a value is non-zero, the corresponding value on the attention
-            layer will be ignored
-        need_weights: output attn_output_weights.
-        attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all
-            the batches while a 3D mask allows to specify a different mask for the entries of each batch.
-
-    Shape:
-        - Inputs:
-        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
-          the embedding dimension. :math:`(N, L, E)` if ``batch_first`` is ``True``.
-        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
-          the embedding dimension. :math:`(N, S, E)` if ``batch_first`` is ``True``.
-        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
-          the embedding dimension. :math:`(N, S, E)` if ``batch_first`` is ``True``.
-        - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.
-          If a ByteTensor is provided, the non-zero positions will be ignored while the position
-          with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
-          value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
-        - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
-          3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,
-          S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
-          positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
-          while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
-          is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
-          is provided, it will be added to the attention weight.
-        - average_attn_weights: If true, indicates that the returned ``attn_weights`` should be averaged across
-          heads. Otherwise, ``attn_weights`` are provided separately per head. Note that this flag only has an
-          effect when ``need_weights=True.``. Default: True (i.e. average weights across heads)
-
-        - Outputs:
-        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
-          E is the embedding dimension. :math:`(N, L, E)` if ``batch_first`` is ``True``.
-        - attn_output_weights: If ``average_attn_weights=True``, returns attention weights averaged
-          across heads of shape :math:`(N, L, S)`, where N is the batch size, L is the target sequence length,
-          S is the source sequence length. If ``average_attn_weights=False``, returns attention weights per
-          head of shape :math:`(N, num_heads, L, S)`.
-        """
-        return self._forward_impl(query, key, value, key_padding_mask,
-                                  need_weights, attn_mask, average_attn_weights)
-
-    def _forward_impl(self,
-                      query: Tensor,
-                      key: Tensor,
-                      value: Tensor,
-                      key_padding_mask: Optional[Tensor] = None,
-                      need_weights: bool = True,
-                      attn_mask: Optional[Tensor] = None,
-                      average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
-        # This version will not deal with the static key/value pairs.
-        # Keeping it here for future changes.
-        #
-        # TODO: This method has some duplicate lines with the
-        # `torch.nn.functional.multi_head_attention`. Will need to refactor.
-        static_k = None
-        static_v = None
-
-        if self.batch_first:
-            query, key, value = [x.transpose(0, 1) for x in (query, key, value)]
-
-        tgt_len, bsz, embed_dim_to_check = query.size()
-        assert self.embed_dim == embed_dim_to_check
-        # allow MHA to have different sizes for the feature dimension
-        assert key.size(0) == value.size(0) and key.size(1) == value.size(1)
-
-        head_dim = self.embed_dim // self.num_heads
-        assert head_dim * self.num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
-        scaling = float(head_dim) ** -0.5
-
-        q = self.linear_Q(query)
-        k = self.linear_K(key)
-        v = self.linear_V(value)
-
-        q = self.q_scaling_product.mul_scalar(q, scaling)
-
-        if attn_mask is not None:
-            assert attn_mask.dtype == torch.float32 or attn_mask.dtype == torch.float64 or \
-                attn_mask.dtype == torch.float16 or attn_mask.dtype == torch.uint8 or attn_mask.dtype == torch.bool, \
-                'Only float, byte, and bool types are supported for attn_mask, not {}'.format(attn_mask.dtype)
-            if attn_mask.dtype == torch.uint8:
-                warnings.warn("Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
-                attn_mask = attn_mask.to(torch.bool)
-
-            if attn_mask.dim() == 2:
-                attn_mask = attn_mask.unsqueeze(0)
-                if list(attn_mask.size()) != [1, query.size(0), key.size(0)]:
-                    raise RuntimeError('The size of the 2D attn_mask is not correct.')
-            elif attn_mask.dim() == 3:
-                if list(attn_mask.size()) != [bsz * self.num_heads, query.size(0), key.size(0)]:
-                    raise RuntimeError('The size of the 3D attn_mask is not correct.')
-            else:
-                raise RuntimeError("attn_mask's dimension {} is not supported".format(attn_mask.dim()))
-            # attn_mask's dim is 3 now.
-
-        # convert ByteTensor key_padding_mask to bool
-        if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:
-            warnings.warn("Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.")
-            key_padding_mask = key_padding_mask.to(torch.bool)
-        if self.bias_k is not None and self.bias_v is not None:
-            if static_k is None and static_v is None:
-
-                # Explicitly assert that bias_k and bias_v are not None
-                # in a way that TorchScript can understand.
-                bias_k = self.bias_k
-                assert bias_k is not None
-                bias_v = self.bias_v
-                assert bias_v is not None
-
-                k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
-                v = torch.cat([v, bias_v.repeat(1, bsz, 1)])
-                if attn_mask is not None:
-                    attn_mask = nnF.pad(attn_mask, (0, 1))
-                if key_padding_mask is not None:
-                    key_padding_mask = nnF.pad(key_padding_mask, (0, 1))
-            else:
-                assert static_k is None, "bias cannot be added to static key."
-                assert static_v is None, "bias cannot be added to static value."
-        else:
-            assert self.bias_k is None
-            assert self.bias_v is None
-
-        q = q.contiguous().view(tgt_len, bsz * self.num_heads, head_dim).transpose(0, 1)
-        if k is not None:
-            k = k.contiguous().view(-1, bsz * self.num_heads, head_dim).transpose(0, 1)
-        if v is not None:
-            v = v.contiguous().view(-1, bsz * self.num_heads, head_dim).transpose(0, 1)
-
-        if static_k is not None:
-            assert static_k.size(0) == bsz * self.num_heads
-            assert static_k.size(2) == head_dim
-            k = static_k
-
-        if static_v is not None:
-            assert static_v.size(0) == bsz * self.num_heads
-            assert static_v.size(2) == head_dim
-            v = static_v
-
-        src_len = k.size(1)
-
-        if key_padding_mask is not None:
-            assert key_padding_mask.size(0) == bsz
-            assert key_padding_mask.size(1) == src_len
-
-        if self.add_zero_attn:
-            src_len += 1
-            k_zeros = torch.zeros((k.size(0), 1) + k.size()[2:])
-            if k.is_quantized:
-                k_zeros = torch.quantize_per_tensor(k_zeros, k.q_scale(), k.q_zero_point(), k.dtype)
-            k = torch.cat([k, k_zeros], dim=1)
-            v_zeros = torch.zeros((v.size(0), 1) + k.size()[2:])
-            if v.is_quantized:
-                v_zeros = torch.quantize_per_tensor(v_zeros, v.q_scale(), v.q_zero_point(), v.dtype)
-            v = torch.cat([v, v_zeros], dim=1)
-
-            if attn_mask is not None:
-                attn_mask = nnF.pad(attn_mask, (0, 1))
-            if key_padding_mask is not None:
-                key_padding_mask = nnF.pad(key_padding_mask, (0, 1))
-
-        # Leaving the quantized zone here
-        q = self.dequant_q(q)
-        k = self.dequant_k(k)
-        v = self.dequant_v(v)
-        attn_output_weights = torch.bmm(q, k.transpose(1, 2))
-        assert list(attn_output_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
-
-        if attn_mask is not None:
-            if attn_mask.dtype == torch.bool:
-                attn_output_weights.masked_fill_(attn_mask, float('-inf'))
-            else:
-                attn_output_weights += attn_mask
-
-        if key_padding_mask is not None:
-            attn_output_weights = attn_output_weights.view(bsz, self.num_heads, tgt_len, src_len)
-            attn_output_weights = attn_output_weights.masked_fill(
-                key_padding_mask.unsqueeze(1).unsqueeze(2),
-                float('-inf'),
-            )
-            attn_output_weights = attn_output_weights.view(bsz * self.num_heads, tgt_len, src_len)
-
-        attn_output_weights = nnF.softmax(
-            attn_output_weights, dim=-1)
-        attn_output_weights = nnF.dropout(attn_output_weights, p=self.dropout, training=self.training)
-
-        attn_output = torch.bmm(attn_output_weights, v)
-        assert list(attn_output.size()) == [bsz * self.num_heads, tgt_len, head_dim]
-        if self.batch_first:
-            attn_output = attn_output.view(bsz, tgt_len, self.embed_dim)
-        else:
-            attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, self.embed_dim)
-
-        # Reentering the quantized zone
-        attn_output = self.quant_attn_output(attn_output)
-        # for the type: ignore[has-type], see https://github.com/pytorch/pytorch/issues/58969
-        attn_output = self.out_proj(attn_output)  # type: ignore[has-type]
-        attn_output_weights = self.quant_attn_output_weights(attn_output_weights)
-
-        if need_weights:
-            # average attention weights over heads
-            attn_output_weights = attn_output_weights.view(bsz, self.num_heads, tgt_len, src_len)
-            if average_attn_weights:
-                attn_output_weights = attn_output_weights.mean(dim=1)
-            return attn_output, attn_output_weights
-        else:
-            return attn_output, None
+# flake8: noqa: F401
+r"""Quantizable Modules
+
+This file is in the process of migration to `torch/ao/nn/quantizable`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantizable/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.quantizable.modules.activation import MultiheadAttention
diff --git a/torch/nn/quantizable/modules/rnn.py b/torch/nn/quantizable/modules/rnn.py
index 08fecf2a167d..b3449bf71611 100644
--- a/torch/nn/quantizable/modules/rnn.py
+++ b/torch/nn/quantizable/modules/rnn.py
@@ -1,386 +1,11 @@
-import numbers
-from typing import Optional, Tuple
-import warnings
-
-import torch
-from torch import Tensor
-
+# flake8: noqa: F401
+r"""Quantizable Modules
+
+This file is in the process of migration to `torch/ao/nn/quantizable`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantizable/modules`,
+while adding an import statement here.
 """
-We will recreate all the RNN modules as we require the modules to be decomposed
-into its building blocks to be able to observe.
-"""
-
-class LSTMCell(torch.nn.Module):
-    r"""A quantizable long short-term memory (LSTM) cell.
-
-    For the description and the argument types, please, refer to :class:`~torch.nn.LSTMCell`
-
-    Examples::
-
-        >>> import torch.nn.quantizable as nnqa
-        >>> rnn = nnqa.LSTMCell(10, 20)
-        >>> input = torch.randn(6, 10)
-        >>> hx = torch.randn(3, 20)
-        >>> cx = torch.randn(3, 20)
-        >>> output = []
-        >>> for i in range(6):
-        ...     hx, cx = rnn(input[i], (hx, cx))
-        ...     output.append(hx)
-    """
-    _FLOAT_MODULE = torch.nn.LSTMCell
-
-    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__()
-        self.input_size = input_dim
-        self.hidden_size = hidden_dim
-        self.bias = bias
-
-        self.igates = torch.nn.Linear(input_dim, 4 * hidden_dim, bias=bias, **factory_kwargs)
-        self.hgates = torch.nn.Linear(hidden_dim, 4 * hidden_dim, bias=bias, **factory_kwargs)
-        self.gates = torch.nn.quantized.FloatFunctional()
-
-        self.fgate_cx = torch.nn.quantized.FloatFunctional()
-        self.igate_cgate = torch.nn.quantized.FloatFunctional()
-        self.fgate_cx_igate_cgate = torch.nn.quantized.FloatFunctional()
-
-        self.ogate_cy = torch.nn.quantized.FloatFunctional()
-
-    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
-        if hidden is None or hidden[0] is None or hidden[1] is None:
-            hidden = self.initialize_hidden(x.shape[0], x.is_quantized)
-        hx, cx = hidden
-
-        igates = self.igates(x)
-        hgates = self.hgates(hx)
-        gates = self.gates.add(igates, hgates)
-
-        input_gate, forget_gate, cell_gate, out_gate = gates.chunk(4, 1)
-
-        input_gate = torch.sigmoid(input_gate)
-        forget_gate = torch.sigmoid(forget_gate)
-        cell_gate = torch.tanh(cell_gate)
-        out_gate = torch.sigmoid(out_gate)
-
-        fgate_cx = self.fgate_cx.mul(forget_gate, cx)
-        igate_cgate = self.igate_cgate.mul(input_gate, cell_gate)
-        fgate_cx_igate_cgate = self.fgate_cx_igate_cgate.add(fgate_cx, igate_cgate)
-        cy = fgate_cx_igate_cgate
-
-        tanh_cy = torch.tanh(cy)
-        hy = self.ogate_cy.mul(out_gate, tanh_cy)
-        return hy, cy
-
-    def initialize_hidden(self, batch_size: int, is_quantized: bool = False) -> Tuple[Tensor, Tensor]:
-        h, c = torch.zeros((batch_size, self.hidden_size)), torch.zeros((batch_size, self.hidden_size))
-        if is_quantized:
-            h = torch.quantize_per_tensor(h, scale=1.0, zero_point=0, dtype=torch.quint8)
-            c = torch.quantize_per_tensor(c, scale=1.0, zero_point=0, dtype=torch.quint8)
-        return h, c
-
-    def _get_name(self):
-        return 'QuantizableLSTMCell'
-
-    @classmethod
-    def from_params(cls, wi, wh, bi=None, bh=None):
-        """Uses the weights and biases to create a new LSTM cell.
-
-        Args:
-            wi, wh: Weights for the input and hidden layers
-            bi, bh: Biases for the input and hidden layers
-        """
-        assert (bi is None) == (bh is None)  # Either both None or both have values
-        input_size = wi.shape[1]
-        hidden_size = wh.shape[1]
-        cell = cls(input_dim=input_size, hidden_dim=hidden_size,
-                   bias=(bi is not None))
-        cell.igates.weight = torch.nn.Parameter(wi)
-        if bi is not None:
-            cell.igates.bias = torch.nn.Parameter(bi)
-        cell.hgates.weight = torch.nn.Parameter(wh)
-        if bh is not None:
-            cell.hgates.bias = torch.nn.Parameter(bh)
-        return cell
-
-    @classmethod
-    def from_float(cls, other):
-        assert type(other) == cls._FLOAT_MODULE
-        assert hasattr(other, 'qconfig'), "The float module must have 'qconfig'"
-        observed = cls.from_params(other.weight_ih, other.weight_hh,
-                                   other.bias_ih, other.bias_hh)
-        observed.qconfig = other.qconfig
-        observed.igates.qconfig = other.qconfig
-        observed.hgates.qconfig = other.qconfig
-        return observed
-
-
-class _LSTMSingleLayer(torch.nn.Module):
-    r"""A single one-directional LSTM layer.
-
-    The difference between a layer and a cell is that the layer can process a
-    sequence, while the cell only expects an instantaneous value.
-    """
-    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__()
-        self.cell = LSTMCell(input_dim, hidden_dim, bias=bias, **factory_kwargs)
-
-    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
-        result = []
-        for xx in x:
-            hidden = self.cell(xx, hidden)
-            result.append(hidden[0])  # type: ignore[index]
-        result_tensor = torch.stack(result, 0)
-        return result_tensor, hidden
-
-    @classmethod
-    def from_params(cls, *args, **kwargs):
-        cell = LSTMCell.from_params(*args, **kwargs)
-        layer = cls(cell.input_size, cell.hidden_size, cell.bias)
-        layer.cell = cell
-        return layer
-
-
-class _LSTMLayer(torch.nn.Module):
-    r"""A single bi-directional LSTM layer."""
-    def __init__(self, input_dim: int, hidden_dim: int, bias: bool = True,
-                 batch_first: bool = False, bidirectional: bool = False,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__()
-        self.batch_first = batch_first
-        self.bidirectional = bidirectional
-        self.layer_fw = _LSTMSingleLayer(input_dim, hidden_dim, bias=bias, **factory_kwargs)
-        if self.bidirectional:
-            self.layer_bw = _LSTMSingleLayer(input_dim, hidden_dim, bias=bias, **factory_kwargs)
-
-    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
-        if self.batch_first:
-            x = x.transpose(0, 1)
-        if hidden is None:
-            hx_fw, cx_fw = (None, None)
-        else:
-            hx_fw, cx_fw = hidden
-        hidden_bw: Optional[Tuple[Tensor, Tensor]] = None
-        if self.bidirectional:
-            if hx_fw is None:
-                hx_bw = None
-            else:
-                hx_bw = hx_fw[1]
-                hx_fw = hx_fw[0]
-            if cx_fw is None:
-                cx_bw = None
-            else:
-                cx_bw = cx_fw[1]
-                cx_fw = cx_fw[0]
-            if hx_bw is not None and cx_bw is not None:
-                hidden_bw = hx_bw, cx_bw
-        if hx_fw is None and cx_fw is None:
-            hidden_fw = None
-        else:
-            hidden_fw = torch.jit._unwrap_optional(hx_fw), torch.jit._unwrap_optional(cx_fw)
-        result_fw, hidden_fw = self.layer_fw(x, hidden_fw)
-
-        if hasattr(self, 'layer_bw') and self.bidirectional:
-            x_reversed = x.flip(0)
-            result_bw, hidden_bw = self.layer_bw(x_reversed, hidden_bw)
-            result_bw = result_bw.flip(0)
-
-            result = torch.cat([result_fw, result_bw], result_fw.dim() - 1)
-            if hidden_fw is None and hidden_bw is None:
-                h = None
-                c = None
-            elif hidden_fw is None:
-                (h, c) = torch.jit._unwrap_optional(hidden_bw)
-            elif hidden_bw is None:
-                (h, c) = torch.jit._unwrap_optional(hidden_fw)
-            else:
-                h = torch.stack([hidden_fw[0], hidden_bw[0]], 0)  # type: ignore[list-item]
-                c = torch.stack([hidden_fw[1], hidden_bw[1]], 0)  # type: ignore[list-item]
-        else:
-            result = result_fw
-            h, c = torch.jit._unwrap_optional(hidden_fw)  # type: ignore[assignment]
-
-        if self.batch_first:
-            result.transpose_(0, 1)
-
-        return result, (h, c)
-
-    @classmethod
-    def from_float(cls, other, layer_idx=0, qconfig=None, **kwargs):
-        r"""
-        There is no FP equivalent of this class. This function is here just to
-        mimic the behavior of the `prepare` within the `torch.ao.quantization`
-        flow.
-        """
-        assert hasattr(other, 'qconfig') or (qconfig is not None)
-
-        input_size = kwargs.get('input_size', other.input_size)
-        hidden_size = kwargs.get('hidden_size', other.hidden_size)
-        bias = kwargs.get('bias', other.bias)
-        batch_first = kwargs.get('batch_first', other.batch_first)
-        bidirectional = kwargs.get('bidirectional', other.bidirectional)
-
-        layer = cls(input_size, hidden_size, bias, batch_first, bidirectional)
-        layer.qconfig = getattr(other, 'qconfig', qconfig)
-        wi = getattr(other, f'weight_ih_l{layer_idx}')
-        wh = getattr(other, f'weight_hh_l{layer_idx}')
-        bi = getattr(other, f'bias_ih_l{layer_idx}', None)
-        bh = getattr(other, f'bias_hh_l{layer_idx}', None)
-
-        layer.layer_fw = _LSTMSingleLayer.from_params(wi, wh, bi, bh)
-
-        if other.bidirectional:
-            wi = getattr(other, f'weight_ih_l{layer_idx}_reverse')
-            wh = getattr(other, f'weight_hh_l{layer_idx}_reverse')
-            bi = getattr(other, f'bias_ih_l{layer_idx}_reverse', None)
-            bh = getattr(other, f'bias_hh_l{layer_idx}_reverse', None)
-            layer.layer_bw = _LSTMSingleLayer.from_params(wi, wh, bi, bh)
-        return layer
-
-
-class LSTM(torch.nn.Module):
-    r"""A quantizable long short-term memory (LSTM).
-
-    For the description and the argument types, please, refer to :class:`~torch.nn.LSTM`
-
-    Attributes:
-        layers : instances of the `_LSTMLayer`
-
-    .. note::
-        To access the weights and biases, you need to access them per layer.
-        See examples below.
-
-    Examples::
-
-        >>> import torch.nn.quantizable as nnqa
-        >>> rnn = nnqa.LSTM(10, 20, 2)
-        >>> input = torch.randn(5, 3, 10)
-        >>> h0 = torch.randn(2, 3, 20)
-        >>> c0 = torch.randn(2, 3, 20)
-        >>> output, (hn, cn) = rnn(input, (h0, c0))
-        >>> # To get the weights:
-        >>> # xdoctest: +SKIP
-        >>> print(rnn.layers[0].weight_ih)
-        tensor([[...]])
-        >>> print(rnn.layers[0].weight_hh)
-        AssertionError: There is no reverse path in the non-bidirectional layer
-    """
-    _FLOAT_MODULE = torch.nn.LSTM
-
-    def __init__(self, input_size: int, hidden_size: int,
-                 num_layers: int = 1, bias: bool = True,
-                 batch_first: bool = False, dropout: float = 0.,
-                 bidirectional: bool = False,
-                 device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__()
-        self.input_size = input_size
-        self.hidden_size = hidden_size
-        self.num_layers = num_layers
-        self.bias = bias
-        self.batch_first = batch_first
-        self.dropout = float(dropout)
-        self.bidirectional = bidirectional
-        self.training = False  # We don't want to train using this module
-        num_directions = 2 if bidirectional else 1
-
-        if not isinstance(dropout, numbers.Number) or not 0 <= dropout <= 1 or \
-                isinstance(dropout, bool):
-            raise ValueError("dropout should be a number in range [0, 1] "
-                             "representing the probability of an element being "
-                             "zeroed")
-        if dropout > 0:
-            warnings.warn("dropout option for quantizable LSTM is ignored. "
-                          "If you are training, please, use nn.LSTM version "
-                          "followed by `prepare` step.")
-            if num_layers == 1:
-                warnings.warn("dropout option adds dropout after all but last "
-                              "recurrent layer, so non-zero dropout expects "
-                              "num_layers greater than 1, but got dropout={} "
-                              "and num_layers={}".format(dropout, num_layers))
-
-        layers = [_LSTMLayer(self.input_size, self.hidden_size,
-                             self.bias, batch_first=False,
-                             bidirectional=self.bidirectional, **factory_kwargs)]
-        for layer in range(1, num_layers):
-            layers.append(_LSTMLayer(self.hidden_size, self.hidden_size,
-                                     self.bias, batch_first=False,
-                                     bidirectional=self.bidirectional,
-                                     **factory_kwargs))
-        self.layers = torch.nn.ModuleList(layers)
-
-    def forward(self, x: Tensor, hidden: Optional[Tuple[Tensor, Tensor]] = None):
-        if self.batch_first:
-            x = x.transpose(0, 1)
-
-        max_batch_size = x.size(1)
-        num_directions = 2 if self.bidirectional else 1
-        if hidden is None:
-            zeros = torch.zeros(num_directions, max_batch_size,
-                                self.hidden_size, dtype=torch.float,
-                                device=x.device)
-            zeros.squeeze_(0)
-            if x.is_quantized:
-                zeros = torch.quantize_per_tensor(zeros, scale=1.0,
-                                                  zero_point=0, dtype=x.dtype)
-            hxcx = [(zeros, zeros) for _ in range(self.num_layers)]
-        else:
-            hidden_non_opt = torch.jit._unwrap_optional(hidden)
-            if isinstance(hidden_non_opt[0], Tensor):
-                hx = hidden_non_opt[0].reshape(self.num_layers, num_directions,
-                                               max_batch_size,
-                                               self.hidden_size).unbind(0)
-                cx = hidden_non_opt[1].reshape(self.num_layers, num_directions,
-                                               max_batch_size,
-                                               self.hidden_size).unbind(0)
-                hxcx = [(hx[idx].squeeze_(0), cx[idx].squeeze_(0)) for idx in range(self.num_layers)]
-            else:
-                hxcx = hidden_non_opt
-
-        hx_list = []
-        cx_list = []
-        for idx, layer in enumerate(self.layers):
-            x, (h, c) = layer(x, hxcx[idx])
-            hx_list.append(torch.jit._unwrap_optional(h))
-            cx_list.append(torch.jit._unwrap_optional(c))
-        hx_tensor = torch.stack(hx_list)
-        cx_tensor = torch.stack(cx_list)
-
-        # We are creating another dimension for bidirectional case
-        # need to collapse it
-        hx_tensor = hx_tensor.reshape(-1, hx_tensor.shape[-2], hx_tensor.shape[-1])
-        cx_tensor = cx_tensor.reshape(-1, cx_tensor.shape[-2], cx_tensor.shape[-1])
-
-        if self.batch_first:
-            x = x.transpose(0, 1)
-
-        return x, (hx_tensor, cx_tensor)
-
-    def _get_name(self):
-        return 'QuantizableLSTM'
-
-    @classmethod
-    def from_float(cls, other, qconfig=None):
-        assert isinstance(other, cls._FLOAT_MODULE)
-        assert (hasattr(other, 'qconfig') or qconfig)
-        observed = cls(other.input_size, other.hidden_size, other.num_layers,
-                       other.bias, other.batch_first, other.dropout,
-                       other.bidirectional)
-        observed.qconfig = getattr(other, 'qconfig', qconfig)
-        for idx in range(other.num_layers):
-            observed.layers[idx] = _LSTMLayer.from_float(other, idx, qconfig,
-                                                         batch_first=False)
-        observed.eval()
-        observed = torch.ao.quantization.prepare(observed, inplace=True)
-        return observed
-
-    @classmethod
-    def from_observed(cls, other):
-        # The whole flow is float -> observed -> quantized
-        # This class does float -> observed only
-        raise NotImplementedError("It looks like you are trying to convert a "
-                                  "non-quantizable LSTM module. Please, see "
-                                  "the examples on quantizable LSTMs.")
+from torch.ao.nn.quantizable.modules.rnn import LSTM
+from torch.ao.nn.quantizable.modules.rnn import LSTMCell
diff --git a/torch/nn/quantized/__init__.py b/torch/nn/quantized/__init__.py
index 3d79bdbfe832..94f354f1d978 100644
--- a/torch/nn/quantized/__init__.py
+++ b/torch/nn/quantized/__init__.py
@@ -1 +1,40 @@
+from . import dynamic  # noqa: F403
+from . import functional  # noqa: F403
+from . import modules  # noqa: F403
 from .modules import *  # noqa: F403
+
+__all__ = [
+    'BatchNorm2d',
+    'BatchNorm3d',
+    'Conv1d',
+    'Conv2d',
+    'Conv3d',
+    'ConvTranspose1d',
+    'ConvTranspose2d',
+    'ConvTranspose3d',
+    'DeQuantize',
+    'Dropout',
+    'ELU',
+    'Embedding',
+    'EmbeddingBag',
+    'GroupNorm',
+    'Hardswish',
+    'InstanceNorm1d',
+    'InstanceNorm2d',
+    'InstanceNorm3d',
+    'LayerNorm',
+    'LeakyReLU',
+    'Linear',
+    'LSTM',
+    'MaxPool2d',
+    'MultiheadAttention',
+    'PReLU',
+    'Quantize',
+    'ReLU6',
+    'Sigmoid',
+    'Softmax',
+    # Wrapper modules
+    'FloatFunctional',
+    'FXFloatFunctional',
+    'QFunctional',
+]
diff --git a/torch/nn/quantized/_reference/modules/__init__.py b/torch/nn/quantized/_reference/modules/__init__.py
index 3627f93ebd6c..2e7098f9337d 100644
--- a/torch/nn/quantized/_reference/modules/__init__.py
+++ b/torch/nn/quantized/_reference/modules/__init__.py
@@ -1,7 +1,18 @@
-from .linear import Linear
-from .conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
-from .rnn import RNNCell, LSTMCell, GRUCell, LSTM
-from .sparse import Embedding, EmbeddingBag
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.reference.modules.linear import Linear
+from torch.ao.nn.quantized.reference.modules.conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from torch.ao.nn.quantized.reference.modules.rnn import RNNCell, LSTMCell, GRUCell, LSTM
+from torch.ao.nn.quantized.reference.modules.sparse import Embedding, EmbeddingBag
 
 __all__ = [
     'Linear',
diff --git a/torch/nn/quantized/_reference/modules/conv.py b/torch/nn/quantized/_reference/modules/conv.py
index 64b31c0c75a7..2fc8167556bb 100644
--- a/torch/nn/quantized/_reference/modules/conv.py
+++ b/torch/nn/quantized/_reference/modules/conv.py
@@ -1,316 +1,19 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from typing import Optional, Dict, Any, List
-from torch.nn.common_types import _size_1_t
-from .utils import ReferenceQuantizedModule
-
-class _ConvNd(torch.nn.modules.conv._ConvNd, ReferenceQuantizedModule):
-    """ A reference version of nn.quantized.Conv2d
-        we will not pack the parameters in this module, since weight packing is an
-        optimization for quantized backends supported in PyTorch (fbgemm/qnnpack),
-        this is useful when user want to use this module in other backends like Glow.
-    """
-    __annotations__ = {"bias": Optional[torch.Tensor]}
-    _IS_REFERENCE = True
-
-    @staticmethod
-    def from_float(cls, float_conv, weight_qparams):
-        qref_conv = cls(
-            float_conv.in_channels,
-            float_conv.out_channels,
-            float_conv.kernel_size,  # type: ignore[arg-type]
-            float_conv.stride,  # type: ignore[arg-type]
-            float_conv.padding,  # type: ignore[arg-type]
-            float_conv.dilation,  # type: ignore[arg-type]
-            float_conv.groups,
-            float_conv.bias is not None,  # type: ignore[arg-type]
-            float_conv.padding_mode,
-            device=float_conv.weight.device,
-            dtype=float_conv.weight.dtype,
-            weight_qparams=weight_qparams)
-        qref_conv.weight = torch.nn.Parameter(float_conv.weight.detach())
-        if float_conv.bias is not None:
-            qref_conv.bias = torch.nn.Parameter(float_conv.bias.detach())
-        return qref_conv
-
-class Conv1d(_ConvNd, nn.Conv1d):
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: _size_1_t = 0,
-                 dilation: _size_1_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = "zeros",
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.Conv1d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.conv1d ---
-
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.conv1d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv1d
-        """
-        weight_quant_dequant = self.get_weight()
-        result = F.conv1d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, self.dilation, self.groups)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConv1d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvNd.from_float(cls, float_conv, weight_qparams)
-
-class Conv2d(_ConvNd, nn.Conv2d):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros',
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.Conv2d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.conv2d ---
-
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.conv2d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv2d
-        """
-        weight_quant_dequant = self.get_weight()
-        result = F.conv2d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, self.dilation, self.groups)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConv2d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvNd.from_float(cls, float_conv, weight_qparams)
-
-class Conv3d(_ConvNd, nn.Conv3d):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode="zeros",
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.Conv3d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.conv3d ---
-
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.conv3d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv3d
-        """
-        weight_quant_dequant = self.get_weight()
-        result = F.conv3d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, self.dilation, self.groups)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConv3d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvNd.from_float(cls, float_conv, weight_qparams)
-
-class _ConvTransposeNd(_ConvNd, torch.nn.modules.conv._ConvTransposeNd):
-    """ A reference version of nn.quantized.ConvTranspose2d
-        we will not pack the parameters in this module, since weight packing is an
-        optimization for quantized backends supported in PyTorch (fbgemm/qnnpack),
-        this is useful when user want to use this module in other backends like Glow.
-    """
-    @staticmethod
-    def from_float(cls, float_conv, weight_qparams):
-        qref_conv = cls(
-            float_conv.in_channels,
-            float_conv.out_channels,
-            float_conv.kernel_size,  # type: ignore[arg-type]
-            float_conv.stride,  # type: ignore[arg-type]
-            float_conv.padding,  # type: ignore[arg-type]
-            float_conv.output_padding,  # type: ignore[arg-type]
-            float_conv.groups,
-            float_conv.bias is not None,  # type: ignore[arg-type]
-            float_conv.dilation,  # type: ignore[arg-type]
-            float_conv.padding_mode,
-            device=float_conv.weight.device,
-            dtype=float_conv.weight.dtype,
-            weight_qparams=weight_qparams)
-        qref_conv.weight = torch.nn.Parameter(float_conv.weight.detach())
-        if float_conv.bias is not None:
-            qref_conv.bias = torch.nn.Parameter(float_conv.bias.detach())
-        return qref_conv
-
-
-class ConvTranspose1d(_ConvTransposeNd, nn.ConvTranspose1d):
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: _size_1_t = 0,
-                 output_padding: _size_1_t = 0,
-                 groups: int = 1,
-                 bias: bool = True,
-                 dilation: _size_1_t = 1,
-                 padding_mode: str = "zeros",
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.ConvTranspose1d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.convTranspose1d ---
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.convTranspose1d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv1d
-        """
-
-        assert isinstance(self.padding, tuple)
-        # One cannot replace List by Tuple or Sequence in "_output_padding" because
-        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
-        output_padding = self._output_padding(
-            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
-
-        weight_quant_dequant = self.get_weight()
-        result = F.conv_transpose1d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, output_padding, self.groups, self.dilation)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConvTranspose1d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
-
-class ConvTranspose2d(_ConvTransposeNd, nn.ConvTranspose2d):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0,
-                 groups=1, bias=True, dilation=1,
-                 padding_mode='zeros',
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-
-        nn.ConvTranspose2d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.convTranspose2d ---
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.convTranspose2d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv2d
-        """
-        assert isinstance(self.padding, tuple)
-        # One cannot replace List by Tuple or Sequence in "_output_padding" because
-        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
-
-        output_padding = self._output_padding(
-            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
-
-        weight_quant_dequant = self.get_weight()
-        result = F.conv_transpose2d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, output_padding, self.groups, self.dilation)
-
-        return result
-
-    def _get_name(self):
-        return "QuantizedConvTranspose2d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
-
-class ConvTranspose3d(_ConvTransposeNd, nn.ConvTranspose3d):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0,
-                 groups=1, bias=True, dilation=1,
-                 padding_mode="zeros",
-                 device=None,
-                 dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None):
-        nn.ConvTranspose3d.__init__(
-            self, in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def forward(self, x: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.convTranspose3d ---
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.convTranspose3d --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized conv3d
-        """
-
-        assert isinstance(self.padding, tuple)
-        # One cannot replace List by Tuple or Sequence in "_output_padding" because
-        # TorchScript does not support `Sequence[T]` or `Tuple[T, ...]`.
-        output_padding = self._output_padding(
-            input, output_size, self.stride, self.padding, self.kernel_size, self.dilation)  # type: ignore[arg-type]
-
-        weight_quant_dequant = self.get_weight()
-        result = F.conv_transpose3d(
-            x, weight_quant_dequant, self.bias, self.stride,
-            self.padding, output_padding, self.groups, self.dilation)
-        return result
-
-    def _get_name(self):
-        return "QuantizedConvTranspose3d(Reference)"
-
-    @classmethod
-    def from_float(cls, float_conv, weight_qparams):
-        return _ConvTransposeNd.from_float(cls, float_conv, weight_qparams)
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.reference.modules.conv import _ConvNd
+from torch.ao.nn.quantized.reference.modules.conv import Conv1d
+from torch.ao.nn.quantized.reference.modules.conv import Conv2d
+from torch.ao.nn.quantized.reference.modules.conv import Conv3d
+from torch.ao.nn.quantized.reference.modules.conv import _ConvTransposeNd
+from torch.ao.nn.quantized.reference.modules.conv import ConvTranspose1d
+from torch.ao.nn.quantized.reference.modules.conv import ConvTranspose2d
+from torch.ao.nn.quantized.reference.modules.conv import ConvTranspose3d
diff --git a/torch/nn/quantized/_reference/modules/linear.py b/torch/nn/quantized/_reference/modules/linear.py
index adbcba39685f..719e07480b19 100644
--- a/torch/nn/quantized/_reference/modules/linear.py
+++ b/torch/nn/quantized/_reference/modules/linear.py
@@ -1,55 +1,12 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from typing import Optional, Dict, Any
-from .utils import ReferenceQuantizedModule
-
-class Linear(nn.Linear, ReferenceQuantizedModule):
-    """ A reference quantized linear module that fits into the FX
-    Graph Mode Quantization workflow
-    activation will be floating point Tensor, we will store floating
-    point weight as well in the module, but in forward we'll quantize
-    and dequantize the weight before running the floating point functional
-    linear operator.
-    """
-    _IS_REFERENCE = True
-
-    def __init__(
-            self,
-            in_features: int,
-            out_features: int,
-            bias_: bool = True,
-            device: Optional[torch.device] = None,
-            dtype: Optional[torch.dtype] = None,
-            weight_qparams: Optional[Dict[str, Any]] = None):
-        super().__init__(in_features, out_features, bias_, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def _get_name(self):
-        return "QuantizedLinear(Reference)"
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        """
-        we have:
-        w(float) -- quant - dequant \
-        x(float) ------------- F.linear ---
-
-        In the full model, we will see
-        w(float) -- quant - *dequant \
-        x -- quant --- *dequant --  *F.linear --- *quant - dequant
-        and the backend should be able to fuse the ops with `*` into a quantized linear
-        """
-        weight_quant_dequant = self.get_weight()
-        result = F.linear(x, weight_quant_dequant, self.bias)
-        return result
-
-    @classmethod
-    def from_float(cls, float_linear, weight_qparams):
-        qref_linear = Linear(
-            float_linear.in_features, float_linear.out_features,
-            float_linear.bias is not None, device=float_linear.weight.device,
-            dtype=float_linear.weight.dtype, weight_qparams=weight_qparams)
-        qref_linear.weight = torch.nn.Parameter(float_linear.weight.detach())
-        if float_linear.bias is not None:
-            qref_linear.bias = torch.nn.Parameter(float_linear.bias.detach())
-        return qref_linear
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.reference.modules.linear import Linear
diff --git a/torch/nn/quantized/_reference/modules/rnn.py b/torch/nn/quantized/_reference/modules/rnn.py
index 5b84f0e48b9a..82bf37352def 100644
--- a/torch/nn/quantized/_reference/modules/rnn.py
+++ b/torch/nn/quantized/_reference/modules/rnn.py
@@ -1,477 +1,17 @@
-import torch
-import torch.nn as nn
-from torch import Tensor
-from .utils import _quantize_and_dequantize_weight
-from .utils import _quantize_weight
-from typing import Optional, Dict, Any, Tuple
-from torch import _VF
-from torch.nn.utils.rnn import PackedSequence
-
-def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
-    return tensor.index_select(dim, permutation)
-
-def get_weight_and_quantization_params(module, wn):
-    weight = getattr(module, wn)
-    params = [weight]
-    for param_name in [wn + n for n in ["_qscheme", "_dtype", "_scale", "_zero_point", "_axis"]]:
-        if hasattr(module, param_name):
-            param = getattr(module, param_name)
-        else:
-            param = None
-        params.append(param)
-    return params
-
-def get_quantized_weight(module, wn):
-    if not hasattr(module, wn):
-        return None
-    params = get_weight_and_quantization_params(module, wn)
-    weight = _quantize_weight(*params)
-    return weight
-
-def get_quantize_and_dequantized_weight(module, wn):
-    if not hasattr(module, wn):
-        return None
-    params = get_weight_and_quantization_params(module, wn)
-    weight = _quantize_and_dequantize_weight(*params)
-    return weight
-
-class RNNCellBase(nn.RNNCellBase):
-    def __init__(self, input_size: int, hidden_size: int, bias: bool, num_chunks: int,
-                 device=None, dtype=None, weight_qparams_dict=None) -> None:
-        super().__init__(input_size, hidden_size, bias, num_chunks, device=device, dtype=dtype)
-        if weight_qparams_dict is None:
-            weight_qparams = {
-                "qscheme": torch.per_tensor_affine,
-                "dtype": torch.quint8,
-                "scale": 1.0,
-                "zero_point": 0
-            }
-            weight_qparams_dict = {
-                "weight_ih": weight_qparams,
-                "weight_hh": weight_qparams
-            }
-        assert len(weight_qparams_dict) == 2, "Expected length for weight_qparams_dict to be 2 for QuantizedRNNCellBase(Reference)"
-        self._init_weight_qparams_dict(weight_qparams_dict, device)
-
-    def _init_weight_qparams_dict(self, weight_qparams_dict, device):
-        assert weight_qparams_dict is not None
-        for key, weight_qparams in weight_qparams_dict.items():
-            # TODO: refactor the duplicated code to utils.py
-            weight_qscheme = weight_qparams["qscheme"]
-            weight_dtype = weight_qparams["dtype"]
-            setattr(self, key + "_qscheme", weight_qscheme)
-            setattr(self, key + "_dtype", weight_dtype)
-            assert weight_qscheme in [None, torch.per_tensor_affine, torch.per_channel_affine], \
-                Exception(f"qscheme: {weight_qscheme} is not support in {self._get_name()}")
-            if weight_qscheme is not None:
-                scale = weight_qparams["scale"]
-                scale_tensor = scale.clone().detach() \
-                    if isinstance(scale, torch.Tensor) else \
-                    torch.tensor(scale, dtype=torch.float, device=device)
-                self.register_buffer(key + "_scale", scale_tensor)
-                zp = weight_qparams["zero_point"]
-                zp_tensor = zp.clone().detach() \
-                    if isinstance(zp, torch.Tensor) else \
-                    torch.tensor(zp, dtype=torch.int, device=device)
-                self.register_buffer(key + "_zero_point", zp_tensor)
-                if weight_qscheme == torch.per_channel_affine:
-                    axis = weight_qparams["axis"]
-                    axis_tensor = axis.clone().detach() \
-                        if isinstance(axis, torch.Tensor) else \
-                        torch.tensor(axis, dtype=torch.int, device=device)
-                    self.register_buffer(key + "_axis", axis_tensor)
-                else:
-                    # added for TorchScriptability, not used
-                    self.register_buffer(
-                        key + "_axis", torch.tensor(0, dtype=torch.int, device=device))
-
-    def _get_name(self):
-        return "QuantizedRNNCellBase(Reference)"
-
-    def get_quantized_weight_ih(self):
-        return get_quantized_weight(self, "weight_ih")
-
-    def get_quantized_weight_hh(self):
-        return get_quantized_weight(self, "weight_hh")
-
-    def get_weight_ih(self):
-        return get_quantize_and_dequantized_weight(self, "weight_ih")
-
-    def get_weight_hh(self):
-        return get_quantize_and_dequantized_weight(self, "weight_hh")
-
-class RNNCell(RNNCellBase):
-    """
-    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
-    we need to pass in a `weight_qparams_dict` that maps from weight name,
-    e.g. weight_ih, to the weight_qparams for that weight
-    """
-    def __init__(self, input_size: int, hidden_size: int, bias: bool = True, nonlinearity: str = "tanh",
-                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
-        super().__init__(input_size, hidden_size, bias, num_chunks=1, **factory_kwargs)
-        self.nonlinearity = nonlinearity
-
-    def _get_name(self):
-        return "QuantizedRNNCell(Reference)"
-
-    # TODO: refactor nn.RNNCell to have a _forward that takes weight_ih and weight_hh as input
-    # and remove duplicated code, same for the other two Cell modules
-    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
-        assert input.dim() in (1, 2), \
-            f"RNNCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
-        is_batched = input.dim() == 2
-        if not is_batched:
-            input = input.unsqueeze(0)
-
-        if hx is None:
-            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-        else:
-            hx = hx.unsqueeze(0) if not is_batched else hx
-
-        if self.nonlinearity == "tanh":
-            ret = _VF.rnn_tanh_cell(
-                input, hx,
-                self.get_weight_ih(), self.get_weight_hh(),
-                self.bias_ih, self.bias_hh,
-            )
-        elif self.nonlinearity == "relu":
-            ret = _VF.rnn_relu_cell(
-                input, hx,
-                self.get_weight_ih(), self.get_weight_hh(),
-                self.bias_ih, self.bias_hh,
-            )
-        else:
-            ret = input  # TODO: remove when jit supports exception flow
-            raise RuntimeError(
-                "Unknown nonlinearity: {}".format(self.nonlinearity))
-
-        if not is_batched:
-            ret = ret.squeeze(0)
-
-        return ret
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams_dict):
-        ref_mod = cls(
-            mod.input_size,
-            mod.hidden_size,
-            mod.bias,
-            mod.nonlinearity,
-            mod.weight_ih.device,
-            mod.weight_ih.dtype,
-            weight_qparams_dict)
-        ref_mod.weight_ih = mod.weight_ih
-        ref_mod.weight_hh = mod.weight_hh
-        ref_mod.bias_ih = mod.bias_ih
-        ref_mod.bias_hh = mod.bias_hh
-        return ref_mod
-
-class LSTMCell(RNNCellBase):
-    """
-    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
-    we need to pass in a `weight_qparams_dict` that maps from weight name,
-    e.g. weight_ih, to the weight_qparams for that weight
-    """
-    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
-                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
-        super().__init__(input_size, hidden_size, bias, num_chunks=4, **factory_kwargs)
-
-    def _get_name(self):
-        return "QuantizedLSTMCell(Reference)"
-
-    def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
-        assert input.dim() in (1, 2), \
-            f"LSTMCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
-        is_batched = input.dim() == 2
-        if not is_batched:
-            input = input.unsqueeze(0)
-
-        if hx is None:
-            zeros = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-            hx = (zeros, zeros)
-        else:
-            hx = (hx[0].unsqueeze(0), hx[1].unsqueeze(0)) if not is_batched else hx
-
-        ret = _VF.lstm_cell(
-            input, hx,
-            self.get_weight_ih(), self.get_weight_hh(),
-            self.bias_ih, self.bias_hh,
-        )
-
-        if not is_batched:
-            ret = (ret[0].squeeze(0), ret[1].squeeze(0))
-        return ret
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams_dict):
-        ref_mod = cls(
-            mod.input_size,
-            mod.hidden_size,
-            mod.bias,
-            mod.weight_ih.device,
-            mod.weight_ih.dtype,
-            weight_qparams_dict)
-        ref_mod.weight_ih = mod.weight_ih
-        ref_mod.weight_hh = mod.weight_hh
-        ref_mod.bias_ih = mod.bias_ih
-        ref_mod.bias_hh = mod.bias_hh
-        return ref_mod
-
-class GRUCell(RNNCellBase):
-    """
-    We'll store weight_qparams for all the weights (weight_ih and weight_hh),
-    we need to pass in a `weight_qparams_dict` that maps from weight name,
-    e.g. weight_ih, to the weight_qparams for that weight
-    """
-    def __init__(self, input_size: int, hidden_size: int, bias: bool = True,
-                 device=None, dtype=None, weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype, 'weight_qparams_dict': weight_qparams_dict}
-        super().__init__(input_size, hidden_size, bias, num_chunks=3, **factory_kwargs)
-
-    def _get_name(self):
-        return "QuantizedGRUCell(Reference)"
-
-    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
-        assert input.dim() in (1, 2), \
-            f"GRUCell: Expected input to be 1-D or 2-D but received {input.dim()}-D tensor"
-        is_batched = input.dim() == 2
-        if not is_batched:
-            input = input.unsqueeze(0)
-
-        if hx is None:
-            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-        else:
-            hx = hx.unsqueeze(0) if not is_batched else hx
-
-        ret = _VF.gru_cell(
-            input, hx,
-            self.get_weight_ih(), self.get_weight_hh(),
-            self.bias_ih, self.bias_hh,
-        )
-
-        if not is_batched:
-            ret = ret.squeeze(0)
-
-        return ret
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams_dict):
-        ref_mod = cls(
-            mod.input_size,
-            mod.hidden_size,
-            mod.bias,
-            mod.weight_ih.device,
-            mod.weight_ih.dtype,
-            weight_qparams_dict)
-        ref_mod.weight_ih = mod.weight_ih
-        ref_mod.weight_hh = mod.weight_hh
-        ref_mod.bias_ih = mod.bias_ih
-        ref_mod.bias_hh = mod.bias_hh
-        return ref_mod
-
-class RNNBase(nn.RNNBase):
-    def __init__(self, mode: str, input_size: int, hidden_size: int,
-                 num_layers: int = 1, bias: bool = True, batch_first: bool = False,
-                 dropout: float = 0., bidirectional: bool = False, proj_size: int = 0,
-                 device=None, dtype=None,
-                 weight_qparams_dict: Optional[Dict[str, Dict[str, Any]]] = None) -> None:
-        super().__init__(
-            mode, input_size, hidden_size, num_layers, bias, batch_first, dropout,
-            bidirectional, proj_size, device, dtype
-        )
-        if weight_qparams_dict is None:
-            weight_qparams = {
-                'qscheme': torch.per_tensor_affine,
-                'dtype': torch.quint8,
-                'scale': 1.0,
-                'zero_point': 0
-            }
-            weight_qparams_dict = dict()
-            for wn in self._flat_weights_names:
-                if wn.startswith("weight"):
-                    weight_qparams_dict[wn] = weight_qparams
-        self._init_weight_qparams_dict(weight_qparams_dict, device)
-
-    def _init_weight_qparams_dict(self, weight_qparams_dict, device):
-        for key, weight_qparams in weight_qparams_dict.items():
-            weight_qscheme = weight_qparams["qscheme"]
-            weight_dtype = weight_qparams["dtype"]
-            setattr(self, key + "_qscheme", weight_qscheme)
-            setattr(self, key + "_dtype", weight_dtype)
-            assert weight_qscheme in [None, torch.per_tensor_affine, torch.per_channel_affine], \
-                Exception(f"qscheme: {weight_qscheme} is not support in {self._get_name()}")
-            if weight_qscheme is not None:
-                self.register_buffer(
-                    key + "_scale",
-                    torch.tensor(weight_qparams["scale"], dtype=torch.float, device=device))
-                self.register_buffer(
-                    key + "_zero_point",
-                    torch.tensor(weight_qparams["zero_point"], dtype=torch.int, device=device))
-                if weight_qscheme == torch.per_channel_affine:
-                    self.register_buffer(
-                        key + "_axis",
-                        torch.tensor(weight_qparams["axis"], dtype=torch.int, device=device))
-                else:
-                    # added for TorchScriptability, not used
-                    self.register_buffer(
-                        key + "_axis", torch.tensor(0, dtype=torch.int, device=device))
-
-class LSTM(RNNBase):
-    """ Reference Quantized LSTM Module
-    We'll store weight_qparams for all the weights in _flat_weights, we need to pass in
-    a `weight_qparams_dict` that maps from weight name, e.g. weight_ih_l0,
-    to the weight_qparams for that weight
-    """
-    def __init__(self, *args, **kwargs):
-        super().__init__('LSTM', *args, **kwargs)
-
-    # Same as above, see torch/nn/modules/module.py::_forward_unimplemented
-    def permute_hidden(self,  # type: ignore[override]
-                       hx: Tuple[Tensor, Tensor],
-                       permutation: Optional[Tensor]
-                       ) -> Tuple[Tensor, Tensor]:
-        if permutation is None:
-            return hx
-        return apply_permutation(hx[0], permutation), apply_permutation(hx[1], permutation)
-
-    def get_expected_cell_size(self, input: Tensor, batch_sizes: Optional[Tensor]) -> Tuple[int, int, int]:
-        if batch_sizes is not None:
-            mini_batch = int(batch_sizes[0])
-        else:
-            mini_batch = input.size(0) if self.batch_first else input.size(1)
-        num_directions = 2 if self.bidirectional else 1
-        expected_hidden_size = (self.num_layers * num_directions,
-                                mini_batch, self.hidden_size)
-        return expected_hidden_size
-
-    # In the future, we should prevent mypy from applying contravariance rules here.
-    # See torch/nn/modules/module.py::_forward_unimplemented
-    def check_forward_args(self,  # type: ignore[override]
-                           input: Tensor,
-                           hidden: Tuple[Tensor, Tensor],
-                           batch_sizes: Optional[Tensor],
-                           ):
-        self.check_input(input, batch_sizes)
-        self.check_hidden_size(hidden[0], self.get_expected_hidden_size(input, batch_sizes),
-                               'Expected hidden[0] size {}, got {}')
-        self.check_hidden_size(hidden[1], self.get_expected_cell_size(input, batch_sizes),
-                               'Expected hidden[1] size {}, got {}')
-
-    def get_quantized_weight_bias_dict(self):
-        """ dictionary from flat_weight_name to quantized weight or (unquantized) bias
-        e.g.
-        {
-          "weight_ih_l0": quantized_weight,
-          "bias_ih_l0": unquantized_bias,
-          ...
-        }
-        """
-        quantized_weight_bias_dict = {}
-        for wn in self._flat_weights_names:
-            if hasattr(self, wn):
-                if wn.startswith("weight"):
-                    weight_or_bias = get_quantized_weight(self, wn)
-                else:
-                    weight_or_bias = getattr(self, wn)
-            else:
-                weight_or_bias = None
-            quantized_weight_bias_dict[wn] = weight_or_bias
-        return quantized_weight_bias_dict
-
-    def get_flat_weights(self):
-        flat_weights = []
-        for wn in self._flat_weights_names:
-            if hasattr(self, wn):
-                weight = getattr(self, wn)
-                if wn.startswith("weight"):
-                    params = get_weight_and_quantization_params(self, wn)
-                    weight = _quantize_and_dequantize_weight(*params)
-            else:
-                weight = None
-            flat_weights.append(weight)
-        return flat_weights
-
-    def forward(self, input, hx=None):  # noqa: F811
-        orig_input = input
-        # xxx: isinstance check needs to be in conditional for TorchScript to compile
-        batch_sizes = None
-        if isinstance(orig_input, PackedSequence):
-            input, batch_sizes, sorted_indices, unsorted_indices = input
-            max_batch_size = batch_sizes[0]
-            max_batch_size = int(max_batch_size)
-        else:
-            batch_sizes = None
-            is_batched = input.dim() == 3
-            batch_dim = 0 if self.batch_first else 1
-            if not is_batched:
-                input = input.unsqueeze(batch_dim)
-            max_batch_size = input.size(0) if self.batch_first else input.size(1)
-            sorted_indices = None
-            unsorted_indices = None
-
-        if hx is None:
-            num_directions = 2 if self.bidirectional else 1
-            real_hidden_size = self.proj_size if self.proj_size > 0 else self.hidden_size
-            h_zeros = torch.zeros(self.num_layers * num_directions,
-                                  max_batch_size, real_hidden_size,
-                                  dtype=input.dtype, device=input.device)
-            c_zeros = torch.zeros(self.num_layers * num_directions,
-                                  max_batch_size, self.hidden_size,
-                                  dtype=input.dtype, device=input.device)
-            hx = (h_zeros, c_zeros)
-        else:
-            if batch_sizes is None:  # If not PackedSequence input.
-                if is_batched:
-                    if (hx[0].dim() != 3 or hx[1].dim() != 3):
-                        msg = ("For batched 3-D input, hx and cx should "
-                               f"also be 3-D but got ({hx[0].dim()}-D, {hx[1].dim()}-D) tensors")
-                        raise RuntimeError(msg)
-                else:
-                    if hx[0].dim() != 2 or hx[1].dim() != 2:
-                        msg = ("For unbatched 2-D input, hx and cx should "
-                               f"also be 2-D but got ({hx[0].dim()}-D, {hx[1].dim()}-D) tensors")
-                        raise RuntimeError(msg)
-                    hx = (hx[0].unsqueeze(1), hx[1].unsqueeze(1))
-
-            # Each batch of the hidden state should match the input sequence that
-            # the user believes he/she is passing in.
-            hx = self.permute_hidden(hx, sorted_indices)
-
-        self.check_forward_args(input, hx, batch_sizes)
-        if batch_sizes is None:
-            result = _VF.lstm(input, hx, self.get_flat_weights(), self.bias, self.num_layers,
-                              self.dropout, self.training, self.bidirectional, self.batch_first)
-        else:
-            result = _VF.lstm(input, batch_sizes, hx, self.get_flat_weights(), self.bias,
-                              self.num_layers, self.dropout, self.training, self.bidirectional)
-        output = result[0]
-        hidden = result[1:]
-        # xxx: isinstance check needs to be in conditional for TorchScript to compile
-        if isinstance(orig_input, PackedSequence):
-            output_packed = PackedSequence(output, batch_sizes, sorted_indices, unsorted_indices)
-            return output_packed, self.permute_hidden(hidden, unsorted_indices)
-        else:
-            if not is_batched:
-                output = output.squeeze(batch_dim)
-                hidden = (hidden[0].squeeze(1), hidden[1].squeeze(1))
-            return output, self.permute_hidden(hidden, unsorted_indices)
-
-    def _get_name(self):
-        return "QuantizedLSTM(Reference)"
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams_dict):
-        ref_mod = cls(
-            mod.input_size,
-            mod.hidden_size,
-            mod.num_layers,
-            mod.bias,
-            mod.batch_first,
-            mod.dropout,
-            mod.bidirectional,
-            weight_qparams_dict=weight_qparams_dict)
-        for wn in mod._flat_weights_names:
-            setattr(ref_mod, wn, getattr(mod, wn))
-        return ref_mod
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.reference.modules.rnn import RNNCellBase
+from torch.ao.nn.quantized.reference.modules.rnn import RNNCell
+from torch.ao.nn.quantized.reference.modules.rnn import LSTMCell
+from torch.ao.nn.quantized.reference.modules.rnn import GRUCell
+from torch.ao.nn.quantized.reference.modules.rnn import RNNBase
+from torch.ao.nn.quantized.reference.modules.rnn import LSTM
diff --git a/torch/nn/quantized/_reference/modules/sparse.py b/torch/nn/quantized/_reference/modules/sparse.py
index 5ace87f0fb73..2230bdee344d 100644
--- a/torch/nn/quantized/_reference/modules/sparse.py
+++ b/torch/nn/quantized/_reference/modules/sparse.py
@@ -1,92 +1,13 @@
-import torch.nn as nn
-import torch.nn.functional as F
-from torch import Tensor
-from .utils import ReferenceQuantizedModule
-from typing import Optional, Dict, Any
-
-class Embedding(nn.Embedding, ReferenceQuantizedModule):
-    """ A reference quantized Embedding module that fits into the
-    FX Graph Mode Quantization workflow, activation will be floating point Tensor,
-    we will store floating point weight as well in the module, but in forward we'll
-    quantize and dequantize the weight before running the floating point functional
-    embedding operator.
-    """
-    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 sparse: bool = False, _weight: Optional[Tensor] = None,
-                 device=None, dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None) -> None:
-        super().__init__(num_embeddings, embedding_dim, padding_idx, max_norm,
-                         norm_type, scale_grad_by_freq, sparse, _weight, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def _get_name(self):
-        return "QuantizedEmbedding(Reference)"
-
-    def forward(self, input: Tensor) -> Tensor:
-        weight_quant_dequant = self.get_weight()
-        return F.embedding(
-            input, weight_quant_dequant, self.padding_idx, self.max_norm,
-            self.norm_type, self.scale_grad_by_freq, self.sparse)
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams):
-        return cls(
-            mod.num_embeddings,
-            mod.embedding_dim,
-            mod.padding_idx,
-            mod.max_norm,
-            mod.norm_type,
-            mod.scale_grad_by_freq,
-            mod.sparse,
-            mod.weight,
-            mod.weight.device,
-            mod.weight.dtype,
-            weight_qparams)
-
-class EmbeddingBag(nn.EmbeddingBag, ReferenceQuantizedModule):
-    """ A reference quantized EmbeddingBag module that fits into the
-    FX Graph Mode Quantization workflow, activation will be floating point Tensor,
-    we will store floating point weight as well in the module, but in forward we'll
-    quantize and dequantize the weight before running the floating point functional
-    embedding operator.
-    """
-    def __init__(self, num_embeddings: int, embedding_dim: int,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 mode: str = 'mean', sparse: bool = False, _weight: Optional[Tensor] = None,
-                 include_last_offset: bool = False, padding_idx: Optional[int] = None,
-                 device=None, dtype=None,
-                 weight_qparams: Optional[Dict[str, Any]] = None) -> None:
-        super().__init__(num_embeddings, embedding_dim, max_norm, norm_type,
-                         scale_grad_by_freq, mode, sparse, _weight, include_last_offset,
-                         padding_idx, device, dtype)
-        self._init_weight_qparams(weight_qparams, device)
-
-    def _get_name(self):
-        return "QuantizedEmbedding(Reference)"
-
-    def forward(self, input: Tensor, offsets: Optional[Tensor] = None, per_sample_weights: Optional[Tensor] = None) -> Tensor:
-        weight_quant_dequant = self.get_weight()
-        return F.embedding_bag(input, weight_quant_dequant, offsets,
-                               self.max_norm, self.norm_type,
-                               self.scale_grad_by_freq, self.mode, self.sparse,
-                               per_sample_weights, self.include_last_offset,
-                               self.padding_idx)
-
-    @classmethod
-    def from_float(cls, mod, weight_qparams):
-        return cls(
-            mod.num_embeddings,
-            mod.embedding_dim,
-            mod.max_norm,
-            mod.norm_type,
-            mod.scale_grad_by_freq,
-            mod.mode,
-            mod.sparse,
-            mod.weight,
-            mod.include_last_offset,
-            mod.padding_idx,
-            mod.weight.device,
-            mod.weight.dtype,
-            weight_qparams
-        )
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/reference`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.reference.modules.sparse import Embedding
+from torch.ao.nn.quantized.reference.modules.sparse import EmbeddingBag
diff --git a/torch/nn/quantized/_reference/modules/utils.py b/torch/nn/quantized/_reference/modules/utils.py
index f9cd0b7dcb21..e3371a7e3a1f 100644
--- a/torch/nn/quantized/_reference/modules/utils.py
+++ b/torch/nn/quantized/_reference/modules/utils.py
@@ -1,160 +1,15 @@
-import torch
-from typing import Dict, Any
-
-class ReferenceQuantizedModule(torch.nn.Module):
-    def _init_weight_qparams(self, weight_qparams, device):
-        if weight_qparams is None:
-            weight_qparams = {
-                "qscheme": torch.per_tensor_affine,
-                "dtype": torch.quint8,
-                "scale": 1.0,
-                "zero_point": 0
-            }
-        self.weight_qscheme: torch.qscheme = weight_qparams["qscheme"]
-        self.weight_dtype = weight_qparams["dtype"]
-        assert self.weight_qscheme in [
-            None, torch.per_tensor_affine, torch.per_channel_affine,
-            torch.per_channel_affine_float_qparams], \
-            Exception(f"qscheme: {self.weight_qscheme} is not support in reference quantized {self._get_name()}")
-        if self.weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
-            zero_point_dtype = weight_qparams["zero_point"].dtype if \
-                isinstance(weight_qparams["zero_point"], torch.Tensor) else \
-                torch.int
-            w_scale = weight_qparams["scale"]
-            w_scale_tensor = w_scale.clone().detach() \
-                if isinstance(w_scale, torch.Tensor) \
-                else torch.tensor(w_scale, dtype=torch.float, device=device)
-            self.register_buffer("weight_scale", w_scale_tensor)
-            w_zp = weight_qparams["zero_point"]
-            w_zp_tensor = w_zp.clone().detach() \
-                if isinstance(w_zp, torch.Tensor) \
-                else torch.tensor(w_zp, dtype=zero_point_dtype, device=device)
-            self.register_buffer("weight_zero_point", w_zp_tensor)
-            if self.weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
-                w_axis = weight_qparams["axis"]
-                w_axis_tensor = w_axis.clone().detach() \
-                    if isinstance(w_axis, torch.Tensor) \
-                    else torch.tensor(w_axis, dtype=torch.int, device=device)
-                self.register_buffer("weight_axis", w_axis_tensor)
-            else:
-                # added for TorchScriptability, not used
-                self.register_buffer(
-                    "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
-        else:
-            # added for TorchScriptability, and for torch.float
-            self.register_buffer("weight_scale", torch.tensor(1.0, dtype=torch.float, device=device))
-            self.register_buffer("weight_zero_point", torch.tensor(0, dtype=torch.int, device=device))
-            self.register_buffer(
-                "weight_axis", torch.tensor(0, dtype=torch.int, device=device))
-
-    def get_weight(self):
-        """
-        Fake quantize (quantize and dequantize) the weight with
-        the quantization parameters for weight, this is used to
-        simulate the numerics for the quantized weight in a quantized
-        model
-        """
-        # suppress mypy warning
-        assert isinstance(self.weight_scale, torch.Tensor)
-        assert isinstance(self.weight_zero_point, torch.Tensor)
-        assert isinstance(self.weight_axis, torch.Tensor)
-        return _quantize_and_dequantize_weight(
-            self.weight,  # type: ignore[arg-type]
-            self.weight_qscheme,
-            self.weight_dtype,
-            self.weight_scale,
-            self.weight_zero_point, self.weight_axis)
-
-    def get_quantized_weight(self):
-        # suppress mypy warning
-        assert isinstance(self.weight_scale, torch.Tensor)
-        assert isinstance(self.weight_zero_point, torch.Tensor)
-        assert isinstance(self.weight_axis, torch.Tensor)
-        return _quantize_weight(
-            self.weight,  # type: ignore[arg-type]
-            self.weight_qscheme,
-            self.weight_dtype,
-            self.weight_scale,
-            self.weight_zero_point,
-            self.weight_axis)
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super()._save_to_state_dict(destination, prefix, keep_vars)
-        _save_weight_qparams(
-            destination, prefix, self.weight_qscheme, self.weight_dtype,
-            self.weight_scale, self.weight_zero_point, self.weight_axis)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        for key in _get_weight_qparam_keys(state_dict, prefix):
-            setattr(self, key, state_dict[prefix + key])
-            state_dict.pop(prefix + key)
-
-        super()._load_from_state_dict(
-            state_dict, prefix, local_metadata, False,
-            missing_keys, unexpected_keys, error_msgs)
-
-def _quantize_weight(
-        weight: torch.Tensor,
-        weight_qscheme: torch.qscheme,
-        weight_dtype: torch.dtype,
-        weight_scale: torch.Tensor,
-        weight_zero_point: torch.Tensor,
-        weight_axis: torch.Tensor):
-    if weight_dtype == torch.float16:
-        weight = weight.to(weight_dtype)
-        return weight
-
-    if weight_qscheme == torch.per_tensor_affine:
-        if weight_dtype in [torch.quint8, torch.qint8, torch.qint32]:
-            weight = torch.quantize_per_tensor(weight, weight_scale, weight_zero_point, weight_dtype)
-            return weight
-    elif weight_qscheme in [torch.per_channel_affine, torch.per_channel_affine_float_qparams]:
-        if weight_dtype in [torch.quint8, torch.qint8, torch.quint4x2, torch.qint32]:
-            weight = torch.quantize_per_channel(
-                weight, weight_scale,
-                weight_zero_point, weight_axis.item(), weight_dtype)  # type: ignore[arg-type]
-            return weight
-    raise Exception(f"Unsupported dtype and qscheme: {weight_dtype}, {weight_qscheme}")
-
-def _quantize_and_dequantize_weight(
-        weight: torch.Tensor,
-        weight_qscheme: torch.qscheme,
-        weight_dtype: torch.dtype,
-        weight_scale: torch.Tensor,
-        weight_zero_point: torch.Tensor,
-        weight_axis: torch.Tensor):
-    """ Quantize and then dequantize the weight based on
-    the quantization parameters
-    """
-    if weight_qscheme in [
-            torch.per_tensor_affine,
-            torch.per_channel_affine,
-            torch.per_channel_affine_float_qparams]:
-        weight_quant = _quantize_weight(
-            weight, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis)
-        weight_dequant = weight_quant.dequantize()
-    else:
-        weight_dequant = weight
-    return weight_dequant
-
-def _save_weight_qparams(destination, prefix, weight_qscheme, weight_dtype, weight_scale, weight_zero_point, weight_axis):
-    destination[prefix + "weight_qscheme"] = weight_qscheme
-    destination[prefix + "weight_dtype"] = weight_dtype
-    if weight_qscheme is not None:
-        destination[prefix + "weight_scale"] = weight_scale
-        destination[prefix + "weight_zero_point"] = weight_zero_point
-        if weight_qscheme == torch.per_channel_affine:
-            destination[prefix + "weight_axis"] = weight_axis
-
-def _get_weight_qparam_keys(
-        state_dict: Dict[str, Any],
-        prefix: str):
-    keys = ["weight_qscheme", "weight_dtype"]
-    weight_qscheme = state_dict[prefix + "weight_qscheme"]
-    if weight_qscheme is not None:
-        keys.append("weight_scale")
-        keys.append("weight_zero_point")
-        if weight_qscheme == torch.quantize_per_channel:
-            keys.append("weight_axis")
-    return keys
+# flake8: noqa: F401
+r"""Quantized Reference Modules
+
+This module is in the process of migration to
+`torch/ao/nn/quantized/reference`, and is kept here for
+compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/reference`,
+while adding an import statement here.
+"""
+from torch.ao.nn.quantized.reference.modules.utils import _quantize_weight
+from torch.ao.nn.quantized.reference.modules.utils import _quantize_and_dequantize_weight
+from torch.ao.nn.quantized.reference.modules.utils import _save_weight_qparams
+from torch.ao.nn.quantized.reference.modules.utils import _get_weight_qparam_keys
+from torch.ao.nn.quantized.reference.modules.utils import ReferenceQuantizedModule
diff --git a/torch/nn/quantized/dynamic/__init__.py b/torch/nn/quantized/dynamic/__init__.py
index 3d79bdbfe832..1b08cd1bc714 100644
--- a/torch/nn/quantized/dynamic/__init__.py
+++ b/torch/nn/quantized/dynamic/__init__.py
@@ -1 +1 @@
-from .modules import *  # noqa: F403
+from torch.ao.nn.quantized.dynamic import *  # noqa: F403
diff --git a/torch/nn/quantized/dynamic/modules/__init__.py b/torch/nn/quantized/dynamic/modules/__init__.py
index a7a97e0a8da8..f5a6843d53a8 100644
--- a/torch/nn/quantized/dynamic/modules/__init__.py
+++ b/torch/nn/quantized/dynamic/modules/__init__.py
@@ -1,7 +1,20 @@
+# flake8: noqa: F401
+r"""Quantized Dynamic Modules
 
-from .linear import Linear
-from .rnn import LSTM, GRU, LSTMCell, RNNCell, GRUCell
-from .conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+This file is in the process of migration to `torch/ao/nn/quantized/dynamic`,
+and is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/dynamic`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.dynamic.modules import conv
+from torch.ao.nn.quantized.dynamic.modules import linear
+from torch.ao.nn.quantized.dynamic.modules import rnn
+
+from torch.ao.nn.quantized.dynamic.modules.conv import Conv1d, Conv2d, Conv3d, ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from torch.ao.nn.quantized.dynamic.modules.linear import Linear
+from torch.ao.nn.quantized.dynamic.modules.rnn import LSTM, GRU, LSTMCell, RNNCell, GRUCell
 
 __all__ = [
     'Linear',
diff --git a/torch/nn/quantized/dynamic/modules/conv.py b/torch/nn/quantized/dynamic/modules/conv.py
index 12f93e8fdae3..3e850e0396d1 100644
--- a/torch/nn/quantized/dynamic/modules/conv.py
+++ b/torch/nn/quantized/dynamic/modules/conv.py
@@ -1,399 +1,18 @@
-# coding=utf-8
-r"""Dynamically quantized convolution modules."""
+# flake8: noqa: F401
+r"""Quantized Dynamic Modules
 
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from torch import Tensor
-from torch._ops import ops
-from torch.nn.common_types import _size_1_t
-from torch.nn.modules.utils import _single, _pair, _triple
-from torch.nn.quantized.modules.conv import _reverse_repeat_padding
-import torch.nn.quantized.modules as nnq
-import warnings
+This file is in the process of migration to `torch/ao/nn/quantized/dynamic`,
+and is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/dynamic/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
 
-class Conv1d(nnq.Conv1d):
-    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv1d` and :class:`~torch.nn.quantized.dynamic.Conv1d` and
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv1d` for other attributes.
-
-    Examples::
-
-        >>> m = nn.quantized.dynamic.Conv1d(16, 33, 3, stride=2)
-        >>> input = torch.randn(20, 16, 100)
-        >>> # xdoctest: +SKIP
-        >>> output = m(input)
-
-    """
-
-    _FLOAT_MODULE = nn.Conv1d
-    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
-    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: _size_1_t = 0,
-                 dilation: _size_1_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 device=None,
-                 dtype=None,
-                 reduce_range=True):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _single(kernel_size)
-        stride = _single(stride)
-        padding = padding if isinstance(padding, str) else _single(padding)
-        dilation = _single(dilation)
-
-        super(Conv1d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConv1d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        if self.padding_mode != 'zeros':
-            # Padding in Conv1d is stored as (p, p), need to get (p,)
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv1d_dynamic(input, self._packed_params, reduce_range)
-
-
-class Conv2d(nnq.Conv2d):
-    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv2d` and :class:`~torch.nn.quantized.dynamic.Conv2d` and
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv2d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nn.quantized.dynamic.Conv2d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nn.quantized.dynamic.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> # non-square kernels and unequal stride and with padding and dilation
-        >>> m = nn.quantized.dynamic.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
-        >>> input = torch.randn(20, 16, 50, 100)
-        >>> # xdoctest: +SKIP
-        >>> output = m(input)
-
-    """
-    _FLOAT_MODULE = nn.Conv2d
-    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
-    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _pair(kernel_size)
-        stride = _pair(stride)
-        padding = _pair(padding)
-        dilation = _pair(dilation)
-
-        super(Conv2d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConv2d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv2d_dynamic(
-            input, self._packed_params, reduce_range)
-
-
-class Conv3d(nnq.Conv3d):
-    r"""A dynamically quantized conv module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv3d` and :class:`~torch.nn.quantized.dynamic.Conv3d` and
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv3d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nn.quantized.dynamic.Conv3d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nn.quantized.dynamic.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
-        >>> # non-square kernels and unequal stride and with padding and dilation
-        >>> m = nn.quantized.dynamic.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2), dilation=(1, 2, 2))
-        >>> input = torch.randn(20, 16, 56, 56, 56)
-        >>> # xdoctest: +SKIP
-        >>> output = m(input)
-
-    """
-    _FLOAT_MODULE = nn.Conv3d
-    _NNIQAT_CONV_BN_MODULE = None  # type: ignore[assignment]
-    _NNI_CONV_RELU_MODULE = None  # type: ignore[assignment]
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _triple(kernel_size)
-        stride = _triple(stride)
-        padding = _triple(padding)
-        dilation = _triple(dilation)
-        super(Conv3d, self)._init(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            False, _triple(0), groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConv3d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv3d_dynamic(
-            input, self._packed_params, reduce_range)
-
-
-class ConvTranspose1d(nnq.ConvTranspose1d):
-    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose1d`.
-
-    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv1d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose1d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> # xdoctest: +SKIP
-        >>> m = nndq.ConvTranspose1d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nndq.ConvTranspose1d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> output = m(input)
-        >>> # exact output size can be also specified as an argument
-        >>> downsample = nndq.Conv1d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nndq.ConvTranspose1d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(input)
-        >>> h.size()
-        torch.Size([1, 16, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose1d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(ConvTranspose1d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConvTranpose1d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        return torch.ops.quantized.conv_transpose1d_dynamic(
-            input, self._packed_params, reduce_range)
-
-
-class ConvTranspose2d(nnq.ConvTranspose2d):
-    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose2d`.
-
-    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv2d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nnq.ConvTranspose2d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> # xdoctest: +SKIP
-        >>> output = m(input)
-        >>> # exact output size can be also specified as an argument
-        >>> downsample = nnq.Conv2d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(input)
-        >>> h.size()
-        torch.Size([1, 16, 6, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose2d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(ConvTranspose2d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConvTranpose2d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        return ops.quantized.conv_transpose2d_dynamic(
-            input, self._packed_params, reduce_range)
-
-
-class ConvTranspose3d(nnq.ConvTranspose3d):
-    r"""A dynamically quantized transposed convolution module with floating point tensors as inputs and outputs.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose3d`.
-
-    For special notes, please, see :class:`~torch.nn.quantized.dynamic.Conv3d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose3d` for other attributes.
-
-    Examples::
-
-        >>> # With cubic kernels and equal stride
-        >>> m = nnq.ConvTranspose3d(16, 33, 3, stride=2)
-        >>> # non-cubic kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose3d(16, 33, (3, 3, 5), stride=(2, 1, 1), padding=(4, 2, 2))
-        >>> # xdoctest: +SKIP
-        >>> output = m(input)
-        >>> # exact output size can be also specified as an argument
-        >>> downsample = nnq.Conv3d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose3d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(input)
-        >>> h.size()
-        torch.Size([1, 16, 6, 6, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12, 12, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose3d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        warnings.warn(
-            "The current implementation of the {} module has poor numerical accuracy and its use is not recommended".format(
-                self._get_name()
-            )
-        )
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(ConvTranspose3d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, output_padding,
-            groups, bias, dilation, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedConvTranpose3d'
-
-    def forward(self, input: Tensor, reduce_range: bool = True) -> Tensor:
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, T, H, W)`!")
-        return ops.quantized.conv_transpose3d_dynamic(
-            input, self._packed_params, reduce_range)
+from torch.ao.nn.quantized.dynamic.modules.conv import Conv1d
+from torch.ao.nn.quantized.dynamic.modules.conv import Conv2d
+from torch.ao.nn.quantized.dynamic.modules.conv import Conv3d
+from torch.ao.nn.quantized.dynamic.modules.conv import ConvTranspose1d
+from torch.ao.nn.quantized.dynamic.modules.conv import ConvTranspose2d
+from torch.ao.nn.quantized.dynamic.modules.conv import ConvTranspose3d
diff --git a/torch/nn/quantized/dynamic/modules/linear.py b/torch/nn/quantized/dynamic/modules/linear.py
index bebb8b2a10bc..3073ef5fe048 100644
--- a/torch/nn/quantized/dynamic/modules/linear.py
+++ b/torch/nn/quantized/dynamic/modules/linear.py
@@ -1,127 +1,10 @@
-import torch
-import torch.nn.quantized as nnq
-import torch.nn.intrinsic as nni
-from torch.nn.quantized.modules.utils import _quantize_weight
-
-class Linear(nnq.Linear):
-    r"""
-    A dynamic quantized linear module with floating point tensor as inputs and outputs.
-    We adopt the same interface as `torch.nn.Linear`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
-
-    Similar to :class:`torch.nn.Linear`, attributes will be randomly
-    initialized at module creation time and will be overwritten later
-
-    Attributes:
-        weight (Tensor): the non-learnable quantized weights of the module which are of
-                         shape :math:`(\text{out\_features}, \text{in\_features})`.
-        bias (Tensor): the non-learnable floating point bias of the module of shape
-                       :math:`(\text{out\_features})`. If :attr:`bias` is ``True``,
-                       the values are initialized to zero.
-
-    Examples::
-
-        >>> m = nn.quantized.dynamic.Linear(20, 30)
-        >>> input = torch.randn(128, 20)
-        >>> # xdoctest: +SKIP
-        >>> output = m(input)
-        >>> print(output.size())
-        torch.Size([128, 30])
-    """
-    # version used in this class is different from the parent class nnq.Linear
-    _version = 4
-
-    def __init__(self, in_features, out_features, bias_=True, dtype=torch.qint8):
-        super(Linear, self).__init__(in_features, out_features, bias_, dtype=dtype)
-        # We don't muck around with buffers or attributes or anything here
-        # to keep the module simple. *everything* is simply a Python attribute.
-        # Serialization logic is explicitly handled in the below serialization and
-        # deserialization modules
-        self.version = 4
-
-    def forward(self, x):
-        # Note that we can handle self.bias == None case.
-        if self._packed_params.dtype == torch.qint8:
-            if self.version is None or self.version < 4:
-                Y = torch.ops.quantized.linear_dynamic(
-                    x, self._packed_params._packed_params)
-            else:
-                Y = torch.ops.quantized.linear_dynamic(
-                    x, self._packed_params._packed_params, reduce_range=True)
-        elif self._packed_params.dtype == torch.float16:
-            Y = torch.ops.quantized.linear_dynamic_fp16(
-                x, self._packed_params._packed_params)
-        else:
-            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
-        return Y.to(x.dtype)
-
-    def _get_name(self):
-        return 'DynamicQuantizedLinear'
-
-    def extra_repr(self):
-        extra_repr_str = 'in_features={}, out_features={}, dtype={}'.format(
-            self.in_features, self.out_features, self._packed_params.dtype
-        )
-        if self._packed_params.dtype == torch.qint8:
-            extra_repr_str += ', qscheme={}'.format(self.weight().qscheme())
-        return extra_repr_str
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        version = local_metadata.get('version', None)
-        self.version = version
-        super(Linear, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                  missing_keys, unexpected_keys, error_msgs)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a dynamic quantized module from a float module or qparams_dict
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-        """
-        float_modules = [torch.nn.Linear, torch.nn.modules.linear.NonDynamicallyQuantizableLinear,
-                         torch.nn.intrinsic.modules.fused.LinearReLU, torch.nn.qat.dynamic.Linear]
-
-        assert type(mod) in float_modules, \
-            'nn.quantized.dynamic.Linear.from_float only works for one of' + \
-            str([float_mod.__name__ for float_mod in float_modules])
-        assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-        if type(mod) == nni.LinearReLU:
-            mod = mod[0]
-        if mod.qconfig is not None and mod.qconfig.weight is not None:
-            weight_observer = mod.qconfig.weight()
-        else:
-            # We have the circular import issues if we import the qconfig in the beginning of this file:
-            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
-            # import until we need it.
-            from torch.ao.quantization.qconfig import default_dynamic_qconfig
-            weight_observer = default_dynamic_qconfig.weight()
-        dtype = weight_observer.dtype
-        assert dtype in [torch.qint8, torch.float16], "The only supported dtypes for " \
-            "dynamic quantized linear are qint8 and float16 got: {}".format(dtype)
-        weight_observer(mod.weight)
-        if dtype == torch.qint8:
-            qweight = _quantize_weight(mod.weight.float(), weight_observer)
-        elif dtype == torch.float16:
-            qweight = mod.weight.float()
-        else:
-            raise RuntimeError('Unsupported dtype specified for dynamic quantized Linear!')
-        qlinear = cls(mod.in_features, mod.out_features, dtype=dtype)
-        qlinear.set_weight_bias(qweight, mod.bias)
-        return qlinear
-
-    @classmethod
-    def from_reference(cls, ref_qlinear):
-        """ Create a (fbgemm/qnnpack) dynamic quantized module from a reference quantized
-        module
-        Args:
-            ref_qlinear (Module): a reference quantized  module, either produced by
-            torch.ao.quantization functions or provided by the user
-        """
-        qlinear = cls(ref_qlinear.in_features, ref_qlinear.out_features, dtype=ref_qlinear.weight_dtype)
-        qweight = ref_qlinear.get_quantized_weight()
-        bias = ref_qlinear.bias
-        qlinear.set_weight_bias(qweight, bias)
-        return qlinear
+# flake8: noqa: F401
+r"""Quantized Dynamic Modules
+
+This file is in the process of migration to `torch/ao/nn/quantized/dynamic`,
+and is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/dynamic/modules`,
+while adding an import statement here.
+"""
+from torch.ao.nn.quantized.dynamic.modules.linear import Linear
diff --git a/torch/nn/quantized/dynamic/modules/rnn.py b/torch/nn/quantized/dynamic/modules/rnn.py
index 2a4d8891c772..33a197a970ca 100644
--- a/torch/nn/quantized/dynamic/modules/rnn.py
+++ b/torch/nn/quantized/dynamic/modules/rnn.py
@@ -1,1054 +1,22 @@
-import numbers
-import warnings
+# flake8: noqa: F401
+r"""Quantized Dynamic Modules
 
-import torch
-import torch.nn as nn
-from torch import Tensor  # noqa: F401
-from torch._jit_internal import Tuple, Optional, List, Union, Dict  # noqa: F401
-from torch.nn.utils.rnn import PackedSequence
-from torch.nn.quantized.modules.utils import _quantize_weight
+This file is in the process of migration to `torch/ao/nn/quantized/dynamic`,
+and is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/dynamic/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['pack_weight_bias', 'PackedParameter', 'RNNBase', 'LSTM', 'GRU', 'RNNCellBase', 'RNNCell', 'LSTMCell',
            'GRUCell']
 
-def _apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
-    return tensor.index_select(dim, permutation)
-
-def apply_permutation(tensor: Tensor, permutation: Tensor, dim: int = 1) -> Tensor:
-    warnings.warn("apply_permutation is deprecated, please use tensor.index_select(dim, permutation) instead")
-    return _apply_permutation(tensor, permutation, dim)
-
-def pack_weight_bias(qweight, bias, dtype):
-
-    if dtype == torch.qint8:
-        # for each layer, for each direction we need to quantize and pack
-        # weights and pack parameters in this order:
-        #
-        #   w_ih, w_hh
-        packed_weight = \
-            torch.ops.quantized.linear_prepack(qweight, bias)
-
-        return packed_weight
-    else:
-        # for each layer, for each direction we need to quantize and pack
-        # weights and pack parameters in this order:
-        #
-        #   packed_ih, packed_hh, b_ih, b_hh
-        packed_weight = torch.ops.quantized.linear_prepack_fp16(
-            qweight, bias)
-
-        return packed_weight
-
-class PackedParameter(torch.nn.Module):
-    def __init__(self, param):
-        super(PackedParameter, self).__init__()
-        self.param = param
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(PackedParameter, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'param'] = self.param
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self.param = state_dict[prefix + 'param']
-        super(PackedParameter, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                           missing_keys, unexpected_keys, error_msgs)
-
-class RNNBase(torch.nn.Module):
-
-    _FLOAT_MODULE = nn.RNNBase
-
-    _version = 2
-
-    def __init__(self, mode, input_size, hidden_size,
-                 num_layers=1, bias=True, batch_first=False,
-                 dropout=0., bidirectional=False, dtype=torch.qint8):
-        super(RNNBase, self).__init__()
-
-        self.mode = mode
-        self.input_size = input_size
-        self.hidden_size = hidden_size
-        self.num_layers = num_layers
-        self.bias = bias
-        self.batch_first = batch_first
-        self.dropout = float(dropout)
-        self.bidirectional = bidirectional
-        self.dtype = dtype
-        self.version = 2
-        self.training = False
-        num_directions = 2 if bidirectional else 1
-
-        # "type: ignore" is required since ints and Numbers are not fully comparable
-        # https://github.com/python/mypy/issues/8566
-        if not isinstance(dropout, numbers.Number) \
-                or not 0 <= dropout <= 1 or isinstance(dropout, bool):  # type: ignore[operator]
-            raise ValueError("dropout should be a number in range [0, 1] "
-                             "representing the probability of an element being "
-                             "zeroed")
-        if dropout > 0 and num_layers == 1:  # type: ignore[operator]
-            warnings.warn("dropout option adds dropout after all but last "
-                          "recurrent layer, so non-zero dropout expects "
-                          "num_layers greater than 1, but got dropout={} and "
-                          "num_layers={}".format(dropout, num_layers))
-
-        if mode == 'LSTM':
-            gate_size = 4 * hidden_size
-        elif mode == 'GRU':
-            gate_size = 3 * hidden_size
-        else:
-            raise ValueError("Unrecognized RNN mode: " + mode)
-
-        _all_weight_values = []
-        for layer in range(num_layers):
-            for direction in range(num_directions):
-                layer_input_size = input_size if layer == 0 else hidden_size * num_directions
-
-                w_ih = torch.randn(gate_size, layer_input_size).to(torch.float)
-                w_hh = torch.randn(gate_size, hidden_size).to(torch.float)
-                b_ih = torch.randn(gate_size).to(torch.float)
-                b_hh = torch.randn(gate_size).to(torch.float)
-                if dtype == torch.qint8:
-                    w_ih = torch.quantize_per_tensor(w_ih, scale=0.1, zero_point=0, dtype=torch.qint8)
-                    w_hh = torch.quantize_per_tensor(w_hh, scale=0.1, zero_point=0, dtype=torch.qint8)
-                    packed_ih = \
-                        torch.ops.quantized.linear_prepack(w_ih, b_ih)
-                    packed_hh = \
-                        torch.ops.quantized.linear_prepack(w_hh, b_hh)
-                    if self.version is None or self.version < 2:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, b_ih, b_hh)
-                    else:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, b_ih, b_hh, True)
-                else:
-                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
-                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
-                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
-                        packed_ih, packed_hh)
-
-                _all_weight_values.append(PackedParameter(cell_params))
-        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
-
-    def _get_name(self):
-        return 'DynamicQuantizedRNN'
-
-    def extra_repr(self):
-        s = '{input_size}, {hidden_size}'
-        if self.num_layers != 1:
-            s += ', num_layers={num_layers}'
-        if self.bias is not True:
-            s += ', bias={bias}'
-        if self.batch_first is not False:
-            s += ', batch_first={batch_first}'
-        if self.dropout != 0:
-            s += ', dropout={dropout}'
-        if self.bidirectional is not False:
-            s += ', bidirectional={bidirectional}'
-        return s.format(**self.__dict__)
-
-    def __repr__(self):
-        # We don't want to show `ModuleList` children, hence custom
-        # `__repr__`. This is the same as nn.Module.__repr__, except the check
-        # for the `PackedParameter` and `nn.ModuleList`.
-        # You should still override `extra_repr` to add more info.
-        extra_lines = []
-        extra_repr = self.extra_repr()
-        # empty string will be split into list ['']
-        if extra_repr:
-            extra_lines = extra_repr.split('\n')
-        child_lines = []
-        for key, module in self._modules.items():
-            if isinstance(module, (PackedParameter, nn.ModuleList)):
-                continue
-            mod_str = repr(module)
-            mod_str = nn.modules.module._addindent(mod_str, 2)
-            child_lines.append('(' + key + '): ' + mod_str)
-        lines = extra_lines + child_lines
-
-        main_str = self._get_name() + '('
-        if lines:
-            # simple one-liner info, which most builtin Modules will use
-            if len(extra_lines) == 1 and not child_lines:
-                main_str += extra_lines[0]
-            else:
-                main_str += '\n  ' + '\n  '.join(lines) + '\n'
-
-        main_str += ')'
-        return main_str
-
-    def check_input(self, input: Tensor, batch_sizes: Optional[Tensor]) -> None:
-        expected_input_dim = 2 if batch_sizes is not None else 3
-        if input.dim() != expected_input_dim:
-            raise RuntimeError(
-                'input must have {} dimensions, got {}'.format(
-                    expected_input_dim, input.dim()))
-        if self.input_size != input.size(-1):
-            raise RuntimeError(
-                'input.size(-1) must be equal to input_size. Expected {}, got {}'.format(
-                    self.input_size, input.size(-1)))
-
-    def get_expected_hidden_size(self, input: Tensor, batch_sizes: Optional[Tensor]) -> Tuple[int, int, int]:
-        if batch_sizes is not None:
-            mini_batch = int(batch_sizes[0])
-        else:
-            mini_batch = input.size(0) if self.batch_first else input.size(1)
-        num_directions = 2 if self.bidirectional else 1
-        expected_hidden_size = (self.num_layers * num_directions,
-                                mini_batch, self.hidden_size)
-        return expected_hidden_size
-
-    def check_hidden_size(
-        self, hx: Tensor, expected_hidden_size: Tuple[int, int, int],
-        msg: str = 'Expected hidden size {}, got {}'
-    ) -> None:
-        if hx.size() != expected_hidden_size:
-            raise RuntimeError(msg.format(
-                expected_hidden_size, list(hx.size())))
-
-    def check_forward_args(self, input: Tensor, hidden: Tensor, batch_sizes: Optional[Tensor]) -> None:
-        self.check_input(input, batch_sizes)
-        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
-        self.check_hidden_size(hidden, expected_hidden_size,
-                               msg='Expected hidden size {}, got {}')
-
-    def permute_hidden(self, hx: Tensor, permutation: Optional[Tensor]) -> Tensor:
-        if permutation is None:
-            return hx
-        return _apply_permutation(hx, permutation)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        version = local_metadata.get('version', None)
-        self.version = version
-        super(RNNBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                   missing_keys, unexpected_keys, error_msgs)
-
-    def set_weight_bias(self, weight_bias_dict):
-
-        def weight_bias_name(ihhh, layer, suffix):
-            weight_name = "weight_{}_l{}{}".format(ihhh, layer, suffix)
-            bias_name = "bias_{}_l{}{}".format(ihhh, layer, suffix)
-            return weight_name, bias_name
-
-        num_directions = 2 if self.bidirectional else 1
-        # TODO: dedup with __init__ of RNNBase
-        _all_weight_values = []
-        for layer in range(self.num_layers):
-            for direction in range(num_directions):
-                suffix = "_reverse" if direction == 1 else ""
-                w_ih_name, b_ih_name = weight_bias_name("ih", layer, suffix)
-                w_hh_name, b_hh_name = weight_bias_name("hh", layer, suffix)
-                w_ih = weight_bias_dict[w_ih_name]
-                b_ih = weight_bias_dict[b_ih_name]
-                w_hh = weight_bias_dict[w_hh_name]
-                b_hh = weight_bias_dict[b_hh_name]
-                if w_ih.dtype == torch.qint8:
-                    packed_ih = torch.ops.quantized.linear_prepack(w_ih, b_ih)
-                    packed_hh = torch.ops.quantized.linear_prepack(w_hh, b_hh)
-                    if self.version is None or self.version < 2:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, b_ih, b_hh)
-                    else:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, b_ih, b_hh, True)
-                else:
-                    packed_ih = torch.ops.quantized.linear_prepack_fp16(w_ih, b_ih)
-                    packed_hh = torch.ops.quantized.linear_prepack_fp16(w_hh, b_hh)
-                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
-                        packed_ih, packed_hh)
-
-                _all_weight_values.append(PackedParameter(cell_params))
-        self._all_weight_values = torch.nn.ModuleList(_all_weight_values)
-
-    @classmethod
-    def from_float(cls, mod):
-        assert type(mod) in set(
-            [torch.nn.LSTM,
-             torch.nn.GRU]
-        ), 'nn.quantized.dynamic.RNNBase.from_float only works for nn.LSTM and nn.GRU'
-        assert hasattr(
-            mod,
-            'qconfig'
-        ), 'Input float module must have qconfig defined'
-
-        if mod.qconfig is not None and mod.qconfig.weight is not None:
-            weight_observer_method = mod.qconfig.weight
-        else:
-            # We have the circular import issues if we import the qconfig in the beginning of this file:
-            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
-            # import until we need it.
-            from torch.ao.quantization.qconfig import default_dynamic_qconfig
-            weight_observer_method = default_dynamic_qconfig.weight
-
-        dtype = weight_observer_method().dtype
-        supported_scalar_types = [torch.qint8, torch.float16]
-        if dtype not in supported_scalar_types:
-            raise RuntimeError('Unsupported dtype for dynamic RNN quantization: {}'.format(dtype))
-        # RNNBase can be either LSTM or GRU
-        qRNNBase: Union[LSTM, GRU]
-        if mod.mode == 'LSTM':
-            qRNNBase = LSTM(mod.input_size, mod.hidden_size, mod.num_layers,
-                            mod.bias, mod.batch_first, mod.dropout, mod.bidirectional, dtype)
-        elif mod.mode == 'GRU':
-            qRNNBase = GRU(mod.input_size, mod.hidden_size, mod.num_layers,
-                           mod.bias, mod.batch_first, mod.dropout, mod.bidirectional, dtype)
-        else:
-            raise NotImplementedError('Only LSTM/GRU is supported for QuantizedRNN for now')
-
-        num_directions = 2 if mod.bidirectional else 1
-
-        assert mod.bias
-
-        _all_weight_values = []
-        for layer in range(qRNNBase.num_layers):
-            for direction in range(num_directions):
-                suffix = '_reverse' if direction == 1 else ''
-
-                def retrieve_weight_bias(ihhh):
-                    weight_name = 'weight_{}_l{}{}'.format(ihhh, layer, suffix)
-                    bias_name = 'bias_{}_l{}{}'.format(ihhh, layer, suffix)
-                    weight = getattr(mod, weight_name)
-                    bias = getattr(mod, bias_name)
-                    return weight, bias
-
-                weight_ih, bias_ih = retrieve_weight_bias('ih')
-                weight_hh, bias_hh = retrieve_weight_bias('hh')
-
-                if dtype == torch.qint8:
-                    def quantize_and_pack(w, b):
-                        weight_observer = weight_observer_method()
-                        weight_observer(w)
-                        qweight = _quantize_weight(w.float(), weight_observer)
-                        packed_weight = \
-                            torch.ops.quantized.linear_prepack(qweight, b)
-                        return packed_weight
-                    packed_ih = quantize_and_pack(weight_ih, bias_ih)
-                    packed_hh = quantize_and_pack(weight_hh, bias_hh)
-                    if qRNNBase.version is None or qRNNBase.version < 2:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, bias_ih, bias_hh)
-                    else:
-                        cell_params = torch.ops.quantized.make_quantized_cell_params_dynamic(
-                            packed_ih, packed_hh, bias_ih, bias_hh, True)
-
-                elif dtype == torch.float16:
-                    packed_ih = torch.ops.quantized.linear_prepack_fp16(
-                        weight_ih.float(), bias_ih)
-                    packed_hh = torch.ops.quantized.linear_prepack_fp16(
-                        weight_hh.float(), bias_hh)
-
-                    cell_params = torch.ops.quantized.make_quantized_cell_params_fp16(
-                        packed_ih, packed_hh)
-                else:
-                    raise RuntimeError('Unsupported dtype specified for dynamic quantized LSTM!')
-
-                _all_weight_values.append(PackedParameter(cell_params))
-        qRNNBase._all_weight_values = torch.nn.ModuleList(_all_weight_values)
-
-        return qRNNBase
-
-
-    def _weight_bias(self):
-        # Returns a dict of weights and biases
-        weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
-        count = 0
-        num_directions = 2 if self.bidirectional else 1
-        for layer in range(self.num_layers):
-            for direction in range(num_directions):
-                suffix = '_reverse' if direction == 1 else ''
-                key_name1 = 'weight_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                key_name2 = 'weight_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                # packed weights are part of torchbind class, CellParamsSerializationType
-                # Within the packed weight class, the weight and bias are accessible as Tensors
-                packed_weight_bias = self._all_weight_values[count].param.__getstate__()[0][4]
-                weight_bias_dict['weight'][key_name1] = packed_weight_bias[0].__getstate__()[0][0]
-                weight_bias_dict['weight'][key_name2] = packed_weight_bias[1].__getstate__()[0][0]
-                key_name1 = 'bias_ih_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                key_name2 = 'bias_hh_l{layer_idx}{suffix}'.format(layer_idx=layer, suffix=suffix)
-                weight_bias_dict['bias'][key_name1] = packed_weight_bias[0].__getstate__()[0][1]
-                weight_bias_dict['bias'][key_name2] = packed_weight_bias[1].__getstate__()[0][1]
-                count = count + 1
-        return weight_bias_dict
-
-    def get_weight(self):
-        return self._weight_bias()['weight']
-
-    def get_bias(self):
-        return self._weight_bias()['bias']
-
-class LSTM(RNNBase):
-    r"""
-    A dynamic quantized LSTM module with floating point tensor as inputs and outputs.
-    We adopt the same interface as `torch.nn.LSTM`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM for documentation.
-
-    Examples::
-
-        >>> rnn = nn.LSTM(10, 20, 2)
-        >>> input = torch.randn(5, 3, 10)
-        >>> h0 = torch.randn(2, 3, 20)
-        >>> c0 = torch.randn(2, 3, 20)
-        >>> output, (hn, cn) = rnn(input, (h0, c0))
-    """
-    _FLOAT_MODULE = nn.LSTM
-
-    __overloads__ = {'forward': ['forward_packed', 'forward_tensor']}
-
-    def __init__(self, *args, **kwargs):
-        super(LSTM, self).__init__('LSTM', *args, **kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedLSTM'
-
-    def forward_impl(
-        self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]],
-        batch_sizes: Optional[Tensor], max_batch_size: int,
-        sorted_indices: Optional[Tensor]
-    ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
-        if hx is None:
-            num_directions = 2 if self.bidirectional else 1
-            zeros = torch.zeros(self.num_layers * num_directions,
-                                max_batch_size, self.hidden_size,
-                                dtype=input.dtype, device=input.device)
-            hx = (zeros, zeros)
-        else:
-            # Each batch of the hidden state should match the input sequence that
-            # the user believes he/she is passing in.
-            hx = self.permute_hidden(hx, sorted_indices)
-
-        self.check_forward_args(input, hx, batch_sizes)
-
-        _all_params = ([m.param for m in self._all_weight_values])
-        if batch_sizes is None:
-            result = torch.quantized_lstm(input, hx, _all_params, self.bias, self.num_layers,
-                                          float(self.dropout), self.training, self.bidirectional,
-                                          self.batch_first, dtype=self.dtype, use_dynamic=True)
-        else:
-            result = torch.quantized_lstm(input, batch_sizes, hx, _all_params, self.bias,
-                                          self.num_layers, float(self.dropout), self.training,
-                                          self.bidirectional, dtype=self.dtype, use_dynamic=True)
-        output = result[0]
-        hidden = result[1:]
-
-        return output, hidden
-
-    @torch.jit.export
-    def forward_tensor(
-        self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None
-    ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
-        batch_sizes = None
-        max_batch_size = input.size(0) if self.batch_first else input.size(1)
-        sorted_indices = None
-        unsorted_indices = None
-
-        output, hidden = self.forward_impl(
-            input, hx, batch_sizes, max_batch_size, sorted_indices)
-
-        return output, self.permute_hidden(hidden, unsorted_indices)
-
-    @torch.jit.export
-    def forward_packed(
-        self, input: PackedSequence, hx: Optional[Tuple[Tensor, Tensor]] = None
-    ) -> Tuple[PackedSequence, Tuple[Tensor, Tensor]]:
-        input_, batch_sizes, sorted_indices, unsorted_indices = input
-        max_batch_size = batch_sizes[0]
-        max_batch_size = int(max_batch_size)
-
-        output_, hidden = self.forward_impl(
-            input_, hx, batch_sizes, max_batch_size, sorted_indices)
-
-        output = PackedSequence(output_, batch_sizes,
-                                sorted_indices, unsorted_indices)
-        return output, self.permute_hidden(hidden, unsorted_indices)
-
-    # "type: ignore" is required due to issue #43072
-    def permute_hidden(  # type: ignore[override]
-        self, hx: Tuple[Tensor, Tensor], permutation: Optional[Tensor]
-    ) -> Tuple[Tensor, Tensor]:
-        if permutation is None:
-            return hx
-        return _apply_permutation(hx[0], permutation), _apply_permutation(hx[1], permutation)
-
-    # "type: ignore" is required due to issue #43072
-    def check_forward_args(  # type: ignore[override]
-        self, input: Tensor, hidden: Tuple[Tensor, Tensor], batch_sizes: Optional[Tensor]
-    ) -> None:
-        self.check_input(input, batch_sizes)
-        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
-
-        self.check_hidden_size(hidden[0], expected_hidden_size,
-                               'Expected hidden[0] size {}, got {}')
-        self.check_hidden_size(hidden[1], expected_hidden_size,
-                               'Expected hidden[1] size {}, got {}')
-
-    @torch.jit.ignore
-    def forward(self, input, hx=None):
-        if isinstance(input, PackedSequence):
-            return self.forward_packed(input, hx)
-        else:
-            return self.forward_tensor(input, hx)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(LSTM, cls).from_float(mod)
-
-    @classmethod
-    def from_reference(cls, ref_mod):
-        assert hasattr(ref_mod, "weight_ih_l0_dtype"), "We are assuming weight_ih_l0 "
-        "exists in LSTM, may need to relax the assumption to support the use case"
-        qmod = cls(
-            ref_mod.input_size,
-            ref_mod.hidden_size,
-            ref_mod.num_layers,
-            ref_mod.bias,
-            ref_mod.batch_first,
-            ref_mod.dropout,
-            ref_mod.bidirectional,
-            # assuming there is layer 0, which should be OK
-            ref_mod.weight_ih_l0_dtype,
-        )
-        qmod.set_weight_bias(ref_mod.get_quantized_weight_bias_dict())
-        return qmod
-
-
-class GRU(RNNBase):
-    r"""Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
-
-
-    For each element in the input sequence, each layer computes the following
-    function:
-
-    .. math::
-        \begin{array}{ll}
-            r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
-            z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
-            n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
-            h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}
-        \end{array}
-
-    where :math:`h_t` is the hidden state at time `t`, :math:`x_t` is the input
-    at time `t`, :math:`h_{(t-1)}` is the hidden state of the layer
-    at time `t-1` or the initial hidden state at time `0`, and :math:`r_t`,
-    :math:`z_t`, :math:`n_t` are the reset, update, and new gates, respectively.
-    :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
-
-    In a multilayer GRU, the input :math:`x^{(l)}_t` of the :math:`l` -th layer
-    (:math:`l >= 2`) is the hidden state :math:`h^{(l-1)}_t` of the previous layer multiplied by
-    dropout :math:`\delta^{(l-1)}_t` where each :math:`\delta^{(l-1)}_t` is a Bernoulli random
-    variable which is :math:`0` with probability :attr:`dropout`.
-
-    Args:
-        input_size: The number of expected features in the input `x`
-        hidden_size: The number of features in the hidden state `h`
-        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
-            would mean stacking two GRUs together to form a `stacked GRU`,
-            with the second GRU taking in outputs of the first GRU and
-            computing the final results. Default: 1
-        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
-            Default: ``True``
-        batch_first: If ``True``, then the input and output tensors are provided
-            as (batch, seq, feature). Default: ``False``
-        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
-            GRU layer except the last layer, with dropout probability equal to
-            :attr:`dropout`. Default: 0
-        bidirectional: If ``True``, becomes a bidirectional GRU. Default: ``False``
-
-    Inputs: input, h_0
-        - **input** of shape `(seq_len, batch, input_size)`: tensor containing the features
-          of the input sequence. The input can also be a packed variable length
-          sequence. See :func:`torch.nn.utils.rnn.pack_padded_sequence`
-          for details.
-        - **h_0** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
-          containing the initial hidden state for each element in the batch.
-          Defaults to zero if not provided. If the RNN is bidirectional,
-          num_directions should be 2, else it should be 1.
-
-    Outputs: output, h_n
-        - **output** of shape `(seq_len, batch, num_directions * hidden_size)`: tensor
-          containing the output features h_t from the last layer of the GRU,
-          for each `t`. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
-          given as the input, the output will also be a packed sequence.
-          For the unpacked case, the directions can be separated
-          using ``output.view(seq_len, batch, num_directions, hidden_size)``,
-          with forward and backward being direction `0` and `1` respectively.
-
-          Similarly, the directions can be separated in the packed case.
-        - **h_n** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
-          containing the hidden state for `t = seq_len`
-
-          Like *output*, the layers can be separated using
-          ``h_n.view(num_layers, num_directions, batch, hidden_size)``.
-
-    Shape:
-        - Input1: :math:`(L, N, H_{in})` tensor containing input features where
-          :math:`H_{in}=\text{input\_size}` and `L` represents a sequence length.
-        - Input2: :math:`(S, N, H_{out})` tensor
-          containing the initial hidden state for each element in the batch.
-          :math:`H_{out}=\text{hidden\_size}`
-          Defaults to zero if not provided. where :math:`S=\text{num\_layers} * \text{num\_directions}`
-          If the RNN is bidirectional, num_directions should be 2, else it should be 1.
-        - Output1: :math:`(L, N, H_{all})` where :math:`H_{all}=\text{num\_directions} * \text{hidden\_size}`
-        - Output2: :math:`(S, N, H_{out})` tensor containing the next hidden state
-          for each element in the batch
-
-    Attributes:
-        weight_ih_l[k] : the learnable input-hidden weights of the :math:`\text{k}^{th}` layer
-            (W_ir|W_iz|W_in), of shape `(3*hidden_size, input_size)` for `k = 0`.
-            Otherwise, the shape is `(3*hidden_size, num_directions * hidden_size)`
-        weight_hh_l[k] : the learnable hidden-hidden weights of the :math:`\text{k}^{th}` layer
-            (W_hr|W_hz|W_hn), of shape `(3*hidden_size, hidden_size)`
-        bias_ih_l[k] : the learnable input-hidden bias of the :math:`\text{k}^{th}` layer
-            (b_ir|b_iz|b_in), of shape `(3*hidden_size)`
-        bias_hh_l[k] : the learnable hidden-hidden bias of the :math:`\text{k}^{th}` layer
-            (b_hr|b_hz|b_hn), of shape `(3*hidden_size)`
-
-    .. note::
-        All the weights and biases are initialized from :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})`
-        where :math:`k = \frac{1}{\text{hidden\_size}}`
-
-    .. include:: ../cudnn_persistent_rnn.rst
-
-    Examples::
-
-        >>> rnn = nn.GRU(10, 20, 2)
-        >>> input = torch.randn(5, 3, 10)
-        >>> h0 = torch.randn(2, 3, 20)
-        >>> output, hn = rnn(input, h0)
-    """
-    _FLOAT_MODULE = nn.GRU
-
-    __overloads__ = {'forward': ['forward_packed', 'forward_tensor']}
-
-    def __init__(self, *args, **kwargs):
-        super(GRU, self).__init__('GRU', *args, **kwargs)
-
-    def _get_name(self):
-        return 'DynamicQuantizedGRU'
-
-    def check_forward_args(self, input: Tensor, hidden: Tensor, batch_sizes: Optional[Tensor]) -> None:
-        self.check_input(input, batch_sizes)
-        expected_hidden_size = self.get_expected_hidden_size(input, batch_sizes)
-
-        self.check_hidden_size(hidden, expected_hidden_size,
-                               'Expected hidden size {}, got {}')
-
-    def forward_impl(
-        self, input: Tensor, hx: Optional[Tensor],
-        batch_sizes: Optional[Tensor], max_batch_size: int,
-        sorted_indices: Optional[Tensor]
-    ) -> Tuple[Tensor, Tensor]:
-        if hx is None:
-            num_directions = 2 if self.bidirectional else 1
-            zeros = torch.zeros(self.num_layers * num_directions,
-                                max_batch_size, self.hidden_size,
-                                dtype=input.dtype, device=input.device)
-            hx = zeros
-        else:
-            # Each batch of the hidden state should match the input sequence that
-            # the user believes he/she is passing in.
-            hx = self.permute_hidden(hx, sorted_indices)
-
-        self.check_forward_args(input, hx, batch_sizes)
-
-        _all_params = ([m.param for m in self._all_weight_values])
-        if batch_sizes is None:
-            result = torch.quantized_gru(input,
-                                         hx,
-                                         _all_params,
-                                         self.bias,
-                                         self.num_layers,
-                                         self.dropout,
-                                         self.training,
-                                         self.bidirectional,
-                                         self.batch_first)
-        else:
-            result = torch.quantized_gru(input,
-                                         batch_sizes,
-                                         hx,
-                                         _all_params,
-                                         self.bias,
-                                         self.num_layers,
-                                         self.dropout,
-                                         self.training,
-                                         self.bidirectional)
-        output = result[0]
-        hidden = result[1]
-
-        return output, hidden
-
-
-    @torch.jit.export
-    def forward_tensor(
-        self, input: Tensor, hx: Optional[Tensor] = None
-    ) -> Tuple[Tensor, Tensor]:
-        batch_sizes = None
-        max_batch_size = input.size(0) if self.batch_first else input.size(1)
-        sorted_indices = None
-        unsorted_indices = None
-
-        output, hidden = self.forward_impl(
-            input, hx, batch_sizes, max_batch_size, sorted_indices)
-
-        return output, self.permute_hidden(hidden, unsorted_indices)
-
-    @torch.jit.export
-    def forward_packed(
-        self, input: PackedSequence, hx: Optional[Tensor] = None
-    ) -> Tuple[PackedSequence, Tensor]:
-        input_, batch_sizes, sorted_indices, unsorted_indices = input
-        max_batch_size = batch_sizes[0]
-        max_batch_size = int(max_batch_size)
-        output_, hidden = self.forward_impl(
-            input_, hx, batch_sizes, max_batch_size, sorted_indices)
-
-        output = PackedSequence(output_, batch_sizes,
-                                sorted_indices, unsorted_indices)
-        return output, self.permute_hidden(hidden, unsorted_indices)
-
-    def permute_hidden(
-        self, hx: Tensor, permutation: Optional[Tensor]
-    ) -> Tensor:
-        if permutation is None:
-            return hx
-        return _apply_permutation(hx, permutation)
-
-    @torch.jit.ignore
-    def forward(self, input, hx=None):
-        if isinstance(input, PackedSequence):
-            return self.forward_packed(input, hx)
-        else:
-            return self.forward_tensor(input, hx)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(GRU, cls).from_float(mod)
-
-
-class RNNCellBase(torch.nn.Module):
-    # _FLOAT_MODULE = nn.CellRNNBase
-    __constants__ = ['input_size', 'hidden_size', 'bias']
-
-    def __init__(self, input_size, hidden_size, bias=True, num_chunks=4, dtype=torch.qint8):
-        super(RNNCellBase, self).__init__()
-        self.input_size = input_size
-        self.hidden_size = hidden_size
-        self.bias = bias
-        self.weight_dtype = dtype
-        if bias:
-            self.bias_ih = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
-            self.bias_hh = torch.randn(num_chunks * hidden_size).to(dtype=torch.float)
-        else:
-            self.register_parameter('bias_ih', None)
-            self.register_parameter('bias_hh', None)
-
-        weight_ih = torch.randn(num_chunks * hidden_size, input_size).to(torch.float)
-        weight_hh = torch.randn(num_chunks * hidden_size, hidden_size).to(torch.float)
-        if dtype == torch.qint8:
-            weight_ih = torch.quantize_per_tensor(weight_ih, scale=1, zero_point=0, dtype=torch.qint8)
-            weight_hh = torch.quantize_per_tensor(weight_hh, scale=1, zero_point=0, dtype=torch.qint8)
-
-        if dtype == torch.qint8:
-            # for each layer, for each direction we need to quantize and pack
-            # weights and pack parameters in this order:
-            #
-            #   w_ih, w_hh
-            packed_weight_ih = \
-                torch.ops.quantized.linear_prepack(weight_ih, self.bias_ih)
-            packed_weight_hh = \
-                torch.ops.quantized.linear_prepack(weight_hh, self.bias_hh)
-        else:
-            # for each layer, for each direction we need to quantize and pack
-            # weights and pack parameters in this order:
-            #
-            #   packed_ih, packed_hh, b_ih, b_hh
-            packed_weight_ih = torch.ops.quantized.linear_prepack_fp16(
-                weight_ih, self.bias_ih)
-            packed_weight_hh = torch.ops.quantized.linear_prepack_fp16(
-                weight_hh, self.bias_hh)
-
-        self._packed_weight_ih = packed_weight_ih
-        self._packed_weight_hh = packed_weight_hh
-
-    def _get_name(self):
-        return 'DynamicQuantizedRNNBase'
-
-    def extra_repr(self):
-        s = '{input_size}, {hidden_size}'
-        if 'bias' in self.__dict__ and self.bias is not True:
-            s += ', bias={bias}'
-        if 'nonlinearity' in self.__dict__ and self.nonlinearity != "tanh":
-            s += ', nonlinearity={nonlinearity}'
-        return s.format(**self.__dict__)
-
-    def check_forward_input(self, input):
-        if input.size(1) != self.input_size:
-            raise RuntimeError(
-                "input has inconsistent input_size: got {}, expected {}".format(
-                    input.size(1), self.input_size))
-
-    def check_forward_hidden(self, input: Tensor, hx: Tensor, hidden_label: str = '') -> None:
-        if input.size(0) != hx.size(0):
-            raise RuntimeError(
-                "Input batch size {} doesn't match hidden{} batch size {}".format(
-                    input.size(0), hidden_label, hx.size(0)))
-
-        if hx.size(1) != self.hidden_size:
-            raise RuntimeError(
-                "hidden{} has inconsistent hidden_size: got {}, expected {}".format(
-                    hidden_label, hx.size(1), self.hidden_size))
-
-    @classmethod
-    def from_float(cls, mod):
-        assert type(mod) in set([torch.nn.LSTMCell,
-                                 torch.nn.GRUCell,
-                                 torch.nn.RNNCell]), 'nn.quantized.dynamic.RNNCellBase.from_float \
-                                 only works for nn.LSTMCell, nn.GRUCell and nn.RNNCell'
-        assert hasattr(
-            mod, 'qconfig'), 'Input float module must have qconfig defined'
-
-        if mod.qconfig is not None and mod.qconfig.weight is not None:
-            weight_observer_method = mod.qconfig.weight
-        else:
-            # We have the circular import issues if we import the qconfig in the beginning of this file:
-            # https://github.com/pytorch/pytorch/pull/24231. The current workaround is to postpone the
-            # import until we need it.
-            from torch.ao.quantization.qconfig import default_dynamic_qconfig
-            weight_observer_method = default_dynamic_qconfig.weight
-
-        dtype = weight_observer_method().dtype
-        supported_scalar_types = [torch.qint8, torch.float16]
-        if dtype not in supported_scalar_types:
-            raise RuntimeError('Unsupported dtype for dynamic RNN quantization: {}'.format(dtype))
-
-        qRNNCellBase: Union[LSTMCell, GRUCell, RNNCell]
-
-        if type(mod) == torch.nn.LSTMCell:
-            qRNNCellBase = LSTMCell(mod.input_size, mod.hidden_size, bias=mod.bias, dtype=dtype)
-        elif type(mod) == torch.nn.GRUCell:
-            qRNNCellBase = GRUCell(mod.input_size, mod.hidden_size, bias=mod.bias, dtype=dtype)
-        elif type(mod) == torch.nn.RNNCell:
-            qRNNCellBase = RNNCell(mod.input_size, mod.hidden_size, bias=mod.bias, nonlinearity=mod.nonlinearity, dtype=dtype)
-        else:
-            raise NotImplementedError('Only LSTMCell, GRUCell and RNNCell \
-            are supported for QuantizedRNN for now')
-
-        assert mod.bias
-
-        def _observe_and_quantize_weight(weight):
-            if dtype == torch.qint8:
-                weight_observer = weight_observer_method()
-                weight_observer(weight)
-                qweight = _quantize_weight(weight.float(), weight_observer)
-                return qweight
-            else:
-                return weight.float()
-
-        qRNNCellBase._packed_weight_ih = pack_weight_bias(_observe_and_quantize_weight(mod.weight_ih), mod.bias_ih, dtype)
-        qRNNCellBase._packed_weight_hh = pack_weight_bias(_observe_and_quantize_weight(mod.weight_hh), mod.bias_hh, dtype)
-        return qRNNCellBase
-
-    @classmethod
-    def from_reference(cls, ref_mod):
-        assert hasattr(ref_mod, "weight_ih_dtype"), "We are assuming weight_ih "
-        "exists in reference module, may need to relax the assumption to support the use case"
-        if hasattr(ref_mod, "nonlinearity"):
-            qmod = cls(
-                ref_mod.input_size,
-                ref_mod.hidden_size,
-                ref_mod.bias,
-                ref_mod.nonlinearity,
-                dtype=ref_mod.weight_ih_dtype
-            )
-        else:
-            qmod = cls(
-                ref_mod.input_size,
-                ref_mod.hidden_size,
-                ref_mod.bias,
-                dtype=ref_mod.weight_ih_dtype
-            )
-        weight_bias_dict = {
-            "weight": {
-                "weight_ih": ref_mod.get_quantized_weight_ih(),
-                "weight_hh": ref_mod.get_quantized_weight_hh(),
-            },
-            "bias": {
-                "bias_ih": ref_mod.bias_ih,
-                "bias_hh": ref_mod.bias_hh,
-            }
-        }
-        qmod.set_weight_bias(weight_bias_dict)
-        return qmod
-
-    def _weight_bias(self):
-        # Returns a dict of weights and biases
-        weight_bias_dict: Dict[str, Dict] = {'weight' : {}, 'bias' : {}}
-        w1, b1 = self._packed_weight_ih.__getstate__()[0]
-        w2, b2 = self._packed_weight_hh.__getstate__()[0]
-        # TODO: these can be simplified to one level? e.g. using weight_ih as key
-        # directly
-        weight_bias_dict['weight']['weight_ih'] = w1
-        weight_bias_dict['weight']['weight_hh'] = w2
-        weight_bias_dict['bias']['bias_ih'] = b1
-        weight_bias_dict['bias']['bias_hh'] = b2
-        return weight_bias_dict
-
-    def get_weight(self):
-        return self._weight_bias()['weight']
-
-    def get_bias(self):
-        return self._weight_bias()['bias']
-
-    def set_weight_bias(self, weight_bias_dict):
-        # TODO: these can be simplified to one level? e.g. using weight_ih as key
-        # directly
-        self._packed_weight_ih = pack_weight_bias(
-            weight_bias_dict["weight"]["weight_ih"],
-            weight_bias_dict["bias"]["bias_ih"],
-            self.weight_dtype)
-        self._packed_weight_hh = pack_weight_bias(
-            weight_bias_dict["weight"]["weight_hh"],
-            weight_bias_dict["bias"]["bias_hh"],
-            self.weight_dtype)
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(RNNCellBase, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + '_packed_weight_ih'] = self._packed_weight_ih
-        destination[prefix + '_packed_weight_hh'] = self._packed_weight_hh
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self._packed_weight_ih = state_dict.pop(prefix + '_packed_weight_ih')
-        self._packed_weight_hh = state_dict.pop(prefix + '_packed_weight_hh')
-        super(RNNCellBase, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                       missing_keys, unexpected_keys, error_msgs)
-
-class RNNCell(RNNCellBase):
-    r"""An Elman RNN cell with tanh or ReLU non-linearity.
-    A dynamic quantized RNNCell module with floating point tensor as inputs and outputs.
-    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.RNNCell`,
-    please see https://pytorch.org/docs/stable/nn.html#torch.nn.RNNCell for documentation.
-
-    Examples::
-
-        >>> rnn = nn.RNNCell(10, 20)
-        >>> input = torch.randn(6, 3, 10)
-        >>> hx = torch.randn(3, 20)
-        >>> output = []
-        >>> for i in range(6):
-        ...     hx = rnn(input[i], hx)
-        ...     output.append(hx)
-    """
-    __constants__ = ['input_size', 'hidden_size', 'bias', 'nonlinearity']
-
-    def __init__(self, input_size, hidden_size, bias=True, nonlinearity="tanh", dtype=torch.qint8):
-        super(RNNCell, self).__init__(input_size, hidden_size, bias, num_chunks=1, dtype=dtype)
-        self.nonlinearity = nonlinearity
-
-    def _get_name(self):
-        return 'DynamicQuantizedRNNCell'
-
-    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
-        self.check_forward_input(input)
-        if hx is None:
-            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-        self.check_forward_hidden(input, hx, '')
-        if self.nonlinearity == "tanh":
-            ret = torch.ops.quantized.quantized_rnn_tanh_cell_dynamic(
-                input, hx,
-                self._packed_weight_ih, self._packed_weight_hh,
-                self.bias_ih, self.bias_hh)
-        elif self.nonlinearity == "relu":
-            ret = torch.ops.quantized.quantized_rnn_relu_cell_dynamic(
-                input, hx,
-                self._packed_weight_ih, self._packed_weight_hh,
-                self.bias_ih, self.bias_hh)
-        else:
-            ret = input  # TODO: remove when jit supports exception flow
-            raise RuntimeError(
-                "Unknown nonlinearity: {}".format(self.nonlinearity))
-        return ret
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(RNNCell, cls).from_float(mod)
-
-
-class LSTMCell(RNNCellBase):
-    r"""A long short-term memory (LSTM) cell.
-
-    A dynamic quantized LSTMCell module with floating point tensor as inputs and outputs.
-    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.LSTMCell`,
-    please see https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell for documentation.
-
-    Examples::
-
-        >>> rnn = nn.LSTMCell(10, 20)
-        >>> input = torch.randn(6, 3, 10)
-        >>> hx = torch.randn(3, 20)
-        >>> cx = torch.randn(3, 20)
-        >>> output = []
-        >>> for i in range(6):
-        ...     hx, cx = rnn(input[i], (hx, cx))
-        ...     output.append(hx)
-    """
-
-    def __init__(self, *args, **kwargs):
-        super(LSTMCell, self).__init__(*args, num_chunks=4, **kwargs)  # type: ignore[misc]
-
-    def _get_name(self):
-        return 'DynamicQuantizedLSTMCell'
-
-    def forward(self, input: Tensor, hx: Optional[Tuple[Tensor, Tensor]] = None) -> Tuple[Tensor, Tensor]:
-        self.check_forward_input(input)
-        if hx is None:
-            zeros = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-            hx = (zeros, zeros)
-        self.check_forward_hidden(input, hx[0], '[0]')
-        self.check_forward_hidden(input, hx[1], '[1]')
-        return torch.ops.quantized.quantized_lstm_cell_dynamic(
-            input, hx,
-            self._packed_weight_ih, self._packed_weight_hh,
-            self.bias_ih, self.bias_hh)
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(LSTMCell, cls).from_float(mod)
-
-class GRUCell(RNNCellBase):
-    r"""A gated recurrent unit (GRU) cell
-
-    A dynamic quantized GRUCell module with floating point tensor as inputs and outputs.
-    Weights are quantized to 8 bits. We adopt the same interface as `torch.nn.GRUCell`,
-    please see https://pytorch.org/docs/stable/nn.html#torch.nn.GRUCell for documentation.
-
-    Examples::
-
-        >>> rnn = nn.GRUCell(10, 20)
-        >>> input = torch.randn(6, 3, 10)
-        >>> hx = torch.randn(3, 20)
-        >>> output = []
-        >>> for i in range(6):
-        ...     hx = rnn(input[i], hx)
-        ...     output.append(hx)
-    """
-
-    def __init__(self, input_size, hidden_size, bias=True, dtype=torch.qint8):
-        super(GRUCell, self).__init__(input_size, hidden_size, bias, num_chunks=3, dtype=dtype)
-
-    def _get_name(self):
-        return 'DynamicQuantizedGRUCell'
-
-    def forward(self, input: Tensor, hx: Optional[Tensor] = None) -> Tensor:
-        self.check_forward_input(input)
-        if hx is None:
-            hx = torch.zeros(input.size(0), self.hidden_size, dtype=input.dtype, device=input.device)
-        self.check_forward_hidden(input, hx, '')
-        return torch.ops.quantized.quantized_gru_cell_dynamic(
-            input, hx,
-            self._packed_weight_ih, self._packed_weight_hh,
-            self.bias_ih, self.bias_hh,
-        )
-
-    @classmethod
-    def from_float(cls, mod):
-        return super(GRUCell, cls).from_float(mod)
+from torch.ao.nn.quantized.dynamic.modules.rnn import pack_weight_bias
+from torch.ao.nn.quantized.dynamic.modules.rnn import PackedParameter
+from torch.ao.nn.quantized.dynamic.modules.rnn import RNNBase
+from torch.ao.nn.quantized.dynamic.modules.rnn import LSTM
+from torch.ao.nn.quantized.dynamic.modules.rnn import GRU
+from torch.ao.nn.quantized.dynamic.modules.rnn import RNNCellBase
+from torch.ao.nn.quantized.dynamic.modules.rnn import RNNCell
+from torch.ao.nn.quantized.dynamic.modules.rnn import LSTMCell
+from torch.ao.nn.quantized.dynamic.modules.rnn import GRUCell
diff --git a/torch/nn/quantized/functional.py b/torch/nn/quantized/functional.py
index 0be06c73777a..84c4dcb06bc3 100644
--- a/torch/nn/quantized/functional.py
+++ b/torch/nn/quantized/functional.py
@@ -1,615 +1,10 @@
-r""" Functional interface (quantized)."""
-from typing import List, Optional
-import warnings
+r"""nn.quantized.functional
 
-import torch
-from torch import Tensor
-from torch.nn.modules.utils import _pair, _triple
-from torch.nn.quantized.modules.utils import _pair_from_first
-from torch.jit.annotations import BroadcastingList2
+Quantized equivalents of the `nn.functional`.
 
-# Although some of the functions and docstrings are mirrored from the torch.nn,
-# we want to have them here for future changes.
+Note::
+    This location is in the process of being deprecated.
+    Please, use the `torch.ao.nn.quantized.functional` instead.
+"""
 
-def avg_pool2d(input, kernel_size, stride=None, padding=0, ceil_mode=False,
-               count_include_pad=True, divisor_override=None):
-    r"""
-    Applies 2D average-pooling operation in :math:`kH \times kW` regions by step size
-    :math:`sH \times sW` steps. The number of output features is equal to the number of
-    input planes.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    See :class:`~torch.nn.quantized.AvgPool2d` for details and output shape.
-
-    Args:
-        input: quantized input tensor :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
-        kernel_size: size of the pooling region. Can be a single number or a
-          tuple `(kH, kW)`
-        stride: stride of the pooling operation. Can be a single number or a
-          tuple `(sH, sW)`. Default: :attr:`kernel_size`
-        padding: implicit zero paddings on both sides of the input. Can be a
-          single number or a tuple `(padH, padW)`. Default: 0
-        ceil_mode: when True, will use `ceil` instead of `floor` in the formula
-            to compute the output shape. Default: ``False``
-        count_include_pad: when True, will include the zero-padding in the
-            averaging calculation. Default: ``True``
-        divisor_override: if specified, it will be used as divisor, otherwise
-             size of the pooling region will be used. Default: None
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.avg_pool2d' must be quantized!")
-    return torch.nn.functional.avg_pool2d(input, kernel_size, stride, padding,
-                                          ceil_mode, count_include_pad,
-                                          divisor_override)
-
-def avg_pool3d(input, kernel_size, stride=None, padding=0, ceil_mode=False,
-               count_include_pad=True, divisor_override=None):
-    r"""
-    Applies 3D average-pooling operation in :math:`kD \ times kH \times kW` regions by step size
-    :math:`sD \times sH \times sW` steps. The number of output features is equal to the number of
-    input planes.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    Args:
-        input: quantized input tensor :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
-        kernel_size: size of the pooling region. Can be a single number or a
-          tuple `(kD, kH, kW)`
-        stride: stride of the pooling operation. Can be a single number or a
-          tuple `(sD, sH, sW)`. Default: :attr:`kernel_size`
-        padding: implicit zero paddings on both sides of the input. Can be a
-          single number or a tuple `(padD, padH, padW)`. Default: 0
-        ceil_mode: when True, will use `ceil` instead of `floor` in the formula
-            to compute the output shape. Default: ``False``
-        count_include_pad: when True, will include the zero-padding in the
-            averaging calculation. Default: ``True``
-        divisor_override: if specified, it will be used as divisor, otherwise
-             size of the pooling region will be used. Default: None
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.avg_pool3d' must be quantized!")
-    return torch.nn.functional.avg_pool3d(input, kernel_size, stride, padding,
-                                          ceil_mode, count_include_pad,
-                                          divisor_override)
-
-def adaptive_avg_pool2d(input: Tensor, output_size: BroadcastingList2[int]) -> Tensor:
-    r"""
-    Applies a 2D adaptive average pooling over a quantized input signal composed
-    of several quantized input planes.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    See :class:`~torch.nn.quantized.AdaptiveAvgPool2d` for details and output shape.
-
-    Args:
-        output_size: the target output size (single integer or
-                     double-integer tuple)
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.functional.adaptive_avg_pool2d' must be quantized!")
-    return torch.nn.functional.adaptive_avg_pool2d(input, output_size)
-
-def adaptive_avg_pool3d(input: Tensor, output_size: BroadcastingList2[int]) -> Tensor:
-    r"""
-    Applies a 3D adaptive average pooling over a quantized input signal composed
-    of several quantized input planes.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    See :class:`~torch.nn.quantized.AdaptiveAvgPool3d` for details and output shape.
-
-    Args:
-        output_size: the target output size (single integer or
-                     double-integer tuple)
-    """
-    if not input.is_quantized:
-        raise ValueError(
-            "Input to 'quantized.functional.adaptive_avg_pool3d' must be quantized!")
-    return torch.nn.functional.adaptive_avg_pool3d(input, output_size)
-
-def conv1d(input, weight, bias,
-           stride=1, padding=0, dilation=1, groups=1,
-           padding_mode='zeros',
-           scale=1.0, zero_point=0,
-           dtype=torch.quint8):
-    r"""
-    Applies a 1D convolution over a quantized 1D input composed of several input
-    planes.
-
-    See :class:`~torch.nn.quantized.Conv1d` for details and output shape.
-
-    Args:
-        input: quantized input tensor of shape :math:`(\text{minibatch} , \text{in\_channels} , iW)`
-        weight: quantized filters of shape :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , iW)`
-        bias: **non-quantized** bias tensor of shape :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
-        stride: the stride of the convolving kernel. Can be a single number or a
-          tuple `(sW,)`. Default: 1
-        padding: implicit paddings on both sides of the input. Can be a
-          single number or a tuple `(padW,)`. Default: 0
-        dilation: the spacing between kernel elements. Can be a single number or
-          a tuple `(dW,)`. Default: 1
-        groups: split input into groups, :math:`\text{in\_channels}` should be divisible by the
-          number of groups. Default: 1
-        padding_mode: the padding mode to use. Only "zeros" is supported for quantized convolution at the moment. Default: "zeros"
-        scale: quantization scale for the output. Default: 1.0
-        zero_point: quantization zero_point for the output. Default: 0
-        dtype: quantization data type to use. Default: ``torch.quint8``
-
-    Examples::
-
-        >>> from torch.nn.quantized import functional as qF
-        >>> filters = torch.randn(33, 16, 3, dtype=torch.float)
-        >>> inputs = torch.randn(20, 16, 50, dtype=torch.float)
-        >>> bias = torch.randn(33, dtype=torch.float)
-        >>>
-        >>> scale, zero_point = 1.0, 0
-        >>> dtype_inputs = torch.quint8
-        >>> dtype_filters = torch.qint8
-        >>>
-        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
-        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
-        >>> qF.conv1d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
-    """  # noqa: E501
-    if padding_mode != 'zeros':
-        raise NotImplementedError("Only zero-padding is supported!")
-    if input.dtype != torch.quint8:
-        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
-    if weight.dtype != torch.qint8:
-        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
-    if input.ndim != 3:
-        raise ValueError("Input shape must be `(N, C, L)`!")
-    stride = _pair_from_first(stride)
-    padding = _pair_from_first(padding)
-    dilation = _pair_from_first(dilation)
-
-    packed_params = torch.ops.quantized.conv1d_prepack(
-        weight, bias, stride, padding, dilation, groups)
-    return torch.ops.quantized.conv1d(input, packed_params, scale, zero_point)
-
-def conv2d(input, weight, bias,
-           stride=1, padding=0, dilation=1, groups=1,
-           padding_mode='zeros',
-           scale=1.0, zero_point=0,
-           dtype=torch.quint8):
-    r"""
-    Applies a 2D convolution over a quantized 2D input composed of several input
-    planes.
-
-    See :class:`~torch.nn.quantized.Conv2d` for details and output shape.
-
-    Args:
-        input: quantized input tensor of shape :math:`(\text{minibatch} , \text{in\_channels} , iH , iW)`
-        weight: quantized filters of shape :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kH , kW)`
-        bias: **non-quantized** bias tensor of shape :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
-        stride: the stride of the convolving kernel. Can be a single number or a
-          tuple `(sH, sW)`. Default: 1
-        padding: implicit paddings on both sides of the input. Can be a
-          single number or a tuple `(padH, padW)`. Default: 0
-        dilation: the spacing between kernel elements. Can be a single number or
-          a tuple `(dH, dW)`. Default: 1
-        groups: split input into groups, :math:`\text{in\_channels}` should be divisible by the
-          number of groups. Default: 1
-        padding_mode: the padding mode to use. Only "zeros" is supported for quantized convolution at the moment. Default: "zeros"
-        scale: quantization scale for the output. Default: 1.0
-        zero_point: quantization zero_point for the output. Default: 0
-        dtype: quantization data type to use. Default: ``torch.quint8``
-
-    Examples::
-
-        >>> from torch.nn.quantized import functional as qF
-        >>> filters = torch.randn(8, 4, 3, 3, dtype=torch.float)
-        >>> inputs = torch.randn(1, 4, 5, 5, dtype=torch.float)
-        >>> bias = torch.randn(8, dtype=torch.float)
-        >>>
-        >>> scale, zero_point = 1.0, 0
-        >>> dtype_inputs = torch.quint8
-        >>> dtype_filters = torch.qint8
-        >>>
-        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
-        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
-        >>> qF.conv2d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
-    """  # noqa: E501
-    if padding_mode != 'zeros':
-        raise NotImplementedError("Only zero-padding is supported!")
-    if input.dtype != torch.quint8:
-        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
-    if weight.dtype != torch.qint8:
-        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
-    if input.ndim != 4:
-        raise ValueError("Input shape must be `(N, C, H, W)`!")
-    stride = _pair(stride)
-    padding = _pair(padding)
-    dilation = _pair(dilation)
-
-    packed_params = torch.ops.quantized.conv2d_prepack(
-        weight, bias, stride, padding, dilation, groups)
-    return torch.ops.quantized.conv2d(input, packed_params, scale, zero_point)
-
-def conv3d(input, weight, bias, stride=1, padding=0, dilation=1, groups=1,
-           padding_mode='zeros', scale=1.0, zero_point=0, dtype=torch.quint8):
-    r"""
-    Applies a 3D convolution over a quantized 3D input composed of several input
-    planes.
-
-    See :class:`~torch.nn.quantized.Conv3d` for details and output shape.
-
-    Args:
-        input: quantized input tensor of shape
-          :math:`(\text{minibatch} , \text{in\_channels} , iD , iH , iW)`
-        weight: quantized filters of shape
-          :math:`(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kD , kH , kW)`
-        bias: **non-quantized** bias tensor of shape
-          :math:`(\text{out\_channels})`. The tensor type must be `torch.float`.
-        stride: the stride of the convolving kernel. Can be a single number or a
-          tuple `(sD, sH, sW)`. Default: 1
-        padding: implicit paddings on both sides of the input. Can be a
-          single number or a tuple `(padD, padH, padW)`. Default: 0
-        dilation: the spacing between kernel elements. Can be a single number or
-          a tuple `(dD, dH, dW)`. Default: 1
-        groups: split input into groups, :math:`\text{in\_channels}` should be
-          divisible by the number of groups. Default: 1
-        padding_mode: the padding mode to use. Only "zeros" is supported for
-          quantized convolution at the moment. Default: "zeros"
-        scale: quantization scale for the output. Default: 1.0
-        zero_point: quantization zero_point for the output. Default: 0
-        dtype: quantization data type to use. Default: ``torch.quint8``
-
-    Examples::
-
-        >>> from torch.nn.quantized import functional as qF
-        >>> filters = torch.randn(8, 4, 3, 3, 3, dtype=torch.float)
-        >>> inputs = torch.randn(1, 4, 5, 5, 5, dtype=torch.float)
-        >>> bias = torch.randn(8, dtype=torch.float)
-        >>>
-        >>> scale, zero_point = 1.0, 0
-        >>> dtype_inputs = torch.quint8
-        >>> dtype_filters = torch.qint8
-        >>>
-        >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype_filters)
-        >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype_inputs)
-        >>> qF.conv3d(q_inputs, q_filters, bias, padding=1, scale=scale, zero_point=zero_point)
-    """  # noqa: E501
-    if padding_mode != 'zeros':
-        raise NotImplementedError("Only zero-padding is supported!")
-    if input.dtype != torch.quint8:
-        raise NotImplementedError("Only torch.quint8 is supported for activation tensor!")
-    if weight.dtype != torch.qint8:
-        raise NotImplementedError("Only torch.qint8 is supported for weight tensor!")
-    if input.ndim != 5:
-        raise ValueError("Input shape must be `(N, C, D, H, W)`!")
-    stride = _triple(stride)
-    padding = _triple(padding)
-    dilation = _triple(dilation)
-
-    packed_params = torch.ops.quantized.conv3d_prepack(
-        weight, bias, stride, padding, dilation, groups)
-    return torch.ops.quantized.conv3d(input, packed_params, scale, zero_point)
-
-def interpolate(input, size=None, scale_factor=None, mode='nearest', align_corners=None):
-    r"""Down/up samples the input to either the given :attr:`size` or the given
-    :attr:`scale_factor`
-
-    See :func:`torch.nn.functional.interpolate` for implementation details.
-
-    The input dimensions are interpreted in the form:
-    `mini-batch x channels x [optional depth] x [optional height] x width`.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    .. note:: Only 2D/3D input is supported for quantized inputs
-
-    .. note:: Only the following modes are supported for the quantized inputs:
-
-        - `bilinear`
-        - `nearest`
-
-    Args:
-        input (Tensor): the input tensor
-        size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):
-            output spatial size.
-        scale_factor (float or Tuple[float]): multiplier for spatial size. Has to match input size if it is a tuple.
-        mode (str): algorithm used for upsampling:
-            ``'nearest'`` | ``'bilinear'``
-        align_corners (bool, optional): Geometrically, we consider the pixels of the
-            input and output as squares rather than points.
-            If set to ``True``, the input and output tensors are aligned by the
-            center points of their corner pixels, preserving the values at the corner pixels.
-            If set to ``False``, the input and output tensors are aligned by the corner
-            points of their corner pixels, and the interpolation uses edge value padding
-            for out-of-boundary values, making this operation *independent* of input size
-            when :attr:`scale_factor` is kept the same. This only has an effect when :attr:`mode`
-            is ``'bilinear'``.
-            Default: ``False``
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.interpolate' must be quantized!")
-    return torch.nn.functional.interpolate(input, size, scale_factor, mode,
-                                           align_corners)
-
-def linear(
-    input: Tensor, weight: Tensor, bias: Optional[Tensor] = None,
-    scale: Optional[float] = None, zero_point: Optional[int] = None
-) -> Tensor:
-    r"""
-    Applies a linear transformation to the incoming quantized data:
-    :math:`y = xA^T + b`.
-    See :class:`~torch.nn.quantized.Linear`
-
-    .. note::
-
-      Current implementation packs weights on every call, which has penalty on performance.
-      If you want to avoid the overhead, use :class:`~torch.nn.quantized.Linear`.
-
-    Args:
-      input (Tensor): Quantized input of type `torch.quint8`
-      weight (Tensor): Quantized weight of type `torch.qint8`
-      bias (Tensor): None or fp32 bias of type `torch.float`
-      scale (double): output scale. If None, derived from the input scale
-      zero_point (long): output zero point. If None, derived from the input zero_point
-
-    Shape:
-        - Input: :math:`(N, *, in\_features)` where `*` means any number of
-          additional dimensions
-        - Weight: :math:`(out\_features, in\_features)`
-        - Bias: :math:`(out\_features)`
-        - Output: :math:`(N, *, out\_features)`
-    """
-    if scale is None:
-        scale = input.q_scale()
-    if zero_point is None:
-        zero_point = input.q_zero_point()
-    _packed_params = torch.ops.quantized.linear_prepack(weight, bias)
-    return torch.ops.quantized.linear(input, _packed_params, scale, zero_point)
-
-def max_pool1d(input, kernel_size, stride=None, padding=0, dilation=1,
-               ceil_mode=False, return_indices=False):
-    r"""Applies a 1D max pooling over a quantized input signal composed of
-    several quantized input planes.
-
-    .. note:: The input quantization parameters are propagated to the output.
-
-    See :class:`~torch.nn.quantized.MaxPool1d` for details.
-    """
-    if return_indices:
-        raise NotImplementedError("return_indices is not yet implemented!")
-    if stride is None:
-        stride = torch.jit.annotate(List[int], [])
-    return torch.nn.functional.max_pool1d(input, kernel_size, stride, padding,
-                                          dilation, ceil_mode=ceil_mode, return_indices=return_indices)
-
-def max_pool2d(input, kernel_size, stride=None, padding=0, dilation=1,
-               ceil_mode=False, return_indices=False):
-    r"""Applies a 2D max pooling over a quantized input signal composed of
-    several quantized input planes.
-
-    .. note:: The input quantization parameters are propagated to the output.
-
-    See :class:`~torch.nn.quantized.MaxPool2d` for details.
-    """
-    if return_indices:
-        raise NotImplementedError("return_indices is not yet implemented!")
-    if stride is None:
-        stride = torch.jit.annotate(List[int], [])
-    return torch.nn.functional.max_pool2d(input, kernel_size, stride, padding,
-                                          dilation, ceil_mode=ceil_mode, return_indices=return_indices)
-
-def celu(input: Tensor, scale: float, zero_point: int, alpha: float = 1.) -> Tensor:
-    r"""celu(input, scale, zero_point, alpha=1.) -> Tensor
-
-    Applies the quantized CELU function element-wise.
-
-    .. math::
-        \text{CELU}(x) = \max(0,x) + \min(0, \alpha * (\exp(x / \alpha) - 1))
-
-    Args:
-        input: quantized input
-        alpha: the :math:`\alpha` value for the CELU formulation. Default: 1.0
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.celu' must be quantized!")
-    return torch.ops.quantized.celu(input, scale, zero_point, alpha)
-
-
-def leaky_relu(input: Tensor, negative_slope: float = 0.01, inplace: bool = False,
-               scale: Optional[float] = None, zero_point: Optional[int] = None):
-    r"""
-    Quantized version of the.
-    leaky_relu(input, negative_slope=0.01, inplace=False, scale, zero_point) -> Tensor
-
-    Applies element-wise,
-    :math:`\text{LeakyReLU}(x) = \max(0, x) + \text{negative\_slope} * \min(0, x)`
-
-    Args:
-        input: Quaintized input
-        negative_slope: The slope of the negative input
-        inplace: Inplace modification of the input tensor
-        scale, zero_point: Scale and zero point of the output tensor.
-
-    See :class:`~torch.nn.LeakyReLU` for more details.
-    """
-    if scale is not None and zero_point is not None:
-        assert not inplace, "Cannot rescale with `inplace`"
-        output = torch._empty_affine_quantized(
-            input.shape, scale=scale, zero_point=int(zero_point), dtype=input.dtype)
-        torch._C._nn.leaky_relu(input, negative_slope, out=output)
-        return output
-    if inplace:
-        result = torch._C._nn.leaky_relu_(input, negative_slope)
-    else:
-        result = torch._C._nn.leaky_relu(input, negative_slope)
-    return result
-
-def hardtanh(input: Tensor, min_val: float = -1., max_val: float = 1., inplace: bool = False) -> Tensor:
-    r"""This is the quantized version of :func:`~torch.nn.functional.hardtanh`.
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.hardtanh' must be quantized!")
-    if inplace:
-        return torch._C._nn.hardtanh_(input, min_val, max_val)
-    return torch._C._nn.hardtanh(input, min_val, max_val)
-
-def hardswish(input: Tensor, scale: float, zero_point: int) -> Tensor:
-    r"""This is the quantized version of :func:`~torch.nn.functional.hardswish`.
-
-    Args:
-        input: quantized input
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.hardswish' must be quantized!")
-    return torch._ops.ops.quantized.hardswish(input, scale, zero_point)
-
-def threshold(input: Tensor, threshold: float, value: float) -> Tensor:
-    r"""Applies the quantized version of the threshold function element-wise:
-
-    .. math::
-        x = \begin{cases}
-                x & \text{if~} x > \text{threshold} \\
-                \text{value} & \text{otherwise}
-            \end{cases}
-
-    See :class:`~torch.nn.Threshold` for more details.
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.threshold' must be quantized!")
-    if threshold is None:
-        raise ValueError("Input to 'threshold' must be specified!")
-    if value is None:
-        raise ValueError("Input to 'value' must be specified!")
-    return torch._ops.ops.quantized.threshold(input, threshold, value)
-
-def elu(input: Tensor, scale: float, zero_point: int, alpha: float = 1.) -> Tensor:
-    r"""This is the quantized version of :func:`~torch.nn.functional.elu`.
-
-    Args:
-        input: quantized input
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-        alpha: the alpha constant
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.elu' must be quantized!")
-    return torch.ops.quantized.elu(input, scale, zero_point, alpha)
-
-def hardsigmoid(input: Tensor, inplace: bool = False) -> Tensor:
-    r"""This is the quantized version of :func:`~torch.nn.functional.hardsigmoid`.
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.hardsigmoid' must be quantized!")
-    if inplace:
-        return torch._C._nn.hardsigmoid_(input)  # type: ignore[attr-defined]
-    return torch._C._nn.hardsigmoid(input)
-
-def clamp(input: Tensor, min_: float, max_: float) -> Tensor:
-    r"""float(input, min\_, max\_) -> Tensor
-
-    Applies the clamp function element-wise.
-    See :class:`~torch.nn.quantized.clamp` for more details.
-
-    Args:
-        input: quantized input
-        min_: minimum value for clamping
-        max_: maximum value for clamping
-    """
-    if not input.is_quantized:
-        raise ValueError("Input to 'quantized.clamp' must be quantized!")
-    return torch.clamp(input, min_, max_)
-
-def upsample(input, size=None, scale_factor=None, mode='nearest', align_corners=None):
-    r"""Upsamples the input to either the given :attr:`size` or the given
-    :attr:`scale_factor`
-
-    .. warning::
-        This function is deprecated in favor of
-        :func:`torch.nn.quantized.functional.interpolate`.
-        This is equivalent with ``nn.quantized.functional.interpolate(...)``.
-
-    See :func:`torch.nn.functional.interpolate` for implementation details.
-
-    The input dimensions are interpreted in the form:
-    `mini-batch x channels x [optional depth] x [optional height] x width`.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    .. note:: Only 2D input is supported for quantized inputs
-
-    .. note:: Only the following modes are supported for the quantized inputs:
-
-        - `bilinear`
-        - `nearest`
-
-    Args:
-        input (Tensor): quantized input tensor
-        size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]):
-            output spatial size.
-        scale_factor (float or Tuple[float]): multiplier for spatial size. Has to be an integer.
-        mode (str): algorithm used for upsampling:
-            ``'nearest'`` | ``'bilinear'``
-        align_corners (bool, optional): Geometrically, we consider the pixels of the
-            input and output as squares rather than points.
-            If set to ``True``, the input and output tensors are aligned by the
-            center points of their corner pixels, preserving the values at the corner pixels.
-            If set to ``False``, the input and output tensors are aligned by the corner
-            points of their corner pixels, and the interpolation uses edge value padding
-            for out-of-boundary values, making this operation *independent* of input size
-            when :attr:`scale_factor` is kept the same. This only has an effect when :attr:`mode`
-            is ``'bilinear'``.
-            Default: ``False``
-
-    .. warning::
-        With ``align_corners = True``, the linearly interpolating modes
-        (`bilinear`) don't proportionally align the
-        output and input pixels, and thus the output values can depend on the
-        input size. This was the default behavior for these modes up to version
-        0.3.1. Since then, the default behavior is ``align_corners = False``.
-        See :class:`~torch.nn.Upsample` for concrete examples on how this
-        affects the outputs.
-    """
-    warnings.warn("nn.quantized.functional.upsample is deprecated. Use nn.quantized.functional.interpolate instead.")
-    return interpolate(input, size, scale_factor, mode, align_corners)
-
-def upsample_bilinear(input, size=None, scale_factor=None):
-    r"""Upsamples the input, using bilinear upsampling.
-
-    .. warning::
-        This function is deprecated in favor of
-        :func:`torch.nn.quantized.functional.interpolate`.
-        This is equivalent with
-        ``nn.quantized.functional.interpolate(..., mode='bilinear', align_corners=True)``.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    .. note:: Only 2D inputs are supported
-
-    Args:
-        input (Tensor): quantized input
-        size (int or Tuple[int, int]): output spatial size.
-        scale_factor (int or Tuple[int, int]): multiplier for spatial size
-    """
-    # DeprecationWarning is ignored by default
-    warnings.warn("nn.quantized.functional.upsample_bilinear is deprecated. Use nn.quantized.functional.interpolate instead.")
-    return interpolate(input, size, scale_factor, mode='bilinear', align_corners=True)
-
-def upsample_nearest(input, size=None, scale_factor=None):
-    r"""Upsamples the input, using nearest neighbours' pixel values.
-
-    .. warning::
-        This function is deprecated in favor of
-        :func:`torch.nn.quantized.functional.interpolate`.
-        This is equivalent with ``nn.quantized.functional.interpolate(..., mode='nearest')``.
-
-    .. note:: The input quantization parameters propagate to the output.
-
-    .. note:: Only 2D inputs are supported
-
-    Args:
-        input (Tensor): quantized input
-        size (int or Tuple[int, int] or Tuple[int, int, int]): output spatial
-            size.
-        scale_factor (int): multiplier for spatial size. Has to be an integer.
-    """
-    # DeprecationWarning is ignored by default
-    warnings.warn("nn.quantized.functional.upsample_nearest is deprecated. Use nn.quantized.functional.interpolate instead.")
-    return interpolate(input, size, scale_factor, mode='nearest')
+from torch.ao.nn.quantized.functional import *  # noqa: F401,F403
diff --git a/torch/nn/quantized/modules/__init__.py b/torch/nn/quantized/modules/__init__.py
index 6bcf2e81c67c..360f0112644a 100644
--- a/torch/nn/quantized/modules/__init__.py
+++ b/torch/nn/quantized/modules/__init__.py
@@ -1,95 +1,38 @@
-import torch
-from torch.nn.modules.pooling import MaxPool2d
-
-from .activation import ReLU6, Hardswish, ELU, LeakyReLU, Sigmoid, Softmax, MultiheadAttention, PReLU
-from .dropout import Dropout
-from .batchnorm import BatchNorm2d, BatchNorm3d
-from .normalization import LayerNorm, GroupNorm, InstanceNorm1d, \
-    InstanceNorm2d, InstanceNorm3d
-from .conv import Conv1d, Conv2d, Conv3d
-from .conv import ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
-from .linear import Linear
-from .embedding_ops import Embedding, EmbeddingBag
-from .rnn import LSTM
-
-from .functional_modules import FloatFunctional, FXFloatFunctional, QFunctional
-
-
-class Quantize(torch.nn.Module):
-    r"""Quantizes an incoming tensor
-
-    Args:
-     `scale`: scale of the output Quantized Tensor
-     `zero_point`: zero_point of output Quantized Tensor
-     `dtype`: data type of output Quantized Tensor
-     `factory_kwargs`: Dictionary of kwargs used for configuring initialization
-         of internal buffers. Currently, `device` and `dtype` are supported.
-         Example: `factory_kwargs={'device': 'cuda', 'dtype': torch.float64}`
-         will initialize internal buffers as type `torch.float64` on the current CUDA device.
-         Note that `dtype` only applies to floating-point buffers.
-
-    Examples::
-        >>> t = torch.tensor([[1., -1.], [1., -1.]])
-        >>> scale, zero_point, dtype = 1.0, 2, torch.qint8
-        >>> qm = Quantize(scale, zero_point, dtype)
-        >>> # xdoctest: +SKIP
-        >>> qt = qm(t)
-        >>> print(qt)
-        tensor([[ 1., -1.],
-                [ 1., -1.]], size=(2, 2), dtype=torch.qint8, scale=1.0, zero_point=2)
-    """
-
-    scale: torch.Tensor
-    zero_point: torch.Tensor
-
-    def __init__(self, scale, zero_point, dtype, factory_kwargs=None):
-        factory_kwargs = torch.nn.factory_kwargs(factory_kwargs)
-        super(Quantize, self).__init__()
-        self.register_buffer('scale', torch.tensor([scale], **factory_kwargs))
-        self.register_buffer('zero_point',
-                             torch.tensor([zero_point], dtype=torch.long,
-                                          **{k: v for k, v in factory_kwargs.items() if k != 'dtype'}))
-        self.dtype = dtype
-
-    def forward(self, X):
-        return torch.quantize_per_tensor(X, float(self.scale),
-                                         int(self.zero_point), self.dtype)
-
-    @staticmethod
-    def from_float(mod):
-        assert hasattr(mod, 'activation_post_process')
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return Quantize(scale.float().item(), zero_point.long().item(), mod.activation_post_process.dtype)
-
-    def extra_repr(self):
-        return 'scale={}, zero_point={}, dtype={}'.format(self.scale, self.zero_point, self.dtype)
-
-
-class DeQuantize(torch.nn.Module):
-    r"""Dequantizes an incoming tensor
-
-    Examples::
-        >>> input = torch.tensor([[1., -1.], [1., -1.]])
-        >>> scale, zero_point, dtype = 1.0, 2, torch.qint8
-        >>> qm = Quantize(scale, zero_point, dtype)
-        >>> # xdoctest: +SKIP
-        >>> quantized_input = qm(input)
-        >>> dqm = DeQuantize()
-        >>> dequantized = dqm(quantized_input)
-        >>> print(dequantized)
-        tensor([[ 1., -1.],
-                [ 1., -1.]], dtype=torch.float32)
-    """
-
-    def __init__(self):
-        super(DeQuantize, self).__init__()
-
-    def forward(self, Xq):
-        return Xq.dequantize()
-
-    @staticmethod
-    def from_float(mod):
-        return DeQuantize()
+r"""Quantized Modules
+
+Note::
+    The `torch.nn.quantized` namespace is in the process of being deprecated.
+    Please, use `torch.ao.nn.quantized` instead.
+"""
+
+from torch.ao.nn.quantized.modules.activation import ReLU6, Hardswish, ELU, LeakyReLU, Sigmoid, Softmax, MultiheadAttention, PReLU
+from torch.ao.nn.quantized.modules.batchnorm import BatchNorm2d, BatchNorm3d
+from torch.ao.nn.quantized.modules.conv import Conv1d, Conv2d, Conv3d
+from torch.ao.nn.quantized.modules.conv import ConvTranspose1d, ConvTranspose2d, ConvTranspose3d
+from torch.ao.nn.quantized.modules.dropout import Dropout
+from torch.ao.nn.quantized.modules.embedding_ops import Embedding, EmbeddingBag
+from torch.ao.nn.quantized.modules.functional_modules import FloatFunctional, FXFloatFunctional, QFunctional
+from torch.ao.nn.quantized.modules.linear import Linear
+from torch.ao.nn.quantized.modules.normalization import LayerNorm, GroupNorm, InstanceNorm1d, InstanceNorm2d, InstanceNorm3d
+from torch.ao.nn.quantized.modules.rnn import LSTM
+
+from torch.ao.nn.quantized.modules import MaxPool2d
+from torch.ao.nn.quantized.modules import Quantize, DeQuantize
+
+# The following imports are needed in case the user decides
+# to import the files directly,
+# s.a. `from torch.nn.quantized.modules.conv import ...`.
+# No need to add them to the `__all__`.
+from torch.ao.nn.quantized.modules import activation
+from torch.ao.nn.quantized.modules import batchnorm
+from torch.ao.nn.quantized.modules import conv
+from torch.ao.nn.quantized.modules import dropout
+from torch.ao.nn.quantized.modules import embedding_ops
+from torch.ao.nn.quantized.modules import functional_modules
+from torch.ao.nn.quantized.modules import linear
+from torch.ao.nn.quantized.modules import normalization
+from torch.ao.nn.quantized.modules import rnn
+from torch.ao.nn.quantized.modules import utils
 
 __all__ = [
     'BatchNorm2d',
diff --git a/torch/nn/quantized/modules/activation.py b/torch/nn/quantized/modules/activation.py
index 2db8f4a2fccb..386c0d7f2d24 100644
--- a/torch/nn/quantized/modules/activation.py
+++ b/torch/nn/quantized/modules/activation.py
@@ -1,278 +1,18 @@
-import torch
-import torch.nn.quantized.functional
-
-class ReLU6(torch.nn.ReLU):
-    r"""Applies the element-wise function:
-
-    :math:`\text{ReLU6}(x) = \min(\max(x_0, x), q(6))`, where :math:`x_0` is the
-    zero_point, and :math:`q(6)` is the quantized representation of number 6.
-
-    Args:
-        inplace: can optionally do the operation in-place. Default: ``False``
-
-    Shape:
-        - Input: :math:`(N, *)` where `*` means, any number of additional
-          dimensions
-        - Output: :math:`(N, *)`, same shape as the input
-
-    .. image:: ../scripts/activation_images/ReLU6.png
-
-    Examples::
-
-        >>> m = nn.quantized.ReLU6()
-        >>> input = torch.randn(2)
-        >>> # xdoctest: +SKIP
-        >>> input = torch.quantize_per_tensor(input, 1.0, 0, dtype=torch.qint32)
-        >>> output = m(input)
-    """
-    def __init__(self, inplace=False):
-        super(ReLU6, self).__init__(inplace)
-        self.inplace = inplace
-
-    def forward(self, input):
-        return torch.ops.quantized.relu6(input, self.inplace)
-
-    def _get_name(self):
-        return 'QuantizedReLU6'
-
-    @staticmethod
-    def from_float(mod):
-        return ReLU6(mod.inplace)
-
-class Hardswish(torch.nn.Hardswish):
-    r"""This is the quantized version of :class:`~torch.nn.Hardswish`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-    """
-    def __init__(self, scale, zero_point):
-        super(Hardswish, self).__init__()
-        self.scale = scale
-        self.zero_point = zero_point
-
-    def forward(self, input):
-        return torch.nn.quantized.functional.hardswish(
-            input, scale=self.scale, zero_point=self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedHardswish'
-
-    @staticmethod
-    def from_float(mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return Hardswish(float(scale), int(zero_point))
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(float(scale), int(zero_point))
-
-class ELU(torch.nn.ELU):
-    r"""This is the quantized equivalent of :class:`~torch.nn.ELU`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-        alpha: the alpha constant
-    """
-    def __init__(self, scale, zero_point, alpha=1.):
-        super(ELU, self).__init__(alpha)
-        self.scale = scale
-        self.zero_point = zero_point
-
-    def forward(self, input):
-        return torch.nn.quantized.functional.elu(
-            input, self.scale, self.zero_point, self.alpha)
-
-    def _get_name(self):
-        return 'QuantizedELU'
-
-    @staticmethod
-    def from_float(mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return ELU(float(scale), int(zero_point), mod.alpha)
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(float(scale), int(zero_point), mod.alpha)
-
-class LeakyReLU(torch.nn.LeakyReLU):
-    r"""This is the quantized equivalent of :class:`~torch.nn.LeakyReLU`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-        negative_slope: Controls the angle of the negative slope. Default: 1e-2
-    """
-    def __init__(self, scale: float, zero_point: int, negative_slope: float = 1e-2,
-                 inplace: bool = False, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(negative_slope, inplace)
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.leaky_relu(
-            input, self.negative_slope, self.inplace, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedLeakyReLU'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return cls(float(scale), int(zero_point), mod.negative_slope, mod.inplace)
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(float(scale), int(zero_point), mod.negative_slope, mod.inplace)
-
-class Sigmoid(torch.nn.Sigmoid):
-    r"""This is the quantized equivalent of :class:`~torch.nn.Sigmoid`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-    """
-
-    def __init__(self, output_scale: float, output_zero_point: int):
-        super().__init__()
-        self.output_scale = output_scale
-        self.output_zero_point = output_zero_point
-
-    def forward(self, input):
-        return torch.ops.quantized.sigmoid(input, self.output_scale, self.output_zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        output_scale, output_zero_point = mod.activation_post_process.calculate_qparams()
-        return cls(float(output_scale), int(output_zero_point))
-
-class Softmax(torch.nn.Softmax):
-    r"""This is the quantized version of :class:`~torch.nn.Softmax`.
-
-    Args:
-        dim: A dimension along which Softmax will be computed (so every slice along dim will sum to 1).
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-    """
-    def __init__(self, dim=None, scale=1.0, zero_point=0):
-        super().__init__()
-        self.dim = dim
-        self.scale = scale
-        self.zero_point = zero_point
-
-    def forward(self, input):
-        dim = self.dim
-        if dim is None:
-            stacklevel = 3
-            # Note: adding the mypy ignore on _get_softmax_dim seems less bad
-            # than making `_get_softmax_dim` an official API.
-            dim = torch.nn.functional._get_softmax_dim(  # type: ignore[attr-defined]
-                "softmax", input.dim(), stacklevel)
-        return torch.ops.quantized.softmax(
-            input, dim, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedSoftmax'
-
-    @staticmethod
-    def from_float(mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        return Softmax(mod.dim, float(scale), int(zero_point))
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(mod.dim, float(scale), int(zero_point))
-
-class MultiheadAttention(torch.nn.quantizable.MultiheadAttention):
-    _FLOAT_MODULE = torch.nn.quantizable.MultiheadAttention
-
-    def _get_name(self):
-        return "QuantizedMultiheadAttention"
-
-    @classmethod
-    def from_float(cls, other):
-        # The whole flow is float -> observed -> quantized
-        # This class does observed -> quantized only
-        raise NotImplementedError("It looks like you are trying to convert a "
-                                  "non-observed MHA module. Please, see "
-                                  "the examples on quantizable MHAs.")
-
-    @classmethod
-    def from_observed(cls, other):
-        converted = torch.ao.quantization.convert(other, mapping=None,
-                                                  inplace=False,
-                                                  remove_qconfig=True,
-                                                  convert_custom_config_dict=None)
-        converted.__class__ = cls
-        # Remove the parameters for the bias_k and bias_v to quantize them
-        # TODO: This is a potential source of accuracy drop.
-        #       quantized cat takes the scale and zp of the first
-        #       element, which might lose the precision in the bias_k
-        #       and the bias_v (which are cat'ed with k/v being first).
-        if converted.bias_k is not None:
-            bias_k = converted._parameters.pop('bias_k')
-            sc, zp = torch._choose_qparams_per_tensor(bias_k,
-                                                      reduce_range=False)
-            bias_k = torch.quantize_per_tensor(bias_k, sc, zp, torch.quint8)
-            setattr(converted, 'bias_k', bias_k)  # noqa: B010
-
-        if converted.bias_v is not None:
-            bias_v = converted._parameters.pop('bias_v')
-            sc, zp = torch._choose_qparams_per_tensor(bias_k,
-                                                      reduce_range=False)
-            bias_v = torch.quantize_per_tensor(bias_v, sc, zp, torch.quint8)
-            setattr(converted, 'bias_v', bias_v)  # noqa: B010
-
-        return converted
-
-class PReLU(torch.nn.Module):
-    r"""This is the quantized equivalent of :class:`~torch.nn.PReLU`.
-
-    Args:
-        scale: quantization scale of the output tensor
-        zero_point: quantization zero point of the output tensor
-        num_parameters: number of parameters: 1, or the number of channels at input. Default: 1
-    """
-    def __init__(self, output_scale: float, output_zero_point: int,
-                 num_parameters: int = 1) -> None:
-        super().__init__()
-        self.num_parameters = num_parameters
-        self.scale = output_scale
-        self.zero_point = output_zero_point
-        w = torch.randn(num_parameters, dtype=torch.float)
-        qw = torch.quantize_per_tensor(w, scale=1.0, zero_point=0, dtype=torch.quint8)
-        self.set_weight(qw)
-
-    def set_weight(self, w: torch.Tensor) -> None:
-        self.weight = w
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        return torch.ops.quantized.prelu(input, self.weight, self.scale, self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedPReLU'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        qprelu = cls(float(scale), int(zero_point), mod.num_parameters)
-        float_wt = mod.weight.float()
-        observer = mod.qconfig.weight()
-        wt_scale, wt_zp = observer.calculate_qparams()
-        qweight = torch.quantize_per_tensor(
-            float_wt, float(wt_scale), int(wt_zp), torch.quint8)
-        qprelu.set_weight(qweight)
-        return qprelu
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        qprelu = cls(float(scale), int(zero_point), mod.num_parameters)
-        float_wt = mod.weight.float()
-        observer = mod.qconfig.weight()
-        wt_scale, wt_zp = observer.calculate_qparams()
-        qweight = torch.quantize_per_tensor(
-            float_wt, float(wt_scale), int(wt_zp), torch.quint8)
-        qprelu.set_weight(qweight)
-        return qprelu
+# flake8: noqa: F401
+r"""Quantized Modules
+
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.modules.activation import ELU
+from torch.ao.nn.quantized.modules.activation import Hardswish
+from torch.ao.nn.quantized.modules.activation import LeakyReLU
+from torch.ao.nn.quantized.modules.activation import MultiheadAttention
+from torch.ao.nn.quantized.modules.activation import PReLU
+from torch.ao.nn.quantized.modules.activation import ReLU6
+from torch.ao.nn.quantized.modules.activation import Sigmoid
+from torch.ao.nn.quantized.modules.activation import Softmax
diff --git a/torch/nn/quantized/modules/batchnorm.py b/torch/nn/quantized/modules/batchnorm.py
index 1046d0254b6d..ef3d5a91da7c 100644
--- a/torch/nn/quantized/modules/batchnorm.py
+++ b/torch/nn/quantized/modules/batchnorm.py
@@ -1,103 +1,12 @@
-import torch
-import torch.nn.quantized.functional
-import torch.nn.intrinsic as nni
-from torch import Tensor
-
-class _BatchNorm(torch.nn.modules.batchnorm._BatchNorm):
-    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_features, eps, momentum, True, True, **factory_kwargs)
-        self.register_buffer('scale', torch.tensor(1.0, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(0, **factory_kwargs))
-
-    @staticmethod
-    def from_float(cls, mod):
-        activation_post_process = mod.activation_post_process
-        if type(mod) == cls._NNI_BN_RELU_MODULE:
-            mod = mod[0]
-        scale, zero_point = activation_post_process.calculate_qparams()
-        new_mod = cls(mod.num_features, mod.eps)
-        new_mod.weight = mod.weight
-        new_mod.bias = mod.bias
-        new_mod.running_mean = mod.running_mean
-        new_mod.running_var = mod.running_var
-        new_mod.scale = scale
-        new_mod.zero_point = zero_point
-        return new_mod
-
-    @classmethod
-    def from_reference(cls, bn, output_scale, output_zero_point):
-        qbn = cls(
-            bn.num_features,
-            bn.eps,
-            bn.momentum,
-            device=bn.weight.device,
-            dtype=bn.weight.dtype
-        )
-        qbn.weight = bn.weight
-        qbn.bias = bn.bias
-        qbn.running_mean = bn.running_mean
-        qbn.running_var = bn.running_var
-        qbn.scale = output_scale
-        qbn.zero_point = output_zero_point
-        return qbn
-
-class BatchNorm2d(_BatchNorm):
-    r"""This is the quantized version of :class:`~torch.nn.BatchNorm2d`.
-    """
-
-    _NNI_BN_RELU_MODULE = nni.BNReLU2d
-
-    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_features, eps, momentum, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedBatchNorm2d'
-
-    def _check_input_dim(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-
-    def forward(self, input: Tensor) -> Tensor:
-        # disabling this since this is not symbolically traceable
-        # self._check_input_dim(input)
-        return torch.ops.quantized.batch_norm2d(
-            input, self.weight, self.bias, self.running_mean,
-            self.running_var, self.eps, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        return _BatchNorm.from_float(cls, mod)
-
-class BatchNorm3d(_BatchNorm):
-    r"""This is the quantized version of :class:`~torch.nn.BatchNorm3d`.
-    """
-
-    _NNI_BN_RELU_MODULE = nni.BNReLU3d
-
-    def __init__(self, num_features, eps=1e-5, momentum=0.1, device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super().__init__(num_features, eps, momentum, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedBatchNorm3d'
-
-    def _check_input_dim(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-
-    def forward(self, input: Tensor) -> Tensor:
-        # disabling this since this is not symbolically traceable
-        # self._check_input_dim(input)
-        return torch.ops.quantized.batch_norm3d(
-            input, self.weight, self.bias, self.running_mean,
-            self.running_var, self.eps, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        return _BatchNorm.from_float(cls, mod)
+# flake8: noqa: F401
+r"""Quantized Modules
+
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.modules.batchnorm import BatchNorm2d
+from torch.ao.nn.quantized.modules.batchnorm import BatchNorm3d
diff --git a/torch/nn/quantized/modules/conv.py b/torch/nn/quantized/modules/conv.py
index 7c726f7b114f..aea6cd104edf 100644
--- a/torch/nn/quantized/modules/conv.py
+++ b/torch/nn/quantized/modules/conv.py
@@ -1,925 +1,21 @@
-# coding=utf-8
-r"""Quantized convolution modules."""
+# flake8: noqa: F401
+r"""Quantized Modules
 
-from typing import Optional, List, TypeVar
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.nn.intrinsic as nni
-import torch.nn.intrinsic.qat as nniqat
-
-from torch._ops import ops
-from torch.nn.common_types import _size_1_t
-from torch.nn.modules.utils import _single, _pair, _triple
-from torch.nn.quantized.modules.utils import _quantize_weight, WeightedQuantizedModule
-from torch.nn.utils import fuse_conv_bn_weights
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d', 'ConvTranspose2d', 'ConvTranspose3d']
 
-_SUPPORTED_PADDING = {
-    'zeros',
-    'reflect'
-}
-
-
-def _reverse_repeat_padding(padding: List[int]) -> List[int]:
-    _reversed_padding_repeated_twice: List[int] = []
-    N = len(padding)
-    for idx in range(N):
-        for _ in range(2):
-            _reversed_padding_repeated_twice.append(padding[N - idx - 1])
-    return _reversed_padding_repeated_twice
-
-class _ConvNd(WeightedQuantizedModule):
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        # All subclasses have this signature - See PR #49702s
-        raise NotImplementedError
-
-    def _init(self, in_channels, out_channels, kernel_size, stride,
-              padding, dilation,
-              transposed, output_padding,
-              groups, bias,
-              padding_mode='zeros',
-              device=None,
-              dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(_ConvNd, self).__init__()
-
-        if in_channels % groups != 0:
-            raise ValueError('in_channels must be divisible by groups')
-        if out_channels % groups != 0:
-            raise ValueError('out_channels must be divisible by groups')
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.kernel_size = kernel_size
-        self.stride = stride
-        self.padding = padding
-        self.dilation = dilation
-        self.transposed = transposed
-        self.output_padding = output_padding
-        self.groups = groups
-        if padding_mode not in _SUPPORTED_PADDING:
-            raise ValueError("'padding_mode' {} is not supported by quantized convolution".format(padding_mode))
-        self.padding_mode = padding_mode
-        # Initialize as NCHW. set_weight will internally transpose to NHWC.
-        if self.transposed:
-            weight_shape = [in_channels, out_channels // self.groups]
-        else:
-            weight_shape = [out_channels, in_channels // self.groups]
-        qweight = torch._empty_affine_quantized(
-            weight_shape + list(kernel_size),
-            scale=1, zero_point=0, dtype=torch.qint8,
-            **{k: v for k, v in factory_kwargs.items() if k != 'dtype'})
-        bias_float = (
-            torch.zeros(out_channels, dtype=torch.float,
-                        **{k: v for k, v in factory_kwargs.items() if k != 'dtype'}) if bias else None)
-
-        self.set_weight_bias(qweight, bias_float)
-        self.scale = 1.0
-        self.zero_point = 0
-
-    def set_weight_bias(self, qweight, bias_float):
-        raise NotImplementedError
-
-    def bias(self):
-        raise NotImplementedError
-
-    def _weight_bias(self):
-        raise NotImplementedError
-
-    def extra_repr(self):
-        s = ('{in_channels}, {out_channels}, kernel_size={kernel_size}'
-             ', stride={stride}, scale={scale}, zero_point={zero_point}')
-        if self.padding != (0,) * len(self.padding):
-            s += ', padding={padding}'
-        if self.dilation != (1,) * len(self.dilation):
-            s += ', dilation={dilation}'
-        if self.output_padding != (0,) * len(self.output_padding):
-            s += ', output_padding={output_padding}'
-        if self.groups != 1:
-            s += ', groups={groups}'
-        if self.bias() is None:
-            s += ', bias=False'
-        return s.format(**self.__dict__)
-
-    # ===== Serialization methods =====
-    # The special consideration here is that we have to unpack the weights into
-    # their regular QTensor form for serialization. Packed weights should not
-    # live outside the process in which they were created, rather they should be
-    # derived from the QTensor weight.
-    #   self
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #
-    # TODO: maybe change to this when https://github.com/pytorch/pytorch/pull/32958 is landed
-    #   self
-    #   |--- _packed_params : Conv2dPackedParamsBase or Conv3dPackedParamsBase
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(_ConvNd, self)._save_to_state_dict(destination, prefix, keep_vars)
-        (w, b) = self._weight_bias()
-        destination[prefix + 'weight'] = w
-        destination[prefix + 'bias'] = b
-        destination[prefix + 'scale'] = torch.tensor(self.scale)
-        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
-
-    @torch.jit.export
-    def __getstate__(self):
-        (w, b) = self._weight_bias()
-        return (
-            self.in_channels,
-            self.out_channels,
-            self.kernel_size,
-            self.stride,
-            self.padding,
-            self.dilation,
-            self.transposed,
-            self.output_padding,
-            self.groups,
-            self.padding_mode,
-            w,
-            b,
-            self.scale,
-            self.zero_point,
-            self.training
-        )
-
-    # ===== Deserialization methods =====
-    # Counterpart to the serialization methods, we must pack the serialized
-    # QTensor weight into its packed format for use by the FBGEMM ops.
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self.set_weight_bias(
-            state_dict[prefix + 'weight'], state_dict[prefix + 'bias'])
-        state_dict.pop(prefix + 'weight')
-        state_dict.pop(prefix + 'bias')
-        self.scale = float(state_dict[prefix + 'scale'])
-        state_dict.pop(prefix + 'scale')
-        self.zero_point = int(state_dict[prefix + 'zero_point'])
-        state_dict.pop(prefix + 'zero_point')
-        super(_ConvNd, self)._load_from_state_dict(
-            state_dict, prefix, local_metadata, False, missing_keys,
-            unexpected_keys, error_msgs)
-
-    @torch.jit.export
-    def __setstate__(self, state):
-        self.in_channels = state[0]
-        self.out_channels = state[1]
-        self.kernel_size = state[2]
-        self.stride = state[3]
-        self.padding = state[4]
-        self.dilation = state[5]
-        self.transposed = state[6]
-        self.output_padding = state[7]
-        self.groups = state[8]
-        self.padding_mode = state[9]
-        self.set_weight_bias(state[10], state[11])
-        self.scale = state[12]
-        self.zero_point = state[13]
-        self.training = state[14]
-
-    def __deepcopy__(self, memo):
-        new_instance = type(self).__new__(type(self))
-        torch.nn.Module.__init__(new_instance)
-        state = self.__getstate__()
-        new_instance.__setstate__(state)
-        return new_instance
-
-    def __copy__(self):
-        return self.__deepcopy__({})
-
-    @classmethod
-    def get_qconv(cls, mod, activation_post_process, weight_post_process=None):
-        r"""Creates a qconv object and returns it.
-        """
-        if weight_post_process is None:
-            weight_post_process = mod.qconfig.weight()
-        weight_post_process(mod.weight)
-        assert weight_post_process.dtype == torch.qint8, \
-            'Weight observer must have a dtype of qint8'
-        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
-        # the __init__ call used is the one from derived classes and not the one from _ConvNd
-        qconv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,
-                    mod.stride, mod.padding, mod.dilation, mod.groups,
-                    mod.bias is not None, mod.padding_mode)
-        qconv.set_weight_bias(qweight, mod.bias)
-        if activation_post_process is None or activation_post_process.dtype == torch.float:
-            return qconv  # dynamic quantization doesn't need scale/zero_point
-        else:
-            act_scale, act_zp = activation_post_process.calculate_qparams()
-            qconv.scale = float(act_scale)
-            qconv.zero_point = int(act_zp)
-            return qconv
-
-    @staticmethod
-    def from_float(cls, mod):
-        if hasattr(mod, "weight_fake_quant"):
-            # assert type(mod) == cls.__QAT_MODULE, " nnq." + cls.__name__ + \
-            # ".from_float only works for " + cls.__QAT_MODULE.__name__
-            if type(mod) == cls._NNIQAT_CONV_BN_MODULE:
-                mod.weight, mod.bias = fuse_conv_bn_weights(
-                    mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
-                    mod.bn.eps, mod.bn.weight, mod.bn.bias)
-            assert hasattr(mod, "activation_post_process"), \
-                "Input QAT module must have observer attached"
-            weight_post_process = mod.weight_fake_quant
-            activation_post_process = mod.activation_post_process
-        else:
-            assert type(mod) == cls._FLOAT_MODULE, \
-                " nnq." + cls.__name__ + ".from_float only works for " + \
-                cls._FLOAT_MODULE.__name__ + " but got:" + str(type(mod))
-            assert hasattr(mod, "qconfig"), \
-                "Input float module must have qconfig defined."
-            activation_post_process = None if not hasattr(
-                mod, "activation_post_process") else mod.activation_post_process
-            if type(mod) == cls._NNI_CONV_RELU_MODULE:
-                mod = mod[0]
-            weight_post_process = mod.qconfig.weight()
-        return cls.get_qconv(mod, activation_post_process, weight_post_process)
-
-    @classmethod
-    def from_reference(cls, ref_qconv, output_scale, output_zero_point):
-        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
-        Args:
-            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-            output_scale (float): scale for output Tensor
-            output_zero_point (int): zero point for output Tensor
-        """
-        qconv = cls(
-            ref_qconv.in_channels,
-            ref_qconv.out_channels,
-            ref_qconv.kernel_size,  # type: ignore[arg-type]
-            ref_qconv.stride,  # type: ignore[arg-type]
-            ref_qconv.padding,  # type: ignore[arg-type]
-            ref_qconv.dilation,  # type: ignore[arg-type]
-            ref_qconv.groups,
-            ref_qconv.bias is not None,  # type: ignore[arg-type]
-            ref_qconv.padding_mode,
-            device=ref_qconv.weight.device,
-            dtype=ref_qconv.weight.dtype)
-        qweight = ref_qconv.get_quantized_weight()
-        qconv.set_weight_bias(qweight, ref_qconv.bias)
-        qconv.scale = float(output_scale)
-        qconv.zero_point = int(output_zero_point)
-        return qconv
-
-class Conv1d(_ConvNd):
-    r"""Applies a 1D convolution over a quantized input signal composed of
-    several quantized input planes.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv1d`.
-
-    .. note::
-        Only `zeros` is supported for the :attr:`padding_mode` argument.
-
-    .. note::
-        Only `torch.quint8` is supported for the input data type.
-
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv1d` for other attributes.
-
-    Examples::
-
-        >>> m = nn.quantized.Conv1d(16, 33, 3, stride=2)
-        >>> input = torch.randn(20, 16, 100)
-        >>> # quantize input to quint8
-        >>> # xdoctest: +SKIP
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0,
-                                                dtype=torch.quint8)
-        >>> output = m(q_input)
-
-    """
-
-    _FLOAT_MODULE = nn.Conv1d
-    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn1d
-    _NNI_CONV_RELU_MODULE = nni.ConvReLU1d
-
-    def __init__(self,
-                 in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_1_t,
-                 stride: _size_1_t = 1,
-                 padding: _size_1_t = 0,
-                 dilation: _size_1_t = 1,
-                 groups: int = 1,
-                 bias: bool = True,
-                 padding_mode: str = 'zeros',
-                 device=None,
-                 dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _single(kernel_size)
-        stride = _single(stride)
-        padding = padding if isinstance(padding, str) else _single(padding)
-        dilation = _single(dilation)
-
-        # Subclasses of _ConvNd needs to call _init rather than __init__. See
-        # discussion on PR #49702
-        super(Conv1d, self)._init(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            False, _single(0), groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConv1d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        if self.padding_mode == 'zeros':
-            self._packed_params = torch.ops.quantized.conv1d_prepack(
-                w, b, self.stride, self.padding, self.dilation, self.groups)
-        else:
-            self._packed_params = torch.ops.quantized.conv1d_prepack(
-                w, b, self.stride, _pair(0), self.dilation,
-                self.groups)
-
-    def _weight_bias(self):
-        w, b = torch.ops.quantized.conv1d_unpack(self._packed_params)
-        return w, b
-
-    def weight(self):
-        return self._weight_bias()[0]
-
-    def bias(self):
-        return self._weight_bias()[1]
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        if self.padding_mode != 'zeros':
-            # Padding in Conv1d is stored as (p, p), need to get (p,)
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding[:1])
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv1d(input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Creates a quantized module from a float module or qparams_dict.
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-              utilities or provided by the user
-        """
-        return _ConvNd.from_float(cls, mod)
-
-
-class Conv2d(_ConvNd):
-    r"""Applies a 2D convolution over a quantized input signal composed of
-    several quantized input planes.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv2d`.
-
-    .. note::
-        Only `zeros` is supported for the :attr:`padding_mode` argument.
-
-    .. note::
-        Only `torch.quint8` is supported for the input data type.
-
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv2d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nn.quantized.Conv2d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> # non-square kernels and unequal stride and with padding and dilation
-        >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
-        >>> input = torch.randn(20, 16, 50, 100)
-        >>> # quantize input to quint8
-        >>> # xdoctest: +SKIP
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-
-    """
-    _FLOAT_MODULE = nn.Conv2d
-    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn2d
-    _NNI_CONV_RELU_MODULE = nni.ConvReLU2d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _pair(kernel_size)
-        stride = _pair(stride)
-        padding = _pair(padding)
-        dilation = _pair(dilation)
-        # Subclasses of _ConvNd need to call _init rather than __init__. See
-        # discussion on PR #49702
-        super(Conv2d, self)._init(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            False, _pair(0), groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConv2d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        if self.padding_mode == 'zeros':
-            self._packed_params = torch.ops.quantized.conv2d_prepack(
-                w, b, self.stride, self.padding, self.dilation, self.groups)
-        else:
-            self._packed_params = torch.ops.quantized.conv2d_prepack(
-                w, b, self.stride, _pair(0), self.dilation, self.groups)
-
-    def _weight_bias(self):
-        return self._packed_params.unpack()
-
-    def weight(self):
-        return self._weight_bias()[0]
-
-    def bias(self):
-        return self._weight_bias()[1]
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv2d(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Creates a quantized module from a float module or qparams_dict.
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-              utilities or provided by the user
-        """
-        return _ConvNd.from_float(cls, mod)
-
-
-class Conv3d(_ConvNd):
-    r"""Applies a 3D convolution over a quantized input signal composed of
-    several quantized input planes.
-
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.Conv3d`.
-
-    .. note::
-        Only `zeros` is supported for the :attr:`padding_mode` argument.
-
-    .. note::
-        Only `torch.quint8` is supported for the input data type.
-
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-
-    See :class:`~torch.nn.Conv3d` for other attributes.
-
-    Examples::
-
-        >>> # With square kernels and equal stride
-        >>> m = nn.quantized.Conv3d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nn.quantized.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2))
-        >>> # non-square kernels and unequal stride and with padding and dilation
-        >>> m = nn.quantized.Conv3d(16, 33, (3, 5, 5), stride=(1, 2, 2), padding=(1, 2, 2), dilation=(1, 2, 2))
-        >>> input = torch.randn(20, 16, 56, 56, 56)
-        >>> # quantize input to quint8
-        >>> # xdoctest: +SKIP
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-
-    """
-    _FLOAT_MODULE = nn.Conv3d
-    _NNIQAT_CONV_BN_MODULE = nniqat.ConvBn3d
-    _NNI_CONV_RELU_MODULE = nni.ConvReLU3d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True,
-                 padding_mode='zeros', device=None, dtype=None):
-        assert padding_mode != 'reflect', "Conv3d does not support reflection padding"
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _triple(kernel_size)
-        stride = _triple(stride)
-        padding = _triple(padding)
-        dilation = _triple(dilation)
-        # Subclasses of _ConvNd need to call _init rather than __init__. See
-        # discussion on PR #49702
-        super(Conv3d, self)._init(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            False, _triple(0), groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConv3d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        if self.padding_mode == 'zeros':
-            self._packed_params = torch.ops.quantized.conv3d_prepack(
-                w, b, self.stride, self.padding, self.dilation, self.groups)
-        else:
-            self._packed_params = torch.ops.quantized.conv3d_prepack(
-                w, b, self.stride, _triple(0), self.dilation, self.groups)
-
-    def _weight_bias(self):
-        return self._packed_params.unpack()
-
-    def weight(self):
-        return self._weight_bias()[0]
-
-    def bias(self):
-        return self._weight_bias()[1]
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, D, H, W)`!")
-        if self.padding_mode != 'zeros':
-            _reversed_padding_repeated_twice = _reverse_repeat_padding(self.padding)
-            input = F.pad(input, _reversed_padding_repeated_twice,
-                          mode=self.padding_mode)
-        return ops.quantized.conv3d(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Creates a quantized module from a float module or qparams_dict.
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-              utilities or provided by the user
-        """
-        return _ConvNd.from_float(cls, mod)
-
-# === Transposed Convolutions ===
-MOD = TypeVar('MOD', bound=nn.modules.conv._ConvNd)
-
-class _ConvTransposeNd(_ConvNd):
-
-    _FLOAT_MODULE = MOD
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride,
-                 padding, dilation, transposed, output_padding,
-                 groups, bias, padding_mode, device=None, dtype=None):
-        if padding_mode != 'zeros':
-            raise ValueError('Only "zeros" padding mode is supported for {}'.format(self.__class__.__name__))
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        # Subclasses of _ConvNd need to call _init rather than __init__. See
-        # discussion on PR #49702
-        super(_ConvTransposeNd, self)._init(
-            in_channels, out_channels, kernel_size, stride,
-            padding, dilation, transposed, output_padding,
-            groups, bias, padding_mode, **factory_kwargs)
-
-    def _input_padding(self, kernel_size: List[int], dilation: List[int], padding: List[int]) -> List[int]:
-        res = torch.jit.annotate(List[int], [])
-        for kdx in range(len(kernel_size)):
-            pad = (dilation[kdx] * (kernel_size[kdx] - 1) - padding[kdx])
-            res.append(pad)
-        return res
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Creates a quantized module from a float module or qparams_dict.
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-              utilities or provided by the user
-        """
-        # derived classes override cls._FLOAT_MODULE attribute
-        msg = ' nnq.' + cls.__name__ + '.from_float only works for ' + \
-              cls._FLOAT_MODULE.__name__  # type: ignore[attr-defined]
-        assert type(mod) == cls._FLOAT_MODULE, msg
-        assert hasattr(mod, 'qconfig'), \
-            'Input float module must have qconfig defined.'
-        weight_post_process = mod.qconfig.weight()
-        weight_post_process(mod.weight)
-        assert weight_post_process.dtype == torch.qint8, \
-            'Weight observer must have a dtype of qint8'
-        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
-        # the __init__ call used is the one from derived classes and not the one from _ConvTransposeNd
-        qconv = cls(mod.in_channels, mod.out_channels, mod.kernel_size,  # type: ignore[call-arg]
-                    mod.stride, mod.padding, mod.output_padding, mod.groups,
-                    mod.bias is not None, mod.dilation, mod.padding_mode)
-        qconv.set_weight_bias(qweight, mod.bias)
-        if not hasattr(mod, "activation_post_process") or mod.activation_post_process.dtype == torch.float:
-            return qconv  # dynamic quantization doesn't need scale/zero_point
-        else:
-            act_scale, act_zp = mod.activation_post_process.calculate_qparams()
-            qconv.scale = float(act_scale)
-            qconv.zero_point = int(act_zp)
-            return qconv
-
-    @staticmethod
-    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
-        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
-        Args:
-            ref_module (Module): a reference quantized  module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-            output_scale (float): scale for output Tensor
-            output_zero_point (int): zero point for output Tensor
-        """
-        qconv = cls(
-            ref_qconvt.in_channels,
-            ref_qconvt.out_channels,
-            ref_qconvt.kernel_size,  # type: ignore[arg-type]
-            ref_qconvt.stride,  # type: ignore[arg-type]
-            ref_qconvt.padding,  # type: ignore[arg-type]
-            ref_qconvt.output_padding,  # type: ignore[arg-type]
-            ref_qconvt.groups,
-            ref_qconvt.bias is not None,  # type: ignore[arg-type]
-            ref_qconvt.dilation,  # type: ignore[arg-type]
-            ref_qconvt.padding_mode,
-            device=ref_qconvt.weight.device,
-            dtype=ref_qconvt.weight.dtype)
-        qweight = ref_qconvt.get_quantized_weight()
-        qconv.set_weight_bias(qweight, ref_qconvt.bias)
-        qconv.scale = float(output_scale)
-        qconv.zero_point = int(output_zero_point)
-        return qconv
-
-class ConvTranspose1d(_ConvTransposeNd):
-    r"""Applies a 1D transposed convolution operator over an input image
-    composed of several input planes.
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose1d`.
-
-    .. note:: Currently only the QNNPACK engine is implemented.
-        Please, set the `torch.backends.quantized.engine = 'qnnpack'`
-
-    For special notes, please, see :class:`~torch.nn.quantized.Conv1d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
-
-    Examples::
-
-        >>> torch.backends.quantized.engine = 'qnnpack'
-        >>> # With square kernels and equal stride
-        >>> # xdoctest: +SKIP
-        >>> m = nnq.ConvTranspose1d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose1d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> input = torch.randn(20, 16, 50)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-        >>> # exact output size can be also specified as an argument
-        >>> input = torch.randn(1, 16, 12)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> downsample = nnq.Conv1d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose1d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(q_input)
-        >>> h.size()
-        torch.Size([1, 16, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose1d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _single(kernel_size)
-        stride = _single(stride)
-        padding = _single(padding)
-        dilation = _single(dilation)
-        output_padding = _single(output_padding)
-
-        super(ConvTranspose1d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConvTranpose1d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        self._packed_params = torch.ops.quantized.conv_transpose1d_prepack(
-            w, b, self.stride, self.padding, self.output_padding, self.dilation,
-            self.groups)
-
-    def _weight_bias(self):
-        w, b = torch.ops.quantized.conv_transpose1d_unpack(self._packed_params)
-        return w, b
-
-    def weight(self):
-        (w, _) = self._weight_bias()
-        return w
-
-    def bias(self):
-        (_, b) = self._weight_bias()
-        return b
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 3:
-            raise ValueError("Input shape must be `(N, C, L)`!")
-        return torch.ops.quantized.conv_transpose1d(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
-        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
-
-
-class ConvTranspose2d(_ConvTransposeNd):
-    r"""Applies a 2D transposed convolution operator over an input image
-    composed of several input planes.
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose2d`.
-
-    For special notes, please, see :class:`~torch.nn.quantized.Conv2d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose2d` for other attributes.
-
-    Examples::
-
-        >>> # QNNPACK or FBGEMM as backend
-        >>> torch.backends.quantized.engine = 'qnnpack'
-        >>> # With square kernels and equal stride
-        >>> # xdoctest: +SKIP
-        >>> m = nnq.ConvTranspose2d(16, 33, 3, stride=2)
-        >>> # non-square kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
-        >>> input = torch.randn(20, 16, 50, 100)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-        >>> # exact output size can be also specified as an argument
-        >>> input = torch.randn(1, 16, 12, 12)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> downsample = nnq.Conv2d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(q_input)
-        >>> h.size()
-        torch.Size([1, 16, 6, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose2d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _pair(kernel_size)
-        stride = _pair(stride)
-        padding = _pair(padding)
-        dilation = _pair(dilation)
-        output_padding = _pair(output_padding)
-
-        super(ConvTranspose2d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConvTranpose2d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        self._packed_params = torch.ops.quantized.conv_transpose2d_prepack(
-            w, b, self.stride, self.padding, self.output_padding, self.dilation,
-            self.groups)
-
-    def _weight_bias(self):
-        w, b = torch.ops.quantized.conv2d_unpack(self._packed_params)
-        return w, b
-
-    def weight(self):
-        (w, _) = self._weight_bias()
-        return w
-
-    def bias(self):
-        (_, b) = self._weight_bias()
-        return b
-
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 4:
-            raise ValueError("Input shape must be `(N, C, H, W)`!")
-        return ops.quantized.conv_transpose2d(
-            input, self._packed_params, self.scale, self.zero_point)
-
-    @classmethod
-    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
-        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
-
-class ConvTranspose3d(_ConvTransposeNd):
-    r"""Applies a 3D transposed convolution operator over an input image
-    composed of several input planes.
-    For details on input arguments, parameters, and implementation see
-    :class:`~torch.nn.ConvTranspose3d`.
-
-    .. note:: Currently only the FBGEMM engine is implemented.
-        Please, set the `torch.backends.quantized.engine = 'fbgemm'`
-
-    For special notes, please, see :class:`~torch.nn.quantized.Conv3d`
-
-    Attributes:
-        weight (Tensor):     packed tensor derived from the learnable weight
-                             parameter.
-        scale (Tensor):      scalar for the output scale
-        zero_point (Tensor): scalar for the output zero point
-    See :class:`~torch.nn.ConvTranspose3d` for other attributes.
-
-    Examples::
-
-        >>> torch.backends.quantized.engine = 'fbgemm'
-        >>> # With cubic kernels and equal stride
-        >>> # xdoctest: +SKIP
-        >>> m = nnq.ConvTranspose3d(16, 33, 3, stride=2)
-        >>> # non-cubic kernels and unequal stride and with padding
-        >>> m = nnq.ConvTranspose3d(16, 33, (3, 3, 5), stride=(2, 1, 1), padding=(4, 2, 2))
-        >>> input = torch.randn(20, 16, 50, 100, 100)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> output = m(q_input)
-        >>> # exact output size can be also specified as an argument
-        >>> input = torch.randn(1, 16, 12, 12, 12)
-        >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
-        >>> downsample = nnq.Conv3d(16, 16, 3, stride=2, padding=1)
-        >>> upsample = nnq.ConvTranspose3d(16, 16, 3, stride=2, padding=1)
-        >>> h = downsample(q_input)
-        >>> h.size()
-        torch.Size([1, 16, 6, 6, 6])
-        >>> output = upsample(h, output_size=input.size())
-        >>> output.size()
-        torch.Size([1, 16, 12, 12, 12])
-    """
-
-    _FLOAT_MODULE = nn.ConvTranspose3d
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True,
-                 dilation=1, padding_mode='zeros', device=None, dtype=None):
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        kernel_size = _triple(kernel_size)
-        stride = _triple(stride)
-        padding = _triple(padding)
-        dilation = _triple(dilation)
-        output_padding = _triple(output_padding)
-
-        super(ConvTranspose3d, self).__init__(
-            in_channels, out_channels, kernel_size, stride, padding, dilation,
-            True, output_padding, groups, bias, padding_mode, **factory_kwargs)
-
-    def _get_name(self):
-        return 'QuantizedConvTranpose3d'
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        self._packed_params = torch.ops.quantized.conv_transpose3d_prepack(
-            w, b, self.stride, self.padding, self.output_padding, self.dilation,
-            self.groups)
-
-    def _weight_bias(self):
-        w, b = torch.ops.quantized.conv3d_unpack(self._packed_params)
-        return w, b
-
-    def weight(self):
-        (w, _) = self._weight_bias()
-        return w
-
-    def bias(self):
-        (_, b) = self._weight_bias()
-        return b
+from torch.ao.nn.quantized.modules.conv import _reverse_repeat_padding
 
-    def forward(self, input):
-        # Temporarily using len(shape) instead of ndim due to JIT issue
-        # https://github.com/pytorch/pytorch/issues/23890
-        if len(input.shape) != 5:
-            raise ValueError("Input shape must be `(N, C, T, H, W)`!")
-        return ops.quantized.conv_transpose3d(
-            input, self._packed_params, self.scale, self.zero_point)
+from torch.ao.nn.quantized.modules.conv import Conv1d
+from torch.ao.nn.quantized.modules.conv import Conv2d
+from torch.ao.nn.quantized.modules.conv import Conv3d
 
-    @classmethod
-    def from_reference(cls, ref_qconvt, output_scale, output_zero_point):
-        return _ConvTransposeNd.from_reference(cls, ref_qconvt, output_scale, output_zero_point)
+from torch.ao.nn.quantized.modules.conv import ConvTranspose1d
+from torch.ao.nn.quantized.modules.conv import ConvTranspose2d
+from torch.ao.nn.quantized.modules.conv import ConvTranspose3d
diff --git a/torch/nn/quantized/modules/dropout.py b/torch/nn/quantized/modules/dropout.py
index 2885f5dc6846..e09dfbfbca6c 100644
--- a/torch/nn/quantized/modules/dropout.py
+++ b/torch/nn/quantized/modules/dropout.py
@@ -1,28 +1,13 @@
-import torch
-import torch.nn.quantized.functional
+# flake8: noqa: F401
+r"""Quantized Modules
 
-__all__ = ['Dropout']
-
-class Dropout(torch.nn.Dropout):
-    r"""This is the quantized equivalent of :class:`~torch.nn.Dropout`.
-        And this is a placeholder to enable models where fp32 tensors
-        had dropout to work with quantized tensors in train and eval mode.
-
-    Args:
-        p: probability of an element to be zeroed
-        inplace: can optionally do the operation in-place. Default: ``False``
-    """
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
-    def forward(self, input):
-        return input
-
-    def _get_name(self):
-        return 'QuantizedDropout'
-
-    @classmethod
-    def from_float(cls, mod):
-        return cls(mod.p, mod.inplace)
+__all__ = ['Dropout']
 
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(mod.p, mod.inplace)
+from torch.ao.nn.quantized.modules.dropout import Dropout
diff --git a/torch/nn/quantized/modules/embedding_ops.py b/torch/nn/quantized/modules/embedding_ops.py
index 179489fa3e3f..051f53499695 100644
--- a/torch/nn/quantized/modules/embedding_ops.py
+++ b/torch/nn/quantized/modules/embedding_ops.py
@@ -1,294 +1,15 @@
-import torch
-import torch.nn as nn
-from torch import Tensor  # noqa: F401
-from torch._jit_internal import Optional, List  # noqa: F401
-from torch.nn.quantized.modules.utils import hide_packed_params_repr
-from torch.nn.quantized.modules.utils import _quantize_weight
+# flake8: noqa: F401
+r"""Quantized Modules
 
-__all__ = ['EmbeddingPackedParams', 'Embedding', 'EmbeddingBag']
-
-class EmbeddingPackedParams(torch.nn.Module):
-    _version = 1
-
-    def __init__(self, num_embeddings, embedding_dim, dtype=torch.quint8):
-        super(EmbeddingPackedParams, self).__init__()
-        self.dtype = dtype
-        if self.dtype in [torch.quint8, torch.quint4x2]:
-            scales = torch.ones(num_embeddings, dtype=torch.float)
-            zero_points = torch.zeros(num_embeddings, dtype=torch.float)
-            wq = torch._empty_per_channel_affine_quantized([num_embeddings, embedding_dim], scales=scales,
-                                                           zero_points=zero_points,
-                                                           axis=0, dtype=self.dtype)
-            self.set_weight(wq)
-        else:
-            raise NotImplementedError(f'Unsupported dtype on quantized embedding! Supports quint8 and quint4x2. Got dtype: {dtype}')
-
-    @torch.jit.export
-    def set_weight(self, weight: torch.Tensor) -> None:
-        if self.dtype in [torch.quint8, torch.quint4x2]:
-            self._packed_weight = torch.ops.quantized.embedding_bag_prepack(weight)
-        else:
-            raise NotImplementedError('Unsupported dtype for quantized embedding prepack! Supports quint8 and quint4x2.')
-
-
-    @torch.jit.export
-    def _weight(self):
-        if self.dtype in [torch.quint8, torch.quint4x2]:
-            return torch.ops.quantized.embedding_bag_unpack(self._packed_weight)
-        else:
-            raise NotImplementedError('Unsupported dtype for quantized embedding unpack! Supports quint8 and quint4x2.')
-
-    def forward(self, x):
-        return x
-
-    # Version 1
-    #   self
-    #   |--- _packed_weight : Tensor representing weight of EmbeddingPackedParamsBase
-    #   |--- dtype : torch.dtype
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(EmbeddingPackedParams, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'dtype'] = self.dtype
-        destination[prefix + '_packed_weight'] = self._weight()
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self.dtype = state_dict[prefix + 'dtype']
-        state_dict.pop(prefix + 'dtype')
-
-        weight = state_dict[prefix + '_packed_weight']
-        state_dict.pop(prefix + '_packed_weight')
-        self.set_weight(weight)
-
-        super(EmbeddingPackedParams, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                                 missing_keys, unexpected_keys, error_msgs)
-
-    def __repr__(self):
-        return self._weight().__repr__()
-
-class Embedding(torch.nn.Module):
-    r"""
-    A quantized Embedding module with quantized packed weights as inputs.
-    We adopt the same interface as `torch.nn.Embedding`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding for documentation.
-
-    Similar to :class:`~torch.nn.Embedding`, attributes will be randomly
-    initialized at module creation time and will be overwritten later
-
-    Attributes:
-        weight (Tensor): the non-learnable quantized weights of the module of
-                         shape :math:`(\text{num\_embeddings}, \text{embedding\_dim})`.
-
-    Examples::
-        >>> m = nn.quantized.Embedding(num_embeddings=10, embedding_dim=12)
-        >>> indices = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8])
-        >>> output = m(indices)
-        >>> print(output.size())
-        torch.Size([9, 12])
-
-    """
-    _version = 1
-
-    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 sparse: bool = False, _weight: Optional[Tensor] = None, dtype=torch.quint8) -> None:
-        super(Embedding, self).__init__()
-        self.num_embeddings = num_embeddings
-        self.embedding_dim = embedding_dim
-        self.dtype = dtype
-
-        if _weight is None:
-            scales = torch.ones(num_embeddings, dtype=torch.float)
-            zero_points = torch.zeros(num_embeddings, dtype=torch.float)
-            qweight = torch._empty_per_channel_affine_quantized([num_embeddings, embedding_dim],
-                                                                scales=scales, zero_points=zero_points,
-                                                                axis=0, dtype=torch.quint8)
-        else:
-            assert list(_weight.shape) == [num_embeddings, embedding_dim], \
-                'Shape of weight does not match num_embeddings and embedding_dim'
-            qweight = _weight
-
-        self._packed_params = EmbeddingPackedParams(num_embeddings, embedding_dim, dtype)
-        self._packed_params.set_weight(qweight)
-
-    def forward(self, indices: Tensor) -> Tensor:
-        if self.dtype == torch.quint4x2:
-            return torch.ops.quantized.embedding_4bit(self._packed_params._packed_weight, indices)
-        else:
-            return torch.ops.quantized.embedding_byte(self._packed_params._packed_weight, indices)
-
-    def _get_name(self):
-        return 'QuantizedEmbedding'
-
-    def __repr__(self):
-        return hide_packed_params_repr(self, EmbeddingPackedParams)
-
-    def extra_repr(self):
-        extra_repr_str = 'num_embeddings={}, embedding_dim={}, dtype={}, qscheme={}'.format(
-            self.num_embeddings, self.embedding_dim, self._packed_params.dtype, self.weight().qscheme()
-        )
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
-        return extra_repr_str
-
-    def set_weight(self, w: torch.Tensor) -> None:
-        self._packed_params.set_weight(w)
-
-    def weight(self):
-        return self._packed_params._weight()
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a quantized embedding module from a float module
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-                          utilities or provided by user
-        """
-        if hasattr(mod, 'weight_fake_quant'):
-            assert type(mod) == nn.qat.Embedding, 'nnq.' + cls.__name__ + '.from_float ' + \
-                'with fake quant only works for ' + nn.qat.Embedding.__name__
-            weight_observer = mod.weight_fake_quant
-            activation_post_process = mod.activation_post_process
-        else:
-            assert type(mod) == nn.Embedding, 'nnq.' + cls.__name__ + '.from_float only works for ' + \
-                nn.Embedding.__name__
-            assert hasattr(mod, 'qconfig'), 'Embedding input float module must have qconfig defined'
-            from torch.ao.quantization import float_qparams_weight_only_qconfig
-            if mod.qconfig is not None and mod.qconfig.weight is not None:  # type: ignore[union-attr]
-                weight_observer = mod.qconfig.weight()  # type: ignore[union-attr, operator]
-            else:
-                weight_observer = float_qparams_weight_only_qconfig.weight()
-
-        dtype = weight_observer.dtype
-        is_float_qparams_qconfig = weight_observer.qscheme == torch.per_channel_affine_float_qparams
-        assert is_float_qparams_qconfig, \
-            'Embedding quantization is only supported with float_qparams_weight_only_qconfig.'
-
-        assert dtype == torch.quint8 or dtype == torch.quint4x2, \
-            f'The only supported dtype for nnq.Embedding is torch.quint8 and torch.quint4x2, got {dtype}'
-
-        # Run the observer to calculate qparams.
-        weight_observer(mod.weight)
-        qweight = _quantize_weight(mod.weight.float(), weight_observer)
-
-        # Create quantized Embedding module and pass in the quantized weight
-        qembedding = Embedding(mod.num_embeddings, mod.embedding_dim)
-        qembedding.set_weight(qweight)
-        return qembedding
-
-    @classmethod
-    def from_reference(cls, ref_embedding):
-        qembedding = cls(
-            ref_embedding.num_embeddings,
-            ref_embedding.embedding_dim,
-            ref_embedding.padding_idx,
-            ref_embedding.max_norm,
-            ref_embedding.norm_type,
-            ref_embedding.scale_grad_by_freq,
-            ref_embedding.sparse,
-            ref_embedding.get_quantized_weight(),
-            ref_embedding.weight_dtype,
-        )
-        return qembedding
-
-class EmbeddingBag(Embedding):
-    r"""
-    A quantized EmbeddingBag module with quantized packed weights as inputs.
-    We adopt the same interface as `torch.nn.EmbeddingBag`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag for documentation.
-
-    Similar to :class:`~torch.nn.EmbeddingBag`, attributes will be randomly
-    initialized at module creation time and will be overwritten later
-
-    Attributes:
-        weight (Tensor): the non-learnable quantized weights of the module of
-                         shape :math:`(\text{num\_embeddings}, \text{embedding\_dim})`.
-
-    Examples::
-        >>> m = nn.quantized.EmbeddingBag(num_embeddings=10, embedding_dim=12, include_last_offset=True, mode='sum')
-        >>> indices = torch.tensor([9, 6, 5, 7, 8, 8, 9, 2, 8, 6, 6, 9, 1, 6, 8, 8, 3, 2, 3, 6, 3, 6, 5, 7, 0, 8, 4, 6, 5, 8, 2, 3])
-        >>> offsets = torch.tensor([0, 19, 20, 28, 28, 32])
-        >>> output = m(indices, offsets)
-        >>> print(output.size())
-        torch.Size([5, 12])
-
-    """
-    _version = 1
-
-    def __init__(self, num_embeddings: int, embedding_dim: int,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 mode: str = 'sum', sparse: bool = False, _weight: Optional[Tensor] = None,
-                 include_last_offset: bool = False, dtype=torch.quint8) -> None:
-        super(EmbeddingBag, self).__init__(num_embeddings, embedding_dim, _weight=_weight, dtype=dtype)
-
-        self.mode = mode
-        self.pruned_weights = False
-        self.include_last_offset = include_last_offset
-        self.dtype = dtype
-
-    def forward(self, indices: Tensor, offsets: Optional[Tensor] = None, per_sample_weights: Optional[Tensor] = None,
-                compressed_indices_mapping: Optional[Tensor] = None) -> Tensor:
-        if self.dtype == torch.quint4x2:
-            return torch.ops.quantized.embedding_bag_4bit(self._packed_params._packed_weight, indices, offsets, False, 0,
-                                                          self.pruned_weights, per_sample_weights, compressed_indices_mapping,
-                                                          self.include_last_offset)
-        else:
-            return torch.ops.quantized.embedding_bag_byte(self._packed_params._packed_weight, indices, offsets, False, 0,
-                                                          self.pruned_weights, per_sample_weights, compressed_indices_mapping,
-                                                          self.include_last_offset)
-
-    def _get_name(self):
-        return 'QuantizedEmbeddingBag'
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a quantized embedding_bag module from a float module
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-                          utilities or provided by user
-        """
-        if hasattr(mod, 'weight_fake_quant'):
-            weight_observer = mod.weight_fake_quant
-        else:
-            assert type(mod) == nn.EmbeddingBag, 'nnq.' + cls.__name__ + '.from_float only works for ' + \
-                nn.EmbeddingBag.__name__
-            assert hasattr(mod, 'qconfig'), 'EmbeddingBag input float module must have qconfig defined'
-            from torch.ao.quantization.qconfig import float_qparams_weight_only_qconfig
-            if mod.qconfig is not None and mod.qconfig.weight is not None:  # type: ignore[union-attr]
-                weight_observer = mod.qconfig.weight()  # type: ignore[union-attr, operator]
-            else:
-                weight_observer = float_qparams_weight_only_qconfig.weight()
-
-        dtype = weight_observer.dtype
-        is_float_qparams_qconfig = weight_observer.qscheme == torch.per_channel_affine_float_qparams
-        assert is_float_qparams_qconfig, \
-            'EmbeddingBag quantization is only supported with float_qparams_weight_only_qconfig.'
-
-        assert dtype == torch.quint8 or dtype == torch.quint4x2, \
-            f'The only supported dtype for nnq.EmbeddingBag is torch.quint8 and torch.quint4x2, got {dtype}'
-
-        # Run the observer to calculate qparams.
-        weight_observer(mod.weight)
-        qweight = _quantize_weight(mod.weight.float(), weight_observer)
-
-        # Create quantized EmbeddingBag module and pass in the quantized weight
-        qembedding_bag = EmbeddingBag(mod.num_embeddings, mod.embedding_dim, dtype=dtype)
-        qembedding_bag.set_weight(qweight)
-        return qembedding_bag
+__all__ = ['EmbeddingPackedParams', 'Embedding', 'EmbeddingBag']
 
-    @classmethod
-    def from_reference(cls, ref_embedding_bag):
-        qembedding_bag = cls(
-            ref_embedding_bag.num_embeddings,
-            ref_embedding_bag.embedding_dim,
-            ref_embedding_bag.max_norm,
-            ref_embedding_bag.norm_type,
-            ref_embedding_bag.scale_grad_by_freq,
-            ref_embedding_bag.mode,
-            ref_embedding_bag.sparse,
-            ref_embedding_bag.get_quantized_weight(),
-            ref_embedding_bag.include_last_offset,
-            ref_embedding_bag.weight_dtype,
-        )
-        return qembedding_bag
+from torch.ao.nn.quantized.modules.embedding_ops import Embedding
+from torch.ao.nn.quantized.modules.embedding_ops import EmbeddingBag
+from torch.ao.nn.quantized.modules.embedding_ops import EmbeddingPackedParams
diff --git a/torch/nn/quantized/modules/functional_modules.py b/torch/nn/quantized/modules/functional_modules.py
index 5bf7a7322652..9bdcc5bc23ab 100644
--- a/torch/nn/quantized/modules/functional_modules.py
+++ b/torch/nn/quantized/modules/functional_modules.py
@@ -1,233 +1,15 @@
-from typing import List
+# flake8: noqa: F401
+r"""Quantized Modules
 
-import torch
-from torch import Tensor
-from torch._ops import ops
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['FloatFunctional', 'FXFloatFunctional', 'QFunctional']
 
-class FloatFunctional(torch.nn.Module):
-    r"""State collector class for float operations.
-
-    The instance of this class can be used instead of the ``torch.`` prefix for
-    some operations. See example usage below.
-
-    .. note::
-
-        This class does not provide a ``forward`` hook. Instead, you must use
-        one of the underlying functions (e.g. ``add``).
-
-    Examples::
-
-        >>> f_add = FloatFunctional()
-        >>> a = torch.tensor(3.0)
-        >>> b = torch.tensor(4.0)
-        >>> f_add.add(a, b)  # Equivalent to ``torch.add(a, b)``
-
-    Valid operation names:
-        - add
-        - cat
-        - mul
-        - add_relu
-        - add_scalar
-        - mul_scalar
-    """
-    def __init__(self):
-        super(FloatFunctional, self).__init__()
-        self.activation_post_process = torch.nn.Identity()
-
-    def forward(self, x):
-        raise RuntimeError("FloatFunctional is not intended to use the " +
-                           "'forward'. Please use the underlying operation")
-
-    r"""Operation equivalent to ``torch.add(Tensor, Tensor)``"""
-    def add(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.add(x, y)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.add(Tensor, float)``"""
-    def add_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = torch.add(x, y)
-        # Note: this operation is not observed because the observation is not
-        # needed for the quantized op.
-        return r
-
-    r"""Operation equivalent to ``torch.mul(Tensor, Tensor)``"""
-    def mul(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.mul(x, y)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.mul(Tensor, float)``"""
-    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = torch.mul(x, y)
-        # Note: this operation is not observed because the observation is not
-        # needed for the quantized op.
-        return r
-
-    r"""Operation equivalent to ``torch.cat``"""
-    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
-        r = torch.cat(x, dim=dim)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``relu(torch.add(x,y))``"""
-    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.add(x, y)
-        r = torch.nn.functional.relu(r)
-        r = self.activation_post_process(r)
-        return r
-
-class FXFloatFunctional(torch.nn.Module):
-    r""" module to replace FloatFunctional module before FX graph mode quantization,
-    since activation_post_process will be inserted in top level module directly
-
-    Valid operation names:
-        - add
-        - cat
-        - mul
-        - add_relu
-        - add_scalar
-        - mul_scalar
-    """
-    def forward(self, x):
-        raise RuntimeError("FloatFunctional is not intended to use the " +
-                           "'forward'. Please use the underlying operation")
-
-    r"""Operation equivalent to ``torch.add(Tensor, Tensor)``"""
-    def add(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.add(x, y)
-        return r
-
-    r"""Operation equivalent to ``torch.add(Tensor, float)``"""
-    def add_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = torch.add(x, y)
-        return r
-
-    r"""Operation equivalent to ``torch.mul(Tensor, Tensor)``"""
-    def mul(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.mul(x, y)
-        return r
-
-    r"""Operation equivalent to ``torch.mul(Tensor, float)``"""
-    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = torch.mul(x, y)
-        return r
-
-    r"""Operation equivalent to ``torch.cat``"""
-    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
-        r = torch.cat(x, dim=dim)
-        return r
-
-    r"""Operation equivalent to ``relu(torch.add(x,y))``"""
-    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
-        r = torch.add(x, y)
-        r = torch.nn.functional.relu(r)
-        return r
-
-class QFunctional(torch.nn.Module):
-    r"""Wrapper class for quantized operations.
-
-    The instance of this class can be used instead of the
-    ``torch.ops.quantized`` prefix. See example usage below.
-
-    .. note::
-
-        This class does not provide a ``forward`` hook. Instead, you must use
-        one of the underlying functions (e.g. ``add``).
-
-    Examples::
-
-        >>> q_add = QFunctional()
-        >>> # xdoctest: +SKIP
-        >>> a = torch.quantize_per_tensor(torch.tensor(3.0), 1.0, 0, torch.qint32)
-        >>> b = torch.quantize_per_tensor(torch.tensor(4.0), 1.0, 0, torch.qint32)
-        >>> q_add.add(a, b)  # Equivalent to ``torch.ops.quantized.add(a, b, 1.0, 0)``
-
-    Valid operation names:
-        - add
-        - cat
-        - mul
-        - add_relu
-        - add_scalar
-        - mul_scalar
-    """
-    def __init__(self):
-        super(QFunctional, self).__init__()
-        self.scale = 1.0
-        self.zero_point = 0
-        self.activation_post_process = torch.nn.Identity()
-
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(QFunctional, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'scale'] = torch.tensor(self.scale)
-        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-
-        self.scale = float(state_dict.pop(prefix + 'scale'))
-        self.zero_point = int(state_dict.pop(prefix + 'zero_point'))
-        super(QFunctional, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                       missing_keys, unexpected_keys, error_msgs)
-
-    def _get_name(self):
-        return 'QFunctional'
-
-    def extra_repr(self):
-        return 'scale={}, zero_point={}'.format(
-            self.scale, self.zero_point
-        )
-
-    def forward(self, x):
-        raise RuntimeError("Functional is not intended to use the " +
-                           "'forward'. Please use the underlying operation")
-
-    r"""Operation equivalent to ``torch.ops.quantized.add``"""
-    def add(self, x: Tensor, y: Tensor) -> Tensor:
-        r = ops.quantized.add(x, y, scale=self.scale, zero_point=self.zero_point)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.add(Tensor, float)``"""
-    def add_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = ops.quantized.add_scalar(x, y)
-        # Note: this operation is not observed because the observation is not
-        # needed for the quantized op.
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.mul(Tensor, Tensor)``"""
-    def mul(self, x: Tensor, y: Tensor) -> Tensor:
-        r = ops.quantized.mul(x, y, scale=self.scale, zero_point=self.zero_point)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.mul(Tensor, float)``"""
-    def mul_scalar(self, x: Tensor, y: float) -> Tensor:
-        r = ops.quantized.mul_scalar(x, y)
-        # Note: this operation is not observed because the observation is not
-        # needed for the quantized op.
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.cat``"""
-    def cat(self, x: List[Tensor], dim: int = 0) -> Tensor:
-        r = ops.quantized.cat(x, scale=self.scale, zero_point=self.zero_point, dim=dim)
-        r = self.activation_post_process(r)
-        return r
-
-    r"""Operation equivalent to ``torch.ops.quantized.add_relu``"""
-    def add_relu(self, x: Tensor, y: Tensor) -> Tensor:
-        r = ops.quantized.add_relu(x, y, scale=self.scale, zero_point=self.zero_point)
-        r = self.activation_post_process(r)
-        return r
-
-    @classmethod
-    def from_float(cls, mod):
-        assert type(mod) == FloatFunctional,\
-            "QFunctional.from_float expects an instance of FloatFunctional"
-        scale, zero_point = mod.activation_post_process.calculate_qparams()  # type: ignore[operator]
-        new_mod = QFunctional()
-        new_mod.scale = float(scale)
-        new_mod.zero_point = int(zero_point)
-        return new_mod
+from torch.ao.nn.quantized.modules.functional_modules import FloatFunctional
+from torch.ao.nn.quantized.modules.functional_modules import FXFloatFunctional
+from torch.ao.nn.quantized.modules.functional_modules import QFunctional
diff --git a/torch/nn/quantized/modules/linear.py b/torch/nn/quantized/modules/linear.py
index d602b3d2bf94..0696014dd7d9 100644
--- a/torch/nn/quantized/modules/linear.py
+++ b/torch/nn/quantized/modules/linear.py
@@ -1,299 +1,14 @@
-from collections.abc import Iterable
-import torch
+# flake8: noqa: F401
+r"""Quantized Modules
 
-import torch.nn as nn
-import torch.nn.intrinsic as nni
-import torch.nn.intrinsic.qat as nniqat
-from torch.nn.quantized.modules.utils import _quantize_weight, hide_packed_params_repr, WeightedQuantizedModule
-from torch.nn.utils.fusion import fuse_linear_bn_weights
-from torch.nn.utils.parametrize import type_before_parametrizations
-from typing import Optional
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
 __all__ = ['LinearPackedParams', 'Linear']
 
-class LinearPackedParams(torch.nn.Module):
-    _version = 3
-
-    def __init__(self, dtype=torch.qint8):
-        super().__init__()
-        self.dtype = dtype
-        if self.dtype == torch.qint8:
-            wq = torch._empty_affine_quantized([1, 1], scale=1.0, zero_point=0, dtype=torch.qint8)
-        elif self.dtype == torch.float16:
-            wq = torch.zeros([1, 1], dtype=torch.float)
-        self.set_weight_bias(wq, None)
-
-    @torch.jit.export
-    def set_weight_bias(self, weight: torch.Tensor, bias: Optional[torch.Tensor]) -> None:
-        if self.dtype == torch.qint8:
-            self._packed_params = torch.ops.quantized.linear_prepack(weight, bias)
-        elif self.dtype == torch.float16:
-            self._packed_params = torch.ops.quantized.linear_prepack_fp16(weight, bias)
-        else:
-            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
-
-
-    @torch.jit.export
-    def _weight_bias(self):
-        if self.dtype == torch.qint8:
-            return torch.ops.quantized.linear_unpack(self._packed_params)
-        elif self.dtype == torch.float16:
-            return torch.ops.quantized.linear_unpack_fp16(self._packed_params)
-        else:
-            raise RuntimeError('Unsupported dtype on dynamic quantized linear!')
-
-    def forward(self, x):
-        return x
-
-    # Version 1
-    #   self
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #
-    # Version 2
-    #   self
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #   |--- dtype : torch.dtype
-    #
-    # Version 3
-    #   self
-    #   |--- _packed_params : (Tensor, Tensor) representing (weight, bias)
-    #                         of LinearPackedParams
-    #   |--- dtype : torch.dtype
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super(LinearPackedParams, self)._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'dtype'] = self.dtype
-        destination[prefix + '_packed_params'] = self._weight_bias()
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        version = local_metadata.get('version', None)
-        if version is None or version < 2:
-            self.dtype = torch.qint8
-        else:
-            self.dtype = state_dict[prefix + 'dtype']
-            state_dict.pop(prefix + 'dtype')
-
-        if version is None or version < 3:
-            self.set_weight_bias(state_dict[prefix + 'weight'], state_dict[prefix + 'bias'])
-            state_dict.pop(prefix + 'weight')
-            state_dict.pop(prefix + 'bias')
-
-        if version == 3:
-            weight, bias = state_dict[prefix + '_packed_params']
-            state_dict.pop(prefix + '_packed_params')
-            self.set_weight_bias(weight, bias)
-
-        super(LinearPackedParams, self)._load_from_state_dict(state_dict, prefix, local_metadata, False,
-                                                              missing_keys, unexpected_keys, error_msgs)
-
-
-    def __repr__(self):
-        return self._weight_bias().__repr__()
-
-
-class Linear(WeightedQuantizedModule):
-    r"""
-    A quantized linear module with quantized tensor as inputs and outputs.
-    We adopt the same interface as `torch.nn.Linear`, please see
-    https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
-
-    Similar to :class:`~torch.nn.Linear`, attributes will be randomly
-    initialized at module creation time and will be overwritten later
-
-    Attributes:
-        weight (Tensor): the non-learnable quantized weights of the module of
-                         shape :math:`(\text{out\_features}, \text{in\_features})`.
-        bias (Tensor): the non-learnable bias of the module of shape :math:`(\text{out\_features})`.
-                If :attr:`bias` is ``True``, the values are initialized to zero.
-        scale: `scale` parameter of output Quantized Tensor, type: double
-        zero_point: `zero_point` parameter for output Quantized Tensor, type: long
-
-    Examples::
-
-        >>> m = nn.quantized.Linear(20, 30)
-        >>> input = torch.randn(128, 20)
-        >>> # xdoctest: +SKIP
-        >>> input = torch.quantize_per_tensor(input, 1.0, 0, torch.quint8)
-        >>> output = m(input)
-        >>> print(output.size())
-        torch.Size([128, 30])
-    """
-    _version = 3
-    _FLOAT_MODULE = (nn.Linear, nn.modules.linear.NonDynamicallyQuantizableLinear)
-
-    def __init__(self, in_features, out_features, bias_=True,
-                 dtype=torch.qint8):
-        super().__init__()
-        # We don't muck around with buffers or attributes or anything here
-        # to keep the module simple. *everything* is simply a Python attribute.
-        # Serialization logic is explicitly handled in the below serialization and
-        # deserialization modules
-        self.in_features = in_features
-        self.out_features = out_features
-        bias = None
-        if bias_:
-            bias = torch.zeros(out_features, dtype=torch.float)
-
-        if dtype == torch.qint8:
-            qweight = torch._empty_affine_quantized(
-                [out_features, in_features], scale=1, zero_point=0, dtype=torch.qint8)
-        elif dtype == torch.float16:
-            qweight = torch.zeros([out_features, in_features], dtype=torch.float)
-        else:
-            raise RuntimeError('Unsupported dtype specified for quantized Linear!')
-
-        self._packed_params = LinearPackedParams(dtype)
-        self._packed_params.set_weight_bias(qweight, bias)
-        self.scale = 1.0
-        self.zero_point = 0
-
-    def _get_name(self):
-        return 'QuantizedLinear'
-
-    def extra_repr(self):
-        return 'in_features={}, out_features={}, scale={}, zero_point={}, qscheme={}'.format(
-            self.in_features, self.out_features, self.scale, self.zero_point, self.weight().qscheme()
-        )
-
-    def __repr__(self):
-        return hide_packed_params_repr(self, LinearPackedParams)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return torch.ops.quantized.linear(
-            x, self._packed_params._packed_params, self.scale, self.zero_point)
-
-    # ===== Serialization methods =====
-    # The special consideration here is that we have to unpack the weights into their
-    # regular QTensor form for serialization. Packed weights should not live
-    # outside the process in which they were created, rather they should be derived
-    # from the QTensor weight.
-    #
-    # Version 1
-    #   self
-    #   |--- scale : float
-    #   |--- zero_point : int
-    #   |--- weight : Tensor
-    #   |--- bias : Tensor
-    #
-    # Version 2
-    #   self
-    #   |--- scale : float
-    #   |--- zero_point : int
-    #   |--- _packed_params : Module
-    #        |--- weight : Tensor
-    #        |--- bias : Tensor
-    #
-    # Version 3
-    #   self
-    #   |--- scale : float
-    #   |--- zero_point : int
-    #   |--- _packed_params : Module
-    #        |--- _packed_params : (Tensor, Tensor) representing weight, bias
-    #                              of LinearPackedParams C++ struct
-    #
-    def _save_to_state_dict(self, destination, prefix, keep_vars):
-        super()._save_to_state_dict(destination, prefix, keep_vars)
-        destination[prefix + 'scale'] = torch.tensor(self.scale)
-        destination[prefix + 'zero_point'] = torch.tensor(self.zero_point)
-
-    # ===== Deserialization methods =====
-    # Counterpart to the serialization methods, we must pack the serialized QTensor
-    # weight into its packed format for use by the FBGEMM ops.
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        self.scale = float(state_dict[prefix + 'scale'])
-        state_dict.pop(prefix + 'scale')
-
-        self.zero_point = int(state_dict[prefix + 'zero_point'])
-        state_dict.pop(prefix + 'zero_point')
-
-        version = local_metadata.get('version', None)
-
-        if version is None or version == 1:
-            # We moved the parameters into a LinearPackedParameters submodule
-            weight = state_dict.pop(prefix + 'weight')
-            bias = state_dict.pop(prefix + 'bias')
-            state_dict.update({prefix + '_packed_params.weight': weight,
-                               prefix + '_packed_params.bias': bias})
-
-        super()._load_from_state_dict(
-            state_dict, prefix, local_metadata, False,
-            missing_keys, unexpected_keys, error_msgs)
-
-    # Function rather than property to make sure that JIT serialization doesn't
-    # register this as an attribute
-    def _weight_bias(self):
-        return self._packed_params._weight_bias()
-
-    def weight(self):
-        return self._weight_bias()[0]
-
-    def bias(self):
-        return self._weight_bias()[1]
-
-    def set_weight_bias(self, w: torch.Tensor, b: Optional[torch.Tensor]) -> None:
-        self._packed_params.set_weight_bias(w, b)
-
-    @classmethod
-    def from_float(cls, mod):
-        r"""Create a quantized module from an observed float module
-
-        Args:
-            mod (Module): a float module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-        """
-        if hasattr(mod, 'weight_fake_quant'):
-            if type_before_parametrizations(mod) == nniqat.LinearBn1d:
-                mod.weight, mod.bias = fuse_linear_bn_weights(
-                    mod.weight, mod.bias, mod.bn.running_mean, mod.bn.running_var,
-                    mod.bn.eps, mod.bn.weight, mod.bn.bias)
-            weight_post_process = mod.weight_fake_quant
-            activation_post_process = mod.activation_post_process
-        else:
-            # This function does not participate in JIT, so it is OK to ignore
-            # the type mismatch in assignment. Also, mypy has an issue with
-            # iterables not being implemented, so we are ignoring those too.
-            if not isinstance(cls._FLOAT_MODULE, Iterable):
-                cls._FLOAT_MODULE = [cls._FLOAT_MODULE]  # type: ignore[assignment]
-            supported_modules = ', '.join([float_mod.__name__ for float_mod in cls._FLOAT_MODULE])  # type: ignore[attr-defined]
-            error_msg = 'nnq.{}.from_float only works for {}, but got: {}'.format(cls.__name__, supported_modules, type(mod))
-            assert type_before_parametrizations(mod) in cls._FLOAT_MODULE, error_msg.format()  # type: ignore[attr-defined]
-            assert hasattr(mod, 'qconfig'), 'Input float module must have qconfig defined'
-            activation_post_process = mod.activation_post_process
-            if type_before_parametrizations(mod) == nni.LinearReLU:
-                mod = mod[0]
-            weight_post_process = mod.qconfig.weight()
-        weight_post_process(mod.weight)
-        dtype = weight_post_process.dtype
-        act_scale, act_zp = activation_post_process.calculate_qparams()
-        assert dtype == torch.qint8, 'Weight observer must have dtype torch.qint8'
-        qweight = _quantize_weight(mod.weight.float(), weight_post_process)
-        qlinear = cls(mod.in_features,
-                      mod.out_features,
-                      dtype=dtype)
-        qlinear.set_weight_bias(qweight, mod.bias)
-        qlinear.scale = float(act_scale)
-        qlinear.zero_point = int(act_zp)
-        return qlinear
-
-    @classmethod
-    def from_reference(cls, ref_qlinear, output_scale, output_zero_point):
-        r"""Create a (fbgemm/qnnpack) quantized module from a reference quantized module
-
-        Args:
-            ref_qlinear (Module): a reference quantized linear module, either produced by torch.ao.quantization
-                          utilities or provided by the user
-            output_scale (float): scale for output Tensor
-            zero_point (int): zero point for output Tensor
-        """
-        qlinear = cls(
-            ref_qlinear.in_features,
-            ref_qlinear.out_features)
-        qweight = ref_qlinear.get_quantized_weight()
-        qlinear.set_weight_bias(qweight, ref_qlinear.bias)
-
-        qlinear.scale = float(output_scale)
-        qlinear.zero_point = int(output_zero_point)
-        return qlinear
+from torch.ao.nn.quantized.modules.linear import Linear
+from torch.ao.nn.quantized.modules.linear import LinearPackedParams
diff --git a/torch/nn/quantized/modules/normalization.py b/torch/nn/quantized/modules/normalization.py
index a24160dee1bc..91923b3121c9 100644
--- a/torch/nn/quantized/modules/normalization.py
+++ b/torch/nn/quantized/modules/normalization.py
@@ -1,205 +1,17 @@
-import torch
-import torch.nn.quantized.functional
+# flake8: noqa: F401
+r"""Quantized Modules
 
-__all__ = ['LayerNorm', 'GroupNorm', 'InstanceNorm1d', 'InstanceNorm2d', 'InstanceNorm3d']
-
-class LayerNorm(torch.nn.LayerNorm):
-    r"""This is the quantized version of :class:`~torch.nn.LayerNorm`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
-
-    """
-
-    def __init__(self, normalized_shape, weight, bias, scale, zero_point, eps=1e-5,
-                 elementwise_affine=True, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(LayerNorm, self).__init__(
-            normalized_shape, eps=eps, elementwise_affine=elementwise_affine,
-            **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.layer_norm(
-            input, self.normalized_shape, weight=self.weight, bias=self.bias,
-            eps=self.eps, output_scale=self.scale, output_zero_point=self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedLayerNorm'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.normalized_shape, mod.weight, mod.bias, float(scale),
-            int(zero_point), mod.eps, mod.elementwise_affine)
-        return new_mod
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(
-            mod.normalized_shape, mod.weight, mod.bias, float(scale),
-            int(zero_point), mod.eps, mod.elementwise_affine)
-
-class GroupNorm(torch.nn.GroupNorm):
-    r"""This is the quantized version of :class:`~torch.nn.GroupNorm`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
-
-    """
-    __constants__ = ['num_groups', 'num_channels', 'eps', 'affine']
-
-    def __init__(self, num_groups, num_channels, weight, bias, scale, zero_point, eps=1e-5,
-                 affine=True, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(GroupNorm, self).__init__(num_groups, num_channels, eps, affine,
-                                        **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.group_norm(
-            input, self.num_groups, self.weight, self.bias, self.eps, self.scale,
-            self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedGroupNorm'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.num_groups, mod.num_channels, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-        return new_mod
-
-class InstanceNorm1d(torch.nn.InstanceNorm1d):
-    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm1d`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
-    """
-    def __init__(self, num_features, weight, bias, scale, zero_point,
-                 eps=1e-5, momentum=0.1, affine=False,
-                 track_running_stats=False, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(InstanceNorm1d, self).__init__(
-            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.instance_norm(
-            input, self.weight, self.bias, self.eps, self.scale,
-            self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedInstanceNorm1d'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-        return new_mod
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-
-class InstanceNorm2d(torch.nn.InstanceNorm2d):
-    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm2d`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
-
-    """
-    def __init__(self, num_features, weight, bias, scale, zero_point,
-                 eps=1e-5, momentum=0.1, affine=False,
-                 track_running_stats=False, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(InstanceNorm2d, self).__init__(
-            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.instance_norm(
-            input, self.weight, self.bias, self.eps, self.scale,
-            self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedInstanceNorm2d'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-        return new_mod
-
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-
-class InstanceNorm3d(torch.nn.InstanceNorm3d):
-    r"""This is the quantized version of :class:`~torch.nn.InstanceNorm3d`.
-
-    Additional args:
-        * **scale** - quantization scale of the output, type: double.
-        * **zero_point** - quantization zero point of the output, type: long.
-
-    """
-    def __init__(self, num_features, weight, bias, scale, zero_point,
-                 eps=1e-5, momentum=0.1, affine=False,
-                 track_running_stats=False, device=None, dtype=None) -> None:
-        factory_kwargs = {'device': device, 'dtype': dtype}
-        super(InstanceNorm3d, self).__init__(
-            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs)
-        self.weight = weight
-        self.bias = bias
-        self.register_buffer('scale', torch.tensor(scale, **factory_kwargs))
-        self.register_buffer('zero_point', torch.tensor(zero_point, **factory_kwargs))
-
-    def forward(self, input):
-        return torch.ops.quantized.instance_norm(
-            input, self.weight, self.bias, self.eps, self.scale,
-            self.zero_point)
-
-    def _get_name(self):
-        return 'QuantizedInstanceNorm3d'
-
-    @classmethod
-    def from_float(cls, mod):
-        scale, zero_point = mod.activation_post_process.calculate_qparams()
-        new_mod = cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
-        return new_mod
+__all__ = ['LayerNorm', 'GroupNorm', 'InstanceNorm1d', 'InstanceNorm2d', 'InstanceNorm3d']
 
-    @classmethod
-    def from_reference(cls, mod, scale, zero_point):
-        return cls(
-            mod.num_features, mod.weight, mod.bias, float(scale), int(zero_point),
-            mod.eps, mod.affine)
+from torch.ao.nn.quantized.modules.normalization import LayerNorm
+from torch.ao.nn.quantized.modules.normalization import GroupNorm
+from torch.ao.nn.quantized.modules.normalization import InstanceNorm1d
+from torch.ao.nn.quantized.modules.normalization import InstanceNorm2d
+from torch.ao.nn.quantized.modules.normalization import InstanceNorm3d
diff --git a/torch/nn/quantized/modules/rnn.py b/torch/nn/quantized/modules/rnn.py
index 09c8d318c76b..05d7a8ee92cd 100644
--- a/torch/nn/quantized/modules/rnn.py
+++ b/torch/nn/quantized/modules/rnn.py
@@ -1,47 +1,11 @@
-import torch
+# flake8: noqa: F401
+r"""Quantized Modules
 
-class LSTM(torch.nn.quantizable.LSTM):
-    r"""A quantized long short-term memory (LSTM).
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
 
-    For the description and the argument types, please, refer to :class:`~torch.nn.LSTM`
-
-    Attributes:
-        layers : instances of the `_LSTMLayer`
-
-    .. note::
-        To access the weights and biases, you need to access them per layer.
-        See examples in :class:`~torch.nn.quantizable.LSTM`
-
-    Examples::
-        >>> # xdoctest: +SKIP
-        >>> custom_module_config = {
-        ...     'float_to_observed_custom_module_class': {
-        ...         nn.LSTM: nn.quantizable.LSTM,
-        ...     },
-        ...     'observed_to_quantized_custom_module_class': {
-        ...         nn.quantizable.LSTM: nn.quantized.LSTM,
-        ...     }
-        ... }
-        >>> tq.prepare(model, prepare_custom_module_class=custom_module_config)
-        >>> tq.convert(model, convert_custom_module_class=custom_module_config)
-    """
-    _FLOAT_MODULE = torch.nn.quantizable.LSTM
-
-    def _get_name(self):
-        return 'QuantizedLSTM'
-
-    @classmethod
-    def from_float(cls, *args, **kwargs):
-        # The whole flow is float -> observed -> quantized
-        # This class does observed -> quantized only
-        raise NotImplementedError("It looks like you are trying to convert a "
-                                  "non-observed LSTM module. Please, see "
-                                  "the examples on quantizable LSTMs.")
-
-    @classmethod
-    def from_observed(cls, other):
-        assert type(other) == cls._FLOAT_MODULE
-        converted = torch.ao.quantization.convert(other, inplace=False,
-                                                  remove_qconfig=True)
-        converted.__class__ = cls
-        return converted
+from torch.ao.nn.quantized.modules.rnn import LSTM
diff --git a/torch/nn/quantized/modules/utils.py b/torch/nn/quantized/modules/utils.py
index e9e801f2f055..97b5f5ea2ada 100644
--- a/torch/nn/quantized/modules/utils.py
+++ b/torch/nn/quantized/modules/utils.py
@@ -1,73 +1,15 @@
-import abc
-import torch
-from itertools import repeat
-import collections
-from torch.nn.modules.module import _addindent
-
-class WeightedQuantizedModule(torch.nn.Module, metaclass=abc.ABCMeta):
-    """Wrapper for quantized modules than can be lowered from reference modules."""
-    @classmethod
-    @abc.abstractmethod
-    def from_reference(cls, ref_module, output_scale, output_zero_point):
-        raise NotImplementedError
-
-def _quantize_weight(float_wt, observer):
-    wt_scale, wt_zp = observer.calculate_qparams()
-    if observer.qscheme in [torch.per_tensor_symmetric, torch.per_tensor_affine]:
-        qweight = torch.quantize_per_tensor(
-            float_wt,
-            float(wt_scale), int(wt_zp), torch.qint8)
-    elif observer.qscheme in [torch.per_channel_symmetric, torch.per_channel_affine]:
-        wt_axis = observer.ch_axis
-        qweight = torch.quantize_per_channel(
-            float_wt,
-            wt_scale.to(torch.double), wt_zp.to(torch.int64), wt_axis, torch.qint8)
-    elif observer.qscheme in [torch.per_channel_affine_float_qparams]:
-        qweight = torch.quantize_per_channel(
-            float_wt,
-            wt_scale.to(torch.float), wt_zp.to(torch.float), observer.ch_axis, observer.dtype)
-    else:
-        raise ValueError("Unexpected qscheme " + observer.qscheme)
-    return qweight
-
-def _ntuple_from_first(n):
-    """Converts the argument to a tuple of size n
-    with the first element repeated."""
-    def parse(x):
-        while isinstance(x, collections.abc.Sequence):
-            if len(x) == n:
-                break
-            x = x[0]
-        return tuple(repeat(x, n))
-    return parse
-
-def hide_packed_params_repr(self, params):
-    # We don't want to show `PackedParams` children, hence custom
-    # `__repr__`. This is the same as nn.Module.__repr__, except the check
-    # for the `params module`.
-    extra_lines = []
-    extra_repr = self.extra_repr()
-    # empty string will be split into list ['']
-    if extra_repr:
-        extra_lines = extra_repr.split('\n')
-    child_lines = []
-    for key, module in self._modules.items():
-        if isinstance(module, params):
-            continue
-        mod_str = repr(module)
-        mod_str = _addindent(mod_str, 2)
-        child_lines.append('(' + key + '): ' + mod_str)
-    lines = extra_lines + child_lines
-
-    main_str = self._get_name() + '('
-    if lines:
-        # simple one-liner info, which most builtin Modules will use
-        if len(extra_lines) == 1 and not child_lines:
-            main_str += extra_lines[0]
-        else:
-            main_str += '\n  ' + '\n  '.join(lines) + '\n'
-
-    main_str += ')'
-    return main_str
-
-_pair_from_first = _ntuple_from_first(2)
+# flake8: noqa: F401
+r"""Quantized Modules
+
+This file is in the process of migration to `torch/ao/nn/quantized`, and
+is kept here for compatibility while the migration process is ongoing.
+If you are adding a new entry/functionality, please, add it to the
+appropriate file under the `torch/ao/nn/quantized/modules`,
+while adding an import statement here.
+"""
+
+from torch.ao.nn.quantized.modules.utils import _ntuple_from_first
+from torch.ao.nn.quantized.modules.utils import _pair_from_first
+from torch.ao.nn.quantized.modules.utils import _quantize_weight
+from torch.ao.nn.quantized.modules.utils import hide_packed_params_repr
+from torch.ao.nn.quantized.modules.utils import WeightedQuantizedModule
diff --git a/torch/nn/utils/_deprecation_utils.py b/torch/nn/utils/_deprecation_utils.py
new file mode 100644
index 000000000000..c6be7fb96adf
--- /dev/null
+++ b/torch/nn/utils/_deprecation_utils.py
@@ -0,0 +1,45 @@
+from typing import List, Callable
+import importlib
+import warnings
+
+
+_MESSAGE_TEMPLATE = r"Usage of '{old_location}' is deprecated; please use '{new_location}' instead."
+
+def lazy_deprecated_import(all: List[str], old_module: str, new_module: str) -> Callable:
+    r"""Import utility to lazily import deprecated packages / modules / functional.
+
+    The old_module and new_module are also used in the deprecation warning defined
+    by the `_MESSAGE_TEMPLATE`.
+
+    Args:
+        all: The list of the functions that are imported. Generally, the module's
+            __all__ list of the module.
+        old_module: Old module location
+        new_module: New module location / Migrated location
+
+    Returns:
+        Callable to asign to the `__getattr__`
+
+    Usage:
+
+        # In the `torch/nn/quantized/functional.py`
+        from torch.nn.utils._deprecation_utils import lazy_deprecated_import
+        _MIGRATED_TO = "torch.ao.nn.quantized.functional"
+        __getattr__ = lazy_deprecated_import(
+            all=__all__,
+            old_module=__name__,
+            new_module=_MIGRATED_TO)
+    """
+    warning_message = _MESSAGE_TEMPLATE.format(
+        old_location=old_module,
+        new_location=new_module)
+
+    def getattr_dunder(name):
+        if name in all:
+            # We are using the "RuntimeWarning" to make sure it is not
+            # ignored by default.
+            warnings.warn(warning_message, RuntimeWarning)
+            package = importlib.import_module(new_module)
+            return getattr(package, name)
+        raise AttributeError(f"Module {new_module!r} has no attribute {name!r}.")
+    return getattr_dunder
diff --git a/torch/nn/utils/_expanded_weights/conv_expanded_weights.py b/torch/nn/utils/_expanded_weights/conv_expanded_weights.py
index df3061916005..c10ccb90ae92 100644
--- a/torch/nn/utils/_expanded_weights/conv_expanded_weights.py
+++ b/torch/nn/utils/_expanded_weights/conv_expanded_weights.py
@@ -1,7 +1,7 @@
 import torch
 import torch.nn.functional as F
 
-from .conv_utils import conv_backward, conv_args_and_kwargs, conv_picker
+from .conv_utils import conv_backward, conv_args_and_kwargs, conv_picker, conv_input_for_string_padding
 from .expanded_weights_impl import ExpandedWeight, implements_per_sample_grads
 from .expanded_weights_utils import forward_helper
 
@@ -11,10 +11,19 @@
 class ConvPerSampleGrad(torch.autograd.Function):
     @staticmethod
     def forward(ctx, kwarg_names, conv_fn, *expanded_args_and_kwargs):
-        if any([isinstance(i, str) for i in expanded_args_and_kwargs]):
-            raise RuntimeError("Expanded Weights does not support convolution padding as a string. "
-                               "Please file an issue to prioritize support")
         expanded_args, expanded_kwargs = conv_args_and_kwargs(kwarg_names, expanded_args_and_kwargs)
+        orig_input = expanded_args[0]
+        was_same_padding = expanded_kwargs['padding'] == "same"
+
+        if isinstance(expanded_kwargs['padding'], str):
+            # if padding is a string, we'll do the necessary padding (slowly) using F.pad
+            kernel_size = expanded_args[1].shape[2:]
+            padding, dilation = expanded_kwargs['padding'], expanded_kwargs['dilation']
+            input = conv_input_for_string_padding(conv_fn, padding, expanded_args[0], dilation, kernel_size)
+            expanded_args = (input, expanded_args[1])
+            # since we've already done the padding, don't need any more
+            expanded_kwargs['padding'] = 0
+
         output = forward_helper(conv_fn, expanded_args, expanded_kwargs)
         input, weight = expanded_args
         batched_dim_size = conv_picker(conv_fn, 3, 4, 5)
@@ -24,8 +33,10 @@ def forward(ctx, kwarg_names, conv_fn, *expanded_args_and_kwargs):
 
         ctx.conv_fn = conv_fn
 
-        ctx.batch_size = input.shape[0]
-        ctx.input_required_grad = input.requires_grad
+        ctx.batch_size = orig_input.shape[0]
+        ctx.input_required_grad = orig_input.requires_grad
+        ctx.orig_input_shape = orig_input.shape
+        ctx.was_same_padding = was_same_padding
         ctx.stride, ctx.padding = expanded_kwargs['stride'], expanded_kwargs['padding']
         ctx.dilation, ctx.groups = expanded_kwargs['dilation'], expanded_kwargs['groups']
 
diff --git a/torch/nn/utils/_expanded_weights/conv_utils.py b/torch/nn/utils/_expanded_weights/conv_utils.py
index cafebedc4102..6b1701c33df5 100644
--- a/torch/nn/utils/_expanded_weights/conv_utils.py
+++ b/torch/nn/utils/_expanded_weights/conv_utils.py
@@ -28,6 +28,38 @@ def conv_args_and_kwargs(kwarg_names, expanded_args_and_kwargs):
 def conv_normalizer(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
     return (input, weight), {'bias': bias, 'stride': stride, 'padding': padding, 'dilation': dilation, 'groups': groups}
 
+
+def conv_input_for_string_padding(func, padding_style, input, dilation, kernel_size):
+    if padding_style == "valid":
+        return input
+    else:
+        padding = int_padding_for_string_padding(func, padding_style, dilation, kernel_size)
+        return F.pad(input, padding)
+
+
+def int_padding_for_string_padding(func, padding_style, dilation, kernel_size):
+    def get_dilation(i):
+        return dilation[i] if isinstance(dilation, tuple) else dilation
+
+    if padding_style == "same":
+        padding: List[int] = []
+        # F.pad needs the padding in reverse order from what conv expects
+        for i in range(conv_picker(func, 0, 1, 2), -1, -1):
+            padding += conv_padding_for_same(get_dilation(i), kernel_size[i])
+        return padding
+    elif padding_style == "valid":
+        return conv_picker(func, 2, 4, 6) * (0,)
+    else:
+        raise RuntimeError(f"got padding type of {padding_style}, only accept 'same' or 'valid'")
+
+
+def conv_padding_for_same(dilation, kernel_size):
+    total_pad = dilation * (kernel_size - 1)
+    left_pad = total_pad // 2
+    right_pad = total_pad - left_pad
+    return left_pad, right_pad
+
+
 def conv_backward(func, ctx, grad_output):
 
     def weight_grad_sample(weight):
@@ -43,6 +75,15 @@ def expand(param):
         else:
             return param
 
+    def calc_total_padding(func, was_same, padding, dilation, kernel_size):
+        if was_same:
+            all_padding = int_padding_for_string_padding(func, "same", dilation, kernel_size)
+            # F.pad needs the padding in reverse order from what conv expects
+            total_padding = tuple(all_padding[i] + all_padding[i - 1] for i in range(len(all_padding) - 1, -1, -2))
+            return total_padding
+        else:
+            return tuple(2 * pad for pad in padding)
+
     weight_shape = ctx.weight.shape
     stride, padding, dilation, groups = expand(ctx.stride), expand(ctx.padding), expand(ctx.dilation), ctx.groups
 
@@ -55,15 +96,24 @@ def expand(param):
     results.append(None)  # for kwarg names
     results.append(None)  # for op reference
 
+    # "same" padding may give uneven padding on either side so we need to separate the "padding" attr and total padding
+    total_padding = calc_total_padding(func, ctx.was_same_padding, padding, dilation, kernel_size)
+
     if ctx.input_required_grad:
         output_padding = []
         input_dims = conv_picker(func, 1, 2, 3)
         for i in range(input_dims):
-            input_dim = ctx.input.shape[2 + i]
-            output_padding.append((2 * padding[i] + input_dim - (kernel_size[i] * dilation[i] - dilation[i] + 1)) % stride[i])
+            input_dim = ctx.orig_input_shape[2 + i]
+            output_padding.append((total_padding[i] + input_dim - (kernel_size[i] * dilation[i] - dilation[i] + 1)) % stride[i])
         weight_ = unpack_expanded_weight_or_tensor(ctx.weight)
         transpose_func = conv_picker(func, F.conv_transpose1d, F.conv_transpose2d, F.conv_transpose3d)
-        results.append(transpose_func(grad_output, weight_, None, stride, padding, tuple(output_padding), groups, dilation))
+        out = transpose_func(grad_output, weight_, None, stride, padding, tuple(output_padding), groups, dilation)
+
+        if ctx.was_same_padding:
+            for i in range(len(total_padding)):
+                out = torch.narrow(out, 2 + i, total_padding[i] // 2, ctx.orig_input_shape[2 + i])
+
+        results.append(out)
     else:
         results.append(None)
     # weight and bias don't compute batched gradients; no other arguments are differentiable
diff --git a/torch/nn/utils/fusion.py b/torch/nn/utils/fusion.py
index e96c4f7d4426..81b1431c53c9 100644
--- a/torch/nn/utils/fusion.py
+++ b/torch/nn/utils/fusion.py
@@ -27,10 +27,10 @@ def fuse_conv_bn_weights(conv_w, conv_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b, trans
     else:
         shape = [-1, 1] + [1] * (len(conv_w.shape) - 2)
 
-    conv_w = conv_w * (bn_w * bn_var_rsqrt).reshape(shape)
-    conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b
+    fused_conv_w = conv_w * (bn_w * bn_var_rsqrt).reshape(shape)
+    fused_conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b
 
-    return torch.nn.Parameter(conv_w), torch.nn.Parameter(conv_b)
+    return torch.nn.Parameter(fused_conv_w, conv_w.requires_grad), torch.nn.Parameter(fused_conv_b, conv_b.requires_grad)
 
 def fuse_linear_bn_eval(linear, bn):
     assert(not (linear.training or bn.training)), "Fusion only for eval!"
@@ -50,4 +50,4 @@ def fuse_linear_bn_weights(linear_w, linear_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b)
     fused_w = linear_w * bn_scale.unsqueeze(-1)
     fused_b = (linear_b - bn_rm) * bn_scale + bn_b
 
-    return torch.nn.Parameter(fused_w), torch.nn.Parameter(fused_b)
+    return torch.nn.Parameter(fused_w, linear_w.requires_grad), torch.nn.Parameter(fused_b, linear_b.requires_grad)
diff --git a/torch/nn/utils/parametrizations.py b/torch/nn/utils/parametrizations.py
index 3dd5192c1062..7b097f667671 100644
--- a/torch/nn/utils/parametrizations.py
+++ b/torch/nn/utils/parametrizations.py
@@ -10,6 +10,7 @@
 
 __all__ = ['orthogonal', 'spectral_norm']
 
+
 def _is_orthogonal(Q, eps=None):
     n, k = Q.size(-2), Q.size(-1)
     Id = torch.eye(k, dtype=Q.dtype, device=Q.device)
@@ -242,7 +243,7 @@ def orthogonal(module: Module,
 
     Example::
 
-        >>> # xdoctest: +REQUIRES(--lapack)
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
         >>> orth_linear = orthogonal(nn.Linear(20, 40))
         >>> orth_linear
         ParametrizedLinear(
@@ -459,19 +460,20 @@ def spectral_norm(module: Module,
 
     Example::
 
-        >>> # xdoctest: +REQUIRES(--lapack)
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
+        >>> # xdoctest: +IGNORE_WANT("non-determenistic")
         >>> snm = spectral_norm(nn.Linear(20, 40))
         >>> snm
         ParametrizedLinear(
-        in_features=20, out_features=40, bias=True
-        (parametrizations): ModuleDict(
+          in_features=20, out_features=40, bias=True
+          (parametrizations): ModuleDict(
             (weight): ParametrizationList(
-            (0): _SpectralNorm()
+              (0): _SpectralNorm()
             )
-        )
+          )
         )
         >>> torch.linalg.matrix_norm(snm.weight, 2)
-        tensor(1.0000, grad_fn=<CopyBackwards>)
+        tensor(1.0081, grad_fn=<AmaxBackward0>)
     """
     weight = getattr(module, name, None)
     if not isinstance(weight, Tensor):
diff --git a/torch/nn/utils/parametrize.py b/torch/nn/utils/parametrize.py
index 32d71b42f9ca..801a1e80c1aa 100644
--- a/torch/nn/utils/parametrize.py
+++ b/torch/nn/utils/parametrize.py
@@ -73,7 +73,7 @@ class ParametrizationList(ModuleList):
     It is the type of ``module.parametrizations[tensor_name]`` when ``module[tensor_name]``
     has been parametrized with :func:`register_parametrization`.
 
-    If the first registered parmetrization has a ``right_inverse`` that returns one tensor or
+    If the first registered parametrization has a ``right_inverse`` that returns one tensor or
     does not have a ``right_inverse`` (in which case we assume that ``right_inverse`` is the identity),
     it will hold the tensor under the name ``original``.
     If it has a ``right_inverse`` that returns more than one tensor, these will be registered as
@@ -242,7 +242,7 @@ def right_inverse(self, value: Tensor) -> None:
                 if len(value) != self.ntensors:
                     raise ValueError(
                         "'right_inverse' must return a sequence of tensors of length "
-                        f"{self.ntensors}. Got a sequence of lenght {len(value)}."
+                        f"{self.ntensors}. Got a sequence of length {len(value)}."
                     )
                 for i, tensor in enumerate(value):
                     original_i = getattr(self, f"original{i}")
@@ -460,7 +460,7 @@ def right_inverse(self, X: Tensor) -> Union[Tensor, Sequence[Tensor]]
         ValueError: if the module does not have a parameter or a buffer named :attr:`tensor_name`
 
     Examples:
-        >>> # xdoctest: +REQUIRES(--lapack)
+        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
         >>> import torch
         >>> import torch.nn as nn
         >>> import torch.nn.utils.parametrize as P
diff --git a/torch/nn/utils/stateless.py b/torch/nn/utils/stateless.py
index f7dcf07c89f1..b5511cc97bc7 100644
--- a/torch/nn/utils/stateless.py
+++ b/torch/nn/utils/stateless.py
@@ -1,3 +1,4 @@
+import warnings
 import contextlib
 from typing import Any, Callable, Dict, Iterator, List, Tuple
 
@@ -35,10 +36,22 @@ def _setattr(self, name: str, value: Any) -> None:
     module.__class__ = param_cls
     module._orig_class = cls
 
-def _create_swap_params(params_and_buffers):
+
+def _check_tied_val_already_replaced(old_val, new_val, replaced_tensors_map):
+    if old_val not in replaced_tensors_map:
+        replaced_tensors_map[old_val] = new_val
+    elif replaced_tensors_map[old_val] is not new_val:
+        warnings.warn("functional_call was passed multiple values for tied weights. "
+                      "This behavior is deprecated and will be an error in future versions")
+
+
+def _create_swap_params(params_and_buffers, replaced_tensors_map):
     def _swap_parameters(module, tensor_name: str, full_path: str, tensor: Tensor) -> None:
         # Changes the module class to get a new __getattr__ dunder method
         # that looks for the reparametrized tensor
+        if hasattr(module, tensor_name):
+            old_val = getattr(module, tensor_name)
+            _check_tied_val_already_replaced(old_val, tensor, replaced_tensors_map)
         if hasattr(module, "_attr_to_path"):
             module._attr_to_path[tensor_name] = full_path
         else:
@@ -60,9 +73,10 @@ def _reparametrize_module(
     module: 'torch.nn.Module',
     parameters_and_buffers: Dict[str, Tensor],
 ) -> Iterator[None]:
+    orig_tensors_to_replacements: Dict[Tensor, Tensor] = {}
     for name, tensor in parameters_and_buffers.items():
         _apply_func_submodules(
-            _create_swap_params(parameters_and_buffers),
+            _create_swap_params(parameters_and_buffers, orig_tensors_to_replacements),
             module, name.split("."), name, (tensor,))
     try:
         yield
diff --git a/torch/onnx/README.md b/torch/onnx/README.md
index cb190ba1e496..9e36515f597a 100644
--- a/torch/onnx/README.md
+++ b/torch/onnx/README.md
@@ -2,6 +2,98 @@
 
 Torch->ONNX converter / exporter.
 
-[User-facing docs](https://pytorch.org/docs/master/onnx.html).
+- User-facing docs: https://pytorch.org/docs/master/onnx.html
+- Developer docs: https://github.com/pytorch/pytorch/wiki/PyTorch-ONNX-exporter
 
-[Developer docs](https://github.com/pytorch/pytorch/wiki/PyTorch-ONNX-exporter).
+> Read the following if you are contributing to `torch.onnx`
+
+## Symbolic functions Opsets
+
+Opset 9 is the base version. It is selected as the base version because
+
+1. It is the first opset version supported by PyTorch export.
+2. Opset 9 is more robust than previous opset versions. Opset versions like 7/8 have limitations
+    that certain basic operators cannot be expressed in ONNX. Instead of basing on these limitations,
+    we chose to handle them as special cases separately.
+
+Backward support for opset versions beyond opset 7 is not in our roadmap.
+
+For opset versions other than 9, by default they will inherit the symbolic functions defined in
+symbolic_opset9.py.
+
+To extend support for updated operators in different opset versions on top of opset 9,
+simply add the updated symbolic functions in the respective symbolic_opset{version}.py file.
+Checkout topk in symbolic_opset10.py, and upsample_nearest2d in symbolic_opset8.py for example.
+
+## Editing Symbolic Files
+
+- Use the internal `registration.onnx_symbolic` decorator to register a new symbolic function. Search for `def reshape(g, self, shape):` to see an example.
+- Parameter names must *exactly* match the names in
+  aten/src/ATen/native/native_functions.yaml, because
+  dispatch is done with keyword arguments.
+- Looking for inplace ops? They're detected by
+  `_jit_pass_onnx_remove_inplace_ops_for_onnx`, and
+  transparently dispatched to their non inplace versions in
+  "run_symbolic_function". See Note [Export inplace](#export-inplace)
+- Required: Annotate new symbolic functions with type annotations and decorate
+  with `@_beartype.beartype` to enable runtime type checking.
+  `@_beartype.beartype` should typically be the closest to the function to
+  ensure proper type checking.
+
+### A note on Tensor types
+
+In general, we should avoid depending on the type of Tensor Values contained
+within the trace graph. However, this is sometimes unavoidable (due to ONNX
+spec requirements, etc). The TensorType object has accessors for these properties that return the property if it is statically known and return nullopt otherwise.
+
+In general, we should prefer to rely on the least specific information possible.
+For example, not relying on tensor properties at all is better than relying
+on the number of dimensions which is better than relying on
+concrete shapes. Doing so will make the export symbolics
+more robust to different graphs.
+
+### Extra context for symbolic functions
+
+The first argument of a symbolic function is always a `GraphContext` object.
+
+`GraphContext` contains all methods defined in a `torch.Graph` object and context
+for the symbolic function.
+
+In general, symbolic functions only require inputs and attributes to
+the original node. An example of a symbolic function needing context is
+`prim::Loop`. It needs access to the sub-block of the original node.
+
+### Export inplace
+
+It would be better for us to export inplace annotations,
+than to not export them, since it is useful information that can
+help the target of an ONNX export export more efficiently. However,
+ONNX doesn't currently formalize inplace. Fortunately, it's sound to drop
+inplace annotations, but we are losing information this way.
+
+### Pointwise by scalar
+
+What happens if you add a tensor with a constant (e.g., x + 2)?  There are
+some moving parts to implementing the ONNX translation in this case:
+
+- By the time we get the scalar in a symbolic function here, it is no longer a
+  Python long/float, but a PyTorch tensor with `numel == 1` (eventually, we want
+  it to be a zero dim tensor but this change has not happened yet.) However, the
+  type of this scalar is *exactly* what the user wrote in Python, which may not
+  match the tensor it is being added to. PyTorch will do implicit conversions on
+  scalars; however, ONNX will not, so we must do the conversion ourselves. This
+  is what `symbolic_helper._if_scalar_type_as()` and
+  `_jit_pass_onnx_scalar_type_analysis` does.
+
+- Dispatch to these functions takes advantage an outrageous coincidence
+    between the tensor and scalar name.  When we add two tensors together,
+    you get the dispatch:
+
+    add(*[self, other], **{"alpha": alpha})
+
+    When you add a tensor and a scalar, you get the dispatch:
+
+    add(*[self], **{"other": other, "alpha": alpha})
+
+    By having the argument name line up with the name of the scalar attribute
+    if it exists, we can write a single function for both overloads.
diff --git a/torch/onnx/__init__.py b/torch/onnx/__init__.py
index 27931eca2993..eb8335ae8f17 100644
--- a/torch/onnx/__init__.py
+++ b/torch/onnx/__init__.py
@@ -1,5 +1,4 @@
 """ONNX exporter."""
-import warnings
 
 from torch import _C
 from torch._C import _onnx as _C_onnx
@@ -25,9 +24,11 @@
     symbolic_opset14,
     symbolic_opset15,
     symbolic_opset16,
-    symbolic_registry,
+    symbolic_opset17,
     utils,
 )
+
+# TODO(After 1.13 release): Remove the deprecated SymbolicContext
 from ._exporter_states import ExportTypes, SymbolicContext
 from ._type_utils import JitScalarType
 from .errors import CheckerError  # Backwards compatibility
@@ -46,7 +47,6 @@
 __all__ = [
     # Modules
     "symbolic_helper",
-    "symbolic_registry",
     "utils",
     "errors",
     # All opsets
@@ -61,14 +61,13 @@
     "symbolic_opset14",
     "symbolic_opset15",
     "symbolic_opset16",
+    "symbolic_opset17",
     # Enums
     "ExportTypes",
     "OperatorExportTypes",
     "TrainingMode",
     "TensorProtoDataType",
     "JitScalarType",
-    # Classes
-    "SymbolicContext",
     # Public functions
     "export",
     "export_to_pretty_string",
@@ -78,16 +77,12 @@
     "unregister_custom_op_symbolic",
     "disable_log",
     "enable_log",
-    "is_onnx_log_enabled",
-    "log",
-    "set_log_stream",
     # Errors
     "CheckerError",  # Backwards compatibility
 ]
 
 # Set namespace for exposed private names
 ExportTypes.__module__ = "torch.onnx"
-SymbolicContext.__module__ = "torch.onnx"
 JitScalarType.__module__ = "torch.onnx"
 
 producer_name = "pytorch"
@@ -95,15 +90,16 @@
 
 
 @_deprecation.deprecated(
-    since="1.12.0", removed_in="TBD", instructions="use `torch.onnx.export` instead"
+    since="1.12.0", removed_in="1.14", instructions="use `torch.onnx.export` instead"
 )
 def _export(*args, **kwargs):
     return utils._export(*args, **kwargs)
 
 
-def is_onnx_log_enabled() -> bool:
-    r"""Returns True iff ONNX logging is turned on."""
-    return _C._jit_is_onnx_log_enabled()
+# TODO(justinchuby): Deprecate these logging functions in favor of the new diagnostic module.
+
+# Returns True iff ONNX logging is turned on.
+is_onnx_log_enabled = _C._jit_is_onnx_log_enabled
 
 
 def enable_log() -> None:
@@ -116,21 +112,19 @@ def disable_log() -> None:
     _C._jit_set_onnx_log_enabled(False)
 
 
-def set_log_stream(stream_name: str = "stdout") -> None:
-    r"""Sets output stream for ONNX logging.
+"""Sets output stream for ONNX logging.
 
-    Args:
-        stream_name (str, default "stdout"): Only 'stdout' and 'stderr' are supported
-            as ``stream_name``.
-    """
-    _C._jit_set_onnx_log_output_stream(stream_name)
+Args:
+    stream_name (str, default "stdout"): Only 'stdout' and 'stderr' are supported
+        as ``stream_name``.
+"""
+set_log_stream = _C._jit_set_onnx_log_output_stream
 
 
-def log(*args) -> None:
-    r"""A simple logging facility for ONNX exporter.
+"""A simple logging facility for ONNX exporter.
 
-    Args:
-        args: Arguments are converted to string, concatenated together with a newline
-            character appended to the end, and flushed to output stream.
-    """
-    _C._jit_onnx_log(*args)
+Args:
+    args: Arguments are converted to string, concatenated together with a newline
+        character appended to the end, and flushed to output stream.
+"""
+log = _C._jit_onnx_log
diff --git a/torch/onnx/_constants.py b/torch/onnx/_constants.py
index 4cf3ef4d1e1d..ed27f94a9e14 100644
--- a/torch/onnx/_constants.py
+++ b/torch/onnx/_constants.py
@@ -1,9 +1,14 @@
 """Constant values used in ONNX."""
 
 ONNX_ARCHIVE_MODEL_PROTO_NAME = "__MODEL_PROTO"
-onnx_default_opset = 13
-onnx_main_opset = 16
-onnx_stable_opsets = tuple(range(7, onnx_main_opset))
-onnx_constant_folding_opsets = tuple(range(9, onnx_main_opset + 1))
+
+ONNX_BASE_OPSET = 9
+ONNX_MIN_OPSET = 7
+ONNX_MAX_OPSET = 17
+# ONNX_DEFAULT_OPSET generated by tools/onnx/update_default_opset_version.py
+ONNX_DEFAULT_OPSET = 14
+ONNX_CONSTANT_FOLDING_MIN_OPSET = 9
 
 PYTORCH_GITHUB_ISSUES_URL = "https://github.com/pytorch/pytorch/issues"
+
+INT64_MAX = 9223372036854775807
diff --git a/torch/onnx/_deprecation.py b/torch/onnx/_deprecation.py
index 70b6e4d3374c..0fd2cd764fc9 100644
--- a/torch/onnx/_deprecation.py
+++ b/torch/onnx/_deprecation.py
@@ -1,13 +1,15 @@
 """Utility for deprecating functions."""
 
 import functools
+import textwrap
 import warnings
 
 
 def deprecated(since: str, removed_in: str, instructions: str):
     """Marks functions as deprecated.
 
-    It will result in a warning when the function is called.
+    It will result in a warning when the function is called and a note in the
+    docstring.
 
     Args:
         since: The version when the function was first deprecated.
@@ -19,13 +21,44 @@ def decorator(function):
         @functools.wraps(function)
         def wrapper(*args, **kwargs):
             warnings.warn(
-                f"`{function.__module__}.{function.__name__}` is deprecated in version {since} and will be "
-                f"removed in version {removed_in}. Please {instructions}.",
+                f"'{function.__module__}.{function.__name__}' "
+                f"is deprecated in version {since} and will be "
+                f"removed in {removed_in}. Please {instructions}.",
                 category=FutureWarning,
                 stacklevel=2,
             )
             return function(*args, **kwargs)
 
+        # Add a deprecation note to the docstring.
+        docstring = function.__doc__ or ""
+
+        # Add a note to the docstring.
+        deprecation_note = textwrap.dedent(
+            f"""\
+            .. deprecated:: {since}
+                Deprecated and will be removed in version {removed_in}.
+                Please {instructions}.
+            """
+        )
+
+        # Split docstring at first occurrence of newline
+        summary_and_body = docstring.split("\n\n", 1)
+
+        if len(summary_and_body) > 1:
+            summary, body = summary_and_body
+
+            # Dedent the body. We cannot do this with the presence of the summary because
+            # the body contains leading whitespaces when the summary does not.
+            body = textwrap.dedent(body)
+
+            new_docstring_parts = [deprecation_note, "\n\n", summary, body]
+        else:
+            summary = summary_and_body[0]
+
+            new_docstring_parts = [deprecation_note, "\n\n", summary]
+
+        wrapper.__doc__ = "".join(new_docstring_parts)
+
         return wrapper
 
     return decorator
diff --git a/torch/onnx/_exporter_states.py b/torch/onnx/_exporter_states.py
index 24d12d1b96fb..dbf43ba0816d 100644
--- a/torch/onnx/_exporter_states.py
+++ b/torch/onnx/_exporter_states.py
@@ -1,8 +1,9 @@
 from __future__ import annotations
 
+import enum
 from typing import Dict
 
-from torch import _C as _C
+from torch import _C
 
 
 class ExportTypes:
@@ -15,7 +16,7 @@ class ExportTypes:
 
 
 class SymbolicContext:
-    r"""Provides extra context for symbolic functions.
+    """Extra context for symbolic functions.
 
     Args:
         params_dict (Dict[str, _C.IValue]): Mapping from graph initializer name to IValue.
@@ -24,10 +25,28 @@ class SymbolicContext:
         onnx_block (_C.Block): Current ONNX block that converted nodes are being appended to.
     """
 
-    def __init__(self, params_dict, env, cur_node, onnx_block):
+    def __init__(
+        self,
+        params_dict: Dict[str, _C.IValue],
+        env: dict,
+        cur_node: _C.Node,
+        onnx_block: _C.Block,
+    ):
         self.params_dict: Dict[str, _C.IValue] = params_dict
         self.env: Dict[_C.Value, _C.Value] = env
         # Current node that is being converted.
         self.cur_node: _C.Node = cur_node
         # Current onnx block that converted nodes are being appended to.
         self.onnx_block: _C.Block = onnx_block
+
+
+@enum.unique
+class RuntimeTypeCheckState(enum.Enum):
+    """Runtime type check state."""
+
+    # Runtime type checking is disabled.
+    DISABLED = enum.auto()
+    # Runtime type checking is enabled but warnings are shown only.
+    WARNINGS = enum.auto()
+    # Runtime type checking is enabled.
+    ERRORS = enum.auto()
diff --git a/torch/onnx/_globals.py b/torch/onnx/_globals.py
index 879f158a9a6b..1831130d764c 100644
--- a/torch/onnx/_globals.py
+++ b/torch/onnx/_globals.py
@@ -5,14 +5,13 @@
 Be very judicious when adding any new global variables. Do not create new global
 variables unless they are absolutely necessary.
 """
-
-from typing import Optional
+import os
 
 import torch._C._onnx as _C_onnx
 
 # This module should only depend on _constants and nothing else in torch.onnx to keep
 # dependency direction clean.
-from torch.onnx import _constants
+from torch.onnx import _constants, _exporter_states
 
 
 class _InternalGlobals:
@@ -23,14 +22,30 @@ class _InternalGlobals:
     """
 
     def __init__(self):
-        self._export_onnx_opset_version = _constants.onnx_default_opset
+        self._export_onnx_opset_version = _constants.ONNX_DEFAULT_OPSET
         self._training_mode: _C_onnx.TrainingMode = _C_onnx.TrainingMode.EVAL
         self._in_onnx_export: bool = False
         # Whether the user's model is training during export
         self.export_training: bool = False
-        self.operator_export_type: Optional[_C_onnx.OperatorExportTypes] = None
+        self.operator_export_type: _C_onnx.OperatorExportTypes = (
+            _C_onnx.OperatorExportTypes.ONNX
+        )
         self.onnx_shape_inference: bool = True
 
+        # Internal feature flags
+        if os.getenv("TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK") == "WARNINGS":
+            self.runtime_type_check_state = (
+                _exporter_states.RuntimeTypeCheckState.WARNINGS
+            )
+        elif os.getenv("TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK") == "DISABLED":
+            self.runtime_type_check_state = (
+                _exporter_states.RuntimeTypeCheckState.DISABLED
+            )
+        else:
+            self.runtime_type_check_state = (
+                _exporter_states.RuntimeTypeCheckState.ERRORS
+            )
+
     @property
     def training_mode(self):
         """The training mode for the exporter."""
@@ -52,8 +67,9 @@ def export_onnx_opset_version(self) -> int:
 
     @export_onnx_opset_version.setter
     def export_onnx_opset_version(self, value: int):
-        supported_versions = [_constants.onnx_main_opset]
-        supported_versions.extend(_constants.onnx_stable_opsets)
+        supported_versions = range(
+            _constants.ONNX_MIN_OPSET, _constants.ONNX_MAX_OPSET + 1
+        )
         if value not in supported_versions:
             raise ValueError(f"Unsupported ONNX opset version: {value}")
         self._export_onnx_opset_version = value
diff --git a/torch/onnx/_internal/__init__.py b/torch/onnx/_internal/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/onnx/_internal/_beartype.py b/torch/onnx/_internal/_beartype.py
new file mode 100644
index 000000000000..ba98dcf0f593
--- /dev/null
+++ b/torch/onnx/_internal/_beartype.py
@@ -0,0 +1,99 @@
+"""An internal wrapper for the beartype library.
+
+The module returns a no-op decorator when the beartype library is not installed.
+"""
+import functools
+import traceback
+import typing
+import warnings
+from types import ModuleType
+
+from torch.onnx import _exporter_states, errors
+from torch.onnx._globals import GLOBALS
+
+try:
+    import beartype as _beartype_lib  # type: ignore[import]
+    from beartype import roar as _roar  # type: ignore[import]
+
+    # Beartype warns when we import from typing because the types are deprecated
+    # in Python 3.9. But there will be a long time until we can move to using
+    # the native container types for type annotations (when 3.9 is the lowest
+    # supported version). So we silence the warning.
+    warnings.filterwarnings(
+        "ignore",
+        category=_roar.BeartypeDecorHintPep585DeprecationWarning,
+    )
+except ImportError:
+    _beartype_lib = None  # type: ignore[assignment]
+except Exception as e:
+    # Warn errors that are not import errors (unexpected).
+    warnings.warn(f"{e}")
+    _beartype_lib = None  # type: ignore[assignment]
+
+
+def _no_op_decorator(func):
+    return func
+
+
+def _create_beartype_decorator(
+    runtime_check_state: _exporter_states.RuntimeTypeCheckState,
+):
+    # beartype needs to be imported outside of the function and aliased because
+    # this module overwrites the name "beartype".
+
+    if runtime_check_state == _exporter_states.RuntimeTypeCheckState.DISABLED:
+        return _no_op_decorator
+    if _beartype_lib is None:
+        # If the beartype library is not installed, return a no-op decorator
+        return _no_op_decorator
+
+    assert isinstance(_beartype_lib, ModuleType)
+
+    if runtime_check_state == _exporter_states.RuntimeTypeCheckState.ERRORS:
+        # Enable runtime type checking which errors on any type hint violation.
+        return _beartype_lib.beartype
+
+    # Warnings only
+    def beartype(func):
+        """Warn on type hint violation."""
+
+        if "return" in func.__annotations__:
+            # Remove the return type from the func function's
+            # annotations so that the beartype decorator does not complain
+            # about the return type.
+            return_type = func.__annotations__["return"]
+            del func.__annotations__["return"]
+            beartyped = _beartype_lib.beartype(func)
+            # Restore the return type to the func function's annotations
+            func.__annotations__["return"] = return_type
+        else:
+            beartyped = _beartype_lib.beartype(func)
+
+        @functools.wraps(func)
+        def _coerce_beartype_exceptions_to_warnings(*args, **kwargs):
+            try:
+                return beartyped(*args, **kwargs)
+            except _roar.BeartypeCallHintParamViolation:
+                # Fall back to the original function if the beartype hint is violated.
+                warnings.warn(
+                    traceback.format_exc(),
+                    category=errors.CallHintViolationWarning,
+                    stacklevel=2,
+                )
+
+            return func(*args, **kwargs)  # noqa: B012
+
+        return _coerce_beartype_exceptions_to_warnings
+
+    return beartype
+
+
+if typing.TYPE_CHECKING:
+    # This is a hack to make mypy play nicely with the beartype decorator.
+    def beartype(func):
+        return func
+
+else:
+    beartype = _create_beartype_decorator(GLOBALS.runtime_type_check_state)
+    # Make sure that the beartype decorator is enabled whichever path we took.
+    assert beartype is not None
diff --git a/torch/onnx/_internal/diagnostics/OVERVIEW.md b/torch/onnx/_internal/diagnostics/OVERVIEW.md
new file mode 100644
index 000000000000..0dffb0d20b45
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/OVERVIEW.md
@@ -0,0 +1,83 @@
+# PyTorch ONNX Exporter Diagnostics
+
+NOTE: This feature is underdevelopment and is subject to change.
+
+Summary of source tree:
+- [OVERVIEW.md](OVERVIEW.md): Technical overview of the diagnostics infrastructure.
+- [generated/](generated): Generated diagnostics rules from [rules.yaml](rules.yaml).
+- [infra/](infra): Generic diagnostics infrastructure built on top of [SARIF](https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html).
+- [_diagnostic.py](diagnostic.py): Python API for diagnostics.
+- [rules.yaml](rules.yaml): Single source of truth for diagnostics rules. Used to generate C++ and Python interfaces, and documentation pages.
+- [tools/onnx/](/tools/onnx): Scripts for generating source code and documentation for diagnostics rules.
+
+## Table of Contents
+
+<!-- toc -->
+
+- [Introduction](#introduction)
+  - [Motivation](#motivation)
+    - [Diagnostics as documentation](#diagnostics-as-documentation)
+    - [Different context and background](#different-context-and-background)
+    - [Machine parsable](#machine-parsable)
+  - [Design](#design)
+    - [Adopting SARIF for diagnostic structure](#adopting-sarif-for-diagnostic-structure)
+    - [Single source of truth for diagnostic rules](#single-source-of-truth-for-diagnostic-rules)
+- [Internal Details](#internal-details)
+  - [Rules](#rules)
+  - [Infrastructure](#infrastructure)
+  - [Documentation](#documentation)
+- [Usage](#usage)
+  - [Python](#python)
+  - [C++](#c)
+
+<!-- tocstop -->
+
+# Introduction
+
+The goal is to improve the diagnostics to help users debug and improve their model export.
+* The diagnostics are emitted in machine parsable [Static Analysis Results Interchange Format (SARIF)](https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html).
+* A new clearer, structured way to add new and keep track of diagnostic rules.
+* Serve as foundation for more future improvements consuming the diagnostics.
+
+## Motivation ##
+
+The previous diagnostics were only scattered warning or error messages. They are not structured and are not machine parsable. This makes it hard to consume the diagnostics in a systematic way. This is a blocker for improving the diagnostics and for building tools on top of them. The diagnostics are also not very helpful for users to debug their model export. They are often not actionable and do not provide enough context to help users debug their model export. Some unsupported patterns or code are documented in the [PyTorch ONNX doc](https://pytorch.org/docs/stable/onnx.html#limitations). The information is scattered, hard to find, and hard to maintain and thus often outdated. The new diagnostics system aim to address these issues with the following key properties.
+
+### Diagnostics as documentation
+
+The diagnostics are the source of truth for the documentation of export issues. Documentations are no longer separated. Any changes are directly reflected as the diagnostic progress. The diagnostic itself serves as the means to track the history and progress of any specific issue. Linking the source code, the issues, the PRs, the fix, the docs, etc together through this single entity.
+
+### Different context and background
+
+There are two very different audiences: users and converter developers. The users care more about where the error is coming from the model, and how to resolve it for a successful export. They are not experts in the internal of exporter or JIT. The converter developers on the other hand need more info of the internal state of the converter to debug the issue. The diagnostics should be actionable for users and provide enough context for converter developers to debug and fix the issues. It should display the right information and context to the right audience, in a clean and concise way.
+
+### Machine parsable
+
+The diagnostics are emitted in machine parsable SARIF format. This opens the door for the diagnostics to be consumed by tools and systems. Future applications like auto fixing, formatted displaying, auto reporting, etc are possible.
+
+## Design ##
+
+### Adopting SARIF for diagnostic structure
+
+The diagnostics are emitted in machine parsable [Static Analysis Results Interchange Format (SARIF)](https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html), with [python classes for SARIF object model](https://github.com/microsoft/sarif-python-om) as starting point. This is a standard format for the output of static analysis tools, and can be consumed by the SARIF Viewer, [VSCode extension](https://marketplace.visualstudio.com/items?itemName=MS-SarifVSCode.sarif-viewer) for example. The diagnostics are also emitted in a human readable format for users to read. The human readable format is a subset of the SARIF format. The human readable format is emitted to stdout and the SARIF format is emitted to a file. [Authoring rule metadata and result messages](https://github.com/microsoft/sarif-tutorials/blob/main/docs/Authoring-rule-metadata-and-result-messages.md) is a good starting point for understanding the SARIF format.
+
+### Single source of truth for diagnostic rules
+
+The diagnostic rules are defined in a single location, in [SARIF `reportingDescriptor` format](https://docs.oasis-open.org/sarif/sarif/v2.1.0/os/sarif-v2.1.0-os.html#_Toc34317836). From it, respective C++, python and documentation files are generated during build. With a bit of redundancy, this approach makes all the rules statically accessible under both Python and C++, while maintaining a single source of truth.
+
+# Internal Details
+
+## Rules ##
+
+
+## Infrastructure ##
+
+
+## Documentation ##
+
+
+# Usage
+
+## Python ##
+
+## C++ ##
diff --git a/torch/onnx/_internal/diagnostics/__init__.py b/torch/onnx/_internal/diagnostics/__init__.py
new file mode 100644
index 000000000000..304978dbe22d
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/__init__.py
@@ -0,0 +1,19 @@
+from ._diagnostic import (
+    context,
+    create_export_diagnostic_context,
+    diagnose,
+    engine,
+    ExportDiagnostic,
+)
+from ._rules import rules
+from .infra import levels
+
+__all__ = [
+    "ExportDiagnostic",
+    "rules",
+    "levels",
+    "engine",
+    "context",
+    "create_export_diagnostic_context",
+    "diagnose",
+]
diff --git a/torch/onnx/_internal/diagnostics/_diagnostic.py b/torch/onnx/_internal/diagnostics/_diagnostic.py
new file mode 100644
index 000000000000..efe5c0e34911
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/_diagnostic.py
@@ -0,0 +1,153 @@
+"""Diagnostic components for PyTorch ONNX export."""
+
+import contextlib
+from typing import Optional, TypeVar
+
+import torch
+from torch.onnx._internal.diagnostics import infra
+from torch.onnx._internal.diagnostics.infra import utils as infra_utils
+from torch.utils import cpp_backtrace
+
+# This is a workaround for mypy not supporting Self from typing_extensions.
+_ExportDiagnostic = TypeVar("_ExportDiagnostic", bound="ExportDiagnostic")
+
+
+def _cpp_call_stack(frames_to_skip: int = 0, frames_to_log: int = 32):
+    """Returns the current C++ call stack.
+
+    This function utilizes `torch.utils.cpp_backtrace` to get the current C++ call stack.
+    The returned C++ call stack is a concatenated string of the C++ call stack frames.
+    Each frame is separated by a newline character, in the same format of
+    r"frame #[0-9]+: (?P<frame_info>.*)". More info at `c10/util/Backtrace.cpp`.
+
+    """
+    frames = cpp_backtrace.get_cpp_backtrace(frames_to_skip, frames_to_log).split("\n")
+    frame_messages = []
+    for frame in frames:
+        segments = frame.split(":", 1)
+        if len(segments) == 2:
+            frame_messages.append(segments[1].strip())
+        else:
+            frame_messages.append("<unknown frame>")
+    return infra.Stack(
+        frames=[
+            infra.StackFrame(location=infra.Location(message=message))
+            for message in frame_messages
+        ]
+    )
+
+
+class ExportDiagnostic(infra.Diagnostic):
+    """Base class for all export diagnostics.
+
+    This class is used to represent all export diagnostics. It is a subclass of
+    infra.Diagnostic, and adds additional methods to add more information to the
+    diagnostic.
+    """
+
+    python_call_stack: Optional[infra.Stack] = None
+    cpp_call_stack: Optional[infra.Stack] = None
+
+    def __init__(
+        self,
+        *args,
+        **kwargs,
+    ) -> None:
+        super().__init__(*args, **kwargs)
+        self.record_python_call_stack(frames_to_skip=1)
+        self.record_cpp_call_stack(frames_to_skip=1)
+
+    def record_python_call_stack(self, frames_to_skip) -> None:
+        """Records the current Python call stack in the diagnostic."""
+        frames_to_skip += 1  # Skip this function.
+        stack = infra_utils.python_call_stack(frames_to_skip=frames_to_skip)
+        stack.message = "Python call stack"
+        self.with_stack(stack)
+        self.python_call_stack = stack
+
+    def record_cpp_call_stack(self, frames_to_skip) -> None:
+        """Records the current C++ call stack in the diagnostic."""
+        # No need to skip this function because python frame is not recorded
+        # in cpp call stack.
+        stack = _cpp_call_stack(frames_to_skip=frames_to_skip)
+        stack.message = "C++ call stack"
+        self.with_stack(stack)
+        self.cpp_call_stack = stack
+
+
+class ExportDiagnosticEngine(infra.DiagnosticEngine):
+    """PyTorch ONNX Export diagnostic engine.
+
+    The only purpose of creating this class instead of using the base class directly
+    is to provide a background context for `diagnose` calls inside exporter.
+
+    By design, one `torch.onnx.export` call should initialize one diagnostic context.
+    All `diagnose` calls inside exporter should be made in the context of that export.
+    However, since diagnostic context is currently being accessed via a global variable,
+    there is no guarantee that the context is properly initialized. Therefore, we need
+    to provide a default background context to fallback to, otherwise any invocation of
+    exporter internals, e.g. unit tests, will fail due to missing diagnostic context.
+    This can be removed once the pipeline for context to flow through the exporter is
+    established.
+    """
+
+    _background_context: infra.DiagnosticContext
+
+    def __init__(self) -> None:
+        super().__init__()
+        self._background_context = infra.DiagnosticContext(
+            name="torch.onnx",
+            version=torch.__version__,
+            diagnostic_type=ExportDiagnostic,
+        )
+
+    @property
+    def background_context(self) -> infra.DiagnosticContext:
+        return self._background_context
+
+    def clear(self):
+        super().clear()
+        self._background_context.diagnostics.clear()
+
+    def sarif_log(self):
+        log = super().sarif_log()
+        log.runs.append(self._background_context.sarif())
+        return log
+
+
+engine = ExportDiagnosticEngine()
+context = engine.background_context
+
+
+@contextlib.contextmanager
+def create_export_diagnostic_context():
+    """Create a diagnostic context for export.
+
+    This is a workaround for code robustness since diagnostic context is accessed by
+    export internals via global variable. See `ExportDiagnosticEngine` for more details.
+    """
+    global context
+    context = engine.create_diagnostic_context(
+        "torch.onnx.export", torch.__version__, diagnostic_type=ExportDiagnostic
+    )
+    try:
+        yield context
+    finally:
+        context.pretty_print(context.options.log_verbose, context.options.log_level)
+        context = engine.background_context
+
+
+def diagnose(
+    rule: infra.Rule,
+    level: infra.Level,
+    message: Optional[str] = None,
+    **kwargs,
+) -> ExportDiagnostic:
+    """Creates a diagnostic and record it in the global diagnostic context.
+
+    This is a wrapper around `context.record` that uses the global diagnostic context.
+    """
+    global context
+    diagnostic = ExportDiagnostic(rule, level, message, **kwargs)
+    context.add_diagnostic(diagnostic)
+    return diagnostic
diff --git a/torch/onnx/_internal/diagnostics/_rules.py b/torch/onnx/_internal/diagnostics/_rules.py
new file mode 100644
index 000000000000..f9948388d5da
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/_rules.py
@@ -0,0 +1,172 @@
+"""
+GENERATED CODE - DO NOT EDIT DIRECTLY
+This file is generated by gen_diagnostics.py.
+See tools/onnx/gen_diagnostics.py for more information.
+
+Diagnostic rules for PyTorch ONNX export.
+"""
+
+import dataclasses
+
+# flake8: noqa
+from torch.onnx._internal.diagnostics import infra
+
+"""
+GENERATED CODE - DO NOT EDIT DIRECTLY
+The purpose of generating a class for each rule is to override the `format_message`
+method to provide more details in the signature about the format arguments.
+"""
+
+
+class _NodeMissingOnnxShapeInference(infra.Rule):
+    """Node is missing ONNX shape inference."""
+
+    def format_message(self, op_name) -> str:  # type: ignore[override]
+        """Returns the formatted default message of this Rule.
+
+        Message template: 'The shape inference of {op_name} type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.'
+        """
+        return self.message_default_template.format(op_name=op_name)
+
+
+class _MissingCustomSymbolicFunction(infra.Rule):
+    """Missing symbolic function for custom PyTorch operator, cannot translate node to ONNX."""
+
+    def format_message(self, op_name) -> str:  # type: ignore[override]
+        """Returns the formatted default message of this Rule.
+
+        Message template: 'ONNX export failed on an operator with unrecognized namespace {op_name}. If you are trying to export a custom operator, make sure you registered it with the right domain and version.'
+        """
+        return self.message_default_template.format(op_name=op_name)
+
+
+class _MissingStandardSymbolicFunction(infra.Rule):
+    """Missing symbolic function for standard PyTorch operator, cannot translate node to ONNX."""
+
+    def format_message(self, op_name, opset_version, issue_url) -> str:  # type: ignore[override]
+        """Returns the formatted default message of this Rule.
+
+        Message template: "Exporting the operator '{op_name}' to ONNX opset version {opset_version} is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub: {issue_url}."
+        """
+        return self.message_default_template.format(
+            op_name=op_name, opset_version=opset_version, issue_url=issue_url
+        )
+
+
+class _OperatorSupportedInNewerOpsetVersion(infra.Rule):
+    """Operator is supported in newer opset version."""
+
+    def format_message(self, op_name, opset_version, supported_opset_version) -> str:  # type: ignore[override]
+        """Returns the formatted default message of this Rule.
+
+        Message template: "Exporting the operator '{op_name}' to ONNX opset version {opset_version} is not supported. Support for this operator was added in version {supported_opset_version}, try exporting with this version."
+        """
+        return self.message_default_template.format(
+            op_name=op_name,
+            opset_version=opset_version,
+            supported_opset_version=supported_opset_version,
+        )
+
+
+@dataclasses.dataclass
+class _POERules(infra.RuleCollection):
+    node_missing_onnx_shape_inference: _NodeMissingOnnxShapeInference = dataclasses.field(
+        default=_NodeMissingOnnxShapeInference.from_sarif(
+            **{
+                "id": "POE0001",
+                "name": "node-missing-onnx-shape-inference",
+                "short_description": {"text": "Node is missing ONNX shape inference."},
+                "full_description": {
+                    "text": "Node is missing ONNX shape inference. This usually happens when the node is not valid under standard ONNX operator spec.",
+                    "markdown": "Node is missing ONNX shape inference.\nThis usually happens when the node is not valid under standard ONNX operator spec.\n",
+                },
+                "message_strings": {
+                    "default": {
+                        "text": "The shape inference of {op_name} type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function."
+                    }
+                },
+                "help_uri": None,
+                "properties": {"deprecated": False, "tags": []},
+            }
+        ),
+        init=False,
+    )
+    """Node is missing ONNX shape inference."""
+
+    missing_custom_symbolic_function: _MissingCustomSymbolicFunction = dataclasses.field(
+        default=_MissingCustomSymbolicFunction.from_sarif(
+            **{
+                "id": "POE0002",
+                "name": "missing-custom-symbolic-function",
+                "short_description": {
+                    "text": "Missing symbolic function for custom PyTorch operator, cannot translate node to ONNX."
+                },
+                "full_description": {
+                    "text": "Missing symbolic function for custom PyTorch operator, cannot translate node to ONNX.",
+                    "markdown": "Missing symbolic function for custom PyTorch operator, cannot translate node to ONNX.\n",
+                },
+                "message_strings": {
+                    "default": {
+                        "text": "ONNX export failed on an operator with unrecognized namespace {op_name}. If you are trying to export a custom operator, make sure you registered it with the right domain and version."
+                    }
+                },
+                "help_uri": None,
+                "properties": {"deprecated": False, "tags": []},
+            }
+        ),
+        init=False,
+    )
+    """Missing symbolic function for custom PyTorch operator, cannot translate node to ONNX."""
+
+    missing_standard_symbolic_function: _MissingStandardSymbolicFunction = dataclasses.field(
+        default=_MissingStandardSymbolicFunction.from_sarif(
+            **{
+                "id": "POE0003",
+                "name": "missing-standard-symbolic-function",
+                "short_description": {
+                    "text": "Missing symbolic function for standard PyTorch operator, cannot translate node to ONNX."
+                },
+                "full_description": {
+                    "text": "Missing symbolic function for standard PyTorch operator, cannot translate node to ONNX.",
+                    "markdown": "Missing symbolic function for standard PyTorch operator, cannot translate node to ONNX.\n",
+                },
+                "message_strings": {
+                    "default": {
+                        "text": "Exporting the operator '{op_name}' to ONNX opset version {opset_version} is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub: {issue_url}."
+                    }
+                },
+                "help_uri": None,
+                "properties": {"deprecated": False, "tags": []},
+            }
+        ),
+        init=False,
+    )
+    """Missing symbolic function for standard PyTorch operator, cannot translate node to ONNX."""
+
+    operator_supported_in_newer_opset_version: _OperatorSupportedInNewerOpsetVersion = dataclasses.field(
+        default=_OperatorSupportedInNewerOpsetVersion.from_sarif(
+            **{
+                "id": "POE0004",
+                "name": "operator-supported-in-newer-opset-version",
+                "short_description": {
+                    "text": "Operator is supported in newer opset version."
+                },
+                "full_description": {
+                    "text": "Operator is supported in newer opset version.",
+                    "markdown": "Operator is supported in newer opset version.\n\nExample:\n```python\ntorch.onnx.export(model, args, ..., opset_version=9)\n```\n",
+                },
+                "message_strings": {
+                    "default": {
+                        "text": "Exporting the operator '{op_name}' to ONNX opset version {opset_version} is not supported. Support for this operator was added in version {supported_opset_version}, try exporting with this version."
+                    }
+                },
+                "help_uri": None,
+                "properties": {"deprecated": False, "tags": []},
+            }
+        ),
+        init=False,
+    )
+    """Operator is supported in newer opset version."""
+
+
+rules = _POERules()
diff --git a/torch/onnx/_internal/diagnostics/infra/__init__.py b/torch/onnx/_internal/diagnostics/infra/__init__.py
new file mode 100644
index 000000000000..4f9dd9e5fa0b
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/__init__.py
@@ -0,0 +1,27 @@
+from ._infra import (
+    Diagnostic,
+    DiagnosticContext,
+    DiagnosticOptions,
+    Level,
+    levels,
+    Location,
+    Rule,
+    RuleCollection,
+    Stack,
+    StackFrame,
+)
+from .engine import DiagnosticEngine
+
+__all__ = [
+    "Diagnostic",
+    "DiagnosticContext",
+    "DiagnosticEngine",
+    "DiagnosticOptions",
+    "Level",
+    "levels",
+    "Location",
+    "Rule",
+    "RuleCollection",
+    "Stack",
+    "StackFrame",
+]
diff --git a/torch/onnx/_internal/diagnostics/infra/_infra.py b/torch/onnx/_internal/diagnostics/infra/_infra.py
new file mode 100644
index 000000000000..3414574cce73
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/_infra.py
@@ -0,0 +1,450 @@
+"""This file defines an additional layer of abstraction on top of the SARIF OM."""
+
+from __future__ import annotations
+
+import dataclasses
+import enum
+from typing import FrozenSet, List, Optional, Sequence, Tuple, Type, TypeVar
+
+from torch.onnx._internal.diagnostics.infra import formatter, sarif
+
+
+class Level(enum.Enum):
+    """The level of a diagnostic.
+
+    This class is used to represent the level of a diagnostic. The levels are defined
+    by the SARIF specification, and are not modifiable. For alternative categories,
+    please use infra.Tag instead.
+    """
+
+    NONE = enum.auto()
+    NOTE = enum.auto()
+    WARNING = enum.auto()
+    ERROR = enum.auto()
+
+
+levels = Level
+
+
+class Tag(enum.Enum):
+    """The tag of a diagnostic. This class can be inherited to define custom tags."""
+
+    pass
+
+
+class PatchedPropertyBag(sarif.PropertyBag):
+    """Key/value pairs that provide additional information about the object.
+
+    The definition of PropertyBag via SARIF spec is "A property bag is an object (§3.6)
+    containing an unordered set of properties with arbitrary names." However it is not
+    reflected in the json file, and therefore not captured by the python representation.
+    This patch adds additional **kwargs to the `__init__` method to allow recording
+    arbitrary key/value pairs.
+    """
+
+    def __init__(self, tags: Optional[List[str]] = None, **kwargs):
+        super().__init__(tags=tags)
+        self.__dict__.update(kwargs)
+
+
+@dataclasses.dataclass(frozen=True)
+class Rule:
+    id: str
+    name: str
+    message_default_template: str
+    short_description: Optional[str] = None
+    full_description: Optional[str] = None
+    full_description_markdown: Optional[str] = None
+    help_uri: Optional[str] = None
+
+    @classmethod
+    def from_sarif(cls, **kwargs):
+        """Returns a rule from the SARIF reporting descriptor."""
+        short_description = kwargs.get("short_description", {}).get("text")
+        full_description = kwargs.get("full_description", {}).get("text")
+        full_description_markdown = kwargs.get("full_description", {}).get("markdown")
+        help_uri = kwargs.get("help_uri")
+
+        rule = cls(
+            id=kwargs["id"],
+            name=kwargs["name"],
+            message_default_template=kwargs["message_strings"]["default"]["text"],
+            short_description=short_description,
+            full_description=full_description,
+            full_description_markdown=full_description_markdown,
+            help_uri=help_uri,
+        )
+        return rule
+
+    def sarif(self) -> sarif.ReportingDescriptor:
+        """Returns a SARIF reporting descriptor of this Rule."""
+        short_description = (
+            sarif.MultiformatMessageString(text=self.short_description)
+            if self.short_description is not None
+            else None
+        )
+        full_description = (
+            sarif.MultiformatMessageString(
+                text=self.full_description, markdown=self.full_description_markdown
+            )
+            if self.full_description is not None
+            else None
+        )
+        return sarif.ReportingDescriptor(
+            id=self.id,
+            name=self.name,
+            short_description=short_description,
+            full_description=full_description,
+            help_uri=self.help_uri,
+        )
+
+    def format_message(self, *args, **kwargs) -> str:
+        """Returns the formatted default message of this Rule.
+
+        This method should be overridden (with code generation) by subclasses to reflect
+        the exact arguments needed by the message template. This is a helper method to
+        create the default message for a diagnostic.
+        """
+        return self.message_default_template.format(*args, **kwargs)
+
+    def pretty_print(self):
+        pass
+
+
+@dataclasses.dataclass
+class Location:
+    uri: Optional[str] = None
+    line: Optional[int] = None
+    message: Optional[str] = None
+    start_column: Optional[int] = None
+    end_column: Optional[int] = None
+    snippet: Optional[str] = None
+
+    def sarif(self) -> sarif.Location:
+        """Returns the SARIF representation of this location."""
+        return sarif.Location(
+            physical_location=sarif.PhysicalLocation(
+                artifact_location=sarif.ArtifactLocation(uri=self.uri),
+                region=sarif.Region(
+                    start_line=self.line,
+                    start_column=self.start_column,
+                    end_column=self.end_column,
+                    snippet=sarif.ArtifactContent(text=self.snippet),
+                ),
+            ),
+            message=sarif.Message(text=self.message)
+            if self.message is not None
+            else None,
+        )
+
+    def pretty_print(self):
+        """Prints the location in a human-readable format."""
+        location_strs = ["frame:"]
+        if self.snippet is not None:
+            location_strs.append(self.snippet)
+        if self.uri is not None:
+            line_strs = [self.uri]
+            line_strs.append(str(self.line)) if self.line is not None else "-1"
+            line_strs.append(
+                str(self.start_column)
+            ) if self.start_column is not None else "-1"
+            line_strs.append(
+                str(self.end_column)
+            ) if self.end_column is not None else "-1"
+            location_strs.append(":".join(line_strs))
+        if self.message is not None:
+            location_strs.append(f"({self.message})")
+        print(" ".join(location_strs))
+
+
+@dataclasses.dataclass
+class StackFrame:
+    location: Location
+
+    def sarif(self) -> sarif.StackFrame:
+        """Returns the SARIF representation of this stack frame."""
+        return sarif.StackFrame(location=self.location.sarif())
+
+    def pretty_print(self):
+        """Prints the stack frame in a human-readable format."""
+        self.location.pretty_print()
+
+
+@dataclasses.dataclass
+class Stack:
+    frames: List[StackFrame] = dataclasses.field(default_factory=list)
+    message: Optional[str] = None
+
+    def sarif(self) -> sarif.Stack:
+        """Returns the SARIF representation of this stack."""
+        return sarif.Stack(
+            frames=[frame.sarif() for frame in self.frames],
+            message=sarif.Message(text=self.message)
+            if self.message is not None
+            else None,
+        )
+
+    def pretty_print(self):
+        """Prints the stack in a human-readable format."""
+        formatter.pretty_print_title(f"Stack: {self.message}", fill_char="-")
+        for frame in self.frames:
+            frame.pretty_print()
+
+
+# This is a workaround for mypy not supporting Self from typing_extensions.
+_Diagnostic = TypeVar("_Diagnostic", bound="Diagnostic")
+
+
+@dataclasses.dataclass
+class Graph:
+    """A graph of diagnostics.
+
+    This class stores the string representation of a model graph.
+    The `nodes` and `edges` fields are unused in the current implementation.
+    """
+
+    graph_str: str
+    name: str
+    description: Optional[str] = None
+
+    def sarif(self) -> sarif.Graph:
+        """Returns the SARIF representation of this graph."""
+        return sarif.Graph(
+            description=sarif.Message(text=self.graph_str),
+            properties=PatchedPropertyBag(name=self.name, description=self.description),
+        )
+
+    def pretty_print(self):
+        pass
+
+
+@dataclasses.dataclass
+class Diagnostic:
+    rule: Rule
+    level: Level
+    message: Optional[str] = None
+    locations: List[Location] = dataclasses.field(default_factory=list)
+    stacks: List[Stack] = dataclasses.field(default_factory=list)
+    graphs: List[Graph] = dataclasses.field(default_factory=list)
+    additional_message: Optional[str] = None
+    tags: List[Tag] = dataclasses.field(default_factory=list)
+
+    def sarif(self) -> sarif.Result:
+        """Returns the SARIF Result representation of this diagnostic."""
+        message = self.message or self.rule.message_default_template
+        if self.additional_message is not None:
+            message = f"{message}\n{self.additional_message}"
+        sarif_result = sarif.Result(
+            message=sarif.Message(text=message),
+            level=self.level.name.lower(),  # type: ignore[arg-type]
+            rule_id=self.rule.id,
+        )
+        sarif_result.locations = [location.sarif() for location in self.locations]
+        sarif_result.stacks = [stack.sarif() for stack in self.stacks]
+        sarif_result.graphs = [graph.sarif() for graph in self.graphs]
+        sarif_result.properties = sarif.PropertyBag(
+            tags=[tag.value for tag in self.tags]
+        )
+        return sarif_result
+
+    def with_location(self: _Diagnostic, location: Location) -> _Diagnostic:
+        """Adds a location to the diagnostic."""
+        self.locations.append(location)
+        return self
+
+    def with_stack(self: _Diagnostic, stack: Stack) -> _Diagnostic:
+        """Adds a stack to the diagnostic."""
+        self.stacks.append(stack)
+        return self
+
+    def with_graph(self: _Diagnostic, graph: Graph) -> _Diagnostic:
+        """Adds a graph to the diagnostic."""
+        self.graphs.append(graph)
+        return self
+
+    def with_additional_message(self: _Diagnostic, message: str) -> _Diagnostic:
+        """Adds an additional message to the diagnostic."""
+        if self.additional_message is None:
+            self.additional_message = message
+        else:
+            self.additional_message = f"{self.additional_message}\n{message}"
+        return self
+
+    def pretty_print(self, verbose: bool = False, log_level: Level = Level.ERROR):
+        """Prints the diagnostics in a human-readable format.
+
+        Args:
+            verbose: If True, prints all information. E.g. stack frames, graphs, etc.
+                Otherwise, only prints compact information. E.g., rule name and display message.
+            level: The minimum level of diagnostics to print.
+        """
+        if self.level.value < log_level.value:
+            return
+        formatter.pretty_print_item_title(f"{self.level.name}: {self.rule.name}")
+        print(self.message)
+
+        if not verbose:
+            print("<Set verbose=True to see more details>\n")
+            return
+
+        for location in self.locations:
+            location.pretty_print()
+        for stack in self.stacks:
+            stack.pretty_print()
+        for graph in self.graphs:
+            graph.pretty_print()
+        print()
+
+
+@dataclasses.dataclass
+class RuleCollection:
+    _rule_id_name_set: FrozenSet[Tuple[str, str]] = dataclasses.field(init=False)
+
+    def __post_init__(self) -> None:
+        self._rule_id_name_set = frozenset(
+            {
+                (field.default.id, field.default.name)
+                for field in dataclasses.fields(self)
+                if isinstance(field.default, Rule)
+            }
+        )
+
+    def __contains__(self, rule: Rule) -> bool:
+        """Checks if the rule is in the collection."""
+        return (rule.id, rule.name) in self._rule_id_name_set
+
+    @classmethod
+    def custom_collection_from_list(
+        cls, new_collection_class_name: str, rules: Sequence[Rule]
+    ) -> RuleCollection:
+        """Creates a custom class inherited from RuleCollection with the list of rules."""
+        return dataclasses.make_dataclass(
+            new_collection_class_name,
+            [
+                (
+                    formatter.kebab_case_to_snake_case(rule.name),
+                    type(rule),
+                    dataclasses.field(default=rule),
+                )
+                for rule in rules
+            ],
+            bases=(cls,),
+        )()
+
+
+class Invocation:
+    # TODO: Implement this.
+    def __init__(self) -> None:
+        raise NotImplementedError()
+
+
+@dataclasses.dataclass
+class DiagnosticOptions:
+    """
+    Options for diagnostic context.
+    """
+
+    log_verbose: bool = dataclasses.field(default=False)
+    log_level: Level = dataclasses.field(default=Level.ERROR)
+
+
+@dataclasses.dataclass
+class DiagnosticContext:
+    name: str
+    version: str
+    options: DiagnosticOptions = dataclasses.field(default_factory=DiagnosticOptions)
+    diagnostic_type: Type[Diagnostic] = dataclasses.field(default=Diagnostic)
+    diagnostics: List[Diagnostic] = dataclasses.field(init=False, default_factory=list)
+    _invocation: Invocation = dataclasses.field(init=False)
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        return True
+
+    def sarif(self) -> sarif.Run:
+        """Returns the SARIF Run object."""
+        return sarif.Run(
+            tool=sarif.Tool(
+                driver=sarif.ToolComponent(
+                    name=self.name,
+                    version=self.version,
+                    rules=[diagnostic.rule.sarif() for diagnostic in self.diagnostics],
+                )
+            ),
+            results=[diagnostic.sarif() for diagnostic in self.diagnostics],
+        )
+
+    def add_diagnostic(self, diagnostic: Diagnostic) -> None:
+        """Adds a diagnostic to the context.
+
+        Use this method to add diagnostics that are not created by the context.
+        Args:
+            diagnostic: The diagnostic to add.
+        """
+        if not isinstance(diagnostic, self.diagnostic_type):
+            raise TypeError(
+                f"Expected diagnostic of type {self.diagnostic_type}, got {type(diagnostic)}"
+            )
+        self.diagnostics.append(diagnostic)
+
+    def diagnose(
+        self,
+        rule: Rule,
+        level: Level,
+        message: Optional[str] = None,
+        **kwargs,
+    ) -> Diagnostic:
+        """Creates a diagnostic for the given arguments.
+
+        Args:
+            rule: The rule that triggered the diagnostic.
+            level: The level of the diagnostic.
+            message: The message of the diagnostic.
+            **kwargs: Additional arguments to pass to the Diagnostic constructor.
+
+        Returns:
+            The created diagnostic.
+
+        Raises:
+            ValueError: If the rule is not supported by the tool.
+        """
+        diagnostic = self.diagnostic_type(rule, level, message, **kwargs)
+        self.add_diagnostic(diagnostic)
+        return diagnostic
+
+    def pretty_print(
+        self, verbose: bool = False, log_level: Level = Level.ERROR
+    ) -> None:
+        """Prints the diagnostics in a human-readable format.
+
+        Args:
+            verbose: Whether to print the diagnostics in verbose mode. See Diagnostic.pretty_print.
+            level: The minimum level of diagnostics to print.
+        """
+        formatter.pretty_print_title(
+            f"Diagnostic Run {self.name} version {self.version}"
+        )
+        print(f"verbose: {verbose}, log level: {log_level}")
+        diagnostic_stats = {level: 0 for level in Level}
+        for diagnostic in self.diagnostics:
+            diagnostic_stats[diagnostic.level] += 1
+        formatter.pretty_print_title(
+            " ".join(f"{diagnostic_stats[level]} {level.name}" for level in Level)
+        )
+
+        for diagnostic in self.diagnostics:
+            diagnostic.pretty_print(verbose, log_level)
+
+        unprinted_diagnostic_stats = [
+            (level, count)
+            for level, count in diagnostic_stats.items()
+            if count > 0 and level.value < log_level.value
+        ]
+        if unprinted_diagnostic_stats:
+            print(
+                f"{' '.join(f'{count} {level.name}' for level, count in unprinted_diagnostic_stats)} "
+                "were not printed due to the log level."
+            )
+        print()
diff --git a/torch/onnx/_internal/diagnostics/infra/engine.py b/torch/onnx/_internal/diagnostics/infra/engine.py
new file mode 100644
index 000000000000..51a6057565bb
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/engine.py
@@ -0,0 +1,107 @@
+"""A diagnostic engine based on SARIF."""
+
+from __future__ import annotations
+
+from typing import List, Optional, Type
+
+from torch.onnx._internal.diagnostics import infra
+from torch.onnx._internal.diagnostics.infra import formatter, sarif
+from torch.onnx._internal.diagnostics.infra.sarif import version as sarif_version
+
+
+class DiagnosticEngine:
+    """A generic diagnostic engine based on SARIF.
+
+    This class is the main interface for diagnostics. It manages the creation of diagnostic contexts.
+    A DiagnosticContext provides the entry point for recording Diagnostics.
+    See infra.DiagnosticContext for more details.
+
+    Examples:
+        Step 1: Create a set of rules.
+        >>> rules = infra.RuleCollection.from_list(
+        ...     "CustomRuleCollection",
+        ...     [
+        ...         infra.Rule(
+        ...             id="r1",
+        ...             name="rule-1",
+        ...             message_default_template="Mising xxx",
+        ...         ),
+        ...     ],
+        ... )
+
+        Step 2: Create a diagnostic engine.
+        >>> engine = DiagnosticEngine()
+
+        Step 3: Start a new diagnostic context.
+        >>> with engine.create_diagnostic_context("torch.onnx.export", version="1.0") as context:
+
+        Step 4: Add diagnostics in your code.
+        ...     context.diagnose(rules.rule1, infra.Level.ERROR)
+
+        Step 5: Afterwards, get the SARIF log.
+        >>> sarif_log = engine.sarif_log()
+    """
+
+    contexts: List[infra.DiagnosticContext]
+
+    def __init__(self) -> None:
+        self.contexts = []
+
+    def sarif_log(self) -> sarif.SarifLog:
+        return sarif.SarifLog(
+            version=sarif_version.SARIF_VERSION,
+            schema_uri=sarif_version.SARIF_SCHEMA_LINK,
+            runs=[context.sarif() for context in self.contexts],
+        )
+
+    def __str__(self) -> str:
+        # TODO: pretty print.
+        return self.to_json()
+
+    def __repr__(self) -> str:
+        return self.to_json()
+
+    def to_json(self) -> str:
+        return formatter.sarif_to_json(self.sarif_log())
+
+    def clear(self) -> None:
+        """Clears all diagnostic contexts."""
+        self.contexts.clear()
+
+    def create_diagnostic_context(
+        self,
+        name: str,
+        version: str,
+        options: Optional[infra.DiagnosticOptions] = None,
+        diagnostic_type: Type[infra.Diagnostic] = infra.Diagnostic,
+    ) -> infra.DiagnosticContext:
+        """Creates a new diagnostic context.
+
+        Args:
+            name: The subject name for the diagnostic context.
+            version: The subject version for the diagnostic context.
+            options: The options for the diagnostic context.
+
+        Returns:
+            A new diagnostic context.
+        """
+        if options is None:
+            options = infra.DiagnosticOptions()
+        context = infra.DiagnosticContext(
+            name, version, options, diagnostic_type=diagnostic_type
+        )
+        self.contexts.append(context)
+        return context
+
+    def pretty_print(
+        self, verbose: bool = False, level: infra.Level = infra.Level.ERROR
+    ) -> None:
+        """Pretty prints all diagnostics in the diagnostic contexts.
+
+        Args:
+            verbose: Whether to print the diagnostics in verbose mode. See Diagnostic.pretty_print.
+            level: The minimum level of diagnostics to print.
+        """
+        formatter.pretty_print_title(f"{len(self.contexts)} Diagnostic Run")
+        for context in self.contexts:
+            context.pretty_print(verbose, level)
diff --git a/torch/onnx/_internal/diagnostics/infra/formatter.py b/torch/onnx/_internal/diagnostics/infra/formatter.py
new file mode 100644
index 000000000000..292a2b6a47a5
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/formatter.py
@@ -0,0 +1,77 @@
+import dataclasses
+import json
+import re
+from typing import Any, Callable, Dict, List, Union
+
+from torch.onnx._internal.diagnostics.infra import sarif
+
+# A list of types in the SARIF module to support pretty printing.
+# This is solely for type annotation for the functions below.
+_SarifClass = Union[
+    sarif.SarifLog,
+    sarif.Run,
+    sarif.ReportingDescriptor,
+    sarif.Result,
+]
+
+
+def _camel_case_to_snake_case(s: str) -> str:
+    return re.sub(r"([A-Z])", r"_\1", s).lower()
+
+
+def kebab_case_to_snake_case(s: str) -> str:
+    return s.replace("-", "_")
+
+
+def _convert_key(
+    object: Union[Dict[str, Any], Any], convert: Callable[[str], str]
+) -> Union[Dict[str, Any], Any]:
+    """Convert and update keys in a dictionary with "convert".
+
+    Any value that is a dictionary will be recursively updated.
+    Any value that is a list will be recursively searched.
+
+    Args:
+        object: The object to update.
+        convert: The function to convert the keys, e.g. `kebab_case_to_snake_case`.
+
+    Returns:
+        The updated object.
+    """
+    if not isinstance(object, Dict):
+        return object
+    new_dict = {}
+    for k, v in object.items():
+        new_k = convert(k)
+        if isinstance(v, Dict):
+            new_v = _convert_key(v, convert)
+        elif isinstance(v, List):
+            new_v = [_convert_key(elem, convert) for elem in v]
+        else:
+            new_v = v
+        new_dict[new_k] = new_v
+    return new_dict
+
+
+def sarif_to_json(attr_cls_obj: _SarifClass) -> str:
+    dict = dataclasses.asdict(attr_cls_obj)
+    dict = _convert_key(dict, _camel_case_to_snake_case)
+    return json.dumps(dict, indent=4)
+
+
+def pretty_print_title(title: str, width: int = 80, fill_char: str = "=") -> None:
+    """Pretty prints title in below format:
+
+    ==================== title ====================
+    """
+    print(f" {title} ".center(width, fill_char))
+
+
+def pretty_print_item_title(title: str, fill_char: str = "=") -> None:
+    """Pretty prints title in below format:
+
+    title
+    =====
+    """
+    print(title)
+    print(fill_char * len(title))
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/__init__.py b/torch/onnx/_internal/diagnostics/infra/sarif/__init__.py
new file mode 100644
index 000000000000..34fd40e5b938
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/__init__.py
@@ -0,0 +1,100 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from torch.onnx._internal.diagnostics.infra.sarif._address import Address
+from torch.onnx._internal.diagnostics.infra.sarif._artifact import Artifact
+from torch.onnx._internal.diagnostics.infra.sarif._artifact_change import ArtifactChange
+from torch.onnx._internal.diagnostics.infra.sarif._artifact_content import (
+    ArtifactContent,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._artifact_location import (
+    ArtifactLocation,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._attachment import Attachment
+from torch.onnx._internal.diagnostics.infra.sarif._code_flow import CodeFlow
+from torch.onnx._internal.diagnostics.infra.sarif._configuration_override import (
+    ConfigurationOverride,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._conversion import Conversion
+from torch.onnx._internal.diagnostics.infra.sarif._edge import Edge
+from torch.onnx._internal.diagnostics.infra.sarif._edge_traversal import EdgeTraversal
+from torch.onnx._internal.diagnostics.infra.sarif._exception import Exception
+from torch.onnx._internal.diagnostics.infra.sarif._external_properties import (
+    ExternalProperties,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._external_property_file_reference import (
+    ExternalPropertyFileReference,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._external_property_file_references import (
+    ExternalPropertyFileReferences,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._fix import Fix
+from torch.onnx._internal.diagnostics.infra.sarif._graph import Graph
+from torch.onnx._internal.diagnostics.infra.sarif._graph_traversal import GraphTraversal
+from torch.onnx._internal.diagnostics.infra.sarif._invocation import Invocation
+from torch.onnx._internal.diagnostics.infra.sarif._location import Location
+from torch.onnx._internal.diagnostics.infra.sarif._location_relationship import (
+    LocationRelationship,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._logical_location import (
+    LogicalLocation,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._message import Message
+from torch.onnx._internal.diagnostics.infra.sarif._multiformat_message_string import (
+    MultiformatMessageString,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._node import Node
+from torch.onnx._internal.diagnostics.infra.sarif._notification import Notification
+from torch.onnx._internal.diagnostics.infra.sarif._physical_location import (
+    PhysicalLocation,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._property_bag import PropertyBag
+from torch.onnx._internal.diagnostics.infra.sarif._rectangle import Rectangle
+from torch.onnx._internal.diagnostics.infra.sarif._region import Region
+from torch.onnx._internal.diagnostics.infra.sarif._replacement import Replacement
+from torch.onnx._internal.diagnostics.infra.sarif._reporting_configuration import (
+    ReportingConfiguration,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._reporting_descriptor import (
+    ReportingDescriptor,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._reporting_descriptor_reference import (
+    ReportingDescriptorReference,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._reporting_descriptor_relationship import (
+    ReportingDescriptorRelationship,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._result import Result
+from torch.onnx._internal.diagnostics.infra.sarif._result_provenance import (
+    ResultProvenance,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._run import Run
+from torch.onnx._internal.diagnostics.infra.sarif._run_automation_details import (
+    RunAutomationDetails,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._sarif_log import SarifLog
+from torch.onnx._internal.diagnostics.infra.sarif._special_locations import (
+    SpecialLocations,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._stack import Stack
+from torch.onnx._internal.diagnostics.infra.sarif._stack_frame import StackFrame
+from torch.onnx._internal.diagnostics.infra.sarif._suppression import Suppression
+from torch.onnx._internal.diagnostics.infra.sarif._thread_flow import ThreadFlow
+from torch.onnx._internal.diagnostics.infra.sarif._thread_flow_location import (
+    ThreadFlowLocation,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._tool import Tool
+from torch.onnx._internal.diagnostics.infra.sarif._tool_component import ToolComponent
+from torch.onnx._internal.diagnostics.infra.sarif._tool_component_reference import (
+    ToolComponentReference,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._translation_metadata import (
+    TranslationMetadata,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._version_control_details import (
+    VersionControlDetails,
+)
+from torch.onnx._internal.diagnostics.infra.sarif._web_request import WebRequest
+from torch.onnx._internal.diagnostics.infra.sarif._web_response import WebResponse
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_address.py b/torch/onnx/_internal/diagnostics/infra/sarif/_address.py
new file mode 100644
index 000000000000..df68b103374a
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_address.py
@@ -0,0 +1,48 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _property_bag
+
+
+@dataclasses.dataclass
+class Address(object):
+    """A physical or virtual address, or a range of addresses, in an 'addressable region' (memory or a binary file)."""
+
+    absolute_address: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "absoluteAddress"}
+    )
+    fully_qualified_name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fullyQualifiedName"}
+    )
+    index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "index"}
+    )
+    kind: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "kind"}
+    )
+    length: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "length"}
+    )
+    name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "name"}
+    )
+    offset_from_parent: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "offsetFromParent"}
+    )
+    parent_index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "parentIndex"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    relative_address: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "relativeAddress"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_artifact.py b/torch/onnx/_internal/diagnostics/infra/sarif/_artifact.py
new file mode 100644
index 000000000000..20aa233a995f
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_artifact.py
@@ -0,0 +1,90 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_content,
+    _artifact_location,
+    _message,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class Artifact(object):
+    """A single artifact. In some cases, this artifact might be nested within another artifact."""
+
+    contents: Optional[_artifact_content.ArtifactContent] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "contents"}
+    )
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    encoding: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "encoding"}
+    )
+    hashes: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "hashes"}
+    )
+    last_modified_time_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "lastModifiedTimeUtc"}
+    )
+    length: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "length"}
+    )
+    location: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "location"}
+    )
+    mime_type: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "mimeType"}
+    )
+    offset: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "offset"}
+    )
+    parent_index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "parentIndex"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    roles: Optional[
+        List[
+            Literal[
+                "analysisTarget",
+                "attachment",
+                "responseFile",
+                "resultFile",
+                "standardStream",
+                "tracedFile",
+                "unmodified",
+                "modified",
+                "added",
+                "deleted",
+                "renamed",
+                "uncontrolled",
+                "driver",
+                "extension",
+                "translation",
+                "taxonomy",
+                "policy",
+                "referencedOnCommandLine",
+                "memoryContents",
+                "directory",
+                "userSpecifiedConfiguration",
+                "toolSpecifiedConfiguration",
+                "debugOutputFile",
+            ]
+        ]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "roles"})
+    source_language: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "sourceLanguage"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_change.py b/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_change.py
new file mode 100644
index 000000000000..f8cca329f25b
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_change.py
@@ -0,0 +1,31 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _property_bag,
+    _replacement,
+)
+
+
+@dataclasses.dataclass
+class ArtifactChange(object):
+    """A change to a single artifact."""
+
+    artifact_location: _artifact_location.ArtifactLocation = dataclasses.field(
+        metadata={"schema_property_name": "artifactLocation"}
+    )
+    replacements: List[_replacement.Replacement] = dataclasses.field(
+        metadata={"schema_property_name": "replacements"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_content.py b/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_content.py
new file mode 100644
index 000000000000..134c89841bf2
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_content.py
@@ -0,0 +1,33 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _multiformat_message_string,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class ArtifactContent(object):
+    """Represents the contents of an artifact."""
+
+    binary: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "binary"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    rendered: Optional[
+        _multiformat_message_string.MultiformatMessageString
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "rendered"})
+    text: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "text"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_location.py b/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_location.py
new file mode 100644
index 000000000000..96e6dbba3159
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_artifact_location.py
@@ -0,0 +1,33 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _message, _property_bag
+
+
+@dataclasses.dataclass
+class ArtifactLocation(object):
+    """Specifies the location of an artifact."""
+
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "index"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    uri: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "uri"}
+    )
+    uri_base_id: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "uriBaseId"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_attachment.py b/torch/onnx/_internal/diagnostics/infra/sarif/_attachment.py
new file mode 100644
index 000000000000..4e5ee6d13fad
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_attachment.py
@@ -0,0 +1,39 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _message,
+    _property_bag,
+    _rectangle,
+    _region,
+)
+
+
+@dataclasses.dataclass
+class Attachment(object):
+    """An artifact relevant to a result."""
+
+    artifact_location: _artifact_location.ArtifactLocation = dataclasses.field(
+        metadata={"schema_property_name": "artifactLocation"}
+    )
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    rectangles: Optional[List[_rectangle.Rectangle]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "rectangles"}
+    )
+    regions: Optional[List[_region.Region]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "regions"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_code_flow.py b/torch/onnx/_internal/diagnostics/infra/sarif/_code_flow.py
new file mode 100644
index 000000000000..5515ef78bee1
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_code_flow.py
@@ -0,0 +1,31 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _message,
+    _property_bag,
+    _thread_flow,
+)
+
+
+@dataclasses.dataclass
+class CodeFlow(object):
+    """A set of threadFlows which together describe a pattern of code execution relevant to detecting a result."""
+
+    thread_flows: List[_thread_flow.ThreadFlow] = dataclasses.field(
+        metadata={"schema_property_name": "threadFlows"}
+    )
+    message: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "message"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_configuration_override.py b/torch/onnx/_internal/diagnostics/infra/sarif/_configuration_override.py
new file mode 100644
index 000000000000..be32e77ff4e1
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_configuration_override.py
@@ -0,0 +1,31 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _property_bag,
+    _reporting_configuration,
+    _reporting_descriptor_reference,
+)
+
+
+@dataclasses.dataclass
+class ConfigurationOverride(object):
+    """Information about how a specific rule or notification was reconfigured at runtime."""
+
+    configuration: _reporting_configuration.ReportingConfiguration = dataclasses.field(
+        metadata={"schema_property_name": "configuration"}
+    )
+    descriptor: _reporting_descriptor_reference.ReportingDescriptorReference = (
+        dataclasses.field(metadata={"schema_property_name": "descriptor"})
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_conversion.py b/torch/onnx/_internal/diagnostics/infra/sarif/_conversion.py
new file mode 100644
index 000000000000..522202bc78dd
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_conversion.py
@@ -0,0 +1,35 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _invocation,
+    _property_bag,
+    _tool,
+)
+
+
+@dataclasses.dataclass
+class Conversion(object):
+    """Describes how a converter transformed the output of a static analysis tool from the analysis tool's native output format into the SARIF format."""
+
+    tool: _tool.Tool = dataclasses.field(metadata={"schema_property_name": "tool"})
+    analysis_tool_log_files: Optional[
+        List[_artifact_location.ArtifactLocation]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "analysisToolLogFiles"}
+    )
+    invocation: Optional[_invocation.Invocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "invocation"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_edge.py b/torch/onnx/_internal/diagnostics/infra/sarif/_edge.py
new file mode 100644
index 000000000000..f85ec8dd99c4
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_edge.py
@@ -0,0 +1,31 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _message, _property_bag
+
+
+@dataclasses.dataclass
+class Edge(object):
+    """Represents a directed edge in a graph."""
+
+    id: str = dataclasses.field(metadata={"schema_property_name": "id"})
+    source_node_id: str = dataclasses.field(
+        metadata={"schema_property_name": "sourceNodeId"}
+    )
+    target_node_id: str = dataclasses.field(
+        metadata={"schema_property_name": "targetNodeId"}
+    )
+    label: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "label"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_edge_traversal.py b/torch/onnx/_internal/diagnostics/infra/sarif/_edge_traversal.py
new file mode 100644
index 000000000000..a8f12d921704
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_edge_traversal.py
@@ -0,0 +1,31 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _message, _property_bag
+
+
+@dataclasses.dataclass
+class EdgeTraversal(object):
+    """Represents the traversal of a single edge during a graph traversal."""
+
+    edge_id: str = dataclasses.field(metadata={"schema_property_name": "edgeId"})
+    final_state: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "finalState"}
+    )
+    message: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "message"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    step_over_edge_count: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "stepOverEdgeCount"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_exception.py b/torch/onnx/_internal/diagnostics/infra/sarif/_exception.py
new file mode 100644
index 000000000000..9afa80633241
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_exception.py
@@ -0,0 +1,37 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _exception,
+    _property_bag,
+    _stack,
+)
+
+
+@dataclasses.dataclass
+class Exception(object):
+    """Describes a runtime exception encountered during the execution of an analysis tool."""
+
+    inner_exceptions: Optional[List[_exception.Exception]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "innerExceptions"}
+    )
+    kind: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "kind"}
+    )
+    message: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "message"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    stack: Optional[_stack.Stack] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "stack"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_external_properties.py b/torch/onnx/_internal/diagnostics/infra/sarif/_external_properties.py
new file mode 100644
index 000000000000..718b9e811668
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_external_properties.py
@@ -0,0 +1,100 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _address,
+    _artifact,
+    _conversion,
+    _graph,
+    _invocation,
+    _logical_location,
+    _property_bag,
+    _result,
+    _thread_flow_location,
+    _tool_component,
+    _web_request,
+    _web_response,
+)
+
+
+@dataclasses.dataclass
+class ExternalProperties(object):
+    """The top-level element of an external property file."""
+
+    addresses: Optional[List[_address.Address]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "addresses"}
+    )
+    artifacts: Optional[List[_artifact.Artifact]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "artifacts"}
+    )
+    conversion: Optional[_conversion.Conversion] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "conversion"}
+    )
+    driver: Optional[_tool_component.ToolComponent] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "driver"}
+    )
+    extensions: Optional[List[_tool_component.ToolComponent]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "extensions"}
+    )
+    externalized_properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "externalizedProperties"}
+    )
+    graphs: Optional[List[_graph.Graph]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "graphs"}
+    )
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    invocations: Optional[List[_invocation.Invocation]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "invocations"}
+    )
+    logical_locations: Optional[
+        List[_logical_location.LogicalLocation]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "logicalLocations"}
+    )
+    policies: Optional[List[_tool_component.ToolComponent]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "policies"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    results: Optional[List[_result.Result]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "results"}
+    )
+    run_guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "runGuid"}
+    )
+    schema: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "schema"}
+    )
+    taxonomies: Optional[List[_tool_component.ToolComponent]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "taxonomies"}
+    )
+    thread_flow_locations: Optional[
+        List[_thread_flow_location.ThreadFlowLocation]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "threadFlowLocations"}
+    )
+    translations: Optional[List[_tool_component.ToolComponent]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "translations"}
+    )
+    version: Optional[Literal["2.1.0"]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "version"}
+    )
+    web_requests: Optional[List[_web_request.WebRequest]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webRequests"}
+    )
+    web_responses: Optional[List[_web_response.WebResponse]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webResponses"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_external_property_file_reference.py b/torch/onnx/_internal/diagnostics/infra/sarif/_external_property_file_reference.py
new file mode 100644
index 000000000000..13a472fec9a3
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_external_property_file_reference.py
@@ -0,0 +1,33 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class ExternalPropertyFileReference(object):
+    """Contains information that enables a SARIF consumer to locate the external property file that contains the value of an externalized property associated with the run."""
+
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    item_count: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "itemCount"}
+    )
+    location: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "location"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_external_property_file_references.py b/torch/onnx/_internal/diagnostics/infra/sarif/_external_property_file_references.py
new file mode 100644
index 000000000000..78ae2db62708
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_external_property_file_references.py
@@ -0,0 +1,86 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _external_property_file_reference,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class ExternalPropertyFileReferences(object):
+    """References to external property files that should be inlined with the content of a root log file."""
+
+    addresses: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "addresses"})
+    artifacts: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "artifacts"})
+    conversion: Optional[
+        _external_property_file_reference.ExternalPropertyFileReference
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "conversion"})
+    driver: Optional[
+        _external_property_file_reference.ExternalPropertyFileReference
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "driver"})
+    extensions: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "extensions"})
+    externalized_properties: Optional[
+        _external_property_file_reference.ExternalPropertyFileReference
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "externalizedProperties"}
+    )
+    graphs: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "graphs"})
+    invocations: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "invocations"}
+    )
+    logical_locations: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "logicalLocations"}
+    )
+    policies: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "policies"})
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    results: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "results"})
+    taxonomies: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "taxonomies"})
+    thread_flow_locations: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "threadFlowLocations"}
+    )
+    translations: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "translations"}
+    )
+    web_requests: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webRequests"}
+    )
+    web_responses: Optional[
+        List[_external_property_file_reference.ExternalPropertyFileReference]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webResponses"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_fix.py b/torch/onnx/_internal/diagnostics/infra/sarif/_fix.py
new file mode 100644
index 000000000000..5e3b944aa239
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_fix.py
@@ -0,0 +1,31 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_change,
+    _message,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class Fix(object):
+    """A proposed fix for the problem represented by a result object. A fix specifies a set of artifacts to modify. For each artifact, it specifies a set of bytes to remove, and provides a set of new bytes to replace them."""
+
+    artifact_changes: List[_artifact_change.ArtifactChange] = dataclasses.field(
+        metadata={"schema_property_name": "artifactChanges"}
+    )
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_graph.py b/torch/onnx/_internal/diagnostics/infra/sarif/_graph.py
new file mode 100644
index 000000000000..306d6b305126
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_graph.py
@@ -0,0 +1,35 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _edge,
+    _message,
+    _node,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class Graph(object):
+    """A network of nodes and directed edges that describes some aspect of the structure of the code (for example, a call graph)."""
+
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    edges: Optional[List[_edge.Edge]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "edges"}
+    )
+    nodes: Optional[List[_node.Node]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "nodes"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_graph_traversal.py b/torch/onnx/_internal/diagnostics/infra/sarif/_graph_traversal.py
new file mode 100644
index 000000000000..bdc25c4591a2
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_graph_traversal.py
@@ -0,0 +1,43 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _edge_traversal,
+    _message,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class GraphTraversal(object):
+    """Represents a path through a graph."""
+
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    edge_traversals: Optional[List[_edge_traversal.EdgeTraversal]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "edgeTraversals"}
+    )
+    immutable_state: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "immutableState"}
+    )
+    initial_state: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "initialState"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    result_graph_index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "resultGraphIndex"}
+    )
+    run_graph_index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "runGraphIndex"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_invocation.py b/torch/onnx/_internal/diagnostics/infra/sarif/_invocation.py
new file mode 100644
index 000000000000..77fb36997507
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_invocation.py
@@ -0,0 +1,117 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _configuration_override,
+    _notification,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class Invocation(object):
+    """The runtime environment of the analysis tool run."""
+
+    execution_successful: bool = dataclasses.field(
+        metadata={"schema_property_name": "executionSuccessful"}
+    )
+    account: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "account"}
+    )
+    arguments: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "arguments"}
+    )
+    command_line: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "commandLine"}
+    )
+    end_time_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "endTimeUtc"}
+    )
+    environment_variables: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "environmentVariables"}
+    )
+    executable_location: Optional[
+        _artifact_location.ArtifactLocation
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "executableLocation"}
+    )
+    exit_code: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "exitCode"}
+    )
+    exit_code_description: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "exitCodeDescription"}
+    )
+    exit_signal_name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "exitSignalName"}
+    )
+    exit_signal_number: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "exitSignalNumber"}
+    )
+    machine: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "machine"}
+    )
+    notification_configuration_overrides: Optional[
+        List[_configuration_override.ConfigurationOverride]
+    ] = dataclasses.field(
+        default=None,
+        metadata={"schema_property_name": "notificationConfigurationOverrides"},
+    )
+    process_id: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "processId"}
+    )
+    process_start_failure_message: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "processStartFailureMessage"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    response_files: Optional[
+        List[_artifact_location.ArtifactLocation]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "responseFiles"}
+    )
+    rule_configuration_overrides: Optional[
+        List[_configuration_override.ConfigurationOverride]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "ruleConfigurationOverrides"}
+    )
+    start_time_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "startTimeUtc"}
+    )
+    stderr: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "stderr"}
+    )
+    stdin: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "stdin"}
+    )
+    stdout: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "stdout"}
+    )
+    stdout_stderr: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "stdoutStderr"}
+    )
+    tool_configuration_notifications: Optional[
+        List[_notification.Notification]
+    ] = dataclasses.field(
+        default=None,
+        metadata={"schema_property_name": "toolConfigurationNotifications"},
+    )
+    tool_execution_notifications: Optional[
+        List[_notification.Notification]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "toolExecutionNotifications"}
+    )
+    working_directory: Optional[
+        _artifact_location.ArtifactLocation
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "workingDirectory"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_location.py b/torch/onnx/_internal/diagnostics/infra/sarif/_location.py
new file mode 100644
index 000000000000..06ce42546e12
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_location.py
@@ -0,0 +1,50 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _location_relationship,
+    _logical_location,
+    _message,
+    _physical_location,
+    _property_bag,
+    _region,
+)
+
+
+@dataclasses.dataclass
+class Location(object):
+    """A location within a programming artifact."""
+
+    annotations: Optional[List[_region.Region]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "annotations"}
+    )
+    id: int = dataclasses.field(default=-1, metadata={"schema_property_name": "id"})
+    logical_locations: Optional[
+        List[_logical_location.LogicalLocation]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "logicalLocations"}
+    )
+    message: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "message"}
+    )
+    physical_location: Optional[
+        _physical_location.PhysicalLocation
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "physicalLocation"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    relationships: Optional[
+        List[_location_relationship.LocationRelationship]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "relationships"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_location_relationship.py b/torch/onnx/_internal/diagnostics/infra/sarif/_location_relationship.py
new file mode 100644
index 000000000000..92f6c4128e04
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_location_relationship.py
@@ -0,0 +1,28 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _message, _property_bag
+
+
+@dataclasses.dataclass
+class LocationRelationship(object):
+    """Information about the relation of one location to another."""
+
+    target: int = dataclasses.field(metadata={"schema_property_name": "target"})
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    kinds: List[str] = dataclasses.field(
+        default_factory=lambda: ["relevant"], metadata={"schema_property_name": "kinds"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_logical_location.py b/torch/onnx/_internal/diagnostics/infra/sarif/_logical_location.py
new file mode 100644
index 000000000000..70e67528a95a
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_logical_location.py
@@ -0,0 +1,39 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _property_bag
+
+
+@dataclasses.dataclass
+class LogicalLocation(object):
+    """A logical location of a construct that produced a result."""
+
+    decorated_name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "decoratedName"}
+    )
+    fully_qualified_name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fullyQualifiedName"}
+    )
+    index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "index"}
+    )
+    kind: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "kind"}
+    )
+    name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "name"}
+    )
+    parent_index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "parentIndex"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_message.py b/torch/onnx/_internal/diagnostics/infra/sarif/_message.py
new file mode 100644
index 000000000000..03528747fa5a
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_message.py
@@ -0,0 +1,33 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _property_bag
+
+
+@dataclasses.dataclass
+class Message(object):
+    """Encapsulates a message intended to be read by the end user."""
+
+    arguments: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "arguments"}
+    )
+    id: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "id"}
+    )
+    markdown: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "markdown"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    text: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "text"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_multiformat_message_string.py b/torch/onnx/_internal/diagnostics/infra/sarif/_multiformat_message_string.py
new file mode 100644
index 000000000000..c1dede2923f4
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_multiformat_message_string.py
@@ -0,0 +1,25 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _property_bag
+
+
+@dataclasses.dataclass
+class MultiformatMessageString(object):
+    """A message string or message format string rendered in multiple formats."""
+
+    text: str = dataclasses.field(metadata={"schema_property_name": "text"})
+    markdown: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "markdown"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_node.py b/torch/onnx/_internal/diagnostics/infra/sarif/_node.py
new file mode 100644
index 000000000000..a4b3f74db3a2
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_node.py
@@ -0,0 +1,36 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _location,
+    _message,
+    _node,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class Node(object):
+    """Represents a node in a graph."""
+
+    id: str = dataclasses.field(metadata={"schema_property_name": "id"})
+    children: Optional[List[_node.Node]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "children"}
+    )
+    label: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "label"}
+    )
+    location: Optional[_location.Location] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "location"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_notification.py b/torch/onnx/_internal/diagnostics/infra/sarif/_notification.py
new file mode 100644
index 000000000000..daf925418fd2
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_notification.py
@@ -0,0 +1,55 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _exception,
+    _location,
+    _message,
+    _property_bag,
+    _reporting_descriptor_reference,
+)
+
+
+@dataclasses.dataclass
+class Notification(object):
+    """Describes a condition relevant to the tool itself, as opposed to being relevant to a target being analyzed by the tool."""
+
+    message: _message.Message = dataclasses.field(
+        metadata={"schema_property_name": "message"}
+    )
+    associated_rule: Optional[
+        _reporting_descriptor_reference.ReportingDescriptorReference
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "associatedRule"}
+    )
+    descriptor: Optional[
+        _reporting_descriptor_reference.ReportingDescriptorReference
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "descriptor"})
+    exception: Optional[_exception.Exception] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "exception"}
+    )
+    level: Literal["none", "note", "warning", "error"] = dataclasses.field(
+        default="warning", metadata={"schema_property_name": "level"}
+    )
+    locations: Optional[List[_location.Location]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "locations"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    thread_id: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "threadId"}
+    )
+    time_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "timeUtc"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_physical_location.py b/torch/onnx/_internal/diagnostics/infra/sarif/_physical_location.py
new file mode 100644
index 000000000000..bd527d3ecd96
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_physical_location.py
@@ -0,0 +1,40 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _address,
+    _artifact_location,
+    _property_bag,
+    _region,
+)
+
+
+@dataclasses.dataclass
+class PhysicalLocation(object):
+    """A physical location relevant to a result. Specifies a reference to a programming artifact together with a range of bytes or characters within that artifact."""
+
+    address: Optional[_address.Address] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "address"}
+    )
+    artifact_location: Optional[
+        _artifact_location.ArtifactLocation
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "artifactLocation"}
+    )
+    context_region: Optional[_region.Region] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "contextRegion"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    region: Optional[_region.Region] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "region"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_property_bag.py b/torch/onnx/_internal/diagnostics/infra/sarif/_property_bag.py
new file mode 100644
index 000000000000..eb576b4dbd11
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_property_bag.py
@@ -0,0 +1,19 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+
+@dataclasses.dataclass
+class PropertyBag(object):
+    """Key/value pairs that provide additional information about the object."""
+
+    tags: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "tags"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_rectangle.py b/torch/onnx/_internal/diagnostics/infra/sarif/_rectangle.py
new file mode 100644
index 000000000000..cf24d7582526
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_rectangle.py
@@ -0,0 +1,36 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _message, _property_bag
+
+
+@dataclasses.dataclass
+class Rectangle(object):
+    """An area within an image."""
+
+    bottom: Optional[float] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "bottom"}
+    )
+    left: Optional[float] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "left"}
+    )
+    message: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "message"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    right: Optional[float] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "right"}
+    )
+    top: Optional[float] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "top"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_region.py b/torch/onnx/_internal/diagnostics/infra/sarif/_region.py
new file mode 100644
index 000000000000..658fdb121734
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_region.py
@@ -0,0 +1,58 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_content,
+    _message,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class Region(object):
+    """A region within an artifact where a result was detected."""
+
+    byte_length: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "byteLength"}
+    )
+    byte_offset: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "byteOffset"}
+    )
+    char_length: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "charLength"}
+    )
+    char_offset: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "charOffset"}
+    )
+    end_column: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "endColumn"}
+    )
+    end_line: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "endLine"}
+    )
+    message: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "message"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    snippet: Optional[_artifact_content.ArtifactContent] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "snippet"}
+    )
+    source_language: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "sourceLanguage"}
+    )
+    start_column: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "startColumn"}
+    )
+    start_line: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "startLine"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_replacement.py b/torch/onnx/_internal/diagnostics/infra/sarif/_replacement.py
new file mode 100644
index 000000000000..9acbc9d8133a
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_replacement.py
@@ -0,0 +1,31 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_content,
+    _property_bag,
+    _region,
+)
+
+
+@dataclasses.dataclass
+class Replacement(object):
+    """The replacement of a single region of an artifact."""
+
+    deleted_region: _region.Region = dataclasses.field(
+        metadata={"schema_property_name": "deletedRegion"}
+    )
+    inserted_content: Optional[_artifact_content.ArtifactContent] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "insertedContent"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_configuration.py b/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_configuration.py
new file mode 100644
index 000000000000..c9967d777d75
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_configuration.py
@@ -0,0 +1,35 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import _property_bag
+
+
+@dataclasses.dataclass
+class ReportingConfiguration(object):
+    """Information about a rule or notification that can be configured at runtime."""
+
+    enabled: bool = dataclasses.field(
+        default=True, metadata={"schema_property_name": "enabled"}
+    )
+    level: Literal["none", "note", "warning", "error"] = dataclasses.field(
+        default="warning", metadata={"schema_property_name": "level"}
+    )
+    parameters: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "parameters"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    rank: float = dataclasses.field(
+        default=-1.0, metadata={"schema_property_name": "rank"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor.py b/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor.py
new file mode 100644
index 000000000000..f562f2f81ba5
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor.py
@@ -0,0 +1,71 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _multiformat_message_string,
+    _property_bag,
+    _reporting_configuration,
+    _reporting_descriptor_relationship,
+)
+
+
+@dataclasses.dataclass
+class ReportingDescriptor(object):
+    """Metadata that describes a specific report produced by the tool, as part of the analysis it provides or its runtime reporting."""
+
+    id: str = dataclasses.field(metadata={"schema_property_name": "id"})
+    default_configuration: Optional[
+        _reporting_configuration.ReportingConfiguration
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "defaultConfiguration"}
+    )
+    deprecated_guids: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "deprecatedGuids"}
+    )
+    deprecated_ids: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "deprecatedIds"}
+    )
+    deprecated_names: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "deprecatedNames"}
+    )
+    full_description: Optional[
+        _multiformat_message_string.MultiformatMessageString
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fullDescription"}
+    )
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    help: Optional[
+        _multiformat_message_string.MultiformatMessageString
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "help"})
+    help_uri: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "helpUri"}
+    )
+    message_strings: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "messageStrings"}
+    )
+    name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "name"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    relationships: Optional[
+        List[_reporting_descriptor_relationship.ReportingDescriptorRelationship]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "relationships"}
+    )
+    short_description: Optional[
+        _multiformat_message_string.MultiformatMessageString
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "shortDescription"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor_reference.py b/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor_reference.py
new file mode 100644
index 000000000000..4d057d508446
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor_reference.py
@@ -0,0 +1,38 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _property_bag,
+    _tool_component_reference,
+)
+
+
+@dataclasses.dataclass
+class ReportingDescriptorReference(object):
+    """Information about how to locate a relevant reporting descriptor."""
+
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    id: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "id"}
+    )
+    index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "index"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    tool_component: Optional[
+        _tool_component_reference.ToolComponentReference
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "toolComponent"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor_relationship.py b/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor_relationship.py
new file mode 100644
index 000000000000..b66bd1bb4c0f
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_reporting_descriptor_relationship.py
@@ -0,0 +1,34 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _message,
+    _property_bag,
+    _reporting_descriptor_reference,
+)
+
+
+@dataclasses.dataclass
+class ReportingDescriptorRelationship(object):
+    """Information about the relation of one reporting descriptor to another."""
+
+    target: _reporting_descriptor_reference.ReportingDescriptorReference = (
+        dataclasses.field(metadata={"schema_property_name": "target"})
+    )
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    kinds: List[str] = dataclasses.field(
+        default_factory=lambda: ["relevant"], metadata={"schema_property_name": "kinds"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_result.py b/torch/onnx/_internal/diagnostics/infra/sarif/_result.py
new file mode 100644
index 000000000000..7eed416e1eb8
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_result.py
@@ -0,0 +1,130 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _attachment,
+    _code_flow,
+    _fix,
+    _graph,
+    _graph_traversal,
+    _location,
+    _message,
+    _property_bag,
+    _reporting_descriptor_reference,
+    _result_provenance,
+    _stack,
+    _suppression,
+    _web_request,
+    _web_response,
+)
+
+
+@dataclasses.dataclass
+class Result(object):
+    """A result produced by an analysis tool."""
+
+    message: _message.Message = dataclasses.field(
+        metadata={"schema_property_name": "message"}
+    )
+    analysis_target: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "analysisTarget"}
+    )
+    attachments: Optional[List[_attachment.Attachment]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "attachments"}
+    )
+    baseline_state: Optional[
+        Literal["new", "unchanged", "updated", "absent"]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "baselineState"}
+    )
+    code_flows: Optional[List[_code_flow.CodeFlow]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "codeFlows"}
+    )
+    correlation_guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "correlationGuid"}
+    )
+    fingerprints: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fingerprints"}
+    )
+    fixes: Optional[List[_fix.Fix]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fixes"}
+    )
+    graph_traversals: Optional[
+        List[_graph_traversal.GraphTraversal]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "graphTraversals"}
+    )
+    graphs: Optional[List[_graph.Graph]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "graphs"}
+    )
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    hosted_viewer_uri: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "hostedViewerUri"}
+    )
+    kind: Literal[
+        "notApplicable", "pass", "fail", "review", "open", "informational"
+    ] = dataclasses.field(default="fail", metadata={"schema_property_name": "kind"})
+    level: Literal["none", "note", "warning", "error"] = dataclasses.field(
+        default="warning", metadata={"schema_property_name": "level"}
+    )
+    locations: Optional[List[_location.Location]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "locations"}
+    )
+    occurrence_count: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "occurrenceCount"}
+    )
+    partial_fingerprints: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "partialFingerprints"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    provenance: Optional[_result_provenance.ResultProvenance] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "provenance"}
+    )
+    rank: float = dataclasses.field(
+        default=-1.0, metadata={"schema_property_name": "rank"}
+    )
+    related_locations: Optional[List[_location.Location]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "relatedLocations"}
+    )
+    rule: Optional[
+        _reporting_descriptor_reference.ReportingDescriptorReference
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "rule"})
+    rule_id: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "ruleId"}
+    )
+    rule_index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "ruleIndex"}
+    )
+    stacks: Optional[List[_stack.Stack]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "stacks"}
+    )
+    suppressions: Optional[List[_suppression.Suppression]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "suppressions"}
+    )
+    taxa: Optional[
+        List[_reporting_descriptor_reference.ReportingDescriptorReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "taxa"})
+    web_request: Optional[_web_request.WebRequest] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webRequest"}
+    )
+    web_response: Optional[_web_response.WebResponse] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webResponse"}
+    )
+    work_item_uris: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "workItemUris"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_result_provenance.py b/torch/onnx/_internal/diagnostics/infra/sarif/_result_provenance.py
new file mode 100644
index 000000000000..e542414a8da5
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_result_provenance.py
@@ -0,0 +1,44 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _physical_location,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class ResultProvenance(object):
+    """Contains information about how and when a result was detected."""
+
+    conversion_sources: Optional[
+        List[_physical_location.PhysicalLocation]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "conversionSources"}
+    )
+    first_detection_run_guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "firstDetectionRunGuid"}
+    )
+    first_detection_time_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "firstDetectionTimeUtc"}
+    )
+    invocation_index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "invocationIndex"}
+    )
+    last_detection_run_guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "lastDetectionRunGuid"}
+    )
+    last_detection_time_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "lastDetectionTimeUtc"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_run.py b/torch/onnx/_internal/diagnostics/infra/sarif/_run.py
new file mode 100644
index 000000000000..c85d764a980a
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_run.py
@@ -0,0 +1,136 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _address,
+    _artifact,
+    _conversion,
+    _external_property_file_references,
+    _graph,
+    _invocation,
+    _logical_location,
+    _property_bag,
+    _result,
+    _run_automation_details,
+    _special_locations,
+    _thread_flow_location,
+    _tool,
+    _tool_component,
+    _version_control_details,
+    _web_request,
+    _web_response,
+)
+
+
+@dataclasses.dataclass
+class Run(object):
+    """Describes a single run of an analysis tool, and contains the reported output of that run."""
+
+    tool: _tool.Tool = dataclasses.field(metadata={"schema_property_name": "tool"})
+    addresses: Optional[List[_address.Address]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "addresses"}
+    )
+    artifacts: Optional[List[_artifact.Artifact]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "artifacts"}
+    )
+    automation_details: Optional[
+        _run_automation_details.RunAutomationDetails
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "automationDetails"}
+    )
+    baseline_guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "baselineGuid"}
+    )
+    column_kind: Optional[
+        Literal["utf16CodeUnits", "unicodeCodePoints"]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "columnKind"})
+    conversion: Optional[_conversion.Conversion] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "conversion"}
+    )
+    default_encoding: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "defaultEncoding"}
+    )
+    default_source_language: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "defaultSourceLanguage"}
+    )
+    external_property_file_references: Optional[
+        _external_property_file_references.ExternalPropertyFileReferences
+    ] = dataclasses.field(
+        default=None,
+        metadata={"schema_property_name": "externalPropertyFileReferences"},
+    )
+    graphs: Optional[List[_graph.Graph]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "graphs"}
+    )
+    invocations: Optional[List[_invocation.Invocation]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "invocations"}
+    )
+    language: str = dataclasses.field(
+        default="en-US", metadata={"schema_property_name": "language"}
+    )
+    logical_locations: Optional[
+        List[_logical_location.LogicalLocation]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "logicalLocations"}
+    )
+    newline_sequences: List[str] = dataclasses.field(
+        default_factory=lambda: ["\r\n", "\n"],
+        metadata={"schema_property_name": "newlineSequences"},
+    )
+    original_uri_base_ids: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "originalUriBaseIds"}
+    )
+    policies: Optional[List[_tool_component.ToolComponent]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "policies"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    redaction_tokens: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "redactionTokens"}
+    )
+    results: Optional[List[_result.Result]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "results"}
+    )
+    run_aggregates: Optional[
+        List[_run_automation_details.RunAutomationDetails]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "runAggregates"}
+    )
+    special_locations: Optional[
+        _special_locations.SpecialLocations
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "specialLocations"}
+    )
+    taxonomies: Optional[List[_tool_component.ToolComponent]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "taxonomies"}
+    )
+    thread_flow_locations: Optional[
+        List[_thread_flow_location.ThreadFlowLocation]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "threadFlowLocations"}
+    )
+    translations: Optional[List[_tool_component.ToolComponent]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "translations"}
+    )
+    version_control_provenance: Optional[
+        List[_version_control_details.VersionControlDetails]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "versionControlProvenance"}
+    )
+    web_requests: Optional[List[_web_request.WebRequest]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webRequests"}
+    )
+    web_responses: Optional[List[_web_response.WebResponse]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webResponses"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_run_automation_details.py b/torch/onnx/_internal/diagnostics/infra/sarif/_run_automation_details.py
new file mode 100644
index 000000000000..ae63959240b3
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_run_automation_details.py
@@ -0,0 +1,33 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _message, _property_bag
+
+
+@dataclasses.dataclass
+class RunAutomationDetails(object):
+    """Information that describes a run's identity and role within an engineering system process."""
+
+    correlation_guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "correlationGuid"}
+    )
+    description: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "description"}
+    )
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    id: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "id"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_sarif_log.py b/torch/onnx/_internal/diagnostics/infra/sarif/_sarif_log.py
new file mode 100644
index 000000000000..f614bb55a412
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_sarif_log.py
@@ -0,0 +1,39 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _external_properties,
+    _property_bag,
+    _run,
+)
+
+
+@dataclasses.dataclass
+class SarifLog(object):
+    """Static Analysis Results Format (SARIF) Version 2.1.0 JSON Schema: a standard format for the output of static analysis tools."""
+
+    runs: List[_run.Run] = dataclasses.field(metadata={"schema_property_name": "runs"})
+    version: Literal["2.1.0"] = dataclasses.field(
+        metadata={"schema_property_name": "version"}
+    )
+    schema_uri: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "$schema"}
+    )
+    inline_external_properties: Optional[
+        List[_external_properties.ExternalProperties]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "inlineExternalProperties"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_special_locations.py b/torch/onnx/_internal/diagnostics/infra/sarif/_special_locations.py
new file mode 100644
index 000000000000..77bd3ccff59d
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_special_locations.py
@@ -0,0 +1,27 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class SpecialLocations(object):
+    """Defines locations of special significance to SARIF consumers."""
+
+    display_base: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "displayBase"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_stack.py b/torch/onnx/_internal/diagnostics/infra/sarif/_stack.py
new file mode 100644
index 000000000000..cad11f5ec38c
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_stack.py
@@ -0,0 +1,31 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _message,
+    _property_bag,
+    _stack_frame,
+)
+
+
+@dataclasses.dataclass
+class Stack(object):
+    """A call stack that is relevant to a result."""
+
+    frames: List[_stack_frame.StackFrame] = dataclasses.field(
+        metadata={"schema_property_name": "frames"}
+    )
+    message: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "message"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_stack_frame.py b/torch/onnx/_internal/diagnostics/infra/sarif/_stack_frame.py
new file mode 100644
index 000000000000..509756f3374a
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_stack_frame.py
@@ -0,0 +1,33 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _location, _property_bag
+
+
+@dataclasses.dataclass
+class StackFrame(object):
+    """A function call within a stack trace."""
+
+    location: Optional[_location.Location] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "location"}
+    )
+    module: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "module"}
+    )
+    parameters: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "parameters"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    thread_id: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "threadId"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_suppression.py b/torch/onnx/_internal/diagnostics/infra/sarif/_suppression.py
new file mode 100644
index 000000000000..aeaa3bd035d2
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_suppression.py
@@ -0,0 +1,38 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import _location, _property_bag
+
+
+@dataclasses.dataclass
+class Suppression(object):
+    """A suppression that is relevant to a result."""
+
+    kind: Literal["inSource", "external"] = dataclasses.field(
+        metadata={"schema_property_name": "kind"}
+    )
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    justification: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "justification"}
+    )
+    location: Optional[_location.Location] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "location"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    state: Optional[Literal["accepted", "underReview", "rejected"]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "state"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_thread_flow.py b/torch/onnx/_internal/diagnostics/infra/sarif/_thread_flow.py
new file mode 100644
index 000000000000..b107ee774c92
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_thread_flow.py
@@ -0,0 +1,40 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _message,
+    _property_bag,
+    _thread_flow_location,
+)
+
+
+@dataclasses.dataclass
+class ThreadFlow(object):
+    """Describes a sequence of code locations that specify a path through a single thread of execution such as an operating system or fiber."""
+
+    locations: List[_thread_flow_location.ThreadFlowLocation] = dataclasses.field(
+        metadata={"schema_property_name": "locations"}
+    )
+    id: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "id"}
+    )
+    immutable_state: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "immutableState"}
+    )
+    initial_state: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "initialState"}
+    )
+    message: Optional[_message.Message] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "message"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_thread_flow_location.py b/torch/onnx/_internal/diagnostics/infra/sarif/_thread_flow_location.py
new file mode 100644
index 000000000000..53cc984ecd0b
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_thread_flow_location.py
@@ -0,0 +1,69 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _location,
+    _property_bag,
+    _reporting_descriptor_reference,
+    _stack,
+    _web_request,
+    _web_response,
+)
+
+
+@dataclasses.dataclass
+class ThreadFlowLocation(object):
+    """A location visited by an analysis tool while simulating or monitoring the execution of a program."""
+
+    execution_order: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "executionOrder"}
+    )
+    execution_time_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "executionTimeUtc"}
+    )
+    importance: Literal["important", "essential", "unimportant"] = dataclasses.field(
+        default="important", metadata={"schema_property_name": "importance"}
+    )
+    index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "index"}
+    )
+    kinds: Optional[List[str]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "kinds"}
+    )
+    location: Optional[_location.Location] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "location"}
+    )
+    module: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "module"}
+    )
+    nesting_level: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "nestingLevel"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    stack: Optional[_stack.Stack] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "stack"}
+    )
+    state: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "state"}
+    )
+    taxa: Optional[
+        List[_reporting_descriptor_reference.ReportingDescriptorReference]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "taxa"})
+    web_request: Optional[_web_request.WebRequest] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webRequest"}
+    )
+    web_response: Optional[_web_response.WebResponse] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "webResponse"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_tool.py b/torch/onnx/_internal/diagnostics/infra/sarif/_tool.py
new file mode 100644
index 000000000000..a6cfa87b05d7
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_tool.py
@@ -0,0 +1,27 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import List, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _property_bag, _tool_component
+
+
+@dataclasses.dataclass
+class Tool(object):
+    """The analysis tool that was run."""
+
+    driver: _tool_component.ToolComponent = dataclasses.field(
+        metadata={"schema_property_name": "driver"}
+    )
+    extensions: Optional[List[_tool_component.ToolComponent]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "extensions"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_tool_component.py b/torch/onnx/_internal/diagnostics/infra/sarif/_tool_component.py
new file mode 100644
index 000000000000..4f47fbb417f8
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_tool_component.py
@@ -0,0 +1,125 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, List, Optional
+
+from typing_extensions import Literal
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _multiformat_message_string,
+    _property_bag,
+    _reporting_descriptor,
+    _tool_component_reference,
+    _translation_metadata,
+)
+
+
+@dataclasses.dataclass
+class ToolComponent(object):
+    """A component, such as a plug-in or the driver, of the analysis tool that was run."""
+
+    name: str = dataclasses.field(metadata={"schema_property_name": "name"})
+    associated_component: Optional[
+        _tool_component_reference.ToolComponentReference
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "associatedComponent"}
+    )
+    contents: List[Literal["localizedData", "nonLocalizedData"]] = dataclasses.field(
+        default_factory=lambda: ["localizedData", "nonLocalizedData"],
+        metadata={"schema_property_name": "contents"},
+    )
+    dotted_quad_file_version: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "dottedQuadFileVersion"}
+    )
+    download_uri: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "downloadUri"}
+    )
+    full_description: Optional[
+        _multiformat_message_string.MultiformatMessageString
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fullDescription"}
+    )
+    full_name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fullName"}
+    )
+    global_message_strings: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "globalMessageStrings"}
+    )
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    information_uri: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "informationUri"}
+    )
+    is_comprehensive: Optional[bool] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "isComprehensive"}
+    )
+    language: str = dataclasses.field(
+        default="en-US", metadata={"schema_property_name": "language"}
+    )
+    localized_data_semantic_version: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "localizedDataSemanticVersion"}
+    )
+    locations: Optional[List[_artifact_location.ArtifactLocation]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "locations"}
+    )
+    minimum_required_localized_data_semantic_version: Optional[str] = dataclasses.field(
+        default=None,
+        metadata={
+            "schema_property_name": "minimumRequiredLocalizedDataSemanticVersion"
+        },
+    )
+    notifications: Optional[
+        List[_reporting_descriptor.ReportingDescriptor]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "notifications"}
+    )
+    organization: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "organization"}
+    )
+    product: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "product"}
+    )
+    product_suite: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "productSuite"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    release_date_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "releaseDateUtc"}
+    )
+    rules: Optional[
+        List[_reporting_descriptor.ReportingDescriptor]
+    ] = dataclasses.field(default=None, metadata={"schema_property_name": "rules"})
+    semantic_version: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "semanticVersion"}
+    )
+    short_description: Optional[
+        _multiformat_message_string.MultiformatMessageString
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "shortDescription"}
+    )
+    supported_taxonomies: Optional[
+        List[_tool_component_reference.ToolComponentReference]
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "supportedTaxonomies"}
+    )
+    taxa: Optional[List[_reporting_descriptor.ReportingDescriptor]] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "taxa"}
+    )
+    translation_metadata: Optional[
+        _translation_metadata.TranslationMetadata
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "translationMetadata"}
+    )
+    version: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "version"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_tool_component_reference.py b/torch/onnx/_internal/diagnostics/infra/sarif/_tool_component_reference.py
new file mode 100644
index 000000000000..c7929e12bc80
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_tool_component_reference.py
@@ -0,0 +1,30 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import _property_bag
+
+
+@dataclasses.dataclass
+class ToolComponentReference(object):
+    """Identifies a particular toolComponent object, either the driver or an extension."""
+
+    guid: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "guid"}
+    )
+    index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "index"}
+    )
+    name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "name"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_translation_metadata.py b/torch/onnx/_internal/diagnostics/infra/sarif/_translation_metadata.py
new file mode 100644
index 000000000000..cba1f5570d3d
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_translation_metadata.py
@@ -0,0 +1,44 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _multiformat_message_string,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class TranslationMetadata(object):
+    """Provides additional metadata related to translation."""
+
+    name: str = dataclasses.field(metadata={"schema_property_name": "name"})
+    download_uri: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "downloadUri"}
+    )
+    full_description: Optional[
+        _multiformat_message_string.MultiformatMessageString
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fullDescription"}
+    )
+    full_name: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "fullName"}
+    )
+    information_uri: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "informationUri"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    short_description: Optional[
+        _multiformat_message_string.MultiformatMessageString
+    ] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "shortDescription"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_version_control_details.py b/torch/onnx/_internal/diagnostics/infra/sarif/_version_control_details.py
new file mode 100644
index 000000000000..870222e7a5dd
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_version_control_details.py
@@ -0,0 +1,42 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_location,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class VersionControlDetails(object):
+    """Specifies the information necessary to retrieve a desired revision from a version control system."""
+
+    repository_uri: str = dataclasses.field(
+        metadata={"schema_property_name": "repositoryUri"}
+    )
+    as_of_time_utc: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "asOfTimeUtc"}
+    )
+    branch: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "branch"}
+    )
+    mapped_to: Optional[_artifact_location.ArtifactLocation] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "mappedTo"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    revision_id: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "revisionId"}
+    )
+    revision_tag: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "revisionTag"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_web_request.py b/torch/onnx/_internal/diagnostics/infra/sarif/_web_request.py
new file mode 100644
index 000000000000..b8d4e34756a4
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_web_request.py
@@ -0,0 +1,48 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_content,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class WebRequest(object):
+    """Describes an HTTP request."""
+
+    body: Optional[_artifact_content.ArtifactContent] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "body"}
+    )
+    headers: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "headers"}
+    )
+    index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "index"}
+    )
+    method: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "method"}
+    )
+    parameters: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "parameters"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    protocol: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "protocol"}
+    )
+    target: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "target"}
+    )
+    version: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "version"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/_web_response.py b/torch/onnx/_internal/diagnostics/infra/sarif/_web_response.py
new file mode 100644
index 000000000000..1e86deedb3c6
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/_web_response.py
@@ -0,0 +1,48 @@
+# DO NOT EDIT! This file was generated by jschema_to_python version 0.0.1.dev29,
+# with extension for dataclasses and type annotation.
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, Optional
+
+from torch.onnx._internal.diagnostics.infra.sarif import (
+    _artifact_content,
+    _property_bag,
+)
+
+
+@dataclasses.dataclass
+class WebResponse(object):
+    """Describes the response to an HTTP request."""
+
+    body: Optional[_artifact_content.ArtifactContent] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "body"}
+    )
+    headers: Any = dataclasses.field(
+        default=None, metadata={"schema_property_name": "headers"}
+    )
+    index: int = dataclasses.field(
+        default=-1, metadata={"schema_property_name": "index"}
+    )
+    no_response_received: Optional[bool] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "noResponseReceived"}
+    )
+    properties: Optional[_property_bag.PropertyBag] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "properties"}
+    )
+    protocol: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "protocol"}
+    )
+    reason_phrase: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "reasonPhrase"}
+    )
+    status_code: Optional[int] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "statusCode"}
+    )
+    version: Optional[str] = dataclasses.field(
+        default=None, metadata={"schema_property_name": "version"}
+    )
+
+
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/sarif/version.py b/torch/onnx/_internal/diagnostics/infra/sarif/version.py
new file mode 100644
index 000000000000..46c122b98084
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/sarif/version.py
@@ -0,0 +1,5 @@
+from typing_extensions import Final
+
+SARIF_VERSION: Final = "2.1.0"
+SARIF_SCHEMA_LINK: Final = "https://docs.oasis-open.org/sarif/sarif/v2.1.0/cs01/schemas/sarif-schema-2.1.0.json"
+# flake8: noqa
diff --git a/torch/onnx/_internal/diagnostics/infra/utils.py b/torch/onnx/_internal/diagnostics/infra/utils.py
new file mode 100644
index 000000000000..6a85df910463
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/infra/utils.py
@@ -0,0 +1,35 @@
+import inspect
+
+from torch.onnx._internal.diagnostics.infra import _infra
+
+
+def python_frame(frame: inspect.FrameInfo) -> _infra.StackFrame:
+    """Returns a StackFrame for the given inspect.FrameInfo."""
+    snippet = (
+        frame.code_context[frame.index].strip()
+        if frame.code_context is not None and frame.index is not None
+        else None
+    )
+
+    return _infra.StackFrame(
+        location=_infra.Location(
+            uri=frame.filename,
+            line=frame.lineno,
+            snippet=snippet,
+        )
+    )
+
+
+def python_call_stack(frames_to_skip: int = 0, frames_to_log: int = 32) -> _infra.Stack:
+    """Returns the current Python call stack."""
+    if frames_to_skip < 0:
+        raise ValueError("frames_to_skip must be non-negative")
+    if frames_to_log < 0:
+        raise ValueError("frames_to_log must be non-negative")
+    frames_to_skip += 1  # Skip this function.
+    stack = _infra.Stack()
+    stack.frames = [
+        python_frame(frame)
+        for frame in inspect.stack()[frames_to_skip : frames_to_skip + frames_to_log]
+    ]
+    return stack
diff --git a/torch/onnx/_internal/diagnostics/rules.yaml b/torch/onnx/_internal/diagnostics/rules.yaml
new file mode 100644
index 000000000000..9d527bccf1e2
--- /dev/null
+++ b/torch/onnx/_internal/diagnostics/rules.yaml
@@ -0,0 +1,84 @@
+# PyTorch ONNX Exporter (POE) Rules are based on sarif ReportingDescriptor format.
+
+# TODO: Define additional format requirements on top of sarif for our usage.
+#       For example: pre defined keys for message_strings for logging on different levels.
+# TODO: Based on above, create helper script to generate new rules.
+# TODO: Separate rules into individual files.
+# TODO: These rules are for demonstration purposes only. They are not complete.
+
+- id: POE0001
+  name: node-missing-onnx-shape-inference
+  short_description:
+    text: Node is missing ONNX shape inference.
+  full_description:
+    text: "Node is missing ONNX shape inference.
+      This usually happens when the node is not valid under standard ONNX operator spec."
+    markdown: |
+      Node is missing ONNX shape inference.
+      This usually happens when the node is not valid under standard ONNX operator spec.
+  message_strings:
+    default:
+      text: "The shape inference of {op_name} type is missing, so it may result in wrong shape inference for the exported graph.
+      Please consider adding it in symbolic function."
+  help_uri:
+  properties:
+    deprecated: false
+    tags: []
+
+- id: POE0002
+  name: missing-custom-symbolic-function
+  short_description:
+    text: Missing symbolic function for custom PyTorch operator, cannot translate node to ONNX.
+  full_description:
+    text: Missing symbolic function for custom PyTorch operator, cannot translate node to ONNX.
+    markdown: |
+      Missing symbolic function for custom PyTorch operator, cannot translate node to ONNX.
+  message_strings:
+    default:
+      text: "ONNX export failed on an operator with unrecognized namespace {op_name}.
+      If you are trying to export a custom operator, make sure you registered
+      it with the right domain and version."
+  help_uri:
+  properties:
+    deprecated: false
+    tags: []
+
+- id: POE0003
+  name: missing-standard-symbolic-function
+  short_description:
+    text: Missing symbolic function for standard PyTorch operator, cannot translate node to ONNX.
+  full_description:
+    text: Missing symbolic function for standard PyTorch operator, cannot translate node to ONNX.
+    markdown: |
+      Missing symbolic function for standard PyTorch operator, cannot translate node to ONNX.
+  message_strings:
+    default:
+      text: "Exporting the operator '{op_name}' to ONNX opset version {opset_version} is not supported.
+      Please feel free to request support or submit a pull request on PyTorch GitHub: {issue_url}."
+  help_uri:
+  properties:
+    deprecated: false
+    tags: []
+
+
+- id: POE0004
+  name: operator-supported-in-newer-opset-version
+  short_description:
+    text: Operator is supported in newer opset version.
+  full_description:
+    text: Operator is supported in newer opset version.
+    markdown: |
+      Operator is supported in newer opset version.
+
+      Example:
+      ```python
+      torch.onnx.export(model, args, ..., opset_version=9)
+      ```
+  message_strings:
+    default:
+      text: "Exporting the operator '{op_name}' to ONNX opset version {opset_version} is not supported.
+      Support for this operator was added in version {supported_opset_version}, try exporting with this version."
+  help_uri:
+  properties:
+    deprecated: false
+    tags: []
diff --git a/torch/onnx/_internal/jit_utils.py b/torch/onnx/_internal/jit_utils.py
new file mode 100644
index 000000000000..a8740a4a2ff6
--- /dev/null
+++ b/torch/onnx/_internal/jit_utils.py
@@ -0,0 +1,396 @@
+"""Utilities for manipulating the torch.Graph object and the torchscript."""
+
+# TODO(justinchuby): Move more of the symbolic helper functions here and expose
+# them to the user.
+
+import dataclasses
+import re
+import typing
+from typing import Any, Dict, Iterable, Optional, Sequence, Tuple, Union
+
+import torch
+from torch import _C
+from torch._C import _onnx as _C_onnx
+from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype, registration
+
+
+_ATTR_PATTERN = re.compile("^(.+)_(([ifstgz])|(ty))$")
+_SKIP_NODE_ATTRIBUTES = {"inplace", "aten"}
+
+
+@dataclasses.dataclass
+class GraphContext:
+    """Extra context for symbolic functions with all methods from torch.Graph.
+
+    NOTE: This class is not meant for external consumption. Please do not depend on
+    it outside of torch.onnx as the interface may evolve.
+
+    Attributes:
+        graph: The _C.Graph being constructed.
+        block: The current _C.Block being constructed.
+        opset: The opset version.
+        original_node: Current node that is being converted from.
+        params_dict: Mapping from graph initializer name to IValue.
+        env: Mapping from Torch domain graph Value to ONNX domain graph Value.
+    """
+
+    graph: _C.Graph
+    block: _C.Block
+    opset: int
+    original_node: _C.Node
+    params_dict: Dict[str, "_C.IValue"]
+    env: Dict[_C.Value, _C.Value]
+
+    # Relay methods from _C.Graph for compatibility with symbolic functions that expect
+    # a _C.Graph
+    def __getattr__(self, name: str) -> Any:
+        return getattr(self.graph, name)
+
+    @_beartype.beartype
+    def op(
+        self,
+        opname: str,
+        *raw_args: Union[torch.Tensor, _C.Value],
+        outputs: int = 1,
+        **kwargs,
+    ):
+        """Creates an ONNX operator "opname", taking "raw_args" as inputs and "kwargs" as attributes.
+
+        The set of operators and the inputs/attributes they take
+        is documented at https://github.com/onnx/onnx/blob/master/docs/Operators.md
+
+        Args:
+            opname: The ONNX operator name, e.g., `Abs` or `Add`, or an operator qualified
+                with a namespace, e.g., `aten::add`.
+            raw_args: The inputs to the operator; usually provided
+                as arguments to the `symbolic` definition.
+            outputs: The number of outputs this operator returns.
+                By default an operator is assumed to return a single output.
+                If `outputs` is greater than one, this functions returns a tuple
+                of output `Value`, representing each output of the ONNX operator
+                in order.
+            kwargs: The attributes of the ONNX operator, whose keys are named
+                according to the following convention: `alpha_f` indicates
+                the `alpha` attribute with type `f`.  The valid type specifiers are
+                `f` (float), `i` (int), `s` (string) or `t` (Tensor).  An attribute
+                specified with type float accepts either a single float, or a
+                list of floats (e.g., you would say `dims_i` for a `dims` attribute
+                that takes a list of integers).
+
+        Returns:
+            The value representing the single output of this operator (see the `outputs`
+            keyword argument for multi-return nodes).
+        """
+        # FIXME(justinchuby): Add the return type back once we know how to handle mypy
+        return _add_op(self, opname, *raw_args, outputs=outputs, **kwargs)
+
+    @_beartype.beartype
+    def aten_op(self, operator: str, *args, overload_name: str = "", **kwargs):
+        """Generates an ONNX ATen op node.
+
+        This function is for backward compatibility with the old symbolic functions.
+        """
+        return self.op(
+            "aten::ATen",
+            *args,
+            operator_s=operator,
+            overload_name_s=overload_name,
+            **kwargs,
+        )
+
+    @_beartype.beartype
+    def onnxscript_op(
+        self,
+        onnx_fn,  # TODO(titaiwang): annotate this when onnx-script becomes dependency
+        *raw_args: Union[torch.Tensor, _C.Value],
+        outputs: int = 1,
+        **kwargs,
+    ):
+        """Creates an ONNX operator from onnx-script function, taking "raw_args" as inputs and "kwargs" as attributes.
+
+        onnx-script repository: https://github.com/microsoft/onnx-script
+
+        Args:
+            onnx_fn: ONNXFunction from onnx-script; An example can be found at
+                https://github.com/microsoft/onnx-script#example
+            raw_args: The inputs to the operator; usually provided
+                as arguments to the `symbolic` definition.
+            outputs: The number of outputs this operator returns.
+                By default an operator is assumed to return a single output.
+                If `outputs` is greater than one, this functions returns a tuple
+                of output `Value`, representing each output of the ONNX operator
+                in order.
+            kwargs: The attributes of the ONNX operator, whose keys are named
+                according to the following convention: `alpha_f` indicates
+                the `alpha` attribute with type `f`.  The valid type specifiers are
+                `f` (float), `i` (int), `s` (string) or `t` (Tensor).  An attribute
+                specified with type float accepts either a single float, or a
+                list of floats (e.g., you would say `dims_i` for a `dims` attribute
+                that takes a list of integers).
+
+        Returns:
+            The value representing the single output of this operator (see the `outputs`
+            keyword argument for multi-return nodes).
+        """
+        # NOTE(titaiwang): This is using class attributes, and it needs to be updated
+        # if onnx-script makes any change on these.
+        symbolic_name = f"{onnx_fn.opset.domain}::{onnx_fn.opname}"
+        opset_version = onnx_fn.opset.version
+
+        registration.custom_onnx_symbolic(symbolic_name, opset_version)(onnx_fn)
+
+        return _add_op(self, symbolic_name, *raw_args, outputs=outputs, **kwargs)
+
+
+@_beartype.beartype
+def add_op_with_blocks(
+    graph_context: GraphContext,
+    opname: str,
+    *inputs: _C.Value,
+    outputs: int = 1,
+    n_blocks: int = 1,
+    **attributes,
+) -> Tuple[Any, Tuple[GraphContext, ...], _C.Node]:
+    """Creates an ONNX operator "opname", taking inputs and attributes.
+
+    Args:
+        graph_context: The context for the current graph.
+        opname: The ONNX operator name, e.g., `Abs` or `Add`, or an operator qualified
+            with a namespace, e.g., `aten::add`.
+        inputs: The inputs to the operator.
+        outputs: The number of outputs this operator returns.
+            By default an operator is assumed to return a single output.
+            If `outputs` is greater than one, this functions returns a tuple
+            of output `Value`, representing each output of the ONNX operator
+            in order.
+        n_blocks: The number of sub-blocks to create in the node.
+        attributes: The attributes of the ONNX operator.
+
+    Returns:
+        A tuple of (output_values, new_contexts, node) where:
+            output_values: ONe or more output value of this operator
+                (see the `outputs` keyword argument for multi-return nodes).
+            new_contexts: A tuple of new graph contexts for each sub-block.
+            node: The node representing the operator.
+    """
+
+    output_values = graph_context.op(opname, *inputs, outputs=outputs, **attributes)
+    if isinstance(output_values, Sequence):
+        node = output_values[0].node()
+    else:
+        node = output_values.node()
+
+    new_contexts = []
+    for _ in range(n_blocks):
+        new_block = node.addBlock()
+        # Create shallow copy of the graph context and update the block
+        new_context = dataclasses.replace(graph_context, block=new_block)
+        new_contexts.append(new_context)
+
+    return output_values, tuple(new_contexts), node
+
+
+@_beartype.beartype
+def _add_op(
+    graph_context: GraphContext,
+    opname: str,
+    *args: Union[torch.Tensor, _C.Value],
+    outputs: int = 1,
+    **kwargs,
+):
+    """Creates an ONNX operator "opname", taking "args" as inputs and attributes "kwargs".
+
+    The set of operators and the inputs/attributes they take
+    is documented at https://github.com/onnx/onnx/blob/master/docs/Operators.md
+
+    This function is monkey-patched onto Graph.
+
+    Args:
+        g: The Torch Graph or Block.
+        opname: The ONNX operator name, e.g., `Abs` or `Add`, or an operator qualified
+            with a namespace, e.g., `aten::add`.
+        args: The inputs to the operator; usually provided
+            as arguments to the `symbolic` definition.
+        outputs: The number of outputs this operator returns.
+            By default an operator is assumed to return a single output.
+            If `outputs` is greater than one, this functions returns a tuple
+            of output `Value`, representing each output of the ONNX operator
+            in order.
+        kwargs: The attributes of the ONNX operator, whose keys are named
+            according to the following convention: `alpha_f` indicates
+            the `alpha` attribute with type `f`.  The valid type specifiers are
+            `f` (float), `i` (int), `s` (string) or `t` (Tensor).  An attribute
+            specified with type float accepts either a single float, or a
+            list of floats (e.g., you would say `dims_i` for a `dims` attribute
+            that takes a list of integers).
+
+    Returns:
+        (Union[_C.Value, Tuple[_C.Value, ...]])
+        The value representing the single output of this operator (see the `outputs`
+        keyword argument for multi-return nodes).
+    """
+    inputs = [_const_if_tensor(graph_context, arg) for arg in args]
+    # Filter out None attributes, this can be convenient client side because
+    # now they can pass through None attributes, and have them not show up
+    attributes = {k: v for k, v in kwargs.items() if v is not None}
+
+    if "::" not in opname:
+        opname = "onnx::" + opname
+
+    node = _create_node(
+        graph_context.block,
+        opname,
+        inputs,
+        attributes,
+        params_dict=graph_context.params_dict,
+        opset_version=graph_context.opset,
+        n_outputs=outputs,
+        shape_inference=GLOBALS.onnx_shape_inference,
+    )
+
+    if outputs == 1:
+        return node.output()
+    return tuple(node.outputs())
+
+
+@_beartype.beartype
+def _const_if_tensor(graph_context: GraphContext, arg):
+    if arg is None:
+        return arg
+    if isinstance(arg, _C.Value):
+        return arg
+
+    return _add_op(graph_context, "onnx::Constant", value_z=arg)
+
+
+def _create_node(
+    graph_or_block: Union[_C.Graph, _C.Block],
+    domain_op: str,
+    inputs: Sequence,
+    attributes: dict,
+    params_dict: dict,
+    opset_version: int,
+    n_outputs: int,
+    shape_inference: bool = True,
+) -> _C.Node:
+    """Creates an node 'domain_op', taking inputs and attributes."""
+    if isinstance(graph_or_block, _C.Graph):
+        graph = graph_or_block
+        node = graph.create(domain_op, inputs, n_outputs)
+        node = graph.insertNode(node)
+    elif isinstance(graph_or_block, _C.Block):
+        block = graph_or_block
+        node = block.addNode(domain_op, inputs)
+
+        # Block does not have create defined, so we need to add outputs manually
+        if n_outputs > 1:
+            for _ in range(1, n_outputs):
+                node.addOutput()
+
+    node_ouputs = tuple(node.outputs())
+    assert len(node_ouputs) == n_outputs
+
+    aten = domain_op.startswith("aten::")
+
+    # Add all attributes
+    for key, value in sorted(attributes.items()):
+        if key in _SKIP_NODE_ATTRIBUTES:
+            continue
+        _add_attribute(node, key, value, aten=aten)
+    if shape_inference:
+        _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
+    return node
+
+
+@_beartype.beartype
+def _is_onnx_list(value):
+    return (
+        not isinstance(value, torch._six.string_classes)
+        and not isinstance(value, torch.Tensor)
+        and isinstance(value, Iterable)
+    )
+
+
+@_beartype.beartype
+def _scalar(x: torch.Tensor):
+    """Convert a scalar tensor into a Python value."""
+    assert x.numel() == 1
+    return x[0]
+
+
+@_beartype.beartype
+def _is_caffe2_aten_fallback() -> bool:
+    return (
+        GLOBALS.operator_export_type == _C_onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK
+        and _C_onnx._CAFFE2_ATEN_FALLBACK
+    )
+
+
+@_beartype.beartype
+def _add_attribute(node: _C.Node, key: str, value: Any, aten: bool):
+    r"""Initializes the right attribute based on type of value."""
+    m = _ATTR_PATTERN.match(key)
+    if m is None:
+        raise ValueError(
+            f"Invalid attribute specifier '{key}' names "
+            "must be suffixed with type, e.g. 'dim_i' or 'dims_i'"
+        )
+    name, kind = m.group(1), m.group(2)
+    if _is_onnx_list(value):
+        kind += "s"
+
+    if aten and _is_caffe2_aten_fallback():
+        if isinstance(value, torch.Tensor):
+            # Caffe2 proto does not support tensor attribute.
+            if value.numel() > 1:
+                raise ValueError("Should not pass tensor attribute")
+            value = _scalar(value)
+            if isinstance(value, float):
+                kind = "f"
+            else:
+                kind = "i"
+    return getattr(node, f"{kind}_")(name, value)
+
+
+# TODO: Expose this to user when migrating symbolic helper functions to here.
+@_beartype.beartype
+def _is_tensor(x: _C.Value) -> bool:
+    return x.type().isSubtypeOf(_C.TensorType.get())
+
+
+@_beartype.beartype
+def get_device_from_value(value: _C.Value) -> Optional[torch.device]:
+    if not _is_tensor(value):
+        return None
+    tensor_type = typing.cast(_C.TensorType, value.type())
+    return tensor_type.device()
+
+
+@_beartype.beartype
+def parse_node_kind(kind: str) -> Tuple[str, str]:
+    """Parse node kind into domain and Op name."""
+    if "::" not in kind:
+        raise ValueError(f"Node kind: {kind} is invalid. '::' is not in node kind.")
+    domain, opname = kind.split("::", 1)
+    if "::" in opname:
+        raise ValueError(f"Node kind: {kind} is invalid. '::' should only apear once.")
+    return domain, opname
+
+
+@_beartype.beartype
+def is_aten(domain: str) -> bool:
+    """Check if the domain is official."""
+    return domain == "aten"
+
+
+@_beartype.beartype
+def is_prim(domain: str) -> bool:
+    """Check if the domain is official."""
+    return domain == "prim"
+
+
+@_beartype.beartype
+def is_onnx(domain: str) -> bool:
+    """Check if the domain is official."""
+    return domain == "onnx"
diff --git a/torch/onnx/_internal/onnx_proto_utils.py b/torch/onnx/_internal/onnx_proto_utils.py
new file mode 100644
index 000000000000..f557089707b8
--- /dev/null
+++ b/torch/onnx/_internal/onnx_proto_utils.py
@@ -0,0 +1,143 @@
+"""Utilities for manipulating the onnx and onnx-script dependencies and ONNX proto."""
+
+import io
+import os
+import zipfile
+from typing import List, Mapping, Set, Union
+
+import torch
+import torch.jit._trace
+import torch.serialization
+from torch.onnx import _constants, _exporter_states, errors
+from torch.onnx._internal import _beartype, jit_utils, registration
+
+
+@_beartype.beartype
+def _export_file(
+    model_bytes: bytes,
+    f: Union[io.BytesIO, str],
+    export_type: str,
+    export_map: Mapping[str, bytes],
+) -> None:
+    """export/write model bytes into directory/protobuf/zip"""
+    # TODO(titaiwang) MYPY asks for os.PathLike[str] type for parameter: f,
+    # but beartype raises beartype.roar.BeartypeDecorHintNonpepException,
+    # as os.PathLike[str] uncheckable at runtime
+    if export_type == _exporter_states.ExportTypes.PROTOBUF_FILE:
+        assert len(export_map) == 0
+        with torch.serialization._open_file_like(f, "wb") as opened_file:
+            opened_file.write(model_bytes)
+    elif export_type in {
+        _exporter_states.ExportTypes.ZIP_ARCHIVE,
+        _exporter_states.ExportTypes.COMPRESSED_ZIP_ARCHIVE,
+    }:
+        compression = (
+            zipfile.ZIP_DEFLATED
+            if export_type == _exporter_states.ExportTypes.COMPRESSED_ZIP_ARCHIVE
+            else zipfile.ZIP_STORED
+        )
+        with zipfile.ZipFile(f, "w", compression=compression) as z:
+            z.writestr(_constants.ONNX_ARCHIVE_MODEL_PROTO_NAME, model_bytes)
+            for k, v in export_map.items():
+                z.writestr(k, v)
+    elif export_type == _exporter_states.ExportTypes.DIRECTORY:
+        if isinstance(f, io.BytesIO) or not os.path.isdir(f):  # type: ignore[arg-type]
+            raise ValueError(
+                f"f should be directory when export_type is set to DIRECTORY, instead get type(f): {type(f)}"
+            )
+        if not os.path.exists(f):  # type: ignore[arg-type]
+            os.makedirs(f)  # type: ignore[arg-type]
+
+        model_proto_file = os.path.join(f, _constants.ONNX_ARCHIVE_MODEL_PROTO_NAME)  # type: ignore[arg-type]
+        with torch.serialization._open_file_like(model_proto_file, "wb") as opened_file:
+            opened_file.write(model_bytes)
+
+        for k, v in export_map.items():
+            weight_proto_file = os.path.join(f, k)  # type: ignore[arg-type]
+            with torch.serialization._open_file_like(
+                weight_proto_file, "wb"
+            ) as opened_file:
+                opened_file.write(v)
+    else:
+        raise ValueError("Unknown export type")
+
+
+@_beartype.beartype
+def _add_onnxscript_fn(
+    model_bytes: bytes,
+    custom_opsets: Mapping[str, int],
+) -> bytes:
+    """Insert model-included custom onnx-script function into ModelProto"""
+    # TODO(titaiwang): remove this when onnx becomes dependency
+    try:
+        import onnx
+    except ImportError:
+        raise errors.OnnxExporterError("Module onnx is not installed!")
+
+    # For > 2GB model, onnx.load_fromstring would fail. However, because
+    # in _export_onnx, the tensors should be saved separately if the proto
+    # size > 2GB, and if it for some reason did not, the model would fail on
+    # serialization anyway in terms of the protobuf limitation. So we don't
+    # need to worry about > 2GB model getting here.
+    model_proto = onnx.load_from_string(model_bytes)
+
+    # Iterate graph nodes to insert only the included custom
+    # function_proto into model_proto
+    # TODO(titaiwang): Currently, onnxscript doesn't support ONNXFunction
+    # calling other ONNXFunction scenario, neither does it here
+    onnx_function_list = list()  # type: ignore[var-annotated]
+    included_node_func = set()  # type: Set[str]
+    # onnx_function_list and included_node_func are expanded in-place
+    _find_onnxscript_op(
+        model_proto.graph, included_node_func, custom_opsets, onnx_function_list
+    )
+
+    if onnx_function_list:
+        model_proto.functions.extend(onnx_function_list)
+        model_bytes = model_proto.SerializeToString()
+    return model_bytes
+
+
+@_beartype.beartype
+def _find_onnxscript_op(
+    graph_proto,
+    included_node_func: Set[str],
+    custom_opsets: Mapping[str, int],
+    onnx_function_list: List,
+):
+    """Recursively iterate ModelProto to find ONNXFunction op as it may contain control flow Op."""
+    for node in graph_proto.node:
+        node_kind = node.domain + "::" + node.op_type
+        # Recursive needed for control flow nodes: IF/Loop which has inner graph_proto
+        for attr in node.attribute:
+            if attr.g is not None:
+                _find_onnxscript_op(
+                    attr.g, included_node_func, custom_opsets, onnx_function_list
+                )
+        # Only custom Op with ONNX function and aten with symbolic_fn should be found in registry
+        onnx_function_group = registration.registry.get_function_group(node_kind)
+        # Ruled out corner cases: onnx/prim in registry
+        if (
+            node.domain
+            and not jit_utils.is_aten(node.domain)
+            and not jit_utils.is_prim(node.domain)
+            and not jit_utils.is_onnx(node.domain)
+            and onnx_function_group is not None
+            and node_kind not in included_node_func
+        ):
+            specified_version = custom_opsets.get(node.domain, 1)
+            onnx_fn = onnx_function_group.get(specified_version)
+            if onnx_fn is not None:
+                # TODO(titaiwang): to_function_proto is onnx-script API and can be annotated
+                # after onnx-script is dependency
+                onnx_function_list.append(onnx_fn.to_function_proto())  # type: ignore[attr-defined]
+                included_node_func.add(node_kind)
+                continue
+            raise errors.UnsupportedOperatorError(
+                node_kind,
+                specified_version,
+                onnx_function_group.get_min_supported()
+                if onnx_function_group
+                else None,
+            )
+    return onnx_function_list, included_node_func
diff --git a/torch/onnx/_internal/registration.py b/torch/onnx/_internal/registration.py
new file mode 100644
index 000000000000..824aa30b742e
--- /dev/null
+++ b/torch/onnx/_internal/registration.py
@@ -0,0 +1,339 @@
+"""Module for handling symbolic function registration."""
+
+import warnings
+from typing import (
+    Callable,
+    Collection,
+    Dict,
+    Generic,
+    Optional,
+    Sequence,
+    Set,
+    TypeVar,
+    Union,
+)
+
+from torch.onnx import _constants, errors
+from torch.onnx._internal import _beartype
+
+OpsetVersion = int
+
+
+def _dispatch_opset_version(
+    target: OpsetVersion, registered_opsets: Collection[OpsetVersion]
+) -> Optional[OpsetVersion]:
+    """Finds the registered opset given a target opset version and the available opsets.
+
+    Args:
+        target: The target opset version.
+        available_opsets: The available opsets.
+
+    Returns:
+        The registered opset version.
+    """
+    if not registered_opsets:
+        return None
+
+    descending_registered_versions = sorted(registered_opsets, reverse=True)
+    # Linear search for the opset version, which is fine since the number of opset
+    # versions is small.
+
+    if target >= _constants.ONNX_BASE_OPSET:
+        # Always look down toward opset 1 when the target is >= ONNX_BASE_OPSET (opset 9).
+        # When a custom op is register at opset 1, we want to be able to discover it as a
+        # fallback for all opsets >= ONNX_BASE_OPSET.
+        for version in descending_registered_versions:
+            if version <= target:
+                return version
+        return None
+
+    # target < opset 9. This is the legacy behavior to support opset 7 and opset 8.
+    # for caffe2 support. We search up toward opset 9.
+    for version in reversed(descending_registered_versions):
+        # Count back up until _constants.ONNX_BASE_OPSET
+        if target <= version <= _constants.ONNX_BASE_OPSET:
+            return version
+
+    return None
+
+
+_K = TypeVar("_K")
+_V = TypeVar("_V")
+
+
+class OverrideDict(Generic[_K, _V], Collection[_K]):
+    """A dictionary that merges built-in and custom symbolic functions.
+
+    It supports overriding and un-overriding built-in symbolic functions with custom
+    ones.
+    """
+
+    def __init__(self):
+        self._base: Dict[_K, _V] = {}
+        self._overrides: Dict[_K, _V] = {}
+        self._merged: Dict[_K, _V] = {}
+
+    def set_base(self, key: _K, value: _V) -> None:
+        self._base[key] = value
+        if key not in self._overrides:
+            self._merged[key] = value
+
+    def in_base(self, key: _K) -> bool:
+        """Checks if a key is in the base dictionary."""
+        return key in self._base
+
+    def override(self, key: _K, value: _V) -> None:
+        """Overrides a base key-value with a new pair."""
+        self._overrides[key] = value
+        self._merged[key] = value
+
+    def remove_override(self, key: _K) -> None:
+        """Un-overrides a key-value pair."""
+        self._overrides.pop(key, None)  # type: ignore[arg-type]
+        self._merged.pop(key, None)  # type: ignore[arg-type]
+        if key in self._base:
+            self._merged[key] = self._base[key]
+
+    def overridden(self, key: _K) -> bool:
+        """Checks if a key-value pair is overridden."""
+        return key in self._overrides
+
+    def __getitem__(self, key: _K) -> _V:
+        return self._merged[key]
+
+    def get(self, key: _K, default: Optional[_V] = None):
+        return self._merged.get(key, default)
+
+    def __contains__(self, key: object) -> bool:
+        return key in self._merged
+
+    def __iter__(self):
+        return iter(self._merged)
+
+    def __len__(self) -> int:
+        return len(self._merged)
+
+    def __repr__(self) -> str:
+        return f"OverrideDict(base={self._base}, overrides={self._overrides})"
+
+    def __bool__(self) -> bool:
+        return bool(self._merged)
+
+
+class _SymbolicFunctionGroup:
+    """Different versions of symbolic functions registered to the same name.
+
+    O(number of registered versions of an op) search is performed to find the most
+    recent version of the op.
+
+    The registration is delayed until op is used to improve startup time.
+
+    Function overloads with different arguments are not allowed.
+    Custom op overrides are supported.
+    """
+
+    def __init__(self, name: str) -> None:
+        self._name = name
+        # A dictionary of functions, keyed by the opset version.
+        self._functions: OverrideDict[OpsetVersion, Callable] = OverrideDict()
+
+    def __repr__(self) -> str:
+        return f"_SymbolicFunctionGroup({self._name}, registered={self._functions})"
+
+    def __getitem__(self, key: OpsetVersion) -> Callable:
+        result = self.get(key)
+        if result is None:
+            raise KeyError(key)
+        return result
+
+    # TODO(justinchuby): Add @functools.lru_cache(maxsize=None) if lookup time becomes
+    # a problem.
+    def get(self, opset: OpsetVersion) -> Optional[Callable]:
+        """Find the most recent version of the function."""
+        version = _dispatch_opset_version(opset, self._functions)
+        if version is None:
+            return None
+
+        return self._functions[version]
+
+    def add(self, func: Callable, opset: OpsetVersion) -> None:
+        """Adds a symbolic function.
+
+        Args:
+            func: The function to add.
+            opset: The opset version of the function to add.
+        """
+        if self._functions.in_base(opset):
+            warnings.warn(
+                f"Symbolic function '{self._name}' already registered for opset {opset}. "
+                f"Replacing the existing function with new function. This is unexpected. "
+                f"Please report it on {_constants.PYTORCH_GITHUB_ISSUES_URL}.",
+                errors.OnnxExporterWarning,
+            )
+        self._functions.set_base(opset, func)
+
+    def add_custom(self, func: Callable, opset: OpsetVersion) -> None:
+        """Adds a custom symbolic function.
+
+        Args:
+            func: The symbolic function to register.
+            opset: The corresponding opset version.
+        """
+        self._functions.override(opset, func)
+
+    def remove_custom(self, opset: OpsetVersion) -> None:
+        """Removes a custom symbolic function.
+
+        Args:
+            opset: The opset version of the custom function to remove.
+        """
+        if not self._functions.overridden(opset):
+            warnings.warn(
+                f"No custom function registered for '{self._name}' opset {opset}"
+            )
+            return
+        self._functions.remove_override(opset)
+
+    def get_min_supported(self) -> OpsetVersion:
+        """Returns the lowest built-in opset version supported by the function."""
+        return min(self._functions)
+
+
+class SymbolicRegistry:
+    """Registry for symbolic functions.
+
+    The registry maintains a mapping from qualified names to symbolic functions.
+    It is used to register new symbolic functions and to dispatch calls to
+    the appropriate function.
+    """
+
+    def __init__(self) -> None:
+        self._registry: Dict[str, _SymbolicFunctionGroup] = {}
+
+    def register(
+        self, name: str, opset: OpsetVersion, func: Callable, custom: bool = False
+    ) -> None:
+        """Registers a symbolic function.
+
+        Args:
+            name: The qualified name of the function to register. In the form of 'domain::op'.
+                E.g. 'aten::add'.
+            opset: The opset version of the function to register.
+            func: The symbolic function to register.
+            custom: Whether the function is a custom function that overrides existing ones.
+
+        Raises:
+            ValueError: If the separator '::' is not in the name.
+        """
+        if "::" not in name:
+            raise ValueError(
+                f"The name must be in the form of 'domain::op', not '{name}'"
+            )
+        symbolic_functions = self._registry.setdefault(
+            name, _SymbolicFunctionGroup(name)
+        )
+        if custom:
+            symbolic_functions.add_custom(func, opset)
+        else:
+            symbolic_functions.add(func, opset)
+
+    def unregister(self, name: str, opset: OpsetVersion) -> None:
+        """Unregisters a symbolic function.
+
+        Args:
+            name: The qualified name of the function to unregister.
+            opset: The opset version of the function to unregister.
+        """
+        if name not in self._registry:
+            return
+        self._registry[name].remove_custom(opset)
+
+    def get_function_group(self, name: str) -> Optional[_SymbolicFunctionGroup]:
+        """Returns the function group for the given name."""
+        return self._registry.get(name)
+
+    def is_registered_op(self, name: str, version: int) -> bool:
+        """Returns whether the given op is registered for the given opset version."""
+        functions = self.get_function_group(name)
+        if functions is None:
+            return False
+        return functions.get(version) is not None
+
+    def all_functions(self) -> Set[str]:
+        """Returns the set of all registered function names."""
+        return set(self._registry)
+
+
+@_beartype.beartype
+def onnx_symbolic(
+    name: str,
+    opset: Union[OpsetVersion, Sequence[OpsetVersion]],
+    decorate: Optional[Sequence[Callable]] = None,
+    custom: bool = False,
+) -> Callable:
+    """Registers a symbolic function.
+
+    Usage::
+
+    ```
+    @onnx_symbolic("aten::symbolic_b", opset=10, decorate=[quantized_aten_handler(scale=1/128, zero_point=0)])
+    @symbolic_helper.parse_args("v", "v", "b")
+    def symbolic_b(g: _C.Graph, x: _C.Value, y: _C.Value, arg1: bool) -> _C.Value:
+        ...
+    ```
+
+    Args:
+        name: The qualified name of the function in the form of 'domain::op'.
+            E.g. 'aten::add'.
+        opset: The opset versions of the function to register at.
+        decorate: A sequence of decorators to apply to the function.
+        custom: Whether the function is a custom symbolic function.
+
+    Raises:
+        ValueError: If the separator '::' is not in the name.
+    """
+
+    def wrapper(func: Callable) -> Callable:
+        decorated = func
+        if decorate is not None:
+            for decorate_func in decorate:
+                decorated = decorate_func(decorated)
+
+        global registry
+        nonlocal opset
+        if isinstance(opset, OpsetVersion):
+            opset = (opset,)
+        for opset_version in opset:
+            registry.register(name, opset_version, decorated, custom=custom)
+
+        # Return the original function because the decorators in "decorate" are only
+        # specific to the instance being registered.
+        return func
+
+    return wrapper
+
+
+@_beartype.beartype
+def custom_onnx_symbolic(
+    name: str,
+    opset: Union[OpsetVersion, Sequence[OpsetVersion]],
+    decorate: Optional[Sequence[Callable]] = None,
+) -> Callable:
+    """Registers a custom symbolic function.
+
+    Args:
+        name: the qualified name of the function.
+        opset: the opset version of the function.
+        decorate: a sequence of decorators to apply to the function.
+
+    Returns:
+        The decorator.
+
+    Raises:
+        ValueError: If the separator '::' is not in the name.
+    """
+    return onnx_symbolic(name, opset, decorate, custom=True)
+
+
+# The registry for all symbolic functions.
+registry = SymbolicRegistry()
diff --git a/torch/onnx/_onnx_supported_ops.py b/torch/onnx/_onnx_supported_ops.py
index 39b51a6f3240..2611b0d81e9b 100644
--- a/torch/onnx/_onnx_supported_ops.py
+++ b/torch/onnx/_onnx_supported_ops.py
@@ -2,11 +2,8 @@
 from typing import Dict, List, Union
 
 from torch import _C
-from torch.onnx import _constants, symbolic_registry
-
-for v in _constants.onnx_stable_opsets:
-    symbolic_registry.register_version("", v)
-symbolic_registry.register_version("", _constants.onnx_main_opset)
+from torch.onnx import _constants
+from torch.onnx._internal import registration
 
 
 class _TorchSchema:
@@ -27,13 +24,15 @@ def __init__(self, schema: Union[_C.FunctionSchema, str]) -> None:
             self.opsets = []
 
     def __str__(self) -> str:
-        s = f"{self.name}.{self.overload_name}("
-        s += ", ".join(self.arguments)
-        s += ") -> ("
-        s += ", ".join(self.returns)
-        s += ")"
-        s += " in opsets "
-        s += ", ".join(str(opset) for opset in self.opsets)
+        s = (
+            f"{self.name}.{self.overload_name}("
+            + ", ".join(self.arguments)
+            + ") -> ("
+            + ", ".join(self.returns)
+            + ")"
+            + " in opsets "
+            + ", ".join(str(opset) for opset in self.opsets)
+        )
         return s
 
     def __hash__(self):
@@ -53,14 +52,6 @@ def is_backward(self) -> bool:
         return "backward" in self.name
 
 
-def _all_aten_forward_schemas():
-    """Creates a list of _TorchSchema for all aten schemas."""
-    torch_schemas = [_TorchSchema(s) for s in _C._jit_get_all_schemas()]
-    torch_schemas = sorted(torch_schemas, key=lambda x: x.name)
-    aten_schemas = [s for s in torch_schemas if s.is_aten() and not s.is_backward()]
-    return aten_schemas
-
-
 def _symbolic_argument_count(func):
     params = []
     signature = inspect.signature(func)
@@ -75,32 +66,32 @@ def _symbolic_argument_count(func):
     return params
 
 
-def _all_symbolics_schemas():
-    symbolics_schemas: Dict[str, _TorchSchema] = dict()
-
-    for domain, version in symbolic_registry._registry:
-        for opname, sym_func in symbolic_registry._registry[(domain, version)].items():
-            symbolics_schema = _TorchSchema("aten::" + opname)
-            symbolics_schema.arguments = _symbolic_argument_count(sym_func)
-            if opname in symbolics_schemas:
-                symbolics_schemas[opname].opsets.append(version)
-            else:
-                symbolics_schema.opsets = [version]
-                symbolics_schemas[opname] = symbolics_schema
-    return symbolics_schemas
+def all_forward_schemas() -> Dict[str, _TorchSchema]:
+    """Returns schemas for all TorchScript forward ops."""
+    torch_schemas = [_TorchSchema(s) for s in _C._jit_get_all_schemas()]
+    return {schema.name: schema for schema in torch_schemas if not schema.is_backward()}
+
+
+def all_symbolics_schemas() -> Dict[str, _TorchSchema]:
+    """Returns schemas for all onnx supported ops."""
+    symbolics_schemas = {}
+
+    for name in registration.registry.all_functions():
+        func_group = registration.registry.get_function_group(name)
+        assert func_group is not None
+        symbolics_schema = _TorchSchema(name)
+        func = func_group.get(_constants.ONNX_MAX_OPSET)
+        if func is not None:
+            symbolics_schema.arguments = _symbolic_argument_count(func)
+            symbolics_schema.opsets = list(
+                range(func_group.get_min_supported(), _constants.ONNX_MAX_OPSET + 1)
+            )
+        else:
+            # Only support opset < 9
+            func = func_group.get(7)
+            symbolics_schema.arguments = _symbolic_argument_count(func)
+            symbolics_schema.opsets = list(range(7, _constants.ONNX_BASE_OPSET))
 
+        symbolics_schemas[name] = symbolics_schema
 
-def onnx_supported_ops():
-    aten_schemas = _all_aten_forward_schemas()
-    symbolic_schemas = _all_symbolics_schemas()
-    torch_schemas = set(symbolic_schemas.values())
-    supported_ops = []
-    onnx_supported = []
-    for schema in aten_schemas:
-        if schema in torch_schemas:
-            opname = schema.name[6:]  # without "aten::" prefix
-            opsets = symbolic_schemas[opname].opsets
-            if schema not in supported_ops:
-                supported_ops.append(symbolic_schemas[opname])
-                onnx_supported.append((opname, " ".join(str(o) for o in opsets)))
-    return sorted(onnx_supported, key=lambda x: x[0])
+    return symbolics_schemas
diff --git a/torch/onnx/_patch_torch.py b/torch/onnx/_patch_torch.py
index 8bb3bdb690f3..5a23d584ec25 100644
--- a/torch/onnx/_patch_torch.py
+++ b/torch/onnx/_patch_torch.py
@@ -6,18 +6,33 @@
 import torch
 from torch import _C
 from torch._C import _onnx as _C_onnx
-from torch.onnx import _deprecation
+
+# Import utils to get _params_dict because it is a global that is accessed by c++ code
+from torch.onnx import _deprecation, utils
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype, jit_utils
+
+_ATTR_PATTERN = re.compile("^(.+)_(([ifstgz])|(ty))$")
+
+
+# TODO(#78694): Remove this file after PyTorch 1.14.
+# All functions in this file are deprecated and should not be used
 
 
-# TODO(#78694): Refactor the patching process to make it more transparent to users.
+@_deprecation.deprecated(
+    "1.13",
+    "1.14",
+    "note 'g.op()' is to be removed from torch.Graph. Please open a"
+    " GitHub issue if you need this functionality.",
+)
+@_beartype.beartype
 def _graph_op(
-    g: torch._C.Graph,
+    g: _C.Graph,
     opname: str,
-    *raw_args: torch._C.Value,
+    *raw_args: Union[torch.Tensor, _C.Value],
     outputs: int = 1,
     **kwargs,
-) -> Union[torch._C.Value, Tuple[torch._C.Value, ...]]:
+) -> Union[_C.Value, Tuple[_C.Value, ...]]:
     r"""Creates an ONNX operator "opname", taking "args" as inputs and attributes "kwargs".
 
     The set of operators and the inputs/attributes they take
@@ -27,7 +42,8 @@ def _graph_op(
 
     Args:
         g: The Torch graph.
-        opname: The ONNX operator name, e.g., `Abs` or `Add`. TODO(justinchu): Update examples to correct ones.
+        opname: The ONNX operator name, e.g., `Abs` or `Add`, or an operator qualified
+            with a namespace, e.g., `aten::add`.
         raw_args: The inputs to the operator; usually provided
             as arguments to the `symbolic` definition.
         outputs: The number of outputs this operator returns.
@@ -51,22 +67,18 @@ def _graph_op(
     # now they can pass through None attributes, and have them not show up
     kwargs = {k: v for k, v in kwargs.items() if v is not None}
 
-    def const_if_tensor(arg):
-        if arg is None:
-            return arg
-        elif isinstance(arg, torch._C.Value):
-            return arg
-        else:
-            return g.op("Constant", value_z=arg)  # type: ignore[attr-defined]
+    args = [_const_if_tensor(g, arg) for arg in raw_args]
 
-    args = [const_if_tensor(arg) for arg in raw_args]
-    n = g.insertNode(_new_node(g, opname, outputs, *args, **kwargs))  # type: ignore[attr-defined]
+    if "::" in opname:
+        namespace, op = jit_utils.parse_node_kind(opname)
+    else:
+        namespace = "onnx"
+        op = opname
 
-    # Import utils to get _params_dict because it is a global that is accessed by c++ code
-    from torch.onnx import utils
+    n = g.insertNode(_new_node(g, namespace, op, outputs, *args, **kwargs))
 
     if GLOBALS.onnx_shape_inference:
-        torch._C._jit_pass_onnx_node_shape_type_inference(
+        _C._jit_pass_onnx_node_shape_type_inference(
             n, utils._params_dict, GLOBALS.export_onnx_opset_version
         )
 
@@ -75,53 +87,87 @@ def const_if_tensor(arg):
     return tuple(n.outputs())
 
 
+@_beartype.beartype
+def _const_if_tensor(g: _C.Graph, arg):
+    if arg is None:
+        return arg
+    if isinstance(arg, _C.Value):
+        return arg
+    return _graph_op(g, "Constant", value_z=arg)
+
+
+@_deprecation.deprecated(
+    "1.13",
+    "1.14",
+    "note 'g.at()' is to be removed from torch.Graph. Please open a"
+    " GitHub issue if you need this functionality.",
+)
 # Generate an ONNX ATen op node.
-def _aten_op(g, operator: str, *args, overload_name: str = "", **kwargs):
-    kwargs["aten"] = True
-    return g.op(
-        "ATen", *args, operator_s=operator, overload_name_s=overload_name, **kwargs
+@_beartype.beartype
+def _aten_op(g: _C.Graph, operator: str, *args, overload_name: str = "", **kwargs):
+    return _graph_op(
+        g,
+        "aten::ATen",
+        *args,
+        operator_s=operator,
+        overload_name_s=overload_name,
+        **kwargs,
     )
 
 
-def _block_op(b: _C.Block, opname: str, *args, **kwargs):
+@_deprecation.deprecated(
+    "1.13",
+    "1.14",
+    "note 'b.op()' is to be removed from torch.Block. Please open a"
+    " GitHub issue if you need this functionality.",
+)
+@_beartype.beartype
+def _block_op(block: _C.Block, opname: str, *args: _C.Value, **kwargs):
     if "::" in opname:
-        aten = False
-        ns_opname = opname
+        namespace, op = jit_utils.parse_node_kind(opname)
     else:
-        aten = kwargs.pop("aten", False)
-        ns = "aten" if aten else "onnx"
-        ns_opname = ns + "::" + opname
-    n = b.addNode(ns_opname, list(args))
+        namespace = "onnx"
+        op = opname
+
+    n = block.addNode(f"{namespace}::{op}", args)
+    aten = namespace == "aten"
+    skip_attrs = {"inplace", "aten"}
     for k, v in sorted(kwargs.items()):
-        # TODO: enable inplace in aten exporting mode.
-        if k == "inplace":
+        if k in skip_attrs:
             continue
         _add_attribute(n, k, v, aten=aten)
-    if len(list(n.outputs())) == 1:
+    outputs = tuple(n.outputs())
+    if len(outputs) == 1:
         return n.output()
-    return tuple(o for o in n.outputs())
+    return outputs
 
 
-def _new_node(g: torch._C.Graph, opname: str, outputs, *args, **kwargs):
-    if "::" in opname:
-        aten = False
-        ns_opname = opname
-    else:
-        aten = kwargs.pop("aten", False)
-        ns = "aten" if aten else "onnx"
-        ns_opname = ns + "::" + opname
-    n = g.create(ns_opname, args, outputs)  # type: ignore[attr-defined]
-    for k, v in sorted(kwargs.items()):
-        # TODO: enable inplace in aten exporting mode.
-        if k == "inplace":
-            continue
-        _add_attribute(n, k, v, aten=aten)
-    return n
+@_beartype.beartype
+def _new_node(
+    g: _C.Graph, namespace: str, op: str, outputs: int, *args: _C.Value, **kwargs
+) -> _C.Node:
+    """Creates a new node in the graph.
 
+    Args:
+        g: The graph to create the operator on.
+        namespace: The namespace of the operator. E.g., "aten", "onnx".
+        op: The name of the operator to create.
+        outputs: The number of the outputs of the node.
 
-_attr_pattern = re.compile("^(.+)_(([ifstgz])|(ty))$")
+    Returns:
+        The new node.
+    """
+    aten = namespace == "aten"
+    node = g.create(f"{namespace}::{op}", args, outputs)
+    skip_attrs = {"inplace", "aten"}
+    for k, v in sorted(kwargs.items()):
+        if k in skip_attrs:
+            continue
+        _add_attribute(node, k, v, aten=aten)
+    return node
 
 
+@_beartype.beartype
 def _is_onnx_list(value):
     return (
         not isinstance(value, torch._six.string_classes)
@@ -130,26 +176,29 @@ def _is_onnx_list(value):
     )
 
 
+@_beartype.beartype
 def _scalar(x: torch.Tensor):
     """Convert a scalar tensor into a Python value."""
     assert x.numel() == 1
     return x[0]
 
 
-def _is_caffe2_aten_fallback():
+@_beartype.beartype
+def _is_caffe2_aten_fallback() -> bool:
     return (
         GLOBALS.operator_export_type == _C_onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK
         and _C_onnx._CAFFE2_ATEN_FALLBACK
     )
 
 
+@_beartype.beartype
 def _add_attribute(node: _C.Node, key: str, value: Any, aten: bool):
     r"""Initializes the right attribute based on type of value."""
-    m = _attr_pattern.match(key)
+    m = _ATTR_PATTERN.match(key)
     if m is None:
-        raise IndexError(
+        raise ValueError(
             f"Invalid attribute specifier '{key}' names "
-            " must be suffixed with type, e.g. 'dim_i' or 'dims_i'"
+            "must be suffixed with type, e.g. 'dim_i' or 'dims_i'"
         )
     name, kind = m.group(1), m.group(2)
     if _is_onnx_list(value):
@@ -165,13 +214,14 @@ def _add_attribute(node: _C.Node, key: str, value: Any, aten: bool):
                 kind = "f"
             else:
                 kind = "i"
-    return getattr(node, kind + "_")(name, value)
+    return getattr(node, f"{kind}_")(name, value)
 
 
 # TODO(#76254): Remove the deprecated function.
 @_deprecation.deprecated(
     "1.13", "1.14", "Use 'g.op()' to create a constant node instead."
 )
+@_beartype.beartype
 def _graph_constant(
     g,
     value,
diff --git a/torch/onnx/_type_utils.py b/torch/onnx/_type_utils.py
index e565c6e9d664..a8a6e5b2e0ef 100644
--- a/torch/onnx/_type_utils.py
+++ b/torch/onnx/_type_utils.py
@@ -2,12 +2,20 @@
 from __future__ import annotations
 
 import enum
+import typing
 from typing import Dict, Optional, Union
 
 from typing_extensions import Literal
 
 import torch
 from torch._C import _onnx as _C_onnx
+from torch.onnx import errors
+from torch.onnx._internal import _beartype
+
+
+if typing.TYPE_CHECKING:
+    # Hack to help mypy to recognize torch._C.Value
+    from torch import _C  # noqa: F401
 
 ScalarName = Literal[
     "Byte",
@@ -54,9 +62,16 @@ class JitScalarType(enum.IntEnum):
 
     Use ``JitScalarType`` to convert from torch and JIT scalar types to ONNX scalar types.
 
-    Examples::
-        >>> JitScalarType.from_name("Float").onnx_type()
+    Examples:
+        >>> JitScalarType.from_value(torch.ones(1, 2)).onnx_type()
+        TensorProtoDataType.FLOAT
+
+        >>> JitScalarType.from_value(torch_c_value_with_type_float).onnx_type()
         TensorProtoDataType.FLOAT
+
+        >>> JitScalarType.from_dtype(torch.get_default_dtype).onnx_type()
+        TensorProtoDataType.FLOAT
+
     """
 
     # Order defined in https://github.com/pytorch/pytorch/blob/344defc9733a45fee8d0c4d3f5530f631e823196/c10/core/ScalarType.h
@@ -79,56 +94,139 @@ class JitScalarType(enum.IntEnum):
     UNDEFINED = enum.auto()  # 16
 
     @classmethod
-    def from_name(
+    @_beartype.beartype
+    def _from_name(
         cls, name: Union[ScalarName, TorchName, Optional[str]]
     ) -> JitScalarType:
         """Convert a JIT scalar type or torch type name to ScalarType.
 
+        Note: DO NOT USE this API when `name` comes from a `torch._C.Value.type()` calls.
+            A "RuntimeError: INTERNAL ASSERT FAILED at "../aten/src/ATen/core/jit_type_base.h" can
+            be raised in several scenarios where shape info is not present.
+            Instead use `from_value` API which is safer.
+
         Args:
             name: JIT scalar type name (Byte) or torch type name (uint8_t).
 
         Returns:
-            ScalarType.
+            JitScalarType
 
         Raises:
-            ValueError: if name is not a valid scalar type name or if it is None.
+           OnnxExporterError: if name is not a valid scalar type name or if it is None.
         """
         if name is None:
-            raise ValueError("Scalar type name cannot be None")
+            raise errors.OnnxExporterError("Scalar type name cannot be None")
         if valid_scalar_name(name):
             return _SCALAR_NAME_TO_TYPE[name]  # type: ignore[index]
         if valid_torch_name(name):
             return _TORCH_NAME_TO_SCALAR_TYPE[name]  # type: ignore[index]
 
-        raise ValueError(f"Unknown torch or scalar type: '{name}'")
+        raise errors.OnnxExporterError(f"Unknown torch or scalar type: '{name}'")
 
     @classmethod
-    def from_dtype(cls, dtype: torch.dtype) -> JitScalarType:
-        """Convert a torch dtype to ScalarType."""
+    @_beartype.beartype
+    def from_dtype(cls, dtype: Optional[torch.dtype]) -> JitScalarType:
+        """Convert a torch dtype to JitScalarType.
+
+        Note: DO NOT USE this API when `dtype` comes from a `torch._C.Value.type()` calls.
+            A "RuntimeError: INTERNAL ASSERT FAILED at "../aten/src/ATen/core/jit_type_base.h" can
+            be raised in several scenarios where shape info is not present.
+            Instead use `from_value` API which is safer.
+
+        Args:
+            dtype: A torch.dtype to create a JitScalarType from
+
+        Returns:
+            JitScalarType
+
+        Raises:
+            OnnxExporterError: if dtype is not a valid torch.dtype or if it is None.
+        """
         if dtype not in _DTYPE_TO_SCALAR_TYPE:
-            raise ValueError(f"Unknown dtype: {dtype}")
+            raise errors.OnnxExporterError(f"Unknown dtype: {dtype}")
         return _DTYPE_TO_SCALAR_TYPE[dtype]
 
+    @classmethod
+    @_beartype.beartype
+    def from_value(
+        cls, value: Union[None, torch._C.Value, torch.Tensor], default=None
+    ) -> JitScalarType:
+        """Create a JitScalarType from an value's scalar type.
+
+        Args:
+            value: An object to fetch scalar type from.
+            default: The JitScalarType to return if a valid scalar cannot be fetched from value
+
+        Returns:
+            JitScalarType.
+
+        Raises:
+            OnnxExporterError: if value does not have a valid scalar type and default is None.
+            SymbolicValueError: when value.type()'s info are empty and default is None
+        """
+
+        if not isinstance(value, (torch._C.Value, torch.Tensor)):
+            # default value of type JitScalarType is returned when value is not valid
+            if default is None:
+                raise errors.OnnxExporterError(
+                    "value must be either torch._C.Value or torch.Tensor objects."
+                )
+            elif not isinstance(default, JitScalarType):
+                raise errors.OnnxExporterError(
+                    "default value must be a JitScalarType object."
+                )
+            return default
+
+        # Each value type has their own way of storing scalar type
+        if isinstance(value, torch.Tensor):
+            return cls.from_dtype(value.dtype)
+        if isinstance(value.type(), torch.ListType):
+            try:
+                return cls.from_dtype(value.type().getElementType().dtype())
+            except RuntimeError:
+                return cls._from_name(str(value.type().getElementType()))
+
+        # value must be a non-list torch._C.Value scalar
+        scalar_type = value.type().scalarType()
+        if scalar_type is not None:
+            return cls._from_name(scalar_type)
+
+        # When everything fails... try to default
+        if default is not None:
+            return default
+        raise errors.SymbolicValueError(
+            f"Cannot determine scalar type for this '{type(value.type())}' instance and "
+            "a default value was not provided.",
+            value,
+        )
+
+    @_beartype.beartype
     def scalar_name(self) -> ScalarName:
-        """Convert a ScalarType to a JIT scalar type name."""
+        """Convert a JitScalarType to a JIT scalar type name."""
         return _SCALAR_TYPE_TO_NAME[self]
 
+    @_beartype.beartype
     def torch_name(self) -> TorchName:
-        """Convert a ScalarType to a torch type name."""
+        """Convert a JitScalarType to a torch type name."""
         return _SCALAR_TYPE_TO_TORCH_NAME[self]
 
+    @_beartype.beartype
     def dtype(self) -> torch.dtype:
-        """Convert a ScalarType to a torch dtype."""
+        """Convert a JitScalarType to a torch dtype."""
         return _SCALAR_TYPE_TO_DTYPE[self]
 
+    @_beartype.beartype
     def onnx_type(self) -> _C_onnx.TensorProtoDataType:
-        """Convert a ScalarType to an ONNX data type."""
+        """Convert a JitScalarType to an ONNX data type."""
         if self not in _SCALAR_TYPE_TO_ONNX:
-            raise ValueError(f"Scalar type {self} cannot be converted to ONNX")
+            raise errors.OnnxExporterError(
+                f"Scalar type {self} cannot be converted to ONNX"
+            )
         return _SCALAR_TYPE_TO_ONNX[self]
 
+    @_beartype.beartype
     def onnx_compatible(self) -> bool:
-        """Return whether this ScalarType is compatible with ONNX."""
+        """Return whether this JitScalarType is compatible with ONNX."""
         return (
             self in _SCALAR_TYPE_TO_ONNX
             and self != JitScalarType.UNDEFINED
@@ -136,11 +234,13 @@ def onnx_compatible(self) -> bool:
         )
 
 
+@_beartype.beartype
 def valid_scalar_name(scalar_name: Union[ScalarName, str]) -> bool:
     """Return whether the given scalar name is a valid JIT scalar type name."""
     return scalar_name in _SCALAR_NAME_TO_TYPE
 
 
+@_beartype.beartype
 def valid_torch_name(torch_name: Union[TorchName, str]) -> bool:
     """Return whether the given torch name is a valid torch type name."""
     return torch_name in _TORCH_NAME_TO_SCALAR_TYPE
diff --git a/torch/onnx/errors.py b/torch/onnx/errors.py
index ed03dc115871..f5ad684cf168 100644
--- a/torch/onnx/errors.py
+++ b/torch/onnx/errors.py
@@ -6,15 +6,30 @@
 
 from torch import _C
 from torch.onnx import _constants
+from torch.onnx._internal import diagnostics
 
 __all__ = [
     "OnnxExporterError",
+    "OnnxExporterWarning",
+    "CallHintViolationWarning",
     "CheckerError",
     "UnsupportedOperatorError",
     "SymbolicValueError",
 ]
 
 
+class OnnxExporterWarning(UserWarning):
+    """Base class for all warnings in the ONNX exporter."""
+
+    pass
+
+
+class CallHintViolationWarning(OnnxExporterWarning):
+    """Warning raised when a type hint is violated during a function call."""
+
+    pass
+
+
 class OnnxExporterError(RuntimeError):
     """Errors raised by the ONNX exporter."""
 
@@ -30,25 +45,28 @@ class CheckerError(OnnxExporterError):
 class UnsupportedOperatorError(OnnxExporterError):
     """Raised when an operator is unsupported by the exporter."""
 
-    def __init__(
-        self, domain: str, op_name: str, version: int, supported_version: Optional[int]
-    ):
-        if domain in {"", "aten", "prim", "quantized"}:
-            msg = f"Exporting the operator '{domain}::{op_name}' to ONNX opset version {version} is not supported. "
-            if supported_version is not None:
-                msg += (
-                    f"Support for this operator was added in version {supported_version}, "
-                    "try exporting with this version."
+    def __init__(self, name: str, version: int, supported_version: Optional[int]):
+        if supported_version is not None:
+            diagnostic_rule: diagnostics.infra.Rule = (
+                diagnostics.rules.operator_supported_in_newer_opset_version
+            )
+            msg = diagnostic_rule.format_message(name, version, supported_version)
+            diagnostics.diagnose(diagnostic_rule, diagnostics.levels.ERROR, msg)
+        else:
+            if (
+                name.startswith("aten::")
+                or name.startswith("prim::")
+                or name.startswith("quantized::")
+            ):
+                diagnostic_rule = diagnostics.rules.missing_standard_symbolic_function
+                msg = diagnostic_rule.format_message(
+                    name, version, _constants.PYTORCH_GITHUB_ISSUES_URL
                 )
+                diagnostics.diagnose(diagnostic_rule, diagnostics.levels.ERROR, msg)
             else:
-                msg += "Please feel free to request support or submit a pull request on PyTorch GitHub: "
-                msg += _constants.PYTORCH_GITHUB_ISSUES_URL
-        else:
-            msg = (
-                f"ONNX export failed on an operator with unrecognized namespace '{domain}::{op_name}'. "
-                "If you are trying to export a custom operator, make sure you registered "
-                "it with the right domain and version."
-            )
+                diagnostic_rule = diagnostics.rules.missing_custom_symbolic_function
+                msg = diagnostic_rule.format_message(name)
+                diagnostics.diagnose(diagnostic_rule, diagnostics.levels.ERROR, msg)
         super().__init__(msg)
 
 
diff --git a/torch/onnx/symbolic_caffe2.py b/torch/onnx/symbolic_caffe2.py
index 6d1be80a1f86..3398fcd2fe10 100644
--- a/torch/onnx/symbolic_caffe2.py
+++ b/torch/onnx/symbolic_caffe2.py
@@ -1,41 +1,41 @@
 import importlib
 import inspect
 
-from torch.onnx import symbolic_helper, symbolic_opset9 as opset9, symbolic_registry
+from torch.onnx import symbolic_helper, symbolic_opset9 as opset9
+from torch.onnx._internal import jit_utils, registration
 
 
 def register_quantized_ops(domain: str, version: int):
-    # Register all the non-quantized ops
-    symbolic_registry.register_version("", version)
     # Register all quantized ops
     module = importlib.import_module("torch.onnx.symbolic_caffe2")
-    symbolic_registry._symbolic_versions["caffe2"] = module
-    quant_version_ops = inspect.getmembers(
-        symbolic_registry._symbolic_versions["caffe2"]
-    )
-    for op in quant_version_ops:
-        if inspect.isfunction(op[1]) and not symbolic_registry.is_registered_op(
-            op[0], domain, version
+    quant_version_ops = inspect.getmembers(module)
+    aten_q_ops = {
+        "relu",
+        "_empty_affine_quantized",
+        "dequantize",
+        "quantize_per_tensor",
+        "upsample_nearest2d",
+        "avg_pool2d",
+        "reshape",
+        "slice",
+        "cat",
+        "max_pool2d",
+        "sigmoid",
+    }
+    for op, func in quant_version_ops:
+        name = f"{domain}::{op}"
+        if inspect.isfunction(func) and not registration.registry.is_registered_op(
+            name, version
         ):
-            aten_q_ops = [
-                "relu",
-                "_empty_affine_quantized",
-                "dequantize",
-                "quantize_per_tensor",
-                "upsample_nearest2d",
-                "avg_pool2d",
-                "reshape",
-                "slice",
-                "cat",
-                "max_pool2d",
-                "sigmoid",
-            ]
-            if op[0] in aten_q_ops:
-                symbolic_registry.register_op(op[0], op[1], "", version)
-            symbolic_registry.register_op(op[0], op[1], domain, version)
-
-
-def _permute_helper(g, input, axes):
+            if op in aten_q_ops:
+                # Override the builtin aten ops
+                registration.registry.register(
+                    f"aten::{op}", version, func, custom=True
+                )
+            registration.registry.register(name, version, func)
+
+
+def _permute_helper(g: jit_utils.GraphContext, input, axes):
     quant_args = {
         "axes_i": axes,
         "Y_scale_f": symbolic_helper._node_get(input.node(), "Y_scale"),
@@ -46,17 +46,17 @@ def _permute_helper(g, input, axes):
     return output
 
 
-def nchw2nhwc(g, input):
+def nchw2nhwc(g: jit_utils.GraphContext, input):
     axes = [0, 2, 3, 1]
     return _permute_helper(g, input, axes)
 
 
-def nhwc2nchw(g, input):
+def nhwc2nchw(g: jit_utils.GraphContext, input):
     axes = [0, 3, 1, 2]
     return _permute_helper(g, input, axes)
 
 
-def linear_prepack(g, weight, bias):
+def linear_prepack(g: jit_utils.GraphContext, weight, bias):
     # Mapping to a dummy caffe2 prepack node.
     # During the onnx -> c2 conversion we can look up original weight and bias
     # from this node
@@ -66,7 +66,7 @@ def linear_prepack(g, weight, bias):
 
 
 @symbolic_helper.parse_args("v", "v", "v", "f", "i")
-def linear(g, input, weight, bias, scale, zero_point):
+def linear(g: jit_utils.GraphContext, input, weight, bias, scale, zero_point):
     kwargs = {
         "Y_scale_f": scale,
         "Y_zero_point_i": zero_point,
@@ -76,7 +76,9 @@ def linear(g, input, weight, bias, scale, zero_point):
     return output
 
 
-def conv_prepack(g, input, weight, bias, stride, padding, dilation, groups):
+def conv_prepack(
+    g: jit_utils.GraphContext, input, weight, bias, stride, padding, dilation, groups
+):
     # Mapping to a dummy caffe2 prepack node.
     # During the onnx -> c2 conversion we can look up original weight and bias
     # from this node
@@ -87,7 +89,16 @@ def conv_prepack(g, input, weight, bias, stride, padding, dilation, groups):
 
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i", "f", "i")
 def conv2d(
-    g, input, weight, bias, stride, padding, dilation, groups, scale, zero_point
+    g: jit_utils.GraphContext,
+    input,
+    weight,
+    bias,
+    stride,
+    padding,
+    dilation,
+    groups,
+    scale,
+    zero_point,
 ):
     kernel_size = weight.node()["shape"][1:3]
     kwargs = {
@@ -107,7 +118,16 @@ def conv2d(
 
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i", "f", "i")
 def conv2d_relu(
-    g, input, weight, bias, stride, padding, dilation, groups, scale, zero_point
+    g: jit_utils.GraphContext,
+    input,
+    weight,
+    bias,
+    stride,
+    padding,
+    dilation,
+    groups,
+    scale,
+    zero_point,
 ):
     kernel_size = weight.node()["shape"][1:3]
     kwargs = {
@@ -126,7 +146,7 @@ def conv2d_relu(
 
 
 @symbolic_helper.parse_args("v", "v", "f", "i")
-def add(g, input_a, input_b, scale, zero_point):
+def add(g: jit_utils.GraphContext, input_a, input_b, scale, zero_point):
     kwargs = {
         "Y_scale_f": scale,
         "Y_zero_point_i": zero_point,
@@ -137,7 +157,7 @@ def add(g, input_a, input_b, scale, zero_point):
 
 
 @symbolic_helper.parse_args("v")
-def relu(g, input):
+def relu(g: jit_utils.GraphContext, input):
     if input not in symbolic_helper._quantized_ops:
         return opset9.relu(g, input)
     kwargs = {
@@ -150,7 +170,7 @@ def relu(g, input):
 
 
 @symbolic_helper.parse_args("v", "f", "i", "t")
-def quantize_per_tensor(g, input, scale, zero_point, dtype):
+def quantize_per_tensor(g: jit_utils.GraphContext, input, scale, zero_point, dtype):
     kwargs = {
         "Y_scale_f": scale,
         "Y_zero_point_i": zero_point,
@@ -161,22 +181,35 @@ def quantize_per_tensor(g, input, scale, zero_point, dtype):
 
 
 @symbolic_helper.parse_args("v")
-def dequantize(g, input):
+def dequantize(g: jit_utils.GraphContext, input):
     return g.op("_caffe2::Int8Dequantize", input)
 
 
 @symbolic_helper.parse_args("v", "t", "t", "t", "t", "t", "t", "t")
 def _empty_affine_quantized(
-    g, input, shape, scale, zero_point, dtype, pin_memory, memory_format, layout
+    g: jit_utils.GraphContext,
+    input,
+    shape,
+    scale,
+    zero_point,
+    dtype,
+    pin_memory,
+    memory_format,
+    layout,
 ):
     return input
 
 
 def upsample_nearest2d(
-    g, input, output_size, align_corners=None, scales_h=None, scales_w=None
+    g: jit_utils.GraphContext,
+    input,
+    output_size,
+    align_corners=None,
+    scales_h=None,
+    scales_w=None,
 ):
     if input not in symbolic_helper._quantized_ops:
-        return opset9.upsample_nearest2d(g, input, output_size, align_corners)
+        return opset9.upsample_nearest2d(g, input, output_size, align_corners)  # type: ignore[attr-defined]
 
     output_size = symbolic_helper._parse_arg(output_size, "is")
     kwargs = {
@@ -192,9 +225,17 @@ def upsample_nearest2d(
 
 
 @symbolic_helper.parse_args("v", "is", "is", "is", "is", "i")
-def max_pool2d(g, input, kernel_size, stride, padding, dilation, ceil_mode):
+def max_pool2d(
+    g: jit_utils.GraphContext,
+    input,
+    kernel_size,
+    stride,
+    padding,
+    dilation,
+    ceil_mode,
+):
     if input not in symbolic_helper._quantized_ops:
-        return opset9.max_pool2d(
+        return opset9.max_pool2d(  # type: ignore[attr-defined]
             g, input, kernel_size, stride, padding, dilation, ceil_mode
         )
     kwargs = {
@@ -214,7 +255,7 @@ def max_pool2d(g, input, kernel_size, stride, padding, dilation, ceil_mode):
 
 @symbolic_helper.parse_args("v", "is", "is", "is", "i", "i", "none")
 def avg_pool2d(
-    g,
+    g: jit_utils.GraphContext,
     input,
     kernel_size,
     stride,
@@ -224,7 +265,7 @@ def avg_pool2d(
     divisor_override=None,
 ):
     if input not in symbolic_helper._quantized_ops:
-        return opset9.avg_pool2d(
+        return opset9.avg_pool2d(  # type: ignore[attr-defined]
             g,
             input,
             kernel_size,
@@ -249,7 +290,7 @@ def avg_pool2d(
     return output
 
 
-def reshape(g, input, shape):
+def reshape(g: jit_utils.GraphContext, input, shape):
     if input not in symbolic_helper._quantized_ops:
         return opset9.reshape(g, input, shape)
 
@@ -263,7 +304,7 @@ def reshape(g, input, shape):
 
 
 @symbolic_helper.parse_args("v", "v", "v", "v", "i")
-def slice(g, input, dim, start, end, step):
+def slice(g: jit_utils.GraphContext, input, dim, start, end, step):
     if input not in symbolic_helper._quantized_ops:
         return opset9.slice(g, input, dim, start, end, step)
 
@@ -285,7 +326,7 @@ def slice(g, input, dim, start, end, step):
     return output
 
 
-def cat(g, tensor_list, dim, scale=None, zero_point=None):
+def cat(g: jit_utils.GraphContext, tensor_list, dim, scale=None, zero_point=None):
     tensors = symbolic_helper._unpack_list(tensor_list)
     input = tensors[0]
     if input not in symbolic_helper._quantized_ops:
@@ -302,7 +343,7 @@ def cat(g, tensor_list, dim, scale=None, zero_point=None):
 
 
 @symbolic_helper.parse_args("v")
-def sigmoid(g, input):
+def sigmoid(g: jit_utils.GraphContext, input):
     if input not in symbolic_helper._quantized_ops:
         return opset9.sigmoid(g, input)
     # Caffe2 expects the output scale to be 1/2^8
diff --git a/torch/onnx/symbolic_helper.py b/torch/onnx/symbolic_helper.py
index 7b83b18f7051..a27db1e2a327 100644
--- a/torch/onnx/symbolic_helper.py
+++ b/torch/onnx/symbolic_helper.py
@@ -5,7 +5,7 @@
 import sys
 import typing
 import warnings
-from typing import Any, Callable, List, Optional, Sequence, Set, Tuple, Union
+from typing import Any, Callable, List, NoReturn, Optional, Sequence, Set, Tuple, Union
 
 from typing_extensions import Literal
 
@@ -14,51 +14,16 @@
 from torch import _C
 
 # Monkey-patch graph manipulation methods on Graph, used for the ONNX symbolics
-from torch.onnx import _patch_torch, _type_utils, errors  # noqa: F401
+from torch.onnx import (  # noqa: F401
+    _constants,
+    _deprecation,
+    _patch_torch,
+    _type_utils,
+    errors,
+)
 from torch.onnx._globals import GLOBALS
-
-# Note [Edit Symbolic Files]
-# EDITING THIS FILE AND SYMBOLIC_OPSET<VERSION> FILES? READ THIS FIRST!
-#
-# - Module-level functions are called to convert the corresponding op in the `aten` domain.
-#   E.g. symbolic_opset9.foo is called to convert aten::foo.
-#   Symbolic functions for other domains are staticmethods in classes named after the domain.
-#   E.g. symbolic_opset9.Prim.ConstantChunk is called to convert prim::ConstantChunk.
-# - Parameter names must *exactly* match the names in
-#   aten/src/ATen/native/native_functions.yaml, because
-#   dispatch is done with keyword arguments.
-# - Looking for inplace ops?  They're detected by
-#   `_jit_pass_onnx_remove_inplace_ops_for_onnx`, and
-#   transparently dispatched to their non inplace versions in
-#   "run_symbolic_function".   See Note [Export inplace]
-#
-# ----------------------------------------------------------------------------------
-# A note on Tensor types
-# ----------------------------------------------------------------------------------
-#
-# In general, we should avoid depending on the type of Tensor Values contained
-# within the trace graph. However, this is sometimes unavoidable (due to ONNX
-# spec requirements, etc). The TensorType object has accessors for these properties
-# that return the property if it is statically known and return nullopt otherwise.
-#
-# In general, we should prefer to rely on the least specific information possible.
-# For example, not relying on tensor properties at all is better than relying
-# on the number of dimensions which is better than relying on
-# concrete shapes. Doing so will make the export symbolics
-# more robust to different graphs.
-#
-# ----------------------------------------------------------------------------------
-# Extra context for symbolic functions
-# ----------------------------------------------------------------------------------
-#
-# In general, symbolic functions only require inputs and attributes to
-# the original node. In rare circumstances, extra context may be required.
-# For example, symbolic function for `prim::Loop` needs access to the subblock of
-# the original node.
-# A symbolic function that has a first arg (before the Graph object) with the
-# type annotation of torch.onnx.SymbolicContext will be called with that additional context.
-# During export, it is populated from `utils._run_symbolic_function`
-# to contain the context for each node being converted.
+from torch.onnx._internal import _beartype, jit_utils
+from torch.types import Number
 
 __all__ = [
     "args_have_same_dtype",
@@ -66,6 +31,7 @@
     "check_training_mode",
     "dequantize_helper",
     "is_caffe2_aten_fallback",
+    "is_complex_value",
     "parse_args",
     "pytorch_name_to_type",
     "quantize_helper",
@@ -93,6 +59,7 @@
 ]
 
 
+@_beartype.beartype
 def _parse_arg(
     value,
     desc: _ValueDescriptor,
@@ -162,6 +129,7 @@ def _parse_arg(
     )
 
 
+@_beartype.beartype
 def _node_get(node: _C.Node, key: str):
     """Gets attributes of a node which is polymorphic over return type."""
     assert isinstance(node, _C.Node)
@@ -169,19 +137,26 @@ def _node_get(node: _C.Node, key: str):
     return getattr(node, sel)(key)
 
 
+@_beartype.beartype
 def _is_onnx_constant(value: _C.Value):
     """Whether a Value is an ONNX constant."""
     return value.node().kind() == "onnx::Constant"
 
 
-def _maybe_get_const(value: _C.Value, descriptor: _ValueDescriptor):
+@_beartype.beartype
+def _maybe_get_const(
+    value: Optional[Union[_C.Value, torch.Tensor, Number, Sequence]],
+    descriptor: _ValueDescriptor,
+):
     # NOTE: prim::Constant at this stage usually means something not compatible in ONNX,
     # otherwise it'd be converted to onnx::Constant
-    if _is_value(value) and _is_onnx_constant(value):
+    # TODO(justinchuby): Replace insinstance with _is_value once we figure out mypy
+    if isinstance(value, _C.Value) and _is_onnx_constant(value):
         return _parse_arg(value, descriptor)
     return value
 
 
+@_beartype.beartype
 def _maybe_get_scalar(value):
     value_t = _maybe_get_const(value, "t")
     if isinstance(value_t, torch.Tensor) and value_t.shape == ():
@@ -189,6 +164,7 @@ def _maybe_get_scalar(value):
     return value
 
 
+@_beartype.beartype
 def _get_const(value, desc, arg_name):
     if not _is_constant(value):
         raise errors.SymbolicValueError(
@@ -199,6 +175,7 @@ def _get_const(value, desc, arg_name):
     return _parse_arg(value, desc)
 
 
+@_beartype.beartype
 def _unpack_list(list_value: _C.Value) -> List[_C.Value]:
     list_node = list_value.node()
     if list_node.kind() != "prim::ListConstruct":
@@ -210,9 +187,10 @@ def _unpack_list(list_value: _C.Value) -> List[_C.Value]:
     return list(list_node.inputs())
 
 
+@_beartype.beartype
 def _unpack_tuple(tuple_value: _C.Value) -> Tuple[_C.Value, ...]:
     tuple_node = tuple_value.node()
-    if tuple_node.kind() != "prim::TupleConstruct":
+    if not _is_tuple_construct(tuple_value):
         raise errors.SymbolicValueError(
             f"ONNX symbolic expected node type 'prim::TupleConstruct', "
             f"got '{tuple_node.kind()}'.",
@@ -221,12 +199,36 @@ def _unpack_tuple(tuple_value: _C.Value) -> Tuple[_C.Value, ...]:
     return tuple(tuple_node.inputs())
 
 
+@_beartype.beartype
+def _unpack_quantized_tensor(tuple_value: _C.Value) -> Tuple[_C.Value, ...]:
+    """Unpacks a quantized tensor into a tuple of tensor and scale/zero_point.
+    Args:
+        tuple_value: A tuple of tensor, scale, zero_point, and optionally axis.
+    Returns:
+        A tuple of tensor, scale, zero_point, and optionally axis.
+    """
+    tuple_node = tuple_value.node()
+    # A quantized tensor is represented as tuple of the form (tensor, scale, zero_point, <axis>)
+    if not _is_tuple_construct(tuple_value):
+        raise errors.SymbolicValueError(
+            f"ONNX symbolic expected the output of `{tuple_node}` to be a quantized "
+            f"tensor. Is this likely due to missing support for quantized "
+            f"`{tuple_node.kind()}`. Please create an issue on {_constants.PYTORCH_GITHUB_ISSUES_URL}",
+            tuple_value,
+        )
+    unpacked = tuple(tuple_node.inputs())
+    assert len(unpacked) == 3 or len(unpacked) == 4
+    return unpacked
+
+
 # Check if list_value is output from prim::ListConstruct
 # This is usually called before _unpack_list to ensure the list can be unpacked.
+@_beartype.beartype
 def _is_packed_list(list_value: _C.Value) -> bool:
     return _is_value(list_value) and list_value.node().kind() == "prim::ListConstruct"
 
 
+@_beartype.beartype
 def parse_args(*arg_descriptors: _ValueDescriptor):
     """A decorator which converts args from torch._C.Value to built-in types.
 
@@ -252,6 +254,7 @@ def parse_args(*arg_descriptors: _ValueDescriptor):
             "b": bool
             "s": str
             "t": torch.Tensor
+            "none": the variable is unused
     """
 
     def decorator(fn):
@@ -304,6 +307,7 @@ def wrapper(g, *args, **kwargs):
     return decorator
 
 
+@_beartype.beartype
 def quantized_args(
     *arg_q_descriptors: bool,
     scale: Optional[float] = None,
@@ -349,44 +353,57 @@ def q_foo(g, x, y):
     """
 
     def decorator(fn):
-        fn._scale = scale
-        fn._zero_point = zero_point
-
         @functools.wraps(fn)
         def wrapper(g, *args, **kwargs):
-            _scale = fn._scale
-            if _scale is not None:
-                _scale = g.op("Constant", value_t=torch.tensor(_scale))
-            _zero_point = fn._zero_point
-            if _zero_point is not None:
-                _zero_point = g.op("Constant", value_t=torch.tensor(_zero_point))
+            nonlocal scale
+            nonlocal zero_point
+            if scale is not None:
+                _scale = g.op("Constant", value_t=torch.tensor(scale))
+            else:
+                _scale = None
+            if zero_point is not None:
+                _zero_point = g.op("Constant", value_t=torch.tensor(zero_point))
+            else:
+                _zero_point = None
 
             # Support variable length arguments by marking unspecified ones as non-quantized
             arg_q_descriptors_extended = arg_q_descriptors + (False,) * (
                 len(args) - len(arg_q_descriptors)
             )
             descriptor_args = tuple(zip(arg_q_descriptors_extended, args))
+
             # Run regular symbolic function if none of the argument is QTensor.
             if not any(
-                (descriptor and arg.node().kind() == "prim::TupleConstruct")
+                (descriptor and _is_value(arg) and _is_tuple_construct(arg))
                 for descriptor, arg in descriptor_args
             ):
                 return fn(g, *args, **kwargs)
 
-            dequantized_args = []
+            # Dequantize arguments that are quantized
+            non_quantized_args = []
             for descriptor, arg in descriptor_args:
-                if descriptor:
-                    dequantized_arg, scale, zero_point, _ = dequantize_helper(g, arg)
-                    dequantized_args.append(dequantized_arg)
+                if descriptor and _is_value(arg) and _is_tuple_construct(arg):
+                    # Quantized arg is a tuple of (value, scale, zero_point)
+                    dequantized_arg, arg_scale, arg_zero_point, _ = dequantize_helper(
+                        g, arg
+                    )
+                    non_quantized_args.append(dequantized_arg)
+                    # Set scale and zero_point to the first quantized input if not already set
                     if _scale is None:
-                        _scale = scale
+                        _scale = arg_scale
                     if _zero_point is None:
-                        _zero_point = zero_point
+                        _zero_point = arg_zero_point
                 else:
-                    dequantized_args.append(arg)
+                    # Non-quantized arg
+                    non_quantized_args.append(arg)
             # TODO(justinchuby): Only single output is supported for now. We may want to
             # support multiple outputs in the future.
-            output = fn(g, *dequantized_args, **kwargs)
+            output = fn(g, *non_quantized_args, **kwargs)
+
+            assert _scale is not None, "Bug: Scale must be set for quantized operator"
+            assert (
+                _zero_point is not None
+            ), "Bug: Zero point must be set for quantized operator"
 
             return quantize_helper(g, output, _scale, _zero_point)
 
@@ -395,14 +412,16 @@ def wrapper(g, *args, **kwargs):
     return decorator
 
 
-def _scalar(x: torch.Tensor):
+@_beartype.beartype
+def _scalar(x: Any) -> Optional[Number]:
     """Convert a scalar tensor into a Python value."""
     if isinstance(x, torch.Tensor) and x.shape == ():
         return x.item()
     return None
 
 
-def _if_scalar_type_as(g: _C.Graph, self, tensor):
+@_beartype.beartype
+def _if_scalar_type_as(self, tensor):
     """
     Convert self into the same type of tensor, as necessary.
     We only support implicit casting for scalars, so we never
@@ -412,22 +431,26 @@ def _if_scalar_type_as(g: _C.Graph, self, tensor):
     if isinstance(self, _C.Value):
         return self
 
-    scalar_type = tensor.type().scalarType()
-    if scalar_type:
-        ty = scalar_type.lower()
+    scalar_type = _type_utils.JitScalarType.from_value(
+        tensor, _type_utils.JitScalarType.UNDEFINED
+    )
+    if scalar_type != _type_utils.JitScalarType.UNDEFINED:
+        ty = scalar_type.scalar_name().lower()
         return getattr(self, ty)()
-
     return self
 
 
+@_beartype.beartype
 def _is_none(x: _C.Value) -> bool:
     return x.node().mustBeNone()
 
 
+@_beartype.beartype
 def _is_value(x: Any) -> bool:
     return isinstance(x, _C.Value)
 
 
+@_beartype.beartype
 def _is_constant(value: Any) -> bool:
     return not _is_value(value) or value.node().kind() in {
         "onnx::Constant",
@@ -435,20 +458,24 @@ def _is_constant(value: Any) -> bool:
     }
 
 
+@_beartype.beartype
 def _is_tensor(x: _C.Value) -> bool:
     return x.type().isSubtypeOf(_C.TensorType.get())
 
 
+# Note: _C.JitType is not exposed to Python and cannot be checked in runtime.
 def _as_list_type(jit_type: _C.JitType) -> Optional[_C.ListType]:
     if isinstance(jit_type, _C.ListType):
         return jit_type
     return None
 
 
+@_beartype.beartype
 def _is_list(x: _C.Value) -> bool:
     return _as_list_type(x.type()) is not None
 
 
+@_beartype.beartype
 def _is_tensor_list(x: _C.Value) -> bool:
     x_type = _as_list_type(x.type())
     if x_type is None:
@@ -456,6 +483,7 @@ def _is_tensor_list(x: _C.Value) -> bool:
     return isinstance(x_type.getElementType(), _C.TensorType)
 
 
+@_beartype.beartype
 def _is_scalar_list(x: _C.Value) -> bool:
     """Checks if x is a scalar list, for example: List[float], List[int].
 
@@ -465,13 +493,28 @@ def _is_scalar_list(x: _C.Value) -> bool:
     x_type = _as_list_type(x.type())
     if x_type is None:
         return False
-    element_type = str(x_type.getElementType())
-    return (
-        _type_utils.valid_torch_name(element_type)
-        and _type_utils.JitScalarType.from_name(element_type).onnx_compatible()
-    )
+    scalar_type = _type_utils.JitScalarType.from_value(x)
+    return scalar_type.onnx_compatible()
+
+
+@_beartype.beartype
+def _is_tuple_construct(x: _C.Value) -> bool:
+    return x.node().kind() == "prim::TupleConstruct"
+
+
+@_beartype.beartype
+def is_complex_value(x: _C.Value) -> bool:
+    assert _is_value(x)
+    return _type_utils.JitScalarType.from_value(
+        x, _type_utils.JitScalarType.UNDEFINED
+    ) in {
+        _type_utils.JitScalarType.COMPLEX32,
+        _type_utils.JitScalarType.COMPLEX64,
+        _type_utils.JitScalarType.COMPLEX128,
+    }
 
 
+@_beartype.beartype
 def is_caffe2_aten_fallback() -> bool:
     return (
         GLOBALS.operator_export_type == _C_onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK
@@ -479,6 +522,7 @@ def is_caffe2_aten_fallback() -> bool:
     )
 
 
+@_beartype.beartype
 def _get_tensor_rank(x: _C.Value) -> Optional[int]:
     if not _is_tensor(x) or x.type() is None:
         return None
@@ -487,6 +531,7 @@ def _get_tensor_rank(x: _C.Value) -> Optional[int]:
     return x_type.dim()
 
 
+@_beartype.beartype
 def _get_tensor_sizes(x: _C.Value, allow_nonstatic: bool = True):
     if not _is_tensor(x) or x.type() is None:
         return None
@@ -501,11 +546,13 @@ def _get_tensor_sizes(x: _C.Value, allow_nonstatic: bool = True):
     return x_type.sizes()
 
 
+@_beartype.beartype
 def _get_tensor_dim_size(x: _C.Value, dim: int) -> Optional[int]:
     sizes = _get_tensor_sizes(x)
     return sizes[dim] if sizes else None
 
 
+@_beartype.beartype
 def _get_dim_for_cross(x: _C.Value, dim: Optional[int]):
     if dim == -1:
         tensor_rank = _get_tensor_rank(x)
@@ -521,42 +568,73 @@ def _get_dim_for_cross(x: _C.Value, dim: Optional[int]):
     return dim
 
 
-def _unimplemented(op: str, msg: str):
+@_beartype.beartype
+def _unimplemented(op: str, msg: str, value: Optional[_C.Value] = None) -> None:
     # For BC reasons, the behavior for Caffe2 does not raise exception for unimplemented operators
     if _C_onnx._CAFFE2_ATEN_FALLBACK:
-        warnings.warn(
-            "ONNX export failed on " + op + " because " + msg + " not supported"
-        )
+        warnings.warn(f"ONNX export failed on {op} because {msg} not supported")
     elif GLOBALS.operator_export_type == _C_onnx.OperatorExportTypes.ONNX:
-        _onnx_unsupported(f"{op}, {msg}")
+        _onnx_unsupported(f"{op}, {msg}", value)
 
 
-def _onnx_unsupported(op_name: str):
-    raise RuntimeError(
+@_beartype.beartype
+def _onnx_unsupported(op_name: str, value: Optional[_C.Value] = None) -> NoReturn:
+    message = (
         f"Unsupported: ONNX export of operator {op_name}. "
-        "Please feel free to request support or submit a pull request on PyTorch GitHub."
+        f"Please feel free to request support or submit a pull request "
+        f"on PyTorch GitHub: {_constants.PYTORCH_GITHUB_ISSUES_URL}"
     )
+    if isinstance(value, _C.Value):
+        raise errors.SymbolicValueError(
+            message,
+            value,
+        )
+    raise errors.OnnxExporterError(message)
 
 
-def _onnx_opset_unsupported(op_name: str, current_opset: int, supported_opset: int):
-    raise RuntimeError(
+@_beartype.beartype
+def _onnx_opset_unsupported(
+    op_name: str,
+    current_opset: int,
+    supported_opset: int,
+    value: Optional[_C.Value] = None,
+) -> NoReturn:
+    message = (
         f"Unsupported: ONNX export of {op_name} in opset {current_opset}. "
         f"Please try opset version {supported_opset}."
     )
+    if isinstance(value, _C.Value):
+        raise errors.SymbolicValueError(
+            message,
+            value,
+        )
+    raise errors.OnnxExporterError(message)
 
 
+@_beartype.beartype
 def _onnx_opset_unsupported_detailed(
-    op_name: str, current_opset: int, supported_opset: int, reason: str
-):
-    raise RuntimeError(
+    op_name: str,
+    current_opset: int,
+    supported_opset: int,
+    reason: str,
+    value: Optional[_C.Value] = None,
+) -> NoReturn:
+    message = (
         f"Unsupported: ONNX export of {op_name} in "
         f"opset {current_opset}. {reason}. Please try opset version {supported_opset}."
     )
+    if isinstance(value, _C.Value):
+        raise errors.SymbolicValueError(
+            message,
+            value,
+        )
+    raise errors.OnnxExporterError(message)
 
 
+@_beartype.beartype
 def _block_list_in_opset(name: str):
     def symbolic_fn(*args, **kwargs):
-        raise RuntimeError(
+        raise errors.OnnxExporterError(
             f"ONNX export failed on {name}, which is not implemented for opset "
             f"{GLOBALS.export_onnx_opset_version}. "
             "Try exporting with other opset versions."
@@ -565,16 +643,19 @@ def symbolic_fn(*args, **kwargs):
     return symbolic_fn
 
 
-def _try_get_scalar_type(*args) -> Optional[str]:
+@_beartype.beartype
+def _try_get_scalar_type(*args) -> Optional[_type_utils.JitScalarType]:
     for arg in args:
-        try:
-            return arg.type().scalarType()
-        except RuntimeError:
-            pass
+        scalar_type = _type_utils.JitScalarType.from_value(
+            arg, _type_utils.JitScalarType.UNDEFINED
+        )
+        if scalar_type != _type_utils.JitScalarType.UNDEFINED:
+            return scalar_type
     return None
 
 
-def _select_helper(g, self, dim, index, apply_reshape=True):
+@_beartype.beartype
+def _select_helper(g: jit_utils.GraphContext, self, dim, index, apply_reshape=True):
     index_const = _maybe_get_scalar(index)
     index_dim = _get_tensor_rank(index)
     if not _is_value(index_const):
@@ -587,14 +668,28 @@ def _select_helper(g, self, dim, index, apply_reshape=True):
                 g, index, g.op("Constant", value_t=torch.LongTensor([1]))
             )
 
-    index_scalar_type = index.type().scalarType()
-    if index_scalar_type is None or index_scalar_type not in {"Long", "Int"}:
+    index_scalar_type = _type_utils.JitScalarType.from_value(
+        index, _type_utils.JitScalarType.UNDEFINED
+    )
+    if index_scalar_type not in {
+        _type_utils.JitScalarType.INT64,
+        _type_utils.JitScalarType.INT,
+    }:
         index = g.op("Cast", index, to_i=_C_onnx.TensorProtoDataType.INT64)
     return g.op("Gather", self, index, axis_i=dim)
 
 
-def _slice_helper(g, input, axes, starts, ends, steps=None, dynamic_slice=False):
-    if GLOBALS.export_onnx_opset_version <= 9:
+@_beartype.beartype
+def _slice_helper(
+    g: jit_utils.GraphContext,
+    input,
+    axes,
+    starts,
+    ends,
+    steps=None,
+    dynamic_slice=False,
+):
+    if g.opset <= 9:
         from torch.onnx.symbolic_opset9 import _slice as _slice9
 
         return _slice9(g, input, axes, starts, ends)
@@ -604,47 +699,27 @@ def _slice_helper(g, input, axes, starts, ends, steps=None, dynamic_slice=False)
         return _slice10(g, input, axes, starts, ends, steps, dynamic_slice)
 
 
-def _is_in_type_group(value, scalar_types: Set[_type_utils.JitScalarType]) -> bool:
-    """Helper function for determining if a value is in a scalar type group."""
-    if value is None:
-        return False
-    if isinstance(value, torch.Tensor):
-        return _type_utils.JitScalarType.from_dtype(value.dtype) in scalar_types
-    elif isinstance(value.type(), torch.ListType):
-        return (
-            _type_utils.JitScalarType.from_dtype(value.type().getElementType().dtype())
-            in scalar_types
-        )
-    scalar_type = value.type().scalarType()
-    if scalar_type is None:
-        warnings.warn(
-            "Type cannot be inferred, which might cause exported graph to produce incorrect results."
-        )
-        return False
-    try:
-        return _type_utils.JitScalarType.from_name(scalar_type) in scalar_types
-    except ValueError:
-        # scalar_type is not a known ScalarType
-        return False
-
-
+@_beartype.beartype
 def _is_fp(value) -> bool:
-    return _is_in_type_group(
-        value,
-        {
-            _type_utils.JitScalarType.FLOAT,
-            _type_utils.JitScalarType.DOUBLE,
-            _type_utils.JitScalarType.HALF,
-            _type_utils.JitScalarType.BFLOAT16,
-        },
-    )
+    return _type_utils.JitScalarType.from_value(
+        value, _type_utils.JitScalarType.UNDEFINED
+    ) in {
+        _type_utils.JitScalarType.FLOAT,
+        _type_utils.JitScalarType.DOUBLE,
+        _type_utils.JitScalarType.HALF,
+        _type_utils.JitScalarType.BFLOAT16,
+    }
 
 
+@_beartype.beartype
 def _is_bool(value) -> bool:
-    return _is_in_type_group(value, {_type_utils.JitScalarType.BOOL})
+    return _type_utils.JitScalarType.from_value(
+        value, _type_utils.JitScalarType.UNDEFINED
+    ) in {_type_utils.JitScalarType.BOOL}
 
 
-def _generate_wrapped_number(g, scalar):
+@_beartype.beartype
+def _generate_wrapped_number(g: jit_utils.GraphContext, scalar):
     """Creates a wrapped number based on https://github.com/pytorch/pytorch/issues/9515.
 
     A Tensor is a considered a "wrapped number" if it is
@@ -662,7 +737,8 @@ def _generate_wrapped_number(g, scalar):
     return g.op("Constant", value_t=torch.tensor(scalar))
 
 
-def _sort_helper(g, input, dim, decending=True, out=None):
+@_beartype.beartype
+def _sort_helper(g: jit_utils.GraphContext, input, dim, decending=True, out=None):
     if out is not None:
         _unimplemented("Sort", "Out parameter is not supported")
     shape_ = g.op("Shape", input)
@@ -671,7 +747,7 @@ def _sort_helper(g, input, dim, decending=True, out=None):
         shape_,
         g.op("Constant", value_t=torch.tensor([dim], dtype=torch.int64)),
     )
-    if GLOBALS.export_onnx_opset_version <= 10:
+    if g.opset <= 10:
         if not decending:
             _unimplemented("Sort", "Ascending is not supported")
         return g.op("TopK", input, dim_size_, axis_i=dim, outputs=2)
@@ -681,16 +757,19 @@ def _sort_helper(g, input, dim, decending=True, out=None):
         )
 
 
-def _topk_helper(g, input, k, dim, largest=True, sorted=False, out=None):
+@_beartype.beartype
+def _topk_helper(
+    g: jit_utils.GraphContext, input, k, dim, largest=True, sorted=False, out=None
+):
     if out is not None:
         _unimplemented("TopK", "Out parameter is not supported")
     if not _is_value(k):
         k = g.op("Constant", value_t=torch.tensor([k], dtype=torch.int64))
     else:
         k = _reshape_helper(g, k, g.op("Constant", value_t=torch.tensor([1])))
-        if _try_get_scalar_type(k) != "Long":
+        if _try_get_scalar_type(k) != _type_utils.JitScalarType.INT64:
             k = g.op("Cast", k, to_i=_C_onnx.TensorProtoDataType.INT64)
-    if GLOBALS.export_onnx_opset_version <= 10:
+    if g.opset <= 10:
         if not largest:
             _unimplemented("TopK", "Ascending is not supported")
         return g.op("TopK", input, k, axis_i=dim, outputs=2)
@@ -700,8 +779,9 @@ def _topk_helper(g, input, k, dim, largest=True, sorted=False, out=None):
         )
 
 
-def _lt_helper(g, input, other):
-    if GLOBALS.export_onnx_opset_version <= 8:
+@_beartype.beartype
+def _lt_helper(g: jit_utils.GraphContext, input, other):
+    if g.opset <= 8:
         from torch.onnx.symbolic_opset8 import lt as _lt8
 
         return _lt8(g, input, other)
@@ -711,6 +791,7 @@ def _lt_helper(g, input, other):
         return _lt9(g, input, other)
 
 
+@_beartype.beartype
 def _interpolate_warning(interpolate_mode):
     onnx_op = (
         "onnx:Resize" if GLOBALS.export_onnx_opset_version >= 10 else "onnx:Upsample"
@@ -728,28 +809,30 @@ def _interpolate_warning(interpolate_mode):
     )
 
 
-def _unsqueeze_helper(g, input, axes_i):
+@_beartype.beartype
+def _unsqueeze_helper(g: jit_utils.GraphContext, input, axes_i):
     if _is_constant(axes_i[0]):
-        if GLOBALS.export_onnx_opset_version >= 13:
+        if g.opset >= 13:
             axes = g.op("Constant", value_t=torch.tensor(axes_i, dtype=torch.long))
             return g.op("Unsqueeze", input, axes)
         return g.op("Unsqueeze", input, axes_i=axes_i)
     # Tensor type
-    if GLOBALS.export_onnx_opset_version < 13:
+    if g.opset < 13:
         raise errors.SymbolicValueError(
             "Opset version must be >= 13 for Unsqueeze with dynamic axes.", input
         )
     return g.op("Unsqueeze", input, axes_i[0])
 
 
-def _squeeze_helper(g, input, axes_i):
+@_beartype.beartype
+def _squeeze_helper(g: jit_utils.GraphContext, input, axes_i):
     if _is_constant(axes_i[0]):
-        if GLOBALS.export_onnx_opset_version >= 13:
+        if g.opset >= 13:
             axes = g.op("Constant", value_t=torch.tensor(axes_i, dtype=torch.long))
             return g.op("Squeeze", input, axes)
         return g.op("Squeeze", input, axes_i=axes_i)
     # Tensor type
-    if GLOBALS.export_onnx_opset_version < 13:
+    if g.opset < 13:
         raise errors.SymbolicValueError(
             "Opset version must be >= 13 for Squeeze with dynamic axes.", input
         )
@@ -767,9 +850,16 @@ def _squeeze_helper(g, input, axes_i):
     return g.op("Squeeze", input, axes_t)
 
 
-def _reducesum_helper(g, input, axes_i=None, keepdims_i=1, noop_with_empty_axes_i=0):
+@_beartype.beartype
+def _reducesum_helper(
+    g: jit_utils.GraphContext,
+    input,
+    axes_i=None,
+    keepdims_i=1,
+    noop_with_empty_axes_i=0,
+):
     keepdims_i = _maybe_get_const(keepdims_i, "i")
-    if GLOBALS.export_onnx_opset_version >= 13:
+    if g.opset >= 13:
         if axes_i:
             if not _is_value(axes_i):
                 axes_i = g.op(
@@ -792,7 +882,8 @@ def _reducesum_helper(g, input, axes_i=None, keepdims_i=1, noop_with_empty_axes_
         return g.op("ReduceSum", input, axes_i=axes_i, keepdims_i=keepdims_i)
 
 
-def _interpolate_size_to_scales(g, input, output_size, dim):
+@_beartype.beartype
+def _interpolate_size_to_scales(g: jit_utils.GraphContext, input, output_size, dim):
     output_size = _maybe_get_const(output_size, "is")
     if _is_value(output_size):
         offset = 2
@@ -818,7 +909,8 @@ def _interpolate_size_to_scales(g, input, output_size, dim):
     return scales
 
 
-def _interpolate_get_scales_if_available(g, scales):
+@_beartype.beartype
+def _interpolate_get_scales_if_available(g: jit_utils.GraphContext, scales):
     available_scales = _maybe_get_const(scales[0], "fs") != -1 and not _is_none(
         scales[0]
     )
@@ -834,7 +926,8 @@ def _interpolate_get_scales_if_available(g, scales):
     return scales
 
 
-def _get_interpolate_attributes(g, mode, args):
+@_beartype.beartype
+def _get_interpolate_attributes(g: jit_utils.GraphContext, mode, args):
     if mode == "nearest":
         align_corners = None
         scales = args[0:]
@@ -845,7 +938,8 @@ def _get_interpolate_attributes(g, mode, args):
     return scales, align_corners
 
 
-def _interpolate_get_scales(g, scale_factor, dim):
+@_beartype.beartype
+def _interpolate_get_scales(g: jit_utils.GraphContext, scale_factor, dim):
     offsets = g.op("Constant", value_t=torch.ones(2, dtype=torch.float32))
     scale_factor_rank = _get_tensor_rank(scale_factor)
     if isinstance(scale_factor.type(), _C.ListType) or (
@@ -862,7 +956,10 @@ def _interpolate_get_scales(g, scale_factor, dim):
     return scale_factor
 
 
-def _interpolate_get_scales_and_mode(g, input, size, scale_factor, mode, align_corners):
+@_beartype.beartype
+def _interpolate_get_scales_and_mode(
+    g: jit_utils.GraphContext, input, size, scale_factor, mode, align_corners
+):
     mode = _maybe_get_const(mode, "s")
     if "linear" in mode:
         mode = "linear"
@@ -895,11 +992,16 @@ def _interpolate_get_scales_and_mode(g, input, size, scale_factor, mode, align_c
     return scale_factor, mode
 
 
+@_beartype.beartype
 def _argmin_argmax_helper(
-    g, input: torch._C.Value, dim: torch._C.Value, keepdim: int, op_name: str
+    g: jit_utils.GraphContext,
+    input: torch._C.Value,
+    dim: torch._C.Value,
+    keepdim: bool,
+    op_name: str,
 ):
     def op_wrapper(input, axis_i, keepdims_i):
-        if GLOBALS.export_onnx_opset_version >= 12:
+        if g.opset >= 12:
             return g.op(
                 op_name,
                 input,
@@ -929,6 +1031,7 @@ def op_wrapper(input, axis_i, keepdims_i):
     return op_wrapper(input, axis_i=dim, keepdims_i=keepdim)
 
 
+@_beartype.beartype
 def _interpolate_helper(name, dim, interpolate_mode):
     @quantized_args(True, False, False)
     def symbolic_fn(g, input, output_size, *args):
@@ -952,7 +1055,7 @@ def symbolic_fn(g, input, output_size, *args):
             )
             output_size = g.op("Concat", input_size_beg, output_size, axis_i=0)
 
-            if GLOBALS.export_onnx_opset_version >= 13:
+            if g.opset >= 13:
                 empty_roi = _optional_input_placeholder_tensor(g)
                 empty_scales = _optional_input_placeholder_tensor(g)
             else:
@@ -975,7 +1078,7 @@ def symbolic_fn(g, input, output_size, *args):
                 nearest_mode_s="floor",
             )  # only valid when mode="nearest"
         else:
-            if GLOBALS.export_onnx_opset_version >= 13:
+            if g.opset >= 13:
                 empty_roi = _optional_input_placeholder_tensor(g)
             else:
                 empty_roi = g.op(
@@ -996,8 +1099,15 @@ def symbolic_fn(g, input, output_size, *args):
     return symbolic_fn
 
 
+@_beartype.beartype
 def __interpolate_helper(
-    g, input, size, scale_factor, mode, align_corners, recompute_scale_factor
+    g: jit_utils.GraphContext,
+    input,
+    size,
+    scale_factor,
+    mode,
+    align_corners,
+    recompute_scale_factor,
 ):
     mode = _maybe_get_const(mode, "s")
     if "linear" in mode:
@@ -1046,7 +1156,7 @@ def __interpolate_helper(
         size = g.op("Cast", size, to_i=_C_onnx.TensorProtoDataType.INT64)
         size = g.op("Concat", input_size, size, axis_i=0)
 
-        if GLOBALS.export_onnx_opset_version >= 13:
+        if g.opset >= 13:
             empty_roi = _optional_input_placeholder_tensor(g)
             empty_scales = _optional_input_placeholder_tensor(g)
         else:
@@ -1071,7 +1181,7 @@ def __interpolate_helper(
         if rank is None:
             return _unimplemented("interpolate (with scales)", "missing input shape")
 
-        if GLOBALS.export_onnx_opset_version >= 13:
+        if g.opset >= 13:
             empty_roi = _optional_input_placeholder_tensor(g)
         else:
             empty_roi = g.op("Constant", value_t=torch.tensor([], dtype=torch.float32))
@@ -1089,18 +1199,20 @@ def __interpolate_helper(
         )  # only valid when mode="nearest"
 
 
-def _unbind_helper(g, self, dim, _outputs):
-    if GLOBALS.export_onnx_opset_version < 11:
+@_beartype.beartype
+def _unbind_helper(g: jit_utils.GraphContext, self, dim, _outputs):
+    if g.opset < 11:
         from torch.onnx.symbolic_opset9 import unbind
-    elif GLOBALS.export_onnx_opset_version <= 12:
+    elif g.opset <= 12:
         from torch.onnx.symbolic_opset11 import unbind  # type: ignore[no-redef]
     else:
         from torch.onnx.symbolic_opset13 import unbind  # type: ignore[no-redef]
     return unbind(g, self, dim, _outputs)
 
 
-def _scatter_helper(g, self, dim, index, src):
-    if GLOBALS.export_onnx_opset_version <= 10:
+@_beartype.beartype
+def _scatter_helper(g: jit_utils.GraphContext, self, dim, index, src):
+    if g.opset <= 10:
         from torch.onnx.symbolic_opset9 import scatter
     else:
         # for mypy, scatter was imported two lines above
@@ -1108,8 +1220,9 @@ def _scatter_helper(g, self, dim, index, src):
     return scatter(g, self, dim, index, src)
 
 
-def _repeat_interleave_split_helper(g, self, reps, dim):
-    if GLOBALS.export_onnx_opset_version <= 12:
+@_beartype.beartype
+def _repeat_interleave_split_helper(g: jit_utils.GraphContext, self, reps, dim):
+    if g.opset <= 12:
         split_out = g.op("Split", self, split_i=[1] * reps, axis_i=dim, outputs=reps)
     else:
         from torch.onnx.symbolic_opset13 import split
@@ -1119,8 +1232,9 @@ def _repeat_interleave_split_helper(g, self, reps, dim):
     return split_out if reps > 1 else [split_out]
 
 
+@_beartype.beartype
 def _arange_cast_helper(
-    g, end, start=None, step=None, dtype=None
+    g: jit_utils.GraphContext, end, start=None, step=None, dtype=None
 ) -> Tuple[
     _type_utils.JitScalarType,
     Optional[_C.Value],
@@ -1129,13 +1243,14 @@ def _arange_cast_helper(
 ]:
     def _is_all_integral(scalars):
         for scalar in scalars:
-            try:
-                if scalar.type().scalarType() != "Long":
-                    return False
-            except Exception:
-                # FIXME(justinchuby): Avoid catching Exception.
-                # Catch a more specific exception instead.
-                pass
+            scalar_type = _type_utils.JitScalarType.from_value(
+                scalar, _type_utils.JitScalarType.UNDEFINED
+            )
+            if (
+                scalar_type != _type_utils.JitScalarType.INT64
+                and scalar_type != _type_utils.JitScalarType.UNDEFINED
+            ):
+                return False
         return True
 
     # This logic is based on torch.arange docs. If "dtype" is provided,
@@ -1160,22 +1275,25 @@ def _is_all_integral(scalars):
     return scalar_type, end, start, step
 
 
-def _arange_helper(g, *args):
-    if GLOBALS.export_onnx_opset_version <= 10:
+@_beartype.beartype
+def _arange_helper(g: jit_utils.GraphContext, *args):
+    if g.opset <= 10:
         from torch.onnx.symbolic_opset9 import arange
     else:
         from torch.onnx.symbolic_opset11 import arange  # type: ignore[no-redef]
     return arange(g, *args)
 
 
-def _size_helper(g, self, dim):
+@_beartype.beartype
+def _size_helper(g: jit_utils.GraphContext, self, dim):
     full_shape = g.op("Shape", self)
     from torch.onnx.symbolic_opset9 import select
 
     return select(g, full_shape, g.op("Constant", value_t=torch.tensor([0])), dim)
 
 
-def _index_fill_reshape_helper(g, self, dim, index):
+@_beartype.beartype
+def _index_fill_reshape_helper(g: jit_utils.GraphContext, self, dim, index):
     # 1. reshape index => [1, ..., 1, dim, 1, ..., 1]
     # 2. expand index => [..., dim, ...], same shape as self except for dim.
     # 3. expand value as well.
@@ -1183,14 +1301,14 @@ def _index_fill_reshape_helper(g, self, dim, index):
 
     from torch.onnx.symbolic_opset9 import expand
 
-    if GLOBALS.export_onnx_opset_version <= 10:
+    if g.opset <= 10:
         from torch.onnx.symbolic_opset9 import scatter
     else:
         # for mypy, scatter was imported two lines above
         from torch.onnx.symbolic_opset11 import scatter  # type: ignore[no-redef]
 
     if self.type().dim() is None:
-        return _unimplemented("index_fill", "input rank not accesible")
+        return _unimplemented("index_fill", "input rank not accessible")
     self_dim = self.type().dim()
     dim_value = _parse_arg(dim, "i")
     unsqueezed_index = _unsqueeze_helper(
@@ -1208,21 +1326,25 @@ def _index_fill_reshape_helper(g, self, dim, index):
 # allowzero=1 indicates that if any value in the 'shape' input is set to zero,
 # the zero value is honored, similar to NumPy.
 # allowzero=1 is only supported for opset version >= 14.
-def _reshape_helper(g, input, shape, allowzero=0):
+@_beartype.beartype
+def _reshape_helper(g: jit_utils.GraphContext, input, shape, allowzero=0):
     shape = _maybe_get_const(shape, "is")
     if not _is_value(shape):
         shape = g.op("Constant", value_t=torch.LongTensor(shape))
-    if GLOBALS.export_onnx_opset_version <= 13:
+    if g.opset <= 13:
         if allowzero == 1:
-            raise _onnx_opset_unsupported(
-                "Reshape with allowzero=1", GLOBALS.export_onnx_opset_version, 14
+            _onnx_opset_unsupported(
+                "Reshape with allowzero=1", GLOBALS.export_onnx_opset_version, 14, input
             )
         return g.op("Reshape", input, shape)
     else:
         return g.op("Reshape", input, shape, allowzero_i=allowzero)
 
 
-def _batchnorm_helper(g, input, weight, bias, running_mean, running_var):
+@_beartype.beartype
+def _batchnorm_helper(
+    g: jit_utils.GraphContext, input, weight, bias, running_mean, running_var
+):
     from torch.onnx.symbolic_opset9 import _var_mean
 
     batch_size = _get_tensor_dim_size(input, 0)
@@ -1236,9 +1358,7 @@ def _batchnorm_helper(g, input, weight, bias, running_mean, running_var):
             )
         weight_value = torch.tensor(
             [1.0] * channel_size,
-            dtype=_type_utils.JitScalarType.from_name(
-                input.type().scalarType()
-            ).dtype(),
+            dtype=_type_utils.JitScalarType.from_value(input).dtype(),
         )
         weight = g.op("Constant", value_t=weight_value)
     if bias is None or _is_none(bias):
@@ -1249,9 +1369,7 @@ def _batchnorm_helper(g, input, weight, bias, running_mean, running_var):
             )
         bias_value = torch.tensor(
             [0.0] * channel_size,
-            dtype=_type_utils.JitScalarType.from_name(
-                input.type().scalarType()
-            ).dtype(),
+            dtype=_type_utils.JitScalarType.from_value(input).dtype(),
         )
         bias = g.op("Constant", value_t=bias_value)
     # If track_running_stats is set to False batch statistics are instead used during evaluation time
@@ -1281,6 +1399,7 @@ def _batchnorm_helper(g, input, weight, bias, running_mean, running_var):
     return weight, bias, running_mean, running_var
 
 
+@_beartype.beartype
 def _avgpool_helper(
     tuple_fn: Callable[[Any], Sequence[int]],
     padding: Union[int, Sequence[int]],
@@ -1294,6 +1413,7 @@ def _avgpool_helper(
     return tuple(tuple_fn(padding))
 
 
+@_beartype.beartype
 def check_training_mode(op_train_mode: int, op_name: str) -> None:
     """Warns the user if the model's training mode and the export mode do not agree."""
     if GLOBALS.training_mode == _C_onnx.TrainingMode.PRESERVE:
@@ -1319,7 +1439,8 @@ def check_training_mode(op_train_mode: int, op_name: str) -> None:
     )
 
 
-def _flatten_helper(g, input, start_dim, end_dim, dim):
+@_beartype.beartype
+def _flatten_helper(g: jit_utils.GraphContext, input, start_dim, end_dim, dim):
     input_size = g.op("Shape", input)
     slice1 = _slice_helper(g, input_size, axes=[0], starts=[0], ends=[start_dim])
     slices = [slice1, g.op("Constant", value_t=torch.tensor([-1], dtype=torch.long))]
@@ -1339,6 +1460,7 @@ def _flatten_helper(g, input, start_dim, end_dim, dim):
     return _reshape_from_tensor(g, input, final_shape)
 
 
+@_beartype.beartype
 def _is_split_static(split_size_or_sizes, _outputs):
     if _outputs is None:
         return False
@@ -1350,13 +1472,15 @@ def _is_split_static(split_size_or_sizes, _outputs):
     return True
 
 
+@_beartype.beartype
 def _optional_input_placeholder_tensor(g):
     n = g.op("prim::Constant")
     n.setType(_C.OptionalType.ofTensor())
     return n
 
 
-def _handle_reduce_dim_none(g, self, op_name):
+@_beartype.beartype
+def _handle_reduce_dim_none(g: jit_utils.GraphContext, self, op_name):
     rank = _get_tensor_rank(self)
     if rank is not None and any(
         [_get_tensor_dim_size(self, i) == 0 for i in range(rank)]
@@ -1367,28 +1491,29 @@ def _handle_reduce_dim_none(g, self, op_name):
     return g.op(op_name, self, keepdims_i=0)
 
 
+@_beartype.beartype
 def dequantize_helper(
-    g,
+    g: jit_utils.GraphContext,
     qtensor: _C.Value,
-    qdtype: Optional[torch.onnx.TensorProtoDataType] = None,
+    qdtype: Optional[_C_onnx.TensorProtoDataType] = None,
 ) -> Tuple[_C.Value, _C.Value, _C.Value, Optional[_C.Value]]:
     """Appends to graph `g` ONNX nodes that dequantizes `qtensor` into `tensor`.
 
     Args:
         g: Graph, the ONNX IR graph that is under construction.
-        qtensor: torch._C.Value, either a tuple of (quantized_tensor, scale, zero_point) for per tensor quantization,
-          or (quantized_tensor, scale, zero_point, axis) for per channel quantization.
-          Representing the quantized tensor.
-        qdtype: torch.onnx.TensorProtoDataType default None, if not None, represents the data type of quantized tensor.
-          It must be either torch.onnx.TensorProtoDataType.UINT8 or torch.onnx.TensorProtoDataType.INT8.
+        qtensor: torch._C.Value, either a tuple of (quantized_tensor, scale, zero_point)
+            for per tensor quantization, or
+            (quantized_tensor, scale, zero_point, axis) for per channel quantization,
+            representing the quantized tensor.
+        qdtype: torch.onnx.TensorProtoDataType default None, if not None, represents the
+            data type of quantized tensor. It must be either
+            torch.onnx.TensorProtoDataType.UINT8 or torch.onnx.TensorProtoDataType.INT8.
     """
-    unpacked_qtensors = _unpack_tuple(qtensor)
+    unpacked_qtensors = _unpack_quantized_tensor(qtensor)
     tensor, scale, zero_point = unpacked_qtensors[:3]
     axis = unpacked_qtensors[3] if len(unpacked_qtensors) >= 4 else None
     axis_i = _get_const(axis, "i", "axis")
-    input_scalar_type = tensor.type().scalarType()
-    assert input_scalar_type is not None
-    input_qdtype = _type_utils.JitScalarType.from_name(tensor.type().scalarType())
+    input_qdtype = _type_utils.JitScalarType.from_value(tensor)
     if qdtype is None:
         if input_qdtype is not None:
             qdtype = input_qdtype.onnx_type()
@@ -1404,6 +1529,7 @@ def dequantize_helper(
             GLOBALS.export_onnx_opset_version,
             13,
             "Attribute axis is not supported.",
+            qtensor,
         )
 
     return (
@@ -1414,8 +1540,9 @@ def dequantize_helper(
     )
 
 
+@_beartype.beartype
 def quantize_helper(
-    g,
+    g: jit_utils.GraphContext,
     tensor: _C.Value,
     scale: _C.Value,
     zero_point: _C.Value,
@@ -1430,6 +1557,9 @@ def quantize_helper(
         zero_point: torch._C.Value, quantized zero point.
         axis: Optional[torch._C.Value] default None, if None, represents per tensor quantization.
             Otherwise, represents per channel quantization, along given axis.
+
+    Returns:
+        A TupleConstruct storing information of the quantized tensor.
     """
     if (
         axis is not None
@@ -1441,16 +1571,23 @@ def quantize_helper(
             GLOBALS.export_onnx_opset_version,
             13,
             "Attribute axis is not supported.",
+            tensor,
         )
 
     assert scale is not None
-    if scale.type().scalarType() != "Float":  # type: ignore[attr-defined]
-        # TODO(justinchuby): Remove type ignore after #81112 is checked in.
+    if (
+        _type_utils.JitScalarType.from_value(scale, _type_utils.JitScalarType.UNDEFINED)
+        != _type_utils.JitScalarType.FLOAT
+    ):
         scale = g.op("Cast", scale, to_i=_C_onnx.TensorProtoDataType.FLOAT)
 
     assert zero_point is not None
-    if zero_point.type().scalarType() not in ("Byte", "Char"):  # type: ignore[attr-defined]
-        # TODO(justinchuby): Remove type ignore after #81112 is checked in.
+    if _type_utils.JitScalarType.from_value(
+        zero_point, _type_utils.JitScalarType.UNDEFINED
+    ) not in {
+        _type_utils.JitScalarType.UINT8,
+        _type_utils.JitScalarType.INT8,
+    }:
         zero_point = g.op("Cast", zero_point, to_i=_C_onnx.TensorProtoDataType.UINT8)
     output = g.op(
         "QuantizeLinear",
@@ -1465,7 +1602,10 @@ def quantize_helper(
     return g.op("prim::TupleConstruct", *args)
 
 
-def requantize_bias_helper(g, bias, input_scale, weight_scale, axis=None):
+@_beartype.beartype
+def requantize_bias_helper(
+    g: jit_utils.GraphContext, bias, input_scale, weight_scale, axis=None
+):
     """In PyTorch, bias is float and is quantized to int32 implicitly inside the quantized ATen op kernel.
     In ONNX we need to make the quantization explicit because operators expect all of their inputs to be quantized.
     Since int32 is not a supported output type by ONNX operator `QuantizeLinear`, quantization is exported using
@@ -1485,24 +1625,42 @@ def requantize_bias_helper(g, bias, input_scale, weight_scale, axis=None):
     return g.op("prim::TupleConstruct", q_bias, bias_scale, bias_zero_point, *axis_args)
 
 
+@_beartype.beartype
 def args_have_same_dtype(args):
     assert args
-    base_dtype = args[0].type().scalarType()
-    has_same_dtype = all(elem.type().scalarType() == base_dtype for elem in args)
+    base_dtype = _type_utils.JitScalarType.from_value(args[0])
+    has_same_dtype = all(
+        _type_utils.JitScalarType.from_value(elem) == base_dtype for elem in args
+    )
     return has_same_dtype
 
 
 # TODO(justinchuby): Delete these setters, users should set the vars directly.
+@_deprecation.deprecated(
+    "1.13",
+    "1.14",
+    "remove its usage and avoid setting internal variables directly",
+)
 def _set_opset_version(opset_version: int):
     GLOBALS.export_onnx_opset_version = opset_version
 
 
+@_deprecation.deprecated(
+    "1.13",
+    "1.14",
+    "remove its usage and avoid setting internal variables directly",
+)
 def _set_operator_export_type(operator_export_type):
     GLOBALS.operator_export_type = operator_export_type
 
 
 # This function is for debug use only.
 # onnx_shape_inference = True by default.
+@_deprecation.deprecated(
+    "1.13",
+    "1.14",
+    "remove its usage and avoid setting internal variables directly",
+)
 def _set_onnx_shape_inference(onnx_shape_inference: bool):
     GLOBALS.onnx_shape_inference = onnx_shape_inference
 
diff --git a/torch/onnx/symbolic_opset10.py b/torch/onnx/symbolic_opset10.py
index 6966aa68cb6b..a02009a74f69 100644
--- a/torch/onnx/symbolic_opset10.py
+++ b/torch/onnx/symbolic_opset10.py
@@ -1,6 +1,7 @@
+import functools
 import sys
 import warnings
-from typing import Sequence
+from typing import Callable
 
 import torch
 import torch._C._onnx as _C_onnx
@@ -9,15 +10,18 @@
 
 # Monkey-patch graph manipulation methods on Graph, used for the ONNX symbolics
 from torch.onnx import (  # noqa: F401
+    _constants,
     _patch_torch,
     _type_utils,
+    errors,
     symbolic_helper,
     symbolic_opset9 as opset9,
 )
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype, jit_utils, registration
 
 # EDITING THIS FILE? READ THIS FIRST!
-# see Note [Edit Symbolic Files] in symbolic_helper.py
+# see Note [Edit Symbolic Files] in README.md
 
 # This file exports ONNX ops for opset 10
 # Opset 10 is supported by ONNX release 1.5.0
@@ -25,9 +29,6 @@
 
 
 __all__ = [
-    "avg_pool1d",
-    "avg_pool2d",
-    "avg_pool3d",
     "dequantize",
     "div",
     "embedding_bag",
@@ -36,28 +37,43 @@
     "fmod",
     "isfinite",
     "isinf",
-    "max_pool1d_with_indices",
-    "max_pool1d",
-    "max_pool2d_with_indices",
-    "max_pool2d",
-    "max_pool3d_with_indices",
-    "max_pool3d",
     "nan_to_num",
     "quantize_per_tensor",
-    "Quantized",
+    "quantized_add_relu",
+    "quantized_add",
+    "quantized_cat",
+    "quantized_conv1d_relu",
+    "quantized_conv2d_relu",
+    "quantized_conv2d",
+    "quantized_group_norm",
+    "quantized_hardswish",
+    "quantized_instance_norm",
+    "quantized_layer_norm",
+    "quantized_leaky_relu",
+    "quantized_linear",
+    "quantized_mul",
+    "quantized_sigmoid",
     "slice",
     "sort",
     "topk",
-    "upsample_bilinear2d",
-    "upsample_linear1d",
-    "upsample_nearest1d",
-    "upsample_nearest2d",
-    "upsample_nearest3d",
-    "upsample_trilinear3d",
 ]
 
 
-def div(g, self, other, *args):
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=10)
+
+
+def _apply_params(*args, **kwargs):
+    """Returns a decorator that calls the decorated (higher-order) function with the given parameters."""
+
+    def _apply(fn):
+        return fn(*args, **kwargs)
+
+    return _apply
+
+
+@_onnx_symbolic("aten::div")
+@_beartype.beartype
+def div(g: jit_utils.GraphContext, self, other, *args):
     if len(args) == 0:
         return opset9.true_divide(g, self, other)
     else:
@@ -65,14 +81,17 @@ def div(g, self, other, *args):
 
 
 @symbolic_helper.parse_args("v", "v", "s")
-def _div_rounding_mode(g, self, other, rounding_mode):
+@_beartype.beartype
+def _div_rounding_mode(g: jit_utils.GraphContext, self, other, rounding_mode):
     if rounding_mode == "floor":
         return _floor_divide(g, self, other)
     else:
         return opset9._div_rounding_mode(g, self, other, rounding_mode)
 
 
-def _floor_divide(g, self, other):
+@_onnx_symbolic("aten::_floor_divide")
+@_beartype.beartype
+def _floor_divide(g: jit_utils.GraphContext, self, other):
     if symbolic_helper._is_fp(self) or symbolic_helper._is_fp(other):
         out = opset9.true_divide(g, self, other)
         return g.op("Floor", out)
@@ -92,19 +111,81 @@ def _floor_divide(g, self, other):
         return g.op("Where", fixup_mask, fixup, div)
 
 
+@_onnx_symbolic("aten::sort")
 @symbolic_helper.parse_args("v", "i", "i", "none")
-def sort(g, self, dim, decending, out=None):
+@_beartype.beartype
+def sort(g: jit_utils.GraphContext, self, dim, decending, out=None):
     return symbolic_helper._sort_helper(g, self, dim, decending=decending, out=out)
 
 
+@_onnx_symbolic("aten::topk")
 @symbolic_helper.parse_args("v", "v", "i", "i", "i", "none")
-def topk(g, self, k, dim, largest, sorted, out=None):
+@_beartype.beartype
+def topk(g: jit_utils.GraphContext, self, k, dim, largest, sorted, out=None):
     return symbolic_helper._topk_helper(
         g, self, k, dim, largest=largest, sorted=sorted, out=out
     )
 
 
-def _max_pool(name, tuple_fn, ndims, return_indices):
+@_onnx_symbolic(
+    "aten::max_pool1d",
+    decorate=[
+        _apply_params(
+            "max_pool1d", torch.nn.modules.utils._single, 1, return_indices=False
+        )
+    ],
+)
+@_onnx_symbolic(
+    "aten::max_pool2d",
+    decorate=[
+        _apply_params(
+            "max_pool2d", torch.nn.modules.utils._pair, 2, return_indices=False
+        )
+    ],
+)
+@_onnx_symbolic(
+    "aten::max_pool3d",
+    decorate=[
+        _apply_params(
+            "max_pool3d", torch.nn.modules.utils._triple, 3, return_indices=False
+        )
+    ],
+)
+@_onnx_symbolic(
+    "aten::max_pool1d_with_indices",
+    decorate=[
+        _apply_params(
+            "max_pool1d_with_indices",
+            torch.nn.modules.utils._single,
+            1,
+            return_indices=True,
+        )
+    ],
+)
+@_onnx_symbolic(
+    "aten::max_pool2d_with_indices",
+    decorate=[
+        _apply_params(
+            "max_pool2d_with_indices",
+            torch.nn.modules.utils._pair,
+            2,
+            return_indices=True,
+        )
+    ],
+)
+@_onnx_symbolic(
+    "aten::max_pool3d_with_indices",
+    decorate=[
+        _apply_params(
+            "max_pool3d_with_indices",
+            torch.nn.modules.utils._triple,
+            3,
+            return_indices=True,
+        )
+    ],
+)
+@_beartype.beartype
+def _max_pool(name: str, tuple_fn: Callable, ndims: int, return_indices: bool):
     @symbolic_helper.quantized_args(True, False, False, False, False, False)
     @symbolic_helper.parse_args("v", "is", "is", "is", "is", "i")
     def symbolic_fn(g, input, kernel_size, stride, padding, dilation, ceil_mode):
@@ -157,75 +238,56 @@ def symbolic_fn(g, input, kernel_size, stride, padding, dilation, ceil_mode):
     return symbolic_fn
 
 
-max_pool1d = _max_pool(
-    "max_pool1d", torch.nn.modules.utils._single, 1, return_indices=False
-)
-max_pool2d = _max_pool(
-    "max_pool2d", torch.nn.modules.utils._pair, 2, return_indices=False
-)
-max_pool3d = _max_pool(
-    "max_pool3d", torch.nn.modules.utils._triple, 3, return_indices=False
-)
-max_pool1d_with_indices = _max_pool(
-    "max_pool1d_with_indices", torch.nn.modules.utils._single, 1, return_indices=True
+@_onnx_symbolic(
+    "aten::avg_pool1d",
+    decorate=[_apply_params("avg_pool1d", torch.nn.modules.utils._single)],
 )
-max_pool2d_with_indices = _max_pool(
-    "max_pool2d_with_indices", torch.nn.modules.utils._pair, 2, return_indices=True
+@_onnx_symbolic(
+    "aten::avg_pool2d",
+    decorate=[_apply_params("avg_pool2d", torch.nn.modules.utils._pair)],
 )
-max_pool3d_with_indices = _max_pool(
-    "max_pool3d_with_indices", torch.nn.modules.utils._triple, 3, return_indices=True
+@_onnx_symbolic(
+    "aten::avg_pool3d",
+    decorate=[_apply_params("avg_pool3d", torch.nn.modules.utils._triple)],
 )
-
-
+@_beartype.beartype
 def _avg_pool(name, tuple_fn):
-    @symbolic_helper.quantized_args(True, False, False, False, False, False, False)
-    @symbolic_helper.parse_args("v", "is", "is", "is", "i", "i", "none")
-    def symbolic_fn(
-        g,
-        input: _C.Value,
-        kernel_size: Sequence[int],
-        stride: Sequence[int],
-        padding: Sequence[int],
-        ceil_mode: int,
-        count_include_pad: int,
-        divisor_override=None,
-    ):
-        if not stride:
-            stride = kernel_size
-        padding = symbolic_helper._avgpool_helper(
-            tuple_fn, padding, kernel_size, stride, divisor_override, name
-        )
-        if count_include_pad:
-            input = opset9.op_with_optional_float_cast(
-                g,
-                "Pad",
-                input,
-                pads_i=((0,) * 2 + padding) * 2,
-                mode_s="constant",
-                value_f=0.0,
-                opset_before=11,
-            )
-            padding = (0,) * len(padding)
-        output = g.op(
-            "AveragePool",
-            input,
-            kernel_shape_i=tuple_fn(kernel_size),
-            strides_i=tuple_fn(stride),
-            pads_i=padding * 2,
-            ceil_mode_i=ceil_mode,
-        )
-        return output
-
-    return symbolic_fn
-
-
-avg_pool1d = _avg_pool("avg_pool1d", torch.nn.modules.utils._single)
-avg_pool2d = _avg_pool("avg_pool2d", torch.nn.modules.utils._pair)
-avg_pool3d = _avg_pool("avg_pool3d", torch.nn.modules.utils._triple)
+    # Although onnx::AvgPool provides count_include_pad and ceil_mode,
+    # The corner case of Average Pooling with ceil_mode on
+    # PyTorch allows sliding window go off bound, which leads to
+    # this accommodation.
+    # More detail on https://github.com/pytorch/pytorch/issues/57178
+    return opset9._avg_pool(name, tuple_fn)
 
 
+@_onnx_symbolic(
+    "aten::upsample_nearest1d",
+    decorate=[_apply_params("upsample_nearest1d", 3, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_nearest2d",
+    decorate=[_apply_params("upsample_nearest2d", 4, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_nearest3d",
+    decorate=[_apply_params("upsample_nearest3d", 5, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_linear1d",
+    decorate=[_apply_params("upsample_linear1d", 3, "linear")],
+)
+@_onnx_symbolic(
+    "aten::upsample_bilinear2d",
+    decorate=[_apply_params("upsample_bilinear2d", 4, "linear")],
+)
+@_onnx_symbolic(
+    "aten::upsample_trilinear3d",
+    decorate=[_apply_params("upsample_trilinear3d", 5, "linear")],
+)
+@_beartype.beartype
 def _interpolate(name, dim, interpolate_mode):
     @symbolic_helper.quantized_args(True, False, False)
+    @_beartype.beartype
     def symbolic_fn(g, input, output_size, *args):
         scales, align_corners = symbolic_helper._get_interpolate_attributes(
             g, interpolate_mode, args
@@ -233,7 +295,7 @@ def symbolic_fn(g, input, output_size, *args):
         symbolic_helper._interpolate_warning(interpolate_mode)
         align_corners = symbolic_helper._maybe_get_scalar(align_corners)
         if align_corners:
-            return symbolic_helper._unimplemented(name, "align_corners == True")
+            return symbolic_helper._unimplemented(name, "align_corners == True", input)
         if scales is None:
             scales = symbolic_helper._interpolate_size_to_scales(
                 g, input, output_size, dim
@@ -243,16 +305,17 @@ def symbolic_fn(g, input, output_size, *args):
     return symbolic_fn
 
 
-upsample_nearest1d = _interpolate("upsample_nearest1d", 3, "nearest")
-upsample_nearest2d = _interpolate("upsample_nearest2d", 4, "nearest")
-upsample_nearest3d = _interpolate("upsample_nearest3d", 5, "nearest")
-upsample_linear1d = _interpolate("upsample_linear1d", 3, "linear")
-upsample_bilinear2d = _interpolate("upsample_bilinear2d", 4, "linear")
-upsample_trilinear3d = _interpolate("upsample_trilinear3d", 5, "linear")
-
-
+@_onnx_symbolic("aten::__interpolate")
+@_beartype.beartype
 def __interpolate(
-    g, input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias
+    g: jit_utils.GraphContext,
+    input,
+    size,
+    scale_factor,
+    mode,
+    align_corners,
+    recompute_scale_factor,
+    antialias,
 ):
     scales, mode = symbolic_helper._interpolate_get_scales_and_mode(
         g, input, size, scale_factor, mode, align_corners
@@ -260,7 +323,16 @@ def __interpolate(
     return g.op("Resize", input, scales, mode_s=mode)
 
 
-def _slice(g, input, axes, starts, ends, steps=None, dynamic_slice=False):
+@_beartype.beartype
+def _slice(
+    g: jit_utils.GraphContext,
+    input,
+    axes,
+    starts,
+    ends,
+    steps=None,
+    dynamic_slice=False,
+):
     if dynamic_slice:
         starts = symbolic_helper._unsqueeze_helper(g, starts, [0])
         ends = symbolic_helper._unsqueeze_helper(g, ends, [0])
@@ -274,7 +346,7 @@ def _slice(g, input, axes, starts, ends, steps=None, dynamic_slice=False):
         if (
             len(starts) == 1
             and starts[0] == 0
-            and ends[0] == 9223372036854775807
+            and ends[0] == _constants.INT64_MAX
             and (steps is None or (len(steps) == 1 and steps[0] == 1))
         ):
             return input
@@ -287,7 +359,9 @@ def _slice(g, input, axes, starts, ends, steps=None, dynamic_slice=False):
     return g.op("Slice", input, starts, ends, axes, steps)
 
 
-def slice(g, self, *args):
+@_onnx_symbolic("aten::slice")
+@_beartype.beartype
+def slice(g: jit_utils.GraphContext, self, *args):
     if len(args) == 4:
         # aten::slice(Tensor self, int dim, int? start=None, int? end=None, int step=1) -> Tensor
         dim, start, end, step = args
@@ -296,7 +370,7 @@ def slice(g, self, *args):
         start, end, step = args
         dim = 0
     else:
-        raise NotImplementedError("Unknown aten::slice signature")
+        raise errors.SymbolicValueError("Unknown aten::slice signature", self)
     is_start_none = start.node().kind() == "prim::Constant" and isinstance(
         start.type(), _C.NoneType
     )
@@ -315,11 +389,13 @@ def slice(g, self, *args):
         if is_start_none:
             start = g.op("Constant", value_t=torch.tensor(0))
         if is_end_none:
-            end = g.op("Constant", value_t=torch.tensor(9223372036854775807))
+            end = g.op("Constant", value_t=torch.tensor(_constants.INT64_MAX))
     else:
         start = [0 if is_start_none else symbolic_helper._parse_arg(start, "i")]
         end = [
-            9223372036854775807 if is_end_none else symbolic_helper._parse_arg(end, "i")
+            _constants.INT64_MAX
+            if is_end_none
+            else symbolic_helper._parse_arg(end, "i")
         ]
         dim = [symbolic_helper._parse_arg(dim, "i")]
         dynamic_slice = False
@@ -334,25 +410,31 @@ def slice(g, self, *args):
     )
 
 
+@_onnx_symbolic("aten::flip")
 @symbolic_helper.parse_args("v", "is")
-def flip(g, input, dims):
+@_beartype.beartype
+def flip(g: jit_utils.GraphContext, input, dims):
     return symbolic_helper._slice_helper(
         g,
         input,
         axes=dims,
         starts=[-1] * len(dims),
-        ends=[-9223372036854775807] * len(dims),
+        ends=[-_constants.INT64_MAX] * len(dims),
         steps=[-1] * len(dims),
     )
 
 
-def fmod(g, input, other):
+@_onnx_symbolic("aten::fmod")
+@_beartype.beartype
+def fmod(g: jit_utils.GraphContext, input, other):
     return g.op("Mod", input, other, fmod_i=1)
 
 
+@_onnx_symbolic("aten::embedding_bag")
 @symbolic_helper.parse_args("v", "v", "v", "i", "i", "i", "v", "i", "i")
+@_beartype.beartype
 def embedding_bag(
-    g,
+    g: jit_utils.GraphContext,
     embedding_matrix,
     indices,
     offsets,
@@ -435,9 +517,16 @@ def embedding_bag(
         )
 
 
+@_onnx_symbolic("aten::fake_quantize_per_tensor_affine")
 @symbolic_helper.parse_args("v", "v", "v", "i", "i")
+@_beartype.beartype
 def fake_quantize_per_tensor_affine(
-    g, inputs, scale, zero_point, quant_min=-128, quant_max=127
+    g: jit_utils.GraphContext,
+    inputs,
+    scale,
+    zero_point,
+    quant_min=-128,
+    quant_max=127,
 ):
     # NOTE: (0, 127) is a special case. PyTorch restricts activations to be in the range (0, 127).
     #   https://github.com/pytorch/pytorch/blob/b34b192d6b97325c9f78e5995c48c8498ede34bd/torch/ao/quantization/observer.py#L1422
@@ -447,11 +536,13 @@ def fake_quantize_per_tensor_affine(
             10,
             13,
             "Quantize range (0, 127) not supported, requires opset 13 Clip",
+            inputs,
         )
     if (quant_min, quant_max) not in [(0, 255), (-128, 127)]:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             f"For (quant_min, quant_max), ONNX allows only (0, 255) and (-128, 127). "
-            f"Got ({quant_min}, {quant_max})"
+            f"Got ({quant_min}, {quant_max})",
+            inputs,
         )
     scale = symbolic_helper._maybe_get_scalar(scale)
     if scale is None:
@@ -460,6 +551,7 @@ def fake_quantize_per_tensor_affine(
             10,
             13,
             "Non-constant scale not supported",
+            inputs,
         )
     scale = scale.float().data  # Avoid exporter generating double type
     if quant_min == 0:
@@ -474,19 +566,23 @@ def fake_quantize_per_tensor_affine(
     )
 
 
-def isinf(g, input):
-    return g.op("IsInf", opset9._cast_Double(g, input, False))  # type: ignore[attr-defined]
+@_onnx_symbolic("aten::isinf")
+@_beartype.beartype
+def isinf(g: jit_utils.GraphContext, input):
+    return g.op("IsInf", g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.DOUBLE))
 
 
-def isfinite(g, input):
-    from torch.onnx.symbolic_opset9 import __not_, __or_
-
+@_onnx_symbolic("aten::isfinite")
+@_beartype.beartype
+def isfinite(g: jit_utils.GraphContext, input):
     inf_node = isinf(g, input)
     nan_node = opset9.isnan(g, input)
-    return __not_(g, __or_(g, inf_node, nan_node))
+    return opset9.__not_(g, opset9.__or_(g, inf_node, nan_node))
 
 
-def quantize_per_tensor(g, input, scale, zero_point, dtype):
+@_onnx_symbolic("aten::quantize_per_tensor")
+@_beartype.beartype
+def quantize_per_tensor(g: jit_utils.GraphContext, input, scale, zero_point, dtype):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     # TODO(justinchuby): Extract all the cast ops into a helper function.
     zero_point = g.op(
@@ -496,17 +592,21 @@ def quantize_per_tensor(g, input, scale, zero_point, dtype):
     return symbolic_helper.quantize_helper(g, input, scale, zero_point)
 
 
-def dequantize(g, input):
+@_onnx_symbolic("aten::dequantize")
+@_beartype.beartype
+def dequantize(g: jit_utils.GraphContext, input):
     return symbolic_helper.dequantize_helper(g, input)[0]
 
 
+@_onnx_symbolic("aten::nan_to_num")
 @symbolic_helper.parse_args("v", "f", "f", "f")
-def nan_to_num(g, input, nan, posinf, neginf):
+@_beartype.beartype
+def nan_to_num(g: jit_utils.GraphContext, input, nan, posinf, neginf):
     # Cannot create a int type tensor with inf/nan values, so we simply
     # return the original tensor
     if not symbolic_helper._is_fp(input):
         return input
-    input_dtype = _type_utils.JitScalarType.from_name(input.type().scalarType()).dtype()
+    input_dtype = _type_utils.JitScalarType.from_value(input).dtype()
     if nan is None:
         nan = 0.0
     nan_cond = opset9.isnan(g, input)
@@ -551,130 +651,237 @@ def nan_to_num(g, input, nan, posinf, neginf):
     )
 
 
+# Quantized symbolics ---------------------------------------------------------
 # https://github.com/pytorch/pytorch/wiki/PyTorch-ONNX-exporter#quantized-model-export
-class Quantized:
-    """
-    https://github.com/pytorch/pytorch/wiki/PyTorch-ONNX-exporter#quantized-model-export
+# Support starts from opset 10 because `DequantizeLinear` and `QuantizeLinear` were
+# introduced in opset version 10.
+@_onnx_symbolic("quantized::linear")
+@_beartype.beartype
+def quantized_linear(
+    g: jit_utils.GraphContext, q_input, q_weight, bias, op_scale, op_zero_point
+):
+    input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+    weight, weight_scale, _, _ = symbolic_helper.dequantize_helper(g, q_weight)
+    q_bias = symbolic_helper.requantize_bias_helper(g, bias, input_scale, weight_scale)
+    bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
 
-    Support starts from opset 10 because `DequantizeLinear` and `QuantizeLinear` were introduced in opset version 10.
-    """
+    output = opset9.linear(g, input, weight, bias)
 
-    domain = "quantized"
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
-    @staticmethod
-    def linear(g, q_input, q_weight, bias, op_scale, op_zero_point):
-        input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
-        weight, weight_scale, _, _ = symbolic_helper.dequantize_helper(g, q_weight)
-        q_bias = symbolic_helper.requantize_bias_helper(
-            g, bias, input_scale, weight_scale
-        )
-        bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
 
-        output = opset9.linear(g, input, weight, bias)
+@_onnx_symbolic("quantized::add")
+@_beartype.beartype
+def quantized_add(g: jit_utils.GraphContext, x, y, op_scale, op_zero_point):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+    y, _, _, _ = symbolic_helper.dequantize_helper(g, y)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+    output = opset9.add(g, x, y)
 
-    @staticmethod
-    def add(g, x, y, op_scale, op_zero_point):
-        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
-        y, _, _, _ = symbolic_helper.dequantize_helper(g, y)
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
-        output = opset9.add(g, x, y)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+@_onnx_symbolic("quantized::add_relu")
+@_beartype.beartype
+def quantized_add_relu(g: jit_utils.GraphContext, x, y, op_scale, op_zero_point):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+    y, _, _, _ = symbolic_helper.dequantize_helper(g, y)
 
-    @staticmethod
-    def add_relu(g, x, y, op_scale, op_zero_point):
-        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
-        y, _, _, _ = symbolic_helper.dequantize_helper(g, y)
+    output = opset9.add(g, x, y)
+    output = opset9.relu(g, output)
 
-        output = opset9.add(g, x, y)
-        output = opset9.relu(g, output)
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
-    @staticmethod
-    def mul(g, x, y, op_scale, op_zero_point):
-        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
-        y, _, _, _ = symbolic_helper.dequantize_helper(g, y)
+@_onnx_symbolic("quantized::mul")
+@_beartype.beartype
+def quantized_mul(g: jit_utils.GraphContext, x, y, op_scale, op_zero_point):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+    y, _, _, _ = symbolic_helper.dequantize_helper(g, y)
 
-        output = opset9.mul(g, x, y)
+    output = opset9.mul(g, x, y)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
-    @staticmethod
-    def hardswish(g, x, op_scale, op_zero_point):
-        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
 
-        output = opset9.hardswish(g, x)
+@_onnx_symbolic("quantized::hardswish")
+@_beartype.beartype
+def quantized_hardswish(g: jit_utils.GraphContext, x, op_scale, op_zero_point):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+    output = opset9.hardswish(g, x)
 
-    @staticmethod
-    def conv2d_relu(
-        g,
-        q_input,
-        q_weight,
-        bias,
-        stride,
-        padding,
-        dilation,
-        groups,
-        op_scale,
-        op_zero_point,
-    ):
-        input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
-        weight, weight_scale, _, _ = symbolic_helper.dequantize_helper(g, q_weight)
-        q_bias = symbolic_helper.requantize_bias_helper(
-            g, bias, input_scale, weight_scale
-        )
-        bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
-        output = opset9.conv2d(
-            g, input, weight, bias, stride, padding, dilation, groups
-        )
-        output = opset9.relu(g, output)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+@_onnx_symbolic("quantized::sigmoid")
+@_beartype.beartype
+def quantized_sigmoid(g: jit_utils.GraphContext, x, op_scale, op_zero_point):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
 
-    @staticmethod
-    def conv2d(
-        g,
-        q_input,
-        q_weight,
-        bias,
-        stride,
-        padding,
-        dilation,
-        groups,
-        op_scale,
-        op_zero_point,
-    ):
-        input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
-        weight, weight_scale, _, _ = symbolic_helper.dequantize_helper(g, q_weight)
-        q_bias = symbolic_helper.requantize_bias_helper(
-            g, bias, input_scale, weight_scale
-        )
-        bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
+    output = opset9.sigmoid(g, x)
 
-        output = opset9.conv2d(
-            g, input, weight, bias, stride, padding, dilation, groups
-        )
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
 
-    @staticmethod
-    @symbolic_helper.parse_args("v", "i", "v", "v")
-    def cat(
-        g,
-        q_inputs: _C.Value,
-        dim: int,
-        op_scale: _C.Value,
-        op_zero_point: _C.Value,
-    ) -> _C.Value:
-        unpacked_inputs = symbolic_helper._unpack_list(q_inputs)
-        dequantized = [
-            symbolic_helper.dequantize_helper(g, input)[0] for input in unpacked_inputs
-        ]
-        concatenated = g.op("Concat", *dequantized, axis_i=dim)
-        return symbolic_helper.quantize_helper(g, concatenated, op_scale, op_zero_point)
+@_onnx_symbolic("quantized::leaky_relu")
+@_beartype.beartype
+def quantized_leaky_relu(
+    g: jit_utils.GraphContext, x, negative_slope, inplace, op_scale, op_zero_point
+):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+
+    output = opset9.leaky_relu(g, x, negative_slope, inplace)
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::layer_norm")
+@_beartype.beartype
+def quantized_layer_norm(
+    g: jit_utils.GraphContext,
+    x,
+    normalized_shape,
+    weight,
+    bias,
+    eps,
+    op_scale,
+    op_zero_point,
+):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+
+    output = opset9.layer_norm(g, x, normalized_shape, weight, bias, eps, False)
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::group_norm")
+@_beartype.beartype
+def quantized_group_norm(
+    g: jit_utils.GraphContext,
+    x,
+    num_groups,
+    weight,
+    bias,
+    eps,
+    op_scale,
+    op_zero_point,
+):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+
+    output = opset9.group_norm(g, x, num_groups, weight, bias, eps, False)
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::instance_norm")
+@symbolic_helper.parse_args("v", "v", "v", "f", "v", "v")
+@_beartype.beartype
+def quantized_instance_norm(
+    g: jit_utils.GraphContext,
+    q_input,
+    weight,
+    bias,
+    eps,
+    op_scale,
+    op_zero_point,
+):
+    input, _, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+
+    output = opset9.instance_norm(
+        g, input, weight, bias, None, None, False, 0.0, eps, False
+    )
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::conv1d_relu")
+@_beartype.beartype
+def quantized_conv1d_relu(
+    g: jit_utils.GraphContext,
+    q_input,
+    q_weight,
+    bias,
+    stride,
+    padding,
+    dilation,
+    groups,
+    op_scale,
+    op_zero_point,
+):
+    input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+    weight, weight_scale, _, _ = symbolic_helper.dequantize_helper(g, q_weight)
+    q_bias = symbolic_helper.requantize_bias_helper(g, bias, input_scale, weight_scale)
+    bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
+
+    output = opset9.conv1d(g, input, weight, bias, stride, padding, dilation, groups)
+    output = opset9.relu(g, output)
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::conv2d_relu")
+@_beartype.beartype
+def quantized_conv2d_relu(
+    g: jit_utils.GraphContext,
+    q_input,
+    q_weight,
+    bias,
+    stride,
+    padding,
+    dilation,
+    groups,
+    op_scale,
+    op_zero_point,
+):
+    input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+    weight, weight_scale, _, _ = symbolic_helper.dequantize_helper(g, q_weight)
+    q_bias = symbolic_helper.requantize_bias_helper(g, bias, input_scale, weight_scale)
+    bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
+
+    output = opset9.conv2d(g, input, weight, bias, stride, padding, dilation, groups)
+    output = opset9.relu(g, output)
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::conv2d")
+@_beartype.beartype
+def quantized_conv2d(
+    g: jit_utils.GraphContext,
+    q_input,
+    q_weight,
+    bias,
+    stride,
+    padding,
+    dilation,
+    groups,
+    op_scale,
+    op_zero_point,
+):
+    input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+    weight, weight_scale, _, _ = symbolic_helper.dequantize_helper(g, q_weight)
+    q_bias = symbolic_helper.requantize_bias_helper(g, bias, input_scale, weight_scale)
+    bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
+
+    output = opset9.conv2d(g, input, weight, bias, stride, padding, dilation, groups)
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::cat")
+@symbolic_helper.parse_args("v", "i", "v", "v")
+@_beartype.beartype
+def quantized_cat(
+    g: jit_utils.GraphContext,
+    q_inputs: _C.Value,
+    dim: int,
+    op_scale: _C.Value,
+    op_zero_point: _C.Value,
+) -> _C.Value:
+    unpacked_inputs = symbolic_helper._unpack_list(q_inputs)
+    dequantized = [
+        symbolic_helper.dequantize_helper(g, input)[0] for input in unpacked_inputs
+    ]
+    concatenated = g.op("Concat", *dequantized, axis_i=dim)
+    return symbolic_helper.quantize_helper(g, concatenated, op_scale, op_zero_point)
diff --git a/torch/onnx/symbolic_opset11.py b/torch/onnx/symbolic_opset11.py
index c8ab6b810d9a..3706c5336dfc 100644
--- a/torch/onnx/symbolic_opset11.py
+++ b/torch/onnx/symbolic_opset11.py
@@ -1,32 +1,32 @@
 """This file exports ONNX ops for opset 11."""
 
+import functools
 import sys
 import warnings
-from typing import Tuple, Union
+from typing import Optional, Sequence, Union
 
 import torch
 from torch import _C
 from torch._C import _onnx as _C_onnx
 from torch.onnx import (
     _type_utils,
+    errors,
     symbolic_helper,
     symbolic_opset10 as opset10,
     symbolic_opset9 as opset9,
     utils,
 )
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype, jit_utils, registration
 
 # EDITING THIS FILE? READ THIS FIRST!
-# see Note [Edit Symbolic Files] in symbolic_helper.py
+# see Note [Edit Symbolic Files] in README.md
 
 __all__ = [
     "add",
     "append",
     "arange",
     "argsort",
-    "avg_pool1d",
-    "avg_pool2d",
-    "avg_pool3d",
     "cat",
     "chunk",
     "clamp_max",
@@ -57,17 +57,11 @@
     "pad",
     "pixel_shuffle",
     "pop",
-    "Prim",
+    "prim_constant_chunk",
     "reflection_pad",
-    "reflection_pad1d",
-    "reflection_pad2d",
-    "reflection_pad3d",
     "relu6",
     "remainder",
     "replication_pad",
-    "replication_pad1d",
-    "replication_pad2d",
-    "replication_pad3d",
     "round",
     "scatter",
     "select",
@@ -81,23 +75,28 @@
     "unbind",
     "unique_dim",
     "unsqueeze",
-    "upsample_bicubic2d",
-    "upsample_bilinear2d",
-    "upsample_linear1d",
-    "upsample_nearest1d",
-    "upsample_nearest2d",
-    "upsample_nearest3d",
-    "upsample_trilinear3d",
 ]
 
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=11)
 
+
+def _apply_params(*args, **kwargs):
+    """Returns a decorator that calls the decorated (higher-order) function with the given parameters."""
+
+    def _apply(fn):
+        return fn(*args, **kwargs)
+
+    return _apply
+
+
+@_onnx_symbolic("aten::hardtanh")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "f", "f")
-def hardtanh(g, self, min_val, max_val):
-    dtype = self.type().scalarType()
-    if dtype is None:
-        scalar_type = _type_utils.JitScalarType.FLOAT
-    else:
-        scalar_type = _type_utils.JitScalarType.from_name(dtype)
+@_beartype.beartype
+def hardtanh(g: jit_utils.GraphContext, self: _C.Value, min_val: float, max_val: float):
+    scalar_type = _type_utils.JitScalarType.from_value(
+        self, _type_utils.JitScalarType.FLOAT
+    )
     min_val = g.op(
         "Constant",
         value_t=torch.tensor(min_val, dtype=scalar_type.dtype()),
@@ -106,27 +105,31 @@ def hardtanh(g, self, min_val, max_val):
         "Constant",
         value_t=torch.tensor(max_val, dtype=scalar_type.dtype()),
     )
-    return opset9.op_with_optional_float_cast(
+    return opset9._op_with_optional_float_cast(
         g, "Clip", self, min_val, max_val, opset_before=12
     )
 
 
-def clamp(g, self, min, max):
-    dtype = self.type().scalarType()
-
+@_onnx_symbolic("aten::clamp")
+@_beartype.beartype
+def clamp(g: jit_utils.GraphContext, self, min, max):
+    @_beartype.beartype
     def _cast_if_not_none(tensor, dtype):
         if tensor is not None and not symbolic_helper._is_none(tensor):
             return g.op(
                 "Cast",
                 tensor,
-                to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type(),
+                to_i=dtype.onnx_type(),
             )
         else:
             return tensor
 
-    if dtype is not None:
-        min = _cast_if_not_none(min, dtype)
-        max = _cast_if_not_none(max, dtype)
+    scalar_type = _type_utils.JitScalarType.from_value(
+        self, _type_utils.JitScalarType.UNDEFINED
+    )
+    if scalar_type != _type_utils.JitScalarType.UNDEFINED:
+        min = _cast_if_not_none(min, scalar_type)
+        max = _cast_if_not_none(max, scalar_type)
 
     if symbolic_helper._is_none(min):
         return clamp_max(g, self, max)
@@ -137,46 +140,48 @@ def _cast_if_not_none(tensor, dtype):
             symbolic_helper._get_tensor_rank(min) == 0
             and symbolic_helper._get_tensor_rank(max) == 0
         ):
-            return opset9.op_with_optional_float_cast(
+            return opset9._op_with_optional_float_cast(
                 g, "Clip", self, min, max, opset_before=12
             )
         else:
             return clamp_max(g, clamp_min(g, self, min), max)
 
 
+@_onnx_symbolic("aten::clamp_min")
 @symbolic_helper.parse_args("v", "v")
-def clamp_min(g, self, min):
-    dtype = self.type().scalarType()
-    min = g.op("Cast", min, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type())
+@_beartype.beartype
+def clamp_min(g: jit_utils.GraphContext, self, min):
+    min = g.op("Cast", min, to_i=_type_utils.JitScalarType.from_value(self).onnx_type())
     if symbolic_helper._get_tensor_rank(min) == 0:
         max = opset9.unused(g)
-        return opset9.op_with_optional_float_cast(
+        return opset9._op_with_optional_float_cast(
             g, "Clip", self, min, max, opset_before=12
         )
     else:
-        return opset9.op_with_optional_float_cast(g, "Max", self, min, opset_before=12)
+        return opset9._op_with_optional_float_cast(g, "Max", self, min, opset_before=12)
 
 
+@_onnx_symbolic("aten::clamp_max")
 @symbolic_helper.parse_args("v", "v")
-def clamp_max(g, self, max):
-    dtype = self.type().scalarType()
-    max = g.op("Cast", max, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type())
+@_beartype.beartype
+def clamp_max(g: jit_utils.GraphContext, self, max):
+    max = g.op("Cast", max, to_i=_type_utils.JitScalarType.from_value(self).onnx_type())
     if symbolic_helper._get_tensor_rank(max) == 0:
         min = opset9.unused(g)
-        return opset9.op_with_optional_float_cast(
+        return opset9._op_with_optional_float_cast(
             g, "Clip", self, min, max, opset_before=12
         )
     else:
-        return opset9.op_with_optional_float_cast(g, "Min", self, max, opset_before=12)
+        return opset9._op_with_optional_float_cast(g, "Min", self, max, opset_before=12)
 
 
-def relu6(g, input):
-    relu_ = opset9.op_with_optional_float_cast(g, "Relu", input, opset_before=14)
-    dtype = input.type().scalarType()
-    if dtype is None:
-        scalar_type = _type_utils.JitScalarType.FLOAT
-    else:
-        scalar_type = _type_utils.JitScalarType.from_name(dtype)
+@_onnx_symbolic("aten::relu6")
+@_beartype.beartype
+def relu6(g: jit_utils.GraphContext, input):
+    relu_ = opset9._op_with_optional_float_cast(g, "Relu", input, opset_before=14)
+    scalar_type = _type_utils.JitScalarType.from_value(
+        input, _type_utils.JitScalarType.FLOAT
+    )
     min_val = g.op(
         "Constant",
         value_t=torch.tensor(0, dtype=scalar_type.dtype()),
@@ -188,13 +193,20 @@ def relu6(g, input):
     return clamp(g, relu_, min_val, max_val)
 
 
+@_onnx_symbolic("aten::select")
 # Opset 11 gather accepts negative indices
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "i", "v")
-def select(g, self, dim, index):
+@_beartype.beartype
+def select(g: jit_utils.GraphContext, self, dim, index):
     return g.op("Gather", self, index, axis_i=dim)
 
 
-def index_put(g, self, indices_list_value, values, accumulate=False):
+@_onnx_symbolic("aten::index_put")
+@_beartype.beartype
+def index_put(
+    g: jit_utils.GraphContext, self, indices_list_value, values, accumulate=False
+):
     if symbolic_helper._is_packed_list(indices_list_value):
         indices_list = symbolic_helper._unpack_list(indices_list_value)
     else:
@@ -210,8 +222,7 @@ def index_put(g, self, indices_list_value, values, accumulate=False):
 
     if len(indices_list) > 1:
         for idx_ in range(len(indices_list)):
-            if indices_list[idx_].type().scalarType() == "Bool":  # type: ignore[attr-defined]
-                # TODO(justinchuby): Remove type ignore after #81112 is checked in.
+            if symbolic_helper._is_bool(indices_list[idx_]):
                 indices_list[idx_] = g.op("NonZero", indices_list[idx_])
         index = indices_list[0]
 
@@ -266,8 +277,7 @@ def index_put(g, self, indices_list_value, values, accumulate=False):
         #   return (%33)
         index = indices_list[0]
         bool_inp = index
-        if bool_inp.type() is not None and bool_inp.type().scalarType() == "Bool":  # type: ignore[attr-defined]
-            # TODO(justinchuby): Remove type ignore after #81112 is checked in.
+        if symbolic_helper._is_bool(bool_inp):
             rank = symbolic_helper._get_tensor_rank(values)
             if rank is not None and rank == 0:
                 return opset9.masked_fill(g, self, bool_inp, values)
@@ -284,18 +294,23 @@ def index_put(g, self, indices_list_value, values, accumulate=False):
         values = opset9.expand(g, values, values_shape, None)
     values = symbolic_helper._reshape_helper(g, values, values_shape)
 
-    dtype = self.type().scalarType()
-    if dtype is not None and dtype != values.type().scalarType():
-        values = g.op(
-            "Cast", values, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
+    self_scalar_type = _type_utils.JitScalarType.from_value(
+        self, _type_utils.JitScalarType.UNDEFINED
+    )
+    if self_scalar_type != _type_utils.JitScalarType.UNDEFINED:
+        values_scalar_type = _type_utils.JitScalarType.from_value(
+            values, _type_utils.JitScalarType.UNDEFINED
         )
-    scalar_type = _type_utils.JitScalarType.from_name(dtype)
+        if self_scalar_type != values_scalar_type:
+            values = g.op("Cast", values, to_i=self_scalar_type.onnx_type())
+    elif accumulate:
+        raise errors.SymbolicValueError("self does not have a valid scalar type.", self)
 
     if accumulate:
         zeros = g.op(
             "ConstantOfShape",
             g.op("Shape", self),
-            value_t=torch.tensor([0], dtype=scalar_type.dtype()),
+            value_t=torch.tensor([0], dtype=self_scalar_type.dtype()),
         )
         result = g.op("ScatterND", zeros, index, values)
         result = add(g, self, result)
@@ -305,46 +320,71 @@ def index_put(g, self, indices_list_value, values, accumulate=False):
     return result
 
 
+@_onnx_symbolic("aten::pixel_shuffle")
 @symbolic_helper.parse_args("v", "i")
-def pixel_shuffle(g, self, upscale_factor):
+@_beartype.beartype
+def pixel_shuffle(g: jit_utils.GraphContext, self, upscale_factor):
     rank = symbolic_helper._get_tensor_rank(self)
     if rank is not None and rank != 4:
         return symbolic_helper._unimplemented("pixel_shuffle", "only support 4d input")
     return g.op("DepthToSpace", self, blocksize_i=upscale_factor, mode_s="CRD")
 
 
-def _interpolate(name, dim, interpolate_mode):
+@_onnx_symbolic(
+    "aten::upsample_nearest1d",
+    decorate=[_apply_params("upsample_nearest1d", 3, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_nearest2d",
+    decorate=[_apply_params("upsample_nearest2d", 4, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_nearest3d",
+    decorate=[_apply_params("upsample_nearest3d", 5, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_linear1d",
+    decorate=[_apply_params("upsample_linear1d", 3, "linear")],
+)
+@_onnx_symbolic(
+    "aten::upsample_bilinear2d",
+    decorate=[_apply_params("upsample_bilinear2d", 4, "linear")],
+)
+@_onnx_symbolic(
+    "aten::upsample_trilinear3d",
+    decorate=[_apply_params("upsample_trilinear3d", 5, "linear")],
+)
+@_onnx_symbolic(
+    "aten::upsample_bicubic2d",
+    decorate=[_apply_params("upsample_bicubic2d", 4, "cubic")],
+)
+@_beartype.beartype
+def _interpolate(name: str, dim: int, interpolate_mode: str):
     return symbolic_helper._interpolate_helper(name, dim, interpolate_mode)
 
 
-upsample_nearest1d = _interpolate("upsample_nearest1d", 3, "nearest")
-upsample_nearest2d = _interpolate("upsample_nearest2d", 4, "nearest")
-upsample_nearest3d = _interpolate("upsample_nearest3d", 5, "nearest")
-upsample_linear1d = _interpolate("upsample_linear1d", 3, "linear")
-upsample_bilinear2d = _interpolate("upsample_bilinear2d", 4, "linear")
-upsample_trilinear3d = _interpolate("upsample_trilinear3d", 5, "linear")
-upsample_bicubic2d = _interpolate("upsample_bicubic2d", 4, "cubic")
-
-upsample_nearest1d.__module__ = "torch.onnx.symbolic_opset11"
-upsample_nearest2d.__module__ = "torch.onnx.symbolic_opset11"
-upsample_nearest3d.__module__ = "torch.onnx.symbolic_opset11"
-upsample_linear1d.__module__ = "torch.onnx.symbolic_opset11"
-upsample_bilinear2d.__module__ = "torch.onnx.symbolic_opset11"
-upsample_trilinear3d.__module__ = "torch.onnx.symbolic_opset11"
-upsample_bicubic2d.__module__ = "torch.onnx.symbolic_opset11"
-
-
+@_onnx_symbolic("aten::__interpolate")
 @symbolic_helper.quantized_args(True, False, False, False, False, False, False)
+@_beartype.beartype
 def __interpolate(
-    g, input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias
+    g: jit_utils.GraphContext,
+    input,
+    size,
+    scale_factor,
+    mode,
+    align_corners,
+    recompute_scale_factor,
+    antialias,
 ):
     return symbolic_helper.__interpolate_helper(
         g, input, size, scale_factor, mode, align_corners, recompute_scale_factor
     )
 
 
+@_onnx_symbolic("aten::gather")
 @symbolic_helper.parse_args("v", "i", "v", "v")
-def gather(g, self, dim, index, sparse_grad=False):
+@_beartype.beartype
+def gather(g: jit_utils.GraphContext, self, dim, index, sparse_grad=False):
     if symbolic_helper._maybe_get_const(sparse_grad, "i"):
         return symbolic_helper._unimplemented("gather", "sparse_grad == True")
     if symbolic_helper.is_caffe2_aten_fallback():
@@ -352,32 +392,34 @@ def gather(g, self, dim, index, sparse_grad=False):
     return g.op("GatherElements", self, index, axis_i=dim)
 
 
+@_onnx_symbolic("aten::scatter")
 @symbolic_helper.parse_args("v", "i", "v", "v")
-def scatter(g, self, dim, index, src):
+@_beartype.beartype
+def scatter(g: jit_utils.GraphContext, self, dim, index, src):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("scatter", self, dim, index, src, overload_name="src")
-    src_type = src.type().scalarType()
+    src_type = _type_utils.JitScalarType.from_value(src)
     src = symbolic_helper._maybe_get_scalar(src)
     if symbolic_helper._is_value(src):
         return g.op("ScatterElements", self, index, src, axis_i=dim)
     else:
         # Check if scalar "src" has same type as self (PyTorch allows different
         # type for scalar src (but not when src is tensor)). If not, insert Cast node.
-        if self.type().scalarType() != src_type:
+        if _type_utils.JitScalarType.from_value(self) != src_type:
             src = g.op(
                 "Cast",
                 src,
-                to_i=_type_utils.JitScalarType.from_name(
-                    self.type().scalarType()
-                ).onnx_type(),
+                to_i=_type_utils.JitScalarType.from_value(self).onnx_type(),
             )
         return g.op(
             "ScatterElements", self, index, opset9.expand_as(g, src, index), axis_i=dim
         )
 
 
+@_onnx_symbolic("aten::cumsum")
 @symbolic_helper.parse_args("v", "i", "none")
-def cumsum(g, self, dim, dtype=None):
+@_beartype.beartype
+def cumsum(g: jit_utils.GraphContext, self, dim, dtype=None):
     dim_tensor = g.op("Constant", value_t=torch.tensor(dim, dtype=torch.int))
     if dtype and dtype.node().kind() != "prim::Constant":
         parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
@@ -390,12 +432,16 @@ def cumsum(g, self, dim, dtype=None):
     return csum
 
 
-def masked_select(g, self, mask):
+@_onnx_symbolic("aten::masked_select")
+@_beartype.beartype
+def masked_select(g: jit_utils.GraphContext, self, mask):
     index = opset9.nonzero(g, opset9.expand_as(g, mask, self))
     return g.op("GatherND", self, index)
 
 
-def masked_scatter(g, self, mask, source):
+@_onnx_symbolic("aten::masked_scatter")
+@_beartype.beartype
+def masked_scatter(g: jit_utils.GraphContext, self, mask, source):
     index = opset9.nonzero(g, opset9.expand_as(g, mask, self))
     # NOTE: source can have more elements than needed.
     # It could also have arbitrary shape.
@@ -412,7 +458,9 @@ def masked_scatter(g, self, mask, source):
     return g.op("ScatterND", self, index, source)
 
 
-def _len(g, self):
+@_onnx_symbolic("aten::len")
+@_beartype.beartype
+def _len(g: jit_utils.GraphContext, self):
     if (
         symbolic_helper._is_tensor_list(self)
         or self.node().kind() == "onnx::SplitToSequence"
@@ -422,7 +470,9 @@ def _len(g, self):
     return symbolic_helper._squeeze_helper(g, sz_0, [0])
 
 
-def __getitem_(g, self, i):
+@_onnx_symbolic("aten::__getitem_")
+@_beartype.beartype
+def __getitem_(g: jit_utils.GraphContext, self, i):
     if symbolic_helper._is_tensor_list(self):
         # SequenceAt requires that the input be a List of Tensors
         return g.op("SequenceAt", self, i)
@@ -432,16 +482,22 @@ def __getitem_(g, self, i):
         return getitem(g, self, i)
 
 
-def _set_item(g, tensor_list, i, v):
+@_onnx_symbolic("aten::_set_item")
+@_beartype.beartype
+def _set_item(g: jit_utils.GraphContext, tensor_list, i, v):
     tensor_list = g.op("SequenceErase", tensor_list, i)
     return g.op("SequenceInsert", tensor_list, v, i)
 
 
-def append(g, self, tensor):
+@_onnx_symbolic("aten::append")
+@_beartype.beartype
+def append(g: jit_utils.GraphContext, self, tensor):
     return g.op("SequenceInsert", self, tensor)
 
 
-def add(g, self, other, alpha=None):
+@_onnx_symbolic("aten::add")
+@_beartype.beartype
+def add(g: jit_utils.GraphContext, self, other, alpha=None):
     if symbolic_helper._is_value(self) and symbolic_helper._is_tensor_list(self):
         tensor_list_node = other.node()
         if tensor_list_node.kind() != "prim::ListConstruct":
@@ -457,19 +513,27 @@ def add(g, self, other, alpha=None):
     return opset9.add(g, self, other, alpha)
 
 
-def insert(g, self, pos, tensor):
+@_onnx_symbolic("aten::insert")
+@_beartype.beartype
+def insert(g: jit_utils.GraphContext, self, pos, tensor):
     return g.op("SequenceInsert", self, tensor, pos)
 
 
-def pop(g, tensor_list, dim):
+@_onnx_symbolic("aten::pop")
+@_beartype.beartype
+def pop(g: jit_utils.GraphContext, tensor_list, dim):
     return g.op("SequenceErase", tensor_list, dim)
 
 
-def Delete(g, tensor_list, dim):
+@_onnx_symbolic("aten::Delete")
+@_beartype.beartype
+def Delete(g: jit_utils.GraphContext, tensor_list, dim):
     return g.op("SequenceErase", tensor_list, dim)
 
 
-def cat(g, tensor_list, dim):
+@_onnx_symbolic("aten::cat")
+@_beartype.beartype
+def cat(g: jit_utils.GraphContext, tensor_list, dim):
     if symbolic_helper._is_packed_list(tensor_list):
         return opset9.cat(g, tensor_list, dim)
     else:
@@ -477,7 +541,9 @@ def cat(g, tensor_list, dim):
         return g.op("ConcatFromSequence", tensor_list, axis_i=dim)
 
 
-def stack(g, tensor_list, dim):
+@_onnx_symbolic("aten::stack")
+@_beartype.beartype
+def stack(g: jit_utils.GraphContext, tensor_list, dim):
     if symbolic_helper._is_packed_list(tensor_list):
         return opset9.stack(g, tensor_list, dim)
     else:
@@ -485,32 +551,55 @@ def stack(g, tensor_list, dim):
         return g.op("ConcatFromSequence", tensor_list, axis_i=dim, new_axis_i=1)
 
 
+@_onnx_symbolic("aten::_unique2")
 @symbolic_helper.parse_args("v", "i", "i", "i")
-def _unique2(g, self, sorted, return_inverse, return_counts):
+@_beartype.beartype
+def _unique2(g: jit_utils.GraphContext, self, sorted, return_inverse, return_counts):
     u, indices, inverse_indices, counts = g.op(
         "Unique", self, sorted_i=sorted, outputs=4
     )
     return u, inverse_indices, counts
 
 
+@_onnx_symbolic(
+    "aten::avg_pool1d",
+    decorate=[_apply_params("avg_pool1d", torch.nn.modules.utils._single)],
+)
+@_onnx_symbolic(
+    "aten::avg_pool2d",
+    decorate=[_apply_params("avg_pool2d", torch.nn.modules.utils._pair)],
+)
+@_onnx_symbolic(
+    "aten::avg_pool3d",
+    decorate=[_apply_params("avg_pool3d", torch.nn.modules.utils._triple)],
+)
+@_beartype.beartype
 def _avg_pool(name, tuple_fn):
     @symbolic_helper.quantized_args(True, False, False, False, False, False, False)
     @symbolic_helper.parse_args("v", "is", "is", "is", "i", "i", "none")
+    @_beartype.beartype
     def symbolic_fn(
         g,
         input: _C.Value,
-        kernel_size: Tuple[int, ...],
-        stride: Tuple[int, ...],
-        padding: Union[int, Tuple[int, ...]],
+        kernel_size: Sequence[int],
+        stride: Sequence[int],
+        padding: Union[int, Sequence[int]],
         ceil_mode: int,
         count_include_pad: int,
         divisor_override=None,
     ):
+        # Although onnx::AvgPool provides count_include_pad and ceil_mode,
+        # The corner case of Average Pooling with ceil_mode on
+        # PyTorch allows sliding window go off bound, which leads to
+        # this accommodation.
+        # More detail on https://github.com/pytorch/pytorch/issues/57178
+        if not stride:
+            stride = kernel_size
         padding = symbolic_helper._avgpool_helper(
             tuple_fn, padding, kernel_size, stride, divisor_override, name
         )
-        if not stride:
-            stride = kernel_size
+        assert isinstance(padding, tuple)
+        adjusted_padding = padding
         if count_include_pad:
             input = g.op(
                 "Pad",
@@ -518,65 +607,84 @@ def symbolic_fn(
                 g.op("Constant", value_t=torch.tensor(((0,) * 2 + padding) * 2)),
                 mode_s="constant",
             )
-            padding = (0,) * len(padding)
+            adjusted_padding = (0,) * len(padding)
+        if ceil_mode:
+            padding_ceil = opset9.get_pool_ceil_padding(
+                input, kernel_size, stride, padding
+            )
+            adjusted_padding = adjusted_padding + tuple(
+                a + b for (a, b) in zip(padding_ceil, adjusted_padding)
+            )
+        else:
+            adjusted_padding = adjusted_padding * 2
         output = g.op(
             "AveragePool",
             input,
             kernel_shape_i=tuple_fn(kernel_size),
             strides_i=tuple_fn(stride),
-            pads_i=padding * 2,
-            ceil_mode_i=ceil_mode,
+            pads_i=adjusted_padding,
         )
         return output
 
     return symbolic_fn
 
 
-avg_pool1d = _avg_pool("avg_pool1d", torch.nn.modules.utils._single)
-avg_pool2d = _avg_pool("avg_pool2d", torch.nn.modules.utils._pair)
-avg_pool3d = _avg_pool("avg_pool3d", torch.nn.modules.utils._triple)
-
-
+@_onnx_symbolic("aten::unique_dim")
 @symbolic_helper.parse_args("v", "i", "i", "i", "i")
-def unique_dim(g, self, dim, sorted, return_inverse, return_counts):
+@_beartype.beartype
+def unique_dim(
+    g: jit_utils.GraphContext, self, dim, sorted, return_inverse, return_counts
+):
     u, indices, inverse_indices, counts = g.op(
         "Unique", self, axis_i=dim, sorted_i=sorted, outputs=4
     )
     return u, inverse_indices, counts
 
 
+@_onnx_symbolic("aten::topk")
 @symbolic_helper.parse_args("v", "v", "i", "i", "i", "none")
-def topk(g, self, k, dim, largest, sorted, out=None):
+@_beartype.beartype
+def topk(g: jit_utils.GraphContext, self, k, dim, largest, sorted, out=None):
     return symbolic_helper._topk_helper(
         g, self, k, dim, largest=largest, sorted=sorted, out=out
     )
 
 
+@_onnx_symbolic("aten::sort")
 @symbolic_helper.parse_args("v", "i", "i", "none")
-def sort(g, self, dim, decending, out=None):
+@_beartype.beartype
+def sort(g: jit_utils.GraphContext, self, dim, decending, out=None):
     return symbolic_helper._sort_helper(g, self, dim, decending=decending, out=out)
 
 
+@_onnx_symbolic("aten::argsort")
 @symbolic_helper.parse_args("v", "i", "i", "none")
-def argsort(g, self, dim, decending, out=None):
+@_beartype.beartype
+def argsort(g: jit_utils.GraphContext, self, dim, decending, out=None):
     _, indices = symbolic_helper._sort_helper(
         g, self, dim, decending=decending, out=out
     )
     return indices
 
 
-def round(g, self):
+@_onnx_symbolic("aten::round")
+@_beartype.beartype
+def round(g: jit_utils.GraphContext, self):
     return g.op("Round", self)
 
 
-def remainder(g, input, other):
+@_onnx_symbolic("aten::remainder")
+@_beartype.beartype
+def remainder(g: jit_utils.GraphContext, input, other):
     if symbolic_helper._is_fp(input) or symbolic_helper._is_fp(other):
         return opset9.remainder(g, input, other)
     return g.op("Mod", input, other, fmod_i=0)
 
 
+@_onnx_symbolic("aten::split")
 @symbolic_helper.parse_args("v", "v", "i", "i")
-def split(g, self, split_size_or_sizes, dim, _outputs=None):
+@_beartype.beartype
+def split(g: jit_utils.GraphContext, self, split_size_or_sizes, dim, _outputs=None):
     if not symbolic_helper._is_split_static(split_size_or_sizes, _outputs):
         split_out = g.op("SplitToSequence", self, split_size_or_sizes, axis_i=dim)
         if _outputs is None:
@@ -612,13 +720,17 @@ def split(g, self, split_size_or_sizes, dim, _outputs=None):
         return opset9.split(g, self, split_size_or_sizes, dim, _outputs)
 
 
+@_onnx_symbolic("aten::split_with_sizes")
 @symbolic_helper.parse_args("v", "v", "i", "i")
-def split_with_sizes(g, self, split_sizes, dim, _outputs=None):
+@_beartype.beartype
+def split_with_sizes(g: jit_utils.GraphContext, self, split_sizes, dim, _outputs=None):
     return split(g, self, split_sizes, dim, _outputs)
 
 
+@_onnx_symbolic("aten::unbind")
 @symbolic_helper.parse_args("v", "i", "i")
-def unbind(g, self, dim=0, _outputs=None):
+@_beartype.beartype
+def unbind(g: jit_utils.GraphContext, self, dim=0, _outputs=None):
     if _outputs is None:
         return g.op(
             "SplitToSequence",
@@ -631,13 +743,16 @@ def unbind(g, self, dim=0, _outputs=None):
         return opset9.unbind(g, self, dim, _outputs)
 
 
-# Generate paddings in ONNX order based on pad in pytorch.
-# Args:
-#     input: the input tensor.
-#     pad: the paddings in pytorch.
-#          The order is dim_n_begin, dim_n_end, dim_n-1_begin, dim_n-1_end, ..., dim_m_begin, dim_m_end,
-#          where m is in range [0, n].
-def _prepare_onnx_paddings(g, input, pad):
+@_beartype.beartype
+def _prepare_onnx_paddings(g: jit_utils.GraphContext, input, pad):
+    """Generate paddings in ONNX order based on pad in pytorch.
+
+    Args:
+        input: the input tensor.
+        pad: the paddings in pytorch.
+            The order is dim_n_begin, dim_n_end, dim_n-1_begin, dim_n-1_end, ..., dim_m_begin, dim_m_end,
+            where m is in range [0, n].
+    """
     if (
         not symbolic_helper._is_packed_list(pad)
         and symbolic_helper._is_list(pad)
@@ -686,35 +801,45 @@ def _prepare_onnx_paddings(g, input, pad):
     return padding_c
 
 
-def constant_pad_nd(g, input, padding, value=None):
+@_onnx_symbolic("aten::constant_pad_nd")
+@_beartype.beartype
+def constant_pad_nd(g: jit_utils.GraphContext, input, padding, value=None):
     mode = "constant"
     value = symbolic_helper._maybe_get_scalar(value)
-    value = symbolic_helper._if_scalar_type_as(g, value, input)
+    value = symbolic_helper._if_scalar_type_as(value, input)
     pad = _prepare_onnx_paddings(g, input, padding)
     return g.op("Pad", input, pad, value, mode_s=mode)
 
 
-def reflection_pad(g, input, padding):
+@_onnx_symbolic("aten::reflection_pad1d")
+@_onnx_symbolic("aten::reflection_pad2d")
+@_onnx_symbolic("aten::reflection_pad3d")
+@_beartype.beartype
+def reflection_pad(g: jit_utils.GraphContext, input, padding):
     mode = "reflect"
     paddings = _prepare_onnx_paddings(g, input, padding)
     return g.op("Pad", input, paddings, mode_s=mode)
 
 
-def replication_pad(g, input, padding):
+@_onnx_symbolic("aten::replication_pad1d")
+@_onnx_symbolic("aten::replication_pad2d")
+@_onnx_symbolic("aten::replication_pad3d")
+@_beartype.beartype
+def replication_pad(g: jit_utils.GraphContext, input, padding):
     mode = "edge"
     paddings = _prepare_onnx_paddings(g, input, padding)
     return g.op("Pad", input, paddings, mode_s=mode)
 
 
-reflection_pad1d = reflection_pad
-reflection_pad2d = reflection_pad
-reflection_pad3d = reflection_pad
-replication_pad1d = replication_pad
-replication_pad2d = replication_pad
-replication_pad3d = replication_pad
-
-
-def pad(g, input, pad, mode, value):
+@_onnx_symbolic("aten::pad")
+@_beartype.beartype
+def pad(
+    g: jit_utils.GraphContext,
+    input: _C.Value,
+    pad: _C.Value,
+    mode: _C.Value,
+    value: _C.Value,
+):
     mode = symbolic_helper._parse_arg(mode, "s")
     if mode == "replicate":
         return replication_pad(g, input, pad)
@@ -725,18 +850,24 @@ def pad(g, input, pad, mode, value):
     elif mode == "circular":
         return opset9._pad_circular(g, input, pad)
     else:
-        raise RuntimeError(f"Unrecognized padding mode {mode}")
+        raise errors.SymbolicValueError(f"Unrecognized padding mode {mode}", input)
 
 
-def linalg_det(g, self):
+@_onnx_symbolic("aten::linalg_det")
+@_beartype.beartype
+def linalg_det(g: jit_utils.GraphContext, self):
     return g.op("Det", self)
 
 
-def logdet(g, input):
+@_onnx_symbolic("aten::logdet")
+@_beartype.beartype
+def logdet(g: jit_utils.GraphContext, input):
     return opset9.log(g, linalg_det(g, input))
 
 
-def arange(g, *args):
+@_onnx_symbolic("aten::arange")
+@_beartype.beartype
+def arange(g: jit_utils.GraphContext, *args):
     def _get_arange_dtype(dtype):
         dtype = symbolic_helper._maybe_get_const(dtype, "i")
         return dtype
@@ -759,7 +890,7 @@ def _get_arange_dtype(dtype):
             "Constant",
             value_t=torch.tensor(1, dtype=type_.dtype()),
         )
-        arange_tensor = g.op("Range", start_default, end, delta_default)
+        return g.op("Range", start_default, end, delta_default)
     elif len(args) == 4 or len(args) == 7:
         if len(args) == 4:
             # aten::arange(Scalar start, Scalar end, Scalar step, Tensor out)
@@ -770,7 +901,7 @@ def _get_arange_dtype(dtype):
         _, end, start, step = symbolic_helper._arange_cast_helper(
             g, start=args[0], end=args[1], step=args[2], dtype=dtype
         )
-        arange_tensor = g.op("Range", start, end, step)
+        return g.op("Range", start, end, step)
     elif len(args) == 6:
         # aten::arange(Scalar start, Scalar end, ScalarType dtype, Layout, Device, bool pin_memory)
         dtype = _get_arange_dtype(args[2])
@@ -781,16 +912,17 @@ def _get_arange_dtype(dtype):
             "Constant",
             value_t=torch.tensor(1, dtype=type_.dtype()),
         )
-        arange_tensor = g.op("Range", start, end, delta_default)
+        return g.op("Range", start, end, delta_default)
     else:
-        raise NotImplementedError(
-            "Unknown aten::arange signature taking " + str(len(args)) + " arguments."
+        return symbolic_helper._unimplemented(
+            "aten::arange", f"with {len(args)} arguments"
         )
-    return arange_tensor
 
 
+@_onnx_symbolic("aten::_dim_arange")
 @symbolic_helper.parse_args("v", "i")
-def _dim_arange(g, like, dim):
+@_beartype.beartype
+def _dim_arange(g: jit_utils.GraphContext, like, dim):
     like_shape = g.op("Shape", like)
     stop = g.op(
         "Gather", like_shape, g.op("Constant", value_t=torch.tensor(dim)), axis_i=0
@@ -800,13 +932,17 @@ def _dim_arange(g, like, dim):
     return arange(g, stop, 4, None, None, None)
 
 
-def size(g, self, dim=None):
+@_onnx_symbolic("aten::size")
+@_beartype.beartype
+def size(g: jit_utils.GraphContext, self, dim=None):
     if dim is None:
         return g.op("Shape", self)
     return symbolic_helper._size_helper(g, self, dim)
 
 
-def squeeze(g, self, dim=None):
+@_onnx_symbolic("aten::squeeze")
+@_beartype.beartype
+def squeeze(g: jit_utils.GraphContext, self, dim=None):
     if dim is None:
         return g.op("Squeeze", self)
 
@@ -830,15 +966,14 @@ def squeeze(g, self, dim=None):
         const_one = g.op("Constant", value_t=torch.ones(1, dtype=torch.int64))
         cond = g.op("Equal", size, const_one)
         # create the "If" node and add the "then" and "else" blocks to it.
-        if_node_outputs = g.op("If", cond)
-        if_node = if_node_outputs.node()
-        if_block = utils._add_block(if_node)
-        squeeze_ = symbolic_helper._squeeze_helper(if_block, self, [dim])
-        utils._add_output_to_block(if_block, squeeze_)
-        else_block = utils._add_block(if_node)
-        identity_ = else_block.op("Identity", self)
-        utils._add_output_to_block(else_block, identity_)
-        return if_node_outputs
+        if_op, (if_context, else_context), _ = jit_utils.add_op_with_blocks(
+            g, "If", cond, n_blocks=2
+        )
+        squeeze_ = symbolic_helper._squeeze_helper(if_context, self, [dim])
+        utils._add_output_to_block(if_context.block, squeeze_)
+        identity_ = else_context.op("Identity", self)
+        utils._add_output_to_block(else_context.block, identity_)
+        return if_op
 
     # For static input shape
     dim = adjusted_dim
@@ -857,18 +992,24 @@ def squeeze(g, self, dim=None):
     return symbolic_helper._squeeze_helper(g, self, [dim])
 
 
-def unsqueeze(g, self, dim):
+@_onnx_symbolic("aten::unsqueeze")
+@_beartype.beartype
+def unsqueeze(g: jit_utils.GraphContext, self, dim):
     if symbolic_helper._is_constant(dim):
         dim = symbolic_helper._get_const(dim, "i", "dim")
 
     return symbolic_helper._unsqueeze_helper(g, self, [dim])
 
 
-def mm(g, self, other):
+@_onnx_symbolic("aten::mm")
+@_beartype.beartype
+def mm(g: jit_utils.GraphContext, self, other):
     return g.op("Gemm", self, other, beta_f=0.0, alpha_f=1.0)
 
 
-def index(g, self, index):
+@_onnx_symbolic("aten::index")
+@_beartype.beartype
+def index(g: jit_utils.GraphContext, self, index):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("index", self, index, overload_name="Tensor")
 
@@ -881,14 +1022,18 @@ def index(g, self, index):
     if len(indices) == 1:
         index = indices[0]
         if not symbolic_helper._is_none(index) and (
-            index.type().scalarType() == "Bool" or index.type().scalarType() == "Byte"
+            symbolic_helper._is_bool(index)
+            or _type_utils.JitScalarType.from_value(index)
+            == _type_utils.JitScalarType.UINT8
         ):
             index = opset9.nonzero(g, index)
             return g.op("GatherND", self, index)
     return opset9.index(g, self, index)
 
 
-def index_fill(g, self, dim, index, value):
+@_onnx_symbolic("aten::index_fill")
+@_beartype.beartype
+def index_fill(g: jit_utils.GraphContext, self, dim, index, value):
     dim_value = symbolic_helper._parse_arg(dim, "i")
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at(
@@ -904,12 +1049,14 @@ def index_fill(g, self, dim, index, value):
         g, self, dim, index
     )
     value = symbolic_helper._maybe_get_scalar(value)
-    value = symbolic_helper._if_scalar_type_as(g, value, self)
+    value = symbolic_helper._if_scalar_type_as(value, self)
     expanded_value = opset9.expand(g, value, expanded_index_shape, None)
     return scatter(g, self, dim, expanded_index, expanded_value)
 
 
-def index_copy(g, self, dim, index, source):
+@_onnx_symbolic("aten::index_copy")
+@_beartype.beartype
+def index_copy(g: jit_utils.GraphContext, self, dim, index, source):
     dim_value = symbolic_helper._parse_arg(dim, "i")
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("index_copy", self, index, source, dim_i=dim_value)
@@ -919,19 +1066,24 @@ def index_copy(g, self, dim, index, source):
     return scatter(g, self, dim, expanded_index, source)
 
 
-def __rshift_(g, self, other):
+@_onnx_symbolic("aten::__rshift_")
+@_beartype.beartype
+def __rshift_(g: jit_utils.GraphContext, self, other):
     # make sure to cast other to self's type
     # (when self is long, make sure that other is not float)
-    if other.type().scalarType() != self.type().scalarType():
+    if _type_utils.JitScalarType.from_value(
+        other, _type_utils.JitScalarType.UNDEFINED
+    ) != _type_utils.JitScalarType.from_value(self):
         other = g.op(
             "Cast",
             other,
-            to_i=_type_utils.JitScalarType.from_name(
-                self.type().scalarType()
-            ).onnx_type(),
+            to_i=_type_utils.JitScalarType.from_value(self).onnx_type(),
         )
 
-    if self.type().scalarType() == "Byte":
+    if (
+        _type_utils.JitScalarType.from_value(self, _type_utils.JitScalarType.UNDEFINED)
+        == _type_utils.JitScalarType.UINT8
+    ):
         return g.op("BitShift", self, other, direction_s="RIGHT")
 
     two = g.op("Constant", value_t=torch.tensor(2, dtype=torch.float32))
@@ -942,25 +1094,30 @@ def __rshift_(g, self, other):
     two_pow = g.op(
         "Cast",
         two_pow,
-        to_i=_type_utils.JitScalarType.from_name(self.type().scalarType()).onnx_type(),
+        to_i=_type_utils.JitScalarType.from_value(self).onnx_type(),
     )
     rshift = g.op("Div", self, two_pow)
     return rshift
 
 
-def __lshift_(g, self, other):
+@_onnx_symbolic("aten::__lshift_")
+@_beartype.beartype
+def __lshift_(g: jit_utils.GraphContext, self, other):
     # make sure to cast other to self's type
     # (when self is long, make sure that other is not float)
-    if other.type().scalarType() != self.type().scalarType():
+    if _type_utils.JitScalarType.from_value(
+        other, _type_utils.JitScalarType.UNDEFINED
+    ) != _type_utils.JitScalarType.from_value(self):
         other = g.op(
             "Cast",
             other,
-            to_i=_type_utils.JitScalarType.from_name(
-                self.type().scalarType()
-            ).onnx_type(),
+            to_i=_type_utils.JitScalarType.from_value(self).onnx_type(),
         )
 
-    if self.type().scalarType() == "Byte":
+    if (
+        _type_utils.JitScalarType.from_value(self, _type_utils.JitScalarType.UNDEFINED)
+        == _type_utils.JitScalarType.UINT8
+    ):
         return g.op("BitShift", self, other, direction_s="LEFT")
 
     two = g.op("Constant", value_t=torch.tensor(2, dtype=torch.float32))
@@ -971,14 +1128,15 @@ def __lshift_(g, self, other):
     two_pow = g.op(
         "Cast",
         two_pow,
-        to_i=_type_utils.JitScalarType.from_name(self.type().scalarType()).onnx_type(),
+        to_i=_type_utils.JitScalarType.from_value(self).onnx_type(),
     )
     lshift = g.op("Mul", self, two_pow)
     return lshift
 
 
+@_beartype.beartype
 def _get_im2col_indices_along_dim(
-    g, input_d, kernel_size_d, dilation_d, padding_d, stride_d
+    g: jit_utils.GraphContext, input_d, kernel_size_d, dilation_d, padding_d, stride_d
 ):
     # Input is always 4-D (N, C, H, W)
     # Calculate indices of sliding blocks along spatial dimension
@@ -1020,7 +1178,8 @@ def _get_im2col_indices_along_dim(
     return block_mask
 
 
-def _get_im2col_padded_input(g, input, padding_h, padding_w):
+@_beartype.beartype
+def _get_im2col_padded_input(g: jit_utils.GraphContext, input, padding_h, padding_w):
     # Input is always 4-D tensor (N, C, H, W)
     # Padding tensor has the following format: (padding_h, padding_w)
     # Reshape the padding to follow ONNX format: (dim1_begin, dim2_begin,...,dim1_end, dim2_end,...)
@@ -1028,7 +1187,8 @@ def _get_im2col_padded_input(g, input, padding_h, padding_w):
     return g.op("Pad", input, pad)
 
 
-def _get_im2col_output_shape(g, input, kernel_h, kernel_w):
+@_beartype.beartype
+def _get_im2col_output_shape(g: jit_utils.GraphContext, input, kernel_h, kernel_w):
     batch_dim = size(g, input, g.op("Constant", value_t=torch.tensor(0)))
     channel_dim = size(g, input, g.op("Constant", value_t=torch.tensor(1)))
     channel_unfolded = g.op(
@@ -1044,8 +1204,10 @@ def _get_im2col_output_shape(g, input, kernel_h, kernel_w):
     )
 
 
+@_onnx_symbolic("aten::im2col")
 @symbolic_helper.parse_args("v", "is", "is", "is", "is")
-def im2col(g, input, kernel_size, dilation, padding, stride):
+@_beartype.beartype
+def im2col(g: jit_utils.GraphContext, input, kernel_size, dilation, padding, stride):
     # Input is always 4-D tensor (N, C, H, W)
     # All other args are int[2]
 
@@ -1096,16 +1258,20 @@ def im2col(g, input, kernel_size, dilation, padding, stride):
     return symbolic_helper._reshape_helper(g, output, output_shape)
 
 
-def narrow(g, input, dim, start, length):
+@_onnx_symbolic("aten::narrow")
+@_beartype.beartype
+def narrow(g: jit_utils.GraphContext, input, dim, start, length):
     end = g.op("Add", start, length)
     return symbolic_helper._slice_helper(
         g, input, axes=dim, starts=start, ends=end, dynamic_slice=True
     )
 
 
+@_onnx_symbolic("aten::flatten")
 @symbolic_helper.quantized_args(True, False, False)
 @symbolic_helper.parse_args("v", "i", "i")
-def flatten(g, input, start_dim, end_dim):
+@_beartype.beartype
+def flatten(g: jit_utils.GraphContext, input, start_dim, end_dim):
     dim = symbolic_helper._get_tensor_rank(input)
     if dim == 1:
         return input
@@ -1129,14 +1295,23 @@ def flatten(g, input, start_dim, end_dim):
     return symbolic_helper._flatten_helper(g, input, start_dim, end_dim, dim)
 
 
-@symbolic_helper.parse_args("v", "f", "is", "i", "v")
-def linalg_vector_norm(g, self, ord, dim, keepdim, dtype):
+@_onnx_symbolic("aten::linalg_vector_norm")
+@symbolic_helper.parse_args("v", "f", "is", "b", "v")
+@_beartype.beartype
+def linalg_vector_norm(
+    g: jit_utils.GraphContext,
+    self,
+    ord,
+    dim: Optional[Sequence[int]],
+    keepdim: bool,
+    dtype,
+):
     if ord == 0:
         if dim is None:
             self = symbolic_helper._reshape_helper(
                 g, self, g.op("Constant", value_t=torch.tensor([-1], dtype=torch.int64))
             )
-            keepdim = 0
+            keepdim = False
 
         cond_op = g.op(
             "Not", g.op("Equal", self, g.op("Constant", value_t=torch.LongTensor([0])))
@@ -1144,9 +1319,7 @@ def linalg_vector_norm(g, self, ord, dim, keepdim, dtype):
         cond_op = g.op(
             "Cast",
             cond_op,
-            to_i=_type_utils.JitScalarType.from_name(
-                self.type().scalarType()
-            ).onnx_type(),
+            to_i=_type_utils.JitScalarType.from_value(self).onnx_type(),
         )
         return symbolic_helper._reducesum_helper(
             g, cond_op, axes_i=dim, keepdims_i=keepdim
@@ -1155,9 +1328,11 @@ def linalg_vector_norm(g, self, ord, dim, keepdim, dtype):
         return opset9.linalg_vector_norm(g, self, ord, dim, keepdim, dtype)
 
 
+@_onnx_symbolic("aten::embedding_bag")
 @symbolic_helper.parse_args("v", "v", "v", "i", "i", "i", "v", "i", "i")
+@_beartype.beartype
 def embedding_bag(
-    g,
+    g: jit_utils.GraphContext,
     embedding_matrix,
     indices,
     offsets,
@@ -1176,7 +1351,7 @@ def embedding_bag(
         raise RuntimeError("embedding_bag with padding_idx")
 
     loop_condition = g.op("Constant", value_t=torch.tensor(1))
-    loop_condition = g.op("Cast", loop_condition, to_i=9)
+    loop_condition = g.op("Cast", loop_condition, to_i=_C_onnx.TensorProtoDataType.BOOL)
     zero = g.op("Constant", value_t=torch.tensor([0]))
 
     indices_len = symbolic_helper._unsqueeze_helper(
@@ -1203,37 +1378,45 @@ def embedding_bag(
     loop_len = symbolic_helper._size_helper(
         g, offsets_ends, g.op("Constant", value_t=torch.tensor(0))
     )
-    loop = g.op("Loop", loop_len, loop_condition)
 
-    loop_block = utils._add_block(loop.node())
+    loop, (loop_context,), _ = jit_utils.add_op_with_blocks(
+        g, "Loop", loop_len, loop_condition, n_blocks=1
+    )
+    loop_block = loop_context.block
+
+    # FIXME(justinchuby): We need to handle what happens when we call b.op on a node return
     block_input_iter = utils._add_input_to_block(loop_block)
     cond = utils._add_input_to_block(loop_block)
 
-    indices_start = loop_block.op("Gather", offsets_starts, block_input_iter, axis_i=0)
-    indices_end = loop_block.op("Gather", offsets_ends, block_input_iter, axis_i=0)
-    indices_start = symbolic_helper._unsqueeze_helper(loop_block, indices_start, [0])
-    indices_end = symbolic_helper._unsqueeze_helper(loop_block, indices_end, [0])
+    indices_start = loop_context.op(
+        "Gather", offsets_starts, block_input_iter, axis_i=0
+    )
+    indices_end = loop_context.op("Gather", offsets_ends, block_input_iter, axis_i=0)
+    indices_start = symbolic_helper._unsqueeze_helper(loop_context, indices_start, [0])
+    indices_end = symbolic_helper._unsqueeze_helper(loop_context, indices_end, [0])
 
-    indices_row = loop_block.op("Slice", indices, indices_start, indices_end, zero)
-    embeddings = loop_block.op("Gather", embedding_matrix, indices_row, axis_i=0)
+    indices_row = loop_context.op("Slice", indices, indices_start, indices_end, zero)
+    embeddings = loop_context.op("Gather", embedding_matrix, indices_row, axis_i=0)
     if not symbolic_helper._is_none(per_sample_weights):
-        per_sample_weights_row = loop_block.op(
+        per_sample_weights_row = loop_context.op(
             "Slice", per_sample_weights, indices_start, indices_end, zero
         )
         per_sample_weights_row = symbolic_helper._unsqueeze_helper(
-            loop_block, per_sample_weights_row, [1]
+            loop_context, per_sample_weights_row, [1]
         )
-        embeddings = loop_block.op("Mul", embeddings, per_sample_weights_row)
+        embeddings = loop_context.op("Mul", embeddings, per_sample_weights_row)
     if mode == 0:
         embeddings = symbolic_helper._reducesum_helper(
-            loop_block, embeddings, axes_i=[0], keepdims_i=0
+            loop_context, embeddings, axes_i=[0], keepdims_i=0
         )
     elif mode == 1:
-        embeddings = loop_block.op("ReduceMean", embeddings, axes_i=[0], keepdims_i=0)
+        embeddings = loop_context.op("ReduceMean", embeddings, axes_i=[0], keepdims_i=0)
     else:
-        embeddings = loop_block.op("ReduceMax", embeddings, axes_i=[0], keepdims_i=0)
+        embeddings = loop_context.op("ReduceMax", embeddings, axes_i=[0], keepdims_i=0)
 
-    cond_out = loop_block.op("Cast", loop_condition, to_i=9)
+    cond_out = loop_context.op(
+        "Cast", loop_condition, to_i=_C_onnx.TensorProtoDataType.BOOL
+    )
     utils._add_output_to_block(loop_block, cond_out)
     utils._add_output_to_block(loop_block, embeddings)
 
@@ -1242,8 +1425,10 @@ def embedding_bag(
     return loop.node().output(), None, None, None
 
 
+@_onnx_symbolic("aten::embedding_renorm")
 @symbolic_helper.parse_args("v", "v", "f", "f")
-def embedding_renorm(g, weight, indices, max_norm, norm_type):
+@_beartype.beartype
+def embedding_renorm(g: jit_utils.GraphContext, weight, indices, max_norm, norm_type):
     unique_indices = g.op("Unique", indices)
     partial_weight = g.op("Gather", weight, unique_indices)
     norm_type = int(norm_type)
@@ -1252,9 +1437,10 @@ def embedding_renorm(g, weight, indices, max_norm, norm_type):
     elif norm_type == 2:
         norm_type = "ReduceL2"
     else:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             f"Unsupported: ONNX export of embedding_renorm with norm: {norm_type}. "
-            "Only 1. and 2. are supported."
+            "Only 1. and 2. are supported.",
+            weight,
         )
     partial_weight_norm = g.op(norm_type, partial_weight, axes_i=[1], keepdims_i=1)
     # https://github.com/pytorch/pytorch/blob/0a07488ed2c47765e337e290bd138c0e6e459cbd/aten/src/ATen/native/Embedding.cpp#L177
@@ -1279,7 +1465,9 @@ def embedding_renorm(g, weight, indices, max_norm, norm_type):
     )
 
 
-def chunk(g, self, chunks, dim):
+@_onnx_symbolic("aten::chunk")
+@_beartype.beartype
+def chunk(g: jit_utils.GraphContext, self, chunks, dim):
     # Calculate chunk size for dynamic chunk
     dim_size = g.op("Gather", g.op("Shape", self), dim, axis_i=0)
     chunk_size_s = g.op(
@@ -1295,35 +1483,47 @@ def chunk(g, self, chunks, dim):
     return split(g, self, chunk_vec, dim)
 
 
-def normal(g, loc, scale, seed):
+@_onnx_symbolic("aten::normal")
+@_beartype.beartype
+def normal(
+    g: jit_utils.GraphContext,
+    mean,
+    std,
+    sizes=None,
+    generator=None,
+    dtype=None,
+    layout=None,
+    device=None,
+    pin_memory=None,
+):
     # If you can sample from a given distribution with mean 0 and variance 1, then you can easily sample from a
     # scale-location transformation of that distribution, which has mean μ and variance σ's square. If x is a sample
     # from a mean 0 and variance 1 distribution then
     #       σx+μ
     # is a sample with mean μ and variance σ's square.
-    result = opset9.mul(g, scale, g.op("RandomNormalLike", loc))
-    return add(g, result, loc)
-
-
-class Prim:
-    domain = "prim"
-
-    @staticmethod
-    def ConstantChunk(g, self, chunks, dim):
-        input_shape = g.op("Shape", self)
-        axis = g.op("Constant", value_t=torch.tensor([dim], dtype=torch.long))
-        input_shape_dim = g.op("Gather", input_shape, axis, axis_i=0)
-        start = g.op("Constant", value_t=torch.tensor([0], dtype=torch.long))
-        chunk_size = g.op("Constant", value_t=torch.tensor([chunks], dtype=torch.long))
-        chunk_size_minus_1 = g.op(
-            "Constant", value_t=torch.tensor([chunks - 1], dtype=torch.long)
-        )
-        input_shape_dim_shift = g.op("Add", input_shape_dim, chunk_size_minus_1)
-        chunk_dim = g.op("Div", input_shape_dim_shift, chunk_size)
-        res = []
-        for i in range(chunks):
-            index = g.op("Constant", value_t=torch.tensor([i + 1], dtype=torch.long))
-            end = g.op("Mul", chunk_dim, index)
-            res.append(g.op("Slice", self, start, end, axis))
-            start = end
-        return res
+    if sizes is not None and not symbolic_helper._is_none(sizes):
+        mean = opset9.expand(g, mean, sizes, None)
+    result = opset9.mul(g, std, g.op("RandomNormalLike", mean))
+    return add(g, result, mean)
+
+
+@_onnx_symbolic("prim::ConstantChunk")
+@_beartype.beartype
+def prim_constant_chunk(g: jit_utils.GraphContext, self, chunks, dim):
+    input_shape = g.op("Shape", self)
+    axis = g.op("Constant", value_t=torch.tensor([dim], dtype=torch.long))
+    input_shape_dim = g.op("Gather", input_shape, axis, axis_i=0)
+    start = g.op("Constant", value_t=torch.tensor([0], dtype=torch.long))
+    chunk_size = g.op("Constant", value_t=torch.tensor([chunks], dtype=torch.long))
+    chunk_size_minus_1 = g.op(
+        "Constant", value_t=torch.tensor([chunks - 1], dtype=torch.long)
+    )
+    input_shape_dim_shift = g.op("Add", input_shape_dim, chunk_size_minus_1)
+    chunk_dim = g.op("Div", input_shape_dim_shift, chunk_size)
+    res = []
+    for i in range(chunks):
+        index = g.op("Constant", value_t=torch.tensor([i + 1], dtype=torch.long))
+        end = g.op("Mul", chunk_dim, index)
+        res.append(g.op("Slice", self, start, end, axis))
+        start = end
+    return res
diff --git a/torch/onnx/symbolic_opset12.py b/torch/onnx/symbolic_opset12.py
index 9ffece340535..b318894a6923 100644
--- a/torch/onnx/symbolic_opset12.py
+++ b/torch/onnx/symbolic_opset12.py
@@ -1,13 +1,21 @@
+import functools
 import sys
 from typing import Optional, Tuple
 
 import torch
 from torch._C import _onnx as _C_onnx
-from torch.onnx import _type_utils, symbolic_helper, symbolic_opset9 as opset9, utils
+from torch.onnx import (
+    _type_utils,
+    errors,
+    symbolic_helper,
+    symbolic_opset9 as opset9,
+    utils,
+)
+from torch.onnx._internal import _beartype, jit_utils, registration
 
 
 # EDITING THIS FILE? READ THIS FIRST!
-# see Note [Edit Symbolic Files] in symbolic_helper.py
+# see Note [Edit Symbolic Files] in README.md
 
 # This file exports ONNX ops for opset 12
 
@@ -31,12 +39,15 @@
     "unfold",
 ]
 
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=12)
 
-def _einsum_helper(g, equation, tensors):
+
+@_beartype.beartype
+def _einsum_helper(g: jit_utils.GraphContext, equation, tensors):
     if not tensors:
         raise RuntimeError("Einsum inputs are empty.")
     # ONNX does not support bool for Einsum inputs.
-    if tensors[0].type().scalarType() == "Bool":
+    if symbolic_helper._is_bool(tensors[0]):
         tensors = [
             g.op("Cast", tensor, to_i=_C_onnx.TensorProtoDataType.INT64)
             for tensor in tensors
@@ -50,28 +61,33 @@ def _einsum_helper(g, equation, tensors):
         return g.op("Einsum", *tensors, equation_s=equation)
 
 
-@symbolic_helper.parse_args("s", "v")
-def einsum(g, equation, tensor_list):
+@_onnx_symbolic("aten::einsum")
+@symbolic_helper.parse_args("s", "v", "is")
+@_beartype.beartype
+def einsum(g: jit_utils.GraphContext, equation, tensor_list, path=None):
     tensors = symbolic_helper._unpack_list(tensor_list)
     return _einsum_helper(g, equation, tensors)
 
 
+@_onnx_symbolic("aten::outer")
 @symbolic_helper.parse_args("v", "v")
-def outer(g, input, other):
+@_beartype.beartype
+def outer(g: jit_utils.GraphContext, input, other):
     # make sure to cast other to self's type
-    if other.type().scalarType() != input.type().scalarType():
+    if _type_utils.JitScalarType.from_value(
+        other, _type_utils.JitScalarType.UNDEFINED
+    ) != _type_utils.JitScalarType.from_value(input):
         other = g.op(
             "Cast",
             other,
-            to_i=_type_utils.JitScalarType.from_name(
-                input.type().scalarType()
-            ).onnx_type(),
+            to_i=_type_utils.JitScalarType.from_value(input).onnx_type(),
         )
     return _einsum_helper(g, "i,j->ij", [input, other])
 
 
+@_beartype.beartype
 def _dropout_returns_masked_input_and_mask(
-    g, input: torch._C.Value, p: float, train: bool
+    g: jit_utils.GraphContext, input: torch._C.Value, p: float, train: bool
 ) -> Tuple[torch._C.Value, Optional[torch._C.Value]]:
     symbolic_helper.check_training_mode(train, "dropout")
     # In eval mode, dropout is non-op. That is, if the node's
@@ -84,18 +100,24 @@ def _dropout_returns_masked_input_and_mask(
     return r, mask
 
 
-@symbolic_helper.parse_args("v", "f", "i")
-def dropout(g, input, p, train):
+@_onnx_symbolic("aten::dropout")
+@symbolic_helper.parse_args("v", "f", "b")
+@_beartype.beartype
+def dropout(g: jit_utils.GraphContext, input, p, train):
     masked, _ = _dropout_returns_masked_input_and_mask(g, input, p, train)
     return masked
 
 
-@symbolic_helper.parse_args("v", "f", "i")
-def native_dropout(g, input, p, train):
+@_onnx_symbolic("aten::native_dropout")
+@symbolic_helper.parse_args("v", "f", "b")
+@_beartype.beartype
+def native_dropout(g: jit_utils.GraphContext, input, p, train):
     return _dropout_returns_masked_input_and_mask(g, input, p, train)
 
 
-def nll_loss(g, self, target, weight, reduction, ignore_index):
+@_onnx_symbolic("aten::nll_loss")
+@_beartype.beartype
+def nll_loss(g: jit_utils.GraphContext, self, target, weight, reduction, ignore_index):
     # none reduction : onnx::Constant[value={0}]
     # mean reduction : onnx::Constant[value={1}]
     # sum reduction : onnx::Constant[value={2}]
@@ -127,16 +149,32 @@ def nll_loss(g, self, target, weight, reduction, ignore_index):
     return nllloss
 
 
-def nll_loss2d(g, self, target, weight, reduction, ignore_index):
+@_onnx_symbolic("aten::nll_loss2d")
+@_beartype.beartype
+def nll_loss2d(
+    g: jit_utils.GraphContext, self, target, weight, reduction, ignore_index
+):
     return nll_loss(g, self, target, weight, reduction, ignore_index)
 
 
-def nll_loss_nd(g, self, target, weight, reduction, ignore_index):
+@_onnx_symbolic("aten::nll_loss_nd")
+@_beartype.beartype
+def nll_loss_nd(
+    g: jit_utils.GraphContext, self, target, weight, reduction, ignore_index
+):
     return nll_loss(g, self, target, weight, reduction, ignore_index)
 
 
+@_onnx_symbolic("aten::cross_entropy_loss")
+@_beartype.beartype
 def cross_entropy_loss(
-    g, self, target, weight, reduction, ignore_index, label_smoothing
+    g: jit_utils.GraphContext,
+    self,
+    target,
+    weight,
+    reduction,
+    ignore_index,
+    label_smoothing,
 ):
     # none reduction : onnx::Constant[value={0}]
     # mean reduction : onnx::Constant[value={1}]
@@ -146,8 +184,10 @@ def cross_entropy_loss(
     reduction = reduction_vals[reduction]
 
     label_smoothing = symbolic_helper._maybe_get_const(label_smoothing, "f")
-    if label_smoothing > 0.0:
-        raise RuntimeError("Unsupported: ONNX does not support label_smoothing")
+    if label_smoothing is not None and label_smoothing > 0.0:
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX does not support label_smoothing", self
+        )
 
     # in onnx SoftmaxCrossEntropyLoss specification, ignore_index is optional without default value.
     # therefore we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100).
@@ -173,8 +213,12 @@ def cross_entropy_loss(
     return celoss
 
 
+@_onnx_symbolic("aten::binary_cross_entropy_with_logits")
 @symbolic_helper.parse_args("v", "v", "v", "v", "i")
-def binary_cross_entropy_with_logits(g, input, target, weight, pos_weight, reduction):
+@_beartype.beartype
+def binary_cross_entropy_with_logits(
+    g: jit_utils.GraphContext, input, target, weight, pos_weight, reduction
+):
     p = g.op("Constant", value_t=torch.tensor([1]))
     sig_x = opset9.sigmoid(g, input)
     log_sig_x = opset9.log(g, sig_x)
@@ -210,14 +254,20 @@ def binary_cross_entropy_with_logits(g, input, target, weight, pos_weight, reduc
         return g.op("ReduceSum", output, keepdims_i=0)
     else:
         return symbolic_helper._onnx_unsupported(
-            "binary_cross_entropy_with_logits with reduction other than none, mean, or sum"
+            "binary_cross_entropy_with_logits with reduction other than none, mean, or sum",
+            input,
         )
 
 
-def celu(g, self, alpha):
+@_onnx_symbolic("aten::celu")
+@_beartype.beartype
+def celu(g: jit_utils.GraphContext, self, alpha):
     alpha = symbolic_helper._maybe_get_const(alpha, "f")
     # if the input is of type double cast it to float
-    if self.type().scalarType() == "Double":
+    if (
+        _type_utils.JitScalarType.from_value(self, _type_utils.JitScalarType.UNDEFINED)
+        == _type_utils.JitScalarType.DOUBLE
+    ):
         self = g.op("Cast", self, to_i=_C_onnx.TensorProtoDataType.FLOAT)
         out = g.op("Celu", self, alpha_f=alpha)
         return g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.DOUBLE)
@@ -225,30 +275,52 @@ def celu(g, self, alpha):
     return g.op("Celu", self, alpha_f=alpha)
 
 
-@symbolic_helper.parse_args("v", "v", "i")
-def argmax(g, input: torch._C.Value, dim: torch._C.Value, keepdim: int):
+@_onnx_symbolic("aten::argmax")
+@symbolic_helper.parse_args("v", "v", "b")
+@_beartype.beartype
+def argmax(
+    g: jit_utils.GraphContext,
+    input: torch._C.Value,
+    dim: torch._C.Value,
+    keepdim: bool,
+):
     return symbolic_helper._argmin_argmax_helper(g, input, dim, keepdim, "ArgMax")
 
 
-@symbolic_helper.parse_args("v", "v", "i")
-def argmin(g, input: torch._C.Value, dim: torch._C.Value, keepdim: int):
+@_onnx_symbolic("aten::argmin")
+@symbolic_helper.parse_args("v", "v", "b")
+@_beartype.beartype
+def argmin(
+    g: jit_utils.GraphContext,
+    input: torch._C.Value,
+    dim: torch._C.Value,
+    keepdim: bool,
+):
     return symbolic_helper._argmin_argmax_helper(g, input, dim, keepdim, "ArgMin")
 
 
-def pow(g, self, exponent):
+@_onnx_symbolic("aten::pow")
+@_beartype.beartype
+def pow(g: jit_utils.GraphContext, self, exponent):
     return g.op("Pow", self, exponent)
 
 
-def ge(g, input, other):
+@_onnx_symbolic("aten::ge")
+@_beartype.beartype
+def ge(g: jit_utils.GraphContext, input, other):
     return g.op("GreaterOrEqual", input, other)
 
 
-def le(g, input, other):
+@_onnx_symbolic("aten::le")
+@_beartype.beartype
+def le(g: jit_utils.GraphContext, input, other):
     return g.op("LessOrEqual", input, other)
 
 
+@_onnx_symbolic("aten::unfold")
 @symbolic_helper.parse_args("v", "i", "v", "v")
-def unfold(g, input, dimension, size, step):
+@_beartype.beartype
+def unfold(g: jit_utils.GraphContext, input, dimension, size, step):
     const_size = symbolic_helper._maybe_get_const(size, "i")
     const_step = symbolic_helper._maybe_get_const(step, "i")
     if not symbolic_helper._is_value(const_size) and not symbolic_helper._is_value(
@@ -280,28 +352,36 @@ def unfold(g, input, dimension, size, step):
 
         unsqueeze_list = []
         loop_condition = g.op("Constant", value_t=torch.tensor(1))
-        loop_condition = g.op("Cast", loop_condition, to_i=9)
+        loop_condition = g.op(
+            "Cast", loop_condition, to_i=_C_onnx.TensorProtoDataType.BOOL
+        )
         loop_len = g.op("Min", low_size, hi_size)
-        loop = g.op("Loop", loop_len, loop_condition)
 
-        loop_block = utils._add_block(loop.node())
+        loop, (loop_context,), _ = jit_utils.add_op_with_blocks(
+            g, "Loop", loop_len, loop_condition, n_blocks=1
+        )
+
+        loop_block = loop_context.block
         block_input_iter = utils._add_input_to_block(loop_block)
+        # FIXME(justinchuby): cond is unused?
         cond = utils._add_input_to_block(loop_block)
 
-        starts = loop_block.op("Gather", low_indices, block_input_iter)
-        ends = loop_block.op("Gather", hi_indices, block_input_iter)
-        axes = loop_block.op("Constant", value_t=torch.tensor([2]))
-        starts = symbolic_helper._unsqueeze_helper(loop_block, starts, [0])
-        ends = symbolic_helper._unsqueeze_helper(loop_block, ends, [0])
-        stack = loop_block.op("Slice", input, starts, ends, axes)
+        starts = loop_context.op("Gather", low_indices, block_input_iter)
+        ends = loop_context.op("Gather", hi_indices, block_input_iter)
+        axes = loop_context.op("Constant", value_t=torch.tensor([2]))
+        starts = symbolic_helper._unsqueeze_helper(loop_context, starts, [0])
+        ends = symbolic_helper._unsqueeze_helper(loop_context, ends, [0])
+        stack = loop_context.op("Slice", input, starts, ends, axes)
 
         unsqueeze = symbolic_helper._unsqueeze_helper(
-            loop_block, loop_block.op("Transpose", stack, perm_i=perm), [dimension]
+            loop_context, loop_context.op("Transpose", stack, perm_i=perm), [dimension]
         )
         unsqueeze_list.append(unsqueeze)
-        concat = loop_block.op("Concat", *unsqueeze_list, axis_i=0)
+        concat = loop_context.op("Concat", *unsqueeze_list, axis_i=0)
 
-        cond_out = loop_block.op("Cast", loop_condition, to_i=9)
+        cond_out = loop_context.op(
+            "Cast", loop_condition, _C_onnx.TensorProtoDataType.BOOL
+        )
         utils._add_output_to_block(loop_block, cond_out)
         utils._add_output_to_block(loop_block, concat)
 
@@ -312,12 +392,14 @@ def unfold(g, input, dimension, size, step):
         squeeze = symbolic_helper._squeeze_helper(g, transpose, [0])
 
         return squeeze
-    else:
-        return symbolic_helper._unimplemented("Unfold", "input size not accessible")
+
+    return symbolic_helper._unimplemented("Unfold", "input size not accessible")
 
 
+@_onnx_symbolic("aten::tensordot")
 @symbolic_helper.parse_args("v", "v", "is", "is", "v")
-def tensordot(g, input_a, input_b, dims_a, dims_b, out=None):
+@_beartype.beartype
+def tensordot(g: jit_utils.GraphContext, input_a, input_b, dims_a, dims_b, out=None):
     if out is not None:
         symbolic_helper._unimplemented(
             "Tensordot", "Out parameter is not supported for tensordot."
@@ -325,14 +407,16 @@ def tensordot(g, input_a, input_b, dims_a, dims_b, out=None):
 
     dim_count_a = symbolic_helper._get_tensor_rank(input_a)
     if dim_count_a is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of tensordot for tensor(input_a) of unknown rank."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of tensordot for tensor(input_a) of unknown rank.",
+            input_a,
         )
 
     dim_count_b = symbolic_helper._get_tensor_rank(input_b)
     if dim_count_b is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of tensordot for tensor(input_b) of unknown rank."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of tensordot for tensor(input_b) of unknown rank.",
+            input_b,
         )
 
     dims_a = [
diff --git a/torch/onnx/symbolic_opset13.py b/torch/onnx/symbolic_opset13.py
index 3bf01180af72..baecba0135fa 100644
--- a/torch/onnx/symbolic_opset13.py
+++ b/torch/onnx/symbolic_opset13.py
@@ -1,20 +1,38 @@
 # EDITING THIS FILE? READ THIS FIRST!
-# see Note [Edit Symbolic Files] in symbolic_helper.py
+# see Note [Edit Symbolic Files] in README.md
 
 # This file exports ONNX ops for opset 13
+import functools
+
 import torch
 import torch._C._onnx as _C_onnx
 from torch.onnx import (
     _type_utils,
+    errors,
     symbolic_helper,
     symbolic_opset11 as opset11,
     symbolic_opset9 as opset9,
     utils,
 )
+from torch.onnx._internal import _beartype, jit_utils, registration
+
+
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=13)
+
+
+def _apply_params(*args, **kwargs):
+    """Returns a decorator that calls the decorated (higher-order) function with the given parameters."""
+
+    def _apply(fn):
+        return fn(*args, **kwargs)
 
+    return _apply
 
+
+@_onnx_symbolic("aten::softmax")
 @symbolic_helper.parse_args("v", "i", "none")
-def softmax(g, input, dim, dtype=None):
+@_beartype.beartype
+def softmax(g: jit_utils.GraphContext, input, dim, dtype=None):
     softmax = g.op("Softmax", input, axis_i=dim)
     if dtype and dtype.node().kind() != "prim::Constant":
         parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
@@ -25,8 +43,10 @@ def softmax(g, input, dim, dtype=None):
     return softmax
 
 
+@_onnx_symbolic("aten::log_softmax")
 @symbolic_helper.parse_args("v", "i", "none")
-def log_softmax(g, input, dim, dtype=None):
+@_beartype.beartype
+def log_softmax(g: jit_utils.GraphContext, input, dim, dtype=None):
     return_op = g.op("LogSoftmax", input, axis_i=dim)
     if dtype and dtype.node().kind() != "prim::Constant":
         parsed_dtype = symbolic_helper._get_const(dtype, "i", "dtype")
@@ -36,8 +56,10 @@ def log_softmax(g, input, dim, dtype=None):
     return return_op
 
 
+@_onnx_symbolic("aten::frobenius_norm")
 @symbolic_helper.parse_args("v", "v", "i")
-def frobenius_norm(g, self, dim=None, keepdim=False):
+@_beartype.beartype
+def frobenius_norm(g: jit_utils.GraphContext, self, dim=None, keepdim=False):
     dim_val = symbolic_helper._maybe_get_const(dim, "is")
     if not symbolic_helper._is_value(dim_val) and len(dim_val) == 0:
         return g.op("ReduceL2", self, keepdims_i=0)
@@ -46,8 +68,10 @@ def frobenius_norm(g, self, dim=None, keepdim=False):
     return g.op("Sqrt", sumsqr)
 
 
+@_onnx_symbolic("aten::split")
 @symbolic_helper.parse_args("v", "v", "i", "i")
-def split(g, self, split_size_or_sizes, dim, _outputs=None):
+@_beartype.beartype
+def split(g: jit_utils.GraphContext, self, split_size_or_sizes, dim, _outputs=None):
     if not symbolic_helper._is_split_static(split_size_or_sizes, _outputs):
         split_out = g.op("SplitToSequence", self, split_size_or_sizes, axis_i=dim)
         if _outputs is None:
@@ -91,7 +115,9 @@ def split(g, self, split_size_or_sizes, dim, _outputs=None):
         if _outputs is not None:
             size = split_size * _outputs
         else:
-            raise RuntimeError("Unknown dimension size not supported")
+            raise errors.SymbolicValueError(
+                "Unknown dimension size not supported", self
+            )
     splits = [split_size] * (size // split_size)
     leftover = size % split_size
     if leftover:
@@ -100,20 +126,34 @@ def split(g, self, split_size_or_sizes, dim, _outputs=None):
     return g.op("Split", self, splits, axis_i=dim, outputs=_outputs)
 
 
-def split_with_sizes(g, self, split_sizes, dim, _outputs=None):
+@_onnx_symbolic("aten::split_with_sizes")
+@_beartype.beartype
+def split_with_sizes(g: jit_utils.GraphContext, self, split_sizes, dim, _outputs=None):
     return split(g, self, split_sizes, dim, _outputs)
 
 
-def unsafe_split(g, self, split_size_or_sizes, dim, _outputs=None):
+@_onnx_symbolic("aten::unsafe_split")
+@_beartype.beartype
+def unsafe_split(
+    g: jit_utils.GraphContext, self, split_size_or_sizes, dim, _outputs=None
+):
     return split(g, self, split_size_or_sizes, dim, _outputs)
 
 
-def unsafe_split_with_sizes(g, self, split_sizes, dim, _outputs=None):
+@_onnx_symbolic("aten::unsafe_split_with_sizes")
+@_beartype.beartype
+def unsafe_split_with_sizes(
+    g: jit_utils.GraphContext, self, split_sizes, dim, _outputs=None
+):
     return split_with_sizes(g, self, split_sizes, dim, _outputs)
 
 
+@_onnx_symbolic("aten::tensor_split")
 @symbolic_helper.parse_args("v", "v", "i", "i")
-def tensor_split(g, self, indices_or_sections, dim, _outputs=None):
+@_beartype.beartype
+def tensor_split(
+    g: jit_utils.GraphContext, self, indices_or_sections, dim, _outputs=None
+):
     axis = g.op("Constant", value_t=torch.tensor(dim, dtype=torch.long))
     axis = opset11.unsqueeze(g, axis, 0)
     const_1 = g.op("Constant", value_t=torch.tensor(1, dtype=torch.long))
@@ -148,7 +188,9 @@ def tensor_split(g, self, indices_or_sections, dim, _outputs=None):
             if _outputs is not None:
                 size = split_size * _outputs
             else:
-                raise RuntimeError("Unknown dimension size not supported")
+                raise errors.SymbolicValueError(
+                    "Unknown dimension size not supported", self
+                )
 
         min_split_size = size // split_size
         num_splits_one_extra = size % split_size
@@ -177,27 +219,31 @@ def tensor_split(g, self, indices_or_sections, dim, _outputs=None):
         indices_or_sections = g.op("Concat", padding_0, indices_or_sections, axis_i=0)
 
         final_splits = g.op("SequenceEmpty")
-        loop = g.op("Loop", loop_len, loop_condition, final_splits)
-
         # Loop inputs
-        loop_block = utils._add_block(loop.node())
+        loop, (loop_context,), _ = jit_utils.add_op_with_blocks(
+            g, "Loop", loop_len, loop_condition, final_splits, outputs=1, n_blocks=1
+        )
+
+        loop_block = loop_context.block
         block_input_iter = utils._add_input_to_block(loop_block)
         cond = utils._add_input_to_block(loop_block)
         final_splits = utils._add_input_to_block(loop_block)
 
-        start = loop_block.op("Gather", indices_or_sections, block_input_iter, axis_i=0)
-        end = loop_block.op(
+        start = loop_context.op(
+            "Gather", indices_or_sections, block_input_iter, axis_i=0
+        )
+        end = loop_context.op(
             "Gather",
             indices_or_sections,
-            loop_block.op("Add", block_input_iter, const_1),
+            loop_context.op("Add", block_input_iter, const_1),
             axis_i=0,
         )
 
-        slice = loop_block.op("Slice", self, start, end, axis)
-        final_splits = loop_block.op("SequenceInsert", final_splits, slice)
+        slice = loop_context.op("Slice", self, start, end, axis)
+        final_splits = loop_context.op("SequenceInsert", final_splits, slice)
 
         # Loop outputs
-        cond_out = loop_block.op("Identity", loop_condition)
+        cond_out = loop_context.op("Identity", loop_condition)
         utils._add_output_to_block(loop_block, cond_out)
         utils._add_output_to_block(loop_block, final_splits)
 
@@ -241,8 +287,10 @@ def tensor_split(g, self, indices_or_sections, dim, _outputs=None):
         return g.op("Split", self, splits, axis_i=dim, outputs=_outputs)
 
 
+@_onnx_symbolic("aten::unbind")
 @symbolic_helper.parse_args("v", "i", "i")
-def unbind(g, self, dim=0, _outputs=None):
+@_beartype.beartype
+def unbind(g: jit_utils.GraphContext, self, dim=0, _outputs=None):
     if _outputs is None:
         return g.op(
             "SplitToSequence",
@@ -262,15 +310,19 @@ def unbind(g, self, dim=0, _outputs=None):
     return squeezed_outputs
 
 
+@_onnx_symbolic("aten::nonzero_numpy")
 # Emitted from `torch.nonzero(x, as_tuple=True)`
-def nonzero_numpy(g, input, _outputs=None):
+@_beartype.beartype
+def nonzero_numpy(g: jit_utils.GraphContext, input, _outputs=None):
     return unbind(g, opset9.nonzero(g, input), 1, _outputs=_outputs)
 
 
+@_onnx_symbolic("aten::where")
 @symbolic_helper.parse_args("v", "v", "v", "i")
-def where(g, condition, self=None, other=None, _outputs=None):
+@_beartype.beartype
+def where(g: jit_utils.GraphContext, condition, self=None, other=None, _outputs=None):
     # Assumes that torch.where's first argument takes only Bool and Byte tensors.
-    if condition.type().scalarType() != "Bool":
+    if not symbolic_helper._is_bool(condition):
         condition = g.op("Cast", condition, to_i=_C_onnx.TensorProtoDataType.BOOL)
     if self is None:
         condition = opset9.nonzero(g, condition)
@@ -280,16 +332,25 @@ def where(g, condition, self=None, other=None, _outputs=None):
     return g.op("Where", condition, self, other)
 
 
+@_onnx_symbolic("aten::fake_quantize_per_channel_affine")
 @symbolic_helper.parse_args("v", "v", "v", "i", "i", "i")
+@_beartype.beartype
 def fake_quantize_per_channel_affine(
-    g, inputs, scale, zero_point, axis, quant_min=-128, quant_max=127
+    g: jit_utils.GraphContext,
+    inputs,
+    scale,
+    zero_point,
+    axis,
+    quant_min=-128,
+    quant_max=127,
 ):
     # NOTE: (0, 127) is allowed as special case. PyTorch restricts activations to be in the range (0, 127).
     #   https://github.com/pytorch/pytorch/blob/b34b192d6b97325c9f78e5995c48c8498ede34bd/torch/ao/quantization/observer.py#L1422
     if (quant_min, quant_max) not in [(0, 255), (-128, 127), (0, 127)]:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             "For (quant_min, quant_max), ONNX allows only (0, 127), (0, 255) and (-128, 127). "
-            "Got ({}, {})".format(quant_min, quant_max)
+            f"Got ({quant_min}, {quant_max})",
+            inputs,
         )
     # ONNX defines zero_point to be int8 or uint8
     if quant_min == 0:
@@ -307,22 +368,33 @@ def fake_quantize_per_channel_affine(
     return g.op("DequantizeLinear", quantized, scale, zero_point, axis_i=axis)
 
 
+@_onnx_symbolic("aten::fake_quantize_per_tensor_affine")
 @symbolic_helper.parse_args("v", "v", "v", "i", "i")
+@_beartype.beartype
 def fake_quantize_per_tensor_affine(
-    g, inputs, scale, zero_point, quant_min=-128, quant_max=127
+    g: jit_utils.GraphContext,
+    inputs,
+    scale,
+    zero_point,
+    quant_min=-128,
+    quant_max=127,
 ):
     # NOTE: (0, 127) is allowed as special case. PyTorch restricts activations to be in the range (0, 127).
     #   https://github.com/pytorch/pytorch/blob/b34b192d6b97325c9f78e5995c48c8498ede34bd/torch/ao/quantization/observer.py#L1422
     if (quant_min, quant_max) not in [(0, 255), (-128, 127), (0, 127)]:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             "For (quant_min, quant_max), ONNX allows only (0, 127), (0, 255) and (-128, 127). "
-            "Got ({}, {})".format(quant_min, quant_max)
+            f"Got ({quant_min}, {quant_max})",
+            inputs,
         )
     if quant_min == 0:
         zero_point = g.op("Cast", zero_point, to_i=_C_onnx.TensorProtoDataType.UINT8)
     else:
         zero_point = g.op("Cast", zero_point, to_i=_C_onnx.TensorProtoDataType.INT8)
-    if scale.type().scalarType() != "Float":
+    if (
+        _type_utils.JitScalarType.from_value(scale, _type_utils.JitScalarType.UNDEFINED)
+        != _type_utils.JitScalarType.FLOAT
+    ):
         scale = g.op("Cast", scale, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     quantized = g.op("QuantizeLinear", inputs, scale, zero_point)
     if (quant_min, quant_max) == (0, 127):
@@ -335,7 +407,9 @@ def fake_quantize_per_tensor_affine(
     return g.op("DequantizeLinear", quantized, scale, zero_point)
 
 
+@_beartype.beartype
 def _reduce_op_symbolic(onnx_op_name):
+    @_beartype.beartype
     def symbolic(g, self, dim=None, keepdim=None):
         self = opset9._maybe_cast_reduce_op_input(g, self)
         if dim is None:
@@ -348,12 +422,19 @@ def symbolic(g, self, dim=None, keepdim=None):
     return symbolic
 
 
+@_onnx_symbolic(
+    "aten::sum",
+    decorate=[_apply_params("ReduceSum", "sum")],
+)
+@_beartype.beartype
 def _reduce_with_dtype(onnx_op, name):
     symbolic = _reduce_op_symbolic(onnx_op)
 
     @opset9.overload_by_arg_count
+    @_beartype.beartype
     def reduce(g, *args, **kwargs):
         @symbolic_helper.parse_args("v", "none")
+        @_beartype.beartype
         def reduce_nodim(g, self, dtype):
             if dtype.node().kind() == "onnx::Constant":
                 dtype = symbolic_helper._get_const(dtype, "i", "dtype")
@@ -361,10 +442,11 @@ def reduce_nodim(g, self, dtype):
                     "Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type()
                 )
             elif dtype.node().kind() != "prim::Constant":
-                return symbolic_helper._unimplemented(name, "dtype")
+                return symbolic_helper._unimplemented(name, "dtype", dtype)
             return symbolic(g, self)
 
         @symbolic_helper.parse_args("v", "v", "i", "none")
+        @_beartype.beartype
         def reduce_dim(g, self, dim, keepdim, dtype):
             if dtype.node().kind() == "onnx::Constant":
                 dtype = symbolic_helper._get_const(dtype, "i", "dtype")
@@ -372,7 +454,7 @@ def reduce_dim(g, self, dim, keepdim, dtype):
                     "Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type()
                 )
             elif dtype.node().kind() != "prim::Constant":
-                return symbolic_helper._unimplemented(name, "dtype")
+                return symbolic_helper._unimplemented(name, "dtype", dtype)
             return symbolic(g, self, dim, keepdim)
 
         return reduce_nodim, reduce_dim
@@ -380,12 +462,10 @@ def reduce_dim(g, self, dim, keepdim, dtype):
     return reduce
 
 
-# TODO(justinchuby): Rename the op to avoid colliding with the builtin sum.
-sum = _reduce_with_dtype("ReduceSum", "sum")
-
-
+@_onnx_symbolic("aten::unsafe_chunk")
 @symbolic_helper.parse_args("v", "i", "i", "i")
-def unsafe_chunk(g, self, chunks, dim, _outputs=None):
+@_beartype.beartype
+def unsafe_chunk(g: jit_utils.GraphContext, self, chunks, dim, _outputs=None):
     if _outputs is None:
         return g.op(
             "SplitToSequence",
@@ -411,7 +491,11 @@ def unsafe_chunk(g, self, chunks, dim, _outputs=None):
     return g.op("Split", self, splits, axis_i=dim, outputs=_outputs)
 
 
-def repeat_interleave(g, self, repeats, dim=None, output_size=None):
+@_onnx_symbolic("aten::repeat_interleave")
+@_beartype.beartype
+def repeat_interleave(
+    g: jit_utils.GraphContext, self, repeats, dim=None, output_size=None
+):
     input = self
     final_dim = dim
     # if dim is None flatten
@@ -428,16 +512,19 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
     repeats_sizes = symbolic_helper._get_tensor_sizes(repeats)
     input_sizes = symbolic_helper._get_tensor_sizes(input)
     if repeats_dim is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown " "repeats rank."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown repeats rank.",
+            self,
         )
     if repeats_sizes is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown " "repeats size."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown repeats size.",
+            self,
         )
     if input_sizes is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown " "input size."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown input size.",
+            self,
         )
     # Handle cases where dim is negative
     if dim < 0:
@@ -501,37 +588,42 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
 
     # Loop conditions
     loop_condition = g.op("Constant", value_t=torch.tensor(1))
-    loop_condition = g.op("Cast", loop_condition, to_i=9)
+    loop_condition = g.op("Cast", loop_condition, to_i=_C_onnx.TensorProtoDataType.BOOL)
     loop_len = reps
 
     # Create an empty sequence to store final expansions
     final_splits = g.op("SequenceEmpty")
-    loop = g.op("Loop", loop_len, loop_condition, final_splits)
 
     # Loop inputs
-    loop_block = utils._add_block(loop.node())
+    loop, (loop_context,), _ = jit_utils.add_op_with_blocks(
+        g, "Loop", loop_len, loop_condition, final_splits, n_blocks=1
+    )
+
+    loop_block = loop_context.block
     block_input_iter = utils._add_input_to_block(loop_block)
     cond = utils._add_input_to_block(loop_block)
     final_splits = utils._add_input_to_block(loop_block)
 
-    r_split = loop_block.op("SequenceAt", r_splits, block_input_iter)
-    i_split = loop_block.op("SequenceAt", i_splits, block_input_iter)
+    r_split = loop_context.op("SequenceAt", r_splits, block_input_iter)
+    i_split = loop_context.op("SequenceAt", i_splits, block_input_iter)
 
-    i_split = opset11.unsqueeze(loop_block, i_split, dim + 1)
+    i_split = opset11.unsqueeze(loop_context, i_split, dim + 1)
     r_concat = [
-        loop_block.op("Constant", value_t=torch.LongTensor(input_sizes[: dim + 1])),
+        loop_context.op("Constant", value_t=torch.LongTensor(input_sizes[: dim + 1])),
         r_split,
-        loop_block.op("Constant", value_t=torch.LongTensor(input_sizes[dim + 1 :])),
+        loop_context.op("Constant", value_t=torch.LongTensor(input_sizes[dim + 1 :])),
     ]
-    r_concat = loop_block.op("Concat", *r_concat, axis_i=0)
-    i_split = opset9.expand(loop_block, i_split, r_concat, None)
+    r_concat = loop_context.op("Concat", *r_concat, axis_i=0)
+    i_split = opset9.expand(loop_context, i_split, r_concat, None)
     i_split = symbolic_helper._reshape_helper(
-        loop_block, i_split, g.op("Constant", value_t=torch.LongTensor(output_sizes))
+        loop_context, i_split, g.op("Constant", value_t=torch.LongTensor(output_sizes))
     )
-    final_splits = loop_block.op("SequenceInsert", final_splits, i_split)
+    final_splits = loop_context.op("SequenceInsert", final_splits, i_split)
 
     # Loop outputs
-    cond_out = loop_block.op("Cast", loop_condition, to_i=9)
+    cond_out = loop_context.op(
+        "Cast", loop_condition, to_i=_C_onnx.TensorProtoDataType.BOOL
+    )
     utils._add_output_to_block(loop_block, cond_out)
     utils._add_output_to_block(loop_block, final_splits)
 
@@ -540,8 +632,10 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
     return loop_out
 
 
+@_onnx_symbolic("aten::diagonal")
 @symbolic_helper.parse_args("v", "i", "i", "i")
-def diagonal(g, self, offset, dim1, dim2):
+@_beartype.beartype
+def diagonal(g: jit_utils.GraphContext, self, offset, dim1, dim2):
     dim1_size = opset9.size(
         g, self, dim=g.op("Constant", value_t=torch.LongTensor([dim1]))
     )
@@ -637,94 +731,92 @@ def diagonal(g, self, offset, dim1, dim2):
             g.op("Constant", value_t=torch.tensor(0, dtype=torch.int64)),
         ),
     )
-    if_op = g.op("If", overrun_cond)
-    if_node = if_op.node()
 
-    if_block = utils._add_block(if_node)
-    gather_indices_if_block = if_block.op("Add", gather_indices, select_window)
+    if_op, (if_context, else_context), _ = jit_utils.add_op_with_blocks(
+        g, "If", overrun_cond, n_blocks=2
+    )
+
+    gather_indices_if_block = if_context.op("Add", gather_indices, select_window)
     gather_indices_if_block = symbolic_helper._unsqueeze_helper(
-        if_block, gather_indices_if_block, [rank - 1]
+        if_context, gather_indices_if_block, [rank - 1]
     )
-    final_non_overrun_ = if_block.op(
+    final_non_overrun = if_context.op(
         "GatherND", result, gather_indices_if_block, batch_dims_i=rank - 2
     )
-    utils._add_output_to_block(if_block, final_non_overrun_)
-
-    else_block = utils._add_block(if_node)
-    final_overrun_ = opset9.zeros(else_block, gather_shape, 6, None, None)
-    utils._add_output_to_block(else_block, final_overrun_)
+    final_overrun = opset9.zeros(else_context, gather_shape, 6, None, None)
+    utils._add_output_to_block(if_context.block, final_non_overrun)
+    utils._add_output_to_block(else_context.block, final_overrun)
     return if_op
 
 
-class Quantized:
-    """
-    https://github.com/pytorch/pytorch/wiki/PyTorch-ONNX-exporter#quantized-model-export
-    """
-
-    domain = "quantized"
+# Quantized ops
 
-    @staticmethod
-    def linear(g, q_input, q_weight, bias, op_scale, op_zero_point):
-        input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
-        weight, weight_scale, _, axis = symbolic_helper.dequantize_helper(g, q_weight)
-        q_bias = symbolic_helper.requantize_bias_helper(
-            g, bias, input_scale, weight_scale, axis
-        )
-        bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
-
-        output = opset9.linear(g, input, weight, bias)
-
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
-
-    @staticmethod
-    def conv2d(
-        g,
-        q_input,
-        q_weight,
-        bias,
-        stride,
-        padding,
-        dilation,
-        groups,
-        op_scale,
-        op_zero_point,
-    ):
-        input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
-        weight, weight_scale, _, axis = symbolic_helper.dequantize_helper(g, q_weight)
-        q_bias = symbolic_helper.requantize_bias_helper(
-            g, bias, input_scale, weight_scale, axis
-        )
-        bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
 
-        output = opset9.conv2d(
-            g, input, weight, bias, stride, padding, dilation, groups
-        )
-
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
-
-    @staticmethod
-    def conv2d_relu(
-        g,
-        q_input,
-        q_weight,
-        bias,
-        stride,
-        padding,
-        dilation,
-        groups,
-        op_scale,
-        op_zero_point,
-    ):
-        input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
-        weight, weight_scale, _, axis = symbolic_helper.dequantize_helper(g, q_weight)
-        q_bias = symbolic_helper.requantize_bias_helper(
-            g, bias, input_scale, weight_scale, axis
-        )
-        bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
+@_onnx_symbolic("quantized::linear")
+@_beartype.beartype
+def quantized_linear(
+    g: jit_utils.GraphContext, q_input, q_weight, bias, op_scale, op_zero_point
+):
+    input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+    weight, weight_scale, _, axis = symbolic_helper.dequantize_helper(g, q_weight)
+    q_bias = symbolic_helper.requantize_bias_helper(
+        g, bias, input_scale, weight_scale, axis
+    )
+    bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
+
+    output = opset9.linear(g, input, weight, bias)
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::conv2d")
+@_beartype.beartype
+def quantized_conv2d(
+    g: jit_utils.GraphContext,
+    q_input,
+    q_weight,
+    bias,
+    stride,
+    padding,
+    dilation,
+    groups,
+    op_scale,
+    op_zero_point,
+):
+    input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+    weight, weight_scale, _, axis = symbolic_helper.dequantize_helper(g, q_weight)
+    q_bias = symbolic_helper.requantize_bias_helper(
+        g, bias, input_scale, weight_scale, axis
+    )
+    bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
+
+    output = opset9.conv2d(g, input, weight, bias, stride, padding, dilation, groups)
+
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+
+
+@_onnx_symbolic("quantized::conv2d_relu")
+@_beartype.beartype
+def quantized_conv2d_relu(
+    g: jit_utils.GraphContext,
+    q_input,
+    q_weight,
+    bias,
+    stride,
+    padding,
+    dilation,
+    groups,
+    op_scale,
+    op_zero_point,
+):
+    input, input_scale, _, _ = symbolic_helper.dequantize_helper(g, q_input)
+    weight, weight_scale, _, axis = symbolic_helper.dequantize_helper(g, q_weight)
+    q_bias = symbolic_helper.requantize_bias_helper(
+        g, bias, input_scale, weight_scale, axis
+    )
+    bias, _, _, _ = symbolic_helper.dequantize_helper(g, q_bias)
 
-        output = opset9.conv2d(
-            g, input, weight, bias, stride, padding, dilation, groups
-        )
-        output = opset9.relu(g, output)
+    output = opset9.conv2d(g, input, weight, bias, stride, padding, dilation, groups)
+    output = opset9.relu(g, output)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
diff --git a/torch/onnx/symbolic_opset14.py b/torch/onnx/symbolic_opset14.py
index aa7d7d135256..daf1ec9708e3 100644
--- a/torch/onnx/symbolic_opset14.py
+++ b/torch/onnx/symbolic_opset14.py
@@ -13,40 +13,51 @@
 """
 
 # EDITING THIS FILE? READ THIS FIRST!
-# see Note [Edit Symbolic Files] in symbolic_helper.py
+# see Note [Edit Symbolic Files] in README.md
+
+import functools
 
 import torch
 from torch.onnx import symbolic_helper
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype, jit_utils, registration
+
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=14)
 
 
+@_onnx_symbolic("aten::hardswish")
 @symbolic_helper.parse_args("v")
-def hardswish(g, self):
+@_beartype.beartype
+def hardswish(g: jit_utils.GraphContext, self):
     return g.op("HardSwish", self)
 
 
-@symbolic_helper.parse_args("v", "i")
-def tril(g, self, diagonal, out=None):
-    k = g.op("Constant", value_t=torch.tensor(diagonal, dtype=torch.int64))
-    return g.op("Trilu", self, k, upper_i=0)
+@_onnx_symbolic("aten::tril")
+@_beartype.beartype
+def tril(g: jit_utils.GraphContext, self, diagonal, out=None):
+    return g.op("Trilu", self, diagonal, upper_i=0)
 
 
-@symbolic_helper.parse_args("v", "i")
-def triu(g, self, diagonal, out=None):
-    k = g.op("Constant", value_t=torch.tensor(diagonal, dtype=torch.int64))
-    return g.op("Trilu", self, k, upper_i=1)
+@_onnx_symbolic("aten::triu")
+@_beartype.beartype
+def triu(g: jit_utils.GraphContext, self, diagonal, out=None):
+    return g.op("Trilu", self, diagonal, upper_i=1)
 
 
+@_onnx_symbolic("aten::reshape")
 @symbolic_helper.parse_args("v", "v")
-def reshape(g, self, shape):
+@_beartype.beartype
+def reshape(g: jit_utils.GraphContext, self, shape):
     # NOTE: Due to bug in ORT https://github.com/microsoft/onnxruntime/issues/10664
     #       Reshape export cannot utilize the new allowzero attribute introduced in opset 14.
     return symbolic_helper._reshape_helper(g, self, shape, allowzero=0)
 
 
+@_onnx_symbolic("aten::batch_norm")
 @symbolic_helper.parse_args("v", "v", "v", "v", "v", "i", "f", "f", "i")
+@_beartype.beartype
 def batch_norm(
-    g,
+    g: jit_utils.GraphContext,
     input,
     weight,
     bias,
@@ -71,6 +82,7 @@ def batch_norm(
             15,
             "All input tensors must have the same `dtype`."
             " Turn off Autocast or export using opset version 15.",
+            input,
         )
 
     symbolic_helper.check_training_mode(training, "batch_norm")
@@ -98,17 +110,11 @@ def batch_norm(
         return res
 
 
-class Quantized:
-    """
-    https://github.com/pytorch/pytorch/wiki/PyTorch-ONNX-exporter#quantized-model-export
-    """
-
-    domain = "quantized"
-
-    @staticmethod
-    def hardswish(g, x, op_scale, op_zero_point):
-        x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
+@_onnx_symbolic("quantized::hardswish")
+@_beartype.beartype
+def quantized_hardswish(g: jit_utils.GraphContext, x, op_scale, op_zero_point):
+    x, _, _, _ = symbolic_helper.dequantize_helper(g, x)
 
-        output = hardswish(g, x)
+    output = hardswish(g, x)
 
-        return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
+    return symbolic_helper.quantize_helper(g, output, op_scale, op_zero_point)
diff --git a/torch/onnx/symbolic_opset15.py b/torch/onnx/symbolic_opset15.py
index fcc625889f85..4f316a77f62e 100644
--- a/torch/onnx/symbolic_opset15.py
+++ b/torch/onnx/symbolic_opset15.py
@@ -23,14 +23,21 @@
 """
 
 # EDITING THIS FILE? READ THIS FIRST!
-# see Note [Edit Symbolic Files] in symbolic_helper.py
+# see Note [Edit Symbolic Files] in README.md
+
+import functools
 
 import torch
 from torch import _C
 from torch.onnx import symbolic_helper, symbolic_opset9 as opset9
+from torch.onnx._internal import _beartype, jit_utils, registration
+
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=15)
 
 
-def __is_(g, self, other):
+@_onnx_symbolic("aten::__is_")
+@_beartype.beartype
+def aten__is_(g: jit_utils.GraphContext, self, other):
     if symbolic_helper._is_none(other):
         if isinstance(self.type(), _C.OptionalType):
             none = g.op("OptionalHasElement", self)
@@ -40,20 +47,36 @@ def __is_(g, self, other):
     return opset9.eq(g, self, other)
 
 
-@opset9.wrap_logical_op_with_negation
-def __isnot_(g, self, other):
-    return __is_(g, self, other)
+@_onnx_symbolic("aten::__isnot_")
+@opset9.wrap_logical_op_with_negation  # type: ignore[has-type]
+@_beartype.beartype
+def aten__isnot_(g: jit_utils.GraphContext, self, other):
+    return aten__is_(g, self, other)
 
 
-class Prim:
-    domain = "prim"
+@_onnx_symbolic("aten::bernoulli")
+@_beartype.beartype
+def bernoulli(g: jit_utils.GraphContext, input, p=None, generator=None, out=None):
+    if out is not None and not symbolic_helper._is_none(out):
+        symbolic_helper._unimplemented(
+            "Bernoulli", "out parameter is not supported for bernoulli", input
+        )
+    if generator is not None and not symbolic_helper._is_none(generator):
+        symbolic_helper._unimplemented(
+            "Bernoulli", "generator is not supported for bernoulli", input
+        )
+    if p is None or symbolic_helper._is_none(p):
+        return g.op("Bernoulli", input)
+    return opset9.bernoulli(g, input, p, generator, out)
 
-    @staticmethod
-    def unchecked_cast(g, self):
-        # exists to refine the type of the Value
-        # if x is Optional[Tensor], unchecked_cast will cast
-        # x to Tensor, so the rest of the graph knows that x is a Tensor.
-        if isinstance(self.type(), _C.OptionalType):
-            return g.op("OptionalGetElement", self)
 
-        return self
+@_onnx_symbolic("prim::unchecked_cast")
+@_beartype.beartype
+def prim_unchecked_cast(g: jit_utils.GraphContext, self):
+    # exists to refine the type of the Value
+    # if x is Optional[Tensor], unchecked_cast will cast
+    # x to Tensor, so the rest of the graph knows that x is a Tensor.
+    if isinstance(self.type(), _C.OptionalType):
+        return g.op("OptionalGetElement", self)
+
+    return self
diff --git a/torch/onnx/symbolic_opset16.py b/torch/onnx/symbolic_opset16.py
index 716aeeaaacb5..75cb96890a12 100644
--- a/torch/onnx/symbolic_opset16.py
+++ b/torch/onnx/symbolic_opset16.py
@@ -15,28 +15,41 @@
     PRelu
     RoiAlign
     Scan
-    ScatterElemenets
+    ScatterElements
     ScatterND
     Where
     GreaterOrEqual
     LessOrEqual
-    SequenceMap
 """
 
 # EDITING THIS FILE? READ THIS FIRST!
-# see Note [Edit Symbolic Files] in symbolic_helper.py
+# see Note [Edit Symbolic Files] in README.md
+
+import functools
 
 from torch.nn.functional import (
     GRID_SAMPLE_INTERPOLATION_MODES,
     GRID_SAMPLE_PADDING_MODES,
 )
 from torch.onnx import _type_utils, symbolic_helper
+from torch.onnx._internal import _beartype, jit_utils, registration
+
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=16)
 
 
 # note (mkozuki): Why `grid_sampler` instead of `grid_sample`?
 # Because `torch.nn.functional.grid_sample` calls `torch.grid_sampler`.
+@_onnx_symbolic("aten::grid_sampler")
 @symbolic_helper.parse_args("v", "v", "i", "i", "b")
-def grid_sampler(g, input, grid, mode_enum, padding_mode_enum, align_corners):
+@_beartype.beartype
+def grid_sampler(
+    g: jit_utils.GraphContext,
+    input,
+    grid,
+    mode_enum,
+    padding_mode_enum,
+    align_corners,
+):
     mode_s = {v: k for k, v in GRID_SAMPLE_INTERPOLATION_MODES.items()}[mode_enum]  # type: ignore[call-arg]
     padding_mode_s = {v: k for k, v in GRID_SAMPLE_PADDING_MODES.items()}[padding_mode_enum]  # type: ignore[call-arg]
     return g.op(
@@ -49,12 +62,16 @@ def grid_sampler(g, input, grid, mode_enum, padding_mode_enum, align_corners):
     )
 
 
+@_onnx_symbolic("aten::scatter_add")
 @symbolic_helper.parse_args("v", "i", "v", "v")
-def scatter_add(g, self, dim, index, src):
+@_beartype.beartype
+def scatter_add(g: jit_utils.GraphContext, self, dim, index, src):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("scatter", self, dim, index, src, overload_name="src")
 
-    src_type = src.type().scalarType()
+    src_type = _type_utils.JitScalarType.from_value(
+        src, _type_utils.JitScalarType.UNDEFINED
+    )
     src_sizes = symbolic_helper._get_tensor_sizes(src)
     index_sizes = symbolic_helper._get_tensor_sizes(index)
 
@@ -70,13 +87,11 @@ def scatter_add(g, self, dim, index, src):
     else:
         # Check if scalar "src" has same type as self (PyTorch allows different
         # type for scalar src (but not when src is tensor)). If not, insert Cast node.
-        if self.type().scalarType() != src_type:
+        if _type_utils.JitScalarType.from_value(self) != src_type:
             src = g.op(
                 "Cast",
                 src,
-                to_i=_type_utils.JitScalarType.from_name(
-                    self.type().scalarType()
-                ).onnx_type(),
+                to_i=_type_utils.JitScalarType.from_value(self).onnx_type(),
             )
 
         return g.op(
diff --git a/torch/onnx/symbolic_opset17.py b/torch/onnx/symbolic_opset17.py
new file mode 100644
index 000000000000..2fadea0c73bf
--- /dev/null
+++ b/torch/onnx/symbolic_opset17.py
@@ -0,0 +1,56 @@
+"""This file exports ONNX ops for opset 17.
+
+Note [ONNX Operators that are added/updated in opset 17]
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+https://github.com/onnx/onnx/blob/main/docs/Changelog.md#version-17-of-the-default-onnx-operator-set
+New operators:
+    BlackmanWindow
+    DFT
+    HammingWindow
+    HannWindow
+    LayerNormalization
+    MelWeightMatrix
+    STFT
+    SequenceMap
+"""
+
+import functools
+from typing import Sequence
+
+from torch import _C
+from torch.onnx import symbolic_helper
+from torch.onnx._internal import jit_utils, registration
+
+# EDITING THIS FILE? READ THIS FIRST!
+# see Note [Edit Symbolic Files] in README.md
+
+__all__ = ["layer_norm"]
+
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=17)
+
+
+@_onnx_symbolic("aten::layer_norm")
+@symbolic_helper.parse_args("v", "is", "v", "v", "f", "none")
+def layer_norm(
+    g: jit_utils.GraphContext,
+    input: _C.Value,
+    normalized_shape: Sequence[int],
+    weight: _C.Value,
+    bias: _C.Value,
+    eps: float,
+    cudnn_enable: bool,
+):
+    # normalized_shape: input shape from an expected input of size
+    # axis: The first normalization dimension.
+    # layer_norm normalizes on the last D dimensions,
+    # where D is the size of normalized_shape
+    axis = -len(normalized_shape)
+    return g.op(
+        "LayerNormalization",
+        input,
+        weight,
+        bias,
+        epsilon_f=eps,
+        axis_i=axis,
+    )
diff --git a/torch/onnx/symbolic_opset7.py b/torch/onnx/symbolic_opset7.py
index 7e6493314741..0537e8a92888 100644
--- a/torch/onnx/symbolic_opset7.py
+++ b/torch/onnx/symbolic_opset7.py
@@ -10,11 +10,16 @@
   Scan
 """
 
+import functools
 import warnings
 
 from torch.onnx import symbolic_helper, symbolic_opset9 as opset9
+from torch.onnx._internal import jit_utils, registration
 
-block_listed_operators = [
+
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=7)
+
+block_listed_operators = (
     "scan",
     "expand",
     "expand_as",
@@ -25,13 +30,14 @@
     "max_pool1d_with_indices",
     "max_pool2d_with_indices",
     "max_pool3d_with_indices",
-]
+)
 
 
 # NOTE: max, min, sum, mean: broadcasting is not supported in opset 7.
 # torch.max (same for torch.min) actually has two interfaces smashed together:
 # torch.max(x, dim, keepdim) and torch.max(x, y)
-def max(g, self, dim_or_y=None, keepdim=None):
+@_onnx_symbolic("aten::max")
+def max(g: jit_utils.GraphContext, self, dim_or_y=None, keepdim=None):
     # torch.max(input, other)
     if keepdim is None and dim_or_y is not None:
         warnings.warn(
@@ -42,7 +48,8 @@ def max(g, self, dim_or_y=None, keepdim=None):
     return opset9.max(g, self, dim_or_y, keepdim)
 
 
-def min(g, self, dim_or_y=None, keepdim=None):
+@_onnx_symbolic("aten::min")
+def min(g: jit_utils.GraphContext, self, dim_or_y=None, keepdim=None):
     # torch.min(input, other)
     if keepdim is None and dim_or_y is not None:
         warnings.warn(
@@ -54,5 +61,6 @@ def min(g, self, dim_or_y=None, keepdim=None):
 
 
 for block_listed_op in block_listed_operators:
-    vars()[block_listed_op] = symbolic_helper._block_list_in_opset(block_listed_op)
-    vars()[block_listed_op].__module__ = "torch.onnx.symbolic_opset7"
+    _onnx_symbolic(f"aten::{block_listed_op}")(
+        symbolic_helper._block_list_in_opset(block_listed_op)
+    )
diff --git a/torch/onnx/symbolic_opset8.py b/torch/onnx/symbolic_opset8.py
index 12a3f1c21bee..c7a771c8f894 100644
--- a/torch/onnx/symbolic_opset8.py
+++ b/torch/onnx/symbolic_opset8.py
@@ -30,12 +30,17 @@
     Scan
 """
 
+import functools
 import warnings
 
 import torch
-from torch.onnx import _type_utils, symbolic_helper, symbolic_opset9 as opset9
+from torch._C import _onnx as _C_onnx
+from torch.onnx import _type_utils, errors, symbolic_helper, symbolic_opset9 as opset9
+from torch.onnx._internal import jit_utils, registration
 
-block_listed_operators = [
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=8)
+
+block_listed_operators = (
     "nonzero",
     "where",
     "scatter",
@@ -49,16 +54,49 @@
     "index_fill",
     "index_copy",
     "repeat_interleave",
-    "isnan",
     "any",
     "all",
-]
+)
 
 for block_listed_op in block_listed_operators:
-    vars()[block_listed_op] = symbolic_helper._block_list_in_opset(block_listed_op)
-    vars()[block_listed_op].__module__ = "torch.onnx.symbolic_opset8"
+    _onnx_symbolic(f"aten::{block_listed_op}")(
+        symbolic_helper._block_list_in_opset(block_listed_op)
+    )
 
 
+def _apply_params(*args, **kwargs):
+    """Returns a decorator that calls the decorated (higher-order) function with the given parameters."""
+
+    def _apply(fn):
+        return fn(*args, **kwargs)
+
+    return _apply
+
+
+@_onnx_symbolic(
+    "aten::upsample_nearest1d",
+    decorate=[_apply_params("upsample_nearest1d", 3, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_nearest2d",
+    decorate=[_apply_params("upsample_nearest2d", 4, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_nearest3d",
+    decorate=[_apply_params("upsample_nearest3d", 5, "nearest")],
+)
+@_onnx_symbolic(
+    "aten::upsample_linear1d",
+    decorate=[_apply_params("upsample_linear1d", 3, "linear")],
+)
+@_onnx_symbolic(
+    "aten::upsample_bilinear2d",
+    decorate=[_apply_params("upsample_bilinear2d", 4, "linear")],
+)
+@_onnx_symbolic(
+    "aten::upsample_trilinear3d",
+    decorate=[_apply_params("upsample_trilinear3d", 5, "linear")],
+)
 def _interpolate(name, dim, interpolate_mode):
     def symbolic_fn(g, input, output_size, *args):
         scales, align_corners = symbolic_helper._get_interpolate_attributes(
@@ -67,7 +105,7 @@ def symbolic_fn(g, input, output_size, *args):
         symbolic_helper._interpolate_warning(interpolate_mode)
         align_corners = symbolic_helper._maybe_get_scalar(align_corners)
         if align_corners:
-            return symbolic_helper._unimplemented(name, "align_corners == True")
+            return symbolic_helper._unimplemented(name, "align_corners == True", input)
         output_size = symbolic_helper._maybe_get_const(output_size, "is")
         if symbolic_helper._is_value(output_size):
             return symbolic_helper._unimplemented(
@@ -86,16 +124,16 @@ def symbolic_fn(g, input, output_size, *args):
     return symbolic_fn
 
 
-upsample_nearest1d = _interpolate("upsample_nearest1d", 3, "nearest")
-upsample_nearest2d = _interpolate("upsample_nearest2d", 4, "nearest")
-upsample_nearest3d = _interpolate("upsample_nearest3d", 5, "nearest")
-upsample_linear1d = _interpolate("upsample_linear1d", 3, "linear")
-upsample_bilinear2d = _interpolate("upsample_bilinear2d", 4, "linear")
-upsample_trilinear3d = _interpolate("upsample_trilinear3d", 5, "linear")
-
-
+@_onnx_symbolic("aten::__interpolate")
 def __interpolate(
-    g, input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias
+    g: jit_utils.GraphContext,
+    input,
+    size,
+    scale_factor,
+    mode,
+    align_corners,
+    recompute_scale_factor,
+    antialias,
 ):
     align_corners = symbolic_helper._maybe_get_const(align_corners, "b")
     if not symbolic_helper._is_none(align_corners) and align_corners:
@@ -120,20 +158,26 @@ def __interpolate(
 # NOTE: We should create a wrapper for this kind of operation, after resolving the shape/type propagation
 #       issue for "cast" operators. Some symbolic functions depend on shape information of input tensor, which
 #       is lost after casting.
-def _try_cast_integer_to_float(g, *args):
-    floating_scalar_types = ["Half", "Float", "Double"]
+def _try_cast_integer_to_float(g: jit_utils.GraphContext, *args):
+    floating_scalar_types = {
+        _type_utils.JitScalarType.HALF,
+        _type_utils.JitScalarType.FLOAT,
+        _type_utils.JitScalarType.DOUBLE,
+    }
     old_type = None
     # Cast the input tensor to Float if its scalarType is known and is not floating number.
     # If casting is performed, return the old scalarType, otherwise return None.
-    arg0_type = args[0].type().scalarType()
-    if arg0_type is not None:
+    arg0_type = _type_utils.JitScalarType.from_value(
+        args[0], _type_utils.JitScalarType.UNDEFINED
+    )
+    if arg0_type != _type_utils.JitScalarType.UNDEFINED:
         old_type = arg0_type
         if old_type not in floating_scalar_types:
-            # TODO(justinchuby): Remove the type ignore hint once _cast_Float is
-            # properly defined.
-            # NOTE: _cast_Float is generated programmatically so we need to make the
-            # type checker happy with ignore[attr-defined].
-            args = tuple(opset9._cast_Float(g, arg, False) for arg in args)  # type: ignore[attr-defined]
+            old_type = old_type.scalar_name()
+            args = tuple(
+                g.op("Cast", arg, to_i=_C_onnx.TensorProtoDataType.FLOAT)
+                for arg in args
+            )
         else:
             return (None,) + args
     else:
@@ -145,30 +189,33 @@ def _try_cast_integer_to_float(g, *args):
     return (old_type,) + args
 
 
-def _cast_to_type(g, input, to_type):
+def _cast_to_type(g: jit_utils.GraphContext, input, to_type):
     if to_type is None:
         return input
     return getattr(opset9, f"_cast_{to_type}")(g, input, False)
 
 
-def _comparison_operator(g, input, other, op_name):
+def _comparison_operator(g: jit_utils.GraphContext, input, other, op_name):
     other = symbolic_helper._maybe_get_scalar(other)
-    other = symbolic_helper._if_scalar_type_as(g, other, input)
+    other = symbolic_helper._if_scalar_type_as(other, input)
     _, input, other = _try_cast_integer_to_float(g, input, other)
     return g.op(op_name, input, other)
 
 
 # NOTE: For symbolics {gt, lt, bmm, matmul, prelu, mm, addmm, view, flatten},
 #       integer input type not supported in opset8. Cast to float if possible.
-def gt(g, input, other):
+@_onnx_symbolic("aten::gt")
+def gt(g: jit_utils.GraphContext, input, other):
     return _comparison_operator(g, input, other, "Greater")
 
 
-def lt(g, input, other):
+@_onnx_symbolic("aten::lt")
+def lt(g: jit_utils.GraphContext, input, other):
     return _comparison_operator(g, input, other, "Less")
 
 
-def bmm(g, self, other):
+@_onnx_symbolic("aten::bmm")
+def bmm(g: jit_utils.GraphContext, self, other):
     if symbolic_helper._try_get_scalar_type(self):
         old_type, self, other = _try_cast_integer_to_float(g, self, other)
         return _cast_to_type(g, g.op("MatMul", self, other), old_type)
@@ -176,11 +223,13 @@ def bmm(g, self, other):
         return g.op("MatMul", self, other)
 
 
-def matmul(g, self, other):
+@_onnx_symbolic("aten::matmul")
+def matmul(g: jit_utils.GraphContext, self, other):
     return bmm(g, self, other)
 
 
-def prelu(g, self, weight):
+@_onnx_symbolic("aten::prelu")
+def prelu(g: jit_utils.GraphContext, self, weight):
     self_rank = symbolic_helper._get_tensor_rank(self)
     weight_sizes = symbolic_helper._get_tensor_sizes(weight)
     if self_rank is not None and self_rank > 2:
@@ -195,17 +244,18 @@ def prelu(g, self, weight):
         return g.op("PRelu", self, weight)
 
 
-def mm(g, self, other):
+@_onnx_symbolic("aten::mm")
+def mm(g: jit_utils.GraphContext, self, other):
     # Create a dummy C tensor. Only needed for API purposes, the value is
     # since beta = 0
     scalar_type = symbolic_helper._try_get_scalar_type(self, other)
     if scalar_type is None:
-        raise ValueError("mm can only operate on tensors with known types")
+        raise errors.SymbolicValueError(
+            "mm can only operate on tensors with known types", self
+        )
     zero_constant = g.op(
         "Constant",
-        value_t=torch.tensor(
-            [0], dtype=_type_utils.JitScalarType.from_name(scalar_type).dtype()
-        ),
+        value_t=torch.tensor([0], dtype=scalar_type.dtype()),
     )
 
     if symbolic_helper._try_get_scalar_type(self):
@@ -220,8 +270,9 @@ def mm(g, self, other):
     return g.op("Gemm", self, other, zero_constant, beta_f=0.0, alpha_f=1.0)
 
 
+@_onnx_symbolic("aten::addmm")
 @symbolic_helper.parse_args("v", "v", "v", "t", "t")
-def addmm(g, self, mat1, mat2, beta, alpha):
+def addmm(g: jit_utils.GraphContext, self, mat1, mat2, beta, alpha):
     if symbolic_helper._try_get_scalar_type(self):
         old_type, self, mat1, mat2 = _try_cast_integer_to_float(g, self, mat1, mat2)
         return _cast_to_type(
@@ -247,7 +298,8 @@ def addmm(g, self, mat1, mat2, beta, alpha):
         )
 
 
-def flatten(g, input, start_dim, end_dim):
+@_onnx_symbolic("aten::flatten")
+def flatten(g: jit_utils.GraphContext, input, start_dim, end_dim):
     start_dim_i = symbolic_helper._get_const(start_dim, "i", "start_dim")
     end_dim_i = symbolic_helper._get_const(end_dim, "i", "end_dim")
 
@@ -275,7 +327,7 @@ def flatten(g, input, start_dim, end_dim):
     return opset9.flatten(g, input, start_dim, end_dim)
 
 
-def _constant_fill(g, sizes, dtype: int, const_value):
+def _constant_fill(g: jit_utils.GraphContext, sizes, dtype: int, const_value):
     if dtype is None:
         scalar_type = _type_utils.JitScalarType.FLOAT
     else:
@@ -299,40 +351,81 @@ def _constant_fill(g, sizes, dtype: int, const_value):
         )
 
 
+@_onnx_symbolic("aten::empty")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v", "v")
-def empty(g, sizes, dtype, layout, device, pin_memory=False, memory_format=None):
+def empty(
+    g: jit_utils.GraphContext,
+    sizes,
+    dtype,
+    layout,
+    device,
+    pin_memory=False,
+    memory_format=None,
+):
     return zeros(g, sizes, dtype, layout, device, pin_memory)
 
 
+@_onnx_symbolic("aten::empty_like")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v", "v")
-def empty_like(g, input, dtype, layout, device, pin_memory=False, memory_format=None):
+def empty_like(
+    g: jit_utils.GraphContext,
+    input,
+    dtype,
+    layout,
+    device,
+    pin_memory=False,
+    memory_format=None,
+):
     return zeros_like(g, input, dtype, layout, device, pin_memory)
 
 
+@_onnx_symbolic("aten::zeros")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v")
-def zeros(g, sizes, dtype, layout, device, pin_memory=False):
+def zeros(g: jit_utils.GraphContext, sizes, dtype, layout, device, pin_memory=False):
     # NOTE: no way to set device and layout in ONNX, so we ignore it
     return _constant_fill(g, sizes, dtype, 0)
 
 
+@_onnx_symbolic("aten::zeros_like")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v", "v")
-def zeros_like(g, input, dtype, layout, device, pin_memory=False, memory_format=None):
+def zeros_like(
+    g: jit_utils.GraphContext,
+    input,
+    dtype,
+    layout,
+    device,
+    pin_memory=False,
+    memory_format=None,
+):
     shape = g.op("Shape", input)
     return _constant_fill(g, shape, dtype, 0)
 
 
+@_onnx_symbolic("aten::ones")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v")
-def ones(g, sizes, dtype, layout, device, pin_memory=False):
+def ones(g: jit_utils.GraphContext, sizes, dtype, layout, device, pin_memory=False):
     return _constant_fill(g, sizes, dtype, 1)
 
 
+@_onnx_symbolic("aten::ones_like")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v", "v")
-def ones_like(g, input, dtype, layout, device, pin_memory=False, memory_format=None):
+def ones_like(
+    g: jit_utils.GraphContext,
+    input,
+    dtype,
+    layout,
+    device,
+    pin_memory=False,
+    memory_format=None,
+):
     shape = g.op("Shape", input)
     return _constant_fill(g, shape, dtype, 1)
 
 
-def full(g, sizes, value, dtype, layout, device, pin_memory=False):
+@_onnx_symbolic("aten::full")
+def full(
+    g: jit_utils.GraphContext, sizes, value, dtype, layout, device, pin_memory=False
+):
     const_value = symbolic_helper._maybe_get_const(value, "t")
     if symbolic_helper._is_value(const_value):
         tmp = zeros(g, sizes, dtype, layout, device)
@@ -342,15 +435,24 @@ def full(g, sizes, value, dtype, layout, device, pin_memory=False):
         return _constant_fill(g, sizes, dtype, const_value)
 
 
+@_onnx_symbolic("aten::full_like")
 @symbolic_helper.parse_args("v", "f", "i", "v", "v", "v", "v")
 def full_like(
-    g, input, fill_value, dtype, layout, device, pin_memory=False, memory_format=None
+    g: jit_utils.GraphContext,
+    input,
+    fill_value,
+    dtype,
+    layout,
+    device,
+    pin_memory=False,
+    memory_format=None,
 ):
     shape = g.op("Shape", input)
     return _constant_fill(g, shape, dtype, fill_value)
 
 
-def repeat(g, self, repeats):
+@_onnx_symbolic("aten::repeat")
+def repeat(g: jit_utils.GraphContext, self, repeats):
     if not symbolic_helper._is_value(repeats):
         repeats = g.op("Constant", value_t=torch.LongTensor(repeats))
     if symbolic_helper._is_packed_list(repeats):
diff --git a/torch/onnx/symbolic_opset9.py b/torch/onnx/symbolic_opset9.py
index 8fb0ba1728eb..9984f602425c 100644
--- a/torch/onnx/symbolic_opset9.py
+++ b/torch/onnx/symbolic_opset9.py
@@ -4,11 +4,12 @@
 release on 01/23/19
 """
 
+import builtins
 import functools
 import math
 import sys
 import warnings
-from typing import List, Optional, Sequence, Tuple, Union
+from typing import Callable, List, Optional, Sequence, Tuple, Union
 
 import torch
 import torch._C._onnx as _C_onnx
@@ -17,58 +18,28 @@
 from torch import _C
 
 # Monkey-patch graph manipulation methods on Graph, used for the ONNX symbolics
-from torch.onnx import _patch_torch, _type_utils, symbolic_helper  # noqa: F401
-from torch.onnx._exporter_states import (
-    SymbolicContext,  # Special case class import for readability
+from torch.onnx import (  # noqa: F401
+    _constants,
+    _deprecation,
+    _patch_torch,
+    _type_utils,
+    errors,
+    symbolic_helper,
 )
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype, jit_utils, registration
+from torch.types import Number
 
 # EDITING THIS FILE? READ THIS FIRST!
-# see Note [Edit Symbolic Files] in symbolic_helper.py
-
-# Note [Pointwise by scalar]
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~
-# What happens if you add a tensor with a constant (e.g., x + 2)?  There are
-# some moving parts to implementing the ONNX translation in this case:
-#
-#   - By the time we get the scalar in a symbolic function here, it is no longer
-#     a Python long/float, but a PyTorch tensor with numel == 1 (eventually, we
-#     want it to be a zero dim tensor but this change has not happened yet.)
-#     However, the type of this scalar is *exactly* what the user wrote in
-#     Python, which may not match the tensor it is being added to.  PyTorch
-#     will do implicit conversions on scalars; however, ONNX will not, so
-#     we must do the conversion ourselves.  This is what _if_scalar_type_as
-#     does.
-#
-#   - Dispatch to these functions takes advantage an outrageous coincidence
-#     between the tensor and scalar name.  When we add two tensors together,
-#     you get the dispatch:
-#
-#       add(*[self, other], **{"alpha": alpha})
-#
-#     When you add a tensor and a scalar, you get the dispatch:
-#
-#       add(*[self], **{"other": other, "alpha": alpha})
-#
-#     By having the argument name line up with the name of the scalar attribute
-#     if it exists, we can write a single function for both overloads.
-#
+# see Note [Edit Symbolic Files] in README.md
 
 __all__ = [
     "abs",
     "acos",
-    "adaptive_avg_pool1d",
-    "adaptive_avg_pool2d",
-    "adaptive_avg_pool3d",
-    "adaptive_max_pool1d",
-    "adaptive_max_pool2d",
-    "adaptive_max_pool3d",
     "add",
     "addcmul",
     "addmm",
     "alias",
-    "alpha_dropout_",
-    "alpha_dropout",
     "amax",
     "amin",
     "aminmax",
@@ -79,9 +50,6 @@
     "as_tensor",
     "asin",
     "atan",
-    "avg_pool1d",
-    "avg_pool2d",
-    "avg_pool3d",
     "baddbmm",
     "batch_norm",
     "bernoulli",
@@ -98,7 +66,6 @@
     "clone",
     "constant_pad_nd",
     "contiguous",
-    "convolution",
     "conv_tbc",
     "conv_transpose1d",
     "conv_transpose2d",
@@ -106,6 +73,7 @@
     "conv1d",
     "conv2d",
     "conv3d",
+    "convolution",
     "cos",
     "cosine_similarity",
     "cross",
@@ -114,7 +82,6 @@
     "dim",
     "div",
     "dot",
-    "dropout_",
     "dropout",
     "elu",
     "embedding_bag",
@@ -127,10 +94,6 @@
     "expand_as",
     "expand",
     "eye",
-    "feature_alpha_dropout_",
-    "feature_alpha_dropout",
-    "feature_dropout_",
-    "feature_dropout",
     "fill",
     "flatten",
     "floor_divide",
@@ -145,8 +108,6 @@
     "get_pool_ceil_padding",
     "glu",
     "group_norm",
-    "gru",
-    "gt_impl",
     "gt",
     "hann_window",
     "hardshrink",
@@ -161,6 +122,7 @@
     "index",
     "instance_norm",
     "is_floating_point",
+    "is_pinned",
     "isnan",
     "item",
     "kl_div",
@@ -187,19 +149,14 @@
     "logsumexp",
     "lstm_cell",
     "lstm",
-    "lt_impl",
     "lt",
     "masked_fill",
     "matmul",
     "max_pool1d_with_indices",
-    "max_pool1d",
     "max_pool2d_with_indices",
-    "max_pool2d",
     "max_pool3d_with_indices",
-    "max_pool3d",
     "max",
     "maximum",
-    "mean",
     "meshgrid",
     "min",
     "minimum",
@@ -225,8 +182,7 @@
     "one_hot",
     "ones_like",
     "ones",
-    "Onnx",
-    "op_with_optional_float_cast",
+    "onnx_placeholder",
     "overload_by_arg_count",
     "pad",
     "pairwise_distance",
@@ -235,30 +191,39 @@
     "pixel_unshuffle",
     "pow",
     "prelu",
-    "Prim",
-    "prod",
+    "prim_constant_chunk",
+    "prim_constant_split",
+    "prim_constant",
+    "prim_data",
+    "prim_device",
+    "prim_dtype",
+    "prim_if",
+    "prim_layout",
+    "prim_list_construct",
+    "prim_list_unpack",
+    "prim_loop",
+    "prim_max",
+    "prim_min",
+    "prim_shape",
+    "prim_tolist",
+    "prim_tuple_construct",
+    "prim_type",
+    "prim_unchecked_cast",
+    "prim_uninitialized",
     "rand_like",
     "rand",
     "randn_like",
     "randn",
     "reciprocal",
     "reflection_pad",
-    "reflection_pad1d",
-    "reflection_pad2d",
-    "reflection_pad3d",
     "relu",
     "relu6",
     "remainder",
     "repeat_interleave",
     "repeat",
     "replication_pad",
-    "replication_pad1d",
-    "replication_pad2d",
-    "replication_pad3d",
     "reshape_as",
     "reshape",
-    "rnn_relu",
-    "rnn_tanh",
     "roll",
     "rrelu",
     "rsqrt",
@@ -287,7 +252,6 @@
     "std_mean",
     "std",
     "sub",
-    "sum",
     "t",
     "take",
     "tan",
@@ -306,13 +270,9 @@
     "unsafe_split_with_sizes",
     "unsafe_split",
     "unsqueeze",
+    "unsupported_complex_operators",
+    "noop_complex_operators",
     "unused",
-    "upsample_bilinear2d",
-    "upsample_linear1d",
-    "upsample_nearest1d",
-    "upsample_nearest2d",
-    "upsample_nearest3d",
-    "upsample_trilinear3d",
     "var_mean",
     "var",
     "view_as",
@@ -324,53 +284,96 @@
     "zeros",
 ]
 
-# used to represent "missing" optional inputs
+
+_onnx_symbolic = functools.partial(registration.onnx_symbolic, opset=9)
+
+
+def _apply_params(*args, **kwargs):
+    """Returns a decorator that calls the decorated (higher-order) function with the given parameters."""
+
+    def _apply(fn):
+        return fn(*args, **kwargs)
+
+    return _apply
+
+
+def _export(name: str):
+    """Exports the function in the current global namespace."""
+
+    def wrapper(func):
+        globals()[name] = func
+        __all__.append(name)
+        return func
+
+    return wrapper
+
+
+@_beartype.beartype
 def unused(g):
+    """Represents "missing" optional inputs."""
     n = g.op("prim::Constant")
     n.setType(_C.OptionalType.ofTensor())
     return n
 
 
-def _shape_as_tensor(g, input):
+@_onnx_symbolic("aten::_shape_as_tensor")
+@_beartype.beartype
+def _shape_as_tensor(g: jit_utils.GraphContext, input):
     return g.op("Shape", input)
 
 
-def _reshape_from_tensor(g, input, shape):
+@_onnx_symbolic("aten::_reshape_from_tensor")
+@_beartype.beartype
+def _reshape_from_tensor(g: jit_utils.GraphContext, input, shape):
     if isinstance(shape, list):
         shape = g.op("Concat", *shape, axis_i=0)
     return reshape(g, input, shape)
 
 
-def reshape(g, self, shape):
+@_onnx_symbolic("aten::reshape")
+@symbolic_helper.quantized_args(True)
+@_beartype.beartype
+def reshape(g: jit_utils.GraphContext, self, shape):
     return symbolic_helper._reshape_helper(g, self, shape)
 
 
-def reshape_as(g, self, other):
+@_onnx_symbolic("aten::reshape_as")
+@symbolic_helper.quantized_args(True)
+@_beartype.beartype
+def reshape_as(g: jit_utils.GraphContext, self, other):
     shape = g.op("Shape", other)
     return reshape(g, self, shape)
 
 
-def add(g, self, other, alpha=None):
+@_onnx_symbolic("aten::add")
+@_beartype.beartype
+def add(g: jit_utils.GraphContext, self, other, alpha=None):
     if symbolic_helper._is_value(self) and symbolic_helper._is_tensor_list(self):
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "Add", 9, 11, "Add between list of tensors not supported"
+            "Add", 9, 11, "Add between list of tensors not supported", self
         )
     if alpha and symbolic_helper._scalar(symbolic_helper._maybe_get_scalar(alpha)) != 1:
         other = g.op("Mul", other, alpha)
     return g.op("Add", self, other)
 
 
-def sub(g, self, other, alpha=None):
+@_onnx_symbolic("aten::sub")
+@_beartype.beartype
+def sub(g: jit_utils.GraphContext, self, other, alpha=None):
     if alpha and symbolic_helper._scalar(symbolic_helper._maybe_get_scalar(alpha)) != 1:
         other = g.op("Mul", other, alpha)
     return g.op("Sub", self, other)
 
 
-def rsub(g, self, other, alpha=None):
+@_onnx_symbolic("aten::rsub")
+@_beartype.beartype
+def rsub(g: jit_utils.GraphContext, self, other, alpha=None):
     return sub(g, other, self, alpha=alpha)
 
 
-def mul(g, self, other):
+@_onnx_symbolic("aten::mul")
+@_beartype.beartype
+def mul(g: jit_utils.GraphContext, self, other):
     if symbolic_helper._is_bool(self) and symbolic_helper._is_bool(other):
         # ONNX Mul doesn't support Boolean, so use And as an equivalent operator.
         return g.op("And", self, other)
@@ -378,21 +381,26 @@ def mul(g, self, other):
         return g.op("Mul", self, other)
 
 
-def div(g, self, other, *args):
+@_onnx_symbolic("aten::div")
+@_beartype.beartype
+def div(g: jit_utils.GraphContext, self, other, *args):
     if len(args) == 0:
         return true_divide(g, self, other)
     else:
         return _div_rounding_mode(g, self, other, *args)
 
 
+@_onnx_symbolic("aten::addcmul")
 @symbolic_helper.parse_args("v", "v", "v", "f")
-def addcmul(g, self, tensor1, tensor2, value=1.0):
+@_beartype.beartype
+def addcmul(g: jit_utils.GraphContext, self, tensor1, tensor2, value=1.0):
     value_tens = g.op("Constant", value_t=torch.tensor([value]))
     return add(g, self, mul(g, mul(g, tensor1, tensor2), value_tens))
 
 
 @symbolic_helper.parse_args("v", "v", "s")
-def _div_rounding_mode(g, self, other, rounding_mode):
+@_beartype.beartype
+def _div_rounding_mode(g: jit_utils.GraphContext, self, other, rounding_mode):
     if rounding_mode is None:
         return true_divide(g, self, other)
     elif rounding_mode == "floor":
@@ -400,12 +408,14 @@ def _div_rounding_mode(g, self, other, rounding_mode):
     elif rounding_mode == "trunc":
         return _trunc_divide(g, self, other)
     else:
-        raise RuntimeError(
-            f'Unsupported rounding mode: "{rounding_mode}". Expected None, "floor" or "trunc"'
+        raise errors.SymbolicValueError(
+            f'Unsupported rounding mode: "{rounding_mode}". Expected None, "floor" or "trunc"',
+            self,
         )
 
 
-def _trunc_divide(g, self, other):
+@_beartype.beartype
+def _trunc_divide(g: jit_utils.GraphContext, self, other):
     out = g.op("Div", self, other)
     # the correct operation is truncate, which is not supported in ONNX,
     # we cannot call floor since it will behave differently for negative numbers
@@ -416,30 +426,28 @@ def _trunc_divide(g, self, other):
 
     # Matching PyTorch's behavior:
     # - if self is fp the output's type is self's type
-    # - if self is not fp and other is fp, the output is of type "Float"
+    # - if self is not fp and other is fp, the output is of type JitScalarType.FLOAT
     # - self is not fp and other is not fp, the output's type is self's output type
     # - the output type defaults to Float
-    scalar_type = self.type().scalarType()
-
-    if scalar_type is not None:
-        if (
-            not symbolic_helper._is_fp(self)
-            and other.type().scalarType() is not None
-            and symbolic_helper._is_fp(other)
-        ):
+    scalar_type = _type_utils.JitScalarType.from_value(
+        self, _type_utils.JitScalarType.UNDEFINED
+    )
+    if scalar_type != _type_utils.JitScalarType.UNDEFINED:
+        if not symbolic_helper._is_fp(self) and symbolic_helper._is_fp(other):
             out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.FLOAT)
         else:
             out = g.op(
                 "Cast",
                 out,
-                to_i=_type_utils.JitScalarType.from_name(scalar_type).onnx_type(),
+                to_i=scalar_type.onnx_type(),
             )
     else:
         out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     return out
 
 
-def _floor_divide(g, self, other):
+@_beartype.beartype
+def _floor_divide(g: jit_utils.GraphContext, self, other):
     if symbolic_helper._is_fp(self) or symbolic_helper._is_fp(other):
         out = true_divide(g, self, other)
         return g.op("Floor", out)
@@ -463,16 +471,22 @@ def _floor_divide(g, self, other):
         return g.op("Sub", div, fixup)
 
 
-def floor_divide(g, self, other):
+@_onnx_symbolic("aten::floor_divide")
+@_beartype.beartype
+def floor_divide(g: jit_utils.GraphContext, self, other):
     # Deprecated behavior, floor_divide actually truncates
     return _trunc_divide(g, self, other)
 
 
-def floordiv(g, self, other):
+@_onnx_symbolic("aten::floordiv")
+@_beartype.beartype
+def floordiv(g: jit_utils.GraphContext, self, other):
     return floor_divide(g, self, other)
 
 
-def true_divide(g, self, other):
+@_onnx_symbolic("aten::true_divide")
+@_beartype.beartype
+def true_divide(g: jit_utils.GraphContext, self, other):
     """Division where both inputs are cast to floating types
 
     If both inputs are floating, performs div as usual
@@ -499,21 +513,27 @@ def true_divide(g, self, other):
     return g.op("Div", self, other)
 
 
-def reciprocal(g, self):
+@_onnx_symbolic("aten::reciprocal")
+@_beartype.beartype
+def reciprocal(g: jit_utils.GraphContext, self):
     # torch.reciprocal implicitly casts to float, so we do the same.
     if not symbolic_helper._is_fp(self):
         self = g.op("Cast", self, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     return g.op("Reciprocal", self)
 
 
+@_onnx_symbolic("aten::cat")
 @symbolic_helper.parse_args("v", "i")
-def cat(g, tensor_list, dim):
+@_beartype.beartype
+def cat(g: jit_utils.GraphContext, tensor_list, dim):
     tensors = symbolic_helper._unpack_list(tensor_list)
     return g.op("Concat", *tensors, axis_i=dim)
 
 
+@_onnx_symbolic("aten::stack")
 @symbolic_helper.parse_args("v", "i")
-def stack(g, tensor_list, dim):
+@_beartype.beartype
+def stack(g: jit_utils.GraphContext, tensor_list, dim):
     unsqueezed = [
         symbolic_helper._unsqueeze_helper(g, t, [dim])
         for t in symbolic_helper._unpack_list(tensor_list)
@@ -521,47 +541,57 @@ def stack(g, tensor_list, dim):
     return g.op("Concat", *unsqueezed, axis_i=dim)
 
 
-def _list(g, self):
+@_onnx_symbolic("aten::list")
+@_beartype.beartype
+def _list(g: jit_utils.GraphContext, self):
     return self
 
 
-def mm(g, self, other):
+@_onnx_symbolic("aten::mm")
+@_beartype.beartype
+def mm(g: jit_utils.GraphContext, self, other):
     # Create a dummy C tensor. Only needed for API purposes, the value is
     # since beta = 0
     C = g.op("Constant", value_t=torch.tensor([1]))
     return g.op("Gemm", self, other, C, beta_f=0.0, alpha_f=1.0)
 
 
-def bmm(g, self, other):
+@_onnx_symbolic("aten::bmm")
+@_beartype.beartype
+def bmm(g: jit_utils.GraphContext, self, other):
     return g.op("MatMul", self, other)
 
 
-def matmul(g, self, other):
+@_onnx_symbolic("aten::matmul")
+@_beartype.beartype
+def matmul(g: jit_utils.GraphContext, self, other):
     return g.op("MatMul", self, other)
 
 
+@_onnx_symbolic("aten::addmm")
 @symbolic_helper.parse_args("v", "v", "v", "t", "t")
-def addmm(g, self, mat1, mat2, beta, alpha):
-    dtype = None
-    self_dtype = symbolic_helper._try_get_scalar_type(self)
-    mat1_dtype = symbolic_helper._try_get_scalar_type(mat1)
-    mat2_dtype = symbolic_helper._try_get_scalar_type(mat2)
-    if self_dtype is not None:
-        dtype = self_dtype
-    elif mat1_dtype is not None:
-        dtype = mat1_dtype
-    elif mat2_dtype is not None:
-        dtype = mat2_dtype
+@_beartype.beartype
+def addmm(g: jit_utils.GraphContext, self, mat1, mat2, beta, alpha):
+    scalar_type = None
+    self_scalar_type = symbolic_helper._try_get_scalar_type(self)
+    mat1_scalar_type = symbolic_helper._try_get_scalar_type(mat1)
+    mat2_scalar_type = symbolic_helper._try_get_scalar_type(mat2)
+    if self_scalar_type is not None:
+        scalar_type = self_scalar_type
+    elif mat1_scalar_type is not None:
+        scalar_type = mat1_scalar_type
+    elif mat2_scalar_type is not None:
+        scalar_type = mat2_scalar_type
 
     mat1_rank = symbolic_helper._get_tensor_rank(mat1)
     mat2_rank = symbolic_helper._get_tensor_rank(mat2)
 
-    def isNotNoneAnd(v, u):
+    def is_not_none_nor(v, u):
         return v is not None and v != u
 
-    if dtype is not None and (isNotNoneAnd(mat1_rank, 2) or isNotNoneAnd(mat2_rank, 2)):
-        scalar_type = _type_utils.JitScalarType.from_name(dtype)
-
+    if scalar_type is not None and (
+        is_not_none_nor(mat1_rank, 2) or is_not_none_nor(mat2_rank, 2)
+    ):
         res1 = g.op("MatMul", mat1, mat2)
         res2 = self
 
@@ -594,76 +624,112 @@ def isNotNoneAnd(v, u):
     )
 
 
-def neg(g, self):
+@_onnx_symbolic("aten::neg")
+@_beartype.beartype
+def neg(g: jit_utils.GraphContext, self):
     return g.op("Neg", self)
 
 
-def sqrt(g, self):
+@_onnx_symbolic("aten::sqrt")
+@_beartype.beartype
+def sqrt(g: jit_utils.GraphContext, self):
     return g.op("Sqrt", self)
 
 
-def rsqrt(g, self):
+@_onnx_symbolic("aten::rsqrt")
+@_beartype.beartype
+def rsqrt(g: jit_utils.GraphContext, self):
     return g.op(
-        "Div", symbolic_helper._if_scalar_type_as(g, torch.ones(1), self), sqrt(g, self)
+        "Div", symbolic_helper._if_scalar_type_as(torch.ones(1), self), sqrt(g, self)
     )
 
 
-def tanh(g, self):
+@_onnx_symbolic("aten::tanh")
+# Fixed scale and zero_point, discovered from aten/src/ATen/native/quantized/cpu/qtanh.cpp
+@symbolic_helper.quantized_args(True, scale=2.0 / 256.0, zero_point=128)
+@_beartype.beartype
+def tanh(g: jit_utils.GraphContext, self):
     return g.op("Tanh", self)
 
 
-def sin(g, self):
+@_onnx_symbolic("aten::sin")
+@_beartype.beartype
+def sin(g: jit_utils.GraphContext, self):
     return g.op("Sin", self)
 
 
-def cos(g, self):
+@_onnx_symbolic("aten::cos")
+@_beartype.beartype
+def cos(g: jit_utils.GraphContext, self):
     return g.op("Cos", self)
 
 
-def tan(g, self):
+@_onnx_symbolic("aten::tan")
+@_beartype.beartype
+def tan(g: jit_utils.GraphContext, self):
     return g.op("Tan", self)
 
 
-def asin(g, self):
+@_onnx_symbolic("aten::asin")
+@_beartype.beartype
+def asin(g: jit_utils.GraphContext, self):
     return g.op("Asin", self)
 
 
-def acos(g, self):
+@_onnx_symbolic("aten::acos")
+@_beartype.beartype
+def acos(g: jit_utils.GraphContext, self):
     return g.op("Acos", self)
 
 
-def atan(g, self):
+@_onnx_symbolic("aten::atan")
+@_beartype.beartype
+def atan(g: jit_utils.GraphContext, self):
     return g.op("Atan", self)
 
 
+@_onnx_symbolic("aten::sigmoid")
 # Fixed scale and zero_point, discovered from aten/src/ATen/native/quantized/cpu/qsigmoid.cpp
 @symbolic_helper.quantized_args(True, scale=1.0 / 256.0, zero_point=0)
-def sigmoid(g, self):
+@_beartype.beartype
+def sigmoid(g: jit_utils.GraphContext, self):
     return g.op("Sigmoid", self)
 
 
-def sign(g, self):
+@_onnx_symbolic("aten::sign")
+@_beartype.beartype
+def sign(g: jit_utils.GraphContext, self):
     return g.op("Sign", self)
 
 
-def _slice(g, input, axes, starts, ends):
+@symbolic_helper.quantized_args(True)
+@_beartype.beartype
+def _slice(g: jit_utils.GraphContext, input, axes, starts, ends):
     assert len(starts) == len(ends)
-    if len(starts) == 1 and starts[0] == 0 and ends[0] == 9223372036854775807:
+    if len(starts) == 1 and starts[0] == 0 and ends[0] == _constants.INT64_MAX:
         return input
     return g.op("Slice", input, axes_i=axes, starts_i=starts, ends_i=ends)
 
 
-def _maybe_cast_reduce_op_input(g, self):
-    dtype = self.type().scalarType()
-    # This check only covers traced modules where dtype is present
-    if dtype is not None:
+@_beartype.beartype
+def _maybe_cast_reduce_op_input(g: jit_utils.GraphContext, self):
+    scalar_type = _type_utils.JitScalarType.from_value(
+        self, _type_utils.JitScalarType.UNDEFINED
+    )
+    if scalar_type != _type_utils.JitScalarType.UNDEFINED:
+        # This check only covers traced modules where dtype is present
         # pytorch reduce-ops cast all other integral types to int64
-        if not symbolic_helper._is_fp(self) and not (dtype == "Long"):
-            self = _cast_Long(g, self, False)  # type: ignore[name-defined]
+        if (
+            not symbolic_helper._is_fp(self)
+            and scalar_type != _type_utils.JitScalarType.INT64
+        ):
+            self = g.op("Cast", self, to_i=_C_onnx.TensorProtoDataType.INT64)
     return self
 
 
+@_beartype.beartype
 def _reduce_op_symbolic(onnx_op_name, allow_multi_dim_support=True):
+    @_beartype.beartype
     def symbolic(g, self, dim=None, keepdim=None):
         self = _maybe_cast_reduce_op_input(g, self)
         if dim is None:
@@ -681,27 +747,39 @@ def symbolic(g, self, dim=None, keepdim=None):
     return symbolic
 
 
+@_beartype.beartype
 def overload_by_arg_count(fn):
     @functools.wraps(fn)
+    @_beartype.beartype
     def wrapper(g, *args):
         overloads = fn(g, *args)
-        last_exception = None
         for overload in overloads:
             arg_descriptors = overload._arg_descriptors
             if len(arg_descriptors) == len(args):
                 return overload(g, *args)
-        raise NotImplementedError(f"Unknown aten::{fn.__name__} signature")
+        return symbolic_helper._unimplemented(
+            f"aten::{fn.__name__}", f"with {len(args)} arguments"
+        )
 
     return wrapper
 
 
-def _reduce_with_dtype(onnx_op, name, allow_multi_dim_support=True):
+@_onnx_symbolic("aten::sum", decorate=[_apply_params("ReduceSum", "sum")])
+@_onnx_symbolic("aten::mean", decorate=[_apply_params("ReduceMean", "mean")])
+# torch.prod does not support multidimensional "dim"
+@_onnx_symbolic(
+    "aten::prod",
+    decorate=[_apply_params("ReduceProd", "prod", allow_multi_dim_support=False)],
+)
+@_beartype.beartype
+def _reduce_with_dtype(onnx_op: str, name: str, allow_multi_dim_support: bool = True):
     symbolic = _reduce_op_symbolic(
         onnx_op, allow_multi_dim_support=allow_multi_dim_support
     )
 
     @overload_by_arg_count
     def reduce(g, *args, **kwargs):
+        @symbolic_helper.quantized_args(True)
         @symbolic_helper.parse_args("v", "none")
         def reduce_nodim(g, self, dtype):
             if dtype.node().kind() == "onnx::Constant":
@@ -710,11 +788,12 @@ def reduce_nodim(g, self, dtype):
                     "Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type()
                 )
             elif dtype.node().kind() != "prim::Constant":
-                return symbolic_helper._unimplemented(name, "dtype")
+                return symbolic_helper._unimplemented(name, "dtype", dtype)
             return symbolic(g, self)
 
         dim_desc = "is" if allow_multi_dim_support else "i"
 
+        @symbolic_helper.quantized_args(True)
         @symbolic_helper.parse_args("v", dim_desc, "i", "none")  # type: ignore[arg-type]
         def reduce_dim(g, self, dim, keepdim, dtype):
             if dtype.node().kind() == "onnx::Constant":
@@ -723,7 +802,7 @@ def reduce_dim(g, self, dim, keepdim, dtype):
                     "Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type()
                 )
             elif dtype.node().kind() != "prim::Constant":
-                return symbolic_helper._unimplemented(name, "dtype")
+                return symbolic_helper._unimplemented(name, "dtype", dtype)
             return symbolic(g, self, dim, keepdim)
 
         return reduce_nodim, reduce_dim
@@ -731,56 +810,63 @@ def reduce_dim(g, self, dim, keepdim, dtype):
     return reduce
 
 
-sum = _reduce_with_dtype("ReduceSum", "sum")
-mean = _reduce_with_dtype("ReduceMean", "mean")
-# torch.prod does not support multidimensional "dim"
-prod = _reduce_with_dtype("ReduceProd", "prod", allow_multi_dim_support=False)
-
-
+@_onnx_symbolic("aten::cumsum")
 @symbolic_helper.parse_args("v", "i", "none")
-def cumsum(g, input, dim, dtype):
+@_beartype.beartype
+def cumsum(g: jit_utils.GraphContext, input, dim, dtype):
     if symbolic_helper.is_caffe2_aten_fallback():
         if dtype.node().kind() != "prim::Constant":
-            return symbolic_helper._unimplemented("cumsum", "dtype")
+            return symbolic_helper._unimplemented("cumsum", "dtype", dtype)
         return g.at("cumsum", input, dim_i=dim)
-    else:
-        symbolic_helper._onnx_opset_unsupported("cumsum", 9, 11)
+
+    symbolic_helper._onnx_opset_unsupported("cumsum", 9, 11, input)
 
 
-def _sample_dirichlet(g, self, generator):
+@_onnx_symbolic("aten::_sample_dirichlet")
+@_beartype.beartype
+def _sample_dirichlet(g: jit_utils.GraphContext, self, generator):
     if symbolic_helper.is_caffe2_aten_fallback():
         if not symbolic_helper._is_none(generator):
             return symbolic_helper._unimplemented(
-                "_sample_dirichlet", "We are not able to export generator"
+                "_sample_dirichlet", "We are not able to export generator", self
             )
         return g.at("_sample_dirichlet", self)
-    else:
-        return symbolic_helper._onnx_unsupported("_sample_dirichlet")
+    return symbolic_helper._onnx_unsupported("_sample_dirichlet", self)
 
 
-def _standard_gamma(g, self, generator):
+@_onnx_symbolic("aten::_standard_gamma")
+@_beartype.beartype
+def _standard_gamma(g: jit_utils.GraphContext, self, generator):
     if symbolic_helper.is_caffe2_aten_fallback():
         if not symbolic_helper._is_none(generator):
             return symbolic_helper._unimplemented(
-                "_standard_gamma", "We are not able to export generator"
+                "_standard_gamma", "not able to export generator", self
             )
         return g.at("_standard_gamma", self)
-    else:
-        return symbolic_helper._onnx_unsupported("_standard_gamma")
+
+    return symbolic_helper._onnx_unsupported("_standard_gamma", self)
 
 
-def t(g, self):
+@_onnx_symbolic("aten::t")
+@_beartype.beartype
+def t(g: jit_utils.GraphContext, self):
     return g.op("Transpose", self, perm_i=(1, 0))
 
 
-def numpy_T(g, input):
+@_onnx_symbolic("aten::numpy_T")
+@symbolic_helper.quantized_args(True)
+@_beartype.beartype
+def numpy_T(g: jit_utils.GraphContext, input):
     ndim = symbolic_helper._get_tensor_rank(input)
     assert ndim is not None
     perm = list(reversed(range(0, ndim)))
     return g.op("Transpose", input, perm_i=perm)
 
 
-def expand(g, self, size, implicit):
+@_onnx_symbolic("aten::expand")
+@symbolic_helper.quantized_args(True)
+@_beartype.beartype
+def expand(g: jit_utils.GraphContext, self, size, implicit):
     size = symbolic_helper._maybe_get_const(size, "is")
     if not symbolic_helper._is_value(size):
         size = g.op("Constant", value_t=torch.LongTensor(size))
@@ -798,7 +884,10 @@ def expand(g, self, size, implicit):
     return g.op("Expand", self, size)
 
 
-def expand_as(g, self, other):
+@_onnx_symbolic("aten::expand_as")
+@symbolic_helper.quantized_args(True, True)
+@_beartype.beartype
+def expand_as(g: jit_utils.GraphContext, self, other):
     self_t = symbolic_helper._maybe_get_const(self, "t")
     if isinstance(self_t, torch.Tensor):
         orig_type = self_t.dtype
@@ -813,12 +902,23 @@ def expand_as(g, self, other):
     return g.op("Expand", self, shape)
 
 
+@_onnx_symbolic("aten::embedding")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "v", "i", "b", "v")
-def embedding(g, weight, indices, padding_idx, scale_grad_by_freq, sparse):
+@_beartype.beartype
+def embedding(
+    g: jit_utils.GraphContext,
+    weight,
+    indices,
+    padding_idx,
+    scale_grad_by_freq,
+    sparse,
+):
     if scale_grad_by_freq and GLOBALS.export_training:
-        raise RuntimeError(
+        raise errors.SymbolicValueError(
             "Unsupported: ONNX export of embedding with scale_grad_by_freq=True "
-            "for training mode. ONNX does not support scaling the gradients."
+            "for training mode. ONNX does not support scaling the gradients.",
+            weight,
         )
     if padding_idx >= 0 and GLOBALS.export_training:
         warnings.warn(
@@ -830,9 +930,12 @@ def embedding(g, weight, indices, padding_idx, scale_grad_by_freq, sparse):
     return g.op("Gather", weight, indices)
 
 
+@_onnx_symbolic("aten::embedding_bag")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "v", "v", "i", "i", "i", "v", "i", "i")
+@_beartype.beartype
 def embedding_bag(
-    g,
+    g: jit_utils.GraphContext,
     embedding_matrix,
     indices,
     offsets,
@@ -845,7 +948,7 @@ def embedding_bag(
 ):
     if not symbolic_helper._is_none(per_sample_weights):
         return symbolic_helper._onnx_unsupported(
-            "embedding_bag  with per_sample_weights"
+            "embedding_bag with per_sample_weights"
         )
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at(
@@ -860,11 +963,13 @@ def embedding_bag(
             include_last_offset_i=include_last_offset,
             padding_idx_i=padding_idx,
         )
-    else:
-        return symbolic_helper._onnx_unsupported("embedding_bag")
 
+    return symbolic_helper._onnx_unsupported("embedding_bag", embedding_matrix)
 
-def size(g, self, dim=None):
+
+@_onnx_symbolic("aten::size")
+@_beartype.beartype
+def size(g: jit_utils.GraphContext, self, dim=None):
     if dim is None:
         return g.op("Shape", self)
     if symbolic_helper._maybe_get_const(dim, "i") < 0:
@@ -875,8 +980,11 @@ def size(g, self, dim=None):
     return symbolic_helper._size_helper(g, self, dim)
 
 
+@_onnx_symbolic("aten::transpose")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "i", "i")
-def transpose(g, self, dim0, dim1):
+@_beartype.beartype
+def transpose(g: jit_utils.GraphContext, self, dim0, dim1):
     if dim0 == dim1:  # micro-optimization
         return self
 
@@ -886,44 +994,53 @@ def transpose(g, self, dim0, dim1):
         axes = list(range(rank))
         axes[dim0], axes[dim1] = axes[dim1], axes[dim0]
         return g.op("Transpose", self, perm_i=axes)
-    else:
+    elif symbolic_helper.is_caffe2_aten_fallback():
         # if we don't have dim information we cannot
         # output a permute so use ATen instead
-        if symbolic_helper.is_caffe2_aten_fallback():
-            return g.at(
-                "transpose", self, overload_name="int", dim0_i=dim0, dim1_i=dim1
-            )
-        else:
-            raise RuntimeError(
-                "Unsupported: ONNX export of transpose for tensor " "of unknown rank."
-            )
+        return g.at("transpose", self, overload_name="int", dim0_i=dim0, dim1_i=dim1)
+    else:
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of transpose for tensor of unknown rank.",
+            self,
+        )
 
 
+@_onnx_symbolic("aten::permute")
 @symbolic_helper.parse_args("v", "is")
-def permute(g, self, dims):
+@_beartype.beartype
+def permute(g: jit_utils.GraphContext, self, dims):
     if dims == list(range(0, len(dims))):
         return self
     return g.op("Transpose", self, perm_i=dims)
 
 
-def view(g, self, size):
+@_onnx_symbolic("aten::view")
+@symbolic_helper.quantized_args(True)
+@_beartype.beartype
+def view(g: jit_utils.GraphContext, self, size):
     return reshape(g, self, size)
 
 
-def view_as(g, self, other):
+@_onnx_symbolic("aten::view_as")
+@_beartype.beartype
+def view_as(g: jit_utils.GraphContext, self, other):
     shape = g.op("Shape", other)
     return reshape(g, self, shape)
 
 
+@_onnx_symbolic("aten::unsafe_chunk")
 @symbolic_helper.parse_args("v", "i", "i", "i")
-def unsafe_chunk(g, self, chunks, dim, _outputs=None):
+@_beartype.beartype
+def unsafe_chunk(g: jit_utils.GraphContext, self, chunks, dim, _outputs=None):
     if _outputs is None:
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "unsafe_chunk", 9, 11, "Dynamic number of outputs not supported"
+            "unsafe_chunk", 9, 11, "Dynamic number of outputs not supported", self
         )
     size = symbolic_helper._get_tensor_dim_size(self, dim)
     if size is None:
-        return symbolic_helper._unimplemented("unsafe_chunk", "unknown dimension size")
+        return symbolic_helper._unimplemented(
+            "unsafe_chunk", "unknown dimension size", self
+        )
     split_size = (size + chunks - 1) // chunks
     splits = [split_size] * (size // split_size)
     leftover = size % split_size
@@ -932,17 +1049,18 @@ def unsafe_chunk(g, self, chunks, dim, _outputs=None):
     return g.op("Split", self, split_i=splits, axis_i=dim, outputs=_outputs)
 
 
-@symbolic_helper.parse_args("v", "v", "v", "i")
-def split(g, self, split_size_or_sizes, dim, _outputs=None):
+@_onnx_symbolic("aten::split")
+@symbolic_helper.parse_args("v", "v", "i", "i")
+@_beartype.beartype
+def split(g: jit_utils.GraphContext, self, split_size_or_sizes, dim, _outputs=None):
     if not symbolic_helper._is_split_static(split_size_or_sizes, _outputs):
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "split", 9, 11, "Dynamic number of outputs not supported"
+            "split", 9, 11, "Dynamic number of outputs not supported", self
         )
     split_val = symbolic_helper._node_get(split_size_or_sizes.node(), "value")
     if split_val.dim() > 0:
         return split_with_sizes(g, self, split_size_or_sizes, dim, _outputs)
     split_size = symbolic_helper._get_const(split_size_or_sizes, "i", "split_size")
-    dim = symbolic_helper._get_const(dim, "i", "dim")
 
     size = symbolic_helper._get_tensor_dim_size(self, dim)
     if size is None:
@@ -950,7 +1068,7 @@ def split(g, self, split_size_or_sizes, dim, _outputs=None):
             size = split_size * _outputs
         else:
             return symbolic_helper._onnx_opset_unsupported_detailed(
-                "split", 9, 11, "Unknown dimension size not supported"
+                "split", 9, 11, "Unknown dimension size not supported", self
             )
     splits = [split_size] * (size // split_size)
     leftover = size % split_size
@@ -959,28 +1077,40 @@ def split(g, self, split_size_or_sizes, dim, _outputs=None):
     return g.op("Split", self, split_i=splits, axis_i=dim, outputs=_outputs)
 
 
-def unsafe_split(g, self, split_size_or_sizes, dim, _outputs=None):
+@_onnx_symbolic("aten::unsafe_split")
+@_beartype.beartype
+def unsafe_split(
+    g: jit_utils.GraphContext, self, split_size_or_sizes, dim, _outputs=None
+):
     return split(g, self, split_size_or_sizes, dim, _outputs)
 
 
+@_onnx_symbolic("aten::split_with_sizes")
 @symbolic_helper.parse_args("v", "is", "i", "i")
-def split_with_sizes(g, self, split_sizes, dim, _outputs=None):
+@_beartype.beartype
+def split_with_sizes(g: jit_utils.GraphContext, self, split_sizes, dim, _outputs=None):
     if not symbolic_helper._is_split_static(split_sizes, _outputs):
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "split_with_sizes", 9, 11, "Dynamic number of outputs not supported"
+            "split_with_sizes", 9, 11, "Dynamic number of outputs not supported", self
         )
     return g.op("Split", self, split_i=split_sizes, axis_i=dim, outputs=_outputs)
 
 
-def unsafe_split_with_sizes(g, self, split_sizes, dim, _outputs=None):
+@_onnx_symbolic("aten::unsafe_split_with_sizes")
+@_beartype.beartype
+def unsafe_split_with_sizes(
+    g: jit_utils.GraphContext, self, split_sizes, dim, _outputs=None
+):
     return split_with_sizes(g, self, split_sizes, dim, _outputs)
 
 
+@_onnx_symbolic("aten::unbind")
 @symbolic_helper.parse_args("v", "i", "i")
-def unbind(g, self, dim=0, _outputs=None):
+@_beartype.beartype
+def unbind(g: jit_utils.GraphContext, self, dim=0, _outputs=None):
     if _outputs is None:
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "unbind", 9, 11, "Dynamic number of outputs not supported"
+            "unbind", 9, 11, "Dynamic number of outputs not supported", self
         )
 
     outputs = g.op("Split", self, split_i=[1] * _outputs, axis_i=dim, outputs=_outputs)
@@ -991,12 +1121,15 @@ def unbind(g, self, dim=0, _outputs=None):
     return squeezed_outputs
 
 
+@_onnx_symbolic("aten::select")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "i", "v")
-def select(g, self, dim, index):
+@_beartype.beartype
+def select(g: jit_utils.GraphContext, self, dim, index):
     index = symbolic_helper._maybe_get_scalar(index)
     if (not symbolic_helper._is_value(index)) and (index < 0):
         if index == -1:
-            end_index = 9223372036854775807
+            end_index = _constants.INT64_MAX
         else:
             end_index = index + 1
         slice_node = symbolic_helper._slice_helper(
@@ -1004,14 +1137,19 @@ def select(g, self, dim, index):
         )
         return symbolic_helper._squeeze_helper(g, slice_node, [dim])
     else:
+        # FIXME(justinchuby): can index be an int and not a value?
         return g.op("Gather", self, index, axis_i=dim)
 
 
-def square(g, self):
+@_onnx_symbolic("aten::square")
+@_beartype.beartype
+def square(g: jit_utils.GraphContext, self):
     return g.op("Mul", self, self)
 
 
-def squeeze(g, self, dim=None):
+@_onnx_symbolic("aten::squeeze")
+@_beartype.beartype
+def squeeze(g: jit_utils.GraphContext, self, dim=None):
     if dim is None:
         return g.op("Squeeze", self)
 
@@ -1033,7 +1171,7 @@ def squeeze(g, self, dim=None):
             squeeze_dim += rank
         else:
             return symbolic_helper._unimplemented(
-                "squeeze", "negative axis with unknown input rank"
+                "squeeze", "negative axis with unknown input rank", self
             )
 
     dim_size = symbolic_helper._get_tensor_dim_size(self, squeeze_dim)
@@ -1073,7 +1211,9 @@ def squeeze(g, self, dim=None):
     return symbolic_helper._squeeze_helper(g, self, axes_i=[squeeze_dim])
 
 
-def prelu(g, self, weight):
+@_onnx_symbolic("aten::prelu")
+@_beartype.beartype
+def prelu(g: jit_utils.GraphContext, self, weight):
     self_rank = symbolic_helper._get_tensor_rank(self)
     weight_sizes = symbolic_helper._get_tensor_sizes(weight)
     weight_rank = len(weight_sizes)
@@ -1095,15 +1235,20 @@ def prelu(g, self, weight):
     return g.op("PRelu", self, weight)
 
 
-def silu(g, input):
+@_onnx_symbolic("aten::silu")
+@_beartype.beartype
+def silu(g: jit_utils.GraphContext, input):
     return g.op("Mul", input, g.op("Sigmoid", input))
 
 
-def mish(g, input):
+@_onnx_symbolic("aten::mish")
+@_beartype.beartype
+def mish(g: jit_utils.GraphContext, input):
     return g.op("Mul", input, g.op("Tanh", g.op("Softplus", input)))
 
 
-def op_with_optional_float_cast(g, op_name, *args, **kwargs):
+@_beartype.beartype
+def _op_with_optional_float_cast(g: jit_utils.GraphContext, op_name, *args, **kwargs):
     """Some PyTorch operators (e.g., Clip/Min/ReLU/Pad) are super set of ONNX in terms of data types.
     This function maximizes the exportability of PyTorch-ONNX by allowing ONNX-unsupported PyTorch
     operator data type. For example, `Cast<int>(Clip<float>(Cast<float>(INPUT)))` can be used to mimic
@@ -1115,16 +1260,16 @@ def op_with_optional_float_cast(g, op_name, *args, **kwargs):
         *args (tuple): operands to the operator.
         **kwargs (dict): attributes to the operator along with "opset_before" (optional, None by default)
             indicating the smallest opset version to trigger such casting behavior and "target_float_t"
-            (optional, "Float" by default) indicating the data type of internal operator.
+            (optional, torch.onnx.JitScalarType.FLOAT by default) indicating the data type of internal operator.
 
     Returns:
         Optional[torch._C.Value, Tuple[torch._C.Value, ...]]: output(s) of the operator.
     """
     opset_before = kwargs.pop("opset_before", None)
-    target_float_t = kwargs.pop("target_float_t", "Float")
+    target_float_t = kwargs.pop("target_float_t", _type_utils.JitScalarType.FLOAT)
 
     inputs = list(args)
-    dtype_0 = inputs[0].type().scalarType()
+    dtype_0 = _type_utils.JitScalarType.from_value(inputs[0])
 
     require_cast = not symbolic_helper._is_fp(inputs[0]) and (
         opset_before is None or GLOBALS.export_onnx_opset_version < opset_before
@@ -1132,73 +1277,92 @@ def op_with_optional_float_cast(g, op_name, *args, **kwargs):
 
     if require_cast:
         for input in inputs:
-            if input.isCompleteTensor() and input.type().scalarType() != dtype_0:
-                raise RuntimeError(
-                    f"Inputs of {op_name} must have same dtype. Got {dtype_0} and {input.type().scalarType()}"
+            input_scalar_type = _type_utils.JitScalarType.from_value(input)
+            if input.isCompleteTensor() and input_scalar_type != dtype_0:
+                raise errors.SymbolicValueError(
+                    f"Inputs of {op_name} must have same dtype. Got {dtype_0.scalar_name()} and {input_scalar_type.scalar_name()}",
+                    input,
                 )
         for i, input in enumerate(inputs):
             if input.isCompleteTensor() and not symbolic_helper._is_fp(input):
                 inputs[i] = g.op(
                     "Cast",
                     input,
-                    to_i=_type_utils.JitScalarType.from_name(
-                        target_float_t
-                    ).onnx_type(),
+                    to_i=target_float_t.onnx_type(),
                 )
 
     self = g.op(op_name, *inputs, **kwargs)
 
     if require_cast:
-        self = g.op(
-            "Cast", self, to_i=_type_utils.JitScalarType.from_name(dtype_0).onnx_type()
-        )
+        self = g.op("Cast", self, to_i=dtype_0.onnx_type())
 
     return self
 
 
+@_onnx_symbolic("aten::relu")
 @symbolic_helper.quantized_args(True)
-def relu(g, input):
-    return op_with_optional_float_cast(g, "Relu", input, opset_before=14)
+@_beartype.beartype
+def relu(g: jit_utils.GraphContext, input):
+    return _op_with_optional_float_cast(g, "Relu", input, opset_before=14)
 
 
+@_onnx_symbolic("aten::relu6")
 @symbolic_helper.quantized_args(True)
-def relu6(g, input):
-    relu = op_with_optional_float_cast(g, "Relu", input, opset_before=14)
+@_beartype.beartype
+def relu6(g: jit_utils.GraphContext, input):
+    relu = _op_with_optional_float_cast(g, "Relu", input, opset_before=14)
     return clamp_max(g, relu, 6)
 
 
-def ceil(g, input):
+@_onnx_symbolic("aten::ceil")
+@_beartype.beartype
+def ceil(g: jit_utils.GraphContext, input):
     return g.op("Ceil", input)
 
 
-def floor(g, input):
+@_onnx_symbolic("aten::floor")
+@_beartype.beartype
+def floor(g: jit_utils.GraphContext, input):
     return g.op("Floor", input)
 
 
-def _len(g, self):
+@_onnx_symbolic("aten::len")
+@_beartype.beartype
+def _len(g: jit_utils.GraphContext, self):
     sz_0 = size(g, self, g.op("Constant", value_t=torch.LongTensor([0])))
     return symbolic_helper._squeeze_helper(g, sz_0, [0])
 
 
+@_onnx_symbolic("aten::threshold")
 @symbolic_helper.parse_args("v", "t", "t")
-def threshold(g, self, threshold, value):
+@_beartype.beartype
+def threshold(g: jit_utils.GraphContext, self, threshold, value):
     # See Note [Export inplace]
     if symbolic_helper._scalar(threshold) != 0:
-        return symbolic_helper._unimplemented("threshold", "non-zero threshold")
+        return symbolic_helper._unimplemented("threshold", "non-zero threshold", self)
     if symbolic_helper._scalar(value) != 0:
-        return symbolic_helper._unimplemented("threshold", "non-zero value")
+        return symbolic_helper._unimplemented("threshold", "non-zero value", self)
     return g.op("Relu", self)
 
 
-def leaky_relu(g, input, negative_slope, inplace=False):
-    negative_slope = symbolic_helper._get_const(negative_slope, "t", "negative_slope")
+@_onnx_symbolic("aten::leaky_relu")
+@symbolic_helper.quantized_args(True)
+@symbolic_helper.parse_args("v", "f", "b")
+@_beartype.beartype
+def leaky_relu(
+    g: jit_utils.GraphContext,
+    input: _C.Value,
+    negative_slope: float,
+    inplace: bool = False,
+):
     # See Note [Export inplace]
-    # TODO: Talk to ONNX about unconditional cast of scalar to float
-    return g.op("LeakyRelu", input, alpha_f=symbolic_helper._scalar(negative_slope))
+    return g.op("LeakyRelu", input, alpha_f=negative_slope)
 
 
+@_onnx_symbolic("aten::glu")
 @symbolic_helper.parse_args("v", "i")
-def glu(g, input, dim):
+@_beartype.beartype
+def glu(g: jit_utils.GraphContext, input, dim):
     dim_size = symbolic_helper._get_tensor_dim_size(input, dim)
     if dim_size is not None:
         assert dim_size % 2 == 0
@@ -1207,8 +1371,10 @@ def glu(g, input, dim):
     return g.op("Mul", first, g.op("Sigmoid", second))
 
 
+@_onnx_symbolic("aten::softmax")
 @symbolic_helper.parse_args("v", "i", "none")
-def softmax(g, input, dim, dtype=None):
+@_beartype.beartype
+def softmax(g: jit_utils.GraphContext, input, dim, dtype=None):
     # Softmax does normalization at vector level.
     # PyTorch and ONNX use different strategies to split the input tensor into vectors.
     # Thus dim and axis have different meanings.
@@ -1270,19 +1436,24 @@ def softmax(g, input, dim, dtype=None):
     return softmax
 
 
-def softplus(g, self, beta, threshold):
+@_onnx_symbolic("aten::softplus")
+@_beartype.beartype
+def softplus(g: jit_utils.GraphContext, self, beta, threshold):
     beta_const = symbolic_helper._maybe_get_const(beta, "f")
     if beta_const != 1:
         return g.op("Div", g.op("Softplus", g.op("Mul", self, beta)), beta)
     return g.op("Softplus", self)
 
 
+@_onnx_symbolic("aten::get_pool_ceil_padding")
+@_beartype.beartype
 def get_pool_ceil_padding(input, kernel_size, stride, padding):
+    # TODO(justinchuby): Looks like this op is deprecated in torch
     sizes = symbolic_helper._get_tensor_sizes(input)
     dim = sizes[-len(padding) :] if sizes is not None else None
     if dim is None or any([i is None for i in dim]):
         return symbolic_helper._unimplemented(
-            "get_pool_ceil_padding", "input size not accessible"
+            "get_pool_ceil_padding", "input size not accessible", input
         )
     ceiled_output_dim = [
         int(math.ceil((dim[i] + 2 * padding[i] - kernel_size[i]) / float(stride[i])))
@@ -1319,12 +1490,41 @@ def get_pool_ceil_padding(input, kernel_size, stride, padding):
     return padding_ceil
 
 
+@_onnx_symbolic(
+    "aten::max_pool1d",
+    decorate=[
+        _apply_params(
+            "max_pool1d", torch.nn.modules.utils._single, 1, return_indices=False
+        ),
+        _export("max_pool1d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::max_pool2d",
+    decorate=[
+        _apply_params(
+            "max_pool2d", torch.nn.modules.utils._pair, 2, return_indices=False
+        ),
+        _export("max_pool2d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::max_pool3d",
+    decorate=[
+        _apply_params(
+            "max_pool3d", torch.nn.modules.utils._triple, 3, return_indices=False
+        ),
+        _export("max_pool3d"),
+    ],
+)
+@_beartype.beartype
 def _max_pool(name, tuple_fn, ndims, return_indices):
     @symbolic_helper.quantized_args(True, False, False, False, False, False)
     @symbolic_helper.parse_args("v", "is", "is", "is", "is", "i")
+    @_beartype.beartype
     def symbolic_fn(g, input, kernel_size, stride, padding, dilation, ceil_mode):
         if set(tuple_fn(dilation)) != {1}:
-            return symbolic_helper._unimplemented(name, "dilation")
+            return symbolic_helper._unimplemented(name, "dilation", input)
         if not stride:
             stride = kernel_size
         padding = tuple(tuple_fn(padding))
@@ -1377,44 +1577,64 @@ def symbolic_fn(g, input, kernel_size, stride, padding, dilation, ceil_mode):
     return symbolic_fn
 
 
-max_pool1d = _max_pool(
-    "max_pool1d", torch.nn.modules.utils._single, 1, return_indices=False
+max_pool1d_with_indices = _onnx_symbolic("aten::max_pool1d_with_indices")(
+    _max_pool(
+        "max_pool1d_with_indices",
+        torch.nn.modules.utils._single,
+        1,
+        return_indices=True,
+    )
 )
-max_pool2d = _max_pool(
-    "max_pool2d", torch.nn.modules.utils._pair, 2, return_indices=False
+max_pool2d_with_indices = _onnx_symbolic("aten::max_pool2d_with_indices")(
+    _max_pool(
+        "max_pool2d_with_indices",
+        torch.nn.modules.utils._pair,
+        2,
+        return_indices=True,
+    )
 )
-max_pool3d = _max_pool(
-    "max_pool3d", torch.nn.modules.utils._triple, 3, return_indices=False
+max_pool3d_with_indices = _onnx_symbolic("aten::max_pool3d_with_indices")(
+    _max_pool(
+        "max_pool3d_with_indices",
+        torch.nn.modules.utils._triple,
+        3,
+        return_indices=True,
+    )
 )
-max_pool1d_with_indices = _max_pool(
-    "max_pool1d_with_indices",
-    torch.nn.modules.utils._single,
-    1,
-    return_indices=True,
+
+
+@_onnx_symbolic(
+    "aten::avg_pool1d",
+    decorate=[
+        _apply_params("avg_pool1d", torch.nn.modules.utils._single),
+        _export("avg_pool1d"),
+    ],
 )
-max_pool2d_with_indices = _max_pool(
-    "max_pool2d_with_indices",
-    torch.nn.modules.utils._pair,
-    2,
-    return_indices=True,
+@_onnx_symbolic(
+    "aten::avg_pool2d",
+    decorate=[
+        _apply_params("avg_pool2d", torch.nn.modules.utils._pair),
+        _export("avg_pool2d"),
+    ],
 )
-max_pool3d_with_indices = _max_pool(
-    "max_pool3d_with_indices",
-    torch.nn.modules.utils._triple,
-    3,
-    return_indices=True,
+@_onnx_symbolic(
+    "aten::avg_pool3d",
+    decorate=[
+        _apply_params("avg_pool3d", torch.nn.modules.utils._triple),
+        _export("avg_pool3d"),
+    ],
 )
-
-
+@_beartype.beartype
 def _avg_pool(name, tuple_fn):
     @symbolic_helper.quantized_args(True)
     @symbolic_helper.parse_args("v", "is", "is", "is", "i", "i", "none")
+    @_beartype.beartype
     def symbolic_fn(
         g,
         input: _C.Value,
-        kernel_size: Tuple[int, ...],
-        stride: Tuple[int, ...],
-        padding: Union[int, Tuple[int, ...]],
+        kernel_size: Sequence[int],
+        stride: Sequence[int],
+        padding: Union[int, Sequence[int]],
         ceil_mode: int,
         count_include_pad: int,
         divisor_override=None,
@@ -1424,14 +1644,22 @@ def symbolic_fn(
         padding = symbolic_helper._avgpool_helper(
             tuple_fn, padding, kernel_size, stride, divisor_override, name
         )
+        assert isinstance(padding, tuple)
         adjusted_padding = padding
+        # Although onnx::AvgPool provides count_include_pad,
+        # The corner case of Average Pooling with ceil_mode on
+        # PyTorch allows sliding window go off bound, which leads to
+        # this accommodation.
+        # More detail on https://github.com/pytorch/pytorch/issues/57178
         if count_include_pad:
-            input = g.op(
+            input = _op_with_optional_float_cast(
+                g,
                 "Pad",
                 input,
                 pads_i=((0,) * 2 + padding) * 2,
                 mode_s="constant",
                 value_f=0.0,
+                opset_before=11,
             )
             adjusted_padding = (0,) * len(padding)
         if ceil_mode:
@@ -1453,13 +1681,73 @@ def symbolic_fn(
     return symbolic_fn
 
 
-avg_pool1d = _avg_pool("avg_pool1d", torch.nn.modules.utils._single)
-avg_pool2d = _avg_pool("avg_pool2d", torch.nn.modules.utils._pair)
-avg_pool3d = _avg_pool("avg_pool3d", torch.nn.modules.utils._triple)
-
-
+@_onnx_symbolic(
+    "aten::adaptive_avg_pool1d",
+    decorate=[
+        _apply_params(
+            "adaptive_avg_pool1d", "AveragePool", torch.nn.modules.utils._single
+        ),
+        _export("adaptive_avg_pool1d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::adaptive_avg_pool2d",
+    decorate=[
+        _apply_params(
+            "adaptive_avg_pool2d", "AveragePool", torch.nn.modules.utils._pair
+        ),
+        _export("adaptive_avg_pool2d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::adaptive_avg_pool3d",
+    decorate=[
+        _apply_params(
+            "adaptive_avg_pool3d", "AveragePool", torch.nn.modules.utils._triple
+        ),
+        _export("adaptive_avg_pool3d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::adaptive_max_pool1d",
+    decorate=[
+        _apply_params(
+            "adaptive_max_pool1d",
+            "MaxPool",
+            torch.nn.modules.utils._single,
+            max_pool1d_with_indices,
+        ),
+        _export("adaptive_max_pool1d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::adaptive_max_pool2d",
+    decorate=[
+        _apply_params(
+            "adaptive_max_pool2d",
+            "MaxPool",
+            torch.nn.modules.utils._pair,
+            max_pool2d_with_indices,
+        ),
+        _export("adaptive_max_pool2d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::adaptive_max_pool3d",
+    decorate=[
+        _apply_params(
+            "adaptive_max_pool3d",
+            "MaxPool",
+            torch.nn.modules.utils._triple,
+            max_pool3d_with_indices,
+        ),
+        _export("adaptive_max_pool3d"),
+    ],
+)
+@_beartype.beartype
 def _adaptive_pool(name, type, tuple_fn, fn=None):
     @symbolic_helper.quantized_args(True, False)
+    @_beartype.beartype
     def symbolic_fn(g, input, output_size):
         # _adaptive_pool is supported for cases where output_size is 1 for all dimensions,
         # by executing a GlobalPool.
@@ -1470,13 +1758,14 @@ def symbolic_fn(g, input, output_size):
         # so we try using max_poolxd_with_indices, and if it is not possible
         # (input is not a complete tensor or output size not factor of input size)
         # then we call GlobalAveragePool and return None for the indices
+        output_size_value = output_size
         try:
             output_size = symbolic_helper._parse_arg(output_size, "is")
         except Exception:
             # FIXME(justinchuby): Avoid catching Exception.
             # Catch a more specific exception instead.
             return symbolic_helper._onnx_unsupported(
-                "adaptive pooling, since output_size is not constant."
+                "adaptive pooling, since output_size is not constant.", input
             )
         if output_size == [1] * len(output_size) and type == "AveragePool":
             return g.op("GlobalAveragePool", input)
@@ -1490,14 +1779,16 @@ def symbolic_fn(g, input, output_size):
         if dim is None or any([i is None for i in dim]):
             if output_size == [1] * len(output_size):
                 return g.op("GlobalMaxPool", input), None
-            return symbolic_helper._unimplemented(name, "input size not accessible")
+            return symbolic_helper._unimplemented(
+                name, "input size not accessible", input
+            )
         # verify if output size % input size = 0 for all dim
         mod = [dim[i] % output_size[i] for i in range(0, len(dim))]
         if mod != [0] * len(mod):
             if output_size == [1] * len(output_size):
                 return g.op("GlobalMaxPool", input), None
             return symbolic_helper._unimplemented(
-                name, "output size that are not factor of input size"
+                name, "output size that are not factor of input size", output_size_value
             )
         k = [int(dim[i] / output_size[i]) for i in range(0, len(dim))]
         # call max_poolxd_with_indices to get indices in the output
@@ -1509,43 +1800,14 @@ def symbolic_fn(g, input, output_size):
     return symbolic_fn
 
 
-adaptive_avg_pool1d = _adaptive_pool(
-    "adaptive_avg_pool1d", "AveragePool", torch.nn.modules.utils._single
-)
-adaptive_avg_pool2d = _adaptive_pool(
-    "adaptive_avg_pool2d", "AveragePool", torch.nn.modules.utils._pair
-)
-adaptive_avg_pool3d = _adaptive_pool(
-    "adaptive_avg_pool3d", "AveragePool", torch.nn.modules.utils._triple
-)
-
-adaptive_max_pool1d = _adaptive_pool(
-    "adaptive_max_pool1d",
-    "MaxPool",
-    torch.nn.modules.utils._single,
-    max_pool1d_with_indices,
-)
-adaptive_max_pool2d = _adaptive_pool(
-    "adaptive_max_pool2d",
-    "MaxPool",
-    torch.nn.modules.utils._pair,
-    max_pool2d_with_indices,
-)
-adaptive_max_pool3d = _adaptive_pool(
-    "adaptive_max_pool3d",
-    "MaxPool",
-    torch.nn.modules.utils._triple,
-    max_pool3d_with_indices,
-)
-
-
-# Generate paddings in ONNX order based on pad in pytorch.
-# Args:
-#     dim: the dimension of the tensor.
-#     pad: the paddings in pytorch.
-#          The order is dim_n_begin, dim_n_end, dim_n-1_begin, dim_n-1_end, ...
-def _prepare_onnx_paddings(dim, pad):
-    assert isinstance(dim, int)
+@_beartype.beartype
+def _prepare_onnx_paddings(dim: int, pad):
+    """Generate paddings in ONNX order based on pad in pytorch.
+    Args:
+        dim: the dimension of the tensor.
+        pad: the paddings in pytorch.
+            The order is dim_n_begin, dim_n_end, dim_n-1_begin, dim_n-1_end, ...
+    """
     # The desired order of paddings is
     # dim_0_begin, dim_1_begin, ... , dim_0_end, ..., dim_n_end.
     # n is the dimension of input.
@@ -1556,8 +1818,9 @@ def _prepare_onnx_paddings(dim, pad):
     return paddings
 
 
-def _convert_padding_node(padding):
-    padding = symbolic_helper._maybe_get_const(padding, "is")
+@_beartype.beartype
+def _convert_padding_node(input):
+    padding = symbolic_helper._maybe_get_const(input, "is")
     if symbolic_helper._is_value(padding) and symbolic_helper._is_packed_list(padding):
         input_list = symbolic_helper._unpack_list(padding)
         try:
@@ -1568,12 +1831,14 @@ def _convert_padding_node(padding):
             # FIXME(justinchuby): Avoid catching Exception.
             # Catch a more specific exception instead.
             return symbolic_helper._onnx_opset_unsupported_detailed(
-                "Pad", 9, 11, "The sizes of the padding must be constant"
+                "Pad", 9, 11, "The sizes of the padding must be constant", input
             )
     return padding
 
 
-def constant_pad_nd(g, input, padding, value):
+@_onnx_symbolic("aten::constant_pad_nd")
+@_beartype.beartype
+def constant_pad_nd(g: jit_utils.GraphContext, input, padding, value):
     mode = "constant"
     try:
         value = symbolic_helper._get_const(value, "f", "value")
@@ -1581,40 +1846,44 @@ def constant_pad_nd(g, input, padding, value):
         # FIXME(justinchuby): Avoid catching Exception.
         # Catch a more specific exception instead.
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "Pad", 9, 11, "The value for the padding must be constant"
+            "Pad", 9, 11, "The value for the padding must be constant", value
         )
 
     padding = _convert_padding_node(padding)
     paddings = _prepare_onnx_paddings(symbolic_helper._get_tensor_rank(input), padding)
-    return op_with_optional_float_cast(
+    return _op_with_optional_float_cast(
         g, "Pad", input, pads_i=paddings, mode_s=mode, value_f=value, opset_before=11
     )
 
 
-def _pad_circular(g, input, pad):
+@_beartype.beartype
+def _pad_circular(g: jit_utils.GraphContext, input: _C.Value, pad: _C.Value):
     padding = _convert_padding_node(pad)
     assert len(padding) % 2 == 0
     ndim = len(padding) // 2
 
     cur = input
     for idx in range(ndim):
-        pad_l = padding[-(2 * idx + 1)]
-        pad_r = padding[-(2 * idx + 2)]
-
+        pad_r = padding[-(2 * idx + 1)]
+        pad_l = padding[-(2 * idx + 2)]
+        # get size for targeting the last idx, as Slice don't take start=[-1], end=[-1]
+        size = symbolic_helper._get_tensor_sizes(input)
         tensors = []
         if pad_l > 0:
             left = symbolic_helper._slice_helper(
-                g, cur, axes=[2 + idx], starts=[-(pad_l + 1)], ends=[-1]
+                g, cur, axes=[2 + idx], starts=[-(pad_l)], ends=[size[2 + idx]]
             )
             tensors.append(left)
 
         if pad_l < 0 or pad_r < 0:
+            start = builtins.max(0, -pad_l)
+            end = -(builtins.max(0, -pad_r))
             middle = symbolic_helper._slice_helper(
                 g,
                 cur,
                 axes=[2 + idx],
-                starts=[max(0, -pad_l)],
-                ends=[-(1 + max(0, -pad_r))],
+                starts=[start],
+                ends=[end],
             )
             tensors.append(middle)
         else:
@@ -1631,33 +1900,41 @@ def _pad_circular(g, input, pad):
     return cur
 
 
-def reflection_pad(g, input, padding):
+@_onnx_symbolic("aten::reflection_pad1d")
+@_onnx_symbolic("aten::reflection_pad2d")
+@_onnx_symbolic("aten::reflection_pad3d")
+@_beartype.beartype
+def reflection_pad(g: jit_utils.GraphContext, input, padding):
     mode = "reflect"
     padding = _convert_padding_node(padding)
     paddings = _prepare_onnx_paddings(symbolic_helper._get_tensor_rank(input), padding)
-    return op_with_optional_float_cast(
+    return _op_with_optional_float_cast(
         g, "Pad", input, pads_i=paddings, mode_s=mode, opset_before=11
     )
 
 
-def replication_pad(g, input, padding):
+@_onnx_symbolic("aten::replication_pad1d")
+@_onnx_symbolic("aten::replication_pad2d")
+@_onnx_symbolic("aten::replication_pad3d")
+@_beartype.beartype
+def replication_pad(g: jit_utils.GraphContext, input, padding):
     mode = "edge"
     padding = _convert_padding_node(padding)
     paddings = _prepare_onnx_paddings(symbolic_helper._get_tensor_rank(input), padding)
-    return op_with_optional_float_cast(
+    return _op_with_optional_float_cast(
         g, "Pad", input, pads_i=paddings, mode_s=mode, opset_before=11
     )
 
 
-reflection_pad1d = reflection_pad
-reflection_pad2d = reflection_pad
-reflection_pad3d = reflection_pad
-replication_pad1d = replication_pad
-replication_pad2d = replication_pad
-replication_pad3d = replication_pad
-
-
-def pad(g, input, pad, mode, value):
+@_onnx_symbolic("aten::pad")
+@_beartype.beartype
+def pad(
+    g: jit_utils.GraphContext,
+    input: _C.Value,
+    pad: _C.Value,
+    mode: _C.Value,
+    value: _C.Value,
+):
     mode = symbolic_helper._parse_arg(mode, "s")
     if mode == "replicate":
         return replication_pad(g, input, pad)
@@ -1668,10 +1945,53 @@ def pad(g, input, pad, mode, value):
     elif mode == "circular":
         return _pad_circular(g, input, pad)
     else:
-        raise RuntimeError(f"Unrecognized padding mode {mode}")
+        raise errors.SymbolicValueError(f"Unrecognized padding mode {mode}", input)
 
 
-def _interpolate(name, dim, interpolate_mode):
+@_onnx_symbolic(
+    "aten::upsample_nearest1d",
+    decorate=[
+        _apply_params("upsample_nearest1d", 3, "nearest"),
+        _export("upsample_nearest1d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::upsample_nearest2d",
+    decorate=[
+        _apply_params("upsample_nearest2d", 4, "nearest"),
+        _export("upsample_nearest2d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::upsample_nearest3d",
+    decorate=[
+        _apply_params("upsample_nearest3d", 5, "nearest"),
+        _export("upsample_nearest3d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::upsample_linear1d",
+    decorate=[
+        _apply_params("upsample_linear1d", 3, "linear"),
+        _export("upsample_linear1d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::upsample_bilinear2d",
+    decorate=[
+        _apply_params("upsample_bilinear2d", 4, "linear"),
+        _export("upsample_bilinear2d"),
+    ],
+)
+@_onnx_symbolic(
+    "aten::upsample_trilinear3d",
+    decorate=[
+        _apply_params("upsample_trilinear3d", 5, "linear"),
+        _export("upsample_trilinear3d"),
+    ],
+)
+@_beartype.beartype
+def _interpolate(name: str, dim: int, interpolate_mode: str):
     def symbolic_fn(g, input, output_size, *args):
         scales, align_corners = symbolic_helper._get_interpolate_attributes(
             g, interpolate_mode, args
@@ -1679,7 +1999,7 @@ def symbolic_fn(g, input, output_size, *args):
         symbolic_helper._interpolate_warning(interpolate_mode)
         align_corners = symbolic_helper._maybe_get_scalar(align_corners)
         if align_corners:
-            return symbolic_helper._unimplemented(name, "align_corners == True")
+            return symbolic_helper._unimplemented(name, "align_corners == True", input)
         if scales is None:
             scales = symbolic_helper._interpolate_size_to_scales(
                 g, input, output_size, dim
@@ -1689,16 +2009,17 @@ def symbolic_fn(g, input, output_size, *args):
     return symbolic_fn
 
 
-upsample_nearest1d = _interpolate("upsample_nearest1d", 3, "nearest")
-upsample_nearest2d = _interpolate("upsample_nearest2d", 4, "nearest")
-upsample_nearest3d = _interpolate("upsample_nearest3d", 5, "nearest")
-upsample_linear1d = _interpolate("upsample_linear1d", 3, "linear")
-upsample_bilinear2d = _interpolate("upsample_bilinear2d", 4, "linear")
-upsample_trilinear3d = _interpolate("upsample_trilinear3d", 5, "linear")
-
-
+@_onnx_symbolic("aten::__interpolate")
+@_beartype.beartype
 def __interpolate(
-    g, input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias
+    g: jit_utils.GraphContext,
+    input,
+    size,
+    scale_factor,
+    mode,
+    align_corners,
+    recompute_scale_factor,
+    antialias,
 ):
     scales, mode = symbolic_helper._interpolate_get_scales_and_mode(
         g, input, size, scale_factor, mode, align_corners
@@ -1706,17 +2027,22 @@ def __interpolate(
     return g.op("Upsample", input, scales, mode_s=mode)
 
 
-def bitwise_not(g, inp):
-    if inp.type().scalarType() != "Bool":
-        raise NotImplementedError(
+@_onnx_symbolic("aten::bitwise_not")
+@_beartype.beartype
+def bitwise_not(g: jit_utils.GraphContext, input):
+    if not symbolic_helper._is_bool(input):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise Not "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            input,
         )
-    return g.op("Not", inp)
+    return g.op("Not", input)
 
 
+@_beartype.beartype
 def wrap_logical_op_with_cast_to(to_type):
     def decorator(fn):
+        @functools.wraps(fn)
         def wrap_with_cast(g, input, other):
             to_cast_func = globals()[f"_cast_{to_type}"]
             return fn(g, to_cast_func(g, input, False), to_cast_func(g, other, False))
@@ -1726,134 +2052,198 @@ def wrap_with_cast(g, input, other):
     return decorator
 
 
-def wrap_logical_op_with_negation(func):
+@_beartype.beartype
+def wrap_logical_op_with_negation(func: Callable) -> Callable:
+    @functools.wraps(func)
     def wrap_with_not(g, input, other):
         return g.op("Not", func(g, input, other))
 
     return wrap_with_not
 
 
-def __not_(g, self):
-    if self.type().scalarType() != "Bool":
-        raise NotImplementedError(
+@_onnx_symbolic("aten::__not_")
+@_beartype.beartype
+def __not_(g: jit_utils.GraphContext, self):
+    if not symbolic_helper._is_bool(self):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise Not "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            self,
         )
     return g.op("Not", self)
 
 
-def eq(g, self, other):
+@_onnx_symbolic("aten::eq")
+@symbolic_helper.quantized_args(True, True)
+@_beartype.beartype
+def eq(g: jit_utils.GraphContext, self, other):
     if isinstance(self.type(), _C.DeviceObjType) and isinstance(
         other.type(), _C.DeviceObjType
     ):
         # ONNX doesn't have devices, so consider them all to be equal.
         # The no-op check for equality will get constant-folded.
         return g.op("Constant", value_t=torch.tensor(True, dtype=torch.bool))
+    self_node = self.node()
+    other_node = other.node()
+    if self_node.kind() == other_node.kind() == "onnx::Constant":
+        if self_node.kindOf("value") == other_node.kindOf("value") == "s":
+            # Exporting strings to ONNX is not supported.
+            # If both strings are constant, we can compare them directly.
+            # The no-op check for equality will get constant-folded.
+            return g.op(
+                "Constant",
+                value_t=torch.tensor(
+                    self_node.s("value") == other_node.s("value"),
+                    dtype=torch.bool,
+                ),
+            )
+
     return g.op("Equal", self, other)
 
 
+@_onnx_symbolic("aten::ne")
+@symbolic_helper.quantized_args(True, True)
 @wrap_logical_op_with_negation
-def ne(g, self, other):
+@_beartype.beartype
+def ne(g: jit_utils.GraphContext, self, other):
     return eq(g, self, other)
 
 
-def gt(g, input, other):
-    return gt_impl(g, input, other)
+@_onnx_symbolic("aten::gt")
+@symbolic_helper.quantized_args(True, True)
+@_beartype.beartype
+def gt(g: jit_utils.GraphContext, input, other):
+    return _gt_impl(g, input, other)
 
 
-def gt_impl(g, input, other):
-    if (
-        input.type().scalarType() is not None
-        and input.type().scalarType() == "Bool"
-        and other.type().scalarType() is not None
-        and other.type().scalarType() == "Bool"
-    ):
+@_beartype.beartype
+def _gt_impl(g: jit_utils.GraphContext, input, other):
+    if symbolic_helper._is_bool(input) and symbolic_helper._is_bool(other):
         input = g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT32)
         other = g.op("Cast", other, to_i=_C_onnx.TensorProtoDataType.INT32)
     return g.op("Greater", input, other)
 
 
-def lt(g, input, other):
-    return lt_impl(g, input, other)
+@_onnx_symbolic("aten::lt")
+@symbolic_helper.quantized_args(True, True)
+@_beartype.beartype
+def lt(g: jit_utils.GraphContext, input, other):
+    return _lt_impl(g, input, other)
 
 
-def lt_impl(g, input, other):
-    if (
-        input.type().scalarType() is not None
-        and input.type().scalarType() == "Bool"
-        and other.type().scalarType() is not None
-        and other.type().scalarType() == "Bool"
-    ):
+@_beartype.beartype
+def _lt_impl(g: jit_utils.GraphContext, input, other):
+    if symbolic_helper._is_bool(input) and symbolic_helper._is_bool(other):
         input = g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT32)
         other = g.op("Cast", other, to_i=_C_onnx.TensorProtoDataType.INT32)
     return g.op("Less", input, other)
 
 
+@_onnx_symbolic("aten::ge")
+@symbolic_helper.quantized_args(True, True)
 @wrap_logical_op_with_negation
-def ge(g, input, other):
-    return lt_impl(g, input, other)
+@_beartype.beartype
+def ge(g: jit_utils.GraphContext, input, other):
+    return _lt_impl(g, input, other)
 
 
+@_onnx_symbolic("aten::le")
+@symbolic_helper.quantized_args(True, True)
 @wrap_logical_op_with_negation
-def le(g, input, other):
-    return gt_impl(g, input, other)
+@_beartype.beartype
+def le(g: jit_utils.GraphContext, input, other):
+    return _gt_impl(g, input, other)
 
 
-def __and_(g, input, other):
-    if input.type().scalarType() == "Bool" and other.type().scalarType() == "Bool":
-        return g.op("And", input, other)
-    else:
-        raise NotImplementedError(
+@_onnx_symbolic("aten::__and_")
+@_beartype.beartype
+def __and_(g: jit_utils.GraphContext, input, other):
+    if not symbolic_helper._is_bool(input):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise AND "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            input,
         )
+    if not symbolic_helper._is_bool(other):
+        raise errors.SymbolicValueError(
+            "ONNX export does NOT support exporting bitwise AND "
+            "for non-boolean input values",
+            other,
+        )
+    return g.op("And", input, other)
 
 
-def __or_(g, input, other):
-    if input.type().scalarType() == "Bool" and other.type().scalarType() == "Bool":
-        return g.op("Or", input, other)
-    else:
-        raise NotImplementedError(
+@_onnx_symbolic("aten::__or_")
+@_beartype.beartype
+def __or_(g: jit_utils.GraphContext, input, other):
+    if not symbolic_helper._is_bool(input):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise OR "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            input,
         )
+    if not symbolic_helper._is_bool(other):
+        raise errors.SymbolicValueError(
+            "ONNX export does NOT support exporting bitwise OR "
+            "for non-boolean input values",
+            other,
+        )
+    return g.op("Or", input, other)
 
 
-def __xor_(g, input, other):
-    if input.type().scalarType() == "Bool" and other.type().scalarType() == "Bool":
-        return g.op("Xor", input, other)
-    else:
-        raise NotImplementedError(
+@_onnx_symbolic("aten::__xor_")
+@_beartype.beartype
+def __xor_(g: jit_utils.GraphContext, input, other):
+    if not symbolic_helper._is_bool(input):
+        raise errors.SymbolicValueError(
+            "ONNX export does NOT support exporting bitwise XOR "
+            "for non-boolean input values",
+            input,
+        )
+    if not symbolic_helper._is_bool(other):
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting bitwise XOR "
-            + "for non-boolean input values"
+            "for non-boolean input values",
+            other,
         )
+    return g.op("Xor", input, other)
 
 
+@_onnx_symbolic("aten::logical_and")
 @wrap_logical_op_with_cast_to("Bool")
-def logical_and(g, input, other):
+@_beartype.beartype
+def logical_and(g: jit_utils.GraphContext, input, other):
     return g.op("And", input, other)
 
 
+@_onnx_symbolic("aten::logical_or")
 @wrap_logical_op_with_cast_to("Bool")
-def logical_or(g, input, other):
+@_beartype.beartype
+def logical_or(g: jit_utils.GraphContext, input, other):
     return g.op("Or", input, other)
 
 
+@_onnx_symbolic("aten::logical_xor")
 @wrap_logical_op_with_cast_to("Bool")
-def logical_xor(g, input, other):
+@_beartype.beartype
+def logical_xor(g: jit_utils.GraphContext, input, other):
     return g.op("Xor", input, other)
 
 
-def __rshift_(g, self, other):
+@_onnx_symbolic("aten::__rshift_")
+@_beartype.beartype
+def __rshift_(g: jit_utils.GraphContext, self, other):
     # make sure to cast other to self's type
     # (when self is long, make sure that other is not float)
-    if other.type().scalarType() != self.type().scalarType():
+    self_scalar_type = _type_utils.JitScalarType.from_value(self)
+    if (
+        _type_utils.JitScalarType.from_value(other, _type_utils.JitScalarType.UNDEFINED)
+        != self_scalar_type
+    ):
         other = g.op(
             "Cast",
             other,
-            to_i=_type_utils.JitScalarType.from_name(
-                self.type().scalarType()
-            ).onnx_type(),
+            to_i=self_scalar_type.onnx_type(),
         )
 
     two = g.op("Constant", value_t=torch.tensor(2, dtype=torch.float32))
@@ -1864,22 +2254,26 @@ def __rshift_(g, self, other):
     two_pow = g.op(
         "Cast",
         two_pow,
-        to_i=_type_utils.JitScalarType.from_name(self.type().scalarType()).onnx_type(),
+        to_i=self_scalar_type.onnx_type(),
     )
     rshift = g.op("Div", self, two_pow)
     return rshift
 
 
-def __lshift_(g, self, other):
+@_onnx_symbolic("aten::__lshift_")
+@_beartype.beartype
+def __lshift_(g: jit_utils.GraphContext, self, other):
     # make sure to cast other to self's type
     # (when self is long, make sure that other is not float)
-    if other.type().scalarType() != self.type().scalarType():
+    self_scalar_type = _type_utils.JitScalarType.from_value(self)
+    if (
+        _type_utils.JitScalarType.from_value(other, _type_utils.JitScalarType.UNDEFINED)
+        != self_scalar_type
+    ):
         other = g.op(
             "Cast",
             other,
-            to_i=_type_utils.JitScalarType.from_name(
-                self.type().scalarType()
-            ).onnx_type(),
+            to_i=self_scalar_type.onnx_type(),
         )
 
     two = g.op("Constant", value_t=torch.tensor(2, dtype=torch.float32))
@@ -1890,16 +2284,18 @@ def __lshift_(g, self, other):
     two_pow = g.op(
         "Cast",
         two_pow,
-        to_i=_type_utils.JitScalarType.from_name(self.type().scalarType()).onnx_type(),
+        to_i=self_scalar_type.onnx_type(),
     )
     lshift = g.op("Mul", self, two_pow)
     return lshift
 
 
+@_onnx_symbolic("aten::where")
 @symbolic_helper.parse_args("v", "v", "v", "i")
-def where(g, condition, self=None, other=None, _outputs=None):
+@_beartype.beartype
+def where(g: jit_utils.GraphContext, condition, self=None, other=None, _outputs=None):
     # Assumes that torch.where's first argument takes only Bool and Byte tensors.
-    if condition.type().scalarType() != "Bool":
+    if not symbolic_helper._is_bool(condition):
         condition = g.op("Cast", condition, to_i=_C_onnx.TensorProtoDataType.BOOL)
     if self is None:
         condition = nonzero(g, condition)
@@ -1909,8 +2305,10 @@ def where(g, condition, self=None, other=None, _outputs=None):
     return g.op("Where", condition, self, other)
 
 
+@_onnx_symbolic("aten::log_softmax")
 @symbolic_helper.parse_args("v", "i", "none")
-def log_softmax(g, input, dim, dtype=None):
+@_beartype.beartype
+def log_softmax(g: jit_utils.GraphContext, input, dim, dtype=None):
     # PyTorch dim and ONNX axis have different meanings.
     # See Softmax comment for details.
     # TODO: remove this as onnx opset 11 spec allows negative axes
@@ -1941,18 +2339,28 @@ def log_softmax(g, input, dim, dtype=None):
     return return_op
 
 
+@_onnx_symbolic("aten::_log_softmax")
 @symbolic_helper.parse_args("v", "i", "i")
-def _log_softmax(g, input, dim, half_to_float):
-    if half_to_float and input.type().scalarType() == "Half":
+@_beartype.beartype
+def _log_softmax(g: jit_utils.GraphContext, input, dim, half_to_float):
+    if (
+        half_to_float
+        and _type_utils.JitScalarType.from_value(
+            input, _type_utils.JitScalarType.UNDEFINED
+        )
+        == _type_utils.JitScalarType.HALF
+    ):
         input = g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.FLOAT)
     return log_softmax(g, input, dim)
 
 
+@_onnx_symbolic("aten::_convolution")
 @symbolic_helper.parse_args(
     "v", "v", "v", "is", "is", "is", "i", "is", "i", "i", "i", "i", "i"
 )
+@_beartype.beartype
 def _convolution(
-    g,
+    g: jit_utils.GraphContext,
     input,
     weight,
     bias,
@@ -1976,8 +2384,9 @@ def _convolution(
         kernel_shape = None
 
     if kernel_shape is None or any([i is None for i in kernel_shape]):
-        raise RuntimeError(
-            "Unsupported: ONNX export of convolution for kernel " "of unknown shape."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of convolution for kernel of unknown shape.",
+            input,
         )
 
     args = [input, weight]
@@ -2017,9 +2426,11 @@ def _convolution(
         return n
 
 
+@_onnx_symbolic("aten::convolution")
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i", "is", "i")
+@_beartype.beartype
 def convolution(
-    g,
+    g: jit_utils.GraphContext,
     input,
     weight,
     bias,
@@ -2048,8 +2459,12 @@ def convolution(
     )
 
 
+@_onnx_symbolic("aten::conv1d")
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i")
-def conv1d(g, input, weight, bias, stride, padding, dilation, groups):
+@_beartype.beartype
+def conv1d(
+    g: jit_utils.GraphContext, input, weight, bias, stride, padding, dilation, groups
+):
     return _convolution(
         g,
         input,
@@ -2068,8 +2483,12 @@ def conv1d(g, input, weight, bias, stride, padding, dilation, groups):
     )
 
 
+@_onnx_symbolic("aten::conv2d")
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i")
-def conv2d(g, input, weight, bias, stride, padding, dilation, groups):
+@_beartype.beartype
+def conv2d(
+    g: jit_utils.GraphContext, input, weight, bias, stride, padding, dilation, groups
+):
     return _convolution(
         g,
         input,
@@ -2088,8 +2507,12 @@ def conv2d(g, input, weight, bias, stride, padding, dilation, groups):
     )
 
 
+@_onnx_symbolic("aten::conv3d")
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i")
-def conv3d(g, input, weight, bias, stride, padding, dilation, groups):
+@_beartype.beartype
+def conv3d(
+    g: jit_utils.GraphContext, input, weight, bias, stride, padding, dilation, groups
+):
     return _convolution(
         g,
         input,
@@ -2108,9 +2531,19 @@ def conv3d(g, input, weight, bias, stride, padding, dilation, groups):
     )
 
 
+@_onnx_symbolic("aten::conv_transpose1d")
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i", "is")
+@_beartype.beartype
 def conv_transpose1d(
-    g, input, weight, bias, stride, padding, output_padding, groups, dilation
+    g: jit_utils.GraphContext,
+    input,
+    weight,
+    bias,
+    stride,
+    padding,
+    output_padding,
+    groups,
+    dilation,
 ):
     return _convolution(
         g,
@@ -2130,9 +2563,19 @@ def conv_transpose1d(
     )
 
 
+@_onnx_symbolic("aten::conv_transpose2d")
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i", "is")
+@_beartype.beartype
 def conv_transpose2d(
-    g, input, weight, bias, stride, padding, output_padding, groups, dilation
+    g: jit_utils.GraphContext,
+    input,
+    weight,
+    bias,
+    stride,
+    padding,
+    output_padding,
+    groups,
+    dilation,
 ):
     return _convolution(
         g,
@@ -2152,9 +2595,19 @@ def conv_transpose2d(
     )
 
 
+@_onnx_symbolic("aten::conv_transpose3d")
 @symbolic_helper.parse_args("v", "v", "v", "is", "is", "is", "i", "is")
+@_beartype.beartype
 def conv_transpose3d(
-    g, input, weight, bias, stride, padding, output_padding, groups, dilation
+    g: jit_utils.GraphContext,
+    input,
+    weight,
+    bias,
+    stride,
+    padding,
+    output_padding,
+    groups,
+    dilation,
 ):
     return _convolution(
         g,
@@ -2174,9 +2627,11 @@ def conv_transpose3d(
     )
 
 
+@_onnx_symbolic("aten::batch_norm")
 @symbolic_helper.parse_args("v", "v", "v", "v", "v", "i", "f", "f", "i")
+@_beartype.beartype
 def batch_norm(
-    g,
+    g: jit_utils.GraphContext,
     input,
     weight,
     bias,
@@ -2202,6 +2657,7 @@ def batch_norm(
             15,
             "All input tensors must have the same `dtype`."
             " Turn off Autocast or export using opset version 15.",
+            input,
         )
 
     weight, bias, running_mean, running_var = symbolic_helper._batchnorm_helper(
@@ -2229,26 +2685,18 @@ def batch_norm(
         return res
 
 
-def _layer_norm_returns_normalized_input_mean_rstd(
-    g,
+@_onnx_symbolic("aten::native_layer_norm")
+@symbolic_helper.quantized_args(True, False, False, False)
+@symbolic_helper.parse_args("v", "is", "v", "v", "f")
+@_beartype.beartype
+def native_layer_norm(
+    g: jit_utils.GraphContext,
     input: _C.Value,
     normalized_shape: Sequence[int],
     weight: _C.Value,
     bias: _C.Value,
     eps: float,
-    cudnn_enable: bool,
-    return_mean_rstd: bool,
-) -> Tuple[_C.Value, Optional[_C.Value], Optional[_C.Value]]:
-    if symbolic_helper.is_caffe2_aten_fallback():
-        return g.at(
-            "layer_norm",
-            input,
-            weight,
-            bias,
-            normalized_shape_i=normalized_shape,
-            eps_f=eps,
-            cudnn_enable_i=cudnn_enable,
-        )
+) -> Tuple[_C.Value, _C.Value, _C.Value]:
     axes = [-i for i in range(len(normalized_shape), 0, -1)]
 
     two_cst = symbolic_helper._generate_wrapped_number(g, 2.0)
@@ -2267,59 +2715,75 @@ def _layer_norm_returns_normalized_input_mean_rstd(
     if not (bias is None or symbolic_helper._is_none(bias)):
         normalized = add(g, normalized, bias)
 
-    if return_mean_rstd:
-        # rdenominator = 1 / sqrt(variance + eps)
-        rdenominator = reciprocal(g, denominator)
-        return normalized, mean, rdenominator
-    return normalized, None, None
-
-
-@symbolic_helper.parse_args("v", "is", "v", "v", "f")
-def native_layer_norm(g, input, normalized_shape, weight, bias, eps):
-    return _layer_norm_returns_normalized_input_mean_rstd(
-        g, input, normalized_shape, weight, bias, eps, False, True
-    )
+    # rdenominator := 1 / sqrt(variance + eps)
+    rdenominator = reciprocal(g, denominator)
+    return normalized, mean, rdenominator
 
 
-@symbolic_helper.parse_args("v", "is", "v", "v", "f", "i")
-def layer_norm(g, input, normalized_shape, weight, bias, eps, cudnn_enable):
-    normalized, _, _ = _layer_norm_returns_normalized_input_mean_rstd(
-        g, input, normalized_shape, weight, bias, eps, cudnn_enable, False
-    )
+@_onnx_symbolic("aten::layer_norm")
+@symbolic_helper.quantized_args(True, False, False, False)
+@symbolic_helper.parse_args("v", "is", "v", "v", "f", "b")
+@_beartype.beartype
+def layer_norm(
+    g: jit_utils.GraphContext,
+    input: _C.Value,
+    normalized_shape: Sequence[int],
+    weight: _C.Value,
+    bias: _C.Value,
+    eps: float,
+    cudnn_enable: bool,
+) -> _C.Value:
+    if symbolic_helper.is_caffe2_aten_fallback():
+        return g.at(
+            "layer_norm",
+            input,
+            weight,
+            bias,
+            normalized_shape_i=normalized_shape,
+            eps_f=eps,
+            cudnn_enable_i=cudnn_enable,
+        )
+    normalized, _, _ = native_layer_norm(g, input, normalized_shape, weight, bias, eps)
     return normalized
 
 
-@symbolic_helper.parse_args("v", "v", "v", "v", "v", "i", "f", "f", "i")
+@_onnx_symbolic("aten::instance_norm")
+@symbolic_helper.parse_args("v", "v", "v", "v", "v", "b", "f", "f", "b")
+@_beartype.beartype
 def instance_norm(
-    g,
+    g: jit_utils.GraphContext,
     input,
     weight,
     bias,
     running_mean,
     running_var,
-    use_input_stats,
-    momentum,
-    eps,
-    cudnn_enabled,
+    use_input_stats: bool,
+    momentum: Number,
+    eps: Number,
+    cudnn_enabled: bool,
 ):
     symbolic_helper.check_training_mode(use_input_stats, "instance_norm")
     channel_size = symbolic_helper._get_tensor_dim_size(input, 1)
     if weight is None or symbolic_helper._is_none(weight):
         if channel_size is None:
-            raise RuntimeError(
-                "Unsupported: ONNX export of instance_norm for unknown " "channel size."
+            raise errors.SymbolicValueError(
+                "Unsupported: ONNX export of instance_norm for unknown channel size.",
+                input,
             )
-        weight_value = torch.tensor([1.0] * channel_size).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        weight_value = torch.tensor(
+            [1.0] * channel_size,
+            dtype=_type_utils.JitScalarType.from_value(input).dtype(),
         )
         weight = g.op("Constant", value_t=weight_value)
     if bias is None or symbolic_helper._is_none(bias):
         if channel_size is None:
-            raise RuntimeError(
-                "Unsupported: ONNX export of instance_norm for unknown " "channel size."
+            raise errors.SymbolicValueError(
+                "Unsupported: ONNX export of instance_norm for unknown channel size.",
+                input,
             )
-        bias_value = torch.tensor([0.0] * channel_size).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        bias_value = torch.tensor(
+            [0.0] * channel_size,
+            dtype=_type_utils.JitScalarType.from_value(input).dtype(),
         )
         bias = g.op("Constant", value_t=bias_value)
     if (
@@ -2337,9 +2801,10 @@ def instance_norm(
         input_size_reshape = input_size.copy()
         n = input_size[0]
         if n is None:
-            raise RuntimeError(
+            raise errors.SymbolicValueError(
                 "Unsupported: ONNX export of instance_norm training for unknown "
-                "batch size."
+                "batch size.",
+                input,
             )
         c = input_size[1]
         input_size_reshape[0] = 1
@@ -2380,11 +2845,14 @@ def instance_norm(
         return view(g, out, g.op("Constant", value_t=torch.tensor(input_size)))
 
 
+@_onnx_symbolic("aten::unfold")
 @symbolic_helper.parse_args("v", "i", "i", "i")
-def unfold(g, input, dimension, size, step):
+@_beartype.beartype
+def unfold(g: jit_utils.GraphContext, input, dimension, size, step):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("unfold", input, dimension_i=dimension, size_i=size, step_i=step)
     sizes = symbolic_helper._get_tensor_sizes(input)
+    # FIXME(justinchuby): Get rid of the try catch here to improve readability
     try:
         sizedim = sizes[dimension]
     except Exception:
@@ -2411,34 +2879,48 @@ def unfold(g, input, dimension, size, step):
         ]
         return g.op("Concat", *unsqueeze, axis_i=dimension)
     else:
-        return symbolic_helper._unimplemented("Unfold", "input size not accessible")
+        return symbolic_helper._unimplemented(
+            "Unfold", "input size not accessible", input
+        )
 
 
+@_onnx_symbolic("aten::elu")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "t", "t", "t")
-def elu(g, input, alpha, scale, input_scale):
+@_beartype.beartype
+def elu(g: jit_utils.GraphContext, input, alpha, scale, input_scale):
     if scale and scale != 1.0:
-        return symbolic_helper._unimplemented("scale", "does not support scale in Elu")
+        return symbolic_helper._unimplemented(
+            "scale", "does not support scale in Elu", scale
+        )
     if input_scale and input_scale != 1.0:
         return symbolic_helper._unimplemented(
-            "input_scale", "does not support input_scale in Elu"
+            "input_scale", "does not support input_scale in Elu", input_scale
         )
     # See Note [Export inplace]
     return g.op("Elu", input, alpha_f=symbolic_helper._scalar(alpha))
 
 
-def selu(g, input):
+@_onnx_symbolic("aten::selu")
+@symbolic_helper.quantized_args(True)
+@_beartype.beartype
+def selu(g: jit_utils.GraphContext, input):
     return g.op("Selu", input)
 
 
+@_onnx_symbolic("aten::index_select")
 @symbolic_helper.parse_args("v", "i", "v")
-def index_select(g, self, dim, index):
+@_beartype.beartype
+def index_select(g: jit_utils.GraphContext, self, dim, index):
     # In case of a scalar index, index_select returns a tensor with the same rank as the input.
     # To match this behavior in ONNX, we make index a 1D tensor so that the following gather
     # also produces a tensor with the same rank as the input.
     return symbolic_helper._select_helper(g, self, dim, index)
 
 
-def index_put(g, self, indices_list_value, values, accumulate):
+@_onnx_symbolic("aten::index_put")
+@_beartype.beartype
+def index_put(g: jit_utils.GraphContext, self, indices_list_value, values, accumulate):
     if symbolic_helper._is_packed_list(indices_list_value):
         indices_list = symbolic_helper._unpack_list(indices_list_value)
     else:
@@ -2452,13 +2934,13 @@ def index_put(g, self, indices_list_value, values, accumulate):
     if len(indices_list) == 0:
         if accumulate:
             return add(g, self, values)
-        else:
-            return values
-    else:
-        symbolic_helper._onnx_opset_unsupported("index_put", 9, 11)
+        return values
+    symbolic_helper._onnx_opset_unsupported("index_put", 9, 11, self)
 
 
-def index_fill(g, self, dim, index, value):
+@_onnx_symbolic("aten::index_fill")
+@_beartype.beartype
+def index_fill(g: jit_utils.GraphContext, self, dim, index, value):
     dim_value = symbolic_helper._parse_arg(dim, "i")
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at(
@@ -2474,13 +2956,15 @@ def index_fill(g, self, dim, index, value):
         g, self, dim, index
     )
     value = symbolic_helper._maybe_get_scalar(value)
-    value = symbolic_helper._if_scalar_type_as(g, value, self)
+    value = symbolic_helper._if_scalar_type_as(value, self)
     expanded_value = expand(g, value, expanded_index_shape, None)
 
     return scatter(g, self, dim, expanded_index, expanded_value)
 
 
-def index_copy(g, self, dim, index, source):
+@_onnx_symbolic("aten::index_copy")
+@_beartype.beartype
+def index_copy(g: jit_utils.GraphContext, self, dim, index, source):
     dim_value = symbolic_helper._parse_arg(dim, "i")
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("index_copy", self, index, source, dim_i=dim_value)
@@ -2490,8 +2974,12 @@ def index_copy(g, self, dim, index, source):
     return scatter(g, self, dim, expanded_index, source)
 
 
+@_onnx_symbolic("aten::bucketize")
 @symbolic_helper.parse_args("v", "v", "b", "b")
-def bucketize(g, self, boundaries, out_int32=False, right=False):
+@_beartype.beartype
+def bucketize(
+    g: jit_utils.GraphContext, self, boundaries, out_int32=False, right=False
+):
     out_type = _C_onnx.TensorProtoDataType.INT64
     if out_int32:
         out_type = _C_onnx.TensorProtoDataType.INT32
@@ -2524,7 +3012,9 @@ def bucketize(g, self, boundaries, out_int32=False, right=False):
     return symbolic_helper._reducesum_helper(g, cond_out, axes_i=[0], keepdims_i=0)
 
 
-def type_as(g, self, other):
+@_onnx_symbolic("aten::type_as")
+@_beartype.beartype
+def type_as(g: jit_utils.GraphContext, self, other):
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     other_dtype = symbolic_helper._try_get_scalar_type(other)
     if self_dtype == other_dtype and self_dtype is not None:
@@ -2533,22 +3023,25 @@ def type_as(g, self, other):
         return g.op(
             "Cast",
             self,
-            to_i=_type_utils.JitScalarType.from_name(other_dtype).onnx_type(),
+            to_i=other_dtype.onnx_type(),
         )
-    else:
-        if symbolic_helper.is_caffe2_aten_fallback():
-            # We don't know the type of other, bail by emitting ATen
-            return g.at("type_as", self, other)
-        else:
-            raise RuntimeError(
-                "Unsupported: ONNX export of type_as for tensor "
-                "of unknown dtype. Please check if the dtype of the "
-                "parameter passed to the type_as function is correct."
-            )
+
+    if symbolic_helper.is_caffe2_aten_fallback():
+        # We don't know the type of other, bail by emitting ATen
+        return g.at("type_as", self, other)
+
+    raise errors.SymbolicValueError(
+        "Unsupported: ONNX export of type_as for tensor "
+        "of unknown dtype. Please check if the dtype of the "
+        "parameter passed to the type_as function is correct.",
+        other,
+    )
 
 
+@_onnx_symbolic("aten::cosine_similarity")
 @symbolic_helper.parse_args("v", "v", "i", "f")
-def cosine_similarity(g, x1, x2, dim, eps):
+@_beartype.beartype
+def cosine_similarity(g: jit_utils.GraphContext, x1, x2, dim, eps):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("cosine_similarity", x1, x2, dim_i=dim, eps_f=eps)
     cross = symbolic_helper._reducesum_helper(
@@ -2566,7 +3059,9 @@ def cosine_similarity(g, x1, x2, dim, eps):
     return div(g, cross, div_tens)
 
 
-def pairwise_distance(g, input1, input2, p, eps, keepdim):
+@_onnx_symbolic("aten::pairwise_distance")
+@_beartype.beartype
+def pairwise_distance(g: jit_utils.GraphContext, input1, input2, p, eps, keepdim):
     if not symbolic_helper._is_value(eps):
         eps = g.op("Constant", value_t=torch.tensor([eps]))
     inv_p = div(
@@ -2583,48 +3078,58 @@ def pairwise_distance(g, input1, input2, p, eps, keepdim):
     return pow(g, summation, inv_p)
 
 
+@_onnx_symbolic("aten::clone")
 # ignore clone operators that are inserted by PyTorch autograd
-def clone(g, input, unused_memory_format):
+@_beartype.beartype
+def clone(g: jit_utils.GraphContext, input, unused_memory_format):
     return input
 
 
-def abs(g, self):
+@_onnx_symbolic("aten::abs")
+@_beartype.beartype
+def abs(g: jit_utils.GraphContext, self):
     return g.op("Abs", self)
 
 
-def log(g, self):
+@_onnx_symbolic("aten::log")
+@_beartype.beartype
+def log(g: jit_utils.GraphContext, self):
     return g.op("Log", self)
 
 
-def log1p(g, self):
-    return log(
-        g, add(g, symbolic_helper._if_scalar_type_as(g, torch.ones(1), self), self)
-    )
+@_onnx_symbolic("aten::log1p")
+@_beartype.beartype
+def log1p(g: jit_utils.GraphContext, self):
+    return log(g, add(g, symbolic_helper._if_scalar_type_as(torch.ones(1), self), self))
 
 
-def log10(g, self):
+@_onnx_symbolic("aten::log10")
+@_beartype.beartype
+def log10(g: jit_utils.GraphContext, self):
     _ln10 = 2.30258509299404568401
     return g.op("Div", log(g, self), g.op("Constant", value_t=torch.tensor([_ln10])))
 
 
-def pow(g, self, exponent):
-    f_dtype = self_dtype = self.type().scalarType()
+@_onnx_symbolic("aten::pow")
+@_beartype.beartype
+def pow(g: jit_utils.GraphContext, self, exponent):
+    f_dtype = _type_utils.JitScalarType.from_value(self)
     if not symbolic_helper._is_fp(self):
-        f_dtype = "Float"
-        self = g.op(
-            "Cast", self, to_i=_type_utils.JitScalarType.from_name(f_dtype).onnx_type()
-        )
+        f_dtype = _type_utils.JitScalarType.FLOAT
+        self = g.op("Cast", self, to_i=f_dtype.onnx_type())
     if not symbolic_helper._is_fp(exponent):
         exponent = g.op(
             "Cast",
             exponent,
-            to_i=_type_utils.JitScalarType.from_name(f_dtype).onnx_type(),
+            to_i=f_dtype.onnx_type(),
         )
     pow = g.op("Pow", self, exponent)
     return pow
 
 
-def clamp(g, self, min, max):
+@_onnx_symbolic("aten::clamp")
+@_beartype.beartype
+def clamp(g: jit_utils.GraphContext, self, min, max):
     # min or max may be None that we need to dispatch to
     # Clip separately, as ONNX does not have None syntax
     if symbolic_helper._is_none(min):
@@ -2633,7 +3138,7 @@ def clamp(g, self, min, max):
         return clamp_min(g, self, min)
     else:
         if symbolic_helper._is_constant(min) and symbolic_helper._is_constant(max):
-            return op_with_optional_float_cast(
+            return _op_with_optional_float_cast(
                 g,
                 "Clip",
                 self,
@@ -2645,43 +3150,46 @@ def clamp(g, self, min, max):
             return clamp_max(g, clamp_min(g, self, min), max)
 
 
+@_onnx_symbolic("aten::clamp_min")
 @symbolic_helper.parse_args("v", "v")
-def clamp_min(g, self, min):
+@_beartype.beartype
+def clamp_min(g: jit_utils.GraphContext, self, min):
     if symbolic_helper._is_constant(min):
-        return op_with_optional_float_cast(
+        return _op_with_optional_float_cast(
             g, "Clip", self, min_f=symbolic_helper._parse_arg(min, "f"), opset_before=12
         )
     else:
-        dtype = self.type().scalarType()
-        min = g.op(
-            "Cast", min, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
-        )
-        return op_with_optional_float_cast(g, "Max", self, min, opset_before=12)
+        dtype = _type_utils.JitScalarType.from_value(self)
+        min = g.op("Cast", min, to_i=dtype.onnx_type())
+        return _op_with_optional_float_cast(g, "Max", self, min, opset_before=12)
 
 
+@_onnx_symbolic("aten::clamp_max")
 @symbolic_helper.parse_args("v", "v")
-def clamp_max(g, self, max):
+@_beartype.beartype
+def clamp_max(g: jit_utils.GraphContext, self, max):
     if symbolic_helper._is_constant(max):
-        return op_with_optional_float_cast(
+        return _op_with_optional_float_cast(
             g, "Clip", self, max_f=symbolic_helper._parse_arg(max, "f"), opset_before=12
         )
     else:
-        dtype = self.type().scalarType()
-        max = g.op(
-            "Cast", max, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
-        )
-        return op_with_optional_float_cast(g, "Min", self, max, opset_before=12)
+        dtype = _type_utils.JitScalarType.from_value(self)
+        max = g.op("Cast", max, to_i=dtype.onnx_type())
+        return _op_with_optional_float_cast(g, "Min", self, max, opset_before=12)
 
 
+@_onnx_symbolic("aten::max")
 # torch.max (same for torch.min) actually has two interfaces smashed together:
 # torch.max(x, dim, keepdim) and torch.max(x, y)
-def max(g, self, dim_or_y=None, keepdim=None):
+# TODO(justinchuby): Support multiple quantized args in output
+@_beartype.beartype
+def max(g: jit_utils.GraphContext, self, dim_or_y=None, keepdim=None):
     # torch.max(input)
     if dim_or_y is None and keepdim is None:
         return g.op("ReduceMax", self, keepdims_i=0)
     # torch.max(input, other)
     if keepdim is None:
-        return op_with_optional_float_cast(g, "Max", self, dim_or_y, opset_before=12)
+        return _op_with_optional_float_cast(g, "Max", self, dim_or_y, opset_before=12)
     # torch.max(input, dim, keepdim)
     else:
         dim = symbolic_helper._get_const(dim_or_y, "i", "dim")
@@ -2691,17 +3199,23 @@ def max(g, self, dim_or_y=None, keepdim=None):
         return max, indices
 
 
-def maximum(g, input, other):
+@_onnx_symbolic("aten::maximum")
+@symbolic_helper.quantized_args(True, True)
+@_beartype.beartype
+def maximum(g: jit_utils.GraphContext, input, other):
     return max(g, input, dim_or_y=other)
 
 
-def min(g, self, dim_or_y=None, keepdim=None):
+@_onnx_symbolic("aten::min")
+# TODO(justinchuby): Support multiple quantized args in output
+@_beartype.beartype
+def min(g: jit_utils.GraphContext, self, dim_or_y=None, keepdim=None):
     # torch.min(input)
     if dim_or_y is None and keepdim is None:
         return g.op("ReduceMin", self, keepdims_i=0)
     # torch.min(input, other)
     if keepdim is None:
-        return op_with_optional_float_cast(g, "Min", self, dim_or_y, opset_before=12)
+        return _op_with_optional_float_cast(g, "Min", self, dim_or_y, opset_before=12)
     # torch.min(input, dim, keepdim)
     else:
         dim = symbolic_helper._get_const(dim_or_y, "i", "dim")
@@ -2711,22 +3225,34 @@ def min(g, self, dim_or_y=None, keepdim=None):
         return min, indices
 
 
-def minimum(g, input, other):
+@_onnx_symbolic("aten::minimum")
+@symbolic_helper.quantized_args(True, True)
+@_beartype.beartype
+def minimum(g: jit_utils.GraphContext, input, other):
     return min(g, input, dim_or_y=other)
 
 
+@_onnx_symbolic("aten::amax")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "is", "i")
-def amax(g, self, dim, keepdim):
+@_beartype.beartype
+def amax(g: jit_utils.GraphContext, self, dim, keepdim):
     return g.op("ReduceMax", self, axes_i=dim, keepdims_i=keepdim)
 
 
+@_onnx_symbolic("aten::amin")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "is", "i")
-def amin(g, self, dim, keepdim):
+@_beartype.beartype
+def amin(g: jit_utils.GraphContext, self, dim, keepdim):
     return g.op("ReduceMin", self, axes_i=dim, keepdims_i=keepdim)
 
 
+@_onnx_symbolic("aten::aminmax")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "v", "i")
-def aminmax(g, self, dim, keepdim):
+@_beartype.beartype
+def aminmax(g: jit_utils.GraphContext, self, dim, keepdim):
     reduce_kwargs = {"keepdims_i": keepdim}
     if not symbolic_helper._is_none(dim):
         dim = symbolic_helper._get_const(dim, "i", "dim")
@@ -2737,12 +3263,17 @@ def aminmax(g, self, dim, keepdim):
     )
 
 
-def exp(g, self):
+@_onnx_symbolic("aten::exp")
+@_beartype.beartype
+def exp(g: jit_utils.GraphContext, self):
     return g.op("Exp", self)
 
 
+@_onnx_symbolic("aten::dropout_")
+@_onnx_symbolic("aten::dropout")
 @symbolic_helper.parse_args("v", "f", "i")
-def dropout(g, input, p, train):
+@_beartype.beartype
+def dropout(g: jit_utils.GraphContext, input, p, train):
     symbolic_helper.check_training_mode(train, "dropout")
     # if train is False, dropout is no-op
     if not train:
@@ -2751,41 +3282,56 @@ def dropout(g, input, p, train):
     return r
 
 
-def _unsupported_dropout(name):
-    @symbolic_helper.parse_args("v", "f", "i")
+@_onnx_symbolic(
+    "aten::alpha_dropout_", decorate=[_apply_params("aten::alpha_dropout_")]
+)  # See Note [Export inplace]
+@_onnx_symbolic(
+    "aten::feature_alpha_dropout_",
+    decorate=[_apply_params("aten::feature_alpha_dropout_")],
+)
+@_onnx_symbolic(
+    "aten::feature_dropout_", decorate=[_apply_params("aten::feature_dropout_")]
+)
+@_onnx_symbolic(
+    "aten::feature_alpha_dropout",
+    decorate=[_apply_params("aten::feature_alpha_dropout")],
+)
+@_onnx_symbolic("aten::alpha_dropout", decorate=[_apply_params("aten::alpha_dropout")])
+@_onnx_symbolic(
+    "aten::feature_dropout", decorate=[_apply_params("aten::feature_dropout")]
+)
+@_beartype.beartype
+def _unsupported_dropout(name: str):
+    @symbolic_helper.parse_args("v", "none", "b")
+    @_beartype.beartype
     def feature_dropout(g, input, p, train):
         # NB: In inference mode, FeatureDropout is exported as an identity op.
         if train:
-            return symbolic_helper._unimplemented(name, "training mode")
+            return symbolic_helper._unimplemented(name, "training mode", input)
         return input
 
     return feature_dropout
 
 
-feature_dropout = _unsupported_dropout("feature_dropout")
-alpha_dropout = _unsupported_dropout("alpha_dropout")
-feature_alpha_dropout = _unsupported_dropout("feature_alpha_dropout")
-
-# See Note [Export inplace]
-dropout_ = dropout
-feature_dropout_ = feature_dropout
-alpha_dropout_ = alpha_dropout
-feature_alpha_dropout_ = feature_alpha_dropout
-
-
+@_onnx_symbolic("aten::norm")
 @symbolic_helper.parse_args("v", "t", "is", "i")
-def norm(g, self, p, dim, keepdim):
+@_beartype.beartype
+def norm(g: jit_utils.GraphContext, self, p, dim, keepdim):
     if p == 1:
         f = _reduce_op_symbolic("ReduceL1")
     elif p == 2:
         f = _reduce_op_symbolic("ReduceL2")
     else:
-        raise RuntimeError("ONNX export only p-norms with p of 1 or 2")
+        raise errors.SymbolicValueError(
+            "ONNX export only p-norms with p of 1 or 2", self
+        )
     return f(g, self, dim=dim, keepdim=keepdim)
 
 
+@_onnx_symbolic("aten::conv_tbc")
 @symbolic_helper.parse_args("v", "v", "v", "i")
-def conv_tbc(g, input, weight, bias, pad):
+@_beartype.beartype
+def conv_tbc(g: jit_utils.GraphContext, input, weight, bias, pad):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("conv_tbc", input, weight, bias, pad_i=pad)
     else:
@@ -2800,99 +3346,182 @@ def conv_tbc(g, input, weight, bias, pad):
         return g.op("Transpose", conv, perm_i=[2, 0, 1])
 
 
-@symbolic_helper.parse_args("v", "i", "i")
-def _unique(g, input, sorted, return_inverse):
-    if symbolic_helper.is_caffe2_aten_fallback():
-        return g.at(
-            "_unique",
-            input,
-            sorted_i=sorted,
-            return_inverse_i=return_inverse,
-            outputs=2,
-        )
-    else:
-        return symbolic_helper._onnx_unsupported("_unique")
+@_onnx_symbolic("aten::_unique")
+@symbolic_helper.parse_args("v", "i", "i")
+@_beartype.beartype
+def _unique(g: jit_utils.GraphContext, input, sorted, return_inverse):
+    if symbolic_helper.is_caffe2_aten_fallback():
+        return g.at(
+            "_unique",
+            input,
+            sorted_i=sorted,
+            return_inverse_i=return_inverse,
+            outputs=2,
+        )
+    else:
+        return symbolic_helper._onnx_unsupported("_unique", input)
+
+
+@_onnx_symbolic("aten::_unique2")
+@symbolic_helper.parse_args("v", "i", "i", "i")
+@_beartype.beartype
+def _unique2(g: jit_utils.GraphContext, input, sorted, return_inverse, return_counts):
+    if symbolic_helper.is_caffe2_aten_fallback():
+        return g.at(
+            "_unique2",
+            input,
+            sorted_i=sorted,
+            return_inverse_i=return_inverse,
+            return_counts_i=return_counts,
+            outputs=3,
+        )
+
+    symbolic_helper._onnx_opset_unsupported("_unique2", 9, 11, input)
+
+
+@_onnx_symbolic("aten::_cast_Byte")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Byte(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.UINT8)
+
+
+@_onnx_symbolic("aten::_cast_Char")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Char(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT8)
+
+
+@_onnx_symbolic("aten::_cast_Short")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Short(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT16)
+
+
+@_onnx_symbolic("aten::_cast_Int")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Int(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT32)
+
+
+@_onnx_symbolic("aten::_cast_Long")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Long(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT64)
+
+
+@_onnx_symbolic("aten::_cast_Half")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Half(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.FLOAT16)
+
+
+@_onnx_symbolic("aten::_cast_Float")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Float(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.FLOAT)
 
 
-@symbolic_helper.parse_args("v", "i", "i", "i")
-def _unique2(g, input, sorted, return_inverse, return_counts):
-    if symbolic_helper.is_caffe2_aten_fallback():
-        return g.at(
-            "_unique2",
-            input,
-            sorted_i=sorted,
-            return_inverse_i=return_inverse,
-            return_counts_i=return_counts,
-            outputs=3,
-        )
-    else:
-        symbolic_helper._onnx_opset_unsupported("_unique2", 9, 11)
-
-
-def _cast_func_template(to_i, g, input, non_blocking):
-    """Template for creating a cast function."""
-    return g.op("Cast", input, to_i=to_i)
-
-
-# Metaprogram symbolics for each ATen native specialized cast operator.
-# For e.g. we specify a function named `_cast_Byte` that instantiates an
-# ONNX cast node with `to` attribute "UINT8"
-# def _cast_Byte
-# def _cast_Char
-# def _cast_Short
-# def _cast_Int
-# def _cast_Long
-# def _cast_Half
-# def _cast_Float
-# def _cast_Double
-# def _cast_ComplexFloat
-# def _cast_ComplexDouble
-# def _cast_Bool
-# def _cast_BFloat16
-for scalar_type in (
-    "Byte",
-    "Char",
-    "Short",
-    "Int",
-    "Long",
-    "Half",
-    "Float",
-    "Double",
-    "ComplexFloat",
-    "ComplexDouble",
-    "Bool",
-    "BFloat16",
-):
-    func_name = f"_cast_{scalar_type}"
-    globals()[func_name] = symbolic_helper.parse_args("v", "i")(
-        functools.partial(
-            _cast_func_template,
-            _type_utils.JitScalarType.from_name(scalar_type).onnx_type(),
-        )
-    )
+@_onnx_symbolic("aten::_cast_Double")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Double(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.DOUBLE)
+
+
+@_onnx_symbolic("aten::_cast_Bool")
+@_deprecation.deprecated(
+    "1.14",
+    "the future",
+    "Avoid using this function and create a Cast node instead",
+)
+@_beartype.beartype
+def _cast_Bool(g: jit_utils.GraphContext, input, non_blocking):
+    return g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.BOOL)
 
 
+@_onnx_symbolic("aten::empty")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v", "v")
-def empty(g, sizes, dtype, layout, device, pin_memory=False, memory_format=None):
+@_beartype.beartype
+def empty(
+    g: jit_utils.GraphContext,
+    sizes,
+    dtype,
+    layout,
+    device,
+    pin_memory=False,
+    memory_format=None,
+):
     return zeros(g, sizes, dtype, layout, device, pin_memory)
 
 
+@_onnx_symbolic("aten::empty_like")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v", "v")
+@_beartype.beartype
 def empty_like(
-    g, input, dtype=None, layout=None, device=None, pin_memory=False, memory_format=None
+    g: jit_utils.GraphContext,
+    input,
+    dtype=None,
+    layout=None,
+    device=None,
+    pin_memory=False,
+    memory_format=None,
 ):
     return zeros_like(g, input, dtype, layout, device, pin_memory)
 
 
-def new_empty(g, self, sizes, dtype, layout, device, pin_memory=False):
+@_onnx_symbolic("aten::new_empty")
+@_beartype.beartype
+def new_empty(
+    g: jit_utils.GraphContext, self, sizes, dtype, layout, device, pin_memory=False
+):
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     if dtype is None and self_dtype is not None:
         dtype = self_dtype
-        dtype = _type_utils.JitScalarType.from_name(dtype)
     return empty(g, sizes, dtype, layout, device, pin_memory)
 
 
-def scalar_tensor(g, scalar, dtype, *options):
+@_onnx_symbolic("aten::scalar_tensor")
+@_beartype.beartype
+def scalar_tensor(g: jit_utils.GraphContext, scalar, dtype, *options):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
         dtype = _type_utils.JitScalarType.FLOAT
@@ -2900,24 +3529,27 @@ def scalar_tensor(g, scalar, dtype, *options):
     return scalar
 
 
-def tensor(g, data, dtype=None, device=None, requires_grad=False):
+@_onnx_symbolic("aten::tensor")
+@_beartype.beartype
+def tensor(
+    g: jit_utils.GraphContext, data, dtype=None, device=None, requires_grad=False
+):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if symbolic_helper._is_packed_list(data):
         if dtype is None:
-            scalar_name = symbolic_helper._unpack_list(data)[0].type().scalarType()  # type: ignore[attr-defined]
-            # TODO(justinchuby): Remove type ignore after #81112 is checked in.
-            dtype = _type_utils.JitScalarType.from_name(scalar_name)
+            dtype = _type_utils.JitScalarType.from_value(
+                symbolic_helper._unpack_list(data)[0]
+            )
         input_list = list()
         for t in symbolic_helper._unpack_list(data):
             shape_reference = g.op("Constant", value_t=torch.LongTensor([1]))
             t = symbolic_helper._reshape_helper(g, t, shape_reference)
-            t = g.op("Cast", t, to_i=_type_utils.JitScalarType(dtype).onnx_type())
+            t = g.op("Cast", t, to_i=dtype.onnx_type())
             input_list.append(t)
         return g.op("Concat", *input_list, axis_i=0)
     else:
         if dtype is None:
-            scalar_name = data.type().scalarType()
-            dtype = _type_utils.JitScalarType.from_name(scalar_name)
+            dtype = _type_utils.JitScalarType.from_value(data)
         if symbolic_helper._is_list(data) and (
             symbolic_helper._is_tensor_list(data)
             or symbolic_helper._is_scalar_list(data)
@@ -2926,12 +3558,16 @@ def tensor(g, data, dtype=None, device=None, requires_grad=False):
     return g.op("Cast", data, to_i=_type_utils.JitScalarType(dtype).onnx_type())
 
 
-def as_tensor(g, data, dtype=None, device=None):
+@_onnx_symbolic("aten::as_tensor")
+@_beartype.beartype
+def as_tensor(g: jit_utils.GraphContext, data, dtype=None, device=None):
     return tensor(g, data, dtype, device)
 
 
+@_onnx_symbolic("aten::zeros")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v")
-def zeros(g, sizes, dtype, layout, device, pin_memory=False):
+@_beartype.beartype
+def zeros(g: jit_utils.GraphContext, sizes, dtype, layout, device, pin_memory=False):
     # NOTE: no way to set device, layout and pin_memory in ONNX, so we ignore it
     if dtype is None:
         scalar_type = _type_utils.JitScalarType.FLOAT
@@ -2947,9 +3583,17 @@ def zeros(g, sizes, dtype, layout, device, pin_memory=False):
     )
 
 
+@_onnx_symbolic("aten::zeros_like")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v", "v")
+@_beartype.beartype
 def zeros_like(
-    g, input, dtype=None, layout=None, device=None, pin_memory=False, memory_format=None
+    g: jit_utils.GraphContext,
+    input,
+    dtype=None,
+    layout=None,
+    device=None,
+    pin_memory=False,
+    memory_format=None,
 ):
     shape = g.op("Shape", input)
     if dtype is None:
@@ -2963,15 +3607,21 @@ def zeros_like(
     )
 
 
-def new_zeros(g, self, sizes, dtype, layout, device, pin_memory=False):
+@_onnx_symbolic("aten::new_zeros")
+@_beartype.beartype
+def new_zeros(
+    g: jit_utils.GraphContext, self, sizes, dtype, layout, device, pin_memory=False
+):
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     if dtype is None and self_dtype is not None:
-        dtype = _type_utils.JitScalarType.from_name(self_dtype)
+        dtype = self_dtype
     return zeros(g, sizes, dtype, layout, device, pin_memory)
 
 
+@_onnx_symbolic("aten::ones")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v")
-def ones(g, sizes, dtype, layout, device, pin_memory=False):
+@_beartype.beartype
+def ones(g: jit_utils.GraphContext, sizes, dtype, layout, device, pin_memory=False):
     if dtype is None:
         scalar_type = _type_utils.JitScalarType.FLOAT
     else:
@@ -2986,9 +3636,17 @@ def ones(g, sizes, dtype, layout, device, pin_memory=False):
     )
 
 
+@_onnx_symbolic("aten::ones_like")
 @symbolic_helper.parse_args("v", "i", "v", "v", "v", "v")
+@_beartype.beartype
 def ones_like(
-    g, input, dtype=None, layout=None, device=None, pin_memory=False, memory_format=None
+    g: jit_utils.GraphContext,
+    input,
+    dtype=None,
+    layout=None,
+    device=None,
+    pin_memory=False,
+    memory_format=None,
 ):
     shape = g.op("Shape", input)
     if dtype is None:
@@ -3002,15 +3660,22 @@ def ones_like(
     )
 
 
-def new_ones(g, self, sizes, dtype, layout, device, pin_memory=False):
+@_onnx_symbolic("aten::new_ones")
+@_beartype.beartype
+def new_ones(
+    g: jit_utils.GraphContext, self, sizes, dtype, layout, device, pin_memory=False
+):
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     if dtype is None and self_dtype is not None:
         dtype = self_dtype
-        dtype = _type_utils.JitScalarType.from_name(dtype)
     return ones(g, sizes, dtype, layout, device, pin_memory)
 
 
-def full(g, sizes, value, dtype, layout, device, pin_memory=False):
+@_onnx_symbolic("aten::full")
+@_beartype.beartype
+def full(
+    g: jit_utils.GraphContext, sizes, value, dtype, layout, device, pin_memory=False
+):
     const_value = symbolic_helper._maybe_get_const(value, "t")
     if symbolic_helper._is_value(const_value):
         dtype = _type_utils.JitScalarType.FLOAT if dtype is None else dtype
@@ -3032,8 +3697,10 @@ def full(g, sizes, value, dtype, layout, device, pin_memory=False):
         )
 
 
+@_onnx_symbolic("aten::full_like")
+@_beartype.beartype
 def full_like(
-    g,
+    g: jit_utils.GraphContext,
     input,
     fill_value,
     dtype=None,
@@ -3061,15 +3728,27 @@ def full_like(
         )
 
 
-def new_full(g, self, size, fill_value, dtype, layout, device, pin_memory=False):
+@_onnx_symbolic("aten::new_full")
+@_beartype.beartype
+def new_full(
+    g: jit_utils.GraphContext,
+    self,
+    size,
+    fill_value,
+    dtype,
+    layout,
+    device,
+    pin_memory=False,
+):
     self_dtype = symbolic_helper._try_get_scalar_type(self)
     if dtype is None and self_dtype is not None:
         dtype = self_dtype
-        dtype = _type_utils.JitScalarType.from_name(dtype)
     return full(g, size, fill_value, dtype, layout, device, pin_memory)
 
 
-def eye(g, *args):
+@_onnx_symbolic("aten::eye")
+@_beartype.beartype
+def eye(g: jit_utils.GraphContext, *args):
     if len(args) == 5:
         # aten::eye(n, dtype, layout, device, pin_memory)
         n, dtype, layout, device, pin_memory = args
@@ -3077,7 +3756,7 @@ def eye(g, *args):
         shape = g.op("Concat", dim_size, dim_size, axis_i=0)
         tensor = zeros(g, shape, dtype, layout, device)
         return g.op("EyeLike", tensor)
-    elif len(args) == 6:
+    if len(args) == 6:
         # aten::eye(n, m, dtype, layout, device, pin_memory)
         n, m, dtype, layout, device, pin_memory = args
         shape = g.op(
@@ -3088,17 +3767,19 @@ def eye(g, *args):
         )
         tensor = zeros(g, shape, dtype, layout, device)
         return g.op("EyeLike", tensor)
-    else:
-        raise NotImplementedError("Unknown aten::eye signature")
 
+    return symbolic_helper._unimplemented("aten::eye", f"with {len(args)} arguments")
 
-def slice(g, self, *args):
+
+@_onnx_symbolic("aten::slice")
+@_beartype.beartype
+def slice(g: jit_utils.GraphContext, self, *args):
     if len(args) == 4:
         # aten::slice(Tensor self, int dim, int start, int end, int step) -> Tensor
         dim, start, end, step = args
         step = symbolic_helper._parse_arg(step, "i")
         if step != 1:
-            raise RuntimeError("step!=1 is currently not supported")
+            raise errors.SymbolicValueError("step!=1 is currently not supported", self)
         is_start_none = start.node().kind() == "prim::Constant" and isinstance(
             start.type(), _C.NoneType
         )
@@ -3113,10 +3794,11 @@ def slice(g, self, *args):
             or dim.node().kind() != "onnx::Constant"
         ):
             if GLOBALS.operator_export_type == _C_onnx.OperatorExportTypes.ONNX:
-                raise RuntimeError(
+                raise errors.SymbolicValueError(
                     "Unsupported: ONNX export of Slice with dynamic inputs. DynamicSlice "
                     "is a deprecated experimental op. Please use statically allocated "
-                    "variables or export to a higher opset version."
+                    "variables or export to a higher opset version.",
+                    self,
                 )
             else:
                 start_unsqueezed = symbolic_helper._unsqueeze_helper(g, start, [0])
@@ -3132,7 +3814,7 @@ def slice(g, self, *args):
         else:
             start = 0 if is_start_none else symbolic_helper._parse_arg(start, "i")
             end = (
-                9223372036854775807
+                _constants.INT64_MAX
                 if is_end_none
                 else symbolic_helper._parse_arg(end, "i")
             )
@@ -3152,49 +3834,61 @@ def slice(g, self, *args):
         )
         start = 0 if is_start_none else symbolic_helper._parse_arg(start, "i")
         end = (
-            9223372036854775807 if is_end_none else symbolic_helper._parse_arg(end, "i")
+            _constants.INT64_MAX
+            if is_end_none
+            else symbolic_helper._parse_arg(end, "i")
         )
         return symbolic_helper._slice_helper(
             g, self, axes=[dim], starts=[start], ends=[end]
         )
-    else:
-        raise NotImplementedError("Unknown aten::slice signature")
+
+    return symbolic_helper._unimplemented("aten::slice", f"with {len(args)} arguments")
 
 
+@_onnx_symbolic("aten::hardtanh")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "f", "f")
-def hardtanh(g, self, min_val, max_val):
-    return op_with_optional_float_cast(
+@_beartype.beartype
+def hardtanh(g: jit_utils.GraphContext, self: _C.Value, min_val: float, max_val: float):
+    return _op_with_optional_float_cast(
         g, "Clip", self, min_f=min_val, max_f=max_val, opset_before=12
     )
 
 
+@_onnx_symbolic("aten::hardswish")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v")
-def hardswish(g, self):
+@_beartype.beartype
+def hardswish(g: jit_utils.GraphContext, self):
     hs = hardsigmoid(g, self)
     return g.op("Mul", self, hs)
 
 
+@_onnx_symbolic("aten::hardsigmoid")
 # Fixed scale and zero_point, discovered from aten/src/ATen/native/quantized/cpu/qhardsigmoid.cpp
 @symbolic_helper.quantized_args(True, scale=1.0 / 256.0, zero_point=0)
 @symbolic_helper.parse_args("v")
-def hardsigmoid(g, self):
+@_beartype.beartype
+def hardsigmoid(g: jit_utils.GraphContext, self):
     # Set alpha_f to 1 / 6 to make op equivalent to PyTorch's definition of Hardsigmoid.
     # See https://pytorch.org/docs/stable/generated/torch.nn.Hardsigmoid.html
     return g.op("HardSigmoid", self, alpha_f=1 / 6)
 
 
+@_onnx_symbolic("aten::tanhshrink")
 @symbolic_helper.parse_args("v")
-def tanhshrink(g, self):
+@_beartype.beartype
+def tanhshrink(g: jit_utils.GraphContext, self):
     return g.op("Sub", self, tanh(g, self))
 
 
+@_onnx_symbolic("aten::hardshrink")
 @symbolic_helper.parse_args("v", "f")
-def hardshrink(g, self, lambd):
-    dtype = self.type().scalarType()
-    if dtype is None:
-        scalar_type = _type_utils.JitScalarType.FLOAT
-    else:
-        scalar_type = _type_utils.JitScalarType.from_name(dtype)
+@_beartype.beartype
+def hardshrink(g: jit_utils.GraphContext, self, lambd):
+    scalar_type = _type_utils.JitScalarType.from_value(
+        self, _type_utils.JitScalarType.FLOAT
+    )
     lambd_op = g.op(
         "Constant",
         value_t=torch.tensor(lambd, dtype=scalar_type.dtype()),
@@ -3211,13 +3905,13 @@ def hardshrink(g, self, lambd):
     )
 
 
+@_onnx_symbolic("aten::softshrink")
 @symbolic_helper.parse_args("v", "f")
-def softshrink(g, self, lambd):
-    dtype = self.type().scalarType()
-    if dtype is None:
-        scalar_type = _type_utils.JitScalarType.FLOAT
-    else:
-        scalar_type = _type_utils.JitScalarType.from_name(dtype)
+@_beartype.beartype
+def softshrink(g: jit_utils.GraphContext, self, lambd):
+    scalar_type = _type_utils.JitScalarType.from_value(
+        self, _type_utils.JitScalarType.FLOAT
+    )
     lambd_op = g.op(
         "Constant",
         value_t=torch.tensor(lambd, dtype=scalar_type.dtype()),
@@ -3245,12 +3939,16 @@ def softshrink(g, self, lambd):
     return add(g, gt_out, lt_out)
 
 
-def alias(g, self):
+@_onnx_symbolic("aten::alias")
+@_beartype.beartype
+def alias(g: jit_utils.GraphContext, self):
     return self
 
 
+@_onnx_symbolic("aten::unsqueeze")
 @symbolic_helper.parse_args("v", "i")
-def unsqueeze(g, self, dim):
+@_beartype.beartype
+def unsqueeze(g: jit_utils.GraphContext, self, dim):
     # Handle negative dim
     if dim < 0:
         rank = symbolic_helper._get_tensor_rank(self)
@@ -3268,17 +3966,20 @@ def unsqueeze(g, self, dim):
             dim = dim + rank + 1
         else:
             return symbolic_helper._unimplemented(
-                "unsqueeze", "negative axis with unknown input rank"
+                "unsqueeze", "negative axis with unknown input rank", self
             )
 
     return symbolic_helper._unsqueeze_helper(g, self, axes_i=[dim])
 
 
+@_onnx_symbolic("aten::sort")
+# TODO(justinchuby): Support multiple quantized args in output
 @symbolic_helper.parse_args("v", "i", "i", "none")
-def sort(g, self, dim, decending, out=None):
+@_beartype.beartype
+def sort(g: jit_utils.GraphContext, self, dim, decending, out=None):
     if out is not None:
         symbolic_helper._unimplemented(
-            "Sort", "Out parameter is not supported for sort"
+            "Sort", "Out parameter is not supported for sort", self
         )
     self_sizes = symbolic_helper._get_tensor_sizes(self)
     try:
@@ -3289,29 +3990,37 @@ def sort(g, self, dim, decending, out=None):
         dim_size = None
 
     if dim_size is None:
-        return symbolic_helper._unimplemented("Sort", "input size not accessible")
+        return symbolic_helper._unimplemented("Sort", "input size not accessible", self)
 
     return g.op("TopK", self, k_i=dim_size, axis_i=dim, outputs=2)
 
 
-def numel(g, self):
+@_onnx_symbolic("aten::numel")
+@_beartype.beartype
+def numel(g: jit_utils.GraphContext, self):
     shape = g.op("Shape", self)
     return g.op("ReduceProd", shape, keepdims_i=0)
 
 
+@_onnx_symbolic("aten::topk")
+# TODO(justinchuby): Support multiple quantized args in output
 @symbolic_helper.parse_args("v", "i", "i", "i", "i", "none")
-def topk(g, self, k, dim, largest, sorted, out=None):
+@_beartype.beartype
+def topk(g: jit_utils.GraphContext, self, k, dim, largest, sorted, out=None):
     if out is not None:
         symbolic_helper._unimplemented(
-            "TopK", "Out parameter is not supported for topk"
+            "TopK", "Out parameter is not supported for topk", self
         )
     if not largest:
-        symbolic_helper._unimplemented("TopK", "Ascending TopK is not supported")
+        symbolic_helper._unimplemented("TopK", "Ascending TopK is not supported", self)
 
     return g.op("TopK", self, k_i=k, axis_i=dim, outputs=2)
 
 
-def to(g, self, *args):
+@_onnx_symbolic("aten::to")
+@_beartype.beartype
+def to(g: jit_utils.GraphContext, self, *args):
+    @_beartype.beartype
     def is_aten_to_device_only(args):
         if len(args) == 4:
             # aten::to(Tensor, Device, bool, bool, memory_format)
@@ -3356,11 +4065,11 @@ def is_aten_to_device_only(args):
 
         if symbolic_helper._is_value(dtype) or isinstance(dtype, torch.Tensor):
             # aten::to(Tensor, Tensor, bool, bool, memory_format)
-            dtype = args[0].type().scalarType()
+            dtype = _type_utils.JitScalarType.from_value(args[0])
             return g.op(
                 "Cast",
                 self,
-                to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type(),
+                to_i=dtype.onnx_type(),
             )
         else:
             # aten::to(Tensor, ScalarType, bool, bool, memory_format)
@@ -3381,18 +4090,24 @@ def is_aten_to_device_only(args):
         dtype = symbolic_helper._get_const(args[0], "i", "dtype")
         # Layout, device and memory_format are ignored
         return g.op("Cast", self, to_i=_type_utils.JitScalarType(dtype).onnx_type())
-    else:
-        return symbolic_helper._onnx_unsupported("Unknown aten::to signature")
+
+    return symbolic_helper._onnx_unsupported("Unknown aten::to signature", self)
 
 
-def repeat(g, self, repeats):
+@_onnx_symbolic("aten::repeat")
+@_beartype.beartype
+def repeat(g: jit_utils.GraphContext, self, repeats):
     dtype = _type_utils.JitScalarType.INT64
     shape_ = ones_like(g, repeats, dtype)
     self = g.op("Expand", self, shape_)
     return g.op("Tile", self, repeats)
 
 
-def repeat_interleave(g, self, repeats, dim=None, output_size=None):
+@_onnx_symbolic("aten::repeat_interleave")
+@_beartype.beartype
+def repeat_interleave(
+    g: jit_utils.GraphContext, self, repeats, dim=None, output_size=None
+):
     input = self
     # if dim is None flatten
     # By default, use the flattened input array, and return a flat output array
@@ -3408,16 +4123,19 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
     repeats_sizes = symbolic_helper._get_tensor_sizes(repeats)
     input_sizes = symbolic_helper._get_tensor_sizes(input)
     if repeats_dim is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown repeats rank."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown repeats rank.",
+            input,
         )
     if repeats_sizes is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown repeats size."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown repeats size.",
+            input,
         )
     if input_sizes is None:
-        raise RuntimeError(
-            "Unsupported: ONNX export of repeat_interleave for unknown input size."
+        raise errors.SymbolicValueError(
+            "Unsupported: ONNX export of repeat_interleave for unknown input size.",
+            input,
         )
 
     input_sizes_temp = input_sizes.copy()
@@ -3435,6 +4153,7 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
                 9,
                 13,
                 "Unsupported along dimension with unknown input size",
+                self,
             )
         else:
             reps = input_sizes[dim]
@@ -3450,17 +4169,22 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
                 9,
                 13,
                 "Unsupported along dimension with unknown input size",
+                self,
             )
         if repeats_sizes[0] is None:
             return symbolic_helper._onnx_opset_unsupported_detailed(
-                "repeat_interleave", 9, 13, "Unsupported for cases with dynamic repeats"
+                "repeat_interleave",
+                9,
+                13,
+                "Unsupported for cases with dynamic repeats",
+                self,
             )
         assert (
             repeats_sizes[0] == input_sizes[dim]
         ), "repeats must have the same size as input along dim"
         reps = repeats_sizes[0]
     else:
-        raise RuntimeError("repeats must be 0-dim or 1-dim tensor")
+        raise errors.SymbolicValueError("repeats must be 0-dim or 1-dim tensor", self)
 
     final_splits = list()
     r_splits = symbolic_helper._repeat_interleave_split_helper(g, repeats, reps, 0)
@@ -3485,11 +4209,15 @@ def repeat_interleave(g, self, repeats, dim=None, output_size=None):
     return g.op("Concat", *final_splits, axis_i=dim)
 
 
+@_onnx_symbolic("aten::pixel_shuffle")
 @symbolic_helper.parse_args("v", "i")
-def pixel_shuffle(g, self, upscale_factor):
+@_beartype.beartype
+def pixel_shuffle(g: jit_utils.GraphContext, self, upscale_factor):
     dims = symbolic_helper._get_tensor_sizes(self)
     if len(dims) != 4:
-        return symbolic_helper._unimplemented("pixel_shuffle", "only support 4d input")
+        return symbolic_helper._unimplemented(
+            "pixel_shuffle", "only support 4d input", self
+        )
     if any(i is None for i in dims[1:]):
         after_view = symbolic_helper._reshape_helper(
             g,
@@ -3554,11 +4282,15 @@ def pixel_shuffle(g, self, upscale_factor):
         )
 
 
+@_onnx_symbolic("aten::pixel_unshuffle")
 @symbolic_helper.parse_args("v", "i")
-def pixel_unshuffle(g, self, downscale_factor):
+@_beartype.beartype
+def pixel_unshuffle(g: jit_utils.GraphContext, self, downscale_factor):
     dims = symbolic_helper._get_tensor_sizes(self)
     if len(dims) != 4:
-        return symbolic_helper._unimplemented("pixel_shuffle", "only support 4d input")
+        return symbolic_helper._unimplemented(
+            "pixel_shuffle", "only support 4d input", self
+        )
     if any(i is None for i in dims[1:]):
         # For dynamic input shapes, two reshapes are performed
         reshape_h = symbolic_helper._reshape_helper(
@@ -3620,8 +4352,9 @@ def pixel_unshuffle(g, self, downscale_factor):
         )
 
 
+@_beartype.beartype
 def _generic_rnn(
-    g,
+    g: jit_utils.GraphContext,
     variant,
     input,
     initial_states,
@@ -3666,7 +4399,7 @@ def _generic_rnn(
     if variant == "LSTM" and len(all_weights) != num_layers * weights_per_layer * (
         1 + bidirectional
     ):
-        return symbolic_helper._unimplemented("LSTM", "LSTMs with projections")
+        return symbolic_helper._unimplemented("LSTM", "LSTMs with projections", input)
     assert len(all_weights) == num_layers * weights_per_layer * (1 + bidirectional)
     layer_weights = [
         all_weights[i : i + weights_per_layer]
@@ -3677,7 +4410,7 @@ def _generic_rnn(
         input = g.op("Transpose", input, perm_i=[1, 0, 2])
     if dropout and train:
         return symbolic_helper._unimplemented(
-            "RNN/GRU/LSTM", "dropout in training mode"
+            "RNN/GRU/LSTM", "dropout in training mode", input
         )
 
     if variant.startswith("RNN"):
@@ -3687,7 +4420,9 @@ def _generic_rnn(
     w_hh = all_weights[1]
     hidden_size = symbolic_helper._get_tensor_dim_size(w_hh, 1)
     if hidden_size is None:
-        return symbolic_helper._unimplemented("RNN/GRU/LSTM", "unknown hidden size")
+        return symbolic_helper._unimplemented(
+            "RNN/GRU/LSTM", "unknown hidden size", input
+        )
 
     unidirectional = not bidirectional
 
@@ -3711,6 +4446,7 @@ def _generic_rnn(
         # onnx is    input, output, forget, cell.
         reform_permutation = [(0, 1), (3, 4), (1, 3)]
 
+    @_beartype.beartype
     def reform_weights(g, w, n, intervals):
         slices = [
             symbolic_helper._slice_helper(g, w, axes=[0], starts=[x * n], ends=[y * n])
@@ -3718,6 +4454,7 @@ def reform_weights(g, w, n, intervals):
         ]
         return g.op("Concat", *slices, axis_i=0)
 
+    @_beartype.beartype
     def transform_weights_no_bias(layer_index):
         weights = layer_weights[layer_index]
         if variant == "RNN":
@@ -3730,6 +4467,7 @@ def transform_weights_no_bias(layer_index):
             symbolic_helper._unsqueeze_helper(g, x, [0]) for x in (weight_ih, weight_hh)
         )
 
+    @_beartype.beartype
     def transform_weights(layer_index):
         weights = layer_weights[layer_index]
         if variant == "RNN":
@@ -3744,6 +4482,7 @@ def transform_weights(layer_index):
             for x in (weight_ih, weight_hh, bias_concat)
         )
 
+    @_beartype.beartype
     def retrieve_state(x, start, end):
         return (
             x
@@ -3845,8 +4584,9 @@ def retrieve_state(x, start, end):
 
 
 @symbolic_helper.parse_args("v", "v", "v", "i", "i", "f", "i", "i", "i")
+@_beartype.beartype
 def _lstm_full(
-    g,
+    g: jit_utils.GraphContext,
     input,
     hidden_v,
     weight_v,
@@ -3876,8 +4616,9 @@ def _lstm_full(
 
 
 @symbolic_helper.parse_args("v", "v", "v", "v", "i", "i", "f", "i", "i")
+@_beartype.beartype
 def _lstm_packed(
-    g,
+    g: jit_utils.GraphContext,
     input,
     batch_sizes,
     hidden_v,
@@ -3906,14 +4647,18 @@ def _lstm_packed(
     )
 
 
-def lstm(g, *args):
+@_onnx_symbolic("aten::lstm")
+@_beartype.beartype
+def lstm(g: jit_utils.GraphContext, *args):
     if symbolic_helper._is_tensor_list(args[3]):
         return _lstm_packed(g, *args)
     else:
         return _lstm_full(g, *args)
 
 
-def lstm_cell(g, self, hidden, w_ih, w_hh, b_ih, b_hh):
+@_onnx_symbolic("aten::lstm_cell")
+@_beartype.beartype
+def lstm_cell(g: jit_utils.GraphContext, self, hidden, w_ih, w_hh, b_ih, b_hh):
     input = symbolic_helper._unsqueeze_helper(g, self, [0])
     hidden = symbolic_helper._unpack_list(hidden)
     hidden = [symbolic_helper._unsqueeze_helper(g, x, [0]) for x in hidden]
@@ -3939,8 +4684,16 @@ def lstm_cell(g, self, hidden, w_ih, w_hh, b_ih, b_hh):
     ), symbolic_helper._squeeze_helper(g, c_outs, [0])
 
 
-def _one_hidden_rnn(kind):
+@_onnx_symbolic("aten::gru", decorate=[_apply_params("GRU"), _export("gru")])
+@_onnx_symbolic(
+    "aten::rnn_tanh", decorate=[_apply_params("RNN_TANH"), _export("rnn_tanh")]
+)
+@_onnx_symbolic(
+    "aten::rnn_relu", decorate=[_apply_params("RNN_RELU"), _export("rnn_relu")]
+)
+def _one_hidden_rnn(kind: str):
     @symbolic_helper.parse_args("v", "v", "v", "i", "i", "f", "i", "i", "i")
+    @_beartype.beartype
     def _rnn_full(
         g,
         input,
@@ -4005,13 +4758,10 @@ def symbolic(g, *args):
     return symbolic
 
 
-gru = _one_hidden_rnn("GRU")
-rnn_tanh = _one_hidden_rnn("RNN_TANH")
-rnn_relu = _one_hidden_rnn("RNN_RELU")
-
-
+@_onnx_symbolic("aten::_dim_arange")
 @symbolic_helper.parse_args("v", "i")
-def _dim_arange(g, like, dim):
+@_beartype.beartype
+def _dim_arange(g: jit_utils.GraphContext, like, dim):
     like_shape = g.op("Shape", like)
     stop = g.op(
         "Gather", like_shape, g.op("Constant", value_t=torch.tensor(dim)), axis_i=0
@@ -4023,38 +4773,60 @@ def _dim_arange(g, like, dim):
         return arange(g, stop, 4, None, None, None)
 
 
-def detach(g, input):
+@_onnx_symbolic("aten::detach")
+@_beartype.beartype
+def detach(g: jit_utils.GraphContext, input):
     # Erase aten::detach nodes because ONNX is inference only
     return input
 
 
+@_onnx_symbolic("aten::contiguous")
 @symbolic_helper.parse_args("v", "i")
-def contiguous(g, input, memory_format):
+@_beartype.beartype
+def contiguous(g: jit_utils.GraphContext, input, memory_format):
     if memory_format > 2:  # allower values are any, preserve and contiguous_format
-        raise RuntimeError("onnx memory_format support is not implemented")
+        raise errors.SymbolicValueError(
+            "onnx memory_format support is not implemented", input
+        )
     return input
 
 
+@_onnx_symbolic("aten::_pack_padded_sequence")
 @symbolic_helper.parse_args("v", "v", "i")
-def _pack_padded_sequence(g, input, lengths, batch_first):
+@_beartype.beartype
+def _pack_padded_sequence(g: jit_utils.GraphContext, input, lengths, batch_first):
     # Currently there is no PackPadded operator in ONNX. We rely on an
     # optimization pass to remove this later. It is an error if all
     # PackPadded operators cannot be optimized out.
     if batch_first:
         input = g.op("Transpose", input, perm_i=[1, 0, 2])
     if not lengths.type().isSubtypeOf(torch._C.TensorType.get()):
-        raise RuntimeError("Lengths must be a Tensor for ONNX export")
+        raise errors.SymbolicValueError(
+            "'lengths' must be a Tensor for ONNX export", input
+        )
     # We know it's a TensorType so this check is now safe.
     # It's really only necessary because those operators expand to something that
     # only works with int32 types in Caffe2...
-    if lengths.type().scalarType() != "Int":
-        lengths = _cast_Int(g, lengths, False)  # type: ignore[name-defined]
+    if (
+        _type_utils.JitScalarType.from_value(
+            lengths, _type_utils.JitScalarType.UNDEFINED
+        )
+        != _type_utils.JitScalarType.INT
+    ):
+        lengths = g.op("Cast", lengths, to_i=_C_onnx.TensorProtoDataType.INT32)
     return g.op("prim::PackPadded", input, lengths, outputs=2)
 
 
+@_onnx_symbolic("aten::_pad_packed_sequence")
 @symbolic_helper.parse_args("v", "v", "i", "t", "v")
+@_beartype.beartype
 def _pad_packed_sequence(
-    g, data, batch_sizes, batch_first, padding_value, total_length
+    g: jit_utils.GraphContext,
+    data,
+    batch_sizes,
+    batch_first,
+    padding_value,
+    total_length,
 ):
     # Ignore total_length as it is not supported in _symbolic_pad_packed_sequence
     # It is only useful/used when training using data_parallel model, so
@@ -4065,7 +4837,9 @@ def _pad_packed_sequence(
     return data, lengths
 
 
-def randn(g, shapes, dtype, *options):
+@_onnx_symbolic("aten::randn")
+@_beartype.beartype
+def randn(g: jit_utils.GraphContext, shapes, dtype, *options):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
         scalar_type = _type_utils.JitScalarType.FLOAT
@@ -4090,7 +4864,9 @@ def randn(g, shapes, dtype, *options):
     )
 
 
-def rand(g, shapes, dtype, *options):
+@_onnx_symbolic("aten::rand")
+@_beartype.beartype
+def rand(g: jit_utils.GraphContext, shapes, dtype, *options):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
         scalar_type = _type_utils.JitScalarType.FLOAT
@@ -4115,8 +4891,16 @@ def rand(g, shapes, dtype, *options):
     )
 
 
+@_onnx_symbolic("aten::randn_like")
+@_beartype.beartype
 def randn_like(
-    g, self, dtype, layout=None, device=None, pin_memory=False, memory_format=None
+    g: jit_utils.GraphContext,
+    self,
+    dtype,
+    layout=None,
+    device=None,
+    pin_memory=False,
+    memory_format=None,
 ):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
@@ -4126,8 +4910,16 @@ def randn_like(
     return g.op("RandomNormalLike", self, dtype_i=scalar_type.onnx_type())
 
 
+@_onnx_symbolic("aten::rand_like")
+@_beartype.beartype
 def rand_like(
-    g, self, dtype, layout=None, device=None, pin_memory=False, memory_format=None
+    g: jit_utils.GraphContext,
+    self,
+    dtype,
+    layout=None,
+    device=None,
+    pin_memory=False,
+    memory_format=None,
 ):
     dtype = symbolic_helper._get_const(dtype, "i", "dtype")
     if dtype is None:
@@ -4137,8 +4929,10 @@ def rand_like(
     )
 
 
+@_onnx_symbolic("aten::rrelu")
 @symbolic_helper.parse_args("v", "f", "f", "i", "none")
-def rrelu(g, input, lower, upper, training, generator):
+@_beartype.beartype
+def rrelu(g: jit_utils.GraphContext, input, lower, upper, training, generator):
     if not training:
         slope = (upper + lower) / 2.0
         return g.op("LeakyRelu", input, alpha_f=slope)
@@ -4146,52 +4940,65 @@ def rrelu(g, input, lower, upper, training, generator):
     return g.op("PRelu", input, p)
 
 
-def bernoulli(g, input, generator=None, out=None):
-    if out is not None:
+@_onnx_symbolic("aten::bernoulli")
+@_beartype.beartype
+def bernoulli(g: jit_utils.GraphContext, input, p=None, generator=None, out=None):
+    if out is not None and not symbolic_helper._is_none(out):
         symbolic_helper._unimplemented(
-            "Bernoulli", "out parameter is not supported for bernoulli"
+            "Bernoulli", "out parameter is not supported for bernoulli", input
         )
     if generator is not None and not symbolic_helper._is_none(generator):
         symbolic_helper._unimplemented(
-            "Bernoulli", "generator is not supported for bernoulli"
+            "Bernoulli", "generator is not supported for bernoulli", input
         )
 
-    dtype = symbolic_helper._try_get_scalar_type(input)
-    if dtype is None:
-        return symbolic_helper._unimplemented("Bernoulli", "input dtype not accessible")
-    p = g.op(
+    dtype = _type_utils.JitScalarType.from_value(
+        input, _type_utils.JitScalarType.UNDEFINED
+    )
+    if dtype == _type_utils.JitScalarType.UNDEFINED:
+        return symbolic_helper._unimplemented(
+            "Bernoulli", "input dtype not accessible", input
+        )
+
+    rands = g.op(
         "RandomUniformLike",
         input,
         high_f=1.0,
         low_f=0.0,
-        dtype_i=_type_utils.JitScalarType.from_name(dtype).onnx_type(),
-    )
-    output = g.op("Less", p, input)
-    return g.op(
-        "Cast", output, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
+        dtype_i=dtype.onnx_type(),
     )
+    prob = p if p is not None and not symbolic_helper._is_none(p) else input
+    output = g.op("Less", rands, prob)
+    return g.op("Cast", output, to_i=dtype.onnx_type())
 
 
+@_onnx_symbolic("aten::log_sigmoid")
 @symbolic_helper.parse_args("v")
-def log_sigmoid(g, input):
+@_beartype.beartype
+def log_sigmoid(g: jit_utils.GraphContext, input):
     p = g.op("Sigmoid", input)
     return g.op("Log", p)
 
 
+@_onnx_symbolic("aten::erf")
 @symbolic_helper.parse_args("v")
-def erf(g, input):
+@_beartype.beartype
+def erf(g: jit_utils.GraphContext, input):
     return g.op("Erf", input)
 
 
+@_onnx_symbolic("aten::flatten")
 @symbolic_helper.quantized_args(True, False, False)
 @symbolic_helper.parse_args("v", "i", "i")
-def flatten(g, input, start_dim, end_dim):
+@_beartype.beartype
+def flatten(g: jit_utils.GraphContext, input, start_dim, end_dim):
     dim = symbolic_helper._get_tensor_rank(input)
     if dim is None:
         return symbolic_helper._unimplemented(
             "dim",
             "ONNX and PyTorch use different strategies to split the input. "
             "Input rank must be known at export time.",
+            input,
         )
 
     # TODO: remove this as onnx opset 11 spec allows negative axes
@@ -4206,24 +5013,32 @@ def flatten(g, input, start_dim, end_dim):
     return symbolic_helper._flatten_helper(g, input, start_dim, end_dim, dim)
 
 
+@_onnx_symbolic("aten::nonzero")
 @symbolic_helper.parse_args("v")
-def nonzero(g, input):
+@_beartype.beartype
+def nonzero(g: jit_utils.GraphContext, input):
     """Emitted from `torch.nonzero(x, as_tuple=False)`"""
     return t(g, g.op("NonZero", input))
 
 
+@_onnx_symbolic("aten::nonzero_numpy")
 # Emitted from `torch.nonzero(x, as_tuple=True)`
-def nonzero_numpy(g, input, _outputs=None):
+@_beartype.beartype
+def nonzero_numpy(g: jit_utils.GraphContext, input, _outputs=None):
     return unbind(g, nonzero(g, input), 1, _outputs=_outputs)
 
 
+@_onnx_symbolic("aten::isnan")
 @symbolic_helper.parse_args("v")
-def isnan(g, input):
+@_beartype.beartype
+def isnan(g: jit_utils.GraphContext, input):
     output = g.op("IsNaN", input)
     return output
 
 
-def _any(g, *args):
+@_onnx_symbolic("aten::any")
+@_beartype.beartype
+def _any(g: jit_utils.GraphContext, *args):
     # aten::any(Tensor self)
     if len(args) == 1:
         input = args[0]
@@ -4233,14 +5048,16 @@ def _any(g, *args):
         input, dim, keepdim = args
         dim = [symbolic_helper._parse_arg(dim, "i")]
         keepdim = symbolic_helper._parse_arg(keepdim, "i")
-    input = _cast_Long(g, input, False)  # type: ignore[name-defined]
+    input = g.op("Cast", input, to_i=_C_onnx.TensorProtoDataType.INT64)
     input_sum = symbolic_helper._reducesum_helper(
         g, input, axes_i=dim, keepdims_i=keepdim
     )
     return gt(g, input_sum, g.op("Constant", value_t=torch.tensor(0, dtype=torch.long)))
 
 
-def _all(g, *args):
+@_onnx_symbolic("aten::all")
+@_beartype.beartype
+def _all(g: jit_utils.GraphContext, *args):
     input = g.op("Not", args[0])
     # aten::all(Tensor self)
     if len(args) == 1:
@@ -4250,51 +5067,67 @@ def _all(g, *args):
         return g.op("Not", _any(g, input, args[1], args[2]))
 
 
+@_onnx_symbolic("aten::narrow")
 @symbolic_helper.parse_args("v", "i", "i", "i")
-def narrow(g, input, dim, start, length):
+@_beartype.beartype
+def narrow(g: jit_utils.GraphContext, input, dim, start, length):
     return symbolic_helper._slice_helper(
         g, input, axes=[dim], starts=[start], ends=[start + length]
     )
 
 
-@symbolic_helper.parse_args("v", "v", "i")
-def argmax(g, input: torch._C.Value, dim: torch._C.Value, keepdim: int):
+@_onnx_symbolic("aten::argmax")
+@symbolic_helper.parse_args("v", "v", "b")
+@_beartype.beartype
+def argmax(
+    g: jit_utils.GraphContext,
+    input: torch._C.Value,
+    dim: torch._C.Value,
+    keepdim: bool,
+):
     return symbolic_helper._argmin_argmax_helper(g, input, dim, keepdim, "ArgMax")
 
 
-@symbolic_helper.parse_args("v", "v", "i")
-def argmin(g, input: torch._C.Value, dim: torch._C.Value, keepdim: int):
+@_onnx_symbolic("aten::argmin")
+@symbolic_helper.parse_args("v", "v", "b")
+@_beartype.beartype
+def argmin(
+    g: jit_utils.GraphContext,
+    input: torch._C.Value,
+    dim: torch._C.Value,
+    keepdim: bool,
+):
     return symbolic_helper._argmin_argmax_helper(g, input, dim, keepdim, "ArgMin")
 
 
+@_onnx_symbolic("aten::scatter")
 @symbolic_helper.parse_args("v", "i", "v", "v")
-def scatter(g, self, dim, index, src):
-    src_type = src.type().scalarType()
+@_beartype.beartype
+def scatter(g: jit_utils.GraphContext, self, dim, index, src):
+    src_type = _type_utils.JitScalarType.from_value(
+        src, _type_utils.JitScalarType.UNDEFINED
+    )
     src = symbolic_helper._maybe_get_scalar(src)
     if symbolic_helper._is_value(src):
         return g.op("Scatter", self, index, src, axis_i=dim)
     else:
         # Check if scalar "src" has same type as self (PyTorch allows different
         # type for scalar src (but not when src is tensor)). If not, insert Cast node.
-        if self.type().scalarType() != src_type:
-            src = g.op(
-                "Cast",
-                src,
-                to_i=_type_utils.JitScalarType.from_name(
-                    self.type().scalarType()
-                ).onnx_type(),
-            )
+        self_scalar_type = _type_utils.JitScalarType.from_value(self)
+        if self_scalar_type != src_type:
+            src = g.op("Cast", src, to_i=self_scalar_type.onnx_type())
         return g.op("Scatter", self, index, expand_as(g, src, index), axis_i=dim)
 
 
+@_onnx_symbolic("aten::scatter_add")
 @symbolic_helper.parse_args("v", "i", "v", "v")
-def scatter_add(g, self, dim, index, src):
-    scalar_name = symbolic_helper._try_get_scalar_type(self)
-    if scalar_name is None:
+@_beartype.beartype
+def scatter_add(g: jit_utils.GraphContext, self, dim, index, src):
+    scalar_type = symbolic_helper._try_get_scalar_type(self)
+    if scalar_type is None:
         return symbolic_helper._unimplemented(
-            "scatter_add", "input dtype not accessible"
+            "scatter_add", "input dtype not accessible", self
         )
-    scalar_type = _type_utils.JitScalarType.from_name(scalar_name)
     sizes = symbolic_helper._get_tensor_sizes(self, allow_nonstatic=False)
     if sizes:
         to_add = g.op("Constant", value_t=torch.zeros(sizes, dtype=scalar_type.dtype()))
@@ -4304,18 +5137,24 @@ def scatter_add(g, self, dim, index, src):
     return add(g, self, to_add)
 
 
-def log2(g, self):
+@_onnx_symbolic("aten::log2")
+@_beartype.beartype
+def log2(g: jit_utils.GraphContext, self):
     _ln2 = 0.693147180559945309
     return g.op("Div", log(g, self), g.op("Constant", value_t=torch.tensor(_ln2)))
 
 
-def is_floating_point(g, self):
+@_onnx_symbolic("aten::is_floating_point")
+@_beartype.beartype
+def is_floating_point(g: jit_utils.GraphContext, self):
     if symbolic_helper._is_fp(self):
         return g.op("Constant", value_t=torch.BoolTensor([1]))
     return g.op("Constant", value_t=torch.BoolTensor([0]))
 
 
-def __is_(g, self, other):
+@_onnx_symbolic("aten::__is_")
+@_beartype.beartype
+def __is_(g: jit_utils.GraphContext, self, other):
     if symbolic_helper._is_none(other):
         if symbolic_helper._is_none(self):
             return g.op("Constant", value_t=torch.BoolTensor([1]))
@@ -4323,39 +5162,53 @@ def __is_(g, self, other):
     return eq(g, self, other)
 
 
+@_onnx_symbolic("aten::__isnot_")
 @wrap_logical_op_with_negation
-def __isnot_(g, self, other):
+@_beartype.beartype
+def __isnot_(g: jit_utils.GraphContext, self, other):
     return __is_(g, self, other)
 
 
-def one_hot(g, self, num_classes):
+@_onnx_symbolic("aten::one_hot")
+@_beartype.beartype
+def one_hot(g: jit_utils.GraphContext, self, num_classes):
     values = g.op("Constant", value_t=torch.LongTensor([0, 1]))
     # onnxruntime supports limited type combinations for OneHot.
-    if num_classes.type().scalarType() in {"Byte", "Char", "Int", "Short"}:
+    if _type_utils.JitScalarType.from_value(
+        num_classes, _type_utils.JitScalarType.UNDEFINED
+    ) in {
+        _type_utils.JitScalarType.UINT8,
+        _type_utils.JitScalarType.INT8,
+        _type_utils.JitScalarType.INT,
+        _type_utils.JitScalarType.INT16,
+    }:
         num_classes = g.op("Cast", num_classes, to_i=_C_onnx.TensorProtoDataType.INT64)
     return g.op("OneHot", self, num_classes, values, axis_i=-1)
 
 
+@_onnx_symbolic("aten::gather")
 @symbolic_helper.parse_args("v", "i", "v", "v")
-def gather(g, self, dim, index, sparse_grad=False):
+@_beartype.beartype
+def gather(g: jit_utils.GraphContext, self, dim, index, sparse_grad=False):
     if symbolic_helper._maybe_get_const(sparse_grad, "i"):
-        return symbolic_helper._unimplemented("gather", "sparse_grad == True")
+        return symbolic_helper._unimplemented("gather", "sparse_grad == True", self)
     # NOTE: This workaround is needed since GatherElement is only supported
     #       since opset 11, and Gather in ONNX is not the same as torch.gather.
-    dtype = self.type().scalarType()
+    scalar_type = _type_utils.JitScalarType.from_value(self)
     values = g.op("Constant", value_t=torch.LongTensor([0, 1]))
     depth = size(g, self, g.op("Constant", value_t=torch.LongTensor([dim])))
     index = g.op(
         "Cast",
         g.op("OneHot", index, depth, values, axis_i=dim),
-        to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type(),
+        to_i=scalar_type.onnx_type(),
     )
     mul = g.op("Mul", symbolic_helper._unsqueeze_helper(g, self, [dim + 1]), index)
     return symbolic_helper._reducesum_helper(g, mul, axes_i=[dim], keepdims_i=0)
 
 
 @symbolic_helper.parse_args("v", "is", "i", "i")
-def _var_mean(g, input, dim, correction, keepdim):
+@_beartype.beartype
+def _var_mean(g: jit_utils.GraphContext, input, dim, correction, keepdim):
     if dim is None:
         mean = g.op("ReduceMean", input, keepdims_i=0)
         t_mean = mean
@@ -4389,46 +5242,60 @@ def _var_mean(g, input, dim, correction, keepdim):
     return var, mean
 
 
-def std(g, input, *args):
+@_onnx_symbolic("aten::std")
+@_beartype.beartype
+def std(g: jit_utils.GraphContext, input, *args):
     var, _ = var_mean(g, input, *args)
     return g.op("Sqrt", var)
 
 
-def var(g, input, *args):
+@_onnx_symbolic("aten::var")
+@_beartype.beartype
+def var(g: jit_utils.GraphContext, input, *args):
     var, _ = var_mean(g, input, *args)
     return var
 
 
-# var_mean (and all variance-related functions) has multiple signatures, so need to manually figure
-# out the correct arguments:
-# aten::var_mean(Tensor self, bool unbiased)
-# aten::var_mean(Tensor self, int[1] dim, bool unbiased, bool keepdim=False)
-# aten::var_mean(Tensor self, int[1]? dim=None, *, int? correction=None, bool keepdim=False)
-def var_mean(g, input, *args):
+@_onnx_symbolic("aten::var_mean")
+@_beartype.beartype
+def var_mean(g: jit_utils.GraphContext, input, *args):
+    # var_mean (and all variance-related functions) has multiple signatures, so need to manually figure
+    # out the correct arguments:
+    # aten::var_mean(Tensor self, bool unbiased)
+    # aten::var_mean(Tensor self, int[1] dim, bool unbiased, bool keepdim=False)
+    # aten::var_mean(Tensor self, int[1]? dim=None, *, int? correction=None, bool keepdim=False)
     if len(args) == 1:
         return _var_mean(g, input, None, args[0], None)
     else:
         return _var_mean(g, input, *args)
 
 
-def std_mean(g, input, *args):
+@_onnx_symbolic("aten::std_mean")
+@_beartype.beartype
+def std_mean(g: jit_utils.GraphContext, input, *args):
     var, mean = var_mean(g, input, *args)
     return g.op("Sqrt", var), mean
 
 
+@_onnx_symbolic("aten::logsumexp")
 @symbolic_helper.parse_args("v", "is", "i")
-def logsumexp(g, input, dim, keepdim):
+@_beartype.beartype
+def logsumexp(g: jit_utils.GraphContext, input, dim, keepdim):
     return g.op("ReduceLogSumExp", input, axes_i=dim, keepdims_i=keepdim)
 
 
-def arange(g, *args):
+@_onnx_symbolic("aten::arange")
+@_beartype.beartype
+def arange(g: jit_utils.GraphContext, *args):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("arange", *args)
 
+    @_beartype.beartype
     def _get_arange_dtype(dtype):
         dtype = symbolic_helper._maybe_get_const(dtype, "i")
         return dtype
 
+    @_beartype.beartype
     def _float_step_convert(range_tensor):
         if symbolic_helper._is_fp(range_tensor):
             range_tensor = g.op(
@@ -4496,13 +5363,15 @@ def _float_step_convert(range_tensor):
         return g.op(
             "Cast", arange_tensor, to_i=_type_utils.JitScalarType(dtype).onnx_type()
         )
-    else:
-        raise NotImplementedError(
-            "Unknown aten::arange signature taking " + str(len(args)) + " arguments."
-        )
 
+    return symbolic_helper._unimplemented("aten::arange", f"with {len(args)} arguments")
 
-def linspace(g, start, end, steps, dtype, layout, device, pin_memory):
+
+@_onnx_symbolic("aten::linspace")
+@_beartype.beartype
+def linspace(
+    g: jit_utils.GraphContext, start, end, steps, dtype, layout, device, pin_memory
+):
     range_tensor = symbolic_helper._arange_helper(g, steps, None)
     step = div(
         g,
@@ -4512,18 +5381,24 @@ def linspace(g, start, end, steps, dtype, layout, device, pin_memory):
     return add(g, mul(g, range_tensor, step), start)
 
 
-def lift(g, self):
+@_onnx_symbolic("aten::lift")
+@_beartype.beartype
+def lift(g: jit_utils.GraphContext, self):
     # at::lift() is a no-op from the perspective of tracing for onnx
     return self
 
 
-def masked_fill(g, self, mask, value):
-    mask = _cast_Bool(g, mask, False)  # type: ignore[name-defined]
+@_onnx_symbolic("aten::masked_fill")
+@_beartype.beartype
+def masked_fill(g: jit_utils.GraphContext, self, mask, value):
+    mask = g.op("Cast", mask, to_i=_C_onnx.TensorProtoDataType.BOOL)
     value = symbolic_helper._maybe_get_scalar(value)
-    return g.op("Where", mask, symbolic_helper._if_scalar_type_as(g, value, self), self)
+    return g.op("Where", mask, symbolic_helper._if_scalar_type_as(value, self), self)
 
 
-def index(g, self, index):
+@_onnx_symbolic("aten::index")
+@_beartype.beartype
+def index(g: jit_utils.GraphContext, self, index):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("index", self, index, overload_name="Tensor")
 
@@ -4532,13 +5407,19 @@ def index(g, self, index):
     else:
         indices = [index]
 
+    @_beartype.beartype
     def try_mask_to_index(index):
         if not symbolic_helper._is_none(index) and (
-            index.type().scalarType() == "Byte" or index.type().scalarType() == "Bool"
+            _type_utils.JitScalarType.from_value(
+                index, _type_utils.JitScalarType.UNDEFINED
+            )
+            == _type_utils.JitScalarType.UINT8
+            or symbolic_helper._is_bool(index)
         ):
-            if GLOBALS.export_onnx_opset_version < 9:
-                raise RuntimeError(
-                    "Exporting masked indices are only supported after ONNX opset 9."
+            if g.opset < 9:
+                raise errors.SymbolicValueError(
+                    "Exporting masked indices are only supported after ONNX opset 9.",
+                    self,
                 )
             warnings.warn(
                 "Exporting aten::index operator with indices of type Byte. "
@@ -4585,19 +5466,21 @@ def try_mask_to_index(index):
         else:
             rank = symbolic_helper._get_tensor_rank(self)
             if rank is None:
-                raise NotImplementedError(
-                    "Unsupported aten::index operator of advanced indexing on tensor of unknown rank. "
-                    + "Try turning on shape inference during export: "
-                    + "torch.onnx._export(..., onnx_shape_inference=True)."
+                return symbolic_helper._unimplemented(
+                    "aten::index",
+                    "operator of advanced indexing on tensor of unknown rank. "
+                    "Try turning on shape inference during export: "
+                    "torch.onnx._export(..., onnx_shape_inference=True).",
+                    self,
                 )
             # TODO: If indexing is supported natively in ONNX in future opsets,
             #       update the warning to recommend exporting with higher opset version.
             warnings.warn(
                 "Exporting aten::index operator of advanced indexing in opset "
-                + str(GLOBALS.export_onnx_opset_version)
-                + " is achieved by combination of multiple ONNX operators, "
-                + "including Reshape, Transpose, Concat, and Gather. "
-                + "If indices include negative values, the exported graph will produce incorrect results."
+                f"{GLOBALS.export_onnx_opset_version}"
+                " is achieved by combination of multiple ONNX operators, "
+                "including Reshape, Transpose, Concat, and Gather. "
+                "If indices include negative values, the exported graph will produce incorrect results."
             )
             adv_idx_count = len(adv_idx_indices)
             shape_tensor = _shape_as_tensor(g, self)
@@ -4684,13 +5567,15 @@ def try_mask_to_index(index):
             return symbolic_helper._reshape_helper(g, self, final_shape)
 
 
-@symbolic_helper.parse_args("v", "v", "is", "i", "v")
+@_onnx_symbolic("aten::linalg_norm")
+@symbolic_helper.parse_args("v", "v", "is", "b", "v")
+@_beartype.beartype
 def linalg_norm(
-    g,
+    g: jit_utils.GraphContext,
     self: torch._C.Value,
     ord: torch._C.Value,
-    dim: List[int],
-    keepdim: int,
+    dim: Optional[Sequence[int]],
+    keepdim: bool,
     dtype: torch._C.Value,
 ):
     # Conditions based on https://pytorch.org/docs/stable/generated/torch.linalg.norm.html
@@ -4702,7 +5587,7 @@ def linalg_norm(
         self_dim = symbolic_helper._get_tensor_rank(self)
         if self_dim is None:
             return symbolic_helper._unimplemented(
-                "dim", "Input rank must be known at export time."
+                "dim", "Input rank must be known at export time.", self
             )
         if self_dim == 1:
             ord_value = symbolic_helper._parse_arg(ord, "f")
@@ -4718,19 +5603,21 @@ def linalg_norm(
     return linalg_matrix_norm(g, self, ord, dim, keepdim, dtype)
 
 
-@symbolic_helper.parse_args("v", "f", "is", "i", "v")
+@_onnx_symbolic("aten::linalg_vector_norm")
+@symbolic_helper.parse_args("v", "f", "is", "b", "v")
+@_beartype.beartype
 def linalg_vector_norm(
-    g,
+    g: jit_utils.GraphContext,
     self: torch._C.Value,
     ord: float,
-    dim: List[int],
-    keepdim: int,
+    dim: Optional[Sequence[int]],
+    keepdim: bool,
     dtype: torch._C.Value,
 ):
     # Conditions based on https://pytorch.org/docs/stable/generated/torch.linalg.vector_norm.html
     if dim is None:
         self = symbolic_helper._reshape_helper(g, self, [-1])
-        keepdim = 0
+        keepdim = False
 
     if ord == math.inf:
         result = g.op("ReduceMax", g.op("Abs", self), axes_i=dim, keepdims_i=keepdim)
@@ -4738,7 +5625,7 @@ def linalg_vector_norm(
         result = g.op("ReduceMin", g.op("Abs", self), axes_i=dim, keepdims_i=keepdim)
     elif ord == 0:
         return symbolic_helper._onnx_opset_unsupported_detailed(
-            "linalg_vector_norm", 9, 11, "ord=0 not supported"
+            "linalg_vector_norm", 9, 11, "ord=0 not supported", self
         )
     else:
         ord_op = g.op("Constant", value_t=torch.tensor(ord, dtype=torch.float32))
@@ -4757,13 +5644,15 @@ def linalg_vector_norm(
     return result
 
 
-@symbolic_helper.parse_args("v", "v", "is", "i", "v")
+@_onnx_symbolic("aten::linalg_matrix_norm")
+@symbolic_helper.parse_args("v", "v", "is", "b", "v")
+@_beartype.beartype
 def linalg_matrix_norm(
-    g,
+    g: jit_utils.GraphContext,
     self: torch._C.Value,
     ord: torch._C.Value,
     dim: List[int],
-    keepdim: int,
+    keepdim: bool,
     dtype: torch._C.Value,
 ):
     # Conditions based on https://pytorch.org/docs/stable/generated/torch.linalg.matrix_norm.html
@@ -4771,7 +5660,7 @@ def linalg_matrix_norm(
     if ord_value == "fro":
         return frobenius_norm(g, self, dim, keepdim)
     elif ord_value == "nuc":
-        return symbolic_helper._unimplemented("linalg.matrix_norm", "ord==nuc")
+        return symbolic_helper._unimplemented("linalg.matrix_norm", "ord==nuc", self)
     else:
         ord_value = symbolic_helper._parse_arg(ord, "f")
         if ord_value is None:
@@ -4779,12 +5668,12 @@ def linalg_matrix_norm(
         if ord_value == 2 or ord_value == -2:
             # ord = 2/-2 unimplemented due to lack of operators
             # used to calculate singular values
-            return symbolic_helper._unimplemented("linalg.matrix_norm", "ord==2")
+            return symbolic_helper._unimplemented("linalg.matrix_norm", "ord==2", self)
         # Wrap the dim vector to handle neagtive dim values
         self_dim = symbolic_helper._get_tensor_rank(self)
         if self_dim is None:
             return symbolic_helper._unimplemented(
-                "linalg.matrix_norm", "Input rank must be known at export time."
+                "linalg.matrix_norm", "Input rank must be known at export time.", self
             )
         # Common implementation for cases with
         # ord = 1/-1 and ord = inf/-inf
@@ -4817,28 +5706,37 @@ def linalg_matrix_norm(
         return result
 
 
+@_onnx_symbolic("aten::linalg_cross")
 @symbolic_helper.parse_args("v", "v", "i")
-def linalg_cross(g, input, other, dim=-1):
+@_beartype.beartype
+def linalg_cross(g: jit_utils.GraphContext, input, other, dim=-1):
     return cross(g, input, other, dim)
 
 
-@symbolic_helper.parse_args("v", "is", "i")
-def frobenius_norm(g, self, dim=None, keepdim=False):
+@_onnx_symbolic("aten::frobenius_norm")
+@symbolic_helper.parse_args("v", "is", "b")
+@_beartype.beartype
+def frobenius_norm(g: jit_utils.GraphContext, self, dim=None, keepdim=False):
     sqr = g.op("Mul", self, self)
     sumsqr = symbolic_helper._reducesum_helper(g, sqr, axes_i=dim, keepdims_i=keepdim)
     return g.op("Sqrt", sumsqr)
 
 
+@_onnx_symbolic("aten::multinomial")
 @symbolic_helper.parse_args("v", "i", "b", "v")
-def multinomial(g, input, num_samples, replacement=False, generator=None):
+@_beartype.beartype
+def multinomial(
+    g: jit_utils.GraphContext, input, num_samples, replacement=False, generator=None
+):
     if generator is not None and not symbolic_helper._is_none(generator):
         symbolic_helper._unimplemented(
-            "Multinomial", "generator is not supported for multinomial"
+            "Multinomial", "generator is not supported for multinomial", input
         )
     if not replacement and num_samples > 1:
         symbolic_helper._unimplemented(
             "Multinomial",
             "replacement=False when num_samples > 1 is not supported for multinomial",
+            input,
         )
 
     log_input = log(g, input)
@@ -4850,30 +5748,34 @@ def multinomial(g, input, num_samples, replacement=False, generator=None):
     )
 
 
-def baddbmm(g, self, batch1, batch2, beta, alpha):
-    dtype = self.type().scalarType()
+@_onnx_symbolic("aten::baddbmm")
+@_beartype.beartype
+def baddbmm(g: jit_utils.GraphContext, self, batch1, batch2, beta, alpha):
+    scalar_type = _type_utils.JitScalarType.from_value(self)
     batch_mul = matmul(g, batch1, batch2)
     mul_a = mul(
         g,
         batch_mul,
-        g.op(
-            "Cast", alpha, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()
-        ),
+        g.op("Cast", alpha, to_i=scalar_type.onnx_type()),
     )
     mul_b = mul(
         g,
         self,
-        g.op("Cast", beta, to_i=_type_utils.JitScalarType.from_name(dtype).onnx_type()),
+        g.op("Cast", beta, to_i=scalar_type.onnx_type()),
     )
     return add(g, mul_a, mul_b)
 
 
+@_onnx_symbolic("aten::meshgrid")
 @symbolic_helper.parse_args("v", "s")
-def meshgrid(g, tensor_list, indexing: Optional[str] = None):
+@_beartype.beartype
+def meshgrid(g: jit_utils.GraphContext, tensor_list, indexing: Optional[str] = None):
     if indexing is None:
         indexing = "ij"
     elif indexing not in {"ij", "xy"}:
-        raise ValueError(f"Unsupported indexing: {indexing}")
+        raise errors.SymbolicValueError(
+            f"Unsupported indexing: {indexing}", tensor_list
+        )
     if indexing == "xy":
         tensor_list[0], tensor_list[1] = tensor_list[1], tensor_list[0]
     tensors = [
@@ -4897,14 +5799,18 @@ def meshgrid(g, tensor_list, indexing: Optional[str] = None):
     return g.op("prim::ListConstruct", *out)
 
 
-def remainder(g, input, other):
+@_onnx_symbolic("aten::remainder")
+@_beartype.beartype
+def remainder(g: jit_utils.GraphContext, input, other):
     div = _floor_divide(g, input, other)
     quo = g.op("Mul", div, other)
     return g.op("Sub", input, quo)
 
 
+@_onnx_symbolic("aten::gelu")
 @symbolic_helper.parse_args("v", "s")
-def gelu(g, self: torch._C.Value, approximate: str = "none"):
+@_beartype.beartype
+def gelu(g: jit_utils.GraphContext, self: torch._C.Value, approximate: str = "none"):
     if approximate == "tanh":
         kBeta = math.sqrt(2 / math.pi)
         kKappa = 0.044715
@@ -4930,8 +5836,13 @@ def gelu(g, self: torch._C.Value, approximate: str = "none"):
         )
 
 
+@_onnx_symbolic("aten::group_norm")
+@symbolic_helper.quantized_args(True, False, False, False)
 @symbolic_helper.parse_args("v", "i", "v", "v", "f", "i")
-def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
+@_beartype.beartype
+def group_norm(
+    g: jit_utils.GraphContext, input, num_groups, weight, bias, eps, cudnn_enabled
+):
     if symbolic_helper.is_caffe2_aten_fallback():
         return g.at(
             "group_norm",
@@ -4948,7 +5859,7 @@ def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
         assert channel_size % num_groups == 0
     input_rank = symbolic_helper._get_tensor_rank(input)
     if input_rank is None:
-        return symbolic_helper._unimplemented("group_norm", "unknown input rank")
+        return symbolic_helper._unimplemented("group_norm", "unknown input rank", input)
     # 0 in the shape list keeps dimension value unchanged.
     shape = [0, num_groups, -1]
     input_reshaped = symbolic_helper._reshape_helper(
@@ -4960,14 +5871,16 @@ def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
     # instance norm computation and reshape
     weight_ = g.op(
         "Constant",
-        value_t=torch.tensor([1.0] * num_groups).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        value_t=torch.tensor(
+            [1.0] * num_groups,
+            dtype=_type_utils.JitScalarType.from_value(input).dtype(),
         ),
     )
     bias_ = g.op(
         "Constant",
-        value_t=torch.tensor([0.0] * num_groups).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        value_t=torch.tensor(
+            [0.0] * num_groups,
+            dtype=_type_utils.JitScalarType.from_value(input).dtype(),
         ),
     )
 
@@ -4977,13 +5890,13 @@ def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
     norm = symbolic_helper._reshape_helper(g, norm_reshaped, g.op("Shape", input))
 
     if weight is None or weight.node().mustBeNone():
-        weight_value = torch.tensor([1.0]).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        weight_value = torch.tensor(
+            [1.0], dtype=_type_utils.JitScalarType.from_value(input).dtype()
         )
         weight = g.op("Constant", value_t=weight_value)
     if bias is None or bias.node().mustBeNone():
-        bias_value = torch.tensor([0.0]).type(
-            "torch." + input.type().scalarType() + "Tensor"
+        bias_value = torch.tensor(
+            [0.0], dtype=_type_utils.JitScalarType.from_value(input).dtype()
         )
         bias = g.op("Constant", value_t=bias_value)
 
@@ -4996,8 +5909,10 @@ def group_norm(g, input, num_groups, weight, bias, eps, cudnn_enabled):
     )
 
 
+@_onnx_symbolic("aten::_weight_norm")
 @symbolic_helper.parse_args("v", "v", "i")
-def _weight_norm(g, weight_v, weight_g, dim):
+@_beartype.beartype
+def _weight_norm(g: jit_utils.GraphContext, weight_v, weight_g, dim):
     rank = symbolic_helper._get_tensor_rank(weight_v)
     if rank is not None:
         # W = g * ((v) / ||v||)
@@ -5014,30 +5929,39 @@ def _weight_norm(g, weight_v, weight_g, dim):
         norm_v = norm(g, weight_v, 2, axes, 1)
         div = g.op("Div", weight_v, norm_v)
         return g.op("Mul", div, weight_g)
-    elif symbolic_helper.is_caffe2_aten_fallback():
+    if symbolic_helper.is_caffe2_aten_fallback():
         return g.at("_weight_norm", weight_v, weight_g, dim_i=dim)
-    else:
-        raise RuntimeError(
-            "Unsupported: ONNX export of _weight_norm for tensor " "of unknown rank."
-        )
+
+    raise errors.SymbolicValueError(
+        "Unsupported: ONNX export of _weight_norm for tensor of unknown rank.",
+        weight_v,
+    )
 
 
-def dim(g, self):
+@_onnx_symbolic("aten::dim")
+@_beartype.beartype
+def dim(g: jit_utils.GraphContext, self):
     """Implement the dim functionality available for a pytorch tensor in ONNX"""
     # ONNX does not support dim directly in this opset so we can use 2 ops to get the info
     shape = g.op("Shape", self)
     return g.op("Size", shape)
 
 
-def __getitem_(g, self, i):
+@_onnx_symbolic("aten::__getitem_")
+@_beartype.beartype
+def __getitem_(g: jit_utils.GraphContext, self, i):
     return select(g, self, g.op("Constant", value_t=torch.tensor([0])), i)
 
 
-def item(g, self):
+@_onnx_symbolic("aten::item")
+@_beartype.beartype
+def item(g: jit_utils.GraphContext, self):
     return self
 
 
-def take(g, self, index):
+@_onnx_symbolic("aten::take")
+@_beartype.beartype
+def take(g: jit_utils.GraphContext, self, index):
     self_flattened = symbolic_helper._reshape_helper(
         g, self, g.op("Constant", value_t=torch.tensor([-1], dtype=torch.int64))
     )
@@ -5046,14 +5970,16 @@ def take(g, self, index):
     return out
 
 
-def _kl_div_log_target_impl(g, input, target):
+@_beartype.beartype
+def _kl_div_log_target_impl(g: jit_utils.GraphContext, input, target):
     diff_ = sub(g, target, input)
     exp_ = exp(g, target)
     output = mul(g, exp_, diff_)
     return output
 
 
-def _kl_div_non_log_target_impl(g, input, target):
+@_beartype.beartype
+def _kl_div_non_log_target_impl(g: jit_utils.GraphContext, input, target):
     log_ = log(g, target)
     diff_ = sub(g, log_, input)
     output_pos = mul(g, target, diff_)
@@ -5063,8 +5989,10 @@ def _kl_div_non_log_target_impl(g, input, target):
     return output
 
 
+@_onnx_symbolic("aten::kl_div")
 @symbolic_helper.parse_args("v", "v", "i", "b")
-def kl_div(g, input, target, reduction, log_target):
+@_beartype.beartype
+def kl_div(g: jit_utils.GraphContext, input, target, reduction, log_target):
     if log_target:
         output = _kl_div_log_target_impl(g, input, target)
     else:
@@ -5078,12 +6006,15 @@ def kl_div(g, input, target, reduction, log_target):
         return symbolic_helper._reducesum_helper(g, output, keepdims_i=0)
     else:
         return symbolic_helper._onnx_unsupported(
-            "kl_div with reduction other than none, mean, or sum."
+            "kl_div with reduction other than none, mean, or sum.", input
         )
 
 
+@_onnx_symbolic("aten::as_strided")
+@symbolic_helper.quantized_args(True)
 @symbolic_helper.parse_args("v", "v", "is", "i")
-def as_strided(g, self, sizes, strides, offset=None):
+@_beartype.beartype
+def as_strided(g: jit_utils.GraphContext, self, sizes, strides, offset=None):
     sizes = symbolic_helper._maybe_get_const(sizes, "is")
     rank = len(strides)
     self_1d = symbolic_helper._reshape_helper(
@@ -5127,10 +6058,13 @@ def as_strided(g, self, sizes, strides, offset=None):
         return g.op("Gather", self_1d, ind)
 
 
-def __derive_index(g, index, start, step):
+@_onnx_symbolic("aten::__derive_index")
+@_beartype.beartype
+def __derive_index(g: jit_utils.GraphContext, index, start, step):
     return g.op("Add", start, g.op("Mul", index, step))
 
 
+@_onnx_symbolic("aten::__range_length")
 # Source code for aten op can be found here: pytorch/torch/csrc/jit/runtime/register_prim_ops.cpp
 # if (step > 0 && lo < hi) {
 #   push(stack, 1 + (hi - 1 - lo) / step);
@@ -5139,13 +6073,16 @@ def __derive_index(g, index, start, step):
 # } else {
 #  push(stack, 0);
 # }
-def __range_length(g, lo, hi, step):
+@_beartype.beartype
+def __range_length(g: jit_utils.GraphContext, lo, hi, step):
     sub = g.op("Sub", hi, lo)
     div = g.op("Ceil", true_divide(g, sub, step))
     return g.op("Cast", div, to_i=_C_onnx.TensorProtoDataType.INT64)
 
 
-def linear(g, input, weight, bias):
+@_onnx_symbolic("aten::linear")
+@_beartype.beartype
+def linear(g: jit_utils.GraphContext, input, weight, bias):
     rank = symbolic_helper._get_tensor_rank(input)
     weight = t(g, weight)
     if rank == 2 and not bias.node().mustBeNone():
@@ -5160,9 +6097,11 @@ def linear(g, input, weight, bias):
     return output
 
 
+@_onnx_symbolic("aten::hann_window")
 @symbolic_helper.parse_args("v", "b", "i", "v", "v", "v", "v")
+@_beartype.beartype
 def hann_window(
-    g,
+    g: jit_utils.GraphContext,
     window_length,
     periodic=True,
     dtype: Optional[int] = None,
@@ -5199,16 +6138,22 @@ def hann_window(
     return output
 
 
-def mv(g, self, vec):
+@_onnx_symbolic("aten::mv")
+@_beartype.beartype
+def mv(g: jit_utils.GraphContext, self, vec):
     return matmul(g, self, vec)
 
 
-def dot(g, self, other):
+@_onnx_symbolic("aten::dot")
+@_beartype.beartype
+def dot(g: jit_utils.GraphContext, self, other):
     return matmul(g, self, other)
 
 
+@_onnx_symbolic("aten::movedim")
 @symbolic_helper.parse_args("v", "t", "t")
-def movedim(g, self, source, destination):
+@_beartype.beartype
+def movedim(g: jit_utils.GraphContext, self, source, destination):
     # This is a pythonic implementation mostly taken from aten/src/ATen/native/TensorShape.cpp::movedim
     source = source.view(-1)
     destination = destination.view(-1)
@@ -5240,18 +6185,19 @@ def movedim(g, self, source, destination):
     return g.op("Transpose", self, perm_i=perm)
 
 
+@_onnx_symbolic("aten::fill")
 @symbolic_helper.parse_args("v", "v")
-def fill(g, self, value):
-    dtype = self.type().scalarType()
-    if dtype is None:
-        dtype = _type_utils.JitScalarType.FLOAT
-    else:
-        dtype = _type_utils.JitScalarType.from_name(dtype)
-
-    return full_like(g, self, value, dtype)
+@_beartype.beartype
+def fill(g: jit_utils.GraphContext, self, value):
+    scalar_type = _type_utils.JitScalarType.from_value(
+        self, _type_utils.JitScalarType.FLOAT
+    )
+    return full_like(g, self, value, scalar_type)
 
 
-def index_add(g, self, dim, index, other, alpha=None):
+@_onnx_symbolic("aten::index_add")
+@_beartype.beartype
+def index_add(g: jit_utils.GraphContext, self, dim, index, other, alpha=None):
     warnings.warn(
         "Warning: ONNX export does not support duplicated values in 'index' field, "
         + "this will cause the ONNX model to be incorrect."
@@ -5260,22 +6206,24 @@ def index_add(g, self, dim, index, other, alpha=None):
     # ONNX does not support "alpha" argument, unlike aten index_add
     # See: https://github.com/pytorch/pytorch/pull/65993#issuecomment-953151102 for more context
     if alpha and symbolic_helper._scalar(symbolic_helper._maybe_get_scalar(alpha)) != 1:
-        return symbolic_helper._unimplemented("index_add", "alpha != 1")
+        return symbolic_helper._unimplemented("index_add", "alpha != 1", self)
 
     dim = symbolic_helper._maybe_get_const(dim, "i")
     if dim is None:
-        raise NotImplementedError(
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting 'index_add_()' function with "
-            + "unknown 'dim' value."
+            "unknown 'dim' value.",
+            self,
         )
 
     self_dim_rank = symbolic_helper._get_tensor_rank(self)
     other_dim_rank = symbolic_helper._get_tensor_rank(other)
 
     if self_dim_rank is None or other_dim_rank is None:
-        raise NotImplementedError(
+        raise errors.SymbolicValueError(
             "ONNX export does NOT support exporting 'index_add_()' function while "
-            + "the rank of self tensor or tensor to be added is unknown."
+            "the rank of self tensor or tensor to be added is unknown.",
+            self,
         )
 
     if other_dim_rank != self_dim_rank:
@@ -5290,9 +6238,10 @@ def index_add(g, self, dim, index, other, alpha=None):
 
     if (other_dim_size is not None) and (self_dim_size is not None):
         if other_dim_size > self_dim_size:
-            raise NotImplementedError(
-                "ONNX export does NOT support exporting 'index_add_()' function with "
-                + "duplicated values in 'index' parameter yet."
+            raise errors.SymbolicValueError(
+                "ONNX export does not support exporting 'index_add_()' function with "
+                "duplicated values in 'index' parameter yet.",
+                self,
             )
 
     # Construct a new shape. It's almost as same as self except the size of the 'dim'
@@ -5317,8 +6266,10 @@ def index_add(g, self, dim, index, other, alpha=None):
     return scatter_add(g, self, dim, expand_as(g, index, other), other)
 
 
+@_onnx_symbolic("aten::roll")
 @symbolic_helper.parse_args("v", "is", "is")
-def roll(g, self, shifts, dims):
+@_beartype.beartype
+def roll(g: jit_utils.GraphContext, self, shifts, dims):
     assert len(shifts) == len(dims)
 
     result = self
@@ -5337,8 +6288,10 @@ def roll(g, self, shifts, dims):
     return result
 
 
+@_onnx_symbolic("aten::cross")
 @symbolic_helper.parse_args("v", "v", "i")
-def cross(g, input, other, dim=None):
+@_beartype.beartype
+def cross(g: jit_utils.GraphContext, input, other, dim=None):
     dim = symbolic_helper._get_dim_for_cross(input, dim)
     # If we have two tensors such that
     # A = [a, b, c], B = [d, e, f], we permute the tensor such that we have
@@ -5355,7 +6308,15 @@ def cross(g, input, other, dim=None):
     return sub(g, mul(g, roll_x_1, roll_y_1), mul(g, roll_x_2, roll_y_2))
 
 
-def cdist(g, x1, x2, p=2.0, compute_mode="use_mm_for_euclid_dist_if_necessary"):
+@_onnx_symbolic("aten::cdist")
+@_beartype.beartype
+def cdist(
+    g: jit_utils.GraphContext,
+    x1,
+    x2,
+    p=2.0,
+    compute_mode="use_mm_for_euclid_dist_if_necessary",
+):
     # X1.shape = (B * P * D), X2.shape = (B * R * D)
     # In order to respect numpy style broadcasting as demonstrated in
     # https://github.com/onnx/onnx/blob/main/docs/Broadcasting.md
@@ -5371,7 +6332,9 @@ def cdist(g, x1, x2, p=2.0, compute_mode="use_mm_for_euclid_dist_if_necessary"):
     )
 
 
-def lerp(g, self, end, weight):
+@_onnx_symbolic("aten::lerp")
+@_beartype.beartype
+def lerp(g: jit_utils.GraphContext, self, end, weight):
     # Conditional for better numeric. This has been discussed in
     # https://github.com/pytorch/pytorch/pull/18871
     diff = g.op("Sub", end, self)
@@ -5391,7 +6354,9 @@ def lerp(g, self, end, weight):
     )
 
 
-def broadcast_tensors(g, self):
+@_onnx_symbolic("aten::broadcast_tensors")
+@_beartype.beartype
+def broadcast_tensors(g: jit_utils.GraphContext, self):
     all_tensors = symbolic_helper._unpack_list(self)
     t_with_final_shape = zeros_like(g, all_tensors[0])
 
@@ -5404,291 +6369,382 @@ def broadcast_tensors(g, self):
     return g.op("prim::ListConstruct", *t_list)
 
 
-class Prim:
-    domain = "prim"
+@_onnx_symbolic("aten::is_pinned")
+def is_pinned(g: jit_utils.GraphContext, self, device=None):
+    # Unused by ONNX.
+    return None
 
-    @staticmethod
-    def ConstantSplit(g, self, split_size, dim):
-        size = symbolic_helper._get_tensor_dim_size(self, dim)
-        if size is None:
-            return symbolic_helper._unimplemented(
-                "prim::ConstantSplit", "unknown dimension size"
-            )
-        splits = [split_size] * (size // split_size)
-        leftover = size % split_size
-        if leftover:
-            splits.append(leftover)
-        return g.op("Split", self, split_i=splits, axis_i=dim, outputs=len(splits))
-
-    # TODO: It would be better to export this as a chunk directly, as this is
-    # less sensitive to changes in input size.
-    # TODO: Once we have proper scoping, stop reimplementing chunk, delete this
-    # method, and use the desugared version
-    @staticmethod
-    def ConstantChunk(g, self, chunks, dim):
-        dim_size = symbolic_helper._get_tensor_dim_size(self, dim)
-        if dim_size is None:
-            return symbolic_helper._unimplemented(
-                "prim::ConstantChunk", "unknown dimension size"
-            )
-        split_size = (dim_size + chunks - 1) // chunks
-        return Prim.ConstantSplit(g, self, split_size, dim)
 
-    @staticmethod
-    def shape(g, self):
-        return g.op("Shape", self)
+@_onnx_symbolic("prim::ConstantSplit")
+@_beartype.beartype
+def prim_constant_split(g: jit_utils.GraphContext, self, split_size, dim):
+    size = symbolic_helper._get_tensor_dim_size(self, dim)
+    if size is None:
+        return symbolic_helper._unimplemented(
+            "prim::ConstantSplit", "unknown dimension size", self
+        )
+    splits = [split_size] * (size // split_size)
+    leftover = size % split_size
+    if leftover:
+        splits.append(leftover)
+    return g.op("Split", self, split_i=splits, axis_i=dim, outputs=len(splits))
 
-    @staticmethod
-    def max(g, self, other):
-        return op_with_optional_float_cast(g, "Max", self, other, opset_before=12)
 
-    @staticmethod
-    def min(g, self, other=None):
-        if not other:
-            if symbolic_helper._is_packed_list(self):
-                self = stack(g, self, g.op("Constant", value_t=torch.tensor([0])))
-            return min(g, self)
-        return min(g, self, other)
+# TODO: It would be better to export this as a chunk directly, as this is
+# less sensitive to changes in input size.
+# TODO: Once we have proper scoping, stop reimplementing chunk, delete this
+# method, and use the desugared version
+@_onnx_symbolic("prim::ConstantChunk")
+@_beartype.beartype
+def prim_constant_chunk(g: jit_utils.GraphContext, self, chunks, dim):
+    dim_size = symbolic_helper._get_tensor_dim_size(self, dim)
+    if dim_size is None:
+        return symbolic_helper._unimplemented(
+            "prim::ConstantChunk", "unknown dimension size", self
+        )
+    split_size = (dim_size + chunks - 1) // chunks
+    return prim_constant_split(g, self, split_size, dim)
 
-    @staticmethod
-    def data(g, self):
-        return self
 
-    @staticmethod
-    def ListConstruct(g, *inputs, **kwargs):
-        return None
+@_onnx_symbolic("prim::shape")
+@_beartype.beartype
+def prim_shape(g: jit_utils.GraphContext, self):
+    return g.op("Shape", self)
 
-    @staticmethod
-    def ListUnpack(g, *inputs, **kwargs) -> Optional[List[_C.Value]]:
-        if len(inputs) == 1 and inputs[0].node().kind() == "prim::ListConstruct":
-            # Cancel the previous node if it is ListConstruct by returning its inputs
-            # TODO(justinchuby): Use a public method in the helper module
-            return symbolic_helper._unpack_list(inputs[0])
 
-        return None
+@_onnx_symbolic("prim::max")
+@_beartype.beartype
+def prim_max(g: jit_utils.GraphContext, self, other):
+    return _op_with_optional_float_cast(g, "Max", self, other, opset_before=12)
+
+
+@_onnx_symbolic("prim::min")
+@_beartype.beartype
+def prim_min(g: jit_utils.GraphContext, self, other=None):
+    if not other:
+        if symbolic_helper._is_packed_list(self):
+            self = stack(g, self, g.op("Constant", value_t=torch.tensor([0])))
+        return min(g, self)
+    return min(g, self, other)
+
+
+@_onnx_symbolic("prim::data")
+@_beartype.beartype
+def prim_data(g: jit_utils.GraphContext, self):
+    return self
+
+
+@_onnx_symbolic("prim::layout")
+def prim_layout(g: jit_utils.GraphContext, self):
+    # Unused by ONNX.
+    return None
+
+
+@_onnx_symbolic("prim::ListConstruct")
+@_beartype.beartype
+def prim_list_construct(g: jit_utils.GraphContext, *inputs, **kwargs):
+    return None
 
-    @staticmethod
-    def TupleConstruct(g, *inputs, **kwargs):
-        return None
 
-    @staticmethod
-    def Uninitialized(g, *inputs, **kwargs):
+@_onnx_symbolic("prim::ListUnpack")
+@_beartype.beartype
+def prim_list_unpack(
+    g: jit_utils.GraphContext, *inputs, **kwargs
+) -> Optional[List[_C.Value]]:
+    if len(inputs) == 1 and inputs[0].node().kind() == "prim::ListConstruct":
+        # Cancel the previous node if it is ListConstruct by returning its inputs
+        # TODO(justinchuby): Use a public method in the helper module
+        return symbolic_helper._unpack_list(inputs[0])
+
+    return None
+
+
+@_onnx_symbolic("prim::TupleConstruct")
+@_beartype.beartype
+def prim_tuple_construct(g: jit_utils.GraphContext, *inputs, **kwargs):
+    return None
+
+
+@_onnx_symbolic("prim::Uninitialized")
+@_beartype.beartype
+def prim_uninitialized(g: jit_utils.GraphContext, *inputs, **kwargs):
+    return None
+
+
+# exists to refine the type of the Value
+# if x is an optional Tensor, unchecked_cast will cast
+# x to Tensor, so the rest of the graph knows that x is a Tensor
+# this doesn't do anything in runtime and is a noop in ONNX
+@_onnx_symbolic("prim::unchecked_cast")
+@_beartype.beartype
+def prim_unchecked_cast(g: jit_utils.GraphContext, self):
+    return self
+
+
+@_onnx_symbolic("prim::dtype")
+@_beartype.beartype
+def prim_dtype(g: jit_utils.GraphContext, self):
+    scalar_type = symbolic_helper._try_get_scalar_type(self)
+    if scalar_type is None:
+        scalar_type = _type_utils.JitScalarType.FLOAT
+    # This node records a torch dtype as int
+    return g.op("Constant", value_t=torch.tensor(scalar_type))
+
+
+@_onnx_symbolic("prim::tolist")
+@_beartype.beartype
+def prim_tolist(g: jit_utils.GraphContext, input, dim_val, elem_ty_val):
+    """tolist is currently supported only for 1D input tensors.
+
+    dim_val and elem_ty_val represent dimension and type annotations
+    that need to match dimension and type of the input tensor.
+    """
+    dim = symbolic_helper._maybe_get_const(dim_val, "i")
+    if dim > 1:
+        return symbolic_helper._unimplemented("prim::tolist", "dim_val > 1", input)
+    return input
+
+
+# -----------------------------------------------------------------------------
+# Symbolic functions that need extra context
+# -----------------------------------------------------------------------------
+@_onnx_symbolic("prim::device")
+@_beartype.beartype
+def prim_device(g: jit_utils.GraphContext, *inputs, **kwargs) -> None:
+    output_type = g.original_node.output().type()
+    if isinstance(output_type, _C.DeviceObjType):
         return None
 
-    # exists to refine the type of the Value
-    # if x is an optional Tensor, unchecked_cast will cast
-    # x to Tensor, so the rest of the graph knows that x is a Tensor
-    # this doesn't do anything in runtime and is a noop in ONNX
-    @staticmethod
-    def unchecked_cast(g, self):
-        return self
+    return symbolic_helper._unimplemented(
+        "prim::device",
+        f"output type should be 'DeviceObjType', not '{output_type.kind()}'",
+        g.original_node.output(),
+    )
 
-    @staticmethod
-    def dtype(g, self):
-        scalar_name = symbolic_helper._try_get_scalar_type(self)
-        if scalar_name is None:
-            scalar_name = "Float"
-        scalar_type = _type_utils.JitScalarType.from_name(scalar_name)
-        # This node records a torch dtype as int
-        return g.op("Constant", value_t=torch.tensor(scalar_type))
-
-    @staticmethod
-    def tolist(g, input, dim_val, elem_ty_val):
-        """tolist is currently supported only for 1D input tensors.
-
-        dim_val and elem_ty_val represent dimension and type annotations
-        that need to match dimension and type of the input tensor.
-        """
-        dim = symbolic_helper._maybe_get_const(dim_val, "i")
-        if dim > 1:
-            return symbolic_helper._unimplemented("prim::tolist", "dim_val > 1")
-        return input
 
-    # -----------------------------------------------------------------------------
-    # Symbolic functions that need extra context
-    # -----------------------------------------------------------------------------
-    @staticmethod
-    def device(ctx: SymbolicContext, g: _C.Graph, *inputs, **kwargs) -> None:
-        output_type = ctx.cur_node.output().type()
-        if isinstance(output_type, _C.DeviceObjType):
-            return None
+@_onnx_symbolic("prim::Loop")
+@_beartype.beartype
+def prim_loop(g: jit_utils.GraphContext, *inputs, **attrs) -> List[_C.Value]:
+    node = g.original_node
+    env = g.env
+    params_dict = g.params_dict
 
-        return symbolic_helper._unimplemented(
-            "prim::device",
-            f"output type should be 'DeviceObjType', not '{output_type.kind()}'",
-        )
-
-    @staticmethod
-    def Loop(ctx: SymbolicContext, g, *inputs, **attrs):
-        n = ctx.cur_node
-        env = ctx.env
-        params_dict = ctx.params_dict
-
-        operator_export_type = GLOBALS.operator_export_type
-        opset_version = GLOBALS.export_onnx_opset_version
-
-        new_op_outputs = g.op("Loop", *inputs, outputs=n.outputsSize())
-        new_node = (
-            new_op_outputs[0].node() if n.outputsSize() > 1 else new_op_outputs.node()
-        )
-        for b in n.blocks():
-            new_block = new_node.addBlock()
-            # Copy input metadata to subblock
-            #
-            #   prim::Loop(iter, cond, input_1, ..., input_n)
-            #     block0(iter, input_1, ..., input_n)
-            #
-            # For `Loop` node, copy metadata for `iter`, `input_1`, ..., `input_n`.
-            for i, b_in in enumerate(b.inputs()):
-                if i == 0 and i < len(inputs):
-                    b_in.setType(inputs[i].type())
-                # For optional block inputs, they may switch between None not-None inside
-                # the loop body, so if the loop input is not optional, the block input may
-                # still need to be optional.
-                if (
-                    i > 0
-                    and (i + 1) < len(inputs)
-                    and not isinstance(b_in.type(), _C.OptionalType)
-                ):
-                    b_in.setType(inputs[i + 1].type())
+    operator_export_type = GLOBALS.operator_export_type
+    opset_version = GLOBALS.export_onnx_opset_version
+
+    old_blocks = tuple(node.blocks())
+    new_op_outputs, new_block_contexts, new_node = jit_utils.add_op_with_blocks(
+        g, "Loop", *inputs, outputs=node.outputsSize(), n_blocks=len(old_blocks)
+    )
+
+    for old_block, new_block_context in zip(old_blocks, new_block_contexts):
+        # Copy input metadata to subblock
+        #
+        #   prim::Loop(iter, cond, input_1, ..., input_n)
+        #     block0(iter, input_1, ..., input_n)
+        #
+        # For `Loop` node, copy metadata for `iter`, `input_1`, ..., `input_n`.
+        for i, b_in in enumerate(old_block.inputs()):
+            if i == 0 and i < len(inputs):
+                b_in.setType(inputs[i].type())
+            # For optional block inputs, they may switch between None not-None inside
+            # the loop body, so if the loop input is not optional, the block input may
+            # still need to be optional.
+            if (
+                i > 0
+                and (i + 1) < len(inputs)
+                and not isinstance(b_in.type(), _C.OptionalType)
+            ):
+                b_in.setType(inputs[i + 1].type())
+        torch._C._jit_pass_onnx_block(
+            old_block,
+            new_block_context.block,
+            operator_export_type,
+            env,
+            False,
+        )
+    fixed_outputs = torch._C._jit_pass_fixup_onnx_controlflow_node(
+        new_node, opset_version
+    )
+    # Run shape type inference for Loop after subblock is converted.
+    if GLOBALS.onnx_shape_inference:
+        torch._C._jit_pass_onnx_node_shape_type_inference(
+            new_node, params_dict, opset_version
+        )
+    return fixed_outputs
+
+
+@_onnx_symbolic("prim::If")
+@_beartype.beartype
+def prim_if(g: jit_utils.GraphContext, *inputs, **attrs) -> List[_C.Value]:
+    n = g.original_node
+    block = g.block
+    env = g.env
+    params_dict = g.params_dict
+
+    operator_export_type = GLOBALS.operator_export_type
+    opset_version = GLOBALS.export_onnx_opset_version
+
+    static_if = inputs[0].node().kind() == "onnx::Constant"
+    if static_if:
+        # Fold static if
+        #
+        # The torch IR
+        # graph(%embedding_matrix.1 : Float(10, 15, strides=[15, 1], requires_grad=0, device=cpu),
+        #    %input.1 : Long(6, strides=[1], requires_grad=0, device=cpu), ...
+        # %65 : Bool(requires_grad=0, device=cpu) = prim::Constant[value={0}]()
+        # %21 : Long(device=cpu) = aten::eq(%20, %64)
+        # %22 : Long(device=cpu) = prim::If(%21)
+        #     block0():
+        #     %23 : Long(device=cpu) = aten::is_floating_point(%input.1)
+        #     -> (%23)
+        #     block1():
+        #     -> (%65)
+        # %input.53 : Tensor, %weight : Tensor = prim::If(%22)
+        #     block0():
+        #     -> (%embedding_matrix.1, %input.1)
+        #     block1():
+        #     -> (%input.1, %embedding_matrix.1)
+        # %26 : int[] = aten::size(%input.53)
+        #
+        # The converted ONNX graph
+        # %10 : Bool(device=cpu) = onnx::Constant[value={0}]()
+        # %14 : Bool(device=cpu) = onnx::Equal(%13, %8)
+        # %15 : Bool(requires_grad=0, device=cpu) = onnx::Constant[value={0}]()
+        # %16 : Long(1, strides=[1], device=cpu) = onnx::Shape(%input.1)
+        input_flag = symbolic_helper._node_get(inputs[0].node(), "value").tolist()
+        const_value = (
+            all(input_flag) if isinstance(input_flag, list) else bool(input_flag)
+        )
+        block_idx = 0 if const_value else 1
+        current_b = list(n.blocks())[block_idx]
+        env = torch._C._jit_pass_onnx_block(
+            current_b,
+            block,
+            operator_export_type,
+            env,
+            True,
+        )
+        if_output_list = list(n.outputs())
+        current_b_list = list(current_b.outputs())
+
+        final_b_list = []
+        for idx in range(len(if_output_list)):
+            if current_b_list[idx] not in env:
+                raise errors.SymbolicValueError(
+                    f"The sub block ATen output {current_b_list[idx]} is not in env.",
+                    current_b_list[idx],
+                )  # type:ignore[operator]
+            onnx_b = env[current_b_list[idx]]
+            final_b_list.append(onnx_b)
+        return final_b_list
+    else:
+        old_blocks = tuple(n.blocks())
+        new_op_outputs, new_block_contexts, new_node = jit_utils.add_op_with_blocks(
+            g, "If", *inputs, outputs=n.outputsSize(), n_blocks=len(old_blocks)
+        )
+
+        for old_block, new_block_context in zip(old_blocks, new_block_contexts):
             torch._C._jit_pass_onnx_block(
-                b, new_block, operator_export_type, env, False  # type:ignore[arg-type]
+                old_block,
+                new_block_context.block,
+                operator_export_type,
+                env,
+                False,
             )
-        new_op_outputs = torch._C._jit_pass_fixup_onnx_controlflow_node(
+        fixed_outputs = torch._C._jit_pass_fixup_onnx_controlflow_node(
             new_node, opset_version
         )
-        # Run shape type inference for Loop after subblock is converted.
+        # Run shape type inference for If after subblock is converted.
         if GLOBALS.onnx_shape_inference:
             torch._C._jit_pass_onnx_node_shape_type_inference(
                 new_node, params_dict, opset_version
             )
-        return new_op_outputs
-
-    @staticmethod
-    def If(ctx: SymbolicContext, g, *inputs, **attrs):
-        n = ctx.cur_node
-        block = ctx.onnx_block
-        env = ctx.env
-        params_dict = ctx.params_dict
-
-        operator_export_type = GLOBALS.operator_export_type
-        opset_version = GLOBALS.export_onnx_opset_version
-
-        static_if = inputs[0].node().kind() == "onnx::Constant"
-        if static_if:
-            # Fold static if
-            #
-            # The torch IR
-            # graph(%embedding_matrix.1 : Float(10, 15, strides=[15, 1], requires_grad=0, device=cpu),
-            #    %input.1 : Long(6, strides=[1], requires_grad=0, device=cpu), ...
-            # %65 : Bool(requires_grad=0, device=cpu) = prim::Constant[value={0}]()
-            # %21 : Long(device=cpu) = aten::eq(%20, %64)
-            # %22 : Long(device=cpu) = prim::If(%21)
-            #     block0():
-            #     %23 : Long(device=cpu) = aten::is_floating_point(%input.1)
-            #     -> (%23)
-            #     block1():
-            #     -> (%65)
-            # %input.53 : Tensor, %weight : Tensor = prim::If(%22)
-            #     block0():
-            #     -> (%embedding_matrix.1, %input.1)
-            #     block1():
-            #     -> (%input.1, %embedding_matrix.1)
-            # %26 : int[] = aten::size(%input.53)
-            #
-            # The converted ONNX graph
-            # %10 : Bool(device=cpu) = onnx::Constant[value={0}]()
-            # %14 : Bool(device=cpu) = onnx::Equal(%13, %8)
-            # %15 : Bool(requires_grad=0, device=cpu) = onnx::Constant[value={0}]()
-            # %16 : Long(1, strides=[1], device=cpu) = onnx::Shape(%input.1)
-            input_flag = symbolic_helper._node_get(inputs[0].node(), "value").tolist()
-            const_value = (
-                all(input_flag) if isinstance(input_flag, list) else bool(input_flag)
-            )
-            block_idx = 0 if const_value else 1
-            current_b = list(n.blocks())[block_idx]
-            env = torch._C._jit_pass_onnx_block(
-                current_b,
-                block,
-                operator_export_type,  # type:ignore[arg-type]
-                env,  # type:ignore[arg-type]
-                True,
-            )
-            if_output_list = list(n.outputs())
-            current_b_list = list(current_b.outputs())
-
-            final_b_list = []
-            for idx in range(len(if_output_list)):
-                if current_b_list[idx] not in env:
-                    raise RuntimeError(
-                        f"The sub block ATen output {current_b_list[idx]} is not in env."
-                    )  # type:ignore[operator]
-                onnx_b = env[current_b_list[idx]]
-                final_b_list.append(onnx_b)
-            return final_b_list
-        else:
-            new_op_outputs = g.op("If", *inputs, outputs=n.outputsSize())
-            new_node = (
-                new_op_outputs[0].node()
-                if n.outputsSize() > 1
-                else new_op_outputs.node()
-            )
-            for b in n.blocks():
-                new_block = new_node.addBlock()
-                torch._C._jit_pass_onnx_block(
-                    b,
-                    new_block,
-                    operator_export_type,  # type:ignore[arg-type]
-                    env,
-                    False,
-                )
-            new_op_outputs = torch._C._jit_pass_fixup_onnx_controlflow_node(
-                new_node, opset_version
-            )
-            # Run shape type inference for If after subblock is converted.
-            if GLOBALS.onnx_shape_inference:
-                torch._C._jit_pass_onnx_node_shape_type_inference(
-                    new_node, params_dict, opset_version
-                )
-            return new_op_outputs
-
-    @staticmethod
-    def Constant(ctx: SymbolicContext, g, *inputs, **attrs):
-        n = ctx.cur_node
-
-        if n.mustBeNone():
-            return None
-        # This must go before checking for string values, because some device constants
-        # have string values, but we want to keep them as unconverted Device types so
-        # that eq() can work on them.
-        if isinstance(n.output().type(), _C.DeviceObjType):
-            return None
-        if n.kindOf("value") == "t":
-            return g.op("Constant", value_t=symbolic_helper._node_get(n, "value"))
-        if n.kindOf("value") == "s":
-            return g.op("Constant", value_s=symbolic_helper._node_get(n, "value"))
-        elif n.output().type().isSubtypeOf(
-            _C.ListType.ofInts()
-        ) or n.output().type().isSubtypeOf(_C.ListType.ofFloats()):
-            return g.op(
-                "Constant", value_t=torch.tensor(symbolic_helper._node_get(n, "value"))
-            )
-        else:
-            raise RuntimeError(
-                f"Unsupported prim::Constant kind: `{n.kindOf('value')}`. Send a bug report."
-            )
+        return fixed_outputs
+
+
+@_onnx_symbolic("prim::Constant")
+@_beartype.beartype
+def prim_constant(g: jit_utils.GraphContext, *inputs, **attrs):
+    node = g.original_node
+
+    if node.mustBeNone():
+        return None
+    # This must go before checking for string values, because some device constants
+    # have string values, but we want to keep them as unconverted Device types so
+    # that eq() can work on them.
+    if isinstance(node.output().type(), _C.DeviceObjType):
+        return None
+    if node.kindOf("value") == "t":
+        return g.op("Constant", value_t=symbolic_helper._node_get(node, "value"))
+    if node.kindOf("value") == "s":
+        return g.op("Constant", value_s=symbolic_helper._node_get(node, "value"))
+    if node.output().type().isSubtypeOf(
+        _C.ListType.ofInts()
+    ) or node.output().type().isSubtypeOf(_C.ListType.ofFloats()):
+        return g.op(
+            "Constant", value_t=torch.tensor(symbolic_helper._node_get(node, "value"))
+        )
 
+    raise errors.SymbolicValueError(
+        f"Unsupported prim::Constant kind: '{node.kindOf('value')}'. "
+        f"Please send a bug report at {_constants.PYTORCH_GITHUB_ISSUES_URL}.",
+        node.output(),
+    )
+
+
+@_onnx_symbolic("prim::type")
+@_beartype.beartype
+def prim_type(g: jit_utils.GraphContext, device_value: _C.Value, *args, **kwargs):
+    if device_value.node().kind() == "prim::device":
+        device = jit_utils.get_device_from_value(device_value.node().input())
+        if device is not None:
+            return g.op("Constant", value_s=str(device))
+
+    return symbolic_helper._unimplemented(
+        "prim::type",
+        "Device type cannot be statically determined.",
+        device_value,
+    )
+
+
+@_onnx_symbolic("onnx::Placeholder")
+@_beartype.beartype
+def onnx_placeholder(g: jit_utils.GraphContext, *inputs, **attrs):
+    node = g.original_node
+    block = g.block
+    env = g.env
+
+    return torch._C._jit_onnx_convert_pattern_from_subblock(block, node, env)
 
-class Onnx:
-    domain = "onnx"
 
-    # -----------------------------------------------------------------------------
-    # Symbolic functions that need extra context
-    # -----------------------------------------------------------------------------
-    @staticmethod
-    def Placeholder(ctx: SymbolicContext, g, *inputs, **attrs):
-        n = ctx.cur_node
-        block = ctx.onnx_block
-        env = ctx.env
+@_onnx_symbolic("aten::resolve_conj")
+@_onnx_symbolic("aten::resolve_neg")
+@_beartype.beartype
+def noop_complex_operators(g: jit_utils.GraphContext, input: _C.Value):
+    # ONNX does not have operators to *directly* manipulate real/imaginary components
+    # However, a few torch APIs (e.g. .tolist()) use complex operations when input is real,
+    # which results in failures due to missing operators for complex numbers
+
+    # `aten::resolve_conj` and `aten::resolve_neg` can safely be implemented as no-op
+    return input
+
+
+@_onnx_symbolic("aten::_conj")
+@_onnx_symbolic("aten::conj_physical")
+@_beartype.beartype
+def unsupported_complex_operators(g: jit_utils.GraphContext, input: _C.Value):
+    # ONNX does not have operators to *directly* manipulate real/imaginary components
+    # However, a few torch APIs (e.g. .tolist()) use complex operations when input is real,
+    # which results in failures due to missing operators for complex numbers
+
+    # While `aten::_conj` and `aten::conj_phisical` raise exception when input is complex
+    if symbolic_helper.is_complex_value(input):
+        # FIXME(justinchuby): report correct name for symbolic being executed
+        return symbolic_helper._onnx_unsupported(
+            "aten::_conj, aten::conj_physical",
+            input,
+        )
 
-        return torch._C._jit_onnx_convert_pattern_from_subblock(block, n, env)
+    # they can safely be implemented as no-op for real numbers only
+    return noop_complex_operators(g, input)
diff --git a/torch/onnx/symbolic_registry.py b/torch/onnx/symbolic_registry.py
deleted file mode 100644
index eb88c4f83289..000000000000
--- a/torch/onnx/symbolic_registry.py
+++ /dev/null
@@ -1,168 +0,0 @@
-import importlib
-import inspect
-import itertools
-import warnings
-from typing import Any, Callable, Dict, Tuple, Union
-
-from torch import _C
-from torch.onnx import _constants, errors
-
-__all__ = [
-    "get_op_supported_version",
-    "get_ops_in_version",
-    "get_registered_op",
-    "is_registered_op",
-    "is_registered_version",
-    "register_op",
-    "register_ops_helper",
-    "register_ops_in_version",
-    "register_version",
-    "unregister_op",
-]
-
-_SymbolicFunction = Callable[..., Union[_C.Value, Tuple[_C.Value]]]
-
-"""
-The symbolic registry "_registry" is a dictionary that maps operators
-(for a specific domain and opset version) to their symbolic functions.
-An operator is defined by its domain, opset version, and opname.
-The keys are tuples (domain, version), (where domain is a string, and version is an int),
-and the operator's name (string).
-The map's entries are as follows : _registry[(domain, version)][op_name] = op_symbolic
-"""
-_registry: Dict[
-    Tuple[str, int],
-    Dict[str, _SymbolicFunction],
-] = {}
-
-_symbolic_versions: Dict[Union[int, str], Any] = {}
-
-
-def _import_symbolic_opsets():
-    for opset_version in itertools.chain(
-        _constants.onnx_stable_opsets, [_constants.onnx_main_opset]
-    ):
-        module = importlib.import_module(f"torch.onnx.symbolic_opset{opset_version}")
-        global _symbolic_versions
-        _symbolic_versions[opset_version] = module
-
-
-def register_version(domain: str, version: int):
-    if not is_registered_version(domain, version):
-        global _registry
-        _registry[(domain, version)] = {}
-    register_ops_in_version(domain, version)
-
-
-def register_ops_helper(domain: str, version: int, iter_version: int):
-    for domain, op_name, op_func in get_ops_in_version(iter_version):
-        if not is_registered_op(op_name, domain, version):
-            register_op(op_name, op_func, domain, version)
-
-
-def register_ops_in_version(domain: str, version: int):
-    """Iterates through the symbolic functions of the specified opset version, and the
-    previous opset versions for operators supported in previous versions.
-
-    Opset 9 is the base version. It is selected as the base version because
-        1. It is the first opset version supported by PyTorch export.
-        2. opset 9 is more robust than previous opset versions. Opset versions like 7/8 have limitations
-            that certain basic operators cannot be expressed in ONNX. Instead of basing on these limitations,
-            we chose to handle them as special cases separately.
-
-    Backward support for opset versions beyond opset 7 is not in our roadmap.
-
-    For opset versions other than 9, by default they will inherit the symbolic functions defined in
-    symbolic_opset9.py.
-
-    To extend support for updated operators in different opset versions on top of opset 9,
-    simply add the updated symbolic functions in the respective symbolic_opset{version}.py file.
-    Checkout topk in symbolic_opset10.py, and upsample_nearest2d in symbolic_opset8.py for example.
-    """
-    iter_version = version
-    while iter_version != 9:
-        register_ops_helper(domain, version, iter_version)
-        if iter_version > 9:
-            iter_version = iter_version - 1
-        else:
-            iter_version = iter_version + 1
-
-    register_ops_helper(domain, version, 9)
-
-
-def get_ops_in_version(version: int):
-    if not _symbolic_versions:
-        _import_symbolic_opsets()
-    members = inspect.getmembers(_symbolic_versions[version])
-    domain_opname_ops = []
-    for obj in members:
-        if isinstance(obj[1], type) and hasattr(obj[1], "domain"):
-            ops = inspect.getmembers(obj[1], predicate=inspect.isfunction)
-            for op in ops:
-                domain_opname_ops.append((obj[1].domain, op[0], op[1]))  # type: ignore[attr-defined]
-
-        elif inspect.isfunction(obj[1]):
-            if obj[0] == "_len":
-                obj = ("len", obj[1])
-            if obj[0] == "_list":
-                obj = ("list", obj[1])
-            if obj[0] == "_any":
-                obj = ("any", obj[1])
-            if obj[0] == "_all":
-                obj = ("all", obj[1])
-            domain_opname_ops.append(("", obj[0], obj[1]))
-    return domain_opname_ops
-
-
-def is_registered_version(domain: str, version: int):
-    global _registry
-    return (domain, version) in _registry
-
-
-def register_op(opname, op, domain, version):
-    if domain is None or version is None:
-        warnings.warn(
-            "ONNX export failed. The ONNX domain and/or version to register are None."
-        )
-    global _registry
-    if not is_registered_version(domain, version):
-        _registry[(domain, version)] = {}
-    _registry[(domain, version)][opname] = op
-
-
-def is_registered_op(opname: str, domain: str, version: int):
-    if domain is None or version is None:
-        warnings.warn("ONNX export failed. The ONNX domain and/or version are None.")
-    global _registry
-    return (domain, version) in _registry and opname in _registry[(domain, version)]
-
-
-def unregister_op(opname: str, domain: str, version: int):
-    global _registry
-    if is_registered_op(opname, domain, version):
-        del _registry[(domain, version)][opname]
-        if not _registry[(domain, version)]:
-            del _registry[(domain, version)]
-    else:
-        warnings.warn("The opname " + opname + " is not registered.")
-
-
-def get_op_supported_version(opname: str, domain: str, version: int):
-    iter_version = version
-    while iter_version <= _constants.onnx_main_opset:
-        ops = [(op[0], op[1]) for op in get_ops_in_version(iter_version)]
-        if (domain, opname) in ops:
-            return iter_version
-        iter_version += 1
-    return None
-
-
-def get_registered_op(opname: str, domain: str, version: int) -> _SymbolicFunction:
-    if domain is None or version is None:
-        warnings.warn("ONNX export failed. The ONNX domain and/or version are None.")
-    global _registry
-    if not is_registered_op(opname, domain, version):
-        raise errors.UnsupportedOperatorError(
-            domain, opname, version, get_op_supported_version(opname, domain, version)
-        )
-    return _registry[(domain, version)][opname]
diff --git a/torch/onnx/utils.py b/torch/onnx/utils.py
index 661947775a4e..36d7fdb75762 100644
--- a/torch/onnx/utils.py
+++ b/torch/onnx/utils.py
@@ -9,13 +9,10 @@
 import copy
 import inspect
 import io
-import itertools
-import os
 import re
 import textwrap
 import typing
 import warnings
-import zipfile
 from typing import (
     Any,
     Callable,
@@ -26,6 +23,7 @@
     Mapping,
     Optional,
     Sequence,
+    Set,
     Tuple,
     Type,
     Union,
@@ -39,13 +37,18 @@
 from torch.onnx import (  # noqa: F401
     _constants,
     _exporter_states,
-    _patch_torch,
     errors,
     symbolic_caffe2,
     symbolic_helper,
-    symbolic_registry,
 )
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import (
+    _beartype,
+    diagnostics,
+    jit_utils,
+    onnx_proto_utils,
+    registration,
+)
 
 __all__ = [
     "is_in_onnx_export",
@@ -58,7 +61,6 @@
     "unpack_quantized_tensor",
     "export_to_pretty_string",
     "unconvertible_ops",
-    "get_ns_op_name_from_custom_op",
     "register_custom_op_symbolic",
     "unregister_custom_op_symbolic",
 ]
@@ -75,6 +77,7 @@ def is_in_onnx_export() -> bool:
 
 
 @contextlib.contextmanager
+@_beartype.beartype
 def select_model_mode_for_export(model, mode: _C_onnx.TrainingMode):
     r"""A context manager to temporarily set the training mode of ``model``
     to ``mode``, resetting it when we exit the with-block.
@@ -89,7 +92,7 @@ def select_model_mode_for_export(model, mode: _C_onnx.TrainingMode):
         )
     originally_training: bool = False
 
-    if not isinstance(model, torch.jit.ScriptFunction):
+    if hasattr(model, "training"):
         originally_training = model.training
 
         # ONNX opset 12 has better support for training amenable models, with updated
@@ -118,14 +121,12 @@ def select_model_mode_for_export(model, mode: _C_onnx.TrainingMode):
     try:
         yield
     finally:
-        if not (
-            isinstance(model, torch.jit.ScriptFunction)
-            or mode == _C_onnx.TrainingMode.PRESERVE
-        ):
+        if hasattr(model, "training") and not mode == _C_onnx.TrainingMode.PRESERVE:
             model.train(originally_training)
 
 
 @contextlib.contextmanager
+@_beartype.beartype
 def disable_apex_o2_state_dict_hook(
     model: Union[torch.nn.Module, torch.jit.ScriptFunction]
 ):
@@ -159,7 +160,8 @@ def disable_apex_o2_state_dict_hook(
 
 
 @contextlib.contextmanager
-def setup_onnx_logging(verbose):
+@_beartype.beartype
+def setup_onnx_logging(verbose: bool):
     is_originally_enabled = torch.onnx.is_onnx_log_enabled()
     if is_originally_enabled or verbose:
         torch.onnx.enable_log()
@@ -171,17 +173,19 @@ def setup_onnx_logging(verbose):
 
 
 @contextlib.contextmanager
-def exporter_context(model, mode, verbose):
+@_beartype.beartype
+def exporter_context(model, mode: _C_onnx.TrainingMode, verbose: bool):
     with select_model_mode_for_export(
         model, mode
     ) as mode_ctx, disable_apex_o2_state_dict_hook(
         model
     ) as apex_ctx, setup_onnx_logging(
         verbose
-    ) as log_ctx:
-        yield (mode_ctx, apex_ctx, log_ctx)
+    ) as log_ctx, diagnostics.create_export_diagnostic_context() as diagnostic_ctx:
+        yield (mode_ctx, apex_ctx, log_ctx, diagnostic_ctx)
 
 
+@_beartype.beartype
 def export(
     model: Union[torch.nn.Module, torch.jit.ScriptModule, torch.jit.ScriptFunction],
     args: Union[Tuple[Any, ...], torch.Tensor],
@@ -210,7 +214,7 @@ def export(
     for dynamic control flow as :func:`torch.jit.trace`.
 
     Args:
-        model (torch.nn.Module, torch.jit.ScriptModule or torch.jit.ScriptFunction):
+        model (:class:`torch.nn.Module`, :class:`torch.jit.ScriptModule` or :class:`torch.jit.ScriptFunction`):
             the model to be exported.
         args (tuple or torch.Tensor):
 
@@ -233,9 +237,13 @@ def export(
 
             3. A TUPLE OF ARGUMENTS ENDING WITH A DICTIONARY OF NAMED ARGUMENTS::
 
-                args = (x,
-                        {'y': input_y,
-                         'z': input_z})
+                args = (
+                    x,
+                    {
+                        "y": input_y,
+                        "z": input_z
+                    }
+                )
 
             All but the last element of the tuple will be passed as non-keyword arguments,
             and named arguments will be set from the last element. If a named argument is
@@ -250,19 +258,25 @@ def export(
 
                     torch.onnx.export(
                         model,
-                        (x,
-                         # WRONG: will be interpreted as named arguments
-                         {y: z}),
-                        "test.onnx.pb")
+                        (
+                            x,
+                            # WRONG: will be interpreted as named arguments
+                            {y: z}
+                        ),
+                        "test.onnx.pb"
+                    )
 
                 Write::
 
                     torch.onnx.export(
                         model,
-                        (x,
-                         {y: z},
-                         {}),
-                        "test.onnx.pb")
+                        (
+                            x,
+                            {y: z},
+                            {}
+                        ),
+                        "test.onnx.pb"
+                    )
 
         f: a file-like object (such that ``f.fileno()`` returns a file descriptor)
             or a string containing a file name.  A binary protocol buffer will be written
@@ -278,9 +292,9 @@ def export(
         training (enum, default TrainingMode.EVAL):
             * ``TrainingMode.EVAL``: export the model in inference mode.
             * ``TrainingMode.PRESERVE``: export the model in inference mode if model.training is
-              False and in training mode if model.training is True.
+                False and in training mode if model.training is True.
             * ``TrainingMode.TRAINING``: export the model in training mode. Disables optimizations
-              which might interfere with training.
+                which might interfere with training.
         input_names (list of str, default empty list): names to assign to the
             input nodes of the graph, in order.
         output_names (list of str, default empty list): names to assign to the
@@ -288,77 +302,77 @@ def export(
         operator_export_type (enum, default OperatorExportTypes.ONNX):
 
             * ``OperatorExportTypes.ONNX``: Export all ops as regular ONNX ops
-              (in the default opset domain).
+                (in the default opset domain).
             * ``OperatorExportTypes.ONNX_FALLTHROUGH``: Try to convert all ops
-              to standard ONNX ops in the default opset domain. If unable to do so
-              (e.g. because support has not been added to convert a particular torch op to ONNX),
-              fall back to exporting the op into a custom opset domain without conversion. Applies
-              to `custom ops <https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html>`_
-              as well as ATen ops. For the exported model to be usable, the runtime must support
-              these non-standard ops.
+                to standard ONNX ops in the default opset domain. If unable to do so
+                (e.g. because support has not been added to convert a particular torch op to ONNX),
+                fall back to exporting the op into a custom opset domain without conversion. Applies
+                to `custom ops <https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html>`_
+                as well as ATen ops. For the exported model to be usable, the runtime must support
+                these non-standard ops.
             * ``OperatorExportTypes.ONNX_ATEN``: All ATen ops (in the TorchScript namespace "aten")
-              are exported as ATen ops (in opset domain "org.pytorch.aten").
-              `ATen <https://pytorch.org/cppdocs/#aten>`_ is PyTorch's built-in tensor library, so
-              this instructs the runtime to use PyTorch's implementation of these ops.
+                are exported as ATen ops (in opset domain "org.pytorch.aten").
+                `ATen <https://pytorch.org/cppdocs/#aten>`_ is PyTorch's built-in tensor library, so
+                this instructs the runtime to use PyTorch's implementation of these ops.
 
-              .. warning::
+                .. warning::
 
-                Models exported this way are probably runnable only by Caffe2.
+                    Models exported this way are probably runnable only by Caffe2.
 
-              This may be useful if the numeric differences in implementations of operators are
-              causing large differences in behavior between PyTorch and Caffe2 (which is more
-              common on untrained models).
+                    This may be useful if the numeric differences in implementations of operators are
+                    causing large differences in behavior between PyTorch and Caffe2 (which is more
+                    common on untrained models).
 
             * ``OperatorExportTypes.ONNX_ATEN_FALLBACK``: Try to export each ATen op
-              (in the TorchScript namespace "aten") as a regular ONNX op. If we are unable to do so
-              (e.g. because support has not been added to convert a particular torch op to ONNX),
-              fall back to exporting an ATen op. See documentation on OperatorExportTypes.ONNX_ATEN for
-              context.
-              For example::
-
-                graph(%0 : Float):
-                  %3 : int = prim::Constant[value=0]()
-                  # conversion unsupported
-                  %4 : Float = aten::triu(%0, %3)
-                  # conversion supported
-                  %5 : Float = aten::mul(%4, %0)
-                  return (%5)
-
-              Assuming ``aten::triu`` is not supported in ONNX, this will be exported as::
-
-                graph(%0 : Float):
-                  %1 : Long() = onnx::Constant[value={0}]()
-                  # not converted
-                  %2 : Float = aten::ATen[operator="triu"](%0, %1)
-                  # converted
-                  %3 : Float = onnx::Mul(%2, %0)
-                  return (%3)
-
-              If PyTorch was built with Caffe2 (i.e. with ``BUILD_CAFFE2=1``), then
-              Caffe2-specific behavior will be enabled, including special support
-              for ops are produced by the modules described in
-              `Quantization <https://pytorch.org/docs/stable/quantization.html>`_.
-
-              .. warning::
-
-                Models exported this way are probably runnable only by Caffe2.
-
-        opset_version (int, default 13): The version of the
+                (in the TorchScript namespace "aten") as a regular ONNX op. If we are unable to do so
+                (e.g. because support has not been added to convert a particular torch op to ONNX),
+                fall back to exporting an ATen op. See documentation on OperatorExportTypes.ONNX_ATEN for
+                context.
+                For example::
+
+                    graph(%0 : Float):
+                    %3 : int = prim::Constant[value=0]()
+                    # conversion unsupported
+                    %4 : Float = aten::triu(%0, %3)
+                    # conversion supported
+                    %5 : Float = aten::mul(%4, %0)
+                    return (%5)
+
+                Assuming ``aten::triu`` is not supported in ONNX, this will be exported as::
+
+                    graph(%0 : Float):
+                    %1 : Long() = onnx::Constant[value={0}]()
+                    # not converted
+                    %2 : Float = aten::ATen[operator="triu"](%0, %1)
+                    # converted
+                    %3 : Float = onnx::Mul(%2, %0)
+                    return (%3)
+
+                If PyTorch was built with Caffe2 (i.e. with ``BUILD_CAFFE2=1``), then
+                Caffe2-specific behavior will be enabled, including special support
+                for ops are produced by the modules described in
+                `Quantization <https://pytorch.org/docs/stable/quantization.html>`_.
+
+                .. warning::
+
+                    Models exported this way are probably runnable only by Caffe2.
+
+        opset_version (int, default 14): The version of the
             `default (ai.onnx) opset <https://github.com/onnx/onnx/blob/master/docs/Operators.md>`_
             to target. Must be >= 7 and <= 16.
         do_constant_folding (bool, default True): Apply the constant-folding optimization.
             Constant-folding will replace some of the ops that have all constant inputs
             with pre-computed constant nodes.
-        dynamic_axes (dict<string, dict<int, string>> or dict<string, list(int)>, default empty dict):
+        dynamic_axes (dict[string, dict[int, string]] or dict[string, list(int)], default empty dict):
 
             By default the exported model will have the shapes of all input and output tensors
             set to exactly match those given in ``args``. To specify axes of tensors as
             dynamic (i.e. known only at run-time), set ``dynamic_axes`` to a dict with schema:
 
             * KEY (str): an input or output name. Each name must also be provided in ``input_names`` or
-              ``output_names``.
+                ``output_names``.
             * VALUE (dict or list): If a dict, keys are axis indices and values are axis names. If a
-              list, each element is an axis index.
+                list, each element is an axis index.
 
             For example::
 
@@ -366,8 +380,13 @@ class SumModule(torch.nn.Module):
                     def forward(self, x):
                         return torch.sum(x, dim=1)
 
-                torch.onnx.export(SumModule(), (torch.ones(2, 2),), "onnx.pb",
-                                  input_names=["x"], output_names=["sum"])
+                torch.onnx.export(
+                    SumModule(),
+                    (torch.ones(2, 2),),
+                    "onnx.pb",
+                    input_names=["x"],
+                    output_names=["sum"]
+                )
 
             Produces::
 
@@ -391,14 +410,19 @@ def forward(self, x):
 
             While::
 
-                torch.onnx.export(SumModule(), (torch.ones(2, 2),), "onnx.pb",
-                                  input_names=["x"], output_names=["sum"],
-                                  dynamic_axes={
-                                      # dict value: manually named axes
-                                      "x": {0: "my_custom_axis_name"},
-                                      # list value: automatic names
-                                      "sum": [0],
-                                  })
+                torch.onnx.export(
+                    SumModule(),
+                    (torch.ones(2, 2),),
+                    "onnx.pb",
+                    input_names=["x"],
+                    output_names=["sum"],
+                    dynamic_axes={
+                        # dict value: manually named axes
+                        "x": {0: "my_custom_axis_name"},
+                        # list value: automatic names
+                        "sum": [0],
+                    }
+                )
 
             Produces::
 
@@ -435,10 +459,10 @@ def forward(self, x):
             If None, then the behavior is chosen automatically as follows:
 
             * If ``operator_export_type=OperatorExportTypes.ONNX``, the behavior is equivalent
-              to setting this argument to False.
+                to setting this argument to False.
             * Else, the behavior is equivalent to setting this argument to True.
 
-        custom_opsets (dict<str, int>, default empty dict): A dict with schema:
+        custom_opsets (dict[str, int], default empty dict): A dict with schema:
 
             * KEY (str): opset domain name
             * VALUE (int): opset version
@@ -466,14 +490,17 @@ def forward(self, x):
             will have prefix "inferred::". This is to differentiate from predefined attributes retrieved from
             python module annotations. Inferred attributes are used inside the subgraph of ONNX local function.
 
-            * ``False``(default): export ``nn.Module`` forward calls as fine grained nodes.
+            * ``False`` (default): export ``nn.Module`` forward calls as fine grained nodes.
             * ``True``: export all ``nn.Module`` forward calls as local function nodes.
             * Set of type of nn.Module: export ``nn.Module`` forward calls as local function nodes,
-              only if the type of the ``nn.Module`` is found in the set.
+                only if the type of the ``nn.Module`` is found in the set.
 
     Raises:
-      CheckerError: If the ONNX checker detects an invalid ONNX graph. Will still export the
-        model to the file ``f`` even if this is raised.
+        :class:`torch.onnx.errors.CheckerError`: If the ONNX checker detects an invalid ONNX graph.
+        :class:`torch.onnx.errors.UnsupportedOperatorError`: If the ONNX graph cannot be exported because it
+            uses an operator that is not supported by the exporter.
+        :class:`torch.onnx.errors.OnnxExporterError`: Other errors that can occur during export.
+            All errors are subclasses of :class:`errors.OnnxExporterError`.
     """
 
     _export(
@@ -495,6 +522,7 @@ def forward(self, x):
     )
 
 
+@_beartype.beartype
 def _is_constant_tensor_list(node):
     if node.kind() != "prim::Constant":
         return False
@@ -509,6 +537,7 @@ def _is_constant_tensor_list(node):
 # get generated in constant prop. So we split them back into prim::ListConstructs
 
 
+@_beartype.beartype
 def _split_tensor_list_constants(g, block):
     for node in block.nodes():
         for subblock in node.blocks():
@@ -531,6 +560,7 @@ def _split_tensor_list_constants(g, block):
             node.output().replaceAllUsesWith(lc)
 
 
+@_beartype.beartype
 def _optimize_graph(
     graph: _C.Graph,
     operator_export_type: _C_onnx.OperatorExportTypes,
@@ -541,6 +571,9 @@ def _optimize_graph(
     input_names=None,
     module=None,
 ):
+    if params_dict is None:
+        params_dict = {}
+
     # Inline everything
     _C._jit_pass_inline(graph)
 
@@ -563,6 +596,12 @@ def _optimize_graph(
     _C._jit_pass_dce(graph)
     _C._jit_pass_lint(graph)
 
+    # CSE should improve perf when Autocast is used with disabled cache
+    # Autocast is disabled due to a limitation on tracer as described at https://github.com/pytorch/pytorch/issues/84092
+    # Must run before _C._jit_pass_erase_number_types to prevent type substitution
+    if _C._jit_pass_cse(graph):
+        _C._jit_pass_onnx_lint(graph)
+
     _C._jit_pass_canonicalize_graph_fuser_ops(graph)
     _C._jit_pass_lint(graph)
     _C._jit_pass_peephole(graph, True)
@@ -622,6 +661,7 @@ def _optimize_graph(
         dynamic_axes = {} if dynamic_axes is None else dynamic_axes
         _C._jit_pass_onnx_set_dynamic_input_shape(graph, dynamic_axes, input_names)
     _C._jit_pass_onnx_lint(graph)
+
     graph = _C._jit_pass_onnx(graph, operator_export_type)
     _C._jit_pass_onnx_lint(graph)
     _C._jit_pass_lint(graph)
@@ -652,6 +692,7 @@ def _optimize_graph(
     return graph
 
 
+@_beartype.beartype
 def warn_on_static_input_change(input_states):
     """Warns that changes to input dictionaries and strings won't take effect in the traced ONNX graph.
 
@@ -681,6 +722,7 @@ def warn_on_static_input_change(input_states):
                 warnings.warn(warning)
 
 
+@_beartype.beartype
 def _resolve_args_by_export_type(arg_name, arg_value, operator_export_type):
     """Resolves the arguments that are ignored when export_type != operator_export_type.ONNX."""
     if (
@@ -697,6 +739,7 @@ def _resolve_args_by_export_type(arg_name, arg_value, operator_export_type):
     return arg_value
 
 
+@_beartype.beartype
 def _decide_keep_init_as_input(
     keep_initializers_as_inputs: Optional[bool],
     operator_export_type: _C_onnx.OperatorExportTypes,
@@ -740,12 +783,14 @@ def _decide_keep_init_as_input(
     return val_keep_init_as_ip
 
 
+@_beartype.beartype
 def _decide_add_node_names(add_node_names, operator_export_type):
     return _resolve_args_by_export_type(
         "add_node_names", add_node_names, operator_export_type
     )
 
 
+@_beartype.beartype
 def _decide_constant_folding(do_constant_folding, operator_export_type, training):
     do_constant_folding = _resolve_args_by_export_type(
         "do_constant_folding", do_constant_folding, operator_export_type
@@ -764,6 +809,7 @@ def _decide_constant_folding(do_constant_folding, operator_export_type, training
     return do_constant_folding
 
 
+@_beartype.beartype
 def _signature(model) -> inspect.Signature:
     should_be_callable = getattr(model, "forward", model)
     if callable(should_be_callable):
@@ -771,6 +817,7 @@ def _signature(model) -> inspect.Signature:
     raise ValueError("model has no forward method and is not callable")
 
 
+@_beartype.beartype
 def _decide_input_format(model, args):
     try:
         sig = _signature(model)
@@ -810,13 +857,18 @@ def _decide_input_format(model, args):
     return args
 
 
+@_beartype.beartype
 def _trace(func, args, operator_export_type, return_outs=False):
     # Special case for common case of passing a single Tensor
     if isinstance(args, torch.Tensor):
         args = (args,)
 
     trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
-        func, args, strict=False, _force_outplace=False, _return_inputs_states=True
+        func,
+        args,
+        strict=False,
+        _force_outplace=False,
+        _return_inputs_states=True,
     )
     warn_on_static_input_change(inputs_states)
 
@@ -826,14 +878,27 @@ def _trace(func, args, operator_export_type, return_outs=False):
     return trace_graph
 
 
+@_beartype.beartype
 def _trace_and_get_graph_from_model(model, args):
     # A basic sanity check: make sure the state_dict keys are the same
     # before and after running the model.  Fail fast!
     orig_state_dict_keys = torch.jit._unique_state_dict(model).keys()
 
+    # Disable Autocast cache because it replaces kernel's weight and bias
+    # by (undesired) constants.
+    # No perf impact for when there are reused weights since https://github.com/pytorch/pytorch/pull/85665
+    # TODO: https://github.com/pytorch/pytorch/issues/84092
+    prev_autocast_cache_enabled = torch.is_autocast_cache_enabled()
+    torch.set_autocast_cache_enabled(False)
     trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
-        model, args, strict=False, _force_outplace=False, _return_inputs_states=True
+        model,
+        args,
+        strict=False,
+        _force_outplace=False,
+        _return_inputs_states=True,
     )
+    torch.set_autocast_cache_enabled(prev_autocast_cache_enabled)
+
     warn_on_static_input_change(inputs_states)
 
     if orig_state_dict_keys != torch.jit._unique_state_dict(model).keys():
@@ -845,6 +910,7 @@ def _trace_and_get_graph_from_model(model, args):
     return trace_graph, torch_out
 
 
+@_beartype.beartype
 def _get_param_count_list(method_graph, args_params):
     param_count_list = []
     for input_, arg_params_ in zip(method_graph.inputs(), args_params):
@@ -857,9 +923,11 @@ def _get_param_count_list(method_graph, args_params):
     return param_count_list
 
 
+@_beartype.beartype
 def _check_flatten_did_not_remove(original, jit_flattened):
     """torch.jit._flatten removes None. Check if it did so in this case."""
 
+    @_beartype.beartype
     def flatten(x):
         if isinstance(x, (list, tuple)):
             for inner in x:
@@ -883,12 +951,7 @@ def flatten(x):
 
 def _create_jit_graph(
     model: Union[torch.nn.Module, torch.jit.ScriptFunction], args: Sequence[Any]
-) -> Tuple[
-    _C.Graph,
-    List[_C.IValue],
-    Optional[Any],
-    Optional[Union[_C.ScriptModule, _C.ScriptFunction]],
-]:
+) -> Tuple[_C.Graph, List[_C.IValue], Optional[Any], Optional[_C.ScriptModule]]:
     if isinstance(model, (torch.jit.ScriptFunction, torch.jit.ScriptModule)):
         flattened_args = tuple(torch.jit._flatten(tuple(args))[0])
         _check_flatten_did_not_remove(args, flattened_args)
@@ -937,6 +1000,7 @@ def _create_jit_graph(
     return graph, params, torch_out, None
 
 
+@_beartype.beartype
 def _get_named_param_dict(graph, params):
     input_and_param_names = [val.debugName() for val in graph.inputs()]
     param_names = input_and_param_names[len(input_and_param_names) - len(params) :]
@@ -944,6 +1008,7 @@ def _get_named_param_dict(graph, params):
     return _params_dict
 
 
+@_beartype.beartype
 def _get_example_outputs(model, args):
     input_args = copy.deepcopy(args)
     input_kwargs = {}
@@ -968,6 +1033,7 @@ def _get_example_outputs(model, args):
 }
 
 
+@_beartype.beartype
 def unpack_quantized_tensor(value, cast_onnx_accepted=True):
     if isinstance(value, torch.Tensor) and value.dtype in _qtype_vtype_map:
         q_value_dequantize = value.dequantize()
@@ -988,6 +1054,7 @@ def unpack_quantized_tensor(value, cast_onnx_accepted=True):
         return (value,)
 
 
+@_beartype.beartype
 def _pre_trace_quant_model(model, args):
     r"""Returns `torch.jit.trace(model, args)` if model is quantized. Otherwise do nothing and return
     original model.
@@ -1001,28 +1068,7 @@ def _pre_trace_quant_model(model, args):
     return model
 
 
-def _assign_onnx_node_name(graph, node_names):
-    """Takes in ONNX graph, and mapping from _C.Node to node name in exported ONNX ModelProto.
-
-    Returns:
-        graph (_C.Graph): A TorchScript IR Graph with ONNX nodes, where each _C.Node gets its name
-        in exported ONNX ModelProto assigned as attribute ``onnx_name``.
-    """
-
-    def n_fn(n, b_fn, node_names):
-        for b in n.blocks():
-            b_fn(b, node_names)
-        if n in node_names:
-            n.s_("onnx_name", node_names[n])
-
-    def b_fn(b, node_names):
-        for n in b.nodes():
-            n_fn(n, b_fn, node_names)
-
-    b_fn(graph, node_names)
-    return graph
-
-
+@_beartype.beartype
 def _model_to_graph(
     model,
     args,
@@ -1038,7 +1084,15 @@ def _model_to_graph(
 ) -> Tuple[
     _C.Graph,
     Dict[str, torch.Tensor],
-    Optional[Union[torch.Tensor, Tuple[torch.Tensor], List[torch.Tensor]]],
+    Optional[
+        Union[
+            torch.Tensor,
+            Tuple[torch.Tensor, ...],
+            List[torch.Tensor],
+            Dict[str, torch.Tensor],
+            Any,  # Can be nested tuples etc.
+        ]
+    ],
 ]:
     """Converts model into an ONNX graph.
 
@@ -1093,7 +1147,7 @@ def _model_to_graph(
         else:
             output_wrapped = torch_out  # type: ignore[assignment]
 
-        output_tensors, out_desc = _C._jit_flatten(tuple(output_wrapped))
+        output_tensors, out_desc = torch.jit._flatten(tuple(output_wrapped))
         # assign_output_shape pass is not compatible with quantized outputs.
         # Quantized outputs are flattened to 3 values in ONNX, while packed as
         # single value in PyTorch.
@@ -1114,7 +1168,8 @@ def _model_to_graph(
 
     if (
         do_constant_folding
-        and GLOBALS.export_onnx_opset_version in _constants.onnx_constant_folding_opsets
+        and GLOBALS.export_onnx_opset_version
+        >= _constants.ONNX_CONSTANT_FOLDING_MIN_OPSET
     ):
         params_dict = _C._jit_pass_onnx_constant_fold(
             graph, params_dict, GLOBALS.export_onnx_opset_version
@@ -1143,6 +1198,7 @@ def _model_to_graph(
     return graph, params_dict, torch_out
 
 
+@_beartype.beartype
 def export_to_pretty_string(
     model,
     args,
@@ -1175,14 +1231,14 @@ def export_to_pretty_string(
             protobuf's `Message::DebugString()`, which is more verbose.
 
     Returns:
-      A UTF-8 str containing a human-readable representation of the ONNX model.
+        A UTF-8 str containing a human-readable representation of the ONNX model.
     """
     if opset_version is None:
-        opset_version = _constants.onnx_default_opset
+        opset_version = _constants.ONNX_DEFAULT_OPSET
     if custom_opsets is None:
         custom_opsets = {}
-    symbolic_helper._set_opset_version(opset_version)
-    symbolic_helper._set_operator_export_type(operator_export_type)
+    GLOBALS.export_onnx_opset_version = opset_version
+    GLOBALS.operator_export_type = operator_export_type
 
     with exporter_context(model, training, verbose):
         val_keep_init_as_ip = _decide_keep_init_as_input(
@@ -1219,52 +1275,80 @@ def export_to_pretty_string(
         )
 
 
+@_beartype.beartype
 def unconvertible_ops(
-    model, args, training=_C_onnx.TrainingMode.EVAL, opset_version=None
-):
-    r"""
-    Converts the model with operator_export_type set to
-    torch.onnx.OperatorExportTypes.ONNX_FALLTHROUGH once in order to get a list of
-    all the ops that are not supported/implemented by the exporter.
+    model,
+    args,
+    training: _C_onnx.TrainingMode = _C_onnx.TrainingMode.EVAL,
+    opset_version: Optional[int] = None,
+) -> Tuple[_C.Graph, List[str]]:
+    """Returns an approximated list of all ops that are yet supported by :mod:`torch.onnx`.
+
+    The list is approximated because some ops may be removed during the conversion
+    process and don't need to be converted. Some other ops may have partial support
+    that will fail conversion with particular inputs. Please open a Github Issue
+    for op support requests.
 
     Args:
-        model: Same as corresponding arg to torch.onnx.export.
-        args: Same as corresponding arg to torch.onnx.export.
-        training: Same as corresponding arg to torch.onnx.export.
-        opset_version: Same as corresponding arg to torch.onnx.export.
+        model: Same as the `model` parameter in :func:`torch.onnx.export`.
+        args: Same as the `args` parameter in :func:`torch.onnx.export`.
+        training: Same as the `training` parameter in :func:`torch.onnx.export`.
+        opset_version: Same as the `opset_version` parameter in :func:`torch.onnx.export`.
 
     Returns:
-        Tuple[torch._C.Graph, List[str]], where the list includes the names
-        of the unconvertible ops.
+        The JIT graph and a list of unconvertible ops in the format of "domain::op".
     """
 
-    opset_version = opset_version or _constants.onnx_default_opset
-    symbolic_helper._set_opset_version(opset_version)
-    # operator_export_type is set to ONNX_FALLTHROUGH by default so that if an op is not supported
-    # in ONNX, fall through will occur and export the operator as is, as a custom ONNX op.
-    with exporter_context(model, training, False):
-        args = _decide_input_format(model, args)
-        graph, params_dict, torch_out = _model_to_graph(
-            model,
-            args,
-            # So that if an op connot be converted to ONNX, it will be kept
-            # as-is rather than cause a failure.
-            operator_export_type=_C_onnx.OperatorExportTypes.ONNX_FALLTHROUGH,
-        )
-    unsupported_ops = list()
-    supported_namespaces = ("onnx", "prim", "quantized")
+    opset_version = opset_version or _constants.ONNX_DEFAULT_OPSET
+    GLOBALS.export_onnx_opset_version = opset_version
+
+    try:
+        with exporter_context(model, training, verbose=False):
+            # Create a mostly clean JIT graph that contains the plain aten and
+            # other ops we can check with the symbolic registry.
+            # NOTE: We don't want to actually convert any ops to ONNX or run any
+            # symbolic functions because there is a higher chance that a pass
+            # fails or an unconvertible op messes up the graph during ONNX conversion.
+            # This way we can always generate a list just by looking at the names
+            # of the ops in the graph.
+            args = _decide_input_format(model, args)
+            model = _pre_trace_quant_model(model, args)
+            graph, _, _, module = _create_jit_graph(model, args)
+            _C._jit_pass_inline(graph)
+            _C._jit_pass_onnx_remove_inplace_ops_for_onnx(graph, module)
+            _C._jit_pass_erase_number_types(graph)
+            _C._jit_pass_dce_allow_deleting_nodes_with_side_effects(graph)
+    except Exception as e:
+        raise errors.OnnxExporterError(
+            "Failed to discover unconvertible ops because of errors during the JIT graph "
+            "generation process."
+        ) from e
+
+    unsupported_ops = []
     for node in graph.nodes():
-        if node.kind().split(":")[0] not in supported_namespaces:
-            unsupported_ops.append(node.kind())
+        domain_op = node.kind()
+        if domain_op.startswith("onnx::") or domain_op.startswith("prim::"):
+            # We consider onnx and prim ops as supported ops, even though some "prim"
+            # ops are not implemented as symbolic functions, because they may be
+            # eliminated in the conversion passes. Users may still see errors caused
+            # by prim ops even though they don't show up in the list.
+            continue
+        if not registration.registry.is_registered_op(
+            domain_op.rstrip("_"), opset_version
+        ):
+            # We consider all registered ops supported, even though some of them are
+            # only partially supported, because there is not yet a good way to check
+            # if an op is fully supported.
+            # TODO(justinchuby): Create a way to check if an op is fully supported.
+            unsupported_ops.append(domain_op)
     return graph, unsupported_ops
 
 
-def _setup_trace_module_map(model, export_modules_as_functions):
-    def __setup_trace_module_map():
-        trace_module_map = {_m: torch.typename(type(_m)) for _m in model.modules()}
-        torch.jit._trace._trace_module_map = trace_module_map
-        return trace_module_map
-
+@_beartype.beartype
+def _setup_trace_module_map(
+    model: Union[torch.nn.Module, torch.jit.ScriptModule],
+    export_modules_as_functions: Union[bool, Collection[Type[torch.nn.Module]]],
+) -> Set[str]:
     def __register_attribute_hook():
         attr_name = "_onnx_attrs"
 
@@ -1288,13 +1372,35 @@ def _track_module_attributes_forward_hook(module, input, output):
             m.register_forward_hook(_track_module_attributes_forward_hook)
             m.register_forward_pre_hook(_track_module_attributes_forward_pre_hook)
 
+    def _unqualified_variable_name(qualified_name: str) -> str:
+        """
+        Parse qualified variable name and return the unqualified version.
+
+        Pure numeric atoms are considered inadequate, so this function will look past them,
+        and start from the first non-numeric atom.
+
+        Example:
+            >>> _unqualified_variable_name('__main__.Foo.bar')
+            'bar'
+            >>> _unqualified_variable_name('__main__.Foo.bar.0')
+            'bar.0'
+        """
+        name_atoms = qualified_name.split(".")
+        for i, atom in reversed(list(enumerate(name_atoms))):
+            if not atom.isnumeric():
+                return ".".join(name_atoms[i:])
+        return qualified_name
+
+    trace_module_map = {
+        _m: torch._C._jit_onnx_create_full_scope_name(
+            torch.typename(type(_m)), _unqualified_variable_name(_n)
+        )
+        for _n, _m in model.named_modules()
+    }
+    torch.jit._trace._trace_module_map = trace_module_map
     if isinstance(export_modules_as_functions, bool) and export_modules_as_functions:
-        trace_module_map = __setup_trace_module_map()
-        export_modules_as_functions = {v for k, v in trace_module_map.items()}
-    elif (
-        isinstance(export_modules_as_functions, set)
-        and len(export_modules_as_functions) > 0
-    ):
+        module_typenames = {torch.typename(type(module)) for module in trace_module_map}
+    elif isinstance(export_modules_as_functions, set) and export_modules_as_functions:
 
         def _find_typename(v):
             if isinstance(v, type):
@@ -1306,23 +1412,23 @@ def _find_typename(v):
                     "Got `%s`." % (type(v).__name__)
                 )
 
-        trace_module_map = __setup_trace_module_map()
         module_typenames = {_find_typename(v) for v in export_modules_as_functions}
-        export_modules_as_functions = module_typenames
     else:
-        export_modules_as_functions = None
+        module_typenames = set()
 
-    if export_modules_as_functions:
+    if module_typenames:
         __register_attribute_hook()
 
-    return export_modules_as_functions
+    return module_typenames
 
 
+@_beartype.beartype
 def _reset_trace_module_map():
     torch.jit._trace._trace_module_map = None
     _C._jit_pass_onnx_clear_scope_records()
 
 
+@_beartype.beartype
 def _get_module_attributes(module):
 
     annotations = typing.get_type_hints(type(module))
@@ -1331,6 +1437,7 @@ def _get_module_attributes(module):
     return {k: getattr(module, k) for k in annotations}
 
 
+@_beartype.beartype
 def _export(
     model,
     args,
@@ -1352,6 +1459,8 @@ def _export(
     onnx_shape_inference=True,
     export_modules_as_functions=False,
 ):
+    assert GLOBALS.in_onnx_export is False
+
     if export_type is None:
         export_type = _exporter_states.ExportTypes.PROTOBUF_FILE
 
@@ -1362,42 +1471,47 @@ def _export(
             "unwrap model from torch.nn.DataParallel. Try "
             "torch.onnx.export(model.module, ...)"
         )
-    assert GLOBALS.in_onnx_export is False
-    GLOBALS.in_onnx_export = True
-    try:
 
-        symbolic_helper._set_onnx_shape_inference(onnx_shape_inference)
+    GLOBALS.onnx_shape_inference = onnx_shape_inference
 
-        if opset_version is None:
-            opset_version = _constants.onnx_default_opset
+    if opset_version is None:
+        opset_version = _constants.ONNX_DEFAULT_OPSET
 
-        if export_modules_as_functions and opset_version < 15:
-            raise ValueError(
-                "`export_modules_as_functions` is not supported for `opset_version` < 15."
-                "This is because `opset_version` < 15 implies IR version < 8, which means "
-                "no local function support. "
-            )
-        export_modules_as_functions = _setup_trace_module_map(
-            model, export_modules_as_functions
+    if export_modules_as_functions and opset_version < 15:
+        raise ValueError(
+            "`export_modules_as_functions` is not supported for `opset_version` < 15."
+            "This is because `opset_version` < 15 implies IR version < 8, which means "
+            "no local function support. "
         )
+    if not operator_export_type:
+        if _C_onnx._CAFFE2_ATEN_FALLBACK:
+            operator_export_type = _C_onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK
+        else:
+            operator_export_type = _C_onnx.OperatorExportTypes.ONNX
+
+    # By default, training=TrainingMode.EVAL,
+    # which is good because running a model in training mode could result in
+    # internal buffers getting updated, dropout getting applied, etc.
+    # If you really know what you're doing, you can turn
+    # training=TrainingMode.TRAINING or training=TrainingMode.PRESERVE,
+    # (to preserve whatever the original training mode was.)
+    GLOBALS.export_onnx_opset_version = opset_version
+    GLOBALS.operator_export_type = operator_export_type
+
+    try:
+        GLOBALS.in_onnx_export = True
+
+        module_typenames_to_export_as_functions: Set[str] = set()
+        if isinstance(model, (torch.nn.Module, torch.jit.ScriptModule)):
+            module_typenames_to_export_as_functions = _setup_trace_module_map(
+                model, export_modules_as_functions
+            )
 
-        if not operator_export_type:
-            if _C_onnx._CAFFE2_ATEN_FALLBACK:
-                operator_export_type = _C_onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK
-            else:
-                operator_export_type = _C_onnx.OperatorExportTypes.ONNX
-
-        # By default, training=TrainingMode.EVAL,
-        # which is good because running a model in training mode could result in
-        # internal buffers getting updated, dropout getting applied, etc.
-        # If you really know what you're doing, you can turn
-        # training=TrainingMode.TRAINING or training=TrainingMode.PRESERVE,
-        # (to preserve whatever the original training mode was.)
-        symbolic_helper._set_opset_version(opset_version)
-        symbolic_helper._set_operator_export_type(operator_export_type)
         with exporter_context(model, training, verbose):
             val_keep_init_as_ip = _decide_keep_init_as_input(
-                keep_initializers_as_inputs, operator_export_type, opset_version
+                keep_initializers_as_inputs,
+                operator_export_type,
+                opset_version,
             )
             val_add_node_names = _decide_add_node_names(
                 add_node_names, operator_export_type
@@ -1438,14 +1552,17 @@ def _export(
 
             _C._jit_pass_dce_allow_deleting_nodes_with_side_effects(graph)
             node_attr_to_name = {}  # type: ignore[var-annotated]
-            if export_modules_as_functions:
+            if module_typenames_to_export_as_functions:
                 # NOTE: cannot call DCE after this pass. DCE will remove function definition nodes.
                 node_attr_to_name = _C._jit_pass_onnx_function_extraction(
-                    graph, export_modules_as_functions, list(params_dict.keys())
+                    graph,
+                    module_typenames_to_export_as_functions,
+                    list(params_dict.keys()),
                 )
             params_dict = _C._jit_pass_onnx_deduplicate_initializers(  # type: ignore[assignment]
                 graph, params_dict, getattr(model, "training", False)  # type: ignore[arg-type]
             )
+            _C._jit_pass_onnx_assign_scoped_names_for_node_and_value(graph)
             if export_params:
                 (
                     proto,
@@ -1484,51 +1601,14 @@ def _export(
                     model_file_location,
                     node_attr_to_name,
                 )
+            # insert function_proto into model_proto.
+            proto = onnx_proto_utils._add_onnxscript_fn(
+                proto,
+                custom_opsets,
+            )
             if verbose:
-                torch.onnx.log(
-                    "Exported graph: ", _assign_onnx_node_name(graph, node_names)
-                )
-            if export_type == _exporter_states.ExportTypes.PROTOBUF_FILE:
-                assert len(export_map) == 0
-                with torch.serialization._open_file_like(f, "wb") as opened_file:
-                    opened_file.write(proto)
-            elif export_type in [
-                _exporter_states.ExportTypes.ZIP_ARCHIVE,
-                _exporter_states.ExportTypes.COMPRESSED_ZIP_ARCHIVE,
-            ]:
-                compression = (
-                    zipfile.ZIP_DEFLATED
-                    if export_type
-                    == _exporter_states.ExportTypes.COMPRESSED_ZIP_ARCHIVE
-                    else zipfile.ZIP_STORED
-                )
-                with zipfile.ZipFile(f, "w", compression=compression) as z:
-                    z.writestr(_constants.ONNX_ARCHIVE_MODEL_PROTO_NAME, proto)
-                    for k, v in export_map.items():
-                        z.writestr(k, v)
-            elif export_type == _exporter_states.ExportTypes.DIRECTORY:
-                if os.path.exists(f):
-                    assert os.path.isdir(f)
-                else:
-                    os.makedirs(f)
-
-                model_proto_file = os.path.join(
-                    f, _constants.ONNX_ARCHIVE_MODEL_PROTO_NAME
-                )
-                with torch.serialization._open_file_like(
-                    model_proto_file, "wb"
-                ) as opened_file:
-                    opened_file.write(proto)
-
-                for k, v in export_map.items():
-                    weight_proto_file = os.path.join(f, k)
-                    with torch.serialization._open_file_like(
-                        weight_proto_file, "wb"
-                    ) as opened_file:
-                        opened_file.write(v)
-            else:
-                raise RuntimeError("Unknown export type")
-
+                torch.onnx.log("Exported graph: ", graph)
+            onnx_proto_utils._export_file(proto, f, export_type, export_map)
             # The ONNX checker only works for ONNX graph. So if the operator_export_type is not ONNX,
             # we can skip this check.
             # If large model format export is enabled, proto will only contain data location instead of
@@ -1549,6 +1629,7 @@ def _export(
     return torch_out
 
 
+@_beartype.beartype
 def _apply_friendly_debug_names(graph, params):
     for n in graph.nodes():
         for v in n.inputs():
@@ -1561,7 +1642,9 @@ def _apply_friendly_debug_names(graph, params):
                 params[new_name] = params.pop(old_name)
 
 
+@_beartype.beartype
 def _set_input_and_output_names(graph, input_names, output_names):
+    @_beartype.beartype
     def set_names(node_list, name_list, descriptor):
         if name_list is None:
             return
@@ -1592,6 +1675,7 @@ def set_names(node_list, name_list, descriptor):
     set_names(list(graph.outputs()), output_names, "output")
 
 
+@_beartype.beartype
 def _run_symbolic_method(g, op_name, symbolic_fn, args):
     r"""
     This trampoline function gets invoked for every symbolic method
@@ -1607,68 +1691,55 @@ def _run_symbolic_method(g, op_name, symbolic_fn, args):
         raise
 
 
-def _add_block(node: _C.Node):
-    return node.addBlock()  # type: ignore[attr-defined]
+@_beartype.beartype
+def _add_block(node: _C.Node) -> _C.Block:
+    return node.addBlock()
 
 
+@_beartype.beartype
 def _add_input_to_block(block: _C.Block):
     return block.addInputToBlock()  # type: ignore[attr-defined]
 
 
-def _add_output_to_block(block: _C.Block, value: _C.Value):
-    new_output = block.registerOutput(value)  # type: ignore[attr-defined]
-    return new_output
-
-
-# Note [Export inplace]
-# ~~~~~~~~~~~~~~~~~~~~~
-# In abstract, it would be better for us to export inplace annotations,
-# than to not export them, since it is useful information that can
-# help the target of an ONNX export export more efficiently.  However,
-# ONNX doesn't currently formalize inplace. Fortunately, it's sound to drop
-# inplace annotations, but we are losing information this way.
-
-
-def _find_symbolic_in_registry(
-    domain: str,
-    op_name: str,
-    opset_version: int,
-    operator_export_type: _C_onnx.OperatorExportTypes,
-) -> Optional[Callable]:
-    """Looks up for the symbolic function in the registry.
-
-    Args:
-        domain: The domain of the symbolic function.
-        op_name: The name of the op.
-        opset_version: Currect opset used.
-        operator_export_type: An enum in _C_onnx.OperatorExportTypes.
-
-    Returns:
-        The symbolic function if found, None otherwise.
-    """
-
-    if not symbolic_registry.is_registered_op(op_name, domain, opset_version):
-        if operator_export_type == _C_onnx.OperatorExportTypes.ONNX_FALLTHROUGH:
-            # Use the original node directly
-            return None
-    return symbolic_registry.get_registered_op(op_name, domain, opset_version)
+@_beartype.beartype
+def _add_output_to_block(block: _C.Block, value: _C.Value) -> int:
+    return block.registerOutput(value)
 
 
-def _should_aten_fallback(ns, op_name, opset_version, operator_export_type):
+@_beartype.beartype
+def _should_aten_fallback(
+    name: str, opset_version: int, operator_export_type: _C_onnx.OperatorExportTypes
+):
+    # For BUILD_CAFFE2=0 builds, if domain=="aten" and operator_export_type==ONNX_ATEN,
+    #   an aten::ATen operator is created regardless of symbolics existence
+    # For BUILD_CAFFE2=1, the same applies only if there is no symbolic available
 
-    is_exportable_aten_op = symbolic_registry.is_registered_op(
-        op_name, "", opset_version
-    )
+    is_exportable_aten_op = registration.registry.is_registered_op(name, opset_version)
     is_onnx_aten_export = operator_export_type == _C_onnx.OperatorExportTypes.ONNX_ATEN
     is_aten_fallback_export = (
         operator_export_type == _C_onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK
     )
-    return is_onnx_aten_export or (
-        not is_exportable_aten_op and is_aten_fallback_export
-    )
+    is_caffe2_build = _C_onnx._CAFFE2_ATEN_FALLBACK
+
+    if not name.startswith("aten::"):
+        return False
+
+    if is_caffe2_build:
+        if (
+            is_onnx_aten_export or is_aten_fallback_export
+        ) and not is_exportable_aten_op:
+            return True
+    else:
+        if is_onnx_aten_export or (
+            is_aten_fallback_export and not is_exportable_aten_op
+        ):
+            return True
 
+    return False
 
-def _need_symbolic_context(symbolic_fn) -> bool:
+
+@_beartype.beartype
+def _need_symbolic_context(symbolic_fn: Callable) -> bool:
     """Checks if the first argument to symbolic_fn is annotated as type `torch.onnx.SymbolicContext`."""
     params = tuple(inspect.signature(symbolic_fn).parameters.values())
     # When the annotation is postpone-evaluated, the annotation is a string
@@ -1683,6 +1754,33 @@ def _need_symbolic_context(symbolic_fn) -> bool:
     return issubclass(param_type, _exporter_states.SymbolicContext)
 
 
+@_beartype.beartype
+def _symbolic_context_handler(symbolic_fn: Callable) -> Callable:
+    """Decorator that provides the symbolic context to the symbolic function if needed."""
+    if _need_symbolic_context(symbolic_fn):
+
+        # TODO(justinchuby): Update the module name of GraphContext when it is public
+        warnings.warn(
+            "The first argument to symbolic functions is deprecated in 1.13 and will be "
+            "removed in the future. Please annotate treat the first argument (g) as GraphContext "
+            "and use context information from the object instead.",
+            category=FutureWarning,
+        )
+
+        def wrapper(graph_context: jit_utils.GraphContext, *args, **kwargs):
+            symbolic_context = _exporter_states.SymbolicContext(
+                params_dict=graph_context.params_dict,
+                env=graph_context.env,
+                cur_node=graph_context.original_node,
+                onnx_block=graph_context.block,
+            )
+            return symbolic_fn(symbolic_context, graph_context, *args, **kwargs)
+
+        return wrapper
+    return symbolic_fn
+
+
+@_beartype.beartype
 def _get_aten_op_overload_name(n: _C.Node) -> str:
 
     # Returns `overload_name` attribute to ATen ops on non-Caffe2 builds
@@ -1692,14 +1790,15 @@ def _get_aten_op_overload_name(n: _C.Node) -> str:
     return _C.parse_schema(schema).overload_name
 
 
+@_beartype.beartype
 def _run_symbolic_function(
-    g: _C.Graph,
+    graph: _C.Graph,
     block: _C.Block,
-    n: _C.Node,
+    node: _C.Node,
     inputs: Any,
     env: Dict[_C.Value, _C.Value],
     operator_export_type=_C_onnx.OperatorExportTypes.ONNX,
-) -> Optional[Union[_C.Value, Tuple[_C.Value, ...]]]:
+) -> Optional[Union[_C.Value, Sequence[Optional[_C.Value]]]]:
     """Runs a symbolic function.
 
     The function is used in C++ to export the node to ONNX.
@@ -1712,74 +1811,78 @@ def _run_symbolic_function(
     opset_version = GLOBALS.export_onnx_opset_version
 
     # See Note [Export inplace]
-    node_kind = n.kind()
+    node_kind = node.kind()
     if node_kind.endswith("_"):
         # Treat relu_ -> relu; add_ -> add etc.
         ns_op_name = node_kind[:-1]
     else:
         ns_op_name = node_kind
 
-    namespace, op_name = ns_op_name.split("::")
+    namespace, op_name = jit_utils.parse_node_kind(ns_op_name)
 
-    try:
-        symbolic_registry.register_version("", opset_version)
+    graph_context = jit_utils.GraphContext(
+        graph=graph,
+        block=block,
+        opset=opset_version,
+        original_node=node,
+        params_dict=_params_dict,
+        env=env,
+    )
 
+    # Direct ATen export requested
+    if _should_aten_fallback(ns_op_name, opset_version, operator_export_type):
+        attrs = {
+            k + "_" + node.kindOf(k)[0]: symbolic_helper._node_get(node, k)
+            for k in node.attributeNames()
+        }
+        outputs = node.outputsSize()
+        attrs["outputs"] = outputs
+        return graph_context.at(
+            op_name,
+            *inputs,
+            overload_name=_get_aten_op_overload_name(node),
+            **attrs,
+        )
+
+    try:
         # Caffe2-specific: Quantized op symbolics are registered for opset 9 only.
         if symbolic_helper.is_caffe2_aten_fallback() and opset_version == 9:
             symbolic_caffe2.register_quantized_ops("caffe2", opset_version)
 
-        if namespace == "aten":
-            domain = ""
-        elif namespace == "quantized" and symbolic_helper.is_caffe2_aten_fallback():
+        if namespace == "quantized" and symbolic_helper.is_caffe2_aten_fallback():
             domain = "caffe2"
         else:
             domain = namespace
+        symbolic_function_name = f"{domain}::{op_name}"
 
-        if symbolic_registry.is_registered_op(op_name, domain, opset_version):
-            symbolic_fn = _find_symbolic_in_registry(
-                domain, op_name, opset_version, operator_export_type
-            )
-            assert symbolic_fn is not None
-
-            attrs = {k: symbolic_helper._node_get(n, k) for k in n.attributeNames()}
-            if _need_symbolic_context(symbolic_fn):
-                ctx = _exporter_states.SymbolicContext(_params_dict, env, n, block)
-                return symbolic_fn(ctx, g, *inputs, **attrs)
-            # PythonOp symbolic need access to the node to resolve the name conflict,
-            # this is inconsistent with regular op symbolic.
-            if op_name == "PythonOp":
-                inputs = (n, *inputs)
-            return symbolic_fn(g, *inputs, **attrs)
-        elif namespace == "onnx":
+        symbolic_function_group = registration.registry.get_function_group(
+            symbolic_function_name
+        )
+        if symbolic_function_group is not None:
+            symbolic_fn = symbolic_function_group.get(opset_version)
+            if symbolic_fn is not None:
+                # TODO Wrap almost identical attrs assignment or comment the difference.
+                attrs = {
+                    k: symbolic_helper._node_get(node, k) for k in node.attributeNames()
+                }
+                return symbolic_fn(graph_context, *inputs, **attrs)
+
+        attrs = {
+            k + "_" + node.kindOf(k)[0]: symbolic_helper._node_get(node, k)
+            for k in node.attributeNames()
+        }
+        if namespace == "onnx":
             # Clone node to trigger ONNX shape inference
-            attrs = {
-                k + "_" + n.kindOf(k)[0]: symbolic_helper._node_get(n, k)
-                for k in n.attributeNames()
-            }
-            return g.op(op_name, *inputs, **attrs, outputs=n.outputsSize())  # type: ignore[attr-defined]
-        elif _should_aten_fallback(
-            namespace, op_name, opset_version, operator_export_type
-        ):
-            # Direct ATen export requested
-            attrs = {
-                k + "_" + n.kindOf(k)[0]: symbolic_helper._node_get(n, k)
-                for k in n.attributeNames()
-            }
-            outputs = n.outputsSize()
-            attrs["outputs"] = outputs
-            # `overload_name` is set for non-Caffe2 builds only
-            return g.at(  # type: ignore[attr-defined]
-                op_name, *inputs, overload_name=_get_aten_op_overload_name(n), **attrs
-            )
-        else:
-            raise errors.UnsupportedOperatorError(
-                domain,
-                op_name,
-                opset_version,
-                symbolic_registry.get_op_supported_version(
-                    op_name, domain, opset_version
-                ),
-            )
+            return graph_context.op(op_name, *inputs, **attrs, outputs=node.outputsSize())  # type: ignore[attr-defined]
+
+        raise errors.UnsupportedOperatorError(
+            symbolic_function_name,
+            opset_version,
+            symbolic_function_group.get_min_supported()
+            if symbolic_function_group
+            else None,
+        )
+
     except RuntimeError:
         if operator_export_type == _C_onnx.OperatorExportTypes.ONNX_FALLTHROUGH:
             return None
@@ -1789,11 +1892,14 @@ def _run_symbolic_function(
         ):
             # Emit ATen op for non-Caffe2 builds when `operator_export_type==ONNX_ATEN_FALLBACK`
             attrs = {
-                k + "_" + n.kindOf(k)[0]: symbolic_helper._node_get(n, k)
-                for k in n.attributeNames()
+                k + "_" + node.kindOf(k)[0]: symbolic_helper._node_get(node, k)
+                for k in node.attributeNames()
             }
-            return g.at(  # type: ignore[attr-defined]
-                op_name, *inputs, overload_name=_get_aten_op_overload_name(n), **attrs
+            return graph_context.at(
+                op_name,
+                *inputs,
+                overload_name=_get_aten_op_overload_name(node),
+                **attrs,
             )
         raise
     except TypeError as e:
@@ -1803,30 +1909,29 @@ def _run_symbolic_function(
         raise
 
 
-def get_ns_op_name_from_custom_op(symbolic_name):
-    if not bool(
-        re.match(r"^[a-zA-Z0-9-_]*::[a-zA-Z-_]+[a-zA-Z0-9-_]*$", symbolic_name)
-    ):
-        raise ValueError(
-            f"Failed to register operator {symbolic_name}."
-            "The symbolic name must match the format Domain::Name, "
+@_beartype.beartype
+def _verify_custom_op_name(symbolic_name: str):
+    if not re.match(r"^[a-zA-Z0-9-_]+::[a-zA-Z-_]+[a-zA-Z0-9-_]*$", symbolic_name):
+        raise errors.OnnxExporterError(
+            f"Failed to register operator {symbolic_name}. "
+            "The symbolic name must match the format domain::name, "
             "and should start with a letter and contain only "
             "alphanumerical characters"
         )
 
-    ns, op_name = symbolic_name.split("::")
+    ns, _ = jit_utils.parse_node_kind(symbolic_name)
     if ns == "onnx":
         raise ValueError(
             f"Failed to register operator {symbolic_name}. {ns} domain cannot be modified."
         )
 
-    if ns == "aten":
-        ns = ""
 
-    return ns, op_name
-
-
-def register_custom_op_symbolic(symbolic_name, symbolic_fn, opset_version):
+@_beartype.beartype
+def register_custom_op_symbolic(
+    symbolic_name: str,
+    symbolic_fn: Callable,
+    opset_version: int,
+):
     """Registers a symbolic function for a custom operator.
 
     When the user registers symbolic for custom/contrib ops,
@@ -1844,15 +1949,21 @@ def register_custom_op_symbolic(symbolic_name, symbolic_fn, opset_version):
             operator nodes to add to the graph.
         opset_version (int): The ONNX opset version in which to register.
     """
-    ns, op_name = get_ns_op_name_from_custom_op(symbolic_name)
+    if symbolic_name.startswith("::"):
+        symbolic_name = f"aten{symbolic_name}"
 
-    for version in itertools.chain(
-        _constants.onnx_stable_opsets, [_constants.onnx_main_opset]
-    ):
-        if version >= opset_version:
-            symbolic_registry.register_op(op_name, symbolic_fn, ns, version)
+    _verify_custom_op_name(symbolic_name)
+
+    registration.custom_onnx_symbolic(
+        symbolic_name,
+        opset_version,
+        decorate=[
+            _symbolic_context_handler,
+        ],
+    )(symbolic_fn)
 
 
+@_beartype.beartype
 def unregister_custom_op_symbolic(symbolic_name: str, opset_version: int):
     """Unregisters ``symbolic_name``.
 
@@ -1863,15 +1974,15 @@ def unregister_custom_op_symbolic(symbolic_name: str, opset_version: int):
             format.
         opset_version (int): The ONNX opset version in which to unregister.
     """
-    ns, op_name = get_ns_op_name_from_custom_op(symbolic_name)
+    if symbolic_name.startswith("::"):
+        symbolic_name = f"aten{symbolic_name}"
 
-    for version in itertools.chain(
-        _constants.onnx_stable_opsets, [_constants.onnx_main_opset]
-    ):
-        if version >= opset_version:
-            symbolic_registry.unregister_op(op_name, ns, version)
+    _verify_custom_op_name(symbolic_name)
+
+    registration.registry.unregister(symbolic_name, opset_version)
 
 
+@_beartype.beartype
 def _validate_dynamic_axes(dynamic_axes, model, input_names, output_names):
     """Ensures dynamic axes argument is follows the expected format."""
     if len(dynamic_axes) == 0:
diff --git a/torch/onnx/verification.py b/torch/onnx/verification.py
index b3deaf5886fb..8c3d63a268ba 100644
--- a/torch/onnx/verification.py
+++ b/torch/onnx/verification.py
@@ -13,7 +13,7 @@
 import os
 import tempfile
 import warnings
-from typing import Any, Callable, Mapping, Optional, Sequence, Tuple, Union
+from typing import Any, Callable, Dict, Mapping, Optional, Sequence, Tuple, Union
 
 import numpy as np
 
@@ -22,10 +22,15 @@
 from torch import _C
 from torch.onnx import _constants, _experimental, utils
 from torch.onnx._globals import GLOBALS
+from torch.onnx._internal import _beartype
+from torch.types import Number
 
 _ORT_PROVIDERS = ("CPUExecutionProvider",)
 
+_NumericType = Union[Number, torch.Tensor, np.ndarray]
 
+
+@_beartype.beartype
 def _flatten_tuples(elem):
     flattened = []
     for t in elem:
@@ -36,7 +41,8 @@ def _flatten_tuples(elem):
     return flattened
 
 
-def _to_numpy(elem):
+# TODO(justinchuby): Add type checking by narrowing down the return type when input is None
+def _to_numpy(elem) -> Union[list, np.ndarray]:
     if isinstance(elem, torch.Tensor):
         if elem.requires_grad:
             return elem.detach().cpu().numpy()
@@ -49,12 +55,13 @@ def _to_numpy(elem):
     elif isinstance(elem, dict):
         flattened = []
         for k in elem:
-            flattened += [_to_numpy(k)] + [_to_numpy(elem[k])]
+            flattened.extend([_to_numpy(k), _to_numpy(elem[k])])
         return flattened
     return elem
 
 
-def _inline_flatten_list(inputs, res_list):
+@_beartype.beartype
+def _inline_flatten_list(inputs, res_list) -> list:
     for i in inputs:
         res_list.append(i) if not isinstance(
             i, (list, tuple)
@@ -62,7 +69,8 @@ def _inline_flatten_list(inputs, res_list):
     return res_list
 
 
-def _unpack_to_numpy(values, cast_onnx_accepted=True):
+@_beartype.beartype
+def _unpack_to_numpy(values, cast_onnx_accepted=True) -> list:
     value_unpacked = []
     for value in values:
         value_unpacked.extend(
@@ -71,6 +79,7 @@ def _unpack_to_numpy(values, cast_onnx_accepted=True):
     return [_to_numpy(v) for v in value_unpacked]
 
 
+@_beartype.beartype
 def _run_ort(ort_session, inputs):
     kw_inputs = {}
     if inputs and isinstance(inputs[-1], dict):
@@ -89,9 +98,10 @@ def _run_ort(ort_session, inputs):
             )
         ort_inputs[ort_session_inputs[i].name] = input
     ort_outs = ort_session.run(None, ort_inputs)
-    return _inline_flatten_list(ort_outs, [])
+    return ort_outs
 
 
+@_beartype.beartype
 def _ort_session(
     model: Union[str, io.BytesIO], ort_providers: Sequence[str] = _ORT_PROVIDERS
 ):
@@ -115,13 +125,15 @@ def _ort_session(
     return ort_session
 
 
+@_beartype.beartype
 def _compare_ort_pytorch_outputs(
-    ort_outs: Sequence[np.ndarray],
-    pt_outs: Sequence[torch.Tensor],
+    ort_outs: Union[Sequence[_NumericType], Sequence],
+    pt_outs: Optional[Union[_NumericType, Sequence[_NumericType], Sequence, Dict]],
     rtol: float,
     atol: float,
     check_shape: bool,
     check_dtype: bool,
+    ignore_none: bool,
     acceptable_error_percentage: Optional[float],
 ):
     """
@@ -130,9 +142,13 @@ def _compare_ort_pytorch_outputs(
     Args:
         ort_outs: outputs from ONNX Runtime.
         pt_outs: outputs from PyTorch.
-        rtol (float, optional): relative tolerance in comparison between ONNX and PyTorch outputs.
-        atol (float, optional): absolute tolerance in comparison between ONNX and PyTorch outputs.
-        acceptable_error_percentage (float, optional): acceptable percentage of element mismatches in comparison.
+        rtol: relative tolerance in comparison between ONNX and PyTorch outputs.
+        atol: absolute tolerance in comparison between ONNX and PyTorch outputs.
+        ignore_none: Whether to ignore None type in
+            torch output, which is usually the case with tracing. Set this to False, if
+            torch output should keep None type, which is usually the case with exporting
+            ScriptModules.
+        acceptable_error_percentage: acceptable percentage of element mismatches in comparison.
             It should be a float of value between 0.0 and 1.0.
 
     Raises:
@@ -140,12 +156,16 @@ def _compare_ort_pytorch_outputs(
             equal up to specified precision.
         ValueError: if arguments provided are invalid.
     """
-    pt_outs, _ = torch.jit._flatten(pt_outs)
-    pt_outs = _unpack_to_numpy(pt_outs, cast_onnx_accepted=False)
-
+    if ignore_none:
+        # torch.jit._flatten filters None type
+        pt_outs, _ = torch.jit._flatten(pt_outs)
+    else:
+        pt_outs = _inline_flatten_list([pt_outs], [])
+    pt_outs_np = _unpack_to_numpy(pt_outs, cast_onnx_accepted=False)
+    ort_outs = _inline_flatten_list(ort_outs, [])
     assert len(ort_outs) == len(
-        pt_outs
-    ), f"Number of outputs differ ONNX runtime: ({len(ort_outs)}) PyTorch: ({len(pt_outs)})"
+        pt_outs_np
+    ), f"Number of outputs differ ONNX runtime: ({len(ort_outs)}) PyTorch: ({len(pt_outs_np)})"
     if acceptable_error_percentage and (
         acceptable_error_percentage > 1.0 or acceptable_error_percentage < 0.0
     ):
@@ -153,7 +173,7 @@ def _compare_ort_pytorch_outputs(
             "If set, acceptable_error_percentage should be between 0.0 and 1.0"
         )
 
-    for ort_out, pt_out in zip(ort_outs, pt_outs):
+    for ort_out, pt_out in zip(ort_outs, pt_outs_np):
         try:
             # TODO: Remove `check_shape` option once every shape inconsistent issue is addressed.
             if not check_shape:
@@ -179,9 +199,14 @@ def _compare_ort_pytorch_outputs(
                         f"within acceptable range {acceptable_error_percentage}."
                     )
                     continue
+            if ort_out.dtype == np.uint8 or ort_out.dtype == np.int8:
+                warnings.warn("ONNX output is quantized")
+            if pt_out.dtype == np.uint8 or pt_out.dtype == np.int8:
+                warnings.warn("PyTorch output is quantized")
             raise
 
 
+@_beartype.beartype
 def _prepare_input_for_pytorch(args, kwargs):
     """Prepare input for PyTorch model execution.
 
@@ -208,6 +233,7 @@ def _prepare_input_for_pytorch(args, kwargs):
     return args, kwargs
 
 
+@_beartype.beartype
 def _prepare_input_for_export(args, kwargs):
     """Prepare input for ONNX model export.
 
@@ -232,6 +258,7 @@ def _prepare_input_for_export(args, kwargs):
     return onnx_inputs
 
 
+@_beartype.beartype
 def _prepare_input_for_ort(args, kwargs, remained_onnx_input_idx, flatten):
     """Prepare input for ONNX model execution in ONNX Runtime.
 
@@ -257,6 +284,7 @@ def _prepare_input_for_ort(args, kwargs, remained_onnx_input_idx, flatten):
         return onnx_inputs
 
 
+@_beartype.beartype
 def _try_clone_model(model):
     """Used for preserving original model in case forward mutates model states."""
     try:
@@ -268,6 +296,7 @@ def _try_clone_model(model):
         return model
 
 
+@_beartype.beartype
 def _compare_ort_pytorch_model(
     model,
     ort_session,
@@ -276,11 +305,12 @@ def _compare_ort_pytorch_model(
     additional_test_inputs,
     remained_onnx_input_idx,
     flatten,
+    ignore_none,
     rtol,
     atol,
     check_shape,
     check_dtype,
-    accetable_error_persentage: Optional[float],
+    acceptable_error_percentage: Optional[float],
 ):
     """Compare outputs from ONNX model runs with outputs from PyTorch model runs.
 
@@ -291,6 +321,7 @@ def _compare_ort_pytorch_model(
             equal up to specified precision.
     """
 
+    @_beartype.beartype
     def compare_ort_pytorch_model_with_input(input_args, input_kwargs):
         pt_args, pt_kwargs = _prepare_input_for_pytorch(input_args, input_kwargs)
         # TODO: remove this and treat mutating model separately. See #77679
@@ -303,13 +334,14 @@ def compare_ort_pytorch_model_with_input(input_args, input_kwargs):
         ort_outs = _run_ort(ort_session, ort_inputs)
 
         _compare_ort_pytorch_outputs(
-            ort_outs,
-            pt_outs,
-            rtol,
-            atol,
-            check_shape,
-            check_dtype,
-            accetable_error_persentage,
+            ort_outs=ort_outs,
+            pt_outs=pt_outs,
+            rtol=rtol,
+            atol=atol,
+            check_shape=check_shape,
+            check_dtype=check_dtype,
+            ignore_none=ignore_none,
+            acceptable_error_percentage=acceptable_error_percentage,
         )
 
     compare_ort_pytorch_model_with_input(input_args, input_kwargs)
@@ -322,6 +354,7 @@ def compare_ort_pytorch_model_with_input(input_args, input_kwargs):
 class _GraphDiff:
     """A class to represent the difference between two graphs."""
 
+    @_beartype.beartype
     def __init__(self, graph_a: _C.Graph, graph_b: _C.Graph):
         """Construct a _GraphDiff object.
 
@@ -332,13 +365,16 @@ def __init__(self, graph_a: _C.Graph, graph_b: _C.Graph):
         self.graph_a = graph_a
         self.graph_b = graph_b
 
+    @_beartype.beartype
     def __str__(self):
         """See function :func:`diff_report`."""
         return self.diff_report()
 
+    @_beartype.beartype
     def _indent(self, lines: str) -> str:
         return "\n".join(["\t" + line for line in lines.splitlines()])
 
+    @_beartype.beartype
     def diff_report(self) -> str:
         """Return a string representation of the graph difference.
 
@@ -388,6 +424,7 @@ def diff_report(self) -> str:
         return "\n".join(graph_diff_report)
 
 
+@_beartype.beartype
 def _check_graph_diff(
     model: Union[torch.nn.Module, torch.jit.ScriptModule],
     test_input_groups: Sequence[Tuple[Tuple[Any, ...], Mapping[str, Any]]],
@@ -429,6 +466,7 @@ def _check_graph_diff(
     return ""
 
 
+@_beartype.beartype
 def _traced_graph_from_model(
     model: Union[torch.nn.Module, torch.jit.ScriptModule],
     args: Tuple[Any, ...],
@@ -456,6 +494,7 @@ def _traced_graph_from_model(
         return jit_graph
 
 
+@_beartype.beartype
 def _onnx_graph_from_model(
     model: Union[torch.nn.Module, torch.jit.ScriptModule],
     args: Tuple[Any, ...],
@@ -484,11 +523,9 @@ def _onnx_graph_from_model(
     output_names = export_options.output_names
 
     if opset_version is None:
-        opset_version = _constants.onnx_default_opset
+        opset_version = _constants.ONNX_DEFAULT_OPSET
 
-    export_modules_as_functions = utils._setup_trace_module_map(
-        model, export_modules_as_functions
-    )
+    utils._setup_trace_module_map(model, export_modules_as_functions)
 
     if not operator_export_type:
         if _C_onnx._CAFFE2_ATEN_FALLBACK:
@@ -525,6 +562,7 @@ def _onnx_graph_from_model(
         return onnx_graph
 
 
+@_beartype.beartype
 def check_export_model_diff(
     model: Union[torch.nn.Module, torch.jit.ScriptModule],
     test_input_groups: Sequence[Tuple[Tuple[Any, ...], Mapping[str, Any]]],
@@ -568,9 +606,10 @@ def check_export_model_diff(
     )
 
 
+@_beartype.beartype
 def verify(
     model: Union[torch.nn.Module, torch.jit.ScriptModule],
-    input_args: Tuple[Any, ...],
+    input_args: Union[torch.Tensor, Tuple[Any, ...]],
     input_kwargs: Optional[Mapping[str, Any]] = None,
     do_constant_folding: bool = True,
     dynamic_axes: Optional[
@@ -584,9 +623,12 @@ def verify(
     verbose: bool = False,
     fixed_batch_size: bool = False,
     use_external_data: bool = False,
-    additional_test_inputs: Optional[Sequence[Tuple[Any, ...]]] = None,
+    additional_test_inputs: Optional[
+        Sequence[Union[torch.Tensor, Tuple[Any, ...]]]
+    ] = None,
     remained_onnx_input_idx: Optional[Sequence[int]] = None,
     flatten: bool = True,
+    ignore_none: bool = True,
     check_shape: bool = True,
     check_dtype: bool = True,
     ort_providers: Sequence[str] = _ORT_PROVIDERS,
@@ -623,11 +665,15 @@ def verify(
             inputs into a flattened list of Tensors for ONNX. Set this to False if nested
             structures are to be preserved for ONNX, which is usually the case with
             exporting ScriptModules.
-        check_shape (bool, optional): Default True. If True, check the shapes between
+        ignore_none (bool, optional): Whether to ignore None type in
+            torch output, which is usually the case with tracing. Set this to False, if
+            torch output should keep None type, which is usually the case with exporting
+            ScriptModules. Default to True.
+        check_shape (bool, optional): Whether to check the shapes between
             PyTorch and ONNX Runtime outputs are exactly the same. Set this to False to allow
-            output shape broadcasting.
-        check_dtype (bool, optional): Default True. If True, check the dtypes between
-            PyTorch and ONNX Runtime outputs are consistent.
+            output shape broadcasting. Default to True.
+        check_dtype (bool, optional): Whether to check the dtypes between
+            PyTorch and ONNX Runtime outputs are consistent. Default to True.
         ort_providers (sequence, optional): ONNX Runtime providers to use.
         rtol (float, optional): relative tolerance in comparison between ONNX and PyTorch outputs.
         atol (float, optional): absolute tolerance in comparison between ONNX and PyTorch outputs.
@@ -671,16 +717,17 @@ def verify(
         ort_session = _ort_session(model_f, ort_providers)
 
         _compare_ort_pytorch_model(
-            model_copy,
-            ort_session,
-            input_args,
-            input_kwargs,
-            additional_test_inputs,
-            remained_onnx_input_idx,
-            flatten,
-            rtol,
-            atol,
-            check_shape,
-            check_dtype,
-            acceptable_error_percentage,
+            model=model_copy,
+            ort_session=ort_session,
+            input_args=input_args,
+            input_kwargs=input_kwargs,
+            additional_test_inputs=additional_test_inputs,
+            remained_onnx_input_idx=remained_onnx_input_idx,
+            flatten=flatten,
+            ignore_none=ignore_none,
+            rtol=rtol,
+            atol=atol,
+            check_shape=check_shape,
+            check_dtype=check_dtype,
+            acceptable_error_percentage=acceptable_error_percentage,
         )
diff --git a/torch/optim/_functional.py b/torch/optim/_functional.py
index 198910623204..44ec120eda02 100644
--- a/torch/optim/_functional.py
+++ b/torch/optim/_functional.py
@@ -40,6 +40,9 @@ def sparse_adam(params: List[Tensor],
         grad = grad.coalesce()  # the update is non-linear so indices must be unique
         grad_indices = grad._indices()
         grad_values = grad._values()
+        if grad_values.numel() == 0:
+            # Skip update for empty grad
+            continue
         size = grad.size()
 
         exp_avg = exp_avgs[i]
diff --git a/torch/optim/adadelta.py b/torch/optim/adadelta.py
index 14fbfa94ebd7..e5b33bb6e255 100644
--- a/torch/optim/adadelta.py
+++ b/torch/optim/adadelta.py
@@ -1,7 +1,7 @@
 import torch
 from torch import Tensor
 
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['Adadelta', 'adadelta']
@@ -54,7 +54,8 @@ class Adadelta(Optimizer):
     """
 
     def __init__(self, params, lr=1.0, rho=0.9, eps=1e-6, weight_decay=0,
-                 foreach: Optional[bool] = None, *, maximize: bool = False):
+                 foreach: Optional[bool] = None, *, maximize: bool = False,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= rho <= 1.0:
@@ -65,7 +66,8 @@ def __init__(self, params, lr=1.0, rho=0.9, eps=1e-6, weight_decay=0,
             raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
 
         defaults = dict(lr=lr, rho=rho, eps=eps, weight_decay=weight_decay,
-                        maximize=maximize, foreach=foreach)
+                        maximize=maximize, foreach=foreach,
+                        differentiable=differentiable)
         super(Adadelta, self).__init__(params, defaults)
 
     def __setstate__(self, state):
@@ -73,8 +75,9 @@ def __setstate__(self, state):
         for group in self.param_groups:
             group.setdefault('foreach', None)
             group.setdefault('maximize', False)
+            group.setdefault('differentiable', False)
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -92,12 +95,14 @@ def step(self, closure=None):
             grads = []
             square_avgs = []
             acc_deltas = []
-            lr, rho, eps, weight_decay, foreach, maximize = (group['lr'],
-                                                             group['rho'],
-                                                             group['eps'],
-                                                             group['weight_decay'],
-                                                             group['foreach'],
-                                                             group['maximize'])
+            lr, rho, eps, weight_decay, foreach, maximize, differentiable = (
+                group['lr'],
+                group['rho'],
+                group['eps'],
+                group['weight_decay'],
+                group['foreach'],
+                group['maximize'],
+                group['differentiable'])
 
             for p in group['params']:
                 if p.grad is None:
@@ -129,7 +134,8 @@ def step(self, closure=None):
                      eps=eps,
                      weight_decay=weight_decay,
                      foreach=foreach,
-                     maximize=maximize)
+                     maximize=maximize,
+                     differentiable=differentiable)
 
         return loss
 
@@ -141,6 +147,7 @@ def adadelta(params: List[Tensor],
              # kwonly args with defaults are not supported by functions compiled with torchscript issue #70627
              # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
              foreach: bool = None,
+             differentiable: bool = False,
              *,
              lr: float,
              rho: float,
@@ -172,7 +179,8 @@ def adadelta(params: List[Tensor],
          rho=rho,
          eps=eps,
          weight_decay=weight_decay,
-         maximize=maximize)
+         maximize=maximize,
+         differentiable=differentiable)
 
 
 def _single_tensor_adadelta(params: List[Tensor],
@@ -184,7 +192,8 @@ def _single_tensor_adadelta(params: List[Tensor],
                             rho: float,
                             eps: float,
                             weight_decay: float,
-                            maximize: bool):
+                            maximize: bool,
+                            differentiable: bool):
 
     for (param, grad, square_avg, acc_delta) in zip(params, grads, square_avgs, acc_deltas):
         grad = grad if not maximize else -grad
@@ -197,10 +206,15 @@ def _single_tensor_adadelta(params: List[Tensor],
             acc_delta = torch.view_as_real(acc_delta)
             grad = torch.view_as_real(grad)
 
+
         square_avg.mul_(rho).addcmul_(grad, grad, value=1 - rho)
         std = square_avg.add(eps).sqrt_()
-        delta = acc_delta.add(eps).sqrt_().div_(std).mul_(grad)
+        delta = acc_delta.add(eps).sqrt_()
+        if differentiable:
+            delta = delta.clone()
+        delta.div_(std).mul_(grad)
         acc_delta.mul_(rho).addcmul_(delta, delta, value=1 - rho)
+
         if torch.is_complex(param):
             delta = torch.view_as_complex(delta)
         param.add_(delta, alpha=-lr)
@@ -215,7 +229,10 @@ def _multi_tensor_adadelta(params: List[Tensor],
                            weight_decay: float,
                            rho: float,
                            eps: float,
-                           maximize: bool):
+                           maximize: bool,
+                           differentiable: bool):
+
+    assert not differentiable, "_foreach ops don't support autograd"
 
     if len(params) == 0:
         return
diff --git a/torch/optim/adagrad.py b/torch/optim/adagrad.py
index 1202eeb014d6..b7da23a84b11 100644
--- a/torch/optim/adagrad.py
+++ b/torch/optim/adagrad.py
@@ -1,7 +1,7 @@
 import torch
 from torch import Tensor
 
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['Adagrad', 'adagrad']
@@ -59,7 +59,8 @@ def __init__(
         eps=1e-10,
         foreach: Optional[bool] = None,
         *,
-        maximize: bool = False
+        maximize: bool = False,
+        differentiable: bool = False
     ):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
@@ -84,6 +85,7 @@ def __init__(
             initial_accumulator_value=initial_accumulator_value,
             foreach=foreach,
             maximize=maximize,
+            differentiable=differentiable
         )
         super(Adagrad, self).__init__(params, defaults)
 
@@ -105,6 +107,7 @@ def __setstate__(self, state):
         for group in self.param_groups:
             group.setdefault("foreach", None)
             group.setdefault("maximize", False)
+            group.setdefault("differentiable", False)
 
         state_values = list(self.state.values())
         step_is_tensor = (len(state_values) != 0) and torch.is_tensor(
@@ -120,7 +123,7 @@ def share_memory(self):
                 state = self.state[p]
                 state["sum"].share_memory_()
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -163,6 +166,7 @@ def step(self, closure=None):
                 has_sparse_grad=has_sparse_grad,
                 foreach=group["foreach"],
                 maximize=group["maximize"],
+                differentiable=group["differentiable"]
             )
 
         return loss
@@ -177,6 +181,7 @@ def adagrad(
     # setting these as kwargs for now as functional API is compiled by torch/distributed/optim
     has_sparse_grad: bool = None,
     foreach: bool = None,
+    differentiable: bool = False,
     *,
     lr: float,
     weight_decay: float,
@@ -217,6 +222,7 @@ def adagrad(
         eps=eps,
         has_sparse_grad=has_sparse_grad,
         maximize=maximize,
+        differentiable=differentiable
     )
 
 
@@ -239,6 +245,7 @@ def _single_tensor_adagrad(
     eps: float,
     has_sparse_grad: bool,
     maximize: bool,
+    differentiable: bool
 ):
 
     for (param, grad, state_sum, step_t) in zip(params, grads, state_sums, state_steps):
@@ -275,7 +282,10 @@ def _single_tensor_adagrad(
                 state_sum = torch.view_as_real(state_sum)
                 param = torch.view_as_real(param)
             state_sum.addcmul_(grad, grad, value=1)
-            std = state_sum.sqrt().add_(eps)
+            if differentiable:
+                std = state_sum.sqrt() + eps
+            else:
+                std = state_sum.sqrt().add_(eps)
             param.addcdiv_(grad, std, value=-clr)
             if is_complex:
                 param = torch.view_as_complex(param)
@@ -294,8 +304,11 @@ def _multi_tensor_adagrad(
     eps: float,
     has_sparse_grad: bool,
     maximize: bool,
+    differentiable: bool
 ):
 
+    assert not differentiable, "_foreach ops don't support autograd"
+
     # Foreach functions will throw errors if given empty lists
     if len(params) == 0:
         return
@@ -318,6 +331,7 @@ def _multi_tensor_adagrad(
             eps=eps,
             has_sparse_grad=has_sparse_grad,
             maximize=False,
+            differentiable=differentiable,
         )
 
     # Update steps
diff --git a/torch/optim/adam.py b/torch/optim/adam.py
index ae688589e28f..50aad78781c9 100644
--- a/torch/optim/adam.py
+++ b/torch/optim/adam.py
@@ -1,11 +1,52 @@
+from collections import defaultdict
 import math
+from typing import cast, List, Optional, Dict, Tuple
+
 import torch
 from torch import Tensor
-from .optimizer import Optimizer
-from typing import List, Optional
+from .optimizer import Optimizer, _use_grad_for_differentiable
 
 __all__ = ['Adam', 'adam']
 
+
+# TODO(crcrpar): Move this to soemwhere (e.g. torch/optim/_utils?) else when adding another fused optimizer.
+# NOTE(crcrpar): Almost the same as `_MultiDeviceReplicator` defined in
+# torch/cuda/amp/grad_scaler.py except for the key being str only for torch script.
+class _MultiDeviceReplicator:
+    main_tensor: Tensor
+    _per_device_tensors: Dict[str, Tensor]
+
+    def __init__(self, main_tensor: Tensor) -> None:
+        self.main_tensor = main_tensor
+        self._per_device_tensors = {str(main_tensor.device): main_tensor}
+
+    def get(self, device: str):
+        if device in self._per_device_tensors:
+            return self._per_device_tensors[device]
+        tensor = self.main_tensor.to(device=device, non_blocking=True, copy=True)
+        self._per_device_tensors[device] = tensor
+        return tensor
+
+
+# todo(crcrpar): Move this to another place when adding another fused optimizer.
+def _get_fp16AMP_params(
+    *,
+    optimizer: Optimizer,
+    grad_scaler: Optional[torch.cuda.amp.GradScaler] = None,
+    device: torch.device,
+) -> Optional[_MultiDeviceReplicator]:
+    if grad_scaler is None:
+        return None
+    found_inf_dict = grad_scaler._check_inf_per_device(optimizer)
+    # Combines found_inf tensors from all devices. As in GradScaler.update(),
+    # tensors are combined on the scale's device, which is an arbitrary but
+    # reasonable choice that avoids new context creation.
+    found_infs = [f.to(device, non_blocking=True) for f in found_inf_dict.values()]
+    assert len(found_infs) > 0, "No inf checks were recorded in _check_inf_per_device."
+    with torch.no_grad():
+        found_inf_combined = cast(torch.Tensor, sum(found_infs))
+    return _MultiDeviceReplicator(found_inf_combined)
+
 class Adam(Optimizer):
     r"""Implements Adam algorithm.
 
@@ -65,6 +106,9 @@ class Adam(Optimizer):
         capturable (bool, optional): whether this instance is safe to capture in a CUDA graph.
             Passing True can impair ungraphed performance, so if you don't intend to
             graph capture this instance, leave it False (default: False)
+        fused (bool, optional): whether fused implementation of optimizer is used.
+            Currently, `torch.float64`, `torch.float32`, `torch.float16`, and `torch.bfloat16`
+            are supported. (default: False)
 
     .. _Adam\: A Method for Stochastic Optimization:
         https://arxiv.org/abs/1412.6980
@@ -74,7 +118,8 @@ class Adam(Optimizer):
 
     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
                  weight_decay=0, amsgrad=False, *, foreach: Optional[bool] = None,
-                 maximize: bool = False, capturable: bool = False):
+                 maximize: bool = False, capturable: bool = False,
+                 differentiable: bool = False, fused: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
@@ -87,9 +132,24 @@ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
             raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
         defaults = dict(lr=lr, betas=betas, eps=eps,
                         weight_decay=weight_decay, amsgrad=amsgrad,
-                        maximize=maximize, foreach=foreach, capturable=capturable)
+                        maximize=maximize, foreach=foreach, capturable=capturable,
+                        differentiable=differentiable, fused=fused)
         super(Adam, self).__init__(params, defaults)
 
+        if fused:
+            if differentiable:
+                raise RuntimeError("`fused` cannot be `differentiable`")
+            self._step_supports_amp_scaling = True
+            # TODO(crcrpar): [low prec params & their higher prec copy]
+            # Suppor AMP with FP16/BF16 model params which would need
+            # higher prec copy of params to do update math in higher prec to
+            # alleviate the loss of information.
+            if not all(
+                p.is_cuda and torch.is_floating_point(p)
+                for pg in self.param_groups for p in pg['params']
+            ):
+                raise RuntimeError("FusedAdam requires all the params to be CUDA, floating point")
+
     def __setstate__(self, state):
         super().__setstate__(state)
         for group in self.param_groups:
@@ -97,19 +157,23 @@ def __setstate__(self, state):
             group.setdefault('maximize', False)
             group.setdefault('foreach', None)
             group.setdefault('capturable', False)
+            group.setdefault('differentiable', False)
+            group.setdefault('fused', False)
         state_values = list(self.state.values())
         step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
         if not step_is_tensor:
             for s in state_values:
                 s['step'] = torch.tensor(float(s['step']))
 
-    @torch.no_grad()
-    def step(self, closure=None):
+    @_use_grad_for_differentiable
+    def step(self, closure=None, *, grad_scaler=None):
         """Performs a single optimization step.
 
         Args:
             closure (Callable, optional): A closure that reevaluates the model
                 and returns the loss.
+            grad_scaler (:class:`torch.cuda.amp.GradScaler`, optional): A GradScaler which is
+                supplied from ``grad_scaler.step(optimizer)``.
         """
         self._cuda_graph_capture_health_check()
 
@@ -127,6 +191,14 @@ def step(self, closure=None):
             state_steps = []
             beta1, beta2 = group['betas']
 
+            grad_scale = None
+            found_inf = None
+            if group['fused'] and grad_scaler is not None:
+                grad_scale = grad_scaler._get_scale_async()
+                device = grad_scale.device
+                grad_scale = _MultiDeviceReplicator(grad_scale)
+                found_inf = _get_fp16AMP_params(optimizer=self, grad_scaler=grad_scaler, device=device)
+
             for p in group['params']:
                 if p.grad is not None:
                     params_with_grad.append(p)
@@ -137,8 +209,11 @@ def step(self, closure=None):
                     state = self.state[p]
                     # Lazy state initialization
                     if len(state) == 0:
-                        state['step'] = torch.zeros((1,), dtype=torch.float, device=p.device) \
-                            if self.defaults['capturable'] else torch.tensor(0.)
+                        state['step'] = (
+                            torch.zeros((1,), dtype=torch.float, device=p.device)
+                            if self.defaults['capturable'] or self.defaults['fused']
+                            else torch.tensor(0.)
+                        )
                         # Exponential moving average of gradient values
                         state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
                         # Exponential moving average of squared gradient values
@@ -152,7 +227,8 @@ def step(self, closure=None):
 
                     if group['amsgrad']:
                         max_exp_avg_sqs.append(state['max_exp_avg_sq'])
-
+                    if group['differentiable'] and state['step'].requires_grad:
+                        raise RuntimeError('`requires_grad` is not supported for `step` in differentiable mode')
                     state_steps.append(state['step'])
 
             adam(params_with_grad,
@@ -169,7 +245,11 @@ def step(self, closure=None):
                  eps=group['eps'],
                  maximize=group['maximize'],
                  foreach=group['foreach'],
-                 capturable=group['capturable'])
+                 capturable=group['capturable'],
+                 differentiable=group['differentiable'],
+                 fused=group['fused'],
+                 grad_scale=grad_scale,
+                 found_inf=found_inf)
 
         return loss
 
@@ -182,8 +262,12 @@ def adam(params: List[Tensor],
          state_steps: List[Tensor],
          # kwonly args with defaults are not supported by functions compiled with torchscript issue #70627
          # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
-         foreach: bool = None,
+         foreach: Optional[bool] = None,
          capturable: bool = False,
+         differentiable: bool = False,
+         fused: bool = False,
+         grad_scale: Optional[_MultiDeviceReplicator] = None,
+         found_inf: Optional[_MultiDeviceReplicator] = None,
          *,
          amsgrad: bool,
          beta1: float,
@@ -208,6 +292,8 @@ def adam(params: List[Tensor],
 
     if foreach and not torch.jit.is_scripting():
         func = _multi_tensor_adam
+    elif fused and not torch.jit.is_scripting():
+        func = _fused_adam
     else:
         func = _single_tensor_adam
 
@@ -224,7 +310,10 @@ def adam(params: List[Tensor],
          weight_decay=weight_decay,
          eps=eps,
          maximize=maximize,
-         capturable=capturable)
+         capturable=capturable,
+         differentiable=differentiable,
+         grad_scale=grad_scale,
+         found_inf=found_inf)
 
 
 def _single_tensor_adam(params: List[Tensor],
@@ -233,6 +322,8 @@ def _single_tensor_adam(params: List[Tensor],
                         exp_avg_sqs: List[Tensor],
                         max_exp_avg_sqs: List[Tensor],
                         state_steps: List[Tensor],
+                        grad_scale: Optional[_MultiDeviceReplicator],
+                        found_inf: Optional[_MultiDeviceReplicator],
                         *,
                         amsgrad: bool,
                         beta1: float,
@@ -241,7 +332,10 @@ def _single_tensor_adam(params: List[Tensor],
                         weight_decay: float,
                         eps: float,
                         maximize: bool,
-                        capturable: bool):
+                        capturable: bool,
+                        differentiable: bool):
+
+    assert grad_scale is None and found_inf is None
 
     for i, param in enumerate(params):
 
@@ -269,7 +363,7 @@ def _single_tensor_adam(params: List[Tensor],
         exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
         exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
 
-        if capturable:
+        if capturable or differentiable:
             step = step_t
 
             # 1 - beta1 ** step can't be captured in a CUDA graph, even if step is a CUDA tensor
@@ -284,7 +378,11 @@ def _single_tensor_adam(params: List[Tensor],
 
             if amsgrad:
                 # Maintains the maximum of all 2nd moment running avg. till now
-                torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
+                if differentiable:
+                    max_exp_avg_sqs_i = max_exp_avg_sqs[i].clone()
+                else:
+                    max_exp_avg_sqs_i = max_exp_avg_sqs[i]
+                max_exp_avg_sqs[i].copy_(torch.maximum(max_exp_avg_sqs_i, exp_avg_sq))
                 # Uses the max. for normalizing running avg. of gradient
                 # Folds in (admittedly ugly) 1-elem step_size math here to avoid extra param-set-sized read+write
                 # (can't fold it into addcdiv_ below because addcdiv_ requires value is a Number, not a Tensor)
@@ -320,6 +418,8 @@ def _multi_tensor_adam(params: List[Tensor],
                        exp_avg_sqs: List[Tensor],
                        max_exp_avg_sqs: List[Tensor],
                        state_steps: List[Tensor],
+                       grad_scale: Optional[_MultiDeviceReplicator],
+                       found_inf: Optional[_MultiDeviceReplicator],
                        *,
                        amsgrad: bool,
                        beta1: float,
@@ -328,7 +428,8 @@ def _multi_tensor_adam(params: List[Tensor],
                        weight_decay: float,
                        eps: float,
                        maximize: bool,
-                       capturable: bool):
+                       capturable: bool,
+                       differentiable: bool):
     if len(params) == 0:
         return
 
@@ -336,9 +437,12 @@ def _multi_tensor_adam(params: List[Tensor],
         assert all(p.is_cuda and step.is_cuda for p, step in zip(params, state_steps)), \
             "If capturable=True, params and state_steps must be CUDA tensors."
 
+    assert grad_scale is None and found_inf is None
+
     if maximize:
         grads = torch._foreach_neg(tuple(grads))  # type: ignore[assignment]
 
+    assert not differentiable, "_foreach ops don't support autograd"
     # Handle complex parameters
     grads = [torch.view_as_real(x) if torch.is_complex(x) else x for x in grads]
     exp_avgs = [torch.view_as_real(x) if torch.is_complex(x) else x for x in exp_avgs]
@@ -417,3 +521,86 @@ def _multi_tensor_adam(params: List[Tensor],
             denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
 
         torch._foreach_addcdiv_(params_, exp_avgs, denom, step_size)
+
+
+# TODO(crcrpar): Move this to another place when adding another fused optimizer.
+# TODO(crcrpar): Make this generic when there's more fused optimizers.
+# TODO(crcrpar): Think of rewriting this in C++.
+@torch.no_grad()
+def _group_params_by_device_and_dtype(
+    params: List[Tensor],
+    grads: List[Tensor],
+    exp_avgs: List[Tensor],
+    exp_avg_sqs: List[Tensor],
+    max_exp_avg_sqs: List[Tensor],
+    state_steps: List[Tensor],
+) -> Dict[Tuple[str, torch.dtype], List[List[Tensor]]]:
+    per_device_and_dtype_tensors = defaultdict(lambda: [[] for _ in range(6)])
+    for i, (p, step) in enumerate(zip(params, state_steps)):
+        key = (str(p.device), p.dtype)
+        per_device_and_dtype_tensors[key][0].append(p)
+        per_device_and_dtype_tensors[key][1].append(grads[i])
+        per_device_and_dtype_tensors[key][2].append(exp_avgs[i])
+        per_device_and_dtype_tensors[key][3].append(exp_avg_sqs[i])
+        if max_exp_avg_sqs:
+            per_device_and_dtype_tensors[key][4].append(max_exp_avg_sqs[i])
+        per_device_and_dtype_tensors[key][5].append(step)
+    return per_device_and_dtype_tensors
+
+
+def _fused_adam(
+    params: List[Tensor],
+    grads: List[Tensor],
+    exp_avgs: List[Tensor],
+    exp_avg_sqs: List[Tensor],
+    max_exp_avg_sqs: List[Tensor],
+    state_steps: List[Tensor],
+    grad_scale: Optional[_MultiDeviceReplicator],
+    found_inf: Optional[_MultiDeviceReplicator],
+    *,
+    amsgrad: bool,
+    beta1: float,
+    beta2: float,
+    lr: float,
+    weight_decay: float,
+    eps: float,
+    maximize: bool,
+    capturable: bool,  # Needed for consistency.
+    differentiable: bool,
+) -> None:
+    grouped_tensors = _group_params_by_device_and_dtype(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps)
+    for (device, dtype) in grouped_tensors:
+        (
+            device_params,
+            device_grads,
+            device_exp_avgs,
+            device_exp_avg_sqs,
+            device_max_exp_avg_sqs,
+            device_state_steps,
+        ) = grouped_tensors[(device, dtype)]
+        if grad_scale is not None and found_inf is not None:
+            device_grad_scale = grad_scale.get(device)
+            device_found_inf = found_inf.get(device)
+        else:
+            device_grad_scale = None
+            device_found_inf = None
+        torch._foreach_add_(device_state_steps, 1)
+        torch._fused_adam_(
+            device_params,
+            device_grads,
+            device_exp_avgs,
+            device_exp_avg_sqs,
+            device_max_exp_avg_sqs,
+            device_state_steps,
+            amsgrad=amsgrad,
+            lr=lr,
+            beta1=beta1,
+            beta2=beta2,
+            weight_decay=weight_decay,
+            eps=eps,
+            maximize=maximize,
+            grad_scale=device_grad_scale,
+            found_inf=device_found_inf,
+        )
+        if device_found_inf is not None:
+            torch._foreach_sub_(device_state_steps, [device_found_inf] * len(device_state_steps))
diff --git a/torch/optim/adamax.py b/torch/optim/adamax.py
index bb45764bd055..7e63c3c33b3c 100644
--- a/torch/optim/adamax.py
+++ b/torch/optim/adamax.py
@@ -1,7 +1,7 @@
 import torch
 from torch import Tensor
 
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['Adamax', 'adamax']
@@ -51,7 +51,8 @@ class Adamax(Optimizer):
     """
 
     def __init__(self, params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8,
-                 weight_decay=0, foreach: Optional[bool] = None, *, maximize: bool = False):
+                 weight_decay=0, foreach: Optional[bool] = None, *, maximize: bool = False,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
@@ -64,7 +65,7 @@ def __init__(self, params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8,
             raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
 
         defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
-                        foreach=foreach, maximize=maximize)
+                        foreach=foreach, maximize=maximize, differentiable=differentiable)
         super(Adamax, self).__init__(params, defaults)
 
     def __setstate__(self, state):
@@ -72,13 +73,14 @@ def __setstate__(self, state):
         for group in self.param_groups:
             group.setdefault('foreach', None)
             group.setdefault('maximize', False)
+            group.setdefault('differentiable', False)
         state_values = list(self.state.values())
         step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
         if not step_is_tensor:
             for s in state_values:
                 s['step'] = torch.tensor(float(s['step']))
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -104,6 +106,7 @@ def step(self, closure=None):
             weight_decay = group['weight_decay']
             foreach = group['foreach']
             maximize = group['maximize']
+            differentiable = group['differentiable']
 
             for p in group['params']:
                 if p.grad is None:
@@ -136,7 +139,8 @@ def step(self, closure=None):
                    lr=lr,
                    weight_decay=weight_decay,
                    foreach=foreach,
-                   maximize=maximize)
+                   maximize=maximize,
+                   differentiable=differentiable)
 
         return loss
 
@@ -150,6 +154,7 @@ def adamax(params: List[Tensor],
            # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
            foreach: bool = None,
            maximize: bool = False,
+           differentiable: bool = False,
            *,
            eps: float,
            beta1: float,
@@ -186,7 +191,8 @@ def adamax(params: List[Tensor],
          beta2=beta2,
          lr=lr,
          weight_decay=weight_decay,
-         maximize=maximize)
+         maximize=maximize,
+         differentiable=differentiable)
 
 
 def _single_tensor_adamax(params: List[Tensor],
@@ -200,7 +206,8 @@ def _single_tensor_adamax(params: List[Tensor],
                           beta2: float,
                           lr: float,
                           weight_decay: float,
-                          maximize: bool):
+                          maximize: bool,
+                          differentiable: bool):
 
     for i, param in enumerate(params):
         grad = grads[i]
@@ -228,7 +235,11 @@ def _single_tensor_adamax(params: List[Tensor],
             exp_inf.mul_(beta2).unsqueeze(0),
             grad.abs().add_(eps).unsqueeze_(0)
         ], 0)
-        torch.amax(norm_buf, 0, keepdim=False, out=exp_inf)
+
+        if not differentiable:
+            torch.amax(norm_buf, 0, keepdim=False, out=exp_inf)
+        else:
+            exp_inf.copy_(torch.amax(norm_buf, 0, keepdim=False))
 
         bias_correction = 1 - beta1 ** step
         clr = lr / bias_correction
@@ -247,7 +258,10 @@ def _multi_tensor_adamax(params: List[Tensor],
                          lr: float,
                          weight_decay: float,
                          eps: float,
-                         maximize: bool):
+                         maximize: bool,
+                         differentiable: bool):
+
+    assert not differentiable, "_foreach ops don't support autograd"
 
     if len(params) == 0:
         return
diff --git a/torch/optim/adamw.py b/torch/optim/adamw.py
index 3b01a0c4d341..9855f05be84e 100644
--- a/torch/optim/adamw.py
+++ b/torch/optim/adamw.py
@@ -1,7 +1,7 @@
 import math
 import torch
 from torch import Tensor
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['AdamW', 'adamw']
@@ -75,7 +75,8 @@ class AdamW(Optimizer):
     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
                  weight_decay=1e-2, amsgrad=False, *, maximize: bool = False,
                  foreach: Optional[bool] = None,
-                 capturable: bool = False):
+                 capturable: bool = False,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
@@ -88,7 +89,8 @@ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
             raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
         defaults = dict(lr=lr, betas=betas, eps=eps,
                         weight_decay=weight_decay, amsgrad=amsgrad,
-                        foreach=foreach, maximize=maximize, capturable=capturable)
+                        foreach=foreach, maximize=maximize, capturable=capturable,
+                        differentiable=differentiable)
         super(AdamW, self).__init__(params, defaults)
 
     def __setstate__(self, state):
@@ -98,13 +100,14 @@ def __setstate__(self, state):
             group.setdefault('maximize', False)
             group.setdefault('foreach', None)
             group.setdefault('capturable', False)
+            group.setdefault('differentiable', False)
         state_values = list(self.state.values())
         step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
         if not step_is_tensor:
             for s in state_values:
                 s['step'] = torch.tensor(float(s['step']))
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -128,6 +131,7 @@ def step(self, closure=None):
             state_steps = []
             amsgrad = group['amsgrad']
             beta1, beta2 = group['betas']
+            differentiable = group['differentiable']
 
             for p in group['params']:
                 if p.grad is None:
@@ -173,7 +177,8 @@ def step(self, closure=None):
                   eps=group['eps'],
                   maximize=group['maximize'],
                   foreach=group['foreach'],
-                  capturable=group['capturable'])
+                  capturable=group['capturable'],
+                  differentiable=group['differentiable'])
 
         return loss
 
@@ -188,6 +193,7 @@ def adamw(params: List[Tensor],
           # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
           foreach: bool = None,
           capturable: bool = False,
+          differentiable: bool = False,
           *,
           amsgrad: bool,
           beta1: float,
@@ -229,7 +235,8 @@ def adamw(params: List[Tensor],
          weight_decay=weight_decay,
          eps=eps,
          maximize=maximize,
-         capturable=capturable)
+         capturable=capturable,
+         differentiable=differentiable)
 
 
 def _single_tensor_adamw(params: List[Tensor],
@@ -246,7 +253,8 @@ def _single_tensor_adamw(params: List[Tensor],
                          weight_decay: float,
                          eps: float,
                          maximize: bool,
-                         capturable: bool):
+                         capturable: bool,
+                         differentiable: bool):
 
     for i, param in enumerate(params):
         grad = grads[i] if not maximize else -grads[i]
@@ -273,7 +281,7 @@ def _single_tensor_adamw(params: List[Tensor],
         exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
         exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
 
-        if capturable:
+        if capturable or differentiable:
             step = step_t
 
             # 1 - beta1 ** step can't be captured in a CUDA graph, even if step is a CUDA tensor
@@ -288,7 +296,11 @@ def _single_tensor_adamw(params: List[Tensor],
 
             if amsgrad:
                 # Maintains the maximum of all 2nd moment running avg. till now
-                torch.maximum(max_exp_avg_sqs[i], exp_avg_sq, out=max_exp_avg_sqs[i])
+                if differentiable:
+                    max_exp_avg_sqs_i = max_exp_avg_sqs[i].clone()
+                else:
+                    max_exp_avg_sqs_i = max_exp_avg_sqs[i]
+                max_exp_avg_sqs[i].copy_(torch.maximum(max_exp_avg_sqs_i, exp_avg_sq))
                 # Uses the max. for normalizing running avg. of gradient
                 # Folds in (admittedly ugly) 1-elem step_size math here to avoid extra param-set-sized read+write
                 # (can't fold it into addcdiv_ below because addcdiv_ requires value is a Number, not a Tensor)
@@ -332,7 +344,8 @@ def _multi_tensor_adamw(params: List[Tensor],
                         weight_decay: float,
                         eps: float,
                         maximize: bool,
-                        capturable: bool):
+                        capturable: bool,
+                        differentiable: bool):
     if len(params) == 0:
         return
 
@@ -343,6 +356,8 @@ def _multi_tensor_adamw(params: List[Tensor],
     if maximize:
         grads = torch._foreach_neg(tuple(grads))  # type: ignore[assignment]
 
+    assert not differentiable, "_foreach ops don't support autograd"
+
     grads = [torch.view_as_real(x) if torch.is_complex(x) else x for x in grads]
     exp_avgs = [torch.view_as_real(x) if torch.is_complex(x) else x for x in exp_avgs]
     exp_avg_sqs = [torch.view_as_real(x) if torch.is_complex(x) else x for x in exp_avg_sqs]
diff --git a/torch/optim/asgd.py b/torch/optim/asgd.py
index 100b743a9d7a..d0b215e9573b 100644
--- a/torch/optim/asgd.py
+++ b/torch/optim/asgd.py
@@ -2,7 +2,7 @@
 import torch
 from torch import Tensor
 
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['ASGD', 'asgd']
@@ -31,14 +31,16 @@ class ASGD(Optimizer):
     """
 
     def __init__(self, params, lr=1e-2, lambd=1e-4, alpha=0.75, t0=1e6, weight_decay=0,
-                 foreach: Optional[bool] = None, maximize: bool = False):
+                 foreach: Optional[bool] = None, maximize: bool = False,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= weight_decay:
             raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
 
         defaults = dict(lr=lr, lambd=lambd, alpha=alpha, t0=t0,
-                        weight_decay=weight_decay, foreach=foreach, maximize=maximize)
+                        weight_decay=weight_decay, foreach=foreach, maximize=maximize,
+                        differentiable=differentiable)
         super(ASGD, self).__init__(params, defaults)
 
     def __setstate__(self, state):
@@ -46,6 +48,7 @@ def __setstate__(self, state):
         for group in self.param_groups:
             group.setdefault('foreach', None)
             group.setdefault('maximize', False)
+            group.setdefault('differentiable', False)
         state_values = list(self.state.values())
         step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
         if not step_is_tensor:
@@ -60,7 +63,7 @@ def __setstate__(self, state):
             for s in state_values:
                 s['mu'] = torch.tensor(float(s['mu']))
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -113,7 +116,8 @@ def step(self, closure=None):
                  alpha=group['alpha'],
                  weight_decay=group['weight_decay'],
                  foreach=group['foreach'],
-                 maximize=group['maximize'])
+                 maximize=group['maximize'],
+                 differentiable=group['differentiable'])
 
         return loss
 
@@ -128,6 +132,7 @@ def asgd(params: List[Tensor],
          # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
          foreach: bool = None,
          maximize: bool = False,
+         differentiable: bool = False,
          *,
          lambd: float,
          lr: float,
@@ -162,7 +167,8 @@ def asgd(params: List[Tensor],
          t0=t0,
          alpha=alpha,
          weight_decay=weight_decay,
-         maximize=maximize)
+         maximize=maximize,
+         differentiable=differentiable)
 
 
 def _single_tensor_asgd(params: List[Tensor],
@@ -177,7 +183,8 @@ def _single_tensor_asgd(params: List[Tensor],
                         t0: float,
                         alpha: float,
                         weight_decay: float,
-                        maximize: bool):
+                        maximize: bool,
+                        differentiable: bool):
 
     for i, param in enumerate(params):
         grad = grads[i]
@@ -187,6 +194,11 @@ def _single_tensor_asgd(params: List[Tensor],
         eta = etas[i]
         step_t = state_steps[i]
 
+        if torch.is_complex(param):
+            grad = torch.view_as_real(grad)
+            param = torch.view_as_real(param)
+            ax = torch.view_as_real(ax)
+
         # update step
         step_t += 1
         step = step_t.item()
@@ -224,19 +236,29 @@ def _multi_tensor_asgd(params: List[Tensor],
                        t0: float,
                        alpha: float,
                        weight_decay: float,
-                       maximize: bool):
+                       maximize: bool,
+                       differentiable: bool):
 
     if len(params) == 0:
         return
 
+    assert not differentiable, "_foreach ops don't support autograd"
+
     if maximize:
         grads = torch._foreach_neg(grads)
 
+    def _view_complex_as_real(tensor_list):
+        return [torch.view_as_real(t) if torch.is_complex(t) else t for t in tensor_list]
+
+    grads = _view_complex_as_real(grads)
+    params = _view_complex_as_real(params)
+    axs = _view_complex_as_real(axs)
+
     # update step
     torch._foreach_add_(state_steps, 1)
 
     if weight_decay != 0:
-        torch._foreach_add_(grads, params, alpha=weight_decay)
+        grads = torch._foreach_add(grads, params, alpha=weight_decay)
 
     # decay term
     eta = etas[0].item()
diff --git a/torch/optim/lr_scheduler.py b/torch/optim/lr_scheduler.py
index 2431d889d1a8..b5abd49c717d 100644
--- a/torch/optim/lr_scheduler.py
+++ b/torch/optim/lr_scheduler.py
@@ -11,7 +11,7 @@
 
 __all__ = ['LambdaLR', 'MultiplicativeLR', 'StepLR', 'MultiStepLR', 'ConstantLR', 'LinearLR',
            'ExponentialLR', 'SequentialLR', 'CosineAnnealingLR', 'ChainedScheduler', 'ReduceLROnPlateau',
-           'CyclicLR', 'CosineAnnealingWarmRestarts', 'OneCycleLR', 'PolynomialLR']
+           'CyclicLR', 'CosineAnnealingWarmRestarts', 'OneCycleLR', 'PolynomialLR', 'LRScheduler']
 
 EPOCH_DEPRECATION_WARNING = (
     "The epoch parameter in `scheduler.step()` was not necessary and is being "
@@ -22,7 +22,7 @@
     "https://github.com/pytorch/pytorch/issues/new/choose."
 )
 
-class _LRScheduler(object):
+class LRScheduler(object):
 
     def __init__(self, optimizer, last_epoch=-1, verbose=False):
 
@@ -143,18 +143,6 @@ def step(self, epoch=None):
                               "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
         self._step_count += 1
 
-        class _enable_get_lr_call:
-
-            def __init__(self, o):
-                self.o = o
-
-            def __enter__(self):
-                self.o._get_lr_called_within_step = True
-                return self
-
-            def __exit__(self, type, value, traceback):
-                self.o._get_lr_called_within_step = False
-
         with _enable_get_lr_call(self):
             if epoch is None:
                 self.last_epoch += 1
@@ -175,7 +163,26 @@ def __exit__(self, type, value, traceback):
         self._last_lr = [group['lr'] for group in self.optimizer.param_groups]
 
 
-class LambdaLR(_LRScheduler):
+# Including _LRScheduler for backwards compatibility
+# Subclass instead of assign because we want __name__ of _LRScheduler to be _LRScheduler (assigning would make it LRScheduler).
+class _LRScheduler(LRScheduler):
+    pass
+
+
+class _enable_get_lr_call:
+
+    def __init__(self, o):
+        self.o = o
+
+    def __enter__(self):
+        self.o._get_lr_called_within_step = True
+        return self
+
+    def __exit__(self, type, value, traceback):
+        self.o._get_lr_called_within_step = False
+
+
+class LambdaLR(LRScheduler):
     """Sets the learning rate of each parameter group to the initial lr
     times a given function. When last_epoch=-1, sets initial lr as lr.
 
@@ -261,7 +268,7 @@ def get_lr(self):
                 for lmbda, base_lr in zip(self.lr_lambdas, self.base_lrs)]
 
 
-class MultiplicativeLR(_LRScheduler):
+class MultiplicativeLR(LRScheduler):
     """Multiply the learning rate of each parameter group by the factor given
     in the specified function. When last_epoch=-1, sets initial lr as lr.
 
@@ -342,7 +349,7 @@ def get_lr(self):
             return [group['lr'] for group in self.optimizer.param_groups]
 
 
-class StepLR(_LRScheduler):
+class StepLR(LRScheduler):
     """Decays the learning rate of each parameter group by gamma every
     step_size epochs. Notice that such decay can happen simultaneously with
     other changes to the learning rate from outside this scheduler. When
@@ -391,7 +398,7 @@ def _get_closed_form_lr(self):
                 for base_lr in self.base_lrs]
 
 
-class MultiStepLR(_LRScheduler):
+class MultiStepLR(LRScheduler):
     """Decays the learning rate of each parameter group by gamma once the
     number of epoch reaches one of the milestones. Notice that such decay can
     happen simultaneously with other changes to the learning rate from outside
@@ -440,7 +447,7 @@ def _get_closed_form_lr(self):
                 for base_lr in self.base_lrs]
 
 
-class ConstantLR(_LRScheduler):
+class ConstantLR(LRScheduler):
     """Decays the learning rate of each parameter group by a small constant factor until the
     number of epoch reaches a pre-defined milestone: total_iters. Notice that such decay can
     happen simultaneously with other changes to the learning rate from outside this scheduler.
@@ -498,7 +505,7 @@ def _get_closed_form_lr(self):
                 for base_lr in self.base_lrs]
 
 
-class LinearLR(_LRScheduler):
+class LinearLR(LRScheduler):
     """Decays the learning rate of each parameter group by linearly changing small
     multiplicative factor until the number of epoch reaches a pre-defined milestone: total_iters.
     Notice that such decay can happen simultaneously with other changes to the learning rate
@@ -534,8 +541,8 @@ class LinearLR(_LRScheduler):
 
     def __init__(self, optimizer, start_factor=1.0 / 3, end_factor=1.0, total_iters=5, last_epoch=-1,
                  verbose=False):
-        if start_factor > 1.0 or start_factor < 0:
-            raise ValueError('Starting multiplicative factor expected to be between 0 and 1.')
+        if start_factor > 1.0 or start_factor <= 0:
+            raise ValueError('Starting multiplicative factor expected to be greater than 0 and less or equal to 1.')
 
         if end_factor > 1.0 or end_factor < 0:
             raise ValueError('Ending multiplicative factor expected to be between 0 and 1.')
@@ -553,7 +560,7 @@ def get_lr(self):
         if self.last_epoch == 0:
             return [group['lr'] * self.start_factor for group in self.optimizer.param_groups]
 
-        if (self.last_epoch > self.total_iters):
+        if self.last_epoch > self.total_iters:
             return [group['lr'] for group in self.optimizer.param_groups]
 
         return [group['lr'] * (1. + (self.end_factor - self.start_factor) /
@@ -566,7 +573,7 @@ def _get_closed_form_lr(self):
                 for base_lr in self.base_lrs]
 
 
-class ExponentialLR(_LRScheduler):
+class ExponentialLR(LRScheduler):
     """Decays the learning rate of each parameter group by gamma every epoch.
     When last_epoch=-1, sets initial lr as lr.
 
@@ -597,7 +604,7 @@ def _get_closed_form_lr(self):
                 for base_lr in self.base_lrs]
 
 
-class SequentialLR(_LRScheduler):
+class SequentialLR(LRScheduler):
     """Receives the list of schedulers that is expected to be called sequentially during
     optimization process and milestone points that provides exact intervals to reflect
     which scheduler is supposed to be called at a given epoch.
@@ -707,7 +714,7 @@ def load_state_dict(self, state_dict):
             self._schedulers[idx].load_state_dict(s)
 
 
-class PolynomialLR(_LRScheduler):
+class PolynomialLR(LRScheduler):
     """Decays the learning rate of each parameter group using a polynomial function
     in the given total_iters. When last_epoch=-1, sets initial lr as lr.
 
@@ -725,6 +732,7 @@ class PolynomialLR(_LRScheduler):
         >>> # lr = 0.00050   if epoch == 2
         >>> # lr = 0.00025   if epoch == 3
         >>> # lr = 0.0       if epoch >= 4
+        >>> # xdoctest: +SKIP("undefined vars")
         >>> scheduler = PolynomialLR(self.opt, total_iters=4, power=1.0)
         >>> for epoch in range(100):
         >>>     train(...)
@@ -756,7 +764,7 @@ def _get_closed_form_lr(self):
         ]
 
 
-class CosineAnnealingLR(_LRScheduler):
+class CosineAnnealingLR(LRScheduler):
     r"""Set the learning rate of each parameter group using a cosine annealing
     schedule, where :math:`\eta_{max}` is set to the initial lr and
     :math:`T_{cur}` is the number of epochs since the last restart in SGDR:
@@ -829,7 +837,7 @@ def _get_closed_form_lr(self):
                 for base_lr in self.base_lrs]
 
 
-class ChainedScheduler(_LRScheduler):
+class ChainedScheduler(LRScheduler):
     """Chains list of learning rate schedulers. It takes a list of chainable learning
     rate schedulers and performs consecutive step() functions belonging to them by just
     one call.
@@ -1076,7 +1084,7 @@ def load_state_dict(self, state_dict):
         self._init_is_better(mode=self.mode, threshold=self.threshold, threshold_mode=self.threshold_mode)
 
 
-class CyclicLR(_LRScheduler):
+class CyclicLR(LRScheduler):
     r"""Sets the learning rate of each parameter group according to
     cyclical learning rate policy (CLR). The policy cycles the learning
     rate between two boundaries with a constant frequency, as detailed in
@@ -1213,17 +1221,19 @@ def __init__(self,
         self.gamma = gamma
 
         if scale_fn is None:
+            self._scale_fn_custom = None
             if self.mode == 'triangular':
-                self.scale_fn = self._triangular_scale_fn
+                self._scale_fn_ref = weakref.WeakMethod(self._triangular_scale_fn)
                 self.scale_mode = 'cycle'
             elif self.mode == 'triangular2':
-                self.scale_fn = self._triangular2_scale_fn
+                self._scale_fn_ref = weakref.WeakMethod(self._triangular2_scale_fn)
                 self.scale_mode = 'cycle'
             elif self.mode == 'exp_range':
-                self.scale_fn = self._exp_range_scale_fn
+                self._scale_fn_ref = weakref.WeakMethod(self._exp_range_scale_fn)
                 self.scale_mode = 'iterations'
         else:
-            self.scale_fn = scale_fn
+            self._scale_fn_custom = scale_fn
+            self._scale_fn_ref = None
             self.scale_mode = scale_mode
 
         self.cycle_momentum = cycle_momentum
@@ -1251,6 +1261,13 @@ def _format_param(self, name, optimizer, param):
         else:
             return [param] * len(optimizer.param_groups)
 
+    def scale_fn(self, x):
+        if self._scale_fn_custom is not None:
+            return self._scale_fn_custom(x)
+
+        else:
+            return self._scale_fn_ref()(x)
+
     def _triangular_scale_fn(self, x):
         return 1.
 
@@ -1303,7 +1320,7 @@ def get_lr(self):
         return lrs
 
 
-class CosineAnnealingWarmRestarts(_LRScheduler):
+class CosineAnnealingWarmRestarts(LRScheduler):
     r"""Set the learning rate of each parameter group using a cosine annealing
     schedule, where :math:`\eta_{max}` is set to the initial lr, :math:`T_{cur}`
     is the number of epochs since the last restart and :math:`T_{i}` is the number
@@ -1426,7 +1443,7 @@ def __exit__(self, type, value, traceback):
         self._last_lr = [group['lr'] for group in self.optimizer.param_groups]
 
 
-class OneCycleLR(_LRScheduler):
+class OneCycleLR(LRScheduler):
     r"""Sets the learning rate of each parameter group according to the
     1cycle learning rate policy. The 1cycle policy anneals the learning
     rate from an initial learning rate to some maximum learning rate and then
@@ -1637,8 +1654,7 @@ def __init__(self,
             if last_epoch == -1:
                 for m_momentum, b_momentum, group in zip(max_momentums, base_momentums, optimizer.param_groups):
                     if self.use_beta1:
-                        _, beta2 = group['betas']
-                        group['betas'] = (m_momentum, beta2)
+                        group['betas'] = (m_momentum, *group['betas'][1:])
                     else:
                         group['momentum'] = m_momentum
                     group['max_momentum'] = m_momentum
@@ -1692,8 +1708,7 @@ def get_lr(self):
             lrs.append(computed_lr)
             if self.cycle_momentum:
                 if self.use_beta1:
-                    _, beta2 = group['betas']
-                    group['betas'] = (computed_momentum, beta2)
+                    group['betas'] = (computed_momentum, *group['betas'][1:])
                 else:
                     group['momentum'] = computed_momentum
 
diff --git a/torch/optim/lr_scheduler.pyi b/torch/optim/lr_scheduler.pyi
index 154d8df55780..00d9eb512ae1 100644
--- a/torch/optim/lr_scheduler.pyi
+++ b/torch/optim/lr_scheduler.pyi
@@ -1,7 +1,7 @@
 from typing import Iterable, Any, Optional, Callable, Union, List
 from .optimizer import Optimizer
 
-class _LRScheduler:
+class LRScheduler:
     optimizer: Optimizer = ...
     base_lrs: List[float] = ...
     last_epoch: int = ...
@@ -14,46 +14,49 @@ class _LRScheduler:
     def step(self, epoch: Optional[int] = ...) -> None: ...
     def print_lr(self, is_verbose: bool, group: dict, lr: float, epoch: Optional[int] = ...) -> None: ...
 
-class LambdaLR(_LRScheduler):
+class _LRScheduler(LRScheduler):
+    ...
+
+class LambdaLR(LRScheduler):
     lr_lambdas: List[Callable[[int], float]] = ...
     def __init__(self, optimizer: Optimizer, lr_lambda: Union[Callable[[int], float], List[Callable[[int], float]]], last_epoch: int = ..., verbose: bool = ...) -> None: ...
 
-class MultiplicativeLR(_LRScheduler):
+class MultiplicativeLR(LRScheduler):
     lr_lambdas: List[Callable[[int], float]] = ...
     def __init__(self, optimizer: Optimizer, lr_lambda: Union[Callable[[int], float], List[Callable[[int], float]]], last_epoch: int = ..., verbose: bool = ...) -> None: ...
 
-class StepLR(_LRScheduler):
+class StepLR(LRScheduler):
     step_size: int = ...
     gamma: float = ...
     def __init__(self, optimizer: Optimizer, step_size: int, gamma: float = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
 
-class MultiStepLR(_LRScheduler):
+class MultiStepLR(LRScheduler):
     milestones: Iterable[int] = ...
     gamma: float = ...
     def __init__(self, optimizer: Optimizer, milestones: Iterable[int], gamma: float = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
 
-class ConstantLR(_LRScheduler):
+class ConstantLR(LRScheduler):
     factor: float = ...
     total_iters: int = ...
     def __init__(self, optimizer: Optimizer, factor: float=..., total_iters: int=..., last_epoch: int=..., verbose: bool = ...) -> None: ...
 
-class LinearLR(_LRScheduler):
+class LinearLR(LRScheduler):
     start_factor: float = ...
     end_factor: float = ...
     total_iters: int = ...
     def __init__(self, optimizer: Optimizer, start_factor: float=..., end_factor: float= ..., total_iters: int= ..., last_epoch: int= ..., verbose: bool = ...) -> None: ...
 
-class ExponentialLR(_LRScheduler):
+class ExponentialLR(LRScheduler):
     gamma: float = ...
     def __init__(self, optimizer: Optimizer, gamma: float, last_epoch: int = ..., verbose: bool = ...) -> None: ...
 
-class ChainedScheduler(_LRScheduler):
-    def __init__(self, schedulers: List[_LRScheduler]) -> None: ...
+class ChainedScheduler(LRScheduler):
+    def __init__(self, schedulers: List[LRScheduler]) -> None: ...
 
-class SequentialLR(_LRScheduler):
-    def __init__(self, optimizer: Optimizer, schedulers: List[_LRScheduler], milestones: List[int], last_epoch: int=..., verbose: bool=...) -> None: ...
+class SequentialLR(LRScheduler):
+    def __init__(self, optimizer: Optimizer, schedulers: List[LRScheduler], milestones: List[int], last_epoch: int=..., verbose: bool=...) -> None: ...
 
-class CosineAnnealingLR(_LRScheduler):
+class CosineAnnealingLR(LRScheduler):
     T_max: int = ...
     eta_min: float = ...
     def __init__(self, optimizer: Optimizer, T_max: int, eta_min: float = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
@@ -82,20 +85,20 @@ class ReduceLROnPlateau:
     def state_dict(self) -> dict: ...
     def load_state_dict(self, state_dict: dict) -> None: ...
 
-class CyclicLR(_LRScheduler):
+class CyclicLR(LRScheduler):
     max_lrs: List[float] = ...
     total_size: float = ...
     step_ratio: float = ...
     mode: str = ...
     gamma: float = ...
-    scale_fn: Any = ...
     scale_mode: str = ...
     cycle_momentum: bool = ...
     base_momentums: List[float] = ...
     max_momentums: List[float] = ...
     def __init__(self, optimizer: Optimizer, base_lr: Union[float, List[float]], max_lr: Union[float, List[float]], step_size_up: int = ..., step_size_down: Optional[int] = ..., mode: str = ..., gamma: float = ..., scale_fn: Optional[Callable[[float], float]] = ..., scale_mode: str = ..., cycle_momentum: bool = ..., base_momentum: float = ..., max_momentum: float = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
+    def scale_fn(self, x: Any) -> float: ...
 
-class CosineAnnealingWarmRestarts(_LRScheduler):
+class CosineAnnealingWarmRestarts(LRScheduler):
     T_0: int = ...
     T_i: int = ...
     T_mult: Optional[int] = ...
@@ -104,14 +107,14 @@ class CosineAnnealingWarmRestarts(_LRScheduler):
     def __init__(self, optimizer: Optimizer, T_0: int, T_mult: int = ..., eta_min: float = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
     def step(self, epoch: Optional[Any] = ...): ...
 
-class OneCycleLR(_LRScheduler):
+class OneCycleLR(LRScheduler):
     total_steps: int = ...
     anneal_func: Callable[[float, float, float], float] = ...
     cycle_momentum: bool = ...
     use_beta1: bool = ...
     def __init__(self, optimizer: Optimizer, max_lr: Union[float, List[float]], total_steps: int = ..., epochs: int = ..., steps_per_epoch: int = ..., pct_start: float = ..., anneal_strategy: str = ..., cycle_momentum: bool = ..., base_momentum: Union[float, List[float]] = ..., max_momentum: Union[float, List[float]] = ..., div_factor: float = ..., final_div_factor: float = ..., three_phase: bool = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
 
-class PolynomialLR(_LRScheduler):
+class PolynomialLR(LRScheduler):
     total_iters: int = ...
     power: float = ...
     def __init__(self, optimizer: Optimizer, total_iters: int = ..., power: float = ..., last_epoch: int = ..., verbose: bool = ...) -> None: ...
diff --git a/torch/optim/nadam.py b/torch/optim/nadam.py
index edf4f59a6ae8..54d0c37e0672 100644
--- a/torch/optim/nadam.py
+++ b/torch/optim/nadam.py
@@ -1,7 +1,7 @@
 import math
 import torch
 from torch import Tensor
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['NAdam', 'nadam']
@@ -56,7 +56,8 @@ class NAdam(Optimizer):
     """
 
     def __init__(self, params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8,
-                 weight_decay=0, momentum_decay=4e-3, foreach: Optional[bool] = None):
+                 weight_decay=0, momentum_decay=4e-3, *, foreach: Optional[bool] = None,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
@@ -71,13 +72,14 @@ def __init__(self, params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8,
             raise ValueError("Invalid momentum_decay value: {}".format(momentum_decay))
         defaults = dict(lr=lr, betas=betas, eps=eps,
                         weight_decay=weight_decay, momentum_decay=momentum_decay,
-                        foreach=foreach)
+                        foreach=foreach, differentiable=differentiable)
         super(NAdam, self).__init__(params, defaults)
 
     def __setstate__(self, state):
         super().__setstate__(state)
         for group in self.param_groups:
             group.setdefault('foreach', None)
+            group.setdefault('differentiable', False)
         state_values = list(self.state.values())
         step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
         if not step_is_tensor:
@@ -88,7 +90,7 @@ def __setstate__(self, state):
             for s in state_values:
                 s['mu_product'] = torch.tensor(s['mu_product'])
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -144,7 +146,8 @@ def step(self, closure=None):
                   weight_decay=group['weight_decay'],
                   momentum_decay=group['momentum_decay'],
                   eps=group['eps'],
-                  foreach=group['foreach'])
+                  foreach=group['foreach'],
+                  differentiable=group['differentiable'])
 
         return loss
 
@@ -158,6 +161,7 @@ def nadam(params: List[Tensor],
           # kwonly args with defaults are not supported by functions compiled with torchscript issue #70627
           # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
           foreach: bool = None,
+          differentiable: bool = False,
           *,
           beta1: float,
           beta2: float,
@@ -199,7 +203,8 @@ def nadam(params: List[Tensor],
          lr=lr,
          weight_decay=weight_decay,
          momentum_decay=momentum_decay,
-         eps=eps)
+         eps=eps,
+         differentiable=differentiable)
 
 
 def _single_tensor_nadam(params: List[Tensor],
@@ -214,7 +219,8 @@ def _single_tensor_nadam(params: List[Tensor],
                          lr: float,
                          weight_decay: float,
                          momentum_decay: float,
-                         eps: float):
+                         eps: float,
+                         differentiable: bool):
 
     for i, param in enumerate(params):
         grad = grads[i]
@@ -242,10 +248,22 @@ def _single_tensor_nadam(params: List[Tensor],
         # decay the first and second moment running average coefficient
         exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
         exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
-
-        denom = exp_avg_sq.div(bias_correction2).sqrt().add_(eps)
-        param.addcdiv_(grad, denom, value=-lr * (1. - mu) / (1. - mu_product.item()))
-        param.addcdiv_(exp_avg, denom, value=-lr * mu_next / (1. - mu_product_next.item()))
+        denom = exp_avg_sq.div(bias_correction2).sqrt()
+        if differentiable:
+            denom = denom.add(eps)
+            # Make autograd track the operations
+            # by updating the grad and exp_avg directly and not using the
+            # scalar "value" argument of addcdiv.
+            grad = grad * (-lr * (1. - mu) / (1. - mu_product))
+            exp_avg = grad * (-lr * (1. - mu_next) / (1. - mu_product_next))
+            value_grad = 1.0
+            value_exp_avg = 1.0
+            param.addcdiv_(grad, denom, value=1.0)
+            param.addcdiv_(exp_avg, denom, value=1.0)
+        else:
+            denom.add_(eps)
+            param.addcdiv_(grad, denom, value=-lr * (1. - mu) / (1. - mu_product.item()))
+            param.addcdiv_(exp_avg, denom, value=-lr * mu_next / (1. - mu_product_next.item()))
 
 
 def _multi_tensor_nadam(params: List[Tensor],
@@ -260,11 +278,14 @@ def _multi_tensor_nadam(params: List[Tensor],
                         lr: float,
                         weight_decay: float,
                         momentum_decay: float,
-                        eps: float):
+                        eps: float,
+                        differentiable: bool):
 
     if len(params) == 0:
         return
 
+    assert not differentiable, "_foreach ops don't support autograd"
+
     # update steps
     torch._foreach_add_(state_steps, 1)
 
diff --git a/torch/optim/optimizer.py b/torch/optim/optimizer.py
index 7b32603babac..9422b3b0f94d 100644
--- a/torch/optim/optimizer.py
+++ b/torch/optim/optimizer.py
@@ -114,6 +114,19 @@ def _cuda_graph_capture_health_check(self):
                       "instance, capturable=True can impair performance, and you should set capturable=False.")
                 self._warned_capturable_if_run_uncaptured = True
 
+    def _optimizer_step_code(self):
+        """Entry point for `torch.profile.profiler`.
+
+        When python tracing is enabled the profiler will hook into this
+        function at the CPython level to inspect the optimizer's parameters and
+        param groups. It is called it after `step()` since many optimizers
+        lazily initialize state.
+
+        This is a workaround due to lack of a proper step hook on the optimizer,
+        and will be removed if it exists.
+        """
+        pass
+
     def _hook_for_profile(self):
         self._zero_grad_profile_name = "Optimizer.zero_grad#{}.zero_grad".format(self.__class__.__name__)
 
@@ -124,7 +137,10 @@ def wrapper(*args, **kwargs):
                 obj, *_ = args
                 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
                 with torch.autograd.profiler.record_function(profile_name):
-                    return func(*args, **kwargs)
+                    out = func(*args, **kwargs)
+                    obj._optimizer_step_code()
+                    return out
+
             return wrapper
 
         hooked = getattr(self.__class__.step, "hooked", None)
diff --git a/torch/optim/radam.py b/torch/optim/radam.py
index f5bd0e78ae0c..c389e48ccf3f 100644
--- a/torch/optim/radam.py
+++ b/torch/optim/radam.py
@@ -2,7 +2,7 @@
 import torch
 from torch import Tensor
 
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['RAdam', 'radam']
@@ -44,6 +44,10 @@ class RAdam(Optimizer):
 
     For further details regarding the algorithm we refer to `On the variance of the adaptive learning rate and beyond`_.
 
+    This implementation uses the same weight_decay implementation as Adam (were the weight_decay is applied
+    to the gradient) and not the one from AdamW (were weight_decay is applied to the update). This
+    is different from the `author's implementation`_.
+
     Args:
         params (iterable): iterable of parameters to optimize or dicts defining
             parameter groups
@@ -58,10 +62,13 @@ class RAdam(Optimizer):
 
     .. _On the variance of the adaptive learning rate and beyond:
         https://arxiv.org/abs/1908.03265
+    .. _author's implementation:
+        https://github.com/LiyuanLucasLiu/RAdam
     """
 
     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
-                 weight_decay=0, foreach: Optional[bool] = None):
+                 weight_decay=0, *, foreach: Optional[bool] = None,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
@@ -73,20 +80,21 @@ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
         if not 0.0 <= weight_decay:
             raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
         defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
-                        foreach=foreach)
+                        foreach=foreach, differentiable=differentiable)
         super(RAdam, self).__init__(params, defaults)
 
     def __setstate__(self, state):
         super().__setstate__(state)
         for group in self.param_groups:
             group.setdefault('foreach', None)
+            group.setdefault('differentiable', False)
         state_values = list(self.state.values())
         step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
         if not step_is_tensor:
             for s in state_values:
                 s['step'] = torch.tensor(float(s['step']))
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -137,7 +145,8 @@ def step(self, closure=None):
                   lr=group['lr'],
                   weight_decay=group['weight_decay'],
                   eps=group['eps'],
-                  foreach=group['foreach'])
+                  foreach=group['foreach'],
+                  differentiable=group['differentiable'])
 
         return loss
 
@@ -150,6 +159,7 @@ def radam(params: List[Tensor],
           # kwonly args with defaults are not supported by functions compiled with torchscript issue #70627
           # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
           foreach: bool = None,
+          differentiable: bool = False,
           *,
           beta1: float,
           beta2: float,
@@ -185,7 +195,8 @@ def radam(params: List[Tensor],
          beta2=beta2,
          lr=lr,
          weight_decay=weight_decay,
-         eps=eps)
+         eps=eps,
+         differentiable=differentiable)
 
 
 def _single_tensor_radam(params: List[Tensor],
@@ -198,7 +209,8 @@ def _single_tensor_radam(params: List[Tensor],
                          beta2: float,
                          lr: float,
                          weight_decay: float,
-                         eps: float):
+                         eps: float,
+                         differentiable: bool):
 
     for i, param in enumerate(params):
         grad = grads[i]
@@ -230,8 +242,12 @@ def _single_tensor_radam(params: List[Tensor],
         if rho_t > 5.:
             # Compute the variance rectification term and update parameters accordingly
             rect = math.sqrt((rho_t - 4) * (rho_t - 2) * rho_inf / ((rho_inf - 4) * (rho_inf - 2) * rho_t))
-            adaptive_lr = math.sqrt(bias_correction2) / exp_avg_sq.sqrt().add_(eps)
-
+            exp_avg_sq_sqrt = exp_avg_sq.sqrt()
+            if differentiable:
+                exp_avg_sq_sqrt = exp_avg_sq_sqrt.add(eps)
+            else:
+                exp_avg_sq_sqrt = exp_avg_sq_sqrt.add_(eps)
+            adaptive_lr = math.sqrt(bias_correction2) / exp_avg_sq_sqrt
             param.add_(bias_corrected_exp_avg * lr * adaptive_lr * rect, alpha=-1.0)
         else:
             param.add_(bias_corrected_exp_avg * lr, alpha=-1.0)
@@ -247,11 +263,14 @@ def _multi_tensor_radam(params: List[Tensor],
                         beta2: float,
                         lr: float,
                         weight_decay: float,
-                        eps: float):
+                        eps: float,
+                        differentiable: bool):
 
     if len(params) == 0:
         return
 
+    assert not differentiable, "_foreach ops don't support autograd"
+
     # Update steps
     torch._foreach_add_(state_steps, 1)
 
diff --git a/torch/optim/rmsprop.py b/torch/optim/rmsprop.py
index d5aa61e9540c..22a5bd4488a7 100644
--- a/torch/optim/rmsprop.py
+++ b/torch/optim/rmsprop.py
@@ -1,6 +1,6 @@
 import torch
 from torch import Tensor
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['RMSprop', 'rmsprop']
@@ -68,7 +68,8 @@ class RMSprop(Optimizer):
     """
 
     def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, momentum=0,
-                 centered=False, foreach: Optional[bool] = None, maximize: bool = False):
+                 centered=False, foreach: Optional[bool] = None, maximize: bool = False,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 <= eps:
@@ -81,7 +82,8 @@ def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0, moment
             raise ValueError("Invalid alpha value: {}".format(alpha))
 
         defaults = dict(lr=lr, momentum=momentum, alpha=alpha, eps=eps, centered=centered,
-                        weight_decay=weight_decay, foreach=foreach, maximize=maximize)
+                        weight_decay=weight_decay, foreach=foreach, maximize=maximize,
+                        differentiable=differentiable)
         super(RMSprop, self).__init__(params, defaults)
 
     def __setstate__(self, state):
@@ -91,8 +93,9 @@ def __setstate__(self, state):
             group.setdefault('centered', False)
             group.setdefault('foreach', None)
             group.setdefault('maximize', False)
+            group.setdefault('differentiable', False)
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -131,7 +134,6 @@ def step(self, closure=None):
                         state['momentum_buffer'] = torch.zeros_like(p, memory_format=torch.preserve_format)
                     if group['centered']:
                         state['grad_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
-
                 square_avgs.append(state['square_avg'])
 
                 if group['momentum'] > 0:
@@ -139,6 +141,9 @@ def step(self, closure=None):
                 if group['centered']:
                     grad_avgs.append(state['grad_avg'])
 
+                if group['differentiable'] and isinstance(state['step'], Tensor):
+                    raise RuntimeError('`step` can\'t be a tensor')
+
                 state['step'] += 1
 
 
@@ -154,7 +159,8 @@ def step(self, closure=None):
                     momentum=group['momentum'],
                     centered=group['centered'],
                     foreach=group['foreach'],
-                    maximize=group["maximize"])
+                    maximize=group["maximize"],
+                    differentiable=group["differentiable"])
 
         return loss
 
@@ -168,6 +174,7 @@ def rmsprop(params: List[Tensor],
             # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
             foreach: bool = None,
             maximize: bool = False,
+            differentiable: bool = False,
             *,
             lr: float,
             alpha: float,
@@ -202,7 +209,8 @@ def rmsprop(params: List[Tensor],
          weight_decay=weight_decay,
          momentum=momentum,
          centered=centered,
-         maximize=maximize)
+         maximize=maximize,
+         differentiable=differentiable)
 
 
 def _single_tensor_rmsprop(params: List[Tensor],
@@ -217,7 +225,8 @@ def _single_tensor_rmsprop(params: List[Tensor],
                            weight_decay: float,
                            momentum: float,
                            centered: bool,
-                           maximize: bool):
+                           maximize: bool,
+                           differentiable: bool):
 
     for i, param in enumerate(params):
         grad = grads[i]
@@ -227,17 +236,32 @@ def _single_tensor_rmsprop(params: List[Tensor],
         if weight_decay != 0:
             grad = grad.add(param, alpha=weight_decay)
 
+        is_complex_param = torch.is_complex(param)
+        if is_complex_param:
+            param = torch.view_as_real(param)
+            grad = torch.view_as_real(grad)
+            square_avg = torch.view_as_real(square_avg)
+
         square_avg.mul_(alpha).addcmul_(grad, grad, value=1 - alpha)
 
         if centered:
             grad_avg = grad_avgs[i]
+            if is_complex_param:
+                grad_avg = torch.view_as_real(grad_avg)
             grad_avg.mul_(alpha).add_(grad, alpha=1 - alpha)
-            avg = square_avg.addcmul(grad_avg, grad_avg, value=-1).sqrt_().add_(eps)
+            avg = square_avg.addcmul(grad_avg, grad_avg, value=-1).sqrt_()
+        else:
+            avg = square_avg.sqrt()
+
+        if differentiable:
+            avg = avg.add(eps)
         else:
-            avg = square_avg.sqrt().add_(eps)
+            avg = avg.add_(eps)
 
         if momentum > 0:
             buf = momentum_buffer_list[i]
+            if is_complex_param:
+                buf = torch.view_as_real(buf)
             buf.mul_(momentum).addcdiv_(grad, avg)
             param.add_(buf, alpha=-lr)
         else:
@@ -256,21 +280,32 @@ def _multi_tensor_rmsprop(params: List[Tensor],
                           weight_decay: float,
                           momentum: float,
                           centered: bool,
-                          maximize: bool):
+                          maximize: bool,
+                          differentiable: bool):
 
     if len(params) == 0:
         return
 
+    assert not differentiable, "_foreach ops don't support autograd"
+
     if maximize:
         grads = torch._foreach_neg(grads)
 
     if weight_decay != 0:
         torch._foreach_add_(grads, params, alpha=weight_decay)
 
+    def _view_complex_as_real(tensor_list):
+        return [torch.view_as_real(t) if torch.is_complex(t) else t for t in tensor_list]
+
+    grads = _view_complex_as_real(grads)
+    params = _view_complex_as_real(params)
+    square_avgs = _view_complex_as_real(square_avgs)
+
     torch._foreach_mul_(square_avgs, alpha)
     torch._foreach_addcmul_(square_avgs, grads, grads, value=1 - alpha)
 
     if centered:
+        grad_avgs = _view_complex_as_real(grad_avgs)
         torch._foreach_mul_(grad_avgs, alpha)
         torch._foreach_add_(grad_avgs, grads, alpha=1 - alpha)
         avg = torch._foreach_addcmul(square_avgs, grad_avgs, grad_avgs, value=-1)
@@ -281,6 +316,7 @@ def _multi_tensor_rmsprop(params: List[Tensor],
         torch._foreach_add_(avg, eps)
 
     if momentum > 0:
+        momentum_buffer_list = _view_complex_as_real(momentum_buffer_list)
         torch._foreach_mul_(momentum_buffer_list, momentum)
         torch._foreach_addcdiv_(momentum_buffer_list, grads, avg)
         torch._foreach_add_(params, momentum_buffer_list, alpha=-lr)
diff --git a/torch/optim/rprop.py b/torch/optim/rprop.py
index 1b9952b26b7d..20e196a09df9 100644
--- a/torch/optim/rprop.py
+++ b/torch/optim/rprop.py
@@ -1,6 +1,6 @@
 import torch
 from torch import Tensor
-from .optimizer import Optimizer
+from .optimizer import Optimizer, _use_grad_for_differentiable
 from typing import List, Optional
 
 __all__ = ['Rprop', 'rprop']
@@ -45,31 +45,36 @@ class Rprop(Optimizer):
         params (iterable): iterable of parameters to optimize or dicts defining
             parameter groups
         lr (float, optional): learning rate (default: 1e-2)
-        etas (Tuple[float, float], optional): pair of (etaminus, etaplis), that
+        etas (Tuple[float, float], optional): pair of (etaminus, etaplus), that
             are multiplicative increase and decrease factors
             (default: (0.5, 1.2))
         step_sizes (Tuple[float, float], optional): a pair of minimal and
             maximal allowed step sizes (default: (1e-6, 50))
         foreach (bool, optional): whether foreach implementation of optimizer
             is used (default: None)
+        maximize (bool, optional): maximize the params based on the objective, instead of
+            minimizing (default: False)
     """
 
     def __init__(self, params, lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50),
-                 foreach: Optional[bool] = None):
+                 *, foreach: Optional[bool] = None, maximize: bool = False,
+                 differentiable: bool = False):
         if not 0.0 <= lr:
             raise ValueError("Invalid learning rate: {}".format(lr))
         if not 0.0 < etas[0] < 1.0 < etas[1]:
             raise ValueError("Invalid eta values: {}, {}".format(etas[0], etas[1]))
 
-        defaults = dict(lr=lr, etas=etas, step_sizes=step_sizes, foreach=foreach)
+        defaults = dict(lr=lr, etas=etas, step_sizes=step_sizes, foreach=foreach, maximize=maximize, differentiable=differentiable)
         super(Rprop, self).__init__(params, defaults)
 
     def __setstate__(self, state):
         super().__setstate__(state)
         for group in self.param_groups:
             group.setdefault('foreach', None)
+            group.setdefault('maximize', False)
+            group.setdefault('differentiable', False)
 
-    @torch.no_grad()
+    @_use_grad_for_differentiable
     def step(self, closure=None):
         """Performs a single optimization step.
 
@@ -90,6 +95,7 @@ def step(self, closure=None):
             etaminus, etaplus = group['etas']
             step_size_min, step_size_max = group['step_sizes']
             foreach = group['foreach']
+            maximize = group['maximize']
 
             for p in group['params']:
                 if p.grad is None:
@@ -106,7 +112,12 @@ def step(self, closure=None):
                 if len(state) == 0:
                     state['step'] = 0
                     state['prev'] = torch.zeros_like(p, memory_format=torch.preserve_format)
-                    state['step_size'] = grad.new().resize_as_(grad).fill_(group['lr'])
+                    if p.dtype.is_complex:
+                        # Complex Number should be as if they are two independent real numbers.
+                        # Hence the step_size shouldn't be zero for imaginary part.
+                        state['step_size'] = grad.new().resize_as_(grad).fill_(complex(group['lr'], group['lr']))
+                    else:
+                        state['step_size'] = grad.new().resize_as_(grad).fill_(group['lr'])
 
                 prevs.append(state['prev'])
                 step_sizes.append(state['step_size'])
@@ -121,7 +132,9 @@ def step(self, closure=None):
                   step_size_max=step_size_max,
                   etaminus=etaminus,
                   etaplus=etaplus,
-                  foreach=foreach)
+                  foreach=foreach,
+                  maximize=maximize,
+                  differentiable=group['differentiable'])
 
         return loss
 
@@ -133,6 +146,8 @@ def rprop(params: List[Tensor],
           # kwonly args with defaults are not supported by functions compiled with torchscript issue #70627
           # setting this as kwarg for now as functional API is compiled by torch/distributed/optim
           foreach: bool = None,
+          maximize: bool = False,
+          differentiable: bool = False,
           *,
           step_size_min: float,
           step_size_max: float,
@@ -162,7 +177,9 @@ def rprop(params: List[Tensor],
          step_size_min=step_size_min,
          step_size_max=step_size_max,
          etaminus=etaminus,
-         etaplus=etaplus)
+         etaplus=etaplus,
+         maximize=maximize,
+         differentiable=differentiable)
 
 
 def _single_tensor_rprop(params: List[Tensor],
@@ -173,14 +190,25 @@ def _single_tensor_rprop(params: List[Tensor],
                          step_size_min: float,
                          step_size_max: float,
                          etaminus: float,
-                         etaplus: float):
+                         etaplus: float,
+                         maximize: bool,
+                         differentiable: bool):
 
     for i, param in enumerate(params):
         grad = grads[i]
+        grad = grad if not maximize else -grad
         prev = prevs[i]
         step_size = step_sizes[i]
 
-        sign = grad.mul(prev).sign()
+        if torch.is_complex(param):
+            grad = torch.view_as_real(grad)
+            prev = torch.view_as_real(prev)
+            param = torch.view_as_real(param)
+            step_size = torch.view_as_real(step_size)
+        if differentiable:
+            sign = grad.mul(prev.clone()).sign()
+        else:
+            sign = grad.mul(prev).sign()
         sign[sign.gt(0)] = etaplus
         sign[sign.lt(0)] = etaminus
         sign[sign.eq(0)] = 1
@@ -195,7 +223,6 @@ def _single_tensor_rprop(params: List[Tensor],
 
         # update parameters
         param.addcmul_(grad.sign(), step_size, value=-1)
-
         prev.copy_(grad)
 
 
@@ -207,11 +234,27 @@ def _multi_tensor_rprop(params: List[Tensor],
                         step_size_min: float,
                         step_size_max: float,
                         etaminus: float,
-                        etaplus: float):
+                        etaplus: float,
+                        maximize: bool,
+                        differentiable: bool):
 
     if len(params) == 0:
         return
 
+    assert not differentiable, "_foreach ops don't support autograd"
+
+    # Handle complex params
+    def _view_complex_as_real(tensor_list):
+        return [torch.view_as_real(t) if torch.is_complex(t) else t for t in tensor_list]
+
+    grads = _view_complex_as_real(grads)
+    prevs = _view_complex_as_real(prevs)
+    params = _view_complex_as_real(params)
+    step_sizes = _view_complex_as_real(step_sizes)
+
+    if maximize:
+        grads = torch._foreach_neg(grads)
+
     signs = torch._foreach_mul(grads, prevs)
     signs = [s.sign() for s in signs]
     for sign in signs:
@@ -226,6 +269,7 @@ def _multi_tensor_rprop(params: List[Tensor],
 
     # for dir<0, dfdx=0
     # for dir>=0 dfdx=dfdx
+    grads = list(grads)
     for i in range(len(grads)):
         grads[i] = grads[i].clone(memory_format=torch.preserve_format)
         grads[i][signs[i].eq(etaminus)] = 0
diff --git a/torch/optim/sgd.py b/torch/optim/sgd.py
index 183b768100df..ed48973cf7c6 100644
--- a/torch/optim/sgd.py
+++ b/torch/optim/sgd.py
@@ -114,6 +114,7 @@ def __setstate__(self, state):
             group.setdefault('nesterov', False)
             group.setdefault('maximize', False)
             group.setdefault('foreach', None)
+            group.setdefault('differentiable', False)
 
     @_use_grad_for_differentiable
     def step(self, closure=None):
diff --git a/torch/optim/sparse_adam.py b/torch/optim/sparse_adam.py
index a32aaca92ad4..6fd016841832 100644
--- a/torch/optim/sparse_adam.py
+++ b/torch/optim/sparse_adam.py
@@ -40,7 +40,9 @@ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, maximize: bool
         sparse_params = []
         for index, param in enumerate(params):
             if isinstance(param, dict):
-                for d_index, d_param in enumerate(param.get("params", [])):
+                # given param group, convert given params to a list first before iterating
+                param['params'] = list(param.get("params", []))
+                for d_index, d_param in enumerate(param['params']):
                     if d_param.is_sparse:
                         sparse_params.append([index, d_index])
             elif param.is_sparse:
diff --git a/torch/optim/swa_utils.py b/torch/optim/swa_utils.py
index 4d2743a278c2..9a1e88d89426 100644
--- a/torch/optim/swa_utils.py
+++ b/torch/optim/swa_utils.py
@@ -5,7 +5,7 @@
 
 import torch
 from torch.nn import Module
-from torch.optim.lr_scheduler import _LRScheduler
+from torch.optim.lr_scheduler import LRScheduler
 
 __all__ = ['AveragedModel', 'update_bn', 'SWALR']
 
@@ -132,6 +132,11 @@ def update_parameters(self, model):
             else:
                 p_swa.detach().copy_(self.avg_fn(p_swa.detach(), p_model_,
                                                  self.n_averaged.to(device)))
+        if not self.use_buffers:
+            # If not apply running averages to the buffers,
+            # keep the buffers in sync with the source model.
+            for b_swa, b_model in zip(self.module.buffers(), model.buffers()):
+                b_swa.detach().copy_(b_model.detach().to(device))
         self.n_averaged += 1
 
 
@@ -191,7 +196,7 @@ def update_bn(loader, model, device=None):
     model.train(was_training)
 
 
-class SWALR(_LRScheduler):
+class SWALR(LRScheduler):
     r"""Anneals the learning rate in each parameter group to a fixed value.
 
     This learning rate scheduler is meant to be used with Stochastic Weight
diff --git a/torch/overrides.py b/torch/overrides.py
index d081f88cf669..ae2b23e17d30 100644
--- a/torch/overrides.py
+++ b/torch/overrides.py
@@ -26,15 +26,15 @@
 import functools
 import types
 import warnings
-from typing import Dict, Set, List, Any, Callable, Iterable, Type, Iterator, Tuple
+from typing import Dict, Set, List, Any, Callable, Iterable, Type, Tuple
 import contextlib
 
 import torch
 from torch._C import (
     _has_torch_function, _has_torch_function_unary,
-    _has_torch_function_variadic, _add_docstr, _set_torch_function_mode, _get_torch_function_mode)
-
-from torch.utils._mode_utils import _enable_mode, _ModeInfo, _wrap_init, _restore_mode
+    _has_torch_function_variadic, _add_docstr,
+    _push_on_torch_function_stack, _pop_torch_function_stack, _get_function_stack_at, _len_torch_function_stack,
+    _is_torch_function_mode_enabled)
 
 __all__ = [
     "get_ignored_functions",
@@ -91,6 +91,7 @@ def get_ignored_functions() -> Set[Callable]:
         torch.import_ir_module,
         torch.import_ir_module_from_buffer,
         torch.is_anomaly_enabled,
+        torch.is_anomaly_check_nan_enabled,
         torch.is_grad_enabled,
         torch.merge_type_from_type_comment,
         torch.parse_ir,
@@ -159,7 +160,6 @@ def get_ignored_functions() -> Set[Callable]:
         torch.mkldnn_max_pool2d,
         torch.mkldnn_max_pool3d,
         torch.mkldnn_linear_backward_weights,
-        torch.nested_tensor,
         torch.normal,
         torch.ones,
         torch.promote_types,
@@ -176,7 +176,6 @@ def get_ignored_functions() -> Set[Callable]:
         torch.sparse_csc_tensor,
         torch.sparse_bsr_tensor,
         torch.sparse_bsc_tensor,
-        torch.spmm_sum,
         torch.tril_indices,
         torch.triu_indices,
         torch.vander,
@@ -207,6 +206,7 @@ def get_ignored_functions() -> Set[Callable]:
         torch.nn.init.kaiming_normal,
         torch.nn.init.orthogonal,
         torch.nn.init.sparse,
+        torch.nested.to_padded_tensor,
         has_torch_function,
         handle_torch_function,
         torch.set_autocast_enabled,
@@ -254,6 +254,8 @@ def get_ignored_functions() -> Set[Callable]:
         Tensor.__subclasshook__,
         Tensor.__hash__,
         Tensor.as_subclass,
+        Tensor.eig,
+        Tensor.lstsq,
         Tensor.reinforce,
         Tensor.new,
         Tensor.new_tensor,
@@ -271,17 +273,19 @@ def get_ignored_functions() -> Set[Callable]:
         Tensor.to_sparse_csc,
         Tensor.to_sparse_bsr,
         Tensor.to_sparse_bsc,
+        Tensor._typed_storage,
         Tensor._reduce_ex_internal,
         Tensor._fix_weakref,
+        Tensor._view_func,
         Tensor._make_wrapper_subclass,
         Tensor._python_dispatch.__get__,
+        Tensor._has_symbolic_sizes_strides.__get__,
         Tensor._conj,
         Tensor._conj_physical,
         Tensor._neg_view,
         Tensor._is_zerotensor,
         Tensor._addmm_activation,
-        Tensor._nested_tensor_layer_norm,
-        Tensor.to_padded_tensor
+        Tensor.to_padded_tensor,
     }
 
 
@@ -411,9 +415,10 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.cartesian_prod: lambda *tensors: -1,
         torch.cat: lambda tensors, dim=0, out=None: -1,
         torch.concat: lambda tensors, dim=0, out=None: -1,  # alias for torch.cat
+        torch.concatenate: lambda tensors, dim=0, out=None: -1,  # alias for torch.concatenate
         torch.cdist: lambda x1, x2, p=2.0, compute_mode='use_mm_for_euclid_dist_if_necessary': -1,
         torch.ceil: lambda input, out=None: -1,
-        torch.celu: lambda input, alhpa=1., inplace=False: -1,
+        torch.celu: lambda input, alpha=1., inplace=False: -1,
         torch.chain_matmul: lambda *matrices, out=None: -1,
         torch.channel_shuffle: lambda input, groups : -1,
         torch.cholesky: lambda input, upper=False, out=None: -1,
@@ -487,7 +492,6 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.hsmm: lambda mat1, mat2: -1,
         torch.dsplit: lambda input, indices_or_sections: -1,
         torch.dstack: lambda tensors, out=None: -1,
-        torch.eig: lambda input, eigenvectors=False, out=None: -1,
         torch.linalg.eig: lambda input, out=None: -1,
         torch.linalg.eigvals: lambda input, out=None: -1,
         torch.linalg.eigh: lambda input, UPLO="L", out=None: -1,
@@ -570,7 +574,7 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.grid_sampler_2d: lambda input, grid, interpolation_mode, padding_mode, align_corners: -1,
         torch.grid_sampler_3d: lambda input, grid, interpolation_mode, padding_mode, align_corners: -1,
         torch.group_norm: lambda input, num_groups, weight=None, bias=None, eps=1e-05, cudnn_enabled=True: -1,
-        torch.gru: lambda input, hx, params, has_biases, num_layers, gropout, train, bidirectional, batch_first: -1,
+        torch.gru: lambda input, hx, params, has_biases, num_layers, dropout, train, bidirectional, batch_first: -1,
         torch.gru_cell: lambda input, hx, w_ih, w_hh, b_ih=None, b_hh=None: -1,
         torch.gt: lambda input, other, out=None: -1,
         torch.greater: lambda input, other, out=None: -1,
@@ -652,7 +656,6 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.logsumexp: lambda input, names, keepdim=False, out=None: -1,
         torch.lstm: lambda data, batch_sizes, hx, params, has_biases, num_layers, dropout, train, bidirectional: -1,
         torch.lstm_cell: lambda input, hx, w_ih, w_hh, b_ih=None, b_hh=None: -1,
-        torch.lstsq: lambda input, A, out=None: -1,
         torch.lt: lambda input, other, out=None: -1,
         torch.less: lambda input, other, out=None: -1,
         torch.lu: lambda A, pivot=True, get_infos=False, out=None: -1,
@@ -669,7 +672,6 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.linalg.matmul: lambda input, other, out=None: -1,  # alias for torch.matmul
         torch.matrix_power: lambda input, n: -1,
         torch.linalg.matrix_power: lambda input, n, out=None: -1,
-        torch.matrix_rank: lambda input, tol=None, symmetric=False: -1,
         torch.linalg.matrix_rank: lambda input, tol=None, hermitian=False: -1,
         torch.linalg.multi_dot: lambda tensors, out=None: -1,
         torch.matrix_exp: lambda input: -1,
@@ -693,6 +695,8 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.miopen_batch_norm: (lambda input, weight, bias, running_mean, running_var, training,
                                   exponential_average_factor, epsilon: -1),
         torch.miopen_convolution: lambda input, weight, bias, padding, stride, dilation, groups, benchmark, deterministic: -1,
+        torch.miopen_convolution_add_relu: lambda input, weight, z, alpha, bias, stride, padding, dilation, groups: -1,
+        torch.miopen_convolution_relu: lambda input, weight, bias, stride, padding, dilation, groups: -1,
         torch.miopen_convolution_transpose: (lambda input, weight, bias, padding, output_padding, stride, dilation,
                                              groups, benchmark, deterministic: -1),
         torch.miopen_depthwise_convolution: (lambda input, weight, bias, padding, stride, dilation, groups, benchmark,
@@ -713,6 +717,7 @@ def get_testing_overrides() -> Dict[Callable, Callable]:
         torch.narrow_copy: lambda input, dim, start, length: -1,
         torch.nan_to_num: lambda input, nan=0.0, posinf=None, neginf=None, out=None: -1,
         torch.native_batch_norm: lambda input, weight, bias, running_mean, running_var, training, momentum, eps: -1,
+        torch._native_batch_norm_legit: lambda input, weight, bias, training, momentum, eps: -1,
         torch.native_dropout: lambda input, p, train: -1,
         torch.native_layer_norm: lambda input, normalized_shape, weight=None, bias=None, eps=1e-05: -1,
         torch.native_group_norm: lambda input, weight, bias, N, C, HxW, group, eps: -1,
@@ -1507,10 +1512,10 @@ def handle_torch_function(
     types = tuple(map(type, overloaded_args))
 
     # Check for __torch_function__ mode.
-    mode = _get_torch_function_mode()
-    if mode is not None:
-        # NB: unlike on tensors, modes are instances
-        with _no_torch_function_mode():
+    if _is_torch_function_mode_enabled():
+        # if we're here, the mode must be set to a TorchFunctionStackMode
+        # this unsets it and calls directly into TorchFunctionStackMode's torch function
+        with _pop_mode_temporarily() as mode:
             result = mode.__torch_function__(public_api, types, args, kwargs)
         if result is not NotImplemented:
             return result
@@ -1538,8 +1543,8 @@ def handle_torch_function(
         "no implementation found for '{}' on types that implement "
         '__torch_function__: {}'
     ).format(func_name, [type(arg) for arg in overloaded_args])
-    if mode is not None:
-        msg += f" nor in mode {mode}"
+    if _is_torch_function_mode_enabled():
+        msg += f" nor in mode {_get_current_function_mode()}"
     raise TypeError(msg)
 
 has_torch_function = _add_docstr(
@@ -1552,7 +1557,7 @@ def handle_torch_function(
     Arguments
     ---------
     relevant_args : iterable
-        Iterable or aguments to check for __torch_function__ methods.
+        Iterable or arguments to check for __torch_function__ methods.
     Returns
     -------
     bool
@@ -1692,7 +1697,7 @@ def resolve_name(f):
         Name of the function; if eval'ed it should give back the input
         function.
     """
-    if isinstance(f, torch._ops.OpOverload):
+    if isinstance(f, torch._ops.OpOverload) or isinstance(f, torch._ops.OpOverloadPacket):
         return str(f)
     return _get_overridable_functions()[1].get(f)
 
@@ -1763,55 +1768,7 @@ def is_tensor_like(inp):
     """
     return type(inp) is torch.Tensor or hasattr(type(inp), "__torch_function__")
 
-
-def _wrap_torch_function(f):
-    @functools.wraps(f)
-    def wrapped(self, *args, **kwargs):
-        if isinstance(f, classmethod):
-            raise RuntimeError("TorchFunctionMode's torch_function function " +
-                               "should be a normal method not a class method")
-        inner = getattr(self, "inner", None)
-
-        with enable_torch_function_mode(inner):
-            return f(self, *args, **kwargs)
-    return wrapped
-
-
-# Implementation note: I had a choice about how much of mode stacks
-# to implement in Python versus in C++.  At time of writing, I did not care
-# too much about implementation efficiency; however, I do care about making it
-# hard for users to implement modes in the wrong way.  In the end, it turned
-# out to be possible to implement mode stacks entirely from userland, with the
-# C++ API providing only _get_torch_function_mode() and
-# _set_torch_function_mode(), so I opted to provide some unsafe C++ bindings and
-# have the bulk of the logic for managing the stack in Python, which helped
-# simplify the C++ API surface.  It would also have been valid to build in the
-# notion of mode stack directly into C++ but in this design it's substantially
-# more difficult to interact with TorchFunctionModeMeta.
-class TorchFunctionModeMeta(type):
-    """
-    Metaclass for :class:`TorchFunctionMode`; it does two things:
-
-        * Adds an implicit ``inner`` kwarg to ``__init__``, to
-          allow the modes to be chained together to form a stack.
-
-        * Reenables the inner mode, so that by default PyTorch API calls
-          will compositionally proceed to the next mode on the stack.
-
-    The default behavior for the second bullet is important, as it is easy to
-    accidentally write ``__torch_function__`` implementations that are not
-    compositional, and the wrapping here makes the obvious code do the
-    right thing (aka, this is why there is a metaclass).
-    """
-    def __new__(metacls, name, bases, dct):
-        if '__init__' in dct:
-            dct['__init__'] = _wrap_init(dct['__init__'])
-        if '__torch_function__' in dct:
-            dct['__torch_function__'] = _wrap_torch_function(dct['__torch_function__'])
-        return super().__new__(metacls, name, bases, dct)
-
-
-class TorchFunctionMode(metaclass=TorchFunctionModeMeta):
+class TorchFunctionMode:
     """
     A ``TorchFunctionMode`` allows you to override the meaning of all
     ``__torch_function__`` overrideable functions within a dynamic scope,
@@ -1851,24 +1808,11 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
         raise NotImplementedError()
 
     def __enter__(self):
-        old = _get_torch_function_mode()
-        if hasattr(self, "inner"):
-            raise RuntimeError(f"{self} has already been used as a mode. Please use a fresh version or use restore")
-        else:
-            self.inner = old
-            if old is None:
-                self.ancestors = set()
-            else:
-                self.ancestors = self.inner.ancestors.union({self.inner})
-        _set_torch_function_mode(self)
+        _push_mode(self)
         return self
 
     def __exit__(self, exc_type, exc_val, exc_tb):
-        _set_torch_function_mode(self.inner)
-
-    @contextlib.contextmanager
-    def restore(self):
-        return _restore_mode(self, mode_info=_TorchFunctionModeInfo())
+        _pop_mode()
 
     @classmethod
     def push(cls, *args, **kwargs):
@@ -1876,68 +1820,39 @@ def push(cls, *args, **kwargs):
         instance = cls(*args, **kwargs)
         return instance
 
-class BaseTorchFunctionMode(TorchFunctionMode):
-    def __torch_function__(self, func, types, args=(), kwargs=None):
-        if kwargs is None:
-            kwargs = {}
-        return func(*args, **kwargs)
 
+def _get_current_function_mode():
+    stack_len = _len_torch_function_stack()
+    return _get_function_stack_at(stack_len - 1) if stack_len > 0 else None
 
-# This is private API as I'm not sure it's possible for users to use this
-# compositionally (easy to discard too many modes).  It is useful for
-# library code though, e.g., in handle_torch_function
-@contextlib.contextmanager
-def _no_torch_function_mode() -> Iterator[None]:
-    old = _get_torch_function_mode()
-    _set_torch_function_mode(None)
-    try:
-        yield
-    finally:
-        _set_torch_function_mode(old)
 
+def _get_current_function_mode_stack():
+    stack_len = _len_torch_function_stack()
+    return [_get_function_stack_at(i) for i in range(stack_len)]
 
-class _TorchFunctionModeInfo(_ModeInfo):
-    def __init__(self):
-        super().__init__(mode_name="torch_function", mode_class=TorchFunctionMode)
+def _push_mode(mode):
+    _push_on_torch_function_stack(mode)
 
-    def get_mode(self):
-        return _get_torch_function_mode()
 
-    def set_mode(self, mode):
-        return _set_torch_function_mode(mode)
+def _pop_mode():
+    old = _pop_torch_function_stack()
+    return old
 
 
 @contextlib.contextmanager
-def enable_torch_function_mode(mode, *, replace=None, ignore_preexisting=False) -> Iterator[None]:
-    """
-    Context manager that sets the current :class:`TorchFunctionMode`; see the
-    class for more information on what modes are.  This function is
-    non-compositional; if there is already an existing mode, it will raise an
-    error; prefer using ``with MyMode():`` if your ``__torch_function__``
-    implementation can defer to an inner mode.
-
-    This function is safe to use inside a ``__torch_function__`` mode handler,
-    as the mode is guaranteed to be disabled in this context.  You can use
-    this context manager to reinstate the mode so that calls to overridable
-    APIs recursively call back into your mode handler (this can easily cause
-    infinite loops, so use with care!)
-
-    Args:
-        mode (:class:`TorchFunctionMode`, Tensor-like class or None): the
-            mode to set as current mode.  If you pass a Tensor-like class,
-            it will be treated as a non-compositional mode with no state,
-            which is convenient if you have an existing tensor subclass
-            that you'd like to apply globally in a quick and dirty way.
-            Passing None will disable the current mode.
-        replace (:class:`TorchFunctionMode` or Tensor-like class): the
-            mode to replace.  You can use this argument to change the mode in
-            a situation where you know what the current mode is (and you are
-            intentionally overwriting it.)  If you don't know what the current
-            mode is, use ``ignore_preexisting`` instead.
-        ignore_preexisting (bool): if True, ignore any preexisting mode
-            and overwrite it with the passed mode.
-    """
-    return _enable_mode(mode, _TorchFunctionModeInfo(), replace=replace, ignore_preexisting=ignore_preexisting)
+def _pop_mode_temporarily():
+    old = _pop_mode()
+    try:
+        yield old
+    finally:
+        _push_mode(old)
+
+class BaseTorchFunctionMode(TorchFunctionMode):
+    def __torch_function__(self, func, types, args=(), kwargs=None):
+        if kwargs is None:
+            kwargs = {}
+        return func(*args, **kwargs)
+
 
 class enable_reentrant_dispatch():
     def __enter__(self):
diff --git a/torch/package/_mock.py b/torch/package/_mock.py
index 59f594121a7f..b0bdb95cc48c 100644
--- a/torch/package/_mock.py
+++ b/torch/package/_mock.py
@@ -89,7 +89,7 @@ def __new__(cls, *args, **kwargs):
         # construct instances of MockedObject to hand out to people looking up
         # module attributes.
 
-        # Any other attempt to construct a MockedOject instance (say, in the
+        # Any other attempt to construct a MockedObject instance (say, in the
         # unpickling process) should give an error.
         if not kwargs.get("_suppress_err"):
             raise NotImplementedError(
diff --git a/torch/package/package_exporter.py b/torch/package/package_exporter.py
index 81b5e650b518..7f6af38468e2 100644
--- a/torch/package/package_exporter.py
+++ b/torch/package/package_exporter.py
@@ -202,7 +202,7 @@ def __init__(
             f: The location to export to. Can be a  ``string``/``Path`` object containing a filename
                 or a binary I/O object.
             importer: If a single Importer is passed, use that to search for modules.
-                If a sequence of importers are passsed, an ``OrderedImporter`` will be constructed out of them.
+                If a sequence of importers are passed, an ``OrderedImporter`` will be constructed out of them.
         """
         torch._C._log_api_usage_once("torch.package.PackageExporter")
 
@@ -574,7 +574,7 @@ def save_pickle(
         pickle_protocol: int = 3,
     ):
         """Save a python object to the archive using pickle. Equivalent to :func:`torch.save` but saving into
-        the archive rather than a stand-alone file. Stanard pickle does not save the code, only the objects.
+        the archive rather than a stand-alone file. Standard pickle does not save the code, only the objects.
         If ``dependencies`` is true, this method will also scan the pickled objects for which modules are required
         to reconstruct them and save the relevant code.
 
@@ -887,7 +887,7 @@ def _persistent_id(self, obj):
             if isinstance(obj, torch.storage.TypedStorage):
                 # TODO: Once we decide to break serialization FC, we can
                 # remove this case
-                untyped_storage = obj._storage
+                untyped_storage = obj._untyped_storage
                 storage_type_str = obj.pickle_storage_type()
                 storage_type = getattr(torch, storage_type_str)
                 storage_numel = obj.size()
diff --git a/torch/package/package_importer.py b/torch/package/package_importer.py
index 6efa943f11e7..3db37128b03b 100644
--- a/torch/package/package_importer.py
+++ b/torch/package/package_importer.py
@@ -39,6 +39,9 @@
     "numpy",
     "numpy.core",
     "numpy.core._multiarray_umath",
+    # FX GraphModule might depend on builtins module and users usually
+    # don't extern builtins. Here we import it here by default.
+    "builtins",
 ]
 
 
@@ -205,14 +208,14 @@ def load_tensor(dtype, size, key, location, restore_location):
             name = f"{key}.storage"
 
             if storage_context.has_storage(name):
-                storage = storage_context.get_storage(name, dtype).storage()
+                storage = storage_context.get_storage(name, dtype)._typed_storage()
             else:
                 tensor = self.zip_reader.get_storage_from_record(
                     ".data/" + name, size, dtype
                 )
                 if isinstance(self.zip_reader, torch._C.PyTorchFileReader):
                     storage_context.add_storage(name, tensor)
-                storage = tensor.storage()
+                storage = tensor._typed_storage()
             loaded_storages[key] = restore_location(storage, location)
 
         def persistent_load(saved_id):
@@ -236,7 +239,7 @@ def persistent_load(saved_id):
                 # TODO: Once we decide to break serialization FC, we can
                 # stop wrapping with TypedStorage
                 return torch.storage.TypedStorage(
-                    wrap_storage=storage.untyped(), dtype=dtype
+                    wrap_storage=storage._untyped_storage, dtype=dtype, _internal=True
                 )
             elif typename == "reduce_package":
                 # to fix BC breaking change, objects on this load path
@@ -294,7 +297,7 @@ def file_structure(
 
         Args:
             include (Union[List[str], str]): An optional string e.g. ``"my_package.my_subpackage"``, or optional list of strings
-                for the names of the files to be inluded in the zipfile representation. This can also be
+                for the names of the files to be included in the zipfile representation. This can also be
                 a glob-style pattern, as described in :meth:`PackageExporter.mock`
 
             exclude (Union[List[str], str]): An optional pattern that excludes files whose name match the pattern.
diff --git a/torch/profiler/__init__.py b/torch/profiler/__init__.py
index 44635884554f..a0185a9799b3 100644
--- a/torch/profiler/__init__.py
+++ b/torch/profiler/__init__.py
@@ -1,4 +1,4 @@
-r'''
+r"""
 PyTorch Profiler is a tool that allows the collection of performance metrics during training and inference.
 Profiler's context manager API can be used to better understand what model operators are the most expensive,
 examine their input shapes and stack traces, study device kernel activity and visualize the execution trace.
@@ -6,15 +6,32 @@
 .. note::
     An earlier version of the API in :mod:`torch.autograd` module is considered legacy and will be deprecated.
 
-'''
-from .profiler import profile, _KinetoProfile, \
-    schedule, supported_activities, tensorboard_trace_handler, ProfilerAction, \
-    _ExperimentalConfig, ExecutionGraphObserver
-from torch._C._autograd import ProfilerActivity, kineto_available, _supported_activities, DeviceType
+"""
+from torch._C._autograd import _supported_activities, DeviceType, kineto_available
+from torch._C._profiler import _ExperimentalConfig, ProfilerActivity, RecordScope
 from torch.autograd.profiler import record_function
 
-__all__ = ['profile', 'schedule', 'supported_activities',
-           'tensorboard_trace_handler', 'ProfilerAction', 'ProfilerActivity',
-           'kineto_available', 'DeviceType', 'record_function', 'ExecutionGraphObserver']
+from .profiler import (
+    _KinetoProfile,
+    ExecutionGraphObserver,
+    profile,
+    ProfilerAction,
+    schedule,
+    supported_activities,
+    tensorboard_trace_handler,
+)
+
+__all__ = [
+    "profile",
+    "schedule",
+    "supported_activities",
+    "tensorboard_trace_handler",
+    "ProfilerAction",
+    "ProfilerActivity",
+    "kineto_available",
+    "DeviceType",
+    "record_function",
+    "ExecutionGraphObserver",
+]
 
 from . import itt
diff --git a/torch/profiler/_memory_profiler.py b/torch/profiler/_memory_profiler.py
new file mode 100644
index 000000000000..2c5684b64dbf
--- /dev/null
+++ b/torch/profiler/_memory_profiler.py
@@ -0,0 +1,807 @@
+import collections
+import dataclasses
+import enum
+import itertools as it
+from typing import (
+    Any,
+    cast,
+    DefaultDict,
+    Dict,
+    Iterator,
+    List,
+    Optional,
+    Set,
+    Tuple,
+    Union,
+)
+
+import torch
+from torch._C import FunctionSchema
+from torch._C._autograd import _ProfilerResult
+from torch._C._profiler import (
+    _EventType,
+    _ExtraFields_Allocation,
+    _ExtraFields_TorchOp,
+    _ProfilerEvent,
+    _TensorMetadata,
+    RecordScope,
+)
+from torch.profiler import _utils
+
+TensorAndID = Tuple["TensorKey", int]
+
+
+class Category(enum.Enum):
+    INPUT = enum.auto()
+    TEMPORARY = enum.auto()
+    ACTIVATION = enum.auto()
+    GRADIENT = enum.auto()
+    AUTOGRAD_DETAIL = enum.auto()
+    PARAMETER = enum.auto()
+    OPTIMIZER_STATE = enum.auto()
+
+
+@dataclasses.dataclass
+class _Storage:
+    """Bundle storage pointer and id.
+
+    All profiling logic should use `allocation_id`, however it is useful to
+    print storage pointers for debugging and unit tests sometimes look up
+    values using the storage data pointer of a live Tensor."""
+
+    ptr: int
+    allocation_id: int
+
+    def __repr__(self) -> str:
+        return f"{hex(self.ptr):>18} ({self.allocation_id})"
+
+    def __eq__(self, other: Any) -> bool:
+        return isinstance(other, _Storage) and self.allocation_id == other.allocation_id
+
+    def __hash__(self) -> int:
+        return hash(self.allocation_id)
+
+
+@dataclasses.dataclass(eq=True, unsafe_hash=True, frozen=True)
+class TensorKey:
+    """Hashable identifier for a storage which has been asigned an ID.
+
+    A detailed description of Tensor IDs and why they are needed is given in
+    `torch/csrc/profiler/collection.h` when `TensorID` is declared. To
+    summarize, multiple Storage buffers can map to the same logical Tensor.
+    This dataclass is used to refer to a concrete in-memory StorageImpl of
+    a Tensor.
+    """
+
+    id: int
+    storage: _Storage
+    device: torch.device
+
+    def __repr__(self) -> str:
+        return f"id={self.id}: {repr(self.storage):<24} ({self.device})"
+
+    def __lt__(self, other: "TensorKey") -> bool:
+        return self._as_sortable < other._as_sortable
+
+    @staticmethod
+    def _make(
+        tensor_id: Optional[int],
+        storage_ptr: Optional[int],
+        allocation_id: Optional[int],
+        device: torch.device,
+    ) -> Optional["TensorKey"]:
+        if (
+            tensor_id is not None
+            and storage_ptr is not None
+            and allocation_id is not None
+        ):
+            return TensorKey(tensor_id, _Storage(storage_ptr, allocation_id), device)
+        return None
+
+    @classmethod
+    def from_allocation(cls, alloc: _ExtraFields_Allocation) -> Optional["TensorKey"]:
+        return cls._make(alloc.id, alloc.ptr, alloc.allocation_id, alloc.device)
+
+    @classmethod
+    def from_tensor(cls, t: Optional[_TensorMetadata]) -> Optional["TensorKey"]:
+        if t is not None:
+            return cls._make(t.id, t.storage_data_ptr, t.allocation_id, t.device)
+        return None
+
+    @property
+    def _as_sortable(self) -> Tuple[int, int, str, int]:
+        return self.id, self.storage.allocation_id, self.device.type, self.device.index
+
+
+def _extract_parameters_and_gradients(
+    node: _ProfilerEvent,
+) -> Iterator[Tuple[Optional[TensorKey], Optional[TensorKey]]]:
+    children = node.children
+
+    # AccumulateGrad is used in the Autograd engine to handle gradient updates.
+    # There are two possible cases:
+    # 1) This is a newly created gradient Tensor. In that case there is nothing
+    #    to accumulate, so autograd simply detaches the Tensor.
+    #
+    # 2) There is a preexisting gradient Tensor and we need to add the newly
+    #    computed update. This is done with an in-place add (aten::add_) op.
+    #    (The underscore suffix denotes "in-place".)
+    if (
+        node.typed[0] == _EventType.TorchOp
+        and node.typed[1].scope == RecordScope.BACKWARD_FUNCTION
+        # TODO(robieta): Move away from load bearing names
+        and node.name == "torch::autograd::AccumulateGrad"
+        and children
+        and children[0].typed[0] == _EventType.TorchOp
+        and children[0].name in ("aten::detach", "aten::add_")
+        and children[0].typed[1].inputs
+        and isinstance(children[0].typed[1].inputs[0], _TensorMetadata)
+    ):
+        yield None, TensorKey.from_tensor(children[0].typed[1].inputs[0])
+
+    # We directly instrument `torch.nn.Module` and `torch.optim.Optimizer`
+    # NOTE: The values captured by the python tracer are cached; they can be
+    #       used to build up labels but do not imply that a Tensor was live at
+    #       a particular time.
+    elif node.typed[0] == _EventType.PyCall:
+        typed_fields = node.typed[1]
+        assert typed_fields.module is None or typed_fields.optimizer is None
+        if typed_fields.module is not None:
+            for _, p, p_grad in typed_fields.module.parameters:
+                yield TensorKey.from_tensor(p), TensorKey.from_tensor(p_grad)
+
+        if typed_fields.optimizer is not None:
+            for p, p_grad, _ in typed_fields.optimizer.parameters:
+                yield TensorKey.from_tensor(p), TensorKey.from_tensor(p_grad)
+
+
+def extract_parameters(node: _ProfilerEvent) -> Iterator[TensorKey]:
+    for p, p_grad in _extract_parameters_and_gradients(node):
+        if p is not None:
+            yield p
+
+
+def extract_gradients(
+    node: _ProfilerEvent,
+) -> Iterator[Tuple[Optional[TensorKey], TensorKey]]:
+    for p, p_grad in _extract_parameters_and_gradients(node):
+        if p_grad is not None:
+            yield p, p_grad
+
+
+def get_scopes(event: Optional[_ProfilerEvent]) -> Tuple[RecordScope, ...]:
+    scopes = []
+    while event:
+        if event.typed[0] == _EventType.TorchOp:
+            scopes.append(event.typed[1].scope)
+        event = event.parent
+    return tuple(scopes)
+
+
+class SchemaMatcher:
+    """Lookup operator schema based on profiled name.
+
+    When profiling we record the operator's name but not the schema. However
+    some analysis requires that information. Fortunately we can look up
+    registered schema from the recorded name. We do not, however, record the
+    overload and so we must compare the profiled arguments with all overloads
+    to determine viable matches.
+
+    Note: Once https://github.com/pytorch/pytorch/issues/78871 is completed
+    this code will be obsolete.
+    """
+
+    @classmethod
+    def inputs_are_mutable(cls, t: _ExtraFields_TorchOp) -> Tuple[Optional[bool], ...]:
+        """Determine which inputs may have mutated based on function schema.
+
+        Note that we don't need to resolve down to a single schema to perform
+        this analysis. An input is mutable if it is mutable in any overload. In
+        practice, however, it is overwhelmingly common to match a single
+        overload. If we cannot find any valid schema then we must be
+        conservative and assume all inputs are mutable.
+        """
+        mutable: Optional[List[bool]] = None
+        for schema in cls.match_schemas(t):
+            mutable = mutable or [False for _ in schema.arguments]
+            for i, arg in enumerate(schema.arguments):
+                mutable[i] |= getattr(arg.alias_info, "is_write", False)
+
+        return tuple(mutable or (None for _ in t.inputs))
+
+    @classmethod
+    def match_schemas(cls, t: _ExtraFields_TorchOp) -> Tuple[FunctionSchema, ...]:
+        signature = tuple(
+            # Tensor
+            TensorKey.from_tensor(i) if isinstance(i, _TensorMetadata)
+            #
+            # TensorList
+            else [TensorKey.from_tensor(j) for j in i] if isinstance(i, list)
+            #
+            # Scalar and uncaptured inputs.
+            else i
+            for i in t.inputs
+        )
+
+        def matches(schema) -> bool:
+            return len(schema.arguments) == len(signature) and all(
+                cls._types_match(observed, schema_arg.type)
+                for observed, schema_arg in zip(signature, schema.arguments)
+            )
+
+        return tuple(s for s in cls.lookup_schemas(t.name) or () if matches(s))
+
+    @classmethod
+    def _types_match(cls, observed, schema_type) -> bool:
+        if isinstance(schema_type, torch._C.OptionalType):
+            schema_type = schema_type.getElementType()
+            return observed is None or cls._types_match(observed, schema_type)
+
+        if isinstance(schema_type, torch._C.AnyType):
+            return True
+
+        if schema_type.isSubtypeOf(torch._C.ListType.ofTensors()):
+            return isinstance(observed, list) and all(
+                isinstance(i, TensorKey) for i in observed
+            )
+
+        type_map: Tuple[Tuple[Any, Union[type, Tuple[type, ...]]], ...] = (
+            (torch._C.TensorType, TensorKey),
+            (torch._C.NoneType, type(None)),
+            (torch._C.BoolType, bool),
+            (torch._C.IntType, int),
+            (torch._C.FloatType, float),
+            (torch._C.ComplexType, complex),
+            (torch._C.NumberType, (bool, int, float, complex)),
+        )
+
+        for jit_type, py_types in type_map:
+            if isinstance(schema_type, jit_type):
+                return isinstance(observed, py_types)
+
+        # Profiler only records a subset of possible argument types. If we
+        # reach this point then the schema must call for a type that profiler
+        # does not record. Thus, the schema can only be a match if `observed`
+        # is also None.
+        return observed is None
+
+    @staticmethod
+    def lookup_schemas(name: str) -> Optional[Tuple[FunctionSchema, ...]]:
+        # TODO(robieta):
+        #   _jit_get_schemas_for_operator is quite expensive. (~100us / call)
+        #   Consider adding `functools.lru_cache` if that becomes an issue.
+
+        try:
+            # Schema lookup will throw if `name` is malformed. (For example,
+            # schemas must be namespaced and schema lookup will fail if name
+            # does not include "::".) We simply catch the exception and return
+            # `None` to denote that `name` cannot be an operator name.
+            #
+            # Note that record_function annotations also go through this path,
+            # so it is expected that some names will not correspond to PyTorch
+            # operators.
+            return tuple(torch._C._jit_get_schemas_for_operator(name))
+        except RuntimeError:
+            return None
+
+
+class OpTree:
+    def __init__(self, result: _ProfilerResult) -> None:
+        self._root_nodes = result.experimental_event_tree()
+        self._sorted_nodes = tuple(sorted(self.dfs(), key=lambda x: x.start_time_ns))
+
+    def dfs(self, *args, **kwargs) -> Iterator[_ProfilerEvent]:
+        yield from _utils.traverse_dfs(self._root_nodes, *args, **kwargs)
+
+    @property
+    def sorted_nodes(self) -> Tuple[_ProfilerEvent, ...]:
+        return self._sorted_nodes
+
+
+@dataclasses.dataclass()
+class DataFlowEdge:
+    input_version: Optional[int] = None
+    mutated: Optional[bool] = False
+
+    @property
+    def is_allocation(self) -> bool:
+        return self.input_version is None
+
+    @property
+    def is_deletion(self) -> bool:
+        return self.mutated is None
+
+
+class DataFlowNode:
+    def __init__(self, event: _ProfilerEvent, graph: "DataFlowGraph") -> None:
+        self._event = event
+        self._graph = graph
+        self._edges: Dict[TensorKey, DataFlowEdge] = self._determine_edges()
+
+        for key, edge in self._edges.items():
+            if edge.mutated and not edge.is_allocation:
+                self._graph.bump(key)
+
+        # Make sure the version bumping behavior matches what we expect.
+        versions = {k: (v, self._graph.lookup(k)) for k, v in self.outputs.items()}
+        assert all(i == j for i, j in versions.values()), f"{versions}, {self._edges}"
+
+    def _determine_edges(self) -> Dict[TensorKey, DataFlowEdge]:
+        subtree = tuple(_utils.traverse_dfs([self._event]))
+
+        # Start by populating edges from op inputs and outputs.
+        mutable_by_key: Dict[Optional[TensorKey], Set[Optional[bool]]] = {}
+        for op in (i.typed[1] for i in subtree if i.typed[0] == _EventType.TorchOp):
+            for op_input, mutable in zip(
+                op.inputs, SchemaMatcher.inputs_are_mutable(op)
+            ):
+                # Tensor
+                if isinstance(op_input, _TensorMetadata):
+                    key = TensorKey.from_tensor(op_input)
+                    mutable_by_key.setdefault(key, set()).add(mutable)
+
+                # TensorList
+                elif isinstance(op_input, list):
+                    for op_input_i in op_input:
+                        key = TensorKey.from_tensor(op_input_i)
+                        mutable_by_key.setdefault(key, set()).add(mutable)
+
+        edges: DefaultDict[Optional[TensorKey], DataFlowEdge]
+        edges = collections.defaultdict(DataFlowEdge)
+        for key, mutable_set in mutable_by_key.items():
+            if key is not None:
+                edges[key].input_version = self._graph.lookup(key) if key else -1
+
+                # We consider an op to be mutated if we encounter a schema where it
+                # is a mutable argument OR if it is ambiguous. (We never explicitly
+                # see it in any schema.)
+                mutated = (True in mutable_set) or (tuple(mutable_set) == (None,))
+                edges[key].mutated = mutated
+
+        # Then handle deletions. Note that deleting a Tensor implicitly adds
+        # it as an input edge.
+        for i in subtree:
+            if i.typed[0] == _EventType.Allocation and i.typed[1].alloc_size < 0:
+                key = TensorKey.from_allocation(i.typed[1])
+                edge = edges[key]
+                assert key is None or edge.mutated is not None, f"Double delete: {key}"
+                edge.mutated = None
+                edge.input_version = self._graph.lookup(key) if key else -1
+
+        # And finally handle allocations. This step must be last, because the
+        # previous two steps optimistically add input edges.
+        for i in subtree:
+            if i.typed[0] == _EventType.Allocation and i.typed[1].alloc_size > 0:
+                edges[TensorKey.from_allocation(i.typed[1])].input_version = None
+
+        # We don't need to sort the inputs, but it makes debugging and unit tests nicer.
+        return dict(sorted((k, v) for k, v in edges.items() if k is not None))
+
+    @property
+    def inputs(self) -> Dict[TensorKey, Tuple[bool, int]]:
+        return {
+            # MyPy can't see through `is_allocation` to know that
+            # `v.input_version` is not None.
+            k: (bool(v.mutated), cast(int, v.input_version))
+            for k, v in self._edges.items()
+            if not v.is_allocation
+        }
+
+    @property
+    def outputs(self) -> Dict[TensorKey, int]:
+        return {
+            k: 0 if v.input_version is None else v.input_version + 1
+            for k, v in self._edges.items()
+            if (v.is_allocation and not v.is_deletion) or v.mutated
+        }
+
+    @property
+    def intermediates(self) -> Tuple[TensorKey, ...]:
+        return tuple(
+            k for k, v in self._edges.items() if v.is_allocation and v.is_deletion
+        )
+
+    @property
+    def start_time(self) -> int:
+        return self._event.start_time_ns
+
+
+class DataFlowGraph:
+    def __init__(self, op_tree: OpTree) -> None:
+        self._op_tree = op_tree
+        self._leaf_events = self._extract_leaf_events(op_tree)
+        self._active_version: Dict[TensorKey, Optional[int]] = {}
+        self._flow_nodes = [DataFlowNode(e, self) for e in self.leaf_events]
+        self._flow_nodes.sort(key=lambda x: x.start_time)
+        self.validate()
+
+    @property
+    def flow_nodes(self) -> Tuple[DataFlowNode, ...]:
+        return tuple(self._flow_nodes)
+
+    def validate(self):
+        # Check that each (Tensor, version) pair has a unique creation node
+        outputs: Set[Tuple[TensorKey, int]] = set()
+        for node in self.flow_nodes:
+            node_outputs = set(node.outputs.items())
+            duplicates = outputs & node_outputs
+            assert not duplicates, f"{node._event.name} {node._edges} {duplicates}"
+            outputs |= node_outputs
+
+        # And check that `self._nodes` forms a valid topologically sorted DAG.
+        tensor_versions: Dict[TensorKey, int] = {}
+        for node in self.flow_nodes:
+            for key, (_, version) in node.inputs.items():
+                expected = tensor_versions.get(key, 0)
+                assert expected == version, (expected, version)
+
+            for key, version in node.outputs.items():
+                prior_version = tensor_versions.get(key, version)
+                assert version >= prior_version, (version, prior_version)
+                tensor_versions[key] = version
+
+    @property
+    def leaf_events(self) -> Tuple[_ProfilerEvent, ...]:
+        return self._leaf_events
+
+    @staticmethod
+    def _extract_leaf_events(op_tree: OpTree) -> Tuple[_ProfilerEvent, ...]:
+        """Partially traverse the op tree and extract top level ops.
+
+        Consider the following code:
+        ```
+        with record_function("My annotation"):
+            x.zero_()
+            y.zero_()
+        ```
+
+        The op tree (assuming no Autograd) will look like:
+          <Python context>
+            TorchOp: "My annotation"
+              TorchOp: zero_
+                TorchOp: fill_
+              TorchOp: zero_
+                TorchOp: fill_
+
+        The recursive structure of operator calls makes data flow unwieldy.
+        In order to simplify analysis we would like to select the highest level
+        ops to represent in the graph. In this case those are the `zero_` ops;
+        the fact that `fill_` is called is an implementation detail. We also
+        do not want to group everything under "My annotation" as this could
+        create overly coarse bundles and lose critical semantics.
+
+        To address this issue we walk over the graph and select the topmost
+        torch ops ** which match at least one operator schema **. These form
+        the leaves of the first pass through the op tree. (As well as any
+        allocations or frees which do are not part of a kernel.) These events
+        form the logical nodes in our data flow graph.
+        """
+
+        leaf_events: List[_ProfilerEvent] = []
+
+        def leaf_op(e: _ProfilerEvent) -> bool:
+            return e.typed[0] == _EventType.TorchOp and (
+                e.typed[1].scope == RecordScope.BACKWARD_FUNCTION
+                or bool(SchemaMatcher.match_schemas(e.typed[1]))
+            )
+
+        def children_fn(e: _ProfilerEvent):
+            if leaf_op(e) or e.tag == _EventType.Allocation:
+                leaf_events.append(e)
+                return []
+
+            return e.children
+
+        for _ in op_tree.dfs(children_fn=children_fn):
+            pass
+
+        return tuple(sorted(leaf_events, key=lambda x: x.start_time_ns))
+
+    def lookup(self, key: TensorKey) -> int:
+        version = self._active_version.setdefault(key, 0)
+        assert version is not None
+        return version
+
+    def bump(self, key: TensorKey) -> None:
+        prior_version = self._active_version.get(key, None)
+        assert prior_version is not None
+        self._active_version[key] = prior_version + 1
+
+    def delete(self, key: TensorKey) -> None:
+        assert self._active_version.setdefault(key, 0) is not None
+        self._active_version[key] = None
+
+
+@dataclasses.dataclass
+class CategoryElement:
+    by_id: Optional[Category] = None
+    by_key: Dict[TensorKey, Category] = dataclasses.field(default_factory=dict)
+    by_version: Dict[TensorAndID, Category] = dataclasses.field(default_factory=dict)
+
+    # Used by unit tests to check internals. (And consequently by
+    # MemoryProfile.lookup) This should not be used in any other capacity.
+    _by_id_keyset: Set[TensorKey] = dataclasses.field(default_factory=set)
+
+
+@dataclasses.dataclass
+class CategoryDict:
+    _values: DefaultDict[int, CategoryElement] = dataclasses.field(
+        default_factory=lambda: collections.defaultdict(CategoryElement)
+    )
+
+    def set_by_id(self, key: TensorKey, category: Category) -> None:
+        self._values[key.id].by_id = category
+        self._values[key.id]._by_id_keyset.add(key)
+
+    def set_by_key(self, key: TensorKey, category: Category) -> None:
+        self._values[key.id].by_key[key] = category
+
+    def set_by_version(self, key: TensorKey, version: int, category: Category) -> None:
+        self._values[key.id].by_version[(key, version)] = category
+
+    def setdefault_by_version(
+        self, key: TensorKey, version: int, category: Category
+    ) -> None:
+        self._values[key.id].by_version.setdefault((key, version), category)
+
+    def get(self, key: TensorKey, version: int) -> Optional[Category]:
+        element = self._values[key.id]
+        return (
+            element.by_id
+            or element.by_key.get(key, None)
+            or element.by_version.get((key, version), None)
+        )
+
+
+class MemoryProfile:
+    def __init__(self, result: _ProfilerResult) -> None:
+        self._op_tree = OpTree(result)
+        self._data_flow_graph = DataFlowGraph(self._op_tree)
+        self._categories = CategoryDict()
+
+        self._set_gradients_and_temporaries()
+        self._set_parameters_using_python_tracer()
+        self._set_inputs()
+        self._set_parameters_using_data_flow()
+        self._set_activations()
+        self._set_optimizer_state()
+        self._set_autograd_detail()
+
+    def _is_gradient(self, *args, **kwargs) -> bool:
+        return self._categories.get(*args, **kwargs) == Category.GRADIENT
+
+    def _category_snapshot(self) -> Dict[TensorAndID, Optional[Category]]:
+        all_tensor_versions: Set[TensorAndID] = set()
+
+        for node in self._data_flow_graph.flow_nodes:
+            all_tensor_versions.update(((k, v) for k, (_, v) in node.inputs.items()))
+            all_tensor_versions.update(((key, 0) for key in node.intermediates))
+            all_tensor_versions.update(node.outputs.items())
+
+        for i in self._categories._values.values():
+            all_tensor_versions.update(((key, 0) for key in i._by_id_keyset))
+
+        return {
+            (key, version): self._categories.get(key, version)
+            for key, version in sorted(all_tensor_versions)
+        }
+
+    def _any_version_depends_on_gradient(self) -> Set[int]:
+        """Extract IDs of Tensors which depend or will depend on a gradient.
+
+        Note that this weakened definition of "depends" requires us to loop
+        over the data flow graph multiple times because it allows dependency
+        information to flow backward through edges and removes the guarantee
+        that nodes are topologically sorted. (Or indeed, even that a valid
+        topological order exists.) Put another way, we have converted an
+        acyclic data flow graph into a cyclic graph and we are attempting to
+        partition cycles involving a gradient from the rest of the graph.
+        """
+        depends_on_gradient: Set[int] = set()
+        while True:
+            start_size = len(depends_on_gradient)
+            for node in self._data_flow_graph.flow_nodes:
+                ids = tuple(
+                    key.id
+                    for key, (_, version) in node.inputs.items()
+                    if self._categories.get(key, version)
+                    in (Category.GRADIENT, Category.PARAMETER)
+                    or key.id in depends_on_gradient
+                )
+
+                if ids:
+                    depends_on_gradient.update(ids)
+                    depends_on_gradient.update(key.id for key in node.outputs)
+
+            # We are guaranteed to exit because there is a finite set of
+            # TensorAndID pairs. In practice we do not expect to loop more than
+            # three times: once to identify the core parameter update loop,
+            # once to fold the first step into that loop, and a third time
+            # where no new elements are added.
+            if len(depends_on_gradient) == start_size:
+                return depends_on_gradient
+
+    def _set_gradients_and_temporaries(self) -> None:
+        """Mark Tensors which are unambiguous and simple to reason about."""
+
+        # Gradients are straightforward to detect. We directly check the
+        # `.grad` property in the Python tracer, and we can detect any new
+        # gradient Tensors from `AccumulateGrad` ops.
+        for event in self._op_tree.dfs():
+            for _, p_grad in extract_gradients(event):
+                self._categories.set_by_id(p_grad, Category.GRADIENT)
+
+        # Similarly, temporary Tensors are easy to identify and are useful to
+        # flag since they can make memory use "spikier" than one would
+        # otherwise expect.
+        for node in self._data_flow_graph.flow_nodes:
+            for i in node.intermediates:
+                self._categories.set_by_key(i, Category.TEMPORARY)
+
+    def _set_parameters_using_python_tracer(self) -> None:
+        for event in self._op_tree.dfs():
+            for p in extract_parameters(event):
+                if p is not None:
+                    self._categories.set_by_id(p, Category.PARAMETER)
+
+    def _set_inputs(self) -> None:
+        """Mark inputs based on which Tensors are updated using gradients.
+
+        The process for differentiating between inputs and activations is more
+        involved. Most Tensors in a training loop depend on at least one
+        gradient: parameters depend on them through updates, and activations
+        and optimizer state depend on them transitively through parameters.
+        Critically, we do not need to know which Tensors are parameters to
+        apply this method; we can simply walk the data flow graph to build the
+        set of all values which depend on a gradient and then obtain the set
+        of inputs from the conjugate set.
+
+        There is, however, one hiccup. The first time we see a parameter is
+        generally on the forward pass of the first step. We know from
+        inspection of the data flow graph that v1 of that Tensor depends on
+        a gradient (provided we profile an optimizer step), but not v0. To
+        address this problem we weaken the definition of "depends on a
+        gradient" to "any version of this Tensor depends on a gradient",
+        which in turn strengthens the criteria for the input set enough to
+        filter the activations in the forward pass of the first step."""
+
+        # All of this analysis is predicated on using at least one training
+        # step (or parameters from the python tracer) to partition the graph.
+        # Absent that we cannot determine which Tensors are inputs and which
+        # ones are part of the model.
+        depends_on_gradient = self._any_version_depends_on_gradient()
+
+        # We only want to annotate Tensors which actually contribute to the
+        # model calculation.
+        produces_gradient: Set[TensorAndID] = set()
+        for node in reversed(self._data_flow_graph.flow_nodes):
+            tensors = {(key, version) for key, (_, version) in node.inputs.items()}
+            tensors |= node.outputs.items()
+            if any(
+                self._categories.get(*i) in (Category.GRADIENT, Category.PARAMETER)
+                or i in produces_gradient
+                for i in tensors
+            ):
+                produces_gradient |= tensors
+
+        # Don't include Tensors created in the backward pass, as these are
+        # generally Autograd implementation details rather than proper inputs.
+        input_candidates = produces_gradient.copy()
+        for node in self._data_flow_graph.flow_nodes:
+            if RecordScope.BACKWARD_FUNCTION in get_scopes(node._event):
+                input_candidates -= set(node.outputs.items())
+
+        for key, version in input_candidates:
+            if key.id not in depends_on_gradient:
+                self._categories.setdefault_by_version(key, version, Category.INPUT)
+
+    def _set_parameters_using_data_flow(self) -> None:
+        """Deduce which Tensors are parameters.
+
+        Consider the following code for the step of SGD with momentum
+        (nesterov=False), where `d_p` is the gradient of `param` and `buf` is
+        the momentum buffer.
+        ```
+          buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
+          d_p = buf
+          param.add_(d_p, alpha=-lr)
+        ```
+        Both `param` and `buf` take a gradient and perform an in-place update.
+
+        The python tracer will inspect calls to `nn.Module.forward` and
+        `optim.Optimizer.step` to extract parameter and optimizer state
+        respectively (including parameters), so this is generally a non-issue.
+
+        However as a fallback we can also exploit several properties of
+        parameters to distinguish them from other model state.
+
+        First, they are directly used in the forward pass. (At this point we
+        haven't established which parts of the graph correspond to the forward
+        pass but we can deduce enough to suffice.) Some mutable state such as
+        batch norm moving averages also contribute to the forward pass, but
+        optimizer state does not.
+
+        Second, a parameter is by definition used to compute at least one
+        gradient and depends on at least one gradient.
+        """
+        snapshot = self._category_snapshot()
+
+        # Determine which Tensors might be parameters based on forward pass
+        # data flow. Note this these are only candidates; we filter nodes that
+        # we know are part of the backward pass but that doesn't guarantee that
+        # they are part of the forward pass.
+        candidate_parameters: Set[TensorAndID] = set()
+        candidate_fwd_tensors: Set[TensorAndID] = {
+            i for i, category in snapshot.items() if category == Category.INPUT
+        }
+
+        for node in self._data_flow_graph.flow_nodes:
+            inputs = {(key, value) for key, (_, value) in node.inputs.items()}
+            if (
+                # Don't check nodes in the backward pass.
+                RecordScope.BACKWARD_FUNCTION not in get_scopes(node._event)
+                and not any(self._is_gradient(*i) for i in inputs)
+                and not any(self._is_gradient(*i) for i in node.outputs.items())
+                #
+                # and only check nodes which depend on an input.
+                and candidate_fwd_tensors.intersection(inputs)
+            ):
+                candidate_fwd_tensors |= node.outputs.items()
+                candidate_parameters |= inputs.difference(candidate_fwd_tensors)
+
+        # Require that each parameter eventually contributes to the value of a gradient
+        used_for_gradient: Set[TensorAndID] = set()
+        for node in reversed(self._data_flow_graph.flow_nodes):
+            if any(
+                self._is_gradient(*i) or i in used_for_gradient
+                for i in node.outputs.items()
+            ):
+                for key, (_, version) in node.inputs.items():
+                    used_for_gradient.add((key, version))
+        candidate_parameters.intersection_update(used_for_gradient)
+
+        # and depends on a gradient.
+        parameter_keys = {key.id for key, _ in candidate_parameters}
+        parameter_keys &= self._any_version_depends_on_gradient()
+
+        for key, _ in snapshot.keys():
+            if key.id in parameter_keys:
+                self._categories.set_by_id(key, Category.PARAMETER)
+
+    def _set_activations(self) -> None:
+        """Flood the graph to identify activations."""
+
+        required = {Category.INPUT, Category.ACTIVATION}
+        also_allowed = {Category.PARAMETER, Category.TEMPORARY}
+        for node in self._data_flow_graph.flow_nodes:
+            inputs = {(key, value) for key, (_, value) in node.inputs.items()}
+            input_categories = {self._categories.get(*i) for i in inputs}
+
+            if (
+                (input_categories & required)
+                and not (input_categories - (required | also_allowed))
+                #
+                # Stop filling when we reach the backward pass.
+                and RecordScope.BACKWARD_FUNCTION not in get_scopes(node._event)
+            ):
+                for i in node.outputs.items():
+                    self._categories.setdefault_by_version(*i, Category.ACTIVATION)
+
+    def _set_optimizer_state(self) -> None:
+        for event in self._op_tree.dfs():
+            if event.typed[0] == _EventType.PyCall and event.typed[1].optimizer:
+                parameters = event.typed[1].optimizer.parameters
+                for _, t in it.chain(*[state for _, _, state in parameters]):
+                    key = TensorKey.from_tensor(t)
+                    if key is not None:
+                        self._categories.set_by_id(key, Category.OPTIMIZER_STATE)
+
+    def _set_autograd_detail(self):
+        prior = {None, Category.AUTOGRAD_DETAIL}
+        for node in self._data_flow_graph.flow_nodes:
+            if RecordScope.BACKWARD_FUNCTION in get_scopes(node._event):
+                for key, version in node.outputs.items():
+                    if version == 0 or self._categories.get(key, version - 1) in prior:
+                        self._categories.setdefault_by_version(
+                            key, version, Category.AUTOGRAD_DETAIL
+                        )
diff --git a/torch/profiler/_pattern_matcher.py b/torch/profiler/_pattern_matcher.py
index b89c12fcc32f..6c06bf2b2861 100644
--- a/torch/profiler/_pattern_matcher.py
+++ b/torch/profiler/_pattern_matcher.py
@@ -1,15 +1,14 @@
-from collections import deque
 import json
 import math
 import os
 import re
-from typing import Dict, List, Set
+from typing import Dict, List, Optional, Set
 
 import torch
 from torch.profiler import profile
 import torch.utils.benchmark as benchmark
-from torch.profiler._utils import index_of_first_match
-from torch._C._autograd import (_ProfilerEvent, _ExtraFields_TorchOp,
+from torch.profiler._utils import index_of_first_match, traverse_bfs, traverse_dfs
+from torch._C._profiler import (_ProfilerEvent, _ExtraFields_TorchOp,
                                 _ExtraFields_PyCCall, _ExtraFields_PyCall,
                                 _EventType)
 
@@ -48,7 +47,7 @@ def eventTreeTraversal(self):
         Traverse the event tree and yield all events.
         Override this method in subclass to customize the traversal.
         '''
-        yield from eventTreeDFS(self.event_tree)
+        yield from traverse_dfs(self.event_tree)
 
     def summary(self, events: List[_ProfilerEvent]):
         default_summary = f"{self.name}: {len(events)} events matched."
@@ -139,7 +138,7 @@ def __init__(self,
         self.name = name
 
     def match(self, event: _ProfilerEvent):
-        return re.search(self.name, event.name()) is not None
+        return re.search(self.name, event.name) is not None
 
 
 class ExtraCUDACopyPattern(Pattern):
@@ -162,7 +161,7 @@ class ExtraCUDACopyPattern(Pattern):
     def __init__(self, prof: profile, should_benchmark: bool = False):
         super().__init__(prof, should_benchmark)
         self.name = "Extra CUDA Copy Pattern"
-        self.description = "Filled a CPU tensor and immediately moved it to GPU. Please initalize it on GPU."
+        self.description = "Filled a CPU tensor and immediately moved it to GPU. Please initialize it on GPU."
         self.url = "https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#create-tensors-directly-on-the-target-device"
         self.init_ops = {
             "aten::fill_", "aten::zero_", "aten::normal_", "aten::uniform_"
@@ -174,24 +173,24 @@ def skip(self):
 
     def match(self, event):
         # TODO: We should also check tensor identities
-        if event.name() != "aten::to":
+        if event.name != "aten::to":
             return False
         to_event = event
         if not event.children:
             return False
         event = event.children[-1]
-        if event.name() != "aten::_to_copy":
+        if event.name != "aten::_to_copy":
             return False
         if not event.children:
             return False
         event = event.children[-1]
-        if event.name() != "aten::copy_":
+        if event.name != "aten::copy_":
             return False
         # aten::copy_ should have the first 2 args dtype the same
         dtypes = input_dtypes(event)
         if len(dtypes) < 2:
             return False
-        if dtypes[0] != dtypes[1]:
+        if dtypes[0] is None or dtypes[0] != dtypes[1]:
             return False
         event = to_event
         # Up one level
@@ -205,9 +204,9 @@ def match(self, event):
         while event.children:
             event = event.children[-1]
             # aten::zero_ is a special optimzation case where fill_ is not called
-            if event.name() in self.init_ops:
+            if event.name in self.init_ops:
                 return True
-        return event.name() in self.init_ops
+        return event.name in self.init_ops
         # TODO: Check if tensor is reused
 
     def benchmark(self, events: List[_ProfilerEvent]):
@@ -251,10 +250,10 @@ def eventTreeTraversal(self):
         '''
         We need to use BFS traversal order to avoid duplicate match.
         '''
-        yield from eventTreeBFS(self.event_tree)
+        yield from traverse_bfs(self.event_tree)
 
     def match(self, event: _ProfilerEvent):
-        if event.name() != "aten::select":
+        if event.name != "aten::select":
             return False
         if event.id in self.visited:
             return False
@@ -268,13 +267,13 @@ def same_ops(list1, list2):
             if len(list1) != len(list2):
                 return False
             for op1, op2 in zip(list1, list2):
-                if op1.name() != op2.name():
+                if op1.name != op2.name:
                     return False
             return True
 
         # Record the ops between two aten::select
         next_select_idx = index_of_first_match(
-            next, lambda e: e.name() == "aten::select")
+            next, lambda e: e.name == "aten::select")
         if next_select_idx is None:
             return False
         indexing_ops = [event] + next[:next_select_idx]
@@ -311,7 +310,7 @@ def match(self, event: _ProfilerEvent):
         if event.tag != _EventType.TorchOp:
             return False
         assert isinstance(event.extra_fields, _ExtraFields_TorchOp)
-        if event.name() == "aten::mm":
+        if event.name == "aten::mm":
             if event.extra_fields.allow_tf32_cublas is False:
                 return True
         return False
@@ -370,7 +369,7 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
 
     def match(self, event: _ProfilerEvent):
         for optimizer in self.optimizers_with_foreach:
-            if event.name().endswith(f"_single_tensor_{optimizer}"):
+            if event.name.endswith(f"_single_tensor_{optimizer}"):
                 return True
         return False
 
@@ -411,17 +410,17 @@ def is_dataloader_function(name: str, function_name: str):
                 os.path.join("torch", "utils", "data",
                              "dataloader.py")) and name.endswith(function_name)
 
-        if not is_dataloader_function(event.name(), "__iter__"):
+        if not is_dataloader_function(event.name, "__iter__"):
             return False
         if not event.children:
             return False
         event = event.children[0]
-        if not is_dataloader_function(event.name(), "_get_iterator"):
+        if not is_dataloader_function(event.name, "_get_iterator"):
             return False
         if not event.children:
             return False
         event = event.children[0]
-        return not is_dataloader_function(event.name(),
+        return not is_dataloader_function(event.name,
                                           "check_worker_number_rationality")
         # TODO: We should also check if the loader is bottleneck.
 
@@ -456,14 +455,13 @@ def __init__(self, prof: profile, should_benchmark: bool = False):
             "#disable-gradient-calculation-for-validation-or-inference")
 
     def match(self, event: _ProfilerEvent):
-        if not event.name().endswith(": zero_grad"):
+        if not event.name.endswith(": zero_grad"):
             return False
         if not event.children:
             return False
 
-        for sub_event in eventTreeDFS(event.children):
-            if sub_event.name(
-            ) == "aten::zero_" and sub_event.parent.name() != "aten::zeros":
+        for sub_event in traverse_dfs(event.children):
+            if sub_event.name == "aten::zero_" and sub_event.parent.name != "aten::zeros":
                 return True
         # TODO: We should also check if the optimizer's numerical behavior will change.
         return False
@@ -495,19 +493,19 @@ def skip(self):
         return self.prof.record_shapes is False or super().skip
 
     def match(self, event: _ProfilerEvent):
-        if event.name() != "aten::conv2d":
+        if event.name != "aten::conv2d":
             return False
-        if len(input_dtypes(event)) < 3 or input_dtypes(event)[2] == "":
+        if len(input_dtypes(event)) < 3 or input_dtypes(event)[2] is None:
             return False
         # This means bias=True
         event = self.go_up_until(
-            event, lambda e: e.name().startswith("nn.Module: Conv2d"))
+            event, lambda e: e.name.startswith("nn.Module: Conv2d"))
         if not event:
             return False
         event = self.next_of(event)
         if not event:
             return False
-        return event.name().startswith("nn.Module: BatchNorm2d")
+        return event.name.startswith("nn.Module: BatchNorm2d")
 
 
 class MatMulDimInFP16Pattern(Pattern):
@@ -528,15 +526,12 @@ def mutiple_of(shapes, multiple):
             return all(dim % multiple == 0 for shape in shapes
                        for dim in shape[-2:])
 
-        if event.name() not in ("aten::mm", "aten::bmm", "aten::addmm"):
+        if event.name not in ("aten::mm", "aten::bmm", "aten::addmm"):
             return False
         if not input_dtypes(event):
             return False
         arg_dtype = input_dtypes(event)[0]
-        # TODO: Have a better way to check dtype_size
-        if (arg_dtype.endswith("c10::BFloat16")
-                or arg_dtype.endswith("c10::Half")) and not mutiple_of(
-                    input_shapes(event), 8):
+        if arg_dtype in (torch.bfloat16, torch.half) and not mutiple_of(input_shapes(event), 8):
             return True
         return False
 
@@ -573,7 +568,7 @@ def closest_multiple(shapes, multiple):
         return shapes_factor_map
 
 
-def source_code_location(event: _ProfilerEvent):
+def source_code_location(event: Optional[_ProfilerEvent]):
     while event:
         if event.tag == _EventType.PyCall or event.tag == _EventType.PyCCall:
             assert isinstance(event.extra_fields,
@@ -588,36 +583,12 @@ def source_code_location(event: _ProfilerEvent):
 
 def input_shapes(event: _ProfilerEvent):
     assert isinstance(event.extra_fields, _ExtraFields_TorchOp)
-    return tuple([tuple(shape) for shape in event.extra_fields.inputs.shapes])
+    return tuple(tuple(getattr(i, "sizes", ())) for i in event.extra_fields.inputs)
 
 
 def input_dtypes(event: _ProfilerEvent):
     assert isinstance(event.extra_fields, _ExtraFields_TorchOp)
-    return tuple(t for t in event.extra_fields.inputs.dtypes)
-
-
-def eventTreeDFS(event_tree: List[_ProfilerEvent]):
-    '''
-    Standard DFS traversal of the event tree.
-    '''
-    stack = deque(event_tree)
-    while stack:
-        curr_event = stack.pop()
-        yield curr_event
-        for child_event in curr_event.children:
-            stack.append(child_event)
-
-
-def eventTreeBFS(event_tree: List[_ProfilerEvent]):
-    '''
-    Standard BFS traversal of the event tree.
-    '''
-    stack = deque(event_tree)
-    while stack:
-        curr_event = stack.popleft()
-        yield curr_event
-        for child_event in curr_event.children:
-            stack.append(child_event)
+    return tuple(getattr(i, "dtype", None) for i in event.extra_fields.inputs)
 
 
 def report_all_anti_patterns(prof,
diff --git a/torch/profiler/_utils.py b/torch/profiler/_utils.py
index a730fc176888..adad9ca64714 100644
--- a/torch/profiler/_utils.py
+++ b/torch/profiler/_utils.py
@@ -1,5 +1,6 @@
 from collections import deque
 from dataclasses import dataclass
+import functools
 import re
 from typing import Dict, List
 
@@ -8,6 +9,19 @@
 from torch.autograd import _KinetoEvent
 
 
+def _traverse(tree, next_fn, children_fn=lambda x: x.children, reverse: bool = False):
+    order = reversed if reverse else lambda x: x
+    remaining = deque(order(tree))
+    while remaining:
+        curr_event = next_fn(remaining)
+        yield curr_event
+        for child_event in order(children_fn(curr_event)):
+            remaining.append(child_event)
+
+traverse_dfs = functools.partial(_traverse, next_fn=lambda x: x.pop(), reverse=True)
+traverse_bfs = functools.partial(_traverse, next_fn=lambda x: x.popleft(), reverse=False)
+
+
 @dataclass
 class EventMetrics:
     duration_time_ns: int = 0
@@ -41,7 +55,7 @@ def __eq__(self, other):
         return self.event.id == other.event.id
 
     def __repr__(self):
-        return f"{self.event.name()}"
+        return f"{self.event.name}"
 
     def intervals_overlap(self, intervals: List[Interval]):
         overlap_time = 0
@@ -105,7 +119,7 @@ def compute_self_time(self):
                 stack.append(child_event)
             assert EventKey(
                 curr_event
-            ) not in self.metrics, f"Duplicate id: {curr_event.id}, {curr_event.name()}"
+            ) not in self.metrics, f"Duplicate id: {curr_event.id}, {curr_event.name}"
             self.metrics[EventKey(curr_event)] = EventMetrics(
                 self_time_ns=self_time)
             self.metrics[EventKey(
@@ -122,12 +136,11 @@ def compute_queue_depth(self):
 
         def is_cuda_launch_kernel(e):
             # TODO: find a better way to identify cudaLaunchKernel
-            return e.name() == "cudaLaunchKernel"
+            return e.name == "cudaLaunchKernel"
 
         def is_cuda_kernel(e):
             # TODO: find a better way to identify CUDA Kernel
-            return e.device_type() == DeviceType.CUDA and "mem" not in e.name(
-            ).lower()
+            return e.device_type() == DeviceType.CUDA and "mem" not in e.name.lower()
 
         cuda_launch_events = sorted(
             (e for e in cuda_event_list if is_cuda_launch_kernel(e)),
@@ -326,9 +339,9 @@ def argmax(seq, key=lambda x: x, start=0, end=None):
 
 def source_code_location(event):
     while (event is not None):
-        match = re.search(r"\.py\(.*\)", event.name())
+        match = re.search(r"\.py\(.*\)", event.name)
         if (match is None):
             event = event.parent
             continue
-        return event.name()
+        return event.name
     return "No source code location found"
diff --git a/torch/profiler/itt.py b/torch/profiler/itt.py
index 022af77c0409..f1c799d16c70 100644
--- a/torch/profiler/itt.py
+++ b/torch/profiler/itt.py
@@ -8,6 +8,10 @@ class _ITTStub(object):
         def _fail(*args, **kwargs):
             raise RuntimeError("ITT functions not installed. Are you sure you have a ITT build?")
 
+        @staticmethod
+        def is_available():
+            return False
+
         rangePush = _fail
         rangePop = _fail
         mark = _fail
@@ -15,11 +19,21 @@ def _fail(*args, **kwargs):
     _itt = _ITTStub()  # type: ignore[assignment]
 
 
-__all__ = ['range_push', 'range_pop', 'mark', 'range']
+__all__ = ['is_available', 'range_push', 'range_pop', 'mark', 'range']
+
+
+def is_available():
+    """
+    Check if ITT feature is available or not
+    """
+    return _itt.is_available()
 
 
 def range_push(msg):
     """
+    Pushes a range onto a stack of nested range span.  Returns zero-based
+    depth of the range that is started.
+
     Arguments:
         msg (str): ASCII message to associate with range
     """
@@ -28,6 +42,8 @@ def range_push(msg):
 
 def range_pop():
     """
+    Pops a range off of a stack of nested range spans. Returns the
+    zero-based depth of the range that is ended.
     """
     return _itt.rangePop()
 
@@ -35,6 +51,7 @@ def range_pop():
 def mark(msg):
     """
     Describe an instantaneous event that occurred at some point.
+
     Arguments:
         msg (str): ASCII message to associate with the event.
     """
diff --git a/torch/profiler/profiler.py b/torch/profiler/profiler.py
index de1eb5b216ec..31b85eb26f0f 100644
--- a/torch/profiler/profiler.py
+++ b/torch/profiler/profiler.py
@@ -9,17 +9,26 @@
 
 import torch
 import torch.autograd.profiler as prof
-from torch._C._autograd import (
-    _ExperimentalConfig,
+from torch._C._profiler import (
     _add_execution_graph_observer,
-    _remove_execution_graph_observer,
-    _enable_execution_graph_observer,
     _disable_execution_graph_observer,
+    _enable_execution_graph_observer,
+    _ExperimentalConfig,
+    _remove_execution_graph_observer,
 )
-from torch.autograd import ProfilerActivity, kineto_available
+from torch.autograd import kineto_available, ProfilerActivity
+from torch.profiler import _memory_profiler
+
+
+__all__ = [
+    "supported_activities",
+    "ProfilerAction",
+    "schedule",
+    "tensorboard_trace_handler",
+    "profile",
+    "ExecutionGraphObserver",
+]
 
-__all__ = ['supported_activities', 'ProfilerAction', 'schedule', 'tensorboard_trace_handler', 'profile',
-           'ExecutionGraphObserver']
 
 def supported_activities():
     """
@@ -109,6 +118,17 @@ def start_trace(self):
         assert self.profiler is not None
         self.profiler._start_trace()
 
+        if self.profile_memory:
+            self.add_metadata_json("profile_memory", "1")
+        if self.with_stack:
+            self.add_metadata_json("with_stack", "1")
+        if self.record_shapes:
+            self.add_metadata_json("record_shapes", "1")
+        if self.with_modules:
+            self.add_metadata_json("with_modules", "1")
+        if self.with_flops:
+            self.add_metadata_json("with_flops", "1")
+
         if kineto_available():
             dist_info = self._get_distributed_info()
             if dist_info:
@@ -197,6 +217,15 @@ def _get_distributed_info(self):
             "world_size": dist.get_world_size()
         }
 
+    def _memory_profile(self) -> _memory_profiler.MemoryProfile:
+        required = ("record_shapes", "profile_memory", "with_stack")
+        missing = [f"{i}=True" for i in required if not getattr(self, i)]
+        if missing:
+            raise ValueError(f"{', '.join(missing)} required for memory profiling.")
+
+        assert self.profiler is not None and self.profiler.kineto_results is not None
+        return _memory_profiler.MemoryProfile(self.profiler.kineto_results)
+
 
 class ProfilerAction(Enum):
     """
diff --git a/torch/quantization/__init__.py b/torch/quantization/__init__.py
index 37218c5c042a..6e4ede123eb0 100644
--- a/torch/quantization/__init__.py
+++ b/torch/quantization/__init__.py
@@ -25,11 +25,12 @@ def default_eval_fn(model, calib_data):
     'quantize', 'quantize_dynamic', 'quantize_qat',
     'prepare', 'convert', 'prepare_qat',
     # Top level API for graph mode quantization on TorchScript
-    'quantize_jit', 'quantize_dynamic_jit',
+    'quantize_jit', 'quantize_dynamic_jit', '_prepare_ondevice_dynamic_jit',
+    '_convert_ondevice_dynamic_jit', '_quantize_ondevice_dynamic_jit',
     # Top level API for graph mode quantization on GraphModule(torch.fx)
     # 'fuse_fx', 'quantize_fx',  # TODO: add quantize_dynamic_fx
     # 'prepare_fx', 'prepare_dynamic_fx', 'convert_fx',
-    'QuantType', 'quant_type_to_str',  # quantization type
+    'QuantType',  # quantization type
     # custom module APIs
     'get_default_static_quant_module_mappings', 'get_static_quant_module_class',
     'get_default_dynamic_quant_module_mappings',
diff --git a/torch/quantization/fuser_method_mappings.py b/torch/quantization/fuser_method_mappings.py
index 50520b3f7967..22f4e638ea69 100644
--- a/torch/quantization/fuser_method_mappings.py
+++ b/torch/quantization/fuser_method_mappings.py
@@ -10,6 +10,6 @@
     fuse_conv_bn,
     fuse_conv_bn_relu,
     fuse_linear_bn,
-    DEFAULT_OP_LIST_TO_FUSER_METHOD,
+    _DEFAULT_OP_LIST_TO_FUSER_METHOD,
     get_fuser_method,
 )
diff --git a/torch/quantization/fx/quantization_types.py b/torch/quantization/fx/quantization_types.py
index 0e7b493f3640..f31cdf5ba1c8 100644
--- a/torch/quantization/fx/quantization_types.py
+++ b/torch/quantization/fx/quantization_types.py
@@ -6,7 +6,7 @@
 appropriate files under `torch/ao/quantization/fx/`, while adding an import statement
 here.
 """
-from torch.ao.quantization.quantization_types import (
+from torch.ao.quantization.utils import (
     Pattern,
     QuantizerCls
 )
diff --git a/torch/quantization/qconfig.py b/torch/quantization/qconfig.py
index 7552d437466d..9da450abd67b 100644
--- a/torch/quantization/qconfig.py
+++ b/torch/quantization/qconfig.py
@@ -23,8 +23,8 @@
     default_qat_qconfig_v2,
     get_default_qconfig,
     get_default_qat_qconfig,
-    assert_valid_qconfig,
+    _assert_valid_qconfig,
     QConfigAny,
-    add_module_to_qconfig_obs_ctr,
+    _add_module_to_qconfig_obs_ctr,
     qconfig_equals
 )
diff --git a/torch/quantization/quant_type.py b/torch/quantization/quant_type.py
index cd2e5e020a6a..c7f7cc15dbdd 100644
--- a/torch/quantization/quant_type.py
+++ b/torch/quantization/quant_type.py
@@ -8,4 +8,4 @@
 """
 
 from torch.ao.quantization.quant_type import QuantType
-from torch.ao.quantization.quant_type import quant_type_to_str
+from torch.ao.quantization.quant_type import _get_quant_type_to_str
diff --git a/torch/quantization/quantize_jit.py b/torch/quantization/quantize_jit.py
index fa77ac5253e7..6228e5ca24c7 100644
--- a/torch/quantization/quantize_jit.py
+++ b/torch/quantization/quantize_jit.py
@@ -16,6 +16,7 @@
     _prepare_jit,
     prepare_jit,
     prepare_dynamic_jit,
+    _prepare_ondevice_dynamic_jit,
     _convert_jit,
     convert_jit,
     convert_dynamic_jit,
diff --git a/torch/return_types.py b/torch/return_types.py
index 2002babb5ab8..9f8c85285279 100644
--- a/torch/return_types.py
+++ b/torch/return_types.py
@@ -1,7 +1,7 @@
 import torch
 import inspect
 
-__all__ = []
+__all__ = ["pytree_register_structseq"]
 
 # error: Module has no attribute "_return_types"
 return_types = torch._C._return_types  # type: ignore[attr-defined]
@@ -19,8 +19,8 @@ def structseq_unflatten(values, context):
     if name.startswith('__'):
         continue
 
-    attr = getattr(return_types, name)
-    globals()[name] = attr
+    _attr = getattr(return_types, name)
+    globals()[name] = _attr
 
     if not name.startswith('_'):
         __all__.append(name)
@@ -30,5 +30,5 @@ def structseq_unflatten(values, context):
     # is no longer the case.
     # NB: I don't know how to check that something is a "structseq" so we do a fuzzy
     # check for tuple
-    if inspect.isclass(attr) and issubclass(attr, tuple):
-        pytree_register_structseq(attr)
+    if inspect.isclass(_attr) and issubclass(_attr, tuple):
+        pytree_register_structseq(_attr)
diff --git a/torch/serialization.py b/torch/serialization.py
index 53380dc6fb48..b9fc92b5110c 100644
--- a/torch/serialization.py
+++ b/torch/serialization.py
@@ -14,10 +14,12 @@
 from torch._sources import get_source_lines_and_file
 from torch.types import Storage
 from torch.storage import _get_dtype_from_pickle_storage_type
-from typing import Any, BinaryIO, cast, Dict, Optional, Type, Tuple, Union, IO
+from typing import Any, BinaryIO, Callable, cast, Dict, Optional, Type, Tuple, Union, IO
+from typing_extensions import TypeAlias
 import copyreg
 import pickle
 import pathlib
+import torch._weights_only_unpickler as _weights_only_unpickler
 
 DEFAULT_PROTOCOL = 2
 
@@ -29,6 +31,24 @@
 PROTOCOL_VERSION = 1001
 STORAGE_KEY_SEPARATOR = ','
 
+FILE_LIKE: TypeAlias = Union[str, os.PathLike, BinaryIO, IO[bytes]]
+MAP_LOCATION: TypeAlias = Optional[Union[Callable[[torch.Tensor, str], torch.Tensor], torch.device, str, Dict[str, str]]]
+
+__all__ = [
+    'SourceChangeWarning',
+    'mkdtemp',
+    'register_package',
+    'check_module_version_greater_or_equal',
+    'validate_cuda_device',
+    'location_tag',
+    'default_restore_location',
+    'normalize_storage_type',
+    'storage_to_tensor_type',
+    'save',
+    'load',
+    'StorageType',
+]
+
 class SourceChangeWarning(Warning):
     pass
 
@@ -228,7 +248,7 @@ def __exit__(self, *args):
 
 class _open_file(_opener):
     def __init__(self, name, mode):
-        super(_open_file, self).__init__(open(name, mode))
+        super().__init__(open(name, mode))
 
     def __exit__(self, *args):
         self.file_like.close()
@@ -236,7 +256,7 @@ def __exit__(self, *args):
 
 class _open_buffer_reader(_opener):
     def __init__(self, buffer):
-        super(_open_buffer_reader, self).__init__(buffer)
+        super().__init__(buffer)
         _check_seekable(buffer)
 
 
@@ -259,12 +279,12 @@ def _open_file_like(name_or_buffer, mode):
 
 class _open_zipfile_reader(_opener):
     def __init__(self, name_or_buffer) -> None:
-        super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
+        super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
 
 
 class _open_zipfile_writer_file(_opener):
     def __init__(self, name) -> None:
-        super(_open_zipfile_writer_file, self).__init__(torch._C.PyTorchFileWriter(str(name)))
+        super().__init__(torch._C.PyTorchFileWriter(str(name)))
 
     def __exit__(self, *args) -> None:
         self.file_like.write_end_of_file()
@@ -272,8 +292,13 @@ def __exit__(self, *args) -> None:
 
 class _open_zipfile_writer_buffer(_opener):
     def __init__(self, buffer) -> None:
+        if not callable(getattr(buffer, "write", None)):
+            msg = f"Buffer of {str(type(buffer)).strip('<>')} has no callable attribute 'write'"
+            if not hasattr(buffer, "write"):
+                raise AttributeError(msg)
+            raise TypeError(msg)
         self.buffer = buffer
-        super(_open_zipfile_writer_buffer, self).__init__(torch._C.PyTorchFileWriter(buffer))
+        super().__init__(torch._C.PyTorchFileWriter(buffer))
 
     def __exit__(self, *args) -> None:
         self.file_like.write_end_of_file()
@@ -339,7 +364,7 @@ def _check_dill_version(pickle_module) -> None:
         pickle_module: module used for pickling metadata and objects
 
     '''
-    if pickle_module.__name__ == 'dill':
+    if pickle_module is not None and pickle_module.__name__ == 'dill':
         required_dill_version = (0, 3, 1)
         if not check_module_version_greater_or_equal(pickle_module, required_dill_version, False):
             raise ValueError((
@@ -350,8 +375,19 @@ def _check_dill_version(pickle_module) -> None:
                 pickle_module.__version__
             ))
 
-def save(obj, f: Union[str, os.PathLike, BinaryIO, IO[bytes]],
-         pickle_module=pickle, pickle_protocol=DEFAULT_PROTOCOL, _use_new_zipfile_serialization=True) -> None:
+def _check_save_filelike(f):
+    if not isinstance(f, (str, os.PathLike)) and not hasattr(f, 'write'):
+        raise AttributeError((
+            "expected 'f' to be string, path, or a file-like object with "
+            "a 'write' attribute"))
+
+def save(
+    obj: object,
+    f: FILE_LIKE,
+    pickle_module: Any = pickle,
+    pickle_protocol: int = DEFAULT_PROTOCOL,
+    _use_new_zipfile_serialization: bool = True
+) -> None:
     # Reference: https://github.com/pytorch/pytorch/issues/54354
     # The first line of this docstring overrides the one Sphinx generates for the
     # documentation. We need it so that Sphinx doesn't leak `pickle`s path from
@@ -391,7 +427,9 @@ def save(obj, f: Union[str, os.PathLike, BinaryIO, IO[bytes]],
         >>> buffer = io.BytesIO()
         >>> torch.save(x, buffer)
     """
+    torch._C._log_api_usage_once("torch.save")
     _check_dill_version(pickle_module)
+    _check_save_filelike(f)
 
     if _use_new_zipfile_serialization:
         with _open_zipfile_writer(f) as opened_zipfile:
@@ -439,12 +477,12 @@ def persistent_id(obj: Any) -> Optional[Tuple]:
             if isinstance(obj, torch.storage.TypedStorage):
                 # TODO: Once we decide to break serialization FC, this case
                 # can be deleted
-                storage = obj._storage
+                storage = obj._untyped_storage
                 storage_dtype = obj.dtype
-                storage_type_str = obj.pickle_storage_type()
+                storage_type_str = obj._pickle_storage_type()
                 storage_type = getattr(torch, storage_type_str)
                 dtype = obj.dtype
-                storage_numel = obj.size()
+                storage_numel = obj._size()
 
             elif isinstance(obj, torch.UntypedStorage):
                 storage = obj
@@ -567,11 +605,11 @@ def persistent_id(obj):
             if isinstance(obj, torch.storage.TypedStorage):
                 # TODO: Once we decide to break serialization FC, this case
                 # can be deleted
-                storage = obj._storage
+                storage = obj._untyped_storage
                 storage_dtype = obj.dtype
-                storage_type_str = obj.pickle_storage_type()
+                storage_type_str = obj._pickle_storage_type()
                 storage_type = getattr(torch, storage_type_str)
-                storage_numel = obj.size()
+                storage_numel = obj._size()
 
             else:
                 storage = obj
@@ -625,13 +663,20 @@ def persistent_id(obj):
         zip_file.write_record(name, storage.data_ptr(), num_bytes)
 
 
-def load(f, map_location=None, pickle_module=pickle, **pickle_load_args):
+def load(
+    f: FILE_LIKE,
+    map_location: MAP_LOCATION = None,
+    pickle_module: Any = None,
+    *,
+    weights_only: bool = False,
+    **pickle_load_args: Any
+) -> Any:
     # Reference: https://github.com/pytorch/pytorch/issues/54354
     # The first line of this docstring overrides the one Sphinx generates for the
     # documentation. We need it so that Sphinx doesn't leak `pickle`s path from
     # the build environment (e.g. `<module 'pickle' from '/leaked/path').
 
-    """load(f, map_location=None, pickle_module=pickle, **pickle_load_args)
+    """load(f, map_location=None, pickle_module=pickle, *, weights_only=False, **pickle_load_args)
 
     Loads an object saved with :func:`torch.save` from a file.
 
@@ -671,15 +716,18 @@ def load(f, map_location=None, pickle_module=pickle, **pickle_load_args):
             locations
         pickle_module: module used for unpickling metadata and objects (has to
             match the :attr:`pickle_module` used to serialize file)
+        weights_only: Indicates whether unpickler should be restricted to
+            loading only tensors, primitive types and dictionaries
         pickle_load_args: (Python 3 only) optional keyword arguments passed over to
             :func:`pickle_module.load` and :func:`pickle_module.Unpickler`, e.g.,
             :attr:`errors=...`.
 
     .. warning::
-        :func:`torch.load()` uses ``pickle`` module implicitly, which is known to be insecure.
+        :func:`torch.load()` unless `weights_only` parameter is set to `True`,
+        uses ``pickle`` module implicitly, which is known to be insecure.
         It is possible to construct malicious pickle data which will execute arbitrary code
         during unpickling. Never load data that could have come from an untrusted
-        source, or that could have been tampered with. **Only load data you trust**.
+        source in an unsafe mode, or that could have been tampered with. **Only load data you trust**.
 
     .. note::
         When you call :func:`torch.load()` on a file which contains GPU tensors, those tensors
@@ -713,6 +761,23 @@ def load(f, map_location=None, pickle_module=pickle, **pickle_load_args):
         # Load a module with 'ascii' encoding for unpickling
         >>> torch.load('module.pt', encoding='ascii')
     """
+    torch._C._log_api_usage_once("torch.load")
+    UNSAFE_MESSAGE = (
+        "Weights only load failed. Re-running `torch.load` with `weights_only` set to `False`"
+        " will likely succeed, but it can result in arbitrary code execution."
+        "Do it only if you get the file from a trusted source. WeightsUnpickler error: "
+    )
+    # Add ability to force safe only weight loads via environment variable
+    if os.getenv("TORCH_FORCE_WEIGHTS_ONLY_LOAD", "0").lower() in ['1', 'y', 'yes', 'true']:
+        weights_only = True
+
+    if weights_only:
+        if pickle_module is not None:
+            raise RuntimeError("Can not safely load weights when explicit picke_module is specified")
+    else:
+        if pickle_module is None:
+            pickle_module = pickle
+
     _check_dill_version(pickle_module)
 
     if 'encoding' not in pickle_load_args.keys():
@@ -731,7 +796,17 @@ def load(f, map_location=None, pickle_module=pickle, **pickle_load_args):
                                   " silence this warning)", UserWarning)
                     opened_file.seek(orig_position)
                     return torch.jit.load(opened_file, map_location=map_location)
+                if weights_only:
+                    try:
+                        return _load(opened_zipfile, map_location, _weights_only_unpickler, **pickle_load_args)
+                    except RuntimeError as e:
+                        raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
                 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
+        if weights_only:
+            try:
+                return _legacy_load(opened_file, map_location, _weights_only_unpickler, **pickle_load_args)
+            except RuntimeError as e:
+                raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
 
 
@@ -827,14 +902,15 @@ def persistent_load(saved_id):
                 for i in range(num_storages):
                     args = pickle_module.load(f, **pickle_load_args)
                     key, location, storage_type = args
-                    dtype = storage_type.dtype
+                    dtype = storage_type._dtype
                     obj = cast(Storage, torch.UntypedStorage)._new_with_file(f, torch._utils._element_size(dtype))
                     obj = restore_location(obj, location)
                     # TODO: Once we decide to break serialization FC, we can
                     # stop wrapping with TypedStorage
                     deserialized_objects[key] = torch.storage.TypedStorage(
                         wrap_storage=obj,
-                        dtype=dtype)
+                        dtype=dtype,
+                        _internal=True)
 
                 storage_views = pickle_module.load(f, **pickle_load_args)
                 for target_cdata, root_cdata, offset, numel in storage_views:
@@ -844,8 +920,9 @@ def persistent_load(saved_id):
                     # TODO: Once we decide to break serialization FC, we can
                     # stop wrapping with TypedStorage
                     deserialized_objects[target_cdata] = torch.storage.TypedStorage(
-                        wrap_storage=root._storage[offset_bytes:offset_bytes + numel * element_size],
-                        dtype=root.dtype)
+                        wrap_storage=root._untyped_storage[offset_bytes:offset_bytes + numel * element_size],
+                        dtype=root.dtype,
+                        _internal=True)
 
             tar.extract('tensors', path=tmpdir)
             with open(os.path.join(tmpdir, 'tensors'), 'rb', 0) as f:
@@ -861,7 +938,7 @@ def persistent_load(saved_id):
                     stride = struct.unpack(f'<{ndim}q', f.read(8 * ndim))
                     storage_offset, = struct.unpack('<q', f.read(8))
                     tensor = torch.tensor([], dtype=storage.dtype).set_(
-                        storage._storage, storage_offset, numel, stride)
+                        storage._untyped_storage, storage_offset, numel, stride)
                     deserialized_objects[key] = tensor
 
             pickle_file = tar.extractfile('pickle')
@@ -896,7 +973,8 @@ def persistent_load(saved_id):
                 # stop wrapping with TypedStorage
                 deserialized_objects[root_key] = torch.storage.TypedStorage(
                     wrap_storage=restore_location(obj, location),
-                    dtype=dtype)
+                    dtype=dtype,
+                    _internal=True)
 
             typed_storage = deserialized_objects[root_key]
             if view_metadata is not None:
@@ -907,8 +985,9 @@ def persistent_load(saved_id):
                     # TODO: Once we decide to break serialization FC, we can
                     # stop wrapping with TypedStorage
                     deserialized_objects[view_key] = torch.storage.TypedStorage(
-                        wrap_storage=typed_storage._storage[offset_bytes:offset_bytes + view_size_bytes],
-                        dtype=dtype)
+                        wrap_storage=typed_storage._untyped_storage[offset_bytes:offset_bytes + view_size_bytes],
+                        dtype=dtype,
+                        _internal=True)
                 res = deserialized_objects[view_key]
 
             else:
@@ -957,7 +1036,7 @@ def persistent_load(saved_id):
     for key in deserialized_storage_keys:
         assert key in deserialized_objects
         typed_storage = deserialized_objects[key]
-        typed_storage._storage._set_from_file(
+        typed_storage._untyped_storage._set_from_file(
             f, offset, f_should_read_directly,
             torch._utils._element_size(typed_storage.dtype))
         if offset is not None:
@@ -1016,12 +1095,13 @@ def _load(zip_file, map_location, pickle_module, pickle_file='data.pkl', **pickl
     def load_tensor(dtype, numel, key, location):
         name = f'data/{key}'
 
-        storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage).storage().untyped()
+        storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
         # TODO: Once we decide to break serialization FC, we can
         # stop wrapping with TypedStorage
         loaded_storages[key] = torch.storage.TypedStorage(
             wrap_storage=restore_location(storage, location),
-            dtype=dtype)
+            dtype=dtype,
+            _internal=True)
 
     def persistent_load(saved_id):
         assert isinstance(saved_id, tuple)
diff --git a/torch/signal/__init__.py b/torch/signal/__init__.py
new file mode 100644
index 000000000000..3684eabe7121
--- /dev/null
+++ b/torch/signal/__init__.py
@@ -0,0 +1,5 @@
+from . import windows
+
+__all__ = [
+    'windows'
+]
diff --git a/torch/signal/windows/__init__.py b/torch/signal/windows/__init__.py
new file mode 100644
index 000000000000..aebd89f1c286
--- /dev/null
+++ b/torch/signal/windows/__init__.py
@@ -0,0 +1,26 @@
+from .windows import (
+    bartlett,
+    blackman,
+    cosine,
+    exponential,
+    gaussian,
+    general_cosine,
+    general_hamming,
+    hamming,
+    hann,
+    kaiser,
+)
+
+
+__all__ = [
+    'bartlett',
+    'blackman',
+    'cosine',
+    'exponential',
+    'gaussian',
+    'general_cosine',
+    'general_hamming',
+    'hamming',
+    'hann',
+    'kaiser',
+]
diff --git a/torch/signal/windows/windows.py b/torch/signal/windows/windows.py
new file mode 100644
index 000000000000..83ae60e7ca06
--- /dev/null
+++ b/torch/signal/windows/windows.py
@@ -0,0 +1,761 @@
+from typing import Optional, Iterable
+
+import torch
+from math import sqrt
+
+from torch import Tensor
+from torch._prims_common import is_float_dtype
+from torch._torch_docs import factory_common_args, parse_kwargs, merge_dicts
+
+__all__ = [
+    'bartlett',
+    'blackman',
+    'cosine',
+    'exponential',
+    'gaussian',
+    'general_cosine',
+    'general_hamming',
+    'hamming',
+    'hann',
+    'kaiser',
+]
+
+window_common_args = merge_dicts(
+    parse_kwargs(
+        """
+    M (int): the length of the window.
+        In other words, the number of points of the returned window.
+    sym (bool, optional): If `False`, returns a periodic window suitable for use in spectral analysis.
+        If `True`, returns a symmetric window suitable for use in filter design. Default: `True`.
+"""
+    ),
+    factory_common_args,
+    {
+        "normalization": "The window is normalized to 1 (maximum value is 1). However, the 1 doesn't appear if "
+                         ":attr:`M` is even and :attr:`sym` is `True`.",
+        "return": "Tensor: A 1-D tensor of size ``M`` containing the window."
+    }
+)
+
+
+def _add_docstr(*args):
+    r"""Adds docstrings to a given decorated function.
+
+    Specially useful when then docstrings needs string interpolation, e.g., with
+    str.format().
+    REMARK: Do not use this function if the docstring doesn't need string
+    interpolation, just write a conventional docstring.
+
+    Args:
+        args (str):
+    """
+
+    def decorator(o):
+        o.__doc__ = "".join(args)
+        return o
+
+    return decorator
+
+
+def _window_function_checks(function_name: str, M: int, dtype: torch.dtype, layout: torch.layout) -> None:
+    r"""Performs common checks for all the defined windows.
+     This function should be called before computing any window.
+
+     Args:
+         function_name (str): name of the window function.
+         M (int): length of the window.
+         dtype (:class:`torch.dtype`): the desired data type of returned tensor.
+         layout (:class:`torch.layout`): the desired layout of returned tensor.
+     """
+    if M < 0:
+        raise ValueError(f'{function_name} requires non-negative window length, got M={M}')
+    if layout is not torch.strided:
+        raise ValueError(f'{function_name} is implemented for strided tensors only, got: {layout}')
+    if not is_float_dtype(dtype):
+        raise ValueError(f'{function_name} expects floating point dtypes, got: {dtype}')
+
+
+@_add_docstr(
+    r"""
+Computes a window with an exponential waveform.
+Also known as Poisson window.
+
+The exponential window is defined as follows:
+
+.. math::
+    w_n = \exp{\left(-\frac{|n - c|}{\tau}\right)}
+
+where `c` is the ``center`` of the window.
+    """,
+    r"""
+
+{normalization}
+
+Args:
+    {M}
+
+Keyword args:
+    center (float, optional): where the center of the window will be located.
+        Default: `M / 2` if `sym` is `False`, else `(M - 1) / 2`.
+    tau (float, optional): the decay value.
+        Tau is generally associated with a percentage, that means, that the value should
+        vary within the interval (0, 100]. If tau is 100, it is considered the uniform window.
+        Default: 1.0.
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric exponential window of size 10 and with a decay value of 1.0.
+    >>> # The center will be at (M - 1) / 2, where M is 10.
+    >>> torch.signal.windows.exponential(10)
+    tensor([0.0111, 0.0302, 0.0821, 0.2231, 0.6065, 0.6065, 0.2231, 0.0821, 0.0302, 0.0111])
+
+    >>> # Generates a periodic exponential window and decay factor equal to .5
+    >>> torch.signal.windows.exponential(10, sym=False,tau=.5)
+    tensor([4.5400e-05, 3.3546e-04, 2.4788e-03, 1.8316e-02, 1.3534e-01, 1.0000e+00, 1.3534e-01, 1.8316e-02, 2.4788e-03, 3.3546e-04])
+    """.format(
+        **window_common_args
+    ),
+)
+def exponential(
+        M: int,
+        *,
+        center: Optional[float] = None,
+        tau: float = 1.0,
+        sym: bool = True,
+        dtype: Optional[torch.dtype] = None,
+        layout: torch.layout = torch.strided,
+        device: Optional[torch.device] = None,
+        requires_grad: bool = False
+) -> Tensor:
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    _window_function_checks('exponential', M, dtype, layout)
+
+    if tau <= 0:
+        raise ValueError(f'Tau must be positive, got: {tau} instead.')
+
+    if sym and center is not None:
+        raise ValueError('Center must be None for symmetric windows')
+
+    if M == 0:
+        return torch.empty((0,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    if center is None:
+        center = (M if not sym and M > 1 else M - 1) / 2.0
+
+    constant = 1 / tau
+
+    k = torch.linspace(start=-center * constant,
+                       end=(-center + (M - 1)) * constant,
+                       steps=M,
+                       dtype=dtype,
+                       layout=layout,
+                       device=device,
+                       requires_grad=requires_grad)
+
+    return torch.exp(-torch.abs(k))
+
+
+@_add_docstr(
+    r"""
+Computes a window with a simple cosine waveform.
+Also known as the sine window.
+
+The cosine window is defined as follows:
+
+.. math::
+    w_n = \cos{\left(\frac{\pi n}{M} - \frac{\pi}{2}\right)} = \sin{\left(\frac{\pi n}{M}\right)}
+    """,
+    r"""
+
+{normalization}
+
+Args:
+    {M}
+
+Keyword args:
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric cosine window.
+    >>> torch.signal.windows.cosine(10)
+    tensor([0.1564, 0.4540, 0.7071, 0.8910, 0.9877, 0.9877, 0.8910, 0.7071, 0.4540, 0.1564])
+
+    >>> # Generates a periodic cosine window.
+    >>> torch.signal.windows.cosine(10, sym=False)
+    tensor([0.1423, 0.4154, 0.6549, 0.8413, 0.9595, 1.0000, 0.9595, 0.8413, 0.6549, 0.4154])
+""".format(
+        **window_common_args,
+    ),
+)
+def cosine(
+        M: int,
+        *,
+        sym: bool = True,
+        dtype: Optional[torch.dtype] = None,
+        layout: torch.layout = torch.strided,
+        device: Optional[torch.device] = None,
+        requires_grad: bool = False
+) -> Tensor:
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    _window_function_checks('cosine', M, dtype, layout)
+
+    if M == 0:
+        return torch.empty((0,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    start = 0.5
+    constant = torch.pi / (M + 1 if not sym and M > 1 else M)
+
+    k = torch.linspace(start=start * constant,
+                       end=(start + (M - 1)) * constant,
+                       steps=M,
+                       dtype=dtype,
+                       layout=layout,
+                       device=device,
+                       requires_grad=requires_grad)
+
+    return torch.sin(k)
+
+
+@_add_docstr(
+    r"""
+Computes a window with a gaussian waveform.
+
+The gaussian window is defined as follows:
+
+.. math::
+    w_n = \exp{\left(-\left(\frac{n}{2\sigma}\right)^2\right)}
+    """,
+    r"""
+
+{normalization}
+
+Args:
+    {M}
+
+Keyword args:
+    std (float, optional): the standard deviation of the gaussian. It controls how narrow or wide the window is.
+        Default: 1.0.
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric gaussian window with a standard deviation of 1.0.
+    >>> torch.signal.windows.gaussian(10)
+    tensor([4.0065e-05, 2.1875e-03, 4.3937e-02, 3.2465e-01, 8.8250e-01, 8.8250e-01, 3.2465e-01, 4.3937e-02, 2.1875e-03, 4.0065e-05])
+
+    >>> # Generates a periodic gaussian window and standard deviation equal to 0.9.
+    >>> torch.signal.windows.gaussian(10, sym=False,std=0.9)
+    tensor([1.9858e-07, 5.1365e-05, 3.8659e-03, 8.4658e-02, 5.3941e-01, 1.0000e+00, 5.3941e-01, 8.4658e-02, 3.8659e-03, 5.1365e-05])
+""".format(
+        **window_common_args,
+    ),
+)
+def gaussian(
+        M: int,
+        *,
+        std: float = 1.0,
+        sym: bool = True,
+        dtype: Optional[torch.dtype] = None,
+        layout: torch.layout = torch.strided,
+        device: Optional[torch.device] = None,
+        requires_grad: bool = False
+) -> Tensor:
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    _window_function_checks('gaussian', M, dtype, layout)
+
+    if std <= 0:
+        raise ValueError(f'Standard deviation must be positive, got: {std} instead.')
+
+    if M == 0:
+        return torch.empty((0,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    start = -(M if not sym and M > 1 else M - 1) / 2.0
+
+    constant = 1 / (std * sqrt(2))
+
+    k = torch.linspace(start=start * constant,
+                       end=(start + (M - 1)) * constant,
+                       steps=M,
+                       dtype=dtype,
+                       layout=layout,
+                       device=device,
+                       requires_grad=requires_grad)
+
+    return torch.exp(-k ** 2)
+
+
+@_add_docstr(
+    r"""
+Computes the Kaiser window.
+
+The Kaiser window is defined as follows:
+
+.. math::
+    w_n = I_0 \left( \beta \sqrt{1 - \left( {\frac{n - N/2}{N/2}} \right) ^2 } \right) / I_0( \beta )
+
+where ``I_0`` is the zeroth order modified Bessel function of the first kind (see :func:`torch.special.i0`), and
+``N = M - 1 if sym else M``.
+    """,
+    r"""
+
+{normalization}
+
+Args:
+    {M}
+
+Keyword args:
+    beta (float, optional): shape parameter for the window. Must be non-negative. Default: 12.0
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Examples::
+
+    >>> # Generates a symmetric gaussian window with a standard deviation of 1.0.
+    >>> torch.signal.windows.kaiser(5)
+    tensor([4.0065e-05, 2.1875e-03, 4.3937e-02, 3.2465e-01, 8.8250e-01, 8.8250e-01, 3.2465e-01, 4.3937e-02, 2.1875e-03, 4.0065e-05])
+    >>> # Generates a periodic gaussian window and standard deviation equal to 0.9.
+    >>> torch.signal.windows.kaiser(5, sym=False,std=0.9)
+    tensor([1.9858e-07, 5.1365e-05, 3.8659e-03, 8.4658e-02, 5.3941e-01, 1.0000e+00, 5.3941e-01, 8.4658e-02, 3.8659e-03, 5.1365e-05])
+""".format(
+        **window_common_args,
+    ),
+)
+def kaiser(
+        M: int,
+        *,
+        beta: float = 12.0,
+        sym: bool = True,
+        dtype: Optional[torch.dtype] = None,
+        layout: torch.layout = torch.strided,
+        device: Optional[torch.device] = None,
+        requires_grad: bool = False
+) -> Tensor:
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    _window_function_checks('kaiser', M, dtype, layout)
+
+    if beta < 0:
+        raise ValueError(f'beta must be non-negative, got: {beta} instead.')
+
+    if M == 0:
+        return torch.empty((0,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    if M == 1:
+        return torch.ones((1,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    start = -beta
+    constant = 2.0 * beta / (M if not sym else M - 1)
+
+    k = torch.linspace(start=start,
+                       end=start + (M - 1) * constant,
+                       steps=M,
+                       dtype=dtype,
+                       layout=layout,
+                       device=device,
+                       requires_grad=requires_grad)
+
+    return torch.i0(torch.sqrt(beta * beta - torch.pow(k, 2))) / torch.i0(torch.tensor(beta))
+
+
+@_add_docstr(
+    r"""
+Computes the Hamming window.
+
+The Hamming window is defined as follows:
+
+.. math::
+    w_n = \alpha - \beta\ \cos \left( \frac{2 \pi n}{M - 1} \right)
+    """,
+    r"""
+
+{normalization}
+
+Arguments:
+    {M}
+
+Keyword args:
+    {sym}
+    alpha (float, optional): The coefficient :math:`\alpha` in the equation above.
+    beta (float, optional): The coefficient :math:`\beta` in the equation above.
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric Hamming window.
+    >>> torch.signal.windows.hamming(10)
+    tensor([0.0800, 0.1876, 0.4601, 0.7700, 0.9723, 0.9723, 0.7700, 0.4601, 0.1876, 0.0800])
+
+    >>> # Generates a periodic Hamming window.
+    >>> torch.signal.windows.hamming(10, sym=False)
+    tensor([0.0800, 0.1679, 0.3979, 0.6821, 0.9121, 1.0000, 0.9121, 0.6821, 0.3979, 0.1679])
+""".format(
+        **window_common_args
+    ),
+)
+def hamming(M: int,
+            *,
+            sym: bool = True,
+            dtype: torch.dtype = None,
+            layout: torch.layout = torch.strided,
+            device: torch.device = None,
+            requires_grad: bool = False) -> Tensor:
+    return general_hamming(M, sym=sym, dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+
+@_add_docstr(
+    r"""
+Computes the Hann window.
+
+The Hann window is defined as follows:
+
+.. math::
+    w_n = \frac{1}{2}\ \left[1 - \cos \left( \frac{2 \pi n}{M - 1} \right)\right] =
+    \sin^2 \left( \frac{\pi n}{M - 1} \right)
+    """,
+    r"""
+
+{normalization}
+
+Arguments:
+    {M}
+
+Keyword args:
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric Hann window.
+    >>> torch.signal.windows.hann(10)
+    tensor([0.0000, 0.1170, 0.4132, 0.7500, 0.9698, 0.9698, 0.7500, 0.4132, 0.1170, 0.0000])
+
+    >>> # Generates a periodic Hann window.
+    >>> torch.signal.windows.hann(10, sym=False)
+    tensor([0.0000, 0.0955, 0.3455, 0.6545, 0.9045, 1.0000, 0.9045, 0.6545, 0.3455, 0.0955])
+""".format(
+        **window_common_args
+    ),
+)
+def hann(M: int,
+         *,
+         sym: bool = True,
+         dtype: torch.dtype = None,
+         layout: torch.layout = torch.strided,
+         device: torch.device = None,
+         requires_grad: bool = False) -> Tensor:
+    return general_hamming(M,
+                           alpha=0.5,
+                           sym=sym,
+                           dtype=dtype,
+                           layout=layout,
+                           device=device,
+                           requires_grad=requires_grad)
+
+
+@_add_docstr(
+    r"""
+Computes the Blackman window.
+
+The Blackman window is defined as follows:
+
+.. math::
+    w_n = 0.42 - 0.5 \cos \left( \frac{2 \pi n}{M - 1} \right) + 0.08 \cos \left( \frac{4 \pi n}{M - 1} \right)
+    """,
+    r"""
+
+{normalization}
+
+Arguments:
+    {M}
+
+Keyword args:
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric Blackman window.
+    >>> torch.signal.windows.blackman(5)
+    tensor([-1.4901e-08,  3.4000e-01,  1.0000e+00,  3.4000e-01, -1.4901e-08])
+
+    >>> # Generates a periodic Blackman window.
+    >>> torch.signal.windows.blackman(5, sym=False)
+    tensor([-1.4901e-08,  2.0077e-01,  8.4923e-01,  8.4923e-01,  2.0077e-01])
+""".format(
+        **window_common_args
+    ),
+)
+def blackman(M: int,
+             *,
+             sym: bool = True,
+             dtype: torch.dtype = None,
+             layout: torch.layout = torch.strided,
+             device: torch.device = None,
+             requires_grad: bool = False) -> Tensor:
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    _window_function_checks('blackman', M, dtype, layout)
+
+    return general_cosine(M, a=[0.42, 0.5, 0.08], sym=sym, dtype=dtype, layout=layout, device=device,
+                          requires_grad=requires_grad)
+
+
+@_add_docstr(
+    r"""
+Computes the Bartlett window.
+
+The Bartlett window is defined as follows:
+
+.. math::
+    w_n = 1 - \left| \frac{2n}{M - 1} - 1 \right| = \begin{cases}
+        \frac{2n}{M - 1} & \text{if } 0 \leq n \leq \frac{M - 1}{2} \\
+        2 - \frac{2n}{M - 1} & \text{if } \frac{M - 1}{2} < n < M \\ \end{cases}
+    """,
+    r"""
+
+{normalization}
+
+Arguments:
+    {M}
+
+Keyword args:
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric Bartlett window.
+    >>> torch.signal.windows.bartlett(10)
+    tensor([0.0000, 0.2222, 0.4444, 0.6667, 0.8889, 0.8889, 0.6667, 0.4444, 0.2222, 0.0000])
+
+    >>> # Generates a periodic Bartlett window.
+    >>> torch.signal.windows.bartlett(10, sym=False)
+    tensor([0.0000, 0.2000, 0.4000, 0.6000, 0.8000, 1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
+""".format(
+        **window_common_args
+    ),
+)
+def bartlett(M: int,
+             *,
+             sym: bool = True,
+             dtype: torch.dtype = None,
+             layout: torch.layout = torch.strided,
+             device: torch.device = None,
+             requires_grad: bool = False) -> Tensor:
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    _window_function_checks('bartlett', M, dtype, layout)
+
+    if M == 0:
+        return torch.empty((0,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    if M == 1:
+        return torch.ones((1,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    start = -1
+    constant = 2 / (M if not sym else M - 1)
+
+    k = torch.linspace(start=start,
+                       end=start + (M - 1) * constant,
+                       steps=M,
+                       dtype=dtype,
+                       layout=layout,
+                       device=device,
+                       requires_grad=requires_grad)
+
+    return 1 - torch.abs(k)
+
+
+@_add_docstr(
+    r"""
+Computes the general cosine window.
+
+The general cosine window is defined as follows:
+
+.. math::
+    w_n = \sum^{M-1}_{i=0} (-1)^i a_i \cos{ \left( \frac{2 \pi i n}{M - 1}\right)}
+    """,
+    r"""
+
+{normalization}
+
+Arguments:
+    {M}
+
+Keyword args:
+    a (Iterable): the coefficients associated to each of the cosine functions.
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric general cosine window with 3 coefficients.
+    >>> torch.signal.windows.general_cosine(10, a=[0.46, 0.23, 0.31], sym=True)
+    tensor([0.5400, 0.3376, 0.1288, 0.4200, 0.9136, 0.9136, 0.4200, 0.1288, 0.3376, 0.5400])
+
+    >>> # Generates a periodic general cosine window wit 2 coefficients.
+    >>> torch.signal.windows.general_cosine(10, a=[0.5, 1 - 0.5], sym=False)
+    tensor([0.0000, 0.0955, 0.3455, 0.6545, 0.9045, 1.0000, 0.9045, 0.6545, 0.3455, 0.0955])
+""".format(
+        **window_common_args
+    ),
+)
+def general_cosine(M, *,
+                   a: Iterable,
+                   sym: bool = True,
+                   dtype: torch.dtype = None,
+                   layout: torch.layout = torch.strided,
+                   device: torch.device = None,
+                   requires_grad: bool = False) -> Tensor:
+    if dtype is None:
+        dtype = torch.get_default_dtype()
+
+    _window_function_checks('general_cosine', M, dtype, layout)
+
+    if M == 0:
+        return torch.empty((0,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    if M == 1:
+        return torch.ones((1,), dtype=dtype, layout=layout, device=device, requires_grad=requires_grad)
+
+    if not isinstance(a, Iterable):
+        raise TypeError("Coefficients must be a list/tuple")
+
+    if not a:
+        raise ValueError("Coefficients cannot be empty")
+
+    constant = 2 * torch.pi / (M if not sym else M - 1)
+
+    k = torch.linspace(start=0,
+                       end=(M - 1) * constant,
+                       steps=M,
+                       dtype=dtype,
+                       layout=layout,
+                       device=device,
+                       requires_grad=requires_grad)
+
+    a_i = torch.tensor([(-1) ** i * w for i, w in enumerate(a)], device=device, dtype=dtype, requires_grad=requires_grad)
+    i = torch.arange(a_i.shape[0], dtype=a_i.dtype, device=a_i.device, requires_grad=a_i.requires_grad)
+    return (a_i.unsqueeze(-1) * torch.cos(i.unsqueeze(-1) * k)).sum(0)
+
+
+@_add_docstr(
+    r"""
+Computes the general Hamming window.
+
+The general Hamming window is defined as follows:
+
+.. math::
+    w_n = \alpha - (1 - \alpha) \cos{ \left( \frac{2 \pi n}{M-1} \right)}
+    """,
+    r"""
+
+{normalization}
+
+Arguments:
+    {M}
+
+Keyword args:
+    alpha (float, optional): the window coefficient. Default: 0.54.
+    {sym}
+    {dtype}
+    {layout}
+    {device}
+    {requires_grad}
+
+Returns:
+    {return}
+
+Examples::
+
+    >>> # Generates a symmetric Hamming window with the general Hamming window.
+    >>> torch.signal.windows.general_hamming(10, sym=True)
+    tensor([0.0800, 0.1876, 0.4601, 0.7700, 0.9723, 0.9723, 0.7700, 0.4601, 0.1876, 0.0800])
+
+    >>> # Generates a periodic Hann window with the general Hamming window.
+    >>> torch.signal.windows.general_hamming(10, alpha=0.5, sym=False)
+    tensor([0.0000, 0.0955, 0.3455, 0.6545, 0.9045, 1.0000, 0.9045, 0.6545, 0.3455, 0.0955])
+""".format(
+        **window_common_args
+    ),
+)
+def general_hamming(M,
+                    *,
+                    alpha: float = 0.54,
+                    sym: bool = True,
+                    dtype: torch.dtype = None,
+                    layout: torch.layout = torch.strided,
+                    device: torch.device = None,
+                    requires_grad: bool = False) -> Tensor:
+    return general_cosine(M,
+                          a=[alpha, 1. - alpha],
+                          sym=sym,
+                          dtype=dtype,
+                          layout=layout,
+                          device=device,
+                          requires_grad=requires_grad)
diff --git a/torch/sparse/__init__.py b/torch/sparse/__init__.py
index a130ef4784c9..02febffc7848 100644
--- a/torch/sparse/__init__.py
+++ b/torch/sparse/__init__.py
@@ -22,6 +22,7 @@
     'sum',
     'softmax',
     'log_softmax',
+    'spmm_reduce'
 ]
 
 
@@ -150,6 +151,46 @@
 """)
 
 
+spmm_reduce = _add_docstr(_sparse.spmm_reduce, r"""
+sparse.spmm_reduce(input, weight, reduce, row_indices=None, ccol_indices=None, csr2csc=None) -> Tensor
+
+Reduce rows from dense matrices :attr:`weight` at the locations specified by the sparsity pattern of :attr:`input`.
+The matrix :attr:`input` is not added to the final result.
+
+For each element in :attr:`input` at index [i, j], pick the rows from :attr:`weight` specified by j,
+and reduce to :attr:`output` at row i.
+
+.. note::
+    :attr:`input` must be a sparse CSR tensor. :attr:`weight` must be dense tensors.
+    This function is implemented only for tensors on CPU device.
+
+Args:
+    input (Tensor): a sparse CSR matrix of shape `(m, n)`.
+    weight (Tensor): a dense matrix of shape `(m, k)` to be reduced.
+    reduce (str): the reduction operation to apply for non-unique indices
+        (:obj:`"sum"`, :obj:`"mean"`, :obj:`"max"`, :obj:`"min"`).
+    row_indices (Tensor, optional): row indices for each element in :attr:`input`, 1-D tensor of size `nse`.
+    ccol_indices (Tensor, optional): compressed col indices for each element in :attr:`input`, 1-D tensor of size `n + 1`.
+    csr2csc (Tensor, optional): permute pattern to convert :attr:`input` from CSR to CSC.
+
+Examples::
+
+    >>> crow_indices = torch.tensor([0, 1, 3, 4])
+    >>> col_indices = torch.tensor([2, 0, 1, 3])
+    >>> values = torch.tensor([1, 2, 3, 4])
+    >>> csr = torch.sparse_csr_tensor(crow_indices, col_indices, values, dtype=torch.float)
+    >>> csr
+    tensor(crow_indices=tensor([0, 1, 3, 4]),
+       col_indices=tensor([2, 0, 1, 3]),
+       values=tensor([1., 2., 3., 4.]), size=(3, 4), nnz=4,
+       layout=torch.sparse_csr)
+    >>> weight = torch.ones(4, 3)
+    >>> torch.sparse.spmm_reduce(csr, weight, "sum")
+    tensor([[1., 1., 1.],
+        [5., 5., 5.],
+        [4., 4., 4.]])
+""")
+
 def sum(input: Tensor, dim: DimOrDims = None,
         dtype: Optional[DType] = None) -> Tensor:
     r"""
diff --git a/torch/sparse/matmul.py b/torch/sparse/matmul.py
new file mode 100644
index 000000000000..d0d69f598160
--- /dev/null
+++ b/torch/sparse/matmul.py
@@ -0,0 +1,27 @@
+import torch
+from torch_sparse import SparseTensor
+from torch_sparse.matmul import spspmm
+from torch_geometric.utils import spmm
+
+@torch.jit._overload  # noqa: F811
+def matmul(src, other, reduce):  # noqa: F811
+    # type: (SparseTensor, torch.Tensor, str) -> torch.Tensor
+    pass
+
+
+@torch.jit._overload  # noqa: F811
+def matmul(src, other, reduce):  # noqa: F811
+    # type: (SparseTensor, SparseTensor, str) -> SparseTensor
+    pass
+
+
+def matmul(src, other, reduce="sum"):  # noqa: F811
+    if isinstance(other, torch.Tensor):
+        return spmm(src, other, reduce)
+    elif isinstance(other, SparseTensor):
+        return spspmm(src, other, reduce)
+    raise ValueError
+
+SparseTensor.matmul = lambda self, other, reduce="sum": matmul(
+    self, other, reduce)
+SparseTensor.__matmul__ = lambda self, other: matmul(self, other, 'sum')
diff --git a/torch/special/__init__.py b/torch/special/__init__.py
index 224e262c1ef6..a25f0f7c0368 100644
--- a/torch/special/__init__.py
+++ b/torch/special/__init__.py
@@ -761,9 +761,9 @@
 .. math::
     \log(\Gamma_{p}(a)) = C + \displaystyle \sum_{i=1}^{p} \log\left(\Gamma\left(a - \frac{i - 1}{2}\right)\right)
 
-where :math:`C = \log(\pi) \times \frac{p (p - 1)}{4}` and :math:`\Gamma(\cdot)` is the Gamma function.
+where :math:`C = \log(\pi) \cdot \frac{p (p - 1)}{4}` and :math:`\Gamma(-)` is the Gamma function.
 
-All elements must be greater than :math:`\frac{p - 1}{2}`, otherwise an error would be thrown.
+All elements must be greater than :math:`\frac{p - 1}{2}`, otherwise the behavior is undefiend.
 """ + """
 
 Args:
diff --git a/torch/storage.py b/torch/storage.py
index 8e35973405b1..ef523bd7b97e 100644
--- a/torch/storage.py
+++ b/torch/storage.py
@@ -7,6 +7,7 @@
 import copy
 import collections
 from functools import lru_cache
+import warnings
 try:
     import numpy as np
     HAS_NUMPY = True
@@ -131,7 +132,7 @@ def mps(self):
     def _to(self, dtype):
         if not isinstance(dtype, torch.dtype):
             raise TypeError(f"Argument 'dtype' must be torch.dtype, not {type(dtype)}")
-        storage = torch.tensor([], dtype=torch.uint8, device=self.device).set_(cast(Storage, self)).to(dtype).storage()
+        storage = torch.tensor([], dtype=torch.uint8, device=self.device).set_(cast(Storage, self)).to(dtype)._typed_storage()
         if storage.data_ptr() == self.data_ptr():
             storage = storage.clone()
         return storage
@@ -297,7 +298,7 @@ def _get_storage_from_sequence(sequence, dtype, device):
             dtype=dtype,
             device=device)
 
-    return tmp_tensor.storage().untyped()
+    return tmp_tensor._typed_storage()._untyped_storage
 
 def _isint(x):
     if HAS_NUMPY:
@@ -305,16 +306,32 @@ def _isint(x):
     else:
         return isinstance(x, int)
 
+def _warn_typed_storage_removal():
+    message = (
+        "TypedStorage is deprecated. It will be removed in the future and "
+        "UntypedStorage will be the only storage class. This should only matter "
+        "to you if you are using storages directly."
+    )
+    warnings.warn(message, UserWarning)
+
 class TypedStorage:
     is_sparse = False
 
     dtype: torch.dtype
 
+    @property
+    def _dtype(self):
+        return self.dtype
+
     def fill_(self, value):
-        self[0:len(self)] = value
+        _warn_typed_storage_removal()
+        self._setitem(slice(0, self._size()), value)
         return self
 
-    def __new__(cls, *args, wrap_storage=None, dtype=None, device=None):
+    def __new__(cls, *args, wrap_storage=None, dtype=None, device=None, _internal=False):
+        if not _internal:
+            _warn_typed_storage_removal()
+
         if cls == torch.storage._LegacyStorage:
             raise RuntimeError("Only child classes of _LegacyStorage can be instantiated")
 
@@ -353,8 +370,9 @@ def __new__(cls, *args, wrap_storage=None, dtype=None, device=None):
 
                 return TypedStorage(
                     *args,
-                    dtype=cls.dtype,
-                    device='cuda' if cls.__module__ == 'torch.cuda' else 'cpu')
+                    dtype=cls._dtype,
+                    device='cuda' if cls.__module__ == 'torch.cuda' else 'cpu',
+                    _internal=True)
 
             else:
                 if len(args) != 0:
@@ -379,9 +397,12 @@ def __new__(cls, *args, wrap_storage=None, dtype=None, device=None):
                 return TypedStorage(
                     *args,
                     wrap_storage=wrap_storage,
-                    dtype=cls.dtype)
+                    dtype=cls.dtype,
+                    _internal=True)
 
-    def __init__(self, *args, device=None, dtype=None, wrap_storage=None):
+    def __init__(self, *args, device=None, dtype=None, wrap_storage=None, _internal=False):
+        if not _internal:
+            _warn_typed_storage_removal()
         arg_error_msg = (
             'TypedStorage.__init__ received an invalid combination '
             'of arguments. Expected one of:\n'
@@ -419,7 +440,7 @@ def __init__(self, *args, device=None, dtype=None, wrap_storage=None):
                     arg_error_msg +
                     f"\nArgument 'wrap_storage' must be UntypedStorage, but got {type(wrap_storage)}")
 
-            self._storage = wrap_storage
+            self._untyped_storage = wrap_storage
 
         else:
             self.dtype = torch.get_default_dtype() if dtype is None else dtype
@@ -430,13 +451,13 @@ def __init__(self, *args, device=None, dtype=None, wrap_storage=None):
                     raise RuntimeError("Cannot create CUDA storage with quantized dtype")
 
             if len(args) == 0:
-                self._storage = torch.UntypedStorage(device=device)
+                self._untyped_storage = torch.UntypedStorage(device=device)
 
             elif len(args) == 1:
                 if _isint(args[0]):
-                    self._storage = torch.UntypedStorage(int(args[0]) * self.element_size(), device=device)
+                    self._untyped_storage = torch.UntypedStorage(int(args[0]) * self._element_size(), device=device)
                 elif isinstance(args[0], collections.abc.Sequence):
-                    self._storage = _get_storage_from_sequence(args[0], self.dtype, device)
+                    self._untyped_storage = _get_storage_from_sequence(args[0], self.dtype, device)
                 else:
                     raise TypeError(
                         arg_error_msg +
@@ -447,30 +468,35 @@ def __init__(self, *args, device=None, dtype=None, wrap_storage=None):
                     arg_error_msg +
                     "\nToo many positional arguments")
 
-
     @property
     def is_cuda(self):
+        _warn_typed_storage_removal()
         return self.device.type == 'cuda'
 
     def untyped(self):
         """Returns the internal :class:`torch.UntypedStorage`"""
-        return self._storage
+        _warn_typed_storage_removal()
+        return self._untyped_storage
 
     def _new_wrapped_storage(self, untyped_storage):
         assert type(untyped_storage) == torch.UntypedStorage
 
         if type(self) == TypedStorage:
-            return TypedStorage(wrap_storage=untyped_storage, dtype=self.dtype)
+            return TypedStorage(
+                wrap_storage=untyped_storage,
+                dtype=self.dtype,
+                _internal=True)
         else:
             return type(self)(wrap_storage=untyped_storage)
 
     def __len__(self):
-        return self._storage.nbytes() // self.element_size()
+        _warn_typed_storage_removal()
+        return self._size()
 
     def _maybe_wrap_index(self, idx, is_stop=False):
         if idx is None:
             if is_stop:
-                return self.size()
+                return self._size()
             else:
                 return 0
 
@@ -479,20 +505,24 @@ def _maybe_wrap_index(self, idx, is_stop=False):
                 raise TypeError(
                     f"can't index a {type(self)} with {type(idx)}")
             if is_stop:
-                if (idx > self.size()) or (idx < -self.size()):
+                if (idx > self._size()) or (idx < -self._size()):
                     raise IndexError(
                         f'index {idx} out of range for storage of size {self.size()}')
                 if idx > 0:
                     return idx
                 else:
-                    return idx % self.size()
+                    return idx % self._size()
             else:
-                if (idx >= self.size()) or (idx < -self.size()):
+                if (idx >= self._size()) or (idx < -self._size()):
                     raise IndexError(
                         f'index {idx} out of range for storage of size {self.size()}')
-                return idx % self.size()
+                return idx % self._size()
 
     def __setitem__(self, idx, value):
+        _warn_typed_storage_removal()
+        return self._setitem(idx, value)
+
+    def _setitem(self, idx, value):
         if not isinstance(idx, (int, slice)):
             raise RuntimeError(f"can't index a {type(self)} with {type(idx)}")
         if torch.is_storage(value):
@@ -506,16 +536,22 @@ def __setitem__(self, idx, value):
                 torch.qint8: torch.int8
             }
             tmp_dtype = interpret_dtypes[self.dtype]
-            tmp_tensor = torch.tensor([], dtype=tmp_dtype, device=self.device).set_(TypedStorage(
-                wrap_storage=self._storage,
-                dtype=tmp_dtype))
+            tmp_tensor = torch.tensor([], dtype=tmp_dtype, device=self._untyped_storage.device)
+            tmp_tensor.set_(TypedStorage(
+                wrap_storage=self._untyped_storage,
+                dtype=tmp_dtype,
+                _internal=True))
         else:
-            tmp_tensor = torch.tensor([], dtype=self.dtype, device=self.device).set_(self)
+            tmp_tensor = torch.tensor([], dtype=self.dtype, device=self._untyped_storage.device).set_(self)
 
         tmp_tensor[idx] = value
 
     def __getitem__(self, idx):
-        if self.device.type == 'meta':
+        _warn_typed_storage_removal()
+        return self._getitem(idx)
+
+    def _getitem(self, idx):
+        if self._untyped_storage.device.type == 'meta':
             raise NotImplementedError("Not available for 'meta' device type")
 
         # NOTE: Before TypedStorage existed, indexing with a slice used to be
@@ -536,21 +572,32 @@ def __getitem__(self, idx):
                 torch.qint8: torch.int8
             }
             return TypedStorage(
-                wrap_storage=self._storage,
-                dtype=interpret_dtypes[self.dtype])[idx]
+                wrap_storage=self._untyped_storage,
+                dtype=interpret_dtypes[self.dtype],
+                _internal=True)._getitem(idx)
 
         idx_wrapped = self._maybe_wrap_index(idx)
-        tmp_tensor = torch.tensor([], dtype=self.dtype, device=self.device).set_(self)
+        tmp_tensor = torch.tensor([], dtype=self.dtype, device=self._untyped_storage.device).set_(self)
         return tmp_tensor[idx_wrapped].item()
 
     def copy_(self, source: T, non_blocking: bool = None):
-        self._storage.copy_(source.untyped(), non_blocking)
+        _warn_typed_storage_removal()
+        if isinstance(source, TypedStorage):
+            self._untyped_storage.copy_(source._untyped_storage, non_blocking)
+        else:
+            self._untyped_storage.copy_(source, non_blocking)
         return self
 
     def nbytes(self):
-        return self._storage.nbytes()
+        _warn_typed_storage_removal()
+        return self._nbytes()
+
+    # For internal use only, to avoid deprecation warning
+    def _nbytes(self):
+        return self._untyped_storage.nbytes()
 
     def type(self, dtype: str = None, non_blocking: bool = False) -> Union[T, str]:
+        _warn_typed_storage_removal()
         if dtype is None:
             legacy_class = self._get_legacy_storage_class()
 
@@ -560,21 +607,29 @@ def type(self, dtype: str = None, non_blocking: bool = False) -> Union[T, str]:
             return '.'.join([self.__module__, type(self).__name__])
 
         else:
-            return self._storage.type(dtype, non_blocking)
+            return self._untyped_storage.type(dtype, non_blocking)
 
     def cuda(self, device=None, non_blocking=False, **kwargs) -> T:
+        _warn_typed_storage_removal()
         if self.dtype in [torch.quint8, torch.quint4x2, torch.quint2x4, torch.qint32, torch.qint8]:
             raise RuntimeError("Cannot create CUDA storage with quantized dtype")
-        cuda_storage: torch.UntypedStorage = self._storage.cuda(device, non_blocking, **kwargs)
+        cuda_storage: torch.UntypedStorage = self._untyped_storage.cuda(device, non_blocking, **kwargs)
         return self._new_wrapped_storage(cuda_storage)
 
     def element_size(self):
+        _warn_typed_storage_removal()
+        return self._element_size()
+
+    # For internal use only, to avoid deprecation warning
+    def _element_size(self):
         return torch._utils._element_size(self.dtype)
 
     def get_device(self) -> int:
-        return self._storage.get_device()
+        _warn_typed_storage_removal()
+        return self._untyped_storage.get_device()
 
     def __str__(self):
+        _warn_typed_storage_removal()
         info_str = (
             f'[{torch.typename(self)}(dtype={self.dtype}, '
             f'device={self.device}) of size {len(self)}]')
@@ -585,35 +640,48 @@ def __str__(self):
             return data_str + '\n' + info_str
 
     def __repr__(self):
+        _warn_typed_storage_removal()
         return str(self)
 
     def __iter__(self):
+        _warn_typed_storage_removal()
         return iter(map(lambda i: self[i], range(self.size())))
 
     def __copy__(self):
-        return self._new_wrapped_storage(copy.copy(self._storage))
+        _warn_typed_storage_removal()
+        return self._new_wrapped_storage(copy.copy(self._untyped_storage))
 
     def __deepcopy__(self, memo):
-        return self._new_wrapped_storage(copy.deepcopy(self._storage, memo))
+        _warn_typed_storage_removal()
+        return self._deepcopy(memo)
+
+    # For internal use only, to avoid deprecation warning
+    def _deepcopy(self, memo):
+        return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
 
     def __sizeof__(self):
+        _warn_typed_storage_removal()
         return super(TypedStorage, self).__sizeof__() + self.nbytes()
 
     def clone(self):
         """Returns a copy of this storage"""
-        return self._new_wrapped_storage(self._storage.clone())
+        _warn_typed_storage_removal()
+        return self._new_wrapped_storage(self._untyped_storage.clone())
 
     def tolist(self):
         """Returns a list containing the elements of this storage"""
+        _warn_typed_storage_removal()
         return list(self)
 
     def cpu(self):
         """Returns a CPU copy of this storage if it's not already on the CPU"""
-        return self._new_wrapped_storage(self._storage.cpu())
+        _warn_typed_storage_removal()
+        return self._new_wrapped_storage(self._untyped_storage.cpu())
 
     def pin_memory(self):
         """Coppies the  storage to pinned memory, if it's not already pinned."""
-        return self._new_wrapped_storage(self._storage.pin_memory())
+        _warn_typed_storage_removal()
+        return self._new_wrapped_storage(self._untyped_storage.pin_memory())
 
     def share_memory_(self):
         """Moves the storage to shared memory.
@@ -624,7 +692,12 @@ def share_memory_(self):
 
         Returns: self
         """
-        self._storage.share_memory_()
+        _warn_typed_storage_removal()
+        return self._share_memory_()
+
+    # For internal use only, to avoid deprecation warning
+    def _share_memory_(self):
+        self._untyped_storage.share_memory_()
         return self
 
     def _new_shared(self, size, *, device=None):
@@ -632,23 +705,37 @@ def _new_shared(self, size, *, device=None):
         if device is None:
             device = 'cpu'
         device = torch.device(device)
-        untyped_storage = torch.UntypedStorage._new_shared(size * self.element_size(), device=device)
+        untyped_storage = torch.UntypedStorage._new_shared(size * self._element_size(), device=device)
         return TypedStorage(
             wrap_storage=untyped_storage,
-            dtype=self.dtype)
+            dtype=self.dtype,
+            _internal=True)
 
     @property
     def _cdata(self):
-        return self._storage._cdata
+        return self._untyped_storage._cdata
 
     @property
     def device(self):
-        return self._storage.device
+        _warn_typed_storage_removal()
+        return self._untyped_storage.device
 
     def size(self):
-        return len(self)
+        _warn_typed_storage_removal()
+        return self._size()
+
+    # For internal use only, to avoid deprecation warning
+    def _size(self):
+        # NB: don't indirect through __len__, as that requires
+        # an int to be returned
+        return self._untyped_storage.nbytes() // self._element_size()
 
     def pickle_storage_type(self):
+        _warn_typed_storage_removal()
+        return self._pickle_storage_type()
+
+    # For internal use only, to avoid deprecation warning
+    def _pickle_storage_type(self):
         try:
             return _dtype_to_storage_type_map()[self.dtype]
         except KeyError:
@@ -660,20 +747,35 @@ def __reduce__(self):
         return (_load_from_bytes, (b.getvalue(),))
 
     def data_ptr(self):
-        return self._storage.data_ptr()
+        _warn_typed_storage_removal()
+        return self._data_ptr()
+
+    # For internal use only, to avoid deprecation warning
+    def _data_ptr(self):
+        return self._untyped_storage.data_ptr()
 
     def resize_(self, size):
-        self._storage.resize_(size * self.element_size())
+        _warn_typed_storage_removal()
+        self._resize_(size)
+
+    # For internal use only, to avoid deprecation warning
+    def _resize_(self, size):
+        self._untyped_storage.resize_(size * self._element_size())
 
     @classmethod
     def _free_weak_ref(cls, *args, **kwargs):
         return UntypedStorage._free_weak_ref(*args, **kwargs)
 
     def _weak_ref(self, *args, **kwargs):
-        return self._storage._weak_ref(*args, **kwargs)
+        return self._untyped_storage._weak_ref(*args, **kwargs)
 
     @classmethod
-    def from_buffer(cls, *args, dtype=None, device=None, **kwargs):
+    def from_buffer(cls, *args, **kwargs):
+        _warn_typed_storage_removal()
+        return cls._from_buffer(*args, **kwargs)
+
+    @classmethod
+    def _from_buffer(cls, *args, dtype=None, device=None, **kwargs):
         if cls == TypedStorage:
             dtype = torch.get_default_dtype() if dtype is None else dtype
             device = torch.device('cpu' if device is None else device)
@@ -691,65 +793,80 @@ def from_buffer(cls, *args, dtype=None, device=None, **kwargs):
                     "from_buffer: 'device' can only be specified in "
                     "UntypedStorage.from_buffer and TypedStorage.from_buffer"))
 
-            dtype = cls.dtype
+            dtype = cls._dtype
             untyped_storage = torch.UntypedStorage.from_buffer(*args, dtype=dtype, **kwargs)
 
-        return TypedStorage(wrap_storage=untyped_storage, dtype=dtype)
+        return TypedStorage(
+            wrap_storage=untyped_storage,
+            dtype=dtype,
+            _internal=True)
 
     def _to(self, dtype):
         if not isinstance(dtype, torch.dtype):
             raise TypeError(f"Argument 'dtype' must be torch.dtype, not {type(dtype)}")
-        storage = torch.tensor([], dtype=self.dtype, device=self.device).set_(self).to(dtype).storage()
+        storage = torch.tensor([], dtype=self.dtype, device=self.device).set_(self).to(dtype)._typed_storage()
         if storage.data_ptr() == self.data_ptr():
             storage = storage.clone()
         return storage
 
     def double(self):
         """Casts this storage to double type"""
+        _warn_typed_storage_removal()
         return self._to(torch.double)
 
     def float(self):
         """Casts this storage to float type"""
+        _warn_typed_storage_removal()
         return self._to(torch.float)
 
     def half(self):
         """Casts this storage to half type"""
+        _warn_typed_storage_removal()
         return self._to(torch.half)
 
     def long(self):
         """Casts this storage to long type"""
+        _warn_typed_storage_removal()
         return self._to(torch.long)
 
     def int(self):
         """Casts this storage to int type"""
+        _warn_typed_storage_removal()
         return self._to(torch.int)
 
     def short(self):
         """Casts this storage to short type"""
+        _warn_typed_storage_removal()
         return self._to(torch.short)
 
     def char(self):
         """Casts this storage to char type"""
+        _warn_typed_storage_removal()
         return self._to(torch.int8)
 
     def byte(self):
         """Casts this storage to byte type"""
+        _warn_typed_storage_removal()
         return self._to(torch.uint8)
 
     def bool(self):
         """Casts this storage to bool type"""
+        _warn_typed_storage_removal()
         return self._to(torch.bool)
 
     def bfloat16(self):
         """Casts this storage to bfloat16 type"""
+        _warn_typed_storage_removal()
         return self._to(torch.bfloat16)
 
     def complex_double(self):
         """Casts this storage to complex double type"""
+        _warn_typed_storage_removal()
         return self._to(torch.cdouble)
 
     def complex_float(self):
         """Casts this storage to complex float type"""
+        _warn_typed_storage_removal()
         return self._to(torch.cfloat)
 
     @classmethod
@@ -771,6 +888,7 @@ def from_file(cls, filename, shared, size):
             shared (bool): whether to share memory
             size (int): number of elements in the storage
         """
+        _warn_typed_storage_removal()
         if cls == TypedStorage:
             raise RuntimeError('from_file can only be called on derived classes')
         untyped_storage: UntypedStorage = UntypedStorage.from_file(
@@ -785,33 +903,39 @@ def _expired(cls, *args, **kwargs):
         return UntypedStorage._expired(*args, **kwargs)
 
     def is_pinned(self):
-        return self._storage.is_pinned()
+        _warn_typed_storage_removal()
+        return self._untyped_storage.is_pinned()
 
     def _write_file(self, *args, **kwargs):
-        return self._storage._write_file(*args, **kwargs)
+        return self._untyped_storage._write_file(*args, **kwargs)
 
     def _set_from_file(self, *args, **kwargs):
-        return self._storage._set_from_file(*args, **kwargs)
+        return self._untyped_storage._set_from_file(*args, **kwargs)
 
     def _set_cdata(self, *args, **kwargs):
-        return self._storage._set_cdata(*args, **kwargs)
+        return self._untyped_storage._set_cdata(*args, **kwargs)
 
     def _share_cuda_(self, *args, **kwargs):
-        return self._storage._share_cuda_(*args, **kwargs)
+        return self._untyped_storage._share_cuda_(*args, **kwargs)
 
     def is_shared(self):
-        return self._storage.is_shared()
+        _warn_typed_storage_removal()
+        return self._is_shared()
+
+    # For internal use only, to avoid deprecation warning
+    def _is_shared(self):
+        return self._untyped_storage.is_shared()
 
     @classmethod
     def _new_shared_cuda(cls, *args, **kwargs):
         return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
 
     def _share_filename_cpu_(self, *args, **kwargs):
-        manager_handle, storage_handle, size = self._storage._share_filename_cpu_(*args, **kwargs)
-        return manager_handle, storage_handle, size // self.element_size()
+        manager_handle, storage_handle, size = self._untyped_storage._share_filename_cpu_(*args, **kwargs)
+        return manager_handle, storage_handle, size // self._element_size()
 
     def _shared_decref(self):
-        self._storage._shared_decref()
+        self._untyped_storage._shared_decref()
         return self
 
     @classmethod
@@ -819,11 +943,11 @@ def _release_ipc_counter(cls, *args, device=None, **kwargs):
         return torch.UntypedStorage._release_ipc_counter_cuda(*args, **kwargs)
 
     def _shared_incref(self, *args, **kwargs):
-        return self._storage._shared_incref(*args, **kwargs)
+        return self._untyped_storage._shared_incref(*args, **kwargs)
 
     def _share_fd_cpu_(self, *args, **kwargs):
-        fd, size = self._storage._share_fd_cpu_(*args, **kwargs)
-        return fd, size // self.element_size()
+        fd, size = self._untyped_storage._share_fd_cpu_(*args, **kwargs)
+        return fd, size // self._element_size()
 
     def _get_legacy_storage_class(self):
         if self.dtype not in _dtype_to_storage_type_map():
@@ -857,7 +981,7 @@ class _LegacyStorage(TypedStorage, metaclass=_LegacyStorageMeta):
     @classmethod
     def _new_shared(cls, size):
         """Creates a new storage in shared memory with the same data type"""
-        untyped_storage = torch.UntypedStorage._new_shared(size * cls().element_size())
+        untyped_storage = torch.UntypedStorage._new_shared(size * cls()._element_size())
         return cls(wrap_storage=untyped_storage)
 
     @classmethod
diff --git a/torch/testing/__init__.py b/torch/testing/__init__.py
index 130eaf672983..ad69ef1d2490 100644
--- a/torch/testing/__init__.py
+++ b/torch/testing/__init__.py
@@ -1,4 +1,4 @@
-from ._comparison import assert_close
-from torch._C import FileCheck
-from ._creation import make_tensor
+from ._comparison import assert_close as assert_close
+from torch._C import FileCheck as FileCheck
+from ._creation import make_tensor as make_tensor
 from ._deprecated import *  # noqa: F403
diff --git a/torch/testing/_comparison.py b/torch/testing/_comparison.py
index 5cd2482ac867..6999986f5294 100644
--- a/torch/testing/_comparison.py
+++ b/torch/testing/_comparison.py
@@ -927,9 +927,34 @@ def originate_pairs(
     Returns:
         (List[Pair]): Originated pairs.
     """
+    if (
+        isinstance(actual, torch.TypedStorage)
+        and isinstance(expected, torch.TypedStorage)
+    ):
+        actual_len = actual._size()
+        expected_len = expected._size()
+        if actual_len != expected_len:
+            raise ErrorMeta(
+                AssertionError, f"The length of the sequences mismatch: {actual_len} != {expected_len}", id=id
+            )
+
+        pairs = []
+        for idx in range(actual_len):
+            pairs.extend(
+                originate_pairs(
+                    actual._getitem(idx),
+                    expected._getitem(idx),
+                    pair_types=pair_types,
+                    sequence_types=sequence_types,
+                    mapping_types=mapping_types,
+                    id=(*id, idx),
+                    **options,
+                )
+            )
+        return pairs
     # We explicitly exclude str's here since they are self-referential and would cause an infinite recursion loop:
     # "a" == "a"[0][0]...
-    if (
+    elif (
         isinstance(actual, sequence_types)
         and not isinstance(actual, str)
         and isinstance(expected, sequence_types)
@@ -1228,7 +1253,7 @@ def assert_close(
 
         :func:`~torch.testing.assert_close` is highly configurable with strict default settings. Users are encouraged
         to :func:`~functools.partial` it to fit their use case. For example, if an equality check is needed, one might
-        define an ``assert_equal`` that uses zero tolrances for every ``dtype`` by default:
+        define an ``assert_equal`` that uses zero tolerances for every ``dtype`` by default:
 
         >>> import functools
         >>> assert_equal = functools.partial(torch.testing.assert_close, rtol=0, atol=0)
diff --git a/torch/testing/_creation.py b/torch/testing/_creation.py
index bb9730c5c0b6..33b9739a7f36 100644
--- a/torch/testing/_creation.py
+++ b/torch/testing/_creation.py
@@ -13,6 +13,16 @@
                                            torch.complex128: torch.float64}
 float_to_corresponding_complex_type_map = {v: k for k, v in complex_to_corresponding_float_type_map.items()}
 
+
+def _uniform_random(t: torch.Tensor, low: float, high: float):
+    # uniform_ requires to-from <= std::numeric_limits<scalar_t>::max()
+    # Work around this by scaling the range before and after the PRNG
+    if high - low >= torch.finfo(t.dtype).max:
+        return t.uniform_(low / 2, high / 2).mul_(2)
+    else:
+        return t.uniform_(low, high)
+
+
 def make_tensor(
     *shape: Union[int, torch.Size, List[int], Tuple[int, ...]],
     dtype: torch.dtype,
@@ -21,7 +31,8 @@ def make_tensor(
     high: Optional[float] = None,
     requires_grad: bool = False,
     noncontiguous: bool = False,
-    exclude_zero: bool = False
+    exclude_zero: bool = False,
+    memory_format: Optional[torch.memory_format] = None,
 ) -> torch.Tensor:
     r"""Creates a tensor with the given :attr:`shape`, :attr:`device`, and :attr:`dtype`, and filled with
     values uniformly drawn from ``[low, high)``.
@@ -64,6 +75,8 @@ def make_tensor(
             :attr:`dtype`'s :func:`~torch.finfo` object), and for complex types it is replaced with a complex number
             whose real and imaginary parts are both the smallest positive normal number representable by the complex
             type. Default ``False``.
+        memory_format (Optional[torch.memory_format]): The memory format of the returned tensor.  Incompatible
+            with :attr:`noncontiguous`.
 
     Raises:
         ValueError: if ``requires_grad=True`` is passed for integral `dtype`
@@ -128,25 +141,26 @@ def clamp(a, l, h):
         result = torch.randint(low, high, shape, device=device, dtype=dtype)  # type: ignore[call-overload]
     elif dtype in _floating_types:
         ranges_floats = (torch.finfo(dtype).min, torch.finfo(dtype).max)
-        low, high = _modify_low_high(low, high, ranges_floats[0], ranges_floats[1], -9, 9, dtype)
-        rand_val = torch.rand(shape, device=device, dtype=dtype)
-        result = high * rand_val + low * (1 - rand_val)
+        m_low, m_high = _modify_low_high(low, high, ranges_floats[0], ranges_floats[1], -9, 9, dtype)
+        result = torch.empty(shape, device=device, dtype=dtype)
+        _uniform_random(result, m_low, m_high)
     elif dtype in _complex_types:
         float_dtype = complex_to_corresponding_float_type_map[dtype]
         ranges_floats = (torch.finfo(float_dtype).min, torch.finfo(float_dtype).max)
-        low, high = _modify_low_high(low, high, ranges_floats[0], ranges_floats[1], -9, 9, dtype)
-        real_rand_val = torch.rand(shape, device=device, dtype=float_dtype)
-        imag_rand_val = torch.rand(shape, device=device, dtype=float_dtype)
-        real = high * real_rand_val + low * (1 - real_rand_val)
-        imag = high * imag_rand_val + low * (1 - imag_rand_val)
-        result = torch.complex(real, imag)
+        m_low, m_high = _modify_low_high(low, high, ranges_floats[0], ranges_floats[1], -9, 9, dtype)
+        result = torch.empty(shape, device=device, dtype=dtype)
+        result_real = torch.view_as_real(result)
+        _uniform_random(result_real, m_low, m_high)
     else:
         raise TypeError(f"The requested dtype '{dtype}' is not supported by torch.testing.make_tensor()."
                         " To request support, file an issue at: https://github.com/pytorch/pytorch/issues")
 
+    assert not (noncontiguous and memory_format is not None)
     if noncontiguous and result.numel() > 1:
         result = torch.repeat_interleave(result, 2, dim=-1)
         result = result[..., ::2]
+    elif memory_format is not None:
+        result = result.clone(memory_format=memory_format)
 
     if exclude_zero:
         if dtype in _integral_types or dtype is torch.bool:
diff --git a/torch/testing/_deprecated.py b/torch/testing/_deprecated.py
index 3dad2e62b0f2..731158ddb41e 100644
--- a/torch/testing/_deprecated.py
+++ b/torch/testing/_deprecated.py
@@ -4,22 +4,13 @@
 """
 
 import functools
-import random
 import warnings
 from typing import Any, Callable, Dict, Optional, Tuple, Union
 
 import torch
 
-from . import _legacy
 
-
-__all__ = [
-    "rand",
-    "randn",
-    "assert_allclose",
-    "get_all_device_types",
-    "make_non_contiguous",
-]
+__all__ = ["assert_allclose"]
 
 
 def warn_deprecated(instructions: Union[str, Callable[[str, Tuple[Any, ...], Dict[str, Any], Any], str]]) -> Callable:
@@ -40,10 +31,6 @@ def inner_wrapper(*args: Any, **kwargs: Any) -> Any:
     return outer_wrapper
 
 
-rand = warn_deprecated("Use torch.rand() instead.")(torch.rand)
-randn = warn_deprecated("Use torch.randn() instead.")(torch.randn)
-
-
 _DTYPE_PRECISIONS = {
     torch.float16: (1e-3, 1e-3),
     torch.float32: (1e-4, 1e-5),
@@ -88,54 +75,3 @@ def assert_allclose(
         check_stride=False,
         msg=msg or None,
     )
-
-
-getter_instructions = (
-    lambda name, args, kwargs, return_value: f"This call can be replaced with {return_value}."  # noqa: E731
-)
-
-# Deprecate and expose all dtype getters
-for name in _legacy.__all_dtype_getters__:
-    fn = getattr(_legacy, name)
-    globals()[name] = warn_deprecated(getter_instructions)(fn)
-    __all__.append(name)
-
-get_all_device_types = warn_deprecated(getter_instructions)(_legacy.get_all_device_types)
-
-
-@warn_deprecated(
-    "Depending on the use case there a different replacement options:\n\n"
-    "- If you are using `make_non_contiguous` in combination with a creation function to create a noncontiguous tensor "
-    "with random values, use `torch.testing.make_tensor(..., noncontiguous=True)` instead.\n"
-    "- If you are using `make_non_contiguous` with a specific tensor, you can replace this call with "
-    "`torch.repeat_interleave(input, 2, dim=-1)[..., ::2]`.\n"
-    "- If you are using `make_non_contiguous` in the PyTorch test suite, use "
-    "`torch.testing._internal.common_utils.noncontiguous_like` instead."
-)
-def make_non_contiguous(tensor: torch.Tensor) -> torch.Tensor:
-    if tensor.numel() <= 1:  # can't make non-contiguous
-        return tensor.clone()
-    osize = list(tensor.size())
-
-    # randomly inflate a few dimensions in osize
-    for _ in range(2):
-        dim = random.randint(0, len(osize) - 1)
-        add = random.randint(4, 15)
-        osize[dim] = osize[dim] + add
-
-    # narrow doesn't make a non-contiguous tensor if we only narrow the 0-th dimension,
-    # (which will always happen with a 1-dimensional tensor), so let's make a new
-    # right-most dimension and cut it off
-
-    input = tensor.new(torch.Size(osize + [random.randint(2, 3)]))
-    input = input.select(len(input.size()) - 1, random.randint(0, 1))
-    # now extract the input of correct size from 'input'
-    for i in range(len(osize)):
-        if input.size(i) != tensor.size(i):
-            bounds = random.randint(1, input.size(i) - tensor.size(i))
-            input = input.narrow(i, bounds, tensor.size(i))
-
-    input.copy_(tensor)
-
-    # Use .data here to hide the view relation between input and other temporary Tensors
-    return input.data
diff --git a/torch/testing/_internal/autocast_test_lists.py b/torch/testing/_internal/autocast_test_lists.py
index 00c588977bcb..9a88ab054340 100644
--- a/torch/testing/_internal/autocast_test_lists.py
+++ b/torch/testing/_internal/autocast_test_lists.py
@@ -319,6 +319,14 @@ def __init__(self, dev):
             ("conv_transpose1d", conv_args_bf16[0]),
             ("conv_transpose2d", conv_args_bf16[1]),
             ("conv_transpose3d", conv_args_bf16[2]),
+            ("poisson_nll_loss", mat0_bf16 + mat1_bf16 + (True, False, 1.e-8, torch.nn._reduction.get_enum('mean'))),
+            ("cosine_embedding_loss", (torch.tensor([[1, 2, 3]], device=dev, dtype=torch.bfloat16),
+                                       torch.tensor([[1, 3, 4]], device=dev, dtype=torch.bfloat16),
+                                       torch.tensor([1], device=dev, dtype=torch.int))),
+            ("hinge_embedding_loss", mat0_bf16 + (torch.ones(n, device=dev, dtype=torch.int),)),
+            ("margin_ranking_loss", mat0_bf16 + mat1_bf16 + (torch.ones((n,), device=dev, dtype=torch.bfloat16),)),
+            ("triplet_margin_loss", mat0_bf16 + mat1_bf16 + mat2_bf16),
+            ("binary_cross_entropy_with_logits", mat0_bf16 + (torch.rand((n, n), device=dev, dtype=torch.bfloat16),)),
         ]
         self.nn_bf16 = [
             ("linear", mat0_fp32 + mat1_fp32, {}),
@@ -328,6 +336,17 @@ def __init__(self, dev):
             ("binary_cross_entropy", (torch.rand((n, n), device=dev, dtype=torch.bfloat16),) +
                                      (torch.rand((n, n), device=dev, dtype=torch.bfloat16),)),
             ("reflection_pad1d", dummy_bf16[2], {"padding": (3, 3)}),
+            ("nll_loss", (torch.rand((n, n), device=dev, dtype=torch.bfloat16),
+                          torch.zeros((n,), device=dev, dtype=torch.long))),
+            ("nll_loss2d", (torch.rand((n, n, n, n), device=dev, dtype=torch.bfloat16),
+                            torch.zeros((n, n, n), device=dev, dtype=torch.long))),
+            ("l1_loss", mat0_bf16 + mat1_bf16),
+            ("smooth_l1_loss", mat0_bf16 + mat1_bf16),
+            ("mse_loss", mat0_bf16 + mat1_bf16),
+            ("multilabel_margin_loss", mat0_bf16 + (torch.ones((n, n), device=dev, dtype=torch.long),)),
+            ("soft_margin_loss", mat0_bf16 + (torch.ones((n, n), device=dev, dtype=torch.long),)),
+            ("multi_margin_loss", mat0_bf16 + (torch.ones((n,), device=dev, dtype=torch.long),)),
+            ("huber_loss", mat0_bf16 + mat1_bf16),
         ]
         self.torch_need_autocast_promote = [
             ("cat", (pointwise0_bf16 + pointwise1_fp32,)),
diff --git a/torch/testing/_internal/check_kernel_launches.py b/torch/testing/_internal/check_kernel_launches.py
index 44924eb6ada9..6c701b87afdf 100644
--- a/torch/testing/_internal/check_kernel_launches.py
+++ b/torch/testing/_internal/check_kernel_launches.py
@@ -160,4 +160,4 @@ def check_cuda_kernel_launches():
 
 if __name__ == "__main__":
     unsafe_launches = check_cuda_kernel_launches()
-    sys.exit(0)
+    sys.exit(0 if unsafe_launches == 0 else 1)
diff --git a/torch/testing/_internal/common_cuda.py b/torch/testing/_internal/common_cuda.py
index cccd8c02ec14..b226c7af58e5 100644
--- a/torch/testing/_internal/common_cuda.py
+++ b/torch/testing/_internal/common_cuda.py
@@ -3,7 +3,7 @@
 import functools
 import torch
 import torch.cuda
-from torch.testing._internal.common_utils import TEST_NUMBA, IS_WINDOWS
+from torch.testing._internal.common_utils import TEST_NUMBA, IS_WINDOWS, TEST_WITH_ROCM
 import inspect
 import contextlib
 from distutils.version import LooseVersion
@@ -173,6 +173,13 @@ def _get_torch_cuda_version():
     cuda_version = str(torch.version.cuda)
     return tuple(int(x) for x in cuda_version.split("."))
 
+def _get_torch_rocm_version():
+    if not TEST_WITH_ROCM:
+        return (0, 0)
+    rocm_version = str(torch.version.hip)
+    rocm_version = rocm_version.split("-")[0]    # ignore git sha
+    return tuple(int(x) for x in rocm_version.split("."))
+
 def _check_cusparse_generic_available():
     version = _get_torch_cuda_version()
     min_supported_version = (10, 1)
@@ -180,4 +187,15 @@ def _check_cusparse_generic_available():
         min_supported_version = (11, 0)
     return version >= min_supported_version
 
+def _check_hipsparse_generic_available():
+    if not TEST_WITH_ROCM:
+        return False
+
+    rocm_version = str(torch.version.hip)
+    rocm_version = rocm_version.split("-")[0]    # ignore git sha
+    rocm_version_tuple = tuple(int(x) for x in rocm_version.split("."))
+    return not (rocm_version_tuple is None or rocm_version_tuple < (5, 1))
+
+
 TEST_CUSPARSE_GENERIC = _check_cusparse_generic_available()
+TEST_HIPSPARSE_GENERIC = _check_hipsparse_generic_available()
diff --git a/torch/testing/_internal/common_device_type.py b/torch/testing/_internal/common_device_type.py
index 8a8e538e4ea7..cf3c3b418981 100644
--- a/torch/testing/_internal/common_device_type.py
+++ b/torch/testing/_internal/common_device_type.py
@@ -10,17 +10,16 @@
 import unittest
 import os
 import torch
+import torch.backends.mps
 from torch.testing._internal.common_utils import TestCase, TEST_WITH_ROCM, TEST_MKL, \
     skipCUDANonDefaultStreamIf, TEST_WITH_ASAN, TEST_WITH_UBSAN, TEST_WITH_TSAN, \
-    IS_SANDCASTLE, IS_FBCODE, IS_REMOTE_GPU, IS_WINDOWS, DeterministicGuard, \
+    IS_SANDCASTLE, IS_FBCODE, IS_REMOTE_GPU, IS_WINDOWS, \
     _TestParametrizer, compose_parametrize_fns, dtype_name, \
-    TEST_WITH_MIOPEN_SUGGEST_NHWC, NATIVE_DEVICES
-from torch.testing._internal.common_cuda import _get_torch_cuda_version, TEST_CUSPARSE_GENERIC
+    TEST_WITH_MIOPEN_SUGGEST_NHWC, NATIVE_DEVICES, skipIfTorchDynamo
+from torch.testing._internal.common_cuda import _get_torch_cuda_version, \
+    TEST_CUSPARSE_GENERIC, TEST_HIPSPARSE_GENERIC
 from torch.testing._internal.common_dtype import get_all_dtypes
 
-# The implementation should be moved here as soon as the deprecation period is over.
-from torch.testing._legacy import get_all_device_types  # noqa: F401
-
 try:
     import psutil  # type: ignore[import]
     HAS_PSUTIL = True
@@ -197,6 +196,8 @@
 #         Skips the test if the device is not a CPU device
 #     - @onlyCUDA
 #         Skips the test if the device is not a CUDA device
+#     - @onlyMPS
+#         Skips the test if the device is not a MPS device
 #     - @skipCPUIfNoLapack
 #         Skips the test if the device is a CPU device and LAPACK is not installed
 #     - @skipCPUIfNoMkl
@@ -589,7 +590,7 @@ def filter_desired_device_types(device_type_test_bases, except_for=None, only_fo
 # The tests in these test cases are derived from the generic tests in
 # generic_test_class.
 # See note "Generic Device Type Testing."
-def instantiate_device_type_tests(generic_test_class, scope, except_for=None, only_for=None, include_lazy=False):
+def instantiate_device_type_tests(generic_test_class, scope, except_for=None, only_for=None, include_lazy=False, allow_mps=False):
     # Removes the generic test class from its enclosing scope so its tests
     # are not discoverable.
     del scope[generic_test_class.__name__]
@@ -608,9 +609,13 @@ def instantiate_device_type_tests(generic_test_class, scope, except_for=None, on
     generic_members = set(generic_test_class.__dict__.keys()) - set(empty_class.__dict__.keys())
     generic_tests = [x for x in generic_members if x.startswith('test')]
 
+    # MPS backend support is disabled in `get_device_type_test_bases` while support is being ramped
+    # up, so allow callers to specifically opt tests into being tested on MPS, similar to `include_lazy`
+    test_bases = device_type_test_bases.copy()
+    if allow_mps and torch.backends.mps.is_available() and MPSTestBase not in test_bases:
+        test_bases.append(MPSTestBase)
     # Filter out the device types based on user inputs
-    desired_device_type_test_bases = filter_desired_device_types(device_type_test_bases,
-                                                                 except_for, only_for)
+    desired_device_type_test_bases = filter_desired_device_types(test_bases, except_for, only_for)
     if include_lazy:
         # Note [Lazy Tensor tests in device agnostic testing]
         # Right now, test_view_ops.py runs with LazyTensor.
@@ -691,7 +696,24 @@ class OpDTypes(Enum):
     unsupported_backward = 3  # Test only unsupported backward dtypes
     any_one = 4  # Test precisely one supported dtype
     none = 5  # Instantiate no dtype variants (no dtype kwarg needed)
-
+    any_common_cpu_cuda_one = 6  # Test precisely one supported dtype that is common to both cuda and cpu
+
+
+# Arbitrary order
+ANY_DTYPE_ORDER = (
+    torch.float32,
+    torch.float64,
+    torch.complex64,
+    torch.complex128,
+    torch.float16,
+    torch.bfloat16,
+    torch.long,
+    torch.int32,
+    torch.int16,
+    torch.int8,
+    torch.uint8,
+    torch.bool
+)
 
 # Decorator that defines the OpInfos a test template should be instantiated for.
 #
@@ -759,33 +781,25 @@ def _parametrize_test(self, test, generic_cls, device_cls):
             elif self.opinfo_dtypes == OpDTypes.supported:
                 dtypes = op.supported_dtypes(device_cls.device_type)
             elif self.opinfo_dtypes == OpDTypes.any_one:
-                # Arbitrary order
-                dtype_order = (
-                    torch.float32,
-                    torch.float64,
-                    torch.complex64,
-                    torch.complex128,
-                    torch.float16,
-                    torch.bfloat16,
-                    torch.long,
-                    torch.int32,
-                    torch.int16,
-                    torch.int8,
-                    torch.uint8,
-                    torch.bool
-                )
-
                 # Tries to pick a dtype that supports both forward or backward
                 supported = op.supported_dtypes(device_cls.device_type)
                 supported_backward = op.supported_backward_dtypes(device_cls.device_type)
                 supported_both = supported.intersection(supported_backward)
                 dtype_set = supported_both if len(supported_both) > 0 else supported
-                for dtype in dtype_order:
+                for dtype in ANY_DTYPE_ORDER:
                     if dtype in dtype_set:
                         dtypes = {dtype}
                         break
                 else:
                     dtypes = {}
+            elif self.opinfo_dtypes == OpDTypes.any_common_cpu_cuda_one:
+                # Tries to pick a dtype that supports both CPU and CUDA
+                supported = op.dtypes.intersection(op.dtypesIfCUDA)
+                if supported:
+                    dtypes = {next(dtype for dtype in ANY_DTYPE_ORDER if dtype in supported)}
+                else:
+                    dtypes = {}
+
             elif self.opinfo_dtypes == OpDTypes.none:
                 dtypes = {None}
             else:
@@ -1133,6 +1147,10 @@ def onlyCUDA(fn):
     return onlyOn('cuda')(fn)
 
 
+def onlyMPS(fn):
+    return onlyOn('mps')(fn)
+
+
 def disablecuDNN(fn):
 
     @wraps(fn)
@@ -1160,70 +1178,11 @@ def expectedFailureCUDA(fn):
     return expectedFailure('cuda')(fn)
 
 def expectedFailureMeta(fn):
-    return expectedFailure('meta')(fn)
+    return skipIfTorchDynamo()(expectedFailure('meta')(fn))
 
 def expectedFailureXLA(fn):
     return expectedFailure('xla')(fn)
 
-# This decorator checks that the decorated function produces a nondeterministic
-# alert for the expected device types
-class expectedAlertNondeterministic:
-    # Args:
-    #
-    #   caller_name (str): Name of the operation that produces the
-    #       nondeterministic alert. This name is expected to appear
-    #       in the error/warning message.
-    #
-    #   device_types (list[str], optional): If provided, the alert is
-    #       expected to only be triggered for the specified devices, and
-    #       no others. If None, then the alert is expected to be triggered
-    #       for all devices. Default: None
-    #
-    def __init__(self, caller_name, device_types=None):
-        if device_types is not None:
-            assert isinstance(device_types, list)
-            for device_type in device_types:
-                assert isinstance(device_type, str)
-        self.device_types = device_types
-        self.error_message = caller_name + ' does not have a deterministic implementation, but you set'
-
-    def __call__(self, fn):
-        @wraps(fn)
-        def efail_fn(slf, device, *args, **kwargs):
-            should_alert = self.device_types is None or slf.device_type in self.device_types
-
-            # Check that errors are thrown correctly
-            with DeterministicGuard(True):
-                if should_alert:
-                    with slf.assertRaisesRegex(
-                            RuntimeError,
-                            self.error_message,
-                            msg='expected a non-deterministic error, but it was not raised'):
-                        fn(slf, device, *args, **kwargs)
-
-                else:
-                    # If a nondeterministic error is not expected, make sure
-                    # that it is not raised
-                    try:
-                        return fn(slf, device, *args, **kwargs)
-                    except RuntimeError as e:
-                        if 'does not have a deterministic implementation' in str(e):
-                            slf.fail(
-                                'did not expect non-deterministic error message, '
-                                + 'but got one anyway: "' + str(e) + '"')
-                        # Reraise exceptions unrelated to nondeterminism
-                        raise
-
-            # Check that warnings are thrown correctly
-            if should_alert:
-                with DeterministicGuard(True, warn_only=True):
-                    with slf.assertWarnsRegex(
-                            UserWarning,
-                            self.error_message):
-                        fn(slf, device, *args, **kwargs)
-
-        return efail_fn
-
 # Skips a test on CPU if LAPACK is not available.
 def skipCPUIfNoLapack(fn):
     return skipCPUIf(not torch._C.has_lapack, "PyTorch compiled without Lapack")(fn)
@@ -1343,6 +1302,12 @@ def wrap_fn(self, *args, **kwargs):
 def skipCUDAIfNoCusparseGeneric(fn):
     return skipCUDAIf(not TEST_CUSPARSE_GENERIC, "cuSparse Generic API not available")(fn)
 
+def skipCUDAIfNoHipsparseGeneric(fn):
+    return skipCUDAIf(not TEST_HIPSPARSE_GENERIC, "hipSparse Generic API not available")(fn)
+
+def skipCUDAIfNoSparseGeneric(fn):
+    return skipCUDAIf(not (TEST_CUSPARSE_GENERIC or TEST_HIPSPARSE_GENERIC), "Sparse Generic API not available")(fn)
+
 def skipCUDAIfNoCudnn(fn):
     return skipCUDAIfCudnnVersionLessThan(0)(fn)
 
@@ -1357,3 +1322,8 @@ def skipMeta(fn):
 
 def skipXLA(fn):
     return skipXLAIf(True, "Marked as skipped for XLA")(fn)
+
+# TODO: the "all" in the name isn't true anymore for quite some time as we have also have for example XLA and MPS now.
+#  This should probably enumerate all available device type test base classes.
+def get_all_device_types() -> List[str]:
+    return ['cpu'] if not torch.cuda.is_available() else ['cpu', 'cuda']
diff --git a/torch/testing/_internal/common_distributed.py b/torch/testing/_internal/common_distributed.py
index 8baf7d03d9f7..272dd7479ce5 100644
--- a/torch/testing/_internal/common_distributed.py
+++ b/torch/testing/_internal/common_distributed.py
@@ -2,10 +2,10 @@
 import logging
 import multiprocessing
 import os
+import subprocess
 import sys
 import tempfile
 import threading
-import subprocess
 import time
 import traceback
 import types
@@ -14,11 +14,7 @@
 from dataclasses import dataclass
 from datetime import timedelta
 from enum import Enum
-from functools import (
-    partial,
-    reduce,
-    wraps
-)
+from functools import partial, reduce, wraps
 from io import StringIO
 from typing import NamedTuple, Optional, Union
 
@@ -26,16 +22,17 @@
 import torch.cuda.nccl
 import torch.distributed as c10d
 from torch.testing._internal.common_utils import (
-    TestCase,
-    TEST_WITH_ROCM,
-    TEST_WITH_TSAN,
     FILE_SCHEMA,
     find_free_port,
-    retry_on_connect_failures,
     IS_SANDCASTLE,
-    sandcastle_skip_if,
+    retry_on_connect_failures,
     sandcastle_skip,
+    sandcastle_skip_if,
+    TEST_WITH_ROCM,
+    TEST_WITH_TSAN,
+    TestCase,
 )
+from torch.testing._internal.distributed.multi_threaded_pg import run_with_threaded_pg
 
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
@@ -67,29 +64,32 @@ class TestSkip(NamedTuple):
     "generic": TestSkip(
         86, "Test skipped at subprocess level, look at subprocess log for skip reason"
     ),
+    "importerror": TestSkip(88, "Test skipped due to missing import"),
 }
 
+
 @dataclass
 class DistTestCases:
     # Backends that do not support a specific collective
     skip_collective = {}
-    skip_collective["allgather_coalesced"] = {"nccl", "mpi"}
+    skip_collective["allgather_coalesced"] = {"nccl", "mpi", "ucc"}
     skip_collective["reduce"] = set()
-    skip_collective["sendrecv anysource"] = {"nccl"}
-    skip_collective["cpu barrier"] = {"nccl"}
+    skip_collective["sendrecv anysource"] = {"nccl", "ucc"}
+    skip_collective["cpu barrier"] = {"nccl", "ucc"}
 
     # Sets showing that something is implemented
     backend_feature = {}
-    backend_feature["gpu"] = {"nccl", "gloo"}
-    backend_feature["cuda"] = {"nccl", "gloo"}
-    backend_feature["ddp"] = {"nccl", "gloo"}
-    backend_feature["subgroup"] = {"nccl", "gloo"}
+    backend_feature["gpu"] = {"nccl", "gloo", "ucc"}
+    backend_feature["cuda"] = {"nccl", "gloo", "ucc"}
+    backend_feature["ddp"] = {"nccl", "gloo", "ucc"}
+    backend_feature["subgroup"] = {"nccl", "gloo", "ucc"}
     backend_feature["plugin"] = set()
 
 
 def skip_if_no_gpu(func):
     """Skips if the world size exceeds the number of GPUs, ensuring that if the
     test is run, each rank has its own GPU via ``torch.cuda.device(rank)``."""
+
     @wraps(func)
     def wrapper(*args, **kwargs):
         if not torch.cuda.is_available():
@@ -113,6 +113,7 @@ def wrapper(*args, **kwargs):
 
     return wrapper
 
+
 def skip_if_odd_worldsize(func):
     @wraps(func)
     def wrapper(*args, **kwargs):
@@ -123,6 +124,7 @@ def wrapper(*args, **kwargs):
 
     return wrapper
 
+
 def require_n_gpus_for_nccl_backend(n, backend):
     def decorator(func):
         @wraps(func)
@@ -137,6 +139,25 @@ def wrapper(*args, **kwargs):
     return decorator
 
 
+def import_transformers_or_skip():
+    def decorator(func):
+        @wraps(func)
+        def wrapper(*args, **kwargs):
+            try:
+                from transformers import (  # noqa: Unused
+                    AutoModelForMaskedLM,
+                    BertConfig,
+                )
+
+                return func(*args, **kwargs)
+            except ImportError:
+                sys.exit(TEST_SKIPS["importerror"].exit_code)
+
+        return wrapper
+
+    return decorator
+
+
 def skip_if_lt_x_gpu(x):
     def decorator(func):
         @wraps(func)
@@ -175,10 +196,13 @@ def verify_ddp_error_logged(model_DDP, err_substr):
     logging_err = ddp_logging_data["error"]
     # Remove C++ stacktrace if needed.
     actual = (
-        err_substr if err_substr.find("\nException raised from ") == -1
+        err_substr
+        if err_substr.find("\nException raised from ") == -1
         else err_substr.split("\nException raised from ")[0]
     )
-    assert actual in logging_err, f"Did not find expected {actual} in ddp logging data error: {logging_err}"
+    assert (
+        actual in logging_err
+    ), f"Did not find expected {actual} in ddp logging data error: {logging_err}"
 
 
 def with_nccl_blocking_wait(func):
@@ -280,6 +304,11 @@ def requires_nccl():
         "c10d was not compiled with the NCCL backend",
     )
 
+def requires_ucc():
+    return sandcastle_skip_if(
+        not c10d.is_ucc_available(),
+        "c10d was not compiled with the UCC backend",
+    )
 
 def requires_mpi():
     return sandcastle_skip_if(
@@ -303,8 +332,8 @@ def wrapper(*args, **kwargs):
 
 def skip_if_win32():
     return sandcastle_skip_if(
-        sys.platform == 'win32',
-        "This unit test case is not supportted on Windows platform",
+        sys.platform == "win32",
+        "This unit test case is not supported on Windows platform",
     )
 
 
@@ -336,13 +365,14 @@ def create_tcp_store(
     # TSAN runs much slower.
     TIMEOUT_DEFAULT = 500
 else:
-    TIMEOUT_DEFAULT = 100
+    TIMEOUT_DEFAULT = int(os.getenv("DISTRIBUTED_TESTS_DEFAULT_TIMEOUT", "300"))
 TIMEOUT_OVERRIDE = {"test_ddp_uneven_inputs": 400}
 
 # https://github.com/pytorch/pytorch/issues/75665
 if TEST_WITH_ROCM:
     TIMEOUT_OVERRIDE["test_join_kwargs"] = 200
 
+
 def create_device(interface=None):
     if sys.platform == "win32" or interface is None:
         return c10d.ProcessGroupGloo.create_device(hostname="127.0.0.1")
@@ -433,9 +463,7 @@ def init_multigpu_helper(world_size: int, backend: str):
     if world_size > nGPUs:
         nGPUs_per_process = nGPUs // world_size
     rank_to_GPU = {
-        i: list(
-            visible_devices[i * nGPUs_per_process : (i + 1) * nGPUs_per_process]
-        )
+        i: list(visible_devices[i * nGPUs_per_process : (i + 1) * nGPUs_per_process])
         for i in range(world_size)
     }
     return rank_to_GPU
@@ -466,6 +494,9 @@ def cleanup_temp_dir() -> None:
         tmp_dir.cleanup()
 
 
+# Most tests operate with this worldsize
+DEFAULT_WORLD_SIZE = 4
+
 # [How does MultiProcessTestCase work?]
 # Each MultiProcessTestCase instance uses 1 + `world_size()` processes, by
 # default `world_size()` returns 4. Let's take `test_rpc_spawn.py` as an
@@ -492,7 +523,7 @@ def _should_stop_test_suite(self) -> bool:
 
     @property
     def world_size(self) -> int:
-        return 4
+        return DEFAULT_WORLD_SIZE
 
     def join_or_run(self, fn):
         @wraps(fn)
@@ -591,7 +622,10 @@ def _event_listener(parent_pipe, signal_pipe, rank: int):
     @classmethod
     def _run(cls, rank: int, test_name: str, file_name: str, parent_pipe) -> None:
         # Enable DDP + ReplicatedTensor
-        from torch.nn.parallel._replicated_tensor_ddp_utils import _set_ddp_with_replicated_tensor
+        from torch.nn.parallel._replicated_tensor_ddp_utils import (
+            _set_ddp_with_replicated_tensor,
+        )
+
         _set_ddp_with_replicated_tensor(True)
 
         self = cls(test_name)
@@ -799,16 +833,20 @@ def _check_return_codes(self, elapsed_time) -> None:
         self.assertEqual(
             first_process.exitcode,
             0,
-            msg="Expected zero exit code but got {} for pid: {}".format(first_process.exitcode, first_process.pid)
+            msg="Expected zero exit code but got {} for pid: {}".format(
+                first_process.exitcode, first_process.pid
+            ),
         )
 
     @property
     def is_master(self) -> bool:
         return self.rank == 0
 
+
 # Cannot use functools.cache as it requires python 3.9
 EFA_PROBE_RESULT = None
 
+
 def has_efa() -> bool:
     """
     If shell command `fi_info -p efa -t FI_EP_RDM` returns exit code 0 then we assume that the machine has
@@ -820,7 +858,9 @@ def has_efa() -> bool:
         return EFA_PROBE_RESULT
 
     try:
-        EFA_PROBE_RESULT = subprocess.run(["fi_info", "-p", "efa", "-t", "FI_EP_RDM"]).returncode == 0
+        EFA_PROBE_RESULT = (
+            subprocess.run(["fi_info", "-p", "efa", "-t", "FI_EP_RDM"]).returncode == 0
+        )
     except FileNotFoundError:
         EFA_PROBE_RESULT = False
     return EFA_PROBE_RESULT
@@ -834,3 +874,81 @@ def tp_transports():
     see https://github.com/pytorch/pytorch/issues/73885 and https://github.com/pytorch/pytorch/issues/65022
     """
     return ["shm", "uv"] if has_efa() else None
+
+
+def _run_test_with_mt_pg(self, timeout, world_size, callback):
+    failed_ranks = run_with_threaded_pg(world_size, timeout, callback)
+    for rank, exc_info in failed_ranks:
+        print(f"Rank {rank} raised:")
+        for line in traceback.format_exception(*exc_info):
+            sys.stdout.write(line)
+    self.assertEqual([], failed_ranks, "Some ranks failed")
+
+
+def spawn_threads_and_init_comms(
+    func=None, timeout=TIMEOUT_DEFAULT, world_size=DEFAULT_WORLD_SIZE
+):
+    """
+    Wrapper to use with a test method
+    """
+    if func is None:
+        return partial(
+            spawn_threads_and_init_comms, timeout=timeout, world_size=world_size
+        )
+
+    @wraps(func)
+    def wrapper(self, *args, **kwargs):
+        _run_test_with_mt_pg(
+            self, timeout, world_size, lambda: func(self, *args, **kwargs)
+        )
+
+    return wrapper
+
+
+class MultiThreadedTestCase(TestCase):
+    """
+    Simple test runner that executes all tests with the in-proc process group.
+
+    A single instance of the TestCase object for all threads.
+
+    Difference from regular test runner:
+    Cannot use setUp / tearDown (must use perThreadSetup / perThreadShutdown)
+        Not sure what these two would be good for though.
+    No global state possible
+        How bad of a limitation is this?
+    """
+
+    def __init__(self, method_name: str = "runTest") -> None:
+        super().__init__(method_name)
+        self._test_method = getattr(self, method_name, None)
+        setattr(self, method_name, self.threaded_run_test)
+        if TestCase.setUp != type(self).setUp:
+            raise RuntimeError(
+                f"Test class {type(self)} overrides disabled method setUp. Use perThreadSetUp instead"
+            )
+        if TestCase.tearDown != type(self).tearDown:
+            raise RuntimeError(
+                f"Test class {type(self)} overrides disabled method tearDown. Use perThreadTearDown instead"
+            )
+
+    def threaded_run_test(self):
+        self.perThreadSetUp()
+        try:
+            _run_test_with_mt_pg(
+                self=self,
+                timeout=TIMEOUT_DEFAULT,
+                world_size=self.world_size,
+                callback=self._test_method,
+            )
+        finally:
+            self.perThreadTearDown()
+
+    def perThreadSetUp(self):
+        pass
+
+    def perThreadTearDown(self):
+        pass
+
+    @property
+    def world_size(self) -> int:
+        raise RuntimeError("world size not implemented")
diff --git a/torch/testing/_internal/common_dtype.py b/torch/testing/_internal/common_dtype.py
index 6b16ad4779b3..6432521498ac 100644
--- a/torch/testing/_internal/common_dtype.py
+++ b/torch/testing/_internal/common_dtype.py
@@ -1,28 +1,120 @@
-# flake8: noqa F401
-
-"""The implementations should be moved here as soon as their deprecation period is over."""
-from torch.testing._legacy import (
-    _validate_dtypes,
-    _dispatch_dtypes,
-    all_types,
-    all_types_and,
-    all_types_and_complex,
-    all_types_and_complex_and,
-    all_types_and_half,
-    complex_types,
-    complex_types_and,
-    empty_types,
-    floating_and_complex_types,
-    floating_and_complex_types_and,
-    floating_types,
-    floating_types_and,
-    double_types,
-    floating_types_and_half,
-    get_all_complex_dtypes,
-    get_all_dtypes,
-    get_all_fp_dtypes,
-    get_all_int_dtypes,
-    get_all_math_dtypes,
-    integral_types,
-    integral_types_and,
-)
+from typing import List
+
+import torch
+
+
+# Functions and classes for describing the dtypes a function supports
+# NOTE: these helpers should correspond to PyTorch's C++ dispatch macros
+
+# Verifies each given dtype is a torch.dtype
+def _validate_dtypes(*dtypes):
+    for dtype in dtypes:
+        assert isinstance(dtype, torch.dtype)
+    return dtypes
+
+# class for tuples corresponding to a PyTorch dispatch macro
+class _dispatch_dtypes(tuple):
+    def __add__(self, other):
+        assert isinstance(other, tuple)
+        return _dispatch_dtypes(tuple.__add__(self, other))
+
+_empty_types = _dispatch_dtypes(())
+def empty_types():
+    return _empty_types
+
+_floating_types = _dispatch_dtypes((torch.float32, torch.float64))
+def floating_types():
+    return _floating_types
+
+_floating_types_and_half = _floating_types + (torch.half,)
+def floating_types_and_half():
+    return _floating_types_and_half
+
+def floating_types_and(*dtypes):
+    return _floating_types + _validate_dtypes(*dtypes)
+
+_floating_and_complex_types = _floating_types + (torch.cfloat, torch.cdouble)
+def floating_and_complex_types():
+    return _floating_and_complex_types
+
+def floating_and_complex_types_and(*dtypes):
+    return _floating_and_complex_types + _validate_dtypes(*dtypes)
+
+_double_types = _dispatch_dtypes((torch.float64, torch.complex128))
+def double_types():
+    return _double_types
+
+_integral_types = _dispatch_dtypes((torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64))
+def integral_types():
+    return _integral_types
+
+def integral_types_and(*dtypes):
+    return _integral_types + _validate_dtypes(*dtypes)
+
+_all_types = _floating_types + _integral_types
+def all_types():
+    return _all_types
+
+def all_types_and(*dtypes):
+    return _all_types + _validate_dtypes(*dtypes)
+
+_complex_types = _dispatch_dtypes((torch.cfloat, torch.cdouble))
+def complex_types():
+    return _complex_types
+
+def complex_types_and(*dtypes):
+    return _complex_types + _validate_dtypes(*dtypes)
+
+_all_types_and_complex = _all_types + _complex_types
+def all_types_and_complex():
+    return _all_types_and_complex
+
+def all_types_and_complex_and(*dtypes):
+    return _all_types_and_complex + _validate_dtypes(*dtypes)
+
+_all_types_and_half = _all_types + (torch.half,)
+def all_types_and_half():
+    return _all_types_and_half
+
+# The functions below are used for convenience in our test suite and thus have no corresponding C++ dispatch macro
+
+# See AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_AND_QINTS.
+def get_all_dtypes(include_half=True,
+                   include_bfloat16=True,
+                   include_bool=True,
+                   include_complex=True,
+                   include_complex32=False,
+                   include_qint=False,
+                   ) -> List[torch.dtype]:
+    dtypes = get_all_int_dtypes() + get_all_fp_dtypes(include_half=include_half, include_bfloat16=include_bfloat16)
+    if include_bool:
+        dtypes.append(torch.bool)
+    if include_complex:
+        dtypes += get_all_complex_dtypes(include_complex32)
+    if include_qint:
+        dtypes += get_all_qint_dtypes()
+    return dtypes
+
+def get_all_math_dtypes(device) -> List[torch.dtype]:
+    return get_all_int_dtypes() + get_all_fp_dtypes(include_half=device.startswith('cuda'),
+                                                    include_bfloat16=False) + get_all_complex_dtypes()
+
+def get_all_complex_dtypes(include_complex32=False) -> List[torch.dtype]:
+    return [torch.complex32, torch.complex64, torch.complex128] if include_complex32 else [torch.complex64, torch.complex128]
+
+
+def get_all_int_dtypes() -> List[torch.dtype]:
+    return [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64]
+
+
+def get_all_fp_dtypes(include_half=True, include_bfloat16=True) -> List[torch.dtype]:
+    dtypes = [torch.float32, torch.float64]
+    if include_half:
+        dtypes.append(torch.float16)
+    if include_bfloat16:
+        dtypes.append(torch.bfloat16)
+    return dtypes
+
+
+def get_all_qint_dtypes() -> List[torch.dtype]:
+    return [torch.qint8, torch.quint8, torch.qint32, torch.quint4x2, torch.quint2x4]
diff --git a/torch/testing/_internal/common_fsdp.py b/torch/testing/_internal/common_fsdp.py
index eac2ec44bb3f..b4650adff569 100644
--- a/torch/testing/_internal/common_fsdp.py
+++ b/torch/testing/_internal/common_fsdp.py
@@ -1,39 +1,29 @@
 # Owner(s): ["oncall: distributed"]
 
-import functools
 import itertools
 import sys
 from abc import ABC, abstractmethod
 from contextlib import suppress
 from copy import deepcopy
-from enum import Enum, auto
-from math import inf
+from enum import auto, Enum
 from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union
 from unittest import mock
 
 import torch
 import torch.distributed as dist
 import torch.nn as nn
-from torch.distributed.fsdp import CPUOffload
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp import CPUOffload, FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp._common_utils import TrainingState
 from torch.distributed.fsdp.fully_sharded_data_parallel import (
     BackwardPrefetch,
     MixedPrecision,
     ShardingStrategy,
-    TrainingState_,
 )
 from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler
-from torch.distributed.fsdp.wrap import (
-    always_wrap_policy,
-    transformer_auto_wrap_policy,
-    wrap,
-)
+from torch.distributed.fsdp.wrap import always_wrap_policy, ModuleWrapPolicy, wrap
 from torch.nn import TransformerDecoderLayer, TransformerEncoderLayer
 from torch.nn.parallel.distributed import DistributedDataParallel as DDP
-from torch.testing._internal.common_distributed import (
-    TEST_SKIPS,
-    MultiProcessTestCase,
-)
+from torch.testing._internal.common_distributed import MultiProcessTestCase, TEST_SKIPS
 from torch.testing._internal.common_utils import FILE_SCHEMA, get_cycles_per_ms
 
 
@@ -58,6 +48,7 @@ class CUDAInitMode(Enum):
 class FSDPTestModel(nn.Module, ABC):
     """This defines the interface expected from all models used commonly for
     FSDP unit tests."""
+
     @abstractmethod
     def get_input(self, device) -> Tuple[torch.Tensor, ...]:
         """Returns an input for the model as as tuple."""
@@ -88,7 +79,6 @@ def init(
         ...
 
 
-
 def _assert_module_states(
     model: nn.Module,
     process_group: dist.ProcessGroup,
@@ -117,6 +107,7 @@ def _assert_module_states(
         for (_, p1), (_, p2) in zip(rank0_states, state):
             assert_fn(p1, p2)
 
+
 def _zero_model(
     model: nn.Module,
     zero_buffers: bool = False,
@@ -131,6 +122,7 @@ def _zero_model(
                 with torch.no_grad():
                     buffer.zero_()
 
+
 def _get_state_dict(model, cpu_offload=False, half=False):
     if not cpu_offload:
         model = model.cuda()
@@ -139,11 +131,13 @@ def _get_state_dict(model, cpu_offload=False, half=False):
 
     return model.state_dict()
 
+
 def subtest_name(test_name_mapping, *args):
-    return '_'.join(
+    return "_".join(
         [test_name_mapping[str(s)] if s is not None else "none" for s in args]
     )
 
+
 def get_full_params(model: nn.Module, recurse: bool = True):
     """
     Returns the full unsharded parameters of ``model``. Any FSDP-managed
@@ -157,14 +151,14 @@ def get_full_params(model: nn.Module, recurse: bool = True):
     with FSDP.summon_full_params(model, recurse=recurse):
         return deepcopy(list(model.parameters()))
 
+
 def _maybe_cuda(model: nn.Module, move_to_cuda: bool):
     return model.cuda() if move_to_cuda else model
 
+
 def _maybe_wrap_fsdp(model: nn.Module, wrap_fsdp: bool, *args, **kwargs):
-    return (
-        model if not wrap_fsdp
-        else FSDP(model, *args, **kwargs)
-    )
+    return model if not wrap_fsdp else FSDP(model, *args, **kwargs)
+
 
 class DummyProcessGroup:
     def __init__(self, rank: int, size: int):
@@ -188,13 +182,13 @@ def get_future():
         dist_wait.get_future = get_future
         return dist_wait
 
+
 class DeterministicModel(torch.nn.Module):
     def __init__(self, wrap_fsdp, cpu_offload=CPUOffload(offload_params=False)):
         super().__init__()
         # keep everything deterministic for model initialization
         torch.manual_seed(0)
-        self.inner: Union[torch.nn.Linear, FSDP] = \
-            torch.nn.Linear(2, 2).cuda()
+        self.inner: Union[torch.nn.Linear, FSDP] = torch.nn.Linear(2, 2).cuda()
         if wrap_fsdp:
             self.inner = FSDP(self.inner, cpu_offload=cpu_offload)
         self.outer = torch.nn.Linear(2, 2).cuda()
@@ -203,6 +197,7 @@ def forward(self, x):
         y = self.inner(x)
         return self.outer(y)
 
+
 class TransformerWithSharedParams(FSDPTestModel):
     def __init__(
         self,
@@ -285,8 +280,8 @@ def init(
             fsdp_init_mode (FSDPInitMode): If ``NO_FSDP``, then does not wrap
                 any modules with FSDP. If ``RECURSIVE``, then wraps with
                 top-level FSDP. By default, the top-level FSDP uses the
-                ``transformer_auto_wrap_policy()`` for encoder and decoder
-                layers, but a different auto wrap policy may be specified via
+                ``ModuleWrapPolicy`` for encoder and decoder layers, but a
+                different auto wrap policy may be specified via
                 ``fsdp_kwargs``.
             cuda_init_mode (CUDAInitMode): Determines model movement to CUDA.
             fsdp_kwargs (Optional[Dict[str, Any]]): Optional keyword arguments
@@ -298,21 +293,24 @@ def init(
         if fsdp_kwargs is None:
             fsdp_kwargs = {}
         if fsdp_init_mode == FSDPInitMode.NO_FSDP:
-            return TransformerWithSharedParams(group, cuda_init_mode, add_bn, deterministic)
+            return TransformerWithSharedParams(
+                group, cuda_init_mode, add_bn, deterministic
+            )
         elif fsdp_init_mode == FSDPInitMode.RECURSIVE:
-            # Default to the `transformer_auto_wrap_policy()`
+            # Default to the `ModuleWrapPolicy`
             if "auto_wrap_policy" not in fsdp_kwargs:
-                auto_wrap_policy = functools.partial(
-                    transformer_auto_wrap_policy,
-                    transformer_layer_cls={
+                auto_wrap_policy = ModuleWrapPolicy(
+                    {
                         TransformerEncoderLayer,
                         TransformerDecoderLayer,
-                    },
+                    }
                 )
             else:
                 auto_wrap_policy = fsdp_kwargs.pop("auto_wrap_policy")
             fsdp_model = FSDP(
-                TransformerWithSharedParams(group, cuda_init_mode, add_bn, deterministic),
+                TransformerWithSharedParams(
+                    group, cuda_init_mode, add_bn, deterministic
+                ),
                 group,
                 auto_wrap_policy=auto_wrap_policy,
                 **fsdp_kwargs,
@@ -455,6 +453,7 @@ def init(
 class ModuleWithDelay(FSDPTestModel):
     """This class wraps a :class:`FSDPTestModel` to optionally add a delay
     after computing the loss and/or before the gradient reduction."""
+
     def __init__(
         self,
         module: nn.Module,
@@ -479,7 +478,7 @@ def get_loss(self, input, output):
         return loss
 
     def run_backward(self, loss):
-        orig_reduce_scatter = torch.distributed._reduce_scatter_base
+        orig_reduce_scatter = torch.distributed.reduce_scatter_tensor
 
         def _delayed_reduce_scatter(*args, **kwargs):
             if self.delay_before_reduction_ms > 0:
@@ -489,7 +488,7 @@ def _delayed_reduce_scatter(*args, **kwargs):
             return orig_reduce_scatter(*args, **kwargs)
 
         with mock.patch(
-            "torch.distributed._reduce_scatter_base", _delayed_reduce_scatter
+            "torch.distributed.reduce_scatter_tensor", _delayed_reduce_scatter
         ):
             self.module.run_backward(loss)
 
@@ -520,6 +519,7 @@ def init(
             delay_before_reduction_ms,
         )
 
+
 class NestedWrappedModuleWithDelay(ModuleWithDelay):
     @staticmethod
     def init(
@@ -602,26 +602,29 @@ def __init__(
             _maybe_cuda(nn.Linear(d_input, d_shared), self.move_to_cuda),
             shared,
             expert,
-            _maybe_cuda(nn.Linear(d_shared, d_input), self.move_to_cuda)
+            _maybe_cuda(nn.Linear(d_shared, d_input), self.move_to_cuda),
         )
 
     def forward(self, x):
         if self.delay_before_free_ms > 0:
             expert = self.module[2]
             if isinstance(expert, FSDP):
-                orig_free_full_params = self.module[2]._free_full_params
+                orig_reshard = torch.distributed.fsdp._runtime_utils._reshard
 
-                def _free_full_params_with_delay(*args):
+                def _delayed_reshard(*args, **kwargs):
                     torch.cuda._sleep(
                         int(self.delay_before_free_ms * get_cycles_per_ms())
                     )
-                    return orig_free_full_params(*args)
-
-                assert hasattr(
-                    expert, "_free_full_params"
-                ), "expert FSDP module should has _free_full_params attribute."
-                with mock.patch.object(
-                    expert, "_free_full_params", _free_full_params_with_delay
+                    return orig_reshard(*args, **kwargs)
+
+                # The first patch covers any `from torch... import _reshard`
+                # uses in `fully_sharded_data_parallel.py`, and the second
+                # patch covers any `import torch..._reshard` uses in general.
+                with mock.patch(
+                    "torch.distributed.fsdp.fully_sharded_data_parallel._reshard",
+                    _delayed_reshard,
+                ), mock.patch(
+                    "torch.distributed.fsdp._runtime_utils._reshard", _delayed_reshard
                 ):
                     return self.module(x)
 
@@ -739,7 +742,9 @@ def run_subtests(
         # Convert the config mapping to a list to have a fixed order
         subtest_config_items: List[Tuple[str, List[Any]]] = list(subtest_config.items())
         subtest_config_keys: List[str] = [item[0] for item in subtest_config_items]
-        subtest_config_values: List[List[Any]] = [item[1] for item in subtest_config_items]
+        subtest_config_values: List[List[Any]] = [
+            item[1] for item in subtest_config_items
+        ]
         for values in itertools.product(*subtest_config_values):
             # Map keyword to chosen value
             subtest_kwargs = {
@@ -823,8 +828,7 @@ def _train_for_several_steps(
                 # Post-forward, if CPU offloading model param should be on CPU.
                 if cpu_offload_params and isinstance(model, FSDP):
                     for p in model.parameters():
-                        # Params should always be on CPU, even if
-                        # p._is_sharded=False
+                        # Params should always be on CPU
                         self.assertEqual(p.device, torch.device("cpu"))
 
                 loss = model.module.get_loss(input, output).to(model_device)
@@ -852,7 +856,9 @@ def _train_for_several_steps(
                         model, norm_type, self.rank
                     )
                 else:
-                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm, norm_type)
+                    torch.nn.utils.clip_grad_norm_(
+                        model.parameters(), max_norm, norm_type
+                    )
                     total_norm_after_clip = _collect_total_grad_norm_local(
                         model, norm_type
                     )
@@ -860,8 +866,7 @@ def _train_for_several_steps(
             # Post-backward, if CPU offloading model params should be on CPU.
             if cpu_offload_params and isinstance(model, FSDP):
                 for p in model.parameters():
-                    # Params should always be on CPU, even if
-                    # p._is_sharded=False
+                    # Params should always be on CPU
                     self.assertEqual(p.device, torch.device("cpu"))
             # Unscale the gradients and step
             sharded_grad_scaler.step(optim)
@@ -876,7 +881,7 @@ def _train_for_several_steps(
                 model.load_state_dict(state_dict)
 
         if isinstance(model, FSDP):
-            model._assert_state(TrainingState_.IDLE)
+            model._assert_state(TrainingState.IDLE)
         return loss.detach()
 
     def _test_fsdp_parity(
@@ -889,9 +894,10 @@ def _test_fsdp_parity(
         save_model: bool = True,
         cpu_offload: CPUOffload = CPUOffload(),
         backward_prefetch: Optional[BackwardPrefetch] = None,
-        forward_prefetch: bool = False,
         sharding_strategy: Optional[ShardingStrategy] = None,
         mixed_precision: Optional[MixedPrecision] = None,
+        forward_prefetch: bool = False,
+        use_orig_params: bool = False,
         enable_sharded_grad_scaler: bool = False,
         use_pure_fp16: bool = False,
         norm_type: Optional[Union[float, int]] = None,
@@ -912,7 +918,9 @@ def _test_fsdp_parity(
                 wrapper should provide data parallel semantics. If ``None``,
                 then the callable defaults to the DDP constructor.
         """
-        assert fsdp_init_mode != FSDPInitMode.NO_FSDP, "Expects an FSDP init mode that wraps with FSDP"
+        assert (
+            fsdp_init_mode != FSDPInitMode.NO_FSDP
+        ), "Expects an FSDP init mode that wraps with FSDP"
         if init_kwargs is None:
             init_kwargs = {}
         lr = 1e-2
@@ -948,9 +956,10 @@ def _test_fsdp_parity(
             {
                 "cpu_offload": cpu_offload,
                 "backward_prefetch": backward_prefetch,
-                "forward_prefetch": forward_prefetch,
                 "sharding_strategy": sharding_strategy,
                 "mixed_precision": mixed_precision,
+                "forward_prefetch": forward_prefetch,
+                "use_orig_params": use_orig_params,
             }
         )
         try:
@@ -978,15 +987,22 @@ def _test_fsdp_parity(
         # Offloading parameters with `CUDA_AFTER` should raise an error during
         # lazy initialization due to the parameter devices not being CPU;
         # otherwise, all parameter devices should be CPU
-        expects_device_error = offload_params and cuda_init_mode == CUDAInitMode.CUDA_AFTER
-        expects_cpu_device = offload_params and cuda_init_mode != CUDAInitMode.CUDA_AFTER
+        expects_device_error = (
+            offload_params and cuda_init_mode == CUDAInitMode.CUDA_AFTER
+        )
+        expects_cpu_device = (
+            offload_params and cuda_init_mode != CUDAInitMode.CUDA_AFTER
+        )
         if expects_cpu_device:
             cpu_device = torch.device("cpu")
             for param in fsdp_model.parameters():
                 self.assertEqual(param.device, cpu_device)
         context = (
-            self.assertRaisesRegex(AssertionError, "Expected param to be on CPU")
-            if expects_device_error else suppress()
+            self.assertRaisesRegex(
+                AssertionError, "Expects the `FlatParameter` to be offloaded to CPU"
+            )
+            if expects_device_error
+            else suppress()
         )
         with context:
             fsdp_loss = self._train_for_several_steps(
@@ -1011,7 +1027,9 @@ def _test_fsdp_parity(
                 self.assertEqual(param.device, cpu_device)
             fsdp_loss = fsdp_loss.cuda()
         fsdp_unsharded_params = get_full_params(fsdp_model)
-        torch.testing.assert_allclose(ref_loss, fsdp_loss)
+        # TODO: Are mismatching dtypes actually ok here or did this pass silently before, because `check_dtype=False`
+        #  was the default?
+        torch.testing.assert_close(ref_loss, fsdp_loss, check_dtype=False)
         # Do not check for parameter parity if using mixed precision since (1)
         # the DDP parameters are in FP16 (from `half()`) while the FSDP
         # parameters are in FP32 (from `summon_full_params()`) and (2) DDP runs
@@ -1058,25 +1076,3 @@ def forward(self, x):
         x = self.linear_skip(x)
         x = self.nested_linear(x)
         return x
-
-
-def _collect_total_grad_norm_fsdp(model, norm_type, rank):
-    total_norm = _collect_total_grad_norm_local(model, norm_type)
-    op = torch.distributed.ReduceOp.SUM
-    if norm_type == inf:
-        op = torch.distributed.ReduceOp.MAX
-        norm_type = 1.0
-    return_norm = torch.tensor(total_norm ** norm_type, device=rank)
-    dist.all_reduce(return_norm, op=op)
-    return return_norm ** (1.0 / norm_type)
-
-
-def _collect_total_grad_norm_local(model, norm_type):
-    if norm_type == inf:
-        return max(p.grad.abs().max() for p in model.parameters())
-    else:
-        total_norm = 0.0
-        for p in model.parameters():
-            local_norm = torch.linalg.vector_norm(p.grad, norm_type, dtype=torch.float32)
-            total_norm += local_norm ** norm_type
-        return total_norm ** (1.0 / norm_type)
diff --git a/torch/testing/_internal/common_methods_invocations.py b/torch/testing/_internal/common_methods_invocations.py
index 9a942e19b01a..bf6f15f825f3 100644
--- a/torch/testing/_internal/common_methods_invocations.py
+++ b/torch/testing/_internal/common_methods_invocations.py
@@ -2,7 +2,6 @@
 from itertools import product, chain, islice
 import itertools
 import functools
-import collections
 import copy
 import operator
 import random
@@ -11,36 +10,34 @@
 
 import torch
 import numpy as np
-from torch._six import inf
-import collections.abc
+from torch._six import inf, nan
 
-from typing import Any, Dict, List, Sequence, Tuple, Union, Iterable
+from typing import Any, Dict, List, Tuple, Union
 from torch.testing import make_tensor
 from torch.testing._internal.common_dtype import (
     _dispatch_dtypes, floating_types, floating_types_and, complex_types, floating_and_complex_types,
     floating_and_complex_types_and, all_types_and_complex_and, all_types_and, all_types_and_complex, integral_types_and,
-    all_types, double_types, empty_types, complex_types_and, integral_types
+    all_types, empty_types, complex_types_and, integral_types
 )
 from torch.testing._internal.common_device_type import \
     (onlyCUDA, onlyNativeDeviceTypes, disablecuDNN, skipCUDAIfNoMagma, skipCUDAIfNoMagmaAndNoCusolver,
-     skipCUDAIfNoCusolver, skipCPUIfNoLapack, skipCPUIfNoFFT, skipCUDAIfRocm, skipCUDAIf, precisionOverride,
+     skipCUDAIfNoCusolver, skipCPUIfNoLapack, skipCPUIfNoFFT, skipCUDAIf, precisionOverride,
      skipCPUIfNoMklSparse,
-     toleranceOverride, tol, has_cusolver)
+     toleranceOverride, tol)
 from torch.testing._internal.common_cuda import (
     CUDA11OrLater, SM53OrLater, SM60OrLater, with_tf32_off, TEST_CUDNN,
-    _get_torch_cuda_version, _get_magma_version)
+    _get_torch_cuda_version, _get_torch_rocm_version)
 from torch.testing._internal.common_utils import (
     make_fullrank_matrices_with_distinct_singular_values,
     TEST_WITH_ROCM, IS_WINDOWS, IS_MACOS, TEST_SCIPY,
     torch_to_numpy_dtype_dict, TEST_WITH_ASAN,
-    GRADCHECK_NONDET_TOL, slowTest, freeze_rng_state,
+    GRADCHECK_NONDET_TOL, freeze_rng_state,
 )
 
 import torch._refs as refs  # noqa: F401
 import torch._refs.nn.functional
 import torch._refs.special
 import torch._refs.linalg
-
 import torch._prims as prims  # noqa: F401
 
 from torch.utils._pytree import tree_flatten
@@ -58,10 +55,8 @@
     SampleInput,
     ErrorInput,
     AliasInfo,
+    NumericsFilter,
     OpInfo,
-    _find_referenced_opinfo,
-    _inherit_constructor_args,
-    PythonRefInfo,
     _generate_reduction_inputs,
     _generate_reduction_kwargs,
     sample_inputs_reduction,
@@ -92,18 +87,52 @@
     ShapeFuncInfo,
     sample_inputs_foreach,
     ForeachFuncInfo,
+    gradcheck_wrapper_hermitian_input,
+    gradcheck_wrapper_triangular_input,
+    gradcheck_wrapper_triangular_input_real_positive_diagonal,
+    gradcheck_wrapper_masked_operation,
+    gradcheck_wrapper_masked_pointwise_operation,
+    clone_sample,
+)
+from torch.testing._internal.opinfo.refs import (  # NOQA: F401
+    _find_referenced_opinfo,
+    _inherit_constructor_args,
+    PythonRefInfo,
+    ReductionPythonRefInfo,
+    ElementwiseUnaryPythonRefInfo,
+    ElementwiseBinaryPythonRefInfo,
+)
+from torch.testing._internal.opinfo.utils import (
+    np_unary_ufunc_integer_promotion_wrapper,
+    reference_reduction_numpy,
+    prod_numpy
+)
+from torch.testing._internal import opinfo
+from torch.testing._internal.opinfo.definitions.linalg import (
+    sample_inputs_linalg_cholesky,
+    sample_inputs_linalg_cholesky_inverse,
+    sample_inputs_cross,
+    sample_inputs_linalg_qr_geqrf,
+    sample_inputs_linalg_invertible,
+    sample_inputs_lu_solve,
+    sample_inputs_legacy_solve,
+    sample_inputs_svd,
+    sample_inputs_linalg_det_logdet_slogdet,
+    sample_inputs_linalg_lu,
+)
+from torch.testing._internal.opinfo.definitions.special import (
+    sample_inputs_i0_i1,
+    sample_inputs_polygamma,
+    reference_polygamma,
+)
+from torch.testing._internal.opinfo.definitions._masked import (
+    sample_inputs_softmax_variant,
 )
 
-has_scipy_fft = False
 if TEST_SCIPY:
     from scipy import stats
     import scipy.spatial
     import scipy.special
-    try:
-        import scipy.fft
-        has_scipy_fft = True
-    except ModuleNotFoundError:
-        pass
 
 
 # test if a tensor is close to an integer
@@ -115,157 +144,19 @@ def close_to_int(x, eps=0.1):
     return (y < eps) | (y > (1 - eps))
 
 
-NumericsFilter = collections.namedtuple('NumericsFilter', ['condition', 'safe_val'])
+def sample_inputs_slice(op_info, device, dtype, requires_grad, **kwargs):
 
+    make_input = partial(make_tensor, device=device, dtype=dtype,
+                         low=None, high=None, requires_grad=requires_grad)
 
-def _generate_masked_op_mask(input_shape, device, **kwargs):
-    yield None
-    yield make_tensor(input_shape, dtype=torch.bool, device=device, requires_grad=False)
-    if len(input_shape) > 2:
-        # broadcast last mask dimension:
-        yield make_tensor(input_shape[:-1] + (1,), dtype=torch.bool, device=device, requires_grad=False)
-        # broadcast middle mask dimension:
-        yield make_tensor(input_shape[:1] + (1,) + input_shape[2:], dtype=torch.bool, device=device, requires_grad=False)
-        # broadcast first mask dimension:
-        yield make_tensor((1,) + input_shape[1:], dtype=torch.bool, device=device, requires_grad=False)
-        # mask.ndim < input.ndim
-        yield make_tensor(input_shape[1:], dtype=torch.bool, device=device, requires_grad=False)
-        # mask.ndim == 1
-        yield make_tensor(input_shape[-1:], dtype=torch.bool, device=device, requires_grad=False)
-        # masks that require broadcasting of inputs (mask.ndim >
-        # input.ndim) will not be supported, however, we may
-        # reconsider this if there will be demand on this kind of
-        # degenerate cases.
-
+    yield SampleInput(make_input(3), 0)
 
-def sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked reduction operators.
+    yield SampleInput(make_input(20, 30, 40), dim=1, start=1, end=-2)
 
-    Masked reduction operator is a reduction operator with trailing
-    mask optional argument. A mask is a bool tensor with the same
-    shape as input or a shape that is broadcastable to input shape.
-    """
-    kwargs['supports_multiple_dims'] = op_info.supports_multiple_dims
-
-    for sample_input in sample_inputs_reduction(op_info, device, dtype, requires_grad, **kwargs):
-        for mask in _generate_masked_op_mask(sample_input.input.shape, device, **kwargs):
-            sample_input_args, sample_input_kwargs = sample_input.args, dict(mask=mask, **sample_input.kwargs)
-            yield SampleInput(sample_input.input.detach().requires_grad_(requires_grad),
-                              args=sample_input_args, kwargs=sample_input_kwargs)
-            if(not requires_grad and dtype.is_floating_point and
-               sample_input.input.ndim == 2 and mask is not None and
-               mask.shape == sample_input.input.shape):
-                for v in [torch.inf, -torch.inf, torch.nan]:
-                    t = sample_input.input.detach()
-                    t.diagonal(0, -2, -1).fill_(v)
-                    yield SampleInput(t.requires_grad_(requires_grad),
-                                      args=sample_input_args, kwargs=sample_input_kwargs)
-
-
-def sample_inputs_sparse_coo_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked reduction operators that support inputs
-    with sparse coo layouts.
-    """
-    if op_info.supports_sparse:
-        op_name = op_info.name.replace('_masked.', '')
-        for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-            mask = sample_input.kwargs.get('mask')
-            if mask is not None:
-                sample_input_kwargs = sample_input.kwargs.copy()
-                sample_input_kwargs.update(mask=mask.to_sparse())
-                yield SampleInput(sample_input.input.to_sparse(),
-                                  args=sample_input.args, kwargs=sample_input_kwargs)
-            else:
-                if op_name in {'prod', 'amax', 'amin'}:
-                    # FIXME: for now reductions with non-zero reduction identity and
-                    # unspecified mask are not supported for sparse COO
-                    # tensors, see torch._masked.prod implementation
-                    # for details.
-                    continue
-                yield SampleInput(sample_input.input.to_sparse(),
-                                  args=sample_input.args, kwargs=sample_input.kwargs)
-
-
-def sample_inputs_sparse_csr_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked reduction operators that support inputs
-    with sparse csr layouts.
-    """
-    if op_info.supports_sparse_csr:
-        op_name = op_info.name.replace('_masked.', '')
-        for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-            if not (sample_input.input.ndim == 2 and sample_input.kwargs.get('keepdim')):
-                # - sparse CSR tensors are always 2-D tensors
-                # - masked reduction on CSR tensors are defined only if keepdim is True.
-                continue
-            mask = sample_input.kwargs.get('mask')
-            if mask is not None:
-                sample_input_kwargs = sample_input.kwargs.copy()
-                sample_input_kwargs.update(mask=mask.to_sparse_csr())
-                new_sample = SampleInput(sample_input.input.to_sparse_csr(),
-                                         args=sample_input.args, kwargs=sample_input_kwargs)
-            else:
-                if op_name in ['prod', 'amax', 'amin', 'mean']:
-                    # reductions with non-zero reduction identity and
-                    # unspecified mask is not supported for sparse CSR
-                    # tensors, see torch._masked.prod implementation
-                    # for details.
-                    continue
-                new_sample = SampleInput(sample_input.input.to_sparse_csr(),
-                                         args=sample_input.args, kwargs=sample_input.kwargs)
-            yield new_sample
-            if sample_input.kwargs['dim'] == 0:
-                # Reductions of CSR tensors use different implementations for
-                # inner and/or outer dimensions. So, as a minimum of testing CSR
-                # implementations the following kwargs must be generated:
-                #   dict(dim=0, keepdim=True)
-                #   dict(dim=1, keepdim=True)
-                #   dict(dim=(0, 1), keepdim=True)
-                # Here we generate the dim=1 case from the dim=0 case.
-                sample_input_kwargs = new_sample.kwargs.copy()
-                sample_input_kwargs.update(dim=1)
-                yield SampleInput(new_sample.input.clone(),
-                                  args=sample_input.args, kwargs=sample_input_kwargs)
-
-
-def sample_inputs_masked_norm(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked norm.
-    """
-    for ord in [2.0, 1, float('inf'), float('-inf'), 0]:
-        for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-            sample_input_args, sample_input_kwargs = (ord,) + sample_input.args, sample_input.kwargs.copy()
-            yield SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
-                              args=sample_input_args, kwargs=sample_input_kwargs)
+    yield SampleInput(make_input(20, 30, 40), dim=1, start=1, end=-2, step=3)
 
+    yield SampleInput(make_input(20, 30, 40), dim=0, start=-10, end=-2, step=2)
 
-def sample_inputs_masked_std_var(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked std/var.
-    """
-    for unbiased in [False, True]:
-        for sample_input in sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
-            if sample_input.args:
-                dim = sample_input.args[0]
-                sample_input_args = sample_input.args[:1] + (unbiased,) + sample_input.args[1:]
-                sample_input_kwargs = sample_input.kwargs.copy()
-            else:
-                dim = sample_input.kwargs.get('dim')
-                sample_input_args = sample_input.args
-                sample_input_kwargs = dict(sample_input.kwargs, unbiased=unbiased)
-            if requires_grad:
-                if sample_input_kwargs.get('mask') is None:
-                    orig_count = torch._masked.sum(torch.ones(sample_input.input.shape, dtype=torch.int64), dim, keepdim=True)
-                else:
-                    inmask = torch._masked._input_mask(sample_input.input, *sample_input_args, **sample_input_kwargs)
-                    orig_count = torch._masked.sum(inmask.new_ones(sample_input.input.shape, dtype=torch.int64),
-                                                   dim, keepdim=True, mask=inmask)
-                if orig_count.min() <= int(unbiased) + 1:
-                    # Skip samples that lead to singularities in var
-                    # computation resulting nan values both in var and
-                    # autograd output that test_grad_fn cannot handle
-                    # correctly. Also, skip samples when the autograd output
-                    # for std could not be handled correctly due to torch.sqrt
-                    continue
-            yield SampleInput(sample_input.input.detach().requires_grad_(requires_grad),
-                              args=sample_input_args, kwargs=sample_input_kwargs)
 
 def sample_inputs_tensor_split(op_info, device, dtype, requires_grad, **kwargs):
     make_input = partial(make_tensor, device=device, dtype=dtype,
@@ -291,281 +182,69 @@ def sample_inputs_tensor_split(op_info, device, dtype, requires_grad, **kwargs):
         yield SampleInput(make_input((S, S, S)), args=args)
 
 
-def sample_inputs_linalg_det_logdet_slogdet(op_info, device, dtype, requires_grad, **kwargs):
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
-    batches = [(), (0, ), (3, )]
-    ns = [0, 1, 5]
-
-    is_logdet = (op_info.name == "logdet")
-
-    for batch, n, in product(batches, ns):
-        shape = batch + (n, n)
-        A = make_arg(*shape)
-        # Need to make the matrices in A have positive determinant for autograd
-        # To do so, we multiply A by its determinant to flip the sign of its determinant
-        if is_logdet and not A.is_complex() and A.numel() > 0:
-            s = torch.linalg.slogdet(A).sign
-            A = A * s.unsqueeze(-1).unsqueeze(-1)
-            A.requires_grad_(requires_grad)
-        yield SampleInput(A)
-
-
-def sample_inputs_linalg_det_singular(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype)
-
-    def make_singular_matrix_batch_base(size, rank):
-        assert size[-1] == size[-2]
-        assert rank > 0 and rank < size[-1]
-
-        n = size[-1]
-        a = make_arg(size[:-2] + (n, rank)) / 10
-        b = make_arg(size[:-2] + (rank, n)) / 10
-        x = a @ b
-        lu, pivs, _ = torch.linalg.lu_factor_ex(x)
-        p, l, u = torch.lu_unpack(lu, pivs)
-        u_diag_abs = u.diagonal(0, -2, -1).abs()
-        u_diag_abs_largest = u_diag_abs.max(dim=-1, keepdim=True).values
-        u_diag_abs_smallest_idxs = torch.topk(u_diag_abs, k=(n - rank), largest=False).indices
-        u.diagonal(0, -2, -1).div_(u_diag_abs_largest)
-        u.diagonal(0, -2, -1)[..., u_diag_abs_smallest_idxs] = torch.finfo(dtype).eps
-        matrix = p @ l @ u
-
-        matrix.requires_grad_(requires_grad)
-        return matrix
-
-    def sample_generator():
-        for batch, size in product(((), (2,), (2, 2)), range(6)):
-            shape = batch + (size, size)
-            for rank in range(1, size):
-                yield make_singular_matrix_batch_base(shape, rank)
-
-    return [SampleInput(t) for t in sample_generator()]
-
-
-def sample_inputs_linalg_matrix_power(op_info, device, dtype, requires_grad, **kwargs):
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    make_arg_fullrank = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
-    # (<matrix_size>, (<batch_sizes, ...>))
-    test_sizes = [
-        (1, ()),
-        (2, (0,)),
-        (2, (2,)),
-    ]
-
-    for matrix_size, batch_sizes in test_sizes:
-        size = batch_sizes + (matrix_size, matrix_size)
-        for n in (0, 3, 5):
-            yield SampleInput(make_arg(size), args=(n,))
-        for n in [-4, -2, -1]:
-            yield SampleInput(make_arg_fullrank(*size), args=(n,))
-
 def sample_inputs_hsplit(op_info, device, dtype, requires_grad, **kwargs):
-    return (SampleInput(make_tensor((6,), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(2,),),
-            SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=([1, 2, 3],),),)
+    make_arg = partial(make_tensor, dtype=dtype, device=device,
+                       low=None, high=None, requires_grad=requires_grad)
+    yield SampleInput(make_arg(6), 2)
+    yield SampleInput(make_arg(S, S, S), [1, 2, 3])
 
 def sample_inputs_vsplit(op_info, device, dtype, requires_grad, **kwargs):
-    return (SampleInput(make_tensor((6, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(2,),),
-            SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=([1, 2, 3],),),)
+    make_arg = partial(make_tensor, dtype=dtype, device=device,
+                       low=None, high=None, requires_grad=requires_grad)
+    yield SampleInput(make_arg(6, S), 2)
+    yield SampleInput(make_arg(S, S, S), [1, 2, 3])
 
 def sample_inputs_dsplit(op_info, device, dtype, requires_grad, **kwargs):
-    return (SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=([1, 2, 3],),),
-            SampleInput(make_tensor((S, S, 6), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(2,),),)
+    make_arg = partial(make_tensor, dtype=dtype, device=device,
+                       low=None, high=None, requires_grad=requires_grad)
+    yield SampleInput(make_arg(S, S, S), [1, 2, 3])
+    yield SampleInput(make_arg(S, S, 6), 2)
 
 def error_inputs_hsplit(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, dtype=torch.float32, device=device)
     err_msg1 = ("torch.hsplit requires a tensor with at least 1 dimension, "
                 "but got a tensor with 0 dimensions!")
-    si1 = SampleInput(make_tensor((),
-                                  dtype=torch.float32,
-                                  device=device),
-                      args=(0,),)
+    yield ErrorInput(SampleInput(make_arg(()), 0), error_regex=err_msg1)
+
     err_msg2 = (f"torch.hsplit attempted to split along dimension 1, "
                 f"but the size of the dimension {S} "
                 f"is not divisible by the split_size 0!")
-    si2 = SampleInput(make_tensor((S, S, S),
-                                  dtype=torch.float32,
-                                  device=device),
-                      args=(0,),)
+    yield ErrorInput(SampleInput(make_arg((S, S, S)), 0), error_regex=err_msg2)
 
     # Incorrect type for indices_or_section argument
     err_msg3 = ("received an invalid combination of arguments.")
-    si3 = SampleInput(make_tensor((S, S, S),
-                                  dtype=torch.float32,
-                                  device=device),
-                      args=("abc",),)
-    yield ErrorInput(si1, error_regex=err_msg1)
-    yield ErrorInput(si2, error_regex=err_msg2)
-    yield ErrorInput(si3, error_type=TypeError, error_regex=err_msg3)
+    yield ErrorInput(
+        SampleInput(make_arg((S, S, S)), "abc"),
+        error_type=TypeError, error_regex=err_msg3)
 
 def error_inputs_vsplit(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, dtype=torch.float32, device=device)
     err_msg1 = ("torch.vsplit requires a tensor with at least 2 dimension, "
                 "but got a tensor with 1 dimensions!")
-    si1 = SampleInput(make_tensor((S,),
-                                  dtype=torch.float32,
-                                  device=device),
-                      args=(0,),)
+    yield ErrorInput(SampleInput(make_arg(S), 0), error_regex=err_msg1)
+
     err_msg2 = (f"torch.vsplit attempted to split along dimension 0, "
                 f"but the size of the dimension {S} "
                 f"is not divisible by the split_size 0!")
-    si2 = SampleInput(make_tensor((S, S, S),
-                                  dtype=torch.float32,
-                                  device=device),
-                      args=(0,),)
+    yield ErrorInput(SampleInput(make_arg(S, S, S), 0),
+                     error_regex=err_msg2)
 
     # Incorrect type for indices_or_section argument
     err_msg3 = ("received an invalid combination of arguments.")
-    si3 = SampleInput(make_tensor((S, S, S),
-                                  dtype=torch.float32,
-                                  device=device),
-                      args=("abc",),)
-    yield ErrorInput(si1, error_regex=err_msg1)
-    yield ErrorInput(si2, error_regex=err_msg2)
-    yield ErrorInput(si3, error_type=TypeError, error_regex=err_msg3)
+    yield ErrorInput(SampleInput(make_arg(S, S, S), "abc"),
+                     error_type=TypeError, error_regex=err_msg3)
 
 def error_inputs_dsplit(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, dtype=torch.float32, device=device)
     err_msg1 = ("torch.dsplit requires a tensor with at least 3 dimension, "
                 "but got a tensor with 1 dimensions!")
-    si1 = SampleInput(make_tensor((S,),
-                                  dtype=torch.float32,
-                                  device=device),
-                      args=(0,),)
+    yield ErrorInput(SampleInput(make_arg(S), 0), error_regex=err_msg1)
+
     err_msg2 = (f"torch.dsplit attempted to split along dimension 2, "
                 f"but the size of the dimension {S} "
                 f"is not divisible by the split_size 0!")
-    si2 = SampleInput(make_tensor((S, S, S),
-                                  dtype=torch.float32,
-                                  device=device),
-                      args=(0,),)
-    return (ErrorInput(si1, error_regex=err_msg1),
-            ErrorInput(si2, error_regex=err_msg2),)
-
-def sample_inputs_linalg_multi_dot(op_info, device, dtype, requires_grad, **kwargs):
-    # Each test case consists of the sizes in the chain of multiplications
-    # e.g. [2, 3, 4, 5] generates matrices (2, 3) @ (3, 4) @ (4, 5)
-    test_cases = [
-        [1, 2, 1],
-        [2, 0, 2],
-        [0, 2, 2],
-        [2, 2, 2, 2],
-        [2, 3, 4, 5],
-        [5, 4, 0, 2],
-        [2, 4, 3, 5, 3, 2]
-    ]
-
-    result = []
-    for sizes in test_cases:
-        tensors = []
-        for size in zip(sizes[:-1], sizes[1:]):
-            t = make_tensor(size, dtype=dtype, device=device, requires_grad=requires_grad)
-            tensors.append(t)
-        result.append(SampleInput(tensors))
-
-    return result
-
-def sample_inputs_linalg_matrix_norm(op_info, device, dtype, requires_grad, **kwargs):
-    low_precision_dtypes = (torch.float16, torch.bfloat16, torch.complex32)
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    sizes = ((2, 2), (2, 3, 2))
-    if dtype in low_precision_dtypes:
-        # svdvals not supported for low precision dtypes
-        ords = ('fro', inf, -inf, 1, -1)
-    else:
-        ords = ('fro', 'nuc', inf, -inf, 1, -1, 2, -2)
-    dims = ((-2, -1), (-1, 0))
-
-    for size, ord, dim, keepdim in product(sizes, ords, dims, [True, False]):
-        yield SampleInput(make_arg(size), args=(ord, dim, keepdim))
-
-def sample_inputs_linalg_norm(op_info, device, dtype, requires_grad, *, variant=None, **kwargs):
-    if variant is not None and variant not in ('subgradient_at_zero',):
-        raise ValueError(f"Unsupported variant, expected variant to be 'subgradient_at_zero' but got: {variant}")
-
-    test_sizes = [
-        (S,),
-        (0,),
-        (S, S),
-        (0, 0),
-        (S, 0),
-        (0, S),
-        (S, S, S),
-        (0, S, S),
-        (S, 0, S),
-        (0, 0, 0),
-    ]
-
-    vector_ords = (None, 0, 0.5, 1, 2, 3.5, inf, -0.5, -1, -2, -3.5, -inf)
-    matrix_ords = (None, 'fro', 'nuc', 1, 2, inf, -1, -2, -inf)
-
-    inputs = []
-
-    for test_size in test_sizes:
-        is_vector_norm = len(test_size) == 1
-        is_matrix_norm = len(test_size) == 2
-
-        for keepdim in [False, True]:
-            if not variant == 'subgradient_at_zero':
-                inputs.append(SampleInput(
-                    make_tensor(
-                        test_size, dtype=dtype, device=device, low=None, high=None,
-                        requires_grad=requires_grad),
-                    kwargs=dict(
-                        keepdim=keepdim)))
-
-            if not (is_vector_norm or is_matrix_norm):
-                continue
+    yield ErrorInput(SampleInput(make_arg(S, S, S), 0), error_regex=err_msg2)
 
-            ords = vector_ords if is_vector_norm else matrix_ords
-
-            for ord in ords:
-                if variant == 'subgradient_at_zero':
-                    inputs.append(SampleInput(
-                        torch.zeros(
-                            test_size, dtype=dtype, device=device,
-                            requires_grad=requires_grad),
-                        args=(ord,),
-                        kwargs=dict(keepdim=keepdim)))
-                else:
-                    inputs.append(SampleInput(
-                        make_tensor(
-                            test_size, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                        args=(ord,),
-                        kwargs=dict(
-                            keepdim=keepdim)))
-
-                    if ord in ['nuc', 'fro']:
-                        inputs.append(SampleInput(
-                            make_tensor(
-                                test_size, dtype=dtype, device=device,
-                                low=None, high=None,
-                                requires_grad=requires_grad),
-                            kwargs=dict(
-                                ord=ord,
-                                keepdim=keepdim,
-                                dim=(0, 1))))
-
-        return inputs
 
 def sample_inputs_as_strided(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -596,19 +275,29 @@ def sample_inputs_as_strided_scatter(op_info, device, dtype, requires_grad, **kw
         ((1,), (1,), (1,), 0),
         ((3, 3), (2, 2), (1, 2), 0),
         ((3, 3), (2, 2), (1, 2), 1),
-        ((16,), (2, 2, 2, 2), (1, 1, 1, 1), 0),
-        ((16,), (2, 1, 1, 2), (1, 7, 7, 1), 0),
+        ((3, 3), (2, 2), (2, 1), 0),
+        # Scatter to larger dimentions
+        ((16,), (2, 2, 2, 2), (8, 4, 2, 1), 0),
+        # Scatter to larger dimentions with strides inverted
+        ((16,), (2, 1, 1, 2), (1, 2, 4, 8), 0),
     ]
 
-    samples = []
-
     for input_shape, output_shape, stride, storage_offset in test_cases:
         input_t = make_arg(input_shape)
         input_src = make_arg(output_shape)
-        kwargs = dict(storage_offset=storage_offset)
-        samples.append(SampleInput(input_t, args=(input_src, output_shape, stride), kwargs=kwargs))
+        yield SampleInput(input_t, input_src, output_shape, stride, storage_offset=storage_offset)
+
+def error_inputs_as_strided_scatter(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32, requires_grad=False)
+
+    # Create a small tensor and try to scatter it out of bounds
+    input_t = make_arg([4, 4])
+    input_src = make_arg([2, 2])
+    yield ErrorInput(
+        SampleInput(input_t, input_src, [2, 2], [200, 200], storage_offset=0),
+        error_regex="itemsize 4 requiring a storage size of 1604 are out of bounds for storage of size 64"
+    )
 
-    return samples
 
 def sample_inputs_combinations(op_info, device, dtype, requires_grad, **kwargs):
     inputs = (
@@ -621,15 +310,9 @@ def sample_inputs_combinations(op_info, device, dtype, requires_grad, **kwargs):
 
     products = product(inputs, rvals, [False, True])
 
-    samples = []
-
     for input_data, r, with_replacement in products:
         input_t = torch.tensor(input_data, device=device, dtype=dtype, requires_grad=requires_grad)
-        kwargs = dict(r=r, with_replacement=with_replacement)
-
-        samples.append(SampleInput(input_t, kwargs=kwargs))
-
-    return tuple(samples)
+        yield SampleInput(input_t, r=r, with_replacement=with_replacement)
 
 def sample_inputs_cartesian_prod(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(torch.tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -639,26 +322,14 @@ def sample_inputs_cartesian_prod(op_info, device, dtype, requires_grad, **kwargs
     b = make_arg((0, 1))
     c = make_arg((0, 1, 2, 3))
 
-    samples = []
-
     # sample with only 1 tensor
-    samples.append(SampleInput(
-        a
-    ))
+    yield SampleInput(a)
 
     # sample with 2 tensors
-    samples.append(SampleInput(
-        a,
-        args=(b,)
-    ))
+    yield SampleInput(a, b)
 
     # sample with 3 tensors
-    samples.append(SampleInput(
-        a,
-        args=(b, c)
-    ))
-
-    return tuple(samples)
+    yield SampleInput(a, b, c)
 
 def sample_inputs_cosine_similarity(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -734,7 +405,56 @@ def sample_inputs_batch_norm(op_info, device, dtype, requires_grad, **kwargs):
 
     # Test case for no optional kwargs
     # running_mean and running_var are required in evaluation mode (training: False) but not in training mode
-    yield SampleInput(make_arg((1, 2, 3)), args=(None, None), kwargs={'training': True})
+    yield SampleInput(make_arg((1, 2, 3)), args=(None, None, None, None), kwargs={'training': True})
+
+def sample_inputs_softmax_backward_data(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    cases = [
+        ((S,), 0),
+        ((S, S), 0),
+        ((S, M, S), -1),
+    ]
+    input_dtypes = [dtype]
+    if dtype == torch.float and device == 'cuda':
+        input_dtypes += [torch.float16]
+
+    for (shape, dim), input_dtype in product(cases, input_dtypes):
+        input = make_arg(shape)
+        output = torch.nn.functional.softmax(input, dim=dim, dtype=input_dtype)
+        yield SampleInput(make_arg(shape), output, dim, input_dtype)
+
+def sample_inputs_native_batch_norm(op_info, device, dtype, requires_grad, **kwargs):
+    samples = sample_inputs_batch_norm(op_info, device, dtype, requires_grad, **kwargs)
+    for sample in samples:
+        # torch.native_batch_norm does not support 0 numel tensors
+        # IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
+        if sample.input.numel() == 0:
+            continue
+        args = sample.args
+        training = sample.kwargs.get('training', True)
+        momentum = sample.kwargs.get('momentum', 0.5)
+        eps = sample.kwargs.get('eps', 1e-5)
+        yield SampleInput(sample.input, args=(args[2], args[3], args[0], args[1], training, momentum, eps))
+
+
+def sample_inputs__native_batch_norm_legit(op_info, device, dtype, requires_grad, **kwargs):
+    samples = sample_inputs_batch_norm(op_info, device, dtype, requires_grad, **kwargs)
+    for sample in samples:
+        # torch.native_batch_norm does not support 0 numel tensors
+        # IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
+        if sample.input.numel() == 0:
+            continue
+        args = sample.args
+        training = sample.kwargs.get('training', True)
+        momentum = sample.kwargs.get('momentum', 0.5)
+        eps = sample.kwargs.get('eps', 1e-5)
+        if args[0] is not None and args[1] is not None:
+            yield SampleInput(sample.input, args=(args[2], args[3], args[0], args[1], training, momentum, eps))
+        else:
+            yield SampleInput(sample.input, args=(args[2], args[3], training, momentum, eps))
+
 
 def sample_inputs_nn_activation_relu(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -807,7 +527,7 @@ def error_inputs_prelu(op, device):
     inp = make_tensor((2, 8, 3), device=device, dtype=torch.float32)
     weight = make_tensor((2, 4), device=device, dtype=torch.float32)
     yield ErrorInput(SampleInput(inp, kwargs={'weight': weight}),
-                     error_regex="prelu: Expected `weight` to be a scalar or 1D tensor, but got ndim = 2")
+                     error_regex="prelu: Expected `weight` to be a scalar or 1D tensor, but got: ndim = 2")
 
     # src and index tensors must have the same # of dimensions
 def sample_inputs_norm(op_info, device, dtype, requires_grad, **kwargs):
@@ -899,23 +619,6 @@ def sample_inputs_norm_inf(op_info, device, dtype, requires_grad, **kwargs):
     for shape, args, name in cases:
         yield SampleInput(make_arg(shape), args=args, name=name)
 
-def sample_kwargs_vector_norm(t, **kwargs):
-    # orders with / without identity
-    def ords():
-        has_id = (6, 4, 2, 1, 0, 0.9)
-        no_id = (inf, -2.1, -inf)
-        if t.numel() == 0:
-            dim = kwargs.get("dim")
-            if dim is None:
-                return has_id
-            if not isinstance(dim, Iterable):
-                dim = (dim,)
-            for d in dim:
-                if t.size(d) == 0:
-                    return has_id
-        return has_id + no_id
-
-    return (((), dict(ord=o)) for o in ords())
 
 def sample_inputs_equal(op, device, dtype, requires_grad, **kwargs):
     make_arg = partial(
@@ -1059,7 +762,10 @@ def to_float(start, end, step):
     for start, end, step in samples:
         if start is None:
             assert step is None
+            # Pass end as positional arg
             yield SampleInput(end, kwargs={"dtype": dtype, "device": device})
+            # (Similar to) calling torch.arange(end=3)
+            yield SampleInput(0, kwargs={"end": end, "dtype": dtype, "device": device})
         elif step is None:
             yield SampleInput(start, args=(end,), kwargs={"dtype": dtype, "device": device})
         else:
@@ -1068,6 +774,61 @@ def to_float(start, end, step):
     yield SampleInput(2)
     yield SampleInput(1, args=(3, 1))
 
+def sample_inputs_randn(op, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=False)
+
+    shapes = (
+        (M,),
+        (S, S)
+    )
+
+    for shape in shapes:
+        yield SampleInput(input=shape, kwargs=dict(dtype=dtype, device=device, requires_grad=requires_grad))
+
+
+def sample_inputs_uniform(op, device, dtype, requires_grad, **kwargs):
+
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=False)
+    samples = (
+        ((M,), -100, 100),
+        ((S, S), 0, 1),
+        ((S, S, S), 1, 2),
+    )
+    for shape, hi, lo in samples:
+        yield SampleInput(make_arg(shape), args=(hi, lo))
+
+def sample_inputs_ones_zeros(op, device, dtype, requires_grad, **kwargs):
+    # this is a bit messy, as we want the args to be tuples
+    # so if we pass size as a tuple, we have a tuple containing a tuple
+    sizes = (
+        (M,),
+        (S, S),
+    )
+    for size in sizes:
+        yield SampleInput(size, kwargs={'dtype': dtype, 'device': device})
+
+def sample_inputs_full(op, device, dtype, requires_grad, **kwargs):
+    def get_val(dtype):
+        return make_tensor([], dtype=dtype, device="cpu").item()
+
+    sizes = (
+        (M,),
+        (S, S),
+    )
+    fill_values = [get_val(dtype), get_val(torch.int)]
+
+    for size, fill_value in product(sizes, fill_values):
+        yield SampleInput(size, fill_value, dtype=dtype, device=device)
+
+
+def error_inputs_uniform(op, device, **kwargs):
+    t = torch.zeros([10], device=device)
+    yield ErrorInput(
+        SampleInput(t, args=(3, -1)),
+        error_type=RuntimeError,
+        error_regex=r"uniform_ expects to return a \[from, to\) range, but found from=3 > to=-1",
+    )
+
 
 def error_inputs_linspace(op, device, **kwargs):
     yield ErrorInput(SampleInput(0, args=(3, -1)), error_type=RuntimeError, error_regex='number of steps must be non-negative')
@@ -1138,22 +899,11 @@ def error_inputs_isclose(op, device, **kwargs):
                      error_regex='atol must be greater than or equal to zero')
 
 
-def sample_inputs_linalg_vecdot(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    batches = ((), (0,), (1,), (5,))
-    ns = (0, 1, 3, 5)
-    for b, n in product(batches, ns):
-        shape = b + (n,)
-        yield SampleInput(make_arg(shape), args=(make_arg(shape),))
-        for i in range(len(shape)):
-            yield SampleInput(make_arg(shape), args=(make_arg(shape),), kwargs=dict(dim=i))
-
-
 def sample_inputs_t(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    return (SampleInput(make_arg((1, 2))),
-            SampleInput(make_arg((2,))),
-            SampleInput(make_arg(())))
+    yield SampleInput(make_arg((1, 2)))
+    yield SampleInput(make_arg((2,)))
+    yield SampleInput(make_arg(()))
 
 
 def sample_inputs_mm(op_info, device, dtype, requires_grad, **kwargs):
@@ -1182,87 +932,55 @@ def sample_inputs_addmm(op_info, device, dtype, requires_grad, **kwargs):
     ]
     test_cases = tests_list + tests_with_lhs_broadcasting  # type: ignore[operator]
 
-    sample_inputs = []
-
+    kwargs = dict(alpha=alpha_val, beta=beta_val)
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
     for shape_a, shape_b, shape_c, broadcasts_input in test_cases:
-        sample_inputs.append(
-            SampleInput(
-                make_tensor(shape_a, dtype=dtype, device=device, requires_grad=requires_grad),
-                args=(
-                    make_tensor(shape_b, dtype=dtype, device=device,
-                                requires_grad=requires_grad),
-                    make_tensor(shape_c, dtype=dtype, device=device,
-                                requires_grad=requires_grad)),
-                kwargs={'alpha': alpha_val, 'beta': beta_val},
-                broadcasts_input=broadcasts_input))
+        yield SampleInput(
+            make_arg(shape_a),
+            make_arg(shape_b),
+            make_arg(shape_c),
+            **kwargs,
+        ).with_metadata(broadcasts_input=broadcasts_input)
 
     if dtype.is_complex:
         shape = (3, 3)
-        sample_inputs.append(
-            SampleInput(make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
-                        args=(
-                            make_tensor(shape, dtype=dtype, device=device).mH.requires_grad_(requires_grad),
-                            make_tensor(shape, dtype=dtype, device=device,
-                                        requires_grad=requires_grad)),
-                        kwargs={'alpha': alpha_val, 'beta': beta_val},))
-        sample_inputs.append(
-            SampleInput(make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
-                        args=(
-                            make_tensor(shape, dtype=dtype, device=device,
-                                        requires_grad=requires_grad),
-                            make_tensor(shape, dtype=dtype, device=device).mH.requires_grad_(requires_grad)),
-                        kwargs={'alpha': alpha_val, 'beta': beta_val},))
-    return sample_inputs
+        yield SampleInput(
+            make_arg(shape),
+            make_arg(shape, requires_grad=False).mH.requires_grad_(requires_grad),
+            make_arg(shape),
+            **kwargs,
+        )
+        yield SampleInput(
+            make_arg(shape),
+            make_arg(shape),
+            make_arg(shape, requires_grad=False).mH.requires_grad_(requires_grad),
+            **kwargs,
+        )
 
 def sample_inputs_sparse_sampled_addmm(op_info, device, dtype, requires_grad, **kwargs):
     alpha = 2 + 3j if dtype.is_complex else 0.6
     beta = 1 + 2j if dtype.is_complex else 0.2
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    def generator():
-        # sparse.sampled_addmm performs: alpha * (A @ B) * sparse_ones_like(C) + beta * C
-        for m, n, k in itertools.product([0, 5], repeat=3):
-            yield SampleInput(
-                torch.eye(m, n, device=device, dtype=dtype)
-                .to_sparse_csr()
-                .requires_grad_(requires_grad),
-                args=(
-                    make_tensor(
-                        (m, k),
-                        device=device,
-                        dtype=dtype,
-                        requires_grad=requires_grad,
-                    ),
-                    make_tensor(
-                        (k, n),
-                        device=device,
-                        dtype=dtype,
-                        requires_grad=requires_grad,
-                    ),
-                ),
-                kwargs={"alpha": alpha, "beta": beta},
-            )
-
-    return list(generator())
+    # sparse.sampled_addmm performs: alpha * (A @ B) * sparse_ones_like(C) + beta * C
+    for m, n, k in itertools.product([0, 5], repeat=3):
+        yield SampleInput(
+            torch.eye(m, n, device=device, dtype=dtype)
+            .to_sparse_csr()
+            .requires_grad_(requires_grad),
+            make_arg((m, k)),
+            make_arg((k, n)),
+            alpha=alpha,
+            beta=beta,
+        )
 
 def sample_inputs_mv(self, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((S, M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(
-                make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            )
-        ),
-    )
+    make_arg = partial(make_tensor, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+    yield SampleInput(make_arg(S, M), make_arg(M))
 
 def sample_inputs_bmm(self, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((M, S, M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(
-                make_tensor((M, M, S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            )
-        ),
-    )
+    make_arg = partial(make_tensor, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+    yield SampleInput(make_arg(M, S, M), make_arg(M, M, S))
 
 def sample_inputs_dot_vdot(self, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -1270,14 +988,12 @@ def sample_inputs_dot_vdot(self, device, dtype, requires_grad, **kwargs):
     def make_arg_conj(size):
         return make_arg(size).conj().requires_grad_(requires_grad)
 
-    sample_inputs = []
-    sample_inputs.append(SampleInput(make_arg((S, )), args=(make_arg((S, )),)))
+    yield SampleInput(make_arg((S, )), make_arg((S, )))
     if dtype.is_complex:
         # dot/vdot for (conj(input), conj(arg_tensor)) and (conj(input), arg_tensor)
         # is tested in test_conj_view (which tests operations with only conjugated input tensor
         # -- not conjugated arg tensors)
-        sample_inputs.append(SampleInput(make_arg((S, )), args=(make_arg_conj((S, )),)))
-    return sample_inputs
+        yield SampleInput(make_arg((S, )), make_arg_conj((S, )))
 
 def sample_inputs_addmv(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -1320,6 +1036,7 @@ def sample_inputs_addbmm(op_info, device, dtype, requires_grad, **kwargs):
                           kwargs=dict(beta=beta, alpha=alpha), broadcasts_input=is_broadcasting)
 
 def sample_inputs_addcmul_addcdiv(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
     test_cases = [(((S, S), (S, S), (S, S)), False),
                   (((S, S), (S, 1), (1, S)), False),
                   (((1,), (S, S, 1), (1, S)), True),
@@ -1328,31 +1045,65 @@ def sample_inputs_addcmul_addcdiv(op_info, device, dtype, requires_grad, **kwarg
                   (((), (S, S, 1), (1, S)), True)
                   ]
 
-    sample_inputs = []
     for input_args, broadcasts_input in test_cases:
         # addcdiv should accept inputs with zero value
         # Currently, it throws ZeroDivisionError when the denominator is zero
         # TODO: exclude_zeros can be removed after https://github.com/pytorch/pytorch/issues/73638 is fixed
-        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad,
-                     exclude_zero=True) if isinstance(arg, tuple) else arg
+        args = tuple(make_arg(arg, exclude_zero=True) if isinstance(arg, tuple) else arg
                      for arg in input_args)
-        sample_inputs.append(SampleInput(
-            args[0],
-            args=args[1:],
-            broadcasts_input=broadcasts_input))
+        yield SampleInput(*args).with_metadata(broadcasts_input=broadcasts_input)
 
         # addcdiv should accept inputs with zero value
         # Currently, it throws ZeroDivisionError when the denominator is zero
         # TODO: exclude_zeros can be removed after https://github.com/pytorch/pytorch/issues/73638 is fixed
-        args = tuple(make_tensor(arg, dtype=dtype, device=device, requires_grad=requires_grad,
-                     exclude_zero=True) if isinstance(arg, tuple) else arg
+        args = tuple(make_arg(arg, exclude_zero=True) if isinstance(arg, tuple) else arg
                      for arg in input_args)
-        sample_inputs.append(SampleInput(
-            args[0],
-            args=args[1:],
-            kwargs=dict(value=3.14), broadcasts_input=broadcasts_input))
+        yield SampleInput(
+            *args, value=3.14 if dtype.is_floating_point or dtype.is_complex else 3
+        ).with_metadata(broadcasts_input=broadcasts_input)
+
+def reference_inputs_addcmul_addcdiv(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_addcmul_addcdiv(
+        op_info, device, dtype, requires_grad, **kwargs)
+
+    # type promotion cases
+    supported_dtypes = op_info.supported_dtypes(device)
+    make_arg = partial(make_tensor, device=device, requires_grad=requires_grad)
+
+    types = (
+        (torch.float64, torch.complex128),
+        (torch.bfloat16, torch.float32),
+    )
+
+    values = (
+        None,
+        True, False,
+        3.14, 3,
+        1.0, 1,
+        0.0, 0,
+        -3.14, -3,
+        3.14 + 2.71j,
+    )
+
+    for (type2, type3), value in product(types, values):
+        if (type2 not in supported_dtypes or
+                type3 not in supported_dtypes):
+            continue
+
+        # RuntimeError: value cannot be converted without overflow
+        if (type(value) is complex and
+                type2 is not torch.complex128):
+            continue
+
+        arg1 = make_arg([5, 5], dtype=dtype)
+        arg2 = make_arg([5, 5], dtype=type2)
+        arg3 = make_arg([1, 5], dtype=type3)
 
-    return tuple(sample_inputs)
+        # TypeError: addcdiv(): argument 'value' must be Number, not NoneType
+        if value is not None:
+            yield SampleInput(arg1, args=(arg2, arg3), kwargs=dict(value=value))
+        else:
+            yield SampleInput(arg1, args=(arg2, arg3))
 
 def sample_inputs_baddbmm(op_info, device, dtype, requires_grad, **kwargs):
     test_cases = [((S, S, M), (S, S, S), (S, S, M), 1, 1, False),
@@ -1362,47 +1113,35 @@ def sample_inputs_baddbmm(op_info, device, dtype, requires_grad, **kwargs):
                   ((), (S, S, S), (S, S, M), 1, 1, True),
                   ((), (S, S, S), (S, S, M), 0.6, 0.2, True),
                   ]
-    sample_inputs = []
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=None, high=None)
     for (input_shape, batch1_shape, batch2_shape, alpha, beta, broadcasts_input) in test_cases:
-        args = (make_tensor(input_shape, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                make_tensor(batch1_shape, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                make_tensor(batch2_shape, dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad))
+        yield SampleInput(
+            make_arg(input_shape),
+            make_arg(batch1_shape),
+            make_arg(batch2_shape),
+            beta=beta,
+            alpha=alpha
+        ).with_metadata(broadcasts_input=broadcasts_input)
 
-        sample_inputs.append(SampleInput(args[0], args=(args[1], args[2]),
-                             kwargs=dict(beta=beta, alpha=alpha), broadcasts_input=broadcasts_input))
         if dtype.is_complex:
-            sample_inputs.append(SampleInput(
-                args[0].clone().requires_grad_(requires_grad),
-                args=(args[1].clone().requires_grad_(requires_grad),
-                      args[2].clone().requires_grad_(requires_grad)),
-                kwargs=dict(beta=beta * (1 + 2j), alpha=alpha * (2 + 3j)),
-                broadcasts_input=broadcasts_input))
+            yield SampleInput(
+                make_arg(input_shape),
+                make_arg(batch1_shape),
+                make_arg(batch2_shape),
+                beta=beta * (1 + 2j),
+                alpha=alpha * (2 + 3j),
+            ).with_metadata(broadcasts_input=broadcasts_input)
 
     if dtype.is_complex:
         shapes = [(S, S, S), (S, M, S), (S, S, M)]
-        args = (make_tensor(shapes[0], dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                make_tensor(shapes[1], dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad),
-                make_tensor(shapes[2], dtype=dtype, device=device,
-                            low=None, high=None,
-                            requires_grad=requires_grad))
-        sample_inputs.append(
-            SampleInput(
-                args[0].transpose_(-1, 1),
-                args=(args[1].transpose(-1, 1).conj().requires_grad_(requires_grad),
-                      args[2].transpose(-1, 1).conj().requires_grad_(requires_grad)),
-                kwargs=dict(beta=beta * (1 + 2j), alpha=alpha * (2 + 3j)),))
-
-    return tuple(sample_inputs)
+        args = tuple(make_arg(s) for s in shapes)
+        yield SampleInput(
+            args[0].transpose_(-1, 1),
+            args[1].transpose(-1, 1).conj().requires_grad_(requires_grad),
+            args[2].transpose(-1, 1).conj().requires_grad_(requires_grad),
+            beta=beta * (1 + 2j),
+            alpha=alpha * (2 + 3j),
+        )
 
 # TODO: add reduction kwargs
 def sample_inputs_multilabel_soft_margin_loss(op_info, device, dtype, requires_grad, **kwargs):
@@ -1420,18 +1159,12 @@ def sample_inputs_multilabel_soft_margin_loss(op_info, device, dtype, requires_g
                           kwargs={'weight': _make_tensor(shape, requires_grad=False)})
 
 def sample_inputs_addr(op_info, device, dtype, requires_grad, **kwargs):
-    yield SampleInput(
-        make_tensor((S, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        args=(
-            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)))
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=None, high=None
+    )
+    yield SampleInput(make_arg(S, M), make_arg(S), make_arg(M))
 
-    yield SampleInput(
-        make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        args=(
-            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
-        broadcasts_input=True)
+    yield SampleInput(make_arg(), make_arg(S), make_arg(M)).with_metadata(broadcasts_input=True)
 
     if dtype.is_complex:
         alpha, beta = 0.1 + 0.3j, 0.4 + 0.6j
@@ -1440,40 +1173,34 @@ def sample_inputs_addr(op_info, device, dtype, requires_grad, **kwargs):
     else:
         alpha, beta = 2, 3
 
-    yield SampleInput(
-        make_tensor((S, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        args=(
-            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
-        kwargs=dict(beta=beta, alpha=alpha))
+    yield SampleInput(make_arg(S, M), make_arg(S), make_arg(M), beta=beta, alpha=alpha)
 
     yield SampleInput(
-        make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        args=(
-            make_tensor((S, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)),
-        kwargs=dict(beta=beta, alpha=alpha),
-        broadcasts_input=True)
+        make_arg(),
+        make_arg(S),
+        make_arg(M),
+        beta=beta,
+        alpha=alpha,
+    ).with_metadata(broadcasts_input=True)
 
     # These samples fail gradcheck
     if dtype.is_floating_point and not requires_grad:
+        tensor_options = dict(device=device, dtype=dtype, requires_grad=requires_grad)
         yield SampleInput(
-            torch.tensor([[math.nan]], device=device, requires_grad=requires_grad),
-            args=(
-                torch.tensor([0.0], device=device, requires_grad=requires_grad),
-                torch.tensor([0.0], device=device, requires_grad=requires_grad),
-            ),
-            kwargs=dict(beta=0.0, alpha=0.0),
-            broadcasts_input=True)
+            torch.tensor([[math.nan]], **tensor_options),
+            torch.tensor([0.0], **tensor_options),
+            torch.tensor([0.0], **tensor_options),
+            beta=0.0,
+            alpha=0.0,
+        ).with_metadata(broadcasts_input=True)
 
         yield SampleInput(
-            torch.tensor([[0.0]], device=device, requires_grad=requires_grad),
-            args=(
-                torch.tensor([math.nan], device=device, requires_grad=requires_grad),
-                torch.tensor([math.nan], device=device, requires_grad=requires_grad),
-            ),
-            kwargs=dict(beta=0.0, alpha=0.0),
-            broadcasts_input=True)
+            torch.tensor([[0.0]], **tensor_options),
+            torch.tensor([math.nan], **tensor_options),
+            torch.tensor([math.nan], **tensor_options),
+            beta=0.0,
+            alpha=0.0,
+        ).with_metadata(broadcasts_input=True)
 
 def sample_inputs_zero_(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -1481,7 +1208,7 @@ def sample_inputs_zero_(op_info, device, dtype, requires_grad, **kwargs):
     cases = ((), (S, S, S), (S,))
 
     for shape in cases:
-        yield(SampleInput(make_arg(shape)))
+        yield SampleInput(make_arg(shape))
 
 # TODO: add reduction kwargs
 def sample_inputs_multi_margin_loss(op_info, device, dtype, requires_grad, **kwargs):
@@ -1507,7 +1234,6 @@ def sample_inputs_logsumexp(self, device, dtype, requires_grad, **kwargs):
         ((S, S), (-2,), False),
         ((S, S), (0, 1), False),
     )
-    samples = []
     # Test large inputs to check numerical stability
     lows = (None, 1e3, 1e6) if dtype in (torch.float32, torch.float64) else (None,)
     for low in lows:
@@ -1516,9 +1242,7 @@ def sample_inputs_logsumexp(self, device, dtype, requires_grad, **kwargs):
             t = make_tensor(shape, dtype=dtype, device=device,
                             low=low, high=high,
                             requires_grad=requires_grad)
-            samples.append(SampleInput(t, args=(dim, keepdim)))
-
-    return tuple(samples)
+            yield SampleInput(t, dim, keepdim)
 
 def sample_inputs_like_fns(self, device, dtype, requires_grad, **kwargs):
     inputs = [
@@ -1535,14 +1259,11 @@ def sample_inputs_like_fns(self, device, dtype, requires_grad, **kwargs):
     if torch.cuda.is_available():
         inputs.append(((S,), {'device': 'cuda'}))
 
-    samples = []
     for shape, kwargs in inputs:
         t = make_tensor(shape, dtype=dtype, device=device,
                         low=None, high=None,
                         requires_grad=requires_grad)
-        samples.append(SampleInput(t, kwargs=kwargs))
-
-    return tuple(samples)
+        yield SampleInput(t, **kwargs)
 
 def reference_inputs_like_fns(op, device, dtype, requires_grad, **kwargs):
     yield from sample_inputs_like_fns(op, device, dtype, requires_grad, **kwargs)
@@ -1576,23 +1297,34 @@ def sample_inputs_multilabel_margin_loss(op_info, device, dtype, requires_grad,
 def get_independent_tensor(tensor):
     return tensor.clone().requires_grad_(tensor.requires_grad)
 
+def sample_inputs_randint(self, device, dtype, requires_grad, **kwargs):
+    low = 2
+    high = 10
+
+    for sample in sample_inputs_like_fns(self, device, dtype, requires_grad, **kwargs):
+        # With high
+        yield SampleInput(high, sample.input.shape, *sample.args, **sample.kwargs)
+        # With low and high
+        yield SampleInput(low, high, sample.input.shape, *sample.args, **sample.kwargs)
+
 def sample_inputs_randint_like(self, device, dtype, requires_grad, **kwargs):
-    samples = []
     low = 2
     high = 10
 
     for sample in sample_inputs_like_fns(self, device, dtype, requires_grad, **kwargs):
         # With high
-        samples.append(SampleInput(
+        yield SampleInput(
             sample.input,
-            args=(high,) + sample.args,
-            kwargs=sample.kwargs))
+            high,
+            *sample.args,
+            **sample.kwargs)
         # With low and high
-        samples.append(SampleInput(
+        yield SampleInput(
             get_independent_tensor(sample.input),
-            args=(low, high,) + sample.args,
-            kwargs=sample.kwargs))
-    return tuple(samples)
+            low,
+            high,
+            *sample.args,
+            **sample.kwargs)
 
 def sample_inputs_margin_ranking_loss(op_info, device, dtype, requires_grad, **kwargs):
     _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -1654,29 +1386,31 @@ def error_inputs_margin_ranking_loss(op, device, **kwargs):
     yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4), make_input(5,),)),
                      error_regex='margin_ranking_loss : All input tensors should')
 
-def sample_inputs_new_fns(self, device, dtype, requires_grad, **kwargs):
+def sample_inputs_new_fns(self, device, dtype, requires_grad, *, is_strided=False, **kwargs):
+    # input_shape, output_shape, strides, kwargs
+    # lengths of output_shape and strides must be equal
     inputs = [
-        ((), (), {}),
-        ((S, S), (2, 0), {}),
-        ((0, S, 0), (3, 2, 2), {}),
-        ((S,), (2, 3), {'dtype': dtype, 'device': device}),
+        ((), (), (), {}),
+        ((S, S), (2, 0), (3, 4), {}),
+        ((0, S, 0), (3, 2, 2), (1, 2, 3), {}),
+        ((S,), (2, 3), (7, 8), {'dtype': dtype, 'device': device}),
         # Hard-code some dtypes/devices. We want to test cases where the
         # (dtype, device) is different from the input's (dtype, device)
-        ((S,), (10,), {'dtype': torch.double}),
-        ((S,), (1, 1, 12), {'device': 'cpu'}),
-        ((S,), (2, 2, 2), {'dtype': torch.double, 'device': 'cpu'}),
+        ((S,), (10,), (S,), {'dtype': torch.double}),
+        ((S,), (1, 1, 12), (S, L, M), {'device': 'cpu'}),
+        ((S,), (2, 2, 2), (L, M, S), {'dtype': torch.double, 'device': 'cpu'}),
     ]
     if torch.cuda.is_available():
-        inputs.append(((S,), (7, 2), {'device': 'cuda'}))
+        inputs.append(((S,), (7, 2), (3, 4), {'device': 'cuda'}))
 
-    samples = []
-    for input_shape, output_shape, kwargs in inputs:
+    for input_shape, output_shape, strides, kwargs in inputs:
         t = make_tensor(input_shape, dtype=dtype, device=device,
                         low=None, high=None,
                         requires_grad=requires_grad)
-        samples.append(SampleInput(t, args=(output_shape,), kwargs=kwargs))
-
-    return tuple(samples)
+        if is_strided:
+            yield SampleInput(t, output_shape, strides, **kwargs)
+        else:
+            yield SampleInput(t, output_shape, **kwargs)
 
 def sample_inputs_empty(op, device, dtype, requires_grad, **kwargs):
     # shape
@@ -1685,21 +1419,62 @@ def sample_inputs_empty(op, device, dtype, requires_grad, **kwargs):
     )
 
     for case in cases:
+        yield SampleInput(case, device=device, dtype=dtype, requires_grad=requires_grad)
+
+def sample_inputs_scalar_tensor(op, device, dtype, requires_grad, **kwargs):
+    # Not including a scalar tensor in vals because meta tests start failing due to
+    # lack of meta support for _local_scalar_dense
+    # torch.tensor(2, device=device)
+    vals = (-5, 0, 1)
+
+    for item in vals:
+        yield SampleInput(item, device=device, dtype=dtype, requires_grad=requires_grad)
+
+def sample_inputs_eye(op, device, dtype, requires_grad, **kwargs):
+    # only ints >= 0 are allowed for both arguments, unless m is omitted
+    sizes = (None, 0, 1, 2, 3, 4, 7, L, M, S)
+
+    for n, m in product(sizes, sizes):
+        if n is None:
+            continue
+
+        # TODO: no layout
         _kwargs = {'device': device, 'dtype': dtype, 'requires_grad': requires_grad}
-        yield SampleInput(case, args=(), kwargs=_kwargs)
+        if m is None:
+            yield SampleInput(n, args=(), kwargs=_kwargs)
+        else:
+            yield SampleInput(n, args=(m,), kwargs=_kwargs)
+
+def error_inputs_eye(op_info, device, **kwargs):
+    # TODO: no layout
+    _kwargs = {'device': device, 'dtype': torch.float32}
+
+    yield ErrorInput(
+        SampleInput(-1, args=(), kwargs=_kwargs),
+        error_regex="n must be greater or equal to 0, got -1"
+    )
+
+    yield ErrorInput(
+        SampleInput(-7, args=(42,), kwargs=_kwargs),
+        error_regex="n must be greater or equal to 0, got -7"
+    )
+
+    yield ErrorInput(
+        SampleInput(0, args=(-3,), kwargs=_kwargs),
+        error_regex="m must be greater or equal to 0, got -3"
+    )
+
 
 def sample_inputs_new_full(self, device, dtype, requires_grad, **kwargs):
     def get_val(dtype):
         return make_tensor([], dtype=dtype, device="cpu").item()
 
-    samples = []
     for sample in sample_inputs_new_fns(self, device, dtype, requires_grad, **kwargs):
         # The scalar we are passing to new_full must be the same dtype
         # as the one of the resulting tensor
         use_dtype = sample.kwargs['dtype'] if 'dtype' in sample.kwargs else dtype
-        samples.append(SampleInput(
-            sample.input, args=sample.args + (get_val(use_dtype),), kwargs=sample.kwargs))
-    return tuple(samples)
+        yield SampleInput(
+            sample.input, *sample.args, get_val(use_dtype), **sample.kwargs)
 
 def sample_inputs_full_like(self, device, dtype, requires_grad, **kwargs):
     def get_val(dtype):
@@ -1719,33 +1494,28 @@ def get_val(dtype):
     if torch.cuda.is_available():
         inputs.append(((S,), get_val(dtype), {'device': 'cuda'}))
 
-    samples = []
     for shape, fill_value, kwargs in inputs:
         t = make_tensor(shape, dtype=dtype, device=device,
                         low=None, high=None,
                         requires_grad=requires_grad)
-        samples.append(SampleInput(t, args=(fill_value,), kwargs=kwargs))
-
-    return tuple(samples)
+        yield SampleInput(t, fill_value, **kwargs)
 
 def sample_inputs_multinomial(self, device, dtype, requires_grad, **kwargs):
     cases = [
-        ([3], 3, dict()),
-        ([10], 3, dict()),
-        ([3, 10], 3, dict()),
+        ([3], 3, {}),
+        ([10], 3, {}),
+        ([3, 10], 3, {}),
         ([3], 3, dict(replacement=False)),
         ([3], 3, dict(replacement=True)),
         ([3, 4], 4, dict(replacement=True)),
         ([3, 4], 4, dict(replacement=False)),
     ]
 
-    samples = []
     for shape, num_samples, kwargs in cases:
         t = make_tensor(shape, dtype=dtype, device=device,
                         low=0, high=None,
                         requires_grad=requires_grad)
-        samples.append(SampleInput(t, args=(num_samples,), kwargs=kwargs))
-    return tuple(samples)
+        yield SampleInput(t, num_samples, **kwargs)
 
 def sample_inputs_normal_common(self, device, dtype, requires_grad, cases, **kwargs):
     def get_value_or_make_tensor(value_or_shape):
@@ -1755,12 +1525,10 @@ def get_value_or_make_tensor(value_or_shape):
                                requires_grad=requires_grad)
         return value_or_shape
 
-    samples = []
     for value_or_mean_shape, value_or_std_shape, kwargs in cases:
         mean = get_value_or_make_tensor(value_or_mean_shape)
         std = get_value_or_make_tensor(value_or_std_shape)
-        samples.append(SampleInput(mean, args=(std,), kwargs=kwargs))
-    return tuple(samples)
+        yield SampleInput(mean, std, **kwargs)
 
 def sample_inputs_normal_tensor_first(self, device, dtype, requires_grad, **kwargs):
     # value_or_size, value_or_size, kwargs
@@ -1788,13 +1556,18 @@ def sample_inputs_bernoulli(self, device, dtype, requires_grad, **kwargs):
         [2, 3, 4],
     ]
 
-    samples = []
     for shape in shapes:
         t = make_tensor(shape, dtype=dtype, device=device,
                         low=0, high=1,
                         requires_grad=requires_grad)
-        samples.append(SampleInput(t))
-    return tuple(samples)
+        yield SampleInput(t)
+
+def error_inputs_bernoulli(op_info, device, **kwargs):
+    # more than one element of the written-to tensor refers to a single memory location
+    x = torch.rand((1,), device=device).expand((6,))
+    err_msg = 'unsupported operation'
+    yield ErrorInput(SampleInput(torch.rand_like(x), kwargs={'out': x}),
+                     error_regex=err_msg)
 
 def sample_inputs_logcumsumexp(self, device, dtype, requires_grad, **kwargs):
     inputs = (
@@ -1802,7 +1575,6 @@ def sample_inputs_logcumsumexp(self, device, dtype, requires_grad, **kwargs):
         ((S, S, S), 1),
         ((), 0),
     )
-    samples = []
 
     for large_number in (True, False):
         for shape, dim in inputs:
@@ -1812,14 +1584,12 @@ def sample_inputs_logcumsumexp(self, device, dtype, requires_grad, **kwargs):
 
             if large_number and t.dim() > 0:
                 t[0] = 10000
-            samples.append(SampleInput(t, args=(dim,)))
-
-    return tuple(samples)
+            yield SampleInput(t, dim)
 
 def sample_inputs_trace(self, device, dtype, requires_grad, **kwargs):
-    return (SampleInput((make_tensor((S, S), dtype=dtype, device=device,
-                                     low=None, high=None,
-                                     requires_grad=requires_grad))),)
+    yield SampleInput((make_tensor((S, S), dtype=dtype, device=device,
+                                   low=None, high=None,
+                                   requires_grad=requires_grad)))
 
 
 def error_inputs_trace(op, device):
@@ -1868,51 +1638,24 @@ def sample_inputs_adjoint(self, device, dtype, requires_grad, **kwargs):
 def sample_inputs_T(self, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    shapes = ((), (M, M))
+    shapes = ((), (M, M), (M, L))
     return (SampleInput(make_arg(shape)) for shape in shapes)
 
+def error_inputs_T(self, device, has_ndims_error=False):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
 
-def sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates invertible inputs for linear algebra ops
-    The input is generated as the itertools.product of 'batches' and 'ns'.
-    In total this function generates 8 SampleInputs
-    'batches' cases include:
-        () - single input,
-        (0,) - zero batched dimension,
-        (2,) - batch of two matrices,
-        (1, 1) - 1x1 batch of matrices
-    'ns' gives 0x0 and 5x5 matrices.
-    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
-    """
-    make_fn = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fn, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    batches = [(), (0, ), (2, ), (1, 1)]
-    ns = [5, 0]
-
-    for batch, n in product(batches, ns):
-        yield SampleInput(make_arg(*batch, n, n))
-
-def sample_inputs_linalg_pinv_singular(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function produces factors `a` and `b` to generate inputs of the form `a @ b.t()` to
-    test the backward method of `linalg_pinv`. That way we always preserve the rank of the
-    input no matter the perturbations applied to it by the gradcheck.
-    Note that `pinv` is Frechet-differentiable in a rank-preserving neighborhood.
-    """
-    batches = [(), (0, ), (2, ), (1, 1)]
-    # the size of at least 30 is required to cause failures for the previous implicit implementation
-    # of the pinv's backward method, albeit it is slow.
-    size = [0, 3, 50]
+    # Deprecated behavior in regular PyTorch, but throws an error in primTorch:
+    # https://github.com/pytorch/pytorch/issues/86968
+    if has_ndims_error:
+        # ndims == 1
+        yield ErrorInput(SampleInput(make_arg(M)),
+                         error_regex=(r'The use of `x\.T` on tensors of dimension other than 0 or 2 '
+                                      r'to reverse their shape is not supported\.'))
 
-    for batch, m, n in product(batches, size, size):
-        for k in range(min(3, min(m, n))):
-            # Note that by making the columns of `a` and `b` orthonormal we make sure that
-            # the product matrix `a @ b.t()` has condition number 1 when restricted to its image
-            a = torch.rand(*batch, m, k, device=device, dtype=dtype).qr().Q.requires_grad_(requires_grad)
-            b = torch.rand(*batch, n, k, device=device, dtype=dtype).qr().Q.requires_grad_(requires_grad)
-            yield SampleInput(a, args=(b,))
+        # ndims > 2
+        yield ErrorInput(SampleInput(make_arg(M, S, L)),
+                         error_regex=(r'The use of `x\.T` on tensors of dimension other than 0 or 2 '
+                                      r'to reverse their shape is not supported\.'))
 
 
 def sample_inputs_singular_matrix_factors(op_info, device, dtype, requires_grad=False, **kwargs):
@@ -1921,36 +1664,15 @@ def sample_inputs_singular_matrix_factors(op_info, device, dtype, requires_grad=
     Their matrix product could be used to generate tensor of shape (*, m, n) of rank k.
     """
 
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
     batches = [(), (0, ), (2, ), (1, 1)]
     size = [1, 5, 10]
 
     for batch, m, n in product(batches, size, size):
         for k in range(min(3, min(m, n))):
-            a = make_tensor((*batch, m, k), dtype=dtype, device=device, requires_grad=requires_grad)
-            b = make_tensor((*batch, n, k), dtype=dtype, device=device, requires_grad=requires_grad)
-            yield SampleInput(a, args=(b,), kwargs=kwargs)
-
-
-def clone_sample(sample, **kwargs):
-    """
-    Given a SampleInput, this function analyzes its input, args and kwargs,
-    and produces a copy with each non-Tensor entry being copied by reference,
-    and with each Tensor entry cloned with `t.clone().requires_grad_(t.requires_grad)`
-    """
-
-    def clone_tensor(t):
-        if isinstance(t, torch.Tensor):
-            return t.detach().clone().requires_grad_(t.requires_grad)
-        else:
-            return t
-
-    sample_kwargs = kwargs if kwargs else sample.kwargs
-
-    return SampleInput(
-        clone_tensor(sample.input),
-        args=tuple(map(clone_tensor, sample.args)),
-        kwargs=dict(((k, clone_tensor(v)) for k, v in sample_kwargs.items()))
-    )
+            a = make_arg((*batch, m, k))
+            b = make_arg((*batch, n, k))
+            yield SampleInput(a, b, **kwargs)
 
 
 def sample_inputs_svd_lowrank(op_info, device, dtype, requires_grad=False, **kwargs):
@@ -2000,50 +1722,6 @@ def sample_inputs_pca_lowrank(op_info, device, dtype, requires_grad=False, **kwa
         yield s1
         yield s2
 
-def sample_inputs_linalg_cond(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    # autograd is not supported for inputs with zero number of elements
-    shapes = ((S, S),
-              (2, S, S),
-              (2, 1, S, S), )
-
-    for shape in shapes:
-        yield SampleInput(make_arg(shape))
-
-
-def sample_inputs_linalg_vander(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    shapes = ((),
-              (1,),
-              (S,),
-              (2, S),)
-
-    for shape in shapes:
-        if len(shape) > 0 and shape[-1] > 1:
-            yield SampleInput(make_arg(shape))
-        n = shape[-1] if len(shape) > 0 else 1
-        for i in range(3):
-            # n-1, n, n+1
-            N = n + i - 1
-            if N < 2:
-                continue
-            yield SampleInput(make_arg(shape), kwargs=dict(N=N))
-
-def np_vander_batched(x, N=None):
-    # Wrapper around np.vander that supports batches of 1 dimension (enough for the tests)
-    if x.ndim == 0:
-        x = x[np.newaxis]
-    if x.ndim == 1:
-        y = np.vander(x, N=N, increasing=True)
-        return y
-    else:
-        if N is None:
-            N = x.shape[-1]
-        y = np.vander(x.ravel(), N=N, increasing=True).reshape((*x.shape, N))
-        return y
-
 def np_sinc_with_fp16_as_fp32(x):
     # Wraps numpy's sinc function so that fp16 values are promoted to fp32
     # before sinc is invoked. Context: numpy's sinc returns NaN when evaluated
@@ -2064,20 +1742,18 @@ def sample_inputs_broadcast_to(op_info, device, dtype, requires_grad, **kwargs):
         ((), (1, 3, 2)),
     )
 
-    return tuple(
+    return (
         SampleInput(
             make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(shape,)) for size, shape in test_cases)
+            shape,
+        ) for size, shape in test_cases)
 
 def sample_inputs_broadcast_tensors(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
     test_cases: Tuple[tuple] = (((3,), (1, 2, 1), (1, 1), (5, 1, 1),),)
 
-    samples: List[SampleInput] = []
     for shape, *other_shapes in test_cases:
-        samples.append(SampleInput(make_arg(shape), args=tuple(make_arg(s) for s in other_shapes)))
-
-    return samples
+        yield SampleInput(make_arg(shape), args=tuple(make_arg(s) for s in other_shapes))
 
 def reference_inputs_broadcast_tensors(op, device, dtype, requires_grad, **kwargs):
     yield from sample_inputs_broadcast_tensors(op, device, dtype, requires_grad, **kwargs)
@@ -2103,16 +1779,13 @@ def sample_inputs_block_diag(op_info, device, dtype, requires_grad, **kwargs):
         ((2, S), (S,))
     )
 
-    samples: List[SampleInput] = []
     for shape, *other_shapes in test_cases:
-        samples.append(SampleInput(make_arg(shape), args=tuple(make_arg(s) for s in other_shapes)))
+        yield SampleInput(make_arg(shape), args=tuple(make_arg(s) for s in other_shapes))
         # We also want to test mixed complex-non-complex inputs to block_diag
         if dtype == torch.complex32 or dtype == torch.complex64:
             non_complex_dtype = torch.float32 if dtype == torch.complex32 else torch.float64
             make_arg_non_complex = partial(make_tensor, dtype=non_complex_dtype, device=device, requires_grad=requires_grad)
-            samples.append(SampleInput(make_arg_non_complex(shape), args=tuple(make_arg(s) for s in other_shapes)))
-
-    return samples
+            yield SampleInput(make_arg_non_complex(shape), args=tuple(make_arg(s) for s in other_shapes))
 
 def sample_inputs_cdist(op_info, device, dtype, requires_grad, **kwargs):
     small_S = 2
@@ -2136,18 +1809,14 @@ def sample_inputs_cdist(op_info, device, dtype, requires_grad, **kwargs):
         ((1, 1, small_S), (small_S, 1, small_S, small_S)),
     )
 
-    samples = []
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
     for cm in ['use_mm_for_euclid_dist', 'donot_use_mm_for_euclid_dist']:
         # FIXME add an override for JIT and revert 0. back to 0
         # since it's accepted by eager
         for p in [0., 1., 2., 3., 0.5, 1.5, 2.5, float("inf")]:
             for t1_size, t2_size in test_cases:
                 # The args should never be non-contiguous as this is not supported in the backward
-                samples.append(SampleInput(
-                    make_tensor(t1_size, dtype=dtype, device=device, requires_grad=requires_grad),
-                    args=(make_tensor(t2_size, dtype=dtype, device=device, requires_grad=requires_grad), p, cm)))
-
-    return samples
+                yield SampleInput(make_arg(t1_size), make_arg(t2_size), p, cm)
 
 
 def sample_inputs_fill_(op_info, device, dtype, requires_grad, **kwargs):
@@ -2214,6 +1883,7 @@ def sample_inputs_cat_concat(op_info, device, dtype, requires_grad, **kwargs):
         ((M, S), (S, S), {'dim': 0}),  # different shapes
         ((1, 2, 3), (1, 2, 3), {'dim': -2}),
         ((0,), (0,), {'dim': 0}),  # empty tensor
+        ((0,), (S, S), {'dim': 1}),  # empty tensor with unempty and dim=1 (special case for legacy_cat_wrap_dim)
         ((0, S), (S, S), {'dim': 0}),
         ((1,), (1,), {})  # dim not passed, fallback to default
     )
@@ -2221,6 +1891,78 @@ def sample_inputs_cat_concat(op_info, device, dtype, requires_grad, **kwargs):
     for input_shape1, input_shape2, kwargs in cases:
         yield SampleInput([make_arg(input_shape1), make_arg(input_shape2)], kwargs=kwargs)
 
+    # from coat_lite_mini
+    yield SampleInput([make_arg((2, 2, 2, 2), memory_format=torch.channels_last)], args=(1,),)
+
+def error_inputs_cat(op_info, device, **kwargs):
+
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+
+    # error inputs for more than one element of the written-to tensor refer to a single memory location
+    yield ErrorInput(SampleInput([make_arg((S, S)), make_arg((S, S))],
+                                 kwargs={'out': make_arg((1, S)).expand((2 * S, S))}),
+                     error_regex='unsupported operation')
+
+    # error inputs for empty tensors
+    yield ErrorInput(SampleInput([], kwargs={'dim': 1}),
+                     error_regex='non-empty list of Tensors')
+
+    # error inputs for different sizes
+    yield ErrorInput(SampleInput([make_arg((S, S, L, L)), make_arg((S, 0, L - 1, L))], kwargs={'dim': 1}),
+                     error_regex='Sizes of tensors must match except in dimension')
+    yield ErrorInput(SampleInput([make_arg((S, 0, L - 1, L)), make_arg((S, S, L, L))], kwargs={'dim': 1}),
+                     error_regex='Sizes of tensors must match except in dimension')
+
+    # error inputs for different dimensions
+    yield ErrorInput(SampleInput([make_arg((S - 1, 0)), make_arg((S, 0, L - 1, L))], kwargs={'dim': 1}),
+                     error_regex='Tensors must have same number of dimensions')
+    yield ErrorInput(SampleInput([make_arg((S, 0, L - 1, L)), make_arg((S - 1, 0))], kwargs={'dim': 1}),
+                     error_regex='Tensors must have same number of dimensions')
+
+    # error inputs for same memory locations
+    x = torch.zeros((0), device=device)
+    y = torch.randn((4, 6), device=device)
+
+    err_msg = "the written-to tensor refer to a single memory location"
+
+    yield ErrorInput(SampleInput((x, y), kwargs={'dim': 0, 'out': x}),
+                     error_regex=err_msg)
+    yield ErrorInput(SampleInput((x, y), kwargs={'dim': 0, 'out': y}),
+                     error_regex=err_msg)
+
+    z = torch.zeros((4, 6), device=device)
+    yield ErrorInput(SampleInput((y, z), kwargs={'out': z[:2, :]}),
+                     error_regex=err_msg)
+
+    # error inputs for different devices
+    if torch.device(device).type == 'cuda':
+        x_cuda = make_tensor((3, 3), device=device, dtype=torch.float32)
+        y_cpu = make_tensor((3, 3), device='cpu', dtype=torch.float32)
+        yield ErrorInput(SampleInput((x_cuda, y_cpu)),
+                         error_regex='Expected all tensors to be on the same device')
+
+    # error inputs for different input sizes for more than 2 tensors
+    yield ErrorInput(SampleInput([make_arg((L, 1)), make_arg((L, 1, 1)), make_arg((L, 1, 1))]),
+                     error_regex='Tensors must have same number of dimensions')
+
+    yield ErrorInput(SampleInput([make_arg((S, 1, M)), make_arg((S, 1, 1)), make_arg((S, M, 1))],
+                                 kwargs={'dim': 1}),
+                     error_regex='Sizes of tensors must match')
+
+    # error inputs for None input
+    yield ErrorInput(SampleInput((make_arg((S, 1, 1)), None)), error_type=TypeError,
+                     error_regex='got None')
+
+    # error inputs for zero-dimensional tensors
+    yield ErrorInput(SampleInput([make_arg(()), make_arg(())]),
+                     error_regex='zero-dimensional.*cannot be concatenated')
+
+    # error inputs for different dtype of out tensors
+    d = make_tensor((2, 3), device=device, dtype=torch.double)
+    x = make_tensor((2, 3), device=device, dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'out': d}), error_type=TypeError,
+                     error_regex='invalid combination of arguments')
+
 def reference_inputs_cat(op, device, dtype, requires_grad, **kwargs):
     yield from sample_inputs_cat_concat(op, device, dtype, requires_grad, **kwargs)
 
@@ -2319,7 +2061,7 @@ def sample_inputs_unbind(op_info, device, dtype, requires_grad, **kwargs):
 def error_inputs_unbind(op_info, device):
     make_arg = partial(make_tensor, dtype=torch.int32, device=device, requires_grad=False)
     yield ErrorInput(SampleInput(make_arg(()), args=(0,)), error_type=IndexError,
-                     error_regex="dimension specified as 0 but tensor has no dimensions")
+                     error_regex="Dimension specified as 0 but tensor has no dimensions")
     yield ErrorInput(SampleInput(make_arg((2,)), args=(2,)), error_type=IndexError,
                      error_regex="Dimension out of range")
 
@@ -2328,24 +2070,28 @@ def reference_unbind(t, dim):
     return tuple(s.squeeze(dim) for s in np.split(t, t.shape[dim], dim))
 
 def sample_inputs_gather(op_info, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(0, gather_variable((S, S), 1, M, True, device=device))),
-        SampleInput(
-            make_tensor((M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(1, gather_variable((M, S // 2), 0, S, True, device=device))),
-        SampleInput(
-            make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(0, torch.tensor([0], dtype=torch.int64, device=device))),
-        # Empty index tensor case, see: https://github.com/pytorch/pytorch/pull/65006
-        SampleInput(
-            make_tensor((S,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(0, torch.tensor([], dtype=torch.uint8, device=device))),
-        SampleInput(
-            make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=(0, torch.tensor(0, dtype=torch.int64, device=device))),
-    )
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=None, high=None)
+    yield SampleInput(
+        make_arg((M, S)),
+        0,
+        gather_variable((S, S), 1, M, True, device=device))
+    yield SampleInput(
+        make_arg((M, S)),
+        1,
+        gather_variable((M, S // 2), 0, S, True, device=device))
+    yield SampleInput(
+        make_arg(),
+        0,
+        torch.tensor([0], dtype=torch.int64, device=device))
+    # Empty index tensor case, see: https://github.com/pytorch/pytorch/pull/65006
+    yield SampleInput(
+        make_arg((S,)),
+        0,
+        torch.tensor([], dtype=torch.uint8, device=device))
+    yield SampleInput(
+        make_arg(()),
+        0,
+        torch.tensor(0, dtype=torch.int64, device=device))
 
 def _fill_indices(idx, dim, dim_size, elems_per_row, m, n, o):
     for i in range(1 if dim == 0 else m):
@@ -2481,39 +2227,29 @@ def error_inputs_renorm(op_info, device, **kwargs):
     yield ErrorInput(SampleInput(zero_d, args=(0.5, 0, 1.0)), error_type=RuntimeError,
                      error_regex="needs at least 2 dimensions, got 0 dimensions")
 
-def error_inputs_lstsq(op_info, device, **kwargs):
-    zero_d = torch.randn((), device=device)
-    yield ErrorInput(SampleInput(zero_d, args=(zero_d,)), error_type=RuntimeError,
-                     error_regex="at least 2 dimensions")
-
-def error_inputs_lstsq_grad_oriented(op_info, device, **kwargs):
-    zero_d = torch.randn((), device=device)
-    yield ErrorInput(SampleInput(zero_d, args=(zero_d, None)), error_type=RuntimeError,
-                     error_regex="at least 2 dimensions")
 
-def error_inputs_eig(op_info, device, **kwargs):
+def error_inputs_ormqr(op_info, device, **kwargs):
     zero_d = torch.randn((), device=device)
+    yield ErrorInput(SampleInput(zero_d, args=(zero_d, zero_d)), error_type=RuntimeError,
+                     error_regex="input must have at least 2 dimensions")
 
-    yield ErrorInput(SampleInput(zero_d, args=(False,)), error_type=RuntimeError,
-                     error_regex="input should be 2 dimensional")
-
-    yield ErrorInput(SampleInput(zero_d, args=(True,)), error_type=RuntimeError,
-                     error_regex="input should be 2 dimensional")
+    # https://github.com/pytorch/pytorch/issues/85218
+    tensor_0 = torch.full((5, 0,), 1, device=device)
+    tensor_1 = torch.full((5,), 1, device=device)
+    tensor_2 = torch.full((5, 5,), 1, device=device)
+    bool_3 = True
+    bool_4 = True
+    yield ErrorInput(SampleInput(tensor_0, args=(tensor_1, tensor_2, bool_3, bool_4)), error_type=RuntimeError,
+                     error_regex=r"tau.shape\[-1\] must be less than or equal to input.shape\[-1\]")
 
-def error_inputs_ormqr(op_info, device, **kwargs):
-    # this is only implemented on cpu
-    if (torch.device(device).type == 'cpu'):
-        zero_d = torch.randn((), device=device)
-        yield ErrorInput(SampleInput(zero_d, args=(zero_d, zero_d)), error_type=RuntimeError,
-                         error_regex="input must have at least 2 dimensions")
 
 def error_inputs_diag(op_info, device, **kwargs):
     zero_d = torch.randn((), device=device)
     yield ErrorInput(SampleInput(zero_d, args=(0,)), error_type=RuntimeError,
-                     error_regex="matrix or a vector expected")
+                     error_regex="1D or 2D")
     zero_d = torch.randn(1, 1, 1, device=device)
     yield ErrorInput(SampleInput(zero_d, args=(0,)), error_type=RuntimeError,
-                     error_regex="matrix or a vector expected")
+                     error_regex="1D or 2D")
 
 def error_inputs_embedding(op_info, device, **kwargs):
     indices = torch.rand(2, 2, device=device).long()
@@ -2536,34 +2272,64 @@ def error_inputs_t(op_info, device, **kwargs):
 
 def error_inputs_multinomial(op_info, device, **kwargs):
     x = torch.empty(1, 2, 3, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(2,)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(x, args=(2,)),
                      error_regex="prob_dist must be 1 or 2 dim")
 
     x = torch.empty(1, 2, dtype=torch.long, device=device)
-    yield ErrorInput(SampleInput(x, args=(2,)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(x, args=(2,)),
                      error_regex="multinomial only supports floating-point dtypes for input")
 
     x = torch.empty(1, 2, dtype=torch.double, device=device)
     y = torch.empty(1, 2, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(2,), kwargs=dict(out=y)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(x, args=(2,), kwargs=dict(out=y)),
                      error_regex="multinomial expects Long tensor out")
 
     x = torch.empty(2, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(0,)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(x, args=(0,)),
                      error_regex="cannot sample n_sample <= 0 samples")
 
     x = torch.empty(2, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(-1,)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(x, args=(-1,)),
                      error_regex="cannot sample n_sample <= 0 samples")
 
     x = torch.empty(2, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(3, False,)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(x, args=(3, False,)),
                      error_regex="cannot sample n_sample > prob_dist")
 
     x = torch.empty(16777217, dtype=torch.double, device=device)
-    yield ErrorInput(SampleInput(x, args=(3,)), error_type=RuntimeError,
+    yield ErrorInput(SampleInput(x, args=(3,)),
                      error_regex="number of categories cannot exceed")
 
+    inputs = ((1., -1., 1.), (1., inf, 1.), (1., -inf, 1.), (1., 1., nan))
+
+    err_msg1 = "probability tensor contains either `inf`, `nan` or element < 0"
+    err_msg2 = "invalid multinomial distribution"
+
+    rep_arg = (False, True) if torch.device(device).type == 'cpu' else (False,)
+
+    for rep in rep_arg:
+        kwargs = {'num_samples': 2, 'replacement': rep}
+
+        for shape in inputs:
+            # error case when input tensor contains `inf`, `nan` or negative element
+            yield ErrorInput(SampleInput(torch.tensor(shape), kwargs=kwargs),
+                             error_regex=err_msg1 if rep is False else err_msg2)
+
+        # error case for the invalid multinomial distribution (sum of probabilities <= 0), 1-D input
+        x = torch.zeros(3, device=device)
+        yield ErrorInput(SampleInput(x, kwargs=kwargs),
+                         error_regex=err_msg2)
+
+        # error case for the invalid multinomial distribution (sum of probabilities <= 0), 2-D input
+        x = torch.zeros(3, 3, device=device)
+        yield ErrorInput(SampleInput(x, kwargs=kwargs),
+                         error_regex=err_msg2)
+
+        # error case for the invalid multinomial distribution
+        x[1, :] = 1
+        yield ErrorInput(SampleInput(x, kwargs=kwargs),
+                         error_regex=err_msg2)
+
 def error_inputs_gradient(op_info, device, **kwargs):
     for dtype in [torch.long, torch.float32, torch.complex64]:
         t = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]], device=device, dtype=dtype)
@@ -2603,6 +2369,11 @@ def error_inputs_gradient(op_info, device, **kwargs):
                          error_type=RuntimeError,
                          error_regex='torch.gradient expected each dimension size to be at least')
 
+def error_inputs_rrelu(op_info, device, **kwargs):
+    input = make_tensor((S, S), device=device, dtype=torch.float32)
+    yield ErrorInput(SampleInput(input, kwargs={'lower': 0.3, 'upper': 0.1}),
+                     error_regex='Lower bound should be less than or equal to the upper bound')
+
 def error_inputs_masked_select(op_info, device, **kwargs):
     x = torch.rand((1,), device=device).expand((3,))
     y = torch.rand((6,), device=device)
@@ -2638,36 +2409,26 @@ def error_inputs_logcumsumexp(op_info, device, **kwargs):
                          error_regex='Dimension out of range')
 
 def sample_inputs_take_along_dim(op_info, device, dtype, requires_grad, **kwargs):
-    return (SampleInput(make_tensor((S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((S, S), 1, S, True, device=device), 0)),
-
-            # `indices` broadcast
-            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((1, S // 2), 0, S, True, device=device), 1)),
-
-            # `self` broadcast
-            SampleInput(make_tensor((1, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((S, S // 2), 0, S, True, device=device), 1)),
-
-            # without `dim` arg
-            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((S, S // 2), 0, S, True, device=device), )),
-            SampleInput(make_tensor((S, S), dtype=dtype, device=device,
-                                    low=None, high=None,
-                                    requires_grad=requires_grad),
-                        args=(gather_variable((S, S // 2), 0, S, True, device=device),)),
-            )
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=None, high=None)
+    yield SampleInput(
+        make_arg((S, S)), gather_variable((S, S), 1, S, True, device=device), 0)
+
+    # `indices` broadcast
+    yield SampleInput(
+        make_arg((S, S)), gather_variable((1, S // 2), 0, S, True, device=device), 1)
+
+    # `self` broadcast
+    yield SampleInput(
+        make_arg((1, S)), gather_variable((S, S // 2), 0, S, True, device=device), 1)
 
+    # without `dim` arg
+    yield SampleInput(
+        make_arg((S, S)), gather_variable((S, S // 2), 0, S, True, device=device))
+    yield SampleInput(
+        make_arg((S, S)), gather_variable((S, S // 2), 0, S, True, device=device))
 
-def error_inputs_aminmax_amax_amin(op_info, device, **kwargs):
+
+def error_inputs_aminmax_amax_amin(op_info, device, is_ref=False, **kwargs):
 
     # Error Inputs for zero-dim tensors, when 'dim' arg is not provided.
     shape = (S, 0, S)
@@ -2700,7 +2461,15 @@ def error_inputs_aminmax_amax_amin(op_info, device, **kwargs):
     min_values = torch.empty(L, dtype=torch.double, device=device)
     illegal_values = torch.empty(L, dtype=torch.int, device=device)
 
-    err_msg_amax_amin2 = "Expected the dtype for input and out to match"
+    # Unlike regular PyTorch, amax and amin refs don't require input and out
+    # dtypes to match exactly:
+    # https://github.com/pytorch/pytorch/pull/87765#pullrequestreview-1162023824
+    if is_ref:
+        err_msg_amax_amin2 = ("Attempting to cast from torch.float32 to out tensor with dtype "
+                              "torch.int32, but this can't be cast because it is not safe!")
+    else:
+        err_msg_amax_amin2 = ("Expected the dtype for input and out to match, but got Float "
+                              "for input's dtype and Int for out's dtype.")
     err_msg_aminmax2 = "Expected out tensor to have dtype float, but got double instead"
 
     if op_info.name in ['amax', 'amin', '_refs.amax', '_refs.amin']:
@@ -2727,13 +2496,10 @@ def sample_inputs_aminmax(op_info, device, dtype, requires_grad, **kwargs):
         ((), {'dim': 0, 'keepdim': True}),
     )
 
-    samples: List[SampleInput] = []
     for shape, kwargs in test_cases:
-        samples.append(SampleInput(
+        yield SampleInput(
             make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad),
-            kwargs=kwargs))
-
-    return samples
+            **kwargs)
 
 def sample_inputs_diff(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -2762,32 +2528,27 @@ def sample_inputs_diff(op_info, device, dtype, requires_grad, **kwargs):
             input_tensor = make_arg(size)
             prepend = make_arg(size_prepend) if size_prepend else None
             append = make_arg(size_append) if size_append else None
-            sample_inputs.append(SampleInput(input_tensor, args=(n, dim, prepend, append,)))
+            yield SampleInput(input_tensor, n, dim, prepend, append)
 
     # add some samples with n > dim_size
-    sample_inputs.append(SampleInput(make_arg((XS, XS, XS)), args=(S + 1, 1,)))
-    sample_inputs.append(SampleInput(make_arg((XS, XS, XS)), args=(S * 3 + 2, 2, make_arg((XS, XS, XS)), make_arg((XS, XS, XS)),)))
-
-    return sample_inputs
+    yield SampleInput(make_arg((XS, XS, XS)), S + 1, 1)
+    yield SampleInput(make_arg((XS, XS, XS)), S * 3 + 2, 2, make_arg((XS, XS, XS)), make_arg((XS, XS, XS)))
 
 def sample_inputs_histogram(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
     sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    sample_inputs = []
     for size, bin_ct, weighted, density in product(sizes, range(1, 5), [False, True], [False, True]):
         input_tensor = make_arg(size)
         weight_tensor = make_arg(size) if weighted else None
 
-        sample_inputs.append(SampleInput(input_tensor, args=(bin_ct,),
-                                         kwargs=dict(weight=weight_tensor, density=density)))
+        yield SampleInput(input_tensor, bin_ct,
+                          weight=weight_tensor, density=density)
 
         bins_tensor = make_arg((bin_ct + 1,))
-        sample_inputs.append(SampleInput(input_tensor, args=(bins_tensor,),
-                                         kwargs=dict(weight=weight_tensor, density=density)))
-
-    return sample_inputs
+        yield SampleInput(input_tensor, bins_tensor,
+                          weight=weight_tensor, density=density)
 
 def sample_inputs_histogramdd(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -2795,44 +2556,34 @@ def sample_inputs_histogramdd(op_info, device, dtype, requires_grad, **kwargs):
     sizes = ((S, S), (S, S, S), (S, 1, S), (S, 0, S))
     bin_ct_patterns = ((1, 1, 1, 1, 1), (2, 3, 2, 3, 2), (3, 2, 3, 2, 3))
 
-    sample_inputs = []
     for size, bin_ct_pattern, weighted, density in product(sizes, bin_ct_patterns, [False, True], [False, True]):
         input_tensor = make_arg(size)
         bin_ct = bin_ct_pattern[:size[-1]]
         weight_tensor = make_arg(size[:-1]) if weighted else None
 
-        sample_inputs.append(SampleInput(input_tensor, args=(bin_ct,),
-                                         kwargs=dict(weight=weight_tensor, density=density)))
+        yield SampleInput(input_tensor, bin_ct,
+                          weight=weight_tensor, density=density)
 
         bins_tensor = [make_arg(ct + 1) for ct in bin_ct]
-        sample_inputs.append(SampleInput(input_tensor, args=(bins_tensor,),
-                                         kwargs=dict(weight=weight_tensor, density=density)))
-
-    return sample_inputs
+        yield SampleInput(input_tensor, bins_tensor,
+                          weight=weight_tensor, density=density)
 
 def sample_inputs_histc(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
     sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    sample_inputs = []
     for size, min, max in product(sizes, [0, -10], [0, 10]):
         # construct sample input omitting bins arg
-        sample_inputs.append(SampleInput(make_arg(size),
-                                         kwargs=dict(min=min, max=max)))
+        yield SampleInput(make_arg(size), min=min, max=max)
 
         # construct sample inputs with a few different bins values
         for bins in [1, 3, 10]:
-            sample_inputs.append(SampleInput(make_arg(size),
-                                             kwargs=dict(bins=bins, min=min, max=max)))
-
-    return sample_inputs
+            yield SampleInput(make_arg(size), bins=bins, min=min, max=max)
 
 def sample_inputs_bincount(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    sample_inputs = []
-
     for size, weighted in product((S, M), [False, True]):
         input_tensor = torch.randint(0, size, (size,), dtype=dtype, device=device)
         weight_tensor = make_arg((size,)) if weighted else None
@@ -2840,32 +2591,30 @@ def sample_inputs_bincount(op_info, device, dtype, requires_grad, **kwargs):
         max_val = int(input_tensor.max().item())
 
         for minlength in [0, max_val // 2, max_val, 2 * max_val]:
-            sample_inputs.append(SampleInput(input_tensor,
-                                             kwargs=dict(weights=weight_tensor, minlength=minlength)))
-
-    return sample_inputs
+            yield SampleInput(
+                input_tensor, weights=weight_tensor, minlength=minlength)
 
-def sample_inputs_bucketize(op_info, device, dtype, requires_grad, **kwargs):
+def sample_inputs_bucketize(op_info, device, dtype, requires_grad, reference_inputs_mode=False, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
+    sizes = (((), S), ((S,), S), ((S, S), S), ((S, S, S), S), ((S, 1, S), S), ((S, 0, S), S))
 
-    sample_inputs = []
+    if reference_inputs_mode:
+        sizes += (((256,), 128), ((128,), 256), ((32, 32), 11), ((32, 4, 32), 33))
 
-    for size, out_int32, right in product(sizes, [False, True], [False, True]):
-        input_tensor = make_arg(size)
-        boundaries = make_arg((S,)).msort()
+    for (input_shape, nb), out_int32, right in product(sizes, [False, True], [False, True]):
+        input_tensor = make_arg(input_shape)
+        boundaries = make_arg(nb).msort()
 
-        sample_inputs.append(SampleInput(input_tensor, args=(boundaries, ),
-                                         kwargs=dict(out_int32=out_int32, right=right)))
+        yield SampleInput(input_tensor, boundaries,
+                          out_int32=out_int32, right=right)
 
-    return sample_inputs
+reference_inputs_bucketize = partial(sample_inputs_bucketize, reference_inputs_mode=True)
 
 def sample_inputs_searchsorted(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
     sizes = ((0,), (M,), (0, 0), (M, M), (0, 0, 0), (M, M, M))
-    inputs = []
     for size, noncontiguous, out_int32, right in product(sizes, [False, True], [False, True], [False, True]):
         unsorted_tensor = make_arg(size, noncontiguous=noncontiguous)
         input_tensor = make_arg(size, noncontiguous=noncontiguous)
@@ -2876,17 +2625,14 @@ def sample_inputs_searchsorted(op_info, device, dtype, requires_grad, **kwargs):
             boundary_tensor, sorter = torch.sort(unsorted_tensor)
         side = "right" if right else "left"
 
-        inputs.append(SampleInput(boundary_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, right=right)))
-        inputs.append(SampleInput(boundary_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, side=side)))
+        yield SampleInput(boundary_tensor, input_tensor, out_int32=out_int32, right=right)
+        yield SampleInput(boundary_tensor, input_tensor, out_int32=out_int32, side=side)
 
-        inputs.append(
-            SampleInput(unsorted_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, right=right, sorter=sorter)))
-        inputs.append(
-            SampleInput(unsorted_tensor, args=(input_tensor,), kwargs=dict(out_int32=out_int32, side=side, sorter=sorter)))
-    return inputs
+        yield SampleInput(unsorted_tensor, input_tensor, out_int32=out_int32, right=right, sorter=sorter)
+        yield SampleInput(unsorted_tensor, input_tensor, out_int32=out_int32, side=side, sorter=sorter)
 
 def sample_inputs_gradient(op_info, device, dtype, requires_grad, **kwargs):
-    sample_inputs = []
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=None, high=None)
     test_cases_float = (
         ((S,), None, None, 1),
         ((S,), 2., None, 1),
@@ -2896,15 +2642,15 @@ def sample_inputs_gradient(op_info, device, dtype, requires_grad, **kwargs):
         ((4, 4, 4), [2., 1.], (0, 1), 2),
     )
     for size, spacing, dim, edge_order in test_cases_float:
-        t = make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        sample_inputs.append(SampleInput(t, kwargs=dict(dim=dim, spacing=spacing, edge_order=edge_order)))
+        t = make_arg(size)
+        yield SampleInput(t, dim=dim, spacing=spacing, edge_order=edge_order)
 
     test_cases_tensor = (
         ((3, 3, 3), ((1.1, 2.0, 3.5), (4.0, 2, 6.0)), (0, -1), 1),
         ((3, 3, 3), ((1.0, 3.0, 2.0), (8.0, 6.0, 1.0)), (0, 1), 2),
     )
     for size, coordinates, dim, edge_order in test_cases_tensor:
-        t = make_tensor(size, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+        t = make_arg(size)
         coordinates_tensor_list = []
         for coords in coordinates:
             # `coords` will always contain floating point values and Python 3.10 does not support this
@@ -2912,9 +2658,7 @@ def sample_inputs_gradient(op_info, device, dtype, requires_grad, **kwargs):
             # TODO: this can be simplified after https://github.com/pytorch/pytorch/issues/69316 is fixed
             a = torch.tensor(coords, device=device)
             coordinates_tensor_list.append(a.to(dtype))
-        sample_inputs.append(SampleInput(t, kwargs=dict(dim=dim, spacing=coordinates_tensor_list, edge_order=edge_order)))
-
-    return tuple(sample_inputs)
+        yield SampleInput(t, dim=dim, spacing=coordinates_tensor_list, edge_order=edge_order)
 
 def sample_inputs_getitem(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -2944,22 +2688,18 @@ def sample_inputs_getitem(op_info, device, dtype, requires_grad, **kwargs):
 def sample_inputs_index_put(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    inputs = []
     for accumulate in [False, True]:
         # Test with indices arg
-        inputs.append(SampleInput(
+        yield SampleInput(
             make_arg((S, S,)),
-            args=((index_variable(2, S, device=device),), make_arg((2, S))),
-            kwargs=dict(accumulate=accumulate)))
+            (index_variable(2, S, device=device),),
+            make_arg((2, S)),
+            accumulate=accumulate)
 
         # Test with mask arg
         mask = torch.zeros(S, dtype=torch.bool) if accumulate else mask_not_all_zeros((S,))
-        inputs.append(SampleInput(
-            make_arg((S, S)),
-            args=((mask, ), make_arg((S,))),
-            kwargs=dict(accumulate=accumulate)))
-
-    return inputs
+        yield SampleInput(
+            make_arg((S, S)), (mask, ), make_arg((S,)), accumulate=accumulate)
 
 def sample_inputs_sort(op_info, device, dtype, requires_grad, **kwargs):
     def small_3d_unique():
@@ -2972,9 +2712,8 @@ def large_1d_unique():
         res = res.to(dtype).requires_grad_(requires_grad)
         return res
 
-    samples = []
     # Test case for large tensor.
-    samples.append(SampleInput(large_1d_unique()))
+    yield SampleInput(large_1d_unique())
 
     # Test cases for small 3d tensors.
     # Imitates legacy tests from test/test_torch.py
@@ -2982,47 +2721,38 @@ def large_1d_unique():
     flag = [True, False]
     for dim, descending, stable in product(dims, flag, flag):
         # default schema without stable sort
-        samples.append(SampleInput(small_3d_unique(),
-                                   args=(dim, descending)))
+        yield SampleInput(small_3d_unique(), dim, descending)
         # schema with stable sort, no CUDA support yet
         if torch.device(device).type == 'cpu':
-            samples.append(
-                SampleInput(small_3d_unique(),
-                            kwargs=dict(dim=dim, descending=descending, stable=stable))
-            )
+            yield SampleInput(
+                small_3d_unique(), dim=dim, descending=descending, stable=stable)
 
     # Test cases for scalar tensor
-    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad)))
-    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad),
-                               args=(0,)))
-    samples.append(SampleInput(torch.tensor(1, dtype=dtype, device=device, requires_grad=requires_grad),
-                               args=(0, True)))
+    tensor_opt = dict(dtype=dtype, device=device, requires_grad=requires_grad)
+    yield SampleInput(torch.tensor(1, **tensor_opt))
+    yield SampleInput(torch.tensor(1, **tensor_opt), 0)
+    yield SampleInput(torch.tensor(1, **tensor_opt), 0, True)
 
     # Test cases for stable sort
-    samples.append(SampleInput(small_3d_unique(),
-                   kwargs=dict(stable=True)))
-    samples.append(SampleInput(small_3d_unique(),
-                   kwargs=dict(dim=0, stable=True)))
-    samples.append(SampleInput(small_3d_unique(),
-                   kwargs=dict(dim=0, descending=True, stable=True)))
-    return samples
+    yield SampleInput(small_3d_unique(), stable=True)
+    yield SampleInput(small_3d_unique(), dim=0, stable=True)
+    yield SampleInput(small_3d_unique(), dim=0, descending=True, stable=True)
 
 def sample_inputs_threshold(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
     sizes = ((), (S,), (S, S), (S, S, S))
-    samples = []
     for x_size in sizes:
         # threshold and values args must be numbers
-        samples.append(SampleInput(make_arg(x_size), args=(make_arg(()).item(), make_arg(()).item())))
-    return samples
+        yield SampleInput(make_arg(x_size), make_arg(()).item(), make_arg(()).item())
 
 def sample_inputs_argsort(*args, **kwargs):
-    return [sample_input for sample_input in sample_inputs_sort(*args, **kwargs) if "stable" not in sample_input.kwargs]
+    return (sample_input for sample_input in sample_inputs_sort(*args, **kwargs)
+            if "stable" not in sample_input.kwargs)
 
 def sample_inputs_unique(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
     sizes = ((), (S,), (S, S), (S, S, S), (S, 1, S), (S, 0, S))
 
-    sample_inputs = []
     for shape, sorted, return_inverse, return_counts, dim in \
             product(sizes, [False, True], [False, True], [False, True], [None, -2, -1, 0, 1, 2]):
         # torch.unique cannot be called if the input tensor has a zero dimension which isn't the selected dim
@@ -3037,18 +2767,15 @@ def sample_inputs_unique(op_info, device, dtype, requires_grad, **kwargs):
 
         # construct a test case with only one distinct value
         input_t = torch.zeros(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
+        yield SampleInput(input_t, **kwargs)
 
         # construct a test case with mixed 0s and 1s
-        input_t = make_tensor(shape, dtype=torch.bool, device=device, requires_grad=False)\
+        input_t = make_arg(shape, dtype=torch.bool, requires_grad=False)\
             .to(dtype).requires_grad_(requires_grad)
-        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
+        yield SampleInput(input_t, **kwargs)
 
         # construct a test case with many different values
-        input_t = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        sample_inputs.append(SampleInput(input_t, kwargs=kwargs.copy()))
-
-    return sample_inputs
+        yield SampleInput(make_arg(shape), **kwargs)
 
 def sample_inputs_unique_consecutive(*args, **kwargs):
     for sample_input in sample_inputs_unique(*args, **kwargs):
@@ -3244,6 +2971,7 @@ def sample_inputs_max_pool(op_info, device, dtype, requires_grad, **kwargs):
         'nn.functional.max_pool1d': _TestParamsMaxPool1d,
         'nn.functional.max_pool2d': _TestParamsMaxPool2d,
         'nn.functional.max_pool3d': _TestParamsMaxPool3d,
+        'max_pool2d_with_indices_backward': _TestParamsMaxPool2d,
     }
 
     params_generator = params_generator_type_dict[op_info.name]()
@@ -3251,6 +2979,151 @@ def sample_inputs_max_pool(op_info, device, dtype, requires_grad, **kwargs):
         arg = make_arg(shape).to(memory_format=memory_format).requires_grad_(requires_grad)
         yield SampleInput(arg, kwargs=kwargs)
 
+def max_pool2d_backward(*args, kernel_size=(), stride=(), padding=(0,), dilation=(1,), ceil_mode=False, **kwargs):
+    out, indices = torch.nn.functional.max_pool2d_with_indices(
+        *args, kernel_size=kernel_size, stride=stride, padding=padding, dilation=dilation, ceil_mode=ceil_mode, return_indices=True)
+    grad_out = torch.ones_like(out)
+    if stride is None:
+        stride = kernel_size
+    out_b = torch.ops.aten.max_pool2d_with_indices_backward.default(
+        grad_out, *args, kernel_size, stride, padding, dilation, ceil_mode, indices)
+    return out_b
+
+def error_inputs_max_pool1d(op_info, device, **kwargs):
+    # Toggle requires_grad because `max_pool1d` has different path
+    # based on whether `requires_grad` is set or not.
+    for requires_grad in (True, False):
+        make_arg = partial(make_tensor, device=device, dtype=torch.float, requires_grad=requires_grad)
+        # error inputs when pad is negative
+        x = make_arg((0, 1, 49))
+        yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1, 'return_indices': True}),
+                         error_regex='pad must be non-negative')
+
+        # error inputs when pad > kernel_size / 2
+        yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4, 'return_indices': True}),
+                         error_regex='pad should be at most half of kernel size')
+
+        # error inputs for input tensor
+        error_msg = r'Expected 2D or 3D \(batch mode\) tensor with optional 0 dim batch size for input'
+        yield ErrorInput(SampleInput(make_arg((), requires_grad=requires_grad), kwargs={'kernel_size': 1}),
+                         error_regex=error_msg)
+
+        # error inputs for empty input
+        yield ErrorInput(SampleInput(torch.tensor([], device=device, requires_grad=requires_grad),
+                                     kwargs={'kernel_size': 1}),
+                         error_regex=error_msg)
+
+        # error: unbatched input with 0 sized non-batch dims.
+        yield ErrorInput(SampleInput(make_arg((0, 10), requires_grad=requires_grad),
+                                     kwargs={'kernel_size': 1}),
+                         error_regex=error_msg)
+
+        # error: batched input with 0 sized non-batch dims.
+        yield ErrorInput(SampleInput(make_arg((1, 10, 0), requires_grad=requires_grad),
+                                     kwargs={'kernel_size': 1}),
+                         error_regex=error_msg)
+
+        # error inputs for empty input with stride=0
+        # NOTE: CPU vs (CPU with requires_grad and CUDA) error messages are different.
+        error_msg = 'stride must be greater than zero, but got 0' if torch.device(
+            device).type == 'cpu' and not requires_grad else 'stride should not be zero'
+        yield ErrorInput(SampleInput(make_arg((3, 3, 3)), kwargs={'kernel_size': 1, 'stride': 0}),
+                         error_regex=error_msg)
+
+        # error inputs for empty input with dilation=0
+        # NOTE: CPU vs (CPU with requires_grad and CUDA) error messages are different.
+        error_msg = 'dilation must be greater than zero, but got 0' if torch.device(
+            device).type == 'cpu' and not requires_grad else 'dilation should be greater than zero, but got dilation'
+        yield ErrorInput(SampleInput(make_arg((3, 3, 3)),
+                                     kwargs={'kernel_size': 1, 'stride': 1, 'padding': 0, 'dilation': 0}),
+                         error_regex=error_msg)
+
+        # error inputs for invalid output size
+        # NOTE: CPU vs (CPU with requires_grad and CUDA) error messages are different.
+        error_msg = 'Invalid computed output size: -2' if torch.device(device).type == 'cpu' and not requires_grad \
+            else \
+            r'Given input size: \(2x1x2\). Calculated output size: \(2x1x-2\). Output size is too small'
+        yield ErrorInput(SampleInput(make_arg((2, 2, 2)),
+                                     kwargs={'kernel_size': 5, 'stride': 1, 'padding': 0, 'dilation': 1}),
+                         error_regex=error_msg)
+
+        # error inputs when kernel_size=0
+        # NOTE: CPU vs (CPU with requires_grad and CUDA) error messages are different.
+        error_msg = 'kernel_size must be greater than zero' if torch.device(
+            device).type == 'cpu' and not requires_grad else r'stride should not be zero'
+        yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 0}),
+                         error_regex=error_msg)
+
+        # error inputs for strides > 0
+        # NOTE: CPU vs (CPU with requires_grad and CUDA) error messages are different.
+        error_msg = 'stride must be greater than zero' if torch.device(
+            device).type == 'cpu' and not requires_grad else r'stride should not be zero'
+        yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 0}),
+                         error_regex=error_msg)
+
+
+def error_inputs_max_pool2d(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float, requires_grad=False)
+    # error inputs when pad is negative
+    x = make_arg((0, 1, 49))
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+    # 2-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2), 'stride': 50, 'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2 (kernel_size : int)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error inputs when pad > kernel_size / 2 (kernel_size : tuple)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2), 'stride': 50, 'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error: unbatched input with 0 sized non-batch dims.
+    err_msg = r'Expected 3D or 4D \(batch mode\) tensor with optional 0 dim batch size for input'
+    yield ErrorInput(SampleInput(make_arg((1, 0, 10)),
+                                 kwargs={'kernel_size': 1}),
+                     error_regex=err_msg)
+
+    # error: batched input with 0 sized non-batch dims.
+    yield ErrorInput(SampleInput(make_arg((2, 1, 10, 0)),
+                                 kwargs={'kernel_size': 1}),
+                     error_regex=err_msg)
+
+
+def error_inputs_max_pool3d(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float, requires_grad=False)
+    # error inputs when pad is negative
+    x = make_arg((0, 1, 49, 50))
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+    # 3-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2, 2), 'stride': 50,
+                                            'padding': -1, 'return_indices': True}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2 (kernel_size: int)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error inputs when pad > kernel_size / 2 (kernel_size: tuple)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2, 2), 'stride': 50,
+                                            'padding': 4, 'return_indices': True}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error: unbatched input with 0 sized non-batch dims.
+    err_msg = r'Expected input\'s non-batch dimensions to have positive length'
+    yield ErrorInput(SampleInput(make_arg((0, 1, 2, 10)),
+                                 kwargs={'kernel_size': 1}),
+                     error_regex=err_msg)
+
+    # error: batched inputs with 0 sized non-batch dims.
+    yield ErrorInput(SampleInput(make_arg((2, 1, 0, 1, 2)),
+                                 kwargs={'kernel_size': 1}),
+                     error_regex=err_msg)
+
+
 def sample_inputs_normalize(self, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, low=-1, high=1, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -3267,6 +3140,78 @@ def sample_inputs_normalize(self, device, dtype, requires_grad, **kwargs):
     for input_shape, kwargs in cases:
         yield SampleInput(make_arg(input_shape), kwargs=kwargs)
 
+
+def complex_conv(fn, input_size, weight, grad_output, stride, padding, dilation, groups):
+    # conv(W, x, b) = conv(Wr, xr, br) - conv(Wi, xi, 0) + i(conv(Wi, xr, bi) + conv(Wr, xi, 0))
+    # a = conv(Wr, xr, br),
+    # b = conv(Wi, xi, 0),
+    # c = conv(Wr + Wi, xr + xi, br + bi)
+    # conv(W, x, b) = a - b + i(c - a - b)
+
+    grad_output_ = torch.view_as_real(grad_output)
+    grad_output_r = grad_output_[..., 0]
+    grad_output_i = grad_output_[..., 1]
+
+    weight_ = torch.view_as_real(weight)
+    weight_r = weight_[..., 0]
+    weight_i = weight_[..., 1]
+
+    a = fn(input_size, weight_r, grad_output_r, stride, padding, dilation, groups)
+    b = fn(input_size, weight_i, grad_output_i, stride, padding, dilation, groups)
+    c = fn(input_size, weight_r + weight_i, grad_output_r + grad_output_i, stride, padding, dilation, groups)
+
+    return (a - b) + 1j * (c - a - b)
+
+
+def conv_transpose_ref(input, weight, bias, stride=1, padding=0,
+                       output_padding=0, dilation=1, groups=1,
+                       fn=None):
+    # Derivative of `conv` is `conv_transpose`.
+    # To verify the correctness of `conv_transpose`,
+    # we rely `torch.nn.grad` implementation (which is tested in test_nn.py)
+    # for floating dtypes.
+
+    assert fn is not None
+
+    grad_fn_map = {torch.nn.functional.conv_transpose1d: torch.nn.grad.conv1d_input,
+                   torch.nn.functional.conv_transpose2d: torch.nn.grad.conv2d_input,
+                   torch.nn.functional.conv_transpose3d: torch.nn.grad.conv3d_input}
+    batched_dim_map = {torch.nn.functional.conv_transpose1d: 3,
+                       torch.nn.functional.conv_transpose2d: 4,
+                       torch.nn.functional.conv_transpose3d: 5}
+
+    # Input for `ref` is ndarray.
+    input, weight = torch.from_numpy(input), torch.from_numpy(weight)
+
+    is_batched = len(input.shape) == batched_dim_map[fn]
+    if not is_batched:
+        input = input.unsqueeze(0)
+
+    if bias is not None:
+        bias = torch.from_numpy(bias)
+        unsqueeze_dims = input.ndim - 2
+        for _ in range(unsqueeze_dims):
+            bias = bias.unsqueeze(1)
+
+    grad_output = input
+    # Get the input shape for grad_fn.
+    conv_transpose_output = fn(grad_output.to('meta'), weight.to('meta'), None,
+                               stride=stride, padding=padding, output_padding=output_padding,
+                               groups=groups, dilation=dilation)
+    input_size = conv_transpose_output.shape
+
+    grad_fn = grad_fn_map[fn]
+    if weight.dtype.is_complex:
+        out = complex_conv(grad_fn, input_size, weight, grad_output, stride, padding, dilation, groups)
+    else:  # Floating
+        out = grad_fn(input_size, weight, grad_output, stride, padding, dilation, groups)
+
+    if bias is not None:
+        out = out + bias
+
+    return out.squeeze(0) if not is_batched else out
+
+
 def sample_inputs_conv_transpose1d(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -3312,8 +3257,8 @@ def sample_inputs_conv_transpose2d(op_info, device, dtype, requires_grad, **kwar
          {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1, 'dilation': (2, 3)}),
         ((1, 1, 4, 3), (1, 2, 3, 4), None,
          {'stride': 2, 'padding': 1, 'output_padding': 1, 'groups': 1}),
-        ((1, 4, 5, 5), (4, 8, 3, 3), None,
-         {})
+        ((2, 4, 4, 4), (4, 1, 3, 3), None, {'groups': 4}),
+        ((1, 2, 5, 5), (2, 4, 3, 3), None, {})
     )
 
     for input_shape, weight, bias, kwargs in cases:
@@ -3390,6 +3335,36 @@ def sample_inputs_conv1d(op_info, device, dtype, requires_grad, **kwargs):
         ), kwargs=kwargs)
 
 
+def error_inputs_conv1d(opinfo, device, **kwargs):
+    input = torch.randn(size=(33, 16, 30), device=device, dtype=torch.float64)
+    weight = torch.randn(size=(20, 16, 5), device=device, dtype=torch.float64)
+    groups = 0
+    yield ErrorInput(
+        SampleInput(input, kwargs={"weight": weight, "groups": groups}),
+        error_regex="non-positive groups is not supported"
+    )
+
+
+def error_inputs_conv2d(opinfo, device, **kwargs):
+    weight = torch.randint(high=10, size=(3, 2, 3, 3), device=device)
+    input = torch.randint(high=10, size=(2, 4, 4), device=device)
+    bias = torch.rand((3,), dtype=torch.float32, device=device)
+    yield ErrorInput(SampleInput(input, args=(weight, bias)), error_regex="should be the same")
+
+    weight = torch.rand(size=(3, 2, 3, 3), device=device, dtype=torch.float64)
+    input = torch.rand(size=(2, 4, 4), device=device, dtype=torch.float64)
+    bias = torch.rand((3,), dtype=torch.complex128, device=device)
+    yield ErrorInput(SampleInput(input, args=(weight, bias)), error_regex="should be the same")
+
+    input = torch.randn(size=(1, 4, 5, 5), device=device, dtype=torch.float64)
+    weight = torch.randn(size=(8, 4, 3, 3), device=device, dtype=torch.float64)
+    groups = 0
+    yield ErrorInput(
+        SampleInput(input, kwargs={"weight": weight, "groups": groups}),
+        error_regex="non-positive groups is not supported"
+    )
+
+
 def sample_inputs_conv2d(op_info, device, dtype, requires_grad, jit_fail_sample=False, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -3438,27 +3413,72 @@ def sample_inputs_conv2d(op_info, device, dtype, requires_grad, jit_fail_sample=
 def sample_inputs_group_norm(opinfo, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    # Ordered as input shape, num groups, and eps
+    # Ordered as input shape, num groups, and kwargs for eps
     cases: Tuple[Tuple[int], int, float] = (  # type: ignore[assignment]
-        ((1, 6, 3), 2, 0.5),
-        ((2, 6, 3), 2, -0.5),
-        ((1, 2), 1, None),
-        ((0, 2), 1, None),
+        ((1, 6, 3), 2, {'eps' : 0.5}),
+        ((2, 6, 3), 2, {'eps' : -0.5}),
+        ((1, 3), 1, {'eps' : 1e-5}),
+        ((0, 2), 1, {'eps' : 1e-5}),
+        ((S, S, S), 1, {'eps' : 0.5}),
     )
 
-    for input_shape, num_groups, eps in cases:
+    # num_channels is inferred to be input.shape[1] dimension
+    for input_shape, num_groups, kwargs in cases:
         # Shape of weight and bias should be the same as num_channels
-        weight = make_arg(input_shape[1])
-        bias = make_arg(input_shape[1])
-        kwargs = {'weight': weight, 'bias': bias} if eps is None else {'weight': weight, 'bias': bias, 'eps': eps}
-        yield SampleInput(
-            make_arg(input_shape),
-            args=(num_groups,),
-            kwargs=kwargs
-        )
+        channels = input_shape[1] if len(input_shape) > 1 else 0
+        weight_tensor = make_arg(channels)
+        bias_tensor = make_arg(channels)
+
+        # Checking for permutations of weights and biases as `None`
+        weights = [weight_tensor, None]
+        biases = [bias_tensor, None]
+        for weight, bias in itertools.product(weights, biases):
+            kwargs = {
+                'weight': weight,
+                'bias': bias,
+                **kwargs
+            }
+            yield SampleInput(make_arg(input_shape), num_groups, **kwargs)
+
     # Without any optional args
     yield SampleInput(make_arg((1, 2)), args=(1,))
 
+def reference_inputs_group_norm(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_group_norm(
+        op_info, device, dtype, requires_grad, **kwargs)
+
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    # Ordered as input shape, num groups, and kwargs for eps
+    cases: Tuple[Tuple[int], int, float] = (  # type: ignore[assignment]
+        ((20, 6, 10, 10), 3, {'eps' : 1e-5}),
+        # equivalent with InstanceNorm
+        # GroupNorm(C, num_groups=C) == InstanceNorm(num_features=C)
+        ((20, 6, 10, 10), 6, {'eps' : 1e-5}),
+        # equivalent with LayerNorm
+        # GroupNorm(C, num_groups=1, affine=False) == LayerNorm(normalized_shape=[C, H, W], elementwise_affine=False)
+        ((20, 6, 10, 10), 1, {'eps' : 1e-5}),
+    )
+
+    # num_channels is inferred to be input.shape[1] dimension
+    for input_shape, num_groups, kwargs in cases:
+        # Shape of weight and bias should be the same as num_channels
+        channels = input_shape[1] if len(input_shape) > 1 else 0
+        input_tensor = make_arg(input_shape)
+        weight_tensor = make_arg(channels)
+        bias_tensor = make_arg(channels)
+
+        # Checking for permutations of weights and biases as `None`
+        weights = [weight_tensor, None]
+        biases = [bias_tensor, None]
+        for weight, bias in itertools.product(weights, biases):
+            kwargs = {
+                'weight': weight,
+                'bias': bias,
+                **kwargs
+            }
+            yield SampleInput(input_tensor, num_groups, **kwargs)
+
 
 def sample_inputs_instance_norm(opinfo, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -3585,6 +3605,18 @@ def sample_inputs_native_layer_norm(opinfo, device, dtype, requires_grad, **kwar
             args=(normalized_shape, None, None, eps),
         )
 
+def error_inputs_group_norm(opinfo, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32, requires_grad=False)
+
+    # check that input has minimum number of dimensions
+    err_msg1 = "Expected at least 2 dimensions for input tensor but received"
+    s1 = SampleInput(make_arg((1)), args=(1,))
+    yield ErrorInput(s1, error_regex=err_msg1)
+
+    # check that the channels dimension is compatible with number of groups
+    err_msg2 = "Expected number of channels in input to be divisible by num_groups, but got input of shape"
+    s2 = SampleInput(make_arg((2, 7, 4)), args=(2,))
+    yield ErrorInput(s2, error_regex=err_msg2)
 
 def error_inputs_native_layer_norm(opinfo, device, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=torch.float32, requires_grad=False)
@@ -3642,9 +3674,9 @@ def sample_inputs_local_response_norm(opinfo, device, dtype, requires_grad, **kw
 def sample_inputs_hardswish(self, device, dtype, requires_grad, **kwargs):
     N = 5
     # make sure we are testing -3 -> 3 range. default is -10 -> 10 so maybe unnecessary ?
-    tensors = [SampleInput(make_tensor((N * 2, N * 2), device=device, dtype=dtype,
-               requires_grad=requires_grad, low=-5, high=5)) for _ in range(1, N)]
-    return tensors
+    make_arg = partial(make_tensor, device=device, dtype=dtype,
+                       requires_grad=requires_grad, low=-5, high=5)
+    return (SampleInput(make_arg((N * 2, N * 2))) for _ in range(1, N))
 
 def sample_inputs_linear(self, device, dtype, requires_grad, **kwargs):
     features_options = [[3, 4], [8, 8]]
@@ -3657,18 +3689,16 @@ def sample_inputs_linear(self, device, dtype, requires_grad, **kwargs):
     create_tensor = partial(make_tensor, device=device, dtype=dtype,
                             requires_grad=requires_grad, low=-2, high=2)
 
-    sample_inputs = []
     for has_bias, (in_feat, out_feat), batch_shape in \
             itertools.product([True, False], features_options, batch_options):
         input_tensor = create_tensor(batch_shape + [in_feat])
         weight = create_tensor([out_feat, in_feat])
         if not has_bias:
-            sample_inputs.append(SampleInput(input_tensor, args=(weight,)))
+            yield SampleInput(input_tensor, weight)
             continue
 
         bias = create_tensor([out_feat])
-        sample_inputs.append(SampleInput(input_tensor, args=(weight, bias)))
-    return sample_inputs
+        yield SampleInput(input_tensor, weight, bias)
 
 def sample_inputs_bilinear(self, device, dtype, requires_grad, **kwargs):
     features_options = [[3, 4, 5], [8, 8, 8]]
@@ -3681,19 +3711,16 @@ def sample_inputs_bilinear(self, device, dtype, requires_grad, **kwargs):
     create_tensor = partial(make_tensor, device=device, dtype=dtype,
                             requires_grad=requires_grad, low=-2, high=2)
 
-    sample_inputs = []
     for has_bias, (in_feat1, in_feat2, out_feat), batch_shape in \
             itertools.product([True, False], features_options, batch_options):
         input_tensor1 = create_tensor(batch_shape + [in_feat1])
         input_tensor2 = create_tensor(batch_shape + [in_feat2])
         weight = create_tensor([out_feat, in_feat1, in_feat2])
         if not has_bias:
-            sample_inputs.append(SampleInput(input_tensor1, args=(input_tensor2, weight,)))
+            yield SampleInput(input_tensor1, input_tensor2, weight)
             continue
         bias = create_tensor([out_feat])
-        sample_inputs.append(SampleInput(input_tensor1, args=(input_tensor2, weight, bias)))
-
-    return sample_inputs
+        yield SampleInput(input_tensor1, input_tensor2, weight, bias)
 
 def sample_inputs_glu(self, device, dtype, requires_grad, **kwargs):
     features_options = [[2], [2, 4], [8, 8], [3, 6, 8], [1, 4, 6, 7]]
@@ -3706,16 +3733,13 @@ def sample_inputs_glu(self, device, dtype, requires_grad, **kwargs):
     create_tensor = partial(make_tensor, device=device, dtype=dtype,
                             requires_grad=requires_grad, low=-2, high=2)
 
-    sample_inputs = []
     for features, batch_shape in itertools.product(features_options, batch_options):
         ndim = len(features) + len(batch_shape)
         for dim in range(ndim):
             input_tensor = create_tensor(batch_shape + features)
             dim_size = input_tensor.size(dim)
             if dim_size > 0 and dim_size % 2 == 0:
-                sample_inputs.append(SampleInput(input_tensor, args=(dim,)))
-
-    return sample_inputs
+                yield SampleInput(input_tensor, dim)
 
 def sample_inputs_interpolate(mode, self, device, dtype, requires_grad, **kwargs):
     N, C = 2, 3
@@ -3743,21 +3767,16 @@ def shape(size, rank, with_batch_channel=True):
     make_arg = partial(make_tensor, device=device, dtype=dtype,
                        requires_grad=requires_grad, low=-1, high=1)
 
-    sample_inputs = []
     for align_corners in align_corners_options:
         for rank in ranks_for_mode[mode]:
-            sample_inputs.extend([
-                SampleInput(make_arg(shape(D, rank)),
-                            args=(shape(S, rank, False), None, mode, align_corners)),
-                SampleInput(make_arg(shape(D, rank)),
-                            args=(shape(L, rank, False), None, mode, align_corners)),
-                SampleInput(make_arg(shape(D, rank)),
-                            args=(None, 1.7, mode, align_corners)),
-                SampleInput(make_arg(shape(D, rank)),
-                            args=(None, 0.6, mode, align_corners)),
-            ])
-
-    return sample_inputs
+            yield SampleInput(make_arg(shape(D, rank)),
+                              shape(S, rank, False), None, mode, align_corners)
+            yield SampleInput(make_arg(shape(D, rank)),
+                              shape(L, rank, False), None, mode, align_corners)
+            yield SampleInput(make_arg(shape(D, rank)),
+                              None, 1.7, mode, align_corners)
+            yield SampleInput(make_arg(shape(D, rank)),
+                              None, 0.6, mode, align_corners)
 
 def sample_inputs_upsample(mode, self, device, dtype, requires_grad, **kwargs):
     N, C = 2, 3
@@ -3772,38 +3791,27 @@ def sample_inputs_upsample(mode, self, device, dtype, requires_grad, **kwargs):
 
     def shape(size, rank, with_batch_channel=True):
         if with_batch_channel:
-            return tuple([N, C] + ([size] * rank))
-        return tuple([size] * rank)
+            return torch.Size([N, C] + ([size] * rank))
+        return torch.Size([size] * rank)
 
     make_arg = partial(make_tensor, device=device, dtype=dtype,
                        requires_grad=requires_grad, low=-1, high=1)
 
-    sample_inputs = []
     for rank in ranks_for_mode[mode]:
-        sample_inputs.extend([
-            SampleInput(make_arg(shape(D, rank)),
-                        kwargs=dict(size=shape(S, rank, False))),
-            SampleInput(make_arg(shape(D, rank)),
-                        kwargs=dict(size=shape(L, rank, False))),
-            SampleInput(make_arg(shape(D, rank)),
-                        kwargs=dict(scale_factor=1.7)),
-            SampleInput(make_arg(shape(D, rank)),
-                        kwargs=dict(scale_factor=0.6)),
-        ])
-
-    return sample_inputs
+        yield SampleInput(make_arg(shape(D, rank)), size=shape(S, rank, False))
+        yield SampleInput(make_arg(shape(D, rank)), size=shape(L, rank, False))
+        yield SampleInput(make_arg(shape(D, rank)), scale_factor=1.7)
+        yield SampleInput(make_arg(shape(D, rank)), scale_factor=0.6)
 
 
 def sample_inputs_gelu(self, device, dtype, requires_grad, **kwargs):
     N = 5
-    tensors = []
     for _ in range(1, N):
         for approximate in ['none', 'tanh']:
-            tensors.append(SampleInput(
+            yield SampleInput(
                 make_tensor((N * 2, N * 2), device=device, dtype=dtype,
                             requires_grad=requires_grad, low=-3, high=3),
-                kwargs=dict(approximate=approximate)))
-    return tensors
+                approximate=approximate)
 
 
 def error_inputs_gelu(op, device, **kwargs):
@@ -3820,22 +3828,16 @@ def sample_inputs_max_min_reduction_with_dim(op_info, device, dtype, requires_gr
         ((), (0,),),
         ((), (0, True,),),
     )
-    inputs = list((SampleInput(make_tensor(input_tensor, dtype=dtype, device=device,
-                                           low=None, high=None,
-                                           requires_grad=requires_grad),
-                               args=args,))
-                  for input_tensor, args in args_for_reduction_with_dim)
-    return inputs
+    return ((SampleInput(make_tensor(input_tensor, dtype=dtype, device=device,
+                                     low=None, high=None,
+                                     requires_grad=requires_grad),
+                         *args))
+            for input_tensor, args in args_for_reduction_with_dim)
 
 def sample_inputs_max_min_reduction_no_dim(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
-    inputs.append(SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                          low=None, high=None,
-                                          requires_grad=requires_grad),))
-    inputs.append(SampleInput(make_tensor((), dtype=dtype, device=device,
-                                          low=None, high=None,
-                                          requires_grad=requires_grad),))
-    return inputs
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=None, high=None)
+    yield SampleInput(make_arg((S, S, S)))
+    yield SampleInput(make_arg(()))
 
 def _generate_nan_reduction_inputs(device, dtype, requires_grad, **kwargs):
     yield from _generate_reduction_inputs(device, dtype, requires_grad)
@@ -3849,16 +3851,11 @@ def sample_inputs_nan_reduction(supports_multiple_dims):
     # and dim and keepdim kwargs. If a reduction op needs to test additional
     # args/kwargs then create a separate sample_inputs function
     def fn(op_info, device, dtype, requires_grad, **kwargs):
-        inputs = []
-
         for t in _generate_nan_reduction_inputs(device, dtype, requires_grad):
             # Add case without dim and keepdim kwargs
-            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad)))
+            yield SampleInput(t.clone().requires_grad_(requires_grad))
             for kwargs in _generate_reduction_kwargs(t.ndim, supports_multiple_dims):
-                inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
-                                          kwargs=kwargs))
-
-        return inputs
+                yield SampleInput(t.clone().requires_grad_(requires_grad), **kwargs)
 
     return fn
 
@@ -3866,22 +3863,19 @@ def sample_inputs_reduction_quantile(op_info, device, dtype, requires_grad, **kw
     test_quantiles = (0.5, make_tensor((2,), dtype=dtype, device=device, low=0, high=1, requires_grad=requires_grad))
     test_interpolations = ['linear', 'midpoint']
 
-    inputs = []
     for quantiles in test_quantiles:
         for t in _generate_reduction_inputs(device, dtype, requires_grad):
             # Add case without dim and keepdim kwargs
-            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
-                                      args=(quantiles,)))
+            input = t.clone().requires_grad_(requires_grad)
+            yield SampleInput(input, quantiles)
             for kwargs in _generate_reduction_kwargs(t.ndim, supports_multiple_dims=False):
                 # Interpolation kwarg for now is only supported when providing both dim and keepdim
                 kwargs.setdefault('dim', 0)
                 kwargs.setdefault('keepdim', False)
                 for interpolation in test_interpolations:
                     kwargs['interpolation'] = interpolation
-                    inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
-                                              args=(quantiles,), kwargs=kwargs))
-
-    return inputs
+                    input = t.clone().requires_grad_(requires_grad)
+                    yield SampleInput(input, quantiles, **kwargs)
 
 def sample_inputs_reduction_count_nonzero(*args, **kwargs):
     """Sample inputs for count_nonzero"""
@@ -3892,9 +3886,8 @@ def sample_inputs_reduction_count_nonzero(*args, **kwargs):
 
 def sample_inputs_leaky_relu(op_info, device, dtype, requires_grad, **kwargs):
     N = 10
-    tensors = [SampleInput(make_tensor((N, N), device=device, dtype=dtype,
-               requires_grad=requires_grad)) for _ in range(1, N)]
-    return tensors
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    return (SampleInput(make_arg((N, N))) for _ in range(1, N))
 
 def sample_inputs_fractional_max_pool2d(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -3907,32 +3900,31 @@ def sample_inputs_fractional_max_pool2d(op_info, device, dtype, requires_grad, *
              ((1, 1, 4, 4), (2, 2)),
              ((1, 2, 6, 6), (4, 4)))
 
-    samples = []
-
     for input_shape, kernel_size in cases:
         for return_indices in [False, True]:
             # test case passing a single output size
-            samples.append(SampleInput(
+            yield SampleInput(
                 make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_size=(2), return_indices=return_indices)
-            ))
+                kernel_size,
+                output_size=2,
+                return_indices=return_indices,
+            )
 
             # test case passing a tuple output size
-            samples.append(SampleInput(
+            yield SampleInput(
                 make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_size=(2, 3), return_indices=return_indices)
-            ))
+                kernel_size,
+                output_size=(2, 3),
+                return_indices=return_indices,
+            )
 
             # test case passing an output ratio
-            samples.append(SampleInput(
+            yield SampleInput(
                 make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_ratio=(0.5, 0.5), return_indices=return_indices)
-            ))
-
-    return samples
+                kernel_size,
+                output_ratio=(0.5, 0.5),
+                return_indices=return_indices,
+            )
 
 def sample_inputs_fractional_max_pool3d(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -3947,32 +3939,31 @@ def sample_inputs_fractional_max_pool3d(op_info, device, dtype, requires_grad, *
              ((1, 1, 8, 7, 6), (4, 3, 2)),
              ((0, 1, 4, 5, 4), (2, 2, 1)))
 
-    samples = []
-
     for input_shape, kernel_size in cases:
         for return_indices in [False, True]:
             # test case passing a single output size
-            samples.append(SampleInput(
+            yield SampleInput(
                 make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_size=(2), return_indices=return_indices)
-            ))
+                kernel_size,
+                output_size=2,
+                return_indices=return_indices,
+            )
 
             # test case passing a tuple output size
-            samples.append(SampleInput(
+            yield SampleInput(
                 make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_size=(2, 3, 2), return_indices=return_indices)
-            ))
+                kernel_size,
+                output_size=(2, 3, 2),
+                return_indices=return_indices,
+            )
 
             # test case passing an output ratio
-            samples.append(SampleInput(
+            yield SampleInput(
                 make_arg(input_shape),
-                args=(kernel_size,),
-                kwargs=dict(output_ratio=(0.5, 0.5, 0.5), return_indices=return_indices)
-            ))
-
-    return samples
+                kernel_size,
+                output_ratio=(0.5, 0.5, 0.5),
+                return_indices=return_indices,
+            )
 
 def sample_inputs_avgpool2d(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -3996,7 +3987,7 @@ def sample_inputs_avgpool1d(op_info, device, dtype, requires_grad, **kwargs):
 
     # Order: input_shape, kernel_size, kwargs
     cases: List[Tuple[Tuple[int, ...], Union[int, Tuple[int, ...]], Dict]] = [
-        ((2, 3, 9), (3,), dict()),
+        ((2, 3, 9), (3,), {}),
         ((1, 3, 9), 3, dict(stride=1, padding=1, ceil_mode=True, count_include_pad=False)),
         ((1, 3, 9), (6,), dict(stride=(3,), padding=(2,), ceil_mode=True, count_include_pad=True)),
         ((2, 3, 9), (3,), dict(stride=(1,), padding=(1,), ceil_mode=False, count_include_pad=True)),
@@ -4015,7 +4006,7 @@ def sample_inputs_avgpool3d(op_info, device, dtype, requires_grad, **kwargs):
 
     # Order: input_shape, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override
     cases: List[Tuple[Tuple[int, ...], Union[int, Tuple[int, ...]], Dict]] = [
-        ((2, 3, 3, 4, 4), (2, 2, 2), dict()),
+        ((2, 3, 3, 4, 4), (2, 2, 2), {}),
         ((1, 2, 4, 4, 4), 2, dict(stride=1, padding=1, ceil_mode=True,
                                   count_include_pad=False, divisor_override=2)),
         ((1, 2, 5, 5, 5), (2, 3, 4), dict(stride=(1, 2, 2), padding=(0, 1, 2), ceil_mode=True,
@@ -4034,35 +4025,120 @@ def sample_inputs_avgpool3d(op_info, device, dtype, requires_grad, **kwargs):
     for input_shape, kernel_size, kwargs in cases:
         yield SampleInput(make_arg(input_shape), args=(kernel_size,), kwargs=kwargs)
 
+def error_inputs_avg_pool1d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+
+def error_inputs_avg_pool2d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+    # 2-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2), 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+    # 2-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2), 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error inputs for zero divisor
+    x = torch.zeros(3, 3, 3)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (2, 2), 'divisor_override': 0}),
+                     error_regex='divisor must be not zero')
+
+def error_inputs_avg_pool3d(op_info, device, **kwargs):
+    # error inputs when pad is negative
+    x = torch.rand([0, 1, 49, 50], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+    # 3-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2, 2), 'stride': 50, 'padding': -1}),
+                     error_regex='pad must be non-negative')
+
+    # error inputs when pad > kernel_size / 2
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+    # 3-dimensional kernel
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (3, 2, 2), 'stride': 50, 'padding': 4}),
+                     error_regex='pad should be at most half of kernel size')
+
+    # error inputs for zero divisor
+    x = torch.zeros(3, 3, 3, 3)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': (2, 2, 2), 'divisor_override': 0}),
+                     error_regex='divisor must be not zero')
+
+    # error inputs for invalid input dimension
+    x = torch.rand([0, 1, 49], dtype=torch.float32)
+    yield ErrorInput(SampleInput(x, kwargs={'kernel_size': 2, 'stride': 50, 'padding': 0}),
+                     error_regex='non-empty 4D or 5D')
+
+
+def sample_inputs_to(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    # test_multiple_devices_to_cuda would fail if we use a different device than given
+    devices = [device]
+    if torch.device(device).type == 'cpu':
+        devices = [torch.device('cpu'), torch.device('cuda:0')] if torch.cuda.is_available() else devices
+    memory_formats = [torch.preserve_format, torch.channels_last]
+
+    # TODO: can't switch `to.device` overload to use positional arguments
+    # https://github.com/pytorch/pytorch/issues/84265
+    # to.device overload
+    for device, nb, cp, mem_f in product(devices, [True, False], [True, False], memory_formats):
+        kwargs = {
+            "memory_format": mem_f,
+        }
+        yield SampleInput(make_arg((S, S, S, S)), args=(device, torch.float64, nb, cp), kwargs=kwargs)
+
+    # to.dtype overload
+    for nb, cp, mem_f in product([True, False], [True, False], memory_formats):
+        kwargs = {
+            "memory_format": mem_f,
+        }
+        yield SampleInput(make_arg((S, S, S, S)), args=(torch.float64, nb, cp), kwargs=kwargs)
+
+    # to.other overload
+    for device, nb, cp, mem_f in product(devices, [True, False], [True, False], memory_formats):
+        kwargs = {
+            "memory_format": mem_f,
+        }
+        other = make_arg((S, S, S, S), dtype=torch.float64, device=device)
+        yield SampleInput(make_arg((S, S, S, S)), args=(other, nb, cp), kwargs=kwargs)
+
+
 def sample_inputs_topk(op_info, device, dtype, requires_grad, **kwargs):
     def get_tensor_input(size):
         return make_tensor(size, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    inputs = []
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3,)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1, True)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2, True)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, 1, True, True)))
-    inputs.append(SampleInput(get_tensor_input((S, M, S)), args=(3, -2, True, True)))
-
-    inputs.append(SampleInput(get_tensor_input(()), args=(1,)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0, True)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1, True)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, 0, True, True)))
-    inputs.append(SampleInput(get_tensor_input(()), args=(1, -1, True, True)))
-
-    return inputs
+    yield SampleInput(get_tensor_input((S, M, S)), 3)
+    yield SampleInput(get_tensor_input((S, M, S)), 3, 1)
+    yield SampleInput(get_tensor_input((S, M, S)), 3, -2)
+    yield SampleInput(get_tensor_input((S, M, S)), 3, 1, True)
+    yield SampleInput(get_tensor_input((S, M, S)), 3, -2, True)
+    yield SampleInput(get_tensor_input((S, M, S)), 3, 1, True, True)
+    yield SampleInput(get_tensor_input((S, M, S)), 3, -2, True, True)
+
+    yield SampleInput(get_tensor_input(()), 1)
+    yield SampleInput(get_tensor_input(()), 1, 0)
+    yield SampleInput(get_tensor_input(()), 1, -1)
+    yield SampleInput(get_tensor_input(()), 1, 0, True)
+    yield SampleInput(get_tensor_input(()), 1, -1, True)
+    yield SampleInput(get_tensor_input(()), 1, 0, True, True)
+    yield SampleInput(get_tensor_input(()), 1, -1, True, True)
 
 def sample_inputs_outer(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
-    arg_a = make_tensor((S,), dtype=dtype, device=device, requires_grad=requires_grad)
-    arg_b = make_tensor((M,), dtype=dtype, device=device, requires_grad=requires_grad)
-    inputs.append(SampleInput(arg_a, args=(arg_b,)))
-    return inputs
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield SampleInput(make_arg(S), make_arg(M))
 
 def sample_inputs_dist(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -4074,26 +4150,39 @@ def sample_inputs_dist(op_info, device, dtype, requires_grad, **kwargs):
 
 # Missing to test the nondeterminism of the operation
 # https://github.com/pytorch/pytorch/issues/53352
-def sample_inputs_index(op_info, device, dtype, requires_grad, **kwargs):
+def sample_inputs_index(op_info, device, dtype, requires_grad, reference=False, **kwargs):
     # target.index_select(dim, idx)
-    select = op_info.name == "index_select"
+    select = "index_select" in op_info.name
     # target.index_add(dim, idx, source, *, alpha=1)
-    add = op_info.name == "index_add"
+    add = "index_add" in op_info.name
     # target.index_copy(dim, idx, source)
-    copy = op_info.name == "index_copy"
+    copy = "index_copy" in op_info.name
     # target.index_fill(dim, idx, value)
-    fill = op_info.name == "index_fill"
-
-
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    make_permutation = partial(torch.randperm, device=device, dtype=torch.int64)
+    fill = "index_fill" in op_info.name
 
-    def make_idx(n):
-        return make_tensor((n,), device=device, dtype=torch.int64, low=0, high=n)
+    # Extended reference inputs. We generate that exercise atomic adds / writing
+    # several times to one location
+    if reference:
+        make_arg = partial(torch.ones, device=device, dtype=dtype, requires_grad=requires_grad)
+        make_idx = partial(torch.zeros, device=device, dtype=torch.int64)
+    else:
+        make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+        # idx They need to be different for copy and add to be deterministic
+        if copy or add:
+            make_idx = partial(torch.randperm, device=device, dtype=torch.int64)
+        else:
+            def make_idx(n):
+                return make_tensor((n,), device=device, dtype=torch.int64, low=0, high=n)
 
     shapes = [(), (1,), (S, S)]
     # extra parameter for add
-    alphas = (-1, 0, 2) if add else (None,)
+    if add:
+        if dtype == torch.bool:
+            alphas = (True, False)
+        else:
+            alphas = (-1, 0, 2)
+    else:
+        alphas = (None,)
 
     for shape, alpha in product(shapes, alphas):
         t = make_arg(shape)
@@ -4103,9 +4192,7 @@ def make_idx(n):
         dim = 1 if t.ndim == 2 else 0
         args.append(dim)
 
-        # idx They need to be different for copy and add to be deterministic
-        make_idx_fn = make_permutation if copy or add else make_idx
-        idx = make_idx_fn(t.shape[dim] if t.ndim != 0 else 1)
+        idx = make_idx(t.shape[dim] if t.ndim != 0 else 1)
         args.append(idx)
 
         # source
@@ -4159,7 +4246,6 @@ def make_idx(n, m):
                           kwargs={'include_self': True})
 
 def sample_inputs_mode(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
     args = (
         ((S, S, S), (),),
         ((S, S, S), (1, ),),
@@ -4170,12 +4256,10 @@ def sample_inputs_mode(op_info, device, dtype, requires_grad, **kwargs):
         # Non-fused mode kernel on CUDA
         ((3000,), ()),
     )
-    inputs = list((SampleInput(make_tensor(input_tensor, dtype=dtype, device=device,
-                                           low=None, high=None,
-                                           requires_grad=requires_grad),
-                               args=args,))
-                  for input_tensor, args in args)
-    return inputs
+    make_arg = partial(make_tensor, dtype=dtype, device=device,
+                       requires_grad=requires_grad, low=None, high=None)
+    return (SampleInput(make_arg(input_tensor), *args)
+            for input_tensor, args in args)
 
 # Missing to test the nondeterminism of the operation
 # https://github.com/pytorch/pytorch/issues/53352
@@ -4245,15 +4329,102 @@ def sample_inputs_take(op_info, device, dtype, requires_grad, **kwargs):
                           args=(idx.clone(),))
 
 def sample_movedim_moveaxis(op_info, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((4, 3, 2, 1), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=([0, 1, 2, 3], [3, 2, 1, 0])),
-        SampleInput(
-            make_tensor((4, 3, 2, 1), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-            args=([0, -1, -2, -3], [-3, -2, -1, -0]))
+    make_arg = partial(make_tensor, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+    yield SampleInput(make_arg((4, 3, 2, 1)), [0, 1, 2, 3], [3, 2, 1, 0])
+    yield SampleInput(make_arg((4, 3, 2, 1)), [0, -1, -2, -3], [-3, -2, -1, -0])
+
+def reference_movedim_moveaxis(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_movedim_moveaxis(op_info, device, dtype, requires_grad, **kwargs)
+
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    # shape, source, destination
+    args = (
+        # empty inputs
+        ((), (), ()),
+        # int inputs, negative
+        ((3, 5, 7, 2), -2, 1),
+        # swap bounds
+        ((3, 5, 7, 2), (-1, 0), (0, -1)),
+        # non-sequential, negative
+        ((2, 3, 4, 5, 6), (3, -3, 4), (1, 0, -1)),
+        # idempotence, negative
+        ((2, 3, 4, 5, 6), (-3, 4, 3, 1), (-3, 4, 3, 1)),
+        # reverse, sequential, positive
+        ((6, 2, 3, 5, 4), (4, 3, 2, 1, 0), (0, 1, 2, 3, 4)),
+        # reverse, non-sequential
+        ((6, 2, 3, 5, 4), (-3, -2, -4, -5, -1), (2, 1, 3, 4, 0)),
+        # reverse, sequential, negative
+        ((6, 2, 3, 5, 4), (4, -2, 2, -4, -5), (-5, 1, 2, -2, -1)),
+    )
+
+    for shape, source, destination in args:
+        yield SampleInput(make_arg(shape), args=(source, destination))
+
+def error_movedim_moveaxis(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+
+    # source length < destination length
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((3, -3), (1, 0, -1))),
+        error_regex=(r"movedim: Invalid source or destination dims: source "
+                     r"\(\[3, -3\] dims\) should contain the same number of "
+                     r"dims as destination \(\[1, 0, -1\] dims\)"),
+    )
+
+    # source length > destination length
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((3, -3, 4), (1, 0))),
+        error_regex=(r"movedim: Invalid source or destination dims: source "
+                     r"\(\[3, -3, 4\] dims\) should contain the same number of "
+                     r"dims as destination \(\[1, 0\] dims\)"),
+    )
+
+    # repeated source dim, with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((0, 4, -5), (1, 0, 2))),
+        error_regex=r"movedim: repeated dim in `source` \(\[0, 4, -5\]\)",
+    )
+
+    # repeated destination dim, with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((1, 0, 2), (0, 4, -5))),
+        error_regex=r"movedim: repeated dim in `destination` \(\[0, 4, -5\]\)",
+    )
+
+    # repeated dim (both), with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((1, 0, -4), (0, 4, -5))),
+        error_regex=r"movedim: repeated dim in `source` \(\[1, 0, -4\]\)",
+    )
+
+    # out of bounds source inputs, with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((0, 1, -6), (1, 4, 2))),
+        error_regex=r"Dimension out of range \(expected to be in range of \[-5, 4\], but got -6\)",
+        error_type=IndexError,
+    )
+
+    # out of bounds destination inputs, with negative indices
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=((1, 4, 2), (0, 1, -6))),
+        error_regex=r"Dimension out of range \(expected to be in range of \[-5, 4\], but got -6\)",
+        error_type=IndexError,
+    )
+
+    # out of bounds source input, int
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=(-6, 1)),
+        error_regex=r"Dimension out of range \(expected to be in range of \[-5, 4\], but got -6\)",
+        error_type=IndexError,
     )
 
+    # out of bounds destination input, int
+    yield ErrorInput(
+        SampleInput(make_arg(2, 3, 4, 5, 6), args=(3, -6)),
+        error_regex=r"Dimension out of range \(expected to be in range of \[-5, 4\], but got -6\)",
+        error_type=IndexError,
+    )
 
 def sample_repeat_tile(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -4267,34 +4438,141 @@ def sample_repeat_tile(op_info, device, dtype, requires_grad, **kwargs):
         rep_dims = ((), (0, ), (0, 2), (1, 1), (2, 3), (1, 3, 2), (3, 1, 1))  # type: ignore[assignment]
         shapes = ((), (0,), (2,), (3, 2))  # type: ignore[assignment]
 
-    samples = []
+    is_repeat_op = op_info.name in ['repeat', '_refs.repeat']
     for rep_dim, shape in product(rep_dims, shapes):
         # `torch.repeat` errors for `len(rep_dims) < t.dim()`,
         # so we filter such combinations.
-        if op_info.name == 'repeat' and len(rep_dim) < len(shape):
+        if is_repeat_op and len(rep_dim) < len(shape):
             continue
-        samples.append(SampleInput(make_arg(shape), args=(rep_dim,),))
+        yield SampleInput(make_arg(shape), rep_dim)
 
-    return samples
 
-
-def sample_inputs_narrow(op_info, device, dtype, requires_grad, **kwargs):
+def sample_inputs_narrow_narrow_copy(op_info, device, dtype, requires_grad, *, is_narrow, **kwargs):
     shapes_and_args = (
-        ((S, S, S), (1, 2, 2)),
-        ((S, S, S), (-1, 2, 2)),
-        ((S, S, S), (1, 0, 0)),
-        ((S, S, S), (-1, 0, 0)),
-        ((S, S, S), (2, 1, 2)),
+        ((S, S, S), 1, 2, 2),
+        ((S, S, S), -1, 2, 2),
+        ((S, S, S), 1, 0, 0),
+        ((S, S, S), -1, 0, 0),
+        ((S, S, S), 2, 1, 2),
     )
 
-    for shape, args in shapes_and_args:
+    for shape, dim, start, length in shapes_and_args:
         tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
                              requires_grad=requires_grad)
-        yield SampleInput(tensor, args=args)
+        yield SampleInput(tensor, dim, start, length)
+        # narrow also accepts the start argument being a Tensor
+        if is_narrow:
+            yield SampleInput(tensor, dim, torch.tensor(start), length)
 
-def sample_trapezoid(op_info, device, dtype, requires_grad, **kwargs):
-    y_shape_x_shape_and_kwargs = [
-        ((2, 3), (2, 3), {}),
+def reference_inputs_narrow_narrow_copy(op_info, device, dtype, requires_grad, *, is_narrow, **kwargs):
+    yield from sample_inputs_narrow_narrow_copy(op_info, device, dtype, requires_grad, is_narrow=is_narrow, **kwargs)
+
+    shapes_and_args = (
+        # 1-dim
+        ((M,), 0, 0, 0),    # 0 elems from the left
+        ((M,), -1, -1, 0),  # 0 elems from the right
+        ((M,), 0, 5, 3),    # 3 elems from the left
+        ((M,), 0, -5, 2),   # 2 elems from the right
+        ((M,), -1, 0, M),   # M elems from the left
+        ((M,), 0, -M, M),   # M elems from the right
+
+        # 2-dim
+        ((M, S), 1, 0, 0),    # dim 1, 0 elems from the left
+        ((S, M), -2, -1, 0),  # dim 0, 0 elems from the right
+        ((L, S), 1, 2, 3),    # dim 1, 3 elems from the left
+        ((L, S), -1, 3, 2),   # dim 1, 2 elems from the left
+        ((M, L), 0, 0, M),    # dim 0, M elems from the left
+        ((M, L), -1, -L, L),  # dim 1, L elems from the right
+
+        # 3-dim
+        ((L, M, S), 2, 0, 0),    # dim 2, 0 elems from the left
+        ((M, S, L), -1, -1, 0),  # dim 2, 0 elems from the right
+        ((S, L, M), 2, 0, M),    # dim 2, M elems from the left
+        ((L, S, M), -1, -M, M),  # dim 2, M elems from the right
+        ((S, L, M), 1, 0, 0),    # dim 1, 0 elems from the left
+        ((S, L, M), 0, 2, 1),    # dim 0, 1 elem from the left
+        ((M, S, M), -1, -5, 4),  # dim 2, 4 elems from the right
+    )
+
+    for shape, dim, start, length in shapes_and_args:
+        tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
+                             requires_grad=requires_grad)
+        yield SampleInput(tensor, dim, start, length)
+        # narrow also accepts the start argument being a Tensor
+        if is_narrow:
+            yield SampleInput(tensor, dim, torch.tensor(start), length)
+
+def error_inputs_narrow_narrow_copy(op_info, device, *, is_narrow, is_ref):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+
+    # 0-dim
+    yield ErrorInput(SampleInput(make_arg(()), 0, 0, 1),
+                     error_type=RuntimeError,
+                     error_regex=r"narrow\(\) cannot be applied to a 0-dim tensor\.")
+
+    # out of bounds dim
+    if not is_narrow and not is_ref and torch.device(device).type == 'cpu':
+        # narrow_copy_dense_cpu_out
+        yield ErrorInput(SampleInput(make_arg((M, S, L)), 3, 0, 0),
+                         error_type=RuntimeError,
+                         error_regex=r"Expected dim < static_cast<int64_t>\(self_sizes.size\(\)\) to be true, but got false\.")
+    else:
+        yield ErrorInput(SampleInput(make_arg((M, S, L)), 3, 0, 0),
+                         error_type=IndexError,
+                         error_regex=r"Dimension out of range \(expected to be in range of \[-3, 2\], but got 3\)")
+    # out of bounds dim (negative)
+    yield ErrorInput(SampleInput(make_arg((L, S, M)), -4, 0, 0),
+                     error_type=IndexError,
+                     error_regex=r"Dimension out of range \(expected to be in range of \[-3, 2\], but got -4\)")
+
+    # out of bounds start
+    if not is_narrow and not is_ref and torch.device(device).type == 'cpu':
+        # narrow_copy_dense_cpu_out
+        yield ErrorInput(SampleInput(make_arg((L, M, S)), 1, M + 1, 0),
+                         error_type=RuntimeError,
+                         error_regex=r"start \(11\) \+ length \(0\) exceeds dimension size \(10\)\.")
+    else:
+        yield ErrorInput(SampleInput(make_arg((L, M, S)), 1, M + 1, 0),
+                         error_type=IndexError,
+                         error_regex=r"Dimension out of range \(expected to be in range of \[-10, 9\], but got 11\)")
+    # out of bounds start (negative)
+    yield ErrorInput(SampleInput(make_arg((L, M, S)), 1, -M - 1, 0),
+                     error_type=IndexError,
+                     error_regex=r"Dimension out of range \(expected to be in range of \[-10, 9\], but got -11\)")
+
+    # out of bounds length
+    yield ErrorInput(SampleInput(make_arg((S, L, M)), 2, 0, M + 1),
+                     error_type=RuntimeError,
+                     error_regex=r"start \(0\) \+ length \(11\) exceeds dimension size \(10\)\.")
+    # out of bounds length (negative)
+    if not is_narrow and not is_ref and torch.device(device).type == 'cpu':
+        # narrow_copy_dense_cpu_out
+        yield ErrorInput(SampleInput(make_arg((M,)), 0, 0, -1),
+                         error_type=RuntimeError,
+                         error_regex=r"start \(0\) \+ length \(-1\) exceeds dimension size \(10\)\.")
+    else:
+        yield ErrorInput(SampleInput(make_arg((M,)), 0, 0, -1),
+                         error_type=RuntimeError,
+                         error_regex=r"narrow\(\): length must be non-negative\.")
+
+    # Test Tensor overload that was added for XLA. Start must be an 0-dim
+    # integral Tensor. narrow_copy doesn't have this overload.
+    # https://github.com/pytorch/pytorch/issues/31558
+    if is_narrow:
+        # *1-dim* integral Tensor
+        yield ErrorInput(SampleInput(make_arg((L, M, S)), 1, make_arg(S, dtype=torch.int), 2),
+                         error_type=RuntimeError,
+                         error_regex=r"start must be an 0-dim integral Tensor\.")
+
+        # 0-dim *bool* Tensor (bools are not allowed)
+        yield ErrorInput(SampleInput(make_arg((L, M, S)), -3, make_arg((), dtype=torch.bool), 3),
+                         error_type=RuntimeError,
+                         error_regex=r"start must be an 0-dim integral Tensor\.")
+
+
+def sample_trapezoid(op_info, device, dtype, requires_grad, **kwargs):
+    y_shape_x_shape_and_kwargs = [
+        ((2, 3), (2, 3), {}),
         ((2, 3), (2, 3), {'dim': 1}),
         ((6,), (6,), {}),
         ((6,), None, {}),
@@ -4307,17 +4585,15 @@ def sample_trapezoid(op_info, device, dtype, requires_grad, **kwargs):
         ((5,), None, {'dx': 2.0}),
         ((2, 2), None, {'dx': 3.0})
     ]
-    samples = []
+    make_arg = partial(make_tensor, dtype=dtype, device=device, low=None, high=None,
+                       requires_grad=requires_grad)
     for y_shape, x_shape, kwarg in y_shape_x_shape_and_kwargs:
-        y_tensor = make_tensor(y_shape, dtype=dtype, device=device, low=None, high=None,
-                               requires_grad=requires_grad)
+        y_tensor = make_arg(y_shape)
         if x_shape is not None:
-            x_tensor = make_tensor(x_shape, dtype=dtype, device=device, low=None, high=None,
-                                   requires_grad=requires_grad)
-            samples.append(SampleInput(y_tensor, args=(x_tensor,), kwargs=kwarg))
+            x_tensor = make_arg(x_shape)
+            yield SampleInput(y_tensor, x_tensor, **kwarg)
         else:
-            samples.append(SampleInput(y_tensor, kwargs=kwarg))
-    return samples
+            yield SampleInput(y_tensor, **kwarg)
 
 def sample_cumulative_trapezoid(op_info, device, dtype, requires_grad, **kwargs):
 
@@ -4335,17 +4611,15 @@ def sample_cumulative_trapezoid(op_info, device, dtype, requires_grad, **kwargs)
         ((5,), None, {'dx': 2.0}),
         ((2, 2), None, {'dx': 3.0})
     ]
-    samples = []
+    make_arg = partial(make_tensor, device=device, dtype=dtype,
+                       requires_grad=requires_grad, low=None, high=None)
     for y_shape, x_shape, kwarg in y_shape_x_shape_and_kwargs:
-        y_tensor = make_tensor(y_shape, dtype=dtype, device=device, low=None, high=None,
-                               requires_grad=requires_grad)
+        y_tensor = make_arg(y_shape)
         if x_shape is not None:
-            x_tensor = make_tensor(x_shape, dtype=dtype, device=device, low=None, high=None,
-                                   requires_grad=requires_grad)
-            samples.append(SampleInput(y_tensor, args=(x_tensor,), kwargs=kwarg))
+            x_tensor = make_arg(x_shape)
+            yield SampleInput(y_tensor, x_tensor, **kwarg)
         else:
-            samples.append(SampleInput(y_tensor, kwargs=kwarg))
-    return samples
+            yield SampleInput(y_tensor, **kwarg)
 
 def sample_unsqueeze(op_info, device, dtype, requires_grad, **kwargs):
     shapes_and_axes = [
@@ -4360,30 +4634,27 @@ def sample_unsqueeze(op_info, device, dtype, requires_grad, **kwargs):
         ((1,), -1),
     ]
 
-    samples = []
     for shape, axis in shapes_and_axes:
         tensor = make_tensor(shape, dtype=dtype, device=device, low=None, high=None,
                              requires_grad=requires_grad)
-        samples.append(SampleInput(tensor, args=(axis,),))
-
-    return samples
+        yield SampleInput(tensor, axis)
 
 
 def sample_inputs_nn_unfold(op_info, device, dtype, requires_grad, **kwargs):
     shapes = ((0, 1, 5, 5), (1, 1, 5, 5), (2, 3, 5, 5))
-    kernel_sizes = (2, (2, 2), (3, 3))
+    kernel_sizes = (2, (2, 2), (3, 3), (2, 3))
     dilations = (1, 2, (1, 2))
-    paddings = (0, 1, (1, 1))
+    paddings = (0, 1, (1, 1), (1, 2))
     strides = (1, 2, (1, 2))
 
     cases = product(shapes, kernel_sizes, dilations, paddings, strides)
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
     for shape, kernel_size, dilation, padding, stride in cases:
-        tensor = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        yield SampleInput(tensor, args=(kernel_size, dilation, padding, stride))
+        tensor = make_arg(shape)
+        yield SampleInput(tensor, kernel_size, dilation, padding, stride)
 
     # With default args
-    yield SampleInput(make_tensor((1, 1, 5, 5), dtype=dtype, device=device, requires_grad=requires_grad),
-                      args=((3, 3),))
+    yield SampleInput(make_arg((1, 1, 5, 5)), (3, 3))
 
 
 def sample_inputs_squeeze(op_info, device, dtype, requires_grad, **kwargs):
@@ -4505,70 +4776,47 @@ def drop_mode_argument(input, pad, mode=None, value=None):
     for sample in nn_samples:
         yield drop_mode_argument(sample.input, *sample.args, **sample.kwargs)
 
-
-def np_unary_ufunc_integer_promotion_wrapper(fn):
-    # Wrapper that passes PyTorch's default scalar
-    #   type as an argument to the wrapped NumPy
-    #   unary ufunc when given an integer input.
-    #   This mimicks PyTorch's integer->floating point
-    #   type promotion.
-    #
-    # This is necessary when NumPy promotes
-    #   integer types to double, since PyTorch promotes
-    #   integer types to the default scalar type.
-
-    # Helper to determine if promotion is needed
-    def is_integral(dtype):
-        return dtype in [np.bool_, bool, np.uint8, np.int8, np.int16, np.int32, np.int64]
-
-    @wraps(fn)
-    def wrapped_fn(x):
-        # As the default dtype can change, acquire it when function is called.
-        # NOTE: Promotion in PyTorch is from integer types to the default dtype
-        np_dtype = torch_to_numpy_dtype_dict[torch.get_default_dtype()]
-
-        if is_integral(x.dtype):
-            return fn(x.astype(np_dtype))
-        return fn(x)
-
-    return wrapped_fn
-
 def sample_inputs_repeat_interleave(op_info, device, dtype, requires_grad, **kwargs):
     make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    return [
-        SampleInput(make_input(()), kwargs=dict(repeats=2)),
-        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=2)),
-        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=2, dim=1)),
-        SampleInput(make_input((2, 3, 4)), kwargs=dict(repeats=torch.arange(3, device=device), dim=1))
-    ]
+    yield SampleInput(make_input(()), repeats=2)
+    yield SampleInput(make_input((2, 3, 4)), repeats=2)
+    yield SampleInput(make_input((2, 3, 4)), repeats=2, dim=1)
+    yield SampleInput(make_input((2, 3, 4)), repeats=torch.arange(3, device=device), dim=1)
 
 
 def sample_inputs_stft(op_info, device, dtype, requires_grad, **kwargs):
     def mt(shape, **kwargs):
         return make_tensor(shape, device=device, dtype=dtype,
                            requires_grad=requires_grad, **kwargs)
-    yield SampleInput(mt(100), kwargs=dict(n_fft=10))
+
+    yield SampleInput(mt(100), n_fft=10, return_complex=True)
+    yield SampleInput(mt(100), n_fft=10, return_complex=False)
+    if dtype.is_complex:
+        yield SampleInput(mt(100), n_fft=10)
 
     for center in [False, True]:
-        yield SampleInput(mt(10), kwargs=dict(n_fft=7, center=center))
-        yield SampleInput(mt((10, 100)), kwargs=dict(n_fft=16, hop_length=4, center=center))
+        yield SampleInput(mt(10), n_fft=7, center=center, return_complex=True)
+        yield SampleInput(mt((10, 100)), n_fft=16, hop_length=4,
+                          center=center, return_complex=True)
 
-    window = make_tensor(16, low=.5, high=2.0, dtype=dtype, device=device, requires_grad=requires_grad)
+    window = mt(16, low=.5, high=2.0)
     yield SampleInput(
         mt((2, 100)), kwargs=dict(n_fft=16, window=window, return_complex=True, center=center))
     yield SampleInput(
         mt((3, 100)), kwargs=dict(n_fft=16, window=window, return_complex=True, center=center))
     if not dtype.is_complex:
         yield SampleInput(
-            mt((10, 100)), kwargs=dict(n_fft=16, window=window, onesided=False))
+            mt((10, 100)), n_fft=16, window=window, onesided=False,
+            return_complex=True)
 
 
 def sample_inputs_istft(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
     def mt(shape, **kwargs):
         real_shape = shape if dtype.is_complex else shape + (2,)
-        return make_tensor(real_shape, device=device, dtype=dtype,
-                           requires_grad=requires_grad, **kwargs)
+        return make_arg(real_shape, **kwargs)
 
     yield SampleInput(mt((10, 2)), kwargs=dict(n_fft=10))
     yield SampleInput(mt((6, 3)), kwargs=dict(n_fft=6, onesided=False))
@@ -4578,7 +4826,7 @@ def mt(shape, **kwargs):
         yield SampleInput(mt((10, 10, 6)), kwargs=dict(n_fft=10, center=center))
         yield SampleInput(mt((1, 9, 10)), kwargs=dict(n_fft=16, hop_length=4, center=center))
 
-    window = make_tensor(10, low=.5, high=2.0, dtype=dtype, device=device, requires_grad=requires_grad)
+    window = make_arg(10, low=.5, high=2.0)
     yield SampleInput(mt((10, 10, 6)), kwargs=dict(
         n_fft=10, window=window, center=center, return_complex=dtype.is_complex))
     yield SampleInput(mt((10, 10, 10)), kwargs=dict(
@@ -4587,229 +4835,21 @@ def mt(shape, **kwargs):
     real_window = window if not dtype.is_complex else window.real
     yield SampleInput(mt((10, 5, 6)), kwargs=dict(n_fft=8, window=real_window[:8], center=center))
 
-
-def sample_inputs_fftshift(op_info, device, dtype, requires_grad, **kwargs):
-    def mt(shape, **kwargs):
-        return make_tensor(shape, device=device, dtype=dtype,
-                           requires_grad=requires_grad, **kwargs)
-
-    yield SampleInput(mt((9, 10)))
-    yield SampleInput(mt((50,)), kwargs=dict(dim=0))
-    yield SampleInput(mt((5, 11)), kwargs=dict(dim=(1,)))
-    yield SampleInput(mt((5, 6)), kwargs=dict(dim=(0, 1)))
-    yield SampleInput(mt((5, 6, 2)), kwargs=dict(dim=(0, 2)))
-
-
-def sample_inputs_linalg_cholesky_inverse(op_info, device, dtype, requires_grad=False, **kwargs):
-    from torch.testing._internal.common_utils import random_well_conditioned_matrix
-
-    # Cholesky factorization is for positive-definite matrices
-    single_well_conditioned_matrix = random_well_conditioned_matrix(S, S, dtype=dtype, device=device)
-    batch_well_conditioned_matrices = random_well_conditioned_matrix(2, S, S, dtype=dtype, device=device)
-    single_pd = single_well_conditioned_matrix @ single_well_conditioned_matrix.mH
-    batch_pd = batch_well_conditioned_matrices @ batch_well_conditioned_matrices.mH
-
-    inputs = (
-        torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
-        torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
-        single_pd,
-        batch_pd
-    )
-    test_cases = (torch.linalg.cholesky(a, upper=False) for a in inputs)
-    for l in test_cases:
-        # generated lower-triangular samples
-        l.requires_grad = requires_grad
-        yield SampleInput(l)  # upper=False by default
-        yield SampleInput(l.detach().clone().requires_grad_(requires_grad), kwargs=dict(upper=False))
-
-        # generate upper-triangular inputs
-        u = l.detach().clone().mT.contiguous().requires_grad_(requires_grad)
-        yield SampleInput(u, kwargs=dict(upper=True))
-
-def sample_inputs_linalg_ldl_factor(op_info, device, dtype, requires_grad=False, **kwargs):
-    from torch.testing._internal.common_utils import (
-        random_hermitian_pd_matrix,
-        random_symmetric_pd_matrix,
-    )
-
-    device = torch.device(device)
-
-    # Symmetric inputs
-    yield SampleInput(
-        random_symmetric_pd_matrix(S, dtype=dtype, device=device),
-        kwargs=dict(hermitian=False),
-    )  # single matrix
-    yield SampleInput(
-        random_symmetric_pd_matrix(S, 2, dtype=dtype, device=device),
-        kwargs=dict(hermitian=False),
-    )  # batch of matrices
-    yield SampleInput(
-        torch.zeros(0, 0, dtype=dtype, device=device), kwargs=dict(hermitian=False)
-    )  # 0x0 matrix
-    yield SampleInput(
-        torch.zeros(0, 2, 2, dtype=dtype, device=device), kwargs=dict(hermitian=False)
-    )  # zero batch of matrices
-
-    # Hermitian inputs
-    # hermitian=True for complex inputs on CUDA is supported only with MAGMA 2.5.4+
-    magma_254_available = device.type == 'cuda' and _get_magma_version() >= (2, 5, 4)
-    if dtype.is_complex and (device.type == 'cpu' or magma_254_available):
-        yield SampleInput(
-            random_hermitian_pd_matrix(S, dtype=dtype, device=device),
-            kwargs=dict(hermitian=True),
-        )  # single matrix
-        yield SampleInput(
-            random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),
-            kwargs=dict(hermitian=True),
-        )  # batch of matrices
-
-def sample_inputs_linalg_ldl_solve(op_info, device, dtype, requires_grad=False, **kwargs):
-    # Generate LDL factors of symmetric (and Hermitian on CPU) matrices
-    from torch.testing._internal.common_utils import random_hermitian_pd_matrix, random_symmetric_pd_matrix
-    device = torch.device(device)
-    symmetric_inputs = (
-        random_symmetric_pd_matrix(S, dtype=dtype, device=device),  # single matrix
-        random_symmetric_pd_matrix(S, 2, dtype=dtype, device=device),  # batch of matrices
-        torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
-        torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
-    )
-    hermitian_inputs = (
-        random_hermitian_pd_matrix(S, dtype=dtype, device=device),
-        random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),
-    ) if device.type == 'cpu' and dtype.is_complex else ()
-    test_cases1 = (torch.linalg.ldl_factor_ex(a, hermitian=False) for a in symmetric_inputs)
-    test_cases2 = (torch.linalg.ldl_factor_ex(a, hermitian=True) for a in hermitian_inputs)
-
-    # Symmetric case
-    for test_case in test_cases1:
-        factors, pivots, _ = test_case
-        factors.requires_grad = requires_grad
-        for B_batch_shape in ((), factors.shape[:-2]):
-            B = make_tensor((*B_batch_shape, factors.shape[-1], S), device=device, dtype=dtype, requires_grad=requires_grad)
-            yield SampleInput(factors, args=(pivots, B), kwargs=dict(hermitian=False))
-            clone_factors = factors.detach().clone().requires_grad_(requires_grad)
-            yield SampleInput(clone_factors, args=(pivots, B), kwargs=dict(hermitian=False))
-
-    # Hermitian case
-    for test_case in test_cases2:
-        factors, pivots, _ = test_case
-        factors.requires_grad = requires_grad
-        for B_batch_shape in ((), factors.shape[:-2]):
-            B = make_tensor((*B_batch_shape, factors.shape[-1], S), device=device, dtype=dtype, requires_grad=requires_grad)
-            yield SampleInput(factors, args=(pivots, B), kwargs=dict(hermitian=True))
-            clone_factors = factors.detach().clone().requires_grad_(requires_grad)
-            yield SampleInput(clone_factors, args=(pivots, B), kwargs=dict(hermitian=True))
-
-def sample_inputs_linalg_lstsq(op_info, device, dtype, requires_grad=False, **kwargs):
-    from torch.testing._internal.common_utils import random_well_conditioned_matrix
-
-    device = torch.device(device)
-
-    drivers: Tuple[str, ...]
-    if device.type == 'cuda':
-        drivers = ('gels',)
-    else:
-        drivers = ('gels', 'gelsy', 'gelss', 'gelsd')
-
-    # we generate matrices of shape (..., n + delta, n)
-    deltas: Tuple[int, ...]
-    if device.type == 'cpu' or has_cusolver():
-        deltas = (-1, 0, +1)
-    # only square systems if Cusolver is not available
-    # becase we solve a lstsq problem with a transposed matrix in the backward
-    else:
-        deltas = (0,)
-
-    out = []
-    for batch, driver, delta in product(((), (3,), (3, 3)), drivers, deltas):
-        shape = batch + (3 + delta, 3)
-        a = random_well_conditioned_matrix(*shape, dtype=dtype, device=device)
-        a.requires_grad_(requires_grad)
-        b = make_tensor(shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        out.append(SampleInput(a, args=(b,), kwargs=dict(driver=driver)))
-    return out
-
-def sample_inputs_householder_product(op_info, device, dtype, requires_grad, **kwargs):
-    """
-    This function generates input for torch.linalg.householder_product (torch.orgqr).
-    The first argument should be a square matrix or batch of square matrices, the second argument is a vector or batch of vectors.
-    Empty, square, rectangular, batched square and batched rectangular input is generated.
-    """
-    # Each column of the matrix is getting multiplied many times leading to very large values for
-    # the Jacobian matrix entries and making the finite-difference result of grad check less accurate.
-    # That's why gradcheck with the default range [-9, 9] fails and [-2, 2] is used here.
-    samples = (
-        SampleInput(make_tensor((S, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((S,), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),)),
-
-        SampleInput(make_tensor((S + 1, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((S,), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),)),
-
-        SampleInput(make_tensor((2, 1, S, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((2, 1, S,), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),)),
-
-        SampleInput(make_tensor((2, 1, S + 1, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((2, 1, S,), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),)),
-
-        SampleInput(make_tensor((0, 0), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(make_tensor((0,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),)),
-
-        SampleInput(make_tensor((S, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((0,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),)),
-
-        # m = n = S, k = S - 2
-        SampleInput(make_tensor((S, S), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((S - 2,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),)),
-
-        # m = S, n = S -1, k = S - 2
-        SampleInput(make_tensor((S, S - 1), dtype=dtype, device=device, low=-2, high=2, requires_grad=requires_grad),
-                    args=(make_tensor((S - 2,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),)),
-    )
-
-    return samples
-
 def sample_inputs_ormqr(op_info, device, dtype, requires_grad, **kwargs):
     # create a helper function wrapping `make_tensor`
-    make_input = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    def gen_inputs():
-        batches = [(), (0, ), (2, ), (2, 1)]
-        ns = [5, 2, 0]
-        tf = [True, False]
-        for batch, (m, n), left, transpose in product(batches, product(ns, ns), tf, tf):
-            reflectors = make_input((*batch, m, n))
-            tau = make_input((*batch, min(m, n)))
-            other_matrix_shape = (m, n) if left else (n, m)
-            other = make_input((*batch, *other_matrix_shape))
-            kwargs = {"left": left, "transpose": transpose}
-            yield SampleInput(reflectors, args=(tau, other,), kwargs=kwargs)
-
-    return tuple(gen_inputs())
-
-def sample_inputs_linalg_cholesky(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates always positive-definite input for torch.linalg.cholesky using
-    random_hermitian_pd_matrix.
-    The input is generated as the itertools.product of 'batches' and 'ns'.
-    In total this function generates 8 SampleInputs
-    'batches' cases include:
-        () - single input,
-        (0,) - zero batched dimension,
-        (2,) - batch of two matrices,
-        (1, 1) - 1x1 batch of matrices
-    'ns' gives 0x0 and 5x5 matrices.
-    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
-    """
-    from torch.testing._internal.common_utils import random_hermitian_pd_matrix
+    make_input = partial(make_tensor, dtype=dtype, device=device, low=-1, high=1)
 
-    batches = [(), (0, ), (2, ), (1, 1)]
-    ns = [5, 0]
-    out = []
-    for batch, n, upper in product(batches, ns, [True, False]):
-        a = random_hermitian_pd_matrix(n, *batch, dtype=dtype, device=device)
-        a.requires_grad = requires_grad
-        out.append(SampleInput(a, kwargs={"upper": upper}))
-    return out
+    batches = [(), (0, ), (2, ), (2, 1)]
+    ns = [5, 2, 0]
+    tf = [True, False]
+    for batch, (m, n), left, transpose in product(batches, product(ns, ns), tf, tf):
+        input = make_input((*batch, m, n))
+        reflectors, tau = torch.geqrf(input)
+        reflectors.requires_grad_(requires_grad)
+        tau.requires_grad_(requires_grad)
+        other_matrix_shape = (m, n) if left else (n, m)
+        other = make_input((*batch, *other_matrix_shape), requires_grad=requires_grad)
+        yield SampleInput(reflectors, tau, other, left=left, transpose=transpose)
 
 def sample_inputs_symeig(op_info, device, dtype, requires_grad=False, **kwargs):
     out = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
@@ -4821,155 +4861,6 @@ def sample_inputs_symeig(op_info, device, dtype, requires_grad=False, **kwargs):
         o.output_process_fn_grad = lambda output: (output[0], abs(output[1]))
         yield o
 
-def sample_inputs_linalg_eig(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates input for torch.linalg.eig
-    """
-    def out_fn(output):
-        return output[0], abs(output[1])
-
-    samples = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
-    for sample in samples:
-        sample.output_process_fn_grad = out_fn
-        yield sample
-
-def sample_inputs_linalg_eigh(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates input for torch.linalg.eigh/eigvalsh with UPLO="U" or "L" keyword argument.
-    """
-    def out_fn(output):
-        if isinstance(output, tuple):
-            # eigh function
-            return output[0], abs(output[1])
-        else:
-            # eigvalsh function
-            return output
-
-    # Samples do not need to be Hermitian, as we're using gradcheck_wrapper_hermitian_input
-    samples = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
-    for sample in samples:
-        sample.kwargs = {"UPLO": np.random.choice(["L", "U"])}
-        sample.output_process_fn_grad = out_fn
-        yield sample
-
-
-def sample_inputs_linalg_pinv(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates input for torch.linalg.pinv with hermitian=False keyword argument.
-    """
-    for o in sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad, **kwargs):
-        real_dtype = o.input.real.dtype if dtype.is_complex else dtype
-        # requires_grad path for rtol tensor is not implemented
-        for rtol in (None, 1.0, torch.tensor(1.0, dtype=real_dtype, device=device)):
-            o = clone_sample(o)
-            o.kwargs = {"rtol": rtol}
-            yield o
-
-
-def sample_inputs_linalg_pinv_hermitian(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates input for torch.linalg.pinv with hermitian=True keyword argument.
-    """
-    for o in sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad, **kwargs):
-        o.kwargs = {"hermitian": True}
-        yield o
-
-def sample_inputs_linalg_solve(op_info, device, dtype, requires_grad=False, vector_rhs_allowed=True, **kwargs):
-    """
-    This function generates always solvable input for torch.linalg.solve
-    We sample a fullrank square matrix (i.e. invertible) A
-    The first input to torch.linalg.solve is generated as the itertools.product of 'batches' and 'ns'.
-    The second input is generated as the product of 'batches', 'ns' and 'nrhs'.
-    In total this function generates 18 SampleInputs
-    'batches' cases include:
-        () - single input,
-        (0,) - zero batched dimension,
-        (2,) - batch of two matrices.
-    'ns' gives 0x0 and 5x5 matrices.
-    and 'nrhs' controls the number of vectors to solve for:
-        () - using 1 as the number of vectors implicitly
-        (1,) - same as () but explicit
-        (3,) - solve for 3 vectors.
-    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
-    'vector_rhs_allowed' controls whether to include nrhs = () to the list of SampleInputs.
-    torch.solve / triangular_solve / cholesky_solve (opposed to torch.linalg.solve) do not allow
-    1D tensors (vectors) as the right-hand-side.
-    Once torch.solve / triangular_solve / cholesky_solve and its testing are removed,
-    'vector_rhs_allowed' may be removed here as well.
-    """
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_a = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
-    make_b = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    batches = [(), (0, ), (2, )]
-    ns = [5, 0]
-    if vector_rhs_allowed:
-        nrhs = [(), (1,), (3,)]
-    else:
-        nrhs = [(1,), (3,)]
-
-    for n, batch, rhs in product(ns, batches, nrhs):
-        yield SampleInput(make_a(*batch, n, n), args=(make_b((batch + (n,) + rhs)),))
-
-
-def sample_inputs_linalg_solve_triangular(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device)
-    bs = (1, 2, 0)
-    ns = (3, 0)
-    ks = (1, 3, 0)
-
-    for b, n, k, (left, upper, uni) in product(bs, ns, ks, product((True, False), repeat=3)):
-        if b == 1:
-            A = make_arg((n, n)) if left else make_arg((k, k))
-            B = make_arg((n, k))
-        else:
-            A = make_arg((b, n, n)) if left else make_arg((b, k, k))
-            B = make_arg((b, n, k))
-        if uni:
-            # Not really necessary, but writing it for consistency
-            A.diagonal(0, -2, -1).fill_(1.)
-        else:
-            d = A.diagonal(0, -2, -1)
-            d[d.abs() < 1e-6] = 1.
-        if upper:
-            A.triu_()
-        else:
-            A.tril_()
-        kwargs = {"upper": upper, "left": left, "unitriangular": uni}
-        if requires_grad:
-            for grad_A, grad_B in product((True, False), repeat=2):
-                # Either A or B needs to have a gradient
-                if not grad_A and not grad_B:
-                    continue
-                yield SampleInput(
-                    A.clone().requires_grad_(grad_A),
-                    args=(B.clone().requires_grad_(grad_B),),
-                    kwargs=kwargs)
-        else:
-            yield SampleInput(A, args=(B,), kwargs=kwargs)
-
-def sample_inputs_legacy_solve(op_info, device, dtype, requires_grad=False, **kwargs):
-    """
-    This function generates always solvable input for legacy solve functions
-    (the ones that are not in torch.linalg module).
-    The difference from sample_inputs_linalg_solve is that here the right-hand-side of A x = b equation
-    should have b.ndim >= 2, vectors are not allowed.
-    Also the arguments order is swapped.
-    """
-    out = sample_inputs_linalg_solve(
-        op_info, device, dtype, requires_grad=requires_grad, vector_rhs_allowed=False
-    )
-
-    def out_fn(output):
-        return output[0]
-
-    # Reverses tensor order
-    for sample in out:
-        sample.input, sample.args = sample.args[0], (sample.input,)
-        if op_info.name == "solve":
-            sample.output_process_fn_grad = out_fn
-        yield sample
-
 
 def sample_inputs_cholesky_solve(op_info, device, dtype, requires_grad=False, **kwargs):
     cholesky_inverse_samples = sample_inputs_linalg_cholesky_inverse(
@@ -4994,64 +4885,6 @@ def sample_inputs_lu(op_info, device, dtype, requires_grad=False, **kwargs):
         input = make_arg(*shape)
         yield SampleInput(input, args=(True, get_infos))
 
-def sample_inputs_linalg_lu(op_info, device, dtype, requires_grad=False, **kwargs):
-    full_rank = (op_info.name == "linalg.lu_factor")
-    make_fn = make_tensor if not full_rank else make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fn, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    def out_fn(output):
-        if op_info.name == "linalg.lu":
-            return output[1], output[2]
-        else:
-            return output
-
-    batch_shapes = ((), (3,), (3, 3))
-    # pivot=False only supported in CUDA
-    pivots = (True, False) if torch.device(device).type == "cuda" else (True,)
-    deltas = (-2, -1, 0, +1, +2)
-    for batch_shape, pivot, delta in product(batch_shapes, pivots, deltas):
-        shape = batch_shape + (S + delta, S)
-        # Insanely annoying that make_fullrank_blablabla accepts a *shape and not a tuple!
-        A = make_arg(shape) if not full_rank else make_arg(*shape)
-        yield SampleInput(A, kwargs={"pivot": pivot}, output_process_fn_grad=out_fn)
-
-def sample_inputs_lu_solve(op_info, device, dtype, requires_grad=False, **kwargs):
-    """ Samples the inputs for both linalg.lu_solve and lu_solve """
-    make_fn = make_fullrank_matrices_with_distinct_singular_values
-    make_a = partial(make_fn, dtype=dtype, device=device)
-    make_b = partial(make_tensor, dtype=dtype, device=device)
-
-    def clone(X, requires_grad):
-        Y = X.clone()
-        Y.requires_grad_(requires_grad)
-        return Y
-
-    is_linalg_lu_solve = (op_info.name == "linalg.lu_solve")
-
-    batches = ((), (0, ), (2, ))
-    ns = (3, 1, 0)
-    nrhs = (4, 1, 0)
-
-    for n, batch, rhs in product(ns, batches, nrhs):
-        A = make_a(*(batch + (n, n)))
-        LU, pivots = torch.linalg.lu_factor(A)
-
-        B = make_b(batch + (n, rhs))
-
-        grads = (False,) if not requires_grad else (True, False)
-        # we try all possible combinations of requires_grad for each input
-        for LU_grad, B_grad in product(grads, grads):
-            # when requires_grad == True, at least one input has to have requires_grad enabled
-            if requires_grad and not LU_grad and not B_grad:
-                continue
-
-            if is_linalg_lu_solve:
-                for adjoint, left in product((True, False), repeat=2):
-                    yield SampleInput(clone(LU, LU_grad),
-                                      args=(pivots, clone(B if left else B.mT, B_grad)),
-                                      kwargs=dict(adjoint=adjoint, left=left))
-            else:
-                yield SampleInput(clone(B, B_grad), args=(clone(LU, LU_grad), pivots))
 
 def sample_inputs_lu_unpack(op_info, device, dtype, requires_grad=False, **kwargs):
     def out_fn(output):
@@ -5060,7 +4893,7 @@ def out_fn(output):
     for lu_sample in sample_inputs_linalg_lu(op_info, device, dtype, requires_grad, **kwargs):
         lu_data, pivots = torch.linalg.lu_factor(lu_sample.input)
         lu_data.requires_grad_(requires_grad)
-        yield SampleInput(lu_data, args=(pivots,), output_process_fn_grad=out_fn)
+        yield SampleInput(lu_data, pivots).with_metadata(output_process_fn_grad=out_fn)
 
 
 def sample_inputs_roll(op_info, device, dtype, requires_grad=False, **kwargs):
@@ -5074,22 +4907,17 @@ def sample_inputs_roll(op_info, device, dtype, requires_grad=False, **kwargs):
 
 
 def error_inputs_roll(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
     err_msg1 = "`shifts` required"
-    s1 = SampleInput(
-        make_tensor((S,), dtype=torch.float32, device=device), args=(tuple(),)
-    )
+    s1 = SampleInput(make_arg((S,)), ())
     yield ErrorInput(s1, error_regex=err_msg1)
 
     err_msg2 = ("shifts and dimensions must align")
-    s2 = SampleInput(
-        make_tensor((S, S), dtype=torch.float32, device=device), args=((2, 1), 0)
-    )
+    s2 = SampleInput(make_arg((S, S)), (2, 1), 0)
     yield ErrorInput(s2, error_regex=err_msg2)
 
     err_msg3 = ("out of range")
-    s3 = SampleInput(
-        make_tensor((S, ), dtype=torch.float32, device=device), args=(0, 2)
-    )
+    s3 = SampleInput(make_arg((S, )), 0, 2)
     yield ErrorInput(s3, error_regex=err_msg3, error_type=IndexError)
 
 def sample_inputs_rot90(op_info, device, dtype, requires_grad=False, **kwargs):
@@ -5103,22 +4931,17 @@ def sample_inputs_rot90(op_info, device, dtype, requires_grad=False, **kwargs):
 
 
 def error_inputs_rot90(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
     err_msg1 = "expected total rotation dims"
-    s1 = SampleInput(
-        make_tensor((S, S), dtype=torch.float32, device=device), kwargs={"dims": (0,)}
-    )
+    s1 = SampleInput(make_arg((S, S)), dims=(0,))
     yield ErrorInput(s1, error_regex=err_msg1)
 
     err_msg2 = "expected total dims >= 2"
-    s2 = SampleInput(
-        make_tensor((S,), dtype=torch.float32, device=device),
-    )
+    s2 = SampleInput(make_arg((S,)))
     yield ErrorInput(s2, error_regex=err_msg2)
 
     err_msg3 = "expected rotation dims to be different"
-    s3 = SampleInput(
-        make_tensor((S, S), dtype=torch.float32, device=device), kwargs={"dims": (1, 1)}
-    )
+    s3 = SampleInput(make_arg((S, S)), dims=(1, 1))
     yield ErrorInput(s3, error_regex=err_msg3)
 
 
@@ -5128,16 +4951,19 @@ def sample_inputs_std_var(op_info, device, dtype, requires_grad, **kwargs):
     tensor_1d = partial(make_tensor, (S,), device=device, dtype=dtype,
                         requires_grad=requires_grad)
 
-    return [
-        SampleInput(tensor_nd()),
-        SampleInput(tensor_nd(), kwargs=dict(dim=1)),
-        SampleInput(tensor_nd(), kwargs=dict(dim=1, unbiased=True, keepdim=True)),
-        SampleInput(tensor_1d(), kwargs=dict(dim=0, unbiased=True, keepdim=True)),
-        SampleInput(tensor_1d(), kwargs=dict(dim=0, unbiased=False, keepdim=False)),
+    yield SampleInput(tensor_nd())
+    yield SampleInput(tensor_nd(), dim=1)
+    yield SampleInput(tensor_nd(), dim=1, unbiased=True, keepdim=True)
+    yield SampleInput(tensor_1d(), dim=0, unbiased=True, keepdim=True)
+    yield SampleInput(tensor_1d(), dim=0, unbiased=False, keepdim=False)
 
-        SampleInput(tensor_nd(), kwargs=dict(dim=(1,), correction=S // 2)),
-        SampleInput(tensor_nd(), kwargs=dict(dim=None, correction=0, keepdim=True)),
-    ]
+    yield SampleInput(tensor_nd(), dim=(1,), correction=S // 2)
+    yield SampleInput(tensor_nd(), dim=None, correction=0, keepdim=True)
+    yield SampleInput(tensor_nd(), dim=None, correction=None)
+
+    # Test var_mean(Tensor self, bool unbiased=True) -> (Tensor, Tensor)
+    yield SampleInput(tensor_nd(), True)
+    yield SampleInput(tensor_nd(), False)
 
 
 def _generate_correlation_inputs(device, dtype, requires_grad, **kwargs):
@@ -5147,95 +4973,49 @@ def _generate_correlation_inputs(device, dtype, requires_grad, **kwargs):
 
 
 def sample_inputs_corrcoef(op_info, device, dtype, requires_grad, **kwargs):
-    return [SampleInput(t) for t in _generate_correlation_inputs(device, dtype, requires_grad)]
+    return (SampleInput(t) for t in _generate_correlation_inputs(device, dtype, requires_grad))
 
 
 def sample_inputs_cov(op_info, device, dtype, requires_grad, **kwargs):
-    inputs = []
     for t in _generate_correlation_inputs(device, dtype, requires_grad):
-        inputs.append(SampleInput(t))
+        yield SampleInput(t)
         num_observations = t.numel() if t.ndimension() < 2 else t.size(1)
         fweights = make_tensor((num_observations,), dtype=torch.int, device=device, low=1, high=10)
         aweights = make_tensor((num_observations,), dtype=torch.float, device=device, low=0, high=1, requires_grad=requires_grad)
         for correction, fw, aw in product(range(num_observations), [None, fweights], [None, aweights]):
-            inputs.append(SampleInput(t.clone().requires_grad_(requires_grad),
-                                      kwargs={'correction': correction, 'fweights': fw, 'aweights': aw}))
-    return inputs
+            yield SampleInput(t.clone().requires_grad_(requires_grad),
+                              correction=correction, fweights=fw, aweights=aw)
 
 
 def error_inputs_cov(op_info, device, **kwargs):
     a = torch.rand(S, device=device)
-    error_inputs = []
-    error_inputs.append(ErrorInput(
+    yield ErrorInput(
         SampleInput(torch.rand(S, S, S, device=device)),
-        error_regex="expected input to have two or fewer dimensions"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'fweights': torch.rand(S, S, device=device)}),
-        error_regex="expected fweights to have one or fewer dimensions"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'aweights': torch.rand(S, S, device=device)}),
-        error_regex="expected aweights to have one or fewer dimensions"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'fweights': torch.rand(S, device=device)}),
-        error_regex="expected fweights to have integral dtype"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'aweights': torch.tensor([1, 1], device=device)}),
-        error_regex="expected aweights to have floating point dtype"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'fweights': torch.tensor([1], device=device)}),
-        error_regex="expected fweights to have the same numel"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'aweights': torch.rand(1, device=device)}),
-        error_regex="expected aweights to have the same numel"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'fweights': torch.tensor([-1, -2, -3, -4 , -5], device=device)}),
-        error_regex="fweights cannot be negative"))
-    error_inputs.append(ErrorInput(
-        SampleInput(a, kwargs={'aweights': torch.tensor([-1., -2., -3., -4., -5.], device=device)}),
-        error_regex="aweights cannot be negative"))
-    return error_inputs
-
-
-def sample_inputs_svd(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    is_linalg_svd = ("linalg.svd" in op_info.name)
-    batches = [(), (0, ), (3, )]
-    ns = [0, 3, 5]
-
-    def uniformize(usv):
-        S = usv[1]
-        k = S.shape[-1]
-        U = usv[0][..., :k]
-        Vh = usv[2] if is_linalg_svd else usv[2].mH
-        Vh = Vh[..., :k, :]
-        return U, S, Vh
-
-    def fn_U(usv):
-        U, _, _ = uniformize(usv)
-        return U.abs()
-
-
-    def fn_S(usv):
-        return uniformize(usv)[1]
-
-    def fn_Vh(usv):
-        # We also return S to test
-        _, S, Vh = uniformize(usv)
-        return S, Vh.abs()
-
-    def fn_UVh(usv):
-        U, S, Vh = uniformize(usv)
-        return U @ Vh, S
-
-    fns = (fn_U, fn_S, fn_Vh, fn_UVh)
-
-    fullmat = 'full_matrices' if is_linalg_svd else 'some'
-
-    for batch, n, k, fullmat_val, fn in product(batches, ns, ns, (True, False), fns):
-        shape = batch + (n, k)
-        yield SampleInput(make_arg(*shape), kwargs={fullmat: fullmat_val}, output_process_fn_grad=fn)
+        error_regex="expected input to have two or fewer dimensions")
+    yield ErrorInput(
+        SampleInput(a, fweights=torch.rand(S, S, device=device)),
+        error_regex="expected fweights to have one or fewer dimensions")
+    yield ErrorInput(
+        SampleInput(a, aweights=torch.rand(S, S, device=device)),
+        error_regex="expected aweights to have one or fewer dimensions")
+    yield ErrorInput(
+        SampleInput(a, fweights=torch.rand(S, device=device)),
+        error_regex="expected fweights to have integral dtype")
+    yield ErrorInput(
+        SampleInput(a, aweights=torch.tensor([1, 1], device=device)),
+        error_regex="expected aweights to have floating point dtype")
+    yield ErrorInput(
+        SampleInput(a, fweights=torch.tensor([1], device=device)),
+        error_regex="expected fweights to have the same numel")
+    yield ErrorInput(
+        SampleInput(a, aweights=torch.rand(1, device=device)),
+        error_regex="expected aweights to have the same numel")
+    yield ErrorInput(
+        SampleInput(a, fweights=torch.tensor([-1, -2, -3, -4 , -5], device=device)),
+        error_regex="fweights cannot be negative")
+    yield ErrorInput(
+        SampleInput(a, aweights=torch.tensor([-1., -2., -3., -4., -5.], device=device)),
+        error_regex="aweights cannot be negative")
 
 
 def sample_inputs_permute(op_info, device, dtype, requires_grad, **kwargs):
@@ -5274,15 +5054,6 @@ def reference_inputs_permute(op, device, dtype, requires_grad, **kwargs):
             a = make_arg(shape, noncontiguous=True).permute(p)
             yield SampleInput(a, args=(permutation,))
 
-def sample_inputs_linalg_svdvals(op_info, device, dtype, requires_grad=False, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    batches = [(), (0, ), (2, ), (1, 1)]
-    ns = [5, 2, 0]
-
-    for batch, m, n in product(batches, ns, ns):
-        yield SampleInput(make_arg(batch + (m, n)))
-
 def error_inputs_softshrink(op, device, **kwargs):
     yield ErrorInput(SampleInput(make_tensor((1,), dtype=torch.float, device=device), kwargs={"lambd": -0.5}),
                      error_regex="lambda must be greater or equal to 0, but found to be -0.5")
@@ -5319,94 +5090,44 @@ def sample_inputs_hardtanh(op_info, device, dtype, requires_grad=False, **kwargs
 
     yield from sample_inputs_elementwise_unary(op_info, device, dtype, requires_grad)
 
-def sample_inputs_eig(op_info, device, dtype, requires_grad=False, **kwargs):
-    eigvecs = make_tensor((S, S), device=device, dtype=dtype,
-                          low=None, high=None)
-    eigvals = make_tensor((S,), device=device, dtype=dtype,
-                          low=None, high=None)
-    # we produce only diagonazible inputs which do not have
-    # complex eigenvalues for real inputs, as there is no
-    # backward implementation for real inputs with complex
-    # eigenvalues yet.
-    input = (eigvecs * eigvals.unsqueeze(-2)) @ eigvecs.inverse()
-    input.requires_grad_(requires_grad)
-
-    def process_output(eigpair):
-        eigvals, eigvecs = eigpair
-        if dtype.is_complex:
-            # eig produces eigenvectors which are normalized to 1 norm.
-            # Note that if v is an eigenvector, so is v * e^{i \phi},
-            # and |v| = |v * e^{i \phi}| = 1.
-            # This, however, makes the eigenvector backward computation process
-            # rather unstable unless the objective function is gauge-invariant,
-            # that is if f(z) == f(|z|), for example.
-            # Hence for complex inputs we ignore the phases and return only
-            # the absolute values.
-            return eigvals, eigvecs.abs()
-        else:
-            return eigvals, eigvecs
-
-    return [
-        SampleInput(
-            input,
-            kwargs=dict(eigenvectors=True),
-            output_process_fn_grad=process_output
-        ),
-    ]
-
 
 def sample_inputs_einsum(op_info, device, dtype, requires_grad=False, **kwargs):
     def c(t):
         return t.clone().requires_grad_(requires_grad)
 
-    x = make_tensor((3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    y = make_tensor((4,), dtype=dtype, device=device, requires_grad=requires_grad)
-    A = make_tensor((2, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    B = make_tensor((1, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    C = make_tensor((1, 2, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    D = make_tensor((1, 3, 4,), dtype=dtype, device=device, requires_grad=requires_grad)
-    E = make_tensor((4, 4,), dtype=dtype, device=device, requires_grad=requires_grad)
-    H = make_tensor((3, 3,), dtype=dtype, device=device, requires_grad=requires_grad)
-    I = make_tensor((1, 3, 1,), dtype=dtype, device=device, requires_grad=requires_grad)
-
-    inputs = []
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    x = make_arg((3,))
+    y = make_arg((4,))
+    A = make_arg((2, 3,))
+    B = make_arg((1, 3,))
+    C = make_arg((1, 2, 3,))
+    D = make_arg((1, 3, 4,))
+    E = make_arg((4, 4,))
+    H = make_arg((3, 3,))
+    I = make_arg((1, 3, 1,))
 
     # Vector operations
-    inputs.append(SampleInput([c(x)], args=('i->',)))                      # sum
-    inputs.append(SampleInput([c(x), c(y)], args=('i,j->ij',)))               # outer
+    yield SampleInput([c(x)], 'i->')                      # sum
+    yield SampleInput([c(x), c(y)], 'i,j->ij')            # outer
 
     # Matrix operations
-    inputs.append(SampleInput([c(A)], args=("ij->i",)))                    # col sum
-    inputs.append(SampleInput([c(A), c(B)], args=("ij,kj->ik",)))             # matmul
-    inputs.append(SampleInput([c(A), c(E)], args=("ij,Ab->ijAb",)))           # matrix outer product
+    yield SampleInput([c(A)], "ij->i")                    # col sum
+    yield SampleInput([c(A), c(B)], "ij,kj->ik")          # matmul
+    yield SampleInput([c(A), c(E)], "ij,Ab->ijAb")        # matrix outer product
 
     # Tensor operations
-    inputs.append(SampleInput([c(C), c(D)], args=("aij,ajk->aik",)))          # batch matmul
-    inputs.append(SampleInput([c(D), c(E)], args=("aij,jk->aik",)))           # tensor matrix contraction
-    inputs.append(SampleInput([c(C), c(B)], args=("ijk,ik->j",)))             # non contiguous
+    yield SampleInput([c(C), c(D)], "aij,ajk->aik")       # batch matmul
+    yield SampleInput([c(D), c(E)], "aij,jk->aik")        # tensor matrix contraction
+    yield SampleInput([c(C), c(B)], "ijk,ik->j")          # non contiguous
 
     # Test diagonals
-    inputs.append(SampleInput([c(I)], args=('iji->j',)))                   # non-contiguous trace
+    yield SampleInput([c(I)], 'iji->j')                   # non-contiguous trace
 
     # Test ellipsis
-    inputs.append(SampleInput([c(H)], args=("i...->...",)))
-    inputs.append(SampleInput([c(C), c(x)], args=('...ik, ...j -> ij',)))
-
-    return inputs
+    yield SampleInput([c(H)], "i...->...")
+    yield SampleInput([c(C), c(x)], '...ik, ...j -> ij')
 
 
-def sample_inputs_linalg_qr_geqrf(op_info, device, dtype, requires_grad=False, **kwargs):
-    # QR is just well defined when the matrix is full rank
-    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
-    make_arg = partial(make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad)
-
-    batches = [(), (0,), (2, ), (1, 1)]
-    ns = [5, 2, 0]
-
-    for batch, (m, n) in product(batches, product(ns, ns)):
-        shape = batch + (m, n)
-        yield SampleInput(make_arg(*shape))
-
 def sample_inputs_flip(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
     sizes = ((S, M, S), (S, 0, M))
@@ -5416,11 +5137,12 @@ def sample_inputs_flip(op_info, device, dtype, requires_grad, **kwargs):
         yield SampleInput(make_arg(size), kwargs={"dims": dims})
 
 def sample_inputs_fliplr_flipud(op_info, device, dtype, requires_grad, **kwargs):
-    tensors = (
-        make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        make_tensor((S, 0, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-    )
-    return [SampleInput(tensor) for tensor in tensors]
+    shapes = [
+        (S, M, S),
+        (S, 0, M),
+    ]
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    return (SampleInput(make_arg(shape, low=None, high=None)) for shape in shapes)
 
 def error_inputs_fliplr(op, device, **kwargs):
     yield ErrorInput(SampleInput(make_tensor((1,), dtype=torch.float, device=device)),
@@ -5430,20 +5152,15 @@ def error_inputs_flipud(op, device, **kwargs):
     yield ErrorInput(SampleInput(make_tensor((), dtype=torch.float, device=device)),
                      error_regex="Input must be >= 1-d.")
 
-# TODO: clamp shares tensors among its sample inputs --- we should prohibit this!
 def sample_inputs_clamp(op_info, device, dtype, requires_grad, **kwargs):
-    x = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-    lb = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-    ub = make_tensor((S, M, S), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-
-    def detach(tensor):
-        return tensor.clone().detach_().requires_grad_(requires_grad)
+    make_arg = partial(make_tensor, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
+    shape = (S, M, S)
 
-    return [
-        SampleInput(detach(x), args=(lb, ub)),
-        SampleInput(detach(x), args=(detach(lb[0]), detach(ub[0]))),
-        SampleInput(detach(x), args=(detach(lb[:, :1]),)),
-    ]
+    yield SampleInput(make_arg(shape), args=(make_arg(shape), make_arg(shape)))
+    yield SampleInput(make_arg(shape), args=(make_arg(shape[1:]), make_arg(shape[1:])))
+    yield SampleInput(make_arg(shape), args=(make_arg((S, 1, S)),))
+    yield SampleInput(make_arg(shape), args=(None, make_arg(shape)))
+    yield SampleInput(make_arg(shape), args=(make_arg(shape), None))
 
 def reference_inputs_elementwise_ternary(op, device, dtype, requires_grad, *, sample_inputs_func, supports_scalars=False, **kwargs):
     yield from sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
@@ -5533,17 +5250,6 @@ def _clamp_numpy(a, min=None, max=None):
 
     return np.minimum(max, np.maximum(a, min))
 
-def sample_inputs_cross(op_info, device, dtype, requires_grad, **kwargs):
-    sample0 = SampleInput(make_tensor((S, 3), device=device, dtype=dtype, requires_grad=requires_grad),
-                          args=(make_tensor((S, 3), device=device, dtype=dtype, requires_grad=requires_grad),))
-    sample1 = SampleInput(make_tensor((S, 3, S), device=device, dtype=dtype, requires_grad=requires_grad),
-                          args=(make_tensor((S, 3, S), device=device, dtype=dtype, requires_grad=requires_grad),),
-                          kwargs={'dim': 1})
-    sample2 = SampleInput(make_tensor((S, 3), device=device, dtype=dtype, requires_grad=requires_grad),
-                          args=(make_tensor((S, 3), device=device, dtype=dtype, requires_grad=requires_grad),),
-                          kwargs={'dim': -1})
-
-    return (sample0, sample1, sample2)
 
 def sample_inputs_cumprod(op_info, device, dtype, requires_grad, **kwargs):
     def make_arg(shape):
@@ -5572,14 +5278,39 @@ def prod_zeros(dim_select):
     yield SampleInput(prod_zeros([1, 2]), args=(1,), kwargs={'dtype': dtype})
 
 def sample_inputs_view_as_complex(op_info, device, dtype, requires_grad, **kwargs):
-    return [SampleInput(make_tensor((S, 2), dtype=dtype, device=device, requires_grad=requires_grad),)]
+    yield SampleInput(make_tensor((S, 2), dtype=dtype, device=device, requires_grad=requires_grad))
 
 def sample_inputs_view_as_real(op_info, device, dtype, requires_grad, **kwargs):
-    tensors = (
-        make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),
-        make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad)
-    )
-    return [SampleInput(tensor) for tensor in tensors]
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    sizes = ((S, S), ())
+    return (SampleInput(make_arg(size)) for size in sizes)
+
+def error_inputs_complex(op_info, device, is_ref=False, **kwargs):
+    make_arg = partial(make_tensor, dtype=torch.float32, device=device)
+
+    if is_ref:
+        error_float = "Expected both inputs to be Half, Float or Double tensors but got torch.float32 and torch.int32"
+        error_dtype = "Expected object of scalar type torch.float32 but got scalar type torch.float64 for second argument"
+        error_out = "Expected out tensor to have dtype torch.complex128 but got torch.complex64 instead"
+    else:
+        error_float = "Expected both inputs to be Half, Float or Double tensors but got Float and Int"
+        error_dtype = "Expected object of scalar type Float but got scalar type Double for second argument"
+        error_out = "Expected object of scalar type ComplexDouble but got scalar type ComplexFloat for argument 'out'"
+
+    yield ErrorInput(SampleInput(make_arg(M, S), make_arg(M, S, dtype=torch.int)),
+                     error_type=RuntimeError, error_regex=error_float)
+
+    yield ErrorInput(SampleInput(make_arg(M, S), make_arg(M, S, dtype=torch.float64)),
+                     error_type=RuntimeError, error_regex=error_dtype)
+
+    yield ErrorInput(SampleInput(make_arg(M, S, dtype=torch.float64), make_arg(M, S, dtype=torch.float64),
+                                 out=make_arg(M, S, dtype=torch.complex64)),
+                     error_type=RuntimeError, error_regex=error_out)
+
+def sample_inputs_logaddexp(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    shape = (S, S)
+    yield SampleInput(make_arg(shape), make_arg(shape))
 
 def sample_inputs_prod(op_info, device, dtype, requires_grad, **kwargs):
     def make_arg(shape):
@@ -5605,6 +5336,9 @@ def prod_single_zero():
     yield SampleInput(make_arg((3, 3, 3)), args=(1,))
     yield SampleInput(make_arg((3, 3, 3)), args=(1,), kwargs={'keepdim': True})
 
+    yield SampleInput(make_arg((3, 0)), args=(1,))
+    yield SampleInput(make_arg((3, 0)), args=(1,), kwargs={'keepdim': True})
+
     # test zero scalar tensor
     zero = make_arg(())
     zero.zero_()
@@ -5619,24 +5353,22 @@ def error_inputs_neg(op_info, device, **kwargs):
     msg = ("Negation, the `\\-` operator, on a bool tensor is not supported."
            " If you are trying to invert a mask, use the `\\~` or"
            " `logical_not\\(\\)` operator instead.")
-    return (ErrorInput(si, error_regex=msg),)
+    yield ErrorInput(si, error_regex=msg)
 
 def sample_inputs_diag(op_info, device, dtype, requires_grad, **kwargs):
-    vec_sample = SampleInput(make_tensor((M, ), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad))
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=None, high=None)
+    yield SampleInput(make_arg(M))
 
     tensors = (
-        make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        make_tensor((3, 5), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-        make_tensor((5, 3), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
+        make_arg((M, M)),
+        make_arg((3, 5)),
+        make_arg((5, 3)),
     )
 
     args = ((), (2,), (-2,), (1,), (2,))
 
-    samples = []
     for tensor, arg in product(tensors, args):
-        samples.append(SampleInput(tensor.clone().requires_grad_(requires_grad), args=arg))
-
-    return samples + [vec_sample]
+        yield SampleInput(tensor.clone().requires_grad_(requires_grad), *arg)
 
 def sample_inputs_diagonal_diag_embed(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -5655,6 +5387,101 @@ def sample_inputs_diagonal_diag_embed(op_info, device, dtype, requires_grad, **k
     for shape, kwarg in chain(product(shapes_2d, kwargs_2d), product(shapes_3d, kwargs_3d)):
         yield SampleInput(make_arg(shape), kwargs=kwarg)
 
+def reference_inputs_diagonal_diag_embed(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_diagonal_diag_embed(
+        op_info, device, dtype, requires_grad, **kwargs)
+
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    shapes1d = ((0,), (1,))
+    shapes2d = ((L, M),)
+    shapes3d = ((L, M, S),)
+
+    kwargs1d = {}
+
+    kwargs2d = (
+        # dim1 > dim2 is allowed
+        dict(dim1=1, dim2=0),
+        # negative dims are allowed
+        dict(dim1=-2, dim2=-1),
+        # out of bounds offset should return an empty tensor in diagonal and
+        # offset the diagonal in diag_embed
+        dict(offset=100),
+    )
+
+    kwargs3d = kwargs2d + (
+        # make sure we can use non-sequential dims
+        dict(offset=-1, dim1=0, dim2=2),
+    )
+
+    samples1d = product(shapes1d, kwargs1d)
+    samples2d = product(shapes2d, kwargs2d)
+    samples3d = product(shapes3d, kwargs3d)
+
+    for shape, kwargs in chain(samples1d, samples2d, samples3d):
+        if 'diagonal' in op_info.name:
+            # these are error inputs for diagonal
+            if shape in ((0,), (1,)):
+                continue
+        yield SampleInput(input=make_arg(shape), kwargs=kwargs)
+
+def error_inputs_diagonal_diag_embed(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+
+    shapes1d = (0, 1, (0,), (1,))
+    shapes2d = ((M, L),)
+    shapes3d = ((M, S, L),)
+
+    kwargs1d = {}
+
+    kwargs2d = (
+        # dim1 == dim2 is not allowed
+        dict(dim1=1, dim2=1),
+        # out of bounds dims are not allowed
+        dict(dim1=10000),
+        dict(dim2=10000),
+    )
+
+    kwargs3d = kwargs2d
+
+    samples1d = product(shapes1d, kwargs1d)
+    samples2d = product(shapes2d, kwargs2d)
+    samples3d = product(shapes3d, kwargs3d)
+
+    for shape, kwargs in chain(samples1d, samples2d, samples3d):
+        arg = make_arg(shape)
+        sample = SampleInput(input=arg, kwargs=kwargs)
+
+        dim1 = kwargs.get('dim1')
+        dim2 = kwargs.get('dim2')
+
+        if 'diagonal' in op_info.name:
+            num_dim = arg.dim()
+        elif op_info.name in ('diag_embed', '_refs.diag_embed'):
+            # these are valid inputs for diag_embed
+            if shape in ((0,), (1,)):
+                continue
+            num_dim = arg.dim() + 1
+        else:
+            raise RuntimeError("should be unreachable")
+
+        bound1 = -num_dim
+        bound2 = num_dim - 1
+        dim_range = range(bound1, bound2 + 1)
+        dim1_cond = dim1 and dim1 not in dim_range
+        dim2_cond = dim2 and dim2 not in dim_range
+
+        if dim1 == dim2:
+            err = f"diagonal dimensions cannot be identical {dim1}, {dim2}"
+            yield ErrorInput(sample, error_regex=err, error_type=RuntimeError)
+        elif dim1_cond or dim2_cond:
+            err_dim = dim1 if dim1_cond else dim2
+            err = (r"Dimension out of range \(expected to be in range of "
+                   rf"\[{bound1}, {bound2}\], but got {err_dim}\)")
+            yield ErrorInput(sample, error_regex=err, error_type=IndexError)
+        else:
+            raise RuntimeError("should be unreachable")
 
 def sample_inputs_diagonal_scatter(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -5684,18 +5511,18 @@ def sample_inputs_diagonal_scatter(op_info, device, dtype, requires_grad, **kwar
 def sample_inputs_to_sparse(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    return (SampleInput(make_arg((S, S)), args=(), output_process_fn_grad=lambda x: x.to_dense()),
-            SampleInput(make_arg((S, S)), args=(1,), output_process_fn_grad=lambda x: x.to_dense()),)
+    yield SampleInput(make_arg((S, S))).with_metadata(output_process_fn_grad=lambda x: x.to_dense())
+    yield SampleInput(make_arg((S, S)), 1).with_metadata(output_process_fn_grad=lambda x: x.to_dense())
 
 def sample_inputs_cross_entropy(op_info, device, dtype, requires_grad, **kwargs):
     batch_size, num_classes = shape = (2, 3)
     reductions = ("mean", "sum", "none")
 
     input_shape_and_kwargs: List[Tuple[Tuple[int, ...], Dict[str, Any]]] = [
-        (shape, dict()),
-        ((*shape, 1), dict()),
-        ((*shape, 1, 2), dict()),
-        ((*shape, 1, 2, 3), dict()),
+        (shape, {}),
+        ((*shape, 1), {}),
+        ((*shape, 1, 2), {}),
+        ((*shape, 1, 2, 3), {}),
         *[(shape, dict(reduction=reduction)) for reduction in reductions],
         *[
             (
@@ -5710,7 +5537,6 @@ def sample_inputs_cross_entropy(op_info, device, dtype, requires_grad, **kwargs)
         (shape, dict(ignore_index=1)),
     ]
 
-    sample_inputs = []
     for (input_shape, kwargs), probabilities_target in itertools.product(input_shape_and_kwargs, (False, True)):
         input = make_tensor(input_shape, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -5740,99 +5566,8 @@ def sample_inputs_cross_entropy(op_info, device, dtype, requires_grad, **kwargs)
                 # make sure at least one item in target is not ignored
                 target[0] = random.sample(set(range(num_classes)) - {kwargs["ignore_index"]}, 1)[0]
 
-        sample_inputs.append(SampleInput(input, args=(target,), kwargs=kwargs))
-
-    return sample_inputs
-
-# Used for log_softmax, softmax, softmin
-def sample_inputs_softmax_variant(op_info, device, dtype, requires_grad, with_dtype=False, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    cases = [
-        ((S, ), (0, )),
-        ((S, S), (0, )),
-        ((S, S), (1, )),
-        ((S, S), (-1, )),
-        ((S, M, S), (2, )),
-    ]
-
-    # PyTorch on XLA throws an error when passed with dim argument for 0d tensor.
-    # See https://github.com/pytorch/xla/issues/3061 for more details.
-    if torch.device(device).type != 'xla':
-        cases.append(((), (0, )))
-
-    return [
-        SampleInput(make_arg(shape), args=dim, kwargs=dict(dtype=torch.float64) if with_dtype else None)
-        for shape, dim in cases
-    ]
-
-def sample_inputs_masked_softmax(op_info, device, dtype, requires_grad, with_dtype=False, **kwargs):
-    """Sample inputs for masked softmax, log_softmax, and softmin.
+        yield SampleInput(input, target, **kwargs)
 
-    Masked normalization operator is a reduction operator with
-    trailing mask optional argument. A mask is a bool tensor with the
-    same shape as input or a shape that is broadcastable to input
-    shape.
-    """
-    inputs: List[SampleInput] = []
-    for sample_input in sample_inputs_softmax_variant(op_info, device, dtype, requires_grad, with_dtype=with_dtype, **kwargs):
-        for mask in _generate_masked_op_mask(sample_input.input.shape, device, **kwargs):
-            sample_input_args, sample_input_kwargs = sample_input.args, dict(mask=mask, **sample_input.kwargs)
-            inputs.append(SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
-                                      args=sample_input_args, kwargs=sample_input_kwargs))
-    return inputs
-
-def sample_inputs_masked_cumops(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked cumsum and cumprod.
-    """
-    inputs: List[SampleInput] = []
-    for sample_input in sample_inputs_softmax_variant(op_info, device, dtype, requires_grad, **kwargs):
-        for mask in _generate_masked_op_mask(sample_input.input.shape, device, **kwargs):
-            if type(mask) != torch.Tensor:
-                continue
-            sample_input_args, sample_input_kwargs = sample_input.args, dict(mask=mask, **sample_input.kwargs)
-            if 'keepdim' in sample_input_kwargs:
-                sample_input_kwargs.pop('keepdim')
-            # dimension is required
-            if sample_input_args:
-                dim = sample_input.args[0]
-            else:
-                if 'dim' not in sample_input_kwargs:
-                    continue
-                dim = sample_input_kwargs.pop('dim')
-                sample_input_args = (dim,)
-            inputs.append(SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
-                                      args=sample_input_args, kwargs=sample_input_kwargs))
-
-    return inputs
-
-def sample_inputs_masked_logaddexp(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked logaddexp.
-    """
-    inputs: List[SampleInput] = []
-    shapes = [(S,), (S, S), (S, M, S)]
-    input_mask_lists = [list(_generate_masked_op_mask(shape, device, **kwargs)) for shape in shapes]
-    other_mask_lists = [list(_generate_masked_op_mask(shape, device, **kwargs)) for shape in shapes]
-
-    for shape, input_masks, other_masks in zip(shapes, input_mask_lists, other_mask_lists):
-        for input_mask, other_mask in zip(input_masks, other_masks):
-            input = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-            other = make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
-            inputs.append(SampleInput(input.clone().requires_grad_(requires_grad),
-                                      args=(other.clone().requires_grad_(requires_grad),),
-                                      kwargs=dict(input_mask=input_mask, other_mask=other_mask)))
-    return inputs
-
-def sample_inputs_masked_normalize(op_info, device, dtype, requires_grad, **kwargs):
-    """Sample inputs for masked normalize.
-    """
-    inputs: List[SampleInput] = []
-    for ord in [2.0, 1, float('inf'), float('-inf'), 0]:
-        for sample_input in sample_inputs_softmax_variant(op_info, device, dtype, requires_grad, **kwargs):
-            sample_input_args, sample_input_kwargs = (ord,) + sample_input.args, sample_input.kwargs.copy()
-            inputs.append(SampleInput(sample_input.input.clone().requires_grad_(requires_grad),
-                                      args=sample_input_args, kwargs=sample_input_kwargs))
-    return inputs
 
 def sample_inputs_logit(op_info, device, dtype, requires_grad, **kwargs):
     low, high = op_info.domain
@@ -5844,17 +5579,14 @@ def sample_inputs_logit(op_info, device, dtype, requires_grad, **kwargs):
 
     low = low + domain_eps
     high = high - domain_eps
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=low, high=high)
 
-    samples = (
-        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)),
-        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, low=low,
-                                high=high, requires_grad=requires_grad), args=(0.2,)),
-        SampleInput(make_tensor((), dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)),
-        SampleInput(make_tensor((), dtype=dtype, device=device, low=low,
-                                high=high, requires_grad=requires_grad), args=(0.2,)),
-    )
-
-    return samples
+    make_arg = partial(make_tensor, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
+    yield SampleInput(make_arg((S, S, S)))
+    yield SampleInput(make_arg((S, S, S)), 0.2)
+    yield SampleInput(make_arg(()))
+    yield SampleInput(make_arg(()), 0.2)
 
 def sample_inputs_isin(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -5918,40 +5650,30 @@ def error_inputs_masked_fill(op_info, device, **kwargs):
 
 
 def sample_inputs_masked_select(op_info, device, dtype, requires_grad, **kwargs):
-    samples = (
-        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn(M, M, device=device) > 0,)),
-
-        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn((M,), device=device) > 0,)),
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, low=None, high=None)
 
-        SampleInput(make_tensor((M,), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn((M, M), device=device) > 0,)),
+    yield SampleInput(make_arg((M, M)), torch.randn(M, M, device=device) > 0)
 
-        SampleInput(make_tensor((M, 1, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn((M, M), device=device) > 0,)),
+    yield SampleInput(make_arg((M, M)), torch.randn((M,), device=device) > 0)
+    yield SampleInput(make_arg((M,)), torch.randn((M, M), device=device) > 0)
 
-        SampleInput(make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.tensor(1, device=device, dtype=torch.bool),)),
+    yield SampleInput(make_arg((M, 1, M)), torch.randn((M, M), device=device) > 0)
 
-        SampleInput(make_tensor((M, M), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.tensor(1, device=device, dtype=torch.bool),)),
+    yield SampleInput(make_arg(()), torch.tensor(1, device=device, dtype=torch.bool))
 
-        SampleInput(make_tensor((), dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad),
-                    args=(torch.randn((M, M), device=device) > 0,)),
-    )
+    yield SampleInput(make_arg((M, M)), torch.tensor(1, device=device, dtype=torch.bool))
 
-    return samples
+    yield SampleInput(make_arg(()), torch.randn((M, M), device=device) > 0)
 
 def sample_inputs_matrix_exp(op_info, device, dtype, requires_grad, **kwargs):
-    samples = (
-        SampleInput(make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad)),
-        SampleInput(make_tensor((S, S, S), dtype=dtype, device=device, requires_grad=requires_grad)),
-    )
-
-    return samples
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield SampleInput(make_arg((S, S)))
+    yield SampleInput(make_arg((S, S, S)))
 
-def sample_inputs_matmul(op_info, device, dtype, requires_grad, **kwargs):
+def sample_inputs_matmul(op_info, device, dtype, requires_grad, is_rmatmul=False, **kwargs):
+    make_arg = partial(make_tensor, dtype=dtype, device=device, low=None,
+                       high=None, requires_grad=requires_grad)
     test_cases = (((L,), (L,)),
                   ((S, M), (M,)),
                   ((M,), (M, S)),
@@ -5966,17 +5688,13 @@ def sample_inputs_matmul(op_info, device, dtype, requires_grad, **kwargs):
                   ((S, S, M, M), (S, S, M, S)),
                   ((S, S, M, M), (M,)),
                   ((M,), (S, S, M, S)))
-    sample_inputs = []
     for lhs_shape, rhs_shape in test_cases:
-        lhs = make_tensor(lhs_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        rhs = make_tensor(rhs_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        if op_info.name == 'matmul':
-            sample_inputs.append(SampleInput(lhs, args=(rhs,)))
-        elif op_info.name == '__rmatmul__':
-            sample_inputs.append(SampleInput(rhs, args=(lhs,)))
+        lhs = make_arg(lhs_shape)
+        rhs = make_arg(rhs_shape)
+        if not is_rmatmul:
+            yield SampleInput(lhs, rhs)
         else:
-            raise RuntimeError("`op_info.name` must be 'matmul' or '__rmatmul__'")
-    return tuple(sample_inputs)
+            yield SampleInput(rhs, lhs)
 
 
 def sample_inputs_meshgrid(op_info: OpInfo, device: torch.device, dtype: torch.dtype,
@@ -5987,13 +5705,13 @@ def make_inputs(
                 tensors: List[torch.Tensor]) -> Tuple[Union[torch.Tensor,
                                                             List[torch.Tensor]],
                                                       Tuple[torch.Tensor, ...]]:
-            return tensors[0], tuple(tensors[1:])
+            return tensors
     elif variant == 'list':
         def make_inputs(
                 tensors: List[torch.Tensor]) -> Tuple[Union[torch.Tensor,
                                                             List[torch.Tensor]],
                                                       Tuple[torch.Tensor, ...]]:
-            return tensors, ()
+            return [tensors]
     else:
         raise ValueError(
             'Unsupported variant, must be one of {"variadic", "list"}. '
@@ -6009,22 +5727,11 @@ def make_inputs(
         [VECTOR, SCALAR, VECTOR, SCALAR],
     ]
 
-    sample_inputs = []
     for shapes, indexing in itertools.product(test_cases, {'xy', 'ij'}):
-        input, args = make_inputs(
+        args = make_inputs(
             [make_tensor(shape, dtype=dtype, device=device, requires_grad=requires_grad)
              for shape in shapes])
-        sample_inputs.append(SampleInput(input=input, args=args,
-                                         kwargs=dict(indexing=indexing)))
-    return sample_inputs
-
-def sample_inputs_polygamma(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
-    tensor_shapes = ((S, S), ())
-    ns = (1, 2, 3, 4, 5)
-
-    for shape, n in product(tensor_shapes, ns):
-        yield SampleInput(make_arg(shape), args=(n,))
+        yield SampleInput(*args, indexing=indexing)
 
 
 def sample_inputs_mvlgamma(op_info, device, dtype, requires_grad, **kwargs):
@@ -6060,7 +5767,8 @@ def skips_mvlgamma(skip_redundant=False):
     if skip_redundant:
         # Redundant tests
         skips = skips + (  # type: ignore[assignment]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients'),
             DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
             DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
         )
@@ -6070,84 +5778,34 @@ def skips_mvlgamma(skip_redundant=False):
 # To test reference numerics against multiple values of argument `p`,
 # we make multiple OpInfo entries with each entry corresponding to different value of p.
 # We run the op tests from test_ops.py only for `p=1` to avoid redundancy in testing.
-# Class `MvlGammaInfo` already contains the basic information related to the operator,
-# it only takes arguments like `domain`, `skips` and `sample_kwargs`, which
-# differ between the entries.
-class MvlGammaInfo(UnaryUfuncInfo):
-    def __init__(self, variant_test_name, domain, skips, sample_kwargs):
-        super(MvlGammaInfo, self).__init__(
-            'mvlgamma',
-            ref=reference_mvlgamma if TEST_SCIPY else None,
-            aliases=('special.multigammaln',),
-            variant_test_name=variant_test_name,
-            domain=domain,
-            decorators=(precisionOverride({torch.float16: 5e-2}),),
-            dtypes=all_types_and(torch.bfloat16),
-            dtypesIfCUDA=all_types_and(torch.half),
-            sample_inputs_func=sample_inputs_mvlgamma,
-            supports_forward_ad=True,
-            supports_fwgrad_bwgrad=True,
-            skips=skips,
-            sample_kwargs=sample_kwargs)
-
-
-def sample_inputs_entr(op_info, device, dtype, requires_grad, **kwargs):
-    low, _ = op_info.domain
-
-    if requires_grad:
-        low = 0 + op_info._domain_eps
-
-    return (SampleInput(make_tensor((L,), dtype=dtype, device=device,
-                                    low=low,
-                                    requires_grad=requires_grad)),
-            SampleInput(make_tensor((), dtype=dtype, device=device,
-                                    low=low,
-                                    requires_grad=requires_grad)))
-
-# TODO: Consolidate `i0e` with sample_inputs_unary when `make_tensor`,
-#       supports `exclude` argument.
-#       For more context: https://github.com/pytorch/pytorch/pull/56352#discussion_r633277617
-def sample_inputs_i0_i1(op_info, device, dtype, requires_grad, **kwargs):
-
-    samples = (SampleInput(make_tensor((S,), dtype=dtype, device=device,
-                                       requires_grad=requires_grad)),
-               SampleInput(make_tensor((), dtype=dtype, device=device,
-                                       requires_grad=requires_grad)))
-
-    if requires_grad and op_info.op == torch.special.i0e:
-        # NOTE: `i0e`'s first-order gradient is not continous
-        # at `0`, hence we don't test `i0e` with any input being `0`.
-        # TODO: Remove this when `make_tensor` supports excluding `0`.
-        for sample in samples:
-            t = sample.input
-            t[t == 0] = torch.finfo(dtype).eps  # type: ignore[index]
-    elif requires_grad and op_info.op != torch.special.i0e:
-        # Special Case for gradient
-        # Sample with `0` in the input
-        t = make_tensor((S,), dtype=dtype, device=device,
-                        requires_grad=requires_grad)
-        t[0] = 0
+def make_mvlgamma_opinfo(variant_test_name, domain, skips, sample_kwargs):
+    return UnaryUfuncInfo('mvlgamma',
+                          ref=reference_mvlgamma if TEST_SCIPY else None,
+                          aliases=('special.multigammaln',),
+                          variant_test_name=variant_test_name,
+                          domain=domain,
+                          decorators=(precisionOverride({torch.float16: 5e-2}),),
+                          dtypes=all_types_and(torch.bfloat16),
+                          dtypesIfCUDA=all_types_and(torch.float16),
+                          sample_inputs_func=sample_inputs_mvlgamma,
+                          supports_forward_ad=True,
+                          supports_fwgrad_bwgrad=True,
+                          skips=skips,
+                          sample_kwargs=sample_kwargs)
 
-        samples += (SampleInput(t),)  # type: ignore[assignment]
-
-    return samples
 
 def sample_inputs_cumulative_ops(op_info, device, dtype, requires_grad, supports_dtype_kwargs=True, **kwargs):
     def _make_tensor_helper(shape, low=None, high=None):
         return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
 
-    samples = [
-        SampleInput(_make_tensor_helper((S, S, S)), args=(0,)),
-        SampleInput(_make_tensor_helper((S, S, S)), args=(1,)),
-        SampleInput(_make_tensor_helper(()), args=(0,)),
-    ]
+    yield SampleInput(_make_tensor_helper((S, S, S)), 0)
+    yield SampleInput(_make_tensor_helper((S, S, S)), 1)
+    yield SampleInput(_make_tensor_helper(()), 0)
 
     if supports_dtype_kwargs:
         # NOTE: if `dtype` is not same as input, then inplace variants fail with
         # `provided dtype must match the dtype of self tensor in cumsum`
-        samples.append(SampleInput(_make_tensor_helper((S, S, S)), args=(1,), kwargs={'dtype': dtype}))
-
-    return samples
+        yield SampleInput(_make_tensor_helper((S, S, S)), 1, dtype=dtype)
 
 
 def sample_inputs_unfold(op_info, device, dtype, requires_grad, **kwargs):
@@ -6176,22 +5834,20 @@ def sample_inputs_unfold(op_info, device, dtype, requires_grad, **kwargs):
         ((S, S, S), (2, 3, 2)),
     )
 
-    sample_inputs = []
     for shape, arguments in test_cases:
-        sample_inputs += [SampleInput(make_tensor(shape, dtype=dtype, device=device,
+        yield SampleInput(make_tensor(shape, dtype=dtype, device=device,
                                       low=None, high=None,
                                       requires_grad=requires_grad),
-                                      args=arguments)]
-    return sample_inputs
+                          *arguments)
 
 def sample_inputs_split(op_info, device, dtype, requires_grad, *, list_args=False, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
     if list_args:
         cases = (
-            ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)],)),
-            ((S, S, S), ([int(S / 2), S - int(S / 2) * 2, int(S / 2)], 2),),
-            ((S, S, S), ([int(S / 2), S - int(S / 2) * 2, int(S / 2)], -2),)
+            ((S, S, S), (torch.Size([int(S / 3), S - int(S / 3) * 2, int(S / 3)]),)),
+            ((S, S, S), (torch.Size([int(S / 2), S - int(S / 2) * 2, int(S / 2)]), 2),),
+            ((S, S, S), (torch.Size([int(S / 2), S - int(S / 2) * 2, int(S / 2)]), -2),)
         )
     else:
         cases = (  # type: ignore[assignment]
@@ -6206,10 +5862,10 @@ def sample_inputs_split(op_info, device, dtype, requires_grad, *, list_args=Fals
 def sample_inputs_split_with_sizes(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    cases = (((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)],)),
-             ((S, S, S), ([int(S / 3), S - int(S / 3), 0],)),
-             ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)], 2)),
-             ((S, S, S), ([int(S / 3), S - int(S / 3) * 2, int(S / 3)], -2)),
+    cases = (((S, S, S), (torch.Size([int(S / 3), S - int(S / 3) * 2, int(S / 3)]),)),
+             ((S, S, S), (torch.Size([int(S / 3), S - int(S / 3), 0]),)),
+             ((S, S, S), (torch.Size([int(S / 3), S - int(S / 3) * 2, int(S / 3)]), 2)),
+             ((S, S, S), (torch.Size([int(S / 3), S - int(S / 3) * 2, int(S / 3)]), -2)),
              )
 
     for shape, args in cases:
@@ -6227,111 +5883,91 @@ def large_1d_unique(dtype, device):
         apply_grad(res)
         return res
 
-    samples = []
     # Test case for large tensor.
-    largesample = SampleInput(large_1d_unique(dtype, device))
-
-    sample = SampleInput(make_tensor((S, M, S), dtype=dtype, device=device,
-                                     low=None, high=None,
-                                     requires_grad=requires_grad))
+    yield SampleInput(large_1d_unique(dtype, device))
 
-    return [largesample, sample]
+    yield SampleInput(make_tensor((S, M, S), dtype=dtype, device=device,
+                                  low=None, high=None,
+                                  requires_grad=requires_grad))
 
 def sample_inputs_lerp(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
-    samples = (
+    # no broadcast
+    yield SampleInput(make_arg((S, S)), make_arg((S, S)), 0.4)
+    # broadcast rhs
+    yield SampleInput(make_arg((S, S)), make_arg((S,)), 0.4)
+    # scalar tensor
+    yield SampleInput(make_arg(()), make_arg(()), 0.4)
+    # broadcast rhs scalar-tensor
+    yield SampleInput(make_arg((S, S)), make_arg(()), 0.4)
+    # broadcast rhs with weight tensor
+    yield SampleInput(make_arg((S, S)), make_arg((S,)), make_arg((S, S)))
+    # broadcast rhs and weight tensor
+    yield SampleInput(make_arg((S, S)), make_arg((S, 1)), make_arg((S,)))
+    # broadcast lhs
+    yield SampleInput(make_arg((S,)), make_arg((S, S)), 0.4).with_metadata(broadcasts_input=True)
+    # scalar broadcast_lhs
+    yield SampleInput(make_arg(()), make_arg((S, S)), 0.4).with_metadata(broadcasts_input=True)
+    # broadcast all
+    yield SampleInput(make_arg((S, 1)), make_arg((S, S)), 0.4).with_metadata(broadcasts_input=True)
+    # tensor broadcast all
+    yield SampleInput(make_arg((S, 1)), make_arg((S, S)), make_arg((S, 1))).with_metadata(
+        broadcasts_input=True)
+    # no broadcast with weight tensor
+    yield SampleInput(make_arg((S, S)), make_arg((S, S)), make_arg((S, S)))
+    # broadcast lhs with weight tensor
+    yield SampleInput(make_arg((S,)), make_arg((S, S)), make_arg((S, S))).with_metadata(
+        broadcasts_input=True)
+    # broadcast lhs and weight tensor
+    yield SampleInput(make_arg((S,)), make_arg((S, S, S)), make_arg((S, S))).with_metadata(
+        broadcasts_input=True)
+    # broadcast lhs and weight tensor variant
+    yield SampleInput(make_arg((S, S)), make_arg((S, S, S)), make_arg((S,))).with_metadata(
+        broadcasts_input=True)
+
+    if dtype.is_complex:
         # no broadcast
-        SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 0.4)),
+        yield SampleInput(make_arg((S, S)), make_arg((S, S)), 0.4j)
+        yield SampleInput(make_arg((S, S)), make_arg((S, S)), 1.2 + 0.1j)
         # broadcast rhs
-        SampleInput(make_arg((S, S)), args=(make_arg((S,)), 0.4)),
+        yield SampleInput(make_arg((S, S)), make_arg((S,)), 0.4j)
+        yield SampleInput(make_arg((S, S)), make_arg((S, S)), 5.4 + 9j)
         # scalar tensor
-        SampleInput(make_arg(()), args=(make_arg(()), 0.4)),
+        yield SampleInput(make_arg(()), make_arg(()), 0.4j)
+        yield SampleInput(make_arg(()), make_arg(()), 6.1 + 0.004j)
         # broadcast rhs scalar-tensor
-        SampleInput(make_arg((S, S)), args=(make_arg(()), 0.4)),
-        # broadcast rhs with weight tensor
-        SampleInput(make_arg((S, S)), args=(make_arg((S,)), make_arg((S, S)))),
-        # broadcast rhs and weight tensor
-        SampleInput(make_arg((S, S)), args=(make_arg((S, 1)), make_arg((S,)))),
-        # broadcast lhs
-        SampleInput(make_arg((S,)), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
-        # scalar broadcast_lhs
-        SampleInput(make_arg(()), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
-        # broadcast all
-        SampleInput(make_arg((S, 1)), args=(make_arg((S, S)), 0.4), broadcasts_input=True),
-        # tensor broadcast all
-        SampleInput(make_arg((S, 1)), args=(make_arg((S, S)), make_arg((S, 1))),
-                    broadcasts_input=True),
-        # no broadcast with weight tensor
-        SampleInput(make_arg((S, S)), args=(make_arg((S, S)), make_arg((S, S)))),
-        # broadcast lhs with weight tensor
-        SampleInput(make_arg((S,)), args=(make_arg((S, S)), make_arg((S, S))), broadcasts_input=True),
-        # broadcast lhs and weight tensor
-        SampleInput(make_arg((S,)), args=(make_arg((S, S, S)), make_arg((S, S))), broadcasts_input=True),
-        # broadcast lhs and weight tensor variant
-        SampleInput(make_arg((S, S)), args=(make_arg((S, S, S)), make_arg((S,))), broadcasts_input=True),
-    )
-
-    if dtype.is_complex:
-        samples = samples + (  # type: ignore[assignment]
-            # no broadcast
-            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 0.4j)),
-            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 1.2 + 0.1j)),
-            # broadcast rhs
-            SampleInput(make_arg((S, S)), args=(make_arg((S,)), 0.4j)),
-            SampleInput(make_arg((S, S)), args=(make_arg((S, S)), 5.4 + 9j)),
-            # scalar tensor
-            SampleInput(make_arg(()), args=(make_arg(()), 0.4j)),
-            SampleInput(make_arg(()), args=(make_arg(()), 6.1 + 0.004j)),
-            # broadcast rhs scalar-tensor
-            SampleInput(make_arg((S, S)), args=(make_arg(()), 0.4j)),
-            SampleInput(make_arg((S, S)), args=(make_arg(()), 1 + 2j)),
-        )
-
-    return samples
+        yield SampleInput(make_arg((S, S)), make_arg(()), 0.4j)
+        yield SampleInput(make_arg((S, S)), make_arg(()), 1 + 2j)
 
 def sample_inputs_tensordot(self, device, dtype, requires_grad, **kwargs):
     cases = (
         ((2, 2, 2), (2, 2, 2), (2)),
         ((2, 2, 1), (2, 1, 2), ([0, 1], [2, 0])),
     )
-    samples = []
     for first_shape, second_shape, dims in cases:
-        samples.append(SampleInput(make_tensor(first_shape, dtype=dtype, device=device,
-                                   requires_grad=requires_grad),
-                       args=(make_tensor(second_shape, dtype=dtype, device=device,
-                             requires_grad=requires_grad),),
-                       kwargs=dict(dims=dims,)))
-    return tuple(samples)
+        yield SampleInput(make_tensor(first_shape, dtype=dtype, device=device,
+                                      requires_grad=requires_grad),
+                          make_tensor(second_shape, dtype=dtype, device=device,
+                                      requires_grad=requires_grad),
+                          dims=dims)
 
 def sample_inputs_kron(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad, low=None, high=None)
     test_cases = (
         ((S, S), (M, L)),
     )
 
-    sample_inputs = []
     for input_shape, other_shape in test_cases:
-        input = make_tensor(input_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        other = make_tensor(other_shape, dtype=dtype, device=device, low=None, high=None, requires_grad=requires_grad)
-        sample = SampleInput(input, args=(other,))
-        sample_inputs.append(sample)
-    return tuple(sample_inputs)
+        input = make_arg(input_shape)
+        other = make_arg(other_shape)
+        yield SampleInput(input, other)
 
 def sample_inputs_inner(self, device, dtype, requires_grad, **kwargs):
-    return (
-        SampleInput(
-            make_tensor((S, ), dtype=dtype, device=device, requires_grad=requires_grad),
-            args=(
-                make_tensor((S, ), dtype=dtype, device=device, requires_grad=requires_grad),
-            )
-        ),
-        SampleInput(
-            make_tensor((), dtype=dtype, device=device, requires_grad=requires_grad),
-            args=(
-                make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),
-            )
-        ),
-    )
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    yield SampleInput(make_arg(S), make_arg(S))
+    yield SampleInput(make_arg(), make_arg(S, S))
 
 def sample_inputs_scatter(op_info, device, dtype, requires_grad, **kwargs):
     def _tensor(shape, dtype=dtype, low=None, high=None):
@@ -6352,23 +5988,14 @@ def _gather(shape, index_dim, max_indices):
         (_tensor(()), (0, zero.clone().detach(), 2.5)),
     )
 
-    samples = []
     for tensor, args in test_cases:
-        samples.append(SampleInput(tensor, args=args))
+        yield SampleInput(tensor, *args)
 
         if not requires_grad:
-            samples.append(SampleInput(
-                tensor.clone().detach(),
-                args=args, kwargs={'reduce': 'add'}
-            ))
+            yield SampleInput(tensor.clone().detach(), *args, reduce='add')
 
             if dtype.is_floating_point:
-                samples.append(SampleInput(
-                    tensor.clone().detach(),
-                    args=args, kwargs={'reduce': 'multiply'}
-                ))
-
-    return samples
+                yield SampleInput(tensor.clone().detach(), *args, reduce='multiply')
 
 def sample_inputs_scatter_add(op_info, device, dtype, requires_grad, **kwargs):
     def _tensor(shape, dtype=dtype, low=None, high=None):
@@ -6378,41 +6005,33 @@ def _gather(shape, index_dim, max_indices):
         return gather_variable(shape, index_dim, max_indices, device=device)
 
     zero = torch.tensor(0, dtype=torch.long, device=device)
-    test_cases = (
-        (_tensor((M, S)), (0, _gather((S, S), 1, M), _tensor((S, S)))),
-        (_tensor((M, S)), (1, _gather((S, S), 0, S), _tensor((S, S)))),
-        (_tensor((M, S)), (-1, _gather((S, S), 0, S), _tensor((S, S)))),
-        (_tensor((M, S)), (0, _gather((M, S // 2), 1, M), _tensor((M, S // 2)))),
-        (_tensor((M, S)), (1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
-        (_tensor((M, S)), (-1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))),
-        (_tensor(()), (0, zero.clone().detach(), _tensor(()))),
-    )
-
-    return [SampleInput(tensor, args=args) for tensor, args in test_cases]
+    yield SampleInput(_tensor((M, S)), 0, _gather((S, S), 1, M), _tensor((S, S)))
+    yield SampleInput(_tensor((M, S)), 1, _gather((S, S), 0, S), _tensor((S, S)))
+    yield SampleInput(_tensor((M, S)), -1, _gather((S, S), 0, S), _tensor((S, S)))
+    yield SampleInput(_tensor((M, S)), 0, _gather((M, S // 2), 1, M), _tensor((M, S // 2)))
+    yield SampleInput(_tensor((M, S)), 1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))
+    yield SampleInput(_tensor((M, S)), -1, _gather((M, S // 2), 0, S), _tensor((M, S // 2)))
+    yield SampleInput(_tensor(()), 0, zero.clone().detach(), _tensor(()))
 
 def sample_inputs_scatter_reduce(op_info, device, dtype, requires_grad, **kwargs):
-    def _tensor(shape, dtype=dtype, low=None, high=None):
-        return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
-
-    def _gather(shape, index_dim, max_indices):
-        return gather_variable(shape, index_dim, max_indices, device=device)
+    make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
+    gather = partial(gather_variable, device=device)
 
     zero = torch.tensor(0, dtype=torch.long, device=device)
     test_cases = (
-        ((M, S), 0, _gather((S, S), 1, M), (S, S)),
-        ((M, S), 1, _gather((S, S), 0, S), (S, S)),
-        ((M, S), -1, _gather((S, S), 0, S), (S, S)),
-        ((M, S), 0, _gather((M, S // 2), 1, M), (M, S // 2)),
-        ((M, S), 1, _gather((M, S // 2), 0, S), (M, S // 2)),
-        ((M, S), -1, _gather((M, S // 2), 0, S), (M, S // 2)),
+        ((M, S), 0, gather((S, S), 1, M), (S, S)),
+        ((M, S), 1, gather((S, S), 0, S), (S, S)),
+        ((M, S), -1, gather((S, S), 0, S), (S, S)),
+        ((M, S), 0, gather((M, S // 2), 1, M), (M, S // 2)),
+        ((M, S), 1, gather((M, S // 2), 0, S), (M, S // 2)),
+        ((M, S), -1, gather((M, S // 2), 0, S), (M, S // 2)),
         ((), 0, zero.clone().detach(), ()),
     )
 
     reduce = op_info.variant_test_name
-    for args, include_self in product(test_cases, [True, False]):
-        inp_shape, dim, index, src_shape = args
-        yield SampleInput(_tensor(inp_shape),
-                          args=(dim, index, _tensor(src_shape), reduce),
+    for (inp_shape, dim, index, src_shape), include_self in product(test_cases, [False, True, False]):
+        yield SampleInput(make_arg(inp_shape),
+                          args=(dim, index, make_arg(src_shape), reduce),
                           kwargs={'include_self': include_self})
 
 
@@ -6473,30 +6092,52 @@ def _tensor(shape, dtype=dtype, low=None, high=None):
 
 
 def sample_inputs_ravel(op_info, device, dtype, requires_grad, **kwargs):
-    samples = (SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                       low=None, high=None,
-                                       requires_grad=requires_grad)),
-               SampleInput(make_tensor((), dtype=dtype, device=device,
-                                       low=None, high=None,
-                                       requires_grad=requires_grad)),
-               SampleInput(make_tensor((S, S, S), dtype=dtype, device=device,
-                                       low=None, high=None,
-                                       requires_grad=requires_grad, noncontiguous=True)),)
-
-    return samples
+    make_arg = partial(make_tensor, dtype=dtype, device=device,
+                       low=None, high=None, requires_grad=requires_grad)
+    yield SampleInput(make_arg((S, S, S)))
+    yield SampleInput(make_arg(()))
+    yield SampleInput(make_arg((S, S, S), noncontiguous=True))
 
 
 def sample_inputs_tril_triu(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
     cases = (((M, M), ()),
              ((M, M), (2,),),
-             ((S, M, M), ()),
-             ((S, M, M), (2,)),
+             ((M, S), ()),
+             ((M, S), (-1,)),
+             ((M, M), (2,),),
+             ((S, M, S), ()),
+             ((S, M, S), (2,)),
              ((3, 3, S, S), ()),)
 
     for shape, args in cases:
         yield SampleInput(make_arg(shape), args=args)
 
+def error_inputs_tril_triu(opinfo, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+
+    # error inputs for input.ndim <= 2
+    yield ErrorInput(SampleInput(make_arg((4,))), error_regex="input tensor must have at least 2 dimensions")
+
+def sample_inputs_trilu_indices(op_info, device, dtype, requires_grad, **kwargs):
+    # (row, col, offset)
+    args_list = ((0, 0),
+                 (20, 0),
+                 (0, 20),
+                 (20, 21, 0),
+                 (20, 21, 7),
+                 (20, 21, -7),
+                 # Large test cases below are deliberately commented out to speed up CI
+                 # tests and to avoid OOM error. When modifying implementations of
+                 # tril_indices and triu_indices, please enable these tests and make sure
+                 # they pass.
+                 # (2, 68435455, 3),
+                 # (5000, 5000),
+                 # (5000, 5000, 1234),
+                 # (5000, 5000, -1233),
+                 )
+    for args in args_list:
+        yield SampleInput(args[0], args=args[1:], kwargs={"dtype": dtype, "device": device})
 
 def sample_inputs_clone_contiguous(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -6561,20 +6202,32 @@ def sample_inputs_sum_to_size(op_info, device, dtype, requires_grad, **kwargs):
     # list of tuples (shape, shape) defining the shapes of the input and output tensors
     sample_shapes = [
         ((), ()),
-        ((S), (1)),
+        ((S,), (1,)),
         ((S, S), (1, 1)),
         ((S, S), (1, S)),
         ((S, S), (S, S)),
         ((S, S, S), (S, 1, S)),
     ]
 
-    samples = []
-
     for input_shape, output_shape in sample_shapes:
-        input_t = make_arg(input_shape)
-        samples.append(SampleInput(input_t, args=(output_shape,)))
+        yield SampleInput(make_arg(input_shape), args=(output_shape,))
+        if output_shape == ():
+            continue
+        yield SampleInput(make_arg(input_shape), args=(list(output_shape),))
+        yield SampleInput(make_arg(input_shape), args=(*output_shape,))
+
+
+def error_inputs_sum_to_size(op_info, device, **kwargs):
+    shape = (M, S, M)
+    err_msg = "is not expandable to size"
+    si = SampleInput(make_tensor(shape, device=device, dtype=torch.float32), args=(M, M))
+    yield ErrorInput(si, error_regex=err_msg)
+
+    shape = (M + 1, S, S, M)
+    err_msg = "is not expandable to size"
+    si = SampleInput(make_tensor(shape, device=device, dtype=torch.float32), args=(M + 1, 1))
+    yield ErrorInput(si, error_regex=err_msg)
 
-    return samples
 
 def sample_inputs_resize_ops(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device)
@@ -6594,138 +6247,152 @@ def sample_inputs_resize_ops(op_info, device, dtype, requires_grad, **kwargs):
         else:
             raise ValueError("sample_inputs_resize_ops is being used with incorrect operator")
 
-        yield(SampleInput(make_arg(shape, requires_grad=requires_grad), args=args))
+        yield SampleInput(make_arg(shape, requires_grad=requires_grad), args=args)
 
 def sample_inputs_view_reshape(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
 
     cases = (
-        ((S, S, S), (S * S, S)),
-        ((S * S, S), (S, S, S)),
-        ((S * S, S), (S, -1, S)),
-        ((S * S * 2, S), (S, -1)),
-        ((S,), (S,)),
-        ((), ()),
-        ((), (1,)),
+        # a, b, is_tensor_supported
+        ((S, S, S), (S * S, S), True),
+        ((S * S, S), (S, S, S), True),
+        ((S * S, S), (S, -1, S), False),  # neg index
+        ((S * S * 2, S), (S, -1), False),  # neg index
+        ((S,), (S,), True),
+        ((), (), False),  # empty
+        ((), (1,), True),
     )
 
-    for shape, args in cases:
-        yield SampleInput(make_arg(shape), args=(args,))
+    for a, b, is_tensor_supported in cases:
+        # skip unsupported cases
+        if kwargs.get("tensor_arg") and not is_tensor_supported:
+            continue
 
-        if kwargs.get("transpose_samples", False) and len(shape) >= 2:
-            transposed = make_arg(shape).transpose(0, 1).detach().requires_grad_(requires_grad)
-            yield SampleInput(transposed, args=(args,))
+        # convert to tensor
+        if kwargs.get("tensor_arg"):
+            b = make_arg(b, requires_grad=False)
+
+        yield SampleInput(make_arg(a), args=(b,))
 
 def reference_inputs_view_reshape(op, device, dtype, requires_grad, **kwargs):
     yield from sample_inputs_view_reshape(op, device, dtype, requires_grad, **kwargs)
 
     cases = (
-        ((125,), (25, 5)),
-        ((25, 25), (1, 5, 5, 1, 5, 1, 5, 1)),
-        ((16, 32), (2, 4, 1, 4, 4, 1, 4)),
-        ((16, 12), (12, 16)),
-        ((1, 16, 12), (12, 16)),
-        ((1, 5, 1, 5), (25, 1)),
-        ((2, 4, 2), (4, 4)),
-        ((1, 4), (1, 1, 2, 1, 2)),
-        ((3, 5, 7), (7, 5, 3)),
-        ((1,), ()),
-        ((5, 0, 2, 3), (5, 0, 2, 3)),
-        ((2, 1, 0, 3, 1), (5, 0)),
-        ((1,), ()),
-        ((4, 5, 6), (4, 5, 6, 1, 1, 1)),
-        ((), (1, 1, 1, 1)),
+        # a, b, is_tensor_supported
+        ((125,), (25, 5), True),
+        ((25, 25), (1, 5, 5, 1, 5, 1, 5, 1), True),
+        ((16, 32), (2, 4, 1, 4, 4, 1, 4), True),
+        ((16, 12), (12, 16), True),
+        ((1, 16, 12), (12, 16), True),
+        ((1, 5, 1, 5), (25, 1), True),
+        ((2, 4, 2), (4, 4), True),
+        ((1, 4), (1, 1, 2, 1, 2), True),
+        ((3, 5, 7), (7, 5, 3), True),
+        ((1,), (), False),  # empty
+        ((5, 0, 2, 3), (5, 0, 2, 3), True),
+        ((2, 1, 0, 3, 1), (5, 0), True),
+        ((1,), (), False),  # empty
+        ((4, 5, 6), (4, 5, 6, 1, 1, 1), True),
+        ((), (1, 1, 1, 1), False),  # empty
     )
 
     irreversible_cases = (
-        ((), (-1,)),
-        ((4, 7, 9, 1, 1), (1, 4, 3, -1, 1)),
+        ((), (-1,), False),  # neg index, empty
+        ((4, 7, 9, 1, 1), (1, 4, 3, -1, 1), False),  # neg index
     )
 
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    for a, b in cases:
-        yield SampleInput(make_arg(a), args=(b,))
-        yield SampleInput(make_arg(b), args=(a,))
+    for a, b, is_tensor_supported in cases:
+        # skip unsupported cases
+        if kwargs.get("tensor_arg") and not is_tensor_supported:
+            continue
 
-        if kwargs.get("transpose_samples", False):
-            yield SampleInput(make_arg(a, noncontiguous=True).transpose(0, -1), args=(b,))
+        if kwargs.get("tensor_arg"):
+            # convert to tensor
+            yield SampleInput(make_arg(a), args=(make_arg(b, requires_grad=False),))
+            yield SampleInput(make_arg(b), args=(make_arg(a, requires_grad=False),))
         else:
-            yield SampleInput(make_arg(a, noncontiguous=True), args=(b,))
+            yield SampleInput(make_arg(a), args=(b,))
+            yield SampleInput(make_arg(b), args=(a,))
+
+    for a, b, is_tensor_supported in irreversible_cases:
+        # skip unsupported cases
+        if kwargs.get("tensor_arg") and not is_tensor_supported:
+            continue
+
+        # convert to tensor
+        if kwargs.get("tensor_arg"):
+            b = make_arg(b, requires_grad=False)
 
-    for a, b in irreversible_cases:
         yield SampleInput(make_arg(a), args=(b,))
 
-def error_inputs_reshape(op, device, **kwargs):
+def error_inputs_view_reshape(op, device, **kwargs):
 
     cases = (
+        # a, b, is_tensor_supported
         # Reshape to different numel
-        ((2,), ()),
-        ((1, 3, 0), ()),
-        ((4, 3), (4, 2)),
-        ((1, 3, 5), (5, 2, 2)),
+        ((2,), (), False),  # empty
+        ((1, 3, 0), (), False),  # empty
+        ((4, 3), (4, 2), True),
+        ((1, 3, 5), (5, 2, 2), True),
         # No valid inference
-        ((1, 3, 5), (5, -1, 2)),
+        ((1, 3, 5), (5, -1, 2), False),  # neg index
         # Two inferred shapes
-        ((1, 3, 5), (5, -1, -1)),
-        ((1), (0, -1)),
-        ((0, 5), (0, -1)),
+        ((1, 3, 5), (5, -1, -1), False),  # neg index
+        ((1), (0, -1), False),  # neg index
+        ((0, 5), (0, -1), False),  # neg index
     )
 
     make_arg = partial(make_tensor, dtype=torch.float32, device=device, requires_grad=False)
-    for a, b in cases:
-        yield ErrorInput(SampleInput(make_arg(a), args=(b,)), error_type=Exception, error_regex="")
-
+    for a, b, is_tensor_supported in cases:
+        # skip unsupported cases
+        if kwargs.get("tensor_arg") and not is_tensor_supported:
+            continue
 
-def sample_inputs_view_as_reshape_as(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = partial(make_tensor, dtype=dtype, device=device)
+        if b == (5, -1, -1):
+            error_regex = "only one dimension can be inferred"
+        elif a == (0, 5):
+            error_regex = (r"cannot reshape tensor of 0 elements into shape "
+                           r"\[0, -1\] because the unspecified dimension size "
+                           r"-1 can be any value and is ambiguous")
+        else:
+            # to avoid having issues with a regex
+            shape = ', '.join(map(str, b))
+            size = a if type(a) is int else functools.reduce(operator.mul, a, 1)
+            error_regex = rf"shape '\[{shape}\]' is invalid for input of size {size}"
 
-    cases = (((S, S, S), (S * S, S)),
-             ((), ()),
-             ((), (1, 1)),
-             )
+        # convert to tensor
+        if kwargs.get("tensor_arg"):
+            b = make_arg(b, requires_grad=False)
 
-    for case in cases:
-        shape, shape_other = case
-        inp = make_arg(shape, requires_grad=requires_grad)
-        yield(SampleInput(inp, args=(make_arg(shape_other, requires_grad=False),)))
+        yield ErrorInput(SampleInput(make_arg(a), args=(b,)), error_type=Exception,
+                         error_regex=error_regex)
 
-        if op_info.name != "view_as" and len(shape) >= 2:
-            yield(SampleInput(
-                inp.clone().transpose(0, 1).requires_grad_(requires_grad),
-                args=(make_arg(shape_other, requires_grad=False),)))
 
 def sample_inputs_atleast1d2d3d(op_info, device, dtype, requires_grad, **kwargs):
     input_list = []
     shapes = ((S, S, S, S), (S, S, S), (S, S), (S, ), (),)
     make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
-    samples = []
     for shape in shapes:
-        input_list.append(make_tensor_partial(shape))
-        samples.append(SampleInput(make_tensor_partial(shape)))
-    samples.append(SampleInput(input_list, ))
-    return samples
+        yield SampleInput(make_tensor_partial(shape))
+    yield SampleInput([make_tensor_partial(shape) for shape in shapes])
 
 def sample_inputs_column_stack(op_info, device, dtype, requires_grad, **kwargs):
-    input_list = []
     cases: Tuple[tuple, tuple] = (  # type: ignore[assignment]
         ((S, 2, 1), (S, 3, 1)),
         ((S), (S, 5)), ((), (1, S))
     )
     make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
     for shape1, shape2 in cases:
-        input_list.append(SampleInput([make_tensor_partial(shape1), make_tensor_partial(shape2)]))
-
-    return input_list
+        yield SampleInput([make_tensor_partial(shape1), make_tensor_partial(shape2)])
 
 def sample_inputs_flatten(op_info, device, dtype, requires_grad, **kwargs):
-    samples = []
     shapes = ((S, S, S), (S, S), (S, ), (),)
     make_tensor_partial = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
     for shape in shapes:
-        samples.append(SampleInput(make_tensor_partial(shape)))
+        yield SampleInput(make_tensor_partial(shape))
         if len(shape) > 1:
-            samples.append(SampleInput(make_tensor_partial(shape), kwargs=dict(start_dim=1, end_dim=-1)))
-    return samples
+            yield SampleInput(make_tensor_partial(shape), start_dim=1, end_dim=-1)
 
 def reference_inputs_flatten(op, device, dtype, requires_grad, **kwargs):
     yield from sample_inputs_flatten(op, device, dtype, requires_grad, **kwargs)
@@ -6836,7 +6503,7 @@ def sample_inputs_expand(op_info, device, dtype, requires_grad, **kwargs):
 
     for case in cases:
         shape, args = case
-        yield(SampleInput(make_arg(shape), args=(args, )))
+        yield SampleInput(make_arg(shape), args=(args,))
 
 def sample_inputs_conversion(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -6859,8 +6526,8 @@ def sample_inputs_expand_as(op_info, device, dtype, requires_grad, **kwargs):
              )
 
     for shape, shape_other in cases:
-        yield(SampleInput(make_arg(shape, requires_grad=requires_grad),
-                          args=(make_arg(shape_other, requires_grad=False), )))
+        yield SampleInput(make_arg(shape, requires_grad=requires_grad),
+                          args=(make_arg(shape_other, requires_grad=False),))
 
 
 def sample_inputs_where(op_info, device, dtype, requires_grad, **kwargs):
@@ -6879,7 +6546,7 @@ def make_bool_mask(shape):
 
         if mask_t.sum() == 0:
             def random_index(shape):
-                return tuple(map(lambda max_idx: random.randint(0, max_idx), shape))
+                return tuple(map(lambda max_idx: random.randrange(0, max_idx), shape))
 
             mask_t[random_index(mask_t.shape)] = True
             return mask_t
@@ -6890,7 +6557,9 @@ def random_index(shape):
              ((M, 1, M), (M, M), (M, M, 1), True),
              ((), (), (), False),
              ((M, 1, M), (), (M, M, 1), True),
-             ((), (M, M), (), True),)
+             ((), (M, M), (), True),
+             ((), (2), (1, 1), True),
+             )
 
     for shape, mask_shape, other_shape, broadcasts_input in cases:
         yield SampleInput(make_arg(shape),
@@ -6976,8 +6645,8 @@ def sample_inputs_nonzero(op_info, device, dtype, requires_grad, **kwargs):
         inputs.append(mixed)
 
     for input_t, as_tuple in product(inputs, [False, True]):
-        yield(SampleInput(input_t.clone().requires_grad_(requires_grad),
-                          kwargs=dict(as_tuple=as_tuple)))
+        yield SampleInput(input_t.clone().requires_grad_(requires_grad),
+                          kwargs=dict(as_tuple=as_tuple))
 
 def sample_inputs_chunk(op_info, device, dtype, requires_grad, **kwargs):
     make_arg = partial(make_tensor, dtype=dtype, device=device, requires_grad=requires_grad)
@@ -6988,7 +6657,7 @@ def sample_inputs_chunk(op_info, device, dtype, requires_grad, **kwargs):
 
     for case in cases:
         shape, args = case
-        yield(SampleInput(make_arg(shape), args=args))
+        yield SampleInput(make_arg(shape), args=args)
 
 def reference_inputs_chunk(op, device, dtype, requires_grad, **kwargs):
     yield from sample_inputs_chunk(op, device, dtype, requires_grad, **kwargs)
@@ -7020,34 +6689,34 @@ def _tensor(shape, dtype=dtype, low=None, high=None):
         return make_tensor(shape, dtype=dtype, device=device, low=low, high=high, requires_grad=requires_grad)
 
     test_cases = [
-        (_tensor((S, S, S)), (2,)),
-        (_tensor((S, S, S)), (2, 1,)),
-        (_tensor((S, S, S)), (2, -1,)),
-        (_tensor((S, S, S)), (2, 1, True,)),
-        (_tensor((S, S, S)), (2, -1, True,)),
-        (_tensor((S,)), (2, 0,)),
-        (_tensor((S,)), (2, 0, True,)),
-        (_tensor(()), (1,)),
-        (_tensor(()), (1, 0,)),
-        (_tensor(()), (1, 0, True))
+        ((S, S, S), (2,)),
+        ((S, S, S), (2, 1,)),
+        ((S, S, S), (2, -1,)),
+        ((S, S, S), (2, 1, True,)),
+        ((S, S, S), (2, -1, True,)),
+        ((S,), (2, 0,)),
+        ((S,), (2, 0, True,)),
+        ((), (1,)),
+        ((), (1, 0,)),
+        ((), (1, 0, True)),
     ]
 
-    return [SampleInput(tensor, args=args) for tensor, args in test_cases]
+    yield from (SampleInput(_tensor(tensor), *args) for tensor, args in test_cases)
 
 def error_inputs_kthvalue(op_info, device, **kwargs):
     # tests overlapping output fails
     t = make_tensor(10, dtype=torch.float32, device=device)
     indices = torch.empty((), device=device, dtype=torch.long)
-    si = SampleInput(t, args=(5,), kwargs={'out': (t, indices)})
+    yield ErrorInput(SampleInput(t, 5, out=(t, indices)),
+                     error_regex="unsupported operation")
 
     k_out_of_range_err = "selected number k out of range for dimension"
-    return (ErrorInput(si, error_regex="unsupported operation"),
-            ErrorInput(SampleInput(torch.randn(2, 2, device=device), args=(3, 0)),
-                       error_regex=k_out_of_range_err),
-            ErrorInput(SampleInput(torch.randn(2, 2, device=device), args=(3,)),
-                       error_regex=k_out_of_range_err),
-            ErrorInput(SampleInput(torch.tensor(2, device=device), args=(3,)),
-                       error_regex=k_out_of_range_err),)
+    yield ErrorInput(SampleInput(torch.randn(2, 2, device=device), 3, 0),
+                     error_regex=k_out_of_range_err)
+    yield ErrorInput(SampleInput(torch.randn(2, 2, device=device), 3),
+                     error_regex=k_out_of_range_err)
+    yield ErrorInput(SampleInput(torch.tensor(2, device=device), 3),
+                     error_regex=k_out_of_range_err)
 
 def sample_inputs_dropout(op_info, device, dtype, requires_grad, *,
                           train=None, valid_input_dim=None, **kwargs):
@@ -7063,9 +6732,18 @@ def sample_inputs_dropout(op_info, device, dtype, requires_grad, *,
     training_vals = [train] if train is not None else [True, False]
 
     for case, p, training in product(cases, p_vals, training_vals):
-        yield SampleInput(make_arg(case), kwargs=dict(p=p, training=training))
-    yield SampleInput(make_arg(case), kwargs=dict())
+        yield SampleInput(make_arg(case), p=p, training=training)
+    yield SampleInput(make_arg(case))
+
+def sample_inputs_dropout_backward(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    make_mask = partial(make_tensor, device=device, dtype=torch.bool, requires_grad=False)
 
+    cases = ((S, S, S, S), (S,), ())
+    scale_vals = [0.0, 1.0, 2.0]
+
+    for case, scale in product(cases, scale_vals):
+        yield SampleInput(make_arg(case), make_mask(case), scale)
 
 def sample_inputs_embedding_bag(op_info, device, dtype, requires_grad, **kwargs):
     def make_input(shape):
@@ -7274,7 +6952,7 @@ def make_input(shape, *, low, high):
     shapes = ((), (S,), (L, M, S))
     num_classess = (-1, 10)
 
-    return [
+    return (
         SampleInput(
             make_input(
                 shape,
@@ -7284,40 +6962,8 @@ def make_input(shape, *, low, high):
             kwargs=dict(num_classes=num_classes),
         )
         for shape, num_classes in itertools.product(shapes, num_classess)
-    ]
-
-def sample_inputs_tensorinv(op_info, device, dtype, requires_grad, **kwargs):
-    make_arg = make_fullrank_matrices_with_distinct_singular_values
-
-    def make_input():
-        return make_arg(12, 12, device=device, dtype=dtype, requires_grad=requires_grad)
-
-    # lhs / rhs shape can have any number of dimensions as long as their product equals 12
-    shapes = [
-        ((2, 2, 3), (12, 1)),
-        ((4, 3), (6, 1, 2)),
-    ]
-
-    samples = []
-    for shape_lhs, shape_rhs in shapes:
-        inp = make_input().reshape(*shape_lhs, *shape_rhs).detach()
-        inp.requires_grad_(requires_grad)
-        samples.append(SampleInput(inp, kwargs=dict(ind=len(shape_lhs))))
-
-    return samples
-
-def sample_inputs_tensorsolve(op_info, device, dtype, requires_grad, **kwargs):
-    a_shapes = [(2, 3, 6), (3, 4, 4, 3)]
-    # Zero-dim tensors are not supported in NumPy, so we skip them for now.
-    # NumPy is used in reference check tests.
-    # See https://github.com/numpy/numpy/pull/20482 for tracking NumPy bugfix.
-    # a_shapes += [(0, 0, 1, 2, 3, 0)]
-    dimss = [None, (0, 2)]
+    )
 
-    for a_shape, dims in itertools.product(a_shapes, dimss):
-        a = make_tensor(a_shape, dtype=dtype, device=device, requires_grad=requires_grad)
-        b = make_tensor(a_shape[:2], dtype=dtype, device=device, requires_grad=requires_grad)
-        yield SampleInput(a, args=(b,), kwargs=dict(dims=dims))
 
 def sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs):
     rhs_requires_grad = kwargs.get('rhs_requires_grad', requires_grad)
@@ -7340,7 +6986,11 @@ def sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs):
                           kwargs=kwargs)
 
 def sample_inputs_grid_sample(op_info, device, dtype, requires_grad, **kwargs):
-    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    # We get better tests if we change the range of the values to something like [-2,2]
+    # because for grid (second tensor argument) the "useful" range is [-1,1] and this way
+    # you get a better combination of out-of-range and in-range test cases
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad,
+                           low=-2, high=2)
 
     batch_size = 2
     num_channels = 3
@@ -7348,25 +6998,40 @@ def sample_inputs_grid_sample(op_info, device, dtype, requires_grad, **kwargs):
     align_cornerss = (False, True)
     padding_modes = ("zeros", "border", "reflection")
 
-    sample_inputs = []
     for dim in (2, 3):
 
         modes_ = (*modes, "bicubic") if dim == 2 else modes
 
         for mode, padding_mode, align_corners in itertools.product(modes_, padding_modes, align_cornerss):
-            sample_inputs.append(
-                SampleInput(
-                    _make_tensor((batch_size, num_channels, *[S] * dim)),
-                    args=(_make_tensor((batch_size, *[S] * dim, dim)),),
-                    kwargs=dict(
-                        mode=mode,
-                        padding_mode=padding_mode,
-                        align_corners=align_corners,
-                    )
-                )
+            yield SampleInput(
+                _make_tensor((batch_size, num_channels, *[S] * dim)),
+                _make_tensor((batch_size, *[S] * dim, dim)),
+                mode=mode,
+                padding_mode=padding_mode,
+                align_corners=align_corners,
             )
 
-    return sample_inputs
+def sample_inputs_grid_sampler_2d(op_info, device, dtype, requires_grad, **kwargs):
+    # We get better tests if we change the range of the values to something like [-2,2]
+    # because for grid (second tensor argument) the "useful" range is [-1,1] and this way
+    # you get a better combination of out-of-range and in-range test cases
+    _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad,
+                           low=-2, high=2)
+
+    batch_size = 2
+    num_channels = 3
+    modes = (0, 1, 2)
+    align_cornerss = (False, True)
+    padding_modes = (0, 1, 2)
+
+    for mode, padding_mode, align_corners in itertools.product(modes, padding_modes, align_cornerss):
+        yield SampleInput(
+            _make_tensor((batch_size, num_channels, S, L)),
+            _make_tensor((batch_size, num_channels, M, 2)),
+            mode,
+            padding_mode,
+            align_corners,
+        )
 
 def sample_inputs_cosine_embedding_loss(op_info, device, dtype, requires_grad, **kwargs):
     make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -7538,6 +7203,23 @@ def gen_shape_kwargs():
     for input, target, var, kwargs in gen_shape_kwargs():
         yield SampleInput(input, args=(target, var, ), kwargs=kwargs)
 
+def error_inputs_gaussian_nll_loss(op_info, device, **kwargs):
+    _make = partial(make_tensor, device=device, dtype=torch.float32)
+
+    # invalid reduction value
+    yield ErrorInput(SampleInput(_make(10, 2, 3), _make(10, 2, 3), _make((10, 2, 3), low=0), reduction="abc"),
+                     error_type=ValueError, error_regex="abc is not valid")
+
+    # var is of incorrect shape
+    yield ErrorInput(SampleInput(_make(10, 2, 3), _make(10, 2, 3), _make((10, 2, 2), low=0)),
+                     error_type=ValueError, error_regex="var is of incorrect size")
+
+    # target is of incorrect shape
+    yield ErrorInput(SampleInput(_make(10, 2, 3), _make(10, 2, 2), _make((10, 2, 3), low=0)),
+                     error_type=RuntimeError,
+                     error_regex=(r"The size of tensor a \(3\) must match the size of tensor b \(2\) "
+                                  r"at non-singleton dimension 2"))
+
 def _generate_sample_inputs_nn_loss(op_info, device, dtype, requires_grad, **kwargs):
     _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -7601,6 +7283,18 @@ def sample_inputs_huber_loss(op_info, device, dtype, requires_grad, **kwargs):
         d['delta'] = random.uniform(1e-3, 9)
         yield SampleInput(input, args=(target, ), kwargs=d)
 
+def error_inputs_huber_loss(op, device, **kwargs):
+    make_input = partial(make_tensor, device=device, dtype=torch.float32)
+    # invalid reduction value
+    err = 'is not a valid value for reduction'
+    yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4),), kwargs={'reduction': 'abc'}),
+                     error_type=ValueError, error_regex=err)
+    # delta <= 0
+    for delta in (0, -1):
+        err = 'huber_loss does not support non-positive values for delta.'
+        yield ErrorInput(SampleInput(make_input(5, 4), args=(make_input(5, 4),), kwargs={'delta': delta}),
+                         error_type=RuntimeError, error_regex=err)
+
 def sample_inputs_poisson_nll_loss(op_info, device, dtype, requires_grad, **kwargs):
     _make_tensor = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -7637,6 +7331,42 @@ def gen_shape_kwargs():
     for input, target, kwargs in gen_shape_kwargs():
         yield SampleInput(input, args=(target, ), kwargs=kwargs)
 
+    # test INT_TO_FLOAT promotion
+    if dtype.is_complex:
+        for d in (torch.bool, torch.int64):
+            yield SampleInput(_make_tensor(dtype=dtype), args=(_make_tensor(dtype=d),))
+            yield SampleInput(_make_tensor(dtype=d), args=(_make_tensor(dtype=dtype),))
+
+def error_inputs_poisson_nll_loss(op_info, device, **kwargs):
+    make = partial(make_tensor, device=device, dtype=torch.float32)
+
+    # invalid reduction value
+    yield ErrorInput(SampleInput(make(5, 4), args=(make(5, 4),),
+                     kwargs={'reduction': 'abc'}),
+                     error_type=ValueError,
+                     error_regex='abc is not a valid value for reduction')
+    # invalid input shapes
+    yield ErrorInput(SampleInput(make(5, 4), args=(make(5,),)),
+                     error_regex=(r'(Attempting to broadcast a dimension of length|'
+                                  r'The size of tensor a \(5\) must match the '
+                                  r'size of tensor b \(4\) at non-singleton '
+                                  r'dimension 1)'))
+
+def error_inputs_soft_margin_loss(op_info, device, **kwargs):
+    make = partial(make_tensor, device=device, dtype=torch.float32)
+
+    # invalid reduction value
+    yield ErrorInput(SampleInput(make(5, 4), args=(make(5, 4),),
+                     kwargs={'reduction': 'abc'}),
+                     error_type=ValueError,
+                     error_regex='abc is not a valid value for reduction')
+    # invalid input shapes
+    yield ErrorInput(SampleInput(make(5, 4), args=(make(5,),)),
+                     error_regex=(r'(Attempting to broadcast a dimension of length|'
+                                  r'The size of tensor a \(4\) must match the '
+                                  r'size of tensor b \(5\) at non-singleton '
+                                  r'dimension 1)'))
+
 def sample_inputs_triplet_margin_loss(op_info, device, dtype, requires_grad, with_distance=False, **kwargs):
     make = partial(make_tensor, (S, M), device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -7653,6 +7383,84 @@ def sample_inputs_triplet_margin_loss(op_info, device, dtype, requires_grad, wit
             kwargs["distance_function"] = torch.nn.PairwiseDistance()
         yield SampleInput(input, args=args, kwargs=kwargs)
 
+def error_inputs_triplet_margin_loss(op_info, device, **kwargs):
+    make_input = partial(make_tensor, device=device, dtype=torch.float32)
+
+    samples = (
+        # input, args, kwargs, error_type, error_regex
+        # invalid reduction
+        (make_input(3, 4), (make_input(3, 4), make_input(3, 4)),
+         dict(reduction="abc"),
+         ValueError, "abc is not a valid value for reduction"),
+
+        # shape mismatch
+        (make_input(3, 5), (make_input(3, 4), make_input(3, 4)),
+         dict(),
+         RuntimeError,
+         (r'(Attempting to broadcast a dimension of length|'
+          r"The size of tensor a \(5\) must match the size of tensor b \(4\) "
+          r"at non-singleton dimension 1)")),
+        (make_input(3, 4), (make_input(3, 5), make_input(3, 4)),
+         dict(),
+         RuntimeError,
+         (r'(Attempting to broadcast a dimension of length|'
+          r"The size of tensor a \(4\) must match the size of tensor b \(5\) "
+          r"at non-singleton dimension 1)")),
+        (make_input(3, 4), (make_input(3, 4), make_input(3, 5)),
+         dict(),
+         RuntimeError,
+         (r'(Attempting to broadcast a dimension of length|'
+          r"The size of tensor a \(4\) must match the size of tensor b \(5\) "
+          r"at non-singleton dimension 1)")),
+
+        # different dimensions
+        (make_input(3,), (make_input(3, 4), make_input(3, 4)),
+         dict(),
+         RuntimeError,
+         (r"The anchor, positive, and negative tensors are expected to have "
+          r"the same number of dimensions, but got: anchor 1D, positive 2D, "
+          r"and negative 2D inputs")),
+        (make_input(3, 4), (make_input(3,), make_input(3, 4)),
+         dict(),
+         RuntimeError,
+         (r"The anchor, positive, and negative tensors are expected to have "
+          r"the same number of dimensions, but got: anchor 2D, positive 1D, "
+          r"and negative 2D inputs")),
+        (make_input(3, 4), (make_input(3, 4), make_input(3,)),
+         dict(),
+         RuntimeError,
+         (r"The anchor, positive, and negative tensors are expected to have "
+          r"the same number of dimensions, but got: anchor 2D, positive 2D, "
+          r"and negative 1D inputs")),
+    )
+
+    for input, args, kwargs, error_type, error_regex in samples:
+        yield ErrorInput(SampleInput(input, args=args, kwargs=kwargs),
+                         error_type=error_type, error_regex=error_regex)
+
+
+def sample_inputs_scaled_dot_product_attention(op_info, device, dtype, requires_grad, **kwargs):
+    make = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    batch, seq_q, seq_kv, num_heads, head_dim = 4, 3, 6, 4, 8
+
+    dim_3_q_shape = (batch, seq_q, head_dim)
+    dim_3_kv_shape = (batch, seq_kv, head_dim)
+    dim_4_q_shape = (batch, num_heads, seq_q, head_dim)
+    dim_4_kv_shape = (batch, num_heads, seq_kv, head_dim)
+
+    qkv_shapes = [(dim_3_q_shape, dim_3_kv_shape), (dim_4_q_shape, dim_4_kv_shape)]
+    for qkv_shapes, is_causal, need_attn_weights, dropout_p in product(
+            qkv_shapes, [True, False], [True, False], [0.0, 0.5]):
+        shape_q, shape_kv = qkv_shapes
+        yield SampleInput(
+            make(shape_q),
+            make(shape_kv),
+            make(shape_kv),
+            is_causal=is_causal,
+            need_attn_weights=need_attn_weights,
+            dropout_p=dropout_p
+        )
+
 def sample_inputs_pairwise_distance(op_info, device, dtype, requires_grad, **kwargs):
     make = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
@@ -7668,27 +7476,39 @@ def sample_inputs_pairwise_distance(op_info, device, dtype, requires_grad, **kwa
         (shape, dict(eps=1.0)),
     ]
 
-    return [
+    return (
         SampleInput(make(shape), args=(make(shape),), kwargs=kwargs) for shape, kwargs in shapes_and_kwargs
-    ]
+    )
 
 def sample_inputs_pixel_shuffle(op_info, device, dtype, requires_grad, **kwargs):
-    return [
-        SampleInput(
-            make_tensor((1, 9, 2, 2), device=device, dtype=dtype, requires_grad=requires_grad),
-            kwargs=dict(upscale_factor=upscale_factor),
-        )
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield from (
+        SampleInput(make_arg((1, 9, 2, 2)), upscale_factor=upscale_factor)
         for upscale_factor in (1, 3)
-    ]
+    )
+    yield from (
+        SampleInput(make_arg(shape), upscale_factor=1)
+        for shape in [
+            (1, 0, 1, 1),
+            (1, 1, 0, 1),
+            (1, 1, 1, 0),
+        ]
+    )
 
 def sample_inputs_pixel_unshuffle(op_info, device, dtype, requires_grad, **kwargs):
-    return [
-        SampleInput(
-            make_tensor((1, 1, 6, 6), device=device, dtype=dtype, requires_grad=requires_grad),
-            kwargs=dict(downscale_factor=downscale_factor),
-        )
+    make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    yield from (
+        SampleInput(make_arg((1, 1, 6, 6)), downscale_factor=downscale_factor)
         for downscale_factor in (1, 3)
-    ]
+    )
+    yield from (
+        SampleInput(make_arg(shape), downscale_factor=1)
+        for shape in [
+            (1, 0, 1, 1),
+            (1, 1, 0, 1),
+            (1, 1, 1, 0),
+        ]
+    )
 
 def sample_inputs_binary_cross_entropy(op_info, device, dtype, requires_grad, logits=False, **kwargs):
     make = partial(make_tensor, device=device, dtype=dtype)
@@ -7715,7 +7535,6 @@ def sample_inputs_binary_cross_entropy(op_info, device, dtype, requires_grad, lo
         )
 
 def sample_inputs_allclose(op_info, device, dtype, requires_grad, **kwargs):
-    samples = []
     sample_shapes = [(), (S), (S, S, S)]
     atols = [1e-2, 1e-16]
     rtols = [1e-1, 0.5]
@@ -7724,26 +7543,39 @@ def sample_inputs_allclose(op_info, device, dtype, requires_grad, **kwargs):
         # close sample
         t = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
         close = (t + atol).detach().requires_grad_(requires_grad)
-        close_sample = SampleInput(t, args=(close,), kwargs=dict(rtol=rtol, atol=atol))
-        samples.append(close_sample)
+        yield SampleInput(t, close, rtol=rtol, atol=atol)
 
         # random sample
         a = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
         b = make_tensor(s, device=device, dtype=dtype, requires_grad=requires_grad)
-        r_sample = SampleInput(a, args=(b,), kwargs=dict(rtol=rtol, atol=atol))
-        samples.append(r_sample)
+        yield SampleInput(a, b, rtol=rtol, atol=atol)
 
-    return samples
 
 def sample_inputs_l1_loss(op_info, device, dtype, requires_grad, **kwargs):
     yield from sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs)
 
-    # In addition to the regular test cases, we add two for mixed floating point and complex inputs
+    # test COMPLEX_TO_FLOAT promotion
     if dtype.is_complex:
         make = partial(make_tensor, (), device=device, requires_grad=requires_grad)
         yield SampleInput(make(dtype=dtype), args=(make(dtype=torch.double),))
         yield SampleInput(make(dtype=torch.double), args=(make(dtype=dtype),))
 
+def error_inputs_l1_loss(op_info, device, **kwargs):
+    make = partial(make_tensor, device=device, dtype=torch.float32)
+
+    # invalid reduction value
+    yield ErrorInput(SampleInput(make(5, 4), args=(make(5, 4),),
+                     kwargs={'reduction': 'abc'}),
+                     error_type=ValueError,
+                     error_regex='abc is not a valid value for reduction')
+    # invalid input shapes
+    yield ErrorInput(SampleInput(make(5, 4), args=(make(5,),)),
+                     error_regex=(r'(Attempting to broadcast a dimension of length|'
+                                  r'The size of tensor a \(4\) must match the '
+                                  r'size of tensor b \(5\) at non-singleton '
+                                  r'dimension 1)')
+                     )
+
 def sample_inputs_smooth_l1_loss(op_info, device, dtype, requires_grad, **kwargs):
     yield from sample_inputs_loss(op_info, device, dtype, requires_grad, **kwargs)
 
@@ -7755,28 +7587,26 @@ def sample_inputs_smooth_l1_loss(op_info, device, dtype, requires_grad, **kwargs
     yield SampleInput(make(), args=(make(),), kwargs=dict(beta=0))
 
 def sample_inputs_kl_div(op_info, device, dtype, requires_grad, **kwargs):
-    make = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
+    # kl_div works with inputs in [0, 1] (aka the pdf of a probability measure)
+    # Then log [0, 1] = (-inf, 0], so this is the log space
+    make_arg = partial(make_tensor, low=0., device=device, dtype=dtype, requires_grad=requires_grad)
 
-    shapes_and_reduction = [
-        ((2,), "mean"),
-        ((2, 3), "mean"),
-        ((2, 3, 4), "mean"),
-        ((2,), "none"),
-        ((2,), "batchmean"),
-        ((2,), "sum"),
-    ]
+    def make_log(shape):
+        out = torch.nn.functional.log_softmax(make_arg(shape), -1)
+        out.requires_grad_(requires_grad)
+        return out
 
-    sample_inputs = []
-    for (shape, reduction), log_target in itertools.product(shapes_and_reduction, (True, False)):
-        # input should be log-probability, i.e. lie in (-inf, 0]
-        input = make(shape, low=None, high=0)
-        # target should be a probability by default, i.e. lie in [0, 1], and a log-probability if log_target is set,
-        # i.e. lie in (-inf, 0]
-        target = make(shape, low=None, high=0) if log_target else make(shape, low=0, high=1)
-        sample_inputs.append(
-            SampleInput(input, args=(target,), kwargs=dict(reduction=reduction, log_target=log_target))
-        )
-    return sample_inputs
+    def make_prob(shape):
+        out = torch.nn.functional.softmax(make_arg(shape), -1)
+        out.requires_grad_(requires_grad)
+        return out
+
+    shapes = ((2,), (2, 3))
+    reductions = ("none", "mean", "batchmean", "sum")
+    for shape, reduction, log_target in product(shapes, reductions, (True, False)):
+        input = make_log(shape)
+        target = make_log(shape) if log_target else make_prob(shape)
+        yield SampleInput(input, args=(target,), kwargs=dict(reduction=reduction, log_target=log_target))
 
 def sample_inputs_pdist(op_info, device, dtype, requires_grad, **kwargs):
     make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
@@ -7797,13 +7627,11 @@ def reference_pdist(input, p=2):
 def sample_inputs_diagflat(op_info, device, dtype, requires_grad, **kwargs):
     make_input = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)
 
-    return [
-        SampleInput(make_input(())),
-        SampleInput(make_input((2,))),
-        SampleInput(make_input((2, 2))),
-        SampleInput(make_input((2,)), kwargs=dict(offset=1)),
-        SampleInput(make_input((2,)), kwargs=dict(offset=-1)),
-    ]
+    yield SampleInput(make_input(()))
+    yield SampleInput(make_input((2,)))
+    yield SampleInput(make_input((2, 2)))
+    yield SampleInput(make_input((2,)), offset=1)
+    yield SampleInput(make_input((2,)), offset=-1)
 
 def sample_inputs_max_unpool(op_info, device, dtype, requires_grad, **kwargs):
     unpool_name_to_pool_method_dict = {
@@ -7909,8 +7737,8 @@ def sample_inputs_max_unpool_grad(op_info, device, dtype, requires_grad, **kwarg
 
     ForeachFuncInfo(
         'ceil',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        dtypes=all_types_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
     ),
 
     ForeachFuncInfo(
@@ -7933,8 +7761,8 @@ def sample_inputs_max_unpool_grad(op_info, device, dtype, requires_grad, **kwarg
 
     ForeachFuncInfo(
         'floor',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        dtypes=all_types_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
     ),
 
     ForeachFuncInfo(
@@ -7945,8 +7773,8 @@ def sample_inputs_max_unpool_grad(op_info, device, dtype, requires_grad, **kwarg
 
     ForeachFuncInfo(
         'round',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        dtypes=all_types_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
     ),
 
     ForeachFuncInfo(
@@ -7969,8 +7797,8 @@ def sample_inputs_max_unpool_grad(op_info, device, dtype, requires_grad, **kwarg
 
     ForeachFuncInfo(
         'trunc',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        dtypes=all_types_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
     ),
 
     ForeachFuncInfo(
@@ -8128,19 +7956,6 @@ def reference_lgamma(x):
 
     return out
 
-def reference_polygamma(x, n):
-    # WEIRD `scipy.special.polygamma` behavior
-    # >>> scipy.special.polygamma(0, np.array(501, dtype=np.float32)).dtype
-    # dtype('float64')
-    # >>> scipy.special.polygamma(0, np.array([501], dtype=np.float32)).dtype
-    # dtype('float32')
-    #
-    # Thus we cast output to the default torch dtype or preserve double
-    result_dtype = torch_to_numpy_dtype_dict[torch.get_default_dtype()]
-    if x.dtype == np.double:
-        result_dtype = np.double
-    return scipy.special.polygamma(n, x).astype(result_dtype)
-
 
 def reference_mvlgamma(x, d):
     if x.dtype == np.float16:
@@ -8230,12 +8045,12 @@ def reference_group_norm(inp: np.ndarray, num_groups: int, weight=None, bias=Non
     if weight is not None:
         # weight is a vector of length equal to the channel
         if len(Y.shape) > 2:
-            weight = np.tile(np.expand_dims(weight, 1), [1] + list(inp.shape[2:]))
+            weight = np.expand_dims(weight, [0] + [idx + 2 for idx in range(inp.ndim - 2)])
         Y = Y * weight
     if bias is not None:
         # bias is a vector of length equal to the channel
         if len(Y.shape) > 2:
-            bias = np.tile(np.expand_dims(bias, 1), [1] + list(inp.shape[2:]))
+            bias = np.expand_dims(bias, [0] + [idx + 2 for idx in range(inp.ndim - 2)])
         Y = Y + bias
     return Y
 
@@ -8272,149 +8087,6 @@ def reference_searchsorted(sorted_sequence, boundary, out_int32=False, right=Fal
         split_ret = [i.astype(np.int32) for i in split_ret] if out_int32 else split_ret
         return np.stack(split_ret).reshape(orig_shape)
 
-
-def gradcheck_wrapper_hermitian_input(op, input, *args, **kwargs):
-    """Gradcheck wrapper for functions that take Hermitian matrices as input.
-
-    They require a modified function because the finite-difference algorithm
-    for calculating derivatives does not preserve the Hermitian property of the input.
-    """
-    return op(input + input.mH, *args, **kwargs)
-
-
-def gradcheck_wrapper_triangular_input(op, *args, upper=False, idx=0, **kwargs):
-    """Gradcheck wrapper for functions that take lower or upper triangular matrices as input.
-
-    They require a modified function because the finite-difference algorithm
-    for calculating derivatives does not preserve the triangular property of the input.
-    `idx` is used to specific which `args[idx]` is to be triangularized.
-    """
-    triangular_arg = args[idx].triu() if upper else args[idx].tril()
-    return op(*args[:idx], triangular_arg, *args[idx + 1:], upper, **kwargs)
-
-
-def gradcheck_wrapper_triangular_input_real_positive_diagonal(op, *args, upper=False, idx=0, **kwargs):
-    """Gradcheck wrapper for functions that take lower/upper triangular matrices
-    with real and positive diagonals, for example, cholesky-like operations.
-    """
-    arg = args[idx]
-    arg_diag = arg.diagonal(0, -2, -1)
-    arg_diag_embed = torch.diag_embed(arg_diag)
-    id_diag_tensor = torch.ones_like(arg_diag)
-    id_tensor = torch.diag_embed(id_diag_tensor)
-    # new_arg = arg - diag(arg) + I
-    new_arg = arg - arg_diag_embed + id_tensor
-    return gradcheck_wrapper_triangular_input(
-        op, *args[:idx], new_arg, *args[idx + 1:],
-        upper=upper, idx=idx, **kwargs
-    )
-
-
-def gradcheck_wrapper_masked_operation(op, input, *args, **kwargs):
-    """Gradcheck wrapper for masked operations.
-
-    When mask is specified, replaces masked-out elements with zeros.
-
-    Use for operations that produce non-finite masked-out elements,
-    for instance, for minimum and maximum reductions.
-    """
-    output = op(input, *args, **kwargs)
-    mask = kwargs.get('mask')
-    if mask is not None:
-        output_mask = torch._masked._output_mask(op, input, *args, **kwargs)
-        output = torch.where(output_mask, output, output.new_zeros([]))
-    return output
-
-
-def gradcheck_wrapper_masked_pointwise_operation(op, input, *args, **kwargs):
-    """Gradcheck wrapper for masked pointwise operations. Assumes that the result
-    will be masked iff both tensors are masked at a specific index
-
-    When mask is specified, replaces masked-out elements with zeros.
-
-    Use for operations that produce non-finite masked-out elements,
-    for instance, for minimum and maximum reductions.
-    """
-    output = op(input, *args, **kwargs)
-    input_mask = kwargs.get('input_mask')
-    other_mask = kwargs.get('other_mask')
-    if input_mask is not None and other_mask is not None:
-        combined_mask = torch.logical_and(input_mask, other_mask)
-        new_kwargs = dict(mask=combined_mask, **kwargs)
-        output_mask = torch._masked._input_mask(input, *args, **new_kwargs)
-        output = torch.where(output_mask, output, output.new_zeros([]))
-    return output
-
-
-def reference_reduction_numpy(f, supports_keepdims=True):
-    """Wraps a NumPy reduction operator.
-
-    The wrapper function will forward dim, keepdim, mask, and identity
-    kwargs to the wrapped function as the NumPy equivalent axis,
-    keepdims, where, and initiak kwargs, respectively.
-
-    Args:
-        f: NumPy reduction operator to wrap
-        supports_keepdims (bool, optional): Whether the NumPy operator accepts
-            keepdims parameter. If it does not, the wrapper will manually unsqueeze
-            the reduced dimensions if it was called with keepdim=True. Defaults to True.
-
-    Returns:
-        Wrapped function
-
-    """
-    @wraps(f)
-    def wrapper(x: np.ndarray, *args, **kwargs):
-        # Copy keys into a set
-        keys = set(kwargs.keys())
-
-        dim = kwargs.pop('dim', None)
-        keepdim = kwargs.pop('keepdim', False)
-
-        if 'dim' in keys:
-            dim = tuple(dim) if isinstance(dim, Sequence) else dim
-
-            # NumPy reductions don't accept dim=0 for scalar inputs
-            # so we convert it to None if and only if dim is equivalent
-            if x.ndim == 0 and dim in {0, -1, (0,), (-1,)}:
-                kwargs['axis'] = None
-            else:
-                kwargs['axis'] = dim
-
-        if 'keepdim' in keys and supports_keepdims:
-            kwargs['keepdims'] = keepdim
-
-        if 'mask' in keys:
-            mask = kwargs.pop('mask')
-            if mask is not None:
-                assert mask.layout == torch.strided
-                kwargs['where'] = mask.cpu().numpy()
-
-        if 'identity' in keys:
-            identity = kwargs.pop('identity')
-            if identity is not None:
-                if identity.dtype is torch.bfloat16:
-                    identity = identity.cpu().to(torch.float32)
-                else:
-                    identity = identity.cpu()
-                kwargs['initial'] = identity.numpy()
-
-        if 'unbiased' in keys:
-            unbiased = kwargs.pop('unbiased')
-            if unbiased is not None:
-                kwargs['ddof'] = int(unbiased)
-
-        result = f(x, *args, **kwargs)
-
-        # Unsqueeze reduced dimensions if NumPy does not support keepdims
-        if keepdim and not supports_keepdims and x.ndim > 0:
-            dim = list(range(x.ndim)) if dim is None else dim
-            result = np.expand_dims(result, dim)
-
-        return result
-
-    return wrapper
-
 def loss_reference_reduction_wrapper(fn):
     def wrapper(input, target, *, size_average=None, reduce=None, reduction="mean", **other_kwargs):
         if size_average is not None or reduce is not None:
@@ -8474,34 +8146,49 @@ def generate_std_var_kwargs(t: torch.Tensor, **kwargs):
         numel = torch.tensor(t.shape)[kwargs.get('dim')].prod()
         yield ((), {'correction': numel // 2})
 
-def error_inputs_mean(op_info, device, **kwargs):
-    err_msg1 = (r"mean\(\): could not infer output dtype. "
-                r"Input dtype must be either a floating point or complex dtype. "
-                r"Got: Long")
-    si1 = SampleInput(
-        make_tensor((3, 4, 5), dtype=torch.int64, device=device),
-        args=([],))
-
-    err_msg2 = (r"mean\(\): could not infer output dtype. "
-                r"Optional dtype must be either a floating point or complex dtype. "
-                r"Got: Long")
-    si2 = SampleInput(
-        make_tensor((3, 4, 5), dtype=torch.float32, device=device),
-        args=([],),
-        kwargs={"dtype": torch.int64})
-
-    err_msg3 = "Expected out tensor to have dtype double, but got float instead"
-    si3 = SampleInput(
-        make_tensor((3, 4, 5), dtype=torch.int64, device=device),
-        args=([],),
-        kwargs={
-            "dtype": torch.float64,
-            "out": make_tensor([], dtype=torch.float32, device=device),
-        })
-
-    return (ErrorInput(si1, error_regex=err_msg1),
-            ErrorInput(si2, error_regex=err_msg2),
-            ErrorInput(si3, error_regex=err_msg3))
+def error_inputs_mean(op_info, device, is_ref=False, **kwargs):
+    if is_ref:
+        err_msg1 = (r"mean\(\): could not infer output dtype. "
+                    r"Input dtype must be either a floating point or complex dtype. "
+                    r"Got: torch.int64")
+    else:
+        err_msg1 = (r"mean\(\): could not infer output dtype. "
+                    r"Input dtype must be either a floating point or complex dtype. "
+                    r"Got: Long")
+    yield ErrorInput(
+        SampleInput(make_tensor((3, 4, 5), dtype=torch.int64, device=device), []),
+        error_regex=err_msg1,
+    )
+
+    if is_ref:
+        err_msg2 = (r"mean\(\): could not infer output dtype. "
+                    r"Optional dtype must be either a floating point or complex dtype. "
+                    r"Got: torch.int64")
+    else:
+        err_msg2 = (r"mean\(\): could not infer output dtype. "
+                    r"Optional dtype must be either a floating point or complex dtype. "
+                    r"Got: Long")
+    yield ErrorInput(
+        SampleInput(
+            make_tensor((3, 4, 5), dtype=torch.float32, device=device),
+            [],
+            dtype=torch.int64),
+        error_regex=err_msg2
+    )
+
+    if is_ref:
+        err_msg3 = "Expected out tensor to have dtype torch.float64, but got torch.float32 instead"
+    else:
+        err_msg3 = "Expected out tensor to have dtype double, but got float instead"
+    yield ErrorInput(
+        SampleInput(
+            make_tensor((3, 4, 5), dtype=torch.int64, device=device),
+            [],
+            dtype=torch.float64,
+            out=make_tensor([], dtype=torch.float32, device=device),
+        ),
+        error_regex=err_msg3
+    )
 
 # numpy implementation of torch.flatten
 # unfortunately there's no np.flatten. we figure out the desired shape and call np.reshape
@@ -8509,7 +8196,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
     in_shape = input.shape
     in_rank = len(in_shape)
     for d in start_dim, end_dim:
-        if not((in_rank == 0 and d in (-1, 0)) or -in_rank <= d < in_rank):
+        if not ((in_rank == 0 and d in (-1, 0)) or -in_rank <= d < in_rank):
             raise IndexError(f"Dimension out of range (expected to be in range of [{-in_rank}, {in_rank-1}], but got {d}")
     end_dim = end_dim if end_dim >= 0 else in_rank + end_dim
     start_dim = start_dim if start_dim >= 0 else in_rank + start_dim
@@ -8529,16 +8216,18 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    dtypes=all_types_and_complex_and(torch.half, torch.bfloat16, torch.chalf),
                    dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
                    skips=(
-                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
+                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestBwdGradients',
                                     'test_inplace_grad', dtypes=(torch.cdouble,)),
-                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
+                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestBwdGradients',
                                     'test_inplace_gradgrad', dtypes=(torch.cdouble,)),
-                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestGradients',
+                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), 'TestFwdGradients',
                                     'test_inplace_forward_mode_AD', dtypes=(torch.cdouble,)),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
                                     device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
                                     device_type='cpu', dtypes=[torch.cfloat]),
+                       DecorateInfo(unittest.skip("In-place abs not supported for complex tensors"), "TestSparseUnaryUfuncs",
+                                    "test_inplace", dtypes=(torch.cdouble, torch.cfloat, torch.chalf)),
                        # Reference: https://github.com/pytorch/pytorch/issues/49224
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
                                     dtypes=[torch.int8], active_if=TEST_WITH_ASAN),
@@ -8550,6 +8239,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    ),
                    supports_fwgrad_bwgrad=True,
                    assert_autodiffed=True,
+                   supports_sparse=True,
                    supports_sparse_csr=True,
                    supports_sparse_csc=True,
                    supports_sparse_bsr=True,
@@ -8585,15 +8275,15 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                                     device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
                                     device_type='cpu', dtypes=[torch.cfloat, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad',
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_grad',
                                     dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_method_grad',
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_method_grad',
                                     dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_inplace_grad',
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_inplace_grad',
                                     dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD',
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients', 'test_forward_mode_AD',
                                     dtypes=[torch.cdouble], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_inplace_forward_mode_AD',
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients', 'test_inplace_forward_mode_AD',
                                     dtypes=[torch.cdouble], active_if=IS_WINDOWS),)),
     # NOTE: the derivative for inplace acosh is not implemented
     UnaryUfuncInfo('acosh',
@@ -8678,7 +8368,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
 
                # Tests that assume input is a tensor or sequence of tensors
-               DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
                DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
                DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
@@ -8705,6 +8394,32 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
            )),
+    OpInfo('uniform',
+           op=lambda inp, *args, **kwargs: wrapper_set_seed(torch.Tensor.uniform_, inp, *args, **kwargs),
+           method_variant=None,
+           inplace_variant=torch.Tensor.uniform_,
+           dtypes=floating_and_complex_types_and(torch.bfloat16, torch.float16),
+           supports_out=False,
+           supports_autograd=False,
+           is_factory_function=False,
+           sample_inputs_func=sample_inputs_uniform,
+           error_inputs_func=error_inputs_uniform,
+           skips=(
+               # FX failed to normalize op - add the op to the op_skip list.
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # Tests that assume input tensor has a meningful effect on output tensor
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+               # AssertionError: JIT Test does not execute any logic
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # aten.uniform_.default - couldn't find symbolic meta function/decomposition
+               DecorateInfo(unittest.expectedFailure, 'TestProxyTensorOpInfo', 'test_make_fx_symbolic_exhaustive'),
+               # aten.uniform was not decomposed
+               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_quick'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
+           )),
     BinaryUfuncInfo('clamp_max',
                     ref=_clamp_max_numpy,
                     dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
@@ -8760,7 +8475,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                     supports_two_python_scalars=True,
                     decorators=(
                         DecorateInfo(
-                            toleranceOverride({torch.float16: tol(atol=1e-2, rtol=0)}),
+                            toleranceOverride({torch.float16: tol(atol=1e-2, rtol=0),
+                                               torch.bfloat16: tol(atol=1e-5, rtol=5e-3),
+                                               torch.complex32: tol(atol=1e-5, rtol=1e-3)}),
                             'TestBinaryUfuncs', 'test_reference_numerics'),
                         DecorateInfo(
                             toleranceOverride({torch.chalf: tol(atol=1e-2, rtol=0)}),
@@ -8807,7 +8524,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            variant_test_name='decomposed',
            dtypes=all_types_and_complex_and(torch.bfloat16),
            dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16] if(CUDA11OrLater or TEST_WITH_ROCM) else []),
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            assert_autodiffed=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -8849,7 +8566,21 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(
                    toleranceOverride({torch.float32: tol(atol=1.3e-05, rtol=1.3e-05),
                                       torch.complex64: tol(atol=1e-05, rtol=1.2e-03)}),
-                   'TestCommon', 'test_numpy_refs')],
+                   'TestCommon', 'test_numpy_refs'),
+               # MPS has slightly worse precision. Is this acceptable?
+               DecorateInfo(
+                   toleranceOverride({torch.float32: tol(atol=1.3e-04, rtol=1.3e-04),
+                                      torch.complex64: tol(atol=1e-05, rtol=1.2e-03)}),
+                   'TestCommon', 'test_numpy_ref_mps'),
+               DecorateInfo(
+                   toleranceOverride({torch.float32: tol(atol=1e-5, rtol=1e-5)}),
+                   'TestConsistency',
+                   'test_output_match',
+               ),
+               DecorateInfo(
+                   toleranceOverride({torch.float32: tol(atol=1.5e-05, rtol=1e-05)}),
+                   'TestCommon', 'test_out'),
+           ],
            skips=(
                # NVIDIA only assures that bfloat16 is supported by bmm if SM >= 5.3
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes', device_type='cuda', active_if=not SM53OrLater),
@@ -8929,6 +8660,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            skips=(
                # NVIDIA only assures that bfloat16 is supported by bmm if SM >= 5.3
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes', device_type='cuda', active_if=not SM53OrLater),
+               DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-5, rtol=1e-5)}),
+                            "TestCommon", "test_out")
            ),
            sample_inputs_func=sample_inputs_bmm),
     OpInfo('mv',
@@ -8963,11 +8696,10 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            skips=(
                # TODO: update sample inputs with for_inplace_variant kwarg to support this test
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
-               # 76047
-               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
-                            dtypes=(torch.int8, torch.int16, torch.int32, torch.int64)),
            ),
-           sample_inputs_func=sample_inputs_addcmul_addcdiv),
+           sample_inputs_func=sample_inputs_addcmul_addcdiv,
+           reference_inputs_func=partial(
+               reference_inputs_elementwise_ternary, sample_inputs_func=reference_inputs_addcmul_addcdiv)),
     OpInfo('addcdiv',
            dtypes=floating_and_complex_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
@@ -8979,7 +8711,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                             'TestCommon',
                             'test_variant_consistency_eager'),
            ),
-           sample_inputs_func=sample_inputs_addcmul_addcdiv),
+           sample_inputs_func=sample_inputs_addcmul_addcdiv,
+           reference_inputs_func=partial(
+               reference_inputs_elementwise_ternary, sample_inputs_func=reference_inputs_addcmul_addcdiv)),
     UnaryUfuncInfo('asin',
                    aliases=('arcsin', ),
                    ref=np.arcsin,
@@ -9063,7 +8797,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    decorators=(precisionOverride({torch.bfloat16: 1e-2}),),
                    skips=(
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    active_if=TEST_WITH_ROCM, device_type='cuda'),
+                                    active_if=TEST_WITH_ROCM, device_type='cuda', dtypes=[torch.complex64, torch.complex128]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
                                     active_if=TEST_WITH_ROCM, device_type='cuda', dtypes=[torch.complex128]),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
@@ -9220,6 +8954,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                     rhs_make_tensor_kwargs=dict(low=0),
                     skips=(
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
+                        # https://github.com/pytorch/pytorch/issues/70904
+                        DecorateInfo(unittest.skip("Some inputs produce undefined outputs"), 'TestCommon', 'test_compare_cpu'),
                     )),
     BinaryUfuncInfo('bitwise_right_shift',
                     op=torch.bitwise_right_shift,
@@ -9232,6 +8968,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                     rhs_make_tensor_kwargs=dict(low=0),
                     skips=(
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_type_promotion'),
+                        # https://github.com/pytorch/pytorch/issues/70904
+                        DecorateInfo(unittest.skip("Some inputs produce undefined outputs"), 'TestCommon', 'test_compare_cpu'),
                     )),
     OpInfo('combinations',
            op=torch.combinations,
@@ -9268,10 +9006,21 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            sample_inputs_func=sample_inputs_cdist),
     UnaryUfuncInfo('ceil',
                    ref=np.ceil,
-                   dtypes=floating_types_and(torch.bfloat16),
-                   dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+                   dtypes=all_types_and(torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
+                   skips=(
+                       DecorateInfo(unittest.expectedFailure,
+                                    'TestNNCOpInfo',
+                                    'test_nnc_correctness',
+                                    dtypes=tuple(t for t in integral_types() if t != torch.uint8)),
+                       DecorateInfo(unittest.expectedFailure,
+                                    'TestCudaFuserOpInfo',
+                                    'test_nvfuser_correctness',
+                                    dtypes=(torch.int32, torch.int64),
+                                    active_if=not TEST_WITH_ROCM),
+                   ),
                    supports_sparse=True,
                    supports_sparse_csr=True,
                    supports_sparse_csc=True,
@@ -9325,6 +9074,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # TypeError: _copy_dispatcher() got an unexpected keyword argument 'memory_format'
                # (NumPy reference needs to be extended with memory_format)
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref'),
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref_mps'),
            ),),
     OpInfo('contiguous',
            op=lambda x, *args, **kwargs: x.contiguous(*args, **kwargs),
@@ -9343,13 +9093,15 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            op=lambda x, *args, **kwargs: x.sum_to_size(*args, **kwargs),
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            sample_inputs_func=sample_inputs_sum_to_size,
+           error_inputs_func=error_inputs_sum_to_size,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            supports_out=False,
            skips=(
                # lambda impl
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float,)),),),
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float,)),
+           )),
     OpInfo('symeig',
            dtypes=floating_and_complex_types(),
            check_batched_grad=False,
@@ -9392,6 +9144,11 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    supports_out=False,
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
                    ),
     UnaryUfuncInfo('conj',
                    ref=np.conj,
@@ -9465,6 +9222,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
                     supports_rhs_python_scalar=False,
+                    error_inputs_func=error_inputs_complex,
                     skips=(
                         # Test doesn't account for complex's type promotion semantics
                         DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion'),
@@ -9493,8 +9251,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    'TestSchemaCheckModeOpInfo',
                    'test_schema_correctness',
                    dtypes=(torch.complex64, torch.complex128)),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
            ),
            supports_out=False),
     UnaryUfuncInfo('cos',
@@ -9572,18 +9328,12 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    'TestSchemaCheckModeOpInfo',
                    'test_schema_correctness',
                    dtypes=(torch.complex64, torch.complex128)),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
                # Float did not match double
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_grad'),
+               DecorateInfo(unittest.expectedFailure, 'TestBwdGradients', 'test_fn_grad'),
                # Jacobian mismatch
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_gradgrad'),
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_forward_mode_AD'),
-               DecorateInfo(unittest.skip("Barely fails"), 'TestGradients', 'test_fn_fwgrad_bwgrad'),
+               DecorateInfo(unittest.expectedFailure, 'TestBwdGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_forward_mode_AD'),
+               DecorateInfo(unittest.skip("Barely fails"), 'TestFwdGradients', 'test_fn_fwgrad_bwgrad'),
                # JIT test not working for tensor kwargs (https://github.com/pytorch/pytorch/issues/58507)
                # RuntimeError:
                # undefined value tensor:
@@ -9601,16 +9351,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_fwgrad_bwgrad=True,
            supports_out=True,
            supports_forward_ad=True),
-    OpInfo('linalg.cross',
-           ref=lambda x, y, dim=-1: np.cross(x, y, axis=dim),
-           op=torch.linalg.cross,
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.half),
-           aten_name='linalg_cross',
-           sample_inputs_func=sample_inputs_cross,
-           supports_out=True,
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True),
     OpInfo('cumsum',
            dtypes=all_types_and_complex_and(torch.bfloat16),
            dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
@@ -9629,7 +9369,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            skips=(
                # cumprod does not handle correctly out= dtypes
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
            ),
            # gradgradcheck fails in fast_mode=True: #56275
            sample_inputs_func=sample_inputs_cumprod,
@@ -9659,6 +9398,11 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
                    skips=(
                        # Reference: https://github.com/pytorch/pytorch/pull/51283#issuecomment-770614273
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
@@ -9787,10 +9531,12 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),),
            ),
     OpInfo('diag',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+           ref=np.diag,
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
            dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           check_batched_forward_grad=False,
            sample_inputs_func=sample_inputs_diag,
            error_inputs_func=error_inputs_diag),
     OpInfo('diag_embed',
@@ -9800,7 +9546,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_diagonal_diag_embed),
+           sample_inputs_func=sample_inputs_diagonal_diag_embed,
+           reference_inputs_func=reference_inputs_diagonal_diag_embed,
+           error_inputs_func=error_inputs_diagonal_diag_embed),
     OpInfo('diagonal',
            # They are not strictly aliases as they have diverging defaults, but we can see them as aliases for testing purposes
            # If we add tests that test the function against the alias, make linalg.diagonal into its own OpInfo
@@ -9810,7 +9558,16 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_diagonal_diag_embed),
+           sample_inputs_func=sample_inputs_diagonal_diag_embed,
+           reference_inputs_func=reference_inputs_diagonal_diag_embed,
+           error_inputs_func=error_inputs_diagonal_diag_embed),
+    OpInfo('diagonal_copy',
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_diagonal_diag_embed,
+           reference_inputs_func=reference_inputs_diagonal_diag_embed,
+           error_inputs_func=error_inputs_diagonal_diag_embed),
     OpInfo('diagonal_scatter',
            dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
            supports_out=False,
@@ -9895,6 +9652,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs',
                                      'test_reference_numerics_small_values',
                                      dtypes=(torch.uint8,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo',
+                                     'test_nnc_correctness',
+                                     dtypes=(torch.bfloat16,)),
                         # Fails on XLA
                         # False is not true : Tensors failed to compare as equal!
                         # Attempted to compare equality of tensors with different dtypes
@@ -9907,403 +9667,18 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    assert_autodiffed=True,
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
                    skips=(
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
                                     dtypes=(torch.bfloat16, torch.float16, torch.float32, torch.float64)),
                        # 76047
                        DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
-                                    dtypes=(torch.float32, torch.float64)),
+                                    dtypes=(torch.bfloat16, torch.float32, torch.float64)),
                    )),
-    SpectralFuncInfo('fft.fft',
-                     aten_name='fft_fft',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.fft,
-                     ndimensional=SpectralFuncType.OneD,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     ),
-    SpectralFuncInfo('fft.fft2',
-                     aten_name='fft_fft2',
-                     ref=np.fft.fft2,
-                     decomp_aten_name='_fft_c2c',
-                     ndimensional=SpectralFuncType.TwoD,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     decorators=[precisionOverride(
-                         {torch.float: 1e-4, torch.cfloat: 1e-4})],
-                     ),
-    SpectralFuncInfo('fft.fftn',
-                     aten_name='fft_fftn',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.fftn,
-                     ndimensional=SpectralFuncType.ND,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     decorators=[precisionOverride(
-                         {torch.float: 1e-4, torch.cfloat: 1e-4})]),
-    SpectralFuncInfo('fft.hfft',
-                     aten_name='fft_hfft',
-                     decomp_aten_name='_fft_c2r',
-                     ref=np.fft.hfft,
-                     ndimensional=SpectralFuncType.OneD,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     check_batched_gradgrad=False,
-                     skips=(
-                         # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
-                         DecorateInfo(
-                             unittest.skip("Skipped!"),
-                             'TestSchemaCheckModeOpInfo',
-                             'test_schema_correctness',
-                             dtypes=(torch.complex64, torch.complex128)
-                         ),
-                     )),
-    SpectralFuncInfo('fft.hfft2',
-                     aten_name='fft_hfft2',
-                     decomp_aten_name='_fft_c2r',
-                     ref=scipy.fft.hfft2 if has_scipy_fft else None,
-                     ndimensional=SpectralFuncType.TwoD,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_gradgrad=False,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 2e-4, torch.cfloat: 2e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     skips=(
-                         # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
-                         DecorateInfo(
-                             unittest.skip("Skipped!"),
-                             'TestSchemaCheckModeOpInfo',
-                             'test_schema_correctness'
-                         ),
-                     )),
-    SpectralFuncInfo('fft.hfftn',
-                     aten_name='fft_hfftn',
-                     decomp_aten_name='_fft_c2r',
-                     ref=scipy.fft.hfftn if has_scipy_fft else None,
-                     ndimensional=SpectralFuncType.ND,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_gradgrad=False,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 2e-4, torch.cfloat: 2e-4}),
-                             'TestFFT', 'test_reference_nd'), ],
-                     skips=(
-                         # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
-                         DecorateInfo(
-                             unittest.skip("Skipped!"),
-                             'TestSchemaCheckModeOpInfo',
-                             'test_schema_correctness'
-                         ),
-                     )),
-    SpectralFuncInfo('fft.rfft',
-                     aten_name='fft_rfft',
-                     decomp_aten_name='_fft_r2c',
-                     ref=np.fft.rfft,
-                     ndimensional=SpectralFuncType.OneD,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_grad=False,
-                     skips=(
-                     ),
-                     check_batched_gradgrad=False),
-    SpectralFuncInfo('fft.rfft2',
-                     aten_name='fft_rfft2',
-                     decomp_aten_name='_fft_r2c',
-                     ref=np.fft.rfft2,
-                     ndimensional=SpectralFuncType.TwoD,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_grad=False,
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         precisionOverride({torch.float: 1e-4}),
-                     ],),
-    SpectralFuncInfo('fft.rfftn',
-                     aten_name='fft_rfftn',
-                     decomp_aten_name='_fft_r2c',
-                     ref=np.fft.rfftn,
-                     ndimensional=SpectralFuncType.ND,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     check_batched_grad=False,
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         precisionOverride({torch.float: 1e-4}),
-                     ],),
-    SpectralFuncInfo('fft.ifft',
-                     aten_name='fft_ifft',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.ifft,
-                     ndimensional=SpectralFuncType.OneD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),),
-    SpectralFuncInfo('fft.ifft2',
-                     aten_name='fft_ifft2',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.ifft2,
-                     ndimensional=SpectralFuncType.TwoD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.ifftn',
-                     aten_name='fft_ifftn',
-                     decomp_aten_name='_fft_c2c',
-                     ref=np.fft.ifftn,
-                     ndimensional=SpectralFuncType.ND,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.ihfft',
-                     aten_name='fft_ihfft',
-                     decomp_aten_name='_fft_r2c',
-                     ref=np.fft.ihfft,
-                     ndimensional=SpectralFuncType.OneD,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     skips=(
-                     ),
-                     check_batched_grad=False),
-    SpectralFuncInfo('fft.ihfft2',
-                     aten_name='fft_ihfft2',
-                     decomp_aten_name='_fft_r2c',
-                     ref=scipy.fft.ihfftn if has_scipy_fft else None,
-                     ndimensional=SpectralFuncType.TwoD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     check_batched_grad=False,
-                     check_batched_gradgrad=False,
-                     decorators=(
-                         # The values for attribute 'shape' do not match: torch.Size([5, 6, 5]) != torch.Size([5, 6, 6]).
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-                         DecorateInfo(precisionOverride({torch.float: 2e-4}), 'TestFFT', 'test_reference_nd'),
-                         # Mismatched elements!
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warnings'))),
-    SpectralFuncInfo('fft.ihfftn',
-                     aten_name='fft_ihfftn',
-                     decomp_aten_name='_fft_r2c',
-                     ref=scipy.fft.ihfftn if has_scipy_fft else None,
-                     ndimensional=SpectralFuncType.ND,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archss
-                     dtypesIfCUDA=all_types_and(torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,)),
-                     check_batched_grad=False,
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         # The values for attribute 'shape' do not match: torch.Size([5, 6, 5]) != torch.Size([5, 6, 6]).
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-                         # Mismatched elements!
-                         DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-                         DecorateInfo(
-                             precisionOverride({torch.float: 2e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.irfft',
-                     aten_name='fft_irfft',
-                     decomp_aten_name='_fft_c2r',
-                     ref=np.fft.irfft,
-                     ndimensional=SpectralFuncType.OneD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     check_batched_gradgrad=False),
-    SpectralFuncInfo('fft.irfft2',
-                     aten_name='fft_irfft2',
-                     decomp_aten_name='_fft_c2r',
-                     ref=np.fft.irfft2,
-                     ndimensional=SpectralFuncType.TwoD,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    SpectralFuncInfo('fft.irfftn',
-                     aten_name='fft_irfftn',
-                     decomp_aten_name='_fft_c2r',
-                     ref=np.fft.irfftn,
-                     ndimensional=SpectralFuncType.ND,
-                     # https://github.com/pytorch/pytorch/issues/80411
-                     gradcheck_fast_mode=True,
-                     supports_forward_ad=True,
-                     supports_fwgrad_bwgrad=True,
-                     # See https://github.com/pytorch/pytorch/pull/78358
-                     check_batched_forward_grad=False,
-                     dtypes=all_types_and_complex_and(torch.bool),
-                     # rocFFT doesn't support Half/Complex Half Precision FFT
-                     # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
-                     dtypesIfCUDA=all_types_and_complex_and(
-                         torch.bool, *() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half, torch.complex32)),
-                     check_batched_gradgrad=False,
-                     decorators=[
-                         DecorateInfo(
-                             precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
-                             'TestFFT', 'test_reference_nd')],
-                     ),
-    OpInfo('fft.fftshift',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           sample_inputs_func=sample_inputs_fftshift,
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           ),
-    OpInfo('fft.ifftshift',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           sample_inputs_func=sample_inputs_fftshift,
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           ),
     OpInfo('stft',
            decorators=[
                skipCPUIfNoFFT,
@@ -10323,7 +9698,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            ),
     OpInfo('istft',
-           dtypes=floating_and_complex_types(),
+           dtypes=complex_types(),
            sample_inputs_func=sample_inputs_istft,
            # Runs very slowly on slow gradcheck - alternatively reduce input sizes
            gradcheck_fast_mode=True,
@@ -10341,20 +9716,27 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                skipCPUIfNoFFT,
                # gradcheck fails on ROCm (gh-68429)
                # grad is computed improperly (probably for weights tensor)
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_grad'),
+               DecorateInfo(unittest.expectedFailure, 'TestBwdGradients', 'test_fn_grad'),
                # Pre-existing condition (calls .item); needs to be fixed
                DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
            )),
     UnaryUfuncInfo('floor',
                    ref=np.floor,
-                   dtypes=floating_types_and(torch.bfloat16),
-                   dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+                   dtypes=all_types_and(torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
+                   skips=(
+                       DecorateInfo(unittest.expectedFailure,
+                                    'TestNNCOpInfo',
+                                    'test_nnc_correctness',
+                                    dtypes=tuple(t for t in integral_types() if t != torch.uint8)),
+                       DecorateInfo(unittest.expectedFailure,
+                                    'TestCudaFuserOpInfo',
+                                    'test_nvfuser_correctness',
+                                    dtypes=(torch.int32, torch.int64),
+                                    active_if=not TEST_WITH_ROCM),
+                   ),
                    supports_sparse=True,
                    supports_sparse_csr=True,
                    supports_sparse_csc=True,
@@ -10389,7 +9771,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_autograd=True,
            sample_inputs_func=sample_inputs_sparse_sampled_addmm,
            decorators=[
-               skipCUDAIf(_get_torch_cuda_version() < (11, 3), "cusparseSDDMM was added in 11.2.1"),
+               skipCUDAIf(not ((_get_torch_cuda_version() >= (11, 3))
+                               or (_get_torch_rocm_version() >= (5, 2))),
+                          "cusparseSDDMM was added in 11.2.1"),
                skipCPUIfNoMklSparse, ],
            skips=(
                # NotImplementedError: Tensors of type SparseCsrTensorImpl do not have is_contiguous
@@ -10416,13 +9800,13 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # RuntimeError: unsupported memory format option Preserve
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_fwgrad_bwgrad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients', 'test_fn_fwgrad_bwgrad'),
                # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_grad'),
                # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_gradgrad'),
                # GradcheckError: gradcheck expects all tensor inputs are dense when check_sparse_nnz is set to False
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients', 'test_forward_mode_AD'),
            )),
     UnaryUfuncInfo('i0',
                    ref=np_unary_ufunc_integer_promotion_wrapper(
@@ -10440,57 +9824,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
                                     dtypes=(torch.int8,)),
                    )),
-    UnaryUfuncInfo('special.i0e',
-                   aten_name='special_i0e',
-                   ref=scipy.special.i0e if TEST_SCIPY else None,
-                   decorators=(precisionOverride({torch.bfloat16: 3e-1,
-                                                  torch.float16: 3e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   backward_dtypes=floating_types(),
-                   sample_inputs_func=sample_inputs_i0_i1,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('special.i1',
-                   aten_name='special_i1',
-                   ref=np_unary_ufunc_integer_promotion_wrapper(scipy.special.i1) if TEST_SCIPY else None,
-                   dtypes=all_types_and(torch.bool),
-                   dtypesIfCUDA=all_types_and(torch.bool),
-                   sample_inputs_func=sample_inputs_i0_i1,
-                   decorators=(
-                       DecorateInfo(toleranceOverride({
-                           torch.float32: tol(atol=1e-4, rtol=0),
-                           torch.bool: tol(atol=1e-4, rtol=0)})),
-                   ),
-                   skips=(
-                       DecorateInfo(unittest.skip("Incorrect result!"),
-                                    'TestUnaryUfuncs',
-                                    'test_reference_numerics_large',
-                                    dtypes=(torch.int8,)),
-                   ),
-                   supports_fwgrad_bwgrad=True,
-                   supports_forward_ad=True),
-    UnaryUfuncInfo('special.i1e',
-                   aten_name='special_i1e',
-                   ref=scipy.special.i1e if TEST_SCIPY else None,
-                   dtypes=all_types_and(torch.bool),
-                   dtypesIfCUDA=all_types_and(torch.bool),
-                   sample_inputs_func=sample_inputs_i0_i1,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('special.ndtr',
-                   aten_name='special_ndtr',
-                   decorators=(precisionOverride({torch.bfloat16: 5e-3,
-                                                  torch.float16: 5e-4}),),
-                   ref=scipy.special.ndtr if TEST_SCIPY else None,
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.bfloat16, torch.float16),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   skips=(
-                       # Dispatch stub: unsupported device typemeta
-                       DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad', device_type='meta'),
-                   )),
     BinaryUfuncInfo('floor_divide',
                     ref=_floor_divide_np,
                     dtypes=all_types_and(torch.half, torch.bfloat16),
@@ -10509,6 +9842,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                         # The following tests fails on some jobs
                         DecorateInfo(unittest.skip('Skipped!'), 'TestBinaryUfuncs', 'test_reference_numerics_extremal_values',
                                      dtypes=(torch.float16,)),
+                        DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-3, rtol=5e-3)}),
+                                     'TestBinaryUfuncs', 'test_reference_numerics'),
                     )),
     UnaryUfuncInfo('frexp',
                    op=torch.frexp,
@@ -10539,6 +9874,21 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
                                     active_if=IS_WINDOWS),
                    )),
+    UnaryUfuncInfo('log1p',
+                   ref=np.log1p,
+                   aliases=('special.log1p',),
+                   domain=(-1, None),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   decorators=(precisionOverride({torch.bfloat16: 1e-1}),),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   assert_autodiffed=True),
     BinaryUfuncInfo('ge',
                     ref=np.greater_equal,
                     aliases=('greater_equal',),
@@ -10601,25 +9951,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_inplace_autograd=False,
            sample_inputs_func=sample_inputs_gradient,
            error_inputs_func=error_inputs_gradient),
-    OpInfo('inverse',
-           op=torch.inverse,
-           dtypes=floating_and_complex_types(),
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # Strides are not the same!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', '.test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', '.test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
     OpInfo('isin',
            dtypes=all_types(),
            dtypesIfCUDA=all_types_and(torch.half),
@@ -10640,143 +9971,224 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                     supports_autograd=False,
                     skips=(
                     )),
-    OpInfo('linalg.det',
-           op=torch.linalg.det,
-           aliases=('det',),
-           dtypes=floating_and_complex_types(),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           aten_name='linalg_det',
-           sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack,
-                       DecorateInfo(toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)}))],
-           check_batched_gradgrad=False,
-           supports_inplace_autograd=False),
-    OpInfo('linalg.det',
-           op=torch.linalg.det,
-           variant_test_name='singular',
-           aliases=('det',),
-           dtypes=double_types(),
-           backward_dtypes=double_types(),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           aten_name='linalg_det',
-           sample_inputs_func=sample_inputs_linalg_det_singular,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack,
-                       DecorateInfo(toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)}))],
-           check_batched_gradgrad=False,
-           supports_inplace_autograd=False,
+    OpInfo('linspace',
+           dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16),
+           is_factory_function=True,
+           supports_out=True,
+           supports_autograd=False,
+           error_inputs_func=error_inputs_linspace,
+           sample_inputs_func=sample_inputs_linspace,
            skips=(
-               DecorateInfo(unittest.skip("Skipped!"), "TestGradients", 'test_fn_fwgrad_bwgrad'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
-               # dtypes are tested in the suite above, no need to repeat it for singular
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes'),
+               # Tests that assume input is a tensor or sequence of tensors
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+
+               # Same failure as arange: cannot find linspace in captured graph
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
+
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # UserWarning: CUDA caching allocator reports a memory leak not verified by the driver API
+               # in __main__.TestJitCUDA.test_variant_consistency_jit_logspace_cuda_complex64!
+               # Caching allocator allocated memory was 0 and is now reported as 307200 on device 0.
+               # CUDA driver allocated memory was 1254555648 and is now 1242955776.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
+                            dtypes=(torch.cfloat,), device_type="cuda"),
            )),
-    OpInfo('linalg.cholesky',
-           aten_name='linalg_cholesky',
-           dtypes=floating_and_complex_types(),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_linalg_cholesky,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],),
-    OpInfo('linalg.cholesky_ex',
-           aten_name='linalg_cholesky_ex',
-           dtypes=floating_and_complex_types(),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_linalg_cholesky,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           ),
-    OpInfo('linalg.vecdot',
-           aten_name='linalg_vecdot',
-           ref=lambda x, y, *, dim=-1: (x.conj() * y).sum(dim),
-           dtypes=floating_and_complex_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.half,
-                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           sample_inputs_func=sample_inputs_linalg_vecdot,
-           check_batched_forward_grad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
+    OpInfo('logspace',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
+           is_factory_function=True,
+           supports_out=True,
+           supports_autograd=False,
+           error_inputs_func=error_inputs_linspace,
+           sample_inputs_func=sample_inputs_logpace,
            skips=(
-               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
-               DecorateInfo(
-                   unittest.skip("Skipped!"),
-                   'TestSchemaCheckModeOpInfo',
-                   'test_schema_correctness',
-                   dtypes=(torch.complex64, torch.complex128)),
+               # Tests that assume input is a tensor or sequence of tensors
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+               # Same failure as arange: cannot find linspace in captured graph
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
+
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+
+               # Off-by-one issue when casting floats to ints
+               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_quick',
+                            dtypes=(torch.int16, torch.int32, torch.int64), device_type="cuda"),
+               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_comprehensive',
+                            dtypes=(torch.int16, torch.int32, torch.int64), device_type="cuda"),
+               # UserWarning: CUDA caching allocator reports a memory leak not verified by the driver API
+               # in __main__.TestJitCUDA.test_variant_consistency_jit_logspace_cuda_complex64!
+               # Caching allocator allocated memory was 0 and is now reported as 307200 on device 0.
+               # CUDA driver allocated memory was 1254555648 and is now 1242955776.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
+                            dtypes=(torch.cfloat,), device_type="cuda"),
            )),
-    OpInfo('linalg.cond',
-           aten_name='linalg_cond',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_cond,
-           check_batched_gradgrad=False,
-           check_batched_forward_grad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],),
-    OpInfo('linalg.eig',
-           aten_name='linalg_eig',
-           op=torch.linalg.eig,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_eig,
-           check_batched_forward_grad=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
+    UnaryUfuncInfo('log',
+                   ref=np.log,
+                   domain=(0, None),
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+                   backward_dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16, torch.chalf),
+                   assert_autodiffed=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   decorators=(precisionOverride({torch.bfloat16: 5e-2}),),
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                   ),
+                   # log(z)->-inf for |z|->0
+                   reference_numerics_filter=NumericsFilter(condition=lambda x: torch.abs(x) < 0.1, safe_val=1)),
+    UnaryUfuncInfo('log10',
+                   ref=np.log10,
+                   domain=(0, None),
+                   decorators=(precisionOverride({torch.bfloat16: 5e-2}),),
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   assert_autodiffed=True,
+                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble],
+                                    active_if=IS_WINDOWS),
+                   ),
+                   # log10(z)->-inf for |z|->0
+                   reference_numerics_filter=NumericsFilter(condition=lambda x: torch.abs(x) < 0.1, safe_val=1)),
+    UnaryUfuncInfo('log2',
+                   ref=np.log2,
+                   domain=(0, None),
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   decorators=(precisionOverride({torch.bfloat16: 1e-1}),),
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.cfloat, torch.cdouble]),
+                   ),
+                   # log2(z)->-inf for |z|->0
+                   reference_numerics_filter=NumericsFilter(condition=lambda x: torch.abs(x) < 0.1, safe_val=1)),
+    BinaryUfuncInfo('ldexp',
+                    dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+                    # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+                    gradcheck_fast_mode=True,
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_inplace_autograd=False,
+                    promotes_int_to_float=True,
+                    supports_out=True,
+                    supports_rhs_python_scalar=False,
+                    skips=(
+                        # RuntimeError: mul(): functions with out=... arguments don't support
+                        # automatic differentiation, but one of the arguments requires grad
+                        # https://github.com/pytorch/pytorch/issues/68966
+                        DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+                        DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+                        DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+                        DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+                    ),
+                    decorators=[
+                        DecorateInfo(
+                            toleranceOverride({
+                                torch.complex64: tol(atol=1e-05, rtol=1e-05)
+                            }),
+                            'TestCommon', device_type='cpu',
+                        ),
+                    ], ),
+    OpInfo('logaddexp',
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.bfloat16),
+           dtypesIfROCM=floating_types_and(torch.bfloat16),
+           supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           skips=(
-               # AssertionError: Scalars are not equal!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cpu'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           ),
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
-           ),
-    OpInfo('linalg.eigvals',
-           aten_name='linalg_eigvals',
-           op=torch.linalg.eigvals,
+           sample_inputs_func=sample_inputs_logaddexp),
+    OpInfo('logaddexp2',
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.bfloat16),
+           dtypesIfROCM=floating_types_and(torch.bfloat16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_logaddexp),
+    UnaryUfuncInfo('logical_not',
+                   ref=np.logical_not,
+                   decorators=(precisionOverride({torch.bfloat16: 7e-1,
+                                                  torch.float16: 5e-1}),),
+                   dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+                   supports_autograd=False,
+                   skips=(
+                       # The function variant always returns BoolTensor
+                       # while the inplace variant preserves the input dtype.
+                       # >>> t = torch.randn(3)
+                       # >>> torch.logical_not(t)
+                       # tensor([False, False, False])
+                       # >>> torch.logical_not(t).dtype
+                       # torch.bool
+                       # >>> t.logical_not_().dtype
+                       # torch.float32
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_variant_consistency',
+                                    dtypes=all_types_and_complex_and(torch.half, torch.bfloat16)),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
+                                    dtypes=all_types_and_complex_and(torch.half, torch.bfloat16)),
+                   )),
+    BinaryUfuncInfo('lt',
+                    ref=np.less,
+                    aliases=('less',),
+                    dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
+                    always_returns_bool=True,
+                    supports_autograd=False,
+                    skips=(
+                    )),
+    OpInfo('lu_unpack',
+           op=torch.lu_unpack,
            dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           check_batched_forward_grad=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+           skips=(skipCPUIfNoLapack,),
+           sample_inputs_func=sample_inputs_lu_unpack),
+    OpInfo('lu',
+           op=torch.lu,
+           dtypes=floating_and_complex_types(),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # https://github.com/pytorch/pytorch/issues/66357
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_lu,
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
            skips=(
-               # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
-               # exits early on eager extremal value test
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
+               # we skip jit tests because `lu` is a torch function
+               # RuntimeError:
+               # 'Tensor (inferred)' object has no attribute or method 'lu'.:
+               # File "<string>", line 3
+               # def the_method(i0):
+               #     return i0.lu(True, True)
+               #            ~~~~~ <--- HERE
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # RuntimeError not raised: Expected RuntimeError when calling with input.device=cpu and out.device=cuda
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
            )),
-    OpInfo('linalg.eigh',
-           aten_name='linalg_eigh',
+    OpInfo('lu_solve',
+           op=torch.lu_solve,
            dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_eigh,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           check_batched_forward_grad=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
            supports_forward_ad=True,
+           # See https://github.com/pytorch/pytorch/issues/66357
+           check_batched_forward_grad=False,
            supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
+           sample_inputs_func=sample_inputs_lu_solve,
            skips=(
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
                             device_type='mps', dtypes=[torch.float32]),
@@ -10784,789 +10196,259 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                             device_type='mps', dtypes=[torch.float32]),
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
                             device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.eigvalsh',
-           aten_name='linalg_eigvalsh',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_eigh,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+               DecorateInfo(unittest.skip("Tests different backward paths"),
+                            "TestCommon", "test_floating_inputs_are_differentiable"),),
+           decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver]),
+    OpInfo('masked_fill',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+           sample_inputs_func=sample_inputs_masked_fill,
+           error_inputs_func=error_inputs_masked_fill,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            check_batched_forward_grad=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
+           supports_out=False),
+    OpInfo('masked_scatter',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_masked_scatter,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+           # https://github.com/pytorch/pytorch/issues/66357
+           check_batched_forward_grad=False,
+           supports_out=False,
            skips=(
-               # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
            )),
-    OpInfo('linalg.householder_product',
-           aten_name='linalg_householder_product',
-           op=torch.linalg.householder_product,
-           aliases=('orgqr', ),
-           dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           # TODO: backward uses in-place operations that vmap doesn't like
+    OpInfo('masked_select',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_masked_select,
+           error_inputs_func=error_inputs_masked_select),
+    OpInfo('matrix_exp',
+           dtypes=floating_and_complex_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           aliases=('linalg.matrix_exp',),
+           sample_inputs_func=sample_inputs_matrix_exp,
+           # Needs to construct a 2nx2n matrix by copy_ ing into it
            check_batched_grad=False,
            check_batched_gradgrad=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           # https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_householder_product,
-           decorators=[
-               skipCUDAIfNoCusolver, skipCPUIfNoLapack,
-               DecorateInfo(toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)})),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
-           ]),
-    OpInfo('linalg.ldl_factor',
-           aten_name='linalg_ldl_factor',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_ldl_factor,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, skipCUDAIfRocm],
+           skips=(
+               # times out
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
            ),
-    OpInfo('linalg.ldl_factor_ex',
-           aten_name='linalg_ldl_factor_ex',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_ldl_factor,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, skipCUDAIfRocm],
+           supports_out=False,
            ),
-    OpInfo('linalg.ldl_solve',
-           aten_name='linalg_ldl_solve',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_ldl_solve,
+    OpInfo('matmul',
+           aliases=('linalg.matmul',),
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
+                                                       *[torch.bfloat16]
+                                                       if (SM53OrLater and CUDA11OrLater) or TEST_WITH_ROCM else []),
+           assert_autodiffed=True,
+           assert_jit_shape_analysis=True,
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           check_batched_forward_grad=False,
+           sample_inputs_func=partial(sample_inputs_matmul, is_rmatmul=False),
            decorators=[
-               skipCUDAIf(_get_torch_cuda_version() < (11, 4), "not available before CUDA 11.3.1"),
-               skipCUDAIfNoCusolver, skipCUDAIfRocm, skipCPUIfNoLapack],
-           ),
-    OpInfo('linalg.lstsq',
-           aten_name='linalg_lstsq',
-           dtypes=floating_and_complex_types(),
-           supports_out=True,
-           sample_inputs_func=sample_inputs_linalg_lstsq,
-           error_inputs_func=error_inputs_lstsq,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
-           skips=(
-               # we skip gradient checks for this suite as they are tested in
-               # variant_test_name='grad_oriented'
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
-               # The values for attribute 'shape' do not match
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.lstsq',
-           aten_name='linalg_lstsq',
-           variant_test_name='grad_oriented',
-           # gradchecks for forward AD fails with multi-Tensor outputs
-           op=lambda a, b, driver: torch.linalg.lstsq(a, b, driver=driver)[0],
-           supports_out=False,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_lstsq,
-           error_inputs_func=error_inputs_lstsq_grad_oriented,
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_autograd=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
-           skips=(
-               # tests do not work with passing lambda for op
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-           )),
-    OpInfo('linalg.matrix_power',
-           aliases=('matrix_power',),
-           aten_name='linalg_matrix_power',
-           dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_inplace_autograd=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           check_batched_grad=False,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           sample_inputs_func=sample_inputs_linalg_matrix_power,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+               # NVIDIA only assures that bfloat16 is supported by bmm if SM >= 5.3
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes', device_type='cuda', active_if=not SM53OrLater),
+               # ROCm intermittently fails the test with standard atol/rtol
+               DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-4, rtol=0)}),
+                            'TestCommon', 'test_noncontiguous_samples', device_type='cuda',
+                            active_if=TEST_WITH_ROCM),
+               DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-4, rtol=0)}),
+                            'TestCommon', 'test_out', device_type='cuda',
+                            active_if=TEST_WITH_ROCM),
+               # mv for the sample with shapes (S, S, M, M), (M,) has some variance in the
+               # backward on CPU
+               DecorateInfo(toleranceOverride({torch.float32: tol(atol=0, rtol=1e-5)}),
+                            'TestCommon', 'test_noncontiguous_samples',
+                            device_type='cpu'),
+               DecorateInfo(
+                   toleranceOverride({
+                       torch.float32: tol(atol=1e-5, rtol=1e-5),
+                       torch.complex64: tol(atol=1e-5, rtol=1e-5),
+                   }),
+                   "TestDecomp", "test_comprehensive", device_type="cuda",
+               ),
+           ],
            skips=(
                # Strides are not the same!
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-           )),
-    OpInfo('linalg.multi_dot',
-           # Need this lambda because gradcheck does not work with TensorList inputs
-           aten_name='linalg_multi_dot',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.half,
-                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           supports_inplace_autograd=False,
-           # Batched grad checks fail for empty input tensors (see https://github.com/pytorch/pytorch/issues/53407)
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # https://github.com/pytorch/pytorch/issues/66357
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_linalg_multi_dot,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           skips=(
                # https://github.com/pytorch/pytorch/issues/67470
-               DecorateInfo(unittest.skip("67470!"), 'TestCommon', 'test_noncontiguous_samples'),
-               # Fails on XLA.
+               DecorateInfo(unittest.skip("67470!"),
+                            'TestCommon', 'test_noncontiguous_samples',
+                            device_type='cpu', dtypes=(torch.long,)),
                # AssertionError: False is not true : Tensors failed to compare as equal!
-               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla', dtypes=(torch.long,)),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo',
+                            device_type='xla', dtypes=(torch.long,)),
                # https://github.com/pytorch/pytorch/issues/71774
                DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
                             device_type='cpu', dtypes=(torch.long,)),
            )),
-    # NB: linalg.norm has two variants so that different skips can be used for different sample inputs
-    OpInfo('linalg.norm',
-           aten_name='linalg_norm',
-           op=torch.linalg.norm,
-           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           sample_inputs_func=sample_inputs_linalg_norm,
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True),
-    OpInfo('linalg.norm',
-           op=torch.linalg.norm,
-           variant_test_name='subgradients_at_zero',
-           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           sample_inputs_func=partial(sample_inputs_linalg_norm, variant='subgradient_at_zero'),
-           aten_name='linalg_norm',
+    OpInfo('max',
+           variant_test_name='reduction_with_dim',
+           dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+           sample_inputs_func=sample_inputs_max_min_reduction_with_dim,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+           ),
+           supports_forward_ad=True),
+    OpInfo('max',
+           variant_test_name='reduction_no_dim',
+           dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+           supports_out=True,
            supports_forward_ad=True,
-           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got:
-           # Could not allocate memory to change Tensor SizesAndStrides!
-           check_batched_forward_grad=False,
            supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_max_min_reduction_no_dim,
            skips=(
-               # [NEW] Skips specifically for sample inputs at zero
-               # norm's vjp/jvp are not well-conditioned near zero
-               DecorateInfo(unittest.expectedFailure, "TestGradients", 'test_fn_gradgrad'),
-               DecorateInfo(unittest.expectedFailure, "TestGradients", 'test_fn_fwgrad_bwgrad')
            )),
-    OpInfo('linalg.matrix_norm',
-           aten_name='linalg_matrix_norm',
-           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+    OpInfo('median',
+           dtypes=all_types_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.float16),
+           # TODO: some signatures of median do support out
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=partial(sample_inputs_reduction, supports_multiple_dims=False)),
+    OpInfo('nanmedian',
+           dtypes=all_types_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.float16),
+           # TODO: some signatures of nanmedian do support out
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=partial(sample_inputs_reduction, supports_multiple_dims=False)),
+    OpInfo('var_mean',
+           dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_std_var,
+           # TODO: some signatures of var_mean do support out
+           supports_out=False,
            supports_forward_ad=True,
            check_batched_forward_grad=False,
-           check_batched_gradgrad=False,
            supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           sample_inputs_func=sample_inputs_linalg_matrix_norm),
-    OpInfo('linalg.qr',
-           aten_name='linalg_qr',
-           op=torch.linalg.qr,
-           dtypes=floating_and_complex_types(),
+           decorators=(
+               DecorateInfo(toleranceOverride({torch.float64: tol(atol=2e-7, rtol=2e-7)}),
+                            "TestDecomp", "test_comprehensive", device_type="cuda"),
+           )),
+    OpInfo('std_mean',
+           dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_std_var,
+           # TODO: some signatures of std_mean do support out
+           supports_out=False,
            supports_forward_ad=True,
+           check_batched_forward_grad=False,
+           supports_fwgrad_bwgrad=True,
+           decorators=(
+               DecorateInfo(toleranceOverride({torch.float64: tol(atol=2e-7, rtol=2e-7)}),
+                            "TestDecomp", "test_comprehensive", device_type="cuda"),
+           )),
+    OpInfo('meshgrid',
+           variant_test_name='variadic_tensors',
+           ref=np.meshgrid,
+           dtypes=all_types_and_complex_and(torch.bfloat16, torch.bool, torch.float16),
+           sample_inputs_func=partial(sample_inputs_meshgrid, variant='variadic'),
+           skips=[
+               # JIT does not support variadic tensors.
+               # RuntimeError: input->type()->kind() == TypeKind::OptionalType
+               # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
+               # please report a bug to PyTorch.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               # meshgrid is defined in torch.functional to take a
+               # variadic list of tensors. Variadic parameters are not
+               # compatible with the normalize operator tests.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # Skip operator schema test because this is a functional and not an operator
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+           ],
+           supports_out=False,
            supports_fwgrad_bwgrad=True,
-           # In-place ops
-           check_batched_gradgrad=False,
-           sample_inputs_func=sample_inputs_linalg_qr_geqrf,
-           decorators=[skipCUDAIfNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('linalg.slogdet',
-           aten_name='linalg_slogdet',
-           op=torch.linalg.slogdet,
-           dtypes=floating_and_complex_types(),
            supports_forward_ad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,),
+    OpInfo('meshgrid',
+           variant_test_name='list_of_tensors',
+           # Unlike the variant above, we do not use np.meshgrid as a
+           # ref since it does not officially support list of numpy
+           # arrays.
+           dtypes=all_types_and_complex_and(torch.bfloat16, torch.bool, torch.float16),
+           sample_inputs_func=partial(sample_inputs_meshgrid, variant='list'),
+           skips=[
+               # meshgrid is defined in torch.functional to take a
+               # variadic list of tensors. Variadic parameters are not
+               # compatible with the normalize operator tests.
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+           ],
+           assert_autodiffed=True,
+           supports_out=False,
+           autodiff_nonfusible_nodes=[],
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('linalg.vander',
-           aten_name='linalg_vander',
-           ref=np_vander_batched,
-           op=torch.linalg.vander,
-           dtypes=all_types_and_complex(),
            supports_forward_ad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,),
+    OpInfo('min',
+           variant_test_name='reduction_with_dim',
+           dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+           sample_inputs_func=sample_inputs_max_min_reduction_with_dim,
            supports_fwgrad_bwgrad=True,
+           supports_forward_ad=True,
+           skips=(
+           )),
+    OpInfo('min',
+           variant_test_name='reduction_no_dim',
+           dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
            supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_max_min_reduction_no_dim,
            skips=(
-               # Pre-existing condition (calls .item); needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+           )),
+    OpInfo('quantile',
+           dtypes=floating_types(),
+           sample_inputs_func=sample_inputs_reduction_quantile,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
            ),
-           sample_inputs_func=sample_inputs_linalg_vander),
-    ReductionOpInfo(
-        'linalg.vector_norm',
-        op=torch.linalg.vector_norm,
-        identity=0,
-        nan_policy='propagate',
-        supports_multiple_dims=True,
-        complex_to_real=True,
+           # See https://github.com/pytorch/pytorch/issues/66357
+           # Relies on copy_ to broadcast, but the forward AD path calls broadcast_to which
+           # does not have a batching rule in core
+           check_batched_forward_grad=False),
+    OpInfo('nanquantile',
+           dtypes=floating_types(),
+           sample_inputs_func=sample_inputs_reduction_quantile,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+           ),
+           # See https://github.com/pytorch/pytorch/issues/66357
+           # Relies on copy_ to broadcast, but the forward AD path calls broadcast_to which
+           # does not have a batching rule in core
+           check_batched_forward_grad=False),
+    BinaryUfuncInfo(
+        'max',
+        aliases=('maximum',),
+        variant_test_name='binary',
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         supports_forward_ad=True,
-        # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
-        # got: Could not allocate memory to change Tensor SizesAndStrides!
-        check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        generate_args_kwargs=sample_kwargs_vector_norm,
-        aten_name='linalg_vector_norm',
+        assert_autodiffed=True,
+        ref=np.maximum,
+        supports_rhs_python_scalar=False,
         skips=(
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-        )),
-    OpInfo('linspace',
-           dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16),
-           is_factory_function=True,
-           supports_out=True,
-           supports_autograd=False,
-           error_inputs_func=error_inputs_linspace,
-           sample_inputs_func=sample_inputs_linspace,
-           skips=(
-               # Tests that assume input is a tensor or sequence of tensors
-               DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-
-               # Same failure as arange: cannot find linspace in captured graph
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
-
-               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-
-               # cpu implementation is wrong on some integral types
-               # https://github.com/pytorch/pytorch/issues/81996
-               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_quick',
-                            dtypes=(torch.int16, torch.int32, torch.int64), device_type="cpu"),
-               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_comprehensive',
-                            dtypes=(torch.int16, torch.int32, torch.int64), device_type="cpu"),
-               # cuda implementation is off-by-one on some inputs due to precision issues
-               # https://github.com/pytorch/pytorch/issues/82230
-               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_quick',
-                            dtypes=(torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64),
-                            device_type="cuda"),
-               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_comprehensive',
-                            dtypes=(torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64),
-                            device_type="cuda"),
-               # UserWarning: CUDA caching allocator reports a memory leak not verified by the driver API
-               # in __main__.TestJitCUDA.test_variant_consistency_jit_logspace_cuda_complex64!
-               # Caching allocator allocated memory was 0 and is now reported as 307200 on device 0.
-               # CUDA driver allocated memory was 1254555648 and is now 1242955776.
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            dtypes=(torch.cfloat,), device_type="cuda"),
-           )),
-    OpInfo('logspace',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.half, torch.bfloat16),
-           is_factory_function=True,
-           supports_out=True,
-           supports_autograd=False,
-           error_inputs_func=error_inputs_linspace,
-           sample_inputs_func=sample_inputs_logpace,
-           skips=(
-               # Tests that assume input is a tensor or sequence of tensors
-               DecorateInfo(unittest.expectedFailure, "TestCommon", "test_noncontiguous_samples"),
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-               # Same failure as arange: cannot find linspace in captured graph
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
-
-               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-
-               # Off-by-one issue when casting floats to ints
-               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_quick',
-                            dtypes=(torch.int16, torch.int32, torch.int64), device_type="cuda"),
-               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_comprehensive',
-                            dtypes=(torch.int16, torch.int32, torch.int64), device_type="cuda"),
-               # UserWarning: CUDA caching allocator reports a memory leak not verified by the driver API
-               # in __main__.TestJitCUDA.test_variant_consistency_jit_logspace_cuda_complex64!
-               # Caching allocator allocated memory was 0 and is now reported as 307200 on device 0.
-               # CUDA driver allocated memory was 1254555648 and is now 1242955776.
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            dtypes=(torch.cfloat,), device_type="cuda"),
-           )),
-    UnaryUfuncInfo('log',
-                   ref=np.log,
-                   domain=(0, None),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-                   backward_dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16, torch.chalf),
-                   assert_autodiffed=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   decorators=(precisionOverride({torch.bfloat16: 5e-2}),),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                   ),
-                   # log(z)->-inf for |z|->0
-                   reference_numerics_filter=NumericsFilter(condition=lambda x: torch.abs(x) < 0.1, safe_val=1)),
-    UnaryUfuncInfo('log10',
-                   ref=np.log10,
-                   domain=(0, None),
-                   decorators=(precisionOverride({torch.bfloat16: 5e-2}),),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   assert_autodiffed=True,
-                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    device_type='cpu', dtypes=[torch.cfloat, torch.cdouble],
-                                    active_if=IS_WINDOWS),
-                   ),
-                   # log10(z)->-inf for |z|->0
-                   reference_numerics_filter=NumericsFilter(condition=lambda x: torch.abs(x) < 0.1, safe_val=1)),
-    UnaryUfuncInfo('log1p',
-                   ref=np.log1p,
-                   aliases=('special.log1p',),
-                   domain=(-1, None),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   decorators=(precisionOverride({torch.bfloat16: 1e-1}),),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
-                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
-                   ),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   assert_autodiffed=True),
-    UnaryUfuncInfo('log2',
-                   ref=np.log2,
-                   domain=(0, None),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   decorators=(precisionOverride({torch.bfloat16: 1e-1}),),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.cfloat, torch.cdouble]),
-                   ),
-                   # log2(z)->-inf for |z|->0
-                   reference_numerics_filter=NumericsFilter(condition=lambda x: torch.abs(x) < 0.1, safe_val=1)),
-    BinaryUfuncInfo('ldexp',
-                    dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                    # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-                    gradcheck_fast_mode=True,
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_inplace_autograd=False,
-                    promotes_int_to_float=True,
-                    supports_out=True,
-                    supports_rhs_python_scalar=False,
-                    skips=(
-                        # RuntimeError: mul(): functions with out=... arguments don't support
-                        # automatic differentiation, but one of the arguments requires grad
-                        # https://github.com/pytorch/pytorch/issues/68966
-                        DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
-                        DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-                        DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-                        DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-                    ),
-                    decorators=[
-                        DecorateInfo(
-                            toleranceOverride({
-                                torch.complex64: tol(atol=1e-05, rtol=1e-05)
-                            }),
-                            'TestCommon', device_type='cpu',
-                        ),
-                    ], ),
-    OpInfo('logaddexp',
-           dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.bfloat16),
-           dtypesIfROCM=floating_types_and(torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=lambda op_info, device, dtype, requires_grad=False, **kwargs:
-           (SampleInput(make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),
-                        args=(make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),)),)),
-    OpInfo('logaddexp2',
-           dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.bfloat16),
-           dtypesIfROCM=floating_types_and(torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=lambda op_info, device, dtype, requires_grad=False, **kwargs:
-           (SampleInput(make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),
-                        args=(make_tensor((S, S), dtype=dtype, device=device, requires_grad=requires_grad),)),)),
-    UnaryUfuncInfo('logical_not',
-                   ref=np.logical_not,
-                   decorators=(precisionOverride({torch.bfloat16: 7e-1,
-                                                  torch.float16: 5e-1}),),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-                   supports_autograd=False,
-                   skips=(
-                       # The function variant always returns BoolTensor
-                       # while the inplace variant preserves the input dtype.
-                       # >>> t = torch.randn(3)
-                       # >>> torch.logical_not(t)
-                       # tensor([False, False, False])
-                       # >>> torch.logical_not(t).dtype
-                       # torch.bool
-                       # >>> t.logical_not_().dtype
-                       # torch.float32
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_variant_consistency',
-                                    dtypes=all_types_and_complex_and(torch.half, torch.bfloat16)),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                                    dtypes=all_types_and_complex_and(torch.half, torch.bfloat16)),
-                   )),
-    BinaryUfuncInfo('lt',
-                    ref=np.less,
-                    aliases=('less',),
-                    dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
-                    always_returns_bool=True,
-                    supports_autograd=False,
-                    skips=(
-                    )),
-    OpInfo('linalg.lu_factor',
-           aten_name='linalg_lu_factor',
-           op=torch.linalg.lu_factor,
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_lu,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('linalg.lu_factor_ex',
-           aten_name='linalg_lu_factor_ex',
-           op=torch.linalg.lu_factor_ex,
-           dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_lu,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('linalg.lu',
-           aten_name='linalg_lu',
-           op=torch.linalg.lu,
-           dtypes=floating_and_complex_types(),
-           # https://github.com/pytorch/pytorch/issues/80411
-           # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_lu,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack]),
-    OpInfo('lu_unpack',
-           op=torch.lu_unpack,
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(skipCPUIfNoLapack,),
-           sample_inputs_func=sample_inputs_lu_unpack),
-    OpInfo('lu',
-           op=torch.lu,
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # https://github.com/pytorch/pytorch/issues/66357
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_lu,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # we skip jit tests because `lu` is a torch function
-               # RuntimeError:
-               # 'Tensor (inferred)' object has no attribute or method 'lu'.:
-               # File "<string>", line 3
-               # def the_method(i0):
-               #     return i0.lu(True, True)
-               #            ~~~~~ <--- HERE
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # RuntimeError not raised: Expected RuntimeError when calling with input.device=cpu and out.device=cuda
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-           )),
-    OpInfo('lu_solve',
-           op=torch.lu_solve,
-           dtypes=floating_and_complex_types(),
-           supports_forward_ad=True,
-           # See https://github.com/pytorch/pytorch/issues/66357
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_lu_solve,
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Tests different backward paths"),
-                            "TestCommon", "test_floating_inputs_are_differentiable"),),
-           decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver]),
-    OpInfo('linalg.lu_solve',
-           op=torch.linalg.lu_solve,
-           aten_name='linalg_lu_solve',
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_lu_solve,
-           skips=(
-               DecorateInfo(unittest.skip("Tests different backward paths"),
-                            "TestCommon", "test_floating_inputs_are_differentiable"),),
-           decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver]),
-    OpInfo('masked_fill',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           sample_inputs_func=sample_inputs_masked_fill,
-           error_inputs_func=error_inputs_masked_fill,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           check_batched_forward_grad=False,
-           supports_out=False),
-    OpInfo('masked_scatter',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_masked_scatter,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # https://github.com/pytorch/pytorch/issues/66357
-           check_batched_forward_grad=False,
-           supports_out=False,
-           skips=(
-           )),
-    OpInfo('masked_select',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_masked_select,
-           error_inputs_func=error_inputs_masked_select),
-    OpInfo('matrix_exp',
-           dtypes=floating_and_complex_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           aliases=('linalg.matrix_exp',),
-           sample_inputs_func=sample_inputs_matrix_exp,
-           # Needs to construct a 2nx2n matrix by copy_ ing into it
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # https://github.com/pytorch/pytorch/issues/66357
-           check_batched_forward_grad=False,
-           skips=(
-               # times out
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
-           ),
-           supports_out=False,
-           ),
-    OpInfo('matmul',
-           aliases=('linalg.matmul',),
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16,
-                                                       *[torch.bfloat16]
-                                                       if (SM53OrLater and CUDA11OrLater) or TEST_WITH_ROCM else []),
-           assert_autodiffed=True,
-           assert_jit_shape_analysis=True,
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_matmul,
-           decorators=[
-               # NVIDIA only assures that bfloat16 is supported by bmm if SM >= 5.3
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes', device_type='cuda', active_if=not SM53OrLater),
-               # ROCm intermittently fails the test with standard atol/rtol
-               DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-4, rtol=0)}),
-                            'TestCommon', 'test_noncontiguous_samples', device_type='cuda',
-                            active_if=TEST_WITH_ROCM),
-               DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-4, rtol=0)}),
-                            'TestCommon', 'test_out', device_type='cuda',
-                            active_if=TEST_WITH_ROCM),
-               # mv for the sample with shapes (S, S, M, M), (M,) has some variance in the
-               # backward on CPU
-               DecorateInfo(toleranceOverride({torch.float32: tol(atol=0, rtol=1e-5)}),
-                            'TestCommon', 'test_noncontiguous_samples',
-                            device_type='cpu'), ],
-           skips=(
-               # Strides are not the same!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # https://github.com/pytorch/pytorch/issues/67470
-               DecorateInfo(unittest.skip("67470!"),
-                            'TestCommon', 'test_noncontiguous_samples',
-                            device_type='cpu', dtypes=(torch.long,)),
-               # AssertionError: False is not true : Tensors failed to compare as equal!
-               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo',
-                            device_type='xla', dtypes=(torch.long,)),
-               # https://github.com/pytorch/pytorch/issues/71774
-               DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
-                            device_type='cpu', dtypes=(torch.long,)),
-           )),
-    OpInfo('max',
-           variant_test_name='reduction_with_dim',
-           dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-           sample_inputs_func=sample_inputs_max_min_reduction_with_dim,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-           ),
-           supports_forward_ad=True),
-    OpInfo('max',
-           variant_test_name='reduction_no_dim',
-           dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_max_min_reduction_no_dim,
-           skips=(
-           )),
-    OpInfo('median',
-           dtypes=all_types_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.float16),
-           # TODO: some signatures of median do support out
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=partial(sample_inputs_reduction, supports_multiple_dims=False)),
-    OpInfo('nanmedian',
-           dtypes=all_types_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.float16),
-           # TODO: some signatures of nanmedian do support out
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=partial(sample_inputs_reduction, supports_multiple_dims=False)),
-    OpInfo('var_mean',
-           dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_std_var,
-           # TODO: some signatures of var_mean do support out
-           supports_out=False,
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True),
-    OpInfo('std_mean',
-           dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_std_var,
-           # TODO: some signatures of std_mean do support out
-           supports_out=False,
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True),
-    OpInfo('meshgrid',
-           variant_test_name='variadic_tensors',
-           ref=np.meshgrid,
-           dtypes=all_types_and_complex_and(torch.bfloat16, torch.bool, torch.float16),
-           sample_inputs_func=partial(sample_inputs_meshgrid, variant='variadic'),
-           skips=[
-               # JIT does not support variadic tensors.
-               # RuntimeError: input->type()->kind() == TypeKind::OptionalType
-               # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
-               # please report a bug to PyTorch.
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-               # meshgrid is defined in torch.functional to take a
-               # variadic list of tensors. Variadic parameters are not
-               # compatible with the normalize operator tests.
-               DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # Skip operator schema test because this is a functional and not an operator
-               DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-           ],
-           supports_out=False,
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,),
-    OpInfo('meshgrid',
-           variant_test_name='list_of_tensors',
-           # Unlike the variant above, we do not use np.meshgrid as a
-           # ref since it does not officially support list of numpy
-           # arrays.
-           dtypes=all_types_and_complex_and(torch.bfloat16, torch.bool, torch.float16),
-           sample_inputs_func=partial(sample_inputs_meshgrid, variant='list'),
-           skips=[
-               # meshgrid is defined in torch.functional to take a
-               # variadic list of tensors. Variadic parameters are not
-               # compatible with the normalize operator tests.
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-           ],
-           assert_autodiffed=True,
-           supports_out=False,
-           autodiff_nonfusible_nodes=[],
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,),
-    OpInfo('min',
-           variant_test_name='reduction_with_dim',
-           dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-           sample_inputs_func=sample_inputs_max_min_reduction_with_dim,
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True,
-           skips=(
-           )),
-    OpInfo('min',
-           variant_test_name='reduction_no_dim',
-           dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_max_min_reduction_no_dim,
-           skips=(
-           )),
-    OpInfo('quantile',
-           dtypes=floating_types(),
-           sample_inputs_func=sample_inputs_reduction_quantile,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
-           ),
-           # See https://github.com/pytorch/pytorch/issues/66357
-           # Relies on copy_ to broadcast, but the forward AD path calls broadcast_to which
-           # does not have a batching rule in core
-           check_batched_forward_grad=False),
-    OpInfo('nanquantile',
-           dtypes=floating_types(),
-           sample_inputs_func=sample_inputs_reduction_quantile,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
-           ),
-           # See https://github.com/pytorch/pytorch/issues/66357
-           # Relies on copy_ to broadcast, but the forward AD path calls broadcast_to which
-           # does not have a batching rule in core
-           check_batched_forward_grad=False),
-    BinaryUfuncInfo(
-        'max',
-        aliases=('maximum',),
-        variant_test_name='binary',
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        assert_autodiffed=True,
-        ref=np.maximum,
-        supports_rhs_python_scalar=False,
-        skips=(
-            # Incorrectly attempts to use a scalar for the second argument
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
-            # TODO: FIXME: RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
-            DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion', device_type='cuda'),
+            # Incorrectly attempts to use a scalar for the second argument
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
+            # TODO: FIXME: RuntimeError: "max_elementwise_cuda" not implemented for 'ComplexFloat'
+            DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion', device_type='cuda'),
         )),
     BinaryUfuncInfo(
         'maximum',
@@ -11739,6 +10621,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            assert_jit_shape_analysis=True,
            assert_autodiffed=True,
            supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            supports_out=True),
     OpInfo('softmax',
            aliases=('special.softmax', 'nn.functional.softmax',),
@@ -11748,7 +10631,27 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
            assert_autodiffed=True,
            supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            supports_out=True),
+    OpInfo(
+        '_softmax_backward_data',
+        op=torch.ops.aten._softmax_backward_data,
+        aten_name='_softmax_backward_data',
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        sample_inputs_func=sample_inputs_softmax_backward_data,
+        assert_autodiffed=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_noncontiguous_samples', device_type='cpu'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=2e-4, rtol=2e-3),
+                                            torch.bfloat16: tol(atol=1e-3, rtol=0.016)}),
+                         'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
+        ),
+    ),
     # `softmin` supports different dtypes based on whether `dtype` argument,
     # is passed or not. Hence two OpInfo entries, one with dtype and other without.
     # https://github.com/pytorch/pytorch/issues/68752
@@ -11760,6 +10663,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            assert_jit_shape_analysis=False,
            assert_autodiffed=False,
            supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            supports_out=False),
     OpInfo('nn.functional.softmin',
            variant_test_name="with_dtype",
@@ -11768,6 +10672,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
            assert_autodiffed=False,
            supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            supports_out=False),
     OpInfo(
         "nn.functional.cross_entropy",
@@ -11776,6 +10681,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         sample_inputs_func=sample_inputs_cross_entropy,
         supports_out=False,
         supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
         decorators=(
             DecorateInfo(
                 toleranceOverride({torch.float32: tol(atol=1e-5, rtol=1e-3)}),
@@ -11836,7 +10742,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # Not close
                DecorateInfo(unittest.skip("Errors when storage_offset is included"), 'TestMathBits', 'test_conj_view'),
                DecorateInfo(unittest.skip("Errors when storage_offset is included"), 'TestMathBits', 'test_neg_view'),
-               DecorateInfo(unittest.skip("Numerous errors"), 'TestGradients'))),
+               DecorateInfo(unittest.skip("Numerous errors"), 'TestFwdGradients'),
+               DecorateInfo(unittest.skip("Numerous errors"), 'TestBwdGradients'))),
     OpInfo('as_strided_scatter',
            op=lambda x, src, size, stride, storage_offset=0:
                torch.as_strided_scatter(x, src, size, stride, storage_offset=storage_offset),
@@ -11847,17 +10754,15 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            # vmap does not support inplace views
            check_inplace_batched_forward_grad=False,
            sample_inputs_func=sample_inputs_as_strided_scatter,
+           error_inputs_func=error_inputs_as_strided_scatter,
            skips=(
-               DecorateInfo(unittest.skip('Works only for CPU complex64'), 'TestMathBits', 'test_conj_view'),
-               DecorateInfo(unittest.skip('Works for float64, fails for everything else'), 'TestMathBits', 'test_neg_view'),
                DecorateInfo(unittest.skip('Works for int64, fails for everything else'), 'TestCommon', 'test_noncontiguous_samples'),  # noqa: B950
                DecorateInfo(unittest.skip('Fails in most cases, passes on LAZY for some reason'), 'TestCommon', 'test_variant_consistency_eager'),  # noqa: B950
-               DecorateInfo(unittest.skip('Only fails for LAZY, passes on everything else'), 'TestCompositeCompliance', 'test_backward'),  # noqa: B950
-               DecorateInfo(unittest.skip('Passes on complex64 and float32 only'), 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.skip('Fails on cuda + rocm'), 'TestCommon', 'test_complex_half_reference_testing'),
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_grad'),
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_gradgrad'),
-               DecorateInfo(unittest.skip('Passes on complex128 and float64 only'), 'TestGradients', 'test_fn_fwgrad_bwgrad'),)),
+               DecorateInfo(unittest.expectedFailure, 'TestBwdGradients', 'test_fn_grad'),
+               DecorateInfo(unittest.skip('Passes on complex128 and float64 only'), 'TestFwdGradients', 'test_fn_fwgrad_bwgrad'),
+               # AssertionError: Tensor-likes are not close! (new_empty_strided.default)
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"), 'TestDecomp', 'test_comprehensive'),)),
     OpInfo('native_layer_norm',
            aten_name='native_layer_norm',
            ref=reference_native_layer_norm,
@@ -11865,20 +10770,81 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            supports_out=False,
            assert_jit_shape_analysis=True,
+           supports_fwgrad_bwgrad=True,
            sample_inputs_func=sample_inputs_native_layer_norm,
            error_inputs_func=error_inputs_native_layer_norm,
            skips=(
                # IndexError: tuple index out of range
-               DecorateInfo(unittest.skip('Skipped!'), 'TestGradients', 'test_forward_mode_AD'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestFwdGradients', 'test_forward_mode_AD'),
                # Tests fail when weight=None and bias is defined
                # https://github.com/pytorch/pytorch/issues/79705
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.expectedFailure, 'TestBwdGradients', 'test_fn_gradgrad'),
                # JIT test also tries to compute double backward, which fails
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
                # Extremal value issue on aten::native_layer_norm, which returns 'nan' for mean on 'inf' inputs
                # possibly because of the welford implementation.
                DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+               DecorateInfo(unittest.skip("Unsupported on MPS for now"), 'TestCommon', 'test_numpy_ref_mps'),
            )),
+    OpInfo('native_batch_norm',
+           aten_name='native_batch_norm',
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           assert_jit_shape_analysis=True,
+           sample_inputs_func=sample_inputs_native_batch_norm,
+           skips=(
+               # NotImplementedError: Could not run
+               # 'aten::native_batch_norm.out' with arguments from the 'CPU' backend.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning', device_type="cpu"),
+               # RuntimeError: out_invstd.dim() == 1 && out_invstd.is_contiguous() && out_invstd.sizes()[0]
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type="cuda"),
+               # Problem with _get_numerical_jacobian
+               # IndexError: tuple index out of range
+               DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients', 'test_forward_mode_AD'),
+               # RuntimeError: deepEquals(input.iValue, deepCopiedInput) INTERNAL ASSERT FAILED
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # https://github.com/pytorch/pytorch/issues/85960
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_compare_cpu'),
+               # AssertionError: Booleans mismatch: True is not False
+               DecorateInfo(unittest.skip("Skipped!"), 'TestFakeTensor', 'test_fake_autocast'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestFakeTensor', 'test_fake'),
+               DecorateInfo(toleranceOverride({torch.float32: tol(atol=5e-5, rtol=5e-5)}),
+                            "TestCompositeCompliance", "test_forward_ad"),
+               # Extremal value issue on aten::native_batch_norm, which returns 'nan' for mean on 'inf' inputs
+               # possibly because of the welford implementation.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+           )
+           ),
+    OpInfo('_native_batch_norm_legit',
+           aten_name='_native_batch_norm_legit',
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           assert_jit_shape_analysis=True,
+           sample_inputs_func=sample_inputs__native_batch_norm_legit,
+           skips=(
+               # NotImplementedError: Could not run
+               # 'aten::native_batch_norm.out' with arguments from the 'CPU' backend.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning', device_type="cpu"),
+               # RuntimeError: out_invstd.dim() == 1 && out_invstd.is_contiguous() && out_invstd.sizes()[0]
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type="cuda"),
+               # Problem with _get_numerical_jacobian
+               # IndexError: tuple index out of range
+               DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients', 'test_forward_mode_AD'),
+               # RuntimeError: deepEquals(input.iValue, deepCopiedInput) INTERNAL ASSERT FAILED
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # https://github.com/pytorch/pytorch/issues/85960
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_compare_cpu'),
+               DecorateInfo(toleranceOverride({torch.float32: tol(atol=5e-5, rtol=5e-5)}),
+                            "TestCompositeCompliance", "test_forward_ad"),
+               # Extremal value issue on aten::native_batch_norm, which returns 'nan' for mean on 'inf' inputs
+               # possibly because of the welford implementation.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+           )
+           ),
     OpInfo('nn.functional.cosine_similarity',
            aten_name="cosine_similarity",
            dtypes=floating_types_and(torch.bfloat16),
@@ -12000,6 +10966,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypes=floating_types_and(torch.int64, torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           error_inputs_func=error_inputs_avg_pool1d,
            sample_inputs_func=sample_inputs_avgpool1d),
     OpInfo('nn.functional.avg_pool3d',
            aten_name='avg_pool3d',
@@ -12009,6 +10976,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypes=floating_types_and(torch.int64),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           error_inputs_func=error_inputs_avg_pool3d,
            sample_inputs_func=sample_inputs_avgpool3d,
            skips=(
                # AssertionError: Tensor-likes are not close!
@@ -12039,6 +11007,11 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         aten_name="relu",
         ref=lambda a: np.where(a <= 0, 0, a),
         supports_autograd=True,
+        supports_sparse=True,
+        supports_sparse_csr=True,
+        supports_sparse_csc=True,
+        supports_sparse_bsr=True,
+        supports_sparse_bsc=True,
         dtypes=all_types_and(torch.bfloat16),
         dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
         sample_inputs_func=sample_inputs_nn_activation_relu,
@@ -12046,32 +11019,61 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_fwgrad_bwgrad=True,
         supports_forward_ad=True),
     OpInfo('nn.functional.conv_transpose1d',
+           # `ref` for this function is backward of
+           # corresponding `conv*d`
+           ref=partial(conv_transpose_ref, fn=torch.nn.functional.conv_transpose1d),
            aten_name='conv_transpose1d',
            aliases=('conv_transpose1d',),
-           dtypes=floating_types_and(torch.int64),
-           dtypesIfCUDA=floating_types_and(torch.float16, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           dtypes=floating_and_complex_types_and(torch.int64),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.chalf,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=sample_inputs_conv_transpose1d,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            assert_jit_shape_analysis=True,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[
+           decorators=(
                DecorateInfo(
                    toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1.3e-06), }),
-                   'TestCommon', 'test_variant_consistency_eager', device_type='cuda')],
+                   'TestCommon', 'test_variant_consistency_eager', device_type='cuda'),
+               DecorateInfo(
+                   toleranceOverride({torch.chalf: tol(atol=5e-2, rtol=5e-2), }),
+                   'TestCommon', 'test_complex_half_reference_testing'),
+               DecorateInfo(
+                   toleranceOverride({torch.complex32: tol(atol=1e-5, rtol=5e-3)}),
+                   "TestCudaFuserOpInfo", "test_nvfuser_correctness"),
+               DecorateInfo(
+                   toleranceOverride({torch.float: tol(atol=1.5e-5, rtol=1.5e-5), }),
+                   'TestCommon', 'test_numpy_ref_mps'),
+           ),
            skips=(
+               # Reason for Skip: https://github.com/pytorch/pytorch/pull/79694#issuecomment-1186949486
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
+                            dtypes=(torch.complex64,)),
+               # RuntimeError: UNSUPPORTED DTYPE: complex
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
+                            dtypes=(torch.complex64, torch.complex128)),
                # RuntimeError: !lhs.isAliasOf(rhs)INTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":104, please report a bug to PyTorch.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',
+                            dtypes=(torch.float,)),
+               # RuntimeError: "slow_conv2d_cpu_grad_input" not implemented for 'Long'
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref',
+                            dtypes=(torch.int64,)),
            ),
            supports_out=False,),
     OpInfo('nn.functional.conv_transpose2d',
            aten_name='conv_transpose2d',
            aliases=('conv_transpose2d',),
-           dtypes=floating_types_and(torch.int64),
-           dtypesIfCUDA=floating_types_and(torch.float16,
-                                           *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           # `ref` for this function is backward of
+           # corresponding `conv*d`
+           ref=partial(conv_transpose_ref, fn=torch.nn.functional.conv_transpose2d),
+           dtypes=floating_and_complex_types_and(torch.int64),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.chalf,
+                                                       *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=sample_inputs_conv_transpose2d,
+           # Runs very slowly on slow-gradcheck for complex.
+           gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            assert_jit_shape_analysis=True,
@@ -12079,18 +11081,44 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            decorators=[
                DecorateInfo(
                    toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1.3e-06), }),
-                   'TestCommon', 'test_variant_consistency_eager', device_type='cuda')],
+                   'TestCommon', 'test_variant_consistency_eager', device_type='cuda'),
+               DecorateInfo(
+                   toleranceOverride({torch.float32: tol(atol=2e-05, rtol=5e-05), }),
+                   'TestCommon', 'test_noncontiguous_samples', device_type='cuda'),
+               DecorateInfo(
+                   toleranceOverride({torch.complex32: tol(atol=5e-2, rtol=5e-2)}),
+                   "TestCudaFuserOpInfo", "test_nvfuser_correctness"),
+               DecorateInfo(
+                   toleranceOverride({torch.chalf: tol(atol=5e-2, rtol=5e-2), }),
+                   'TestCommon', 'test_complex_half_reference_testing')],
            skips=(
                # RuntimeError: !lhs.isAliasOf(rhs)INTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":104, please report a bug to PyTorch.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               # RuntimeError: UNSUPPORTED DTYPE: complex
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
+                            dtypes=(torch.complex64, torch.complex128)),
+               # RuntimeError: "slow_conv2d_cpu_grad_input" not implemented for 'Long'
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref',
+                            dtypes=(torch.int64,)),
+               # Reference: https://github.com/pytorch/pytorch/issues/86356
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref',
+                            dtypes=(torch.double, torch.cdouble)),
+               DecorateInfo(unittest.skip("Unsupported on MPS for now"), 'TestCommon', 'test_numpy_ref_mps'),
+               # AssertionError: None mismatch: torch.complex64 is not None
+               DecorateInfo(unittest.expectedFailure, 'TestDtypeCustomRules', 'test_custom_rules',
+                            dtypes=(torch.complex64, torch.complex128)),
            ),
            supports_out=False,),
     OpInfo('nn.functional.conv_transpose3d',
            aten_name='conv_transpose3d',
            aliases=('conv_transpose3d',),
-           dtypes=floating_types_and(torch.int64),
-           dtypesIfCUDA=floating_types_and(torch.float16, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           # `ref` for this function is backward of
+           # corresponding `conv*d`
+           ref=partial(conv_transpose_ref, fn=torch.nn.functional.conv_transpose3d),
+           dtypes=floating_and_complex_types_and(torch.int64),
+           dtypesIfCUDA=floating_and_complex_types_and(
+               torch.float16, torch.chalf, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=sample_inputs_conv_transpose3d,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -12100,28 +11128,45 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            decorators=[
                DecorateInfo(
-                   toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1.3e-06), }),
+                   toleranceOverride({torch.float32: tol(atol=1e-04, rtol=1.3e-06),
+                                     torch.complex64: tol(atol=1.3e-04, rtol=1.3e-05)}),
                    'TestCommon', 'test_variant_consistency_eager', device_type='cuda'),
                DecorateInfo(
                    toleranceOverride({torch.float32: tol(atol=2e-04, rtol=2e-04), }),
                    'TestCompositeCompliance', 'test_operator', device_type='cuda'),
                DecorateInfo(
-                   toleranceOverride({torch.float32: tol(atol=1.3e-04, rtol=1.3e-06), }),
+                   toleranceOverride({torch.float32: tol(atol=1.3e-04, rtol=1.3e-06),
+                                     torch.complex64: tol(atol=1.3e-04, rtol=1.3e-05)}),
                    'TestCommon', 'test_noncontiguous_samples', device_type='cuda'),
                DecorateInfo(
                    toleranceOverride({torch.float32: tol(atol=1e-04, rtol=2e-05), }),
                    'TestCompositeCompliance', 'test_forward_ad', device_type='cuda',
-                   active_if=TEST_CUDNN)],
+                   active_if=TEST_CUDNN),
+               DecorateInfo(
+                   toleranceOverride({torch.complex32: tol(atol=5e-2, rtol=5e-2)}),
+                   "TestCudaFuserOpInfo", "test_nvfuser_correctness"),
+               DecorateInfo(
+                   toleranceOverride({torch.complex64: tol(atol=1e-4, rtol=1e-4)}),
+                   "TestMathBits", "test_conj_view", device_type='cuda'),
+               DecorateInfo(
+                   toleranceOverride({torch.chalf: tol(atol=9e-2, rtol=9e-2), }),
+                   'TestCommon', 'test_complex_half_reference_testing')],
            skips=(
                # RuntimeError: !lhs.isAliasOf(rhs)INTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":104, please report a bug to PyTorch.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.skip("Skipped! 75029"), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
                DecorateInfo(unittest.skip("Skipped! 75363"), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
-               DecorateInfo(unittest.skip("Skipped! RuntimeError: bias tensor has to be contiguous"), 'TestGradients',
-                            'test_forward_mode_AD', device_type='cuda', active_if=TEST_WITH_ROCM),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad', device_type='cuda',
-                            active_if=(not TEST_CUDNN)),
+               # RuntimeError: "slow_conv3d_cpu_grad_input" not implemented for 'Long'
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref',
+                            dtypes=(torch.int64,)),
+               # Reference: https://github.com/pytorch/pytorch/issues/86356
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref',
+                            dtypes=(torch.double, torch.cdouble)),
+               DecorateInfo(unittest.skip("Unsupported on MPS for now"), 'TestCommon', 'test_numpy_ref_mps'),
+               # RuntimeError: UNSUPPORTED DTYPE: complex
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness',
+                            dtypes=(torch.complex64, torch.complex128)),
            ),
            supports_out=False,),
     OpInfo('nn.functional.conv1d',
@@ -12131,19 +11176,24 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.chalf,
                                                        *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=sample_inputs_conv1d,
+           error_inputs_func=error_inputs_conv1d,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            assert_jit_shape_analysis=True,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            decorators=(
                DecorateInfo(
-                   toleranceOverride({torch.chalf: tol(atol=1e-2, rtol=1e-2)}),
+                   toleranceOverride({torch.chalf: tol(atol=1e-2, rtol=5e-2)}),
                    'TestCommon', 'test_complex_half_reference_testing'
                ),
                DecorateInfo(
                    toleranceOverride({torch.chalf: tol(atol=1e-3, rtol=1e-3)}),
                    'TestCudaFuserOpInfo', 'test_nvfuser_correctness',
                ),
+               DecorateInfo(
+                   toleranceOverride({torch.float16: tol(atol=2e-3, rtol=1e-3)}),
+                   'TestInductorOpInfo', 'test_comprehensive', device_type='cuda',
+               ),
            ),
            skips=(
                # RuntimeError: !lhs.isAliasOf(rhs)INTERNAL ASSERT FAILED at
@@ -12167,6 +11217,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.chalf,
                                                        *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []),
            sample_inputs_func=partial(sample_inputs_conv2d),
+           error_inputs_func=error_inputs_conv2d,
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            # Runs very slowly on slow gradcheck - alternatively reduce input sizes
            gradcheck_fast_mode=True,
@@ -12206,12 +11257,14 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           error_inputs_func=error_inputs_group_norm,
            decorators=[
                # RuntimeError: Cannot insert a Tensor that requires grad as a constant.
                # Consider making it a parameter or input, or detaching the gradient
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,))
            ],
            sample_inputs_func=sample_inputs_group_norm,
+           reference_inputs_func=reference_inputs_group_norm,
            supports_expanded_weight=True,),
     OpInfo('nn.functional.instance_norm',
            # no ref because instance_norm will often have numerical instability (large numbers or nan)
@@ -12219,6 +11272,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            decorators=[
                # RuntimeError: Cannot insert a Tensor that requires grad as a constant.
                # Consider making it a parameter or input, or detaching the gradient
@@ -12237,12 +11291,14 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            assert_jit_shape_analysis=True,
            decorators=[
                DecorateInfo(
                    toleranceOverride({torch.float32: tol(atol=1e-05, rtol=1e-03)}),
                    'TestCommon', 'test_numpy_refs'
-               )
+               ),
+               DecorateInfo(unittest.skip("Bug in MPS backend!"), 'TestCommon', 'test_numpy_ref_mps'),
            ],
            sample_inputs_func=sample_inputs_layer_norm,
            supports_expanded_weight=True,),
@@ -12291,7 +11347,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            dtypes=floating_and_complex_types(),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.half),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
            sample_inputs_func=partial(sample_inputs_nn_pad, mode='reflect'),
            skips=(
                # Doesn't have a corresponding aten operator.
@@ -12330,6 +11386,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # RuntimeError: falseINTERNAL ASSERT FAILED at
                # "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":185, please report a bug to PyTorch.
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.float32,)),
+               # Difference from <type> is larger with decomposition new_empty_strided.default than original on output 0
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"), 'TestDecomp', 'test_comprehensive'),
            ),
            supports_out=False),
     OpInfo('nn.functional.hardswish',
@@ -12347,9 +11405,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            autodiff_nonfusible_nodes=["aten::hardswish"]),
     OpInfo('nn.functional.unfold',
            aten_name='im2col',
-           aten_backward_name='im2col_backward',
            dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.half),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
            sample_inputs_func=sample_inputs_nn_unfold,
            # Runs very slowly on slow gradcheck - alternatively reduce input sizes
            gradcheck_fast_mode=True,
@@ -12484,6 +11541,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_forward_ad=True,
         # doesn't support grad on target
         sample_inputs_func=partial(sample_inputs_loss, rhs_requires_grad=False),
+        error_inputs_func=error_inputs_soft_margin_loss,
     ),
     OpInfo('nn.functional.upsample_nearest',
            supports_autograd=True,
@@ -12533,6 +11591,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            sample_inputs_func=sample_inputs_leaky_relu,
            dtypes=floating_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           inplace_variant=lambda x, negative_slope=0.01:
+               torch.nn.functional.leaky_relu(x, negative_slope, inplace=True),
            supports_autograd=True,
            assert_autodiffed=True,
            supports_gradgrad=True,
@@ -12577,6 +11637,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_fwgrad_bwgrad=True,
            dtypes=floating_types_and(torch.int64, torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           error_inputs_func=error_inputs_avg_pool2d,
            sample_inputs_func=sample_inputs_avgpool2d,
            skips=(
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cuda'),
@@ -12599,7 +11660,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
                # RuntimeError: input->type()->kind() == TypeKind::OptionalType
                # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'))),
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit')),
+           skips=(
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),)),
     OpInfo('nn.functional.fractional_max_pool3d',
            supports_autograd=True,
            supports_out=False,
@@ -12612,6 +11675,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypes=floating_types(),
            dtypesIfCUDA=floating_types_and(torch.float16),
            test_neg_view=False,
+           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
            sample_inputs_func=sample_inputs_fractional_max_pool3d,
            decorators=(
                # FIXME: both derivatives are implemented incorrectly
@@ -12620,7 +11684,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
                # RuntimeError: input->type()->kind() == TypeKind::OptionalType
                # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),)),
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit')),
+           skips=(
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),)),
     OpInfo('nn.functional.max_pool1d',
            aten_name='max_pool1d',
            supports_autograd=True,
@@ -12635,7 +11701,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            skips=(
                # Pre-existing condition; Needs to be fixed
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator', device_type='cpu'),
                DecorateInfo(unittest.skip("Works on some configs"), 'TestNNCOpInfo',
                             'test_nnc_correctness', dtypes=(torch.bfloat16,)),
                # RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.
@@ -12643,6 +11708,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # to actually allocate memory
                DecorateInfo(unittest.skip("Skipped!"), 'TestTags', 'test_tags'),
            ),
+           error_inputs_func=error_inputs_max_pool1d,
            sample_inputs_func=sample_inputs_max_pool),
     OpInfo('nn.functional.max_pool2d',
            aten_name='max_pool2d',
@@ -12658,7 +11724,33 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            assert_jit_shape_analysis=True,
            dtypes=floating_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           error_inputs_func=error_inputs_max_pool2d,
            sample_inputs_func=sample_inputs_max_pool),
+    OpInfo('max_pool2d_with_indices_backward',
+           op=max_pool2d_backward,
+           # We've defined a custom op, so there's no corresponding aten op
+           aten_name=None,
+           method_variant=None,
+           inplace_variant=None,
+           operator_variant=None,
+           inplace_operator_variant=None,
+           check_batched_gradgrad=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           check_batched_forward_grad=False,
+           assert_jit_shape_analysis=False,
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           sample_inputs_func=sample_inputs_max_pool,
+           skips=(
+               # We've defined a custom op here, and we don't handle the case where we receive an out kwarg
+               DecorateInfo(unittest.skip("Skipped!"), "TestCommon", "test_out"),
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # FX failed to normalize op - add the op to the op_skip list.
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # object has no attribute max_pool2d_with_indices_backward (It's not available on torch -- so expected)
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit')
+           )),
     OpInfo('nn.functional.max_pool3d',
            aten_name='max_pool3d',
            # Runs very slowly on slow gradcheck - alternatively reduce input sizes
@@ -12674,6 +11766,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            # TODO: investigate nondeterminism
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+           error_inputs_func=error_inputs_max_pool3d,
            sample_inputs_func=sample_inputs_max_pool),
     OpInfo('nn.functional.max_unpool1d',
            aten_name='max_unpool1d',
@@ -12692,10 +11785,11 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # and if there are several indices pointing to the same memory,
                # gradcheck is oblivious about that and cannot perturb them all at once
                # (see sample_inputs_max_unpool_grad to find out more).
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_forward_mode_AD'),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad',
+               DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_grad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients', 'test_forward_mode_AD',
+                            active_if=(not IS_MACOS)),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_forward_ad',
                             device_type='cpu'),
            )),
     OpInfo('nn.functional.max_unpool1d',
@@ -12726,10 +11820,11 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # and if there are several indices pointing to the same memory,
                # gradcheck is oblivious about that and cannot perturb them all at once
                # (see sample_inputs_max_unpool_grad to find out more).
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_forward_mode_AD'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad'),
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_forward_mode_AD',
+                            active_if=(not IS_MACOS)),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_grad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_forward_ad'),
            )),
     OpInfo('nn.functional.max_unpool2d',
            variant_test_name='grad',
@@ -12763,9 +11858,10 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # and if there are several indices pointing to the same memory,
                # gradcheck is oblivious about that and cannot perturb them all at once
                # (see sample_inputs_max_unpool_grad to find out more).
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_forward_mode_AD'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_grad'),
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_forward_mode_AD',
+                            active_if=(not IS_MACOS)),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_gradgrad'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients', 'test_fn_grad'),
                DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
            )),
     OpInfo('nn.functional.max_unpool3d',
@@ -12880,7 +11976,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         decorators=[
             # FIXME: second derivative is implemented but seems to be incorrect
             # https://github.com/pytorch/pytorch/issues/68760
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_gradgrad'),
+            DecorateInfo(unittest.expectedFailure, 'TestBwdGradients', 'test_fn_gradgrad'),
             # RuntimeError: Cannot insert a Tensor that requires grad as a constant.
             # Consider making it a parameter or input, or detaching the gradient
             # https://github.com/pytorch/pytorch/issues/68752
@@ -12927,6 +12023,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         sample_kwargs=lambda device, dtype, input:
             (dict(lower=0., upper=1., training=True), dict(lower=0., upper=1., training=True)),
         sample_inputs_func=partial(sample_inputs_elementwise_unary, op_kwargs=dict(lower=0., upper=1., training=True)),
+        error_inputs_func=error_inputs_rrelu,
         decorators=(
             DecorateInfo(
                 toleranceOverride({
@@ -12942,13 +12039,14 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
             # In-place operations do not play well with forward AD
             # https://github.com/pytorch/pytorch/issues/77447
-            DecorateInfo(unittest.expectedFailure, 'TestGradients',
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients',
                          'test_inplace_forward_mode_AD'),
             # The noise vector that's generated in these tests is not the same elementwise
             DecorateInfo(unittest.skip("Different noise"), 'TestUnaryUfuncs', 'test_batch_vs_slicing'),
             DecorateInfo(unittest.skip("Different noise"), 'TestUnaryUfuncs', 'test_contig_vs_every_other'),
             DecorateInfo(unittest.skip("Different noise"), 'TestUnaryUfuncs', 'test_non_contig_expand'),
-            DecorateInfo(unittest.skip("Different noise"), 'TestUnaryUfuncs', 'test_contig_vs_transposed'),)),
+            DecorateInfo(unittest.skip("Different noise"), 'TestUnaryUfuncs', 'test_contig_vs_transposed'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'))),
     UnaryUfuncInfo(
         'nn.functional.selu',
         ref=lambda x, inplace=False:
@@ -12973,6 +12071,40 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                 'TestUnaryUfuncs', device_type='cuda',
             ), ],
     ),
+    OpInfo(
+        'nn.functional._scaled_dot_product_attention',
+        op=lambda *args, **kwargs:
+               wrapper_set_seed(torch.nn.functional._scaled_dot_product_attention, *args, **kwargs),
+        sample_inputs_func=sample_inputs_scaled_dot_product_attention,
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        supports_out=False,
+        supports_forward_ad=False,
+        supports_fwgrad_bwgrad=True,
+        check_batched_forward_grad=False,
+        decorators=[DecorateInfo(toleranceOverride(
+            {torch.float32: tol(atol=5e-05, rtol=5e-6)}), 'TestCommon', device_type='cuda',), ],
+        skips=(
+            # This is only failing on Linux Bionic 3.10 Cuda 11.6
+            DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes',
+                         device_type='cuda', active_if=_get_torch_cuda_version() >= (11, 6)),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_noncontiguous_samples',
+                         device_type='cuda', dtypes=(torch.float32,)),
+            # AssertionError: JIT Test does not execute any logic
+            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # Doesn't support autocasting
+            DecorateInfo(unittest.skip("Skipped!"), 'TestFakeTensorNonErroring', 'test_fake_autocast', device_type='cpu'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestFakeTensor', 'test_fake_autocast'),
+            # Forward works for dtype=float64 which is the math path
+            DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients', 'test_forward_mode_AD'),
+            # No meta function
+            DecorateInfo(unittest.skip("Skipped!"), 'TestProxyTensorOpInfo', 'test_make_fx_symbolic_exhaustive'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.skip("Skipped"), 'TestDecomp', 'test_comprehensive'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestFakeTensor', 'test_fake'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMeta', device_type='cuda'),
+            DecorateInfo(unittest.skip('output is non-deterministic (when dropout_p > 0)'), 'TestCommon', 'test_compare_cpu'),),
+    ),
     UnaryUfuncInfo(
         'nn.functional.silu',
         aten_backward_name='silu_backward',
@@ -13060,7 +12192,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                 toleranceOverride({torch.float16: tol(atol=1e-04, rtol=0.001)}), 'TestUnaryUfuncs', device_type='cuda',), ],
         skips=[
             # still want to test that first derivative works though second derivative isn't supported
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', "test_inplace_gradgrad"),
+            DecorateInfo(unittest.expectedFailure, 'TestBwdGradients', "test_inplace_gradgrad"),
             # produces 0 instead of nan on ROCM
             DecorateInfo(unittest.expectedFailure,
                          'TestUnaryUfuncs', "test_reference_numerics_extremal",
@@ -13099,7 +12231,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         'nn.functional.mish',
         aten_backward_name='mish_backward',
         ref=lambda x: x * np.tanh(reference_softplus(x)),
-        dtypes=floating_types(),
+        dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
@@ -13171,14 +12303,16 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         ref=lambda x, threshold, value: np.where(x <= threshold, value, x).astype(x.dtype),
         dtypes=all_types_and(torch.bfloat16),
         dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        inplace_variant=lambda x, threshold, value:
+            torch.nn.functional.threshold(x, threshold, value, inplace=True),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         assert_autodiffed=False,
         supports_gradgrad=True,
         supports_out=False,
-        sample_kwargs=lambda device, dtype, input: ({'threshold': 0.123,
+        sample_kwargs=lambda device, dtype, input: ({'threshold': float.fromhex('0x1.3ap-3'),
                                                     'value': -9},
-                                                    {'threshold': 0.123,
+                                                    {'threshold': float.fromhex('0x1.3ap-3'),
                                                     'value': -9}),
         # TODO(whc) should not need sample_inputs_func, but without it
         # kwargs aren't being hooked up properly
@@ -13187,6 +12321,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
     OpInfo(
         "nn.functional.triplet_margin_loss",
         sample_inputs_func=sample_inputs_triplet_margin_loss,
+        error_inputs_func=error_inputs_triplet_margin_loss,
         dtypes=all_types_and_complex_and(torch.bfloat16),
         dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
         supports_out=False,
@@ -13196,6 +12331,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
     OpInfo(
         "nn.functional.triplet_margin_with_distance_loss",
         sample_inputs_func=partial(sample_inputs_triplet_margin_loss, with_distance=True),
+        error_inputs_func=error_inputs_triplet_margin_loss,
         dtypes=all_types_and_complex_and(torch.bfloat16),
         dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
         supports_out=False,
@@ -13220,6 +12356,41 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                     dtypes=floating_types_and(torch.bfloat16),
                     supports_autograd=False,
                     supports_rhs_python_scalar=False),
+    OpInfo(
+        "to",
+        op=lambda x, *args, **kwargs: x.to(*args, **kwargs),
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+        sample_inputs_func=sample_inputs_to,
+        skips=(
+            # RuntimeError: undefined value cpu
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="cpu",
+            ),
+            # NotImplementedError: Cannot copy out of meta tensor; no data!
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMeta",
+                "test_meta_outplace",
+            ),
+            # https://github.com/pytorch/pytorch/issues/84335
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestProxyTensorOpInfo",
+                "test_make_fx_symbolic_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+        ),
+    ),
     OpInfo('topk',
            dtypes=all_types_and(torch.bfloat16),
            dtypesIfCUDA=all_types_and(torch.bfloat16, torch.float16),
@@ -13235,6 +12406,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            assert_jit_shape_analysis=True,
            sample_inputs_func=sample_inputs_batch_norm,
            skips=(
@@ -13242,13 +12414,12 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness'),
                DecorateInfo(unittest.skip('Skipped!'), 'TestNNCOpInfo', 'test_nnc_correctness',
                             device_type='cpu', dtypes=(torch.bfloat16,)),
-               # see https://github.com/pytorch/pytorch/issues/76283
-               DecorateInfo(unittest.skip("Fails on UBSAN!"), 'TestCompositeCompliance', 'test_forward_ad',
-                            device_type="cpu"),
                # Trying to use forward AD with miopen_batch_norm that does not support it
                # because it has not been implemented yet.
                DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad',
                             device_type="cuda", active_if=TEST_WITH_ROCM),
+               DecorateInfo(toleranceOverride({torch.float32: tol(atol=5e-05, rtol=1e-05)}),
+                            'TestCompositeCompliance', 'test_forward_ad', device_type="cpu"),
                DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
                             'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
            )),
@@ -13260,12 +12431,13 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
            supports_out=False,
            supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
            decorators=[onlyCUDA, disablecuDNN],
            skips=(
-               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad',
-                            device_type='cpu'),
                DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
                             'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
+               DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-03, rtol=1e-04)}),
+                            'TestJit', 'test_variant_consistency_jit'),
            ),
            sample_inputs_func=sample_inputs_batch_norm),
     OpInfo(
@@ -13415,7 +12587,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    aten_name="hardtanh",
                    aten_backward_name='hardtanh_backward',
                    dtypes=floating_types_and(torch.int8, torch.int16, torch.int32, torch.int64, torch.bfloat16),
-                   backward_dtypes=all_types(),
+                   backward_dtypes=all_types_and(torch.bfloat16),
                    dtypesIfCUDA=floating_types_and(torch.int8, torch.int16, torch.int32, torch.int64, torch.float16,
                                                    torch.bfloat16),
                    backward_dtypesIfCUDA=floating_types_and(torch.float16),
@@ -13442,11 +12614,13 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            skips=(
                # AssertionError: Tensor-likes are not close!
                # May not replicate in CI
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out'),)),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out'),
+               DecorateInfo(unittest.skip("Unsupported on MPS for now"), 'TestCommon', 'test_numpy_ref_mps'),
+           )),
     UnaryUfuncInfo('nn.functional.relu6',
                    aten_name="relu6",
                    dtypes=all_types_and(torch.bfloat16),
-                   backward_dtypes=floating_types(),
+                   backward_dtypes=floating_types_and(torch.bfloat16),
                    dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
                    backward_dtypesIfCUDA=floating_types_and(torch.float16),
                    assert_autodiffed=True,
@@ -13480,35 +12654,36 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
            ),
            sample_inputs_func=sample_inputs_mode,),
-    MvlGammaInfo(variant_test_name='mvlgamma_p_1',
-                 domain=(1, None),
-                 skips=skips_mvlgamma() + \
-                 (DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
-                  DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                               dtypes=(torch.float16, torch.int8)),
-                  DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                               dtypes=(torch.int8,)),),
-                 sample_kwargs=lambda device, dtype, input: ({'p': 1}, {'d': 1})),
-    MvlGammaInfo(variant_test_name='mvlgamma_p_3',
-                 domain=(2, None),
-                 skips=skips_mvlgamma(skip_redundant=True) + (
-                     DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
-                     DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                  dtypes=(torch.float16, torch.int8)),
-                     DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                  dtypes=(torch.int8,)),
-                 ),
-                 sample_kwargs=lambda device, dtype, input: ({'p': 3}, {'d': 3})),
-    MvlGammaInfo(variant_test_name='mvlgamma_p_5',
-                 domain=(3, None),
-                 skips=skips_mvlgamma(skip_redundant=True) + (
-                     DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
-                     DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                  dtypes=(torch.float16, torch.int8)),
-                     DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                  dtypes=(torch.int8,)),
-                 ),
-                 sample_kwargs=lambda device, dtype, input: ({'p': 5}, {'d': 5})),
+    make_mvlgamma_opinfo(variant_test_name='mvlgamma_p_1',
+                         domain=(1, None),
+                         skips=skips_mvlgamma() + (
+                             DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
+                             DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                          dtypes=(torch.float16, torch.int8)),
+                             DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                          dtypes=(torch.int8,)),
+                         ),
+                         sample_kwargs=lambda device, dtype, input: ({'p': 1}, {'d': 1})),
+    make_mvlgamma_opinfo(variant_test_name='mvlgamma_p_3',
+                         domain=(2, None),
+                         skips=skips_mvlgamma() + (
+                             DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
+                             DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                          dtypes=(torch.float16, torch.int8)),
+                             DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                          dtypes=(torch.int8,)),
+                         ),
+                         sample_kwargs=lambda device, dtype, input: ({'p': 3}, {'d': 3})),
+    make_mvlgamma_opinfo(variant_test_name='mvlgamma_p_5',
+                         domain=(3, None),
+                         skips=skips_mvlgamma() + (
+                             DecorateInfo(unittest.expectedFailure, 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
+                             DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                          dtypes=(torch.float16, torch.int8)),
+                             DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                          dtypes=(torch.int8,)),
+                         ),
+                         sample_kwargs=lambda device, dtype, input: ({'p': 5}, {'d': 5})),
     BinaryUfuncInfo('ne',
                     ref=np.not_equal,
                     aliases=('not_equal',),
@@ -13522,7 +12697,49 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_narrow),
+           sample_inputs_func=partial(sample_inputs_narrow_narrow_copy, is_narrow=True),
+           reference_inputs_func=partial(reference_inputs_narrow_narrow_copy, is_narrow=True),
+           error_inputs_func=partial(error_inputs_narrow_narrow_copy, is_narrow=True, is_ref=False),
+           skips=(
+               # Use of .item()
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+               DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+           )),
+    OpInfo('narrow_copy',
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+           supports_out=True,
+           supports_forward_ad=False,
+           supports_fwgrad_bwgrad=False,
+           supports_autograd=False,
+           # https://github.com/pytorch/pytorch/issues/86931
+           sample_inputs_func=partial(sample_inputs_narrow_narrow_copy, is_narrow=False),
+           reference_inputs_func=partial(reference_inputs_narrow_narrow_copy, is_narrow=False),
+           error_inputs_func=partial(error_inputs_narrow_narrow_copy, is_narrow=False, is_ref=False),
+           skips=(
+               # https://github.com/pytorch/pytorch/issues/84577
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # Lazy tensor failures: mutating and aliasing ops should all have codegen'd kernels
+               DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_correctness'),
+               DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_correctness_with_reusing_ir'),
+           )),
+    OpInfo('view_copy',
+           dtypes=all_types_and(torch.bool, torch.bfloat16, torch.float16),
+           ref=lambda x, newshape: np.reshape(x, newshape).copy(),
+           supports_out=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           supports_autograd=True,
+           sample_inputs_func=sample_inputs_view_reshape,
+           error_inputs_func=error_inputs_view_reshape,
+           skips=(
+               # https://github.com/pytorch/pytorch/issues/89068
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+           )),
     UnaryUfuncInfo('neg',
                    aliases=('negative', ),
                    ref=np.negative,
@@ -13560,13 +12777,14 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
     OpInfo('ormqr',
            op=torch.ormqr,
            dtypes=floating_and_complex_types(),
-           supports_autograd=False,
+           # https://github.com/pytorch/pytorch/issues/80411
+           gradcheck_fast_mode=True,
+           supports_forward_ad=False,
+           supports_fwgrad_bwgrad=False,
            sample_inputs_func=sample_inputs_ormqr,
            error_inputs_func=error_inputs_ormqr,
            decorators=[skipCUDAIfNoCusolver, skipCPUIfNoLapack],
            skips=(
-               # ormqr does not support forward when complex inputs require grad
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_dtypes'),
                # Strides are not the same!
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
            )),
@@ -13580,6 +12798,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            assert_jit_shape_analysis=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           supports_varargs=True,
            sample_inputs_func=sample_inputs_permute,
            reference_inputs_func=reference_inputs_permute),
     BinaryUfuncInfo('pow',
@@ -13620,6 +12839,10 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                         # For `chalf`, reference computation in `numpy` is computed in `cfloat`.
                         # Output of `chalf` saturates to `inf` quicker than reference due to its small range
                         # which leads to failure of this test.
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_batch_vs_slicing',
+                                     dtypes=(torch.complex32,)),
+                        DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_non_contig',
+                                     dtypes=(torch.complex32,)),
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_reference_numerics',
                                      dtypes=(torch.complex32,)),
                         DecorateInfo(unittest.skip("Skipped!"), 'TestBinaryUfuncs', 'test_reference_numerics_small_values',
@@ -13685,7 +12908,12 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                                     dtypes=[torch.bfloat16]),
                    ),
                    supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
+                   supports_fwgrad_bwgrad=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True),
     UnaryUfuncInfo('real',
                    ref=np.real,
                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
@@ -13725,16 +12953,32 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
     UnaryUfuncInfo('round',
                    ref=np.round,
                    aliases=('special.round',),
-                   dtypes=floating_types_and(torch.bfloat16),
-                   dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+                   dtypes=all_types_and(torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.half, torch.bfloat16),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
+                   skips=(
+                       DecorateInfo(unittest.expectedFailure,
+                                    'TestNNCOpInfo',
+                                    'test_nnc_correctness',
+                                    dtypes=tuple(t for t in integral_types() if t != torch.uint8)),
+                       DecorateInfo(unittest.expectedFailure,
+                                    'TestCudaFuserOpInfo',
+                                    'test_nvfuser_correctness',
+                                    dtypes=(torch.int32, torch.int64),
+                                    active_if=not TEST_WITH_ROCM),
+                       DecorateInfo(unittest.skip("Skipped!"),
+                                    'TestNNCOpInfo',
+                                    'test_nnc_correctness',
+                                    dtypes=(torch.bfloat16,)),
+                   ),
                    supports_sparse=True,
                    supports_sparse_csr=True,
                    supports_sparse_csc=True,
                    supports_sparse_bsr=True,
                    supports_sparse_bsc=True,
-                   assert_autodiffed=True,),
+                   assert_autodiffed=True,
+                   ),
     UnaryUfuncInfo('round',
                    ref=np.round,
                    variant_test_name='decimals_0',
@@ -13758,9 +13002,16 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    skips=(
                        # test_ops already tested for this overload with `decimals_0` opinfo entry
                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits'),
+                       DecorateInfo(toleranceOverride({torch.bfloat16: tol(atol=1e-3, rtol=0.016)}),
+                                    "TestUnaryUfuncs", "test_reference_numerics_extremal",
+                                    device_type="cuda"),
+                       DecorateInfo(toleranceOverride({torch.bfloat16: tol(atol=1e-3, rtol=0.016)}),
+                                    "TestUnaryUfuncs", "test_reference_numerics_normal",
+                                    device_type="cuda"),
                    ),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
@@ -13777,7 +13028,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    skips=(
                        # test_ops already tested for this overload with `decimals_0` opinfo entry
                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits'),
                    ),
@@ -14001,7 +13253,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                                                        *[torch.bfloat16]
                                                        if (SM53OrLater and CUDA11OrLater) or TEST_WITH_ROCM else []),
            assert_autodiffed=True,
-           sample_inputs_func=sample_inputs_matmul,
+           sample_inputs_func=partial(sample_inputs_matmul, is_rmatmul=True),
            # Runs very slowly on slow gradcheck - alternatively reduce input sizes
            gradcheck_fast_mode=True,
            supports_out=False,
@@ -14015,6 +13267,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                             'TestMathBits', 'test_conj_view'),
                DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-05, rtol=1.2e-03)}),
                             'TestCommon', 'test_noncontiguous_samples'),
+               DecorateInfo(toleranceOverride({torch.complex64: tol(atol=1e-05, rtol=1e-05)}),
+                            "TestDecomp", "test_comprehensive", device_type="cuda",
+                            active_if=TEST_WITH_ROCM),
            ),
            skips=(
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
@@ -14039,7 +13294,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                     supports_out=False,
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
-                    supports_two_python_scalars=True,
+                    supports_one_python_scalar=True,
                     skips=(
                         DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
                         DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',),
@@ -14064,7 +13319,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                         DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
                         DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',),
                         # TODO: FIXME tolerance is too high
-                        DecorateInfo(unittest.skip('Skipped!'), 'TestGradients'),
+                        DecorateInfo(unittest.skip('Skipped!'), 'TestFwdGradients'),
+                        DecorateInfo(unittest.skip('Skipped!'), 'TestBwdGradients'),
                     ),
                     assert_autodiffed=True,
                     autodiff_nonfusible_nodes=['aten::pow'],),
@@ -14074,7 +13330,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
                     supports_out=False,
-                    supports_two_python_scalars=True,
+                    supports_one_python_scalar=True,
                     skips=(
                         DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
                         DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',),
@@ -14103,6 +13359,16 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            supports_out=False),
+    OpInfo('slice',
+           op=torch.ops.aten.slice.Tensor,
+           dtypes=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool, torch.chalf),
+           sample_inputs_func=sample_inputs_slice,
+           gradcheck_fast_mode=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           supports_scripting=False,
+           supports_inplace_autograd=False,
+           supports_out=False),
     OpInfo('slice_scatter',
            dtypes=all_types_and(torch.bfloat16, torch.half, torch.bool),
            sample_inputs_func=sample_inputs_slice_scatter,
@@ -14145,9 +13411,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
                                     device_type='cpu', dtypes=[torch.cfloat, torch.cdouble],
                                     active_if=(IS_MACOS or IS_WINDOWS)),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cuda', dtypes=[torch.float64],
-                                    active_if=TEST_WITH_ROCM),
                        DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
                                     'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
                    ),
@@ -14177,9 +13440,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                        DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
                                     device_type='cpu', dtypes=[torch.cfloat, torch.cdouble],
                                     active_if=(IS_MACOS or IS_WINDOWS)),
-                       # alias, nn.functional.tanh, will produce (because of warning string saved):
-                       # "RuntimeError: Expected to not find "tanh" but found it"
-                       DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
                        DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
                                     'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
                    ),
@@ -14242,7 +13502,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # AssertionError: Scalars are not equal!
                DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
                # Gradcheck fails
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad',
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_fn_fwgrad_bwgrad',
                             dtypes=floating_and_complex_types()),
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
                             device_type='mps', dtypes=[torch.float32]),
@@ -14254,11 +13514,22 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
     UnaryUfuncInfo('trunc',
                    aliases=('fix', ),
                    ref=np.trunc,
-                   dtypes=floating_types_and(torch.bfloat16),
-                   dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+                   dtypes=all_types_and(torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True,
                    supports_sparse=True,
+                   skips=(
+                       DecorateInfo(unittest.expectedFailure,
+                                    'TestNNCOpInfo',
+                                    'test_nnc_correctness',
+                                    dtypes=tuple(t for t in integral_types() if t != torch.uint8)),
+                       DecorateInfo(unittest.expectedFailure,
+                                    'TestCudaFuserOpInfo',
+                                    'test_nvfuser_correctness',
+                                    dtypes=(torch.int32, torch.int64),
+                                    active_if=not TEST_WITH_ROCM),
+                   ),
                    supports_sparse_csr=True,
                    supports_sparse_csc=True,
                    supports_sparse_bsr=True,
@@ -14403,52 +13674,13 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                                     dtypes=[torch.bfloat16]),
                    ),),
     OpInfo('lerp',
-           dtypes=floating_and_complex_types(),
+           dtypes=floating_and_complex_types_and(torch.bfloat16),
            dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
            dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
            sample_inputs_func=sample_inputs_lerp,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            assert_autodiffed=True),
-    OpInfo('linalg.inv',
-           aten_name='linalg_inv',
-           op=torch.linalg.inv,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # AssertionError: Scalars are not equal!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.inv_ex',
-           aten_name='linalg_inv_ex',
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # AssertionError: Scalars are not equal!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
     UnaryUfuncInfo('angle',
                    ref=np.angle,
                    dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
@@ -14517,152 +13749,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    supports_sparse_bsr=True,
                    supports_sparse_bsc=True,
                    supports_autograd=False),
-    OpInfo('linalg.solve',
-           aten_name='linalg_solve',
-           op=torch.linalg.solve,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_solve,
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.solve_ex',
-           aten_name='linalg_solve_ex',
-           op=torch.linalg.solve_ex,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_solve,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.solve_triangular',
-           aten_name='linalg_solve_triangular',
-           op=torch.linalg.solve_triangular,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_linalg_solve_triangular,
-           supports_fwgrad_bwgrad=True,
-           skips=(skipCPUIfNoLapack,),
-           # linalg.solve_triangular cannot be batched over because of a call to out.copy_(result);
-           supports_forward_ad=True),
-    OpInfo('linalg.matrix_rank',
-           aten_name='linalg_matrix_rank',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_invertible,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           ),
-           ),
-    OpInfo('linalg.matrix_rank',
-           aten_name='linalg_matrix_rank',
-           variant_test_name='hermitian',
-           dtypes=floating_and_complex_types(),
-           supports_autograd=False,
-           sample_inputs_func=sample_inputs_linalg_pinv_hermitian,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           ),
-           ),
-    OpInfo('linalg.pinv',
-           aten_name='linalg_pinv',
-           op=torch.linalg.pinv,
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_pinv,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               # errors with "leaked XXXX bytes CUDA memory on device 0"
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', device_type='cuda'),)
-           ),
-    OpInfo('linalg.pinv',
-           aten_name='linalg_pinv',
-           variant_test_name='singular',
-           # pinv is Frechet-differentiable in a rank-preserving neighborhood,
-           # so we feed inputs that are the products of two full-rank factors,
-           # to avoid any rank changes caused by the perturbations in the gradcheck
-           op=lambda a, b: torch.linalg.pinv(a @ b.mT),
-           dtypes=floating_and_complex_types(),
-           supports_out=False,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_linalg_pinv_singular,
-           # Only large tensors show issues with implicit backward used prior to
-           # explicit backward implementation.
-           decorators=[slowTest, skipCUDAIfNoCusolver, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # CUDA runs out of memory
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_fwgrad_bwgrad',
-                            device_type='cuda', dtypes=[torch.cdouble]),
-               # This test takes almost 2 hours to run!
-               DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad',
-                            device_type='cuda', dtypes=[torch.cdouble]),
-           )),
-    OpInfo('linalg.pinv',
-           aten_name='linalg_pinv',
-           variant_test_name='hermitian',
-           dtypes=floating_and_complex_types(),
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_linalg_pinv_hermitian,
-           gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
-           decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )
-           ),
-    OpInfo('eig',
-           op=torch.eig,
-           dtypes=floating_and_complex_types(),
-           sample_inputs_func=sample_inputs_eig,
-           error_inputs_func=error_inputs_eig,
-           decorators=[
-               skipCUDAIfNoMagma,
-               skipCPUIfNoLapack,
-           ],
-           ),
     OpInfo('einsum',
            # we need this lambda because SampleInput expects tensor input as the first argument
            # TODO(@heitorschueroff) update SampleInput to handle such cases
@@ -14713,41 +13799,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
                             device_type='mps', dtypes=[torch.float32]),
            )),
-    OpInfo('linalg.svd',
-           op=torch.linalg.svd,
-           aten_name='linalg_svd',
-           decomp_aten_name='_linalg_svd',
-           dtypes=floating_and_complex_types(),
-           # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True,
-           check_batched_forward_grad=False,
-           # We're using at::allclose, which does not have a batching rule
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           sample_inputs_func=sample_inputs_svd,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
-           skips=(
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_variant_consistency_eager',
-                            device_type='mps', dtypes=[torch.float32]),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit',
-                            device_type='mps', dtypes=[torch.float32]),
-           )),
-    OpInfo('linalg.svdvals',
-           op=torch.linalg.svdvals,
-           aten_name='linalg_svdvals',
-           decomp_aten_name='_linalg_svd',
-           dtypes=floating_and_complex_types(),
-           check_batched_forward_grad=False,
-           supports_fwgrad_bwgrad=True,
-           supports_forward_ad=True,
-           # We're using at::allclose, which does not have a batching rule
-           check_batched_gradgrad=False,
-           sample_inputs_func=sample_inputs_linalg_svdvals,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off]),
     OpInfo('svd_lowrank',
            op=lambda *args, **kwargs: wrapper_set_seed(
                lambda a, b, **kwargs: torch.svd_lowrank(a @ b.mT, **kwargs),
@@ -14768,9 +13819,12 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                                     'TestCommon', 'test_noncontiguous_samples',
                                     device_type='cuda')],
            skips=(
+               # need to add pin_memory support to primTorch
+               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_comprehensive'),
                # test does not work with passing lambda for op
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
            )),
     OpInfo('pca_lowrank',
            op=lambda *args, **kwargs: wrapper_set_seed(
@@ -14792,9 +13846,12 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                                     'TestCommon', 'test_noncontiguous_samples',
                                     device_type='cuda')],
            skips=(
+               # need to add pin_memory support to primTorch
+               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_comprehensive'),
                # test does not work with passing lambda for op
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
            )),
     BinaryUfuncInfo('polar',
                     dtypes=floating_types(),
@@ -14810,7 +13867,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                         #  tensor([[0.]], dtype=torch.float64)
                         # Analytical:
                         # tensor([[-0.0047]], dtype=torch.float64, grad_fn=<CopySlices>)
-                        DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad'),
+                        DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_fn_fwgrad_bwgrad'),
                     )),
     # TODO(@kshitij12345): Refactor similar to `mvlgamma` entries.
     # To test reference numerics against multiple values of argument `n`,
@@ -14829,25 +13886,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                        DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                    ),
                    sample_kwargs=lambda device, dtype, input: ({'n': 0}, {'n': 0})),
-    # A separate OpInfo entry for special.polygamma is needed to reorder the arguments
-    # for the alias. See the discussion here: https://github.com/pytorch/pytorch/pull/59691#discussion_r650261939
-    UnaryUfuncInfo('special.polygamma',
-                   op=lambda x, n, **kwargs: torch.special.polygamma(n, x, **kwargs),
-                   variant_test_name='special_polygamma_n_0',
-                   ref=reference_polygamma if TEST_SCIPY else None,
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   sample_inputs_func=sample_inputs_polygamma,
-                   skips=(
-                       # lambda impl
-                       DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-                       DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-                   ),
-                   sample_kwargs=lambda device, dtype, input: ({'n': 0}, {'n': 0}),
-                   # polygamma functions have multiple singularities at x <= 0
-                   reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
     UnaryUfuncInfo('polygamma',
                    op=lambda x, n, **kwargs: torch.polygamma(n, x, **kwargs),
                    variant_test_name='polygamma_n_1',
@@ -14859,7 +13897,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    sample_inputs_func=sample_inputs_polygamma,
                    skips=(
                        # Redundant tests
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
@@ -14881,16 +13920,13 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    sample_inputs_func=sample_inputs_polygamma,
                    skips=(
                        # Redundant tests
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
                        # Mismatch: https://github.com/pytorch/pytorch/issues/55357
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    active_if=TEST_WITH_ROCM),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    active_if=TEST_WITH_ROCM),),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),),
                    sample_kwargs=lambda device, dtype, input: ({'n': 2}, {'n': 2}),
                    # polygamma functions have multiple singularities at x <= 0
                    reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
@@ -14905,7 +13941,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    sample_inputs_func=sample_inputs_polygamma,
                    skips=(
                        # Redundant tests
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
@@ -14926,16 +13963,13 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                    sample_inputs_func=sample_inputs_polygamma,
                    skips=(
                        # Redundant tests
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestFwdGradients'),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestBwdGradients'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestJit'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators'),
                        DecorateInfo(unittest.skip("Skipped!"), 'TestCommon'),
                        # Mismatch: https://github.com/pytorch/pytorch/issues/55357
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    active_if=TEST_WITH_ROCM),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    active_if=TEST_WITH_ROCM),),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal'),),
                    sample_kwargs=lambda device, dtype, input: ({'n': 4}, {'n': 4}),
                    # polygamma functions have multiple singularities at x <= 0
                    reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
@@ -14951,9 +13985,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            ),
     OpInfo('reshape',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-           sample_inputs_func=partial(sample_inputs_view_reshape, transpose_samples=True),
-           reference_inputs_func=partial(reference_inputs_view_reshape, transpose_samples=True),
-           error_inputs_func=error_inputs_reshape,
+           sample_inputs_func=sample_inputs_view_reshape,
+           reference_inputs_func=reference_inputs_view_reshape,
+           error_inputs_func=error_inputs_view_reshape,
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -14961,7 +13995,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
     OpInfo('reshape_as',
            op=lambda x, other: x.reshape_as(other),
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-           sample_inputs_func=sample_inputs_view_as_reshape_as,
+           sample_inputs_func=partial(sample_inputs_view_reshape, tensor_arg=True),
+           reference_inputs_func=partial(reference_inputs_view_reshape, tensor_arg=True),
+           error_inputs_func=partial(error_inputs_view_reshape, tensor_arg=True),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -14975,8 +14011,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            assert_jit_shape_analysis=True,
-           sample_inputs_func=partial(sample_inputs_view_reshape, transpose_samples=False),
-           reference_inputs_func=partial(reference_inputs_view_reshape, transpose_samples=False),
+           sample_inputs_func=sample_inputs_view_reshape,
+           reference_inputs_func=reference_inputs_view_reshape,
+           error_inputs_func=error_inputs_view_reshape,
            skips=(
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
            )),
@@ -14986,7 +14023,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_view_as_reshape_as,
+           sample_inputs_func=partial(sample_inputs_view_reshape, tensor_arg=True),
+           reference_inputs_func=partial(reference_inputs_view_reshape, tensor_arg=True),
+           error_inputs_func=partial(error_inputs_view_reshape, tensor_arg=True),
            skips=(
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
            )),
@@ -15101,7 +14140,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_fwgrad_bwgrad=True,
            # https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_index),
+           sample_inputs_func=sample_inputs_index,
+           reference_inputs_func=partial(sample_inputs_index, reference=True)),
     OpInfo('index_copy',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
            supports_out=True,
@@ -15112,11 +14152,13 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            skips=(
            ),
            sample_inputs_func=sample_inputs_index,
+           reference_inputs_func=partial(sample_inputs_index, reference=True),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
     OpInfo('index_select',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
            backward_dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16, torch.chalf),
            sample_inputs_func=sample_inputs_index,
+           reference_inputs_func=partial(sample_inputs_index, reference=True),
            error_inputs_func=error_inputs_index_select,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
@@ -15130,6 +14172,18 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            # https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
            sample_inputs_func=sample_inputs_index,
+           reference_inputs_func=partial(sample_inputs_index, reference=True),
+           skips=(
+               # boolean alpha not handled properly
+               DecorateInfo(unittest.expectedFailure,
+                            'TestCudaFuserOpInfo',
+                            'test_nvfuser_correctness',
+                            dtypes=(torch.bool,)),
+               DecorateInfo(unittest.expectedFailure,
+                            'TestNNCOpInfo',
+                            'test_nnc_correctness',
+                            dtypes=(torch.bool,)),
+           ),
            gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
     OpInfo('index_reduce',
            dtypes=all_types_and(torch.float16, torch.bfloat16),
@@ -15166,14 +14220,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            sample_inputs_func=sample_inputs_index_put,
            skips=(
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # RuntimeError: The following operation failed in the TorchScript interpreter.
-               # Traceback of TorchScript (most recent call last):
-               #   File "<string>", line 3, in forward
-               # def the_method(i0, i1: List[torch.Tensor], i2):
-               #     return torch.index_put(i0, i1, i2, accumulate=False)
-               #            ~~~~~~~~~~~~~~~ <--- HERE
-               # RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
            )),
     OpInfo('sort',
            dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
@@ -15193,9 +14239,10 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # lambda impl
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # 76571
+               # 76571 - CUDA gets expectedFailure, but this test passes for ROCm
                DecorateInfo(unittest.expectedFailure, 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values',
-                            dtypes=(torch.float16, torch.float32, torch.float64)),
+                            dtypes=(torch.float16, torch.float32, torch.float64), active_if=not TEST_WITH_ROCM),
+               DecorateInfo(unittest.skip('Output order is undefined when sorted=False'), 'TestCommon', 'test_compare_cpu'),
            )),
     OpInfo('unique_consecutive',
            dtypes=all_types_and(torch.bool, torch.bfloat16),
@@ -15207,9 +14254,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                # lambda impl
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # 76571
+               # 76571 - CUDA gets expectedFailure, but this test passes for ROCm
                DecorateInfo(unittest.expectedFailure, 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values',
-                            dtypes=(torch.float16, torch.float32, torch.float64)),
+                            dtypes=(torch.float16, torch.float32, torch.float64), active_if=not TEST_WITH_ROCM),
            )),
     OpInfo('put',
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
@@ -15240,7 +14287,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         sample_inputs_func=sample_inputs_conversion,
         skips=(
             # autograd tests don't handle operators that change dtype
-            DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestBwdGradients'),
             DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
             # RuntimeError: attribute lookup is not defined on builtin
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
@@ -15271,6 +14319,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
             DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
             # RuntimeError: attribute lookup is not defined on builtin
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
         )),
     UnaryUfuncInfo(
         'char',
@@ -15284,6 +14333,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
             DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
             # RuntimeError: attribute lookup is not defined on builtin
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
         )),
     UnaryUfuncInfo(
         'double',
@@ -15306,7 +14356,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         sample_inputs_func=sample_inputs_conversion,
         skips=(
             # autograd tests don't handle operators that change dtype
-            DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestBwdGradients'),
             DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
             # RuntimeError: attribute lookup is not defined on builtin
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
@@ -15320,7 +14371,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_autograd=True,
         skips=(
             # autograd tests don't handle operators that change dtype
-            DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestBwdGradients'),
             DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
             # RuntimeError: attribute lookup is not defined on builtin
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
@@ -15336,6 +14388,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
             DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
             # RuntimeError: attribute lookup is not defined on builtin
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
         )),
     UnaryUfuncInfo(
         'long',
@@ -15348,6 +14401,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
             DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
             # RuntimeError: attribute lookup is not defined on builtin
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
         )),
     UnaryUfuncInfo(
         'short',
@@ -15360,6 +14414,36 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
             DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
             # RuntimeError: attribute lookup is not defined on builtin
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
+        )),
+    UnaryUfuncInfo(
+        'cdouble',
+        op=torch.Tensor.cdouble,
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
+        )),
+    UnaryUfuncInfo(
+        'cfloat',
+        op=torch.Tensor.cfloat,
+        dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_conversion,
+        skips=(
+            # autograd tests don't handle operators that change dtype
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestBwdGradients'),
+            DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+            # RuntimeError: attribute lookup is not defined on builtin
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo', 'test_nnc_correctness'),
         )),
     UnaryUfuncInfo(
         'chalf',
@@ -15369,7 +14453,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         sample_inputs_func=sample_inputs_conversion,
         skips=(
             # autograd tests don't handle operators that change dtype
-            DecorateInfo(unittest.expectedFailure, 'TestGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients'),
+            DecorateInfo(unittest.expectedFailure, 'TestBwdGradients'),
             # use of lambda doesn't work with test_normalize_operator_exhaustive
             DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
             # RuntimeError: "sum_cpu" not implemented for 'ComplexHalf'
@@ -15418,6 +14503,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_non_standard_bool_values'),
                DecorateInfo(unittest.skip("Expected: empty_like is not comparable"), 'TestCompositeCompliance',
                             'test_operator'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
            )),
     OpInfo('zeros_like',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
@@ -15433,8 +14519,36 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
            supports_autograd=False,
            skips=(
            )),
+    OpInfo('randn',
+           dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16, torch.complex32),
+           op=lambda *args, **kwargs: wrapper_set_seed(torch.randn, *args, **kwargs),
+           supports_out=True,
+           sample_inputs_func=sample_inputs_randn,
+           supports_autograd=False,
+           skips=(
+               # Tests that assume input is a tensor or sequence of tensors
+               DecorateInfo(unittest.skip("Test expects tensor input"), "TestCommon", "test_noncontiguous_samples"),
+               DecorateInfo(unittest.skip("Test expects tensor input"), "TestVmapOperatorsOpInfo", "test_vmap_exhaustive"),
+               DecorateInfo(unittest.skip("Test expects tensor input"), "TestVmapOperatorsOpInfo", "test_op_has_batch_rule"),
+               # Reference doesn't support the pin_memory parameter
+               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_comprehensive'),
+               # CPU randn generates different values based on the strides of out tensor
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cpu'),
+               # randn fails to warn when resizing its out tensor
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # FX failed to normalize op - add the op to the op_skip list.
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # Tests that assume input tensor has a meaningful effect on output tensor
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+               # AssertionError: JIT Test does not execute any logic
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.expectedFailure, 'TestDecomp', 'test_quick'),
+           )),
     OpInfo('randn_like',
-           dtypes=floating_types_and(torch.half, torch.bfloat16, torch.complex32, torch.complex64, torch.complex128),
+           dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16, torch.complex32),
            op=lambda inp, *args, **kwargs:
                wrapper_set_seed(torch.randn_like, inp, *args, **kwargs),
            supports_out=False,
@@ -15447,6 +14561,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
                DecorateInfo(unittest.skip("Expected: randn_like is not comparable between dtypes"),
                             'TestCommon', 'test_complex_half_reference_testing'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
            )),
     OpInfo('rand_like',
            dtypes=floating_types_and(torch.half, torch.bfloat16, torch.complex32, torch.complex64, torch.complex128),
@@ -15459,8 +14574,35 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                # AssertionError: JIT Test does not execute any logic
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               DecorateInfo(unittest.skip("Expected: randn_like is not comparable between dtypes"),
-                            'TestCommon', 'test_complex_half_reference_testing'),
+               DecorateInfo(unittest.skip("Expected: randn_like is not comparable between dtypes"),
+                            'TestCommon', 'test_complex_half_reference_testing'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
+           )),
+    OpInfo('randint',
+           dtypes=all_types_and(torch.half, torch.bfloat16),
+           op=lambda *args, **kwargs:
+               wrapper_set_seed(torch.randint, *args, **kwargs),
+           supports_out=False,
+           sample_inputs_func=sample_inputs_randint,
+           supports_autograd=False,
+           skips=(
+               # Tests that assume input is a tensor or sequence of tensors
+               DecorateInfo(unittest.skip("Test expects tensor input"), "TestCommon", "test_noncontiguous_samples"),
+               DecorateInfo(unittest.skip("Test expects tensor input"), "TestVmapOperatorsOpInfo", "test_vmap_exhaustive"),
+               DecorateInfo(unittest.skip("Test expects tensor input"), "TestVmapOperatorsOpInfo", "test_op_has_batch_rule"),
+               # CPU randint generates different values based on the strides of out tensor
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # randint fails to warn when resizing its out tensor
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # FX failed to normalize op - add the op to the op_skip list.
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # Tests that assume input tensor has a meaningful effect on output tensor
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+               # AssertionError: JIT Test does not execute any logic
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
            )),
     OpInfo('randint_like',
            dtypes=all_types_and(torch.half, torch.bfloat16),
@@ -15473,6 +14615,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
                # AssertionError: JIT Test does not execute any logic
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
            )),
     OpInfo('full_like',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
@@ -15499,6 +14642,72 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
            ),
            supports_autograd=False),
+    OpInfo('ones',
+           op=torch.ones,
+           supports_autograd=False,
+           supports_varargs=True,
+           is_factory_function=True,
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+           supports_out=True,
+           sample_inputs_func=sample_inputs_ones_zeros,
+           skips=(
+               # Tests that assume input is a tensor or sequence of tensors
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+
+               # Same failure as arange: cannot find linspace in captured graph
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+           )),
+    OpInfo('zeros',
+           op=torch.zeros,
+           supports_autograd=False,
+           is_factory_function=True,
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+           supports_out=True,
+           sample_inputs_func=sample_inputs_ones_zeros,
+           skips=(
+               # Tests that assume input is a tensor or sequence of tensors
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+
+               # Same failure as arange: cannot find linspace in captured graph
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+           )),
+    OpInfo('full',
+           op=torch.full,
+           supports_autograd=False,
+           is_factory_function=True,
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+           supports_out=True,
+           sample_inputs_func=sample_inputs_full,
+           skips=(
+               # Tests that assume input is a tensor or sequence of tensors
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+               # Same failure as arange: cannot find linspace in captured graph
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # boolean alpha not handled properly
+               DecorateInfo(unittest.expectedFailure,
+                            'TestCudaFuserOpInfo',
+                            'test_nvfuser_correctness',
+                            dtypes=(torch.bool,)),
+               # RuntimeError: UNSUPPORTED DTYPE: bool
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.bool,)),
+           )),
     OpInfo('new_empty',
            op=lambda x, *args, **kwargs: x.new_empty(*args, **kwargs),
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
@@ -15528,8 +14737,58 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                             'test_operator'),
                DecorateInfo(unittest.skip("Expected: new_empty is not comparable"),
                             'TestCommon', 'test_complex_half_reference_testing'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
            ),
            supports_autograd=False),
+    OpInfo('new_empty_strided',
+           op=lambda x, *args, **kwargs: x.new_empty_strided(*args, **kwargs),
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+           supports_out=False,
+           sample_inputs_func=partial(sample_inputs_new_fns, is_strided=True),
+           supports_autograd=False,
+           skips=(
+               # FX failed to normalize op
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # Lazy tensor failures
+               DecorateInfo(unittest.skip("Skipped!"), 'TestLazyOpInfo', 'test_correctness'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestLazyOpInfo', 'test_correctness_with_reusing_ir'),
+               # Empty tensor data is garbage so it's hard to make comparisons with it.
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCommon', 'test_noncontiguous_samples'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestMathBits', 'test_neg_view'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestMathBits', 'test_neg_conj_view'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCommon', 'test_non_standard_bool_values'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCommon', 'test_complex_half_reference_testing'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestDecomp', 'test_comprehensive'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestDecomp', 'test_quick'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestProxyTensorOpInfo', 'test_make_fx_exhaustive'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestProxyTensorOpInfo', 'test_make_fx_fake_exhaustive'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestProxyTensorOpInfo', 'test_make_fx_symbolic_exhaustive'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestNNCOpInfo', 'test_nnc_correctness'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+               DecorateInfo(unittest.skip("Expected: new_empty_strided is not comparable"),
+                            'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
+           )),
     OpInfo('empty',
            dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
            sample_inputs_func=sample_inputs_empty,
@@ -15558,2206 +14817,1789 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                             'test_operator'),
                # requires_grad doesn't exist in the jit schema
                DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-               DecorateInfo(unittest.skip("Expected: empty is not comparable"),
-                            'TestCommon',
-                            'test_out'),
-               DecorateInfo(unittest.skip("Expected: empty is not comparable"),
-                            'TestCommon',
-                            'test_out_warning'),
-               DecorateInfo(unittest.skip("Expected: empty is not comparable"),
-                            'TestLazyOpInfo'),
-               DecorateInfo(unittest.skip("Expected: empty is not comparable"),
-                            'TestCommon', 'test_complex_half_reference_testing'),
-           )),
-    OpInfo('new_full',
-           op=lambda x, *args, **kwargs: x.new_full(*args, **kwargs),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           sample_inputs_func=sample_inputs_new_full,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-           ),
-           supports_autograd=False),
-    OpInfo('multinomial',
-           op=lambda inp, *args, **kwargs:
-               wrapper_set_seed(torch.multinomial, inp, *args, **kwargs),
-           method_variant=lambda inp, *args, **kwargs:
-               wrapper_set_seed(torch.Tensor.multinomial, inp, *args, **kwargs),
-           dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.half),
-           supports_out=True,
-           sample_inputs_func=sample_inputs_multinomial,
-           error_inputs_func=error_inputs_multinomial,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Strides are not the same!
-               # This may not be reproducible in CI
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out'),
-               # AssertionError: JIT Test does not execute any logic
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning')),
-           supports_autograd=False),
-    OpInfo('normal',
-           op=lambda inp, *args, **kwargs:
-               wrapper_set_seed(torch.normal, inp, *args, **kwargs),
-           # The inplace variant (Tensor.normal_) is different from torch.normal
-           inplace_variant=None,
-           dtypes=floating_types_and(torch.bfloat16, torch.half),
-           dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.half),
-           supports_out=True,
-           sample_inputs_func=sample_inputs_normal_tensor_first,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Tensor-likes are not close!
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # AssertionError: JIT Test does not execute any logic
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-               # Computed gradient is incorrect -- would be an exfail but gradgrad somehow passes
-               DecorateInfo(unittest.skip("Gradients are incorrect!"), 'TestGradients'),)),
-    OpInfo('normal',
-           # This has its own variant b/c OpInfos assume the first arg is a Tensor but it is not here
-           variant_test_name='number_mean',
-           op=lambda std, mean, *args, **kwargs:
-               wrapper_set_seed(torch.normal, mean, std, *args, **kwargs),
-           # The inplace variant (Tensor.normal_) is different from torch.normal
-           inplace_variant=None,
-           dtypes=floating_types_and(torch.bfloat16, torch.half),
-           dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.half),
-           supports_out=True,
-           sample_inputs_func=sample_inputs_normal_tensor_second,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # AssertionError: JIT Test does not execute any logic
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # NotImplementedError not raised
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_fn_fwgrad_bwgrad'),
-               # Computed gradient is incorrect -- would be an exfail but gradgrad somehow passes
-               DecorateInfo(unittest.skip("Gradients are incorrect!"), 'TestGradients'),)),
-    OpInfo('bernoulli',
-           op=lambda inp, *args, **kwargs:
-               wrapper_set_seed(torch.bernoulli, inp, *args, **kwargs),
-           # The inplace variant (Tensor.bernoulli_) is different from torch.bernoulli
-           inplace_variant=None,
-           method_variant=lambda inp, *args, **kwargs:
-               wrapper_set_seed(torch.Tensor.bernoulli, inp, *args, **kwargs),
-           dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.half),
-           supports_out=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_bernoulli,
-           skips=(
-               # vmap: We do not yet support calling random operations inside of vmap
-               DecorateInfo(unittest.expectedFailure, 'TestGradients', 'test_forward_mode_AD'),
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # AssertionError: JIT Test does not execute any logic
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # Expected RuntimeError when doing an unsafe cast from a result of
-               # dtype torch.float32 into an out= with dtype torch.lon
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'))),
-    OpInfo('scatter_add',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           sample_inputs_func=sample_inputs_scatter_add,
-           error_inputs_func=error_inputs_scatter_and_scatter_add,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           ),
-    OpInfo('stack',
-           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
-           sample_inputs_func=sample_inputs_stack,
-           assert_autodiffed=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               # https://github.com/pytorch/pytorch/issues/77046
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-           ),
-           ),
-    OpInfo('hstack',
-           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
-           sample_inputs_func=sample_inputs_hstack_dstack_vstack,
-           error_inputs_func=error_inputs_hstack_dstack_vstack,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           ),
-    BinaryUfuncInfo('hypot',
-                    dtypes=floating_types_and(torch.bfloat16),
-                    dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_rhs_python_scalar=False),
-    OpInfo('histogram',
-           dtypes=floating_types(),
-           dtypesIfCUDA=_dispatch_dtypes(),  # histogram is only implemented on CPU
-           sample_inputs_func=sample_inputs_histogram,
-           supports_autograd=False,
-           skips=(
-               # JIT tests don't work with Tensor keyword arguments
-               # https://github.com/pytorch/pytorch/issues/58507
-               # RuntimeError:
-               # undefined value tensor:
-               #   File "<string>", line 3
-               # def the_method(i0):
-               #     return torch.histogram(i0, 1, weight=tensor(-0.5735, dtype=torch.float32), density=False)
-               #                                          ~~~~~~ <--- HERE
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # Not Implemented on XLA.
-               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla'),
-           )),
-    OpInfo('histogramdd',
-           dtypes=floating_types(),
-           dtypesIfCUDA=_dispatch_dtypes(),  # histogramdd is only implemented on CPU
-           sample_inputs_func=sample_inputs_histogramdd,
-           supports_autograd=False,
-           skips=(
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # JIT tests don't work with Tensor keyword arguments
-               # https://github.com/pytorch/pytorch/issues/58507
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('histc',
-           dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.int8, torch.int16, torch.int32, torch.int64),
-           sample_inputs_func=sample_inputs_histc,
-           supports_out=True,
-           supports_autograd=False,
-           skips=(
-               # CUDA histc returns a float tensor but does not correctly warn when passed an integral out tensor
-               # "AssertionError: RuntimeError not raised : Expected RuntimeError when doing an unsafe cast
-               # from a result of dtype torch.float32 into an out= with dtype torch.long"
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cuda'),
-               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
-           )),
-    OpInfo('bincount',
-           dtypes=integral_types_and(),
-           sample_inputs_func=sample_inputs_bincount,
-           supports_out=False,
-           supports_autograd=False,
-           skips=(
-               # JIT tests don't work with Tensor keyword arguments
-               # https://github.com/pytorch/pytorch/issues/58507
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip("Expected: empty is not comparable"),
+                            'TestCommon',
+                            'test_out'),
+               DecorateInfo(unittest.skip("Expected: empty is not comparable"),
+                            'TestCommon',
+                            'test_out_warning'),
+               DecorateInfo(unittest.skip("Expected: empty is not comparable"),
+                            'TestLazyOpInfo'),
+               DecorateInfo(unittest.skip("Expected: empty is not comparable"),
+                            'TestCommon', 'test_complex_half_reference_testing'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
            )),
-    OpInfo('bucketize',
-           dtypes=all_types_and(torch.float16, torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.float16),
-           sample_inputs_func=sample_inputs_bucketize,
+    OpInfo('eye',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half),
+           dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_eye,
+           error_inputs_func=error_inputs_eye,
+           supports_out=True,
            supports_autograd=False,
            skips=(
-               # JIT tests don't work with Tensor keyword arguments
-               DecorateInfo(unittest.skip("Expected failure!"), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # TODO: same as this?
+               # https://github.com/pytorch/pytorch/issues/81774
+               # also see: arange, new_full
+               # fails to match any schemas despite working in the interpreter
+               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+               # fails to match any schemas despite working in the interpreter
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # skip these tests since we have non tensor input
+               DecorateInfo(unittest.skip('Skipped!'), "TestCommon", "test_noncontiguous_samples"),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view'),
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
            )),
-    OpInfo('searchsorted',
-           dtypes=all_types_and(torch.bfloat16, torch.float16),
-           dtypesIfCUDA=all_types_and(torch.float16),
-           sample_inputs_func=sample_inputs_searchsorted,
+    OpInfo('scalar_tensor',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
+           sample_inputs_func=sample_inputs_scalar_tensor,
            supports_autograd=False,
-           ref=reference_searchsorted,
-           skips=(
-               # JIT tests don't work with Tensor keyword arguments
-               # https://github.com/pytorch/pytorch/issues/58507
-               DecorateInfo(unittest.skip("Expected failure!"), 'TestJit', 'test_variant_consistency_jit'),
-           )),
-    OpInfo('cat',
-           ref=_cat_np,
-           aliases=('concat',),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.complex32),
-           sample_inputs_func=sample_inputs_cat_concat,
-           reference_inputs_func=reference_inputs_cat,
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           assert_autodiffed=True,
-           skips=(
-               # RuntimeError: Arguments for call not valid.
-               #               Expected a value of type 'List[Tensor]' for argument
-               #               'tensors' but instead found type 'Tensor (inferred)'.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
-               # see https://github.com/pytorch/pytorch/issues/71286
-               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness'),)),
-    OpInfo('unbind',
-           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
-           ref=reference_unbind,
-           sample_inputs_func=sample_inputs_unbind,
-           error_inputs_func=error_inputs_unbind,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           supports_gradgrad=True,
-           supports_out=False,
-           ),
-    OpInfo('vstack',
-           aliases=('row_stack',),
-           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
-           sample_inputs_func=sample_inputs_hstack_dstack_vstack,
-           error_inputs_func=error_inputs_hstack_dstack_vstack,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           skips=(
-               # RuntimeError: _fn() Expected a value of type
-               #   'Tensor (inferred)' for argument 't0' but instead found type 'tuple'.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),)),
-    OpInfo('dstack',
-           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
-           sample_inputs_func=sample_inputs_hstack_dstack_vstack,
-           error_inputs_func=error_inputs_hstack_dstack_vstack,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           ),
-    OpInfo('unfold',
-           op=lambda x, *args: x.unfold(*args),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-           backward_dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
            supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           check_batched_gradgrad=False,
-           # See https://github.com/pytorch/pytorch/issues/66357
-           check_batched_forward_grad=False,
            skips=(
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Skip operator schema test because this is a functional and not an operator
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # fails to match any schemas despite working in the interpreter
                DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-           ),
-           sample_inputs_func=sample_inputs_unfold),
-    OpInfo('msort',
-           dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
-           check_batched_gradgrad=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_msort,
-           skips=(
+               # fails to match any schemas despite working in the interpreter
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # skip these tests since we have non tensor input
+               DecorateInfo(unittest.skip('Skipped!'), "TestCommon", "test_noncontiguous_samples"),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view'),
            )),
-    OpInfo('movedim',
-           aliases=('moveaxis',),
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_movedim_moveaxis),
-    OpInfo('renorm',
-           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-           sample_inputs_func=sample_inputs_renorm,
-           error_inputs_func=error_inputs_renorm),
-    ShapeFuncInfo('repeat',
-                  op=lambda x, dims: x.repeat(dims),
-                  ref=np.tile,
-                  dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-                  # https://github.com/pytorch/pytorch/issues/80411
-                  gradcheck_fast_mode=True,
-                  supports_out=False,
-                  supports_forward_ad=True,
-                  supports_fwgrad_bwgrad=True,
-                  sample_inputs_func=sample_repeat_tile,
-                  skips=(
-                      DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-                  )),
-    OpInfo('squeeze',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-           supports_out=False,
-           assert_autodiffed=True,
-           autodiff_fusible_nodes=[],  # aliases inputs, shouldn't be fused
-           autodiff_nonfusible_nodes=[],  # aliases inputs, shouldn't be fused
-           assert_jit_shape_analysis=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # vmap does not support inplace views
-           check_inplace_batched_forward_grad=False,
-           # https://github.com/pytorch/pytorch/issues/66357
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_squeeze),
-    UnaryUfuncInfo(
-        'fill',
-        op=_fill_aten,
-        ref=_fill_np,
-        method_variant=None,
-        inplace_variant=torch.Tensor.fill_,
-        sample_kwargs=_fill_sample_kwargs,
-        sample_inputs_func=partial(sample_inputs_elementwise_unary, op_kwargs={'value': True}),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        # https://github.com/pytorch/pytorch/issues/66357
-        check_batched_forward_grad=False,
-        dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
-        supports_out=False,
-        skips=(
-            # JIT has issue when op is passed as lambda
-            # AssertionError: JIT Test does not execute any logic
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.skip("No fill_ op"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("No fill_ op"), 'TestNNCOpInfo'),
-        )),
-    OpInfo('resize_',
-           op=lambda x, shape: x.clone().resize_(shape),
-           method_variant=None,
-           inplace_variant=torch.Tensor.resize_,
-           # the test fails because resize_ doesn't work with imag views as expected by the test
-           # https://github.com/pytorch/pytorch/issues/65945
-           test_neg_view=False,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+    OpInfo('new_full',
+           op=lambda x, *args, **kwargs: x.new_full(*args, **kwargs),
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf),
            supports_out=False,
-           supports_autograd=False,
+           sample_inputs_func=sample_inputs_new_full,
            skips=(
-               # Cannot resize variables that require grad
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_dtypes'),
                DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_operator'),
            ),
-           sample_inputs_func=sample_inputs_resize_ops),
-    OpInfo('resize_as_',
-           op=lambda x, other: torch.resize_as_(x.clone(), other),
-           method_variant=None,
-           inplace_variant=torch.Tensor.resize_as_,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           supports_out=False,
-           supports_autograd=False,
+           supports_autograd=False),
+    OpInfo('multinomial',
+           op=lambda inp, *args, **kwargs:
+               wrapper_set_seed(torch.multinomial, inp, *args, **kwargs),
+           method_variant=lambda inp, *args, **kwargs:
+               wrapper_set_seed(torch.Tensor.multinomial, inp, *args, **kwargs),
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.half),
+           supports_out=True,
+           sample_inputs_func=sample_inputs_multinomial,
+           error_inputs_func=error_inputs_multinomial,
            skips=(
-               # Cannot resize variables that require grad
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_dtypes'),
-               DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
-           ),
-           sample_inputs_func=sample_inputs_resize_ops),
-    OpInfo('take_along_dim',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           supports_inplace_autograd=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_take_along_dim,
-           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
-    ShapeFuncInfo('tile',
-                  ref=np.tile,
-                  dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-                  # https://github.com/pytorch/pytorch/issues/80411
-                  gradcheck_fast_mode=True,
-                  supports_out=False,
-                  supports_forward_ad=True,
-                  supports_fwgrad_bwgrad=True,
-                  sample_inputs_func=sample_repeat_tile),
-    OpInfo('trapz',  # TODO: in the future, 'trapz' should be made a proper alias of 'trapezoid'
-           dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
-           supports_out=False,
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # Strides are not the same!
+               # This may not be reproducible in CI
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_out'),
+               # AssertionError: JIT Test does not execute any logic
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu')),
+           supports_autograd=False),
+    OpInfo('normal',
+           op=lambda inp, *args, **kwargs:
+               wrapper_set_seed(torch.normal, inp, *args, **kwargs),
+           # The inplace variant (Tensor.normal_) is different from torch.normal
+           inplace_variant=None,
+           dtypes=floating_types_and(torch.bfloat16, torch.half),
+           dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.half),
+           supports_out=True,
+           sample_inputs_func=sample_inputs_normal_tensor_first,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # Tensor-likes are not close!
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # AssertionError: JIT Test does not execute any logic
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # Computed gradient is incorrect -- would be an exfail but gradgrad somehow passes
+               DecorateInfo(unittest.skip("Gradients are incorrect!"), 'TestFwdGradients'),
+               DecorateInfo(unittest.skip("Gradients are incorrect!"), 'TestBwdGradients'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'))),
+    OpInfo('normal',
+           # This has its own variant b/c OpInfos assume the first arg is a Tensor but it is not here
+           variant_test_name='number_mean',
+           op=lambda std, mean, *args, **kwargs:
+               wrapper_set_seed(torch.normal, mean, std, *args, **kwargs),
+           # The inplace variant (Tensor.normal_) is different from torch.normal
+           inplace_variant=None,
+           dtypes=floating_types_and(torch.bfloat16, torch.half),
+           dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.half),
+           supports_out=True,
+           sample_inputs_func=sample_inputs_normal_tensor_second,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # AssertionError: JIT Test does not execute any logic
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # NotImplementedError not raised
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_fn_fwgrad_bwgrad'),
+               # Computed gradient is incorrect -- would be an exfail but gradgrad somehow passes
+               DecorateInfo(unittest.skip("Gradients are incorrect!"), 'TestFwdGradients'),
+               DecorateInfo(unittest.skip("Gradients are incorrect!"), 'TestBwdGradients'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'))),
+    OpInfo('bernoulli',
+           op=lambda inp, *args, **kwargs:
+               wrapper_set_seed(torch.bernoulli, inp, *args, **kwargs),
+           # The inplace variant (Tensor.bernoulli_) is different from torch.bernoulli
+           inplace_variant=None,
+           method_variant=lambda inp, *args, **kwargs:
+               wrapper_set_seed(torch.Tensor.bernoulli, inp, *args, **kwargs),
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.half),
+           supports_out=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_trapezoid),
-    OpInfo('trapezoid',
-           dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
-           supports_out=False,
+           sample_inputs_func=sample_inputs_bernoulli,
+           error_inputs_func=error_inputs_bernoulli,
+           skips=(
+               # vmap: We do not yet support calling random operations inside of vmap
+               DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_forward_mode_AD'),
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # AssertionError: JIT Test does not execute any logic
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # Expected RuntimeError when doing an unsafe cast from a result of
+               # dtype torch.float32 into an out= with dtype torch.lon
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+               DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'))),
+    OpInfo('scatter_add',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_scatter_add,
+           error_inputs_func=error_inputs_scatter_and_scatter_add,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_trapezoid),
-    OpInfo('cumulative_trapezoid',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.bfloat16, torch.float16),
+           ),
+    OpInfo('stack',
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
+           sample_inputs_func=sample_inputs_stack,
+           assert_autodiffed=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           supports_out=False,
-           sample_inputs_func=sample_cumulative_trapezoid,),
-    OpInfo('unsqueeze',
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-           supports_out=False,
+           skips=(
+               # https://github.com/pytorch/pytorch/issues/77046
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+               DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+           ),
+           ),
+    OpInfo('hstack',
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
+           sample_inputs_func=sample_inputs_hstack_dstack_vstack,
+           error_inputs_func=error_inputs_hstack_dstack_vstack,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           # vmap does not support inplace views
-           check_inplace_batched_forward_grad=False,
-           assert_jit_shape_analysis=True,
-           assert_autodiffed=True,
-           autodiff_fusible_nodes=[],  # aliases inputs, shouldn't be fused
-           autodiff_nonfusible_nodes=[],  # aliases inputs, shouldn't be fused
-           sample_inputs_func=sample_unsqueeze),
-    BinaryUfuncInfo('xlogy',
-                    aliases=('special.xlogy',),
-                    dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                    promotes_int_to_float=True,
+           ),
+    BinaryUfuncInfo('hypot',
+                    dtypes=floating_types_and(torch.bfloat16),
+                    dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
                     supports_forward_ad=True,
                     supports_fwgrad_bwgrad=True,
-                    supports_one_python_scalar=True,
-                    skips=(
-                        # nan vs nan comparisons
-                        # https://github.com/pytorch/pytorch/issues/74279
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
-                    )),
-    OpInfo('zero_',
-           op=lambda x: torch.zero_(x.clone()),
-           method_variant=None,
-           inplace_variant=torch.Tensor.zero_,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           # https://github.com/pytorch/pytorch/issues/80411
-           gradcheck_fast_mode=True,
-           supports_out=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           supports_gradgrad=True,
+                    supports_rhs_python_scalar=False),
+    OpInfo('histogram',
+           dtypes=floating_types(),
+           dtypesIfCUDA=_dispatch_dtypes(),  # histogram is only implemented on CPU
+           sample_inputs_func=sample_inputs_histogram,
+           supports_autograd=False,
+           skips=(
+               # JIT tests don't work with Tensor keyword arguments
+               # https://github.com/pytorch/pytorch/issues/58507
+               # RuntimeError:
+               # undefined value tensor:
+               #   File "<string>", line 3
+               # def the_method(i0):
+               #     return torch.histogram(i0, 1, weight=tensor(-0.5735, dtype=torch.float32), density=False)
+               #                                          ~~~~~~ <--- HERE
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # Not Implemented on XLA.
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOpInfo', device_type='xla'),
+           )),
+    OpInfo('histogramdd',
+           dtypes=floating_types(),
+           dtypesIfCUDA=_dispatch_dtypes(),  # histogramdd is only implemented on CPU
+           sample_inputs_func=sample_inputs_histogramdd,
+           supports_autograd=False,
            skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # JIT tests don't work with Tensor keyword arguments
+               # https://github.com/pytorch/pytorch/issues/58507
                DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-           ),
-           sample_inputs_func=sample_inputs_zero_),
-    BinaryUfuncInfo('special.xlog1py',
-                    aten_name='special_xlog1py',
-                    dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                    backward_dtypes=all_types_and(torch.bool, torch.bfloat16),
-                    backward_dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                    promotes_int_to_float=True,
-                    supports_forward_ad=True,
-                    supports_fwgrad_bwgrad=True,
-                    supports_one_python_scalar=True,
-                    skips=(
-                        # nan vs 0 comparisons
-                        # https://github.com/pytorch/pytorch/issues/74279
-                        DecorateInfo(unittest.skip("Skipped!"), 'TestGradients'),
-                    )),
-    BinaryUfuncInfo('special.zeta',
-                    aten_name='special_zeta',
-                    dtypes=all_types_and(torch.bool),
-                    promotes_int_to_float=True,
-                    supports_autograd=False,
-                    supports_one_python_scalar=True),
-    # TODO: FIXME
-    # OpInfo entry to verify the gradient formula of `other`/`q`
-    # BinaryUfuncInfo('special.zeta',
-    #                 op=lambda q, x, **kwargs: torch.special.zeta(x, q, **kwargs),
-    #                 aten_name='special_zeta',
-    #                 variant_test_name='grad',
-    #                 dtypes=all_types_and(torch.bool),
-    #                 promotes_int_to_float=True,
-    #                 supports_autograd=True,
-    #                 supports_rhs_python_scalar=False,
-    #                 decorators=[
-    #                     # Derivative wrt first tensor not implemented
-    #                     DecorateInfo(unittest.expectedFailure, "TestCommon",
-    #                                  "test_floating_inputs_are_differentiable")
-    #                 ],
-    #                 skips=(
-    #                     # Lambda doesn't work in JIT test
-    #                     # AssertionError: JIT Test does not execute any logic
-    #                     DecorateInfo(unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"),
-    #                 )),
-    OpInfo('logsumexp',
-           aliases=('special.logsumexp',),
-           dtypes=all_types_and(torch.bool, torch.bfloat16),
-           dtypesIfCUDA=all_types_and(torch.bool, torch.bfloat16, torch.half),
-           assert_autodiffed=True,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           gradcheck_fast_mode=False,
-           sample_inputs_func=sample_inputs_logsumexp),
-    OpInfo('trace',
-           dtypes=all_types_and_complex(),
-           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
-           backward_dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           error_inputs_func=error_inputs_trace,
-           supports_inplace_autograd=False,
+           )),
+    OpInfo('histc',
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.int8, torch.int16, torch.int32, torch.int64),
+           sample_inputs_func=sample_inputs_histc,
+           supports_out=True,
+           supports_autograd=False,
+           skips=(
+               # CUDA histc returns a float tensor but does not correctly warn when passed an integral out tensor
+               # "AssertionError: RuntimeError not raised : Expected RuntimeError when doing an unsafe cast
+               # from a result of dtype torch.float32 into an out= with dtype torch.long"
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out', device_type='cuda'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_extremal_values'),
+           )),
+    OpInfo('bincount',
+           dtypes=integral_types_and(),
+           sample_inputs_func=sample_inputs_bincount,
            supports_out=False,
+           supports_autograd=False,
+           skips=(
+               # JIT tests don't work with Tensor keyword arguments
+               # https://github.com/pytorch/pytorch/issues/58507
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+           )),
+    OpInfo('bucketize',
+           dtypes=all_types_and(torch.float16, torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.float16),
+           sample_inputs_func=sample_inputs_bucketize,
+           reference_inputs_func=reference_inputs_bucketize,
+           supports_autograd=False,
+           skips=(
+               # JIT tests don't work with Tensor keyword arguments
+               DecorateInfo(unittest.skip("Expected failure!"), 'TestJit', 'test_variant_consistency_jit'),
+           )),
+    OpInfo('searchsorted',
+           dtypes=all_types_and(torch.bfloat16, torch.float16),
+           dtypesIfCUDA=all_types_and(torch.float16),
+           sample_inputs_func=sample_inputs_searchsorted,
+           supports_autograd=False,
+           ref=reference_searchsorted,
+           skips=(
+               # JIT tests don't work with Tensor keyword arguments
+               # https://github.com/pytorch/pytorch/issues/58507
+               DecorateInfo(unittest.skip("Expected failure!"), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip("Unsupported on MPS for now"), 'TestCommon', 'test_numpy_ref_mps'),
+           )),
+    OpInfo('cat',
+           ref=_cat_np,
+           aliases=('concat', 'concatenate'),
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.complex32),
+           sample_inputs_func=sample_inputs_cat_concat,
+           reference_inputs_func=reference_inputs_cat,
+           error_inputs_func=error_inputs_cat,
+           # https://github.com/pytorch/pytorch/issues/80411
+           gradcheck_fast_mode=True,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_trace),
-    OpInfo('transpose',
-           ref=_numpy_ref_transpose,
-           aliases=('swapdims', 'swapaxes'),
-           assert_jit_shape_analysis=True,
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           supports_out=False,
+           assert_autodiffed=True,
+           skips=(
+               # https://github.com/pytorch/pytorch/issues/89353
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_numpy_ref_mps'),
+               # RuntimeError: Arguments for call not valid.
+               #               Expected a value of type 'List[Tensor]' for argument
+               #               'tensors' but instead found type 'Tensor (inferred)'.
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),
+               # see https://github.com/pytorch/pytorch/issues/71286
+               DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness'),)),
+    OpInfo('unbind',
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
+           ref=reference_unbind,
+           sample_inputs_func=sample_inputs_unbind,
+           error_inputs_func=error_inputs_unbind,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # vmap does not support inplace views
-           check_inplace_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_transpose_swapdims),
-    OpInfo('T',
-           op=lambda x: x.T,
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+           supports_gradgrad=True,
            supports_out=False,
+           ),
+    OpInfo('vstack',
+           aliases=('row_stack',),
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
+           sample_inputs_func=sample_inputs_hstack_dstack_vstack,
+           error_inputs_func=error_inputs_hstack_dstack_vstack,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            skips=(
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
-           sample_inputs_func=sample_inputs_T),
-    OpInfo('H',
-           op=lambda x: x.H,
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
-           supports_out=False,
+               # RuntimeError: _fn() Expected a value of type
+               #   'Tensor (inferred)' for argument 't0' but instead found type 'tuple'.
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping'),)),
+    OpInfo('dstack',
+           dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
+           sample_inputs_func=sample_inputs_hstack_dstack_vstack,
+           error_inputs_func=error_inputs_hstack_dstack_vstack,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            # See https://github.com/pytorch/pytorch/pull/78358
            check_batched_forward_grad=False,
-           skips=(
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
-           sample_inputs_func=sample_inputs_T),
-    OpInfo('mT',
-           op=lambda x: x.mT,
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+           ),
+    OpInfo('unfold',
+           op=lambda x, *args: x.unfold(*args),
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+           backward_dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
            # Runs very slowly on slow gradcheck - alternatively reduce input sizes
            gradcheck_fast_mode=True,
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           check_batched_gradgrad=False,
+           # See https://github.com/pytorch/pytorch/issues/66357
+           check_batched_forward_grad=False,
            skips=(
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
-           sample_inputs_func=sample_inputs_adjoint),
-    OpInfo('mH',
-           op=lambda x: x.mH,
-           aliases=('adjoint',),
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # Skip operator schema test because this is a functional and not an operator
+               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+           ),
+           sample_inputs_func=sample_inputs_unfold),
+    OpInfo('unfold_copy',
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+           backward_dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
            # Runs very slowly on slow gradcheck - alternatively reduce input sizes
            gradcheck_fast_mode=True,
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_gradgrad=False,
+           # See https://github.com/pytorch/pytorch/issues/66357
            check_batched_forward_grad=False,
            skips=(
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
-           sample_inputs_func=sample_inputs_adjoint),
-    OpInfo('tril',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_tril_triu),
-    OpInfo('triu',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_tril_triu),
-    OpInfo('kron',
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
-           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-           gradcheck_fast_mode=True,
-           supports_inplace_autograd=False,
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           sample_inputs_func=sample_inputs_kron),
-    OpInfo('inner',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16]
-                                                       if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
-           supports_forward_ad=True,
-           supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_inner,
+               # *_copy functions do not seem to treat out as expected
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
            ),
-    OpInfo('tensordot',
-           dtypes=all_types_and_complex_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16]
-                                                       if (CUDA11OrLater or TEST_WITH_ROCM) else []),
-           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
+           sample_inputs_func=sample_inputs_unfold),
+    OpInfo('msort',
+           dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+           check_batched_gradgrad=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           # See https://github.com/pytorch/pytorch/pull/78358
-           check_batched_forward_grad=False,
-           sample_inputs_func=sample_inputs_tensordot,
-           skips=(
-               # Skip operator schema test because this is a functional and not an operator.
-               # Reference: https://github.com/pytorch/pytorch/issues/54574
-               DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-           )
-           ),
-    OpInfo('to_sparse',
-           op=lambda x, *args: x.to_sparse(*args),
-           sample_inputs_func=sample_inputs_to_sparse,
-           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           backward_dtypes=floating_types(),
-           backward_dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-           supports_out=False,
-           supports_sparse_csr=True,
-           supports_sparse_csc=True,
-           check_batched_grad=False,
-           check_batched_gradgrad=False,
-           skips=(
-               # to_sparse does not support automatic differentiation for outputs with complex dtype
-               DecorateInfo(unittest.expectedFailure, 'TestGradients',
-                            'test_nondifferentiable', dtypes=(torch.cdouble,)),
-               # NotImplementedError: Could not run 'aten::normal_' with arguments from the 'SparseCPU' backend
-               DecorateInfo(unittest.skip(""), 'TestCommon', 'test_noncontiguous_samples'),
-               # TODO: FIXME: complex inputs requiring grad error in forward
-               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes'),
-               # lambda impl
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               # Allowed exception: sparse tensors don't have strides
-               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_operator'),
-               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_backward'),
-               DecorateInfo(unittest.skip("Allowed exception"), 'TestTags', 'test_tags'),
-               # TODO: implement csr.to_sparse(sample_dim) where sampled_dim is 1.
-               DecorateInfo(unittest.skip("csr.to_sparse(1) not implemented. Skipped!"),
-                            'TestSparseCSR', 'test_sparse_csr_consistency'),
-           )
-           ),
-    OpInfo('logcumsumexp',
-           dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-           backward_dtypes=floating_types_and(torch.bfloat16),
-           backward_dtypesIfCUDA=floating_types_and(torch.bfloat16),
+           sample_inputs_func=sample_inputs_msort,
            skips=(
-               # AssertionError: UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning', device_type='cuda'),
-           ),
-           sample_inputs_func=sample_inputs_logcumsumexp,
-           error_inputs_func=error_inputs_logcumsumexp),
-    UnaryUfuncInfo('sigmoid',
-                   aliases=('special.expit', 'nn.functional.sigmoid'),
-                   aten_backward_name='sigmoid_backward',
-                   ref=reference_sigmoid if TEST_SCIPY else None,
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.complex64: 1e-1,
-                                                  torch.bfloat16: 1e-2}),),
-                   skips=(
-                       # Reference: https://github.com/pytorch/pytorch/issues/56012
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.complex64, torch.cdouble]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.chalf, torch.complex64, torch.cdouble]),
-                       # alias, nn.functional.sigmoid, will produce (because of warning string saved):
-                       # "RuntimeError: Expected to not find "sigmoid" but found it"
-                       DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_jit_alias_remapping')),
-                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and_complex_and(torch.complex32, torch.bool, torch.half, torch.bfloat16),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   assert_autodiffed=True,
-                   # sigmoid(z) = 1 / (1 + exp(-z)), at z = j * pi * odd_number, the denominator is zero
-                   reference_numerics_filter=NumericsFilter(
-                       condition=lambda x: (close_to_int(x / (math.pi * 1j))
-                                            if x.is_complex() else x.new_tensor(False, dtype=torch.bool)),
-                       safe_val=0)),
-    UnaryUfuncInfo('digamma',
-                   ref=scipy.special.digamma if TEST_SCIPY else None,
-                   aliases=('special.psi', 'special.digamma',),
-                   decorators=(precisionOverride({torch.float16: 5e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('special.entr',
-                   ref=scipy.special.entr if TEST_SCIPY else None,
-                   aten_name='special_entr',
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   decorators=(precisionOverride({torch.float16: 1e-1,
-                                                  torch.bfloat16: 1e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.bfloat16, torch.float16]),
-                   ),
-                   supports_inplace_autograd=False,
-                   sample_inputs_func=sample_inputs_entr),
-    UnaryUfuncInfo('special.ndtri',
-                   ref=scipy.special.ndtri if TEST_SCIPY else None,
-                   domain=(0, 1),
-                   aten_name='special_ndtri',
-                   dtypes=all_types_and(torch.bool),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('special.log_ndtr',
-                   aten_name='special_log_ndtr',
-                   ref=scipy.special.log_ndtr if TEST_SCIPY else None,
-                   dtypes=all_types_and(torch.bool),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   ),
-    UnaryUfuncInfo('erf',
-                   ref=scipy.special.erf if TEST_SCIPY else None,
-                   aliases=('special.erf', ),
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.bfloat16: 1e-2}),),
-                   skips=(
-                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
-                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
-
-                   ),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
-                   assert_jit_shape_analysis=True,
-                   supports_sparse=True,
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('erfc',
-                   ref=scipy.special.erfc if TEST_SCIPY else None,
-                   aliases=('special.erfc', ),
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.bfloat16: 1e-2}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   assert_autodiffed=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True),
-    UnaryUfuncInfo('erfinv',
-                   ref=scipy.special.erfinv if TEST_SCIPY else None,
-                   aliases=('special.erfinv', ),
-                   decorators=(precisionOverride({torch.float16: 1e-2,
-                                                  torch.bfloat16: 1e-2,
-                                                  torch.float32: 1e-4}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
-                   supports_sparse_csr=True,
-                   supports_sparse_csc=True,
-                   supports_sparse_bsr=True,
-                   supports_sparse_bsc=True,
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   domain=(-1, 1),
-                   skips=(
-                       # Reference: https://github.com/pytorch/pytorch/pull/49155#issuecomment-742664611
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
-                   )),
-    OpInfo("nn.functional.smooth_l1_loss",
-           ref=reference_smooth_l1_loss,
-           sample_inputs_func=sample_inputs_smooth_l1_loss,
-           dtypes=floating_types_and(torch.float16, torch.bfloat16),
-           backward_dtypes=floating_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_types_and(torch.float16),
-           backward_dtypesIfCUDA=floating_types_and(torch.float16),
+           )),
+    OpInfo('movedim',
+           aliases=('moveaxis',),
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_movedim_moveaxis,
+           reference_inputs_func=reference_movedim_moveaxis,
+           error_inputs_func=error_movedim_moveaxis),
+    OpInfo('renorm',
+           dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+           sample_inputs_func=sample_inputs_renorm,
+           error_inputs_func=error_inputs_renorm,
            skips=(
-               # RuntimeError: input->type()->kind() == TypeKind::OptionalTypeINTERNAL ASSERT FAILED
-               # at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270, please report a bug to PyTorch.
-               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),)),
-    OpInfo(
-        "nn.functional.l1_loss",
-        ref=loss_reference_reduction_wrapper(lambda input, target: np.abs(input - target)),
-        sample_inputs_func=sample_inputs_l1_loss,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        supports_out=False,
+               # RuntimeError: Difference from float64 is larger with decomposition
+               # linalg_vector_norm.default than original on output 0.
+               # Original max diff: 2.560596747969157e-07,
+               # Decomp max diff: 1.8187482915266173e-06
+               DecorateInfo(unittest.skip("Inconsistent accuracy"), 'TestDecomp', 'test_comprehensive',
+                            device_type='cpu', dtypes=(torch.float16,)),
+           )),
+    ShapeFuncInfo('repeat',
+                  op=lambda x, dims: x.repeat(dims),
+                  ref=np.tile,
+                  dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+                  # https://github.com/pytorch/pytorch/issues/80411
+                  gradcheck_fast_mode=True,
+                  supports_out=False,
+                  supports_forward_ad=True,
+                  supports_fwgrad_bwgrad=True,
+                  sample_inputs_func=sample_repeat_tile,
+                  skips=(
+                      DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+                  )),
+    OpInfo('squeeze',
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+           supports_out=False,
+           assert_autodiffed=True,
+           autodiff_fusible_nodes=[],  # aliases inputs, shouldn't be fused
+           autodiff_nonfusible_nodes=[],  # aliases inputs, shouldn't be fused
+           assert_jit_shape_analysis=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # vmap does not support inplace views
+           check_inplace_batched_forward_grad=False,
+           # https://github.com/pytorch/pytorch/issues/66357
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_squeeze),
+    UnaryUfuncInfo(
+        'fill',
+        op=_fill_aten,
+        ref=_fill_np,
+        method_variant=None,
+        inplace_variant=torch.Tensor.fill_,
+        sample_kwargs=_fill_sample_kwargs,
+        sample_inputs_func=partial(sample_inputs_elementwise_unary, op_kwargs={'value': True}),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        skips=(
-            # RuntimeError: input->type()->kind() == TypeKind::OptionalTypeINTERNAL ASSERT FAILED
-            # at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270, please report a bug to PyTorch.
-            DecorateInfo(
-                unittest.expectedFailure,
-                "TestJit",
-                "test_variant_consistency_jit",
-                dtypes=(torch.float32,),
-            ),
-        ),
-    ),
-    UnaryUfuncInfo('lgamma',
-                   ref=reference_lgamma if TEST_SCIPY else None,
-                   aliases=('special.gammaln', ),
-                   decorators=(precisionOverride({torch.float16: 7e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   skips=(
-                       # Reference: https://github.com/pytorch/pytorch/pull/50140#discussion_r552615345
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.bfloat16]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    device_type='cpu', dtypes=[torch.bfloat16]),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
-                                    device_type='cpu', dtypes=[torch.bfloat16]),
-                       # Reference: https://github.com/pytorch/pytorch/pull/50140#issuecomment-756150214
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                                    dtypes=[torch.float32, torch.float64], active_if=IS_WINDOWS),
-                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
-                                    dtypes=[torch.float32, torch.float64], active_if=IS_WINDOWS),
-                   ),
-                   # lgamma have multiple singularities at x <= 0
-                   reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
-    OpInfo(
-        'logdet',
-        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/66357
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.complex32, torch.bool, torch.float16, torch.bfloat16),
         supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
-        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack]),
-    # `log_softmax` supports different dtypes based on whether `dtype` argument,
-    # is passed or not. Hence two OpInfo entries, one with dtype and other without.
-    OpInfo(
-        'log_softmax',
-        aliases=('special.log_softmax', 'nn.functional.log_softmax'),
-        supports_out=True,
-        aten_backward_name='_log_softmax_backward_data',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_softmax_variant,
-        supports_forward_ad=True,
-        assert_autodiffed=True),
-    OpInfo(
-        'log_softmax',
-        variant_test_name='dtype',
-        aliases=('special.log_softmax', 'nn.functional.log_softmax'),
-        supports_out=True,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-        sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
-        supports_forward_ad=True,
-        assert_autodiffed=True),
-    UnaryUfuncInfo('logit',
-                   aten_backward_name='logit_backward',
-                   ref=scipy.special.logit if TEST_SCIPY else None,
-                   domain=(0, 1),
-                   aliases=('special.logit', ),
-                   supports_forward_ad=True,
-                   supports_fwgrad_bwgrad=True,
-                   decorators=(precisionOverride({torch.bfloat16: 5e-1,
-                                                  torch.float16: 5e-1}),),
-                   dtypes=all_types_and(torch.bool, torch.bfloat16),
-                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
-                   sample_inputs_func=sample_inputs_logit),
-    OpInfo('where',
-           # Currently only the `input` is tested in gradcheck.
-           # If we pass `condition` first, none of the input which supports
-           # autograd will be tested. Hence the following lambda.
-           op=lambda self, condition, other: torch.where(condition, self, other),
-           ref=lambda self, condition, other: np.where(condition, self, other),
-           sample_inputs_func=sample_inputs_where,
-           reference_inputs_func=reference_inputs_where,
-           error_inputs_func=error_inputs_where,
+        skips=(
+            # JIT has issue when op is passed as lambda
+            # AssertionError: JIT Test does not execute any logic
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip("No fill_ op"), 'TestCudaFuserOpInfo'),
+            DecorateInfo(unittest.skip("No fill_ op"), 'TestNNCOpInfo'),
+        )),
+    OpInfo('resize_',
+           op=lambda x, shape: x.clone().resize_(shape),
+           method_variant=None,
+           inplace_variant=torch.Tensor.resize_,
+           # the test fails because resize_ doesn't work with imag views as expected by the test
+           # https://github.com/pytorch/pytorch/issues/65945
+           test_neg_view=False,
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           supports_out=False,
+           supports_autograd=False,
+           skips=(
+               # Cannot resize variables that require grad
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_dtypes'),
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_operator'),
+           ),
+           sample_inputs_func=sample_inputs_resize_ops),
+    OpInfo('resize_as_',
+           op=lambda x, other: torch.resize_as_(x.clone(), other),
+           method_variant=None,
+           inplace_variant=torch.Tensor.resize_as_,
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           supports_out=False,
+           supports_autograd=False,
+           skips=(
+               # Cannot resize variables that require grad
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_dtypes'),
+               DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
+           ),
+           sample_inputs_func=sample_inputs_resize_ops),
+    OpInfo('take_along_dim',
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           supports_inplace_autograd=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_take_along_dim,
+           gradcheck_nondet_tol=GRADCHECK_NONDET_TOL),
+    ShapeFuncInfo('tile',
+                  ref=np.tile,
+                  dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+                  # https://github.com/pytorch/pytorch/issues/80411
+                  gradcheck_fast_mode=True,
+                  supports_out=False,
+                  supports_forward_ad=True,
+                  supports_fwgrad_bwgrad=True,
+                  sample_inputs_func=sample_repeat_tile),
+    OpInfo('trapz',  # TODO: in the future, 'trapz' should be made a proper alias of 'trapezoid'
+           dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_trapezoid),
+    OpInfo('trapezoid',
+           dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_trapezoid),
+    OpInfo('cumulative_trapezoid',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.bfloat16, torch.float16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           supports_out=False,
+           sample_inputs_func=sample_cumulative_trapezoid,),
+    OpInfo('unsqueeze',
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           # vmap does not support inplace views
+           check_inplace_batched_forward_grad=False,
+           assert_jit_shape_analysis=True,
+           assert_autodiffed=True,
+           autodiff_fusible_nodes=[],  # aliases inputs, shouldn't be fused
+           autodiff_nonfusible_nodes=[],  # aliases inputs, shouldn't be fused
+           sample_inputs_func=sample_unsqueeze),
+    BinaryUfuncInfo('xlogy',
+                    aliases=('special.xlogy',),
+                    dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                    promotes_int_to_float=True,
+                    supports_forward_ad=True,
+                    supports_fwgrad_bwgrad=True,
+                    supports_one_python_scalar=True,
+                    # We don't test 0 as the gradient will be NaN and it'll break
+                    rhs_make_tensor_kwargs=dict(low=0.01)),
+    OpInfo('zero_',
+           op=lambda x: torch.zero_(x.clone()),
+           method_variant=None,
+           inplace_variant=torch.Tensor.zero_,
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           # https://github.com/pytorch/pytorch/issues/80411
+           gradcheck_fast_mode=True,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           supports_gradgrad=True,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+           ),
+           sample_inputs_func=sample_inputs_zero_),
+    OpInfo('logsumexp',
+           aliases=('special.logsumexp',),
+           dtypes=all_types_and(torch.bool, torch.bfloat16),
+           dtypesIfCUDA=all_types_and(torch.bool, torch.bfloat16, torch.half),
+           assert_autodiffed=True,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           gradcheck_fast_mode=False,
+           sample_inputs_func=sample_inputs_logsumexp),
+    OpInfo('trace',
+           dtypes=all_types_and_complex(),
+           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half, torch.bfloat16),
+           backward_dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
+           error_inputs_func=error_inputs_trace,
+           supports_inplace_autograd=False,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_trace),
+    OpInfo('transpose',
+           ref=_numpy_ref_transpose,
+           aliases=('swapdims', 'swapaxes'),
+           assert_jit_shape_analysis=True,
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # vmap does not support inplace views
+           check_inplace_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_transpose_swapdims),
+    OpInfo('T',
+           op=lambda x: x.T,
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
            supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           decorators=(
-               DecorateInfo(onlyCUDA, "TestCommon", 'test_errors'),),
            skips=(
                # lambda impl
-               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
-               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-           ),
-           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf)),
-    OpInfo('nonzero',
-           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
-           sample_inputs_func=sample_inputs_nonzero,
-           supports_autograd=False,
-           skips=(
                DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-               # nonzero(): argument 'out' must be Tensor, not tuple
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # https://github.com/pytorch/pytorch/issues/67458
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-               # nonzero is not raising a warning when the out is resized
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-               # Can't find schemas for this operator for some reason
-               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
-           )),
-    # Following tests are for jiterator's python interface
-    # Jiterator can be used to author elementwise CUDA kernel
-    # jiterator._create_jit_fn returns a callable that behaves like a regular pytorch op
-    # See create_jit_fn in jiterator.py for more information
-    UnaryUfuncInfo(
-        'jiterator_unary',
-        op=torch.cuda.jiterator._create_jit_fn("template <typename T> T unary(T x) { return x * x + x; }"),
-        ref=lambda x: x * x + x,
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        decorators=[
-            onlyCUDA,
-            skipCUDAIfRocm,
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestUnaryUfuncs', 'test_reference_numerics_hard'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestUnaryUfuncs', 'test_reference_numerics_normal'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestUnaryUfuncs', 'test_reference_numerics_small'),
-        ],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Skip reference_numerics tests for bool type, as the defined function doesn't work for bool
-            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
-                         dtypes=[torch.bool]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_hard',
-                         dtypes=[torch.bool]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
-                         dtypes=[torch.bool]),
-            # Expected failure: torch.jiterator_unary is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    BinaryUfuncInfo(
-        'jiterator_binary',
-        op=torch.cuda.jiterator._create_jit_fn(
-            "template <typename T> T binary(T x, T y, T alpha) { return x + alpha * y; }", alpha=1),
-        ref=lambda input, other, *, alpha=1: np.add(input, other) if alpha == 1 \
-            else np.add(input, np.multiply(alpha, other)),
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2, alpha=-3.14),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        supports_rhs_python_scalar=False,
-        decorators=[onlyCUDA, skipCUDAIfRocm],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Expected failure: torch.jiterator_binary is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    OpInfo(
-        'jiterator_4inputs_with_extra_args',
-        op=torch.cuda.jiterator._create_jit_fn(
-            "template <typename T> T binary(T i0, T i1, T i2, T i3, T alpha, T beta) { return alpha * i0 + beta * i1 + i2 + i3; }",
-            alpha=1, beta=1),
-        ref=lambda i0, i1, i2, i3, *, alpha=1, beta=1: alpha * i0 + beta * i1 + i2 + i3,
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=4, alpha=3.14, beta=-4.20),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        decorators=[onlyCUDA, skipCUDAIfRocm],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    BinaryUfuncInfo(
-        'jiterator_binary_return_by_ref',
-        op=torch.cuda.jiterator._create_multi_output_jit_fn(
-            """
-            template <typename T>
-            T binary_return_by_ref(T i0, T i1, T& out0) {
-                out0 = i0 + i1;
-            }
-            """,
-            num_outputs=1),
-        ref=lambda i0, i1: i0 + i1,
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2, alpha=-0.42),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        supports_rhs_python_scalar=False,
-        decorators=[onlyCUDA, skipCUDAIfRocm],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    OpInfo(
-        'jiterator_2inputs_2outputs',
-        op=torch.cuda.jiterator._create_multi_output_jit_fn(
-            """
-            template <typename T>
-            T binary_2outputs(T i0, T i1, T& out0, T& out1) {
-                out0 = i0 + i1;
-                out1 = i0 - i1;
-            }
-            """,
-            num_outputs=2),
-        ref=lambda i0, i1, *, alpha=1: (i0 + i1, i0 - i1),
-        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
-        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2),
-        supports_out=False,
-        supports_autograd=False,  # jiterator ops doesn't have backward defined
-        decorators=[onlyCUDA, skipCUDAIfRocm],
-        skips=(
-            # Jiterator ops doesn't support neg or conj view
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
-            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # Jiterator ops doesn't suport CompositeCompliantTensor
-            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
-            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
-            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # Skip Nvfuser
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
-        )
-    ),
-    # `torch.norm` has multiple code paths depending on the value of `p`.
-    # These paths have different dtype support. Also JIT supports,
-    # most variants but not all of them. So we split the OpInfo entries,
-    # for `norm` based on the code-paths and JIT support.
-    OpInfo(
-        "norm",
-        sample_inputs_func=sample_inputs_norm,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        skips=(
-            # AssertionError: RuntimeError not raised : Expected RuntimeError when doing an unsafe cast from a result
-            # of dtype torch.float32 into an out= with dtype torch.long
-            DecorateInfo(
-                unittest.expectedFailure,
-                "TestCommon",
-                "test_out",
-                device_type="meta",
-            ),
-        ),
-    ),
-    OpInfo('norm',
-           variant_test_name='nuc',
-           aten_name='nuclear_norm',
-           sample_inputs_func=sample_inputs_norm_nuc,
-           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
-           check_batched_gradgrad=False,
-           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
-           # got: Could not allocate memory to change Tensor SizesAndStrides!
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
+           sample_inputs_func=sample_inputs_T,
+           error_inputs_func=error_inputs_T),
+    OpInfo('H',
+           op=lambda x: x.H,
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
            check_batched_forward_grad=False,
+           skips=(
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
+           sample_inputs_func=sample_inputs_T),
+    OpInfo('mT',
+           op=lambda x: x.mT,
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
+           sample_inputs_func=sample_inputs_adjoint),
+    OpInfo('mH',
+           op=lambda x: x.mH,
+           aliases=('adjoint',),
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.half, torch.chalf),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_out=False,
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
-           dtypes=floating_and_complex_types(),
-           dtypesIfCUDA=floating_and_complex_types(),
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
            skips=(
-               # RuntimeError not raised :
-               # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # RuntimeError:
-               # Arguments for call are not valid.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.complex64, torch.float32,)),  # noqa: B950
-           )
-           ),
-    OpInfo('norm',
-           variant_test_name='fro',
-           aten_name='frobenius_norm',
-           sample_inputs_func=sample_inputs_norm_fro,
-           dtypes=floating_and_complex_types_and(torch.bfloat16),
-           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),),
+           sample_inputs_func=sample_inputs_adjoint),
+    OpInfo('tril',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           error_inputs_func=error_inputs_tril_triu,
+           sample_inputs_func=sample_inputs_tril_triu),
+    OpInfo('triu',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.chalf, torch.bool, torch.half),
            supports_forward_ad=True,
-           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
-           # got: Could not allocate memory to change Tensor SizesAndStrides!
-           check_batched_forward_grad=False,
            supports_fwgrad_bwgrad=True,
+           error_inputs_func=error_inputs_tril_triu,
+           sample_inputs_func=sample_inputs_tril_triu),
+    OpInfo('triu_indices',
+           dtypes=_dispatch_dtypes((torch.int32, torch.int64)),
+           sample_inputs_func=sample_inputs_trilu_indices,
+           ref=lambda h, w, ofs=0, dtype=torch.long, device='cpu' : np.array(np.triu_indices(h, ofs, w), dtype=dtype),
+           supports_out=False,
+           supports_autograd=False,
            skips=(
-               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
-               DecorateInfo(
-                   unittest.skip("Skipped!"),
-                   'TestSchemaCheckModeOpInfo',
-                   'test_schema_correctness',
-                   dtypes=(torch.complex64, torch.complex128)),
-               # Expected RuntimeError when calling with input.device=cpu and out.device=cuda
-               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
-               # Arguments for call are not valid.
-               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit', dtypes=(torch.complex64, torch.float32,)),  # noqa: B950
+               # skip these tests since we have non tensor input
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_noncontiguous_samples'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestMathBits', 'test_neg_view'),
            )),
-    OpInfo(
-        "norm",
-        variant_test_name="inf",
-        sample_inputs_func=sample_inputs_norm_inf,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        # fast gradcheck produces NaNs
-        gradcheck_fast_mode=False,
-        skips=(
-            # AssertionError: RuntimeError not raised : Expected RuntimeError when doing an unsafe cast from a result
-            # of dtype torch.float32 into an out= with dtype torch.long
-            DecorateInfo(
-                unittest.expectedFailure,
-                "TestCommon",
-                "test_out",
-                device_type="meta",
-            ),
-        ),
-    ),
-    OpInfo('t',
-           sample_inputs_func=sample_inputs_t,
+    OpInfo('tril_indices',
+           dtypes=_dispatch_dtypes((torch.int32, torch.int64)),
+           sample_inputs_func=sample_inputs_trilu_indices,
+           ref=lambda h, w, ofs=0, dtype=torch.long, device='cpu' : np.array(np.tril_indices(h, ofs, w), dtype=dtype),
            supports_out=False,
+           supports_autograd=False,
+           skips=(
+               # skip these tests since we have non tensor input
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_noncontiguous_samples'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+               DecorateInfo(unittest.skip('Skipped!'), 'TestMathBits', 'test_neg_view'),
+           )),
+    OpInfo('kron',
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16),
+           # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+           gradcheck_fast_mode=True,
+           supports_inplace_autograd=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           sample_inputs_func=sample_inputs_kron),
+    OpInfo('inner',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16]
+                                                       if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
            supports_forward_ad=True,
            supports_fwgrad_bwgrad=True,
            # See https://github.com/pytorch/pytorch/pull/78358
            check_batched_forward_grad=False,
-           # vmap does not support inplace views
-           check_inplace_batched_forward_grad=False,
-           autodiff_fusible_nodes=[],  # aliases inputs, shouldn't be fused
-           autodiff_nonfusible_nodes=[],  # aliases inputs, shouldn't be fused
+           sample_inputs_func=sample_inputs_inner,
+           ),
+    OpInfo('tensordot',
+           dtypes=all_types_and_complex_and(torch.bfloat16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, *[torch.bfloat16]
+                                                       if (CUDA11OrLater or TEST_WITH_ROCM) else []),
+           dtypesIfROCM=floating_and_complex_types_and(torch.half, torch.bfloat16),
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           sample_inputs_func=sample_inputs_tensordot,
+           skips=(
+               # Skip operator schema test because this is a functional and not an operator.
+               # Reference: https://github.com/pytorch/pytorch/issues/54574
+               DecorateInfo(unittest.skip("Skipped!"), 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+           )
+           ),
+    OpInfo('to_sparse',
+           op=lambda x, *args: x.to_sparse(*args),
+           sample_inputs_func=sample_inputs_to_sparse,
            dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-           assert_autodiffed=True,
-           error_inputs_func=error_inputs_t),
-    UnaryUfuncInfo('special.erfcx',
-                   ref=scipy.special.erfcx if TEST_SCIPY else None,
-                   aten_name='special_erfcx',
-                   decorators=(toleranceOverride({torch.float32: tol(atol=0, rtol=4e-6), }),),
-                   dtypes=all_types_and(torch.bool),
+           backward_dtypes=floating_types(),
+           backward_dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+           supports_out=False,
+           supports_sparse_csr=True,
+           supports_sparse_csc=True,
+           check_batched_grad=False,
+           check_batched_gradgrad=False,
+           skips=(
+               # NotImplementedError: Could not run 'aten::normal_' with arguments from the 'SparseCPU' backend
+               DecorateInfo(unittest.skip(""), 'TestCommon', 'test_noncontiguous_samples'),
+               # TODO: FIXME: complex inputs requiring grad error in forward
+               DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_dtypes'),
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               # Allowed exception: sparse tensors don't have strides
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_operator'),
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestCompositeCompliance', 'test_backward'),
+               DecorateInfo(unittest.skip("Allowed exception"), 'TestTags', 'test_tags'),
+               # TODO: implement csr.to_sparse(sample_dim) where sampled_dim is 1.
+               DecorateInfo(unittest.skip("csr.to_sparse(1) not implemented. Skipped!"),
+                            'TestSparseCSR', 'test_sparse_csr_consistency'),
+           )
+           ),
+    OpInfo('logcumsumexp',
+           dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+           backward_dtypes=floating_types_and(torch.bfloat16),
+           backward_dtypesIfCUDA=floating_types_and(torch.bfloat16),
+           skips=(
+               # AssertionError: UserWarning not triggered : Resized a non-empty tensor but did not warn about it.
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning', device_type='cuda'),
+           ),
+           sample_inputs_func=sample_inputs_logcumsumexp,
+           error_inputs_func=error_inputs_logcumsumexp),
+    UnaryUfuncInfo('sigmoid',
+                   aliases=('special.expit', 'nn.functional.sigmoid'),
+                   aten_backward_name='sigmoid_backward',
+                   ref=reference_sigmoid if TEST_SCIPY else None,
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.complex64: 1e-1,
+                                                  torch.bfloat16: 1e-2}),),
+                   skips=(
+                       # Reference: https://github.com/pytorch/pytorch/issues/56012
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.complex64, torch.cdouble]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=[torch.chalf, torch.complex64, torch.cdouble])),
+                   dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and_complex_and(torch.complex32, torch.bool, torch.half, torch.bfloat16),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   assert_autodiffed=True,
+                   # sigmoid(z) = 1 / (1 + exp(-z)), at z = j * pi * odd_number, the denominator is zero
+                   reference_numerics_filter=NumericsFilter(
+                       condition=lambda x: (close_to_int(x / (math.pi * 1j))
+                                            if x.is_complex() else x.new_tensor(False, dtype=torch.bool)),
+                       safe_val=0)),
+    UnaryUfuncInfo('digamma',
+                   ref=scipy.special.digamma if TEST_SCIPY else None,
+                   aliases=('special.psi', 'special.digamma',),
+                   decorators=(precisionOverride({torch.float16: 5e-1}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
                    supports_forward_ad=True,
                    supports_fwgrad_bwgrad=True),
+    UnaryUfuncInfo('erf',
+                   ref=scipy.special.erf if TEST_SCIPY else None,
+                   aliases=('special.erf', ),
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.bfloat16: 1e-2}),),
+                   skips=(
+                       DecorateInfo(unittest.skip("Skipped! sparse backward not supported"),
+                                    'TestSparseUnaryUfuncs', 'test_sparse_fn_grad'),
+
+                   ),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
+                   assert_jit_shape_analysis=True,
+                   supports_sparse=True,
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True),
+    UnaryUfuncInfo('erfc',
+                   ref=scipy.special.erfc if TEST_SCIPY else None,
+                   aliases=('special.erfc', ),
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.bfloat16: 1e-2}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   assert_autodiffed=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True),
+    UnaryUfuncInfo('erfinv',
+                   ref=scipy.special.erfinv if TEST_SCIPY else None,
+                   aliases=('special.erfinv', ),
+                   decorators=(precisionOverride({torch.float16: 1e-2,
+                                                  torch.bfloat16: 1e-2,
+                                                  torch.float32: 1e-4}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
+                   supports_sparse_csr=True,
+                   supports_sparse_csc=True,
+                   supports_sparse_bsr=True,
+                   supports_sparse_bsc=True,
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   domain=(-1, 1),
+                   skips=(
+                       # Reference: https://github.com/pytorch/pytorch/pull/49155#issuecomment-742664611
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
+                   )),
+    OpInfo("nn.functional.smooth_l1_loss",
+           ref=reference_smooth_l1_loss,
+           sample_inputs_func=sample_inputs_smooth_l1_loss,
+           dtypes=floating_types_and(torch.float16, torch.bfloat16),
+           backward_dtypes=floating_types_and(torch.bfloat16),
+           dtypesIfCUDA=floating_types_and(torch.float16),
+           backward_dtypesIfCUDA=floating_types_and(torch.float16),
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               # RuntimeError: input->type()->kind() == TypeKind::OptionalTypeINTERNAL ASSERT FAILED
+               # at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270, please report a bug to PyTorch.
+               DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"),)),
     OpInfo(
-        "nn.functional.dropout",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout, input, *args, **kwargs),
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # Probably because we have used lambda for the op here
-            # AssertionError: JIT Test does not execute any logic
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # inplace variant dispatches to dropout kernel, while on CUDA
-            # the op dispatches to _fused_dropout (with a few more conditions)
-            # hence, different values and this skip here
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        # https://github.com/pytorch/pytorch/issues/66357
-        check_batched_forward_grad=False,
-        supports_out=False,
-        sample_inputs_func=sample_inputs_dropout,
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout, input, *args, **kwargs, inplace=True)),
-    OpInfo(
-        "nn.functional.dropout2d",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout2d, input, *args, **kwargs),
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
+        "nn.functional.l1_loss",
+        ref=loss_reference_reduction_wrapper(lambda input, target: np.abs(input - target)),
+        sample_inputs_func=sample_inputs_l1_loss,
+        error_inputs_func=error_inputs_l1_loss,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        supports_out=False,
-        check_batched_forward_grad=False,
-        # As per the docs, valid input dims are (3, 4)
-        sample_inputs_func=partial(sample_inputs_dropout, valid_input_dim=(3, 4)),
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout2d, input, *args, **kwargs, inplace=True)),
-    OpInfo(
-        "nn.functional.dropout3d",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout3d, input, *args, **kwargs),
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
+            # RuntimeError: input->type()->kind() == TypeKind::OptionalTypeINTERNAL ASSERT FAILED
+            # at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":270, please report a bug to PyTorch.
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestJit",
+                "test_variant_consistency_jit",
+                dtypes=(torch.float32,),
+            ),
+        ),
+    ),
+    UnaryUfuncInfo('lgamma',
+                   ref=reference_lgamma if TEST_SCIPY else None,
+                   aliases=('special.gammaln', ),
+                   decorators=(precisionOverride({torch.float16: 7e-1}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   skips=(
+                       # Reference: https://github.com/pytorch/pytorch/pull/50140#discussion_r552615345
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.bfloat16]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    device_type='cpu', dtypes=[torch.bfloat16]),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_small',
+                                    device_type='cpu', dtypes=[torch.bfloat16]),
+                       # Reference: https://github.com/pytorch/pytorch/pull/50140#issuecomment-756150214
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                                    dtypes=[torch.float32, torch.float64], active_if=IS_WINDOWS),
+                       DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                                    dtypes=[torch.float32, torch.float64], active_if=IS_WINDOWS),
+                   ),
+                   # lgamma have multiple singularities at x <= 0
+                   reference_numerics_filter=NumericsFilter(condition=lambda x: x < 0.1, safe_val=1)),
+    OpInfo(
+        'logdet',
+        dtypes=floating_and_complex_types(),
+        supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        supports_out=False,
-        check_batched_forward_grad=False,
-        # As per the docs, valid input dims are (4, 5)
-        sample_inputs_func=partial(sample_inputs_dropout, valid_input_dim=(4, 5)),
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.dropout3d, input, *args, **kwargs, inplace=True)),
-    # In training mode, feature_alpha_dropout currently doesn't support inputs of complex dtype
-    # unlike when `train=False`, it supports complex inputs, hence 2 OpInfos to cover all cases
+        sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack]),
+    # `log_softmax` supports different dtypes based on whether `dtype` argument,
+    # is passed or not. Hence two OpInfo entries, one with dtype and other without.
     OpInfo(
-        "nn.functional.feature_alpha_dropout",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs),
-        variant_test_name="with_train",
+        'log_softmax',
+        aliases=('special.log_softmax', 'nn.functional.log_softmax'),
+        supports_out=True,
+        aten_backward_name='_log_softmax_backward_data',
         dtypes=floating_types_and(torch.bfloat16),
         dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
-        skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got:
-            # vmap: We do not yet support calling random operations inside of vmap.
-            # Please perform random operations outside of vmap as a workaround
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', "test_forward_mode_AD"),
-            DecorateInfo(unittest.expectedFailure, 'TestGradients', "test_inplace_forward_mode_AD"),),
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
+        sample_inputs_func=sample_inputs_softmax_variant,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        supports_out=False,
-        # As per the docs, valid input dims are (4, 5)
-        sample_inputs_func=partial(sample_inputs_dropout, train=True, valid_input_dim=(4, 5)),
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs, inplace=True)),
+        assert_autodiffed=True),
     OpInfo(
-        "nn.functional.feature_alpha_dropout",
-        op=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs),
-        variant_test_name="without_train",
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        skips=(
-            # lambda impl
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
-        gradcheck_wrapper=wrapper_set_seed,
+        'log_softmax',
+        variant_test_name='with_dtype',
+        aliases=('special.log_softmax', 'nn.functional.log_softmax'),
+        supports_out=True,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+        sample_inputs_func=partial(sample_inputs_softmax_variant, with_dtype=True),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        assert_autodiffed=True),
+    UnaryUfuncInfo('logit',
+                   aten_backward_name='logit_backward',
+                   ref=scipy.special.logit if TEST_SCIPY else None,
+                   domain=(0, 1),
+                   aliases=('special.logit', ),
+                   supports_forward_ad=True,
+                   supports_fwgrad_bwgrad=True,
+                   decorators=(precisionOverride({torch.bfloat16: 5e-1,
+                                                  torch.float16: 5e-1}),),
+                   dtypes=all_types_and(torch.bool, torch.bfloat16),
+                   dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+                   sample_inputs_func=sample_inputs_logit),
+    OpInfo('where',
+           # Currently only the `input` is tested in gradcheck.
+           # If we pass `condition` first, none of the input which supports
+           # autograd will be tested. Hence the following lambda.
+           op=lambda self, condition, other: torch.where(condition, self, other),
+           ref=lambda self, condition, other: np.where(condition, self, other),
+           sample_inputs_func=sample_inputs_where,
+           reference_inputs_func=reference_inputs_where,
+           error_inputs_func=error_inputs_where,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           decorators=(
+               DecorateInfo(onlyCUDA, "TestCommon", 'test_errors'),),
+           skips=(
+               # lambda impl
+               DecorateInfo(unittest.expectedFailure, "TestNormalizeOperators", "test_normalize_operator_exhaustive"),
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+           ),
+           dtypes=all_types_and_complex_and(torch.bool, torch.half, torch.bfloat16, torch.chalf)),
+    OpInfo('nonzero',
+           dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16, torch.chalf),
+           sample_inputs_func=sample_inputs_nonzero,
+           supports_autograd=False,
+           skips=(
+               DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+               # nonzero(): argument 'out' must be Tensor, not tuple
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
+               # https://github.com/pytorch/pytorch/issues/67458
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+               # nonzero is not raising a warning when the out is resized
+               DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
+               # Can't find schemas for this operator for some reason
+               DecorateInfo(unittest.expectedFailure, 'TestOperatorSignatures', 'test_get_torch_func_signature_exhaustive'),
+           )),
+    # Following tests are for jiterator's python interface
+    # Jiterator can be used to author elementwise CUDA kernel
+    # jiterator._create_jit_fn returns a callable that behaves like a regular pytorch op
+    # See create_jit_fn in jiterator.py for more information
+    UnaryUfuncInfo(
+        'jiterator_unary',
+        op=torch.cuda.jiterator._create_jit_fn("template <typename T> T unary(T x) { return x * x + x; }"),
+        ref=lambda x: x * x + x,
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
         supports_out=False,
-        sample_inputs_func=partial(sample_inputs_dropout, train=False),
-        inplace_variant=lambda input, *args, **kwargs:
-            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs, inplace=True)),
-    OpInfo(
-        "nn.functional.one_hot",
-        ref=reference_one_hot,
-        supports_out=False,
-        dtypes=_dispatch_dtypes((torch.int64,)),
-        sample_inputs_func=sample_inputs_one_hot,
-    ),
-    OpInfo(
-        "nn.functional.embedding",
-        aten_backward_name="embedding_dense_backward",
-        # We use lambda to reshuffle the positional arguments.
-        # This is because currently only the `input` field of SampleInput
-        # is tested in gradient tests.
-        op=lambda weight, idx, **kwargs: torch.nn.functional.embedding(idx, weight, **kwargs),
-        dtypes=floating_types_and(torch.bfloat16, torch.float16),
-        sample_inputs_func=sample_inputs_embedding,
-        error_inputs_func=error_inputs_embedding,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        decorators=[
+            onlyCUDA,
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestUnaryUfuncs', 'test_reference_numerics_extremal'),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestUnaryUfuncs', 'test_reference_numerics_hard'),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestUnaryUfuncs', 'test_reference_numerics_normal'),
+            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                         'TestUnaryUfuncs', 'test_reference_numerics_small'),
+        ],
         skips=(
-            # lambda impl
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Skip reference_numerics tests for bool type, as the defined function doesn't work for bool
+            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_extremal',
+                         dtypes=[torch.bool]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_hard',
+                         dtypes=[torch.bool]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_normal',
+                         dtypes=[torch.bool]),
+            # ROCm generates -inf+infj instead of nan+infj for complex64 for some of the results
+            DecorateInfo(unittest.skip("Skipped!"), 'TestUnaryUfuncs', 'test_reference_numerics_large',
+                         dtypes=[torch.complex64], active_if=TEST_WITH_ROCM),
+            # Expected failure: torch.jiterator_unary is not a valid op
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # Reference: https://github.com/pytorch/pytorch/issues/67084
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),
-            # Not a problem: embedding does weird stuff to its input (it renormalizes)
-            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
-        ),
-        supports_expanded_weight=True,
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
+    ),
+    BinaryUfuncInfo(
+        'jiterator_binary',
+        op=torch.cuda.jiterator._create_jit_fn(
+            "template <typename T> T binary(T x, T y, T alpha) { return x + alpha * y; }", alpha=1),
+        ref=lambda input, other, *, alpha=1: np.add(input, other) if alpha == 1 \
+            else np.add(input, np.multiply(alpha, other)),
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2, alpha=-3.14),
         supports_out=False,
-    ),
-    OpInfo(
-        "nn.functional.embedding_bag",
-        # We use lambda to reshuffle the positional arguments.
-        # This is because currently only the `input` field of SampleInput
-        # is tested in gradient tests.
-        op=lambda weight, idx, **kwargs: torch.nn.functional.embedding_bag(idx, weight, **kwargs),
-        dtypes=floating_types_and(torch.float16),
-        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
-        # backward is not supported for mode `max` and dtype `bfloat16`
-        backward_dtypesIfCUDA=floating_types_and(torch.float16),
-        sample_inputs_func=sample_inputs_embedding_bag,
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        supports_rhs_python_scalar=False,
+        decorators=[onlyCUDA],
         skips=(
-            # lambda impl
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Expected failure: torch.jiterator_binary is not a valid op
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # Not a problem: embedding_bag does weird stuff to its input (it renormalizes)
-            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
-        ),
-        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-        supports_out=False,
-        supports_gradgrad=False,
-    ),
-    UnaryUfuncInfo(
-        "nn.functional.softplus",
-        aten_backward_name='softplus_backward',
-        ref=reference_softplus,
-        sample_kwargs=lambda device, dtype, input: ({'beta': 3, 'threshold': .2}, {'beta': 3, 'threshold': .2}),
-        sample_inputs_func=partial(sample_inputs_elementwise_unary, op_kwargs={'beta': 3, 'threshold': .2}),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
-        decorators=(
-            DecorateInfo(
-                toleranceOverride
-                ({
-                    torch.half: tol(atol=1e-2, rtol=1e-2),
-                    torch.bfloat16: tol(atol=1e-2, rtol=1e-2),
-                }),
-                'TestUnaryUfuncs'),
-        ),
-    ),
-    OpInfo(
-        "linalg.tensorinv",
-        ref=np.linalg.tensorinv,
-        dtypes=floating_and_complex_types(),
-        sample_inputs_func=sample_inputs_tensorinv,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        # See https://github.com/pytorch/pytorch/pull/78358
-        check_batched_forward_grad=False,
-        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
-    ),
-    OpInfo(
-        "linalg.tensorsolve",
-        ref=lambda a, b, dims=None: np.linalg.tensorsolve(a, b, axes=dims),
-        dtypes=floating_and_complex_types(),
-        sample_inputs_func=sample_inputs_tensorsolve,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack,
-                    DecorateInfo(toleranceOverride({torch.float32: tol(atol=1e-03, rtol=1e-03)}),
-                                 'TestCommon', 'test_noncontiguous_samples',
-                                 device_type='cuda')],
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
     OpInfo(
-        "nn.functional.mse_loss",
-        aten_backward_name='mse_loss_backward',
-        ref=loss_reference_reduction_wrapper(lambda input, target: (input - target) ** 2),
-        sample_inputs_func=sample_inputs_loss,
+        'jiterator_4inputs_with_extra_args',
+        op=torch.cuda.jiterator._create_jit_fn(
+            "template <typename T> T binary(T i0, T i1, T i2, T i3, T alpha, T beta) { return alpha * i0 + beta * i1 + i2 + i3; }",
+            alpha=1, beta=1),
+        ref=lambda i0, i1, i2, i3, *, alpha=1, beta=1: alpha * i0 + beta * i1 + i2 + i3,
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=4, alpha=3.14, beta=-4.20),
         supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        dtypes=floating_types_and(torch.float16),
-        backward_dtypes=floating_types(),
-        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
-        backward_dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        decorators=[onlyCUDA],
         skips=(
-            # RuntimeError: input->type()->kind() == TypeKind::OptionalType
-            # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
-            # please report a bug to PyTorch.
-            DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit", dtypes=(torch.float32,),),
-        ),
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
-    OpInfo(
-        "nn.functional.grid_sample",
-        dtypes=floating_types(),
-        dtypesIfCUDA=floating_types_and(torch.float16),
-        supports_out=False,
-        sample_inputs_func=sample_inputs_grid_sample,
-        supports_gradgrad=False,
-        gradcheck_nondet_tol=1e-15),
-    OpInfo(
-        "argwhere",
-        ref=np.argwhere,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+    BinaryUfuncInfo(
+        'jiterator_binary_return_by_ref',
+        op=torch.cuda.jiterator._create_multi_output_jit_fn(
+            """
+            template <typename T>
+            void binary_return_by_ref(T i0, T i1, T& out0) {
+                out0 = i0 + i1;
+            }
+            """,
+            num_outputs=1),
+        ref=lambda i0, i1: i0 + i1,
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2, alpha=-0.42),
         supports_out=False,
-        supports_autograd=False,
-        sample_inputs_func=sample_inputs_argwhere,
-    ),
-    ReductionOpInfo(
-        'all',
-        identity=True,
-        supports_multiple_dims=False,
-        supports_autograd=False,
-        result_dtype=torch.bool,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.all),
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        supports_rhs_python_scalar=False,
+        decorators=[onlyCUDA],
         skips=(
-            # FIXME: does not support passing keepdim without dim
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: does not support dim=None
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none_keepdim'),
-            # FIXME: uint8 input returns uint8 instead of bool
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_result_dtype', dtypes=[torch.uint8]),
-        ),
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
-    ReductionOpInfo(
-        'any',
-        identity=False,
-        supports_multiple_dims=False,
-        supports_autograd=False,
-        result_dtype=torch.bool,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.any),
+    OpInfo(
+        'jiterator_2inputs_2outputs',
+        op=torch.cuda.jiterator._create_multi_output_jit_fn(
+            """
+            template <typename T>
+            void binary_2outputs(T i0, T i1, T& out0, T& out1) {
+                out0 = i0 + i1;
+                out1 = i0 - i1;
+            }
+            """,
+            num_outputs=2),
+        ref=lambda i0, i1, *, alpha=1: (i0 + i1, i0 - i1),
+        dtypes=all_types_and_complex_and(torch.bfloat16, torch.float16, torch.bool),
+        sample_inputs_func=partial(sample_inputs_jiterator, num_inputs=2),
+        supports_out=False,
+        supports_autograd=False,  # jiterator ops doesn't have backward defined
+        decorators=[onlyCUDA],
         skips=(
-            # FIXME: does not support passing keepdim without dim
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: does not support dim=None
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none_keepdim'),
-            # FIXME: uint8 input returns uint8 instead of bool
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_result_dtype', dtypes=[torch.uint8]),
-        ),
+            # Jiterator ops doesn't support neg or conj view
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+            # Jiterator ops doesn't suport CompositeCompliantTensor
+            # Following test should expectedFailure, but it's causing cascading failures in CUDA, thus skipped
+            DecorateInfo(unittest.skip("skip"), 'TestCompositeCompliance', 'test_operator'),
+            # Expected failure: torch.jiterator_4inputs_with_extra_args is not a valid op
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # Skip Nvfuser
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo'),
+        )
     ),
-    ReductionOpInfo(
-        'amax',
-        nan_policy='propagate',
-        supports_forward_ad=True,
+    # `torch.norm` has multiple code paths depending on the value of `p`.
+    # These paths have different dtype support. Also JIT supports,
+    # most variants but not all of them. So we split the OpInfo entries,
+    # for `norm` based on the code-paths and JIT support.
+    OpInfo(
+        "norm",
+        sample_inputs_func=sample_inputs_norm,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        # TODO Benchmark again with the new implementation
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
         check_batched_forward_grad=False,
+        supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        ref=reference_reduction_numpy(np.amax),
         skips=(
-            # FIXME: reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-        ),
-        error_inputs_func=error_inputs_aminmax_amax_amin,
+            # Dispatches in Python to vector_norm. Not sure how to make this test happy
+            # Happens to pass on complex64. Also a mystery
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',
+                         dtypes=(torch.float32,)),)
     ),
-    ReductionOpInfo(
-        'amin',
-        nan_policy='propagate',
+    OpInfo('norm',
+           variant_test_name='nuc',
+           sample_inputs_func=sample_inputs_norm_nuc,
+           decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+           check_batched_gradgrad=False,
+           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
+           # got: Could not allocate memory to change Tensor SizesAndStrides!
+           check_batched_forward_grad=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           dtypes=floating_and_complex_types(),
+           dtypesIfCUDA=floating_and_complex_types(),
+           skips=(
+               # Dispatches in Python to matrix_norm. Not sure how to make this test happy
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',
+                            dtypes=(torch.complex64, torch.float32,)),)
+           ),
+    OpInfo('norm',
+           variant_test_name='fro',
+           sample_inputs_func=sample_inputs_norm_fro,
+           dtypes=floating_and_complex_types_and(torch.bfloat16, torch.float16),
+           dtypesIfCUDA=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+           supports_forward_ad=True,
+           # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
+           # got: Could not allocate memory to change Tensor SizesAndStrides!
+           check_batched_forward_grad=False,
+           supports_fwgrad_bwgrad=True,
+           skips=(
+               # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+               DecorateInfo(
+                   unittest.skip("Skipped!"),
+                   'TestSchemaCheckModeOpInfo',
+                   'test_schema_correctness',
+                   dtypes=(torch.complex64, torch.complex128)),
+               # Dispatches in Python to vector_norm. Not sure how to make this test happy
+               DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',
+                            dtypes=(torch.complex64, torch.float32,)),)
+           ),
+    OpInfo(
+        "norm",
+        variant_test_name="inf",
+        sample_inputs_func=sample_inputs_norm_inf,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
         supports_forward_ad=True,
         check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
-        ref=reference_reduction_numpy(np.amin),
-        skips=(
-            # FIXME: reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-        ),
-        error_inputs_func=error_inputs_aminmax_amax_amin,
-    ),
-    ReductionOpInfo(
-        'argmax',
-        supports_multiple_dims=False,
-        supports_autograd=False,
-        assert_jit_shape_analysis=True,
-        result_dtype=torch.int64,
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.argmax, supports_keepdims=False),
-        skips=(
-            # FIXME: keepdim parameter is ignored when dim=None
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-        ),
-    ),
-    ReductionOpInfo(
-        'argmin',
-        supports_multiple_dims=False,
-        supports_autograd=False,
-        result_dtype=torch.int64,
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.argmin, supports_keepdims=False),
+        # fast gradcheck produces NaNs
+        gradcheck_fast_mode=False,
         skips=(
-            # FIXME: keepdim parameter is ignored when dim=None
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=2e-3, rtol=1e-3)}),
+                'TestInductorOpInfo', 'test_comprehensive', device_type='cuda',
+            ),
+            # Dispatches in Python to vector_norm. Not sure how to make this test happy
+            # Happens to pass on complex64. Also a mystery
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit',
+                         dtypes=(torch.float32,))
         ),
     ),
-    ReductionOpInfo(
-        'count_nonzero',
-        identity=0,
-        supports_out=False,
-        supports_autograd=False,
-        result_dtype=torch.int64,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_reduction_count_nonzero,
-        ref=reference_reduction_numpy(np.count_nonzero),
+    OpInfo('t',
+           sample_inputs_func=sample_inputs_t,
+           supports_out=False,
+           supports_forward_ad=True,
+           supports_fwgrad_bwgrad=True,
+           # See https://github.com/pytorch/pytorch/pull/78358
+           check_batched_forward_grad=False,
+           # vmap does not support inplace views
+           check_inplace_batched_forward_grad=False,
+           autodiff_fusible_nodes=[],  # aliases inputs, shouldn't be fused
+           autodiff_nonfusible_nodes=[],  # aliases inputs, shouldn't be fused
+           dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+           assert_autodiffed=True,
+           error_inputs_func=error_inputs_t),
+    OpInfo(
+        "nn.functional.dropout",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout, input, *args, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         skips=(
-            # FIXME: count_nonzero does not accept keepdim kwarg
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_single_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_multi_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_multi_unsorted_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_offbounds_keepdim'),
-            # FIXME: dim=[] reduces all dimensions
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-        ),
-    ),
-    ReductionOpInfo(
-        'mean',
-        nan_policy='propagate',
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            # Probably because we have used lambda for the op here
+            # AssertionError: JIT Test does not execute any logic
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # inplace variant dispatches to dropout kernel, while on CUDA
+            # the op dispatches to _fused_dropout (with a few more conditions)
+            # hence, different values and this skip here
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu')),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        # FIXME: mean needs 'dim' parameter when using the 'out' overload.
-        # Adding it with 'generate_args_kwargs' does not work, since these also get passed
-        # onto the reference implementations.
+        # https://github.com/pytorch/pytorch/issues/66357
+        check_batched_forward_grad=False,
         supports_out=False,
-        assert_autodiffed=True,
-        assert_jit_shape_analysis=True,
-        promotes_int_to_float=True,
-        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
-        ref=reference_reduction_numpy(np.mean),
-        error_inputs_func=error_inputs_mean,
+        sample_inputs_func=sample_inputs_dropout,
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "native_dropout_backward",
+        op=torch.ops.aten.native_dropout_backward.default,
+        aten_name="native_dropout_backward",
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_dropout_backward,
         skips=(
-            # FIXME: mean does not support passing keepdim without passing dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: mean reduces all dimensions when dim=[]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_extremal_values',
-                         device_type='cuda', dtypes=[torch.complex64]),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+            # Lazy tensor failures
+            DecorateInfo(unittest.skip('Skipped!'), 'TestLazyOpInfo', 'test_dispatched_to_lazy'),
+            DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_correctness'),
+            DecorateInfo(unittest.expectedFailure, 'TestLazyOpInfo', 'test_correctness_with_reusing_ir'),
         ),
     ),
-    ReductionOpInfo(
-        'nanmean',
-        nan_policy='omit',
-        assert_autodiffed=True,
-        promotes_int_to_float=True,
-        supports_forward_ad=True,
-        check_batched_forward_grad=False,
-        supports_fwgrad_bwgrad=True,
-        dtypes=floating_types_and(torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_nan_reduction(supports_multiple_dims=True),
-        ref=reference_reduction_numpy(np.nanmean),
+    OpInfo(
+        "nn.functional.dropout2d",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout2d, input, *args, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         skips=(
-            # AssertionError: False is not true :
-            # Failure in testing nodes' autodifferentiation.
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: prod reduces all dimensions when dim=[]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
-                         device_type='cuda', dtypes=[torch.float16]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_extremal_values',
-                         device_type='cuda', dtypes=[torch.complex64]),
-        ),
-    ),
-    ReductionOpInfo(
-        'std',
-        nan_policy='propagate',
-        supports_out=False,
-        complex_to_real=True,
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu')),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        assert_autodiffed=True,
-        promotes_int_to_float=True,
+        supports_out=False,
         check_batched_forward_grad=False,
-        dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_std_var,
-        ref=reference_std_var(np.std),
-        generate_args_kwargs=generate_std_var_kwargs,
+        # As per the docs, valid input dims are (3, 4)
+        sample_inputs_func=partial(sample_inputs_dropout, valid_input_dim=(3, 4)),
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout2d, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "nn.functional.dropout3d",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout3d, input, *args, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         skips=(
-            # FIXME: cannot specify keepdim without dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: dim=[] reduces all dimensions
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values'),
-            # NumPy is giving NaN for this
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_large_input'),
-        ),
-    ),
-    ReductionOpInfo(
-        'var',
-        nan_policy='propagate',
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu')),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
         supports_out=False,
-        assert_autodiffed=True,
-        promotes_int_to_float=True,
-        complex_to_real=True,
+        check_batched_forward_grad=False,
+        # As per the docs, valid input dims are (4, 5)
+        sample_inputs_func=partial(sample_inputs_dropout, valid_input_dim=(4, 5)),
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.dropout3d, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "nn.functional.alpha_dropout",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.alpha_dropout, input, *args, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        gradcheck_wrapper=wrapper_set_seed,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        supports_out=False,
+        sample_inputs_func=sample_inputs_dropout,
         check_batched_forward_grad=False,
-        dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_std_var,
-        ref=reference_std_var(np.var),
-        generate_args_kwargs=generate_std_var_kwargs,
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.alpha_dropout, input, *args, **kwargs, inplace=True),
         skips=(
-            # FIXME: cannot specify keepdim without dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: dim=[] reduces all dimensions
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values'),
-            # NumPy is giving NaN for this
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_large_input'),
-        ),
-    ),
-    ReductionOpInfo(
-        'prod',
-        identity=1,
-        nan_policy='propagate',
-        supports_multiple_dims=False,
-        # https://github.com/pytorch/pytorch/issues/80411
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            # AssertionError: Tensor-likes are not close!
+            # Fails in cuda11.7
+            # Error Log: https://github.com/pytorch/pytorch/actions/runs/3440108478/jobs/5738475757
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_compare_cpu', device_type='cuda'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),),
+    # In training mode, feature_alpha_dropout currently doesn't support inputs of complex dtype
+    # unlike when `train=False`, it supports complex inputs, hence 2 OpInfos to cover all cases
+    OpInfo(
+        "nn.functional.feature_alpha_dropout",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs),
+        variant_test_name="with_train",
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
+        skips=(
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got:
+            # vmap: We do not yet support calling random operations inside of vmap.
+            # Please perform random operations outside of vmap as a workaround
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', "test_forward_mode_AD"),
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', "test_inplace_forward_mode_AD"),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu')),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
         gradcheck_fast_mode=True,
-        supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        promotes_int_to_int64=True,
-        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
-        dtypes=all_types_and_complex_and(torch.bool),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-        sample_inputs_func=sample_inputs_prod,
-        ref=reference_reduction_numpy(np.prod),
-        skips=(
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_forward_ad'),
-            # FIXME: prod does not support passing keepdim without passing dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: prod reduces all dimensions when dim=[]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: prod does not support passing None to dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16, torch.complex64]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
-                         dtypes=[torch.uint8, torch.float16, torch.complex64]),
-        ),
-    ),
-    ReductionOpInfo(
-        'sum',
-        identity=0,
-        nan_policy='propagate',
         supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        promotes_int_to_int64=True,
+        # As per the docs, valid input dims are (4, 5)
+        sample_inputs_func=partial(sample_inputs_dropout, train=True, valid_input_dim=(4, 5)),
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "nn.functional.feature_alpha_dropout",
+        op=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs),
+        variant_test_name="without_train",
         dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
-        ref=reference_reduction_numpy(np.sum),
         skips=(
-            # FIXME: sum does not support passing keepdim without passing dim
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: improve precision
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16]),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
-                         dtypes=[torch.float16]),
-        ),
-    ),
-    ReductionOpInfo(
-        'nansum',
-        identity=0,
-        nan_policy='omit',
-        supports_out=True,
-        promotes_int_to_int64=True,
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),),
+        gradcheck_wrapper=wrapper_set_seed,
         supports_forward_ad=True,
-        check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
-        sample_inputs_func=sample_inputs_nan_reduction(supports_multiple_dims=True),
-        ref=reference_reduction_numpy(np.nansum),
-        skips=(
-            # please report a bug to PyTorch.
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: nansum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # FIXME: flaky test so skipped instead of xfailed
-            # possibly bad low precision reference in numpy
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
-                         dtypes=[torch.float16]),
-        ),
-    ),
-    ReductionOpInfo(
-        '_masked.sum',
-        ref=reference_reduction_numpy(np.sum),
-        method_variant=None,
-        identity=0,
-        nan_policy='propagate',
         supports_out=False,
+        sample_inputs_func=partial(sample_inputs_dropout, train=False),
+        inplace_variant=lambda input, *args, **kwargs:
+            wrapper_set_seed(torch.nn.functional.feature_alpha_dropout, input, *args, **kwargs, inplace=True)),
+    OpInfo(
+        "nn.functional.one_hot",
+        ref=reference_one_hot,
+        supports_out=False,
+        dtypes=_dispatch_dtypes((torch.int64,)),
+        sample_inputs_func=sample_inputs_one_hot,
+    ),
+    OpInfo(
+        "nn.functional.embedding",
+        aten_backward_name="embedding_dense_backward",
+        # We use lambda to reshuffle the positional arguments.
+        # This is because currently only the `input` field of SampleInput
+        # is tested in gradient tests.
+        op=lambda weight, idx, **kwargs: torch.nn.functional.embedding(idx, weight, **kwargs),
+        dtypes=floating_types_and(torch.bfloat16, torch.float16),
+        sample_inputs_func=sample_inputs_embedding,
+        error_inputs_func=error_inputs_embedding,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        supports_sparse=True,
-        supports_sparse_csr=True,
-        promotes_int_to_int64=True,
-        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
         skips=(
-            DecorateInfo(unittest.skip("Failing on some jobs"), 'TestReductions', 'test_reference_masked',
-                         dtypes=(torch.bool, torch.int8, torch.int16, torch.int32)),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: undefined value tensor
+            # lambda impl
             DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            # Fails on CI https://github.com/pytorch/pytorch/issues/85377
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_compare_cpu'),
+            # Reference: https://github.com/pytorch/pytorch/issues/67084
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view', device_type='cuda'),
+            # Not a problem: embedding does weird stuff to its input (it renormalizes)
+            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
         ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.bfloat16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-03)}),
-                         'TestReductions', 'test_ref_small_input'),
-        ],
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction
+        supports_expanded_weight=True,
+        supports_out=False,
     ),
-    ReductionOpInfo(
-        '_masked.prod',
-        ref=reference_reduction_numpy(np.prod),
-        method_variant=None,
-        identity=1,
-        nan_policy='propagate',
-        # https://github.com/pytorch/pytorch/issues/80411
-        gradcheck_fast_mode=True,
+    OpInfo(
+        "nn.functional.embedding_bag",
+        # We use lambda to reshuffle the positional arguments.
+        # This is because currently only the `input` field of SampleInput
+        # is tested in gradient tests.
+        op=lambda weight, idx, **kwargs: torch.nn.functional.embedding_bag(idx, weight, **kwargs),
+        dtypes=floating_types_and(torch.float16),
+        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        # backward is not supported for mode `max` and dtype `bfloat16`
+        backward_dtypesIfCUDA=floating_types_and(torch.float16),
+        sample_inputs_func=sample_inputs_embedding_bag,
+        skips=(
+            # lambda impl
+            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
+            # Not a problem: embedding_bag does weird stuff to its input (it renormalizes)
+            DecorateInfo(unittest.skip('Allowed exemption'), 'TestCompositeCompliance', 'test_operator'),
+        ),
+        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
         supports_out=False,
+        supports_gradgrad=False,
+    ),
+    UnaryUfuncInfo(
+        "nn.functional.softplus",
+        aten_backward_name='softplus_backward',
+        ref=reference_softplus,
+        sample_kwargs=lambda device, dtype, input: ({'beta': 3, 'threshold': .2}, {'beta': 3, 'threshold': .2}),
+        sample_inputs_func=partial(sample_inputs_elementwise_unary, op_kwargs={'beta': 3, 'threshold': .2}),
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        supports_sparse=True,
-        supports_sparse_csr=True,
-        promotes_int_to_int64=True,
-        # FIXME: "prod_cpu" not implemented for 'BFloat16'
-        # FIXME: "prod_cpu" not implemented for 'Half'
-        dtypes=all_types_and_complex_and(torch.bool),
-        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
-        skips=(
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCompositeCompliance', 'test_forward_ad'),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.skip("Failing on some jobs"), 'TestReductions', 'test_reference_masked',
-                         dtypes=(torch.bool, torch.int8, torch.int16, torch.int32),),
-            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMasked', 'test_mask_layout', device_type='cuda',
-                         dtypes=(torch.bool, torch.int8, torch.uint8, torch.int16, torch.int32,
-                                 torch.int64, torch.complex64, torch.complex128)),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        decorators=(
+            DecorateInfo(
+                toleranceOverride
+                ({
+                    torch.half: tol(atol=1e-2, rtol=1e-2),
+                    torch.bfloat16: tol(atol=1e-2, rtol=1e-2),
+                }),
+                'TestUnaryUfuncs'),
         ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-02)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_ref_duplicate_values'),
-        ],
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
     ),
     OpInfo(
-        '_masked.cumsum',
-        dtypes=all_types_and_complex_and(torch.bfloat16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
-        method_variant=None,
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
+        "nn.functional.mse_loss",
+        aten_backward_name='mse_loss_backward',
+        ref=loss_reference_reduction_wrapper(lambda input, target: (input - target) ** 2),
+        sample_inputs_func=sample_inputs_loss,
         supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        dtypes=floating_types_and(torch.float16),
+        backward_dtypes=floating_types(),
+        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        backward_dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # RuntimeError: input->type()->kind() == TypeKind::OptionalType
+            # INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/utils/check_alias_annotation.cpp":252,
+            # please report a bug to PyTorch.
+            DecorateInfo(unittest.expectedFailure, "TestJit", "test_variant_consistency_jit", dtypes=(torch.float32,),),
         ),
-        # Can reuse the same inputs; dim is required in both
-        sample_inputs_func=sample_inputs_masked_cumops,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
     ),
     OpInfo(
-        '_masked.cumprod',
-        dtypes=all_types_and_complex_and(torch.bfloat16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
-        method_variant=None,
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
+        "nn.functional.grid_sample",
+        dtypes=floating_types(),
+        dtypesIfCUDA=floating_types_and(torch.float16),
         supports_out=False,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_grid_sample,
+        supports_gradgrad=False,
+        gradcheck_nondet_tol=1e-15),
+    # TODO: delete this OpInfo once we add meta support for grid_sampler_3d
+    OpInfo(
+        "grid_sampler_2d",
+        dtypes=floating_types(),
+        dtypesIfCUDA=floating_types_and(torch.float16),
+        supports_out=False,
+        sample_inputs_func=sample_inputs_grid_sampler_2d,
+        supports_gradgrad=False,
+        gradcheck_nondet_tol=1e-15),
+    OpInfo(
+        "argwhere",
+        ref=np.argwhere,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        supports_out=False,
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_argwhere,
+    ),
+    ReductionOpInfo(
+        'all',
+        identity=True,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        result_dtype=torch.bool,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.all),
         skips=(
-            # Pre-existing condition (calls .item); needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # FIXME: does not support passing keepdim without dim
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: does not support dim=None
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none'),
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none_keepdim'),
+            # FIXME: uint8 input returns uint8 instead of bool
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_result_dtype', dtypes=[torch.uint8]),
+        ),
+    ),
+    ReductionOpInfo(
+        'any',
+        identity=False,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        result_dtype=torch.bool,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.any),
+        skips=(
+            # FIXME: does not support passing keepdim without dim
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: does not support dim=None
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none'),
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_none_keepdim'),
+            # FIXME: uint8 input returns uint8 instead of bool
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_result_dtype', dtypes=[torch.uint8]),
         ),
-        # Can reuse the same inputs; dim is required in both
-        sample_inputs_func=sample_inputs_masked_cumops,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
     ),
     ReductionOpInfo(
-        '_masked.amax',
+        'amax',
         nan_policy='propagate',
-        supports_out=False,
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        supports_sparse=True,
         supports_forward_ad=True,
+        check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        supports_sparse_csr=True,
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         ref=reference_reduction_numpy(np.amax),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: amax reduces all dimensions when dim=[]
+            # FIXME: reduces all dimensions when dim=[]
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: Unknown builtin op: aten::iinfo
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
-            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMasked', 'test_mask_layout',
-                         dtypes=(torch.bool, torch.int8, torch.uint8, torch.int16, torch.int32,
-                                 torch.int64, torch.complex64, torch.complex128)),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
+        error_inputs_func=error_inputs_aminmax_amax_amin,
     ),
     ReductionOpInfo(
-        '_masked.amin',
+        'amin',
         nan_policy='propagate',
-        supports_out=False,
         supports_forward_ad=True,
+        check_batched_forward_grad=False,
         supports_fwgrad_bwgrad=True,
-        dtypes=all_types_and(torch.float16, torch.bfloat16),
-        supports_sparse=True,
-        supports_sparse_csr=True,
+        dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         ref=reference_reduction_numpy(np.amin),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: amax reduces all dimensions when dim=[]
+            # FIXME: reduces all dimensions when dim=[]
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
             DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: Unknown builtin op: aten::iinfo
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
-            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMasked', 'test_mask_layout',
-                         dtypes=(torch.bool, torch.int8, torch.uint8, torch.int16, torch.int32,
-                                 torch.int64, torch.complex64, torch.complex128)),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
+        error_inputs_func=error_inputs_aminmax_amax_amin,
     ),
     ReductionOpInfo(
-        '_masked.argmax',
-        supports_out=False,
+        'argmax',
         supports_multiple_dims=False,
         supports_autograd=False,
+        assert_jit_shape_analysis=True,
+        result_dtype=torch.int64,
         dtypes=all_types_and(torch.float16, torch.bfloat16),
         ref=reference_reduction_numpy(np.argmax, supports_keepdims=False),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # initial is not a keyword for argmax
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_reference_masked'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.bfloat16,)),
+            # FIXME: keepdim parameter is ignored when dim=None
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     ReductionOpInfo(
-        '_masked.argmin',
-        supports_out=False,
+        'argmin',
         supports_multiple_dims=False,
         supports_autograd=False,
+        result_dtype=torch.int64,
         dtypes=all_types_and(torch.float16, torch.bfloat16),
         ref=reference_reduction_numpy(np.argmin, supports_keepdims=False),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # initial is not a keyword for argmin
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_reference_masked'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.expectedFailure, 'TestNNCOpInfo', 'test_nnc_correctness', dtypes=(torch.bfloat16,)),
+            # FIXME: keepdim parameter is ignored when dim=None
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     ReductionOpInfo(
-        '_masked.mean',
-        ref=reference_reduction_numpy(np.mean) if np.lib.NumpyVersion(np.__version__) >= '1.20.2' else None,
-        method_variant=None,
-        nan_policy='propagate',
+        'count_nonzero',
+        identity=0,
         supports_out=False,
-        supports_sparse_csr=True,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        promotes_int_to_float=True,
-        dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16, torch.bool),
+        supports_autograd=False,
+        result_dtype=torch.int64,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_reduction_count_nonzero,
+        ref=reference_reduction_numpy(np.count_nonzero),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_ref_duplicate_values',
-                         dtypes=(torch.bool,)),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_reference_masked',
-                         dtypes=(torch.bool,)),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_ref_small_input',
-                         dtypes=(torch.bool,)),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: undefined value tensor
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
-            DecorateInfo(unittest.skip("Skipped!"), 'TestMasked', 'test_mask_layout',
-                         dtypes=(torch.bool, torch.int8, torch.uint8, torch.int16, torch.int32,
-                                 torch.int64, torch.complex64, torch.complex128)),
+            # FIXME: count_nonzero does not accept keepdim kwarg
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_single_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_multi_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_multi_unsorted_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_offbounds_keepdim'),
+            # FIXME: dim=[] reduces all dimensions
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
         ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_reference_masked'),
-        ],
-        sample_inputs_func=sample_inputs_masked_reduction,
-        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
-    OpInfo(
-        '_masked.median',
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.float16),
-        method_variant=None,
-        supports_out=False,
+    ReductionOpInfo(
+        'mean',
+        nan_policy='propagate',
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        # FIXME: mean needs 'dim' parameter when using the 'out' overload.
+        # Adding it with 'generate_args_kwargs' does not work, since these also get passed
+        # onto the reference implementations.
+        supports_out=False,
+        assert_autodiffed=True,
+        assert_jit_shape_analysis=True,
+        promotes_int_to_float=True,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.mean),
+        error_inputs_func=error_inputs_mean,
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # FIXME: mean does not support passing keepdim without passing dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: mean reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_extremal_values',
+                         device_type='cuda', dtypes=[torch.complex64]),
         ),
-        sample_inputs_func=sample_inputs_masked_softmax,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     ReductionOpInfo(
-        '_masked.norm',
-        identity=0,
-        method_variant=None,
-        nan_policy='propagate',
-        supports_out=False,
+        'nanmean',
+        nan_policy='omit',
+        assert_autodiffed=True,
         promotes_int_to_float=True,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
         dtypes=floating_types_and(torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_nan_reduction(supports_multiple_dims=True),
+        ref=reference_reduction_numpy(np.nanmean),
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # torch.jit.frontend.NotSupportedError: Compiled functions
-            # can't take variable number of arguments or use
-            # keyword-only arguments with defaults
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # AssertionError: False is not true :
+            # Failure in testing nodes' autodifferentiation.
+            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
+            # FIXME: prod reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
+                         device_type='cuda', dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_extremal_values',
+                         device_type='cuda', dtypes=[torch.complex64]),
         ),
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        sample_inputs_func=sample_inputs_masked_norm,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     ReductionOpInfo(
-        '_masked.var',
-        ref=reference_reduction_numpy(np.var) if np.lib.NumpyVersion(np.__version__) >= '1.20.2' else None,
-        method_variant=None,
+        'std',
         nan_policy='propagate',
         supports_out=False,
+        complex_to_real=True,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        # See https://github.com/pytorch/pytorch/pull/78358
-        check_batched_forward_grad=False,
+        assert_autodiffed=True,
         promotes_int_to_float=True,
-        dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        check_batched_forward_grad=False,
+        dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
+        dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_std_var,
+        ref=reference_std_var(np.std),
+        generate_args_kwargs=generate_std_var_kwargs,
         skips=(
-            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
-            DecorateInfo(
-                unittest.skip("Skipped!"),
-                'TestSchemaCheckModeOpInfo',
-                'test_schema_correctness',
-                dtypes=(torch.complex64, torch.complex128)),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: undefined value tensor
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
+            # FIXME: cannot specify keepdim without dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: dim=[] reduces all dimensions
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values'),
+            # NumPy is giving NaN for this
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_large_input'),
         ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02),
-                                            torch.bfloat16: tol(atol=1e-03, rtol=1e-03)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestReductions', 'test_ref_small_input'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestMasked', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestCudaFuserOpInfo', 'test_nvfuser_correctness'),
-        ],
-        sample_inputs_func=sample_inputs_masked_std_var,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        check_batched_grad=True,
     ),
     ReductionOpInfo(
-        '_masked.std',
-        ref=reference_reduction_numpy(np.std) if np.lib.NumpyVersion(np.__version__) >= '1.20.2' else None,
-        method_variant=None,
+        'var',
         nan_policy='propagate',
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
-        gradcheck_fast_mode=True,
         supports_out=False,
+        assert_autodiffed=True,
+        promotes_int_to_float=True,
+        complex_to_real=True,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        # See https://github.com/pytorch/pytorch/pull/78358
         check_batched_forward_grad=False,
-        promotes_int_to_float=True,
-        dtypes=all_types_and_complex_and(torch.bfloat16),
-        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
-        skips=(
-            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
-            DecorateInfo(
-                unittest.skip("Skipped!"),
-                'TestSchemaCheckModeOpInfo',
-                'test_schema_correctness',
-                dtypes=(torch.complex64, torch.complex128)),
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: sum reduces all dimensions when dim=[]
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
-            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
-            # RuntimeError: undefined value tensor
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.skip('Skipped!'), 'TestCudaFuserOpInfo', 'test_nvfuser_correctness',
-                         dtypes=(torch.float16,)),
-        ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestReductions', 'test_reference_masked'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestReductions', 'test_ref_small_input'),
-            DecorateInfo(toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestMasked', 'test_reference_masked'),
-        ],
-        sample_inputs_func=sample_inputs_masked_std_var,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        check_batched_grad=True,
-    ),
-    OpInfo(
-        '_masked.softmax',
-        method_variant=None,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_masked_softmax,
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-        ),
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        supports_forward_ad=True,
-        supports_out=False),
-    OpInfo(
-        '_masked.log_softmax',
-        method_variant=None,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_masked_softmax,
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-        ),
-        decorators=[
-            DecorateInfo(toleranceOverride({torch.bfloat16: tol(atol=1e-02, rtol=1e-02)}),
-                         'TestMasked', 'test_reference_masked'),
-        ],
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        supports_forward_ad=True,
-        supports_out=False),
-    OpInfo(
-        '_masked.softmin',
-        method_variant=None,
-        dtypes=floating_types_and(torch.bfloat16),
-        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_masked_softmax,
-        skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-        ),
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        supports_forward_ad=True,
-        supports_out=False),
-    OpInfo(
-        '_masked.normalize',
-        method_variant=None,
         dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
-        sample_inputs_func=sample_inputs_masked_normalize,
+        dtypesIfCUDA=floating_and_complex_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_std_var,
+        ref=reference_std_var(np.var),
+        generate_args_kwargs=generate_std_var_kwargs,
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            DecorateInfo(unittest.expectedFailure, 'TestJit', 'test_variant_consistency_jit'),
-            # RuntimeError: "clamp_min_cpu" not implemented for 'Half'
-            DecorateInfo(unittest.expectedFailure, 'TestMasked', 'test_reference_masked',
-                         device_type='cpu', dtypes=[torch.half]),
+            # FIXME: cannot specify keepdim without dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: dim=[] reduces all dimensions
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values'),
+            # NumPy is giving NaN for this
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_large_input'),
         ),
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
-        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+    ),
+    ReductionOpInfo(
+        'prod',
+        identity=1,
+        nan_policy='propagate',
+        supports_multiple_dims=False,
+        # https://github.com/pytorch/pytorch/issues/80411
         gradcheck_fast_mode=True,
-        supports_forward_ad=True,
-        supports_fwgrad_bwgrad=True,
-        supports_out=False),
-    OpInfo(
-        '_masked.logaddexp',
-        dtypes=floating_types_and(torch.bfloat16),
         supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
-        check_batched_forward_grad=False,
+        promotes_int_to_int64=True,
+        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+        sample_inputs_func=sample_inputs_prod,
+        ref=prod_numpy,
         skips=(
-            DecorateInfo(unittest.expectedFailure, 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
-            DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestGradients', 'test_fn_gradgrad'),
+            # FIXME: prod does not support passing keepdim without passing dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: prod reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: prod does not support passing None to dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_none_keepdim'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16, torch.complex64]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
+                         dtypes=[torch.uint8, torch.float16, torch.complex64]),
+            # FIXME: ValueError: The data in MaskedTensor a and Tensor b do not match
+            DecorateInfo(unittest.skip("Skipped!"), 'TestOperators', 'test_reduction_all',
+                         dtypes=[torch.float16]),
         ),
-        sample_inputs_func=sample_inputs_masked_logaddexp,
-        gradcheck_wrapper=gradcheck_wrapper_masked_pointwise_operation
     ),
     ReductionOpInfo(
-        '_masked.logsumexp',
-        dtypes=all_types_and(torch.bfloat16),
-        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
-        method_variant=None,
+        'sum',
+        identity=0,
         nan_policy='propagate',
         supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
+        promotes_int_to_int64=True,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16, torch.chalf),
+        ref=reference_reduction_numpy(np.sum),
         skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNormalizeOperators', 'test_normalize_operator_exhaustive'),
-            # FIXME: reduces all dimensions when dim=[]
+            # FIXME: sum does not support passing keepdim without passing dim
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_default_keepdim'),
+            # FIXME: sum reduces all dimensions when dim=[]
             DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty'),
             DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_dim_empty_keepdim'),
-            # Identity can't be -torch.inf without overflow
-            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_empty_tensor_empty_slice'),
-            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            # FIXME: improve precision
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_duplicate_values',
+                         dtypes=[torch.float16]),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestOperators', 'test_reduction_all',
+                         dtypes=[torch.float32]),
+        ),
+    ),
+    ReductionOpInfo(
+        'nansum',
+        identity=0,
+        nan_policy='omit',
+        supports_out=True,
+        promotes_int_to_int64=True,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        dtypes=all_types_and(torch.bool, torch.float16, torch.bfloat16),
+        sample_inputs_func=sample_inputs_nan_reduction(supports_multiple_dims=True),
+        ref=reference_reduction_numpy(np.nansum),
+        skips=(
+            # please report a bug to PyTorch.
             DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit'),
-            # all the values are the same except for -inf vs nan
-            DecorateInfo(unittest.skip("Skipped!"), 'TestDecomp', 'test_comprehensive'),
+            # FIXME: nansum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty'),
+            DecorateInfo(unittest.expectedFailure, 'TestReductions', 'test_dim_empty_keepdim'),
+            # FIXME: flaky test so skipped instead of xfailed
+            # possibly bad low precision reference in numpy
+            DecorateInfo(unittest.skip("Skipped!"), 'TestReductions', 'test_ref_small_input',
+                         dtypes=[torch.float16]),
         ),
-        sample_inputs_func=sample_inputs_masked_reduction,
-        gradcheck_wrapper=gradcheck_wrapper_masked_operation
     ),
     OpInfo(
         "nn.functional.ctc_loss",
@@ -17769,14 +16611,14 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
             # torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0
             DecorateInfo(
                 unittest.expectedFailure,
-                "TestGradients",
+                "TestBwdGradients",
                 "test_fn_grad",
                 dtypes=(torch.float64,),
             ),
             # RuntimeError: derivative for aten::_ctc_loss_backward is not implemented
             DecorateInfo(
                 unittest.expectedFailure,
-                "TestGradients",
+                "TestBwdGradients",
                 "test_fn_gradgrad",
                 dtypes=(torch.float64,),
             ),
@@ -17787,9 +16629,10 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                 "test_variant_consistency_jit",
                 dtypes=(torch.float32,),
             ),
-            # Operation calls data_ptr() somewhere; needs to be fixed
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_operator'),
-            DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+            # Ref: https://github.com/pytorch/pytorch/issues/85231
+            DecorateInfo(unittest.skip("Fails with ASAN"),
+                         'TestProxyTensorOpInfo',
+                         'test_make_fx_fake_exhaustive', active_if=TEST_WITH_ASAN),
         ),
     ),
     OpInfo(
@@ -17808,6 +16651,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_out=False,
         sample_inputs_func=sample_inputs_nll_loss,
         supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
         assert_jit_shape_analysis=True,
         skips=(
             # RuntimeError:
@@ -17829,6 +16673,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         sample_inputs_func=sample_inputs_gaussian_nll_loss,
+        error_inputs_func=error_inputs_gaussian_nll_loss,
         skips=(
             # Pre-existing condition (calls .item); needs to be fixed
             DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
@@ -17865,6 +16710,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_out=False,
         supports_forward_ad=True,
         sample_inputs_func=sample_inputs_huber_loss,
+        error_inputs_func=error_inputs_huber_loss,
         skips=(
             # JIT does not support variadic tensors.
             # RuntimeError: input->type()->kind() == TypeKind::OptionalType
@@ -17879,7 +16725,11 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         sample_inputs_func=sample_inputs_pdist,
         dtypes=floating_types(),
         supports_out=False,
-        supports_gradgrad=False),
+        supports_gradgrad=False,
+        skips=(
+            DecorateInfo(unittest.skip("Unsupported on MPS for now"), 'TestCommon', 'test_numpy_ref_mps'),
+        )
+    ),
     OpInfo(
         "nn.functional.poisson_nll_loss",
         dtypes=all_types_and(torch.bfloat16),
@@ -17888,6 +16738,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         sample_inputs_func=sample_inputs_poisson_nll_loss,
+        error_inputs_func=error_inputs_poisson_nll_loss,
     ),
     OpInfo(
         "argsort",
@@ -17951,13 +16802,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         skips=(
-            # https://github.com/pytorch/pytorch/issues/82235
-            DecorateInfo(
-                unittest.expectedFailure,
-                'TestSchemaCheckModeOpInfo',
-                'test_schema_correctness',
-                device_type='cuda',
-            ),
             DecorateInfo(
                 unittest.skip("Skipped!"),
                 "TestJit",
@@ -17974,13 +16818,6 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
         skips=(
-            # https://github.com/pytorch/pytorch/issues/82235
-            DecorateInfo(
-                unittest.expectedFailure,
-                'TestSchemaCheckModeOpInfo',
-                'test_schema_correctness',
-                device_type='cuda',
-            ),
             DecorateInfo(
                 unittest.skip("Skipped!"),
                 "TestJit",
@@ -17992,10 +16829,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
     OpInfo(
         "nn.functional.kl_div",
         sample_inputs_func=sample_inputs_kl_div,
-        dtypes=floating_types_and(torch.bfloat16, torch.int8, torch.int16, torch.int32, torch.int64),
-        dtypesIfCUDA=floating_types_and(
-            torch.float16, torch.bfloat16, torch.int8, torch.int16, torch.int32, torch.int64
-        ),
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16, torch.bfloat16),
         supports_out=False,
         supports_forward_ad=True,
         supports_fwgrad_bwgrad=True,
@@ -18004,7 +16839,7 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         "diagflat",
         ref=lambda input, offset=0: np.diagflat(input, k=offset),
         sample_inputs_func=sample_inputs_diagflat,
-        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16, torch.float16),
         dtypesIfCUDA=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
         supports_out=False,
         supports_forward_ad=True,
@@ -18018,6 +16853,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         # complex not added to dtypes as complex gradients are not properly handled
         # and scatter_reduce hasn't been added to the whitelist in gen_variable_type yet
         dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
         sample_inputs_func=sample_inputs_scatter_reduce,
     ),
     OpInfo(
@@ -18031,6 +16868,10 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         skips=(
             # Pre-existing condition (calls .item); needs to be fixed
             DecorateInfo(unittest.expectedFailure, 'TestCompositeCompliance', 'test_backward'),
+            # Not implemented
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_forward_mode_AD'),
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_inplace_forward_mode_AD'),
+            DecorateInfo(unittest.expectedFailure, 'TestFwdGradients', 'test_fn_fwgrad_bwgrad'),
         ),
     ),
     OpInfo(
@@ -18040,6 +16881,8 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         # and scatter_reduce hasn't been added to the whitelist in gen_variable_type yet
         dtypes=all_types_and(torch.float16, torch.bfloat16),
         dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
         sample_inputs_func=sample_inputs_scatter_reduce,
     ),
     OpInfo(
@@ -18047,6 +16890,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         variant_test_name='amin',
         dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
         sample_inputs_func=sample_inputs_scatter_reduce,
     ),
     OpInfo(
@@ -18054,6 +16900,9 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
         variant_test_name='amax',
         dtypes=all_types_and(torch.float16, torch.bfloat16, torch.bool),
         dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
         sample_inputs_func=sample_inputs_scatter_reduce,
     ),
     OpInfo(
@@ -18090,427 +16939,12 @@ def reference_flatten(input, start_dim=0, end_dim=-1):
                 unittest.skip("Skipped!"),
                 "TestJit",
                 "test_variant_consistency_jit",
-                device_type="cuda",
-            ),
-        ),
-    ),
-    UnaryUfuncInfo(
-        'special.airy_ai',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=lambda x: scipy.special.airy(x)[0] if TEST_SCIPY else None,
-        skips=(
-            DecorateInfo(
-                unittest.skip("Skipped!"),
-                'TestUnaryUfuncs',
-                'test_reference_numerics_large',
-            ),
-        ),
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.bessel_j0',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-04,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.j0 if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.bessel_j1',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-04,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.j1 if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.bessel_y0',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-04,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.y0 if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.bessel_y1',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-04,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.y1 if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.chebyshev_polynomial_t',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.chebyshev_polynomial_u',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.chebyshev_polynomial_v',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.chebyshev_polynomial_w',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.hermite_polynomial_h',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.hermite_polynomial_he',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.laguerre_polynomial_l',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.legendre_polynomial_p',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.modified_bessel_i0',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.i0 if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.modified_bessel_i1',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.i1 if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.modified_bessel_k0',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.k0 if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.modified_bessel_k1',
-        decorators=(
-            precisionOverride(
-                {
-                    torch.float32: 1e-03,
-                    torch.float64: 1e-05,
-                },
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.k1 if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.scaled_modified_bessel_k0',
-        decorators=(
-            toleranceOverride(
-                {
-                    torch.float32: tol(atol=1e-03, rtol=1e-03),
-                    torch.float64: tol(atol=1e-05, rtol=1e-03),
-                }
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.k0e if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.scaled_modified_bessel_k1',
-        decorators=(
-            toleranceOverride(
-                {
-                    torch.float32: tol(atol=1e-03, rtol=1e-03),
-                    torch.float64: tol(atol=1e-05, rtol=1e-03),
-                }
-            ),
-        ),
-        dtypes=all_types_and(torch.bool),
-        ref=scipy.special.k1e if TEST_SCIPY else None,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.shifted_chebyshev_polynomial_t',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.shifted_chebyshev_polynomial_u',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.shifted_chebyshev_polynomial_v',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    BinaryUfuncInfo(
-        'special.shifted_chebyshev_polynomial_w',
-        dtypes=all_types_and(torch.bool),
-        promotes_int_to_float=True,
-        skips=(
-            DecorateInfo(unittest.skip("Skipping - testing takes an unreasonably long time, #79528")),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestCudaFuserOpInfo'),
-            DecorateInfo(unittest.skip("Skipped!"), 'TestNNCOpInfo'),
-        ),
-        supports_one_python_scalar=True,
-        supports_autograd=False,
-    ),
-    UnaryUfuncInfo(
-        'special.spherical_bessel_j0',
-        decorators=(
-            toleranceOverride(
-                {
-                    torch.float32: tol(atol=1e-03, rtol=1e-03),
-                    torch.float64: tol(atol=1e-05, rtol=1e-03),
-                }
+                device_type="cuda",
             ),
         ),
-        dtypes=all_types_and(torch.bool),
-        ref=lambda x: scipy.special.spherical_jn(0, x) if TEST_SCIPY else None,
-        supports_autograd=False,
     ),
 ]
-
-class ReductionPythonRefInfo(ReductionOpInfo):
-    '''
-    An OpInfo for a Python reference of an elementwise unary operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant_name='',  # the variant name for corresponding torch opinfo
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo_variant_name = torch_opinfo_variant_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant_name)
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, ReductionOpInfo)
-
-        inherited = self.torch_opinfo._original_reduction_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-
-        # See https://github.com/pytorch/pytorch/issues/77216
-        self.validate_view_consistency = False
-
-        super().__init__(**ukwargs)
-
-class ElementwiseUnaryPythonRefInfo(UnaryUfuncInfo):
-    '''
-    An OpInfo for a Python reference of an elementwise unary operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant_name='',  # the variant name for corresponding torch opinfo
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo_variant_name = torch_opinfo_variant_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant_name)
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, UnaryUfuncInfo)
-
-        inherited = self.torch_opinfo._original_unary_ufunc_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-
-        super(ElementwiseUnaryPythonRefInfo, self).__init__(**ukwargs)
-
-class ElementwiseBinaryPythonRefInfo(BinaryUfuncInfo):
-    '''
-    An OpInfo for a Python reference of an elementwise binary operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant_name='',  # the variant name for corresponding torch opinfo
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo_variant_name = torch_opinfo_variant_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant_name)
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, BinaryUfuncInfo)
-
-        inherited = self.torch_opinfo._original_binary_ufunc_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-
-        super(ElementwiseBinaryPythonRefInfo, self).__init__(**ukwargs)
-
-class SpectralFuncPythonRefInfo(SpectralFuncInfo):
-    '''
-    An OpInfo for a Python reference of an elementwise unary operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant='',
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant)
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, SpectralFuncInfo)
-
-        inherited = self.torch_opinfo._original_spectral_func_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-
-        super().__init__(**ukwargs)
+op_db += opinfo.definitions.op_db
 
 
 # Separate registry for experimental Python Reference OpInfos.
@@ -18550,6 +16984,33 @@ def __init__(
         torch_opinfo_name="asinh",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.lerp",
+        torch_opinfo_name="lerp",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.ones",
+        torch_opinfo_name="ones",
+        skips=(
+            # Tests that assume input is a tensor or sequence of tensors
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+        ),
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.zeros",
+        torch_opinfo_name="zeros",
+        skips=(
+            # Tests that assume input is a tensor or sequence of tensors
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
+        ),
+        supports_nvfuser=False,
+    ),
     PythonRefInfo(
         "_refs.arange",
         torch_opinfo_name="arange",
@@ -18558,9 +17019,6 @@ def __init__(
             DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_view'),
             DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_conj_view'),
             DecorateInfo(unittest.expectedFailure, 'TestMathBits', 'test_neg_conj_view'),
-            # See https://github.com/pytorch/pytorch/issues/82364
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out_warning'),
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_out'),
 
             # Prims arange does not follow aten
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_meta',
@@ -18580,9 +17038,9 @@ def __init__(
             # cpu implementation is wrong on some integral types
             # https://github.com/pytorch/pytorch/issues/81996
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                         dtypes=(torch.int16, torch.int32, torch.int64), device_type="cpu"),
+                         dtypes=(torch.int8, torch.uint8, torch.int16, torch.int32, torch.int64), device_type="cpu"),
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                         dtypes=(torch.int16, torch.int32, torch.int64), device_type="cpu"),
+                         dtypes=(torch.int8, torch.uint8, torch.int16, torch.int32, torch.int64), device_type="cpu"),
 
             # cuda implementation is off-by-one on some inputs due to precision issues
             # https://github.com/pytorch/pytorch/issues/82230
@@ -18630,12 +17088,69 @@ def __init__(
         torch_opinfo_variant_name="variadic_tensors",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.to",
+        torch_opinfo_name="to",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.triu",
+        torch_opinfo_name="triu",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.tril",
+        torch_opinfo_name="tril",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.triu_indices",
+        torch_opinfo_name="triu_indices",
+        supports_nvfuser=False,
+        # the implementation uses torch.stack that violates view consistency
+        validate_view_consistency=False,
+        skips=(
+            # skip these tests since we have non tensor input
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_noncontiguous_samples'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestMathBits', 'test_neg_view'),
+        )),
+    PythonRefInfo(
+        "_refs.tril_indices",
+        torch_opinfo_name="tril_indices",
+        supports_nvfuser=False,
+        # the implementation uses torch.stack that violates view consistency
+        validate_view_consistency=False,
+        skips=(
+            # skip these tests since we have non tensor input
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_noncontiguous_samples'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestCommon', 'test_variant_consistency_eager'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestJit', 'test_variant_consistency_jit'),
+            DecorateInfo(unittest.skip('Skipped!'), 'TestMathBits', 'test_neg_view'),
+        )),
     PythonRefInfo(
         "_refs.meshgrid",
         torch_opinfo_name="meshgrid",
         torch_opinfo_variant_name="list_of_tensors",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.movedim",
+        aliases=('moveaxis',),
+        torch_opinfo_name="movedim",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.bucketize",
+        torch_opinfo_name="bucketize",
+        skips=(
+            # RuntimeError: It appears that you're trying to get value out of a tracing tensor with
+            #  aten._local_scalar_dense.default - erroring out! [...]
+            # triggered by mid_val = boundaries[mid]
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_python_ref_executor"),
+        )
+    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.atan",
         torch_opinfo_name="atan",
@@ -18651,6 +17166,9 @@ def __init__(
     ElementwiseUnaryPythonRefInfo(
         "_refs.ceil",
         torch_opinfo_name="ceil",
+        # Fails on int32
+        # https://github.com/pytorch/pytorch/issues/85258
+        supports_nvfuser=False,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.conj_physical",
@@ -18705,6 +17223,9 @@ def __init__(
     ElementwiseUnaryPythonRefInfo(
         "_refs.floor",
         torch_opinfo_name="floor",
+        # Fails on int32
+        # https://github.com/pytorch/pytorch/issues/85258
+        supports_nvfuser=False,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.frac",
@@ -18745,6 +17266,12 @@ def __init__(
         torch_opinfo_name="isnan",
         supports_out=True,
     ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.isreal",
+        torch_opinfo_name="isreal",
+        supports_out=True,
+        supports_nvfuser=False,
+    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.i0",
         torch_opinfo_name="i0",
@@ -18754,6 +17281,24 @@ def __init__(
         "_refs.lgamma",
         torch_opinfo_name="lgamma",
     ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.multigammaln",
+        torch_opinfo_name="mvlgamma",
+        torch_opinfo_variant_name="mvlgamma_p_1",
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.multigammaln",
+        torch_opinfo_name="mvlgamma",
+        torch_opinfo_variant_name="mvlgamma_p_3",
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.multigammaln",
+        torch_opinfo_name="mvlgamma",
+        torch_opinfo_variant_name="mvlgamma_p_5",
+        supports_nvfuser=False,
+    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.log",
         torch_opinfo_name="log",
@@ -18780,6 +17325,7 @@ def __init__(
     PythonRefInfo(
         "_refs.log_softmax",
         torch_opinfo_name="log_softmax",
+        torch_opinfo_variant_name="with_dtype",
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nan_to_num",
@@ -18807,6 +17353,9 @@ def __init__(
     ElementwiseUnaryPythonRefInfo(
         "_refs.round",
         torch_opinfo_name="round",
+        # Fails on int32
+        # https://github.com/pytorch/pytorch/issues/85258
+        supports_nvfuser=False,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.rsqrt",
@@ -18815,6 +17364,7 @@ def __init__(
     ElementwiseUnaryPythonRefInfo(
         "_refs.sigmoid",
         torch_opinfo_name="sigmoid",
+        aliases=('_refs.special.expit',),
         # Reference: https://github.com/pytorch/pytorch/issues/56012
         handles_complex_extremal_values=False,
         handles_large_floats=False,
@@ -18823,6 +17373,13 @@ def __init__(
         "_refs.sign",
         torch_opinfo_name="sign",
     ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.sgn",
+        torch_opinfo_name="sgn",
+        # This is an issue with the vectorised abs on CPU
+        handles_complex_extremal_values=False,
+        handles_large_floats=False,
+    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.signbit",
         torch_opinfo_name="signbit",
@@ -18832,6 +17389,10 @@ def __init__(
         "_refs.sin",
         torch_opinfo_name="sin",
     ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.sinc",
+        torch_opinfo_name="sinc",
+    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.sinh",
         torch_opinfo_name="sinh",
@@ -18839,6 +17400,7 @@ def __init__(
     PythonRefInfo(
         "_refs.softmax",
         torch_opinfo_name="softmax",
+        torch_opinfo_variant_name="with_dtype",
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.sqrt",
@@ -18863,25 +17425,25 @@ def __init__(
     ElementwiseUnaryPythonRefInfo(
         "_refs.trunc",
         torch_opinfo_name="trunc",
-    ),
-    #
-    # Elementwise Unary Special OpInfos
-    #
-    ElementwiseUnaryPythonRefInfo(
-        "_refs.special.i0e",
-        torch_opinfo_name="special.i0e",
+        # Fails on int32
+        # https://github.com/pytorch/pytorch/issues/85258
         supports_nvfuser=False,
     ),
-    ElementwiseUnaryPythonRefInfo(
-        "_refs.special.i1",
-        torch_opinfo_name="special.i1",
-        supports_nvfuser=False,
+    PythonRefInfo(
+        "_refs.special.log_softmax",
+        torch_opinfo_name="log_softmax",  # alias
+        torch_opinfo_variant_name="with_dtype",
+        supports_out=False,
     ),
-    ElementwiseUnaryPythonRefInfo(
-        "_refs.special.i1e",
-        torch_opinfo_name="special.i1e",
-        supports_nvfuser=False,
+    PythonRefInfo(
+        "_refs.special.softmax",
+        torch_opinfo_name="softmax",  # alias
+        torch_opinfo_variant_name="with_dtype",
+        supports_out=False,
     ),
+    #
+    # Elementwise Unary Special OpInfos
+    #
     ElementwiseUnaryPythonRefInfo(
         "_refs.special.logit",
         torch_opinfo_name="logit",
@@ -18890,14 +17452,41 @@ def __init__(
     #
     # Elementwise Unary nn.functional OpInfos
     #
+    PythonRefInfo(
+        "_refs.nn.functional.alpha_dropout",
+        torch_opinfo_name="nn.functional.alpha_dropout",
+        supports_nvfuser=False,
+        decorators=(
+            DecorateInfo(unittest.skip("Expected: dropout is not comparable"),
+                         'TestCommon',
+                         'test_python_ref'),
+            # AssertionError: Tensor-likes are not close!
+            DecorateInfo(unittest.skip("Expected: dropout is not comparable"),
+                         'TestCommon',
+                         'test_python_ref_torch_fallback'),
+            DecorateInfo(unittest.skip("Expected: dropout is not comparable"),
+                         'TestCommon',
+                         'test_python_ref_executor', device_type='cuda'),
+            # AssertionError: Tensor-likes are not close!
+            DecorateInfo(unittest.skip("Expected: dropout is not comparable"),
+                         'TestMathBits',
+                         'test_neg_view'),
+            # AssertionError: Tensor-likes are not close!
+            DecorateInfo(unittest.skip("Expected: dropout is not comparable"),
+                         'TestCommon',
+                         'test_compare_cpu'),
+        )
+    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.celu",
         torch_opinfo_name="nn.functional.celu",
+        supports_out=True,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.threshold",
         torch_opinfo_name="nn.functional.threshold",
         supports_nvfuser=False,
+        supports_out=True,
     ),
     PythonRefInfo(
         "_refs.nn.functional.dropout",
@@ -18926,16 +17515,19 @@ def __init__(
                          'test_neg_view'),
             # dropout is not comparable
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_executor'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
         )
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.elu",
         torch_opinfo_name="nn.functional.elu",
+        supports_out=True,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.hardtanh",
         torch_opinfo_name="nn.functional.hardtanh",
         supports_nvfuser=False,
+        supports_out=True,
     ),
     PythonRefInfo(  # TODO: Port this to an UnaryOpInfo
         "_refs.nn.functional.gelu",
@@ -18951,9 +17543,40 @@ def __init__(
                          dtypes=(torch.float32,), device_type='cpu'),
         ),
     ),
+    PythonRefInfo(
+        "_refs.nn.functional.glu",
+        torch_opinfo_name="nn.functional.glu",
+        supports_nvfuser=False,
+        supports_out=True,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.pairwise_distance",
+        torch_opinfo_name="nn.functional.pairwise_distance",
+        supports_out=True,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.pdist",
+        torch_opinfo_name="nn.functional.pdist",
+        supports_out=True,
+        supports_nvfuser=False,
+        skips=(
+            # RunTimeError: no _refs support for torch.Tensor.index_select
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),
+        )),
     PythonRefInfo(
         "_refs.nn.functional.leaky_relu",
         torch_opinfo_name="nn.functional.leaky_relu",
+        supports_out=True,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.log_softmax",
+        torch_opinfo_name="log_softmax",  # alias
+        torch_opinfo_variant_name="with_dtype",
+        supports_out=False,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.poisson_nll_loss",
+        torch_opinfo_name="nn.functional.poisson_nll_loss",
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.prelu",
@@ -18963,18 +17586,34 @@ def __init__(
         "_refs.nn.functional.relu",
         torch_opinfo_name="nn.functional.relu",
         supports_nvfuser=False,
+        supports_out=True,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.relu6",
         torch_opinfo_name="nn.functional.relu6",
+        supports_out=True,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.mish",
         torch_opinfo_name="nn.functional.mish",
+        supports_out=True,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.selu",
         torch_opinfo_name="nn.functional.selu",
+        supports_out=True,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.softmax",
+        torch_opinfo_name="softmax",  # alias
+        torch_opinfo_variant_name="with_dtype",
+        supports_out=False,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.softmin",
+        torch_opinfo_name="nn.functional.softmin",
+        torch_opinfo_variant_name="with_dtype",
+        supports_out=False,
     ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.softplus",
@@ -18983,6 +17622,8 @@ def __init__(
     PythonRefInfo(
         "_refs.nn.functional.l1_loss",
         torch_opinfo_name="nn.functional.l1_loss",
+        # TestCommonCUDA::test_python_ref_executor__refs_nn_functional_l1_loss_executor_nvfuser_cuda_float32
+        # - RuntimeError: No reduction axis specified
         supports_nvfuser=False,
     ),
     PythonRefInfo(
@@ -19000,6 +17641,30 @@ def __init__(
         torch_opinfo_name="nn.functional.hinge_embedding_loss",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.nn.functional.nll_loss",
+        torch_opinfo_name="nn.functional.nll_loss",
+        # The corresponding PyTorch op doesn't support out.  But the ref is
+        # registered as a decomp and ATen has an out variant.
+        supports_out=True,
+        supports_nvfuser=False,
+        # For simpler indexing, we flatten target indices, then reshape the result tensor.
+        # This creates inconsistent view state with reference impl.
+        validate_view_consistency=False,
+        skips=(
+            # RuntimeError: It appears that you're trying to get value out of a tracing tensor - erroring out!
+            DecorateInfo(
+                unittest.expectedFailure, 'TestCommon', 'test_python_ref_executor', device_type="cuda"
+            ),
+        ),
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.huber_loss",
+        torch_opinfo_name="nn.functional.huber_loss",
+        # The corresponding PyTorch op doesn't support out.  But the ref is
+        # registered as a decomp and ATen has an out variant.
+        supports_out=True,
+    ),
     ElementwiseUnaryPythonRefInfo(
         "_refs.nn.functional.tanhshrink",
         torch_opinfo_name="nn.functional.tanhshrink",
@@ -19021,7 +17686,7 @@ def __init__(
         "_refs.add",
         torch_opinfo_name="add",
         # https://github.com/pytorch/pytorch/issues/76944
-        supports_two_python_scalars=False,
+        supports_two_python_scalars=True,
         supports_one_python_scalar=True,
     ),
     ElementwiseBinaryPythonRefInfo(
@@ -19036,6 +17701,19 @@ def __init__(
         "_refs.bitwise_left_shift",
         torch_opinfo_name="bitwise_left_shift",
         supports_nvfuser=False,
+        skips=(
+            # https://github.com/pytorch/pytorch/issues/70904
+            DecorateInfo(unittest.skip("Some inputs produce undefined outputs"), 'TestCommon', 'test_compare_cpu'),
+        ),
+    ),
+    ElementwiseBinaryPythonRefInfo(
+        "_refs.bitwise_right_shift",
+        torch_opinfo_name="bitwise_right_shift",
+        supports_nvfuser=False,
+        skips=(
+            # # https://github.com/pytorch/pytorch/issues/70904
+            DecorateInfo(unittest.skip("Skipped some inputs produce undefined outputs"), 'TestCommon', 'test_compare_cpu'),
+        ),
     ),
     ElementwiseBinaryPythonRefInfo(
         "_refs.bitwise_or",
@@ -19059,7 +17737,7 @@ def __init__(
         torch_opinfo_name="div",
         torch_opinfo_variant_name="no_rounding_mode",
         # https://github.com/pytorch/pytorch/issues/76944
-        supports_two_python_scalars=False,
+        supports_two_python_scalars=True,
         supports_one_python_scalar=True,
         supports_nvfuser=False,
         skips=(
@@ -19072,13 +17750,13 @@ def __init__(
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
             # Reference result was farther (0.7433461727239705) from the precise
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
         ),
     ),
@@ -19087,7 +17765,7 @@ def __init__(
         torch_opinfo_name="div",
         torch_opinfo_variant_name="trunc_rounding",
         # https://github.com/pytorch/pytorch/issues/76944
-        supports_two_python_scalars=False,
+        supports_two_python_scalars=True,
         supports_one_python_scalar=True,
         supports_nvfuser=False,
     ),
@@ -19096,7 +17774,7 @@ def __init__(
         torch_opinfo_name="div",
         torch_opinfo_variant_name="floor_rounding",
         # https://github.com/pytorch/pytorch/issues/76944
-        supports_two_python_scalars=False,
+        supports_two_python_scalars=True,
         supports_one_python_scalar=True,
         supports_nvfuser=False,
     ),
@@ -19118,7 +17796,7 @@ def __init__(
         torch_opinfo_name="floor_divide",
         rhs_make_tensor_kwargs=dict(exclude_zero=True),
         # https://github.com/pytorch/pytorch/issues/76944
-        supports_two_python_scalars=False,
+        supports_two_python_scalars=True,
         supports_one_python_scalar=True,
         supports_nvfuser=False,
         # bfloat16 floor_divide compared with a float32 reference works inconsistently
@@ -19246,7 +17924,7 @@ def __init__(
         "_refs.mul",
         torch_opinfo_name="mul",
         # https://github.com/pytorch/pytorch/issues/76944
-        supports_two_python_scalars=False,
+        supports_two_python_scalars=True,
         supports_one_python_scalar=True,
         skips=(
             # Reference result was farther (0.0) from the precise computation
@@ -19260,13 +17938,13 @@ def __init__(
             # than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                dtypes=(torch.complex32,), device_type='cuda', active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type='cuda'
             ),
             # Reference result was farther (0.0) from the precise computation
             # than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                dtypes=(torch.complex32,), device_type='cuda', active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type='cuda'
             ),
         )
     ),
@@ -19294,13 +17972,13 @@ def __init__(
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
             # Reference result was farther (inf) from the precise
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
         ),
     ),
@@ -19333,24 +18011,14 @@ def __init__(
         "_refs.sub",
         torch_opinfo_name="sub",
         # https://github.com/pytorch/pytorch/issues/76944
-        supports_two_python_scalars=False,
+        supports_two_python_scalars=True,
         supports_one_python_scalar=True,
-        skips=(
-            # Reference result was farther (nan) from the precise computation than
-            # the torch result was (nan)!
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                         dtypes=(torch.chalf,), device_type='cpu'),
-            # Reference result was farther (nan) from the precise computation than
-            # the torch result was (nan)!
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                         dtypes=(torch.chalf,), device_type='cpu'),
-        ),
     ),
     ElementwiseBinaryPythonRefInfo(
         "_refs.true_divide",
         torch_opinfo_name="true_divide",
         # https://github.com/pytorch/pytorch/issues/76944
-        supports_two_python_scalars=False,
+        supports_two_python_scalars=True,
         supports_one_python_scalar=True,
         skips=(
             # Reference result was farther (0.7433461727239705) from the precise
@@ -19363,58 +18031,228 @@ def __init__(
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
             # Reference result was farther (0.7433461727239705) from the precise
             # computation than the torch result was (nan)!
             DecorateInfo(
                 unittest.expectedFailure, 'TestCommon', 'test_python_ref_torch_fallback',
-                dtypes=(torch.complex32,), device_type="cuda", active_if=not TEST_WITH_ROCM
+                dtypes=(torch.complex32,), device_type="cuda"
             ),
         ),
     ),
     #
+    # Elementwise Ternary Reference OpInfos
+    #
+    PythonRefInfo(
+        "_refs.addcdiv",
+        torch_opinfo_name="addcdiv",
+    ),
+    PythonRefInfo(
+        "_refs.addcmul",
+        torch_opinfo_name="addcmul",
+    ),
+    ElementwiseBinaryPythonRefInfo(
+        "_refs.clamp_min",
+        torch_opinfo_name="clamp_min",
+        supports_nvfuser=False,
+        skips=(
+            # test error disabled since rhs non-tensor python scalar is supported
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),
+        ),
+    ),
+    ElementwiseBinaryPythonRefInfo(
+        "_refs.clamp_max",
+        torch_opinfo_name="clamp_max",
+        supports_nvfuser=False,
+        skips=(
+            # test error disabled since rhs non-tensor python scalar is supported
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),
+        ),
+    ),
+    PythonRefInfo(
+        "_refs.clamp",
+        torch_opinfo_name="clamp",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.triplet_margin_loss",
+        torch_opinfo_name="nn.functional.triplet_margin_loss",
+        supports_out=False,
+        # TODO: Uses minimum and clamp, which don't support nvfuser.
+        supports_nvfuser=False,
+        skips=(
+            # AssertionError: Tensor-likes are not close!
+            # Greatest absolute difference: 6.103515625e-05 at index (4,) (up to 1e-05 allowed)
+            # Greatest relative difference: 8.519846983548175e-06 at index (4,) (up to 1.3e-06 allowed)
+            DecorateInfo(unittest.skip("Skipped!"), 'TestCommon', 'test_python_ref',
+                         dtypes=(torch.uint8,), device_type="cpu"),
+        )
+    ),
+    ElementwiseBinaryPythonRefInfo(
+        "_refs.xlogy",
+        torch_opinfo_name="xlogy",
+        supports_one_python_scalar=True,
+        supports_nvfuser=False,
+    ),
+    #
     # Elementwise Binary Special OpInfos
     #
     ElementwiseBinaryPythonRefInfo(
-        "_refs.special.zeta",
-        torch_opinfo_name="special.zeta",
+        "_refs.special.xlog1py",
+        torch_opinfo_name="special.xlog1py",
         supports_one_python_scalar=True,
         supports_nvfuser=False,
     ),
-    #
-    # Elementwise Ternary Reference OpInfos
-    #
-    ElementwiseBinaryPythonRefInfo(
-        "_refs.clamp_min",
-        torch_opinfo_name="clamp_min",
+    #
+    # Data Conversion & Data Movement Opinfos
+    #
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.bfloat16",
+        torch_opinfo_name="bfloat16",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.bool",
+        torch_opinfo_name="bool",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.byte",
+        torch_opinfo_name="byte",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+        skips=(
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
+        )
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.char",
+        torch_opinfo_name="char",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+        skips=(
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
+        )
+    ),
+    ElementwiseBinaryPythonRefInfo(
+        "_refs._conversions.complex",
+        torch_opinfo_name="complex",
+        error_inputs_func=partial(error_inputs_complex, is_ref=True),
+        # prims.empty_strided.default does not support nvfuser
+        supports_nvfuser=False,
+        skips=(
+            # Test doesn't account for complex's type promotion semantics
+            DecorateInfo(unittest.expectedFailure, 'TestBinaryUfuncs', 'test_type_promotion'),
+        )
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.double",
+        torch_opinfo_name="double",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.float",
+        torch_opinfo_name="float",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.half",
+        torch_opinfo_name="half",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.int",
+        torch_opinfo_name="int",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+        skips=(
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
+        )
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.long",
+        torch_opinfo_name="long",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
         supports_nvfuser=False,
         skips=(
-            # test error disabled since rhs non-tensor python scalar is supported
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),
-        ),
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
+        )
     ),
-    ElementwiseBinaryPythonRefInfo(
-        "_refs.clamp_max",
-        torch_opinfo_name="clamp_max",
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.short",
+        torch_opinfo_name="short",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
         supports_nvfuser=False,
         skips=(
-            # test error disabled since rhs non-tensor python scalar is supported
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),
-        ),
+            DecorateInfo(unittest.skip('Overflow when downcasting signed type is undefined'), 'TestCommon', 'test_compare_cpu'),
+        )
     ),
-    PythonRefInfo(
-        "_refs.clamp",
-        torch_opinfo_name="clamp",
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.chalf",
+        torch_opinfo_name="chalf",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.cfloat",
+        torch_opinfo_name="cfloat",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
+        supports_nvfuser=False,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs._conversions.cdouble",
+        torch_opinfo_name="cdouble",
+        # TODO: If self already has the correct dtype and device, then self is
+        # returned ignoring memory_format.
+        # https://github.com/pytorch/pytorch/issues/86558
+        validate_view_consistency=False,
         supports_nvfuser=False,
     ),
-    #
-    # Data Conversion & Data Movement Opinfos
-    #
     PythonRefInfo(
         "_refs.clone",
         torch_opinfo_name="clone",
-        supports_nvfuser=False,
     ),
     #
     # View & Shape OpInfos
@@ -19461,17 +18299,19 @@ def __init__(
     PythonRefInfo(
         "_refs.broadcast_tensors",
         torch_opinfo_name="broadcast_tensors",
-        supports_nvfuser=False,
     ),
     PythonRefInfo(
         "_refs.broadcast_to",
         torch_opinfo_name="broadcast_to",
-        supports_nvfuser=False,
     ),
     PythonRefInfo(
         "_refs.cat",
         torch_opinfo_name="cat",
         supports_nvfuser=False,
+        skips=(
+            # FIXME: AssertionError: RuntimeError not raised
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),
+        ),
     ),
     PythonRefInfo(
         "_refs.chunk",
@@ -19503,12 +18343,38 @@ def __init__(
         torch_opinfo_name="dsplit",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.diag",
+        torch_opinfo_name="diag",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.diagonal",
+        torch_opinfo_name="diagonal",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.diagonal_copy",
+        torch_opinfo_name="diagonal_copy",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.diagonal_scatter",
+        torch_opinfo_name="diagonal_scatter",
+        supports_out=True,
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.diag_embed",
+        torch_opinfo_name="diag_embed",
+        supports_out=True,
+        supports_nvfuser=False,
+    ),
     PythonRefInfo(
         "_refs.dstack",
         torch_opinfo_name="dstack",
         supports_nvfuser=False,
         skips=(
-            # https://github.com/pytorch/pytorch/issues/78613
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),
         ),
     ),
@@ -19517,6 +18383,11 @@ def __init__(
         torch_opinfo_name="expand",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.expand_as",
+        torch_opinfo_name="expand_as",
+        supports_nvfuser=False,
+    ),
     PythonRefInfo(
         "_refs.flatten",
         torch_opinfo_name="flatten",
@@ -19550,26 +18421,56 @@ def __init__(
         "_refs.narrow",
         torch_opinfo_name="narrow",
         supports_nvfuser=False,
+        error_inputs_func=partial(error_inputs_narrow_narrow_copy, is_narrow=True, is_ref=True),
+    ),
+    PythonRefInfo(
+        "_refs.narrow_copy",
+        torch_opinfo_name="narrow_copy",
+        supports_out=True,
+        supports_nvfuser=False,
+        error_inputs_func=partial(error_inputs_narrow_narrow_copy, is_narrow=False, is_ref=True),
+    ),
+    PythonRefInfo(
+        "_refs.nn.functional.group_norm",
+        torch_opinfo_name="nn.functional.group_norm",
+        supports_nvfuser=False,
+        validate_view_consistency=False,
     ),
     PythonRefInfo(
         "_refs.native_layer_norm",
         torch_opinfo_name="native_layer_norm",
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCommon", "test_python_ref",
+                         device_type="cpu", dtypes=(torch.float32,)),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCommon", "test_python_ref_torch_fallback",
+                         device_type="cpu", dtypes=(torch.float32,)),
+        ),
     ),
     PythonRefInfo(
         "_refs.permute",
         torch_opinfo_name="permute",
-        supports_nvfuser=False,
     ),
     PythonRefInfo(
         "_refs.ravel",
         torch_opinfo_name="ravel",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.repeat",
+        torch_opinfo_name="repeat",
+        supports_nvfuser=False,
+        validate_view_consistency=False,
+    ),
     PythonRefInfo(
         "_refs.reshape",
         torch_opinfo_name="reshape",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.reshape_as",
+        torch_opinfo_name="reshape_as",
+        supports_nvfuser=False,
+    ),
     PythonRefInfo(
         "_refs.roll",
         torch_opinfo_name="roll",
@@ -19591,7 +18492,6 @@ def __init__(
     PythonRefInfo(
         "_refs.squeeze",
         torch_opinfo_name="squeeze",
-        supports_nvfuser=False,
     ),
     PythonRefInfo(
         "_refs.tensor_split",
@@ -19601,9 +18501,8 @@ def __init__(
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_meta'),
             # RuntimeError: no _refs support for torch.Tensor.tolist
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),
-            # RuntimeError: .tolist() is not supported for tensor subclasses.
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_executor'),
-        )
+        ),
+        supports_nvfuser=False,
     ),
     PythonRefInfo(
         "_refs.hsplit",
@@ -19618,23 +18517,41 @@ def __init__(
     PythonRefInfo(
         "_refs.transpose",
         torch_opinfo_name="transpose",
-        supports_nvfuser=False,
     ),
     PythonRefInfo(
         "_refs.t",
         torch_opinfo_name="t",
+    ),
+    PythonRefInfo(
+        "_refs.T",
+        torch_opinfo_name="T",
+        error_inputs_func=partial(error_inputs_T, has_ndims_error=True),
+    ),
+    PythonRefInfo(
+        "_refs.unfold",
+        torch_opinfo_name="unfold",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.unfold_copy",
+        torch_opinfo_name="unfold_copy",
+        supports_nvfuser=False,
+        supports_out=True,
+    ),
     PythonRefInfo(
         "_refs.unsqueeze",
         torch_opinfo_name="unsqueeze",
-        supports_nvfuser=False,
     ),
     PythonRefInfo(
         "_refs.view",
         torch_opinfo_name="view",
         supports_nvfuser=False,
     ),
+    PythonRefInfo(
+        "_refs.view_as",
+        torch_opinfo_name="view_as",
+        supports_nvfuser=False,
+    ),
     PythonRefInfo(
         "_refs.vstack",
         torch_opinfo_name="vstack",
@@ -19664,10 +18581,12 @@ def __init__(
     ReductionPythonRefInfo(
         "_refs.amax",
         torch_opinfo_name="amax",
+        error_inputs_func=partial(error_inputs_aminmax_amax_amin, is_ref=True),
     ),
     ReductionPythonRefInfo(
         "_refs.amin",
         torch_opinfo_name="amin",
+        error_inputs_func=partial(error_inputs_aminmax_amax_amin, is_ref=True),
     ),
     ReductionPythonRefInfo(
         "_refs.any",
@@ -19677,6 +18596,7 @@ def __init__(
         "_refs.mean",
         torch_opinfo_name="mean",
         supports_out=True,
+        error_inputs_func=partial(error_inputs_mean, is_ref=True),
     ),
     ReductionPythonRefInfo(
         "_refs.std",
@@ -19687,13 +18607,23 @@ def __init__(
     PythonRefInfo(
         "_refs.std_mean",
         torch_opinfo_name="std_mean",
-        validate_view_consistency=False,
     ),
     ReductionPythonRefInfo(
         "_refs.sum",
         torch_opinfo_name="sum",
         supports_out=True,
     ),
+    PythonRefInfo(
+        "_refs.cumsum",
+        torch_opinfo_name="cumsum",
+        supports_out=True,
+        supports_nvfuser=False,  # arange not supported
+    ),
+    PythonRefInfo(
+        "_refs.sum_to_size",
+        torch_opinfo_name="sum_to_size",
+        validate_view_consistency=False,
+    ),
     ReductionPythonRefInfo(
         "_refs.prod",
         torch_opinfo_name="prod",
@@ -19710,59 +18640,63 @@ def __init__(
         torch_opinfo_name="var_mean",
         validate_view_consistency=False,
     ),
-    #
-    # Linear Algebra Operators
-    #
     PythonRefInfo(
-        "_refs.addr",
-        torch_opinfo_name="addr",
-        supports_nvfuser=False,
+        "ops.nvprims.var_mean",
+        torch_opinfo_name="var_mean",
+        validate_view_consistency=False,
+        # Complex types are currently disabled
+        dtypes=floating_types_and(torch.float16, torch.bfloat16),
+        # This function is expected not to work with TorchRefsMode(strict=True)
         decorators=(
             DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref',),
         ),
     ),
     PythonRefInfo(
-        "_refs.trace",
-        torch_opinfo_name="trace",
+        "ops.nvprims.native_batch_norm",
+        torch_opinfo_name="native_batch_norm",
+        # Complex types are currently disabled
+        dtypes=floating_types(),
+        supports_out=False,
+        # This function is expected not to work with TorchRefsMode(strict=True)
         decorators=(
-            # TODO: torch.diag is currently not supported by either refs, meta funcs, or NVFuser
-            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),
-            DecorateInfo(unittest.skip("diag is not supported by meta"), 'TestCommon', 'test_python_ref_meta'),
-            DecorateInfo(unittest.skip("diag is not supported by nvfuser"), 'TestCommon', 'test_python_ref_executor'),
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref',),
+            # There's a discrepancy in returned shape between CPU and other devices
+            # AssertionError: Shapes torch.Size([0]) and torch.Size([2]) are not equal!
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_meta', device_type="cpu"),
+        ),
+        skips=(
+            # https://github.com/pytorch/pytorch/issues/85960
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_compare_cpu'),
         ),
     ),
     PythonRefInfo(
-        "_refs.norm",
-        torch_opinfo_name="norm",
-        supports_out=True,
-        # Uses svdvals which does not support nvfuser
-        supports_nvfuser=False,
-        # Uses vector_norm inside and vector_norm is affected by
-        # https://github.com/pytorch/pytorch/issues/77216
+        "ops.nvprims.view",
+        torch_opinfo_name="view",
         validate_view_consistency=False,
+        # This function is expected not to work with TorchRefsMode(strict=True)
+        decorators=(
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref',),
+        ),
     ),
     #
-    # torch.linalg
+    # Linear Algebra Operators
     #
-    ReductionPythonRefInfo(
-        "_refs.linalg.vector_norm",
-        torch_opinfo_name="linalg.vector_norm",
-        supports_out=True,
-        supports_nvfuser=False,  # clone_default
+    PythonRefInfo(
+        "_refs.addr",
+        torch_opinfo_name="addr",
+        supports_nvfuser=False,
+        decorators=(
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref',),
+        ),
     ),
     PythonRefInfo(
-        "_refs.linalg.matrix_norm",
-        torch_opinfo_name="linalg.matrix_norm",
-        supports_out=True,
-        # Uses svdvals which does not support nvfuser
+        "_refs.trace",
+        torch_opinfo_name="trace",
         supports_nvfuser=False,
-        # Uses vector_norm inside and vector_norm is affected by
-        # https://github.com/pytorch/pytorch/issues/77216
-        validate_view_consistency=False,
     ),
     PythonRefInfo(
-        "_refs.linalg.norm",
-        torch_opinfo_name="linalg.norm",
+        "_refs.norm",
+        torch_opinfo_name="norm",
         supports_out=True,
         # Uses svdvals which does not support nvfuser
         supports_nvfuser=False,
@@ -19770,18 +18704,6 @@ def __init__(
         # https://github.com/pytorch/pytorch/issues/77216
         validate_view_consistency=False,
     ),
-    PythonRefInfo(
-        "_refs.linalg.svd",
-        torch_opinfo_name="linalg.svd",
-        supports_out=True,
-        supports_nvfuser=False,
-    ),
-    PythonRefInfo(
-        "_refs.linalg.svdvals",
-        torch_opinfo_name="linalg.svdvals",
-        supports_out=True,
-        supports_nvfuser=False,
-    ),
     #
     # Tensor Creation Reference OpInfos
     #
@@ -19812,6 +18734,7 @@ def __init__(
                          'test_neg_view'),
             # FIXME: shouldn't check empty results
             DecorateInfo(unittest.skip("Can't check result for empty"), 'TestCommon', 'test_python_ref_executor'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
         ),
     ),
     PythonRefInfo(
@@ -19842,6 +18765,35 @@ def __init__(
                          'test_neg_view'),
             # FIXME: should not compare results of empty_like
             DecorateInfo(unittest.skip("Can't check result for empty_like"), 'TestCommon', 'test_python_ref_executor'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
+        ),
+    ),
+    PythonRefInfo(
+        "_refs.randn",
+        torch_opinfo_name="randn",
+        op=lambda *args, **kwargs: wrapper_set_seed(refs.randn, *args, **kwargs),
+        supports_nvfuser=False,
+        skips=(
+            # see https://github.com/pytorch/pytorch/issues/85121
+            DecorateInfo(unittest.skip("make_traced() doesn't set seed properly!"),
+                         'TestCommon',
+                         'test_python_ref_executor'),
+            # These tests expect the input to be a tensor or a sequence of tensors
+            DecorateInfo(unittest.skip("Test expects tensor input"), "TestCommon", "test_noncontiguous_samples"),
+            DecorateInfo(unittest.skip("Test expects tensor input"), 'TestMathBits', 'test_neg_view'),
+            DecorateInfo(unittest.skip("Test expects tensor input"), 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.skip("Test expects tensor input"), 'TestMathBits', 'test_neg_conj_view'),
+        ),
+    ),
+    PythonRefInfo(
+        "_refs.eye",
+        torch_opinfo_name="eye",
+        supports_nvfuser=False,
+        skips=(
+            # skip these tests since we have non tensor input
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_conj_view'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_conj_view'),
+            DecorateInfo(unittest.skip("Skipped!"), 'TestMathBits', 'test_neg_view'),
         ),
     ),
     PythonRefInfo(
@@ -19872,6 +18824,33 @@ def __init__(
                          'test_neg_view'),
             # FIXME: should not compare results of empty_like
             DecorateInfo(unittest.skip("Can't check result for new_empty"), 'TestCommon', 'test_python_ref_executor'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
+        ),
+    ),
+    PythonRefInfo(
+        "_refs.new_empty_strided",
+        torch_opinfo_name="new_empty_strided",
+        skips=(
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestCommon',
+                         'test_python_ref'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestCommon',
+                         'test_python_ref_torch_fallback'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestMathBits',
+                         'test_conj_view'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestMathBits',
+                         'test_neg_conj_view'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestMathBits',
+                         'test_neg_view'),
+            DecorateInfo(unittest.skip("Expected: empty_strided is not comparable"),
+                         'TestCommon',
+                         'test_python_ref_executor'),
+            DecorateInfo(unittest.skip('output is non-deterministic'), 'TestCommon', 'test_compare_cpu'),
+
         ),
     ),
     PythonRefInfo(
@@ -19896,125 +18875,68 @@ def __init__(
         "_refs.masked_fill",
         torch_opinfo_name="masked_fill",
         supports_nvfuser=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),
+        ),
     ),
     PythonRefInfo(
         "_refs.where",
         torch_opinfo_name="where",
         op=lambda self, condition, other: refs.where(condition, self, other),
         supports_nvfuser=False,
+        skips=(
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors', device_type='cuda'),
+        ),
     ),
-    #
-    # Test-related functions
-    #
     PythonRefInfo(
-        "_refs.allclose",
-        torch_opinfo_name="allclose",
-        supports_nvfuser=False,
-    ),
-    #
-    # FFT OpInfos
-    #
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.fft",
-        torch_opinfo_name="fft.fft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ifft",
-        torch_opinfo_name="fft.ifft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.rfft",
-        torch_opinfo_name="fft.rfft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.irfft",
-        torch_opinfo_name="fft.irfft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.hfft",
-        torch_opinfo_name="fft.hfft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ihfft",
-        torch_opinfo_name="fft.ihfft",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.fftn",
-        torch_opinfo_name="fft.fftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ifftn",
-        torch_opinfo_name="fft.ifftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.rfftn",
-        torch_opinfo_name="fft.rfftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.irfftn",
-        torch_opinfo_name="fft.irfftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.hfftn",
-        torch_opinfo_name="fft.hfftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ihfftn",
-        torch_opinfo_name="fft.ihfftn",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.fft2",
-        torch_opinfo_name="fft.fft2",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ifft2",
-        torch_opinfo_name="fft.ifft2",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.rfft2",
-        torch_opinfo_name="fft.rfft2",
-        supports_nvfuser=False,
-    ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.irfft2",
-        torch_opinfo_name="fft.irfft2",
+        "_refs.index_select",
+        torch_opinfo_name="index_select",
+        # empty_strided
         supports_nvfuser=False,
+        skips=(
+            # no _refs support for Tensor.__setitem__
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),
+            # Sample out= with a stride of zero. This _out operation checks that the input has no
+            # inner overlap
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref_errors'),)
     ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.hfft2",
-        torch_opinfo_name="fft.hfft2",
+    PythonRefInfo(
+        "_refs.index_copy",
+        torch_opinfo_name="index_copy",
+        # empty_strided
         supports_nvfuser=False,
+        skips=(
+            # no _refs support for Tensor.__setitem__
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),)
     ),
-    SpectralFuncPythonRefInfo(
-        "_refs.fft.ihfft2",
-        torch_opinfo_name="fft.ihfft2",
+    PythonRefInfo(
+        "_refs.index_add",
+        torch_opinfo_name="index_add",
+        # empty_strided
         supports_nvfuser=False,
+        skips=(
+            # no _refs support for Tensor.__setitem__
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),)
     ),
     PythonRefInfo(
-        "_refs.fft.fftshift",
-        torch_opinfo_name="fft.fftshift",
+        "_refs.index_fill",
+        torch_opinfo_name="index_fill",
+        # empty_strided
         supports_nvfuser=False,
+        skips=(
+            # no _refs support for Tensor.__setitem__
+            DecorateInfo(unittest.expectedFailure, 'TestCommon', 'test_python_ref'),)
     ),
+    #
+    # Test-related functions
+    #
     PythonRefInfo(
-        "_refs.fft.ifftshift",
-        torch_opinfo_name="fft.ifftshift",
+        "_refs.allclose",
+        torch_opinfo_name="allclose",
         supports_nvfuser=False,
     ),
 ]
+python_ref_db += opinfo.definitions.python_ref_db
 
 # Common operator groupings
 ops_and_refs = op_db + python_ref_db
@@ -20028,8 +18950,8 @@ def __init__(
 shape_funcs = [op for op in op_db if isinstance(op, ShapeFuncInfo)]
 reduction_ops = [op for op in op_db if isinstance(op, ReductionOpInfo)]
 reference_filtered_ops = [op for op in reduction_ops if op.ref is not None]
-reference_masked_ops = [op for op in reference_filtered_ops if op.name.startswith('_masked.')]
-sparse_masked_reduction_ops = [op for op in sparse_reduction_ops if op.name.startswith('_masked.')]
+reference_masked_ops = [op for op in reference_filtered_ops if op.name.startswith('masked.')]
+sparse_masked_reduction_ops = [op for op in sparse_reduction_ops if op.name.startswith('masked.')]
 
 # TODO: review porting these to make_tensor
 def index_variable(shape, max_indices, device=torch.device('cpu')):
@@ -20059,156 +18981,3 @@ def mask_not_all_zeros(shape):
         result = torch.randn(shape).gt(0)
         if result.sum() > 0:
             return result
-
-
-# TODO: move all tri/tril/triu testing to tensor creation op test suite and remove
-#   these from here
-def _compare_trilu_indices(
-        self, row, col, offset=0, dtype=torch.long, device='cpu'):
-    if row == 0 or col == 0:
-        # have to handle this separately as tril and triu does not take
-        # empty matrix as input
-        self.assertEqual(
-            torch.empty(0, 2, dtype=dtype, device=device).transpose(0, 1),
-            torch.tril_indices(row, col, offset, dtype=dtype, device=device))
-
-        self.assertEqual(
-            torch.empty(0, 2, dtype=dtype, device=device).transpose(0, 1),
-            torch.triu_indices(row, col, offset, dtype=dtype, device=device))
-
-    else:
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(
-            torch.ones(row, col, device='cpu')
-                 .tril(offset).nonzero().to(dtype).transpose(0, 1),
-            torch.tril_indices(row, col, offset, dtype=dtype, device=device))
-
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        self.assertEqualIgnoreType(
-            torch.ones(row, col, device='cpu')
-                 .triu(offset).nonzero().to(dtype).transpose(0, 1),
-            torch.triu_indices(row, col, offset, dtype=dtype, device=device))
-
-
-def _compare_large_trilu_indices(
-        self, row, col, offset=0, dtype=torch.long, device='cpu'):
-    l = torch.ones(row, col, dtype=dtype, device='cpu').tril(offset) \
-             .nonzero()[-100:-1, :].transpose(0, 1).to(device)
-    torch.cuda.empty_cache()
-
-    r = torch.tril_indices(
-        row, col, offset, dtype=dtype, device=device)[:, -100:-1]
-    self.assertEqual(l, r)
-    torch.cuda.empty_cache()
-
-    l = torch.ones(row, col, dtype=dtype, device='cpu').triu(offset) \
-             .nonzero()[-100:-1, :].transpose(0, 1).to(device)
-    torch.cuda.empty_cache()
-
-    r = torch.triu_indices(
-        row, col, offset, dtype=dtype, device=device)[:, -100:-1]
-    self.assertEqual(l, r)
-    torch.cuda.empty_cache()
-
-# (
-#   row
-#   col
-#   offset (optional)
-#   dtype (optional)
-# )
-tri_tests_args = [
-    (1, 1),
-    (3, 3),
-    (3, 3, 1),
-    (3, 3, 2),
-    (3, 3, 200),
-    (3, 3, -1),
-    (3, 3, -2),
-    (3, 3, -200),
-    (0, 3, 0),
-    (0, 3, 1),
-    (0, 3, -1),
-    (0, 1, 2),
-    (1, 0, 2),
-    (3, 0, 0),
-    (3, 0, 1),
-    (3, 0, -1),
-    (0, 0, 0),
-    (0, 0, 1),
-    (0, 0, -1),
-    (3, 6, 0),
-    (3, 6, 1),
-    (3, 6, 3),
-    (3, 6, 9),
-    (3, 6, -1),
-    (3, 6, -3),
-    (3, 6, -9),
-    (6, 3, 0),
-    (6, 3, 1),
-    (6, 3, 3),
-    (6, 3, 9),
-    (6, 3, -1),
-    (6, 3, -3),
-    (6, 3, -9),
-    (258, 253, 1, torch.float32),
-    (257, 258, 1, torch.float64),
-    (258, 258, 1, torch.short),
-    (3, 513, 1, torch.long),
-    (513, 3, 1, torch.int),
-    (513, 0, 1, torch.double),
-    (1024, 1024),
-    (1024, 1024, 500, torch.float32),
-    (1024, 1024, 1023),
-    (1024, 1024, -500),
-    (1023, 1025),
-    (1025, 1023, 1022),
-    (1024, 1024, -500),
-    (3, 2028),
-    (3, 2028, 1),
-    (3, 2028, -1),
-    (2028, 3),
-    (2028, 1),
-    (2028, 1, -1)
-]
-
-tri_large_tests_args: List[Tuple[int, ...]] = [
-    # Large test cases below are deliberately commented out to speed up CI
-    # tests and to avoid OOM error. When modifying implementations of
-    # tril_indices and triu_indices, please enable these tests and make sure
-    # they pass.
-    #
-    # (1, 268435455),
-    # (5000, 5000),
-    # (10000, 10000),
-    # (268435455, 1),
-    # (134217727, 2, 1),
-    # (2, 134217727, 1),
-    # (536870901, 1),
-    # (1, 536870901),
-    # (268435455, 2, 1),
-    # (2, 268435455, 1)
-]
-
-
-def run_additional_tri_tests(self, device):
-    x = torch.ones(
-        3, 3, dtype=torch.long, device=device, layout=torch.strided)
-    l = x.tril(0).nonzero().transpose(0, 1)
-    u = x.triu(0).nonzero().transpose(0, 1)
-    self.assertEqual(l, torch.tril_indices(3, 3, device=device))
-    self.assertEqual(
-        l, torch.tril_indices(3, 3, device=device, layout=torch.strided))
-
-    self.assertEqual(u, torch.triu_indices(3, 3, device=device))
-    self.assertEqual(
-        u, torch.triu_indices(3, 3, device=device, layout=torch.strided))
-
-    self.assertRaises(
-        RuntimeError,
-        lambda: torch.triu_indices(
-            1, 1, device=device, layout=torch.sparse_coo))
-
-    self.assertRaises(
-        RuntimeError,
-        lambda: torch.tril_indices(
-            1, 1, device=device, layout=torch.sparse_coo))
diff --git a/torch/testing/_internal/common_modules.py b/torch/testing/_internal/common_modules.py
index 7a0d60b80241..1f395cbe606a 100644
--- a/torch/testing/_internal/common_modules.py
+++ b/torch/testing/_internal/common_modules.py
@@ -8,7 +8,7 @@
 import torch.nn.functional as F
 from torch.testing import make_tensor
 from torch.testing._internal.common_cuda import TEST_CUDNN
-from torch.testing._internal.common_dtype import floating_types
+from torch.testing._internal.common_dtype import floating_types, floating_and_complex_types_and
 from torch.testing._internal.common_device_type import (
     _TestParametrizer, _update_param_kwargs, toleranceOverride, tol,
     skipCUDAIfCudnnVersionLessThan, skipCUDAIfRocm, precisionOverride, skipMeta)
@@ -22,9 +22,10 @@
 # List of all namespaces containing modules to test.
 MODULE_NAMESPACES: List[ModuleType] = [
     torch.nn.modules,
-    torch.nn.qat.modules,
+    torch.ao.nn.qat.modules,
     torch.nn.quantizable.modules,
     torch.nn.quantized.modules,
+    torch.ao.nn.quantized.modules,
 ]
 
 # Modules that shouldn't be tested for one reason or another.
@@ -33,6 +34,7 @@
     torch.nn.Container,  # deprecated
     torch.nn.NLLLoss2d,  # deprecated
     torch.nn.quantized.MaxPool2d,  # aliases to nn.MaxPool2d
+    torch.ao.nn.quantized.MaxPool2d,  # aliases to nn.MaxPool2d
 }
 
 # List of all module classes to test.
@@ -1050,19 +1052,13 @@ def module_inputs_torch_nn_LSTM(module_info, device, dtype, requires_grad, train
                train_and_eval_differ=True,
                module_inputs_func=module_inputs_torch_nn_BatchNorm2d,
                skips=(
-                   DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),),
-               decorators=(
-                   # Failure on ROCM for BatchNorm2d float32 issue #70125
-                   DecorateInfo(skipCUDAIfRocm, 'TestModule', 'test_memory_format', dtypes=[torch.float32]),)
+                   DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),)
                ),
     ModuleInfo(torch.nn.BatchNorm3d,
                train_and_eval_differ=True,
                module_inputs_func=module_inputs_torch_nn_BatchNorm3d,
                skips=(
-                   DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),),
-               decorators=(
-                   # Failure on ROCM for BatchNorm3d float32 issue #70125
-                   DecorateInfo(skipCUDAIfRocm, 'TestModule', 'test_memory_format', dtypes=[torch.float32]),)
+                   DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),)
                ),
     ModuleInfo(torch.nn.Conv1d,
                module_inputs_func=partial(module_inputs_torch_nn_ConvNd, N=1, lazy=False),
@@ -1117,12 +1113,28 @@ def module_inputs_torch_nn_LSTM(module_info, device, dtype, requires_grad, train
                module_inputs_func=partial(module_inputs_torch_nn_ConvNd, N=1, lazy=False, transposed=True),
                gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
                module_memformat_affects_out=True,
+               dtypes=floating_and_complex_types_and(torch.chalf),
                skips=(
                    # channels_last support on cuda requires cudnn >= 7603
                    DecorateInfo(skipCUDAIfCudnnVersionLessThan(version=7603), 'TestModule', 'test_memory_format'),
                    # Failure on ROCM for float32 issue #70125
                    DecorateInfo(skipCUDAIfRocm, 'TestModule', 'test_memory_format', dtypes=[torch.float32]),
                    DecorateInfo(skipIfMps, 'TestModule', dtypes=[torch.float64]),
+                   # Not implmented for chalf on CPU
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_forward',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_memory_format',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule',
+                                'test_if_train_and_eval_modes_differ', dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_non_contiguous_tensors',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_cpu_gpu_parity',
+                                dtypes=(torch.chalf,), device_type='cuda'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_multiple_device_transfer',
+                                dtypes=(torch.chalf,), device_type='cuda'),
+                   # Ref: https://github.com/pytorch/pytorch/issues/73502
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_pickle', dtypes=(torch.chalf,)),
                ),
                decorators=(
                    DecorateInfo(precisionOverride({torch.float32: 1e-04}), 'TestModule', 'test_memory_format'),
@@ -1131,6 +1143,7 @@ def module_inputs_torch_nn_LSTM(module_info, device, dtype, requires_grad, train
                module_inputs_func=partial(module_inputs_torch_nn_ConvNd, N=2, lazy=False, transposed=True),
                gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
                module_memformat_affects_out=True,
+               dtypes=floating_and_complex_types_and(torch.chalf),
                skips=(
                    # channels_last support on cuda requires cudnn >= 7603
                    DecorateInfo(skipCUDAIfCudnnVersionLessThan(version=7603), 'TestModule', 'test_memory_format'),
@@ -1141,13 +1154,32 @@ def module_inputs_torch_nn_LSTM(module_info, device, dtype, requires_grad, train
                    # See https://github.com/pytorch/pytorch/issues/80247
                    DecorateInfo(unittest.expectedFailure, "TestModule", "test_memory_format", device_type='cpu'),
                    DecorateInfo(unittest.expectedFailure, "TestModule", "test_memory_format", device_type='cuda',
-                                dtypes=[torch.float64]),
+                                dtypes=[torch.float64, torch.complex128]),
+                   # These fail only on ROCm
+                   DecorateInfo(unittest.expectedFailure, "TestModule", "test_memory_format", device_type='cuda',
+                                dtypes=[torch.complex32, torch.complex64], active_if=TEST_WITH_ROCM),
+                   # Not implmented for chalf on CPU
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_forward',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_memory_format',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule',
+                                'test_if_train_and_eval_modes_differ', dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_non_contiguous_tensors',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_cpu_gpu_parity',
+                                dtypes=(torch.chalf,), device_type='cuda'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_multiple_device_transfer',
+                                dtypes=(torch.chalf,), device_type='cuda'),
+                   # Ref: https://github.com/pytorch/pytorch/issues/73502
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_pickle', dtypes=(torch.chalf,)),
                ),
                decorators=(
                    DecorateInfo(precisionOverride({torch.float32: 1e-04}), 'TestModule', 'test_memory_format'),
                )),
     ModuleInfo(torch.nn.ConvTranspose3d,
                module_inputs_func=partial(module_inputs_torch_nn_ConvNd, N=3, lazy=False, transposed=True),
+               dtypes=floating_and_complex_types_and(torch.chalf),
                gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
                module_memformat_affects_out=True,
                skips=(
@@ -1159,9 +1191,28 @@ def module_inputs_torch_nn_LSTM(module_info, device, dtype, requires_grad, train
                    # This was wrongly being skipped before and needs investigation.
                    # See https://github.com/pytorch/pytorch/issues/80247
                    DecorateInfo(unittest.expectedFailure, "TestModule", "test_memory_format"),
+                   # These fail only on ROCm
+                   DecorateInfo(unittest.expectedFailure, "TestModule", "test_memory_format", device_type='cuda',
+                                dtypes=[torch.complex32, torch.complex64], active_if=TEST_WITH_ROCM),
+                   # Not implmented for chalf on CPU
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_forward',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_memory_format',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule',
+                                'test_if_train_and_eval_modes_differ', dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_non_contiguous_tensors',
+                                dtypes=(torch.chalf,), device_type='cpu'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_cpu_gpu_parity',
+                                dtypes=(torch.chalf,), device_type='cuda'),
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_multiple_device_transfer',
+                                dtypes=(torch.chalf,), device_type='cuda'),
+                   # Ref: https://github.com/pytorch/pytorch/issues/73502
+                   DecorateInfo(unittest.expectedFailure, 'TestModule', 'test_pickle', dtypes=(torch.chalf,)),
                ),
                decorators=(
                    DecorateInfo(precisionOverride({torch.float32: 1e-04}), 'TestModule', 'test_memory_format'),
+                   DecorateInfo(precisionOverride({torch.complex64: 1e-04}), 'TestModule', 'test_cpu_gpu_parity'),
                )),
     ModuleInfo(torch.nn.ELU,
                module_inputs_func=module_inputs_torch_nn_ELU,
diff --git a/torch/testing/_internal/common_nn.py b/torch/testing/_internal/common_nn.py
index d68ef4387b09..d582b5a2b458 100644
--- a/torch/testing/_internal/common_nn.py
+++ b/torch/testing/_internal/common_nn.py
@@ -5951,8 +5951,7 @@ def __call__(self, test_case):
             ref_input = deepcopy(input)
             ref_module = deepcopy(module)
             expected_out = self.reference_fn(ref_input, test_case._get_parameters(module)[0], ref_module)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            test_case.assertEqualIgnoreType(out, expected_out)
+            test_case.assertEqual(out, expected_out, exact_dtype=False)
         if self.check_forward_only:
             return
         self.test_noncontig(test_case, module, input)
@@ -6051,8 +6050,7 @@ def test_cuda(self, test_case):
         if getattr(cpu_module, "return_indices", False):
             cpu_output = cpu_output[0]
             gpu_output = gpu_output[0]
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        test_case.assertEqualIgnoreType(cpu_output, gpu_output, atol=self.precision, rtol=0)
+        test_case.assertEqual(cpu_output, gpu_output, atol=self.precision, rtol=0, exact_dtype=False)
 
         # Run backwards on CPU and GPU and compare results
         for _ in range(5):
@@ -6060,8 +6058,7 @@ def test_cuda(self, test_case):
             gpu_gradOutput = cpu_gradOutput.type_as(gpu_output)
             cpu_gradInput = test_case._backward(cpu_module, cpu_input_tuple, cpu_output, cpu_gradOutput)
             gpu_gradInput = test_case._backward(gpu_module, gpu_input_tuple, gpu_output, gpu_gradOutput)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            test_case.assertEqualIgnoreType(cpu_gradInput, gpu_gradInput, atol=self.precision, rtol=0)
+            test_case.assertEqual(cpu_gradInput, gpu_gradInput, atol=self.precision, rtol=0, exact_dtype=False)
             for cpu_d_p, gpu_d_p in zip(cpu_param[1], gpu_param[1]):
                 test_case.assertEqual(cpu_d_p, gpu_d_p, atol=self.precision, rtol=0)
 
@@ -6089,8 +6086,7 @@ def test_cuda(self, test_case):
                 create_graph=True)
 
             for cpu_d_i, gpu_d_i in zip(cpu_gradInputs, gpu_gradInputs):
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                test_case.assertEqualIgnoreType(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0)
+                test_case.assertEqual(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0, exact_dtype=False)
 
             # We mix output into the second backwards computation so that
             # torch.autograd.grad doesn't complain that some inputs
@@ -6104,11 +6100,9 @@ def test_cuda(self, test_case):
                 gpu_output.sum() + sum(x.sum() for x in gpu_gradInputs),
                 gpu_input_tuple + (gpu_gradOutput,) + tuple(gpu_module.parameters()),
                 retain_graph=True)
-            # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-            test_case.assertEqualIgnoreType(cpu_gradInput, gpu_gradInput, atol=self.precision, rtol=0)
+            test_case.assertEqual(cpu_gradInput, gpu_gradInput, atol=self.precision, rtol=0, exact_dtype=False)
             for cpu_d_p, gpu_d_p in zip(cpu_gg, gpu_gg):
-                # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-                test_case.assertEqualIgnoreType(cpu_d_p, gpu_d_p, atol=self.precision, rtol=0)
+                test_case.assertEqual(cpu_d_p, gpu_d_p, atol=self.precision, rtol=0, exact_dtype=False)
 
         self.test_noncontig(test_case, gpu_module, gpu_input_tuple)
 
@@ -6421,18 +6415,16 @@ def convert_dtype(obj, dtype, requires_grad=False):
         cpu_output = test_case._forward_criterion(cpu_module, cpu_input, cpu_target, extra_args=extra_args)
         gpu_output = test_case._forward_criterion(gpu_module, gpu_input, gpu_target, extra_args=extra_args)
         # dtype used to be able to be None, so set precision in this way instead of a precision map
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        test_case.assertEqualIgnoreType(cpu_output, gpu_output,
-                                        atol=1e-1 if dtype in {torch.half, torch.bfloat16} else 4e-4, rtol=0)
+        test_case.assertEqual(cpu_output, gpu_output,
+                              atol=1e-1 if dtype in {torch.half, torch.bfloat16} else 4e-4, rtol=0, exact_dtype=False)
 
         cpu_gradInput = test_case._backward_criterion(
             cpu_module, cpu_input, cpu_output, cpu_target, extra_args=extra_args)
         gpu_gradInput = test_case._backward_criterion(
             gpu_module, gpu_input, gpu_output, gpu_target, extra_args=extra_args)
         # dtype used to be able to be None, so set precision in this way instead of a precision map
-        # TODO(#38095): Replace assertEqualIgnoreType. See issue #38095
-        test_case.assertEqualIgnoreType(cpu_gradInput, gpu_gradInput,
-                                        atol=1e-1 if dtype in {torch.half, torch.bfloat16} else 4e-4, rtol=0)
+        test_case.assertEqual(cpu_gradInput, gpu_gradInput,
+                              atol=1e-1 if dtype in {torch.half, torch.bfloat16} else 4e-4, rtol=0, exact_dtype=False)
 
     def _get_target(self):
         return self._get_arg('target', False)
@@ -6444,3 +6436,59 @@ def constructor_args(self):
     @property
     def extra_args(self):
         return self._get_arg('extra_args', False)
+
+
+def _test_bfloat16_ops(test_case, op, device, inp_dims=(), prec=1e-2, scale_factor=None):
+    # fp32 compute
+    input1 = torch.randn(inp_dims, dtype=torch.float32, device=device, requires_grad=True)
+    if scale_factor is not None:
+        input1 = (torch.rand(inp_dims, dtype=torch.bfloat16, device=device) * scale_factor).float().requires_grad_()
+    out1 = op(input1)
+    grad_input1 = torch.randn_like(out1, device=device)
+    out1.backward(grad_input1)
+
+    # bfloat16 compute
+    op_bfp16 = op.bfloat16()
+    input2 = input1.detach().bfloat16().requires_grad_()
+    grad_input2 = grad_input1.bfloat16()
+    out2 = op_bfp16(input2)
+    out2.backward(grad_input2)
+
+    test_case.assertEqual(out1, out2, atol=prec, rtol=prec, exact_dtype=False)
+    test_case.assertEqual(input1.grad.data, input2.grad.data, atol=prec, rtol=prec, exact_dtype=False)
+
+def _test_module_empty_input(test_case, module, inp, check_size=True, inference=False):
+    if not inference:
+        inp.requires_grad_(True)
+    out = module(inp)
+    if not inference:
+        gO = torch.rand_like(out)
+        out.backward(gO)
+    if check_size:
+        test_case.assertEqual(out.size(), inp.size())
+    if not inference:
+        for p in module.parameters():
+            if p.requires_grad:
+                test_case.assertEqual(p.grad, torch.zeros_like(p.grad))
+        test_case.assertEqual(inp.grad, torch.zeros_like(inp))
+
+
+def _create_basic_net():
+    class Layer(nn.Module):
+        def __init__(self):
+            super(Layer, self).__init__()
+            self.layer_dummy_param = nn.Parameter(torch.empty(3, 5))
+            self.register_buffer('layer_dummy_buf', torch.zeros(1, 3, 3, 7))
+
+    class Net(nn.Module):
+        def __init__(self):
+            super(Net, self).__init__()
+            self.l1 = Layer()
+            self.dummy_param = nn.Parameter(torch.empty(3, 5))
+            self.register_buffer('dummy_buf', torch.zeros(7, 3, 3, 1))
+
+    l = Layer()
+    n = Net()
+    s = nn.Sequential(n, n)
+
+    return l, n, s
diff --git a/torch/testing/_internal/common_quantization.py b/torch/testing/_internal/common_quantization.py
index 32db709199b2..aa435842513e 100644
--- a/torch/testing/_internal/common_quantization.py
+++ b/torch/testing/_internal/common_quantization.py
@@ -6,12 +6,12 @@
 import torch.nn as nn
 import torch.nn.functional as F
 import torch.nn.intrinsic.quantized.dynamic as nniqd
-import torch.nn.quantized as nnq
-import torch.nn.quantized.dynamic as nnqd
+import torch.ao.nn.quantized as nnq
+import torch.ao.nn.quantized.dynamic as nnqd
 from torch.nn.intrinsic import _FusedModule
 import torch.distributed as dist
+from torch.testing._internal.common_utils import TestCase, TEST_WITH_ROCM
 
-from torch.testing._internal.common_utils import TestCase
 from torch.ao.quantization import (
     QuantType,
     default_dynamic_qat_qconfig,
@@ -95,6 +95,9 @@ def __eq__(self, other):
     def __repr__(self):
         return repr(self.op) + " " + repr(self.target)
 
+def get_supported_device_types():
+    return ['cpu', 'cuda'] if torch.cuda.is_available() and not TEST_WITH_ROCM else ['cpu']
+
 def test_only_eval_fn(model, calib_data):
     r"""
     Default evaluation function takes a torch.utils.data.Dataset or a list of
@@ -588,7 +591,7 @@ def checkGraphModuleNodes(
             expected_node, expected_node_occurrence, expected_node_list:
                see docs for checkGraphModeFxOp
         """
-        nodes_in_graph = dict()
+        nodes_in_graph = {}
         node_list = []
         modules = dict(graph_module.named_modules(remove_duplicate=False))
         for node in graph_module.graph.nodes:
diff --git a/torch/testing/_internal/common_quantized.py b/torch/testing/_internal/common_quantized.py
index 597fd774e329..0db312f2c209 100644
--- a/torch/testing/_internal/common_quantized.py
+++ b/torch/testing/_internal/common_quantized.py
@@ -178,6 +178,8 @@ def qengine_is_qnnpack():
     return torch.backends.quantized.engine == 'qnnpack'
 def qengine_is_onednn():
     return torch.backends.quantized.engine == 'onednn'
+def qengine_is_x86():
+    return torch.backends.quantized.engine == 'x86'
 
 # Helper function used to simulate per-channel fake-quant against any axis
 def _permute_to_axis_zero(X, axis):
diff --git a/torch/testing/_internal/common_utils.py b/torch/testing/_internal/common_utils.py
index e87c2fa5a674..3b3e81439c48 100644
--- a/torch/testing/_internal/common_utils.py
+++ b/torch/testing/_internal/common_utils.py
@@ -5,77 +5,95 @@
 torch.testing._internal.common_cuda.py can freely initialize CUDA context when imported.
 """
 
-import sys
-import os
-import platform
-import re
+import argparse
+import contextlib
+import copy
+import ctypes
+import errno
+import functools
 import gc
-import types
-import math
-from functools import partial
 import inspect
 import io
-import copy
+import json
+import math
 import operator
-import argparse
-import unittest
-import warnings
+import os
+import platform
 import random
-import contextlib
+import re
 import shutil
-import threading
-from pathlib import Path
 import socket
 import subprocess
+import sys
+import tempfile
+import threading
 import time
-from collections.abc import Sequence, Mapping
-from contextlib import contextmanager, closing
-from functools import wraps
-from itertools import product
+import types
+import unittest
+import warnings
+from collections.abc import Mapping, Sequence
+from contextlib import closing, contextmanager
 from copy import deepcopy
-import tempfile
-import json
-import __main__  # type: ignore[import]
-import errno
-import ctypes
-from typing import Any, Dict, Iterable, Iterator, Optional, Union, List, Tuple, Type, TypeVar, Callable
+from enum import Enum
+from functools import partial, wraps
+from itertools import product, chain
+from pathlib import Path
+from statistics import mean
+from typing import (
+    Any,
+    Callable,
+    Dict,
+    Iterable,
+    Iterator,
+    List,
+    Optional,
+    Tuple,
+    Type,
+    TypeVar,
+    Union,
+)
 from unittest.mock import MagicMock
 
+import expecttest
 import numpy as np
 
-import expecttest
+import __main__  # type: ignore[import]
+import torch
+import torch.backends.cudnn
+import torch.backends.mkl
+import torch.backends.xnnpack
+import torch.cuda
+from torch import Tensor
+from torch._C import ScriptDict, ScriptList  # type: ignore[attr-defined]
+from torch._six import string_classes
+from torch._utils_internal import get_writable_path
+from torch.nn import (
+    ModuleDict,
+    ModuleList,
+    ParameterDict,
+    ParameterList,
+    Sequential,
+)
+from torch.onnx import (
+    register_custom_op_symbolic,
+    unregister_custom_op_symbolic,
+)
+from torch.testing import make_tensor
 from torch.testing._comparison import (
-    assert_equal as assert_equal,
-    Pair,
-    TensorLikePair,
     BooleanPair,
+    ErrorMeta,
+    NonePair,
     NumberPair,
+    Pair,
+    TensorLikePair,
     UnsupportedInputs,
-    NonePair,
-    ErrorMeta,
 )
+from torch.testing._comparison import assert_equal as assert_equal
+from torch.testing._internal.common_dtype import get_all_dtypes
 
-import torch
-import torch.cuda
-from torch.testing import make_tensor
-from torch._utils_internal import get_writable_path
-from torch._six import string_classes
-from torch import Tensor
-import torch.backends.cudnn
-import torch.backends.mkl
-import torch.backends.xnnpack
-from enum import Enum
-from statistics import mean
-import functools
 from .composite_compliance import no_dispatch
-from torch.testing._internal.common_dtype import get_all_dtypes
-from torch.nn import ModuleList, ModuleDict, Sequential, ParameterList, ParameterDict
-from torch._C import ScriptList, ScriptDict  # type: ignore[attr-defined]
-from torch.onnx import (register_custom_op_symbolic,
-                        unregister_custom_op_symbolic)
-torch.backends.disable_global_flags()
 
-PYTEST_FILES = ["test_ops", "test_ops_gradients", "test_ops_jit"]
+torch.backends.disable_global_flags()
 
 FILE_SCHEMA = "file://"
 if sys.platform == 'win32':
@@ -88,10 +106,23 @@
 
 RETRY_TEST_CASES = os.getenv('PYTORCH_RETRY_TEST_CASES') == '1'
 OVERRIDE_FLAKY_SIGNAL = os.getenv('PYTORCH_OVERRIDE_FLAKY_SIGNAL') == '1'
-MAX_NUM_RETRIES = 3
+DISABLE_RUNNING_SCRIPT_CHK = os.getenv('PYTORCH_DISABLE_RUNNING_SCRIPT_CHK') == '1'
+
+DEFAULT_DISABLED_TESTS_FILE = '.pytorch-disabled-tests.json'
+DEFAULT_SLOW_TESTS_FILE = '.pytorch-slow-tests.json'
 
-DISABLED_TESTS_FILE = '.pytorch-disabled-tests.json'
-SLOW_TESTS_FILE = '.pytorch-slow-tests.json'
+disabled_tests_dict = {}
+slow_tests_dict = {}
+
+# set them here in case the tests are running in a subprocess that doesn't call run_tests
+if os.getenv("SLOW_TESTS_FILE", ""):
+    with open(os.getenv("SLOW_TESTS_FILE"), 'r') as fp:
+        slow_tests_dict = json.load(fp)
+        warnings.warn(f"loaded {len(slow_tests_dict)} slow tests")
+if os.getenv("DISABLED_TESTS_FILE", ""):
+    with open(os.getenv("DISABLED_TESTS_FILE"), 'r') as fp:
+        disabled_tests_dict = json.load(fp)
+        warnings.warn(f"loaded {len(disabled_tests_dict)} disabled tests")
 
 NATIVE_DEVICES = ('cpu', 'cuda', 'meta')
 
@@ -457,7 +488,7 @@ def _get_test_report_path():
     return os.path.join('test-reports', test_source)
 
 is_running_via_run_test = "run_test.py" in getattr(__main__, "__file__", "")
-parser = argparse.ArgumentParser(add_help=not is_running_via_run_test)
+parser = argparse.ArgumentParser(add_help=not is_running_via_run_test, allow_abbrev=False)
 parser.add_argument('--subprocess', action='store_true',
                     help='whether to run each test in a subprocess')
 parser.add_argument('--seed', type=int, default=1234)
@@ -465,14 +496,16 @@ def _get_test_report_path():
 parser.add_argument('--jit_executor', type=str)
 parser.add_argument('--repeat', type=int, default=1)
 parser.add_argument('--test_bailouts', action='store_true')
+parser.add_argument('--use-pytest', action='store_true')
 parser.add_argument('--save-xml', nargs='?', type=str,
                     const=_get_test_report_path(),
                     default=_get_test_report_path() if IS_CI else None)
 parser.add_argument('--discover-tests', action='store_true')
 parser.add_argument('--log-suffix', type=str, default="")
 parser.add_argument('--run-parallel', type=int, default=1)
-parser.add_argument('--import-slow-tests', type=str, nargs='?', const=SLOW_TESTS_FILE)
-parser.add_argument('--import-disabled-tests', type=str, nargs='?', const=DISABLED_TESTS_FILE)
+parser.add_argument('--import-slow-tests', type=str, nargs='?', const=DEFAULT_SLOW_TESTS_FILE)
+parser.add_argument('--import-disabled-tests', type=str, nargs='?', const=DEFAULT_DISABLED_TESTS_FILE)
+parser.add_argument('--rerun-disabled-tests', action='store_true')
 
 # Only run when -h or --help flag is active to display both unittest and parser help messages.
 def run_unittest_help(argv):
@@ -494,12 +527,16 @@ def run_unittest_help(argv):
     # infer flags based on the default settings
     GRAPH_EXECUTOR = cppProfilingFlagsToProfilingMode()
 
+RERUN_DISABLED_TESTS = args.rerun_disabled_tests
+# Rerun disabled tests many more times to make sure that they are not flaky anymore
+MAX_NUM_RETRIES = 3 if not RERUN_DISABLED_TESTS else 50
 
-IMPORT_SLOW_TESTS = args.import_slow_tests
-IMPORT_DISABLED_TESTS = args.import_disabled_tests
+SLOW_TESTS_FILE = args.import_slow_tests
+DISABLED_TESTS_FILE = args.import_disabled_tests
 LOG_SUFFIX = args.log_suffix
 RUN_PARALLEL = args.run_parallel
 TEST_BAILOUTS = args.test_bailouts
+USE_PYTEST = args.use_pytest
 TEST_DISCOVER = args.discover_tests
 TEST_IN_SUBPROCESS = args.subprocess
 TEST_SAVE_XML = args.save_xml
@@ -534,7 +571,7 @@ def wait_for_process(p):
         # Always call p.wait() to ensure exit
         p.wait()
 
-def shell(command, cwd=None, env=None):
+def shell(command, cwd=None, env=None, stdout=None, stderr=None):
     sys.stdout.flush()
     sys.stderr.flush()
     # The following cool snippet is copied from Py3 core library subprocess.call
@@ -545,7 +582,7 @@ def shell(command, cwd=None, env=None):
     #
     # https://github.com/python/cpython/blob/71b6c1af727fbe13525fb734568057d78cea33f3/Lib/subprocess.py#L309-L323
     assert not isinstance(command, torch._six.string_classes), "Command to shell should be a list or tuple of tokens"
-    p = subprocess.Popen(command, universal_newlines=True, cwd=cwd, env=env)
+    p = subprocess.Popen(command, universal_newlines=True, cwd=cwd, env=env, stdout=stdout, stderr=stderr)
     return wait_for_process(p)
 
 
@@ -578,15 +615,6 @@ def sanitize_test_filename(filename):
     strip_py = re.sub(r'.py$', '', filename)
     return re.sub('/', r'.', strip_py)
 
-# hack until https://github.com/pytorch/pytorch/issues/82109 is resolved
-def sanitize_if_functorch_test_filename(filename):
-    # absolute filenames must be converted to relative paths, otherwise,
-    # we cannot prepend test-reports/ to it
-    # (e.g. test-reports\\C:\\... on windows is nonsense)
-    if filename.startswith(CI_FUNCTORCH_ROOT):
-        filename = filename[len(CI_PT_ROOT) + 1:]
-    return filename
-
 def lint_test_case_extension(suite):
     succeed = True
     for test_case_or_suite in suite:
@@ -605,6 +633,20 @@ def lint_test_case_extension(suite):
                 succeed = False
     return succeed
 
+
+def get_report_path(argv=UNITTEST_ARGS, pytest=False):
+    test_filename = sanitize_test_filename(argv[0])
+    test_report_path = TEST_SAVE_XML + LOG_SUFFIX
+    test_report_path = os.path.join(test_report_path, test_filename)
+    if pytest:
+        test_report_path = test_report_path.replace('python-unittest', 'python-pytest')
+        os.makedirs(test_report_path, exist_ok=True)
+        test_report_path = os.path.join(test_report_path, f"{test_filename}-{os.urandom(8).hex()}.xml")
+        return test_report_path
+    os.makedirs(test_report_path, exist_ok=True)
+    return test_report_path
+
+
 def sanitize_pytest_xml(xml_file: str):
     # pytext xml is different from unittext xml, this function makes pytest xml more similar to unittest xml
     # consider somehow modifying the XML logger in conftest to do this instead
@@ -612,28 +654,33 @@ def sanitize_pytest_xml(xml_file: str):
     tree = ET.parse(xml_file)
     for testcase in tree.iter('testcase'):
         full_classname = testcase.attrib['classname']
-        regex_result = re.search(r"^test\.(.*)\.([^\.]*)$", full_classname)
-        classname = regex_result.group(2)
-        file = regex_result.group(1).replace('.', "/")
-        testcase.set('classname', classname)
-        testcase.set('file', f"{file}.py")
+        # The test prefix is optional
+        regex_result = re.search(r"^(test\.)?(?P<file>.*)\.(?P<classname>[^\.]*)$", full_classname)
+        classname = regex_result.group("classname")
+        file = regex_result.group("file").replace(".", "/")
+        testcase.set("classname", classname)
+        testcase.set("file", f"{file}.py")
     tree.write(xml_file)
 
 def run_tests(argv=UNITTEST_ARGS):
     # import test files.
-    if IMPORT_SLOW_TESTS:
-        if os.path.exists(IMPORT_SLOW_TESTS):
-            with open(IMPORT_SLOW_TESTS, 'r') as fp:
+    if SLOW_TESTS_FILE:
+        if os.path.exists(SLOW_TESTS_FILE):
+            with open(SLOW_TESTS_FILE, 'r') as fp:
+                global slow_tests_dict
+                slow_tests_dict = json.load(fp)
                 # use env vars so pytest-xdist subprocesses can still access them
-                os.environ['SLOW_TESTS_DICT'] = fp.read()
+                os.environ['SLOW_TESTS_FILE'] = SLOW_TESTS_FILE
         else:
-            print(f'[WARNING] slow test file provided but not found: {IMPORT_SLOW_TESTS}')
-    if IMPORT_DISABLED_TESTS:
-        if os.path.exists(IMPORT_DISABLED_TESTS):
-            with open(IMPORT_DISABLED_TESTS, 'r') as fp:
-                os.environ['DISABLED_TESTS_DICT'] = fp.read()
+            warnings.warn(f'slow test file provided but not found: {SLOW_TESTS_FILE}')
+    if DISABLED_TESTS_FILE:
+        if os.path.exists(DISABLED_TESTS_FILE):
+            with open(DISABLED_TESTS_FILE, 'r') as fp:
+                global disabled_tests_dict
+                disabled_tests_dict = json.load(fp)
+                os.environ['DISABLED_TESTS_FILE'] = DISABLED_TESTS_FILE
         else:
-            print(f'[WARNING] disabled test file provided but not found: {IMPORT_DISABLED_TESTS}')
+            warnings.warn(f'disabled test file provided but not found: {DISABLED_TESTS_FILE}')
     # Determine the test launch mechanism
     if TEST_DISCOVER:
         _print_test_names()
@@ -650,9 +697,9 @@ def run_tests(argv=UNITTEST_ARGS):
         for case in test_cases:
             test_case_full_name = case.id().split('.', 1)[1]
             other_args = []
-            if IMPORT_DISABLED_TESTS:
+            if DISABLED_TESTS_FILE:
                 other_args.append('--import-disabled-tests')
-            if IMPORT_SLOW_TESTS:
+            if SLOW_TESTS_FILE:
                 other_args.append('--import-slow-tests')
             cmd = [sys.executable] + [argv[0]] + other_args + argv[1:] + [test_case_full_name]
             string_cmd = " ".join(cmd)
@@ -681,6 +728,28 @@ def run_tests(argv=UNITTEST_ARGS):
         for p in processes:
             failed |= wait_for_process(p) != 0
         assert not failed, "Some test shards have failed"
+    elif USE_PYTEST:
+        if TEST_SAVE_XML:
+            test_report_path = get_report_path(pytest=True)
+            print(f'Test results will be stored in {test_report_path}')
+
+        import pytest
+        os.environ["NO_COLOR"] = "1"
+        os.environ["USING_PYTEST"] = "1"
+        exit_code = pytest.main(args=argv + [f'--junit-xml-reruns={test_report_path}'] if TEST_SAVE_XML else [])
+        del os.environ["USING_PYTEST"]
+        if TEST_SAVE_XML:
+            sanitize_pytest_xml(test_report_path)
+        print("If in CI, skip info is located in the xml test reports, please either go to s3 or the hud to download them")
+
+        if not RERUN_DISABLED_TESTS:
+            # exitcode of 5 means no tests were found, which happens since some test configs don't
+            # run tests from certain files
+            exit(0 if exit_code == 5 else exit_code)
+        else:
+            # Only record the test report and always return a success code when running under rerun
+            # disabled tests mode
+            exit(0)
     elif TEST_SAVE_XML is not None:
         # import here so that non-CI doesn't need xmlrunner installed
         import xmlrunner  # type: ignore[import]
@@ -707,46 +776,17 @@ def addSkip(self, test, reason):
                         # it stands for `verbose_str` captured in the closure
                         c.cell_contents = f"skip: {reason}"
 
-        test_filename = inspect.getfile(sys._getframe(1))
-        test_filename = sanitize_if_functorch_test_filename(test_filename)
-        test_filename = sanitize_test_filename(test_filename)
-        test_report_path = TEST_SAVE_XML + LOG_SUFFIX
-        test_report_path = os.path.join(test_report_path, test_filename)
-        build_environment = os.environ.get("BUILD_ENVIRONMENT", "")
-        if test_filename in PYTEST_FILES and not IS_SANDCASTLE and not (
-            "cuda" in build_environment and "linux" in build_environment
-        ):
-            # exclude linux cuda tests because we run into memory issues when running in parallel
-            import pytest
-            os.environ["NO_COLOR"] = "1"
-            os.environ["USING_PYTEST"] = "1"
-            pytest_report_path = test_report_path.replace('python-unittest', 'python-pytest')
-            os.makedirs(pytest_report_path, exist_ok=True)
-            # part of our xml parsing looks for grandparent folder names
-            pytest_report_path = os.path.join(pytest_report_path, f"{test_filename}.xml")
-            print(f'Test results will be stored in {pytest_report_path}')
-            # mac slower on 4 proc than 3
-            num_procs = 3 if "macos" in build_environment else 4
-            # f = failed
-            # E = error
-            # X = unexpected success
-            exit_code = pytest.main(args=[inspect.getfile(sys._getframe(1)), f'-n={num_procs}', '-vv', '-x',
-                                    '--reruns=2', '-rfEX', f'--junit-xml-reruns={pytest_report_path}'])
-            del os.environ["USING_PYTEST"]
-            sanitize_pytest_xml(f'{pytest_report_path}')
-            print("Skip info is located in the xml test reports, please either go to s3 or the hud to download them")
-            # exitcode of 5 means no tests were found, which happens since some test configs don't
-            # run tests from certain files
-            exit(0 if exit_code == 5 else exit_code)
-        else:
-            os.makedirs(test_report_path, exist_ok=True)
-            verbose = '--verbose' in argv or '-v' in argv
-            if verbose:
-                print(f'Test results will be stored in {test_report_path}')
-            unittest.main(argv=argv, testRunner=xmlrunner.XMLTestRunner(
-                output=test_report_path,
-                verbosity=2 if verbose else 1,
-                resultclass=XMLTestResultVerbose))
+            def printErrors(self) -> None:
+                super().printErrors()
+                self.printErrorList("XPASS", self.unexpectedSuccesses)
+        test_report_path = get_report_path()
+        verbose = '--verbose' in argv or '-v' in argv
+        if verbose:
+            print(f'Test results will be stored in {test_report_path}')
+        unittest.main(argv=argv, testRunner=xmlrunner.XMLTestRunner(
+            output=test_report_path,
+            verbosity=2 if verbose else 1,
+            resultclass=XMLTestResultVerbose))
     elif REPEAT_COUNT > 1:
         for _ in range(REPEAT_COUNT):
             if not unittest.main(exit=False, argv=argv).result.wasSuccessful():
@@ -836,6 +876,8 @@ def _check_module_exists(name: str) -> bool:
 
 TEST_LIBROSA = _check_module_exists('librosa') and not IS_ARM64
 
+TEST_OPT_EINSUM = _check_module_exists('opt_einsum')
+
 BUILD_WITH_CAFFE2 = torch.onnx._CAFFE2_ATEN_FALLBACK
 
 # Python 2.7 doesn't have spawn
@@ -867,6 +909,13 @@ def _check_module_exists(name: str) -> bool:
 # as we had before.  By default, we don't run these tests.
 TEST_WITH_CROSSREF = os.getenv('PYTORCH_TEST_WITH_CROSSREF', '0') == '1'
 
+
+if TEST_CUDA and 'NUM_PARALLEL_PROCS' in os.environ:
+    num_procs = int(os.getenv("NUM_PARALLEL_PROCS", "2"))
+    # other libraries take up about 11% of space per process
+    torch.cuda.set_per_process_memory_fraction(round(1 / num_procs - .11, 2))
+
+
 def skipIfCrossRef(fn):
     @wraps(fn)
     def wrapper(*args, **kwargs):
@@ -883,36 +932,67 @@ def __torch_function__(self, func, types, args=(), kwargs=None):
         return r
 
 # Run PyTorch tests with TorchDynamo
-TEST_WITH_TORCHDYNAMO = os.getenv('PYTORCH_TEST_WITH_DYNAMO') == '1'
+TEST_WITH_TORCHINDUCTOR = os.getenv('PYTORCH_TEST_WITH_INDUCTOR') == '1'
+TEST_WITH_TORCHDYNAMO = os.getenv('PYTORCH_TEST_WITH_DYNAMO') == '1' or TEST_WITH_TORCHINDUCTOR
+
 if TEST_WITH_TORCHDYNAMO:
-    import torchdynamo
-    # torchdynamo.config.trace = True
-    # torchdynamo.config.debug = True
-    torchdynamo.config.print_internal_exceptions = False
-    # TODO - Collect errors with fake tensors
-    torchdynamo.config.fake_tensor_propagation = False
+    import torch._dynamo
     # Do not spend time on helper functions that are called with different inputs
-    torchdynamo.config.cache_size_limit = 8
+    torch._dynamo.config.cache_size_limit = 8
+    # TODO: Remove this; this is grandfathered in because we suppressed errors
+    # on test suite previously
+    torch._dynamo.config.suppress_errors = True
 
 
-def skipIfTorchDynamo(msg="test doesn't currently work with torchdynamo"):
+def skipIfTorchDynamo(msg="test doesn't currently work with dynamo"):
     def decorator(fn):
-        @wraps(fn)
-        def wrapper(*args, **kwargs):
-            if TEST_WITH_TORCHDYNAMO:
-                raise unittest.SkipTest(msg)
-            else:
-                fn(*args, **kwargs)
-        return wrapper
+        if not isinstance(fn, type):
+            @wraps(fn)
+            def wrapper(*args, **kwargs):
+                if TEST_WITH_TORCHDYNAMO:
+                    raise unittest.SkipTest(msg)
+                else:
+                    fn(*args, **kwargs)
+            return wrapper
+
+        assert(isinstance(fn, type))
+        if TEST_WITH_TORCHDYNAMO:
+            fn.__unittest_skip__ = True
+            fn.__unittest_skip_why__ = msg
+
+        return fn
+
+
+    return decorator
+
+def skipIfTorchInductor(msg="test doesn't currently work with torchinductor"):
+    def decorator(fn):
+        if not isinstance(fn, type):
+            @wraps(fn)
+            def wrapper(*args, **kwargs):
+                if TEST_WITH_TORCHINDUCTOR:
+                    raise unittest.SkipTest(msg)
+                else:
+                    fn(*args, **kwargs)
+            return wrapper
+
+        assert(isinstance(fn, type))
+        if TEST_WITH_TORCHINDUCTOR:
+            fn.__unittest_skip__ = True
+            fn.__unittest_skip_why__ = msg
+
+        return fn
+
     return decorator
 
+
 # Determine whether to enable cuda memory leak check.
 # CUDA mem leak check is expensive and thus we don't want to execute it on every
 # test case / configuration.
 # If this is True then CUDA memory leak checks are skipped. If this is false
 #   then CUDA memory leak checks are performed.
 # See: https://github.com/pytorch/pytorch/pull/59402#issuecomment-858811135
-TEST_SKIP_CUDA_MEM_LEAK_CHECK = os.getenv('PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK', '0') == '1'
+TEST_CUDA_MEM_LEAK_CHECK = os.getenv('PYTORCH_TEST_CUDA_MEM_LEAK_CHECK', '0') == '1'
 
 # True if CI is running TBB-enabled Pytorch
 IS_TBB = "tbb" in os.getenv("BUILD_ENVIRONMENT", "")
@@ -1297,6 +1377,8 @@ def freeze_rng_state():
         #
         # In the long run torch.cuda.set_rng_state should probably be
         # an operator.
+        #
+        # NB: Mode disable is to avoid running cross-ref tests on thes seeding
         with no_dispatch(), disable_functorch():
             if torch.cuda.is_available():
                 torch.cuda.set_rng_state(cuda_rng_state)
@@ -1360,6 +1442,7 @@ def __enter__(self):
         for d in range(torch.cuda.device_count()):
             self.beforeStreams.append(torch.cuda.current_stream(d))
             deviceStream = torch.cuda.Stream(device=d)
+            self.beforeStreams[-1].synchronize()
             torch._C._cuda_setStream(deviceStream._cdata)
         torch._C._cuda_setDevice(beforeDevice)
 
@@ -1398,6 +1481,7 @@ def __enter__(self):
             #   because the driver will always have some bytes in use (context size?)
             if caching_allocator_mem_allocated > 0:
                 gc.collect()
+                torch._C._cuda_clearCublasWorkspaces()
                 torch.cuda.empty_cache()
                 break
 
@@ -1419,6 +1503,8 @@ def __exit__(self, exec_type, exec_value, traceback):
         discrepancy_detected = False
         num_devices = torch.cuda.device_count()
         for i in range(num_devices):
+            # avoid counting cublasWorkspace allocations
+            torch._C._cuda_clearCublasWorkspaces()
             caching_allocator_mem_allocated = torch.cuda.memory_allocated(i)
 
             if caching_allocator_mem_allocated > self.caching_allocator_befores[i]:
@@ -1441,18 +1527,31 @@ def __exit__(self, exec_type, exec_value, traceback):
         torch.cuda.empty_cache()
 
         for i in range(num_devices):
-            caching_allocator_mem_allocated = torch.cuda.memory_allocated(i)
-            bytes_free, bytes_total = torch.cuda.mem_get_info(i)
-            driver_mem_allocated = bytes_total - bytes_free
 
-            caching_allocator_discrepancy = False
-            driver_discrepancy = False
+            discrepancy_detected = True
 
-            if caching_allocator_mem_allocated > self.caching_allocator_befores[i]:
-                caching_allocator_discrepancy = True
+            # Query memory multiple tiems to ensure leak was not transient
+            for n in range(3):
+                caching_allocator_mem_allocated = torch.cuda.memory_allocated(i)
+                bytes_free, bytes_total = torch.cuda.mem_get_info(i)
+                driver_mem_allocated = bytes_total - bytes_free
+
+                caching_allocator_discrepancy = False
+                driver_discrepancy = False
+
+                if caching_allocator_mem_allocated > self.caching_allocator_befores[i]:
+                    caching_allocator_discrepancy = True
 
-            if driver_mem_allocated > self.driver_befores[i]:
-                driver_discrepancy = True
+                if driver_mem_allocated > self.driver_befores[i]:
+                    driver_discrepancy = True
+
+                if not(caching_allocator_discrepancy or driver_discrepancy):
+                    # Leak was false positive, exit loop
+                    discrepancy_detected = False
+                    break
+
+            if not discrepancy_detected:
+                continue
 
             if caching_allocator_discrepancy and not driver_discrepancy:
                 # Just raises a warning if the leak is not validated by the
@@ -1486,12 +1585,7 @@ def __exit__(self, exec_type, exec_value, traceback):
                     self.driver_befores[i],
                     driver_mem_allocated)
 
-                # See #62533
-                # ROCM: Sometimes the transient memory is reported as leaked memory
-                if TEST_WITH_ROCM:
-                    warnings.warn(msg)
-                else:
-                    raise RuntimeError(msg)
+                raise RuntimeError(msg)
 
 @contextmanager
 def skip_exception_type(exc_type):
@@ -1563,13 +1657,15 @@ def check_if_enable(test: unittest.TestCase):
     if "USING_PYTEST" in os.environ:
         test_suite = f"__main__.{test_suite.split('.')[1]}"
     raw_test_name = f'{test._testMethodName} ({test_suite})'
-    if raw_test_name in json.loads(os.environ.get("SLOW_TESTS_DICT", "{}")):
+    if raw_test_name in slow_tests_dict:
         getattr(test, test._testMethodName).__dict__['slow_test'] = True
         if not TEST_WITH_SLOW:
             raise unittest.SkipTest("test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test")
     sanitized_test_method_name = remove_device_and_dtype_suffixes(test._testMethodName)
-    if not IS_SANDCASTLE and "DISABLED_TESTS_DICT" in os.environ:
-        disabled_tests_dict = json.loads(os.environ["DISABLED_TESTS_DICT"])
+    if not IS_SANDCASTLE:
+        should_skip = False
+        skip_msg = ""
+
         for disabled_test, (issue_url, platforms) in disabled_tests_dict.items():
             disable_test_parts = disabled_test.split()
             if len(disable_test_parts) > 1:
@@ -1586,14 +1682,40 @@ def check_if_enable(test: unittest.TestCase):
                         "windows": IS_WINDOWS,
                         "linux": IS_LINUX,
                         "rocm": TEST_WITH_ROCM,
-                        "asan": TEST_WITH_ASAN
+                        "asan": TEST_WITH_ASAN,
+                        "dynamo": TEST_WITH_TORCHDYNAMO,
                     }
+
+                    invalid_platforms = list(filter(lambda p: p not in platform_to_conditional, platforms))
+                    if len(invalid_platforms) > 0:
+                        invalid_plats_str = ", ".join(invalid_platforms)
+                        valid_plats = ", ".join(platform_to_conditional.keys())
+
+                        print(f"Test {disabled_test} is disabled for some unrecognized ",
+                              f"platforms: [{invalid_plats_str}]. Please edit issue {issue_url} to fix the platforms ",
+                              "assigned to this flaky test, changing \"Platforms: ...\" to a comma separated ",
+                              f"subset of the following (or leave it blank to match all platforms): {valid_plats}")
+
+                        # Sanitize the platforms list so that we continue to disable the test for any valid platforms given
+                        platforms = list(filter(lambda p: p in platform_to_conditional, platforms))
+
                     if platforms == [] or any([platform_to_conditional[platform] for platform in platforms]):
+                        should_skip = True
                         skip_msg = f"Test is disabled because an issue exists disabling it: {issue_url}" \
                             f" for {'all' if platforms == [] else ''}platform(s) {', '.join(platforms)}. " \
                             "If you're seeing this on your local machine and would like to enable this test, " \
                             "please make sure CI is not set and you are not using the flag --import-disabled-tests."
-                        raise unittest.SkipTest(skip_msg)
+                        break
+
+        if should_skip and not RERUN_DISABLED_TESTS:
+            # Skip the disabled test when not running under --rerun-disabled-tests verification mode
+            raise unittest.SkipTest(skip_msg)
+
+        if not should_skip and RERUN_DISABLED_TESTS:
+            skip_msg = "Test is enabled but --rerun-disabled-tests verification mode is set, so only" \
+                " disabled tests are run"
+            raise unittest.SkipTest(skip_msg)
+
     if TEST_SKIP_FAST:
         if not getattr(test, test._testMethodName).__dict__.get('slow_test', False):
             raise unittest.SkipTest("test is fast; we disabled it with PYTORCH_TEST_SKIP_FAST")
@@ -1722,6 +1844,31 @@ def __init__(self, actual, expected, *, rtol_override=0.0, atol_override=0.0, **
         self.rtol = max(self.rtol, rtol_override)
         self.atol = max(self.atol, atol_override)
 
+        # This is a slow and ugly hack to allow the comparison of hybrid sparse CSR tensors with strided ones. If
+        # `check_layout=False` (default), the tensors will be converted to strided by calling `.to_dense()` on them.
+        # However, this is not yet supported for hybrid sparse CSR and thus we need to do it manually for now.
+        # FIXME: Remove this as soon as `.to_dense` is supported for hybrid sparse CSR tensors
+        if not self.check_layout:
+            self.actual, self.expected = self._handle_hybrid_sparse_csr(self.actual, self.expected)
+
+    def _handle_hybrid_sparse_csr(self, actual, expected):
+        compressed_sparse_layouts = {torch.sparse_csr, torch.sparse_csc, torch.sparse_bsr, torch.sparse_bsc}
+        if not ((actual.layout in compressed_sparse_layouts) ^ (expected.layout in compressed_sparse_layouts)):
+            return actual, expected
+
+        def to_dense(tensor):
+            if tensor.layout not in compressed_sparse_layouts:
+                return tensor
+
+            def partial_to_dense(tensor):
+                if tensor.layout not in compressed_sparse_layouts or tensor.values().ndim == 1:
+                    return tensor.to_dense()
+                return torch.stack([partial_to_dense(sub_tensor) for sub_tensor in tensor])
+
+            return partial_to_dense(tensor)
+
+        return [to_dense(input) for input in [actual, expected]]
+
     def _process_inputs(self, actual, expected, *, id, allow_subclasses):
         self._check_inputs_isinstance(actual, expected, cls=(torch.Tensor, np.ndarray))
 
@@ -1852,7 +1999,7 @@ def __init__(self, method_name='runTest'):
         test_method = getattr(self, method_name, None)
         if test_method is not None:
             # Wraps the tested method if we should do CUDA memory check.
-            if not TEST_SKIP_CUDA_MEM_LEAK_CHECK:
+            if TEST_CUDA_MEM_LEAK_CHECK:
                 self._do_cuda_memory_leak_check &= getattr(test_method, '_do_cuda_memory_leak_check', True)
                 # FIXME: figure out the flaky -1024 anti-leaks on windows. See #8044
                 if self._do_cuda_memory_leak_check and not IS_WINDOWS:
@@ -1916,22 +2063,64 @@ def wrap_with_cuda_memory_check(self, method):
     def _run_with_retry(self, result=None, num_runs_left=0, report_only=True, num_red=0, num_green=0):
         using_unittest = isinstance(result, unittest.TestResult)
         if num_runs_left == 0:
+            # The logic when RERUN_DISABLED_TESTS is set to true is as follows:
+            # |-if the disabled test passes:
+            # |-- if it's flaky:
+            # |---  Do nothing because it's still flaky
+            # |-- elif it isn't flaky anymore:
+            # |---  Close the disabled ticket (later)
+            # |
+            # |- elif the disabled test fails after n retries:
+            # |--  This is expected, report this but don't fail the job
+            skipped_msg = {
+                "num_red": num_red,
+                "num_green": num_green,
+                "max_num_retries": MAX_NUM_RETRIES,
+                "rerun_disabled_test": RERUN_DISABLED_TESTS,
+            }
+
+            traceback_str = ""
+            if RERUN_DISABLED_TESTS and using_unittest:
+                # Hide all failures and errors when RERUN_DISABLED_TESTS is enabled. This is
+                # a verification check, we don't want more red signals coming from it
+                if result.failures:
+                    _, traceback_str = result.failures.pop(-1)
+                if result.errors:
+                    _, traceback_str = result.errors.pop(-1)
+
+                if traceback_str:
+                    skipped_msg["traceback_str"] = traceback_str
+
+                if num_green == 0:
+                    # The disabled test fails, report as skipped but don't fail the job
+                    result.addSkip(self, json.dumps(skipped_msg))
+
+                if num_red == 0:
+                    # The test passes after re-running multiple times. This acts as a signal
+                    # to confirm that it's not flaky anymore
+                    result.addSuccess(self)
+
             if num_green > 0 and num_red > 0 and using_unittest:
-                result.addSkip(self, f'{{"flaky": {True}, "num_red": {num_red}, "num_green": {num_green},' +
-                                     f'"max_num_retries": {MAX_NUM_RETRIES}}}')
+                skipped_msg["flaky"] = True
+                # Still flaky, do nothing
+                result.addSkip(self, json.dumps(skipped_msg))
+
             return
 
         if using_unittest:
             failures_before = 0 if result is None else len(result.failures)  # num tests marked as failed before starting
             errors_before = 0 if result is None else len(result.errors)  # num tests marked as errored before starting
 
-
         if TEST_WITH_TORCHDYNAMO:
-            with torchdynamo.optimize("eager"):
-                super().run(result=result)
+            # TorchDynamo optimize annotation
+            if TEST_WITH_TORCHINDUCTOR:
+                super_run = torch._dynamo.optimize("inductor")(super().run)
+            else:
+                super_run = torch._dynamo.optimize("eager")(super().run)
+            super_run(result=result)
 
             # TODO - Reset for each test slows down testing significantly.
-            # torchdynamo.reset()
+            # torch._dynamo.reset()
         else:
             super().run(result=result)
 
@@ -1974,7 +2163,20 @@ def _run_with_retry(self, result=None, num_runs_left=0, report_only=True, num_re
                 result.addExpectedFailure(self, err)
             self._run_with_retry(result=result, num_runs_left=num_retries_left, report_only=report_only,
                                  num_red=num_red + 1, num_green=num_green)
+        elif RERUN_DISABLED_TESTS and num_retries_left <= MAX_NUM_RETRIES and not result.skipped:
+            # Always re-run up to MAX_NUM_RETRIES when running under rerun disabled tests modes if the test successes.
+            # The parameter num_retries_left can be equal to MAX_NUM_RETRIES here because num_runs_left is initially
+            # set to MAX_NUM_RETRIES + 1, i.e. the first run successes
+            #
+            # Also if the result is skipped, this is due to check_if_enable skipping non-disabled tests, thus we
+            # want to ignore them, not retrying and skipping multiple times
+            print(f"    {self._testMethodName} succeeded - num_retries_left: {num_retries_left}")
+            result.addSuccess(self)
+            self._run_with_retry(result=result, num_runs_left=num_retries_left, report_only=report_only,
+                                 num_red=num_red, num_green=num_green + 1)
         elif report_only and num_retries_left < MAX_NUM_RETRIES:
+            # The original logic here is that num_retries_left must be smaller than MAX_NUM_RETRIES indicating
+            # that at least one retry has been spent
             print(f"    {self._testMethodName} succeeded - num_retries_left: {num_retries_left}")
             result.addUnexpectedSuccess(self)
             self._run_with_retry(result=result, num_runs_left=num_retries_left, report_only=report_only,
@@ -2241,7 +2443,7 @@ def genSparseTensor(self, size, sparse_dim, nnz, is_uncoalesced, device, dtype):
             #        Remove after implementing something equivalent to CopySlice
             #        for sparse views.
             # NOTE: We do clone() after detach() here because we need to be able to change size/storage of x afterwards
-            x = x.detach().clone()
+            x = x.detach().clone()._coalesced_(False)
         return x, x._indices().clone(), x._values().clone()
 
     def safeToDense(self, t):
@@ -2399,7 +2601,7 @@ def to_list(input):
             # This emulates unittest.TestCase's behavior if a custom message passed and
             # TestCase.longMessage (https://docs.python.org/3/library/unittest.html#unittest.TestCase.longMessage)
             # is True (default)
-            msg=(lambda generated_msg: f"{generated_msg} : {msg}") if isinstance(msg, str) and self.longMessage else msg,
+            msg=(lambda generated_msg: f"{generated_msg}\n{msg}") if isinstance(msg, str) and self.longMessage else msg,
         )
 
     def assertNotEqual(self, x, y, msg: Optional[str] = None, *,                                       # type: ignore[override]
@@ -2628,6 +2830,62 @@ def assertAtenOp(self, onnx_model, operator, overload_name=""):
         self.assertEqual(attrs["operator"], operator)
         self.assertEqual(attrs.get("overload_name", ""), overload_name)
 
+    def check_nondeterministic_alert(self, fn, caller_name, should_alert=True):
+        '''Checks that an operation produces a nondeterministic alert when
+        expected while `torch.use_deterministic_algorithms(True)` is set.
+
+        Args:
+          fn (callable): Function to check for a nondeterministic alert
+
+          caller_name (str): Name of the operation that produces the
+              nondeterministic alert. This name is expected to appear at the
+              beginning of the error/warning message.
+
+          should_alert (bool, optional): If True, then the check will only pass
+              if calling `fn` produces a nondeterministic error/warning with the
+              expected message. If False, then the check will only pass if
+              calling `fn` does not produce an error. Default: `True`.
+        '''
+
+        alert_message = '^' + caller_name + ' does not have a deterministic implementation, but you set'
+
+        # Check that errors are thrown correctly
+        with DeterministicGuard(True):
+            if should_alert:
+                with self.assertRaisesRegex(
+                        RuntimeError,
+                        alert_message,
+                        msg='expected a non-deterministic error, but it was not raised'):
+                    fn()
+
+            else:
+                # If a nondeterministic error is not expected, make sure
+                # that it is not raised
+                try:
+                    fn()
+                except RuntimeError as e:
+                    if 'does not have a deterministic implementation' in str(e):
+                        self.fail(
+                            'did not expect non-deterministic error message, '
+                            + 'but got one anyway: "' + str(e) + '"')
+                    # Reraise exceptions unrelated to nondeterminism
+                    raise
+
+        # Check that warnings are thrown correctly
+        with DeterministicGuard(True, warn_only=True):
+            if should_alert:
+                with self.assertWarnsRegex(
+                        UserWarning,
+                        alert_message):
+                    fn()
+            else:
+                with warnings.catch_warnings(record=True) as w:
+                    warnings.simplefilter("always")
+                    fn()
+                    for warning in w:
+                        if isinstance(warning, UserWarning):
+                            self.assertTrue(re.search(alert_message, str(warning)) is None)
+
     # run code in subprocess and capture exceptions.
     @staticmethod
     def run_process_no_exception(code, env=None):
@@ -2786,13 +3044,6 @@ def noncontiguous_like(t):
     if not t.is_contiguous():
         return t
 
-    # Special-cases 0-dim tensors
-    zero_dim = t.ndim == 0
-    if zero_dim:
-        t = t.unsqueeze(0)
-
-    result = torch.repeat_interleave(t.detach(), 2, dim=-1)
-
     # Choose a "weird" value that won't be accessed
     if t.dtype.is_floating_point or t.dtype.is_complex:
         value = math.nan
@@ -2801,14 +3052,10 @@ def noncontiguous_like(t):
     else:
         value = 12
 
-    if zero_dim:
-        result[0] = value
-        result.set_(result.storage(), 1, (), ())
-    else:
-        result[..., 1::2] = value
-        strides = list(result.stride())
-        strides[-1] *= 2
-        result.set_(result.storage(), result.storage_offset(), t.size(), stride=tuple(strides))
+    result = t.new_empty(t.shape + (2,))
+    result[..., 0] = value
+    result[..., 1] = t.detach()
+    result = result[..., 1]
     result.requires_grad_(t.requires_grad)
     return result
 
@@ -3125,12 +3372,13 @@ def load_tests(loader, tests, pattern):
     set_running_script_path()
     test_suite = unittest.TestSuite()
     for test_group in tests:
-        for test in test_group:
-            check_test_defined_in_running_script(test)
-            test_suite.addTest(test)
+        if not DISABLE_RUNNING_SCRIPT_CHK:
+            for test in test_group:
+                check_test_defined_in_running_script(test)
+        if test_group._tests:
+            test_suite.addTest(test_group)
     return test_suite
 
-
 # FIXME: document this and move it to test_serialization
 class BytesIOContext(io.BytesIO):
     def __enter__(self):
@@ -3333,6 +3581,32 @@ def check_bytes(byte_list):
     return torch.tensor(res, device=device, dtype=dtype)
 
 
+def copy_func(f):
+    """Based on http://stackoverflow.com/a/6528148/190597 (Glenn Maynard)"""
+    g = types.FunctionType(f.__code__, f.__globals__, name=f.__name__,
+                           argdefs=f.__defaults__,
+                           closure=f.__closure__)
+    g = functools.update_wrapper(g, f)
+    g.__kwdefaults__ = f.__kwdefaults__
+    return g
+
+
+def xfail_inherited_tests(tests):
+    """
+    Given a list of test names which are defined by a superclass of the
+    class this decorates, mark them as expected failure.  This is useful
+    if you are doing poor man's parameterized tests by subclassing a generic
+    test class.
+    """
+    def deco(cls):
+        for t in tests:
+            # NB: expectedFailure operates by mutating the method in question,
+            # which is why you have to copy the function first
+            setattr(cls, t, unittest.expectedFailure(copy_func(getattr(cls, t))))
+        return cls
+    return deco
+
+
 def sandcastle_skip_if(condition, reason):
     """
     Similar to unittest.skipIf, however in the sandcastle environment it just
@@ -3359,6 +3633,23 @@ def dtype_name(dtype):
     return str(dtype).split('.')[1]
 
 
+dtype_abbrs = {
+    torch.bfloat16: 'bf16',
+    torch.float64: 'f64',
+    torch.float32: 'f32',
+    torch.float16: 'f16',
+    torch.complex32: 'c32',
+    torch.complex64: 'c64',
+    torch.complex128: 'c128',
+    torch.int8: 'i8',
+    torch.int16: 'i16',
+    torch.int32: 'i32',
+    torch.int64: 'i64',
+    torch.bool: 'b8',
+    torch.uint8: 'u8',
+}
+
+
 def set_single_threaded_if_parallel_tbb(fn):
     """Set test to be single threaded for parallel tbb.
 
@@ -3441,3 +3732,144 @@ def custom_op(opname, symbolic_fn, opset_version):
         yield
     finally:
         unregister_custom_op_symbolic(opname, opset_version)
+
+
+class TestGradients(TestCase):
+    exact_dtype = True
+
+    # Copies inputs to inplace operations to avoid inplace modifications
+    #   to leaves requiring gradient
+    def _get_safe_inplace(self, inplace_variant):
+        @wraps(inplace_variant)
+        def _fn(t, *args, **kwargs):
+            return inplace_variant(t.clone(), *args, **kwargs)
+
+        return _fn
+
+    def _check_helper(self, device, dtype, op, variant, check, *, check_forward_ad=False, check_backward_ad=True,
+                      check_batched_grad=None, check_batched_forward_grad=False):
+        assert check in ('gradcheck', 'bwgrad_bwgrad', 'fwgrad_bwgrad')
+        # NB: check_backward_ad does not affect gradgradcheck (always True)
+        if variant is None:
+            self.skipTest("Skipped! Variant not implemented.")
+        if not op.supports_dtype(dtype, torch.device(device).type):
+            self.skipTest(f"Skipped! {op.name} does not support dtype {str(dtype)}")
+
+        def is_inplace(variant):
+            if hasattr(variant, "__wrapped__"):
+                return variant.__wrapped__ is op.get_inplace()
+            return variant is op.get_inplace()
+
+        include_conjugated_inputs = op.test_conjugated_samples and dtype.is_complex
+
+        samples = op.sample_inputs(device, dtype, requires_grad=True, include_conjugated_inputs=include_conjugated_inputs,
+                                   small_inputs_only=is_slow_gradcheck_env())
+
+        for sample in samples:
+            if sample.broadcasts_input and is_inplace(variant):
+                continue
+
+            # Gradcheck expects tensors as its input, but autograd actually supports tensorlists
+            #   and tensors passed as kwargs. The following creates a function that accepts just
+            #   the tensors that require grad as varargs, and then recomposes them back into the
+            #   original input.
+
+            # Creates gradcheck inputs by identifying tensors requiring grad
+            all_args = None
+            if is_iterable_of_tensors(sample.input):
+                all_args = chain(sample.input, sample.args, sample.kwargs.values())
+            else:
+                all_args = tuple(chain((sample.input,), sample.args, sample.kwargs.values()))
+            gradcheck_args = tuple(x for x in all_args if (isinstance(x, torch.Tensor) and x.requires_grad))
+
+            # Verifies sample input tensors should have no grad
+            # This may happen if the same tensor is used in two different SampleInputs
+            for t in gradcheck_args:
+                self.assertIsNone(t.grad,
+                                  "A sampled input has a gradient before running autograd. "
+                                  "This usually means that (at least) one input tensor is reused "
+                                  "across different SampleInputs. "
+                                  "Please create a new tensor for each SampleInput.")
+
+            def _input_recomposition_helper(inputs, inp, input_idx):
+                if is_iterable_of_tensors(inp):
+                    tensor_list = []
+                    for x in inp:
+                        if isinstance(x, torch.Tensor) and x.requires_grad:
+                            tensor_list.append(inputs[input_idx])
+                            input_idx = input_idx + 1
+                        else:
+                            tensor_list.append(x)
+                    return tensor_list, input_idx
+                elif isinstance(inp, torch.Tensor) and inp.requires_grad:
+                    return inputs[input_idx], input_idx + 1
+                else:
+                    return inp, input_idx
+
+            def fn(*inputs):
+                # Puts inputs back into sample properly
+                positional_args = []
+                input_idx = 0
+                inp, input_idx = _input_recomposition_helper(inputs, sample.input, input_idx)
+                positional_args.append(inp)
+
+                for x in sample.args:
+                    inp, input_idx = _input_recomposition_helper(inputs, x, input_idx)
+                    positional_args.append(inp)
+
+                # Recreates kwargs
+                kwargs = {}
+                for k, v in sample.kwargs.items():
+                    inp, input_idx = _input_recomposition_helper(inputs, v, input_idx)
+                    kwargs[k] = inp
+
+                output = op.gradcheck_wrapper(variant, *positional_args, **kwargs)
+                if sample.output_process_fn_grad is not None:
+                    return sample.output_process_fn_grad(output)
+                return output
+
+            if check == 'gradcheck':
+                if check_batched_grad is None:
+                    check_batched_grad = op.check_batched_grad
+                self.assertTrue(gradcheck(fn, gradcheck_args,
+                                          check_batched_grad=check_batched_grad,
+                                          check_grad_dtypes=True,
+                                          nondet_tol=op.gradcheck_nondet_tol,
+                                          fast_mode=op.gradcheck_fast_mode,
+                                          check_forward_ad=check_forward_ad,
+                                          check_backward_ad=check_backward_ad,
+                                          check_undefined_grad=True,
+                                          check_batched_forward_grad=check_batched_forward_grad))
+            elif check in ('bwgrad_bwgrad', 'fwgrad_bwgrad'):  # gradgrad check
+                self.assertFalse(check_forward_ad, msg="Cannot run forward AD check for gradgradcheck")
+                for gen_non_contig_grad_outputs in (False, True):
+                    kwargs = {
+                        "gen_non_contig_grad_outputs": gen_non_contig_grad_outputs,
+                        "check_batched_grad": op.check_batched_gradgrad,
+                        "check_grad_dtypes": True,
+                        "nondet_tol": op.gradcheck_nondet_tol,
+                        "fast_mode": op.gradcheck_fast_mode
+                    }
+                    if check == "fwgrad_bwgrad":
+                        kwargs["check_fwd_over_rev"] = True
+                        kwargs["check_rev_over_rev"] = False
+                        kwargs["check_batched_grad"] = False
+                        kwargs["check_undefined_grad"] = False
+
+                    self.assertTrue(gradgradcheck(fn, gradcheck_args, **kwargs))
+            else:
+                self.assertTrue(False, msg="Unknown check requested!")
+
+    def _grad_test_helper(self, device, dtype, op, variant, *, check_forward_ad=False, check_backward_ad=True,
+                          check_batched_grad=None, check_batched_forward_grad=False):
+        return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad,
+                                  check_backward_ad=check_backward_ad, check_batched_grad=check_batched_grad,
+                                  check_batched_forward_grad=check_batched_forward_grad)
+
+    def _skip_helper(self, op, device, dtype):
+        if dtype not in op.supported_backward_dtypes(torch.device(device).type):
+            self.skipTest("Skipped! Op doesn't support autograd for this dtype.")
+        if not op.supports_autograd and not op.supports_forward_ad:
+            self.skipTest("Skipped! autograd not supported.")
+        if op.name == "cat":
+            self.skipTest("TODO(whc) fix pre-existing bug with cat for newly added opinfo for empty+nonempty")
diff --git a/torch/testing/_internal/composite_compliance.py b/torch/testing/_internal/composite_compliance.py
index c14595c69c59..5d7de4e2328a 100644
--- a/torch/testing/_internal/composite_compliance.py
+++ b/torch/testing/_internal/composite_compliance.py
@@ -1,13 +1,12 @@
 import torch
 from torch import Tensor
-import contextlib
 import itertools
+
+from torch.utils._python_dispatch import TorchDispatchMode
 from torch.utils._pytree import tree_map, tree_flatten, tree_unflatten
 from functools import partial
-from torch.utils._mode_utils import no_dispatch
-from torch.utils._python_dispatch import enable_torch_dispatch_mode
+from torch.utils._mode_utils import no_dispatch, all_same_mode
 import torch.autograd.forward_ad as fwAD
-from torch.overrides import enable_reentrant_dispatch
 from typing import Callable
 import re
 
@@ -101,21 +100,10 @@ def is_inplace(func):
     return name[-1] == '_'
 
 
-def generate_cct(enable_recursive_torch_dispatch=False,
-                 autograd_view_consistency=True):
+def generate_cct_and_mode(autograd_view_consistency=True):
     # This function returns a new class CompositeCompliantTensor
     # The two arguments control the behaviour described below.
 
-    # enable_recursive_torch_dispatch:
-    #   If True, enable __torch_dispatch__ before calling the func in
-    #   CCT's __torch_dispatch__ implementation else call
-    #   the func under `no_dispatch`.
-    #   NOTE: We need to disable dispatch under Torch Dispatch Mode,
-    #   to avoid infinite recursion.
-    #   Also, we need to enable dispatch for checking
-    #   forward_AD composite compliance
-    #   Refer: https://github.com/pytorch/pytorch/issues/75652
-
     # autograd_view_consistency:
     #   If True, alias result using `set_` if func returns a view
     #   (See Note [Alias Result]).
@@ -129,7 +117,7 @@ class CompositeCompliantTensor(torch.Tensor):
         __torch_function__ = torch._C._disabled_torch_function_impl
 
         @staticmethod
-        def __new__(cls, elem, *args, **kwargs):
+        def __new__(cls, elem, mode, *args, **kwargs):
             assert type(elem) is not cls, \
                 "Wrapping a CompositeCompliantTensor in a CompositeCompliantTensor is not supported"
 
@@ -161,6 +149,8 @@ def __new__(cls, elem, *args, **kwargs):
             # Ref: https://github.com/albanD/subclass_zoo/issues/21
             torch._C._set_conj(r, r.elem.is_conj())
             torch._C._set_neg(r, r.elem.is_neg())
+
+            r.mode = mode
             return r
 
         def __repr__(self):
@@ -168,11 +158,20 @@ def __repr__(self):
 
         @classmethod
         def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
+            all_args = tree_flatten(args)[0] + tree_flatten(kwargs)[0]
+            modes = tuple(e.mode for e in all_args if isinstance(e, CompositeCompliantTensor))
+            if not all_same_mode(modes):
+                raise RuntimeError("Multiple CompositeCompliantTensorModes NYI")
+            with modes[0]:
+                return func(*args, **kwargs)
+
+    class CompositeCompliantTensorMode(TorchDispatchMode):
+        def __torch_dispatch__(self, func, types, args=(), kwargs=None):
             def unwrap(e):
                 return e.elem if isinstance(e, CompositeCompliantTensor) else e
 
             def wrap(e):
-                return CompositeCompliantTensor(e) if isinstance(e, torch.Tensor) else e
+                return CompositeCompliantTensor(e, self) if isinstance(e, torch.Tensor) else e
 
             if func == torch.ops.aten._local_scalar_dense.default:
                 raise RuntimeError(
@@ -197,12 +196,10 @@ def wrap(e):
                         'regular Tensor but the other tensors are Tensor Subclasses. '
                         'Please try to avoid this in-place operation.')
 
-            with enable_reentrant_dispatch():
-                with contextlib.nullcontext() if enable_recursive_torch_dispatch else no_dispatch():
-                    unwrapped_args = tree_map(unwrap, args)
-                    unwrapped_kwargs = tree_map(unwrap, kwargs)
-                    unwrapped_rs = func(*unwrapped_args, **unwrapped_kwargs)
-                    rs = tree_map(wrap, unwrapped_rs)
+            unwrapped_args = tree_map(unwrap, args)
+            unwrapped_kwargs = tree_map(unwrap, kwargs)
+            unwrapped_rs = func(*unwrapped_args, **unwrapped_kwargs)
+            rs = tree_map(wrap, unwrapped_rs)
 
             if is_view_fn(func) and autograd_view_consistency:
                 # Note [Alias Result]
@@ -210,25 +207,24 @@ def wrap(e):
                 # are the same. Here we try to make B alias A to avoid those asserts.
                 # See https://github.com/pytorch/pytorch/issues/65339 for more information
                 # about the issue.
-                with enable_reentrant_dispatch():
-                    with no_dispatch():
-                        # Idea: this is a weird way of getting a storage that aliases the input.
-                        # This is a workaround for #65339.
-                        # 1. under no_dispatch, all of the wrapper tensors look like regular
-                        #    tensors with special storage (the storage is nullptr and
-                        #    advertises CPU/CUDA device.
-                        # 2. we run func, which ends up running the view operation
-                        # 3. All view operations reuse the input's storage and return
-                        #    result Tensor(s) with new sizes/strides/offset that alias
-                        #    the input.
-                        # 4. we set the storage (and sizes/strides/offset) of the wrapper
-                        #    tensor results to be that of the tensors that alias the input
-                        result = func(*args, **kwargs)
-                        if isinstance(result, tuple) or isinstance(result, list):
-                            for a, b in zip(rs, result):
-                                a.set_(b)
-                        else:
-                            rs.set_(result)
+                with no_dispatch():
+                    # Idea: this is a weird way of getting a storage that aliases the input.
+                    # This is a workaround for #65339.
+                    # 1. under no_dispatch, all of the wrapper tensors look like regular
+                    #    tensors with special storage (the storage is nullptr and
+                    #    advertises CPU/CUDA device.
+                    # 2. we run func, which ends up running the view operation
+                    # 3. All view operations reuse the input's storage and return
+                    #    result Tensor(s) with new sizes/strides/offset that alias
+                    #    the input.
+                    # 4. we set the storage (and sizes/strides/offset) of the wrapper
+                    #    tensor results to be that of the tensors that alias the input
+                    result = func(*args, **kwargs)
+                    if isinstance(result, tuple) or isinstance(result, list):
+                        for a, b in zip(rs, result):
+                            a.set_(b)
+                    else:
+                        rs.set_(result)
 
             # Some operations are allowed to in-place modify the metadata of the
             # inputs. The only ones are the "inplace view functions"; when we
@@ -240,13 +236,13 @@ def wrap(e):
             # For each CompositeCompliantTensor t, we check that t and t.elem
             # have consistent metadata. If they don't have consistent metadata,
             # that means the operator did something fishy.
-            check = partial(check_metadata_consistency, CCT=cls)
+            check = partial(check_metadata_consistency, CCT=CompositeCompliantTensor)
             tree_map(check, args)
             tree_map(check, kwargs)
             tree_map(check, rs)
             return rs
 
-    return CompositeCompliantTensor
+    return CompositeCompliantTensor, CompositeCompliantTensorMode()
 
 def is_tensorlist(lst):
     if not isinstance(lst, list) and not isinstance(lst, tuple):
@@ -267,12 +263,12 @@ def maybe_map(fn, should_map, arg):
     return fn(arg) if should_map else arg
 
 
-def wrap(arg, CCT):
-    # CCT: CompositeCompliantTensor class which is generated using generate_cct
+def wrap(arg, CCT, cct_mode):
+    # CCT: CompositeCompliantTensor class which is generated using generate_cct_and_mode
     if isinstance(arg, torch.Tensor):
-        return CCT(arg)
+        return CCT(arg, cct_mode)
     if is_tensorlist(arg):
-        return [CCT(a) for a in arg]
+        return [CCT(a, cct_mode) for a in arg]
     raise RuntimeError("wrap assumes that the input can be wrapped")
 
 
@@ -286,14 +282,14 @@ def wrap(arg, CCT):
 # [A, 1, B]
 # NB: Yes, this is exponential. No, we don't care too much because PyTorch ops
 # don't accept that many input Tensors.
-def generate_subclass_choices(flat_args, CCT):
-    # CCT: CompositeCompliantTensor class which is generated using generate_cct
+def generate_subclass_choices(flat_args, CCT, cct_mode):
+    # CCT: CompositeCompliantTensor class which is generated using generate_cct_and_mode
     is_tensor_likes = [isinstance(arg, torch.Tensor) or is_tensorlist(arg) for arg in flat_args]
     subclass_options = [[False, True] if is_tensor_like else [False] for is_tensor_like in is_tensor_likes]
 
     for which_args_are_wrapped in itertools.product(*subclass_options):
 
-        result = [maybe_map(partial(wrap, CCT=CCT), should_wrap_arg, arg)
+        result = [maybe_map(partial(wrap, CCT=CCT, cct_mode=cct_mode), should_wrap_arg, arg)
                   for should_wrap_arg, arg in zip(which_args_are_wrapped, flat_args)]
         yield result, which_args_are_wrapped
 
@@ -301,11 +297,11 @@ def generate_subclass_choices(flat_args, CCT):
 # For an operation f(*args, **kwargs), each Tensor argument may either be
 # a regular Tensor or a Tensor Subclass. This iterator iterates through
 # all of those options.
-def generate_subclass_choices_args_kwargs(args, kwargs, CCT):
-    # CCT: CompositeCompliantTensor class which is generated using generate_cct
+def generate_subclass_choices_args_kwargs(args, kwargs, CCT, cct_mode):
+    # CCT: CompositeCompliantTensor class which is generated using generate_cct_and_mode
     flat_kwargs, spec = tree_flatten(kwargs)
     flat_args_kwargs = list(args) + list(flat_kwargs)
-    for choice, debug_metadata in generate_subclass_choices(flat_args_kwargs, CCT):
+    for choice, debug_metadata in generate_subclass_choices(flat_args_kwargs, CCT, cct_mode):
         new_args = choice[:len(args)]
         new_kwargs = tree_unflatten(choice[len(args):], spec)
         which_args_are_wrapped = debug_metadata[:len(args)]
@@ -315,7 +311,7 @@ def generate_subclass_choices_args_kwargs(args, kwargs, CCT):
 
 def raise_composite_compliance_error(err, additional_info=''):
     raise RuntimeError(
-        "Composite compilance check failed with "
+        "Composite compliance check failed with "
         "the above error.\n"
         f"{additional_info}"
         "If you are adding an OpInfo of an "
@@ -336,9 +332,9 @@ def raise_composite_compliance_error(err, additional_info=''):
 # If some composite operation does any non-compliant behavior,
 # CompositeCompliantTensor will raise an error.
 def check_all_permutations(op, args, kwargs, assert_equal_fn):
-    CCT = generate_cct()
+    CCT, cct_mode = generate_cct_and_mode()
     expected = op(*args, **kwargs)
-    for choice in generate_subclass_choices_args_kwargs(args, kwargs, CCT):
+    for choice in generate_subclass_choices_args_kwargs(args, kwargs, CCT, cct_mode):
         new_args, new_kwargs, which_args_are_wrapped, which_kwargs_are_wrapped = choice
 
         try:
@@ -381,17 +377,17 @@ def unwrap(e):
 # Composite does any non-compliant behavior,
 # CompositeCompliantTensor will raise an error.
 def check_with_mode(op, args, kwargs, assert_equal_fn):
-    CCT = generate_cct()
+    CCT, cct_mode = generate_cct_and_mode()
 
     def wrap(e):
-        return CCT(e) if isinstance(e, torch.Tensor) else e
+        return CCT(e, cct_mode) if isinstance(e, torch.Tensor) else e
 
     expected = op(*args, **kwargs)
 
     args = tree_map(wrap, args)
     kwargs = tree_map(wrap, kwargs)
     try:
-        with enable_torch_dispatch_mode(CCT):
+        with cct_mode:
             actual = op(*args, **kwargs)
     # see NOTE: [What errors are Composite Compliance trying to catch?]
     except RuntimeError as err:
@@ -415,6 +411,26 @@ def gather_leaf_tensors(args, kwargs):
     return leaf_tensors
 
 
+def compute_expected_grads(op, args, kwargs, output_process_fn_grad=None, gradcheck_wrapper=None):
+    if gradcheck_wrapper is None:
+        results = op(*args, **kwargs)
+    else:
+        results = gradcheck_wrapper(op, *args, **kwargs)
+
+    if output_process_fn_grad is not None:
+        results = output_process_fn_grad(results)
+
+    flat_results, _ = tree_flatten(results)
+    flat_diff_results = [r for r in flat_results if r.requires_grad]
+    assert len(flat_diff_results) > 0
+
+    grads = [torch.ones(r.shape, device=r.device, dtype=r.dtype) for r in flat_diff_results]
+    leaf_tensors = gather_leaf_tensors(args, kwargs)
+    assert len(leaf_tensors) > 0
+    return torch.autograd.grad(flat_diff_results, leaf_tensors,
+                               grads, allow_unused=True, retain_graph=True)
+
+
 # Checks if the backward formula is composite compliant by testing
 # all possible permutations of {inputs, grad_outputs} being
 # CompositeCompliantTensor or regular Tensors.
@@ -425,31 +441,11 @@ def gather_leaf_tensors(args, kwargs):
 def check_backward_formula(op: Callable, args, kwargs,
                            output_process_fn_grad=None,
                            gradcheck_wrapper=None, assert_equal_fn=None):
-    CCT = generate_cct()
-
-    def compute_expected_grads(args, kwargs):
-        if gradcheck_wrapper is None:
-            results = op(*args, **kwargs)
-        else:
-            results = gradcheck_wrapper(op, *args, **kwargs)
-
-        if output_process_fn_grad is not None:
-            results = output_process_fn_grad(results)
-
-        flat_results, _ = tree_flatten(results)
-        flat_diff_results = [r for r in flat_results if r.requires_grad]
-        assert len(flat_diff_results) > 0
-
-        grads = [torch.ones(r.shape, device=r.device, dtype=r.dtype)
-                 for r in flat_diff_results]
-        leaf_tensors = gather_leaf_tensors(args, kwargs)
-        assert len(leaf_tensors) > 0
-        return torch.autograd.grad(flat_diff_results, leaf_tensors,
-                                   grads, allow_unused=True, retain_graph=True)
+    CCT, cct_mode = generate_cct_and_mode()
 
-    expected = compute_expected_grads(args, kwargs)
+    expected = compute_expected_grads(op, args, kwargs, output_process_fn_grad, gradcheck_wrapper)
 
-    for choice in generate_subclass_choices_args_kwargs(args, kwargs, CCT):
+    for choice in generate_subclass_choices_args_kwargs(args, kwargs, CCT, cct_mode):
         new_args, new_kwargs, which_args_are_wrapped, which_kwargs_are_wrapped = choice
         leaf_tensors = gather_leaf_tensors(new_args, new_kwargs)
         assert len(leaf_tensors) > 0
@@ -476,7 +472,7 @@ def compute_expected_grads(args, kwargs):
         # NB: ones, not ones_like, so we get a regular Tensor here
         grads = [torch.ones(r.shape, device=r.device, dtype=r.dtype)
                  for r in flat_diff_results]
-        for flat_new_grads, which_grad_is_batched in generate_subclass_choices(grads, CCT):
+        for flat_new_grads, which_grad_is_batched in generate_subclass_choices(grads, CCT, cct_mode):
             try:
                 actual = torch.autograd.grad(flat_diff_results, leaf_tensors, flat_new_grads,
                                              allow_unused=True, retain_graph=True)
@@ -502,7 +498,7 @@ def unwrap(e):
 # this means we can apply check_forward_ad_formula to things that aren't OpInfos
 # while debugging.
 def check_forward_ad_formula(op: Callable, args, kwargs, gradcheck_wrapper=None, assert_equal_fn=None):
-    CCT = generate_cct(enable_recursive_torch_dispatch=True, autograd_view_consistency=False)
+    CCT, cct_mode = generate_cct_and_mode(autograd_view_consistency=False)
 
     def maybe_tangent(t):
         assert type(t) is not CCT
@@ -545,11 +541,11 @@ def compute_expected_grad(args, tangent_args, kwargs, tangent_kwargs):
         expected_tangents = tree_map(lambda x: x.tangent, expected)
 
         # Permutations of arg and kwargs in CCT.
-        for choice in generate_subclass_choices_args_kwargs(args, kwargs, CCT):
+        for choice in generate_subclass_choices_args_kwargs(args, kwargs, CCT, cct_mode):
             new_args, new_kwargs, which_args_are_wrapped, which_kwargs_are_wrapped = choice
 
             # Permutations tangent arg and tangent kwargs in CCT.
-            for tang_choice in generate_subclass_choices_args_kwargs(tangent_args, tangent_kwargs, CCT):
+            for tang_choice in generate_subclass_choices_args_kwargs(tangent_args, tangent_kwargs, CCT, cct_mode):
                 new_tang_args, new_tang_kwargs, \
                     which_tang_args_are_wrapped, which_tang_kwargs_are_wrapped = tang_choice
 
diff --git a/torch/testing/_internal/distributed/_tensor/__init__.py b/torch/testing/_internal/distributed/_tensor/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/torch/testing/_internal/distributed/_tensor/common_dtensor.py b/torch/testing/_internal/distributed/_tensor/common_dtensor.py
new file mode 100644
index 000000000000..cf2abe0ee8d2
--- /dev/null
+++ b/torch/testing/_internal/distributed/_tensor/common_dtensor.py
@@ -0,0 +1,334 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+
+from contextlib import contextmanager
+from dataclasses import dataclass
+import itertools
+import sys
+from functools import wraps
+from typing import (
+    Any,
+    Callable,
+    Generator,
+    Iterator,
+    Tuple,
+    Dict,
+    Optional,
+    List,
+    Sequence,
+    TypeVar,
+    cast,
+)
+
+import torch
+import torch.distributed as dist
+
+from torch.utils._pytree import tree_flatten, tree_unflatten, TreeSpec
+from torch.testing._internal.common_distributed import (
+    MultiProcessTestCase,
+    TEST_SKIPS,
+    skip_if_lt_x_gpu,
+)
+
+from torch.distributed._tensor import (
+    DeviceMesh,
+    Shard,
+    Replicate,
+    distribute_tensor,
+    redistribute,
+)
+from torch.distributed._tensor.api import DTensor
+from torch.distributed._tensor.placement_types import Placement
+
+DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu"
+NUM_DEVICES = 4
+
+# We use this as a proxy for "multiple GPUs exist"
+if torch.cuda.is_available() and torch.cuda.device_count() > 1:
+    # when we actually have multiple GPUs, relax the requirement to smaller counts.
+    NUM_DEVICES = min(NUM_DEVICES, torch.cuda.device_count())
+
+T = TypeVar("T")
+
+
+def skip_unless_torch_gpu(method: T) -> T:
+    """
+    Test decorator which skips the test unless there's a GPU available to torch.
+
+    >>> @skip_unless_torch_gpu
+    >>> def test_some_method(self) -> None:
+    >>>   ...
+    """
+    # The builtin @skip_if_no_gpu relies on os.environ['WORLD_SIZE'] being set.
+    return cast(T, skip_if_lt_x_gpu(NUM_DEVICES)(method))
+
+
+@dataclass
+class RedistributeProfile:
+    num_calls: int
+
+
+@contextmanager
+def redistribute_profiler() -> Generator[RedistributeProfile, None, None]:
+
+    orig_redistribute_dtensor = redistribute.redistribute_dtensor
+    profile: RedistributeProfile = RedistributeProfile(num_calls=0)
+
+    # pyre-ignore[53]
+    def patched_redistribute_dtensor(
+        input: DTensor,
+        device_mesh: DeviceMesh,
+        placements: Sequence[Placement],
+    ) -> DTensor:
+        result = orig_redistribute_dtensor(input, device_mesh, placements)
+        profile.num_calls += 1
+        return result
+
+    try:
+        # pyre-ignore[9]
+        redistribute.redistribute_dtensor = patched_redistribute_dtensor
+        yield profile
+    finally:
+        redistribute.redistribute_dtensor = orig_redistribute_dtensor
+
+
+class DTensorTestBase(MultiProcessTestCase):
+    @property
+    def world_size(self) -> int:
+        return NUM_DEVICES
+
+    def build_device_mesh(self) -> DeviceMesh:
+        return DeviceMesh(DEVICE_TYPE, list(range(NUM_DEVICES)))
+
+    def init_pg(self, backend: str = "nccl") -> None:
+        if backend == "nccl" and torch.cuda.device_count() < self.world_size:
+            sys.exit(TEST_SKIPS[f"multi-gpu-{self.world_size}"].exit_code)
+
+        if backend not in ["nccl", "gloo", "mpi"]:
+            raise RuntimeError(f"Backend {backend} not supported!")
+
+        dist.init_process_group(
+            backend=backend,
+            world_size=self.world_size,
+            rank=self.rank,  # pyre-ignore[16]
+            init_method=f"file://{self.file_name}",  # pyre-ignore[16]
+        )
+
+        # set device for nccl pg for collectives
+        if backend == "nccl":
+            torch.cuda.set_device(self.rank)
+
+    def destroy_pg(self) -> None:
+        # Wait for all ranks to reach here before starting shutdown.
+        dist.barrier()
+        dist.destroy_process_group()
+
+    def setUp(self) -> None:
+        super().setUp()
+        self._spawn_processes()
+
+    # pyre-ignore[2]:
+    def _test_op(self, mesh: DeviceMesh, op_call, *args, **kwargs) -> None:
+        with redistribute_profiler() as profile:
+            out = op_call(*args, **kwargs)
+            dtc = DTensorConverter(mesh, args, kwargs)
+            for d_args, d_kwargs in dtc:
+                # pyre can't find assertTrue anymore?
+                self.assertEqual(dtc.successful(), True)
+                d_out = op_call(*d_args, **d_kwargs)
+                self.assertEqual(
+                    d_out.redistribute(
+                        mesh, [Replicate()] * mesh.ndim
+                    ).to_local(),
+                    out,
+                )
+
+
+# wrapper to initialize comms (processgroup)
+def with_comms(
+    func: Optional[  # pyre-fixme[24]: Generic type `Callable` expects 2 type parameters.
+        Callable
+    ] = None,
+    backend: Optional[str] = None,
+) -> Optional[  # pyre-fixme[24]: Generic type `Callable` expects 2 type parameters.
+    Callable
+]:
+    assert func is not None
+
+    @wraps(func)  # pyre-ignore[6]
+    def wrapper(
+        self, *args: Tuple[object], **kwargs: Dict[str, Any]  # type: ignore[misc]
+    ) -> None:
+        # if backend not specified, and cuda available, then use nccl, else gloo
+        pg_backend = (
+            "nccl" if backend is None and torch.cuda.is_available() else "gloo"
+        )
+        if pg_backend == "nccl" and torch.cuda.device_count() < self.world_size:
+            sys.exit(TEST_SKIPS[f"multi-gpu-{self.world_size}"].exit_code)
+
+        self.device_type = "cuda" if pg_backend == "nccl" else "cpu"
+        self.init_pg(backend=pg_backend)
+        func(self)  # type: ignore[misc]
+        self.destroy_pg()
+
+    return wrapper
+
+
+# This is a class for converting args/kwargs of an op into distributed args/kwargs
+class DTensorConverter(object):
+    def __init__(
+        self,
+        mesh: DeviceMesh,
+        args: Tuple[object, ...],
+        kwargs: Dict[str, object],
+    ) -> None:
+        self.hit = 0
+        self.miss = 0
+        self.mesh = mesh
+        self.args = args
+        self.kwargs = kwargs
+        flatten_args, flatten_args_spec = tree_flatten(args)
+        flatten_kwargs, flatten_kwargs_spec = tree_flatten(kwargs)
+
+        self.flatten_args: List[object] = flatten_args
+        self.flatten_args_spec: TreeSpec = flatten_args_spec
+        self.flatten_kwargs: List[object] = flatten_kwargs
+        self.flatten_kwargs_spec: TreeSpec = flatten_kwargs_spec
+
+        choices_for_args = []
+        for arg in self.flatten_args:
+            if isinstance(arg, torch.Tensor):
+                choices_for_args.append(self.gen_sharding_choices_for_arg(arg))
+
+        for arg in self.flatten_kwargs:
+            if isinstance(arg, torch.Tensor):
+                choices_for_args.append(self.gen_sharding_choices_for_arg(arg))
+
+        self.sharding_combs: Iterator[Sequence[Placement]] = iter(
+            itertools.product(*choices_for_args)
+        )
+
+    def successful(self) -> bool:
+        return self.hit > 0 and self.miss == 0
+
+    def is_supported_tensor(self, t: torch.Tensor) -> bool:
+        # TODO: dist tensor need to support quantized and sparse
+        # tensors, quantized tensor might be relatively easy, but
+        # sparse tensor have special layouts that we need to possibly
+        # deal with, until we are clear about them, we don't officially
+        # support them.
+        return not any(
+            [
+                t.is_sparse_csr,
+                t.is_sparse,
+                t.is_mkldnn,
+                t.is_quantized,
+                t.is_nested,
+                torch._is_functional_tensor(t),
+                t.is_neg(),
+                t.is_conj(),
+                t.device.type in ("lazy", "meta"),
+                # We need a way to test if a tensor is batched but there
+                # is no official APi to do it
+                # torch._C._is_batched(t),
+            ]
+        )
+
+    def gen_sharding_choices_for_arg(
+        self, arg: torch.Tensor
+    ) -> Sequence[Placement]:
+        mesh_size = self.mesh.size()
+        sharding_choices: List[Placement] = [Replicate()]
+        # c10d collective does not support bool tensor
+        # for bool tensor we treat it as replicated
+        if arg.dtype != torch.bool:
+            # only generating choices with: replicate, or sharding
+            # evenly on a dimension that could be sharded
+            sharding_choices = sharding_choices + [
+                Shard(i)
+                for i, s in enumerate(arg.shape)
+                if s > 1 and s % mesh_size == 0
+            ]
+        # TODO: add multi mesh choices
+        # all_choices = itertools.product(
+        #     *(self.mesh.ndim * [sharding_choices])
+        # )
+        return sharding_choices
+
+    def __iter__(self) -> "DTensorConverter":
+        return self
+
+    def __next__(self) -> Tuple[Tuple[object, ...], Dict[str, object]]:
+        try:
+            next_sharding_choices = next(self.sharding_combs)
+            idx = 0
+
+            new_args: List[object] = []
+            for arg in self.flatten_args:
+                if isinstance(arg, torch.Tensor):
+                    new_args.append(
+                        self.to_dist_tensor(
+                            arg, self.mesh, [next_sharding_choices[idx]]
+                        )
+                    )
+                    idx += 1
+                else:
+                    new_args.append(arg)
+
+            new_kwargs: List[object] = []
+            for arg in self.flatten_kwargs:
+                if isinstance(arg, torch.Tensor):
+                    new_kwargs.append(
+                        self.to_dist_tensor(
+                            arg, self.mesh, [next_sharding_choices[idx]]
+                        )
+                    )
+                    idx += 1
+                else:
+                    new_kwargs.append(arg)
+
+            return (
+                tree_unflatten(new_args, self.flatten_args_spec),
+                tree_unflatten(new_kwargs, self.flatten_kwargs_spec),
+            )
+        except StopIteration:
+            raise StopIteration
+
+    def to_dist_tensor(
+        self, t: torch.Tensor, mesh: DeviceMesh, placements: List[Placement]
+    ) -> torch.Tensor:
+        if type(t) is torch.Tensor or type(t) is torch.nn.Parameter:
+            if self.is_supported_tensor(t):
+                self.hit += 1
+                # We cannot use distribute_tensor for bool tensors as c10d
+                # collectives does not support the dtype, we assume op with
+                # bool tensor args the same tensor so we don't need to broadcast
+                # TODO: add bool tensor dtype support in c10d collective
+                if t.dtype == torch.bool:
+                    r = DTensor(
+                        t,
+                        mesh,
+                        placements,
+                        size=t.size(),
+                        requires_grad=t.requires_grad,
+                    )
+                else:
+                    r = distribute_tensor(t, mesh, placements)
+                if type(t) is torch.nn.Parameter:
+                    r = torch.nn.Parameter(  # type: ignore[assignment]
+                        r, requires_grad=r.requires_grad
+                    )
+                return r
+            else:
+                self.miss += 1
+                return t
+        elif torch.overrides.is_tensor_like(t):
+            # Blindly converting tensor subclasses to dist tensor can cause
+            # unpredictable problems, we explicitly disable this conversion
+            # for now (i.e. we don't support DTensor holding tensor subclass
+            # until there's a strong reason later).
+            self.miss += 1
+            return t
+        else:
+            raise RuntimeError(
+                f"Trying to convert to DTensor, but got {type(t)}"
+            )
diff --git a/torch/testing/_internal/distributed/_tensor/dtensor_lagging_op_db.py b/torch/testing/_internal/distributed/_tensor/dtensor_lagging_op_db.py
new file mode 100644
index 000000000000..abd0ccfe0a09
--- /dev/null
+++ b/torch/testing/_internal/distributed/_tensor/dtensor_lagging_op_db.py
@@ -0,0 +1,661 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+from typing import List
+from torch.testing._internal.common_methods_invocations import op_db, OpInfo
+
+# Generated from test/gen_dtensor_op_db.py via
+# python spmd/testing/gen_dtensor_lagging_op_db.py > spmd/testing/dtensor_lagging_op_db.py
+#
+# This approach is copied from functorch:
+# People add new OpInfos to PyTorch all the time.
+# We want them to be able to add OpInfos without breaking our CI.
+# To achieve this, we keep our OpInfo library behind that of Pytorch's and
+# we periodically update our OpInfo library by regenerating this file
+_dtensor_lagging_meta = {
+    ("H", ""),
+    ("T", ""),
+    ("__getitem__", ""),
+    ("__radd__", ""),
+    ("__rand__", ""),
+    ("__rdiv__", ""),
+    ("__rmatmul__", ""),
+    ("__rmod__", ""),
+    ("__rmul__", ""),
+    ("__ror__", ""),
+    ("__rpow__", ""),
+    ("__rsub__", ""),
+    ("__rxor__", ""),
+    ("abs", ""),
+    ("acos", ""),
+    ("acosh", ""),
+    ("add", ""),
+    ("addbmm", ""),
+    ("addcdiv", ""),
+    ("addcmul", ""),
+    ("addmm", ""),
+    ("addmm", "decomposed"),
+    ("addmv", ""),
+    ("addr", ""),
+    ("all", ""),
+    ("allclose", ""),
+    ("amax", ""),
+    ("amin", ""),
+    ("aminmax", ""),
+    ("angle", ""),
+    ("any", ""),
+    ("arange", ""),
+    ("argmax", ""),
+    ("argmin", ""),
+    ("argsort", ""),
+    ("argwhere", ""),
+    ("as_strided", ""),
+    ("as_strided_scatter", ""),
+    ("asin", ""),
+    ("asinh", ""),
+    ("atan", ""),
+    ("atan2", ""),
+    ("atanh", ""),
+    ("atleast_1d", ""),
+    ("atleast_2d", ""),
+    ("atleast_3d", ""),
+    ("baddbmm", ""),
+    ("bernoulli", ""),
+    ("bfloat16", ""),
+    ("bincount", ""),
+    ("bitwise_and", ""),
+    ("bitwise_left_shift", ""),
+    ("bitwise_not", ""),
+    ("bitwise_or", ""),
+    ("bitwise_right_shift", ""),
+    ("bitwise_xor", ""),
+    ("block_diag", ""),
+    ("bmm", ""),
+    ("bool", ""),
+    ("broadcast_shapes", ""),
+    ("broadcast_tensors", ""),
+    ("broadcast_to", ""),
+    ("bucketize", ""),
+    ("byte", ""),
+    ("cartesian_prod", ""),
+    ("cat", ""),
+    ("cdist", ""),
+    ("cdouble", ""),
+    ("ceil", ""),
+    ("cfloat", ""),
+    ("chalf", ""),
+    ("char", ""),
+    ("cholesky", ""),
+    ("cholesky_inverse", ""),
+    ("cholesky_solve", ""),
+    ("chunk", ""),
+    ("clamp", ""),
+    ("clamp_max", ""),
+    ("clamp_min", ""),
+    ("clone", ""),
+    ("column_stack", ""),
+    ("combinations", ""),
+    ("complex", ""),
+    ("conj", ""),
+    ("conj_physical", ""),
+    ("constant_pad_nd", ""),
+    ("contiguous", ""),
+    ("copysign", ""),
+    ("corrcoef", ""),
+    ("cos", ""),
+    ("cosh", ""),
+    ("count_nonzero", ""),
+    ("cov", ""),
+    ("cross", ""),
+    ("cummax", ""),
+    ("cummin", ""),
+    ("cumprod", ""),
+    ("cumsum", ""),
+    ("cumulative_trapezoid", ""),
+    ("deg2rad", ""),
+    ("diag", ""),
+    ("diag_embed", ""),
+    ("diagflat", ""),
+    ("diagonal", ""),
+    ("diagonal_copy", ""),
+    ("diagonal_scatter", ""),
+    ("diff", ""),
+    ("digamma", ""),
+    ("dist", ""),
+    ("div", "floor_rounding"),
+    ("div", "no_rounding_mode"),
+    ("div", "trunc_rounding"),
+    ("dot", ""),
+    ("double", ""),
+    ("dsplit", ""),
+    ("dstack", ""),
+    ("einsum", ""),
+    ("empty", ""),
+    ("empty_like", ""),
+    ("eq", ""),
+    ("equal", ""),
+    ("erf", ""),
+    ("erfc", ""),
+    ("erfinv", ""),
+    ("exp", ""),
+    ("exp2", ""),
+    ("expand", ""),
+    ("expand_as", ""),
+    ("expm1", ""),
+    ("eye", ""),
+    ("fft.fft", ""),
+    ("fft.fft2", ""),
+    ("fft.fftn", ""),
+    ("fft.fftshift", ""),
+    ("fft.hfft", ""),
+    ("fft.hfft2", ""),
+    ("fft.hfftn", ""),
+    ("fft.ifft", ""),
+    ("fft.ifft2", ""),
+    ("fft.ifftn", ""),
+    ("fft.ifftshift", ""),
+    ("fft.ihfft", ""),
+    ("fft.ihfft2", ""),
+    ("fft.ihfftn", ""),
+    ("fft.irfft", ""),
+    ("fft.irfft2", ""),
+    ("fft.irfftn", ""),
+    ("fft.rfft", ""),
+    ("fft.rfft2", ""),
+    ("fft.rfftn", ""),
+    ("fill", ""),
+    ("flatten", ""),
+    ("flip", ""),
+    ("fliplr", ""),
+    ("flipud", ""),
+    ("float", ""),
+    ("float_power", ""),
+    ("floor", ""),
+    ("floor_divide", ""),
+    ("fmax", ""),
+    ("fmin", ""),
+    ("fmod", ""),
+    ("frac", ""),
+    ("frexp", ""),
+    ("full", ""),
+    ("full_like", ""),
+    ("gather", ""),
+    ("gcd", ""),
+    ("ge", ""),
+    ("geqrf", ""),
+    ("gradient", ""),
+    ("gt", ""),
+    ("half", ""),
+    ("heaviside", ""),
+    ("histc", ""),
+    ("histogram", ""),
+    ("histogramdd", ""),
+    ("hsplit", ""),
+    ("hstack", ""),
+    ("hypot", ""),
+    ("i0", ""),
+    ("igamma", ""),
+    ("igammac", ""),
+    ("imag", ""),
+    ("index_add", ""),
+    ("index_copy", ""),
+    ("index_fill", ""),
+    ("index_put", ""),
+    ("index_reduce", ""),
+    ("index_select", ""),
+    ("inner", ""),
+    ("int", ""),
+    ("isclose", ""),
+    ("isfinite", ""),
+    ("isin", ""),
+    ("isinf", ""),
+    ("isnan", ""),
+    ("isneginf", ""),
+    ("isposinf", ""),
+    ("isreal", ""),
+    ("istft", ""),
+    ("jiterator_2inputs_2outputs", ""),
+    ("jiterator_4inputs_with_extra_args", ""),
+    ("jiterator_binary", ""),
+    ("jiterator_binary_return_by_ref", ""),
+    ("jiterator_unary", ""),
+    ("kron", ""),
+    ("kthvalue", ""),
+    ("lcm", ""),
+    ("ldexp", ""),
+    ("le", ""),
+    ("lerp", ""),
+    ("lgamma", ""),
+    ("linalg.cholesky", ""),
+    ("linalg.cholesky_ex", ""),
+    ("linalg.cond", ""),
+    ("linalg.cross", ""),
+    ("linalg.det", ""),
+    ("linalg.det", "singular"),
+    ("linalg.eig", ""),
+    ("linalg.eigh", ""),
+    ("linalg.eigvals", ""),
+    ("linalg.eigvalsh", ""),
+    ("linalg.householder_product", ""),
+    ("linalg.inv", ""),
+    ("linalg.inv_ex", ""),
+    ("linalg.ldl_factor", ""),
+    ("linalg.ldl_factor_ex", ""),
+    ("linalg.ldl_solve", ""),
+    ("linalg.lstsq", ""),
+    ("linalg.lstsq", "grad_oriented"),
+    ("linalg.lu", ""),
+    ("linalg.lu_factor", ""),
+    ("linalg.lu_factor_ex", ""),
+    ("linalg.lu_solve", ""),
+    ("linalg.matrix_norm", ""),
+    ("linalg.matrix_power", ""),
+    ("linalg.matrix_rank", ""),
+    ("linalg.matrix_rank", "hermitian"),
+    ("linalg.multi_dot", ""),
+    ("linalg.norm", ""),
+    ("linalg.norm", "subgradients_at_zero"),
+    ("linalg.pinv", ""),
+    ("linalg.pinv", "hermitian"),
+    ("linalg.pinv", "singular"),
+    ("linalg.qr", ""),
+    ("linalg.slogdet", ""),
+    ("linalg.solve", ""),
+    ("linalg.solve_ex", ""),
+    ("linalg.solve_triangular", ""),
+    ("linalg.svd", ""),
+    ("linalg.svdvals", ""),
+    ("linalg.tensorinv", ""),
+    ("linalg.tensorsolve", ""),
+    ("linalg.vander", ""),
+    ("linalg.vecdot", ""),
+    ("linalg.vector_norm", ""),
+    ("linspace", ""),
+    ("log", ""),
+    ("log10", ""),
+    ("log1p", ""),
+    ("log2", ""),
+    ("log_softmax", ""),
+    ("log_softmax", "with_dtype"),
+    ("logaddexp", ""),
+    ("logaddexp2", ""),
+    ("logcumsumexp", ""),
+    ("logdet", ""),
+    ("logical_and", ""),
+    ("logical_not", ""),
+    ("logical_or", ""),
+    ("logical_xor", ""),
+    ("logit", ""),
+    ("logspace", ""),
+    ("logsumexp", ""),
+    ("long", ""),
+    ("lt", ""),
+    ("lu", ""),
+    ("lu_solve", ""),
+    ("lu_unpack", ""),
+    ("mH", ""),
+    ("mT", ""),
+    ("masked.amax", ""),
+    ("masked.amin", ""),
+    ("masked.argmax", ""),
+    ("masked.argmin", ""),
+    ("masked.cumprod", ""),
+    ("masked.cumsum", ""),
+    ("masked.log_softmax", ""),
+    ("masked.logaddexp", ""),
+    ("masked.logsumexp", ""),
+    ("masked.mean", ""),
+    ("masked.median", ""),
+    ("masked.norm", ""),
+    ("masked.normalize", ""),
+    ("masked.prod", ""),
+    ("masked.softmax", ""),
+    ("masked.softmin", ""),
+    ("masked.std", ""),
+    ("masked.sum", ""),
+    ("masked.var", ""),
+    ("masked_fill", ""),
+    ("masked_scatter", ""),
+    ("masked_select", ""),
+    ("matmul", ""),
+    ("matrix_exp", ""),
+    ("max", "binary"),
+    ("max", "reduction_no_dim"),
+    ("max", "reduction_with_dim"),
+    ("max_pool2d_with_indices_backward", ""),
+    ("maximum", ""),
+    ("mean", ""),
+    ("median", ""),
+    ("meshgrid", "list_of_tensors"),
+    ("meshgrid", "variadic_tensors"),
+    ("min", "binary"),
+    ("min", "reduction_no_dim"),
+    ("min", "reduction_with_dim"),
+    ("minimum", ""),
+    ("mm", ""),
+    ("mode", ""),
+    ("movedim", ""),
+    ("msort", ""),
+    ("mul", ""),
+    ("multinomial", ""),
+    ("mv", ""),
+    ("mvlgamma", "mvlgamma_p_1"),
+    ("mvlgamma", "mvlgamma_p_3"),
+    ("mvlgamma", "mvlgamma_p_5"),
+    ("nan_to_num", ""),
+    ("nanmean", ""),
+    ("nanmedian", ""),
+    ("nanquantile", ""),
+    ("nansum", ""),
+    ("narrow", ""),
+    ("narrow_copy", ""),
+    ("native_batch_norm", ""),
+    ("native_layer_norm", ""),
+    ("ne", ""),
+    ("neg", ""),
+    ("new_empty", ""),
+    ("new_empty_strided", ""),
+    ("new_full", ""),
+    ("new_ones", ""),
+    ("new_zeros", ""),
+    ("nextafter", ""),
+    ("nn.functional._scaled_dot_product_attention", ""),
+    ("nn.functional.adaptive_avg_pool1d", ""),
+    ("nn.functional.adaptive_avg_pool2d", ""),
+    ("nn.functional.adaptive_avg_pool3d", ""),
+    ("nn.functional.adaptive_max_pool1d", ""),
+    ("nn.functional.adaptive_max_pool2d", ""),
+    ("nn.functional.adaptive_max_pool3d", ""),
+    ("nn.functional.alpha_dropout", ""),
+    ("nn.functional.avg_pool1d", ""),
+    ("nn.functional.avg_pool2d", ""),
+    ("nn.functional.avg_pool3d", ""),
+    ("nn.functional.batch_norm", ""),
+    ("nn.functional.batch_norm", "without_cudnn"),
+    ("nn.functional.bilinear", ""),
+    ("nn.functional.binary_cross_entropy", ""),
+    ("nn.functional.binary_cross_entropy_with_logits", ""),
+    ("nn.functional.celu", ""),
+    ("nn.functional.conv1d", ""),
+    ("nn.functional.conv2d", ""),
+    ("nn.functional.conv_transpose1d", ""),
+    ("nn.functional.conv_transpose2d", ""),
+    ("nn.functional.conv_transpose3d", ""),
+    ("nn.functional.cosine_embedding_loss", ""),
+    ("nn.functional.cosine_similarity", ""),
+    ("nn.functional.cross_entropy", ""),
+    ("nn.functional.ctc_loss", ""),
+    ("nn.functional.dropout", ""),
+    ("nn.functional.dropout2d", ""),
+    ("nn.functional.dropout3d", ""),
+    ("nn.functional.elu", ""),
+    ("nn.functional.embedding", ""),
+    ("nn.functional.embedding_bag", ""),
+    ("nn.functional.feature_alpha_dropout", "with_train"),
+    ("nn.functional.feature_alpha_dropout", "without_train"),
+    ("nn.functional.fractional_max_pool2d", ""),
+    ("nn.functional.fractional_max_pool3d", ""),
+    ("nn.functional.gaussian_nll_loss", ""),
+    ("nn.functional.gelu", ""),
+    ("nn.functional.glu", ""),
+    ("nn.functional.grid_sample", ""),
+    ("nn.functional.group_norm", ""),
+    ("nn.functional.hardshrink", ""),
+    ("nn.functional.hardsigmoid", ""),
+    ("nn.functional.hardswish", ""),
+    ("nn.functional.hardtanh", ""),
+    ("nn.functional.hinge_embedding_loss", ""),
+    ("nn.functional.huber_loss", ""),
+    ("nn.functional.instance_norm", ""),
+    ("nn.functional.interpolate", "area"),
+    ("nn.functional.interpolate", "bicubic"),
+    ("nn.functional.interpolate", "bilinear"),
+    ("nn.functional.interpolate", "linear"),
+    ("nn.functional.interpolate", "nearest"),
+    ("nn.functional.interpolate", "trilinear"),
+    ("nn.functional.kl_div", ""),
+    ("nn.functional.l1_loss", ""),
+    ("nn.functional.layer_norm", ""),
+    ("nn.functional.leaky_relu", ""),
+    ("nn.functional.linear", ""),
+    ("nn.functional.local_response_norm", ""),
+    ("nn.functional.logsigmoid", ""),
+    ("nn.functional.margin_ranking_loss", ""),
+    ("nn.functional.max_pool1d", ""),
+    ("nn.functional.max_pool2d", ""),
+    ("nn.functional.max_pool3d", ""),
+    ("nn.functional.max_unpool1d", ""),
+    ("nn.functional.max_unpool1d", "grad"),
+    ("nn.functional.max_unpool2d", ""),
+    ("nn.functional.max_unpool2d", "grad"),
+    ("nn.functional.max_unpool3d", ""),
+    ("nn.functional.max_unpool3d", "grad"),
+    ("nn.functional.mish", ""),
+    ("nn.functional.mse_loss", ""),
+    ("nn.functional.multi_margin_loss", ""),
+    ("nn.functional.multilabel_margin_loss", ""),
+    ("nn.functional.multilabel_soft_margin_loss", ""),
+    ("nn.functional.nll_loss", ""),
+    ("nn.functional.normalize", ""),
+    ("nn.functional.one_hot", ""),
+    ("nn.functional.pad", "circular"),
+    ("nn.functional.pad", "constant"),
+    ("nn.functional.pad", "reflect"),
+    ("nn.functional.pad", "replicate"),
+    ("nn.functional.pairwise_distance", ""),
+    ("nn.functional.pdist", ""),
+    ("nn.functional.pixel_shuffle", ""),
+    ("nn.functional.pixel_unshuffle", ""),
+    ("nn.functional.poisson_nll_loss", ""),
+    ("nn.functional.prelu", ""),
+    ("nn.functional.relu", ""),
+    ("nn.functional.relu6", ""),
+    ("nn.functional.rrelu", ""),
+    ("nn.functional.selu", ""),
+    ("nn.functional.silu", ""),
+    ("nn.functional.silu", "complex"),
+    ("nn.functional.smooth_l1_loss", ""),
+    ("nn.functional.soft_margin_loss", ""),
+    ("nn.functional.softmin", ""),
+    ("nn.functional.softmin", "with_dtype"),
+    ("nn.functional.softplus", ""),
+    ("nn.functional.softshrink", ""),
+    ("nn.functional.softsign", ""),
+    ("nn.functional.tanhshrink", ""),
+    ("nn.functional.threshold", ""),
+    ("nn.functional.triplet_margin_loss", ""),
+    ("nn.functional.triplet_margin_with_distance_loss", ""),
+    ("nn.functional.unfold", ""),
+    ("nn.functional.upsample_bilinear", ""),
+    ("nn.functional.upsample_nearest", ""),
+    ("nonzero", ""),
+    ("norm", ""),
+    ("norm", "fro"),
+    ("norm", "inf"),
+    ("norm", "nuc"),
+    ("normal", ""),
+    ("normal", "number_mean"),
+    ("ones", ""),
+    ("ones_like", ""),
+    ("ormqr", ""),
+    ("outer", ""),
+    ("pca_lowrank", ""),
+    ("permute", ""),
+    ("pinverse", ""),
+    ("polar", ""),
+    ("polygamma", "polygamma_n_0"),
+    ("polygamma", "polygamma_n_1"),
+    ("polygamma", "polygamma_n_2"),
+    ("polygamma", "polygamma_n_3"),
+    ("polygamma", "polygamma_n_4"),
+    ("positive", ""),
+    ("pow", ""),
+    ("prod", ""),
+    ("put", ""),
+    ("qr", ""),
+    ("quantile", ""),
+    ("rad2deg", ""),
+    ("rand_like", ""),
+    ("randint", ""),
+    ("randint_like", ""),
+    ("randn", ""),
+    ("randn_like", ""),
+    ("ravel", ""),
+    ("real", ""),
+    ("reciprocal", ""),
+    ("remainder", ""),
+    ("renorm", ""),
+    ("repeat", ""),
+    ("repeat_interleave", ""),
+    ("reshape", ""),
+    ("reshape_as", ""),
+    ("resize_", ""),
+    ("resize_as_", ""),
+    ("resolve_conj", ""),
+    ("resolve_neg", ""),
+    ("roll", ""),
+    ("rot90", ""),
+    ("round", ""),
+    ("round", "decimals_0"),
+    ("round", "decimals_3"),
+    ("round", "decimals_neg_3"),
+    ("rsqrt", ""),
+    ("rsub", ""),
+    ("scalar_tensor", ""),
+    ("scatter", ""),
+    ("scatter_add", ""),
+    ("scatter_reduce", "amax"),
+    ("scatter_reduce", "amin"),
+    ("scatter_reduce", "mean"),
+    ("scatter_reduce", "prod"),
+    ("scatter_reduce", "sum"),
+    ("searchsorted", ""),
+    ("segment_reduce", "lengths"),
+    ("segment_reduce", "offsets"),
+    ("select", ""),
+    ("select_scatter", ""),
+    ("sgn", ""),
+    ("short", ""),
+    ("sigmoid", ""),
+    ("sign", ""),
+    ("signal.windows.cosine", ""),
+    ("signal.windows.exponential", ""),
+    ("signal.windows.gaussian", ""),
+    ("signal.windows.kaiser", ""),
+    ("signbit", ""),
+    ("sin", ""),
+    ("sinc", ""),
+    ("sinh", ""),
+    ("slice", ""),
+    ("slice_scatter", ""),
+    ("softmax", ""),
+    ("softmax", "with_dtype"),
+    ("sort", ""),
+    ("sparse.sampled_addmm", ""),
+    ("special.airy_ai", ""),
+    ("special.bessel_j0", ""),
+    ("special.bessel_j1", ""),
+    ("special.bessel_y0", ""),
+    ("special.bessel_y1", ""),
+    ("special.chebyshev_polynomial_t", ""),
+    ("special.chebyshev_polynomial_u", ""),
+    ("special.chebyshev_polynomial_v", ""),
+    ("special.chebyshev_polynomial_w", ""),
+    ("special.entr", ""),
+    ("special.erfcx", ""),
+    ("special.hermite_polynomial_h", ""),
+    ("special.hermite_polynomial_he", ""),
+    ("special.i0e", ""),
+    ("special.i1", ""),
+    ("special.i1e", ""),
+    ("special.laguerre_polynomial_l", ""),
+    ("special.legendre_polynomial_p", ""),
+    ("special.log_ndtr", ""),
+    ("special.modified_bessel_i0", ""),
+    ("special.modified_bessel_i1", ""),
+    ("special.modified_bessel_k0", ""),
+    ("special.modified_bessel_k1", ""),
+    ("special.ndtr", ""),
+    ("special.ndtri", ""),
+    ("special.polygamma", "special_polygamma_n_0"),
+    ("special.scaled_modified_bessel_k0", ""),
+    ("special.scaled_modified_bessel_k1", ""),
+    ("special.shifted_chebyshev_polynomial_t", ""),
+    ("special.shifted_chebyshev_polynomial_u", ""),
+    ("special.shifted_chebyshev_polynomial_v", ""),
+    ("special.shifted_chebyshev_polynomial_w", ""),
+    ("special.spherical_bessel_j0", ""),
+    ("special.xlog1py", ""),
+    ("special.zeta", ""),
+    ("split", ""),
+    ("split", "list_args"),
+    ("split_with_sizes", ""),
+    ("sqrt", ""),
+    ("square", ""),
+    ("squeeze", ""),
+    ("stack", ""),
+    ("std", ""),
+    ("std_mean", ""),
+    ("stft", ""),
+    ("sub", ""),
+    ("sum", ""),
+    ("sum_to_size", ""),
+    ("svd", ""),
+    ("svd_lowrank", ""),
+    ("symeig", ""),
+    ("t", ""),
+    ("take", ""),
+    ("take_along_dim", ""),
+    ("tan", ""),
+    ("tanh", ""),
+    ("tensor_split", ""),
+    ("tensordot", ""),
+    ("tile", ""),
+    ("to", ""),
+    ("to_sparse", ""),
+    ("topk", ""),
+    ("trace", ""),
+    ("transpose", ""),
+    ("trapezoid", ""),
+    ("trapz", ""),
+    ("triangular_solve", ""),
+    ("tril", ""),
+    ("tril_indices", ""),
+    ("triu", ""),
+    ("triu_indices", ""),
+    ("true_divide", ""),
+    ("trunc", ""),
+    ("unbind", ""),
+    ("unflatten", ""),
+    ("unfold", ""),
+    ("unfold_copy", ""),
+    ("uniform", ""),
+    ("unique", ""),
+    ("unique_consecutive", ""),
+    ("unsqueeze", ""),
+    ("var", ""),
+    ("var_mean", ""),
+    ("vdot", ""),
+    ("view", ""),
+    ("view_as", ""),
+    ("view_as_complex", ""),
+    ("view_as_real", ""),
+    ("vsplit", ""),
+    ("vstack", ""),
+    ("where", ""),
+    ("xlogy", ""),
+    ("zero_", ""),
+    ("zeros", ""),
+    ("zeros_like", ""),
+}
+
+
+def in_dtensor_lagging_op_db(opinfo: OpInfo) -> bool:
+    return (opinfo.name, opinfo.variant_test_name) in _dtensor_lagging_meta
+
+
+dtensor_lagging_op_db: List[OpInfo] = [
+    opinfo for opinfo in op_db if in_dtensor_lagging_op_db(opinfo)
+]
diff --git a/functorch/codegen/gen_functorch_lagging_op_db.py b/torch/testing/_internal/distributed/_tensor/gen_dtensor_lagging_op_db.py
similarity index 57%
rename from functorch/codegen/gen_functorch_lagging_op_db.py
rename to torch/testing/_internal/distributed/_tensor/gen_dtensor_lagging_op_db.py
index 833e34ed4d69..f684f77ed2c4 100644
--- a/functorch/codegen/gen_functorch_lagging_op_db.py
+++ b/torch/testing/_internal/distributed/_tensor/gen_dtensor_lagging_op_db.py
@@ -4,6 +4,7 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.
 
+from typing import List, Tuple
 from torch.testing._internal.common_methods_invocations import op_db
 
 
@@ -16,43 +17,51 @@ def num_leading_spaces(line: str) -> int:
 
 
 def deindent(code: str) -> str:
-    lines = code.split('\n')
+    lines = code.split("\n")
     min_leading_spaces = min(map(num_leading_spaces, lines))
     lines = [line[min_leading_spaces:] for line in lines]
-    return '\n'.join(lines)
+    return "\n".join(lines)
 
 
-if __name__ == '__main__':
-    supported = {(opinfo.name, opinfo.variant_test_name) for opinfo in op_db}
+if __name__ == "__main__":
+    supported: List[Tuple[str, str]] = [
+        (opinfo.name, opinfo.variant_test_name) for opinfo in op_db
+    ]
     supported = sorted(supported)
-    print(deindent("""\
+    print(
+        deindent(
+            """\
     # Copyright (c) Facebook, Inc. and its affiliates.
     # All rights reserved.
     #
     # This source code is licensed under the BSD-style license found in the
     # LICENSE file in the root directory of this source tree.
-    from torch.testing._internal.common_methods_invocations import op_db
-
-    # Generated from codegen/gen_functorch_op_db.py via
-    # python codegen/gen_functorch_lagging_op_db.py > test/functorch_lagging_op_db.py
+    from typing import List
+    from torch.testing._internal.common_methods_invocations import op_db, OpInfo
+    # Generated from test/gen_dtensor_op_db.py via
+    # python spmd/testing/gen_dtensor_lagging_op_db.py > spmd/testing/dtensor_lagging_op_db.py
     #
+    # This approach is copied from functorch:
     # People add new OpInfos to PyTorch all the time.
     # We want them to be able to add OpInfos without breaking our CI.
     # To achieve this, we keep our OpInfo library behind that of Pytorch's and
-    # we periodically update our OpInfo library by regenerating this file"""))
+    # we periodically update our OpInfo library by regenerating this file"""
+        )
+    )
 
-    print("_functorch_lagging_meta = {")
+    print("_dtensor_lagging_meta = {")
     for name, variant in supported:
-        print(f'    {(name, variant)},')
+        print(f"    {(name, variant)},")
     print("}")
 
-    print(deindent("""\
-
-
-    def in_functorch_lagging_op_db(opinfo):
-        return (opinfo.name, opinfo.variant_test_name) in _functorch_lagging_meta
-
+    print(
+        deindent(
+            """\
+    def in_dtensor_lagging_op_db(opinfo: OpInfo) -> bool:
+        return (opinfo.name, opinfo.variant_test_name) in _dtensor_lagging_meta
 
-    functorch_lagging_op_db = [
-        opinfo for opinfo in op_db if in_functorch_lagging_op_db(opinfo)
-    ]"""))
+    dtensor_lagging_op_db: List[OpInfo] = [
+        opinfo for opinfo in op_db if in_dtensor_lagging_op_db(opinfo)
+    ]"""
+        )
+    )
diff --git a/torch/testing/_internal/distributed/distributed_test.py b/torch/testing/_internal/distributed/distributed_test.py
index e855d710f00f..814dd3d5ad5f 100644
--- a/torch/testing/_internal/distributed/distributed_test.py
+++ b/torch/testing/_internal/distributed/distributed_test.py
@@ -73,6 +73,7 @@
     FILE_SCHEMA,
     IS_FBCODE,
     NO_MULTIPROCESSING_SPAWN,
+    IS_SANDCASTLE,
     parametrize,
     sandcastle_skip,
     sandcastle_skip_if,
@@ -143,6 +144,7 @@ def eq(value, other):
     dist.Backend.NCCL,
     dist.Backend.GLOO,
     dist.Backend.MPI,
+    dist.Backend.UCC,
 ]
 
 # Allowlist of distributed backends where profiling is supported with use_cuda=True
@@ -150,6 +152,7 @@ def eq(value, other):
     dist.Backend.GLOO,
     dist.Backend.MPI,
     dist.Backend.NCCL,
+    dist.Backend.UCC,
 ]
 
 # Allowlist of distributed backends where profiling is supported for p2p ops
@@ -157,6 +160,7 @@ def eq(value, other):
     dist.Backend.MPI,
     dist.Backend.GLOO,
     dist.Backend.NCCL,
+    dist.Backend.UCC,
 ]
 
 # Dummy NamedTuple data structures to test DDP support for NamedTuple types.
@@ -395,6 +399,8 @@ def check(backend):
             return dist.is_nccl_available()
         if backend == dist.Backend.MPI:
             return dist.is_mpi_available()
+        if backend == dist.Backend.UCC:
+            return dist.is_ucc_available()
         if backend in DistTestCases.backend_feature["plugin"]:
             return True
         return False
@@ -429,6 +435,21 @@ def _lock():
                 fcntl.flock(lf.fileno(), fcntl.LOCK_UN)
             lf.close()
 
+@contextmanager
+def _rank_temp_file():
+    if dist.get_rank() == 0:
+        fd, name = tempfile.mkstemp()
+        os.close(fd)
+    else:
+        name = None
+    object_list = [name]
+    dist.broadcast_object_list(object_list)
+    name = object_list[0]
+    try:
+        yield name
+    finally:
+        if dist.get_rank() == 0:
+            os.remove(name)
 
 def _build_tensor(size, value=None, dtype=torch.float, device_id=None):
     if value is None:
@@ -503,7 +524,7 @@ class TestDistBackend(MultiProcessTestCase):
     @classmethod
     def setUpClass(cls):
         os.environ["MASTER_ADDR"] = str(MASTER_ADDR)
-        os.environ["MASTER_PORT"] = str(MASTER_PORT)
+        # Not setting MASTER_PORT and get a random free port
         super().setUpClass()
 
     def setUp(self):
@@ -1127,7 +1148,6 @@ def test_1_level_hierarchical_model_averager_equivalent_to_periodic_model_averag
         @require_world_size(4)
         @skip_if_lt_x_gpu(4)
         def test_3_level_hierarchical_model_averager(self):
-            from torch.distributed.distributed_c10d import _pg_group_ranks
             rank = dist.get_rank()
             world_size = dist.get_world_size()
             rank_to_GPU = init_multigpu_helper(world_size, BACKEND)
@@ -1158,8 +1178,8 @@ def test_3_level_hierarchical_model_averager(self):
             subgroup1 = averager.period_process_group_dict[subgroup_avg_period1]
             subgroup2 = averager.period_process_group_dict[subgroup_avg_period2]
 
-            real_group_ranks_res1 = list(_pg_group_ranks[subgroup1].keys())
-            real_group_ranks_res2 = list(_pg_group_ranks[subgroup2].keys())
+            real_group_ranks_res1 = dist.get_process_group_ranks(subgroup1)
+            real_group_ranks_res2 = dist.get_process_group_ranks(subgroup2)
             expect_group_ranks_res1 = (rank // subgroup_size1 * subgroup_size1 + np.array(list(range(subgroup_size1)))).tolist()
             expect_group_ranks_res2 = (rank // subgroup_size2 * subgroup_size2 + np.array(list(range(subgroup_size2)))).tolist()
             self.assertEqual(real_group_ranks_res1, expect_group_ranks_res1)
@@ -2202,6 +2222,93 @@ def test_reduce_sum_cuda_twice(self):
                 rank_to_GPU,
             )
 
+        @sandcastle_skip_if(BACKEND != "nccl", "Only Nccl supports reduce_scatter_v")
+        @sandcastle_skip_if(BACKEND in DistTestCases.skip_collective["reduce"], f"{BACKEND} does not support reduce")
+        @skip_if_no_gpu
+        def test_reduce_scatter_v_cuda(self):
+            self._barrier()
+            group, group_id, rank = self._init_global_test()
+            rank_to_GPU = init_multigpu_helper(dist.get_world_size(), BACKEND)
+            device_id = rank_to_GPU[rank][0]
+
+            input_split_sizes = []
+            for src in group:
+                input_split_sizes.append(src + 1)
+            start_len = sum(input_split_sizes[:rank])
+            end_len = start_len + input_split_sizes[rank]
+            sum_len = sum(input_split_sizes)
+            master_value = 2
+            worker_value = 10
+
+            for async_val in [True, False]:
+                tensor = _build_tensor(sum_len, worker_value, device_id=device_id)
+                tensor[start_len:end_len].fill_(master_value)
+                out_tensor = torch.empty(input_split_sizes[rank], sum_len, sum_len, dtype=torch.float).fill_(-1).cuda(device_id)
+
+                req = dist.reduce_scatter(
+                    out_tensor,
+                    list(torch.split(tensor, input_split_sizes)),
+                    dist.ReduceOp.SUM,
+                    group_id,
+                    async_val,
+                )
+                if async_val:
+                    req.wait()
+
+                expected_value = 2 + (10 * (len(group) - 1))
+                expected_tensor = torch.empty(input_split_sizes[rank], sum_len, sum_len, dtype=torch.float)
+                expected_tensor = expected_tensor.fill_(expected_value).cuda(device_id)
+
+                self.assertEqual(out_tensor, expected_tensor)
+            self._barrier()
+
+        # Test reduce_scatter_tensor accepting single tensor as input
+        def _reduce_scatter_tensor_helper(
+            self, tensor_out, tensor_in,
+            group_id, rank, cuda=True, rank_to_GPU=None
+        ):
+            if cuda:
+                tensor_in = tensor_in.cuda(rank_to_GPU[rank][0])
+                tensor_out = tensor_out.cuda(rank_to_GPU[rank][0])
+            tensor_shapes = [tensor_out.shape]
+            self.call_dist_op(
+                ":reduce_scatter_tensor",
+                False,
+                dist.reduce_scatter_tensor,
+                tensor_out,
+                tensor_in,
+                dist.ReduceOp.SUM,
+                group_id,
+                False,
+                expect_event=False,
+                tensor_shapes=tensor_shapes,
+            )
+            return tensor_out
+
+        @sandcastle_skip_if(BACKEND != "nccl", "Only Nccl supports CUDA reduce_scatter_tensor")
+        @skip_if_no_gpu
+        def test_reduce_scatter_tensor_cuda(self):
+            group, group_id, rank = self._init_global_test()
+            rank_to_GPU = init_multigpu_helper(dist.get_world_size(), BACKEND)
+            size = 2
+            tensor_out = torch.zeros(size, dtype=torch.int64)
+
+            # Concatenated input
+            tensor_in = torch.arange(len(group) * size)
+            tensor_out = self._reduce_scatter_tensor_helper(tensor_out, tensor_in, group_id, rank, True, rank_to_GPU)
+            # Check result
+            expected_tensor = torch.arange(rank * size, (rank + 1) * size) * len(group)
+            self.assertEqual(tensor_out, expected_tensor)
+            self._barrier()
+
+            # Stacked input
+            tensor_in = torch.reshape(tensor_in, (len(group), size))
+            tensor_out = self._reduce_scatter_tensor_helper(tensor_out, tensor_in, group_id, rank, True, rank_to_GPU)
+            # Check result
+            # Should be the same as the result in concatenated case
+            self.assertEqual(tensor_out, expected_tensor)
+            self._barrier()
+
         @skip_if_no_gpu
         @require_backend(DistTestCases.backend_feature["gpu"])
         def test_all_reduce_result_cuda(self):
@@ -2873,6 +2980,7 @@ def _test_scatter_helper(
             self._barrier()
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         def test_scatter_checks(self):
             group, group_id, rank = self._init_global_test()
             one = torch.ones([1])
@@ -2896,6 +3004,7 @@ def test_scatter_checks(self):
             self.assertEqual(output, one * rank)
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         def test_scatter(self):
             group, group_id, rank = self._init_global_test()
             self._test_scatter_helper(group, group_id, rank)
@@ -2908,6 +3017,7 @@ def test_scatter_cuda(self):
             self._test_scatter_helper(group, group_id, rank, True, rank_to_GPU)
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         def test_scatter_complex(self):
             group, group_id, rank = self._init_global_test()
             self._test_scatter_helper(group, group_id, rank, dtype=torch.cfloat)
@@ -2920,12 +3030,14 @@ def test_scatter_cuda_complex(self):
             self._test_scatter_helper(group, group_id, rank, True, rank_to_GPU, dtype=torch.cfloat)
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         @skip_if_small_worldsize
         def test_scatter_group(self):
             group, group_id, rank = self._init_group_test()
             self._test_scatter_helper(group, group_id, rank)
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         def test_scatter_full_group(self):
             group, group_id, rank = self._init_full_group_test()
             self._test_scatter_helper(group, group_id, rank)
@@ -2959,6 +3071,7 @@ def _test_gather_helper(self, group, group_id, rank, cuda=False, rank_to_GPU=Non
             self._barrier()
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         def test_gather_checks(self):
             group, group_id, rank = self._init_global_test()
             one = torch.ones([1])
@@ -2982,6 +3095,7 @@ def test_gather_checks(self):
                 dist.gather(one * rank)
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         def test_gather(self):
             group, group_id, rank = self._init_global_test()
             self._test_gather_helper(group, group_id, rank)
@@ -2994,12 +3108,14 @@ def test_gather_cuda(self):
             self._test_gather_helper(group, group_id, rank, True, rank_to_GPU)
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         @skip_if_small_worldsize
         def test_gather_group(self):
             group, group_id, rank = self._init_group_test()
             self._test_gather_helper(group, group_id, rank)
 
         @sandcastle_skip_if(BACKEND == "nccl", "Nccl does not support CPU tensors")
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         def test_gather_full_group(self):
             group, group_id, rank = self._init_full_group_test()
             self._test_gather_helper(group, group_id, rank)
@@ -3075,6 +3191,102 @@ def test_all_gather_full_group(self):
             group, group_id, rank = self._init_full_group_test()
             self._test_all_gather_helper(group, group_id, rank)
 
+        @sandcastle_skip_if(BACKEND != "nccl", "Only Nccl supports all_gather_v")
+        @skip_if_no_gpu
+        def test_all_gather_v_cuda(self):
+            self._barrier()
+            group, group_id, rank = self._init_global_test()
+            rank_to_GPU = init_multigpu_helper(dist.get_world_size(), BACKEND)
+            device_id = rank_to_GPU[rank][0]
+
+            output_split_sizes = []
+            for dst in group:
+                output_split_sizes.append(dst + 1)
+            sum_len = sum(output_split_sizes)
+            value = 2
+
+            for async_val in [True, False]:
+                tensor = torch.empty(output_split_sizes[rank], sum_len, sum_len, dtype=torch.float).fill_(value).cuda(device_id)
+                out_tensor = _build_tensor(sum_len, -1, device_id=device_id)
+
+                req = dist.all_gather(
+                    list(torch.split(out_tensor, output_split_sizes)),
+                    tensor,
+                    group_id,
+                    async_val,
+                )
+                if async_val:
+                    req.wait()
+
+                expected_value = value
+                expected_tensor = _build_tensor(sum_len, expected_value, device_id=device_id)
+
+                self.assertEqual(out_tensor, expected_tensor)
+            self._barrier()
+
+        # Test all_gather accepting single tensor as output
+        def _all_gather_into_tensor_helper(
+            self, tensor_out, tensor_in,
+            group_id, rank, cuda=True, rank_to_GPU=None
+        ):
+            if cuda:
+                tensor_in = tensor_in.cuda(rank_to_GPU[rank][0])
+                tensor_out = tensor_out.cuda(rank_to_GPU[rank][0])
+            if tensor_out.dtype == torch.complex64:
+                tensor_shapes = [torch.view_as_real(tensor_in).shape]
+            else:
+                tensor_shapes = [tensor_in.shape]
+            self.call_dist_op(
+                ":all_gather_into_tensor",
+                False,
+                dist.all_gather_into_tensor,
+                tensor_out,
+                tensor_in,
+                group_id,
+                False,
+                expect_event=False,
+                tensor_shapes=tensor_shapes,
+            )
+            return tensor_out
+
+        @sandcastle_skip_if(BACKEND != "nccl", "Only Nccl supports CUDA all_gather_into_tensor")
+        @skip_if_no_gpu
+        def test_all_gather_into_cat_tensor_cuda(self):
+            group, group_id, rank = self._init_global_test()
+            rank_to_GPU = init_multigpu_helper(dist.get_world_size(), BACKEND)
+            size = 2
+            tensor_in = torch.ones([size, size]) * rank
+            # Concatenated output
+            tensor_out = torch.ones([len(group) * size, size]) * (-1)
+            tensor_out = self._all_gather_into_tensor_helper(tensor_out, tensor_in, group_id, rank, True, rank_to_GPU)
+
+            # Check result
+            # Concatenate all blocks into a bigger tensor
+            expected_tensor = torch.cat([
+                torch.ones([size, size]) * i for i in group
+            ])
+            self.assertEqual(tensor_out, expected_tensor)
+            self._barrier()
+
+        @sandcastle_skip_if(BACKEND != "nccl", "Only Nccl supports CUDA all_gather_into_tensor")
+        @skip_if_no_gpu
+        def test_all_gather_into_stack_tensor_cuda(self):
+            group, group_id, rank = self._init_global_test()
+            rank_to_GPU = init_multigpu_helper(dist.get_world_size(), BACKEND)
+            size = 2
+            tensor_in = torch.ones([size, size]) * rank
+            # Stacked output
+            tensor_out = torch.ones([len(group), size, size]) * (-1)
+            tensor_out = self._all_gather_into_tensor_helper(tensor_out, tensor_in, group_id, rank, True, rank_to_GPU)
+
+            # Check result
+            # Stack all blocks into a bigger tensor
+            expected_tensor = torch.stack([
+                torch.ones([size, size]) * i for i in group
+            ])
+            self.assertEqual(tensor_out, expected_tensor)
+            self._barrier()
+
         def _run_all_gather_coalesced_and_verify(
             self, output_tensor_lists, input_tensors, expected_tensors, group_id
         ):
@@ -3537,6 +3749,7 @@ def _test_barrier_helper(
 
         @skip_if_no_gpu
         @sandcastle_skip_if(BACKEND == "mpi", "MPI doesn't supports GPU barrier")
+        @sandcastle_skip_if(BACKEND == "ucc" and IS_SANDCASTLE, "Skipped internally")
         def test_barrier_cuda(self):
             group, group_id, rank = self._init_global_test()
             rank_to_GPU = init_multigpu_helper(dist.get_world_size(), BACKEND)
@@ -3630,6 +3843,7 @@ def _test_all_reduce_multigpu_helper(
 
         @sandcastle_skip_if(BACKEND == "mpi", "MPI doesn't support broadcast multigpu")
         @sandcastle_skip_if(BACKEND == "nccl", "CUDA all_reduce multigpu skipped for NCCL")
+        @sandcastle_skip_if(BACKEND == "ucc" and IS_SANDCASTLE, "Skipped internally")
         @skip_if_no_gpu
         def test_all_reduce_multigpu(self):
             group, group_id, rank = self._init_global_test()
@@ -3647,6 +3861,7 @@ def test_all_reduce_multigpu(self):
 
         @sandcastle_skip_if(BACKEND == "mpi", "MPI doesn't support broadcast multigpu")
         @sandcastle_skip_if(BACKEND == "nccl", "CUDA all_reduce multigpu skipped for NCCL")
+        @sandcastle_skip_if(BACKEND == "ucc" and IS_SANDCASTLE, "Skipped internally")
         @skip_if_no_gpu
         def test_all_reduce_multigpu_complex(self):
             group, group_id, rank = self._init_global_test()
@@ -4005,6 +4220,23 @@ def test_DistributedDataParallel_requires_grad(self):
             )
             self._barrier()
 
+
+        @sandcastle_skip_if(
+            BACKEND not in DistTestCases.backend_feature["ddp"],
+            f"The {BACKEND} backend does not support DistributedDataParallel"
+        )
+        @skip_if_lt_x_gpu(int(os.environ["WORLD_SIZE"]))
+        def test_ddp_zero_output_features(self):
+            class ToyModel(nn.Module):
+                def __init__(self):
+                    super(ToyModel, self).__init__()
+                    self.net1 = nn.Linear(10, 10)
+                    self.relu = nn.ReLU()
+                    self.net2 = nn.Linear(10, 0)
+
+            model = ToyModel().to(self.rank)
+            ddp_model = nn.parallel.DistributedDataParallel(model, device_ids=[self.rank])
+
         @sandcastle_skip_if(
             BACKEND == "nccl",
             "Gloo-only test"
@@ -4262,7 +4494,7 @@ def _test_ddp_hook_with_optimizer_parity(
                     dist.barrier()
 
         @sandcastle_skip_if(
-            BACKEND == "nccl",
+            BACKEND == "nccl" or BACKEND == "ucc",
             "Issues with async error handling, see https://github.com/pytorch/pytorch/issues/73259"
         )
         @skip_if_lt_x_gpu(2)
@@ -4289,7 +4521,7 @@ def test_ddp_hook_with_optimizer_parity_adamw(
             )
 
         @sandcastle_skip_if(
-            BACKEND == "nccl",
+            BACKEND == "nccl" or BACKEND == "ucc",
             "Issues with async error handling, see https://github.com/pytorch/pytorch/issues/73259"
         )
         @skip_if_lt_x_gpu(2)
@@ -4309,7 +4541,7 @@ def test_ddp_hook_with_optimizer_parity_adam(self, optimize_subset):
             )
 
         @sandcastle_skip_if(
-            BACKEND == "nccl",
+            BACKEND == "nccl" or BACKEND == "ucc",
             "Issues with async error handling, see https://github.com/pytorch/pytorch/issues/73259"
         )
         @skip_if_lt_x_gpu(2)
@@ -4955,9 +5187,8 @@ def _perform_a_train_step(self, optimizer, net, loss_fn, input, target):
             loss.backward()
             optimizer.step()
 
-        def _test_post_localSGD_optimizer_step_reload(self, create_averager):
+        def _test_post_localSGD_optimizer_step_reload(self, create_averager, chkpt_file):
             learning_rate = 0.03
-            chkpt_file = tempfile.gettempdir() + "/checkpoint.pt"
 
             net_using_post_localSGD_opt = torch.nn.parallel.DistributedDataParallel(
                 copy.deepcopy(DDP_NET).cuda(),
@@ -5019,9 +5250,6 @@ def _test_post_localSGD_optimizer_step_reload(self, create_averager):
 
             self.assertEqual(averager2.step, 0)
 
-            if self.rank == 0:
-                os.remove(chkpt_file)
-
         @skip_if_lt_x_gpu(2)
         @sandcastle_skip_if(
             BACKEND not in DistTestCases.backend_feature["ddp"],
@@ -5085,9 +5313,11 @@ def test_post_localSGD_optimizer_parity_with_hierarchical_sgd_grad_is_view(self)
         )
         def test_post_localSGD_optimizer_step_reload(self):
             torch.cuda.set_device(self.rank)
-            self._test_post_localSGD_optimizer_step_reload(
-                self._create_periodic_model_averager
-            )
+            with _rank_temp_file() as tmp_file:
+                self._test_post_localSGD_optimizer_step_reload(
+                    self._create_periodic_model_averager,
+                    tmp_file
+                )
 
         @sandcastle_skip_if(
             BACKEND not in DistTestCases.backend_feature["ddp"],
@@ -5095,6 +5325,10 @@ def test_post_localSGD_optimizer_step_reload(self):
         )
         @skip_if_no_gpu
         def test_DistributedDataParallel_SyncBatchNorm_Channels_Last(self):
+            self._test_DistributedDataParallel_SyncBatchNorm_with_memory_format(torch.channels_last)
+            self._test_DistributedDataParallel_SyncBatchNorm_with_memory_format(torch.channels_last_3d)
+
+        def _test_DistributedDataParallel_SyncBatchNorm_with_memory_format(self, memory_format):
             group, group_id, rank = self._init_global_test()
             num_processes = dist.get_world_size()
             local_bs = 2
@@ -5107,14 +5341,15 @@ def test_DistributedDataParallel_SyncBatchNorm_Channels_Last(self):
                 model_gpu, device_ids=[rank]
             )
 
-            memory_format = torch.channels_last
+            shapes = [global_bs, 2, 4, 4] + ([] if memory_format is torch.channels_last else [4])
+
             input_gpu = (
-                torch.randn(global_bs, 2, 4, 4, dtype=torch.float)
+                torch.randn(*shapes, dtype=torch.float)
                 .cuda(rank)
                 .to(memory_format=memory_format)
             )
             target_gpu = (
-                torch.randn(global_bs, 2, 4, 4, dtype=torch.float)
+                torch.randn(*shapes, dtype=torch.float)
                 .cuda(rank)
                 .to(memory_format=memory_format)
             )
@@ -5946,11 +6181,13 @@ class Bar:
                     group=pg
                 )
 
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         @require_backend(DistTestCases.backend_feature["gpu"])
         @with_dist_debug_levels(levels=["DETAIL", "OFF", "INFO"])
         def test_gather_object(self):
             return self._test_gather_object()
 
+        @sandcastle_skip_if(BACKEND == "ucc", "CPU tensor ops not supported by UCP TL")
         @require_backend(DistTestCases.backend_feature["gpu"])
         @with_dist_debug_levels(levels=["DETAIL", "OFF", "INFO"])
         def test_gather_object_subgroup(self):
@@ -6152,7 +6389,6 @@ def test_ddp_profiling_autograd_profiler(self):
             IS_MACOS or IS_WINDOWS,
             "torch.profiler not enabled for mac/windows: https://github.com/pytorch/pytorch/pull/56124",
         )
-        @skip_if_rocm
         def test_ddp_profiling_torch_profiler(self):
             cpu_act = torch.profiler.ProfilerActivity.CPU
             cuda_act = torch.profiler.ProfilerActivity.CUDA
@@ -7482,12 +7718,14 @@ def _test_verify_model_across_rank(self, use_logger):
 
         @require_backend(DistTestCases.backend_feature["gpu"])
         @require_backends_available(DistTestCases.backend_feature["gpu"])
+        @sandcastle_skip_if(BACKEND == "ucc" and IS_SANDCASTLE, "Skipped internally")
         @skip_if_lt_x_gpu(2)
         def test_verify_model_across_rank_with_logger(self):
             self._test_verify_model_across_rank(use_logger=True)
 
         @require_backend(DistTestCases.backend_feature["gpu"])
         @require_backends_available(DistTestCases.backend_feature["gpu"])
+        @sandcastle_skip_if(BACKEND == "ucc" and IS_SANDCASTLE, "Skipped internally")
         @skip_if_lt_x_gpu(2)
         def test_verify_model_across_rank_without_logger(self):
             self._test_verify_model_across_rank(use_logger=False)
@@ -7511,6 +7749,7 @@ def _run_test_ddp_model_with_diff_params(self, ctx, net, ddp_group, group_gloo):
 
         @require_backend(DistTestCases.backend_feature["gpu"])
         @require_backends_available(DistTestCases.backend_feature["gpu"])
+        @sandcastle_skip_if(BACKEND == "ucc" and IS_SANDCASTLE, "Skipped internally")
         @skip_if_lt_x_gpu(2)
         def test_ddp_model_diff_shape_across_ranks(self):
             group_gloo = dist.new_group(
@@ -7533,6 +7772,7 @@ def test_ddp_model_diff_shape_across_ranks(self):
 
         @require_backend(DistTestCases.backend_feature["gpu"])
         @require_backends_available(DistTestCases.backend_feature["gpu"])
+        @sandcastle_skip_if(BACKEND == "ucc" and IS_SANDCASTLE, "Skipped internally")
         @skip_if_lt_x_gpu(2)
         def test_ddp_model_diff_num_params_across_ranks(self):
             group_gloo = dist.new_group(
@@ -8938,6 +9178,7 @@ def _test_hook_pickling(self, hook, hook_state):
             for orig_param, dummy_param in zip(ddp_model.parameters(), dummy_ddp_model.parameters()):
                 self.assertEqual(orig_param.grad, dummy_param.grad)
 
+            dist.barrier()
             if rank == 0:
                 os.remove(chkpt_file)
 
@@ -8946,6 +9187,7 @@ def _test_hook_pickling(self, hook, hook_state):
             f"The {BACKEND} backend does not support DDP communication hook on CUDA devices"
         )
         @skip_if_lt_x_gpu(int(os.environ["WORLD_SIZE"]))
+        @sandcastle_skip_if(BACKEND == "ucc" and IS_SANDCASTLE, "Skipped internally")
         def test_ddp_hook_pickling_powerSGD(self):
 
             hook = powerSGD.powerSGD_hook
diff --git a/torch/testing/_internal/distributed/multi_threaded_pg.py b/torch/testing/_internal/distributed/multi_threaded_pg.py
new file mode 100644
index 000000000000..b66ca1473165
--- /dev/null
+++ b/torch/testing/_internal/distributed/multi_threaded_pg.py
@@ -0,0 +1,375 @@
+import queue
+import sys
+import threading
+import time
+from dataclasses import dataclass
+from typing import Dict, Optional, Tuple
+
+import torch
+import torch.distributed as dist
+from torch._C._distributed_c10d import (
+    _create_work_from_future,
+    AllgatherOptions,
+    AllreduceOptions,
+    BroadcastOptions,
+    ReduceScatterOptions,
+    ScatterOptions,
+    Store,
+    ReduceOp,
+)
+from torch.futures import Future
+from torch.utils._pytree import tree_flatten
+
+"""
+TODO:
+Lots of missing collectives.
+Collectives validation.
+Make timeout robust by making collectives respect the test deadline.
+Make tests robuts by making collectives interruptible.
+We need some synchronization around cleanup to ensure that timedout ranks don't cause spurious failures.
+
+"""
+
+
+def flatten_list(lst):
+    return tree_flatten(lst)[0]
+
+
+def ret_work(ret):
+    fut = Future()
+    fut.set_result(ret)
+    return _create_work_from_future(fut)
+
+
+class AllReduce:
+    def __init__(self, op):
+        if op != ReduceOp.SUM:
+            raise NotImplementedError(
+                "AllReduce only supports SUM on threaded pg for now."
+            )
+        self.op = op
+
+    def work(self, data):
+        # data: List[List[Tensor]]
+        res = data[0][0]
+        for src_rank in range(1, len(data)):
+            in_tensor_list = data[src_rank]
+            res.add_(in_tensor_list[0])  # Hardcoded
+        with torch.no_grad():
+            for src_rank in range(len(data)):
+                data[src_rank][0].copy_(res)
+
+
+class AllGather:
+    def work(self, data):
+        for src_rank in range(len(data)):
+            in_tensor_list = data[src_rank][1]
+            # Can't handle all_gather with multiple tensors
+            assert len(in_tensor_list) == 1
+            src_tensor = in_tensor_list[0]
+
+            for dest in data:
+                dest_tensor = dest[0][0][src_rank]
+                with torch.no_grad():
+                    dest_tensor.copy_(src_tensor)
+
+class Scatter:
+    def __init__(self, src):
+        self.src = src
+
+    def work(self, data):
+        src_in_tensor_list = data[self.src][1]
+        # Can't handle scatter with multiple input tensor list
+        assert len(src_in_tensor_list) == 1
+        src_in_tensors = src_in_tensor_list[0]
+
+        for rank, each_rank_data in enumerate(data):
+            out_tensor_list = each_rank_data[0]
+            # Can't handle scatter with multiple output tensor
+            assert len(out_tensor_list) == 1
+            dest_tensor = out_tensor_list[0]
+            with torch.no_grad():
+                dest_tensor.copy_(src_in_tensors[rank])
+
+class ReduceScatter:
+    def __init__(self, op):
+        if op != dist.ReduceOp.SUM:
+            raise NotImplementedError("ReduceScatter only supports SUM on threaded pg for now.")
+        self.op = op
+
+    def work(self, data):
+        start_reduction = [False for _ in range(len(data))]
+        for each_rank_data in data:
+            # Can't handle reduce_scatter with multiple scatter list
+            assert len(each_rank_data[1]) == 1
+            to_scatter = each_rank_data[1][0]
+            for i in range(len(to_scatter)):
+                dest_tensor_on_rank_i = data[i][0]
+                # Can't handle reduce_scatter with multiple output tensor
+                assert len(dest_tensor_on_rank_i) == 1
+                if not start_reduction[i]:
+                    with torch.no_grad():
+                        dest_tensor_on_rank_i[0].copy_(to_scatter[i])
+                    start_reduction[i] = True
+                else:
+                    with torch.no_grad():
+                        dest_tensor_on_rank_i[0].add_(to_scatter[i])
+
+class Broadcast:
+    def __init__(self, src):
+        self.src = src
+
+    def work(self, data):
+        in_tensor_list = flatten_list(data[self.src])
+        for i in range(len(data)):
+            out_tensor_list = flatten_list(data[i])
+            for j in range(len(in_tensor_list)):
+                with torch.no_grad():
+                    out_tensor_list[j].copy_(in_tensor_list[j])
+
+
+class Collective:
+    def __init__(self, world_size, collective):
+        self._world_size = world_size
+        self._collective = collective
+
+        self._start_cond = threading.Condition()
+        self._done_cond = threading.Condition()
+
+        self._data = [None] * world_size
+        self._count = 0
+        self._done = False
+
+    def join(self, rank, data):
+        with self._start_cond:
+            self._data[rank] = data
+            self._count += 1
+
+            # notify rank 0
+            if self._count == self._world_size:
+                if rank > 0:
+                    self._start_cond.notify()
+
+            if rank == 0:
+                while self._count < self._world_size:
+                    self._start_cond.wait()
+
+        with self._done_cond:
+            # wait for rank 0 to finish
+            if rank > 0:
+                while not self._done:
+                    self._done_cond.wait()
+            else:
+                # copy data around
+                self._collective.work(self._data)
+                self._done = True
+                self._done_cond.notify_all()
+        return ret_work(data)
+
+
+class ProcessLocalGroup(dist.ProcessGroup):
+    _pg_lock = threading.Lock()
+    _pg_list = []
+    _count = 0
+    _ready = False
+
+    _coll_lock = threading.Lock()
+    _cur_coll = None
+
+    @classmethod
+    def _register(cls, pg):
+        with cls._pg_lock:
+            while len(cls._pg_list) <= pg._rank:
+                cls._pg_list.append(None)
+            cls._pg_list[pg._rank] = pg
+            cls._count += 1
+            if cls._count == pg._world:
+                cls._ready = True
+
+    @classmethod
+    def _start_coll(cls, world_size, collective):
+        with cls._coll_lock:
+            if not cls._ready:
+                raise Exception(
+                    f"world not ready, only {cls._count} PG's registered but world has {world_size} ranks"
+                )
+            if cls._cur_coll is None:
+                cls._cur_coll = Collective(world_size, collective)
+            return cls._cur_coll
+
+    @classmethod
+    def _end_coll(cls, collective):
+        # This is racily called by all ranks, so only one will work
+        with cls._coll_lock:
+            if cls._cur_coll == collective:
+                cls._cur_coll = None
+
+    def allreduce(self, tensor_list, opts=AllreduceOptions()):
+        coll = ProcessLocalGroup._start_coll(self._world, AllReduce(opts.reduceOp))
+        res = coll.join(self._rank, tensor_list)
+        ProcessLocalGroup._end_coll(coll)
+        return res
+
+    def allgather(self, output_tensors, input_tensor, opts=AllgatherOptions()):
+        coll = ProcessLocalGroup._start_coll(self._world, AllGather())
+        res = coll.join(self._rank, (output_tensors, input_tensor))
+        ProcessLocalGroup._end_coll(coll)
+        return res
+
+    def broadcast(self, tensor_list, opts=BroadcastOptions()):
+        coll = ProcessLocalGroup._start_coll(self._world, Broadcast(opts.rootRank))
+        res = coll.join(self._rank, tensor_list)
+        ProcessLocalGroup._end_coll(coll)
+        return res
+
+    def scatter(self, output_tensors, input_tensors, opts=ScatterOptions()):
+        coll = ProcessLocalGroup._start_coll(self._world, Scatter(opts.rootRank))
+        res = coll.join(self._rank, (output_tensors, input_tensors))
+        ProcessLocalGroup._end_coll(coll)
+        return res
+
+    def reduce_scatter(self, output_tensor, scatter_list, opts=ReduceScatterOptions()):
+        coll = ProcessLocalGroup._start_coll(self._world, ReduceScatter(opts.reduceOp))
+        res = coll.join(self._rank, (output_tensor, scatter_list))
+        ProcessLocalGroup._end_coll(coll)
+        return res
+
+    def __init__(self, rank, world):
+        super(ProcessLocalGroup, self).__init__(rank, world)
+        self._rank = rank
+        self._world = world
+        ProcessLocalGroup._register(self)
+
+    def size(self):
+        return self._world
+
+    def getBackendName(self):
+        return "local"
+
+    def __repr__(self):
+        return f"PLG w:{self._world} r:{self._rank}"
+
+
+def _create_threaded_pg(prefix_store, rank, world_size, timeout):
+    return ProcessLocalGroup(rank, world_size)
+
+
+dist.Backend.register_backend("threaded", _create_threaded_pg)
+
+
+@dataclass
+class WorldData:
+    default_pg: dist.ProcessGroup
+    pg_map: Dict[dist.ProcessGroup, Tuple[str, Optional[Store]]]
+    pg_names: Dict[dist.ProcessGroup, str]
+    pg_group_ranks: Dict[dist.ProcessGroup, Dict[int, int]]
+    group_count: int
+
+
+class ThreadLocalWorld:
+    _world = threading.local()
+
+    def _get_world(self) -> WorldData:
+        if not hasattr(ThreadLocalWorld._world, "world"):
+            ThreadLocalWorld._world.world = WorldData(None, {}, {}, {}, 0)
+        return ThreadLocalWorld._world.world
+
+    @property
+    def default_pg(self):
+        return self._get_world().default_pg
+
+    @default_pg.setter
+    def default_pg(self, value):
+        self._get_world().default_pg = value
+
+    @property
+    def pg_map(self):
+        return self._get_world().pg_map
+
+    @property
+    def pg_names(self):
+        return self._get_world().pg_names
+
+    @property
+    def pg_group_ranks(self):
+        return self._get_world().pg_group_ranks
+
+    @property
+    def group_count(self) -> int:
+        return self._get_world().group_count
+
+    @group_count.setter
+    def group_count(self, value):
+        self._get_world().group_count = value
+
+
+_old_pg_world = None
+
+
+def _install_threaded_pg():
+    global _old_pg_world
+    _old_pg_world = dist.distributed_c10d._world
+    dist.distributed_c10d._world = ThreadLocalWorld()
+    return dist.distributed_c10d._world
+
+
+def _uninstall_threaded_pg():
+    dist.distributed_c10d._world = _old_pg_world
+
+
+def run_with_threaded_pg(world_size, timeout, callback):
+    """
+    Run ``callback`` with ``world_size`` threads using the in-proc process group
+    """
+    world = _install_threaded_pg()
+
+    def world_is_valid():
+        return world == dist.distributed_c10d._world
+
+    global_store = dist.HashStore()
+    exception_queue = queue.Queue()
+
+    def worker(rank):
+        if not world_is_valid():
+            raise TimeoutError("Invalid world")
+        dist.init_process_group(
+            backend="threaded", rank=rank, world_size=world_size, store=global_store
+        )
+        try:
+            callback()
+        except BaseException as ex:
+            exception_queue.put((rank, sys.exc_info()))
+        finally:
+            if world_is_valid():
+                dist.destroy_process_group()
+
+    try:
+        threads = [
+            threading.Thread(target=worker, args=(rank,)) for rank in range(world_size)
+        ]
+        for thread in threads:
+            thread.start()
+
+        deadline = time.time() + timeout
+        for idx, thread in enumerate(threads):
+            thread.join(max(0, deadline - time.time()))
+            if thread.is_alive():
+                exception_queue.put(
+                    (
+                        idx,
+                        (
+                            TimeoutError,
+                            TimeoutError(
+                                f"Rank failed to join in under {timeout} seconds"
+                            ),
+                            None,
+                        ),
+                    )
+                )
+        failed_ranks = []
+        while not exception_queue.empty():
+            failure = exception_queue.get()
+            failed_ranks.append(failure)
+        return failed_ranks
+    finally:
+        _uninstall_threaded_pg()
diff --git a/torch/testing/_internal/distributed/rpc/jit/rpc_test.py b/torch/testing/_internal/distributed/rpc/jit/rpc_test.py
index 383e6278bd0a..275103a50cbe 100644
--- a/torch/testing/_internal/distributed/rpc/jit/rpc_test.py
+++ b/torch/testing/_internal/distributed/rpc/jit/rpc_test.py
@@ -194,12 +194,12 @@ def script_fork_wait_throw(invalue):
 
 
 @torch.jit.script
-def call_rpc_with_profiling(handle: Tensor, dst_worker_name: str) -> Tensor:
+def call_rpc_with_profiling(record: torch.classes.profiler._RecordFunction, dst_worker_name: str) -> Tensor:
     # Call rpc_async from within ScriptFunction and ensure that we can attach
     # profiling callbacks. Note that handle here is a Tensor representation of
     # RecordFunction.
     fut = rpc.rpc_async(dst_worker_name, one_arg, (torch.tensor(1),))
-    torch.ops.profiler._call_end_callbacks_on_jit_fut(handle, fut)
+    torch.ops.profiler._call_end_callbacks_on_jit_fut(record, fut)
     ret = fut.wait()
     return ret
 
@@ -210,12 +210,12 @@ def call_rpc_torchscript_with_record_function(dst_worker_name: str, block: str)
 
 
 @torch.jit.script
-def call_fork_with_profiling(handle: Tensor) -> Tensor:
+def call_fork_with_profiling(record: torch.classes.profiler._RecordFunction) -> Tensor:
     # Call fork from within ScriptFunction and ensure that we can attach profiling
     # callbacks to the resulting future. Note that handle here is a Tensor
     # representation of RecordFunction.
     fut = torch.jit._fork(one_arg, torch.tensor(1))
-    torch.ops.profiler._call_end_callbacks_on_jit_fut(handle, fut)
+    torch.ops.profiler._call_end_callbacks_on_jit_fut(record, fut)
     ret = fut.wait()
     return ret
 
@@ -426,7 +426,7 @@ def test_return_local_script_class_rref_in_py_and_use_in_script(self):
 
         dst_worker_name = worker_name((self.rank + 1) % self.world_size)
 
-        # Create a local RRef<MyScripClass> remotely in Python.
+        # Create a local RRef<MyScriptClass> remotely in Python.
         rref = rpc.rpc_sync(
             dst_worker_name, owner_create_rref_my_script_class, args=(self.rank,)
         )
@@ -440,7 +440,7 @@ def use_rref_on_owner(rref: RRef[MyScriptClass]) -> int:
             ret = fut.wait()
             return ret
 
-        # Use RRef<MyScripClass> in local Python RPC and remote Script run.
+        # Use RRef<MyScriptClass> in local Python RPC and remote Script run.
         ret = use_rref_on_owner(rref)
         self.assertEqual(ret, self.rank)
 
@@ -473,7 +473,7 @@ def use_rref_on_owner(rref: RRef[MyModuleInterface]) -> Tensor:
             ret = fut.wait()
             return ret
 
-        # Use RRef<MyScripClass> in local Python RPC and remote Script run.
+        # Use RRef<MyScriptClass> in local Python RPC and remote Script run.
         ret = use_rref_on_owner(rref)
         self.assertEqual(ret, torch.ones(self.rank))
 
@@ -1146,7 +1146,7 @@ def test_call_rpc_with_profiling(self):
                     "worker1",
                 )
                 with torch.autograd.profiler.record_function(prof_key) as rf:
-                    ret = call_rpc_with_profiling(rf.handle, "worker1")
+                    ret = call_rpc_with_profiling(rf.record, "worker1")
             # TODO: Can't get a reliable time for this profiling event since
             # it's hard to estimate the execution time on the remote end for non-UDFs.
             # This can be resolved by https://github.com/pytorch/pytorch/issues/36272.
@@ -1295,7 +1295,7 @@ def test_call_fork_in_jit_with_profiling(self):
         # future from within a script function with torch.jit.fork
         with _profile() as prof:
             with torch.autograd.profiler.record_function("foo") as rf:
-                ret = call_fork_with_profiling(rf.handle)
+                ret = call_fork_with_profiling(rf.record)
 
         events = prof.function_events
         function_event = get_function_event(events, "foo")
diff --git a/torch/testing/_internal/distributed/rpc/rpc_test.py b/torch/testing/_internal/distributed/rpc/rpc_test.py
index 359e7a25df63..4c0239ac653e 100644
--- a/torch/testing/_internal/distributed/rpc/rpc_test.py
+++ b/torch/testing/_internal/distributed/rpc/rpc_test.py
@@ -394,9 +394,20 @@ def my_script_func(tensor):
 
 
 expected_err = "Expected error"
+
+# Note that it needs to inherit from Exception, not BaseException. See comment
+# in rpc/internal.py
+class CustomException(Exception):
+    def __init__(self, bool, msg):
+        self.bool = bool
+        super().__init__(msg)
+
 def raise_func():
     raise ValueError(expected_err)
 
+def custom_raise_func():
+    raise CustomException(True, "foo")
+
 @torch.jit.script
 def raise_func_script(expected_err: str) -> torch.Tensor:
     raise ValueError(expected_err)
@@ -2456,20 +2467,20 @@ def test_async_record_function_double_end_callbacks(self):
                 fut.wait()
 
     @dist_init
-    def test_async_record_function_double_end_callbacks_new_signatures(self):
-        # Test the new _record_function ops work
-        # Note: Remove once record_function uses these directly
+    def test_async_record_function_legacy(self):
+        # Test the legacy _record_function ops work
+        # Note: These exist for backward compatibility with TorchScript
         num_sleep_seconds = 1
         if self.rank == 1:
             with _profile() as pf:
                 try:
-                    record = torch.ops.profiler._record_function_enter_new("foo", None)
+                    handle = torch.ops.profiler._record_function_enter("foo", None)
                     fut = rpc.rpc_async(
                         worker_name(0), my_sleep_func, args=(num_sleep_seconds,)
                     )
-                    torch.ops.profiler._call_end_callbacks_on_jit_fut(record, fut)
+                    torch.ops.profiler._call_end_callbacks_on_jit_fut(handle, fut)
                 finally:
-                    torch.ops.profiler._record_function_exit(record)
+                    torch.ops.profiler._record_function_exit(handle)
 
                 fut.wait()
 
@@ -2488,7 +2499,7 @@ def test_async_record_function_cbs_jit_call(self):
                         worker_name(0), my_script_func, args=(torch.tensor(1),)
                     )
                     # Intentionally calling record_function internals
-                    fut = torch.ops.profiler._call_end_callbacks_on_jit_fut(rf.handle, fut)
+                    fut = torch.ops.profiler._call_end_callbacks_on_jit_fut(rf.record, fut)
                 result = fut.wait()
                 # Validate that the profiling future returns the same value as the RPC
                 # future.
@@ -3019,7 +3030,7 @@ def test_owner_equality(self):
         self.assertEqual(a.owner(), a.owner())
         self.assertEqual(a.owner(), b.owner())
         self.assertEqual(a.owner(), rpc.get_worker_info())
-        x = dict()
+        x = {}
         x[a.owner()] = a
         x[other_a.owner()] = other_a
         self.assertEqual(x[a.owner()], a)
@@ -3567,6 +3578,33 @@ def test_wait_all_raise_in_body(self):
                 raise_func()
         self.assertFalse(hasattr(_thread_local_var, "future_list"))
 
+    @dist_init
+    def test_custom_exception_throw_during_reconstruction(self):
+        """
+        Test that we still throw info about the remote side exception even when
+        we cannot recreate it on client side.
+        """
+        initialize_pg(self.file_init_method, self.rank, self.world_size)
+        if self.rank != 0:
+            exc_caught = False
+            dst = worker_name(0)
+            try:
+                rpc.rpc_sync(dst, custom_raise_func, args=())
+            except RuntimeError as e:
+                exc_caught = True
+                msg = str(e)
+                print(f"Got msg {msg}")
+                self.assertTrue("Original exception on remote side was" in msg)
+                self.assertTrue("CustomException" in msg)
+            except BaseException as e:
+                raise RuntimeError(
+                    f"Failure - expected RuntimeError, got {e}"
+                ) from e
+            finally:
+                self.assertTrue(exc_caught)
+
+        dist.barrier()
+
 
     timed_out_rpc_event = None
 
@@ -5068,16 +5106,21 @@ def test_dynamic_rpc_existing_rank_can_communicate_with_new_rank_cuda(self):
                 rpc_backend_options=self.rpc_backend_options,
             )
 
-        dist.barrier()
-        if self.rank == 0:
-            for i in range(1, self.world_size):
-                x = torch.ones(2)
-                result_on_device_0 = rpc.rpc_sync(worker_name(i), torch.add, args=(x.to(0), 1))
-                result_on_device_1 = rpc.rpc_sync(worker_name(i), torch.add, args=(x.to(1), 1))
-                self.assertEqual(torch.add(torch.ones(2), 1), result_on_device_0)
-                self.assertEqual(torch.device('cuda:0'), result_on_device_0.device)
-                self.assertEqual(torch.add(torch.ones(2), 1), result_on_device_1)
-                self.assertEqual(torch.device('cuda:1'), result_on_device_1.device)
+        # TODO: Cuda RPC is failing due to:
+        # terminate called after throwing an instance of 'c10::Error'
+        # what():  0 <= device && static_cast<size_t>(device) < device_allocator.size()
+        # INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":1937,
+        # please report a bug to PyTorch. Allocator not initialized for device 1: did you call init?
+        # dist.barrier()
+        # if self.rank == 0:
+        #     for i in range(1, self.world_size):
+        #         x = torch.ones(2)
+        #         result_on_device_0 = rpc.rpc_sync(worker_name(i), torch.add, args=(x.to(0), 1))
+        #         result_on_device_1 = rpc.rpc_sync(worker_name(i), torch.add, args=(x.to(1), 1))
+        #         self.assertEqual(torch.add(torch.ones(2), 1), result_on_device_0)
+        #         self.assertEqual(torch.device('cuda:0'), result_on_device_0.device)
+        #         self.assertEqual(torch.add(torch.ones(2), 1), result_on_device_1)
+        #         self.assertEqual(torch.device('cuda:1'), result_on_device_1.device)
 
         # Barrier to ensure that all rpc_sync calls are finished
         dist.barrier()
diff --git a/torch/testing/_internal/distributed/rpc_utils.py b/torch/testing/_internal/distributed/rpc_utils.py
index a26c00c7c675..76ecfc2a6fe9 100644
--- a/torch/testing/_internal/distributed/rpc_utils.py
+++ b/torch/testing/_internal/distributed/rpc_utils.py
@@ -178,7 +178,7 @@ def generate_tests(
             continue
 
         name = f"{prefix}{test_class.__name__}"
-        class_ = type(name, (test_class, mixin, SpawnHelper), dict())
+        class_ = type(name, (test_class, mixin, SpawnHelper), {})
         class_.__module__ = module_name
         ret[name] = class_
     return ret
diff --git a/torch/testing/_internal/inductor_utils.py b/torch/testing/_internal/inductor_utils.py
new file mode 100644
index 000000000000..84750a2de3ee
--- /dev/null
+++ b/torch/testing/_internal/inductor_utils.py
@@ -0,0 +1,23 @@
+from subprocess import CalledProcessError
+
+from torch._inductor.codecache import CppCodeCache
+from torch._inductor.utils import has_triton
+from torch.testing._internal.common_utils import (
+    IS_FBCODE,
+    TEST_WITH_ROCM,
+)
+import torch
+
+HAS_CPU = False
+try:
+    CppCodeCache.load("")
+    HAS_CPU = not IS_FBCODE
+except (
+    CalledProcessError,
+    OSError,
+    torch._inductor.exc.InvalidCxxCompiler,
+    torch._inductor.exc.CppCompileError,
+):
+    pass
+
+HAS_CUDA = has_triton() and not TEST_WITH_ROCM
diff --git a/torch/testing/_internal/opinfo/__init__.py b/torch/testing/_internal/opinfo/__init__.py
index e69de29bb2d1..4afd4147f10f 100644
--- a/torch/testing/_internal/opinfo/__init__.py
+++ b/torch/testing/_internal/opinfo/__init__.py
@@ -0,0 +1,2 @@
+import torch.testing._internal.opinfo.core
+import torch.testing._internal.opinfo.definitions
diff --git a/torch/testing/_internal/opinfo/core.py b/torch/testing/_internal/opinfo/core.py
index 765b71ae1697..4f4ab79c2256 100644
--- a/torch/testing/_internal/opinfo/core.py
+++ b/torch/testing/_internal/opinfo/core.py
@@ -1,33 +1,36 @@
-from dataclasses import dataclass, asdict
+import collections
 import collections.abc
+import math
 import operator
-from typing import Any, Callable, List, Optional, Tuple, Iterable
-from enum import Enum
 import unittest
-import math
+from dataclasses import asdict, dataclass
+from enum import Enum
 from functools import partial
 from itertools import product
+from typing import Any, Callable, Iterable, List, Optional, Tuple
+
+from torchgen.utils import dataclass_repr
 
 import torch
 from torch.testing import make_tensor
-from torch.testing._internal.opinfo import utils
-from torchgen.utils import dataclass_repr
-from torch.testing._internal.common_utils import (
-    is_iterable_of_tensors,
-    noncontiguous_like,
-    TEST_WITH_ROCM,
+from torch.testing._internal.common_device_type import (
+    skipCPUIfNoFFT,
+    tol,
+    toleranceOverride,
 )
 from torch.testing._internal.common_dtype import (
     _dispatch_dtypes,
-    floating_and_complex_types_and,
     floating_and_complex_types,
+    floating_and_complex_types_and,
     floating_types,
 )
-from torch.testing._internal.common_device_type import (
-    skipCPUIfNoFFT,
-    toleranceOverride,
-    tol,
+from torch.testing._internal.common_utils import (
+    is_iterable_of_tensors,
+    noncontiguous_like,
+    TEST_WITH_ROCM,
+    torch_to_numpy_dtype_dict,
 )
+from torch.testing._internal.opinfo import utils
 
 # Reasonable testing sizes for dimensions
 L = 20
@@ -43,7 +46,7 @@
 # e.g. _getattr_qual(torch, 'linalg.norm') -> torch.linalg.norm
 def _getattr_qual(obj, name, default=_NOTHING):
     try:
-        for path in name.split('.'):
+        for path in name.split("."):
             obj = getattr(obj, path)
         return obj
     except AttributeError:
@@ -55,15 +58,34 @@ def _getattr_qual(obj, name, default=_NOTHING):
 
 class DecorateInfo(object):
     """Describes which test, or type of tests, should be wrapped in the given
-       decorators when testing an operator. Any test that matches all provided
-       arguments will be decorated. The decorators will only be applied if the
-       active_if argument is True."""
-
-    __slots__ = ['decorators', 'cls_name', 'test_name', 'device_type', 'dtypes', 'active_if']
+    decorators when testing an operator. Any test that matches all provided
+    arguments will be decorated. The decorators will only be applied if the
+    active_if argument is True."""
+
+    __slots__ = [
+        "decorators",
+        "cls_name",
+        "test_name",
+        "device_type",
+        "dtypes",
+        "active_if",
+    ]
 
-    def __init__(self, decorators, cls_name=None, test_name=None, *,
-                 device_type=None, dtypes=None, active_if=True):
-        self.decorators = list(decorators) if isinstance(decorators, collections.abc.Sequence) else [decorators]
+    def __init__(
+        self,
+        decorators,
+        cls_name=None,
+        test_name=None,
+        *,
+        device_type=None,
+        dtypes=None,
+        active_if=True,
+    ):
+        self.decorators = (
+            list(decorators)
+            if isinstance(decorators, collections.abc.Sequence)
+            else [decorators]
+        )
         self.cls_name = cls_name
         self.test_name = test_name
         self.device_type = device_type
@@ -77,13 +99,14 @@ def __init__(self, decorators, cls_name=None, test_name=None, *,
 
     def is_active(self, cls_name, test_name, device_type, dtype):
         return (
-            self.active_if and
-            (self.cls_name is None or self.cls_name == cls_name) and
-            (self.test_name is None or self.test_name == test_name) and
-            (self.device_type is None or self.device_type == device_type) and
-            (self.dtypes is None or dtype in self.dtypes)
+            self.active_if
+            and (self.cls_name is None or self.cls_name == cls_name)
+            and (self.test_name is None or self.test_name == test_name)
+            and (self.device_type is None or self.device_type == device_type)
+            and (self.dtypes is None or dtype in self.dtypes)
         )
 
+
 # FIXME
 # Note: historically the 'input' kwarg had to be a Tensor or TensorList, but we are trying
 #   to support scalar inputs, too. Some tests still depend on 'input' being a Tensor
@@ -91,28 +114,82 @@ def is_active(self, cls_name, test_name, device_type, dtype):
 class SampleInput(object):
     """Represents sample inputs to a function."""
 
-    __slots__ = ['input', 'args', 'kwargs', 'output_process_fn_grad', 'broadcasts_input', 'name']
+    __slots__ = [
+        "input",
+        "args",
+        "kwargs",
+        "output_process_fn_grad",
+        "broadcasts_input",
+        "name",
+    ]
 
-    def __init__(self, input, *, args=tuple(), kwargs=None, output_process_fn_grad=lambda x: x, broadcasts_input=False, name=""):
+    def __init__(
+        self,
+        input,
+        *var_args,
+        args=None,
+        kwargs=None,
+        output_process_fn_grad=None,
+        broadcasts_input=None,
+        name=None,
+        **var_kwargs,
+    ):
         # input is the first input to the op and is typically either a Tensor or TensorList (Sequence[Tensor]).
         # This follows the typical pattern where for Tensor inputs op(t, ...) = t.op(...).
         self.input = input
-        self.args = args
+
+        # Allow calling either as SampleInput(input, args=args, kwargs=kwargs), or as
+        # SampleInput(input, *args, **kwargs) but not to mix the two forms
+        if args is not None or kwargs is not None:
+            assert (
+                not var_args and not var_kwargs
+            ), """
+A SampleInput can be constructed "naturally" with *args and **kwargs or by
+explicitly setting the "args" and "kwargs" paremeters, but the two
+methods of construction cannot be mixed!"""
+        elif len(var_args) or len(var_kwargs):
+            assert (
+                output_process_fn_grad is None
+                and broadcasts_input is None
+                and name is None
+            ), """
+A SampleInput constructed "naturally" with *args and **kwargs
+cannot specify additional metadata in keyword arguments"""
+
+        self.args = args if args is not None else var_args
         assert isinstance(self.args, tuple)
-        self.kwargs = kwargs if kwargs is not None else {}
+        self.kwargs = kwargs if kwargs is not None else var_kwargs
         assert isinstance(self.kwargs, dict)
-        self.output_process_fn_grad = output_process_fn_grad
-        self.name = name
+
+        self.output_process_fn_grad = (
+            output_process_fn_grad
+            if output_process_fn_grad is not None
+            else lambda x: x
+        )
+        self.name = name if name is not None else ""
 
         # Specifies if `self.input` is broadcasted or not,
         # given that the operator supports broadcasting.
         # This field is used to verify the behavior for inplace variant.
         #
         # If a SampleInput is marked with `broadcasts_input=True`,
-        # it is verified that we get a `RuntimerError` with this sample,
+        # it is verified that we get a `RuntimeError` with this sample,
         # and inplace variant. Also inplace grad{grad} tests are skipped,
         # for such inputs (as they will error out otherwise).
-        self.broadcasts_input = broadcasts_input
+        self.broadcasts_input = (
+            broadcasts_input if broadcasts_input is not None else False
+        )
+
+    def with_metadata(
+        self, *, output_process_fn_grad=None, broadcasts_input=None, name=None
+    ):
+        if output_process_fn_grad is not None:
+            self.output_process_fn_grad = output_process_fn_grad
+        if broadcasts_input is not None:
+            self.broadcasts_input = broadcasts_input
+        if name is not None:
+            self.name = name
+        return self
 
     def _repr_helper(self, formatter):
         # Helper function to return the details of the SampleInput as `str`
@@ -121,12 +198,13 @@ def _repr_helper(self, formatter):
         # callable to customize the representation.
         # Look at `summary` method for example.
         arguments = [
-            f'input={formatter(self.input)}',
-            f'args={formatter(self.args)}',
-            f'kwargs={formatter(self.kwargs)}',
-            f'output_process_fn_grad={self.output_process_fn_grad}',
-            f'broadcasts_input={self.broadcasts_input}',
-            f'name={repr(self.name)}']
+            f"input={formatter(self.input)}",
+            f"args={formatter(self.args)}",
+            f"kwargs={formatter(self.kwargs)}",
+            f"output_process_fn_grad={self.output_process_fn_grad}",
+            f"broadcasts_input={self.broadcasts_input}",
+            f"name={repr(self.name)}",
+        ]
 
         return f'SampleInput({", ".join(a for a in arguments if a is not None)})'
 
@@ -143,7 +221,7 @@ def formatter(arg):
             # by Tensor[TensorShape]
             # Eg. Tensor with shape (3, 4) is formatted as Tensor[3, 4]
             if isinstance(arg, torch.Tensor):
-                shape = str(tuple(arg.shape)).replace('(', '').replace(')', '')
+                shape = str(tuple(arg.shape)).replace("(", "").replace(")", "")
                 return f"Tensor[{shape}]"
             elif isinstance(arg, dict):
                 return {k: formatter(v) for k, v in arg.items()}
@@ -176,7 +254,11 @@ def _tt(t):
             else:
                 return t
 
-        sample_tt_input, tt_args, tt_kwargs = tt(self.input), tt(self.args), tt(self.kwargs)
+        sample_tt_input, tt_args, tt_kwargs = (
+            tt(self.input),
+            tt(self.args),
+            tt(self.kwargs),
+        )
 
         # Note the transformed SampleInput assumes metadata like output_process_fn_grad is still valid!
         return SampleInput(
@@ -185,7 +267,8 @@ def _tt(t):
             kwargs=tt_kwargs,
             output_process_fn_grad=self.output_process_fn_grad,
             broadcasts_input=self.broadcasts_input,
-            name=self.name + "_transformed")
+            name=self.name + "_transformed",
+        )
 
     # Returns the NumPy version of the sample input object in the form of a tuple: (input, args, kwargs)
     # Converts tensors to ndarrays by calling .detach().cpu().numpy() on them
@@ -217,13 +300,16 @@ def to_noncontiguous(t):
         return self.transform(to_noncontiguous)
 
 
+NumericsFilter = collections.namedtuple("NumericsFilter", ["condition", "safe_val"])
+
+
 class ErrorInput(object):
     """
     A SampleInput that will cause the operation to throw an error plus information
     about the resulting error.
     """
 
-    __slots__ = ['sample_input', 'error_type', 'error_regex']
+    __slots__ = ["sample_input", "error_type", "error_regex"]
 
     def __init__(self, sample_input, *, error_type=RuntimeError, error_regex):
         self.sample_input = sample_input
@@ -245,6 +331,7 @@ def __init__(self, alias_name):
     def __call__(self, *args, **kwargs):
         return self.op(*args, **kwargs)
 
+
 # Note [OpInfos]
 # ~~~~~~~~~~~~~~
 #
@@ -543,7 +630,7 @@ class OpInfo(object):
     # this is useful when an op needs multiple OpInfos,
     # like divide does, often because it's really several
     # different ops behind the scenes
-    variant_test_name: str = ''
+    variant_test_name: str = ""
 
     # the function variant of the operation, populated as torch.<name> if None
     op: Callable = None
@@ -664,6 +751,10 @@ class OpInfo(object):
     # If the value is False, we test that forward grad is not implemented
     supports_forward_ad: bool = False
 
+    # Whether the operation has a varargs variant
+    # (e.g. functions like ones, zeros, methods like view, permute)
+    supports_varargs: bool = False
+
     # wrapper function for gradcheck
     gradcheck_wrapper: Callable = lambda op, *args, **kwargs: op(*args, **kwargs)
 
@@ -756,7 +847,9 @@ class OpInfo(object):
     def __post_init__(self):
         self._original_opinfo_args = asdict(self).copy()
 
-        assert self.dtypes is not None, "OpInfo for {0} has no dtypes!".format(self.name)
+        assert self.dtypes is not None, "OpInfo for {0} has no dtypes!".format(
+            self.name
+        )
 
         dtypes_args = (self.dtypes, self.dtypesIfCUDA, self.dtypesIfROCM)
 
@@ -768,37 +861,68 @@ def __post_init__(self):
             self.aten_name = self.name
 
         # Attribute to verify dynamic_dtypes are used.
-        self.dynamic_dtypes = any(map(lambda dtypes: isinstance(
-            dtypes, utils._dynamic_dispatch_dtypes), dtypes_args))
+        self.dynamic_dtypes = any(
+            map(
+                lambda dtypes: isinstance(dtypes, utils._dynamic_dispatch_dtypes),
+                dtypes_args,
+            )
+        )
 
         if self.dynamic_dtypes:
             # Make sure `dtyesIfCUDA` is dynamic, if dynamic dispatch is used for CPU
             # This is because, below we set dtypesIfCUDA to dtypes if they are None.
-            assert isinstance(self.dtypesIfCUDA, utils._dynamic_dispatch_dtypes), \
-                (f"To use dynamic dypes for operator {self.name}, "
-                 "acquire the dtypes dynamically for argument `dtypesIfCUDA`."
-                 "This is to ensure that CUDA dtypes are acquired correctly as they"
-                 "differ from CPU dtypes occasionally")
+            assert isinstance(self.dtypesIfCUDA, utils._dynamic_dispatch_dtypes), (
+                f"To use dynamic dypes for operator {self.name}, "
+                "acquire the dtypes dynamically for argument `dtypesIfCUDA`."
+                "This is to ensure that CUDA dtypes are acquired correctly as they"
+                "differ from CPU dtypes occasionally"
+            )
 
         self.dtypes = set(self.dtypes)
 
         # NOTE: backward dtypes must be acquired before forward dtypes
         #   since they fallback to explicit (not implicit!) specifications of
         #   forward dtypes
-        self.backward_dtypesIfROCM = set(self.backward_dtypesIfROCM) if self.backward_dtypesIfROCM is not None else (
-            self.backward_dtypesIfCUDA if self.backward_dtypesIfCUDA is not None
-            else self.backward_dtypes if self.backward_dtypes is not None
-            else self.dtypesIfROCM if self.dtypesIfROCM is not None
-            else self.dtypesIfCUDA if self.dtypesIfCUDA is not None
-            else self.dtypes)
-        self.backward_dtypesIfCUDA = set(self.backward_dtypesIfCUDA) if self.backward_dtypesIfCUDA is not None else (
-            self.backward_dtypes if self.backward_dtypes is not None
-            else self.dtypesIfCUDA if self.dtypesIfCUDA is not None
-            else self.dtypes)
-        self.backward_dtypes = set(self.backward_dtypes) if self.backward_dtypes is not None else self.dtypes
-
-        self.dtypesIfCUDA = set(self.dtypesIfCUDA) if self.dtypesIfCUDA is not None else self.dtypes
-        self.dtypesIfROCM = set(self.dtypesIfROCM) if self.dtypesIfROCM is not None else self.dtypesIfCUDA
+        self.backward_dtypesIfROCM = (
+            set(self.backward_dtypesIfROCM)
+            if self.backward_dtypesIfROCM is not None
+            else (
+                self.backward_dtypesIfCUDA
+                if self.backward_dtypesIfCUDA is not None
+                else self.backward_dtypes
+                if self.backward_dtypes is not None
+                else self.dtypesIfROCM
+                if self.dtypesIfROCM is not None
+                else self.dtypesIfCUDA
+                if self.dtypesIfCUDA is not None
+                else self.dtypes
+            )
+        )
+        self.backward_dtypesIfCUDA = (
+            set(self.backward_dtypesIfCUDA)
+            if self.backward_dtypesIfCUDA is not None
+            else (
+                self.backward_dtypes
+                if self.backward_dtypes is not None
+                else self.dtypesIfCUDA
+                if self.dtypesIfCUDA is not None
+                else self.dtypes
+            )
+        )
+        self.backward_dtypes = (
+            set(self.backward_dtypes)
+            if self.backward_dtypes is not None
+            else self.dtypes
+        )
+
+        self.dtypesIfCUDA = (
+            set(self.dtypesIfCUDA) if self.dtypesIfCUDA is not None else self.dtypes
+        )
+        self.dtypesIfROCM = (
+            set(self.dtypesIfROCM)
+            if self.dtypesIfROCM is not None
+            else self.dtypesIfCUDA
+        )
 
         # NOTE: if the op is unspecified it is assumed to be under the torch namespace
         if not self.op:
@@ -825,7 +949,9 @@ def __post_init__(self):
             # operator with a check that an inplace variant exists.
             if self.inplace_variant is not None:
                 inplace_operator_name = "i" + self.name
-                self.inplace_operator_variant = getattr(operator, inplace_operator_name, None)
+                self.inplace_operator_variant = getattr(
+                    operator, inplace_operator_name, None
+                )
             else:
                 self.inplace_operator_variant = None
 
@@ -833,11 +959,21 @@ def __post_init__(self):
 
         # We run the sampling functions without tracking the gradiends of the creation of inputs
         self.sample_inputs_func = torch.no_grad()(self.sample_inputs_func)
-        self.sample_inputs_sparse_coo_func = torch.no_grad()(self.sample_inputs_sparse_coo_func)
-        self.sample_inputs_sparse_csr_func = torch.no_grad()(self.sample_inputs_sparse_csr_func)
-        self.sample_inputs_sparse_csc_func = torch.no_grad()(self.sample_inputs_sparse_csc_func)
-        self.sample_inputs_sparse_bsr_func = torch.no_grad()(self.sample_inputs_sparse_bsr_func)
-        self.sample_inputs_sparse_bsc_func = torch.no_grad()(self.sample_inputs_sparse_bsc_func)
+        self.sample_inputs_sparse_coo_func = torch.no_grad()(
+            self.sample_inputs_sparse_coo_func
+        )
+        self.sample_inputs_sparse_csr_func = torch.no_grad()(
+            self.sample_inputs_sparse_csr_func
+        )
+        self.sample_inputs_sparse_csc_func = torch.no_grad()(
+            self.sample_inputs_sparse_csc_func
+        )
+        self.sample_inputs_sparse_bsr_func = torch.no_grad()(
+            self.sample_inputs_sparse_bsr_func
+        )
+        self.sample_inputs_sparse_bsc_func = torch.no_grad()(
+            self.sample_inputs_sparse_bsc_func
+        )
         if self.reference_inputs_func is not None:
             self.reference_inputs_func = torch.no_grad()(self.reference_inputs_func)
 
@@ -845,7 +981,7 @@ def __post_init__(self):
             self.autodiff_fusible_nodes = []
 
         if self.autodiff_nonfusible_nodes is None:
-            self.autodiff_nonfusible_nodes = ['aten::' + self.name]
+            self.autodiff_nonfusible_nodes = ["aten::" + self.name]
 
         # Autograd support
 
@@ -856,49 +992,71 @@ def __post_init__(self):
         else:
             assert not (self.supports_gradgrad and not self.supports_autograd), (
                 "supports_gradgrad refines the part of autograd is supported, so it should "
-                "not be set if supports_autograd is False")
+                "not be set if supports_autograd is False"
+            )
         if self.check_batched_grad is None:
             self.check_batched_grad = self.supports_autograd or self.supports_forward_ad
         else:
-            assert not (self.check_batched_grad and not (self.supports_autograd or self.supports_forward_ad)), (
+            assert not (
+                self.check_batched_grad
+                and not (self.supports_autograd or self.supports_forward_ad)
+            ), (
                 "check_batched_grad refines the part of autograd that will be checked (by gradcheck), so "
-                "it should not be set if supports_autograd is False")
+                "it should not be set if supports_autograd is False"
+            )
         if self.check_batched_gradgrad is None:
             self.check_batched_gradgrad = self.supports_gradgrad
         else:
             assert not (self.check_batched_gradgrad and not self.supports_gradgrad), (
                 "check_batched_gradgrad refines the part of autograd that will be checked (by "
                 "gradgradcheck), so it should not be set if either supports_gradgrad or supports_autograd "
-                "is False.")
+                "is False."
+            )
         if self.check_batched_forward_grad is None:
             self.check_batched_forward_grad = self.supports_forward_ad
         else:
-            assert not (self.check_batched_forward_grad and not self.supports_forward_ad), (
+            assert not (
+                self.check_batched_forward_grad and not self.supports_forward_ad
+            ), (
                 "check_batched_forward_grad should only be used when supports_forward_ad "
                 "is True. It is used to disable the test in the specific cases "
                 "where the op supports forward ad but fails to compute "
-                "batched forward grad.")
+                "batched forward grad."
+            )
 
         if self.check_inplace_batched_forward_grad is None:
             self.check_inplace_batched_forward_grad = self.check_batched_forward_grad
         else:
-            assert not (self.check_inplace_batched_forward_grad and not self.check_batched_forward_grad), (
+            assert not (
+                self.check_inplace_batched_forward_grad
+                and not self.check_batched_forward_grad
+            ), (
                 "check_batched_forward_grad should only be used when check_batched_forward_grad "
                 "is True. It is used to disable the test in the specific cases "
                 "where the op supports batched forward grad but fails to compute batched forward "
-                "grad for the inplace variant of the op.")
+                "grad for the inplace variant of the op."
+            )
 
         assert not (self.supports_fwgrad_bwgrad and not self.supports_autograd), (
             "supports_fwgrad_bwgrad enables forward-over-backward gradgrad checks and should only be "
-            "True if backward ad is also checked, i.e., supports_forward_ad should be True.", self.name)
+            "True if backward ad is also checked, i.e., supports_forward_ad should be True.",
+            self.name,
+        )
 
         # Autograd flags that depend on both forward AD and backward AD
         if self.supports_inplace_autograd is None:
-            self.supports_inplace_autograd = self.supports_autograd or self.supports_forward_ad
+            self.supports_inplace_autograd = (
+                self.supports_autograd or self.supports_forward_ad
+            )
         else:
-            assert not (self.supports_inplace_autograd and not self.supports_autograd and not self.supports_forward_ad), (
+            assert not (
+                self.supports_inplace_autograd
+                and not self.supports_autograd
+                and not self.supports_forward_ad
+            ), (
                 "supports_inplace_autograd refines the part of autograd that is supported, so "
-                "it should not be set if both supports_autograd and supports_forward_ad are False")
+                "it should not be set if both supports_autograd and supports_forward_ad are False"
+            )
 
         if self.aliases is not None:
             self.aliases = tuple(AliasInfo(a) for a in self.aliases)  # type: ignore[assignment]
@@ -971,8 +1129,10 @@ def sample_inputs(self, device, dtype, requires_grad=False, **kwargs):
         """
         samples = self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
 
-        if kwargs.get('include_conjugated_inputs', False):
-            conj_samples = self.conjugate_sample_inputs(device, dtype, requires_grad, **kwargs)
+        if kwargs.get("include_conjugated_inputs", False):
+            conj_samples = self.conjugate_sample_inputs(
+                device, dtype, requires_grad, **kwargs
+            )
             samples_list = list(samples)
             samples_list.extend(conj_samples)
             samples = tuple(samples_list)
@@ -990,7 +1150,7 @@ def reference_inputs(self, device, dtype, requires_grad=False, **kwargs):
         if self.reference_inputs_func is None:
             return self.sample_inputs_func(self, device, dtype, requires_grad, **kwargs)
 
-        if kwargs.get('include_conjugated_inputs', False):
+        if kwargs.get("include_conjugated_inputs", False):
             raise NotImplementedError
 
         return self.reference_inputs_func(self, device, dtype, requires_grad, **kwargs)
@@ -1005,34 +1165,44 @@ def sample_inputs_sparse_coo(self, device, dtype, requires_grad=False, **kwargs)
         """Returns an iterable of SampleInputs that contain inputs with sparse
         coo layout.
         """
-        return self.sample_inputs_sparse_coo_func(self, device, dtype, requires_grad, **kwargs)
+        return self.sample_inputs_sparse_coo_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
 
     def sample_inputs_sparse_csr(self, device, dtype, requires_grad=False, **kwargs):
         """Returns an iterable of SampleInputs that contain inputs with sparse
         csr layout.
         """
-        return self.sample_inputs_sparse_csr_func(self, device, dtype, requires_grad, **kwargs)
+        return self.sample_inputs_sparse_csr_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
 
     def sample_inputs_sparse_csc(self, device, dtype, requires_grad=False, **kwargs):
         """Returns an iterable of SampleInputs that contain inputs with sparse
         csc layout.
         """
-        return self.sample_inputs_sparse_csc_func(self, device, dtype, requires_grad, **kwargs)
+        return self.sample_inputs_sparse_csc_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
 
     def sample_inputs_sparse_bsr(self, device, dtype, requires_grad=False, **kwargs):
         """Returns an iterable of SampleInputs that contain inputs with sparse
         bsr layout.
         """
-        return self.sample_inputs_sparse_bsr_func(self, device, dtype, requires_grad, **kwargs)
+        return self.sample_inputs_sparse_bsr_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
 
     def sample_inputs_sparse_bsc(self, device, dtype, requires_grad=False, **kwargs):
         """Returns an iterable of SampleInputs that contain inputs with sparse
         bsc layout.
         """
-        return self.sample_inputs_sparse_bsc_func(self, device, dtype, requires_grad, **kwargs)
+        return self.sample_inputs_sparse_bsc_func(
+            self, device, dtype, requires_grad, **kwargs
+        )
 
     def get_decorators(self, test_class, test_name, device, dtype):
-        '''Returns the decorators targeting the given test.'''
+        """Returns the decorators targeting the given test."""
         result = []
         for decorator in self.decorators:
             if isinstance(decorator, DecorateInfo):
@@ -1043,9 +1213,9 @@ def get_decorators(self, test_class, test_name, device, dtype):
         return result
 
     def supported_dtypes(self, device_type):
-        if device_type == 'cpu':
+        if device_type == "cpu":
             return self.dtypes
-        if device_type == 'cuda':
+        if device_type == "cuda":
             return self.dtypesIfROCM if TEST_WITH_ROCM else self.dtypesIfCUDA
         else:
             return self.dtypes
@@ -1055,14 +1225,20 @@ def supported_backward_dtypes(self, device_type):
             return set()
 
         backward_dtypes = None
-        if device_type == 'cpu':
+        if device_type == "cpu":
             backward_dtypes = self.backward_dtypes
-        elif device_type == 'cuda':
-            backward_dtypes = self.backward_dtypesIfROCM if TEST_WITH_ROCM else self.backward_dtypesIfCUDA
+        elif device_type == "cuda":
+            backward_dtypes = (
+                self.backward_dtypesIfROCM
+                if TEST_WITH_ROCM
+                else self.backward_dtypesIfCUDA
+            )
         else:
             backward_dtypes = self.backward_dtypes
 
-        allowed_backward_dtypes = floating_and_complex_types_and(torch.bfloat16, torch.float16, torch.complex32)
+        allowed_backward_dtypes = floating_and_complex_types_and(
+            torch.bfloat16, torch.float16, torch.complex32
+        )
         return set(allowed_backward_dtypes).intersection(backward_dtypes)
 
     def supports_dtype(self, dtype, device_type):
@@ -1071,109 +1247,12 @@ def supports_dtype(self, dtype, device_type):
     @property
     def formatted_name(self):
         """Returns a formatted full name for this OpInfo that can be used in test names."""
-        variant = '_' + self.variant_test_name.replace('.', '_') if self.variant_test_name else ''
-        return '{}{}'.format(self.name.replace('.', '_'), variant)
-
-# NOTE [Python References]
-# Python References emulate existing PyTorch operations, but can ultimately
-#   be expressed in terms of "primitive" operations from torch._prims.
-#
-# These references are experimental.
-# See https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-0/577
-#   for additional context.
-#
-# Python Reference OpInfos should be added to the python_ref_db list below.
-#   Tests can opt-into running on these references by including
-#   that list in the Sequence they pass to the @ops decorator.
-#
-# When a Python Reference OpInfo is constructed a pointer to an
-#   existing OpInfo must be provided using the torch_opinfo_name kwarg.
-#   The existing OpInfo with that name and no variant will be found
-#   to inherit from.
-#
-# Instead of just inheriting the existing OpInfo's metadata, the
-#   Python Reference OpInfos inherit the existing OpInfo's
-#   construction arguments. These arguments can be overridden
-#   by adding kwargs to the constructor.
-
-
-def _find_referenced_opinfo(referenced_name, variant_name):
-    '''
-    Finds the OpInfo with the given name that has no variant name.
-    '''
-    from torch.testing._internal.common_methods_invocations import op_db
-    for opinfo in op_db:
-        if opinfo.name == referenced_name and opinfo.variant_test_name == variant_name:
-            return opinfo
-
-def _inherit_constructor_args(name, op, inherited, overrides):
-    # inherits metadata
-    common_kwargs = {
-        'name': name,
-        'op': op,
-        'aliases': None,  # TODO add a check for alias coverage
-        'method_variant': None,
-        'inplace_variant': None,  # TODO: add a check for inplace coverage
-        'supports_scripting': False,
-    }
-
-    # Acquires inherited kwargs
-    kwargs = inherited.copy()
-
-    # Fixes metadata
-    if 'kwargs' in kwargs:
-        kwargs.update(kwargs['kwargs'])
-        del kwargs['kwargs']
-    if 'self' in kwargs:
-        del kwargs['self']
-    if '__class__' in kwargs:
-        del kwargs['__class__']
-    if 'skips' in kwargs:
-        del kwargs['skips']
-    if 'decorators' in kwargs:
-        del kwargs['decorators']
-
-    # Overrides metadata
-    kwargs.update(common_kwargs)
-    kwargs.update(overrides)
-
-    # At the moment no prims support autograd, so we must not run autograd
-    # tests e.g. when testing dtype support.  Once we start writing autograd
-    # formulas for prims this can be removed.
-    kwargs['supports_autograd'] = False
-    kwargs['supports_gradgrad'] = False
-    kwargs['supports_fwgrad_bwgrad'] = False
-    kwargs['supports_inplace_autograd'] = False
-    kwargs['supports_forward_ad'] = False
-
-    return kwargs
-
-
-class PythonRefInfo(OpInfo):
-    '''
-    An OpInfo for a Python reference of an OpInfo base class operation.
-    '''
-    def __init__(
-            self,
-            name,  # the stringname of the callable Python reference
-            *,
-            op=None,  # the function variant of the operation, populated as torch.<name> if None
-            torch_opinfo_name,  # the string name of the corresponding torch opinfo
-            torch_opinfo_variant_name='',  # the variant name for corresponding torch opinfo
-            validate_view_consistency=True,
-            supports_nvfuser=True,
-            **kwargs):  # additional kwargs override kwargs inherited from the torch opinfo
-
-        self.torch_opinfo_name = torch_opinfo_name
-        self.torch_opinfo_variant_name = torch_opinfo_variant_name
-        self.torch_opinfo = _find_referenced_opinfo(torch_opinfo_name, torch_opinfo_variant_name)
-        self.validate_view_consistency = validate_view_consistency
-        self.supports_nvfuser = supports_nvfuser
-        assert isinstance(self.torch_opinfo, OpInfo)
-
-        inherited = self.torch_opinfo._original_opinfo_args
-        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
-        super(PythonRefInfo, self).__init__(**ukwargs)
+        variant = (
+            "_" + self.variant_test_name.replace(".", "_")
+            if self.variant_test_name
+            else ""
+        )
+        return "{}{}".format(self.name.replace(".", "_"), variant)
 
 
 def _generate_reduction_inputs(device, dtype, requires_grad, **kwargs):
@@ -1181,7 +1260,9 @@ def _generate_reduction_inputs(device, dtype, requires_grad, **kwargs):
     yield make_tensor([], dtype=dtype, device=device, requires_grad=requires_grad)
     yield make_tensor([2], dtype=dtype, device=device, requires_grad=requires_grad)
     yield make_tensor([3, 5], dtype=dtype, device=device, requires_grad=requires_grad)
-    yield make_tensor([3, 2, 1, 2], dtype=dtype, device=device, requires_grad=requires_grad)
+    yield make_tensor(
+        [3, 2, 1, 2], dtype=dtype, device=device, requires_grad=requires_grad
+    )
 
 
 def _generate_reduction_kwargs(ndim, supports_multiple_dims=True):
@@ -1193,24 +1274,24 @@ def _generate_reduction_kwargs(ndim, supports_multiple_dims=True):
     yield {}
 
     # Test reducing inner and outer most dimensions
-    yield {'dim': 0, 'keepdim': True}
-    yield {'dim': -1, 'keepdim': False}
+    yield {"dim": 0, "keepdim": True}
+    yield {"dim": -1, "keepdim": False}
 
     # Test reducing middle dimension
     if ndim > 2:
-        yield {'dim': ndim // 2, 'keepdim': True}
+        yield {"dim": ndim // 2, "keepdim": True}
 
     if supports_multiple_dims:
         # Test reducing all dimensions
-        yield {'dim': tuple(range(ndim)), 'keepdim': False}
+        yield {"dim": tuple(range(ndim)), "keepdim": False}
 
         # Test reducing both first and last dimensions
         if ndim > 1:
-            yield {'dim': (0, -1), 'keepdim': True}
+            yield {"dim": (0, -1), "keepdim": True}
 
         # Test reducing every other dimension starting with the second
         if ndim > 3:
-            yield {'dim': tuple(range(1, ndim, 2)), 'keepdim': False}
+            yield {"dim": tuple(range(1, ndim, 2)), "keepdim": False}
 
 
 def sample_inputs_reduction(op_info, device, dtype, requires_grad, **kwargs):
@@ -1218,17 +1299,23 @@ def sample_inputs_reduction(op_info, device, dtype, requires_grad, **kwargs):
 
     # TODO(@heitorschueroff) Once all reduction operators are using
     # ReductionOpInfo use op_info.supports_multiple_dims directly.
-    supports_multiple_dims: bool = kwargs.get('supports_multiple_dims', True)
+    supports_multiple_dims: bool = kwargs.get("supports_multiple_dims", True)
 
     # TODO(@heitorschueroff) Once all reduction operators are using ReductionOpInfo
     # use op_info.generate_args_kwargs directly.
-    generate_args_kwargs = kwargs.get('generate_args_kwargs', lambda *args, **kwargs: (yield tuple(), {}))
+    generate_args_kwargs = kwargs.get(
+        "generate_args_kwargs", lambda *args, **kwargs: (yield tuple(), {})
+    )
 
     for t in _generate_reduction_inputs(device, dtype, requires_grad):
-        for reduction_kwargs in _generate_reduction_kwargs(t.ndim, supports_multiple_dims):
+        for reduction_kwargs in _generate_reduction_kwargs(
+            t.ndim, supports_multiple_dims
+        ):
             for args, kwargs in generate_args_kwargs(t, **reduction_kwargs):
                 kwargs.update(reduction_kwargs)
-                yield SampleInput(t.detach().requires_grad_(requires_grad), args=args, kwargs=kwargs)
+                yield SampleInput(
+                    t.detach().requires_grad_(requires_grad), args=args, kwargs=kwargs
+                )
 
 
 # NOTE [Reductions]:
@@ -1271,44 +1358,40 @@ class ReductionOpInfo(OpInfo):
     """
 
     def __init__(
-        self, name, *,
-
+        self,
+        name,
+        *,
         # The identity value for the operator if it has one.
         identity: Optional[Any] = None,
-
         # The nan policy for the operator if it implements one.
         # - propagate: NaN values are propagated to the output
         # - omit: NaN values are discarded during the reduction
         nan_policy: Optional[str] = None,
-
         # Whether the operator supports reducing multiple dimensions.
         supports_multiple_dims: bool = True,
-
         # Whether the operator promotes integral to floating point dtypes.
         promotes_int_to_float: bool = False,
-
         # Whether the operator promotes all integral dtypes to int64.
         promotes_int_to_int64: bool = False,
-
         # If a specific dtype is given, then the operator always returns that
         # dtype irrespective of the input dtype. If None, the operator returns
         # the dtype according to the type promotion rules above.
         result_dtype: Optional[torch.dtype] = None,
-
         # Casts complex results to real (e.g. linalg.norm or torch.var)
         complex_to_real: bool = False,
-
         # ReductionOpInfo tests generate their own input, dim and keepdim
         # arguments and call this function to generate tuples of extra args and
         # kwargs to use when calling the op. This is required for operators that
         # have other required parameters besides the input tensor.
-        generate_args_kwargs: Callable = lambda t, dim=None, keepdim=False: (yield tuple(), {}),
-
+        generate_args_kwargs: Callable = lambda t, dim=None, keepdim=False: (
+            yield tuple(),
+            {},
+        ),
         # Options from the OpInfo base class
         **kwargs,
     ):
         self._original_reduction_args = locals().copy()
-        assert nan_policy in (None, 'propagate', 'omit')
+        assert nan_policy in (None, "propagate", "omit")
 
         # These are mutually exclusive options
         assert not (result_dtype and promotes_int_to_float)
@@ -1320,13 +1403,13 @@ def __init__(
         # inputs from sample_inputs_reduction with the args and kwargs from
         # generate_args_kwargs. This is only used if sample_inputs_func is None.
         def sample_inputs_func(*args, **kwargs):
-            kwargs['supports_multiple_dims'] = supports_multiple_dims
-            kwargs['generate_args_kwargs'] = generate_args_kwargs
+            kwargs["supports_multiple_dims"] = supports_multiple_dims
+            kwargs["generate_args_kwargs"] = generate_args_kwargs
             yield from sample_inputs_reduction(*args, **kwargs)
 
         # Override OpInfo defaults and call base class __init__
-        kwargs.setdefault('inplace_variant', None)
-        kwargs.setdefault('sample_inputs_func', sample_inputs_func)
+        kwargs.setdefault("inplace_variant", None)
+        kwargs.setdefault("sample_inputs_func", sample_inputs_func)
         super().__init__(name, **kwargs)
 
         self.identity = identity
@@ -1338,11 +1421,18 @@ def sample_inputs_func(*args, **kwargs):
         self.result_dtype = result_dtype
         self.generate_args_kwargs = generate_args_kwargs
 
+
 # The base reference input generation for elementwise binary operations
-def _reference_inputs_elementwise_binary(op, device, dtype, requires_grad, exclude_zero, **kwargs):
+def _reference_inputs_elementwise_binary(
+    op, device, dtype, requires_grad, exclude_zero, **kwargs
+):
     yield from op.sample_inputs_func(op, device, dtype, requires_grad, **kwargs)
     yield from generate_elementwise_binary_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
     if dtype is not torch.bool:
         yield from generate_elementwise_binary_small_value_tensors(
@@ -1353,7 +1443,11 @@ def _reference_inputs_elementwise_binary(op, device, dtype, requires_grad, exclu
             op, device=device, dtype=dtype, requires_grad=requires_grad
         )
     yield from generate_elementwise_binary_broadcasting_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
     yield from generate_elementwise_binary_with_scalar_samples(
         op, device=device, dtype=dtype, requires_grad=requires_grad
@@ -1376,7 +1470,13 @@ def reference_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwar
         exclude_zero = op.rhs_make_tensor_kwargs.get("exclude_zero", False)
 
     gen = partial(
-        _reference_inputs_elementwise_binary, op, device, dtype, requires_grad, exclude_zero, **kwargs
+        _reference_inputs_elementwise_binary,
+        op,
+        device,
+        dtype,
+        requires_grad,
+        exclude_zero,
+        **kwargs,
     )
 
     # yields "normal" samples
@@ -1387,11 +1487,19 @@ def reference_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwar
         yield sample.noncontiguous()
 
     yield from generate_elementwise_binary_noncontiguous_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
 
     yield from generate_elementwise_binary_arbitrarily_strided_tensors(
-        op, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        op,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
 
 
@@ -1419,6 +1527,7 @@ def error_inputs_func_wrapper(op, device, **kwargs):
 
     return error_inputs_func_wrapper
 
+
 # The following functions and classes are for testing elementwise binary operators.
 
 # Returns a generator of pairs of contiguous tensors on the requested device
@@ -1431,7 +1540,9 @@ def error_inputs_func_wrapper(op, device, **kwargs):
 # Each iterable will include an a tensor with no elements,
 #   zero dim (scalar) tensors, small 1D tensors, a medium 1D tensor, and
 #   a large 2D tensor.
-def generate_elementwise_binary_tensors(op, *, device, dtype, requires_grad=False, exclude_zero=False):
+def generate_elementwise_binary_tensors(
+    op, *, device, dtype, requires_grad=False, exclude_zero=False
+):
     shapes = (
         # tensors with no elements
         (0,),
@@ -1447,14 +1558,21 @@ def generate_elementwise_binary_tensors(op, *, device, dtype, requires_grad=Fals
     )
 
     make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
     for shape in shapes:
         lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
         rhs = make_arg(shape, **op.rhs_make_tensor_kwargs)
         yield SampleInput(lhs, args=(rhs,))
 
-def generate_elementwise_binary_arbitrarily_strided_tensors(op, *, device, dtype, requires_grad=False, exclude_zero=False):
+
+def generate_elementwise_binary_arbitrarily_strided_tensors(
+    op, *, device, dtype, requires_grad=False, exclude_zero=False
+):
     # shape, strides, offset
     strided_cases = (
         ((5, 6, 2), (1, 1, 7), 2),
@@ -1465,15 +1583,21 @@ def generate_elementwise_binary_arbitrarily_strided_tensors(op, *, device, dtype
         ((9, 5, 2), (0, 1, 7), 3),
     )
 
-
     make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
     for shape, strides, offset in strided_cases:
-        a = make_arg(500,).as_strided(shape, strides, offset)
+        a = make_arg(
+            500,
+        ).as_strided(shape, strides, offset)
         b = make_arg(shape)
         yield SampleInput(a, args=(b,))
 
+
 # Returns a generator of pairs of contiguous tensors on the requested device and with
 #   the requested dtype.
 #
@@ -1602,14 +1726,21 @@ def generate_elementwise_binary_extremal_value_tensors(
     yield SampleInput(lhs, args=(rhs,))
 
     # Test case for NaN propagation
-    nan = float('nan') if dtype.is_floating_point else complex(float('nan'), float('nan'))
-    lhs = make_tensor((128, 128), device=device, dtype=dtype, requires_grad=requires_grad)
-    lhs.flatten()[::3] = nan
-    rhs = make_tensor((128, 128), device=device, dtype=dtype, requires_grad=requires_grad)
-    rhs.flatten()[::3] = nan
+    nan = (
+        float("nan") if dtype.is_floating_point else complex(float("nan"), float("nan"))
+    )
+    lhs = make_tensor(
+        (128, 128), device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    lhs.view(-1)[::3] = nan
+    rhs = make_tensor(
+        (128, 128), device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    rhs.view(-1)[::3] = nan
 
     yield SampleInput(lhs, args=(rhs,))
 
+
 # Returns a generator of pairs of contiguous and noncontiguous tensors that
 #   require broadcasting
 def generate_elementwise_binary_broadcasting_tensors(
@@ -1630,7 +1761,11 @@ def generate_elementwise_binary_broadcasting_tensors(
     )
 
     make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
     for shape, noncontiguous in product(shapes, [True, False]):
         shape_lhs, shape_rhs = shape
@@ -1672,17 +1807,30 @@ def generate_elementwise_binary_with_scalar_samples(
 
         yield SampleInput(lhs_scalar, args=(rhs_scalar,))
 
+
 # Returns a generator of pairs of contiguous tensors and 0d tensos and scalars and type promotion
 def generate_elementwise_binary_with_scalar_and_type_promotion_samples(
     op, *, device, dtype, requires_grad=False
 ):
     # add these samples only for logical and comparison ops, arithmetic ops are not happy about extremal scalars
-    if op.name in ('eq', 'ne', 'gt', 'ge', 'lt', 'le', 'logical_and', 'logical_or', 'logical_xor'):
+    if op.name in (
+        "eq",
+        "ne",
+        "gt",
+        "ge",
+        "lt",
+        "le",
+        "logical_and",
+        "logical_or",
+        "logical_xor",
+    ):
         make_arg = partial(
             make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
         )
-        shape = (23,)  # this shape is big enough to trigger vectorization, and has non-vectorized tail
-        values = (float('nan'), float('inf'), -float('inf'))
+        shape = (
+            23,
+        )  # this shape is big enough to trigger vectorization, and has non-vectorized tail
+        values = (float("nan"), float("inf"), -float("inf"))
         scalar_tensors = tuple(torch.tensor(val) for val in values)
         if op.supports_rhs_python_scalar:
             lhs = make_arg(shape, **op.lhs_make_tensor_kwargs)
@@ -1693,12 +1841,17 @@ def generate_elementwise_binary_with_scalar_and_type_promotion_samples(
                 if op.supports_one_python_scalar:
                     yield SampleInput(scalar, args=(rhs,))
 
+
 # Returns a generator of pairs of noncontiguous tensors
 def generate_elementwise_binary_noncontiguous_tensors(
     op, *, device, dtype, requires_grad=False, exclude_zero=False
 ):
     make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
 
     # Generic noncontiguity
@@ -1763,7 +1916,11 @@ def sample_inputs_elementwise_binary(op, device, dtype, requires_grad, **kwargs)
         exclude_zero = op.rhs_make_tensor_kwargs.get("exclude_zero", False)
 
     make_arg = partial(
-        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad, exclude_zero=exclude_zero
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
     )
 
     shapes = (
@@ -1879,7 +2036,12 @@ def sample_inputs_elementwise_unary(
     low, high = op_info.domain
     low = low if low is None else low + op_info._domain_eps
     high = high if high is None else high - op_info._domain_eps
-    if op_info.supports_sparse_csr or op_info.supports_sparse_csc or op_info.supports_sparse_bsr or op_info.supports_sparse_bsc:
+    if (
+        op_info.supports_sparse_csr
+        or op_info.supports_sparse_csc
+        or op_info.supports_sparse_bsr
+        or op_info.supports_sparse_bsc
+    ):
         # Tensors with dim=2 for sparse compressed testing
         yield SampleInput(
             make_tensor(
@@ -2064,7 +2226,10 @@ def generate_elementwise_unary_noncontiguous_tensors(
             t_non_contig, kwargs=op.sample_kwargs(device, dtype, t_non_contig)[0]
         )
 
-def generate_elementwise_unary_arbitrarily_strided_tensors(op, *, device, dtype, requires_grad=False):
+
+def generate_elementwise_unary_arbitrarily_strided_tensors(
+    op, *, device, dtype, requires_grad=False
+):
     # shape, strides, offset
     strided_cases = (
         ((5, 6, 2), (1, 1, 7), 2),
@@ -2079,9 +2244,12 @@ def generate_elementwise_unary_arbitrarily_strided_tensors(op, *, device, dtype,
         make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
     )
     for shape, strides, offset in strided_cases:
-        a = make_arg(500,).as_strided(shape, strides, offset)
+        a = make_arg(
+            500,
+        ).as_strided(shape, strides, offset)
         yield SampleInput(a, kwargs=op.sample_kwargs(device, dtype, a)[0])
 
+
 # Reuses the elementwise binary generators for consistency
 # TODO: in the future generalize the reference generators to handle n-ary elementwise operations
 def _reference_inputs_elementwise_unary(op, device, dtype, requires_grad, **kwargs):
@@ -2103,11 +2271,14 @@ def _reference_inputs_elementwise_unary(op, device, dtype, requires_grad, **kwar
             op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
         )
 
-    if dtype.is_floating_point or (op.handles_complex_extremal_values and dtype.is_complex):
+    if dtype.is_floating_point or (
+        op.handles_complex_extremal_values and dtype.is_complex
+    ):
         yield from generate_elementwise_unary_extremal_value_tensors(
             op, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
         )
 
+
 def reference_inputs_elementwise_unary(op, device, dtype, requires_grad, **kwargs):
     gen = partial(
         _reference_inputs_elementwise_unary, op, device, dtype, requires_grad, **kwargs
@@ -2190,24 +2361,37 @@ def __init__(
 def sample_inputs_spectral_ops(self, device, dtype, requires_grad=False, **kwargs):
     is_fp16_or_chalf = dtype == torch.complex32 or dtype == torch.half
     if not is_fp16_or_chalf:
-        nd_tensor = partial(make_tensor, (S, S + 1, S + 2), device=device,
-                            dtype=dtype, requires_grad=requires_grad)
-        oned_tensor = partial(make_tensor, (31,), device=device,
-                              dtype=dtype, requires_grad=requires_grad)
+        nd_tensor = partial(
+            make_tensor,
+            (S, S + 1, S + 2),
+            device=device,
+            dtype=dtype,
+            requires_grad=requires_grad,
+        )
+        oned_tensor = partial(
+            make_tensor, (31,), device=device, dtype=dtype, requires_grad=requires_grad
+        )
     else:
         # cuFFT supports powers of 2 for half and complex half precision
         # NOTE: For hfft, hfft2, hfftn, irfft, irfft2, irfftn with default args
         # where output_size n=2*(input_size - 1), we make sure that logical fft size is a power of two
         low = None
         high = None
-        if self.name in ['fft.hfft', 'fft.irfft',
-                         '_refs.fft.hfft', '_refs.fft.irfft']:
+        if self.name in ["fft.hfft", "fft.irfft", "_refs.fft.hfft", "_refs.fft.irfft"]:
             shapes = ((2, 9, 9), (33,))
-        elif self.name in ['fft.hfft2', 'fft.irfft2',
-                           '_refs.fft.hfft2', '_refs.fft.irfft2']:
+        elif self.name in [
+            "fft.hfft2",
+            "fft.irfft2",
+            "_refs.fft.hfft2",
+            "_refs.fft.irfft2",
+        ]:
             shapes = ((2, 8, 9), (33,))
-        elif self.name in ['fft.hfftn', 'fft.irfftn',
-                           '_refs.fft.hfftn', '_refs.fft.irfftn']:
+        elif self.name in [
+            "fft.hfftn",
+            "fft.irfftn",
+            "_refs.fft.hfftn",
+            "_refs.fft.irfftn",
+        ]:
             shapes = ((2, 2, 33), (33,))
             # Adjusting the limits because the test would be flaky due to over-saturation of float16
             # See: https://github.com/pytorch/pytorch/pull/81416
@@ -2215,73 +2399,79 @@ def sample_inputs_spectral_ops(self, device, dtype, requires_grad=False, **kwarg
             high = 1.0
         else:
             shapes = ((2, 8, 16), (32,))
-        nd_tensor = partial(make_tensor, shapes[0], device=device, low=low, high=high,
-                            dtype=dtype, requires_grad=requires_grad)
-        oned_tensor = partial(make_tensor, shapes[1], device=device, low=low, high=high,
-                              dtype=dtype, requires_grad=requires_grad)
+        nd_tensor = partial(
+            make_tensor,
+            shapes[0],
+            device=device,
+            low=low,
+            high=high,
+            dtype=dtype,
+            requires_grad=requires_grad,
+        )
+        oned_tensor = partial(
+            make_tensor,
+            shapes[1],
+            device=device,
+            low=low,
+            high=high,
+            dtype=dtype,
+            requires_grad=requires_grad,
+        )
 
     if self.ndimensional == SpectralFuncType.ND:
-        return [
-            SampleInput(nd_tensor(),
-                        kwargs=dict(s=(3, 10) if not is_fp16_or_chalf else (4, 8), dim=(1, 2), norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(s=(8,))),
-            SampleInput(oned_tensor()),
-
-            *(SampleInput(nd_tensor(),
-                          kwargs=dict(dim=dim))
-                for dim in [-1, -2, -3, (0, -1)]),
-        ]
+        yield SampleInput(
+            nd_tensor(),
+            s=(3, 10) if not is_fp16_or_chalf else (4, 8),
+            dim=(1, 2),
+            norm="ortho",
+        )
+        yield SampleInput(nd_tensor(), norm="ortho")
+        yield SampleInput(nd_tensor(), s=(8,))
+        yield SampleInput(oned_tensor())
+        yield from (SampleInput(nd_tensor(), dim=dim) for dim in [-1, -2, -3, (0, -1)])
     elif self.ndimensional == SpectralFuncType.TwoD:
-        return [
-            SampleInput(nd_tensor(),
-                        kwargs=dict(s=(3, 10) if not is_fp16_or_chalf else (4, 8), dim=(1, 2), norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(s=(6, 8) if not is_fp16_or_chalf else (4, 8))),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(dim=0)),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(dim=(0, -1))),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(dim=(-3, -2, -1))),
-        ]
+        yield SampleInput(
+            nd_tensor(),
+            s=(3, 10) if not is_fp16_or_chalf else (4, 8),
+            dim=(1, 2),
+            norm="ortho",
+        )
+        yield SampleInput(nd_tensor(), norm="ortho")
+        yield SampleInput(nd_tensor(), s=(6, 8) if not is_fp16_or_chalf else (4, 8))
+        yield SampleInput(nd_tensor(), dim=0)
+        yield SampleInput(nd_tensor(), dim=(0, -1))
+        yield SampleInput(nd_tensor(), dim=(-3, -2, -1))
     else:
-        return [
-            SampleInput(nd_tensor(),
-                        kwargs=dict(n=10 if not is_fp16_or_chalf else 8, dim=1, norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(norm='ortho')),
-            SampleInput(nd_tensor(),
-                        kwargs=dict(n=7 if not is_fp16_or_chalf else 8)
-                        ),
-            SampleInput(oned_tensor()),
-
-            *(SampleInput(nd_tensor(),
-                          kwargs=dict(dim=dim))
-                for dim in [-1, -2, -3]),
-        ]
+        yield SampleInput(
+            nd_tensor(),
+            n=10 if not is_fp16_or_chalf else 8,
+            dim=1,
+            norm="ortho",
+        )
+        yield SampleInput(nd_tensor(), norm="ortho")
+        yield SampleInput(nd_tensor(), n=7 if not is_fp16_or_chalf else 8)
+        yield SampleInput(oned_tensor())
+        yield from (SampleInput(nd_tensor(), dim=dim) for dim in [-1, -2, -3])
 
 
-SpectralFuncType = Enum('SpectralFuncType', ('OneD', 'TwoD', 'ND'))
+SpectralFuncType = Enum("SpectralFuncType", ("OneD", "TwoD", "ND"))
 
 
 # Metadata class for Fast Fourier Transforms in torch.fft.
 class SpectralFuncInfo(OpInfo):
-    """Operator information for torch.fft transforms. """
-
-    def __init__(self,
-                 name,  # the string name of the function
-                 *,
-                 ref=None,  # Reference implementation (probably in np.fft namespace)
-                 dtypes=floating_and_complex_types(),
-                 ndimensional: SpectralFuncType,
-                 sample_inputs_func=sample_inputs_spectral_ops,
-                 decorators=None,
-                 **kwargs):
+    """Operator information for torch.fft transforms."""
+
+    def __init__(
+        self,
+        name,  # the string name of the function
+        *,
+        ref=None,  # Reference implementation (probably in np.fft namespace)
+        dtypes=floating_and_complex_types(),
+        ndimensional: SpectralFuncType,
+        sample_inputs_func=sample_inputs_spectral_ops,
+        decorators=None,
+        **kwargs,
+    ):
 
         self._original_spectral_func_args = dict(locals()).copy()
         self._original_spectral_func_args.update(kwargs)
@@ -2289,44 +2479,64 @@ def __init__(self,
         decorators = list(decorators) if decorators is not None else []
         decorators += [
             skipCPUIfNoFFT,
-            DecorateInfo(toleranceOverride({torch.chalf: tol(4e-2, 4e-2)}),
-                         "TestCommon", "test_complex_half_reference_testing")
+            DecorateInfo(
+                toleranceOverride({torch.chalf: tol(4e-2, 4e-2)}),
+                "TestCommon",
+                "test_complex_half_reference_testing",
+            ),
         ]
 
-        super().__init__(name=name,
-                         dtypes=dtypes,
-                         decorators=decorators,
-                         sample_inputs_func=sample_inputs_func,
-                         **kwargs)
+        super().__init__(
+            name=name,
+            dtypes=dtypes,
+            decorators=decorators,
+            sample_inputs_func=sample_inputs_func,
+            **kwargs,
+        )
         self.ref = ref
         self.ndimensional = ndimensional
 
 
 class ShapeFuncInfo(OpInfo):
     """Early version of a specialized OpInfo for Shape manipulating operations like tile and roll"""
-    def __init__(self,
-                 name,  # the string name of the function
-                 *,
-                 ref,  # a reference function
-                 dtypes=floating_types(),
-                 dtypesIfCUDA=None,
-                 dtypesIfROCM=None,
-                 sample_inputs_func=None,
-                 **kwargs):
-        super(ShapeFuncInfo, self).__init__(name,
-                                            dtypes=dtypes,
-                                            dtypesIfCUDA=dtypesIfCUDA,
-                                            dtypesIfROCM=dtypesIfROCM,
-                                            sample_inputs_func=sample_inputs_func,
-                                            **kwargs)
+
+    def __init__(
+        self,
+        name,  # the string name of the function
+        *,
+        ref,  # a reference function
+        dtypes=floating_types(),
+        dtypesIfCUDA=None,
+        dtypesIfROCM=None,
+        sample_inputs_func=None,
+        **kwargs,
+    ):
+        super(ShapeFuncInfo, self).__init__(
+            name,
+            dtypes=dtypes,
+            dtypesIfCUDA=dtypesIfCUDA,
+            dtypesIfROCM=dtypesIfROCM,
+            sample_inputs_func=sample_inputs_func,
+            **kwargs,
+        )
         self.ref = ref
 
 
-def sample_inputs_foreach(self, device, dtype, N, *, noncontiguous=False, same_size=False, low=None, high=None):
+def sample_inputs_foreach(
+    self, device, dtype, N, *, noncontiguous=False, same_size=False, low=None, high=None
+):
     if same_size:
-        return [make_tensor((N, N), dtype=dtype, device=device, noncontiguous=noncontiguous) for _ in range(N)]
+        return [
+            make_tensor((N, N), dtype=dtype, device=device, noncontiguous=noncontiguous)
+            for _ in range(N)
+        ]
     else:
-        return [make_tensor((N - i, N - i), dtype=dtype, device=device, noncontiguous=noncontiguous) for i in range(N)]
+        return [
+            make_tensor(
+                (N - i, N - i), dtype=dtype, device=device, noncontiguous=noncontiguous
+            )
+            for i in range(N)
+        ]
 
 
 def get_foreach_method_names(name):
@@ -2344,24 +2554,32 @@ def get_foreach_method_names(name):
 
 class ForeachFuncInfo(OpInfo):
     """Early version of a specialized OpInfo for foreach functions"""
-    def __init__(self,
-                 name,
-                 dtypes=floating_and_complex_types(),
-                 dtypesIfCUDA=floating_and_complex_types_and(torch.half),
-                 dtypesIfROCM=None,
-                 supports_alpha_param=False,
-                 sample_inputs_func=sample_inputs_foreach,
-                 **kwargs):
+
+    def __init__(
+        self,
+        name,
+        dtypes=floating_and_complex_types(),
+        dtypesIfCUDA=floating_and_complex_types_and(torch.half),
+        dtypesIfROCM=None,
+        supports_alpha_param=False,
+        sample_inputs_func=sample_inputs_foreach,
+        **kwargs,
+    ):
         super().__init__(
             "_foreach_" + name,
             dtypes=dtypes,
             dtypesIfCUDA=dtypesIfCUDA,
             dtypesIfROCM=dtypesIfROCM,
             sample_inputs_func=sample_inputs_func,
-            **kwargs
+            **kwargs,
         )
 
-        foreach_method, foreach_method_inplace, torch_ref_method, torch_ref_inplace = get_foreach_method_names(name)
+        (
+            foreach_method,
+            foreach_method_inplace,
+            torch_ref_method,
+            torch_ref_inplace,
+        ) = get_foreach_method_names(name)
         self.method_variant = foreach_method
         self.inplace_variant = foreach_method_inplace
         self.ref = torch_ref_method
@@ -2370,3 +2588,99 @@ def __init__(self,
 
         if name == "norm":
             self.ref = torch.linalg.vector_norm
+
+
+def gradcheck_wrapper_hermitian_input(op, input, *args, **kwargs):
+    """Gradcheck wrapper for functions that take Hermitian matrices as input.
+
+    They require a modified function because the finite-difference algorithm
+    for calculating derivatives does not preserve the Hermitian property of the input.
+    """
+    return op(input + input.mH, *args, **kwargs)
+
+
+def gradcheck_wrapper_triangular_input(op, *args, upper=False, idx=0, **kwargs):
+    """Gradcheck wrapper for functions that take lower or upper triangular matrices as input.
+
+    They require a modified function because the finite-difference algorithm
+    for calculating derivatives does not preserve the triangular property of the input.
+    `idx` is used to specific which `args[idx]` is to be triangularized.
+    """
+    triangular_arg = args[idx].triu() if upper else args[idx].tril()
+    return op(*args[:idx], triangular_arg, *args[idx + 1 :], upper, **kwargs)
+
+
+def gradcheck_wrapper_triangular_input_real_positive_diagonal(
+    op, *args, upper=False, idx=0, **kwargs
+):
+    """Gradcheck wrapper for functions that take lower/upper triangular matrices
+    with real and positive diagonals, for example, cholesky-like operations.
+    """
+    arg = args[idx]
+    arg_diag = arg.diagonal(0, -2, -1)
+    arg_diag_embed = torch.diag_embed(arg_diag)
+    id_diag_tensor = torch.ones_like(arg_diag)
+    id_tensor = torch.diag_embed(id_diag_tensor)
+    # new_arg = arg - diag(arg) + I
+    new_arg = arg - arg_diag_embed + id_tensor
+    return gradcheck_wrapper_triangular_input(
+        op, *args[:idx], new_arg, *args[idx + 1 :], upper=upper, idx=idx, **kwargs
+    )
+
+
+def gradcheck_wrapper_masked_operation(op, input, *args, **kwargs):
+    """Gradcheck wrapper for masked operations.
+
+    When mask is specified, replaces masked-out elements with zeros.
+
+    Use for operations that produce non-finite masked-out elements,
+    for instance, for minimum and maximum reductions.
+    """
+    output = op(input, *args, **kwargs)
+    mask = kwargs.get("mask")
+    if mask is not None:
+        output_mask = torch.masked._output_mask(op, input, *args, **kwargs)
+        output = torch.where(output_mask, output, output.new_zeros([]))
+    return output
+
+
+def gradcheck_wrapper_masked_pointwise_operation(op, input, *args, **kwargs):
+    """Gradcheck wrapper for masked pointwise operations. Assumes that the result
+    will be masked iff both tensors are masked at a specific index
+
+    When mask is specified, replaces masked-out elements with zeros.
+
+    Use for operations that produce non-finite masked-out elements,
+    for instance, for minimum and maximum reductions.
+    """
+    output = op(input, *args, **kwargs)
+    input_mask = kwargs.get("input_mask")
+    other_mask = kwargs.get("other_mask")
+    if input_mask is not None and other_mask is not None:
+        combined_mask = torch.logical_and(input_mask, other_mask)
+        new_kwargs = dict(mask=combined_mask, **kwargs)
+        output_mask = torch.masked._input_mask(input, *args, **new_kwargs)
+        output = torch.where(output_mask, output, output.new_zeros([]))
+    return output
+
+
+def clone_sample(sample, **kwargs):
+    """
+    Given a SampleInput, this function analyzes its input, args and kwargs,
+    and produces a copy with each non-Tensor entry being copied by reference,
+    and with each Tensor entry cloned with `t.clone().requires_grad_(t.requires_grad)`
+    """
+
+    def clone_tensor(t):
+        if isinstance(t, torch.Tensor):
+            return t.detach().clone().requires_grad_(t.requires_grad)
+        else:
+            return t
+
+    sample_kwargs = kwargs if kwargs else sample.kwargs
+
+    return SampleInput(
+        clone_tensor(sample.input),
+        args=tuple(map(clone_tensor, sample.args)),
+        kwargs=dict(((k, clone_tensor(v)) for k, v in sample_kwargs.items())),
+    )
diff --git a/torch/testing/_internal/opinfo/definitions/__init__.py b/torch/testing/_internal/opinfo/definitions/__init__.py
new file mode 100644
index 000000000000..5b2ed8d391b4
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/__init__.py
@@ -0,0 +1,25 @@
+from typing import List
+
+from torch.testing._internal.opinfo.core import OpInfo
+from torch.testing._internal.opinfo.definitions import (
+    _masked,
+    fft,
+    linalg,
+    signal,
+    special,
+)
+
+# Operator database
+op_db: List[OpInfo] = [
+    *fft.op_db,
+    *linalg.op_db,
+    *signal.op_db,
+    *special.op_db,
+    *_masked.op_db,
+]
+
+python_ref_db: List[OpInfo] = [
+    *fft.python_ref_db,
+    *linalg.python_ref_db,
+    *special.python_ref_db,
+]
diff --git a/torch/testing/_internal/opinfo/definitions/_masked.py b/torch/testing/_internal/opinfo/definitions/_masked.py
new file mode 100644
index 000000000000..5a5ce8bc7e16
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/_masked.py
@@ -0,0 +1,1148 @@
+import unittest
+from functools import partial
+from typing import List
+
+import numpy as np
+
+import torch
+from torch.testing import make_tensor
+from torch.testing._internal.common_device_type import tol, toleranceOverride
+from torch.testing._internal.common_dtype import (
+    all_types_and,
+    all_types_and_complex_and,
+    complex_types,
+    floating_and_complex_types_and,
+    floating_types_and,
+    integral_types,
+)
+from torch.testing._internal.opinfo.core import (
+    DecorateInfo,
+    gradcheck_wrapper_masked_operation,
+    gradcheck_wrapper_masked_pointwise_operation,
+    M,
+    OpInfo,
+    ReductionOpInfo,
+    S,
+    sample_inputs_reduction,
+    SampleInput,
+)
+from torch.testing._internal.opinfo.utils import prod_numpy, reference_reduction_numpy
+
+# Used for log_softmax, softmax, softmin
+def sample_inputs_softmax_variant(
+    op_info, device, dtype, requires_grad, with_dtype=False, **kwargs
+):
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    cases = [
+        ((S,), (0,)),
+        ((S, S), (0,)),
+        ((S, S), (1,)),
+        ((S, S), (-1,)),
+        ((S, M, S), (2,)),
+    ]
+    kwargs = dict(dtype=torch.float64) if with_dtype else None
+
+    # PyTorch on XLA throws an error when passed with dim argument for 0d tensor.
+    # See https://github.com/pytorch/xla/issues/3061 for more details.
+    if torch.device(device).type != "xla":
+        cases.append(((), (0,)))
+
+    return (
+        SampleInput(make_arg(shape), args=dim, kwargs=kwargs) for shape, dim in cases
+    )
+
+
+def _generate_masked_op_mask(input_shape, device, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=torch.bool, device=device, requires_grad=False
+    )
+    yield None
+    yield make_arg(input_shape)
+    if len(input_shape) > 2:
+        # broadcast last mask dimension:
+        yield make_arg(input_shape[:-1] + (1,))
+        # broadcast middle mask dimension:
+        yield make_arg(input_shape[:1] + (1,) + input_shape[2:])
+        # broadcast first mask dimension:
+        yield make_arg((1,) + input_shape[1:])
+        # mask.ndim < input.ndim
+        yield make_arg(input_shape[1:])
+        # mask.ndim == 1
+        yield make_arg(input_shape[-1:])
+        # masks that require broadcasting of inputs (mask.ndim >
+        # input.ndim) will not be supported, however, we may
+        # reconsider this if there will be demand on this kind of
+        # degenerate cases.
+
+
+def sample_inputs_masked_reduction(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked reduction operators.
+
+    Masked reduction operator is a reduction operator with trailing
+    mask optional argument. A mask is a bool tensor with the same
+    shape as input or a shape that is broadcastable to input shape.
+    """
+    kwargs["supports_multiple_dims"] = op_info.supports_multiple_dims
+
+    for sample_input in sample_inputs_reduction(
+        op_info, device, dtype, requires_grad, **kwargs
+    ):
+        for mask in _generate_masked_op_mask(
+            sample_input.input.shape, device, **kwargs
+        ):
+            sample_input_args, sample_input_kwargs = sample_input.args, dict(
+                mask=mask, **sample_input.kwargs
+            )
+            yield SampleInput(
+                sample_input.input.detach().requires_grad_(requires_grad),
+                args=sample_input_args,
+                kwargs=sample_input_kwargs,
+            )
+            if (
+                not requires_grad
+                and dtype.is_floating_point
+                and sample_input.input.ndim == 2
+                and mask is not None
+                and mask.shape == sample_input.input.shape
+            ):
+                for v in [torch.inf, -torch.inf, torch.nan]:
+                    t = sample_input.input.detach()
+                    t.diagonal(0, -2, -1).fill_(v)
+                    yield SampleInput(
+                        t.requires_grad_(requires_grad),
+                        args=sample_input_args,
+                        kwargs=sample_input_kwargs,
+                    )
+
+
+def sample_inputs_sparse_coo_masked_reduction(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    """Sample inputs for masked reduction operators that support inputs
+    with sparse coo layouts.
+    """
+    if op_info.supports_sparse:
+        op_name = op_info.name.replace("masked.", "")
+        for sample_input in sample_inputs_masked_reduction(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            mask = sample_input.kwargs.get("mask")
+            if mask is not None:
+                sample_input_kwargs = sample_input.kwargs.copy()
+                sample_input_kwargs.update(mask=mask.to_sparse())
+                yield SampleInput(
+                    sample_input.input.to_sparse(),
+                    args=sample_input.args,
+                    kwargs=sample_input_kwargs,
+                )
+            else:
+                if op_name in {"prod", "amax", "amin"}:
+                    # FIXME: for now reductions with non-zero reduction identity and
+                    # unspecified mask are not supported for sparse COO
+                    # tensors, see torch.masked.prod implementation
+                    # for details.
+                    continue
+                yield SampleInput(
+                    sample_input.input.to_sparse(),
+                    args=sample_input.args,
+                    kwargs=sample_input.kwargs,
+                )
+
+
+def sample_inputs_sparse_csr_masked_reduction(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    """Sample inputs for masked reduction operators that support inputs
+    with sparse csr layouts.
+    """
+    if op_info.supports_sparse_csr:
+        op_name = op_info.name.replace("masked.", "")
+        for sample_input in sample_inputs_masked_reduction(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            if not (
+                sample_input.input.ndim == 2 and sample_input.kwargs.get("keepdim")
+            ):
+                # - sparse CSR tensors are always 2-D tensors
+                # - masked reduction on CSR tensors are defined only if keepdim is True.
+                continue
+            mask = sample_input.kwargs.get("mask")
+            if mask is not None:
+                sample_input_kwargs = sample_input.kwargs.copy()
+                sample_input_kwargs.update(mask=mask.to_sparse_csr())
+                new_sample = SampleInput(
+                    sample_input.input.to_sparse_csr(),
+                    args=sample_input.args,
+                    kwargs=sample_input_kwargs,
+                )
+            else:
+                if op_name in ["prod", "amax", "amin", "mean"]:
+                    # reductions with non-zero reduction identity and
+                    # unspecified mask is not supported for sparse CSR
+                    # tensors, see torch.masked.prod implementation
+                    # for details.
+                    continue
+                new_sample = SampleInput(
+                    sample_input.input.to_sparse_csr(),
+                    args=sample_input.args,
+                    kwargs=sample_input.kwargs,
+                )
+            yield new_sample
+            if sample_input.kwargs["dim"] == 0:
+                # Reductions of CSR tensors use different implementations for
+                # inner and/or outer dimensions. So, as a minimum of testing CSR
+                # implementations the following kwargs must be generated:
+                #   dict(dim=0, keepdim=True)
+                #   dict(dim=1, keepdim=True)
+                #   dict(dim=(0, 1), keepdim=True)
+                # Here we generate the dim=1 case from the dim=0 case.
+                sample_input_kwargs = new_sample.kwargs.copy()
+                sample_input_kwargs.update(dim=1)
+                yield SampleInput(
+                    new_sample.input.clone(),
+                    args=sample_input.args,
+                    kwargs=sample_input_kwargs,
+                )
+
+
+def sample_inputs_masked_norm(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked norm."""
+    for ord in [2.0, 1, float("inf"), float("-inf"), 0]:
+        for sample_input in sample_inputs_masked_reduction(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            sample_input_args, sample_input_kwargs = (
+                ord,
+            ) + sample_input.args, sample_input.kwargs.copy()
+            yield SampleInput(
+                sample_input.input.clone().requires_grad_(requires_grad),
+                args=sample_input_args,
+                kwargs=sample_input_kwargs,
+            )
+
+
+def sample_inputs_masked_std_var(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked std/var."""
+    for unbiased in [False, True]:
+        for sample_input in sample_inputs_masked_reduction(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            if sample_input.args:
+                dim = sample_input.args[0]
+                sample_input_args = (
+                    sample_input.args[:1] + (unbiased,) + sample_input.args[1:]
+                )
+                sample_input_kwargs = sample_input.kwargs.copy()
+            else:
+                dim = sample_input.kwargs.get("dim")
+                sample_input_args = sample_input.args
+                sample_input_kwargs = dict(sample_input.kwargs, unbiased=unbiased)
+            if requires_grad:
+                if sample_input_kwargs.get("mask") is None:
+                    orig_count = torch.masked.sum(
+                        torch.ones(sample_input.input.shape, dtype=torch.int64),
+                        dim,
+                        keepdim=True,
+                    )
+                else:
+                    inmask = torch.masked._input_mask(
+                        sample_input.input, *sample_input_args, **sample_input_kwargs
+                    )
+                    orig_count = torch.masked.sum(
+                        inmask.new_ones(sample_input.input.shape, dtype=torch.int64),
+                        dim,
+                        keepdim=True,
+                        mask=inmask,
+                    )
+                if orig_count.min() <= int(unbiased) + 1:
+                    # Skip samples that lead to singularities in var
+                    # computation resulting nan values both in var and
+                    # autograd output that test_grad_fn cannot handle
+                    # correctly. Also, skip samples when the autograd output
+                    # for std could not be handled correctly due to torch.sqrt
+                    continue
+            yield SampleInput(
+                sample_input.input.detach().requires_grad_(requires_grad),
+                args=sample_input_args,
+                kwargs=sample_input_kwargs,
+            )
+
+
+def sample_inputs_masked_softmax(
+    op_info, device, dtype, requires_grad, with_dtype=False, **kwargs
+):
+    """Sample inputs for masked softmax, log_softmax, and softmin.
+
+    Masked normalization operator is a reduction operator with
+    trailing mask optional argument. A mask is a bool tensor with the
+    same shape as input or a shape that is broadcastable to input
+    shape.
+    """
+    for sample_input in sample_inputs_softmax_variant(
+        op_info, device, dtype, requires_grad, with_dtype=with_dtype, **kwargs
+    ):
+        for mask in _generate_masked_op_mask(
+            sample_input.input.shape, device, **kwargs
+        ):
+            yield SampleInput(
+                sample_input.input.clone().requires_grad_(requires_grad),
+                *sample_input.args,
+                mask=mask,
+                **sample_input.kwargs,
+            )
+
+
+def sample_inputs_masked_cumops(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked cumsum and cumprod."""
+    inputs: List[SampleInput] = []
+    for sample_input in sample_inputs_softmax_variant(
+        op_info, device, dtype, requires_grad, **kwargs
+    ):
+        for mask in _generate_masked_op_mask(
+            sample_input.input.shape, device, **kwargs
+        ):
+            if type(mask) != torch.Tensor:
+                continue
+            sample_input_args, sample_input_kwargs = sample_input.args, dict(
+                mask=mask, **sample_input.kwargs
+            )
+            if "keepdim" in sample_input_kwargs:
+                sample_input_kwargs.pop("keepdim")
+            # dimension is required
+            if sample_input_args:
+                dim = sample_input.args[0]
+            else:
+                if "dim" not in sample_input_kwargs:
+                    continue
+                dim = sample_input_kwargs.pop("dim")
+                sample_input_args = (dim,)
+            yield SampleInput(
+                sample_input.input.clone().requires_grad_(requires_grad),
+                *sample_input_args,
+                **sample_input_kwargs,
+            )
+
+
+def sample_inputs_masked_logaddexp(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked logaddexp."""
+    shapes = [(S,), (S, S), (S, M, S)]
+    input_mask_lists = [
+        list(_generate_masked_op_mask(shape, device, **kwargs)) for shape in shapes
+    ]
+    other_mask_lists = [
+        list(_generate_masked_op_mask(shape, device, **kwargs)) for shape in shapes
+    ]
+
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    for shape, input_masks, other_masks in zip(
+        shapes, input_mask_lists, other_mask_lists
+    ):
+        for input_mask, other_mask in zip(input_masks, other_masks):
+            yield SampleInput(
+                make_arg(shape),
+                make_arg(shape),
+                input_mask=input_mask,
+                other_mask=other_mask,
+            )
+
+
+def sample_inputs_masked_normalize(op_info, device, dtype, requires_grad, **kwargs):
+    """Sample inputs for masked normalize."""
+    for ord in [2.0, 1, float("inf"), float("-inf"), 0]:
+        for sample_input in sample_inputs_softmax_variant(
+            op_info, device, dtype, requires_grad, **kwargs
+        ):
+            yield SampleInput(
+                sample_input.input.clone().requires_grad_(requires_grad),
+                ord,
+                *sample_input.args,
+                **sample_input.kwargs,
+            )
+
+
+op_db: List[OpInfo] = [
+    ReductionOpInfo(
+        "masked.sum",
+        ref=reference_reduction_numpy(np.sum),
+        method_variant=None,
+        identity=0,
+        nan_policy="propagate",
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_sparse=True,
+        supports_sparse_csr=True,
+        promotes_int_to_int64=True,
+        dtypes=all_types_and_complex_and(torch.bool, torch.float16, torch.bfloat16),
+        skips=(
+            DecorateInfo(
+                unittest.skip("Failing on some jobs"),
+                "TestReductions",
+                "test_reference_masked",
+                dtypes=(torch.bool, torch.int8, torch.int16, torch.int32),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.bfloat16: tol(atol=1e-03, rtol=5e-2),
+                        torch.float16: tol(atol=1e-03, rtol=5e-3),
+                    }
+                ),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-03)}),
+                "TestReductions",
+                "test_ref_small_input",
+            ),
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.bfloat16: tol(atol=0.1, rtol=0.1),
+                        torch.float16: tol(atol=5e-3, rtol=5e-3),
+                    }
+                ),
+                "TestMasked",
+                "test_mask_layout",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+    ),
+    ReductionOpInfo(
+        "masked.prod",
+        ref=prod_numpy,
+        method_variant=None,
+        identity=1,
+        nan_policy="propagate",
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_sparse=True,
+        supports_sparse_csr=True,
+        promotes_int_to_int64=True,
+        # FIXME: "prod_cpu" not implemented for 'Half'
+        dtypes=all_types_and_complex_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool, torch.float16, torch.bfloat16
+        ),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.skip("Failing on some jobs"),
+                "TestReductions",
+                "test_reference_masked",
+                dtypes=(torch.bool, torch.int8, torch.int16, torch.int32),
+            ),
+            # integer overflow
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestReductions",
+                "test_ref_small_input",
+                dtypes=(torch.int8, torch.int16, torch.int32),
+            ),
+            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMasked",
+                "test_mask_layout",
+                device_type="cuda",
+                dtypes=(torch.bool, *integral_types(), *complex_types()),
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-02)}),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
+                "TestReductions",
+                "test_ref_duplicate_values",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
+                "TestReductions",
+                "test_ref_small_input",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+    ),
+    OpInfo(
+        "masked.cumsum",
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        method_variant=None,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        # Can reuse the same inputs; dim is required in both
+        sample_inputs_func=sample_inputs_masked_cumops,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    OpInfo(
+        "masked.cumprod",
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        method_variant=None,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float32: tol(atol=1e-5, rtol=1e-5)}),
+                "TestCompositeCompliance",
+                "test_backward",
+                device_type="cuda",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=2e-3, rtol=2e-3)}),
+                "TestInductorOpInfo",
+                "test_comprehensive",
+                device_type="cuda",
+            ),
+        ),
+        # Can reuse the same inputs; dim is required in both
+        sample_inputs_func=sample_inputs_masked_cumops,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "masked.amax",
+        nan_policy="propagate",
+        supports_out=False,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        supports_sparse=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_sparse_csr=True,
+        ref=reference_reduction_numpy(np.amax),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: amax reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: Unknown builtin op: aten::iinfo
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
+            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMasked",
+                "test_mask_layout",
+                dtypes=(torch.bool, *integral_types(), *complex_types()),
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "masked.amin",
+        nan_policy="propagate",
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        supports_sparse=True,
+        supports_sparse_csr=True,
+        ref=reference_reduction_numpy(np.amin),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: amax reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: Unknown builtin op: aten::iinfo
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # FIXME: "cuda_scatter_gather_base_kernel_func" not implemented for ... (used for sparse_coo inputs)
+            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMasked",
+                "test_mask_layout",
+                dtypes=(torch.bool, *integral_types(), *complex_types()),
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_coo_func=sample_inputs_sparse_coo_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "masked.argmax",
+        supports_out=False,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.argmax, supports_keepdims=False),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # initial is not a keyword for argmax
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_reference_masked"
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "masked.argmin",
+        supports_out=False,
+        supports_multiple_dims=False,
+        supports_autograd=False,
+        dtypes=all_types_and(torch.float16, torch.bfloat16),
+        ref=reference_reduction_numpy(np.argmin, supports_keepdims=False),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # initial is not a keyword for argmin
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_reference_masked"
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "masked.mean",
+        ref=reference_reduction_numpy(np.mean)
+        if np.lib.NumpyVersion(np.__version__) >= "1.20.2"
+        else None,
+        method_variant=None,
+        nan_policy="propagate",
+        supports_out=False,
+        supports_sparse_csr=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        promotes_int_to_float=True,
+        dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16, torch.bool),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestReductions",
+                "test_ref_duplicate_values",
+                dtypes=(torch.bool,),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestReductions",
+                "test_reference_masked",
+                dtypes=(torch.bool,),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestReductions",
+                "test_ref_small_input",
+                dtypes=(torch.bool,),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # FIXME: "_segment_reduce_lengths_cpu/cuda" not implemented for ... (used for sparse_csr inputs)
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestMasked",
+                "test_mask_layout",
+                dtypes=(torch.bool, *integral_types(), *complex_types()),
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.bfloat16: tol(atol=1e-03, rtol=0.05),
+                        torch.float16: tol(atol=1e-03, rtol=1e-03),
+                    }
+                ),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=1e-03)}),
+                "TestReductions",
+                "test_ref_small_input",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-03, rtol=2e-03)}),
+                "TestSparseCompressed",
+                "test_consistency",
+                device_type="cuda",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_reduction,
+        sample_inputs_sparse_csr_func=sample_inputs_sparse_csr_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    OpInfo(
+        "masked.median",
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.float16),
+        method_variant=None,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_softmax,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "masked.norm",
+        identity=0,
+        method_variant=None,
+        nan_policy="propagate",
+        supports_out=False,
+        promotes_int_to_float=True,
+        dtypes=floating_types_and(torch.float16, torch.bfloat16),
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # torch.jit.frontend.NotSupportedError: Compiled functions
+            # can't take variable number of arguments or use
+            # keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_masked_norm,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+    ReductionOpInfo(
+        "masked.var",
+        ref=reference_reduction_numpy(np.var)
+        if np.lib.NumpyVersion(np.__version__) >= "1.20.2"
+        else None,
+        method_variant=None,
+        nan_policy="propagate",
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        promotes_int_to_float=True,
+        dtypes=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=(torch.complex64, torch.complex128),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.float16: tol(atol=1e-02, rtol=1e-02),
+                        torch.bfloat16: tol(atol=1e-03, rtol=1e-03),
+                    }
+                ),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestReductions",
+                "test_ref_small_input",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestMasked",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestCudaFuserOpInfo",
+                "test_nvfuser_correctness",
+            ),
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.float16: tol(atol=1e-02, rtol=1e-02),
+                        torch.bfloat16: tol(atol=1e-03, rtol=1e-03),
+                    }
+                ),
+                "TestMasked",
+                "test_reference_masked",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_std_var,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        check_batched_grad=True,
+    ),
+    ReductionOpInfo(
+        "masked.std",
+        ref=reference_reduction_numpy(np.std)
+        if np.lib.NumpyVersion(np.__version__) >= "1.20.2"
+        else None,
+        method_variant=None,
+        nan_policy="propagate",
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        promotes_int_to_float=True,
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.float16, torch.bfloat16),
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=(torch.complex64, torch.complex128),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # RuntimeError: undefined value tensor
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCudaFuserOpInfo",
+                "test_nvfuser_correctness",
+                dtypes=(torch.float16,),
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.bfloat16: tol(atol=1e-02, rtol=1e-02),
+                        torch.float16: tol(atol=1e-02, rtol=1e-02),
+                    }
+                ),
+                "TestReductions",
+                "test_reference_masked",
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestReductions",
+                "test_ref_small_input",
+            ),
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.float16: tol(atol=1e-02, rtol=1e-02),
+                        torch.bfloat16: tol(atol=5e-03, rtol=5e-04),
+                    }
+                ),
+                "TestMasked",
+                "test_reference_masked",
+            ),
+        ],
+        sample_inputs_func=sample_inputs_masked_std_var,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        check_batched_grad=True,
+    ),
+    OpInfo(
+        "masked.softmax",
+        method_variant=None,
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_masked_softmax,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+    ),
+    OpInfo(
+        "masked.log_softmax",
+        method_variant=None,
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_masked_softmax,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                toleranceOverride({torch.bfloat16: tol(atol=1e-02, rtol=1e-02)}),
+                "TestMasked",
+                "test_reference_masked",
+            ),
+        ],
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+    ),
+    OpInfo(
+        "masked.softmin",
+        method_variant=None,
+        dtypes=floating_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_masked_softmax,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+        ),
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+    ),
+    OpInfo(
+        "masked.normalize",
+        method_variant=None,
+        dtypes=floating_and_complex_types_and(torch.half, torch.bfloat16),
+        sample_inputs_func=sample_inputs_masked_normalize,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # RuntimeError: "clamp_min_cpu" not implemented for 'Half'
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMasked",
+                "test_reference_masked",
+                device_type="cpu",
+                dtypes=[torch.half],
+            ),
+        ),
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+    ),
+    OpInfo(
+        "masked.logaddexp",
+        dtypes=floating_types_and(torch.bfloat16),
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_forward_grad=False,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestFwdGradients", "test_fn_gradgrad"
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestBwdGradients", "test_fn_gradgrad"
+            ),
+        ),
+        sample_inputs_func=sample_inputs_masked_logaddexp,
+        gradcheck_wrapper=gradcheck_wrapper_masked_pointwise_operation,
+    ),
+    ReductionOpInfo(
+        "masked.logsumexp",
+        dtypes=all_types_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.float16, torch.bfloat16),
+        method_variant=None,
+        nan_policy="propagate",
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+            # FIXME: reduces all dimensions when dim=[]
+            DecorateInfo(unittest.skip("Skipped!"), "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestReductions", "test_dim_empty_keepdim"
+            ),
+            # Identity can't be -torch.inf without overflow
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestReductions",
+                "test_empty_tensor_empty_slice",
+            ),
+            # NotSupportedError: Compiled functions can't ... use keyword-only arguments with defaults
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"
+            ),
+            # all the values are the same except for -inf vs nan
+            DecorateInfo(unittest.skip("Skipped!"), "TestDecomp", "test_comprehensive"),
+        ),
+        sample_inputs_func=sample_inputs_masked_reduction,
+        gradcheck_wrapper=gradcheck_wrapper_masked_operation,
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/definitions/fft.py b/torch/testing/_internal/opinfo/definitions/fft.py
new file mode 100644
index 000000000000..341e18331995
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/fft.py
@@ -0,0 +1,755 @@
+import unittest
+from functools import partial
+from typing import List
+
+import numpy as np
+
+import torch
+
+from torch.testing import make_tensor
+from torch.testing._internal.common_cuda import SM53OrLater
+from torch.testing._internal.common_device_type import precisionOverride
+from torch.testing._internal.common_dtype import (
+    all_types_and,
+    all_types_and_complex_and,
+)
+from torch.testing._internal.common_utils import TEST_SCIPY, TEST_WITH_ROCM
+from torch.testing._internal.opinfo.core import (
+    DecorateInfo,
+    ErrorInput,
+    OpInfo,
+    SampleInput,
+    SpectralFuncInfo,
+    SpectralFuncType,
+)
+from torch.testing._internal.opinfo.refs import (
+    _find_referenced_opinfo,
+    _inherit_constructor_args,
+    PythonRefInfo,
+)
+
+has_scipy_fft = False
+if TEST_SCIPY:
+    try:
+        import scipy.fft
+
+        has_scipy_fft = True
+    except ModuleNotFoundError:
+        pass
+
+
+class SpectralFuncPythonRefInfo(SpectralFuncInfo):
+    """
+    An OpInfo for a Python reference of an elementwise unary operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant="",
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant, op_db=op_db
+        )
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, SpectralFuncInfo)
+
+        inherited = self.torch_opinfo._original_spectral_func_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+
+        super().__init__(**ukwargs)
+
+
+def error_inputs_fft(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+    # Zero-dimensional tensor has no dimension to take FFT of
+    yield ErrorInput(
+        SampleInput(make_arg()),
+        error_type=IndexError,
+        error_regex="Dimension specified as -1 but tensor has no dimensions",
+    )
+
+
+def error_inputs_fftn(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+    # Specifying a dimension on a zero-dimensional tensor
+    yield ErrorInput(
+        SampleInput(make_arg(), dim=(0,)),
+        error_type=IndexError,
+        error_regex="Dimension specified as 0 but tensor has no dimensions",
+    )
+
+
+def sample_inputs_fftshift(op_info, device, dtype, requires_grad, **kwargs):
+    def mt(shape, **kwargs):
+        return make_tensor(
+            shape, device=device, dtype=dtype, requires_grad=requires_grad, **kwargs
+        )
+
+    yield SampleInput(mt((9, 10)))
+    yield SampleInput(mt((50,)), kwargs=dict(dim=0))
+    yield SampleInput(mt((5, 11)), kwargs=dict(dim=(1,)))
+    yield SampleInput(mt((5, 6)), kwargs=dict(dim=(0, 1)))
+    yield SampleInput(mt((5, 6, 2)), kwargs=dict(dim=(0, 2)))
+
+
+# Operator database
+op_db: List[OpInfo] = [
+    SpectralFuncInfo(
+        "fft.fft",
+        aten_name="fft_fft",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.fft,
+        ndimensional=SpectralFuncType.OneD,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        error_inputs_func=error_inputs_fft,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+    ),
+    SpectralFuncInfo(
+        "fft.fft2",
+        aten_name="fft_fft2",
+        ref=np.fft.fft2,
+        decomp_aten_name="_fft_c2c",
+        ndimensional=SpectralFuncType.TwoD,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4})],
+    ),
+    SpectralFuncInfo(
+        "fft.fftn",
+        aten_name="fft_fftn",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.fftn,
+        ndimensional=SpectralFuncType.ND,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4})],
+    ),
+    SpectralFuncInfo(
+        "fft.hfft",
+        aten_name="fft_hfft",
+        decomp_aten_name="_fft_c2r",
+        ref=np.fft.hfft,
+        ndimensional=SpectralFuncType.OneD,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        error_inputs_func=error_inputs_fft,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        check_batched_gradgrad=False,
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=(torch.complex64, torch.complex128),
+            ),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.hfft2",
+        aten_name="fft_hfft2",
+        decomp_aten_name="_fft_c2r",
+        ref=scipy.fft.hfft2 if has_scipy_fft else None,
+        ndimensional=SpectralFuncType.TwoD,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_gradgrad=False,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 2e-4, torch.cfloat: 2e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+            ),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.hfftn",
+        aten_name="fft_hfftn",
+        decomp_aten_name="_fft_c2r",
+        ref=scipy.fft.hfftn if has_scipy_fft else None,
+        ndimensional=SpectralFuncType.ND,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_gradgrad=False,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 2e-4, torch.cfloat: 2e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            ),
+        ],
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+            ),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.rfft",
+        aten_name="fft_rfft",
+        decomp_aten_name="_fft_r2c",
+        ref=np.fft.rfft,
+        ndimensional=SpectralFuncType.OneD,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        error_inputs_func=error_inputs_fft,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_grad=False,
+        skips=(),
+        check_batched_gradgrad=False,
+    ),
+    SpectralFuncInfo(
+        "fft.rfft2",
+        aten_name="fft_rfft2",
+        decomp_aten_name="_fft_r2c",
+        ref=np.fft.rfft2,
+        ndimensional=SpectralFuncType.TwoD,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        decorators=[
+            precisionOverride({torch.float: 1e-4}),
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.rfftn",
+        aten_name="fft_rfftn",
+        decomp_aten_name="_fft_r2c",
+        ref=np.fft.rfftn,
+        ndimensional=SpectralFuncType.ND,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        decorators=[
+            precisionOverride({torch.float: 1e-4}),
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.ifft",
+        aten_name="fft_ifft",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.ifft,
+        ndimensional=SpectralFuncType.OneD,
+        error_inputs_func=error_inputs_fft,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.ifft2",
+        aten_name="fft_ifft2",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.ifft2,
+        ndimensional=SpectralFuncType.TwoD,
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.ifftn",
+        aten_name="fft_ifftn",
+        decomp_aten_name="_fft_c2c",
+        ref=np.fft.ifftn,
+        ndimensional=SpectralFuncType.ND,
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.ihfft",
+        aten_name="fft_ihfft",
+        decomp_aten_name="_fft_r2c",
+        ref=np.fft.ihfft,
+        ndimensional=SpectralFuncType.OneD,
+        error_inputs_func=error_inputs_fft,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        skips=(),
+        check_batched_grad=False,
+    ),
+    SpectralFuncInfo(
+        "fft.ihfft2",
+        aten_name="fft_ihfft2",
+        decomp_aten_name="_fft_r2c",
+        ref=scipy.fft.ihfftn if has_scipy_fft else None,
+        ndimensional=SpectralFuncType.TwoD,
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        decorators=(
+            # The values for attribute 'shape' do not match: torch.Size([5, 6, 5]) != torch.Size([5, 6, 6]).
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out_warning"),
+            DecorateInfo(
+                precisionOverride({torch.float: 2e-4}), "TestFFT", "test_reference_nd"
+            ),
+            # Mismatched elements!
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out"),
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out_warnings"),
+        ),
+    ),
+    SpectralFuncInfo(
+        "fft.ihfftn",
+        aten_name="fft_ihfftn",
+        decomp_aten_name="_fft_r2c",
+        ref=scipy.fft.ihfftn if has_scipy_fft else None,
+        ndimensional=SpectralFuncType.ND,
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archss
+        dtypesIfCUDA=all_types_and(
+            torch.bool, *(() if (TEST_WITH_ROCM or not SM53OrLater) else (torch.half,))
+        ),
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        decorators=[
+            # The values for attribute 'shape' do not match: torch.Size([5, 6, 5]) != torch.Size([5, 6, 6]).
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out_warning"),
+            # Mismatched elements!
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_out"),
+            DecorateInfo(
+                precisionOverride({torch.float: 2e-4}), "TestFFT", "test_reference_nd"
+            ),
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.irfft",
+        aten_name="fft_irfft",
+        decomp_aten_name="_fft_c2r",
+        ref=np.fft.irfft,
+        ndimensional=SpectralFuncType.OneD,
+        error_inputs_func=error_inputs_fft,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        check_batched_gradgrad=False,
+    ),
+    SpectralFuncInfo(
+        "fft.irfft2",
+        aten_name="fft_irfft2",
+        decomp_aten_name="_fft_c2r",
+        ref=np.fft.irfft2,
+        ndimensional=SpectralFuncType.TwoD,
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        check_batched_gradgrad=False,
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+    ),
+    SpectralFuncInfo(
+        "fft.irfftn",
+        aten_name="fft_irfftn",
+        decomp_aten_name="_fft_c2r",
+        ref=np.fft.irfftn,
+        ndimensional=SpectralFuncType.ND,
+        error_inputs_func=error_inputs_fftn,
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        dtypes=all_types_and_complex_and(torch.bool),
+        # rocFFT doesn't support Half/Complex Half Precision FFT
+        # CUDA supports Half/ComplexHalf Precision FFT only on SM53 or later archs
+        dtypesIfCUDA=all_types_and_complex_and(
+            torch.bool,
+            *(
+                ()
+                if (TEST_WITH_ROCM or not SM53OrLater)
+                else (torch.half, torch.complex32)
+            ),
+        ),
+        check_batched_gradgrad=False,
+        decorators=[
+            DecorateInfo(
+                precisionOverride({torch.float: 1e-4, torch.cfloat: 1e-4}),
+                "TestFFT",
+                "test_reference_nd",
+            )
+        ],
+    ),
+    OpInfo(
+        "fft.fftshift",
+        dtypes=all_types_and_complex_and(
+            torch.bool, torch.bfloat16, torch.half, torch.chalf
+        ),
+        sample_inputs_func=sample_inputs_fftshift,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    OpInfo(
+        "fft.ifftshift",
+        dtypes=all_types_and_complex_and(
+            torch.bool, torch.bfloat16, torch.half, torch.chalf
+        ),
+        sample_inputs_func=sample_inputs_fftshift,
+        supports_out=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+]
+
+python_ref_db: List[OpInfo] = [
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.fft",
+        torch_opinfo_name="fft.fft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ifft",
+        torch_opinfo_name="fft.ifft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.rfft",
+        torch_opinfo_name="fft.rfft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.irfft",
+        torch_opinfo_name="fft.irfft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.hfft",
+        torch_opinfo_name="fft.hfft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ihfft",
+        torch_opinfo_name="fft.ihfft",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.fftn",
+        torch_opinfo_name="fft.fftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ifftn",
+        torch_opinfo_name="fft.ifftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.rfftn",
+        torch_opinfo_name="fft.rfftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.irfftn",
+        torch_opinfo_name="fft.irfftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.hfftn",
+        torch_opinfo_name="fft.hfftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ihfftn",
+        torch_opinfo_name="fft.ihfftn",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.fft2",
+        torch_opinfo_name="fft.fft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ifft2",
+        torch_opinfo_name="fft.ifft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.rfft2",
+        torch_opinfo_name="fft.rfft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.irfft2",
+        torch_opinfo_name="fft.irfft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.hfft2",
+        torch_opinfo_name="fft.hfft2",
+        supports_nvfuser=False,
+    ),
+    SpectralFuncPythonRefInfo(
+        "_refs.fft.ihfft2",
+        torch_opinfo_name="fft.ihfft2",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.fft.fftshift",
+        op_db=op_db,
+        torch_opinfo_name="fft.fftshift",
+        supports_nvfuser=False,
+    ),
+    PythonRefInfo(
+        "_refs.fft.ifftshift",
+        op_db=op_db,
+        torch_opinfo_name="fft.ifftshift",
+        supports_nvfuser=False,
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/definitions/linalg.py b/torch/testing/_internal/opinfo/definitions/linalg.py
new file mode 100644
index 000000000000..e0d60c08022f
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/linalg.py
@@ -0,0 +1,2232 @@
+import itertools
+import unittest
+from functools import partial
+from itertools import product
+from typing import Iterable, List
+
+import numpy as np
+from numpy import inf
+
+import torch
+
+from torch.testing import make_tensor
+from torch.testing._internal.common_cuda import (
+    _get_magma_version,
+    _get_torch_cuda_version,
+    CUDA11OrLater,
+    with_tf32_off,
+)
+from torch.testing._internal.common_device_type import (
+    has_cusolver,
+    skipCPUIfNoLapack,
+    skipCUDAIf,
+    skipCUDAIfNoCusolver,
+    skipCUDAIfNoMagma,
+    skipCUDAIfNoMagmaAndNoCusolver,
+    skipCUDAIfRocm,
+    tol,
+    toleranceOverride,
+)
+from torch.testing._internal.common_dtype import (
+    all_types_and_complex,
+    all_types_and_complex_and,
+    floating_and_complex_types,
+    floating_and_complex_types_and,
+)
+from torch.testing._internal.common_utils import (
+    GRADCHECK_NONDET_TOL,
+    IS_MACOS,
+    make_fullrank_matrices_with_distinct_singular_values,
+    skipIfSlowGradcheckEnv,
+    slowTest,
+    TEST_WITH_ROCM,
+)
+from torch.testing._internal.opinfo.core import (
+    clone_sample,
+    DecorateInfo,
+    ErrorInput,
+    gradcheck_wrapper_hermitian_input,
+    OpInfo,
+    ReductionOpInfo,
+    S,
+    SampleInput,
+)
+from torch.testing._internal.opinfo.refs import PythonRefInfo, ReductionPythonRefInfo
+
+
+def sample_kwargs_vector_norm(t, **kwargs):
+    # orders with / without identity
+    def ords():
+        has_id = (6, 4, 2, 1, 0, 0.9)
+        no_id = (inf, -2.1, -inf)
+        if t.numel() == 0:
+            dim = kwargs.get("dim")
+            if dim is None:
+                return has_id
+            if not isinstance(dim, Iterable):
+                dim = (dim,)
+            for d in dim:
+                if t.size(d) == 0:
+                    return has_id
+        return has_id + no_id
+
+    return (((), dict(ord=o)) for o in ords())
+
+
+def sample_inputs_svd(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    is_linalg_svd = "linalg.svd" in op_info.name
+    batches = [(), (0,), (3,)]
+    ns = [0, 3, 5]
+
+    def uniformize(usv):
+        S = usv[1]
+        k = S.shape[-1]
+        U = usv[0][..., :k]
+        Vh = usv[2] if is_linalg_svd else usv[2].mH
+        Vh = Vh[..., :k, :]
+        return U, S, Vh
+
+    def fn_U(usv):
+        U, _, _ = uniformize(usv)
+        return U.abs()
+
+    def fn_S(usv):
+        return uniformize(usv)[1]
+
+    def fn_Vh(usv):
+        # We also return S to test
+        _, S, Vh = uniformize(usv)
+        return S, Vh.abs()
+
+    def fn_UVh(usv):
+        U, S, Vh = uniformize(usv)
+        return U @ Vh, S
+
+    fns = (fn_U, fn_S, fn_Vh, fn_UVh)
+
+    fullmat = "full_matrices" if is_linalg_svd else "some"
+
+    for batch, n, k, fullmat_val, fn in product(batches, ns, ns, (True, False), fns):
+        shape = batch + (n, k)
+        yield SampleInput(
+            make_arg(*shape), kwargs={fullmat: fullmat_val}, output_process_fn_grad=fn
+        )
+
+
+def sample_inputs_cross(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    yield SampleInput(make_arg((S, 3)), args=(make_arg((S, 3)),))
+    yield SampleInput(
+        make_arg((S, 3, S)), args=(make_arg((S, 3, S)),), kwargs=dict(dim=1)
+    )
+    yield SampleInput(make_arg((1, 3)), args=(make_arg((S, 3)),), kwargs=dict(dim=-1))
+
+
+def error_inputs_cross(op_info, device, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=torch.float32)
+
+    sample = SampleInput(input=make_arg((S, 3)), args=(make_arg((S, 1)),))
+    err = "inputs dimension -1 must have length 3"
+    yield ErrorInput(sample, error_regex=err, error_type=RuntimeError)
+
+    sample = SampleInput(input=make_arg((5, S, 3)), args=(make_arg((S, 3)),))
+    err = "inputs must have the same number of dimensions"
+    yield ErrorInput(sample, error_regex=err, error_type=RuntimeError)
+
+    sample = SampleInput(input=make_arg((S, 2)), args=(make_arg((S, 2)),))
+    err = "must have length 3"
+    yield ErrorInput(sample, error_regex=err, error_type=RuntimeError)
+
+    sample = SampleInput(
+        input=make_arg((S, 2)), args=(make_arg((S, 2)),), kwargs=dict(dim=2)
+    )
+    err = "Dimension out of range"
+    yield ErrorInput(sample, error_regex=err, error_type=IndexError)
+
+
+def sample_inputs_householder_product(op_info, device, dtype, requires_grad, **kwargs):
+    """
+    This function generates input for torch.linalg.householder_product (torch.orgqr).
+    The first argument should be a square matrix or batch of square matrices, the second argument is a vector or batch of vectors.
+    Empty, square, rectangular, batched square and batched rectangular input is generated.
+    """
+    make_arg = partial(
+        make_tensor,
+        device=device,
+        dtype=dtype,
+        requires_grad=requires_grad,
+        low=-2,
+        high=2,
+    )
+    # Each column of the matrix is getting multiplied many times leading to very large values for
+    # the Jacobian matrix entries and making the finite-difference result of grad check less accurate.
+    # That's why gradcheck with the default range [-9, 9] fails and [-2, 2] is used here.
+    yield SampleInput(make_arg((S, S)), make_arg((S,)))
+    yield SampleInput(make_arg((S + 1, S)), make_arg((S,)))
+    yield SampleInput(make_arg((2, 1, S, S)), make_arg((2, 1, S)))
+    yield SampleInput(make_arg((2, 1, S + 1, S)), make_arg((2, 1, S)))
+    yield SampleInput(
+        make_arg((0, 0), low=None, high=None),
+        make_arg((0,), low=None, high=None),
+    )
+    yield SampleInput(make_arg((S, S)), make_arg((0,), low=None, high=None))
+    # m = n = S, k = S - 2
+    yield SampleInput(make_arg((S, S)), make_arg((S - 2,), low=None, high=None))
+    # m = S, n = S -1, k = S - 2
+    yield SampleInput(make_arg((S, S - 1)), make_arg((S - 2,), low=None, high=None))
+
+
+def sample_inputs_linalg_det_singular(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(make_tensor, device=device, dtype=dtype)
+
+    def make_singular_matrix_batch_base(size, rank):
+        assert size[-1] == size[-2]
+        assert rank > 0 and rank < size[-1]
+
+        n = size[-1]
+        a = make_arg(size[:-2] + (n, rank)) / 10
+        b = make_arg(size[:-2] + (rank, n)) / 10
+        x = a @ b
+        lu, pivs, _ = torch.linalg.lu_factor_ex(x)
+        p, l, u = torch.lu_unpack(lu, pivs)
+        u_diag_abs = u.diagonal(0, -2, -1).abs()
+        u_diag_abs_largest = u_diag_abs.max(dim=-1, keepdim=True).values
+        u_diag_abs_smallest_idxs = torch.topk(
+            u_diag_abs, k=(n - rank), largest=False
+        ).indices
+        u.diagonal(0, -2, -1).div_(u_diag_abs_largest)
+        u.diagonal(0, -2, -1)[..., u_diag_abs_smallest_idxs] = torch.finfo(dtype).eps
+        matrix = p @ l @ u
+
+        matrix.requires_grad_(requires_grad)
+        return matrix
+
+    for batch, size in product(((), (2,), (2, 2)), range(6)):
+        shape = batch + (size, size)
+        for rank in range(1, size):
+            yield SampleInput(make_singular_matrix_batch_base(shape, rank))
+
+
+def sample_inputs_linalg_matrix_power(op_info, device, dtype, requires_grad, **kwargs):
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    make_arg_fullrank = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    # (<matrix_size>, (<batch_sizes, ...>))
+    test_sizes = [
+        (1, ()),
+        (2, (0,)),
+        (2, (2,)),
+    ]
+
+    for matrix_size, batch_sizes in test_sizes:
+        size = batch_sizes + (matrix_size, matrix_size)
+        for n in (0, 3, 5):
+            yield SampleInput(make_arg(size), args=(n,))
+        for n in [-4, -2, -1]:
+            yield SampleInput(make_arg_fullrank(*size), args=(n,))
+
+
+def sample_inputs_linalg_det_logdet_slogdet(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    batches = [(), (0,), (3,)]
+    ns = [0, 1, 5]
+
+    is_logdet = op_info.name == "logdet"
+
+    for (
+        batch,
+        n,
+    ) in product(batches, ns):
+        shape = batch + (n, n)
+        A = make_arg(*shape)
+        # Need to make the matrices in A have positive determinant for autograd
+        # To do so, we multiply A by its determinant to flip the sign of its determinant
+        if is_logdet and not A.is_complex() and A.numel() > 0:
+            s = torch.linalg.slogdet(A).sign
+            A = A * s.unsqueeze(-1).unsqueeze(-1)
+            A.requires_grad_(requires_grad)
+        yield SampleInput(A)
+
+
+def sample_inputs_lu_solve(op_info, device, dtype, requires_grad=False, **kwargs):
+    """Samples the inputs for both linalg.lu_solve and lu_solve"""
+    make_fn = make_fullrank_matrices_with_distinct_singular_values
+    make_a = partial(make_fn, dtype=dtype, device=device)
+    make_b = partial(make_tensor, dtype=dtype, device=device)
+
+    def clone(X, requires_grad):
+        Y = X.clone()
+        Y.requires_grad_(requires_grad)
+        return Y
+
+    is_linalg_lu_solve = op_info.name == "linalg.lu_solve"
+
+    batches = ((), (0,), (2,))
+    ns = (3, 1, 0)
+    nrhs = (4, 1, 0)
+
+    for n, batch, rhs in product(ns, batches, nrhs):
+        A = make_a(*(batch + (n, n)))
+        LU, pivots = torch.linalg.lu_factor(A)
+
+        B = make_b(batch + (n, rhs))
+
+        grads = (False,) if not requires_grad else (True, False)
+        # we try all possible combinations of requires_grad for each input
+        for LU_grad, B_grad in product(grads, grads):
+            # when requires_grad == True, at least one input has to have requires_grad enabled
+            if requires_grad and not LU_grad and not B_grad:
+                continue
+
+            if is_linalg_lu_solve:
+                for adjoint, left in product((True, False), repeat=2):
+                    yield SampleInput(
+                        clone(LU, LU_grad),
+                        args=(pivots, clone(B if left else B.mT, B_grad)),
+                        kwargs=dict(adjoint=adjoint, left=left),
+                    )
+            else:
+                yield SampleInput(clone(B, B_grad), args=(clone(LU, LU_grad), pivots))
+
+
+def sample_inputs_linalg_multi_dot(op_info, device, dtype, requires_grad, **kwargs):
+    # Each test case consists of the sizes in the chain of multiplications
+    # e.g. [2, 3, 4, 5] generates matrices (2, 3) @ (3, 4) @ (4, 5)
+    test_cases = [
+        [1, 2, 1],
+        [2, 0, 2],
+        [0, 2, 2],
+        [2, 2, 2, 2],
+        [2, 3, 4, 5],
+        [5, 4, 0, 2],
+        [2, 4, 3, 5, 3, 2],
+    ]
+
+    for sizes in test_cases:
+        tensors = []
+        for size in zip(sizes[:-1], sizes[1:]):
+            t = make_tensor(
+                size, dtype=dtype, device=device, requires_grad=requires_grad
+            )
+            tensors.append(t)
+        yield SampleInput(tensors)
+
+
+def sample_inputs_linalg_matrix_norm(op_info, device, dtype, requires_grad, **kwargs):
+    low_precision_dtypes = (torch.float16, torch.bfloat16, torch.complex32)
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+
+    sizes = ((2, 2), (2, 3, 2))
+    if dtype in low_precision_dtypes:
+        # svdvals not supported for low precision dtypes
+        ords = ("fro", inf, -inf, 1, -1)
+    else:
+        ords = ("fro", "nuc", inf, -inf, 1, -1, 2, -2)
+    dims = ((-2, -1), (-1, 0))
+
+    for size, ord, dim, keepdim in product(sizes, ords, dims, [True, False]):
+        yield SampleInput(make_arg(size), args=(ord, dim, keepdim))
+
+
+def sample_inputs_linalg_norm(
+    op_info, device, dtype, requires_grad, *, variant=None, **kwargs
+):
+    if variant is not None and variant not in ("subgradient_at_zero",):
+        raise ValueError(
+            f"Unsupported variant, expected variant to be 'subgradient_at_zero' but got: {variant}"
+        )
+
+    test_sizes = [
+        (S,),
+        (0,),
+        (S, S),
+        (0, 0),
+        (S, 0),
+        (0, S),
+        (S, S, S),
+        (0, S, S),
+        (S, 0, S),
+        (0, 0, 0),
+    ]
+
+    vector_ords = (None, 0, 0.5, 1, 2, 3.5, inf, -0.5, -1, -2, -3.5, -inf)
+    if dtype in {torch.float16, torch.bfloat16, torch.complex32}:
+        # svdvals not supported for low precision dtypes
+        matrix_ords = ("fro", inf, -inf, 1, -1)
+    else:
+        matrix_ords = (None, "fro", "nuc", inf, -inf, 1, -1, 2, -2)
+
+    make_arg = partial(
+        make_tensor,
+        dtype=dtype,
+        device=device,
+        requires_grad=requires_grad,
+        low=None,
+        high=None,
+    )
+
+    for test_size in test_sizes:
+        is_vector_norm = len(test_size) == 1
+        is_matrix_norm = len(test_size) == 2
+
+        # IndexError: amax(): Expected reduction dim 0 to have non-zero size.
+        is_valid_for_p2 = is_vector_norm or (test_size[-1] != 0 and test_size[-2] != 0)
+
+        for keepdim in [False, True]:
+            if variant != "subgradient_at_zero" and is_valid_for_p2:
+                yield SampleInput(make_arg(test_size), keepdim=keepdim)
+
+            if not (is_vector_norm or is_matrix_norm):
+                continue
+
+            ords = vector_ords if is_vector_norm else matrix_ords
+
+            for ord in ords:
+                if is_vector_norm and test_size[-1] == 0:
+                    if ord == np.inf or (ord is not None and ord < 0):
+                        # RuntimeError: linalg.vector_norm cannot compute the
+                        # {ord} norm on an empty tensor because the operation
+                        # does not have an identity
+                        continue
+                elif is_matrix_norm:
+                    dims_to_check = {
+                        None: (0,),
+                        np.inf: (0,),
+                        2: (0, 1),
+                        1: (1,),
+                        -1: (1,),
+                        -2: (0, 1),
+                        -np.inf: (0,),
+                    }.get(ord, ())
+
+                    if any(test_size[d] == 0 for d in dims_to_check):
+                        # IndexError: amax(): Expected reduction dim {dim} to
+                        # have non-zero size.
+                        continue
+
+                if variant == "subgradient_at_zero":
+                    yield SampleInput(
+                        torch.zeros(
+                            test_size,
+                            dtype=dtype,
+                            device=device,
+                            requires_grad=requires_grad,
+                        ),
+                        ord,
+                        keepdim=keepdim,
+                    )
+                else:
+                    yield SampleInput(make_arg(test_size), ord, keepdim=keepdim)
+
+                    if ord in ["nuc", "fro"]:
+                        yield SampleInput(
+                            make_arg(test_size), ord=ord, keepdim=keepdim, dim=(0, 1)
+                        )
+
+
+def sample_inputs_linalg_vecdot(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    batches = ((), (0,), (1,), (5,))
+    ns = (0, 1, 3, 5)
+    for b, n in product(batches, ns):
+        shape = b + (n,)
+        yield SampleInput(make_arg(shape), args=(make_arg(shape),))
+        for i in range(len(shape)):
+            yield SampleInput(
+                make_arg(shape), args=(make_arg(shape),), kwargs=dict(dim=i)
+            )
+
+
+def sample_inputs_linalg_invertible(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    """
+    This function generates invertible inputs for linear algebra ops
+    The input is generated as the itertools.product of 'batches' and 'ns'.
+    In total this function generates 8 SampleInputs
+    'batches' cases include:
+        () - single input,
+        (0,) - zero batched dimension,
+        (2,) - batch of two matrices,
+        (1, 1) - 1x1 batch of matrices
+    'ns' gives 0x0 and 5x5 matrices.
+    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
+    """
+    make_fn = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(make_fn, dtype=dtype, device=device, requires_grad=requires_grad)
+
+    batches = [(), (0,), (2,), (1, 1)]
+    ns = [5, 0]
+
+    for batch, n in product(batches, ns):
+        yield SampleInput(make_arg(*batch, n, n))
+
+
+def sample_inputs_matrix_rank(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function produces inputs for matrix rank that test
+    all possible combinations for atol and rtol
+    """
+
+    def make_tol_arg(kwarg_type, inp):
+        if kwarg_type == "none":
+            return None
+        if kwarg_type == "float":
+            return 1.0
+        assert kwarg_type == "tensor"
+        return torch.ones(inp.shape[:-2], device=device)
+
+    for tol_type in ["float", "tensor"]:
+        for atol_type, rtol_type in product(["none", tol_type], repeat=2):
+            if (
+                not atol_type and not rtol_type
+            ):  # default behavior, so skipped here so it's not tested 2 extra times
+                continue
+            for sample in sample_inputs_linalg_invertible(
+                op_info, device, dtype, requires_grad
+            ):
+                assert sample.kwargs == {}
+                sample.kwargs = {
+                    "atol": make_tol_arg(atol_type, sample.input),
+                    "rtol": make_tol_arg(rtol_type, sample.input),
+                }
+                yield sample
+
+    for sample in sample_inputs_linalg_invertible(
+        op_info, device, dtype, requires_grad
+    ):
+        yield sample  # default kwargs
+
+
+def sample_inputs_linalg_pinv_singular(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    """
+    This function produces factors `a` and `b` to generate inputs of the form `a @ b.t()` to
+    test the backward method of `linalg_pinv`. That way we always preserve the rank of the
+    input no matter the perturbations applied to it by the gradcheck.
+    Note that `pinv` is Frechet-differentiable in a rank-preserving neighborhood.
+    """
+    batches = [(), (0,), (2,), (1, 1)]
+    # the size of at least 30 is required to cause failures for the previous implicit implementation
+    # of the pinv's backward method, albeit it is slow.
+    size = [0, 3, 50]
+
+    for batch, m, n in product(batches, size, size):
+        for k in range(min(3, min(m, n))):
+            # Note that by making the columns of `a` and `b` orthonormal we make sure that
+            # the product matrix `a @ b.t()` has condition number 1 when restricted to its image
+            a = (
+                torch.rand(*batch, m, k, device=device, dtype=dtype)
+                .qr()
+                .Q.requires_grad_(requires_grad)
+            )
+            b = (
+                torch.rand(*batch, n, k, device=device, dtype=dtype)
+                .qr()
+                .Q.requires_grad_(requires_grad)
+            )
+            yield SampleInput(a, args=(b,))
+
+
+def sample_inputs_linalg_cond(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    # autograd is not supported for inputs with zero number of elements
+    shapes = (
+        (S, S),
+        (2, S, S),
+        (2, 1, S, S),
+    )
+
+    for shape in shapes:
+        yield SampleInput(make_arg(shape))
+
+
+def sample_inputs_linalg_vander(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    shapes = (
+        (),
+        (1,),
+        (S,),
+        (2, S),
+    )
+
+    for shape in shapes:
+        if len(shape) > 0 and shape[-1] > 1:
+            yield SampleInput(make_arg(shape))
+        n = shape[-1] if len(shape) > 0 else 1
+        for i in range(3):
+            # n-1, n, n+1
+            N = n + i - 1
+            if N < 2:
+                continue
+            yield SampleInput(make_arg(shape), kwargs=dict(N=N))
+
+
+def np_vander_batched(x, N=None):
+    # Wrapper around np.vander that supports batches of 1 dimension (enough for the tests)
+    if x.ndim == 0:
+        x = x[np.newaxis]
+    if x.ndim == 1:
+        y = np.vander(x, N=N, increasing=True)
+        return y
+    else:
+        if N is None:
+            N = x.shape[-1]
+        y = np.vander(x.ravel(), N=N, increasing=True).reshape((*x.shape, N))
+        return y
+
+
+def sample_inputs_linalg_cholesky_inverse(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    from torch.testing._internal.common_utils import random_well_conditioned_matrix
+
+    # Cholesky factorization is for positive-definite matrices
+    single_well_conditioned_matrix = random_well_conditioned_matrix(
+        S, S, dtype=dtype, device=device
+    )
+    batch_well_conditioned_matrices = random_well_conditioned_matrix(
+        2, S, S, dtype=dtype, device=device
+    )
+    single_pd = single_well_conditioned_matrix @ single_well_conditioned_matrix.mH
+    batch_pd = batch_well_conditioned_matrices @ batch_well_conditioned_matrices.mH
+
+    inputs = (
+        torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
+        torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
+        single_pd,
+        batch_pd,
+    )
+    test_cases = (torch.linalg.cholesky(a, upper=False) for a in inputs)
+    for l in test_cases:
+        # generated lower-triangular samples
+        l.requires_grad = requires_grad
+        yield SampleInput(l)  # upper=False by default
+        yield SampleInput(
+            l.detach().clone().requires_grad_(requires_grad), kwargs=dict(upper=False)
+        )
+
+        # generate upper-triangular inputs
+        u = l.detach().clone().mT.contiguous().requires_grad_(requires_grad)
+        yield SampleInput(u, kwargs=dict(upper=True))
+
+
+def sample_inputs_linalg_ldl_factor(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    from torch.testing._internal.common_utils import (
+        random_hermitian_pd_matrix,
+        random_symmetric_pd_matrix,
+    )
+
+    device = torch.device(device)
+
+    # Symmetric inputs
+    yield SampleInput(
+        random_symmetric_pd_matrix(S, dtype=dtype, device=device),
+        kwargs=dict(hermitian=False),
+    )  # single matrix
+    yield SampleInput(
+        random_symmetric_pd_matrix(S, 2, dtype=dtype, device=device),
+        kwargs=dict(hermitian=False),
+    )  # batch of matrices
+    yield SampleInput(
+        torch.zeros(0, 0, dtype=dtype, device=device), kwargs=dict(hermitian=False)
+    )  # 0x0 matrix
+    yield SampleInput(
+        torch.zeros(0, 2, 2, dtype=dtype, device=device), kwargs=dict(hermitian=False)
+    )  # zero batch of matrices
+
+    # Hermitian inputs
+    # hermitian=True for complex inputs on CUDA is supported only with MAGMA 2.5.4+
+    magma_254_available = device.type == "cuda" and _get_magma_version() >= (2, 5, 4)
+    if dtype.is_complex and (device.type == "cpu" or magma_254_available):
+        yield SampleInput(
+            random_hermitian_pd_matrix(S, dtype=dtype, device=device),
+            kwargs=dict(hermitian=True),
+        )  # single matrix
+        yield SampleInput(
+            random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),
+            kwargs=dict(hermitian=True),
+        )  # batch of matrices
+
+
+def sample_inputs_linalg_ldl_solve(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    # Generate LDL factors of symmetric (and Hermitian on CPU) matrices
+    from torch.testing._internal.common_utils import (
+        random_hermitian_pd_matrix,
+        random_symmetric_pd_matrix,
+    )
+
+    device = torch.device(device)
+    symmetric_inputs = (
+        random_symmetric_pd_matrix(S, dtype=dtype, device=device),  # single matrix
+        random_symmetric_pd_matrix(
+            S, 2, dtype=dtype, device=device
+        ),  # batch of matrices
+        torch.zeros(0, 0, dtype=dtype, device=device),  # 0x0 matrix
+        torch.zeros(0, 2, 2, dtype=dtype, device=device),  # zero batch of matrices
+    )
+    hermitian_inputs = (
+        (
+            random_hermitian_pd_matrix(S, dtype=dtype, device=device),
+            random_hermitian_pd_matrix(S, 2, dtype=dtype, device=device),
+        )
+        if device.type == "cpu" and dtype.is_complex
+        else ()
+    )
+    test_cases1 = (
+        torch.linalg.ldl_factor_ex(a, hermitian=False) for a in symmetric_inputs
+    )
+    test_cases2 = (
+        torch.linalg.ldl_factor_ex(a, hermitian=True) for a in hermitian_inputs
+    )
+
+    # Symmetric case
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    for test_case in test_cases1:
+        factors, pivots, _ = test_case
+        factors.requires_grad = requires_grad
+        for B_batch_shape in ((), factors.shape[:-2]):
+            B = make_arg((*B_batch_shape, factors.shape[-1], S))
+            yield SampleInput(factors, args=(pivots, B), kwargs=dict(hermitian=False))
+            clone_factors = factors.detach().clone().requires_grad_(requires_grad)
+            yield SampleInput(
+                clone_factors, args=(pivots, B), kwargs=dict(hermitian=False)
+            )
+
+    # Hermitian case
+    for test_case in test_cases2:
+        factors, pivots, _ = test_case
+        factors.requires_grad = requires_grad
+        for B_batch_shape in ((), factors.shape[:-2]):
+            B = make_arg((*B_batch_shape, factors.shape[-1], S))
+            yield SampleInput(factors, args=(pivots, B), kwargs=dict(hermitian=True))
+            clone_factors = factors.detach().clone().requires_grad_(requires_grad)
+            yield SampleInput(
+                clone_factors, args=(pivots, B), kwargs=dict(hermitian=True)
+            )
+
+
+def sample_inputs_linalg_lstsq(op_info, device, dtype, requires_grad=False, **kwargs):
+    from torch.testing._internal.common_utils import random_well_conditioned_matrix
+
+    device = torch.device(device)
+
+    drivers: Tuple[str, ...]
+    if device.type == "cuda":
+        drivers = ("gels",)
+    else:
+        drivers = ("gels", "gelsy", "gelss", "gelsd")
+
+    # we generate matrices of shape (..., n + delta, n)
+    deltas: Tuple[int, ...]
+    if device.type == "cpu" or has_cusolver():
+        deltas = (-1, 0, +1)
+    # only square systems if Cusolver is not available
+    # becase we solve a lstsq problem with a transposed matrix in the backward
+    else:
+        deltas = (0,)
+
+    for batch, driver, delta in product(((), (3,), (3, 3)), drivers, deltas):
+        shape = batch + (3 + delta, 3)
+        a = random_well_conditioned_matrix(*shape, dtype=dtype, device=device)
+        a.requires_grad_(requires_grad)
+        b = make_tensor(
+            shape,
+            dtype=dtype,
+            device=device,
+            low=None,
+            high=None,
+            requires_grad=requires_grad,
+        )
+        yield SampleInput(a, b, driver=driver)
+
+
+def error_inputs_lstsq(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(
+        SampleInput(zero_d, args=(zero_d,)),
+        error_type=RuntimeError,
+        error_regex="at least 2 dimensions",
+    )
+
+
+def error_inputs_lstsq_grad_oriented(op_info, device, **kwargs):
+    zero_d = torch.randn((), device=device)
+    yield ErrorInput(
+        SampleInput(zero_d, args=(zero_d, None)),
+        error_type=RuntimeError,
+        error_regex="at least 2 dimensions",
+    )
+
+
+def sample_inputs_linalg_cholesky(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    """
+    This function generates always positive-definite input for torch.linalg.cholesky using
+    random_hermitian_pd_matrix.
+    The input is generated as the itertools.product of 'batches' and 'ns'.
+    In total this function generates 8 SampleInputs
+    'batches' cases include:
+        () - single input,
+        (0,) - zero batched dimension,
+        (2,) - batch of two matrices,
+        (1, 1) - 1x1 batch of matrices
+    'ns' gives 0x0 and 5x5 matrices.
+    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
+    """
+    from torch.testing._internal.common_utils import random_hermitian_pd_matrix
+
+    batches = [(), (0,), (2,), (1, 1)]
+    ns = [5, 0]
+    for batch, n, upper in product(batches, ns, [True, False]):
+        a = random_hermitian_pd_matrix(n, *batch, dtype=dtype, device=device)
+        a.requires_grad = requires_grad
+        yield SampleInput(a, upper=upper)
+
+
+def sample_inputs_linalg_eig(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function generates input for torch.linalg.eig
+    """
+
+    def out_fn(output):
+        return output[0], abs(output[1])
+
+    samples = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
+    for sample in samples:
+        sample.output_process_fn_grad = out_fn
+        yield sample
+
+
+def sample_inputs_linalg_eigh(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function generates input for torch.linalg.eigh/eigvalsh with UPLO="U" or "L" keyword argument.
+    """
+
+    def out_fn(output):
+        if isinstance(output, tuple):
+            # eigh function
+            return output[0], abs(output[1])
+        else:
+            # eigvalsh function
+            return output
+
+    # Samples do not need to be Hermitian, as we're using gradcheck_wrapper_hermitian_input
+    samples = sample_inputs_linalg_invertible(op_info, device, dtype, requires_grad)
+    for sample in samples:
+        sample.kwargs = {"UPLO": np.random.choice(["L", "U"])}
+        sample.output_process_fn_grad = out_fn
+        yield sample
+
+
+def sample_inputs_linalg_pinv(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function generates input for torch.linalg.pinv with hermitian=False keyword argument.
+    """
+    for o in sample_inputs_linalg_invertible(
+        op_info, device, dtype, requires_grad, **kwargs
+    ):
+        real_dtype = o.input.real.dtype if dtype.is_complex else dtype
+        # requires_grad path for rtol tensor is not implemented
+        for rtol in (None, 1.0, torch.tensor(1.0, dtype=real_dtype, device=device)):
+            o = clone_sample(o)
+            o.kwargs = {"rtol": rtol}
+            yield o
+
+
+def sample_inputs_linalg_pinv_hermitian(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    """
+    This function generates input for torch.linalg.pinv with hermitian=True keyword argument.
+    """
+    for o in sample_inputs_linalg_invertible(
+        op_info, device, dtype, requires_grad, **kwargs
+    ):
+        o.kwargs = {"hermitian": True}
+        yield o
+
+
+def sample_inputs_linalg_solve(
+    op_info, device, dtype, requires_grad=False, vector_rhs_allowed=True, **kwargs
+):
+    """
+    This function generates always solvable input for torch.linalg.solve
+    We sample a fullrank square matrix (i.e. invertible) A
+    The first input to torch.linalg.solve is generated as the itertools.product of 'batches' and 'ns'.
+    The second input is generated as the product of 'batches', 'ns' and 'nrhs'.
+    In total this function generates 18 SampleInputs
+    'batches' cases include:
+        () - single input,
+        (0,) - zero batched dimension,
+        (2,) - batch of two matrices.
+    'ns' gives 0x0 and 5x5 matrices.
+    and 'nrhs' controls the number of vectors to solve for:
+        () - using 1 as the number of vectors implicitly
+        (1,) - same as () but explicit
+        (3,) - solve for 3 vectors.
+    Zeros in dimensions are edge cases in the implementation and important to test for in order to avoid unexpected crashes.
+    'vector_rhs_allowed' controls whether to include nrhs = () to the list of SampleInputs.
+    torch.solve / triangular_solve / cholesky_solve (opposed to torch.linalg.solve) do not allow
+    1D tensors (vectors) as the right-hand-side.
+    Once torch.solve / triangular_solve / cholesky_solve and its testing are removed,
+    'vector_rhs_allowed' may be removed here as well.
+    """
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_a = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    make_b = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    batches = [(), (0,), (2,)]
+    ns = [5, 0]
+    if vector_rhs_allowed:
+        nrhs = [(), (1,), (3,)]
+    else:
+        nrhs = [(1,), (3,)]
+
+    for n, batch, rhs in product(ns, batches, nrhs):
+        yield SampleInput(make_a(*batch, n, n), args=(make_b((batch + (n,) + rhs)),))
+
+
+def sample_inputs_linalg_solve_triangular(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    make_arg = partial(make_tensor, dtype=dtype, device=device)
+    bs = (1, 2, 0)
+    ns = (3, 0)
+    ks = (1, 3, 0)
+
+    for b, n, k, (left, upper, uni) in product(
+        bs, ns, ks, product((True, False), repeat=3)
+    ):
+        if b == 1:
+            A = make_arg((n, n)) if left else make_arg((k, k))
+            B = make_arg((n, k))
+        else:
+            A = make_arg((b, n, n)) if left else make_arg((b, k, k))
+            B = make_arg((b, n, k))
+        if uni:
+            # Not really necessary, but writing it for consistency
+            A.diagonal(0, -2, -1).fill_(1.0)
+        else:
+            d = A.diagonal(0, -2, -1)
+            d[d.abs() < 1e-6] = 1.0
+        if upper:
+            A.triu_()
+        else:
+            A.tril_()
+        kwargs = {"upper": upper, "left": left, "unitriangular": uni}
+        if requires_grad:
+            for grad_A, grad_B in product((True, False), repeat=2):
+                # Either A or B needs to have a gradient
+                if not grad_A and not grad_B:
+                    continue
+                yield SampleInput(
+                    A.clone().requires_grad_(grad_A),
+                    args=(B.clone().requires_grad_(grad_B),),
+                    kwargs=kwargs,
+                )
+        else:
+            yield SampleInput(A, args=(B,), kwargs=kwargs)
+
+
+def sample_inputs_legacy_solve(op_info, device, dtype, requires_grad=False, **kwargs):
+    """
+    This function generates always solvable input for legacy solve functions
+    (the ones that are not in torch.linalg module).
+    The difference from sample_inputs_linalg_solve is that here the right-hand-side of A x = b equation
+    should have b.ndim >= 2, vectors are not allowed.
+    Also the arguments order is swapped.
+    """
+    out = sample_inputs_linalg_solve(
+        op_info, device, dtype, requires_grad=requires_grad, vector_rhs_allowed=False
+    )
+
+    def out_fn(output):
+        return output[0]
+
+    # Reverses tensor order
+    for sample in out:
+        sample.input, sample.args = sample.args[0], (sample.input,)
+        if op_info.name == "solve":
+            sample.output_process_fn_grad = out_fn
+        yield sample
+
+
+def sample_inputs_linalg_lu(op_info, device, dtype, requires_grad=False, **kwargs):
+    full_rank = op_info.name == "linalg.lu_factor"
+    make_fn = (
+        make_tensor
+        if not full_rank
+        else make_fullrank_matrices_with_distinct_singular_values
+    )
+    make_arg = partial(make_fn, dtype=dtype, device=device, requires_grad=requires_grad)
+
+    def out_fn(output):
+        if op_info.name == "linalg.lu":
+            return output[1], output[2]
+        else:
+            return output
+
+    batch_shapes = ((), (3,), (3, 3))
+    # pivot=False only supported in CUDA
+    pivots = (True, False) if torch.device(device).type == "cuda" else (True,)
+    deltas = (-2, -1, 0, +1, +2)
+    for batch_shape, pivot, delta in product(batch_shapes, pivots, deltas):
+        shape = batch_shape + (S + delta, S)
+        # Insanely annoying that make_fullrank_blablabla accepts a *shape and not a tuple!
+        A = make_arg(shape) if not full_rank else make_arg(*shape)
+        yield SampleInput(A, kwargs={"pivot": pivot}, output_process_fn_grad=out_fn)
+
+
+def sample_inputs_linalg_svdvals(op_info, device, dtype, requires_grad=False, **kwargs):
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    batches = [(), (0,), (2,), (1, 1)]
+    ns = [5, 2, 0]
+
+    for batch, m, n in product(batches, ns, ns):
+        yield SampleInput(make_arg(batch + (m, n)))
+
+
+def sample_inputs_linalg_qr_geqrf(
+    op_info, device, dtype, requires_grad=False, **kwargs
+):
+    # QR is just well defined when the matrix is full rank
+    make_fullrank = make_fullrank_matrices_with_distinct_singular_values
+    make_arg = partial(
+        make_fullrank, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+
+    batches = [(), (0,), (2,), (1, 1)]
+    ns = [5, 2, 0]
+
+    for batch, (m, n) in product(batches, product(ns, ns)):
+        shape = batch + (m, n)
+        yield SampleInput(make_arg(*shape))
+
+
+def sample_inputs_tensorsolve(op_info, device, dtype, requires_grad, **kwargs):
+    a_shapes = [(2, 3, 6), (3, 4, 4, 3)]
+    # Zero-dim tensors are not supported in NumPy, so we skip them for now.
+    # NumPy is used in reference check tests.
+    # See https://github.com/numpy/numpy/pull/20482 for tracking NumPy bugfix.
+    # a_shapes += [(0, 0, 1, 2, 3, 0)]
+    dimss = [None, (0, 2)]
+
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, requires_grad=requires_grad
+    )
+    for a_shape, dims in itertools.product(a_shapes, dimss):
+        a = make_arg(a_shape)
+        b = make_arg(a_shape[:2])
+        yield SampleInput(a, b, dims=dims)
+
+
+def sample_inputs_tensorinv(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = make_fullrank_matrices_with_distinct_singular_values
+
+    def make_input():
+        return make_arg(12, 12, device=device, dtype=dtype, requires_grad=requires_grad)
+
+    # lhs / rhs shape can have any number of dimensions as long as their product equals 12
+    shapes = [
+        ((2, 2, 3), (12, 1)),
+        ((4, 3), (6, 1, 2)),
+    ]
+
+    for shape_lhs, shape_rhs in shapes:
+        inp = make_input().reshape(*shape_lhs, *shape_rhs).detach()
+        inp.requires_grad_(requires_grad)
+        yield SampleInput(inp, ind=len(shape_lhs))
+
+
+op_db: List[OpInfo] = [
+    OpInfo(
+        "linalg.cross",
+        ref=lambda x, y, dim=-1: np.cross(x, y, axis=dim),
+        op=torch.linalg.cross,
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=all_types_and_complex_and(torch.half),
+        aten_name="linalg_cross",
+        sample_inputs_func=sample_inputs_cross,
+        error_inputs_func=error_inputs_cross,
+        supports_out=True,
+        supports_fwgrad_bwgrad=True,
+        supports_forward_ad=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip("Unsupported on MPS for now"),
+                "TestCommon",
+                "test_numpy_ref_mps",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.det",
+        aten_name="linalg_det",
+        op=torch.linalg.det,
+        aliases=("det",),
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
+        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
+        check_batched_gradgrad=False,
+    ),
+    OpInfo(
+        "linalg.det",
+        aten_name="linalg_det",
+        op=torch.linalg.det,
+        variant_test_name="singular",
+        aliases=("det",),
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_gradgrad=False,
+        sample_inputs_func=sample_inputs_linalg_det_singular,
+        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
+        skips=(
+            DecorateInfo(
+                unittest.skip("The backward may give different results"),
+                "TestCommon",
+                "test_noncontiguous_samples",
+            ),
+            DecorateInfo(
+                unittest.skip("Gradients are incorrect on macos"),
+                "TestBwdGradients",
+                "test_fn_grad",
+                device_type="cpu",
+                dtypes=(torch.float64,),
+                active_if=IS_MACOS,
+            ),
+            DecorateInfo(
+                unittest.skip("Gradients are incorrect on macos"),
+                "TestFwdGradients",
+                "test_forward_mode_AD",
+                device_type="cpu",
+                dtypes=(torch.float64,),
+                active_if=IS_MACOS,
+            ),
+            # Both Hessians are incorrect on complex inputs??
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestBwdGradients",
+                "test_fn_gradgrad",
+                dtypes=(torch.complex128,),
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestFwdGradients",
+                "test_fn_fwgrad_bwgrad",
+                dtypes=(torch.complex128,),
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped, see https://github.com//issues/84192"),
+                "TestBwdGradients",
+                "test_fn_gradgrad",
+                device_type="cuda",
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped, see https://github.com//issues/84192"),
+                "TestFwdGradients",
+                "test_fn_fwgrad_bwgrad",
+                device_type="cuda",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.cholesky",
+        aten_name="linalg_cholesky",
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_linalg_cholesky,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.cholesky_ex",
+        aten_name="linalg_cholesky_ex",
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_linalg_cholesky,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.vecdot",
+        aten_name="linalg_vecdot",
+        ref=lambda x, y, *, dim=-1: (x.conj() * y).sum(dim),
+        dtypes=floating_and_complex_types_and(torch.bfloat16),
+        dtypesIfCUDA=floating_and_complex_types_and(
+            torch.half, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []
+        ),
+        sample_inputs_func=sample_inputs_linalg_vecdot,
+        check_batched_forward_grad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # Issue with conj and torch dispatch, see https://github.com/pytorch/pytorch/issues/82479
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=(torch.complex64, torch.complex128),
+            ),
+            DecorateInfo(
+                unittest.skip("Unsupported on MPS for now"),
+                "TestCommon",
+                "test_numpy_ref_mps",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.cond",
+        aten_name="linalg_cond",
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_cond,
+        check_batched_gradgrad=False,
+        check_batched_forward_grad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+    ),
+    OpInfo(
+        "linalg.eig",
+        aten_name="linalg_eig",
+        op=torch.linalg.eig,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_eig,
+        check_batched_forward_grad=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # AssertionError: Scalars are not equal!
+            DecorateInfo(
+                unittest.expectedFailure, "TestCommon", "test_out", device_type="cpu"
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
+    ),
+    OpInfo(
+        "linalg.eigvals",
+        aten_name="linalg_eigvals",
+        op=torch.linalg.eigvals,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_invertible,
+        check_batched_forward_grad=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            # exits early on eager extremal value test
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCudaFuserOpInfo",
+                "test_nvfuser_extremal_values",
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.eigh",
+        aten_name="linalg_eigh",
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_eigh,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        check_batched_forward_grad=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack, with_tf32_off],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.eigvalsh",
+        aten_name="linalg_eigvalsh",
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_eigh,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        check_batched_forward_grad=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            # Pre-existing condition; Needs to be fixed
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.householder_product",
+        aten_name="linalg_householder_product",
+        op=torch.linalg.householder_product,
+        aliases=("orgqr",),
+        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        # TODO: backward uses in-place operations that vmap doesn't like
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_householder_product,
+        decorators=[
+            skipCUDAIfNoCusolver,
+            skipCPUIfNoLapack,
+            DecorateInfo(
+                toleranceOverride({torch.complex64: tol(atol=1e-3, rtol=1e-3)})
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped! Flaky"),
+                "TestFwdGradients",
+                "test_fn_fwgrad_bwgrad",
+                device_type="cpu",
+                dtypes=(torch.complex128,),
+            ),
+        ],
+    ),
+    OpInfo(
+        "linalg.ldl_factor",
+        aten_name="linalg_ldl_factor",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_ldl_factor,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, skipCUDAIfRocm],
+    ),
+    OpInfo(
+        "linalg.ldl_factor_ex",
+        aten_name="linalg_ldl_factor_ex",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_ldl_factor,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, skipCUDAIfRocm],
+    ),
+    OpInfo(
+        "linalg.ldl_solve",
+        aten_name="linalg_ldl_solve",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_ldl_solve,
+        decorators=[
+            skipCUDAIf(
+                _get_torch_cuda_version() < (11, 4), "not available before CUDA 11.3.1"
+            ),
+            skipCUDAIfNoCusolver,
+            skipCUDAIfRocm,
+            skipCPUIfNoLapack,
+        ],
+    ),
+    OpInfo(
+        "linalg.lstsq",
+        aten_name="linalg_lstsq",
+        dtypes=floating_and_complex_types(),
+        supports_out=True,
+        sample_inputs_func=sample_inputs_linalg_lstsq,
+        error_inputs_func=error_inputs_lstsq,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            # we skip gradient checks for this suite as they are tested in
+            # variant_test_name='grad_oriented'
+            DecorateInfo(unittest.skip("Skipped!"), "TestFwdGradients"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestBwdGradients"),
+            # The values for attribute 'shape' do not match
+            DecorateInfo(unittest.skip("Skipped!"), "TestCommon", "test_out"),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.lstsq",
+        aten_name="linalg_lstsq",
+        variant_test_name="grad_oriented",
+        # gradchecks for forward AD fails with multi-Tensor outputs
+        op=lambda a, b, driver: torch.linalg.lstsq(a, b, driver=driver)[0],
+        supports_out=False,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_lstsq,
+        error_inputs_func=error_inputs_lstsq_grad_oriented,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_autograd=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            # tests do not work with passing lambda for op
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestOperatorSignatures",
+                "test_get_torch_func_signature_exhaustive",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.matrix_power",
+        aliases=("matrix_power",),
+        aten_name="linalg_matrix_power",
+        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_inplace_autograd=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        check_batched_grad=False,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        sample_inputs_func=sample_inputs_linalg_matrix_power,
+    ),
+    OpInfo(
+        "linalg.multi_dot",
+        # Need this lambda because gradcheck does not work with TensorList inputs
+        aten_name="linalg_multi_dot",
+        dtypes=all_types_and_complex_and(torch.bfloat16),
+        dtypesIfCUDA=floating_and_complex_types_and(
+            torch.half, *[torch.bfloat16] if (CUDA11OrLater or TEST_WITH_ROCM) else []
+        ),
+        supports_inplace_autograd=False,
+        # Batched grad checks fail for empty input tensors (see https://github.com/pytorch/pytorch/issues/53407)
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # https://github.com/pytorch/pytorch/issues/66357
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_linalg_multi_dot,
+        gradcheck_nondet_tol=GRADCHECK_NONDET_TOL,
+        skips=(
+            # https://github.com/pytorch/pytorch/issues/67470
+            DecorateInfo(
+                unittest.skip("67470!"), "TestCommon", "test_noncontiguous_samples"
+            ),
+            # Fails on XLA.
+            # AssertionError: False is not true : Tensors failed to compare as equal!
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestOpInfo",
+                device_type="xla",
+                dtypes=(torch.long,),
+            ),
+            # https://github.com/pytorch/pytorch/issues/71774
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                device_type="cpu",
+                dtypes=(torch.long,),
+            ),
+        ),
+    ),
+    # NB: linalg.norm has two variants so that different skips can be used for different sample inputs
+    OpInfo(
+        "linalg.norm",
+        aten_name="linalg_norm",
+        op=torch.linalg.norm,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        sample_inputs_func=sample_inputs_linalg_norm,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure, "TestBwdGradients", "test_fn_gradgrad"
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.norm",
+        op=torch.linalg.norm,
+        variant_test_name="subgradients_at_zero",
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        sample_inputs_func=partial(
+            sample_inputs_linalg_norm, variant="subgradient_at_zero"
+        ),
+        aten_name="linalg_norm",
+        supports_forward_ad=True,
+        # torch.autograd.gradcheck.GradcheckError: While computing batched gradients, got:
+        # Could not allocate memory to change Tensor SizesAndStrides!
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # [NEW] Skips specifically for sample inputs at zero
+            # norm's vjp/jvp are not well-conditioned near zero
+            DecorateInfo(
+                unittest.expectedFailure, "TestBwdGradients", "test_fn_gradgrad"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestFwdGradients", "test_fn_fwgrad_bwgrad"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure, "TestFwdGradients", "test_forward_mode_AD"
+            ),
+            DecorateInfo(unittest.expectedFailure, "TestBwdGradients", "test_fn_grad"),
+        ),
+    ),
+    OpInfo(
+        "linalg.matrix_norm",
+        aten_name="linalg_matrix_norm",
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        check_batched_gradgrad=False,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        sample_inputs_func=sample_inputs_linalg_matrix_norm,
+    ),
+    OpInfo(
+        "linalg.qr",
+        aten_name="linalg_qr",
+        op=torch.linalg.qr,
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # In-place ops
+        check_batched_gradgrad=False,
+        sample_inputs_func=sample_inputs_linalg_qr_geqrf,
+        decorators=[skipCUDAIfNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.slogdet",
+        aten_name="linalg_slogdet",
+        op=torch.linalg.slogdet,
+        dtypes=floating_and_complex_types(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_det_logdet_slogdet,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+    ),
+    OpInfo(
+        "linalg.vander",
+        aten_name="linalg_vander",
+        ref=np_vander_batched,
+        op=torch.linalg.vander,
+        dtypes=all_types_and_complex(),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_out=False,
+        sample_inputs_func=sample_inputs_linalg_vander,
+        skips=(
+            DecorateInfo(
+                unittest.skip("Unsupported on MPS for now"),
+                "TestCommon",
+                "test_numpy_ref_mps",
+            ),
+        ),
+    ),
+    ReductionOpInfo(
+        "linalg.vector_norm",
+        op=torch.linalg.vector_norm,
+        identity=0,
+        nan_policy="propagate",
+        supports_multiple_dims=True,
+        complex_to_real=True,
+        supports_forward_ad=True,
+        # torch.autograd.gradcheck.GradcheckError: While computing batched gradients
+        # got: Could not allocate memory to change Tensor SizesAndStrides!
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        dtypes=floating_and_complex_types_and(torch.float16, torch.bfloat16),
+        generate_args_kwargs=sample_kwargs_vector_norm,
+        aten_name="linalg_vector_norm",
+        skips=(
+            # FIXME: sum reduces all dimensions when dim=[]
+            DecorateInfo(unittest.expectedFailure, "TestReductions", "test_dim_empty"),
+            DecorateInfo(
+                unittest.expectedFailure, "TestReductions", "test_dim_empty_keepdim"
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.lu_factor",
+        aten_name="linalg_lu_factor",
+        op=torch.linalg.lu_factor,
+        dtypes=floating_and_complex_types(),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_lu,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            # linalg.lu_factor: LU without pivoting is not implemented on the CPU
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_compare_cpu"),
+        ),
+    ),
+    OpInfo(
+        "linalg.lu_factor_ex",
+        aten_name="linalg_lu_factor_ex",
+        op=torch.linalg.lu_factor_ex,
+        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/80411
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_lu,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            # linalg.lu_factor: LU without pivoting is not implemented on the CPU
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_compare_cpu"),
+        ),
+    ),
+    OpInfo(
+        "linalg.lu",
+        aten_name="linalg_lu",
+        op=torch.linalg.lu,
+        dtypes=floating_and_complex_types(),
+        # https://github.com/pytorch/pytorch/issues/80411
+        # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_lu,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            # linalg.lu_factor: LU without pivoting is not implemented on the CPU
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_compare_cpu"),
+        ),
+    ),
+    OpInfo(
+        "linalg.lu_solve",
+        op=torch.linalg.lu_solve,
+        aten_name="linalg_lu_solve",
+        dtypes=floating_and_complex_types(),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_lu_solve,
+        skips=(
+            DecorateInfo(
+                unittest.skip("Tests different backward paths"),
+                "TestCommon",
+                "test_floating_inputs_are_differentiable",
+            ),
+        ),
+        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
+    ),
+    OpInfo(
+        "linalg.inv",
+        aten_name="linalg_inv",
+        op=torch.linalg.inv,
+        aliases=("inverse",),
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_invertible,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.inv_ex",
+        aten_name="linalg_inv_ex",
+        op=torch.linalg.inv_ex,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_invertible,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.solve",
+        aten_name="linalg_solve",
+        op=torch.linalg.solve,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_solve,
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.solve_ex",
+        aten_name="linalg_solve_ex",
+        op=torch.linalg.solve_ex,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_solve,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.solve_triangular",
+        aten_name="linalg_solve_triangular",
+        op=torch.linalg.solve_triangular,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_linalg_solve_triangular,
+        supports_fwgrad_bwgrad=True,
+        skips=(skipCPUIfNoLapack,),
+        # linalg.solve_triangular cannot be batched over because of a call to out.copy_(result);
+        supports_forward_ad=True,
+    ),
+    OpInfo(
+        "linalg.matrix_rank",
+        aten_name="linalg_matrix_rank",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_matrix_rank,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            # jit doesn't accept tensor inputs for matrix rank
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                dtypes=[torch.complex64, torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.matrix_rank",
+        aten_name="linalg_matrix_rank",
+        variant_test_name="hermitian",
+        dtypes=floating_and_complex_types(),
+        supports_autograd=False,
+        sample_inputs_func=sample_inputs_linalg_pinv_hermitian,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.pinv",
+        aten_name="linalg_pinv",
+        op=torch.linalg.pinv,
+        dtypes=floating_and_complex_types(),
+        # Runs very slowly on slow gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_pinv,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            # errors with "leaked XXXX bytes CUDA memory on device 0"
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="cuda",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.pinv",
+        aten_name="linalg_pinv",
+        variant_test_name="singular",
+        # pinv is Frechet-differentiable in a rank-preserving neighborhood,
+        # so we feed inputs that are the products of two full-rank factors,
+        # to avoid any rank changes caused by the perturbations in the gradcheck
+        op=lambda a, b: torch.linalg.pinv(a @ b.mT),
+        dtypes=floating_and_complex_types(),
+        supports_out=False,
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_linalg_pinv_singular,
+        # Only large tensors show issues with implicit backward used prior to
+        # explicit backward implementation.
+        decorators=[slowTest, skipCUDAIfNoCusolver, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # CUDA runs out of memory
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestFwdGradients",
+                "test_fn_fwgrad_bwgrad",
+                device_type="cuda",
+                dtypes=[torch.cdouble],
+            ),
+            # This test takes almost 2 hours to run!
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestBwdGradients",
+                "test_fn_gradgrad",
+                device_type="cuda",
+                dtypes=[torch.cdouble],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.pinv",
+        aten_name="linalg_pinv",
+        variant_test_name="hermitian",
+        dtypes=floating_and_complex_types(),
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        sample_inputs_func=sample_inputs_linalg_pinv_hermitian,
+        gradcheck_wrapper=gradcheck_wrapper_hermitian_input,
+        decorators=[skipCUDAIfNoMagma, skipCPUIfNoLapack],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                toleranceOverride({torch.float32: tol(atol=1e-5, rtol=1e-5)}),
+                "TestCommon",
+                "test_noncontiguous_samples",
+                device_type="cuda",
+            ),
+            # This test is flaky under slow gradcheck, likely due to rounding issues
+            DecorateInfo(
+                skipIfSlowGradcheckEnv,
+                "TestFwdGradients",
+                "test_fn_fwgrad_bwgrad",
+                device_type="cuda",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.svd",
+        op=torch.linalg.svd,
+        aten_name="linalg_svd",
+        decomp_aten_name="_linalg_svd",
+        dtypes=floating_and_complex_types(),
+        # Runs very slowly on slow-gradcheck - alternatively reduce input sizes
+        gradcheck_fast_mode=True,
+        supports_fwgrad_bwgrad=True,
+        supports_forward_ad=True,
+        check_batched_forward_grad=False,
+        # We're using at::allclose, which does not have a batching rule
+        check_batched_grad=False,
+        check_batched_gradgrad=False,
+        sample_inputs_func=sample_inputs_svd,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_out",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestJit",
+                "test_variant_consistency_jit",
+                device_type="mps",
+                dtypes=[torch.float32],
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.svdvals",
+        op=torch.linalg.svdvals,
+        aten_name="linalg_svdvals",
+        decomp_aten_name="_linalg_svd",
+        dtypes=floating_and_complex_types(),
+        check_batched_forward_grad=False,
+        supports_fwgrad_bwgrad=True,
+        supports_forward_ad=True,
+        # We're using at::allclose, which does not have a batching rule
+        check_batched_gradgrad=False,
+        sample_inputs_func=sample_inputs_linalg_svdvals,
+        decorators=[skipCUDAIfNoMagmaAndNoCusolver, skipCPUIfNoLapack, with_tf32_off],
+    ),
+    OpInfo(
+        "linalg.tensorinv",
+        ref=np.linalg.tensorinv,
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_tensorinv,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        # See https://github.com/pytorch/pytorch/pull/78358
+        check_batched_forward_grad=False,
+        decorators=[skipCPUIfNoLapack, skipCUDAIfNoMagmaAndNoCusolver],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Unsupported on MPS for now"),
+                "TestCommon",
+                "test_numpy_ref_mps",
+            ),
+        ),
+    ),
+    OpInfo(
+        "linalg.tensorsolve",
+        ref=lambda a, b, dims=None: np.linalg.tensorsolve(a, b, axes=dims),
+        dtypes=floating_and_complex_types(),
+        sample_inputs_func=sample_inputs_tensorsolve,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=[
+            skipCUDAIfNoMagmaAndNoCusolver,
+            skipCPUIfNoLapack,
+            DecorateInfo(
+                toleranceOverride({torch.float32: tol(atol=1e-03, rtol=1e-03)}),
+                "TestCommon",
+                "test_noncontiguous_samples",
+                device_type="cuda",
+            ),
+        ],
+        skips=(
+            DecorateInfo(
+                unittest.skip("Unsupported on MPS for now"),
+                "TestCommon",
+                "test_numpy_ref_mps",
+            ),
+        ),
+    ),
+]
+
+python_ref_db: List[OpInfo] = [
+    #
+    # torch.linalg
+    #
+    ReductionPythonRefInfo(
+        "_refs.linalg.vector_norm",
+        torch_opinfo_name="linalg.vector_norm",
+        supports_out=True,
+        supports_nvfuser=False,  # clone_default
+        op_db=op_db,
+    ),
+    PythonRefInfo(
+        "_refs.linalg.matrix_norm",
+        torch_opinfo_name="linalg.matrix_norm",
+        supports_out=True,
+        # Uses svdvals which does not support nvfuser
+        supports_nvfuser=False,
+        # Uses vector_norm inside and vector_norm is affected by
+        # https://github.com/pytorch/pytorch/issues/77216
+        validate_view_consistency=False,
+        op_db=op_db,
+    ),
+    PythonRefInfo(
+        "_refs.linalg.norm",
+        torch_opinfo_name="linalg.norm",
+        supports_out=True,
+        # Uses svdvals which does not support nvfuser
+        supports_nvfuser=False,
+        # Uses vector_norm inside and vector_norm is affected by
+        # https://github.com/pytorch/pytorch/issues/77216
+        validate_view_consistency=False,
+        op_db=op_db,
+    ),
+    PythonRefInfo(
+        "_refs.linalg.svd",
+        torch_opinfo_name="linalg.svd",
+        supports_out=True,
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    PythonRefInfo(
+        "_refs.linalg.svdvals",
+        torch_opinfo_name="linalg.svdvals",
+        supports_out=True,
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/definitions/signal.py b/torch/testing/_internal/opinfo/definitions/signal.py
new file mode 100644
index 000000000000..19559a8c59af
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/signal.py
@@ -0,0 +1,827 @@
+import unittest
+from functools import partial
+
+from itertools import product
+from typing import Callable, List, Tuple
+
+import numpy
+
+import torch
+from torch.testing._internal.common_dtype import floating_types_and
+from torch.testing._internal.common_utils import TEST_SCIPY
+from torch.testing._internal.opinfo.core import (
+    DecorateInfo,
+    ErrorInput,
+    OpInfo,
+    SampleInput,
+)
+
+if TEST_SCIPY:
+    import scipy.signal
+
+
+def sample_inputs_window(op_info, device, dtype, requires_grad, *args, **kwargs):
+    r"""Base function used to create sample inputs for windows.
+
+    For additional required args you should use *args, as well as **kwargs for
+    additional keyword arguments.
+    """
+
+    # Tests window sizes up to 5 samples.
+    for size, sym in product(range(6), (True, False)):
+        yield SampleInput(
+            size,
+            *args,
+            sym=sym,
+            device=device,
+            dtype=dtype,
+            requires_grad=requires_grad,
+            **kwargs,
+        )
+
+
+def reference_inputs_window(op_info, device, dtype, requires_grad, *args, **kwargs):
+    r"""Reference inputs function to use for windows which have a common signature, i.e.,
+    window size and sym only.
+
+    Implement other special functions for windows that have a specific signature.
+    See exponential and gaussian windows for instance.
+    """
+    yield from sample_inputs_window(
+        op_info, device, dtype, requires_grad, *args, **kwargs
+    )
+
+    cases = (8, 16, 32, 64, 128, 256)
+
+    for size in cases:
+        yield SampleInput(size, sym=False)
+        yield SampleInput(size, sym=True)
+
+
+def reference_inputs_exponential_window(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    yield from sample_inputs_window(op_info, device, dtype, requires_grad, **kwargs)
+
+    cases = (
+        (8, {"center": 4, "tau": 0.5}),
+        (16, {"center": 8, "tau": 2.5}),
+        (32, {"center": 16, "tau": 43.5}),
+        (64, {"center": 20, "tau": 3.7}),
+        (128, {"center": 62, "tau": 99}),
+        (256, {"tau": 10}),
+    )
+
+    for size, kw in cases:
+        yield SampleInput(size, sym=False, **kw)
+        kw["center"] = None
+        yield SampleInput(size, sym=True, **kw)
+
+
+def reference_inputs_gaussian_window(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_window(op_info, device, dtype, requires_grad, **kwargs)
+
+    cases = (
+        (8, {"std": 0.1}),
+        (16, {"std": 1.2}),
+        (32, {"std": 2.1}),
+        (64, {"std": 3.9}),
+        (128, {"std": 4.5}),
+        (256, {"std": 10}),
+    )
+
+    for size, kw in cases:
+        yield SampleInput(size, sym=False, **kw)
+        yield SampleInput(size, sym=True, **kw)
+
+
+def reference_inputs_kaiser_window(op_info, device, dtype, requires_grad, **kwargs):
+    yield from sample_inputs_window(op_info, device, dtype, requires_grad, **kwargs)
+
+    cases = (
+        (8, {"beta": 2}),
+        (16, {"beta": 12}),
+        (32, {"beta": 30}),
+        (64, {"beta": 35}),
+        (128, {"beta": 41.2}),
+        (256, {"beta": 100}),
+    )
+
+    for size, kw in cases:
+        yield SampleInput(size, sym=False, **kw)
+        yield SampleInput(size, sym=True, **kw)
+
+
+def reference_inputs_general_cosine_window(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    yield from sample_inputs_window(op_info, device, dtype, requires_grad, **kwargs)
+
+    cases = (
+        (8, {"a": [0.5, 0.5]}),
+        (16, {"a": [0.46, 0.54]}),
+        (32, {"a": [0.46, 0.23, 0.31]}),
+        (64, {"a": [0.5]}),
+        (128, {"a": [0.1, 0.8, 0.05, 0.05]}),
+        (256, {"a": [0.2, 0.2, 0.2, 0.2, 0.2]}),
+    )
+
+    for size, kw in cases:
+        yield SampleInput(size, sym=False, **kw)
+        yield SampleInput(size, sym=True, **kw)
+
+
+def reference_inputs_general_hamming_window(
+    op_info, device, dtype, requires_grad, **kwargs
+):
+    yield from sample_inputs_window(op_info, device, dtype, requires_grad, **kwargs)
+
+    cases = (
+        (8, {"alpha": 0.54}),
+        (16, {"alpha": 0.5}),
+        (32, {"alpha": 0.23}),
+        (64, {"alpha": 0.8}),
+        (128, {"alpha": 0.9}),
+        (256, {"alpha": 0.05}),
+    )
+
+    for size, kw in cases:
+        yield SampleInput(size, sym=False, **kw)
+        yield SampleInput(size, sym=True, **kw)
+
+
+def error_inputs_window(op_info, device, *args, **kwargs):
+    # Tests for windows that have a negative size
+    yield ErrorInput(
+        SampleInput(-1, *args, dtype=torch.float32, device=device, **kwargs),
+        error_type=ValueError,
+        error_regex="requires non-negative window length, got M=-1",
+    )
+
+    # Tests for window tensors that are not torch.strided, for instance, torch.sparse_coo.
+    yield ErrorInput(
+        SampleInput(
+            3,
+            *args,
+            layout=torch.sparse_coo,
+            device=device,
+            dtype=torch.float32,
+            **kwargs,
+        ),
+        error_type=ValueError,
+        error_regex="is implemented for strided tensors only, got: torch.sparse_coo",
+    )
+
+    # Tests for window tensors that are not floating point dtypes, for instance, torch.long.
+    yield ErrorInput(
+        SampleInput(3, *args, dtype=torch.long, device=device, **kwargs),
+        error_type=ValueError,
+        error_regex="expects floating point dtypes, got: torch.int64",
+    )
+
+
+def error_inputs_exponential_window(op_info, device, **kwargs):
+    # Yield common error inputs
+    yield from error_inputs_window(op_info, device, **kwargs)
+
+    # Tests for negative decay values.
+    yield ErrorInput(
+        SampleInput(3, tau=-1, dtype=torch.float32, device=device, **kwargs),
+        error_type=ValueError,
+        error_regex="Tau must be positive, got: -1 instead.",
+    )
+
+    # Tests for symmetric windows and a given center value.
+    yield ErrorInput(
+        SampleInput(3, center=1, sym=True, dtype=torch.float32, device=device),
+        error_type=ValueError,
+        error_regex="Center must be None for symmetric windows",
+    )
+
+
+def error_inputs_gaussian_window(op_info, device, **kwargs):
+    # Yield common error inputs
+    yield from error_inputs_window(op_info, device, std=0.5, **kwargs)
+
+    # Tests for negative standard deviations
+    yield ErrorInput(
+        SampleInput(3, std=-1, dtype=torch.float32, device=device, **kwargs),
+        error_type=ValueError,
+        error_regex="Standard deviation must be positive, got: -1 instead.",
+    )
+
+
+def error_inputs_kaiser_window(op_info, device, **kwargs):
+    # Yield common error inputs
+    yield from error_inputs_window(op_info, device, beta=12, **kwargs)
+
+    # Tests for negative beta
+    yield ErrorInput(
+        SampleInput(3, beta=-1, dtype=torch.float32, device=device, **kwargs),
+        error_type=ValueError,
+        error_regex="beta must be non-negative, got: -1 instead.",
+    )
+
+
+def error_inputs_general_cosine_window(op_info, device, **kwargs):
+    # Yield common error inputs
+    yield from error_inputs_window(op_info, device, a=[0.54, 0.46], **kwargs)
+
+    # Tests for negative beta
+    yield ErrorInput(
+        SampleInput(3, a=None, dtype=torch.float32, device=device, **kwargs),
+        error_type=TypeError,
+        error_regex="Coefficients must be a list/tuple",
+    )
+
+    yield ErrorInput(
+        SampleInput(3, a=[], dtype=torch.float32, device=device, **kwargs),
+        error_type=ValueError,
+        error_regex="Coefficients cannot be empty",
+    )
+
+
+def reference_signal_window(fn: Callable):
+    r"""Wrapper for scipy signal window references.
+
+    Discards keyword arguments for window reference functions that don't have a matching signature with
+    torch, e.g., gaussian window.
+    """
+
+    def _fn(
+        *args,
+        dtype=numpy.float64,
+        device=None,
+        layout=torch.strided,
+        requires_grad=False,
+        **kwargs,
+    ):
+        r"""The unused arguments are defined to disregard those values"""
+        return fn(*args, **kwargs).astype(dtype)
+
+    return _fn
+
+
+def make_signal_windows_opinfo(
+    name: str,
+    ref: Callable,
+    sample_inputs_func: Callable,
+    reference_inputs_func: Callable,
+    error_inputs_func: Callable,
+    *,
+    skips: Tuple[DecorateInfo, ...] = (),
+):
+    r"""Helper function to create OpInfo objects related to different windows."""
+    return OpInfo(
+        name=name,
+        ref=ref if TEST_SCIPY else None,
+        dtypes=floating_types_and(torch.bfloat16, torch.float16),
+        dtypesIfCUDA=floating_types_and(torch.bfloat16, torch.float16),
+        sample_inputs_func=sample_inputs_func,
+        reference_inputs_func=reference_inputs_func,
+        error_inputs_func=error_inputs_func,
+        supports_out=False,
+        supports_autograd=False,
+        skips=(
+            # TODO: same as this?
+            # https://github.com/pytorch/pytorch/issues/81774
+            # also see: arange, new_full
+            # fails to match any schemas despite working in the interpreter
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestOperatorSignatures",
+                "test_get_torch_func_signature_exhaustive",
+            ),
+            # fails to match any schemas despite working in the interpreter
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            # skip these tests since we have non tensor input
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestCommon", "test_noncontiguous_samples"
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestCommon",
+                "test_variant_consistency_eager",
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestMathBits", "test_conj_view"),
+            DecorateInfo(
+                unittest.skip("Skipped!"), "TestMathBits", "test_neg_conj_view"
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestMathBits", "test_neg_view"),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestVmapOperatorsOpInfo",
+                "test_vmap_exhaustive",
+            ),
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestVmapOperatorsOpInfo",
+                "test_op_has_batch_rule",
+            ),
+            DecorateInfo(
+                unittest.skip("Buggy on MPS for now (mistakenly promotes to float64)"),
+                "TestCommon",
+                "test_numpy_ref_mps",
+            ),
+            *skips,
+        ),
+    )
+
+
+op_db: List[OpInfo] = [
+    make_signal_windows_opinfo(
+        name="signal.windows.hamming",
+        ref=reference_signal_window(scipy.signal.windows.hamming)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=sample_inputs_window,
+        reference_inputs_func=reference_inputs_window,
+        error_inputs_func=error_inputs_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.hann",
+        ref=reference_signal_window(scipy.signal.windows.hann) if TEST_SCIPY else None,
+        sample_inputs_func=sample_inputs_window,
+        reference_inputs_func=reference_inputs_window,
+        error_inputs_func=error_inputs_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.bartlett",
+        ref=reference_signal_window(scipy.signal.windows.bartlett)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=sample_inputs_window,
+        reference_inputs_func=reference_inputs_window,
+        error_inputs_func=error_inputs_window,
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.blackman",
+        ref=reference_signal_window(scipy.signal.windows.blackman)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=sample_inputs_window,
+        reference_inputs_func=reference_inputs_window,
+        error_inputs_func=error_inputs_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.cosine",
+        ref=reference_signal_window(scipy.signal.windows.cosine)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=sample_inputs_window,
+        reference_inputs_func=reference_inputs_window,
+        error_inputs_func=error_inputs_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.exponential",
+        ref=reference_signal_window(scipy.signal.windows.exponential)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=partial(sample_inputs_window, tau=2.78),
+        reference_inputs_func=partial(reference_inputs_exponential_window, tau=2.78),
+        error_inputs_func=error_inputs_exponential_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.gaussian",
+        ref=reference_signal_window(scipy.signal.windows.gaussian)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=partial(sample_inputs_window, std=1.92),
+        reference_inputs_func=partial(reference_inputs_gaussian_window, std=1.92),
+        error_inputs_func=error_inputs_gaussian_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.skip("Buggy on MPS for now (mistakenly promotes to float64)"),
+                "TestCommon",
+                "test_numpy_ref_mps",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.kaiser",
+        ref=reference_signal_window(scipy.signal.windows.kaiser)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=partial(sample_inputs_window, beta=12.0),
+        reference_inputs_func=partial(reference_inputs_kaiser_window, beta=12.0),
+        error_inputs_func=error_inputs_kaiser_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.general_cosine",
+        ref=reference_signal_window(scipy.signal.windows.general_cosine)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=partial(sample_inputs_window, a=[0.54, 0.46]),
+        reference_inputs_func=partial(
+            reference_inputs_general_cosine_window, a=[0.54, 0.46]
+        ),
+        error_inputs_func=error_inputs_general_cosine_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+    make_signal_windows_opinfo(
+        name="signal.windows.general_hamming",
+        ref=reference_signal_window(scipy.signal.windows.general_hamming)
+        if TEST_SCIPY
+        else None,
+        sample_inputs_func=partial(sample_inputs_window, alpha=0.54),
+        reference_inputs_func=partial(
+            reference_inputs_general_hamming_window, alpha=0.54
+        ),
+        error_inputs_func=error_inputs_window,
+        skips=(
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestDecomp",
+                "test_comprehensive",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestMeta",
+                "test_dispatch_symbolic_meta",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestSchemaCheckModeOpInfo",
+                "test_schema_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNNCOpInfo",
+                "test_nnc_correctness",
+                dtypes=[torch.float16],
+                device_type="cpu",
+            ),
+        ),
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/definitions/special.py b/torch/testing/_internal/opinfo/definitions/special.py
new file mode 100644
index 000000000000..f05b996f82d9
--- /dev/null
+++ b/torch/testing/_internal/opinfo/definitions/special.py
@@ -0,0 +1,772 @@
+import unittest
+from functools import partial
+from itertools import product
+from typing import List
+
+import numpy as np
+
+import torch
+from torch.testing import make_tensor
+from torch.testing._internal.common_device_type import (
+    precisionOverride,
+    tol,
+    toleranceOverride,
+)
+from torch.testing._internal.common_dtype import all_types_and, floating_types
+from torch.testing._internal.common_utils import TEST_SCIPY, torch_to_numpy_dtype_dict
+from torch.testing._internal.opinfo.core import (
+    BinaryUfuncInfo,
+    DecorateInfo,
+    L,
+    NumericsFilter,
+    OpInfo,
+    S,
+    SampleInput,
+    UnaryUfuncInfo,
+)
+from torch.testing._internal.opinfo.refs import (
+    ElementwiseBinaryPythonRefInfo,
+    ElementwiseUnaryPythonRefInfo,
+)
+from torch.testing._internal.opinfo.utils import (
+    np_unary_ufunc_integer_promotion_wrapper,
+)
+
+
+if TEST_SCIPY:
+    import scipy.special
+
+# TODO: Consolidate `i0e` with sample_inputs_unary when `make_tensor`,
+#       supports `exclude` argument.
+#       For more context: https://github.com/pytorch/pytorch/pull/56352#discussion_r633277617
+def sample_inputs_i0_i1(op_info, device, dtype, requires_grad, **kwargs):
+    exclude_zero = requires_grad and op_info.op == torch.special.i0e
+    make_arg = partial(
+        make_tensor,
+        dtype=dtype,
+        device=device,
+        requires_grad=requires_grad,
+        exclude_zero=exclude_zero,
+    )
+    yield SampleInput(make_arg((S,)))
+    yield SampleInput(make_arg(()))
+
+    if requires_grad and not exclude_zero:
+        # Special Case for gradient
+        # Sample with `0` in the input
+        t = make_arg((S,))
+        t[0] = 0
+
+        yield SampleInput(t)
+
+
+def sample_inputs_polygamma(op_info, device, dtype, requires_grad, **kwargs):
+    make_arg = partial(
+        make_tensor, device=device, dtype=dtype, requires_grad=requires_grad
+    )
+    tensor_shapes = ((S, S), ())
+    ns = (1, 2, 3, 4, 5)
+
+    for shape, n in product(tensor_shapes, ns):
+        yield SampleInput(make_arg(shape), args=(n,))
+
+
+def reference_polygamma(x, n):
+    # WEIRD `scipy.special.polygamma` behavior
+    # >>> scipy.special.polygamma(0, np.array(501, dtype=np.float32)).dtype
+    # dtype('float64')
+    # >>> scipy.special.polygamma(0, np.array([501], dtype=np.float32)).dtype
+    # dtype('float32')
+    #
+    # Thus we cast output to the default torch dtype or preserve double
+    result_dtype = torch_to_numpy_dtype_dict[torch.get_default_dtype()]
+    if x.dtype == np.double:
+        result_dtype = np.double
+    return scipy.special.polygamma(n, x).astype(result_dtype)
+
+
+def sample_inputs_entr(op_info, device, dtype, requires_grad, **kwargs):
+    low, _ = op_info.domain
+
+    if requires_grad:
+        low = 0 + op_info._domain_eps
+
+    make_arg = partial(
+        make_tensor, dtype=dtype, device=device, low=low, requires_grad=requires_grad
+    )
+    yield SampleInput(make_arg((L,)))
+    yield SampleInput(make_arg(()))
+
+
+op_db: List[OpInfo] = [
+    UnaryUfuncInfo(
+        "special.i0e",
+        aten_name="special_i0e",
+        ref=scipy.special.i0e if TEST_SCIPY else None,
+        decorators=(precisionOverride({torch.bfloat16: 3e-1, torch.float16: 3e-1}),),
+        dtypes=all_types_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+        backward_dtypes=floating_types(),
+        sample_inputs_func=sample_inputs_i0_i1,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.i1",
+        aten_name="special_i1",
+        ref=np_unary_ufunc_integer_promotion_wrapper(scipy.special.i1)
+        if TEST_SCIPY
+        else None,
+        dtypes=all_types_and(torch.bool),
+        dtypesIfCUDA=all_types_and(torch.bool),
+        sample_inputs_func=sample_inputs_i0_i1,
+        decorators=(
+            DecorateInfo(
+                toleranceOverride(
+                    {
+                        torch.float32: tol(atol=1e-4, rtol=0),
+                        torch.bool: tol(atol=1e-4, rtol=0),
+                    }
+                )
+            ),
+        ),
+        skips=(
+            DecorateInfo(
+                unittest.skip("Incorrect result!"),
+                "TestUnaryUfuncs",
+                "test_reference_numerics_large",
+                dtypes=(torch.int8,),
+            ),
+        ),
+        supports_fwgrad_bwgrad=True,
+        supports_forward_ad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.i1e",
+        aten_name="special_i1e",
+        ref=scipy.special.i1e if TEST_SCIPY else None,
+        dtypes=all_types_and(torch.bool),
+        dtypesIfCUDA=all_types_and(torch.bool),
+        sample_inputs_func=sample_inputs_i0_i1,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.ndtr",
+        aten_name="special_ndtr",
+        decorators=(precisionOverride({torch.bfloat16: 5e-3, torch.float16: 5e-4}),),
+        ref=scipy.special.ndtr if TEST_SCIPY else None,
+        dtypes=all_types_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.bool, torch.bfloat16, torch.float16),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        skips=(
+            # Dispatch stub: unsupported device typemeta
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestFwdGradients",
+                "test_fn_fwgrad_bwgrad",
+                device_type="meta",
+            ),
+        ),
+    ),
+    # A separate OpInfo entry for special.polygamma is needed to reorder the arguments
+    # for the alias. See the discussion here: https://github.com/pytorch/pytorch/pull/59691#discussion_r650261939
+    UnaryUfuncInfo(
+        "special.polygamma",
+        op=lambda x, n, **kwargs: torch.special.polygamma(n, x, **kwargs),
+        variant_test_name="special_polygamma_n_0",
+        ref=reference_polygamma if TEST_SCIPY else None,
+        dtypes=all_types_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.bool, torch.half),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        sample_inputs_func=sample_inputs_polygamma,
+        skips=(
+            # lambda impl
+            DecorateInfo(
+                unittest.expectedFailure, "TestJit", "test_variant_consistency_jit"
+            ),
+            DecorateInfo(
+                unittest.expectedFailure,
+                "TestNormalizeOperators",
+                "test_normalize_operator_exhaustive",
+            ),
+        ),
+        sample_kwargs=lambda device, dtype, input: ({"n": 0}, {"n": 0}),
+        # polygamma functions have multiple singularities at x <= 0
+        reference_numerics_filter=NumericsFilter(
+            condition=lambda x: x < 0.1, safe_val=1
+        ),
+    ),
+    BinaryUfuncInfo(
+        "special.xlog1py",
+        aten_name="special_xlog1py",
+        dtypes=all_types_and(torch.bool, torch.half, torch.bfloat16),
+        promotes_int_to_float=True,
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        supports_one_python_scalar=True,
+        # We don't test -1 as the gradient will be NaN and it'll break
+        rhs_make_tensor_kwargs=dict(low=-0.99),
+    ),
+    BinaryUfuncInfo(
+        "special.zeta",
+        aten_name="special_zeta",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        supports_autograd=False,
+        supports_one_python_scalar=True,
+        skips=(
+            # Reference reference_inputs nans and infs on cuda and nan, inf, 0., -inf for cpu
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_compare_cpu"),
+        ),
+    ),
+    # TODO: FIXME
+    # OpInfo entry to verify the gradient formula of `other`/`q`
+    # BinaryUfuncInfo('special.zeta',
+    #                 op=lambda q, x, **kwargs: torch.special.zeta(x, q, **kwargs),
+    #                 aten_name='special_zeta',
+    #                 variant_test_name='grad',
+    #                 dtypes=all_types_and(torch.bool),
+    #                 promotes_int_to_float=True,
+    #                 supports_autograd=True,
+    #                 supports_rhs_python_scalar=False,
+    #                 decorators=[
+    #                     # Derivative wrt first tensor not implemented
+    #                     DecorateInfo(unittest.expectedFailure, "TestCommon",
+    #                                  "test_floating_inputs_are_differentiable")
+    #                 ],
+    #                 skips=(
+    #                     # Lambda doesn't work in JIT test
+    #                     # AssertionError: JIT Test does not execute any logic
+    #                     DecorateInfo(unittest.skip("Skipped!"), "TestJit", "test_variant_consistency_jit"),
+    #                 )),
+    UnaryUfuncInfo(
+        "special.entr",
+        ref=scipy.special.entr if TEST_SCIPY else None,
+        aten_name="special_entr",
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+        decorators=(precisionOverride({torch.float16: 1e-1, torch.bfloat16: 1e-1}),),
+        dtypes=all_types_and(torch.bool, torch.bfloat16),
+        dtypesIfCUDA=all_types_and(torch.bool, torch.half, torch.bfloat16),
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestUnaryUfuncs",
+                "test_reference_numerics_large",
+                dtypes=[torch.bfloat16, torch.float16],
+            ),
+        ),
+        supports_inplace_autograd=False,
+        sample_inputs_func=sample_inputs_entr,
+    ),
+    UnaryUfuncInfo(
+        "special.ndtri",
+        ref=scipy.special.ndtri if TEST_SCIPY else None,
+        domain=(0, 1),
+        aten_name="special_ndtri",
+        dtypes=all_types_and(torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.log_ndtr",
+        aten_name="special_log_ndtr",
+        ref=scipy.special.log_ndtr if TEST_SCIPY else None,
+        dtypes=all_types_and(torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.erfcx",
+        ref=scipy.special.erfcx if TEST_SCIPY else None,
+        aten_name="special_erfcx",
+        decorators=(
+            toleranceOverride(
+                {
+                    torch.float32: tol(atol=0, rtol=4e-6),
+                }
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        supports_forward_ad=True,
+        supports_fwgrad_bwgrad=True,
+    ),
+    UnaryUfuncInfo(
+        "special.airy_ai",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=lambda x: scipy.special.airy(x)[0] if TEST_SCIPY else None,
+        skips=(
+            DecorateInfo(
+                unittest.skip("Skipped!"),
+                "TestUnaryUfuncs",
+                "test_reference_numerics_large",
+            ),
+        ),
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.bessel_j0",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-04,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.j0 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.bessel_j1",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-04,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.j1 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.bessel_y0",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-04,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.y0 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.bessel_y1",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-04,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.y1 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.chebyshev_polynomial_t",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.chebyshev_polynomial_u",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.chebyshev_polynomial_v",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.chebyshev_polynomial_w",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.hermite_polynomial_h",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            # Greatest absolute difference: inf
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_compare_cpu"),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.hermite_polynomial_he",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.laguerre_polynomial_l",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.legendre_polynomial_p",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.modified_bessel_i0",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.i0 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.modified_bessel_i1",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.i1 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.modified_bessel_k0",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.k0 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.modified_bessel_k1",
+        decorators=(
+            precisionOverride(
+                {
+                    torch.float32: 1e-03,
+                    torch.float64: 1e-05,
+                },
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.k1 if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.scaled_modified_bessel_k0",
+        decorators=(
+            toleranceOverride(
+                {
+                    torch.float32: tol(atol=1e-03, rtol=1e-03),
+                    torch.float64: tol(atol=1e-05, rtol=1e-03),
+                }
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.k0e if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.scaled_modified_bessel_k1",
+        decorators=(
+            toleranceOverride(
+                {
+                    torch.float32: tol(atol=1e-03, rtol=1e-03),
+                    torch.float64: tol(atol=1e-05, rtol=1e-03),
+                }
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=scipy.special.k1e if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.shifted_chebyshev_polynomial_t",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.shifted_chebyshev_polynomial_u",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.shifted_chebyshev_polynomial_v",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    BinaryUfuncInfo(
+        "special.shifted_chebyshev_polynomial_w",
+        dtypes=all_types_and(torch.bool),
+        promotes_int_to_float=True,
+        skips=(
+            DecorateInfo(
+                unittest.skip(
+                    "Skipping - testing takes an unreasonably long time, #79528"
+                )
+            ),
+            DecorateInfo(unittest.skip("Skipped!"), "TestCudaFuserOpInfo"),
+            DecorateInfo(unittest.skip("Skipped!"), "TestNNCOpInfo"),
+            DecorateInfo(
+                unittest.skip("testing takes an unreasonably long time, #79528"),
+                "TestCommon",
+                "test_compare_cpu",
+            ),
+        ),
+        supports_one_python_scalar=True,
+        supports_autograd=False,
+    ),
+    UnaryUfuncInfo(
+        "special.spherical_bessel_j0",
+        decorators=(
+            toleranceOverride(
+                {
+                    torch.float32: tol(atol=1e-03, rtol=1e-03),
+                    torch.float64: tol(atol=1e-05, rtol=1e-03),
+                }
+            ),
+        ),
+        dtypes=all_types_and(torch.bool),
+        ref=lambda x: scipy.special.spherical_jn(0, x) if TEST_SCIPY else None,
+        supports_autograd=False,
+    ),
+]
+
+python_ref_db: List[OpInfo] = [
+    #
+    # Elementwise Unary Special OpInfos
+    #
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.bessel_j0",
+        torch_opinfo_name="special.bessel_j0",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.bessel_j1",
+        torch_opinfo_name="special.bessel_j1",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.entr",
+        torch_opinfo_name="special.entr",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.erfcx",
+        torch_opinfo_name="special.erfcx",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.i0e",
+        torch_opinfo_name="special.i0e",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.i1",
+        torch_opinfo_name="special.i1",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.i1e",
+        torch_opinfo_name="special.i1e",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.log_ndtr",
+        torch_opinfo_name="special.log_ndtr",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.ndtr",
+        torch_opinfo_name="special.ndtr",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.ndtri",
+        torch_opinfo_name="special.ndtri",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    ElementwiseUnaryPythonRefInfo(
+        "_refs.special.spherical_bessel_j0",
+        torch_opinfo_name="special.spherical_bessel_j0",
+        supports_nvfuser=False,
+        op_db=op_db,
+    ),
+    #
+    # Elementwise Binary Special OpInfos
+    #
+    ElementwiseBinaryPythonRefInfo(
+        "_refs.special.zeta",
+        torch_opinfo_name="special.zeta",
+        supports_one_python_scalar=True,
+        supports_nvfuser=False,
+        op_db=op_db,
+        skips=(
+            # Reference reference_inputs nans and infs on cuda and nan, inf, 0., -inf for cpu
+            DecorateInfo(unittest.expectedFailure, "TestCommon", "test_compare_cpu"),
+        ),
+    ),
+]
diff --git a/torch/testing/_internal/opinfo/refs.py b/torch/testing/_internal/opinfo/refs.py
new file mode 100644
index 000000000000..500c93998e21
--- /dev/null
+++ b/torch/testing/_internal/opinfo/refs.py
@@ -0,0 +1,216 @@
+from torch.testing._internal.opinfo.core import (
+    BinaryUfuncInfo,
+    OpInfo,
+    ReductionOpInfo,
+    UnaryUfuncInfo,
+)
+
+# NOTE [Python References]
+# Python References emulate existing PyTorch operations, but can ultimately
+#   be expressed in terms of "primitive" operations from torch._prims.
+#
+# These references are experimental.
+# See https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-0/577
+#   for additional context.
+#
+# Python Reference OpInfos should be added to the python_ref_db list below.
+#   Tests can opt-into running on these references by including
+#   that list in the Sequence they pass to the @ops decorator.
+#
+# When a Python Reference OpInfo is constructed a pointer to an
+#   existing OpInfo must be provided using the torch_opinfo_name kwarg.
+#   The existing OpInfo with that name and no variant will be found
+#   to inherit from.
+#
+# Instead of just inheriting the existing OpInfo's metadata, the
+#   Python Reference OpInfos inherit the existing OpInfo's
+#   construction arguments. These arguments can be overridden
+#   by adding kwargs to the constructor.
+
+
+def _find_referenced_opinfo(referenced_name, variant_name, *, op_db=None):
+    """
+    Finds the OpInfo with the given name that has no variant name.
+    """
+    # NOTE: searching the global op_db doesn't work when OpInfos are split into
+    # different modules, as otherwise the op_db will not be fully constructed
+    # yet. So, instead the local op_db must be passed in explicitly.
+    if op_db is None:
+        from torch.testing._internal.common_methods_invocations import op_db
+
+    for opinfo in op_db:
+        if opinfo.name == referenced_name and opinfo.variant_test_name == variant_name:
+            return opinfo
+
+
+def _inherit_constructor_args(name, op, inherited, overrides):
+    # inherits metadata
+    common_kwargs = {
+        "name": name,
+        "op": op,
+        "aliases": None,  # TODO add a check for alias coverage
+        "method_variant": None,
+        "inplace_variant": None,  # TODO: add a check for inplace coverage
+        "supports_scripting": False,
+    }
+
+    # Acquires inherited kwargs
+    kwargs = inherited.copy()
+
+    # Fixes metadata
+    if "kwargs" in kwargs:
+        kwargs.update(kwargs["kwargs"])
+        del kwargs["kwargs"]
+    if "self" in kwargs:
+        del kwargs["self"]
+    if "__class__" in kwargs:
+        del kwargs["__class__"]
+    if "skips" in kwargs:
+        del kwargs["skips"]
+    if "decorators" in kwargs:
+        del kwargs["decorators"]
+
+    # Overrides metadata
+    kwargs.update(common_kwargs)
+    kwargs.update(overrides)
+
+    # At the moment no prims support autograd, so we must not run autograd
+    # tests e.g. when testing dtype support.  Once we start writing autograd
+    # formulas for prims this can be removed.
+    kwargs["supports_autograd"] = False
+    kwargs["supports_gradgrad"] = False
+    kwargs["supports_fwgrad_bwgrad"] = False
+    kwargs["supports_inplace_autograd"] = False
+    kwargs["supports_forward_ad"] = False
+
+    return kwargs
+
+
+class PythonRefInfo(OpInfo):
+    """
+    An OpInfo for a Python reference of an OpInfo base class operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        op_db=None,  # The database of opinfos to search for the parent opinfo
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant_name="",  # the variant name for corresponding torch opinfo
+        validate_view_consistency=True,
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo_variant_name = torch_opinfo_variant_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant_name, op_db=op_db
+        )
+        self.validate_view_consistency = validate_view_consistency
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, OpInfo)
+
+        inherited = self.torch_opinfo._original_opinfo_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+        super(PythonRefInfo, self).__init__(**ukwargs)
+
+
+class ReductionPythonRefInfo(ReductionOpInfo):
+    """
+    An OpInfo for a Python reference of an elementwise unary operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        op_db=None,  # The database of opinfos to search for the parent opinfo
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant_name="",  # the variant name for corresponding torch opinfo
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo_variant_name = torch_opinfo_variant_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant_name, op_db=op_db
+        )
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, ReductionOpInfo)
+
+        inherited = self.torch_opinfo._original_reduction_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+
+        # See https://github.com/pytorch/pytorch/issues/77216
+        self.validate_view_consistency = False
+
+        super().__init__(**ukwargs)
+
+
+class ElementwiseUnaryPythonRefInfo(UnaryUfuncInfo):
+    """
+    An OpInfo for a Python reference of an elementwise unary operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        op_db=None,  # The database of opinfos to search for the parent opinfo
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant_name="",  # the variant name for corresponding torch opinfo
+        validate_view_consistency=True,
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo_variant_name = torch_opinfo_variant_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant_name, op_db=op_db
+        )
+        self.validate_view_consistency = validate_view_consistency
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, UnaryUfuncInfo)
+
+        inherited = self.torch_opinfo._original_unary_ufunc_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+
+        super(ElementwiseUnaryPythonRefInfo, self).__init__(**ukwargs)
+
+
+class ElementwiseBinaryPythonRefInfo(BinaryUfuncInfo):
+    """
+    An OpInfo for a Python reference of an elementwise binary operation.
+    """
+
+    def __init__(
+        self,
+        name,  # the stringname of the callable Python reference
+        *,
+        op=None,  # the function variant of the operation, populated as torch.<name> if None
+        op_db=None,  # The database of opinfos to search for the parent opinfo
+        torch_opinfo_name,  # the string name of the corresponding torch opinfo
+        torch_opinfo_variant_name="",  # the variant name for corresponding torch opinfo
+        supports_nvfuser=True,
+        **kwargs,
+    ):  # additional kwargs override kwargs inherited from the torch opinfo
+
+        self.torch_opinfo_name = torch_opinfo_name
+        self.torch_opinfo_variant_name = torch_opinfo_variant_name
+        self.torch_opinfo = _find_referenced_opinfo(
+            torch_opinfo_name, torch_opinfo_variant_name, op_db=op_db
+        )
+        self.supports_nvfuser = supports_nvfuser
+        assert isinstance(self.torch_opinfo, BinaryUfuncInfo)
+
+        inherited = self.torch_opinfo._original_binary_ufunc_args
+        ukwargs = _inherit_constructor_args(name, op, inherited, kwargs)
+
+        super(ElementwiseBinaryPythonRefInfo, self).__init__(**ukwargs)
diff --git a/torch/testing/_internal/opinfo/utils.py b/torch/testing/_internal/opinfo/utils.py
index 77381c0d769d..0bbba7c769d8 100644
--- a/torch/testing/_internal/opinfo/utils.py
+++ b/torch/testing/_internal/opinfo/utils.py
@@ -1,25 +1,30 @@
 import collections
 import warnings
-from functools import partial
+from functools import partial, wraps
+from typing import Sequence
+
+import numpy as np
 
 import torch
-from torch.testing._internal.common_cuda import (TEST_CUDA)
+from torch.testing._internal.common_cuda import TEST_CUDA
 from torch.testing._internal.common_dtype import (
-    all_types_and_complex_and,
+    _dispatch_dtypes,
+    all_types,
+    all_types_and,
     all_types_and_complex,
+    all_types_and_complex_and,
     all_types_and_half,
-    all_types,
     complex_types,
     floating_and_complex_types,
-    floating_types_and_half,
+    floating_and_complex_types_and,
     floating_types,
-    integral_types,
     floating_types_and,
-    floating_and_complex_types_and,
+    floating_types_and_half,
+    integral_types,
     integral_types_and,
-    all_types_and,
-    _dispatch_dtypes,
 )
+from torch.testing._internal.common_utils import torch_to_numpy_dtype_dict
+
 
 COMPLETE_DTYPES_DISPATCH = (
     all_types,
@@ -41,7 +46,8 @@
 )
 
 # Better way to acquire devices?
-DEVICES = ['cpu'] + (['cuda'] if TEST_CUDA else [])
+DEVICES = ["cpu"] + (["cuda"] if TEST_CUDA else [])
+
 
 class _dynamic_dispatch_dtypes(_dispatch_dtypes):
     # Class to tag the dynamically generated types.
@@ -50,9 +56,11 @@ class _dynamic_dispatch_dtypes(_dispatch_dtypes):
 
 def get_supported_dtypes(op, sample_inputs_fn, device_type):
     # Returns the supported dtypes for the given operator and device_type pair.
-    assert device_type in ['cpu', 'cuda']
-    if not TEST_CUDA and device_type == 'cuda':
-        warnings.warn("WARNING: CUDA is not available, empty_dtypes dispatch will be returned!")
+    assert device_type in ["cpu", "cuda"]
+    if not TEST_CUDA and device_type == "cuda":
+        warnings.warn(
+            "WARNING: CUDA is not available, empty_dtypes dispatch will be returned!"
+        )
         return _dynamic_dispatch_dtypes(())
 
     supported_dtypes = set()
@@ -64,7 +72,9 @@ def get_supported_dtypes(op, sample_inputs_fn, device_type):
             # `dtype`, we assume that the `dtype` is not supported.
             # We raise a warning, so that user knows that this was the case
             # and can investigate if there was an issue with the `sample_inputs_fn`.
-            warnings.warn(f"WARNING: Unable to generate sample for device:{device_type} and dtype:{dtype}")
+            warnings.warn(
+                f"WARNING: Unable to generate sample for device:{device_type} and dtype:{dtype}"
+            )
             continue
 
         # We assume the dtype is supported
@@ -87,7 +97,7 @@ def get_supported_dtypes(op, sample_inputs_fn, device_type):
 def dtypes_dispatch_hint(dtypes):
     # Function returns the appropriate dispatch function (from COMPLETE_DTYPES_DISPATCH and EXTENSIBLE_DTYPE_DISPATCH)
     # and its string representation for the passed `dtypes`.
-    return_type = collections.namedtuple('return_type', 'dispatch_fn dispatch_fn_str')
+    return_type = collections.namedtuple("return_type", "dispatch_fn dispatch_fn_str")
 
     # CUDA is not available, dtypes will be empty.
     if len(dtypes) == 0:
@@ -100,7 +110,7 @@ def dtypes_dispatch_hint(dtypes):
             return return_type(dispatch, dispatch.__name__ + "()")
 
     chosen_dispatch = None
-    chosen_dispatch_score = 0.
+    chosen_dispatch_score = 0.0
     for dispatch in EXTENSIBLE_DTYPE_DISPATCH:
         dispatch_dtypes = set(dispatch())
         if not dispatch_dtypes.issubset(set_dtypes):
@@ -116,8 +126,10 @@ def dtypes_dispatch_hint(dtypes):
     if chosen_dispatch is None:
         return return_type((), str(dtypes))
 
-    return return_type(partial(dispatch, *tuple(set(dtypes) - set(dispatch()))),
-                       dispatch.__name__ + str(tuple(set(dtypes) - set(dispatch()))))
+    return return_type(
+        partial(dispatch, *tuple(set(dtypes) - set(dispatch()))),
+        dispatch.__name__ + str(tuple(set(dtypes) - set(dispatch()))),
+    )
 
 
 def is_dynamic_dtype_set(op):
@@ -132,8 +144,137 @@ def str_format_dynamic_dtype(op):
                dtypes={dtypes},
                dtypesIfCUDA={dtypesIfCUDA},
         )
-        """.format(name=op.name,
-                   dtypes=dtypes_dispatch_hint(op.dtypes).dispatch_fn_str,
-                   dtypesIfCUDA=dtypes_dispatch_hint(op.dtypesIfCUDA).dispatch_fn_str)
+        """.format(
+        name=op.name,
+        dtypes=dtypes_dispatch_hint(op.dtypes).dispatch_fn_str,
+        dtypesIfCUDA=dtypes_dispatch_hint(op.dtypesIfCUDA).dispatch_fn_str,
+    )
 
     return fmt_str
+
+
+def np_unary_ufunc_integer_promotion_wrapper(fn):
+    # Wrapper that passes PyTorch's default scalar
+    #   type as an argument to the wrapped NumPy
+    #   unary ufunc when given an integer input.
+    #   This mimicks PyTorch's integer->floating point
+    #   type promotion.
+    #
+    # This is necessary when NumPy promotes
+    #   integer types to double, since PyTorch promotes
+    #   integer types to the default scalar type.
+
+    # Helper to determine if promotion is needed
+    def is_integral(dtype):
+        return dtype in [
+            np.bool_,
+            bool,
+            np.uint8,
+            np.int8,
+            np.int16,
+            np.int32,
+            np.int64,
+        ]
+
+    @wraps(fn)
+    def wrapped_fn(x):
+        # As the default dtype can change, acquire it when function is called.
+        # NOTE: Promotion in PyTorch is from integer types to the default dtype
+        np_dtype = torch_to_numpy_dtype_dict[torch.get_default_dtype()]
+
+        if is_integral(x.dtype):
+            return fn(x.astype(np_dtype))
+        return fn(x)
+
+    return wrapped_fn
+
+
+def reference_reduction_numpy(f, supports_keepdims=True):
+    """Wraps a NumPy reduction operator.
+
+    The wrapper function will forward dim, keepdim, mask, and identity
+    kwargs to the wrapped function as the NumPy equivalent axis,
+    keepdims, where, and initiak kwargs, respectively.
+
+    Args:
+        f: NumPy reduction operator to wrap
+        supports_keepdims (bool, optional): Whether the NumPy operator accepts
+            keepdims parameter. If it does not, the wrapper will manually unsqueeze
+            the reduced dimensions if it was called with keepdim=True. Defaults to True.
+
+    Returns:
+        Wrapped function
+
+    """
+
+    @wraps(f)
+    def wrapper(x: np.ndarray, *args, **kwargs):
+        # Copy keys into a set
+        keys = set(kwargs.keys())
+
+        dim = kwargs.pop("dim", None)
+        keepdim = kwargs.pop("keepdim", False)
+
+        if "dim" in keys:
+            dim = tuple(dim) if isinstance(dim, Sequence) else dim
+
+            # NumPy reductions don't accept dim=0 for scalar inputs
+            # so we convert it to None if and only if dim is equivalent
+            if x.ndim == 0 and dim in {0, -1, (0,), (-1,)}:
+                kwargs["axis"] = None
+            else:
+                kwargs["axis"] = dim
+
+        if "keepdim" in keys and supports_keepdims:
+            kwargs["keepdims"] = keepdim
+
+        if "mask" in keys:
+            mask = kwargs.pop("mask")
+            if mask is not None:
+                assert mask.layout == torch.strided
+                kwargs["where"] = mask.cpu().numpy()
+
+        if "identity" in keys:
+            identity = kwargs.pop("identity")
+            if identity is not None:
+                if identity.dtype is torch.bfloat16:
+                    identity = identity.cpu().to(torch.float32)
+                else:
+                    identity = identity.cpu()
+                kwargs["initial"] = identity.numpy()
+
+        if "unbiased" in keys:
+            unbiased = kwargs.pop("unbiased")
+            if unbiased is not None:
+                kwargs["ddof"] = int(unbiased)
+
+        result = f(x, *args, **kwargs)
+
+        # Unsqueeze reduced dimensions if NumPy does not support keepdims
+        if keepdim and not supports_keepdims and x.ndim > 0:
+            dim = list(range(x.ndim)) if dim is None else dim
+            result = np.expand_dims(result, dim)
+
+        return result
+
+    return wrapper
+
+
+def prod_numpy(a, *args, **kwargs):
+    """
+    The function will call np.prod with type as np.int64 if the input type
+    is int or uint64 if is uint. This is necessary because windows np.prod uses by default
+    int32 while on linux it uses int64.
+    This is for fixing integer overflow https://github.com/pytorch/pytorch/issues/77320
+
+    Returns:
+        np.prod of input
+    """
+    if "dtype" not in kwargs:
+        if np.issubdtype(a.dtype, np.signedinteger):
+            a = a.astype(np.int64)
+        elif np.issubdtype(a.dtype, np.unsignedinteger):
+            a = a.astype(np.uint64)
+
+    fn = reference_reduction_numpy(np.prod)
+    return fn(a, *args, **kwargs)
diff --git a/torch/testing/_internal/schema_check_mode.py b/torch/testing/_internal/schema_check_mode.py
index 9d118719af6b..9fda9d95e159 100644
--- a/torch/testing/_internal/schema_check_mode.py
+++ b/torch/testing/_internal/schema_check_mode.py
@@ -47,7 +47,7 @@ def has_mutated(before, after, md):
                     before.size() == after.size() and
                     torch.allclose(before, after, equal_nan=True) and
                     md[0] == after.stride() and
-                    md[1] == after.storage()._cdata
+                    md[1] == after._typed_storage()._cdata
                 )
             return False
 
@@ -76,12 +76,12 @@ def parse_metadata(e):
                 if not type(e) == torch.Tensor:
                     try:
                         current = e.elem
-                        return (deepcopy(current.stride()), current.storage()._cdata)
+                        return (deepcopy(current.stride()), current._typed_storage()._cdata)
                     except AttributeError as t:
                         return None
                 # Sparse CSR tensors do not have strides or storage
                 elif (e.layout != torch.sparse_csr):
-                    return (deepcopy(e.stride()), e.storage()._cdata)
+                    return (deepcopy(e.stride()), e._typed_storage()._cdata)
             return None
 
         self.ops.append(func._schema.name)
diff --git a/torch/testing/_legacy.py b/torch/testing/_legacy.py
deleted file mode 100644
index 1c7ba1472896..000000000000
--- a/torch/testing/_legacy.py
+++ /dev/null
@@ -1,158 +0,0 @@
-"""This module exist to be able to deprecate functions publicly without doing so internally. The deprecated
-public versions are defined in torch.testing._deprecated and exposed from torch.testing. The non-deprecated internal
-versions should be imported from torch.testing._internal
-"""
-
-from typing import List
-
-import torch
-
-__all_dtype_getters__ = [
-    "_validate_dtypes",
-    "_dispatch_dtypes",
-    "all_types",
-    "all_types_and",
-    "all_types_and_complex",
-    "all_types_and_complex_and",
-    "all_types_and_half",
-    "complex_types",
-    "empty_types",
-    "floating_and_complex_types",
-    "floating_and_complex_types_and",
-    "floating_types",
-    "floating_types_and",
-    "double_types",
-    "floating_types_and_half",
-    "get_all_complex_dtypes",
-    "get_all_dtypes",
-    "get_all_fp_dtypes",
-    "get_all_int_dtypes",
-    "get_all_math_dtypes",
-    "integral_types",
-    "integral_types_and",
-]
-
-__all__ = [
-    *__all_dtype_getters__,
-    "get_all_device_types",
-]
-
-# Functions and classes for describing the dtypes a function supports
-# NOTE: these helpers should correspond to PyTorch's C++ dispatch macros
-
-# Verifies each given dtype is a torch.dtype
-def _validate_dtypes(*dtypes):
-    for dtype in dtypes:
-        assert isinstance(dtype, torch.dtype)
-    return dtypes
-
-# class for tuples corresponding to a PyTorch dispatch macro
-class _dispatch_dtypes(tuple):
-    def __add__(self, other):
-        assert isinstance(other, tuple)
-        return _dispatch_dtypes(tuple.__add__(self, other))
-
-_empty_types = _dispatch_dtypes(())
-def empty_types():
-    return _empty_types
-
-_floating_types = _dispatch_dtypes((torch.float32, torch.float64))
-def floating_types():
-    return _floating_types
-
-_floating_types_and_half = _floating_types + (torch.half,)
-def floating_types_and_half():
-    return _floating_types_and_half
-
-def floating_types_and(*dtypes):
-    return _floating_types + _validate_dtypes(*dtypes)
-
-_floating_and_complex_types = _floating_types + (torch.cfloat, torch.cdouble)
-def floating_and_complex_types():
-    return _floating_and_complex_types
-
-def floating_and_complex_types_and(*dtypes):
-    return _floating_and_complex_types + _validate_dtypes(*dtypes)
-
-_double_types = _dispatch_dtypes((torch.float64, torch.complex128))
-def double_types():
-    return _double_types
-
-_integral_types = _dispatch_dtypes((torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64))
-def integral_types():
-    return _integral_types
-
-def integral_types_and(*dtypes):
-    return _integral_types + _validate_dtypes(*dtypes)
-
-_all_types = _floating_types + _integral_types
-def all_types():
-    return _all_types
-
-def all_types_and(*dtypes):
-    return _all_types + _validate_dtypes(*dtypes)
-
-_complex_types = _dispatch_dtypes((torch.cfloat, torch.cdouble))
-def complex_types():
-    return _complex_types
-
-def complex_types_and(*dtypes):
-    return _complex_types + _validate_dtypes(*dtypes)
-
-_all_types_and_complex = _all_types + _complex_types
-def all_types_and_complex():
-    return _all_types_and_complex
-
-def all_types_and_complex_and(*dtypes):
-    return _all_types_and_complex + _validate_dtypes(*dtypes)
-
-_all_types_and_half = _all_types + (torch.half,)
-def all_types_and_half():
-    return _all_types_and_half
-
-# The functions below are used for convenience in our test suite and thus have no corresponding C++ dispatch macro
-
-# See AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_AND_QINTS.
-def get_all_dtypes(include_half=True,
-                   include_bfloat16=True,
-                   include_bool=True,
-                   include_complex=True,
-                   include_complex32=False,
-                   include_qint=False,
-                   ) -> List[torch.dtype]:
-    dtypes = get_all_int_dtypes() + get_all_fp_dtypes(include_half=include_half, include_bfloat16=include_bfloat16)
-    if include_bool:
-        dtypes.append(torch.bool)
-    if include_complex:
-        dtypes += get_all_complex_dtypes(include_complex32)
-    if include_qint:
-        dtypes += get_all_qint_dtypes()
-    return dtypes
-
-def get_all_math_dtypes(device) -> List[torch.dtype]:
-    return get_all_int_dtypes() + get_all_fp_dtypes(include_half=device.startswith('cuda'),
-                                                    include_bfloat16=False) + get_all_complex_dtypes()
-
-def get_all_complex_dtypes(include_complex32=False) -> List[torch.dtype]:
-    return [torch.complex32, torch.complex64, torch.complex128] if include_complex32 else [torch.complex64, torch.complex128]
-
-
-def get_all_int_dtypes() -> List[torch.dtype]:
-    return [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64]
-
-
-def get_all_fp_dtypes(include_half=True, include_bfloat16=True) -> List[torch.dtype]:
-    dtypes = [torch.float32, torch.float64]
-    if include_half:
-        dtypes.append(torch.float16)
-    if include_bfloat16:
-        dtypes.append(torch.bfloat16)
-    return dtypes
-
-
-def get_all_qint_dtypes() -> List[torch.dtype]:
-    return [torch.qint8, torch.quint8, torch.qint32, torch.quint4x2, torch.quint2x4]
-
-
-def get_all_device_types() -> List[str]:
-    return ['cpu'] if not torch.cuda.is_available() else ['cpu', 'cuda']
diff --git a/torch/types.py b/torch/types.py
index 80f7278ef488..0f62ca9561d5 100644
--- a/torch/types.py
+++ b/torch/types.py
@@ -21,6 +21,7 @@
 _qscheme = torch.qscheme
 _size = Union[torch.Size, List[_int], Tuple[_int, ...]]
 _layout = torch.layout
+_dispatchkey = Union[str, torch._C.DispatchKey]
 
 class SymInt:
     pass
@@ -31,7 +32,7 @@ class SymInt:
 # Meta-type for "device-like" things.  Not to be confused with 'device' (a
 # literal device object).  This nomenclature is consistent with PythonArgParser.
 # None means use the default device (typically CPU)
-Device = Union[_device, str, None]
+Device = Union[_device, str, _int, None]
 
 # Storage protocol implemented by ${Type}StorageBase classes
 
diff --git a/torch/utils/__init__.py b/torch/utils/__init__.py
index 2d5649a653c4..c2054a9b5c65 100644
--- a/torch/utils/__init__.py
+++ b/torch/utils/__init__.py
@@ -3,6 +3,8 @@
 
 from .throughput_benchmark import ThroughputBenchmark
 from ._crash_handler import enable_minidumps, disable_minidumps, enable_minidumps_on_exceptions
+from .cpp_backtrace import get_cpp_backtrace
+from .backend_registration import rename_privateuse1_backend
 
 # Set the module for a given object for nicer printing
 def set_module(obj, mod):
diff --git a/torch/utils/_cuda_trace.py b/torch/utils/_cuda_trace.py
index 11f0308bb4db..bc62145d683d 100644
--- a/torch/utils/_cuda_trace.py
+++ b/torch/utils/_cuda_trace.py
@@ -46,6 +46,15 @@ def fire_callbacks(self, *args: P.args, **kwargs: P.kwargs) -> None:
 CUDAStreamCreationCallbacks: "CallbackRegistry[int]" = CallbackRegistry(
     "CUDA stream creation"
 )
+CUDADeviceSynchronizationCallbacks: "CallbackRegistry[[]]" = CallbackRegistry(
+    "CUDA device synchronization"
+)
+CUDAStreamSynchronizationCallbacks: "CallbackRegistry[int]" = CallbackRegistry(
+    "CUDA stream synchronization"
+)
+CUDAEventSynchronizationCallbacks: "CallbackRegistry[int]" = CallbackRegistry(
+    "CUDA event synchronization"
+)
 
 
 def register_callback_for_cuda_event_creation(cb: Callable[[int], None]) -> None:
@@ -74,3 +83,17 @@ def register_callback_for_cuda_memory_deallocation(cb: Callable[[int], None]) ->
 
 def register_callback_for_cuda_stream_creation(cb: Callable[[int], None]) -> None:
     CUDAStreamCreationCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_device_synchronization(cb: Callable[[], None]) -> None:
+    CUDADeviceSynchronizationCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_stream_synchronization(
+    cb: Callable[[int], None]
+) -> None:
+    CUDAStreamSynchronizationCallbacks.add_callback(cb)
+
+
+def register_callback_for_cuda_event_synchronization(cb: Callable[[int], None]) -> None:
+    CUDAEventSynchronizationCallbacks.add_callback(cb)
diff --git a/torch/utils/_mode_utils.py b/torch/utils/_mode_utils.py
index 9b2526bc0838..f9098c6d7ef4 100644
--- a/torch/utils/_mode_utils.py
+++ b/torch/utils/_mode_utils.py
@@ -1,135 +1,13 @@
-import functools
 import torch
-from typing import Iterator, TypeVar
-from dataclasses import dataclass
+from typing import TypeVar
 from contextlib import contextmanager
 
 T = TypeVar('T')
 
-# This file has all the logic to dedupe logic between torch dispatch and
-# torch function modes
-#
-# Specifically, it has the helper functions for enable_ and push_X_mode and the
-# ModeInfo class, which is extended by each where they are different
-
-def _wrap_init(f):
-    @functools.wraps(f)
-    def wrapped(self, *args, **kwargs):
-        if 'inner' in kwargs:
-            self.inner = kwargs['inner']
-            del kwargs['inner']
-        return f(self, *args, **kwargs)
-    return wrapped
-
-
-# in order to dedupe the logic between python mode and torch_function mode, this
-# is a container to hold all the differences between the modes. Then functions like
-# _enable_mode are able to use this container to call functions or get correctly
-# formatted names
-@dataclass
-class _ModeInfo:
-    mode_name: str
-    mode_class: type  # the class related to the mode that's allowed to be passed in
-
-    def mode_class_name(self):
-        return self.mode_class.__name__
-
-    def get_mode(self):
-        """gets the current mode for this type of mode"""
-        raise NotImplementedError()
-
-    def set_mode(self, mode):
-        """
-        set mode to for this type of mode. Note that no checks are done on this, it's the unsafe
-        version where checks are assumed to have been already done by the helper function
-        """
-        raise NotImplementedError()
-
-
-# shared version of enable_torch_function/enable_torch_dispatch_mode in order to deduplicate the code.
-# The differences between the modes are captured by `mode_info` and then queried when they're
-# needed during the function's invocation
-def _enable_mode(mode: T, mode_info: _ModeInfo, *, replace=None, ignore_preexisting=False) -> Iterator[T]:
-    if not (
-        mode is None or
-        isinstance(mode, mode_info.mode_class) or
-        (isinstance(mode, type) and not issubclass(mode, mode_info.mode_class))
-    ):
-        raise ValueError(f'expected to get {mode_info.mode_class_name()}, Tensor-like class, '
-                         f'or None as an argument got {type(mode)} instead')
-    old = mode_info.get_mode()
-    if old is mode:
-        yield mode  # type: ignore[misc]
-        return
-    if old is not None and not ignore_preexisting and old is not replace:
-        if isinstance(mode, mode_info.mode_class):
-            help_text = f'Use push_{mode_info.mode_name}_mode instead.'
-        else:
-            help_text = (
-                'If you intended to completely override the preexisting mode, '
-                'pass ignore_preexisting=True.  This can result in unexpected '
-                'behavior; please consider rewriting your mode to be a subclass '
-                f'of {mode_info.mode_class_name()} to make it compositional!'
-            )
-        raise ValueError(
-            f'Attempted to enable_{mode_info.mode_name}_mode, but there is already an '
-            f'active mode {old}.  {help_text}'
-        )
-    # NB: we don't require TorchFunctionMode/PythonMode since this is intended to also
-    # let you directly pass a Tensor subclass type to "mode-ify" it.
-    if mode is not None:
-        required_fn = "__" + mode_info.mode_name + "__"
-        if not hasattr(mode, required_fn):
-            raise ValueError(
-                f'The argument passed to enable_{mode_info.mode_name}_mode must implement {required_fn}'
-            )
-    mode_info.set_mode(mode)
-    try:
-        yield mode  # type: ignore[misc]
-    finally:
-        mode_info.set_mode(old)
-
-
-def _restore_mode(mode, mode_info: _ModeInfo):
-    if not hasattr(mode, "ancestors"):
-        raise RuntimeError(f"{mode} does not have any ancestors. Use the standard version instead of restore")
-    old = mode_info.get_mode()
-    if old is not None and old not in mode.ancestors:
-        raise RuntimeError(f"{mode} is not valid in the current state because the current mode is not its ancestor")
-    mode_info.set_mode(mode)
-    try:
-        yield mode
-    finally:
-        mode_info.set_mode(old)
-
-
-# To help with non-lexical scoping, it will error if all the modes are from different scopes or haven't been used
-def find_outermost_mode(modes):
-    outermost = None
-    for mode in modes:
-        if mode is not None:
-            if not hasattr(mode, "ancestors"):
-                raise RuntimeError(f"{mode}, doesn't have ancestors set so the ordering with other modes is unclear")
-            if outermost is None:
-                outermost = mode
-            elif mode not in outermost.ancestors and outermost not in mode.ancestors:
-                raise RuntimeError(f"modes {mode} and {outermost} are not compatible because they "
-                                   "don't come from the same scope")
-            elif outermost in mode.ancestors:
-                outermost = mode
-    return outermost
-
-
 # returns if all are the same mode
 def all_same_mode(modes):
     return all(tuple(mode == modes[0] for mode in modes))
 
-# returns if all modes are from the current scope, ``cur_mode``
-def all_same_mode_scope(modes, cur_mode):
-    if not hasattr(cur_mode, "ancestors"):
-        return False
-    return all(tuple(mode == cur_mode or mode in cur_mode.ancestors for mode in modes))
-
 @contextmanager
 def no_dispatch():
     guard = torch._C._DisableTorchDispatch()  # type: ignore[attr-defined]
diff --git a/torch/utils/_python_dispatch.py b/torch/utils/_python_dispatch.py
index 0e8c57d92626..14c5c35ed45d 100644
--- a/torch/utils/_python_dispatch.py
+++ b/torch/utils/_python_dispatch.py
@@ -1,117 +1,17 @@
 import contextlib
-from typing import Iterator, Set
-import functools
 
 import warnings
-from torch.utils._mode_utils import _enable_mode, _ModeInfo, _wrap_init, _restore_mode
-from torch._C import _get_torch_dispatch_mode, _set_torch_dispatch_mode
-from dataclasses import dataclass
-
-
-@dataclass
-class TorchDispatchModeInfo(_ModeInfo):
-    def __init__(self):
-        super().__init__(mode_name="torch_dispatch", mode_class=TorchDispatchMode)
-
-    def get_mode(self):
-        return _get_torch_dispatch_mode()
-
-    def set_mode(self, mode):
-        return _set_torch_dispatch_mode(mode)
+from torch._C import _len_torch_dispatch_stack, _get_dispatch_stack_at,\
+    _pop_torch_dispatch_stack, _push_on_torch_dispatch_stack
 
 
 # TODO: Limitations and things about enable_torch_dispatch_mode we should fix before exposing it:
-# - We need a better user-facing api for torch._C._DisableTorchDispatch that
+# - We need a better user-facing api for _DisableTorchDispatch that
 #   is able to selectively disable __torch_dispatch__ of a particular class.
 # - It doesn't work with the tensor constructors (torch.tensor, torch.Tensor)
 # - Better name (see https://github.com/pytorch/pytorch/pull/63496#discussion_r694091694)
-@contextlib.contextmanager
-def enable_torch_dispatch_mode(mode, *, replace=None, ignore_preexisting=False) -> Iterator[None]:
-    """
-    Context manager that causes all pytorch operators to dispatch to the passed-in
-    type's __torch_dispatch__ function, including operations that accept no tensors
-    but return a tensor.
-
-    This function is non-compositional; if there is already an existing mode,
-    it will raise an error
-
-    This function is safe to use inside a ``__torch_dispatch__`` mode handler,
-    as the mode is guaranteed to be disabled in this context.  You can use
-    this context manager to reinstate the mode so that calls to overridable
-    APIs recursively call back into your mode handler (this can easily cause
-    infinite loops, so use with care!)
-
-    enable_torch_dispatch_mode is affected by _DisableTorchDispatch.
-
-    Args:
-        mode (:class:`TorchDispatchMode`, Tensor-like class, or None): the
-            mode to set as current mode.  If you pass a Tensor-like class,
-            it will be treated as a non-compositional mode with no state,
-            which is convenient if you have an existing tensor subclass
-            that you'd like to apply globally in a quick and dirty way.
-            Passing None will disable the current mode.
-        replace (:class:`TorchDispatchMode` or Tensor-like class): the
-            mode to replace.  You can use this argument to change the mode in
-            a situation where you know what the current mode is (and you are
-            intentionally overwriting it.)  If you don't know what the current
-            mode is, use ``ignore_preexisting`` instead.
-        ignore_preexisting (bool): if True, ignore any preexisting mode
-            and overwrite it with the passed mode.
-    """
-
-    return _enable_mode(mode, mode_info=TorchDispatchModeInfo(), replace=replace, ignore_preexisting=ignore_preexisting)
-
-
-def _wrap_torch_dispatch(f):
-    @functools.wraps(f)
-    def wrapped(self, *args, **kwargs):
-        if isinstance(f, classmethod):
-            raise RuntimeError("TorchDispatchMode's torch_dispatch function " +
-                               "should be a normal method not a class method")
-        inner = getattr(self, "inner", None)
-
-        with enable_torch_dispatch_mode(inner):
-            return f(self, *args, **kwargs)
-    return wrapped
-
-
-# Implementation note, since this is based on TorchFunctionMode, this had the
-# same dilemma: I had a choice about how much of mode stacks
-# to implement in Python versus in C++.  At time of writing, I did not care
-# too much about implementation efficiency; however, I do care about making it
-# hard for users to implement modes in the wrong way.  In the end, it turned
-# out to be possible to implement mode stacks entirely from userland, with the
-# C++ API providing only _get_torch_dispatch_mode() and
-# _set_torch_dispatch_mode(), so I opted to provide some unsafe C++ bindings and
-# have the bulk of the logic for managing the stack in Python, which helped
-# simplify the C++ API surface.  It would also have been valid to build in the
-# notion of mode stack directly into C++ but in this design it's substantially
-# more difficult to interact with TorchDispatchModeMeta.
-
-class TorchDispatchModeMeta(type):
-    """
-    Metaclass for :class:`TorchDispatchMode`; it does two things:
-
-        * Adds an implicit ``inner`` kwarg to ``__init__``, to
-          allow the modes to be chained together to form a stack.
-
-        * Reenables the inner mode, so that by default PyTorch API calls
-          will compositionally proceed to the next mode on the stack.
 
-    The default behavior for the second bullet is important, as it is easy to
-    accidentally write ``_wrap_torch_dispatch`` implementations that are not
-    compositional, and the wrapping here makes the obvious code do the
-    right thing (aka, this is why there is a metaclass).
-    """
-    def __new__(metacls, name, bases, dct):
-        if '__init__' in dct:
-            dct['__init__'] = _wrap_init(dct['__init__'])
-        if '__torch_dispatch__' in dct:
-            dct['__torch_dispatch__'] = _wrap_torch_dispatch(dct['__torch_dispatch__'])
-        return super().__new__(metacls, name, bases, dct)
-
-
-class TorchDispatchMode(metaclass=TorchDispatchModeMeta):
+class TorchDispatchMode:
     """
     A ``TorchDispatchMode`` allows you to override the meaning of all
     ``__torch_dispatch__`` overrideable functions within a dynamic scope,
@@ -141,32 +41,15 @@ class TorchDispatchMode(metaclass=TorchDispatchModeMeta):
     ``__torch_dispatch__(self)`` to make PyTorch
     API self-referential (beware of infinite loops, in this case!)
     """
-    # Force metaclass to generate constructor at the base of the hierarchy
-    def __init__(self):
-        self.ancestors: Set[TorchDispatchMode]
-
     def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         raise NotImplementedError()
 
     def __enter__(self):
-        old = _get_torch_dispatch_mode()
-        if hasattr(self, "inner"):
-            raise RuntimeError(f"{self} has already been used as a mode. Please use a fresh version or use restore")
-        else:
-            self.inner = old
-            if old is None:
-                self.ancestors = set()
-            else:
-                self.ancestors = self.inner.ancestors.union({self.inner})
-        _set_torch_dispatch_mode(self)
+        _push_mode(self)
         return self
 
     def __exit__(self, exc_type, exc_val, exc_tb):
-        _set_torch_dispatch_mode(self.inner)
-
-    @contextlib.contextmanager
-    def restore(self):
-        return _restore_mode(self, mode_info=TorchDispatchModeInfo())
+        _pop_mode()
 
     @classmethod
     def push(cls, *args, **kwargs):
@@ -174,6 +57,42 @@ def push(cls, *args, **kwargs):
         instance = cls(*args, **kwargs)
         return instance
 
+def _get_current_dispatch_mode():
+    stack_len = _len_torch_dispatch_stack()
+    return _get_dispatch_stack_at(stack_len - 1) if stack_len > 0 else None
+
+
+def _get_current_dispatch_mode_stack():
+    stack_len = _len_torch_dispatch_stack()
+    return [_get_dispatch_stack_at(i) for i in range(stack_len)]
+
+def _push_mode(mode):
+    _push_on_torch_dispatch_stack(mode)
+
+
+def _pop_mode():
+    return _pop_torch_dispatch_stack()
+
+
+@contextlib.contextmanager
+def _pop_mode_temporarily():
+    old = _pop_mode()
+    try:
+        yield old
+    finally:
+        _push_mode(old)
+
+
+@contextlib.contextmanager
+def _disable_current_modes():
+    mode_len = _len_torch_dispatch_stack()
+    old_modes = [_pop_mode() for _ in range(mode_len)]
+    try:
+        yield old_modes
+    finally:
+        for mode in reversed(old_modes):
+            _push_mode(mode)
+
 
 class BaseTorchDispatchMode(TorchDispatchMode):
     def __torch_dispatch__(self, func, types, args=(), kwargs=None):
diff --git a/torch/utils/_pytree.py b/torch/utils/_pytree.py
index 464b2897e065..d89f1bcdfc3d 100644
--- a/torch/utils/_pytree.py
+++ b/torch/utils/_pytree.py
@@ -1,9 +1,12 @@
-from typing import NamedTuple, Callable, Any, Tuple, List, Dict, Type, cast, Optional, TypeVar
+from typing import NamedTuple, Callable, Any, Tuple, List, Dict, Type, cast, Optional, TypeVar, overload, Union
 import functools
 from collections import namedtuple, OrderedDict
+from dataclasses import dataclass
+
 
 T = TypeVar('T')
 S = TypeVar('S')
+R = TypeVar('R')
 
 """
 Contains utility functions for working with nested python data structures.
@@ -107,32 +110,33 @@ def _is_leaf(pytree: PyTree) -> bool:
 # context: some context that is useful in unflattening the pytree
 # children_specs: specs for each child of the root Node
 # num_leaves: the number of leaves
+@dataclass
 class TreeSpec:
-    def __init__(self, typ: Any, context: Context, children_specs: List['TreeSpec']) -> None:
-        self.type = typ
-        self.context = context
-        self.children_specs = children_specs
-        self.num_leaves: int = sum([spec.num_leaves for spec in children_specs])
+    type: Any
+    context: Context
+    children_specs: List['TreeSpec']
 
-    def __repr__(self) -> str:
-        return f'TreeSpec({self.type.__name__}, {self.context}, {self.children_specs})'
+    def __post_init__(self) -> None:
+        self.num_leaves: int = sum([spec.num_leaves for spec in self.children_specs])
 
-    def __eq__(self, other: Any) -> bool:
-        result = self.type == other.type and self.context == other.context \
-            and self.children_specs == other.children_specs \
-            and self.num_leaves == other.num_leaves
-        # This should really not be necessary, but mypy errors out without it.
-        return cast(bool, result)
+    def __repr__(self, indent: int = 0) -> str:
+        repr_prefix: str = f'TreeSpec({self.type.__name__}, {self.context}, ['
+        children_specs_str: str = ''
+        if len(self.children_specs):
+            indent += len(repr_prefix)
+            children_specs_str += self.children_specs[0].__repr__(indent)
+            children_specs_str += ',' if len(self.children_specs) > 1 else ''
+            children_specs_str += ','.join(['\n' + ' ' * indent + child.__repr__(indent) for child in self.children_specs[1:]])
+        repr_suffix: str = f'{children_specs_str}])'
+        return repr_prefix + repr_suffix
 
-    def __ne__(self, other: Any) -> bool:
-        return not self.__eq__(other)
 
 class LeafSpec(TreeSpec):
     def __init__(self) -> None:
         super().__init__(None, None, [])
         self.num_leaves = 1
 
-    def __repr__(self) -> str:
+    def __repr__(self, indent: int = 0) -> str:
         return '*'
 
 def tree_flatten(pytree: PyTree) -> Tuple[List[Any], TreeSpec]:
@@ -190,7 +194,31 @@ def tree_map(fn: Any, pytree: PyTree) -> PyTree:
     flat_args, spec = tree_flatten(pytree)
     return tree_unflatten([fn(i) for i in flat_args], spec)
 
-def map_only(ty: Type[T]) -> Callable[[Callable[[T], Any]], Callable[[Any], Any]]:
+Type2 = Tuple[Type[T], Type[S]]
+TypeAny = Union[Type[Any], Tuple[Type[Any], ...]]
+
+Fn2 = Callable[[Union[T, S]], R]
+Fn = Callable[[T], R]
+FnAny = Callable[[Any], R]
+
+MapOnlyFn = Callable[[T], Callable[[Any], Any]]
+
+# These specializations help with type inference on the lambda passed to this
+# function
+@overload
+def map_only(ty: Type2[T, S]) -> MapOnlyFn[Fn2[T, S, Any]]:
+    ...
+
+@overload
+def map_only(ty: Type[T]) -> MapOnlyFn[Fn[T, Any]]:
+    ...
+
+# This specialization is needed for the implementations below that call
+@overload
+def map_only(ty: TypeAny) -> MapOnlyFn[FnAny[Any]]:
+    ...
+
+def map_only(ty: TypeAny) -> MapOnlyFn[FnAny[Any]]:
     """
     Suppose you are writing a tree_map over tensors, leaving everything
     else unchanged.  Ordinarily you would have to write:
@@ -219,7 +247,15 @@ def inner(x: T) -> Any:
         return inner
     return deco
 
-def tree_map_only(ty: Type[T], fn: Callable[[T], Any], pytree: PyTree) -> PyTree:
+@overload
+def tree_map_only(ty: Type[T], fn: Fn[T, Any], pytree: PyTree) -> PyTree:
+    ...
+
+@overload
+def tree_map_only(ty: Type2[T, S], fn: Fn2[T, S, Any], pytree: PyTree) -> PyTree:
+    ...
+
+def tree_map_only(ty: TypeAny, fn: FnAny[Any], pytree: PyTree) -> PyTree:
     return tree_map(map_only(ty)(fn), pytree)
 
 def tree_all(pred: Callable[[Any], bool], pytree: PyTree) -> bool:
@@ -230,11 +266,27 @@ def tree_any(pred: Callable[[Any], bool], pytree: PyTree) -> bool:
     flat_args, _ = tree_flatten(pytree)
     return any(map(pred, flat_args))
 
-def tree_all_only(ty: Type[T], pred: Callable[[T], bool], pytree: PyTree) -> bool:
+@overload
+def tree_all_only(ty: Type[T], pred: Fn[T, bool], pytree: PyTree) -> bool:
+    ...
+
+@overload
+def tree_all_only(ty: Type2[T, S], pred: Fn2[T, S, bool], pytree: PyTree) -> bool:
+    ...
+
+def tree_all_only(ty: TypeAny, pred: FnAny[bool], pytree: PyTree) -> bool:
     flat_args, _ = tree_flatten(pytree)
     return all(pred(x) for x in flat_args if isinstance(x, ty))
 
-def tree_any_only(ty: Type[T], pred: Callable[[T], bool], pytree: PyTree) -> bool:
+@overload
+def tree_any_only(ty: Type[T], pred: Fn[T, bool], pytree: PyTree) -> bool:
+    ...
+
+@overload
+def tree_any_only(ty: Type2[T, S], pred: Fn2[T, S, bool], pytree: PyTree) -> bool:
+    ...
+
+def tree_any_only(ty: TypeAny, pred: FnAny[bool], pytree: PyTree) -> bool:
     flat_args, _ = tree_flatten(pytree)
     return any(pred(x) for x in flat_args if isinstance(x, ty))
 
diff --git a/torch/utils/backend_registration.py b/torch/utils/backend_registration.py
new file mode 100644
index 000000000000..539d5c65d237
--- /dev/null
+++ b/torch/utils/backend_registration.py
@@ -0,0 +1,30 @@
+from torch._C import _rename_privateuse1_backend
+
+def rename_privateuse1_backend(backend_name: str) -> None:
+    r"""
+    rename_privateuse1_backend(backend_name) -> None
+
+    This is a registration API for external backends that would like to register their
+    own device and C++ kernels out of tree.
+
+    The steps are:
+    (1) (In C++) implement kernels for various torch operations, and register them
+        to the PrivateUse1 dispatch key.
+    (2) (In python) call torch.register_privateuse1_backend("foo")
+
+    You can now use "foo" as an ordinary device string in python.
+
+    Note: this API can only be called once per process. Attempting to change
+    the external backend after it's already been set will result in an error.
+
+    For more details, see https://pytorch.org/tutorials/advanced/extend_dispatcher.html#get-a-dispatch-key-for-your-backend
+    For an existing example, see https://github.com/bdhirsh/pytorch_open_registration_example
+
+    Example::
+
+        >>> torch.register_privateuse1_backend("foo")
+        # This will work, assuming that you've implemented the right C++ kernels
+        # to implement torch.ones.
+        >>> a = torch.ones(2, device="foo")
+        """
+    return _rename_privateuse1_backend(backend_name)
diff --git a/torch/utils/benchmark/examples/fuzzer.py b/torch/utils/benchmark/examples/fuzzer.py
index 4446e2d85c0a..9728bf3d26c9 100644
--- a/torch/utils/benchmark/examples/fuzzer.py
+++ b/torch/utils/benchmark/examples/fuzzer.py
@@ -65,7 +65,7 @@ def main():
     print()
 
     # More string munging to make pretty output.
-    print(f"Average attemts per valid config: {1. / (1. - add_fuzzer.rejection_rate):.1f}")
+    print(f"Average attempts per valid config: {1. / (1. - add_fuzzer.rejection_rate):.1f}")
 
     def time_fn(m):
         return m.median / m.metadata["numel"]
diff --git a/torch/utils/benchmark/examples/sparse/fuzzer.py b/torch/utils/benchmark/examples/sparse/fuzzer.py
index 8e2bf554c42a..38421474ccf8 100644
--- a/torch/utils/benchmark/examples/sparse/fuzzer.py
+++ b/torch/utils/benchmark/examples/sparse/fuzzer.py
@@ -80,7 +80,7 @@ def main():
     print()
 
     # More string munging to make pretty output.
-    print(f"Average attemts per valid config: {1. / (1. - add_fuzzer.rejection_rate):.1f}")
+    print(f"Average attempts per valid config: {1. / (1. - add_fuzzer.rejection_rate):.1f}")
 
     def time_fn(m):
         return m.mean / m.metadata["nnz"]
diff --git a/torch/utils/benchmark/utils/cpp_jit.py b/torch/utils/benchmark/utils/cpp_jit.py
index c5a030e67084..65b8c70ee43e 100644
--- a/torch/utils/benchmark/utils/cpp_jit.py
+++ b/torch/utils/benchmark/utils/cpp_jit.py
@@ -69,6 +69,10 @@ def _get_build_root() -> str:
         CXX_FLAGS = torch.__config__._cxx_flags().strip().split()
         if CXX_FLAGS is not None and "-g" not in CXX_FLAGS:
             CXX_FLAGS.append("-g")
+        # remove "-W" flags to allow build benchmarks
+        # with a relaxed constraint of compiler versions
+        if CXX_FLAGS is not None:
+            CXX_FLAGS = list(filter(lambda x: not x.startswith("-W"), CXX_FLAGS))
 
     except RuntimeError:
         # We are in FBCode.
diff --git a/torch/utils/benchmark/utils/timer.py b/torch/utils/benchmark/utils/timer.py
index aec53d3f1281..61b05e144924 100644
--- a/torch/utils/benchmark/utils/timer.py
+++ b/torch/utils/benchmark/utils/timer.py
@@ -159,14 +159,14 @@ class Timer(object):
 
         env:
             This tag indicates that otherwise identical tasks were run in
-            different environments, and are therefore not equivilent, for
+            different environments, and are therefore not equivalent, for
             instance when A/B testing a change to a kernel. `Compare` will
             treat Measurements with different `env` specification as distinct
             when merging replicate runs.
 
         num_threads:
             The size of the PyTorch threadpool when executing `stmt`. Single
-            threaded performace is important as both a key inference workload
+            threaded performance is important as both a key inference workload
             and a good indicator of intrinsic algorithmic efficiency, so the
             default is set to one. This is in contrast to the default PyTorch
             threadpool size which tries to utilize all cores.
@@ -377,7 +377,7 @@ def blocked_autorange(
 
             2) A large block size better amortizes the cost of `timer`
                invocation, and results in a less biased measurement. This is
-               important because CUDA syncronization time is non-trivial
+               important because CUDA synchronization time is non-trivial
                (order single to low double digit microseconds) and would
                otherwise bias the measurement.
 
diff --git a/torch/utils/benchmark/utils/valgrind_wrapper/timer_interface.py b/torch/utils/benchmark/utils/valgrind_wrapper/timer_interface.py
index 7260c24cf282..378dd27c65ba 100644
--- a/torch/utils/benchmark/utils/valgrind_wrapper/timer_interface.py
+++ b/torch/utils/benchmark/utils/valgrind_wrapper/timer_interface.py
@@ -209,7 +209,7 @@ def stats(self, inclusive: bool = False) -> FunctionCounts:
     def counts(self, *, denoise: bool = False) -> int:
         """Returns the total number of instructions executed.
 
-        See `FunctionCounts.denoise()` for an explation of the `denoise` arg.
+        See `FunctionCounts.denoise()` for an explanation of the `denoise` arg.
         """
         stats = self.stmt_exclusive_stats
         return (stats.denoise() if denoise else stats).sum()
@@ -251,7 +251,7 @@ def as_standardized(self) -> "CallgrindStats":
             -23234231 /tmp/second_build_dir/thing.c:foo(...)
 
         Stripping prefixes can ameliorate this issue by regularizing the
-        strings and causing better cancellation of equivilent call sites
+        strings and causing better cancellation of equivalent call sites
         when diffing.
         """
         def strip(stats: FunctionCounts) -> FunctionCounts:
@@ -820,10 +820,10 @@ def check_result(completed_process):
             # =============================================================================
             # == Check that subprocess matches parent =====================================
             # =============================================================================
-            if sys.executable != "{parent_interpreter}":
+            if os.path.realpath(sys.executable) != "{parent_interpreter}":
                 log_failure(
                     "Interpreter mismatch:\n"
-                    f"  {{sys.executable}}\n    vs.\n  {parent_interpreter}"
+                    f"  {{os.path.realpath(sys.executable)}}\n    vs.\n  {parent_interpreter}"
                 )
 
             if torch.__file__ != "{torch_file}":
@@ -888,7 +888,7 @@ def check_result(completed_process):
             num_threads=task_spec.num_threads,
             error_log_repr=repr(error_log),
             stat_log=stat_log,
-            parent_interpreter=sys.executable,
+            parent_interpreter=os.path.realpath(sys.executable),
             torch_file=torch.__file__,
             bindings_import=(
                 "import torch._C as callgrind_bindings" if bindings is None
diff --git a/torch/utils/bottleneck/__main__.py b/torch/utils/bottleneck/__main__.py
index 9a98a85f2e91..86c1af04baa0 100644
--- a/torch/utils/bottleneck/__main__.py
+++ b/torch/utils/bottleneck/__main__.py
@@ -36,7 +36,7 @@ def run_env_analysis():
     print('Running environment analysis...')
     info = get_env_info()
 
-    result: Dict[str, str] = dict()
+    result: Dict[str, str] = {}
 
     debug_str = ''
     if info.is_debug_build:
diff --git a/torch/utils/bundled_inputs.py b/torch/utils/bundled_inputs.py
index 1ca2d56616bc..4ae39733ff2e 100644
--- a/torch/utils/bundled_inputs.py
+++ b/torch/utils/bundled_inputs.py
@@ -391,7 +391,7 @@ def _inflate_expr(
 
     if isinstance(arg, torch.Tensor):
         # Small-storage tensors can just be saved directly.
-        if arg.storage().size() <= MAX_RAW_TENSOR_SIZE or skip_size_check:
+        if arg._typed_storage().size() <= MAX_RAW_TENSOR_SIZE or skip_size_check:
             return arg, ref, None
         # Small contiguous tensors can be cloned to have small storage.
         # TODO: Should we do this even for non-contiguous tensors?
@@ -407,7 +407,7 @@ def _inflate_expr(
         # TODO: Provide more useful diagnostics.
         raise Exception(
             f"Bundled input argument at position '{ref}' is "
-            f"a tensor with storage size {arg.storage().size()}. "
+            f"a tensor with storage size {arg._typed_storage().size()}. "
             f"You probably don't want to bundle this as an input. "
         )
     else:
diff --git a/torch/utils/checkpoint.py b/torch/utils/checkpoint.py
index d28cf4a1c3ac..9483a742eddd 100644
--- a/torch/utils/checkpoint.py
+++ b/torch/utils/checkpoint.py
@@ -256,7 +256,7 @@ def checkpoint(function, *args, use_reentrant: bool = True, **kwargs):
         )
 
 
-def checkpoint_sequential(functions, segments, input, **kwargs):
+def checkpoint_sequential(functions, segments, input, use_reentrant=True, **kwargs):
     r"""A helper function for checkpointing sequential models.
 
     Sequential models execute a list of modules/functions in order
@@ -290,6 +290,14 @@ def checkpoint_sequential(functions, segments, input, **kwargs):
         preserve_rng_state(bool, optional):  Omit stashing and restoring
             the RNG state during each checkpoint.
             Default: ``True``
+        use_reentrant(bool, optional): Use checkpointing
+            implementation that requires re-entrant autograd.
+            If ``use_reentrant=False`` is specified, ``checkpoint`` will use an
+            implementation that does not require re-entrant autograd. This
+            allows ``checkpoint`` to support additional functionality, such as
+            working as expected with ``torch.autograd.grad`` and support for
+            keyword arguments input into the checkpointed function.
+            Default: ``True``
 
     Returns:
         Output of running :attr:`functions` sequentially on :attr:`*inputs`
@@ -319,8 +327,12 @@ def forward(input):
     end = -1
     for start in range(0, segment_size * (segments - 1), segment_size):
         end = start + segment_size - 1
-        input = checkpoint(run_function(start, end, functions), input,
-                           preserve_rng_state=preserve)
+        input = checkpoint(
+            run_function(start, end, functions),
+            input,
+            use_reentrant=use_reentrant,
+            preserve_rng_state=preserve
+        )
     return run_function(end + 1, len(functions) - 1, functions)(input)
 
 def _checkpoint_without_reentrant(function, preserve_rng_state=True, *args, **kwargs):
diff --git a/torch/utils/collect_env.py b/torch/utils/collect_env.py
index 97bbc7bada6c..b658666dd2af 100644
--- a/torch/utils/collect_env.py
+++ b/torch/utils/collect_env.py
@@ -32,6 +32,7 @@
     'python_platform',
     'is_cuda_available',
     'cuda_runtime_version',
+    'cuda_module_loading',
     'nvidia_driver_version',
     'nvidia_gpu_models',
     'cudnn_version',
@@ -307,6 +308,16 @@ def get_cachingallocator_config():
     ca_config = os.environ.get('PYTORCH_CUDA_ALLOC_CONF', '')
     return ca_config
 
+
+def get_cuda_module_loading_config():
+    if TORCH_AVAILABLE and torch.cuda.is_available():
+        torch.cuda.init()
+        config = os.environ.get('CUDA_MODULE_LOADING', '')
+        return config
+    else:
+        return "N/A"
+
+
 def is_xnnpack_available():
     if TORCH_AVAILABLE:
         import torch.backends.xnnpack
@@ -345,6 +356,7 @@ def get_env_info():
         is_cuda_available=cuda_available_str,
         cuda_compiled_version=cuda_version_str,
         cuda_runtime_version=get_running_cuda_version(run_lambda),
+        cuda_module_loading=get_cuda_module_loading_config(),
         nvidia_gpu_models=get_gpu_info(run_lambda),
         nvidia_driver_version=get_nvidia_driver_version(run_lambda),
         cudnn_version=get_cudnn_version(run_lambda),
@@ -379,6 +391,7 @@ def get_env_info():
 Python platform: {python_platform}
 Is CUDA available: {is_cuda_available}
 CUDA runtime version: {cuda_runtime_version}
+CUDA_MODULE_LOADING set to: {cuda_module_loading}
 GPU models and configuration: {nvidia_gpu_models}
 Nvidia driver version: {nvidia_driver_version}
 cuDNN version: {cudnn_version}
diff --git a/torch/utils/cpp_backtrace.py b/torch/utils/cpp_backtrace.py
new file mode 100644
index 000000000000..d45c216adcaf
--- /dev/null
+++ b/torch/utils/cpp_backtrace.py
@@ -0,0 +1,11 @@
+from torch._C import _get_cpp_backtrace
+
+def get_cpp_backtrace(frames_to_skip=0, maximum_number_of_frames=64) -> str:
+    r"""
+    Returns a string containing the C++ stack trace of the current thread.
+    Args:
+        frames_to_skip (int): the number of frames to skip from the top of the stack
+        maximum_number_of_frames (int): the maximum number of frames to return
+    """
+
+    return _get_cpp_backtrace(frames_to_skip, maximum_number_of_frames)
diff --git a/torch/utils/cpp_extension.py b/torch/utils/cpp_extension.py
index 364e18e701ce..d74ef9a38372 100644
--- a/torch/utils/cpp_extension.py
+++ b/torch/utils/cpp_extension.py
@@ -18,7 +18,7 @@
 from ._cpp_extension_versioner import ExtensionVersioner
 from .hipify import hipify_python
 from .hipify.hipify_python import GeneratedFileCleaner
-from typing import List, Optional, Union, Tuple
+from typing import Dict, List, Optional, Union, Tuple
 from torch.torch_version import TorchVersion
 
 from setuptools.command.build_ext import build_ext
@@ -38,39 +38,44 @@
 TORCH_LIB_PATH = os.path.join(_TORCH_PATH, 'lib')
 
 
-BUILD_SPLIT_CUDA = os.getenv('BUILD_SPLIT_CUDA') or (os.path.exists(os.path.join(
-    TORCH_LIB_PATH, f'{CLIB_PREFIX}torch_cuda_cu{CLIB_EXT}')) and os.path.exists(os.path.join(TORCH_LIB_PATH, f'{CLIB_PREFIX}torch_cuda_cpp{CLIB_EXT}')))
-
 SUBPROCESS_DECODE_ARGS = ('oem',) if IS_WINDOWS else ()
 MINIMUM_GCC_VERSION = (5, 0, 0)
 MINIMUM_MSVC_VERSION = (19, 0, 24215)
 
+VersionRange = Tuple[Tuple[int, ...], Tuple[int, ...]]
+VersionMap = Dict[str, VersionRange]
 # The following values were taken from the following GitHub gist that
 # summarizes the minimum valid major versions of g++/clang++ for each supported
 # CUDA version: https://gist.github.com/ax3l/9489132
-CUDA_GCC_VERSIONS = {
-    '10.2': (MINIMUM_GCC_VERSION, (8, 0, 0)),
-    '11.1': (MINIMUM_GCC_VERSION, (10, 0, 0)),
-    '11.2': (MINIMUM_GCC_VERSION, (10, 2, 1)),
-    '11.3': (MINIMUM_GCC_VERSION, (10, 2, 1)),
-    '11.4': ((6, 0, 0), (11, 5, 0)),
-    '11.5': ((6, 0, 0), (11, 5, 0)),
-    '11.6': ((6, 0, 0), (11, 5, 0)),
-    '11.7': ((6, 0, 0), (11, 5, 0)),
+# Or from include/crt/host_config.h in the CUDA SDK
+# The second value is the exclusive(!) upper bound, i.e. min <= version < max
+CUDA_GCC_VERSIONS: VersionMap = {
+    '10.2': (MINIMUM_GCC_VERSION, (9, 0)),
+    '11.0': (MINIMUM_GCC_VERSION, (10, 0)),
+    '11.1': (MINIMUM_GCC_VERSION, (11, 0)),
+    '11.2': (MINIMUM_GCC_VERSION, (11, 0)),
+    '11.3': (MINIMUM_GCC_VERSION, (11, 0)),
+    '11.4': ((6, 0, 0), (12, 0)),
+    '11.5': ((6, 0, 0), (12, 0)),
+    '11.6': ((6, 0, 0), (12, 0)),
+    '11.7': ((6, 0, 0), (12, 0)),
 }
 
-CUDA_CLANG_VERSIONS = {
-    '10.2': ((3, 3, 0), (8, 0, 0)),
-    '11.1': ((6, 0, 0), (9, 0, 0)),
-    '11.2': ((6, 0, 0), (9, 0, 0)),
-    '11.3': ((6, 0, 0), (11, 0, 0)),
-    '11.4': ((6, 0, 0), (11, 0, 0)),
-    '11.5': ((6, 0, 0), (12, 0, 0)),
-    '11.6': ((6, 0, 0), (12, 0, 0)),
-    '11.7': ((6, 0, 0), (13, 0, 0)),
+MINIMUM_CLANG_VERSION = (3, 3, 0)
+CUDA_CLANG_VERSIONS: VersionMap = {
+    '10.2': (MINIMUM_CLANG_VERSION, (9, 0)),
+    '11.1': (MINIMUM_CLANG_VERSION, (11, 0)),
+    '11.2': (MINIMUM_CLANG_VERSION, (12, 0)),
+    '11.3': (MINIMUM_CLANG_VERSION, (12, 0)),
+    '11.4': (MINIMUM_CLANG_VERSION, (13, 0)),
+    '11.5': (MINIMUM_CLANG_VERSION, (13, 0)),
+    '11.6': (MINIMUM_CLANG_VERSION, (14, 0)),
+    '11.7': (MINIMUM_CLANG_VERSION, (14, 0)),
 }
 
-
+__all__ = ["get_default_build_root", "check_compiler_ok_for_platform", "get_compiler_abi_compatibility_and_version", "BuildExtension",
+           "CppExtension", "CUDAExtension", "include_paths", "library_paths", "load", "load_inline", "is_ninja_available",
+           "verify_ninja_availability"]
 # Taken directly from python stdlib < 3.9
 # See https://github.com/pytorch/pytorch/issues/48617
 def _nt_quote_args(args: Optional[List[str]]) -> List[str]:
@@ -292,7 +297,9 @@ def check_compiler_ok_for_platform(compiler: str) -> bool:
     if any(name in compiler_path for name in _accepted_compilers_for_platform()):
         return True
     # If compiler wrapper is used try to infer the actual compiler by invoking it with -v flag
-    version_string = subprocess.check_output([compiler, '-v'], stderr=subprocess.STDOUT).decode(*SUBPROCESS_DECODE_ARGS)
+    env = os.environ.copy()
+    env['LC_ALL'] = 'C'  # Don't localize output
+    version_string = subprocess.check_output([compiler, '-v'], stderr=subprocess.STDOUT, env=env).decode(*SUBPROCESS_DECODE_ARGS)
     if IS_LINUX:
         # Check for 'gcc' or 'g++' for sccache warpper
         pattern = re.compile("^COLLECT_GCC=(.*)$", re.MULTILINE)
@@ -387,20 +394,19 @@ def _check_cuda_version(compiler_name: str, compiler_version: TorchVersion) -> N
             _is_binary_build()):
         return
 
-    cuda_compiler_bounds = CUDA_CLANG_VERSIONS if compiler_name.startswith('clang') else CUDA_GCC_VERSIONS
+    cuda_compiler_bounds: VersionMap = CUDA_CLANG_VERSIONS if compiler_name.startswith('clang') else CUDA_GCC_VERSIONS
 
     if cuda_str_version not in cuda_compiler_bounds:
         warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
     else:
-        min_compiler_version, max_compiler_version = cuda_compiler_bounds[cuda_str_version]
-        # Special case for 11.4.0, which has lower compiler bounds that 11.4.1
+        min_compiler_version, max_excl_compiler_version = cuda_compiler_bounds[cuda_str_version]
+        # Special case for 11.4.0, which has lower compiler bounds than 11.4.1
         if "V11.4.48" in cuda_version_str and cuda_compiler_bounds == CUDA_GCC_VERSIONS:
-            max_compiler_version = (10, 0, 0)
+            max_excl_compiler_version = (11, 0)
         min_compiler_version_str = '.'.join(map(str, min_compiler_version))
-        max_compiler_version_str = '.'.join(map(str, max_compiler_version))
+        max_excl_compiler_version_str = '.'.join(map(str, max_excl_compiler_version))
 
-        version_bound_str = f'>={min_compiler_version_str}'
-        version_bound_str = f'{version_bound_str}, <={max_compiler_version_str}'
+        version_bound_str = f'>={min_compiler_version_str}, <{max_excl_compiler_version_str}'
 
         if compiler_version < TorchVersion(min_compiler_version_str):
             raise RuntimeError(
@@ -408,10 +414,10 @@ def _check_cuda_version(compiler_name: str, compiler_version: TorchVersion) -> N
                 f'than the minimum required version by CUDA {cuda_str_version} ({min_compiler_version_str}). '
                 f'Please make sure to use an adequate version of {compiler_name} ({version_bound_str}).'
             )
-        if compiler_version > TorchVersion(max_compiler_version_str):
+        if compiler_version >= TorchVersion(max_excl_compiler_version_str):
             raise RuntimeError(
                 f'The current installed version of {compiler_name} ({compiler_version}) is greater '
-                f'than the maximum required version by CUDA {cuda_str_version} ({max_compiler_version_str}). '
+                f'than the maximum required version by CUDA {cuda_str_version}. '
                 f'Please make sure to use an adequate version of {compiler_name} ({version_bound_str}).'
             )
 
@@ -1056,11 +1062,7 @@ def CUDAExtension(name, sources, *args, **kwargs):
     else:
         libraries.append('cudart')
         libraries.append('c10_cuda')
-        if BUILD_SPLIT_CUDA:
-            libraries.append('torch_cuda_cu')
-            libraries.append('torch_cuda_cpp')
-        else:
-            libraries.append('torch_cuda')
+        libraries.append('torch_cuda')
     kwargs['libraries'] = libraries
 
     include_dirs = kwargs.get('include_dirs', [])
@@ -1169,7 +1171,7 @@ def library_paths(cuda: bool = False) -> List[str]:
             paths.append(os.path.join(HIP_HOME, 'lib'))
     elif cuda:
         if IS_WINDOWS:
-            lib_dir = 'lib/x64'
+            lib_dir = os.path.join('lib', 'x64')
         else:
             lib_dir = 'lib64'
             if (not os.path.exists(_join_cuda_home(lib_dir)) and
@@ -1653,15 +1655,7 @@ def _prepare_ldflags(extra_ldflags, with_cuda, verbose, is_standalone):
         if with_cuda:
             extra_ldflags.append('c10_cuda.lib')
         extra_ldflags.append('torch_cpu.lib')
-        if BUILD_SPLIT_CUDA and with_cuda:
-            extra_ldflags.append('torch_cuda_cu.lib')
-            # See [Note about _torch_cuda_cu_linker_symbol_op and torch_cuda_cu] in native_functions.yaml
-            extra_ldflags.append('-INCLUDE:?_torch_cuda_cu_linker_symbol_op_cuda@native@at@@YA?AVTensor@2@AEBV32@@Z')
-            extra_ldflags.append('torch_cuda_cpp.lib')
-            # /INCLUDE is used to ensure torch_cuda_cpp is linked against in a project that relies on it.
-            # Related issue: https://github.com/pytorch/pytorch/issues/31611
-            extra_ldflags.append('-INCLUDE:?warp_size@cuda@at@@YAHXZ')
-        elif with_cuda:
+        if with_cuda:
             extra_ldflags.append('torch_cuda.lib')
             # /INCLUDE is used to ensure torch_cuda is linked against in a project that relies on it.
             # Related issue: https://github.com/pytorch/pytorch/issues/31611
@@ -1678,9 +1672,7 @@ def _prepare_ldflags(extra_ldflags, with_cuda, verbose, is_standalone):
         if with_cuda:
             extra_ldflags.append('-lc10_hip' if IS_HIP_EXTENSION else '-lc10_cuda')
         extra_ldflags.append('-ltorch_cpu')
-        if BUILD_SPLIT_CUDA and with_cuda:
-            extra_ldflags.append('-ltorch_hip' if IS_HIP_EXTENSION else '-ltorch_cuda_cu -ltorch_cuda_cpp')
-        elif with_cuda:
+        if with_cuda:
             extra_ldflags.append('-ltorch_hip' if IS_HIP_EXTENSION else '-ltorch_cuda')
         extra_ldflags.append('-ltorch')
         if not is_standalone:
@@ -1696,10 +1688,10 @@ def _prepare_ldflags(extra_ldflags, with_cuda, verbose, is_standalone):
         if verbose:
             print('Detected CUDA files, patching ldflags', file=sys.stderr)
         if IS_WINDOWS:
-            extra_ldflags.append(f'/LIBPATH:{_join_cuda_home("lib/x64")}')
+            extra_ldflags.append(f'/LIBPATH:{_join_cuda_home("lib", "x64")}')
             extra_ldflags.append('cudart.lib')
             if CUDNN_HOME is not None:
-                extra_ldflags.append(os.path.join(CUDNN_HOME, 'lib/x64'))
+                extra_ldflags.append(f'/LIBPATH:{os.path.join(CUDNN_HOME, "lib", "x64")}')
         elif not IS_HIP_EXTENSION:
             extra_ldflags.append(f'-L{_join_cuda_home("lib64")}')
             extra_ldflags.append('-lcudart')
@@ -1742,10 +1734,12 @@ def _get_cuda_arch_flags(cflags: Optional[List[str]] = None) -> List[str]:
         ('Volta', '7.0+PTX'),
         ('Turing', '7.5+PTX'),
         ('Ampere', '8.0;8.6+PTX'),
+        ('Ada', '8.9+PTX'),
+        ('Hopper', '9.0+PTX'),
     ])
 
     supported_arches = ['3.5', '3.7', '5.0', '5.2', '5.3', '6.0', '6.1', '6.2',
-                        '7.0', '7.2', '7.5', '8.0', '8.6']
+                        '7.0', '7.2', '7.5', '8.0', '8.6', '8.9', '9.0']
     valid_arch_strings = supported_arches + [s + "+PTX" for s in supported_arches]
 
     # The default is sm_30 for CUDA 9.x and 10.x
diff --git a/torch/utils/data/__init__.py b/torch/utils/data/__init__.py
index 2fef8254dbda..bc054a947069 100644
--- a/torch/utils/data/__init__.py
+++ b/torch/utils/data/__init__.py
@@ -39,8 +39,6 @@
     runtime_validation,
     runtime_validation_disabled,
 )
-from torch.utils.data.dataloader_experimental import DataLoader2
-from torch.utils.data import communication
 
 __all__ = ['BatchSampler',
            'ChainDataset',
@@ -48,7 +46,6 @@
            'DFIterDataPipe',
            'DataChunk',
            'DataLoader',
-           'DataLoader2',
            'Dataset',
            'DistributedSampler',
            'IterDataPipe',
@@ -63,7 +60,6 @@
            'WeightedRandomSampler',
            '_DatasetKind',
            'argument_validation',
-           'communication',
            'default_collate',
            'default_convert',
            'functional_datapipe',
diff --git a/torch/utils/data/_utils/__init__.py b/torch/utils/data/_utils/__init__.py
index 1085e96a122a..96e4629cba00 100644
--- a/torch/utils/data/_utils/__init__.py
+++ b/torch/utils/data/_utils/__init__.py
@@ -34,20 +34,6 @@
 https://github.com/python/cpython/blob/d4d60134b29290049e28df54f23493de4f1824b6/Lib/multiprocessing/util.py#L277-L327
 """
 
-DATAPIPE_SHARED_SEED = "_dl_shared_seed"
-r"""The key to share the same seed for shuffle DataPipe across distributed processes"""
-
-DATAPIPE_SHARED_SEED_COUNTER = "_dl_shared_seed_recv_cnt"
-r"""The key to count the number of distributed processes that have received the shared seed"""
-
-DATAPIPE_SHARED_SEED_DEFAULT_TIMEOUT = 30 * 60
-r"""Timeout (in seconds) sending the shared seed from Rank 0 and sending
-    the signal of the shared seed received from other Ranks.
-    It uses the same default timeout for the distributed process group"""
-
-DATAPIPE_SHARED_SEED_CHECK_INTERVAL = 0.01
-r"""Interval to check if each rank has received the shared seed"""
-
 
 try:
     import numpy
diff --git a/torch/utils/data/_utils/collate.py b/torch/utils/data/_utils/collate.py
index 2225dc084969..1a00cd4514f5 100644
--- a/torch/utils/data/_utils/collate.py
+++ b/torch/utils/data/_utils/collate.py
@@ -7,9 +7,12 @@
 `default_collate` and `default_convert` are exposed to users via 'dataloader.py'.
 """
 
-import torch
-import re
 import collections
+import contextlib
+import re
+import torch
+
+from typing import Callable, Dict, Optional, Tuple, Type, Union
 from torch._six import string_classes
 
 np_str_obj_array_pattern = re.compile(r'[SaUO]')
@@ -82,6 +85,122 @@ def default_convert(data):
     "dicts or lists; found {}")
 
 
+def collate(batch, *, collate_fn_map: Optional[Dict[Union[Type, Tuple[Type, ...]], Callable]] = None):
+    r"""
+        General collate function that handles collection type of element within each batch
+        and opens function registry to deal with specific element types. `default_collate_fn_map`
+        provides default collate functions for tensors, numpy arrays, numbers and strings.
+
+        Args:
+            batch: a single batch to be collated
+            collate_fn_map: Optional dictionary mapping from element type to the corresponding collate function.
+              If the element type isn't present in this dictionary,
+              this function will go through each key of the dictionary in the insertion order to
+              invoke the corresponding collate function if the element type is a subclass of the key.
+
+        Examples:
+            >>> # Extend this function to handle batch of tensors
+            >>> def collate_tensor_fn(batch, *, collate_fn_map):
+            ...     return torch.stack(batch, 0)
+            >>> def custom_collate(batch):
+            ...     collate_map = {torch.Tensor: collate_tensor_fn}
+            ...     return collate(batch, collate_fn_map=collate_map)
+            >>> # Extend `default_collate` by in-place modifying `default_collate_fn_map`
+            >>> default_collate_fn_map.update({torch.Tensor: collate_tensor_fn})
+
+        Note:
+            Each collate function requires a positional argument for batch and a keyword argument
+            for the dictionary of collate functions as `collate_fn_map`.
+    """
+    elem = batch[0]
+    elem_type = type(elem)
+
+    if collate_fn_map is not None:
+        if elem_type in collate_fn_map:
+            return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
+
+        for collate_type in collate_fn_map:
+            if isinstance(elem, collate_type):
+                return collate_fn_map[collate_type](batch, collate_fn_map=collate_fn_map)
+
+    if isinstance(elem, collections.abc.Mapping):
+        try:
+            return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
+        except TypeError:
+            # The mapping type may not support `__init__(iterable)`.
+            return {key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem}
+    elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
+        return elem_type(*(collate(samples, collate_fn_map=collate_fn_map) for samples in zip(*batch)))
+    elif isinstance(elem, collections.abc.Sequence):
+        # check to make sure that the elements in batch have consistent size
+        it = iter(batch)
+        elem_size = len(next(it))
+        if not all(len(elem) == elem_size for elem in it):
+            raise RuntimeError('each element in list of batch should be of equal size')
+        transposed = list(zip(*batch))  # It may be accessed twice, so we use a list.
+
+        if isinstance(elem, tuple):
+            return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
+        else:
+            try:
+                return elem_type([collate(samples, collate_fn_map=collate_fn_map) for samples in transposed])
+            except TypeError:
+                # The sequence type may not support `__init__(iterable)` (e.g., `range`).
+                return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]
+
+    raise TypeError(default_collate_err_msg_format.format(elem_type))
+
+
+def collate_tensor_fn(batch, *, collate_fn_map: Optional[Dict[Union[Type, Tuple[Type, ...]], Callable]] = None):
+    elem = batch[0]
+    out = None
+    if torch.utils.data.get_worker_info() is not None:
+        # If we're in a background process, concatenate directly into a
+        # shared memory tensor to avoid an extra copy
+        numel = sum(x.numel() for x in batch)
+        storage = elem._typed_storage()._new_shared(numel, device=elem.device)
+        out = elem.new(storage).resize_(len(batch), *list(elem.size()))
+    return torch.stack(batch, 0, out=out)
+
+
+def collate_numpy_array_fn(batch, *, collate_fn_map: Optional[Dict[Union[Type, Tuple[Type, ...]], Callable]] = None):
+    elem = batch[0]
+    # array of string classes and object
+    if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
+        raise TypeError(default_collate_err_msg_format.format(elem.dtype))
+
+    return collate([torch.as_tensor(b) for b in batch], collate_fn_map=collate_fn_map)
+
+
+def collate_numpy_scalar_fn(batch, *, collate_fn_map: Optional[Dict[Union[Type, Tuple[Type, ...]], Callable]] = None):
+    return torch.as_tensor(batch)
+
+
+def collate_float_fn(batch, *, collate_fn_map: Optional[Dict[Union[Type, Tuple[Type, ...]], Callable]] = None):
+    return torch.tensor(batch, dtype=torch.float64)
+
+
+def collate_int_fn(batch, *, collate_fn_map: Optional[Dict[Union[Type, Tuple[Type, ...]], Callable]] = None):
+    return torch.tensor(batch)
+
+
+def collate_str_fn(batch, *, collate_fn_map: Optional[Dict[Union[Type, Tuple[Type, ...]], Callable]] = None):
+    return batch
+
+
+default_collate_fn_map: Dict[Union[Type, Tuple[Type, ...]], Callable] = {torch.Tensor: collate_tensor_fn}
+with contextlib.suppress(ImportError):
+    import numpy as np
+    # For both ndarray and memmap (subclass of ndarray)
+    default_collate_fn_map[np.ndarray] = collate_numpy_array_fn
+    # See scalars hierarchy: https://numpy.org/doc/stable/reference/arrays.scalars.html
+    # Skip string scalars
+    default_collate_fn_map[(np.bool_, np.number, np.object_)] = collate_numpy_scalar_fn
+default_collate_fn_map[float] = collate_float_fn
+default_collate_fn_map[int] = collate_int_fn
+default_collate_fn_map[string_classes] = collate_str_fn
+
+
 def default_collate(batch):
     r"""
         Function that takes in a batch of data and puts the elements within the batch
@@ -129,57 +248,18 @@ def default_collate(batch):
             >>> # Example with `List` inside the batch:
             >>> default_collate([[0, 1], [2, 3]])
             [tensor([0, 2]), tensor([1, 3])]
+            >>> # Two options to extend `default_collate` to handle specific type
+            >>> # Option 1: Write custom collate function and invoke `default_collate`
+            >>> def custom_collate(batch):
+            ...     elem = batch[0]
+            ...     if isinstance(elem, CustomType):  # Some custom condition
+            ...         return ...
+            ...     else:  # Fall back to `default_collate`
+            ...         return default_collate(batch)
+            >>> # Option 2: In-place modify `default_collate_fn_map`
+            >>> def collate_customtype_fn(batch, *, collate_fn_map=None):
+            ...     return ...
+            >>> default_collate_fn_map.update(CustoType, collate_customtype_fn)
+            >>> default_collate(batch)  # Handle `CustomType` automatically
     """
-    elem = batch[0]
-    elem_type = type(elem)
-    if isinstance(elem, torch.Tensor):
-        out = None
-        if torch.utils.data.get_worker_info() is not None:
-            # If we're in a background process, concatenate directly into a
-            # shared memory tensor to avoid an extra copy
-            numel = sum(x.numel() for x in batch)
-            storage = elem.storage()._new_shared(numel, device=elem.device)
-            out = elem.new(storage).resize_(len(batch), *list(elem.size()))
-        return torch.stack(batch, 0, out=out)
-    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
-            and elem_type.__name__ != 'string_':
-        if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
-            # array of string classes and object
-            if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
-                raise TypeError(default_collate_err_msg_format.format(elem.dtype))
-
-            return default_collate([torch.as_tensor(b) for b in batch])
-        elif elem.shape == ():  # scalars
-            return torch.as_tensor(batch)
-    elif isinstance(elem, float):
-        return torch.tensor(batch, dtype=torch.float64)
-    elif isinstance(elem, int):
-        return torch.tensor(batch)
-    elif isinstance(elem, string_classes):
-        return batch
-    elif isinstance(elem, collections.abc.Mapping):
-        try:
-            return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
-        except TypeError:
-            # The mapping type may not support `__init__(iterable)`.
-            return {key: default_collate([d[key] for d in batch]) for key in elem}
-    elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
-        return elem_type(*(default_collate(samples) for samples in zip(*batch)))
-    elif isinstance(elem, collections.abc.Sequence):
-        # check to make sure that the elements in batch have consistent size
-        it = iter(batch)
-        elem_size = len(next(it))
-        if not all(len(elem) == elem_size for elem in it):
-            raise RuntimeError('each element in list of batch should be of equal size')
-        transposed = list(zip(*batch))  # It may be accessed twice, so we use a list.
-
-        if isinstance(elem, tuple):
-            return [default_collate(samples) for samples in transposed]  # Backwards compatibility.
-        else:
-            try:
-                return elem_type([default_collate(samples) for samples in transposed])
-            except TypeError:
-                # The sequence type may not support `__init__(iterable)` (e.g., `range`).
-                return [default_collate(samples) for samples in transposed]
-
-    raise TypeError(default_collate_err_msg_format.format(elem_type))
+    return collate(batch, collate_fn_map=default_collate_fn_map)
diff --git a/torch/utils/data/_utils/fetch.py b/torch/utils/data/_utils/fetch.py
index 10109bfc7af2..0262c078ca98 100644
--- a/torch/utils/data/_utils/fetch.py
+++ b/torch/utils/data/_utils/fetch.py
@@ -17,7 +17,9 @@ def fetch(self, possibly_batched_index):
 
 class _IterableDatasetFetcher(_BaseDatasetFetcher):
     def __init__(self, dataset, auto_collation, collate_fn, drop_last):
-        super(_IterableDatasetFetcher, self).__init__(dataset, auto_collation, collate_fn, drop_last)
+        super(_IterableDatasetFetcher, self).__init__(
+            dataset, auto_collation, collate_fn, drop_last
+        )
         self.dataset_iter = iter(dataset)
         self.ended = False
 
@@ -33,7 +35,9 @@ def fetch(self, possibly_batched_index):
                 except StopIteration:
                     self.ended = True
                     break
-            if len(data) == 0 or (self.drop_last and len(data) < len(possibly_batched_index)):
+            if len(data) == 0 or (
+                self.drop_last and len(data) < len(possibly_batched_index)
+            ):
                 raise StopIteration
         else:
             data = next(self.dataset_iter)
@@ -42,11 +46,16 @@ def fetch(self, possibly_batched_index):
 
 class _MapDatasetFetcher(_BaseDatasetFetcher):
     def __init__(self, dataset, auto_collation, collate_fn, drop_last):
-        super(_MapDatasetFetcher, self).__init__(dataset, auto_collation, collate_fn, drop_last)
+        super(_MapDatasetFetcher, self).__init__(
+            dataset, auto_collation, collate_fn, drop_last
+        )
 
     def fetch(self, possibly_batched_index):
         if self.auto_collation:
-            data = [self.dataset[idx] for idx in possibly_batched_index]
+            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
+                data = self.dataset.__getitems__(possibly_batched_index)
+            else:
+                data = [self.dataset[idx] for idx in possibly_batched_index]
         else:
             data = self.dataset[possibly_batched_index]
         return self.collate_fn(data)
diff --git a/torch/utils/data/_utils/pin_memory.py b/torch/utils/data/_utils/pin_memory.py
index fd2879228d76..466cf0c70e2a 100644
--- a/torch/utils/data/_utils/pin_memory.py
+++ b/torch/utils/data/_utils/pin_memory.py
@@ -19,15 +19,16 @@ def _pin_memory_loop(in_queue, out_queue, device_id, done_event, device):
     # consuming all CPU cores.
     torch.set_num_threads(1)
 
-    torch.cuda.set_device(device_id)
+    if device == "cuda":
+        torch.cuda.set_device(device_id)
+    elif device == "xpu":
+        torch.xpu.set_device(device_id)  # type: ignore[attr-defined]
 
-    # See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on the
-    # logic of this function.
-    while not done_event.is_set():
+    def do_one_step():
         try:
             r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
         except queue.Empty:
-            continue
+            return
         idx, data = r
         if not done_event.is_set() and not isinstance(data, ExceptionWrapper):
             try:
@@ -42,8 +43,13 @@ def _pin_memory_loop(in_queue, out_queue, device_id, done_event, device):
                 break
             except queue.Full:
                 continue
-        del r  # save memory
 
+    # See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on the
+    # logic of this function.
+    while not done_event.is_set():
+        # Make sure that we don't preserve any object from one iteration
+        # to the next
+        do_one_step()
 
 def pin_memory(data, device=None):
     if isinstance(data, torch.Tensor):
diff --git a/torch/utils/data/_utils/worker.py b/torch/utils/data/_utils/worker.py
index ee94bc08ac1c..a12e4ea127b7 100644
--- a/torch/utils/data/_utils/worker.py
+++ b/torch/utils/data/_utils/worker.py
@@ -10,8 +10,10 @@
 import queue
 from dataclasses import dataclass
 from torch._utils import ExceptionWrapper
-from typing import Optional, Union
+from typing import Optional, Union, TYPE_CHECKING
 from . import signal_handling, MP_STATUS_CHECK_INTERVAL, IS_WINDOWS, HAS_NUMPY
+if TYPE_CHECKING:
+    from torch.utils.data import Dataset
 
 if IS_WINDOWS:
     import ctypes
@@ -60,6 +62,10 @@ def is_alive(self):
 
 
 class WorkerInfo(object):
+    id: int
+    num_workers: int
+    seed: int
+    dataset: 'Dataset'
     __initialized = False
 
     def __init__(self, **kwargs):
@@ -80,7 +86,7 @@ def __repr__(self):
         return '{}({})'.format(self.__class__.__name__, ', '.join(items))
 
 
-def get_worker_info():
+def get_worker_info() -> Optional[WorkerInfo]:
     r"""Returns the information about the current
     :class:`~torch.utils.data.DataLoader` iterator worker process.
 
@@ -223,13 +229,13 @@ def _worker_loop(dataset_kind, dataset, index_queue, data_queue, done_event,
             np.random.seed(np_seed)
 
         from torch.utils.data import IterDataPipe
-        from torch.utils.data.graph_settings import apply_shuffle_seed
+        from torch.utils.data.graph_settings import apply_random_seed
 
         shared_rng = torch.Generator()
         if isinstance(dataset, IterDataPipe):
             assert shared_seed is not None
             shared_rng.manual_seed(shared_seed)
-            dataset = apply_shuffle_seed(dataset, shared_rng)
+            dataset = apply_random_seed(dataset, shared_rng)
 
         global _worker_info
         _worker_info = WorkerInfo(id=worker_id, num_workers=num_workers,
@@ -277,7 +283,7 @@ def _worker_loop(dataset_kind, dataset, index_queue, data_queue, done_event,
                 if isinstance(dataset, IterDataPipe):
                     assert r.seed is not None
                     shared_rng.manual_seed(r.seed)
-                    dataset = apply_shuffle_seed(dataset, shared_rng)
+                    dataset = apply_random_seed(dataset, shared_rng)
 
                 # Recreate the fetcher for worker-reuse policy
                 fetcher = _DatasetKind.create_fetcher(
diff --git a/torch/utils/data/communication/__init__.py b/torch/utils/data/communication/__init__.py
deleted file mode 100644
index 1b9cae401189..000000000000
--- a/torch/utils/data/communication/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-from . import eventloop
-from . import iter
-from . import map
-from . import messages
-from . import protocol
-from . import queue
diff --git a/torch/utils/data/communication/eventloop.py b/torch/utils/data/communication/eventloop.py
deleted file mode 100644
index 9bf241d334df..000000000000
--- a/torch/utils/data/communication/eventloop.py
+++ /dev/null
@@ -1,70 +0,0 @@
-import torch
-import threading
-import pickle
-
-from torch.utils.data import IterDataPipe, communication, MapDataPipe
-
-try:
-    import dill
-    # XXX: By default, dill writes the Pickler dispatch table to inject its
-    # own logic there. This globally affects the behavior of the standard library
-    # pickler for any user who transitively depends on this module!
-    # Undo this extension to avoid altering the behavior of the pickler globally.
-    dill.extend(use_dill=False)
-    HAS_DILL = True
-except ImportError:
-    HAS_DILL = False
-
-__all__ = [
-    "DataPipeToQueuesLoop",
-    "SpawnProcessForDataPipeline",
-    "SpawnThreadForDataPipeline",
-]
-
-def DataPipeToQueuesLoop(source_datapipe, req_queue, res_queue):
-    if isinstance(source_datapipe, IterDataPipe):
-        pipe_type = communication.iter
-        protocol_type = communication.protocol.IterDataPipeQueueProtocolServer
-    elif isinstance(source_datapipe, MapDataPipe):
-        pipe_type = communication.map  # type: ignore[misc]
-        protocol_type = communication.protocol.MapDataPipeQueueProtocolServer  # type: ignore[assignment]
-    else:
-        raise Exception('Only supports IterDataPipe or MapDataPipe, got', source_datapipe)
-
-    torch.set_num_threads(1)
-    for _ in pipe_type.DataPipeBehindQueues(source_datapipe, protocol_type(req_queue, res_queue),
-                                            blocking_request_get=True):
-        pass
-
-
-def SpawnProcessForDataPipeline(multiprocessing_ctx, datapipe):
-    req_queue = multiprocessing_ctx.Queue()
-    res_queue = multiprocessing_ctx.Queue()
-    process = multiprocessing_ctx.Process(
-        target=DataPipeToQueuesLoop, args=(datapipe, req_queue, res_queue))
-    return process, req_queue, res_queue
-
-
-def SpawnThreadForDataPipeline(datapipe):
-    r"""
-        Given a DataPipe, creates a copy of the DataPipe, starts a new Thread with DataPipeToQueuesLoop as target,
-        and return the process, req_queue, res_queue, thread_local_datapipe.
-    """
-    req_queue = communication.queue.ThreadingQueue()
-    res_queue = communication.queue.ThreadingQueue()
-
-    try:
-        new_datapipe = pickle.loads(pickle.dumps(datapipe))
-    except Exception as pe:
-        if HAS_DILL:
-            try:
-                new_datapipe = dill.loads(dill.dumps(datapipe))
-            except Exception as de:
-                raise Exception('Unable to dill DataPipe to make thread local copy', de)
-
-        else:
-            raise Exception('Unable to pickle DataPipe to make thread local copy (consider installing `dill`)', pe)
-
-    process = threading.Thread(target=DataPipeToQueuesLoop, args=(
-        new_datapipe, req_queue, res_queue), daemon=True)
-    return process, req_queue, res_queue, new_datapipe
diff --git a/torch/utils/data/communication/iter.py b/torch/utils/data/communication/iter.py
deleted file mode 100644
index 94f7cd2ec703..000000000000
--- a/torch/utils/data/communication/iter.py
+++ /dev/null
@@ -1,181 +0,0 @@
-import time
-import types
-
-from torch.utils.data import IterDataPipe, communication
-
-DEFAULT_NON_BLOCKING_SLEEP = 0.001
-
-__all__ = [
-    "DataPipeBehindQueues",
-    "EnsureNonBlockingDataPipe",
-    "InvalidStateResetRequired",
-    "NonBlocking",
-    "NotAvailable",
-    "QueueWrapper",
-    "default_not_available_hook",
-]
-
-
-def default_not_available_hook():
-    time.sleep(DEFAULT_NON_BLOCKING_SLEEP)
-
-
-class NotAvailable(Exception):
-    pass
-
-
-class InvalidStateResetRequired(Exception):
-    """
-        Returned by DataPipe when it is expecting to get reset request,
-        for example RouterDataPipe expecting all workers to request reset'
-    """
-    pass
-
-
-class NonBlocking(IterDataPipe):
-    not_available_hook = default_not_available_hook
-
-    def __iter__(self):
-        self.reset_iterator()
-        return self
-
-    def __next__(self):
-        while True:
-            try:
-                return self.nonblocking_next()
-            except StopIteration:
-                raise StopIteration
-            except NotAvailable:
-                if NonBlocking.not_available_hook is not None:
-                    NonBlocking.not_available_hook()
-
-    def nonblocking_next(self):
-        raise NotImplementedError(
-            "nonblocking_next is not implemented for %s" % self.__class__)
-
-    def reset_iterator(self):
-        raise NotImplementedError(
-            "reset_iterator is not implemented for %s" % self.__class__)
-
-    @staticmethod
-    def register_not_available_hook(hook_function):
-        NonBlocking.not_available_hook = hook_function
-
-
-def EnsureNonBlockingDataPipe(validated_datapipe):
-    if not isinstance(validated_datapipe, IterDataPipe):
-        raise Exception('Not Iterable DataPipe ' +
-                        str(validated_datapipe.__class__))
-    if isinstance(validated_datapipe, NonBlocking):
-        return validated_datapipe
-    if not hasattr(validated_datapipe, '_as_iterator'):
-        validated_datapipe._as_iterator = None  # type: ignore[attr-defined]
-    if not hasattr(validated_datapipe, 'nonblocking_next'):
-        def nonblocking_next(self):
-            if self._as_iterator is None:
-                self._as_iterator = iter(self)
-            return next(self._as_iterator)
-        validated_datapipe.nonblocking_next = types.MethodType(  # type: ignore[attr-defined]
-            nonblocking_next, validated_datapipe)
-    if not hasattr(validated_datapipe, 'reset_iterator'):
-        def reset_iterator(self):
-            self._as_iterator = None
-        validated_datapipe.reset_iterator = types.MethodType(  # type: ignore[attr-defined]
-            reset_iterator, validated_datapipe)
-    return validated_datapipe
-
-
-def DataPipeBehindQueues(source_datapipe, protocol, full_stop=False, blocking_request_get=False):
-    """
-        Indefinitely iterates over req_queue and passing values from source_datapipe to res_queue
-        If raise_stop is true, raises exception when StopIteration received from the source_datapipe
-    """
-    if not isinstance(protocol, communication.protocol.IterDataPipeQueueProtocolServer):
-        raise Exception('Expecting IterDataPipeQueueProtocolServer, got', protocol)
-    source_datapipe = EnsureNonBlockingDataPipe(source_datapipe)
-    forever = True
-    while forever:
-        try:
-            # Non-blocking call is Extremely slow here for python.mp, need to figure out a good workaround
-            request = protocol.get_new_request(block=blocking_request_get)
-        except communication.protocol.EmptyQueue:
-            yield True
-            continue
-
-        if isinstance(request, communication.messages.ResetIteratorRequest):
-            source_datapipe.reset_iterator()
-            protocol.response_reset_iterator()
-
-        elif isinstance(request, communication.messages.TerminateRequest):
-            forever = False
-            protocol.response_terminate()
-
-        elif isinstance(request, communication.messages.GetNextRequest):
-            while forever:
-                try:
-                    value = source_datapipe.nonblocking_next()
-                except NotAvailable:
-                    yield True
-                    continue
-                except StopIteration:
-                    protocol.response_stop_iteration()
-                    if full_stop:
-                        forever = False
-                    else:
-                        yield True
-                    break
-                except InvalidStateResetRequired:
-                    protocol.response_invalid_state()
-                    if full_stop:
-                        forever = False
-                    else:
-                        yield True
-                    break
-                protocol.response_next(value)
-                yield True  # Returns control
-                break
-        else:
-            raise Exception('Unrecognized type of request received', request)
-
-
-class QueueWrapper(NonBlocking):
-    """
-        Creates iter.DataPipe which reads data from the DataLoader.Queue
-    """
-
-    def __init__(self, protocol, response_wait_time=0.00001):
-        if not isinstance(protocol, communication.protocol.IterDataPipeQueueProtocolClient):
-            raise Exception('Got', protocol)
-        self.protocol = protocol
-        self.counter = 0
-        self._stop_iteration = False
-        self._response_wait_time = response_wait_time
-
-    def reset_iterator(self):
-        self._stop_iteration = False
-        self.counter = 0
-        self.protocol.request_reset_iterator()
-        while True:
-            try:
-                self.protocol.get_response_reset_iterator()
-                break
-            except communication.protocol.EmptyQueue:
-                if NonBlocking.not_available_hook is not None:
-                    NonBlocking.not_available_hook()
-
-    def nonblocking_next(self):
-        if self._stop_iteration:
-            raise Exception(
-                '`next` or `nonblocking_next` called after receiving StopIteration')
-        if self.protocol.can_take_request():
-            self.protocol.request_next()
-        try:
-            response = self.protocol.get_response_next(block=True, timeout=self._response_wait_time)
-        except communication.protocol.EmptyQueue:
-            raise NotAvailable
-        if isinstance(response, communication.messages.StopIterationResponse):
-            self._stop_iteration = True
-            raise StopIteration
-        if isinstance(response, communication.messages.InvalidStateResponse):
-            raise NotAvailable
-        return response.value
diff --git a/torch/utils/data/communication/map.py b/torch/utils/data/communication/map.py
deleted file mode 100644
index 8af63bf0c73e..000000000000
--- a/torch/utils/data/communication/map.py
+++ /dev/null
@@ -1,159 +0,0 @@
-import time
-import types
-
-from torch.utils.data import communication, MapDataPipe
-
-DEFAULT_NON_BLOCKING_SLEEP = 0.001
-
-__all__ = [
-    "DataPipeBehindQueues",
-    "EnsureNonBlockingMapDataPipe",
-    "NonBlockingMap",
-    "NotAvailable",
-    "QueueWrapperForMap",
-    "default_not_available_hook",
-]
-
-
-def default_not_available_hook():
-    time.sleep(DEFAULT_NON_BLOCKING_SLEEP)
-
-
-class NotAvailable(Exception):
-    pass
-
-
-class NonBlockingMap(MapDataPipe):
-    not_available_hook = default_not_available_hook
-
-    def __getitem__(self, index):
-        while True:
-            try:
-                return self.nonblocking_getitem(index)
-            except NotAvailable:
-                if NonBlockingMap.not_available_hook is not None:
-                    NonBlockingMap.not_available_hook()
-
-    def __len__(self):
-        try:
-            return self.nonblocking_len()
-        except NotAvailable:
-            if NonBlockingMap.not_available_hook is not None:
-                NonBlockingMap.not_available_hook()
-
-    def nonblocking_len(self):
-        raise NotImplementedError(
-            "nonblocking_len is not implemented for %s" % self.__class__)
-
-    def nonblocking_getitem(self, index):
-        raise NotImplementedError(
-            "nonblocking_getitem is not implemented for %s" % self.__class__)
-
-    @staticmethod
-    def register_not_available_hook(hook_function):
-        NonBlockingMap.not_available_hook = hook_function
-
-
-def EnsureNonBlockingMapDataPipe(validated_datapipe):
-    if not isinstance(validated_datapipe, MapDataPipe):
-        raise Exception(f'Not Map DataPipe - got {validated_datapipe.__class__}')
-    if isinstance(validated_datapipe, NonBlockingMap):
-        return validated_datapipe
-    if not hasattr(validated_datapipe, 'nonblocking_len'):
-        def nonblocking_len(self):
-            return self.__len__()
-        validated_datapipe.nonblocking_len = types.MethodType(  # type: ignore[attr-defined]
-            nonblocking_len, validated_datapipe)
-    if not hasattr(validated_datapipe, 'nonblocking_getitem'):
-        def nonblocking_getitem(self, index):
-            return self.__getitem__(index)
-        validated_datapipe.nonblocking_getitem = types.MethodType(  # type: ignore[attr-defined]
-            nonblocking_getitem, validated_datapipe)
-    return validated_datapipe
-
-
-def DataPipeBehindQueues(source_datapipe, protocol, full_stop=False, blocking_request_get=False):
-    """
-        Indefinitely iterates over req_queue and passing values from source_datapipe to res_queue
-        If raise_stop is true, raises exception when StopIteration received from the source_datapipe
-    """
-    if not isinstance(protocol, communication.protocol.MapDataPipeQueueProtocolServer):
-        raise Exception('Expecting MapDataPipeQueueProtocolServer, got', protocol)
-    source_datapipe = EnsureNonBlockingMapDataPipe(source_datapipe)
-    forever = True
-    while forever:
-        try:
-            # Non-blocking call is Extremely slow here for python.mp, need to figure out a good workaround
-            request = protocol.get_new_request(block=blocking_request_get)
-        except communication.protocol.EmptyQueue:
-            yield True
-            continue
-
-        if isinstance(request, communication.messages.TerminateRequest):
-            forever = False
-            protocol.response_terminate()
-
-        elif isinstance(request, communication.messages.LenRequest):
-            size = source_datapipe.nonblocking_len()
-            protocol.response_len(size)
-
-        elif isinstance(request, communication.messages.GetItemRequest):
-            while forever:
-                try:
-                    value = source_datapipe.nonblocking_getitem(request.key)
-                except NotAvailable:
-                    yield True
-                    continue
-                except IndexError as e:
-                    # Alternatively, we can just allow the underlying DataPipe to throw an exception?
-                    protocol.response_index_out_of_bound()
-                    if full_stop:
-                        forever = False
-                    else:
-                        yield True
-                    break
-                protocol.response_item(request.key, value)
-                yield True  # Returns control
-                break
-        else:
-            raise Exception('Unrecognized type of request received', request)
-
-
-class QueueWrapperForMap(NonBlockingMap):
-    """
-        Creates map.DataPipe which reads data from the DataLoader.Queue
-    """
-    def __init__(self, protocol, response_wait_time=0.00001):
-        if not isinstance(protocol, communication.protocol.MapDataPipeQueueProtocolClient):
-            raise Exception('Got', protocol)
-        self.protocol = protocol
-        self.counter = 0
-        self._stop_iteration = False
-        self._response_wait_time = response_wait_time
-
-    def nonblocking_getitem(self, index):
-        if self._stop_iteration:
-            raise Exception(
-                '`getitem` or `nonblocking_getitem` called after receiving StopIteration')
-        if self.protocol.can_take_request():
-            self.protocol.request_item(index)
-        try:
-            response = self.protocol.get_response_item(block=True, timeout=self._response_wait_time)
-        except communication.protocol.EmptyQueue:
-            raise NotAvailable
-        if isinstance(response, communication.messages.StopIterationResponse):
-            self._stop_iteration = True
-            raise IndexError(f"Index {index} is out of bound.")
-        return response.key, response.value
-
-    def nonblocking_len(self):
-        if self._stop_iteration:
-            raise Exception(
-                '`len` or `nonblocking_len` called after receiving StopIteration')
-        if self.protocol.can_take_request():
-            self.protocol.request_len()
-        try:
-            response = self.protocol.get_response_len(block=True, timeout=self._response_wait_time)
-        except communication.protocol.EmptyQueue:
-            raise NotAvailable
-        return response.len
diff --git a/torch/utils/data/communication/messages.py b/torch/utils/data/communication/messages.py
deleted file mode 100644
index 449cf23cfc01..000000000000
--- a/torch/utils/data/communication/messages.py
+++ /dev/null
@@ -1,75 +0,0 @@
-class DataLoaderQueueMessage(object):
-    pass
-
-
-class Request(DataLoaderQueueMessage):
-    pass
-
-
-class Response(DataLoaderQueueMessage):
-    pass
-
-
-class ResetIteratorRequest(Request):
-    pass
-
-
-class ResetIteratorResponse(Response):
-    pass
-
-
-class TerminateRequest(Request):
-    pass
-
-
-class TerminateResponse(Response):
-    pass
-
-
-class LenRequest(Request):
-    pass
-
-
-class LenResponse(Response):
-    __slots__ = ('len')
-
-    def __init__(self, len):
-        self.len = len
-
-
-class GetItemRequest(Request):
-    __slots__ = ('key')
-
-    def __init__(self, key):
-        self.key = key
-
-
-class GetItemResponse(Response):
-    __slots__ = ('key', 'value')
-
-    def __init__(self, key, value):
-        self.key = key
-        self.value = value
-
-
-class GetNextRequest(Request):
-    pass
-
-
-class GetNextResponse(Response):
-    __slots__ = ('value')
-
-    def __init__(self, value):
-        self.value = value
-
-
-class StopIterationResponse(Response):
-    pass
-
-
-class InvalidStateResponse(Response):
-    """
-        Returned by DataPipe when it is expecting to get reset request,
-        for example RouterDataPipe expecting all workers to request reset'
-    """
-    pass
diff --git a/torch/utils/data/communication/protocol.py b/torch/utils/data/communication/protocol.py
deleted file mode 100644
index 5bf5fe1af062..000000000000
--- a/torch/utils/data/communication/protocol.py
+++ /dev/null
@@ -1,205 +0,0 @@
-from torch.utils.data import communication
-
-
-class Protocol(object):
-    __slots__ = ('request_queue', 'response_queue')
-
-    def __init__(self, request_queue, response_queue):
-        self.request_queue = request_queue
-        self.response_queue = response_queue
-
-
-class ProtocolClient(Protocol):
-    """
-        ProtocolClient takes charge of putting requests into req_queue and returning results from res_queue.
-    """
-    _req_sent = None
-
-    def __init__(self, request_queue, response_queue):
-        self.request_queue = request_queue
-        self.response_queue = response_queue
-        self._req_sent = None
-
-    def can_take_request(self):
-        return self._req_sent is None
-
-    def waiting_for_response(self):
-        return self._req_sent is not None
-
-    def request_sent(self, request=True):
-        if not self.can_take_request():
-            raise Exception('Protocol only supports one request in the Queue')
-        self._req_sent = request
-
-    def request_served(self, result=None):
-        if not self.waiting_for_response():
-            raise Exception(
-                'Expected no peding requests, but something got served', result)
-        self._req_sent = None
-
-
-class ProtocolServer(Protocol):
-    """
-        ProtocolServer takes charge of getting requests from req_queue and fetching data from source datapipe.
-    """
-    _req_received = None
-
-    def __init__(self, request_queue, response_queue):
-        self.request_queue = request_queue
-        self.response_queue = response_queue
-        self._req_received = None
-
-    def have_pending_request(self):
-        return self._req_received is not None
-
-    def get_new_request(self, block=False):
-        if self.have_pending_request():
-            raise Exception(
-                'Trying to get next request, while having one unserved')
-        try:
-            response = self.request_queue.get(block=block)
-        except Exception as e:  # TODO: Catch only timeout exceptions
-            raise EmptyQueue('queue is empty')
-        self._req_received = response
-        return response
-        # TODO: Validate supported requests
-
-    def response_terminate(self):
-        if not self.have_pending_request():
-            raise Exception("Attempting to reply with pending request")
-        if not isinstance(self._req_received, communication.messages.TerminateRequest):
-            raise Exception(
-                "Replaying with terminate status to other type of message")
-        self.response_queue.put(communication.messages.TerminateResponse())
-        self._req_received = None
-
-
-class MapDataPipeQueueProtocolServer(ProtocolServer):
-    def response_item(self, key, value):
-        if not self.have_pending_request():
-            raise Exception("Attempting to reply with pending request")
-        self.response_queue.put(communication.messages.GetItemResponse(key, value))
-        self._req_received = None
-
-    def response_len(self, size):
-        if not self.have_pending_request():
-            raise Exception("Attempting to reply with pending request")
-        self.response_queue.put(communication.messages.LenResponse(size))
-        self._req_received = None
-
-    def response_index_out_of_bound(self):
-        if not self.have_pending_request():
-            raise Exception("Attempting to reply with pending request")
-        self.response_queue.put(communication.messages.StopIterationResponse())
-        self._req_received = None
-
-class MapDataPipeQueueProtocolClient(ProtocolClient):
-    def request_len(self):
-        if not self.can_take_request():
-            raise Exception('Can not request len while we are still waiting response for previous request')
-        request = communication.messages.LenRequest()
-        self.request_queue.put(request)
-        self.request_sent(request)
-
-    def request_item(self, index):
-        if not self.can_take_request():
-            raise Exception('Can not request item while we are still waiting response for previous request')
-        request = communication.messages.GetItemRequest(index)
-        self.request_queue.put(request)
-        self.request_sent(request)
-
-    def get_response_len(self, block=False, timeout=None):
-        if not self.waiting_for_response():
-            raise Exception('Can not expect any response without submitted request')
-        try:
-            response = self.response_queue.get(block=block, timeout=timeout)
-        except TimeoutError:
-            raise EmptyQueue('queue is empty')
-        self.request_served(response)
-        if not isinstance(response, communication.messages.LenResponse):
-            raise Exception('Invalid response received')
-        return response
-
-    def get_response_item(self, block=False, timeout=None):
-        if not self.waiting_for_response():
-            raise Exception('Can not expect any response without submitted request')
-        try:
-            response = self.response_queue.get(block=block, timeout=timeout)
-        except TimeoutError:
-            raise EmptyQueue('queue is empty')
-        self.request_served(response)
-        # if not isinstance(response, communication.messages.GetItemResponse):
-        #     raise Exception('Invalid response received')
-        return response
-
-
-class EmptyQueue(Exception):
-    pass
-
-
-class IterDataPipeQueueProtocolServer(ProtocolServer):
-    def response_reset_iterator(self):
-        if not self.have_pending_request():
-            raise Exception("Attempting to reply with pending request")
-        if not isinstance(self._req_received, communication.messages.ResetIteratorRequest):
-            raise Exception(
-                "Replaying with reset status to other type of message")
-        self.response_queue.put(communication.messages.ResetIteratorResponse())
-        self._req_received = None
-
-    def response_next(self, value):
-        if not self.have_pending_request():
-            raise Exception("Attempting to reply with pending request")
-        self.response_queue.put(communication.messages.GetNextResponse(value))
-        self._req_received = None
-
-    def response_stop_iteration(self):
-        if not self.have_pending_request():
-            raise Exception("Attempting to reply with pending request")
-        self.response_queue.put(communication.messages.StopIterationResponse())
-        self._req_received = None
-
-    def response_invalid_state(self):
-        if not self.have_pending_request():
-            raise Exception("Attempting to reply with pending request")
-        self.response_queue.put(communication.messages.InvalidStateResponse())
-        self._req_received = None
-
-
-class IterDataPipeQueueProtocolClient(ProtocolClient):
-    def request_reset_iterator(self):
-        if not self.can_take_request():
-            raise Exception('Can not reset while we are still waiting response for previous request')
-        request = communication.messages.ResetIteratorRequest()
-        self.request_queue.put(request)
-        self.request_sent(request)
-
-    def request_next(self):
-        if not self.can_take_request():
-            raise Exception('Can not request next item while we are still waiting response for previous request')
-        request = communication.messages.GetNextRequest()
-        self.request_queue.put(request)
-        self.request_sent(request)
-
-    def get_response_reset_iterator(self, block=False):
-        try:
-            response = self.response_queue.get(block=block)
-        except Exception as e:  # TODO: Catch only timeout exceptions
-            raise EmptyQueue('queue is empty')
-        self.request_served(response)
-
-        if not isinstance(response, communication.messages.ResetIteratorResponse):
-            raise Exception('Invalid response received')
-
-    def get_response_next(self, block=False, timeout=None):
-        if not self.waiting_for_response():
-            raise Exception(
-                'Can not expect any response without submitted request')
-        try:
-            response = self.response_queue.get(block=block, timeout=timeout)
-        except Exception as e:  # TODO: Catch only timeout exceptions
-            raise EmptyQueue('queue is empty')
-        self.request_served(response)
-
-        # TODO(VitalyFedyunin): Add possible response types validation here
-        return response
diff --git a/torch/utils/data/communication/queue.py b/torch/utils/data/communication/queue.py
deleted file mode 100644
index 85c33d4799cd..000000000000
--- a/torch/utils/data/communication/queue.py
+++ /dev/null
@@ -1,51 +0,0 @@
-import threading
-import time
-
-
-class LocalQueue():
-    ops = 0
-    stored = 0
-    uid = 0
-    empty = 0
-
-    def __init__(self, name='unnamed'):
-        self.items = []
-        self.name = name
-        self.uid = LocalQueue.uid
-        LocalQueue.uid += 1
-
-    def put(self, item, block=True):
-        LocalQueue.ops += 1
-        LocalQueue.stored += 1
-        self.items.append(item)
-
-    def get(self, block=True, timeout=0):
-        # TODO(VitalyFedyunin): Add support of block and timeout arguments
-        LocalQueue.ops += 1
-        if not len(self.items):
-            LocalQueue.empty += 1
-            raise Exception('LocalQueue is empty')
-        LocalQueue.stored -= 1
-        return self.items.pop()
-
-
-class ThreadingQueue():
-    def __init__(self, name='unnamed'):
-        self.lock = threading.Lock()
-        self.items = []
-        self.name = name
-
-    def put(self, item, block=True):
-        with self.lock:
-            self.items.append(item)
-
-    def get(self, block=True, timeout=0):
-        # TODO(VitalyFedyunin): Add support of block and timeout arguments
-        while True:
-            with self.lock:
-                if len(self.items) > 0:
-                    return self.items.pop()
-            if not block:
-                raise Exception("Not available")
-            # TODO(VitalyFedyunin): Figure out what to do if nothing in the queue
-            time.sleep(0.000001)
diff --git a/torch/utils/data/dataloader.py b/torch/utils/data/dataloader.py
index 9c67d50789f1..c836c9fa975f 100644
--- a/torch/utils/data/dataloader.py
+++ b/torch/utils/data/dataloader.py
@@ -11,10 +11,8 @@
 import os
 import queue
 import threading
-import time
 import warnings
 
-from datetime import timedelta
 from typing import Any, Callable, Iterable, TypeVar, Generic, Sequence, List, Optional, Union
 
 import multiprocessing as python_multiprocessing
@@ -37,6 +35,7 @@
     Dataset,)
 
 from torch.utils.data.datapipes.datapipe import _IterDataPipeSerializationWrapper, _MapDataPipeSerializationWrapper
+from torch.utils.data.datapipes.iter.grouping import SHARDING_PRIORITIES
 
 from . import _utils
 
@@ -106,19 +105,28 @@ def _get_distributed_settings():
 
 
 def _sharding_worker_init_fn(worker_init_fn, world_size, rank_id, worker_id):
-    global_worker_id = worker_id
     info = torch.utils.data.get_worker_info()
+    assert info is not None
     total_workers = info.num_workers
     datapipe = info.dataset
+    assert isinstance(datapipe, (IterDataPipe, MapDataPipe))
     # To distribute elements across distributed process evenly, we should shard data on distributed
     # processes first then shard on worker processes
-    total_workers *= world_size
-    global_worker_id = global_worker_id * world_size + rank_id
-    torch.utils.data.graph_settings.apply_sharding(datapipe, total_workers, global_worker_id)
+    torch.utils.data.graph_settings.apply_sharding(
+        datapipe, world_size, rank_id, sharding_group=SHARDING_PRIORITIES.DISTRIBUTED)
+    torch.utils.data.graph_settings.apply_sharding(
+        datapipe, total_workers, worker_id, sharding_group=SHARDING_PRIORITIES.MULTIPROCESSING)
     if worker_init_fn is not None:
         worker_init_fn(worker_id)
 
 
+def _share_dist_seed(generator, pg):
+    _shared_seed = torch.empty((), dtype=torch.int64).random_(generator=generator)
+    if isinstance(pg, dist.ProcessGroup):
+        dist.broadcast(_shared_seed, src=0, group=pg)
+    return _shared_seed.item()
+
+
 class DataLoader(Generic[T_co]):
     r"""
     Data loader. Combines a dataset and a sampler, and provides an iterable over
@@ -209,7 +217,7 @@ class DataLoader(Generic[T_co]):
     timeout: float
     sampler: Union[Sampler, Iterable]
     pin_memory_device: str
-    prefetch_factor: int
+    prefetch_factor: Optional[int]
     _iterator : Optional['_BaseDataLoaderIter']
     __initialized = False
 
@@ -220,7 +228,7 @@ def __init__(self, dataset: Dataset[T_co], batch_size: Optional[int] = 1,
                  pin_memory: bool = False, drop_last: bool = False,
                  timeout: float = 0, worker_init_fn: Optional[_worker_init_fn_t] = None,
                  multiprocessing_context=None, generator=None,
-                 *, prefetch_factor: int = 2,
+                 *, prefetch_factor: Optional[int] = None,
                  persistent_workers: bool = False,
                  pin_memory_device: str = ""):
         torch._C._log_api_usage_once("python.data_loader")
@@ -232,10 +240,13 @@ def __init__(self, dataset: Dataset[T_co], batch_size: Optional[int] = 1,
         if timeout < 0:
             raise ValueError('timeout option should be non-negative')
 
-        if num_workers == 0 and prefetch_factor != 2:
+        if num_workers == 0 and prefetch_factor is not None:
             raise ValueError('prefetch_factor option could only be specified in multiprocessing.'
-                             'let num_workers > 0 to enable multiprocessing.')
-        assert prefetch_factor > 0
+                             'let num_workers > 0 to enable multiprocessing, otherwise set prefetch_factor to None.')
+        elif num_workers > 0 and prefetch_factor is None:
+            prefetch_factor = 2
+        elif prefetch_factor is not None and prefetch_factor < 0:
+            raise ValueError('prefetch_factor option should be non-negative')
 
         if persistent_workers and num_workers == 0:
             raise ValueError('persistent_workers option needs num_workers > 0')
@@ -249,26 +260,12 @@ def __init__(self, dataset: Dataset[T_co], batch_size: Optional[int] = 1,
         self.worker_init_fn = worker_init_fn
         self.multiprocessing_context = multiprocessing_context
 
-        # Adds several forward compatibilities so classic DataLoader can work with DataPipes
-        # 1. _DataPipeSerializationWrapper container makes it easier to serialize without redefining pickler
-        # 2. Additional worker init function will take care of sharding in MP and Distributed
+        # Adds forward compatibilities so classic DataLoader can work with DataPipes:
+        #   _DataPipeSerializationWrapper container makes it easier to serialize without redefining pickler
         if isinstance(self.dataset, IterDataPipe):
             self.dataset = _IterDataPipeSerializationWrapper(self.dataset)
-            ws, rank = _get_distributed_settings()
-            if num_workers > 0:
-                self.worker_init_fn = functools.partial(
-                    _sharding_worker_init_fn, self.worker_init_fn, ws, rank)
-            else:
-                torch.utils.data.graph_settings.apply_sharding(self.dataset, ws, rank)
         elif isinstance(self.dataset, MapDataPipe):
             self.dataset = _MapDataPipeSerializationWrapper(self.dataset)
-            ws, rank = _get_distributed_settings()
-            if num_workers > 0:
-                self.worker_init_fn = functools.partial(
-                    _sharding_worker_init_fn, self.worker_init_fn, ws, rank)
-            else:
-                torch.utils.data.graph_settings.apply_sharding(self.dataset, ws, rank)
-
 
         # Arg-check dataset related before checking samplers because we want to
         # tell users that iterable-style datasets are incompatible with custom
@@ -565,73 +562,28 @@ def _create_warning_msg(num_worker_suggest, num_worker_created, cpuset_checked):
                 self.num_workers,
                 cpuset_checked))
 
-    def _get_shared_seed(self):
-        if isinstance(self.dataset, IterDataPipe):
-            _shared_seed = torch.empty((), dtype=torch.int64).random_(generator=self.generator).item()
-            if dist.is_available() and dist.is_initialized():
-                rank = dist.get_rank()
-                ws = dist.get_world_size()
-                store = dist.distributed_c10d._get_default_store()
-                if rank == 0:
-                    _shared_seed_str = str(_shared_seed)
-                    store.set(_utils.DATAPIPE_SHARED_SEED, _shared_seed_str)
-                    logger.info(f"Shared seed ({_shared_seed_str}) sent to store on rank 0")
-                    # Use 'add' instead of 'get' since for some store implementations 'add'
-                    # doesn't work well with 'get'.
-                    _shared_seed_recv_cnt = store.add(_utils.DATAPIPE_SHARED_SEED_COUNTER, 1)
-                    start = time.time()
-                    while _shared_seed_recv_cnt < ws:
-                        time.sleep(_utils.DATAPIPE_SHARED_SEED_CHECK_INTERVAL)
-                        _shared_seed_recv_cnt = store.add(_utils.DATAPIPE_SHARED_SEED_COUNTER, 0)
-                        if timedelta(seconds=(time.time() - start)) > \
-                                timedelta(seconds=_utils.DATAPIPE_SHARED_SEED_DEFAULT_TIMEOUT):
-                            raise RuntimeError("Timed out receiving the signal from the distribtued store on "
-                                               "Rank 0 that all other Ranks have received the shared seed. "
-                                               f"(world_size={ws}, received={_shared_seed_recv_cnt}, "
-                                               f"timeout={_utils.DATAPIPE_SHARED_SEED_DEFAULT_TIMEOUT})")
-                    # Reset after all distributed processes have received the shared seed
-                    store.set(_utils.DATAPIPE_SHARED_SEED, "")
-                    _shared_seed_recv_cnt = store.add(_utils.DATAPIPE_SHARED_SEED_COUNTER, -ws)
-                    assert _shared_seed_recv_cnt == 0
-                else:
-                    _shared_seed_str = ""
-                    start = time.time()
-                    while len(_shared_seed_str) == 0:
-                        time.sleep(_utils.DATAPIPE_SHARED_SEED_CHECK_INTERVAL)
-                        _shared_seed_str = store.get(_utils.DATAPIPE_SHARED_SEED)
-                        if timedelta(seconds=(time.time() - start)) > \
-                                timedelta(seconds=_utils.DATAPIPE_SHARED_SEED_DEFAULT_TIMEOUT):
-                            raise RuntimeError("Timed out receiving the shared seed from the distribtued store "
-                                               f"on Rank {rank}. (world_size={ws}, "
-                                               f"timeout={_utils.DATAPIPE_SHARED_SEED_DEFAULT_TIMEOUT})")
-                    logger.info(f"Shared seed ({_shared_seed_str}) received from store on rank {rank}")
-                    _shared_seed_recv_cnt = store.add(_utils.DATAPIPE_SHARED_SEED_COUNTER, 1)
-                    # Exit only when all ranks received seed, otherwise we are at risk that current rank
-                    # will reach same section of the code again while rank zero still in the previous iteration
-                    while _shared_seed_recv_cnt > 0:
-                        time.sleep(_utils.DATAPIPE_SHARED_SEED_CHECK_INTERVAL)
-                        _shared_seed_recv_cnt = store.add(_utils.DATAPIPE_SHARED_SEED_COUNTER, 0)
-                    _shared_seed = int(_shared_seed_str)
-            return _shared_seed
-        else:
-            return None
-
 
 class _BaseDataLoaderIter(object):
     def __init__(self, loader: DataLoader) -> None:
         self._dataset = loader.dataset
-        self._shared_seed = loader._get_shared_seed()
+        self._shared_seed = None
+        self._pg = None
         if isinstance(self._dataset, IterDataPipe):
+            if dist.is_available() and dist.is_initialized():
+                self._pg = dist.new_group(backend="gloo")
+            self._shared_seed = _share_dist_seed(loader.generator, self._pg)
             shared_rng = torch.Generator()
             shared_rng.manual_seed(self._shared_seed)
-            self._dataset = torch.utils.data.graph_settings.apply_shuffle_seed(self._dataset, shared_rng)
+            self._dataset = torch.utils.data.graph_settings.apply_random_seed(self._dataset, shared_rng)
         self._dataset_kind = loader._dataset_kind
         self._IterableDataset_len_called = loader._IterableDataset_len_called
         self._auto_collation = loader._auto_collation
         self._drop_last = loader.drop_last
         self._index_sampler = loader._index_sampler
         self._num_workers = loader.num_workers
-        self._prefetch_factor = loader.prefetch_factor
+        ws, rank = _get_distributed_settings()
+        self._world_size = ws
+        self._rank = rank
         # for other backends, pin_memory_device need to set. if not set
         # default behaviour is CUDA device. if pin_memory_device is selected
         # and pin_memory is not set, the default behaviour false.
@@ -661,11 +613,11 @@ def _reset(self, loader, first_iter=False):
         self._sampler_iter = iter(self._index_sampler)
         self._num_yielded = 0
         self._IterableDataset_len_called = loader._IterableDataset_len_called
-        self._shared_seed = loader._get_shared_seed()
         if isinstance(self._dataset, IterDataPipe):
+            self._shared_seed = _share_dist_seed(loader.generator, self._pg)
             shared_rng = torch.Generator()
             shared_rng.manual_seed(self._shared_seed)
-            self._dataset = torch.utils.data.graph_settings.apply_shuffle_seed(self._dataset, shared_rng)
+            self._dataset = torch.utils.data.graph_settings.apply_random_seed(self._dataset, shared_rng)
 
     def _next_index(self):
         return next(self._sampler_iter)  # may raise StopIteration
@@ -711,6 +663,12 @@ def __init__(self, loader):
         assert self._timeout == 0
         assert self._num_workers == 0
 
+        # Adds forward compatibilities so classic DataLoader can work with DataPipes:
+        #   Taking care of distributed sharding
+        if isinstance(self._dataset, (IterDataPipe, MapDataPipe)):
+            torch.utils.data.graph_settings.apply_sharding(
+                self._dataset, self._world_size, self._rank, sharding_group=SHARDING_PRIORITIES.DISTRIBUTED)
+
         self._dataset_fetcher = _DatasetKind.create_fetcher(
             self._dataset_kind, self._dataset, self._auto_collation, self._collate_fn, self._drop_last)
 
@@ -1035,6 +993,8 @@ class _MultiProcessingDataLoaderIter(_BaseDataLoaderIter):
     def __init__(self, loader):
         super(_MultiProcessingDataLoaderIter, self).__init__(loader)
 
+        self._prefetch_factor = loader.prefetch_factor
+
         assert self._num_workers > 0
         assert self._prefetch_factor > 0
 
@@ -1044,6 +1004,13 @@ def __init__(self, loader):
             multiprocessing_context = loader.multiprocessing_context
 
         self._worker_init_fn = loader.worker_init_fn
+
+        # Adds forward compatibilities so classic DataLoader can work with DataPipes:
+        #   Additional worker init function will take care of sharding in MP and Distributed
+        if isinstance(self._dataset, (IterDataPipe, MapDataPipe)):
+            self._worker_init_fn = functools.partial(
+                _sharding_worker_init_fn, self._worker_init_fn, self._world_size, self._rank)
+
         # No certainty which module multiprocessing_context is
         self._worker_result_queue = multiprocessing_context.Queue()  # type: ignore[var-annotated]
         self._worker_pids_set = False
@@ -1081,10 +1048,14 @@ def __init__(self, loader):
 
             # Queue is not type-annotated
             self._data_queue = queue.Queue()  # type: ignore[var-annotated]
+            if self._pin_memory_device == "xpu":
+                current_device = torch.xpu.current_device()  # type: ignore[attr-defined]
+            else:
+                current_device = torch.cuda.current_device()  # choose cuda for default
             pin_memory_thread = threading.Thread(
                 target=_utils.pin_memory._pin_memory_loop,
                 args=(self._worker_result_queue, self._data_queue,
-                      torch.cuda.current_device(),
+                      current_device,
                       self._pin_memory_thread_done_event, self._pin_memory_device))
             pin_memory_thread.daemon = True
             pin_memory_thread.start()
@@ -1430,8 +1401,7 @@ def _shutdown_workers(self):
         # Called when shutting down this `_MultiProcessingDataLoaderIter`.
         # See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on
         # the logic of this function.
-        python_exit_status = _utils.python_exit_status
-        if python_exit_status is True or python_exit_status is None:
+        if _utils is None or _utils.python_exit_status is True or _utils.python_exit_status is None:
             # See (2) of the note. If Python is shutting down, do no-op.
             return
         # Normal exit when last reference is gone / iterator is depleted.
diff --git a/torch/utils/data/dataloader_experimental.py b/torch/utils/data/dataloader_experimental.py
deleted file mode 100644
index 8a8d536b7985..000000000000
--- a/torch/utils/data/dataloader_experimental.py
+++ /dev/null
@@ -1,150 +0,0 @@
-import time
-
-from typing import Any, List
-
-import torch.utils.data.backward_compatibility
-
-import torch.utils.data.graph_settings
-from torch.utils.data import DataLoader, IterDataPipe, communication
-from torch.utils.data.datapipes.iter import IterableWrapper
-
-__all__ = [
-    "DataLoader2",
-]
-
-
-class _ThreadingDataLoader2:
-
-    def __init__(self, datapipe, num_workers=0, collate_fn=None):
-        self.threads = []
-        self.datapipes = []
-        self.collate_fn = collate_fn
-        for worker_id in range(num_workers):
-            (thread, req_queue, res_queue, thread_localdatapipe) = communication.eventloop.SpawnThreadForDataPipeline(datapipe)
-            torch.utils.data.graph_settings.apply_sharding(thread_localdatapipe, num_workers, worker_id)
-            thread.start()
-            self.threads.append((thread, req_queue, res_queue))  # These queues are independent
-            local_datapipe = communication.iter.QueueWrapper(
-                communication.protocol.IterDataPipeQueueProtocolClient(req_queue, res_queue))
-            self.datapipes.append(local_datapipe)
-
-    def __iter__(self):
-        not_available = False
-        forever = True
-        exclude_datapipes: List[Any] = []
-        while len(exclude_datapipes) < len(self.datapipes):
-            for dp in self.datapipes:
-                if dp not in exclude_datapipes:
-                    try:
-                        value = dp.nonblocking_next()
-                        yield value
-                    except StopIteration:
-                        exclude_datapipes.append(dp)
-                    except communication.iter.NotAvailable:
-                        not_available = True
-            if not_available:
-                time.sleep(0.001)
-
-    def __del__(self):
-        self._cleanup_all_threads()
-
-    def _cleanup_all_threads(self):
-        def clean_me(thread, req_queue, res_queue):
-            req_queue.put(communication.messages.TerminateRequest())
-            _ = res_queue.get()
-            thread.join()
-
-        for thread, req_queue, res_queue in self.threads:
-            clean_me(thread, req_queue, res_queue)
-
-class DataLoader2:
-    def __new__(cls,
-                dataset,
-                batch_size=1,
-                shuffle=None,
-                sampler=None,
-                batch_sampler=None,
-                num_workers=0,
-                collate_fn=None,
-                pin_memory=False,
-                drop_last=False,
-                timeout=0,
-                worker_init_fn=None,
-                *,
-                prefetch_factor=2,
-                persistent_workers=False,
-                batch_outside_worker=False,
-                parallelism_mode='mp'):
-        if isinstance(dataset, IterDataPipe):
-            data_loader: Any = None
-            if batch_sampler is not None:
-                raise Exception(
-                    'batch_sampler is not yet supported by DataPipes')
-            if sampler is not None:
-                raise Exception(
-                    'sampler is not yet supported by DataPipes')
-            datapipe = dataset
-            datapipe = torch.utils.data.graph_settings.apply_shuffle_settings(datapipe, shuffle=shuffle)  # type: ignore[assignment]
-            if batch_outside_worker and pin_memory:
-                raise Exception(
-                    'pin_memory is not yet compatible with batch_outside_worker')
-            if not batch_outside_worker:
-                if batch_size is not None:
-                    datapipe = datapipe.batch(batch_size, drop_last=drop_last)
-                    if collate_fn is None:
-                        collate_fn = torch.utils.data._utils.collate.default_collate
-
-                # Note: It is safe to pass shuffle=True to the old DataLoader, as shuffle does nothing
-                # for Iterable, but required to set Pipes correctly.
-                data_loader = DataLoader(datapipe,
-                                         batch_size=None,  # Replaced by .batch DataPipe
-                                         shuffle=shuffle,
-                                         sampler=None,
-                                         batch_sampler=None,
-                                         num_workers=num_workers,
-                                         collate_fn=collate_fn,
-                                         pin_memory=pin_memory,
-                                         drop_last=False,  # Replaced by .batch DataPipe
-                                         timeout=timeout,
-                                         worker_init_fn=worker_init_fn,
-                                         prefetch_factor=prefetch_factor,
-                                         persistent_workers=persistent_workers)
-            elif parallelism_mode == 'thread':
-                if collate_fn is not None and not batch_outside_worker:
-                    datapipe = datapipe.map(collate_fn)
-                if pin_memory:
-                    raise Exception(
-                        'pin_memory is not yet supported by DataPipes with Threading')
-                if worker_init_fn is not None:
-                    raise Exception(
-                        'worker_init_fn is not yet supported by DataPipes with Threading')
-                data_loader = _ThreadingDataLoader2(datapipe,
-                                                    num_workers=num_workers,
-                                                    collate_fn=collate_fn)
-            else:
-                raise Exception('Unsupported parallelism mode', parallelism_mode)
-            if not batch_outside_worker:
-                return data_loader
-            else:
-                if collate_fn is None:
-                    collate_fn = torch.utils.data._utils.collate.default_collate
-                datapipe = IterableWrapper(data_loader).batch(
-                    batch_size, drop_last=drop_last).map(collate_fn)
-                return datapipe
-        else:
-            if parallelism_mode == 'thread':
-                raise Exception(
-                    'thread parallelism mode is not supported for old DataSets')
-            return DataLoader(dataset,
-                              batch_size=batch_size,
-                              shuffle=shuffle,
-                              sampler=sampler,
-                              batch_sampler=batch_sampler,
-                              num_workers=num_workers,
-                              collate_fn=collate_fn,
-                              pin_memory=pin_memory,
-                              drop_last=drop_last,
-                              timeout=timeout,
-                              worker_init_fn=worker_init_fn,
-                              prefetch_factor=prefetch_factor,
-                              persistent_workers=persistent_workers)
diff --git a/torch/utils/data/datapipes/_hook_iterator.py b/torch/utils/data/datapipes/_hook_iterator.py
index df27f85c0373..270ef9b3d5c5 100644
--- a/torch/utils/data/datapipes/_hook_iterator.py
+++ b/torch/utils/data/datapipes/_hook_iterator.py
@@ -8,7 +8,7 @@
 class _SnapshotState(Enum):
     r"""
     These are the snapshotting-related states that IterDataPipes can be in.
-    `NotStarted` - allows you to restore a snapshot and create an iterator without reset
+    `NotStarted` - allows you to restore a snapshot and create an iterator with reset
     `Restored` - cannot restore again, allows you to create an iterator without resetting the DataPipe
     `Iterating` - can restore, will reset if you create a new iterator
     """
@@ -193,7 +193,7 @@ def wrap_generator(*args, **kwargs):
                 single_iterator_msg = "single iterator per IterDataPipe constraint"
                 if hasattr(e.args, '__len__'):
                     full_msg = f"{msg} {datapipe.__class__.__name__}({_generate_input_args_string(datapipe)})"
-                    if len(e.args) == 0:  # If an exception message doesn't exist
+                    if len(e.args) == 0 or not isinstance(e.args[0], str):  # If an exception message doesn't exist
                         e.args = (f'\nThis exception is {full_msg}',)
                     elif msg not in e.args[0] and single_iterator_msg not in e.args[0]:
                         e.args = (e.args[0] + f'\nThis exception is {full_msg}',) + e.args[1:]
diff --git a/torch/utils/data/datapipes/_typing.py b/torch/utils/data/datapipes/_typing.py
index 1fd316cbec5f..ab5e3fb33b60 100644
--- a/torch/utils/data/datapipes/_typing.py
+++ b/torch/utils/data/datapipes/_typing.py
@@ -351,11 +351,11 @@ def __new__(cls, name, bases, namespace, **kwargs):
             @functools.wraps(reset_func)
             def conditional_reset(*args, **kwargs):
                 r"""
-                Only execute DataPipe's `reset()` method if `_SnapshotState` is `Iterating`. This allows recently
+                Only execute DataPipe's `reset()` method if `_SnapshotState` is `Iterating` or `NotStarted`. This allows recently
                 restored DataPipe to preserve its restored state during the initial `__iter__` call.
                 """
                 datapipe = args[0]
-                if datapipe._snapshot_state == _SnapshotState.Iterating:
+                if datapipe._snapshot_state in (_SnapshotState.Iterating, _SnapshotState.NotStarted):
                     # Reset `NotStarted` is necessary because the `source_datapipe` of a DataPipe might have
                     # already begun iterating.
                     datapipe._number_of_samples_yielded = 0
diff --git a/torch/utils/data/datapipes/dataframe/dataframes.py b/torch/utils/data/datapipes/dataframe/dataframes.py
index fcbf15328e43..3a7cbb44feaf 100644
--- a/torch/utils/data/datapipes/dataframe/dataframes.py
+++ b/torch/utils/data/datapipes/dataframe/dataframes.py
@@ -408,7 +408,7 @@ def collate(self, *args, **kwargs):
 
     def __getattr__(self, attrname):  # ?
         if attrname in UNIMPLEMENTED_ATTR:
-            raise AttributeError('Attemping to get ', attrname)
+            raise AttributeError('Attempting to get ', attrname)
         if attrname in DATAPIPES_OPS:
             return (self.as_datapipe()).__getattr__(attrname)
         return super().__getattr__(attrname)
diff --git a/torch/utils/data/datapipes/datapipe.py b/torch/utils/data/datapipes/datapipe.py
index 339613137fdd..42120148d026 100644
--- a/torch/utils/data/datapipes/datapipe.py
+++ b/torch/utils/data/datapipes/datapipe.py
@@ -153,9 +153,10 @@ def __getstate__(self):
         If this doesn't cover your custom DataPipe's use case, consider writing custom methods for
         `__getstate__` and `__setstate__`, or use `pickle.dumps` for serialization.
         """
+        state = self.__dict__
         if IterDataPipe.getstate_hook is not None:
-            return IterDataPipe.getstate_hook(self)
-        return self.__dict__
+            return IterDataPipe.getstate_hook(state)
+        return state
 
     def __reduce_ex__(self, *args, **kwargs):
         if IterDataPipe.reduce_ex_hook is not None:
@@ -189,6 +190,10 @@ def __str__(self):
         # Instead of showing <torch. ... .MapperIterDataPipe object at 0x.....>, return the class name
         return str(self.__class__.__qualname__)
 
+    def __dir__(self):
+        # for auto-completion in a REPL (e.g. Jupyter notebook)
+        return list(super().__dir__()) + list(self.functions.keys())
+
     def reset(self) -> None:
         r"""
         Reset the `IterDataPipe` to the initial state. By default, no-op. For subclasses of `IterDataPipe`,
@@ -275,9 +280,10 @@ def __getstate__(self):
         If this doesn't cover your custom DataPipe's use case, consider writing custom methods for
         `__getstate__` and `__setstate__`, or use `pickle.dumps` for serialization.
         """
+        state = self.__dict__
         if MapDataPipe.getstate_hook is not None:
-            return MapDataPipe.getstate_hook(self)
-        return self.__dict__
+            return MapDataPipe.getstate_hook(state)
+        return state
 
     def __reduce_ex__(self, *args, **kwargs):
         if MapDataPipe.reduce_ex_hook is not None:
@@ -311,6 +317,11 @@ def __str__(self):
         # Instead of showing <torch. ... .MapperMapDataPipe object at 0x.....>, return the class name
         return str(self.__class__.__qualname__)
 
+    def __dir__(self):
+        # for auto-completion in a REPL (e.g. Jupyter notebook)
+        return list(super().__dir__()) + list(self.functions.keys())
+
+
 
 class _DataPipeSerializationWrapper:
     def __init__(self, datapipe):
@@ -345,8 +356,17 @@ def __len__(self):
 
 
 class _IterDataPipeSerializationWrapper(_DataPipeSerializationWrapper, IterDataPipe):
-    def __iter__(self):
-        yield from self._datapipe
+    def __init__(self, datapipe: IterDataPipe[T_co]):
+        super().__init__(datapipe)
+        self._datapipe_iter: Optional[Iterator[T_co]] = None
+
+    def __iter__(self) -> "_IterDataPipeSerializationWrapper":
+        self._datapipe_iter = iter(self._datapipe)
+        return self
+
+    def __next__(self) -> T_co:
+        assert self._datapipe_iter is not None
+        return next(self._datapipe_iter)
 
 
 class _MapDataPipeSerializationWrapper(_DataPipeSerializationWrapper, MapDataPipe):
diff --git a/torch/utils/data/datapipes/gen_pyi.py b/torch/utils/data/datapipes/gen_pyi.py
index e7c496bdf255..e89031075cfe 100644
--- a/torch/utils/data/datapipes/gen_pyi.py
+++ b/torch/utils/data/datapipes/gen_pyi.py
@@ -179,6 +179,10 @@ def get_method_definitions(file_path: Union[str, List[str]],
     methods_and_signatures, methods_and_class_names, methods_w_special_output_types = \
         parse_datapipe_files(file_paths)
 
+    for fn_name in method_to_special_output_type:
+        if fn_name not in methods_w_special_output_types:
+            methods_w_special_output_types.add(fn_name)
+
     method_definitions = []
     for method_name, arguments in methods_and_signatures.items():
         class_name = methods_and_class_names[method_name]
@@ -202,7 +206,7 @@ def get_method_definitions(file_path: Union[str, List[str]],
 mapDP_file_path: str = "map"
 mapDP_files_to_exclude: Set[str] = {"__init__.py", "utils.py"}
 mapDP_deprecated_files: Set[str] = set()
-mapDP_method_to_special_output_type: Dict[str, str] = {}
+mapDP_method_to_special_output_type: Dict[str, str] = {"shuffle": "IterDataPipe"}
 
 
 def main() -> None:
@@ -227,5 +231,4 @@ def main() -> None:
 
 
 if __name__ == '__main__':
-    print("Generating Python interface file 'datapipe.pyi'...")
     main()
diff --git a/torch/utils/data/datapipes/iter/callable.py b/torch/utils/data/datapipes/iter/callable.py
index 653a25bb72bd..f0f91dee34b4 100644
--- a/torch/utils/data/datapipes/iter/callable.py
+++ b/torch/utils/data/datapipes/iter/callable.py
@@ -7,7 +7,8 @@
 from torch.utils.data._utils.collate import default_collate
 from torch.utils.data.datapipes.dataframe import dataframe_wrapper as df_wrapper
 from torch.utils.data.datapipes.datapipe import IterDataPipe
-from torch.utils.data.datapipes.utils.common import _check_unpickable_fn
+from torch.utils.data.datapipes.utils.common import (_check_unpickable_fn,
+                                                     validate_input_col)
 
 __all__ = [
     "CollatorIterDataPipe",
@@ -80,6 +81,7 @@ def __init__(
                 raise ValueError("`output_col` must be a single-element list or tuple")
             output_col = output_col[0]
         self.output_col = output_col
+        validate_input_col(fn, input_col)
 
     def _apply_fn(self, data):
         if self.input_col is None and self.output_col is None:
@@ -153,7 +155,7 @@ def _collate_helper(conversion, item):
                 import torcharrow.pytorch as tap  # type: ignore[import]
                 collation_fn = tap.rec.Default()
             except Exception:
-                raise Exception("unable to import default collation function from the TorchArrrow")
+                raise Exception("unable to import default collation function from the TorchArrow")
 
         tuple_names.append(str(name))
         value = collation_fn(df[name])
diff --git a/torch/utils/data/datapipes/iter/combinatorics.py b/torch/utils/data/datapipes/iter/combinatorics.py
index 257b58220942..9776bdb5d04d 100644
--- a/torch/utils/data/datapipes/iter/combinatorics.py
+++ b/torch/utils/data/datapipes/iter/combinatorics.py
@@ -117,14 +117,13 @@ def set_shuffle(self, shuffle=True):
 
     def set_seed(self, seed: int):
         self._seed = seed
+        return self
 
     def __iter__(self) -> Iterator[T_co]:
         if not self._enabled:
             for x in self.datapipe:
                 yield x
         else:
-            self._rng.seed(self._seed)
-            self._seed = None
             for x in self.datapipe:
                 if len(self._buffer) == self.buffer_size:
                     idx = self._rng.randint(0, len(self._buffer) - 1)
@@ -132,9 +131,9 @@ def __iter__(self) -> Iterator[T_co]:
                     yield val
                 else:
                     self._buffer.append(x)
-            self._rng.shuffle(self._buffer)
             while self._buffer:
-                yield self._buffer.pop()
+                idx = self._rng.randint(0, len(self._buffer) - 1)
+                yield self._buffer.pop(idx)
 
     def __len__(self) -> int:
         if isinstance(self.datapipe, Sized):
@@ -143,21 +142,25 @@ def __len__(self) -> int:
 
     def reset(self) -> None:
         self._buffer = []
-        if self._enabled and self._seed is None:
-            self._seed = int(torch.empty((), dtype=torch.int64).random_().item())
+        if self._enabled:
+            if self._seed is None:
+                self._seed = int(torch.empty((), dtype=torch.int64).random_().item())
+            self._rng.seed(self._seed)
+            self._seed = None
 
     def __getstate__(self):
-        if IterDataPipe.getstate_hook is not None:
-            return IterDataPipe.getstate_hook(self)
         state = (
             self.datapipe,
             self.buffer_size,
             self._enabled,
             self._seed,
+            self._buffer,
+            self._rng.getstate(),
             self._valid_iterator_id,
             self._number_of_samples_yielded,
-            self._rng.getstate(),
         )
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(state)
         return state
 
     def __setstate__(self, state):
@@ -166,13 +169,13 @@ def __setstate__(self, state):
             self.buffer_size,
             self._enabled,
             self._seed,
+            self._buffer,
+            rng_state,
             self._valid_iterator_id,
             self._number_of_samples_yielded,
-            rng_state,
         ) = state
         self._rng = random.Random()
         self._rng.setstate(rng_state)
-        self._buffer = []
 
     def __del__(self):
         self._buffer.clear()
diff --git a/torch/utils/data/datapipes/iter/combining.py b/torch/utils/data/datapipes/iter/combining.py
index dc7fe50d1a5f..c874cedbde29 100644
--- a/torch/utils/data/datapipes/iter/combining.py
+++ b/torch/utils/data/datapipes/iter/combining.py
@@ -1,5 +1,6 @@
 import warnings
 
+from abc import ABC, abstractmethod
 from collections import deque
 from typing import Any, Callable, Iterator, List, Optional, Sized, Tuple, TypeVar, Deque
 
@@ -96,7 +97,31 @@ def __new__(cls, datapipe: IterDataPipe, num_instances: int, buffer_size: int =
         return [_ChildDataPipe(container, i) for i in range(num_instances)]
 
 
-class _ForkerIterDataPipe(IterDataPipe):
+class _ContainerTemplate(ABC):
+    r"""
+    Abstract class for container ``DataPipes``. The followings are three required
+    methods.
+    """
+    @abstractmethod
+    def get_next_element_by_instance(self, instance_id: int):
+        ...
+
+    @abstractmethod
+    def is_every_instance_exhausted(self) -> bool:
+        ...
+
+    @abstractmethod
+    def reset(self) -> None:
+        ...
+
+    @abstractmethod
+    def get_length_by_instance(self, instance_id: int):
+        r"""
+        Raise TypeError if it's not supposed to be implemented to support `list(datapipe)`
+        """
+
+
+class _ForkerIterDataPipe(IterDataPipe, _ContainerTemplate):
     r"""
     Container to hold instance-specific information on behalf of ForkerIterDataPipe. It tracks
     the state of its child DataPipes, maintains the buffer, and yields the next value
@@ -159,6 +184,9 @@ def is_every_instance_exhausted(self) -> bool:
         return self.end_ptr is not None and\
             all(self.end_ptr == ptr or self.end_ptr - 1 == ptr for ptr in self.child_pointers)
 
+    def get_length_by_instance(self, instance_id: int) -> int:
+        return len(self.main_datapipe)
+
     def reset(self) -> None:
         self._datapipe_iterator = iter(self.main_datapipe)
         self.buffer = deque()
@@ -168,9 +196,6 @@ def reset(self) -> None:
         self.end_ptr = None
 
     def __getstate__(self):
-        if IterDataPipe.getstate_hook is not None:
-            return IterDataPipe.getstate_hook(self)
-
         state = (
             self.main_datapipe,
             self.num_instances,
@@ -178,6 +203,8 @@ def __getstate__(self):
             self._valid_iterator_id,
             self._number_of_samples_yielded,
         )
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(state)
         return state
 
     def __setstate__(self, state):
@@ -196,7 +223,8 @@ def __setstate__(self, state):
         self.end_ptr = None
 
     def __del__(self):
-        self.buffer.clear()
+        if self.buffer:
+            self.buffer.clear()
 
 
 class _ChildDataPipe(IterDataPipe):
@@ -230,10 +258,8 @@ class _ChildDataPipe(IterDataPipe):
     _is_child_datapipe: bool = True
 
     def __init__(self, main_datapipe: IterDataPipe, instance_id: int):
-        required_attrs = ["get_next_element_by_instance", "is_every_instance_exhausted", "reset"]
-        required_ops = [getattr(main_datapipe, attr) for attr in required_attrs]
-        if any(not callable(op) for op in required_ops):
-            raise NotImplementedError(f"Main Datapipe must have methods {required_attrs} implemented.")
+        assert isinstance(main_datapipe, _ContainerTemplate)
+
         self.main_datapipe: IterDataPipe = main_datapipe
         self.instance_id = instance_id
 
@@ -243,7 +269,7 @@ def __iter__(self):
         return self.main_datapipe.get_next_element_by_instance(self.instance_id)
 
     def __len__(self):
-        return len(self.main_datapipe)
+        return self.main_datapipe.get_length_by_instance(self.instance_id)
 
     # This method is called by `hook_iterator` in `_typing.py`.
     def _set_main_datapipe_valid_iterator_id(self) -> int:
@@ -325,7 +351,7 @@ def __new__(cls, datapipe: IterDataPipe, num_instances: int,
         return [_ChildDataPipe(container, i) for i in range(num_instances)]
 
 
-class _DemultiplexerIterDataPipe(IterDataPipe):
+class _DemultiplexerIterDataPipe(IterDataPipe, _ContainerTemplate):
     r"""
     Container to hold instance-specific information on behalf of DemultiplexerIterDataPipe. It tracks
     the state of its child DataPipes, maintains the buffer, classifies and yields the next correct value
@@ -394,6 +420,9 @@ def get_next_element_by_instance(self, instance_id: int):
     def is_every_instance_exhausted(self) -> bool:
         return self.main_datapipe_exhausted and all(not child_buffer for child_buffer in self.child_buffers)
 
+    def get_length_by_instance(self, instance_id: int) -> int:
+        raise TypeError
+
     def reset(self) -> None:
         self._datapipe_iterator = None
         self.current_buffer_usage = 0
@@ -401,9 +430,6 @@ def reset(self) -> None:
         self.main_datapipe_exhausted = False
 
     def __getstate__(self):
-        if IterDataPipe.getstate_hook is not None:
-            return IterDataPipe.getstate_hook(self)
-
         state = (
             self.main_datapipe,
             self.num_instances,
@@ -413,6 +439,8 @@ def __getstate__(self):
             self._valid_iterator_id,
             self._number_of_samples_yielded,
         )
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(state)
         return state
 
     def __setstate__(self, state):
@@ -431,8 +459,9 @@ def __setstate__(self, state):
         self.main_datapipe_exhausted = False
 
     def __del__(self):
-        for dq in self.child_buffers:
-            dq.clear()
+        if self.child_buffers:
+            for dq in self.child_buffers:
+                dq.clear()
 
 
 @functional_datapipe('mux')
@@ -486,15 +515,14 @@ def reset(self) -> None:
         self.buffer = []
 
     def __getstate__(self):
-        if IterDataPipe.getstate_hook is not None:
-            return IterDataPipe.getstate_hook(self)
-
         state = (
             self.datapipes,
             self.length,
             self._valid_iterator_id,
             self._number_of_samples_yielded,
         )
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(state)
         return state
 
     def __setstate__(self, state):
diff --git a/torch/utils/data/datapipes/iter/filelister.py b/torch/utils/data/datapipes/iter/filelister.py
index f8eb4f91df5e..b2ecd71b5ce9 100644
--- a/torch/utils/data/datapipes/iter/filelister.py
+++ b/torch/utils/data/datapipes/iter/filelister.py
@@ -14,7 +14,7 @@
 class FileListerIterDataPipe(IterDataPipe[str]):
     r"""
     Given path(s) to the root directory, yields file pathname(s) (path + filename) of files within the root directory.
-    Multiple root directories can be provided.
+    Multiple root directories can be provided (functional name: ``list_files``).
 
     Args:
         root: Root directory or a sequence of root directories
diff --git a/torch/utils/data/datapipes/iter/grouping.py b/torch/utils/data/datapipes/iter/grouping.py
index 8e09eb095e67..58d1509f3ee9 100644
--- a/torch/utils/data/datapipes/iter/grouping.py
+++ b/torch/utils/data/datapipes/iter/grouping.py
@@ -1,20 +1,28 @@
 from collections import defaultdict
+from enum import IntEnum
 
 from torch.utils.data.datapipes._decorator import functional_datapipe
 from torch.utils.data.datapipes.datapipe import IterDataPipe, DataChunk
 from torch.utils.data.datapipes.utils.common import _check_unpickable_fn
-from typing import Any, Callable, DefaultDict, Iterator, List, Optional, Sized, TypeVar
+from typing import Any, Callable, DefaultDict, Dict, Iterator, List, Optional, Sized, Tuple, TypeVar
 
 __all__ = [
     "BatcherIterDataPipe",
     "GrouperIterDataPipe",
     "ShardingFilterIterDataPipe",
+    "SHARDING_PRIORITIES",
     "UnBatcherIterDataPipe",
 ]
 
 T_co = TypeVar('T_co', covariant=True)
 
 
+class SHARDING_PRIORITIES(IntEnum):
+    DEFAULT = 1
+    DISTRIBUTED = 2
+    MULTIPROCESSING = 3
+
+
 @functional_datapipe('sharding_filter')
 class ShardingFilterIterDataPipe(IterDataPipe):
     r"""
@@ -25,17 +33,44 @@ class ShardingFilterIterDataPipe(IterDataPipe):
     Args:
         source_datapipe: Iterable DataPipe that will be sharded
     """
-    def __init__(self, source_datapipe: IterDataPipe):
+
+    def __init__(self, source_datapipe: IterDataPipe, sharding_group_filter=None):
         self.source_datapipe = source_datapipe
+        self.sharding_group_filter = sharding_group_filter
+        self.groups: Dict[int, Tuple[int, int]] = {}
         self.num_of_instances = 1
         self.instance_id = 0
+        self._update_num_of_instances()
 
     def is_shardable(self):
         return True
 
-    def apply_sharding(self, num_of_instances, instance_id):
-        self.num_of_instances = num_of_instances
-        self.instance_id = instance_id
+    def apply_sharding(self, num_of_instances, instance_id, sharding_group=SHARDING_PRIORITIES.DEFAULT):
+        if instance_id >= num_of_instances:
+            raise ValueError(f"instance_id({instance_id}) should be smaller than num_of_instances({num_of_instances})")
+        if sharding_group == SHARDING_PRIORITIES.DEFAULT:
+            if len(self.groups) and SHARDING_PRIORITIES.DEFAULT not in self.groups:
+                raise Exception('ShardingFilter cannot mix DEFAULT and non DEFAULT groups')
+        else:
+            if SHARDING_PRIORITIES.DEFAULT in self.groups:
+                raise Exception('ShardingFilter cannot mix DEFAULT and non DEFAULT groups')
+        self.groups[sharding_group] = (num_of_instances, instance_id)
+        self._update_num_of_instances()
+
+    def _update_num_of_instances(self):
+        sorted_sharding_groups = []
+        for key in sorted(self.groups.keys()):
+            if self.sharding_group_filter is None or key == self.sharding_group_filter:
+                sorted_sharding_groups.append(self.groups[key])
+
+        sorted_sharding_groups.reverse()
+
+        self.num_of_instances = 1
+        self.instance_id = 0
+
+        for group_num_of_instances, group_instance_id in sorted_sharding_groups:
+            self.instance_id += self.num_of_instances * group_instance_id
+            self.num_of_instances *= group_num_of_instances
 
     def __iter__(self):
         for i, item in enumerate(self.source_datapipe):
@@ -283,8 +318,6 @@ def reset(self) -> None:
         self.buffer_elements = defaultdict(list)
 
     def __getstate__(self):
-        if IterDataPipe.getstate_hook is not None:
-            return IterDataPipe.getstate_hook(self)
         state = (
             self.datapipe,
             self.group_key_fn,
@@ -296,6 +329,8 @@ def __getstate__(self):
             self._valid_iterator_id,
             self._number_of_samples_yielded,
         )
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(state)
         return state
 
     def __setstate__(self, state):
diff --git a/torch/utils/data/datapipes/iter/selecting.py b/torch/utils/data/datapipes/iter/selecting.py
index a31f2f933f39..470d2952241f 100644
--- a/torch/utils/data/datapipes/iter/selecting.py
+++ b/torch/utils/data/datapipes/iter/selecting.py
@@ -1,17 +1,18 @@
-from typing import Callable, Iterator, Optional, TypeVar
+from typing import Callable, Iterator, Tuple, TypeVar
 
 from torch.utils.data.datapipes._decorator import functional_datapipe
 from torch.utils.data.datapipes.datapipe import IterDataPipe
 from torch.utils.data.datapipes.dataframe import dataframe_wrapper as df_wrapper
 from torch.utils.data.datapipes.utils.common import (
     _check_unpickable_fn,
-    _deprecation_warning,
     StreamWrapper,
+    validate_input_col
 )
 
 
 __all__ = ["FilterIterDataPipe", ]
 
+T = TypeVar('T')
 T_co = TypeVar('T_co', covariant=True)
 
 
@@ -23,7 +24,6 @@ class FilterIterDataPipe(IterDataPipe[T_co]):
     Args:
         datapipe: Iterable DataPipe being filtered
         filter_fn: Customized function mapping an element to a boolean.
-        drop_empty_batches (Deprecated): By default, drops a batch if it is empty after filtering instead of keeping an empty list
         input_col: Index or indices of data which ``filter_fn`` is applied, such as:
 
             - ``None`` as default to apply ``filter_fn`` to the data directly.
@@ -40,15 +40,13 @@ class FilterIterDataPipe(IterDataPipe[T_co]):
         >>> list(filter_dp)
         [0, 2, 4]
     """
-    datapipe: IterDataPipe
+    datapipe: IterDataPipe[T_co]
     filter_fn: Callable
-    drop_empty_batches: bool
 
     def __init__(
         self,
-        datapipe: IterDataPipe,
+        datapipe: IterDataPipe[T_co],
         filter_fn: Callable,
-        drop_empty_batches: Optional[bool] = None,
         input_col=None,
     ) -> None:
         super().__init__()
@@ -57,18 +55,8 @@ def __init__(
         _check_unpickable_fn(filter_fn)
         self.filter_fn = filter_fn  # type: ignore[assignment]
 
-        if drop_empty_batches is None:
-            drop_empty_batches = True
-        else:
-            _deprecation_warning(
-                type(self).__name__,
-                deprecation_version="1.12",
-                removal_version="1.14",
-                old_argument_name="drop_empty_batches",
-            )
-        self.drop_empty_batches = drop_empty_batches
-
         self.input_col = input_col
+        validate_input_col(filter_fn, input_col)
 
     def _apply_filter_fn(self, data) -> bool:
         if self.input_col is None:
@@ -81,13 +69,13 @@ def _apply_filter_fn(self, data) -> bool:
 
     def __iter__(self) -> Iterator[T_co]:
         for data in self.datapipe:
-            filtered = self._returnIfTrue(data)
-            if self._isNonEmpty(filtered):
+            condition, filtered = self._returnIfTrue(data)
+            if condition:
                 yield filtered
             else:
                 StreamWrapper.close_streams(data)
 
-    def _returnIfTrue(self, data):
+    def _returnIfTrue(self, data: T) -> Tuple[bool, T]:
         condition = self._apply_filter_fn(data)
 
         if df_wrapper.is_column(condition):
@@ -97,18 +85,11 @@ def _returnIfTrue(self, data):
                 if mask:
                     result.append(df_wrapper.get_item(data, idx))
             if len(result):
-                return df_wrapper.concat(result)
+                return True, df_wrapper.concat(result)
             else:
-                return None
+                return False, None  # type: ignore[return-value]
 
         if not isinstance(condition, bool):
             raise ValueError("Boolean output is required for `filter_fn` of FilterIterDataPipe, got", type(condition))
-        if condition:
-            return data
-
-    def _isNonEmpty(self, data):
-        if df_wrapper.is_dataframe(data):
-            return True
-        r = data is not None and \
-            not (isinstance(data, list) and len(data) == 0 and self.drop_empty_batches)
-        return r
+
+        return condition, data
diff --git a/torch/utils/data/datapipes/map/__init__.py b/torch/utils/data/datapipes/map/__init__.py
index fc8a9de701da..dee04d15cc7b 100644
--- a/torch/utils/data/datapipes/map/__init__.py
+++ b/torch/utils/data/datapipes/map/__init__.py
@@ -1,6 +1,6 @@
 # Functional DataPipe
 from torch.utils.data.datapipes.map.callable import MapperMapDataPipe as Mapper
-from torch.utils.data.datapipes.map.combinatorics import ShufflerMapDataPipe as Shuffler
+from torch.utils.data.datapipes.map.combinatorics import ShufflerIterDataPipe as Shuffler
 from torch.utils.data.datapipes.map.combining import (
     ConcaterMapDataPipe as Concater,
     ZipperMapDataPipe as Zipper
diff --git a/torch/utils/data/datapipes/map/combinatorics.py b/torch/utils/data/datapipes/map/combinatorics.py
index abf264d61ab4..6eab7b8b4298 100644
--- a/torch/utils/data/datapipes/map/combinatorics.py
+++ b/torch/utils/data/datapipes/map/combinatorics.py
@@ -1,19 +1,19 @@
 import random
 
-from torch.utils.data.datapipes._decorator import functional_datapipe
-from torch.utils.data.datapipes.datapipe import MapDataPipe
+import torch
+from torch.utils.data.datapipes.datapipe import IterDataPipe, MapDataPipe
 from typing import Iterator, List, Optional, TypeVar
 
-__all__ = ["ShufflerMapDataPipe", ]
+__all__ = ["ShufflerIterDataPipe", ]
 
 
 T_co = TypeVar('T_co', covariant=True)
 
 
-@functional_datapipe('shuffle')
-class ShufflerMapDataPipe(MapDataPipe[T_co]):
+# @functional_datapipe('shuffle')
+class ShufflerIterDataPipe(IterDataPipe[T_co]):
     r"""
-    Shuffle the input DataPipe via its indices (functional name: ``shuffle``).
+    Shuffle the input MapDataPipe via its indices (functional name: ``shuffle``).
 
     When it is used with :class:`~torch.utils.data.DataLoader`, the methods to
     set up random seed are different based on :attr:`num_workers`.
@@ -31,11 +31,26 @@ class ShufflerMapDataPipe(MapDataPipe[T_co]):
         >>> # xdoctest: +SKIP
         >>> from torchdata.datapipes.map import SequenceWrapper
         >>> dp = SequenceWrapper(range(10))
-        >>> shuffle_dp = dp.shuffle()
+        >>> shuffle_dp = dp.shuffle().set_seed(0)
         >>> list(shuffle_dp)
-        [0, 4, 1, 6, 3, 2, 9, 5, 7, 8]
+        [7, 8, 1, 5, 3, 4, 2, 0, 9, 6]
+        >>> list(shuffle_dp)
+        [6, 1, 9, 5, 2, 4, 7, 3, 8, 0]
+        >>> # Reset seed for Shuffler
+        >>> shuffle_dp = shuffle_dp.set_seed(0)
+        >>> list(shuffle_dp)
+        [7, 8, 1, 5, 3, 4, 2, 0, 9, 6]
+
+    Note:
+        Even thought this ``shuffle`` operation takes a ``MapDataPipe`` as the input, it would return an
+        ``IterDataPipe`` rather than a ``MapDataPipe``, because ``MapDataPipe`` should be non-sensitive to
+        the order of data order for the sake of random reads, but ``IterDataPipe`` depends on the order
+        of data during data-processing.
     """
     datapipe: MapDataPipe[T_co]
+    _enabled: bool
+    _seed: Optional[int]
+    _rng: random.Random
 
     def __init__(self,
                  datapipe: MapDataPipe[T_co],
@@ -45,23 +60,66 @@ def __init__(self,
         super().__init__()
         self.datapipe = datapipe
         self.indices = list(range(len(datapipe))) if indices is None else indices
-        self.index_map = {index_name: num_index for num_index, index_name in enumerate(self.indices)}
-        # We do not lazily shuffle because this way is significantly faster in terms of total time
-        random.shuffle(self.indices)
-
-    def __getitem__(self, index) -> T_co:
-        try:
-            old_numeric_index = self.index_map[index]
-        except KeyError:
-            raise IndexError(f"Index {index} is out of range for {self}.")
-        new_index = self.indices[old_numeric_index]
-        return self.datapipe[new_index]
-
-    # Without __iter__ implemented, by default it tries to use 0-index,
-    # which doesn't work when there is a custom index.
+        self._enabled = True
+        self._seed = None
+        self._rng = random.Random()
+        self._shuffled_indices: List = self.indices
+
+    def set_shuffle(self, shuffle=True):
+        self._enabled = shuffle
+        return self
+
+    def set_seed(self, seed: int):
+        self._seed = seed
+        return self
+
     def __iter__(self) -> Iterator[T_co]:
-        for i in self.indices:
-            yield self.datapipe[i]
+        if not self._enabled:
+            for idx in self.indices:
+                yield self.datapipe[idx]
+        else:
+            while self._shuffled_indices:
+                idx = self._shuffled_indices.pop()
+                yield self.datapipe[idx]
+
+    def reset(self) -> None:
+        if self._enabled and self._seed is None:
+            self._seed = int(torch.empty((), dtype=torch.int64).random_().item())
+        self._rng.seed(self._seed)
+        self._seed = None
+        self._shuffled_indices = self._rng.sample(self.indices, len(self.indices))
 
     def __len__(self) -> int:
         return len(self.datapipe)
+
+    def __getstate__(self):
+        state = (
+            self.datapipe,
+            self.indices,
+            self._enabled,
+            self._seed,
+            self._rng.getstate(),
+            self._shuffled_indices,
+            self._valid_iterator_id,
+            self._number_of_samples_yielded,
+        )
+        if IterDataPipe.getstate_hook is not None:
+            return IterDataPipe.getstate_hook(state)
+        return state
+
+    def __setstate__(self, state):
+        (
+            self.datapipe,
+            self.indices,
+            self._enabled,
+            self._seed,
+            rng_state,
+            self._shuffled_indices,
+            self._valid_iterator_id,
+            self._number_of_samples_yielded,
+        ) = state
+        self._rng = random.Random()
+        self._rng.setstate(rng_state)
+
+
+MapDataPipe.register_datapipe_as_function("shuffle", ShufflerIterDataPipe)
diff --git a/torch/utils/data/datapipes/utils/common.py b/torch/utils/data/datapipes/utils/common.py
index 42227bfaf592..75d9a5cf173c 100644
--- a/torch/utils/data/datapipes/utils/common.py
+++ b/torch/utils/data/datapipes/utils/common.py
@@ -12,6 +12,7 @@
 from torch.utils.data._utils.serialization import DILL_AVAILABLE
 
 __all__ = [
+    "validate_input_col",
     "StreamWrapper",
     "get_file_binaries_from_pathnames",
     "get_file_pathnames_from_root",
@@ -20,6 +21,89 @@
 ]
 
 
+def validate_input_col(fn: Callable, input_col: Optional[Union[int, tuple, list]]):
+    """
+    Checks that function used in a callable datapipe works with the input column
+
+    This simply ensures that the number of positional arguments matches the size
+    of the input column. The function must not contain any non-default
+    keyword-only arguments.
+
+    Examples:
+        >>> def f(a, b, *, c=1):
+        >>>     return a + b + c
+        >>> def f_def(a, b=1, *, c=1):
+        >>>     return a + b + c
+        >>> assert validate_input_col(f, [1, 2])
+        >>> assert validate_input_col(f_def, 1)
+        >>> assert validate_input_col(f_def, [1, 2])
+
+    Notes:
+        If the function contains variable positional (`inspect.VAR_POSITIONAL`) arguments,
+        for example, f(a, *args), the validator will accept any size of input column
+        greater than or equal to the number of positional arguments.
+        (in this case, 1).
+
+    Args:
+        fn: The function to check.
+        input_col: The input column to check.
+
+    Raises:
+        ValueError: If the function is not compatible with the input column.
+    """
+    try:
+        sig = inspect.signature(fn)
+    except ValueError:  # Signature cannot be inspected, likely it is a built-in fn or written in C
+        return
+    if isinstance(input_col, (list, tuple)):
+        input_col_size = len(input_col)
+    else:
+        input_col_size = 1
+
+    fn_name = str(fn)
+
+    pos = []
+    var_positional = False
+    non_default_kw_only = []
+
+    for p in sig.parameters.values():
+        if p.kind in (inspect.Parameter.POSITIONAL_ONLY, inspect.Parameter.POSITIONAL_OR_KEYWORD):
+            pos.append(p)
+        elif p.kind is inspect.Parameter.VAR_POSITIONAL:
+            var_positional = True
+        elif p.kind is inspect.Parameter.KEYWORD_ONLY:
+            if p.default is p.empty:
+                non_default_kw_only.append(p)
+        else:
+            continue
+
+    if len(non_default_kw_only) > 0:
+        raise ValueError(
+            f"The function {fn_name} takes {len(non_default_kw_only)} "
+            f"non-default keyword-only parameters, which is not allowed."
+        )
+
+    if len(sig.parameters) < input_col_size:
+        if not var_positional:
+            raise ValueError(
+                f"The function {fn_name} takes {len(sig.parameters)} "
+                f"parameters, but {input_col_size} are required."
+            )
+    else:
+        if len(pos) > input_col_size:
+            if any(p.default is p.empty for p in pos[input_col_size:]):
+                raise ValueError(
+                    f"The function {fn_name} takes {len(pos)} "
+                    f"positional parameters, but {input_col_size} are required."
+                )
+        elif len(pos) < input_col_size:
+            if not var_positional:
+                raise ValueError(
+                    f"The function {fn_name} takes {len(pos)} "
+                    f"positional parameters, but {input_col_size} are required."
+                )
+
+
 def _is_local_fn(fn):
     # Functions or Methods
     if hasattr(fn, "__code__"):
@@ -33,7 +117,6 @@ def _is_local_fn(fn):
             return "<locals>" in fn_type.__qualname__
     return False
 
-
 def _check_unpickable_fn(fn: Callable):
     """
     Checks function is pickable or not. If it is a lambda or local function, a UserWarning
@@ -144,21 +227,7 @@ def validate_pathname_binary_tuple(data: Tuple[str, IOBase]):
 
 
 # Deprecated function names and its corresponding DataPipe type and kwargs for the `_deprecation_warning` function
-_iter_deprecated_functional_names: Dict[str, Dict] = {"open_file_by_fsspec":
-                                                      {"old_class_name": "FSSpecFileOpener",
-                                                       "deprecation_version": "0.4.0",
-                                                       "removal_version": "0.6.0",
-                                                       "old_functional_name": "open_file_by_fsspec",
-                                                       "new_functional_name": "open_files_by_fsspec",
-                                                       "deprecate_functional_name_only": True},
-                                                      "open_file_by_iopath":
-                                                      {"old_class_name": "IoPathFileOpener",
-                                                       "deprecation_version": "0.4.0",
-                                                       "removal_version": "0.6.0",
-                                                       "old_functional_name": "open_file_by_iopath",
-                                                       "new_functional_name": "open_files_by_iopath",
-                                                       "deprecate_functional_name_only": True}}
-
+_iter_deprecated_functional_names: Dict[str, Dict] = {}
 _map_deprecated_functional_names: Dict[str, Dict] = {}
 
 
diff --git a/torch/utils/data/datapipes/utils/snapshot.py b/torch/utils/data/datapipes/utils/snapshot.py
index 95af98e6b920..feb41ed4d236 100644
--- a/torch/utils/data/datapipes/utils/snapshot.py
+++ b/torch/utils/data/datapipes/utils/snapshot.py
@@ -1,6 +1,6 @@
 from torch.utils.data.datapipes._hook_iterator import _SnapshotState
 from torch.utils.data.datapipes.datapipe import IterDataPipe
-from torch.utils.data.graph_settings import apply_shuffle_seed
+from torch.utils.data.graph_settings import apply_random_seed
 
 
 # TODO: Caveats
@@ -39,7 +39,7 @@ def _simple_graph_snapshot_restoration(datapipe: IterDataPipe, n_iterations: int
     # simple fast-forwarding. Therefore, we need to call `reset` twice, because if `SnapshotState` is `Restored`,
     # the first reset will not actually reset.
     datapipe.reset()  # This ensures `SnapshotState` is `Iterating` by this point, even if it was `Restored`.
-    apply_shuffle_seed(datapipe, rng)
+    apply_random_seed(datapipe, rng)
 
     remainder = n_iterations
     it = iter(datapipe)  # This always reset the DataPipe if it hasn't already.
diff --git a/torch/utils/data/dataset.py b/torch/utils/data/dataset.py
index 0b59c43736d3..4cf957034cbd 100644
--- a/torch/utils/data/dataset.py
+++ b/torch/utils/data/dataset.py
@@ -108,6 +108,7 @@ class IterableDataset(Dataset[T_co]):
         >>> print(list(torch.utils.data.DataLoader(ds, num_workers=0)))
         [tensor([3]), tensor([4]), tensor([5]), tensor([6])]
 
+        >>> # xdoctest: +REQUIRES(POSIX)
         >>> # Mult-process loading with two worker processes
         >>> # Worker 0 fetched [3, 4].  Worker 1 fetched [5, 6].
         >>> # xdoctest: +IGNORE_WANT("non deterministic")
@@ -116,7 +117,7 @@ class IterableDataset(Dataset[T_co]):
 
         >>> # With even more workers
         >>> # xdoctest: +IGNORE_WANT("non deterministic")
-        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=20)))
+        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=12)))
         [tensor([3]), tensor([5]), tensor([4]), tensor([6])]
 
     Example 2: splitting workload across all workers using :attr:`worker_init_fn`::
@@ -161,7 +162,7 @@ class IterableDataset(Dataset[T_co]):
         [3, 5, 4, 6]
 
         >>> # With even more workers
-        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=20, worker_init_fn=worker_init_fn)))
+        >>> print(list(torch.utils.data.DataLoader(ds, num_workers=12, worker_init_fn=worker_init_fn)))
         [3, 4, 5, 6]
     """
     def __iter__(self) -> Iterator[T_co]:
diff --git a/torch/utils/data/graph.py b/torch/utils/data/graph.py
index df2b81648cef..2769e326c03e 100644
--- a/torch/utils/data/graph.py
+++ b/torch/utils/data/graph.py
@@ -1,18 +1,19 @@
 import io
 import pickle
+import warnings
+
+from collections.abc import Collection
+from typing import Dict, List, Optional, Set, Tuple, Type, Union
 
 from torch.utils.data import IterDataPipe, MapDataPipe
 from torch.utils.data._utils.serialization import DILL_AVAILABLE
 
-from typing import Dict, List, Set, Tuple, Type, Union
 
-__all__ = ["traverse"]
+__all__ = ["traverse", "traverse_dps"]
 
 DataPipe = Union[IterDataPipe, MapDataPipe]
 DataPipeGraph = Dict[int, Tuple[DataPipe, "DataPipeGraph"]]  # type: ignore[misc]
 
-reduce_ex_hook = None
-
 
 def _stub_unpickler():
     return "STUB"
@@ -28,16 +29,22 @@ def _list_connected_datapipes(scan_obj: DataPipe, only_datapipe: bool, cache: Se
     else:
         d = None
 
-    def stub_pickler(obj):
-        return _stub_unpickler, ()
-
     captured_connections = []
 
-    def getstate_hook(obj):
-        state = {}
-        for k, v in obj.__dict__.items():
-            if isinstance(v, (IterDataPipe, MapDataPipe, tuple)):
-                state[k] = v
+    def getstate_hook(ori_state):
+        state = None
+        if isinstance(ori_state, dict):
+            state = {}  # type: ignore[assignment]
+            for k, v in ori_state.items():
+                if isinstance(v, (IterDataPipe, MapDataPipe, Collection)):
+                    state[k] = v  # type: ignore[attr-defined]
+        elif isinstance(ori_state, (tuple, list)):
+            state = []  # type: ignore[assignment]
+            for v in ori_state:
+                if isinstance(v, (IterDataPipe, MapDataPipe, Collection)):
+                    state.append(v)  # type: ignore[attr-defined]
+        elif isinstance(ori_state, (IterDataPipe, MapDataPipe, Collection)):
+            state = ori_state  # type: ignore[assignment]
         return state
 
     def reduce_hook(obj):
@@ -45,6 +52,7 @@ def reduce_hook(obj):
             raise NotImplementedError
         else:
             captured_connections.append(obj)
+            # Adding id to remove duplicate DataPipe serialized at the same level
             cache.add(id(obj))
             return _stub_unpickler, ()
 
@@ -73,7 +81,48 @@ def reduce_hook(obj):
     return captured_connections
 
 
-def traverse(datapipe: DataPipe, only_datapipe: bool = False) -> DataPipeGraph:
+def traverse_dps(datapipe: DataPipe) -> DataPipeGraph:
+    r"""
+    Traverse the DataPipes and their attributes to extract the DataPipe graph.
+    This only looks into the attribute from each DataPipe that is either a
+    DataPipe and a Python collection object such as ``list``, ``tuple``,
+    ``set`` and ``dict``.
+
+    Args:
+        datapipe: the end DataPipe of the graph
+    Returns:
+        A graph represented as a nested dictionary, where keys are ids of DataPipe instances
+        and values are tuples of DataPipe instance and the sub-graph
+    """
+    cache: Set[int] = set()
+    return _traverse_helper(datapipe, only_datapipe=True, cache=cache)
+
+
+def traverse(datapipe: DataPipe, only_datapipe: Optional[bool] = None) -> DataPipeGraph:
+    r"""
+    [Deprecated] Traverse the DataPipes and their attributes to extract the DataPipe graph. When
+    ``only_dataPipe`` is specified as ``True``, it would only look into the attribute
+    from each DataPipe that is either a DataPipe and a Python collection object such as
+    ``list``, ``tuple``, ``set`` and ``dict``.
+
+    Note:
+        This function is deprecated. Please use `traverse_dps` instead.
+
+    Args:
+        datapipe: the end DataPipe of the graph
+        only_datapipe: If ``False`` (default), all attributes of each DataPipe are traversed.
+          This argument is deprecating and will be removed after the next release.
+    Returns:
+        A graph represented as a nested dictionary, where keys are ids of DataPipe instances
+        and values are tuples of DataPipe instance and the sub-graph
+    """
+    msg = "`traverse` function and will be removed after 1.13. " \
+          "Please use `traverse_dps` instead."
+    if not only_datapipe:
+        msg += " And, the behavior will be changed to the equivalent of `only_datapipe=True`."
+    warnings.warn(msg, FutureWarning)
+    if only_datapipe is None:
+        only_datapipe = False
     cache: Set[int] = set()
     return _traverse_helper(datapipe, only_datapipe, cache)
 
@@ -87,6 +136,7 @@ def _traverse_helper(datapipe: DataPipe, only_datapipe: bool, cache: Set[int]) -
     if dp_id in cache:
         return {}
     cache.add(dp_id)
+    # Using cache.copy() here is to prevent the same DataPipe pollutes the cache on different paths
     items = _list_connected_datapipes(datapipe, only_datapipe, cache.copy())
     d: DataPipeGraph = {dp_id: (datapipe, {})}
     for item in items:
diff --git a/torch/utils/data/graph_settings.py b/torch/utils/data/graph_settings.py
index 43678a06625d..618e66696898 100644
--- a/torch/utils/data/graph_settings.py
+++ b/torch/utils/data/graph_settings.py
@@ -1,13 +1,15 @@
+import inspect
 import warnings
 
 from typing import Any, List, Optional, Set
 
 import torch
-import torch.utils.data.datapipes as dp
 
-from torch.utils.data.graph import DataPipe, DataPipeGraph, traverse
+from torch.utils.data.graph import DataPipe, DataPipeGraph, traverse_dps
+from torch.utils.data.datapipes.iter.grouping import SHARDING_PRIORITIES
 
 __all__ = [
+    "apply_random_seed",
     "apply_sharding",
     "apply_shuffle_seed",
     "apply_shuffle_settings",
@@ -18,6 +20,7 @@
 def get_all_graph_pipes(graph: DataPipeGraph) -> List[DataPipe]:
     return _get_all_graph_pipes_helper(graph, set())
 
+
 def _get_all_graph_pipes_helper(graph: DataPipeGraph, id_cache: Set[int]) -> List[DataPipe]:
     results: List[DataPipe] = []
     for dp_id, (datapipe, sub_graph) in graph.items():
@@ -29,8 +32,11 @@ def _get_all_graph_pipes_helper(graph: DataPipeGraph, id_cache: Set[int]) -> Lis
     return results
 
 
-def apply_sharding(datapipe: DataPipe, num_of_instances: int, instance_id: int) -> DataPipe:
-    graph = traverse(datapipe, only_datapipe=True)
+def apply_sharding(datapipe: DataPipe,
+                   num_of_instances: int,
+                   instance_id: int,
+                   sharding_group=SHARDING_PRIORITIES.DEFAULT) -> DataPipe:
+    graph = traverse_dps(datapipe)
     all_pipes = get_all_graph_pipes(graph)
     already_applied_to = None
     for pipe in all_pipes:
@@ -40,18 +46,34 @@ def apply_sharding(datapipe: DataPipe, num_of_instances: int, instance_id: int)
                     if already_applied_to is not None:
                         raise RuntimeError('This implementation of sharding can be only applied once per instance of DataPipeline.',
                                            'Already applied to', already_applied_to, 'while trying to apply to', pipe)
-                    pipe.apply_sharding(num_of_instances, instance_id)
+                    pipe.apply_sharding(num_of_instances, instance_id, sharding_group=sharding_group)
                     already_applied_to = pipe
     return datapipe
 
 
-def apply_shuffle_settings(datapipe: DataPipe, shuffle: Optional[bool]) -> DataPipe:
+def _is_shuffle_datapipe(datapipe: DataPipe) -> bool:
+    if not hasattr(datapipe, "set_shuffle") or not hasattr(datapipe, "set_seed"):
+        return False
+    if not inspect.ismethod(datapipe.set_shuffle) or not inspect.ismethod(datapipe.set_seed):
+        return False
+    return True
+
+
+def apply_shuffle_settings(datapipe: DataPipe, shuffle: Optional[bool] = None) -> DataPipe:
+    r"""
+    Traverse the graph of ``DataPipes`` to find and set shuffle attribute
+    to each `DataPipe` that has APIs of ``set_shuffle`` and ``set_seed``.
+
+    Args:
+        datapipe: DataPipe that needs to set shuffle attribute
+        shuffle: Shuffle option (default: ``None`` and no-op to the graph)
+    """
     if shuffle is None:
         return datapipe
 
-    graph = traverse(datapipe, only_datapipe=True)
+    graph = traverse_dps(datapipe)
     all_pipes = get_all_graph_pipes(graph)
-    shufflers = [pipe for pipe in all_pipes if isinstance(pipe, (dp.iter.Shuffler, dp.map.Shuffler))]
+    shufflers = [pipe for pipe in all_pipes if _is_shuffle_datapipe(pipe)]
     if not shufflers and shuffle:
         warnings.warn(
             "`shuffle=True` was set, but the datapipe does not contain a `Shuffler`. Adding one at the end. "
@@ -67,12 +89,43 @@ def apply_shuffle_settings(datapipe: DataPipe, shuffle: Optional[bool]) -> DataP
 
 
 def apply_shuffle_seed(datapipe: DataPipe, rng: Any) -> DataPipe:
-    graph = traverse(datapipe, only_datapipe=True)
+    warnings.warn(
+        "`apply_shuffle_seed` is deprecated since 1.12 and will be removed in the future releases."
+        "\nPlease use `apply_random_seed` instead."
+    )
+    return apply_random_seed(datapipe, rng)
+
+
+def _is_random_datapipe(datapipe: DataPipe) -> bool:
+    if hasattr(datapipe, "set_seed") and inspect.ismethod(datapipe.set_seed):
+        return True
+    return False
+
+
+def apply_random_seed(datapipe: DataPipe, rng: torch.Generator) -> DataPipe:
+    r"""
+    Traverse the graph of ``DataPipes`` to find random ``DataPipe`` with an API of
+    ``set_seed`` then set the random seed based on the provided RNG.
+
+    Args:
+        datapipe: DataPipe that needs to set randomness
+        rng: Random number generator to generate random seeds
+    """
+    graph = traverse_dps(datapipe)
     all_pipes = get_all_graph_pipes(graph)
-    shufflers = {pipe for pipe in all_pipes if isinstance(pipe, (dp.iter.Shuffler, dp.map.Shuffler))}
+    # Using a set to track id of DataPipe to prevent setting randomness per DataPipe more than once.
+    # And, `id` is used in case of unhashable DataPipe
+    cache = set()
+    random_datapipes = []
+    for pipe in all_pipes:
+        if id(pipe) in cache:
+            continue
+        if _is_random_datapipe(pipe):
+            random_datapipes.append(pipe)
+            cache.add(id(pipe))
 
-    for shuffler in shufflers:
-        shuffle_seed = int(torch.empty((), dtype=torch.int64).random_(generator=rng).item())
-        shuffler.set_seed(shuffle_seed)
+    for pipe in random_datapipes:
+        random_seed = int(torch.empty((), dtype=torch.int64).random_(generator=rng).item())
+        pipe.set_seed(random_seed)
 
     return datapipe
diff --git a/torch/utils/dlpack.py b/torch/utils/dlpack.py
index 2fe5a5c1d2c0..49fe6b4ea1b8 100644
--- a/torch/utils/dlpack.py
+++ b/torch/utils/dlpack.py
@@ -42,9 +42,10 @@ class DLDeviceType(enum.IntEnum):
 The DLPack capsule shares the tensor's memory.
 """)
 
+
 # TODO: add a typing.Protocol to be able to tell Mypy that only objects with
 # __dlpack__ and __dlpack_device__ methods are accepted.
-def from_dlpack(ext_tensor: Any) -> torch.Tensor:
+def from_dlpack(ext_tensor: Any) -> 'torch.Tensor':
     """from_dlpack(ext_tensor) -> Tensor
 
     Converts a tensor from an external library into a ``torch.Tensor``.
diff --git a/torch/utils/hipify/cuda_to_hip_mappings.py b/torch/utils/hipify/cuda_to_hip_mappings.py
index 07c4ad156359..9a3065c675a2 100644
--- a/torch/utils/hipify/cuda_to_hip_mappings.py
+++ b/torch/utils/hipify/cuda_to_hip_mappings.py
@@ -34,11 +34,9 @@
     rocm_path = subprocess.check_output(["hipconfig", "--rocmpath"]).decode("utf-8")
 except subprocess.CalledProcessError:
     print(f"Warning: hipconfig --rocmpath failed, assuming {rocm_path}")
-except FileNotFoundError:
+except (FileNotFoundError, PermissionError):
     # Do not print warning. This is okay. This file can also be imported for non-ROCm builds.
     pass
-except PermissionError:
-    pass
 
 rocm_version = (0, 0, 0)
 rocm_version_h = f"{rocm_path}/include/rocm_version.h"
@@ -621,11 +619,11 @@
         ("curand_poisson.h", ("hiprand/hiprand_kernel.h", CONV_INCLUDE, API_RAND)),
         ("curand_precalc.h", ("hiprand/hiprand_kernel.h", CONV_INCLUDE, API_RAND)),
         ("curand_uniform.h", ("hiprand/hiprand_kernel.h", CONV_INCLUDE, API_RAND)),
-        ("cusparse.h", ("hipsparse.h", CONV_INCLUDE, API_RAND)),
-        ("cufft.h", ("hipfft.h", CONV_INCLUDE, API_BLAS)),
-        ("cufftXt.h", ("hipfft.h", CONV_INCLUDE, API_BLAS)),
+        ("cusparse.h", ("hipsparse.h" if rocm_version < (5, 2, 0) else "hipsparse/hipsparse.h", CONV_INCLUDE, API_RAND)),
+        ("cufft.h", ("hipfft.h" if rocm_version < (5, 2, 0) else "hipfft/hipfft.h", CONV_INCLUDE, API_BLAS)),
+        ("cufftXt.h", ("hipfftXt.h" if rocm_version < (5, 2, 0) else "hipfft/hipfftXt.h", CONV_INCLUDE, API_BLAS)),
         # PyTorch also has a source file named "nccl.h", so we need to "<"">" to differentiate
-        ("<nccl.h>", ("<rccl.h>", CONV_INCLUDE, API_RUNTIME)),
+        ("<nccl.h>", ("<rccl.h>" if rocm_version < (5, 2, 0) else "<rccl/rccl.h>", CONV_INCLUDE, API_RUNTIME)),
         ("nvrtc.h", ("hip/hiprtc.h", CONV_INCLUDE, API_RTC)),
         ("thrust/system/cuda", ("thrust/system/hip", CONV_INCLUDE, API_BLAS)),
         ("cub/util_allocator.cuh", ("hipcub/hipcub.hpp", CONV_INCLUDE, API_BLAS)),
@@ -6504,22 +6502,6 @@
             "cublasZgetrfBatched",
             ("rocblas_zgetrf_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
         ),
-        (
-            "cublasSgetriBatched",
-            ("rocblas_sgetri_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
-        ),
-        (
-            "cublasDgetriBatched",
-            ("rocblas_dgetri_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
-        ),
-        (
-            "cublasCgetriBatched",
-            ("rocblas_cgetri_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
-        ),
-        (
-            "cublasZgetriBatched",
-            ("rocblas_zgetri_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
-        ),
         (
             "cublasSgetrsBatched",
             ("rocblas_sgetrs_batched", CONV_MATH_FUNC, API_BLAS, HIP_UNSUPPORTED),
@@ -7920,6 +7902,55 @@
             ("hipsparseSetMatIndexBase", CONV_MATH_FUNC, API_SPARSE),
         ),
         ("cusparseSetMatType", ("hipsparseSetMatType", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMV", ("hipsparseSpMV", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMV_bufferSize", ("hipsparseSpMV_bufferSize", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMM", ("hipsparseSpMM", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMM_bufferSize", ("hipsparseSpMM_bufferSize", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateDnMat", ("hipsparseCreateDnMat", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDnMatSetStridedBatch", ("hipsparseDnMatSetStridedBatch", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCsrSetStridedBatch", ("hipsparseCsrSetStridedBatch", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateDnVec", ("hipsparseCreateDnVec", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateCsr", ("hipsparseCreateCsr", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDestroyDnMat", ("hipsparseDestroyDnMat", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDestroyDnVec", ("hipsparseDestroyDnVec", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDestroySpMat", ("hipsparseDestroySpMat", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_destroyDescr", ("hipsparseSpGEMM_destroyDescr", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateCoo", ("hipsparseCreateCoo", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCreateCsr", ("hipsparseCreateCsr", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_createDescr", ("hipsparseSpGEMM_createDescr", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseDnMatSetStridedBatch", ("hipsparseDnMatSetStridedBatch", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_copy", ("hipsparseSpGEMM_copy", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSDDMM_bufferSize", ("hipsparseSDDMM_bufferSize", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSDDMM_preprocess", ("hipsparseSDDMM_preprocess", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSDDMM", ("hipsparseSDDMM", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_compute", ("hipsparseSpGEMM_compute", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpGEMM_workEstimation", ("hipsparseSpGEMM_workEstimation", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMatGetSize", ("hipsparseSpMatGetSize", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseCsrSetPointers", ("hipsparseCsrSetPointers", CONV_MATH_FUNC, API_SPARSE)),
+        ("cusparseSpMVAlg_t", ("hipsparseSpMVAlg_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseSpMMAlg_t", ("hipsparseSpMMAlg_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseIndexType_t", ("hipsparseIndexType_t", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseMatDescr", ("hipsparseMatDescr", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseDnMatDescr", ("hipsparseDnMatDescr", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseDnVecDescr", ("hipsparseDnVecDescr", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseSpMatDescr", ("hipsparseSpMatDescr", CONV_TYPE, API_SPARSE)),
+        # Unsupported ("cusparseSpGEMMDescr", ("hipsparseSpGEMMDescr", CONV_TYPE, API_SPARSE)),
+        ("cusparseDnMatDescr_t", ("hipsparseDnMatDescr_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseDnVecDescr_t", ("hipsparseDnVecDescr_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseSpMatDescr_t", ("hipsparseSpMatDescr_t", CONV_TYPE, API_SPARSE)),
+        ("cusparseSpGEMMDescr_t", ("hipsparseSpGEMMDescr_t", CONV_TYPE, API_SPARSE)),
+        ("CUSPARSE_INDEX_32I", ("HIPSPARSE_INDEX_32I", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_INDEX_64I", ("HIPSPARSE_INDEX_64I", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_ORDER_COL", ("HIPSPARSE_ORDER_COLUMN", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_MV_ALG_DEFAULT", ("HIPSPARSE_MV_ALG_DEFAULT", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_MM_ALG_DEFAULT", ("HIPSPARSE_MM_ALG_DEFAULT", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_COOMM_ALG1", ("HIPSPARSE_COOMM_ALG1", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_COOMM_ALG2", ("HIPSPARSE_COOMM_ALG2", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_COOMM_ALG3", ("HIPSPARSE_COOMM_ALG3", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_COOMV_ALG", ("HIPSPARSE_COOMV_ALG", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_CSRMM_ALG1", ("HIPSPARSE_CSRMM_ALG1", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_SPGEMM_DEFAULT", ("HIPSPARSE_SPGEMM_DEFAULT", CONV_NUMERIC_LITERAL, API_SPARSE)),
+        ("CUSPARSE_SDDMM_ALG_DEFAULT", ("HIPSPARSE_SDDMM_ALG_DEFAULT", CONV_NUMERIC_LITERAL, API_SPARSE)),
         (
             "CUSPARSE_STATUS_SUCCESS",
             ("HIPSPARSE_STATUS_SUCCESS", CONV_NUMERIC_LITERAL, API_SPARSE),
@@ -8054,6 +8085,20 @@
                 API_PYTORCH,
             ),
         ),
+        (
+            "cuda::CUDAAllocator::recordStream",
+            (
+                "hip::HIPCachingAllocatorMasqueradingAsCUDA::recordStreamMasqueradingAsCUDA",
+                API_PYTORCH,
+            ),
+        ),
+        (
+            "CUDAAllocator::recordStream",
+            (
+                "HIPCachingAllocatorMasqueradingAsCUDA::recordStreamMasqueradingAsCUDA",
+                API_PYTORCH,
+            ),
+        ),
         ("cuda::CUDAStream", ("hip::HIPStreamMasqueradingAsCUDA", API_PYTORCH)),
         ("CUDAStream", ("HIPStreamMasqueradingAsCUDA", API_PYTORCH)),
         (
@@ -8241,6 +8286,9 @@
         ),
         ("C10_CUDA_CHECK", ("C10_HIP_CHECK", API_C10)),
         ("C10_CUDA_CHECK_WARN", ("C10_HIP_CHECK_WARN", API_C10)),
+        ("C10_CUDA_ERROR_HANDLED", ("C10_HIP_ERROR_HANDLED", API_C10)),
+        ("C10_CUDA_IGNORE_ERROR", ("C10_HIP_IGNORE_ERROR", API_C10)),
+        ("C10_CUDA_CLEAR_ERROR", ("C10_HIP_CLEAR_ERROR", API_C10)),
         ("c10::cuda", ("c10::hip", API_C10)),
         ("cuda::CUDAStream", ("hip::HIPStream", API_C10)),
         ("CUDAStream", ("HIPStream", API_C10)),
@@ -8261,6 +8309,9 @@
         ("setCurrentCUDAStream", ("setCurrentHIPStream", API_C10)),
         ("cuda::CUDACachingAllocator", ("hip::HIPCachingAllocator", API_C10)),
         ("CUDACachingAllocator", ("HIPCachingAllocator", API_C10)),
+        ("c10::cuda::CUDAAllocator", ("c10::hip::HIPAllocator", API_C10)),
+        ("cuda::CUDAAllocator", ("hip::HIPAllocator", API_C10)),
+        ("CUDAAllocator", ("HIPAllocator", API_C10)),
         ("C10_CUDA_KERNEL_LAUNCH_CHECK", ("C10_HIP_KERNEL_LAUNCH_CHECK", API_C10))
     ]
 )
diff --git a/torch/utils/hipify/hipify_python.py b/torch/utils/hipify/hipify_python.py
index f7d5e714c5c3..66e6cf486f32 100755
--- a/torch/utils/hipify/hipify_python.py
+++ b/torch/utils/hipify/hipify_python.py
@@ -854,7 +854,8 @@ def repl(m):
     output_source = hip_header_magic(output_source)
 
     # Replace the extern __shared__
-    output_source = replace_extern_shared(output_source)
+    # NOTE: No longer needed after transition from hcc to hipclang.
+    # output_source = replace_extern_shared(output_source)
 
     # Don't write out identical hipified files for extensions if dirpath has not changed
     if (
diff --git a/torch/utils/hooks.py b/torch/utils/hooks.py
index 39785a83d0f9..be9a4c1f0a65 100644
--- a/torch/utils/hooks.py
+++ b/torch/utils/hooks.py
@@ -4,25 +4,48 @@
 import warnings
 from typing import Any
 
+__all__ = ["RemovableHandle", "unserializable_hook", "warn_if_has_hooks", "BackwardHook"]
 
 class RemovableHandle(object):
-    """A handle which provides the capability to remove a hook."""
+    r"""
+    A handle which provides the capability to remove a hook.
+
+    Args:
+        hooks_dict (dict): A dictionary of hooks, indexed by hook ``id``.
+        extra_dict (dict): An additional dictionary whose keys will be deleted
+            when the same keys are removed from ``hooks_dict``.
+    """
 
     id: int
     next_id: int = 0
 
-    def __init__(self, hooks_dict: Any) -> None:
+    def __init__(self, hooks_dict: Any, *, extra_dict: Any = None) -> None:
         self.hooks_dict_ref = weakref.ref(hooks_dict)
         self.id = RemovableHandle.next_id
         RemovableHandle.next_id += 1
 
+        self.extra_dict_ref = (
+            weakref.ref(extra_dict)
+            if extra_dict is not None
+            else None
+        )
+
     def remove(self) -> None:
         hooks_dict = self.hooks_dict_ref()
         if hooks_dict is not None and self.id in hooks_dict:
             del hooks_dict[self.id]
 
+        if self.extra_dict_ref is not None:
+            extra_dict = self.extra_dict_ref()
+            if extra_dict is not None and self.id in extra_dict:
+                del extra_dict[self.id]
+
     def __getstate__(self):
-        return (self.hooks_dict_ref(), self.id)
+        return (
+            (self.hooks_dict_ref(), self.id)
+            if self.extra_dict_ref is None
+            else (self.hooks_dict_ref(), self.id, self.extra_dict_ref())
+        )
 
     def __setstate__(self, state) -> None:
         if state[0] is None:
@@ -33,7 +56,13 @@ def __setstate__(self, state) -> None:
         self.id = state[1]
         RemovableHandle.next_id = max(RemovableHandle.next_id, self.id + 1)
 
-    def __enter__(self) -> 'RemovableHandle':
+        self.extra_dict_ref = (
+            None
+            if len(state) < 3
+            else weakref.ref(OrderedDict() if state[2] is None else state[2])
+        )
+
+    def __enter__(self) -> "RemovableHandle":
         return self
 
     def __exit__(self, type: Any, value: Any, tb: Any) -> None:
@@ -70,8 +99,9 @@ class BackwardHook(object):
       - Calling the user hook once both output and input gradients are available
     """
 
-    def __init__(self, module, user_hooks):
+    def __init__(self, module, user_hooks, user_pre_hooks):
         self.user_hooks = user_hooks
+        self.user_pre_hooks = user_pre_hooks
         self.module = module
 
         self.grad_outputs = None
@@ -97,13 +127,10 @@ def _unpack_none(self, indices, values):
     def _set_user_hook(self, grad_fn):
         def hook(grad_input, _):
             if self.grad_outputs is None:
-                raise RuntimeError("Module backward hook for grad_input is called before "
-                                   "the grad_output one. This happens because the gradient "
-                                   "in your nn.Module flows to the Module's input without "
-                                   "passing through the Module's output. Make sure that the "
-                                   "output depends on the input and that the loss is computed "
-                                   "based on the output.")
-
+                # This happens because the gradient in your nn.Module flows to
+                # the Module's input without " passing through the Module's
+                # output, e.g. when you're doing double backward.
+                return
             res = self._pack_with_none(self.input_tensors_index, grad_input, self.n_inputs)
 
             for hook in self.user_hooks:
@@ -173,6 +200,19 @@ def hook(_, grad_output):
                                                          grad_output,
                                                          self.n_outputs)
 
+                if self.user_pre_hooks:
+                    expected_len = len(self.grad_outputs)
+                    for user_pre_hook in self.user_pre_hooks:
+                        hook_grad_outputs = user_pre_hook(self.module, self.grad_outputs)
+                        if hook_grad_outputs is None:
+                            continue
+
+                        actual_len = len(hook_grad_outputs)
+                        if actual_len != expected_len:
+                            raise RuntimeError("Backward pre hook returned an invalid number of grad_output, "
+                                               "got {}, but expected {}".format(actual_len, expected_len))
+                        self.grad_outputs = hook_grad_outputs
+
                 # Special case if no input required gradients, this hook should call the user
                 # hook directly
                 if self.input_tensors_index is None:
diff --git a/torch/utils/mobile_optimizer.py b/torch/utils/mobile_optimizer.py
index bda5defab29e..2b59ac41809f 100644
--- a/torch/utils/mobile_optimizer.py
+++ b/torch/utils/mobile_optimizer.py
@@ -64,7 +64,10 @@ def optimize_for_mobile(
             optimization_blocklist,
             preserved_methods_str)
     elif backend == 'vulkan':
-        optimized_cpp_module = torch._C._jit_pass_vulkan_optimize_for_mobile(script_module._c, preserved_methods_str)
+        optimized_cpp_module = torch._C._jit_pass_vulkan_optimize_for_mobile(
+            script_module._c,
+            optimization_blocklist,
+            preserved_methods_str)
     elif backend == 'metal':
         optimized_cpp_module = torch._C._jit_pass_metal_optimize_for_mobile(script_module._c, preserved_methods_str)
     else:
diff --git a/torch/utils/model_dump/__init__.py b/torch/utils/model_dump/__init__.py
index 8a230800099b..bbb456a6f14b 100644
--- a/torch/utils/model_dump/__init__.py
+++ b/torch/utils/model_dump/__init__.py
@@ -134,7 +134,10 @@ def hierarchical_pickle(data):
             }
         if typename == "torch._utils._rebuild_tensor_v2":
             assert data.state is None
-            storage, offset, size, stride, requires_grad, hooks = data.args
+            if len(data.args) == 6:
+                storage, offset, size, stride, requires_grad, hooks = data.args
+            else:
+                storage, offset, size, stride, requires_grad, hooks, metadata = data.args
             storage_info = get_storage_info(storage)
             return {"__tensor_v2__": [storage_info, offset, size, stride, requires_grad]}
         if typename == "torch._utils._rebuild_qtensor":
diff --git a/torch/utils/show_pickle.py b/torch/utils/show_pickle.py
index ae2e83aa4277..6ccda4cdde2f 100644
--- a/torch/utils/show_pickle.py
+++ b/torch/utils/show_pickle.py
@@ -7,6 +7,7 @@
 import fnmatch
 from typing import Any, IO, BinaryIO, Union
 
+__all__ = ["FakeObject", "FakeClass", "DumpUnpickler", "main"]
 
 class FakeObject(object):
     def __init__(self, module, name, args):
diff --git a/torch/utils/tensorboard/_pytorch_graph.py b/torch/utils/tensorboard/_pytorch_graph.py
index 00aca545a333..c35cf88213be 100644
--- a/torch/utils/tensorboard/_pytorch_graph.py
+++ b/torch/utils/tensorboard/_pytorch_graph.py
@@ -258,7 +258,7 @@ def parse(graph, trace, args=None, omit_useless_nodes=True):
         if node.type().kind() != CLASSTYPE_KIND:
             nodes_py.append(NodePyIO(node, "input"))
 
-    attr_to_scope: Dict[Any, str] = dict()
+    attr_to_scope: Dict[Any, str] = {}
     for node in graph.nodes():
         if node.kind() == GETATTR_KIND:
             attr_name = node.s("name")
@@ -297,7 +297,7 @@ def parse_traced_name(module):
             module_name = getattr(module, "original_name", "Module")
         return module_name
 
-    alias_to_name = dict()
+    alias_to_name = {}
     base_name = parse_traced_name(trace)
     for name, module in trace.named_modules(prefix="__module"):
         mod_name = parse_traced_name(module)
diff --git a/torch/utils/tensorboard/summary.py b/torch/utils/tensorboard/summary.py
index 80e1af448578..bfa8600b34c2 100644
--- a/torch/utils/tensorboard/summary.py
+++ b/torch/utils/tensorboard/summary.py
@@ -489,9 +489,12 @@ def make_image(tensor, rescale=1, rois=None, labels=None):
     image = Image.fromarray(tensor)
     if rois is not None:
         image = draw_boxes(image, rois, labels=labels)
-    image = image.resize((scaled_width, scaled_height), Image.ANTIALIAS)
+    try:
+        ANTIALIAS = Image.Resampling.LANCZOS
+    except AttributeError:
+        ANTIALIAS = Image.ANTIALIAS
+    image = image.resize((scaled_width, scaled_height), ANTIALIAS)
     import io
-
     output = io.BytesIO()
     image.save(output, format="PNG")
     image_string = output.getvalue()
diff --git a/torch/utils/throughput_benchmark.py b/torch/utils/throughput_benchmark.py
index 7068f74d0906..1dae4b937783 100644
--- a/torch/utils/throughput_benchmark.py
+++ b/torch/utils/throughput_benchmark.py
@@ -1,6 +1,7 @@
 
 import torch._C
 
+
 def format_time(time_us=None, time_ms=None, time_s=None):
     '''Defines how to format time'''
     assert sum([time_us is not None, time_ms is not None, time_s is not None]) == 1
@@ -48,7 +49,6 @@ def total_time_seconds(self):
         return self.num_iters * (
             self.latency_avg_ms / 1000.0) / self.benchmark_config.num_calling_threads
 
-
     def __str__(self):
         return '\n'.join([
             "Average latency per example: " + format_time(time_ms=self.latency_avg_ms),
diff --git a/torchgen/api/autograd.py b/torchgen/api/autograd.py
index 708ad70d3962..19ffd332b703 100644
--- a/torchgen/api/autograd.py
+++ b/torchgen/api/autograd.py
@@ -383,7 +383,7 @@ def find_info(
             )
             continue
 
-        fw_derivative_dict: Dict[str, Sequence[ForwardDerivative]] = dict()
+        fw_derivative_dict: Dict[str, Sequence[ForwardDerivative]] = {}
         for key, info in info_dict.items():
             if not info.forward_derivatives:
                 fw_derivative_dict[key] = []
@@ -538,7 +538,9 @@ def gen_differentiable_outputs(
     info = fn.info[key] if fn.info else None
     outputs: List[DifferentiableOutput] = [
         DifferentiableOutput(
-            name=name, type=ret.type, cpp_type=cpp.return_type(ret).cpp_type()
+            name=name,
+            type=ret.type,
+            cpp_type=cpp.return_type(ret, symint=True).cpp_type(),
         )
         for name, ret in zip(cpp.return_names(f), f.func.returns)
     ]
diff --git a/torchgen/api/cpp.py b/torchgen/api/cpp.py
index 408c123a0952..4b00b5367b82 100644
--- a/torchgen/api/cpp.py
+++ b/torchgen/api/cpp.py
@@ -13,12 +13,14 @@
     CType,
     dimnameListT,
     intArrayRefT,
+    iTensorListRefT,
     ListCType,
     longT,
     MutRefCType,
     NamedCType,
     OptionalCType,
     optionalIntArrayRefT,
+    optionalSymIntArrayRefT,
     scalarT,
     SpecialArgName,
     symIntArrayRefT,
@@ -64,9 +66,14 @@
 # collisions, but functions are fair game to collide
 
 
-def name(func: FunctionSchema, *, faithful_name_for_out_overloads: bool = False) -> str:
+def name(
+    func: FunctionSchema,
+    *,
+    faithful_name_for_out_overloads: bool = False,
+    symint_overload: bool = False,
+) -> str:
     name = str(func.name.name)
-    if func.is_symint_fn():
+    if symint_overload:
         name += "_symint"
     if func.is_out_fn():
         if faithful_name_for_out_overloads:
@@ -81,11 +88,20 @@ def name(func: FunctionSchema, *, faithful_name_for_out_overloads: bool = False)
 # types look the same no matter if they are argument types or return
 # types.  Returns None if the type in question is not a value type.
 def valuetype_type(
-    t: Type, *, binds: ArgName, remove_non_owning_ref_types: bool = False
+    t: Type,
+    *,
+    binds: ArgName,
+    remove_non_owning_ref_types: bool = False,
+    symint: bool = False,
 ) -> Optional[NamedCType]:
     if isinstance(t, BaseType):
         if t.name == BaseTy.Tensor or t.name == BaseTy.Scalar:
             return None
+        elif str(t) == "SymInt":
+            if symint:
+                return NamedCType(binds, BaseCType(SymIntT))
+            else:
+                return NamedCType(binds, BaseCType(longT))
         if remove_non_owning_ref_types:
             if t.name == BaseTy.str:
                 raise AssertionError(
@@ -94,7 +110,7 @@ def valuetype_type(
         # All other BaseType currently map directly to BaseCppTypes.
         return NamedCType(binds, BaseCType(BaseTypeToCppMapping[t.name]))
     elif isinstance(t, OptionalType):
-        elem = valuetype_type(t.elem, binds=binds)
+        elem = valuetype_type(t.elem, binds=binds, symint=symint)
         if elem is None:
             return None
         return NamedCType(binds, OptionalCType(elem.type))
@@ -113,11 +129,19 @@ def valuetype_type(
 # For example, we'll return std::vector<int> instead of IntArrayRef.
 # See Note [translation from C++ reference to value types]
 def argumenttype_type(
-    t: Type, *, mutable: bool, binds: ArgName, remove_non_owning_ref_types: bool = False
+    t: Type,
+    *,
+    mutable: bool,
+    binds: ArgName,
+    remove_non_owning_ref_types: bool = False,
+    symint: bool = False,
 ) -> NamedCType:
     # If it's a value type, do the value type translation
     r = valuetype_type(
-        t, binds=binds, remove_non_owning_ref_types=remove_non_owning_ref_types
+        t,
+        binds=binds,
+        symint=symint,
+        remove_non_owning_ref_types=remove_non_owning_ref_types,
     )
     if r is not None:
         return r
@@ -146,7 +170,12 @@ def argumenttype_type(
             return NamedCType(binds, ConstRefCType(OptionalCType(BaseCType(scalarT))))
         elif isinstance(t.elem, ListType) and str(t.elem.elem) == "int":
             return NamedCType(binds, BaseCType(optionalIntArrayRefT))
-        elem = argumenttype_type(t.elem, mutable=mutable, binds=binds)
+        elif isinstance(t.elem, ListType) and str(t.elem.elem) == "SymInt":
+            if symint:
+                return NamedCType(binds, BaseCType(optionalSymIntArrayRefT))
+            else:
+                return NamedCType(binds, BaseCType(optionalIntArrayRefT))
+        elem = argumenttype_type(t.elem, mutable=mutable, binds=binds, symint=symint)
         return NamedCType(binds, OptionalCType(elem.type))
     elif isinstance(t, ListType):
         # TODO: remove these special cases, ArrayRef fallthrough works fine
@@ -157,11 +186,20 @@ def argumenttype_type(
                 return NamedCType(binds, BaseCType(intArrayRefT))
         if str(t.elem) == "SymInt":
             if remove_non_owning_ref_types:
-                return NamedCType(binds, VectorCType(BaseCType(SymIntT)))
+                if symint:
+                    return NamedCType(binds, VectorCType(BaseCType(SymIntT)))
+                else:
+                    return NamedCType(binds, VectorCType(BaseCType(longT)))
             else:
-                return NamedCType(binds, BaseCType(symIntArrayRefT))
-        elif str(t.elem) == "Tensor":
-            return NamedCType(binds, BaseCType(tensorListT))
+                if symint:
+                    return NamedCType(binds, BaseCType(symIntArrayRefT))
+                else:
+                    return NamedCType(binds, BaseCType(intArrayRefT))
+        if str(t.elem) == "Tensor":
+            if local.use_ilistref_for_tensor_lists():
+                return NamedCType(binds, ConstRefCType(BaseCType(iTensorListRefT)))
+            else:
+                return NamedCType(binds, BaseCType(tensorListT))
         elif str(t.elem) == "Scalar":
             return NamedCType(binds, ArrayRefCType(BaseCType(scalarT)))
         elif str(t.elem) == "Dimname":
@@ -170,15 +208,15 @@ def argumenttype_type(
             return NamedCType(
                 binds, ConstRefCType(ListCType(OptionalCType(BaseCType(tensorT))))
             )
-        elem = argumenttype_type(t.elem, mutable=mutable, binds=binds)
+        elem = argumenttype_type(t.elem, mutable=mutable, binds=binds, symint=symint)
         return NamedCType(binds, ArrayRefCType(elem.type))
     else:
         raise AssertionError(f"unrecognized type {repr(t)}")
 
 
 # Translate a JIT argument into its C++ type
-def argument_type(a: Argument, *, binds: ArgName) -> NamedCType:
-    return argumenttype_type(a.type, mutable=a.is_write, binds=binds)
+def argument_type(a: Argument, *, binds: ArgName, symint: bool = False) -> NamedCType:
+    return argumenttype_type(a.type, mutable=a.is_write, symint=symint, binds=binds)
 
 
 # Translation of a (non-multi) return type from JIT to C++
@@ -186,9 +224,9 @@ def argument_type(a: Argument, *, binds: ArgName) -> NamedCType:
 # This is mostly because of the mismatch between return types and return names.
 # e.g. a function with a return type of 'void' has 0 return names,
 # and a function with a return type of 'std::tuple' has >1 return name.
-def returntype_type(t: Type, *, mutable: bool) -> CType:
+def returntype_type(t: Type, *, mutable: bool, symint: bool = False) -> CType:
     # placeholder is ignored
-    r = valuetype_type(t, binds="__placeholder__")
+    r = valuetype_type(t, binds="__placeholder__", symint=symint)
     if r is not None:
         return r.type
 
@@ -211,7 +249,7 @@ def returntype_type(t: Type, *, mutable: bool) -> CType:
         assert (
             not mutable
         ), "Native functions should never return a mutable tensor list. They should return void."
-        elem = returntype_type(t.elem, mutable=False)
+        elem = returntype_type(t.elem, mutable=False, symint=symint)
         assert t.size is None, f"fixed size list returns not supported: {t}"
         return VectorCType(elem)
 
@@ -219,18 +257,18 @@ def returntype_type(t: Type, *, mutable: bool) -> CType:
 
 
 # Translation of a single return to its C++ type
-def return_type(r: Return) -> CType:
-    return returntype_type(r.type, mutable=r.is_write)
+def return_type(r: Return, *, symint: bool = False) -> CType:
+    return returntype_type(r.type, mutable=r.is_write, symint=symint)
 
 
 # Translation of a full (possibly multi) return from JIT to its C++ type
-def returns_type(rs: Sequence[Return]) -> CType:
+def returns_type(rs: Sequence[Return], *, symint: bool = False) -> CType:
     if len(rs) == 0:
         return BaseCType(voidT)
     elif len(rs) == 1:
-        return return_type(rs[0])
+        return return_type(rs[0], symint=symint)
     else:
-        return TupleCType([return_type(r) for r in rs])
+        return TupleCType([return_type(r, symint=symint) for r in rs])
 
 
 def return_names(f: NativeFunction, *, fallback_name: str = "result") -> Sequence[str]:
@@ -276,7 +314,7 @@ def return_names(f: NativeFunction, *, fallback_name: str = "result") -> Sequenc
 }
 
 # Convert a JIT default into C++ expression representing the default
-def default_expr(d: str, t: Type) -> str:
+def default_expr(d: str, t: Type, *, symint: bool) -> str:
     if d == "None" and str(t) == "Tensor?":
         return "{}"
     if isinstance(t, BaseType) and t.name is BaseTy.str:
@@ -304,11 +342,13 @@ def default_expr(d: str, t: Type) -> str:
         if d == "None":
             return "c10::nullopt"
 
-        return default_expr(d, t.elem)
+        return default_expr(d, t.elem, symint=symint)
 
     if isinstance(t, ListType):
         if d.startswith("[") and d.endswith("]"):
             return "{" + d[1:-1] + "}"
+        elif symint and d.isdigit() and str(t.elem) == "SymInt":
+            return f"c10::SymInt({d})"
         elif t.size is None:
             # NOTE: Sized lists can have scalar defaults
             raise ValueError(f"Expected a list default '[...]' but found: '{d}'")
@@ -325,6 +365,7 @@ def argument(
     cpp_no_default_args: Set[str],
     method: bool,
     faithful: bool,
+    symint: bool = False,
     has_tensor_options: bool,
 ) -> List[Binding]:
     def sub_argument(
@@ -335,6 +376,7 @@ def sub_argument(
             cpp_no_default_args=cpp_no_default_args,
             method=method,
             faithful=faithful,
+            symint=symint,
             has_tensor_options=has_tensor_options,
         )
 
@@ -346,10 +388,10 @@ def sub_argument(
             binds = a.name
         default: Optional[str] = None
         if a.name not in cpp_no_default_args and a.default is not None:
-            default = default_expr(a.default, a.type)
+            default = default_expr(a.default, a.type, symint=symint)
         return [
             Binding(
-                nctype=argument_type(a, binds=binds),
+                nctype=argument_type(a, binds=binds, symint=symint),
                 name=a.name,
                 default=default,
                 argument=a,
@@ -390,7 +432,12 @@ def sub_argument(
 
 
 def arguments(
-    arguments: Arguments, *, faithful: bool, method: bool, cpp_no_default_args: Set[str]
+    arguments: Arguments,
+    *,
+    faithful: bool,
+    symint: bool = False,
+    method: bool,
+    cpp_no_default_args: Set[str],
 ) -> List[Binding]:
     args: List[Union[Argument, TensorOptionsArguments, SelfArgument]] = []
     if faithful:
@@ -405,6 +452,7 @@ def arguments(
         for r in argument(
             a,
             faithful=faithful,
+            symint=symint,
             method=method,
             has_tensor_options=arguments.tensor_options is not None,
             cpp_no_default_args=cpp_no_default_args,
diff --git a/torchgen/api/dispatcher.py b/torchgen/api/dispatcher.py
index 008e8c5664a4..58816959f7cd 100644
--- a/torchgen/api/dispatcher.py
+++ b/torchgen/api/dispatcher.py
@@ -35,7 +35,12 @@ def name(func: FunctionSchema) -> str:
 
 
 def argumenttype_type(
-    t: Type, *, mutable: bool, binds: ArgName, remove_non_owning_ref_types: bool = False
+    t: Type,
+    *,
+    mutable: bool,
+    binds: ArgName,
+    remove_non_owning_ref_types: bool = False,
+    symint: bool = True,
 ) -> NamedCType:
     # This is a faux amis.  If it makes sense in the future to add
     # more special cases here, or invert things so cpp.argument_type
@@ -45,24 +50,30 @@ def argumenttype_type(
         t,
         mutable=mutable,
         binds=binds,
+        symint=symint,
         remove_non_owning_ref_types=remove_non_owning_ref_types,
     )
 
 
 def argument_type(
-    a: Argument, *, binds: ArgName, remove_non_owning_ref_types: bool = False
+    a: Argument,
+    *,
+    binds: ArgName,
+    remove_non_owning_ref_types: bool = False,
+    symint: bool = True,
 ) -> NamedCType:
     return argumenttype_type(
         a.type,
         mutable=a.is_write,
         binds=binds,
         remove_non_owning_ref_types=remove_non_owning_ref_types,
+        symint=symint,
     )
 
 
-def returns_type(rs: Sequence[Return]) -> CType:
+def returns_type(rs: Sequence[Return], *, symint: bool = True) -> CType:
     # At present, there is no difference. But there could be!
-    return cpp.returns_type(rs)
+    return cpp.returns_type(rs, symint=symint)
 
 
 def jit_arguments(func: FunctionSchema) -> List[Argument]:
@@ -88,15 +99,20 @@ def to_argument(
     )
 
 
-def argument(a: Argument, *, remove_non_owning_ref_types: bool = False) -> Binding:
+def argument(
+    a: Argument, *, remove_non_owning_ref_types: bool = False, symint: bool = True
+) -> Binding:
     return Binding(
         nctype=argument_type(
-            a, binds=a.name, remove_non_owning_ref_types=remove_non_owning_ref_types
+            a,
+            binds=a.name,
+            remove_non_owning_ref_types=remove_non_owning_ref_types,
+            symint=symint,
         ),
         name=a.name,
         argument=a,
     )
 
 
-def arguments(func: FunctionSchema) -> List[Binding]:
-    return [argument(a) for a in jit_arguments(func)]
+def arguments(func: FunctionSchema, *, symint: bool = True) -> List[Binding]:
+    return [argument(a, symint=symint) for a in jit_arguments(func)]
diff --git a/torchgen/api/lazy.py b/torchgen/api/lazy.py
index 6bce9db92bdb..58f1722d1bb4 100644
--- a/torchgen/api/lazy.py
+++ b/torchgen/api/lazy.py
@@ -37,6 +37,14 @@
 _valueT = None
 
 
+# A ValueT is an IR type which represents the computation of a Tensor.  In other
+# words, a PyTorch user will do operations on lazy tensors, and each output lazy
+# tensor internally tracks a ValueT representing the IR node that would have
+# actually produced the value of this tensor for real.
+#
+# This is configurable because different lazy tensor backends (LTC vs XLA) will
+# have different IR representations.  (Though, arguably, after unification they
+# shouldn't!)
 def getValueT() -> BaseCppType:
     global _valueT
     if not _valueT:
@@ -58,7 +66,7 @@ def setValueT(val: BaseCppType) -> None:
 
 
 def process_ir_type(
-    typ: Type, properties: "LazyIrProperties"
+    typ: Type, properties: "LazyIrProperties", *, symint: bool
 ) -> Union[BaseCType, VectorCType, OptionalCType, ListCType]:
     """
     This function takes a type from NativeFunctions and converts it for use with
@@ -89,7 +97,10 @@ def process_ir_type(
         elif typ.name == BaseTy.int:
             return BaseCType(longT)
         elif typ.name == BaseTy.SymInt:
-            return BaseCType(getValueT())
+            if symint:
+                return BaseCType(getValueT())
+            else:
+                return BaseCType(longT)
         elif typ.name == BaseTy.bool:
             return BaseCType(boolT)
         elif typ.name == BaseTy.float:
@@ -105,7 +116,7 @@ def process_ir_type(
         else:
             raise AssertionError(f"TODO add support for type {repr(typ)}")
     elif isinstance(typ, OptionalType):
-        return OptionalCType(process_ir_type(typ.elem, properties))
+        return OptionalCType(process_ir_type(typ.elem, properties, symint=symint))
     elif isinstance(typ, ListType):
         if str(typ.elem) == "Tensor?":
             # TODO(whc) is this actually correct? or should it use a Vector like above
@@ -113,12 +124,27 @@ def process_ir_type(
         elif str(typ.elem) == "Tensor":
             # this is a TensorList which comes in from GetTensorList as a Value
             return BaseCType(tensorListValueT)
+        elif typ.elem == BaseType(BaseTy.SymInt):
+            # TODO: return a value type.  The problem here is analogous to
+            # the problem with tensorListValueT: if you have SymInt[] you
+            # cannot conveniently save the list of Value directly, as nodes
+            # expect to save values as a vector for ALL arguments.  So you
+            # need a separate IR node that represents all of the size nodes
+            # assembled into a list.  I'm not an LTC dev so I don't want to
+            # figure it out right now.  Y'all figure it out...
+            return VectorCType(BaseCType(longT))
+
         else:
-            return VectorCType(process_ir_type(typ.elem, properties))
+            return VectorCType(process_ir_type(typ.elem, properties, symint=symint))
     else:
         raise AssertionError(f"unrecognized type {repr(typ)}")
 
 
+# TODO: Determining this based off of CType is bad; this should be computed
+# from Type directly; then the same logic as process_ir_type can be used
+#
+# Invariant: passed typ should be an *owning* CType (e.g., we will report
+# that ArrayRef<Value> is NOT a value type)
 def isValueType(typ: CType, properties: "Optional[LazyIrProperties]" = None) -> bool:
     """
     Given a type, determine if it is a Value-like type.  This is equivalent to
@@ -133,6 +159,9 @@ def isValueType(typ: CType, properties: "Optional[LazyIrProperties]" = None) ->
             or (typ.type == scalarT and not treat_scalars_as_constants)
             or typ.type == SymIntT
         )
+    elif typ == VectorCType(BaseCType(SymIntT)):
+        # TODO: report True for this
+        return False
     elif isinstance(typ, (OptionalCType, ListCType, VectorCType)):
         return isValueType(typ.elem, properties)
     return False
@@ -157,6 +186,7 @@ def isWrappedScalarType(typ: Type) -> bool:
     return False
 
 
+# TODO: dedupe with Type.is_generator_like
 def isGeneratorType(typ: Type) -> bool:
     if isinstance(typ, BaseType):
         return typ.name == BaseTy.Generator
@@ -165,20 +195,27 @@ def isGeneratorType(typ: Type) -> bool:
     return False
 
 
+# This class caches a few derived properties computed from an Argument
+# and LazyIrProperties
 class LazyArgument:
     name: str
     orig_type: Type
     lazy_type_: Optional[CType]
     is_wrapped_scalar: bool
     is_generator: bool
+    # TODO: this is lies, it is false for symint list
     is_symint_or_list: bool
 
+    # Whether or not we are treating this as symint or not
+    symint: bool
+
     # true if this argument is or contains a lazy IR value
     is_lazy_value: bool
 
-    def __init__(self, arg: Argument, properties: "LazyIrProperties"):
+    def __init__(self, arg: Argument, properties: "LazyIrProperties", *, symint: bool):
         self.name = arg.name
         self.orig_type = arg.type
+        self.symint = symint
         self.is_optional = isinstance(arg.type, OptionalType)
         self.is_generator = isGeneratorType(arg.type)
         if self.is_generator:
@@ -190,9 +227,14 @@ def __init__(self, arg: Argument, properties: "LazyIrProperties"):
             # its null and safe to exclude from lazy IR
             self.lazy_type_ = None
         else:
-            self.lazy_type_ = process_ir_type(arg.type, properties)
+            self.lazy_type_ = process_ir_type(arg.type, properties, symint=symint)
         self.is_wrapped_scalar = isWrappedScalarType(arg.type)
-        self.is_symint_or_list = isSymIntType(arg.type)
+        self.is_symint_or_list = symint and (
+            isSymIntType(arg.type)
+            or (isinstance(arg.type, OptionalType) and isSymIntType(arg.type.elem))
+            # TODO: lists of symints are not currently treated as value types
+            # or (isinstance(arg.type, ListType) and isSymIntType(arg.type.elem))
+        )
 
         self.is_lazy_value = not self.is_generator and isValueType(
             self.lazy_type, properties
@@ -268,6 +310,8 @@ def __setattr__(self, key: str, value: Any) -> Any:
 # Unlike a FunctionSchema, it has no round-trippable string form (relating to the YAML),
 # but carries type information from a native FunctionSchema modified for use with IR nodes,
 # and preserving original argument names.
+#
+# TODO: This is not idiomatic with how other torchgen APIs transform on schema.
 class LazyIrSchema:
     # The name of the operator this function schema describes.
     name: "OperatorName"
@@ -282,6 +326,12 @@ class LazyIrSchema:
     # build a LazyArgument since lazy IR doesn't support it
     generator_arg: Optional[NamedCType] = None
 
+    # original function schema
+    func: FunctionSchema
+
+    # Whether or not we are code-genning for SymInt or not
+    symint: bool
+
     properties: LazyIrProperties = LazyIrProperties(
         # default properties
         "ShapePrecompute",
@@ -291,19 +341,27 @@ class LazyIrSchema:
     opkind: Optional[str] = None
 
     def __init__(
-        self, func: FunctionSchema, properties: Optional[LazyIrProperties] = None
+        self,
+        func: FunctionSchema,
+        properties: Optional[LazyIrProperties] = None,
+        *,
+        symint: bool,
     ):
         if properties:
             self.properties = properties
 
+        self.func = func
+        self.symint = symint
         positional_args: List[LazyArgument] = []
         for arg_field in ["pre_self_positional", "self_arg", "post_self_positional"]:
             if arg_field == "self_arg" and func.arguments.self_arg is not None:
                 arg = getattr(func.arguments, "self_arg").argument
-                positional_args.append(LazyArgument(arg, self.properties))
+                positional_args.append(
+                    LazyArgument(arg, self.properties, symint=symint)
+                )
             elif getattr(func.arguments, arg_field) is not None:
                 positional_args.extend(
-                    LazyArgument(arg, self.properties)
+                    LazyArgument(arg, self.properties, symint=symint)
                     for arg in getattr(func.arguments, arg_field)
                 )
         self.positional_args = tuple(positional_args)
@@ -326,7 +384,8 @@ def __init__(
                         ), "We expect there is only one generator arg"
                         self.generator_arg = NamedCType(arg.name, arg.type)
                 keyword_args.extend(
-                    LazyArgument(arg, self.properties) for arg in curr_args
+                    LazyArgument(arg, self.properties, symint=symint)
+                    for arg in curr_args
                 )
         self.keyword_args = tuple(keyword_args)
         self.name = func.name
diff --git a/torchgen/api/native.py b/torchgen/api/native.py
index 16814e34867c..7f8b3eb3af2e 100644
--- a/torchgen/api/native.py
+++ b/torchgen/api/native.py
@@ -37,6 +37,9 @@
 # native:: kernels.  The intention is to make native API and dispatcher API
 # line up as closely as possible, since this results in the least overhead
 # (no translation is needed from dispatcher API to native API).
+#
+# NB: this is symint aware, you will get the non-SymInt variant for some
+# dispatch entries and SymInt for others.
 
 
 def name(func: FunctionSchema) -> str:
@@ -49,7 +52,9 @@ def name(func: FunctionSchema) -> str:
     return name
 
 
-def argumenttype_type(t: Type, *, mutable: bool, binds: ArgName) -> NamedCType:
+def argumenttype_type(
+    t: Type, *, mutable: bool, binds: ArgName, symint: bool
+) -> NamedCType:
     if str(t) == "Tensor?":
         tensor_type: OptionalCType = OptionalCType(BaseCType(tensorT))
         if mutable and not local.use_const_ref_for_mutable_tensors():
@@ -64,19 +69,22 @@ def argumenttype_type(t: Type, *, mutable: bool, binds: ArgName) -> NamedCType:
         return NamedCType(binds, ConstRefCType(BaseCType(scalarT)))
     elif str(t) == "Scalar?":
         return NamedCType(binds, ConstRefCType(OptionalCType(BaseCType(scalarT))))
-    return cpp.argumenttype_type(t, mutable=mutable, binds=binds)
+    return cpp.argumenttype_type(t, mutable=mutable, binds=binds, symint=symint)
 
 
-def returns_type(rs: Sequence[Return]) -> CType:
-    return cpp.returns_type(rs)
+def returns_type(rs: Sequence[Return], *, symint: bool) -> CType:
+    return cpp.returns_type(rs, symint=symint)
 
 
-def argument_type(a: Argument, *, binds: ArgName) -> NamedCType:
-    return argumenttype_type(a.type, mutable=a.is_write, binds=binds)
+def argument_type(a: Argument, *, binds: ArgName, symint: bool) -> NamedCType:
+    return argumenttype_type(a.type, mutable=a.is_write, binds=binds, symint=symint)
 
 
 def argument(
-    a: Union[Argument, SelfArgument, TensorOptionsArguments], *, is_out: bool
+    a: Union[Argument, SelfArgument, TensorOptionsArguments],
+    *,
+    is_out: bool,
+    symint: bool,
 ) -> List[Binding]:
     # Ideally, we NEVER default native functions.  However, there are a number
     # of functions that call native:: directly and rely on the defaulting
@@ -87,10 +95,10 @@ def argument(
     if isinstance(a, Argument):
         default: Optional[str] = None
         if should_default and a.default is not None:
-            default = cpp.default_expr(a.default, a.type)
+            default = cpp.default_expr(a.default, a.type, symint=symint)
         return [
             Binding(
-                nctype=argument_type(a, binds=a.name),
+                nctype=argument_type(a, binds=a.name, symint=symint),
                 name=a.name,
                 default=default,
                 argument=a,
@@ -98,7 +106,7 @@ def argument(
         ]
     elif isinstance(a, SelfArgument):
         # Erase SelfArgument from the distinction
-        return argument(a.argument, is_out=is_out)
+        return argument(a.argument, is_out=is_out, symint=symint)
     elif isinstance(a, TensorOptionsArguments):
         default = None
         if should_default:
@@ -136,8 +144,10 @@ def argument(
         assert_never(a)
 
 
-def arguments(func: FunctionSchema) -> List[Binding]:
+def arguments(func: FunctionSchema, *, symint: bool) -> List[Binding]:
     args: List[Union[Argument, TensorOptionsArguments, SelfArgument]] = []
     args.extend(func.arguments.non_out)
     args.extend(func.arguments.out)
-    return [r for arg in args for r in argument(arg, is_out=func.is_out_fn())]
+    return [
+        r for arg in args for r in argument(arg, symint=symint, is_out=func.is_out_fn())
+    ]
diff --git a/torchgen/api/python.py b/torchgen/api/python.py
index 280879c0570d..728ee4c18c0a 100644
--- a/torchgen/api/python.py
+++ b/torchgen/api/python.py
@@ -216,8 +216,12 @@ class PythonArgument:
 
     # Compute argument formal for python argument parsing.
     # Needs to be consistent with torch/csrc/utils/python_arg_parser.h.
-    def argument_str(self, *, method: bool = False) -> str:
-        type_str = argument_type_str(self.type).replace("const ", "").replace(" &", "")
+    def argument_str(self, *, method: bool = False, symint: bool = True) -> str:
+        type_str = (
+            argument_type_str(self.type, symint=symint)
+            .replace("const ", "")
+            .replace(" &", "")
+        )
 
         name = self.name
         # s/self/input/ outside method bindings
@@ -384,10 +388,10 @@ def output_idx(self) -> int:
     #
     # For a translation to mypy-valid type signatures, see
     # signature_str_pyi().
-    def signature_str(self, *, skip_outputs: bool = False) -> str:
+    def signature_str(self, *, skip_outputs: bool = False, symint: bool = True) -> str:
         args = self.arguments(skip_outputs=skip_outputs)
         schema_formals: List[str] = list(
-            map(lambda a: a.argument_str(method=self.method), args)
+            map(lambda a: a.argument_str(method=self.method, symint=symint), args)
         )
         positional_argc = len(self.input_args)
         if len(schema_formals) > positional_argc:
@@ -426,7 +430,7 @@ def signature_str_pyi_vararg(self, *, skip_outputs: bool = False) -> Optional[st
             vararg_type = args[0].type
             if (
                 isinstance(vararg_type, ListType)
-                and str(vararg_type.elem) == "int"
+                and str(vararg_type.elem) in ["int", "SymInt"]
                 and num_positionalargs == 1
             ):
                 have_vararg_version = True
@@ -464,9 +468,11 @@ class PythonSignatureDeprecated(PythonSignature):
     def deprecated(self) -> bool:
         return True
 
-    def signature_str(self, *, skip_outputs: bool = False) -> str:
+    def signature_str(self, *, skip_outputs: bool = False, symint: bool = True) -> str:
         return (
-            PythonSignature.signature_str(self, skip_outputs=skip_outputs)
+            PythonSignature.signature_str(
+                self, skip_outputs=skip_outputs, symint=symint
+            )
             + "|deprecated"
         )
 
@@ -633,7 +639,9 @@ def has_tensor_options(f: NativeFunction) -> bool:
 # 'simple_type' was introduced by the old codegen, which is slightly
 # different from the python schema type, e.g.: doesn't have '?' suffix
 # for optional Tensor/TensorList; doesn't have '[size]' suffix for list type.
-def argument_type_str(t: Type, *, simple_type: bool = False) -> str:
+def argument_type_str(
+    t: Type, *, simple_type: bool = False, symint: bool = True
+) -> str:
     if isinstance(t, BaseType):
         if t.name == BaseTy.Tensor:
             return "Tensor"
@@ -665,7 +673,7 @@ def argument_type_str(t: Type, *, simple_type: bool = False) -> str:
         if str(t.elem) == "Tensor":
             # Is it desired to keep '?' for simple_type with new style dispatcher?
             return "Tensor?"
-        elem = argument_type_str(t.elem, simple_type=simple_type)
+        elem = argument_type_str(t.elem, simple_type=simple_type, symint=symint)
         return f"{elem}?"
     elif isinstance(t, ListType):
         size = t.size if not simple_type else None
@@ -675,7 +683,12 @@ def argument_type_str(t: Type, *, simple_type: bool = False) -> str:
         elif str(t.elem) == "int":
             return f"IntArrayRef[{size}]" if size is not None else "IntArrayRef"
         elif str(t.elem) == "SymInt":
-            return f"SymIntArrayRef[{size}]" if size is not None else "SymIntArrayRef"
+            if symint:
+                return (
+                    f"SymIntArrayRef[{size}]" if size is not None else "SymIntArrayRef"
+                )
+            else:
+                return f"IntArrayRef[{size}]" if size is not None else "IntArrayRef"
         elif str(t.elem) == "Tensor":
             return f"TensorList[{size}]" if size is not None else "TensorList"
         elif str(t.elem) == "Scalar":
@@ -687,7 +700,7 @@ def argument_type_str(t: Type, *, simple_type: bool = False) -> str:
                 return "const c10::List<c10::optional<Tensor>> &"
         elif str(t.elem) == "Dimname":
             return f"DimnameList[{size}]" if size is not None else "DimnameList"
-        elem = argument_type_str(t.elem, simple_type=simple_type)
+        elem = argument_type_str(t.elem, simple_type=simple_type, symint=symint)
         return f"ArrayRef<{elem}>"
 
     raise RuntimeError(f"unrecognized type {repr(t)}")
@@ -706,7 +719,9 @@ def argument(a: Argument) -> PythonArgument:
         name=a.name,
         type=a.type,
         # TODO: directly translate a.default to python default
-        default=str(pythonify_default(cpp.default_expr(a.default, a.type)))
+        default=str(
+            pythonify_default(cpp.default_expr(a.default, a.type, symint=False))
+        )
         if a.default is not None
         else None,
         default_init=None,
@@ -791,7 +806,7 @@ def topt_default_init(name: str) -> Optional[str]:
             a = getattr(topt_args, name)
             if a.default is None or a.default == "None":
                 return None
-            return cpp.default_expr(a.default, a.type)
+            return cpp.default_expr(a.default, a.type, symint=False)
 
         tensor_options_args.append(
             PythonArgument(
@@ -898,7 +913,7 @@ def argument_type_str_pyi(t: Type) -> str:
         if t.name == BaseTy.int:
             ret = "_int"
         if t.name == BaseTy.SymInt:
-            ret = "SymInt"
+            ret = "Union[_int, SymInt]"
         elif t.name == BaseTy.float:
             ret = "_float"
         elif t.name == BaseTy.str:
@@ -943,11 +958,13 @@ def argument_type_str_pyi(t: Type) -> str:
             elem = argument_type_str_pyi(t.elem)
             ret = f"Sequence[{elem}]"
 
+    else:
+        raise RuntimeError(f"unrecognized type {repr(t)}")
+
     if add_optional:
         ret = "Optional[" + ret + "]"
-    return ret
 
-    raise RuntimeError(f"unrecognized type {repr(t)}")
+    return ret
 
 
 def return_type_str_pyi(t: Type) -> str:
@@ -1040,7 +1057,7 @@ def returns_str_pyi(signature: PythonSignature) -> str:
 
 
 def dispatch_lambda_args(
-    ps: PythonSignature, f: NativeFunction
+    ps: PythonSignature, f: NativeFunction, symint: bool = True
 ) -> Tuple[DispatchLambdaArgument, ...]:
     if isinstance(ps, PythonSignatureDeprecated):
         schema = ps.deprecated_schema
@@ -1051,6 +1068,7 @@ def dispatch_lambda_args(
     cpp_args = cpp.arguments(
         arguments=schema.arguments,
         faithful=False,
+        symint=symint,
         method=False,
         cpp_no_default_args=f.cpp_no_default_args,
     )
@@ -1133,14 +1151,15 @@ def dispatch_lambda_return_str(f: NativeFunction) -> str:
     returns_without_annotation = tuple(
         map(lambda r: Return(r.name, r.type, None), f.func.returns)
     )
-    return_str = cpp.returns_type(returns_without_annotation).cpp_type()
+    return_str = cpp.returns_type(returns_without_annotation, symint=True).cpp_type()
     if return_str not in SUPPORTED_RETURN_TYPES:
         raise RuntimeError(f"{f.func.name} returns unsupported type {return_str}")
     return return_str
 
 
 def cpp_dispatch_target(f: NativeFunction) -> str:
-    name = cpp.name(f.func)
+    symint = f.func.has_symint()
+    name = cpp.name(f.func, symint_overload=symint)
     if Variant.method in f.variants:
         return f"self.{name}"
     if Variant.function in f.variants:
@@ -1192,7 +1211,7 @@ def cpp_dispatch_exprs(
 # For certain cases it is intentionally more restrictive than necessary,
 # e.g.: it doesn't accepts doublelist with definite size.
 def arg_parser_unpack_method(
-    t: Type, default: Optional[str], default_init: Optional[str]
+    t: Type, default: Optional[str], default_init: Optional[str], *, symint: bool = True
 ) -> str:
     has_default_init = default_init is not None
     if has_default_init and str(t) not in (
@@ -1224,7 +1243,10 @@ def arg_parser_unpack_method(
         elif t.name == BaseTy.int:
             return "toInt64"
         elif t.name == BaseTy.SymInt:
-            return "toSymInt"
+            if symint:
+                return "toSymInt"
+            else:
+                return "toInt64"
         elif t.name == BaseTy.bool:
             return "toBoolWithDefault" if has_default_init else "toBool"
         elif t.name == BaseTy.float:
@@ -1245,10 +1267,14 @@ def arg_parser_unpack_method(
             return "toDimnameListOptional"
         elif not has_default_init and default in (None, "None", "c10::nullopt"):
             # If default is None: append 'Optional' to elem's unpacking method
-            return arg_parser_unpack_method(t.elem, None, None) + "Optional"
+            return (
+                arg_parser_unpack_method(t.elem, None, None, symint=symint) + "Optional"
+            )
         else:
             # Otherwise, load as underlying type with default
-            return arg_parser_unpack_method(t.elem, default, default_init)
+            return arg_parser_unpack_method(
+                t.elem, default, default_init, symint=symint
+            )
 
     elif isinstance(t, ListType):
         if str(t.elem) == "Tensor":
@@ -1269,7 +1295,10 @@ def arg_parser_unpack_method(
             return "doublelist"
         elif str(t.elem) == "SymInt":
             # accept definite size
-            return "symintlist"
+            if symint:
+                return "symintlist"
+            else:
+                return "intlist"
         elif str(t) == "Scalar[]":
             return "scalarlist"
     raise RuntimeError(f"type '{t}' is not supported by PythonArgParser")
@@ -1278,11 +1307,11 @@ def arg_parser_unpack_method(
 # Return RHS expression for python argument using PythonArgParser output.
 # e.g. for arg name 'foo', arg type 'bool', arg_index = 2, returns '_r.toBool(2)'
 def arg_parser_output_expr(
-    arg_index: int, a: PythonArgument
+    arg_index: int, a: PythonArgument, *, symint: bool = True
 ) -> PythonArgParserOutputExpr:
     has_default = a.default_init is not None
     unpack_method = arg_parser_unpack_method(
-        t=a.type, default=a.default, default_init=a.default_init
+        t=a.type, default=a.default, default_init=a.default_init, symint=symint
     )
     default = f", {a.default_init}" if has_default else ""
     expr = f"_r.{unpack_method}({arg_index}{default})"
@@ -1297,12 +1326,12 @@ def arg_parser_output_expr(
 
 # Returns a map with key = arg_name and value = PythonArgParserOutputExpr.
 def arg_parser_output_exprs(
-    ps: PythonSignature, f: NativeFunction
+    ps: PythonSignature, f: NativeFunction, *, symint: bool = True
 ) -> Dict[str, PythonArgParserOutputExpr]:
     return {
         e.name: e
         for i, a in enumerate(ps.arguments())
-        for e in (arg_parser_output_expr(i, a),)
+        for e in (arg_parser_output_expr(i, a, symint=symint),)
     }
 
 
@@ -1317,15 +1346,15 @@ def arg_parser_output_exprs(
 
 # bind arg parser outputs (python args) with dispatch lambda arguments (c++ args).
 def dispatch_lambda_exprs(
-    ps: PythonSignature, f: NativeFunction
+    ps: PythonSignature, f: NativeFunction, *, symint: bool = True
 ) -> DispatchLambdaArgumentExprs:
     # This method is to bind 'arg_parser_outputs' and 'lambda_args' by producing
     # 'inits' and 'lambda_args_exprs' for each lambda argument using arg parser
     # outputs.
-    arg_parser_outputs = arg_parser_output_exprs(ps, f)
-    lambda_args = dispatch_lambda_args(ps, f)
+    arg_parser_outputs = arg_parser_output_exprs(ps, f, symint=symint)
+    lambda_args = dispatch_lambda_args(ps, f, symint=symint)
     inits: List[str] = []
-    lambda_args_exprs: Dict[str, str] = dict()
+    lambda_args_exprs: Dict[str, str] = {}
 
     has_toptions = has_tensor_options(f)
 
diff --git a/torchgen/api/structured.py b/torchgen/api/structured.py
index 4787adccae6b..9ad45a37ac8c 100644
--- a/torchgen/api/structured.py
+++ b/torchgen/api/structured.py
@@ -42,7 +42,12 @@
 # some more nominal types
 def argumenttype_type(t: Type, *, mutable: bool, binds: ArgName) -> NamedCType:
     # If it's a value type, do the value type translation
-    r = cpp.valuetype_type(t, binds=binds)
+    # NB: structured kernels ALWAYS have symint off, since they involve actual
+    # kernels that require real ints.  The one exception is the
+    # CompositeExplicitAutograd and the meta function (which could
+    # hypothetically be SymInt), but for simplicity we plan for these to just
+    # be handled in Python
+    r = cpp.valuetype_type(t, symint=False, binds=binds)
     if r is not None:
         return r
 
@@ -64,7 +69,7 @@ def argumenttype_type(t: Type, *, mutable: bool, binds: ArgName) -> NamedCType:
         return NamedCType(binds, OptionalCType(elem.type))
     elif isinstance(t, ListType):
         if t.elem == BaseType(BaseTy.Tensor):
-            return NamedCType(binds, BaseCType(iTensorListRefT))
+            return NamedCType(binds, ConstRefCType(BaseCType(iTensorListRefT)))
         elif t.elem == OptionalType(BaseType(BaseTy.Tensor)):
             return NamedCType(binds, BaseCType(iOptTensorListRefT))
         # TODO: delete these special cases; see torchgen.api.cpp--these
diff --git a/torchgen/api/translate.py b/torchgen/api/translate.py
index bee33b473dc9..c9dfb0c84fd6 100644
--- a/torchgen/api/translate.py
+++ b/torchgen/api/translate.py
@@ -1,6 +1,7 @@
 from typing import Dict, List, NoReturn, Sequence, Union
 
 from torchgen.api.types import (
+    ArrayRefCType,
     BaseCType,
     Binding,
     boolT,
@@ -9,7 +10,6 @@
     Expr,
     intArrayRefT,
     iOptTensorListRefT,
-    iTensorListRefT,
     layoutT,
     ListCType,
     longT,
@@ -20,6 +20,7 @@
     OptionalCType,
     optionalIntArrayRefT,
     optionalScalarRefT,
+    optionalSymIntArrayRefT,
     optionalTensorRefT,
     scalar_t,
     scalarT,
@@ -27,7 +28,6 @@
     SpecialArgName,
     symIntArrayRefT,
     SymIntT,
-    tensorListT,
     tensorOptionsT,
     tensorT,
     VectorCType,
@@ -184,12 +184,6 @@ def translate(
                 NamedCType(t.name, BaseCType(opmath_t))
             ] = f"static_cast<opmath_t>({b.expr})"
 
-        # [Note: ITensorListRef]
-        if t.type == BaseCType(tensorListT):
-            ctx[
-                NamedCType(t.name, BaseCType(iTensorListRefT))
-            ] = f"at::ITensorListRef({b.expr})"
-
         # [Note: IOptTensorListRef]
         if t.type == ConstRefCType(ListCType(OptionalCType(BaseCType(tensorT)))):
             ctx[
@@ -257,6 +251,12 @@ def direct_solve(goal: NamedCType) -> str:
             except UnsatError:
                 pass
 
+        # TODO: These are referentially equal, shouldn't have to do this;
+        # ensuring we don't use type synonym IntArrayRef in codegen would
+        # help
+        if goal.type == ArrayRefCType(BaseCType(longT)):
+            return solve(NamedCType(goal.name, BaseCType(intArrayRefT)), direct=direct)
+
         if direct:
             unsat(goal)
 
@@ -337,12 +337,41 @@ def direct_solve(goal: NamedCType) -> str:
                 )
                 return f"c10::asIntArrayRefSlow({symIntArrayRef_type})"
         elif goal.type == BaseCType(symIntArrayRefT):
-            return direct_solve(NamedCType(goal.name, longSymVec_ctype))
+            try:
+                r = direct_solve(NamedCType(goal.name, BaseCType(intArrayRefT)))
+                return f"c10::fromIntArrayRefSlow({r})"
+            except UnsatError:
+                return direct_solve(NamedCType(goal.name, longSymVec_ctype))
+        elif goal.type == BaseCType(SymIntT):
+            return direct_solve(NamedCType(goal.name, BaseCType(longT)))
+        elif goal.type == OptionalCType(BaseCType(SymIntT)):
+            argname = direct_solve(
+                NamedCType(goal.name, OptionalCType(BaseCType(longT)))
+            )
+            return f"{argname}.has_value() ? c10::make_optional(c10::SymInt(*{argname})) : c10::nullopt"
         elif goal.type == BaseCType(longT):
             symInt_type = direct_solve(NamedCType(goal.name, BaseCType(SymIntT)))
-            return f"{symInt_type}.expectInt()"
+            return f"{symInt_type}.expect_int()"
+        elif goal.type == OptionalCType(BaseCType(longT)):
+            argname = direct_solve(
+                NamedCType(goal.name, OptionalCType(BaseCType(SymIntT)))
+            )
+            return f"{argname}.has_value() ? c10::make_optional({argname}->expect_int()) : c10::nullopt"
         elif goal.type == BaseCType(optionalIntArrayRefT):
-            return direct_solve(NamedCType(goal.name, optionalLongVec_ctype))
+            try:
+                return direct_solve(NamedCType(goal.name, optionalLongVec_ctype))
+            except UnsatError:
+                argname = direct_solve(
+                    NamedCType(goal.name, BaseCType(optionalSymIntArrayRefT))
+                )
+                return f"{argname}.has_value() ? c10::make_optional(c10::asIntArrayRefSlow(*{argname})) : c10::nullopt"
+        elif goal.type == BaseCType(optionalSymIntArrayRefT):
+            # TODO: You might also want to solve this from longSymVec_ctype or
+            # an optional version of it
+            argname = direct_solve(
+                NamedCType(goal.name, BaseCType(optionalIntArrayRefT))
+            )
+            return f"{argname}.has_value() ? c10::make_optional(c10::fromIntArrayRefSlow(*{argname})) : c10::nullopt"
         elif goal.type == BaseCType(optionalScalarRefT):
             return direct_solve(NamedCType(goal.name, optionalScalar_ctype))
         elif goal.type == BaseCType(optionalTensorRefT):
diff --git a/torchgen/api/types.py b/torchgen/api/types.py
index 6bee40a421d4..65a3d1642352 100644
--- a/torchgen/api/types.py
+++ b/torchgen/api/types.py
@@ -1,6 +1,6 @@
 from dataclasses import dataclass
 from enum import Enum
-from typing import Dict, List, Optional, Sequence, Set, TypeVar, Union
+from typing import Dict, Iterator, List, Optional, Sequence, Set, Tuple, TypeVar, Union
 
 from torchgen.model import (
     Argument,
@@ -17,6 +17,12 @@
 
 _T = TypeVar("_T")
 
+TENSOR_LIST_LIKE_CTYPES = [
+    "at::TensorList",
+    "const c10::List<c10::optional<at::Tensor>> &",
+    "const at::ITensorListRef &",
+]
+
 # An ArgName is just the str name of the argument in schema;
 # but in some special circumstances, we may add a little extra
 # context.  The Enum SpecialArgName covers all of these cases;
@@ -79,6 +85,7 @@ def __str__(self) -> str:
 streamT = BaseCppType("at", "Stream")
 intArrayRefT = BaseCppType("at", "IntArrayRef")
 optionalIntArrayRefT = BaseCppType("at", "OptionalIntArrayRef")
+optionalSymIntArrayRefT = BaseCppType("at", "OptionalSymIntArrayRef")
 tensorOptionsT = BaseCppType("at", "TensorOptions")
 typeAndSizeT = BaseCppType("torch::autograd::generated", "TypeAndSize")
 tensorGeometryT = BaseCppType("at", "TensorGeometry")
@@ -417,6 +424,11 @@ class CppSignature:
     # (i.e. with a potential TensorOptions argument and out arguments in the front)
     faithful: bool
 
+    # Is this a symint C++ signature.  For BC reasons, functions that take
+    # SymInts still present as int64_t in C++, and the SymInt variant is
+    # offered at a different overload name
+    symint: bool
+
     # The set of C++ arguments which should not have defaults applied to them
     cpp_no_default_args: Set[str]
 
@@ -433,12 +445,17 @@ def arguments(self) -> Sequence[Binding]:
         return cpp.arguments(
             self.func.arguments,
             faithful=self.faithful,
+            symint=self.symint,
             method=self.method,
             cpp_no_default_args=self.cpp_no_default_args,
         )
 
-    def name(self) -> str:
-        n = cpp.name(self.func, faithful_name_for_out_overloads=self.faithful)
+    def name(self, *, suppress_symint_suffix: bool = False) -> str:
+        n = cpp.name(
+            self.func,
+            faithful_name_for_out_overloads=self.faithful,
+            symint_overload=False if suppress_symint_suffix else self.symint,
+        )
         if self.fallback_binding:
             n = f"__dispatch_{n}"
         return n
@@ -450,14 +467,17 @@ def decl(
         name: Optional[str] = None,
         prefix: str = "",
         is_redispatching_fn: bool = False,
+        suppress_symint_suffix: bool = False,
     ) -> str:
-        returns_type = cpp.returns_type(self.func.returns).cpp_type()
+        returns_type = cpp.returns_type(
+            self.func.returns, symint=self.symint
+        ).cpp_type()
         cpp_args = [a.decl() for a in self.arguments()]
         if is_redispatching_fn:
             cpp_args = ["c10::DispatchKeySet dispatchKeySet"] + cpp_args
         cpp_args_str = ", ".join(cpp_args)
         if name is None:
-            name = prefix + self.name()
+            name = prefix + self.name(suppress_symint_suffix=suppress_symint_suffix)
         return f"{returns_type} {name}({cpp_args_str})"
 
     # Render the C++ definition for this signature, not including
@@ -469,7 +489,9 @@ def defn(
         prefix: str = "",
         is_redispatching_fn: bool = False,
     ) -> str:
-        returns_type = cpp.returns_type(self.func.returns).cpp_type()
+        returns_type = cpp.returns_type(
+            self.func.returns, symint=self.symint
+        ).cpp_type()
         cpp_args = [a.defn() for a in self.arguments()]
         if is_redispatching_fn:
             cpp_args = ["c10::DispatchKeySet dispatchKeySet"] + cpp_args
@@ -480,12 +502,12 @@ def defn(
 
     def ptr_type(self) -> str:
         args_types_str = ", ".join(a.type for a in self.arguments())
-        return f"{cpp.returns_type(self.func.returns).cpp_type()} (*)({args_types_str})"
+        return f"{cpp.returns_type(self.func.returns, symint=self.symint).cpp_type()} (*)({args_types_str})"
 
     # Return the C++ function type, e.g., something like int(bool)
     def type(self) -> str:
         args_types_str = ", ".join(a.type for a in self.arguments())
-        return f"{cpp.returns_type(self.func.returns).cpp_type()} ({args_types_str})"
+        return f"{cpp.returns_type(self.func.returns, symint=self.symint).cpp_type()} ({args_types_str})"
 
 
 # Represents group of all CppSignatures associated with a
@@ -497,6 +519,8 @@ class CppSignatureGroup:
     func: FunctionSchema
     signature: CppSignature
     faithful_signature: Optional[CppSignature]
+    symint_signature: Optional[CppSignature]
+    symint_faithful_signature: Optional[CppSignature]
 
     def most_faithful_signature(self) -> CppSignature:
         if self.faithful_signature:
@@ -504,33 +528,51 @@ def most_faithful_signature(self) -> CppSignature:
         else:
             return self.signature
 
+    def signatures(self, *, symint: bool = True) -> Iterator[CppSignature]:
+        yield self.signature
+        if self.faithful_signature:
+            yield self.faithful_signature
+        if symint:
+            if self.symint_signature:
+                yield self.symint_signature
+            if self.symint_faithful_signature:
+                yield self.symint_faithful_signature
+
     @staticmethod
     def from_native_function(
         f: NativeFunction, *, method: bool, fallback_binding: bool = False
     ) -> "CppSignatureGroup":
         func = f.func
-        faithful_signature: Optional[CppSignature]
-        if func.arguments.tensor_options is not None or len(func.arguments.out) > 0:
-            faithful_signature = CppSignature(
+
+        def make_sig(*, faithful: bool, symint: bool) -> CppSignature:
+            return CppSignature(
                 func=func,
-                faithful=True,
+                faithful=faithful,
+                symint=symint,
                 method=method,
                 fallback_binding=fallback_binding,
                 cpp_no_default_args=f.cpp_no_default_args,
             )
-        else:
-            faithful_signature = None
-        signature = CppSignature(
-            func=func,
-            faithful=False,
-            method=method,
-            fallback_binding=fallback_binding,
-            cpp_no_default_args=f.cpp_no_default_args,
-        )
+
+        def make_sigs(*, symint: bool) -> Tuple[CppSignature, Optional[CppSignature]]:
+            faithful_signature: Optional[CppSignature] = None
+            if func.arguments.tensor_options is not None or len(func.arguments.out) > 0:
+                faithful_signature = make_sig(faithful=True, symint=symint)
+            signature = make_sig(faithful=False, symint=symint)
+            return signature, faithful_signature
+
+        signature, faithful_signature = make_sigs(symint=False)
+        symint_signature: Optional[CppSignature] = None
+        symint_faithful_signature: Optional[CppSignature] = None
+        if func.has_symint():
+            symint_signature, symint_faithful_signature = make_sigs(symint=True)
+
         return CppSignatureGroup(
             func=func,
             signature=signature,
             faithful_signature=faithful_signature,
+            symint_signature=symint_signature,
+            symint_faithful_signature=symint_faithful_signature,
         )
 
 
@@ -544,8 +586,10 @@ class DispatcherSignature:
     # and need to avoid naming collisions.
     prefix: str = ""
 
+    symint: bool = True
+
     def arguments(self) -> List[Binding]:
-        return dispatcher.arguments(self.func)
+        return dispatcher.arguments(self.func, symint=self.symint)
 
     def name(self) -> str:
         return self.prefix + dispatcher.name(self.func)
@@ -571,7 +615,7 @@ def exprs(self) -> List[Expr]:
         return [Expr(a.name, a.nctype) for a in self.arguments()]
 
     def returns_type(self) -> CType:
-        return dispatcher.returns_type(self.func.returns)
+        return dispatcher.returns_type(self.func.returns, symint=self.symint)
 
     def ptr_type(self) -> str:
         dispatcher_args_types_str = ", ".join(a.type for a in self.arguments())
@@ -583,8 +627,10 @@ def type(self) -> str:
         return f"{self.returns_type().cpp_type()} ({dispatcher_args_types_str})"
 
     @staticmethod
-    def from_schema(func: FunctionSchema, *, prefix: str = "") -> "DispatcherSignature":
-        return DispatcherSignature(func, prefix)
+    def from_schema(
+        func: FunctionSchema, *, prefix: str = "", symint: bool = True
+    ) -> "DispatcherSignature":
+        return DispatcherSignature(func, prefix, symint)
 
 
 @dataclass(frozen=True)
@@ -592,6 +638,8 @@ class NativeSignature:
     # The schema this signature is derived from
     func: FunctionSchema
 
+    symint: bool
+
     prefix: str = ""
 
     def name(self) -> str:
@@ -601,24 +649,24 @@ def decl(self, name: Optional[str] = None) -> str:
         args_str = ", ".join(a.decl() for a in self.arguments())
         if name is None:
             name = self.name()
-        return f"{native.returns_type(self.func.returns).cpp_type()} {name}({args_str})"
+        return f"{native.returns_type(self.func.returns, symint=self.symint).cpp_type()} {name}({args_str})"
 
     def defn(self, name: Optional[str] = None) -> str:
         args_str = ", ".join(a.defn() for a in self.arguments())
         if name is None:
             name = self.name()
-        return f"{native.returns_type(self.func.returns).cpp_type()} {name}({args_str})"
+        return f"{native.returns_type(self.func.returns, symint=self.symint).cpp_type()} {name}({args_str})"
 
     def ptr_type(self) -> str:
         # don't include defaults in type signature!
         args_str = ", ".join(a.defn() for a in self.arguments())
-        return f"{native.returns_type(self.func.returns).cpp_type()} (*)({args_str})"
+        return f"{native.returns_type(self.func.returns, symint=self.symint).cpp_type()} (*)({args_str})"
 
     def arguments(self) -> List[Binding]:
-        return native.arguments(self.func)
+        return native.arguments(self.func, symint=self.symint)
 
     def returns_type(self) -> CType:
-        return native.returns_type(self.func.returns)
+        return native.returns_type(self.func.returns, symint=self.symint)
 
     def dispatcher_exprs(self) -> List[Expr]:
         return translate.translate(
@@ -743,10 +791,16 @@ def kernel_signature(
     # so we'd like to keep the differences as small as possible.
     # With external backends, we'd like to enforce that they write their kernels with schemas
     # that match the Dispatcher API directly, if they can.
+    meta = backend_index.get_kernel(f)
+    symint = meta is not None and meta.supports_symint()
+    if symint:
+        assert (
+            f.func.has_symint()
+        ), f"attempted to define symint kernel for {backend_index.dispatch_key} without SymInt in schema"
     if backend_index.external:
-        return DispatcherSignature.from_schema(f.func, prefix=prefix)
+        return DispatcherSignature.from_schema(f.func, prefix=prefix, symint=symint)
     else:
-        return NativeSignature(f.func, prefix)
+        return NativeSignature(f.func, prefix=prefix, symint=symint)
 
 
 # Functions only, no types
diff --git a/torchgen/api/ufunc.py b/torchgen/api/ufunc.py
index 34384ce340d5..7f044706068c 100644
--- a/torchgen/api/ufunc.py
+++ b/torchgen/api/ufunc.py
@@ -40,7 +40,8 @@ def kernel_name(g: NativeFunctionsGroup, dispatch_key: DispatchKey) -> str:
 #
 # NB: used for CPU only
 def dispatchstub_type(t: Type, *, binds: ArgName) -> Optional[NamedCType]:
-    r = cpp.valuetype_type(t, binds=binds)
+    # Dispatch stubs are always plain ints
+    r = cpp.valuetype_type(t, binds=binds, symint=False)
     if r is not None:
         return r
 
@@ -64,7 +65,7 @@ def opmath_type(scalar_t: BaseCppType) -> BaseCppType:
 #
 # NB: CUDA only
 def ufunctor_ctor_type(t: Type, *, binds: ArgName, scalar_t: BaseCppType) -> NamedCType:
-    r = cpp.valuetype_type(t, binds=binds)
+    r = cpp.valuetype_type(t, binds=binds, symint=False)
     if r is not None:
         return r
 
@@ -93,7 +94,7 @@ def ufunctor_apply_type(
 # is done in the computation type.  compute_t is opmath_t in CUDA and scalar_t
 # in CPU
 def ufunc_type(t: Type, *, binds: ArgName, compute_t: CType) -> NamedCType:
-    r = cpp.valuetype_type(t, binds=binds)
+    r = cpp.valuetype_type(t, binds=binds, symint=False)
     if r is not None:
         return r
 
diff --git a/torchgen/api/unboxing.py b/torchgen/api/unboxing.py
index b5afdc099fa9..0a3aad42864e 100644
--- a/torchgen/api/unboxing.py
+++ b/torchgen/api/unboxing.py
@@ -122,7 +122,9 @@ def convert_arguments(f: NativeFunction) -> Tuple[List[Binding], List[str]]:
             )
         argument: Argument = arg.argument
         unboxed_name, _, code, decl = argumenttype_ivalue_convert(
-            argument.type, argument.name, mutable=argument.is_write
+            argument.type,
+            argument.name,
+            mutable=argument.is_write,
         )
         code_list.extend(decl)
         code_list.extend(code)
@@ -136,7 +138,10 @@ def convert_arguments(f: NativeFunction) -> Tuple[List[Binding], List[str]]:
 def argumenttype_ivalue_convert(
     t: Type, arg_name: str, *, mutable: bool = False
 ) -> Tuple[str, CType, List[str], List[str]]:
-    ctype = cpp.argumenttype_type(t=t, mutable=mutable, binds=arg_name).type
+    # Unboxing is for mobile, which doesn't care about SymInts
+    ctype = cpp.argumenttype_type(
+        t=t, mutable=mutable, binds=arg_name, symint=False
+    ).type
 
     if isinstance(t, BaseType):
         out_name = f"{arg_name}_base"
@@ -146,12 +151,18 @@ def argumenttype_ivalue_convert(
     elif isinstance(t, OptionalType):
         out_name = f"{arg_name}_opt_out"
         code, decl = _gen_code_optional_type(
-            arg_name=arg_name, out_name=out_name, t=t, ctype=ctype
+            arg_name=arg_name,
+            out_name=out_name,
+            t=t,
+            ctype=ctype,
         )
     elif isinstance(t, ListType):
         out_name = f"{arg_name}_list_out"
         code, decl = _gen_code_list_type(
-            arg_name=arg_name, out_name=out_name, t=t, ctype=ctype
+            arg_name=arg_name,
+            out_name=out_name,
+            t=t,
+            ctype=ctype,
         )
     else:
         raise Exception(f"Cannot handle type {t}. arg_name: {arg_name}")
diff --git a/torchgen/context.py b/torchgen/context.py
index 6924befb2550..b643890d9799 100644
--- a/torchgen/context.py
+++ b/torchgen/context.py
@@ -51,7 +51,8 @@ def native_function_manager(
         f = g
     with context(lambda: f"in native_functions.yaml line {f.loc}:\n  {f.func}"):
         with local.parametrize(
-            use_const_ref_for_mutable_tensors=f.use_const_ref_for_mutable_tensors
+            use_const_ref_for_mutable_tensors=f.use_const_ref_for_mutable_tensors,
+            use_ilistref_for_tensor_lists=f.part_of_structured_group,
         ):
             yield
 
diff --git a/torchgen/dest/lazy_ir.py b/torchgen/dest/lazy_ir.py
index 32f9370c8e16..33043a5780d7 100644
--- a/torchgen/dest/lazy_ir.py
+++ b/torchgen/dest/lazy_ir.py
@@ -19,6 +19,7 @@
     deviceT,
     DispatcherSignature,
     kernel_signature,
+    NativeSignature,
     OptionalCType,
     VectorCType,
 )
@@ -27,7 +28,11 @@
 from torchgen.model import (
     Argument,
     BackendIndex,
+    BackendMetadata,
+    BaseTy,
+    BaseType,
     FunctionSchema,
+    ListType,
     NativeFunction,
     NativeFunctionsGroup,
 )
@@ -40,6 +45,7 @@ def node_ctor_arg_rvalue_string(arg: LazyArgument) -> str:
     a lazy Node constructor.
     """
 
+    # TODO: Matching on CType seems wrong; should be matching on Type
     if isValueType(arg.lazy_type):
         if isinstance(arg.lazy_type, BaseCType):
             if arg.is_wrapped_scalar:
@@ -47,11 +53,14 @@ def node_ctor_arg_rvalue_string(arg: LazyArgument) -> str:
             elif arg.lazy_type.type is tensorListValueT:
                 return f"lazy_{arg.name}_tensorlist"
             elif arg.is_symint_or_list:
-                cpp_type = arg.lazy_type.cpp_type()
-                return f"{cpp_type}(dynamic_cast<torch::lazy::SymIntNodeImpl*>({arg.name}.toSymIntNodeImpl().get())->node_, 0)"
+                return f"GetSymIntValue({arg.name})"
             return f"lazy_{arg.name}->GetIrValue()"
         elif isinstance(arg.lazy_type, OptionalCType):
-            if arg.is_wrapped_scalar:
+            if arg.is_symint_or_list:
+                # TODO: I don't understand when you should put lazy_ in the name
+                # or not
+                return f"{arg.name} ? c10::make_optional(GetSymIntValue(*{arg.name})) : c10::nullopt"
+            elif arg.is_wrapped_scalar:
                 return f"node_{arg.name}"
             return (
                 f"lazy_{arg.name} ? "
@@ -63,7 +72,18 @@ def node_ctor_arg_rvalue_string(arg: LazyArgument) -> str:
                 f"TODO not sure if there are other valid types to handle here ({arg.lazy_type})"
             )
     else:
-        if isinstance(arg.lazy_type, VectorCType) and isinstance(
+        # NB: this is here because right now we aren't treating SymInt[] as a
+        # value type; when we do this needs to move above
+        # NB: we cannot test arg.lazy_type as we've already specified it is an
+        # int64_t and so we cannot distinguish between SymInt and int64_t
+        if isinstance(arg.orig_type, ListType) and arg.orig_type.elem == BaseType(
+            BaseTy.SymInt
+        ):
+            if arg.symint:
+                return f"GetSymIntArrayRefValue({arg.name})"
+            else:
+                return f"std::vector<int64_t>({arg.name}.begin(), {arg.name}.end())"
+        elif isinstance(arg.lazy_type, VectorCType) and isinstance(
             arg.lazy_type.elem, BaseCType
         ):
             return f"std::vector<{arg.lazy_type.elem.type}>({arg.name}.begin(), {arg.name}.end())"
@@ -87,13 +107,17 @@ def node_ctor_inputs(schema: LazyIrSchema) -> str:
     return ", ".join(node_ctor_values)
 
 
-def gen_fallback_code(schema: LazyIrSchema, overload_name: str) -> str:
+def gen_fallback_code(
+    schema: LazyIrSchema,
+    sig: Union[DispatcherSignature, NativeSignature],
+    overload_name: str,
+) -> str:
     """
     Generate code that falls back to eager conditioned on a predicate
     """
-    fallback_args = ",\n                ".join(
-        [str(arg.name) for arg in schema.filtered_args(generator=True)]
-    )
+    dispatcher_sig = DispatcherSignature.from_schema(schema.func)
+    exprs = translate(sig.arguments(), dispatcher_sig.arguments())
+    fallback_args = ",\n                ".join([a.expr for a in exprs])
     if len(overload_name):
         aten_op_str = f"ATEN_OP2({schema.aten_name}, {overload_name})"
     else:
@@ -104,7 +128,7 @@ def gen_fallback_code(schema: LazyIrSchema, overload_name: str) -> str:
         or_has_generator = f" || ({schema.generator_arg.name}.has_value() && {schema.generator_arg.name}->defined())"
     return f"""
         if (force_eager_fallback({aten_symbol(schema)}){or_has_generator}) {{
-            return at::native::call_fallback_fn<&ltc_eager_fallback, {aten_op_str}>::call(
+            return at::native::call_fallback_fn_symint<&ltc_eager_fallback, {aten_op_str}>::call(
                 {fallback_args}
             );
         }}
@@ -148,11 +172,17 @@ class GenLazyIR(ABC):
     backend_index: BackendIndex
     backend_name: str
     node_base: str
+    use_lazy_shape: bool
 
     @method_with_native_function
     def __call__(self, f: Union[NativeFunctionsGroup, NativeFunction]) -> List[str]:
         func = f.functional.func if isinstance(f, NativeFunctionsGroup) else f.func
-        schema = LazyIrSchema(func)
+        metadata = self.backend_index.get_kernel(
+            f.functional if isinstance(f, NativeFunctionsGroup) else f
+        )
+        schema = LazyIrSchema(
+            func, symint=metadata is not None and metadata.supports_symint()
+        )
         return self.gen(schema)
 
     # there is no lowering functionality generated unless this IR base class is subclassed and
@@ -223,7 +253,7 @@ def gen(self, schema: LazyIrSchema) -> List[str]:
 
         ctor_args = [f"const {i.lazy_type.cpp_type()}& {i.name}" for i in all_args]
         reuse_ctor_args = ", ".join(ctor_args)
-        if schema.properties.ShapePrecompute:
+        if self.use_lazy_shape and schema.properties.ShapePrecompute:
             ctor_args.append("std::vector<torch::lazy::Shape>&& shapes")
         node_ctor_args = ", ".join(ctor_args)
 
@@ -416,6 +446,7 @@ def lazy_tensor_decls(self, func: NativeFunction, schema: LazyIrSchema) -> str:
                         f"{self.backend_namespace}::{self.get_tensor_or_wrap_number}({arg.name}, *common_device);"
                     )
             elif isinstance(arg.lazy_type, OptionalCType):
+                assert arg.lazy_type.elem == BaseCType(getValueT()), arg.lazy_type.elem
                 # TODO(alanwaketan): Maybe we want to apply GetLtcTensorOrCreateForWrappedNumber here, but hold it
                 # until we encounter a real world example.
                 lazy_tensor_decls.append(
@@ -428,9 +459,17 @@ def lazy_tensor_decls(self, func: NativeFunction, schema: LazyIrSchema) -> str:
                 )
         return ("\n        ").join(lazy_tensor_decls)
 
-    def force_eager_fallback(self, func: NativeFunction, schema: LazyIrSchema) -> str:
+    def force_eager_fallback(
+        self,
+        func: NativeFunction,
+        schema: LazyIrSchema,
+        metadata: BackendMetadata,
+        sig: Union[DispatcherSignature, NativeSignature],
+    ) -> str:
         if self.gen_forced_fallback_code:
-            return gen_fallback_code(schema, overload_name=func.func.name.overload_name)
+            return gen_fallback_code(
+                schema, sig, overload_name=func.func.name.overload_name
+            )
         return ""
 
     def metrics(self, func: NativeFunction, schema: LazyIrSchema) -> str:
@@ -500,12 +539,18 @@ def this_shape(i: int) -> str:
                 dispatch_ns = "compositeexplicitautogradnonfunctional"
             else:
                 dispatch_ns = "meta"
+            aten_name = schema.aten_name
+            # TODO: this is trolling
+            if func.func.has_symint() and metadata.supports_symint():
+                aten_name += "_symint"
             shape_str = f"""\
         {meta_conversion_str}
-        auto out_meta = at::{dispatch_ns}::{schema.aten_name}({', '.join(meta_call_args)});
+        auto out_meta = at::{dispatch_ns}::{aten_name}({', '.join(meta_call_args)});
         {meta_out}"""
         else:
-            shape_sig = ComputeShapeSignature(metadata.kernel, func)
+            shape_sig = ComputeShapeSignature(
+                metadata.kernel, func, symint=metadata.supports_symint()
+            )
             shape_str = f"""
             auto shapes = {shape_sig.shape_call};"""
 
@@ -578,17 +623,17 @@ def __call__(self, func: NativeFunction) -> List[str]:
         sig = kernel_signature(func, self.backend_index)
         metadata = self.backend_index.get_kernel(func)
         assert metadata is not None
-        schema = LazyIrSchema(func.func)
+        schema = LazyIrSchema(func.func, symint=metadata.supports_symint())
         return [
             f"""\
     {sig.decl(name=f"{self.class_method_name}::{metadata.kernel}")} {{
-        {self.force_eager_fallback(func, schema)}
+        {self.force_eager_fallback(func, schema, metadata, sig)}
         {self.metrics(func, schema)}
         {self.get_device(func, schema)}
         {self.lazy_tensor_decls(func, schema)}
         {self.build_ir_node(func, schema)}
         {self.return_aten_tensor(func, schema)}
-    }};\n
+    }}\n
     """
         ]
 
@@ -598,10 +643,10 @@ class ComputeShapeSignature:
     Here we use the base name as the suffix of the signature to avoid generating for in-place variants.
     """
 
-    def __init__(self, kernel_name: str, f: NativeFunction):
-        self.__schema = LazyIrSchema(f.func)
+    def __init__(self, kernel_name: str, f: NativeFunction, *, symint: bool):
+        self.__schema = LazyIrSchema(f.func, symint=symint)
         self.__dispatch_args = ", ".join(
-            [a.decl() for a in dispatcher.arguments(f.func)]
+            [a.decl() for a in dispatcher.arguments(f.func, symint=symint)]
         )
         self.__call_args = ", ".join(
             [f"{arg.name}" for arg in self.__schema.filtered_args(generator=True)]
@@ -640,7 +685,9 @@ def __call__(self, f: NativeFunction) -> List[str]:
         if is_structured or is_view_copy_op:
             return []
         else:
-            shape_sig = ComputeShapeSignature(metadata.kernel, f)
+            shape_sig = ComputeShapeSignature(
+                metadata.kernel, f, symint=metadata.supports_symint()
+            )
             return ["\n".join([f"{shape_sig.shape_decl};"])]
 
 
@@ -655,7 +702,8 @@ def generate_non_native_lazy_ir_nodes(
         for p in op.get("properties", []):
             setattr(properties, p, True)
 
-        schema = LazyIrSchema(FunctionSchema.parse(op["func"]), properties)
+        # non-native is assumed to want symint bindings if you wrote symint
+        schema = LazyIrSchema(FunctionSchema.parse(op["func"]), properties, symint=True)
         schema.opkind = op.get("opkind")
         nodes.append(gen_lazy_ir.gen(schema)[0])
 
diff --git a/torchgen/dest/register_dispatch_key.py b/torchgen/dest/register_dispatch_key.py
index f7a3ef7bb644..095fdbc44316 100644
--- a/torchgen/dest/register_dispatch_key.py
+++ b/torchgen/dest/register_dispatch_key.py
@@ -159,7 +159,8 @@ def gen_resize_out_helper(backend_index: BackendIndex) -> List[str]:
   if (resized) {
     if (!strides.empty()) {
       TORCH_INTERNAL_ASSERT(!options.memory_format_opt().has_value());
-      at::native::as_strided_(out, sizes, strides);
+      // TODO: avoid the redispatch here
+      out.as_strided_(sizes, strides);
     } else if (options.memory_format_opt().has_value()) {
       out.unsafeGetTensorImpl()->empty_tensor_restride(*options.memory_format_opt());
     }
@@ -237,6 +238,11 @@ class RegisterDispatchKey:
     # Whether or not we are actually code-genning for ROCm
     rocm: bool
 
+    # Whether or not to generate symint registrations or not.  External users
+    # of codegen who don't care about symints can set this to false to get
+    # non-SymInt codegen
+    symint: bool
+
     # The class that all unstructured native functions live under. This is used to improve
     # compiler error messages when a kernel writer adds a native function with the wrong signature.
     # This is only used in unstructured kernels, since structured kernels already live in a class.
@@ -287,8 +293,8 @@ def wrapper_kernel_sig(
         self, f: NativeFunction
     ) -> Union[NativeSignature, DispatcherSignature]:
         # The prefix is just to ensure uniqueness. The Dispatcher API doesn't guarantee unique kernel names.
-        return kernel_signature(
-            f, self.backend_index, prefix=f"wrapper_{f.func.name.overload_name}_"
+        return DispatcherSignature.from_schema(
+            f.func, prefix=f"wrapper_{f.func.name.overload_name}_", symint=self.symint
         )
 
     def gen_out_inplace_wrapper(
@@ -353,6 +359,7 @@ def gen_structured(self, g: NativeFunctionsGroup) -> List[str]:
             self.target,
             self.selector,
             self.rocm,
+            self.symint,
             self.class_method_name,
             self.skip_dispatcher_op_registration,
             g,
@@ -407,10 +414,11 @@ def gen_unstructured(
                 f, method=False, fallback_binding=False
             )
 
+            # TODO: dedupe this with the structured codegen
             if self.target is Target.NAMESPACED_DECLARATION:
-                result = f"TORCH_API {cpp_sig_group.signature.decl()};\n"
-                if cpp_sig_group.faithful_signature is not None:
-                    result += f"TORCH_API {cpp_sig_group.faithful_signature.decl()};\n"
+                result = ""
+                for cpp_sig in cpp_sig_group.signatures(symint=self.symint):
+                    result += f"TORCH_API {cpp_sig.decl()};\n"
                 return result
             elif self.target is Target.NAMESPACED_DEFINITION:
 
@@ -421,10 +429,11 @@ def generate_defn(cpp_sig: CppSignature) -> str:
 }}
 """
 
-                result = generate_defn(cpp_sig_group.signature)
-                if cpp_sig_group.faithful_signature is not None:
-                    result += generate_defn(cpp_sig_group.faithful_signature)
+                result = ""
+                for cpp_sig in cpp_sig_group.signatures(symint=self.symint):
+                    result += generate_defn(cpp_sig)
                 return result
+
             elif self.target is Target.ANONYMOUS_DEFINITION:
                 # short circuit for inplace_meta
                 if inplace_meta:
@@ -451,7 +460,14 @@ def generate_defn(cpp_sig: CppSignature) -> str:
                 else:
                     impl_name = f"{metadata.cpp_namespace}::{self.class_method_name}::{metadata.kernel}"
 
-                args_exprs_str = ", ".join(a.name for a in args)
+                kernel_sig = kernel_signature(f, self.backend_index)
+
+                args_exprs_str = ", ".join(
+                    e.expr
+                    for e in translate(
+                        sig.arguments(), kernel_sig.arguments(), method=False
+                    )
+                )
 
                 device_check = "  // No device check\n"
                 # Backends that require device guards presumably also require device checks.
@@ -741,12 +757,17 @@ def gen_one(self, f: NativeFunction) -> Optional[str]:
         )
 
         # Signature of the wrapper function we'll register to the dispatcher
-        sig = NativeSignature(f.func, prefix="wrapper_")
+        kern = self.backend_index.get_kernel(f)
+        sig = NativeSignature(
+            f.func,
+            prefix="wrapper_",
+            symint=kern is not None and kern.supports_symint(),
+        )
 
         if self.target is Target.NAMESPACED_DECLARATION:
-            result = f"TORCH_API {cpp_sig_group.signature.decl()};\n"
-            if cpp_sig_group.faithful_signature is not None:
-                result += f"TORCH_API {cpp_sig_group.faithful_signature.decl()};\n"
+            result = ""
+            for cpp_sig in cpp_sig_group.signatures(symint=self.symint):
+                result += f"TORCH_API {cpp_sig.decl()};\n"
             return result
 
         elif self.target is Target.NAMESPACED_DEFINITION:
@@ -758,9 +779,9 @@ def generate_defn(cpp_sig: CppSignature) -> str:
 }}
 """
 
-            result = generate_defn(cpp_sig_group.signature)
-            if cpp_sig_group.faithful_signature is not None:
-                result += generate_defn(cpp_sig_group.faithful_signature)
+            result = ""
+            for cpp_sig in cpp_sig_group.signatures(symint=self.symint):
+                result += generate_defn(cpp_sig)
             return result
 
         elif self.target is Target.ANONYMOUS_DEFINITION:
diff --git a/torchgen/gen.py b/torchgen/gen.py
index 4ce121c83bb1..7552451a5135 100644
--- a/torchgen/gen.py
+++ b/torchgen/gen.py
@@ -20,6 +20,7 @@
 from torchgen.api.translate import translate
 from torchgen.api.types import (
     Binding,
+    CppSignature,
     CppSignatureGroup,
     DispatcherSignature,
     NamedCType,
@@ -33,11 +34,10 @@
     with_native_function_and_indices,
 )
 from torchgen.gen_functionalization_type import (
-    gen_composite_view_copy_kernel,
     gen_functionalization_definition,
     gen_functionalization_registration,
     gen_functionalization_view_inverse_declaration,
-    gen_symint_view_copy_kernel,
+    GenCompositeViewCopyKernel,
 )
 from torchgen.gen_vmap_plumbing import gen_all_vmap_plumbing
 
@@ -48,6 +48,7 @@
     BaseOperatorName,
     DEFAULT_KERNEL_NAMESPACE,
     DispatchKey,
+    FRAGMENT_NAMESPACES,
     FunctionSchema,
     is_cuda_dispatch_key,
     is_generic_dispatch_key,
@@ -160,6 +161,8 @@ def parse_native_yaml_struct(
             use_out_as_primary=True,
             external=False,
             device_guard=False,
+            # I'm actually not sure about this; undefined could be hit on
+            # empty TensorList, hypothetically that could have sizes in it
             index={},
         )
     )
@@ -298,6 +301,7 @@ def static_dispatch_keys(backends: List[BackendIndex]) -> List[DispatchKey]:
     else:
         return [backend.dispatch_key for backend in backends] + [
             DispatchKey.CompositeImplicitAutograd,
+            DispatchKey.CompositeImplicitAutogradNestedTensor,
             DispatchKey.CompositeExplicitAutograd,
             DispatchKey.CompositeExplicitAutogradNonFunctional,
         ]
@@ -318,6 +322,8 @@ def get_static_dispatch_backend(
         return DispatchKey.CompositeExplicitAutogradNonFunctional
     elif f.has_composite_implicit_autograd_kernel:
         return DispatchKey.CompositeImplicitAutograd
+    elif f.has_composite_implicit_autograd_nested_tensor_kernel:
+        return DispatchKey.CompositeImplicitAutogradNestedTensor
     return None
 
 
@@ -344,11 +350,12 @@ def static_dispatch_extra_headers(backends: List[BackendIndex]) -> List[str]:
     ]
 
 
-# Translates arguments of a native function from DispatcherSignature form to CppSignature form with support for
-# supporting usecases even when there is a memory_format argument along with tensor_option arguments.
-# This usecase is not covered by tools.codegen.api.translate() yet as its application is limited to static dispatch
-def translate_args_dispatcher_to_cpp(
-    f: NativeFunction,
+# Translates arguments of `sig` to CppSignature bindings.
+# Note that we have a special case for `memory_format` argument and this case is not covered by
+# tools.codegen.api.translate() yet as its application is limited to static dispatch.
+def translate_args(
+    sig: Union[CppSignature, DispatcherSignature],
+    cpp_sig: CppSignature,
 ) -> str:
 
     # Adds SpecialArgName.possibly_redundant_memory_format NamedCType for memory_format bindings
@@ -370,27 +377,33 @@ def add_spl_memory_format_binding(input_bindings: List[Binding]) -> List[Binding
                 output_bindings.append(binding)
         return output_bindings
 
-    disp_sig = DispatcherSignature.from_schema(f.func)
-    cpp_sig = CppSignatureGroup.from_native_function(
-        f, method=False, fallback_binding=False
-    ).signature
-    disp_bindings = disp_sig.arguments()
+    src_bindings = list(sig.arguments())
+    goal_bindings = list(cpp_sig.arguments())
     # When last argument of CPP signature has SpecialArgName.possibly_redundant_memory_format NCType,
     # get memory_format bindings of dispatcher signature to have the same NCType as well
-    for arg in cpp_sig.arguments():
+    for arg in goal_bindings:
         if arg.nctype.name == SpecialArgName.possibly_redundant_memory_format:
-            disp_bindings = add_spl_memory_format_binding(disp_sig.arguments())
+            src_bindings = add_spl_memory_format_binding(src_bindings)
             break
-    exprs = translate(disp_bindings, cpp_sig.arguments())
+    exprs = translate(src_bindings, goal_bindings)
     return ", ".join(a.expr for a in exprs)
 
 
 def generate_static_dispatch_backend_call(
+    sig: Union[CppSignature, DispatcherSignature],
     f: NativeFunction,
     backend_index: BackendIndex,
 ) -> str:
-    name = DispatcherSignature.from_schema(f.func).name()
-    exprs = translate_args_dispatcher_to_cpp(f)
+    cpp_sigs = CppSignatureGroup.from_native_function(
+        f, method=False, fallback_binding=False
+    )
+    if sig.symint and f.func.has_symint():
+        cpp_sig = cpp_sigs.symint_signature
+    else:
+        cpp_sig = cpp_sigs.signature
+    assert cpp_sig is not None
+    name = cpp_sig.name()
+    exprs = translate_args(sig, cpp_sig)
     backend_metadata = backend_index.get_kernel(f)
     kernel_ns = (
         backend_metadata.cpp_namespace
@@ -402,11 +415,20 @@ def generate_static_dispatch_backend_call(
 
 
 def generate_static_dispatch_fallback_call(
+    sig: Union[CppSignature, DispatcherSignature],
     f: NativeFunction,
     backend_indices: List[BackendIndex],
 ) -> str:
-    name = DispatcherSignature.from_schema(f.func).name()
-    exprs = translate_args_dispatcher_to_cpp(f)
+    cpp_sigs = CppSignatureGroup.from_native_function(
+        f, method=False, fallback_binding=False
+    )
+    if sig.symint and f.func.has_symint():
+        cpp_sig = cpp_sigs.symint_signature
+    else:
+        cpp_sig = cpp_sigs.signature
+    assert cpp_sig is not None
+    name = cpp_sig.name()
+    exprs = translate_args(sig, cpp_sig)
     ns = DEFAULT_KERNEL_NAMESPACE.replace("::native", "")
     if f.has_composite_explicit_autograd_kernel:
         return f"return {ns}::{DispatchKey.CompositeExplicitAutograd.lower()}::{name}({exprs});"
@@ -414,15 +436,28 @@ def generate_static_dispatch_fallback_call(
         return f"return {ns}::{DispatchKey.CompositeExplicitAutogradNonFunctional.lower()}::{name}({exprs});"
     elif f.has_composite_implicit_autograd_kernel:
         return f"return {ns}::{DispatchKey.CompositeImplicitAutograd.lower()}::{name}({exprs});"
+    elif f.has_composite_implicit_autograd_nested_tensor_kernel:
+        return f"return {ns}::{DispatchKey.CompositeImplicitAutogradNestedTensor.lower()}::{name}({exprs});"
     else:
         return f"""TORCH_CHECK(false, "Static dispatch does not support {name} for\
 {', '.join([str(index.dispatch_key)for index in backend_indices])} ");"""
 
 
 def static_dispatch(
+    sig: Union[CppSignature, DispatcherSignature],
     f: NativeFunction,
     backend_indices: List[BackendIndex],
 ) -> str:
+    """
+    For a given `NativeFunction`, find out the corresponding backend and dispatch to it. If more than one
+    backends exsit, fallback to static dispatch by determining dispatch key from inputs.
+    Arguments:
+        sig: A CppSignature or DispatcherSignature for this native function we want to use.
+        f: NativeFunction to generate static dispatch.
+        backend_indices: All available backends.
+    Return:
+        C++ code to call backend-specific functions, e.g., "return at::cpu::add(self, other, scale);"
+    """
     if len(backend_indices) == 0 or f.manual_kernel_registration:
         return ""
 
@@ -436,11 +471,10 @@ def static_dispatch(
         )
     ]
     if len(keys) == 1:
-        return generate_static_dispatch_backend_call(f, keys[0])
+        return generate_static_dispatch_backend_call(sig, f, keys[0])
     elif len(keys) == 0:
-        return generate_static_dispatch_fallback_call(f, backend_indices)
+        return generate_static_dispatch_fallback_call(sig, f, backend_indices)
 
-    sig = DispatcherSignature.from_schema(f.func)
     native_tensor_args = [
         a.name
         for a in sig.arguments()
@@ -466,10 +500,10 @@ def static_dispatch(
     for index in keys:
         dispatch_code.append(f"""case DispatchKey::{index.dispatch_key}:""")
         dispatch_code.append(
-            f"""\t{generate_static_dispatch_backend_call(f, index)};"""
+            f"""\t{generate_static_dispatch_backend_call(sig, f, index)};"""
         )
 
-    fallback = generate_static_dispatch_fallback_call(f, backend_indices)
+    fallback = generate_static_dispatch_fallback_call(sig, f, backend_indices)
     connector = "\n\t\t"
 
     return f"""
@@ -511,8 +545,6 @@ class ComputeOperators:
     def __call__(self, f: NativeFunction) -> str:
         sig = DispatcherSignature.from_schema(f.func)
         name = f.func.name.unambiguous_name()
-        call_method_name = "call"
-        redispatch_method_name = "redispatch"
 
         if self.target is Target.DECLARATION:
             # Note [The ATen Operators API]
@@ -546,8 +578,8 @@ def __call__(self, f: NativeFunction) -> str:
   STATIC_CONSTEXPR_STR_INL_EXCEPT_WIN_CUDA(name, "aten::{f.func.name.name}")
   STATIC_CONSTEXPR_STR_INL_EXCEPT_WIN_CUDA(overload_name, "{f.func.name.overload_name}")
   STATIC_CONSTEXPR_STR_INL_EXCEPT_WIN_CUDA(schema_str, {cpp_string(str(f.func))})
-  static {sig.defn(name=call_method_name, is_redispatching_fn=False)};
-  static {sig.defn(name=redispatch_method_name, is_redispatching_fn=True)};
+  static {sig.defn(name="call", is_redispatching_fn=False)};
+  static {sig.defn(name="redispatch", is_redispatching_fn=True)};
 }};"""
 
         elif self.target is Target.DEFINITION:
@@ -568,12 +600,13 @@ def __call__(self, f: NativeFunction) -> str:
                     dispatcher_exprs_str = ", ".join(
                         ["dispatchKeySet"] + [a.name for a in sig.arguments()]
                     )
-                    dispatcher_call = "redispatch"
-                    method_name = f"{name}::{redispatch_method_name}"
+                    method_base = "redispatch"
                 else:
-                    method_name = f"{name}::{call_method_name}"
                     dispatcher_exprs_str = ", ".join([a.name for a in sig.arguments()])
-                    dispatcher_call = "call"
+                    method_base = "call"
+
+                dispatcher_call = method_base
+                method_name = f"{name}::{method_base}"
 
                 fn_body = f"""
     static auto op = create_{name}_typed_handle();
@@ -585,7 +618,7 @@ def __call__(self, f: NativeFunction) -> str:
                 ):
                     # call() should go through static dispatch
                     fn_body = static_dispatch(
-                        f, backend_indices=self.static_dispatch_backend_indices
+                        sig, f, backend_indices=self.static_dispatch_backend_indices
                     )
                 defns += f"""
 // aten::{f.func}
@@ -604,36 +637,45 @@ def __call__(self, f: NativeFunction) -> str:
 class ComputeFunction:
     @method_with_native_function
     def __call__(self, f: NativeFunction) -> Optional[str]:
-        if Variant.function not in f.variants:
-            return None
-
         sig_group = CppSignatureGroup.from_native_function(
             f, method=False, fallback_binding=f.manual_cpp_binding
         )
+        has_symint = f.func.has_symint()
 
-        def generate_defn(faithful: bool) -> str:
-            if faithful:
-                sig = sig_group.faithful_signature
-                assert sig is not None
-            else:
-                sig = sig_group.signature
-
+        result = ""
+        for sig in sig_group.signatures():
             # See Note [The ATen Operators API]
             target_sig = DispatcherSignature.from_schema(f.func)
             exprs = translate(sig.arguments(), target_sig.arguments())
             exprs_str = ", ".join([e.expr for e in exprs])
 
-            return f"""
+            if sig.symint:
+                intlike_t = "c10::SymInt"
+            else:
+                intlike_t = "int64_t"
+
+            if Variant.function in f.variants:
+                result += f"""
 // aten::{f.func}
 inline {sig.decl()} {{
     return at::_ops::{f.func.name.unambiguous_name()}::call({exprs_str});
+}}"""
+
+            # The template function can be used from template situations
+            # where you want to switch between the symint or not version
+            # depending on a template argument
+            #
+            # NB: we ALWAYS generate this even for methods.  But we put it in
+            # this header so it can take advantage of per-op headers
+            if has_symint:
+                result += f"""
+namespace symint {{
+  template <typename T, typename = std::enable_if_t<std::is_same<T, {intlike_t}>::value>>
+  {sig.decl(suppress_symint_suffix=True)} {{
+    return at::_ops::{f.func.name.unambiguous_name()}::call({exprs_str});
+  }}
 }}
 """
-
-        result = generate_defn(False)
-        if sig_group.faithful_signature is not None:
-            result += generate_defn(True)
-
         return result
 
 
@@ -657,36 +699,28 @@ def __call__(self, f: NativeFunction) -> Optional[str]:
         )
 
         if self.target is Target.DECLARATION:
-            result = f"{sig_group.signature.decl()} const;\n"
-            if sig_group.faithful_signature is not None:
-                result += f"{sig_group.faithful_signature.decl()} const;\n"
+            result = ""
+            for sig in sig_group.signatures():
+                result += f"{sig.decl()} const;\n"
             return result
 
         if self.target is not Target.DEFINITION:
             assert_never(self.target)
 
-        def generate_defn(faithful: bool) -> str:
-            if faithful:
-                sig = sig_group.faithful_signature
-                assert sig is not None
-            else:
-                sig = sig_group.signature
+        result = ""
 
+        for sig in sig_group.signatures():
             target_sig = DispatcherSignature.from_schema(f.func)
             exprs = translate(sig.arguments(), target_sig.arguments(), method=True)
             exprs_str = ", ".join([e.expr for e in exprs])
 
-            return f"""
+            result += f"""
 // aten::{f.func}
 inline {sig.defn(prefix="Tensor::")} const {{
     return at::_ops::{f.func.name.unambiguous_name()}::call({exprs_str});
 }}
 """
 
-        result = generate_defn(faithful=False)
-        if sig_group.faithful_signature is not None:
-            result += generate_defn(faithful=True)
-
         return result
 
 
@@ -703,28 +737,19 @@ def __call__(self, f: NativeFunction) -> Optional[str]:
             f, method=False, fallback_binding=f.manual_cpp_binding
         )
 
-        def generate_defn(faithful: bool) -> str:
-            if faithful:
-                sig = sig_group.faithful_signature
-                assert sig is not None
-            else:
-                sig = sig_group.signature
-
+        result = ""
+        for sig in sig_group.signatures():
             target_sig = DispatcherSignature.from_schema(f.func)
             exprs = translate(sig.arguments(), target_sig.arguments())
             exprs_str = ", ".join(["dispatchKeySet"] + [a.expr for a in exprs])
 
-            return f"""
+            result += f"""
 // aten::{f.func}
 inline {sig.decl(is_redispatching_fn=True)} {{
     return at::_ops::{f.func.name.unambiguous_name()}::redispatch({exprs_str});
 }}
 """
 
-        result = generate_defn(False)
-        if sig_group.faithful_signature is not None:
-            result += generate_defn(True)
-
         return result
 
 
@@ -889,7 +914,8 @@ def __call__(self, f: NativeFunction) -> Optional[str]:
             return None
 
         name = native.name(f.func)
-        native_sig = NativeSignature(f.func)
+        # BackendSelect can go to Meta, so it must preserve symints
+        native_sig = NativeSignature(f.func, symint=True)
 
         native_tensor_args = [
             a
@@ -993,7 +1019,10 @@ def dynamic_type(t: Type) -> str:
     # also include Tensor[]
     if str(t) == "Tensor":
         return "at::Tensor"
-    return cpp.argumenttype_type(t, mutable=False, binds="__placeholder__").cpp_type()
+    # This is a legacy concept, so never report SymInt
+    return cpp.argumenttype_type(
+        t, mutable=False, binds="__placeholder__", symint=False
+    ).cpp_type()
 
 
 def compute_method_of_yaml(variants: Set[Variant]) -> List[str]:
@@ -1058,7 +1087,8 @@ def compute_returns_yaml(
         ret = {
             "dynamic_type": dynamic_type(r.type),
             "name": name,
-            "type": cpp.return_type(r).cpp_type(),
+            # legacy, report ints
+            "type": cpp.return_type(r, symint=False).cpp_type(),
         }
 
         if r.name:
@@ -1118,10 +1148,13 @@ def compute_argument_yaml(
         "dynamic_type": dynamic_type(a.type),
         "is_nullable": a.type.is_nullable(),
         "name": a.name,
-        "type": cpp.argument_type(a, binds="__placeholder__").cpp_type(),
+        # legacy, report ints
+        "type": cpp.argument_type(a, binds="__placeholder__", symint=False).cpp_type(),
     }
     if a.default is not None:
-        arg["default"] = pythonify_default(cpp.default_expr(a.default, a.type))
+        arg["default"] = pythonify_default(
+            cpp.default_expr(a.default, a.type, symint=False)
+        )
     if a.name in kwarg_only_set:
         arg["kwarg_only"] = True
     if a.name in out_arg_set:
@@ -1184,11 +1217,13 @@ def compute_declaration_yaml(f: NativeFunction) -> object:
             method=False,
             cpp_no_default_args=set(),
             faithful=False,
+            symint=False,
             has_tensor_options=False,
         )
     ]
 
-    cpp_returns = cpp.returns_type(f.func.returns).cpp_type()
+    # legacy, report ints
+    cpp_returns = cpp.returns_type(f.func.returns, symint=False).cpp_type()
     schema_order_cpp_signature = f"{cpp_returns} ({', '.join(cpp_schema_order_types)})"
 
     is_factory_method = (
@@ -1248,6 +1283,11 @@ def compute_registration_declarations(
         "dispatch": str(
             {k for k, v in backend_indices.items() if v.has_kernel(f)}
             != {DispatchKey.CompositeImplicitAutograd}
+            and {k for k, v in backend_indices.items() if v.has_kernel(f)}
+            != {
+                DispatchKey.CompositeImplicitAutograd,
+                DispatchKey.CompositeImplicitAutogradNestedTensor,
+            }
         ),
         "default": str(f.has_composite_kernel or has_autogenerated_composite_kernel(f)),
     }
@@ -1433,6 +1473,7 @@ def get_native_function_definitions(
     backend_idx: BackendIndex,
     selector: SelectiveBuilder,
     rocm: bool,
+    symint: bool,
     skip_dispatcher_op_registration: bool,
     gen_dispatch_helpers: bool,
 ) -> List[str]:
@@ -1446,6 +1487,7 @@ def get_native_function_definitions(
         Target.NAMESPACED_DEFINITION,
         selector,
         rocm=rocm,
+        symint=symint,
         class_method_name=None,
         skip_dispatcher_op_registration=skip_dispatcher_op_registration,
     )
@@ -1454,6 +1496,7 @@ def get_native_function_definitions(
         Target.ANONYMOUS_DEFINITION,
         selector,
         rocm=rocm,
+        symint=symint,
         class_method_name=None,
         skip_dispatcher_op_registration=skip_dispatcher_op_registration,
     )
@@ -1462,6 +1505,7 @@ def get_native_function_definitions(
         Target.REGISTRATION,
         selector,
         rocm=rocm,
+        symint=symint,
         class_method_name=None,
         skip_dispatcher_op_registration=skip_dispatcher_op_registration,
     )
@@ -1531,6 +1575,7 @@ def get_namespaced_declaration(
     backend_idx: BackendIndex,
     selector: SelectiveBuilder,
     rocm: bool,
+    symint: bool,
 ) -> List[str]:
     declarations: List[str] = []
     ns_grouped_kernels: Dict[str, List[str]] = defaultdict(list)
@@ -1542,6 +1587,7 @@ def get_namespaced_declaration(
         rocm=rocm,
         class_method_name=None,
         skip_dispatcher_op_registration=False,
+        symint=symint,
     )
     for f in grouped_native_functions:
         namespace = get_kernel_namespace(f=f, backend_idx=backend_idx).replace(
@@ -1593,14 +1639,17 @@ def get_native_function_schema_registrations(
         if namespace == "aten":
             aten_schema_registrations = schema_registrations_body
         else:
-            assert custom_namespace is None or namespace == custom_namespace, (
-                "Only one custom namespace (other than 'aten') is currently supported, "
-                f" but getting {namespace} and {custom_namespace}"
-            )
             custom_namespace = namespace
             tab = "\t"
+            # if the namespace is predefined, we should use define a library fragment
+            # instead of a new library
+            torch_library_macro = (
+                "TORCH_LIBRARY_FRAGMENT"
+                if namespace in FRAGMENT_NAMESPACES
+                else "TORCH_LIBRARY"
+            )
             schema_registrations += f"""
-TORCH_LIBRARY({custom_namespace}, m) {{
+{torch_library_macro}({custom_namespace}, m) {{
   {tab.join(schema_registrations_body)}
 }};"""
     return (aten_schema_registrations, schema_registrations)
@@ -1719,6 +1768,7 @@ def gen_aggregated_headers(
                         backend_idx=backend_indices[dispatch_key],
                         selector=selector,
                         rocm=rocm,
+                        symint=True,
                     ),
                 },
             )
@@ -1859,6 +1909,7 @@ def gen_per_operator_headers(
                         Target.NAMESPACED_DECLARATION,
                         selector,
                         rocm=rocm,
+                        symint=True,
                         class_method_name=None,
                         skip_dispatcher_op_registration=False,
                     ),
@@ -2156,6 +2207,13 @@ def operator_headers() -> List[str]:
             ns_grouped_native_functions[namespace].append(grouped_native_function)
 
         dispatch_namespace = str(dispatch_key).lower()
+
+        # CompositeImplicitAutogradNestdTensor does not currently user the helpers generated
+        # compilation will fail when `-Werror=unused-function` flag is set
+        gen_dispatch_helpers: bool = (
+            dispatch_key != DispatchKey.CompositeImplicitAutogradNestedTensor
+        )
+
         dispatch_definitions = get_native_function_definitions(
             fm=fm,
             grouped_native_functions=grouped_native_functions,
@@ -2163,8 +2221,9 @@ def operator_headers() -> List[str]:
             backend_idx=backend_index,
             selector=selector,
             rocm=rocm,
+            symint=True,
             skip_dispatcher_op_registration=skip_dispatcher_op_registration,
-            gen_dispatch_helpers=True,
+            gen_dispatch_helpers=gen_dispatch_helpers,
         )
         fm.write_with_template(
             f"Register{dispatch_key}.cpp",
@@ -2421,29 +2480,6 @@ def gen_op_headers(
             )
         },
     )
-    view_copy_with_symint_pairs: List[Tuple[NativeFunction, NativeFunction]] = []
-    for g1 in view_groups:
-        for g2 in view_groups:
-            if g1.view_copy is None or g2.view_copy is None:
-                continue
-            # TODO: make this more first class in the data model
-            g1_base_name = str(g1.view_copy.func.name.name)
-            g2_base_name = str(g2.view_copy.func.name.name)
-
-            same_base_op = (
-                g1_base_name == g2_base_name
-                and g1.view_copy.func.arguments.symints_to_ints()
-                == g2.view_copy.func.arguments.symints_to_ints()
-            )
-            op1_not_symint = "SymInt" not in str(g1.view_copy.func.name.overload_name)
-            op2_symint = "SymInt" in str(g2.view_copy.func.name.overload_name)
-            if same_base_op and op1_not_symint and op2_symint:
-                view_copy_with_symint_pairs.append(
-                    (
-                        g1.view_copy,
-                        g2.view_copy,
-                    )
-                )
 
     # Note [view_copy NativeFunctions]
     # Every view operator in native_functions.yaml that is not CompositeImplicitAutograd
@@ -2466,7 +2502,11 @@ def gen_op_headers(
         lambda: {
             "ops_headers": [
                 "\n".join(
-                    f"#include <ATen/ops/{f.root_name}_ops.h>"
+                    f"#include <ATen/ops/{f.root_name}_ops.h>\n"
+                    # NB: this include is important as it ensures we
+                    # set the visibility on generated view_copy kernels
+                    # correctly
+                    f"#include <ATen/ops/{f.root_name}_native.h>"
                     for f in (
                         [g.view] if g.view_copy is None else [g.view, g.view_copy]
                     )
@@ -2482,12 +2522,13 @@ def gen_op_headers(
                 for g in structured_native_functions
             ],
             "CompositeViewCopyKernel_Definitions": list(
-                mapMaybe(gen_composite_view_copy_kernel, view_groups)
-            ),
-            "SymIntViewCopyKernel_Definitions": list(
                 mapMaybe(
-                    lambda pair: gen_symint_view_copy_kernel(pair[0], pair[1]),
-                    view_copy_with_symint_pairs,
+                    GenCompositeViewCopyKernel(
+                        backend_indices[
+                            DispatchKey.CompositeExplicitAutogradNonFunctional
+                        ]
+                    ),
+                    view_groups,
                 )
             ),
             "GeneratedCompositeFunctional_Definitions": list(
@@ -2676,6 +2717,7 @@ def main() -> None:
         DispatchKey.CPU,
         DispatchKey.CUDA,
         DispatchKey.CompositeImplicitAutograd,
+        DispatchKey.CompositeImplicitAutogradNestedTensor,
         DispatchKey.CompositeExplicitAutograd,
         DispatchKey.CompositeExplicitAutogradNonFunctional,
         DispatchKey.Meta,
diff --git a/torchgen/gen_backend_stubs.py b/torchgen/gen_backend_stubs.py
index 37b4048146c0..184e6c1ce29d 100644
--- a/torchgen/gen_backend_stubs.py
+++ b/torchgen/gen_backend_stubs.py
@@ -3,7 +3,7 @@
 import pathlib
 import re
 from collections import Counter, defaultdict, namedtuple
-from typing import Dict, List, Optional, Sequence, Union
+from typing import Dict, List, Optional, Sequence, Set, Union
 
 import yaml
 
@@ -68,6 +68,7 @@ def parse_backend_yaml(
         "full_codegen",
         "non_native",
         "ir_gen",
+        "symint",
     ]
 
     backend = yaml_values.pop("backend", None)
@@ -96,6 +97,14 @@ def parse_backend_yaml(
         supported, list
     ), f'expected "supported" to be a list, but got: {supported} (of type {type(supported)})'
 
+    symint = yaml_values.pop("symint", [])
+    if symint is None:
+        symint = []  # Allow an empty list of symint ops
+    assert isinstance(
+        symint, list
+    ), f'expected "symint" to be a list, but got: {supported} (of type {type(supported)})'
+    symint_set = set(symint)
+
     supported_autograd = yaml_values.pop("autograd", [])
     assert isinstance(
         supported_autograd, list
@@ -118,6 +127,7 @@ def parse_backend_yaml(
 
     def create_backend_index(
         backend_ops: List[str],
+        symint_ops: Set[str],
         dispatch_key: DispatchKey,
         *,
         use_out_as_primary: bool,
@@ -131,6 +141,8 @@ def create_backend_index(
             ), f"Found an invalid operator name: {op_name}"
             # See Note [External Backends Follow Dispatcher API]
             kernel_name = dispatcher.name(native_functions_map[op_name].func)
+            if op in symint_ops:
+                kernel_name += "_symint"
             # TODO: allow structured external backends later.
             m = BackendMetadata(
                 kernel=kernel_name, structured=False, cpp_namespace=cpp_namespace
@@ -153,6 +165,7 @@ def create_backend_index(
 
         backend_idx = create_backend_index(
             supported,
+            symint_set,
             backend_key,
             use_out_as_primary=use_out_as_primary,
             use_device_guard=use_device_guard,
@@ -170,6 +183,7 @@ def create_backend_index(
 
         autograd_idx = create_backend_index(
             supported_autograd,
+            symint_set,
             autograd_key,
             use_out_as_primary=use_out_as_primary,
             use_device_guard=use_device_guard,
@@ -256,30 +270,49 @@ def error_on_missing_kernels(
     if full_codegen is None:
         full_codegen = []
 
-    expected_backend_op_names: List[OperatorName] = (
-        list(backend_indices[backend_key].index.keys()) + []
-        if autograd_key is None
-        else list(backend_indices[autograd_key].index.keys())
+    indices = [backend_indices[backend_key].index] + (
+        [] if autograd_key is None else [backend_indices[autograd_key].index]
+    )
+    # Quick mapping from each OperatorName used by the external backend
+    # to its backend kernel name
+    expected_backend_op_names: Dict[OperatorName, str] = dict(
+        list(
+            concatMap(
+                lambda index: [
+                    (op_name, metadata.kernel) for op_name, metadata in index.items()
+                ],
+                indices,
+            )
+        )
     )
     expected_backend_native_funcs: List[NativeFunction] = [
         f
         for f in native_functions
-        if f.func.name in expected_backend_op_names and f.func.name not in full_codegen
+        if f.func.name in expected_backend_op_names.keys()
+        and f.func.name not in full_codegen
     ]
     expected_backend_kernel_name_counts: Dict[str, List[NativeFunction]] = defaultdict(
         list
     )
     for native_f in expected_backend_native_funcs:
-        expected_backend_kernel_name_counts[dispatcher.name(native_f.func)].append(
-            native_f
-        )
+        expected_backend_kernel_name_counts[
+            expected_backend_op_names[native_f.func.name]
+        ].append(native_f)
 
     # This just looks for lines containing "foo(", and assumes that the kernel foo has been implemented.
     # It might cause false negatives (we won't catch all cases), but that's ok - if we catch a missing kernel
     # here, then we get a nicer error message. If we miss it, you get a linker error.
-    kernel_defn_regex = rf"{class_name}::\s*([\w\d]*)\("
+    kernel_defn_regex = rf"(.*){class_name}::\s*([\w\d]*)\("
     actual_backend_kernel_name_counts = Counter(
-        re.findall(kernel_defn_regex, backend_defns)
+        # A bit unwieldy (this could probably be moved into regex),
+        # but we don't want to include kernel names that come from function calls,
+        # like "return torch_xla::XLANativeFunctions::empty_strided_symint(...)".
+        # Easy check is to ignore any lines with colons before the class name.
+        [
+            y
+            for (x, y) in re.findall(kernel_defn_regex, backend_defns)
+            if not x.endswith(":")
+        ]
     )
 
     missing_kernels_err_msg = ""
@@ -416,6 +449,7 @@ def gen_dispatcher_registrations(
                 Target.REGISTRATION,
                 selector,
                 rocm=False,
+                symint=True,
                 class_method_name=f"{class_name}",
                 skip_dispatcher_op_registration=False,
             ),
@@ -482,6 +516,7 @@ def gen_dispatcher_registrations(
                                 Target.ANONYMOUS_DEFINITION,
                                 selector,
                                 rocm=False,
+                                symint=True,
                                 class_method_name=f"{class_name}",
                                 skip_dispatcher_op_registration=False,
                             ),
diff --git a/torchgen/gen_functionalization_type.py b/torchgen/gen_functionalization_type.py
index d1e26a0f13b6..33b4e4d86bb9 100644
--- a/torchgen/gen_functionalization_type.py
+++ b/torchgen/gen_functionalization_type.py
@@ -1,3 +1,4 @@
+from dataclasses import dataclass
 from typing import Callable, List, Optional, Tuple, Union
 
 from torchgen.api import cpp, dispatcher
@@ -8,6 +9,7 @@
     CType,
     DispatcherSignature,
     FunctionalizationLambda,
+    iTensorListRefT,
     NativeSignature,
     tensorListT,
     tensorT,
@@ -15,6 +17,7 @@
     ViewInverseSignature,
 )
 from torchgen.context import (
+    method_with_native_function,
     native_function_manager,
     with_native_function,
     with_native_function_and,
@@ -72,100 +75,77 @@
 
 # Generates the body of the default composite C++ kernel for a {view}_copy NativeFunction
 # See Note [view_copy NativeFunctions]
-@with_native_function
-def gen_composite_view_copy_kernel(g: NativeFunctionsViewGroup) -> Optional[str]:
-
-    if g.view_copy is None:
-        return None
-
-    # For view_copy.SymInt overloads,
-    # See gen_symint_view_copy_kernel.
-    if g.view_copy.func.name.overload_name == "SymInt":
-        return None
-
-    # We can make view_copy work in more cases by using reshape()
-    # when a normal view call would ordinarily fail.
-    # This also makes LTC more efficient, because they don't need to include
-    # clone() calls in their graph (which is normally needed by reshape).
-    if str(g.view_copy.func.name) == "view_copy":
-        return """\
-at::Tensor view_copy(const at::Tensor & self, at::IntArrayRef size) {
-  DimVector shape = infer_size_dv(size, self.numel());
-  if (!at::detail::computeStride(self.sizes(), self.strides(), shape).has_value()) {
-    return self.reshape(size);
+@dataclass(frozen=True)
+class GenCompositeViewCopyKernel:
+    backend_index: BackendIndex
+
+    @method_with_native_function
+    def __call__(self, g: NativeFunctionsViewGroup) -> Optional[str]:
+        if g.view_copy is None:
+            return None
+
+        metadata = self.backend_index.get_kernel(g.view_copy)
+        assert metadata is not None
+
+        # We can make view_copy work in more cases by using reshape()
+        # when a normal view call would ordinarily fail.
+        # This also makes LTC more efficient, because they don't need to include
+        # clone() calls in their graph (which is normally needed by reshape).
+        if str(g.view_copy.func.name) == "view_copy":
+            assert metadata.kernel == "view_copy_symint"
+            return """\
+at::Tensor view_copy_symint(const at::Tensor & self, at::SymIntArrayRef size) {
+  c10::SymDimVector shape = infer_size_dv(size, self.sym_numel());
+  if (!at::detail::computeStride(self.sym_sizes(), self.sym_strides(), shape).has_value()) {
+    return self.reshape_symint(size);
   } else {
     auto output = at::_ops::view::call(self, size);
-    return output.clone();
+    return output.clone(/*memory_format=*/at::MemoryFormat::Contiguous);
   }
 }
 """
-    # view_copy is a native signature, since we're generating an at::native:: kernel
-    view_copy_sig = NativeSignature(g.view_copy.func)
+        # view_copy is a native signature, since we're generating an at::native:: kernel
+        # Functionalization always operates on symints though
+        view_copy_sig = NativeSignature(
+            g.view_copy.func, symint=metadata.supports_symint()
+        )
 
-    # view is a dispatcher signature, since we're calling into the at::_ops API
-    view_sig = DispatcherSignature(g.view.func)
+        # view is a dispatcher signature, since we're calling into the at::_ops API
+        view_sig = DispatcherSignature(g.view.func)
 
-    view_api_name = g.view.func.name.unambiguous_name()
-    exprs = ", ".join(
-        [e.expr for e in translate(view_copy_sig.arguments(), view_sig.arguments())]
-    )
+        view_api_name = g.view.func.name.unambiguous_name()
+        exprs = ", ".join(
+            [e.expr for e in translate(view_copy_sig.arguments(), view_sig.arguments())]
+        )
 
-    # view ops today always return either a Tensor or a list of Tensors
-    assert len(g.view.func.returns) == 1
-    assert g.view.func.returns[0].type == BaseType(
-        BaseTy.Tensor
-    ) or g.view.func.returns[0].type == ListType(BaseType(BaseTy.Tensor), None)
+        # view ops today always return either a Tensor or a list of Tensors
+        assert len(g.view.func.returns) == 1
+        assert g.view.func.returns[0].type == BaseType(
+            BaseTy.Tensor
+        ) or g.view.func.returns[0].type == ListType(BaseType(BaseTy.Tensor), None)
 
-    if g.view.func.returns[0].type == BaseType(BaseTy.Tensor):
-        return_cloned_output = """\
-  return output.clone();"""
-    else:
-        # If the return type is a list, we need to clone each tensor in the list.
-        return_cloned_output = f"""\
+        if g.view.func.returns[0].type == BaseType(BaseTy.Tensor):
+            return_cloned_output = """\
+  return output.clone(/*memory_format=*/at::MemoryFormat::Contiguous);"""
+        else:
+            # If the return type is a list, we need to clone each tensor in the list.
+            return_cloned_output = f"""\
   {view_copy_sig.returns_type().cpp_type()} out_clone;
   for (const auto i : c10::irange(output.size())) {{
-    out_clone.push_back(output[i].clone());
+    out_clone.push_back(output[i].clone(/*memory_format=*/at::MemoryFormat::Contiguous));
   }}
   return out_clone;"""
 
-    # The default generated composite kernel for {view}_copy() operators just clones
-    # the input tensor, and runs the underlying view on the clone.
-    return f"""
-{view_copy_sig.defn()} {{
+        # The default generated composite kernel for {view}_copy() operators just clones
+        # the input tensor, and runs the underlying view on the clone.
+        return f"""
+{view_copy_sig.defn(name=metadata.kernel)} {{
   auto output = at::_ops::{view_api_name}::call({exprs});
   {return_cloned_output}
 }}
 """
 
 
-# For symint view copy kernels, we want to generate them to call into
-# their concrete view_copy counterparts.
-@with_native_function_and
-def gen_symint_view_copy_kernel(
-    view_copy: NativeFunction, view_copy_symint: NativeFunction
-) -> str:
-    # view_copy.symint is a native signature, since we're generating an at::native:: kernel
-    view_copy_symint_sig = NativeSignature(view_copy_symint.func)
-
-    # view_copy is a dispatcher signature, since we're calling into the at::_ops API
-    view_copy_sig = DispatcherSignature(view_copy.func)
-
-    exprs = ", ".join(
-        [
-            e.expr
-            for e in translate(
-                view_copy_symint_sig.arguments(), view_copy_sig.arguments()
-            )
-        ]
-    )
-
-    return f"""
-{view_copy_symint_sig.defn()} {{
-  return at::_ops::{view_copy.func.name.unambiguous_name()}::call({exprs});
-}}
-"""
-
-
 def return_str(rets: Tuple[Return, ...], names: List[str]) -> str:
     assert len(rets) == len(names)
     if len(rets) == 0:
@@ -203,6 +183,8 @@ def is_tensor_like(a: Union[Argument, TensorOptionsArguments, SelfArgument]) ->
 def get_owning_type(t: CType) -> Tuple[CType, Callable[[str], str]]:
     if t == BaseCType(tensorListT):
         return VectorCType(BaseCType(tensorT)), lambda x: f"{x}.vec()"
+    if t == BaseCType(iTensorListRefT):
+        return VectorCType(BaseCType(tensorT)), lambda x: f"{{{x}.begin(), {x}.end()}}"
     # There are technically other non-owning types out there (like IntArrayRef),
     # but functionalization only actually cares about the ones involving tensors.
     return t, lambda x: x
@@ -364,8 +346,11 @@ def emit_view_functionalization_body(
           return {reverse_lambda.inner_call()}
         }}
       );
+      auto compute_reference_meta =
+        {view_tensor_name}.key_set().has_backend(c10::BackendComponent::XLABit) ||
+        {view_tensor_name}.key_set().has_backend(c10::BackendComponent::LazyBit);
       {return_type} reference_tensor_output;
-      {{
+      if (compute_reference_meta) {{
         at::AutoDispatchSkipFunctionalize func_guard;
         c10::impl::ExcludeDispatchKeyGuard guard(exclude_keys_for_meta_dispatch);
         {meta_conversion_str}
@@ -380,7 +365,9 @@ def emit_view_functionalization_body(
       // XLA/LTC don't implement the logic to propagate strides correctly, so we need to rely
       // on a reference implementation here (instead of relying on the output from the forward lambda
       // having the correct stride info)
-      at::functionalization::impl::set_sizes_strides_offset({view_tensor_name}, reference_tensor_output);
+      if (compute_reference_meta) {{
+        at::functionalization::impl::set_sizes_strides_offset({view_tensor_name}, reference_tensor_output);
+      }}
       return {view_tensor_name};
     }}
 """
@@ -395,6 +382,9 @@ def emit_view_functionalization_body(
         return at::_ops::{noop_api_name}::call({', '.join(view_redispatch_args)});
       }}
       auto reapply_views = at::functionalization::impl::getFunctionalizationReapplyViewsTLS();
+      auto compute_reference_meta =
+        {view_tensor_name}.key_set().has_backend(c10::BackendComponent::XLABit) ||
+        {view_tensor_name}.key_set().has_backend(c10::BackendComponent::LazyBit);
       {return_type} reference_tensor_output;
       {{
         at::AutoDispatchSkipFunctionalize func_guard;
@@ -425,7 +415,9 @@ def emit_view_functionalization_body(
       );
       auto out = at::functionalization::impl::create_functional_tensor_with_view_meta(tmp_output, {view_tensor_name}, view_meta);
       // See  Note [Propagating strides in the functionalization pass]
-      at::functionalization::impl::set_sizes_strides_offset(out, reference_tensor_output);
+      if (compute_reference_meta) {{
+        at::functionalization::impl::set_sizes_strides_offset(out, reference_tensor_output);
+      }}
       return out;
     }}
 """
@@ -671,7 +663,7 @@ def emit_registration_helper(f: NativeFunction) -> str:
             metadata = composite_implicit_autograd_index.get_kernel(f)
             assert metadata is not None
             native_api_name = metadata.kernel
-            sig = DispatcherSignature.from_schema(f.func)
+            sig = NativeSignature(f.func, symint=metadata.supports_symint())
             # Note [Composite view ops in the functionalization pass]
             # We don't need to worry about implemententing functionalization kernels for views with
             # CompositeImplicitAutograd kernels, because we can just decompose them into their base operators.
diff --git a/torchgen/gen_lazy_tensor.py b/torchgen/gen_lazy_tensor.py
index a5a1fd9b2535..b2b24111b0f9 100644
--- a/torchgen/gen_lazy_tensor.py
+++ b/torchgen/gen_lazy_tensor.py
@@ -23,7 +23,7 @@
 
 from torchgen.api.lazy import setValueT
 from torchgen.api.types import BaseCppType
-from torchgen.dest.lazy_ir import GenLazyIR, GenTSLazyIR
+from torchgen.dest.lazy_ir import GenLazyIR, GenLazyNativeFuncDefinition, GenTSLazyIR
 from torchgen.gen import get_grouped_native_functions, parse_native_yaml
 
 from torchgen.model import NativeFunction, NativeFunctionsGroup, OperatorName
@@ -166,7 +166,7 @@ def get_ltc_helper_fns() -> str:
 at::Tensor to_meta(const at::Tensor& tensor) {
   // undefined tensors can't be converted to the meta device, since they don't have sizes/strides
   if (!tensor.defined()) return tensor;
-  auto out = at::native::empty_strided_meta(tensor.sizes(), tensor.strides(), \
+  auto out = at::native::empty_strided_meta_symint(tensor.sym_sizes(), tensor.sym_strides(), \
 /*dtype=*/c10::make_optional(tensor.scalar_type()), /*layout=*/c10::make_optional(tensor.layout()), \
 /*device=*/c10::make_optional(c10::Device(c10::kMeta)), /*pin_memory=*/c10::nullopt);
   // needs to handle wrapped numbers, so dtype promotion works properly.
@@ -182,11 +182,11 @@ def get_ltc_helper_fns() -> str:
   return c10::nullopt;
 }
 
-std::vector<at::Tensor> to_meta(const at::TensorList& t_list) {
+std::vector<at::Tensor> to_meta(at::ITensorListRef t_list) {
   std::vector<at::Tensor> outs;
   outs.reserve(t_list.size());
-  for (const auto& i : c10::irange(t_list.size())) {
-    outs.push_back(to_meta(t_list[i]));
+  for (const auto& tensor : t_list) {
+    outs.push_back(to_meta(tensor));
   }
   return outs;
 }
@@ -200,6 +200,9 @@ class default_args:
     tensor_class: str = "torch::lazy::LazyTensor"
     tensor_class_hdr: str = "torch/csrc/lazy/core/tensor.h"
     lazy_ir_generator: Type[GenLazyIR] = GenLazyIR
+    native_func_definition_generator: Type[
+        GenLazyNativeFuncDefinition
+    ] = GenLazyNativeFuncDefinition
     backend_name: str = "TorchScript"
 
 
@@ -267,6 +270,9 @@ def main() -> None:
     lazy_ir_generator: Type[GenLazyIR] = default_args.lazy_ir_generator
     if options.gen_ts_lowerings:
         lazy_ir_generator = GenTSLazyIR
+    native_func_definition_generator: Type[
+        GenLazyNativeFuncDefinition
+    ] = default_args.native_func_definition_generator
 
     run_gen_lazy_tensor(
         aten_path,
@@ -280,6 +286,7 @@ def main() -> None:
         options.tensor_class_hdr,
         options.shape_inference_hdr,
         lazy_ir_generator,
+        native_func_definition_generator,
         options.backend_name,
     )
 
@@ -296,6 +303,9 @@ def run_gen_lazy_tensor(
     tensor_class_hdr: str = default_args.tensor_class_hdr,
     shape_inference_hdr: str = default_args.shape_inference_hdr,
     lazy_ir_generator: Type[GenLazyIR] = default_args.lazy_ir_generator,
+    native_func_definition_generator: Type[
+        GenLazyNativeFuncDefinition
+    ] = default_args.native_func_definition_generator,
     # build_in_tree is true for TS backend and affects include paths
     build_in_tree: bool = False,
     # per_operator_headers changes whether ATen/Functions.h or individual operator headers are used
@@ -303,6 +313,7 @@ def run_gen_lazy_tensor(
     per_operator_headers: bool = False,
     backend_name: str = default_args.backend_name,
     gen_forced_fallback_code: bool = False,
+    use_lazy_shape: bool = True,
     # the following arguments are temporary customization points for xla backend migration.
     # do not rely on them otherwise, they should be removed once migration is complete
     backend_namespace: str = "torch::lazy",
@@ -499,7 +510,7 @@ def concat_map_codegen(
             "namespace_epilogue": ns_helper.epilogue,
             "native_function_definitions": list(
                 concat_map_codegen(
-                    dest.GenLazyNativeFuncDefinition(
+                    native_func_definition_generator(
                         f"{backend_key}NativeFunctions",
                         backend_indices[backend_key],
                         tensor_class,
@@ -523,7 +534,7 @@ def concat_map_codegen(
     )
     # Generate IR node classes
     lazy_ir_obj = lazy_ir_generator(
-        backend_indices[backend_key], backend_name, node_base
+        backend_indices[backend_key], backend_name, node_base, use_lazy_shape
     )
 
     fm.write_with_template(
diff --git a/torchgen/gen_vmap_plumbing.py b/torchgen/gen_vmap_plumbing.py
index ac1413a1845b..87db309e0b78 100644
--- a/torchgen/gen_vmap_plumbing.py
+++ b/torchgen/gen_vmap_plumbing.py
@@ -79,7 +79,7 @@ def gen_unwraps(
 
 
 def gen_case_where_all_bdims_are_none(
-    schema: FunctionSchema, cur_level_var: str
+    outer_sig: DispatcherSignature, schema: FunctionSchema, cur_level_var: str
 ) -> str:
     conditions = []
     flat_args = schema.arguments.flat_all
@@ -90,7 +90,7 @@ def gen_case_where_all_bdims_are_none(
 
     sig = DispatcherSignature.from_schema(schema)
     translated_args = ", ".join(
-        e.expr for e in translate(sig.arguments(), sig.arguments())
+        e.expr for e in translate(outer_sig.arguments(), sig.arguments())
     )
     return f"""\
 if ({' && '.join(conditions)}) {{
@@ -160,12 +160,12 @@ def gen_vmap_inplace_plumbing(native_function: NativeFunction) -> Optional[str]:
     cur_level_var = "cur_level"
 
     unwraps, unwrapped_arg_list = gen_unwraps(schema.arguments.flat_all, cur_level_var)
-    bdims_all_none_case = gen_case_where_all_bdims_are_none(schema, cur_level_var)
+    bdims_all_none_case = gen_case_where_all_bdims_are_none(sig, schema, cur_level_var)
 
     return f"""\
 template <typename batch_rule_t, batch_rule_t batch_rule>
 {sig.decl(name=schema.name.unambiguous_name() + '_generated_plumbing')} {{
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t {cur_level_var} = maybe_layer->layerId();
@@ -182,12 +182,12 @@ def gen_vmap_plumbing_no_returns(native_function: NativeFunction) -> str:
     cur_level_var = "cur_level"
 
     unwraps, unwrapped_arg_list = gen_unwraps(schema.arguments.flat_all, cur_level_var)
-    bdims_all_none_case = gen_case_where_all_bdims_are_none(schema, cur_level_var)
+    bdims_all_none_case = gen_case_where_all_bdims_are_none(sig, schema, cur_level_var)
 
     return f"""\
 template <typename batch_rule_t, batch_rule_t batch_rule>
 {sig.decl(name=schema.name.unambiguous_name() + '_generated_plumbing')} {{
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t {cur_level_var} = maybe_layer->layerId();
@@ -224,13 +224,13 @@ def gen_vmap_plumbing(native_function: NativeFunction) -> Optional[str]:
     cur_level_var = "cur_level"
 
     unwraps, unwrapped_arg_list = gen_unwraps(schema.arguments.flat_all, cur_level_var)
-    bdims_all_none_case = gen_case_where_all_bdims_are_none(schema, cur_level_var)
+    bdims_all_none_case = gen_case_where_all_bdims_are_none(sig, schema, cur_level_var)
 
     wrapped_returns = gen_returns(returns, cur_level_var, results_var)
     return f"""\
 template <typename batch_rule_t, batch_rule_t batch_rule>
 {sig.decl(name=schema.name.unambiguous_name() + '_generated_plumbing')} {{
-  c10::impl::ExcludeDispatchKeyGuard guard(kBatchedKey);
+  c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
   auto maybe_layer = maybeCurrentDynamicLayer();
   TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
   int64_t {cur_level_var} = maybe_layer->layerId();
@@ -255,8 +255,7 @@ def gen_all_vmap_plumbing(native_functions: Sequence[NativeFunction]) -> str:
     return f"""
 #pragma once
 #include <ATen/Operators.h>
-#include <functorch/csrc/PlumbingHelper.h>
-#include <functorch/csrc/Constants.h>
+#include <ATen/functorch/PlumbingHelper.h>
 
 namespace at {{ namespace functorch {{
 
diff --git a/torchgen/local.py b/torchgen/local.py
index 65efce2c3b11..f72e53601ab1 100644
--- a/torchgen/local.py
+++ b/torchgen/local.py
@@ -17,6 +17,7 @@
 
 class Locals(threading.local):
     use_const_ref_for_mutable_tensors: Optional[bool] = None
+    use_ilistref_for_tensor_lists: Optional[bool] = None
 
 
 _locals = Locals()
@@ -30,13 +31,26 @@ def use_const_ref_for_mutable_tensors() -> bool:
     return _locals.use_const_ref_for_mutable_tensors
 
 
+def use_ilistref_for_tensor_lists() -> bool:
+    assert _locals.use_ilistref_for_tensor_lists is not None, (
+        "need to initialize local.use_ilistref_for_tensor_lists with "
+        "local.parametrize"
+    )
+    return _locals.use_ilistref_for_tensor_lists
+
+
 @contextmanager
-def parametrize(*, use_const_ref_for_mutable_tensors: bool) -> Iterator[None]:
+def parametrize(
+    *, use_const_ref_for_mutable_tensors: bool, use_ilistref_for_tensor_lists: bool
+) -> Iterator[None]:
     old_use_const_ref_for_mutable_tensors = _locals.use_const_ref_for_mutable_tensors
+    old_use_ilistref_for_tensor_lists = _locals.use_ilistref_for_tensor_lists
     try:
         _locals.use_const_ref_for_mutable_tensors = use_const_ref_for_mutable_tensors
+        _locals.use_ilistref_for_tensor_lists = use_ilistref_for_tensor_lists
         yield
     finally:
         _locals.use_const_ref_for_mutable_tensors = (
             old_use_const_ref_for_mutable_tensors
         )
+        _locals.use_ilistref_for_tensor_lists = old_use_ilistref_for_tensor_lists
diff --git a/torchgen/model.py b/torchgen/model.py
index 8cd5b413bfe6..f511b8ffcb3b 100644
--- a/torchgen/model.py
+++ b/torchgen/model.py
@@ -58,8 +58,10 @@ def __str__(self) -> str:
     "Autograd" + component for component in BACKEND_COMPONENTS
 ]
 
+FRAGMENT_NAMESPACES = {"quantized", "quantized_decomposed"}
+
 # This doesn't have to be in sync with the header, it only needs to contain
-# entries that we actually use in the codegen
+# entries that we actually use in the codegen or want pyi entries for
 class DispatchKey(Enum):
     Undefined = 0
     CatchAll = Undefined
@@ -78,6 +80,8 @@ class DispatchKey(Enum):
     SparseCsrCPU = auto()
     SparseCsrCUDA = auto()
 
+    Python = auto()
+    FuncTorchDynamicLayerBackMode = auto()
     ZeroTensor = auto()
     BackendSelect = auto()
     Named = auto()
@@ -88,11 +92,15 @@ class DispatchKey(Enum):
     Autocast = auto()
     Batched = auto()
     VmapMode = auto()
+    FuncTorchDynamicLayerFrontMode = auto()
+    Functionalize = auto()
     TESTING_ONLY_GenericWrapper = auto()
     TESTING_ONLY_GenericMode = auto()
 
+    ADInplaceOrView = auto()
     Autograd = auto()
     CompositeImplicitAutograd = auto()
+    CompositeImplicitAutogradNestedTensor = auto()
     CompositeExplicitAutograd = auto()
     CompositeExplicitAutogradNonFunctional = auto()
 
@@ -217,6 +225,7 @@ def codegen_per_backend_entries() -> str:
     DispatchKey.QuantizedCPU,
     DispatchKey.QuantizedCUDA,
     DispatchKey.CompositeImplicitAutograd,
+    DispatchKey.CompositeImplicitAutogradNestedTensor,
     DispatchKey.CompositeExplicitAutograd,
     DispatchKey.CompositeExplicitAutogradNonFunctional,
     DispatchKey.NestedTensorCPU,
@@ -237,6 +246,7 @@ def is_generic_dispatch_key(dk: DispatchKey) -> bool:
         DispatchKey.CompositeExplicitAutograd,
         DispatchKey.CompositeExplicitAutogradNonFunctional,
         DispatchKey.CompositeImplicitAutograd,
+        DispatchKey.CompositeImplicitAutogradNestedTensor,
     }
 
 
@@ -485,6 +495,7 @@ class NativeFunction:
 
     # Whether or not the NativeFunction contains a backend-agnostic kernel
     has_composite_implicit_autograd_kernel: bool
+    has_composite_implicit_autograd_nested_tensor_kernel: bool
     has_composite_explicit_autograd_kernel: bool
     has_composite_explicit_autograd_non_functional_kernel: bool
 
@@ -666,9 +677,11 @@ def from_yaml(
             )
             # if a function is a structured delegate, deleting the dispatch
             # table is NOT semantics preserving
-            assert structured_delegate or dispatch.keys() != {
-                DispatchKey.CompositeImplicitAutograd
-            }, (
+            assert (
+                structured_delegate
+                or dispatch.keys() != {DispatchKey.CompositeImplicitAutograd}
+                or dispatch[DispatchKey.CompositeImplicitAutograd].supports_symint()
+            ), (
                 f"unexpected name for singleton CompositeImplicitAutograd dispatch entry: expected {cpp.name(func)} "
                 f"but got {dispatch[DispatchKey.CompositeImplicitAutograd]}.  Rename your implementation to the expected "
                 "name, then delete the dispatch table"
@@ -699,9 +712,20 @@ def from_yaml(
             if d == DispatchKey.CompositeExplicitAutograd
             or d == DispatchKey.CompositeExplicitAutogradNonFunctional
             or d == DispatchKey.CompositeImplicitAutograd
+            or d == DispatchKey.CompositeImplicitAutogradNestedTensor
         ]
 
-        assert len(composites_in_dispatch) <= 1, (
+        assert len(composites_in_dispatch) <= 1 or (
+            len(composites_in_dispatch) == 2
+            and (
+                DispatchKey.CompositeExplicitAutogradNonFunctional
+                not in composites_in_dispatch
+            )
+            and (
+                DispatchKey.CompositeImplicitAutogradNestedTensor
+                in composites_in_dispatch
+            )
+        ), (
             "cannot specify more than one of CompositeExplicitAutograd, CompositeExplicitAutogradNonFunctional, "
             "or CompositeImplicitAutograd on a single kernel; each "
             "strictly subsumes the other.  If you wanted to provide an explicit autograd "
@@ -756,11 +780,23 @@ def from_yaml(
             # Structured functions MUST have a dispatch table
             is_abstract = True
         else:
-            is_abstract = dispatch.keys() != {DispatchKey.CompositeImplicitAutograd}
+            is_abstract = (
+                dispatch.keys() != {DispatchKey.CompositeImplicitAutograd}
+                and dispatch.keys()
+                != {DispatchKey.CompositeImplicitAutogradNestedTensor}
+                and dispatch.keys()
+                != {
+                    DispatchKey.CompositeImplicitAutograd,
+                    DispatchKey.CompositeImplicitAutogradNestedTensor,
+                }
+            )
 
         has_composite_implicit_autograd_kernel = (
             DispatchKey.CompositeImplicitAutograd in dispatch.keys()
         )
+        has_composite_implicit_autograd_nested_tensor_kernel = (
+            DispatchKey.CompositeImplicitAutogradNestedTensor in dispatch.keys()
+        )
         has_composite_explicit_autograd_kernel = (
             DispatchKey.CompositeExplicitAutograd in dispatch.keys()
         )
@@ -808,6 +844,7 @@ def from_yaml(
                 cpp_no_default_args=cpp_no_default_args,
                 is_abstract=is_abstract,
                 has_composite_implicit_autograd_kernel=has_composite_implicit_autograd_kernel,
+                has_composite_implicit_autograd_nested_tensor_kernel=has_composite_implicit_autograd_nested_tensor_kernel,
                 has_composite_explicit_autograd_kernel=has_composite_explicit_autograd_kernel,
                 has_composite_explicit_autograd_non_functional_kernel=has_composite_explicit_autograd_non_functional_kernel,
                 tags=tags,
@@ -816,9 +853,6 @@ def from_yaml(
             backend_metadata,
         )
 
-    def symints_to_ints(self) -> "NativeFunction":
-        return dataclasses.replace(self, func=self.func.symints_to_ints())
-
     def validate_unstructured(self) -> None:
         # TODO: probably better to accumulate these errors and report them all
         # at once
@@ -882,12 +916,29 @@ def __post_init__(self) -> None:
                 "device_check not allowed to be enabled"
             )
 
+        # NB: if your function accidentally has rand/dropout/... in its name
+        # but is not actually random, feel free to amend this to special case
+        if (
+            "rand" in str(self.func.name)
+            or (
+                "dropout" in str(self.func.name)
+                # Backwards of dropout is typically deterministic
+                and "backward" not in str(self.func.name)
+                and str(self.func.name.name) not in ["_cudnn_init_dropout_state"]
+            )
+            or self.func.arguments.has_generator_arg()
+        ):
+            assert "nondeterministic_seeded" in self.tags, str(self.func.name)
+
     @property
     def has_composite_kernel(self) -> bool:
         return (
             self.has_composite_implicit_autograd_kernel
             or self.has_composite_explicit_autograd_kernel
             or self.has_composite_explicit_autograd_non_functional_kernel
+        ) or (
+            self.has_composite_implicit_autograd_kernel
+            and self.has_composite_implicit_autograd_nested_tensor_kernel
         )
 
     @property
@@ -901,7 +952,7 @@ def is_view_op(self) -> bool:
             "inplace_view" in self.tags and str(self.func.name) != "resize_"
         )
         is_wildcard_view = any(
-            inp.annotation is not None and inp.annotation.alias_set_after != ""
+            inp.annotation is not None and "*" in inp.annotation.alias_set_after
             for inp in self.func.schema_order_arguments()
         )
         return is_non_mutating_view or is_inplace_view or is_wildcard_view
@@ -920,6 +971,10 @@ def view_schema_kind(self) -> ViewSchemaKind:
     def root_name(self) -> str:
         return self.func.name.name.base
 
+    @property
+    def part_of_structured_group(self) -> bool:
+        return self.structured or self.structured_delegate is not None
+
 
 SchemaKind = Enum("SchemaKind", ("functional", "inplace", "out", "mutable", "scratch"))
 
@@ -949,6 +1004,12 @@ def __post_init__(self) -> None:
                     "NativeFunctionsGroup constructed from two NativeFunctions "
                     f"that don't have matching signatures: {test_sig} != {f.func.signature()}"
                 )
+
+            if self.structured != f.part_of_structured_group:
+                raise AssertionError(
+                    "NativeFunctionsGroup constructed from structured and unstructured "
+                    f"functions: {self.out.func.name} and {f.func.name}"
+                )
         assert self.functional.func.kind() == SchemaKind.functional
         assert self.out.func.kind() == SchemaKind.out
         assert self.functional.namespace == self.out.namespace
@@ -965,7 +1026,10 @@ def __post_init__(self) -> None:
         if self.structured:
             # For now, structured composite kernels are not supported (need some
             # design work to figure out how to make the composite case work)
-            assert not self.out.has_composite_implicit_autograd_kernel
+            assert (
+                not self.out.has_composite_implicit_autograd_kernel
+                and not self.out.has_composite_implicit_autograd_nested_tensor_kernel
+            )
 
             assert self.functional.structured_delegate == self.out.func.name, (
                 f"{self.functional.func.name} delegates to {self.functional.structured_delegate} "
@@ -1058,6 +1122,9 @@ class BackendMetadata:
     # The namespace for kernels, default value: DEFAULT_KERNEL_NAMESPACE
     cpp_namespace: str
 
+    def supports_symint(self) -> bool:
+        return "_symint" in self.kernel
+
 
 @dataclass(frozen=True)
 class UfuncInnerLoop:
@@ -1086,7 +1153,7 @@ def parse(value: str, ufunc_key: UfuncKey) -> "UfuncInnerLoop":
 # (the 'dispatch' entry in native_functions.yaml).
 # However, there can be other examples of different backends having different information.
 # External backends can choose to opt their kernels to be structured independently from in-tree backends,
-# which means that this information isn't inherentely tied to a NativeFunction- it's different per backend.
+# which means that this information isn't inherently tied to a NativeFunction- it's different per backend.
 @dataclass(frozen=True)
 class BackendIndex:
     dispatch_key: DispatchKey
@@ -1219,9 +1286,6 @@ def schema_order_arguments(self) -> Iterator["Argument"]:
 
     decl_re = re.compile(r"(?P<name>[^\(]+)\((?P<args>.*)\) -> (?P<returns>.*)")
 
-    def symints_to_ints(self) -> "FunctionSchema":
-        return dataclasses.replace(self, arguments=self.arguments.symints_to_ints())
-
     @staticmethod
     def parse(func: str) -> "FunctionSchema":
         # We should probably get a proper parser here
@@ -1344,10 +1408,6 @@ def __post_init__(self) -> None:
     def is_functional_fn(self) -> bool:
         return "functional" in self.name.overload_name
 
-    def is_symint_fn(self) -> bool:
-        # TODO: make this more robust
-        return "SymInt" in self.name.overload_name
-
     def is_out_fn(self) -> bool:
         # Note [is_out_fn]
         #
@@ -1555,6 +1615,11 @@ def with_name(self, name: "OperatorName") -> "FunctionSchema":
     def modifies_arguments(self) -> bool:
         return self.kind() in [SchemaKind.inplace, SchemaKind.out, SchemaKind.mutable]
 
+    def has_symint(self) -> bool:
+        return self.arguments.has_symint_arg() or any(
+            r.type.is_symint_like() for r in self.returns
+        )
+
     def __str__(self) -> str:
         all_arguments_str = str(self.arguments)
         if len(self.returns) == 1:
@@ -1577,26 +1642,37 @@ class Annotation:
     # we can conveniently assume it is canonically ordered
     alias_set: Tuple[str, ...]
     is_write: bool
-    alias_set_after: str
+    alias_set_after: Tuple[str, ...]
 
     @staticmethod
     def parse(ann: str) -> "Annotation":
-        # Only handling afterSet == Wildcard for now
-        becomes_wildcard_index = ann.find(" -> *")
-        if becomes_wildcard_index != -1:
-            after_set = "*"
-            # TODO: im not good enough with regexes to ignore -> *
-            m = re.match(
-                r"^([a-z])(!?)(!?)$",
-                ann[:becomes_wildcard_index]
-                + ann[becomes_wildcard_index + len(" -> *") :],
-            )
-        else:
-            after_set = ""
-            m = re.match(r"^([a-z])(!?)(!?)$", ann)
+        # TODO: implement a proper parser if this gets more ugly
+        # Regex Explanation:
+        # Example: "a! -> a|b"
+        # Group #1: alias before optional '|', required. Matches the first
+        #   character 'a' in the example
+        # Group #2: optional alias set after optional '|', matches empty string
+        #   in the example
+        # Group #3: optional "is write" flag, matches '!' in the example.
+        # Group #4: optional section containing arrow, matches " -> a|b" in the
+        #   example.
+        # Group #5: optional alias after set, supports wildcard, matches "a|b"
+        #   in the example.
+        # Group #6: optional sub-section of alias after set, matches "|b" in the
+        #   example.
+        m = re.match(r"^([a-z])(\|[a-z])*(!?)( -> (\*|[a-z](\|[a-z])*))?$", ann)
+
         assert m is not None, f"unrecognized alias annotation {ann}"
-        alias_set = (m.group(1),)
-        is_write = m.group(2) == "!"
+        before_alias = m.group(1) + (m.group(2) if m.group(2) else "")
+        alias_set = tuple(before_alias.split("|"))
+        is_write = m.group(3) == "!"
+        assert not (
+            is_write and len(alias_set) > 1
+        ), f"alias set larger than 1 is not mutable, got {ann} instead."
+        after_set = tuple(m.group(5).split("|")) if m.group(5) else tuple()
+        assert not (
+            len(before_alias) > 1 and len(after_set) > 1
+        ), f"before alias set and after alias set cannot be larger than 1 at the same time, got {ann} instead."
         r = Annotation(
             alias_set=alias_set, is_write=is_write, alias_set_after=after_set
         )
@@ -1605,10 +1681,12 @@ def parse(ann: str) -> "Annotation":
 
     def __str__(self) -> str:
         alias_set = "|".join(self.alias_set)
-        if self.alias_set_after:
-            alias_set = f'{alias_set}{" -> "}{self.alias_set_after}'
-        is_write = "!" if self.is_write else ""
-        return f"{alias_set}{is_write}"
+        if self.is_write:
+            alias_set = f"{alias_set}!"
+        alias_set_after = "|".join(self.alias_set_after)
+        if alias_set_after:
+            alias_set = f'{alias_set}{" -> "}{alias_set_after}'
+        return alias_set
 
 
 # The base class for the type system.  This is also loosely modeled
@@ -1652,18 +1730,24 @@ def __str__(self) -> str:
     # so we can conveniently generate legacy Declarations.yaml but
     # really we should probably just remove these at some point
 
-    def is_tensor_like(self) -> bool:
+    def is_base_ty_like(self, base_ty: "BaseTy") -> bool:
         raise NotImplementedError
 
+    def is_tensor_like(self) -> bool:
+        return self.is_base_ty_like(BaseTy.Tensor)
+
+    def is_generator_like(self) -> bool:
+        return self.is_base_ty_like(BaseTy.Generator)
+
+    def is_symint_like(self) -> bool:
+        return self.is_base_ty_like(BaseTy.SymInt)
+
     def is_nullable(self) -> bool:
         raise NotImplementedError
 
     def is_list_like(self) -> Optional["ListType"]:
         raise NotImplementedError
 
-    def symint_to_int(self) -> "Type":
-        raise NotImplementedError
-
 
 # Base types are simple, atomic types with no further structure
 BaseTy = Enum(
@@ -1698,20 +1782,18 @@ class BaseType(Type):
     def __str__(self) -> str:
         return f"{self.name.name}"
 
-    def is_tensor_like(self) -> bool:
-        return self.name == BaseTy.Tensor
+    def is_base_ty_like(self, base_ty: BaseTy) -> bool:
+        return self.name == base_ty
 
     def is_nullable(self) -> bool:
         return False
 
-    def symint_to_int(self) -> "BaseType":
-        if self.name == BaseTy.SymInt:
-            return BaseType(BaseTy.int)
-        return self
-
     def is_list_like(self) -> Optional["ListType"]:
         return None
 
+    def is_symint_like(self) -> bool:
+        return self.name == BaseTy.SymInt
+
 
 # Optional types may be specified, or may also be validly given None
 @dataclass(frozen=True)
@@ -1721,15 +1803,15 @@ class OptionalType(Type):
     def __str__(self) -> str:
         return f"{self.elem}?"
 
-    def is_tensor_like(self) -> bool:
-        return self.elem.is_tensor_like()
+    def is_base_ty_like(self, base_ty: BaseTy) -> bool:
+        return self.elem.is_base_ty_like(base_ty)
+
+    def is_symint_like(self) -> bool:
+        return self.elem.is_symint_like()
 
     def is_nullable(self) -> bool:
         return True
 
-    def symint_to_int(self) -> "Type":
-        return dataclasses.replace(self, elem=self.elem.symint_to_int())
-
     def is_list_like(self) -> Optional["ListType"]:
         return self.elem.is_list_like()
 
@@ -1745,10 +1827,10 @@ def __str__(self) -> str:
         """
         return f"__torch__.torch.classes.{self.class_name}"
 
-    def is_tensor_like(self) -> bool:
-        """
-        Assume a custom class is not a tensor.
-        """
+    def is_base_ty_like(self, base_ty: BaseTy) -> bool:
+        return False
+
+    def is_symint_like(self) -> bool:
         return False
 
     def is_nullable(self) -> bool:
@@ -1757,9 +1839,6 @@ def is_nullable(self) -> bool:
         """
         return False
 
-    def symint_to_int(self) -> "Type":
-        return self
-
     def is_list_like(self) -> Optional["ListType"]:
         return None
 
@@ -1780,15 +1859,15 @@ def __str__(self) -> str:
         size = f"{self.size}" if self.size else ""
         return f"{self.elem}[{size}]"
 
-    def is_tensor_like(self) -> bool:
-        return self.elem.is_tensor_like()
+    def is_base_ty_like(self, base_ty: BaseTy) -> bool:
+        return self.elem.is_base_ty_like(base_ty)
+
+    def is_symint_like(self) -> bool:
+        return self.elem.is_symint_like()
 
     def is_nullable(self) -> bool:
         return self.elem.is_nullable()
 
-    def symint_to_int(self) -> "ListType":
-        return ListType(self.elem.symint_to_int(), self.size)
-
     def is_list_like(self) -> Optional["ListType"]:
         return self
 
@@ -1862,9 +1941,6 @@ def parse(arg: str) -> "Argument":
     def is_write(self) -> bool:
         return self.annotation is not None and self.annotation.is_write
 
-    def symint_to_int(self) -> "Argument":
-        return dataclasses.replace(self, type=self.type.symint_to_int())
-
     def __str__(self) -> str:
         type = f"{self.type}"
         if self.annotation:
@@ -2054,40 +2130,15 @@ def mutable_arg_names(self) -> List[str]:
             if a.annotation is not None and a.annotation.is_write
         ]
 
-    def symints_to_ints(self) -> "Arguments":
-        arguments = self
-
-        if arguments.self_arg:
-            arguments = dataclasses.replace(
-                arguments,
-                pre_self_positional=[
-                    x.symint_to_int() for x in arguments.pre_self_positional
-                ],
-            )
-
-        if self.tensor_options:
-            arguments = dataclasses.replace(
-                arguments,
-                post_tensor_options_kwarg_only=[
-                    x.symint_to_int() for x in arguments.post_tensor_options_kwarg_only
-                ],
-            )
-
-        arguments = dataclasses.replace(
-            arguments,
-            post_self_positional=[
-                x.symint_to_int() for x in arguments.post_self_positional
-            ],
-            pre_tensor_options_kwarg_only=[
-                x.symint_to_int() for x in arguments.pre_tensor_options_kwarg_only
-            ],
-        )
-
-        return arguments
-
     def has_tensor_arg(self) -> bool:
         return any(a.type.is_tensor_like() for a in self.flat_non_out)
 
+    def has_symint_arg(self) -> bool:
+        return any(a.type.is_symint_like() for a in self.flat_non_out)
+
+    def has_generator_arg(self) -> bool:
+        return any(a.type.is_generator_like() for a in self.flat_non_out)
+
     def signature(self, *, strip_default: bool = False) -> "Arguments":
         # dataclasses.replace could be used here, but it is less
         # type safe so for now I've opted to type everything out
@@ -2513,6 +2564,14 @@ def __post_init__(self) -> None:
                     f"{str(self.view.func.name)} and {str(self.view_inplace.func.name)} must either"
                     " both have CompositeImplicitAutograd kernels, or both not have composite kernels."
                 )
+        if self.view.has_composite_implicit_autograd_nested_tensor_kernel:
+            if self.view_inplace is not None:
+                assert (
+                    self.view_inplace.has_composite_implicit_autograd_nested_tensor_kernel
+                ), (
+                    f"{str(self.view.func.name)} and {str(self.view_inplace.func.name)} must either"
+                    " both have CompositeImplicitAutogradNestedTensor kernels, or both not have composite kernels."
+                )
 
     def functions(self, *, include_copy: bool = True) -> Iterator[NativeFunction]:
         yield self.view
diff --git a/torchgen/native_function_generation.py b/torchgen/native_function_generation.py
index 6ff25b004277..657a133c31c7 100644
--- a/torchgen/native_function_generation.py
+++ b/torchgen/native_function_generation.py
@@ -57,6 +57,7 @@
     "_nested_tensor_from_mask_left_aligned",  # returns a boolean
     "_nnz",  # returns an int
     "_use_cudnn_ctc_loss",  # returns a boolean
+    "_use_cudnn_ctc_loss.Tensor",  # returns a boolean
     "_validate_compressed_sparse_indices",  # no return
     "allclose",  # returns a boolean
     "dense_dim",  # returns an int
@@ -71,6 +72,8 @@
     "qscheme",  # returns a QScheme
     "record_stream",  # no return
     "sparse_dim",  # returns an int
+    "_nested_tensor_offsets",  # returns a vector of ints
+    "_fused_sdp_choice",  # returns an int
 ]
 
 INPLACE_OPS_THAT_DONT_GET_GROUPED_PROPERLY = [
@@ -304,6 +307,8 @@ def generate_function(
         if func.kind() == SchemaKind.out
         else cpp.name(func)
     )
+    if f.func.has_symint():
+        kernel_name += "_symint"
     backend_metadata = {
         DispatchKey.CompositeExplicitAutograd: {
             func.name: BackendMetadata(
@@ -336,11 +341,12 @@ def generate_function(
             cpp_no_default_args=set(),
             is_abstract=f.is_abstract,
             has_composite_implicit_autograd_kernel=False,
+            has_composite_implicit_autograd_nested_tensor_kernel=False,
             has_composite_explicit_autograd_kernel=True,
             has_composite_explicit_autograd_non_functional_kernel=False,
             # Every generated NativeFunction gets a "generated" tag, so it's easy to tell
             # which NativeFunction objects did not come directly from native_functions.yaml.
-            tags=set(["generated"]),
+            tags=set(["generated"]) | (f.tags & {"nondeterministic_seeded"}),
             namespace=f.namespace,
         ),
         backend_metadata,
@@ -554,7 +560,7 @@ def gen_composite_functional_kernel(g: NativeFunctionsGroup) -> Optional[str]:
 
     clone_mutable_inputs_str = "\n".join(clone_mutable_inputs)
     return f"""
-{sig.defn()} {{
+{sig.defn(name=sig.name() + ("_symint" if g.out.func.has_symint() else ""))} {{
   {clone_mutable_inputs_str}
   {maybe_assign}at::_ops::{target_f.func.name.unambiguous_name()}::call({exprs});
   {ret_str}
@@ -614,7 +620,7 @@ def gen_composite_out_kernel(g: NativeFunctionsGroup) -> Optional[str]:
 
     # Kernel name needs to follow the naming convention defined in `generate_function()`
     return f"""
-{sig.defn(name=g.out.func.name.unambiguous_name())} {{
+{sig.defn(name=g.out.func.name.unambiguous_name() + ("_symint" if g.out.func.has_symint() else ""))} {{
   auto {out_name} = at::_ops::{g.functional.func.name.unambiguous_name()}::call({exprs});
   {copy_outs_str}
   {return_str(g.out.func.returns, rets)}
diff --git a/torchgen/shape_functions/gen_jit_shape_functions.py b/torchgen/shape_functions/gen_jit_shape_functions.py
index 5cffce838f89..c6336a69518b 100644
--- a/torchgen/shape_functions/gen_jit_shape_functions.py
+++ b/torchgen/shape_functions/gen_jit_shape_functions.py
@@ -1,12 +1,34 @@
 #!/usr/bin/env python3
+import importlib.util
 import os
+import sys
 from itertools import chain
 from pathlib import Path
 
-from torch.jit._shape_functions import (
-    bounded_compute_graph_mapping,
-    shape_compute_graph_mapping,
-)
+
+# Manually importing the shape function module based on current directory
+# instead of torch imports to avoid needing to recompile Pytorch before
+# running the script
+
+file_path = Path.cwd() / "torch" / "jit" / "_shape_functions.py"
+module_name = "torch.jit._shape_functions"
+
+err_msg = """Could not find shape functions file, please make sure
+you are in the root directory of the Pytorch git repo"""
+if not file_path.exists():
+    raise Exception(err_msg)
+
+spec = importlib.util.spec_from_file_location(module_name, file_path)
+assert spec is not None
+module = importlib.util.module_from_spec(spec)
+sys.modules[module_name] = module
+assert spec.loader is not None
+assert module is not None
+spec.loader.exec_module(module)
+
+bounded_compute_graph_mapping = module.bounded_compute_graph_mapping
+shape_compute_graph_mapping = module.shape_compute_graph_mapping
+
 
 SHAPE_HEADER = r"""
 /**
diff --git a/torchgen/static_runtime/config.py b/torchgen/static_runtime/config.py
index 0c4fa0788073..407165147e35 100644
--- a/torchgen/static_runtime/config.py
+++ b/torchgen/static_runtime/config.py
@@ -214,23 +214,39 @@ def override_test_values(arg_map: Dict[str, str], op_name: str, index: int) -> N
         return
     if op_name == "adaptive_max_pool2d_backward":
         if index == 0:
-            arg_map["grad_output"] = "at::randint(-3, 2, {2,2,2})"
-            arg_map["self"] = "at::randint(-3, 2, {2,2,2})"
-            arg_map["indices"] = "at::randint(0, 1, {2,2,2}, at::kLong)"
+            arg_map["grad_output"] = "at::rand({2, 2, 2}, at::kFloat)"
+            arg_map["self"] = "at::rand({2, 2, 2}, at::kFloat)"
+            arg_map["indices"] = "at::randint(0, 1, {2, 2, 2}, at::kLong)"
         else:
-            arg_map["grad_output"] = "at::randint(-3, 3, {3,3,3})"
-            arg_map["self"] = "at::randint(-3, 2, {3,3,3})"
-            arg_map["indices"] = "at::randint(0, 1, {3,3,3}, at::kLong)"
+            arg_map["grad_output"] = "at::rand({3, 3, 3}, at::kFloat)"
+            arg_map["self"] = "at::rand({3, 3, 3}, at::kFloat)"
+            arg_map["indices"] = "at::randint(0, 1, {3, 3, 3}, at::kLong)"
         return
     if op_name == "adaptive_max_pool3d_backward":
         if index == 0:
-            arg_map["grad_output"] = "at::randint(-3, 2, {2,2,2,2})"
-            arg_map["self"] = "at::randint(-3, 2, {2,2,2,2})"
-            arg_map["indices"] = "at::randint(0, 1, {2,2,2,2}, at::kLong)"
+            arg_map["grad_output"] = "at::rand({2, 2, 2, 2}, at::kFloat)"
+            arg_map["self"] = "at::rand({2, 2, 2, 2}, at::kFloat)"
+            arg_map["indices"] = "at::randint(0, 1, {2, 2, 2, 2}, at::kLong)"
         else:
-            arg_map["grad_output"] = "at::randint(-3, 3, {3,3,3,3})"
-            arg_map["self"] = "at::randint(-3, 2, {3,3,3,3})"
-            arg_map["indices"] = "at::randint(0, 1, {3,3,3,3}, at::kLong)"
+            arg_map["grad_output"] = "at::rand({3, 3, 3, 3}, at::kFloat)"
+            arg_map["self"] = "at::rand({3, 3, 3, 3}, at::kFloat)"
+            arg_map["indices"] = "at::randint(0, 1, {3, 3, 3, 3}, at::kLong)"
+        return
+    if op_name == "bitwise_left_shift":
+        if index == 0:
+            arg_map["self"] = "at::randint(1, 1 << 4, {6, 6, 6}, at::kInt)"
+            arg_map["other"] = "at::randint(1, 26, {6, 6, 6}, at::kInt)"
+        else:
+            arg_map["self"] = "at::randint(1, 1 << 4, {22, 22, 22}, at::kInt)"
+            arg_map["other"] = "at::randint(1, 26, {22, 22, 22}, at::kInt)"
+        return
+    if op_name == "bitwise_right_shift":
+        if index == 0:
+            arg_map["self"] = "at::randint(1 << 21, 1 << 30, {6, 6, 6}, at::kInt)"
+            arg_map["other"] = "at::randint(1, 22, {6, 6, 6}, at::kInt)"
+        else:
+            arg_map["self"] = "at::randint(1 << 21, 1 << 30, {22, 22, 22}, at::kInt)"
+            arg_map["other"] = "at::randint(1, 22, {22, 22, 22}, at::kInt)"
         return
     if op_name == "gather":
         if index == 0:
diff --git a/torchgen/static_runtime/gen_static_runtime_ops.py b/torchgen/static_runtime/gen_static_runtime_ops.py
index 6e234d33a721..ec4ea5dee819 100644
--- a/torchgen/static_runtime/gen_static_runtime_ops.py
+++ b/torchgen/static_runtime/gen_static_runtime_ops.py
@@ -54,6 +54,7 @@ def clang_format(cpp_file_path: str) -> None:
 def write_cpp(cpp_ops: Sequence[str], file_path: str) -> None:
     code = "\n".join(cpp_ops)
     generated = f"""// @lint-ignore-every CLANGTIDY HOWTOEVEN
+// AUTO-GENERATED FROM: torchgen/static_runtime/gen_static_runtime_ops.py
 #include <torch/csrc/jit/runtime/static/ops.h>
 
 #include <ATen/CPUFunctions.h>
@@ -67,6 +68,7 @@ def write_cpp(cpp_ops: Sequence[str], file_path: str) -> None:
 #include <ATen/native/EmbeddingBag.h>
 #include <ATen/native/Fill.h>
 #include <ATen/native/IndexingUtils.h>
+#include <ATen/native/NonSymbolicBC.h>
 #include <ATen/native/Resize.h>
 #include <ATen/native/SharedReduceOps.h>
 #include <ATen/native/TensorAdvancedIndexing.h>
@@ -105,6 +107,7 @@ def write_cpp(cpp_ops: Sequence[str], file_path: str) -> None:
 def write_test_cpp(cpp_ops: Sequence[str], file_path: str) -> None:
     code = "\n".join(cpp_ops)
     generated = f"""// @lint-ignore-every CLANGTIDY HOWTOEVEN
+// AUTO-GENERATED FROM: torchgen/static_runtime/gen_static_runtime_ops.py
 #include <gtest/gtest.h>
 #include <torch/csrc/jit/runtime/static/impl.h>
 #include <torch/torch.h>
diff --git a/torchgen/static_runtime/generator.py b/torchgen/static_runtime/generator.py
index 22bf259f640b..f0645c8251be 100644
--- a/torchgen/static_runtime/generator.py
+++ b/torchgen/static_runtime/generator.py
@@ -42,6 +42,7 @@ def has_alias(
         # non cpu ops
         "sparse_sampled_addmm",
         "hspmm",
+        "linalg_svdvals",
         # sparse ops
         "sspaddmm",
         "coalesce",
@@ -71,6 +72,156 @@ def has_alias(
         "_fw_primal",
         # no documentation
         "_index_reduce",
+        # TODO: these ones got added recently and need manual inspection
+        "_new_zeros_with_same_feature_meta",
+        "_conj_physical",
+        "binary_cross_entropy_with_logits",
+        "bincount",
+        "conv_tbc",
+        "copy",
+        "_copy_from",
+        "_copy_from_and_resize",
+        "count_nonzero",
+        "cudnn_affine_grid_generator",
+        "cudnn_affine_grid_generator_backward",
+        "cudnn_grid_sampler",
+        "diag_embed",
+        "embedding",
+        "embedding_dense_backward",
+        "_embedding_bag_dense_backward",
+        "_embedding_bag_per_sample_weights_backward",
+        "grid_sampler_2d",
+        "_grid_sampler_2d_cpu_fallback",
+        "grid_sampler_3d",
+        "isnan",
+        "mkldnn_linear",
+        "median",
+        "nanmedian",
+        "_sparse_sparse_matmul",
+        "_sparse_mask_helper",
+        "batch_norm_backward_elemt",
+        "_euclidean_dist",
+        "pixel_shuffle",
+        "pixel_unshuffle",
+        "channel_shuffle",
+        "_reshape_nested_backward",
+        "relu",
+        "prelu",
+        "celu",
+        "slice_scatter",
+        "select_scatter",
+        "diagonal_scatter",
+        "sum",
+        "_mkldnn_transpose",
+        "_nested_tensor_from_mask",
+        "_nested_from_padded",
+        "_nested_tensor_size",
+        "_nested_from_padded_and_nested_example",
+        "_standard_gamma_grad",
+        "_dirichlet_grad",
+        "native_norm",
+        "_sparse_softmax",
+        "_sparse_softmax_backward_data",
+        "_sparse_log_softmax",
+        "_sparse_log_softmax_backward_data",
+        "zero",
+        "_sparse_addmm",
+        "sparse_mask",
+        "_to_dense",
+        "_coalesce",
+        "_coalesced",
+        "copy_sparse_to_sparse",
+        "to_sparse",
+        "to_sparse_csr",
+        "to_sparse_csc",
+        "to_mkldnn",
+        "quantize_per_tensor_dynamic",
+        "quantize_per_channel",
+        "q_per_channel_scales",
+        "q_per_channel_zero_points",
+        "int_repr",
+        "_make_per_channel_quantized_tensor",
+        "set",
+        "lift",
+        "lift_fresh",
+        "lift_fresh_copy",
+        "masked_scatter",
+        "_masked_softmax",
+        "_masked_softmax_backward",
+        "put",
+        "index_reduce",
+        "trace",
+        "_cholesky_solve_helper",
+        "dist",
+        "max",
+        "_torch_cuda_cu_linker_symbol_op",
+        "glu_jvp",
+        "glu_backward_jvp",
+        "hardswish_backward",
+        "rrelu_with_noise_backward",
+        "mkldnn_adaptive_avg_pool2d_backward",
+        "_adaptive_avg_pool2d_backward",
+        "_adaptive_avg_pool3d_backward",
+        "isinf",
+        "linalg_lu_solve",
+        "linalg_vecdot",
+        "linalg_matrix_exp",
+        "linalg_eigvalsh",
+        "_test_warn_in_autograd",
+        "_test_autograd_multiple_dispatch_view",
+        "_test_autograd_multiple_dispatch_view_copy",
+        "segment_reduce",
+        "_segment_reduce_backward",
+        "_fw_primal_copy",
+        "_make_dual_copy",
+        "view_as_real_copy",
+        "view_as_complex_copy",
+        "_conj_copy",
+        "_neg_view_copy",
+        "diagonal_copy",
+        "detach_copy",
+        "squeeze_copy",
+        "t_copy",
+        "unsqueeze_copy",
+        "_indices_copy",
+        "_values_copy",
+        "indices_copy",
+        "values_copy",
+        "crow_indices_copy",
+        "col_indices_copy",
+        "ccol_indices",
+        "ccol_indices_copy",
+        "row_indices",
+        "row_indices_copy",
+        "unfold_copy",
+        "alias_copy",
+        "_triton_multi_head_attention",
+        "special_airy_ai",
+        "special_bessel_j0",
+        "special_bessel_j1",
+        "special_bessel_y0",
+        "special_bessel_y1",
+        "special_chebyshev_polynomial_t",
+        "special_chebyshev_polynomial_u",
+        "special_chebyshev_polynomial_v",
+        "special_chebyshev_polynomial_w",
+        "special_hermite_polynomial_h",
+        "special_hermite_polynomial_he",
+        "special_laguerre_polynomial_l",
+        "special_legendre_polynomial_p",
+        "special_modified_bessel_i0",
+        "special_modified_bessel_i1",
+        "special_modified_bessel_k0",
+        "special_modified_bessel_k1",
+        "special_scaled_modified_bessel_k0",
+        "special_scaled_modified_bessel_k1",
+        "special_shifted_chebyshev_polynomial_t",
+        "special_shifted_chebyshev_polynomial_u",
+        "special_shifted_chebyshev_polynomial_v",
+        "special_shifted_chebyshev_polynomial_w",
+        "special_spherical_bessel_j0",
+        "_foobar",
+        "_nested_tensor_strides",
     )
 )
 
@@ -98,7 +249,9 @@ def is_supported(g: Union[NativeFunctionsGroup, NativeFunctionsViewGroup]) -> bo
             return False
 
     if isinstance(g, NativeFunctionsViewGroup):
-        if "at::Tensor" != cpp.returns_type(func.returns).cpp_type():
+        # TODO: stop doing type tests by converting to C++ and then testing
+        # the string, just test the dang thing directly
+        if "at::Tensor" != cpp.returns_type(func.returns, symint=False).cpp_type():
             # Returns a non-Tensor value.
             logger.info(f"NON-TENSOR RET TYPE: {str(func)}")
             return False
@@ -122,7 +275,8 @@ def is_supported(g: Union[NativeFunctionsGroup, NativeFunctionsViewGroup]) -> bo
             or not str(func.name).endswith(".out")
         ):
             return False
-    if "at::Tensor &" != cpp.returns_type(func.returns).cpp_type():
+    # TODO: stop type testing by converting to C++
+    if "at::Tensor &" != cpp.returns_type(func.returns, symint=False).cpp_type():
         logger.info(f"NON_TENSOR RET TYPE: {str(func)}")
         return False
     if has_alias(func.arguments.non_out):
@@ -418,6 +572,7 @@ def generate_out_variant_call(
         "multilabel_margin_loss",
         "nll_loss",
         "nll_loss2d",
+        "prod",
     )
 )
 
diff --git a/torchgen/utils.py b/torchgen/utils.py
index 64c218170802..017259083422 100644
--- a/torchgen/utils.py
+++ b/torchgen/utils.py
@@ -293,6 +293,14 @@ def write_outputs(self, variable_name: str, filename: str) -> None:
         )
         self._write_if_changed(filename, content)
 
+    def template_dir_for_comments(self) -> str:
+        """
+        This needs to be deterministic. The template dir is an absolute path
+        that varies across builds. So, just use the path relative to this file,
+        which will point to the codegen source but will be stable.
+        """
+        return os.path.relpath(self.template_dir, os.path.dirname(__file__))
+
 
 # Helper function to generate file manager
 def make_file_manager(
diff --git a/ubsan.supp b/ubsan.supp
deleted file mode 100644
index 395f5208c843..000000000000
--- a/ubsan.supp
+++ /dev/null
@@ -1,2 +0,0 @@
-vptr:libtorch_python.so
-vptr:test_jit
diff --git a/version.txt b/version.txt
index cc67b88a3383..59c85dbc8702 100644
--- a/version.txt
+++ b/version.txt
@@ -1 +1 @@
-1.13.0a0
+1.14.0a0